From mboxrd@z Thu Jan 1 00:00:00 1970 From: Erez Shitrit Subject: Re: [PATCH 7/9] IB/ipoib: fix MCAST_FLAG_BUSY usage Date: Tue, 03 Mar 2015 11:53:21 +0200 Message-ID: <54F58491.6020407@dev.mellanox.co.il> References: <9d657f64ee961ee3b3233520d8b499b234a42bcd.1424562072.git.dledford@redhat.com> <54F2DC81.304@dev.mellanox.co.il> <1425310036.2354.24.camel@redhat.com> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <1425310036.2354.24.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> Sender: linux-rdma-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org To: Doug Ledford Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, roland-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org, Or Gerlitz , Erez Shitrit List-Id: linux-rdma@vger.kernel.org On 3/2/2015 5:27 PM, Doug Ledford wrote: > On Sun, 2015-03-01 at 11:31 +0200, Erez Shitrit wrote: >> On 2/22/2015 2:27 AM, Doug Ledford wrote: >>> Commit a9c8ba5884 ("IPoIB: Fix usage of uninitialized multicast >>> objects") added a new flag MCAST_JOIN_STARTED, but was not very strict >>> in how it was used. We didn't always initialize the completion struct >>> before we set the flag, and we didn't always call complete on the >>> completion struct from all paths that complete it. And when we did >>> complete it, sometimes we continued to touch the mcast entry after >>> the completion, opening us up to possible use after free issues. >>> >>> This made it less than totally effective, and certainly made its use >>> confusing. And in the flush function we would use the presence of this >>> flag to signal that we should wait on the completion struct, but we never >>> cleared this flag, ever. >>> >>> In order to make things clearer and aid in resolving the rtnl deadlock >>> bug I've been chasing, I cleaned this up a bit. >>> >>> 1) Remove the MCAST_JOIN_STARTED flag entirely >>> 2) Change MCAST_FLAG_BUSY so it now only means a join is in-flight >>> 3) Test mcast->mc directly to see if we have completed >>> ib_sa_join_multicast (using IS_ERR_OR_NULL) >>> 4) Make sure that before setting MCAST_FLAG_BUSY we always initialize >>> the mcast->done completion struct >>> 5) Make sure that before calling complete(&mcast->done), we always clear >>> the MCAST_FLAG_BUSY bit >>> 6) Take the mcast_mutex before we call ib_sa_multicast_join and also >>> take the mutex in our join callback. This forces >>> ib_sa_multicast_join to return and set mcast->mc before we process >>> the callback. This way, our callback can safely clear mcast->mc >>> if there is an error on the join and we will do the right thing as >>> a result in mcast_dev_flush. >>> 7) Because we need the mutex to synchronize mcast->mc, we can no >>> longer call mcast_sendonly_join directly from mcast_send and >>> instead must add sendonly join processing to the mcast_join_task >>> 8) Make MCAST_RUN mean that we have a working mcast subsystem, not that >>> we have a running task. We know when we need to reschedule our >>> join task thread and don't need a flag to tell us. >>> 9) Add a helper for rescheduling the join task thread >>> >>> A number of different races are resolved with these changes. These >>> races existed with the old MCAST_FLAG_BUSY usage, the >>> MCAST_JOIN_STARTED flag was an attempt to address them, and while it >>> helped, a determined effort could still trip things up. >>> >>> One race looks something like this: >>> >>> Thread 1 Thread 2 >>> ib_sa_join_multicast (as part of running restart mcast task) >>> alloc member >>> call callback >>> ifconfig ib0 down >>> wait_for_completion >>> callback call completes >>> wait_for_completion in >>> mcast_dev_flush completes >>> mcast->mc is PTR_ERR_OR_NULL >>> so we skip ib_sa_leave_multicast >>> return from callback >>> return from ib_sa_join_multicast >>> set mcast->mc = return from ib_sa_multicast >>> >>> We now have a permanently unbalanced join/leave issue that trips up the >>> refcounting in core/multicast.c >>> >>> Another like this: >>> >>> Thread 1 Thread 2 Thread 3 >>> ib_sa_multicast_join >>> ifconfig ib0 down >>> priv->broadcast = NULL >>> join_complete >>> wait_for_completion >>> mcast->mc is not yet set, so don't clear >>> return from ib_sa_join_multicast and set mcast->mc >>> complete >>> return -EAGAIN (making mcast->mc invalid) >>> call ib_sa_multicast_leave >>> on invalid mcast->mc, hang >>> forever >>> >>> By holding the mutex around ib_sa_multicast_join and taking the mutex >>> early in the callback, we force mcast->mc to be valid at the time we >>> run the callback. This allows us to clear mcast->mc if there is an >>> error and the join is going to fail. We do this before we complete >>> the mcast. In this way, mcast_dev_flush always sees consistent state >>> in regards to mcast->mc membership at the time that the >>> wait_for_completion() returns. >>> >>> Signed-off-by: Doug Ledford >>> --- >>> drivers/infiniband/ulp/ipoib/ipoib.h | 11 +- >>> drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 355 ++++++++++++++++--------- >>> 2 files changed, 238 insertions(+), 128 deletions(-) >>> >>> diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h b/drivers/infiniband/ulp/ipoib/ipoib.h >>> index 9ef432ae72e..c79dcd5ee8a 100644 >>> --- a/drivers/infiniband/ulp/ipoib/ipoib.h >>> +++ b/drivers/infiniband/ulp/ipoib/ipoib.h >>> @@ -98,9 +98,15 @@ enum { >>> >>> IPOIB_MCAST_FLAG_FOUND = 0, /* used in set_multicast_list */ >>> IPOIB_MCAST_FLAG_SENDONLY = 1, >>> - IPOIB_MCAST_FLAG_BUSY = 2, /* joining or already joined */ >>> + /* >>> + * For IPOIB_MCAST_FLAG_BUSY >>> + * When set, in flight join and mcast->mc is unreliable >>> + * When clear and mcast->mc IS_ERR_OR_NULL, need to restart or >>> + * haven't started yet >>> + * When clear and mcast->mc is valid pointer, join was successful >>> + */ >>> + IPOIB_MCAST_FLAG_BUSY = 2, >>> IPOIB_MCAST_FLAG_ATTACHED = 3, >>> - IPOIB_MCAST_JOIN_STARTED = 4, >>> >>> MAX_SEND_CQE = 16, >>> IPOIB_CM_COPYBREAK = 256, >>> @@ -148,6 +154,7 @@ struct ipoib_mcast { >>> >>> unsigned long created; >>> unsigned long backoff; >>> + unsigned long delay_until; >>> >>> unsigned long flags; >>> unsigned char logcount; >>> diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c >>> index bb1b69904f9..277e7ac7c4d 100644 >>> --- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c >>> +++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c >>> @@ -66,6 +66,48 @@ struct ipoib_mcast_iter { >>> unsigned int send_only; >>> }; >>> >>> +/* >>> + * This should be called with the mcast_mutex held >>> + */ >>> +static void __ipoib_mcast_schedule_join_thread(struct ipoib_dev_priv *priv, >>> + struct ipoib_mcast *mcast, >>> + bool delay) >>> +{ >>> + if (!test_bit(IPOIB_MCAST_RUN, &priv->flags)) >> You don't need the flag IPOIB_MCAST_RUN, it is duplicated of >> IPOIB_FLAG_OPER_UP >> probably, need to be taken from all places (including ipoib.h file). > This is probably true, however I skipped it for this series of patches. > It wasn't a requirement of proper operation, and depending on where in > the patch series you tried to inject this change, it had unintended > negative consequences. In particular, up until patch 7/9, > mcast_restart_task used to do a hand rolled stop thread and a matching > start_thread and so couldn't use FLAG_OPER_UP because we needed > FLAG_OPER_UP to tell us whether or not to restart the thread. OK, sounds reasonable. we can send a patch after that that fix that. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html