From mboxrd@z Thu Jan 1 00:00:00 1970 From: Matan Barak Subject: Re: [PATCH rdma-cm] IB/core: Fix use after free of ifa Date: Mon, 19 Oct 2015 17:20:03 +0300 Message-ID: <5624FC13.1090200@mellanox.com> References: <1444910463-5688-1-git-send-email-matanb@mellanox.com> <1444910463-5688-2-git-send-email-matanb@mellanox.com> <561FE452.3050304@redhat.com> <5624E0AE.8050702@redhat.com> Mime-Version: 1.0 Content-Type: text/plain; charset="utf-8"; format=flowed Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <5624E0AE.8050702-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> Sender: linux-rdma-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org To: Doug Ledford Cc: Matan Barak , linux-rdma , Or Gerlitz , Jason Gunthorpe , Eran Ben Elisha List-Id: linux-rdma@vger.kernel.org On 10/19/2015 3:23 PM, Doug Ledford wrote: > On 10/18/2015 03:49 AM, Matan Barak wrote: >> On Thu, Oct 15, 2015 at 8:37 PM, Doug Ledford wrote: >>> On 10/15/2015 08:01 AM, Matan Barak wrote: >>>> When using ifup/ifdown while executing enum_netdev_ipv4_ips, >>>> ifa could become invalid and cause use after free error. >>>> Fixing it by protecting with RCU lock. >>>> >>>> Fixes: 03db3a2d81e6 ('IB/core: Add RoCE GID table management') >>>> Signed-off-by: Matan Barak >>>> --- >>>> >>>> Hi Doug, >>>> >>>> This patch fixes a bug in RoCE GID table implementation. Under stress conditions >>>> where ifup/ifdown are used, the ifa pointer could become invalid. Using a >>>> RCU lock in order to avoid freeing the ifa node (as done in other inet functions >>>> (for example, inet_addr_onlink). >>>> >>>> Our QA team verified that this patch fixes this issue. >>> >>> This doesn't look like a good fix to me. In particular, I think you >>> merely shifted the bug around, you didn't actually resolve it. >>> >>> In the original code, you called update_gid_ip while holding a reference >>> to in_dev. The reference to in_dev was not enough to keep the ifa list >>> from changing while you were doing your work. It's not surprising that >>> you hit a race with the ifa list because update_gid_ip being called >>> synchronously can both A) sleep because of the mutexes it takes and B) >>> be slow because of how many locks it takes (and it can really take a lot >>> due to find_gid) and C) be slow again because of updating the gid table >>> calling into the low level driver and actually writing a mailbox command >>> to the card. So, for all those reasons, not only did you hit this race, >>> but you were *likely* to hit this race. >>> >> >> I don't mind that the list could be changing between the inet event >> and the work handler. >> I do mind the ifa is released while working on it. I think the major >> reason for possible slowness is the vendor call. > > No, it's not. > >> Most locks are >> per-entry and are read-write locks. > > This is a major cause of the slowness. Unless you have a specific need > of them, per-entry rwlocks are *NOT* a good idea. I was going to bring > this up separately, so I'll just mention it here. Per entry locks help > reduce contention when you have lots of multiple accessors. Using > rwlocks help reduce contention when you have a read-mostly entry that is > only occasionally changed. But every lock and every unlock (just like > every atomic access) still requires a locked memory cycle. That means > every lock acquire and every lock release requires a synchronization > event between all CPUs. Using per-entry locks means that every entry > triggers two of those synchronization events. On modern Intel CPUs, > they cost about 32 cycles per event. If you are going to do something, > and it can't be done without a lock, then grab a single lock and do it > as fast as you can. Only in rare cases would you want per-entry locks. > I agree that every rwlock costs us locked access. However, lets look at the common scenario here. I think that in production stable systems, the IPs our rdma-cm stack uses (and thus GIDs) should be pretty stable. Thus, most accesses should be read calls. That's why IMHO read-write access makes sense here. Regarding single-lock vs per entry lock, it really depends on common an entry could be updated while another entry is being used by an application. In a (future) dynamic system, you might want to create containers dynamically, which will add a net device and change the hardware GID table while other application (maybe in another container) uses other GIDs, but this might be rare scenario. >>> Now, you've put an rcu_read_lock on ndev instead. And you're no longer >>> seeing the race. However, does taking the rcu_read_lock on ndev >>> actually protect the ifa list on ndev, or is the real fix the fact that >>> you moved update_gid_ip out of the main loop? Before, you blocked while >>> processing the ifa list, making hitting your race likely. Now you >>> process the ifa list very fast and build your own sin_list that is no >>> longer impacted by changes to the ifa list, but I don't know that the >>> rcu_read_lock you have taken actually makes you for sure safe here >>> versus the possibility that you have just made the race much harder to >>> hit and hidden it. >>> >> >> As Jason wrote, the release of the ifa is protected by call_rcu. So >> protecting the usage of ifa with RCU should be enough to eliminate >> this bug. > > OK, I'm happy enough with the explanation to take this patch. But > please think about the per-entry locks you mention above. Those need to > go if possible. I'm relatively certain that you will be able to > demonstrate a *drastic* speed up in this code with them removed (try > running the cmtime application from librdmacm-utils and see what the > results are with per-entry locks and a per table lock instead). > Ok, refactoring this code for a single lock shouldn't be problematic. Regarding performance, I think the results here are vastly impacted by the rate we'll add/remove IPs or upper net-devices in the background. >> >>> And even if the rcu_read_lock is for sure safe in terms of accessing the >>> ifa list, these changes may have just introduced a totally new bug that >>> your QE tests haven't exposed but might exist none the less. In >>> particular, we have now queued up adding a bunch of gids to the ndev. >>> But we drop our reference to the rcu lock, then we invoke a (possibly >>> large number) of sleeping iterations. What's to prevent a situation >>> where we get enum_netdev_ipv4_ips() called on say a vlan child interface >>> of a primary RoCE netdev, create our address list, release our lock, >>> then the user destroys our vlan device, and we race with del_netdev_ips >>> on the vlan device, such that del_netdev_ips completes and removes all >>> the gids for that netdev, but we still have backlogged gid add events in >>> enum_netdev_ipv4_ips and so we add back in what will become permanently >>> stale gids? I don't think we hold rtnl_lock while running in >>> enum_netdev_ipv4_ips and that's probably the only lock that would >>> exclude the user from deleting the vlan device, so as far as I can tell >>> we can easily call del_netdev_ips while the tail end of >>> enum_netdev_ipv4_ips is sleeping. Am I wrong here? A test would be to >>> take whatever QE test you have that hit this bug in the first place, and >>> on a different terminal add a while loop of adding/removing the same >>> vlan interface that you are updating gids on and see if the gid table >>> starts filling up with stale, unremovable entries. >>> >> >> The RoCE GID management design uses events handlers and one workqueue. >> When an event (inet/net) is handled, we hold the net device and >> execute a work in the workqueue. >> The works are executed in a queue - thus first-come first-served. >> That's why if you add/del a vlan (or its IP) >> we do dev_hold in the event itself. Since the ndev is available in the >> event and is held when executing the event - it can't be deleted until >> we handle this >> event in the workqueue. If the user tries to delete the vlan before >> our add (inet/ndev) work was completed, we'll get an UNREGISTER event, >> but since the dev is held, the stack will have to wait until we free >> all our ref counts to this device. Using a queue guarantees us the >> order - we'll first complete adding the vlan and then delete it. Only >> after all reference counts are dropped, the net device could be >> deleted. >> Anyway, I'll ask the QA team here to add this test. > > OK. And it works for now, but if there is ever added an additional > means of running one of these functions that isn't on the work queue, > then it can break subtly. > This workqueue is an important part of this design. We need to impose an order, but handle events in a different context (as some IPv6 events come from an atomic context). >> Thanks for taking a look on this patch. >> >>>> Thanks, >>>> Matan >>>> >>>> drivers/infiniband/core/roce_gid_mgmt.c | 35 +++++++++++++++++++++++++-------- >>>> 1 file changed, 27 insertions(+), 8 deletions(-) >>>> >>>> diff --git a/drivers/infiniband/core/roce_gid_mgmt.c b/drivers/infiniband/core/roce_gid_mgmt.c >>>> index 6b24cba..178f984 100644 >>>> --- a/drivers/infiniband/core/roce_gid_mgmt.c >>>> +++ b/drivers/infiniband/core/roce_gid_mgmt.c >>>> @@ -250,25 +250,44 @@ static void enum_netdev_ipv4_ips(struct ib_device *ib_dev, >>>> u8 port, struct net_device *ndev) >>>> { >>>> struct in_device *in_dev; >>>> + struct sin_list { >>>> + struct list_head list; >>>> + struct sockaddr_in ip; >>>> + }; >>>> + struct sin_list *sin_iter; >>>> + struct sin_list *sin_temp; >>>> >>>> + LIST_HEAD(sin_list); >>>> if (ndev->reg_state >= NETREG_UNREGISTERING) >>>> return; >>>> >>>> - in_dev = in_dev_get(ndev); >>>> - if (!in_dev) >>>> + rcu_read_lock(); >>>> + in_dev = __in_dev_get_rcu(ndev); >>>> + if (!in_dev) { >>>> + rcu_read_unlock(); >>>> return; >>>> + } >>>> >>>> for_ifa(in_dev) { >>>> - struct sockaddr_in ip; >>>> + struct sin_list *entry = kzalloc(sizeof(*entry), GFP_ATOMIC); >>>> >>>> - ip.sin_family = AF_INET; >>>> - ip.sin_addr.s_addr = ifa->ifa_address; >>>> - update_gid_ip(GID_ADD, ib_dev, port, ndev, >>>> - (struct sockaddr *)&ip); >>>> + if (!entry) { >>>> + pr_warn("roce_gid_mgmt: couldn't allocate entry for IPv4 update\n"); >>>> + continue; >>>> + } >>>> + entry->ip.sin_family = AF_INET; >>>> + entry->ip.sin_addr.s_addr = ifa->ifa_address; >>>> + list_add_tail(&entry->list, &sin_list); >>>> } >>>> endfor_ifa(in_dev); >>>> + rcu_read_unlock(); >>>> >>>> - in_dev_put(in_dev); >>>> + list_for_each_entry_safe(sin_iter, sin_temp, &sin_list, list) { >>>> + update_gid_ip(GID_ADD, ib_dev, port, ndev, >>>> + (struct sockaddr *)&sin_iter->ip); >>>> + list_del(&sin_iter->list); >>>> + kfree(sin_iter); >>>> + } >>>> } >>>> >>>> static void enum_netdev_ipv6_ips(struct ib_device *ib_dev, >>>> >>> >>> >>> -- >>> Doug Ledford >>> GPG KeyID: 0E572FDD >>> >>> >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in >> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html >> > > -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html