From: Matan Barak <matanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
To: Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Cc: Matan Barak <matanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>,
linux-rdma <linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>,
Or Gerlitz <ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>,
Jason Gunthorpe
<jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>,
Eran Ben Elisha <eranbe-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
Subject: Re: [PATCH rdma-cm] IB/core: Fix use after free of ifa
Date: Mon, 19 Oct 2015 17:20:03 +0300 [thread overview]
Message-ID: <5624FC13.1090200@mellanox.com> (raw)
In-Reply-To: <5624E0AE.8050702-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
On 10/19/2015 3:23 PM, Doug Ledford wrote:
> On 10/18/2015 03:49 AM, Matan Barak wrote:
>> On Thu, Oct 15, 2015 at 8:37 PM, Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
>>> On 10/15/2015 08:01 AM, Matan Barak wrote:
>>>> When using ifup/ifdown while executing enum_netdev_ipv4_ips,
>>>> ifa could become invalid and cause use after free error.
>>>> Fixing it by protecting with RCU lock.
>>>>
>>>> Fixes: 03db3a2d81e6 ('IB/core: Add RoCE GID table management')
>>>> Signed-off-by: Matan Barak <matanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
>>>> ---
>>>>
>>>> Hi Doug,
>>>>
>>>> This patch fixes a bug in RoCE GID table implementation. Under stress conditions
>>>> where ifup/ifdown are used, the ifa pointer could become invalid. Using a
>>>> RCU lock in order to avoid freeing the ifa node (as done in other inet functions
>>>> (for example, inet_addr_onlink).
>>>>
>>>> Our QA team verified that this patch fixes this issue.
>>>
>>> This doesn't look like a good fix to me. In particular, I think you
>>> merely shifted the bug around, you didn't actually resolve it.
>>>
>>> In the original code, you called update_gid_ip while holding a reference
>>> to in_dev. The reference to in_dev was not enough to keep the ifa list
>>> from changing while you were doing your work. It's not surprising that
>>> you hit a race with the ifa list because update_gid_ip being called
>>> synchronously can both A) sleep because of the mutexes it takes and B)
>>> be slow because of how many locks it takes (and it can really take a lot
>>> due to find_gid) and C) be slow again because of updating the gid table
>>> calling into the low level driver and actually writing a mailbox command
>>> to the card. So, for all those reasons, not only did you hit this race,
>>> but you were *likely* to hit this race.
>>>
>>
>> I don't mind that the list could be changing between the inet event
>> and the work handler.
>> I do mind the ifa is released while working on it. I think the major
>> reason for possible slowness is the vendor call.
>
> No, it's not.
>
>> Most locks are
>> per-entry and are read-write locks.
>
> This is a major cause of the slowness. Unless you have a specific need
> of them, per-entry rwlocks are *NOT* a good idea. I was going to bring
> this up separately, so I'll just mention it here. Per entry locks help
> reduce contention when you have lots of multiple accessors. Using
> rwlocks help reduce contention when you have a read-mostly entry that is
> only occasionally changed. But every lock and every unlock (just like
> every atomic access) still requires a locked memory cycle. That means
> every lock acquire and every lock release requires a synchronization
> event between all CPUs. Using per-entry locks means that every entry
> triggers two of those synchronization events. On modern Intel CPUs,
> they cost about 32 cycles per event. If you are going to do something,
> and it can't be done without a lock, then grab a single lock and do it
> as fast as you can. Only in rare cases would you want per-entry locks.
>
I agree that every rwlock costs us locked access. However, lets look at
the common scenario here. I think that in production stable systems, the
IPs our rdma-cm stack uses (and thus GIDs) should be pretty stable.
Thus, most accesses should be read calls. That's why IMHO read-write
access makes sense here.
Regarding single-lock vs per entry lock, it really depends on common an
entry could be updated while another entry is being used by an
application. In a (future) dynamic system, you might want to create
containers dynamically, which will add a net device and change the
hardware GID table while other application (maybe in another container)
uses other GIDs, but this might be rare scenario.
>>> Now, you've put an rcu_read_lock on ndev instead. And you're no longer
>>> seeing the race. However, does taking the rcu_read_lock on ndev
>>> actually protect the ifa list on ndev, or is the real fix the fact that
>>> you moved update_gid_ip out of the main loop? Before, you blocked while
>>> processing the ifa list, making hitting your race likely. Now you
>>> process the ifa list very fast and build your own sin_list that is no
>>> longer impacted by changes to the ifa list, but I don't know that the
>>> rcu_read_lock you have taken actually makes you for sure safe here
>>> versus the possibility that you have just made the race much harder to
>>> hit and hidden it.
>>>
>>
>> As Jason wrote, the release of the ifa is protected by call_rcu. So
>> protecting the usage of ifa with RCU should be enough to eliminate
>> this bug.
>
> OK, I'm happy enough with the explanation to take this patch. But
> please think about the per-entry locks you mention above. Those need to
> go if possible. I'm relatively certain that you will be able to
> demonstrate a *drastic* speed up in this code with them removed (try
> running the cmtime application from librdmacm-utils and see what the
> results are with per-entry locks and a per table lock instead).
>
Ok, refactoring this code for a single lock shouldn't be problematic.
Regarding performance, I think the results here are vastly impacted by
the rate we'll add/remove IPs or upper net-devices in the background.
>>
>>> And even if the rcu_read_lock is for sure safe in terms of accessing the
>>> ifa list, these changes may have just introduced a totally new bug that
>>> your QE tests haven't exposed but might exist none the less. In
>>> particular, we have now queued up adding a bunch of gids to the ndev.
>>> But we drop our reference to the rcu lock, then we invoke a (possibly
>>> large number) of sleeping iterations. What's to prevent a situation
>>> where we get enum_netdev_ipv4_ips() called on say a vlan child interface
>>> of a primary RoCE netdev, create our address list, release our lock,
>>> then the user destroys our vlan device, and we race with del_netdev_ips
>>> on the vlan device, such that del_netdev_ips completes and removes all
>>> the gids for that netdev, but we still have backlogged gid add events in
>>> enum_netdev_ipv4_ips and so we add back in what will become permanently
>>> stale gids? I don't think we hold rtnl_lock while running in
>>> enum_netdev_ipv4_ips and that's probably the only lock that would
>>> exclude the user from deleting the vlan device, so as far as I can tell
>>> we can easily call del_netdev_ips while the tail end of
>>> enum_netdev_ipv4_ips is sleeping. Am I wrong here? A test would be to
>>> take whatever QE test you have that hit this bug in the first place, and
>>> on a different terminal add a while loop of adding/removing the same
>>> vlan interface that you are updating gids on and see if the gid table
>>> starts filling up with stale, unremovable entries.
>>>
>>
>> The RoCE GID management design uses events handlers and one workqueue.
>> When an event (inet/net) is handled, we hold the net device and
>> execute a work in the workqueue.
>> The works are executed in a queue - thus first-come first-served.
>> That's why if you add/del a vlan (or its IP)
>> we do dev_hold in the event itself. Since the ndev is available in the
>> event and is held when executing the event - it can't be deleted until
>> we handle this
>> event in the workqueue. If the user tries to delete the vlan before
>> our add (inet/ndev) work was completed, we'll get an UNREGISTER event,
>> but since the dev is held, the stack will have to wait until we free
>> all our ref counts to this device. Using a queue guarantees us the
>> order - we'll first complete adding the vlan and then delete it. Only
>> after all reference counts are dropped, the net device could be
>> deleted.
>> Anyway, I'll ask the QA team here to add this test.
>
> OK. And it works for now, but if there is ever added an additional
> means of running one of these functions that isn't on the work queue,
> then it can break subtly.
>
This workqueue is an important part of this design. We need to impose an
order, but handle events in a different context (as some IPv6 events
come from an atomic context).
>> Thanks for taking a look on this patch.
>>
>>>> Thanks,
>>>> Matan
>>>>
>>>> drivers/infiniband/core/roce_gid_mgmt.c | 35 +++++++++++++++++++++++++--------
>>>> 1 file changed, 27 insertions(+), 8 deletions(-)
>>>>
>>>> diff --git a/drivers/infiniband/core/roce_gid_mgmt.c b/drivers/infiniband/core/roce_gid_mgmt.c
>>>> index 6b24cba..178f984 100644
>>>> --- a/drivers/infiniband/core/roce_gid_mgmt.c
>>>> +++ b/drivers/infiniband/core/roce_gid_mgmt.c
>>>> @@ -250,25 +250,44 @@ static void enum_netdev_ipv4_ips(struct ib_device *ib_dev,
>>>> u8 port, struct net_device *ndev)
>>>> {
>>>> struct in_device *in_dev;
>>>> + struct sin_list {
>>>> + struct list_head list;
>>>> + struct sockaddr_in ip;
>>>> + };
>>>> + struct sin_list *sin_iter;
>>>> + struct sin_list *sin_temp;
>>>>
>>>> + LIST_HEAD(sin_list);
>>>> if (ndev->reg_state >= NETREG_UNREGISTERING)
>>>> return;
>>>>
>>>> - in_dev = in_dev_get(ndev);
>>>> - if (!in_dev)
>>>> + rcu_read_lock();
>>>> + in_dev = __in_dev_get_rcu(ndev);
>>>> + if (!in_dev) {
>>>> + rcu_read_unlock();
>>>> return;
>>>> + }
>>>>
>>>> for_ifa(in_dev) {
>>>> - struct sockaddr_in ip;
>>>> + struct sin_list *entry = kzalloc(sizeof(*entry), GFP_ATOMIC);
>>>>
>>>> - ip.sin_family = AF_INET;
>>>> - ip.sin_addr.s_addr = ifa->ifa_address;
>>>> - update_gid_ip(GID_ADD, ib_dev, port, ndev,
>>>> - (struct sockaddr *)&ip);
>>>> + if (!entry) {
>>>> + pr_warn("roce_gid_mgmt: couldn't allocate entry for IPv4 update\n");
>>>> + continue;
>>>> + }
>>>> + entry->ip.sin_family = AF_INET;
>>>> + entry->ip.sin_addr.s_addr = ifa->ifa_address;
>>>> + list_add_tail(&entry->list, &sin_list);
>>>> }
>>>> endfor_ifa(in_dev);
>>>> + rcu_read_unlock();
>>>>
>>>> - in_dev_put(in_dev);
>>>> + list_for_each_entry_safe(sin_iter, sin_temp, &sin_list, list) {
>>>> + update_gid_ip(GID_ADD, ib_dev, port, ndev,
>>>> + (struct sockaddr *)&sin_iter->ip);
>>>> + list_del(&sin_iter->list);
>>>> + kfree(sin_iter);
>>>> + }
>>>> }
>>>>
>>>> static void enum_netdev_ipv6_ips(struct ib_device *ib_dev,
>>>>
>>>
>>>
>>> --
>>> Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
>>> GPG KeyID: 0E572FDD
>>>
>>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
next prev parent reply other threads:[~2015-10-19 14:20 UTC|newest]
Thread overview: 15+ messages / expand[flat|nested] mbox.gz Atom feed top
2015-10-15 12:01 [PATCH rdma-cm] IB/core: Fix memory corruption in ib_cache_gid_set_default_gid Matan Barak
[not found] ` <1444910463-5688-1-git-send-email-matanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2015-10-15 12:01 ` [PATCH rdma-cm] IB/core: Fix use after free of ifa Matan Barak
[not found] ` <1444910463-5688-2-git-send-email-matanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2015-10-15 17:37 ` Doug Ledford
[not found] ` <561FE452.3050304-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2015-10-15 17:53 ` Jason Gunthorpe
[not found] ` <20151015175310.GA17519-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2015-10-18 7:51 ` Matan Barak
[not found] ` <CAAKD3BCoNmHjUvAR_SuKT_AL-823_y34QyRRV3aZ=T8cw9F9gw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-10-19 18:26 ` Jason Gunthorpe
2015-10-18 7:49 ` Matan Barak
[not found] ` <CAAKD3BBEfKTHPKyoTzMW3YESKJmGkcUkui=hjhsbyFRY+xDDEw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-10-19 12:23 ` Doug Ledford
[not found] ` <5624E0AE.8050702-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2015-10-19 14:20 ` Matan Barak [this message]
[not found] ` <5624FC13.1090200-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2015-10-19 15:27 ` Doug Ledford
[not found] ` <56250BD6.2050503-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2015-10-20 14:50 ` Matan Barak
[not found] ` <562654B6.8090501-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2015-10-20 16:52 ` Doug Ledford
2015-10-20 20:17 ` Doug Ledford
[not found] ` <5626A15E.7080800-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2015-11-16 13:17 ` Matan Barak
2015-10-15 16:27 ` [PATCH rdma-cm] IB/core: Fix memory corruption in ib_cache_gid_set_default_gid Doug Ledford
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=5624FC13.1090200@mellanox.com \
--to=matanb-vpraknaxozvwk0htik3j/w@public.gmane.org \
--cc=dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org \
--cc=eranbe-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org \
--cc=jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org \
--cc=linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org \
--cc=ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).