All of lore.kernel.org
 help / color / mirror / Atom feed
From: Matan Barak <matanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
To: Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Cc: Matan Barak <matanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>,
	linux-rdma <linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>,
	Or Gerlitz <ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>,
	Jason Gunthorpe
	<jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>,
	Eran Ben Elisha <eranbe-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
Subject: Re: [PATCH rdma-cm] IB/core: Fix use after free of ifa
Date: Mon, 19 Oct 2015 17:20:03 +0300	[thread overview]
Message-ID: <5624FC13.1090200@mellanox.com> (raw)
In-Reply-To: <5624E0AE.8050702-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

On 10/19/2015 3:23 PM, Doug Ledford wrote:
> On 10/18/2015 03:49 AM, Matan Barak wrote:
>> On Thu, Oct 15, 2015 at 8:37 PM, Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
>>> On 10/15/2015 08:01 AM, Matan Barak wrote:
>>>> When using ifup/ifdown while executing enum_netdev_ipv4_ips,
>>>> ifa could become invalid and cause use after free error.
>>>> Fixing it by protecting with RCU lock.
>>>>
>>>> Fixes: 03db3a2d81e6 ('IB/core: Add RoCE GID table management')
>>>> Signed-off-by: Matan Barak <matanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
>>>> ---
>>>>
>>>> Hi Doug,
>>>>
>>>> This patch fixes a bug in RoCE GID table implementation. Under stress conditions
>>>> where ifup/ifdown are used, the ifa pointer could become invalid. Using a
>>>> RCU lock in order to avoid freeing the ifa node (as done in other inet functions
>>>> (for example, inet_addr_onlink).
>>>>
>>>> Our QA team verified that this patch fixes this issue.
>>>
>>> This doesn't look like a good fix to me.  In particular, I think you
>>> merely shifted the bug around, you didn't actually resolve it.
>>>
>>> In the original code, you called update_gid_ip while holding a reference
>>> to in_dev.  The reference to in_dev was not enough to keep the ifa list
>>> from changing while you were doing your work.  It's not surprising that
>>> you hit a race with the ifa list because update_gid_ip being called
>>> synchronously can both A) sleep because of the mutexes it takes and B)
>>> be slow because of how many locks it takes (and it can really take a lot
>>> due to find_gid) and C) be slow again because of updating the gid table
>>> calling into the low level driver and actually writing a mailbox command
>>> to the card.  So, for all those reasons, not only did you hit this race,
>>> but you were *likely* to hit this race.
>>>
>>
>> I don't mind that the list could be changing between the inet event
>> and the work handler.
>> I do mind the ifa is released while working on it. I think the major
>> reason for possible slowness is the vendor call.
>
> No, it's not.
>
>> Most locks are
>> per-entry and are read-write locks.
>
> This is a major cause of the slowness.  Unless you have a specific need
> of them, per-entry rwlocks are *NOT* a good idea.  I was going to bring
> this up separately, so I'll just mention it here.  Per entry locks help
> reduce contention when you have lots of multiple accessors.  Using
> rwlocks help reduce contention when you have a read-mostly entry that is
> only occasionally changed.  But every lock and every unlock (just like
> every atomic access) still requires a locked memory cycle.  That means
> every lock acquire and every lock release requires a synchronization
> event between all CPUs.  Using per-entry locks means that every entry
> triggers two of those synchronization events.  On modern Intel CPUs,
> they cost about 32 cycles per event.  If you are going to do something,
> and it can't be done without a lock, then grab a single lock and do it
> as fast as you can.  Only in rare cases would you want per-entry locks.
>

I agree that every rwlock costs us locked access. However, lets look at 
the common scenario here. I think that in production stable systems, the 
IPs our rdma-cm stack uses (and thus GIDs) should be pretty stable. 
Thus, most accesses should be read calls. That's why IMHO read-write 
access makes sense here.

Regarding single-lock vs per entry lock, it really depends on common an 
entry could be updated while another entry is being used by an 
application. In a (future) dynamic system, you might want to create 
containers dynamically, which will add a net device and change the 
hardware GID table while other application (maybe in another container) 
uses other GIDs, but this might be rare scenario.

>>> Now, you've put an rcu_read_lock on ndev instead.  And you're no longer
>>> seeing the race.  However, does taking the rcu_read_lock on ndev
>>> actually protect the ifa list on ndev, or is the real fix the fact that
>>> you moved update_gid_ip out of the main loop?  Before, you blocked while
>>> processing the ifa list, making hitting your race likely.  Now you
>>> process the ifa list very fast and build your own sin_list that is no
>>> longer impacted by changes to the ifa list, but I don't know that the
>>> rcu_read_lock you have taken actually makes you for sure safe here
>>> versus the possibility that you have just made the race much harder to
>>> hit and hidden it.
>>>
>>
>> As Jason wrote, the release of the ifa is protected by call_rcu. So
>> protecting the usage of ifa with RCU should be enough to eliminate
>> this bug.
>
> OK, I'm happy enough with the explanation to take this patch.  But
> please think about the per-entry locks you mention above.  Those need to
> go if possible.  I'm relatively certain that you will be able to
> demonstrate a *drastic* speed up in this code with them removed (try
> running the cmtime application from librdmacm-utils and see what the
> results are with per-entry locks and a per table lock instead).
>

Ok, refactoring this code for a single lock shouldn't be problematic. 
Regarding performance, I think the results here are vastly impacted by 
the rate we'll add/remove IPs or upper net-devices in the background.

>>
>>> And even if the rcu_read_lock is for sure safe in terms of accessing the
>>> ifa list, these changes may have just introduced a totally new bug that
>>> your QE tests haven't exposed but might exist none the less.  In
>>> particular, we have now queued up adding a bunch of gids to the ndev.
>>> But we drop our reference to the rcu lock, then we invoke a (possibly
>>> large number) of sleeping iterations.  What's to prevent a situation
>>> where we get enum_netdev_ipv4_ips() called on say a vlan child interface
>>> of a primary RoCE netdev, create our address list, release our lock,
>>> then the user destroys our vlan device, and we race with del_netdev_ips
>>> on the vlan device, such that del_netdev_ips completes and removes all
>>> the gids for that netdev, but we still have backlogged gid add events in
>>> enum_netdev_ipv4_ips and so we add back in what will become permanently
>>> stale gids?  I don't think we hold rtnl_lock while running in
>>> enum_netdev_ipv4_ips and that's probably the only lock that would
>>> exclude the user from deleting the vlan device, so as far as I can tell
>>> we can easily call del_netdev_ips while the tail end of
>>> enum_netdev_ipv4_ips is sleeping.  Am I wrong here?  A test would be to
>>> take whatever QE test you have that hit this bug in the first place, and
>>> on a different terminal add a while loop of adding/removing the same
>>> vlan interface that you are updating gids on and see if the gid table
>>> starts filling up with stale, unremovable entries.
>>>
>>
>> The RoCE GID management design uses events handlers and one workqueue.
>> When an event (inet/net) is handled, we hold the net device and
>> execute a work in the workqueue.
>> The works are executed in a queue - thus first-come first-served.
>> That's why if you add/del a vlan (or its IP)
>> we do dev_hold in the event itself. Since the ndev is available in the
>> event and is held when executing the event - it can't be deleted until
>> we handle this
>> event in the workqueue. If the user tries to delete the vlan before
>> our add (inet/ndev) work was completed, we'll get an UNREGISTER event,
>> but since the dev is held, the stack will have to wait until we free
>> all our ref counts to this device. Using a queue guarantees us the
>> order - we'll first complete adding the vlan and then delete it. Only
>> after all reference counts are dropped, the net device could be
>> deleted.
>> Anyway, I'll ask the QA team here to add this test.
>
> OK.  And it works for now, but if there is ever added an additional
> means of running one of these functions that isn't on the work queue,
> then it can break subtly.
>

This workqueue is an important part of this design. We need to impose an 
order, but handle events in a different context (as some IPv6 events 
come from an atomic context).

>> Thanks for taking a look on this patch.
>>
>>>> Thanks,
>>>> Matan
>>>>
>>>>   drivers/infiniband/core/roce_gid_mgmt.c | 35 +++++++++++++++++++++++++--------
>>>>   1 file changed, 27 insertions(+), 8 deletions(-)
>>>>
>>>> diff --git a/drivers/infiniband/core/roce_gid_mgmt.c b/drivers/infiniband/core/roce_gid_mgmt.c
>>>> index 6b24cba..178f984 100644
>>>> --- a/drivers/infiniband/core/roce_gid_mgmt.c
>>>> +++ b/drivers/infiniband/core/roce_gid_mgmt.c
>>>> @@ -250,25 +250,44 @@ static void enum_netdev_ipv4_ips(struct ib_device *ib_dev,
>>>>                                 u8 port, struct net_device *ndev)
>>>>   {
>>>>        struct in_device *in_dev;
>>>> +     struct sin_list {
>>>> +             struct list_head        list;
>>>> +             struct sockaddr_in      ip;
>>>> +     };
>>>> +     struct sin_list *sin_iter;
>>>> +     struct sin_list *sin_temp;
>>>>
>>>> +     LIST_HEAD(sin_list);
>>>>        if (ndev->reg_state >= NETREG_UNREGISTERING)
>>>>                return;
>>>>
>>>> -     in_dev = in_dev_get(ndev);
>>>> -     if (!in_dev)
>>>> +     rcu_read_lock();
>>>> +     in_dev = __in_dev_get_rcu(ndev);
>>>> +     if (!in_dev) {
>>>> +             rcu_read_unlock();
>>>>                return;
>>>> +     }
>>>>
>>>>        for_ifa(in_dev) {
>>>> -             struct sockaddr_in ip;
>>>> +             struct sin_list *entry = kzalloc(sizeof(*entry), GFP_ATOMIC);
>>>>
>>>> -             ip.sin_family = AF_INET;
>>>> -             ip.sin_addr.s_addr = ifa->ifa_address;
>>>> -             update_gid_ip(GID_ADD, ib_dev, port, ndev,
>>>> -                           (struct sockaddr *)&ip);
>>>> +             if (!entry) {
>>>> +                     pr_warn("roce_gid_mgmt: couldn't allocate entry for IPv4 update\n");
>>>> +                     continue;
>>>> +             }
>>>> +             entry->ip.sin_family = AF_INET;
>>>> +             entry->ip.sin_addr.s_addr = ifa->ifa_address;
>>>> +             list_add_tail(&entry->list, &sin_list);
>>>>        }
>>>>        endfor_ifa(in_dev);
>>>> +     rcu_read_unlock();
>>>>
>>>> -     in_dev_put(in_dev);
>>>> +     list_for_each_entry_safe(sin_iter, sin_temp, &sin_list, list) {
>>>> +             update_gid_ip(GID_ADD, ib_dev, port, ndev,
>>>> +                           (struct sockaddr *)&sin_iter->ip);
>>>> +             list_del(&sin_iter->list);
>>>> +             kfree(sin_iter);
>>>> +     }
>>>>   }
>>>>
>>>>   static void enum_netdev_ipv6_ips(struct ib_device *ib_dev,
>>>>
>>>
>>>
>>> --
>>> Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
>>>                GPG KeyID: 0E572FDD
>>>
>>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>
>

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

  parent reply	other threads:[~2015-10-19 14:20 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-10-15 12:01 [PATCH rdma-cm] IB/core: Fix memory corruption in ib_cache_gid_set_default_gid Matan Barak
     [not found] ` <1444910463-5688-1-git-send-email-matanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2015-10-15 12:01   ` [PATCH rdma-cm] IB/core: Fix use after free of ifa Matan Barak
     [not found]     ` <1444910463-5688-2-git-send-email-matanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2015-10-15 17:37       ` Doug Ledford
     [not found]         ` <561FE452.3050304-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2015-10-15 17:53           ` Jason Gunthorpe
     [not found]             ` <20151015175310.GA17519-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2015-10-18  7:51               ` Matan Barak
     [not found]                 ` <CAAKD3BCoNmHjUvAR_SuKT_AL-823_y34QyRRV3aZ=T8cw9F9gw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-10-19 18:26                   ` Jason Gunthorpe
2015-10-18  7:49           ` Matan Barak
     [not found]             ` <CAAKD3BBEfKTHPKyoTzMW3YESKJmGkcUkui=hjhsbyFRY+xDDEw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-10-19 12:23               ` Doug Ledford
     [not found]                 ` <5624E0AE.8050702-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2015-10-19 14:20                   ` Matan Barak [this message]
     [not found]                     ` <5624FC13.1090200-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2015-10-19 15:27                       ` Doug Ledford
     [not found]                         ` <56250BD6.2050503-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2015-10-20 14:50                           ` Matan Barak
     [not found]                             ` <562654B6.8090501-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2015-10-20 16:52                               ` Doug Ledford
2015-10-20 20:17       ` Doug Ledford
     [not found]         ` <5626A15E.7080800-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2015-11-16 13:17           ` Matan Barak
2015-10-15 16:27   ` [PATCH rdma-cm] IB/core: Fix memory corruption in ib_cache_gid_set_default_gid Doug Ledford

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5624FC13.1090200@mellanox.com \
    --to=matanb-vpraknaxozvwk0htik3j/w@public.gmane.org \
    --cc=dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org \
    --cc=eranbe-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org \
    --cc=jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org \
    --cc=linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org \
    --cc=ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.