From mboxrd@z Thu Jan  1 00:00:00 1970
From: Ying Xue <ying.xue@windriver.com>
Subject: Re: [PATCH v2] net: fix a double free issue for neighbour entry
Date: Wed, 20 May 2015 15:01:18 +0800
Message-ID: <555C313E.1020003@windriver.com>
References: <1432085113-25063-1-git-send-email-ying.xue@windriver.com> <1432099637.4060.12.camel@edumazet-glaptop2.roam.corp.google.com>
Mime-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: 7bit
Cc: <netdev@vger.kernel.org>, <davem@davemloft.net>,
	<alexei@purestorage.com>, <joern@purestorage.com>, <ja@ssi.bg>
To: Eric Dumazet <eric.dumazet@gmail.com>
Return-path: <netdev-owner@vger.kernel.org>
Received: from mail1.windriver.com ([147.11.146.13]:40246 "EHLO
	mail1.windriver.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S1752350AbbETHBp (ORCPT
	<rfc822;netdev@vger.kernel.org>); Wed, 20 May 2015 03:01:45 -0400
In-Reply-To: <1432099637.4060.12.camel@edumazet-glaptop2.roam.corp.google.com>
Sender: netdev-owner@vger.kernel.org
List-ID: <netdev.vger.kernel.org>

On 05/20/2015 01:27 PM, Eric Dumazet wrote:
> Sorry, this atomic_read() makes no sense to me.
> 
> When rcu is used, this is perfectly fine to use an object which refcount
> is 0. If you believe the opposite, then point me to relevant
> Documentation/RCU/ file.
> 

However, the reality for us is that rcu_read_lock() can guarantee that a neigh
object is not freed within the area covered by rcu read lock, but it cannot
prevent neigh_destroy() from being executed in another context at the same time.

> refcount should only protect the object itself, not the use of it by a
> rcu reader. This has nothing to do with rcu but standard refcounting of
> objects.
> 
> The only forbidden thing would be to try to take a reference on it with
> atomic_inc() instead of the more careful atomic_inc_not_zero(), if the
> caller needs to exit the rcu_read_lock() section safely (as explained in
> Documentation/RCU/rcuref.txt around line 52)
> 
> So far, dst_neigh_output() will not try to take a refcnt.
> 

Please let us take a look at the following log generated by the Joern Engel
written debug patch:

[ 4974.816012]  [<ffffffff8163baae>] dump_stack+0x19/0x1b
[ 4974.816017]  [<ffffffff8103f141>] warn_slowpath_common+0x61/0x80
[ 4974.816019]  [<ffffffff8103f21a>] warn_slowpath_null+0x1a/0x20
[ 4974.816021]  [<ffffffff8163de95>] neigh_hold.part.26+0x1e/0x27
[ 4974.816027]  [<ffffffff8155f08c>] neigh_add_timer+0x3c/0x60
[ 4974.816029]  [<ffffffff8155f1ab>] __neigh_event_send+0xfb/0x220
[ 4974.816031]  [<ffffffff8155f40b>] neigh_resolve_output+0x13b/0x220
[ 4974.816038]  [<ffffffff8158c950>] ip_finish_output+0x1b0/0x3b0
[ 4974.816040]  [<ffffffff8158ddd8>] ip_output+0x58/0x90
[ 4974.816042]  [<ffffffff8158d5a5>] ip_local_out+0x25/0x30
[ 4974.816045]  [<ffffffff8158d8f7>] ip_queue_xmit+0x137/0x380
[ 4974.816051]  [<ffffffff815a47a5>] tcp_transmit_skb+0x445/0x8a0
[ 4974.816054]  [<ffffffff815a4d40>] tcp_write_xmit+0x140/0xb00
[ 4974.816058]  [<ffffffff815a59ae>] __tcp_push_pending_frames+0x2e/0xc0
[ 4974.816062]  [<ffffffff81597198>] tcp_sendmsg+0x118/0xd90
[ 4974.816070]  [<ffffffff81278b55>] ? debug_object_deactivate+0x115/0x170
[ 4974.816076]  [<ffffffff815bf434>] inet_sendmsg+0x64/0xb0
[ 4974.816080]  [<ffffffff8153da56>] sock_sendmsg+0x76/0x90
[ 4974.816086]  [<ffffffff81046e89>] ? local_bh_enable_ip+0x89/0xf0

Above stack trace log tells us the following call chain happens:
ip_finish_output()
  ->ip_finish_output2()
    ->dst_neigh_output()
      ->neigh_resolve_output()
        ->__neigh_event_send()
          ->neigh_add_timer()
            ->neigh_hold()

Moreover, the below debug code is added in Joern Engel written debug patch :
+static inline void neigh_hold(struct neighbour *neigh)
+{
+	WARN_ON_ONCE(atomic_inc_return(&neigh->refcnt) == 1);
+}

So, we can identify above stack trace was printed by WARN_ON_ONCE(), which also
proves that dst_neigh_output() not only tries to take a refcnt of a neigh, but
also it the refcnt of neigh entry that dst_neigh_output() will be touched was
already decremented to zero.

Please consider the below scenario which is very similar to above case met by
Joern Engel:

Time 1:
CPU 0:
neigh_release()
  ->neigh_destroy() //it's called as neigh's refcnt is 0 now
    ->kfree_rcu(neigh, rcu);//freeing neigh object is delayed after a RCU
grace period

Time 2:
CPU 1:
ip_finish_output2()
  rcu_read_lock_bh()
  dst_neigh_output()
   __neigh_event_send()
    neigh_add_timer()
      neigh_hold() //refcnt of neigh is increased from 0 to 1;
     rcu_read_unlock_bh()

Time 3:
CPU 0:
On RCU_SOFTIRQ context:
rcu_process_callbacks()
  neigh object is really freed!

Time 4:
CPU 0:
Release the neigh whose refcnt was increased from 0 to zero.
neigh_release() is called by someone. As the neigh is already freed, panic
happens!!!

This is why I initially added the condition statement in ip_finish_output2() to
check whether the refcnt of neigh returned by __ipv4_neigh_lookup_noref() is
zero or not with atomic_read(). But as you pointed out, atomic_read() was not
safe for us, but we cannot use other atomic routines with implicit memory
barriers as they all increase or decease refcnt while checking refcnt is 0,
meanwhile, we cannot use explicit memory barrier as it seems too expensive for
the hot path. So I try to add another condition to prevent neigh_add_timer()
called by __neigh_event_send() from being take a refcnt which is already zero.

> 
> By the time you check atomic_read() the result is meaningless if another
> cpu decrements the refcount to 0.
> 

If you think checking if refcnt is zero is meaningless in ip_finish_output2(),
why does __ipv4_neigh_lookup() use atomic_inc_not_zero() instead of directly
using neigh_hold() in __ipv4_neigh_lookup()?

Regards,
Ying

> So what is the point, trying to reduce a race window but not properly
> remove it ?
> 
> I repeat again : using atomic_read() in a rcu lookup is absolutely
> useless and is a clear sign you missed a fundamental issue. 
> 
> So now, we are back to the patch I initially sent, but you did not
> address Julian Anastasov feedback.