From mboxrd@z Thu Jan 1 00:00:00 1970 From: Ying Xue Subject: Re: [PATCH v2] net: fix a double free issue for neighbour entry Date: Wed, 20 May 2015 15:01:18 +0800 Message-ID: <555C313E.1020003@windriver.com> References: <1432085113-25063-1-git-send-email-ying.xue@windriver.com> <1432099637.4060.12.camel@edumazet-glaptop2.roam.corp.google.com> Mime-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit Cc: , , , , To: Eric Dumazet Return-path: Received: from mail1.windriver.com ([147.11.146.13]:40246 "EHLO mail1.windriver.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752350AbbETHBp (ORCPT ); Wed, 20 May 2015 03:01:45 -0400 In-Reply-To: <1432099637.4060.12.camel@edumazet-glaptop2.roam.corp.google.com> Sender: netdev-owner@vger.kernel.org List-ID: On 05/20/2015 01:27 PM, Eric Dumazet wrote: > Sorry, this atomic_read() makes no sense to me. > > When rcu is used, this is perfectly fine to use an object which refcount > is 0. If you believe the opposite, then point me to relevant > Documentation/RCU/ file. > However, the reality for us is that rcu_read_lock() can guarantee that a neigh object is not freed within the area covered by rcu read lock, but it cannot prevent neigh_destroy() from being executed in another context at the same time. > refcount should only protect the object itself, not the use of it by a > rcu reader. This has nothing to do with rcu but standard refcounting of > objects. > > The only forbidden thing would be to try to take a reference on it with > atomic_inc() instead of the more careful atomic_inc_not_zero(), if the > caller needs to exit the rcu_read_lock() section safely (as explained in > Documentation/RCU/rcuref.txt around line 52) > > So far, dst_neigh_output() will not try to take a refcnt. > Please let us take a look at the following log generated by the Joern Engel written debug patch: [ 4974.816012] [] dump_stack+0x19/0x1b [ 4974.816017] [] warn_slowpath_common+0x61/0x80 [ 4974.816019] [] warn_slowpath_null+0x1a/0x20 [ 4974.816021] [] neigh_hold.part.26+0x1e/0x27 [ 4974.816027] [] neigh_add_timer+0x3c/0x60 [ 4974.816029] [] __neigh_event_send+0xfb/0x220 [ 4974.816031] [] neigh_resolve_output+0x13b/0x220 [ 4974.816038] [] ip_finish_output+0x1b0/0x3b0 [ 4974.816040] [] ip_output+0x58/0x90 [ 4974.816042] [] ip_local_out+0x25/0x30 [ 4974.816045] [] ip_queue_xmit+0x137/0x380 [ 4974.816051] [] tcp_transmit_skb+0x445/0x8a0 [ 4974.816054] [] tcp_write_xmit+0x140/0xb00 [ 4974.816058] [] __tcp_push_pending_frames+0x2e/0xc0 [ 4974.816062] [] tcp_sendmsg+0x118/0xd90 [ 4974.816070] [] ? debug_object_deactivate+0x115/0x170 [ 4974.816076] [] inet_sendmsg+0x64/0xb0 [ 4974.816080] [] sock_sendmsg+0x76/0x90 [ 4974.816086] [] ? local_bh_enable_ip+0x89/0xf0 Above stack trace log tells us the following call chain happens: ip_finish_output() ->ip_finish_output2() ->dst_neigh_output() ->neigh_resolve_output() ->__neigh_event_send() ->neigh_add_timer() ->neigh_hold() Moreover, the below debug code is added in Joern Engel written debug patch : +static inline void neigh_hold(struct neighbour *neigh) +{ + WARN_ON_ONCE(atomic_inc_return(&neigh->refcnt) == 1); +} So, we can identify above stack trace was printed by WARN_ON_ONCE(), which also proves that dst_neigh_output() not only tries to take a refcnt of a neigh, but also it the refcnt of neigh entry that dst_neigh_output() will be touched was already decremented to zero. Please consider the below scenario which is very similar to above case met by Joern Engel: Time 1: CPU 0: neigh_release() ->neigh_destroy() //it's called as neigh's refcnt is 0 now ->kfree_rcu(neigh, rcu);//freeing neigh object is delayed after a RCU grace period Time 2: CPU 1: ip_finish_output2() rcu_read_lock_bh() dst_neigh_output() __neigh_event_send() neigh_add_timer() neigh_hold() //refcnt of neigh is increased from 0 to 1; rcu_read_unlock_bh() Time 3: CPU 0: On RCU_SOFTIRQ context: rcu_process_callbacks() neigh object is really freed! Time 4: CPU 0: Release the neigh whose refcnt was increased from 0 to zero. neigh_release() is called by someone. As the neigh is already freed, panic happens!!! This is why I initially added the condition statement in ip_finish_output2() to check whether the refcnt of neigh returned by __ipv4_neigh_lookup_noref() is zero or not with atomic_read(). But as you pointed out, atomic_read() was not safe for us, but we cannot use other atomic routines with implicit memory barriers as they all increase or decease refcnt while checking refcnt is 0, meanwhile, we cannot use explicit memory barrier as it seems too expensive for the hot path. So I try to add another condition to prevent neigh_add_timer() called by __neigh_event_send() from being take a refcnt which is already zero. > > By the time you check atomic_read() the result is meaningless if another > cpu decrements the refcount to 0. > If you think checking if refcnt is zero is meaningless in ip_finish_output2(), why does __ipv4_neigh_lookup() use atomic_inc_not_zero() instead of directly using neigh_hold() in __ipv4_neigh_lookup()? Regards, Ying > So what is the point, trying to reduce a race window but not properly > remove it ? > > I repeat again : using atomic_read() in a rcu lookup is absolutely > useless and is a clear sign you missed a fundamental issue. > > So now, we are back to the patch I initially sent, but you did not > address Julian Anastasov feedback.