From mboxrd@z Thu Jan 1 00:00:00 1970 From: stranche@codeaurora.org Subject: Re: [PATCH net] af_key: free SKBs under RCU protection Date: Fri, 21 Sep 2018 11:09:05 -0600 Message-ID: References: <1537402712-12875-1-git-send-email-stranche@codeaurora.org> <6d194ac2-15e8-76d7-31d0-b4c4eed68d5a@gmail.com> <357e28c3fa0c7bacaffde4e960f58a87@codeaurora.org> <82b2ac27-0f61-3be6-b09e-ec31cf95881f@gmail.com> <6f2e6b62-42dc-f2b3-3758-29f5338f1185@gmail.com> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII; format=flowed Content-Transfer-Encoding: 7bit Cc: netdev@vger.kernel.org, steffen.klassert@secunet.com To: Eric Dumazet Return-path: Received: from smtp.codeaurora.org ([198.145.29.96]:49684 "EHLO smtp.codeaurora.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1732114AbeIUW6y (ORCPT ); Fri, 21 Sep 2018 18:58:54 -0400 In-Reply-To: <6f2e6b62-42dc-f2b3-3758-29f5338f1185@gmail.com> Sender: netdev-owner@vger.kernel.org List-ID: >> >> As long as one skb has sock_rfree has its destructor, the socket >> attached to >> this skb can not be released. There is no race here. >> >> Note that skb_clone() does not propagate the destructor. >> >> The issue here is that in the rcu lookup, we can find a socket that >> has been >> dismantled, with a 0 refcount. >> >> We must not use sock_hold() in this case, since we are not sure the >> socket refcount is not already 0. >> >> pfkey_broadcast() and pfkey_broadcast_one() violate basic RCU rules. >> >> When in a RCU lookup, one want to increment an object refcount, it >> needs >> to be extra-careful, as I did in my proposal. >> >> Note that the race could be automatically detected with >> CONFIG_REFCOUNT_FULL=y > > Bug was added in commit 7f6b9dbd5afb ("af_key: locking change") Hi Eric, I tried your refcount idea below, but it still results in the same crash. >>>> --- a/net/key/af_key.c >>>> +++ b/net/key/af_key.c >>>> @@ -201,7 +201,9 @@ static int pfkey_broadcast_one(struct sk_buff >>>> *skb, struct sk_buff **skb2, >>>> { >>>> int err = -ENOBUFS; >>>> >>>> - sock_hold(sk); >>>> + if (!refcount_inc_not_zero(&sk->sk_refcnt)) >>>> + return -ENOENT; >>>> + >>>> if (*skb2 == NULL) { >>>> if (refcount_read(&skb->users) != 1) { >>>> *skb2 = skb_clone(skb, allocation); I also tried reverting 7f6b9dbd5afb ("af_key: locking change") and running the test there and I still see the crash, so it doesn't seem to be an RCU specific issue. Is there anything else that could be causing this?