From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from out-181.mta0.migadu.com (out-181.mta0.migadu.com [91.218.175.181]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 6B6E581741 for ; Thu, 22 May 2025 19:25:18 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=91.218.175.181 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1747941924; cv=none; b=evRvoDVXG2IrBjK/xMJZAlmZdpvxUHhAR5btoOOEt+0y366nd6GlQkCzEahEj/FxSE6orGRNFNKBPvUNOGdN7vRTq0hJJTZlOGzSWYtjrbf+cU9M0cAG322OpntkiiBltCR/H1mLhqHT56FK8BJsyEYZ9wRy8gxXREXV/ofA0lU= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1747941924; c=relaxed/simple; bh=FYTA2YGOVavSHrzNFs6r/5f2tb2plMnbO0mKPNeYSOI=; h=Message-ID:Date:MIME-Version:Subject:To:Cc:References:From: In-Reply-To:Content-Type; b=PXXUGx5PAYQ9zR4It0OOv2uaaVl3rkUI2VVB4Uf2o9fFTeEgTp1JA+eoIkmYsSmsHE3VoO3pF3/K/OgEkhPLmvValsCJHRfY/jGxByIp5RfnkLVG8wWMGRboH9dskOfPpeyL3OevFMFpFOFm/slrir6JYjNuyU21aW/mh5h4HNM= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev; spf=pass smtp.mailfrom=linux.dev; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b=lpY0aPgv; arc=none smtp.client-ip=91.218.175.181 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.dev Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b="lpY0aPgv" Message-ID: <3eb50302-d90c-4477-b296-f5f29a7d1eca@linux.dev> DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1747941916; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=fjZ8wY76miqnH4YT71rqtvPLrVN+UZp1HnTbatuwpPM=; b=lpY0aPgvIDFd9ZPcgNsY1y8pPpUUptnJ8MPIYDaPDjhCMAOzLa4wlZY0g6dDIf/YyGr1oN /qZAi6qOivPaJk0eZinfrZQBG56+VRPGwuj5PxgwIlmI9ktbExsHuvznWwFDZ4kKpSVUQM OEtZV5BwK83Wc06kpk45vNpM5GCij4Y= Date: Thu, 22 May 2025 12:25:10 -0700 Precedence: bulk X-Mailing-List: bpf@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Subject: Re: [PATCH bpf-next v6] bpf, sockmap: avoid using sk_socket after free when sending To: Jiayuan Chen Cc: bpf@vger.kernel.org, Michal Luczaj , John Fastabend , Jakub Sitnicki , "David S. Miller" , Eric Dumazet , Jakub Kicinski , Paolo Abeni , Simon Horman , Thadeu Lima de Souza Cascardo , netdev@vger.kernel.org, linux-kernel@vger.kernel.org References: <20250516141713.291150-1-jiayuan.chen@linux.dev> Content-Language: en-US X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. From: Martin KaFai Lau In-Reply-To: <20250516141713.291150-1-jiayuan.chen@linux.dev> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Migadu-Flow: FLOW_OUT On 5/16/25 7:17 AM, Jiayuan Chen wrote: > The sk->sk_socket is not locked or referenced in backlog thread, and > during the call to skb_send_sock(), there is a race condition with > the release of sk_socket. All types of sockets(tcp/udp/unix/vsock) > will be affected. > > Race conditions: > ''' > CPU0 CPU1 > > backlog::skb_send_sock > sendmsg_unlocked > sock_sendmsg > sock_sendmsg_nosec > close(fd): > ... > ops->release() -> sock_map_close() > sk_socket->ops = NULL > free(socket) > sock->ops->sendmsg > ^ > panic here > ''' > > The ref of psock become 0 after sock_map_close() executed. > ''' > void sock_map_close() > { > ... > if (likely(psock)) { > ... > // !! here we remove psock and the ref of psock become 0 > sock_map_remove_links(sk, psock) > psock = sk_psock_get(sk); > if (unlikely(!psock)) > goto no_psock; <=== Control jumps here via goto > ... > cancel_delayed_work_sync(&psock->work); <=== not executed > sk_psock_put(sk, psock); > ... > } > ''' > > Based on the fact that we already wait for the workqueue to finish in > sock_map_close() if psock is held, we simply increase the psock > reference count to avoid race conditions. > > With this patch, if the backlog thread is running, sock_map_close() will > wait for the backlog thread to complete and cancel all pending work. > > If no backlog running, any pending work that hasn't started by then will > fail when invoked by sk_psock_get(), as the psock reference count have > been zeroed, and sk_psock_drop() will cancel all jobs via > cancel_delayed_work_sync(). > > In summary, we require synchronization to coordinate the backlog thread > and close() thread. > > The panic I catched: > ''' > Workqueue: events sk_psock_backlog > RIP: 0010:sock_sendmsg+0x21d/0x440 > RAX: 0000000000000000 RBX: ffffc9000521fad8 RCX: 0000000000000001 > ... > Call Trace: > > ? die_addr+0x40/0xa0 > ? exc_general_protection+0x14c/0x230 > ? asm_exc_general_protection+0x26/0x30 > ? sock_sendmsg+0x21d/0x440 > ? sock_sendmsg+0x3e0/0x440 > ? __pfx_sock_sendmsg+0x10/0x10 > __skb_send_sock+0x543/0xb70 > sk_psock_backlog+0x247/0xb80 > ... > ''' > > Reported-by: Michal Luczaj > Fixes: 4b4647add7d3 ("sock_map: avoid race between sock_map_close and sk_psock_put") > Signed-off-by: Jiayuan Chen > > --- > V5 -> V6: Use correct "Fixes" tag. > V4 -> V5: > This patch is extracted from my previous v4 patchset that contained > multiple fixes, and it remains unchanged. Since this fix is relatively > simple and easy to review, we want to separate it from other fixes to > avoid any potential interference. > --- > net/core/skmsg.c | 8 ++++++++ > 1 file changed, 8 insertions(+) > > diff --git a/net/core/skmsg.c b/net/core/skmsg.c > index 276934673066..34c51eb1a14f 100644 > --- a/net/core/skmsg.c > +++ b/net/core/skmsg.c > @@ -656,6 +656,13 @@ static void sk_psock_backlog(struct work_struct *work) > bool ingress; > int ret; > > + /* Increment the psock refcnt to synchronize with close(fd) path in > + * sock_map_close(), ensuring we wait for backlog thread completion > + * before sk_socket freed. If refcnt increment fails, it indicates > + * sock_map_close() completed with sk_socket potentially already freed. > + */ > + if (!sk_psock_get(psock->sk)) This seems to be the first use case to pass "psock->sk" to "sk_psock_get()". I could have missed the sock_map details here. Considering it is racing with sock_map_close() which should also do a sock_put(sk) [?], could you help to explain what makes it safe to access the psock->sk here? > + return; > mutex_lock(&psock->work_mutex); > while ((skb = skb_peek(&psock->ingress_skb))) { > len = skb->len; > @@ -708,6 +715,7 @@ static void sk_psock_backlog(struct work_struct *work) > } > end: > mutex_unlock(&psock->work_mutex); > + sk_psock_put(psock->sk, psock); > } > > struct sk_psock *sk_psock_init(struct sock *sk, int node)