From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from out-173.mta1.migadu.com (out-173.mta1.migadu.com [95.215.58.173]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 2A2C32BD014 for ; Thu, 22 May 2025 22:57:13 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=95.215.58.173 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1747954637; cv=none; b=Z5Viuikltll6kWCYZd33VNLW++FgqXnRKZ/oQCHqa+5SLT9rwoUZHq22YE7XqtViW1IXEDEA3Wj+YqkWOFWMaKmUYm2ir2wUJgkcofysbYfRdH/TMPPq6/Pa/gvmpc1j1Sea/G9FzMDFGui6t5EIGz2mxJ1z57UbegU4phb4vE8= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1747954637; c=relaxed/simple; bh=ShpH2Lrz40L9DOsJ3UQSgxelkKBLo/YVsgqN9p+w034=; h=MIME-Version:Date:Content-Type:From:Message-ID:Subject:To:Cc: In-Reply-To:References; b=Ka6Rpvq4u4ehQDX1xP3G5+KTU8UmTajSE3dr7IRDd1ubQ/oDGQg6kOUqqDSORyhsIBvUbkSpnj6vJWjzKsxAM0cs9/65aRVExZoYdrseo9fzTapkXZ9S1JWEoEiJIAkJTZlXuvX34/ojCIdZVNSOwm6ARn3Y55RM5k8NsBt6tSE= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev; spf=pass smtp.mailfrom=linux.dev; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b=Hvwi6Dhm; arc=none smtp.client-ip=95.215.58.173 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.dev Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b="Hvwi6Dhm" Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1747954618; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=XJDADEXuHfqfTmAHEHNUhMdj/6JudXWWJl8mcFP32/s=; b=Hvwi6Dhm6PdUO0889euxLcZpJjDxz7rpCSpB/RGawHif/VbiSsxwSQPXU9mg+YDp0qhrDq Lal+XtseX+emD5Ba/1IN93xFlGK26cyLKSGBWYC4xH3cdiWXUVOCWw2Idh0tYxoOr5xga5 9/fTCY8YHHZEmZZwJoGjUHOO68apVPM= Date: Thu, 22 May 2025 22:56:52 +0000 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. From: "Jiayuan Chen" Message-ID: <2c8ab490e47d44ef5250ac755a5388fe147345d4@linux.dev> TLS-Required: No Subject: Re: [PATCH bpf-next v6] bpf, sockmap: avoid using sk_socket after free when sending To: "Martin KaFai Lau" Cc: bpf@vger.kernel.org, "Michal Luczaj" , "John Fastabend" , "Jakub Sitnicki" , "David S. Miller" , "Eric Dumazet" , "Jakub Kicinski" , "Paolo Abeni" , "Simon Horman" , "Thadeu Lima de Souza Cascardo" , netdev@vger.kernel.org, linux-kernel@vger.kernel.org In-Reply-To: <3eb50302-d90c-4477-b296-f5f29a7d1eca@linux.dev> References: <20250516141713.291150-1-jiayuan.chen@linux.dev> <3eb50302-d90c-4477-b296-f5f29a7d1eca@linux.dev> X-Migadu-Flow: FLOW_OUT 2025/5/23 03:25, "Martin KaFai Lau" wrote: >=20 >=20On 5/16/25 7:17 AM, Jiayuan Chen wrote: >=20 >=20>=20 >=20> The sk->sk_socket is not locked or referenced in backlog thread, an= d > >=20 >=20> during the call to skb_send_sock(), there is a race condition with > >=20 >=20> the release of sk_socket. All types of sockets(tcp/udp/unix/vsock) > >=20 >=20> will be affected. > >=20 >=20> Race conditions: > >=20 >=20> ''' > >=20 >=20> CPU0 CPU1 > >=20 >=20> backlog::skb_send_sock > >=20 >=20> sendmsg_unlocked > >=20 >=20> sock_sendmsg > >=20 >=20> sock_sendmsg_nosec > >=20 >=20> close(fd): > >=20 >=20> ... > >=20 >=20> ops->release() -> sock_map_close() > >=20 >=20> sk_socket->ops =3D NULL > >=20 >=20> free(socket) > >=20 >=20> sock->ops->sendmsg > >=20 >=20> ^ > >=20 >=20> panic here > >=20 >=20> ''' > >=20 >=20> The ref of psock become 0 after sock_map_close() executed. > >=20 >=20> ''' > >=20 >=20> void sock_map_close() > >=20 >=20> { > >=20 >=20> ... > >=20 >=20> if (likely(psock)) { > >=20 >=20> ... > >=20 >=20> // !! here we remove psock and the ref of psock become 0 > >=20 >=20> sock_map_remove_links(sk, psock) > >=20 >=20> psock =3D sk_psock_get(sk); > >=20 >=20> if (unlikely(!psock)) > >=20 >=20> goto no_psock; <=3D=3D=3D Control jumps here via goto > >=20 >=20> ... > >=20 >=20> cancel_delayed_work_sync(&psock->work); <=3D=3D=3D not executed > >=20 >=20> sk_psock_put(sk, psock); > >=20 >=20> ... > >=20 >=20> } > >=20 >=20> ''' > >=20 >=20> Based on the fact that we already wait for the workqueue to finish= in > >=20 >=20> sock_map_close() if psock is held, we simply increase the psock > >=20 >=20> reference count to avoid race conditions. > >=20 >=20> With this patch, if the backlog thread is running, sock_map_close(= ) will > >=20 >=20> wait for the backlog thread to complete and cancel all pending wor= k. > >=20 >=20> If no backlog running, any pending work that hasn't started by the= n will > >=20 >=20> fail when invoked by sk_psock_get(), as the psock reference count = have > >=20 >=20> been zeroed, and sk_psock_drop() will cancel all jobs via > >=20 >=20> cancel_delayed_work_sync(). > >=20 >=20> In summary, we require synchronization to coordinate the backlog t= hread > >=20 >=20> and close() thread. > >=20 >=20> The panic I catched: > >=20 >=20> ''' > >=20 >=20> Workqueue: events sk_psock_backlog > >=20 >=20> RIP: 0010:sock_sendmsg+0x21d/0x440 > >=20 >=20> RAX: 0000000000000000 RBX: ffffc9000521fad8 RCX: 0000000000000001 > >=20 >=20> ... > >=20 >=20> Call Trace: > >=20 >=20> > >=20 >=20> ? die_addr+0x40/0xa0 > >=20 >=20> ? exc_general_protection+0x14c/0x230 > >=20 >=20> ? asm_exc_general_protection+0x26/0x30 > >=20 >=20> ? sock_sendmsg+0x21d/0x440 > >=20 >=20> ? sock_sendmsg+0x3e0/0x440 > >=20 >=20> ? __pfx_sock_sendmsg+0x10/0x10 > >=20 >=20> __skb_send_sock+0x543/0xb70 > >=20 >=20> sk_psock_backlog+0x247/0xb80 > >=20 >=20> ... > >=20 >=20> ''' > >=20 >=20> Reported-by: Michal Luczaj > >=20 >=20> Fixes: 4b4647add7d3 ("sock_map: avoid race between sock_map_close = and sk_psock_put") > >=20 >=20> Signed-off-by: Jiayuan Chen > >=20 >=20> --- > >=20 >=20> V5 -> V6: Use correct "Fixes" tag. > >=20 >=20> V4 -> V5: > >=20 >=20> This patch is extracted from my previous v4 patchset that containe= d > >=20 >=20> multiple fixes, and it remains unchanged. Since this fix is relati= vely > >=20 >=20> simple and easy to review, we want to separate it from other fixes= to > >=20 >=20> avoid any potential interference. > >=20 >=20> --- > >=20 >=20> net/core/skmsg.c | 8 ++++++++ > >=20 >=20> 1 file changed, 8 insertions(+) > >=20 >=20> diff --git a/net/core/skmsg.c b/net/core/skmsg.c > >=20 >=20> index 276934673066..34c51eb1a14f 100644 > >=20 >=20> --- a/net/core/skmsg.c > >=20 >=20> +++ b/net/core/skmsg.c > >=20 >=20> @@ -656,6 +656,13 @@ static void sk_psock_backlog(struct work_stru= ct *work) > >=20 >=20> bool ingress; > >=20 >=20> int ret; > >=20 >=20> > + /* Increment the psock refcnt to synchronize with close(fd) pa= th in > >=20 >=20> + * sock_map_close(), ensuring we wait for backlog thread completi= on > >=20 >=20> + * before sk_socket freed. If refcnt increment fails, it indicate= s > >=20 >=20> + * sock_map_close() completed with sk_socket potentially already = freed. > >=20 >=20> + */ > >=20 >=20> + if (!sk_psock_get(psock->sk)) > >=20 >=20 > This seems to be the first use case to pass "psock->sk" to "sk_psock_ge= t()". >=20 >=20I could have missed the sock_map details here. Considering it is raci= ng with sock_map_close() which should also do a sock_put(sk) [?], >=20 >=20could you help to explain what makes it safe to access the psock->sk = here? >=20 >=20>=20 >=20> + return; > >=20 >=20> mutex_lock(&psock->work_mutex); > >=20 >=20> while ((skb =3D skb_peek(&psock->ingress_skb))) { > >=20 >=20> len =3D skb->len; > >=20 >=20> @@ -708,6 +715,7 @@ static void sk_psock_backlog(struct work_struc= t *work) > >=20 >=20> } > >=20 >=20> end: > >=20 >=20> mutex_unlock(&psock->work_mutex); > >=20 >=20> + sk_psock_put(psock->sk, psock); > >=20 >=20> } > >=20 >=20> > struct sk_psock *sk_psock_init(struct sock *sk, int node) > > > Hi Martin, Using 'sk_psock_get(psock->sk)' in the workqueue is safe because sock_map_close() only reduces the reference count of psock to zero, while the actual memory release is fully handled by the RCU callback: sk_psock_= destroy(). In sk_psock_destroy(), we first cancel_delayed_work_sync() to wait for th= e workqueue to complete, and then perform sock_put(psock->sk). This means w= e already have an explicit synchronization mechanism in place that guarante= es safe access to both psock and psock->sk in the workqueue context. Thanks.