From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from out-171.mta1.migadu.com (out-171.mta1.migadu.com [95.215.58.171]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id D1E382F3E for ; Mon, 9 Jun 2025 02:59:24 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=95.215.58.171 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1749437972; cv=none; b=K+F0GLzpb6uZPJAbTBRj0tZzF3gpzmjXbxBh5SPDe3NasKd8Gu54I2rcIE6KIrqjlXSJjhb2bSxqm8MKwneLa4B09TmNvaVZ6yFDuRGAEz3pabwMuTLnoncwU2IyIBKXfzmF8KtQnZCzrSUjYawfOj19bbjrvhW/Ch++pB/pFB0= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1749437972; c=relaxed/simple; bh=UQElxAqGxBmUymstBZ/Dk/1nzyfCRHDk4FZj9475Mnk=; h=From:To:Cc:Subject:Date:Message-ID:MIME-Version; b=e9RQuLyIaKBrom6mYhjgl1HwDO7AEgEjk9UqtPzI30x+X6gqf2MMRyslkiSl/CHgGQQTiR6awJC2whaJ3cXq3RZC4kzWPHgh/Vd8Q98EsvF2BrDZUABURPZewatMk2xDAP8DzPC8t8erJpPld75cDAyZyHATTe+n4YHsAypjpYE= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev; spf=pass smtp.mailfrom=linux.dev; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b=tWyQrXWg; arc=none smtp.client-ip=95.215.58.171 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.dev Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b="tWyQrXWg" X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1749437960; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding; bh=K0E0j2BlPBhEZVL7ok/0anTaOlpOqXA2NyI5PTiVfMY=; b=tWyQrXWguA+ancsJD83ufgUvjOSaZJM8GQUgWmiCpxfkAPoFtrldct233g2dm09zrZEWqr dTNiLK6Hqupx/HlzQhgZvBbgoe2MQE3DSHV0Jf4aGbJ13/0POOQD7Rn/4GT76imbEQWGvr IC6o3BtFXr5ZDS6pz2xprufyGVjLKvo= From: Jiayuan Chen To: bpf@vger.kernel.org Cc: Jiayuan Chen , John Fastabend , Jakub Sitnicki , "David S. Miller" , Eric Dumazet , Jakub Kicinski , Paolo Abeni , Simon Horman , Alexei Starovoitov , Daniel Borkmann , netdev@vger.kernel.org, linux-kernel@vger.kernel.org Subject: [PATCH bpf-next v2] bpf, sockmap: Fix psock incorrectly pointing to sk Date: Mon, 9 Jun 2025 10:59:08 +0800 Message-ID: <20250609025908.79331-1-jiayuan.chen@linux.dev> Precedence: bulk X-Mailing-List: bpf@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Migadu-Flow: FLOW_OUT We observed an issue from the latest selftest: sockmap_redir where sk_psock(psock->sk) != psock in the backlog. The root cause is the special behavior in sockmap_redir - it frequently performs map_update() and map_delete() on the same socket. During map_update(), we create a new psock and during map_delete(), we eventually free the psock via rcu_work in sk_psock_drop(). However, pending workqueues might still exist and not be processed yet. If users immediately perform another map_update(), a new psock will be allocated for the same sk, resulting in two psocks pointing to the same sk. When the pending workqueue is later triggered, it uses the old psock to access sk for I/O operations, which is incorrect. Timing Diagram: cpu0 cpu1 map_update(sk): sk->psock = psock1 psock1->sk = sk map_delete(sk): rcu_work_free(psock1) map_update(sk): sk->psock = psock2 psock2->sk = sk workqueue: wakeup with psock1, but the sk of psock1 doesn't belong to psock1 rcu_handler: clean psock1 free(psock1) Previously, we used reference counting to address the concurrency issue between backlog and sock_map_close(). This logic remains necessary as it prevents the sk from being freed while processing the backlog. But this patch prevents pending backlogs from using a psock after it has been stopped. Note: We cannot call cancel_delayed_work_sync() in map_delete() since this might be invoked in BPF context by BPF helper, and the function may sleep. Fixes: 604326b41a6f ("bpf, sockmap: convert to generic sk_msg interface") Signed-off-by: Jiayuan Chen --- V1->V2: Use existing flag instead of adding new one. https://lore.kernel.org/bpf/20250605142448.3llri3w7wbclfxwc@gmail.com/ Thanks to Michal Luczaj for providing the sockmap_redir test case, which indeed covers almost all sockmap forwarding paths. --- net/core/skmsg.c | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/net/core/skmsg.c b/net/core/skmsg.c index 34c51eb1a14f..83c78379932e 100644 --- a/net/core/skmsg.c +++ b/net/core/skmsg.c @@ -656,6 +656,13 @@ static void sk_psock_backlog(struct work_struct *work) bool ingress; int ret; + /* If sk is quickly removed from the map and then added back, the old + * psock should not be scheduled, because there are now two psocks + * pointing to the same sk. + */ + if (!sk_psock_test_state(psock, SK_PSOCK_TX_ENABLED)) + return; + /* Increment the psock refcnt to synchronize with close(fd) path in * sock_map_close(), ensuring we wait for backlog thread completion * before sk_socket freed. If refcnt increment fails, it indicates -- 2.47.1