From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-alma10-1.taild15c8.ts.net [100.103.45.18]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 84101342CB3 for ; Fri, 12 Jun 2026 01:33:12 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=100.103.45.18 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1781227993; cv=none; b=muzphD27RzZg79Js+zWxqD8qVdwspa39CjJScFZUdlipiH6PRnqQ8/l9ZaxCOKeQMKlo6gpH3noXss3hYhSjnU+xzK6FAWfMqpmd2nWRlCFtA4LPKAtbF7DVpuyg7FbhykUMWerRwg3o9zacX6UfH43LYLcKgek0bOskvU2WO2A= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1781227993; c=relaxed/simple; bh=PYtfSTiJXntGswecjzXH710Qea2ec0wFnI+TOIfNvBQ=; h=From:Subject:To:Cc:In-Reply-To:References:Content-Type:Date: Message-Id; b=VpYx1wBtIz4XfvrRclzN9MzuZqHoe/NGsA9R2CFd3w+xcP/5lbAA2z9A5OQWIFPAstJbJ2R4fT9fg7kMKMkOGvdi0IbgFzqd3EKT/sq/gbi7FEPqoSARDK1BhPRR1x73/fw20Y0ENFqM40hcipMWwkrMsz7qKY5+Zmy4Gf7+SGI= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=F1F2JGuG; arc=none smtp.client-ip=100.103.45.18 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="F1F2JGuG" Received: by smtp.kernel.org (Postfix) with ESMTPSA id EFC6F1F000E9; Fri, 12 Jun 2026 01:33:11 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=kernel.org; s=k20260515; t=1781227992; bh=AdTCDzo3F/hQDF0bkKaIlmhm3o0x8zBqHfbi3KPh1R4=; h=From:Subject:Reply-To:To:Cc:In-Reply-To:References:Date; b=F1F2JGuGevGIPgWCsPgS3EfiuwZD6VbJa5RGABz1MdzspfWwGI2nAA9uc5pADqIDH LIhlUttgNu45ShrodzSc9RjwcHUOtPXyAwWGLIps1EfdyqR3fmvPSa3hdg7D3wMxmd dXxGXSpxSEUskXfvlgc9lLkBCE2tBAuhMSJdw1JSi3pa3P1+AcAFfwlT2chvDaQ8qC 0fHnTPv4DhY8sK2ZzlbgoMFADDjy9Q4+zOrGojhbIGaPs4UoALIfziVxJJu5I3oF5h ismH3CjLCkt92Ur+Ts3O668MjOaksANp/AdOoo/3K/SYIoj1Pbuxa/O0w1x9W1FRUe rl85VDaU8zUdg== From: sashiko-bot@kernel.org Subject: Re: [RFC PATCH bpf-next 1/5] tcp_bpf: add bpf_sock_splice_pair kfunc for opportunistic loopback splice Reply-To: sashiko-reviews@lists.linux.dev To: "Cong Wang" Cc: bpf@vger.kernel.org In-Reply-To: <20260612011452.134466-2-xiyou.wangcong@gmail.com> References: <20260612011452.134466-2-xiyou.wangcong@gmail.com> Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Date: Fri, 12 Jun 2026 01:33:11 +0000 Message-Id: <20260612013311.EFC6F1F000E9@smtp.kernel.org> Precedence: bulk X-Mailing-List: bpf@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Thank you for your contribution! Sashiko AI review found 10 potential issue= (s) to consider: - [High] Missing socket lock in spliced sendmsg/recvmsg leading to SPSC rin= g buffer corruption. - [High] Permanent memory leak of sockets and psocks due to circular refere= nces. - [High] TCP stream ordering violation due to fast path racing with in-flig= ht TCP data. - [High] Use-after-free of socket waitqueue due to missing RCU read lock in= splice_wake. - [High] RCU API Violations and lockless data races when accessing `psock->= splice`. - [High] Silent data blackhole when the receiving spliced socket has an ing= ress BPF program. - [High] Refcount leak and Use-After-Free due to ignoring the return value = of `sk_psock_get()`. - [Medium] Missing __GFP_ACCOUNT on ring buffer allocations allows memcg li= mits bypass. - [Medium] MSG_PEEK semantics broken for spliced sockets, causing hangs or = silent data omission. - [Medium] Socket errors are not cleared when fetched, leading to permanent= error state. -- commit 2f2595b5999c95fe8a664fc82201339a3716f692 Author: Cong Wang tcp_bpf: add bpf_sock_splice_pair kfunc for opportunistic loopback spli= ce Two locally-connected TCP sockets can be paired by a sock_ops BPF program at handshake completion. Once paired, sendmsg copies the user payload into a per-direction kernel-side byte ring; recvmsg drains the ring into the user buffer. Both copies happen in their own task's mm, so no cross-mm pin / kmap dance is needed and the splice fast path incurs no skb construction, no softirq, and no TCP protocol-state processing. The TCP wire connection itself never sees the spliced bytes: sequence numbers stay frozen at post-handshake values, so FIN, RST, and keepalive continue to work through the regular paths and the pair tears down via a normal close handshake. > diff --git a/net/core/skmsg.c b/net/core/skmsg.c > index e1850caf1a71a..b39fc249a18d5 100644 > --- a/net/core/skmsg.c > +++ b/net/core/skmsg.c [ ... ] > @@ -881,12 +881,15 @@ static void sk_psock_destroy(struct work_struct *wo= rk) > sock_put(psock->sk_redir); > if (psock->sk_pair) > sock_put(psock->sk_pair); > + tcp_bpf_splice_destroy(psock); > sock_put(psock->sk); > kfree(psock); > } > =20 > void sk_psock_drop(struct sock *sk, struct sk_psock *psock) > { > + tcp_bpf_splice_unpair(psock); > + [Severity: High] Does this code create a permanent memory leak due to a circular reference? In bpf_sock_splice_pair(), both sides of the paired connection acquire a reference to the other's sk_psock via sk_psock_get(). This establishes a circular reference. Since the cross-references are only dropped in tcp_bpf_splice_unpair(), and this is only called from sk_psock_drop() which executes when a psock's reference count reaches 0, the circular reference guarantees neither refcou= nt will ever reach 0. > diff --git a/net/ipv4/tcp_bpf.c b/net/ipv4/tcp_bpf.c > index cc0bd73f36b6d..549f370772446 100644 > --- a/net/ipv4/tcp_bpf.c > +++ b/net/ipv4/tcp_bpf.c [ ... ] > @@ -626,8 +683,9 @@ static void tcp_bpf_rebuild_protos(struct proto prot[= TCP_BPF_NUM_CFGS], > prot[TCP_BPF_BASE] =3D *base; > prot[TCP_BPF_BASE].destroy =3D sock_map_destroy; > prot[TCP_BPF_BASE].close =3D sock_map_close; > + prot[TCP_BPF_BASE].sendmsg =3D tcp_bpf_splice_sendmsg; > prot[TCP_BPF_BASE].recvmsg =3D tcp_bpf_recvmsg; > - prot[TCP_BPF_BASE].sock_is_readable =3D sk_msg_is_readable; > + prot[TCP_BPF_BASE].sock_is_readable =3D tcp_bpf_is_readable; > prot[TCP_BPF_BASE].ioctl =3D tcp_bpf_ioctl; [Severity: High] Does this cause a silent data blackhole when the receiving spliced socket h= as an ingress BPF program? The TCP_BPF_RX protocol configuration inherits its base configuration here, so its sendmsg is assigned tcp_bpf_splice_sendmsg. However, its recvmsg is explicitly overridden to tcp_bpf_recvmsg_parser, which lacks the splice ring-buffer logic. A sender will correctly push data into the splice ring buffer, but the receiver will only poll the standard sk_receive_queue and ingress_msg queue, never reading the ring buffer. [ ... ] > +static inline void splice_wake(struct sock *sk) > +{ > + wait_queue_head_t *wq =3D sk_sleep(sk); > + > + smp_mb(); > + if (wq && waitqueue_active(wq)) > + wake_up_interruptible_all(wq); > +} [Severity: High] Is it possible for sk_sleep(sk) to access a freed waitqueue here? This function uses sk_sleep(sk) to get the socket's waitqueue, which uses rcu_dereference_raw(). Without the caller holding the RCU read lock, the memory backing the waitqueue could be concurrently freed by sock_release(). [ ... ] > +static bool sk_psock_is_spliced(const struct sk_psock *psock) > +{ > + struct sk_psock_splice *s =3D rcu_dereference(psock->splice); > + > + return s && rcu_access_pointer(s->peer); > +} [Severity: High] Is this missing an RCU read lock? sk_psock_is_spliced() calls rcu_dereference(psock->splice) from tcp_bpf_splice_sendmsg() and tcp_bpf_recvmsg() without an active rcu_read_lock(), which can trigger a lockdep warning. [ ... ] > +static int tcp_bpf_splice_sendmsg(struct sock *sk, struct msghdr *msg, > + size_t size) > +{ > + struct sk_psock *psock; > + int spliced =3D 0; > + int ret; [Severity: High] Could the lack of socket locking here lead to SPSC ring buffer corruption? tcp_bpf_splice_sendmsg() bypasses the standard socket lock (lock_sock(sk)). Since userspace can call sendmsg or recvmsg from multiple threads concurren= tly on the same socket descriptor, multiple producers or consumers will race when accessing the SPSC ring buffer logic in splice_ring_write() and tcp_bpf_splice_recvmsg(). This can lead to non-atomic cursor updates and memory corruption. [ ... ] > +void tcp_bpf_splice_unpair(struct sk_psock *psock) > +{ > + struct sk_psock_splice *self_s, *peer_s; > + struct sk_psock *peer; > + bool was_paired =3D false; > + > + self_s =3D rcu_dereference_protected(psock->splice, 1); > + if (!self_s) > + return; > + > + rcu_read_lock(); > + peer =3D rcu_dereference(self_s->peer); > + if (!peer) { > + rcu_read_unlock(); > + return; > + } > + if (!sk_psock_get(peer->sk)) { [Severity: High] Does ignoring the return value of sk_psock_get() here cause a refcount leak and use-after-free? If a socket's psock is concurrently replaced, sk_psock_get() returns and increments the refcount of the new psock. Because this code continues using the original peer pointer to interact with and ultimately drop the refcount (via sk_psock_put()), it creates an unbalanced increment on the new psock (memory leak) and an unbalanced decrement on the old psock (use-after-free). [ ... ] > + rcu_read_unlock(); [ ... ] > + /* Wake any blocked rendezvous waiters on either side. They will > + * re-check the predicate, see splice->peer =3D=3D NULL, and exit. > + */ > + splice_wake(psock->sk); > + splice_wake(peer->sk); [Severity: High] Is it safe to call splice_wake() here after dropping the RCU read lock? splice_wake() requires the RCU read lock to safely access the socket's waitqueue without it being concurrently freed. [ ... ] > +static int splice_ring_alloc(struct sk_psock_splice *s) > +{ > + void *buf; > + > + if (READ_ONCE(s->ring_buf)) > + return 0; > + > + buf =3D (void *)__get_free_pages(GFP_ATOMIC | __GFP_NOWARN, > + get_order(SPLICE_RING_SIZE)); [Severity: Medium] Should this allocation use the __GFP_ACCOUNT flag? Because this 16 KiB allocation is not charged to the memory cgroup, a local application could create and pair an arbitrary number of sockets to intentionally exhaust host memory without being restricted by its container= 's memory limits. [ ... ] > +static int tcp_bpf_splice_recvmsg(struct sock *sk, > + struct sk_psock *psock, > + struct msghdr *msg, size_t len, > + int flags, int *err) > +{ > + struct sk_psock_splice *s; > + size_t copied; > + long timeo; > + > + *err =3D 0; > + /* PEEK is not implemented against the ring (no peek-without-advance > + * helper). Return 0 with no error so the dispatcher defers to the > + * TCP path; ring contents are invisible to PEEK but the socket > + * continues to work for normal apps. > + */ > + if (flags & MSG_PEEK) > + return 0; [Severity: Medium] Does returning 0 here break MSG_PEEK semantics? Since the data has already been spliced directly into the ring buffer, it is completely invisible to the TCP stack. A MSG_PEEK call falling back to the standard TCP sk_receive_queue will falsely report that no data is available, and a blocking MSG_PEEK could hang indefinitely waiting for normal TCP data. [ ... ] > + s =3D rcu_dereference_protected(psock->splice, 1); [Severity: High] Can using a hardcoded 1 here cause data races? Using rcu_dereference_protected() with a hardcoded 1 in a lockless, concurrently accessible path suppresses lockdep checks and omits the data dependency barriers provided by rcu_dereference() or READ_ONCE(). In a concurrent environment, this creates data races and risks observing uninitialized ring buffer state on architectures that require strict data dependency barriers. [ ... ] > + if (sk->sk_err) { > + *err =3D -sk->sk_err; > + return 0; > + } [Severity: Medium] Should the socket error be cleared when it is fetched? When this observes an error in sk->sk_err, it sets *err and returns, which propagates the error to user space. However, sk->sk_err is never cleared (e.g., via sock_error(sk) or xchg(&sk->sk_err, 0)). Failing to clear the error means every subsequent recvmsg() call will erroneously fail with the exact same error, permanently breaking the socket. [ ... ] > +static int splice_send_ring(struct sock *sk, struct sk_psock *psock, > + struct msghdr *msg, size_t size, int flags) > +{ [ ... ] > + if (ps && READ_ONCE(ps->ring_buf) && > + !sk->sk_err && !(sk->sk_shutdown & SEND_SHUTDOWN) && > + skb_queue_empty(&peer_sk->sk_receive_queue)) { [Severity: High] Does this check violate TCP stream ordering by ignoring in-flight TCP data? To preserve stream ordering when falling back to the standard TCP stack, the sender relies on skb_queue_empty(&peer_sk->sk_receive_queue) to determine if it is safe to write to the ring buffer. However, this check ignores data th= at is actively in-flight (queued in the sender's sk_write_queue, the local qdisc, or softirq backlog). A subsequent sendmsg will see an empty peer_sk->sk_receive_queue and splice new data directly into the ring buffer, allowing the receiver to read it before the older in-flight TCP data arrive= s. [ ... ] > +__bpf_kfunc int bpf_sock_splice_pair(struct sock *peer, > + struct bpf_sock_ops_kern *skops) > +{ [ ... ] > + /* Each side keeps a psock ref on the other for the duration. */ > + if (!sk_psock_get(sk)) { > + ret =3D -ENOENT; > + goto out_unlock; > + } > + if (!sk_psock_get(peer)) { [Severity: High] Could ignoring the return value of sk_psock_get() here cause a refcount leak and use-after-free? Similar to tcp_bpf_splice_unpair(), if a socket's psock is concurrently replaced, sk_psock_get() returns and increments the refcount of the new pso= ck. Because this code continues using the original peer pointer to interact with it, it creates an unbalanced increment on the new psock and an unbalanced decrement on the old psock. --=20 Sashiko AI review =C2=B7 https://sashiko.dev/#/patchset/20260612011452.1344= 66-1-xiyou.wangcong@gmail.com?part=3D1