Re: [RFC PATCH bpf-next 1/5] tcp_bpf: add bpf_sock_splice_pair kfunc for opportunistic loopback splice

BPF List
 help / color / mirror / Atom feed

From: sashiko-bot@kernel.org
To: "Cong Wang" <xiyou.wangcong@gmail.com>
Cc: bpf@vger.kernel.org
Subject: Re: [RFC PATCH bpf-next 1/5] tcp_bpf: add bpf_sock_splice_pair kfunc for opportunistic loopback splice
Date: Fri, 12 Jun 2026 01:33:11 +0000	[thread overview]
Message-ID: <20260612013311.EFC6F1F000E9@smtp.kernel.org> (raw)
In-Reply-To: <20260612011452.134466-2-xiyou.wangcong@gmail.com>

Thank you for your contribution! Sashiko AI review found 10 potential issue(s) to consider:
- [High] Missing socket lock in spliced sendmsg/recvmsg leading to SPSC ring buffer corruption.
- [High] Permanent memory leak of sockets and psocks due to circular references.
- [High] TCP stream ordering violation due to fast path racing with in-flight TCP data.
- [High] Use-after-free of socket waitqueue due to missing RCU read lock in splice_wake.
- [High] RCU API Violations and lockless data races when accessing `psock->splice`.
- [High] Silent data blackhole when the receiving spliced socket has an ingress BPF program.
- [High] Refcount leak and Use-After-Free due to ignoring the return value of `sk_psock_get()`.
- [Medium] Missing __GFP_ACCOUNT on ring buffer allocations allows memcg limits bypass.
- [Medium] MSG_PEEK semantics broken for spliced sockets, causing hangs or silent data omission.
- [Medium] Socket errors are not cleared when fetched, leading to permanent error state.
--

commit 2f2595b5999c95fe8a664fc82201339a3716f692
Author: Cong Wang <xiyou.wangcong@gmail.com>

    tcp_bpf: add bpf_sock_splice_pair kfunc for opportunistic loopback splice

    Two locally-connected TCP sockets can be paired by a sock_ops BPF
    program at handshake completion. Once paired, sendmsg copies the user
    payload into a per-direction kernel-side byte ring; recvmsg drains the
    ring into the user buffer. Both copies happen in their own task's mm,
    so no cross-mm pin / kmap dance is needed and the splice fast path
    incurs no skb construction, no softirq, and no TCP protocol-state
    processing. The TCP wire connection itself never sees the spliced
    bytes: sequence numbers stay frozen at post-handshake values, so FIN,
    RST, and keepalive continue to work through the regular paths and the
    pair tears down via a normal close handshake.

> diff --git a/net/core/skmsg.c b/net/core/skmsg.c
> index e1850caf1a71a..b39fc249a18d5 100644
> --- a/net/core/skmsg.c
> +++ b/net/core/skmsg.c

[ ... ]

> @@ -881,12 +881,15 @@ static void sk_psock_destroy(struct work_struct *work)
>  		sock_put(psock->sk_redir);
>  	if (psock->sk_pair)
>  		sock_put(psock->sk_pair);
> +	tcp_bpf_splice_destroy(psock);
>  	sock_put(psock->sk);
>  	kfree(psock);
>  }
>  
>  void sk_psock_drop(struct sock *sk, struct sk_psock *psock)
>  {
> +	tcp_bpf_splice_unpair(psock);
> +

[Severity: High]
Does this code create a permanent memory leak due to a circular reference?

In bpf_sock_splice_pair(), both sides of the paired connection acquire a
reference to the other's sk_psock via sk_psock_get(). This establishes a
circular reference.

Since the cross-references are only dropped in tcp_bpf_splice_unpair(), and
this is only called from sk_psock_drop() which executes when a psock's
reference count reaches 0, the circular reference guarantees neither refcount
will ever reach 0.

> diff --git a/net/ipv4/tcp_bpf.c b/net/ipv4/tcp_bpf.c
> index cc0bd73f36b6d..549f370772446 100644
> --- a/net/ipv4/tcp_bpf.c
> +++ b/net/ipv4/tcp_bpf.c

[ ... ]

> @@ -626,8 +683,9 @@ static void tcp_bpf_rebuild_protos(struct proto prot[TCP_BPF_NUM_CFGS],
>  	prot[TCP_BPF_BASE]			= *base;
>  	prot[TCP_BPF_BASE].destroy		= sock_map_destroy;
>  	prot[TCP_BPF_BASE].close		= sock_map_close;
> +	prot[TCP_BPF_BASE].sendmsg		= tcp_bpf_splice_sendmsg;
>  	prot[TCP_BPF_BASE].recvmsg		= tcp_bpf_recvmsg;
> -	prot[TCP_BPF_BASE].sock_is_readable	= sk_msg_is_readable;
> +	prot[TCP_BPF_BASE].sock_is_readable	= tcp_bpf_is_readable;
>  	prot[TCP_BPF_BASE].ioctl		= tcp_bpf_ioctl;

[Severity: High]
Does this cause a silent data blackhole when the receiving spliced socket has
an ingress BPF program?

The TCP_BPF_RX protocol configuration inherits its base configuration here,
so its sendmsg is assigned tcp_bpf_splice_sendmsg. However, its recvmsg
is explicitly overridden to tcp_bpf_recvmsg_parser, which lacks the splice
ring-buffer logic.

A sender will correctly push data into the splice ring buffer, but the
receiver will only poll the standard sk_receive_queue and ingress_msg queue,
never reading the ring buffer.

[ ... ]

> +static inline void splice_wake(struct sock *sk)
> +{
> +	wait_queue_head_t *wq = sk_sleep(sk);
> +
> +	smp_mb();
> +	if (wq && waitqueue_active(wq))
> +		wake_up_interruptible_all(wq);
> +}

[Severity: High]
Is it possible for sk_sleep(sk) to access a freed waitqueue here?

This function uses sk_sleep(sk) to get the socket's waitqueue, which uses
rcu_dereference_raw(). Without the caller holding the RCU read lock, the
memory backing the waitqueue could be concurrently freed by sock_release().

[ ... ]

> +static bool sk_psock_is_spliced(const struct sk_psock *psock)
> +{
> +	struct sk_psock_splice *s = rcu_dereference(psock->splice);
> +
> +	return s && rcu_access_pointer(s->peer);
> +}

[Severity: High]
Is this missing an RCU read lock?

sk_psock_is_spliced() calls rcu_dereference(psock->splice) from
tcp_bpf_splice_sendmsg() and tcp_bpf_recvmsg() without an active
rcu_read_lock(), which can trigger a lockdep warning.

[ ... ]

> +static int tcp_bpf_splice_sendmsg(struct sock *sk, struct msghdr *msg,
> +				  size_t size)
> +{
> +	struct sk_psock *psock;
> +	int spliced = 0;
> +	int ret;

[Severity: High]
Could the lack of socket locking here lead to SPSC ring buffer corruption?

tcp_bpf_splice_sendmsg() bypasses the standard socket lock (lock_sock(sk)).
Since userspace can call sendmsg or recvmsg from multiple threads concurrently
on the same socket descriptor, multiple producers or consumers will race
when accessing the SPSC ring buffer logic in splice_ring_write() and
tcp_bpf_splice_recvmsg(). This can lead to non-atomic cursor updates and
memory corruption.

[ ... ]

> +void tcp_bpf_splice_unpair(struct sk_psock *psock)
> +{
> +	struct sk_psock_splice *self_s, *peer_s;
> +	struct sk_psock *peer;
> +	bool was_paired = false;
> +
> +	self_s = rcu_dereference_protected(psock->splice, 1);
> +	if (!self_s)
> +		return;
> +
> +	rcu_read_lock();
> +	peer = rcu_dereference(self_s->peer);
> +	if (!peer) {
> +		rcu_read_unlock();
> +		return;
> +	}
> +	if (!sk_psock_get(peer->sk)) {

[Severity: High]
Does ignoring the return value of sk_psock_get() here cause a refcount leak
and use-after-free?

If a socket's psock is concurrently replaced, sk_psock_get() returns and
increments the refcount of the new psock. Because this code continues using
the original peer pointer to interact with and ultimately drop the refcount
(via sk_psock_put()), it creates an unbalanced increment on the new psock
(memory leak) and an unbalanced decrement on the old psock (use-after-free).

[ ... ]

> +	rcu_read_unlock();

[ ... ]

> +	/* Wake any blocked rendezvous waiters on either side. They will
> +	 * re-check the predicate, see splice->peer == NULL, and exit.
> +	 */
> +	splice_wake(psock->sk);
> +	splice_wake(peer->sk);

[Severity: High]
Is it safe to call splice_wake() here after dropping the RCU read lock?

splice_wake() requires the RCU read lock to safely access the socket's
waitqueue without it being concurrently freed.

[ ... ]

> +static int splice_ring_alloc(struct sk_psock_splice *s)
> +{
> +	void *buf;
> +
> +	if (READ_ONCE(s->ring_buf))
> +		return 0;
> +
> +	buf = (void *)__get_free_pages(GFP_ATOMIC | __GFP_NOWARN,
> +				       get_order(SPLICE_RING_SIZE));

[Severity: Medium]
Should this allocation use the __GFP_ACCOUNT flag?

Because this 16 KiB allocation is not charged to the memory cgroup, a local
application could create and pair an arbitrary number of sockets to
intentionally exhaust host memory without being restricted by its container's
memory limits.

[ ... ]

> +static int tcp_bpf_splice_recvmsg(struct sock *sk,
> +				  struct sk_psock *psock,
> +				  struct msghdr *msg, size_t len,
> +				  int flags, int *err)
> +{
> +	struct sk_psock_splice *s;
> +	size_t copied;
> +	long timeo;
> +
> +	*err = 0;
> +	/* PEEK is not implemented against the ring (no peek-without-advance
> +	 * helper). Return 0 with no error so the dispatcher defers to the
> +	 * TCP path; ring contents are invisible to PEEK but the socket
> +	 * continues to work for normal apps.
> +	 */
> +	if (flags & MSG_PEEK)
> +		return 0;

[Severity: Medium]
Does returning 0 here break MSG_PEEK semantics?

Since the data has already been spliced directly into the ring buffer, it is
completely invisible to the TCP stack. A MSG_PEEK call falling back to the
standard TCP sk_receive_queue will falsely report that no data is available,
and a blocking MSG_PEEK could hang indefinitely waiting for normal TCP data.

[ ... ]

> +	s = rcu_dereference_protected(psock->splice, 1);

[Severity: High]
Can using a hardcoded 1 here cause data races?

Using rcu_dereference_protected() with a hardcoded 1 in a lockless,
concurrently accessible path suppresses lockdep checks and omits the data
dependency barriers provided by rcu_dereference() or READ_ONCE(). In a
concurrent environment, this creates data races and risks observing
uninitialized ring buffer state on architectures that require strict data
dependency barriers.

[ ... ]

> +		if (sk->sk_err) {
> +			*err = -sk->sk_err;
> +			return 0;
> +		}

[Severity: Medium]
Should the socket error be cleared when it is fetched?

When this observes an error in sk->sk_err, it sets *err and returns, which
propagates the error to user space. However, sk->sk_err is never cleared
(e.g., via sock_error(sk) or xchg(&sk->sk_err, 0)). Failing to clear the
error means every subsequent recvmsg() call will erroneously fail with the
exact same error, permanently breaking the socket.

[ ... ]

> +static int splice_send_ring(struct sock *sk, struct sk_psock *psock,
> +			    struct msghdr *msg, size_t size, int flags)
> +{

[ ... ]

> +			if (ps && READ_ONCE(ps->ring_buf) &&
> +			    !sk->sk_err && !(sk->sk_shutdown & SEND_SHUTDOWN) &&
> +			    skb_queue_empty(&peer_sk->sk_receive_queue)) {

[Severity: High]
Does this check violate TCP stream ordering by ignoring in-flight TCP data?

To preserve stream ordering when falling back to the standard TCP stack, the
sender relies on skb_queue_empty(&peer_sk->sk_receive_queue) to determine if
it is safe to write to the ring buffer. However, this check ignores data that
is actively in-flight (queued in the sender's sk_write_queue, the local
qdisc, or softirq backlog). A subsequent sendmsg will see an empty
peer_sk->sk_receive_queue and splice new data directly into the ring buffer,
allowing the receiver to read it before the older in-flight TCP data arrives.

[ ... ]

> +__bpf_kfunc int bpf_sock_splice_pair(struct sock *peer,
> +				     struct bpf_sock_ops_kern *skops)
> +{

[ ... ]

> +	/* Each side keeps a psock ref on the other for the duration. */
> +	if (!sk_psock_get(sk)) {
> +		ret = -ENOENT;
> +		goto out_unlock;
> +	}
> +	if (!sk_psock_get(peer)) {

[Severity: High]
Could ignoring the return value of sk_psock_get() here cause a refcount leak
and use-after-free?

Similar to tcp_bpf_splice_unpair(), if a socket's psock is concurrently
replaced, sk_psock_get() returns and increments the refcount of the new psock.
Because this code continues using the original peer pointer to interact with
it, it creates an unbalanced increment on the new psock and an unbalanced
decrement on the old psock.

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260612011452.134466-1-xiyou.wangcong@gmail.com?part=1

next prev parent reply	other threads:[~2026-06-12  1:33 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-06-12  1:14 [RFC PATCH bpf-next 0/5] tcp: opportunistic loopback splice for BPF-paired sockets Cong Wang
2026-06-12  1:14 ` [RFC PATCH bpf-next 1/5] tcp_bpf: add bpf_sock_splice_pair kfunc for opportunistic loopback splice Cong Wang
2026-06-12  1:33   ` sashiko-bot [this message]
2026-06-12  2:10   ` bot+bpf-ci
2026-06-12  1:14 ` [RFC PATCH bpf-next 2/5] tcp_bpf: busy-poll the splice ring before parking the receiver Cong Wang
2026-06-12  3:29   ` sashiko-bot
2026-06-12  1:14 ` [RFC PATCH bpf-next 3/5] selftests/bpf: add tcp_splice basic round-trip test Cong Wang
2026-06-12  1:28   ` sashiko-bot
2026-06-12  1:14 ` [RFC PATCH bpf-next 4/5] bpf: allow SO_BUSY_POLL in bpf_setsockopt() Cong Wang
2026-06-12  1:14 ` [RFC PATCH bpf-next 5/5] selftests/bpf: set SO_BUSY_POLL from the tcp_splice sockops prog Cong Wang
2026-06-12  1:26   ` sashiko-bot
2026-06-12 16:01 ` [RFC PATCH bpf-next 0/5] tcp: opportunistic loopback splice for BPF-paired sockets Alexei Starovoitov
2026-06-12 18:12   ` Cong Wang
2026-06-12 18:34     ` Alexei Starovoitov
2026-06-12 20:17       ` Cong Wang
2026-06-13 17:57   ` Jakub Kicinski
2026-06-13 21:25     ` Cong Wang
2026-06-12 22:10 ` [syzbot ci] " syzbot ci

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260612013311.EFC6F1F000E9@smtp.kernel.org \
    --to=sashiko-bot@kernel.org \
    --cc=bpf@vger.kernel.org \
    --cc=sashiko-reviews@lists.linux.dev \
    --cc=xiyou.wangcong@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox