From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from out-177.mta0.migadu.com (out-177.mta0.migadu.com [91.218.175.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 3752B375F80 for ; Wed, 4 Mar 2026 06:39:17 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=91.218.175.177 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1772606358; cv=none; b=HnLO9x2kfVWxRyfaXUR3Ruue5PEJ4pDuZ0rnnp1sd/nXrvy0xtiSJ9A3JjuoqZN9rhNxNOA/QG0rg9Y5zWYrCVhpEmnN8+htWP0nq/DXdY8EZc9yuqv/gCdVLfVXa/m2FwFvYqLM8Ct5ZemozMmHUuwaUl2Jpou1hU5C7cU6Hic= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1772606358; c=relaxed/simple; bh=T3LZkyXNgXp7yXEoUgK0q7pKCngpRITB3P7KxJw8R3k=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=ns/QgTwfRdbY1udqXBH4UT9iO1kXJWd8zzCwhCFngHPkaL45sgPwScE0vXmuHx6cm7Pbti86A+WYl+jM1T+KPX0yTxeWbiS7F3LBfLuRUhhc9SSy6MLl7N0e8Pnon0jiqqO6I63azcFKMl2Ds/fEuSViFx6kAYDiKa8MXhwPidg= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev; spf=pass smtp.mailfrom=linux.dev; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b=XjbhhzTd; arc=none smtp.client-ip=91.218.175.177 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.dev Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b="XjbhhzTd" X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1772606355; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=d5duGPLmRq29qJW6Ubc2JTj9Y024B43GziisGHsbdVM=; b=XjbhhzTdcorKHPYWcfmiX2oGlCoSL1wUr+CUtd8WlT2R3/xdlrugtM+MNIP1lrNiaqM+m5 jovxTnqsXHiZG+vYbETJvCkxEJ/3hEVv9aiNy4MsJ/xkQiw13drwVxcf3yTduSIFpQUpkw y74jmFQHAPmGee+k8P2fG3WOjoVJC5A= From: Jiayuan Chen To: bpf@vger.kernel.org, john.fastabend@gmail.com, jakub@cloudflare.com Cc: Jiayuan Chen , "David S. Miller" , Eric Dumazet , Jakub Kicinski , Paolo Abeni , Simon Horman , Kuniyuki Iwashima , Willem de Bruijn , David Ahern , Neal Cardwell , Alexei Starovoitov , Daniel Borkmann , Andrii Nakryiko , Martin KaFai Lau , Eduard Zingerman , Song Liu , Yonghong Song , KP Singh , Stanislav Fomichev , Hao Luo , Jiri Olsa , Shuah Khan , Jiapeng Chong , Ihor Solodrai , Michal Luczaj , netdev@vger.kernel.org, linux-kernel@vger.kernel.org, linux-kselftest@vger.kernel.org Subject: [PATCH bpf-next v1 5/7] tcp_bpf: optimize splice_read with zero-copy for non-slab pages Date: Wed, 4 Mar 2026 14:33:56 +0800 Message-ID: <20260304063643.14581-6-jiayuan.chen@linux.dev> In-Reply-To: <20260304063643.14581-1-jiayuan.chen@linux.dev> References: <20260304063643.14581-1-jiayuan.chen@linux.dev> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Migadu-Flow: FLOW_OUT The previous splice_read implementation copies all data through intermediate pages (alloc_page + memcpy). This is wasteful for skb fragment pages which are allocated from the page allocator and can be safely referenced via get_page(). Optimize by checking PageSlab() to distinguish between linear skb data (slab-backed) and fragment pages (page allocator-backed): - For slab pages (skb linear data): copy to a page fragment via sk_page_frag, matching what linear_to_page() does in the standard TCP splice path (skb_splice_bits). get_page() is invalid on slab pages so a copy is unavoidable here. - For non-slab pages (skb frags): use get_page() directly for true zero-copy, same as skb_splice_bits does for fragments. Both paths use nosteal_pipe_buf_ops. The sk_page_frag approach is more memory-efficient than alloc_page for small linear copies, as multiple copies can share a single page fragment. Benchmark results with rx-verdict-ingress mode (loopback, 8 CPUs): splice(2) + always-copy: ~2770 MB/s (before this patch) splice(2) + zero-copy: ~4270 MB/s (after this patch, +54%) read(2): ~4292 MB/s (baseline for reference) Signed-off-by: Jiayuan Chen --- net/ipv4/tcp_bpf.c | 41 +++++++++++++++++++++++++++++++---------- 1 file changed, 31 insertions(+), 10 deletions(-) diff --git a/net/ipv4/tcp_bpf.c b/net/ipv4/tcp_bpf.c index e85a27e32ea7..13506ba7672f 100644 --- a/net/ipv4/tcp_bpf.c +++ b/net/ipv4/tcp_bpf.c @@ -447,6 +447,7 @@ static int tcp_bpf_recvmsg(struct sock *sk, struct msghdr *msg, size_t len, struct tcp_bpf_splice_ctx { struct pipe_inode_info *pipe; + struct sock *sk; }; static int sk_msg_splice_actor(void *arg, struct page *page, @@ -458,13 +459,33 @@ static int sk_msg_splice_actor(void *arg, struct page *page, }; ssize_t ret; - buf.page = alloc_page(GFP_KERNEL); - if (!buf.page) - return 0; + if (PageSlab(page)) { + /* + * skb linear data is backed by slab memory where + * get_page() is invalid. Copy to a page fragment from + * the socket's page allocator, matching what + * linear_to_page() does in the standard TCP splice + * path (skb_splice_bits). + */ + struct page_frag *pfrag = sk_page_frag(ctx->sk); + + if (!sk_page_frag_refill(ctx->sk, pfrag)) + return 0; - memcpy(page_address(buf.page), page_address(page) + offset, len); - buf.offset = 0; - buf.len = len; + len = min_t(size_t, len, pfrag->size - pfrag->offset); + memcpy(page_address(pfrag->page) + pfrag->offset, + page_address(page) + offset, len); + buf.page = pfrag->page; + buf.offset = pfrag->offset; + buf.len = len; + pfrag->offset += len; + } else { + buf.page = page; + buf.offset = offset; + buf.len = len; + } + + get_page(buf.page); /* * add_to_pipe() calls pipe_buf_release() on failure, which @@ -481,9 +502,9 @@ static ssize_t tcp_bpf_splice_read(struct socket *sock, loff_t *ppos, struct pipe_inode_info *pipe, size_t len, unsigned int flags) { - struct tcp_bpf_splice_ctx ctx = { .pipe = pipe }; - int bpf_flags = flags & SPLICE_F_NONBLOCK ? MSG_DONTWAIT : 0; struct sock *sk = sock->sk; + struct tcp_bpf_splice_ctx ctx = { .pipe = pipe, .sk = sk }; + int bpf_flags = flags & SPLICE_F_NONBLOCK ? MSG_DONTWAIT : 0; struct sk_psock *psock; int ret; @@ -508,9 +529,9 @@ static ssize_t tcp_bpf_splice_read_parser(struct socket *sock, loff_t *ppos, struct pipe_inode_info *pipe, size_t len, unsigned int flags) { - struct tcp_bpf_splice_ctx ctx = { .pipe = pipe }; - int bpf_flags = flags & SPLICE_F_NONBLOCK ? MSG_DONTWAIT : 0; struct sock *sk = sock->sk; + struct tcp_bpf_splice_ctx ctx = { .pipe = pipe, .sk = sk }; + int bpf_flags = flags & SPLICE_F_NONBLOCK ? MSG_DONTWAIT : 0; struct sk_psock *psock; int ret; -- 2.43.0