From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from out-180.mta0.migadu.com (out-180.mta0.migadu.com [91.218.175.180]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 3F52E3F825C for ; Thu, 11 Jun 2026 12:38:34 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=91.218.175.180 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1781181516; cv=none; b=TU7IhAUCT9xInHVmX3GHApwZCU2J7tQ71MLQRtCbDy+97xTnUW8Lj1KPxIUxnXDuuygkPS8/cIhhLyZgKE3tVdwnXU34jTzfNAUDLexasOuQmGQ8slz8lwFJtA1q0zu/REr/7isFFZcX7BDqTIn7eWq3EZOWOtY9caTB+PSae48= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1781181516; c=relaxed/simple; bh=haEccmhqC7VmWW+0DSMapViaZBrWcsJHhtn5EvyTuag=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=azPkKNkMgxzLE6VHJCmsDaPyJZum3D23gRdyXitwnCpCsH7+whj43ZP90WxFRIbu0fYbvy56WFgG1tGZ6fFIs3GDqB94pHTfOze1qIPcziJlTsyTLZHcZqrCFmpeRP/XrzuO5epfz6jiT0IeyzmxNPJT7/uHEigyMBlV8+D54K0= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev; spf=pass smtp.mailfrom=linux.dev; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b=AKCI8xX6; arc=none smtp.client-ip=91.218.175.180 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.dev Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b="AKCI8xX6" X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1781181511; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=z3RQOZCpkPzTNjmfTMEyMYiXFULFskQUrtntUJzV76c=; b=AKCI8xX6vjsjwdWFni0qBmSGXe6SMtpwVhdeEol05oi5RYHPS4DFVD3lxpEms9kT+Y9+LJ gqV5gkY2OmgH8+fcorngwF80CmQk6NE4TI5Ut+tNXl9fbbtCIB8MPIzyyU8Y4YabnTloLf K5ErMtKqZbyfhrpKPycSz9hlSRRSmBI= From: Jiayuan Chen To: bpf@vger.kernel.org Cc: Zhang Cen , stable@vger.kernel.org, Han Guidong <2045gemini@gmail.com>, Jiayuan Chen , John Fastabend , Daniel Borkmann , Stanislav Fomichev , Martin KaFai Lau , Alexei Starovoitov , Andrii Nakryiko , Eduard Zingerman , Kumar Kartikeya Dwivedi , Song Liu , Yonghong Song , Jiri Olsa , Emil Tsalapatis , "David S. Miller" , Eric Dumazet , Jakub Kicinski , Paolo Abeni , Simon Horman , Jakub Sitnicki , Shuah Khan , Jesper Dangaard Brouer , Sechang Lim , Ihor Solodrai , Cong Wang , linux-kernel@vger.kernel.org, netdev@vger.kernel.org, linux-kselftest@vger.kernel.org Subject: [PATCH bpf v2 4/7] bpf, sockmap: keep sk_msg copy state in sync Date: Thu, 11 Jun 2026 20:34:12 +0800 Message-ID: <20260611123538.156005-5-jiayuan.chen@linux.dev> In-Reply-To: <20260611123538.156005-1-jiayuan.chen@linux.dev> References: <20260611123538.156005-1-jiayuan.chen@linux.dev> Precedence: bulk X-Mailing-List: netdev@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Migadu-Flow: FLOW_OUT From: Zhang Cen SK_MSG uses msg->sg.copy as per-scatterlist-entry provenance. Entries with this bit set are copied before data/data_end are exposed to SK_MSG BPF programs for direct packet access. bpf_msg_pull_data(), bpf_msg_push_data(), and bpf_msg_pop_data() rewrite the sk_msg scatterlist ring by collapsing, splitting, and shifting entries. These operations move msg->sg.data[] entries, but the parallel copy bitmap can be left behind on the old slot. A copied entry can then return to msg->sg.start with its copy bit clear and be exposed as directly writable packet data. This corruption path requires an attached SK_MSG BPF program that calls the mutating helpers; ordinary sockmap/TLS traffic that never runs push/pop/pull helper sequences is not affected. Keep msg->sg.copy synchronized with scatterlist entry moves, preserve the copy bit when an entry is split, clear it when a helper replaces an entry with a private page, and clear slots vacated by pull-data compaction. Fixes: 015632bb30da ("bpf: sk_msg program helper bpf_sk_msg_pull_data") Fixes: 6fff607e2f14 ("bpf: sk_msg program helper bpf_msg_push_data") Fixes: 7246d8ed4dcc ("bpf: helper to pop data from messages") Cc: stable@vger.kernel.org Co-developed-by: Han Guidong <2045gemini@gmail.com> Cc: Jiayuan Chen Reviewed-by: John Fastabend Signed-off-by: Han Guidong <2045gemini@gmail.com> Signed-off-by: Zhang Cen --- net/core/filter.c | 88 ++++++++++++++++++++++++++++++++++++++++++++--- 1 file changed, 83 insertions(+), 5 deletions(-) diff --git a/net/core/filter.c b/net/core/filter.c index 6e345ca65ca14..e35e681a15dca 100644 --- a/net/core/filter.c +++ b/net/core/filter.c @@ -2654,6 +2654,38 @@ static void sk_msg_reset_curr(struct sk_msg *msg) } } +static bool sk_msg_elem_is_copy(const struct sk_msg *msg, u32 i) +{ + return test_bit(i, msg->sg.copy); +} + +static void sk_msg_clear_elem_copy(struct sk_msg *msg, u32 i) +{ + __clear_bit(i, msg->sg.copy); +} + +static void sk_msg_set_elem_copy(struct sk_msg *msg, u32 i) +{ + __set_bit(i, msg->sg.copy); +} + +static void sk_msg_clear_copy_range(struct sk_msg *msg, u32 start, u32 end) +{ + while (start != end) { + sk_msg_clear_elem_copy(msg, start); + sk_msg_iter_var_next(start); + } +} + +static void sk_msg_sg_move(struct sk_msg *msg, u32 dst, u32 src) +{ + msg->sg.data[dst] = msg->sg.data[src]; + if (sk_msg_elem_is_copy(msg, src)) + sk_msg_set_elem_copy(msg, dst); + else + sk_msg_clear_elem_copy(msg, dst); +} + static const struct bpf_func_proto bpf_msg_cork_bytes_proto = { .func = bpf_msg_cork_bytes, .gpl_only = false, @@ -2692,7 +2724,7 @@ BPF_CALL_4(bpf_msg_pull_data, struct sk_msg *, msg, u32, start, * account for the headroom. */ bytes_sg_total = start - offset + bytes; - if (!test_bit(i, msg->sg.copy) && bytes_sg_total <= len) + if (!sk_msg_elem_is_copy(msg, i) && bytes_sg_total <= len) goto out; /* At this point we need to linearize multiple scatterlist @@ -2738,6 +2770,7 @@ BPF_CALL_4(bpf_msg_pull_data, struct sk_msg *, msg, u32, start, } while (i != last_sge); sg_set_page(&msg->sg.data[first_sge], page, copy, 0); + sk_msg_clear_elem_copy(msg, first_sge); /* To repair sg ring we need to shift entries. If we only * had a single entry though we can just replace it and @@ -2747,8 +2780,14 @@ BPF_CALL_4(bpf_msg_pull_data, struct sk_msg *, msg, u32, start, shift = last_sge > first_sge ? last_sge - first_sge - 1 : NR_MSG_FRAG_IDS - first_sge + last_sge - 1; - if (!shift) + if (!shift) { + sk_msg_clear_elem_copy(msg, msg->sg.end); goto out; + } + + i = first_sge; + sk_msg_iter_var_next(i); + sk_msg_clear_copy_range(msg, i, last_sge); i = first_sge; sk_msg_iter_var_next(i); @@ -2762,16 +2801,18 @@ BPF_CALL_4(bpf_msg_pull_data, struct sk_msg *, msg, u32, start, if (move_from == msg->sg.end) break; - msg->sg.data[i] = msg->sg.data[move_from]; + sk_msg_sg_move(msg, i, move_from); msg->sg.data[move_from].length = 0; msg->sg.data[move_from].page_link = 0; msg->sg.data[move_from].offset = 0; + sk_msg_clear_elem_copy(msg, move_from); sk_msg_iter_var_next(i); } while (1); msg->sg.end = msg->sg.end - shift > msg->sg.end ? msg->sg.end - shift + NR_MSG_FRAG_IDS : msg->sg.end - shift; + sk_msg_clear_elem_copy(msg, msg->sg.end); out: sk_msg_reset_curr(msg); msg->data = sg_virt(&msg->sg.data[first_sge]) + start - offset; @@ -2794,6 +2835,8 @@ BPF_CALL_4(bpf_msg_push_data, struct sk_msg *, msg, u32, start, { struct scatterlist sge, nsge, nnsge, rsge = {0}, *psge; u32 new, i = 0, l = 0, space, copy = 0, offset = 0; + bool sge_copy = false, nsge_copy = false, nnsge_copy = false; + bool rsge_copy = false; u8 *raw, *to, *from; struct page *page; @@ -2869,6 +2912,7 @@ BPF_CALL_4(bpf_msg_push_data, struct sk_msg *, msg, u32, start, sk_msg_iter_var_prev(i); psge = sk_msg_elem(msg, i); rsge = sk_msg_elem_cpy(msg, i); + rsge_copy = sk_msg_elem_is_copy(msg, i); psge->length = start - offset; rsge.length -= psge->length; @@ -2894,23 +2938,34 @@ BPF_CALL_4(bpf_msg_push_data, struct sk_msg *, msg, u32, start, /* Shift one or two slots as needed */ sge = sk_msg_elem_cpy(msg, new); sg_unmark_end(&sge); + sge_copy = sk_msg_elem_is_copy(msg, new); nsge = sk_msg_elem_cpy(msg, i); + nsge_copy = sk_msg_elem_is_copy(msg, i); if (rsge.length) { sk_msg_iter_var_next(i); nnsge = sk_msg_elem_cpy(msg, i); + nnsge_copy = sk_msg_elem_is_copy(msg, i); sk_msg_iter_next(msg, end); } while (i != msg->sg.end) { msg->sg.data[i] = sge; + if (sge_copy) + sk_msg_set_elem_copy(msg, i); + else + sk_msg_clear_elem_copy(msg, i); sge = nsge; + sge_copy = nsge_copy; sk_msg_iter_var_next(i); if (rsge.length) { nsge = nnsge; + nsge_copy = nnsge_copy; nnsge = sk_msg_elem_cpy(msg, i); + nnsge_copy = sk_msg_elem_is_copy(msg, i); } else { nsge = sk_msg_elem_cpy(msg, i); + nsge_copy = sk_msg_elem_is_copy(msg, i); } } @@ -2918,13 +2973,18 @@ BPF_CALL_4(bpf_msg_push_data, struct sk_msg *, msg, u32, start, /* Place newly allocated data buffer */ sk_mem_charge(msg->sk, len); msg->sg.size += len; - __clear_bit(new, msg->sg.copy); + sk_msg_clear_elem_copy(msg, new); sg_set_page(&msg->sg.data[new], page, len + copy, 0); if (rsge.length) { get_page(sg_page(&rsge)); sk_msg_iter_var_next(new); msg->sg.data[new] = rsge; + if (rsge_copy) + sk_msg_set_elem_copy(msg, new); + else + sk_msg_clear_elem_copy(msg, new); } + sk_msg_clear_elem_copy(msg, msg->sg.end); sk_msg_reset_curr(msg); sk_msg_compute_data_pointers(msg); @@ -2950,27 +3010,38 @@ static void sk_msg_shift_left(struct sk_msg *msg, int i) do { prev = i; sk_msg_iter_var_next(i); - msg->sg.data[prev] = msg->sg.data[i]; + sk_msg_sg_move(msg, prev, i); } while (i != msg->sg.end); sk_msg_iter_prev(msg, end); + sk_msg_clear_elem_copy(msg, msg->sg.end); } static void sk_msg_shift_right(struct sk_msg *msg, int i) { struct scatterlist tmp, sge; + bool tmp_copy, sge_copy; sk_msg_iter_next(msg, end); sge = sk_msg_elem_cpy(msg, i); + sge_copy = sk_msg_elem_is_copy(msg, i); sk_msg_iter_var_next(i); tmp = sk_msg_elem_cpy(msg, i); + tmp_copy = sk_msg_elem_is_copy(msg, i); while (i != msg->sg.end) { msg->sg.data[i] = sge; + if (sge_copy) + sk_msg_set_elem_copy(msg, i); + else + sk_msg_clear_elem_copy(msg, i); sk_msg_iter_var_next(i); sge = tmp; + sge_copy = tmp_copy; tmp = sk_msg_elem_cpy(msg, i); + tmp_copy = sk_msg_elem_is_copy(msg, i); } + sk_msg_clear_elem_copy(msg, msg->sg.end); } BPF_CALL_4(bpf_msg_pop_data, struct sk_msg *, msg, u32, start, @@ -3027,8 +3098,10 @@ BPF_CALL_4(bpf_msg_pop_data, struct sk_msg *, msg, u32, start, */ if (start != offset) { struct scatterlist *nsge, *sge = sk_msg_elem(msg, i); + u32 sge_idx = i; int a = start - offset; int b = sge->length - pop - a; + bool sge_copy = sk_msg_elem_is_copy(msg, sge_idx); sk_msg_iter_var_next(i); @@ -3041,6 +3114,10 @@ BPF_CALL_4(bpf_msg_pop_data, struct sk_msg *, msg, u32, start, sg_set_page(nsge, sg_page(sge), b, sge->offset + pop + a); + if (sge_copy) + sk_msg_set_elem_copy(msg, i); + else + sk_msg_clear_elem_copy(msg, i); } else { struct page *page, *orig; u8 *to, *from; @@ -3057,6 +3134,7 @@ BPF_CALL_4(bpf_msg_pop_data, struct sk_msg *, msg, u32, start, memcpy(to, from, a); memcpy(to + a, from + a + pop, b); sg_set_page(sge, page, a + b, 0); + sk_msg_clear_elem_copy(msg, sge_idx); put_page(orig); } pop = 0; -- 2.43.0