[PATCH bpf] bpf, sockmap: zero-initialize pages allocated in bpf_msg_push

public inbox for bpf@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH bpf] bpf, sockmap: zero-initialize pages allocated in bpf_msg_push_data
@ 2026-04-24 19:03 Weiming Shi
  2026-04-25  3:17 ` Jiayuan Chen
  2026-04-25 19:04 ` sashiko-bot
  0 siblings, 2 replies; 5+ messages in thread
From: Weiming Shi @ 2026-04-24 19:03 UTC (permalink / raw)
  To: Martin KaFai Lau, Daniel Borkmann, Alexei Starovoitov,
	Andrii Nakryiko, Eduard Zingerman, Kumar Kartikeya Dwivedi,
	David S . Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni
  Cc: John Fastabend, Stanislav Fomichev, Song Liu, Yonghong Song,
	Jiri Olsa, Simon Horman, bpf, netdev, Xiang Mei, Weiming Shi,
	Xinyu Ma

bpf_msg_push_data() allocates pages via alloc_pages() without
__GFP_ZERO. In the non-copy path, the entire page of uninitialized
heap content is added directly to the sk_msg scatterlist, which is
then transmitted over TCP to userspace via tcp_bpf_push(). In the
copy path, a gap of len bytes between the front and back memcpy
regions is similarly left uninitialized.

This leads to a kernel heap information leak: stale page content
including kernel pointers from the direct-map and vmemmap regions
is transmitted to userspace, which can be used to defeat KASLR.

Add __GFP_ZERO to the alloc_pages() call to ensure the allocated
page is always zeroed before it enters the scatterlist.

Link: https://lore.kernel.org/all/20260424155913.A19FDC19425@smtp.kernel.org
Fixes: 6fff607e2f14 ("bpf: sk_msg program helper bpf_msg_push_data")
Tested-by: Xiang Mei <xmei5@asu.edu>
Tested-by: Xinyu Ma <mmmxny@gmail.com>
Signed-off-by: Weiming Shi <bestswngs@gmail.com>
---
 net/core/filter.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/core/filter.c b/net/core/filter.c
index bc96c18df4e0..ea02239892fd 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -2820,7 +2820,7 @@ BPF_CALL_4(bpf_msg_push_data, struct sk_msg *, msg, u32, start,
 	if (!space || (space == 1 && start != offset))
 		copy = msg->sg.data[i].length;

-	page = alloc_pages(__GFP_NOWARN | GFP_ATOMIC | __GFP_COMP,
+	page = alloc_pages(__GFP_NOWARN | GFP_ATOMIC | __GFP_COMP | __GFP_ZERO,
 			   get_order(copy + len));
 	if (unlikely(!page))
 		return -ENOMEM;
-- 
2.43.0

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH bpf] bpf, sockmap: zero-initialize pages allocated in bpf_msg_push_data
  2026-04-24 19:03 [PATCH bpf] bpf, sockmap: zero-initialize pages allocated in bpf_msg_push_data Weiming Shi
@ 2026-04-25  3:17 ` Jiayuan Chen
  2026-04-25 17:59   ` Weiming Shi
  2026-04-25 19:04 ` sashiko-bot
  1 sibling, 1 reply; 5+ messages in thread
From: Jiayuan Chen @ 2026-04-25  3:17 UTC (permalink / raw)
  To: Weiming Shi, Martin KaFai Lau, Daniel Borkmann,
	Alexei Starovoitov, Andrii Nakryiko, Eduard Zingerman,
	Kumar Kartikeya Dwivedi, David S . Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni
  Cc: John Fastabend, Stanislav Fomichev, Song Liu, Yonghong Song,
	Jiri Olsa, Simon Horman, bpf, netdev, Xiang Mei, Xinyu Ma


On 4/25/26 3:03 AM, Weiming Shi wrote:
> bpf_msg_push_data() allocates pages via alloc_pages() without
> __GFP_ZERO. In the non-copy path, the entire page of uninitialized
> heap content is added directly to the sk_msg scatterlist, which is
> then transmitted over TCP to userspace via tcp_bpf_push(). In the
> copy path, a gap of len bytes between the front and back memcpy
> regions is similarly left uninitialized.
>
> This leads to a kernel heap information leak: stale page content
> including kernel pointers from the direct-map and vmemmap regions
> is transmitted to userspace, which can be used to defeat KASLR.
>
> Add __GFP_ZERO to the alloc_pages() call to ensure the allocated
> page is always zeroed before it enters the scatterlist.



As the helper's own documentation says:

     If a program of type BPF_PROG_TYPE_SK_MSG is run on a msg it may
     want to insert metadata or options into the msg. This can later be
     read and used by any of the lower layer BPF hooks.

The inserted region is meant to be written by the BPF program — that's 
the entire point of calling push.

If the program doesn't fill it,  the push has no purpose to begin with.


Isn't the uninitialized content a bug in the BPF program rather than 
something the kernel helper should paper over?


> Link: https://lore.kernel.org/all/20260424155913.A19FDC19425@smtp.kernel.org
> Fixes: 6fff607e2f14 ("bpf: sk_msg program helper bpf_msg_push_data")
> Tested-by: Xiang Mei <xmei5@asu.edu>
> Tested-by: Xinyu Ma <mmmxny@gmail.com>
> Signed-off-by: Weiming Shi <bestswngs@gmail.com>
> ---
>   net/core/filter.c | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/net/core/filter.c b/net/core/filter.c
> index bc96c18df4e0..ea02239892fd 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -2820,7 +2820,7 @@ BPF_CALL_4(bpf_msg_push_data, struct sk_msg *, msg, u32, start,
>   	if (!space || (space == 1 && start != offset))
>   		copy = msg->sg.data[i].length;
>   
> -	page = alloc_pages(__GFP_NOWARN | GFP_ATOMIC | __GFP_COMP,
> +	page = alloc_pages(__GFP_NOWARN | GFP_ATOMIC | __GFP_COMP | __GFP_ZERO,
>   			   get_order(copy + len));
>   	if (unlikely(!page))
>   		return -ENOMEM;

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH bpf] bpf, sockmap: zero-initialize pages allocated in bpf_msg_push_data
  2026-04-25  3:17 ` Jiayuan Chen
@ 2026-04-25 17:59   ` Weiming Shi
  2026-04-26  6:31     ` Jiayuan Chen
  0 siblings, 1 reply; 5+ messages in thread
From: Weiming Shi @ 2026-04-25 17:59 UTC (permalink / raw)
  To: Jiayuan Chen
  Cc: Martin KaFai Lau, Daniel Borkmann, Alexei Starovoitov,
	Andrii Nakryiko, Eduard Zingerman, Kumar Kartikeya Dwivedi,
	David S . Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	John Fastabend, Stanislav Fomichev, Song Liu, Yonghong Song,
	Jiri Olsa, Simon Horman, bpf, netdev, Xiang Mei, Xinyu Ma

On 26-04-25 11:17, Jiayuan Chen wrote:
> 
> On 4/25/26 3:03 AM, Weiming Shi wrote:
> > bpf_msg_push_data() allocates pages via alloc_pages() without
> > __GFP_ZERO. In the non-copy path, the entire page of uninitialized
> > heap content is added directly to the sk_msg scatterlist, which is
> > then transmitted over TCP to userspace via tcp_bpf_push(). In the
> > copy path, a gap of len bytes between the front and back memcpy
> > regions is similarly left uninitialized.
> > 
> > This leads to a kernel heap information leak: stale page content
> > including kernel pointers from the direct-map and vmemmap regions
> > is transmitted to userspace, which can be used to defeat KASLR.
> > 
> > Add __GFP_ZERO to the alloc_pages() call to ensure the allocated
> > page is always zeroed before it enters the scatterlist.
> 
> 
> 
> As the helper's own documentation says:
> 
>     If a program of type BPF_PROG_TYPE_SK_MSG is run on a msg it may
>     want to insert metadata or options into the msg. This can later be
>     read and used by any of the lower layer BPF hooks.
> 
> The inserted region is meant to be written by the BPF program — that's the
> entire point of calling push.
> 
> If the program doesn't fill it,  the push has no purpose to begin with.
> 
> 
> Isn't the uninitialized content a bug in the BPF program rather than
> something the kernel helper should paper over?
> 

Hi, Thanks for the review.

In my testing a process with only CAP_BPF + CAP_NET_ADMIN can receive
kernel heap and vmalloc pointers through recv() from the uninitialized
pushed region. The uninitialized memory contains critical kernel metadata
such as direct-map and vmalloc pointers, which breaks KASLR.

Kernels without CONFIG_INIT_ON_ALLOC_DEFAULT_ON (e.g. RHEL) are
directly affected the leak is not masked by any mitigation.

Thanks,
Weiming Shi

> 
> > Link: https://lore.kernel.org/all/20260424155913.A19FDC19425@smtp.kernel.org
> > Fixes: 6fff607e2f14 ("bpf: sk_msg program helper bpf_msg_push_data")
> > Tested-by: Xiang Mei <xmei5@asu.edu>
> > Tested-by: Xinyu Ma <mmmxny@gmail.com>
> > Signed-off-by: Weiming Shi <bestswngs@gmail.com>
> > ---
> >   net/core/filter.c | 2 +-
> >   1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/net/core/filter.c b/net/core/filter.c
> > index bc96c18df4e0..ea02239892fd 100644
> > --- a/net/core/filter.c
> > +++ b/net/core/filter.c
> > @@ -2820,7 +2820,7 @@ BPF_CALL_4(bpf_msg_push_data, struct sk_msg *, msg, u32, start,
> >   	if (!space || (space == 1 && start != offset))
> >   		copy = msg->sg.data[i].length;
> > -	page = alloc_pages(__GFP_NOWARN | GFP_ATOMIC | __GFP_COMP,
> > +	page = alloc_pages(__GFP_NOWARN | GFP_ATOMIC | __GFP_COMP | __GFP_ZERO,
> >   			   get_order(copy + len));
> >   	if (unlikely(!page))
> >   		return -ENOMEM;

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH bpf] bpf, sockmap: zero-initialize pages allocated in bpf_msg_push_data
  2026-04-24 19:03 [PATCH bpf] bpf, sockmap: zero-initialize pages allocated in bpf_msg_push_data Weiming Shi
  2026-04-25  3:17 ` Jiayuan Chen
@ 2026-04-25 19:04 ` sashiko-bot
  1 sibling, 0 replies; 5+ messages in thread
From: sashiko-bot @ 2026-04-25 19:04 UTC (permalink / raw)
  To: Weiming Shi; +Cc: bpf

Sashiko AI review found 6 potential issue(s):
- [Critical] Unbounded `len` causes `copy + len` integer overflow, leading to an exploitable out-of-bounds kernel write.
- [Critical] Appending to a full `sk_msg` wraps the ring buffer index, corrupting the last element and causing an out-of-bounds read/write.
- [Critical] Scatterlist element shifting desynchronizes `msg->sg.copy` bitmap, allowing unauthorized direct writes to shared pages.
- [High] Incorrect scatterlist offset calculation when splitting elements causes out-of-bounds memory read.
- [High] Appending to an `sk_msg` with 1 free slot reads garbage length, inflating payload size and leaking uninitialized memory.
- [Medium] Unnecessary zeroing of compound pages in atomic context causes latency spikes.
--

commit b1bf1a9196d24b0457d32d17aa1980fac2aeeebd
Author: Weiming Shi <bestswngs@gmail.com>

bpf, sockmap: zero-initialize pages allocated in bpf_msg_push_data

The commit attempts to fix a kernel heap information leak in
bpf_msg_push_data() by adding __GFP_ZERO to the alloc_pages() call. This
zeroes out the allocated pages before they are added to the scatterlist,
preventing uninitialized memory from being sent to userspace.

Link: https://lore.kernel.org/all/20260424155913.A19FDC19425@smtp.kernel.org

> diff --git a/net/core/filter.c b/net/core/filter.c
> index 5fa9189eb772b2..3422cbf09c7da2 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -2820,7 +2820,7 @@ BPF_CALL_4(bpf_msg_push_data, struct sk_msg *, msg, u32, start,
>  	if (!space || (space == 1 && start != offset))
>  		copy = msg->sg.data[i].length;

This isn't a bug introduced by this patch, but when appending data (start ==
msg->sg.size) and exactly 1 free slot remains (space == 1), the search loop
exits with i == msg->sg.end. Does this fallback path read from an unused ring
buffer slot containing uninitialized garbage, since
msg->sg.data[msg->sg.end].length is used here?

Later, this garbage value is added to len to set the new element's length,
inflating it far beyond the copied data.

Also, when appending data to the exact end of a message with no free slots
(space == 0), the search loop terminates with i == msg->sg.end and
msg->sg.end wraps to msg->sg.start. Does this code mistakenly copy the
length of the first element in the ring?

Then when calculating back:

	if (i == msg->sg.end)
		sk_msg_iter_var_prev(i);
	psge = sk_msg_elem(msg, i);
	front = start - offset;
	back = psge->length - front;

If the first element is shorter than the last, couldn't back become negative
and pass a large unsigned size to memcpy(), leading to an out-of-bounds
write?

> -	page = alloc_pages(__GFP_NOWARN | GFP_ATOMIC | __GFP_COMP,
> +	page = alloc_pages(__GFP_NOWARN | GFP_ATOMIC | __GFP_COMP | __GFP_ZERO,
>  			   get_order(copy + len));

Does copy + len overflow here if a BPF program passes a very large len
(e.g., 0xFFFFF000)?

Since both are 32-bit integers, couldn't this wrap around to a small positive
value, causing get_order() to allocate a much smaller buffer than required,
and later lead to an out-of-bounds memcpy() when to = raw + front + len?

Additionally, if a BPF program passes a valid len that requires a high-order
compound page (e.g., 32KB or 64KB), will synchronously zeroing the entire
memory block under GFP_ATOMIC with preemption disabled introduce latency spikes
into the network fast-path?

Would it be better to allocate normally and use memset() to zero only the
exact gap of uninitialized len bytes?

>  	if (unlikely(!page))
>  		return -ENOMEM;

This isn't a bug introduced by this patch, but later in the same function when
inserting data into the middle of a scatterlist element:

	if (start - offset) {
		if (i == msg->sg.end)
			sk_msg_iter_var_prev(i);
		psge = sk_msg_elem(msg, i);
		rsge = sk_msg_elem_cpy(msg, i);

		psge->length = start - offset;
		rsge.length -= psge->length;
		rsge.offset += start;

Is rsge.offset incorrectly advanced by the absolute logical byte offset
start?

Shouldn't rsge.offset instead be advanced by the relative number of bytes
consumed by the first half (which is start - offset)? Does this absolute
offset addition cause an out-of-bounds memory read when transmitting the
scatterlist element?

Lastly, further down in the function when shifting entries:

	while (i != msg->sg.end) {
		msg->sg.data[i] = sge;
		sge = nsge;
		sk_msg_iter_var_next(i);
		if (rsge.length) {
			nsge = nnsge;
			nnsge = sk_msg_elem_cpy(msg, i);
		} else {

When bpf_msg_push_data() shifts existing scatterlist entries to the right to
make room for new elements, it appears to miss shifting the corresponding bits
in the msg->sg.copy bitmap.

Does this desynchronize the copy-on-write tracking?

If a shared page is shifted to an index with a cleared bitmap bit, could a
subsequent bpf_msg_pull_data() directly overwrite memory belonging to other
sockets or the page cache?

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260424190310.1520555-2-bestswngs@gmail.com?part=1

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH bpf] bpf, sockmap: zero-initialize pages allocated in bpf_msg_push_data
  2026-04-25 17:59   ` Weiming Shi
@ 2026-04-26  6:31     ` Jiayuan Chen
  0 siblings, 0 replies; 5+ messages in thread
From: Jiayuan Chen @ 2026-04-26  6:31 UTC (permalink / raw)
  To: Weiming Shi, Jiayuan Chen
  Cc: Martin KaFai Lau, Daniel Borkmann, Alexei Starovoitov,
	Andrii Nakryiko, Eduard Zingerman, Kumar Kartikeya Dwivedi,
	David S . Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	John Fastabend, Stanislav Fomichev, Song Liu, Yonghong Song,
	Jiri Olsa, Simon Horman, bpf, netdev, Xiang Mei, Xinyu Ma


On 4/26/26 1:59 AM, Weiming Shi wrote:
> On 26-04-25 11:17, Jiayuan Chen wrote:
>> On 4/25/26 3:03 AM, Weiming Shi wrote:
>>> bpf_msg_push_data() allocates pages via alloc_pages() without
>>> __GFP_ZERO. In the non-copy path, the entire page of uninitialized
>>> heap content is added directly to the sk_msg scatterlist, which is
>>> then transmitted over TCP to userspace via tcp_bpf_push(). In the
>>> copy path, a gap of len bytes between the front and back memcpy
>>> regions is similarly left uninitialized.
>>>
>>> This leads to a kernel heap information leak: stale page content
>>> including kernel pointers from the direct-map and vmemmap regions
>>> is transmitted to userspace, which can be used to defeat KASLR.
>>>
>>> Add __GFP_ZERO to the alloc_pages() call to ensure the allocated
>>> page is always zeroed before it enters the scatterlist.
>>
>>
>> As the helper's own documentation says:
>>
>>      If a program of type BPF_PROG_TYPE_SK_MSG is run on a msg it may
>>      want to insert metadata or options into the msg. This can later be
>>      read and used by any of the lower layer BPF hooks.
>>
>> The inserted region is meant to be written by the BPF program — that's the
>> entire point of calling push.
>>
>> If the program doesn't fill it,  the push has no purpose to begin with.
>>
>>
>> Isn't the uninitialized content a bug in the BPF program rather than
>> something the kernel helper should paper over?
>>
> Hi, Thanks for the review.
>
> In my testing a process with only CAP_BPF + CAP_NET_ADMIN can receive
> kernel heap and vmalloc pointers through recv() from the uninitialized
> pushed region. The uninitialized memory contains critical kernel metadata
> such as direct-map and vmalloc pointers, which breaks KASLR.
>
> Kernels without CONFIG_INIT_ON_ALLOC_DEFAULT_ON (e.g. RHEL) are
> directly affected the leak is not masked by any mitigation.
>
> Thanks,
> Weiming Shi
>

Reviewed-by: Jiayuan Chen <jiayuan.chen@linux.dev>


Previously I thought this was as same as bpf_xdp_adjust_head / 
bpf_xdp_adjust_meta,
but the function itself allocates a page, I believed the cost of 
GFP_ZERO flag was irrelevant.

Add one more thing: in the future, more and more AI systems will 
complain about
this kind of problem. I believe it is worth it.



^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2026-04-26  6:31 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-24 19:03 [PATCH bpf] bpf, sockmap: zero-initialize pages allocated in bpf_msg_push_data Weiming Shi
2026-04-25  3:17 ` Jiayuan Chen
2026-04-25 17:59   ` Weiming Shi
2026-04-26  6:31     ` Jiayuan Chen
2026-04-25 19:04 ` sashiko-bot

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox