* [PATCH bpf-next RESEND v4 1/2] bpf: refactor max_depth computation in bpf_get_stack()
@ 2025-08-19 16:26 Arnaud Lecomte
2025-08-19 16:29 ` [PATCH bpf-next RESEND v4 2/2] bpf: fix stackmap overflow check in __bpf_get_stackid() Arnaud Lecomte
2025-08-19 21:15 ` [PATCH bpf-next RESEND v4 1/2] bpf: refactor max_depth computation in bpf_get_stack() Martin KaFai Lau
0 siblings, 2 replies; 7+ messages in thread
From: Arnaud Lecomte @ 2025-08-19 16:26 UTC (permalink / raw)
To: song
Cc: andrii, ast, bpf, daniel, eddyz87, haoluo, john.fastabend, jolsa,
kpsingh, linux-kernel, martin.lau, sdf,
syzbot+c9b724fbb41cf2538b7b, syzkaller-bugs, yonghong.song,
Arnaud Lecomte
A new helper function stack_map_calculate_max_depth() that
computes the max depth for a stackmap.
Changes in v2:
- Removed the checking 'map_size % map_elem_size' from
stack_map_calculate_max_depth
- Changed stack_map_calculate_max_depth params name to be more generic
Changes in v3:
- Changed map size param to size in max depth helper
Changes in v4:
- Fixed indentation in max depth helper for args
Link to v3: https://lore.kernel.org/all/09dc40eb-a84e-472a-8a68-36a2b1835308@linux.dev/
Signed-off-by: Arnaud Lecomte <contact@arnaud-lcm.com>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
---
kernel/bpf/stackmap.c | 30 ++++++++++++++++++++++++------
1 file changed, 24 insertions(+), 6 deletions(-)
diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c
index 3615c06b7dfa..b9cc6c72a2a5 100644
--- a/kernel/bpf/stackmap.c
+++ b/kernel/bpf/stackmap.c
@@ -42,6 +42,27 @@ static inline int stack_map_data_size(struct bpf_map *map)
sizeof(struct bpf_stack_build_id) : sizeof(u64);
}
+/**
+ * stack_map_calculate_max_depth - Calculate maximum allowed stack trace depth
+ * @size: Size of the buffer/map value in bytes
+ * @elem_size: Size of each stack trace element
+ * @flags: BPF stack trace flags (BPF_F_USER_STACK, BPF_F_USER_BUILD_ID, ...)
+ *
+ * Return: Maximum number of stack trace entries that can be safely stored
+ */
+static u32 stack_map_calculate_max_depth(u32 size, u32 elem_size, u64 flags)
+{
+ u32 skip = flags & BPF_F_SKIP_FIELD_MASK;
+ u32 max_depth;
+
+ max_depth = size / elem_size;
+ max_depth += skip;
+ if (max_depth > sysctl_perf_event_max_stack)
+ return sysctl_perf_event_max_stack;
+
+ return max_depth;
+}
+
static int prealloc_elems_and_freelist(struct bpf_stack_map *smap)
{
u64 elem_size = sizeof(struct stack_map_bucket) +
@@ -406,7 +427,7 @@ static long __bpf_get_stack(struct pt_regs *regs, struct task_struct *task,
struct perf_callchain_entry *trace_in,
void *buf, u32 size, u64 flags, bool may_fault)
{
- u32 trace_nr, copy_len, elem_size, num_elem, max_depth;
+ u32 trace_nr, copy_len, elem_size, max_depth;
bool user_build_id = flags & BPF_F_USER_BUILD_ID;
bool crosstask = task && task != current;
u32 skip = flags & BPF_F_SKIP_FIELD_MASK;
@@ -438,10 +459,7 @@ static long __bpf_get_stack(struct pt_regs *regs, struct task_struct *task,
goto clear;
}
- num_elem = size / elem_size;
- max_depth = num_elem + skip;
- if (sysctl_perf_event_max_stack < max_depth)
- max_depth = sysctl_perf_event_max_stack;
+ max_depth = stack_map_calculate_max_depth(size, elem_size, flags);
if (may_fault)
rcu_read_lock(); /* need RCU for perf's callchain below */
@@ -461,7 +479,7 @@ static long __bpf_get_stack(struct pt_regs *regs, struct task_struct *task,
}
trace_nr = trace->nr - skip;
- trace_nr = (trace_nr <= num_elem) ? trace_nr : num_elem;
+ trace_nr = min(trace_nr, max_depth - skip);
copy_len = trace_nr * elem_size;
ips = trace->ip + skip;
--
2.43.0
^ permalink raw reply related [flat|nested] 7+ messages in thread
* [PATCH bpf-next RESEND v4 2/2] bpf: fix stackmap overflow check in __bpf_get_stackid()
2025-08-19 16:26 [PATCH bpf-next RESEND v4 1/2] bpf: refactor max_depth computation in bpf_get_stack() Arnaud Lecomte
@ 2025-08-19 16:29 ` Arnaud Lecomte
2025-08-19 21:15 ` [PATCH bpf-next RESEND v4 1/2] bpf: refactor max_depth computation in bpf_get_stack() Martin KaFai Lau
1 sibling, 0 replies; 7+ messages in thread
From: Arnaud Lecomte @ 2025-08-19 16:29 UTC (permalink / raw)
To: song
Cc: andrii, ast, bpf, daniel, eddyz87, haoluo, john.fastabend, jolsa,
kpsingh, linux-kernel, martin.lau, sdf,
syzbot+c9b724fbb41cf2538b7b, syzkaller-bugs, yonghong.song,
Arnaud Lecomte
Syzkaller reported a KASAN slab-out-of-bounds write in __bpf_get_stackid()
when copying stack trace data. The issue occurs when the perf trace
contains more stack entries than the stack map bucket can hold,
leading to an out-of-bounds write in the bucket's data array.
Changes in v2:
- Fixed max_depth names across get stack id
Changes in v4:
- Removed unnecessary empty line in __bpf_get_stackid
Link to v3: https://lore.kernel.org/all/997d3b8a-4b3a-4720-8fa0-2f91447021bd@linux.dev/
Reported-by: syzbot+c9b724fbb41cf2538b7b@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=c9b724fbb41cf2538b7b
Signed-off-by: Arnaud Lecomte <contact@arnaud-lcm.com>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
---
kernel/bpf/stackmap.c | 23 +++++++++++++----------
1 file changed, 13 insertions(+), 10 deletions(-)
diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c
index b9cc6c72a2a5..318f150460bb 100644
--- a/kernel/bpf/stackmap.c
+++ b/kernel/bpf/stackmap.c
@@ -246,7 +246,7 @@ get_callchain_entry_for_task(struct task_struct *task, u32 max_depth)
}
static long __bpf_get_stackid(struct bpf_map *map,
- struct perf_callchain_entry *trace, u64 flags)
+ struct perf_callchain_entry *trace, u64 flags, u32 max_depth)
{
struct bpf_stack_map *smap = container_of(map, struct bpf_stack_map, map);
struct stack_map_bucket *bucket, *new_bucket, *old_bucket;
@@ -262,6 +262,8 @@ static long __bpf_get_stackid(struct bpf_map *map,
trace_nr = trace->nr - skip;
trace_len = trace_nr * sizeof(u64);
+ trace_nr = min(trace_nr, max_depth - skip);
+
ips = trace->ip + skip;
hash = jhash2((u32 *)ips, trace_len / sizeof(u32), 0);
id = hash & (smap->n_buckets - 1);
@@ -321,19 +323,17 @@ static long __bpf_get_stackid(struct bpf_map *map,
BPF_CALL_3(bpf_get_stackid, struct pt_regs *, regs, struct bpf_map *, map,
u64, flags)
{
- u32 max_depth = map->value_size / stack_map_data_size(map);
- u32 skip = flags & BPF_F_SKIP_FIELD_MASK;
+ u32 elem_size = stack_map_data_size(map);
bool user = flags & BPF_F_USER_STACK;
struct perf_callchain_entry *trace;
bool kernel = !user;
+ u32 max_depth;
if (unlikely(flags & ~(BPF_F_SKIP_FIELD_MASK | BPF_F_USER_STACK |
BPF_F_FAST_STACK_CMP | BPF_F_REUSE_STACKID)))
return -EINVAL;
- max_depth += skip;
- if (max_depth > sysctl_perf_event_max_stack)
- max_depth = sysctl_perf_event_max_stack;
+ max_depth = stack_map_calculate_max_depth(map->value_size, elem_size, flags);
trace = get_perf_callchain(regs, 0, kernel, user, max_depth,
false, false);
@@ -342,7 +342,7 @@ BPF_CALL_3(bpf_get_stackid, struct pt_regs *, regs, struct bpf_map *, map,
/* couldn't fetch the stack trace */
return -EFAULT;
- return __bpf_get_stackid(map, trace, flags);
+ return __bpf_get_stackid(map, trace, flags, max_depth);
}
const struct bpf_func_proto bpf_get_stackid_proto = {
@@ -374,6 +374,7 @@ BPF_CALL_3(bpf_get_stackid_pe, struct bpf_perf_event_data_kern *, ctx,
bool kernel, user;
__u64 nr_kernel;
int ret;
+ u32 elem_size, max_depth;
/* perf_sample_data doesn't have callchain, use bpf_get_stackid */
if (!(event->attr.sample_type & PERF_SAMPLE_CALLCHAIN))
@@ -392,12 +393,13 @@ BPF_CALL_3(bpf_get_stackid_pe, struct bpf_perf_event_data_kern *, ctx,
return -EFAULT;
nr_kernel = count_kernel_ip(trace);
-
+ elem_size = stack_map_data_size(map);
if (kernel) {
__u64 nr = trace->nr;
trace->nr = nr_kernel;
- ret = __bpf_get_stackid(map, trace, flags);
+ max_depth = stack_map_calculate_max_depth(map->value_size, elem_size, flags);
+ ret = __bpf_get_stackid(map, trace, flags, max_depth);
/* restore nr */
trace->nr = nr;
@@ -409,7 +411,8 @@ BPF_CALL_3(bpf_get_stackid_pe, struct bpf_perf_event_data_kern *, ctx,
return -EFAULT;
flags = (flags & ~BPF_F_SKIP_FIELD_MASK) | skip;
- ret = __bpf_get_stackid(map, trace, flags);
+ max_depth = stack_map_calculate_max_depth(map->value_size, elem_size, flags);
+ ret = __bpf_get_stackid(map, trace, flags, max_depth);
}
return ret;
}
--
2.43.0
^ permalink raw reply related [flat|nested] 7+ messages in thread
* Re: [PATCH bpf-next RESEND v4 1/2] bpf: refactor max_depth computation in bpf_get_stack()
2025-08-19 16:26 [PATCH bpf-next RESEND v4 1/2] bpf: refactor max_depth computation in bpf_get_stack() Arnaud Lecomte
2025-08-19 16:29 ` [PATCH bpf-next RESEND v4 2/2] bpf: fix stackmap overflow check in __bpf_get_stackid() Arnaud Lecomte
@ 2025-08-19 21:15 ` Martin KaFai Lau
2025-08-25 16:39 ` Lecomte, Arnaud
1 sibling, 1 reply; 7+ messages in thread
From: Martin KaFai Lau @ 2025-08-19 21:15 UTC (permalink / raw)
To: Arnaud Lecomte, yonghong.song
Cc: andrii, ast, bpf, daniel, eddyz87, haoluo, john.fastabend, jolsa,
kpsingh, linux-kernel, sdf, syzbot+c9b724fbb41cf2538b7b,
syzkaller-bugs, song
On 8/19/25 9:26 AM, Arnaud Lecomte wrote:
> A new helper function stack_map_calculate_max_depth() that
> computes the max depth for a stackmap.
>
> Changes in v2:
> - Removed the checking 'map_size % map_elem_size' from
> stack_map_calculate_max_depth
> - Changed stack_map_calculate_max_depth params name to be more generic
>
> Changes in v3:
> - Changed map size param to size in max depth helper
>
> Changes in v4:
> - Fixed indentation in max depth helper for args
>
> Link to v3: https://lore.kernel.org/all/09dc40eb-a84e-472a-8a68-36a2b1835308@linux.dev/
>
> Signed-off-by: Arnaud Lecomte <contact@arnaud-lcm.com>
> Acked-by: Yonghong Song <yonghong.song@linux.dev>
> ---
> kernel/bpf/stackmap.c | 30 ++++++++++++++++++++++++------
> 1 file changed, 24 insertions(+), 6 deletions(-)
>
> diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c
> index 3615c06b7dfa..b9cc6c72a2a5 100644
> --- a/kernel/bpf/stackmap.c
> +++ b/kernel/bpf/stackmap.c
> @@ -42,6 +42,27 @@ static inline int stack_map_data_size(struct bpf_map *map)
> sizeof(struct bpf_stack_build_id) : sizeof(u64);
> }
>
> +/**
> + * stack_map_calculate_max_depth - Calculate maximum allowed stack trace depth
> + * @size: Size of the buffer/map value in bytes
> + * @elem_size: Size of each stack trace element
> + * @flags: BPF stack trace flags (BPF_F_USER_STACK, BPF_F_USER_BUILD_ID, ...)
> + *
> + * Return: Maximum number of stack trace entries that can be safely stored
> + */
> +static u32 stack_map_calculate_max_depth(u32 size, u32 elem_size, u64 flags)
> +{
> + u32 skip = flags & BPF_F_SKIP_FIELD_MASK;
> + u32 max_depth;
> +
> + max_depth = size / elem_size;
> + max_depth += skip;
> + if (max_depth > sysctl_perf_event_max_stack)
> + return sysctl_perf_event_max_stack;
hmm... this looks a bit suspicious. Is it possible that
sysctl_perf_event_max_stack is being changed to a larger value in parallel?
> +
> + return max_depth;
> +}
> +
> static int prealloc_elems_and_freelist(struct bpf_stack_map *smap)
> {
> u64 elem_size = sizeof(struct stack_map_bucket) +
> @@ -406,7 +427,7 @@ static long __bpf_get_stack(struct pt_regs *regs, struct task_struct *task,
> struct perf_callchain_entry *trace_in,
> void *buf, u32 size, u64 flags, bool may_fault)
> {
> - u32 trace_nr, copy_len, elem_size, num_elem, max_depth;
> + u32 trace_nr, copy_len, elem_size, max_depth;
> bool user_build_id = flags & BPF_F_USER_BUILD_ID;
> bool crosstask = task && task != current;
> u32 skip = flags & BPF_F_SKIP_FIELD_MASK;
> @@ -438,10 +459,7 @@ static long __bpf_get_stack(struct pt_regs *regs, struct task_struct *task,
> goto clear;
> }
>
> - num_elem = size / elem_size;
> - max_depth = num_elem + skip;
> - if (sysctl_perf_event_max_stack < max_depth)
> - max_depth = sysctl_perf_event_max_stack;
> + max_depth = stack_map_calculate_max_depth(size, elem_size, flags);
>
> if (may_fault)
> rcu_read_lock(); /* need RCU for perf's callchain below */
> @@ -461,7 +479,7 @@ static long __bpf_get_stack(struct pt_regs *regs, struct task_struct *task,
> }
>
> trace_nr = trace->nr - skip;
> - trace_nr = (trace_nr <= num_elem) ? trace_nr : num_elem;
I suspect it was fine because trace_nr was still bounded by num_elem.
> + trace_nr = min(trace_nr, max_depth - skip);
but now the min() is also based on max_depth which could be
sysctl_perf_event_max_stack.
beside, if I read it correctly, in "max_depth - skip", the max_depth could also
be less than skip. I assume trace->nr is bound by max_depth, so should be less
of a problem but still a bit unintuitive to read.
> copy_len = trace_nr * elem_size;
>
> ips = trace->ip + skip;
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH bpf-next RESEND v4 1/2] bpf: refactor max_depth computation in bpf_get_stack()
2025-08-19 21:15 ` [PATCH bpf-next RESEND v4 1/2] bpf: refactor max_depth computation in bpf_get_stack() Martin KaFai Lau
@ 2025-08-25 16:39 ` Lecomte, Arnaud
2025-08-25 18:27 ` Yonghong Song
0 siblings, 1 reply; 7+ messages in thread
From: Lecomte, Arnaud @ 2025-08-25 16:39 UTC (permalink / raw)
To: Martin KaFai Lau, yonghong.song
Cc: andrii, ast, bpf, daniel, eddyz87, haoluo, john.fastabend, jolsa,
kpsingh, linux-kernel, sdf, syzbot+c9b724fbb41cf2538b7b,
syzkaller-bugs, song
On 19/08/2025 22:15, Martin KaFai Lau wrote:
> On 8/19/25 9:26 AM, Arnaud Lecomte wrote:
>> A new helper function stack_map_calculate_max_depth() that
>> computes the max depth for a stackmap.
>>
>> Changes in v2:
>> - Removed the checking 'map_size % map_elem_size' from
>> stack_map_calculate_max_depth
>> - Changed stack_map_calculate_max_depth params name to be more generic
>>
>> Changes in v3:
>> - Changed map size param to size in max depth helper
>>
>> Changes in v4:
>> - Fixed indentation in max depth helper for args
>>
>> Link to v3:
>> https://lore.kernel.org/all/09dc40eb-a84e-472a-8a68-36a2b1835308@linux.dev/
>>
>> Signed-off-by: Arnaud Lecomte <contact@arnaud-lcm.com>
>> Acked-by: Yonghong Song <yonghong.song@linux.dev>
>> ---
>> kernel/bpf/stackmap.c | 30 ++++++++++++++++++++++++------
>> 1 file changed, 24 insertions(+), 6 deletions(-)
>>
>> diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c
>> index 3615c06b7dfa..b9cc6c72a2a5 100644
>> --- a/kernel/bpf/stackmap.c
>> +++ b/kernel/bpf/stackmap.c
>> @@ -42,6 +42,27 @@ static inline int stack_map_data_size(struct
>> bpf_map *map)
>> sizeof(struct bpf_stack_build_id) : sizeof(u64);
>> }
>> +/**
>> + * stack_map_calculate_max_depth - Calculate maximum allowed stack
>> trace depth
>> + * @size: Size of the buffer/map value in bytes
>> + * @elem_size: Size of each stack trace element
>> + * @flags: BPF stack trace flags (BPF_F_USER_STACK,
>> BPF_F_USER_BUILD_ID, ...)
>> + *
>> + * Return: Maximum number of stack trace entries that can be safely
>> stored
>> + */
>> +static u32 stack_map_calculate_max_depth(u32 size, u32 elem_size,
>> u64 flags)
>> +{
>> + u32 skip = flags & BPF_F_SKIP_FIELD_MASK;
>> + u32 max_depth;
>> +
>> + max_depth = size / elem_size;
>> + max_depth += skip;
>> + if (max_depth > sysctl_perf_event_max_stack)
>> + return sysctl_perf_event_max_stack;
>
> hmm... this looks a bit suspicious. Is it possible that
> sysctl_perf_event_max_stack is being changed to a larger value in
> parallel?
>
Hi Martin, this is a valid concern as sysctl_perf_event_max_stack can be
modified at runtime through /proc/sys/kernel/perf_event_max_stack.
What we could maybe do instead is to create a copy: u32 current_max =
READ_ONCE(sysctl_perf_event_max_stack);
Any thoughts on this ?
>> +
>> + return max_depth;
>> +}
>> +
>> static int prealloc_elems_and_freelist(struct bpf_stack_map *smap)
>> {
>> u64 elem_size = sizeof(struct stack_map_bucket) +
>> @@ -406,7 +427,7 @@ static long __bpf_get_stack(struct pt_regs *regs,
>> struct task_struct *task,
>> struct perf_callchain_entry *trace_in,
>> void *buf, u32 size, u64 flags, bool may_fault)
>> {
>> - u32 trace_nr, copy_len, elem_size, num_elem, max_depth;
>> + u32 trace_nr, copy_len, elem_size, max_depth;
>> bool user_build_id = flags & BPF_F_USER_BUILD_ID;
>> bool crosstask = task && task != current;
>> u32 skip = flags & BPF_F_SKIP_FIELD_MASK;
>> @@ -438,10 +459,7 @@ static long __bpf_get_stack(struct pt_regs
>> *regs, struct task_struct *task,
>> goto clear;
>> }
>> - num_elem = size / elem_size;
>> - max_depth = num_elem + skip;
>> - if (sysctl_perf_event_max_stack < max_depth)
>> - max_depth = sysctl_perf_event_max_stack;
>> + max_depth = stack_map_calculate_max_depth(size, elem_size, flags);
>> if (may_fault)
>> rcu_read_lock(); /* need RCU for perf's callchain below */
>> @@ -461,7 +479,7 @@ static long __bpf_get_stack(struct pt_regs *regs,
>> struct task_struct *task,
>> }
>> trace_nr = trace->nr - skip;
>> - trace_nr = (trace_nr <= num_elem) ? trace_nr : num_elem;
>
> I suspect it was fine because trace_nr was still bounded by num_elem.
>
We should bring back the num_elem bound as an additional safe net.
>> + trace_nr = min(trace_nr, max_depth - skip);
>
> but now the min() is also based on max_depth which could be
> sysctl_perf_event_max_stack.
>
> beside, if I read it correctly, in "max_depth - skip", the max_depth
> could also be less than skip. I assume trace->nr is bound by
> max_depth, so should be less of a problem but still a bit unintuitive
> to read.
>
>> copy_len = trace_nr * elem_size;
>> ips = trace->ip + skip;
>
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH bpf-next RESEND v4 1/2] bpf: refactor max_depth computation in bpf_get_stack()
2025-08-25 16:39 ` Lecomte, Arnaud
@ 2025-08-25 18:27 ` Yonghong Song
2025-08-25 20:07 ` Lecomte, Arnaud
0 siblings, 1 reply; 7+ messages in thread
From: Yonghong Song @ 2025-08-25 18:27 UTC (permalink / raw)
To: Lecomte, Arnaud, Martin KaFai Lau
Cc: andrii, ast, bpf, daniel, eddyz87, haoluo, john.fastabend, jolsa,
kpsingh, linux-kernel, sdf, syzbot+c9b724fbb41cf2538b7b,
syzkaller-bugs, song
On 8/25/25 9:39 AM, Lecomte, Arnaud wrote:
>
> On 19/08/2025 22:15, Martin KaFai Lau wrote:
>> On 8/19/25 9:26 AM, Arnaud Lecomte wrote:
>>> A new helper function stack_map_calculate_max_depth() that
>>> computes the max depth for a stackmap.
>>>
>>> Changes in v2:
>>> - Removed the checking 'map_size % map_elem_size' from
>>> stack_map_calculate_max_depth
>>> - Changed stack_map_calculate_max_depth params name to be more
>>> generic
>>>
>>> Changes in v3:
>>> - Changed map size param to size in max depth helper
>>>
>>> Changes in v4:
>>> - Fixed indentation in max depth helper for args
>>>
>>> Link to v3:
>>> https://lore.kernel.org/all/09dc40eb-a84e-472a-8a68-36a2b1835308@linux.dev/
>>>
>>> Signed-off-by: Arnaud Lecomte <contact@arnaud-lcm.com>
>>> Acked-by: Yonghong Song <yonghong.song@linux.dev>
>>> ---
>>> kernel/bpf/stackmap.c | 30 ++++++++++++++++++++++++------
>>> 1 file changed, 24 insertions(+), 6 deletions(-)
>>>
>>> diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c
>>> index 3615c06b7dfa..b9cc6c72a2a5 100644
>>> --- a/kernel/bpf/stackmap.c
>>> +++ b/kernel/bpf/stackmap.c
>>> @@ -42,6 +42,27 @@ static inline int stack_map_data_size(struct
>>> bpf_map *map)
>>> sizeof(struct bpf_stack_build_id) : sizeof(u64);
>>> }
>>> +/**
>>> + * stack_map_calculate_max_depth - Calculate maximum allowed stack
>>> trace depth
>>> + * @size: Size of the buffer/map value in bytes
>>> + * @elem_size: Size of each stack trace element
>>> + * @flags: BPF stack trace flags (BPF_F_USER_STACK,
>>> BPF_F_USER_BUILD_ID, ...)
>>> + *
>>> + * Return: Maximum number of stack trace entries that can be safely
>>> stored
>>> + */
>>> +static u32 stack_map_calculate_max_depth(u32 size, u32 elem_size,
>>> u64 flags)
>>> +{
>>> + u32 skip = flags & BPF_F_SKIP_FIELD_MASK;
>>> + u32 max_depth;
>>> +
>>> + max_depth = size / elem_size;
>>> + max_depth += skip;
>>> + if (max_depth > sysctl_perf_event_max_stack)
>>> + return sysctl_perf_event_max_stack;
>>
>> hmm... this looks a bit suspicious. Is it possible that
>> sysctl_perf_event_max_stack is being changed to a larger value in
>> parallel?
>>
> Hi Martin, this is a valid concern as sysctl_perf_event_max_stack can
> be modified at runtime through /proc/sys/kernel/perf_event_max_stack.
> What we could maybe do instead is to create a copy: u32 current_max =
> READ_ONCE(sysctl_perf_event_max_stack);
> Any thoughts on this ?
There is no need to have READ_ONCE. Jut do
int curr_sysctl_max_stack = sysctl_perf_event_max_stack;
if (max_depth > curr_sysctl_max_stack)
return curr_sysctl_max_stack;
Because of the above change, the patch is not a refactoring change any more.
>
>>> +
>>> + return max_depth;
>>> +}
>>> +
>>> static int prealloc_elems_and_freelist(struct bpf_stack_map *smap)
>>> {
>>> u64 elem_size = sizeof(struct stack_map_bucket) +
>>> @@ -406,7 +427,7 @@ static long __bpf_get_stack(struct pt_regs
>>> *regs, struct task_struct *task,
>>> struct perf_callchain_entry *trace_in,
>>> void *buf, u32 size, u64 flags, bool may_fault)
>>> {
>>> - u32 trace_nr, copy_len, elem_size, num_elem, max_depth;
>>> + u32 trace_nr, copy_len, elem_size, max_depth;
>>> bool user_build_id = flags & BPF_F_USER_BUILD_ID;
>>> bool crosstask = task && task != current;
>>> u32 skip = flags & BPF_F_SKIP_FIELD_MASK;
>>> @@ -438,10 +459,7 @@ static long __bpf_get_stack(struct pt_regs
>>> *regs, struct task_struct *task,
>>> goto clear;
>>> }
>>> - num_elem = size / elem_size;
>>> - max_depth = num_elem + skip;
>>> - if (sysctl_perf_event_max_stack < max_depth)
>>> - max_depth = sysctl_perf_event_max_stack;
>>> + max_depth = stack_map_calculate_max_depth(size, elem_size, flags);
>>> if (may_fault)
>>> rcu_read_lock(); /* need RCU for perf's callchain below */
>>> @@ -461,7 +479,7 @@ static long __bpf_get_stack(struct pt_regs
>>> *regs, struct task_struct *task,
>>> }
>>> trace_nr = trace->nr - skip;
>>> - trace_nr = (trace_nr <= num_elem) ? trace_nr : num_elem;
>>
>> I suspect it was fine because trace_nr was still bounded by num_elem.
>>
> We should bring back the num_elem bound as an additional safe net.
>>> + trace_nr = min(trace_nr, max_depth - skip);
>>
>> but now the min() is also based on max_depth which could be
>> sysctl_perf_event_max_stack.
>>
>> beside, if I read it correctly, in "max_depth - skip", the max_depth
>> could also be less than skip. I assume trace->nr is bound by
>> max_depth, so should be less of a problem but still a bit unintuitive
>> to read.
>>
>>> copy_len = trace_nr * elem_size;
>>> ips = trace->ip + skip;
>>
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH bpf-next RESEND v4 1/2] bpf: refactor max_depth computation in bpf_get_stack()
2025-08-25 18:27 ` Yonghong Song
@ 2025-08-25 20:07 ` Lecomte, Arnaud
2025-08-25 21:15 ` Yonghong Song
0 siblings, 1 reply; 7+ messages in thread
From: Lecomte, Arnaud @ 2025-08-25 20:07 UTC (permalink / raw)
To: Yonghong Song, Martin KaFai Lau
Cc: andrii, ast, bpf, daniel, eddyz87, haoluo, john.fastabend, jolsa,
kpsingh, linux-kernel, sdf, syzbot+c9b724fbb41cf2538b7b,
syzkaller-bugs, song
On 25/08/2025 19:27, Yonghong Song wrote:
>
>
> On 8/25/25 9:39 AM, Lecomte, Arnaud wrote:
>>
>> On 19/08/2025 22:15, Martin KaFai Lau wrote:
>>> On 8/19/25 9:26 AM, Arnaud Lecomte wrote:
>>>> A new helper function stack_map_calculate_max_depth() that
>>>> computes the max depth for a stackmap.
>>>>
>>>> Changes in v2:
>>>> - Removed the checking 'map_size % map_elem_size' from
>>>> stack_map_calculate_max_depth
>>>> - Changed stack_map_calculate_max_depth params name to be more
>>>> generic
>>>>
>>>> Changes in v3:
>>>> - Changed map size param to size in max depth helper
>>>>
>>>> Changes in v4:
>>>> - Fixed indentation in max depth helper for args
>>>>
>>>> Link to v3:
>>>> https://lore.kernel.org/all/09dc40eb-a84e-472a-8a68-36a2b1835308@linux.dev/
>>>>
>>>> Signed-off-by: Arnaud Lecomte <contact@arnaud-lcm.com>
>>>> Acked-by: Yonghong Song <yonghong.song@linux.dev>
>>>> ---
>>>> kernel/bpf/stackmap.c | 30 ++++++++++++++++++++++++------
>>>> 1 file changed, 24 insertions(+), 6 deletions(-)
>>>>
>>>> diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c
>>>> index 3615c06b7dfa..b9cc6c72a2a5 100644
>>>> --- a/kernel/bpf/stackmap.c
>>>> +++ b/kernel/bpf/stackmap.c
>>>> @@ -42,6 +42,27 @@ static inline int stack_map_data_size(struct
>>>> bpf_map *map)
>>>> sizeof(struct bpf_stack_build_id) : sizeof(u64);
>>>> }
>>>> +/**
>>>> + * stack_map_calculate_max_depth - Calculate maximum allowed stack
>>>> trace depth
>>>> + * @size: Size of the buffer/map value in bytes
>>>> + * @elem_size: Size of each stack trace element
>>>> + * @flags: BPF stack trace flags (BPF_F_USER_STACK,
>>>> BPF_F_USER_BUILD_ID, ...)
>>>> + *
>>>> + * Return: Maximum number of stack trace entries that can be
>>>> safely stored
>>>> + */
>>>> +static u32 stack_map_calculate_max_depth(u32 size, u32 elem_size,
>>>> u64 flags)
>>>> +{
>>>> + u32 skip = flags & BPF_F_SKIP_FIELD_MASK;
>>>> + u32 max_depth;
>>>> +
>>>> + max_depth = size / elem_size;
>>>> + max_depth += skip;
>>>> + if (max_depth > sysctl_perf_event_max_stack)
>>>> + return sysctl_perf_event_max_stack;
>>>
>>> hmm... this looks a bit suspicious. Is it possible that
>>> sysctl_perf_event_max_stack is being changed to a larger value in
>>> parallel?
>>>
>> Hi Martin, this is a valid concern as sysctl_perf_event_max_stack can
>> be modified at runtime through /proc/sys/kernel/perf_event_max_stack.
>> What we could maybe do instead is to create a copy: u32 current_max =
>> READ_ONCE(sysctl_perf_event_max_stack);
>> Any thoughts on this ?
>
> There is no need to have READ_ONCE. Jut do
> int curr_sysctl_max_stack = sysctl_perf_event_max_stack;
> if (max_depth > curr_sysctl_max_stack)
> return curr_sysctl_max_stack;
>
> Because of the above change, the patch is not a refactoring change any
> more.
>
Why would you not consider it as a refactoring change anymore ?
>>
>>>> +
>>>> + return max_depth;
>>>> +}
>>>> +
>>>> static int prealloc_elems_and_freelist(struct bpf_stack_map *smap)
>>>> {
>>>> u64 elem_size = sizeof(struct stack_map_bucket) +
>>>> @@ -406,7 +427,7 @@ static long __bpf_get_stack(struct pt_regs
>>>> *regs, struct task_struct *task,
>>>> struct perf_callchain_entry *trace_in,
>>>> void *buf, u32 size, u64 flags, bool may_fault)
>>>> {
>>>> - u32 trace_nr, copy_len, elem_size, num_elem, max_depth;
>>>> + u32 trace_nr, copy_len, elem_size, max_depth;
>>>> bool user_build_id = flags & BPF_F_USER_BUILD_ID;
>>>> bool crosstask = task && task != current;
>>>> u32 skip = flags & BPF_F_SKIP_FIELD_MASK;
>>>> @@ -438,10 +459,7 @@ static long __bpf_get_stack(struct pt_regs
>>>> *regs, struct task_struct *task,
>>>> goto clear;
>>>> }
>>>> - num_elem = size / elem_size;
>>>> - max_depth = num_elem + skip;
>>>> - if (sysctl_perf_event_max_stack < max_depth)
>>>> - max_depth = sysctl_perf_event_max_stack;
>>>> + max_depth = stack_map_calculate_max_depth(size, elem_size,
>>>> flags);
>>>> if (may_fault)
>>>> rcu_read_lock(); /* need RCU for perf's callchain below */
>>>> @@ -461,7 +479,7 @@ static long __bpf_get_stack(struct pt_regs
>>>> *regs, struct task_struct *task,
>>>> }
>>>> trace_nr = trace->nr - skip;
>>>> - trace_nr = (trace_nr <= num_elem) ? trace_nr : num_elem;
>>>
>>> I suspect it was fine because trace_nr was still bounded by num_elem.
>>>
>> We should bring back the num_elem bound as an additional safe net.
>>>> + trace_nr = min(trace_nr, max_depth - skip);
>>>
>>> but now the min() is also based on max_depth which could be
>>> sysctl_perf_event_max_stack.
>>>
>>> beside, if I read it correctly, in "max_depth - skip", the max_depth
>>> could also be less than skip. I assume trace->nr is bound by
>>> max_depth, so should be less of a problem but still a bit
>>> unintuitive to read.
>>>
>>>> copy_len = trace_nr * elem_size;
>>>> ips = trace->ip + skip;
>>>
>
>
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH bpf-next RESEND v4 1/2] bpf: refactor max_depth computation in bpf_get_stack()
2025-08-25 20:07 ` Lecomte, Arnaud
@ 2025-08-25 21:15 ` Yonghong Song
0 siblings, 0 replies; 7+ messages in thread
From: Yonghong Song @ 2025-08-25 21:15 UTC (permalink / raw)
To: Lecomte, Arnaud, Martin KaFai Lau
Cc: andrii, ast, bpf, daniel, eddyz87, haoluo, john.fastabend, jolsa,
kpsingh, linux-kernel, sdf, syzbot+c9b724fbb41cf2538b7b,
syzkaller-bugs, song
On 8/25/25 1:07 PM, Lecomte, Arnaud wrote:
>
> On 25/08/2025 19:27, Yonghong Song wrote:
>>
>>
>> On 8/25/25 9:39 AM, Lecomte, Arnaud wrote:
>>>
>>> On 19/08/2025 22:15, Martin KaFai Lau wrote:
>>>> On 8/19/25 9:26 AM, Arnaud Lecomte wrote:
>>>>> A new helper function stack_map_calculate_max_depth() that
>>>>> computes the max depth for a stackmap.
>>>>>
>>>>> Changes in v2:
>>>>> - Removed the checking 'map_size % map_elem_size' from
>>>>> stack_map_calculate_max_depth
>>>>> - Changed stack_map_calculate_max_depth params name to be more
>>>>> generic
>>>>>
>>>>> Changes in v3:
>>>>> - Changed map size param to size in max depth helper
>>>>>
>>>>> Changes in v4:
>>>>> - Fixed indentation in max depth helper for args
>>>>>
>>>>> Link to v3:
>>>>> https://lore.kernel.org/all/09dc40eb-a84e-472a-8a68-36a2b1835308@linux.dev/
>>>>>
>>>>> Signed-off-by: Arnaud Lecomte <contact@arnaud-lcm.com>
>>>>> Acked-by: Yonghong Song <yonghong.song@linux.dev>
>>>>> ---
>>>>> kernel/bpf/stackmap.c | 30 ++++++++++++++++++++++++------
>>>>> 1 file changed, 24 insertions(+), 6 deletions(-)
>>>>>
>>>>> diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c
>>>>> index 3615c06b7dfa..b9cc6c72a2a5 100644
>>>>> --- a/kernel/bpf/stackmap.c
>>>>> +++ b/kernel/bpf/stackmap.c
>>>>> @@ -42,6 +42,27 @@ static inline int stack_map_data_size(struct
>>>>> bpf_map *map)
>>>>> sizeof(struct bpf_stack_build_id) : sizeof(u64);
>>>>> }
>>>>> +/**
>>>>> + * stack_map_calculate_max_depth - Calculate maximum allowed
>>>>> stack trace depth
>>>>> + * @size: Size of the buffer/map value in bytes
>>>>> + * @elem_size: Size of each stack trace element
>>>>> + * @flags: BPF stack trace flags (BPF_F_USER_STACK,
>>>>> BPF_F_USER_BUILD_ID, ...)
>>>>> + *
>>>>> + * Return: Maximum number of stack trace entries that can be
>>>>> safely stored
>>>>> + */
>>>>> +static u32 stack_map_calculate_max_depth(u32 size, u32 elem_size,
>>>>> u64 flags)
>>>>> +{
>>>>> + u32 skip = flags & BPF_F_SKIP_FIELD_MASK;
>>>>> + u32 max_depth;
>>>>> +
>>>>> + max_depth = size / elem_size;
>>>>> + max_depth += skip;
>>>>> + if (max_depth > sysctl_perf_event_max_stack)
>>>>> + return sysctl_perf_event_max_stack;
>>>>
>>>> hmm... this looks a bit suspicious. Is it possible that
>>>> sysctl_perf_event_max_stack is being changed to a larger value in
>>>> parallel?
>>>>
>>> Hi Martin, this is a valid concern as sysctl_perf_event_max_stack
>>> can be modified at runtime through
>>> /proc/sys/kernel/perf_event_max_stack.
>>> What we could maybe do instead is to create a copy: u32 current_max
>>> = READ_ONCE(sysctl_perf_event_max_stack);
>>> Any thoughts on this ?
>>
>> There is no need to have READ_ONCE. Jut do
>> int curr_sysctl_max_stack = sysctl_perf_event_max_stack;
>> if (max_depth > curr_sysctl_max_stack)
>> return curr_sysctl_max_stack;
>>
>> Because of the above change, the patch is not a refactoring change
>> any more.
>>
> Why would you not consider it as a refactoring change anymore ?
Sorry, I think I made a couple of mistakes in the above.
First, yes, we do want READ_ONCE, other potentially compiler may optimization
the above back to the original code with two references to sysctl_perf_event_max_stack.
Second, yes, it is indeed a refactoring.
>>>
>>>>> +
>>>>> + return max_depth;
>>>>> +}
>>>>> +
>>>>> static int prealloc_elems_and_freelist(struct bpf_stack_map *smap)
>>>>> {
>>>>> u64 elem_size = sizeof(struct stack_map_bucket) +
>>>>> @@ -406,7 +427,7 @@ static long __bpf_get_stack(struct pt_regs
>>>>> *regs, struct task_struct *task,
>>>>> struct perf_callchain_entry *trace_in,
>>>>> void *buf, u32 size, u64 flags, bool may_fault)
>>>>> {
>>>>> - u32 trace_nr, copy_len, elem_size, num_elem, max_depth;
>>>>> + u32 trace_nr, copy_len, elem_size, max_depth;
>>>>> bool user_build_id = flags & BPF_F_USER_BUILD_ID;
>>>>> bool crosstask = task && task != current;
>>>>> u32 skip = flags & BPF_F_SKIP_FIELD_MASK;
>>>>> @@ -438,10 +459,7 @@ static long __bpf_get_stack(struct pt_regs
>>>>> *regs, struct task_struct *task,
>>>>> goto clear;
>>>>> }
>>>>> - num_elem = size / elem_size;
>>>>> - max_depth = num_elem + skip;
>>>>> - if (sysctl_perf_event_max_stack < max_depth)
>>>>> - max_depth = sysctl_perf_event_max_stack;
>>>>> + max_depth = stack_map_calculate_max_depth(size, elem_size,
>>>>> flags);
>>>>> if (may_fault)
>>>>> rcu_read_lock(); /* need RCU for perf's callchain below */
>>>>> @@ -461,7 +479,7 @@ static long __bpf_get_stack(struct pt_regs
>>>>> *regs, struct task_struct *task,
>>>>> }
>>>>> trace_nr = trace->nr - skip;
>>>>> - trace_nr = (trace_nr <= num_elem) ? trace_nr : num_elem;
>>>>
>>>> I suspect it was fine because trace_nr was still bounded by num_elem.
>>>>
>>> We should bring back the num_elem bound as an additional safe net.
>>>>> + trace_nr = min(trace_nr, max_depth - skip);
>>>>
>>>> but now the min() is also based on max_depth which could be
>>>> sysctl_perf_event_max_stack.
>>>>
>>>> beside, if I read it correctly, in "max_depth - skip", the
>>>> max_depth could also be less than skip. I assume trace->nr is bound
>>>> by max_depth, so should be less of a problem but still a bit
>>>> unintuitive to read.
>>>>
>>>>> copy_len = trace_nr * elem_size;
>>>>> ips = trace->ip + skip;
>>>>
>>
>>
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2025-08-25 21:15 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-08-19 16:26 [PATCH bpf-next RESEND v4 1/2] bpf: refactor max_depth computation in bpf_get_stack() Arnaud Lecomte
2025-08-19 16:29 ` [PATCH bpf-next RESEND v4 2/2] bpf: fix stackmap overflow check in __bpf_get_stackid() Arnaud Lecomte
2025-08-19 21:15 ` [PATCH bpf-next RESEND v4 1/2] bpf: refactor max_depth computation in bpf_get_stack() Martin KaFai Lau
2025-08-25 16:39 ` Lecomte, Arnaud
2025-08-25 18:27 ` Yonghong Song
2025-08-25 20:07 ` Lecomte, Arnaud
2025-08-25 21:15 ` Yonghong Song
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).