* [PATCH bpf-next RESEND v4 1/2] bpf: refactor max_depth computation in bpf_get_stack() @ 2025-08-19 16:26 Arnaud Lecomte 2025-08-19 16:29 ` [PATCH bpf-next RESEND v4 2/2] bpf: fix stackmap overflow check in __bpf_get_stackid() Arnaud Lecomte 2025-08-19 21:15 ` [PATCH bpf-next RESEND v4 1/2] bpf: refactor max_depth computation in bpf_get_stack() Martin KaFai Lau 0 siblings, 2 replies; 7+ messages in thread From: Arnaud Lecomte @ 2025-08-19 16:26 UTC (permalink / raw) To: song Cc: andrii, ast, bpf, daniel, eddyz87, haoluo, john.fastabend, jolsa, kpsingh, linux-kernel, martin.lau, sdf, syzbot+c9b724fbb41cf2538b7b, syzkaller-bugs, yonghong.song, Arnaud Lecomte A new helper function stack_map_calculate_max_depth() that computes the max depth for a stackmap. Changes in v2: - Removed the checking 'map_size % map_elem_size' from stack_map_calculate_max_depth - Changed stack_map_calculate_max_depth params name to be more generic Changes in v3: - Changed map size param to size in max depth helper Changes in v4: - Fixed indentation in max depth helper for args Link to v3: https://lore.kernel.org/all/09dc40eb-a84e-472a-8a68-36a2b1835308@linux.dev/ Signed-off-by: Arnaud Lecomte <contact@arnaud-lcm.com> Acked-by: Yonghong Song <yonghong.song@linux.dev> --- kernel/bpf/stackmap.c | 30 ++++++++++++++++++++++++------ 1 file changed, 24 insertions(+), 6 deletions(-) diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c index 3615c06b7dfa..b9cc6c72a2a5 100644 --- a/kernel/bpf/stackmap.c +++ b/kernel/bpf/stackmap.c @@ -42,6 +42,27 @@ static inline int stack_map_data_size(struct bpf_map *map) sizeof(struct bpf_stack_build_id) : sizeof(u64); } +/** + * stack_map_calculate_max_depth - Calculate maximum allowed stack trace depth + * @size: Size of the buffer/map value in bytes + * @elem_size: Size of each stack trace element + * @flags: BPF stack trace flags (BPF_F_USER_STACK, BPF_F_USER_BUILD_ID, ...) + * + * Return: Maximum number of stack trace entries that can be safely stored + */ +static u32 stack_map_calculate_max_depth(u32 size, u32 elem_size, u64 flags) +{ + u32 skip = flags & BPF_F_SKIP_FIELD_MASK; + u32 max_depth; + + max_depth = size / elem_size; + max_depth += skip; + if (max_depth > sysctl_perf_event_max_stack) + return sysctl_perf_event_max_stack; + + return max_depth; +} + static int prealloc_elems_and_freelist(struct bpf_stack_map *smap) { u64 elem_size = sizeof(struct stack_map_bucket) + @@ -406,7 +427,7 @@ static long __bpf_get_stack(struct pt_regs *regs, struct task_struct *task, struct perf_callchain_entry *trace_in, void *buf, u32 size, u64 flags, bool may_fault) { - u32 trace_nr, copy_len, elem_size, num_elem, max_depth; + u32 trace_nr, copy_len, elem_size, max_depth; bool user_build_id = flags & BPF_F_USER_BUILD_ID; bool crosstask = task && task != current; u32 skip = flags & BPF_F_SKIP_FIELD_MASK; @@ -438,10 +459,7 @@ static long __bpf_get_stack(struct pt_regs *regs, struct task_struct *task, goto clear; } - num_elem = size / elem_size; - max_depth = num_elem + skip; - if (sysctl_perf_event_max_stack < max_depth) - max_depth = sysctl_perf_event_max_stack; + max_depth = stack_map_calculate_max_depth(size, elem_size, flags); if (may_fault) rcu_read_lock(); /* need RCU for perf's callchain below */ @@ -461,7 +479,7 @@ static long __bpf_get_stack(struct pt_regs *regs, struct task_struct *task, } trace_nr = trace->nr - skip; - trace_nr = (trace_nr <= num_elem) ? trace_nr : num_elem; + trace_nr = min(trace_nr, max_depth - skip); copy_len = trace_nr * elem_size; ips = trace->ip + skip; -- 2.43.0 ^ permalink raw reply related [flat|nested] 7+ messages in thread
* [PATCH bpf-next RESEND v4 2/2] bpf: fix stackmap overflow check in __bpf_get_stackid() 2025-08-19 16:26 [PATCH bpf-next RESEND v4 1/2] bpf: refactor max_depth computation in bpf_get_stack() Arnaud Lecomte @ 2025-08-19 16:29 ` Arnaud Lecomte 2025-08-19 21:15 ` [PATCH bpf-next RESEND v4 1/2] bpf: refactor max_depth computation in bpf_get_stack() Martin KaFai Lau 1 sibling, 0 replies; 7+ messages in thread From: Arnaud Lecomte @ 2025-08-19 16:29 UTC (permalink / raw) To: song Cc: andrii, ast, bpf, daniel, eddyz87, haoluo, john.fastabend, jolsa, kpsingh, linux-kernel, martin.lau, sdf, syzbot+c9b724fbb41cf2538b7b, syzkaller-bugs, yonghong.song, Arnaud Lecomte Syzkaller reported a KASAN slab-out-of-bounds write in __bpf_get_stackid() when copying stack trace data. The issue occurs when the perf trace contains more stack entries than the stack map bucket can hold, leading to an out-of-bounds write in the bucket's data array. Changes in v2: - Fixed max_depth names across get stack id Changes in v4: - Removed unnecessary empty line in __bpf_get_stackid Link to v3: https://lore.kernel.org/all/997d3b8a-4b3a-4720-8fa0-2f91447021bd@linux.dev/ Reported-by: syzbot+c9b724fbb41cf2538b7b@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=c9b724fbb41cf2538b7b Signed-off-by: Arnaud Lecomte <contact@arnaud-lcm.com> Acked-by: Yonghong Song <yonghong.song@linux.dev> --- kernel/bpf/stackmap.c | 23 +++++++++++++---------- 1 file changed, 13 insertions(+), 10 deletions(-) diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c index b9cc6c72a2a5..318f150460bb 100644 --- a/kernel/bpf/stackmap.c +++ b/kernel/bpf/stackmap.c @@ -246,7 +246,7 @@ get_callchain_entry_for_task(struct task_struct *task, u32 max_depth) } static long __bpf_get_stackid(struct bpf_map *map, - struct perf_callchain_entry *trace, u64 flags) + struct perf_callchain_entry *trace, u64 flags, u32 max_depth) { struct bpf_stack_map *smap = container_of(map, struct bpf_stack_map, map); struct stack_map_bucket *bucket, *new_bucket, *old_bucket; @@ -262,6 +262,8 @@ static long __bpf_get_stackid(struct bpf_map *map, trace_nr = trace->nr - skip; trace_len = trace_nr * sizeof(u64); + trace_nr = min(trace_nr, max_depth - skip); + ips = trace->ip + skip; hash = jhash2((u32 *)ips, trace_len / sizeof(u32), 0); id = hash & (smap->n_buckets - 1); @@ -321,19 +323,17 @@ static long __bpf_get_stackid(struct bpf_map *map, BPF_CALL_3(bpf_get_stackid, struct pt_regs *, regs, struct bpf_map *, map, u64, flags) { - u32 max_depth = map->value_size / stack_map_data_size(map); - u32 skip = flags & BPF_F_SKIP_FIELD_MASK; + u32 elem_size = stack_map_data_size(map); bool user = flags & BPF_F_USER_STACK; struct perf_callchain_entry *trace; bool kernel = !user; + u32 max_depth; if (unlikely(flags & ~(BPF_F_SKIP_FIELD_MASK | BPF_F_USER_STACK | BPF_F_FAST_STACK_CMP | BPF_F_REUSE_STACKID))) return -EINVAL; - max_depth += skip; - if (max_depth > sysctl_perf_event_max_stack) - max_depth = sysctl_perf_event_max_stack; + max_depth = stack_map_calculate_max_depth(map->value_size, elem_size, flags); trace = get_perf_callchain(regs, 0, kernel, user, max_depth, false, false); @@ -342,7 +342,7 @@ BPF_CALL_3(bpf_get_stackid, struct pt_regs *, regs, struct bpf_map *, map, /* couldn't fetch the stack trace */ return -EFAULT; - return __bpf_get_stackid(map, trace, flags); + return __bpf_get_stackid(map, trace, flags, max_depth); } const struct bpf_func_proto bpf_get_stackid_proto = { @@ -374,6 +374,7 @@ BPF_CALL_3(bpf_get_stackid_pe, struct bpf_perf_event_data_kern *, ctx, bool kernel, user; __u64 nr_kernel; int ret; + u32 elem_size, max_depth; /* perf_sample_data doesn't have callchain, use bpf_get_stackid */ if (!(event->attr.sample_type & PERF_SAMPLE_CALLCHAIN)) @@ -392,12 +393,13 @@ BPF_CALL_3(bpf_get_stackid_pe, struct bpf_perf_event_data_kern *, ctx, return -EFAULT; nr_kernel = count_kernel_ip(trace); - + elem_size = stack_map_data_size(map); if (kernel) { __u64 nr = trace->nr; trace->nr = nr_kernel; - ret = __bpf_get_stackid(map, trace, flags); + max_depth = stack_map_calculate_max_depth(map->value_size, elem_size, flags); + ret = __bpf_get_stackid(map, trace, flags, max_depth); /* restore nr */ trace->nr = nr; @@ -409,7 +411,8 @@ BPF_CALL_3(bpf_get_stackid_pe, struct bpf_perf_event_data_kern *, ctx, return -EFAULT; flags = (flags & ~BPF_F_SKIP_FIELD_MASK) | skip; - ret = __bpf_get_stackid(map, trace, flags); + max_depth = stack_map_calculate_max_depth(map->value_size, elem_size, flags); + ret = __bpf_get_stackid(map, trace, flags, max_depth); } return ret; } -- 2.43.0 ^ permalink raw reply related [flat|nested] 7+ messages in thread
* Re: [PATCH bpf-next RESEND v4 1/2] bpf: refactor max_depth computation in bpf_get_stack() 2025-08-19 16:26 [PATCH bpf-next RESEND v4 1/2] bpf: refactor max_depth computation in bpf_get_stack() Arnaud Lecomte 2025-08-19 16:29 ` [PATCH bpf-next RESEND v4 2/2] bpf: fix stackmap overflow check in __bpf_get_stackid() Arnaud Lecomte @ 2025-08-19 21:15 ` Martin KaFai Lau 2025-08-25 16:39 ` Lecomte, Arnaud 1 sibling, 1 reply; 7+ messages in thread From: Martin KaFai Lau @ 2025-08-19 21:15 UTC (permalink / raw) To: Arnaud Lecomte, yonghong.song Cc: andrii, ast, bpf, daniel, eddyz87, haoluo, john.fastabend, jolsa, kpsingh, linux-kernel, sdf, syzbot+c9b724fbb41cf2538b7b, syzkaller-bugs, song On 8/19/25 9:26 AM, Arnaud Lecomte wrote: > A new helper function stack_map_calculate_max_depth() that > computes the max depth for a stackmap. > > Changes in v2: > - Removed the checking 'map_size % map_elem_size' from > stack_map_calculate_max_depth > - Changed stack_map_calculate_max_depth params name to be more generic > > Changes in v3: > - Changed map size param to size in max depth helper > > Changes in v4: > - Fixed indentation in max depth helper for args > > Link to v3: https://lore.kernel.org/all/09dc40eb-a84e-472a-8a68-36a2b1835308@linux.dev/ > > Signed-off-by: Arnaud Lecomte <contact@arnaud-lcm.com> > Acked-by: Yonghong Song <yonghong.song@linux.dev> > --- > kernel/bpf/stackmap.c | 30 ++++++++++++++++++++++++------ > 1 file changed, 24 insertions(+), 6 deletions(-) > > diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c > index 3615c06b7dfa..b9cc6c72a2a5 100644 > --- a/kernel/bpf/stackmap.c > +++ b/kernel/bpf/stackmap.c > @@ -42,6 +42,27 @@ static inline int stack_map_data_size(struct bpf_map *map) > sizeof(struct bpf_stack_build_id) : sizeof(u64); > } > > +/** > + * stack_map_calculate_max_depth - Calculate maximum allowed stack trace depth > + * @size: Size of the buffer/map value in bytes > + * @elem_size: Size of each stack trace element > + * @flags: BPF stack trace flags (BPF_F_USER_STACK, BPF_F_USER_BUILD_ID, ...) > + * > + * Return: Maximum number of stack trace entries that can be safely stored > + */ > +static u32 stack_map_calculate_max_depth(u32 size, u32 elem_size, u64 flags) > +{ > + u32 skip = flags & BPF_F_SKIP_FIELD_MASK; > + u32 max_depth; > + > + max_depth = size / elem_size; > + max_depth += skip; > + if (max_depth > sysctl_perf_event_max_stack) > + return sysctl_perf_event_max_stack; hmm... this looks a bit suspicious. Is it possible that sysctl_perf_event_max_stack is being changed to a larger value in parallel? > + > + return max_depth; > +} > + > static int prealloc_elems_and_freelist(struct bpf_stack_map *smap) > { > u64 elem_size = sizeof(struct stack_map_bucket) + > @@ -406,7 +427,7 @@ static long __bpf_get_stack(struct pt_regs *regs, struct task_struct *task, > struct perf_callchain_entry *trace_in, > void *buf, u32 size, u64 flags, bool may_fault) > { > - u32 trace_nr, copy_len, elem_size, num_elem, max_depth; > + u32 trace_nr, copy_len, elem_size, max_depth; > bool user_build_id = flags & BPF_F_USER_BUILD_ID; > bool crosstask = task && task != current; > u32 skip = flags & BPF_F_SKIP_FIELD_MASK; > @@ -438,10 +459,7 @@ static long __bpf_get_stack(struct pt_regs *regs, struct task_struct *task, > goto clear; > } > > - num_elem = size / elem_size; > - max_depth = num_elem + skip; > - if (sysctl_perf_event_max_stack < max_depth) > - max_depth = sysctl_perf_event_max_stack; > + max_depth = stack_map_calculate_max_depth(size, elem_size, flags); > > if (may_fault) > rcu_read_lock(); /* need RCU for perf's callchain below */ > @@ -461,7 +479,7 @@ static long __bpf_get_stack(struct pt_regs *regs, struct task_struct *task, > } > > trace_nr = trace->nr - skip; > - trace_nr = (trace_nr <= num_elem) ? trace_nr : num_elem; I suspect it was fine because trace_nr was still bounded by num_elem. > + trace_nr = min(trace_nr, max_depth - skip); but now the min() is also based on max_depth which could be sysctl_perf_event_max_stack. beside, if I read it correctly, in "max_depth - skip", the max_depth could also be less than skip. I assume trace->nr is bound by max_depth, so should be less of a problem but still a bit unintuitive to read. > copy_len = trace_nr * elem_size; > > ips = trace->ip + skip; ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH bpf-next RESEND v4 1/2] bpf: refactor max_depth computation in bpf_get_stack() 2025-08-19 21:15 ` [PATCH bpf-next RESEND v4 1/2] bpf: refactor max_depth computation in bpf_get_stack() Martin KaFai Lau @ 2025-08-25 16:39 ` Lecomte, Arnaud 2025-08-25 18:27 ` Yonghong Song 0 siblings, 1 reply; 7+ messages in thread From: Lecomte, Arnaud @ 2025-08-25 16:39 UTC (permalink / raw) To: Martin KaFai Lau, yonghong.song Cc: andrii, ast, bpf, daniel, eddyz87, haoluo, john.fastabend, jolsa, kpsingh, linux-kernel, sdf, syzbot+c9b724fbb41cf2538b7b, syzkaller-bugs, song On 19/08/2025 22:15, Martin KaFai Lau wrote: > On 8/19/25 9:26 AM, Arnaud Lecomte wrote: >> A new helper function stack_map_calculate_max_depth() that >> computes the max depth for a stackmap. >> >> Changes in v2: >> - Removed the checking 'map_size % map_elem_size' from >> stack_map_calculate_max_depth >> - Changed stack_map_calculate_max_depth params name to be more generic >> >> Changes in v3: >> - Changed map size param to size in max depth helper >> >> Changes in v4: >> - Fixed indentation in max depth helper for args >> >> Link to v3: >> https://lore.kernel.org/all/09dc40eb-a84e-472a-8a68-36a2b1835308@linux.dev/ >> >> Signed-off-by: Arnaud Lecomte <contact@arnaud-lcm.com> >> Acked-by: Yonghong Song <yonghong.song@linux.dev> >> --- >> kernel/bpf/stackmap.c | 30 ++++++++++++++++++++++++------ >> 1 file changed, 24 insertions(+), 6 deletions(-) >> >> diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c >> index 3615c06b7dfa..b9cc6c72a2a5 100644 >> --- a/kernel/bpf/stackmap.c >> +++ b/kernel/bpf/stackmap.c >> @@ -42,6 +42,27 @@ static inline int stack_map_data_size(struct >> bpf_map *map) >> sizeof(struct bpf_stack_build_id) : sizeof(u64); >> } >> +/** >> + * stack_map_calculate_max_depth - Calculate maximum allowed stack >> trace depth >> + * @size: Size of the buffer/map value in bytes >> + * @elem_size: Size of each stack trace element >> + * @flags: BPF stack trace flags (BPF_F_USER_STACK, >> BPF_F_USER_BUILD_ID, ...) >> + * >> + * Return: Maximum number of stack trace entries that can be safely >> stored >> + */ >> +static u32 stack_map_calculate_max_depth(u32 size, u32 elem_size, >> u64 flags) >> +{ >> + u32 skip = flags & BPF_F_SKIP_FIELD_MASK; >> + u32 max_depth; >> + >> + max_depth = size / elem_size; >> + max_depth += skip; >> + if (max_depth > sysctl_perf_event_max_stack) >> + return sysctl_perf_event_max_stack; > > hmm... this looks a bit suspicious. Is it possible that > sysctl_perf_event_max_stack is being changed to a larger value in > parallel? > Hi Martin, this is a valid concern as sysctl_perf_event_max_stack can be modified at runtime through /proc/sys/kernel/perf_event_max_stack. What we could maybe do instead is to create a copy: u32 current_max = READ_ONCE(sysctl_perf_event_max_stack); Any thoughts on this ? >> + >> + return max_depth; >> +} >> + >> static int prealloc_elems_and_freelist(struct bpf_stack_map *smap) >> { >> u64 elem_size = sizeof(struct stack_map_bucket) + >> @@ -406,7 +427,7 @@ static long __bpf_get_stack(struct pt_regs *regs, >> struct task_struct *task, >> struct perf_callchain_entry *trace_in, >> void *buf, u32 size, u64 flags, bool may_fault) >> { >> - u32 trace_nr, copy_len, elem_size, num_elem, max_depth; >> + u32 trace_nr, copy_len, elem_size, max_depth; >> bool user_build_id = flags & BPF_F_USER_BUILD_ID; >> bool crosstask = task && task != current; >> u32 skip = flags & BPF_F_SKIP_FIELD_MASK; >> @@ -438,10 +459,7 @@ static long __bpf_get_stack(struct pt_regs >> *regs, struct task_struct *task, >> goto clear; >> } >> - num_elem = size / elem_size; >> - max_depth = num_elem + skip; >> - if (sysctl_perf_event_max_stack < max_depth) >> - max_depth = sysctl_perf_event_max_stack; >> + max_depth = stack_map_calculate_max_depth(size, elem_size, flags); >> if (may_fault) >> rcu_read_lock(); /* need RCU for perf's callchain below */ >> @@ -461,7 +479,7 @@ static long __bpf_get_stack(struct pt_regs *regs, >> struct task_struct *task, >> } >> trace_nr = trace->nr - skip; >> - trace_nr = (trace_nr <= num_elem) ? trace_nr : num_elem; > > I suspect it was fine because trace_nr was still bounded by num_elem. > We should bring back the num_elem bound as an additional safe net. >> + trace_nr = min(trace_nr, max_depth - skip); > > but now the min() is also based on max_depth which could be > sysctl_perf_event_max_stack. > > beside, if I read it correctly, in "max_depth - skip", the max_depth > could also be less than skip. I assume trace->nr is bound by > max_depth, so should be less of a problem but still a bit unintuitive > to read. > >> copy_len = trace_nr * elem_size; >> ips = trace->ip + skip; > ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH bpf-next RESEND v4 1/2] bpf: refactor max_depth computation in bpf_get_stack() 2025-08-25 16:39 ` Lecomte, Arnaud @ 2025-08-25 18:27 ` Yonghong Song 2025-08-25 20:07 ` Lecomte, Arnaud 0 siblings, 1 reply; 7+ messages in thread From: Yonghong Song @ 2025-08-25 18:27 UTC (permalink / raw) To: Lecomte, Arnaud, Martin KaFai Lau Cc: andrii, ast, bpf, daniel, eddyz87, haoluo, john.fastabend, jolsa, kpsingh, linux-kernel, sdf, syzbot+c9b724fbb41cf2538b7b, syzkaller-bugs, song On 8/25/25 9:39 AM, Lecomte, Arnaud wrote: > > On 19/08/2025 22:15, Martin KaFai Lau wrote: >> On 8/19/25 9:26 AM, Arnaud Lecomte wrote: >>> A new helper function stack_map_calculate_max_depth() that >>> computes the max depth for a stackmap. >>> >>> Changes in v2: >>> - Removed the checking 'map_size % map_elem_size' from >>> stack_map_calculate_max_depth >>> - Changed stack_map_calculate_max_depth params name to be more >>> generic >>> >>> Changes in v3: >>> - Changed map size param to size in max depth helper >>> >>> Changes in v4: >>> - Fixed indentation in max depth helper for args >>> >>> Link to v3: >>> https://lore.kernel.org/all/09dc40eb-a84e-472a-8a68-36a2b1835308@linux.dev/ >>> >>> Signed-off-by: Arnaud Lecomte <contact@arnaud-lcm.com> >>> Acked-by: Yonghong Song <yonghong.song@linux.dev> >>> --- >>> kernel/bpf/stackmap.c | 30 ++++++++++++++++++++++++------ >>> 1 file changed, 24 insertions(+), 6 deletions(-) >>> >>> diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c >>> index 3615c06b7dfa..b9cc6c72a2a5 100644 >>> --- a/kernel/bpf/stackmap.c >>> +++ b/kernel/bpf/stackmap.c >>> @@ -42,6 +42,27 @@ static inline int stack_map_data_size(struct >>> bpf_map *map) >>> sizeof(struct bpf_stack_build_id) : sizeof(u64); >>> } >>> +/** >>> + * stack_map_calculate_max_depth - Calculate maximum allowed stack >>> trace depth >>> + * @size: Size of the buffer/map value in bytes >>> + * @elem_size: Size of each stack trace element >>> + * @flags: BPF stack trace flags (BPF_F_USER_STACK, >>> BPF_F_USER_BUILD_ID, ...) >>> + * >>> + * Return: Maximum number of stack trace entries that can be safely >>> stored >>> + */ >>> +static u32 stack_map_calculate_max_depth(u32 size, u32 elem_size, >>> u64 flags) >>> +{ >>> + u32 skip = flags & BPF_F_SKIP_FIELD_MASK; >>> + u32 max_depth; >>> + >>> + max_depth = size / elem_size; >>> + max_depth += skip; >>> + if (max_depth > sysctl_perf_event_max_stack) >>> + return sysctl_perf_event_max_stack; >> >> hmm... this looks a bit suspicious. Is it possible that >> sysctl_perf_event_max_stack is being changed to a larger value in >> parallel? >> > Hi Martin, this is a valid concern as sysctl_perf_event_max_stack can > be modified at runtime through /proc/sys/kernel/perf_event_max_stack. > What we could maybe do instead is to create a copy: u32 current_max = > READ_ONCE(sysctl_perf_event_max_stack); > Any thoughts on this ? There is no need to have READ_ONCE. Jut do int curr_sysctl_max_stack = sysctl_perf_event_max_stack; if (max_depth > curr_sysctl_max_stack) return curr_sysctl_max_stack; Because of the above change, the patch is not a refactoring change any more. > >>> + >>> + return max_depth; >>> +} >>> + >>> static int prealloc_elems_and_freelist(struct bpf_stack_map *smap) >>> { >>> u64 elem_size = sizeof(struct stack_map_bucket) + >>> @@ -406,7 +427,7 @@ static long __bpf_get_stack(struct pt_regs >>> *regs, struct task_struct *task, >>> struct perf_callchain_entry *trace_in, >>> void *buf, u32 size, u64 flags, bool may_fault) >>> { >>> - u32 trace_nr, copy_len, elem_size, num_elem, max_depth; >>> + u32 trace_nr, copy_len, elem_size, max_depth; >>> bool user_build_id = flags & BPF_F_USER_BUILD_ID; >>> bool crosstask = task && task != current; >>> u32 skip = flags & BPF_F_SKIP_FIELD_MASK; >>> @@ -438,10 +459,7 @@ static long __bpf_get_stack(struct pt_regs >>> *regs, struct task_struct *task, >>> goto clear; >>> } >>> - num_elem = size / elem_size; >>> - max_depth = num_elem + skip; >>> - if (sysctl_perf_event_max_stack < max_depth) >>> - max_depth = sysctl_perf_event_max_stack; >>> + max_depth = stack_map_calculate_max_depth(size, elem_size, flags); >>> if (may_fault) >>> rcu_read_lock(); /* need RCU for perf's callchain below */ >>> @@ -461,7 +479,7 @@ static long __bpf_get_stack(struct pt_regs >>> *regs, struct task_struct *task, >>> } >>> trace_nr = trace->nr - skip; >>> - trace_nr = (trace_nr <= num_elem) ? trace_nr : num_elem; >> >> I suspect it was fine because trace_nr was still bounded by num_elem. >> > We should bring back the num_elem bound as an additional safe net. >>> + trace_nr = min(trace_nr, max_depth - skip); >> >> but now the min() is also based on max_depth which could be >> sysctl_perf_event_max_stack. >> >> beside, if I read it correctly, in "max_depth - skip", the max_depth >> could also be less than skip. I assume trace->nr is bound by >> max_depth, so should be less of a problem but still a bit unintuitive >> to read. >> >>> copy_len = trace_nr * elem_size; >>> ips = trace->ip + skip; >> ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH bpf-next RESEND v4 1/2] bpf: refactor max_depth computation in bpf_get_stack() 2025-08-25 18:27 ` Yonghong Song @ 2025-08-25 20:07 ` Lecomte, Arnaud 2025-08-25 21:15 ` Yonghong Song 0 siblings, 1 reply; 7+ messages in thread From: Lecomte, Arnaud @ 2025-08-25 20:07 UTC (permalink / raw) To: Yonghong Song, Martin KaFai Lau Cc: andrii, ast, bpf, daniel, eddyz87, haoluo, john.fastabend, jolsa, kpsingh, linux-kernel, sdf, syzbot+c9b724fbb41cf2538b7b, syzkaller-bugs, song On 25/08/2025 19:27, Yonghong Song wrote: > > > On 8/25/25 9:39 AM, Lecomte, Arnaud wrote: >> >> On 19/08/2025 22:15, Martin KaFai Lau wrote: >>> On 8/19/25 9:26 AM, Arnaud Lecomte wrote: >>>> A new helper function stack_map_calculate_max_depth() that >>>> computes the max depth for a stackmap. >>>> >>>> Changes in v2: >>>> - Removed the checking 'map_size % map_elem_size' from >>>> stack_map_calculate_max_depth >>>> - Changed stack_map_calculate_max_depth params name to be more >>>> generic >>>> >>>> Changes in v3: >>>> - Changed map size param to size in max depth helper >>>> >>>> Changes in v4: >>>> - Fixed indentation in max depth helper for args >>>> >>>> Link to v3: >>>> https://lore.kernel.org/all/09dc40eb-a84e-472a-8a68-36a2b1835308@linux.dev/ >>>> >>>> Signed-off-by: Arnaud Lecomte <contact@arnaud-lcm.com> >>>> Acked-by: Yonghong Song <yonghong.song@linux.dev> >>>> --- >>>> kernel/bpf/stackmap.c | 30 ++++++++++++++++++++++++------ >>>> 1 file changed, 24 insertions(+), 6 deletions(-) >>>> >>>> diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c >>>> index 3615c06b7dfa..b9cc6c72a2a5 100644 >>>> --- a/kernel/bpf/stackmap.c >>>> +++ b/kernel/bpf/stackmap.c >>>> @@ -42,6 +42,27 @@ static inline int stack_map_data_size(struct >>>> bpf_map *map) >>>> sizeof(struct bpf_stack_build_id) : sizeof(u64); >>>> } >>>> +/** >>>> + * stack_map_calculate_max_depth - Calculate maximum allowed stack >>>> trace depth >>>> + * @size: Size of the buffer/map value in bytes >>>> + * @elem_size: Size of each stack trace element >>>> + * @flags: BPF stack trace flags (BPF_F_USER_STACK, >>>> BPF_F_USER_BUILD_ID, ...) >>>> + * >>>> + * Return: Maximum number of stack trace entries that can be >>>> safely stored >>>> + */ >>>> +static u32 stack_map_calculate_max_depth(u32 size, u32 elem_size, >>>> u64 flags) >>>> +{ >>>> + u32 skip = flags & BPF_F_SKIP_FIELD_MASK; >>>> + u32 max_depth; >>>> + >>>> + max_depth = size / elem_size; >>>> + max_depth += skip; >>>> + if (max_depth > sysctl_perf_event_max_stack) >>>> + return sysctl_perf_event_max_stack; >>> >>> hmm... this looks a bit suspicious. Is it possible that >>> sysctl_perf_event_max_stack is being changed to a larger value in >>> parallel? >>> >> Hi Martin, this is a valid concern as sysctl_perf_event_max_stack can >> be modified at runtime through /proc/sys/kernel/perf_event_max_stack. >> What we could maybe do instead is to create a copy: u32 current_max = >> READ_ONCE(sysctl_perf_event_max_stack); >> Any thoughts on this ? > > There is no need to have READ_ONCE. Jut do > int curr_sysctl_max_stack = sysctl_perf_event_max_stack; > if (max_depth > curr_sysctl_max_stack) > return curr_sysctl_max_stack; > > Because of the above change, the patch is not a refactoring change any > more. > Why would you not consider it as a refactoring change anymore ? >> >>>> + >>>> + return max_depth; >>>> +} >>>> + >>>> static int prealloc_elems_and_freelist(struct bpf_stack_map *smap) >>>> { >>>> u64 elem_size = sizeof(struct stack_map_bucket) + >>>> @@ -406,7 +427,7 @@ static long __bpf_get_stack(struct pt_regs >>>> *regs, struct task_struct *task, >>>> struct perf_callchain_entry *trace_in, >>>> void *buf, u32 size, u64 flags, bool may_fault) >>>> { >>>> - u32 trace_nr, copy_len, elem_size, num_elem, max_depth; >>>> + u32 trace_nr, copy_len, elem_size, max_depth; >>>> bool user_build_id = flags & BPF_F_USER_BUILD_ID; >>>> bool crosstask = task && task != current; >>>> u32 skip = flags & BPF_F_SKIP_FIELD_MASK; >>>> @@ -438,10 +459,7 @@ static long __bpf_get_stack(struct pt_regs >>>> *regs, struct task_struct *task, >>>> goto clear; >>>> } >>>> - num_elem = size / elem_size; >>>> - max_depth = num_elem + skip; >>>> - if (sysctl_perf_event_max_stack < max_depth) >>>> - max_depth = sysctl_perf_event_max_stack; >>>> + max_depth = stack_map_calculate_max_depth(size, elem_size, >>>> flags); >>>> if (may_fault) >>>> rcu_read_lock(); /* need RCU for perf's callchain below */ >>>> @@ -461,7 +479,7 @@ static long __bpf_get_stack(struct pt_regs >>>> *regs, struct task_struct *task, >>>> } >>>> trace_nr = trace->nr - skip; >>>> - trace_nr = (trace_nr <= num_elem) ? trace_nr : num_elem; >>> >>> I suspect it was fine because trace_nr was still bounded by num_elem. >>> >> We should bring back the num_elem bound as an additional safe net. >>>> + trace_nr = min(trace_nr, max_depth - skip); >>> >>> but now the min() is also based on max_depth which could be >>> sysctl_perf_event_max_stack. >>> >>> beside, if I read it correctly, in "max_depth - skip", the max_depth >>> could also be less than skip. I assume trace->nr is bound by >>> max_depth, so should be less of a problem but still a bit >>> unintuitive to read. >>> >>>> copy_len = trace_nr * elem_size; >>>> ips = trace->ip + skip; >>> > > ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH bpf-next RESEND v4 1/2] bpf: refactor max_depth computation in bpf_get_stack() 2025-08-25 20:07 ` Lecomte, Arnaud @ 2025-08-25 21:15 ` Yonghong Song 0 siblings, 0 replies; 7+ messages in thread From: Yonghong Song @ 2025-08-25 21:15 UTC (permalink / raw) To: Lecomte, Arnaud, Martin KaFai Lau Cc: andrii, ast, bpf, daniel, eddyz87, haoluo, john.fastabend, jolsa, kpsingh, linux-kernel, sdf, syzbot+c9b724fbb41cf2538b7b, syzkaller-bugs, song On 8/25/25 1:07 PM, Lecomte, Arnaud wrote: > > On 25/08/2025 19:27, Yonghong Song wrote: >> >> >> On 8/25/25 9:39 AM, Lecomte, Arnaud wrote: >>> >>> On 19/08/2025 22:15, Martin KaFai Lau wrote: >>>> On 8/19/25 9:26 AM, Arnaud Lecomte wrote: >>>>> A new helper function stack_map_calculate_max_depth() that >>>>> computes the max depth for a stackmap. >>>>> >>>>> Changes in v2: >>>>> - Removed the checking 'map_size % map_elem_size' from >>>>> stack_map_calculate_max_depth >>>>> - Changed stack_map_calculate_max_depth params name to be more >>>>> generic >>>>> >>>>> Changes in v3: >>>>> - Changed map size param to size in max depth helper >>>>> >>>>> Changes in v4: >>>>> - Fixed indentation in max depth helper for args >>>>> >>>>> Link to v3: >>>>> https://lore.kernel.org/all/09dc40eb-a84e-472a-8a68-36a2b1835308@linux.dev/ >>>>> >>>>> Signed-off-by: Arnaud Lecomte <contact@arnaud-lcm.com> >>>>> Acked-by: Yonghong Song <yonghong.song@linux.dev> >>>>> --- >>>>> kernel/bpf/stackmap.c | 30 ++++++++++++++++++++++++------ >>>>> 1 file changed, 24 insertions(+), 6 deletions(-) >>>>> >>>>> diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c >>>>> index 3615c06b7dfa..b9cc6c72a2a5 100644 >>>>> --- a/kernel/bpf/stackmap.c >>>>> +++ b/kernel/bpf/stackmap.c >>>>> @@ -42,6 +42,27 @@ static inline int stack_map_data_size(struct >>>>> bpf_map *map) >>>>> sizeof(struct bpf_stack_build_id) : sizeof(u64); >>>>> } >>>>> +/** >>>>> + * stack_map_calculate_max_depth - Calculate maximum allowed >>>>> stack trace depth >>>>> + * @size: Size of the buffer/map value in bytes >>>>> + * @elem_size: Size of each stack trace element >>>>> + * @flags: BPF stack trace flags (BPF_F_USER_STACK, >>>>> BPF_F_USER_BUILD_ID, ...) >>>>> + * >>>>> + * Return: Maximum number of stack trace entries that can be >>>>> safely stored >>>>> + */ >>>>> +static u32 stack_map_calculate_max_depth(u32 size, u32 elem_size, >>>>> u64 flags) >>>>> +{ >>>>> + u32 skip = flags & BPF_F_SKIP_FIELD_MASK; >>>>> + u32 max_depth; >>>>> + >>>>> + max_depth = size / elem_size; >>>>> + max_depth += skip; >>>>> + if (max_depth > sysctl_perf_event_max_stack) >>>>> + return sysctl_perf_event_max_stack; >>>> >>>> hmm... this looks a bit suspicious. Is it possible that >>>> sysctl_perf_event_max_stack is being changed to a larger value in >>>> parallel? >>>> >>> Hi Martin, this is a valid concern as sysctl_perf_event_max_stack >>> can be modified at runtime through >>> /proc/sys/kernel/perf_event_max_stack. >>> What we could maybe do instead is to create a copy: u32 current_max >>> = READ_ONCE(sysctl_perf_event_max_stack); >>> Any thoughts on this ? >> >> There is no need to have READ_ONCE. Jut do >> int curr_sysctl_max_stack = sysctl_perf_event_max_stack; >> if (max_depth > curr_sysctl_max_stack) >> return curr_sysctl_max_stack; >> >> Because of the above change, the patch is not a refactoring change >> any more. >> > Why would you not consider it as a refactoring change anymore ? Sorry, I think I made a couple of mistakes in the above. First, yes, we do want READ_ONCE, other potentially compiler may optimization the above back to the original code with two references to sysctl_perf_event_max_stack. Second, yes, it is indeed a refactoring. >>> >>>>> + >>>>> + return max_depth; >>>>> +} >>>>> + >>>>> static int prealloc_elems_and_freelist(struct bpf_stack_map *smap) >>>>> { >>>>> u64 elem_size = sizeof(struct stack_map_bucket) + >>>>> @@ -406,7 +427,7 @@ static long __bpf_get_stack(struct pt_regs >>>>> *regs, struct task_struct *task, >>>>> struct perf_callchain_entry *trace_in, >>>>> void *buf, u32 size, u64 flags, bool may_fault) >>>>> { >>>>> - u32 trace_nr, copy_len, elem_size, num_elem, max_depth; >>>>> + u32 trace_nr, copy_len, elem_size, max_depth; >>>>> bool user_build_id = flags & BPF_F_USER_BUILD_ID; >>>>> bool crosstask = task && task != current; >>>>> u32 skip = flags & BPF_F_SKIP_FIELD_MASK; >>>>> @@ -438,10 +459,7 @@ static long __bpf_get_stack(struct pt_regs >>>>> *regs, struct task_struct *task, >>>>> goto clear; >>>>> } >>>>> - num_elem = size / elem_size; >>>>> - max_depth = num_elem + skip; >>>>> - if (sysctl_perf_event_max_stack < max_depth) >>>>> - max_depth = sysctl_perf_event_max_stack; >>>>> + max_depth = stack_map_calculate_max_depth(size, elem_size, >>>>> flags); >>>>> if (may_fault) >>>>> rcu_read_lock(); /* need RCU for perf's callchain below */ >>>>> @@ -461,7 +479,7 @@ static long __bpf_get_stack(struct pt_regs >>>>> *regs, struct task_struct *task, >>>>> } >>>>> trace_nr = trace->nr - skip; >>>>> - trace_nr = (trace_nr <= num_elem) ? trace_nr : num_elem; >>>> >>>> I suspect it was fine because trace_nr was still bounded by num_elem. >>>> >>> We should bring back the num_elem bound as an additional safe net. >>>>> + trace_nr = min(trace_nr, max_depth - skip); >>>> >>>> but now the min() is also based on max_depth which could be >>>> sysctl_perf_event_max_stack. >>>> >>>> beside, if I read it correctly, in "max_depth - skip", the >>>> max_depth could also be less than skip. I assume trace->nr is bound >>>> by max_depth, so should be less of a problem but still a bit >>>> unintuitive to read. >>>> >>>>> copy_len = trace_nr * elem_size; >>>>> ips = trace->ip + skip; >>>> >> >> ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2025-08-25 21:15 UTC | newest] Thread overview: 7+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2025-08-19 16:26 [PATCH bpf-next RESEND v4 1/2] bpf: refactor max_depth computation in bpf_get_stack() Arnaud Lecomte 2025-08-19 16:29 ` [PATCH bpf-next RESEND v4 2/2] bpf: fix stackmap overflow check in __bpf_get_stackid() Arnaud Lecomte 2025-08-19 21:15 ` [PATCH bpf-next RESEND v4 1/2] bpf: refactor max_depth computation in bpf_get_stack() Martin KaFai Lau 2025-08-25 16:39 ` Lecomte, Arnaud 2025-08-25 18:27 ` Yonghong Song 2025-08-25 20:07 ` Lecomte, Arnaud 2025-08-25 21:15 ` Yonghong Song
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).