From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from out-183.mta1.migadu.com (out-183.mta1.migadu.com [95.215.58.183]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 3CD9E246BC6 for ; Tue, 26 Aug 2025 15:14:38 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=95.215.58.183 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1756221284; cv=none; b=OwCoi2NtOBL+emvLfeunALD2tHWWJ8H5OXi0rwsz6g3n1Et10jhESZZGeUxOLHFFlW3h6OZCSHm5Qsk4JWfAT6/FJ07CRDzPbQUL0PEVlPxhcDWlNwDb41HNtW5K91KdDxWGdSLhVNu7zF50FOYEK08Swp+n59iGzY0KWl6CYjM= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1756221284; c=relaxed/simple; bh=lSyzP83D4gbzZel3LrKsHONLu8+wCEnZ/3tvWAlrn7s=; h=Mime-Version:Content-Type:Date:Message-Id:Cc:Subject:From:To: References:In-Reply-To; b=bzqPwD8RvY0mE0fZLgFzGmSg6lC8/HGuZwweyFk04epYYPPUuMrs6LkZafmvXSDO8+/OHLphcYHlHr/1Gl61P5Uii2uY2GNnTM1MXFz5iuMctL0C/JGi6XSGjJrPRya8pLPtbs4UnRZIca9I0Y2nEhoWi13n3xq1MVaz7LD/4Lw= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev; spf=pass smtp.mailfrom=linux.dev; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b=iHU9DMbv; arc=none smtp.client-ip=95.215.58.183 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.dev Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b="iHU9DMbv" Precedence: bulk X-Mailing-List: bpf@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1756221277; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=AqHxo+iR9cy3xz1tS0lle7SItTgcTzunmy6GcF1Vw3A=; b=iHU9DMbva+VEpkAkZkqewOFMjD5QtxEq4mkh2pLGmrfrbjRlp7dKZwFvB5YXrAS9d80SSP DPoaT/GK648pD5neEXC3z9aXM4BAdjK26eglTYH3kFfgzjnuRZwMx41DuEDY1NxVTRAmND BHQnluEdBim5g1oQGVyEFW9FAOzPTMU= Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=UTF-8 Date: Tue, 26 Aug 2025 23:14:26 +0800 Message-Id: Cc: , , , , , , , , , , Subject: Re: [PATCH bpf-next v3 3/6] bpf: Introduce BPF_F_CPU flag for percpu_hash and lru_percpu_hash maps X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. From: "Leon Hwang" To: "Andrii Nakryiko" References: <20250821160817.70285-1-leon.hwang@linux.dev> <20250821160817.70285-4-leon.hwang@linux.dev> In-Reply-To: X-Migadu-Flow: FLOW_OUT On Sat Aug 23, 2025 at 6:14 AM +08, Andrii Nakryiko wrote: > On Thu, Aug 21, 2025 at 9:08=E2=80=AFAM Leon Hwang = wrote: >> >> Introduce BPF_F_ALL_CPUS flag support for percpu_hash and lru_percpu_has= h >> maps to allow updating values for all CPUs with a single value. >> >> Introduce BPF_F_CPU flag support for percpu_hash and lru_percpu_hash >> maps to allow updating value for specified CPU. >> >> This enhancement enables: >> >> * Efficient update values across all CPUs with a single value when >> BPF_F_ALL_CPUS is set for update_elem and update_batch APIs. >> * Targeted update or lookup for a specified CPU when BPF_F_CPU is set. >> >> The BPF_F_CPU flag is passed via: >> >> * map_flags of lookup_elem and update_elem APIs along with embedded cpu >> field. >> * elem_flags of lookup_batch and update_batch APIs along with embedded >> cpu field. >> >> Signed-off-by: Leon Hwang >> --- >> include/linux/bpf.h | 54 +++++++++++++++++++- >> kernel/bpf/arraymap.c | 29 ++++------- >> kernel/bpf/hashtab.c | 111 +++++++++++++++++++++++++++++------------- >> kernel/bpf/syscall.c | 30 +++--------- >> 4 files changed, 147 insertions(+), 77 deletions(-) >> > > [...] > >> @@ -397,22 +395,14 @@ int bpf_percpu_array_update(struct bpf_map *map, v= oid *key, void *value, >> u64 map_flags) >> { >> struct bpf_array *array =3D container_of(map, struct bpf_array, = map); >> - const u64 cpu_flags =3D BPF_F_CPU | BPF_F_ALL_CPUS; >> u32 index =3D *(u32 *)key; >> void __percpu *pptr; >> + int off =3D 0, err; >> u32 size, cpu; >> - int off =3D 0; >> - >> - if (unlikely((u32)map_flags > BPF_F_ALL_CPUS)) >> - /* unknown flags */ >> - return -EINVAL; >> - if (unlikely((map_flags & cpu_flags) =3D=3D cpu_flags)) >> - return -EINVAL; >> >> - cpu =3D map_flags >> 32; >> - if (unlikely((map_flags & BPF_F_CPU) && cpu >=3D num_possible_cp= us())) >> - /* invalid cpu */ >> - return -ERANGE; >> + err =3D bpf_map_check_cpu_flags(map_flags, true); >> + if (unlikely(err)) >> + return err; > > again, unnecessary churn, why not add this function in previous patch > when you add cpu flags ? > Ack. > >> >> if (unlikely(index >=3D array->map.max_entries)) >> /* all elements were pre-allocated, cannot insert a new = one */ >> @@ -432,6 +422,7 @@ int bpf_percpu_array_update(struct bpf_map *map, voi= d *key, void *value, >> rcu_read_lock(); >> pptr =3D array->pptrs[index & array->index_mask]; >> if (map_flags & BPF_F_CPU) { >> + cpu =3D map_flags >> 32; >> copy_map_value_long(map, per_cpu_ptr(pptr, cpu), value); >> bpf_obj_free_fields(array->map.record, per_cpu_ptr(pptr,= cpu)); >> } else { >> diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c >> index 71f9931ac64cd..34a35cdade425 100644 >> --- a/kernel/bpf/hashtab.c >> +++ b/kernel/bpf/hashtab.c >> @@ -937,24 +937,39 @@ static void free_htab_elem(struct bpf_htab *htab, = struct htab_elem *l) >> } >> >> static void pcpu_copy_value(struct bpf_htab *htab, void __percpu *pptr, >> - void *value, bool onallcpus) >> + void *value, bool onallcpus, u64 map_flags) >> { >> + int cpu =3D map_flags & BPF_F_CPU ? map_flags >> 32 : 0; >> + int current_cpu =3D raw_smp_processor_id(); >> + >> if (!onallcpus) { >> /* copy true value_size bytes */ >> - copy_map_value(&htab->map, this_cpu_ptr(pptr), value); >> + copy_map_value(&htab->map, (map_flags & BPF_F_CPU) && cp= u !=3D current_cpu ? >> + per_cpu_ptr(pptr, cpu) : this_cpu_ptr(ppt= r), value); > > is there any benefit to this cpu =3D=3D current_cpu special casing? > Wouldn't per_cpu_ptr() do the right thing even if cpu =3D=3D current_cpu? > per_cpu_ptr() seems doing the same thing of this_cpu_ptr(). However, after having a deep research of per_cpu_ptr() and this_cpu_ptr(), I think this_cpu_ptr() is more efficient than per_cpu_ptr(). e.g. DEFINE_PER_CPU(int, my_percpu_var) =3D 0; int percpu_read(int cpu); int thiscpu_read(void); int percpu_currcpu_read(int cpu); int percpu_read(int cpu) { return *(int *)(void *)per_cpu_ptr(&my_percpu_var, cpu); } EXPORT_SYMBOL(percpu_read); int thiscpu_read(void) { return *(int *)(void *)this_cpu_ptr(&my_percpu_var); } EXPORT_SYMBOL(thiscpu_read); int percpu_currcpu_read(int cpu) { if (cpu =3D=3D raw_smp_processor_id()) return *(int *)(void *)this_cpu_ptr(&my_percpu_var); else return *(int *)(void *)per_cpu_ptr(&my_percpu_var, cpu); } EXPORT_SYMBOL(percpu_currcpu_read); Source code link[0]. With disassembling these functions, the core insns of this_cpu_ptr() are: 0xffffffffc15c5076 <+6/0x6>: 48 c7 c0 04 60 03 00 movq $0x36004, %rax 0xffffffffc15c507d <+13/0xd>: 65 48 03 05 7b 49 a5 3e addq %gs:0x3ea549= 7b(%rip), %rax 0xffffffffc15c5085 <+21/0x15>: 8b 00 movl (%rax), %eax The core insns of per_cpu_ptr() are: 0xffffffffc15c501b <+11/0xb>: 49 c7 c4 04 60 03 00 movq $0x36004, %r12 0xffffffffc15c5022 <+18/0x12>: 53 pushq %rbx 0xffffffffc15c5023 <+19/0x13>: 48 63 df movslq %edi, %rbx 0xffffffffc15c5026 <+22/0x16>: 48 81 fb 00 20 00 00 cmpq $0x2000, %rbx 0xffffffffc15c502d <+29/0x1d>: 73 19 jae 0xffffffffc15c504= 8 ; percpu_read+0x38 0xffffffffc15c502f <+31/0x1f>: 48 8b 04 dd 80 fc 24 a8 movq -0x57db0380(= , %rbx, 8), %rax 0xffffffffc15c5037 <+39/0x27>: 5b popq %rbx 0xffffffffc15c5038 <+40/0x28>: 42 8b 04 20 movl (%rax, %r12), %e= ax 0xffffffffc15c503c <+44/0x2c>: 41 5c popq %r12 0xffffffffc15c503e <+46/0x2e>: 5d popq %rbp 0xffffffffc15c503f <+47/0x2f>: 31 f6 xorl %esi, %esi 0xffffffffc15c5041 <+49/0x31>: 31 ff xorl %edi, %edi 0xffffffffc15c5043 <+51/0x33>: c3 retq 0xffffffffc15c5044 <+52/0x34>: cc int3 0xffffffffc15c5045 <+53/0x35>: cc int3 0xffffffffc15c5046 <+54/0x36>: cc int3 0xffffffffc15c5047 <+55/0x37>: cc int3 0xffffffffc15c5048 <+56/0x38>: 48 89 de movq %rbx, %rsi 0xffffffffc15c504b <+59/0x3b>: 48 c7 c7 a0 70 5c c1 movq $18446744072658= 645152, %rdi 0xffffffffc15c5052 <+66/0x42>: e8 d9 87 ac e5 callq 0xffffffffa708d= 830 ; __ubsan_handle_out_of_bounds+0x0 lib/ubsan.c:336 0xffffffffc15c5057 <+71/0x47>: eb d6 jmp 0xffffffffc15c502= f ; percpu_read+0x1f Disasm log link[1]. As we can see, per_cpu_ptr() calls __ubsan_handle_out_of_bounds() to check index range of percpu offset array. The callq insn is much slower than the addq insn. Therefore, cpu =3D=3D current_cpu + this_cpu_ptr() is much better than per_cpu_ptr(). Links: [0] https://github.com/Asphaltt/kernel-module-fun/blob/master/percpu-ptr.c [1] https://github.com/Asphaltt/kernel-module-fun/blob/master/percpu_ptr.md >> } else { >> u32 size =3D round_up(htab->map.value_size, 8); >> - int off =3D 0, cpu; >> + int off =3D 0; >> + >> + if (map_flags & BPF_F_CPU) { >> + copy_map_value_long(&htab->map, cpu !=3D current= _cpu ? >> + per_cpu_ptr(pptr, cpu) : thi= s_cpu_ptr(pptr), value); >> + return; >> + } > > [...] > >> @@ -1806,10 +1834,17 @@ __htab_map_lookup_and_delete_batch(struct bpf_ma= p *map, >> void __percpu *pptr; >> >> pptr =3D htab_elem_get_ptr(l, map->key_size); >> - for_each_possible_cpu(cpu) { >> - copy_map_value_long(&htab->map, dst_val = + off, per_cpu_ptr(pptr, cpu)); >> - check_and_init_map_value(&htab->map, dst= _val + off); >> - off +=3D size; >> + if (!do_delete && (elem_map_flags & BPF_F_CPU)) = { > > if do_delete is true we can't have BPF_F_CPU set, right? We checked > that above, so why all these complications? > Ack. This !do_delete is unnecessary. I'll drop it in next revision. >> + cpu =3D elem_map_flags >> 32; >> + copy_map_value_long(&htab->map, dst_val,= per_cpu_ptr(pptr, cpu)); >> + check_and_init_map_value(&htab->map, dst= _val); >> + } else { >> + for_each_possible_cpu(cpu) { >> + copy_map_value_long(&htab->map, = dst_val + off, >> + per_cpu_ptr(= pptr, cpu)); >> + check_and_init_map_value(&htab->= map, dst_val + off); >> + off +=3D size; >> + } >> } >> } else { >> value =3D htab_elem_value(l, key_size); >> @@ -2365,14 +2400,18 @@ static void *htab_lru_percpu_map_lookup_percpu_e= lem(struct bpf_map *map, void *k >> return NULL; >> } >> >> -int bpf_percpu_hash_copy(struct bpf_map *map, void *key, void *value) >> +int bpf_percpu_hash_copy(struct bpf_map *map, void *key, void *value, u= 64 map_flags) >> { >> + int ret, cpu, off =3D 0; >> struct htab_elem *l; >> void __percpu *pptr; >> - int ret =3D -ENOENT; >> - int cpu, off =3D 0; >> u32 size; >> >> + ret =3D bpf_map_check_cpu_flags(map_flags, false); >> + if (unlikely(ret)) >> + return ret; >> + ret =3D -ENOENT; >> + >> /* per_cpu areas are zero-filled and bpf programs can only >> * access 'value_size' of them, so copying rounded areas >> * will not leak any kernel data >> @@ -2386,10 +2425,16 @@ int bpf_percpu_hash_copy(struct bpf_map *map, vo= id *key, void *value) >> * eviction heuristics when user space does a map walk. >> */ >> pptr =3D htab_elem_get_ptr(l, map->key_size); >> - for_each_possible_cpu(cpu) { >> - copy_map_value_long(map, value + off, per_cpu_ptr(pptr, = cpu)); >> - check_and_init_map_value(map, value + off); >> - off +=3D size; >> + if (map_flags & BPF_F_CPU) { >> + cpu =3D map_flags >> 32; >> + copy_map_value_long(map, value, per_cpu_ptr(pptr, cpu)); >> + check_and_init_map_value(map, value); >> + } else { >> + for_each_possible_cpu(cpu) { >> + copy_map_value_long(map, value + off, per_cpu_pt= r(pptr, cpu)); >> + check_and_init_map_value(map, value + off); >> + off +=3D size; >> + } >> } > > it feels like this whole logic of copying per-cpu value to/from user > should be generic between all per-cpu maps, once we get that `void > __percpu *` pointer, no? See if you can extract it as reusable helper > (but, you know, without all the per-map type special casing, though it > doesn't seem like you should need it, though I might be missing > details, of course). One for bpf_percpu_copy_to_user() and another for > bpf_percpu_copy_from_user(), which would take into account all these > cpu flags? > Good idea. Let's introduce these two helper functions. Thanks, Leon [...]