Re: [PATCH bpf-next 1/2] bpf: add bpf_get_hw_counter kfunc

BPF List
 help / color / mirror / Atom feed

From: Vadim Fedorenko <vadim.fedorenko@linux.dev>
To: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Cc: Alexei Starovoitov <ast@kernel.org>,
	Daniel Borkmann <daniel@iogearbox.net>,
	Andrii Nakryiko <andrii@kernel.org>,
	Eduard Zingerman <eddyz87@gmail.com>,
	Thomas Gleixner <tglx@linutronix.de>, X86 ML <x86@kernel.org>,
	bpf <bpf@vger.kernel.org>
Subject: Re: [PATCH bpf-next 1/2] bpf: add bpf_get_hw_counter kfunc
Date: Thu, 24 Oct 2024 14:11:10 +0100	[thread overview]
Message-ID: <4d7f00e6-8fab-4274-8121-620820b99f02@linux.dev> (raw)
In-Reply-To: <CAADnVQ+YRj2_wWYkT20yo+5+G5B11d3NCZ8TBuCKJz+SJo37iw@mail.gmail.com>

On 24/10/2024 02:39, Alexei Starovoitov wrote:
> On Wed, Oct 23, 2024 at 2:05 PM Vadim Fedorenko <vadfed@meta.com> wrote:
>>
>> New kfunc to return ARCH-specific timecounter. For x86 BPF JIT converts
>> it into rdtsc ordered call. Other architectures will get JIT
>> implementation too if supported. The fallback is to
>> __arch_get_hw_counter().
> 
> arch_get_hw_counter is a great idea.
> 
>> Signed-off-by: Vadim Fedorenko <vadfed@meta.com>
>> ---
>>   arch/x86/net/bpf_jit_comp.c   | 23 +++++++++++++++++++++++
>>   arch/x86/net/bpf_jit_comp32.c | 11 +++++++++++
>>   kernel/bpf/helpers.c          |  7 +++++++
>>   kernel/bpf/verifier.c         | 11 +++++++++++
>>   4 files changed, 52 insertions(+)
>>
>> diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
>> index 06b080b61aa5..55595a0fa55b 100644
>> --- a/arch/x86/net/bpf_jit_comp.c
>> +++ b/arch/x86/net/bpf_jit_comp.c
>> @@ -2126,6 +2126,29 @@ st:                      if (is_imm8(insn->off))
>>                  case BPF_JMP | BPF_CALL: {
>>                          u8 *ip = image + addrs[i - 1];
>>
>> +                       if (insn->src_reg == BPF_PSEUDO_KFUNC_CALL && !imm32) {
>> +                               if (insn->dst_reg == 1) {
>> +                                       struct cpuinfo_x86 *c = &cpu_data(get_boot_cpu_id());
>> +
>> +                                       /* Save RDX because RDTSC will use EDX:EAX to return u64 */
>> +                                       emit_mov_reg(&prog, true, AUX_REG, BPF_REG_3);
>> +                                       if (cpu_has(c, X86_FEATURE_LFENCE_RDTSC))
>> +                                               EMIT_LFENCE();
>> +                                       EMIT2(0x0F, 0x31);
>> +
>> +                                       /* shl RDX, 32 */
>> +                                       maybe_emit_1mod(&prog, BPF_REG_3, true);
>> +                                       EMIT3(0xC1, add_1reg(0xE0, BPF_REG_3), 32);
>> +                                       /* or RAX, RDX */
>> +                                       maybe_emit_mod(&prog, BPF_REG_0, BPF_REG_3, true);
>> +                                       EMIT2(0x09, add_2reg(0xC0, BPF_REG_0, BPF_REG_3));
>> +                                       /* restore RDX from R11 */
>> +                                       emit_mov_reg(&prog, true, BPF_REG_3, AUX_REG);
> 
> This doesn't match
> static inline u64 __arch_get_hw_counter(s32 clock_mode,
>                                          const struct vdso_data *vd)
> {
>          if (likely(clock_mode == VDSO_CLOCKMODE_TSC))
>                  return (u64)rdtsc_ordered() & S64_MAX;
> 
> - & is missing

& S64_MAX is only needed to early detect possible wrap-around of
timecounter in case of vDSO call for CLOCK_MONOTONIC_RAW/CLOCK_COARSE
which adds namespace time offset. TSC is reset during CPU reset and will
not overflow within 10 years according to "Intel 64 and IA-32
Architecture Software Developer's Manual,Vol 3B", so it doesn't really
matter if we mask the highest bit or not while accessing raw cycles
counter.

> - rdtsc vs rdtscp

rdtscp provides additional 32 bit of "signature value" atomically with
TSC value in ECX. This value is not really usable outside of domain
which set it previously while initialization. The kernel stores encoded
cpuid into IA32_TSC_AUX to provide it back to user-space application,
but at the same time ignores its value during read. The combination of
lfence and rdtsc will give us the same result (ordered read of TSC value
independent on the core) without trashing ECX value.

> but the later one is much slower (I was told).

It is slower on AMD CPUs for now, easily visible in perf traces under load.

> So maybe instead of arch_get_hw_counter() it should be modelled
> as JIT of sched_clock() ?

sched_clock() is much more complicated because it converts cycles
counter into ns, we don't actually need this conversion, let's stick
to arch_get_hw_counter().

> 
>> +
>> +                                       break;
>> +                               }
>> +                       }
>> +
>>                          func = (u8 *) __bpf_call_base + imm32;
>>                          if (tail_call_reachable) {
>>                                  LOAD_TAIL_CALL_CNT_PTR(bpf_prog->aux->stack_depth);
>> diff --git a/arch/x86/net/bpf_jit_comp32.c b/arch/x86/net/bpf_jit_comp32.c
>> index de0f9e5f9f73..c36ff18a044b 100644
>> --- a/arch/x86/net/bpf_jit_comp32.c
>> +++ b/arch/x86/net/bpf_jit_comp32.c
>> @@ -2091,6 +2091,17 @@ static int do_jit(struct bpf_prog *bpf_prog, int *addrs, u8 *image,
>>                          if (insn->src_reg == BPF_PSEUDO_CALL)
>>                                  goto notyet;
>>
>> +                       if (insn->src_reg == BPF_PSEUDO_KFUNC_CALL && !imm32) {
>> +                               if (insn->dst_reg == 1) {
>> +                                       struct cpuinfo_x86 *c = &cpu_data(get_boot_cpu_id());
>> +
>> +                                       if (cpu_has(c, X86_FEATURE_LFENCE_RDTSC))
>> +                                               EMIT3(0x0F, 0xAE, 0xE8);
>> +                                       EMIT2(0x0F, 0x31);
>> +                                       break;
>> +                               }
>> +                       }
>> +
>>                          if (insn->src_reg == BPF_PSEUDO_KFUNC_CALL) {
>>                                  int err;
>>
>> diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
>> index 5c3fdb29c1b1..6624b2465484 100644
>> --- a/kernel/bpf/helpers.c
>> +++ b/kernel/bpf/helpers.c
>> @@ -23,6 +23,7 @@
>>   #include <linux/btf_ids.h>
>>   #include <linux/bpf_mem_alloc.h>
>>   #include <linux/kasan.h>
>> +#include <asm/vdso/gettimeofday.h>
>>
>>   #include "../../lib/kstrtox.h"
>>
>> @@ -3023,6 +3024,11 @@ __bpf_kfunc int bpf_copy_from_user_str(void *dst, u32 dst__sz, const void __user
>>          return ret + 1;
>>   }
>>
>> +__bpf_kfunc int bpf_get_hw_counter(void)
>> +{
>> +       return __arch_get_hw_counter(1, NULL);
>> +}
>> +
>>   __bpf_kfunc_end_defs();
>>
>>   BTF_KFUNCS_START(generic_btf_ids)
>> @@ -3112,6 +3118,7 @@ BTF_ID_FLAGS(func, bpf_iter_bits_next, KF_ITER_NEXT | KF_RET_NULL)
>>   BTF_ID_FLAGS(func, bpf_iter_bits_destroy, KF_ITER_DESTROY)
>>   BTF_ID_FLAGS(func, bpf_copy_from_user_str, KF_SLEEPABLE)
>>   BTF_ID_FLAGS(func, bpf_get_kmem_cache)
>> +BTF_ID_FLAGS(func, bpf_get_hw_counter, KF_FASTCALL)
>>   BTF_KFUNCS_END(common_btf_ids)
>>
>>   static const struct btf_kfunc_id_set common_kfunc_set = {
>> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
>> index f514247ba8ba..5f0e4f91ce48 100644
>> --- a/kernel/bpf/verifier.c
>> +++ b/kernel/bpf/verifier.c
>> @@ -11260,6 +11260,7 @@ enum special_kfunc_type {
>>          KF_bpf_iter_css_task_new,
>>          KF_bpf_session_cookie,
>>          KF_bpf_get_kmem_cache,
>> +       KF_bpf_get_hw_counter,
>>   };
>>
>>   BTF_SET_START(special_kfunc_set)
>> @@ -11326,6 +11327,7 @@ BTF_ID(func, bpf_session_cookie)
>>   BTF_ID_UNUSED
>>   #endif
>>   BTF_ID(func, bpf_get_kmem_cache)
>> +BTF_ID(func, bpf_get_hw_counter)
>>
>>   static bool is_kfunc_ret_null(struct bpf_kfunc_call_arg_meta *meta)
>>   {
>> @@ -20396,6 +20398,15 @@ static int fixup_kfunc_call(struct bpf_verifier_env *env, struct bpf_insn *insn,
>>                     desc->func_id == special_kfunc_list[KF_bpf_rdonly_cast]) {
>>                  insn_buf[0] = BPF_MOV64_REG(BPF_REG_0, BPF_REG_1);
>>                  *cnt = 1;
>> +       } else if (IS_ENABLED(CONFIG_X86) &&
> 
> It's better to introduce bpf_jit_inlines_kfunc_call()
> similar to bpf_jit_inlines_helper_call().

Yep, I thought about introducing it while adding more architectures, but
can do it from the beginning.

> 
>> +                  desc->func_id == special_kfunc_list[KF_bpf_get_hw_counter]) {
>> +               insn->imm = 0;
>> +               insn->code = BPF_JMP | BPF_CALL;
>> +               insn->src_reg = BPF_PSEUDO_KFUNC_CALL;
>> +               insn->dst_reg = 1; /* Implement enum for inlined fast calls */
> 
> Yes. Pls do it cleanly from the start.
> 
> Why rewrite though?
> Can JIT match the addr of bpf_get_hw_counter ?
> And no need to rewrite call insn ?

I was thinking about this way, just wasn't able to find easy examples of
matching function addresses in jit. I'll try to make it but it may add
some extra functions to the jit.

next prev parent reply	other threads:[~2024-10-24 13:11 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-10-23 21:04 [PATCH bpf-next 1/2] bpf: add bpf_get_hw_counter kfunc Vadim Fedorenko
2024-10-23 21:04 ` [PATCH bpf-next 2/2] selftests/bpf: add selftest to check rdtsc jit Vadim Fedorenko
2024-10-24  1:39 ` [PATCH bpf-next 1/2] bpf: add bpf_get_hw_counter kfunc Alexei Starovoitov
2024-10-24 13:11   ` Vadim Fedorenko [this message]
2024-10-24 17:03     ` Alexei Starovoitov
2024-10-24 17:31     ` Eduard Zingerman
2024-10-24 17:47       ` Eduard Zingerman
2024-10-24 16:03 ` kernel test robot
2024-10-24 16:13 ` kernel test robot
2024-10-24 16:13 ` kernel test robot

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4d7f00e6-8fab-4274-8121-620820b99f02@linux.dev \
    --to=vadim.fedorenko@linux.dev \
    --cc=alexei.starovoitov@gmail.com \
    --cc=andrii@kernel.org \
    --cc=ast@kernel.org \
    --cc=bpf@vger.kernel.org \
    --cc=daniel@iogearbox.net \
    --cc=eddyz87@gmail.com \
    --cc=tglx@linutronix.de \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox