BPF List
 help / color / mirror / Atom feed
* Re: Idea for "function meta"
       [not found] ` <CAADnVQ+1mSHwUK4rZ_mJP7W72iSXgsVfazurYPRGi=3p5aBVdQ@mail.gmail.com>
@ 2024-12-18  3:52   ` Menglong Dong
  0 siblings, 0 replies; 7+ messages in thread
From: Menglong Dong @ 2024-12-18  3:52 UTC (permalink / raw)
  To: Alexei Starovoitov, bpf

On Tue, Dec 17, 2024 at 2:46 AM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Thu, Dec 12, 2024 at 6:17 AM Menglong Dong <menglong8.dong@gmail.com> wrote:
> >
> > Hello, Alexei.
> >
> > After the series of "bpf: make tracing program support multi-link", I'm
> > keeping think about the method to make tracing support multi-link.
> > The previous discuss is here:
> >
> > https://lore.kernel.org/netdev/20240311093526.1010158-1-dongmenglong.8@bytedance.com/
> >
> > Now, I have a idea. How about we introduce a "function meta", which
> > will reserve some space (such as 16-bytes) before function, just like
> > what fentry do, except that the space we reserve for fentry is after the
> > function. For example, we have a function "do_test", the layout is just
> > like this:
> >
> > -------------> 16-bytes
> > -------------> do_test
> > -------------> __fentry__(nop)
> >
> > (Of curse, this need the suuport of the compiler, which is not available
> > for now.)
> >
> > Then, we create a global trampoline for BPF. When we need to attach
> > a BPF program to the function do_test(), we can allocate a
> > "progs" of type struct tracing_progs, which is defined like this:
> >
> > struct tracing_progs {
> >     struct bpf_prog *fentry_progs;
> >     int fentry_count;
> >     struct bpf_prog *fexit_progs;
> >     int fexit_count;
> >     struct bpf_prog *fret_progs;
> >     int fret_count;
> > };
> >
> > Then, we store the address of the bpf program to "progs", and
> > store the "progs" to "do_test - 8". And we make the __fentry__
> > of do_test() a call to the global bpf trampoline.
> >
> > In the global bpf trampoline, we can get the "progs" by "ip - 8",
> > and call the bpf progs in it.
> >
> > The function meta can be used in more general way. For example,
> > ftrace can alse use it. Then, we don't need to lookup the filter hash
> > table.
> >
> > (Hope I'm disturbing you)
>
> Not at all. Sorry for the delay.
> It's best to email the mailing list all the time.
> Plenty of eyes there that can help evaluate and refine the idea.
>
> Overall, I think, it makes sense and there is a compiler
> support already.
> See -fpatchable-function-entry.

Awesome! This makes it possible for me to implement
such idea in the kernel right now~

>
> But it's probably a hard sell to unconditionally add 8 or 16 bytes
> to all functions in the kernel for x86,
> though arm64 is already using -fpatchable-function-entry=4,2
> to support jumping to full 64-bit offsets (if I recall correctly).
>
> If it's only 8 extra bytes for x86 in front of the function
> then please measure vmlinux text before/after.
> Then we can decide.

Okay, I'm testing on this pointer, and I'll send the result
out to the list after I finish it.

> Let's discuss this on the list.

Thanks for the replying :/
Menglong Dong

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Idea for "function meta"
@ 2024-12-20 13:57 Menglong Dong
  2024-12-20 14:00 ` Peter Zijlstra
  0 siblings, 1 reply; 7+ messages in thread
From: Menglong Dong @ 2024-12-20 13:57 UTC (permalink / raw)
  To: Alexei Starovoitov, Peter Zijlstra; +Cc: bpf

Hello, all.

In the previous discussion, I'm trying to reserve some space (such as 16-bytes)
before function to store the information about the function:

https://lore.kernel.org/bpf/CADxym3ZfHv_VdgopE5TBQxhO7RrPTVm83VW07c8bAywp404QPw@mail.gmail.com/T/#u

And let me describe it here again. For example, we have a function
"do_test", the layout is just like this:

-------------> 16-bytes
-------------> do_test
-------------> __fentry__(nop)

Then, we can alloc a memory, store some information that we
need, and store the pointer of this memory to "do_test - 8".
And then, we can get the function information with "ip - 8" in
bpf trampoline or ftrace handler.

After I dig into the code, I find that this space is already reserved
with the config:

CONFIG_MITIGATION_CALL_DEPTH_TRACKING
  CONFIG_CALL_THUNKS
    CALL_PADDING

And the reserved space before the function is 16 bytes. According
to my tests, 9 bytes is used by the call depth tracking if the CPU
support it, and the insn is just like this:

__pfx_do_test:
nop nop nop nop nop nop nop (7 bytes)
sarq (9 bytes)
do_test:
xxx

And all the calls to do_test will be redirected to __pfx_do_test + 7.
I think that there are still 7 bytes for me to use, which is enough.
In fact, 4 bytes is enough for me, as we can allocate a function
meta array, and store the index to the reserved space.

However, the other 5-bytes will be consumed if CFI_CLANG is
enabled, and the space is not enough anymore in this case, and
the insn will be like this:

__cfi_do_test:
mov (5byte)
nop nop (2 bytes)
sarq (9 bytes)
do_test:
xxx

A method that I think is that we can use such "function meta"
feature only if the call depth tracking or the CFI is not enabled.
(I'm not sure if the CFI is a commonly used feature.)

I would appreciate some advice :/

Thanks!
Menglong Dong

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Idea for "function meta"
  2024-12-20 13:57 Idea for "function meta" Menglong Dong
@ 2024-12-20 14:00 ` Peter Zijlstra
  2024-12-25  3:26   ` Menglong Dong
  0 siblings, 1 reply; 7+ messages in thread
From: Peter Zijlstra @ 2024-12-20 14:00 UTC (permalink / raw)
  To: Menglong Dong; +Cc: Alexei Starovoitov, bpf

On Fri, Dec 20, 2024 at 09:57:22PM +0800, Menglong Dong wrote:

> However, the other 5-bytes will be consumed if CFI_CLANG is
> enabled, and the space is not enough anymore in this case, and
> the insn will be like this:
> 
> __cfi_do_test:
> mov (5byte)
> nop nop (2 bytes)
> sarq (9 bytes)
> do_test:
> xxx
> 

FineIBT will fully consume those 16 bytes.

Also, text is ROX, you cannot easily write there. Furthermore, writing
non-instructions there will destroy disassemblers ability to make sense
of the memory.

So no, don't do this.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Idea for "function meta"
  2024-12-20 14:00 ` Peter Zijlstra
@ 2024-12-25  3:26   ` Menglong Dong
  2025-01-03 19:28     ` Alexei Starovoitov
  0 siblings, 1 reply; 7+ messages in thread
From: Menglong Dong @ 2024-12-25  3:26 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Alexei Starovoitov, bpf

On Fri, Dec 20, 2024 at 10:01 PM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Fri, Dec 20, 2024 at 09:57:22PM +0800, Menglong Dong wrote:
>
> > However, the other 5-bytes will be consumed if CFI_CLANG is
> > enabled, and the space is not enough anymore in this case, and
> > the insn will be like this:
> >
> > __cfi_do_test:
> > mov (5byte)
> > nop nop (2 bytes)
> > sarq (9 bytes)
> > do_test:
> > xxx
> >
>
> FineIBT will fully consume those 16 bytes.
>
> Also, text is ROX, you cannot easily write there. Furthermore, writing
> non-instructions there will destroy disassemblers ability to make sense
> of the memory.

Thanks for the reply. Your words make sense, and it
seems to be dangerous too.

Thanks!
Menglong Dong

>
> So no, don't do this.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Idea for "function meta"
  2024-12-25  3:26   ` Menglong Dong
@ 2025-01-03 19:28     ` Alexei Starovoitov
  2025-01-07 13:28       ` Menglong Dong
  2025-02-07  8:16       ` Menglong Dong
  0 siblings, 2 replies; 7+ messages in thread
From: Alexei Starovoitov @ 2025-01-03 19:28 UTC (permalink / raw)
  To: Menglong Dong, Andrii Nakryiko; +Cc: Peter Zijlstra, bpf

On Tue, Dec 24, 2024 at 7:25 PM Menglong Dong <menglong8.dong@gmail.com> wrote:
>
> On Fri, Dec 20, 2024 at 10:01 PM Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > On Fri, Dec 20, 2024 at 09:57:22PM +0800, Menglong Dong wrote:
> >
> > > However, the other 5-bytes will be consumed if CFI_CLANG is
> > > enabled, and the space is not enough anymore in this case, and
> > > the insn will be like this:
> > >
> > > __cfi_do_test:
> > > mov (5byte)
> > > nop nop (2 bytes)
> > > sarq (9 bytes)
> > > do_test:
> > > xxx
> > >
> >
> > FineIBT will fully consume those 16 bytes.
> >
> > Also, text is ROX, you cannot easily write there. Furthermore, writing
> > non-instructions there will destroy disassemblers ability to make sense
> > of the memory.
>
> Thanks for the reply. Your words make sense, and it
> seems to be dangerous too.

Raw bytes are indeed dangerous in the text section, but
I think we can make it work.

We can prepend 5 byte mov %eax, 0x12345678
or 10 byte mov %rax, 0x12345678aabbccdd
instructions before function entry and before FineIBT/kcfi preamble.

Ideally 5 byte insn and use 4 byte as an offset within 4Gb region
for this per-function metadata that we will allocate on demand.
We can prototype with 10 byte insn and full 8 byte pointer to metadata.
Without mitigations it will be
-fpatchable-function-entry=10
with FineIBT
-fpatchable-function-entry=26

but we have to measure the impact on I-cache iTLB first.

Menglong,
could you do performance benchmarking for no-mitigation kernel
with extra 5 and extra 10 bytes of padding ?

Since we have:
select FUNCTION_ALIGNMENT_16B           if X86_64 || X86_ALIGNMENT_16

the functions are aligned to 16 all the time,
so there is some gap between them.
Extra -fpatchable-function-entry=5 might be in the noise
from performance point of view,
but the ability to provide such per function metadata block
will be very useful for all kinds of use cases.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Idea for "function meta"
  2025-01-03 19:28     ` Alexei Starovoitov
@ 2025-01-07 13:28       ` Menglong Dong
  2025-02-07  8:16       ` Menglong Dong
  1 sibling, 0 replies; 7+ messages in thread
From: Menglong Dong @ 2025-01-07 13:28 UTC (permalink / raw)
  To: Alexei Starovoitov; +Cc: Andrii Nakryiko, Peter Zijlstra, bpf

On Sat, Jan 4, 2025 at 3:28 AM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Tue, Dec 24, 2024 at 7:25 PM Menglong Dong <menglong8.dong@gmail.com> wrote:
> >
> > On Fri, Dec 20, 2024 at 10:01 PM Peter Zijlstra <peterz@infradead.org> wrote:
> > >
> > > On Fri, Dec 20, 2024 at 09:57:22PM +0800, Menglong Dong wrote:
> > >
> > > > However, the other 5-bytes will be consumed if CFI_CLANG is
> > > > enabled, and the space is not enough anymore in this case, and
> > > > the insn will be like this:
> > > >
> > > > __cfi_do_test:
> > > > mov (5byte)
> > > > nop nop (2 bytes)
> > > > sarq (9 bytes)
> > > > do_test:
> > > > xxx
> > > >
> > >
> > > FineIBT will fully consume those 16 bytes.
> > >
> > > Also, text is ROX, you cannot easily write there. Furthermore, writing
> > > non-instructions there will destroy disassemblers ability to make sense
> > > of the memory.
> >
> > Thanks for the reply. Your words make sense, and it
> > seems to be dangerous too.
>
> Raw bytes are indeed dangerous in the text section, but
> I think we can make it work.
>
> We can prepend 5 byte mov %eax, 0x12345678
> or 10 byte mov %rax, 0x12345678aabbccdd
> instructions before function entry and before FineIBT/kcfi preamble.

Sounds great, which makes the raw bytes safe enough!

> Ideally 5 byte insn and use 4 byte as an offset within 4Gb region
> for this per-function metadata that we will allocate on demand.
> We can prototype with 10 byte insn and full 8 byte pointer to metadata.
> Without mitigations it will be
> -fpatchable-function-entry=10
> with FineIBT
> -fpatchable-function-entry=26

Yeah! And we even don't need extra bytes if CONFIG_CFI_CLANG
is not enabled and mitigation is enabled (which is enabled by default),
as the remaining 7-bytes is enough for us.

> but we have to measure the impact on I-cache iTLB first.
>
> Menglong,
> could you do performance benchmarking for no-mitigation kernel
> with extra 5 and extra 10 bytes of padding ?

Okay, I'll do this performance benchmarking (maybe a few days later,
as I am busy at my work these days :/).

>
> Since we have:
> select FUNCTION_ALIGNMENT_16B           if X86_64 || X86_ALIGNMENT_16
>
> the functions are aligned to 16 all the time,
> so there is some gap between them.
> Extra -fpatchable-function-entry=5 might be in the noise
> from performance point of view,
> but the ability to provide such per function metadata block
> will be very useful for all kinds of use cases.

Thanks for the advice, which gives me some faith in this idea.

Thanks!
Menglong Dong

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Idea for "function meta"
  2025-01-03 19:28     ` Alexei Starovoitov
  2025-01-07 13:28       ` Menglong Dong
@ 2025-02-07  8:16       ` Menglong Dong
  1 sibling, 0 replies; 7+ messages in thread
From: Menglong Dong @ 2025-02-07  8:16 UTC (permalink / raw)
  To: Alexei Starovoitov; +Cc: Andrii Nakryiko, Peter Zijlstra, bpf

On Sat, Jan 4, 2025 at 3:28 AM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Tue, Dec 24, 2024 at 7:25 PM Menglong Dong <menglong8.dong@gmail.com> wrote:
> >
> > On Fri, Dec 20, 2024 at 10:01 PM Peter Zijlstra <peterz@infradead.org> wrote:
> > >
> > > On Fri, Dec 20, 2024 at 09:57:22PM +0800, Menglong Dong wrote:
> > >
> > > > However, the other 5-bytes will be consumed if CFI_CLANG is
> > > > enabled, and the space is not enough anymore in this case, and
> > > > the insn will be like this:
> > > >
> > > > __cfi_do_test:
> > > > mov (5byte)
> > > > nop nop (2 bytes)
> > > > sarq (9 bytes)
> > > > do_test:
> > > > xxx
> > > >
> > >
> > > FineIBT will fully consume those 16 bytes.
> > >
> > > Also, text is ROX, you cannot easily write there. Furthermore, writing
> > > non-instructions there will destroy disassemblers ability to make sense
> > > of the memory.
> >
> > Thanks for the reply. Your words make sense, and it
> > seems to be dangerous too.
>
> Raw bytes are indeed dangerous in the text section, but
> I think we can make it work.
>
> We can prepend 5 byte mov %eax, 0x12345678
> or 10 byte mov %rax, 0x12345678aabbccdd
> instructions before function entry and before FineIBT/kcfi preamble.
>
> Ideally 5 byte insn and use 4 byte as an offset within 4Gb region
> for this per-function metadata that we will allocate on demand.
> We can prototype with 10 byte insn and full 8 byte pointer to metadata.
> Without mitigations it will be
> -fpatchable-function-entry=10
> with FineIBT
> -fpatchable-function-entry=26
>
> but we have to measure the impact on I-cache iTLB first.
>
> Menglong,
> could you do performance benchmarking for no-mitigation kernel
> with extra 5 and extra 10 bytes of padding ?

Hi Alexei,

(Sorry for the late reply, I was celebrating the Spring Festival a few
days ago :/ )

I did some performance benchmarking recently with sysbench.
The only case that I did is threads creating benchmarking:

  sysbench --time=60 threads run

I disabled mitigation, and compile a 5-bytes padding kernel with
following changes:

--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -2485,10 +2485,10 @@ config FUNCTION_PADDING_CFI
 config FUNCTION_PADDING_BYTES
        int
        default FUNCTION_PADDING_CFI if CFI_CLANG
-       default FUNCTION_ALIGNMENT
+       default 5

 config CALL_PADDING
-       def_bool n
+       def_bool y
        depends on CC_HAS_ENTRY_PADDING && OBJTOOL
        select FUNCTION_ALIGNMENT_16B

I did this testing in a kvm in following steps:

1. isolate 4 cores in the host by adding following params to
   the cmdline:
   isolcpus=0-3
2. run a kvm, and isolate 2 cores in the kvm by adding
   "isolcpus=0,1" to the cmdline of the kvm
3. bind the vcpu threads of the kvm to the CPU that we
    isolated
4. run the following command to performance the benchmarking
   in the kvm:
   taskset -c 0 sysbench --time=60 threads run
   and do the statistics with perf meanwhile:
   perf stat -C 0 -- sleep 10

I did the testing not only for 5-bytes padding, but also for
1-bytes, 3-bytes, 4-bytes,5-bytes, 6-bytes, 7-bytes, 8-bytes,
10-bytes, and following is the results of this testing:

| PADDING(BYTES) | RESULT | cycles    | IPS(insns per seconds) |
stalled cycles per insn | stalled-cycles-frontend |
| ------------ | ------ | --------- | ---------------------- |
----------------------- | ----------------------- |
| 1            | 120577 | 4.790 GHz | 3.05                   | 0.04
                | 12.18%                  |
| 1            | 120657 | 4.815 GHz | 3.05                   | 0.04
                | 12.04%                  |
| 1            | 120172 | 4.789 GHz | 3.05                   | 0.04
                | 12.25%                  |
| 3            | 117454 | 4.804 GHz | 2.98                   | 0.04
                | 12.97%                  |
| 3            | 117418 | 4.815 GHz | 2.98                   | 0.04
                | 13.12%                  |
| 3            | 117864 | 4.815 GHz | 2.98                   | 0.04
                | 13.06%                  |
| 4            | 120825 | 4.767 GHz | 3.08                   | 0.04
                | 11.02%                  |
| 4            | 121361 | 4.816 GHz | 3.08                   | 0.04
                | 11.00%                  |
| 4            | 121227 | 4.792 GHz | 3.08                   | 0.04
                | 11.04%                  |
| 5            | 120214 | 4.804 GHz | 3.05                   | 0.04
                | 10.91%                  |
| 5            | 120295 | 4.772 GHz | 3.07                   | 0.04
                | 10.99%                  |
| 5            | 120980 | 4.798 GHz | 3.07                   | 0.04
                | 11.00%                  |
| 6            | 120151 | 4.776 GHz | 3.05                   | 0.04
                | 11.73%                  |
| 6            | 119700 | 4.803 GHz | 3.04                   | 0.04
                | 11.77%                  |
| 6            | 120030 | 4.789 GHz | 3.05                   | 0.04
                | 11.88%                  |
| 7            | 115081 | 4.789 GHz | 2.93                   | 0.05
                | 13.77%                  |
| 7            | 115681 | 4.795 GHz | 2.94                   | 0.05
                | 13.37%                  |
| 7            | 115954 | 4.817 GHz | 2.95                   | 0.05
                | 13.46%                  |
| 8            | 119675 | 4.768 GHz | 3.04                   | 0.04
                | 12.10%                  |
| 8            | 120442 | 4.824 GHz | 3.05                   | 0.04
                | 12.06%                  |
| 8            | 120260 | 4.793 GHz | 3.04                   | 0.04
                | 12.21%                  |
| 10           | 116292 | 4.788 GHz | 2.97                   | 0.04
                | 12.69%                  |
| 10           | 116543 | 4.815 GHz | 2.97                   | 0.04
                | 12.74%                  |
| 10           | 116654 | 4.794 GHz | 2.97                   | 0.04
                | 12.81%                  |
| 16           | 120051 | 4.786 GHz | 3.05                   | 0.04
                | 11.21%                  |
| 16           | 120450 | 4.808 GHz | 3.05                   | 0.04
                | 11.19%                  |
| 16           | 120562 | 4.831 GHz | 3.05                   | 0.04
                | 11.22%                  |

I haven't found the rule of the impact of the space we padding,
but we can see that the performance is ok for 1,4,5,6,8 bytes
padding, which means that the performance is the same as 16-bytes
padding. But it's not ok for 3-bytes, 7-bytes and 10-bytes padding.

I didn't do the testing for all the possible padding bytes, it consumes
time :/

So it seems that we can add extra 5-bytes to the padding and
don't have performance loss. But I'm not sure if it has any other
impacts.

So we have two ways to implement such a function:
1. add extra 5-bytes padding when necessary. This will make the
   vmlinux is as small as possible.
2. make the FUNCTION_ALIGNMENT 32-bytes, which will make
   the vmlinux ~5% larger.

BTW, we don't need to do anything if CFI_CLANG is not enabled,
as there is 7-bytes spare padding in such cases, which is enough
for us.

What do you think?

Thanks!
Menglong Dong

>
> Since we have:
> select FUNCTION_ALIGNMENT_16B           if X86_64 || X86_ALIGNMENT_16
>
> the functions are aligned to 16 all the time,
> so there is some gap between them.
> Extra -fpatchable-function-entry=5 might be in the noise
> from performance point of view,
> but the ability to provide such per function metadata block
> will be very useful for all kinds of use cases.

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2025-02-07  8:17 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-12-20 13:57 Idea for "function meta" Menglong Dong
2024-12-20 14:00 ` Peter Zijlstra
2024-12-25  3:26   ` Menglong Dong
2025-01-03 19:28     ` Alexei Starovoitov
2025-01-07 13:28       ` Menglong Dong
2025-02-07  8:16       ` Menglong Dong
     [not found] <CADxym3Yop==sWx2q8448kYkDWcK=P7+fqeZLzyzk8D0GwZEV-A@mail.gmail.com>
     [not found] ` <CAADnVQ+1mSHwUK4rZ_mJP7W72iSXgsVfazurYPRGi=3p5aBVdQ@mail.gmail.com>
2024-12-18  3:52   ` Menglong Dong

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox