* [RFC PATCH bpf-next 0/2] bpf: Add generic kfunc bpf_ffs64()
@ 2024-01-31 15:56 Leon Hwang
2024-01-31 15:56 ` [RFC PATCH bpf-next 1/2] " Leon Hwang
` (2 more replies)
0 siblings, 3 replies; 8+ messages in thread
From: Leon Hwang @ 2024-01-31 15:56 UTC (permalink / raw)
To: bpf; +Cc: ast, daniel, andrii, Leon Hwang
This patchset introduces a new generic kfunc bpf_ffs64(). This kfunc
allows bpf to reuse kernel's __ffs64() function to improve ffs
performance in bpf.
In patch "bpf: Add generic kfunc bpf_ffs64()", there is some data to
confirm that this kfunc is able to save around 10ns for every time on
"Intel(R) Xeon(R) Silver 4116 CPU @ 2.10GHz" CPU server, by comparing
with bpf-implemented __ffs64().
However, it will be better when convert this kfunc to "rep bsf" in
JIT on x86, which is able to avoid a call. But, I haven't figure out the
way.
Leon Hwang (2):
bpf: Add generic kfunc bpf_ffs64()
selftests/bpf: Add testcases for generic kfunc bpf_ffs64()
kernel/bpf/helpers.c | 7 +++
.../testing/selftests/bpf/prog_tests/bitops.c | 54 +++++++++++++++++++
tools/testing/selftests/bpf/progs/bitops.c | 21 ++++++++
3 files changed, 82 insertions(+)
create mode 100644 tools/testing/selftests/bpf/prog_tests/bitops.c
create mode 100644 tools/testing/selftests/bpf/progs/bitops.c
base-commit: c5809f0c308111adbcdbf95462a72fa79eb267d1
--
2.42.1
^ permalink raw reply [flat|nested] 8+ messages in thread
* [RFC PATCH bpf-next 1/2] bpf: Add generic kfunc bpf_ffs64()
2024-01-31 15:56 [RFC PATCH bpf-next 0/2] bpf: Add generic kfunc bpf_ffs64() Leon Hwang
@ 2024-01-31 15:56 ` Leon Hwang
2024-01-31 15:56 ` [RFC PATCH bpf-next 2/2] selftests/bpf: Add testcases for " Leon Hwang
2024-02-02 22:18 ` [RFC PATCH bpf-next 0/2] bpf: Add " Andrii Nakryiko
2 siblings, 0 replies; 8+ messages in thread
From: Leon Hwang @ 2024-01-31 15:56 UTC (permalink / raw)
To: bpf; +Cc: ast, daniel, andrii, Leon Hwang
On XDP-based virtual network gateway, ffs (aka find first set) algorithm
is used to find the index of the very first 1-value bit in a bitmap,
which is an array of u64, in the gateway's ACL module.
The ACL module was designed from these two papers:
* "eBPF / XDP based firewall and packet filtering"[1]
* "Securing Linux with a Faster and Scalable Iptables"[2]
In the ACL module, the key details are:
1. Match source address to get a bitmap.
2. Match destination address to get a bitmap.
3. Match l4 protocol to get a bitmap.
4. Match source port to get a bitmap.
5. Match destination port to get a bitmap.
Finally, by traversing these 5 bitmaps and doing bitwise-and on 5 u64s
meanwhile, for every bitwise-and result, an u64, if it's not zero, do
ffs to find the index of the very first 1-value bit in the result. When
the index is found, convert it to a rule index of a rule policy bpf map,
whose type is BPF_MAP_TYPE_ARRAY or BPF_MAP_TYPE_PERCPU_ARRAY.
If __ffs64() kernel function can be reused in bpf, it can save some time in
finding the index of the very first 1-value bit in an u64.
Like AVX2, __ffs64() will be compiled to one instruction, "rep bsf", on
x86.
Then, I do compare bpf-implemented __ffs64() with this kfunc bpf_ffs64()
with following bpf code snippet:
#include "vmlinux.h"
#include "bpf/bpf_helpers.h"
unsigned long bpf_ffs64(u64 word) __ksym;
static __noinline __u64
__ffs64(__u64 word)
{
__u64 shift = 0;
if ((word & 0xffffffff) == 0) {
word >>= 32;
shift += 32;
}
if ((word & 0xffff) == 0) {
word >>= 16;
shift += 16;
}
if ((word & 0xff) == 0) {
word >>= 8;
shift += 8;
}
if ((word & 0xf) == 0) {
word >>= 4;
shift += 4;
}
if ((word & 0x3) == 0) {
word >>= 2;
shift += 2;
}
if ((word & 0x1) == 0) {
shift += 1;
}
return shift;
}
SEC("tc")
int tc_ffs1(struct __sk_buff *skb)
{
void *data_end = (void *)(long) skb->data_end;
u64 *data = (u64 *)(long) skb->data;
if ((void *)(u64) (data + 1) > data_end)
return 0;
return __ffs64(*data);
}
SEC("tc")
int tc_ffs2(struct __sk_buff *skb)
{
void *data_end = (void *)(long) skb->data_end;
u64 *data = (u64 *)(long) skb->data;
if ((void *)(u64) (data + 1) > data_end)
return 0;
return bpf_ffs64(*data);
}
char _license[] SEC("license") = "GPL";
Then, I run them on a KVM-based VM, which runs on a 48 cores and "Intel(R)
Xeon(R) Silver 4116 CPU @ 2.10GHz" CPU server.
As for the 1-value bit offset is 0, and for every time the bpf progs run
for 10000000 times, the average time cost data of bpf progs running is:
+----------+---------------+-------------------+
| Nth time | bpf __ffs64() | kfunc bpf_ffs64() |
+----------+---------------+-------------------+
| 1 | 164ns | 154ns |
| 2 | 166ns | 155ns |
| 3 | 160ns | 154ns |
| 4 | 161ns | 157ns |
| 5 | 161ns | 155ns |
| 6 | 163ns | 155ns |
| 7 | 164ns | 155ns |
| 8 | 159ns | 159ns |
| 9 | 171ns | 154ns |
| 10 | 164ns | 156ns |
| 11 | 161ns | 155ns |
| 12 | 160ns | 155ns |
| 13 | 161ns | 154ns |
| 14 | 165ns | 154ns |
| 15 | 161ns | 162ns |
| 16 | 161ns | 157ns |
| 17 | 164ns | 154ns |
| 18 | 162ns | 154ns |
| 19 | 159ns | 156ns |
| 20 | 160ns | 154ns |
+----------+---------------+-------------------+
As for the 1-value bit offset is 63, and for every time the bpf progs run
for 10000000 times, the average time cost data of bpf progs running is:
+----------+---------------+-------------------+
| Nth time | bpf __ffs64() | kfunc bpf_ffs64() |
+----------+---------------+-------------------+
| 1 | 163ns | 157ns |
| 2 | 163ns | 154ns |
| 3 | 165ns | 155ns |
| 4 | 167ns | 155ns |
| 5 | 165ns | 155ns |
| 6 | 163ns | 155ns |
| 7 | 162ns | 155ns |
| 8 | 162ns | 156ns |
| 9 | 174ns | 155ns |
| 10 | 162ns | 156ns |
| 11 | 168ns | 155ns |
| 12 | 169ns | 156ns |
| 13 | 162ns | 155ns |
| 14 | 169ns | 155ns |
| 15 | 162ns | 154ns |
| 16 | 163ns | 155ns |
| 17 | 162ns | 154ns |
| 18 | 166ns | 154ns |
| 19 | 165ns | 154ns |
| 20 | 165ns | 154ns |
+----------+---------------+-------------------+
As we can see, for every time, bpf __ffs64() costs around 165ns, and
kfunc bpf_ffs64() costs around 155ns. It seems that kfunc bpf_ffs64()
saves 10ns for every time.
If there is 1m PPS on the gateway, kfunc bpf_ffs64() will save much CPU
resource.
Links:
[1] http://vger.kernel.org/lpc_net2018_talks/ebpf-firewall-paper-LPC.pdf
[2] https://mbertrone.github.io/documents/21-Securing_Linux_with_a_Faster_and_Scalable_Iptables.pdf
Signed-off-by: Leon Hwang <hffilwlqm@gmail.com>
---
kernel/bpf/helpers.c | 7 +++++++
1 file changed, 7 insertions(+)
diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
index bcb951a2ecf4b..4db48a6a04a90 100644
--- a/kernel/bpf/helpers.c
+++ b/kernel/bpf/helpers.c
@@ -23,6 +23,7 @@
#include <linux/btf_ids.h>
#include <linux/bpf_mem_alloc.h>
#include <linux/kasan.h>
+#include <linux/bitops.h>
#include "../../lib/kstrtox.h"
@@ -2542,6 +2543,11 @@ __bpf_kfunc void bpf_throw(u64 cookie)
WARN(1, "A call to BPF exception callback should never return\n");
}
+__bpf_kfunc unsigned long bpf_ffs64(u64 word)
+{
+ return __ffs64(word);
+}
+
__bpf_kfunc_end_defs();
BTF_SET8_START(generic_btf_ids)
@@ -2573,6 +2579,7 @@ BTF_ID_FLAGS(func, bpf_task_get_cgroup1, KF_ACQUIRE | KF_RCU | KF_RET_NULL)
#endif
BTF_ID_FLAGS(func, bpf_task_from_pid, KF_ACQUIRE | KF_RET_NULL)
BTF_ID_FLAGS(func, bpf_throw)
+BTF_ID_FLAGS(func, bpf_ffs64)
BTF_SET8_END(generic_btf_ids)
static const struct btf_kfunc_id_set generic_kfunc_set = {
--
2.42.1
^ permalink raw reply related [flat|nested] 8+ messages in thread
* [RFC PATCH bpf-next 2/2] selftests/bpf: Add testcases for generic kfunc bpf_ffs64()
2024-01-31 15:56 [RFC PATCH bpf-next 0/2] bpf: Add generic kfunc bpf_ffs64() Leon Hwang
2024-01-31 15:56 ` [RFC PATCH bpf-next 1/2] " Leon Hwang
@ 2024-01-31 15:56 ` Leon Hwang
2024-02-02 22:18 ` [RFC PATCH bpf-next 0/2] bpf: Add " Andrii Nakryiko
2 siblings, 0 replies; 8+ messages in thread
From: Leon Hwang @ 2024-01-31 15:56 UTC (permalink / raw)
To: bpf; +Cc: ast, daniel, andrii, Leon Hwang
Add a selftest to confirm the kfunc bpf_ffs64() runs OK.
./tools/testing/selftests/bpf/test_progs -t bitops
12/1 bitops/bitops_ffs64:OK
12 bitops:OK
Summary: 1/1 PASSED, 0 SKIPPED, 0 FAILED
Signed-off-by: Leon Hwang <hffilwlqm@gmail.com>
---
.../testing/selftests/bpf/prog_tests/bitops.c | 54 +++++++++++++++++++
tools/testing/selftests/bpf/progs/bitops.c | 21 ++++++++
2 files changed, 75 insertions(+)
create mode 100644 tools/testing/selftests/bpf/prog_tests/bitops.c
create mode 100644 tools/testing/selftests/bpf/progs/bitops.c
diff --git a/tools/testing/selftests/bpf/prog_tests/bitops.c b/tools/testing/selftests/bpf/prog_tests/bitops.c
new file mode 100644
index 0000000000000..e4c68da626280
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/bitops.c
@@ -0,0 +1,54 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright Leon Hwang */
+
+#include <test_progs.h>
+
+/* test_ffs64 tests the generic kfunc bpf_ffs64().
+ */
+static void test_ffs64(void)
+{
+ struct bpf_object *obj = NULL;
+ struct bpf_program *prog;
+ char buff[128] = {};
+ int err, prog_fd;
+
+ LIBBPF_OPTS(bpf_test_run_opts, topts,
+ .data_in = buff,
+ .data_size_in = sizeof(buff),
+ .repeat = 1,
+ );
+
+ err = bpf_prog_test_load("bitops.bpf.o", BPF_PROG_TYPE_SCHED_CLS, &obj,
+ &prog_fd);
+ if (!ASSERT_OK(err, "load obj"))
+ return;
+
+ prog = bpf_object__find_program_by_name(obj, "tc_ffs64");
+ if (!ASSERT_OK_PTR(prog, "find tc_ffs64"))
+ goto out;
+
+#define TEST_FFS(n) \
+ do { \
+ u64 __n = 1; \
+ \
+ *(u64 *)(void *) buff = (u64) (__n << n); \
+ err = bpf_prog_test_run_opts(prog_fd, &topts); \
+ ASSERT_OK(err, "run prog"); \
+ ASSERT_EQ(topts.retval, n, "run prog"); \
+ } while (0)
+
+ TEST_FFS(0);
+ TEST_FFS(1);
+ TEST_FFS(31);
+ TEST_FFS(63);
+
+#undef TEST_FFS
+out:
+ bpf_object__close(obj);
+}
+
+void test_bitops(void)
+{
+ if (test__start_subtest("bitops_ffs64"))
+ test_ffs64();
+}
diff --git a/tools/testing/selftests/bpf/progs/bitops.c b/tools/testing/selftests/bpf/progs/bitops.c
new file mode 100644
index 0000000000000..0863d1392b3d4
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/bitops.c
@@ -0,0 +1,21 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright Leon Hwang */
+
+#include "vmlinux.h"
+#include <bpf/bpf_helpers.h>
+
+unsigned long bpf_ffs64(u64 word) __ksym;
+
+SEC("tc")
+int tc_ffs64(struct __sk_buff *skb)
+{
+ void *data_end = (void *)(long)skb->data_end;
+ u64 *data = (u64 *)(long)skb->data;
+
+ if ((void *)(u64)(data + 1) > data_end)
+ return -1;
+
+ return bpf_ffs64(*data);
+}
+
+char _license[] SEC("license") = "GPL";
--
2.42.1
^ permalink raw reply related [flat|nested] 8+ messages in thread
* Re: [RFC PATCH bpf-next 0/2] bpf: Add generic kfunc bpf_ffs64()
2024-01-31 15:56 [RFC PATCH bpf-next 0/2] bpf: Add generic kfunc bpf_ffs64() Leon Hwang
2024-01-31 15:56 ` [RFC PATCH bpf-next 1/2] " Leon Hwang
2024-01-31 15:56 ` [RFC PATCH bpf-next 2/2] selftests/bpf: Add testcases for " Leon Hwang
@ 2024-02-02 22:18 ` Andrii Nakryiko
2024-02-04 19:19 ` Yonghong Song
2 siblings, 1 reply; 8+ messages in thread
From: Andrii Nakryiko @ 2024-02-02 22:18 UTC (permalink / raw)
To: Leon Hwang; +Cc: bpf, ast, daniel, andrii, Yonghong Song
On Wed, Jan 31, 2024 at 7:56 AM Leon Hwang <hffilwlqm@gmail.com> wrote:
>
> This patchset introduces a new generic kfunc bpf_ffs64(). This kfunc
> allows bpf to reuse kernel's __ffs64() function to improve ffs
> performance in bpf.
>
The downside of using kfunc for this is that the compiler will assume
that R1-R5 have to be spilled/filled, because that's function call
convention in BPF.
If this was an instruction, though, it would be much more efficient
and would avoid this problem. But I see how something like ffs64 is
useful. I think it would be good to also have popcnt instruction and a
few other fast bit manipulation operations as well.
Perhaps we should think about another BPF ISA extension to add fast
bit manipulation instructions?
> In patch "bpf: Add generic kfunc bpf_ffs64()", there is some data to
> confirm that this kfunc is able to save around 10ns for every time on
> "Intel(R) Xeon(R) Silver 4116 CPU @ 2.10GHz" CPU server, by comparing
> with bpf-implemented __ffs64().
>
> However, it will be better when convert this kfunc to "rep bsf" in
> JIT on x86, which is able to avoid a call. But, I haven't figure out the
> way.
>
> Leon Hwang (2):
> bpf: Add generic kfunc bpf_ffs64()
> selftests/bpf: Add testcases for generic kfunc bpf_ffs64()
>
> kernel/bpf/helpers.c | 7 +++
> .../testing/selftests/bpf/prog_tests/bitops.c | 54 +++++++++++++++++++
> tools/testing/selftests/bpf/progs/bitops.c | 21 ++++++++
> 3 files changed, 82 insertions(+)
> create mode 100644 tools/testing/selftests/bpf/prog_tests/bitops.c
> create mode 100644 tools/testing/selftests/bpf/progs/bitops.c
>
>
> base-commit: c5809f0c308111adbcdbf95462a72fa79eb267d1
> --
> 2.42.1
>
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [RFC PATCH bpf-next 0/2] bpf: Add generic kfunc bpf_ffs64()
2024-02-02 22:18 ` [RFC PATCH bpf-next 0/2] bpf: Add " Andrii Nakryiko
@ 2024-02-04 19:19 ` Yonghong Song
2024-02-05 18:18 ` Andrii Nakryiko
0 siblings, 1 reply; 8+ messages in thread
From: Yonghong Song @ 2024-02-04 19:19 UTC (permalink / raw)
To: Andrii Nakryiko, Leon Hwang; +Cc: bpf, ast, daniel, andrii
On 2/2/24 2:18 PM, Andrii Nakryiko wrote:
> On Wed, Jan 31, 2024 at 7:56 AM Leon Hwang <hffilwlqm@gmail.com> wrote:
>> This patchset introduces a new generic kfunc bpf_ffs64(). This kfunc
>> allows bpf to reuse kernel's __ffs64() function to improve ffs
>> performance in bpf.
>>
> The downside of using kfunc for this is that the compiler will assume
> that R1-R5 have to be spilled/filled, because that's function call
> convention in BPF.
>
> If this was an instruction, though, it would be much more efficient
> and would avoid this problem. But I see how something like ffs64 is
> useful. I think it would be good to also have popcnt instruction and a
> few other fast bit manipulation operations as well.
>
> Perhaps we should think about another BPF ISA extension to add fast
> bit manipulation instructions?
Sounds a good idea to start the conversion. Besides popcnt, lzcnt
is also a candidate. From llvm perspective, it would be hard to
generate ffs64/popcnt/lzcnt etc. from source generic implementation.
So most likely, inline asm will be used. libbpf could define
some macros to make adoption easier. Verifier and JIT will do
proper thing, either using corresponding arch insns directly or
verifier will rewrite so JIT won't be aware of these insns.
>
>> In patch "bpf: Add generic kfunc bpf_ffs64()", there is some data to
>> confirm that this kfunc is able to save around 10ns for every time on
>> "Intel(R) Xeon(R) Silver 4116 CPU @ 2.10GHz" CPU server, by comparing
>> with bpf-implemented __ffs64().
>>
>> However, it will be better when convert this kfunc to "rep bsf" in
>> JIT on x86, which is able to avoid a call. But, I haven't figure out the
>> way.
>>
>> Leon Hwang (2):
>> bpf: Add generic kfunc bpf_ffs64()
>> selftests/bpf: Add testcases for generic kfunc bpf_ffs64()
>>
>> kernel/bpf/helpers.c | 7 +++
>> .../testing/selftests/bpf/prog_tests/bitops.c | 54 +++++++++++++++++++
>> tools/testing/selftests/bpf/progs/bitops.c | 21 ++++++++
>> 3 files changed, 82 insertions(+)
>> create mode 100644 tools/testing/selftests/bpf/prog_tests/bitops.c
>> create mode 100644 tools/testing/selftests/bpf/progs/bitops.c
>>
>>
>> base-commit: c5809f0c308111adbcdbf95462a72fa79eb267d1
>> --
>> 2.42.1
>>
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [RFC PATCH bpf-next 0/2] bpf: Add generic kfunc bpf_ffs64()
2024-02-04 19:19 ` Yonghong Song
@ 2024-02-05 18:18 ` Andrii Nakryiko
2024-02-05 18:34 ` Yonghong Song
0 siblings, 1 reply; 8+ messages in thread
From: Andrii Nakryiko @ 2024-02-05 18:18 UTC (permalink / raw)
To: Yonghong Song; +Cc: Leon Hwang, bpf, ast, daniel, andrii
On Sun, Feb 4, 2024 at 11:20 AM Yonghong Song <yonghong.song@linux.dev> wrote:
>
>
> On 2/2/24 2:18 PM, Andrii Nakryiko wrote:
> > On Wed, Jan 31, 2024 at 7:56 AM Leon Hwang <hffilwlqm@gmail.com> wrote:
> >> This patchset introduces a new generic kfunc bpf_ffs64(). This kfunc
> >> allows bpf to reuse kernel's __ffs64() function to improve ffs
> >> performance in bpf.
> >>
> > The downside of using kfunc for this is that the compiler will assume
> > that R1-R5 have to be spilled/filled, because that's function call
> > convention in BPF.
> >
> > If this was an instruction, though, it would be much more efficient
> > and would avoid this problem. But I see how something like ffs64 is
> > useful. I think it would be good to also have popcnt instruction and a
> > few other fast bit manipulation operations as well.
> >
> > Perhaps we should think about another BPF ISA extension to add fast
> > bit manipulation instructions?
>
> Sounds a good idea to start the conversion. Besides popcnt, lzcnt
> is also a candidate. From llvm perspective, it would be hard to
> generate ffs64/popcnt/lzcnt etc. from source generic implementation.
I'm curious why? I assumed that if a user used __builtin_popcount()
Clang could just generate BPF's popcnt instruction (assuming the right
BPF cpu version is enabled, of course).
> So most likely, inline asm will be used. libbpf could define
> some macros to make adoption easier. Verifier and JIT will do
> proper thing, either using corresponding arch insns directly or
> verifier will rewrite so JIT won't be aware of these insns.
>
> >
> >> In patch "bpf: Add generic kfunc bpf_ffs64()", there is some data to
> >> confirm that this kfunc is able to save around 10ns for every time on
> >> "Intel(R) Xeon(R) Silver 4116 CPU @ 2.10GHz" CPU server, by comparing
> >> with bpf-implemented __ffs64().
> >>
> >> However, it will be better when convert this kfunc to "rep bsf" in
> >> JIT on x86, which is able to avoid a call. But, I haven't figure out the
> >> way.
> >>
> >> Leon Hwang (2):
> >> bpf: Add generic kfunc bpf_ffs64()
> >> selftests/bpf: Add testcases for generic kfunc bpf_ffs64()
> >>
> >> kernel/bpf/helpers.c | 7 +++
> >> .../testing/selftests/bpf/prog_tests/bitops.c | 54 +++++++++++++++++++
> >> tools/testing/selftests/bpf/progs/bitops.c | 21 ++++++++
> >> 3 files changed, 82 insertions(+)
> >> create mode 100644 tools/testing/selftests/bpf/prog_tests/bitops.c
> >> create mode 100644 tools/testing/selftests/bpf/progs/bitops.c
> >>
> >>
> >> base-commit: c5809f0c308111adbcdbf95462a72fa79eb267d1
> >> --
> >> 2.42.1
> >>
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [RFC PATCH bpf-next 0/2] bpf: Add generic kfunc bpf_ffs64()
2024-02-05 18:18 ` Andrii Nakryiko
@ 2024-02-05 18:34 ` Yonghong Song
2024-03-03 13:18 ` Leon Hwang
0 siblings, 1 reply; 8+ messages in thread
From: Yonghong Song @ 2024-02-05 18:34 UTC (permalink / raw)
To: Andrii Nakryiko; +Cc: Leon Hwang, bpf, ast, daniel, andrii
On 2/5/24 10:18 AM, Andrii Nakryiko wrote:
> On Sun, Feb 4, 2024 at 11:20 AM Yonghong Song <yonghong.song@linux.dev> wrote:
>>
>> On 2/2/24 2:18 PM, Andrii Nakryiko wrote:
>>> On Wed, Jan 31, 2024 at 7:56 AM Leon Hwang <hffilwlqm@gmail.com> wrote:
>>>> This patchset introduces a new generic kfunc bpf_ffs64(). This kfunc
>>>> allows bpf to reuse kernel's __ffs64() function to improve ffs
>>>> performance in bpf.
>>>>
>>> The downside of using kfunc for this is that the compiler will assume
>>> that R1-R5 have to be spilled/filled, because that's function call
>>> convention in BPF.
>>>
>>> If this was an instruction, though, it would be much more efficient
>>> and would avoid this problem. But I see how something like ffs64 is
>>> useful. I think it would be good to also have popcnt instruction and a
>>> few other fast bit manipulation operations as well.
>>>
>>> Perhaps we should think about another BPF ISA extension to add fast
>>> bit manipulation instructions?
>> Sounds a good idea to start the conversion. Besides popcnt, lzcnt
>> is also a candidate. From llvm perspective, it would be hard to
>> generate ffs64/popcnt/lzcnt etc. from source generic implementation.
> I'm curious why? I assumed that if a user used __builtin_popcount()
> Clang could just generate BPF's popcnt instruction (assuming the right
> BPF cpu version is enabled, of course).
Not aware of __builtin_popcount(). Yes, BPF backend should be able easily
converts __builtin_popcount() to a BPF insn.
>
>> So most likely, inline asm will be used. libbpf could define
>> some macros to make adoption easier. Verifier and JIT will do
>> proper thing, either using corresponding arch insns directly or
>> verifier will rewrite so JIT won't be aware of these insns.
[...]
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [RFC PATCH bpf-next 0/2] bpf: Add generic kfunc bpf_ffs64()
2024-02-05 18:34 ` Yonghong Song
@ 2024-03-03 13:18 ` Leon Hwang
0 siblings, 0 replies; 8+ messages in thread
From: Leon Hwang @ 2024-03-03 13:18 UTC (permalink / raw)
To: Yonghong Song, Andrii Nakryiko; +Cc: bpf, ast, daniel, andrii
On 2024/2/6 02:34, Yonghong Song wrote:
>
> On 2/5/24 10:18 AM, Andrii Nakryiko wrote:
>> On Sun, Feb 4, 2024 at 11:20 AM Yonghong Song
>> <yonghong.song@linux.dev> wrote:
>>>
>>> On 2/2/24 2:18 PM, Andrii Nakryiko wrote:
>>>> On Wed, Jan 31, 2024 at 7:56 AM Leon Hwang <hffilwlqm@gmail.com> wrote:
>>>>> This patchset introduces a new generic kfunc bpf_ffs64(). This kfunc
>>>>> allows bpf to reuse kernel's __ffs64() function to improve ffs
>>>>> performance in bpf.
>>>>>
>>>> The downside of using kfunc for this is that the compiler will assume
>>>> that R1-R5 have to be spilled/filled, because that's function call
>>>> convention in BPF.
>>>>
>>>> If this was an instruction, though, it would be much more efficient
>>>> and would avoid this problem. But I see how something like ffs64 is
>>>> useful. I think it would be good to also have popcnt instruction and a
>>>> few other fast bit manipulation operations as well.
>>>>
>>>> Perhaps we should think about another BPF ISA extension to add fast
>>>> bit manipulation instructions?
>>> Sounds a good idea to start the conversion. Besides popcnt, lzcnt
>>> is also a candidate. From llvm perspective, it would be hard to
>>> generate ffs64/popcnt/lzcnt etc. from source generic implementation.
>> I'm curious why? I assumed that if a user used __builtin_popcount()
>> Clang could just generate BPF's popcnt instruction (assuming the right
>> BPF cpu version is enabled, of course).
>
> Not aware of __builtin_popcount(). Yes, BPF backend should be able easily
> converts __builtin_popcount() to a BPF insn.
>
>>
>>> So most likely, inline asm will be used. libbpf could define
>>> some macros to make adoption easier. Verifier and JIT will do
>>> proper thing, either using corresponding arch insns directly or
>>> verifier will rewrite so JIT won't be aware of these insns.
> [...]
Sorry for late reply. I was busy trying to fix a tailcall issue[0]. But,
unluckily, it's hopeless to achieve it.
[0] bpf, x64: Fix tailcall hierarchy
https://lore.kernel.org/bpf/20240222085232.62483-1-hffilwlqm@gmail.com/
It seems great that another BPF ISA extension adds fast bit manipulation
instructions. With assuming the right BPF cpu version, clang expects to
generate ffs64/popcnt/lzcnt BPF insn for
__builtin_ffs64()/__builtin_popcount()/__builtin_clz().
Then, I did a POC to do jit for this kfunc bpf_ffs64(). It's ok to do
jit for kfunc bpf_ffs64() with being aware of cpu features and adding
'BPF_ALU64|BPF_BITOPS'.
Here's the diff of the POC:
diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
index e1390d1e3..9cd552dd7 100644
--- a/arch/x86/net/bpf_jit_comp.c
+++ b/arch/x86/net/bpf_jit_comp.c
@@ -18,6 +18,7 @@
#include <asm/text-patching.h>
#include <asm/unwind.h>
#include <asm/cfi.h>
+#include <asm/cpufeatures.h>
static bool all_callee_regs_used[4] = {true, true, true, true};
@@ -1131,6 +1132,39 @@ static void emit_shiftx(u8 **pprog, u32 dst_reg,
u8 src_reg, bool is64, u8 op)
*pprog = prog;
}
+static int emit_bitops(u8 **pprog, u32 bitops)
+{
+ u8 *prog = *pprog;
+
+ switch (bitops) {
+#ifdef X86_FEATURE_AVX2
+ case BPF_FFS64:
+ /* identical to tzcnt rax, rdi */
+ /* rep bsf rax, rdi */
+ EMIT1(0xF3);
+ EMIT4(0x48, 0x0F, 0xBC, 0xC7);
+ break;
+#endif
+#ifdef X86_FEATURE_XMM4_1
+ case BPF_POPCNT:
+ /* popcnt rax, rdi */
+ EMIT1(0xF3);
+ EMIT4(0X8, 0X0F, 0XB8, 0XC7);
+ break;
+ case BPF_LZCNT:
+ /* lzcnt rax, rdi */
+ EMIT1(0xF3);
+ EMIT4(0X8, 0X0F, 0XBD, 0XC7);
+ break;
+#endif
+ default:
+ return -EINVAL;
+ }
+
+ *pprog = prog;
+ return 0;
+}
+
#define INSN_SZ_DIFF (((addrs[i] - addrs[i - 1]) - (prog - temp)))
/* mov rax, qword ptr [rbp - rounded_stack_depth - 8] */
@@ -1521,6 +1555,12 @@ static int do_jit(struct bpf_prog *bpf_prog, int
*addrs, u8 *image, u8 *rw_image
}
break;
+ case BPF_ALU64 | BPF_BITOPS:
+ err = emit_bitops(&prog, insn->imm);
+ if (err)
+ return err;
+ break;
+
/* speculation barrier */
case BPF_ST | BPF_NOSPEC:
EMIT_LFENCE();
@@ -3145,6 +3185,11 @@ bool bpf_jit_supports_subprog_tailcalls(void)
return true;
}
+bool bpf_jit_supports_bitops(void)
+{
+ return true;
+}
+
void bpf_jit_free(struct bpf_prog *prog)
{
if (prog->jited) {
diff --git a/include/linux/filter.h b/include/linux/filter.h
index 36cc29a29..27ad34e20 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -959,6 +959,7 @@ bool bpf_jit_supports_kfunc_call(void);
bool bpf_jit_supports_far_kfunc_call(void);
bool bpf_jit_supports_exceptions(void);
bool bpf_jit_supports_ptr_xchg(void);
+bool bpf_jit_supports_bitops(void);
void arch_bpf_stack_walk(bool (*consume_fn)(void *cookie, u64 ip, u64
sp, u64 bp), void *cookie);
bool bpf_helper_changes_pkt_data(void *func);
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index d96708380..0391e2d94 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -34,6 +34,12 @@
#define BPF_FROM_LE BPF_TO_LE
#define BPF_FROM_BE BPF_TO_BE
+/* bitops on a register */
+#define BPF_BITOPS 0xe0 /* flags for bitops */
+#define BPF_FFS64 0x00 /* opcode for ffs64 */
+#define BPF_POPCNT 0x01 /* opcode for popcnt */
+#define BPF_LZCNT 0x02 /* opcode for lzcnt */
+
/* jmp encodings */
#define BPF_JNE 0x50 /* jump != */
#define BPF_JLT 0xa0 /* LT is unsigned, '<' */
diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index 71c459a51..d90163ede 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -2936,6 +2936,12 @@ bool __weak bpf_jit_supports_ptr_xchg(void)
return false;
}
+/* return TRUE if the JIT backend supports bitops with few instructions. */
+bool __weak bpf_jit_supports_bitops(void)
+{
+ return false;
+}
+
/* To execute LD_ABS/LD_IND instructions __bpf_prog_run() may call
* skb_copy_bits(), so provide a weak definition of it for NET-less config.
*/
diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
index 93edf730d..f5123e92f 100644
--- a/kernel/bpf/helpers.c
+++ b/kernel/bpf/helpers.c
@@ -23,6 +23,7 @@
#include <linux/btf_ids.h>
#include <linux/bpf_mem_alloc.h>
#include <linux/kasan.h>
+#include <linux/bitops.h>
#include "../../lib/kstrtox.h"
@@ -2542,6 +2543,11 @@ __bpf_kfunc void bpf_throw(u64 cookie)
WARN(1, "A call to BPF exception callback should never return\n");
}
+__bpf_kfunc unsigned long bpf_ffs64(u64 word)
+{
+ return __ffs64(word);
+}
+
__bpf_kfunc_end_defs();
BTF_KFUNCS_START(generic_btf_ids)
@@ -2573,6 +2579,7 @@ BTF_ID_FLAGS(func, bpf_task_get_cgroup1,
KF_ACQUIRE | KF_RCU | KF_RET_NULL)
#endif
BTF_ID_FLAGS(func, bpf_task_from_pid, KF_ACQUIRE | KF_RET_NULL)
BTF_ID_FLAGS(func, bpf_throw)
+BTF_ID_FLAGS(func, bpf_ffs64)
BTF_KFUNCS_END(generic_btf_ids)
static const struct btf_kfunc_id_set generic_kfunc_set = {
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 011d54a1d..a5965e1b7 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -10926,6 +10926,7 @@ enum special_kfunc_type {
KF_bpf_percpu_obj_drop_impl,
KF_bpf_throw,
KF_bpf_iter_css_task_new,
+ KF_bpf_ffs64,
};
BTF_SET_START(special_kfunc_set)
@@ -10952,6 +10953,7 @@ BTF_ID(func, bpf_throw)
#ifdef CONFIG_CGROUPS
BTF_ID(func, bpf_iter_css_task_new)
#endif
+BTF_ID(func, bpf_ffs64)
BTF_SET_END(special_kfunc_set)
BTF_ID_LIST(special_kfunc_list)
@@ -10982,6 +10984,7 @@ BTF_ID(func, bpf_iter_css_task_new)
#else
BTF_ID_UNUSED
#endif
+BTF_ID(func, bpf_ffs64)
static bool is_kfunc_ret_null(struct bpf_kfunc_call_arg_meta *meta)
{
@@ -19349,6 +19352,10 @@ static int fixup_kfunc_call(struct
bpf_verifier_env *env, struct bpf_insn *insn,
desc->func_id == special_kfunc_list[KF_bpf_rdonly_cast]) {
insn_buf[0] = BPF_MOV64_REG(BPF_REG_0, BPF_REG_1);
*cnt = 1;
+ } else if (desc->func_id == special_kfunc_list[KF_bpf_ffs64]) {
+ insn_buf[0].code = BPF_ALU64 | BPF_BITOPS;
+ insn_buf[0].imm = BPF_FFS64;
+ *cnt = 1;
}
return 0;
}
Thanks,
Leon
^ permalink raw reply related [flat|nested] 8+ messages in thread
end of thread, other threads:[~2024-03-03 13:18 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-01-31 15:56 [RFC PATCH bpf-next 0/2] bpf: Add generic kfunc bpf_ffs64() Leon Hwang
2024-01-31 15:56 ` [RFC PATCH bpf-next 1/2] " Leon Hwang
2024-01-31 15:56 ` [RFC PATCH bpf-next 2/2] selftests/bpf: Add testcases for " Leon Hwang
2024-02-02 22:18 ` [RFC PATCH bpf-next 0/2] bpf: Add " Andrii Nakryiko
2024-02-04 19:19 ` Yonghong Song
2024-02-05 18:18 ` Andrii Nakryiko
2024-02-05 18:34 ` Yonghong Song
2024-03-03 13:18 ` Leon Hwang
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox