* Re: [PATCH bpf-next v3 1/2] bpf: add skcipher API support to TC/XDP programs
From: Vadim Fedorenko @ 2023-11-02 0:38 UTC (permalink / raw)
To: Martin KaFai Lau
Cc: bpf, netdev, linux-crypto, Jakub Kicinski, Andrii Nakryiko,
Alexei Starovoitov, Mykola Lysenko, Vadim Fedorenko
In-Reply-To: <6947046d-27e3-90ee-3419-0b480af0abb0@linux.dev>
On 01.11.2023 23:41, Martin KaFai Lau wrote:
> On 11/1/23 3:50 PM, Vadim Fedorenko wrote:
>>>> +static void *__bpf_dynptr_data_ptr(const struct bpf_dynptr_kern *ptr)
>>>> +{
>>>> + enum bpf_dynptr_type type;
>>>> +
>>>> + if (!ptr->data)
>>>> + return NULL;
>>>> +
>>>> + type = bpf_dynptr_get_type(ptr);
>>>> +
>>>> + switch (type) {
>>>> + case BPF_DYNPTR_TYPE_LOCAL:
>>>> + case BPF_DYNPTR_TYPE_RINGBUF:
>>>> + return ptr->data + ptr->offset;
>>>> + case BPF_DYNPTR_TYPE_SKB:
>>>> + return skb_pointer_if_linear(ptr->data, ptr->offset,
>>>> __bpf_dynptr_size(ptr));
>>>> + case BPF_DYNPTR_TYPE_XDP:
>>>> + {
>>>> + void *xdp_ptr = bpf_xdp_pointer(ptr->data, ptr->offset,
>>>> __bpf_dynptr_size(ptr));
>>>
>>> I suspect what it is doing here (for skb and xdp in particular) is very
>>> similar to bpf_dynptr_slice. Please check if bpf_dynptr_slice(ptr, 0, NULL,
>>> sz) will work.
>>>
>>
>> Well, yes, it's simplified version of bpf_dynptr_slice. The problem is
>> that bpf_dynptr_slice bpf_kfunc which cannot be used in another
>> bpf_kfunc. Should I refactor the code to use it in both places? Like
>
> Sorry, scrolled too fast in my earlier reply :(
>
> I am not aware of this limitation. What error does it have?
> The bpf_dynptr_slice_rdwr kfunc() is also calling the bpf_dynptr_slice() kfunc.
>
Yeah, but they are in the same module. We do not declare prototypes of kfuncs in
linux/bpf.h and that's why we cannot use them outside of helpers.c.
The same problem was with bpf_dynptr_is_rdonly() I believe.
>> create __bpf_dynptr_slice() which will be internal part of bpf_kfunc?
>
>
>
>
>
>
^ permalink raw reply
* Re: [PATCH net] tcp: Fix -Wc23-extensions in tcp_options_write()
From: Palmer Dabbelt @ 2023-11-02 0:41 UTC (permalink / raw)
To: nathan
Cc: edumazet, davem, dsahern, kuba, pabeni, ndesaulniers, trix,
0x7f454c46, fruggeri, noureddine, netdev, linux-kernel, llvm,
patches, nathan
In-Reply-To: <20231031-tcp-ao-fix-label-in-compound-statement-warning-v1-1-c9731d115f17@kernel.org>
On Tue, 31 Oct 2023 13:23:35 PDT (-0700), nathan@kernel.org wrote:
> Clang warns (or errors with CONFIG_WERROR=y) when CONFIG_TCP_AO is set:
>
> net/ipv4/tcp_output.c:663:2: error: label at end of compound statement is a C23 extension [-Werror,-Wc23-extensions]
> 663 | }
> | ^
> 1 error generated.
>
> On earlier releases (such as clang-11, the current minimum supported
> version for building the kernel) that do not support C23, this was a
> hard error unconditionally:
>
> net/ipv4/tcp_output.c:663:2: error: expected statement
> }
> ^
> 1 error generated.
>
> Add a semicolon after the label to create an empty statement, which
> resolves the warning or error for all compilers.
>
> Closes: https://github.com/ClangBuiltLinux/linux/issues/1953
> Fixes: 1e03d32bea8e ("net/tcp: Add TCP-AO sign to outgoing packets")
> Signed-off-by: Nathan Chancellor <nathan@kernel.org>
> ---
> net/ipv4/tcp_output.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index f558c054cf6e..6064895daece 100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -658,7 +658,7 @@ static void tcp_options_write(struct tcphdr *th, struct tcp_sock *tp,
> memset(ptr, TCPOPT_NOP, sizeof(*ptr));
> ptr++;
> }
> -out_ao:
> +out_ao:;
> #endif
> }
> if (unlikely(opts->mss)) {
>
> ---
> base-commit: 55c900477f5b3897d9038446f72a281cae0efd86
> change-id: 20231031-tcp-ao-fix-label-in-compound-statement-warning-ebd6c9978498
>
> Best regards,
This gives me a
linux/net/ipv4/tcp_output.c:663:2: error: expected statement
}
on GCC for me. So I think something like
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index f558c054cf6e..ca09763acaa8 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -659,6 +659,11 @@ static void tcp_options_write(struct tcphdr *th, struct tcp_sock *tp,
ptr++;
}
out_ao:
+ /*
+ * Labels at the end of compound statements are a C23 feature, so
+ * introduce a block to avoid a warning/error on strict toolchains.
+ */
+ {}
#endif
}
if (unlikely(opts->mss)) {
should do it (though it's still build testing...)
^ permalink raw reply related
* Re: [PATCH bpf-next v3 2/2] selftests: bpf: crypto skcipher algo selftests
From: Vadim Fedorenko @ 2023-11-02 0:54 UTC (permalink / raw)
To: Martin KaFai Lau, Vadim Fedorenko
Cc: bpf, netdev, linux-crypto, Jakub Kicinski, Andrii Nakryiko,
Alexei Starovoitov, Mykola Lysenko
In-Reply-To: <0568c28a-f631-a12c-97ce-dbc9e5ee62b4@linux.dev>
On 01.11.2023 22:53, Martin KaFai Lau wrote:
> On 10/31/23 6:49 AM, Vadim Fedorenko wrote:
>> Add simple tc hook selftests to show the way to work with new crypto
>> BPF API. Some weird structre and map are added to setup program to make
>> verifier happy about dynptr initialization from memory. Simple AES-ECB
>> algo is used to demonstrate encryption and decryption of fixed size
>> buffers.
>>
>> Signed-off-by: Vadim Fedorenko <vadfed@meta.com>
>> ---
>> v2 -> v3:
>> - disable tests on s390 and aarch64 because of unknown Fatal exception
>> in sg_init_one
>> v1 -> v2:
>> - add CONFIG_CRYPTO_AES and CONFIG_CRYPTO_ECB to selftest build config
>> suggested by Daniel
>>
>> kernel/bpf/crypto.c | 5 +-
>> tools/testing/selftests/bpf/DENYLIST.aarch64 | 1 +
>> tools/testing/selftests/bpf/DENYLIST.s390x | 1 +
>> tools/testing/selftests/bpf/config | 3 +
>> .../selftests/bpf/prog_tests/crypto_sanity.c | 129 +++++++++++++++
>> .../selftests/bpf/progs/crypto_common.h | 103 ++++++++++++
>> .../selftests/bpf/progs/crypto_sanity.c | 154 ++++++++++++++++++
>> 7 files changed, 394 insertions(+), 2 deletions(-)
>> create mode 100644 tools/testing/selftests/bpf/prog_tests/crypto_sanity.c
>> create mode 100644 tools/testing/selftests/bpf/progs/crypto_common.h
>> create mode 100644 tools/testing/selftests/bpf/progs/crypto_sanity.c
>>
>> diff --git a/kernel/bpf/crypto.c b/kernel/bpf/crypto.c
>> index c2a0703934fc..4ee6193165ca 100644
>> --- a/kernel/bpf/crypto.c
>> +++ b/kernel/bpf/crypto.c
>> @@ -65,8 +65,9 @@ static void *__bpf_dynptr_data_ptr(const struct
>> bpf_dynptr_kern *ptr)
>> *
>> * bpf_crypto_skcipher_ctx_create() allocates memory using the BPF memory
>> * allocator, and will not block. It may return NULL if no memory is available.
>> - * @algo: bpf_dynptr which holds string representation of algorithm.
>> - * @key: bpf_dynptr which holds cipher key to do crypto.
>> + * @palgo: bpf_dynptr which holds string representation of algorithm.
>> + * @pkey: bpf_dynptr which holds cipher key to do crypto.
>> + * @err: integer to store error code when NULL is returned
>> */
>> __bpf_kfunc struct bpf_crypto_skcipher_ctx *
>> bpf_crypto_skcipher_ctx_create(const struct bpf_dynptr_kern *palgo,
>> diff --git a/tools/testing/selftests/bpf/DENYLIST.aarch64
>> b/tools/testing/selftests/bpf/DENYLIST.aarch64
>> index 5c2cc7e8c5d0..72c7ef3e5872 100644
>> --- a/tools/testing/selftests/bpf/DENYLIST.aarch64
>> +++ b/tools/testing/selftests/bpf/DENYLIST.aarch64
>> @@ -1,5 +1,6 @@
>> bpf_cookie/multi_kprobe_attach_api #
>> kprobe_multi_link_api_subtest:FAIL:fentry_raw_skel_load unexpected error: -3
>> bpf_cookie/multi_kprobe_link_api #
>> kprobe_multi_link_api_subtest:FAIL:fentry_raw_skel_load unexpected error: -3
>> +crypto_sanity # sg_init_one has exception on aarch64
>
> What is the exception? Is arm64 just lacking the support?
I have no idea, TBH. Might be because of different aligment requirements, but
it's just guess.
>
>> exceptions # JIT does not support calling kfunc
>> bpf_throw: -524
>> fexit_sleep # The test never returns.
>> The remaining tests cannot start.
>> kprobe_multi_bench_attach # needs CONFIG_FPROBE
>> diff --git a/tools/testing/selftests/bpf/DENYLIST.s390x
>> b/tools/testing/selftests/bpf/DENYLIST.s390x
>> index 1a63996c0304..8ab7485bedb1 100644
>> --- a/tools/testing/selftests/bpf/DENYLIST.s390x
>> +++ b/tools/testing/selftests/bpf/DENYLIST.s390x
>> @@ -1,5 +1,6 @@
>> # TEMPORARY
>> # Alphabetical order
>> +crypto_sanity # sg_init_one has exception on
>> s390 (exceptions)
>> exceptions # JIT does not support calling kfunc
>> bpf_throw (exceptions)
>> get_stack_raw_tp # user_stack corrupted user
>> stack (no backchain userspace)
>> stacktrace_build_id # compare_map_keys stackid_hmap vs.
>> stackmap err -2 errno 2 (?)
>> diff --git a/tools/testing/selftests/bpf/config
>> b/tools/testing/selftests/bpf/config
>> index 02dd4409200e..48b570fd1752 100644
>> --- a/tools/testing/selftests/bpf/config
>> +++ b/tools/testing/selftests/bpf/config
>> @@ -14,6 +14,9 @@ CONFIG_CGROUP_BPF=y
>> CONFIG_CRYPTO_HMAC=y
>> CONFIG_CRYPTO_SHA256=y
>> CONFIG_CRYPTO_USER_API_HASH=y
>> +CONFIG_CRYPTO_SKCIPHER=y
>> +CONFIG_CRYPTO_ECB=y
>> +CONFIG_CRYPTO_AES=y
>> CONFIG_DEBUG_INFO=y
>> CONFIG_DEBUG_INFO_BTF=y
>> CONFIG_DEBUG_INFO_DWARF4=y
>> diff --git a/tools/testing/selftests/bpf/prog_tests/crypto_sanity.c
>> b/tools/testing/selftests/bpf/prog_tests/crypto_sanity.c
>> new file mode 100644
>> index 000000000000..a43969da6d15
>> --- /dev/null
>> +++ b/tools/testing/selftests/bpf/prog_tests/crypto_sanity.c
>> @@ -0,0 +1,129 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +/* Copyright (c) 2022 Meta Platforms, Inc. and affiliates. */
>
> s/2022/2023/ :)
ah, sure
>
>> +
>> +#include <sys/types.h>
>> +#include <sys/socket.h>
>> +#include <net/if.h>
>> +#include <linux/in6.h>
>> +
>> +#include "test_progs.h"
>> +#include "network_helpers.h"
>> +#include "crypto_sanity.skel.h"
>> +
>> +#define NS_TEST "crypto_sanity_ns"
>> +#define IPV6_IFACE_ADDR "face::1"
>> +#define UDP_TEST_PORT 7777
>> +static const char plain_text[] = "stringtoencrypt0";
>> +static const char crypted_data[] =
>> "\x5B\x59\x39\xEA\xD9\x7A\x2D\xAD\xA7\xE0\x43" \
>> + "\x37\x8A\x77\x17\xB2";
>> +
>> +void test_crypto_sanity(void)
>> +{
>> + LIBBPF_OPTS(bpf_tc_hook, qdisc_hook, .attach_point = BPF_TC_EGRESS);
>> + LIBBPF_OPTS(bpf_tc_opts, tc_attach_enc);
>> + LIBBPF_OPTS(bpf_tc_opts, tc_attach_dec);
>> + LIBBPF_OPTS(bpf_test_run_opts, opts,
>> + .data_in = crypted_data,
>> + .data_size_in = sizeof(crypted_data),
>
> This should not be needed now because the test_run is on a tracing program.
yep, will remove it
>
>> + .repeat = 1,
>> + );
>> + struct nstoken *nstoken = NULL;
>> + struct crypto_sanity *skel;
>> + struct sockaddr_in6 addr;
>> + int sockfd, err, pfd;
>> + socklen_t addrlen;
>> +
>> + skel = crypto_sanity__open();
>> + if (!ASSERT_OK_PTR(skel, "skel open"))
>> + return;
>> +
>> + bpf_program__set_autoload(skel->progs.skb_crypto_setup, true);
>
> Remove the '?' from SEC("?fentry.s/bpf_fentry_test1") should save this
> set_autoload call
ok
.
>
>> +
>> + SYS(fail, "ip netns add %s", NS_TEST);
>> + SYS(fail, "ip -net %s -6 addr add %s/128 dev lo nodad", NS_TEST,
>> IPV6_IFACE_ADDR);
>> + SYS(fail, "ip -net %s link set dev lo up", NS_TEST);
>> +
>> + err = crypto_sanity__load(skel);
>> + if (!ASSERT_OK(err, "crypto_sanity__load"))
>> + goto fail;
>> +
>> + nstoken = open_netns(NS_TEST);
>> + if (!ASSERT_OK_PTR(nstoken, "open_netns"))
>> + goto fail;
>> +
>> + qdisc_hook.ifindex = if_nametoindex("lo");
>> + if (!ASSERT_GT(qdisc_hook.ifindex, 0, "if_nametoindex lo"))
>> + goto fail;
>> +
>> + err = crypto_sanity__attach(skel);
>> + if (!ASSERT_OK(err, "crypto_sanity__attach"))
>> + goto fail;
>> +
>> + pfd = bpf_program__fd(skel->progs.skb_crypto_setup);
>> + if (!ASSERT_GT(pfd, 0, "skb_crypto_setup fd"))
>> + goto fail;
>> +
>> + err = bpf_prog_test_run_opts(pfd, &opts);
>> + if (!ASSERT_OK(err, "skb_crypto_setup") ||
>> + !ASSERT_OK(opts.retval, "skb_crypto_setup retval"))
>> + goto fail;
>> +
>> + if (!ASSERT_OK(skel->bss->status, "skb_crypto_setup status"))
>> + goto fail;
>> +
>> + err = bpf_tc_hook_create(&qdisc_hook);
>> + if (!ASSERT_OK(err, "create qdisc hook"))
>> + goto fail;
>> +
>> + addrlen = sizeof(addr);
>> + err = make_sockaddr(AF_INET6, IPV6_IFACE_ADDR, UDP_TEST_PORT,
>> + (void *)&addr, &addrlen);
>> + if (!ASSERT_OK(err, "make_sockaddr"))
>> + goto fail;
>> +
>> + tc_attach_dec.prog_fd = bpf_program__fd(skel->progs.decrypt_sanity);
>> + err = bpf_tc_attach(&qdisc_hook, &tc_attach_dec);
>> + if (!ASSERT_OK(err, "attach decrypt filter"))
>> + goto fail;
>> +
>> + sockfd = socket(AF_INET6, SOCK_DGRAM, 0);
>> + if (!ASSERT_NEQ(sockfd, -1, "decrypt socket"))
>> + goto fail;
>> + err = sendto(sockfd, crypted_data, 16, 0, (void *)&addr, addrlen);
>> + close(sockfd);
>> + if (!ASSERT_EQ(err, 16, "decrypt send"))
>> + goto fail;
>> +
>> + bpf_tc_detach(&qdisc_hook, &tc_attach_dec);
>> + if (!ASSERT_OK(skel->bss->status, "decrypt status"))
>> + goto fail;
>> + if (!ASSERT_STRNEQ(skel->bss->dst, plain_text, sizeof(plain_text),
>> "decrypt"))
>> + goto fail;
>> +
>> + tc_attach_enc.prog_fd = bpf_program__fd(skel->progs.encrypt_sanity);
>> + err = bpf_tc_attach(&qdisc_hook, &tc_attach_enc);
>> + if (!ASSERT_OK(err, "attach encrypt filter"))
>> + goto fail;
>> +
>> + sockfd = socket(AF_INET6, SOCK_DGRAM, 0);
>> + if (!ASSERT_NEQ(sockfd, -1, "encrypt socket"))
>> + goto fail;
>> + err = sendto(sockfd, plain_text, 16, 0, (void *)&addr, addrlen);
>> + close(sockfd);
>> + if (!ASSERT_EQ(err, 16, "encrypt send"))
>> + goto fail;
>> +
>> + bpf_tc_detach(&qdisc_hook, &tc_attach_enc);
>> + if (!ASSERT_OK(skel->bss->status, "encrypt status"))
>> + goto fail;
>> + if (!ASSERT_STRNEQ(skel->bss->dst, crypted_data, sizeof(crypted_data),
>> "encrypt"))
>> + goto fail;
>> +
>> +fail:
>> + if (nstoken) {
>> + bpf_tc_hook_destroy(&qdisc_hook);
>> + close_netns(nstoken);
>> + }
>> + SYS_NOFAIL("ip netns del " NS_TEST " &> /dev/null");
>> + crypto_sanity__destroy(skel);
>> +}
>> diff --git a/tools/testing/selftests/bpf/progs/crypto_common.h
>> b/tools/testing/selftests/bpf/progs/crypto_common.h
>> new file mode 100644
>> index 000000000000..584378bb6df8
>> --- /dev/null
>> +++ b/tools/testing/selftests/bpf/progs/crypto_common.h
>> @@ -0,0 +1,103 @@
>> +/* SPDX-License-Identifier: GPL-2.0 */
>> +/* Copyright (c) 2023 Meta Platforms, Inc. and affiliates. */
>> +
>> +#ifndef _CRYPTO_COMMON_H
>> +#define _CRYPTO_COMMON_H
>> +
>> +#include "errno.h"
>> +#include <stdbool.h>
>> +
>> +#define private(name) SEC(".bss." #name) __hidden __attribute__((aligned(8)))
>> +private(CTX) static struct bpf_crypto_skcipher_ctx __kptr * global_crypto_ctx;
>
> Add a test to show how it is used?
I'm thinking if we really need this? Looks like we don't.
>
>> +
>> +struct bpf_crypto_skcipher_ctx *bpf_crypto_skcipher_ctx_create(const struct
>> bpf_dynptr *algo,
>> + const struct bpf_dynptr *key,
>> + int *err) __ksym;
>> +struct bpf_crypto_skcipher_ctx *bpf_crypto_skcipher_ctx_acquire(struct
>> bpf_crypto_skcipher_ctx *ctx) __ksym;
>> +void bpf_crypto_skcipher_ctx_release(struct bpf_crypto_skcipher_ctx *ctx)
>> __ksym;
>
> Same for acquire and release kfunc. Add test cases.
>
> btw, does it have use cases to store the same bpf_crypto_skcipher_ctx in
> multiple maps?
Yeah, sure. You can potentially use the same algo with the same configured key
in parallel. All sensitive data is stored in the request structure.
>
>> +int bpf_crypto_skcipher_encrypt(struct bpf_crypto_skcipher_ctx *ctx,
>> + const struct bpf_dynptr *src, struct bpf_dynptr *dst,
>> + const struct bpf_dynptr *iv) __ksym;
>> +int bpf_crypto_skcipher_decrypt(struct bpf_crypto_skcipher_ctx *ctx,
>> + const struct bpf_dynptr *src, struct bpf_dynptr *dst,
>> + const struct bpf_dynptr *iv) __ksym;
>> +
>> +struct __crypto_skcipher_ctx_value {
>> + struct bpf_crypto_skcipher_ctx __kptr * ctx;
>> +};
>> +
>> +struct crypto_conf_value {
>> + __u8 algo[32];
>> + __u32 algo_size;
>> + __u8 key[32];
>> + __u32 key_size;
>> + __u8 iv[32];
>> + __u32 iv_size;
>> + __u8 dst[32];
>> + __u32 dst_size;
>> +};
>> +
>> +struct array_conf_map {
>> + __uint(type, BPF_MAP_TYPE_ARRAY);
>> + __type(key, int);
>> + __type(value, struct crypto_conf_value);
>> + __uint(max_entries, 1);
>> +} __crypto_conf_map SEC(".maps");
>> +
>> +struct array_map {
>> + __uint(type, BPF_MAP_TYPE_ARRAY);
>> + __type(key, int);
>> + __type(value, struct __crypto_skcipher_ctx_value);
>> + __uint(max_entries, 1);
>> +} __crypto_skcipher_ctx_map SEC(".maps");
>> +
>> +static inline struct crypto_conf_value *crypto_conf_lookup(void)
>> +{
>> + struct crypto_conf_value *v, local = {};
>> + u32 key = 0;
>> +
>> + v = bpf_map_lookup_elem(&__crypto_conf_map, &key);
>> + if (v)
>> + return v;
>> +
>> + bpf_map_update_elem(&__crypto_conf_map, &key, &local, 0);
>> + return bpf_map_lookup_elem(&__crypto_conf_map, &key);
>
> hmm... local is all 0 which became the map value. Where is it initialized?
ahh.. it's actually initialized in ctx_insert(). but looks like this part is
left from previous iteration. I'll adjust the code.
>
>> +}
>> +
>> +static inline struct __crypto_skcipher_ctx_value
>> *crypto_skcipher_ctx_value_lookup(void)
>> +{
>> + u32 key = 0;
>> +
>> + return bpf_map_lookup_elem(&__crypto_skcipher_ctx_map, &key);
>> +}
>> +
>> +static inline int crypto_skcipher_ctx_insert(struct bpf_crypto_skcipher_ctx
>> *ctx)
>> +{
>> + struct __crypto_skcipher_ctx_value local, *v;
>> + long status;
>
> nit. s/status/err/
sure
>
>> + struct bpf_crypto_skcipher_ctx *old;
>> + u32 key = 0;
>> +
>> + local.ctx = NULL;
>> + status = bpf_map_update_elem(&__crypto_skcipher_ctx_map, &key, &local, 0);
>> + if (status) {
>> + bpf_crypto_skcipher_ctx_release(ctx);
>> + return status;
>> + }
>> +
>> + v = bpf_map_lookup_elem(&__crypto_skcipher_ctx_map, &key);
>> + if (!v) {
>> + bpf_crypto_skcipher_ctx_release(ctx);
>> + return -ENOENT;
>> + }
>> +
>> + old = bpf_kptr_xchg(&v->ctx, ctx);
>> + if (old) {
>> + bpf_crypto_skcipher_ctx_release(old);
>> + return -EEXIST;
>> + }
>> +
>> + return 0;
>> +}
>> +
>> +#endif /* _CRYPTO_COMMON_H */
>> diff --git a/tools/testing/selftests/bpf/progs/crypto_sanity.c
>> b/tools/testing/selftests/bpf/progs/crypto_sanity.c
>> new file mode 100644
>> index 000000000000..71a172d8d2a2
>> --- /dev/null
>> +++ b/tools/testing/selftests/bpf/progs/crypto_sanity.c
>> @@ -0,0 +1,154 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +/* Copyright (c) 2023 Meta Platforms, Inc. and affiliates. */
>> +
>> +#include "vmlinux.h"
>> +#include "bpf_tracing_net.h"
>> +#include <bpf/bpf_helpers.h>
>> +#include <bpf/bpf_endian.h>
>> +#include <bpf/bpf_tracing.h>
>> +#include "bpf_misc.h"
>> +#include "bpf_kfuncs.h"
>> +#include "crypto_common.h"
>> +
>> +#define UDP_TEST_PORT 7777
>> +unsigned char crypto_key[16] = "testtest12345678";
>> +char crypto_algo[9] = "ecb(aes)";
>> +char dst[32] = {};
>> +int status;
>> +
>> +static inline int skb_validate_test(const struct __sk_buff *skb)
>> +{
>> + struct ipv6hdr ip6h;
>> + struct udphdr udph;
>> + u32 offset;
>> +
>> + if (skb->protocol != __bpf_constant_htons(ETH_P_IPV6))
>> + return -1;
>> +
>> + if (bpf_skb_load_bytes(skb, ETH_HLEN, &ip6h, sizeof(ip6h)))
>> + return -1;
>> +
>> + if (ip6h.nexthdr != IPPROTO_UDP)
>> + return -1;
>> +
>> + if (bpf_skb_load_bytes(skb, ETH_HLEN + sizeof(ip6h), &udph, sizeof(udph)))
>> + return -1;
>> +
>> + if (udph.dest != __bpf_constant_htons(UDP_TEST_PORT))
>> + return -1;
>> +
>> + offset = ETH_HLEN + sizeof(ip6h) + sizeof(udph);
>> + if (skb->len < offset + 16)
>> + return -1;
>> +
>> + return offset;
>> +}
>> +
>> +SEC("?fentry.s/bpf_fentry_test1")
>
> I wonder if others have better idea how to do this. This is basically setting up
> the algo and key which requires to call a kfunc in sleepable context. Otherwise,
> the tc's test_run would be a better fit.
The other way can be implement crypto API function to allocate with GFP_ATOMIC.
Currently all skcipher allocations are hard-coded to GFP_KERNEL.
>
>> +int BPF_PROG(skb_crypto_setup)
>> +{
>> + struct crypto_conf_value *c;
>> + struct bpf_dynptr algo, key;
>> + int err = 0;
>> +
>> + status = 0;
>> +
>> + c = crypto_conf_lookup();
>
> "c" is not used. Why lookup is needed? May be properly setup the
> __crypto_conf_map so the example is more complete.
>
This is most exciting part. Without this lookup verifier complains on the
initialization of dynptr from memory buffer. I don't really understand the
way it works or changes pointers types.
>> + if (!c) {
>> + status = -EINVAL;
>> + return 0;
>> + }
>> +
>> + bpf_dynptr_from_mem(crypto_algo, sizeof(crypto_algo), 0, &algo);
>> + bpf_dynptr_from_mem(crypto_key, sizeof(crypto_key), 0, &key);
>> + struct bpf_crypto_skcipher_ctx *cctx =
>> bpf_crypto_skcipher_ctx_create(&algo, &key, &err);
>> +
>> + if (!cctx) {
>> + status = err;
>> + return 0;
>> + }
>> +
>> + err = crypto_skcipher_ctx_insert(cctx);
>> + if (err && err != -EEXIST)
>> + status = err;
>> +
>> + return 0;
>> +}
>> +
>> +SEC("tc")
>> +int decrypt_sanity(struct __sk_buff *skb)
>> +{
>> + struct __crypto_skcipher_ctx_value *v;
>> + struct bpf_crypto_skcipher_ctx *ctx;
>> + struct bpf_dynptr psrc, pdst, iv;
>> + int err;
>
> nit. s/err/offset/. "err" is actually the offset for the common case.
>
> btw, considering dynptr psrc has to be created from skb anyway, is it easier to
> use bpf_dynptr_slice(psrc, 0, NULL, ETH_HLEN + sizeof(ip6h) + sizeof(udph)) in
> the above skb_validate_test()?
Yeah, maybe it's better. The only problem is naming then...
>
>> +
>> + err = skb_validate_test(skb);
>> + if (err < 0) {
>> + status = err;
>> + return TC_ACT_SHOT;
>> + }
>> +
>> + v = crypto_skcipher_ctx_value_lookup();
>> + if (!v) {
>> + status = -ENOENT;
>> + return TC_ACT_SHOT;
>> + }
>> +
>> + ctx = v->ctx;
>> + if (!ctx) {
>> + status = -ENOENT;
>> + return TC_ACT_SHOT;
>> + }
>> +
>> + bpf_dynptr_from_skb(skb, 0, &psrc);
>> + bpf_dynptr_adjust(&psrc, err, err + 16);
>> + bpf_dynptr_from_mem(dst, sizeof(dst), 0, &pdst);
>> + bpf_dynptr_from_mem(dst, 0, 0, &iv);
>> +
>> + bpf_crypto_skcipher_decrypt(ctx, &psrc, &pdst, &iv);
>
> check decrypt return value before setting "status = 0;' below.
Agree, will adjust the code.
>> +
>> + status = 0;
>> +
>> + return TC_ACT_SHOT;
>> +}
>> +
>> +SEC("tc")
>> +int encrypt_sanity(struct __sk_buff *skb)
>> +{
>> + struct __crypto_skcipher_ctx_value *v;
>> + struct bpf_crypto_skcipher_ctx *ctx;
>> + struct bpf_dynptr psrc, pdst, iv;
>> + int err;
>> +
>> + status = 0;
>> +
>> + err = skb_validate_test(skb);
>> + if (err < 0) {
>> + status = err;
>> + return TC_ACT_SHOT;
>> + }
>> +
>> + v = crypto_skcipher_ctx_value_lookup();
>> + if (!v) {
>> + status = -ENOENT;
>> + return TC_ACT_SHOT;
>> + }
>> +
>> + ctx = v->ctx;
>> + if (!ctx) {
>> + status = -ENOENT;
>> + return TC_ACT_SHOT;
>> + }
>> +
>> + bpf_dynptr_from_skb(skb, 0, &psrc);
>> + bpf_dynptr_adjust(&psrc, err, err + 16);
>> + bpf_dynptr_from_mem(dst, sizeof(dst), 0, &pdst);
>> + bpf_dynptr_from_mem(dst, 0, 0, &iv);
>> +
>> + bpf_crypto_skcipher_encrypt(ctx, &psrc, &pdst, &iv);
>
> Same here. check return value.
>
Yep.
>> +
>> + return TC_ACT_SHOT;
>> +}
>> +
>> +char __license[] SEC("license") = "GPL";
>
^ permalink raw reply
* Re: [PATCH bpf-next v8 07/10] bpf, net: switch to dynamic registration
From: Kui-Feng Lee @ 2023-11-02 0:59 UTC (permalink / raw)
To: Martin KaFai Lau
Cc: kuifeng, netdev, bpf, ast, song, kernel-team, andrii, thinker.li,
drosen
In-Reply-To: <331802b3-07bd-7fec-32a7-b85a8dae1391@linux.dev>
On 11/1/23 17:17, Martin KaFai Lau wrote:
> On 10/31/23 5:19 PM, Kui-Feng Lee wrote:
>>
>>
>> On 10/31/23 17:02, Martin KaFai Lau wrote:
>>> On 10/31/23 4:34 PM, Kui-Feng Lee wrote:
>>>>>> diff --git a/include/linux/btf.h b/include/linux/btf.h
>>>>>> index a8813605f2f6..954536431e0b 100644
>>>>>> --- a/include/linux/btf.h
>>>>>> +++ b/include/linux/btf.h
>>>>>> @@ -12,6 +12,8 @@
>>>>>> #include <uapi/linux/bpf.h>
>>>>>> #define BTF_TYPE_EMIT(type) ((void)(type *)0)
>>>>>> +#define BTF_STRUCT_OPS_TYPE_EMIT(type) {((void)(struct type
>>>>>> *)0); \
>>>>>
>>>>> ((void)(struct type *)0); is new. Why is it needed?
>>>>
>>>> This is a trick of BTF to force compiler generate type info for
>>>> the given type. Without trick, compiler may skip these types if these
>>>> type are not used at all in the module. For example, modules usually
>>>> don't use value types of struct_ops directly.
>>> It is not the value type and value type emit is understood. It is the
>>> struct_ops type itself and it is new addition in this patchset
>>> afaict. The value type emit is in the next line which was cut out
>>> from the context here.
>>>
>> I mean both of them are required.
>> In the case of a dummy implementation, struct_ops type itself properly
>> never being used, only being declared by the module. Without this line,
>
> Other than bpf_dummy_ops, after reg(), the struct_ops->func() must be
> used somewhere in the kernel or module. Like tcp must be using the
> tcp_congestion_ops after reg(). bpf_dummy_ops is very special and
> probably should be moved out to bpf_testmod somehow but this is for
> later. Even bpf_dummy_ops does not have an issue now. Why it is needed
> after the kmod support change?
>
> or it is a preemptive addition to be future proof only?
>
> Addition is fine if it is required to work. I am trying to understand
> why this new addition is needed after the kmod support change. The
> reason why this is needed after the kmod support change is not obvious
> from looking at the code. The commit message didn't mention why and what
> broke after this kmod change. If someone wants to clean it up a few
> months later, we will need to figure out why it was added in the first
> place.
It is a future proof.
What do you think if I add a comment in the code?
>
>
>> the module developer will fail to load a struct_ops map of the dummy
>> type. This line is added to avoid this awful situation.
>>
>
^ permalink raw reply
* Re: [PATCH net] tcp: Fix -Wc23-extensions in tcp_options_write()
From: Nathan Chancellor @ 2023-11-02 1:07 UTC (permalink / raw)
To: Palmer Dabbelt
Cc: edumazet, davem, dsahern, kuba, pabeni, ndesaulniers, trix,
0x7f454c46, fruggeri, noureddine, netdev, linux-kernel, llvm,
patches
In-Reply-To: <mhng-df0ad8f4-f79c-427c-9798-8682fca4f516@palmer-ri-x1c9a>
On Wed, Nov 01, 2023 at 05:41:10PM -0700, Palmer Dabbelt wrote:
> On Tue, 31 Oct 2023 13:23:35 PDT (-0700), nathan@kernel.org wrote:
> > Clang warns (or errors with CONFIG_WERROR=y) when CONFIG_TCP_AO is set:
> >
> > net/ipv4/tcp_output.c:663:2: error: label at end of compound statement is a C23 extension [-Werror,-Wc23-extensions]
> > 663 | }
> > | ^
> > 1 error generated.
> >
> > On earlier releases (such as clang-11, the current minimum supported
> > version for building the kernel) that do not support C23, this was a
> > hard error unconditionally:
> >
> > net/ipv4/tcp_output.c:663:2: error: expected statement
> > }
> > ^
> > 1 error generated.
> >
> > Add a semicolon after the label to create an empty statement, which
> > resolves the warning or error for all compilers.
> >
> > Closes: https://github.com/ClangBuiltLinux/linux/issues/1953
> > Fixes: 1e03d32bea8e ("net/tcp: Add TCP-AO sign to outgoing packets")
> > Signed-off-by: Nathan Chancellor <nathan@kernel.org>
> > ---
> > net/ipv4/tcp_output.c | 2 +-
> > 1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> > index f558c054cf6e..6064895daece 100644
> > --- a/net/ipv4/tcp_output.c
> > +++ b/net/ipv4/tcp_output.c
> > @@ -658,7 +658,7 @@ static void tcp_options_write(struct tcphdr *th, struct tcp_sock *tp,
> > memset(ptr, TCPOPT_NOP, sizeof(*ptr));
> > ptr++;
> > }
> > -out_ao:
> > +out_ao:;
> > #endif
> > }
> > if (unlikely(opts->mss)) {
> >
> > ---
> > base-commit: 55c900477f5b3897d9038446f72a281cae0efd86
> > change-id: 20231031-tcp-ao-fix-label-in-compound-statement-warning-ebd6c9978498
> >
> > Best regards,
>
> This gives me a
>
> linux/net/ipv4/tcp_output.c:663:2: error: expected statement
> }
>
> on GCC for me.
What GCC version?
I cannot reproduce that error with my patch applied. I tested mainline
at commit deefd5024f07 ("Merge tag 'vfio-v6.7-rc1' of
https://github.com/awilliam/linux-vfio") using GCC 6 from kernel.org and
I can reproduce a similar failure with ARCH=x86_64 allyesconfig:
net/ipv4/tcp_output.c: In function 'tcp_options_write':
net/ipv4/tcp_output.c:661:1: error: label at end of compound statement
out_ao:
^~~~~~
With this change applied, the error disappears for GCC 6 and GCC 13
continues to build without error. I can try the other supported versions
later, I just did an older and newer one for a quick test.
> So I think something like
>
> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index f558c054cf6e..ca09763acaa8 100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -659,6 +659,11 @@ static void tcp_options_write(struct tcphdr *th, struct tcp_sock *tp,
> ptr++;
> }
> out_ao:
> + /*
> + * Labels at the end of compound statements are a C23 feature, so
> + * introduce a block to avoid a warning/error on strict toolchains.
> + */
> + {}
> #endif
> }
> if (unlikely(opts->mss)) {
>
> should do it (though it's still build testing...)
I am not opposed to this once we understand what versions are affected
by this so that we have some timeline of removing this workaround.
Cheers,
Nathan
^ permalink raw reply
* Re: [RFCv2 bpf-next 1/7] bpf: xfrm: Add bpf_xdp_get_xfrm_state() kfunc
From: Alexei Starovoitov @ 2023-11-02 1:27 UTC (permalink / raw)
To: Daniel Xu
Cc: Jakub Kicinski, Jesper Dangaard Brouer, Eric Dumazet,
Steffen Klassert, Daniel Borkmann, Herbert Xu, Alexei Starovoitov,
John Fastabend, Paolo Abeni, David S. Miller, antony.antony, LKML,
Network Development, bpf, devel
In-Reply-To: <0a5dc090a098b911bdd19ed0e63c7e466f7054f6.1698875025.git.dxu@dxuuu.xyz>
On Wed, Nov 1, 2023 at 2:58 PM Daniel Xu <dxu@dxuuu.xyz> wrote:
>
> This commit adds an unstable kfunc helper to access internal xfrm_state
> associated with an SA. This is intended to be used for the upcoming
> IPsec pcpu work to assign special pcpu SAs to a particular CPU. In other
> words: for custom software RSS.
>
> That being said, the function that this kfunc wraps is fairly generic
> and used for a lot of xfrm tasks. I'm sure people will find uses
> elsewhere over time.
>
> Co-developed-by: Antony Antony <antony.antony@secunet.com>
> Signed-off-by: Antony Antony <antony.antony@secunet.com>
> Signed-off-by: Daniel Xu <dxu@dxuuu.xyz>
> ---
> include/net/xfrm.h | 9 ++++
> net/xfrm/Makefile | 1 +
> net/xfrm/xfrm_policy.c | 2 +
> net/xfrm/xfrm_state_bpf.c | 105 ++++++++++++++++++++++++++++++++++++++
> 4 files changed, 117 insertions(+)
> create mode 100644 net/xfrm/xfrm_state_bpf.c
>
> diff --git a/include/net/xfrm.h b/include/net/xfrm.h
> index c9bb0f892f55..1d107241b901 100644
> --- a/include/net/xfrm.h
> +++ b/include/net/xfrm.h
> @@ -2190,4 +2190,13 @@ static inline int register_xfrm_interface_bpf(void)
>
> #endif
>
> +#if IS_ENABLED(CONFIG_DEBUG_INFO_BTF)
> +int register_xfrm_state_bpf(void);
> +#else
> +static inline int register_xfrm_state_bpf(void)
> +{
> + return 0;
> +}
> +#endif
> +
> #endif /* _NET_XFRM_H */
> diff --git a/net/xfrm/Makefile b/net/xfrm/Makefile
> index cd47f88921f5..547cec77ba03 100644
> --- a/net/xfrm/Makefile
> +++ b/net/xfrm/Makefile
> @@ -21,3 +21,4 @@ obj-$(CONFIG_XFRM_USER_COMPAT) += xfrm_compat.o
> obj-$(CONFIG_XFRM_IPCOMP) += xfrm_ipcomp.o
> obj-$(CONFIG_XFRM_INTERFACE) += xfrm_interface.o
> obj-$(CONFIG_XFRM_ESPINTCP) += espintcp.o
> +obj-$(CONFIG_DEBUG_INFO_BTF) += xfrm_state_bpf.o
> diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c
> index c13dc3ef7910..1b7e75159727 100644
> --- a/net/xfrm/xfrm_policy.c
> +++ b/net/xfrm/xfrm_policy.c
> @@ -4218,6 +4218,8 @@ void __init xfrm_init(void)
> #ifdef CONFIG_XFRM_ESPINTCP
> espintcp_init();
> #endif
> +
> + register_xfrm_state_bpf();
> }
>
> #ifdef CONFIG_AUDITSYSCALL
> diff --git a/net/xfrm/xfrm_state_bpf.c b/net/xfrm/xfrm_state_bpf.c
> new file mode 100644
> index 000000000000..4aaac134b97a
> --- /dev/null
> +++ b/net/xfrm/xfrm_state_bpf.c
since net/xfrm/xfrm_interface_bpf.c is already there and
was meant to be use as a file for interface between xfrm and bpf
may be add new kfuncs there instead of new file?
> @@ -0,0 +1,105 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/* Unstable XFRM state BPF helpers.
> + *
> + * Note that it is allowed to break compatibility for these functions since the
> + * interface they are exposed through to BPF programs is explicitly unstable.
> + */
> +
> +#include <linux/bpf.h>
> +#include <linux/btf_ids.h>
> +#include <net/xdp.h>
> +#include <net/xfrm.h>
> +
> +/* bpf_xfrm_state_opts - Options for XFRM state lookup helpers
> + *
> + * Members:
> + * @error - Out parameter, set for any errors encountered
> + * Values:
> + * -EINVAL - netns_id is less than -1
> + * -EINVAL - Passed NULL for opts
> + * -EINVAL - opts__sz isn't BPF_XFRM_STATE_OPTS_SZ
> + * -ENONET - No network namespace found for netns_id
> + * @netns_id - Specify the network namespace for lookup
> + * Values:
> + * BPF_F_CURRENT_NETNS (-1)
> + * Use namespace associated with ctx
> + * [0, S32_MAX]
> + * Network Namespace ID
> + * @mark - XFRM mark to match on
> + * @daddr - Destination address to match on
> + * @spi - Security parameter index to match on
> + * @proto - L3 protocol to match on
> + * @family - L3 protocol family to match on
> + */
> +struct bpf_xfrm_state_opts {
> + s32 error;
> + s32 netns_id;
> + u32 mark;
> + xfrm_address_t daddr;
> + __be32 spi;
> + u8 proto;
> + u16 family;
> +};
> +
> +enum {
> + BPF_XFRM_STATE_OPTS_SZ = sizeof(struct bpf_xfrm_state_opts),
> +};
> +
> +__diag_push();
> +__diag_ignore_all("-Wmissing-prototypes",
> + "Global functions as their definitions will be in xfrm_state BTF");
> +
> +/* bpf_xdp_get_xfrm_state - Get XFRM state
> + *
> + * Parameters:
> + * @ctx - Pointer to ctx (xdp_md) in XDP program
> + * Cannot be NULL
> + * @opts - Options for lookup (documented above)
> + * Cannot be NULL
> + * @opts__sz - Length of the bpf_xfrm_state_opts structure
> + * Must be BPF_XFRM_STATE_OPTS_SZ
> + */
> +__bpf_kfunc struct xfrm_state *
> +bpf_xdp_get_xfrm_state(struct xdp_md *ctx, struct bpf_xfrm_state_opts *opts, u32 opts__sz)
> +{
> + struct xdp_buff *xdp = (struct xdp_buff *)ctx;
> + struct net *net = dev_net(xdp->rxq->dev);
> +
> + if (!opts || opts__sz != BPF_XFRM_STATE_OPTS_SZ) {
> + opts->error = -EINVAL;
> + return NULL;
> + }
> +
> + if (unlikely(opts->netns_id < BPF_F_CURRENT_NETNS)) {
> + opts->error = -EINVAL;
> + return NULL;
> + }
> +
> + if (opts->netns_id >= 0) {
> + net = get_net_ns_by_id(net, opts->netns_id);
netns is leaking :(
> + if (unlikely(!net)) {
> + opts->error = -ENONET;
> + return NULL;
> + }
> + }
> +
> + return xfrm_state_lookup(net, opts->mark, &opts->daddr, opts->spi,
> + opts->proto, opts->family);
After looking into xfrm internals realized that
refcnt inc/dec and KF_ACQUIRE maybe unnecessary overhead.
XDP progs run under rcu_read_lock.
I think you can make a version of __xfrm_state_lookup()
without xfrm_state_hold_rcu() and avoid two atomics per packet,
but such xfrm_state may have refcnt==0.
Since bpf prog will only read from there and won't pass it anywhere
else it might be ok.
But considering the rest of ipsec overhead this might be
a premature optimization and it's better to stay with clean
acquire/release semantics.
As far as IETF:
https://datatracker.ietf.org/doc/html/draft-ietf-ipsecme-multi-sa-performance-02
it's not clear to me why one Child SA (without new pcpu field)
has to be handled by one cpu.
Sounds like it's possible to implement differently. At least in SW.
In HW, I can see how duplicating multiple crypto state and the rest
in a single queue is difficult.
^ permalink raw reply
* Re: [PATCH bpf-next v8 07/10] bpf, net: switch to dynamic registration
From: Martin KaFai Lau @ 2023-11-02 1:32 UTC (permalink / raw)
To: Kui-Feng Lee
Cc: kuifeng, netdev, bpf, ast, song, kernel-team, andrii, thinker.li,
drosen
In-Reply-To: <c4427a57-aea9-4acc-a6be-e30cfb1dbaad@gmail.com>
On 11/1/23 5:59 PM, Kui-Feng Lee wrote:
>
>
> On 11/1/23 17:17, Martin KaFai Lau wrote:
>> On 10/31/23 5:19 PM, Kui-Feng Lee wrote:
>>>
>>>
>>> On 10/31/23 17:02, Martin KaFai Lau wrote:
>>>> On 10/31/23 4:34 PM, Kui-Feng Lee wrote:
>>>>>>> diff --git a/include/linux/btf.h b/include/linux/btf.h
>>>>>>> index a8813605f2f6..954536431e0b 100644
>>>>>>> --- a/include/linux/btf.h
>>>>>>> +++ b/include/linux/btf.h
>>>>>>> @@ -12,6 +12,8 @@
>>>>>>> #include <uapi/linux/bpf.h>
>>>>>>> #define BTF_TYPE_EMIT(type) ((void)(type *)0)
>>>>>>> +#define BTF_STRUCT_OPS_TYPE_EMIT(type) {((void)(struct type *)0); \
>>>>>>
>>>>>> ((void)(struct type *)0); is new. Why is it needed?
>>>>>
>>>>> This is a trick of BTF to force compiler generate type info for
>>>>> the given type. Without trick, compiler may skip these types if these
>>>>> type are not used at all in the module. For example, modules usually
>>>>> don't use value types of struct_ops directly.
>>>> It is not the value type and value type emit is understood. It is the
>>>> struct_ops type itself and it is new addition in this patchset afaict. The
>>>> value type emit is in the next line which was cut out from the context here.
>>>>
>>> I mean both of them are required.
>>> In the case of a dummy implementation, struct_ops type itself properly never
>>> being used, only being declared by the module. Without this line,
>>
>> Other than bpf_dummy_ops, after reg(), the struct_ops->func() must be used
>> somewhere in the kernel or module. Like tcp must be using the
>> tcp_congestion_ops after reg(). bpf_dummy_ops is very special and probably
>> should be moved out to bpf_testmod somehow but this is for later. Even
>> bpf_dummy_ops does not have an issue now. Why it is needed after the kmod
>> support change?
>>
>> or it is a preemptive addition to be future proof only?
>>
>> Addition is fine if it is required to work. I am trying to understand why this
>> new addition is needed after the kmod support change. The reason why this is
>> needed after the kmod support change is not obvious from looking at the code.
>> The commit message didn't mention why and what broke after this kmod change.
>> If someone wants to clean it up a few months later, we will need to figure out
>> why it was added in the first place.
>
>
> It is a future proof.
> What do you think if I add a comment in the code?
If it is not required to work, I prefer not adding it to avoid confusion and
avoid future cleanup temptation. Even the artificial bpf_dummy_ops does not need
it, so not enough reason to introduce this code redundancy.
Switch topic.
While we are on a new macro topic, I think a new macro will be useful to emit
the value type and register_bpf_struct_ops together. wdyt?
>
>>
>>
>>> the module developer will fail to load a struct_ops map of the dummy
>>> type. This line is added to avoid this awful situation.
>>>
>>
^ permalink raw reply
* Re: [PATCH net] tcp: Fix -Wc23-extensions in tcp_options_write()
From: Palmer Dabbelt @ 2023-11-02 1:42 UTC (permalink / raw)
To: nathan
Cc: edumazet, davem, dsahern, kuba, pabeni, ndesaulniers, trix,
0x7f454c46, fruggeri, noureddine, netdev, linux-kernel, llvm,
patches
In-Reply-To: <20231102010723.GA406542@dev-arch.thelio-3990X>
On Wed, 01 Nov 2023 18:07:23 PDT (-0700), nathan@kernel.org wrote:
> On Wed, Nov 01, 2023 at 05:41:10PM -0700, Palmer Dabbelt wrote:
>> On Tue, 31 Oct 2023 13:23:35 PDT (-0700), nathan@kernel.org wrote:
>> > Clang warns (or errors with CONFIG_WERROR=y) when CONFIG_TCP_AO is set:
>> >
>> > net/ipv4/tcp_output.c:663:2: error: label at end of compound statement is a C23 extension [-Werror,-Wc23-extensions]
>> > 663 | }
>> > | ^
>> > 1 error generated.
>> >
>> > On earlier releases (such as clang-11, the current minimum supported
>> > version for building the kernel) that do not support C23, this was a
>> > hard error unconditionally:
>> >
>> > net/ipv4/tcp_output.c:663:2: error: expected statement
>> > }
>> > ^
>> > 1 error generated.
>> >
>> > Add a semicolon after the label to create an empty statement, which
>> > resolves the warning or error for all compilers.
>> >
>> > Closes: https://github.com/ClangBuiltLinux/linux/issues/1953
>> > Fixes: 1e03d32bea8e ("net/tcp: Add TCP-AO sign to outgoing packets")
>> > Signed-off-by: Nathan Chancellor <nathan@kernel.org>
>> > ---
>> > net/ipv4/tcp_output.c | 2 +-
>> > 1 file changed, 1 insertion(+), 1 deletion(-)
>> >
>> > diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
>> > index f558c054cf6e..6064895daece 100644
>> > --- a/net/ipv4/tcp_output.c
>> > +++ b/net/ipv4/tcp_output.c
>> > @@ -658,7 +658,7 @@ static void tcp_options_write(struct tcphdr *th, struct tcp_sock *tp,
>> > memset(ptr, TCPOPT_NOP, sizeof(*ptr));
>> > ptr++;
>> > }
>> > -out_ao:
>> > +out_ao:;
>> > #endif
>> > }
>> > if (unlikely(opts->mss)) {
>> >
>> > ---
>> > base-commit: 55c900477f5b3897d9038446f72a281cae0efd86
>> > change-id: 20231031-tcp-ao-fix-label-in-compound-statement-warning-ebd6c9978498
>> >
>> > Best regards,
>>
>> This gives me a
>>
>> linux/net/ipv4/tcp_output.c:663:2: error: expected statement
>> }
>>
>> on GCC for me.
>
> What GCC version?
12.1, though I can't get a smaller reproducer so I'm going to roll back
to your change and double-check. Might take a bit...
> I cannot reproduce that error with my patch applied. I tested mainline
> at commit deefd5024f07 ("Merge tag 'vfio-v6.7-rc1' of
> https://github.com/awilliam/linux-vfio") using GCC 6 from kernel.org and
> I can reproduce a similar failure with ARCH=x86_64 allyesconfig:
>
> net/ipv4/tcp_output.c: In function 'tcp_options_write':
> net/ipv4/tcp_output.c:661:1: error: label at end of compound statement
> out_ao:
> ^~~~~~
>
> With this change applied, the error disappears for GCC 6 and GCC 13
> continues to build without error. I can try the other supported versions
> later, I just did an older and newer one for a quick test.
>
>> So I think something like
>>
>> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
>> index f558c054cf6e..ca09763acaa8 100644
>> --- a/net/ipv4/tcp_output.c
>> +++ b/net/ipv4/tcp_output.c
>> @@ -659,6 +659,11 @@ static void tcp_options_write(struct tcphdr *th, struct tcp_sock *tp,
>> ptr++;
>> }
>> out_ao:
>> + /*
>> + * Labels at the end of compound statements are a C23 feature, so
>> + * introduce a block to avoid a warning/error on strict toolchains.
>> + */
>> + {}
>> #endif
>> }
>> if (unlikely(opts->mss)) {
>>
>> should do it (though it's still build testing...)
>
> I am not opposed to this once we understand what versions are affected
> by this so that we have some timeline of removing this workaround.
>
> Cheers,
> Nathan
^ permalink raw reply
* Re: Does anyone use Appletalk?
From: Finn Thain @ 2023-11-02 2:13 UTC (permalink / raw)
To: Arnd Bergmann
Cc: Dan Williams, John Paul Adrian Glaubitz, Geert Uytterhoeven,
linux-m68k, Jakub Kicinski, Netdev, linuxppc-dev
In-Reply-To: <d40cd45a-e7e3-49c4-931b-c5ec75a6bf56@app.fastmail.com>
On Wed, 1 Nov 2023, Arnd Bergmann wrote:
>
> If we had not removed all localtalk support already, ipddp might have
> been used to bridge between a pre-ethernet mac running macip and an IP
> based AFP server (netatalk or time machine). Without localtalk support,
> that is not all that interesting of course.
>
That line of reasoning misunderstands the value of the localtalk code (and
conveniently neglects the actual cost of keeping it in-tree).
The existing zilog driver works on all 68k and powerpc Macs with built-in
serial ports. If we were to complete that driver by adding the missing
localtalk support, it would create new opportunities for creative
users/developers who already run Linux on those systems.
Those users/developers would surely derive value from that functionality
in ways we cannot anticipate, as happens over and over again in the
(retrocomputing) community.
So the value of the missing zilog localtalk functionality would be
proportional to the number of Linux systems out there with the necessary
serial hardware. It's value is not a function of the potential business
opportunities for your sponsors, despite the prevailing incentives.
It was the potential value of the missing code for localtalk (Zilog SCC)
and Apple Sound Chip that caused me to place that work near the top of my
to-do list. But that was several years ago. Unfortunately, with bug fixing
and recapping, I still can't find time to write the necessary code.
So I can't object to the removal of the localtalk code. But I do object to
the underhand manner in which it is done.
^ permalink raw reply
* RE: [EXT] Re: [PATCH net v1 2/2] octeontx2-pf: Fix holes in error code
From: Ratheesh Kannoth @ 2023-11-02 2:31 UTC (permalink / raw)
To: Paolo Abeni, netdev@vger.kernel.org, linux-kernel@vger.kernel.org
Cc: Sunil Kovvuri Goutham, Geethasowjanya Akula,
Subbaraya Sundeep Bhatta, Hariprasad Kelam, davem@davemloft.net,
edumazet@google.com, kuba@kernel.org, wojciech.drewek@intel.com
In-Reply-To: <5a46ffb675addbed8a3dac176effb96eb2c8ca3e.camel@redhat.com>
> From: Paolo Abeni <pabeni@redhat.com>
> Subject: [EXT] Re: [PATCH net v1 2/2] octeontx2-pf: Fix holes in error code
>
> On Fri, 2023-10-27 at 07:49 +0530, Ratheesh Kannoth wrote:
> > Error code strings are not getting printed properly due to holes.
> > Print error code as well.
> error",
> > qidx);
> > if (val & BIT_ULL(NIX_CQERRINT_CQE_FAULT))
> > - netdev_err(pf->netdev, "CQ%lld: Memory
> fault on CQE write to LLC/DRAM",
> > + netdev_err(pf->netdev,
> > + "CQ%lld: Memory fault on CQE
> write to LLC/DRAM",
> > qidx);
> > }
>
> It's not a big deal (no need to repost just for this), but the above chunk (and
> a couple below, too) is not related to the current fix, you should have not
> included it here.
I understand your point. But, Commit message says "Print error code as well". When I added code to print it, it crossed 80 column mark,
so checkpatch script threw warning. So I splitted lines. Actually , there is not much change here.
-Ratheesh
^ permalink raw reply
* Re: [GIT PULL v2] Networking for 6.7
From: pr-tracker-bot @ 2023-11-02 2:51 UTC (permalink / raw)
To: Jakub Kicinski; +Cc: torvalds, kuba, davem, netdev, linux-kernel, pabeni
In-Reply-To: <20231031210948.2651866-1-kuba@kernel.org>
The pull request you sent on Tue, 31 Oct 2023 14:09:48 -0700:
> git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git tags/net-next-6.7-v2
has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/ff269e2cd5adce4ae14f883fc9c8803bc43ee1e9
Thank you!
--
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/prtracker.html
^ permalink raw reply
* Re: [RFC Draft PATCHv2 net-next] Doc: update bridge doc
From: Hangbin Liu @ 2023-11-02 4:09 UTC (permalink / raw)
To: Nikolay Aleksandrov
Cc: netdev, David S . Miller, David Ahern, Eric Dumazet,
Jakub Kicinski, Paolo Abeni, Ido Schimmel, Roopa Prabhu,
Stephen Hemminger
In-Reply-To: <68045f82-4306-165b-c4b2-96432cc8c422@blackwall.org>
Hi Nikolay,
On Wed, Nov 01, 2023 at 01:29:34PM +0200, Nikolay Aleksandrov wrote:
> Hi,
> I have written some initial comments, there will definitely be more.
> One general thing - please split this in 2 patches at least. 1 for the
> documentation, and 1 for the netlink uAPI changes. You can even split it
Sure, I will.
> further into logical parts if you'd like, it will make it easier to
> review and people can focus on different parts better. Please CC DSA
> folks as well.
>
> > diff --git a/Documentation/networking/bridge.rst b/Documentation/networking/bridge.rst
> > index c859f3c1636e..b36bd737c05e 100644
> > --- a/Documentation/networking/bridge.rst
> > +++ b/Documentation/networking/bridge.rst
> > -The bridge-utilities are maintained at:
> > - git://git.kernel.org/pub/scm/linux/kernel/git/shemminger/bridge-utils.git
> > +Bridge internals
> > +================
> > -Additionally, the iproute2 utilities can be used to configure
> > -bridge devices.
> > +Here are the core structs of bridge code.
>
> the core structs? These are outdated structures used in ioctl.
Ah, I just notice this. In next patch I will use `struct net_bridge_vlan`
as example.
> > +Bridge sysfs
> > +------------
> > +
> > +All the sysfs parameters are also exported via the bridge netlink API.
> > +Here you can find the explanation based on the correspond netlink attributes.
>
> I don't get this one?
Some users/admins may still config bridge via sysfs. So I added this part
to let users know what's the meaning/usage of each sysfs field.
Do you want to remove this part for the doc?
> Also please mention the sysfs interface is deprecated and should not be
> extended if new options are added.
OK, I will add this note if we will keep this part.
> > +
> > +Switchdev
> > +=========
> > +
> > +Linux Bridge Switchdev is a feature in the Linux kernel that extends the
> > +capabilities of the traditional Linux bridge to work more efficiently with
> > +hardware switches that support switchdev. This technology is particularly
> > +useful in data center and networking environments where high-performance
> > +and low-latency packet forwarding is essential.
>
> The last sentence is misleading, switchdev is used for many different types
> of devices.
> > +
> > +With Linux Bridge Switchdev, certain networking functions like forwarding,
> > +filtering, and learning of Ethernet frames can be offloaded to the hardware
>
> "to a hardware switch"
>
> > +switch. This offloading reduces the burden on the Linux kernel and CPU,
> > +leading to improved network performance and lower latency.
> > +
> > +To use Linux Bridge Switchdev, you need hardware switches that support the
> > +switchdev interface. This means that the switch hardware needs to have the
> > +necessary drivers and functionality to work in conjunction with the Linux
> > +kernel.
>
> I'd add DSA maintainers to the CC list, and also ask switchdev driver
> maintainers to add more here. Switchdev can be explained much better.
Yes, you are right. I will add them in next version.
> > +
> > +Is it protocol independent?
>
> Unclear, what layer?
How about "Is it L3/L4 protocol independent?"
> > +---------------------------
> > +
> > +Yes. The bridge knows nothing about protocols, it only sees Ethernet frames.
>
> It sees all frames, it *uses* only L2 headers/information.
>
> > diff --git a/include/uapi/linux/if_bridge.h b/include/uapi/linux/if_bridge.h
> > index f95326fce6bb..63e39de1055b 100644
> > --- a/include/uapi/linux/if_bridge.h
> > +++ b/include/uapi/linux/if_bridge.h
> > + *
> > + * @IFLA_BR_MCAST_STARTUP_QUERY_CNT
> > + * Set the number of IGMP queries to send during startup phase.
>
> What is a startup phase?
https://datatracker.ietf.org/doc/html/rfc2236#section-8.7
I think it means when the bridge/switch up.
>
> > + *
> > + * The default value is 1.
>
> What is 1?
1 second.
>
> > + *
> > + * @IFLA_BR_MCAST_MEMBERSHIP_INTVL
> > + * The interval after which the bridge will leave a group, if no membership
> > + * reports for this group are received.
> > + *
> > + * The default value is 260.
>
> What is 260? Please be more specific.
OK, I will.
Thanks
Hangbin
^ permalink raw reply
* Re: [PATCH 2/2] tg3: Fix the TX ring stall
From: alexey.pakhunov @ 2023-11-02 4:10 UTC (permalink / raw)
To: michael.chan
Cc: alexey.pakhunov, linux-kernel, mchan, netdev, prashant,
siva.kallam, vincent.wong2
In-Reply-To: <CACKFLikNku0_9=MNjj=X+RvO_omTtg5TicQeM5oWfk0NSxiqwg@mail.gmail.com>
Hi,
> Thanks for the patch. An alternative fix that may be simpler is to
> add a goto after calling tg3_tso_bug(). Something like this:
>
> tg3_tso_bug();
> goto update_tx_mbox;
> ...
>
> update_tx_mbox:
> if (!netdev_xmit_more() || netif_xmit_stopped())
> tw32_tx_mbox();
> ...
Yes, I considered this approach but in the end it seemed more fragile. All
future updates to tg3_start_xmit() would need to be careful to make sure
all return paths go through "update_tx_mbox". This is much more
straightforward with a separate wrapper function.
The sizes of both patches are roughly the same. The wrapper function
version:
drivers/net/ethernet/broadcom/tg3.c | 46 +++++++++++++++++++++++++++++++++++-----------
1 file changed, 35 insertions(+), 11 deletions(-)
The goto version touches four different return paths: three tg3_tso_bug()
calls and the return at the very top of the function:
drivers/net/ethernet/broadcom/tg3.c | 46 +++++++++++++++++++++++++++++++++-------------
1 file changed, 33 insertions(+), 13 deletions(-)
Let me re-test the goto version and resubmit it as v2. Please let me know
which version of the patch you prefer more.
Alex.
^ permalink raw reply
* Re: [PATCH bpf-next v8 07/10] bpf, net: switch to dynamic registration
From: Kui-Feng Lee @ 2023-11-02 4:19 UTC (permalink / raw)
To: Martin KaFai Lau
Cc: kuifeng, netdev, bpf, ast, song, kernel-team, andrii, thinker.li,
drosen
In-Reply-To: <22051390-2331-ad11-406b-1e5c6dbcd6a2@linux.dev>
On 11/1/23 18:32, Martin KaFai Lau wrote:
> On 11/1/23 5:59 PM, Kui-Feng Lee wrote:
>>
>>
>> On 11/1/23 17:17, Martin KaFai Lau wrote:
>>> On 10/31/23 5:19 PM, Kui-Feng Lee wrote:
>>>>
>>>>
>>>> On 10/31/23 17:02, Martin KaFai Lau wrote:
>>>>> On 10/31/23 4:34 PM, Kui-Feng Lee wrote:
>>>>>>>> diff --git a/include/linux/btf.h b/include/linux/btf.h
>>>>>>>> index a8813605f2f6..954536431e0b 100644
>>>>>>>> --- a/include/linux/btf.h
>>>>>>>> +++ b/include/linux/btf.h
>>>>>>>> @@ -12,6 +12,8 @@
>>>>>>>> #include <uapi/linux/bpf.h>
>>>>>>>> #define BTF_TYPE_EMIT(type) ((void)(type *)0)
>>>>>>>> +#define BTF_STRUCT_OPS_TYPE_EMIT(type) {((void)(struct type
>>>>>>>> *)0); \
>>>>>>>
>>>>>>> ((void)(struct type *)0); is new. Why is it needed?
>>>>>>
>>>>>> This is a trick of BTF to force compiler generate type info for
>>>>>> the given type. Without trick, compiler may skip these types if these
>>>>>> type are not used at all in the module. For example, modules usually
>>>>>> don't use value types of struct_ops directly.
>>>>> It is not the value type and value type emit is understood. It is
>>>>> the struct_ops type itself and it is new addition in this patchset
>>>>> afaict. The value type emit is in the next line which was cut out
>>>>> from the context here.
>>>>>
>>>> I mean both of them are required.
>>>> In the case of a dummy implementation, struct_ops type itself
>>>> properly never being used, only being declared by the module.
>>>> Without this line,
>>>
>>> Other than bpf_dummy_ops, after reg(), the struct_ops->func() must be
>>> used somewhere in the kernel or module. Like tcp must be using the
>>> tcp_congestion_ops after reg(). bpf_dummy_ops is very special and
>>> probably should be moved out to bpf_testmod somehow but this is for
>>> later. Even bpf_dummy_ops does not have an issue now. Why it is
>>> needed after the kmod support change?
>>>
>>> or it is a preemptive addition to be future proof only?
>>>
>>> Addition is fine if it is required to work. I am trying to understand
>>> why this new addition is needed after the kmod support change. The
>>> reason why this is needed after the kmod support change is not
>>> obvious from looking at the code. The commit message didn't mention
>>> why and what broke after this kmod change. If someone wants to clean
>>> it up a few months later, we will need to figure out why it was added
>>> in the first place.
>>
>>
>> It is a future proof.
>> What do you think if I add a comment in the code?
>
> If it is not required to work, I prefer not adding it to avoid confusion
> and avoid future cleanup temptation. Even the artificial bpf_dummy_ops
> does not need it, so not enough reason to introduce this code redundancy.
Got it!
>
> Switch topic.
> While we are on a new macro topic, I think a new macro will be useful to
> emit the value type and register_bpf_struct_ops together. wdyt?
Like this?
#define REGISTER_STRUCT_OPS(st_type, st_ops) { \
BTF_STRUCT_OPS_TYPE_EMIT(st_type); \
register_bpf_struct_ops(st_ops); } while(0)
static int bpf_testmod_init(void) {
....
REGISTER_STRUCT_OPS(bpf_testmod_ops, &bpf_bpf_testmod_ops);
....
}
or you like something more aggressive
#define REGISTER_STRUCT_OPS(st_type) { \
BTF_STRUCT_OPS_TYPE_EMIT(st_type); \
register_bpf_struct_ops(&bpf_##st_type); } while(0)
>
>>
>>>
>>>
>>>> the module developer will fail to load a struct_ops map of the dummy
>>>> type. This line is added to avoid this awful situation.
>>>>
>>>
>
^ permalink raw reply
* Re: [PATCH net-next 4/5] virtio-net: support rx netdim
From: Jason Wang @ 2023-11-02 4:31 UTC (permalink / raw)
To: Heng Qi
Cc: Michael S. Tsirkin, netdev, virtualization, Xuan Zhuo,
Eric Dumazet, David S. Miller, Paolo Abeni,
Jesper Dangaard Brouer, John Fastabend, Alexei Starovoitov,
Jakub Kicinski, Simon Horman, Liu, Yujie
In-Reply-To: <3bc31a3b-a022-4816-a854-7f6b41d2e351@linux.alibaba.com>
On Wed, Nov 1, 2023 at 6:55 PM Heng Qi <hengqi@linux.alibaba.com> wrote:
>
>
>
> 在 2023/10/25 上午11:34, Jason Wang 写道:
> > On Thu, Oct 12, 2023 at 3:44 PM Heng Qi <hengqi@linux.alibaba.com> wrote:
> >> By comparing the traffic information in the complete napi processes,
> >> let the virtio-net driver automatically adjust the coalescing
> >> moderation parameters of each receive queue.
> >>
> >> Signed-off-by: Heng Qi <hengqi@linux.alibaba.com>
> >> ---
> >> drivers/net/virtio_net.c | 147 +++++++++++++++++++++++++++++++++------
> >> 1 file changed, 126 insertions(+), 21 deletions(-)
> >>
> >> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> >> index caef78bb3963..6ad2890a7909 100644
> >> --- a/drivers/net/virtio_net.c
> >> +++ b/drivers/net/virtio_net.c
> >> @@ -19,6 +19,7 @@
> >> #include <linux/average.h>
> >> #include <linux/filter.h>
> >> #include <linux/kernel.h>
> >> +#include <linux/dim.h>
> >> #include <net/route.h>
> >> #include <net/xdp.h>
> >> #include <net/net_failover.h>
> >> @@ -172,6 +173,17 @@ struct receive_queue {
> >>
> >> struct virtnet_rq_stats stats;
> >>
> >> + /* The number of rx notifications */
> >> + u16 calls;
> >> +
> >> + /* Is dynamic interrupt moderation enabled? */
> >> + bool dim_enabled;
> >> +
> >> + /* Dynamic Interrupt Moderation */
> >> + struct dim dim;
> >> +
> >> + u32 packets_in_napi;
> >> +
> >> struct virtnet_interrupt_coalesce intr_coal;
> >>
> >> /* Chain pages by the private ptr. */
> >> @@ -305,6 +317,9 @@ struct virtnet_info {
> >> u8 duplex;
> >> u32 speed;
> >>
> >> + /* Is rx dynamic interrupt moderation enabled? */
> >> + bool rx_dim_enabled;
> >> +
> >> /* Interrupt coalescing settings */
> >> struct virtnet_interrupt_coalesce intr_coal_tx;
> >> struct virtnet_interrupt_coalesce intr_coal_rx;
> >> @@ -2001,6 +2016,7 @@ static void skb_recv_done(struct virtqueue *rvq)
> >> struct virtnet_info *vi = rvq->vdev->priv;
> >> struct receive_queue *rq = &vi->rq[vq2rxq(rvq)];
> >>
> >> + rq->calls++;
> >> virtqueue_napi_schedule(&rq->napi, rvq);
> >> }
> >>
> >> @@ -2138,6 +2154,25 @@ static void virtnet_poll_cleantx(struct receive_queue *rq)
> >> }
> >> }
> >>
> >> +static void virtnet_rx_dim_work(struct work_struct *work);
> >> +
> >> +static void virtnet_rx_dim_update(struct virtnet_info *vi, struct receive_queue *rq)
> >> +{
> >> + struct virtnet_rq_stats *stats = &rq->stats;
> >> + struct dim_sample cur_sample = {};
> >> +
> >> + if (!rq->packets_in_napi)
> >> + return;
> >> +
> >> + u64_stats_update_begin(&rq->stats.syncp);
> >> + dim_update_sample(rq->calls, stats->packets,
> >> + stats->bytes, &cur_sample);
> >> + u64_stats_update_end(&rq->stats.syncp);
> >> +
> >> + net_dim(&rq->dim, cur_sample);
> >> + rq->packets_in_napi = 0;
> >> +}
> >> +
> >> static int virtnet_poll(struct napi_struct *napi, int budget)
> >> {
> >> struct receive_queue *rq =
> >> @@ -2146,17 +2181,22 @@ static int virtnet_poll(struct napi_struct *napi, int budget)
> >> struct send_queue *sq;
> >> unsigned int received;
> >> unsigned int xdp_xmit = 0;
> >> + bool napi_complete;
> >>
> >> virtnet_poll_cleantx(rq);
> >>
> >> received = virtnet_receive(rq, budget, &xdp_xmit);
> >> + rq->packets_in_napi += received;
> >>
> >> if (xdp_xmit & VIRTIO_XDP_REDIR)
> >> xdp_do_flush();
> >>
> >> /* Out of packets? */
> >> - if (received < budget)
> >> - virtqueue_napi_complete(napi, rq->vq, received);
> >> + if (received < budget) {
> >> + napi_complete = virtqueue_napi_complete(napi, rq->vq, received);
> >> + if (napi_complete && rq->dim_enabled)
> >> + virtnet_rx_dim_update(vi, rq);
> >> + }
> >>
> >> if (xdp_xmit & VIRTIO_XDP_TX) {
> >> sq = virtnet_xdp_get_sq(vi);
> >> @@ -2176,6 +2216,7 @@ static void virtnet_disable_queue_pair(struct virtnet_info *vi, int qp_index)
> >> virtnet_napi_tx_disable(&vi->sq[qp_index].napi);
> >> napi_disable(&vi->rq[qp_index].napi);
> >> xdp_rxq_info_unreg(&vi->rq[qp_index].xdp_rxq);
> >> + cancel_work_sync(&vi->rq[qp_index].dim.work);
> >> }
> >>
> >> static int virtnet_enable_queue_pair(struct virtnet_info *vi, int qp_index)
> >> @@ -2193,6 +2234,9 @@ static int virtnet_enable_queue_pair(struct virtnet_info *vi, int qp_index)
> >> if (err < 0)
> >> goto err_xdp_reg_mem_model;
> >>
> >> + INIT_WORK(&vi->rq[qp_index].dim.work, virtnet_rx_dim_work);
> >> + vi->rq[qp_index].dim.mode = DIM_CQ_PERIOD_MODE_START_FROM_EQE;
> >> +
> >> virtnet_napi_enable(vi->rq[qp_index].vq, &vi->rq[qp_index].napi);
> >> virtnet_napi_tx_enable(vi, vi->sq[qp_index].vq, &vi->sq[qp_index].napi);
> >>
> >> @@ -3335,23 +3379,42 @@ static int virtnet_send_tx_notf_coal_cmds(struct virtnet_info *vi,
> >> static int virtnet_send_rx_notf_coal_cmds(struct virtnet_info *vi,
> >> struct ethtool_coalesce *ec)
> >> {
> >> + bool rx_ctrl_dim_on = !!ec->use_adaptive_rx_coalesce;
> >> struct scatterlist sgs_rx;
> >> + int i;
> >>
> >> - vi->ctrl->coal_rx.rx_usecs = cpu_to_le32(ec->rx_coalesce_usecs);
> >> - vi->ctrl->coal_rx.rx_max_packets = cpu_to_le32(ec->rx_max_coalesced_frames);
> >> - sg_init_one(&sgs_rx, &vi->ctrl->coal_rx, sizeof(vi->ctrl->coal_rx));
> >> -
> >> - if (!virtnet_send_command(vi, VIRTIO_NET_CTRL_NOTF_COAL,
> >> - VIRTIO_NET_CTRL_NOTF_COAL_RX_SET,
> >> - &sgs_rx))
> >> + if (rx_ctrl_dim_on && (ec->rx_coalesce_usecs != vi->intr_coal_rx.max_usecs ||
> >> + ec->rx_max_coalesced_frames != vi->intr_coal_rx.max_packets))
> > Any reason we need to stick a check for usecs/packets? I think it
> > might confuse the user since the value could be modified by netdim
> > actually.
>
> Yes, that's exactly what's done here.
>
> When dim is enabled, the user is prohibited from manually configuring
> parameters because dim may modify the parameters.
So it should be something like
if (rx_ctrl_dim_on)
return -EINVAL;
without the checking of whether it matches the current parameters?
>
> >
> >> return -EINVAL;
> >>
> >> - /* Save parameters */
> >> - vi->intr_coal_rx.max_usecs = ec->rx_coalesce_usecs;
> >> - vi->intr_coal_rx.max_packets = ec->rx_max_coalesced_frames;
> >> - for (i = 0; i < vi->max_queue_pairs; i++) {
> >> - vi->rq[i].intr_coal.max_usecs = ec->rx_coalesce_usecs;
> >> - vi->rq[i].intr_coal.max_packets = ec->rx_max_coalesced_frames;
> >> + if (rx_ctrl_dim_on) {
> >> + if (virtio_has_feature(vi->vdev, VIRTIO_NET_F_VQ_NOTF_COAL)) {
> >> + vi->rx_dim_enabled = true;
> >> + for (i = 0; i < vi->max_queue_pairs; i++)
> >> + vi->rq[i].dim_enabled = true;
> >> + } else {
> >> + return -EOPNOTSUPP;
> >> + }
> >> + } else {
> >> + vi->rx_dim_enabled = false;
> >> + for (i = 0; i < vi->max_queue_pairs; i++)
> >> + vi->rq[i].dim_enabled = false;
> >> +
> >> + vi->ctrl->coal_rx.rx_usecs = cpu_to_le32(ec->rx_coalesce_usecs);
> >> + vi->ctrl->coal_rx.rx_max_packets = cpu_to_le32(ec->rx_max_coalesced_frames);
> >> + sg_init_one(&sgs_rx, &vi->ctrl->coal_rx, sizeof(vi->ctrl->coal_rx));
> >> +
> >> + if (!virtnet_send_command(vi, VIRTIO_NET_CTRL_NOTF_COAL,
> >> + VIRTIO_NET_CTRL_NOTF_COAL_RX_SET,
> >> + &sgs_rx))
> >> + return -EINVAL;
> >> +
> >> + vi->intr_coal_rx.max_usecs = ec->rx_coalesce_usecs;
> >> + vi->intr_coal_rx.max_packets = ec->rx_max_coalesced_frames;
> >> + for (i = 0; i < vi->max_queue_pairs; i++) {
> >> + vi->rq[i].intr_coal.max_usecs = ec->rx_coalesce_usecs;
> >> + vi->rq[i].intr_coal.max_packets = ec->rx_max_coalesced_frames;
> >> + }
> >> }
> >>
> >> return 0;
> >> @@ -3377,13 +3440,27 @@ static int virtnet_send_notf_coal_vq_cmds(struct virtnet_info *vi,
> >> struct ethtool_coalesce *ec,
> >> u16 queue)
> >> {
> >> + bool rx_ctrl_dim_on;
> >> + u32 max_usecs, max_packets;
> >> int err;
> >>
> >> - err = virtnet_send_rx_ctrl_coal_vq_cmd(vi, queue,
> >> - ec->rx_coalesce_usecs,
> >> - ec->rx_max_coalesced_frames);
> >> - if (err)
> >> - return err;
> >> + rx_ctrl_dim_on = !!ec->use_adaptive_rx_coalesce;
> >> + max_usecs = vi->rq[queue].intr_coal.max_usecs;
> >> + max_packets = vi->rq[queue].intr_coal.max_packets;
> >> + if (rx_ctrl_dim_on && (ec->rx_coalesce_usecs != max_usecs ||
> >> + ec->rx_max_coalesced_frames != max_packets))
> >> + return -EINVAL;
> >> +
> >> + if (rx_ctrl_dim_on) {
> >> + vi->rq[queue].dim_enabled = true;
> >> + } else {
> >> + vi->rq[queue].dim_enabled = false;
> >> + err = virtnet_send_rx_ctrl_coal_vq_cmd(vi, queue,
> >> + ec->rx_coalesce_usecs,
> >> + ec->rx_max_coalesced_frames);
> >> + if (err)
> >> + return err;
> >> + }
> >>
> >> err = virtnet_send_tx_ctrl_coal_vq_cmd(vi, queue,
> >> ec->tx_coalesce_usecs,
> >> @@ -3394,6 +3471,32 @@ static int virtnet_send_notf_coal_vq_cmds(struct virtnet_info *vi,
> >> return 0;
> >> }
> >>
> >> +static void virtnet_rx_dim_work(struct work_struct *work)
> >> +{
> >> + struct dim *dim = container_of(work, struct dim, work);
> >> + struct receive_queue *rq = container_of(dim,
> >> + struct receive_queue, dim);
> >> + struct virtnet_info *vi = rq->vq->vdev->priv;
> >> + struct net_device *dev = vi->dev;
> >> + struct dim_cq_moder update_moder;
> >> + int qnum = rq - vi->rq, err;
> >> +
> >> + update_moder = net_dim_get_rx_moderation(dim->mode, dim->profile_ix);
> >> + if (update_moder.usec != vi->rq[qnum].intr_coal.max_usecs ||
> >> + update_moder.pkts != vi->rq[qnum].intr_coal.max_packets) {
> > Is this safe across the e.g vq reset?
>
> I think it might. This will be avoided in the next version using:
> 1. cancel virtnet_rx_dim_work before vq reset.
> 2. restore virtnet_rx_dim_work after vq re-enable.
Ok.
Thanks
>
> Thanks a lot!
>
> >
> > Thanks
> >
> >> + rtnl_lock();
> >> + err = virtnet_send_rx_ctrl_coal_vq_cmd(vi, qnum,
> >> + update_moder.usec,
> >> + update_moder.pkts);
> >> + if (err)
> >> + pr_debug("%s: Failed to send dim parameters on rxq%d\n",
> >> + dev->name, (int)(rq - vi->rq));
> >> + rtnl_unlock();
> >> + }
> >> +
> >> + dim->state = DIM_START_MEASURE;
> >> +}
> >> +
> >> static int virtnet_coal_params_supported(struct ethtool_coalesce *ec)
> >> {
> >> /* usecs coalescing is supported only if VIRTIO_NET_F_NOTF_COAL
> >> @@ -3475,6 +3578,7 @@ static int virtnet_get_coalesce(struct net_device *dev,
> >> ec->tx_coalesce_usecs = vi->intr_coal_tx.max_usecs;
> >> ec->tx_max_coalesced_frames = vi->intr_coal_tx.max_packets;
> >> ec->rx_max_coalesced_frames = vi->intr_coal_rx.max_packets;
> >> + ec->use_adaptive_rx_coalesce = vi->rx_dim_enabled;
> >> } else {
> >> ec->rx_max_coalesced_frames = 1;
> >>
> >> @@ -3532,6 +3636,7 @@ static int virtnet_get_per_queue_coalesce(struct net_device *dev,
> >> ec->tx_coalesce_usecs = vi->sq[queue].intr_coal.max_usecs;
> >> ec->tx_max_coalesced_frames = vi->sq[queue].intr_coal.max_packets;
> >> ec->rx_max_coalesced_frames = vi->rq[queue].intr_coal.max_packets;
> >> + ec->use_adaptive_rx_coalesce = vi->rq[queue].dim_enabled;
> >> } else {
> >> ec->rx_max_coalesced_frames = 1;
> >>
> >> @@ -3657,7 +3762,7 @@ static int virtnet_set_rxnfc(struct net_device *dev, struct ethtool_rxnfc *info)
> >>
> >> static const struct ethtool_ops virtnet_ethtool_ops = {
> >> .supported_coalesce_params = ETHTOOL_COALESCE_MAX_FRAMES |
> >> - ETHTOOL_COALESCE_USECS,
> >> + ETHTOOL_COALESCE_USECS | ETHTOOL_COALESCE_USE_ADAPTIVE_RX,
> >> .get_drvinfo = virtnet_get_drvinfo,
> >> .get_link = ethtool_op_get_link,
> >> .get_ringparam = virtnet_get_ringparam,
> >> --
> >> 2.19.1.6.gb485710b
> >>
>
^ permalink raw reply
* Re: [PATCH net-next 0/5] virtio-net: support dynamic coalescing moderation
From: Jason Wang @ 2023-11-02 4:33 UTC (permalink / raw)
To: Heng Qi
Cc: Michael S. Tsirkin, netdev, virtualization, Xuan Zhuo,
Eric Dumazet, David S. Miller, Paolo Abeni,
Jesper Dangaard Brouer, John Fastabend, Alexei Starovoitov,
Jakub Kicinski, Simon Horman, Liu, Yujie
In-Reply-To: <d3b9e9e8-1ef4-48ac-8a2f-4fa647ae4372@linux.alibaba.com>
On Wed, Nov 1, 2023 at 7:03 PM Heng Qi <hengqi@linux.alibaba.com> wrote:
>
>
>
> 在 2023/10/25 下午1:53, Michael S. Tsirkin 写道:
> > On Wed, Oct 25, 2023 at 09:18:27AM +0800, Jason Wang wrote:
> >> On Tue, Oct 24, 2023 at 8:03 PM Heng Qi <hengqi@linux.alibaba.com> wrote:
> >>>
> >>>
> >>> 在 2023/10/12 下午4:29, Jason Wang 写道:
> >>>> On Thu, Oct 12, 2023 at 3:44 PM Heng Qi <hengqi@linux.alibaba.com> wrote:
> >>>>> Now, virtio-net already supports per-queue moderation parameter
> >>>>> setting. Based on this, we use the netdim library of linux to support
> >>>>> dynamic coalescing moderation for virtio-net.
> >>>>>
> >>>>> Due to hardware scheduling issues, we only tested rx dim.
> >>>> Do you have PPS numbers? And TX numbers are also important as the
> >>>> throughput could be misleading due to various reasons.
> >>> Hi Jason!
> >>>
> >>> The comparison of rx netdim performance is as follows:
> >>> (the backend supporting tx dim is not yet ready)
> >> Thanks a lot for the numbers.
> >>
> >> I'd still expect the TX result as I did play tx interrupt coalescing
> >> about 10 years ago.
> >>
> >> I will start to review the series but let's try to have some TX numbers as well.
> >>
> >> Btw, it would be more convenient to have a raw PPS benchmark. E.g you
> >> can try to use a software or hardware packet generator.
> >>
> >> Thanks
> > Latency results are also kind of interesting.
>
> I test the latency using sockperf pp:
>
> @Rx cmd
> taskset -c 0 sockperf sr -p 8989
>
> @Tx cmd
> taskset -c 0 sockperf pp -i ${ip} -p 8989 -t 10
>
> After running this cmd 5 times and averaging the results,
> we get the following data:
>
> dim off: 17.7735 usec
> dim on: 18.0110 usec
Let's add those numbers to the changelog of the next version.
Thanks
>
> Thanks!
>
> >
> >
> >>>
> >>> I. Sockperf UDP
> >>> =================================================
> >>> 1. Env
> >>> rxq_0 is affinity to cpu_0
> >>>
> >>> 2. Cmd
> >>> client: taskset -c 0 sockperf tp -p 8989 -i $IP -t 10 -m 16B
> >>> server: taskset -c 0 sockperf sr -p 8989
> >>>
> >>> 3. Result
> >>> dim off: 1143277.00 rxpps, throughput 17.844 MBps, cpu is 100%.
> >>> dim on: 1124161.00 rxpps, throughput 17.610 MBps, cpu is 83.5%.
> >>> =================================================
> >>>
> >>>
> >>> II. Redis
> >>> =================================================
> >>> 1. Env
> >>> There are 8 rxqs and rxq_i is affinity to cpu_i.
> >>>
> >>> 2. Result
> >>> When all cpus are 100%, ops/sec of memtier_benchmark client is
> >>> dim off: 978437.23
> >>> dim on: 1143638.28
> >>> =================================================
> >>>
> >>>
> >>> III. Nginx
> >>> =================================================
> >>> 1. Env
> >>> There are 8 rxqs and rxq_i is affinity to cpu_i.
> >>>
> >>> 2. Result
> >>> When all cpus are 100%, requests/sec of wrk client is
> >>> dim off: 877931.67
> >>> dim on: 1019160.31
> >>> =================================================
> >>>
> >>> Thanks!
> >>>
> >>>> Thanks
> >>>>
> >>>>> @Test env
> >>>>> rxq0 has affinity to cpu0.
> >>>>>
> >>>>> @Test cmd
> >>>>> client: taskset -c 0 sockperf tp -i ${IP} -t 30 --tcp -m ${msg_size}
> >>>>> server: taskset -c 0 sockperf sr --tcp
> >>>>>
> >>>>> @Test res
> >>>>> The second column is the ratio of the result returned by client
> >>>>> when rx dim is enabled to the result returned by client when
> >>>>> rx dim is disabled.
> >>>>> --------------------------------------
> >>>>> | msg_size | rx_dim=on / rx_dim=off |
> >>>>> --------------------------------------
> >>>>> | 14B | + 3% |
> >>>>> --------------------------------------
> >>>>> | 100B | + 16% |
> >>>>> --------------------------------------
> >>>>> | 500B | + 25% |
> >>>>> --------------------------------------
> >>>>> | 1400B | + 28% |
> >>>>> --------------------------------------
> >>>>> | 2048B | + 22% |
> >>>>> --------------------------------------
> >>>>> | 4096B | + 5% |
> >>>>> --------------------------------------
> >>>>>
> >>>>> ---
> >>>>> This patch set was part of the previous netdim patch set[1].
> >>>>> [1] was split into a merged bugfix set[2] and the current set.
> >>>>> The previous relevant commentators have been Cced.
> >>>>>
> >>>>> [1] https://lore.kernel.org/all/20230811065512.22190-1-hengqi@linux.alibaba.com/
> >>>>> [2] https://lore.kernel.org/all/cover.1696745452.git.hengqi@linux.alibaba.com/
> >>>>>
> >>>>> Heng Qi (5):
> >>>>> virtio-net: returns whether napi is complete
> >>>>> virtio-net: separate rx/tx coalescing moderation cmds
> >>>>> virtio-net: extract virtqueue coalescig cmd for reuse
> >>>>> virtio-net: support rx netdim
> >>>>> virtio-net: support tx netdim
> >>>>>
> >>>>> drivers/net/virtio_net.c | 394 ++++++++++++++++++++++++++++++++-------
> >>>>> 1 file changed, 322 insertions(+), 72 deletions(-)
> >>>>>
> >>>>> --
> >>>>> 2.19.1.6.gb485710b
> >>>>>
> >>>>>
> >>>
>
^ permalink raw reply
* Re: [PATCH net-next 0/5] virtio-net: support dynamic coalescing moderation
From: Jason Wang @ 2023-11-02 4:34 UTC (permalink / raw)
To: Heng Qi
Cc: Michael S. Tsirkin, netdev, virtualization, Xuan Zhuo,
Eric Dumazet, David S. Miller, Paolo Abeni,
Jesper Dangaard Brouer, John Fastabend, Alexei Starovoitov,
Jakub Kicinski, Simon Horman, Liu, Yujie
In-Reply-To: <753ac6da-f7f1-4acb-9184-e59271809c6d@linux.alibaba.com>
On Wed, Nov 1, 2023 at 5:38 PM Heng Qi <hengqi@linux.alibaba.com> wrote:
>
>
>
> 在 2023/10/25 上午9:18, Jason Wang 写道:
> > On Tue, Oct 24, 2023 at 8:03 PM Heng Qi <hengqi@linux.alibaba.com> wrote:
> >>
> >>
> >> 在 2023/10/12 下午4:29, Jason Wang 写道:
> >>> On Thu, Oct 12, 2023 at 3:44 PM Heng Qi <hengqi@linux.alibaba.com> wrote:
> >>>> Now, virtio-net already supports per-queue moderation parameter
> >>>> setting. Based on this, we use the netdim library of linux to support
> >>>> dynamic coalescing moderation for virtio-net.
> >>>>
> >>>> Due to hardware scheduling issues, we only tested rx dim.
> >>> Do you have PPS numbers? And TX numbers are also important as the
> >>> throughput could be misleading due to various reasons.
> >> Hi Jason!
> >>
> >> The comparison of rx netdim performance is as follows:
> >> (the backend supporting tx dim is not yet ready)
> > Thanks a lot for the numbers.
> >
> > I'd still expect the TX result as I did play tx interrupt coalescing
>
> Hi, Jason.
>
> Sorry for the late reply to this! Our team has been blocked by other
> priorities the past few days.
>
> For tx dim, we have a fixed empirical value internally.
> This value performs better overall than manually adjusting the tx timer
> register -->
> I'll do not have tx numbers. :( So in the short term I no longer try to
> push [5/5]
> patch for tx dim and try to return -EOPNOTSUPP for it, sorry for this.
>
> > about 10 years ago.
> >
> > I will start to review the series but let's try to have some TX numbers as well.
> >
> > Btw, it would be more convenient to have a raw PPS benchmark. E.g you
>
> I got some raw pps data using pktgen from linux/sample/pktgen:
>
> 1. tx cmd
> ./pktgen_sample02_multiqueue.sh -i eth1 -s 44 -d ${dst_ip} -m ${dst_mac}
> -t 8 -f 0 -n 0
>
> This uses 8 kpktgend threads to inject data into eth1.
>
> 2. Rx side loads a simple xdp prog which drops all received udp packets.
>
> 3. Data
> pps: ~1000w
For "w" did you mean 10 million? Looks too huge to me?
> rx dim off: cpu idle= ~35%
> rx dim on: cpu idle= ~76%
This looks promising.
Thanks
>
> Thanks!
>
> > can try to use a software or hardware packet generator.
> >
> > Thanks
> >
> >>
> >> I. Sockperf UDP
> >> =================================================
> >> 1. Env
> >> rxq_0 is affinity to cpu_0
> >>
> >> 2. Cmd
> >> client: taskset -c 0 sockperf tp -p 8989 -i $IP -t 10 -m 16B
> >> server: taskset -c 0 sockperf sr -p 8989
> >>
> >> 3. Result
> >> dim off: 1143277.00 rxpps, throughput 17.844 MBps, cpu is 100%.
> >> dim on: 1124161.00 rxpps, throughput 17.610 MBps, cpu is 83.5%.
> >> =================================================
> >>
> >>
> >> II. Redis
> >> =================================================
> >> 1. Env
> >> There are 8 rxqs and rxq_i is affinity to cpu_i.
> >>
> >> 2. Result
> >> When all cpus are 100%, ops/sec of memtier_benchmark client is
> >> dim off: 978437.23
> >> dim on: 1143638.28
> >> =================================================
> >>
> >>
> >> III. Nginx
> >> =================================================
> >> 1. Env
> >> There are 8 rxqs and rxq_i is affinity to cpu_i.
> >>
> >> 2. Result
> >> When all cpus are 100%, requests/sec of wrk client is
> >> dim off: 877931.67
> >> dim on: 1019160.31
> >> =================================================
> >>
> >> Thanks!
> >>
> >>> Thanks
> >>>
> >>>> @Test env
> >>>> rxq0 has affinity to cpu0.
> >>>>
> >>>> @Test cmd
> >>>> client: taskset -c 0 sockperf tp -i ${IP} -t 30 --tcp -m ${msg_size}
> >>>> server: taskset -c 0 sockperf sr --tcp
> >>>>
> >>>> @Test res
> >>>> The second column is the ratio of the result returned by client
> >>>> when rx dim is enabled to the result returned by client when
> >>>> rx dim is disabled.
> >>>> --------------------------------------
> >>>> | msg_size | rx_dim=on / rx_dim=off |
> >>>> --------------------------------------
> >>>> | 14B | + 3% |
> >>>> --------------------------------------
> >>>> | 100B | + 16% |
> >>>> --------------------------------------
> >>>> | 500B | + 25% |
> >>>> --------------------------------------
> >>>> | 1400B | + 28% |
> >>>> --------------------------------------
> >>>> | 2048B | + 22% |
> >>>> --------------------------------------
> >>>> | 4096B | + 5% |
> >>>> --------------------------------------
> >>>>
> >>>> ---
> >>>> This patch set was part of the previous netdim patch set[1].
> >>>> [1] was split into a merged bugfix set[2] and the current set.
> >>>> The previous relevant commentators have been Cced.
> >>>>
> >>>> [1] https://lore.kernel.org/all/20230811065512.22190-1-hengqi@linux.alibaba.com/
> >>>> [2] https://lore.kernel.org/all/cover.1696745452.git.hengqi@linux.alibaba.com/
> >>>>
> >>>> Heng Qi (5):
> >>>> virtio-net: returns whether napi is complete
> >>>> virtio-net: separate rx/tx coalescing moderation cmds
> >>>> virtio-net: extract virtqueue coalescig cmd for reuse
> >>>> virtio-net: support rx netdim
> >>>> virtio-net: support tx netdim
> >>>>
> >>>> drivers/net/virtio_net.c | 394 ++++++++++++++++++++++++++++++++-------
> >>>> 1 file changed, 322 insertions(+), 72 deletions(-)
> >>>>
> >>>> --
> >>>> 2.19.1.6.gb485710b
> >>>>
> >>>>
> >>
>
^ permalink raw reply
* Re: [PATCH] net: usbnet: Fix potential NULL pointer dereference
From: Jakub Kicinski @ 2023-11-02 4:38 UTC (permalink / raw)
To: Ren Mingshuai
Cc: caowangbao, davem, khlebnikov, liaichun, linux-kernel, netdev,
oneukum, yanan
In-Reply-To: <20231101125511.222629-1-renmingshuai@huawei.com>
On Wed, 1 Nov 2023 20:55:11 +0800 Ren Mingshuai wrote:
> >23ba07991dad said SKB can be NULL without describing the triggering
> >scenario. Always Check it before dereference to void potential NULL
> >pointer dereference.
> I've tried to find out the scenarios where SKB is NULL, but failed.
> It seems impossible for SKB to be NULL. If SKB can be NULL, please tell
> me the reason and I'd be very grateful.
What do you mean? Grepping the function name shows call sites with NULL
getting passed as skb.
^ permalink raw reply
* Re: [net-next v2 PATCH] octeontx2-pf: TC flower offload support for ICMP type and code
From: Jakub Kicinski @ 2023-11-02 4:42 UTC (permalink / raw)
To: Geetha sowjanya
Cc: netdev, linux-kernel, davem, pabeni, edumazet, sgoutham, sbhatta,
hkelam
In-Reply-To: <20231031165258.30002-1-gakula@marvell.com>
On Tue, 31 Oct 2023 22:22:58 +0530 Geetha sowjanya wrote:
> Adds tc offload support for matching on ICMP type and code.
>
> Example usage:
> To enable adding tc ingress rules
> tc qdisc add dev eth0 ingress
>
> TC rule drop the ICMP echo reply:
> tc filter add dev eth0 protocol ip parent ffff: \
> flower ip_proto icmp type 8 code 0 skip_sw action drop
>
> TC rule to drop ICMPv6 echo reply:
> tc filter add dev eth0 protocol ipv6 parent ffff: flower \
> indev eth0 ip_proto icmpv6 type 128 code 0 action drop
## Form letter - net-next-closed
The merge window for v6.7 has begun and we have already posted our pull
request. Therefore net-next is closed for new drivers, features, code
refactoring and optimizations. We are currently accepting bug fixes only.
Please repost when net-next reopens after Nov 12th.
RFC patches sent for review only are obviously welcome at any time.
See: https://www.kernel.org/doc/html/next/process/maintainer-netdev.html#development-cycle
--
pw-bot: defer
^ permalink raw reply
* Re: [PATCH 2/2] tg3: Fix the TX ring stall
From: Michael Chan @ 2023-11-02 4:42 UTC (permalink / raw)
To: alexey.pakhunov
Cc: linux-kernel, mchan, netdev, prashant, siva.kallam, vincent.wong2
In-Reply-To: <20231102041045.3103307-1-alexey.pakhunov@spacex.com>
[-- Attachment #1: Type: text/plain, Size: 1644 bytes --]
On Wed, Nov 1, 2023 at 9:11 PM <alexey.pakhunov@spacex.com> wrote:
>
> Hi,
>
> > Thanks for the patch. An alternative fix that may be simpler is to
> > add a goto after calling tg3_tso_bug(). Something like this:
> >
> > tg3_tso_bug();
> > goto update_tx_mbox;
> > ...
> >
> > update_tx_mbox:
> > if (!netdev_xmit_more() || netif_xmit_stopped())
> > tw32_tx_mbox();
> > ...
>
> Yes, I considered this approach but in the end it seemed more fragile. All
> future updates to tg3_start_xmit() would need to be careful to make sure
> all return paths go through "update_tx_mbox". This is much more
> straightforward with a separate wrapper function.
>
> The sizes of both patches are roughly the same. The wrapper function
> version:
>
> drivers/net/ethernet/broadcom/tg3.c | 46 +++++++++++++++++++++++++++++++++++-----------
> 1 file changed, 35 insertions(+), 11 deletions(-)
>
> The goto version touches four different return paths: three tg3_tso_bug()
> calls and the return at the very top of the function:
>
> drivers/net/ethernet/broadcom/tg3.c | 46 +++++++++++++++++++++++++++++++++-------------
> 1 file changed, 33 insertions(+), 13 deletions(-)
>
> Let me re-test the goto version and resubmit it as v2. Please let me know
> which version of the patch you prefer more.
>
I did not realize the goto version is almost as big. In that case,
your original version is fine.
You might want to declare the variables in reverse Xmas tree style for
any new code. This driver is old and most of the existing code does
not follow that style.
Thanks.
[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 4209 bytes --]
^ permalink raw reply
* Re: [PATCH net-next 4/5] virtio-net: support rx netdim
From: Heng Qi @ 2023-11-02 4:46 UTC (permalink / raw)
To: Jason Wang
Cc: Michael S. Tsirkin, netdev, virtualization, Xuan Zhuo,
Eric Dumazet, David S. Miller, Paolo Abeni,
Jesper Dangaard Brouer, John Fastabend, Alexei Starovoitov,
Jakub Kicinski, Simon Horman, Liu, Yujie
In-Reply-To: <CACGkMEuL+ocjBasSAOvKZA7pzLganK9cwvtr-ErquMZBC0aNDw@mail.gmail.com>
在 2023/11/2 下午12:31, Jason Wang 写道:
> On Wed, Nov 1, 2023 at 6:55 PM Heng Qi <hengqi@linux.alibaba.com> wrote:
>>
>>
>> 在 2023/10/25 上午11:34, Jason Wang 写道:
>>> On Thu, Oct 12, 2023 at 3:44 PM Heng Qi <hengqi@linux.alibaba.com> wrote:
>>>> By comparing the traffic information in the complete napi processes,
>>>> let the virtio-net driver automatically adjust the coalescing
>>>> moderation parameters of each receive queue.
>>>>
>>>> Signed-off-by: Heng Qi <hengqi@linux.alibaba.com>
>>>> ---
>>>> drivers/net/virtio_net.c | 147 +++++++++++++++++++++++++++++++++------
>>>> 1 file changed, 126 insertions(+), 21 deletions(-)
>>>>
>>>> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
>>>> index caef78bb3963..6ad2890a7909 100644
>>>> --- a/drivers/net/virtio_net.c
>>>> +++ b/drivers/net/virtio_net.c
>>>> @@ -19,6 +19,7 @@
>>>> #include <linux/average.h>
>>>> #include <linux/filter.h>
>>>> #include <linux/kernel.h>
>>>> +#include <linux/dim.h>
>>>> #include <net/route.h>
>>>> #include <net/xdp.h>
>>>> #include <net/net_failover.h>
>>>> @@ -172,6 +173,17 @@ struct receive_queue {
>>>>
>>>> struct virtnet_rq_stats stats;
>>>>
>>>> + /* The number of rx notifications */
>>>> + u16 calls;
>>>> +
>>>> + /* Is dynamic interrupt moderation enabled? */
>>>> + bool dim_enabled;
>>>> +
>>>> + /* Dynamic Interrupt Moderation */
>>>> + struct dim dim;
>>>> +
>>>> + u32 packets_in_napi;
>>>> +
>>>> struct virtnet_interrupt_coalesce intr_coal;
>>>>
>>>> /* Chain pages by the private ptr. */
>>>> @@ -305,6 +317,9 @@ struct virtnet_info {
>>>> u8 duplex;
>>>> u32 speed;
>>>>
>>>> + /* Is rx dynamic interrupt moderation enabled? */
>>>> + bool rx_dim_enabled;
>>>> +
>>>> /* Interrupt coalescing settings */
>>>> struct virtnet_interrupt_coalesce intr_coal_tx;
>>>> struct virtnet_interrupt_coalesce intr_coal_rx;
>>>> @@ -2001,6 +2016,7 @@ static void skb_recv_done(struct virtqueue *rvq)
>>>> struct virtnet_info *vi = rvq->vdev->priv;
>>>> struct receive_queue *rq = &vi->rq[vq2rxq(rvq)];
>>>>
>>>> + rq->calls++;
>>>> virtqueue_napi_schedule(&rq->napi, rvq);
>>>> }
>>>>
>>>> @@ -2138,6 +2154,25 @@ static void virtnet_poll_cleantx(struct receive_queue *rq)
>>>> }
>>>> }
>>>>
>>>> +static void virtnet_rx_dim_work(struct work_struct *work);
>>>> +
>>>> +static void virtnet_rx_dim_update(struct virtnet_info *vi, struct receive_queue *rq)
>>>> +{
>>>> + struct virtnet_rq_stats *stats = &rq->stats;
>>>> + struct dim_sample cur_sample = {};
>>>> +
>>>> + if (!rq->packets_in_napi)
>>>> + return;
>>>> +
>>>> + u64_stats_update_begin(&rq->stats.syncp);
>>>> + dim_update_sample(rq->calls, stats->packets,
>>>> + stats->bytes, &cur_sample);
>>>> + u64_stats_update_end(&rq->stats.syncp);
>>>> +
>>>> + net_dim(&rq->dim, cur_sample);
>>>> + rq->packets_in_napi = 0;
>>>> +}
>>>> +
>>>> static int virtnet_poll(struct napi_struct *napi, int budget)
>>>> {
>>>> struct receive_queue *rq =
>>>> @@ -2146,17 +2181,22 @@ static int virtnet_poll(struct napi_struct *napi, int budget)
>>>> struct send_queue *sq;
>>>> unsigned int received;
>>>> unsigned int xdp_xmit = 0;
>>>> + bool napi_complete;
>>>>
>>>> virtnet_poll_cleantx(rq);
>>>>
>>>> received = virtnet_receive(rq, budget, &xdp_xmit);
>>>> + rq->packets_in_napi += received;
>>>>
>>>> if (xdp_xmit & VIRTIO_XDP_REDIR)
>>>> xdp_do_flush();
>>>>
>>>> /* Out of packets? */
>>>> - if (received < budget)
>>>> - virtqueue_napi_complete(napi, rq->vq, received);
>>>> + if (received < budget) {
>>>> + napi_complete = virtqueue_napi_complete(napi, rq->vq, received);
>>>> + if (napi_complete && rq->dim_enabled)
>>>> + virtnet_rx_dim_update(vi, rq);
>>>> + }
>>>>
>>>> if (xdp_xmit & VIRTIO_XDP_TX) {
>>>> sq = virtnet_xdp_get_sq(vi);
>>>> @@ -2176,6 +2216,7 @@ static void virtnet_disable_queue_pair(struct virtnet_info *vi, int qp_index)
>>>> virtnet_napi_tx_disable(&vi->sq[qp_index].napi);
>>>> napi_disable(&vi->rq[qp_index].napi);
>>>> xdp_rxq_info_unreg(&vi->rq[qp_index].xdp_rxq);
>>>> + cancel_work_sync(&vi->rq[qp_index].dim.work);
>>>> }
>>>>
>>>> static int virtnet_enable_queue_pair(struct virtnet_info *vi, int qp_index)
>>>> @@ -2193,6 +2234,9 @@ static int virtnet_enable_queue_pair(struct virtnet_info *vi, int qp_index)
>>>> if (err < 0)
>>>> goto err_xdp_reg_mem_model;
>>>>
>>>> + INIT_WORK(&vi->rq[qp_index].dim.work, virtnet_rx_dim_work);
>>>> + vi->rq[qp_index].dim.mode = DIM_CQ_PERIOD_MODE_START_FROM_EQE;
>>>> +
>>>> virtnet_napi_enable(vi->rq[qp_index].vq, &vi->rq[qp_index].napi);
>>>> virtnet_napi_tx_enable(vi, vi->sq[qp_index].vq, &vi->sq[qp_index].napi);
>>>>
>>>> @@ -3335,23 +3379,42 @@ static int virtnet_send_tx_notf_coal_cmds(struct virtnet_info *vi,
>>>> static int virtnet_send_rx_notf_coal_cmds(struct virtnet_info *vi,
>>>> struct ethtool_coalesce *ec)
>>>> {
>>>> + bool rx_ctrl_dim_on = !!ec->use_adaptive_rx_coalesce;
>>>> struct scatterlist sgs_rx;
>>>> + int i;
>>>>
>>>> - vi->ctrl->coal_rx.rx_usecs = cpu_to_le32(ec->rx_coalesce_usecs);
>>>> - vi->ctrl->coal_rx.rx_max_packets = cpu_to_le32(ec->rx_max_coalesced_frames);
>>>> - sg_init_one(&sgs_rx, &vi->ctrl->coal_rx, sizeof(vi->ctrl->coal_rx));
>>>> -
>>>> - if (!virtnet_send_command(vi, VIRTIO_NET_CTRL_NOTF_COAL,
>>>> - VIRTIO_NET_CTRL_NOTF_COAL_RX_SET,
>>>> - &sgs_rx))
>>>> + if (rx_ctrl_dim_on && (ec->rx_coalesce_usecs != vi->intr_coal_rx.max_usecs ||
>>>> + ec->rx_max_coalesced_frames != vi->intr_coal_rx.max_packets))
>>> Any reason we need to stick a check for usecs/packets? I think it
>>> might confuse the user since the value could be modified by netdim
>>> actually.
>> Yes, that's exactly what's done here.
>>
>> When dim is enabled, the user is prohibited from manually configuring
>> parameters because dim may modify the parameters.
> So it should be something like
>
> if (rx_ctrl_dim_on)
> return -EINVAL;
>
> without the checking of whether it matches the current parameters?
I think yes.
Thanks!
>
>>>> return -EINVAL;
>>>>
>>>> - /* Save parameters */
>>>> - vi->intr_coal_rx.max_usecs = ec->rx_coalesce_usecs;
>>>> - vi->intr_coal_rx.max_packets = ec->rx_max_coalesced_frames;
>>>> - for (i = 0; i < vi->max_queue_pairs; i++) {
>>>> - vi->rq[i].intr_coal.max_usecs = ec->rx_coalesce_usecs;
>>>> - vi->rq[i].intr_coal.max_packets = ec->rx_max_coalesced_frames;
>>>> + if (rx_ctrl_dim_on) {
>>>> + if (virtio_has_feature(vi->vdev, VIRTIO_NET_F_VQ_NOTF_COAL)) {
>>>> + vi->rx_dim_enabled = true;
>>>> + for (i = 0; i < vi->max_queue_pairs; i++)
>>>> + vi->rq[i].dim_enabled = true;
>>>> + } else {
>>>> + return -EOPNOTSUPP;
>>>> + }
>>>> + } else {
>>>> + vi->rx_dim_enabled = false;
>>>> + for (i = 0; i < vi->max_queue_pairs; i++)
>>>> + vi->rq[i].dim_enabled = false;
>>>> +
>>>> + vi->ctrl->coal_rx.rx_usecs = cpu_to_le32(ec->rx_coalesce_usecs);
>>>> + vi->ctrl->coal_rx.rx_max_packets = cpu_to_le32(ec->rx_max_coalesced_frames);
>>>> + sg_init_one(&sgs_rx, &vi->ctrl->coal_rx, sizeof(vi->ctrl->coal_rx));
>>>> +
>>>> + if (!virtnet_send_command(vi, VIRTIO_NET_CTRL_NOTF_COAL,
>>>> + VIRTIO_NET_CTRL_NOTF_COAL_RX_SET,
>>>> + &sgs_rx))
>>>> + return -EINVAL;
>>>> +
>>>> + vi->intr_coal_rx.max_usecs = ec->rx_coalesce_usecs;
>>>> + vi->intr_coal_rx.max_packets = ec->rx_max_coalesced_frames;
>>>> + for (i = 0; i < vi->max_queue_pairs; i++) {
>>>> + vi->rq[i].intr_coal.max_usecs = ec->rx_coalesce_usecs;
>>>> + vi->rq[i].intr_coal.max_packets = ec->rx_max_coalesced_frames;
>>>> + }
>>>> }
>>>>
>>>> return 0;
>>>> @@ -3377,13 +3440,27 @@ static int virtnet_send_notf_coal_vq_cmds(struct virtnet_info *vi,
>>>> struct ethtool_coalesce *ec,
>>>> u16 queue)
>>>> {
>>>> + bool rx_ctrl_dim_on;
>>>> + u32 max_usecs, max_packets;
>>>> int err;
>>>>
>>>> - err = virtnet_send_rx_ctrl_coal_vq_cmd(vi, queue,
>>>> - ec->rx_coalesce_usecs,
>>>> - ec->rx_max_coalesced_frames);
>>>> - if (err)
>>>> - return err;
>>>> + rx_ctrl_dim_on = !!ec->use_adaptive_rx_coalesce;
>>>> + max_usecs = vi->rq[queue].intr_coal.max_usecs;
>>>> + max_packets = vi->rq[queue].intr_coal.max_packets;
>>>> + if (rx_ctrl_dim_on && (ec->rx_coalesce_usecs != max_usecs ||
>>>> + ec->rx_max_coalesced_frames != max_packets))
>>>> + return -EINVAL;
>>>> +
>>>> + if (rx_ctrl_dim_on) {
>>>> + vi->rq[queue].dim_enabled = true;
>>>> + } else {
>>>> + vi->rq[queue].dim_enabled = false;
>>>> + err = virtnet_send_rx_ctrl_coal_vq_cmd(vi, queue,
>>>> + ec->rx_coalesce_usecs,
>>>> + ec->rx_max_coalesced_frames);
>>>> + if (err)
>>>> + return err;
>>>> + }
>>>>
>>>> err = virtnet_send_tx_ctrl_coal_vq_cmd(vi, queue,
>>>> ec->tx_coalesce_usecs,
>>>> @@ -3394,6 +3471,32 @@ static int virtnet_send_notf_coal_vq_cmds(struct virtnet_info *vi,
>>>> return 0;
>>>> }
>>>>
>>>> +static void virtnet_rx_dim_work(struct work_struct *work)
>>>> +{
>>>> + struct dim *dim = container_of(work, struct dim, work);
>>>> + struct receive_queue *rq = container_of(dim,
>>>> + struct receive_queue, dim);
>>>> + struct virtnet_info *vi = rq->vq->vdev->priv;
>>>> + struct net_device *dev = vi->dev;
>>>> + struct dim_cq_moder update_moder;
>>>> + int qnum = rq - vi->rq, err;
>>>> +
>>>> + update_moder = net_dim_get_rx_moderation(dim->mode, dim->profile_ix);
>>>> + if (update_moder.usec != vi->rq[qnum].intr_coal.max_usecs ||
>>>> + update_moder.pkts != vi->rq[qnum].intr_coal.max_packets) {
>>> Is this safe across the e.g vq reset?
>> I think it might. This will be avoided in the next version using:
>> 1. cancel virtnet_rx_dim_work before vq reset.
>> 2. restore virtnet_rx_dim_work after vq re-enable.
> Ok.
>
> Thanks
>
>> Thanks a lot!
>>
>>> Thanks
>>>
>>>> + rtnl_lock();
>>>> + err = virtnet_send_rx_ctrl_coal_vq_cmd(vi, qnum,
>>>> + update_moder.usec,
>>>> + update_moder.pkts);
>>>> + if (err)
>>>> + pr_debug("%s: Failed to send dim parameters on rxq%d\n",
>>>> + dev->name, (int)(rq - vi->rq));
>>>> + rtnl_unlock();
>>>> + }
>>>> +
>>>> + dim->state = DIM_START_MEASURE;
>>>> +}
>>>> +
>>>> static int virtnet_coal_params_supported(struct ethtool_coalesce *ec)
>>>> {
>>>> /* usecs coalescing is supported only if VIRTIO_NET_F_NOTF_COAL
>>>> @@ -3475,6 +3578,7 @@ static int virtnet_get_coalesce(struct net_device *dev,
>>>> ec->tx_coalesce_usecs = vi->intr_coal_tx.max_usecs;
>>>> ec->tx_max_coalesced_frames = vi->intr_coal_tx.max_packets;
>>>> ec->rx_max_coalesced_frames = vi->intr_coal_rx.max_packets;
>>>> + ec->use_adaptive_rx_coalesce = vi->rx_dim_enabled;
>>>> } else {
>>>> ec->rx_max_coalesced_frames = 1;
>>>>
>>>> @@ -3532,6 +3636,7 @@ static int virtnet_get_per_queue_coalesce(struct net_device *dev,
>>>> ec->tx_coalesce_usecs = vi->sq[queue].intr_coal.max_usecs;
>>>> ec->tx_max_coalesced_frames = vi->sq[queue].intr_coal.max_packets;
>>>> ec->rx_max_coalesced_frames = vi->rq[queue].intr_coal.max_packets;
>>>> + ec->use_adaptive_rx_coalesce = vi->rq[queue].dim_enabled;
>>>> } else {
>>>> ec->rx_max_coalesced_frames = 1;
>>>>
>>>> @@ -3657,7 +3762,7 @@ static int virtnet_set_rxnfc(struct net_device *dev, struct ethtool_rxnfc *info)
>>>>
>>>> static const struct ethtool_ops virtnet_ethtool_ops = {
>>>> .supported_coalesce_params = ETHTOOL_COALESCE_MAX_FRAMES |
>>>> - ETHTOOL_COALESCE_USECS,
>>>> + ETHTOOL_COALESCE_USECS | ETHTOOL_COALESCE_USE_ADAPTIVE_RX,
>>>> .get_drvinfo = virtnet_get_drvinfo,
>>>> .get_link = ethtool_op_get_link,
>>>> .get_ringparam = virtnet_get_ringparam,
>>>> --
>>>> 2.19.1.6.gb485710b
>>>>
^ permalink raw reply
* Re: [PATCH net-next 0/5] virtio-net: support dynamic coalescing moderation
From: Heng Qi @ 2023-11-02 4:47 UTC (permalink / raw)
To: Jason Wang
Cc: Michael S. Tsirkin, netdev, virtualization, Xuan Zhuo,
Eric Dumazet, David S. Miller, Paolo Abeni,
Jesper Dangaard Brouer, John Fastabend, Alexei Starovoitov,
Jakub Kicinski, Simon Horman, Liu, Yujie
In-Reply-To: <CACGkMEsQ4oDbXPQZ2boB-Bj36qzWs9Sx_Du9ZiJLe+-99DOtwQ@mail.gmail.com>
在 2023/11/2 下午12:33, Jason Wang 写道:
> On Wed, Nov 1, 2023 at 7:03 PM Heng Qi <hengqi@linux.alibaba.com> wrote:
>>
>>
>> 在 2023/10/25 下午1:53, Michael S. Tsirkin 写道:
>>> On Wed, Oct 25, 2023 at 09:18:27AM +0800, Jason Wang wrote:
>>>> On Tue, Oct 24, 2023 at 8:03 PM Heng Qi <hengqi@linux.alibaba.com> wrote:
>>>>>
>>>>> 在 2023/10/12 下午4:29, Jason Wang 写道:
>>>>>> On Thu, Oct 12, 2023 at 3:44 PM Heng Qi <hengqi@linux.alibaba.com> wrote:
>>>>>>> Now, virtio-net already supports per-queue moderation parameter
>>>>>>> setting. Based on this, we use the netdim library of linux to support
>>>>>>> dynamic coalescing moderation for virtio-net.
>>>>>>>
>>>>>>> Due to hardware scheduling issues, we only tested rx dim.
>>>>>> Do you have PPS numbers? And TX numbers are also important as the
>>>>>> throughput could be misleading due to various reasons.
>>>>> Hi Jason!
>>>>>
>>>>> The comparison of rx netdim performance is as follows:
>>>>> (the backend supporting tx dim is not yet ready)
>>>> Thanks a lot for the numbers.
>>>>
>>>> I'd still expect the TX result as I did play tx interrupt coalescing
>>>> about 10 years ago.
>>>>
>>>> I will start to review the series but let's try to have some TX numbers as well.
>>>>
>>>> Btw, it would be more convenient to have a raw PPS benchmark. E.g you
>>>> can try to use a software or hardware packet generator.
>>>>
>>>> Thanks
>>> Latency results are also kind of interesting.
>> I test the latency using sockperf pp:
>>
>> @Rx cmd
>> taskset -c 0 sockperf sr -p 8989
>>
>> @Tx cmd
>> taskset -c 0 sockperf pp -i ${ip} -p 8989 -t 10
>>
>> After running this cmd 5 times and averaging the results,
>> we get the following data:
>>
>> dim off: 17.7735 usec
>> dim on: 18.0110 usec
> Let's add those numbers to the changelog of the next version.
Ok. Thanks!
>
> Thanks
>
>> Thanks!
>>
>>>
>>>>> I. Sockperf UDP
>>>>> =================================================
>>>>> 1. Env
>>>>> rxq_0 is affinity to cpu_0
>>>>>
>>>>> 2. Cmd
>>>>> client: taskset -c 0 sockperf tp -p 8989 -i $IP -t 10 -m 16B
>>>>> server: taskset -c 0 sockperf sr -p 8989
>>>>>
>>>>> 3. Result
>>>>> dim off: 1143277.00 rxpps, throughput 17.844 MBps, cpu is 100%.
>>>>> dim on: 1124161.00 rxpps, throughput 17.610 MBps, cpu is 83.5%.
>>>>> =================================================
>>>>>
>>>>>
>>>>> II. Redis
>>>>> =================================================
>>>>> 1. Env
>>>>> There are 8 rxqs and rxq_i is affinity to cpu_i.
>>>>>
>>>>> 2. Result
>>>>> When all cpus are 100%, ops/sec of memtier_benchmark client is
>>>>> dim off: 978437.23
>>>>> dim on: 1143638.28
>>>>> =================================================
>>>>>
>>>>>
>>>>> III. Nginx
>>>>> =================================================
>>>>> 1. Env
>>>>> There are 8 rxqs and rxq_i is affinity to cpu_i.
>>>>>
>>>>> 2. Result
>>>>> When all cpus are 100%, requests/sec of wrk client is
>>>>> dim off: 877931.67
>>>>> dim on: 1019160.31
>>>>> =================================================
>>>>>
>>>>> Thanks!
>>>>>
>>>>>> Thanks
>>>>>>
>>>>>>> @Test env
>>>>>>> rxq0 has affinity to cpu0.
>>>>>>>
>>>>>>> @Test cmd
>>>>>>> client: taskset -c 0 sockperf tp -i ${IP} -t 30 --tcp -m ${msg_size}
>>>>>>> server: taskset -c 0 sockperf sr --tcp
>>>>>>>
>>>>>>> @Test res
>>>>>>> The second column is the ratio of the result returned by client
>>>>>>> when rx dim is enabled to the result returned by client when
>>>>>>> rx dim is disabled.
>>>>>>> --------------------------------------
>>>>>>> | msg_size | rx_dim=on / rx_dim=off |
>>>>>>> --------------------------------------
>>>>>>> | 14B | + 3% |
>>>>>>> --------------------------------------
>>>>>>> | 100B | + 16% |
>>>>>>> --------------------------------------
>>>>>>> | 500B | + 25% |
>>>>>>> --------------------------------------
>>>>>>> | 1400B | + 28% |
>>>>>>> --------------------------------------
>>>>>>> | 2048B | + 22% |
>>>>>>> --------------------------------------
>>>>>>> | 4096B | + 5% |
>>>>>>> --------------------------------------
>>>>>>>
>>>>>>> ---
>>>>>>> This patch set was part of the previous netdim patch set[1].
>>>>>>> [1] was split into a merged bugfix set[2] and the current set.
>>>>>>> The previous relevant commentators have been Cced.
>>>>>>>
>>>>>>> [1] https://lore.kernel.org/all/20230811065512.22190-1-hengqi@linux.alibaba.com/
>>>>>>> [2] https://lore.kernel.org/all/cover.1696745452.git.hengqi@linux.alibaba.com/
>>>>>>>
>>>>>>> Heng Qi (5):
>>>>>>> virtio-net: returns whether napi is complete
>>>>>>> virtio-net: separate rx/tx coalescing moderation cmds
>>>>>>> virtio-net: extract virtqueue coalescig cmd for reuse
>>>>>>> virtio-net: support rx netdim
>>>>>>> virtio-net: support tx netdim
>>>>>>>
>>>>>>> drivers/net/virtio_net.c | 394 ++++++++++++++++++++++++++++++++-------
>>>>>>> 1 file changed, 322 insertions(+), 72 deletions(-)
>>>>>>>
>>>>>>> --
>>>>>>> 2.19.1.6.gb485710b
>>>>>>>
>>>>>>>
^ permalink raw reply
* Re: [PATCH net-next 0/5] virtio-net: support dynamic coalescing moderation
From: Heng Qi @ 2023-11-02 4:51 UTC (permalink / raw)
To: Jason Wang
Cc: Michael S. Tsirkin, netdev, virtualization, Xuan Zhuo,
Eric Dumazet, David S. Miller, Paolo Abeni,
Jesper Dangaard Brouer, John Fastabend, Alexei Starovoitov,
Jakub Kicinski, Simon Horman, Liu, Yujie
In-Reply-To: <CACGkMEsRVV9mgVe2+Qe89QZD807KV8jyBmAz5--Z3NiZBPrPVg@mail.gmail.com>
在 2023/11/2 下午12:34, Jason Wang 写道:
> On Wed, Nov 1, 2023 at 5:38 PM Heng Qi <hengqi@linux.alibaba.com> wrote:
>>
>>
>> 在 2023/10/25 上午9:18, Jason Wang 写道:
>>> On Tue, Oct 24, 2023 at 8:03 PM Heng Qi <hengqi@linux.alibaba.com> wrote:
>>>>
>>>> 在 2023/10/12 下午4:29, Jason Wang 写道:
>>>>> On Thu, Oct 12, 2023 at 3:44 PM Heng Qi <hengqi@linux.alibaba.com> wrote:
>>>>>> Now, virtio-net already supports per-queue moderation parameter
>>>>>> setting. Based on this, we use the netdim library of linux to support
>>>>>> dynamic coalescing moderation for virtio-net.
>>>>>>
>>>>>> Due to hardware scheduling issues, we only tested rx dim.
>>>>> Do you have PPS numbers? And TX numbers are also important as the
>>>>> throughput could be misleading due to various reasons.
>>>> Hi Jason!
>>>>
>>>> The comparison of rx netdim performance is as follows:
>>>> (the backend supporting tx dim is not yet ready)
>>> Thanks a lot for the numbers.
>>>
>>> I'd still expect the TX result as I did play tx interrupt coalescing
>> Hi, Jason.
>>
>> Sorry for the late reply to this! Our team has been blocked by other
>> priorities the past few days.
>>
>> For tx dim, we have a fixed empirical value internally.
>> This value performs better overall than manually adjusting the tx timer
>> register -->
>> I'll do not have tx numbers. :( So in the short term I no longer try to
>> push [5/5]
>> patch for tx dim and try to return -EOPNOTSUPP for it, sorry for this.
>>
>>> about 10 years ago.
>>>
>>> I will start to review the series but let's try to have some TX numbers as well.
>>>
>>> Btw, it would be more convenient to have a raw PPS benchmark. E.g you
>> I got some raw pps data using pktgen from linux/sample/pktgen:
>>
>> 1. tx cmd
>> ./pktgen_sample02_multiqueue.sh -i eth1 -s 44 -d ${dst_ip} -m ${dst_mac}
>> -t 8 -f 0 -n 0
>>
>> This uses 8 kpktgend threads to inject data into eth1.
>>
>> 2. Rx side loads a simple xdp prog which drops all received udp packets.
>>
>> 3. Data
>> pps: ~1000w
> For "w" did you mean 10 million? Looks too huge to me?
Yes, all cpus in tx are 100% sys, rx uses xdp to drop all received udp
packets.
Then this means rx receiving ability is strong.
If there was no xdp in rx, I remember tx sent 10million pps, but rx
could only receive 7.3+ million pps.
Thanks!
>
>> rx dim off: cpu idle= ~35%
>> rx dim on: cpu idle= ~76%
> This looks promising.
>
> Thanks
>
>> Thanks!
>>
>>> can try to use a software or hardware packet generator.
>>>
>>> Thanks
>>>
>>>> I. Sockperf UDP
>>>> =================================================
>>>> 1. Env
>>>> rxq_0 is affinity to cpu_0
>>>>
>>>> 2. Cmd
>>>> client: taskset -c 0 sockperf tp -p 8989 -i $IP -t 10 -m 16B
>>>> server: taskset -c 0 sockperf sr -p 8989
>>>>
>>>> 3. Result
>>>> dim off: 1143277.00 rxpps, throughput 17.844 MBps, cpu is 100%.
>>>> dim on: 1124161.00 rxpps, throughput 17.610 MBps, cpu is 83.5%.
>>>> =================================================
>>>>
>>>>
>>>> II. Redis
>>>> =================================================
>>>> 1. Env
>>>> There are 8 rxqs and rxq_i is affinity to cpu_i.
>>>>
>>>> 2. Result
>>>> When all cpus are 100%, ops/sec of memtier_benchmark client is
>>>> dim off: 978437.23
>>>> dim on: 1143638.28
>>>> =================================================
>>>>
>>>>
>>>> III. Nginx
>>>> =================================================
>>>> 1. Env
>>>> There are 8 rxqs and rxq_i is affinity to cpu_i.
>>>>
>>>> 2. Result
>>>> When all cpus are 100%, requests/sec of wrk client is
>>>> dim off: 877931.67
>>>> dim on: 1019160.31
>>>> =================================================
>>>>
>>>> Thanks!
>>>>
>>>>> Thanks
>>>>>
>>>>>> @Test env
>>>>>> rxq0 has affinity to cpu0.
>>>>>>
>>>>>> @Test cmd
>>>>>> client: taskset -c 0 sockperf tp -i ${IP} -t 30 --tcp -m ${msg_size}
>>>>>> server: taskset -c 0 sockperf sr --tcp
>>>>>>
>>>>>> @Test res
>>>>>> The second column is the ratio of the result returned by client
>>>>>> when rx dim is enabled to the result returned by client when
>>>>>> rx dim is disabled.
>>>>>> --------------------------------------
>>>>>> | msg_size | rx_dim=on / rx_dim=off |
>>>>>> --------------------------------------
>>>>>> | 14B | + 3% |
>>>>>> --------------------------------------
>>>>>> | 100B | + 16% |
>>>>>> --------------------------------------
>>>>>> | 500B | + 25% |
>>>>>> --------------------------------------
>>>>>> | 1400B | + 28% |
>>>>>> --------------------------------------
>>>>>> | 2048B | + 22% |
>>>>>> --------------------------------------
>>>>>> | 4096B | + 5% |
>>>>>> --------------------------------------
>>>>>>
>>>>>> ---
>>>>>> This patch set was part of the previous netdim patch set[1].
>>>>>> [1] was split into a merged bugfix set[2] and the current set.
>>>>>> The previous relevant commentators have been Cced.
>>>>>>
>>>>>> [1] https://lore.kernel.org/all/20230811065512.22190-1-hengqi@linux.alibaba.com/
>>>>>> [2] https://lore.kernel.org/all/cover.1696745452.git.hengqi@linux.alibaba.com/
>>>>>>
>>>>>> Heng Qi (5):
>>>>>> virtio-net: returns whether napi is complete
>>>>>> virtio-net: separate rx/tx coalescing moderation cmds
>>>>>> virtio-net: extract virtqueue coalescig cmd for reuse
>>>>>> virtio-net: support rx netdim
>>>>>> virtio-net: support tx netdim
>>>>>>
>>>>>> drivers/net/virtio_net.c | 394 ++++++++++++++++++++++++++++++++-------
>>>>>> 1 file changed, 322 insertions(+), 72 deletions(-)
>>>>>>
>>>>>> --
>>>>>> 2.19.1.6.gb485710b
>>>>>>
>>>>>>
^ permalink raw reply
* Re: [PATCH net-next 0/5] virtio-net: support dynamic coalescing moderation
From: Heng Qi @ 2023-11-02 4:53 UTC (permalink / raw)
To: Jason Wang
Cc: Michael S. Tsirkin, netdev, virtualization, Xuan Zhuo,
Eric Dumazet, David S. Miller, Paolo Abeni,
Jesper Dangaard Brouer, John Fastabend, Alexei Starovoitov,
Jakub Kicinski, Simon Horman, Liu, Yujie
In-Reply-To: <6f324788-216e-4998-9fad-f8e7d27b4261@linux.alibaba.com>
在 2023/11/2 下午12:51, Heng Qi 写道:
>
>
> 在 2023/11/2 下午12:34, Jason Wang 写道:
>> On Wed, Nov 1, 2023 at 5:38 PM Heng Qi <hengqi@linux.alibaba.com> wrote:
>>>
>>>
>>> 在 2023/10/25 上午9:18, Jason Wang 写道:
>>>> On Tue, Oct 24, 2023 at 8:03 PM Heng Qi <hengqi@linux.alibaba.com>
>>>> wrote:
>>>>>
>>>>> 在 2023/10/12 下午4:29, Jason Wang 写道:
>>>>>> On Thu, Oct 12, 2023 at 3:44 PM Heng Qi
>>>>>> <hengqi@linux.alibaba.com> wrote:
>>>>>>> Now, virtio-net already supports per-queue moderation parameter
>>>>>>> setting. Based on this, we use the netdim library of linux to
>>>>>>> support
>>>>>>> dynamic coalescing moderation for virtio-net.
>>>>>>>
>>>>>>> Due to hardware scheduling issues, we only tested rx dim.
>>>>>> Do you have PPS numbers? And TX numbers are also important as the
>>>>>> throughput could be misleading due to various reasons.
>>>>> Hi Jason!
>>>>>
>>>>> The comparison of rx netdim performance is as follows:
>>>>> (the backend supporting tx dim is not yet ready)
>>>> Thanks a lot for the numbers.
>>>>
>>>> I'd still expect the TX result as I did play tx interrupt coalescing
>>> Hi, Jason.
>>>
>>> Sorry for the late reply to this! Our team has been blocked by other
>>> priorities the past few days.
>>>
>>> For tx dim, we have a fixed empirical value internally.
>>> This value performs better overall than manually adjusting the tx timer
>>> register -->
>>> I'll do not have tx numbers. :( So in the short term I no longer try to
>>> push [5/5]
>>> patch for tx dim and try to return -EOPNOTSUPP for it, sorry for this.
>>>
>>>> about 10 years ago.
>>>>
>>>> I will start to review the series but let's try to have some TX
>>>> numbers as well.
>>>>
>>>> Btw, it would be more convenient to have a raw PPS benchmark. E.g you
>>> I got some raw pps data using pktgen from linux/sample/pktgen:
>>>
>>> 1. tx cmd
>>> ./pktgen_sample02_multiqueue.sh -i eth1 -s 44 -d ${dst_ip} -m
>>> ${dst_mac}
>>> -t 8 -f 0 -n 0
>>>
>>> This uses 8 kpktgend threads to inject data into eth1.
>>>
>>> 2. Rx side loads a simple xdp prog which drops all received udp
>>> packets.
>>>
>>> 3. Data
>>> pps: ~1000w
>> For "w" did you mean 10 million? Looks too huge to me?
>
> Yes, all cpus in tx are 100% sys, rx uses xdp to drop all received udp
> packets.
> Then this means rx receiving ability is strong.
>
> If there was no xdp in rx, I remember tx sent 10million pps, but rx
> could only receive 7.3+ million pps.
In addition, in the test environment of pktgen, rx has 8 cpu and 8 queue.
Thanks!
>
> Thanks!
>
>>
>>> rx dim off: cpu idle= ~35%
>>> rx dim on: cpu idle= ~76%
>> This looks promising.
>>
>> Thanks
>>
>>> Thanks!
>>>
>>>> can try to use a software or hardware packet generator.
>>>>
>>>> Thanks
>>>>
>>>>> I. Sockperf UDP
>>>>> =================================================
>>>>> 1. Env
>>>>> rxq_0 is affinity to cpu_0
>>>>>
>>>>> 2. Cmd
>>>>> client: taskset -c 0 sockperf tp -p 8989 -i $IP -t 10 -m 16B
>>>>> server: taskset -c 0 sockperf sr -p 8989
>>>>>
>>>>> 3. Result
>>>>> dim off: 1143277.00 rxpps, throughput 17.844 MBps, cpu is 100%.
>>>>> dim on: 1124161.00 rxpps, throughput 17.610 MBps, cpu is 83.5%.
>>>>> =================================================
>>>>>
>>>>>
>>>>> II. Redis
>>>>> =================================================
>>>>> 1. Env
>>>>> There are 8 rxqs and rxq_i is affinity to cpu_i.
>>>>>
>>>>> 2. Result
>>>>> When all cpus are 100%, ops/sec of memtier_benchmark client is
>>>>> dim off: 978437.23
>>>>> dim on: 1143638.28
>>>>> =================================================
>>>>>
>>>>>
>>>>> III. Nginx
>>>>> =================================================
>>>>> 1. Env
>>>>> There are 8 rxqs and rxq_i is affinity to cpu_i.
>>>>>
>>>>> 2. Result
>>>>> When all cpus are 100%, requests/sec of wrk client is
>>>>> dim off: 877931.67
>>>>> dim on: 1019160.31
>>>>> =================================================
>>>>>
>>>>> Thanks!
>>>>>
>>>>>> Thanks
>>>>>>
>>>>>>> @Test env
>>>>>>> rxq0 has affinity to cpu0.
>>>>>>>
>>>>>>> @Test cmd
>>>>>>> client: taskset -c 0 sockperf tp -i ${IP} -t 30 --tcp -m
>>>>>>> ${msg_size}
>>>>>>> server: taskset -c 0 sockperf sr --tcp
>>>>>>>
>>>>>>> @Test res
>>>>>>> The second column is the ratio of the result returned by client
>>>>>>> when rx dim is enabled to the result returned by client when
>>>>>>> rx dim is disabled.
>>>>>>> --------------------------------------
>>>>>>> | msg_size | rx_dim=on / rx_dim=off |
>>>>>>> --------------------------------------
>>>>>>> | 14B | + 3% |
>>>>>>> --------------------------------------
>>>>>>> | 100B | + 16% |
>>>>>>> --------------------------------------
>>>>>>> | 500B | + 25% |
>>>>>>> --------------------------------------
>>>>>>> | 1400B | + 28% |
>>>>>>> --------------------------------------
>>>>>>> | 2048B | + 22% |
>>>>>>> --------------------------------------
>>>>>>> | 4096B | + 5% |
>>>>>>> --------------------------------------
>>>>>>>
>>>>>>> ---
>>>>>>> This patch set was part of the previous netdim patch set[1].
>>>>>>> [1] was split into a merged bugfix set[2] and the current set.
>>>>>>> The previous relevant commentators have been Cced.
>>>>>>>
>>>>>>> [1]
>>>>>>> https://lore.kernel.org/all/20230811065512.22190-1-hengqi@linux.alibaba.com/
>>>>>>> [2]
>>>>>>> https://lore.kernel.org/all/cover.1696745452.git.hengqi@linux.alibaba.com/
>>>>>>>
>>>>>>> Heng Qi (5):
>>>>>>> virtio-net: returns whether napi is complete
>>>>>>> virtio-net: separate rx/tx coalescing moderation cmds
>>>>>>> virtio-net: extract virtqueue coalescig cmd for reuse
>>>>>>> virtio-net: support rx netdim
>>>>>>> virtio-net: support tx netdim
>>>>>>>
>>>>>>> drivers/net/virtio_net.c | 394
>>>>>>> ++++++++++++++++++++++++++++++++-------
>>>>>>> 1 file changed, 322 insertions(+), 72 deletions(-)
>>>>>>>
>>>>>>> --
>>>>>>> 2.19.1.6.gb485710b
>>>>>>>
>>>>>>>
>
^ permalink raw reply
* [PATCH net] netlink: fill in missing MODULE_DESCRIPTION()
From: Jakub Kicinski @ 2023-11-02 4:57 UTC (permalink / raw)
To: davem; +Cc: netdev, edumazet, pabeni, Jakub Kicinski
W=1 builds now warn if a module is built without
a MODULE_DESCRIPTION(). Fill it in for sock_diag.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
---
net/netlink/diag.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/net/netlink/diag.c b/net/netlink/diag.c
index 9c4f231be275..1eeff9422856 100644
--- a/net/netlink/diag.c
+++ b/net/netlink/diag.c
@@ -257,5 +257,6 @@ static void __exit netlink_diag_exit(void)
module_init(netlink_diag_init);
module_exit(netlink_diag_exit);
+MODULE_DESCRIPTION("Netlink-based socket monitoring/diagnostic interface (sock_diag)");
MODULE_LICENSE("GPL");
MODULE_ALIAS_NET_PF_PROTO_TYPE(PF_NETLINK, NETLINK_SOCK_DIAG, 16 /* AF_NETLINK */);
--
2.41.0
^ permalink raw reply related
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox