Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH bpf-next v8 02/10] bpf, net: introduce bpf_struct_ops_desc.
From: Kui-Feng Lee @ 2023-10-31 16:00 UTC (permalink / raw)
  To: Martin KaFai Lau, thinker.li, bpf, ast, song, kernel-team, andrii,
	drosen
  Cc: kuifeng, netdev
In-Reply-To: <bb7bbfa2-94b1-041d-7255-bb8c7e56e6c7@linux.dev>



On 10/30/23 23:40, Martin KaFai Lau wrote:
> 
> nit. I think this should be renamed to bpf_

Sure! It looks better.

^ permalink raw reply

* Re: [PATCH bpf-next v3 1/2] bpf: add skcipher API support to TC/XDP programs
From: Vadim Fedorenko @ 2023-10-31 15:58 UTC (permalink / raw)
  To: Jiri Olsa, Vadim Fedorenko
  Cc: Jakub Kicinski, Martin KaFai Lau, Andrii Nakryiko,
	Alexei Starovoitov, Mykola Lysenko, bpf, netdev, linux-crypto
In-Reply-To: <ZUEhT7WqRyUARm4s@krava>

On 31/10/2023 15:46, Jiri Olsa wrote:
> On Tue, Oct 31, 2023 at 06:48:59AM -0700, Vadim Fedorenko wrote:
> 
> SNIP
> 
>> diff --git a/kernel/bpf/crypto.c b/kernel/bpf/crypto.c
>> new file mode 100644
>> index 000000000000..c2a0703934fc
>> --- /dev/null
>> +++ b/kernel/bpf/crypto.c
>> @@ -0,0 +1,280 @@
>> +// SPDX-License-Identifier: GPL-2.0-only
>> +/* Copyright (c) 2023 Meta, Inc */
>> +#include <linux/bpf.h>
>> +#include <linux/bpf_mem_alloc.h>
>> +#include <linux/btf.h>
>> +#include <linux/btf_ids.h>
>> +#include <linux/filter.h>
>> +#include <linux/scatterlist.h>
>> +#include <linux/skbuff.h>
>> +#include <crypto/skcipher.h>
>> +
>> +/**
>> + * struct bpf_crypto_skcipher_ctx - refcounted BPF sync skcipher context structure
>> + * @tfm:	The pointer to crypto_sync_skcipher struct.
>> + * @rcu:	The RCU head used to free the crypto context with RCU safety.
>> + * @usage:	Object reference counter. When the refcount goes to 0, the
>> + *		memory is released back to the BPF allocator, which provides
>> + *		RCU safety.
>> + */
>> +struct bpf_crypto_skcipher_ctx {
>> +	struct crypto_sync_skcipher *tfm;
>> +	struct rcu_head rcu;
>> +	refcount_t usage;
>> +};
>> +
>> +__diag_push();
>> +__diag_ignore_all("-Wmissing-prototypes",
>> +		  "Global kfuncs as their definitions will be in BTF");
>> +
> 
> hi,
> just a heads up.. there's [1] change adding macro for this, it's not
> merged yet, but will probably get in soon
> 
> jirka
> 
> [1] https://lore.kernel.org/bpf/20231030210638.2415306-1-davemarchevsky@fb.com/
> 

Hi Jiri,

Thanks for heads up, I'll adjust the patchset when macro patch is merged.


>> +static void *__bpf_dynptr_data_ptr(const struct bpf_dynptr_kern *ptr)
>> +{
>> +	enum bpf_dynptr_type type;
>> +
>> +	if (!ptr->data)
>> +		return NULL;
>> +
>> +	type = bpf_dynptr_get_type(ptr);
>> +
>> +	switch (type) {
>> +	case BPF_DYNPTR_TYPE_LOCAL:
>> +	case BPF_DYNPTR_TYPE_RINGBUF:
>> +		return ptr->data + ptr->offset;
>> +	case BPF_DYNPTR_TYPE_SKB:
>> +		return skb_pointer_if_linear(ptr->data, ptr->offset, __bpf_dynptr_size(ptr));
>> +	case BPF_DYNPTR_TYPE_XDP:
>> +	{
>> +		void *xdp_ptr = bpf_xdp_pointer(ptr->data, ptr->offset, __bpf_dynptr_size(ptr));
>> +		if (!IS_ERR_OR_NULL(xdp_ptr))
>> +			return xdp_ptr;
>> +
>> +		return NULL;
>> +	}
>> +	default:
>> +		WARN_ONCE(true, "unknown dynptr type %d\n", type);
>> +		return NULL;
>> +	}
>> +}
>> +
>> +/**
>> + * bpf_crypto_skcipher_ctx_create() - Create a mutable BPF crypto context.
>> + *
>> + * Allocates a crypto context that can be used, acquired, and released by
>> + * a BPF program. The crypto context returned by this function must either
>> + * be embedded in a map as a kptr, or freed with bpf_crypto_skcipher_ctx_release().
>> + *
>> + * bpf_crypto_skcipher_ctx_create() allocates memory using the BPF memory
>> + * allocator, and will not block. It may return NULL if no memory is available.
>> + * @algo: bpf_dynptr which holds string representation of algorithm.
>> + * @key:  bpf_dynptr which holds cipher key to do crypto.
>> + */
>> +__bpf_kfunc struct bpf_crypto_skcipher_ctx *
>> +bpf_crypto_skcipher_ctx_create(const struct bpf_dynptr_kern *palgo,
>> +			       const struct bpf_dynptr_kern *pkey, int *err)
>> +{
>> +	struct bpf_crypto_skcipher_ctx *ctx;
>> +	char *algo;
>> +
>> +	if (__bpf_dynptr_size(palgo) > CRYPTO_MAX_ALG_NAME) {
>> +		*err = -EINVAL;
>> +		return NULL;
>> +	}
>> +
>> +	algo = __bpf_dynptr_data_ptr(palgo);
>> +
>> +	if (!crypto_has_skcipher(algo, CRYPTO_ALG_TYPE_SKCIPHER, CRYPTO_ALG_TYPE_MASK)) {
>> +		*err = -EOPNOTSUPP;
>> +		return NULL;
>> +	}
>> +
>> +	ctx = kmalloc(sizeof(*ctx), GFP_KERNEL);
>> +	if (!ctx) {
>> +		*err = -ENOMEM;
>> +		return NULL;
>> +	}
>> +
>> +	memset(ctx, 0, sizeof(*ctx));
>> +
>> +	ctx->tfm = crypto_alloc_sync_skcipher(algo, 0, 0);
>> +	if (IS_ERR(ctx->tfm)) {
>> +		*err = PTR_ERR(ctx->tfm);
>> +		ctx->tfm = NULL;
>> +		goto err;
>> +	}
>> +
>> +	*err = crypto_sync_skcipher_setkey(ctx->tfm, __bpf_dynptr_data_ptr(pkey),
>> +					   __bpf_dynptr_size(pkey));
>> +	if (*err)
>> +		goto err;
>> +
>> +	refcount_set(&ctx->usage, 1);
>> +
>> +	return ctx;
>> +err:
>> +	if (ctx->tfm)
>> +		crypto_free_sync_skcipher(ctx->tfm);
>> +	kfree(ctx);
>> +
>> +	return NULL;
>> +}
>> +
>> +static void crypto_free_sync_skcipher_cb(struct rcu_head *head)
>> +{
>> +	struct bpf_crypto_skcipher_ctx *ctx;
>> +
>> +	ctx = container_of(head, struct bpf_crypto_skcipher_ctx, rcu);
>> +	crypto_free_sync_skcipher(ctx->tfm);
>> +	kfree(ctx);
>> +}
>> +
>> +/**
>> + * bpf_crypto_skcipher_ctx_acquire() - Acquire a reference to a BPF crypto context.
>> + * @ctx: The BPF crypto context being acquired. The ctx must be a trusted
>> + *	     pointer.
>> + *
>> + * Acquires a reference to a BPF crypto context. The context returned by this function
>> + * must either be embedded in a map as a kptr, or freed with
>> + * bpf_crypto_skcipher_ctx_release().
>> + */
>> +__bpf_kfunc struct bpf_crypto_skcipher_ctx *
>> +bpf_crypto_skcipher_ctx_acquire(struct bpf_crypto_skcipher_ctx *ctx)
>> +{
>> +	refcount_inc(&ctx->usage);
>> +	return ctx;
>> +}
>> +
>> +/**
>> + * bpf_crypto_skcipher_ctx_release() - Release a previously acquired BPF crypto context.
>> + * @ctx: The crypto context being released.
>> + *
>> + * Releases a previously acquired reference to a BPF cpumask. When the final
>> + * reference of the BPF cpumask has been released, it is subsequently freed in
>> + * an RCU callback in the BPF memory allocator.
>> + */
>> +__bpf_kfunc void bpf_crypto_skcipher_ctx_release(struct bpf_crypto_skcipher_ctx *ctx)
>> +{
>> +	if (refcount_dec_and_test(&ctx->usage))
>> +		call_rcu(&ctx->rcu, crypto_free_sync_skcipher_cb);
>> +}
>> +
>> +static int bpf_crypto_skcipher_crypt(struct crypto_sync_skcipher *tfm,
>> +				     const struct bpf_dynptr_kern *src,
>> +				     struct bpf_dynptr_kern *dst,
>> +				     const struct bpf_dynptr_kern *iv,
>> +				     bool decrypt)
>> +{
>> +	struct skcipher_request *req = NULL;
>> +	struct scatterlist sgin, sgout;
>> +	int err;
>> +
>> +	if (crypto_sync_skcipher_get_flags(tfm) & CRYPTO_TFM_NEED_KEY)
>> +		return -EINVAL;
>> +
>> +	if (__bpf_dynptr_is_rdonly(dst))
>> +		return -EINVAL;
>> +
>> +	if (!__bpf_dynptr_size(dst) || !__bpf_dynptr_size(src))
>> +		return -EINVAL;
>> +
>> +	if (__bpf_dynptr_size(iv) != crypto_sync_skcipher_ivsize(tfm))
>> +		return -EINVAL;
>> +
>> +	req = skcipher_request_alloc(&tfm->base, GFP_ATOMIC);
>> +	if (!req)
>> +		return -ENOMEM;
>> +
>> +	sg_init_one(&sgin, __bpf_dynptr_data_ptr(src), __bpf_dynptr_size(src));
>> +	sg_init_one(&sgout, __bpf_dynptr_data_ptr(dst), __bpf_dynptr_size(dst));
>> +
>> +	skcipher_request_set_crypt(req, &sgin, &sgout, __bpf_dynptr_size(src),
>> +				   __bpf_dynptr_data_ptr(iv));
>> +
>> +	err = decrypt ? crypto_skcipher_decrypt(req) : crypto_skcipher_encrypt(req);
>> +
>> +	skcipher_request_free(req);
>> +
>> +	return err;
>> +}
>> +
>> +/**
>> + * bpf_crypto_skcipher_decrypt() - Decrypt buffer using configured context and IV provided.
>> + * @ctx:	The crypto context being used. The ctx must be a trusted pointer.
>> + * @src:	bpf_dynptr to the encrypted data. Must be a trusted pointer.
>> + * @dst:	bpf_dynptr to the buffer where to store the result. Must be a trusted pointer.
>> + * @iv:		bpf_dynptr to IV data to be used by decryptor.
>> + *
>> + * Decrypts provided buffer using IV data and the crypto context. Crypto context must be configured.
>> + */
>> +__bpf_kfunc int bpf_crypto_skcipher_decrypt(struct bpf_crypto_skcipher_ctx *ctx,
>> +					    const struct bpf_dynptr_kern *src,
>> +					    struct bpf_dynptr_kern *dst,
>> +					    const struct bpf_dynptr_kern *iv)
>> +{
>> +	return bpf_crypto_skcipher_crypt(ctx->tfm, src, dst, iv, true);
>> +}
>> +
>> +/**
>> + * bpf_crypto_skcipher_encrypt() - Encrypt buffer using configured context and IV provided.
>> + * @ctx:	The crypto context being used. The ctx must be a trusted pointer.
>> + * @src:	bpf_dynptr to the plain data. Must be a trusted pointer.
>> + * @dst:	bpf_dynptr to buffer where to store the result. Must be a trusted pointer.
>> + * @iv:		bpf_dynptr to IV data to be used by decryptor.
>> + *
>> + * Encrypts provided buffer using IV data and the crypto context. Crypto context must be configured.
>> + */
>> +__bpf_kfunc int bpf_crypto_skcipher_encrypt(struct bpf_crypto_skcipher_ctx *ctx,
>> +					    const struct bpf_dynptr_kern *src,
>> +					    struct bpf_dynptr_kern *dst,
>> +					    const struct bpf_dynptr_kern *iv)
>> +{
>> +	return bpf_crypto_skcipher_crypt(ctx->tfm, src, dst, iv, false);
>> +}
>> +
>> +__diag_pop();
>> +
>> +BTF_SET8_START(crypt_skcipher_init_kfunc_btf_ids)
>> +BTF_ID_FLAGS(func, bpf_crypto_skcipher_ctx_create, KF_ACQUIRE | KF_RET_NULL | KF_SLEEPABLE)
>> +BTF_ID_FLAGS(func, bpf_crypto_skcipher_ctx_release, KF_RELEASE)
>> +BTF_ID_FLAGS(func, bpf_crypto_skcipher_ctx_acquire, KF_ACQUIRE | KF_TRUSTED_ARGS)
>> +BTF_SET8_END(crypt_skcipher_init_kfunc_btf_ids)
>> +
>> +static const struct btf_kfunc_id_set crypt_skcipher_init_kfunc_set = {
>> +	.owner = THIS_MODULE,
>> +	.set   = &crypt_skcipher_init_kfunc_btf_ids,
>> +};
>> +
>> +BTF_SET8_START(crypt_skcipher_kfunc_btf_ids)
>> +BTF_ID_FLAGS(func, bpf_crypto_skcipher_decrypt, KF_RCU)
>> +BTF_ID_FLAGS(func, bpf_crypto_skcipher_encrypt, KF_RCU)
>> +BTF_SET8_END(crypt_skcipher_kfunc_btf_ids)
>> +
>> +static const struct btf_kfunc_id_set crypt_skcipher_kfunc_set = {
>> +	.owner = THIS_MODULE,
>> +	.set   = &crypt_skcipher_kfunc_btf_ids,
>> +};
>> +
>> +BTF_ID_LIST(crypto_skcipher_dtor_ids)
>> +BTF_ID(struct, bpf_crypto_skcipher_ctx)
>> +BTF_ID(func, bpf_crypto_skcipher_ctx_release)
>> +
>> +static int __init crypto_skcipher_kfunc_init(void)
>> +{
>> +	int ret;
>> +	const struct btf_id_dtor_kfunc crypto_skcipher_dtors[] = {
>> +		{
>> +			.btf_id	      = crypto_skcipher_dtor_ids[0],
>> +			.kfunc_btf_id = crypto_skcipher_dtor_ids[1]
>> +		},
>> +	};
>> +
>> +	ret = register_btf_kfunc_id_set(BPF_PROG_TYPE_SCHED_CLS, &crypt_skcipher_kfunc_set);
>> +	ret = ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_SCHED_ACT, &crypt_skcipher_kfunc_set);
>> +	ret = ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_XDP, &crypt_skcipher_kfunc_set);
>> +	ret = ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_UNSPEC,
>> +					       &crypt_skcipher_init_kfunc_set);
>> +	return  ret ?: register_btf_id_dtor_kfuncs(crypto_skcipher_dtors,
>> +						   ARRAY_SIZE(crypto_skcipher_dtors),
>> +						   THIS_MODULE);
>> +}
>> +
>> +late_initcall(crypto_skcipher_kfunc_init);
>> diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
>> index 62a53ebfedf9..2020884fca3d 100644
>> --- a/kernel/bpf/helpers.c
>> +++ b/kernel/bpf/helpers.c
>> @@ -1429,7 +1429,7 @@ static const struct bpf_func_proto bpf_kptr_xchg_proto = {
>>   #define DYNPTR_SIZE_MASK	0xFFFFFF
>>   #define DYNPTR_RDONLY_BIT	BIT(31)
>>   
>> -static bool __bpf_dynptr_is_rdonly(const struct bpf_dynptr_kern *ptr)
>> +bool __bpf_dynptr_is_rdonly(const struct bpf_dynptr_kern *ptr)
>>   {
>>   	return ptr->size & DYNPTR_RDONLY_BIT;
>>   }
>> @@ -1444,7 +1444,7 @@ static void bpf_dynptr_set_type(struct bpf_dynptr_kern *ptr, enum bpf_dynptr_typ
>>   	ptr->size |= type << DYNPTR_TYPE_SHIFT;
>>   }
>>   
>> -static enum bpf_dynptr_type bpf_dynptr_get_type(const struct bpf_dynptr_kern *ptr)
>> +enum bpf_dynptr_type bpf_dynptr_get_type(const struct bpf_dynptr_kern *ptr)
>>   {
>>   	return (ptr->size & ~(DYNPTR_RDONLY_BIT)) >> DYNPTR_TYPE_SHIFT;
>>   }
>> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
>> index bb58987e4844..75d2f47ca3cb 100644
>> --- a/kernel/bpf/verifier.c
>> +++ b/kernel/bpf/verifier.c
>> @@ -5184,6 +5184,7 @@ BTF_ID(struct, prog_test_ref_kfunc)
>>   BTF_ID(struct, cgroup)
>>   BTF_ID(struct, bpf_cpumask)
>>   BTF_ID(struct, task_struct)
>> +BTF_ID(struct, bpf_crypto_skcipher_ctx)
>>   BTF_SET_END(rcu_protected_types)
>>   
>>   static bool rcu_protected_object(const struct btf *btf, u32 btf_id)
>> -- 
>> 2.39.3
>>
>>


^ permalink raw reply

* Re: It is time to put some TCP congestion controls out to pasture
From: Dave Taht @ 2023-10-31 15:52 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: netdev
In-Reply-To: <20231031084603.7aacfb53@fedora>

[-- Attachment #1: Type: text/plain, Size: 1045 bytes --]

Heh,  I was going to propose moving FlexIS into the kernel to spark
more new development here:

It is a new lower than best effort CC. Also LEDBAT++ would be
potentially useful.

I think you can retire bic at least. I would have to look over the
others, There are a lot of users of vegas! still! but westwood is
better... could there be an upgrade path for those still selecting old
ccs?

My understanding is BBRv3 is going to replace BBR?

On Tue, Oct 31, 2023 at 8:46 AM Stephen Hemminger
<stephen@networkplumber.org> wrote:
>
> Linux supports lots of TCP congestion control options which is good for research,
> but in the current world of syszbot, it makes sense to euthanize unmaintained
> especially research only algorithms.
>
> Some options on how to do this:
>     - new config option TCP_CONGESTION_CONTROL_UNSAFE?
>     - move to staging
>     - remove completely
>
> What is best option?
>

--
Oct 30: https://netdevconf.info/0x17/news/the-maestro-and-the-music-bof.html
Dave Täht CSO, LibreQos

[-- Attachment #2: flexis_final.pdf --]
[-- Type: application/pdf, Size: 783391 bytes --]

^ permalink raw reply

* Re: [PATCH bpf-next v3 1/2] bpf: add skcipher API support to TC/XDP programs
From: Jiri Olsa @ 2023-10-31 15:46 UTC (permalink / raw)
  To: Vadim Fedorenko
  Cc: Vadim Fedorenko, Jakub Kicinski, Martin KaFai Lau,
	Andrii Nakryiko, Alexei Starovoitov, Mykola Lysenko, bpf, netdev,
	linux-crypto
In-Reply-To: <20231031134900.1432945-1-vadfed@meta.com>

On Tue, Oct 31, 2023 at 06:48:59AM -0700, Vadim Fedorenko wrote:

SNIP

> diff --git a/kernel/bpf/crypto.c b/kernel/bpf/crypto.c
> new file mode 100644
> index 000000000000..c2a0703934fc
> --- /dev/null
> +++ b/kernel/bpf/crypto.c
> @@ -0,0 +1,280 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/* Copyright (c) 2023 Meta, Inc */
> +#include <linux/bpf.h>
> +#include <linux/bpf_mem_alloc.h>
> +#include <linux/btf.h>
> +#include <linux/btf_ids.h>
> +#include <linux/filter.h>
> +#include <linux/scatterlist.h>
> +#include <linux/skbuff.h>
> +#include <crypto/skcipher.h>
> +
> +/**
> + * struct bpf_crypto_skcipher_ctx - refcounted BPF sync skcipher context structure
> + * @tfm:	The pointer to crypto_sync_skcipher struct.
> + * @rcu:	The RCU head used to free the crypto context with RCU safety.
> + * @usage:	Object reference counter. When the refcount goes to 0, the
> + *		memory is released back to the BPF allocator, which provides
> + *		RCU safety.
> + */
> +struct bpf_crypto_skcipher_ctx {
> +	struct crypto_sync_skcipher *tfm;
> +	struct rcu_head rcu;
> +	refcount_t usage;
> +};
> +
> +__diag_push();
> +__diag_ignore_all("-Wmissing-prototypes",
> +		  "Global kfuncs as their definitions will be in BTF");
> +

hi,
just a heads up.. there's [1] change adding macro for this, it's not
merged yet, but will probably get in soon

jirka

[1] https://lore.kernel.org/bpf/20231030210638.2415306-1-davemarchevsky@fb.com/

> +static void *__bpf_dynptr_data_ptr(const struct bpf_dynptr_kern *ptr)
> +{
> +	enum bpf_dynptr_type type;
> +
> +	if (!ptr->data)
> +		return NULL;
> +
> +	type = bpf_dynptr_get_type(ptr);
> +
> +	switch (type) {
> +	case BPF_DYNPTR_TYPE_LOCAL:
> +	case BPF_DYNPTR_TYPE_RINGBUF:
> +		return ptr->data + ptr->offset;
> +	case BPF_DYNPTR_TYPE_SKB:
> +		return skb_pointer_if_linear(ptr->data, ptr->offset, __bpf_dynptr_size(ptr));
> +	case BPF_DYNPTR_TYPE_XDP:
> +	{
> +		void *xdp_ptr = bpf_xdp_pointer(ptr->data, ptr->offset, __bpf_dynptr_size(ptr));
> +		if (!IS_ERR_OR_NULL(xdp_ptr))
> +			return xdp_ptr;
> +
> +		return NULL;
> +	}
> +	default:
> +		WARN_ONCE(true, "unknown dynptr type %d\n", type);
> +		return NULL;
> +	}
> +}
> +
> +/**
> + * bpf_crypto_skcipher_ctx_create() - Create a mutable BPF crypto context.
> + *
> + * Allocates a crypto context that can be used, acquired, and released by
> + * a BPF program. The crypto context returned by this function must either
> + * be embedded in a map as a kptr, or freed with bpf_crypto_skcipher_ctx_release().
> + *
> + * bpf_crypto_skcipher_ctx_create() allocates memory using the BPF memory
> + * allocator, and will not block. It may return NULL if no memory is available.
> + * @algo: bpf_dynptr which holds string representation of algorithm.
> + * @key:  bpf_dynptr which holds cipher key to do crypto.
> + */
> +__bpf_kfunc struct bpf_crypto_skcipher_ctx *
> +bpf_crypto_skcipher_ctx_create(const struct bpf_dynptr_kern *palgo,
> +			       const struct bpf_dynptr_kern *pkey, int *err)
> +{
> +	struct bpf_crypto_skcipher_ctx *ctx;
> +	char *algo;
> +
> +	if (__bpf_dynptr_size(palgo) > CRYPTO_MAX_ALG_NAME) {
> +		*err = -EINVAL;
> +		return NULL;
> +	}
> +
> +	algo = __bpf_dynptr_data_ptr(palgo);
> +
> +	if (!crypto_has_skcipher(algo, CRYPTO_ALG_TYPE_SKCIPHER, CRYPTO_ALG_TYPE_MASK)) {
> +		*err = -EOPNOTSUPP;
> +		return NULL;
> +	}
> +
> +	ctx = kmalloc(sizeof(*ctx), GFP_KERNEL);
> +	if (!ctx) {
> +		*err = -ENOMEM;
> +		return NULL;
> +	}
> +
> +	memset(ctx, 0, sizeof(*ctx));
> +
> +	ctx->tfm = crypto_alloc_sync_skcipher(algo, 0, 0);
> +	if (IS_ERR(ctx->tfm)) {
> +		*err = PTR_ERR(ctx->tfm);
> +		ctx->tfm = NULL;
> +		goto err;
> +	}
> +
> +	*err = crypto_sync_skcipher_setkey(ctx->tfm, __bpf_dynptr_data_ptr(pkey),
> +					   __bpf_dynptr_size(pkey));
> +	if (*err)
> +		goto err;
> +
> +	refcount_set(&ctx->usage, 1);
> +
> +	return ctx;
> +err:
> +	if (ctx->tfm)
> +		crypto_free_sync_skcipher(ctx->tfm);
> +	kfree(ctx);
> +
> +	return NULL;
> +}
> +
> +static void crypto_free_sync_skcipher_cb(struct rcu_head *head)
> +{
> +	struct bpf_crypto_skcipher_ctx *ctx;
> +
> +	ctx = container_of(head, struct bpf_crypto_skcipher_ctx, rcu);
> +	crypto_free_sync_skcipher(ctx->tfm);
> +	kfree(ctx);
> +}
> +
> +/**
> + * bpf_crypto_skcipher_ctx_acquire() - Acquire a reference to a BPF crypto context.
> + * @ctx: The BPF crypto context being acquired. The ctx must be a trusted
> + *	     pointer.
> + *
> + * Acquires a reference to a BPF crypto context. The context returned by this function
> + * must either be embedded in a map as a kptr, or freed with
> + * bpf_crypto_skcipher_ctx_release().
> + */
> +__bpf_kfunc struct bpf_crypto_skcipher_ctx *
> +bpf_crypto_skcipher_ctx_acquire(struct bpf_crypto_skcipher_ctx *ctx)
> +{
> +	refcount_inc(&ctx->usage);
> +	return ctx;
> +}
> +
> +/**
> + * bpf_crypto_skcipher_ctx_release() - Release a previously acquired BPF crypto context.
> + * @ctx: The crypto context being released.
> + *
> + * Releases a previously acquired reference to a BPF cpumask. When the final
> + * reference of the BPF cpumask has been released, it is subsequently freed in
> + * an RCU callback in the BPF memory allocator.
> + */
> +__bpf_kfunc void bpf_crypto_skcipher_ctx_release(struct bpf_crypto_skcipher_ctx *ctx)
> +{
> +	if (refcount_dec_and_test(&ctx->usage))
> +		call_rcu(&ctx->rcu, crypto_free_sync_skcipher_cb);
> +}
> +
> +static int bpf_crypto_skcipher_crypt(struct crypto_sync_skcipher *tfm,
> +				     const struct bpf_dynptr_kern *src,
> +				     struct bpf_dynptr_kern *dst,
> +				     const struct bpf_dynptr_kern *iv,
> +				     bool decrypt)
> +{
> +	struct skcipher_request *req = NULL;
> +	struct scatterlist sgin, sgout;
> +	int err;
> +
> +	if (crypto_sync_skcipher_get_flags(tfm) & CRYPTO_TFM_NEED_KEY)
> +		return -EINVAL;
> +
> +	if (__bpf_dynptr_is_rdonly(dst))
> +		return -EINVAL;
> +
> +	if (!__bpf_dynptr_size(dst) || !__bpf_dynptr_size(src))
> +		return -EINVAL;
> +
> +	if (__bpf_dynptr_size(iv) != crypto_sync_skcipher_ivsize(tfm))
> +		return -EINVAL;
> +
> +	req = skcipher_request_alloc(&tfm->base, GFP_ATOMIC);
> +	if (!req)
> +		return -ENOMEM;
> +
> +	sg_init_one(&sgin, __bpf_dynptr_data_ptr(src), __bpf_dynptr_size(src));
> +	sg_init_one(&sgout, __bpf_dynptr_data_ptr(dst), __bpf_dynptr_size(dst));
> +
> +	skcipher_request_set_crypt(req, &sgin, &sgout, __bpf_dynptr_size(src),
> +				   __bpf_dynptr_data_ptr(iv));
> +
> +	err = decrypt ? crypto_skcipher_decrypt(req) : crypto_skcipher_encrypt(req);
> +
> +	skcipher_request_free(req);
> +
> +	return err;
> +}
> +
> +/**
> + * bpf_crypto_skcipher_decrypt() - Decrypt buffer using configured context and IV provided.
> + * @ctx:	The crypto context being used. The ctx must be a trusted pointer.
> + * @src:	bpf_dynptr to the encrypted data. Must be a trusted pointer.
> + * @dst:	bpf_dynptr to the buffer where to store the result. Must be a trusted pointer.
> + * @iv:		bpf_dynptr to IV data to be used by decryptor.
> + *
> + * Decrypts provided buffer using IV data and the crypto context. Crypto context must be configured.
> + */
> +__bpf_kfunc int bpf_crypto_skcipher_decrypt(struct bpf_crypto_skcipher_ctx *ctx,
> +					    const struct bpf_dynptr_kern *src,
> +					    struct bpf_dynptr_kern *dst,
> +					    const struct bpf_dynptr_kern *iv)
> +{
> +	return bpf_crypto_skcipher_crypt(ctx->tfm, src, dst, iv, true);
> +}
> +
> +/**
> + * bpf_crypto_skcipher_encrypt() - Encrypt buffer using configured context and IV provided.
> + * @ctx:	The crypto context being used. The ctx must be a trusted pointer.
> + * @src:	bpf_dynptr to the plain data. Must be a trusted pointer.
> + * @dst:	bpf_dynptr to buffer where to store the result. Must be a trusted pointer.
> + * @iv:		bpf_dynptr to IV data to be used by decryptor.
> + *
> + * Encrypts provided buffer using IV data and the crypto context. Crypto context must be configured.
> + */
> +__bpf_kfunc int bpf_crypto_skcipher_encrypt(struct bpf_crypto_skcipher_ctx *ctx,
> +					    const struct bpf_dynptr_kern *src,
> +					    struct bpf_dynptr_kern *dst,
> +					    const struct bpf_dynptr_kern *iv)
> +{
> +	return bpf_crypto_skcipher_crypt(ctx->tfm, src, dst, iv, false);
> +}
> +
> +__diag_pop();
> +
> +BTF_SET8_START(crypt_skcipher_init_kfunc_btf_ids)
> +BTF_ID_FLAGS(func, bpf_crypto_skcipher_ctx_create, KF_ACQUIRE | KF_RET_NULL | KF_SLEEPABLE)
> +BTF_ID_FLAGS(func, bpf_crypto_skcipher_ctx_release, KF_RELEASE)
> +BTF_ID_FLAGS(func, bpf_crypto_skcipher_ctx_acquire, KF_ACQUIRE | KF_TRUSTED_ARGS)
> +BTF_SET8_END(crypt_skcipher_init_kfunc_btf_ids)
> +
> +static const struct btf_kfunc_id_set crypt_skcipher_init_kfunc_set = {
> +	.owner = THIS_MODULE,
> +	.set   = &crypt_skcipher_init_kfunc_btf_ids,
> +};
> +
> +BTF_SET8_START(crypt_skcipher_kfunc_btf_ids)
> +BTF_ID_FLAGS(func, bpf_crypto_skcipher_decrypt, KF_RCU)
> +BTF_ID_FLAGS(func, bpf_crypto_skcipher_encrypt, KF_RCU)
> +BTF_SET8_END(crypt_skcipher_kfunc_btf_ids)
> +
> +static const struct btf_kfunc_id_set crypt_skcipher_kfunc_set = {
> +	.owner = THIS_MODULE,
> +	.set   = &crypt_skcipher_kfunc_btf_ids,
> +};
> +
> +BTF_ID_LIST(crypto_skcipher_dtor_ids)
> +BTF_ID(struct, bpf_crypto_skcipher_ctx)
> +BTF_ID(func, bpf_crypto_skcipher_ctx_release)
> +
> +static int __init crypto_skcipher_kfunc_init(void)
> +{
> +	int ret;
> +	const struct btf_id_dtor_kfunc crypto_skcipher_dtors[] = {
> +		{
> +			.btf_id	      = crypto_skcipher_dtor_ids[0],
> +			.kfunc_btf_id = crypto_skcipher_dtor_ids[1]
> +		},
> +	};
> +
> +	ret = register_btf_kfunc_id_set(BPF_PROG_TYPE_SCHED_CLS, &crypt_skcipher_kfunc_set);
> +	ret = ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_SCHED_ACT, &crypt_skcipher_kfunc_set);
> +	ret = ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_XDP, &crypt_skcipher_kfunc_set);
> +	ret = ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_UNSPEC,
> +					       &crypt_skcipher_init_kfunc_set);
> +	return  ret ?: register_btf_id_dtor_kfuncs(crypto_skcipher_dtors,
> +						   ARRAY_SIZE(crypto_skcipher_dtors),
> +						   THIS_MODULE);
> +}
> +
> +late_initcall(crypto_skcipher_kfunc_init);
> diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
> index 62a53ebfedf9..2020884fca3d 100644
> --- a/kernel/bpf/helpers.c
> +++ b/kernel/bpf/helpers.c
> @@ -1429,7 +1429,7 @@ static const struct bpf_func_proto bpf_kptr_xchg_proto = {
>  #define DYNPTR_SIZE_MASK	0xFFFFFF
>  #define DYNPTR_RDONLY_BIT	BIT(31)
>  
> -static bool __bpf_dynptr_is_rdonly(const struct bpf_dynptr_kern *ptr)
> +bool __bpf_dynptr_is_rdonly(const struct bpf_dynptr_kern *ptr)
>  {
>  	return ptr->size & DYNPTR_RDONLY_BIT;
>  }
> @@ -1444,7 +1444,7 @@ static void bpf_dynptr_set_type(struct bpf_dynptr_kern *ptr, enum bpf_dynptr_typ
>  	ptr->size |= type << DYNPTR_TYPE_SHIFT;
>  }
>  
> -static enum bpf_dynptr_type bpf_dynptr_get_type(const struct bpf_dynptr_kern *ptr)
> +enum bpf_dynptr_type bpf_dynptr_get_type(const struct bpf_dynptr_kern *ptr)
>  {
>  	return (ptr->size & ~(DYNPTR_RDONLY_BIT)) >> DYNPTR_TYPE_SHIFT;
>  }
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index bb58987e4844..75d2f47ca3cb 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -5184,6 +5184,7 @@ BTF_ID(struct, prog_test_ref_kfunc)
>  BTF_ID(struct, cgroup)
>  BTF_ID(struct, bpf_cpumask)
>  BTF_ID(struct, task_struct)
> +BTF_ID(struct, bpf_crypto_skcipher_ctx)
>  BTF_SET_END(rcu_protected_types)
>  
>  static bool rcu_protected_object(const struct btf *btf, u32 btf_id)
> -- 
> 2.39.3
> 
> 

^ permalink raw reply

* It is time to put some TCP congestion controls out to pasture
From: Stephen Hemminger @ 2023-10-31 15:46 UTC (permalink / raw)
  To: netdev

Linux supports lots of TCP congestion control options which is good for research,
but in the current world of syszbot, it makes sense to euthanize unmaintained
especially research only algorithms.

Some options on how to do this:
    - new config option TCP_CONGESTION_CONTROL_UNSAFE?
    - move to staging
    - remove completely

What is best option?

^ permalink raw reply

* Re: [PATCH v2] mlx5: fix init stage error handling to avoid double free of same QP and UAF
From: Jason Gunthorpe @ 2023-10-31 15:19 UTC (permalink / raw)
  To: George Kennedy
  Cc: leon, sd, linux-rdma, linux-kernel, netdev, tom.hromatka,
	harshit.m.mogalapalli
In-Reply-To: <1698170518-4006-1-git-send-email-george.kennedy@oracle.com>

On Tue, Oct 24, 2023 at 01:01:58PM -0500, George Kennedy wrote:
> In the unlikely event that workqueue allocation fails and returns
> NULL in mlx5_mkey_cache_init(), delete the call to
> mlx5r_umr_resource_cleanup() (which frees the QP) in
> mlx5_ib_stage_post_ib_reg_umr_init().  This will avoid attempted
> double free of the same QP when __mlx5_ib_add() does its cleanup.
> 
> Syzkaller reported a UAF in ib_destroy_qp_user
> 
> workqueue: Failed to create a rescuer kthread for wq "mkey_cache": -EINTR
> infiniband mlx5_0: mlx5_mkey_cache_init:981:(pid 1642):
> failed to create work queue
> infiniband mlx5_0: mlx5_ib_stage_post_ib_reg_umr_init:4075:(pid 1642):
> mr cache init failed -12
> ==================================================================
> BUG: KASAN: slab-use-after-free in ib_destroy_qp_user (drivers/infiniband/core/verbs.c:2073)
> Read of size 8 at addr ffff88810da310a8 by task repro_upstream/1642
> 
> Call Trace:
> <TASK>
> kasan_report (mm/kasan/report.c:590)
> ib_destroy_qp_user (drivers/infiniband/core/verbs.c:2073)
> mlx5r_umr_resource_cleanup (drivers/infiniband/hw/mlx5/umr.c:198)
> __mlx5_ib_add (drivers/infiniband/hw/mlx5/main.c:4178)
> mlx5r_probe (drivers/infiniband/hw/mlx5/main.c:4402)
> ...
> </TASK>
> 
> Allocated by task 1642:
> __kmalloc (./include/linux/kasan.h:198 mm/slab_common.c:1026
> mm/slab_common.c:1039)
> create_qp (./include/linux/slab.h:603 ./include/linux/slab.h:720
> ./include/rdma/ib_verbs.h:2795 drivers/infiniband/core/verbs.c:1209)
> ib_create_qp_kernel (drivers/infiniband/core/verbs.c:1347)
> mlx5r_umr_resource_init (drivers/infiniband/hw/mlx5/umr.c:164)
> mlx5_ib_stage_post_ib_reg_umr_init (drivers/infiniband/hw/mlx5/main.c:4070)
> __mlx5_ib_add (drivers/infiniband/hw/mlx5/main.c:4168)
> mlx5r_probe (drivers/infiniband/hw/mlx5/main.c:4402)
> ...
> 
> Freed by task 1642:
> __kmem_cache_free (mm/slub.c:1826 mm/slub.c:3809 mm/slub.c:3822)
> ib_destroy_qp_user (drivers/infiniband/core/verbs.c:2112)
> mlx5r_umr_resource_cleanup (drivers/infiniband/hw/mlx5/umr.c:198)
> mlx5_ib_stage_post_ib_reg_umr_init (drivers/infiniband/hw/mlx5/main.c:4076
> drivers/infiniband/hw/mlx5/main.c:4065)
> __mlx5_ib_add (drivers/infiniband/hw/mlx5/main.c:4168)
> mlx5r_probe (drivers/infiniband/hw/mlx5/main.c:4402)
> ...
> 
> The buggy address belongs to the object at ffff88810da31000
> which belongs to the cache kmalloc-2k of size 2048
> The buggy address is located 168 bytes inside of
> freed 2048-byte region [ffff88810da31000, ffff88810da31800)
> 
> The buggy address belongs to the physical page:
> page:000000003b5e469d refcount:1 mapcount:0 mapping:0000000000000000
> index:0x0 pfn:0x10da30
> head:000000003b5e469d order:3 entire_mapcount:0 nr_pages_mapped:0
> pincount:0
> flags: 0x17ffffc0000840(slab|head|node=0|zone=2|lastcpupid=0x1fffff)
> page_type: 0xffffffff()
> raw: 0017ffffc0000840 ffff888100042f00 ffffea0004180800 dead000000000002
> raw: 0000000000000000 0000000000080008 00000001ffffffff 0000000000000000
> page dumped because: kasan: bad access detected
> 
> Memory state around the buggy address:
> ffff88810da30f80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
> ffff88810da31000: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> >ffff88810da31080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> 			      ^
> ffff88810da31100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> ffff88810da31180: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> ==================================================================
> Disabling lock debugging due to kernel taint
> 
> Fixes: 04876c12c19e ("RDMA/mlx5: Move init and cleanup of UMR to umr.c")
> Reported-by: syzkaller <syzkaller@googlegroups.com>
> Suggested-by: Leon Romanovsky <leon@kernel.org>
> Signed-off-by: George Kennedy <george.kennedy@oracle.com>
> ---
> v2: went with fix suggested by: Leon Romanovsky <leon@kernel.org>

Applied to for-next

Thanks,
Jason

^ permalink raw reply

* Re: [Intel-wired-lan] [PATCH net-next v4 1/6] net: ethtool: allow symmetric-xor RSS hash for any flow type
From: Jakub Kicinski @ 2023-10-31 15:20 UTC (permalink / raw)
  To: Gal Pressman
  Cc: Ahmed Zaki, Alexander H Duyck, mkubecek, andrew,
	willemdebruijn.kernel, Wojciech Drewek, corbet, netdev, linux-doc,
	jesse.brandeburg, edumazet, anthony.l.nguyen, horms,
	vladimir.oltean, Jacob Keller, intel-wired-lan, pabeni, davem
In-Reply-To: <e7679b57-af11-42b1-91c7-b18cbcc70119@intel.com>

On Tue, 31 Oct 2023 09:14:58 -0600 Ahmed Zaki wrote:
> Do you mean add vendor-specific implementation details to common docs? 
> Not sure if I have seen this before. Any examples?
> 
> Or, we can add a note in ethtool doc that each vendor's implementation 
> is different and "Refer to your vendor's specifications for more info".

Gal, can you shed any more detail on who your implementation differs?
It will help the discussion, and also - I'm not sure how you can do
xor differently..

^ permalink raw reply

* Re: [Intel-wired-lan] [PATCH net-next v4 1/6] net: ethtool: allow symmetric-xor RSS hash for any flow type
From: Ahmed Zaki @ 2023-10-31 15:14 UTC (permalink / raw)
  To: Gal Pressman, Jakub Kicinski, Alexander H Duyck
  Cc: mkubecek, andrew, willemdebruijn.kernel, Wojciech Drewek, corbet,
	netdev, linux-doc, jesse.brandeburg, edumazet, anthony.l.nguyen,
	horms, vladimir.oltean, Jacob Keller, intel-wired-lan, pabeni,
	davem
In-Reply-To: <70132b6f-542f-4fe6-971f-ab9ea80acbe4@nvidia.com>



On 2023-10-31 08:45, Gal Pressman wrote:
> On 31/10/2023 16:40, Ahmed Zaki wrote:
>>
>>
>> On 2023-10-31 06:00, Gal Pressman wrote:
>>> On 29/10/2023 18:59, Ahmed Zaki wrote:
>>>>
>>>>
>>>> On 2023-10-29 06:48, Gal Pressman wrote:
>>>>> On 29/10/2023 14:42, Ahmed Zaki wrote:
>>>>>>
>>>>>>
>>>>>> On 2023-10-29 06:25, Gal Pressman wrote:
>>>>>>> On 21/10/2023 3:00, Ahmed Zaki wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>> On 2023-10-20 17:49, Jakub Kicinski wrote:
>>>>>>>>> On Fri, 20 Oct 2023 17:14:11 -0600 Ahmed Zaki wrote:
>>>>>>>>>> I replied to that here:
>>>>>>>>>>
>>>>>>>>>> https://lore.kernel.org/all/afb4a06f-cfba-47ba-adb3-09bea7cb5f00@intel.com/
>>>>>>>>>>
>>>>>>>>>> I am kind of confused now so please bear with me. ethtool either
>>>>>>>>>> sends
>>>>>>>>>> "ethtool_rxfh" or "ethtool_rxnfc". AFAIK "ethtool_rxfh" is the
>>>>>>>>>> interface
>>>>>>>>>> for "ethtool -X" which is used to set the RSS algorithm. But we
>>>>>>>>>> kind of
>>>>>>>>>> agreed to go with "ethtool -U|-N" for symmetric-xor, and that uses
>>>>>>>>>> "ethtool_rxnfc" (as implemented in this series).
>>>>>>>>>
>>>>>>>>> I have no strong preference. Sounds like Alex prefers to keep it
>>>>>>>>> closer
>>>>>>>>> to algo, which is "ethtool_rxfh".
>>>>>>>>>
>>>>>>>>>> Do you mean use "ethtool_rxfh" instead of "ethtool_rxnfc"? how
>>>>>>>>>> would
>>>>>>>>>> that work on the ethtool user interface?
>>>>>>>>>
>>>>>>>>> I don't know what you're asking of us. If you find the code to
>>>>>>>>> confusing
>>>>>>>>> maybe someone at Intel can help you :|
>>>>>>>>
>>>>>>>> The code is straightforward. I am confused by the requirements:
>>>>>>>> don't
>>>>>>>> add a new algorithm but use "ethtool_rxfh".
>>>>>>>>
>>>>>>>> I'll see if I can get more help, may be I am missing something.
>>>>>>>>
>>>>>>>
>>>>>>> What was the decision here?
>>>>>>> Is this going to be exposed through ethtool -N or -X?
>>>>>>
>>>>>> I am working on a new version that uses "ethtool_rxfh" to set the
>>>>>> symmetric-xor. The user will set per-device via:
>>>>>>
>>>>>> ethtool -X eth0 hfunc toeplitz symmetric-xor
>>>>>>
>>>>>> then specify the per-flow type RSS fields as usual:
>>>>>>
>>>>>> ethtool -N|-U eth0 rx-flow-hash <flow_type> s|d|f|n
>>>>>>
>>>>>> The downside is that all flow-types will have to be either
>>>>>> symmetric or
>>>>>> asymmetric.
>>>>>
>>>>> Why are we making the interface less flexible than it can be with -N?
>>>>
>>>> Alexander Duyck prefers to implement the "symmetric-xor" interface as an
>>>> algorithm or extension (please refer to previous messages), but ethtool
>>>> does not provide flowtype/RSS fields setting via "-X". The above was the
>>>> best solution that we (at Intel) could think of.
>>>
>>> OK, it's a weird we're deliberately limiting our interface, given
>>> there's already hardware that supports controlling symmetric hashing per
>>> flow type.
>>>
>>> I saw you mentioned the way ice hardware implements symmetric-xor
>>> somewhere, it definitely needs to be added somewhere in our
>>> documentation to prevent confusion.
>>> mlx5 hardware also does symmetric hashing with xor, but not exactly as
>>> you described, we need the algorithm to be clear.
>>
>> Sure. I will add more ice-specific doc in:
>> Documentation/networking/device_drivers/ethernet/intel/ice.rst
> 
> I was thinking of somewhere more generic, where ethtool users (not
> necessarily ice users) can refer to.
> 
> Perhaps Documentation/networking/ethtool-netlink.rst? Or ethtool man page?

Do you mean add vendor-specific implementation details to common docs? 
Not sure if I have seen this before. Any examples?

Or, we can add a note in ethtool doc that each vendor's implementation 
is different and "Refer to your vendor's specifications for more info".


^ permalink raw reply

* Re: [Intel-wired-lan] [PATCH net-next v4 1/6] net: ethtool: allow symmetric-xor RSS hash for any flow type
From: Alexander Duyck @ 2023-10-31 14:59 UTC (permalink / raw)
  To: Gal Pressman
  Cc: Ahmed Zaki, Jakub Kicinski, mkubecek, andrew,
	willemdebruijn.kernel, Wojciech Drewek, corbet, netdev, linux-doc,
	jesse.brandeburg, edumazet, anthony.l.nguyen, horms,
	vladimir.oltean, Jacob Keller, intel-wired-lan, pabeni, davem
In-Reply-To: <d097e7d3-5e16-44ba-aa92-dfb7fbedc600@nvidia.com>

On Tue, Oct 31, 2023 at 5:01 AM Gal Pressman <gal@nvidia.com> wrote:
>
> On 29/10/2023 18:59, Ahmed Zaki wrote:
> >
> >
> > On 2023-10-29 06:48, Gal Pressman wrote:
> >> On 29/10/2023 14:42, Ahmed Zaki wrote:
> >>>
> >>>
> >>> On 2023-10-29 06:25, Gal Pressman wrote:
> >>>> On 21/10/2023 3:00, Ahmed Zaki wrote:
> >>>>>
> >>>>>
> >>>>> On 2023-10-20 17:49, Jakub Kicinski wrote:
> >>>>>> On Fri, 20 Oct 2023 17:14:11 -0600 Ahmed Zaki wrote:
> >>>>>>> I replied to that here:
> >>>>>>>
> >>>>>>> https://lore.kernel.org/all/afb4a06f-cfba-47ba-adb3-09bea7cb5f00@intel.com/
> >>>>>>>
> >>>>>>> I am kind of confused now so please bear with me. ethtool either
> >>>>>>> sends
> >>>>>>> "ethtool_rxfh" or "ethtool_rxnfc". AFAIK "ethtool_rxfh" is the
> >>>>>>> interface
> >>>>>>> for "ethtool -X" which is used to set the RSS algorithm. But we
> >>>>>>> kind of
> >>>>>>> agreed to go with "ethtool -U|-N" for symmetric-xor, and that uses
> >>>>>>> "ethtool_rxnfc" (as implemented in this series).
> >>>>>>
> >>>>>> I have no strong preference. Sounds like Alex prefers to keep it
> >>>>>> closer
> >>>>>> to algo, which is "ethtool_rxfh".
> >>>>>>
> >>>>>>> Do you mean use "ethtool_rxfh" instead of "ethtool_rxnfc"? how would
> >>>>>>> that work on the ethtool user interface?
> >>>>>>
> >>>>>> I don't know what you're asking of us. If you find the code to
> >>>>>> confusing
> >>>>>> maybe someone at Intel can help you :|
> >>>>>
> >>>>> The code is straightforward. I am confused by the requirements: don't
> >>>>> add a new algorithm but use "ethtool_rxfh".
> >>>>>
> >>>>> I'll see if I can get more help, may be I am missing something.
> >>>>>
> >>>>
> >>>> What was the decision here?
> >>>> Is this going to be exposed through ethtool -N or -X?
> >>>
> >>> I am working on a new version that uses "ethtool_rxfh" to set the
> >>> symmetric-xor. The user will set per-device via:
> >>>
> >>> ethtool -X eth0 hfunc toeplitz symmetric-xor
> >>>
> >>> then specify the per-flow type RSS fields as usual:
> >>>
> >>> ethtool -N|-U eth0 rx-flow-hash <flow_type> s|d|f|n
> >>>
> >>> The downside is that all flow-types will have to be either symmetric or
> >>> asymmetric.
> >>
> >> Why are we making the interface less flexible than it can be with -N?
> >
> > Alexander Duyck prefers to implement the "symmetric-xor" interface as an
> > algorithm or extension (please refer to previous messages), but ethtool
> > does not provide flowtype/RSS fields setting via "-X". The above was the
> > best solution that we (at Intel) could think of.
>
> OK, it's a weird we're deliberately limiting our interface, given
> there's already hardware that supports controlling symmetric hashing per
> flow type.
>
> I saw you mentioned the way ice hardware implements symmetric-xor
> somewhere, it definitely needs to be added somewhere in our
> documentation to prevent confusion.
> mlx5 hardware also does symmetric hashing with xor, but not exactly as
> you described, we need the algorithm to be clear.

It is precisely because of the way the symmetric-xor implements it
that I suggested that they change the algo type instead of the input
fields.

Instead of doing something such as rearranging the inputs, what they
do is start XORing them together and then using those values for both
the source and destination ports. It would be one thing if they
swapped them, but instead they destroy the entropy provided by XORing
the values together and then doubling them up in the source and
destination fields. The result is the hash value becomes predictable
in that once you know the target you just have to offset the source
and destination port/IP by the same amount so that they hash out to
the same values, and as a result it would make DDoS attacks based on
the RSS hash much easier.

Where I draw the line in this is if we start losing entropy without
explicitly removing the value then it is part of the algo, whereas if
it is something such as placement or us explicitly saying we don't
want certain fields in there then it would be part of the input.
Adding fields to the input should increase or at least maintain the
entropy is my point of view.

^ permalink raw reply

* Re: [PATCH v2 08/12] riscv: dts: starfive: Add pool for coherent DMA memory on JH7100 boards
From: Emil Renner Berthing @ 2023-10-31 14:56 UTC (permalink / raw)
  To: Andrew Lunn, Cristian Ciocaltea
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Rob Herring, Krzysztof Kozlowski, Conor Dooley,
	Emil Renner Berthing, Samin Guo, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Alexandre Torgue, Jose Abreu, Maxime Coquelin,
	Richard Cochran, Giuseppe Cavallaro, netdev, devicetree,
	linux-kernel, linux-riscv, linux-stm32, linux-arm-kernel, kernel,
	Emil Renner Berthing
In-Reply-To: <9b8c9846-20be-4cfa-aff5-f9ae8ac2aba4@lunn.ch>

Andrew Lunn wrote:
> On Sun, Oct 29, 2023 at 06:27:08AM +0200, Cristian Ciocaltea wrote:
> > From: Emil Renner Berthing <emil.renner.berthing@canonical.com>
> >
> > The StarFive JH7100 SoC has non-coherent device DMAs, but most drivers
> > expect to be able to allocate coherent memory for DMA descriptors and
> > such. However on the JH7100 DDR memory appears twice in the physical
> > memory map, once cached and once uncached:
> >
> >   0x00_8000_0000 - 0x08_7fff_ffff : Off chip DDR memory, cached
> >   0x10_0000_0000 - 0x17_ffff_ffff : Off chip DDR memory, uncached
> >
> > To use this uncached region we create a global DMA memory pool there and
> > reserve the corresponding area in the cached region.
> >
> > However the uncached region is fully above the 32bit address limit, so add
> > a dma-ranges map so the DMA address used for peripherals is still in the
> > regular cached region below the limit.
> >
> > Link: https://github.com/starfive-tech/JH7100_Docs/blob/main/JH7100%20Data%20Sheet%20V01.01.04-EN%20(4-21-2021).pdf
> > Signed-off-by: Emil Renner Berthing <emil.renner.berthing@canonical.com>
> > Signed-off-by: Cristian Ciocaltea <cristian.ciocaltea@collabora.com>
> > ---
> >  .../boot/dts/starfive/jh7100-common.dtsi      | 24 +++++++++++++++++++
> >  1 file changed, 24 insertions(+)
> >
> > diff --git a/arch/riscv/boot/dts/starfive/jh7100-common.dtsi b/arch/riscv/boot/dts/starfive/jh7100-common.dtsi
> > index b93ce351a90f..504c73f01f14 100644
> > --- a/arch/riscv/boot/dts/starfive/jh7100-common.dtsi
> > +++ b/arch/riscv/boot/dts/starfive/jh7100-common.dtsi
> > @@ -39,6 +39,30 @@ led-ack {
> >  			label = "ack";
> >  		};
> >  	};
> > +
> > +	reserved-memory {
> > +		#address-cells = <2>;
> > +		#size-cells = <2>;
> > +		ranges;
> > +
> > +		dma-reserved {
> > +			reg = <0x0 0xfa000000 0x0 0x1000000>;
>
> If i'm reading this correctly, this is at the top of the first 4G of
> RAM. But this is jh7100-common.dtsi. Is it guaranteed that all boards
> derived from this have at least 4G? What happens is a board only has
> 2G?

Yes, both the BeagleV Starlight and StarFive VisionFive V1 boards have at least
4G of ram and there won't be any more boards with this SoC. It was a test chip
for the JH7110 after all.

There aren't really any limitations on where this pool could be placed, so I
just chose to wedge it between ranges reserved for graphics by the bootloader.
If anyone has a better idea please go ahead and change it.

>
> It might also be worth putting a comment here about the memory being
> mapped twice. In the ARM world that would be illegal, so its maybe not
> seen that often. Yes, the commit message explains that, but when i
> look at the code on its own, it is less obvious.
>
> > +			no-map;
> > +		};
> > +
> > +		linux,dma {
> > +			compatible = "shared-dma-pool";
> > +			reg = <0x10 0x7a000000 0x0 0x1000000>;
> > +			no-map;
> > +			linux,dma-default;
> > +		};
> > +	};

^ permalink raw reply

* Re: [PATCH v2] selftests/net: synchronize udpgso_bench rx and tx
From: Paolo Abeni @ 2023-10-31 14:49 UTC (permalink / raw)
  To: Lucas Karpinski
  Cc: davem, edumazet, kuba, shuah, netdev, linux-kselftest,
	linux-kernel
In-Reply-To: <vzz3qfbfq52qja24y25lopif27sdwyvz3jmmcbx5wm6jc5l53b@fy7ym6xk4zsb>

On Tue, 2023-10-31 at 09:26 -0400, Lucas Karpinski wrote:
> > Since you wrote the same function verbatim in 3 different files, I
> > think it would be better place it in separate, new, net_helper.sh
> > file
> > and include such file from the various callers. Possibly
> > additionally
> > rename the function as wait_local_udp_port_listen.
> > 
> Thanks, I'll move it over. I think it would be best though to leave
> udp out of the name and to just pass the protocol as an argument.

Indeed. I suggested the other option just to keep it the simpler, but
if you have time and will, please go ahead!

Cheers,

Paolo


^ permalink raw reply

* Re: [Intel-wired-lan] [PATCH net-next v4 1/6] net: ethtool: allow symmetric-xor RSS hash for any flow type
From: Gal Pressman @ 2023-10-31 14:45 UTC (permalink / raw)
  To: Ahmed Zaki, Jakub Kicinski, Alexander H Duyck
  Cc: mkubecek, andrew, willemdebruijn.kernel, Wojciech Drewek, corbet,
	netdev, linux-doc, jesse.brandeburg, edumazet, anthony.l.nguyen,
	horms, vladimir.oltean, Jacob Keller, intel-wired-lan, pabeni,
	davem
In-Reply-To: <aa1dd347-a16c-44f8-95ad-5d50bcba8f34@intel.com>

On 31/10/2023 16:40, Ahmed Zaki wrote:
> 
> 
> On 2023-10-31 06:00, Gal Pressman wrote:
>> On 29/10/2023 18:59, Ahmed Zaki wrote:
>>>
>>>
>>> On 2023-10-29 06:48, Gal Pressman wrote:
>>>> On 29/10/2023 14:42, Ahmed Zaki wrote:
>>>>>
>>>>>
>>>>> On 2023-10-29 06:25, Gal Pressman wrote:
>>>>>> On 21/10/2023 3:00, Ahmed Zaki wrote:
>>>>>>>
>>>>>>>
>>>>>>> On 2023-10-20 17:49, Jakub Kicinski wrote:
>>>>>>>> On Fri, 20 Oct 2023 17:14:11 -0600 Ahmed Zaki wrote:
>>>>>>>>> I replied to that here:
>>>>>>>>>
>>>>>>>>> https://lore.kernel.org/all/afb4a06f-cfba-47ba-adb3-09bea7cb5f00@intel.com/
>>>>>>>>>
>>>>>>>>> I am kind of confused now so please bear with me. ethtool either
>>>>>>>>> sends
>>>>>>>>> "ethtool_rxfh" or "ethtool_rxnfc". AFAIK "ethtool_rxfh" is the
>>>>>>>>> interface
>>>>>>>>> for "ethtool -X" which is used to set the RSS algorithm. But we
>>>>>>>>> kind of
>>>>>>>>> agreed to go with "ethtool -U|-N" for symmetric-xor, and that uses
>>>>>>>>> "ethtool_rxnfc" (as implemented in this series).
>>>>>>>>
>>>>>>>> I have no strong preference. Sounds like Alex prefers to keep it
>>>>>>>> closer
>>>>>>>> to algo, which is "ethtool_rxfh".
>>>>>>>>
>>>>>>>>> Do you mean use "ethtool_rxfh" instead of "ethtool_rxnfc"? how
>>>>>>>>> would
>>>>>>>>> that work on the ethtool user interface?
>>>>>>>>
>>>>>>>> I don't know what you're asking of us. If you find the code to
>>>>>>>> confusing
>>>>>>>> maybe someone at Intel can help you :|
>>>>>>>
>>>>>>> The code is straightforward. I am confused by the requirements:
>>>>>>> don't
>>>>>>> add a new algorithm but use "ethtool_rxfh".
>>>>>>>
>>>>>>> I'll see if I can get more help, may be I am missing something.
>>>>>>>
>>>>>>
>>>>>> What was the decision here?
>>>>>> Is this going to be exposed through ethtool -N or -X?
>>>>>
>>>>> I am working on a new version that uses "ethtool_rxfh" to set the
>>>>> symmetric-xor. The user will set per-device via:
>>>>>
>>>>> ethtool -X eth0 hfunc toeplitz symmetric-xor
>>>>>
>>>>> then specify the per-flow type RSS fields as usual:
>>>>>
>>>>> ethtool -N|-U eth0 rx-flow-hash <flow_type> s|d|f|n
>>>>>
>>>>> The downside is that all flow-types will have to be either
>>>>> symmetric or
>>>>> asymmetric.
>>>>
>>>> Why are we making the interface less flexible than it can be with -N?
>>>
>>> Alexander Duyck prefers to implement the "symmetric-xor" interface as an
>>> algorithm or extension (please refer to previous messages), but ethtool
>>> does not provide flowtype/RSS fields setting via "-X". The above was the
>>> best solution that we (at Intel) could think of.
>>
>> OK, it's a weird we're deliberately limiting our interface, given
>> there's already hardware that supports controlling symmetric hashing per
>> flow type.
>>
>> I saw you mentioned the way ice hardware implements symmetric-xor
>> somewhere, it definitely needs to be added somewhere in our
>> documentation to prevent confusion.
>> mlx5 hardware also does symmetric hashing with xor, but not exactly as
>> you described, we need the algorithm to be clear.
> 
> Sure. I will add more ice-specific doc in:
> Documentation/networking/device_drivers/ethernet/intel/ice.rst

I was thinking of somewhere more generic, where ethtool users (not
necessarily ice users) can refer to.

Perhaps Documentation/networking/ethtool-netlink.rst? Or ethtool man page?

^ permalink raw reply

* Re: [PATCH v2 08/12] riscv: dts: starfive: Add pool for coherent DMA memory on JH7100 boards
From: Emil Renner Berthing @ 2023-10-31 14:40 UTC (permalink / raw)
  To: Cristian Ciocaltea, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Rob Herring, Krzysztof Kozlowski, Conor Dooley,
	Emil Renner Berthing, Samin Guo, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Alexandre Torgue, Jose Abreu, Maxime Coquelin,
	Richard Cochran, Giuseppe Cavallaro
  Cc: netdev, devicetree, linux-kernel, linux-riscv, linux-stm32,
	linux-arm-kernel, kernel, Emil Renner Berthing
In-Reply-To: <20231029042712.520010-9-cristian.ciocaltea@collabora.com>

Cristian Ciocaltea wrote:
> From: Emil Renner Berthing <emil.renner.berthing@canonical.com>
>
> The StarFive JH7100 SoC has non-coherent device DMAs, but most drivers
> expect to be able to allocate coherent memory for DMA descriptors and
> such. However on the JH7100 DDR memory appears twice in the physical
> memory map, once cached and once uncached:
>
>   0x00_8000_0000 - 0x08_7fff_ffff : Off chip DDR memory, cached
>   0x10_0000_0000 - 0x17_ffff_ffff : Off chip DDR memory, uncached
>
> To use this uncached region we create a global DMA memory pool there and
> reserve the corresponding area in the cached region.
>
> However the uncached region is fully above the 32bit address limit, so add
> a dma-ranges map so the DMA address used for peripherals is still in the
> regular cached region below the limit.

Adding these nodes to the device tree won't actually do anything without
enabling CONFIG_DMA_GLOBAL_POOL as is done here:

https://github.com/esmil/linux/commit/e14ad9ff67fd51dcc76415d4cc7f3a30ffcba379

>
> Link: https://github.com/starfive-tech/JH7100_Docs/blob/main/JH7100%20Data%20Sheet%20V01.01.04-EN%20(4-21-2021).pdf
> Signed-off-by: Emil Renner Berthing <emil.renner.berthing@canonical.com>
> Signed-off-by: Cristian Ciocaltea <cristian.ciocaltea@collabora.com>
> ---
>  .../boot/dts/starfive/jh7100-common.dtsi      | 24 +++++++++++++++++++
>  1 file changed, 24 insertions(+)
>
> diff --git a/arch/riscv/boot/dts/starfive/jh7100-common.dtsi b/arch/riscv/boot/dts/starfive/jh7100-common.dtsi
> index b93ce351a90f..504c73f01f14 100644
> --- a/arch/riscv/boot/dts/starfive/jh7100-common.dtsi
> +++ b/arch/riscv/boot/dts/starfive/jh7100-common.dtsi
> @@ -39,6 +39,30 @@ led-ack {
>  			label = "ack";
>  		};
>  	};
> +
> +	reserved-memory {
> +		#address-cells = <2>;
> +		#size-cells = <2>;
> +		ranges;
> +
> +		dma-reserved {
> +			reg = <0x0 0xfa000000 0x0 0x1000000>;
> +			no-map;
> +		};
> +
> +		linux,dma {
> +			compatible = "shared-dma-pool";
> +			reg = <0x10 0x7a000000 0x0 0x1000000>;
> +			no-map;
> +			linux,dma-default;
> +		};
> +	};
> +
> +	soc {
> +		dma-ranges = <0x00 0x80000000 0x00 0x80000000 0x00 0x7a000000>,
> +			     <0x00 0xfa000000 0x10 0x7a000000 0x00 0x01000000>,
> +			     <0x00 0xfb000000 0x00 0xfb000000 0x07 0x85000000>;
> +	};
>  };
>
>  &gpio {
> --
> 2.42.0
>

^ permalink raw reply

* Re: [PATCH net] net: sched: fix warn on htb offloaded class creation
From: Maxim Mikityanskiy @ 2023-10-31 14:40 UTC (permalink / raw)
  To: Paolo Abeni
  Cc: netdev, Jamal Hadi Salim, Cong Wang, Jiri Pirko, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Tariq Toukan, Gal Pressman,
	Saeed Mahameed
In-Reply-To: <131da9645be5ef6ea584da27ecde795c52dfbb00.camel@redhat.com>

On Tue, 31 Oct 2023 at 10:11:14 +0100, Paolo Abeni wrote:
> Hi,
> 
> I'm sorry for the late reply.
> 
> On Fri, 2023-10-27 at 16:57 +0300, Maxim Mikityanskiy wrote:
> > I believe this is not the right fix.
> > 
> > On Thu, 26 Oct 2023 at 17:36:48 +0200, Paolo Abeni wrote:
> > > The following commands:
> > > 
> > > tc qdisc add dev eth1 handle 2: root htb offload
> > > tc class add dev eth1 parent 2: classid 2:1 htb rate 5mbit burst 15k
> > > 
> > > yeld to a WARN in the HTB qdisc:
> > 
> > Something is off here. These are literally the most basic commands one
> > could invoke with HTB offload, I'm sure they worked. Is it something
> > that broke recently? Tariq/Gal/Saeed, could you check them on a Mellanox
> > NIC?
> > 
> > > 
> > >  WARNING: CPU: 2 PID: 1583 at net/sched/sch_htb.c:1959
> > >  CPU: 2 PID: 1583 Comm: tc Kdump: loaded 6.6.0-rc2.mptcp_7895773e5235+ #59
> > >  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-1.fc37 04/01/2014
> > >  RIP: 0010:htb_change_class+0x25c4/0x2e30 [sch_htb]
> > >  Code: 24 58 48 b8 00 00 00 00 00 fc ff df 48 89 ca 48 c1 ea 03 80 3c 02 00 0f 85 92 01 00 00 49 89 8c 24 b0 01 00 00 e9 77 fc ff ff <0f> 0b e9 15 ec ff ff 80 3d f8 35 00 00 00 0f 85 d4 f9 ff ff ba 32
> > >  RSP: 0018:ffffc900015df240 EFLAGS: 00010246
> > >  RAX: 0000000000000000 RBX: ffff88811b4ca000 RCX: ffff88811db42800
> > >  RDX: 1ffff11023b68502 RSI: ffffffffaf2e6a00 RDI: ffff88811db42810
> > >  RBP: ffff88811db45000 R08: 0000000000000001 R09: fffffbfff664bbc9
> > >  R10: ffffffffb325de4f R11: ffffffffb2d33748 R12: 0000000000000000
> > >  R13: ffff88811db43000 R14: ffff88811b4caaac R15: ffff8881252c0030
> > >  FS:  00007f6c1f126740(0000) GS:ffff88815aa00000(0000) knlGS:0000000000000000
> > >  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > >  CR2: 000055dca8e5b4a8 CR3: 000000011bc7a006 CR4: 0000000000370ee0
> > >  DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > >  DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> > >  Call Trace:
> > >  <TASK>
> > >   tc_ctl_tclass+0x394/0xeb0
> > >   rtnetlink_rcv_msg+0x2f5/0xaa0
> > >   netlink_rcv_skb+0x12e/0x3a0
> > >   netlink_unicast+0x421/0x730
> > >   netlink_sendmsg+0x79e/0xc60
> > >   ____sys_sendmsg+0x95a/0xc20
> > >   ___sys_sendmsg+0xee/0x170
> > >   __sys_sendmsg+0xc6/0x170
> > >  do_syscall_64+0x58/0x80
> > >  entry_SYSCALL_64_after_hwframe+0x6e/0xd8
> > > 
> > > The first command creates per TX queue pfifo qdiscs in
> > > tc_modify_qdisc() -> htb_init() and grafts the pfifo to each dev_queue
> > > via tc_modify_qdisc() ->  qdisc_graft() -> htb_attach().
> > 
> > Not exactly; it grafts pfifo to direct queues only. htb_attach_offload
> > explicitly grafts noop to all the remaining queues.
> 
> num_direct_qdiscs == real_num_tx_queues:
> 
> https://elixir.bootlin.com/linux/latest/source/net/sched/sch_htb.c#L1101
> 
> pfifo will be configured on all the TX queues available at TC creation
> time, right?

Yes, all real TX queues will be used as direct queues (for unclassified
traffic). num_tx_queues should be somewhat bigger than
real_num_tx_queues - it should reserve a queue per potential leaf class.

pfifo is configured on direct queues, and the reserved queues have noop.
Then, when a new leaf class is added (TC_HTB_LEAF_ALLOC_QUEUE), the
driver allocates a new queue and increases real_num_tx_queues. HTB
assigns a pfifo qdisc to the newly allocated queue.

Changing the hierarchy (deleting a node or converting an inner node to a
leaf) may reorder the classful queues (with indexes >= the initial
real_num_tx_queues), so that there are no gaps.

> Lacking a mlx card with offload support I hack basic htb support in
> netdevsim and I observe the splat on top of such device. I can as well
> share the netdevsim patch - it will need some clean-up.

I will be happy to review the netdevsim patch, but I don't promise
prompt responsiveness.

> > 
> > > When the command completes, the qdisc_sleeping for each dev_queue is a
> > > pfifo one. The next class creation will trigger the reported splat.
> > > 
> > > Address the issue taking care of old non-builtin qdisc in
> > > htb_change_class().
> > > 
> > > Fixes: d03b195b5aa0 ("sch_htb: Hierarchical QoS hardware offload")
> > > Signed-off-by: Paolo Abeni <pabeni@redhat.com>
> > > ---
> > >  net/sched/sch_htb.c | 3 +--
> > >  1 file changed, 1 insertion(+), 2 deletions(-)
> > > 
> > > diff --git a/net/sched/sch_htb.c b/net/sched/sch_htb.c
> > > index 0d947414e616..dc682bd542b4 100644
> > > --- a/net/sched/sch_htb.c
> > > +++ b/net/sched/sch_htb.c
> > > @@ -1955,8 +1955,7 @@ static int htb_change_class(struct Qdisc *sch, u32 classid,
> > >  				qdisc_refcount_inc(new_q);
> > >  			}
> > >  			old_q = htb_graft_helper(dev_queue, new_q);
> > > -			/* No qdisc_put needed. */
> > > -			WARN_ON(!(old_q->flags & TCQ_F_BUILTIN));
> > > +			qdisc_put(old_q);
> > 
> > We can get here after one of two cases above:
> > 
> > 1. A new queue is allocated with TC_HTB_LEAF_ALLOC_QUEUE. It's supposed
> > to have a noop qdisc by default (after htb_attach_offload).
> 
> So most likely the trivial netdevsim implementation I used was not good
> enough.
> 
> Which constrains should respect TC_HTB_LEAF_ALLOC_QUEUE WRT the
> returned qid value? should it in the (real_num_tx_queues,
> num_tx_queues] range?

Let's say N is real_num_tx_queues as it was at the moment of attaching.
HTB queues should be allocated from [N, num_tx_queues), and
real_num_tx_queues should be increased accordingly. It should not return
queues number [0, N).

Deletions should fill the gaps: if queue X is being deleted, N <= X <
real_num_tx_queues - 1, then the gap should be filled with queue number
real_num_tx_queues - 1 by swapping the queues (real_num_tx_queues will
be decreased by 1 accordingly). Some care also needs to be taken when
converting inner-to-leaf (TC_HTB_LEAF_DEL_LAST) and leaf-to-inner (it's
better to get insights from [1], there are also some comments).

> Can HTB actually configure H/W shaping on
> real_num_tx_queues?

It will be on real_num_tx_queues, but after it's increased to add new
HTB queues. The original queues [0, N) are used for direct traffic, same
as the non-offloaded HTB's direct_queue (it's not shaped).

> I find no clear documentation WRT the above.

I'm sorry for the lack of documentation. All I have is the commit
message [2] and a netdev talk [3]. Maybe the slides could be of some
use...

I hope the above explanation clarifies something, and feel free to ask
further questions, I'll be glad to explain what hasn't been documented
properly.

[1]: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/net/ethernet/mellanox/mlx5/core/en/htb.c?id=5a6a09e97199d6600d31383055f9d43fbbcbe86f
[2]: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d03b195b5aa015f6c11988b86a3625f8d5dbac52
[3]: https://netdevconf.info/0x14/session.html?talk-hierarchical-QoS-hardware-offload

> Thanks!
> 
> Paolo
> 

^ permalink raw reply

* Re: [Intel-wired-lan] [PATCH net-next v4 1/6] net: ethtool: allow symmetric-xor RSS hash for any flow type
From: Ahmed Zaki @ 2023-10-31 14:40 UTC (permalink / raw)
  To: Gal Pressman, Jakub Kicinski, Alexander H Duyck
  Cc: mkubecek, andrew, willemdebruijn.kernel, Wojciech Drewek, corbet,
	netdev, linux-doc, jesse.brandeburg, edumazet, anthony.l.nguyen,
	horms, vladimir.oltean, Jacob Keller, intel-wired-lan, pabeni,
	davem
In-Reply-To: <d097e7d3-5e16-44ba-aa92-dfb7fbedc600@nvidia.com>



On 2023-10-31 06:00, Gal Pressman wrote:
> On 29/10/2023 18:59, Ahmed Zaki wrote:
>>
>>
>> On 2023-10-29 06:48, Gal Pressman wrote:
>>> On 29/10/2023 14:42, Ahmed Zaki wrote:
>>>>
>>>>
>>>> On 2023-10-29 06:25, Gal Pressman wrote:
>>>>> On 21/10/2023 3:00, Ahmed Zaki wrote:
>>>>>>
>>>>>>
>>>>>> On 2023-10-20 17:49, Jakub Kicinski wrote:
>>>>>>> On Fri, 20 Oct 2023 17:14:11 -0600 Ahmed Zaki wrote:
>>>>>>>> I replied to that here:
>>>>>>>>
>>>>>>>> https://lore.kernel.org/all/afb4a06f-cfba-47ba-adb3-09bea7cb5f00@intel.com/
>>>>>>>>
>>>>>>>> I am kind of confused now so please bear with me. ethtool either
>>>>>>>> sends
>>>>>>>> "ethtool_rxfh" or "ethtool_rxnfc". AFAIK "ethtool_rxfh" is the
>>>>>>>> interface
>>>>>>>> for "ethtool -X" which is used to set the RSS algorithm. But we
>>>>>>>> kind of
>>>>>>>> agreed to go with "ethtool -U|-N" for symmetric-xor, and that uses
>>>>>>>> "ethtool_rxnfc" (as implemented in this series).
>>>>>>>
>>>>>>> I have no strong preference. Sounds like Alex prefers to keep it
>>>>>>> closer
>>>>>>> to algo, which is "ethtool_rxfh".
>>>>>>>
>>>>>>>> Do you mean use "ethtool_rxfh" instead of "ethtool_rxnfc"? how would
>>>>>>>> that work on the ethtool user interface?
>>>>>>>
>>>>>>> I don't know what you're asking of us. If you find the code to
>>>>>>> confusing
>>>>>>> maybe someone at Intel can help you :|
>>>>>>
>>>>>> The code is straightforward. I am confused by the requirements: don't
>>>>>> add a new algorithm but use "ethtool_rxfh".
>>>>>>
>>>>>> I'll see if I can get more help, may be I am missing something.
>>>>>>
>>>>>
>>>>> What was the decision here?
>>>>> Is this going to be exposed through ethtool -N or -X?
>>>>
>>>> I am working on a new version that uses "ethtool_rxfh" to set the
>>>> symmetric-xor. The user will set per-device via:
>>>>
>>>> ethtool -X eth0 hfunc toeplitz symmetric-xor
>>>>
>>>> then specify the per-flow type RSS fields as usual:
>>>>
>>>> ethtool -N|-U eth0 rx-flow-hash <flow_type> s|d|f|n
>>>>
>>>> The downside is that all flow-types will have to be either symmetric or
>>>> asymmetric.
>>>
>>> Why are we making the interface less flexible than it can be with -N?
>>
>> Alexander Duyck prefers to implement the "symmetric-xor" interface as an
>> algorithm or extension (please refer to previous messages), but ethtool
>> does not provide flowtype/RSS fields setting via "-X". The above was the
>> best solution that we (at Intel) could think of.
> 
> OK, it's a weird we're deliberately limiting our interface, given
> there's already hardware that supports controlling symmetric hashing per
> flow type.
> 
> I saw you mentioned the way ice hardware implements symmetric-xor
> somewhere, it definitely needs to be added somewhere in our
> documentation to prevent confusion.
> mlx5 hardware also does symmetric hashing with xor, but not exactly as
> you described, we need the algorithm to be clear.

Sure. I will add more ice-specific doc in:
Documentation/networking/device_drivers/ethernet/intel/ice.rst

^ permalink raw reply

* Re: [PATCH v2 07/12] riscv: dts: starfive: jh7100: Add ccache DT node
From: Emil Renner Berthing @ 2023-10-31 14:38 UTC (permalink / raw)
  To: Cristian Ciocaltea, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Rob Herring, Krzysztof Kozlowski, Conor Dooley,
	Emil Renner Berthing, Samin Guo, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Alexandre Torgue, Jose Abreu, Maxime Coquelin,
	Richard Cochran, Giuseppe Cavallaro
  Cc: netdev, devicetree, linux-kernel, linux-riscv, linux-stm32,
	linux-arm-kernel, kernel
In-Reply-To: <20231029042712.520010-8-cristian.ciocaltea@collabora.com>

Cristian Ciocaltea wrote:
> Provide a DT node for the SiFive Composable Cache controller found on
> the StarFive JH7100 SoC.
>
> Note this is also used to support non-coherent DMA, via the
> sifive,cache-ops cache flushing operations.

This property is no longer needed:
https://lore.kernel.org/linux-riscv/20231031141444.53426-1-emil.renner.berthing@canonical.com/

Also it would be nice to mention that these nodes are copied from my
visionfive patches ;)

>
> Signed-off-by: Cristian Ciocaltea <cristian.ciocaltea@collabora.com>
> ---
>  arch/riscv/boot/dts/starfive/jh7100.dtsi | 14 ++++++++++++++
>  1 file changed, 14 insertions(+)
>
> diff --git a/arch/riscv/boot/dts/starfive/jh7100.dtsi b/arch/riscv/boot/dts/starfive/jh7100.dtsi
> index 06bb157ce111..a8a5bb00b0d8 100644
> --- a/arch/riscv/boot/dts/starfive/jh7100.dtsi
> +++ b/arch/riscv/boot/dts/starfive/jh7100.dtsi
> @@ -32,6 +32,7 @@ U74_0: cpu@0 {
>  			i-tlb-sets = <1>;
>  			i-tlb-size = <32>;
>  			mmu-type = "riscv,sv39";
> +			next-level-cache = <&ccache>;
>  			riscv,isa = "rv64imafdc";
>  			riscv,isa-base = "rv64i";
>  			riscv,isa-extensions = "i", "m", "a", "f", "d", "c", "zicntr", "zicsr",
> @@ -60,6 +61,7 @@ U74_1: cpu@1 {
>  			i-tlb-sets = <1>;
>  			i-tlb-size = <32>;
>  			mmu-type = "riscv,sv39";
> +			next-level-cache = <&ccache>;
>  			riscv,isa = "rv64imafdc";
>  			riscv,isa-base = "rv64i";
>  			riscv,isa-extensions = "i", "m", "a", "f", "d", "c", "zicntr", "zicsr",
> @@ -147,6 +149,18 @@ soc {
>  		dma-noncoherent;
>  		ranges;
>
> +		ccache: cache-controller@2010000 {
> +			compatible = "starfive,jh7100-ccache", "sifive,ccache0", "cache";
> +			reg = <0x0 0x2010000 0x0 0x1000>;
> +			interrupts = <128>, <130>, <131>, <129>;
> +			cache-block-size = <64>;
> +			cache-level = <2>;
> +			cache-sets = <2048>;
> +			cache-size = <2097152>;
> +			cache-unified;
> +			sifive,cache-ops;
> +		};
> +
>  		clint: clint@2000000 {
>  			compatible = "starfive,jh7100-clint", "sifive,clint0";
>  			reg = <0x0 0x2000000 0x0 0x10000>;
> --
> 2.42.0
>

^ permalink raw reply

* Re: [PATCH iproute2-next] bridge: mdb: Add get support
From: Ido Schimmel @ 2023-10-31 14:37 UTC (permalink / raw)
  To: Nikolay Aleksandrov; +Cc: netdev, dsahern, stephen, mlxsw
In-Reply-To: <f5823dfb-4ba7-f32c-d8a3-9b8b7cdb7c5d@blackwall.org>

On Tue, Oct 31, 2023 at 04:29:58PM +0200, Nikolay Aleksandrov wrote:
> > @@ -865,6 +960,8 @@ int do_mdb(int argc, char **argv)
> >   		    matches(*argv, "lst") == 0 ||
> >   		    matches(*argv, "list") == 0)
> >   			return mdb_show(argc-1, argv+1);
> > +		if (matches(*argv, "get") == 0)
> > +			return mdb_get(argc-1, argv+1);
> 
> I can't recall if it was agreed to add only strcmp even if the rest uses
> matches()?

Yea, not sure how I missed it... Will fix in v2.

Thanks!

^ permalink raw reply

* Re: [PATCH v2 05/12] net: stmmac: dwmac-starfive: Add support for JH7100 SoC
From: Emil Renner Berthing @ 2023-10-31 14:33 UTC (permalink / raw)
  To: Cristian Ciocaltea, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Rob Herring, Krzysztof Kozlowski, Conor Dooley,
	Emil Renner Berthing, Samin Guo, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Alexandre Torgue, Jose Abreu, Maxime Coquelin,
	Richard Cochran, Giuseppe Cavallaro
  Cc: netdev, devicetree, linux-kernel, linux-riscv, linux-stm32,
	linux-arm-kernel, kernel
In-Reply-To: <20231029042712.520010-6-cristian.ciocaltea@collabora.com>

Cristian Ciocaltea wrote:
> Add a missing quirk to enable support for the StarFive JH7100 SoC.
>
> Additionally, for greater flexibility in operation, allow using the
> rgmii-rxid and rgmii-txid phy modes.
>
> Co-developed-by: Emil Renner Berthing <kernel@esmil.dk>
> Signed-off-by: Emil Renner Berthing <kernel@esmil.dk>
> Signed-off-by: Cristian Ciocaltea <cristian.ciocaltea@collabora.com>

Hi Cristian,

Thanks for working on this! This driver has code to update the phy clock for
different line speeds. I don't think that will work without the
CLK_SET_RATE_PARENT flag added to the clock in [1] which in turn depends on
[2].

[1]: https://github.com/esmil/linux/commit/b200c3054b58a49ba25af67aff82d9045e3c3666
[2]: https://github.com/esmil/linux/commit/dce189542c16bf0eb8533d96c0305cb59d149dae

Two more comments below..

> ---
>  drivers/net/ethernet/stmicro/stmmac/Kconfig   |  6 ++--
>  .../ethernet/stmicro/stmmac/dwmac-starfive.c  | 32 ++++++++++++++++---
>  2 files changed, 31 insertions(+), 7 deletions(-)
>
> diff --git a/drivers/net/ethernet/stmicro/stmmac/Kconfig b/drivers/net/ethernet/stmicro/stmmac/Kconfig
> index a2b9e289aa36..c3c2c8360047 100644
> --- a/drivers/net/ethernet/stmicro/stmmac/Kconfig
> +++ b/drivers/net/ethernet/stmicro/stmmac/Kconfig
> @@ -165,9 +165,9 @@ config DWMAC_STARFIVE
>  	help
>  	  Support for ethernet controllers on StarFive RISC-V SoCs
>
> -	  This selects the StarFive platform specific glue layer support for
> -	  the stmmac device driver. This driver is used for StarFive JH7110
> -	  ethernet controller.
> +	  This selects the StarFive platform specific glue layer support
> +	  for the stmmac device driver. This driver is used for the
> +	  StarFive JH7100 and JH7110 ethernet controllers.
>
>  config DWMAC_STI
>  	tristate "STi GMAC support"
> diff --git a/drivers/net/ethernet/stmicro/stmmac/dwmac-starfive.c b/drivers/net/ethernet/stmicro/stmmac/dwmac-starfive.c
> index 5d630affb4d1..88c431edcea0 100644
> --- a/drivers/net/ethernet/stmicro/stmmac/dwmac-starfive.c
> +++ b/drivers/net/ethernet/stmicro/stmmac/dwmac-starfive.c
> @@ -15,13 +15,20 @@
>
>  #include "stmmac_platform.h"
>
> -#define STARFIVE_DWMAC_PHY_INFT_RGMII	0x1
> -#define STARFIVE_DWMAC_PHY_INFT_RMII	0x4
> -#define STARFIVE_DWMAC_PHY_INFT_FIELD	0x7U
> +#define STARFIVE_DWMAC_PHY_INFT_RGMII		0x1
> +#define STARFIVE_DWMAC_PHY_INFT_RMII		0x4
> +#define STARFIVE_DWMAC_PHY_INFT_FIELD		0x7U
> +
> +#define JH7100_SYSMAIN_REGISTER49_DLYCHAIN	0xc8
> +
> +struct starfive_dwmac_data {
> +	unsigned int gtxclk_dlychain;
> +};
>
>  struct starfive_dwmac {
>  	struct device *dev;
>  	struct clk *clk_tx;
> +	const struct starfive_dwmac_data *data;
>  };
>
>  static void starfive_dwmac_fix_mac_speed(void *priv, unsigned int speed, unsigned int mode)
> @@ -67,6 +74,8 @@ static int starfive_dwmac_set_mode(struct plat_stmmacenet_data *plat_dat)
>
>  	case PHY_INTERFACE_MODE_RGMII:
>  	case PHY_INTERFACE_MODE_RGMII_ID:
> +	case PHY_INTERFACE_MODE_RGMII_RXID:
> +	case PHY_INTERFACE_MODE_RGMII_TXID:
>  		mode = STARFIVE_DWMAC_PHY_INFT_RGMII;
>  		break;
>
> @@ -89,6 +98,14 @@ static int starfive_dwmac_set_mode(struct plat_stmmacenet_data *plat_dat)
>  	if (err)
>  		return dev_err_probe(dwmac->dev, err, "error setting phy mode\n");
>
> +	if (dwmac->data) {

I think you want something like this so future quirks don't need to touch this
code:

	if (dwmac->data && dwmac->data->gtxclk_dlychain)

> +		err = regmap_write(regmap, JH7100_SYSMAIN_REGISTER49_DLYCHAIN,
> +				   dwmac->data->gtxclk_dlychain);
> +		if (err)
> +			return dev_err_probe(dwmac->dev, err,
> +					     "error selecting gtxclk delay chain\n");
> +	}
> +
>  	return 0;
>  }
>
> @@ -114,6 +131,8 @@ static int starfive_dwmac_probe(struct platform_device *pdev)
>  	if (!dwmac)
>  		return -ENOMEM;
>
> +	dwmac->data = device_get_match_data(&pdev->dev);
> +
>  	dwmac->clk_tx = devm_clk_get_enabled(&pdev->dev, "tx");
>  	if (IS_ERR(dwmac->clk_tx))
>  		return dev_err_probe(&pdev->dev, PTR_ERR(dwmac->clk_tx),
> @@ -144,8 +163,13 @@ static int starfive_dwmac_probe(struct platform_device *pdev)
>  	return stmmac_dvr_probe(&pdev->dev, plat_dat, &stmmac_res);
>  }
>
> +static const struct starfive_dwmac_data jh7100_data = {
> +	.gtxclk_dlychain = 4

Please add a , at the end of this line. I know it's unlikely that we need to
add more properties, but it's still good practice to do. This way such patches
won't need to touch this line.

> +};
> +
>  static const struct of_device_id starfive_dwmac_match[] = {
> -	{ .compatible = "starfive,jh7110-dwmac"	},
> +	{ .compatible = "starfive,jh7100-dwmac", .data = &jh7100_data },
> +	{ .compatible = "starfive,jh7110-dwmac" },
>  	{ /* sentinel */ }
>  };
>  MODULE_DEVICE_TABLE(of, starfive_dwmac_match);
> --
> 2.42.0
>

^ permalink raw reply

* Re: [PATCH iproute2-next] bridge: mdb: Add get support
From: Nikolay Aleksandrov @ 2023-10-31 14:29 UTC (permalink / raw)
  To: Ido Schimmel, netdev; +Cc: dsahern, stephen, mlxsw
In-Reply-To: <20231030154654.1202094-1-idosch@nvidia.com>

On 10/30/23 17:46, Ido Schimmel wrote:
> Implement MDB get functionality, allowing user space to query a single
> MDB entry from the kernel instead of dumping all the entries. Example
> usage:
> 
>   # bridge mdb add dev br0 port swp1 grp 239.1.1.1 vid 10
>   # bridge mdb add dev br0 port swp2 grp 239.1.1.1 vid 10
>   # bridge mdb add dev br0 port swp2 grp 239.1.1.5 vid 10
>   # bridge mdb get dev br0 grp 239.1.1.1 vid 10
>   dev br0 port swp1 grp 239.1.1.1 temp vid 10
>   dev br0 port swp2 grp 239.1.1.1 temp vid 10
>   # bridge -j -p mdb get dev br0 grp 239.1.1.1 vid 10
>   [ {
>           "index": 10,
>           "dev": "br0",
>           "port": "swp1",
>           "grp": "239.1.1.1",
>           "state": "temp",
>           "flags": [ ],
>           "vid": 10
>       },{
>           "index": 10,
>           "dev": "br0",
>           "port": "swp2",
>           "grp": "239.1.1.1",
>           "state": "temp",
>           "flags": [ ],
>           "vid": 10
>       } ]
>   # bridge mdb get dev br0 grp 239.1.1.1 vid 20
>   Error: bridge: MDB entry not found.
>   # bridge mdb get dev br0 grp 239.1.1.2 vid 10
>   Error: bridge: MDB entry not found.
> 
> Signed-off-by: Ido Schimmel <idosch@nvidia.com>
> ---
>   bridge/mdb.c      | 99 ++++++++++++++++++++++++++++++++++++++++++++++-
>   man/man8/bridge.8 | 35 +++++++++++++++++
>   2 files changed, 133 insertions(+), 1 deletion(-)
> 

The patch looks good. One side question below.
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>

[snip]
> @@ -865,6 +960,8 @@ int do_mdb(int argc, char **argv)
>   		    matches(*argv, "lst") == 0 ||
>   		    matches(*argv, "list") == 0)
>   			return mdb_show(argc-1, argv+1);
> +		if (matches(*argv, "get") == 0)
> +			return mdb_get(argc-1, argv+1);

I can't recall if it was agreed to add only strcmp even if the rest uses 
matches()?



^ permalink raw reply

* Re: [PATCH bpf-next v6 11/18] ice: put XDP meta sources assignment under a static key condition
From: Larysa Zaremba @ 2023-10-31 14:22 UTC (permalink / raw)
  To: Maciej Fijalkowski
  Cc: bpf, ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, David Ahern, Jakub Kicinski,
	Willem de Bruijn, Jesper Dangaard Brouer, Anatoly Burakov,
	Alexander Lobakin, Magnus Karlsson, Maryam Tahhan, xdp-hints,
	netdev, Willem de Bruijn, Alexei Starovoitov, Tariq Toukan,
	Saeed Mahameed, toke
In-Reply-To: <ZT1nSGYng8sUKQD7@boxer>

On Sat, Oct 28, 2023 at 09:55:52PM +0200, Maciej Fijalkowski wrote:
> On Mon, Oct 23, 2023 at 11:35:46AM +0200, Larysa Zaremba wrote:
> > On Fri, Oct 20, 2023 at 06:32:13PM +0200, Maciej Fijalkowski wrote:
> > > On Thu, Oct 12, 2023 at 07:05:17PM +0200, Larysa Zaremba wrote:
> > > > Usage of XDP hints requires putting additional information after the
> > > > xdp_buff. In basic case, only the descriptor has to be copied on a
> > > > per-packet basis, because xdp_buff permanently resides before per-ring
> > > > metadata (cached time and VLAN protocol ID).
> > > > 
> > > > However, in ZC mode, xdp_buffs come from a pool, so memory after such
> > > > buffer does not contain any reliable information, so everything has to be
> > > > copied, damaging the performance.
> > > > 
> > > > Introduce a static key to enable meta sources assignment only when attached
> > > > XDP program is device-bound.
> > > > 
> > > > This patch eliminates a 6% performance drop in ZC mode, which was a result
> > > > of addition of XDP hints to the driver.
> > > > 
> > > > Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
> > > > ---
> > > >  drivers/net/ethernet/intel/ice/ice.h      |  1 +
> > > >  drivers/net/ethernet/intel/ice/ice_main.c | 14 ++++++++++++++
> > > >  drivers/net/ethernet/intel/ice/ice_txrx.c |  3 ++-
> > > >  drivers/net/ethernet/intel/ice/ice_xsk.c  |  3 +++
> > > >  4 files changed, 20 insertions(+), 1 deletion(-)
> > > > 
> > > > diff --git a/drivers/net/ethernet/intel/ice/ice.h b/drivers/net/ethernet/intel/ice/ice.h
> > > > index 3d0f15f8b2b8..76d22be878a4 100644
> > > > --- a/drivers/net/ethernet/intel/ice/ice.h
> > > > +++ b/drivers/net/ethernet/intel/ice/ice.h
> > > > @@ -210,6 +210,7 @@ enum ice_feature {
> > > >  };
> > > >  
> > > >  DECLARE_STATIC_KEY_FALSE(ice_xdp_locking_key);
> > > > +DECLARE_STATIC_KEY_FALSE(ice_xdp_meta_key);
> > > >  
> > > >  struct ice_channel {
> > > >  	struct list_head list;
> > > > diff --git a/drivers/net/ethernet/intel/ice/ice_main.c b/drivers/net/ethernet/intel/ice/ice_main.c
> > > > index 47e8920e1727..ee0df86d34b7 100644
> > > > --- a/drivers/net/ethernet/intel/ice/ice_main.c
> > > > +++ b/drivers/net/ethernet/intel/ice/ice_main.c
> > > > @@ -48,6 +48,9 @@ MODULE_PARM_DESC(debug, "netif level (0=none,...,16=all)");
> > > >  DEFINE_STATIC_KEY_FALSE(ice_xdp_locking_key);
> > > >  EXPORT_SYMBOL(ice_xdp_locking_key);
> > > >  
> > > > +DEFINE_STATIC_KEY_FALSE(ice_xdp_meta_key);
> > > > +EXPORT_SYMBOL(ice_xdp_meta_key);
> > > > +
> > > >  /**
> > > >   * ice_hw_to_dev - Get device pointer from the hardware structure
> > > >   * @hw: pointer to the device HW structure
> > > > @@ -2634,6 +2637,11 @@ static int ice_xdp_alloc_setup_rings(struct ice_vsi *vsi)
> > > >  	return -ENOMEM;
> > > >  }
> > > >  
> > > > +static bool ice_xdp_prog_has_meta(struct bpf_prog *prog)
> > > > +{
> > > > +	return prog && prog->aux->dev_bound;
> > > > +}
> > > > +
> > > >  /**
> > > >   * ice_vsi_assign_bpf_prog - set or clear bpf prog pointer on VSI
> > > >   * @vsi: VSI to set the bpf prog on
> > > > @@ -2644,10 +2652,16 @@ static void ice_vsi_assign_bpf_prog(struct ice_vsi *vsi, struct bpf_prog *prog)
> > > >  	struct bpf_prog *old_prog;
> > > >  	int i;
> > > >  
> > > > +	if (ice_xdp_prog_has_meta(prog))
> > > > +		static_branch_inc(&ice_xdp_meta_key);
> > > 
> > > i thought boolean key would be enough but inc/dec should serve properly
> > > for example prog hotswap cases.
> > >
> > 
> > My thought process on using counting instead of boolean was: there can be 
> > several PFs that use the same driver, so therefore we need to keep track of how 
> > many od them use hints. 
> 
> Very good point. This implies that if PF0 has hints-enabled prog loaded,
> PF1 with non-hints prog will "suffer" from it.
> 
> Sorry for such a long delays in responses but I was having a hard time
> making up my mind about it. In the end I have come up to some conclusions.
> I know the timing for sending this response is not ideal, but I need to
> get this off my chest and bring discussion back to life:)
> 
> IMHO having static keys to eliminate ZC overhead does not scale. I assume
> every other driver would have to follow that.
> 
> XSK pool allows us to avoid initializing various things per each packet.
> Instead, taking xdp_rxq_info as an example, each xdp_buff from pool has
> xdp_rxq_info assigned at init time. With this in mind, we should have some
> mechanism to set hints-specific things in xdp_buff_xsk::cb, at init time
> as well. Such mechanism should not require us to expose driver's private
> xdp_buff hints containers (such as ice_pkt_ctx) to XSK pool.
> 
> Right now you moved phctime down to ice_pkt_ctx and to me that's the main
> reason we have to copy ice_pkt_ctx to each xdp_buff on ZC. What if we keep
> the cached_phctime at original offset in ring but ice_pkt_ctx would get a
> pointer to that?
> 
> This would allow us to init the pointer in each xdp_buff from XSK pool at
> init time. I have come up with a way to program that via so called XSK
> meta descriptors. Each desc would have data to write onto cb, offset
> within cb and amount of bytes to write/copy.
> 
> I'll share the diff below but note that I didn't measure how much lower
> the performance is degraded. My icelake machine where I used to measure
> performance-sensitive code got broke. For now we can't escape initing
> eop_desc per each xdp_buff, but I moved it to alloc side, as we mangle
> descs there anyway.
> 
> I think mlx5 could benefit from that approach as well with initing the rq
> ptr at init time.
> 
> Diff does mostly these things:
> - move cached_phctime to old place in ice_rx_ring and add ptr to that in
>   ice_pkt_ctx
> - introduce xsk_pool_set_meta()
> - use it from ice side.
> 

Thank you for the code! I will probably send v7 with such changes. Are you OK, 
if patch with core changes would go with you as an author?

But also, I see a minor problem with that switching VLAN protocol does not 
trigger buffer allocation, so we have to point to that too, this probably means 
moving cached time back and finding 16 extra bits in CL3. Single pointer to 
{cached time, vlan_proto} would be copied to be after xdp_buff.

> I consider this as a discussion trigger rather than ready code. Any
> feedback will be appreciated.
> 
> ---------------------------------8<---------------------------------
> 
> diff --git a/drivers/net/ethernet/intel/ice/ice_base.c b/drivers/net/ethernet/intel/ice/ice_base.c
> index 7fa43827a3f0..c192e84bee55 100644
> --- a/drivers/net/ethernet/intel/ice/ice_base.c
> +++ b/drivers/net/ethernet/intel/ice/ice_base.c
> @@ -519,6 +519,23 @@ static int ice_setup_rx_ctx(struct ice_rx_ring *ring)
>  	return 0;
>  }
>  
> +static void ice_xsk_pool_set_meta(struct ice_rx_ring *ring)
> +{
> +	struct xsk_meta_desc desc = {};
> +
> +	desc.val = (uintptr_t)&ring->cached_phctime;
> +	desc.off = offsetof(struct ice_pkt_ctx, cached_phctime);
> +	desc.bytes = sizeof_field(struct ice_pkt_ctx, cached_phctime);
> +	xsk_pool_set_meta(ring->xsk_pool, &desc);
> +
> +	memset(&desc, 0, sizeof(struct xsk_meta_desc));
> +
> +	desc.val = ring->pkt_ctx.vlan_proto;
> +	desc.off = offsetof(struct ice_pkt_ctx, vlan_proto);
> +	desc.bytes = sizeof_field(struct ice_pkt_ctx, vlan_proto);
> +	xsk_pool_set_meta(ring->xsk_pool, &desc);
> +}
> +
>  /**
>   * ice_vsi_cfg_rxq - Configure an Rx queue
>   * @ring: the ring being configured
> @@ -553,6 +570,7 @@ int ice_vsi_cfg_rxq(struct ice_rx_ring *ring)
>  			if (err)
>  				return err;
>  			xsk_pool_set_rxq_info(ring->xsk_pool, &ring->xdp_rxq);
> +			ice_xsk_pool_set_meta(ring);
>  
>  			dev_info(dev, "Registered XDP mem model MEM_TYPE_XSK_BUFF_POOL on Rx ring %d\n",
>  				 ring->q_index);
> @@ -575,6 +593,7 @@ int ice_vsi_cfg_rxq(struct ice_rx_ring *ring)
>  
>  	xdp_init_buff(&ring->xdp, ice_rx_pg_size(ring) / 2, &ring->xdp_rxq);
>  	ring->xdp.data = NULL;
> +	ring->pkt_ctx.cached_phctime = &ring->cached_phctime;
>  	err = ice_setup_rx_ctx(ring);
>  	if (err) {
>  		dev_err(dev, "ice_setup_rx_ctx failed for RxQ %d, err %d\n",
> diff --git a/drivers/net/ethernet/intel/ice/ice_ethtool.c b/drivers/net/ethernet/intel/ice/ice_ethtool.c
> index cf5c91ada94c..d3cb08e66dcb 100644
> --- a/drivers/net/ethernet/intel/ice/ice_ethtool.c
> +++ b/drivers/net/ethernet/intel/ice/ice_ethtool.c
> @@ -2846,7 +2846,7 @@ ice_set_ringparam(struct net_device *netdev, struct ethtool_ringparam *ring,
>  		/* clone ring and setup updated count */
>  		rx_rings[i] = *vsi->rx_rings[i];
>  		rx_rings[i].count = new_rx_cnt;
> -		rx_rings[i].pkt_ctx.cached_phctime = pf->ptp.cached_phc_time;
> +		rx_rings[i].cached_phctime = pf->ptp.cached_phc_time;
>  		rx_rings[i].desc = NULL;
>  		rx_rings[i].rx_buf = NULL;
>  		/* this is to allow wr32 to have something to write to
> diff --git a/drivers/net/ethernet/intel/ice/ice_lib.c b/drivers/net/ethernet/intel/ice/ice_lib.c
> index 2fc97eafd1f6..1f45f0c3963d 100644
> --- a/drivers/net/ethernet/intel/ice/ice_lib.c
> +++ b/drivers/net/ethernet/intel/ice/ice_lib.c
> @@ -1456,7 +1456,7 @@ static int ice_vsi_alloc_rings(struct ice_vsi *vsi)
>  		ring->netdev = vsi->netdev;
>  		ring->dev = dev;
>  		ring->count = vsi->num_rx_desc;
> -		ring->pkt_ctx.cached_phctime = pf->ptp.cached_phc_time;
> +		ring->cached_phctime = pf->ptp.cached_phc_time;
>  		WRITE_ONCE(vsi->rx_rings[i], ring);
>  	}
>  
> diff --git a/drivers/net/ethernet/intel/ice/ice_ptp.c b/drivers/net/ethernet/intel/ice/ice_ptp.c
> index f6444890f0ef..e2fa979830cd 100644
> --- a/drivers/net/ethernet/intel/ice/ice_ptp.c
> +++ b/drivers/net/ethernet/intel/ice/ice_ptp.c
> @@ -955,8 +955,7 @@ static int ice_ptp_update_cached_phctime(struct ice_pf *pf)
>  		ice_for_each_rxq(vsi, j) {
>  			if (!vsi->rx_rings[j])
>  				continue;
> -			WRITE_ONCE(vsi->rx_rings[j]->pkt_ctx.cached_phctime,
> -				   systime);
> +			WRITE_ONCE(vsi->rx_rings[j]->cached_phctime, systime);
>  		}
>  	}
>  	clear_bit(ICE_CFG_BUSY, pf->state);
> @@ -2119,7 +2118,7 @@ u64 ice_ptp_get_rx_hwts(const union ice_32b_rx_flex_desc *rx_desc,
>  	if (!(rx_desc->wb.time_stamp_low & ICE_PTP_TS_VALID))
>  		return 0;
>  
> -	cached_time = READ_ONCE(pkt_ctx->cached_phctime);
> +	cached_time = READ_ONCE(*pkt_ctx->cached_phctime);
>  
>  	/* Do not report a timestamp if we don't have a cached PHC time */
>  	if (!cached_time)
> diff --git a/drivers/net/ethernet/intel/ice/ice_txrx.h b/drivers/net/ethernet/intel/ice/ice_txrx.h
> index 41e0b14e6643..94594cc0d3ee 100644
> --- a/drivers/net/ethernet/intel/ice/ice_txrx.h
> +++ b/drivers/net/ethernet/intel/ice/ice_txrx.h
> @@ -259,7 +259,7 @@ enum ice_rx_dtype {
>  
>  struct ice_pkt_ctx {
>  	const union ice_32b_rx_flex_desc *eop_desc;
> -	u64 cached_phctime;
> +	u64 *cached_phctime;
>  	__be16 vlan_proto;
>  };
>  
> @@ -356,6 +356,7 @@ struct ice_rx_ring {
>  	struct ice_tx_ring *xdp_ring;
>  	struct xsk_buff_pool *xsk_pool;
>  	dma_addr_t dma;			/* physical address of ring */
> +	u64 cached_phctime;
>  	u16 rx_buf_len;
>  	u8 dcb_tc;			/* Traffic class of ring */
>  	u8 ptp_rx;
> diff --git a/drivers/net/ethernet/intel/ice/ice_xsk.c b/drivers/net/ethernet/intel/ice/ice_xsk.c
> index 49a64bfdd1f6..6fa7a86152d0 100644
> --- a/drivers/net/ethernet/intel/ice/ice_xsk.c
> +++ b/drivers/net/ethernet/intel/ice/ice_xsk.c
> @@ -431,9 +431,18 @@ int ice_xsk_pool_setup(struct ice_vsi *vsi, struct xsk_buff_pool *pool, u16 qid)
>  	return ret;
>  }
>  
> +static struct ice_xdp_buff *xsk_buff_to_ice_ctx(struct xdp_buff *xdp)
> +{
> +	/* xdp_buff pointer used by ZC code path is alloc as xdp_buff_xsk. The
> +	 * ice_xdp_buff shares its layout with xdp_buff_xsk and private
> +	 * ice_xdp_buff fields fall into xdp_buff_xsk->cb
> +	 */
> +       return (struct ice_xdp_buff *)xdp;
> +}
> +
>  /**
>   * ice_fill_rx_descs - pick buffers from XSK buffer pool and use it
> - * @pool: XSK Buffer pool to pull the buffers from
> + * @rx_ring: rx ring
>   * @xdp: SW ring of xdp_buff that will hold the buffers
>   * @rx_desc: Pointer to Rx descriptors that will be filled
>   * @count: The number of buffers to allocate
> @@ -445,18 +454,21 @@ int ice_xsk_pool_setup(struct ice_vsi *vsi, struct xsk_buff_pool *pool, u16 qid)
>   *
>   * Returns the amount of allocated Rx descriptors
>   */
> -static u16 ice_fill_rx_descs(struct xsk_buff_pool *pool, struct xdp_buff **xdp,
> +static u16 ice_fill_rx_descs(struct ice_rx_ring *rx_ring, struct xdp_buff **xdp,
>  			     union ice_32b_rx_flex_desc *rx_desc, u16 count)
>  {
> +	struct ice_xdp_buff *ctx;
>  	dma_addr_t dma;
>  	u16 buffs;
>  	int i;
>  
> -	buffs = xsk_buff_alloc_batch(pool, xdp, count);
> +	buffs = xsk_buff_alloc_batch(rx_ring->xsk_pool, xdp, count);
>  	for (i = 0; i < buffs; i++) {
>  		dma = xsk_buff_xdp_get_dma(*xdp);
>  		rx_desc->read.pkt_addr = cpu_to_le64(dma);
>  		rx_desc->wb.status_error0 = 0;
> +		ctx = xsk_buff_to_ice_ctx(*xdp);
> +		ctx->pkt_ctx.eop_desc = rx_desc;
>  
>  		rx_desc++;
>  		xdp++;
> @@ -488,8 +500,7 @@ static bool __ice_alloc_rx_bufs_zc(struct ice_rx_ring *rx_ring, u16 count)
>  	xdp = ice_xdp_buf(rx_ring, ntu);
>  
>  	if (ntu + count >= rx_ring->count) {
> -		nb_buffs_extra = ice_fill_rx_descs(rx_ring->xsk_pool, xdp,
> -						   rx_desc,
> +		nb_buffs_extra = ice_fill_rx_descs(rx_ring, xdp, rx_desc,
>  						   rx_ring->count - ntu);
>  		if (nb_buffs_extra != rx_ring->count - ntu) {
>  			ntu += nb_buffs_extra;
> @@ -502,7 +513,7 @@ static bool __ice_alloc_rx_bufs_zc(struct ice_rx_ring *rx_ring, u16 count)
>  		ice_release_rx_desc(rx_ring, 0);
>  	}
>  
> -	nb_buffs = ice_fill_rx_descs(rx_ring->xsk_pool, xdp, rx_desc, count);
> +	nb_buffs = ice_fill_rx_descs(rx_ring, xdp, rx_desc, count);
>  
>  	ntu += nb_buffs;
>  	if (ntu == rx_ring->count)
> @@ -746,32 +757,6 @@ static int ice_xmit_xdp_tx_zc(struct xdp_buff *xdp,
>  	return ICE_XDP_CONSUMED;
>  }
>  
> -/**
> - * ice_prepare_pkt_ctx_zc - Prepare packet context for XDP hints
> - * @xdp: xdp_buff used as input to the XDP program
> - * @eop_desc: End of packet descriptor
> - * @rx_ring: Rx ring with packet context
> - *
> - * In regular XDP, xdp_buff is placed inside the ring structure,
> - * just before the packet context, so the latter can be accessed
> - * with xdp_buff address only at all times, but in ZC mode,
> - * xdp_buffs come from the pool, so we need to reinitialize
> - * context for every packet.
> - *
> - * We can safely convert xdp_buff_xsk to ice_xdp_buff,
> - * because there are XSK_PRIV_MAX bytes reserved in xdp_buff_xsk
> - * right after xdp_buff, for our private use.
> - * XSK_CHECK_PRIV_TYPE() ensures we do not go above the limit.
> - */
> -static void ice_prepare_pkt_ctx_zc(struct xdp_buff *xdp,
> -				   union ice_32b_rx_flex_desc *eop_desc,
> -				   struct ice_rx_ring *rx_ring)
> -{
> -	XSK_CHECK_PRIV_TYPE(struct ice_xdp_buff);
> -	((struct ice_xdp_buff *)xdp)->pkt_ctx = rx_ring->pkt_ctx;
> -	ice_xdp_meta_set_desc(xdp, eop_desc);
> -}
> -
>  /**
>   * ice_run_xdp_zc - Executes an XDP program in zero-copy path
>   * @rx_ring: Rx ring
> @@ -784,13 +769,11 @@ static void ice_prepare_pkt_ctx_zc(struct xdp_buff *xdp,
>   */
>  static int
>  ice_run_xdp_zc(struct ice_rx_ring *rx_ring, struct xdp_buff *xdp,
> -	       struct bpf_prog *xdp_prog, struct ice_tx_ring *xdp_ring,
> -	       union ice_32b_rx_flex_desc *rx_desc)
> +	       struct bpf_prog *xdp_prog, struct ice_tx_ring *xdp_ring)
>  {
>  	int err, result = ICE_XDP_PASS;
>  	u32 act;
>  
> -	ice_prepare_pkt_ctx_zc(xdp, rx_desc, rx_ring);
>  	act = bpf_prog_run_xdp(xdp_prog, xdp);
>  
>  	if (likely(act == XDP_REDIRECT)) {
> @@ -930,8 +913,7 @@ int ice_clean_rx_irq_zc(struct ice_rx_ring *rx_ring, int budget)
>  		if (ice_is_non_eop(rx_ring, rx_desc))
>  			continue;
>  
> -		xdp_res = ice_run_xdp_zc(rx_ring, first, xdp_prog, xdp_ring,
> -					 rx_desc);
> +		xdp_res = ice_run_xdp_zc(rx_ring, first, xdp_prog, xdp_ring);
>  		if (likely(xdp_res & (ICE_XDP_TX | ICE_XDP_REDIR))) {
>  			xdp_xmit |= xdp_res;
>  		} else if (xdp_res == ICE_XDP_EXIT) {
> diff --git a/include/net/xdp_sock_drv.h b/include/net/xdp_sock_drv.h
> index 1f6fc8c7a84c..91fa74a14841 100644
> --- a/include/net/xdp_sock_drv.h
> +++ b/include/net/xdp_sock_drv.h
> @@ -14,6 +14,13 @@
>  
>  #ifdef CONFIG_XDP_SOCKETS
>  
> +struct xsk_meta_desc {
> +	u64 val;
> +	u8 off;
> +	u8 bytes;
> +};
> +
> +
>  void xsk_tx_completed(struct xsk_buff_pool *pool, u32 nb_entries);
>  bool xsk_tx_peek_desc(struct xsk_buff_pool *pool, struct xdp_desc *desc);
>  u32 xsk_tx_peek_release_desc_batch(struct xsk_buff_pool *pool, u32 max);
> @@ -47,6 +54,12 @@ static inline void xsk_pool_set_rxq_info(struct xsk_buff_pool *pool,
>  	xp_set_rxq_info(pool, rxq);
>  }
>  
> +static inline void xsk_pool_set_meta(struct xsk_buff_pool *pool,
> +				     struct xsk_meta_desc *desc)
> +{
> +	xp_set_meta(pool, desc);
> +}
> +
>  static inline unsigned int xsk_pool_get_napi_id(struct xsk_buff_pool *pool)
>  {
>  #ifdef CONFIG_NET_RX_BUSY_POLL
> @@ -250,6 +263,11 @@ static inline void xsk_pool_set_rxq_info(struct xsk_buff_pool *pool,
>  {
>  }
>  
> +static inline void xsk_pool_set_meta(struct xsk_buff_pool *pool,
> +				     struct xsk_meta_desc *desc)
> +{
> +}
> +
>  static inline unsigned int xsk_pool_get_napi_id(struct xsk_buff_pool *pool)
>  {
>  	return 0;
> diff --git a/include/net/xsk_buff_pool.h b/include/net/xsk_buff_pool.h
> index b0bdff26fc88..354b1c702a82 100644
> --- a/include/net/xsk_buff_pool.h
> +++ b/include/net/xsk_buff_pool.h
> @@ -12,6 +12,7 @@
>  
>  struct xsk_buff_pool;
>  struct xdp_rxq_info;
> +struct xsk_meta_desc;
>  struct xsk_queue;
>  struct xdp_desc;
>  struct xdp_umem;
> @@ -132,6 +133,7 @@ static inline void xp_init_xskb_dma(struct xdp_buff_xsk *xskb, struct xsk_buff_p
>  
>  /* AF_XDP ZC drivers, via xdp_sock_buff.h */
>  void xp_set_rxq_info(struct xsk_buff_pool *pool, struct xdp_rxq_info *rxq);
> +void xp_set_meta(struct xsk_buff_pool *pool, struct xsk_meta_desc *desc);
>  int xp_dma_map(struct xsk_buff_pool *pool, struct device *dev,
>  	       unsigned long attrs, struct page **pages, u32 nr_pages);
>  void xp_dma_unmap(struct xsk_buff_pool *pool, unsigned long attrs);
> diff --git a/net/xdp/xsk_buff_pool.c b/net/xdp/xsk_buff_pool.c
> index 49cb9f9a09be..632fdc247862 100644
> --- a/net/xdp/xsk_buff_pool.c
> +++ b/net/xdp/xsk_buff_pool.c
> @@ -123,6 +123,18 @@ void xp_set_rxq_info(struct xsk_buff_pool *pool, struct xdp_rxq_info *rxq)
>  }
>  EXPORT_SYMBOL(xp_set_rxq_info);
>  
> +void xp_set_meta(struct xsk_buff_pool *pool, struct xsk_meta_desc *desc)
> +{
> +	u32 i;
> +
> +	for (i = 0; i < pool->heads_cnt; i++) {
> +		struct xdp_buff_xsk *xskb = &pool->heads[i];
> +
> +		memcpy(xskb->cb + desc->off, desc->buf, desc->bytes);
> +	}
> +}
> +EXPORT_SYMBOL(xp_set_meta);
> +
>  static void xp_disable_drv_zc(struct xsk_buff_pool *pool)
>  {
>  	struct netdev_bpf bpf;
> 
> --------------------------------->8---------------------------------
> 
> > And yes, this also looks better for hot-swapping, 
> > because conditions become more straightforward (we do not need to compare old 
> > and new programs).
> > 
> > > > +
> > > >  	old_prog = xchg(&vsi->xdp_prog, prog);
> > > >  	ice_for_each_rxq(vsi, i)
> > > >  		WRITE_ONCE(vsi->rx_rings[i]->xdp_prog, vsi->xdp_prog);
> > > >  
> > > > +	if (ice_xdp_prog_has_meta(old_prog))
> > > > +		static_branch_dec(&ice_xdp_meta_key);
> > > > +
> > > >  	if (old_prog)
> > > >  		bpf_prog_put(old_prog);
> > > >  }
> > > > diff --git a/drivers/net/ethernet/intel/ice/ice_txrx.c b/drivers/net/ethernet/intel/ice/ice_txrx.c
> > > > index 4fd7614f243d..19fc182d1f4c 100644
> > > > --- a/drivers/net/ethernet/intel/ice/ice_txrx.c
> > > > +++ b/drivers/net/ethernet/intel/ice/ice_txrx.c
> > > > @@ -572,7 +572,8 @@ ice_run_xdp(struct ice_rx_ring *rx_ring, struct xdp_buff *xdp,
> > > >  	if (!xdp_prog)
> > > >  		goto exit;
> > > >  
> > > > -	ice_xdp_meta_set_desc(xdp, eop_desc);
> > > > +	if (static_branch_unlikely(&ice_xdp_meta_key))
> > > 
> > > My only concern is that we might be hurting in a minor way hints path now,
> > > no?
> > 
> > I have thought "unlikely" refers to the default state the code is compiled with 
> > and after static key incrementation this should be patched to "likely". Isn't 
> > this how static keys work?
> 
> I was only referring to that it ends with compiler hint:
> #define unlikely_notrace(x)	__builtin_expect(!!(x), 0)
> 
> see include/linux/jump_label.h
> 
> > 
> > > 
> > > > +		ice_xdp_meta_set_desc(xdp, eop_desc);
> > > >  
> > > >  	act = bpf_prog_run_xdp(xdp_prog, xdp);
> > > >  	switch (act) {
> > > > diff --git a/drivers/net/ethernet/intel/ice/ice_xsk.c b/drivers/net/ethernet/intel/ice/ice_xsk.c
> > > > index 39775bb6cec1..f92d7d33fde6 100644
> > > > --- a/drivers/net/ethernet/intel/ice/ice_xsk.c
> > > > +++ b/drivers/net/ethernet/intel/ice/ice_xsk.c
> > > > @@ -773,6 +773,9 @@ static void ice_prepare_pkt_ctx_zc(struct xdp_buff *xdp,
> > > >  				   union ice_32b_rx_flex_desc *eop_desc,
> > > >  				   struct ice_rx_ring *rx_ring)
> > > >  {
> > > > +	if (!static_branch_unlikely(&ice_xdp_meta_key))
> > > > +		return;
> > > 
> > > wouldn't it be better to pull it out and avoid calling
> > > ice_prepare_pkt_ctx_zc() unnecessarily?
> > > 
> > > > +
> > > >  	XSK_CHECK_PRIV_TYPE(struct ice_xdp_buff);
> > > >  	((struct ice_xdp_buff *)xdp)->pkt_ctx = rx_ring->pkt_ctx;
> > > >  	ice_xdp_meta_set_desc(xdp, eop_desc);
> > > > -- 
> > > > 2.41.0
> > > > 

^ permalink raw reply

* Re: [PATCH net v2] net: dsa: tag_rtl4_a: Bump min packet size
From: Linus Walleij @ 2023-10-31 14:16 UTC (permalink / raw)
  To: Vladimir Oltean
  Cc: Andrew Lunn, Florian Fainelli, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, netdev, linux-kernel
In-Reply-To: <20231030233334.jcd5dnojruo57hfk@skbuf>

On Tue, Oct 31, 2023 at 12:33 AM Vladimir Oltean <olteanv@gmail.com> wrote:
> On Mon, Oct 30, 2023 at 11:57:33PM +0100, Linus Walleij wrote:
> > Here you can incidentally also see what happens if we don't pad the big packets:
> > the packet gets truncated.
>
> Are we sure we are debugging a switch problem? On how many platforms
> with the RTL8366RB can the issue be seen? Is the conduit interface the
> same on all these platforms, or is it different and that makes no
> difference?

I don't have any other RTL8366RB systems than the D-Link DIR-685.

I however have several systems with the same backing ethernet controller
connected directly to a PHY and they all work fine.

Yours,
Linus Walleij

^ permalink raw reply

* Re: [PATCH net v2] net: dsa: tag_rtl4_a: Bump min packet size
From: Linus Walleij @ 2023-10-31 14:06 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: Vladimir Oltean, Florian Fainelli, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, netdev, linux-kernel
In-Reply-To: <494a8bb7-7ca1-40bd-b3a7-babeadfd88a0@lunn.ch>

On Tue, Oct 31, 2023 at 1:37 AM Andrew Lunn <andrew@lunn.ch> wrote:
> On Tue, Oct 31, 2023 at 01:09:06AM +0200, Vladimir Oltean wrote:
> > On Mon, Oct 30, 2023 at 11:57:33PM +0100, Linus Walleij wrote:
> > > This of course make no sense, since the padding function should do nothing
> > > when the packet is bigger than 60 bytes.
> >
> > Indeed, this of course makes no sense. Ping doesn't work, or ARP doesn't
> > work? Could you add a static ARP entry for the 192.168.1.137 IP address?
>
> Probably the ARP, since they are short packets and probably need the
> padding.

It seems correct: the reason ping stops working is not that the ping
isn't reaching the host, I can see the pin request on tcpdumps.

The reason is that the host has no idea where to send the reply.
Because it's not getting any ARP replies.

And that is probably because the ARP replies are short and needs
padding to ETH_ZLEN.

I notice the code is probably borrowed from tag_brcm.c which does
exactly the same thing (ETH_ZLEN + tag). It comes with this
explanation:

       /* The Ethernet switch we are interfaced with needs packets to
be at
         * least 64 bytes (including FCS) otherwise they will be
discarded when
         * they enter the switch port logic. When Broadcom tags are
enabled, we
         * need to make sure that packets are at least 68 bytes
         * (including FCS and tag) because the length verification is
done after
         * the Broadcom tag is stripped off the ingress packet.
         *
         * Let dsa_slave_xmit() free the SKB
         */

The switch fabric is dropping smaller packets when CPU tags
(DSA tags) are enabled.

So is the padding to ETH_ZLEN OK or should is happen elsewhere?

Yours,
Linus Walleij

^ permalink raw reply

* [PATCH bpf-next v3 2/2] selftests: bpf: crypto skcipher algo selftests
From: Vadim Fedorenko @ 2023-10-31 13:49 UTC (permalink / raw)
  To: Vadim Fedorenko, Jakub Kicinski, Martin KaFai Lau,
	Andrii Nakryiko, Alexei Starovoitov, Mykola Lysenko
  Cc: Vadim Fedorenko, bpf, netdev, linux-crypto
In-Reply-To: <20231031134900.1432945-1-vadfed@meta.com>

Add simple tc hook selftests to show the way to work with new crypto
BPF API. Some weird structre and map are added to setup program to make
verifier happy about dynptr initialization from memory. Simple AES-ECB
algo is used to demonstrate encryption and decryption of fixed size
buffers.

Signed-off-by: Vadim Fedorenko <vadfed@meta.com>
---
v2 -> v3:
- disable tests on s390 and aarch64 because of unknown Fatal exception
  in sg_init_one
v1 -> v2:
- add CONFIG_CRYPTO_AES and CONFIG_CRYPTO_ECB to selftest build config
  suggested by Daniel

 kernel/bpf/crypto.c                           |   5 +-
 tools/testing/selftests/bpf/DENYLIST.aarch64  |   1 +
 tools/testing/selftests/bpf/DENYLIST.s390x    |   1 +
 tools/testing/selftests/bpf/config            |   3 +
 .../selftests/bpf/prog_tests/crypto_sanity.c  | 129 +++++++++++++++
 .../selftests/bpf/progs/crypto_common.h       | 103 ++++++++++++
 .../selftests/bpf/progs/crypto_sanity.c       | 154 ++++++++++++++++++
 7 files changed, 394 insertions(+), 2 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/crypto_sanity.c
 create mode 100644 tools/testing/selftests/bpf/progs/crypto_common.h
 create mode 100644 tools/testing/selftests/bpf/progs/crypto_sanity.c

diff --git a/kernel/bpf/crypto.c b/kernel/bpf/crypto.c
index c2a0703934fc..4ee6193165ca 100644
--- a/kernel/bpf/crypto.c
+++ b/kernel/bpf/crypto.c
@@ -65,8 +65,9 @@ static void *__bpf_dynptr_data_ptr(const struct bpf_dynptr_kern *ptr)
  *
  * bpf_crypto_skcipher_ctx_create() allocates memory using the BPF memory
  * allocator, and will not block. It may return NULL if no memory is available.
- * @algo: bpf_dynptr which holds string representation of algorithm.
- * @key:  bpf_dynptr which holds cipher key to do crypto.
+ * @palgo: bpf_dynptr which holds string representation of algorithm.
+ * @pkey:  bpf_dynptr which holds cipher key to do crypto.
+ * @err:   integer to store error code when NULL is returned
  */
 __bpf_kfunc struct bpf_crypto_skcipher_ctx *
 bpf_crypto_skcipher_ctx_create(const struct bpf_dynptr_kern *palgo,
diff --git a/tools/testing/selftests/bpf/DENYLIST.aarch64 b/tools/testing/selftests/bpf/DENYLIST.aarch64
index 5c2cc7e8c5d0..72c7ef3e5872 100644
--- a/tools/testing/selftests/bpf/DENYLIST.aarch64
+++ b/tools/testing/selftests/bpf/DENYLIST.aarch64
@@ -1,5 +1,6 @@
 bpf_cookie/multi_kprobe_attach_api               # kprobe_multi_link_api_subtest:FAIL:fentry_raw_skel_load unexpected error: -3
 bpf_cookie/multi_kprobe_link_api                 # kprobe_multi_link_api_subtest:FAIL:fentry_raw_skel_load unexpected error: -3
+crypto_sanity					 # sg_init_one has exception on aarch64
 exceptions					 # JIT does not support calling kfunc bpf_throw: -524
 fexit_sleep                                      # The test never returns. The remaining tests cannot start.
 kprobe_multi_bench_attach                        # needs CONFIG_FPROBE
diff --git a/tools/testing/selftests/bpf/DENYLIST.s390x b/tools/testing/selftests/bpf/DENYLIST.s390x
index 1a63996c0304..8ab7485bedb1 100644
--- a/tools/testing/selftests/bpf/DENYLIST.s390x
+++ b/tools/testing/selftests/bpf/DENYLIST.s390x
@@ -1,5 +1,6 @@
 # TEMPORARY
 # Alphabetical order
+crypto_sanity				 # sg_init_one has exception on s390					       (exceptions)
 exceptions				 # JIT does not support calling kfunc bpf_throw				       (exceptions)
 get_stack_raw_tp                         # user_stack corrupted user stack                                             (no backchain userspace)
 stacktrace_build_id                      # compare_map_keys stackid_hmap vs. stackmap err -2 errno 2                   (?)
diff --git a/tools/testing/selftests/bpf/config b/tools/testing/selftests/bpf/config
index 02dd4409200e..48b570fd1752 100644
--- a/tools/testing/selftests/bpf/config
+++ b/tools/testing/selftests/bpf/config
@@ -14,6 +14,9 @@ CONFIG_CGROUP_BPF=y
 CONFIG_CRYPTO_HMAC=y
 CONFIG_CRYPTO_SHA256=y
 CONFIG_CRYPTO_USER_API_HASH=y
+CONFIG_CRYPTO_SKCIPHER=y
+CONFIG_CRYPTO_ECB=y
+CONFIG_CRYPTO_AES=y
 CONFIG_DEBUG_INFO=y
 CONFIG_DEBUG_INFO_BTF=y
 CONFIG_DEBUG_INFO_DWARF4=y
diff --git a/tools/testing/selftests/bpf/prog_tests/crypto_sanity.c b/tools/testing/selftests/bpf/prog_tests/crypto_sanity.c
new file mode 100644
index 000000000000..a43969da6d15
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/crypto_sanity.c
@@ -0,0 +1,129 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2022 Meta Platforms, Inc. and affiliates. */
+
+#include <sys/types.h>
+#include <sys/socket.h>
+#include <net/if.h>
+#include <linux/in6.h>
+
+#include "test_progs.h"
+#include "network_helpers.h"
+#include "crypto_sanity.skel.h"
+
+#define NS_TEST "crypto_sanity_ns"
+#define IPV6_IFACE_ADDR "face::1"
+#define UDP_TEST_PORT 7777
+static const char plain_text[] = "stringtoencrypt0";
+static const char crypted_data[] = "\x5B\x59\x39\xEA\xD9\x7A\x2D\xAD\xA7\xE0\x43" \
+				   "\x37\x8A\x77\x17\xB2";
+
+void test_crypto_sanity(void)
+{
+	LIBBPF_OPTS(bpf_tc_hook, qdisc_hook, .attach_point = BPF_TC_EGRESS);
+	LIBBPF_OPTS(bpf_tc_opts, tc_attach_enc);
+	LIBBPF_OPTS(bpf_tc_opts, tc_attach_dec);
+	LIBBPF_OPTS(bpf_test_run_opts, opts,
+		    .data_in = crypted_data,
+		    .data_size_in = sizeof(crypted_data),
+		    .repeat = 1,
+	);
+	struct nstoken *nstoken = NULL;
+	struct crypto_sanity *skel;
+	struct sockaddr_in6 addr;
+	int sockfd, err, pfd;
+	socklen_t addrlen;
+
+	skel = crypto_sanity__open();
+	if (!ASSERT_OK_PTR(skel, "skel open"))
+		return;
+
+	bpf_program__set_autoload(skel->progs.skb_crypto_setup, true);
+
+	SYS(fail, "ip netns add %s", NS_TEST);
+	SYS(fail, "ip -net %s -6 addr add %s/128 dev lo nodad", NS_TEST, IPV6_IFACE_ADDR);
+	SYS(fail, "ip -net %s link set dev lo up", NS_TEST);
+
+	err = crypto_sanity__load(skel);
+	if (!ASSERT_OK(err, "crypto_sanity__load"))
+		goto fail;
+
+	nstoken = open_netns(NS_TEST);
+	if (!ASSERT_OK_PTR(nstoken, "open_netns"))
+		goto fail;
+
+	qdisc_hook.ifindex = if_nametoindex("lo");
+	if (!ASSERT_GT(qdisc_hook.ifindex, 0, "if_nametoindex lo"))
+		goto fail;
+
+	err = crypto_sanity__attach(skel);
+	if (!ASSERT_OK(err, "crypto_sanity__attach"))
+		goto fail;
+
+	pfd = bpf_program__fd(skel->progs.skb_crypto_setup);
+	if (!ASSERT_GT(pfd, 0, "skb_crypto_setup fd"))
+		goto fail;
+
+	err = bpf_prog_test_run_opts(pfd, &opts);
+	if (!ASSERT_OK(err, "skb_crypto_setup") ||
+	    !ASSERT_OK(opts.retval, "skb_crypto_setup retval"))
+		goto fail;
+
+	if (!ASSERT_OK(skel->bss->status, "skb_crypto_setup status"))
+		goto fail;
+
+	err = bpf_tc_hook_create(&qdisc_hook);
+	if (!ASSERT_OK(err, "create qdisc hook"))
+		goto fail;
+
+	addrlen = sizeof(addr);
+	err = make_sockaddr(AF_INET6, IPV6_IFACE_ADDR, UDP_TEST_PORT,
+			    (void *)&addr, &addrlen);
+	if (!ASSERT_OK(err, "make_sockaddr"))
+		goto fail;
+
+	tc_attach_dec.prog_fd = bpf_program__fd(skel->progs.decrypt_sanity);
+	err = bpf_tc_attach(&qdisc_hook, &tc_attach_dec);
+	if (!ASSERT_OK(err, "attach decrypt filter"))
+		goto fail;
+
+	sockfd = socket(AF_INET6, SOCK_DGRAM, 0);
+	if (!ASSERT_NEQ(sockfd, -1, "decrypt socket"))
+		goto fail;
+	err = sendto(sockfd, crypted_data, 16, 0, (void *)&addr, addrlen);
+	close(sockfd);
+	if (!ASSERT_EQ(err, 16, "decrypt send"))
+		goto fail;
+
+	bpf_tc_detach(&qdisc_hook, &tc_attach_dec);
+	if (!ASSERT_OK(skel->bss->status, "decrypt status"))
+		goto fail;
+	if (!ASSERT_STRNEQ(skel->bss->dst, plain_text, sizeof(plain_text), "decrypt"))
+		goto fail;
+
+	tc_attach_enc.prog_fd = bpf_program__fd(skel->progs.encrypt_sanity);
+	err = bpf_tc_attach(&qdisc_hook, &tc_attach_enc);
+	if (!ASSERT_OK(err, "attach encrypt filter"))
+		goto fail;
+
+	sockfd = socket(AF_INET6, SOCK_DGRAM, 0);
+	if (!ASSERT_NEQ(sockfd, -1, "encrypt socket"))
+		goto fail;
+	err = sendto(sockfd, plain_text, 16, 0, (void *)&addr, addrlen);
+	close(sockfd);
+	if (!ASSERT_EQ(err, 16, "encrypt send"))
+		goto fail;
+
+	bpf_tc_detach(&qdisc_hook, &tc_attach_enc);
+	if (!ASSERT_OK(skel->bss->status, "encrypt status"))
+		goto fail;
+	if (!ASSERT_STRNEQ(skel->bss->dst, crypted_data, sizeof(crypted_data), "encrypt"))
+		goto fail;
+
+fail:
+	if (nstoken) {
+		bpf_tc_hook_destroy(&qdisc_hook);
+		close_netns(nstoken);
+	}
+	SYS_NOFAIL("ip netns del " NS_TEST " &> /dev/null");
+	crypto_sanity__destroy(skel);
+}
diff --git a/tools/testing/selftests/bpf/progs/crypto_common.h b/tools/testing/selftests/bpf/progs/crypto_common.h
new file mode 100644
index 000000000000..584378bb6df8
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/crypto_common.h
@@ -0,0 +1,103 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/* Copyright (c) 2023 Meta Platforms, Inc. and affiliates. */
+
+#ifndef _CRYPTO_COMMON_H
+#define _CRYPTO_COMMON_H
+
+#include "errno.h"
+#include <stdbool.h>
+
+#define private(name) SEC(".bss." #name) __hidden __attribute__((aligned(8)))
+private(CTX) static struct bpf_crypto_skcipher_ctx __kptr * global_crypto_ctx;
+
+struct bpf_crypto_skcipher_ctx *bpf_crypto_skcipher_ctx_create(const struct bpf_dynptr *algo,
+							       const struct bpf_dynptr *key,
+							       int *err) __ksym;
+struct bpf_crypto_skcipher_ctx *bpf_crypto_skcipher_ctx_acquire(struct bpf_crypto_skcipher_ctx *ctx) __ksym;
+void bpf_crypto_skcipher_ctx_release(struct bpf_crypto_skcipher_ctx *ctx) __ksym;
+int bpf_crypto_skcipher_encrypt(struct bpf_crypto_skcipher_ctx *ctx,
+				const struct bpf_dynptr *src, struct bpf_dynptr *dst,
+				const struct bpf_dynptr *iv) __ksym;
+int bpf_crypto_skcipher_decrypt(struct bpf_crypto_skcipher_ctx *ctx,
+				const struct bpf_dynptr *src, struct bpf_dynptr *dst,
+				const struct bpf_dynptr *iv) __ksym;
+
+struct __crypto_skcipher_ctx_value {
+	struct bpf_crypto_skcipher_ctx __kptr * ctx;
+};
+
+struct crypto_conf_value {
+	__u8 algo[32];
+	__u32 algo_size;
+	__u8 key[32];
+	__u32 key_size;
+	__u8 iv[32];
+	__u32 iv_size;
+	__u8 dst[32];
+	__u32 dst_size;
+};
+
+struct array_conf_map {
+	__uint(type, BPF_MAP_TYPE_ARRAY);
+	__type(key, int);
+	__type(value, struct crypto_conf_value);
+	__uint(max_entries, 1);
+} __crypto_conf_map SEC(".maps");
+
+struct array_map {
+	__uint(type, BPF_MAP_TYPE_ARRAY);
+	__type(key, int);
+	__type(value, struct __crypto_skcipher_ctx_value);
+	__uint(max_entries, 1);
+} __crypto_skcipher_ctx_map SEC(".maps");
+
+static inline struct crypto_conf_value *crypto_conf_lookup(void)
+{
+	struct crypto_conf_value *v, local = {};
+	u32 key = 0;
+
+	v = bpf_map_lookup_elem(&__crypto_conf_map, &key);
+	if (v)
+		return v;
+
+	bpf_map_update_elem(&__crypto_conf_map, &key, &local, 0);
+	return bpf_map_lookup_elem(&__crypto_conf_map, &key);
+}
+
+static inline struct __crypto_skcipher_ctx_value *crypto_skcipher_ctx_value_lookup(void)
+{
+	u32 key = 0;
+
+	return bpf_map_lookup_elem(&__crypto_skcipher_ctx_map, &key);
+}
+
+static inline int crypto_skcipher_ctx_insert(struct bpf_crypto_skcipher_ctx *ctx)
+{
+	struct __crypto_skcipher_ctx_value local, *v;
+	long status;
+	struct bpf_crypto_skcipher_ctx *old;
+	u32 key = 0;
+
+	local.ctx = NULL;
+	status = bpf_map_update_elem(&__crypto_skcipher_ctx_map, &key, &local, 0);
+	if (status) {
+		bpf_crypto_skcipher_ctx_release(ctx);
+		return status;
+	}
+
+	v = bpf_map_lookup_elem(&__crypto_skcipher_ctx_map, &key);
+	if (!v) {
+		bpf_crypto_skcipher_ctx_release(ctx);
+		return -ENOENT;
+	}
+
+	old = bpf_kptr_xchg(&v->ctx, ctx);
+	if (old) {
+		bpf_crypto_skcipher_ctx_release(old);
+		return -EEXIST;
+	}
+
+	return 0;
+}
+
+#endif /* _CRYPTO_COMMON_H */
diff --git a/tools/testing/selftests/bpf/progs/crypto_sanity.c b/tools/testing/selftests/bpf/progs/crypto_sanity.c
new file mode 100644
index 000000000000..71a172d8d2a2
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/crypto_sanity.c
@@ -0,0 +1,154 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2023 Meta Platforms, Inc. and affiliates. */
+
+#include "vmlinux.h"
+#include "bpf_tracing_net.h"
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_endian.h>
+#include <bpf/bpf_tracing.h>
+#include "bpf_misc.h"
+#include "bpf_kfuncs.h"
+#include "crypto_common.h"
+
+#define UDP_TEST_PORT 7777
+unsigned char crypto_key[16] = "testtest12345678";
+char crypto_algo[9] = "ecb(aes)";
+char dst[32] = {};
+int status;
+
+static inline int skb_validate_test(const struct __sk_buff *skb)
+{
+	struct ipv6hdr ip6h;
+	struct udphdr udph;
+	u32 offset;
+
+	if (skb->protocol != __bpf_constant_htons(ETH_P_IPV6))
+		return -1;
+
+	if (bpf_skb_load_bytes(skb, ETH_HLEN, &ip6h, sizeof(ip6h)))
+		return -1;
+
+	if (ip6h.nexthdr != IPPROTO_UDP)
+		return -1;
+
+	if (bpf_skb_load_bytes(skb, ETH_HLEN + sizeof(ip6h), &udph, sizeof(udph)))
+		return -1;
+
+	if (udph.dest != __bpf_constant_htons(UDP_TEST_PORT))
+		return -1;
+
+	offset = ETH_HLEN + sizeof(ip6h) + sizeof(udph);
+	if (skb->len < offset + 16)
+		return -1;
+
+	return offset;
+}
+
+SEC("?fentry.s/bpf_fentry_test1")
+int BPF_PROG(skb_crypto_setup)
+{
+	struct crypto_conf_value *c;
+	struct bpf_dynptr algo, key;
+	int err = 0;
+
+	status = 0;
+
+	c = crypto_conf_lookup();
+	if (!c) {
+		status = -EINVAL;
+		return 0;
+	}
+
+	bpf_dynptr_from_mem(crypto_algo, sizeof(crypto_algo), 0, &algo);
+	bpf_dynptr_from_mem(crypto_key, sizeof(crypto_key), 0, &key);
+	struct bpf_crypto_skcipher_ctx *cctx = bpf_crypto_skcipher_ctx_create(&algo, &key, &err);
+
+	if (!cctx) {
+		status = err;
+		return 0;
+	}
+
+	err = crypto_skcipher_ctx_insert(cctx);
+	if (err && err != -EEXIST)
+		status = err;
+
+	return 0;
+}
+
+SEC("tc")
+int decrypt_sanity(struct __sk_buff *skb)
+{
+	struct __crypto_skcipher_ctx_value *v;
+	struct bpf_crypto_skcipher_ctx *ctx;
+	struct bpf_dynptr psrc, pdst, iv;
+	int err;
+
+	err = skb_validate_test(skb);
+	if (err < 0) {
+		status = err;
+		return TC_ACT_SHOT;
+	}
+
+	v = crypto_skcipher_ctx_value_lookup();
+	if (!v) {
+		status = -ENOENT;
+		return TC_ACT_SHOT;
+	}
+
+	ctx = v->ctx;
+	if (!ctx) {
+		status = -ENOENT;
+		return TC_ACT_SHOT;
+	}
+
+	bpf_dynptr_from_skb(skb, 0, &psrc);
+	bpf_dynptr_adjust(&psrc, err, err + 16);
+	bpf_dynptr_from_mem(dst, sizeof(dst), 0, &pdst);
+	bpf_dynptr_from_mem(dst, 0, 0, &iv);
+
+	bpf_crypto_skcipher_decrypt(ctx, &psrc, &pdst, &iv);
+
+	status = 0;
+
+	return TC_ACT_SHOT;
+}
+
+SEC("tc")
+int encrypt_sanity(struct __sk_buff *skb)
+{
+	struct __crypto_skcipher_ctx_value *v;
+	struct bpf_crypto_skcipher_ctx *ctx;
+	struct bpf_dynptr psrc, pdst, iv;
+	int err;
+
+	status = 0;
+
+	err = skb_validate_test(skb);
+	if (err < 0) {
+		status = err;
+		return TC_ACT_SHOT;
+	}
+
+	v = crypto_skcipher_ctx_value_lookup();
+	if (!v) {
+		status = -ENOENT;
+		return TC_ACT_SHOT;
+	}
+
+	ctx = v->ctx;
+	if (!ctx) {
+		status = -ENOENT;
+		return TC_ACT_SHOT;
+	}
+
+	bpf_dynptr_from_skb(skb, 0, &psrc);
+	bpf_dynptr_adjust(&psrc, err, err + 16);
+	bpf_dynptr_from_mem(dst, sizeof(dst), 0, &pdst);
+	bpf_dynptr_from_mem(dst, 0, 0, &iv);
+
+	bpf_crypto_skcipher_encrypt(ctx, &psrc, &pdst, &iv);
+
+	return TC_ACT_SHOT;
+}
+
+char __license[] SEC("license") = "GPL";
-- 
2.39.3


^ permalink raw reply related

* [PATCH bpf-next v3 1/2] bpf: add skcipher API support to TC/XDP programs
From: Vadim Fedorenko @ 2023-10-31 13:48 UTC (permalink / raw)
  To: Vadim Fedorenko, Jakub Kicinski, Martin KaFai Lau,
	Andrii Nakryiko, Alexei Starovoitov, Mykola Lysenko
  Cc: Vadim Fedorenko, bpf, netdev, linux-crypto

Add crypto API support to BPF to be able to decrypt or encrypt packets
in TC/XDP BPF programs. Only symmetric key ciphers are supported for
now. Special care should be taken for initialization part of crypto algo
because crypto_alloc_sync_skcipher() doesn't work with preemtion
disabled, it can be run only in sleepable BPF program. Also async crypto
is not supported because of the very same issue - TC/XDP BPF programs
are not sleepable.

Signed-off-by: Vadim Fedorenko <vadfed@meta.com>
---
v2 -> v3:
- fix kdoc issues
v1 -> v2:
- use kmalloc in sleepable func, suggested by Alexei
- use __bpf_dynptr_is_rdonly() to check destination, suggested by Jakub
- use __bpf_dynptr_data_ptr() for all dynptr accesses

 include/linux/bpf.h   |   2 +
 kernel/bpf/Makefile   |   3 +
 kernel/bpf/crypto.c   | 280 ++++++++++++++++++++++++++++++++++++++++++
 kernel/bpf/helpers.c  |   4 +-
 kernel/bpf/verifier.c |   1 +
 5 files changed, 288 insertions(+), 2 deletions(-)
 create mode 100644 kernel/bpf/crypto.c

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index d3c51a507508..5ad1e83394b3 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -1222,6 +1222,8 @@ enum bpf_dynptr_type {
 
 int bpf_dynptr_check_size(u32 size);
 u32 __bpf_dynptr_size(const struct bpf_dynptr_kern *ptr);
+enum bpf_dynptr_type bpf_dynptr_get_type(const struct bpf_dynptr_kern *ptr);
+bool __bpf_dynptr_is_rdonly(const struct bpf_dynptr_kern *ptr);
 
 #ifdef CONFIG_BPF_JIT
 int bpf_trampoline_link_prog(struct bpf_tramp_link *link, struct bpf_trampoline *tr);
diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
index f526b7573e97..e14b5834c477 100644
--- a/kernel/bpf/Makefile
+++ b/kernel/bpf/Makefile
@@ -41,6 +41,9 @@ obj-$(CONFIG_BPF_SYSCALL) += bpf_struct_ops.o
 obj-$(CONFIG_BPF_SYSCALL) += cpumask.o
 obj-${CONFIG_BPF_LSM} += bpf_lsm.o
 endif
+ifeq ($(CONFIG_CRYPTO_SKCIPHER),y)
+obj-$(CONFIG_BPF_SYSCALL) += crypto.o
+endif
 obj-$(CONFIG_BPF_PRELOAD) += preload/
 
 obj-$(CONFIG_BPF_SYSCALL) += relo_core.o
diff --git a/kernel/bpf/crypto.c b/kernel/bpf/crypto.c
new file mode 100644
index 000000000000..c2a0703934fc
--- /dev/null
+++ b/kernel/bpf/crypto.c
@@ -0,0 +1,280 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/* Copyright (c) 2023 Meta, Inc */
+#include <linux/bpf.h>
+#include <linux/bpf_mem_alloc.h>
+#include <linux/btf.h>
+#include <linux/btf_ids.h>
+#include <linux/filter.h>
+#include <linux/scatterlist.h>
+#include <linux/skbuff.h>
+#include <crypto/skcipher.h>
+
+/**
+ * struct bpf_crypto_skcipher_ctx - refcounted BPF sync skcipher context structure
+ * @tfm:	The pointer to crypto_sync_skcipher struct.
+ * @rcu:	The RCU head used to free the crypto context with RCU safety.
+ * @usage:	Object reference counter. When the refcount goes to 0, the
+ *		memory is released back to the BPF allocator, which provides
+ *		RCU safety.
+ */
+struct bpf_crypto_skcipher_ctx {
+	struct crypto_sync_skcipher *tfm;
+	struct rcu_head rcu;
+	refcount_t usage;
+};
+
+__diag_push();
+__diag_ignore_all("-Wmissing-prototypes",
+		  "Global kfuncs as their definitions will be in BTF");
+
+static void *__bpf_dynptr_data_ptr(const struct bpf_dynptr_kern *ptr)
+{
+	enum bpf_dynptr_type type;
+
+	if (!ptr->data)
+		return NULL;
+
+	type = bpf_dynptr_get_type(ptr);
+
+	switch (type) {
+	case BPF_DYNPTR_TYPE_LOCAL:
+	case BPF_DYNPTR_TYPE_RINGBUF:
+		return ptr->data + ptr->offset;
+	case BPF_DYNPTR_TYPE_SKB:
+		return skb_pointer_if_linear(ptr->data, ptr->offset, __bpf_dynptr_size(ptr));
+	case BPF_DYNPTR_TYPE_XDP:
+	{
+		void *xdp_ptr = bpf_xdp_pointer(ptr->data, ptr->offset, __bpf_dynptr_size(ptr));
+		if (!IS_ERR_OR_NULL(xdp_ptr))
+			return xdp_ptr;
+
+		return NULL;
+	}
+	default:
+		WARN_ONCE(true, "unknown dynptr type %d\n", type);
+		return NULL;
+	}
+}
+
+/**
+ * bpf_crypto_skcipher_ctx_create() - Create a mutable BPF crypto context.
+ *
+ * Allocates a crypto context that can be used, acquired, and released by
+ * a BPF program. The crypto context returned by this function must either
+ * be embedded in a map as a kptr, or freed with bpf_crypto_skcipher_ctx_release().
+ *
+ * bpf_crypto_skcipher_ctx_create() allocates memory using the BPF memory
+ * allocator, and will not block. It may return NULL if no memory is available.
+ * @algo: bpf_dynptr which holds string representation of algorithm.
+ * @key:  bpf_dynptr which holds cipher key to do crypto.
+ */
+__bpf_kfunc struct bpf_crypto_skcipher_ctx *
+bpf_crypto_skcipher_ctx_create(const struct bpf_dynptr_kern *palgo,
+			       const struct bpf_dynptr_kern *pkey, int *err)
+{
+	struct bpf_crypto_skcipher_ctx *ctx;
+	char *algo;
+
+	if (__bpf_dynptr_size(palgo) > CRYPTO_MAX_ALG_NAME) {
+		*err = -EINVAL;
+		return NULL;
+	}
+
+	algo = __bpf_dynptr_data_ptr(palgo);
+
+	if (!crypto_has_skcipher(algo, CRYPTO_ALG_TYPE_SKCIPHER, CRYPTO_ALG_TYPE_MASK)) {
+		*err = -EOPNOTSUPP;
+		return NULL;
+	}
+
+	ctx = kmalloc(sizeof(*ctx), GFP_KERNEL);
+	if (!ctx) {
+		*err = -ENOMEM;
+		return NULL;
+	}
+
+	memset(ctx, 0, sizeof(*ctx));
+
+	ctx->tfm = crypto_alloc_sync_skcipher(algo, 0, 0);
+	if (IS_ERR(ctx->tfm)) {
+		*err = PTR_ERR(ctx->tfm);
+		ctx->tfm = NULL;
+		goto err;
+	}
+
+	*err = crypto_sync_skcipher_setkey(ctx->tfm, __bpf_dynptr_data_ptr(pkey),
+					   __bpf_dynptr_size(pkey));
+	if (*err)
+		goto err;
+
+	refcount_set(&ctx->usage, 1);
+
+	return ctx;
+err:
+	if (ctx->tfm)
+		crypto_free_sync_skcipher(ctx->tfm);
+	kfree(ctx);
+
+	return NULL;
+}
+
+static void crypto_free_sync_skcipher_cb(struct rcu_head *head)
+{
+	struct bpf_crypto_skcipher_ctx *ctx;
+
+	ctx = container_of(head, struct bpf_crypto_skcipher_ctx, rcu);
+	crypto_free_sync_skcipher(ctx->tfm);
+	kfree(ctx);
+}
+
+/**
+ * bpf_crypto_skcipher_ctx_acquire() - Acquire a reference to a BPF crypto context.
+ * @ctx: The BPF crypto context being acquired. The ctx must be a trusted
+ *	     pointer.
+ *
+ * Acquires a reference to a BPF crypto context. The context returned by this function
+ * must either be embedded in a map as a kptr, or freed with
+ * bpf_crypto_skcipher_ctx_release().
+ */
+__bpf_kfunc struct bpf_crypto_skcipher_ctx *
+bpf_crypto_skcipher_ctx_acquire(struct bpf_crypto_skcipher_ctx *ctx)
+{
+	refcount_inc(&ctx->usage);
+	return ctx;
+}
+
+/**
+ * bpf_crypto_skcipher_ctx_release() - Release a previously acquired BPF crypto context.
+ * @ctx: The crypto context being released.
+ *
+ * Releases a previously acquired reference to a BPF cpumask. When the final
+ * reference of the BPF cpumask has been released, it is subsequently freed in
+ * an RCU callback in the BPF memory allocator.
+ */
+__bpf_kfunc void bpf_crypto_skcipher_ctx_release(struct bpf_crypto_skcipher_ctx *ctx)
+{
+	if (refcount_dec_and_test(&ctx->usage))
+		call_rcu(&ctx->rcu, crypto_free_sync_skcipher_cb);
+}
+
+static int bpf_crypto_skcipher_crypt(struct crypto_sync_skcipher *tfm,
+				     const struct bpf_dynptr_kern *src,
+				     struct bpf_dynptr_kern *dst,
+				     const struct bpf_dynptr_kern *iv,
+				     bool decrypt)
+{
+	struct skcipher_request *req = NULL;
+	struct scatterlist sgin, sgout;
+	int err;
+
+	if (crypto_sync_skcipher_get_flags(tfm) & CRYPTO_TFM_NEED_KEY)
+		return -EINVAL;
+
+	if (__bpf_dynptr_is_rdonly(dst))
+		return -EINVAL;
+
+	if (!__bpf_dynptr_size(dst) || !__bpf_dynptr_size(src))
+		return -EINVAL;
+
+	if (__bpf_dynptr_size(iv) != crypto_sync_skcipher_ivsize(tfm))
+		return -EINVAL;
+
+	req = skcipher_request_alloc(&tfm->base, GFP_ATOMIC);
+	if (!req)
+		return -ENOMEM;
+
+	sg_init_one(&sgin, __bpf_dynptr_data_ptr(src), __bpf_dynptr_size(src));
+	sg_init_one(&sgout, __bpf_dynptr_data_ptr(dst), __bpf_dynptr_size(dst));
+
+	skcipher_request_set_crypt(req, &sgin, &sgout, __bpf_dynptr_size(src),
+				   __bpf_dynptr_data_ptr(iv));
+
+	err = decrypt ? crypto_skcipher_decrypt(req) : crypto_skcipher_encrypt(req);
+
+	skcipher_request_free(req);
+
+	return err;
+}
+
+/**
+ * bpf_crypto_skcipher_decrypt() - Decrypt buffer using configured context and IV provided.
+ * @ctx:	The crypto context being used. The ctx must be a trusted pointer.
+ * @src:	bpf_dynptr to the encrypted data. Must be a trusted pointer.
+ * @dst:	bpf_dynptr to the buffer where to store the result. Must be a trusted pointer.
+ * @iv:		bpf_dynptr to IV data to be used by decryptor.
+ *
+ * Decrypts provided buffer using IV data and the crypto context. Crypto context must be configured.
+ */
+__bpf_kfunc int bpf_crypto_skcipher_decrypt(struct bpf_crypto_skcipher_ctx *ctx,
+					    const struct bpf_dynptr_kern *src,
+					    struct bpf_dynptr_kern *dst,
+					    const struct bpf_dynptr_kern *iv)
+{
+	return bpf_crypto_skcipher_crypt(ctx->tfm, src, dst, iv, true);
+}
+
+/**
+ * bpf_crypto_skcipher_encrypt() - Encrypt buffer using configured context and IV provided.
+ * @ctx:	The crypto context being used. The ctx must be a trusted pointer.
+ * @src:	bpf_dynptr to the plain data. Must be a trusted pointer.
+ * @dst:	bpf_dynptr to buffer where to store the result. Must be a trusted pointer.
+ * @iv:		bpf_dynptr to IV data to be used by decryptor.
+ *
+ * Encrypts provided buffer using IV data and the crypto context. Crypto context must be configured.
+ */
+__bpf_kfunc int bpf_crypto_skcipher_encrypt(struct bpf_crypto_skcipher_ctx *ctx,
+					    const struct bpf_dynptr_kern *src,
+					    struct bpf_dynptr_kern *dst,
+					    const struct bpf_dynptr_kern *iv)
+{
+	return bpf_crypto_skcipher_crypt(ctx->tfm, src, dst, iv, false);
+}
+
+__diag_pop();
+
+BTF_SET8_START(crypt_skcipher_init_kfunc_btf_ids)
+BTF_ID_FLAGS(func, bpf_crypto_skcipher_ctx_create, KF_ACQUIRE | KF_RET_NULL | KF_SLEEPABLE)
+BTF_ID_FLAGS(func, bpf_crypto_skcipher_ctx_release, KF_RELEASE)
+BTF_ID_FLAGS(func, bpf_crypto_skcipher_ctx_acquire, KF_ACQUIRE | KF_TRUSTED_ARGS)
+BTF_SET8_END(crypt_skcipher_init_kfunc_btf_ids)
+
+static const struct btf_kfunc_id_set crypt_skcipher_init_kfunc_set = {
+	.owner = THIS_MODULE,
+	.set   = &crypt_skcipher_init_kfunc_btf_ids,
+};
+
+BTF_SET8_START(crypt_skcipher_kfunc_btf_ids)
+BTF_ID_FLAGS(func, bpf_crypto_skcipher_decrypt, KF_RCU)
+BTF_ID_FLAGS(func, bpf_crypto_skcipher_encrypt, KF_RCU)
+BTF_SET8_END(crypt_skcipher_kfunc_btf_ids)
+
+static const struct btf_kfunc_id_set crypt_skcipher_kfunc_set = {
+	.owner = THIS_MODULE,
+	.set   = &crypt_skcipher_kfunc_btf_ids,
+};
+
+BTF_ID_LIST(crypto_skcipher_dtor_ids)
+BTF_ID(struct, bpf_crypto_skcipher_ctx)
+BTF_ID(func, bpf_crypto_skcipher_ctx_release)
+
+static int __init crypto_skcipher_kfunc_init(void)
+{
+	int ret;
+	const struct btf_id_dtor_kfunc crypto_skcipher_dtors[] = {
+		{
+			.btf_id	      = crypto_skcipher_dtor_ids[0],
+			.kfunc_btf_id = crypto_skcipher_dtor_ids[1]
+		},
+	};
+
+	ret = register_btf_kfunc_id_set(BPF_PROG_TYPE_SCHED_CLS, &crypt_skcipher_kfunc_set);
+	ret = ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_SCHED_ACT, &crypt_skcipher_kfunc_set);
+	ret = ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_XDP, &crypt_skcipher_kfunc_set);
+	ret = ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_UNSPEC,
+					       &crypt_skcipher_init_kfunc_set);
+	return  ret ?: register_btf_id_dtor_kfuncs(crypto_skcipher_dtors,
+						   ARRAY_SIZE(crypto_skcipher_dtors),
+						   THIS_MODULE);
+}
+
+late_initcall(crypto_skcipher_kfunc_init);
diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
index 62a53ebfedf9..2020884fca3d 100644
--- a/kernel/bpf/helpers.c
+++ b/kernel/bpf/helpers.c
@@ -1429,7 +1429,7 @@ static const struct bpf_func_proto bpf_kptr_xchg_proto = {
 #define DYNPTR_SIZE_MASK	0xFFFFFF
 #define DYNPTR_RDONLY_BIT	BIT(31)
 
-static bool __bpf_dynptr_is_rdonly(const struct bpf_dynptr_kern *ptr)
+bool __bpf_dynptr_is_rdonly(const struct bpf_dynptr_kern *ptr)
 {
 	return ptr->size & DYNPTR_RDONLY_BIT;
 }
@@ -1444,7 +1444,7 @@ static void bpf_dynptr_set_type(struct bpf_dynptr_kern *ptr, enum bpf_dynptr_typ
 	ptr->size |= type << DYNPTR_TYPE_SHIFT;
 }
 
-static enum bpf_dynptr_type bpf_dynptr_get_type(const struct bpf_dynptr_kern *ptr)
+enum bpf_dynptr_type bpf_dynptr_get_type(const struct bpf_dynptr_kern *ptr)
 {
 	return (ptr->size & ~(DYNPTR_RDONLY_BIT)) >> DYNPTR_TYPE_SHIFT;
 }
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index bb58987e4844..75d2f47ca3cb 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -5184,6 +5184,7 @@ BTF_ID(struct, prog_test_ref_kfunc)
 BTF_ID(struct, cgroup)
 BTF_ID(struct, bpf_cpumask)
 BTF_ID(struct, task_struct)
+BTF_ID(struct, bpf_crypto_skcipher_ctx)
 BTF_SET_END(rcu_protected_types)
 
 static bool rcu_protected_object(const struct btf *btf, u32 btf_id)
-- 
2.39.3


^ permalink raw reply related

* Re: [PATCH v3 1/1] r8169: Coalesce RTL8411b PHY power-down recovery programming instructions to reduce spinlock stalls
From: Mirsad Todorovac @ 2023-10-31 13:27 UTC (permalink / raw)
  To: Andrew Lunn, Mirsad Todorovac
  Cc: Heiner Kallweit, netdev, linux-kernel, nic_swsd, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Marco Elver
In-Reply-To: <7fa02dd1-c894-4980-8439-4dc1e22d3634@lunn.ch>



On 10/31/2023 1:38 PM, Andrew Lunn wrote:
> On Tue, Oct 31, 2023 at 04:35:19AM +0100, Mirsad Todorovac wrote:
>> On 10/31/23 02:21, Andrew Lunn wrote:
>>>> I will not contradict, but the cummulative amount of memory barriers on each MMIO read/write
>>>> in each single one of the drivers could amount to some degrading of overall performance and
>>>> latency in a multicore system.
>>>
>>> For optimisations, we like to see benchmark results which show some
>>> improvements. Do you have any numbers?
>>
>> Hi, Andrew,
>>
>> Thank you for your interest in RTL NIC driver optimisations.
>>
>> My knowledge about the timing costs of synchronisation is mostly theoretical.
> 
> The kernel tends to be very practical. Maybe try to turn the
> theoretical knowledge into practice. Write a benchmark test, or see if
> any of the existing RT Linux tests show there is a real problem here,
> and your changes fix it.

I stand corrected. Real benchmarks would indeed say more than the visual
inspection of the code.

As I see, as a maintainer of the PHYLIB you certainly have a greater
insight. What I've done is maybe too aggressive "loophole optimisation",
without much consideration of what Mr. Heiner Kallweit addressed as hot 
paths and not so hot (less trodden) paths.

Best regards,
Mirsad Todorovac

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox