Linux-HyperV List

Linux-HyperV List
 help / color / mirror / Atom feed

* Re: [PATCH 5/9] x86/hyperv: Use vmmcall to implement Hyper-V hypercall in sev-snp enlightened guest
From: Peter Zijlstra @ 2023-06-08 13:21 UTC (permalink / raw)
  To: Tianyu Lan
  Cc: kys, haiyangz, wei.liu, decui, tglx, mingo, bp, dave.hansen, x86,
	hpa, daniel.lezcano, arnd, michael.h.kelley, Tianyu Lan,
	linux-arch, linux-hyperv, linux-kernel, vkuznets
In-Reply-To: <20230601151624.1757616-6-ltykernel@gmail.com>

On Thu, Jun 01, 2023 at 11:16:18AM -0400, Tianyu Lan wrote:
> From: Tianyu Lan <tiala@microsoft.com>
> 
> In sev-snp enlightened guest, Hyper-V hypercall needs
> to use vmmcall to trigger vmexit and notify hypervisor
> to handle hypercall request.
> 
> There is no x86 SEV SNP feature flag support so far and
> hardware provides MSR_AMD64_SEV register to check SEV-SNP
> capability with MSR_AMD64_SEV_ENABLED bit. ALTERNATIVE can't
> work without SEV-SNP x86 feature flag. May add later when
> the associated flag is introduced. 
> 
> Signed-off-by: Tianyu Lan <tiala@microsoft.com>
> ---
>  arch/x86/include/asm/mshyperv.h | 44 ++++++++++++++++++++++++---------
>  1 file changed, 33 insertions(+), 11 deletions(-)
> 
> diff --git a/arch/x86/include/asm/mshyperv.h b/arch/x86/include/asm/mshyperv.h
> index 31c476f4e656..d859d7c5f5e8 100644
> --- a/arch/x86/include/asm/mshyperv.h
> +++ b/arch/x86/include/asm/mshyperv.h
> @@ -61,16 +61,25 @@ static inline u64 hv_do_hypercall(u64 control, void *input, void *output)
>  	u64 hv_status;
>  
>  #ifdef CONFIG_X86_64
> -	if (!hv_hypercall_pg)
> -		return U64_MAX;
> +	if (hv_isolation_type_en_snp()) {
> +		__asm__ __volatile__("mov %4, %%r8\n"
> +				     "vmmcall"
> +				     : "=a" (hv_status), ASM_CALL_CONSTRAINT,
> +				       "+c" (control), "+d" (input_address)
> +				     :  "r" (output_address)
> +				     : "cc", "memory", "r8", "r9", "r10", "r11");
> +	} else {
> +		if (!hv_hypercall_pg)
> +			return U64_MAX;
>  
> -	__asm__ __volatile__("mov %4, %%r8\n"
> -			     CALL_NOSPEC
> -			     : "=a" (hv_status), ASM_CALL_CONSTRAINT,
> -			       "+c" (control), "+d" (input_address)
> -			     :  "r" (output_address),
> -				THUNK_TARGET(hv_hypercall_pg)
> -			     : "cc", "memory", "r8", "r9", "r10", "r11");
> +		__asm__ __volatile__("mov %4, %%r8\n"
> +				     CALL_NOSPEC
> +				     : "=a" (hv_status), ASM_CALL_CONSTRAINT,
> +				       "+c" (control), "+d" (input_address)
> +				     :  "r" (output_address),
> +					THUNK_TARGET(hv_hypercall_pg)
> +				     : "cc", "memory", "r8", "r9", "r10", "r11");
> +	}
>  #else

Remains unanswered:

https://lkml.kernel.org/r/20230516102912.GG2587705%40hirez.programming.kicks-ass.net

Would this not generate better code with an alternative?

^ permalink raw reply

* Re: [PATCH 1/9] x86/hyperv: Add sev-snp enlightened guest static key
From: Tianyu Lan @ 2023-06-08 13:17 UTC (permalink / raw)
  To: Michael Kelley (LINUX), KY Srinivasan, Haiyang Zhang,
	wei.liu@kernel.org, Dexuan Cui, tglx@linutronix.de,
	mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com,
	x86@kernel.org, hpa@zytor.com, daniel.lezcano@linaro.org,
	arnd@arndb.de
  Cc: Tianyu Lan, linux-arch@vger.kernel.org,
	linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org,
	vkuznets@redhat.com
In-Reply-To: <BYAPR21MB168808E653CDCE3585EB8858D750A@BYAPR21MB1688.namprd21.prod.outlook.com>

On 6/8/2023 8:56 PM, Michael Kelley (LINUX) wrote:
>> @ -473,7 +477,7 @@ static void __init ms_hyperv_init_platform(void)
>>
>>   #if IS_ENABLED(CONFIG_HYPERV)
>>   	if ((hv_get_isolation_type() == HV_ISOLATION_TYPE_VBS) ||
>> -	    (hv_get_isolation_type() == HV_ISOLATION_TYPE_SNP))
>> +	    ms_hyperv.paravisor_present)
> This test needs to be:
> 
>    	if ((hv_get_isolation_type() == HV_ISOLATION_TYPE_VBS) ||
> 	    ((hv_get_isolation_type() == HV_ISOLATION_TYPE_SNP) &&
> 	    ms_hyperv.paravisor_present)
> 
> We want to call hv_vtom_init() only when running with VBS, or
> with SEV-SNP*and*  we have a paravisor present.  Testing only for
> paravisor_present risks confusion with future TDX scenarios.

Yes, current paravisor is only available for VBS and SEV-SNP vTOM cases.
TDX may also have paravisor support. Will update.

Thanks.

^ permalink raw reply

* RE: [PATCH 2/9] x86/hyperv: Set Virtual Trust Level in VMBus init message
From: Michael Kelley (LINUX) @ 2023-06-08 13:06 UTC (permalink / raw)
  To: Tianyu Lan, KY Srinivasan, Haiyang Zhang, wei.liu@kernel.org,
	Dexuan Cui, tglx@linutronix.de, mingo@redhat.com, bp@alien8.de,
	dave.hansen@linux.intel.com, x86@kernel.org, hpa@zytor.com,
	daniel.lezcano@linaro.org, arnd@arndb.de
  Cc: Tianyu Lan, linux-arch@vger.kernel.org,
	linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org,
	vkuznets@redhat.com
In-Reply-To: <20230601151624.1757616-3-ltykernel@gmail.com>

From: Tianyu Lan <ltykernel@gmail.com> Sent: Thursday, June 1, 2023 8:16 AM
> 
> SEV-SNP guest provides vtl(Virtual Trust Level) and
> get it from Hyper-V hvcall via register hvcall HVCALL_
> GET_VP_REGISTERS.
> 
> During initialization of VMBus, vtl needs to be set in the
> VMBus init message.

Let's clean up this commit message a bit.  I would suggest:

SEV-SNP guests on Hyper-V can run at multiple Virtual Trust
Levels (VTL).  During boot, get the VTL at which we're running
using the GET_VP_REGISTERs hypercall, and save the value
for future use.  Then during VMBus initialization, set the VTL
with the saved value as required in the VMBus init message.

Michael

> 
> Signed-off-by: Tianyu Lan <tiala@microsoft.com>
> ---
>  arch/x86/hyperv/hv_init.c          | 36 ++++++++++++++++++++++++++++++
>  arch/x86/include/asm/hyperv-tlfs.h |  7 ++++++
>  drivers/hv/connection.c            |  1 +
>  include/asm-generic/mshyperv.h     |  1 +
>  include/linux/hyperv.h             |  4 ++--
>  5 files changed, 47 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/hyperv/hv_init.c b/arch/x86/hyperv/hv_init.c
> index a5f9474f08e1..b4a2327c823b 100644
> --- a/arch/x86/hyperv/hv_init.c
> +++ b/arch/x86/hyperv/hv_init.c
> @@ -378,6 +378,40 @@ static void __init hv_get_partition_id(void)
>  	local_irq_restore(flags);
>  }
> 
> +static u8 __init get_vtl(void)
> +{
> +	u64 control = HV_HYPERCALL_REP_COMP_1 | HVCALL_GET_VP_REGISTERS;
> +	struct hv_get_vp_registers_input *input;
> +	struct hv_get_vp_registers_output *output;
> +	u64 vtl = 0;
> +	u64 ret;
> +	unsigned long flags;
> +
> +	local_irq_save(flags);
> +	input = *this_cpu_ptr(hyperv_pcpu_input_arg);
> +	output = (struct hv_get_vp_registers_output *)input;
> +	if (!input) {
> +		local_irq_restore(flags);
> +		goto done;
> +	}
> +
> +	memset(input, 0, struct_size(input, element, 1));
> +	input->header.partitionid = HV_PARTITION_ID_SELF;
> +	input->header.vpindex = HV_VP_INDEX_SELF;
> +	input->header.inputvtl = 0;
> +	input->element[0].name0 = HV_X64_REGISTER_VSM_VP_STATUS;
> +
> +	ret = hv_do_hypercall(control, input, output);
> +	if (hv_result_success(ret))
> +		vtl = output->as64.low & HV_X64_VTL_MASK;
> +	else
> +		pr_err("Hyper-V: failed to get VTL! %lld", ret);
> +	local_irq_restore(flags);
> +
> +done:
> +	return vtl;
> +}
> +
>  /*
>   * This function is to be invoked early in the boot sequence after the
>   * hypervisor has been detected.
> @@ -506,6 +540,8 @@ void __init hyperv_init(void)
>  	/* Query the VMs extended capability once, so that it can be cached. */
>  	hv_query_ext_cap(0);
> 
> +	/* Find the VTL */
> +	ms_hyperv.vtl = get_vtl();
>  	return;
> 
>  clean_guest_os_id:
> diff --git a/arch/x86/include/asm/hyperv-tlfs.h b/arch/x86/include/asm/hyperv-tlfs.h
> index cea95dcd27c2..4bf0b315b0ce 100644
> --- a/arch/x86/include/asm/hyperv-tlfs.h
> +++ b/arch/x86/include/asm/hyperv-tlfs.h
> @@ -301,6 +301,13 @@ enum hv_isolation_type {
>  #define HV_X64_MSR_TIME_REF_COUNT	HV_REGISTER_TIME_REF_COUNT
>  #define HV_X64_MSR_REFERENCE_TSC	HV_REGISTER_REFERENCE_TSC
> 
> +/*
> + * Registers are only accessible via HVCALL_GET_VP_REGISTERS hvcall and
> + * there is not associated MSR address.
> + */
> +#define	HV_X64_REGISTER_VSM_VP_STATUS	0x000D0003
> +#define	HV_X64_VTL_MASK			GENMASK(3, 0)
> +
>  /* Hyper-V memory host visibility */
>  enum hv_mem_host_visibility {
>  	VMBUS_PAGE_NOT_VISIBLE		= 0,
> diff --git a/drivers/hv/connection.c b/drivers/hv/connection.c
> index 5978e9dbc286..02b54f85dc60 100644
> --- a/drivers/hv/connection.c
> +++ b/drivers/hv/connection.c
> @@ -98,6 +98,7 @@ int vmbus_negotiate_version(struct vmbus_channel_msginfo
> *msginfo, u32 version)
>  	 */
>  	if (version >= VERSION_WIN10_V5) {
>  		msg->msg_sint = VMBUS_MESSAGE_SINT;
> +		msg->msg_vtl = ms_hyperv.vtl;
>  		vmbus_connection.msg_conn_id = VMBUS_MESSAGE_CONNECTION_ID_4;
>  	} else {
>  		msg->interrupt_page = virt_to_phys(vmbus_connection.int_page);
> diff --git a/include/asm-generic/mshyperv.h b/include/asm-generic/mshyperv.h
> index d444f831d633..c7a90f91c0d3 100644
> --- a/include/asm-generic/mshyperv.h
> +++ b/include/asm-generic/mshyperv.h
> @@ -54,6 +54,7 @@ struct ms_hyperv_info {
>  		};
>  	};
>  	u64 shared_gpa_boundary;
> +	u8 vtl;
>  };
>  extern struct ms_hyperv_info ms_hyperv;
>  extern bool hv_nested;
> diff --git a/include/linux/hyperv.h b/include/linux/hyperv.h
> index bfbc37ce223b..1f2bfec4abde 100644
> --- a/include/linux/hyperv.h
> +++ b/include/linux/hyperv.h
> @@ -665,8 +665,8 @@ struct vmbus_channel_initiate_contact {
>  		u64 interrupt_page;
>  		struct {
>  			u8	msg_sint;
> -			u8	padding1[3];
> -			u32	padding2;
> +			u8	msg_vtl;
> +			u8	reserved[6];
>  		};
>  	};
>  	u64 monitor_page1;
> --
> 2.25.1


^ permalink raw reply

* RE: [PATCH 1/9] x86/hyperv: Add sev-snp enlightened guest static key
From: Michael Kelley (LINUX) @ 2023-06-08 12:56 UTC (permalink / raw)
  To: Tianyu Lan, KY Srinivasan, Haiyang Zhang, wei.liu@kernel.org,
	Dexuan Cui, tglx@linutronix.de, mingo@redhat.com, bp@alien8.de,
	dave.hansen@linux.intel.com, x86@kernel.org, hpa@zytor.com,
	daniel.lezcano@linaro.org, arnd@arndb.de
  Cc: Tianyu Lan, linux-arch@vger.kernel.org,
	linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org,
	vkuznets@redhat.com
In-Reply-To: <20230601151624.1757616-2-ltykernel@gmail.com>

From: Tianyu Lan <ltykernel@gmail.com> Sent: Thursday, June 1, 2023 8:16 AM
> 
> Introduce static key isolation_type_en_snp for enlightened
> sev-snp guest check.
> 
> Signed-off-by: Tianyu Lan <tiala@microsoft.com>
> ---
>  arch/x86/hyperv/ivm.c           | 11 +++++++++++
>  arch/x86/include/asm/mshyperv.h |  3 +++
>  arch/x86/kernel/cpu/mshyperv.c  |  8 ++++++--
>  drivers/hv/hv_common.c          |  6 ++++++
>  include/asm-generic/mshyperv.h  | 12 +++++++++---
>  5 files changed, 35 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/x86/hyperv/ivm.c b/arch/x86/hyperv/ivm.c
> index cc92388b7a99..5d3ee3124e00 100644
> --- a/arch/x86/hyperv/ivm.c
> +++ b/arch/x86/hyperv/ivm.c
> @@ -409,3 +409,14 @@ bool hv_isolation_type_snp(void)
>  {
>  	return static_branch_unlikely(&isolation_type_snp);
>  }
> +
> +DEFINE_STATIC_KEY_FALSE(isolation_type_en_snp);
> +/*
> + * hv_isolation_type_en_snp - Check system runs in the AMD SEV-SNP based
> + * isolation enlightened VM.
> + */
> +bool hv_isolation_type_en_snp(void)
> +{
> +	return static_branch_unlikely(&isolation_type_en_snp);
> +}
> +
> diff --git a/arch/x86/include/asm/mshyperv.h b/arch/x86/include/asm/mshyperv.h
> index 49bb4f2bd300..31c476f4e656 100644
> --- a/arch/x86/include/asm/mshyperv.h
> +++ b/arch/x86/include/asm/mshyperv.h
> @@ -26,6 +26,7 @@
>  union hv_ghcb;
> 
>  DECLARE_STATIC_KEY_FALSE(isolation_type_snp);
> +DECLARE_STATIC_KEY_FALSE(isolation_type_en_snp);
> 
>  typedef int (*hyperv_fill_flush_list_func)(
>  		struct hv_guest_mapping_flush_list *flush,
> @@ -45,6 +46,8 @@ extern void *hv_hypercall_pg;
> 
>  extern u64 hv_current_partition_id;
> 
> +extern bool hv_isolation_type_en_snp(void);
> +
>  extern union hv_ghcb * __percpu *hv_ghcb_pg;
> 
>  int hv_call_deposit_pages(int node, u64 partition_id, u32 num_pages);
> diff --git a/arch/x86/kernel/cpu/mshyperv.c b/arch/x86/kernel/cpu/mshyperv.c
> index c7969e806c64..9186453251f7 100644
> --- a/arch/x86/kernel/cpu/mshyperv.c
> +++ b/arch/x86/kernel/cpu/mshyperv.c
> @@ -402,8 +402,12 @@ static void __init ms_hyperv_init_platform(void)
>  		pr_info("Hyper-V: Isolation Config: Group A 0x%x, Group B 0x%x\n",
>  			ms_hyperv.isolation_config_a, ms_hyperv.isolation_config_b);
> 
> -		if (hv_get_isolation_type() == HV_ISOLATION_TYPE_SNP)
> +
> +		if (cc_platform_has(CC_ATTR_GUEST_SEV_SNP)) {
> +			static_branch_enable(&isolation_type_en_snp);
> +		} else if (hv_get_isolation_type() == HV_ISOLATION_TYPE_SNP) {
>  			static_branch_enable(&isolation_type_snp);
> +		}
>  	}
> 
>  	if (hv_max_functions_eax >= HYPERV_CPUID_NESTED_FEATURES) {
> @@ -473,7 +477,7 @@ static void __init ms_hyperv_init_platform(void)
> 
>  #if IS_ENABLED(CONFIG_HYPERV)
>  	if ((hv_get_isolation_type() == HV_ISOLATION_TYPE_VBS) ||
> -	    (hv_get_isolation_type() == HV_ISOLATION_TYPE_SNP))
> +	    ms_hyperv.paravisor_present)

This test needs to be:

  	if ((hv_get_isolation_type() == HV_ISOLATION_TYPE_VBS) ||
	    ((hv_get_isolation_type() == HV_ISOLATION_TYPE_SNP) &&
	    ms_hyperv.paravisor_present)

We want to call hv_vtom_init() only when running with VBS, or
with SEV-SNP *and* we have a paravisor present.  Testing only for
paravisor_present risks confusion with future TDX scenarios.

>  		hv_vtom_init();
>  	/*
>  	 * Setup the hook to get control post apic initialization.
> diff --git a/drivers/hv/hv_common.c b/drivers/hv/hv_common.c
> index 64f9ceca887b..179bc5f5bf52 100644
> --- a/drivers/hv/hv_common.c
> +++ b/drivers/hv/hv_common.c
> @@ -502,6 +502,12 @@ bool __weak hv_isolation_type_snp(void)
>  }
>  EXPORT_SYMBOL_GPL(hv_isolation_type_snp);
> 
> +bool __weak hv_isolation_type_en_snp(void)
> +{
> +	return false;
> +}
> +EXPORT_SYMBOL_GPL(hv_isolation_type_en_snp);
> +
>  void __weak hv_setup_vmbus_handler(void (*handler)(void))
>  {
>  }
> diff --git a/include/asm-generic/mshyperv.h b/include/asm-generic/mshyperv.h
> index 402a8c1c202d..d444f831d633 100644
> --- a/include/asm-generic/mshyperv.h
> +++ b/include/asm-generic/mshyperv.h
> @@ -36,15 +36,21 @@ struct ms_hyperv_info {
>  	u32 nested_features;
>  	u32 max_vp_index;
>  	u32 max_lp_index;
> -	u32 isolation_config_a;
> +	union {
> +		u32 isolation_config_a;
> +		struct {
> +			u32 paravisor_present : 1;
> +			u32 reserved1 : 31;
> +		};
> +	};
>  	union {
>  		u32 isolation_config_b;
>  		struct {
>  			u32 cvm_type : 4;
> -			u32 reserved1 : 1;
> +			u32 reserved2 : 1;
>  			u32 shared_gpa_boundary_active : 1;
>  			u32 shared_gpa_boundary_bits : 6;
> -			u32 reserved2 : 20;
> +			u32 reserved3 : 20;
>  		};
>  	};
>  	u64 shared_gpa_boundary;
> --
> 2.25.1


^ permalink raw reply

* Re: [PATCH -next] net: hv_netvsc: Remove duplicated include in rndis_filter.c
From: Simon Horman @ 2023-06-08 12:30 UTC (permalink / raw)
  To: Yang Li
  Cc: davem, edumazet, kuba, pabeni, kys, haiyangz, wei.liu, decui,
	linux-hyperv, netdev, linux-kernel, Abaci Robot
In-Reply-To: <20230608080316.84203-1-yang.lee@linux.alibaba.com>

On Thu, Jun 08, 2023 at 04:03:16PM +0800, Yang Li wrote:
> ./drivers/net/hyperv/rndis_filter.c: linux/slab.h is included more than once.
> 
> Reported-by: Abaci Robot <abaci@linux.alibaba.com>
> Closes: https://bugzilla.openanolis.cn/show_bug.cgi?id=5462
> Fixes: 4cab498f33f7 ("hv_netvsc: Allocate rx indirection table size dynamically")
> Signed-off-by: Yang Li <yang.lee@linux.alibaba.com>

Hi Yang Li,

A few nits from my side;

1. The subject should include the target tree, in this case net-next.

	[PATCH net-next] ...

2. I don't think this needs a fixes tag: nothing was broken

I'm not sure this warrants a v2.
If you do decide to post a v2, please allow for 24h since
v1 was posted.

Link: https://kernel.org/doc/html/v6.3/process/maintainer-netdev.html

Lastly, I think at least one other similar change has been posted recently.
Please consider batching up such changes, say in groups of 10 patches,
and posting them as a patch-set.


The change itself seems fine to me.

Reviewed-by: Simon Horman <simon.horman@corigine.com>

> ---
>  drivers/net/hyperv/rndis_filter.c | 1 -
>  1 file changed, 1 deletion(-)
> 
> diff --git a/drivers/net/hyperv/rndis_filter.c b/drivers/net/hyperv/rndis_filter.c
> index af95947a87c5..ecc2128ca9b7 100644
> --- a/drivers/net/hyperv/rndis_filter.c
> +++ b/drivers/net/hyperv/rndis_filter.c
> @@ -21,7 +21,6 @@
>  #include <linux/rtnetlink.h>
>  #include <linux/ucs2_string.h>
>  #include <linux/string.h>
> -#include <linux/slab.h>
>  
>  #include "hyperv_net.h"
>  #include "netvsc_trace.h"
> -- 
> 2.20.1.7.g153144c
> 
> 

^ permalink raw reply

* RE: [PATCH v2 1/1] RDMA/mana_ib: Add EQ interrupt support to mana ib driver.
From: Wei Hu @ 2023-06-08 11:17 UTC (permalink / raw)
  To: Long Li, netdev@vger.kernel.org, linux-hyperv@vger.kernel.org,
	linux-rdma@vger.kernel.org, Ajay Sharma, jgg@ziepe.ca,
	leon@kernel.org, KY Srinivasan, Haiyang Zhang, wei.liu@kernel.org,
	Dexuan Cui, davem@davemloft.net, edumazet@google.com,
	kuba@kernel.org, pabeni@redhat.com, vkuznets@redhat.com,
	ssengar@linux.microsoft.com, shradhagupta@linux.microsoft.com
In-Reply-To: <PH7PR21MB32634CB06AFF8BFFDBC003B3CE53A@PH7PR21MB3263.namprd21.prod.outlook.com>



> -----Original Message-----
> Subject: RE: [PATCH v2 1/1] RDMA/mana_ib: Add EQ interrupt support to
> mana ib driver.
> 
> > Subject: [PATCH v2 1/1] RDMA/mana_ib: Add EQ interrupt support to mana
> > ib driver.
> >
> > Add EQ interrupt support for mana ib driver. Allocate EQs per ucontext
> > to receive interrupt. Attach EQ when CQ is created. Call CQ interrupt
> > handler when completion interrupt happens. EQs are destroyed when
> ucontext is deallocated.
> >
> > The change calls some public APIs in mana ethernet driver to allocate
> > EQs and other resources. Ehe EQ process routine is also shared by mana
> > ethernet and mana ib drivers.
> >
> > Co-developed-by: Ajay Sharma <sharmaajay@microsoft.com>
> > Signed-off-by: Ajay Sharma <sharmaajay@microsoft.com>
> > Signed-off-by: Wei Hu <weh@microsoft.com>
> > ---
> >
> > v2: Use ibdev_dbg to print error messages and return -ENOMEN
> >     when kzalloc fails.
> >
> >  drivers/infiniband/hw/mana/cq.c               |  32 ++++-
> >  drivers/infiniband/hw/mana/main.c             |  87 ++++++++++++
> >  drivers/infiniband/hw/mana/mana_ib.h          |   4 +
> >  drivers/infiniband/hw/mana/qp.c               |  90 +++++++++++-
> >  .../net/ethernet/microsoft/mana/gdma_main.c   | 131 ++++++++++--------
> >  drivers/net/ethernet/microsoft/mana/mana_en.c |   1 +
> >  include/net/mana/gdma.h                       |   9 +-
> >  7 files changed, 290 insertions(+), 64 deletions(-)
> >
> > diff --git a/drivers/infiniband/hw/mana/cq.c
> > b/drivers/infiniband/hw/mana/cq.c index d141cab8a1e6..3cd680e0e753
> > 100644
> > --- a/drivers/infiniband/hw/mana/cq.c
> > +++ b/drivers/infiniband/hw/mana/cq.c
> > @@ -12,13 +12,20 @@ int mana_ib_create_cq(struct ib_cq *ibcq, const
> > struct ib_cq_init_attr *attr,
> >  	struct ib_device *ibdev = ibcq->device;
> >  	struct mana_ib_create_cq ucmd = {};
> >  	struct mana_ib_dev *mdev;
> > +	struct gdma_context *gc;
> > +	struct gdma_dev *gd;
> >  	int err;
> >
> >  	mdev = container_of(ibdev, struct mana_ib_dev, ib_dev);
> > +	gd = mdev->gdma_dev;
> > +	gc = gd->gdma_context;
> >
> >  	if (udata->inlen < sizeof(ucmd))
> >  		return -EINVAL;
> >
> > +	cq->comp_vector = attr->comp_vector > gc->max_num_queues ?
> > +				0 : attr->comp_vector;
> > +
> >  	err = ib_copy_from_udata(&ucmd, udata, min(sizeof(ucmd), udata-
> > >inlen));
> >  	if (err) {
> >  		ibdev_dbg(ibdev,
> > @@ -69,11 +76,32 @@ int mana_ib_destroy_cq(struct ib_cq *ibcq, struct
> > ib_udata *udata)
> >  	struct mana_ib_cq *cq = container_of(ibcq, struct mana_ib_cq, ibcq);
> >  	struct ib_device *ibdev = ibcq->device;
> >  	struct mana_ib_dev *mdev;
> > +	struct gdma_context *gc;
> > +	struct gdma_dev *gd;
> > +
> >
> >  	mdev = container_of(ibdev, struct mana_ib_dev, ib_dev);
> > +	gd = mdev->gdma_dev;
> > +	gc = gd->gdma_context;
> >
> > -	mana_ib_gd_destroy_dma_region(mdev, cq->gdma_region);
> > -	ib_umem_release(cq->umem);
> > +
> > +
> > +	if (atomic_read(&ibcq->usecnt) == 0) {
> > +		mana_ib_gd_destroy_dma_region(mdev, cq->gdma_region);
> 
> Need to check if this function fails. The following code will call kfree(gc-
> >cq_table[cq->id]), it's possible that IRQ is happening at the same time if CQ
> is not destroyed.
> 

Sure. Will update.

> > +		ibdev_dbg(ibdev, "freeing gdma cq %p\n", gc->cq_table[cq-
> >id]);
> > +		kfree(gc->cq_table[cq->id]);
> > +		gc->cq_table[cq->id] = NULL;
> > +		ib_umem_release(cq->umem);
> > +	}
> >
> >  	return 0;
> >  }
> > +
> > +void mana_ib_cq_handler(void *ctx, struct gdma_queue *gdma_cq) {
> > +	struct mana_ib_cq *cq = ctx;
> > +	struct ib_device *ibdev = cq->ibcq.device;
> > +
> > +	ibdev_dbg(ibdev, "Enter %s %d\n", __func__, __LINE__);
> 
> This debug message seems overkill?
> 
> > +	cq->ibcq.comp_handler(&cq->ibcq, cq->ibcq.cq_context); }
> > diff --git a/drivers/infiniband/hw/mana/main.c
> > b/drivers/infiniband/hw/mana/main.c
> > index 7be4c3adb4e2..e4efbcaed10e 100644
> > --- a/drivers/infiniband/hw/mana/main.c
> > +++ b/drivers/infiniband/hw/mana/main.c
> > @@ -143,6 +143,81 @@ int mana_ib_dealloc_pd(struct ib_pd *ibpd, struct
> > ib_udata *udata)
> >  	return err;
> >  }
> >
> > +static void mana_ib_destroy_eq(struct mana_ib_ucontext *ucontext,
> > +			       struct mana_ib_dev *mdev)
> > +{
> > +	struct gdma_context *gc = mdev->gdma_dev->gdma_context;
> > +	struct ib_device *ibdev = ucontext->ibucontext.device;
> > +	struct gdma_queue *eq;
> > +	int i;
> > +
> > +	if (!ucontext->eqs)
> > +		return;
> > +
> > +	for (i = 0; i < gc->max_num_queues; i++) {
> > +		eq = ucontext->eqs[i].eq;
> > +		if (!eq)
> > +			continue;
> > +
> > +		mana_gd_destroy_queue(gc, eq);
> > +	}
> > +
> > +	kfree(ucontext->eqs);
> > +	ucontext->eqs = NULL;
> > +
> > +	ibdev_dbg(ibdev, "destroyed eq's count %d\n", gc-
> >max_num_queues); }
> 
> Will gc->max_num_queues change after destroying a EQ?
> 

I think it will not change. Also the compiler might optimize
the code to just read the value once and store it in a register.

Thanks,
Wei

^ permalink raw reply

* Re: [PATCH 4/9] drivers: hv: Mark shared pages unencrypted in SEV-SNP enlightened guest
From: Vitaly Kuznetsov @ 2023-06-08  8:54 UTC (permalink / raw)
  To: Tianyu Lan, kys, haiyangz, wei.liu, decui, tglx, mingo, bp,
	dave.hansen, x86, hpa, daniel.lezcano, arnd, michael.h.kelley
  Cc: Tianyu Lan, linux-arch, linux-hyperv, linux-kernel
In-Reply-To: <2803e5d6-58bc-57f1-0721-226333238883@gmail.com>

Tianyu Lan <ltykernel@gmail.com> writes:

> On 6/5/2023 8:54 PM, Vitaly Kuznetsov wrote:
>>> @@ -402,7 +417,14 @@ int hv_common_cpu_die(unsigned int cpu)
>>>   
>>>   	local_irq_restore(flags);
>>>   
>>> -	kfree(mem);
>>> +	if (hv_isolation_type_en_snp()) {
>>> +		ret = set_memory_encrypted((unsigned long)mem, pgcount);
>>> +		if (ret)
>>> +			pr_warn("Hyper-V: Failed to encrypt input arg on cpu%d: %d\n",
>>> +				cpu, ret);
>>> +		/* It's unsafe to free 'mem'. */
>>> +		return 0;
>> Why is it unsafe to free 'mem' if ret == 0? Also, why don't we want to
>> proparate non-zero 'ret' from here to fail CPU offlining?
>> 
>
> Based on Michael's patch the mem will not be freed during cpu offline.
> https://lwn.net/ml/linux-kernel/87cz2j5zrc.fsf@redhat.com/
> So I think it's unnessary to encrypt the mem again here.

Good, you can probably include Michael's patch in your next submission
then (unless it gets merged before that).

-- 
Vitaly


^ permalink raw reply

* [PATCH -next] net: hv_netvsc: Remove duplicated include in rndis_filter.c
From: Yang Li @ 2023-06-08  8:03 UTC (permalink / raw)
  To: davem
  Cc: edumazet, kuba, pabeni, kys, haiyangz, wei.liu, decui,
	linux-hyperv, netdev, linux-kernel, Yang Li, Abaci Robot

./drivers/net/hyperv/rndis_filter.c: linux/slab.h is included more than once.

Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Closes: https://bugzilla.openanolis.cn/show_bug.cgi?id=5462
Fixes: 4cab498f33f7 ("hv_netvsc: Allocate rx indirection table size dynamically")
Signed-off-by: Yang Li <yang.lee@linux.alibaba.com>
---
 drivers/net/hyperv/rndis_filter.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/drivers/net/hyperv/rndis_filter.c b/drivers/net/hyperv/rndis_filter.c
index af95947a87c5..ecc2128ca9b7 100644
--- a/drivers/net/hyperv/rndis_filter.c
+++ b/drivers/net/hyperv/rndis_filter.c
@@ -21,7 +21,6 @@
 #include <linux/rtnetlink.h>
 #include <linux/ucs2_string.h>
 #include <linux/string.h>
-#include <linux/slab.h>
 
 #include "hyperv_net.h"
 #include "netvsc_trace.h"
-- 
2.20.1.7.g153144c


^ permalink raw reply related

* Re: [PATCH v2 1/1] RDMA/mana_ib: Add EQ interrupt support to mana ib driver.
From: Jakub Kicinski @ 2023-06-08  4:39 UTC (permalink / raw)
  To: Wei Hu
  Cc: netdev, linux-hyperv, linux-rdma, longli, sharmaajay, jgg, leon,
	kys, haiyangz, wei.liu, decui, davem, edumazet, pabeni, vkuznets,
	ssengar, shradhagupta
In-Reply-To: <20230606151747.1649305-1-weh@microsoft.com>

On Tue,  6 Jun 2023 15:17:47 +0000 Wei Hu wrote:
>  drivers/infiniband/hw/mana/cq.c               |  32 ++++-
>  drivers/infiniband/hw/mana/main.c             |  87 ++++++++++++
>  drivers/infiniband/hw/mana/mana_ib.h          |   4 +
>  drivers/infiniband/hw/mana/qp.c               |  90 +++++++++++-
>  .../net/ethernet/microsoft/mana/gdma_main.c   | 131 ++++++++++--------
>  drivers/net/ethernet/microsoft/mana/mana_en.c |   1 +
>  include/net/mana/gdma.h                       |   9 +-

IB and netdev are different subsystem, can you put it on a branch 
and send a PR as the cover letter so that both subsystems can pull?

Examples:
https://lore.kernel.org/all/20230607210410.88209-1-saeed@kernel.org/
https://lore.kernel.org/all/20230602171302.745492-1-anthony.l.nguyen@intel.com/

^ permalink raw reply

* Re: [RFC PATCH 1/1] sched/fair: Fix SMT balance dependency on CPU numbering
From: Chen Yu @ 2023-06-08  2:44 UTC (permalink / raw)
  To: Michael Kelley (LINUX)
  Cc: Vincent Guittot, mingo@redhat.com, peterz@infradead.org,
	juri.lelli@redhat.com, dietmar.eggemann@arm.com,
	rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de,
	bristot@redhat.com, vschneid@redhat.com,
	linux-kernel@vger.kernel.org, linux-hyperv@vger.kernel.org
In-Reply-To: <BYAPR21MB1688FE804787663C425C2202D753A@BYAPR21MB1688.namprd21.prod.outlook.com>

On 2023-06-07 at 21:41:46 +0000, Michael Kelley (LINUX) wrote:
> From: Chen Yu <yu.c.chen@intel.com> Sent: Tuesday, June 6, 2023 9:07 PM
> > 
> > On 2023-06-06 at 17:38:28 +0200, Vincent Guittot wrote:
> > > On Tue, 6 Jun 2023 at 16:08, Michael Kelley (LINUX)
> > > <mikelley@microsoft.com> wrote:
> > > >
> > > > From: Vincent Guittot <vincent.guittot@linaro.org> Sent: Monday, June 5, 2023
> > 2:31 AM
> > > > >
> > > > > Hi Michael,
> > > > >
> > > > > On Wed, 31 May 2023 at 19:55, Michael Kelley <mikelley@microsoft.com> wrote:
> > > > > >
> > > > > > With some CPU numbering schemes, the function select_idle_cpu() currently
> > > > > > has a subtle bias to return the first hyper-thread in a core. As a result
> > > > > > work is not evenly balanced across hyper-threads in a core. The difference
> > > > > > is often as much as 15 to 20 percentage points -- i.e., the first
> > > > > > hyper-thread might run at 45% load while the second hyper-thread runs at
> > > > > > only 30% load or less.
> > > > > >
> > > > > > Two likely CPU numbering schemes make sense with today's typical case
> > > > > > of two hyper-threads per core:
> > > > > >
> > > > > > A. Enumerate all the first hyper-theads in a core, then all the second
> > > > > >    hyper-threads in a core.  If a system has 8 cores with 16 hyper-threads,
> > > > > >    CPUs #0 and #8 are in the same core, as are CPUs #1 and #9, etc.
> > > > > >
> > > > > > B. Enumerate all hyper-threads in a core, then all hyper-threads in the
> > > > > >    next core, etc.  Again with 8 cores and 16 hyper-threads, CPUs #0 and
> > > > > >    #1 are in the same core, as are CPUs #2 and #3, etc.
> > > > > >
> > > > > > Scheme A is used in most ACPI bare metal systems and in VMs running on
> > > > > > KVM.  The enumeration order is determined by the order of the processor
> > > > > > entries in the ACPI MADT, and the ACPI spec *recommends* that the MADT
> > > > > > be set up for scheme A.
> > > > > >
> > > > > > However, for reasons that pre-date the ACPI recommendation, Hyper-V
> > > > > > guests have an ACPI MADT that is set up for scheme B.  When using scheme B,
> > > > > > the subtle bias is evident in select_idle_cpu().  While having Hyper-V
> > > > > > conform to the ACPI spec recommendation would solve the Hyper-V problem,
> > > > > > it is also desirable for the fair scheduler code to be independent of the
> > > > > > CPU numbering scheme.  ACPI is not always the firmware configuration
> > > > > > mechanism, and CPU numbering schemes might vary more in architectures
> > > > > > other than x86/x64.
> > > > > >
> > > > > > The bias occurs with scheme B when "has_idle_cpu" is true and
> > > > >
> > > > > I assume that you mean has_idle_core as I can't find has_idle_cpu in the code
> > > >
> > > > Yes.  You are right.
> > > >
> > > > >
> > > > > > select_idle_core() is called in the for_each_cpu_wrap() loop. Regardless
> > > > > > of where the loop starts, it will almost immediately encountered a CPU
> > > > > > that is the first hyper-thread in a core. If that core is idle, the CPU
> > > > > > number of that first hyper-thread is returned. If that core is not idle,
> > > > > > both hyper-threads are removed from the cpus mask, and the loop iterates
> > > > > > to choose another CPU that is the first hyper-thread in a core.  As a
> > > > > > result, select_idle_core() almost always returns the first hyper-thread
> > > > > > in a core.
> > > > > >
> > > > > > The bias does not occur with scheme A because half of the CPU numbering
> > > > > > space is a series of CPUs that are the second hyper-thread in all the
> > > > > > cores. Assuming that the "target" CPU is evenly distributed throughout
> > > > > > the CPU numbering space, there's a 50/50 chance of starting in the portion
> > > > > > of the CPU numbering space that is all second hyper-threads.  If
> > > > > > select_idle_core() finds a idle core, it will likely return a CPU that
> > > > > > is the second hyper-thread in the core.  On average over the time,
> > > > > > both the first and second hyper-thread are equally likely to be
> > > > > > returned.
> > > > > >
> > > > > > Fix this bias by determining which hyper-thread in a core the "target"
> > > > > > CPU is -- i.e., the "smt index" of that CPU.  Then when select_idle_core()
> > > > > > finds an idle core, it returns the CPU in the core that has the same
> > > > > > smt index. If that CPU is not valid to be chosen, just return the CPU
> > > > > > that was passed into select_idle_core() and don't worry about bias.
> > > > > >
> > > > > > With scheme B, this fix solves the bias problem by making the chosen
> > > > > > CPU be roughly equally likely to either hyper-thread.  With scheme A
> > > > > > there's no real effect as the chosen CPU was already equally likely
> > > > > > to be either hyper-thread, and still is.
> > > > > >
> > > > > > The imbalance in hyper-thread loading was originally observed in a
> > > > > > customer workload, and then reproduced with a synthetic workload.
> > > > > > The change has been tested with the synthetic workload in a Hyper-V VM
> > > > > > running the normal scheme B CPU numbering, and then with the MADT
> > > > > > replaced with a scheme A version using Linux's ability to override
> > > > > > ACPI tables. The testing showed no change hyper-thread loading
> > > > > > balance with the scheme A CPU numbering, but the imbalance is
> > > > > > corrected if the CPU numbering is scheme B.
> > > > >
> > > > > You failed to explain why it's important to evenly select 1st or 2nd
> > > > > hyper-threads on the system.  I don't see any performance figures.
> > > > > What would be the benefit ?
> > > >
> > > > As I noted below, it's not completely clear to me whether this is a
> > > > problem.  I don't have a specific workload where overall runtime is
> > > > longer due to the SMT level imbalance.  I'm not a scheduler expert,
> > > > and wanted input from those who are.  Linux generally does a good
> > > > job of balancing load, and given the code in fair.c that is devoted to
> > > > balancing, I inferred at least some importance.  But maybe balancing
> > > > is more important for the higher-level domains, and less so for the
> > > > SMT domain.
> > >
> > > As long as the hyper-threads on a core are the same, I don't see any
> > > need to add more code and complexity to evenly balance the number of
> > > time that we select each CPU of the same core when the whole core is
> > > idle.
> 
> Vincent -- 
> 
> Fair enough.  We can revisit the topic if we discover a workload where the
> imbalance produces a noticeable difference in performance for some reason.
> Thanks for your review and consideration.
> 
> > >
> > In theory if a core is idle, either the 1st hyper thread or the 2nd hyper
> > thread is ok to run the wakee. Would there be a race condition between the
> > check of has_idle_core + idle core checking in select_idle_cpu() and the
> > task enqueue?  If the 2 hyper threads within 1 core are falling asleep
> > and wake up quickly, we have a false positive of has_idle_core, and
> > incorrectly think coreX is idle, and queue too many tasks on the first hyper
> > thread on coreX in B scheme, although coreX is not idle.
> > 
> 
> Chen --
> 
> I have not tried to think through the scenario you describe.  But I don't
> think such a race condition is the primary reason that we observe the SMT
> imbalance when running with CPU numbering scheme B.   It seems like
> such a race would be relatively rare, and hence not a significant issue.  Or
> is there a more serious consequence if the race were to happen?
>
This race condition is just a suspect to cause the imbalance among SMTs,
it does not have direct consequence if the race happens.

thanks,
Chenyu

^ permalink raw reply

* RE: [RFC PATCH 1/1] sched/fair: Fix SMT balance dependency on CPU numbering
From: Michael Kelley (LINUX) @ 2023-06-07 21:41 UTC (permalink / raw)
  To: Chen Yu, Vincent Guittot
  Cc: mingo@redhat.com, peterz@infradead.org, juri.lelli@redhat.com,
	dietmar.eggemann@arm.com, rostedt@goodmis.org, bsegall@google.com,
	mgorman@suse.de, bristot@redhat.com, vschneid@redhat.com,
	linux-kernel@vger.kernel.org, linux-hyperv@vger.kernel.org
In-Reply-To: <ZIACTGpTD7FLfd1K@chenyu5-mobl2.ccr.corp.intel.com>

From: Chen Yu <yu.c.chen@intel.com> Sent: Tuesday, June 6, 2023 9:07 PM
> 
> On 2023-06-06 at 17:38:28 +0200, Vincent Guittot wrote:
> > On Tue, 6 Jun 2023 at 16:08, Michael Kelley (LINUX)
> > <mikelley@microsoft.com> wrote:
> > >
> > > From: Vincent Guittot <vincent.guittot@linaro.org> Sent: Monday, June 5, 2023
> 2:31 AM
> > > >
> > > > Hi Michael,
> > > >
> > > > On Wed, 31 May 2023 at 19:55, Michael Kelley <mikelley@microsoft.com> wrote:
> > > > >
> > > > > With some CPU numbering schemes, the function select_idle_cpu() currently
> > > > > has a subtle bias to return the first hyper-thread in a core. As a result
> > > > > work is not evenly balanced across hyper-threads in a core. The difference
> > > > > is often as much as 15 to 20 percentage points -- i.e., the first
> > > > > hyper-thread might run at 45% load while the second hyper-thread runs at
> > > > > only 30% load or less.
> > > > >
> > > > > Two likely CPU numbering schemes make sense with today's typical case
> > > > > of two hyper-threads per core:
> > > > >
> > > > > A. Enumerate all the first hyper-theads in a core, then all the second
> > > > >    hyper-threads in a core.  If a system has 8 cores with 16 hyper-threads,
> > > > >    CPUs #0 and #8 are in the same core, as are CPUs #1 and #9, etc.
> > > > >
> > > > > B. Enumerate all hyper-threads in a core, then all hyper-threads in the
> > > > >    next core, etc.  Again with 8 cores and 16 hyper-threads, CPUs #0 and
> > > > >    #1 are in the same core, as are CPUs #2 and #3, etc.
> > > > >
> > > > > Scheme A is used in most ACPI bare metal systems and in VMs running on
> > > > > KVM.  The enumeration order is determined by the order of the processor
> > > > > entries in the ACPI MADT, and the ACPI spec *recommends* that the MADT
> > > > > be set up for scheme A.
> > > > >
> > > > > However, for reasons that pre-date the ACPI recommendation, Hyper-V
> > > > > guests have an ACPI MADT that is set up for scheme B.  When using scheme B,
> > > > > the subtle bias is evident in select_idle_cpu().  While having Hyper-V
> > > > > conform to the ACPI spec recommendation would solve the Hyper-V problem,
> > > > > it is also desirable for the fair scheduler code to be independent of the
> > > > > CPU numbering scheme.  ACPI is not always the firmware configuration
> > > > > mechanism, and CPU numbering schemes might vary more in architectures
> > > > > other than x86/x64.
> > > > >
> > > > > The bias occurs with scheme B when "has_idle_cpu" is true and
> > > >
> > > > I assume that you mean has_idle_core as I can't find has_idle_cpu in the code
> > >
> > > Yes.  You are right.
> > >
> > > >
> > > > > select_idle_core() is called in the for_each_cpu_wrap() loop. Regardless
> > > > > of where the loop starts, it will almost immediately encountered a CPU
> > > > > that is the first hyper-thread in a core. If that core is idle, the CPU
> > > > > number of that first hyper-thread is returned. If that core is not idle,
> > > > > both hyper-threads are removed from the cpus mask, and the loop iterates
> > > > > to choose another CPU that is the first hyper-thread in a core.  As a
> > > > > result, select_idle_core() almost always returns the first hyper-thread
> > > > > in a core.
> > > > >
> > > > > The bias does not occur with scheme A because half of the CPU numbering
> > > > > space is a series of CPUs that are the second hyper-thread in all the
> > > > > cores. Assuming that the "target" CPU is evenly distributed throughout
> > > > > the CPU numbering space, there's a 50/50 chance of starting in the portion
> > > > > of the CPU numbering space that is all second hyper-threads.  If
> > > > > select_idle_core() finds a idle core, it will likely return a CPU that
> > > > > is the second hyper-thread in the core.  On average over the time,
> > > > > both the first and second hyper-thread are equally likely to be
> > > > > returned.
> > > > >
> > > > > Fix this bias by determining which hyper-thread in a core the "target"
> > > > > CPU is -- i.e., the "smt index" of that CPU.  Then when select_idle_core()
> > > > > finds an idle core, it returns the CPU in the core that has the same
> > > > > smt index. If that CPU is not valid to be chosen, just return the CPU
> > > > > that was passed into select_idle_core() and don't worry about bias.
> > > > >
> > > > > With scheme B, this fix solves the bias problem by making the chosen
> > > > > CPU be roughly equally likely to either hyper-thread.  With scheme A
> > > > > there's no real effect as the chosen CPU was already equally likely
> > > > > to be either hyper-thread, and still is.
> > > > >
> > > > > The imbalance in hyper-thread loading was originally observed in a
> > > > > customer workload, and then reproduced with a synthetic workload.
> > > > > The change has been tested with the synthetic workload in a Hyper-V VM
> > > > > running the normal scheme B CPU numbering, and then with the MADT
> > > > > replaced with a scheme A version using Linux's ability to override
> > > > > ACPI tables. The testing showed no change hyper-thread loading
> > > > > balance with the scheme A CPU numbering, but the imbalance is
> > > > > corrected if the CPU numbering is scheme B.
> > > >
> > > > You failed to explain why it's important to evenly select 1st or 2nd
> > > > hyper-threads on the system.  I don't see any performance figures.
> > > > What would be the benefit ?
> > >
> > > As I noted below, it's not completely clear to me whether this is a
> > > problem.  I don't have a specific workload where overall runtime is
> > > longer due to the SMT level imbalance.  I'm not a scheduler expert,
> > > and wanted input from those who are.  Linux generally does a good
> > > job of balancing load, and given the code in fair.c that is devoted to
> > > balancing, I inferred at least some importance.  But maybe balancing
> > > is more important for the higher-level domains, and less so for the
> > > SMT domain.
> >
> > As long as the hyper-threads on a core are the same, I don't see any
> > need to add more code and complexity to evenly balance the number of
> > time that we select each CPU of the same core when the whole core is
> > idle.

Vincent -- 

Fair enough.  We can revisit the topic if we discover a workload where the
imbalance produces a noticeable difference in performance for some reason.
Thanks for your review and consideration.

> >
> In theory if a core is idle, either the 1st hyper thread or the 2nd hyper
> thread is ok to run the wakee. Would there be a race condition between the
> check of has_idle_core + idle core checking in select_idle_cpu() and the
> task enqueue?  If the 2 hyper threads within 1 core are falling asleep
> and wake up quickly, we have a false positive of has_idle_core, and
> incorrectly think coreX is idle, and queue too many tasks on the first hyper
> thread on coreX in B scheme, although coreX is not idle.
> 

Chen --

I have not tried to think through the scenario you describe.  But I don't
think such a race condition is the primary reason that we observe the SMT
imbalance when running with CPU numbering scheme B.   It seems like
such a race would be relatively rare, and hence not a significant issue.  Or
is there a more serious consequence if the race were to happen?

Michael



^ permalink raw reply

* RE: [PATCH v2 1/1] RDMA/mana_ib: Add EQ interrupt support to mana ib driver.
From: Long Li @ 2023-06-07 21:03 UTC (permalink / raw)
  To: Wei Hu, netdev@vger.kernel.org, linux-hyperv@vger.kernel.org,
	linux-rdma@vger.kernel.org, Ajay Sharma, jgg@ziepe.ca,
	leon@kernel.org, KY Srinivasan, Haiyang Zhang, wei.liu@kernel.org,
	Dexuan Cui, davem@davemloft.net, edumazet@google.com,
	kuba@kernel.org, pabeni@redhat.com, vkuznets@redhat.com,
	ssengar@linux.microsoft.com, shradhagupta@linux.microsoft.com
In-Reply-To: <20230606151747.1649305-1-weh@microsoft.com>

> Subject: [PATCH v2 1/1] RDMA/mana_ib: Add EQ interrupt support to mana ib
> driver.
> 
> Add EQ interrupt support for mana ib driver. Allocate EQs per ucontext to receive
> interrupt. Attach EQ when CQ is created. Call CQ interrupt handler when
> completion interrupt happens. EQs are destroyed when ucontext is deallocated.
> 
> The change calls some public APIs in mana ethernet driver to allocate EQs and
> other resources. Ehe EQ process routine is also shared by mana ethernet and
> mana ib drivers.
> 
> Co-developed-by: Ajay Sharma <sharmaajay@microsoft.com>
> Signed-off-by: Ajay Sharma <sharmaajay@microsoft.com>
> Signed-off-by: Wei Hu <weh@microsoft.com>
> ---
> 
> v2: Use ibdev_dbg to print error messages and return -ENOMEN
>     when kzalloc fails.
> 
>  drivers/infiniband/hw/mana/cq.c               |  32 ++++-
>  drivers/infiniband/hw/mana/main.c             |  87 ++++++++++++
>  drivers/infiniband/hw/mana/mana_ib.h          |   4 +
>  drivers/infiniband/hw/mana/qp.c               |  90 +++++++++++-
>  .../net/ethernet/microsoft/mana/gdma_main.c   | 131 ++++++++++--------
>  drivers/net/ethernet/microsoft/mana/mana_en.c |   1 +
>  include/net/mana/gdma.h                       |   9 +-
>  7 files changed, 290 insertions(+), 64 deletions(-)
> 
> diff --git a/drivers/infiniband/hw/mana/cq.c b/drivers/infiniband/hw/mana/cq.c
> index d141cab8a1e6..3cd680e0e753 100644
> --- a/drivers/infiniband/hw/mana/cq.c
> +++ b/drivers/infiniband/hw/mana/cq.c
> @@ -12,13 +12,20 @@ int mana_ib_create_cq(struct ib_cq *ibcq, const struct
> ib_cq_init_attr *attr,
>  	struct ib_device *ibdev = ibcq->device;
>  	struct mana_ib_create_cq ucmd = {};
>  	struct mana_ib_dev *mdev;
> +	struct gdma_context *gc;
> +	struct gdma_dev *gd;
>  	int err;
> 
>  	mdev = container_of(ibdev, struct mana_ib_dev, ib_dev);
> +	gd = mdev->gdma_dev;
> +	gc = gd->gdma_context;
> 
>  	if (udata->inlen < sizeof(ucmd))
>  		return -EINVAL;
> 
> +	cq->comp_vector = attr->comp_vector > gc->max_num_queues ?
> +				0 : attr->comp_vector;
> +
>  	err = ib_copy_from_udata(&ucmd, udata, min(sizeof(ucmd), udata-
> >inlen));
>  	if (err) {
>  		ibdev_dbg(ibdev,
> @@ -69,11 +76,32 @@ int mana_ib_destroy_cq(struct ib_cq *ibcq, struct
> ib_udata *udata)
>  	struct mana_ib_cq *cq = container_of(ibcq, struct mana_ib_cq, ibcq);
>  	struct ib_device *ibdev = ibcq->device;
>  	struct mana_ib_dev *mdev;
> +	struct gdma_context *gc;
> +	struct gdma_dev *gd;
> +
> 
>  	mdev = container_of(ibdev, struct mana_ib_dev, ib_dev);
> +	gd = mdev->gdma_dev;
> +	gc = gd->gdma_context;
> 
> -	mana_ib_gd_destroy_dma_region(mdev, cq->gdma_region);
> -	ib_umem_release(cq->umem);
> +
> +
> +	if (atomic_read(&ibcq->usecnt) == 0) {
> +		mana_ib_gd_destroy_dma_region(mdev, cq->gdma_region);

Need to check if this function fails. The following code will call kfree(gc->cq_table[cq->id]), it's possible that IRQ is happening at the same time if CQ is not destroyed.

> +		ibdev_dbg(ibdev, "freeing gdma cq %p\n", gc->cq_table[cq->id]);
> +		kfree(gc->cq_table[cq->id]);
> +		gc->cq_table[cq->id] = NULL;
> +		ib_umem_release(cq->umem);
> +	}
> 
>  	return 0;
>  }
> +
> +void mana_ib_cq_handler(void *ctx, struct gdma_queue *gdma_cq) {
> +	struct mana_ib_cq *cq = ctx;
> +	struct ib_device *ibdev = cq->ibcq.device;
> +
> +	ibdev_dbg(ibdev, "Enter %s %d\n", __func__, __LINE__);

This debug message seems overkill?

> +	cq->ibcq.comp_handler(&cq->ibcq, cq->ibcq.cq_context); }
> diff --git a/drivers/infiniband/hw/mana/main.c
> b/drivers/infiniband/hw/mana/main.c
> index 7be4c3adb4e2..e4efbcaed10e 100644
> --- a/drivers/infiniband/hw/mana/main.c
> +++ b/drivers/infiniband/hw/mana/main.c
> @@ -143,6 +143,81 @@ int mana_ib_dealloc_pd(struct ib_pd *ibpd, struct
> ib_udata *udata)
>  	return err;
>  }
> 
> +static void mana_ib_destroy_eq(struct mana_ib_ucontext *ucontext,
> +			       struct mana_ib_dev *mdev)
> +{
> +	struct gdma_context *gc = mdev->gdma_dev->gdma_context;
> +	struct ib_device *ibdev = ucontext->ibucontext.device;
> +	struct gdma_queue *eq;
> +	int i;
> +
> +	if (!ucontext->eqs)
> +		return;
> +
> +	for (i = 0; i < gc->max_num_queues; i++) {
> +		eq = ucontext->eqs[i].eq;
> +		if (!eq)
> +			continue;
> +
> +		mana_gd_destroy_queue(gc, eq);
> +	}
> +
> +	kfree(ucontext->eqs);
> +	ucontext->eqs = NULL;
> +
> +	ibdev_dbg(ibdev, "destroyed eq's count %d\n", gc->max_num_queues); }

Will gc->max_num_queues change after destroying a EQ?

> +
> +static int mana_ib_create_eq(struct mana_ib_ucontext *ucontext,
> +			     struct mana_ib_dev *mdev)
> +{
> +	struct gdma_queue_spec spec = {};
> +	struct gdma_queue *queue;
> +	struct gdma_context *gc;
> +	struct ib_device *ibdev;
> +	struct gdma_dev *gd;
> +	int err;
> +	int i;
> +
> +	if (!ucontext || !mdev)
> +		return -EINVAL;
> +
> +	ibdev = ucontext->ibucontext.device;
> +	gd = mdev->gdma_dev;
> +
> +	gc = gd->gdma_context;
> +
> +	ucontext->eqs = kcalloc(gc->max_num_queues, sizeof(struct mana_eq),
> +				GFP_KERNEL);
> +	if (!ucontext->eqs)
> +		return -ENOMEM;
> +
> +	spec.type = GDMA_EQ;
> +	spec.monitor_avl_buf = false;
> +	spec.queue_size = EQ_SIZE;
> +	spec.eq.callback = NULL;
> +	spec.eq.context = ucontext->eqs;
> +	spec.eq.log2_throttle_limit = LOG2_EQ_THROTTLE;
> +	spec.eq.msix_allocated = true;
> +
> +	for (i = 0; i < gc->max_num_queues; i++) {
> +		spec.eq.msix_index = i;
> +		err = mana_gd_create_mana_eq(gd, &spec, &queue);
> +		if (err)
> +			goto out;
> +
> +		queue->eq.disable_needed = true;
> +		ucontext->eqs[i].eq = queue;
> +	}
> +
> +	return 0;
> +
> +out:
> +	ibdev_dbg(ibdev, "Failed to allocated eq err %d\n", err);
> +	mana_ib_destroy_eq(ucontext, mdev);
> +	return err;
> +}
> +
>  static int mana_gd_destroy_doorbell_page(struct gdma_context *gc,
>  					 int doorbell_page)
>  {
> @@ -225,7 +300,17 @@ int mana_ib_alloc_ucontext(struct ib_ucontext
> *ibcontext,
> 
>  	ucontext->doorbell = doorbell_page;
> 
> +	ret = mana_ib_create_eq(ucontext, mdev);
> +	if (ret) {
> +		ibdev_dbg(ibdev, "Failed to create eq's , ret %d\n", ret);
> +		goto err;
> +	}
> +
>  	return 0;
> +
> +err:
> +	mana_gd_destroy_doorbell_page(gc, doorbell_page);
> +	return ret;
>  }
> 
>  void mana_ib_dealloc_ucontext(struct ib_ucontext *ibcontext) @@ -240,6
> +325,8 @@ void mana_ib_dealloc_ucontext(struct ib_ucontext *ibcontext)
>  	mdev = container_of(ibdev, struct mana_ib_dev, ib_dev);
>  	gc = mdev->gdma_dev->gdma_context;
> 
> +	mana_ib_destroy_eq(mana_ucontext, mdev);
> +
>  	ret = mana_gd_destroy_doorbell_page(gc, mana_ucontext->doorbell);
>  	if (ret)
>  		ibdev_dbg(ibdev, "Failed to destroy doorbell page %d\n", ret);
> diff --git a/drivers/infiniband/hw/mana/mana_ib.h
> b/drivers/infiniband/hw/mana/mana_ib.h
> index 502cc8672eef..9672fa1670a5 100644
> --- a/drivers/infiniband/hw/mana/mana_ib.h
> +++ b/drivers/infiniband/hw/mana/mana_ib.h
> @@ -67,6 +67,7 @@ struct mana_ib_cq {
>  	int cqe;
>  	u64 gdma_region;
>  	u64 id;
> +	u32 comp_vector;
>  };
> 
>  struct mana_ib_qp {
> @@ -86,6 +87,7 @@ struct mana_ib_qp {
>  struct mana_ib_ucontext {
>  	struct ib_ucontext ibucontext;
>  	u32 doorbell;
> +	struct mana_eq *eqs;
>  };
> 
>  struct mana_ib_rwq_ind_table {
> @@ -159,4 +161,6 @@ int mana_ib_query_gid(struct ib_device *ibdev, u32 port,
> int index,
> 
>  void mana_ib_disassociate_ucontext(struct ib_ucontext *ibcontext);
> 
> +void mana_ib_cq_handler(void *ctx, struct gdma_queue *gdma_cq);
> +
>  #endif
> diff --git a/drivers/infiniband/hw/mana/qp.c b/drivers/infiniband/hw/mana/qp.c
> index 54b61930a7fd..e133d86c0875 100644
> --- a/drivers/infiniband/hw/mana/qp.c
> +++ b/drivers/infiniband/hw/mana/qp.c
> @@ -96,16 +96,20 @@ static int mana_ib_create_qp_rss(struct ib_qp *ibqp,
> struct ib_pd *pd,
>  	struct mana_ib_qp *qp = container_of(ibqp, struct mana_ib_qp, ibqp);
>  	struct mana_ib_dev *mdev =
>  		container_of(pd->device, struct mana_ib_dev, ib_dev);
> +	struct ib_ucontext *ib_ucontext = pd->uobject->context;
>  	struct ib_rwq_ind_table *ind_tbl = attr->rwq_ind_tbl;
>  	struct mana_ib_create_qp_rss_resp resp = {};
>  	struct mana_ib_create_qp_rss ucmd = {};
> +	struct mana_ib_ucontext *mana_ucontext;
>  	struct gdma_dev *gd = mdev->gdma_dev;
>  	mana_handle_t *mana_ind_table;
>  	struct mana_port_context *mpc;
> +	struct gdma_queue *gdma_cq;
>  	struct mana_context *mc;
>  	struct net_device *ndev;
>  	struct mana_ib_cq *cq;
>  	struct mana_ib_wq *wq;
> +	struct mana_eq *eq;
>  	unsigned int ind_tbl_size;
>  	struct ib_cq *ibcq;
>  	struct ib_wq *ibwq;
> @@ -114,6 +118,8 @@ static int mana_ib_create_qp_rss(struct ib_qp *ibqp,
> struct ib_pd *pd,
>  	int ret;
> 
>  	mc = gd->driver_data;
> +	mana_ucontext =
> +		container_of(ib_ucontext, struct mana_ib_ucontext, ibucontext);
> 
>  	if (!udata || udata->inlen < sizeof(ucmd))
>  		return -EINVAL;
> @@ -180,6 +186,7 @@ static int mana_ib_create_qp_rss(struct ib_qp *ibqp,
> struct ib_pd *pd,
>  	for (i = 0; i < ind_tbl_size; i++) {
>  		struct mana_obj_spec wq_spec = {};
>  		struct mana_obj_spec cq_spec = {};
> +		unsigned int max_num_queues = gd->gdma_context-
> >max_num_queues;
> 
>  		ibwq = ind_tbl->ind_tbl[i];
>  		wq = container_of(ibwq, struct mana_ib_wq, ibwq); @@ -193,7
> +200,8 @@ static int mana_ib_create_qp_rss(struct ib_qp *ibqp, struct ib_pd
> *pd,
>  		cq_spec.gdma_region = cq->gdma_region;
>  		cq_spec.queue_size = cq->cqe * COMP_ENTRY_SIZE;
>  		cq_spec.modr_ctx_id = 0;
> -		cq_spec.attached_eq = GDMA_CQ_NO_EQ;
> +		eq = &mana_ucontext->eqs[cq->comp_vector %
> max_num_queues];
> +		cq_spec.attached_eq = eq->eq->id;
> 
>  		ret = mana_create_wq_obj(mpc, mpc->port_handle, GDMA_RQ,
>  					 &wq_spec, &cq_spec, &wq-
> >rx_object); @@ -207,6 +215,9 @@ static int mana_ib_create_qp_rss(struct
> ib_qp *ibqp, struct ib_pd *pd,
>  		wq->id = wq_spec.queue_index;
>  		cq->id = cq_spec.queue_index;
> 
> +		ibdev_dbg(&mdev->ib_dev, "attached eq id %u  cq with
> id %llu\n",
> +			eq->eq->id, cq->id);
> +
>  		ibdev_dbg(&mdev->ib_dev,
>  			  "ret %d rx_object 0x%llx wq id %llu cq id %llu\n",
>  			  ret, wq->rx_object, wq->id, cq->id); @@ -215,6
> +226,27 @@ static int mana_ib_create_qp_rss(struct ib_qp *ibqp, struct ib_pd
> *pd,
>  		resp.entries[i].wqid = wq->id;
> 
>  		mana_ind_table[i] = wq->rx_object;
> +
> +		if (gd->gdma_context->cq_table[cq->id] == NULL) {
> +
> +			gdma_cq = kzalloc(sizeof(*gdma_cq), GFP_KERNEL);
> +			if (!gdma_cq) {
> +				ibdev_dbg(&mdev->ib_dev,
> +					 "failed to allocate gdma_cq\n");
> +				ret = -ENOMEM;
> +				goto free_cq;
> +			}
> +
> +			ibdev_dbg(&mdev->ib_dev, "gdma cq allocated %p\n",
> +				  gdma_cq);
> +
> +			gdma_cq->cq.context = cq;
> +			gdma_cq->type = GDMA_CQ;
> +			gdma_cq->cq.callback = mana_ib_cq_handler;
> +			gdma_cq->id = cq->id;
> +			gd->gdma_context->cq_table[cq->id] = gdma_cq;
> +		}
> +
>  	}
>  	resp.num_entries = i;
> 
> @@ -224,7 +256,7 @@ static int mana_ib_create_qp_rss(struct ib_qp *ibqp,
> struct ib_pd *pd,
>  					 ucmd.rx_hash_key_len,
>  					 ucmd.rx_hash_key);
>  	if (ret)
> -		goto fail;
> +		goto free_cq;
> 
>  	ret = ib_copy_to_udata(udata, &resp, sizeof(resp));
>  	if (ret) {
> @@ -238,6 +270,23 @@ static int mana_ib_create_qp_rss(struct ib_qp *ibqp,
> struct ib_pd *pd,
> 
>  	return 0;
> 
> +free_cq:
> +	{
> +		int j = i;
> +		u64 cqid;
> +
> +		while (j-- > 0) {
> +			cqid = resp.entries[j].cqid;
> +			gdma_cq = gd->gdma_context->cq_table[cqid];
> +			cq = gdma_cq->cq.context;
> +			if (atomic_read(&cq->ibcq.usecnt) == 0) {
> +				kfree(gd->gdma_context->cq_table[cqid]);
> +				gd->gdma_context->cq_table[cqid] = NULL;
> +			}
> +		}
> +
> +	}
> +
>  fail:
>  	while (i-- > 0) {
>  		ibwq = ind_tbl->ind_tbl[i];
> @@ -269,10 +318,12 @@ static int mana_ib_create_qp_raw(struct ib_qp *ibqp,
> struct ib_pd *ibpd,
>  	struct mana_obj_spec wq_spec = {};
>  	struct mana_obj_spec cq_spec = {};
>  	struct mana_port_context *mpc;
> +	struct gdma_queue *gdma_cq;
>  	struct mana_context *mc;
>  	struct net_device *ndev;
>  	struct ib_umem *umem;
> -	int err;
> +	struct mana_eq *eq;
> +	int err, eq_vec;
>  	u32 port;
> 
>  	mc = gd->driver_data;
> @@ -350,7 +401,9 @@ static int mana_ib_create_qp_raw(struct ib_qp *ibqp,
> struct ib_pd *ibpd,
>  	cq_spec.gdma_region = send_cq->gdma_region;
>  	cq_spec.queue_size = send_cq->cqe * COMP_ENTRY_SIZE;
>  	cq_spec.modr_ctx_id = 0;
> -	cq_spec.attached_eq = GDMA_CQ_NO_EQ;
> +	eq_vec = send_cq->comp_vector % gd->gdma_context-
> >max_num_queues;
> +	eq = &mana_ucontext->eqs[eq_vec];
> +	cq_spec.attached_eq = eq->eq->id;
> 
>  	err = mana_create_wq_obj(mpc, mpc->port_handle, GDMA_SQ,
> &wq_spec,
>  				 &cq_spec, &qp->tx_object);
> @@ -368,6 +421,26 @@ static int mana_ib_create_qp_raw(struct ib_qp *ibqp,
> struct ib_pd *ibpd,
>  	qp->sq_id = wq_spec.queue_index;
>  	send_cq->id = cq_spec.queue_index;
> 
> +	if (gd->gdma_context->cq_table[send_cq->id] == NULL) {
> +
> +		gdma_cq = kzalloc(sizeof(*gdma_cq), GFP_KERNEL);
> +		if (!gdma_cq) {
> +			ibdev_dbg(&mdev->ib_dev,
> +				  "failed to allocate gdma_cq\n");
> +			err = -ENOMEM;
> +			goto err_destroy_wqobj_and_cq;
> +		}
> +
> +		pr_debug("gdma cq allocated %p\n", gdma_cq);
Should use ibdev_dbg

Thanks,
Long

^ permalink raw reply

* RE: [PATCH net-next] net: mana: Add support for vlan tagging
From: Haiyang Zhang @ 2023-06-07 20:44 UTC (permalink / raw)
  To: linux-hyperv@vger.kernel.org, netdev@vger.kernel.org
  Cc: Dexuan Cui, KY Srinivasan, Paul Rosswurm, olaf@aepfle.de,
	vkuznets@redhat.com, davem@davemloft.net, wei.liu@kernel.org,
	edumazet@google.com, kuba@kernel.org, pabeni@redhat.com,
	leon@kernel.org, Long Li, ssengar@linux.microsoft.com,
	linux-rdma@vger.kernel.org, daniel@iogearbox.net,
	john.fastabend@gmail.com, bpf@vger.kernel.org, ast@kernel.org,
	Ajay Sharma, hawk@kernel.org, tglx@linutronix.de,
	shradhagupta@linux.microsoft.com, linux-kernel@vger.kernel.org
In-Reply-To: <1686163058-25469-1-git-send-email-haiyangz@microsoft.com>



> -----Original Message-----
> From: LKML haiyangz <lkmlhyz@microsoft.com> On Behalf Of Haiyang Zhang
> Sent: Wednesday, June 7, 2023 2:38 PM
> To: linux-hyperv@vger.kernel.org; netdev@vger.kernel.org
> Cc: Haiyang Zhang <haiyangz@microsoft.com>; Dexuan Cui
> <decui@microsoft.com>; KY Srinivasan <kys@microsoft.com>; Paul Rosswurm
> <paulros@microsoft.com>; olaf@aepfle.de; vkuznets@redhat.com;
> davem@davemloft.net; wei.liu@kernel.org; edumazet@google.com;
> kuba@kernel.org; pabeni@redhat.com; leon@kernel.org; Long Li
> <longli@microsoft.com>; ssengar@linux.microsoft.com; linux-
> rdma@vger.kernel.org; daniel@iogearbox.net; john.fastabend@gmail.com;
> bpf@vger.kernel.org; ast@kernel.org; Ajay Sharma
> <sharmaajay@microsoft.com>; hawk@kernel.org; tglx@linutronix.de;
> shradhagupta@linux.microsoft.com; linux-kernel@vger.kernel.org
> Subject: [PATCH net-next] net: mana: Add support for vlan tagging
> 
> To support vlan, use MANA_LONG_PKT_FMT if vlan tag is present in TX
> skb. Then extract the vlan tag from the skb struct or the frame, and
> save it to tx_oob for the NIC to transmit.
> 
> For RX, extract the vlan tag from CQE and put it into skb.
> 
> Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
> ---
>  drivers/net/ethernet/microsoft/mana/mana_en.c | 36
> +++++++++++++++++--
>  1 file changed, 34 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c
> b/drivers/net/ethernet/microsoft/mana/mana_en.c
> index d907727c7b7a..1d76ac66908c 100644
> --- a/drivers/net/ethernet/microsoft/mana/mana_en.c
> +++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
> @@ -179,6 +179,31 @@ netdev_tx_t mana_start_xmit(struct sk_buff *skb,
> struct net_device *ndev)
>  		pkg.tx_oob.s_oob.short_vp_offset = txq->vp_offset;
>  	}
> 
> +	/* When using AF_PACKET we need to move VLAN header from
> +	 * the frame to the SKB struct to allow the NIC to xmit
> +	 * the 802.1Q packet.
> +	 */
> +	if (skb->protocol == htons(ETH_P_8021Q)) {
> +		u16 vlan_tci;
> +
> +		skb_reset_mac_header(skb);
> +		if (eth_type_vlan(eth_hdr(skb)->h_proto)) {
> +			if (unlikely(__skb_vlan_pop(skb, &vlan_tci)))
> +				goto tx_drop_count;
> +
> +			__vlan_hwaccel_put_tag(skb, htons(ETH_P_8021Q),
> +					       vlan_tci);
> +		}
> +	}

Not necessary to extract inband tag, because our NIC accepts inband tags too.
The change is in the next version.



^ permalink raw reply

* RE: [PATCH net-next,V2] net: mana: Add support for vlan tagging
From: Haiyang Zhang @ 2023-06-07 20:41 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: linux-hyperv@vger.kernel.org, netdev@vger.kernel.org, Dexuan Cui,
	KY Srinivasan, Paul Rosswurm, olaf@aepfle.de, vkuznets@redhat.com,
	davem@davemloft.net, wei.liu@kernel.org, edumazet@google.com,
	pabeni@redhat.com, leon@kernel.org, Long Li,
	ssengar@linux.microsoft.com, linux-rdma@vger.kernel.org,
	daniel@iogearbox.net, john.fastabend@gmail.com,
	bpf@vger.kernel.org, ast@kernel.org, Ajay Sharma, hawk@kernel.org,
	tglx@linutronix.de, shradhagupta@linux.microsoft.com,
	linux-kernel@vger.kernel.org
In-Reply-To: <20230607133805.58161672@kernel.org>



> -----Original Message-----
> From: Jakub Kicinski <kuba@kernel.org>
> Sent: Wednesday, June 7, 2023 4:38 PM
> To: Haiyang Zhang <haiyangz@microsoft.com>
> Cc: linux-hyperv@vger.kernel.org; netdev@vger.kernel.org; Dexuan Cui
> <decui@microsoft.com>; KY Srinivasan <kys@microsoft.com>; Paul Rosswurm
> <paulros@microsoft.com>; olaf@aepfle.de; vkuznets@redhat.com;
> davem@davemloft.net; wei.liu@kernel.org; edumazet@google.com;
> pabeni@redhat.com; leon@kernel.org; Long Li <longli@microsoft.com>;
> ssengar@linux.microsoft.com; linux-rdma@vger.kernel.org;
> daniel@iogearbox.net; john.fastabend@gmail.com; bpf@vger.kernel.org;
> ast@kernel.org; Ajay Sharma <sharmaajay@microsoft.com>;
> hawk@kernel.org; tglx@linutronix.de; shradhagupta@linux.microsoft.com;
> linux-kernel@vger.kernel.org
> Subject: Re: [PATCH net-next,V2] net: mana: Add support for vlan tagging
> 
> On Wed,  7 Jun 2023 13:34:02 -0700 Haiyang Zhang wrote:
> > To support vlan, use MANA_LONG_PKT_FMT if vlan tag is present in TX
> > skb. Then extract the vlan tag from the skb struct, and save it to
> > tx_oob for the NIC to transmit. For vlan tags on the payload, they
> > are accepted by the NIC too.
> >
> > For RX, extract the vlan tag from CQE and put it into skb.
> >
> > Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
> > ---
> > V2:
> > Removed the code that extracts inband tag, because our NIC accepts
> > inband tags too.
> 
> Please don't rush multiple versions, if your previous version is buggy
> you have to reply to it saying so and then wait before posting v2.

Will do. Thanks.


^ permalink raw reply

* Re: [PATCH net-next,V2] net: mana: Add support for vlan tagging
From: Jakub Kicinski @ 2023-06-07 20:38 UTC (permalink / raw)
  To: Haiyang Zhang
  Cc: linux-hyperv, netdev, decui, kys, paulros, olaf, vkuznets, davem,
	wei.liu, edumazet, pabeni, leon, longli, ssengar, linux-rdma,
	daniel, john.fastabend, bpf, ast, sharmaajay, hawk, tglx,
	shradhagupta, linux-kernel
In-Reply-To: <1686170042-10610-1-git-send-email-haiyangz@microsoft.com>

On Wed,  7 Jun 2023 13:34:02 -0700 Haiyang Zhang wrote:
> To support vlan, use MANA_LONG_PKT_FMT if vlan tag is present in TX
> skb. Then extract the vlan tag from the skb struct, and save it to
> tx_oob for the NIC to transmit. For vlan tags on the payload, they
> are accepted by the NIC too.
> 
> For RX, extract the vlan tag from CQE and put it into skb.
> 
> Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
> ---
> V2:
> Removed the code that extracts inband tag, because our NIC accepts
> inband tags too.

Please don't rush multiple versions, if your previous version is buggy
you have to reply to it saying so and then wait before posting v2.

Reviewing something just to find there is a v2 already posting is one
of the more annoying experiences for maintainers and reviewers.

Quoting documentation:

  Resending after review
  ~~~~~~~~~~~~~~~~~~~~~~

  Allow at least 24 hours to pass between postings. This will ensure reviewers
  from all geographical locations have a chance to chime in. Do not wait
  too long (weeks) between postings either as it will make it harder for reviewers
  to recall all the context.

  Make sure you address all the feedback in your new posting. Do not post a new
  version of the code if the discussion about the previous version is still
  ongoing, unless directly instructed by a reviewer.

See: https://www.kernel.org/doc/html/next/process/maintainer-netdev.html#resending-after-review

^ permalink raw reply

* [PATCH net-next,V2] net: mana: Add support for vlan tagging
From: Haiyang Zhang @ 2023-06-07 20:34 UTC (permalink / raw)
  To: linux-hyperv, netdev
  Cc: haiyangz, decui, kys, paulros, olaf, vkuznets, davem, wei.liu,
	edumazet, kuba, pabeni, leon, longli, ssengar, linux-rdma, daniel,
	john.fastabend, bpf, ast, sharmaajay, hawk, tglx, shradhagupta,
	linux-kernel

To support vlan, use MANA_LONG_PKT_FMT if vlan tag is present in TX
skb. Then extract the vlan tag from the skb struct, and save it to
tx_oob for the NIC to transmit. For vlan tags on the payload, they
are accepted by the NIC too.

For RX, extract the vlan tag from CQE and put it into skb.

Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
---
V2:
Removed the code that extracts inband tag, because our NIC accepts
inband tags too.

---
 drivers/net/ethernet/microsoft/mana/mana_en.c | 19 +++++++++++++++++--
 1 file changed, 17 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
index d907727c7b7a..cd4d5ceb9f2d 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_en.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
@@ -179,6 +179,14 @@ netdev_tx_t mana_start_xmit(struct sk_buff *skb, struct net_device *ndev)
 		pkg.tx_oob.s_oob.short_vp_offset = txq->vp_offset;
 	}
 
+	if (skb_vlan_tag_present(skb)) {
+		pkt_fmt = MANA_LONG_PKT_FMT;
+		pkg.tx_oob.l_oob.inject_vlan_pri_tag = 1;
+		pkg.tx_oob.l_oob.pcp = skb_vlan_tag_get_prio(skb);
+		pkg.tx_oob.l_oob.dei = skb_vlan_tag_get_cfi(skb);
+		pkg.tx_oob.l_oob.vlan_id = skb_vlan_tag_get_id(skb);
+	}
+
 	pkg.tx_oob.s_oob.pkt_fmt = pkt_fmt;
 
 	if (pkt_fmt == MANA_SHORT_PKT_FMT) {
@@ -1457,6 +1465,12 @@ static void mana_rx_skb(void *buf_va, struct mana_rxcomp_oob *cqe,
 			skb_set_hash(skb, hash_value, PKT_HASH_TYPE_L3);
 	}
 
+	if (cqe->rx_vlantag_present) {
+		u16 vlan_tci = cqe->rx_vlan_id;
+
+		__vlan_hwaccel_put_tag(skb, htons(ETH_P_8021Q), vlan_tci);
+	}
+
 	u64_stats_update_begin(&rx_stats->syncp);
 	rx_stats->packets++;
 	rx_stats->bytes += pkt_len;
@@ -2451,8 +2465,9 @@ static int mana_probe_port(struct mana_context *ac, int port_idx,
 	ndev->hw_features |= NETIF_F_RXCSUM;
 	ndev->hw_features |= NETIF_F_TSO | NETIF_F_TSO6;
 	ndev->hw_features |= NETIF_F_RXHASH;
-	ndev->features = ndev->hw_features;
-	ndev->vlan_features = 0;
+	ndev->features = ndev->hw_features | NETIF_F_HW_VLAN_CTAG_TX |
+			 NETIF_F_HW_VLAN_CTAG_RX;
+	ndev->vlan_features = ndev->features;
 	ndev->xdp_features = NETDEV_XDP_ACT_BASIC | NETDEV_XDP_ACT_REDIRECT |
 			     NETDEV_XDP_ACT_NDO_XMIT;
 
-- 
2.25.1


^ permalink raw reply related

* [PATCH net-next] net: mana: Add support for vlan tagging
From: Haiyang Zhang @ 2023-06-07 18:37 UTC (permalink / raw)
  To: linux-hyperv, netdev
  Cc: haiyangz, decui, kys, paulros, olaf, vkuznets, davem, wei.liu,
	edumazet, kuba, pabeni, leon, longli, ssengar, linux-rdma, daniel,
	john.fastabend, bpf, ast, sharmaajay, hawk, tglx, shradhagupta,
	linux-kernel

To support vlan, use MANA_LONG_PKT_FMT if vlan tag is present in TX
skb. Then extract the vlan tag from the skb struct or the frame, and
save it to tx_oob for the NIC to transmit.

For RX, extract the vlan tag from CQE and put it into skb.

Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
---
 drivers/net/ethernet/microsoft/mana/mana_en.c | 36 +++++++++++++++++--
 1 file changed, 34 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
index d907727c7b7a..1d76ac66908c 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_en.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
@@ -179,6 +179,31 @@ netdev_tx_t mana_start_xmit(struct sk_buff *skb, struct net_device *ndev)
 		pkg.tx_oob.s_oob.short_vp_offset = txq->vp_offset;
 	}
 
+	/* When using AF_PACKET we need to move VLAN header from
+	 * the frame to the SKB struct to allow the NIC to xmit
+	 * the 802.1Q packet.
+	 */
+	if (skb->protocol == htons(ETH_P_8021Q)) {
+		u16 vlan_tci;
+
+		skb_reset_mac_header(skb);
+		if (eth_type_vlan(eth_hdr(skb)->h_proto)) {
+			if (unlikely(__skb_vlan_pop(skb, &vlan_tci)))
+				goto tx_drop_count;
+
+			__vlan_hwaccel_put_tag(skb, htons(ETH_P_8021Q),
+					       vlan_tci);
+		}
+	}
+
+	if (skb_vlan_tag_present(skb)) {
+		pkt_fmt = MANA_LONG_PKT_FMT;
+		pkg.tx_oob.l_oob.inject_vlan_pri_tag = 1;
+		pkg.tx_oob.l_oob.pcp = skb_vlan_tag_get_prio(skb);
+		pkg.tx_oob.l_oob.dei = skb_vlan_tag_get_cfi(skb);
+		pkg.tx_oob.l_oob.vlan_id = skb_vlan_tag_get_id(skb);
+	}
+
 	pkg.tx_oob.s_oob.pkt_fmt = pkt_fmt;
 
 	if (pkt_fmt == MANA_SHORT_PKT_FMT) {
@@ -1457,6 +1482,12 @@ static void mana_rx_skb(void *buf_va, struct mana_rxcomp_oob *cqe,
 			skb_set_hash(skb, hash_value, PKT_HASH_TYPE_L3);
 	}
 
+	if (cqe->rx_vlantag_present) {
+		u16 vlan_tci = cqe->rx_vlan_id;
+
+		__vlan_hwaccel_put_tag(skb, htons(ETH_P_8021Q), vlan_tci);
+	}
+
 	u64_stats_update_begin(&rx_stats->syncp);
 	rx_stats->packets++;
 	rx_stats->bytes += pkt_len;
@@ -2451,8 +2482,9 @@ static int mana_probe_port(struct mana_context *ac, int port_idx,
 	ndev->hw_features |= NETIF_F_RXCSUM;
 	ndev->hw_features |= NETIF_F_TSO | NETIF_F_TSO6;
 	ndev->hw_features |= NETIF_F_RXHASH;
-	ndev->features = ndev->hw_features;
-	ndev->vlan_features = 0;
+	ndev->features = ndev->hw_features | NETIF_F_HW_VLAN_CTAG_TX |
+			 NETIF_F_HW_VLAN_CTAG_RX;
+	ndev->vlan_features = ndev->features;
 	ndev->xdp_features = NETDEV_XDP_ACT_BASIC | NETDEV_XDP_ACT_REDIRECT |
 			     NETDEV_XDP_ACT_NDO_XMIT;
 
-- 
2.25.1


^ permalink raw reply related

* Re: [RFC PATCH V6 01/14] x86/sev: Add a #HV exception handler
From: Tom Lendacky @ 2023-06-07 18:19 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Gupta, Pankaj, Tianyu Lan, luto, tglx, mingo, bp, dave.hansen,
	x86, hpa, seanjc, pbonzini, jgross, tiala, kirill, jiangshan.ljs,
	ashish.kalra, srutherford, akpm, anshuman.khandual,
	pawan.kumar.gupta, adrian.hunter, daniel.sneddon,
	alexander.shishkin, sandipan.das, ray.huang, brijesh.singh,
	michael.roth, venu.busireddy, sterritt, tony.luck, samitolvanen,
	fenghua.yu, pangupta, linux-kernel, kvm, linux-hyperv, linux-arch
In-Reply-To: <20230531091452.GG38236@hirez.programming.kicks-ass.net>

On 5/31/23 04:14, Peter Zijlstra wrote:
> On Tue, May 30, 2023 at 08:52:32PM +0200, Peter Zijlstra wrote:
> 
>>> That should really say that a nested #HV should never be raised by the
>>> hypervisor, but if it is, then the guest should detect that and
>>> self-terminate knowing that the hypervisor is possibly being malicious.
>>
>> I've yet to see code that can do that reliably.
> 
> Tom; could you please investigate if this can be enforced in ucode?
> 
> Ideally #HV would have an internal latch such that a recursive #HV will
> terminate the guest (much like double #MC and tripple-fault).
> 
> But unlike the #MC trainwreck, can we please not leave a glaring hole in
> this latch and use a spare bit in the IRET frame please?
> 
> So have #HV delivery:
>   - check internal latch; if set, terminate machine
>   - set latch
>   - write IRET frame with magic bit set
> 
> have IRET:
>   - check magic bit and reset #HV latch

Hi Peter,

I talked with the hardware team about this and, unfortunately, it is not 
practical to implement. The main concerns are that there are already two 
generations of hardware out there with the current support and, given 
limited patch space, in addition to the ucode support to track and perform 
the latch support, additional ucode support would be required to 
save/restore the latch information when handling a VMEXIT during #HV 
processing.

Thanks,
Tom

> 

^ permalink raw reply

* Re: [PATCH RFC net-next v3 6/8] virtio/vsock: support dgrams
From: Bobby Eshleman @ 2023-06-01  7:54 UTC (permalink / raw)
  To: Dan Carpenter
  Cc: Simon Horman, Bobby Eshleman, Stefan Hajnoczi, Stefano Garzarella,
	Michael S. Tsirkin, Jason Wang, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, K. Y. Srinivasan, Haiyang Zhang,
	Wei Liu, Dexuan Cui, Bryan Tan, Vishnu Dasa,
	VMware PV-Drivers Reviewers, kvm, virtualization, netdev,
	linux-kernel, linux-hyperv
In-Reply-To: <d2e9c45f-bcbd-4e6a-98c1-c98283450626@kadam.mountain>

On Wed, May 31, 2023 at 09:13:04PM +0300, Dan Carpenter wrote:
> On Wed, May 31, 2023 at 06:09:11PM +0200, Simon Horman wrote:
> > > @@ -102,6 +144,7 @@ virtio_transport_alloc_skb(struct virtio_vsock_pkt_info *info,
> > 
> > Smatch that err may not be initialised in the out label below.
> > 
> > Just above this context the following appears:
> > 
> > 	if (info->vsk && !skb_set_owner_sk_safe(skb, sk_vsock(info->vsk))) {
> > 		WARN_ONCE(1, "failed to allocate skb on vsock socket with sk_refcnt == 0\n");
> > 		goto out;
> > 	}
> > 
> > So I wonder if in that case err may not be initialised.
> > 
> 
> Yep, exactly right.  I commented out the goto and it silenced the
> warning.  I also initialized err to zero at the start hoping that it
> would trigger a different warning but it didn't.  :(
> 
> regards,
> dan carpenter
> 

Thanks for checking that Dan. Fixed in the next rev.

Best,
Bobby

^ permalink raw reply

* Re: [PATCH RFC net-next v3 1/8] vsock/dgram: generalize recvmsg and drop transport->dgram_dequeue
From: Bobby Eshleman @ 2023-06-01  7:53 UTC (permalink / raw)
  To: Simon Horman
  Cc: Bobby Eshleman, Stefan Hajnoczi, Stefano Garzarella,
	Michael S. Tsirkin, Jason Wang, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, K. Y. Srinivasan, Haiyang Zhang,
	Wei Liu, Dexuan Cui, Bryan Tan, Vishnu Dasa,
	VMware PV-Drivers Reviewers, kvm, virtualization, netdev,
	linux-kernel, linux-hyperv
In-Reply-To: <ZHduQMZG4an6A+DG@corigine.com>

On Wed, May 31, 2023 at 05:56:48PM +0200, Simon Horman wrote:
> On Wed, May 31, 2023 at 12:35:05AM +0000, Bobby Eshleman wrote:
> > This commit drops the transport->dgram_dequeue callback and makes
> > vsock_dgram_recvmsg() generic. It also adds additional transport
> > callbacks for use by the generic vsock_dgram_recvmsg(), such as for
> > parsing skbs for CID/port which vary in format per transport.
> > 
> > Signed-off-by: Bobby Eshleman <bobby.eshleman@bytedance.com>
> 
> ...
> 
> > diff --git a/net/vmw_vsock/vmci_transport.c b/net/vmw_vsock/vmci_transport.c
> > index b370070194fa..b6a51afb74b8 100644
> > --- a/net/vmw_vsock/vmci_transport.c
> > +++ b/net/vmw_vsock/vmci_transport.c
> > @@ -1731,57 +1731,40 @@ static int vmci_transport_dgram_enqueue(
> >  	return err - sizeof(*dg);
> >  }
> >  
> > -static int vmci_transport_dgram_dequeue(struct vsock_sock *vsk,
> > -					struct msghdr *msg, size_t len,
> > -					int flags)
> > +int vmci_transport_dgram_get_cid(struct sk_buff *skb, unsigned int *cid)
> >  {
> > -	int err;
> >  	struct vmci_datagram *dg;
> > -	size_t payload_len;
> > -	struct sk_buff *skb;
> >  
> > -	if (flags & MSG_OOB || flags & MSG_ERRQUEUE)
> > -		return -EOPNOTSUPP;
> > +	dg = (struct vmci_datagram *)skb->data;
> > +	if (!dg)
> > +		return -EINVAL;
> >  
> > -	/* Retrieve the head sk_buff from the socket's receive queue. */
> > -	err = 0;
> > -	skb = skb_recv_datagram(&vsk->sk, flags, &err);
> > -	if (!skb)
> > -		return err;
> > +	*cid = dg->src.context;
> > +	return 0;
> > +}
> 
> Hi Bobby,
> 
> clang-16 with W=1 seems a bit unhappy about this.
> 
>   net/vmw_vsock/vmci_transport.c:1734:5: warning: no previous prototype for function 'vmci_transport_dgram_get_cid' [-Wmissing-prototypes]
>   int vmci_transport_dgram_get_cid(struct sk_buff *skb, unsigned int *cid)
>       ^
>   net/vmw_vsock/vmci_transport.c:1734:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
>   int vmci_transport_dgram_get_cid(struct sk_buff *skb, unsigned int *cid)
>   ^
>   static 
>   net/vmw_vsock/vmci_transport.c:1746:5: warning: no previous prototype for function 'vmci_transport_dgram_get_port' [-Wmissing-prototypes]
>   int vmci_transport_dgram_get_port(struct sk_buff *skb, unsigned int *port)
>       ^
>   net/vmw_vsock/vmci_transport.c:1746:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
>   int vmci_transport_dgram_get_port(struct sk_buff *skb, unsigned int *port)
>   ^
>   static 
>   net/vmw_vsock/vmci_transport.c:1758:5: warning: no previous prototype for function 'vmci_transport_dgram_get_length' [-Wmissing-prototypes]
>   int vmci_transport_dgram_get_length(struct sk_buff *skb, size_t *len)
>       ^
>   net/vmw_vsock/vmci_transport.c:1758:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
>   int vmci_transport_dgram_get_length(struct sk_buff *skb, size_t *len)
>   ^
> 
> I see similar warnings for net/vmw_vsock/af_vsock.c in patch 4/8.
> 
> > +
> > +int vmci_transport_dgram_get_port(struct sk_buff *skb, unsigned int *port)
> > +{
> > +	struct vmci_datagram *dg;
> >  
> >  	dg = (struct vmci_datagram *)skb->data;
> >  	if (!dg)
> > -		/* err is 0, meaning we read zero bytes. */
> > -		goto out;
> > -
> > -	payload_len = dg->payload_size;
> > -	/* Ensure the sk_buff matches the payload size claimed in the packet. */
> > -	if (payload_len != skb->len - sizeof(*dg)) {
> > -		err = -EINVAL;
> > -		goto out;
> > -	}
> > +		return -EINVAL;
> >  
> > -	if (payload_len > len) {
> > -		payload_len = len;
> > -		msg->msg_flags |= MSG_TRUNC;
> > -	}
> > +	*port = dg->src.resource;
> > +	return 0;
> > +}
> >  
> > -	/* Place the datagram payload in the user's iovec. */
> > -	err = skb_copy_datagram_msg(skb, sizeof(*dg), msg, payload_len);
> > -	if (err)
> > -		goto out;
> > +int vmci_transport_dgram_get_length(struct sk_buff *skb, size_t *len)
> > +{
> > +	struct vmci_datagram *dg;
> >  
> > -	if (msg->msg_name) {
> > -		/* Provide the address of the sender. */
> > -		DECLARE_SOCKADDR(struct sockaddr_vm *, vm_addr, msg->msg_name);
> > -		vsock_addr_init(vm_addr, dg->src.context, dg->src.resource);
> > -		msg->msg_namelen = sizeof(*vm_addr);
> > -	}
> > -	err = payload_len;
> > +	dg = (struct vmci_datagram *)skb->data;
> > +	if (!dg)
> > +		return -EINVAL;
> >  
> > -out:
> > -	skb_free_datagram(&vsk->sk, skb);
> > -	return err;
> > +	*len = dg->payload_size;
> > +	return 0;
> >  }
> >  
> >  static bool vmci_transport_dgram_allow(u32 cid, u32 port)
> 
> ...

Thanks for the review! Your feedback from both emails will be
incorporated in the next rev (with C=1 and W=1 output clearing).

Thanks again,
Bobby

^ permalink raw reply

* Re: [PATCH RFC net-next v3 8/8] tests: add vsock dgram tests
From: Bobby Eshleman @ 2023-06-01  7:51 UTC (permalink / raw)
  To: Arseniy Krasnov
  Cc: Bobby Eshleman, Jiang Wang, Stefan Hajnoczi, Stefano Garzarella,
	Michael S. Tsirkin, Jason Wang, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, K. Y. Srinivasan, Haiyang Zhang,
	Wei Liu, Dexuan Cui, Bryan Tan, Vishnu Dasa,
	VMware PV-Drivers Reviewers, Krasnov Arseniy, kvm, virtualization,
	netdev, linux-kernel, linux-hyperv
In-Reply-To: <0bd40fd8-e666-e2a3-04da-501a0e7b97a9@sberdevices.ru>

On Tue, Jun 06, 2023 at 12:34:22PM +0300, Arseniy Krasnov wrote:
> Sorry, CC mailing lists
> 
> On 06.06.2023 12:29, Arseniy Krasnov wrote:
> > Hello Bobby and Jiang! Small remarks(sorry for this letter layout, I add multiple newline over comments):
> > 

Hey Arseniy!

> > diff --git a/tools/testing/vsock/util.c b/tools/testing/vsock/util.c
> > index 01b636d3039a..45e35da48b40 100644
> > --- a/tools/testing/vsock/util.c
> > +++ b/tools/testing/vsock/util.c
> > @@ -260,6 +260,57 @@ void send_byte(int fd, int expected_ret, int flags)
> >  	}
> >  }
> >  
> > +/* Transmit one byte and check the return value.
> > + *
> > + * expected_ret:
> > + *  <0 Negative errno (for testing errors)
> > + *   0 End-of-file
> > + *   1 Success
> > + */
> > +void sendto_byte(int fd, const struct sockaddr *dest_addr, int len, int expected_ret,
> > +		 int flags)
> > +{
> > +	const uint8_t byte = 'A';
> > +	ssize_t nwritten;
> > +
> > +	timeout_begin(TIMEOUT);
> > +	do {
> > +		nwritten = sendto(fd, &byte, sizeof(byte), flags, dest_addr,
> > +				  len);
> > +		timeout_check("write");
> > +	} while (nwritten < 0 && errno == EINTR);
> > +	timeout_end();
> > +
> > +	if (expected_ret < 0) {
> > +		if (nwritten != -1) {
> > +			fprintf(stderr, "bogus sendto(2) return value %zd\n",
> > +				nwritten);
> > +			exit(EXIT_FAILURE);
> > +		}
> > +		if (errno != -expected_ret) {
> > +			perror("write");
> > +			exit(EXIT_FAILURE);
> > +		}
> > +		return;
> > +	}
> > +
> > +	if (nwritten < 0) {
> > +		perror("write");
> > +		exit(EXIT_FAILURE);
> > +	}
> > +	if (nwritten == 0) {
> > +		if (expected_ret == 0)
> > +			return;
> > +
> > +		fprintf(stderr, "unexpected EOF while sending byte\n");
> > +		exit(EXIT_FAILURE);
> > +	}
> > +	if (nwritten != sizeof(byte)) {
> > +		fprintf(stderr, "bogus sendto(2) return value %zd\n", nwritten);
> > +		exit(EXIT_FAILURE);
> > +
> > 	}
> > 
> > 
> > 
> > ^^^
> > May be short check that 'nwritten' != 'expected_ret' will be enough? Then print expected and
> > real value. Here and in 'recvfrom_byte()' below.
> > 

Right now this is really just a copy/paste of the send_byte() that
stream uses, so that would probably make the two report errors slightly
different. If desired for some specific reason, I'm open to it.

> > 
> > 
> > 
> > +}
> > +
> >  /* Receive one byte and check the return value.
> >   *
> >   * expected_ret:
> > @@ -313,6 +364,60 @@ void recv_byte(int fd, int expected_ret, int flags)
> >  	}
> >  }
> >  
> > +/* Receive one byte and check the return value.
> > + *
> > + * expected_ret:
> > + *  <0 Negative errno (for testing errors)
> > + *   0 End-of-file
> > + *   1 Success
> > + */
> > +void recvfrom_byte(int fd, struct sockaddr *src_addr, socklen_t *addrlen,
> > +		   int expected_ret, int flags)
> > +{
> > +	uint8_t byte;
> > +	ssize_t nread;
> > +
> > +	timeout_begin(TIMEOUT);
> > +	do {
> > +		nread = recvfrom(fd, &byte, sizeof(byte), flags, src_addr, addrlen);
> > +		timeout_check("read");
> > +	} while (nread < 0 && errno == EINTR);
> > +	timeout_end();
> > +
> > +	if (expected_ret < 0) {
> > +		if (nread != -1) {
> > +			fprintf(stderr, "bogus recvfrom(2) return value %zd\n",
> > +				nread);
> > +			exit(EXIT_FAILURE);
> > +		}
> > +		if (errno != -expected_ret) {
> > +			perror("read");
> > +			exit(EXIT_FAILURE);
> > +		}
> > +		return;
> > +	}
> > +
> > +	if (nread < 0) {
> > +		perror("read");
> > +		exit(EXIT_FAILURE);
> > +	}
> > +	if (nread == 0) {
> > +		if (expected_ret == 0)
> > +			return;
> > +
> > +		fprintf(stderr, "unexpected EOF while receiving byte\n");
> > +		exit(EXIT_FAILURE);
> > +	}
> > +	if (nread != sizeof(byte)) {
> > +		fprintf(stderr, "bogus recvfrom(2) return value %zd\n", nread);
> > +		exit(EXIT_FAILURE);
> > +	}
> > +	if (byte != 'A') {
> > +		fprintf(stderr, "unexpected byte read %c\n", byte);
> > +		exit(EXIT_FAILURE);
> > +	}
> > +}
> > +
> >  /* Run test cases.  The program terminates if a failure occurs. */
> >  void run_tests(const struct test_case *test_cases,
> >  	       const struct test_opts *opts)
> > diff --git a/tools/testing/vsock/util.h b/tools/testing/vsock/util.h
> > index fb99208a95ea..6e5cd610bf05 100644
> > --- a/tools/testing/vsock/util.h
> > +++ b/tools/testing/vsock/util.h
> > @@ -43,7 +43,11 @@ int vsock_seqpacket_accept(unsigned int cid, unsigned int port,
> >  			   struct sockaddr_vm *clientaddrp);
> >  void vsock_wait_remote_close(int fd);
> >  void send_byte(int fd, int expected_ret, int flags);
> > +void sendto_byte(int fd, const struct sockaddr *dest_addr, int len, int expected_ret,
> > +		 int flags);
> >  void recv_byte(int fd, int expected_ret, int flags);
> > +void recvfrom_byte(int fd, struct sockaddr *src_addr, socklen_t *addrlen,
> > +		   int expected_ret, int flags);
> >  void run_tests(const struct test_case *test_cases,
> >  	       const struct test_opts *opts);
> >  void list_tests(const struct test_case *test_cases);
> > diff --git a/tools/testing/vsock/vsock_test.c b/tools/testing/vsock/vsock_test.c
> > index ac1bd3ac1533..851c3d65178d 100644
> > --- a/tools/testing/vsock/vsock_test.c
> > +++ b/tools/testing/vsock/vsock_test.c
> > @@ -202,6 +202,113 @@ static void test_stream_server_close_server(const struct test_opts *opts)
> >  	close(fd);
> >  }
> >  
> > +static void test_dgram_sendto_client(const struct test_opts *opts)
> > +{
> > +	union {
> > +		struct sockaddr sa;
> > +		struct sockaddr_vm svm;
> > +	} addr = {
> > +		.svm = {
> > +			.svm_family = AF_VSOCK,
> > +			.svm_port = 1234,
> > +			.svm_cid = opts->peer_cid,
> > +		},
> > +	};
> > +	int fd;
> > +
> > +	/* Wait for the server to be ready */
> > +	control_expectln("BIND");
> > +
> > +	fd = socket(AF_VSOCK, SOCK_DGRAM, 0);
> > +	if (fd < 0) {
> > +		perror("socket");
> > +		exit(EXIT_FAILURE);
> > +	}
> > +
> > +	sendto_byte(fd, &addr.sa, sizeof(addr.svm), 1, 0);
> > +
> > +	/* Notify the server that the client has finished */
> > +	control_writeln("DONE");
> > +
> > +	close(fd);
> > +}
> > +
> > +static void test_dgram_sendto_server(const struct test_opts *opts)
> > +{
> > +	union {
> > +		struct sockaddr sa;
> > +		struct sockaddr_vm svm;
> > +	} addr = {
> > +		.svm = {
> > +			.svm_family = AF_VSOCK,
> > +			.svm_port = 1234,
> > +			.svm_cid = VMADDR_CID_ANY,
> > +		},
> > +	};
> > +	int fd;
> > +	int len = sizeof(addr.sa);
> > +
> > +	fd = socket(AF_VSOCK, SOCK_DGRAM, 0);
> > 
> > 
> > 
> > ^^^
> > I think we can check 'socket()' return value;
> > 

Gotcha, I'll add in next rev.

> > 
> > 
> > 
> > +
> > +	if (bind(fd, &addr.sa, sizeof(addr.svm)) < 0) {
> > +		perror("bind");
> > +		exit(EXIT_FAILURE);
> > +	}
> > +
> > +	/* Notify the client that the server is ready */
> > +	control_writeln("BIND");
> > +
> > +	recvfrom_byte(fd, &addr.sa, &len, 1, 0);
> > +
> > +	/* Wait for the client to finish */
> > +	control_expectln("DONE");
> > +
> > +	close(fd);
> > +}
> > +
> > +static void test_dgram_connect_client(const struct test_opts *opts)
> > +{
> > +	union {
> > +		struct sockaddr sa;
> > +		struct sockaddr_vm svm;
> > +	} addr = {
> > +		.svm = {
> > +			.svm_family = AF_VSOCK,
> > +			.svm_port = 1234,
> > +			.svm_cid = opts->peer_cid,
> > +		},
> > +	};
> > +	int fd;
> > +	int ret;
> > +
> > +	/* Wait for the server to be ready */
> > +	control_expectln("BIND");
> > +
> > +	fd = socket(AF_VSOCK, SOCK_DGRAM, 0);
> > +	if (fd < 0) {
> > +		perror("bind");
> > +		exit(EXIT_FAILURE);
> > +	}
> > +
> > +	ret = connect(fd, &addr.sa, sizeof(addr.svm));
> > +	if (ret < 0) {
> > +		perror("connect");
> > +		exit(EXIT_FAILURE);
> > +	}
> > +
> > +	send_byte(fd, 1, 0);
> > +
> > +	/* Notify the server that the client has finished */
> > +	control_writeln("DONE");
> > +
> > +	close(fd);
> > +}
> > +
> > +static void test_dgram_connect_server(const struct test_opts *opts)
> > +{
> > +	test_dgram_sendto_server(opts);
> > +}
> > +
> >  /* With the standard socket sizes, VMCI is able to support about 100
> >   * concurrent stream connections.
> >   */
> > @@ -255,6 +362,77 @@ static void test_stream_multiconn_server(const struct test_opts *opts)
> >  		close(fds[i]);
> >  }
> >  
> > +static void test_dgram_multiconn_client(const struct test_opts *opts)
> > +{
> > +	int fds[MULTICONN_NFDS];
> > +	int i;
> > +	union {
> > +		struct sockaddr sa;
> > +		struct sockaddr_vm svm;
> > +	} addr = {
> > +		.svm = {
> > +			.svm_family = AF_VSOCK,
> > +			.svm_port = 1234,
> > +			.svm_cid = opts->peer_cid,
> > +		},
> > +	};
> > +
> > +	/* Wait for the server to be ready */
> > +	control_expectln("BIND");
> > +
> > +	for (i = 0; i < MULTICONN_NFDS; i++) {
> > +		fds[i] = socket(AF_VSOCK, SOCK_DGRAM, 0);
> > +		if (fds[i] < 0) {
> > +			perror("socket");
> > +			exit(EXIT_FAILURE);
> > +		}
> > +	}
> > +
> > +	for (i = 0; i < MULTICONN_NFDS; i++)
> > +		sendto_byte(fds[i], &addr.sa, sizeof(addr.svm), 1, 0);
> > +
> > +	/* Notify the server that the client has finished */
> > +	control_writeln("DONE");
> > +
> > +	for (i = 0; i < MULTICONN_NFDS; i++)
> > +		close(fds[i]);
> > +}
> > +
> > +static void test_dgram_multiconn_server(const struct test_opts *opts)
> > +{
> > +	union {
> > +		struct sockaddr sa;
> > +		struct sockaddr_vm svm;
> > +	} addr = {
> > +		.svm = {
> > +			.svm_family = AF_VSOCK,
> > +			.svm_port = 1234,
> > +			.svm_cid = VMADDR_CID_ANY,
> > +		},
> > +	};
> > +	int fd;
> > +	int len = sizeof(addr.sa);
> > +	int i;
> > +
> > +	fd = socket(AF_VSOCK, SOCK_DGRAM, 0);
> > 
> > 
> > 
> > ^^^
> > I think we can check 'socket()' return value;
> > 
> > 
> > 
> > +
> > +	if (bind(fd, &addr.sa, sizeof(addr.svm)) < 0) {
> > +		perror("bind");
> > +		exit(EXIT_FAILURE);
> > +	}
> > +
> > +	/* Notify the client that the server is ready */
> > +	control_writeln("BIND");
> > +
> > +	for (i = 0; i < MULTICONN_NFDS; i++)
> > +		recvfrom_byte(fd, &addr.sa, &len, 1, 0);
> > +
> > +	/* Wait for the client to finish */
> > +	control_expectln("DONE");
> > +
> > +	close(fd);
> > +}
> > +
> >  static void test_stream_msg_peek_client(const struct test_opts *opts)
> >  {
> >  	int fd;
> > @@ -1128,6 +1306,21 @@ static struct test_case test_cases[] = {
> >  		.run_client = test_stream_virtio_skb_merge_client,
> >  		.run_server = test_stream_virtio_skb_merge_server,
> >  	},
> > +	{
> > +		.name = "SOCK_DGRAM client close",
> > +		.run_client = test_dgram_sendto_client,
> > +		.run_server = test_dgram_sendto_server,
> > +	},
> > +	{
> > +		.name = "SOCK_DGRAM client connect",
> > +		.run_client = test_dgram_connect_client,
> > +		.run_server = test_dgram_connect_server,
> > +	},
> > +	{
> > +		.name = "SOCK_DGRAM multiple connections",
> > +		.run_client = test_dgram_multiconn_client,
> > +		.run_server = test_dgram_multiconn_server,
> > +	},
> > 
> > 
> > 
> > 
> > SOCK_DGRAM guarantees message bounds, I think it will be good to add such test like in SOCK_SEQPACKET tests.

Agreed, I'll write one for the next rev.

> > 
> > Thanks, Arseniy

Thanks for the review!
> > 
> > 
> >  	{},
> >  };
> >  
> > 

^ permalink raw reply

* Re: [PATCH] x86/hyperv: Trace hv_set_register()
From: Dave Hansen @ 2023-06-07 11:27 UTC (permalink / raw)
  To: Nischala Yelchuri, linux-hyperv, linux-kernel
  Cc: Tyler Hicks, boqun.feng, K. Y. Srinivasan, Haiyang Zhang, Wei Liu,
	Dexuan Cui, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, H. Peter Anvin, Nischala Yelchuri
In-Reply-To: <1686101757-23985-1-git-send-email-niyelchu@linux.microsoft.com>

On 6/6/23 18:35, Nischala Yelchuri wrote:
>  void hv_set_register(unsigned int reg, u64 value)
>  {
> +	trace_hyperv_set_register(reg, value);
> +
>  	if (hv_nested)
>  		reg = hv_get_nested_reg(reg);

I can't help but wonder if this is just a patch for people that don't
know how to set kprobes.

^ permalink raw reply

* Re: [PATCH v2 04/13] arm64/arch_timer: Provide noinstr sched_clock_read() functions
From: Valentin Schneider @ 2023-06-07  8:58 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: bigeasy, mark.rutland, maz, catalin.marinas, will, chenhuacai,
	kernel, hca, gor, agordeev, borntraeger, svens, pbonzini,
	wanpengli, vkuznets, tglx, mingo, bp, dave.hansen, x86, hpa,
	jgross, boris.ostrovsky, daniel.lezcano, kys, haiyangz, wei.liu,
	decui, rafael, longman, boqun.feng, pmladek, senozhatsky, rostedt,
	john.ogness, juri.lelli, vincent.guittot, dietmar.eggemann,
	bsegall, mgorman, bristot, jstultz, sboyd, linux-kernel,
	loongarch, linux-s390, kvm, linux-hyperv, linux-pm
In-Reply-To: <20230602115451.GG620383@hirez.programming.kicks-ass.net>

On 02/06/23 13:54, Peter Zijlstra wrote:
> On Wed, May 24, 2023 at 05:40:47PM +0100, Valentin Schneider wrote:
>>
>> So this bit sent me on a little spelunking session :-)
>>
>> From a control flow perspective the initialization isn't required, but then
>> I looked into the comment and found it comes from the
>> arch_timer_read_counter() definition... Which itself doesn't get used by
>> sched_clock() until the sched_clock_register() below!
>>
>> So AFAICT that comment was true as of
>>
>>   220069945b29 ("clocksource: arch_timer: Add support for memory mapped timers")
>>
>> but not after a commit that came 2 months later:
>>
>>   65cd4f6c99c1 ("arch_timer: Move to generic sched_clock framework")
>>
>> which IIUC made arm/arm64 follow the default approach of using the
>> jiffy-based sched_clock() before probing DT/ACPI and registering a "proper"
>> sched_clock.
>>
>> All of that to say: the comment about arch_timer_read_counter() vs early
>> sched_clock() doesn't apply anymore, but I think we need to keep its
>> initalization around for stuff like get_cycles(). This initialization here
>> should be OK to put to the bin, though.
>
> Something like the below folded in then?
>

Much better, thank you!


^ permalink raw reply

* Re: [PATCH v6] hv_netvsc: Allocate rx indirection table size dynamically
From: patchwork-bot+netdevbpf @ 2023-06-07  8:50 UTC (permalink / raw)
  To: Shradha Gupta
  Cc: linux-kernel, linux-hyperv, netdev, edumazet, kuba, pabeni, kys,
	haiyangz, wei.liu, decui, longli, mikelley, davem, steen.hegelund,
	simon.horman
In-Reply-To: <1685964606-24690-1-git-send-email-shradhagupta@linux.microsoft.com>

Hello:

This patch was applied to netdev/net-next.git (main)
by David S. Miller <davem@davemloft.net>:

On Mon,  5 Jun 2023 04:30:06 -0700 you wrote:
> Allocate the size of rx indirection table dynamically in netvsc
> from the value of size provided by OID_GEN_RECEIVE_SCALE_CAPABILITIES
> query instead of using a constant value of ITAB_NUM.
> 
> Signed-off-by: Shradha Gupta <shradhagupta@linux.microsoft.com>
> Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
> Tested-on: Ubuntu22 (azure VM, SKU size: Standard_F72s_v2)
> Testcases:
> 1. ethtool -x eth0 output
> 2. LISA testcase:PERF-NETWORK-TCP-THROUGHPUT-MULTICONNECTION-NTTTCP-Synthetic
> 3. LISA testcase:PERF-NETWORK-TCP-THROUGHPUT-MULTICONNECTION-NTTTCP-SRIOV
> 
> [...]

Here is the summary with links:
  - [v6] hv_netvsc: Allocate rx indirection table size dynamically
    https://git.kernel.org/netdev/net-next/c/4cab498f33f7

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



^ permalink raw reply

* Negocjacje-wskaźnik skuteczności
From:  Tomasz Wiewiór  @ 2023-06-07  8:11 UTC (permalink / raw)
  To: linux-hyperv

Szanowni Państwo,

jako specjaliści w dziedzinie badania potrzeb i projektowania programów szkoleniowych, wiemy, jak ważne jest zapewnienie odpowiednich narzędzi i wiedzy, aby skutecznie zwiększyć efektywność zespołu, szczególnie w dużych organizacjach.

Zamknięte szkolenia są najlepszym rozwiązaniem dla tych, którzy chcą skupić się na konkretnych potrzebach i zagadnieniach. 

Nasza metodyka opiera się na identyfikacji obszarów, które wymagają poprawy, a następnie dostarczaniu spersonalizowanych rozwiązań, które pomagają osiągnąć cele biznesowe.

Z powodzeniem prowadziliśmy projekty szkoleniowe dla renomowanych marek, takich jak: PZU, Bank Pekao S.A., PWC, Ronson Development, Gedeon Richter i wielu innych.

Możemy niezobowiązująco porozmawiać o możliwościach zorganizowania szkolenia dla Państwa firmy?

Pozdrawiam
Tomasz Wiewiór

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox