Linux-HyperV List

Linux-HyperV List
 help / color / mirror / Atom feed

* Re: [PATCH 5/9] x86/hyperv: Use vmmcall to implement Hyper-V hypercall in sev-snp enlightened guest
From: Vitaly Kuznetsov @ 2023-06-05 13:00 UTC (permalink / raw)
  To: Tianyu Lan, kys, haiyangz, wei.liu, decui, tglx, mingo, bp,
	dave.hansen, x86, hpa, daniel.lezcano, arnd, michael.h.kelley
  Cc: Tianyu Lan, linux-arch, linux-hyperv, linux-kernel
In-Reply-To: <20230601151624.1757616-6-ltykernel@gmail.com>

Tianyu Lan <ltykernel@gmail.com> writes:

> From: Tianyu Lan <tiala@microsoft.com>
>
> In sev-snp enlightened guest, Hyper-V hypercall needs
> to use vmmcall to trigger vmexit and notify hypervisor
> to handle hypercall request.
>
> There is no x86 SEV SNP feature flag support so far and
> hardware provides MSR_AMD64_SEV register to check SEV-SNP
> capability with MSR_AMD64_SEV_ENABLED bit. ALTERNATIVE can't
> work without SEV-SNP x86 feature flag. May add later when
> the associated flag is introduced. 
>
> Signed-off-by: Tianyu Lan <tiala@microsoft.com>
> ---
>  arch/x86/include/asm/mshyperv.h | 44 ++++++++++++++++++++++++---------
>  1 file changed, 33 insertions(+), 11 deletions(-)
>
> diff --git a/arch/x86/include/asm/mshyperv.h b/arch/x86/include/asm/mshyperv.h
> index 31c476f4e656..d859d7c5f5e8 100644
> --- a/arch/x86/include/asm/mshyperv.h
> +++ b/arch/x86/include/asm/mshyperv.h
> @@ -61,16 +61,25 @@ static inline u64 hv_do_hypercall(u64 control, void *input, void *output)
>  	u64 hv_status;
>  
>  #ifdef CONFIG_X86_64
> -	if (!hv_hypercall_pg)
> -		return U64_MAX;
> +	if (hv_isolation_type_en_snp()) {

Would it be possible to redo 'hv_isolation_type_en_snp()' into a static
inline doing static_branch_unlikely() so we avoid function call penalty
here?

> +		__asm__ __volatile__("mov %4, %%r8\n"
> +				     "vmmcall"
> +				     : "=a" (hv_status), ASM_CALL_CONSTRAINT,
> +				       "+c" (control), "+d" (input_address)
> +				     :  "r" (output_address)
> +				     : "cc", "memory", "r8", "r9", "r10", "r11");
> +	} else {
> +		if (!hv_hypercall_pg)
> +			return U64_MAX;
>  
> -	__asm__ __volatile__("mov %4, %%r8\n"
> -			     CALL_NOSPEC
> -			     : "=a" (hv_status), ASM_CALL_CONSTRAINT,
> -			       "+c" (control), "+d" (input_address)
> -			     :  "r" (output_address),
> -				THUNK_TARGET(hv_hypercall_pg)
> -			     : "cc", "memory", "r8", "r9", "r10", "r11");
> +		__asm__ __volatile__("mov %4, %%r8\n"
> +				     CALL_NOSPEC
> +				     : "=a" (hv_status), ASM_CALL_CONSTRAINT,
> +				       "+c" (control), "+d" (input_address)
> +				     :  "r" (output_address),
> +					THUNK_TARGET(hv_hypercall_pg)
> +				     : "cc", "memory", "r8", "r9", "r10", "r11");
> +	}
>  #else
>  	u32 input_address_hi = upper_32_bits(input_address);
>  	u32 input_address_lo = lower_32_bits(input_address);
> @@ -104,7 +113,13 @@ static inline u64 _hv_do_fast_hypercall8(u64 control, u64 input1)
>  	u64 hv_status;
>  
>  #ifdef CONFIG_X86_64
> -	{
> +	if (hv_isolation_type_en_snp()) {
> +		__asm__ __volatile__(
> +				"vmmcall"
> +				: "=a" (hv_status), ASM_CALL_CONSTRAINT,
> +				"+c" (control), "+d" (input1)
> +				:: "cc", "r8", "r9", "r10", "r11");
> +	} else {
>  		__asm__ __volatile__(CALL_NOSPEC
>  				     : "=a" (hv_status), ASM_CALL_CONSTRAINT,
>  				       "+c" (control), "+d" (input1)
> @@ -149,7 +164,14 @@ static inline u64 _hv_do_fast_hypercall16(u64 control, u64 input1, u64 input2)
>  	u64 hv_status;
>  
>  #ifdef CONFIG_X86_64
> -	{
> +	if (hv_isolation_type_en_snp()) {
> +		__asm__ __volatile__("mov %4, %%r8\n"
> +				     "vmmcall"
> +				     : "=a" (hv_status), ASM_CALL_CONSTRAINT,
> +				       "+c" (control), "+d" (input1)
> +				     : "r" (input2)
> +				     : "cc", "r8", "r9", "r10", "r11");
> +	} else {
>  		__asm__ __volatile__("mov %4, %%r8\n"
>  				     CALL_NOSPEC
>  				     : "=a" (hv_status), ASM_CALL_CONSTRAINT,

-- 
Vitaly


^ permalink raw reply

* Re: [PATCH 4/9] drivers: hv: Mark shared pages unencrypted in SEV-SNP enlightened guest
From: Vitaly Kuznetsov @ 2023-06-05 12:54 UTC (permalink / raw)
  To: Tianyu Lan, kys, haiyangz, wei.liu, decui, tglx, mingo, bp,
	dave.hansen, x86, hpa, daniel.lezcano, arnd, michael.h.kelley
  Cc: Tianyu Lan, linux-arch, linux-hyperv, linux-kernel
In-Reply-To: <20230601151624.1757616-5-ltykernel@gmail.com>

Tianyu Lan <ltykernel@gmail.com> writes:

> From: Tianyu Lan <tiala@microsoft.com>
>
> Hypervisor needs to access iput arg, VMBus synic event and
> message pages. Mask these pages unencrypted in the sev-snp
> guest and free them only if they have been marked encrypted
> successfully.
>
> Signed-off-by: Tianyu Lan <tiala@microsoft.com>
> ---
>  drivers/hv/hv.c        | 57 +++++++++++++++++++++++++++++++++++++++---
>  drivers/hv/hv_common.c | 24 +++++++++++++++++-
>  2 files changed, 77 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/hv/hv.c b/drivers/hv/hv.c
> index de6708dbe0df..94406dbe0df0 100644
> --- a/drivers/hv/hv.c
> +++ b/drivers/hv/hv.c
> @@ -20,6 +20,7 @@
>  #include <linux/interrupt.h>
>  #include <clocksource/hyperv_timer.h>
>  #include <asm/mshyperv.h>
> +#include <linux/set_memory.h>
>  #include "hyperv_vmbus.h"
>  
>  /* The one and only */
> @@ -78,7 +79,7 @@ int hv_post_message(union hv_connection_id connection_id,
>  
>  int hv_synic_alloc(void)
>  {
> -	int cpu;
> +	int cpu, ret = -ENOMEM;
>  	struct hv_per_cpu_context *hv_cpu;
>  
>  	/*
> @@ -123,26 +124,76 @@ int hv_synic_alloc(void)
>  				goto err;
>  			}
>  		}
> +
> +		if (hv_isolation_type_en_snp()) {
> +			ret = set_memory_decrypted((unsigned long)
> +				hv_cpu->synic_message_page, 1);
> +			if (ret) {
> +				pr_err("Failed to decrypt SYNIC msg page: %d\n", ret);
> +				hv_cpu->synic_message_page = NULL;
> +
> +				/*
> +				 * Free the event page here and not encrypt
> +				 * the page in hv_synic_free().
> +				 */
> +				free_page((unsigned long)hv_cpu->synic_event_page);
> +				hv_cpu->synic_event_page = NULL;
> +				goto err;
> +			}
> +
> +			ret = set_memory_decrypted((unsigned long)
> +				hv_cpu->synic_event_page, 1);
> +			if (ret) {
> +				pr_err("Failed to decrypt SYNIC event page: %d\n", ret);
> +				hv_cpu->synic_event_page = NULL;
> +				goto err;
> +			}
> +
> +			memset(hv_cpu->synic_message_page, 0, PAGE_SIZE);
> +			memset(hv_cpu->synic_event_page, 0, PAGE_SIZE);
> +		}
>  	}
>  
>  	return 0;
> +
>  err:
>  	/*
>  	 * Any memory allocations that succeeded will be freed when
>  	 * the caller cleans up by calling hv_synic_free()
>  	 */
> -	return -ENOMEM;
> +	return ret;
>  }
>  
>  
>  void hv_synic_free(void)
>  {
> -	int cpu;
> +	int cpu, ret;
>  
>  	for_each_present_cpu(cpu) {
>  		struct hv_per_cpu_context *hv_cpu
>  			= per_cpu_ptr(hv_context.cpu_context, cpu);
>  
> +		/* It's better to leak the page if the encryption fails. */
> +		if (hv_isolation_type_en_snp()) {
> +			if (hv_cpu->synic_message_page) {
> +				ret = set_memory_encrypted((unsigned long)
> +					hv_cpu->synic_message_page, 1);
> +				if (ret) {
> +					pr_err("Failed to encrypt SYNIC msg page: %d\n", ret);
> +					hv_cpu->synic_message_page = NULL;
> +				}
> +			}
> +
> +			if (hv_cpu->synic_event_page) {
> +				ret = set_memory_encrypted((unsigned long)
> +					hv_cpu->synic_event_page, 1);
> +				if (ret) {
> +					pr_err("Failed to encrypt SYNIC event page: %d\n", ret);
> +					hv_cpu->synic_event_page = NULL;
> +				}
> +			}
> +		}
> +
>  		free_page((unsigned long)hv_cpu->synic_event_page);
>  		free_page((unsigned long)hv_cpu->synic_message_page);
>  	}
> diff --git a/drivers/hv/hv_common.c b/drivers/hv/hv_common.c
> index 179bc5f5bf52..bed9aa6ac19a 100644
> --- a/drivers/hv/hv_common.c
> +++ b/drivers/hv/hv_common.c
> @@ -24,6 +24,7 @@
>  #include <linux/kmsg_dump.h>
>  #include <linux/slab.h>
>  #include <linux/dma-map-ops.h>
> +#include <linux/set_memory.h>
>  #include <asm/hyperv-tlfs.h>
>  #include <asm/mshyperv.h>
>  
> @@ -359,6 +360,7 @@ int hv_common_cpu_init(unsigned int cpu)
>  	u64 msr_vp_index;
>  	gfp_t flags;
>  	int pgcount = hv_root_partition ? 2 : 1;
> +	int ret;
>  
>  	/* hv_cpu_init() can be called with IRQs disabled from hv_resume() */
>  	flags = irqs_disabled() ? GFP_ATOMIC : GFP_KERNEL;
> @@ -368,6 +370,17 @@ int hv_common_cpu_init(unsigned int cpu)
>  	if (!(*inputarg))
>  		return -ENOMEM;
>  
> +	if (hv_isolation_type_en_snp()) {
> +		ret = set_memory_decrypted((unsigned long)*inputarg, pgcount);
> +		if (ret) {
> +			kfree(*inputarg);
> +			*inputarg = NULL;
> +			return ret;
> +		}
> +
> +		memset(*inputarg, 0x00, pgcount * PAGE_SIZE);
> +	}
> +
>  	if (hv_root_partition) {
>  		outputarg = (void **)this_cpu_ptr(hyperv_pcpu_output_arg);
>  		*outputarg = (char *)(*inputarg) + HV_HYP_PAGE_SIZE;
> @@ -387,7 +400,9 @@ int hv_common_cpu_die(unsigned int cpu)
>  {
>  	unsigned long flags;
>  	void **inputarg, **outputarg;
> +	int pgcount = hv_root_partition ? 2 : 1;
>  	void *mem;
> +	int ret;
>  
>  	local_irq_save(flags);
>  
> @@ -402,7 +417,14 @@ int hv_common_cpu_die(unsigned int cpu)
>  
>  	local_irq_restore(flags);
>  
> -	kfree(mem);
> +	if (hv_isolation_type_en_snp()) {
> +		ret = set_memory_encrypted((unsigned long)mem, pgcount);
> +		if (ret)
> +			pr_warn("Hyper-V: Failed to encrypt input arg on cpu%d: %d\n",
> +				cpu, ret);
> +		/* It's unsafe to free 'mem'. */
> +		return 0;

Why is it unsafe to free 'mem' if ret == 0? Also, why don't we want to
proparate non-zero 'ret' from here to fail CPU offlining?


> +	}
>  
>  	return 0;
>  }

-- 
Vitaly


^ permalink raw reply

* Re: [PATCH 3/9] x86/hyperv: Mark Hyper-V vp assist page unencrypted in SEV-SNP enlightened guest
From: Vitaly Kuznetsov @ 2023-06-05 12:13 UTC (permalink / raw)
  To: Tianyu Lan, kys, haiyangz, wei.liu, decui, tglx, mingo, bp,
	dave.hansen, x86, hpa, daniel.lezcano, arnd, michael.h.kelley
  Cc: Tianyu Lan, linux-arch, linux-hyperv, linux-kernel
In-Reply-To: <20230601151624.1757616-4-ltykernel@gmail.com>

Tianyu Lan <ltykernel@gmail.com> writes:

> From: Tianyu Lan <tiala@microsoft.com>
>
> hv vp assist page needs to be shared between SEV-SNP guest and Hyper-V.
> So mark the page unencrypted in the SEV-SNP guest.
>
> Signed-off-by: Tianyu Lan <tiala@microsoft.com>
> ---
>  arch/x86/hyperv/hv_init.c | 6 ++++++
>  1 file changed, 6 insertions(+)
>
> diff --git a/arch/x86/hyperv/hv_init.c b/arch/x86/hyperv/hv_init.c
> index b4a2327c823b..331b855314b7 100644
> --- a/arch/x86/hyperv/hv_init.c
> +++ b/arch/x86/hyperv/hv_init.c
> @@ -18,6 +18,7 @@
>  #include <asm/hyperv-tlfs.h>
>  #include <asm/mshyperv.h>
>  #include <asm/idtentry.h>
> +#include <asm/set_memory.h>
>  #include <linux/kexec.h>
>  #include <linux/version.h>
>  #include <linux/vmalloc.h>
> @@ -113,6 +114,11 @@ static int hv_cpu_init(unsigned int cpu)
>  
>  	}
>  	if (!WARN_ON(!(*hvp))) {
> +		if (hv_isolation_type_en_snp()) {
> +			WARN_ON_ONCE(set_memory_decrypted((unsigned long)(*hvp), 1));
> +			memset(*hvp, 0, PAGE_SIZE);
> +		}

Why do we need to set the page as decrypted here and not when we
allocate the page (a few lines above)? And why do we need to clear it
_after_ we made it decrypted? In case we care about not leaking the
stale content to the hypervisor, we should've cleared it _before_, but
the bigger problem I see is that memset() is problemmatic e.g. for KVM
which uses enlightened VMCS. You put a CPU offline and then back online
and this path will be taken. Clearing VP assist page will likely brake
things. (AFAIU SEV-SNP Hyper-V guests don't expose SVM yet so the
problem is likely theoretical only, but still).

> +
>  		msr.enable = 1;
>  		wrmsrl(HV_X64_MSR_VP_ASSIST_PAGE, msr.as_uint64);
>  	}

-- 
Vitaly


^ permalink raw reply

* Re: [PATCH 1/9] x86/hyperv: Add sev-snp enlightened guest static key
From: Vitaly Kuznetsov @ 2023-06-05 12:09 UTC (permalink / raw)
  To: Tianyu Lan
  Cc: Tianyu Lan, linux-arch, linux-hyperv, linux-kernel, kys, haiyangz,
	wei.liu, decui, tglx, mingo, bp, dave.hansen, x86, hpa,
	daniel.lezcano, arnd, michael.h.kelley
In-Reply-To: <20230601151624.1757616-2-ltykernel@gmail.com>

Tianyu Lan <ltykernel@gmail.com> writes:

> From: Tianyu Lan <tiala@microsoft.com>
>
> Introduce static key isolation_type_en_snp for enlightened
> sev-snp guest check.
>
> Signed-off-by: Tianyu Lan <tiala@microsoft.com>
> ---
>  arch/x86/hyperv/ivm.c           | 11 +++++++++++
>  arch/x86/include/asm/mshyperv.h |  3 +++
>  arch/x86/kernel/cpu/mshyperv.c  |  8 ++++++--
>  drivers/hv/hv_common.c          |  6 ++++++
>  include/asm-generic/mshyperv.h  | 12 +++++++++---
>  5 files changed, 35 insertions(+), 5 deletions(-)
>
> diff --git a/arch/x86/hyperv/ivm.c b/arch/x86/hyperv/ivm.c
> index cc92388b7a99..5d3ee3124e00 100644
> --- a/arch/x86/hyperv/ivm.c
> +++ b/arch/x86/hyperv/ivm.c
> @@ -409,3 +409,14 @@ bool hv_isolation_type_snp(void)
>  {
>  	return static_branch_unlikely(&isolation_type_snp);
>  }
> +
> +DEFINE_STATIC_KEY_FALSE(isolation_type_en_snp);
> +/*
> + * hv_isolation_type_en_snp - Check system runs in the AMD SEV-SNP based
> + * isolation enlightened VM.
> + */
> +bool hv_isolation_type_en_snp(void)
> +{
> +	return static_branch_unlikely(&isolation_type_en_snp);
> +}
> +
> diff --git a/arch/x86/include/asm/mshyperv.h b/arch/x86/include/asm/mshyperv.h
> index 49bb4f2bd300..31c476f4e656 100644
> --- a/arch/x86/include/asm/mshyperv.h
> +++ b/arch/x86/include/asm/mshyperv.h
> @@ -26,6 +26,7 @@
>  union hv_ghcb;
>  
>  DECLARE_STATIC_KEY_FALSE(isolation_type_snp);
> +DECLARE_STATIC_KEY_FALSE(isolation_type_en_snp);
>  
>  typedef int (*hyperv_fill_flush_list_func)(
>  		struct hv_guest_mapping_flush_list *flush,
> @@ -45,6 +46,8 @@ extern void *hv_hypercall_pg;
>  
>  extern u64 hv_current_partition_id;
>  
> +extern bool hv_isolation_type_en_snp(void);
> +
>  extern union hv_ghcb * __percpu *hv_ghcb_pg;
>  
>  int hv_call_deposit_pages(int node, u64 partition_id, u32 num_pages);
> diff --git a/arch/x86/kernel/cpu/mshyperv.c b/arch/x86/kernel/cpu/mshyperv.c
> index c7969e806c64..9186453251f7 100644
> --- a/arch/x86/kernel/cpu/mshyperv.c
> +++ b/arch/x86/kernel/cpu/mshyperv.c
> @@ -402,8 +402,12 @@ static void __init ms_hyperv_init_platform(void)
>  		pr_info("Hyper-V: Isolation Config: Group A 0x%x, Group B 0x%x\n",
>  			ms_hyperv.isolation_config_a, ms_hyperv.isolation_config_b);
>  
> -		if (hv_get_isolation_type() == HV_ISOLATION_TYPE_SNP)
> +
> +		if (cc_platform_has(CC_ATTR_GUEST_SEV_SNP)) {
> +			static_branch_enable(&isolation_type_en_snp);
> +		} else if (hv_get_isolation_type() == HV_ISOLATION_TYPE_SNP) {
>  			static_branch_enable(&isolation_type_snp);

Nitpick: In case 'isolation_type_snp' and 'isolation_type_en_snp' are
mutually exclusive, I'd suggest we rename the former: it is quite
un-intuitive that for an enlightened SNP guest '&isolation_type_snp' is
NOT enabled. E.g. we can use

'isol_type_snp_paravisor'
and
'isol_type_snp_enlightened'

(I also don't like 'isolation_type_en_snp' name as 'en' normally stands
for 'enabled')

> +		}
>  	}
>  
>  	if (hv_max_functions_eax >= HYPERV_CPUID_NESTED_FEATURES) {
> @@ -473,7 +477,7 @@ static void __init ms_hyperv_init_platform(void)
>  
>  #if IS_ENABLED(CONFIG_HYPERV)
>  	if ((hv_get_isolation_type() == HV_ISOLATION_TYPE_VBS) ||
> -	    (hv_get_isolation_type() == HV_ISOLATION_TYPE_SNP))
> +	    ms_hyperv.paravisor_present)
>  		hv_vtom_init();
>  	/*
>  	 * Setup the hook to get control post apic initialization.
> diff --git a/drivers/hv/hv_common.c b/drivers/hv/hv_common.c
> index 64f9ceca887b..179bc5f5bf52 100644
> --- a/drivers/hv/hv_common.c
> +++ b/drivers/hv/hv_common.c
> @@ -502,6 +502,12 @@ bool __weak hv_isolation_type_snp(void)
>  }
>  EXPORT_SYMBOL_GPL(hv_isolation_type_snp);
>  
> +bool __weak hv_isolation_type_en_snp(void)
> +{
> +	return false;
> +}
> +EXPORT_SYMBOL_GPL(hv_isolation_type_en_snp);
> +
>  void __weak hv_setup_vmbus_handler(void (*handler)(void))
>  {
>  }
> diff --git a/include/asm-generic/mshyperv.h b/include/asm-generic/mshyperv.h
> index 402a8c1c202d..d444f831d633 100644
> --- a/include/asm-generic/mshyperv.h
> +++ b/include/asm-generic/mshyperv.h
> @@ -36,15 +36,21 @@ struct ms_hyperv_info {
>  	u32 nested_features;
>  	u32 max_vp_index;
>  	u32 max_lp_index;
> -	u32 isolation_config_a;
> +	union {
> +		u32 isolation_config_a;
> +		struct {
> +			u32 paravisor_present : 1;
> +			u32 reserved1 : 31;
> +		};
> +	};
>  	union {
>  		u32 isolation_config_b;
>  		struct {
>  			u32 cvm_type : 4;
> -			u32 reserved1 : 1;
> +			u32 reserved2 : 1;
>  			u32 shared_gpa_boundary_active : 1;
>  			u32 shared_gpa_boundary_bits : 6;
> -			u32 reserved2 : 20;
> +			u32 reserved3 : 20;

Maybe use 'reserved_a1', 'reserved_b1', 'reserved_b2',... to avoid the
need to rename in the future when more bits from isolation_config_a get
used?

>  		};
>  	};
>  	u64 shared_gpa_boundary;

-- 
Vitaly


^ permalink raw reply

* [PATCH 1/1] RDMA/mana_ib: Add EQ interrupt support to mana ib driver.
From: Wei Hu @ 2023-06-05 11:43 UTC (permalink / raw)
  To: netdev, linux-hyperv, linux-rdma, longli, sharmaajay, jgg, leon,
	kys, haiyangz, wei.liu, decui, davem, edumazet, kuba, pabeni,
	vkuznets, ssengar, shradhagupta, weh

Add EQ interrupt support for mana ib driver. Allocate EQs per ucontext
to receive interrupt. Attach EQ when CQ is created. Call CQ interrupt
handler when completion interrupt happens. EQs are destroyed when
ucontext is deallocated.

The change calls some public APIs in mana ethernet driver to
allocate EQs and other resources. Ehe EQ process routine is also shared
by mana ethernet and mana ib drivers.

Co-developed-by: Ajay Sharma <sharmaajay@microsoft.com>
Signed-off-by: Ajay Sharma <sharmaajay@microsoft.com>
Signed-off-by: Wei Hu <weh@microsoft.com>
---
 drivers/infiniband/hw/mana/cq.c               |  32 ++++-
 drivers/infiniband/hw/mana/main.c             |  87 ++++++++++++
 drivers/infiniband/hw/mana/mana_ib.h          |   4 +
 drivers/infiniband/hw/mana/qp.c               |  87 +++++++++++-
 .../net/ethernet/microsoft/mana/gdma_main.c   | 131 ++++++++++--------
 drivers/net/ethernet/microsoft/mana/mana_en.c |   1 +
 include/net/mana/gdma.h                       |   9 +-
 7 files changed, 287 insertions(+), 64 deletions(-)

diff --git a/drivers/infiniband/hw/mana/cq.c b/drivers/infiniband/hw/mana/cq.c
index d141cab8a1e6..3cd680e0e753 100644
--- a/drivers/infiniband/hw/mana/cq.c
+++ b/drivers/infiniband/hw/mana/cq.c
@@ -12,13 +12,20 @@ int mana_ib_create_cq(struct ib_cq *ibcq, const struct ib_cq_init_attr *attr,
 	struct ib_device *ibdev = ibcq->device;
 	struct mana_ib_create_cq ucmd = {};
 	struct mana_ib_dev *mdev;
+	struct gdma_context *gc;
+	struct gdma_dev *gd;
 	int err;
 
 	mdev = container_of(ibdev, struct mana_ib_dev, ib_dev);
+	gd = mdev->gdma_dev;
+	gc = gd->gdma_context;
 
 	if (udata->inlen < sizeof(ucmd))
 		return -EINVAL;
 
+	cq->comp_vector = attr->comp_vector > gc->max_num_queues ?
+				0 : attr->comp_vector;
+
 	err = ib_copy_from_udata(&ucmd, udata, min(sizeof(ucmd), udata->inlen));
 	if (err) {
 		ibdev_dbg(ibdev,
@@ -69,11 +76,32 @@ int mana_ib_destroy_cq(struct ib_cq *ibcq, struct ib_udata *udata)
 	struct mana_ib_cq *cq = container_of(ibcq, struct mana_ib_cq, ibcq);
 	struct ib_device *ibdev = ibcq->device;
 	struct mana_ib_dev *mdev;
+	struct gdma_context *gc;
+	struct gdma_dev *gd;
+
 
 	mdev = container_of(ibdev, struct mana_ib_dev, ib_dev);
+	gd = mdev->gdma_dev;
+	gc = gd->gdma_context;
 
-	mana_ib_gd_destroy_dma_region(mdev, cq->gdma_region);
-	ib_umem_release(cq->umem);
+
+
+	if (atomic_read(&ibcq->usecnt) == 0) {
+		mana_ib_gd_destroy_dma_region(mdev, cq->gdma_region);
+		ibdev_dbg(ibdev, "freeing gdma cq %p\n", gc->cq_table[cq->id]);
+		kfree(gc->cq_table[cq->id]);
+		gc->cq_table[cq->id] = NULL;
+		ib_umem_release(cq->umem);
+	}
 
 	return 0;
 }
+
+void mana_ib_cq_handler(void *ctx, struct gdma_queue *gdma_cq)
+{
+	struct mana_ib_cq *cq = ctx;
+	struct ib_device *ibdev = cq->ibcq.device;
+
+	ibdev_dbg(ibdev, "Enter %s %d\n", __func__, __LINE__);
+	cq->ibcq.comp_handler(&cq->ibcq, cq->ibcq.cq_context);
+}
diff --git a/drivers/infiniband/hw/mana/main.c b/drivers/infiniband/hw/mana/main.c
index 7be4c3adb4e2..e4efbcaed10e 100644
--- a/drivers/infiniband/hw/mana/main.c
+++ b/drivers/infiniband/hw/mana/main.c
@@ -143,6 +143,81 @@ int mana_ib_dealloc_pd(struct ib_pd *ibpd, struct ib_udata *udata)
 	return err;
 }
 
+static void mana_ib_destroy_eq(struct mana_ib_ucontext *ucontext,
+			       struct mana_ib_dev *mdev)
+{
+	struct gdma_context *gc = mdev->gdma_dev->gdma_context;
+	struct ib_device *ibdev = ucontext->ibucontext.device;
+	struct gdma_queue *eq;
+	int i;
+
+	if (!ucontext->eqs)
+		return;
+
+	for (i = 0; i < gc->max_num_queues; i++) {
+		eq = ucontext->eqs[i].eq;
+		if (!eq)
+			continue;
+
+		mana_gd_destroy_queue(gc, eq);
+	}
+
+	kfree(ucontext->eqs);
+	ucontext->eqs = NULL;
+
+	ibdev_dbg(ibdev, "destroyed eq's count %d\n", gc->max_num_queues);
+}
+
+static int mana_ib_create_eq(struct mana_ib_ucontext *ucontext,
+			     struct mana_ib_dev *mdev)
+{
+	struct gdma_queue_spec spec = {};
+	struct gdma_queue *queue;
+	struct gdma_context *gc;
+	struct ib_device *ibdev;
+	struct gdma_dev *gd;
+	int err;
+	int i;
+
+	if (!ucontext || !mdev)
+		return -EINVAL;
+
+	ibdev = ucontext->ibucontext.device;
+	gd = mdev->gdma_dev;
+
+	gc = gd->gdma_context;
+
+	ucontext->eqs = kcalloc(gc->max_num_queues, sizeof(struct mana_eq),
+				GFP_KERNEL);
+	if (!ucontext->eqs)
+		return -ENOMEM;
+
+	spec.type = GDMA_EQ;
+	spec.monitor_avl_buf = false;
+	spec.queue_size = EQ_SIZE;
+	spec.eq.callback = NULL;
+	spec.eq.context = ucontext->eqs;
+	spec.eq.log2_throttle_limit = LOG2_EQ_THROTTLE;
+	spec.eq.msix_allocated = true;
+
+	for (i = 0; i < gc->max_num_queues; i++) {
+		spec.eq.msix_index = i;
+		err = mana_gd_create_mana_eq(gd, &spec, &queue);
+		if (err)
+			goto out;
+
+		queue->eq.disable_needed = true;
+		ucontext->eqs[i].eq = queue;
+	}
+
+	return 0;
+
+out:
+	ibdev_dbg(ibdev, "Failed to allocated eq err %d\n", err);
+	mana_ib_destroy_eq(ucontext, mdev);
+	return err;
+}
+
 static int mana_gd_destroy_doorbell_page(struct gdma_context *gc,
 					 int doorbell_page)
 {
@@ -225,7 +300,17 @@ int mana_ib_alloc_ucontext(struct ib_ucontext *ibcontext,
 
 	ucontext->doorbell = doorbell_page;
 
+	ret = mana_ib_create_eq(ucontext, mdev);
+	if (ret) {
+		ibdev_dbg(ibdev, "Failed to create eq's , ret %d\n", ret);
+		goto err;
+	}
+
 	return 0;
+
+err:
+	mana_gd_destroy_doorbell_page(gc, doorbell_page);
+	return ret;
 }
 
 void mana_ib_dealloc_ucontext(struct ib_ucontext *ibcontext)
@@ -240,6 +325,8 @@ void mana_ib_dealloc_ucontext(struct ib_ucontext *ibcontext)
 	mdev = container_of(ibdev, struct mana_ib_dev, ib_dev);
 	gc = mdev->gdma_dev->gdma_context;
 
+	mana_ib_destroy_eq(mana_ucontext, mdev);
+
 	ret = mana_gd_destroy_doorbell_page(gc, mana_ucontext->doorbell);
 	if (ret)
 		ibdev_dbg(ibdev, "Failed to destroy doorbell page %d\n", ret);
diff --git a/drivers/infiniband/hw/mana/mana_ib.h b/drivers/infiniband/hw/mana/mana_ib.h
index 502cc8672eef..9672fa1670a5 100644
--- a/drivers/infiniband/hw/mana/mana_ib.h
+++ b/drivers/infiniband/hw/mana/mana_ib.h
@@ -67,6 +67,7 @@ struct mana_ib_cq {
 	int cqe;
 	u64 gdma_region;
 	u64 id;
+	u32 comp_vector;
 };
 
 struct mana_ib_qp {
@@ -86,6 +87,7 @@ struct mana_ib_qp {
 struct mana_ib_ucontext {
 	struct ib_ucontext ibucontext;
 	u32 doorbell;
+	struct mana_eq *eqs;
 };
 
 struct mana_ib_rwq_ind_table {
@@ -159,4 +161,6 @@ int mana_ib_query_gid(struct ib_device *ibdev, u32 port, int index,
 
 void mana_ib_disassociate_ucontext(struct ib_ucontext *ibcontext);
 
+void mana_ib_cq_handler(void *ctx, struct gdma_queue *gdma_cq);
+
 #endif
diff --git a/drivers/infiniband/hw/mana/qp.c b/drivers/infiniband/hw/mana/qp.c
index 54b61930a7fd..0938085bded3 100644
--- a/drivers/infiniband/hw/mana/qp.c
+++ b/drivers/infiniband/hw/mana/qp.c
@@ -96,16 +96,20 @@ static int mana_ib_create_qp_rss(struct ib_qp *ibqp, struct ib_pd *pd,
 	struct mana_ib_qp *qp = container_of(ibqp, struct mana_ib_qp, ibqp);
 	struct mana_ib_dev *mdev =
 		container_of(pd->device, struct mana_ib_dev, ib_dev);
+	struct ib_ucontext *ib_ucontext = pd->uobject->context;
 	struct ib_rwq_ind_table *ind_tbl = attr->rwq_ind_tbl;
 	struct mana_ib_create_qp_rss_resp resp = {};
 	struct mana_ib_create_qp_rss ucmd = {};
+	struct mana_ib_ucontext *mana_ucontext;
 	struct gdma_dev *gd = mdev->gdma_dev;
 	mana_handle_t *mana_ind_table;
 	struct mana_port_context *mpc;
+	struct gdma_queue *gdma_cq;
 	struct mana_context *mc;
 	struct net_device *ndev;
 	struct mana_ib_cq *cq;
 	struct mana_ib_wq *wq;
+	struct mana_eq *eq;
 	unsigned int ind_tbl_size;
 	struct ib_cq *ibcq;
 	struct ib_wq *ibwq;
@@ -114,6 +118,8 @@ static int mana_ib_create_qp_rss(struct ib_qp *ibqp, struct ib_pd *pd,
 	int ret;
 
 	mc = gd->driver_data;
+	mana_ucontext =
+		container_of(ib_ucontext, struct mana_ib_ucontext, ibucontext);
 
 	if (!udata || udata->inlen < sizeof(ucmd))
 		return -EINVAL;
@@ -180,6 +186,7 @@ static int mana_ib_create_qp_rss(struct ib_qp *ibqp, struct ib_pd *pd,
 	for (i = 0; i < ind_tbl_size; i++) {
 		struct mana_obj_spec wq_spec = {};
 		struct mana_obj_spec cq_spec = {};
+		unsigned int max_num_queues = gd->gdma_context->max_num_queues;
 
 		ibwq = ind_tbl->ind_tbl[i];
 		wq = container_of(ibwq, struct mana_ib_wq, ibwq);
@@ -193,7 +200,8 @@ static int mana_ib_create_qp_rss(struct ib_qp *ibqp, struct ib_pd *pd,
 		cq_spec.gdma_region = cq->gdma_region;
 		cq_spec.queue_size = cq->cqe * COMP_ENTRY_SIZE;
 		cq_spec.modr_ctx_id = 0;
-		cq_spec.attached_eq = GDMA_CQ_NO_EQ;
+		eq = &mana_ucontext->eqs[cq->comp_vector % max_num_queues];
+		cq_spec.attached_eq = eq->eq->id;
 
 		ret = mana_create_wq_obj(mpc, mpc->port_handle, GDMA_RQ,
 					 &wq_spec, &cq_spec, &wq->rx_object);
@@ -207,6 +215,9 @@ static int mana_ib_create_qp_rss(struct ib_qp *ibqp, struct ib_pd *pd,
 		wq->id = wq_spec.queue_index;
 		cq->id = cq_spec.queue_index;
 
+		ibdev_dbg(&mdev->ib_dev, "attached eq id %u  cq with id %llu\n",
+			eq->eq->id, cq->id);
+
 		ibdev_dbg(&mdev->ib_dev,
 			  "ret %d rx_object 0x%llx wq id %llu cq id %llu\n",
 			  ret, wq->rx_object, wq->id, cq->id);
@@ -215,6 +226,26 @@ static int mana_ib_create_qp_rss(struct ib_qp *ibqp, struct ib_pd *pd,
 		resp.entries[i].wqid = wq->id;
 
 		mana_ind_table[i] = wq->rx_object;
+
+		if (gd->gdma_context->cq_table[cq->id] == NULL) {
+
+			gdma_cq = kzalloc(sizeof(*gdma_cq), GFP_KERNEL);
+			if (!gdma_cq) {
+				ibdev_dbg(&mdev->ib_dev,
+					 "failed to allocate gdma_cq\n");
+				goto free_cq;
+			}
+
+			ibdev_dbg(&mdev->ib_dev, "gdma cq allocated %p\n",
+				  gdma_cq);
+
+			gdma_cq->cq.context = cq;
+			gdma_cq->type = GDMA_CQ;
+			gdma_cq->cq.callback = mana_ib_cq_handler;
+			gdma_cq->id = cq->id;
+			gd->gdma_context->cq_table[cq->id] = gdma_cq;
+		}
+
 	}
 	resp.num_entries = i;
 
@@ -224,7 +255,7 @@ static int mana_ib_create_qp_rss(struct ib_qp *ibqp, struct ib_pd *pd,
 					 ucmd.rx_hash_key_len,
 					 ucmd.rx_hash_key);
 	if (ret)
-		goto fail;
+		goto free_cq;
 
 	ret = ib_copy_to_udata(udata, &resp, sizeof(resp));
 	if (ret) {
@@ -238,6 +269,23 @@ static int mana_ib_create_qp_rss(struct ib_qp *ibqp, struct ib_pd *pd,
 
 	return 0;
 
+free_cq:
+	{
+		int j = i;
+		u64 cqid;
+
+		while (j-- > 0) {
+			cqid = resp.entries[j].cqid;
+			gdma_cq = gd->gdma_context->cq_table[cqid];
+			cq = gdma_cq->cq.context;
+			if (atomic_read(&cq->ibcq.usecnt) == 0) {
+				kfree(gd->gdma_context->cq_table[cqid]);
+				gd->gdma_context->cq_table[cqid] = NULL;
+			}
+		}
+
+	}
+
 fail:
 	while (i-- > 0) {
 		ibwq = ind_tbl->ind_tbl[i];
@@ -269,10 +317,12 @@ static int mana_ib_create_qp_raw(struct ib_qp *ibqp, struct ib_pd *ibpd,
 	struct mana_obj_spec wq_spec = {};
 	struct mana_obj_spec cq_spec = {};
 	struct mana_port_context *mpc;
+	struct gdma_queue *gdma_cq;
 	struct mana_context *mc;
 	struct net_device *ndev;
 	struct ib_umem *umem;
-	int err;
+	struct mana_eq *eq;
+	int err, eq_vec;
 	u32 port;
 
 	mc = gd->driver_data;
@@ -350,7 +400,9 @@ static int mana_ib_create_qp_raw(struct ib_qp *ibqp, struct ib_pd *ibpd,
 	cq_spec.gdma_region = send_cq->gdma_region;
 	cq_spec.queue_size = send_cq->cqe * COMP_ENTRY_SIZE;
 	cq_spec.modr_ctx_id = 0;
-	cq_spec.attached_eq = GDMA_CQ_NO_EQ;
+	eq_vec = send_cq->comp_vector % gd->gdma_context->max_num_queues;
+	eq = &mana_ucontext->eqs[eq_vec];
+	cq_spec.attached_eq = eq->eq->id;
 
 	err = mana_create_wq_obj(mpc, mpc->port_handle, GDMA_SQ, &wq_spec,
 				 &cq_spec, &qp->tx_object);
@@ -368,6 +420,24 @@ static int mana_ib_create_qp_raw(struct ib_qp *ibqp, struct ib_pd *ibpd,
 	qp->sq_id = wq_spec.queue_index;
 	send_cq->id = cq_spec.queue_index;
 
+	if (gd->gdma_context->cq_table[send_cq->id] == NULL) {
+
+		gdma_cq = kzalloc(sizeof(*gdma_cq), GFP_KERNEL);
+		if (!gdma_cq) {
+			pr_err("failed to allocate gdma_cq\n");
+			goto err_destroy_wqobj_and_cq;
+		}
+
+		pr_debug("gdma cq allocated %p\n", gdma_cq);
+		gdma_cq->cq.context = send_cq;
+		gdma_cq->type = GDMA_CQ;
+		gdma_cq->cq.callback = mana_ib_cq_handler;
+		gdma_cq->id = send_cq->id;
+		gd->gdma_context->cq_table[send_cq->id] = gdma_cq;
+	} else {
+		gdma_cq = gd->gdma_context->cq_table[send_cq->id];
+	}
+
 	ibdev_dbg(&mdev->ib_dev,
 		  "ret %d qp->tx_object 0x%llx sq id %llu cq id %llu\n", err,
 		  qp->tx_object, qp->sq_id, send_cq->id);
@@ -381,12 +451,17 @@ static int mana_ib_create_qp_raw(struct ib_qp *ibqp, struct ib_pd *ibpd,
 		ibdev_dbg(&mdev->ib_dev,
 			  "Failed copy udata for create qp-raw, %d\n",
 			  err);
-		goto err_destroy_wq_obj;
+		goto err_destroy_wqobj_and_cq;
 	}
 
 	return 0;
 
-err_destroy_wq_obj:
+err_destroy_wqobj_and_cq:
+	if (atomic_read(&send_cq->ibcq.usecnt) == 0) {
+		kfree(gdma_cq);
+		gd->gdma_context->cq_table[send_cq->id] = NULL;
+	}
+
 	mana_destroy_wq_obj(mpc, GDMA_SQ, qp->tx_object);
 
 err_destroy_dma_region:
diff --git a/drivers/net/ethernet/microsoft/mana/gdma_main.c b/drivers/net/ethernet/microsoft/mana/gdma_main.c
index 8f3f78b68592..8231d77628d9 100644
--- a/drivers/net/ethernet/microsoft/mana/gdma_main.c
+++ b/drivers/net/ethernet/microsoft/mana/gdma_main.c
@@ -368,53 +368,57 @@ static void mana_gd_process_eqe(struct gdma_queue *eq)
 	}
 }
 
-static void mana_gd_process_eq_events(void *arg)
+static void mana_gd_process_eq_events(struct list_head *eq_list)
 {
 	u32 owner_bits, new_bits, old_bits;
 	union gdma_eqe_info eqe_info;
 	struct gdma_eqe *eq_eqe_ptr;
-	struct gdma_queue *eq = arg;
 	struct gdma_context *gc;
+	struct gdma_queue *eq;
 	struct gdma_eqe *eqe;
 	u32 head, num_eqe;
 	int i;
 
-	gc = eq->gdma_dev->gdma_context;
+	list_for_each_entry_rcu(eq, eq_list, entry) {
+		gc = eq->gdma_dev->gdma_context;
 
-	num_eqe = eq->queue_size / GDMA_EQE_SIZE;
-	eq_eqe_ptr = eq->queue_mem_ptr;
+		num_eqe = eq->queue_size / GDMA_EQE_SIZE;
+		eq_eqe_ptr = eq->queue_mem_ptr;
 
-	/* Process up to 5 EQEs at a time, and update the HW head. */
-	for (i = 0; i < 5; i++) {
-		eqe = &eq_eqe_ptr[eq->head % num_eqe];
-		eqe_info.as_uint32 = eqe->eqe_info;
-		owner_bits = eqe_info.owner_bits;
+		/* Process up to 5 EQEs at a time, and update the HW head. */
+		for (i = 0; i < 5; i++) {
+			eqe = &eq_eqe_ptr[eq->head % num_eqe];
+			eqe_info.as_uint32 = eqe->eqe_info;
+			owner_bits = eqe_info.owner_bits;
 
-		old_bits = (eq->head / num_eqe - 1) & GDMA_EQE_OWNER_MASK;
-		/* No more entries */
-		if (owner_bits == old_bits)
-			break;
+			old_bits =
+				(eq->head / num_eqe - 1) & GDMA_EQE_OWNER_MASK;
+			/* No more entries */
+			if (owner_bits == old_bits)
+				break;
 
-		new_bits = (eq->head / num_eqe) & GDMA_EQE_OWNER_MASK;
-		if (owner_bits != new_bits) {
-			dev_err(gc->dev, "EQ %d: overflow detected\n", eq->id);
-			break;
-		}
+			new_bits = (eq->head / num_eqe) & GDMA_EQE_OWNER_MASK;
+			if (owner_bits != new_bits) {
+				dev_err(gc->dev, "EQ %d: overflow detected\n",
+					eq->id);
+				break;
+			}
 
-		/* Per GDMA spec, rmb is necessary after checking owner_bits, before
-		 * reading eqe.
-		 */
-		rmb();
+			/* Per GDMA spec, rmb is necessary after checking
+			 * owner_bits, before reading eqe.
+			 */
+			rmb();
 
-		mana_gd_process_eqe(eq);
+			mana_gd_process_eqe(eq);
 
-		eq->head++;
-	}
+			eq->head++;
+		}
 
-	head = eq->head % (num_eqe << GDMA_EQE_OWNER_BITS);
+		head = eq->head % (num_eqe << GDMA_EQE_OWNER_BITS);
 
-	mana_gd_ring_doorbell(gc, eq->gdma_dev->doorbell, eq->type, eq->id,
-			      head, SET_ARM_BIT);
+		mana_gd_ring_doorbell(gc, eq->gdma_dev->doorbell, eq->type,
+				      eq->id, head, SET_ARM_BIT);
+	}
 }
 
 static int mana_gd_register_irq(struct gdma_queue *queue,
@@ -432,44 +436,49 @@ static int mana_gd_register_irq(struct gdma_queue *queue,
 	gc = gd->gdma_context;
 	r = &gc->msix_resource;
 	dev = gc->dev;
+	msi_index = spec->eq.msix_index;
 
 	spin_lock_irqsave(&r->lock, flags);
 
-	msi_index = find_first_zero_bit(r->map, r->size);
-	if (msi_index >= r->size || msi_index >= gc->num_msix_usable) {
-		err = -ENOSPC;
-	} else {
-		bitmap_set(r->map, msi_index, 1);
-		queue->eq.msix_index = msi_index;
-	}
-
-	spin_unlock_irqrestore(&r->lock, flags);
+	if (!spec->eq.msix_allocated) {
+		msi_index = find_first_zero_bit(r->map, r->size);
+		if (msi_index >= r->size || msi_index >= gc->num_msix_usable)
+			err = -ENOSPC;
+		else
+			bitmap_set(r->map, msi_index, 1);
 
-	if (err) {
-		dev_err(dev, "Register IRQ err:%d, msi:%u rsize:%u, nMSI:%u",
-			err, msi_index, r->size, gc->num_msix_usable);
+		if (err) {
+			dev_err(dev, "Register IRQ err:%d, msi:%u rsize:%u, nMSI:%u",
+				err, msi_index, r->size, gc->num_msix_usable);
 
-		return err;
+			goto out;
+		}
 	}
 
+	queue->eq.msix_index = msi_index;
 	gic = &gc->irq_contexts[msi_index];
 
-	WARN_ON(gic->handler || gic->arg);
+	list_add_rcu(&queue->entry, &gic->eq_list);
 
-	gic->arg = queue;
+	WARN_ON(gic->handler);
 
 	gic->handler = mana_gd_process_eq_events;
 
-	return 0;
+out:
+	spin_unlock_irqrestore(&r->lock, flags);
+
+	return err;
 }
 
-static void mana_gd_deregiser_irq(struct gdma_queue *queue)
+static void mana_gd_deregister_irq(struct gdma_queue *queue)
 {
 	struct gdma_dev *gd = queue->gdma_dev;
 	struct gdma_irq_context *gic;
 	struct gdma_context *gc;
 	struct gdma_resource *r;
 	unsigned int msix_index;
+	struct list_head *p, *n;
+	struct gdma_queue *eq;
 	unsigned long flags;
 
 	gc = gd->gdma_context;
@@ -480,13 +489,25 @@ static void mana_gd_deregiser_irq(struct gdma_queue *queue)
 	if (WARN_ON(msix_index >= gc->num_msix_usable))
 		return;
 
+	spin_lock_irqsave(&r->lock, flags);
+
 	gic = &gc->irq_contexts[msix_index];
-	gic->handler = NULL;
-	gic->arg = NULL;
 
-	spin_lock_irqsave(&r->lock, flags);
-	bitmap_clear(r->map, msix_index, 1);
+	list_for_each_safe(p, n, &gic->eq_list) {
+		eq = list_entry(p, struct gdma_queue, entry);
+		if (queue == eq) {
+			list_del_rcu(&eq->entry);
+			break;
+		}
+	}
+
+	if (list_empty(&gic->eq_list)) {
+		gic->handler = NULL;
+		bitmap_clear(r->map, msix_index, 1);
+	}
+
 	spin_unlock_irqrestore(&r->lock, flags);
+	synchronize_rcu();
 
 	queue->eq.msix_index = INVALID_PCI_MSIX_INDEX;
 }
@@ -550,7 +571,7 @@ static void mana_gd_destroy_eq(struct gdma_context *gc, bool flush_evenets,
 			dev_warn(gc->dev, "Failed to flush EQ: %d\n", err);
 	}
 
-	mana_gd_deregiser_irq(queue);
+	mana_gd_deregister_irq(queue);
 
 	if (queue->eq.disable_needed)
 		mana_gd_disable_queue(queue);
@@ -565,7 +586,7 @@ static int mana_gd_create_eq(struct gdma_dev *gd,
 	u32 log2_num_entries;
 	int err;
 
-	queue->eq.msix_index = INVALID_PCI_MSIX_INDEX;
+	queue->eq.msix_index = spec->eq.msix_index;
 
 	log2_num_entries = ilog2(queue->queue_size / GDMA_EQE_SIZE);
 
@@ -602,6 +623,7 @@ static int mana_gd_create_eq(struct gdma_dev *gd,
 	mana_gd_destroy_eq(gc, false, queue);
 	return err;
 }
+EXPORT_SYMBOL(mana_gd_create_mana_eq);
 
 static void mana_gd_create_cq(const struct gdma_queue_spec *spec,
 			      struct gdma_queue *queue)
@@ -873,6 +895,7 @@ void mana_gd_destroy_queue(struct gdma_context *gc, struct gdma_queue *queue)
 	mana_gd_free_memory(gmi);
 	kfree(queue);
 }
+EXPORT_SYMBOL(mana_gd_destroy_queue);
 
 int mana_gd_verify_vf_version(struct pci_dev *pdev)
 {
@@ -1188,7 +1211,7 @@ static irqreturn_t mana_gd_intr(int irq, void *arg)
 	struct gdma_irq_context *gic = arg;
 
 	if (gic->handler)
-		gic->handler(gic->arg);
+		gic->handler(&gic->eq_list);
 
 	return IRQ_HANDLED;
 }
@@ -1241,7 +1264,7 @@ static int mana_gd_setup_irqs(struct pci_dev *pdev)
 	for (i = 0; i < nvec; i++) {
 		gic = &gc->irq_contexts[i];
 		gic->handler = NULL;
-		gic->arg = NULL;
+		INIT_LIST_HEAD(&gic->eq_list);
 
 		if (!i)
 			snprintf(gic->name, MANA_IRQ_NAME_SZ, "mana_hwc@pci:%s",
diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
index 06d6292e09b3..85345225813f 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_en.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
@@ -1156,6 +1156,7 @@ static int mana_create_eq(struct mana_context *ac)
 	spec.eq.callback = NULL;
 	spec.eq.context = ac->eqs;
 	spec.eq.log2_throttle_limit = LOG2_EQ_THROTTLE;
+	spec.eq.msix_allocated = false;
 
 	for (i = 0; i < gc->max_num_queues; i++) {
 		err = mana_gd_create_mana_eq(gd, &spec, &ac->eqs[i].eq);
diff --git a/include/net/mana/gdma.h b/include/net/mana/gdma.h
index 96c120160f15..cc728fc42043 100644
--- a/include/net/mana/gdma.h
+++ b/include/net/mana/gdma.h
@@ -6,6 +6,7 @@
 
 #include <linux/dma-mapping.h>
 #include <linux/netdevice.h>
+#include <linux/list.h>
 
 #include "shm_channel.h"
 
@@ -291,6 +292,8 @@ struct gdma_queue {
 	u32 head;
 	u32 tail;
 
+	struct list_head entry;
+
 	/* Extra fields specific to EQ/CQ. */
 	union {
 		struct {
@@ -325,6 +328,8 @@ struct gdma_queue_spec {
 			void *context;
 
 			unsigned long log2_throttle_limit;
+			bool msix_allocated;
+			unsigned int msix_index;
 		} eq;
 
 		struct {
@@ -340,8 +345,8 @@ struct gdma_queue_spec {
 #define MANA_IRQ_NAME_SZ 32
 
 struct gdma_irq_context {
-	void (*handler)(void *arg);
-	void *arg;
+	void (*handler)(struct list_head *arg);
+	struct list_head eq_list;
 	char name[MANA_IRQ_NAME_SZ];
 };
 
-- 
2.25.1


^ permalink raw reply related

* [PATCH v6] hv_netvsc: Allocate rx indirection table size dynamically
From: Shradha Gupta @ 2023-06-05 11:30 UTC (permalink / raw)
  To: linux-kernel, linux-hyperv, netdev
  Cc: Shradha Gupta, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Michael Kelley, David S. Miller, Steen Hegelund, Simon Horman

Allocate the size of rx indirection table dynamically in netvsc
from the value of size provided by OID_GEN_RECEIVE_SCALE_CAPABILITIES
query instead of using a constant value of ITAB_NUM.

Signed-off-by: Shradha Gupta <shradhagupta@linux.microsoft.com>
Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
Tested-on: Ubuntu22 (azure VM, SKU size: Standard_F72s_v2)
Testcases:
1. ethtool -x eth0 output
2. LISA testcase:PERF-NETWORK-TCP-THROUGHPUT-MULTICONNECTION-NTTTCP-Synthetic
3. LISA testcase:PERF-NETWORK-TCP-THROUGHPUT-MULTICONNECTION-NTTTCP-SRIOV

---
Changes in v6:
 * fixed net_device_context declaration in rndis_filter_device_remove
---
 drivers/net/hyperv/hyperv_net.h   |  5 ++++-
 drivers/net/hyperv/netvsc_drv.c   | 10 ++++++----
 drivers/net/hyperv/rndis_filter.c | 29 +++++++++++++++++++++++++----
 3 files changed, 35 insertions(+), 9 deletions(-)

diff --git a/drivers/net/hyperv/hyperv_net.h b/drivers/net/hyperv/hyperv_net.h
index dd5919ec408b..c40868f287a9 100644
--- a/drivers/net/hyperv/hyperv_net.h
+++ b/drivers/net/hyperv/hyperv_net.h
@@ -74,6 +74,7 @@ struct ndis_recv_scale_cap { /* NDIS_RECEIVE_SCALE_CAPABILITIES */
 #define NDIS_RSS_HASH_SECRET_KEY_MAX_SIZE_REVISION_2   40
 
 #define ITAB_NUM 128
+#define ITAB_NUM_MAX 256
 
 struct ndis_recv_scale_param { /* NDIS_RECEIVE_SCALE_PARAMETERS */
 	struct ndis_obj_header hdr;
@@ -1034,7 +1035,9 @@ struct net_device_context {
 
 	u32 tx_table[VRSS_SEND_TAB_SIZE];
 
-	u16 rx_table[ITAB_NUM];
+	u16 *rx_table;
+
+	u32 rx_table_sz;
 
 	/* Ethtool settings */
 	u8 duplex;
diff --git a/drivers/net/hyperv/netvsc_drv.c b/drivers/net/hyperv/netvsc_drv.c
index 0103ff914024..3ba3c8fb28a5 100644
--- a/drivers/net/hyperv/netvsc_drv.c
+++ b/drivers/net/hyperv/netvsc_drv.c
@@ -1747,7 +1747,9 @@ static u32 netvsc_get_rxfh_key_size(struct net_device *dev)
 
 static u32 netvsc_rss_indir_size(struct net_device *dev)
 {
-	return ITAB_NUM;
+	struct net_device_context *ndc = netdev_priv(dev);
+
+	return ndc->rx_table_sz;
 }
 
 static int netvsc_get_rxfh(struct net_device *dev, u32 *indir, u8 *key,
@@ -1766,7 +1768,7 @@ static int netvsc_get_rxfh(struct net_device *dev, u32 *indir, u8 *key,
 
 	rndis_dev = ndev->extension;
 	if (indir) {
-		for (i = 0; i < ITAB_NUM; i++)
+		for (i = 0; i < ndc->rx_table_sz; i++)
 			indir[i] = ndc->rx_table[i];
 	}
 
@@ -1792,11 +1794,11 @@ static int netvsc_set_rxfh(struct net_device *dev, const u32 *indir,
 
 	rndis_dev = ndev->extension;
 	if (indir) {
-		for (i = 0; i < ITAB_NUM; i++)
+		for (i = 0; i < ndc->rx_table_sz; i++)
 			if (indir[i] >= ndev->num_chn)
 				return -EINVAL;
 
-		for (i = 0; i < ITAB_NUM; i++)
+		for (i = 0; i < ndc->rx_table_sz; i++)
 			ndc->rx_table[i] = indir[i];
 	}
 
diff --git a/drivers/net/hyperv/rndis_filter.c b/drivers/net/hyperv/rndis_filter.c
index eea777ec2541..af95947a87c5 100644
--- a/drivers/net/hyperv/rndis_filter.c
+++ b/drivers/net/hyperv/rndis_filter.c
@@ -21,6 +21,7 @@
 #include <linux/rtnetlink.h>
 #include <linux/ucs2_string.h>
 #include <linux/string.h>
+#include <linux/slab.h>
 
 #include "hyperv_net.h"
 #include "netvsc_trace.h"
@@ -927,7 +928,7 @@ static int rndis_set_rss_param_msg(struct rndis_device *rdev,
 	struct rndis_set_request *set;
 	struct rndis_set_complete *set_complete;
 	u32 extlen = sizeof(struct ndis_recv_scale_param) +
-		     4 * ITAB_NUM + NETVSC_HASH_KEYLEN;
+		     4 * ndc->rx_table_sz + NETVSC_HASH_KEYLEN;
 	struct ndis_recv_scale_param *rssp;
 	u32 *itab;
 	u8 *keyp;
@@ -953,7 +954,7 @@ static int rndis_set_rss_param_msg(struct rndis_device *rdev,
 	rssp->hashinfo = NDIS_HASH_FUNC_TOEPLITZ | NDIS_HASH_IPV4 |
 			 NDIS_HASH_TCP_IPV4 | NDIS_HASH_IPV6 |
 			 NDIS_HASH_TCP_IPV6;
-	rssp->indirect_tabsize = 4*ITAB_NUM;
+	rssp->indirect_tabsize = 4 * ndc->rx_table_sz;
 	rssp->indirect_taboffset = sizeof(struct ndis_recv_scale_param);
 	rssp->hashkey_size = NETVSC_HASH_KEYLEN;
 	rssp->hashkey_offset = rssp->indirect_taboffset +
@@ -961,7 +962,7 @@ static int rndis_set_rss_param_msg(struct rndis_device *rdev,
 
 	/* Set indirection table entries */
 	itab = (u32 *)(rssp + 1);
-	for (i = 0; i < ITAB_NUM; i++)
+	for (i = 0; i < ndc->rx_table_sz; i++)
 		itab[i] = ndc->rx_table[i];
 
 	/* Set hask key values */
@@ -1548,6 +1549,18 @@ struct netvsc_device *rndis_filter_device_add(struct hv_device *dev,
 	if (ret || rsscap.num_recv_que < 2)
 		goto out;
 
+	if (rsscap.num_indirect_tabent &&
+	    rsscap.num_indirect_tabent <= ITAB_NUM_MAX)
+		ndc->rx_table_sz = rsscap.num_indirect_tabent;
+	else
+		ndc->rx_table_sz = ITAB_NUM;
+
+	ndc->rx_table = kcalloc(ndc->rx_table_sz, sizeof(u16), GFP_KERNEL);
+	if (!ndc->rx_table) {
+		ret = -ENOMEM;
+		goto err_dev_remv;
+	}
+
 	/* This guarantees that num_possible_rss_qs <= num_online_cpus */
 	num_possible_rss_qs = min_t(u32, num_online_cpus(),
 				    rsscap.num_recv_que);
@@ -1558,7 +1571,7 @@ struct netvsc_device *rndis_filter_device_add(struct hv_device *dev,
 	net_device->num_chn = min(net_device->max_chn, device_info->num_chn);
 
 	if (!netif_is_rxfh_configured(net)) {
-		for (i = 0; i < ITAB_NUM; i++)
+		for (i = 0; i < ndc->rx_table_sz; i++)
 			ndc->rx_table[i] = ethtool_rxfh_indir_default(
 						i, net_device->num_chn);
 	}
@@ -1596,11 +1609,19 @@ void rndis_filter_device_remove(struct hv_device *dev,
 				struct netvsc_device *net_dev)
 {
 	struct rndis_device *rndis_dev = net_dev->extension;
+	struct net_device *net = hv_get_drvdata(dev);
+	struct net_device_context *ndc;
+
+	ndc = netdev_priv(net);
 
 	/* Halt and release the rndis device */
 	rndis_filter_halt_device(net_dev, rndis_dev);
 
 	netvsc_device_remove(dev);
+
+	ndc->rx_table_sz = 0;
+	kfree(ndc->rx_table);
+	ndc->rx_table = NULL;
 }
 
 int rndis_filter_open(struct netvsc_device *nvdev)
-- 
2.34.1


^ permalink raw reply related

* Re: Fwd: nvsp_rndis_pkt_complete error status and net_ratelimit: callbacks suppressed messages on 6.4.0rc4
From: Linux regression tracking (Thorsten Leemhuis) @ 2023-06-05 10:28 UTC (permalink / raw)
  To: Michael Kelley
  Cc: Haiyang Zhang, Paolo Abeni, David S. Miller, Jakub Kicinski,
	Linux Kernel Mailing List, Linux Regressions, Linux BPF,
	Linux on Hyper-V, Linux Kernel Network Developers, Bagas Sanjaya
In-Reply-To: <15dd93af-fcd5-5b9a-a6ba-9781768dbae7@gmail.com>

Hi, Thorsten here, the Linux kernel's regression tracker.

On 30.05.23 14:25, Bagas Sanjaya wrote:
> 
> I notice a regression report on Bugzilla [1]. Quoting from it:

Hmmm, nobody replied to this yet (or am I missing something?). Doesn't
seems like it's something urgent, but nevertheless:

Michael, Bagas didn't make it obvious at all, hence please allow me to
ask: did you notice that this is a regression that is apparently caused
by a commit of yours?

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
--
Everything you wanna know about Linux kernel regression tracking:
https://linux-regtracking.leemhuis.info/about/#tldr
If I did something stupid, please tell me, as explained on that page.

#regzbot poke

>> After building 6.4.0rc4 for my VM running on a Windows 10 Hyper-V host, I see the following 
>>
>> [  756.697753] net_ratelimit: 34 callbacks suppressed
>> [  756.697806] hv_netvsc cd9dd876-2fa9-4764-baa7-b44482f85f9f eth0: nvsp_rndis_pkt_complete error status: 2
>> (snipped repeated messages)
>>
>> *but*, I'm only able to reliably reproduce this if I am generating garbage on another terminal, e.g. sudo strings /dev/sda
>>
>>
>> This doesn't appear to affect latency or bandwidth a huge amount, I ran an iperf3 test between the guest and the host while trying to cause these messages.
>> Although you if you take 17-18 gigabit as the "base" speed, you can see it drop a bit to 16 gigabit while the errors happen and "catch up" when I stop spamming the console.
>>
>> [  5]  99.00-100.00 sec  1.89 GBytes  16.2 Gbits/sec
>> [  5] 100.00-101.00 sec  1.91 GBytes  16.4 Gbits/sec
>> [  5] 101.00-102.00 sec  1.91 GBytes  16.4 Gbits/sec
>> [  5] 102.00-103.00 sec  1.91 GBytes  16.4 Gbits/sec
>> [  5] 103.00-104.00 sec  1.92 GBytes  16.5 Gbits/sec
>> [  5] 104.00-105.00 sec  1.94 GBytes  16.6 Gbits/sec
>> [  5] 105.00-106.00 sec  1.89 GBytes  16.2 Gbits/sec
>> [  5] 106.00-107.00 sec  1.90 GBytes  16.3 Gbits/sec
>> [  5] 107.00-108.00 sec  2.23 GBytes  19.2 Gbits/sec
>> [  5] 108.00-109.00 sec  2.57 GBytes  22.0 Gbits/sec
>> [  5] 109.00-110.00 sec  2.66 GBytes  22.9 Gbits/sec
>> [  5] 110.00-111.00 sec  2.64 GBytes  22.7 Gbits/sec
>> [  5] 111.00-112.00 sec  2.65 GBytes  22.7 Gbits/sec
>> [  5] 112.00-113.00 sec  2.65 GBytes  22.8 Gbits/sec
>> [  5] 113.00-114.00 sec  2.65 GBytes  22.8 Gbits/sec
>> [  5] 114.00-115.00 sec  2.65 GBytes  22.8 Gbits/sec
>> [  5] 115.00-116.00 sec  2.66 GBytes  22.9 Gbits/sec
>> [  5] 116.00-117.00 sec  2.63 GBytes  22.6 Gbits/sec
>> [  5] 117.00-118.00 sec  2.69 GBytes  23.1 Gbits/sec
>> [  5] 118.00-119.00 sec  2.66 GBytes  22.9 Gbits/sec
>> [  5] 119.00-120.00 sec  2.67 GBytes  22.9 Gbits/sec
>> [  5] 120.00-121.00 sec  2.66 GBytes  22.9 Gbits/sec
>> [  5] 121.00-122.00 sec  2.49 GBytes  21.4 Gbits/sec
>> [  5] 122.00-123.00 sec  2.15 GBytes  18.5 Gbits/sec
>> [  5] 123.00-124.00 sec  2.16 GBytes  18.6 Gbits/sec
>> [  5] 124.00-125.00 sec  2.16 GBytes  18.6 Gbits/sec
>>
> 
> See bugzilla for the full thread.
> 
> Anyway, I'm adding it to regzbot:
> 
> #regzbot introduced: dca5161f9bd052 https://bugzilla.kernel.org/show_bug.cgi?id=217503
> #regzbot title: net_ratelimit and nvsp_rndis_pkt_complete error due to SEND_RNDIS_PKT status check
> 
> Thanks.
> 
> [1]: https://bugzilla.kernel.org/show_bug.cgi?id=217503
> 

^ permalink raw reply

* Re: [PATCH v5] hv_netvsc: Allocate rx indirection table size dynamically
From: kernel test robot @ 2023-06-05  9:46 UTC (permalink / raw)
  To: Shradha Gupta, linux-kernel, linux-hyperv, netdev
  Cc: llvm, oe-kbuild-all, Shradha Gupta, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui,
	Long Li, Michael Kelley, Steen Hegelund, Simon Horman
In-Reply-To: <1685949196-16175-1-git-send-email-shradhagupta@linux.microsoft.com>

Hi Shradha,

kernel test robot noticed the following build errors:

[auto build test ERROR on linus/master]
[also build test ERROR on horms-ipvs/master v6.4-rc5 next-20230605]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Shradha-Gupta/hv_netvsc-Allocate-rx-indirection-table-size-dynamically/20230605-151438
base:   linus/master
patch link:    https://lore.kernel.org/r/1685949196-16175-1-git-send-email-shradhagupta%40linux.microsoft.com
patch subject: [PATCH v5] hv_netvsc: Allocate rx indirection table size dynamically
config: i386-randconfig-r002-20230605 (https://download.01.org/0day-ci/archive/20230605/202306051725.FOGcInnl-lkp@intel.com/config)
compiler: clang version 15.0.7 (https://github.com/llvm/llvm-project.git 8dfdcc7b7bf66834a761bd8de445840ef68e4d1a)
reproduce (this is a W=1 build):
        mkdir -p ~/bin
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/intel-lab-lkp/linux/commit/cd4dda15951edad50a4ffd51e084863ef2f50bd3
        git remote add linux-review https://github.com/intel-lab-lkp/linux
        git fetch --no-tags linux-review Shradha-Gupta/hv_netvsc-Allocate-rx-indirection-table-size-dynamically/20230605-151438
        git checkout cd4dda15951edad50a4ffd51e084863ef2f50bd3
        # save the config file
        mkdir build_dir && cp config build_dir/.config
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang ~/bin/make.cross W=1 O=build_dir ARCH=i386 olddefconfig
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang ~/bin/make.cross W=1 O=build_dir ARCH=i386 SHELL=/bin/bash drivers/net/hyperv/

If you fix the issue, kindly add following tag where applicable
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202306051725.FOGcInnl-lkp@intel.com/

All errors (new ones prefixed by >>):

>> drivers/net/hyperv/rndis_filter.c:1612:47: error: use of undeclared identifier 'net'
           struct net_device_context *ndc = netdev_priv(net);
                                                        ^
   1 error generated.


vim +/net +1612 drivers/net/hyperv/rndis_filter.c

  1607	
  1608	void rndis_filter_device_remove(struct hv_device *dev,
  1609					struct netvsc_device *net_dev)
  1610	{
  1611		struct rndis_device *rndis_dev = net_dev->extension;
> 1612		struct net_device_context *ndc = netdev_priv(net);
  1613		struct net_device *net = hv_get_drvdata(dev);
  1614	
  1615		/* Halt and release the rndis device */
  1616		rndis_filter_halt_device(net_dev, rndis_dev);
  1617	
  1618		netvsc_device_remove(dev);
  1619	
  1620		ndc->rx_table_sz = 0;
  1621		kfree(ndc->rx_table);
  1622		ndc->rx_table = NULL;
  1623	}
  1624	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply

* Re: [RFC PATCH 1/1] sched/fair: Fix SMT balance dependency on CPU numbering
From: Vincent Guittot @ 2023-06-05  9:30 UTC (permalink / raw)
  To: Michael Kelley
  Cc: mingo, peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall,
	mgorman, bristot, vschneid, linux-kernel, linux-hyperv
In-Reply-To: <1685555673-2363-1-git-send-email-mikelley@microsoft.com>

Hi Michael,

On Wed, 31 May 2023 at 19:55, Michael Kelley <mikelley@microsoft.com> wrote:
>
> With some CPU numbering schemes, the function select_idle_cpu() currently
> has a subtle bias to return the first hyper-thread in a core. As a result
> work is not evenly balanced across hyper-threads in a core. The difference
> is often as much as 15 to 20 percentage points -- i.e., the first
> hyper-thread might run at 45% load while the second hyper-thread runs at
> only 30% load or less.
>
> Two likely CPU numbering schemes make sense with today's typical case
> of two hyper-threads per core:
>
> A. Enumerate all the first hyper-theads in a core, then all the second
>    hyper-threads in a core.  If a system has 8 cores with 16 hyper-threads,
>    CPUs #0 and #8 are in the same core, as are CPUs #1 and #9, etc.
>
> B. Enumerate all hyper-threads in a core, then all hyper-threads in the
>    next core, etc.  Again with 8 cores and 16 hyper-threads, CPUs #0 and
>    #1 are in the same core, as are CPUs #2 and #3, etc.
>
> Scheme A is used in most ACPI bare metal systems and in VMs running on
> KVM.  The enumeration order is determined by the order of the processor
> entries in the ACPI MADT, and the ACPI spec *recommends* that the MADT
> be set up for scheme A.
>
> However, for reasons that pre-date the ACPI recommendation, Hyper-V
> guests have an ACPI MADT that is set up for scheme B.  When using scheme B,
> the subtle bias is evident in select_idle_cpu().  While having Hyper-V
> conform to the ACPI spec recommendation would solve the Hyper-V problem,
> it is also desirable for the fair scheduler code to be independent of the
> CPU numbering scheme.  ACPI is not always the firmware configuration
> mechanism, and CPU numbering schemes might vary more in architectures
> other than x86/x64.
>
> The bias occurs with scheme B when "has_idle_cpu" is true and

I assume that you mean has_idle_core as I can't find has_idle_cpu in the code

> select_idle_core() is called in the for_each_cpu_wrap() loop. Regardless
> of where the loop starts, it will almost immediately encountered a CPU
> that is the first hyper-thread in a core. If that core is idle, the CPU
> number of that first hyper-thread is returned. If that core is not idle,
> both hyper-threads are removed from the cpus mask, and the loop iterates
> to choose another CPU that is the first hyper-thread in a core.  As a
> result, select_idle_core() almost always returns the first hyper-thread
> in a core.
>
> The bias does not occur with scheme A because half of the CPU numbering
> space is a series of CPUs that are the second hyper-thread in all the
> cores. Assuming that the "target" CPU is evenly distributed throughout
> the CPU numbering space, there's a 50/50 chance of starting in the portion
> of the CPU numbering space that is all second hyper-threads.  If
> select_idle_core() finds a idle core, it will likely return a CPU that
> is the second hyper-thread in the core.  On average over the time,
> both the first and second hyper-thread are equally likely to be
> returned.
>
> Fix this bias by determining which hyper-thread in a core the "target"
> CPU is -- i.e., the "smt index" of that CPU.  Then when select_idle_core()
> finds an idle core, it returns the CPU in the core that has the same
> smt index. If that CPU is not valid to be chosen, just return the CPU
> that was passed into select_idle_core() and don't worry about bias.
>
> With scheme B, this fix solves the bias problem by making the chosen
> CPU be roughly equally likely to either hyper-thread.  With scheme A
> there's no real effect as the chosen CPU was already equally likely
> to be either hyper-thread, and still is.
>
> The imbalance in hyper-thread loading was originally observed in a
> customer workload, and then reproduced with a synthetic workload.
> The change has been tested with the synthetic workload in a Hyper-V VM
> running the normal scheme B CPU numbering, and then with the MADT
> replaced with a scheme A version using Linux's ability to override
> ACPI tables. The testing showed no change hyper-thread loading
> balance with the scheme A CPU numbering, but the imbalance is
> corrected if the CPU numbering is scheme B.

You failed to explain why it's important to evenly select 1st or 2nd
hyper-threads on the system.  I don't see any performance figures.
What would be the benefit ?

>
> Signed-off-by: Michael Kelley <mikelley@microsoft.com>
> ---
>
> I haven't previously worked in Linux scheduler code, so I'm posting this
> as an RFC to point out the observed problem, and to suggest a solution.
> There may well be considerations in the design of a solution that I'm not
> aware of, so please educate me or suggest an alternative.
>
> It's also not completely clear whether an imbalance in hyper-thread
> loading is actually a problem. It looks weird, and causes customer
> concern when it is observed consistently across all cores in some
> production workload. The fair scheduler strives to balance load evenly, so
> I'm treating it as a problem that should be fixed, if for no other reason
> than general goodness. But again, I'm sure reviewers will feel free to
> tell me otherwise. :-) The fix takes relatively few CPU cycles, but it's
> still a non-zero cost.
>
> FWIW, the same imbalance has been observed with kernels as far back as
> 5.4, and the root cause in the code is essentially the same. So it's not
> a recently introduced issue. I haven't tried anything earlier than 5.4.
>
>  kernel/sched/fair.c | 36 ++++++++++++++++++++++++++++++------
>  1 file changed, 30 insertions(+), 6 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 373ff5f..8b56e9d 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6832,6 +6832,19 @@ static inline bool test_idle_cores(int cpu)
>         return false;
>  }
>
> +static inline int get_smt_index(int core)
> +{
> +       int cpu, n = 0;
> +
> +       for_each_cpu(cpu, cpu_smt_mask(core)) {
> +               if (cpu == core)
> +                       return n;
> +               n++;
> +       }
> +       /* If get here, cpu_smt_mask is set up incorrectly */
> +       return 0;
> +}
> +
>  /*
>   * Scans the local SMT mask to see if the entire core is idle, and records this
>   * information in sd_llc_shared->has_idle_cores.
> @@ -6866,10 +6879,11 @@ void __update_idle_core(struct rq *rq)
>   * there are no idle cores left in the system; tracked through
>   * sd_llc->shared->has_idle_cores and enabled through update_idle_core() above.
>   */
> -static int select_idle_core(struct task_struct *p, int core, struct cpumask *cpus, int *idle_cpu)
> +static int select_idle_core(struct task_struct *p, int core, int smt_index,
> +                           struct cpumask *cpus, int *idle_cpu)
>  {
>         bool idle = true;
> -       int cpu;
> +       int cpu, index_cpu, n = 0;
>
>         for_each_cpu(cpu, cpu_smt_mask(core)) {
>                 if (!available_idle_cpu(cpu)) {
> @@ -6885,10 +6899,13 @@ static int select_idle_core(struct task_struct *p, int core, struct cpumask *cpu
>                 }
>                 if (*idle_cpu == -1 && cpumask_test_cpu(cpu, p->cpus_ptr))
>                         *idle_cpu = cpu;
> +
> +               if (n++ == smt_index)
> +                       index_cpu = cpu;
>         }
>
>         if (idle)
> -               return core;
> +               return cpumask_test_cpu(index_cpu, p->cpus_ptr) ? index_cpu : core;
>
>         cpumask_andnot(cpus, cpus, cpu_smt_mask(core));
>         return -1;
> @@ -6922,7 +6939,13 @@ static inline bool test_idle_cores(int cpu)
>         return false;
>  }
>
> -static inline int select_idle_core(struct task_struct *p, int core, struct cpumask *cpus, int *idle_cpu)
> +static inline int get_smt_index(int core)
> +{
> +       return 0;
> +}
> +
> +static inline int select_idle_core(struct task_struct *p, int core, int smt_index,
> +                                  struct cpumask *cpus, int *idle_cpu)
>  {
>         return __select_idle_cpu(core, p);
>  }
> @@ -6942,7 +6965,7 @@ static inline int select_idle_smt(struct task_struct *p, int target)
>  static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool has_idle_core, int target)
>  {
>         struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_rq_mask);
> -       int i, cpu, idle_cpu = -1, nr = INT_MAX;
> +       int i, cpu, smt_index, idle_cpu = -1, nr = INT_MAX;
>         struct sched_domain_shared *sd_share;
>         struct rq *this_rq = this_rq();
>         int this = smp_processor_id();
> @@ -6994,9 +7017,10 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
>                 }
>         }
>
> +       smt_index = get_smt_index(target);
>         for_each_cpu_wrap(cpu, cpus, target + 1) {
>                 if (has_idle_core) {
> -                       i = select_idle_core(p, cpu, cpus, &idle_cpu);
> +                       i = select_idle_core(p, cpu, smt_index, cpus, &idle_cpu);
>                         if ((unsigned int)i < nr_cpumask_bits)
>                                 return i;
>
> --
> 1.8.3.1
>

^ permalink raw reply

* Re: [PATCH v5] hv_netvsc: Allocate rx indirection table size dynamically
From: kernel test robot @ 2023-06-05  9:14 UTC (permalink / raw)
  To: Shradha Gupta, linux-kernel, linux-hyperv, netdev
  Cc: oe-kbuild-all, Shradha Gupta, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui,
	Long Li, Michael Kelley, Steen Hegelund, Simon Horman
In-Reply-To: <1685949196-16175-1-git-send-email-shradhagupta@linux.microsoft.com>

Hi Shradha,

kernel test robot noticed the following build warnings:

[auto build test WARNING on linus/master]
[also build test WARNING on horms-ipvs/master v6.4-rc5 next-20230605]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Shradha-Gupta/hv_netvsc-Allocate-rx-indirection-table-size-dynamically/20230605-151438
base:   linus/master
patch link:    https://lore.kernel.org/r/1685949196-16175-1-git-send-email-shradhagupta%40linux.microsoft.com
patch subject: [PATCH v5] hv_netvsc: Allocate rx indirection table size dynamically
config: x86_64-rhel-8.3 (https://download.01.org/0day-ci/archive/20230605/202306051754.7zMgBFMX-lkp@intel.com/config)
compiler: gcc-12 (Debian 12.2.0-14) 12.2.0
reproduce (this is a W=1 build):
        # https://github.com/intel-lab-lkp/linux/commit/cd4dda15951edad50a4ffd51e084863ef2f50bd3
        git remote add linux-review https://github.com/intel-lab-lkp/linux
        git fetch --no-tags linux-review Shradha-Gupta/hv_netvsc-Allocate-rx-indirection-table-size-dynamically/20230605-151438
        git checkout cd4dda15951edad50a4ffd51e084863ef2f50bd3
        # save the config file
        mkdir build_dir && cp config build_dir/.config
        make W=1 O=build_dir ARCH=x86_64 olddefconfig
        make W=1 O=build_dir ARCH=x86_64 SHELL=/bin/bash drivers/net/hyperv/

If you fix the issue, kindly add following tag where applicable
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202306051754.7zMgBFMX-lkp@intel.com/

All warnings (new ones prefixed by >>):

   drivers/net/hyperv/rndis_filter.c: In function 'rndis_filter_device_remove':
   drivers/net/hyperv/rndis_filter.c:1612:54: error: 'net' undeclared (first use in this function)
    1612 |         struct net_device_context *ndc = netdev_priv(net);
         |                                                      ^~~
   drivers/net/hyperv/rndis_filter.c:1612:54: note: each undeclared identifier is reported only once for each function it appears in
>> drivers/net/hyperv/rndis_filter.c:1613:28: warning: unused variable 'net' [-Wunused-variable]
    1613 |         struct net_device *net = hv_get_drvdata(dev);
         |                            ^~~


vim +/net +1613 drivers/net/hyperv/rndis_filter.c

  1607	
  1608	void rndis_filter_device_remove(struct hv_device *dev,
  1609					struct netvsc_device *net_dev)
  1610	{
  1611		struct rndis_device *rndis_dev = net_dev->extension;
  1612		struct net_device_context *ndc = netdev_priv(net);
> 1613		struct net_device *net = hv_get_drvdata(dev);
  1614	
  1615		/* Halt and release the rndis device */
  1616		rndis_filter_halt_device(net_dev, rndis_dev);
  1617	
  1618		netvsc_device_remove(dev);
  1619	
  1620		ndc->rx_table_sz = 0;
  1621		kfree(ndc->rx_table);
  1622		ndc->rx_table = NULL;
  1623	}
  1624	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply

* [PATCH v5] hv_netvsc: Allocate rx indirection table size dynamically
From: Shradha Gupta @ 2023-06-05  7:13 UTC (permalink / raw)
  To: linux-kernel, linux-hyperv, netdev
  Cc: Shradha Gupta, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Michael Kelley, David S. Miller, Steen Hegelund, Simon Horman

Allocate the size of rx indirection table dynamically in netvsc
from the value of size provided by OID_GEN_RECEIVE_SCALE_CAPABILITIES
query instead of using a constant value of ITAB_NUM.

Signed-off-by: Shradha Gupta <shradhagupta@linux.microsoft.com>
Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
Tested-on: Ubuntu22 (azure VM, SKU size: Standard_F72s_v2)
Testcases:
1. ethtool -x eth0 output
2. LISA testcase:PERF-NETWORK-TCP-THROUGHPUT-MULTICONNECTION-NTTTCP-Synthetic
3. LISA testcase:PERF-NETWORK-TCP-THROUGHPUT-MULTICONNECTION-NTTTCP-SRIOV

---
Changes in v5:
 * Follwoed the RCT format for varible declarations in rndix_filter.c
---
 drivers/net/hyperv/hyperv_net.h   |  5 ++++-
 drivers/net/hyperv/netvsc_drv.c   | 10 ++++++----
 drivers/net/hyperv/rndis_filter.c | 27 +++++++++++++++++++++++----
 3 files changed, 33 insertions(+), 9 deletions(-)

diff --git a/drivers/net/hyperv/hyperv_net.h b/drivers/net/hyperv/hyperv_net.h
index dd5919ec408b..c40868f287a9 100644
--- a/drivers/net/hyperv/hyperv_net.h
+++ b/drivers/net/hyperv/hyperv_net.h
@@ -74,6 +74,7 @@ struct ndis_recv_scale_cap { /* NDIS_RECEIVE_SCALE_CAPABILITIES */
 #define NDIS_RSS_HASH_SECRET_KEY_MAX_SIZE_REVISION_2   40
 
 #define ITAB_NUM 128
+#define ITAB_NUM_MAX 256
 
 struct ndis_recv_scale_param { /* NDIS_RECEIVE_SCALE_PARAMETERS */
 	struct ndis_obj_header hdr;
@@ -1034,7 +1035,9 @@ struct net_device_context {
 
 	u32 tx_table[VRSS_SEND_TAB_SIZE];
 
-	u16 rx_table[ITAB_NUM];
+	u16 *rx_table;
+
+	u32 rx_table_sz;
 
 	/* Ethtool settings */
 	u8 duplex;
diff --git a/drivers/net/hyperv/netvsc_drv.c b/drivers/net/hyperv/netvsc_drv.c
index 0103ff914024..3ba3c8fb28a5 100644
--- a/drivers/net/hyperv/netvsc_drv.c
+++ b/drivers/net/hyperv/netvsc_drv.c
@@ -1747,7 +1747,9 @@ static u32 netvsc_get_rxfh_key_size(struct net_device *dev)
 
 static u32 netvsc_rss_indir_size(struct net_device *dev)
 {
-	return ITAB_NUM;
+	struct net_device_context *ndc = netdev_priv(dev);
+
+	return ndc->rx_table_sz;
 }
 
 static int netvsc_get_rxfh(struct net_device *dev, u32 *indir, u8 *key,
@@ -1766,7 +1768,7 @@ static int netvsc_get_rxfh(struct net_device *dev, u32 *indir, u8 *key,
 
 	rndis_dev = ndev->extension;
 	if (indir) {
-		for (i = 0; i < ITAB_NUM; i++)
+		for (i = 0; i < ndc->rx_table_sz; i++)
 			indir[i] = ndc->rx_table[i];
 	}
 
@@ -1792,11 +1794,11 @@ static int netvsc_set_rxfh(struct net_device *dev, const u32 *indir,
 
 	rndis_dev = ndev->extension;
 	if (indir) {
-		for (i = 0; i < ITAB_NUM; i++)
+		for (i = 0; i < ndc->rx_table_sz; i++)
 			if (indir[i] >= ndev->num_chn)
 				return -EINVAL;
 
-		for (i = 0; i < ITAB_NUM; i++)
+		for (i = 0; i < ndc->rx_table_sz; i++)
 			ndc->rx_table[i] = indir[i];
 	}
 
diff --git a/drivers/net/hyperv/rndis_filter.c b/drivers/net/hyperv/rndis_filter.c
index eea777ec2541..5a5dd5007590 100644
--- a/drivers/net/hyperv/rndis_filter.c
+++ b/drivers/net/hyperv/rndis_filter.c
@@ -21,6 +21,7 @@
 #include <linux/rtnetlink.h>
 #include <linux/ucs2_string.h>
 #include <linux/string.h>
+#include <linux/slab.h>
 
 #include "hyperv_net.h"
 #include "netvsc_trace.h"
@@ -927,7 +928,7 @@ static int rndis_set_rss_param_msg(struct rndis_device *rdev,
 	struct rndis_set_request *set;
 	struct rndis_set_complete *set_complete;
 	u32 extlen = sizeof(struct ndis_recv_scale_param) +
-		     4 * ITAB_NUM + NETVSC_HASH_KEYLEN;
+		     4 * ndc->rx_table_sz + NETVSC_HASH_KEYLEN;
 	struct ndis_recv_scale_param *rssp;
 	u32 *itab;
 	u8 *keyp;
@@ -953,7 +954,7 @@ static int rndis_set_rss_param_msg(struct rndis_device *rdev,
 	rssp->hashinfo = NDIS_HASH_FUNC_TOEPLITZ | NDIS_HASH_IPV4 |
 			 NDIS_HASH_TCP_IPV4 | NDIS_HASH_IPV6 |
 			 NDIS_HASH_TCP_IPV6;
-	rssp->indirect_tabsize = 4*ITAB_NUM;
+	rssp->indirect_tabsize = 4 * ndc->rx_table_sz;
 	rssp->indirect_taboffset = sizeof(struct ndis_recv_scale_param);
 	rssp->hashkey_size = NETVSC_HASH_KEYLEN;
 	rssp->hashkey_offset = rssp->indirect_taboffset +
@@ -961,7 +962,7 @@ static int rndis_set_rss_param_msg(struct rndis_device *rdev,
 
 	/* Set indirection table entries */
 	itab = (u32 *)(rssp + 1);
-	for (i = 0; i < ITAB_NUM; i++)
+	for (i = 0; i < ndc->rx_table_sz; i++)
 		itab[i] = ndc->rx_table[i];
 
 	/* Set hask key values */
@@ -1548,6 +1549,18 @@ struct netvsc_device *rndis_filter_device_add(struct hv_device *dev,
 	if (ret || rsscap.num_recv_que < 2)
 		goto out;
 
+	if (rsscap.num_indirect_tabent &&
+	    rsscap.num_indirect_tabent <= ITAB_NUM_MAX)
+		ndc->rx_table_sz = rsscap.num_indirect_tabent;
+	else
+		ndc->rx_table_sz = ITAB_NUM;
+
+	ndc->rx_table = kcalloc(ndc->rx_table_sz, sizeof(u16), GFP_KERNEL);
+	if (!ndc->rx_table) {
+		ret = -ENOMEM;
+		goto err_dev_remv;
+	}
+
 	/* This guarantees that num_possible_rss_qs <= num_online_cpus */
 	num_possible_rss_qs = min_t(u32, num_online_cpus(),
 				    rsscap.num_recv_que);
@@ -1558,7 +1571,7 @@ struct netvsc_device *rndis_filter_device_add(struct hv_device *dev,
 	net_device->num_chn = min(net_device->max_chn, device_info->num_chn);
 
 	if (!netif_is_rxfh_configured(net)) {
-		for (i = 0; i < ITAB_NUM; i++)
+		for (i = 0; i < ndc->rx_table_sz; i++)
 			ndc->rx_table[i] = ethtool_rxfh_indir_default(
 						i, net_device->num_chn);
 	}
@@ -1596,11 +1609,17 @@ void rndis_filter_device_remove(struct hv_device *dev,
 				struct netvsc_device *net_dev)
 {
 	struct rndis_device *rndis_dev = net_dev->extension;
+	struct net_device_context *ndc = netdev_priv(net);
+	struct net_device *net = hv_get_drvdata(dev);
 
 	/* Halt and release the rndis device */
 	rndis_filter_halt_device(net_dev, rndis_dev);
 
 	netvsc_device_remove(dev);
+
+	ndc->rx_table_sz = 0;
+	kfree(ndc->rx_table);
+	ndc->rx_table = NULL;
 }
 
 int rndis_filter_open(struct netvsc_device *nvdev)
-- 
2.34.1


^ permalink raw reply related

* RE: [EXTERNAL] Re: [PATCH 1/5] uio: Add hv_vmbus_client driver
From: Saurabh Singh Sengar @ 2023-06-05  3:00 UTC (permalink / raw)
  To: Greg KH, Saurabh Sengar
  Cc: KY Srinivasan, Haiyang Zhang, wei.liu@kernel.org, Dexuan Cui,
	Michael Kelley (LINUX), linux-kernel@vger.kernel.org,
	linux-hyperv@vger.kernel.org
In-Reply-To: <2023060421-unclothed-hungry-cb3e@gregkh>



> -----Original Message-----
> From: Greg KH <gregkh@linuxfoundation.org>
> Sent: Sunday, June 4, 2023 12:40 PM
> To: Saurabh Sengar <ssengar@linux.microsoft.com>
> Cc: KY Srinivasan <kys@microsoft.com>; Haiyang Zhang
> <haiyangz@microsoft.com>; wei.liu@kernel.org; Dexuan Cui
> <decui@microsoft.com>; Michael Kelley (LINUX) <mikelley@microsoft.com>;
> linux-kernel@vger.kernel.org; linux-hyperv@vger.kernel.org
> Subject: [EXTERNAL] Re: [PATCH 1/5] uio: Add hv_vmbus_client driver
> 
> On Fri, Jun 02, 2023 at 12:57:05AM -0700, Saurabh Sengar wrote:
> > + * Since the driver does not declare any device ids, you must
> > + allocate
> > + * id and bind the device to the driver yourself.  For example:
> > + * driverctl -b vmbus set-override <dev uuid> uio_hv_vmbus_client
> 
> Shouldn't this be documented somewhere?

I will update it in Documentation/driver-api/uio-howto.rst.

> 	
> > + */
> > +
> > +#include <linux/device.h>
> > +#include <linux/kernel.h>
> > +#include <linux/module.h>
> > +#include <linux/uio_driver.h>
> > +#include <linux/hyperv.h>
> > +#include <linux/vmalloc.h>
> > +#include <linux/sysfs.h>
> > +
> > +#define DRIVER_VERSION	"0.0.1"
> 
> Why this number?  Why not "1"?
> 
> The whole "driver version" thing doesn't really make sense here, we should
> just drop it from the uio later, right?

will remove in V2.

> 
> > +#define DRIVER_AUTHOR	"Saurabh Sengar <ssengar@microsoft.com>"
> > +#define DRIVER_DESC	"Generic UIO driver for low speed VMBus
> devices"
> > +
> > +#define DEFAULT_HV_RING_SIZE	VMBUS_RING_SIZE(3 *
> HV_HYP_PAGE_SIZE)
> > +static int ring_size = DEFAULT_HV_RING_SIZE;
> > +
> > +struct uio_hv_vmbus_dev {
> > +	struct uio_info info;
> > +	struct hv_device *device;
> > +	atomic_t refcnt;
> 
> Why is this refcnt needed?
> 
> Have you seen the other threads about how attempting to block multiple
> opens in UIO drivers really does not work at all?  Please drop all of that logic,
> it is not correct.
> 

Will remove.

> 
> > +static ssize_t ring_size_show(struct device *dev, struct device_attribute
> *attr,
> > +			      char *buf)
> > +{
> > +	return scnprintf(buf, PAGE_SIZE, "%d\n", ring_size);
> 
> Did you use checkpatch?
> 
> It should have said to use sysfs_emit(), right?

Yes, I am using  "--strict" switch for checkpatch.pl, still didn't get this warning.
However, I will use sysfs_emit() in V2.

> 
> > +	ret = sysfs_create_file(&dev->device.kobj,
> > +&dev_attr_ring_size.attr);
> 
> If you ever have to use a sysfs_* call in a driver, that's a huge hint something
> is wrong.  Please fix this to not race with userspace and loose.

Thanks for pointing this out, will look into this.

> 
> > +	if (ret)
> > +		dev_notice(&dev->device, "sysfs create ring size file failed;
> > +%d\n", ret);
> 
> Why not just use dev_err() for this and other errors?  Why "notice"?

Will fix.

> 
> > +MODULE_VERSION(DRIVER_VERSION);
> 
> This means nothing, please drop.

OK.

Thanks for review.
- Saurabh

> 
> thanks,
> 
> greg k-h

^ permalink raw reply

* Re: [PATCH 1/5] uio: Add hv_vmbus_client driver
From: Greg KH @ 2023-06-04  7:09 UTC (permalink / raw)
  To: Saurabh Sengar
  Cc: kys, haiyangz, wei.liu, decui, mikelley, linux-kernel,
	linux-hyperv
In-Reply-To: <1685692629-31351-2-git-send-email-ssengar@linux.microsoft.com>

On Fri, Jun 02, 2023 at 12:57:05AM -0700, Saurabh Sengar wrote:
> + * Since the driver does not declare any device ids, you must allocate
> + * id and bind the device to the driver yourself.  For example:
> + * driverctl -b vmbus set-override <dev uuid> uio_hv_vmbus_client

Shouldn't this be documented somewhere?

> + */
> +
> +#include <linux/device.h>
> +#include <linux/kernel.h>
> +#include <linux/module.h>
> +#include <linux/uio_driver.h>
> +#include <linux/hyperv.h>
> +#include <linux/vmalloc.h>
> +#include <linux/sysfs.h>
> +
> +#define DRIVER_VERSION	"0.0.1"

Why this number?  Why not "1"?

The whole "driver version" thing doesn't really make sense here, we
should just drop it from the uio later, right?

> +#define DRIVER_AUTHOR	"Saurabh Sengar <ssengar@microsoft.com>"
> +#define DRIVER_DESC	"Generic UIO driver for low speed VMBus devices"
> +
> +#define DEFAULT_HV_RING_SIZE	VMBUS_RING_SIZE(3 * HV_HYP_PAGE_SIZE)
> +static int ring_size = DEFAULT_HV_RING_SIZE;
> +
> +struct uio_hv_vmbus_dev {
> +	struct uio_info info;
> +	struct hv_device *device;
> +	atomic_t refcnt;

Why is this refcnt needed?

Have you seen the other threads about how attempting to block multiple
opens in UIO drivers really does not work at all?  Please drop all of
that logic, it is not correct.

> +static ssize_t ring_size_show(struct device *dev, struct device_attribute *attr,
> +			      char *buf)
> +{
> +	return scnprintf(buf, PAGE_SIZE, "%d\n", ring_size);

Did you use checkpatch?

It should have said to use sysfs_emit(), right?

> +	ret = sysfs_create_file(&dev->device.kobj, &dev_attr_ring_size.attr);

If you ever have to use a sysfs_* call in a driver, that's a huge hint
something is wrong.  Please fix this to not race with userspace and
loose.

> +	if (ret)
> +		dev_notice(&dev->device, "sysfs create ring size file failed; %d\n", ret);

Why not just use dev_err() for this and other errors?  Why "notice"?

> +MODULE_VERSION(DRIVER_VERSION);

This means nothing, please drop.

thanks,

greg k-h

^ permalink raw reply

* Re: [PATCH 1/5] uio: Add hv_vmbus_client driver
From: kernel test robot @ 2023-06-04  2:05 UTC (permalink / raw)
  To: Saurabh Sengar, kys, haiyangz, wei.liu, decui, mikelley, gregkh,
	linux-kernel, linux-hyperv
  Cc: oe-kbuild-all
In-Reply-To: <1685692629-31351-2-git-send-email-ssengar@linux.microsoft.com>

Hi Saurabh,

kernel test robot noticed the following build warnings:

[auto build test WARNING on char-misc/char-misc-testing]
[also build test WARNING on char-misc/char-misc-next char-misc/char-misc-linus linus/master v6.4-rc4 next-20230602]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Saurabh-Sengar/uio-Add-hv_vmbus_client-driver/20230602-160029
base:   char-misc/char-misc-testing
patch link:    https://lore.kernel.org/r/1685692629-31351-2-git-send-email-ssengar%40linux.microsoft.com
patch subject: [PATCH 1/5] uio: Add hv_vmbus_client driver
config: i386-randconfig-c001-20230604 (https://download.01.org/0day-ci/archive/20230604/202306040910.13oLufQd-lkp@intel.com/config)
compiler: gcc-12 (Debian 12.2.0-14) 12.2.0

If you fix the issue, kindly add following tag where applicable
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202306040910.13oLufQd-lkp@intel.com/

cocci warnings: (new ones prefixed by >>)
>> drivers/uio/uio_hv_vmbus_client.c:195:1-6: WARNING: invalid free of devm_ allocated data

vim +195 drivers/uio/uio_hv_vmbus_client.c

   158	
   159	static int uio_hv_vmbus_probe(struct hv_device *dev, const struct hv_vmbus_device_id *dev_id)
   160	{
   161		struct uio_hv_vmbus_dev *pdata;
   162		int ret = 0;
   163		char *name = NULL;
   164	
   165		pdata = devm_kzalloc(&dev->device, sizeof(*pdata), GFP_KERNEL);
   166		if (!pdata)
   167			return -ENOMEM;
   168	
   169		name = kasprintf(GFP_KERNEL, "%pUl", &dev->dev_instance);
   170	
   171		/* Fill general uio info */
   172		pdata->info.name = name; /* /sys/class/uio/uioX/name */
   173		pdata->info.version = DRIVER_VERSION;
   174		pdata->info.irqcontrol = uio_hv_vmbus_irqcontrol;
   175		pdata->info.open = uio_hv_vmbus_open;
   176		pdata->info.release = uio_hv_vmbus_release;
   177		pdata->info.irq = UIO_IRQ_CUSTOM;
   178		pdata->info.priv = pdata;
   179		pdata->device = dev;
   180	
   181		ret = uio_register_device(&dev->device, &pdata->info);
   182		if (ret) {
   183			dev_err(&dev->device, "uio_hv_vmbus register failed\n");
   184			goto fail;
   185		}
   186	
   187		ret = sysfs_create_file(&dev->device.kobj, &dev_attr_ring_size.attr);
   188		if (ret)
   189			dev_notice(&dev->device, "sysfs create ring size file failed; %d\n", ret);
   190	
   191		hv_set_drvdata(dev, pdata);
   192		return 0;
   193	
   194	fail:
 > 195		kfree(pdata);
   196		return ret;
   197	}
   198	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply

* Wyższa konwersja w e-sklepie 
From: Kamil Durjasz @ 2023-06-02  8:30 UTC (permalink / raw)
  To: linux-hyperv

Dzień dobry,

w jaki sposób docierają Państwo do odbiorców?

Tworzymy potężne narzędzia sprzedaży, które pozwalają kompleksowo rozwiązać problemy potencjalnych klientów i skutecznie wpłynąć na ich decyzje zakupowe. 

Skupiamy się na Państwa potrzebach związanych z obsługą sklepu, oczekiwaniach i planach sprzedażowych. Szczegółowo dopasowujemy grafikę, funkcjonalności, strukturę i mikrointerakcje do Państwa grupy docelowej, co przekłada się na oczekiwane rezultaty.

Chętnie przedstawię dotychczasowe realizacje, aby mogli Państwo przekonać się o naszych możliwościach. Mogę się skontaktować?

Pozdrawiam
Kamil Durjasz

^ permalink raw reply

* RE: [EXTERNAL] Re: [PATCH] x86/hyperv: add noop functions to x86_init mpparse functions
From: Saurabh Singh Sengar @ 2023-06-02 16:27 UTC (permalink / raw)
  To: Wei Liu, Saurabh Sengar
  Cc: KY Srinivasan, Haiyang Zhang, Dexuan Cui, tglx@linutronix.de,
	mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com,
	x86@kernel.org, Michael Kelley (LINUX),
	linux-kernel@vger.kernel.org, linux-hyperv@vger.kernel.org,
	hpa@zytor.com
In-Reply-To: <ZHoXVAvof+STaIxI@liuwe-devbox-debian-v2>



> -----Original Message-----
> From: Wei Liu <wei.liu@kernel.org>
> Sent: Friday, June 2, 2023 9:53 PM
> To: Saurabh Sengar <ssengar@linux.microsoft.com>
> Cc: KY Srinivasan <kys@microsoft.com>; Haiyang Zhang
> <haiyangz@microsoft.com>; wei.liu@kernel.org; Dexuan Cui
> <decui@microsoft.com>; tglx@linutronix.de; mingo@redhat.com;
> bp@alien8.de; dave.hansen@linux.intel.com; x86@kernel.org; Michael Kelley
> (LINUX) <mikelley@microsoft.com>; linux-kernel@vger.kernel.org; linux-
> hyperv@vger.kernel.org; hpa@zytor.com
> Subject: [EXTERNAL] Re: [PATCH] x86/hyperv: add noop functions to x86_init
> mpparse functions
> 
> On Fri, Jun 02, 2023 at 05:41:52AM -0700, Saurabh Sengar wrote:
> > In !ACPI system, there is no way to disable CONFIG_X86_MPPARSE.
> > When CONFIG_X86_MPPARSE is enabled for VTL2, the kernel will scan low
> > memory looking for MP tables. Don't allow this, because low memory is
> > controlled by VTL0 and may contain actual valid tables for VTL0, which
> > can confuse the VTL2 kernel.
> >
> > Signed-off-by: Saurabh Sengar <ssengar@linux.microsoft.com>
> > ---
> >  arch/x86/hyperv/hv_vtl.c | 4 ++++
> >  1 file changed, 4 insertions(+)
> >
> > diff --git a/arch/x86/hyperv/hv_vtl.c b/arch/x86/hyperv/hv_vtl.c index
> > 1ba5d3b99b16..ea21d897b5da 100644
> > --- a/arch/x86/hyperv/hv_vtl.c
> > +++ b/arch/x86/hyperv/hv_vtl.c
> > @@ -23,6 +23,10 @@ void __init hv_vtl_init_platform(void)
> >  	x86_init.irqs.pre_vector_init = x86_init_noop;
> >  	x86_init.timers.timer_init = x86_init_noop;
> >
> > +	/* Avoid searching for BIOS MP tables */
> > +	x86_init.mpparse.find_smp_config = x86_init_noop;
> > +	x86_init.mpparse.get_smp_config = x86_init_uint_noop;
> > +
> 
> The code looks fine.
> 
> Can you expand the commit message a bit so that people who are not
> familiar with VTL and your setup can understand what's going on?

Sure, I will add more details in commit message.

> 
> Thanks,
> Wei.
> 
> >  	x86_platform.get_wallclock = get_rtc_noop;
> >  	x86_platform.set_wallclock = set_rtc_noop;
> >  	x86_platform.get_nmi_reason = hv_get_nmi_reason;
> > --
> > 2.34.1
> >

^ permalink raw reply

* RE: [EXTERNAL] Re: [PATCH 1/2] x86/Kconfig: Allow CONFIG_X86_MPPARSE disable for OF platforms
From: Saurabh Singh Sengar @ 2023-06-02 16:25 UTC (permalink / raw)
  To: Dave Hansen, Saurabh Sengar, tglx@linutronix.de, mingo@redhat.com,
	bp@alien8.de, dave.hansen@linux.intel.com, x86@kernel.org,
	hpa@zytor.com, KY Srinivasan, Haiyang Zhang, wei.liu@kernel.org,
	Dexuan Cui, Michael Kelley (LINUX), linux-kernel@vger.kernel.org,
	linux-hyperv@vger.kernel.org
In-Reply-To: <e5ca0f92-0909-68af-16e0-582f47d8e424@intel.com>



> -----Original Message-----
> From: Dave Hansen <dave.hansen@intel.com>
> Sent: Friday, June 2, 2023 7:54 PM
> To: Saurabh Singh Sengar <ssengar@microsoft.com>; Saurabh Sengar
> <ssengar@linux.microsoft.com>; tglx@linutronix.de; mingo@redhat.com;
> bp@alien8.de; dave.hansen@linux.intel.com; x86@kernel.org;
> hpa@zytor.com; KY Srinivasan <kys@microsoft.com>; Haiyang Zhang
> <haiyangz@microsoft.com>; wei.liu@kernel.org; Dexuan Cui
> <decui@microsoft.com>; Michael Kelley (LINUX) <mikelley@microsoft.com>;
> linux-kernel@vger.kernel.org; linux-hyperv@vger.kernel.org
> Subject: [EXTERNAL] Re: [PATCH 1/2] x86/Kconfig: Allow
> CONFIG_X86_MPPARSE disable for OF platforms
> 
> On 6/2/23 05:22, Saurabh Singh Sengar wrote:
> > Furthermore, I would like to learn about the rationale behind
> > disallowing the disablement of CONFIG_X86_MPPARSE when MP tables are
> > not in use. Considering that we compile out the features we don't
> > support, wouldn't it be acceptable to allow users to customize their
> > configurations in this manner? Allowing the disablement of
> > CONFIG_X86_MPPARSE would provide greater flexibility and efficiency for
> those who do not utilize MP tables.
> 
> If someone sent a patch, I'd certainly look and figure out what "flexibility" and
> "efficiency" it would provide.  But, honestly, it would just just be noise if it
> doesn't solve an _actual_ problem.
> 
> Would anyone care or notice the "flexibility" and "efficiency" it would
> provide?

After the new approach we have opted, I agree it's not blocking us from any functional
problem. Although one of the things we are trying to achieve is the minimal kernel
where we want to remove all the unnecessary options. Our system is not dependent
MPPARSE and we just don't want this code in our minimal kernel, and this config
option is preventing us from removing it.

Also, in past kernel has provided this flexibility for other options, which made me think
it is fine do it for OF as well.
https://lkml.indiana.edu/hypermail/linux/kernel/1210.3/01987.html

Happy to know your opinion and learn.

^ permalink raw reply

* Re: [PATCH] x86/hyperv: add noop functions to x86_init mpparse functions
From: Wei Liu @ 2023-06-02 16:22 UTC (permalink / raw)
  To: Saurabh Sengar
  Cc: kys, haiyangz, wei.liu, decui, tglx, mingo, bp, dave.hansen, x86,
	mikelley, linux-kernel, linux-hyperv, hpa
In-Reply-To: <1685709712-13752-1-git-send-email-ssengar@linux.microsoft.com>

On Fri, Jun 02, 2023 at 05:41:52AM -0700, Saurabh Sengar wrote:
> In !ACPI system, there is no way to disable CONFIG_X86_MPPARSE.
> When CONFIG_X86_MPPARSE is enabled for VTL2, the kernel will
> scan low memory looking for MP tables. Don't allow this, because
> low memory is controlled by VTL0 and may contain actual valid
> tables for VTL0, which can confuse the VTL2 kernel.
> 
> Signed-off-by: Saurabh Sengar <ssengar@linux.microsoft.com>
> ---
>  arch/x86/hyperv/hv_vtl.c | 4 ++++
>  1 file changed, 4 insertions(+)
> 
> diff --git a/arch/x86/hyperv/hv_vtl.c b/arch/x86/hyperv/hv_vtl.c
> index 1ba5d3b99b16..ea21d897b5da 100644
> --- a/arch/x86/hyperv/hv_vtl.c
> +++ b/arch/x86/hyperv/hv_vtl.c
> @@ -23,6 +23,10 @@ void __init hv_vtl_init_platform(void)
>  	x86_init.irqs.pre_vector_init = x86_init_noop;
>  	x86_init.timers.timer_init = x86_init_noop;
>  
> +	/* Avoid searching for BIOS MP tables */
> +	x86_init.mpparse.find_smp_config = x86_init_noop;
> +	x86_init.mpparse.get_smp_config = x86_init_uint_noop;
> +

The code looks fine.

Can you expand the commit message a bit so that people who are not
familiar with VTL and your setup can understand what's going on?

Thanks,
Wei.

>  	x86_platform.get_wallclock = get_rtc_noop;
>  	x86_platform.set_wallclock = set_rtc_noop;
>  	x86_platform.get_nmi_reason = hv_get_nmi_reason;
> -- 
> 2.34.1
> 

^ permalink raw reply

* Re: [PATCH] x86/hyperv: add noop functions to x86_init mpparse functions
From: Dave Hansen @ 2023-06-02 16:21 UTC (permalink / raw)
  To: Wei Liu
  Cc: Saurabh Sengar, kys, haiyangz, decui, tglx, mingo, bp,
	dave.hansen, x86, mikelley, linux-kernel, linux-hyperv, hpa
In-Reply-To: <ZHoWN1TYicOSGsd3@liuwe-devbox-debian-v2>

On 6/2/23 09:17, Wei Liu wrote:
> On Fri, Jun 02, 2023 at 08:33:13AM -0700, Dave Hansen wrote:
>> On 6/2/23 05:41, Saurabh Sengar wrote:
>>> In !ACPI system, there is no way to disable CONFIG_X86_MPPARSE.
>>> When CONFIG_X86_MPPARSE is enabled for VTL2, the kernel will
>>> scan low memory looking for MP tables. Don't allow this, because
>>> low memory is controlled by VTL0 and may contain actual valid
>>> tables for VTL0, which can confuse the VTL2 kernel.
>> Do you folks have a writeup of this VTL* setup anywhere?  I'm struggling
>> to grasp why VTL0 and VTL2 share the same address space and why they
>> would get confused by each other's data structures.
> Dave, here is some public information about Virtual Trust Level.
> 
> https://learn.microsoft.com/en-us/virtualization/hyper-v-on-windows/tlfs/vsm#virtual-trust-level-vtl

Yeah, I can google too. ;)

Seriously, though.  I don't need an architecture document.  This is like
me asking how NMIs work in Linux and someone pointing me to the SDM
describing the hardware architecture.

I need to know about the Linux implementation of these VTL's, not the
overall VTL architecture.

^ permalink raw reply

* Re: [PATCH] x86/hyperv: add noop functions to x86_init mpparse functions
From: Wei Liu @ 2023-06-02 16:17 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Saurabh Sengar, kys, haiyangz, wei.liu, decui, tglx, mingo, bp,
	dave.hansen, x86, mikelley, linux-kernel, linux-hyperv, hpa
In-Reply-To: <442441b3-803e-0c1f-ff1b-5a49ecc2e423@intel.com>

On Fri, Jun 02, 2023 at 08:33:13AM -0700, Dave Hansen wrote:
> On 6/2/23 05:41, Saurabh Sengar wrote:
> > In !ACPI system, there is no way to disable CONFIG_X86_MPPARSE.
> > When CONFIG_X86_MPPARSE is enabled for VTL2, the kernel will
> > scan low memory looking for MP tables. Don't allow this, because
> > low memory is controlled by VTL0 and may contain actual valid
> > tables for VTL0, which can confuse the VTL2 kernel.
> 
> Do you folks have a writeup of this VTL* setup anywhere?  I'm struggling
> to grasp why VTL0 and VTL2 share the same address space and why they
> would get confused by each other's data structures.

Dave, here is some public information about Virtual Trust Level.

https://learn.microsoft.com/en-us/virtualization/hyper-v-on-windows/tlfs/vsm#virtual-trust-level-vtl

I wished it could be more detailed.

With the proper configuration, VTL2 can see memory from VTL0, but not
the other way around.

Saurabh can probably give you more information for this particular setup.

Thanks,
Wei.

> 
> $ grep -r VTL[02] Documentation/
> $
> 
> Either way, this is way better than the #ifdefs.  But the changelog is
> kinda just gibberish to me.

^ permalink raw reply

* Re: [PATCH] x86/hyperv: add noop functions to x86_init mpparse functions
From: Dave Hansen @ 2023-06-02 15:33 UTC (permalink / raw)
  To: Saurabh Sengar, kys, haiyangz, wei.liu, decui, tglx, mingo, bp,
	dave.hansen, x86, mikelley, linux-kernel, linux-hyperv, hpa
In-Reply-To: <1685709712-13752-1-git-send-email-ssengar@linux.microsoft.com>

On 6/2/23 05:41, Saurabh Sengar wrote:
> In !ACPI system, there is no way to disable CONFIG_X86_MPPARSE.
> When CONFIG_X86_MPPARSE is enabled for VTL2, the kernel will
> scan low memory looking for MP tables. Don't allow this, because
> low memory is controlled by VTL0 and may contain actual valid
> tables for VTL0, which can confuse the VTL2 kernel.

Do you folks have a writeup of this VTL* setup anywhere?  I'm struggling
to grasp why VTL0 and VTL2 share the same address space and why they
would get confused by each other's data structures.

$ grep -r VTL[02] Documentation/
$

Either way, this is way better than the #ifdefs.  But the changelog is
kinda just gibberish to me.

^ permalink raw reply

* Re: [RFC PATCH v1 0/9] Hypervisor-Enforced Kernel Integrity
From: Mickaël Salaün @ 2023-06-02 15:07 UTC (permalink / raw)
  To: Sean Christopherson, Rick P Edgecombe
  Cc: dave.hansen@linux.intel.com, bp@alien8.de, keescook@chromium.org,
	hpa@zytor.com, mingo@redhat.com, tglx@linutronix.de,
	pbonzini@redhat.com, wanpengli@tencent.com, vkuznets@redhat.com,
	kvm@vger.kernel.org, qemu-devel@nongnu.org, liran.alon@oracle.com,
	marian.c.rotariu@gmail.com, Alexander Graf, John S Andersen,
	madvenka@linux.microsoft.com, ssicleru@bitdefender.com,
	yuanyu@google.com, linux-kernel@vger.kernel.org,
	tgopinath@microsoft.com, jamorris@linux.microsoft.com,
	linux-security-module@vger.kernel.org,
	xen-devel@lists.xenproject.org, will@kernel.org,
	dev@lists.cloudhypervisor.org, mdontu@bitdefender.com,
	linux-hardening@vger.kernel.org, linux-hyperv@vger.kernel.org,
	virtualization@lists.linux-foundation.org, nicu.citu@icloud.com,
	ztarkhani@microsoft.com, x86@kernel.org, James Gowans
In-Reply-To: <ZHes4a73Zg+6JuFB@google.com>


On 31/05/2023 22:24, Sean Christopherson wrote:
> On Tue, May 30, 2023, Rick P Edgecombe wrote:
>> On Fri, 2023-05-26 at 17:22 +0200, Mickaï¿½l Salaï¿½n wrote:
>>>>> Can the guest kernel ask the host VMM's emulated devices to DMA into
>>>>> the protected data? It should go through the host userspace mappings I
>>>>> think, which don't care about EPT permissions. Or did I miss where you
>>>>> are protecting that another way? There are a lot of easy ways to ask
>>>>> the host to write to guest memory that don't involve the EPT. You
>>>>> probably need to protect the host userspace mappings, and also the
>>>>> places in KVM that kmap a GPA provided by the guest.
>>>>
>>>> Good point, I'll check this confused deputy attack. Extended KVM
>>>> protections should indeed handle all ways to map guests' memory.  I'm
>>>> wondering if current VMMs would gracefully handle such new restrictions
>>>> though.
>>>
>>> I guess the host could map arbitrary data to the guest, so that need to be
>>> handled, but how could the VMM (not the host kernel) bypass/update EPT
>>> initially used for the guest (and potentially later mapped to the host)?
>>
>> Well traditionally both QEMU and KVM accessed guest memory via host
>> mappings instead of the EPT.ï¿½So I'm wondering what is stopping the
>> guest from passing a protected gfn when setting up the DMA, and QEMU
>> being enticed to write to it? The emulator as well would use these host
>> userspace mappings and not consult the EPT IIRC.
>>
>> I think Sean was suggesting host userspace should be more involved in
>> this process, so perhaps it could protect its own alias of the
>> protected memory, for example mprotect() it as read-only.
> 
> Ya, though "suggesting" is really "demanding, unless someone provides super strong
> justification for handling this directly in KVM".  It's basically the same argument
> that led to Linux Security Modules: I'm all for KVM providing the framework and
> plumbing, but I don't want KVM to get involved in defining policy, thread models, etc.

I agree that KVM should not provide its own policy but only the building 
blocks to enforce one. There is two complementary points:
- policy definition by the guest, provided to KVM and the host;
- policy enforcement by KVM and the host.

A potential extension of this framework could be to enable the host to 
define it's own policy for guests, but this would be a different threat 
model.

To avoid too much latency because of the host being involved in policy 
enforcement, I'd like to explore an asynchronous approach that would 
especially fit well for dynamic restrictions.

^ permalink raw reply

* Re: [PATCH 1/2] x86/Kconfig: Allow CONFIG_X86_MPPARSE disable for OF platforms
From: Dave Hansen @ 2023-06-02 14:23 UTC (permalink / raw)
  To: Saurabh Singh Sengar, Saurabh Sengar, tglx@linutronix.de,
	mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com,
	x86@kernel.org, hpa@zytor.com, KY Srinivasan, Haiyang Zhang,
	wei.liu@kernel.org, Dexuan Cui, Michael Kelley (LINUX),
	linux-kernel@vger.kernel.org, linux-hyperv@vger.kernel.org
In-Reply-To: <PUZP153MB07490EDED3B0194F5518B26EBE4EA@PUZP153MB0749.APCP153.PROD.OUTLOOK.COM>

On 6/2/23 05:22, Saurabh Singh Sengar wrote:
> Furthermore, I would like to learn about the rationale behind disallowing the
> disablement of CONFIG_X86_MPPARSE when MP tables are not in use. Considering
> that we compile out the features we don't support, wouldn't it be acceptable to
> allow users to customize their configurations in this manner? Allowing the
> disablement of CONFIG_X86_MPPARSE would provide greater flexibility and
> efficiency for those who do not utilize MP tables.

If someone sent a patch, I'd certainly look and figure out what
"flexibility" and "efficiency" it would provide.  But, honestly, it
would just just be noise if it doesn't solve an _actual_ problem.

Would anyone care or notice the "flexibility" and "efficiency" it would
provide?

^ permalink raw reply

* [PATCH] x86/hyperv: add noop functions to x86_init mpparse functions
From: Saurabh Sengar @ 2023-06-02 12:41 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, tglx, mingo, bp, dave.hansen, x86,
	mikelley, linux-kernel, linux-hyperv, hpa

In !ACPI system, there is no way to disable CONFIG_X86_MPPARSE.
When CONFIG_X86_MPPARSE is enabled for VTL2, the kernel will
scan low memory looking for MP tables. Don't allow this, because
low memory is controlled by VTL0 and may contain actual valid
tables for VTL0, which can confuse the VTL2 kernel.

Signed-off-by: Saurabh Sengar <ssengar@linux.microsoft.com>
---
 arch/x86/hyperv/hv_vtl.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/arch/x86/hyperv/hv_vtl.c b/arch/x86/hyperv/hv_vtl.c
index 1ba5d3b99b16..ea21d897b5da 100644
--- a/arch/x86/hyperv/hv_vtl.c
+++ b/arch/x86/hyperv/hv_vtl.c
@@ -23,6 +23,10 @@ void __init hv_vtl_init_platform(void)
 	x86_init.irqs.pre_vector_init = x86_init_noop;
 	x86_init.timers.timer_init = x86_init_noop;
 
+	/* Avoid searching for BIOS MP tables */
+	x86_init.mpparse.find_smp_config = x86_init_noop;
+	x86_init.mpparse.get_smp_config = x86_init_uint_noop;
+
 	x86_platform.get_wallclock = get_rtc_noop;
 	x86_platform.set_wallclock = set_rtc_noop;
 	x86_platform.get_nmi_reason = hv_get_nmi_reason;
-- 
2.34.1


^ permalink raw reply related

* RE: [EXTERNAL] Re: [PATCH 1/2] x86/Kconfig: Allow CONFIG_X86_MPPARSE disable for OF platforms
From: Saurabh Singh Sengar @ 2023-06-02 12:22 UTC (permalink / raw)
  To: Dave Hansen, Saurabh Sengar, tglx@linutronix.de, mingo@redhat.com,
	bp@alien8.de, dave.hansen@linux.intel.com, x86@kernel.org,
	hpa@zytor.com, KY Srinivasan, Haiyang Zhang, wei.liu@kernel.org,
	Dexuan Cui, Michael Kelley (LINUX), linux-kernel@vger.kernel.org,
	linux-hyperv@vger.kernel.org
In-Reply-To: <13922610-b9bc-ab4b-8482-9aae238396c7@intel.com>



> -----Original Message-----
> From: Dave Hansen <dave.hansen@intel.com>
> Sent: Thursday, May 25, 2023 7:26 PM
> To: Saurabh Singh Sengar <ssengar@microsoft.com>; Saurabh Sengar
> <ssengar@linux.microsoft.com>; tglx@linutronix.de; mingo@redhat.com;
> bp@alien8.de; dave.hansen@linux.intel.com; x86@kernel.org;
> hpa@zytor.com; KY Srinivasan <kys@microsoft.com>; Haiyang Zhang
> <haiyangz@microsoft.com>; wei.liu@kernel.org; Dexuan Cui
> <decui@microsoft.com>; Michael Kelley (LINUX) <mikelley@microsoft.com>;
> linux-kernel@vger.kernel.org; linux-hyperv@vger.kernel.org
> Subject: Re: [EXTERNAL] Re: [PATCH 1/2] x86/Kconfig: Allow
> CONFIG_X86_MPPARSE disable for OF platforms
> 
> On 5/25/23 00:06, Saurabh Singh Sengar wrote:
> > When CONFIG_X86_MPPARSE is enabled, the kernel will scan low memory,
> > looking for MP tables. In Hyper-V VBS setup, lower memory is assigned to
> VTL0.
> > This lower memory may contain the actual MPPARSE table for VTL0, which
> > can confuse the VTLx kernel and cause issues. (x > 0)
> 
> This still just seems wrong.
> 
> If an action crashes the kernel because of changes, we don't just compile it
> out and move on.  We add enumeration for the new feature that's causing it
> and turn off the action at runtime.
> 

Thanks, I greatly appreciate your suggestion and wanted to let you know that
I am actively working on upstreaming the new fix for the VTL driver to bypass
the need for the MP table scan.

Furthermore, I would like to learn about the rationale behind disallowing the
disablement of CONFIG_X86_MPPARSE when MP tables are not in use. Considering
that we compile out the features we don't support, wouldn't it be acceptable to
allow users to customize their configurations in this manner? Allowing the
disablement of CONFIG_X86_MPPARSE would provide greater flexibility and
efficiency for those who do not utilize MP tables.


^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox