Linux-HyperV List

Linux-HyperV List
 help / color / mirror / Atom feed

* Re: [PATCH v8 1/2] x86/tdx: Retry TDVMCALL_MAP_GPA() when needed
From: Sathyanarayanan Kuppuswamy @ 2023-06-20 19:44 UTC (permalink / raw)
  To: Dexuan Cui, ak@linux.intel.com, arnd@arndb.de, bp@alien8.de,
	brijesh.singh@amd.com, dan.j.williams@intel.com,
	dave.hansen@intel.com, dave.hansen@linux.intel.com, Haiyang Zhang,
	hpa@zytor.com, jane.chu@oracle.com,
	kirill.shutemov@linux.intel.com, KY Srinivasan,
	linux-arch@vger.kernel.org, linux-hyperv@vger.kernel.org,
	luto@kernel.org, mingo@redhat.com, peterz@infradead.org,
	rostedt@goodmis.org, seanjc@google.com, tglx@linutronix.de,
	tony.luck@intel.com, wei.liu@kernel.org, x86@kernel.org,
	Michael Kelley (LINUX)
  Cc: linux-kernel@vger.kernel.org, Tianyu Lan,
	rick.p.edgecombe@intel.com
In-Reply-To: <SA1PR21MB13359EB1A88FC676C2269318BF5CA@SA1PR21MB1335.namprd21.prod.outlook.com>

Hi,

On 6/20/23 12:23 PM, Dexuan Cui wrote:
>> From: Sathyanarayanan Kuppuswamy
>> Sent: Tuesday, June 20, 2023 11:31 AM
>>> ...
>>> -static bool tdx_enc_status_changed(unsigned long vaddr, int numpages,
>> bool enc)
>>> +static bool tdx_map_gpa(phys_addr_t start, phys_addr_t end, bool enc)
>>>  {
>>> -	phys_addr_t start = __pa(vaddr);
>>> -	phys_addr_t end   = __pa(vaddr + numpages * PAGE_SIZE);
>>> +	const int max_retries_per_page = 3;
>>
>> Add some details about why you chose 3? Maybe you can also use macro for it.
> 
> It's a small number recommended by Kirill:
> https://lwn.net/ml/linux-kernel/20221208194800.n27ak4xj6pmyny46@box.shutemov.name/
> 
> The spec doesn't define a max retry count. Normally I guess a max retry count
> of 2 should be enough, at least for Hyper-V according to my testing.
> 
> Maybe we can add a comment like this:
> 
> /* Retrying the hypercall a second time should succeed; use 3 just in case. */
> 
> Does this look good to all?

Looks fine to me.

> 
>>> +	struct tdx_hypercall_args args;
>>> +	u64 map_fail_paddr, ret;
>>> +	int retry_count = 0;
>>>
>>>  	if (!enc) {
>>>  		/* Set the shared (decrypted) bits: */
>>> @@ -718,12 +720,49 @@ static bool tdx_enc_status_changed(unsigned long
>> vaddr, int numpages, bool enc)
>>>  		end   |= cc_mkdec(0);
>>>  	}
>>>
>>> -	/*
>>> -	 * Notify the VMM about page mapping conversion. More info about ABI
>>> -	 * can be found in TDX Guest-Host-Communication Interface (GHCI),
>>> -	 * section "TDG.VP.VMCALL<MapGPA>"
>>> -	 */
>>> -	if (_tdx_hypercall(TDVMCALL_MAP_GPA, start, end - start, 0, 0))
>>> +	while (retry_count < max_retries_per_page) {
>>> +		memset(&args, 0, sizeof(args));
>>> +		args.r10 = TDX_HYPERCALL_STANDARD;
>>> +		args.r11 = TDVMCALL_MAP_GPA;
>>> +		args.r12 = start;
>>> +		args.r13 = end - start;
>>> +
>>> +		ret = __tdx_hypercall_ret(&args);
>>> +		if (ret != TDVMCALL_STATUS_RETRY)
>>> +			return !ret;
>>> +		/*
>>> +		 * The guest must retry the operation for the pages in the
>>> +		 * region starting at the GPA specified in R11. R11 comes
>>> +		 * from the untrusted VMM. Sanity check it.
>>> +		 */
>>> +		map_fail_paddr = args.r11;
>>
>> Do you really need map_fail_paddr? Why not directly use args.r11?
>>
>>> +		if (map_fail_paddr < start || map_fail_paddr >= end)
>>> +			return false;
> 
> Originally, I used r11. 
> 
> Dave says " 'r11' needs a real, logical name":
> https://lwn.net/ml/linux-kernel/6bb65614-d420-49d3-312f-316dc8ca4cc4@intel.com/

Got it.

Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

^ permalink raw reply

* RE: [PATCH v8 1/2] x86/tdx: Retry TDVMCALL_MAP_GPA() when needed
From: Dexuan Cui @ 2023-06-20 19:23 UTC (permalink / raw)
  To: Sathyanarayanan Kuppuswamy, ak@linux.intel.com, arnd@arndb.de,
	bp@alien8.de, brijesh.singh@amd.com, dan.j.williams@intel.com,
	dave.hansen@intel.com, dave.hansen@linux.intel.com, Haiyang Zhang,
	hpa@zytor.com, jane.chu@oracle.com,
	kirill.shutemov@linux.intel.com, KY Srinivasan,
	linux-arch@vger.kernel.org, linux-hyperv@vger.kernel.org,
	luto@kernel.org, mingo@redhat.com, peterz@infradead.org,
	rostedt@goodmis.org, seanjc@google.com, tglx@linutronix.de,
	tony.luck@intel.com, wei.liu@kernel.org, x86@kernel.org,
	Michael Kelley (LINUX)
  Cc: linux-kernel@vger.kernel.org, Tianyu Lan,
	rick.p.edgecombe@intel.com
In-Reply-To: <49cb0f01-f1c2-8812-7f2f-9a70ff576085@linux.intel.com>

> From: Sathyanarayanan Kuppuswamy
> Sent: Tuesday, June 20, 2023 11:31 AM
> > ...
> > -static bool tdx_enc_status_changed(unsigned long vaddr, int numpages,
> bool enc)
> > +static bool tdx_map_gpa(phys_addr_t start, phys_addr_t end, bool enc)
> >  {
> > -	phys_addr_t start = __pa(vaddr);
> > -	phys_addr_t end   = __pa(vaddr + numpages * PAGE_SIZE);
> > +	const int max_retries_per_page = 3;
> 
> Add some details about why you chose 3? Maybe you can also use macro for it.

It's a small number recommended by Kirill:
https://lwn.net/ml/linux-kernel/20221208194800.n27ak4xj6pmyny46@box.shutemov.name/

The spec doesn't define a max retry count. Normally I guess a max retry count
of 2 should be enough, at least for Hyper-V according to my testing.

Maybe we can add a comment like this:

/* Retrying the hypercall a second time should succeed; use 3 just in case. */

Does this look good to all?

> > +	struct tdx_hypercall_args args;
> > +	u64 map_fail_paddr, ret;
> > +	int retry_count = 0;
> >
> >  	if (!enc) {
> >  		/* Set the shared (decrypted) bits: */
> > @@ -718,12 +720,49 @@ static bool tdx_enc_status_changed(unsigned long
> vaddr, int numpages, bool enc)
> >  		end   |= cc_mkdec(0);
> >  	}
> >
> > -	/*
> > -	 * Notify the VMM about page mapping conversion. More info about ABI
> > -	 * can be found in TDX Guest-Host-Communication Interface (GHCI),
> > -	 * section "TDG.VP.VMCALL<MapGPA>"
> > -	 */
> > -	if (_tdx_hypercall(TDVMCALL_MAP_GPA, start, end - start, 0, 0))
> > +	while (retry_count < max_retries_per_page) {
> > +		memset(&args, 0, sizeof(args));
> > +		args.r10 = TDX_HYPERCALL_STANDARD;
> > +		args.r11 = TDVMCALL_MAP_GPA;
> > +		args.r12 = start;
> > +		args.r13 = end - start;
> > +
> > +		ret = __tdx_hypercall_ret(&args);
> > +		if (ret != TDVMCALL_STATUS_RETRY)
> > +			return !ret;
> > +		/*
> > +		 * The guest must retry the operation for the pages in the
> > +		 * region starting at the GPA specified in R11. R11 comes
> > +		 * from the untrusted VMM. Sanity check it.
> > +		 */
> > +		map_fail_paddr = args.r11;
> 
> Do you really need map_fail_paddr? Why not directly use args.r11?
> 
> > +		if (map_fail_paddr < start || map_fail_paddr >= end)
> > +			return false;

Originally, I used r11. 

Dave says " 'r11' needs a real, logical name":
https://lwn.net/ml/linux-kernel/6bb65614-d420-49d3-312f-316dc8ca4cc4@intel.com/

^ permalink raw reply

* [PATCH] x86/hyperv: Improve code for referencing hyperv_pcpu_input_arg
From: Nischala Yelchuri @ 2023-06-20 18:40 UTC (permalink / raw)
  To: linux-hyperv, linux-kernel
  Cc: mikelley, "K. Y. Srinivasan", kys, Haiyang, Zhang,
	haiyangz, Wei, Liu, wei.liu, Dexuan, Cui, decui, Thomas, Gleixner,
	tglx, Ingo, Molnar, mingo, Borislav, Petkov, bp, Dave, Hansen,
	dave.hansen, x86, "H. Peter Anvin", hpa, Tyler, Hicks,
	code, niyelchu

Several places in code for Hyper-V reference the
per-CPU variable hyperv_pcpu_input_arg. Older code uses a multi-line
sequence to reference the variable, and usually includes a cast.
Newer code does a much simpler direct assignment. The latter is
preferable as the complexity of the older code is unnecessary.

Update older code to use the simpler direct assignment.

Signed-off-by: Nischala Yelchuri <niyelchu@linux.microsoft.com>
---
 arch/x86/hyperv/hv_apic.c |  4 +---
 arch/x86/hyperv/ivm.c     |  7 +++----
 arch/x86/hyperv/mmu.c     | 12 ++----------
 arch/x86/hyperv/nested.c  | 11 ++---------
 drivers/hv/hv_balloon.c   |  2 +-
 5 files changed, 9 insertions(+), 27 deletions(-)

diff --git a/arch/x86/hyperv/hv_apic.c b/arch/x86/hyperv/hv_apic.c
index 1fbda2f94..b21335e6a 100644
--- a/arch/x86/hyperv/hv_apic.c
+++ b/arch/x86/hyperv/hv_apic.c
@@ -107,7 +107,6 @@ static bool cpu_is_self(int cpu)
 static bool __send_ipi_mask_ex(const struct cpumask *mask, int vector,
 		bool exclude_self)
 {
-	struct hv_send_ipi_ex **arg;
 	struct hv_send_ipi_ex *ipi_arg;
 	unsigned long flags;
 	int nr_bank = 0;
@@ -117,9 +116,8 @@ static bool __send_ipi_mask_ex(const struct cpumask *mask, int vector,
 		return false;
 
 	local_irq_save(flags);
-	arg = (struct hv_send_ipi_ex **)this_cpu_ptr(hyperv_pcpu_input_arg);
+	ipi_arg = *this_cpu_ptr(hyperv_pcpu_input_arg);
 
-	ipi_arg = *arg;
 	if (unlikely(!ipi_arg))
 		goto ipi_mask_ex_done;
 
diff --git a/arch/x86/hyperv/ivm.c b/arch/x86/hyperv/ivm.c
index 14f46ad2c..28be6df88 100644
--- a/arch/x86/hyperv/ivm.c
+++ b/arch/x86/hyperv/ivm.c
@@ -247,7 +247,7 @@ EXPORT_SYMBOL_GPL(hv_ghcb_msr_read);
 static int hv_mark_gpa_visibility(u16 count, const u64 pfn[],
 			   enum hv_mem_host_visibility visibility)
 {
-	struct hv_gpa_range_for_visibility **input_pcpu, *input;
+	struct hv_gpa_range_for_visibility *input;
 	u16 pages_processed;
 	u64 hv_status;
 	unsigned long flags;
@@ -263,9 +263,8 @@ static int hv_mark_gpa_visibility(u16 count, const u64 pfn[],
 	}
 
 	local_irq_save(flags);
-	input_pcpu = (struct hv_gpa_range_for_visibility **)
-			this_cpu_ptr(hyperv_pcpu_input_arg);
-	input = *input_pcpu;
+	input = *this_cpu_ptr(hyperv_pcpu_input_arg);
+
 	if (unlikely(!input)) {
 		local_irq_restore(flags);
 		return -EINVAL;
diff --git a/arch/x86/hyperv/mmu.c b/arch/x86/hyperv/mmu.c
index 8460bd35e..1cc113200 100644
--- a/arch/x86/hyperv/mmu.c
+++ b/arch/x86/hyperv/mmu.c
@@ -61,7 +61,6 @@ static void hyperv_flush_tlb_multi(const struct cpumask *cpus,
 				   const struct flush_tlb_info *info)
 {
 	int cpu, vcpu, gva_n, max_gvas;
-	struct hv_tlb_flush **flush_pcpu;
 	struct hv_tlb_flush *flush;
 	u64 status;
 	unsigned long flags;
@@ -74,10 +73,7 @@ static void hyperv_flush_tlb_multi(const struct cpumask *cpus,
 
 	local_irq_save(flags);
 
-	flush_pcpu = (struct hv_tlb_flush **)
-		     this_cpu_ptr(hyperv_pcpu_input_arg);
-
-	flush = *flush_pcpu;
+	flush = *this_cpu_ptr(hyperv_pcpu_input_arg);
 
 	if (unlikely(!flush)) {
 		local_irq_restore(flags);
@@ -178,17 +174,13 @@ static u64 hyperv_flush_tlb_others_ex(const struct cpumask *cpus,
 				      const struct flush_tlb_info *info)
 {
 	int nr_bank = 0, max_gvas, gva_n;
-	struct hv_tlb_flush_ex **flush_pcpu;
 	struct hv_tlb_flush_ex *flush;
 	u64 status;
 
 	if (!(ms_hyperv.hints & HV_X64_EX_PROCESSOR_MASKS_RECOMMENDED))
 		return HV_STATUS_INVALID_PARAMETER;
 
-	flush_pcpu = (struct hv_tlb_flush_ex **)
-		     this_cpu_ptr(hyperv_pcpu_input_arg);
-
-	flush = *flush_pcpu;
+	flush = *this_cpu_ptr(hyperv_pcpu_input_arg);
 
 	if (info->mm) {
 		/*
diff --git a/arch/x86/hyperv/nested.c b/arch/x86/hyperv/nested.c
index 5d70968c8..9dc259fa3 100644
--- a/arch/x86/hyperv/nested.c
+++ b/arch/x86/hyperv/nested.c
@@ -19,7 +19,6 @@
 
 int hyperv_flush_guest_mapping(u64 as)
 {
-	struct hv_guest_mapping_flush **flush_pcpu;
 	struct hv_guest_mapping_flush *flush;
 	u64 status;
 	unsigned long flags;
@@ -30,10 +29,7 @@ int hyperv_flush_guest_mapping(u64 as)
 
 	local_irq_save(flags);
 
-	flush_pcpu = (struct hv_guest_mapping_flush **)
-		this_cpu_ptr(hyperv_pcpu_input_arg);
-
-	flush = *flush_pcpu;
+	flush = *this_cpu_ptr(hyperv_pcpu_input_arg);
 
 	if (unlikely(!flush)) {
 		local_irq_restore(flags);
@@ -90,7 +86,6 @@ EXPORT_SYMBOL_GPL(hyperv_fill_flush_guest_mapping_list);
 int hyperv_flush_guest_mapping_range(u64 as,
 		hyperv_fill_flush_list_func fill_flush_list_func, void *data)
 {
-	struct hv_guest_mapping_flush_list **flush_pcpu;
 	struct hv_guest_mapping_flush_list *flush;
 	u64 status;
 	unsigned long flags;
@@ -102,10 +97,8 @@ int hyperv_flush_guest_mapping_range(u64 as,
 
 	local_irq_save(flags);
 
-	flush_pcpu = (struct hv_guest_mapping_flush_list **)
-		this_cpu_ptr(hyperv_pcpu_input_arg);
+	flush = *this_cpu_ptr(hyperv_pcpu_input_arg);
 
-	flush = *flush_pcpu;
 	if (unlikely(!flush)) {
 		local_irq_restore(flags);
 		goto fault;
diff --git a/drivers/hv/hv_balloon.c b/drivers/hv/hv_balloon.c
index dffcc894f..0d7a3ba66 100644
--- a/drivers/hv/hv_balloon.c
+++ b/drivers/hv/hv_balloon.c
@@ -1628,7 +1628,7 @@ static int hv_free_page_report(struct page_reporting_dev_info *pr_dev_info,
 	WARN_ON_ONCE(nents > HV_MEMORY_HINT_MAX_GPA_PAGE_RANGES);
 	WARN_ON_ONCE(sgl->length < (HV_HYP_PAGE_SIZE << page_reporting_order));
 	local_irq_save(flags);
-	hint = *(struct hv_memory_hint **)this_cpu_ptr(hyperv_pcpu_input_arg);
+	hint = *this_cpu_ptr(hyperv_pcpu_input_arg);
 	if (!hint) {
 		local_irq_restore(flags);
 		return -ENOSPC;
-- 
2.25.1


^ permalink raw reply related

* Re: [PATCH v8 2/2] x86/tdx: Support vmalloc() for tdx_enc_status_changed()
From: Sathyanarayanan Kuppuswamy @ 2023-06-20 18:34 UTC (permalink / raw)
  To: Dexuan Cui, ak, arnd, bp, brijesh.singh, dan.j.williams,
	dave.hansen, dave.hansen, haiyangz, hpa, jane.chu,
	kirill.shutemov, kys, linux-arch, linux-hyperv, luto, mingo,
	peterz, rostedt, seanjc, tglx, tony.luck, wei.liu, x86, mikelley
  Cc: linux-kernel, Tianyu.Lan, rick.p.edgecombe
In-Reply-To: <20230620154830.25442-3-decui@microsoft.com>

Hi,

On 6/20/23 8:48 AM, Dexuan Cui wrote:
> When a TDX guest runs on Hyper-V, the hv_netvsc driver's netvsc_init_buf()
> allocates buffers using vzalloc(), and needs to share the buffers with the
> host OS by calling set_memory_decrypted(), which is not working for
> vmalloc() yet. Add the support by handling the pages one by one.
> 
> Co-developed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Reviewed-by: Michael Kelley <mikelley@microsoft.com>
> Signed-off-by: Dexuan Cui <decui@microsoft.com>
> ---

Looks good to me.

Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>

>  arch/x86/coco/tdx/tdx.c | 35 +++++++++++++++++++++++++++++------
>  1 file changed, 29 insertions(+), 6 deletions(-)
> 
> Changes in v2:
>   Changed tdx_enc_status_changed() in place.
> 
> Changes in v3:
>   No change since v2.
> 
> Changes in v4:
>   Added Kirill's Co-developed-by since Kirill helped to improve the
>     code by adding tdx_enc_status_changed_phys().
> 
>   Thanks Kirill for the clarification on load_unaligned_zeropad()!
> 
> Changes in v5:
>   Added Kirill's Signed-off-by.
>   Added Michael's Reviewed-by.
> 
> Changes in v6: None.
> 
> Changes in v7: None.
>   Note: there was a race between set_memory_encrypted() and
>   load_unaligned_zeropad(), which has been fixed by the 3 patches of
>   Kirill in the x86/tdx branch of the tip tree.
> 
> Changes in v8:
>   Rebased to tip.git's master branch.
> 
> diff --git a/arch/x86/coco/tdx/tdx.c b/arch/x86/coco/tdx/tdx.c
> index 0c198ab73aa7..a313d5ab42f1 100644
> --- a/arch/x86/coco/tdx/tdx.c
> +++ b/arch/x86/coco/tdx/tdx.c
> @@ -8,6 +8,7 @@
>  #include <linux/export.h>
>  #include <linux/io.h>
> +#include <linux/mm.h>
>  #include <asm/coco.h>
>  #include <asm/tdx.h>
>  #include <asm/vmx.h>
>  #include <asm/insn.h>
> @@ -752,6 +753,19 @@ static bool tdx_map_gpa(phys_addr_t start, phys_addr_t end, bool enc)
>  	return false;
>  }
>  
> +static bool tdx_enc_status_changed_phys(phys_addr_t start, phys_addr_t end,
> +					bool enc)
> +{
> +	if (!tdx_map_gpa(start, end, enc))
> +		return false;
> +
> +	/* shared->private conversion requires memory to be accepted before use */
> +	if (enc)
> +		return tdx_accept_memory(start, end);
> +
> +	return true;
> +}
> +
>  /*
>   * Inform the VMM of the guest's intent for this physical page: shared with
>   * the VMM or private to the guest.  The VMM is expected to change its mapping
> @@ -759,15 +773,24 @@ static bool tdx_map_gpa(phys_addr_t start, phys_addr_t end, bool enc)
>   */
>  static bool tdx_enc_status_changed(unsigned long vaddr, int numpages, bool enc)
>  {
> -	phys_addr_t start = __pa(vaddr);
> -	phys_addr_t end   = __pa(vaddr + numpages * PAGE_SIZE);
> +	unsigned long start = vaddr;
> +	unsigned long end = start + numpages * PAGE_SIZE;
>  
> -	if (!tdx_map_gpa(start, end, enc))
> +	if (offset_in_page(start) != 0)
>  		return false;
>  
> -	/* shared->private conversion requires memory to be accepted before use */
> -	if (enc)
> -		return tdx_accept_memory(start, end);
> +	if (!is_vmalloc_addr((void *)start))
> +		return tdx_enc_status_changed_phys(__pa(start), __pa(end), enc);
> +
> +	while (start < end) {
> +		phys_addr_t start_pa = slow_virt_to_phys((void *)start);
> +		phys_addr_t end_pa = start_pa + PAGE_SIZE;
> +
> +		if (!tdx_enc_status_changed_phys(start_pa, end_pa, enc))
> +			return false;
> +
> +		start += PAGE_SIZE;
> +	}
>  
>  	return true;
>  }

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

^ permalink raw reply

* Re: [PATCH v8 1/2] x86/tdx: Retry TDVMCALL_MAP_GPA() when needed
From: Sathyanarayanan Kuppuswamy @ 2023-06-20 18:30 UTC (permalink / raw)
  To: Dexuan Cui, ak, arnd, bp, brijesh.singh, dan.j.williams,
	dave.hansen, dave.hansen, haiyangz, hpa, jane.chu,
	kirill.shutemov, kys, linux-arch, linux-hyperv, luto, mingo,
	peterz, rostedt, seanjc, tglx, tony.luck, wei.liu, x86, mikelley
  Cc: linux-kernel, Tianyu.Lan, rick.p.edgecombe
In-Reply-To: <20230620154830.25442-2-decui@microsoft.com>

Hi,

On 6/20/23 8:48 AM, Dexuan Cui wrote:
> GHCI spec for TDX 1.0 says that the MapGPA call may fail with the R10
> error code = TDG.VP.VMCALL_RETRY (1), and the guest must retry this
> operation for the pages in the region starting at the GPA specified
> in R11.
> 
> When a fully enlightened TDX guest runs on Hyper-V, Hyper-V can return
> the retry error when set_memory_decrypted() is called to decrypt up to
> 1GB of swiotlb bounce buffers.
> 
> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Reviewed-by: Michael Kelley <mikelley@microsoft.com>
> Signed-off-by: Dexuan Cui <decui@microsoft.com>
> ---
> 
>  arch/x86/coco/tdx/tdx.c           | 63 +++++++++++++++++++++++++------
>  arch/x86/include/asm/shared/tdx.h |  2 +
>  2 files changed, 53 insertions(+), 12 deletions(-)
> 
> Changes in v2:
>   Used __tdx_hypercall() directly in tdx_map_gpa().
>   Added a max_retry_cnt of 1000.
>   Renamed a few variables, e.g., r11 -> map_fail_paddr.
> 
> Changes in v3:
>   Changed max_retry_cnt from 1000 to 3.
> 
> Changes in v4:
>   __tdx_hypercall(&args, TDX_HCALL_HAS_OUTPUT) -> __tdx_hypercall_ret()
>   Added Kirill's Acked-by.
> 
> Changes in v5:
>   Added Michael's Reviewed-by.
> 
> Changes in v6: None.
> 
> Changes in v7:
>   Addressed Dave's comments:
>   see https://lwn.net/ml/linux-kernel/SA1PR21MB1335736123C2BCBBFD7460C3BF46A@SA1PR21MB1335.namprd21.prod.outlook.com
> 
> Changes in v8:
>   Rebased to tip.git's master branch.
> 
> diff --git a/arch/x86/coco/tdx/tdx.c b/arch/x86/coco/tdx/tdx.c
> index 1d6b863c42b0..0c198ab73aa7 100644
> --- a/arch/x86/coco/tdx/tdx.c
> +++ b/arch/x86/coco/tdx/tdx.c
> @@ -703,14 +703,16 @@ static bool tdx_cache_flush_required(void)
>  }
>  
>  /*
> - * Inform the VMM of the guest's intent for this physical page: shared with
> - * the VMM or private to the guest.  The VMM is expected to change its mapping
> - * of the page in response.
> + * Notify the VMM about page mapping conversion. More info about ABI
> + * can be found in TDX Guest-Host-Communication Interface (GHCI),
> + * section "TDG.VP.VMCALL<MapGPA>".
>   */
> -static bool tdx_enc_status_changed(unsigned long vaddr, int numpages, bool enc)
> +static bool tdx_map_gpa(phys_addr_t start, phys_addr_t end, bool enc)
>  {
> -	phys_addr_t start = __pa(vaddr);
> -	phys_addr_t end   = __pa(vaddr + numpages * PAGE_SIZE);
> +	const int max_retries_per_page = 3;

Add some details about why you chose 3? Maybe you can also use macro for it.

> +	struct tdx_hypercall_args args;
> +	u64 map_fail_paddr, ret;
> +	int retry_count = 0;
>  
>  	if (!enc) {
>  		/* Set the shared (decrypted) bits: */
> @@ -718,12 +720,49 @@ static bool tdx_enc_status_changed(unsigned long vaddr, int numpages, bool enc)
>  		end   |= cc_mkdec(0);
>  	}
>  
> -	/*
> -	 * Notify the VMM about page mapping conversion. More info about ABI
> -	 * can be found in TDX Guest-Host-Communication Interface (GHCI),
> -	 * section "TDG.VP.VMCALL<MapGPA>"
> -	 */
> -	if (_tdx_hypercall(TDVMCALL_MAP_GPA, start, end - start, 0, 0))
> +	while (retry_count < max_retries_per_page) {
> +		memset(&args, 0, sizeof(args));
> +		args.r10 = TDX_HYPERCALL_STANDARD;
> +		args.r11 = TDVMCALL_MAP_GPA;
> +		args.r12 = start;
> +		args.r13 = end - start;
> +
> +		ret = __tdx_hypercall_ret(&args);
> +		if (ret != TDVMCALL_STATUS_RETRY)
> +			return !ret;
> +		/*
> +		 * The guest must retry the operation for the pages in the
> +		 * region starting at the GPA specified in R11. R11 comes
> +		 * from the untrusted VMM. Sanity check it.
> +		 */
> +		map_fail_paddr = args.r11;

Do you really need map_fail_paddr? Why not directly use args.r11?

> +		if (map_fail_paddr < start || map_fail_paddr >= end)
> +			return false;
> +
> +		/* "Consume" a retry without forward progress */
> +		if (map_fail_paddr == start) {
> +			retry_count++;
> +			continue;
> +		}
> +
> +		start = map_fail_paddr;
> +		retry_count = 0;
> +	}
> +
> +	return false;
> +}
> +
> +/*
> + * Inform the VMM of the guest's intent for this physical page: shared with
> + * the VMM or private to the guest.  The VMM is expected to change its mapping
> + * of the page in response.
> + */
> +static bool tdx_enc_status_changed(unsigned long vaddr, int numpages, bool enc)
> +{
> +	phys_addr_t start = __pa(vaddr);
> +	phys_addr_t end   = __pa(vaddr + numpages * PAGE_SIZE);
> +
> +	if (!tdx_map_gpa(start, end, enc))
>  		return false;
>  
>  	/* shared->private conversion requires memory to be accepted before use */
> diff --git a/arch/x86/include/asm/shared/tdx.h b/arch/x86/include/asm/shared/tdx.h
> index 90ea813c4b99..9db89a99ae5b 100644
> --- a/arch/x86/include/asm/shared/tdx.h
> +++ b/arch/x86/include/asm/shared/tdx.h
> @@ -24,6 +24,8 @@
>  #define TDVMCALL_MAP_GPA		0x10001
>  #define TDVMCALL_REPORT_FATAL_ERROR	0x10003
>  
> +#define TDVMCALL_STATUS_RETRY		1
> +
>  #ifndef __ASSEMBLY__
>  
>  /*

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

^ permalink raw reply

* RE: [PATCH 1/3] x86/vector: Rename send_cleanup_vector() to vector_schedule_cleanup()
From: Li, Xin3 @ 2023-06-20 17:33 UTC (permalink / raw)
  To: Steve Wahl
  Cc: linux-kernel@vger.kernel.org, platform-driver-x86@vger.kernel.org,
	iommu@lists.linux.dev, linux-hyperv@vger.kernel.org,
	linux-perf-users@vger.kernel.org, x86@kernel.org,
	tglx@linutronix.de, mingo@redhat.com, bp@alien8.de,
	dave.hansen@linux.intel.com, hpa@zytor.com, Sivanich, Dimitri,
	Anderson, Russ, dvhart@infradead.org, andy@infradead.org,
	joro@8bytes.org, suravee.suthikulpanit@amd.com, will@kernel.org,
	robin.murphy@arm.com, kys@microsoft.com, haiyangz@microsoft.com,
	wei.liu@kernel.org, Cui, Dexuan, dwmw2@infradead.org,
	baolu.lu@linux.intel.com, peterz@infradead.org, acme@kernel.org,
	mark.rutland@arm.com, alexander.shishkin@linux.intel.com,
	jolsa@kernel.org, namhyung@kernel.org, irogers@google.com,
	Hunter, Adrian, Christopherson,, Sean, jiangshanlai@gmail.com,
	jgg@ziepe.ca, yangtiezhu@loongson.cn
In-Reply-To: <ZJHTYUrKA/mclxRZ@swahl-home.5wahls.com>

> Reviewed-by: Steve Wahl <steve.wahl@hpe.com>

Thanks, will add your RB to this patch in the next iteration.

  Xin

^ permalink raw reply

* RE: [PATCH 2/3] x86/vector: Replace IRQ_MOVE_CLEANUP_VECTOR with a timer callback
From: Li, Xin3 @ 2023-06-20 17:30 UTC (permalink / raw)
  To: Thomas Gleixner, linux-kernel@vger.kernel.org,
	platform-driver-x86@vger.kernel.org, iommu@lists.linux.dev,
	linux-hyperv@vger.kernel.org, linux-perf-users@vger.kernel.org,
	x86@kernel.org
  Cc: mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com,
	hpa@zytor.com, steve.wahl@hpe.com, mike.travis@hpe.com,
	Sivanich, Dimitri, Anderson, Russ, dvhart@infradead.org,
	andy@infradead.org, joro@8bytes.org,
	suravee.suthikulpanit@amd.com, will@kernel.org,
	robin.murphy@arm.com, kys@microsoft.com, haiyangz@microsoft.com,
	wei.liu@kernel.org, Cui, Dexuan, dwmw2@infradead.org,
	baolu.lu@linux.intel.com, peterz@infradead.org, acme@kernel.org,
	mark.rutland@arm.com, alexander.shishkin@linux.intel.com,
	jolsa@kernel.org, namhyung@kernel.org, irogers@google.com,
	Hunter, Adrian, Christopherson,, Sean, jiangshanlai@gmail.com,
	jgg@ziepe.ca, yangtiezhu@loongson.cn
In-Reply-To: <87h6r2pm1a.ffs@tglx>

> > +/*
> > + * Called with vector_lock held
> > + */
> 
> Such a comment is patently bad.
> 
> > +static void __vector_cleanup(struct vector_cleanup *cl, bool
> > +check_irr)
> >  {
>         ....
> 
>         lockdep_assert_held(&vector_lock);
> 
> Documents the requirement clearly _and_ catches any caller which does not hold
> the lock when lockdep is enabled.

Peter Z gave the same comment, I will address in the next iteration.

Thanks!
  Xin

^ permalink raw reply

* Re: [PATCH 1/3] x86/vector: Rename send_cleanup_vector() to vector_schedule_cleanup()
From: Steve Wahl @ 2023-06-20 16:27 UTC (permalink / raw)
  To: Xin Li
  Cc: linux-kernel, platform-driver-x86, iommu, linux-hyperv,
	linux-perf-users, x86, tglx, mingo, bp, dave.hansen, hpa,
	steve.wahl, dimitri.sivanich, russ.anderson, dvhart, andy, joro,
	suravee.suthikulpanit, will, robin.murphy, kys, haiyangz, wei.liu,
	decui, dwmw2, baolu.lu, peterz, acme, mark.rutland,
	alexander.shishkin, jolsa, namhyung, irogers, adrian.hunter,
	seanjc, jiangshanlai, jgg, yangtiezhu
In-Reply-To: <20230619231611.2230-2-xin3.li@intel.com>

Reviewed-by: Steve Wahl <steve.wahl@hpe.com>

On Mon, Jun 19, 2023 at 04:16:09PM -0700, Xin Li wrote:
> From: Thomas Gleixner <tglx@linutronix.de>
> 
> Rename send_cleanup_vector() to vector_schedule_cleanup() for the next
> patch to replace vector cleanup IPI with a timer callback.
> 
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> Signed-off-by: Xin Li <xin3.li@intel.com>
> ---
>  arch/x86/include/asm/hw_irq.h       | 4 ++--
>  arch/x86/kernel/apic/vector.c       | 8 ++++----
>  arch/x86/platform/uv/uv_irq.c       | 2 +-
>  drivers/iommu/amd/iommu.c           | 2 +-
>  drivers/iommu/hyperv-iommu.c        | 4 ++--
>  drivers/iommu/intel/irq_remapping.c | 2 +-
>  6 files changed, 11 insertions(+), 11 deletions(-)
> 
> diff --git a/arch/x86/include/asm/hw_irq.h b/arch/x86/include/asm/hw_irq.h
> index d465ece58151..551829884734 100644
> --- a/arch/x86/include/asm/hw_irq.h
> +++ b/arch/x86/include/asm/hw_irq.h
> @@ -97,10 +97,10 @@ extern struct irq_cfg *irqd_cfg(struct irq_data *irq_data);
>  extern void lock_vector_lock(void);
>  extern void unlock_vector_lock(void);
>  #ifdef CONFIG_SMP
> -extern void send_cleanup_vector(struct irq_cfg *);
> +extern void vector_schedule_cleanup(struct irq_cfg *);
>  extern void irq_complete_move(struct irq_cfg *cfg);
>  #else
> -static inline void send_cleanup_vector(struct irq_cfg *c) { }
> +static inline void vector_schedule_cleanup(struct irq_cfg *c) { }
>  static inline void irq_complete_move(struct irq_cfg *c) { }
>  #endif
>  
> diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c
> index c1efebd27e6c..aa370bd0d933 100644
> --- a/arch/x86/kernel/apic/vector.c
> +++ b/arch/x86/kernel/apic/vector.c
> @@ -967,7 +967,7 @@ DEFINE_IDTENTRY_SYSVEC(sysvec_irq_move_cleanup)
>  	raw_spin_unlock(&vector_lock);
>  }
>  
> -static void __send_cleanup_vector(struct apic_chip_data *apicd)
> +static void __vector_schedule_cleanup(struct apic_chip_data *apicd)
>  {
>  	unsigned int cpu;
>  
> @@ -983,13 +983,13 @@ static void __send_cleanup_vector(struct apic_chip_data *apicd)
>  	raw_spin_unlock(&vector_lock);
>  }
>  
> -void send_cleanup_vector(struct irq_cfg *cfg)
> +void vector_schedule_cleanup(struct irq_cfg *cfg)
>  {
>  	struct apic_chip_data *apicd;
>  
>  	apicd = container_of(cfg, struct apic_chip_data, hw_irq_cfg);
>  	if (apicd->move_in_progress)
> -		__send_cleanup_vector(apicd);
> +		__vector_schedule_cleanup(apicd);
>  }
>  
>  void irq_complete_move(struct irq_cfg *cfg)
> @@ -1007,7 +1007,7 @@ void irq_complete_move(struct irq_cfg *cfg)
>  	 * on the same CPU.
>  	 */
>  	if (apicd->cpu == smp_processor_id())
> -		__send_cleanup_vector(apicd);
> +		__vector_schedule_cleanup(apicd);
>  }
>  
>  /*
> diff --git a/arch/x86/platform/uv/uv_irq.c b/arch/x86/platform/uv/uv_irq.c
> index ee21d6a36a80..4221259a5870 100644
> --- a/arch/x86/platform/uv/uv_irq.c
> +++ b/arch/x86/platform/uv/uv_irq.c
> @@ -58,7 +58,7 @@ uv_set_irq_affinity(struct irq_data *data, const struct cpumask *mask,
>  	ret = parent->chip->irq_set_affinity(parent, mask, force);
>  	if (ret >= 0) {
>  		uv_program_mmr(cfg, data->chip_data);
> -		send_cleanup_vector(cfg);
> +		vector_schedule_cleanup(cfg);
>  	}
>  
>  	return ret;
> diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c
> index dc1ec6849775..b5900e70de60 100644
> --- a/drivers/iommu/amd/iommu.c
> +++ b/drivers/iommu/amd/iommu.c
> @@ -3658,7 +3658,7 @@ static int amd_ir_set_affinity(struct irq_data *data,
>  	 * at the new destination. So, time to cleanup the previous
>  	 * vector allocation.
>  	 */
> -	send_cleanup_vector(cfg);
> +	vector_schedule_cleanup(cfg);
>  
>  	return IRQ_SET_MASK_OK_DONE;
>  }
> diff --git a/drivers/iommu/hyperv-iommu.c b/drivers/iommu/hyperv-iommu.c
> index 8302db7f783e..8a5c17b97310 100644
> --- a/drivers/iommu/hyperv-iommu.c
> +++ b/drivers/iommu/hyperv-iommu.c
> @@ -51,7 +51,7 @@ static int hyperv_ir_set_affinity(struct irq_data *data,
>  	if (ret < 0 || ret == IRQ_SET_MASK_OK_DONE)
>  		return ret;
>  
> -	send_cleanup_vector(cfg);
> +	vector_schedule_cleanup(cfg);
>  
>  	return 0;
>  }
> @@ -257,7 +257,7 @@ static int hyperv_root_ir_set_affinity(struct irq_data *data,
>  	if (ret < 0 || ret == IRQ_SET_MASK_OK_DONE)
>  		return ret;
>  
> -	send_cleanup_vector(cfg);
> +	vector_schedule_cleanup(cfg);
>  
>  	return 0;
>  }
> diff --git a/drivers/iommu/intel/irq_remapping.c b/drivers/iommu/intel/irq_remapping.c
> index a1b987335b31..55d899f5a14b 100644
> --- a/drivers/iommu/intel/irq_remapping.c
> +++ b/drivers/iommu/intel/irq_remapping.c
> @@ -1180,7 +1180,7 @@ intel_ir_set_affinity(struct irq_data *data, const struct cpumask *mask,
>  	 * at the new destination. So, time to cleanup the previous
>  	 * vector allocation.
>  	 */
> -	send_cleanup_vector(cfg);
> +	vector_schedule_cleanup(cfg);
>  
>  	return IRQ_SET_MASK_OK_DONE;
>  }
> -- 
> 2.34.1
> 

-- 
Steve Wahl, Hewlett Packard Enterprise

^ permalink raw reply

* [PATCH v8 2/2] x86/tdx: Support vmalloc() for tdx_enc_status_changed()
From: Dexuan Cui @ 2023-06-20 15:48 UTC (permalink / raw)
  To: ak, arnd, bp, brijesh.singh, dan.j.williams, dave.hansen,
	dave.hansen, haiyangz, hpa, jane.chu, kirill.shutemov, kys,
	linux-arch, linux-hyperv, luto, mingo, peterz, rostedt,
	sathyanarayanan.kuppuswamy, seanjc, tglx, tony.luck, wei.liu, x86,
	mikelley
  Cc: linux-kernel, Tianyu.Lan, rick.p.edgecombe, Dexuan Cui
In-Reply-To: <20230620154830.25442-1-decui@microsoft.com>

When a TDX guest runs on Hyper-V, the hv_netvsc driver's netvsc_init_buf()
allocates buffers using vzalloc(), and needs to share the buffers with the
host OS by calling set_memory_decrypted(), which is not working for
vmalloc() yet. Add the support by handling the pages one by one.

Co-developed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Michael Kelley <mikelley@microsoft.com>
Signed-off-by: Dexuan Cui <decui@microsoft.com>
---
 arch/x86/coco/tdx/tdx.c | 35 +++++++++++++++++++++++++++++------
 1 file changed, 29 insertions(+), 6 deletions(-)

Changes in v2:
  Changed tdx_enc_status_changed() in place.

Changes in v3:
  No change since v2.

Changes in v4:
  Added Kirill's Co-developed-by since Kirill helped to improve the
    code by adding tdx_enc_status_changed_phys().

  Thanks Kirill for the clarification on load_unaligned_zeropad()!

Changes in v5:
  Added Kirill's Signed-off-by.
  Added Michael's Reviewed-by.

Changes in v6: None.

Changes in v7: None.
  Note: there was a race between set_memory_encrypted() and
  load_unaligned_zeropad(), which has been fixed by the 3 patches of
  Kirill in the x86/tdx branch of the tip tree.

Changes in v8:
  Rebased to tip.git's master branch.

diff --git a/arch/x86/coco/tdx/tdx.c b/arch/x86/coco/tdx/tdx.c
index 0c198ab73aa7..a313d5ab42f1 100644
--- a/arch/x86/coco/tdx/tdx.c
+++ b/arch/x86/coco/tdx/tdx.c
@@ -8,6 +8,7 @@
 #include <linux/export.h>
 #include <linux/io.h>
+#include <linux/mm.h>
 #include <asm/coco.h>
 #include <asm/tdx.h>
 #include <asm/vmx.h>
 #include <asm/insn.h>
@@ -752,6 +753,19 @@ static bool tdx_map_gpa(phys_addr_t start, phys_addr_t end, bool enc)
 	return false;
 }
 
+static bool tdx_enc_status_changed_phys(phys_addr_t start, phys_addr_t end,
+					bool enc)
+{
+	if (!tdx_map_gpa(start, end, enc))
+		return false;
+
+	/* shared->private conversion requires memory to be accepted before use */
+	if (enc)
+		return tdx_accept_memory(start, end);
+
+	return true;
+}
+
 /*
  * Inform the VMM of the guest's intent for this physical page: shared with
  * the VMM or private to the guest.  The VMM is expected to change its mapping
@@ -759,15 +773,24 @@ static bool tdx_map_gpa(phys_addr_t start, phys_addr_t end, bool enc)
  */
 static bool tdx_enc_status_changed(unsigned long vaddr, int numpages, bool enc)
 {
-	phys_addr_t start = __pa(vaddr);
-	phys_addr_t end   = __pa(vaddr + numpages * PAGE_SIZE);
+	unsigned long start = vaddr;
+	unsigned long end = start + numpages * PAGE_SIZE;
 
-	if (!tdx_map_gpa(start, end, enc))
+	if (offset_in_page(start) != 0)
 		return false;
 
-	/* shared->private conversion requires memory to be accepted before use */
-	if (enc)
-		return tdx_accept_memory(start, end);
+	if (!is_vmalloc_addr((void *)start))
+		return tdx_enc_status_changed_phys(__pa(start), __pa(end), enc);
+
+	while (start < end) {
+		phys_addr_t start_pa = slow_virt_to_phys((void *)start);
+		phys_addr_t end_pa = start_pa + PAGE_SIZE;
+
+		if (!tdx_enc_status_changed_phys(start_pa, end_pa, enc))
+			return false;
+
+		start += PAGE_SIZE;
+	}
 
 	return true;
 }
-- 
2.25.1


^ permalink raw reply related

* [PATCH v8 1/2] x86/tdx: Retry TDVMCALL_MAP_GPA() when needed
From: Dexuan Cui @ 2023-06-20 15:48 UTC (permalink / raw)
  To: ak, arnd, bp, brijesh.singh, dan.j.williams, dave.hansen,
	dave.hansen, haiyangz, hpa, jane.chu, kirill.shutemov, kys,
	linux-arch, linux-hyperv, luto, mingo, peterz, rostedt,
	sathyanarayanan.kuppuswamy, seanjc, tglx, tony.luck, wei.liu, x86,
	mikelley
  Cc: linux-kernel, Tianyu.Lan, rick.p.edgecombe, Dexuan Cui
In-Reply-To: <20230620154830.25442-1-decui@microsoft.com>

GHCI spec for TDX 1.0 says that the MapGPA call may fail with the R10
error code = TDG.VP.VMCALL_RETRY (1), and the guest must retry this
operation for the pages in the region starting at the GPA specified
in R11.

When a fully enlightened TDX guest runs on Hyper-V, Hyper-V can return
the retry error when set_memory_decrypted() is called to decrypt up to
1GB of swiotlb bounce buffers.

Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Michael Kelley <mikelley@microsoft.com>
Signed-off-by: Dexuan Cui <decui@microsoft.com>
---

 arch/x86/coco/tdx/tdx.c           | 63 +++++++++++++++++++++++++------
 arch/x86/include/asm/shared/tdx.h |  2 +
 2 files changed, 53 insertions(+), 12 deletions(-)

Changes in v2:
  Used __tdx_hypercall() directly in tdx_map_gpa().
  Added a max_retry_cnt of 1000.
  Renamed a few variables, e.g., r11 -> map_fail_paddr.

Changes in v3:
  Changed max_retry_cnt from 1000 to 3.

Changes in v4:
  __tdx_hypercall(&args, TDX_HCALL_HAS_OUTPUT) -> __tdx_hypercall_ret()
  Added Kirill's Acked-by.

Changes in v5:
  Added Michael's Reviewed-by.

Changes in v6: None.

Changes in v7:
  Addressed Dave's comments:
  see https://lwn.net/ml/linux-kernel/SA1PR21MB1335736123C2BCBBFD7460C3BF46A@SA1PR21MB1335.namprd21.prod.outlook.com

Changes in v8:
  Rebased to tip.git's master branch.

diff --git a/arch/x86/coco/tdx/tdx.c b/arch/x86/coco/tdx/tdx.c
index 1d6b863c42b0..0c198ab73aa7 100644
--- a/arch/x86/coco/tdx/tdx.c
+++ b/arch/x86/coco/tdx/tdx.c
@@ -703,14 +703,16 @@ static bool tdx_cache_flush_required(void)
 }
 
 /*
- * Inform the VMM of the guest's intent for this physical page: shared with
- * the VMM or private to the guest.  The VMM is expected to change its mapping
- * of the page in response.
+ * Notify the VMM about page mapping conversion. More info about ABI
+ * can be found in TDX Guest-Host-Communication Interface (GHCI),
+ * section "TDG.VP.VMCALL<MapGPA>".
  */
-static bool tdx_enc_status_changed(unsigned long vaddr, int numpages, bool enc)
+static bool tdx_map_gpa(phys_addr_t start, phys_addr_t end, bool enc)
 {
-	phys_addr_t start = __pa(vaddr);
-	phys_addr_t end   = __pa(vaddr + numpages * PAGE_SIZE);
+	const int max_retries_per_page = 3;
+	struct tdx_hypercall_args args;
+	u64 map_fail_paddr, ret;
+	int retry_count = 0;
 
 	if (!enc) {
 		/* Set the shared (decrypted) bits: */
@@ -718,12 +720,49 @@ static bool tdx_enc_status_changed(unsigned long vaddr, int numpages, bool enc)
 		end   |= cc_mkdec(0);
 	}
 
-	/*
-	 * Notify the VMM about page mapping conversion. More info about ABI
-	 * can be found in TDX Guest-Host-Communication Interface (GHCI),
-	 * section "TDG.VP.VMCALL<MapGPA>"
-	 */
-	if (_tdx_hypercall(TDVMCALL_MAP_GPA, start, end - start, 0, 0))
+	while (retry_count < max_retries_per_page) {
+		memset(&args, 0, sizeof(args));
+		args.r10 = TDX_HYPERCALL_STANDARD;
+		args.r11 = TDVMCALL_MAP_GPA;
+		args.r12 = start;
+		args.r13 = end - start;
+
+		ret = __tdx_hypercall_ret(&args);
+		if (ret != TDVMCALL_STATUS_RETRY)
+			return !ret;
+		/*
+		 * The guest must retry the operation for the pages in the
+		 * region starting at the GPA specified in R11. R11 comes
+		 * from the untrusted VMM. Sanity check it.
+		 */
+		map_fail_paddr = args.r11;
+		if (map_fail_paddr < start || map_fail_paddr >= end)
+			return false;
+
+		/* "Consume" a retry without forward progress */
+		if (map_fail_paddr == start) {
+			retry_count++;
+			continue;
+		}
+
+		start = map_fail_paddr;
+		retry_count = 0;
+	}
+
+	return false;
+}
+
+/*
+ * Inform the VMM of the guest's intent for this physical page: shared with
+ * the VMM or private to the guest.  The VMM is expected to change its mapping
+ * of the page in response.
+ */
+static bool tdx_enc_status_changed(unsigned long vaddr, int numpages, bool enc)
+{
+	phys_addr_t start = __pa(vaddr);
+	phys_addr_t end   = __pa(vaddr + numpages * PAGE_SIZE);
+
+	if (!tdx_map_gpa(start, end, enc))
 		return false;
 
 	/* shared->private conversion requires memory to be accepted before use */
diff --git a/arch/x86/include/asm/shared/tdx.h b/arch/x86/include/asm/shared/tdx.h
index 90ea813c4b99..9db89a99ae5b 100644
--- a/arch/x86/include/asm/shared/tdx.h
+++ b/arch/x86/include/asm/shared/tdx.h
@@ -24,6 +24,8 @@
 #define TDVMCALL_MAP_GPA		0x10001
 #define TDVMCALL_REPORT_FATAL_ERROR	0x10003
 
+#define TDVMCALL_STATUS_RETRY		1
+
 #ifndef __ASSEMBLY__
 
 /*
-- 
2.25.1


^ permalink raw reply related

* [PATCH v8 0/2] Support TDX guests on Hyper-V (the x86/tdx part)
From: Dexuan Cui @ 2023-06-20 15:48 UTC (permalink / raw)
  To: ak, arnd, bp, brijesh.singh, dan.j.williams, dave.hansen,
	dave.hansen, haiyangz, hpa, jane.chu, kirill.shutemov, kys,
	linux-arch, linux-hyperv, luto, mingo, peterz, rostedt,
	sathyanarayanan.kuppuswamy, seanjc, tglx, tony.luck, wei.liu, x86,
	mikelley
  Cc: linux-kernel, Tianyu.Lan, rick.p.edgecombe, Dexuan Cui

v8 is a rebased version of v7:
  v8 is based on tip.git's master branch, which has commit
      75d090fd167a ("x86/tdx: Add unaccepted memory support"), so I have
      to rebase v7 to the commit and post v8.

  v7 is based on tip.git's x86/tdx branch.

The two patches (which are based on the latest master branch of the tip
tree) are the x86/tdx part of the v6 patchset:
https://lwn.net/ml/linux-kernel/20230504225351.10765-1-decui@microsoft.com/

The other patches of the v6 patchset needs more changes in preparation for
the upcoming paravisor support, so let me post the x86/tdx part first.

This v8 patchset addressed Dave's comments on patch 1:
see https://lwn.net/ml/linux-kernel/SA1PR21MB1335736123C2BCBBFD7460C3BF46A@SA1PR21MB1335.namprd21.prod.outlook.com/

Patch 2 is just a repost. There was a race between set_memory_encrypted()
and load_unaligned_zeropad(), which has been fixed by the 3 patches of
Kirill in the tip tree:
  3f6819dd192e ("x86/mm: Allow guest.enc_status_change_prepare() to fail")
  195edce08b63 ("x86/tdx: Fix race between set_memory_encrypted() and load_unaligned_zeropad()")
  94142c9d1bdf ("x86/mm: Fix enc_status_change_finish_noop()")
  (see https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/log/?h=x86/tdx)

If you want to view the patchset on github, it is here:
https://github.com/dcui/tdx/commits/decui/upstream-tip/master/tdx/v8-x86-tdx-only

Dexuan Cui (2):
  x86/tdx: Retry TDVMCALL_MAP_GPA() when needed
  x86/tdx: Support vmalloc() for tdx_enc_status_changed()

 arch/x86/coco/tdx/tdx.c           | 86 ++++++++++++++++++++++++++-----
 arch/x86/include/asm/shared/tdx.h |  2 +
 2 files changed, 76 insertions(+), 12 deletions(-)

-- 
2.25.1

^ permalink raw reply

* RE: [EXTERNAL] Re: [PATCH v2] x86/hyperv: add noop functions to x86_init mpparse functions
From: Saurabh Singh Sengar @ 2023-06-20 10:09 UTC (permalink / raw)
  To: Saurabh Singh Sengar, Dave Hansen, Saurabh Sengar, KY Srinivasan,
	Haiyang Zhang, wei.liu@kernel.org, Dexuan Cui, tglx@linutronix.de,
	mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com,
	x86@kernel.org, Michael Kelley (LINUX),
	linux-kernel@vger.kernel.org, linux-hyperv@vger.kernel.org,
	hpa@zytor.com
In-Reply-To: <SEZP153MB0742622EE41C33148CB598DDBE5AA@SEZP153MB0742.APCP153.PROD.OUTLOOK.COM>



> -----Original Message-----
> From: Saurabh Singh Sengar <ssengar@microsoft.com>
> Sent: Wednesday, June 14, 2023 9:36 AM
> To: Dave Hansen <dave.hansen@intel.com>; Saurabh Sengar
> <ssengar@linux.microsoft.com>; KY Srinivasan <kys@microsoft.com>;
> Haiyang Zhang <haiyangz@microsoft.com>; wei.liu@kernel.org; Dexuan Cui
> <decui@microsoft.com>; tglx@linutronix.de; mingo@redhat.com;
> bp@alien8.de; dave.hansen@linux.intel.com; x86@kernel.org; Michael Kelley
> (LINUX) <mikelley@microsoft.com>; linux-kernel@vger.kernel.org; linux-
> hyperv@vger.kernel.org; hpa@zytor.com
> Subject: RE: [EXTERNAL] Re: [PATCH v2] x86/hyperv: add noop functions to
> x86_init mpparse functions
> 
> 
> 
> > -----Original Message-----
> > From: Dave Hansen <dave.hansen@intel.com>
> > Sent: Tuesday, June 13, 2023 11:03 PM
> > To: Saurabh Sengar <ssengar@linux.microsoft.com>; KY Srinivasan
> > <kys@microsoft.com>; Haiyang Zhang <haiyangz@microsoft.com>;
> > wei.liu@kernel.org; Dexuan Cui <decui@microsoft.com>;
> > tglx@linutronix.de; mingo@redhat.com; bp@alien8.de;
> > dave.hansen@linux.intel.com; x86@kernel.org; Michael Kelley (LINUX)
> > <mikelley@microsoft.com>; linux- kernel@vger.kernel.org;
> > linux-hyperv@vger.kernel.org; hpa@zytor.com
> > Subject: [EXTERNAL] Re: [PATCH v2] x86/hyperv: add noop functions to
> > x86_init mpparse functions
> >
> > On 6/13/23 00:13, Saurabh Sengar wrote:
> > > In certain configurations, VTL0 and VTL2 can share the same VM
> > > partition and hence share the same memory address space. In such
> > > systems VTL2 has visibility of all of the VTL0 memory space.
> >
> > FWIW, this is pretty much gibberish to me.  The way I suggest avoiding
> > producing gibberish is avoiding acronyms:
> >
> > 	Hyper-V can run VMs at different privilege "levels".  Sometimes,
> > 	it chooses to run two different VMs at different levels but
> > 	they share some of their address space.  This <insert reason
> > 	that someone might want to do this>.
> >
> > That's not gibberish.
> 
> Thanks for your suggestion I can add this in v3.
> 
> >
> > > When CONFIG_X86_MPPARSE is enabled for VTL2, the kernel performs a
> > > scan of low memory to search for MP tables. However, in systems
> > > where
> > > VTL0 controls the low memory and may contain valid tables specific
> > > to VTL0, this scanning process can lead to confusion within the VTL2
> > > kernel.
> >
> > What is the end-user-visible effect of this "confusion"?  A crash?  A
> warning?
> > An error message?  What is the impact on end users?
> 
> The VTL2 kernel is currently scanning the VTL0 MP table and incorporating
> that information, which is incorrect because VTL2 doesn't support MP tables
> and instead, is booted with DT. While I don't have an immediate crash or
> error message to present, this situation could potentially result in incorrect
> behaviour.
> 
> >
> > This information will help the maintainers decide how to disposition
> > your patch.  Should we send it upstream immediately because it's
> > impacting millions of users?  Or can we do it in a bit more leisurely
> > fashion because nobody cares?
> 
> I understand, I have provided all the information I have please consider what
> is appropriate in this case.
> 
> >
> > > In !ACPI system, there is no way to disable CONFIG_X86_MPPARSE hence
> > > add the noop function instead.
> >
> > This makes zero sense to me.
> 
> My intention was to tell that this fix is required because we are in !ACPI
> system.
> If it was ACPI system we could have simply disable this
> CONFIG_X86_MPPARSE Option. But as you suggested I am fine removing
> these 2 lines.
> 
> >
> > Like I told you before, we don't compile things out just because they
> > don't work on one weirdo system.  If we did that, we'd have a billion
> > incompatible
> > x86 kernel images that can't boot across systems.
> >
> 
> Understood, thank you. I was just considering the option of keeping the
> default setting for CONFIG_X86_MPPARSE as 'Y' and allowing the choice to
> change it to 'N' if someone specifically desires to do so. I also considered that
> nowadays, not many kernels are likely to utilize MP tables for booting x86
> machines, which is why I thought this change wouldn't have a significant
> impact. Moreover, there is a potential benefit in terms of reducing the
> kernel's size. However, I completely respect and am open to whatever you
> decide, having better visibility of wider kernel community needs.
> 
> - Saurabh

Below is the updated commit message, if there are no more concerns I will
send the V3 with it.

Hyper-V can run VMs at different privilege "levels" known as Virtual
Trust Levels (VTL). Sometimes, it chooses to run two different VMs
at different levels but they share some of their address space. In
such setups VTL2 (higher level VM) has visibility of all of the
VTL0 (level 0) memory space.

When the CONFIG_X86_MPPARSE is enabled for VTL2, the VTL2 kernel
performs a search within the low memory to locate MP tables. However,
in systems where VTL0 manages the low memory and may contain valid
tables, this scanning can result in incorrect MP table information
being provided to the VTL2 kernel, mistakenly considering VTL0's MP
table as its own

Add noop functions to avoid MP parse scan by VTL2.

- Saurabh

^ permalink raw reply

* Re: [PATCH 2/3] x86/vector: Replace IRQ_MOVE_CLEANUP_VECTOR with a timer callback
From: Thomas Gleixner @ 2023-06-20  8:37 UTC (permalink / raw)
  To: Xin Li, linux-kernel, platform-driver-x86, iommu, linux-hyperv,
	linux-perf-users, x86
  Cc: mingo, bp, dave.hansen, hpa, steve.wahl, mike.travis,
	dimitri.sivanich, russ.anderson, dvhart, andy, joro,
	suravee.suthikulpanit, will, robin.murphy, kys, haiyangz, wei.liu,
	decui, dwmw2, baolu.lu, peterz, acme, mark.rutland,
	alexander.shishkin, jolsa, namhyung, irogers, adrian.hunter,
	xin3.li, seanjc, jiangshanlai, jgg, yangtiezhu
In-Reply-To: <20230619231611.2230-3-xin3.li@intel.com>

On Mon, Jun 19 2023 at 16:16, Xin Li wrote:
> +/*
> + * Called with vector_lock held
> + */

Such a comment is patently bad. 

> +static void __vector_cleanup(struct vector_cleanup *cl, bool check_irr)
>  {
        ....

        lockdep_assert_held(&vector_lock);

Documents the requirement clearly _and_ catches any caller which does not
hold the lock when lockdep is enabled.

Thanks,

        tglx

^ permalink raw reply

* RE: [PATCH 2/3] x86/vector: Replace IRQ_MOVE_CLEANUP_VECTOR with a timer callback
From: Li, Xin3 @ 2023-06-20  7:47 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel@vger.kernel.org, platform-driver-x86@vger.kernel.org,
	iommu@lists.linux.dev, linux-hyperv@vger.kernel.org,
	linux-perf-users@vger.kernel.org, x86@kernel.org,
	tglx@linutronix.de, mingo@redhat.com, bp@alien8.de,
	dave.hansen@linux.intel.com, hpa@zytor.com, steve.wahl@hpe.com,
	mike.travis@hpe.com, Sivanich, Dimitri, Anderson, Russ,
	dvhart@infradead.org, andy@infradead.org, joro@8bytes.org,
	suravee.suthikulpanit@amd.com, will@kernel.org,
	robin.murphy@arm.com, kys@microsoft.com, haiyangz@microsoft.com,
	wei.liu@kernel.org, Cui, Dexuan, dwmw2@infradead.org,
	baolu.lu@linux.intel.com, acme@kernel.org, mark.rutland@arm.com,
	alexander.shishkin@linux.intel.com, jolsa@kernel.org,
	namhyung@kernel.org, irogers@google.com, Hunter, Adrian,
	Christopherson,, Sean, jiangshanlai@gmail.com, jgg@ziepe.ca,
	yangtiezhu@loongson.cn
In-Reply-To: <20230620072047.GS4253@hirez.programming.kicks-ass.net>

> > +/*
> > + * Called with vector_lock held
> > + */
> 
> Instead of comments like that, I tend to add a lockdep_assert*() statement to the
> same effect. Which unlike comment actually validate the claim and since it's code
> it tends to not go stale like comments do.

neat, will do!

> > +	bool rearm = false;
> 
> 	lockdep_assert_held(&vector_lock);

My bad, didn't notice that quite a few functions in the file already do that. 



^ permalink raw reply

* RE: [EXTERNAL] Re: [PATCH v2 1/5] uio: Add hv_vmbus_client driver
From: Saurabh Singh Sengar @ 2023-06-20  7:41 UTC (permalink / raw)
  To: Greg KH
  Cc: Saurabh Sengar, KY Srinivasan, Haiyang Zhang, wei.liu@kernel.org,
	Dexuan Cui, Michael Kelley (LINUX), corbet@lwn.net,
	linux-kernel@vger.kernel.org, linux-hyperv@vger.kernel.org,
	linux-doc@vger.kernel.org
In-Reply-To: <2023062007-coral-nicotine-856c@gregkh>



> -----Original Message-----
> From: Greg KH <gregkh@linuxfoundation.org>
> Sent: Tuesday, June 20, 2023 12:37 PM
> To: Saurabh Singh Sengar <ssengar@microsoft.com>
> Cc: Saurabh Sengar <ssengar@linux.microsoft.com>; KY Srinivasan
> <kys@microsoft.com>; Haiyang Zhang <haiyangz@microsoft.com>;
> wei.liu@kernel.org; Dexuan Cui <decui@microsoft.com>; Michael Kelley
> (LINUX) <mikelley@microsoft.com>; corbet@lwn.net; linux-
> kernel@vger.kernel.org; linux-hyperv@vger.kernel.org; linux-
> doc@vger.kernel.org
> Subject: Re: [EXTERNAL] Re: [PATCH v2 1/5] uio: Add hv_vmbus_client driver
> 
> On Tue, Jun 20, 2023 at 05:19:14AM +0000, Saurabh Singh Sengar wrote:
> > > And you are defining a "global" variable that can be modified by an
> > > individual sysfs file for ANY device bound to this driver, messing
> > > with the other device's ring buffer size, right?  This needs to be
> > > per-device, or explain in huge detail here why not.
> >
> > The global variable is expected to be set by userspace per device
> > before opening, the particular uio device. For a particular Hyper-v
> > device this value be same, and once device is open the ring buffer is
> > allocated and there won't be any impact afterwards changing it. I can
> elaborate more of this in sysfs documentation.
> 
> That's totally confusing, please make this per-device properly, as you will find
> out when you try to document it, what you are describing is unlike any other
> per-device interface we have.

Ok, will make this per device. Thanks.

- Saurabh

> 
> greg k-h

^ permalink raw reply

* Re: [PATCH 2/3] x86/vector: Replace IRQ_MOVE_CLEANUP_VECTOR with a timer callback
From: Peter Zijlstra @ 2023-06-20  7:20 UTC (permalink / raw)
  To: Xin Li
  Cc: linux-kernel, platform-driver-x86, iommu, linux-hyperv,
	linux-perf-users, x86, tglx, mingo, bp, dave.hansen, hpa,
	steve.wahl, mike.travis, dimitri.sivanich, russ.anderson, dvhart,
	andy, joro, suravee.suthikulpanit, will, robin.murphy, kys,
	haiyangz, wei.liu, decui, dwmw2, baolu.lu, acme, mark.rutland,
	alexander.shishkin, jolsa, namhyung, irogers, adrian.hunter,
	seanjc, jiangshanlai, jgg, yangtiezhu
In-Reply-To: <20230619231611.2230-3-xin3.li@intel.com>

On Mon, Jun 19, 2023 at 04:16:10PM -0700, Xin Li wrote:

> +/*
> + * Called with vector_lock held
> + */

Instead of comments like that, I tend to add a lockdep_assert*()
statement to the same effect. Which unlike comment actually validate the
claim and since it's code it tends to not go stale like comments do.

> +static void __vector_cleanup(struct vector_cleanup *cl, bool check_irr)
>  {
>  	struct apic_chip_data *apicd;
>  	struct hlist_node *tmp;
> +	bool rearm = false;

	lockdep_assert_held(&vector_lock);

> +	hlist_for_each_entry_safe(apicd, tmp, &cl->head, clist) {
>  		unsigned int irr, vector = apicd->prev_vector;
>  
>  		/*
>  		 * Paranoia: Check if the vector that needs to be cleaned
> +		 * up is registered at the APICs IRR. That's clearly a
> +		 * hardware issue if the vector arrived on the old target
> +		 * _after_ interrupts were disabled above. Keep @apicd
> +		 * on the list and schedule the timer again to give the CPU
> +		 * a chance to handle the pending interrupt.
> +		 *
> +		 * Do not check IRR when called from lapic_offline(), because
> +		 * fixup_irqs() was just called to scan IRR for set bits and
> +		 * forward them to new destination CPUs via IPIs.
>  		 */
> +		irr = check_irr ? apic_read(APIC_IRR + (vector / 32 * 0x10)) : 0;
>  		if (irr & (1U << (vector % 32))) {
> +			pr_warn_once("Moved interrupt pending in old target APIC %u\n", apicd->irq);
> +			rearm = true;
>  			continue;
>  		}
>  		free_moved_vector(apicd);
>  	}
>  
> +	/*
> +	 * Must happen under vector_lock to make the timer_pending() check
> +	 * in __vector_schedule_cleanup() race free against the rearm here.
> +	 */
> +	if (rearm)
> +		mod_timer(&cl->timer, jiffies + 1);
> +}
> +
> +static void vector_cleanup_callback(struct timer_list *tmr)
> +{
> +	struct vector_cleanup *cl = container_of(tmr, typeof(*cl), timer);
> +
> +	/* Prevent vectors vanishing under us */
> +	raw_spin_lock_irq(&vector_lock);
> +	__vector_cleanup(cl, true);
> +	raw_spin_unlock_irq(&vector_lock);
>  }

^ permalink raw reply

* Re: [EXTERNAL] Re: [PATCH v2 1/5] uio: Add hv_vmbus_client driver
From: Greg KH @ 2023-06-20  7:06 UTC (permalink / raw)
  To: Saurabh Singh Sengar
  Cc: Saurabh Sengar, KY Srinivasan, Haiyang Zhang, wei.liu@kernel.org,
	Dexuan Cui, Michael Kelley (LINUX), corbet@lwn.net,
	linux-kernel@vger.kernel.org, linux-hyperv@vger.kernel.org,
	linux-doc@vger.kernel.org
In-Reply-To: <PUZP153MB07490FDBBB8CC3099CFF126BBE5CA@PUZP153MB0749.APCP153.PROD.OUTLOOK.COM>

On Tue, Jun 20, 2023 at 05:19:14AM +0000, Saurabh Singh Sengar wrote:
> > And you are defining a "global" variable that can be modified by an individual
> > sysfs file for ANY device bound to this driver, messing with the other device's
> > ring buffer size, right?  This needs to be per-device, or explain in huge detail
> > here why not.
> 
> The global variable is expected to be set by userspace per device before opening, the
> particular uio device. For a particular Hyper-v device this value be same, and once
> device is open the ring buffer is allocated and there won't be any impact afterwards
> changing it. I can elaborate more of this in sysfs documentation.

That's totally confusing, please make this per-device properly, as you
will find out when you try to document it, what you are describing is
unlike any other per-device interface we have.

greg k-h

^ permalink raw reply

* Re: [EXTERNAL] Re: [PATCH v2 2/5] tools: hv: Add vmbus_bufring
From: Greg KH @ 2023-06-20  7:05 UTC (permalink / raw)
  To: Saurabh Singh Sengar
  Cc: Saurabh Sengar, KY Srinivasan, Haiyang Zhang, wei.liu@kernel.org,
	Dexuan Cui, Michael Kelley (LINUX), corbet@lwn.net,
	linux-kernel@vger.kernel.org, linux-hyperv@vger.kernel.org,
	linux-doc@vger.kernel.org
In-Reply-To: <PUZP153MB07492FF43240CFD055CF268ABE5CA@PUZP153MB0749.APCP153.PROD.OUTLOOK.COM>

On Tue, Jun 20, 2023 at 05:25:06AM +0000, Saurabh Singh Sengar wrote:
> > > diff --git a/tools/hv/vmbus_bufring.c b/tools/hv/vmbus_bufring.c new
> > > file mode 100644 index 000000000000..d44a06d45b03
> > > --- /dev/null
> > > +++ b/tools/hv/vmbus_bufring.c
> > > @@ -0,0 +1,322 @@
> > > +// SPDX-License-Identifier: BSD-3-Clause
> > > +/*
> > > + * Copyright (c) 2009-2012,2016,2023 Microsoft Corp.
> > > + * Copyright (c) 2012 NetApp Inc.
> > > + * Copyright (c) 2012 Citrix Inc.
> > > + * All rights reserved.
> > 
> > No copyright for the work you did?
> 
> I have added 2023 Microsoft Corp. Please let me know if I need to add
> anything more.

Please contact your lawyers about EXACTLY what you need to do here,
that's up to them, not me!

greg k-h

^ permalink raw reply

* RE: [EXTERNAL] Re: [PATCH v2 2/5] tools: hv: Add vmbus_bufring
From: Saurabh Singh Sengar @ 2023-06-20  5:25 UTC (permalink / raw)
  To: Greg KH, Saurabh Sengar
  Cc: KY Srinivasan, Haiyang Zhang, wei.liu@kernel.org, Dexuan Cui,
	Michael Kelley (LINUX), corbet@lwn.net,
	linux-kernel@vger.kernel.org, linux-hyperv@vger.kernel.org,
	linux-doc@vger.kernel.org
In-Reply-To: <2023061430-facedown-getting-d9f7@gregkh>



> -----Original Message-----
> From: Greg KH <gregkh@linuxfoundation.org>
> Sent: Thursday, June 15, 2023 2:46 AM
> To: Saurabh Sengar <ssengar@linux.microsoft.com>
> Cc: KY Srinivasan <kys@microsoft.com>; Haiyang Zhang
> <haiyangz@microsoft.com>; wei.liu@kernel.org; Dexuan Cui
> <decui@microsoft.com>; Michael Kelley (LINUX) <mikelley@microsoft.com>;
> corbet@lwn.net; linux-kernel@vger.kernel.org; linux-hyperv@vger.kernel.org;
> linux-doc@vger.kernel.org
> Subject: [EXTERNAL] Re: [PATCH v2 2/5] tools: hv: Add vmbus_bufring
> 
> On Wed, Jun 14, 2023 at 11:15:09AM -0700, Saurabh Sengar wrote:
> > Common userspace interface for read/write from VMBus ringbuffer.
> > This implementation is open for use by any userspace driver or
> > application seeking direct control over VMBus ring buffers.
> > A significant  part of this code is borrowed from DPDK.
> 
> "  "?
> 
> Anyway, this does not explain what this is at all.

I can elaborate more in next version.

> 
> And if you "borrowed" it from DPDK, that feels odd, are you sure you are
> allowed to do so?

I will confirm this internally before sending next version.

> 
> > Link:
> > https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgith
> >
> ub.com%2FDPDK%2Fdpdk%2F&data=05%7C01%7Cssengar%40microsoft.com
> %7C79975
> >
> a82b7b44c67b0b508db6d1c9301%7C72f988bf86f141af91ab2d7cd011db47%7
> C1%7C0
> >
> %7C638223741757437265%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLj
> AwMDAiLCJQ
> >
> IjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata
> =100fd
> > FVed6C5lBrikWqkWwFpfH33LHF0H8fuRb0myL0%3D&reserved=0
> 
> Not what a Link: tag is for, sorry.

Will fix, thanks for pointing this.

> 
> >
> > Signed-off-by: Saurabh Sengar <ssengar@linux.microsoft.com>
> > ---
> > [V2]
> > - simpler sysfs path, less parsing
> >
> >  tools/hv/vmbus_bufring.c | 322
> > +++++++++++++++++++++++++++++++++++++++
> >  tools/hv/vmbus_bufring.h | 158 +++++++++++++++++++
> >  2 files changed, 480 insertions(+)
> >  create mode 100644 tools/hv/vmbus_bufring.c  create mode 100644
> > tools/hv/vmbus_bufring.h
> 
> You add new files to the tools directory, yet say nothing about how to use
> them or even how to build them.
> 
> Why is there a .h file for a single .c file?  That seems pointless, right?

This is a header file so that any userspace application can add it. This file
Is used by fcopy application in [PATCH v2 3/5] of this patch series.
If this is confusing, shall I merge 2/5 and 3/5 ? I thought better to keep the
common library code as separate patch. Please let me know your opinion.

> 
> > diff --git a/tools/hv/vmbus_bufring.c b/tools/hv/vmbus_bufring.c new
> > file mode 100644 index 000000000000..d44a06d45b03
> > --- /dev/null
> > +++ b/tools/hv/vmbus_bufring.c
> > @@ -0,0 +1,322 @@
> > +// SPDX-License-Identifier: BSD-3-Clause
> > +/*
> > + * Copyright (c) 2009-2012,2016,2023 Microsoft Corp.
> > + * Copyright (c) 2012 NetApp Inc.
> > + * Copyright (c) 2012 Citrix Inc.
> > + * All rights reserved.
> 
> No copyright for the work you did?

I have added 2023 Microsoft Corp. Please let me know if I need to add
anything more.

> 
> > + */
> > +
> > +#include <errno.h>
> > +#include <fcntl.h>
> > +#include <emmintrin.h>
> > +#include <linux/limits.h>
> > +#include <stdbool.h>
> > +#include <stdint.h>
> > +#include <stdio.h>
> > +#include <string.h>
> > +#include <sys/mman.h>
> > +#include <sys/uio.h>
> > +#include <unistd.h>
> > +#include "vmbus_bufring.h"
> > +
> > +#define	rte_compiler_barrier()	({ asm volatile ("" : : : "memory"); })
> > +
> > +#define	rte_smp_rwmb()		({ asm volatile ("" : : :
> "memory"); })
> 
> These aren't in any common header file already?

I see every userspace application is maintaining their separate copy of this.
Although I can remove this duplicate define and can use only one of these.

- Saurabh

> 
> thanks,
> 
> greg k-h

^ permalink raw reply

* RE: [EXTERNAL] Re: [PATCH v2 1/5] uio: Add hv_vmbus_client driver
From: Saurabh Singh Sengar @ 2023-06-20  5:19 UTC (permalink / raw)
  To: Greg KH, Saurabh Sengar
  Cc: KY Srinivasan, Haiyang Zhang, wei.liu@kernel.org, Dexuan Cui,
	Michael Kelley (LINUX), corbet@lwn.net,
	linux-kernel@vger.kernel.org, linux-hyperv@vger.kernel.org,
	linux-doc@vger.kernel.org
In-Reply-To: <2023061419-probe-velocity-b276@gregkh>



> -----Original Message-----
> From: Greg KH <gregkh@linuxfoundation.org>
> Sent: Thursday, June 15, 2023 2:43 AM
> To: Saurabh Sengar <ssengar@linux.microsoft.com>
> Cc: KY Srinivasan <kys@microsoft.com>; Haiyang Zhang
> <haiyangz@microsoft.com>; wei.liu@kernel.org; Dexuan Cui
> <decui@microsoft.com>; Michael Kelley (LINUX) <mikelley@microsoft.com>;
> corbet@lwn.net; linux-kernel@vger.kernel.org; linux-hyperv@vger.kernel.org;
> linux-doc@vger.kernel.org
> Subject: [EXTERNAL] Re: [PATCH v2 1/5] uio: Add hv_vmbus_client driver
> 
> On Wed, Jun 14, 2023 at 11:15:08AM -0700, Saurabh Sengar wrote:
> > --- a/Documentation/ABI/stable/sysfs-bus-vmbus
> > +++ b/Documentation/ABI/stable/sysfs-bus-vmbus
> > @@ -153,6 +153,13 @@ Contact:	Stephen Hemminger
> <sthemmin@microsoft.com>
> >  Description:	Binary file created by uio_hv_generic for ring buffer
> >  Users:		Userspace drivers
> >
> > +What:		/sys/bus/vmbus/devices/<UUID>/ring_size
> > +Date:		June. 2023
> 
> No need for the "."

OK

> 
> > +KernelVersion:	6.4
> 
> 6.4 will be released without this, sorry.

Ok will change it to 6.5.

> 
> > +Contact:	Saurabh Sengar <ssengar@microsoft.com>
> > +Description:	File created by uio_hv_vmbus_client for setting device ring
> buffer size
> > +Users:		Userspace drivers
> > +
> >  What:           /sys/bus/vmbus/devices/<UUID>/channels/<N>/intr_in_full
> >  Date:           February 2019
> >  KernelVersion:  5.0
> > diff --git a/Documentation/driver-api/uio-howto.rst
> > b/Documentation/driver-api/uio-howto.rst
> > index 907ffa3b38f5..33b67f876b96 100644
> > --- a/Documentation/driver-api/uio-howto.rst
> > +++ b/Documentation/driver-api/uio-howto.rst
> > @@ -722,6 +722,52 @@ For example::
> >
> >
> > /sys/bus/vmbus/devices/3811fe4d-0fa0-4b62-981a-
> 74fc1084c757/channels/2
> > 1/ring
> >
> > +Generic Hyper-V driver for low speed devices
> > +============================================
> > +
> > +The generic driver is a kernel module named uio_hv_vmbus_client. It
> > +supports slow devices on the Hyper-V VMBus similar to uio_hv_generic
> > +for faster devices. This driver also gives flexibility of customized
> > +ring buffer sizes.
> > +
> > +Making the driver recognize the device
> > +--------------------------------------
> > +
> > +Since the driver does not declare any device GUID's, it will not get
> > +loaded automatically and will not automatically bind to any devices,
> > +you must load it and allocate id to the driver yourself. For example,
> > +to use the fcopy device class GUID::
> > +
> > +        DEV_UUID=eb765408-105f-49b6-b4aa-c123b64d17d4
> > +        driverctl -b vmbus set-override $DEV_UUID uio_hv_vmbus_client
> 
> Why are you adding a dependancy on a 300 line bash script that is not used
> by most distros?
> 
> Why not just show the "real" commands that you can use here that don't
> require an external tool not controlled by the kernel at all.

Ok will mention the regular  "echo" commands as you suggested.

> 
> > --- /dev/null
> > +++ b/drivers/uio/uio_hv_vmbus_client.c
> > @@ -0,0 +1,217 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +/*
> > + * uio_hv_vmbus_client - UIO driver for low speed VMBus devices
> > + *
> > + * Copyright (c) 2023, Microsoft Corporation.
> > + *
> > + * Authors:
> > + *   Saurabh Sengar <ssengar@microsoft.com>
> > + *
> > + * Since the driver does not declare any device ids, you must
> > +allocate
> > + * id and bind the device to the driver yourself.  For example:
> > + * driverctl -b vmbus set-override <dev uuid> uio_hv_vmbus_client
> 
> Again, no need to discuss driverctl.

Noted.

> 
> > + */
> > +
> > +#include <linux/device.h>
> > +#include <linux/kernel.h>
> > +#include <linux/module.h>
> > +#include <linux/uio_driver.h>
> > +#include <linux/hyperv.h>
> > +
> > +#define DRIVER_AUTHOR	"Saurabh Sengar <ssengar@microsoft.com>"
> > +#define DRIVER_DESC	"Generic UIO driver for low speed VMBus
> devices"
> 
> You only use these defines in one place, so why not just spell them out there,
> no need for 2 extra lines, right?

Sure, will fix

> 
> > +
> > +#define DEFAULT_HV_RING_SIZE	VMBUS_RING_SIZE(3 *
> HV_HYP_PAGE_SIZE)
> > +static int ring_size = DEFAULT_HV_RING_SIZE;
> 
> You only use that #define in one place, why have it at all?

Ok, will fix

> 
> And you are defining a "global" variable that can be modified by an individual
> sysfs file for ANY device bound to this driver, messing with the other device's
> ring buffer size, right?  This needs to be per-device, or explain in huge detail
> here why not.

The global variable is expected to be set by userspace per device before opening, the
particular uio device. For a particular Hyper-v device this value be same, and once
device is open the ring buffer is allocated and there won't be any impact afterwards
changing it. I can elaborate more of this in sysfs documentation.

> 
> > +
> > +struct uio_hv_vmbus_dev {
> > +	struct uio_info info;
> > +	struct hv_device *device;
> > +};
> > +
> > +/* Sysfs API to allow mmap of the ring buffers */ static int
> > +uio_hv_vmbus_mmap(struct file *filp, struct kobject *kobj,
> > +			     struct bin_attribute *attr, struct vm_area_struct
> *vma) {
> > +	struct device *dev = container_of(kobj, struct device, kobj);
> > +	struct hv_device *hv_dev = container_of(dev, struct hv_device,
> device);
> > +	struct vmbus_channel *channel = hv_dev->channel;
> > +	void *ring_buffer = page_address(channel->ringbuffer_page);
> > +
> > +	return vm_iomap_memory(vma, virt_to_phys(ring_buffer),
> > +			       channel->ringbuffer_pagecount << PAGE_SHIFT); }
> > +
> > +static const struct bin_attribute ring_buffer_bin_attr = {
> > +	.attr = {
> > +		.name = "ringbuffer",
> > +		.mode = 0600,
> > +	},
> > +	.mmap = uio_hv_vmbus_mmap,
> > +};
> > +
> > +/*
> > + * This is the irqcontrol callback to be registered to uio_info.
> > + * It can be used to disable/enable interrupt from user space processes.
> > + *
> > + * @param info
> > + *  pointer to uio_info.
> > + * @param irq_state
> > + *  state value. 1 to enable interrupt, 0 to disable interrupt.
> > + */
> > +static int uio_hv_vmbus_irqcontrol(struct uio_info *info, s32
> > +irq_state) {
> > +	struct uio_hv_vmbus_dev *pdata = info->priv;
> > +	struct hv_device *hv_dev = pdata->device;
> > +
> > +	/* Issue a full memory barrier before triggering the notification */
> > +	virt_mb();
> > +
> > +	vmbus_setevent(hv_dev->channel);
> > +	return 0;
> > +}
> > +
> > +/*
> > + * Callback from vmbus_event when something is in inbound ring.
> > + */
> > +static void uio_hv_vmbus_channel_cb(void *context) {
> > +	struct uio_hv_vmbus_dev *pdata = context;
> > +
> > +	/* Issue a full memory barrier before sending the event to userspace
> */
> > +	virt_mb();
> > +
> > +	uio_event_notify(&pdata->info);
> > +}
> > +
> > +static int uio_hv_vmbus_open(struct uio_info *info, struct inode
> > +*inode) {
> > +	struct uio_hv_vmbus_dev *pdata = container_of(info, struct
> uio_hv_vmbus_dev, info);
> > +	struct hv_device *hv_dev = pdata->device;
> > +	struct vmbus_channel *channel = hv_dev->channel;
> > +	int ret;
> > +
> > +	ret = vmbus_open(channel, ring_size, ring_size, NULL, 0,
> > +			 uio_hv_vmbus_channel_cb, pdata);
> > +	if (ret) {
> > +		dev_err(&hv_dev->device, "error %d when opening the
> channel\n", ret);
> > +		return ret;
> > +	}
> > +	channel->inbound.ring_buffer->interrupt_mask = 0;
> > +	set_channel_read_mode(channel, HV_CALL_ISR);
> > +
> > +	ret = device_create_bin_file(&hv_dev->device,
> &ring_buffer_bin_attr);
> > +	if (ret)
> > +		dev_err(&hv_dev->device, "sysfs create ring bin file failed;
> %d\n",
> > +ret);
> > +
> > +	return ret;
> > +}
> > +
> > +/* VMbus primary channel is closed on last close */ static int
> > +uio_hv_vmbus_release(struct uio_info *info, struct inode *inode) {
> > +	struct uio_hv_vmbus_dev *pdata = container_of(info, struct
> uio_hv_vmbus_dev, info);
> > +	struct hv_device *hv_dev = pdata->device;
> > +	struct vmbus_channel *channel = hv_dev->channel;
> > +
> > +	device_remove_bin_file(&hv_dev->device, &ring_buffer_bin_attr);
> > +	vmbus_close(channel);
> > +
> > +	return 0;
> > +}
> > +
> > +static ssize_t ring_size_show(struct device *dev, struct device_attribute
> *attr,
> > +			      char *buf)
> > +{
> > +	return sysfs_emit(buf, "%d\n", ring_size); }
> > +
> > +static ssize_t ring_size_store(struct device *dev, struct device_attribute
> *attr,
> > +			       const char *buf, size_t count) {
> > +	unsigned int val;
> > +
> > +	if (kstrtouint(buf, 0, &val) < 0)
> > +		return -EINVAL;
> > +
> > +	if (val < HV_HYP_PAGE_SIZE)
> > +		return -EINVAL;
> > +
> > +	ring_size = val;
> > +
> > +	return count;
> > +}
> > +
> > +static DEVICE_ATTR_RW(ring_size);
> > +
> > +static struct attribute *uio_hv_vmbus_client_attrs[] = {
> > +	&dev_attr_ring_size.attr,
> > +	NULL,
> > +};
> > +ATTRIBUTE_GROUPS(uio_hv_vmbus_client);
> > +
> > +static int uio_hv_vmbus_probe(struct hv_device *dev, const struct
> > +hv_vmbus_device_id *dev_id) {
> > +	struct uio_hv_vmbus_dev *pdata;
> > +	int ret;
> > +	char *name = NULL;
> > +
> > +	pdata = devm_kzalloc(&dev->device, sizeof(*pdata), GFP_KERNEL);
> > +	if (!pdata)
> > +		return -ENOMEM;
> > +
> > +	name = kasprintf(GFP_KERNEL, "%pUl", &dev->dev_instance);
> > +
> > +	/* Fill general uio info */
> > +	pdata->info.name = name; /* /sys/class/uio/uioX/name */
> > +	pdata->info.version = "1";
> > +	pdata->info.irqcontrol = uio_hv_vmbus_irqcontrol;
> > +	pdata->info.open = uio_hv_vmbus_open;
> > +	pdata->info.release = uio_hv_vmbus_release;
> > +	pdata->info.irq = UIO_IRQ_CUSTOM;
> > +	pdata->info.priv = pdata;
> > +	pdata->device = dev;
> > +
> > +	ret = uio_register_device(&dev->device, &pdata->info);
> > +	if (ret) {
> > +		dev_err(&dev->device, "uio_hv_vmbus register failed\n");
> > +		return ret;
> > +	}
> > +
> > +	hv_set_drvdata(dev, pdata);
> > +
> > +	return 0;
> > +}
> > +
> > +static void uio_hv_vmbus_remove(struct hv_device *dev) {
> > +	struct uio_hv_vmbus_dev *pdata = hv_get_drvdata(dev);
> > +
> > +	if (pdata)
> > +		uio_unregister_device(&pdata->info);
> > +}
> > +
> > +static struct hv_driver uio_hv_vmbus_drv = {
> > +	.driver.dev_groups = uio_hv_vmbus_client_groups,
> > +	.name = "uio_hv_vmbus_client",
> > +	.id_table = NULL, /* only dynamic id's */
> 
> No need to set this if it's NULL.

Ok.

Thanks for your review.
- Saurabh

> 
> thanks,
> 
> greg k-h

^ permalink raw reply

* Re: [GIT PULL] Hyper-V fixes for 6.4-rc8
From: pr-tracker-bot @ 2023-06-20  0:14 UTC (permalink / raw)
  To: Wei Liu
  Cc: Linus Torvalds, Wei Liu, Linux on Hyper-V List, Linux Kernel List,
	kys, haiyangz, decui, Michael Kelley
In-Reply-To: <ZJDihKX5BT38Rkab@liuwe-devbox-debian-v2>

The pull request you sent on Mon, 19 Jun 2023 23:19:32 +0000:

> ssh://git@gitolite.kernel.org/pub/scm/linux/kernel/git/hyperv/linux.git tags/hyperv-fixes-signed-20230619

has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/692b7dc87ca6d55ab254f8259e6f970171dc9d01

Thank you!

-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/prtracker.html

^ permalink raw reply

* [PATCH 3/3] tools: Get rid of IRQ_MOVE_CLEANUP_VECTOR from tools
From: Xin Li @ 2023-06-19 23:16 UTC (permalink / raw)
  To: linux-kernel, platform-driver-x86, iommu, linux-hyperv,
	linux-perf-users, x86
  Cc: tglx, mingo, bp, dave.hansen, hpa, steve.wahl, mike.travis,
	dimitri.sivanich, russ.anderson, dvhart, andy, joro,
	suravee.suthikulpanit, will, robin.murphy, kys, haiyangz, wei.liu,
	decui, dwmw2, baolu.lu, peterz, acme, mark.rutland,
	alexander.shishkin, jolsa, namhyung, irogers, adrian.hunter,
	xin3.li, seanjc, jiangshanlai, jgg, yangtiezhu
In-Reply-To: <20230619231611.2230-1-xin3.li@intel.com>

Get rid of IRQ_MOVE_CLEANUP_VECTOR from tools.

Signed-off-by: Xin Li <xin3.li@intel.com>
---
 tools/arch/x86/include/asm/irq_vectors.h               | 7 -------
 tools/perf/trace/beauty/tracepoints/x86_irq_vectors.sh | 2 +-
 2 files changed, 1 insertion(+), 8 deletions(-)

diff --git a/tools/arch/x86/include/asm/irq_vectors.h b/tools/arch/x86/include/asm/irq_vectors.h
index 43dcb9284208..3a19904c2db6 100644
--- a/tools/arch/x86/include/asm/irq_vectors.h
+++ b/tools/arch/x86/include/asm/irq_vectors.h
@@ -35,13 +35,6 @@
  */
 #define FIRST_EXTERNAL_VECTOR		0x20
 
-/*
- * Reserve the lowest usable vector (and hence lowest priority)  0x20 for
- * triggering cleanup after irq migration. 0x21-0x2f will still be used
- * for device interrupts.
- */
-#define IRQ_MOVE_CLEANUP_VECTOR		FIRST_EXTERNAL_VECTOR
-
 #define IA32_SYSCALL_VECTOR		0x80
 
 /*
diff --git a/tools/perf/trace/beauty/tracepoints/x86_irq_vectors.sh b/tools/perf/trace/beauty/tracepoints/x86_irq_vectors.sh
index eed9ce0fcbe6..87dc68c7de0c 100755
--- a/tools/perf/trace/beauty/tracepoints/x86_irq_vectors.sh
+++ b/tools/perf/trace/beauty/tracepoints/x86_irq_vectors.sh
@@ -12,7 +12,7 @@ x86_irq_vectors=${arch_x86_header_dir}/irq_vectors.h
 
 # FIRST_EXTERNAL_VECTOR is not that useful, find what is its number
 # and then replace whatever is using it and that is useful, which at
-# the time of writing of this script was: IRQ_MOVE_CLEANUP_VECTOR.
+# the time of writing of this script was: 0x20.
 
 first_external_regex='^#define[[:space:]]+FIRST_EXTERNAL_VECTOR[[:space:]]+(0x[[:xdigit:]]+)$'
 first_external_vector=$(grep -E ${first_external_regex} ${x86_irq_vectors} | sed -r "s/${first_external_regex}/\1/g")
-- 
2.34.1


^ permalink raw reply related

* [PATCH 2/3] x86/vector: Replace IRQ_MOVE_CLEANUP_VECTOR with a timer callback
From: Xin Li @ 2023-06-19 23:16 UTC (permalink / raw)
  To: linux-kernel, platform-driver-x86, iommu, linux-hyperv,
	linux-perf-users, x86
  Cc: tglx, mingo, bp, dave.hansen, hpa, steve.wahl, mike.travis,
	dimitri.sivanich, russ.anderson, dvhart, andy, joro,
	suravee.suthikulpanit, will, robin.murphy, kys, haiyangz, wei.liu,
	decui, dwmw2, baolu.lu, peterz, acme, mark.rutland,
	alexander.shishkin, jolsa, namhyung, irogers, adrian.hunter,
	xin3.li, seanjc, jiangshanlai, jgg, yangtiezhu
In-Reply-To: <20230619231611.2230-1-xin3.li@intel.com>

From: Thomas Gleixner <tglx@linutronix.de>

Replace IRQ_MOVE_CLEANUP_VECTOR with a timer callback for cleaning
up the leftovers of a moved interrupt.

The only new job incurred is to do vector cleanup in lapic_offline()
in case the vector cleanup timer has not expired.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Xin Li <xin3.li@intel.com>
---
 arch/x86/include/asm/idtentry.h    |   1 -
 arch/x86/include/asm/irq_vectors.h |   7 --
 arch/x86/kernel/apic/vector.c      | 101 +++++++++++++++++++++++------
 arch/x86/kernel/idt.c              |   1 -
 4 files changed, 80 insertions(+), 30 deletions(-)

diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h
index b241af4ce9b4..cd5c10a74071 100644
--- a/arch/x86/include/asm/idtentry.h
+++ b/arch/x86/include/asm/idtentry.h
@@ -648,7 +648,6 @@ DECLARE_IDTENTRY_SYSVEC(X86_PLATFORM_IPI_VECTOR,	sysvec_x86_platform_ipi);
 
 #ifdef CONFIG_SMP
 DECLARE_IDTENTRY(RESCHEDULE_VECTOR,			sysvec_reschedule_ipi);
-DECLARE_IDTENTRY_SYSVEC(IRQ_MOVE_CLEANUP_VECTOR,	sysvec_irq_move_cleanup);
 DECLARE_IDTENTRY_SYSVEC(REBOOT_VECTOR,			sysvec_reboot);
 DECLARE_IDTENTRY_SYSVEC(CALL_FUNCTION_SINGLE_VECTOR,	sysvec_call_function_single);
 DECLARE_IDTENTRY_SYSVEC(CALL_FUNCTION_VECTOR,		sysvec_call_function);
diff --git a/arch/x86/include/asm/irq_vectors.h b/arch/x86/include/asm/irq_vectors.h
index 43dcb9284208..3a19904c2db6 100644
--- a/arch/x86/include/asm/irq_vectors.h
+++ b/arch/x86/include/asm/irq_vectors.h
@@ -35,13 +35,6 @@
  */
 #define FIRST_EXTERNAL_VECTOR		0x20
 
-/*
- * Reserve the lowest usable vector (and hence lowest priority)  0x20 for
- * triggering cleanup after irq migration. 0x21-0x2f will still be used
- * for device interrupts.
- */
-#define IRQ_MOVE_CLEANUP_VECTOR		FIRST_EXTERNAL_VECTOR
-
 #define IA32_SYSCALL_VECTOR		0x80
 
 /*
diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c
index aa370bd0d933..23a3f3b6dd2c 100644
--- a/arch/x86/kernel/apic/vector.c
+++ b/arch/x86/kernel/apic/vector.c
@@ -44,7 +44,18 @@ static cpumask_var_t vector_searchmask;
 static struct irq_chip lapic_controller;
 static struct irq_matrix *vector_matrix;
 #ifdef CONFIG_SMP
-static DEFINE_PER_CPU(struct hlist_head, cleanup_list);
+
+static void vector_cleanup_callback(struct timer_list *tmr);
+
+struct vector_cleanup {
+	struct hlist_head	head;
+	struct timer_list	timer;
+};
+
+static DEFINE_PER_CPU(struct vector_cleanup, vector_cleanup) = {
+	.head	= HLIST_HEAD_INIT,
+	.timer	= __TIMER_INITIALIZER(vector_cleanup_callback, TIMER_PINNED),
+};
 #endif
 
 void lock_vector_lock(void)
@@ -841,10 +852,21 @@ void lapic_online(void)
 		this_cpu_write(vector_irq[vector], __setup_vector_irq(vector));
 }
 
+static void __vector_cleanup(struct vector_cleanup *cl, bool check_irr);
+
 void lapic_offline(void)
 {
+	struct vector_cleanup *cl = this_cpu_ptr(&vector_cleanup);
+
 	lock_vector_lock();
+
+	/* In case the vector cleanup timer has not expired */
+	__vector_cleanup(cl, false);
+
 	irq_matrix_offline(vector_matrix);
+	WARN_ON_ONCE(try_to_del_timer_sync(&cl->timer) < 0);
+	WARN_ON_ONCE(!hlist_empty(&cl->head));
+
 	unlock_vector_lock();
 }
 
@@ -934,49 +956,86 @@ static void free_moved_vector(struct apic_chip_data *apicd)
 	apicd->move_in_progress = 0;
 }
 
-DEFINE_IDTENTRY_SYSVEC(sysvec_irq_move_cleanup)
+/*
+ * Called with vector_lock held
+ */
+static void __vector_cleanup(struct vector_cleanup *cl, bool check_irr)
 {
-	struct hlist_head *clhead = this_cpu_ptr(&cleanup_list);
 	struct apic_chip_data *apicd;
 	struct hlist_node *tmp;
+	bool rearm = false;
 
-	ack_APIC_irq();
-	/* Prevent vectors vanishing under us */
-	raw_spin_lock(&vector_lock);
-
-	hlist_for_each_entry_safe(apicd, tmp, clhead, clist) {
+	hlist_for_each_entry_safe(apicd, tmp, &cl->head, clist) {
 		unsigned int irr, vector = apicd->prev_vector;
 
 		/*
 		 * Paranoia: Check if the vector that needs to be cleaned
-		 * up is registered at the APICs IRR. If so, then this is
-		 * not the best time to clean it up. Clean it up in the
-		 * next attempt by sending another IRQ_MOVE_CLEANUP_VECTOR
-		 * to this CPU. IRQ_MOVE_CLEANUP_VECTOR is the lowest
-		 * priority external vector, so on return from this
-		 * interrupt the device interrupt will happen first.
+		 * up is registered at the APICs IRR. That's clearly a
+		 * hardware issue if the vector arrived on the old target
+		 * _after_ interrupts were disabled above. Keep @apicd
+		 * on the list and schedule the timer again to give the CPU
+		 * a chance to handle the pending interrupt.
+		 *
+		 * Do not check IRR when called from lapic_offline(), because
+		 * fixup_irqs() was just called to scan IRR for set bits and
+		 * forward them to new destination CPUs via IPIs.
 		 */
-		irr = apic_read(APIC_IRR + (vector / 32 * 0x10));
+		irr = check_irr ? apic_read(APIC_IRR + (vector / 32 * 0x10)) : 0;
 		if (irr & (1U << (vector % 32))) {
-			apic->send_IPI_self(IRQ_MOVE_CLEANUP_VECTOR);
+			pr_warn_once("Moved interrupt pending in old target APIC %u\n", apicd->irq);
+			rearm = true;
 			continue;
 		}
 		free_moved_vector(apicd);
 	}
 
-	raw_spin_unlock(&vector_lock);
+	/*
+	 * Must happen under vector_lock to make the timer_pending() check
+	 * in __vector_schedule_cleanup() race free against the rearm here.
+	 */
+	if (rearm)
+		mod_timer(&cl->timer, jiffies + 1);
+}
+
+static void vector_cleanup_callback(struct timer_list *tmr)
+{
+	struct vector_cleanup *cl = container_of(tmr, typeof(*cl), timer);
+
+	/* Prevent vectors vanishing under us */
+	raw_spin_lock_irq(&vector_lock);
+	__vector_cleanup(cl, true);
+	raw_spin_unlock_irq(&vector_lock);
 }
 
 static void __vector_schedule_cleanup(struct apic_chip_data *apicd)
 {
-	unsigned int cpu;
+	unsigned int cpu = apicd->prev_cpu;
 
 	raw_spin_lock(&vector_lock);
 	apicd->move_in_progress = 0;
-	cpu = apicd->prev_cpu;
 	if (cpu_online(cpu)) {
-		hlist_add_head(&apicd->clist, per_cpu_ptr(&cleanup_list, cpu));
-		apic->send_IPI(cpu, IRQ_MOVE_CLEANUP_VECTOR);
+		struct vector_cleanup *cl = per_cpu_ptr(&vector_cleanup, cpu);
+
+		hlist_add_head(&apicd->clist, &cl->head);
+
+		/*
+		 * The lockless timer_pending() check is safe here. If it
+		 * returns true, then the callback will observe this new
+		 * apic data in the hlist as everything is serialized by
+		 * vector lock.
+		 *
+		 * If it returns false then the timer is either not armed
+		 * or the other CPU executes the callback, which again
+		 * would be blocked on vector lock. Rearming it in the
+		 * latter case makes it fire for nothing.
+		 *
+		 * This is also safe against the callback rearming the timer
+		 * because that's serialized via vector lock too.
+		 */
+		if (!timer_pending(&cl->timer)) {
+			cl->timer.expires = jiffies + 1;
+			add_timer_on(&cl->timer, cpu);
+		}
 	} else {
 		apicd->prev_vector = 0;
 	}
diff --git a/arch/x86/kernel/idt.c b/arch/x86/kernel/idt.c
index a58c6bc1cd68..f3958262c725 100644
--- a/arch/x86/kernel/idt.c
+++ b/arch/x86/kernel/idt.c
@@ -131,7 +131,6 @@ static const __initconst struct idt_data apic_idts[] = {
 	INTG(RESCHEDULE_VECTOR,			asm_sysvec_reschedule_ipi),
 	INTG(CALL_FUNCTION_VECTOR,		asm_sysvec_call_function),
 	INTG(CALL_FUNCTION_SINGLE_VECTOR,	asm_sysvec_call_function_single),
-	INTG(IRQ_MOVE_CLEANUP_VECTOR,		asm_sysvec_irq_move_cleanup),
 	INTG(REBOOT_VECTOR,			asm_sysvec_reboot),
 #endif
 
-- 
2.34.1


^ permalink raw reply related

* [PATCH 1/3] x86/vector: Rename send_cleanup_vector() to vector_schedule_cleanup()
From: Xin Li @ 2023-06-19 23:16 UTC (permalink / raw)
  To: linux-kernel, platform-driver-x86, iommu, linux-hyperv,
	linux-perf-users, x86
  Cc: tglx, mingo, bp, dave.hansen, hpa, steve.wahl, mike.travis,
	dimitri.sivanich, russ.anderson, dvhart, andy, joro,
	suravee.suthikulpanit, will, robin.murphy, kys, haiyangz, wei.liu,
	decui, dwmw2, baolu.lu, peterz, acme, mark.rutland,
	alexander.shishkin, jolsa, namhyung, irogers, adrian.hunter,
	xin3.li, seanjc, jiangshanlai, jgg, yangtiezhu
In-Reply-To: <20230619231611.2230-1-xin3.li@intel.com>

From: Thomas Gleixner <tglx@linutronix.de>

Rename send_cleanup_vector() to vector_schedule_cleanup() for the next
patch to replace vector cleanup IPI with a timer callback.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Xin Li <xin3.li@intel.com>
---
 arch/x86/include/asm/hw_irq.h       | 4 ++--
 arch/x86/kernel/apic/vector.c       | 8 ++++----
 arch/x86/platform/uv/uv_irq.c       | 2 +-
 drivers/iommu/amd/iommu.c           | 2 +-
 drivers/iommu/hyperv-iommu.c        | 4 ++--
 drivers/iommu/intel/irq_remapping.c | 2 +-
 6 files changed, 11 insertions(+), 11 deletions(-)

diff --git a/arch/x86/include/asm/hw_irq.h b/arch/x86/include/asm/hw_irq.h
index d465ece58151..551829884734 100644
--- a/arch/x86/include/asm/hw_irq.h
+++ b/arch/x86/include/asm/hw_irq.h
@@ -97,10 +97,10 @@ extern struct irq_cfg *irqd_cfg(struct irq_data *irq_data);
 extern void lock_vector_lock(void);
 extern void unlock_vector_lock(void);
 #ifdef CONFIG_SMP
-extern void send_cleanup_vector(struct irq_cfg *);
+extern void vector_schedule_cleanup(struct irq_cfg *);
 extern void irq_complete_move(struct irq_cfg *cfg);
 #else
-static inline void send_cleanup_vector(struct irq_cfg *c) { }
+static inline void vector_schedule_cleanup(struct irq_cfg *c) { }
 static inline void irq_complete_move(struct irq_cfg *c) { }
 #endif
 
diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c
index c1efebd27e6c..aa370bd0d933 100644
--- a/arch/x86/kernel/apic/vector.c
+++ b/arch/x86/kernel/apic/vector.c
@@ -967,7 +967,7 @@ DEFINE_IDTENTRY_SYSVEC(sysvec_irq_move_cleanup)
 	raw_spin_unlock(&vector_lock);
 }
 
-static void __send_cleanup_vector(struct apic_chip_data *apicd)
+static void __vector_schedule_cleanup(struct apic_chip_data *apicd)
 {
 	unsigned int cpu;
 
@@ -983,13 +983,13 @@ static void __send_cleanup_vector(struct apic_chip_data *apicd)
 	raw_spin_unlock(&vector_lock);
 }
 
-void send_cleanup_vector(struct irq_cfg *cfg)
+void vector_schedule_cleanup(struct irq_cfg *cfg)
 {
 	struct apic_chip_data *apicd;
 
 	apicd = container_of(cfg, struct apic_chip_data, hw_irq_cfg);
 	if (apicd->move_in_progress)
-		__send_cleanup_vector(apicd);
+		__vector_schedule_cleanup(apicd);
 }
 
 void irq_complete_move(struct irq_cfg *cfg)
@@ -1007,7 +1007,7 @@ void irq_complete_move(struct irq_cfg *cfg)
 	 * on the same CPU.
 	 */
 	if (apicd->cpu == smp_processor_id())
-		__send_cleanup_vector(apicd);
+		__vector_schedule_cleanup(apicd);
 }
 
 /*
diff --git a/arch/x86/platform/uv/uv_irq.c b/arch/x86/platform/uv/uv_irq.c
index ee21d6a36a80..4221259a5870 100644
--- a/arch/x86/platform/uv/uv_irq.c
+++ b/arch/x86/platform/uv/uv_irq.c
@@ -58,7 +58,7 @@ uv_set_irq_affinity(struct irq_data *data, const struct cpumask *mask,
 	ret = parent->chip->irq_set_affinity(parent, mask, force);
 	if (ret >= 0) {
 		uv_program_mmr(cfg, data->chip_data);
-		send_cleanup_vector(cfg);
+		vector_schedule_cleanup(cfg);
 	}
 
 	return ret;
diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c
index dc1ec6849775..b5900e70de60 100644
--- a/drivers/iommu/amd/iommu.c
+++ b/drivers/iommu/amd/iommu.c
@@ -3658,7 +3658,7 @@ static int amd_ir_set_affinity(struct irq_data *data,
 	 * at the new destination. So, time to cleanup the previous
 	 * vector allocation.
 	 */
-	send_cleanup_vector(cfg);
+	vector_schedule_cleanup(cfg);
 
 	return IRQ_SET_MASK_OK_DONE;
 }
diff --git a/drivers/iommu/hyperv-iommu.c b/drivers/iommu/hyperv-iommu.c
index 8302db7f783e..8a5c17b97310 100644
--- a/drivers/iommu/hyperv-iommu.c
+++ b/drivers/iommu/hyperv-iommu.c
@@ -51,7 +51,7 @@ static int hyperv_ir_set_affinity(struct irq_data *data,
 	if (ret < 0 || ret == IRQ_SET_MASK_OK_DONE)
 		return ret;
 
-	send_cleanup_vector(cfg);
+	vector_schedule_cleanup(cfg);
 
 	return 0;
 }
@@ -257,7 +257,7 @@ static int hyperv_root_ir_set_affinity(struct irq_data *data,
 	if (ret < 0 || ret == IRQ_SET_MASK_OK_DONE)
 		return ret;
 
-	send_cleanup_vector(cfg);
+	vector_schedule_cleanup(cfg);
 
 	return 0;
 }
diff --git a/drivers/iommu/intel/irq_remapping.c b/drivers/iommu/intel/irq_remapping.c
index a1b987335b31..55d899f5a14b 100644
--- a/drivers/iommu/intel/irq_remapping.c
+++ b/drivers/iommu/intel/irq_remapping.c
@@ -1180,7 +1180,7 @@ intel_ir_set_affinity(struct irq_data *data, const struct cpumask *mask,
 	 * at the new destination. So, time to cleanup the previous
 	 * vector allocation.
 	 */
-	send_cleanup_vector(cfg);
+	vector_schedule_cleanup(cfg);
 
 	return IRQ_SET_MASK_OK_DONE;
 }
-- 
2.34.1


^ permalink raw reply related

* [PATCH 0/3] Do IRQ move cleanup with a timer instead of an IPI
From: Xin Li @ 2023-06-19 23:16 UTC (permalink / raw)
  To: linux-kernel, platform-driver-x86, iommu, linux-hyperv,
	linux-perf-users, x86
  Cc: tglx, mingo, bp, dave.hansen, hpa, steve.wahl, mike.travis,
	dimitri.sivanich, russ.anderson, dvhart, andy, joro,
	suravee.suthikulpanit, will, robin.murphy, kys, haiyangz, wei.liu,
	decui, dwmw2, baolu.lu, peterz, acme, mark.rutland,
	alexander.shishkin, jolsa, namhyung, irogers, adrian.hunter,
	xin3.li, seanjc, jiangshanlai, jgg, yangtiezhu

No point to waste a vector for cleaning up the leftovers of a moved
interrupt. Aside of that this must be the lowest priority of all vectors
which makes FRED systems utilizing vectors 0x10-0x1f more complicated
than necessary.

Schedule a timer instead.

Thomas Gleixner (2):
  x86/vector: Rename send_cleanup_vector() to vector_schedule_cleanup()
  x86/vector: Replace IRQ_MOVE_CLEANUP_VECTOR with a timer callback

Xin Li (1):
  tools: Get rid of IRQ_MOVE_CLEANUP_VECTOR from tools

 arch/x86/include/asm/hw_irq.h                 |   4 +-
 arch/x86/include/asm/idtentry.h               |   1 -
 arch/x86/include/asm/irq_vectors.h            |   7 --
 arch/x86/kernel/apic/vector.c                 | 109 ++++++++++++++----
 arch/x86/kernel/idt.c                         |   1 -
 arch/x86/platform/uv/uv_irq.c                 |   2 +-
 drivers/iommu/amd/iommu.c                     |   2 +-
 drivers/iommu/hyperv-iommu.c                  |   4 +-
 drivers/iommu/intel/irq_remapping.c           |   2 +-
 tools/arch/x86/include/asm/irq_vectors.h      |   7 --
 .../beauty/tracepoints/x86_irq_vectors.sh     |   2 +-
 11 files changed, 92 insertions(+), 49 deletions(-)

-- 
2.34.1


^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox