[PATCH v1 0/2] KVM: arm64: support write combining and cachable IO memory in VMs

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v1 0/2] KVM: arm64: support write combining and cachable IO memory in VMs
@ 2023-09-07 18:14 ankita
  2023-09-07 18:14 ` [PATCH v1 1/2] KVM: arm64: determine memory type from VMA ankita
  2023-09-07 18:14 ` [PATCH v1 2/2] KVM: arm64: allow the VM to select DEVICE_* and NORMAL_NC for IO memory ankita
  0 siblings, 2 replies; 55+ messages in thread
From: ankita @ 2023-09-07 18:14 UTC (permalink / raw)
  To: ankita, jgg, maz, oliver.upton, catalin.marinas, will
  Cc: aniketa, cjia, kwankhede, targupta, vsethi, acurrid, apopple,
	jhubbard, danw, linux-arm-kernel, kvmarm, linux-kernel

From: Ankit Agrawal <ankita@nvidia.com>

Today KVM forces all of the memory to either NORMAL or DEVICE_nGnRE
largely based on pfn_is_map_memory() and ignores the per-VMA flags
that indicate the memory attributes.

This means there is no way for a VM to get NORMAL_NC IO memory (what
Linux calls WC memory, which is used for performance in certain NIC
and GPU devices). There is also no way for a VM to get cachable IO
memory (like from a CXL or pre-CXL device). In both cases the memory
will be forced to be DEVICE_nGnRE and the VM's memory attributes will
be ignored.

After some discussions with ARM and CPU architects we reached the
conclusion there was no need for KVM to prevent the VM from selecting
between DEVICE_* and NORMAL_NC for IO memory in VMs. There was a fear
that NORMAL_NC could result in uncontained failures, but upon deeper
analysis it turns out there are already possible cases for uncontained
failures with DEVICE types too. Ultimately the platform must be
implemented in a way that ensures that all DEVICE_* and NORMAL_NC
accesses have no uncontained failures.

Fortunately real platforms do tend to implement this.

This is similar to x86 where various BIOS choices can result in
uncontained failures (machine checks in x86 speak) on IO memory. On
either architecture there is no way for KVM to know if the underlying
platform is fully safe or not.

This series expands the assumption that any uncachable IO will not
have any uncontained failures for DEVICE_* and NORMAL_NC and that
cachable IO will not have uncontained failures for NORMAL. We hope ARM
will publish information helping platform designers follow these
guidelines.

Additionally have KVM drive the decision on cachable or not entirely
on the VMA. If the VMA is cachable then the KVM memory is made NORMAL
regardless of pfn_is_map_memory(). This closes a hole where cachable
VMA mappings in a process could be accessed via an uncached alias
through KVM.

Applied over next-20230906

Ankit Agrawal (2):
  KVM: arm64: determine memory type from VMA
  KVM: arm64: allow the VM to select DEVICE_* and NORMAL_NC for IO
    memory

 arch/arm64/include/asm/kvm_pgtable.h |  8 ++++++
 arch/arm64/include/asm/memory.h      |  2 ++
 arch/arm64/kvm/hyp/pgtable.c         |  4 +--
 arch/arm64/kvm/mmu.c                 | 40 +++++++++++++++++++++++++---
 4 files changed, 48 insertions(+), 6 deletions(-)

-- 
2.17.1

^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH v1 1/2] KVM: arm64: determine memory type from VMA
  2023-09-07 18:14 [PATCH v1 0/2] KVM: arm64: support write combining and cachable IO memory in VMs ankita
@ 2023-09-07 18:14 ` ankita
  2023-09-07 19:12   ` Jason Gunthorpe
                     ` (2 more replies)
  2023-09-07 18:14 ` [PATCH v1 2/2] KVM: arm64: allow the VM to select DEVICE_* and NORMAL_NC for IO memory ankita
  1 sibling, 3 replies; 55+ messages in thread
From: ankita @ 2023-09-07 18:14 UTC (permalink / raw)
  To: ankita, jgg, maz, oliver.upton, catalin.marinas, will
  Cc: aniketa, cjia, kwankhede, targupta, vsethi, acurrid, apopple,
	jhubbard, danw, linux-arm-kernel, kvmarm, linux-kernel

From: Ankit Agrawal <ankita@nvidia.com>

Currently KVM determines if a VMA is pointing at IO memory by checking
pfn_is_map_memory(). However, the MM already gives us a way to tell what
kind of memory it is by inspecting the VMA.

Replace pfn_is_map_memory() with a check on the VMA pgprot to determine if
the memory is IO and thus needs stage-2 device mapping.

The VMA's pgprot is tested to determine the memory type with the
following mapping:

 pgprot_noncached    MT_DEVICE_nGnRnE   device
 pgprot_writecombine MT_NORMAL_NC       device
 pgprot_device       MT_DEVICE_nGnRE    device
 pgprot_tagged       MT_NORMAL_TAGGED   RAM

This patch solves a problems where it is possible for the kernel to
have VMAs pointing at cachable memory without causing
pfn_is_map_memory() to be true, eg DAX memremap cases and CXL/pre-CXL
devices. This memory is now properly marked as cachable in KVM.

Unfortunately when FWB is not enabled, the kernel expects to naively do
cache management by flushing the memory using an address in the
kernel's map. This does not work in several of the newly allowed
cases such as dcache_clean_inval_poc(). Check whether the targeted pfn
and its mapping KVA is valid in case the FWB is absent before continuing.

Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
---
 arch/arm64/include/asm/kvm_pgtable.h |  8 ++++++
 arch/arm64/kvm/hyp/pgtable.c         |  2 +-
 arch/arm64/kvm/mmu.c                 | 40 +++++++++++++++++++++++++---
 3 files changed, 45 insertions(+), 5 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_pgtable.h b/arch/arm64/include/asm/kvm_pgtable.h
index d3e354bb8351..0579dbe958b9 100644
--- a/arch/arm64/include/asm/kvm_pgtable.h
+++ b/arch/arm64/include/asm/kvm_pgtable.h
@@ -430,6 +430,14 @@ u64 kvm_pgtable_hyp_unmap(struct kvm_pgtable *pgt, u64 addr, u64 size);
  */
 u64 kvm_get_vtcr(u64 mmfr0, u64 mmfr1, u32 phys_shift);
 
+/**
+ * stage2_has_fwb() - Determine whether FWB is supported
+ * @pgt:    Page-table structure initialised by kvm_pgtable_stage2_init*()
+ *
+ * Return: True if FWB is supported.
+ */
+bool stage2_has_fwb(struct kvm_pgtable *pgt);
+
 /**
  * kvm_pgtable_stage2_pgd_size() - Helper to compute size of a stage-2 PGD
  * @vtcr:	Content of the VTCR register.
diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index f155b8c9e98c..ccd291b6893d 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -662,7 +662,7 @@ u64 kvm_get_vtcr(u64 mmfr0, u64 mmfr1, u32 phys_shift)
 	return vtcr;
 }
 
-static bool stage2_has_fwb(struct kvm_pgtable *pgt)
+bool stage2_has_fwb(struct kvm_pgtable *pgt)
 {
 	if (!cpus_have_const_cap(ARM64_HAS_STAGE2_FWB))
 		return false;
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 482280fe22d7..79f1caaa08a0 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -1391,6 +1391,15 @@ static bool kvm_vma_mte_allowed(struct vm_area_struct *vma)
 	return vma->vm_flags & VM_MTE_ALLOWED;
 }
 
+/*
+ * Determine the memory region cacheability from VMA's pgprot. This
+ * is used to set the stage 2 PTEs.
+ */
+static unsigned long mapping_type(pgprot_t page_prot)
+{
+	return FIELD_GET(PTE_ATTRINDX_MASK, pgprot_val(page_prot));
+}
+
 static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 			  struct kvm_memory_slot *memslot, unsigned long hva,
 			  unsigned long fault_status)
@@ -1490,6 +1499,18 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 	gfn = fault_ipa >> PAGE_SHIFT;
 	mte_allowed = kvm_vma_mte_allowed(vma);
 
+	/*
+	 * Figure out the memory type based on the user va mapping properties
+	 * Only MT_DEVICE_nGnRE and MT_DEVICE_nGnRnE will be set using
+	 * pgprot_device() and pgprot_noncached() respectively.
+	 */
+	if ((mapping_type(vma->vm_page_prot) == MT_DEVICE_nGnRE) ||
+	    (mapping_type(vma->vm_page_prot) == MT_DEVICE_nGnRnE) ||
+	    (mapping_type(vma->vm_page_prot) == MT_NORMAL_NC))
+		prot |= KVM_PGTABLE_PROT_DEVICE;
+	else if (cpus_have_const_cap(ARM64_HAS_CACHE_DIC))
+		prot |= KVM_PGTABLE_PROT_X;
+
 	/* Don't use the VMA after the unlock -- it may have vanished */
 	vma = NULL;
 
@@ -1576,10 +1597,21 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 	if (exec_fault)
 		prot |= KVM_PGTABLE_PROT_X;
 
-	if (device)
-		prot |= KVM_PGTABLE_PROT_DEVICE;
-	else if (cpus_have_const_cap(ARM64_HAS_CACHE_DIC))
-		prot |= KVM_PGTABLE_PROT_X;
+	/*
+	 *  When FWB is unsupported KVM needs to do cache flushes
+	 *  (via dcache_clean_inval_poc()) of the underlying memory. This is
+	 *  only possible if the memory is already mapped into the kernel map
+	 *  at the usual spot.
+	 *
+	 *  Validate that there is a struct page for the PFN which maps
+	 *  to the KVA that the flushing code expects.
+	 */
+	if (!stage2_has_fwb(pgt) &&
+	    !(pfn_valid(pfn) &&
+	      page_to_virt(pfn_to_page(pfn)) == kvm_host_va(PFN_PHYS(pfn)))) {
+		ret = -EINVAL;
+		goto out_unlock;
+	}
 
 	/*
 	 * Under the premise of getting a FSC_PERM fault, we just need to relax
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* Re: [PATCH v1 1/2] KVM: arm64: determine memory type from VMA
  2023-09-07 18:14 ` [PATCH v1 1/2] KVM: arm64: determine memory type from VMA ankita
@ 2023-09-07 19:12   ` Jason Gunthorpe
  2023-10-05 16:15   ` Catalin Marinas
  2023-10-23 13:20   ` Shameerali Kolothum Thodi
  2 siblings, 0 replies; 55+ messages in thread
From: Jason Gunthorpe @ 2023-09-07 19:12 UTC (permalink / raw)
  To: ankita
  Cc: maz, oliver.upton, catalin.marinas, will, aniketa, cjia,
	kwankhede, targupta, vsethi, acurrid, apopple, jhubbard, danw,
	linux-arm-kernel, kvmarm, linux-kernel

On Thu, Sep 07, 2023 at 11:14:58AM -0700, ankita@nvidia.com wrote:
> From: Ankit Agrawal <ankita@nvidia.com>
> 
> Currently KVM determines if a VMA is pointing at IO memory by checking
> pfn_is_map_memory(). However, the MM already gives us a way to tell what
> kind of memory it is by inspecting the VMA.
> 
> Replace pfn_is_map_memory() with a check on the VMA pgprot to determine if
> the memory is IO and thus needs stage-2 device mapping.
> 
> The VMA's pgprot is tested to determine the memory type with the
> following mapping:
> 
>  pgprot_noncached    MT_DEVICE_nGnRnE   device
>  pgprot_writecombine MT_NORMAL_NC       device
>  pgprot_device       MT_DEVICE_nGnRE    device
>  pgprot_tagged       MT_NORMAL_TAGGED   RAM
> 
> This patch solves a problems where it is possible for the kernel to
> have VMAs pointing at cachable memory without causing
> pfn_is_map_memory() to be true, eg DAX memremap cases and CXL/pre-CXL
> devices. This memory is now properly marked as cachable in KVM.
> 
> Unfortunately when FWB is not enabled, the kernel expects to naively do
> cache management by flushing the memory using an address in the
> kernel's map. This does not work in several of the newly allowed
> cases such as dcache_clean_inval_poc(). Check whether the targeted pfn
> and its mapping KVA is valid in case the FWB is absent before
> continuing.

Looking at this more closely, it relies on kvm_pte_follow() which
ultimately calls __va(), and the ARM 64 version of page_to_virt() is:

#define page_to_virt(x) ({                                              \
        __typeof__(x) __page = x;                                       \
        void *__addr = __va(page_to_phys(__page));                      \
        (void *)__tag_set((const void *)__addr, page_kasan_tag(__page));\
})

So we don't actually have an issue here, anything with a struct page
will be properly handled by KVM.

Thus this can just be:

	if (!stage2_has_fwb(pgt) && !pfn_valid(pfn)))

Then the last paragraph of the commit message is:

 As cachable vs device memory is now determined by the VMA adjust
 the other checks to match their purpose. In most cases the code needs
 to know if there is a struct page, or if the memory is in the kernel
 map and pfn_valid() is an appropriate API for this.

 Note when FWB is not enabled, the kernel expects to trivially do
 cache management by flushing the memory by linearly converting a
 kvm_pte to phys_addr to a KVA, see kvm_flush_dcache_to_poc(). This is
 only possibile for struct page backed memory. Do not allow non-struct
 page memory to be cachable without FWB.

> @@ -1490,6 +1499,18 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>  	gfn = fault_ipa >> PAGE_SHIFT;
>  	mte_allowed = kvm_vma_mte_allowed(vma);
>  
> +	/*
> +	 * Figure out the memory type based on the user va mapping properties
> +	 * Only MT_DEVICE_nGnRE and MT_DEVICE_nGnRnE will be set using
> +	 * pgprot_device() and pgprot_noncached() respectively.
> +	 */
> +	if ((mapping_type(vma->vm_page_prot) == MT_DEVICE_nGnRE) ||
> +	    (mapping_type(vma->vm_page_prot) == MT_DEVICE_nGnRnE) ||
> +	    (mapping_type(vma->vm_page_prot) == MT_NORMAL_NC))
> +		prot |= KVM_PGTABLE_PROT_DEVICE;
> +	else if (cpus_have_const_cap(ARM64_HAS_CACHE_DIC))
> +		prot |= KVM_PGTABLE_PROT_X;
> +
>  	/* Don't use the VMA after the unlock -- it may have vanished */
>  	vma = NULL;
>  
> @@ -1576,10 +1597,21 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>  	if (exec_fault)
>  		prot |= KVM_PGTABLE_PROT_X;

You still didn't remove the kvm_is_device_pfn() check from this code,
I don't think it can really stay and make any sense..

Probably this:

   if (exec_fault && (prot & KVM_PGTABLE_PROT_DEVICE))
		return -ENOEXEC;

And these two should also be pfn_valid() [precompute pfn_valid]

	if (vma_pagesize == PAGE_SIZE && !(force_pte || !pfn_valid(pte))) {

	if (fault_status != ESR_ELx_FSC_PERM && pfn_valid(pte) && kvm_has_mte(kvm)) {

Makes sense?

Check if kvm_is_device_pfn() can be removed entirely.

Jason

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v1 1/2] KVM: arm64: determine memory type from VMA
  2023-09-07 18:14 ` [PATCH v1 1/2] KVM: arm64: determine memory type from VMA ankita
  2023-09-07 19:12   ` Jason Gunthorpe
@ 2023-10-05 16:15   ` Catalin Marinas
  2023-10-05 16:54     ` Jason Gunthorpe
  2023-10-23 13:20   ` Shameerali Kolothum Thodi
  2 siblings, 1 reply; 55+ messages in thread
From: Catalin Marinas @ 2023-10-05 16:15 UTC (permalink / raw)
  To: ankita
  Cc: jgg, maz, oliver.upton, will, aniketa, cjia, kwankhede, targupta,
	vsethi, acurrid, apopple, jhubbard, danw, linux-arm-kernel,
	kvmarm, linux-kernel

On Thu, Sep 07, 2023 at 11:14:58AM -0700, ankita@nvidia.com wrote:
> From: Ankit Agrawal <ankita@nvidia.com>
> 
> Currently KVM determines if a VMA is pointing at IO memory by checking
> pfn_is_map_memory(). However, the MM already gives us a way to tell what
> kind of memory it is by inspecting the VMA.

Well, it doesn't. It tells us what attributes the user mapped that
memory with, not whether it's I/O memory or standard RAM.

> Replace pfn_is_map_memory() with a check on the VMA pgprot to determine if
> the memory is IO and thus needs stage-2 device mapping.
> 
> The VMA's pgprot is tested to determine the memory type with the
> following mapping:
> 
>  pgprot_noncached    MT_DEVICE_nGnRnE   device
>  pgprot_writecombine MT_NORMAL_NC       device
>  pgprot_device       MT_DEVICE_nGnRE    device
>  pgprot_tagged       MT_NORMAL_TAGGED   RAM

I would move the second patch to be the first, we could even merge that
independently as it is about relaxing the stage 2 mapping to Normal NC.
It would make it simpler I think to reason about the second patch which
further relaxes the stage 2 mapping to Normal Cacheable under certain
conditions.

> This patch solves a problems where it is possible for the kernel to
> have VMAs pointing at cachable memory without causing
> pfn_is_map_memory() to be true, eg DAX memremap cases and CXL/pre-CXL
> devices. This memory is now properly marked as cachable in KVM.
> 
> Unfortunately when FWB is not enabled, the kernel expects to naively do
> cache management by flushing the memory using an address in the
> kernel's map. This does not work in several of the newly allowed
> cases such as dcache_clean_inval_poc(). Check whether the targeted pfn
> and its mapping KVA is valid in case the FWB is absent before continuing.

I would only allow cacheable stage 2 mappings if FWB is enabled.
Otherwise we end up with a mismatch between the VMM mapping and whatever
the guest may do.

> Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
> ---
>  arch/arm64/include/asm/kvm_pgtable.h |  8 ++++++
>  arch/arm64/kvm/hyp/pgtable.c         |  2 +-
>  arch/arm64/kvm/mmu.c                 | 40 +++++++++++++++++++++++++---
>  3 files changed, 45 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/kvm_pgtable.h b/arch/arm64/include/asm/kvm_pgtable.h
> index d3e354bb8351..0579dbe958b9 100644
> --- a/arch/arm64/include/asm/kvm_pgtable.h
> +++ b/arch/arm64/include/asm/kvm_pgtable.h
> @@ -430,6 +430,14 @@ u64 kvm_pgtable_hyp_unmap(struct kvm_pgtable *pgt, u64 addr, u64 size);
>   */
>  u64 kvm_get_vtcr(u64 mmfr0, u64 mmfr1, u32 phys_shift);
>  
> +/**
> + * stage2_has_fwb() - Determine whether FWB is supported
> + * @pgt:    Page-table structure initialised by kvm_pgtable_stage2_init*()
> + *
> + * Return: True if FWB is supported.
> + */
> +bool stage2_has_fwb(struct kvm_pgtable *pgt);
> +
>  /**
>   * kvm_pgtable_stage2_pgd_size() - Helper to compute size of a stage-2 PGD
>   * @vtcr:	Content of the VTCR register.
> diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> index f155b8c9e98c..ccd291b6893d 100644
> --- a/arch/arm64/kvm/hyp/pgtable.c
> +++ b/arch/arm64/kvm/hyp/pgtable.c
> @@ -662,7 +662,7 @@ u64 kvm_get_vtcr(u64 mmfr0, u64 mmfr1, u32 phys_shift)
>  	return vtcr;
>  }
>  
> -static bool stage2_has_fwb(struct kvm_pgtable *pgt)
> +bool stage2_has_fwb(struct kvm_pgtable *pgt)
>  {
>  	if (!cpus_have_const_cap(ARM64_HAS_STAGE2_FWB))
>  		return false;
> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index 482280fe22d7..79f1caaa08a0 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -1391,6 +1391,15 @@ static bool kvm_vma_mte_allowed(struct vm_area_struct *vma)
>  	return vma->vm_flags & VM_MTE_ALLOWED;
>  }
>  
> +/*
> + * Determine the memory region cacheability from VMA's pgprot. This
> + * is used to set the stage 2 PTEs.
> + */
> +static unsigned long mapping_type(pgprot_t page_prot)
> +{
> +	return FIELD_GET(PTE_ATTRINDX_MASK, pgprot_val(page_prot));
> +}
> +
>  static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>  			  struct kvm_memory_slot *memslot, unsigned long hva,
>  			  unsigned long fault_status)
> @@ -1490,6 +1499,18 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>  	gfn = fault_ipa >> PAGE_SHIFT;
>  	mte_allowed = kvm_vma_mte_allowed(vma);
>  
> +	/*
> +	 * Figure out the memory type based on the user va mapping properties
> +	 * Only MT_DEVICE_nGnRE and MT_DEVICE_nGnRnE will be set using
> +	 * pgprot_device() and pgprot_noncached() respectively.
> +	 */
> +	if ((mapping_type(vma->vm_page_prot) == MT_DEVICE_nGnRE) ||
> +	    (mapping_type(vma->vm_page_prot) == MT_DEVICE_nGnRnE) ||
> +	    (mapping_type(vma->vm_page_prot) == MT_NORMAL_NC))
> +		prot |= KVM_PGTABLE_PROT_DEVICE;
> +	else if (cpus_have_const_cap(ARM64_HAS_CACHE_DIC))
> +		prot |= KVM_PGTABLE_PROT_X;

Does this mean that we can end up with some I/O memory also mapped as
executable? Is there a use-case (e.g. using CXL memory as standard guest
RAM, executable)?

> +
>  	/* Don't use the VMA after the unlock -- it may have vanished */
>  	vma = NULL;
>  
> @@ -1576,10 +1597,21 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>  	if (exec_fault)
>  		prot |= KVM_PGTABLE_PROT_X;
>  
> -	if (device)
> -		prot |= KVM_PGTABLE_PROT_DEVICE;
> -	else if (cpus_have_const_cap(ARM64_HAS_CACHE_DIC))
> -		prot |= KVM_PGTABLE_PROT_X;
> +	/*
> +	 *  When FWB is unsupported KVM needs to do cache flushes
> +	 *  (via dcache_clean_inval_poc()) of the underlying memory. This is
> +	 *  only possible if the memory is already mapped into the kernel map
> +	 *  at the usual spot.
> +	 *
> +	 *  Validate that there is a struct page for the PFN which maps
> +	 *  to the KVA that the flushing code expects.
> +	 */
> +	if (!stage2_has_fwb(pgt) &&
> +	    !(pfn_valid(pfn) &&
> +	      page_to_virt(pfn_to_page(pfn)) == kvm_host_va(PFN_PHYS(pfn)))) {
> +		ret = -EINVAL;
> +		goto out_unlock;
> +	}

My preference would be to keep most of the current logic (including
pfn_is_map_memory()) but force stage 2 cacheable for this page if the
user vma_page_prot is MT_NORMAL or MT_NORMAL_TAGGED and we have FWB. It
might be seen as an ABI change but I don't think it matters, it mostly
brings cacheable I/O mem mappings in line with standard RAM (bar the
exec permission unless there is a use-case for it).

-- 
Catalin

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v1 1/2] KVM: arm64: determine memory type from VMA
  2023-10-05 16:15   ` Catalin Marinas
@ 2023-10-05 16:54     ` Jason Gunthorpe
  2023-10-10 14:25       ` Catalin Marinas
  0 siblings, 1 reply; 55+ messages in thread
From: Jason Gunthorpe @ 2023-10-05 16:54 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: ankita, maz, oliver.upton, will, aniketa, cjia, kwankhede,
	targupta, vsethi, acurrid, apopple, jhubbard, danw,
	linux-arm-kernel, kvmarm, linux-kernel

On Thu, Oct 05, 2023 at 05:15:37PM +0100, Catalin Marinas wrote:
> On Thu, Sep 07, 2023 at 11:14:58AM -0700, ankita@nvidia.com wrote:
> > From: Ankit Agrawal <ankita@nvidia.com>
> > 
> > Currently KVM determines if a VMA is pointing at IO memory by checking
> > pfn_is_map_memory(). However, the MM already gives us a way to tell what
> > kind of memory it is by inspecting the VMA.
> 
> Well, it doesn't. It tells us what attributes the user mapped that
> memory with, not whether it's I/O memory or standard RAM.

There is VM_IO which is intended to be used for address space with
side effects.

And there is VM_PFNMAP which is intended to be used for address space
without struct page (IO or not)

And finally we have the pgprot bit which define the cachability.

Do you have a definition of IO memory that those three things don't
cover?

I would propose that, for KVM's purpose, IO memory is marked with
VM_IO or a non-cachable pgprot

And "standard RAM" is defined by a cachable pgprot. Linux never makes
something that is VM_IO cachable.

?

> I would move the second patch to be the first, we could even merge that
> independently as it is about relaxing the stage 2 mapping to Normal NC.
> It would make it simpler I think to reason about the second patch which
> further relaxes the stage 2 mapping to Normal Cacheable under certain
> conditions.

Make sense
 
> > Unfortunately when FWB is not enabled, the kernel expects to naively do
> > cache management by flushing the memory using an address in the
> > kernel's map. This does not work in several of the newly allowed
> > cases such as dcache_clean_inval_poc(). Check whether the targeted pfn
> > and its mapping KVA is valid in case the FWB is absent before continuing.
> 
> I would only allow cacheable stage 2 mappings if FWB is enabled.
> Otherwise we end up with a mismatch between the VMM mapping and whatever
> the guest may do.

Does it need to be stronger? If FWB is disabled and the cache flush
works then what is the issue?

> > +	/*
> > +	 * Figure out the memory type based on the user va mapping properties
> > +	 * Only MT_DEVICE_nGnRE and MT_DEVICE_nGnRnE will be set using
> > +	 * pgprot_device() and pgprot_noncached() respectively.
> > +	 */
> > +	if ((mapping_type(vma->vm_page_prot) == MT_DEVICE_nGnRE) ||
> > +	    (mapping_type(vma->vm_page_prot) == MT_DEVICE_nGnRnE) ||
> > +	    (mapping_type(vma->vm_page_prot) == MT_NORMAL_NC))
> > +		prot |= KVM_PGTABLE_PROT_DEVICE;
> > +	else if (cpus_have_const_cap(ARM64_HAS_CACHE_DIC))
> > +		prot |= KVM_PGTABLE_PROT_X;
> 
> Does this mean that we can end up with some I/O memory also mapped as
> executable?

Yes. We don't have cachable memory with side effects in Linux?

> Is there a use-case (e.g. using CXL memory as standard guest
> RAM, executable)?

Certainly.

> > +
> >  	/* Don't use the VMA after the unlock -- it may have vanished */
> >  	vma = NULL;
> >  
> > @@ -1576,10 +1597,21 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
> >  	if (exec_fault)
> >  		prot |= KVM_PGTABLE_PROT_X;
> >  
> > -	if (device)
> > -		prot |= KVM_PGTABLE_PROT_DEVICE;
> > -	else if (cpus_have_const_cap(ARM64_HAS_CACHE_DIC))
> > -		prot |= KVM_PGTABLE_PROT_X;
> > +	/*
> > +	 *  When FWB is unsupported KVM needs to do cache flushes
> > +	 *  (via dcache_clean_inval_poc()) of the underlying memory. This is
> > +	 *  only possible if the memory is already mapped into the kernel map
> > +	 *  at the usual spot.
> > +	 *
> > +	 *  Validate that there is a struct page for the PFN which maps
> > +	 *  to the KVA that the flushing code expects.
> > +	 */
> > +	if (!stage2_has_fwb(pgt) &&
> > +	    !(pfn_valid(pfn) &&
> > +	      page_to_virt(pfn_to_page(pfn)) == kvm_host_va(PFN_PHYS(pfn)))) {
> > +		ret = -EINVAL;
> > +		goto out_unlock;
> > +	}
> 
> My preference would be to keep most of the current logic (including
> pfn_is_map_memory())

Why? I think this pfn_is_map_memory() is actually not desired, it
doesn't deal with modern memory hotplug or pgmap stuff? Isn't
pfn_valid() more appropriate?

> as an ABI change but I don't think it matters, it mostly brings
> cacheable I/O mem mappings in line with standard RAM (bar the exec
> permission unless there is a use-case for it).

I would discourage the concept of "cacheable I/O mem mappings".

Cachable memory located on a NUMA node close to the CPU should have
exactly the same treatement as cachable memory located on a NUMA node
distant from the CPU.

I think when you say "cachable I/O memory" it really just means normal
memory that lives on a NUMA node that is located on an IO device.

At the KVM level we don't care about the NUMA locality, we only care
if it is normal cachable system memory.

I think there are two issues here. 

1) KVM uses pfn_is_map_memory() which does not cover all our modern
NUMA and memory hotplug cases for normal struct page backed cachable
memory.

2) KVM doesn't work with normal cachable memory that does not have
struct pages.


For 1 the test should be 'does the pfn have a struct page, does the
struct page refer to cachable memory?'

For 2 the test should be 'does the VMA have pgprot = cachable,
VM_PFNMAP and not VM_IO (both implied)'

Jason

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v1 1/2] KVM: arm64: determine memory type from VMA
  2023-10-05 16:54     ` Jason Gunthorpe
@ 2023-10-10 14:25       ` Catalin Marinas
  2023-10-10 15:05         ` Jason Gunthorpe
  0 siblings, 1 reply; 55+ messages in thread
From: Catalin Marinas @ 2023-10-10 14:25 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: ankita, maz, oliver.upton, will, aniketa, cjia, kwankhede,
	targupta, vsethi, acurrid, apopple, jhubbard, danw,
	linux-arm-kernel, kvmarm, linux-kernel

On Thu, Oct 05, 2023 at 01:54:58PM -0300, Jason Gunthorpe wrote:
> On Thu, Oct 05, 2023 at 05:15:37PM +0100, Catalin Marinas wrote:
> > On Thu, Sep 07, 2023 at 11:14:58AM -0700, ankita@nvidia.com wrote:
> > > From: Ankit Agrawal <ankita@nvidia.com>
> > > Currently KVM determines if a VMA is pointing at IO memory by checking
> > > pfn_is_map_memory(). However, the MM already gives us a way to tell what
> > > kind of memory it is by inspecting the VMA.
> > 
> > Well, it doesn't. It tells us what attributes the user mapped that
> > memory with, not whether it's I/O memory or standard RAM.
> 
> There is VM_IO which is intended to be used for address space with
> side effects.
> 
> And there is VM_PFNMAP which is intended to be used for address space
> without struct page (IO or not)
> 
> And finally we have the pgprot bit which define the cachability.
> 
> Do you have a definition of IO memory that those three things don't
> cover?
> 
> I would propose that, for KVM's purpose, IO memory is marked with
> VM_IO or a non-cachable pgprot
> 
> And "standard RAM" is defined by a cachable pgprot. Linux never makes
> something that is VM_IO cachable.

I think we can safely set a stage 2 Normal NC for a vma with pgprot
other than MT_NORMAL or MT_NORMAL_TAGGED. But the other way around is
not that simple. Just because the VMM was allowed to map it as cacheable
does not mean that it supports all the CPU features. One example is MTE
where we can only guarantee that the RAM given to the OS at boot
supports tagged accesses. I've seen something similar in the past with
LSE atomics (or was it exclusives?) not being propagated. These don't
make the memory safe for a guest to use as general purpose RAM.

I don't have a nice solution, it looks more like the host kernel being
able to trust what the VMM maps and gives to a guest (or we map
everything as Device at Stage 2 as we currently do). An alternative
would be for the host to know which physical address ranges support
which attributes and ignore the vma but not sure we have such
information in the ACPI tables (we can make something up for DT).

> > > Unfortunately when FWB is not enabled, the kernel expects to naively do
> > > cache management by flushing the memory using an address in the
> > > kernel's map. This does not work in several of the newly allowed
> > > cases such as dcache_clean_inval_poc(). Check whether the targeted pfn
> > > and its mapping KVA is valid in case the FWB is absent before continuing.
> > 
> > I would only allow cacheable stage 2 mappings if FWB is enabled.
> > Otherwise we end up with a mismatch between the VMM mapping and whatever
> > the guest may do.
> 
> Does it need to be stronger? If FWB is disabled and the cache flush
> works then what is the issue?

I was thinking more about keeping things simpler and avoid any lack of
coherence between the VMM and the VM, in case the latter maps it as
Normal NC. But if the VMM doesn't touch it, the initial cache
maintenance by the KVM would do.

> I think there are two issues here. 
> 
> 1) KVM uses pfn_is_map_memory() which does not cover all our modern
> NUMA and memory hotplug cases for normal struct page backed cachable
> memory.
> 
> 2) KVM doesn't work with normal cachable memory that does not have
> struct pages.
> 
> For 1 the test should be 'does the pfn have a struct page, does the
> struct page refer to cachable memory?'
> 
> For 2 the test should be 'does the VMA have pgprot = cachable,
> VM_PFNMAP and not VM_IO (both implied)'

See above on the characteristics of the memory. If some of them are not
supported, it's probably fine (atomics not working) but others like MTE
accesses could either cause external aborts or access random addresses
from elsewhere. Currently we rely on pfn_is_map_memory() for this but we
need a way to tell that other ranges outside the initial RAM supports
all features. IOW, is any of this memory (mapped as cacheable in the
VMM) special purpose with only a subset of the CPU features supported?

-- 
Catalin

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v1 1/2] KVM: arm64: determine memory type from VMA
  2023-10-10 14:25       ` Catalin Marinas
@ 2023-10-10 15:05         ` Jason Gunthorpe
  2023-10-10 17:19           ` Catalin Marinas
  0 siblings, 1 reply; 55+ messages in thread
From: Jason Gunthorpe @ 2023-10-10 15:05 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: ankita, maz, oliver.upton, will, aniketa, cjia, kwankhede,
	targupta, vsethi, acurrid, apopple, jhubbard, danw,
	linux-arm-kernel, kvmarm, linux-kernel

On Tue, Oct 10, 2023 at 03:25:22PM +0100, Catalin Marinas wrote:
> On Thu, Oct 05, 2023 at 01:54:58PM -0300, Jason Gunthorpe wrote:
> > On Thu, Oct 05, 2023 at 05:15:37PM +0100, Catalin Marinas wrote:
> > > On Thu, Sep 07, 2023 at 11:14:58AM -0700, ankita@nvidia.com wrote:
> > > > From: Ankit Agrawal <ankita@nvidia.com>
> > > > Currently KVM determines if a VMA is pointing at IO memory by checking
> > > > pfn_is_map_memory(). However, the MM already gives us a way to tell what
> > > > kind of memory it is by inspecting the VMA.
> > > 
> > > Well, it doesn't. It tells us what attributes the user mapped that
> > > memory with, not whether it's I/O memory or standard RAM.
> > 
> > There is VM_IO which is intended to be used for address space with
> > side effects.
> > 
> > And there is VM_PFNMAP which is intended to be used for address space
> > without struct page (IO or not)
> > 
> > And finally we have the pgprot bit which define the cachability.
> > 
> > Do you have a definition of IO memory that those three things don't
> > cover?
> > 
> > I would propose that, for KVM's purpose, IO memory is marked with
> > VM_IO or a non-cachable pgprot
> > 
> > And "standard RAM" is defined by a cachable pgprot. Linux never makes
> > something that is VM_IO cachable.
> 
> I think we can safely set a stage 2 Normal NC for a vma with pgprot
> other than MT_NORMAL or MT_NORMAL_TAGGED. But the other way around is
> not that simple. Just because the VMM was allowed to map it as cacheable
> does not mean that it supports all the CPU features. One example is MTE
> where we can only guarantee that the RAM given to the OS at boot
> supports tagged accesses. 

Is there a use case to supply the VMM with cachable memory that is not
full featured? At least the vfio cases I am aware of do not actually
want to do this and would probably like the arch to prevent these
corner cases upfront.

> I've seen something similar in the past with
> LSE atomics (or was it exclusives?) not being propagated. These don't
> make the memory safe for a guest to use as general purpose RAM.

At least from a mm perspective, I think it is important that cachable
struct pages are all the same and all interchangable. If the arch
cannot provide this it should not allow the pgmap/memremap to succeed
at all. Otherwise drivers using these new APIs are never going to work
fully right..

That leaves open the question of what to do with a cachable VM_PFNMAP,
but if the arch can deal with the memremap issue then it seems like it
can use the same logic when inspecting the VMA contents?

> I was thinking more about keeping things simpler and avoid any lack of
> coherence between the VMM and the VM, in case the latter maps it as
> Normal NC. But if the VMM doesn't touch it, the initial cache
> maintenance by the KVM would do.

I see, in general we usually don't have use cases for the VMM touching
the vfio memory as it is hard to make that secure. I would like it if
that memory could be a FD instead of a VMA. Maybe someday..

> See above on the characteristics of the memory. If some of them are not
> supported, it's probably fine (atomics not working) but others like MTE
> accesses could either cause external aborts or access random addresses
> from elsewhere. Currently we rely on pfn_is_map_memory() for this but we
> need a way to tell that other ranges outside the initial RAM supports
> all features. 

Indeed, I did not realize this about arm. See above on my note that at
the mm level we should not have different types of cachable struct
pages. So I think the right thing to do is fix that and still rely on
the struct page test here in kvm.

> IOW, is any of this memory (mapped as cacheable in the
> VMM) special purpose with only a subset of the CPU features supported?

At least the use cases I am interested in has the memory be
functionally indistinguishable from boot time DDR.

Cachable memory that does not function the same as DDR is something
else, maybe that is your "cachable IO memory" concept. I don't have a
use case for it. If we do decide to create it then it likely should be
tagged as MEMORY_DEVICE_PCI_P2PDMA in the mm side and not get into any
code paths that assume DDR.

Jason

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v1 1/2] KVM: arm64: determine memory type from VMA
  2023-10-10 15:05         ` Jason Gunthorpe
@ 2023-10-10 17:19           ` Catalin Marinas
  2023-10-10 18:23             ` Jason Gunthorpe
  0 siblings, 1 reply; 55+ messages in thread
From: Catalin Marinas @ 2023-10-10 17:19 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: ankita, maz, oliver.upton, will, aniketa, cjia, kwankhede,
	targupta, vsethi, acurrid, apopple, jhubbard, danw,
	linux-arm-kernel, kvmarm, linux-kernel

On Tue, Oct 10, 2023 at 12:05:02PM -0300, Jason Gunthorpe wrote:
> On Tue, Oct 10, 2023 at 03:25:22PM +0100, Catalin Marinas wrote:
> > On Thu, Oct 05, 2023 at 01:54:58PM -0300, Jason Gunthorpe wrote:
> > > On Thu, Oct 05, 2023 at 05:15:37PM +0100, Catalin Marinas wrote:
> > > > On Thu, Sep 07, 2023 at 11:14:58AM -0700, ankita@nvidia.com wrote:
> > > > > From: Ankit Agrawal <ankita@nvidia.com>
> > > > > Currently KVM determines if a VMA is pointing at IO memory by checking
> > > > > pfn_is_map_memory(). However, the MM already gives us a way to tell what
> > > > > kind of memory it is by inspecting the VMA.
> > > > 
> > > > Well, it doesn't. It tells us what attributes the user mapped that
> > > > memory with, not whether it's I/O memory or standard RAM.
> > > 
> > > There is VM_IO which is intended to be used for address space with
> > > side effects.
> > > 
> > > And there is VM_PFNMAP which is intended to be used for address space
> > > without struct page (IO or not)
> > > 
> > > And finally we have the pgprot bit which define the cachability.
> > > 
> > > Do you have a definition of IO memory that those three things don't
> > > cover?
> > > 
> > > I would propose that, for KVM's purpose, IO memory is marked with
> > > VM_IO or a non-cachable pgprot
> > > 
> > > And "standard RAM" is defined by a cachable pgprot. Linux never makes
> > > something that is VM_IO cachable.
> > 
> > I think we can safely set a stage 2 Normal NC for a vma with pgprot
> > other than MT_NORMAL or MT_NORMAL_TAGGED. But the other way around is
> > not that simple. Just because the VMM was allowed to map it as cacheable
> > does not mean that it supports all the CPU features. One example is MTE
> > where we can only guarantee that the RAM given to the OS at boot
> > supports tagged accesses. 
> 
> Is there a use case to supply the VMM with cachable memory that is not
> full featured? At least the vfio cases I am aware of do not actually
> want to do this and would probably like the arch to prevent these
> corner cases upfront.

The MTE case is the problematic one here. On a data access, the
interconnect shifts (right) the physical address and adds an offset. The
resulting address is used to access tags. Such shift+offset is
configured by firmware at boot and normally only covers the default
memory. If there's some memory on PCIe, it's very unlikely to be
covered and we can't tell whether it simply drops such tag accesses or
makes up some random address that may or may not hit an existing memory
or device.

We don't currently have a way to describe this in ACPI tables (there
were talks about describing special purpose memory, I lost track of the
progress) and the way MTE was first designed doesn't allow a hypervisor
to prevent the guest from generating a tagged access (other than mapping
the memory as non-cacheable at Stage 2). This has been fixed in newer
architecture versions but we haven't added Linux support for it yet (and
there's no hardware available either). AFAIK, there's no MTE support for
CXL-attached memory at the moment in the current interconnects, so
better not pretend it's general purpose memory that supports all the
features.

Other than preventing malicious guest behaviour, it depends what the VM
needs cacheable access for: some GPU memory that's only used for sharing
data and we don't need all features or general purpose memory that a VM
can use to run applications from etc. The former may not need all the
features (e.g. can skip exclusives) but the latter does.

We can probably work around the MTE case by only allowing cacheable
Stage 2 if MTE is disabled for the guest or FEAT_MTE_PERM is
implemented/supported (TBD when we'll add this). For the other cases, it
would be up to the VMM how it presents the mapping to the guest (device
pass-through or general purpose memory).

> > I've seen something similar in the past with
> > LSE atomics (or was it exclusives?) not being propagated. These don't
> > make the memory safe for a guest to use as general purpose RAM.
> 
> At least from a mm perspective, I think it is important that cachable
> struct pages are all the same and all interchangable. If the arch
> cannot provide this it should not allow the pgmap/memremap to succeed
> at all. Otherwise drivers using these new APIs are never going to work
> fully right..

Yes, for struct page backed memory, the current assumption is that all
are the same, support all CPU features. It's the PFN-based memory where
we don't have such guarantees.

> That leaves open the question of what to do with a cachable VM_PFNMAP,
> but if the arch can deal with the memremap issue then it seems like it
> can use the same logic when inspecting the VMA contents?

In the MTE case, memremap() never returns a Normal Tagged mapping and it
would not map it in user-space as PROT_MTE either. But if a page is not
mmap(PROT_MTE) (or VM_MTE in vma->flags) in the VMM, it doesn't mean the
guest should not be allowed to use MTE. Qemu for example doesn't map the
guest memory with mmap(PROT_MTE) since it doesn't have a need for it but
the guest can enable MTE independently (well, if enabled at the vCPU
level).

We have an additional flag, VM_MTE_ALLOWED, only set for mappings backed
by struct page. We could probe that in KVM and either fall back to
non-cacheable or allow cacheable if MTE is disable at the vCPU level.

-- 
Catalin

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v1 1/2] KVM: arm64: determine memory type from VMA
  2023-10-10 17:19           ` Catalin Marinas
@ 2023-10-10 18:23             ` Jason Gunthorpe
  2023-10-11 17:45               ` Catalin Marinas
  0 siblings, 1 reply; 55+ messages in thread
From: Jason Gunthorpe @ 2023-10-10 18:23 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: ankita, maz, oliver.upton, will, aniketa, cjia, kwankhede,
	targupta, vsethi, acurrid, apopple, jhubbard, danw,
	linux-arm-kernel, kvmarm, linux-kernel

On Tue, Oct 10, 2023 at 06:19:14PM +0100, Catalin Marinas wrote:

> This has been fixed in newer architecture versions but we haven't
> added Linux support for it yet (and there's no hardware available
> either). AFAIK, there's no MTE support for CXL-attached memory at
> the moment in the current interconnects, so better not pretend it's
> general purpose memory that supports all the features.

I don't know much about MTE, but the use case imagined for CXL memory
allows the MM to exchange any system page with a CXL page. So there
cannot be a behavioral difference.

Can usespace continue to use tagged pointers even if the mm has moved
the pages to CXL that doesn't support it?

The main purpose for this is to tier VM memory, so having CXL
behaviorally different in a VM seems fatal to me.

Linux drivers need a way to understand this, like we can't have a CXL
memory pool driver or GPU driver calling memremap_pages() and getting
a somewhat broken system because of MTE incompatibilities. So maybe
ARM really should block memremap_pages() in case of MTE until
everything is resolved?

From the mm perspective we can't have two kinds of cachable struct
pages running around that are functionally different.

> Other than preventing malicious guest behaviour, it depends what the VM
> needs cacheable access for: some GPU memory that's only used for sharing
> data and we don't need all features or general purpose memory that a VM
> can use to run applications from etc. The former may not need all the
> features (e.g. can skip exclusives) but the latter does.

Like CXL memory pooling GPU memory is used interchangably with every
kind of DDR memory in every context. It must be completely transparent
and interchangable via the mm's migration machinery.

> > > I've seen something similar in the past with
> > > LSE atomics (or was it exclusives?) not being propagated. These don't
> > > make the memory safe for a guest to use as general purpose RAM.
> > 
> > At least from a mm perspective, I think it is important that cachable
> > struct pages are all the same and all interchangable. If the arch
> > cannot provide this it should not allow the pgmap/memremap to succeed
> > at all. Otherwise drivers using these new APIs are never going to work
> > fully right..
> 
> Yes, for struct page backed memory, the current assumption is that all
> are the same, support all CPU features. It's the PFN-based memory where
> we don't have such guarantees.

I see it got a bit confused, I am talking about memremap_pages() (ie
include/linux/memremap.h), not memremap (include/linux/io.h) for
getting non-struct page memory. It is confusing :|

memremap_pages() is one of the entry points of the struct page hotplug
machinery. Things like CXL drivers assume they can hot plug in new
memory through these APIs and get new cachable struct pages that are
functionally identical to boot time cachable struct pages.

> We have an additional flag, VM_MTE_ALLOWED, only set for mappings backed
> by struct page. We could probe that in KVM and either fall back to
> non-cacheable or allow cacheable if MTE is disable at the vCPU level.

I'm not sure what this does, it is only set by shmem_map? That is
much stricter than "mappings backed by struct page"

Still, I'm not sure how to proceed here - we veered into this MTE
stuff I don't know we have experiance with yet.

Thanks,
Jason

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v1 1/2] KVM: arm64: determine memory type from VMA
  2023-10-10 18:23             ` Jason Gunthorpe
@ 2023-10-11 17:45               ` Catalin Marinas
  2023-10-11 18:38                 ` Jason Gunthorpe
  0 siblings, 1 reply; 55+ messages in thread
From: Catalin Marinas @ 2023-10-11 17:45 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: ankita, maz, oliver.upton, will, aniketa, cjia, kwankhede,
	targupta, vsethi, acurrid, apopple, jhubbard, danw,
	linux-arm-kernel, kvmarm, linux-kernel

On Tue, Oct 10, 2023 at 03:23:28PM -0300, Jason Gunthorpe wrote:
> On Tue, Oct 10, 2023 at 06:19:14PM +0100, Catalin Marinas wrote:
> > This has been fixed in newer architecture versions but we haven't
> > added Linux support for it yet (and there's no hardware available
> > either). AFAIK, there's no MTE support for CXL-attached memory at
> > the moment in the current interconnects, so better not pretend it's
> > general purpose memory that supports all the features.
> 
> I don't know much about MTE, but the use case imagined for CXL memory
> allows the MM to exchange any system page with a CXL page. So there
> cannot be a behavioral difference.
> 
> Can usespace continue to use tagged pointers even if the mm has moved
> the pages to CXL that doesn't support it?

That would lead to external aborts in the worst case or just losing tags
in the best. But none of them ideal.

> The main purpose for this is to tier VM memory, so having CXL
> behaviorally different in a VM seems fatal to me.

Yes, that's why if you can't tell the memory supports MTE, either it
should not be given to a guest with KVM_CAP_ARM_MTE or MTE gets disabled
in the guest.

> Linux drivers need a way to understand this, like we can't have a CXL
> memory pool driver or GPU driver calling memremap_pages() and getting
> a somewhat broken system because of MTE incompatibilities. So maybe
> ARM really should block memremap_pages() in case of MTE until
> everything is resolved?

I need to understand this part of the kernel a bit more but see below
(and MTE as it currently stands doesn't play well with server-like
systems, memory hotplug etc.)

> From the mm perspective we can't have two kinds of cachable struct
> pages running around that are functionally different.

From a Linux+MTE perspective, what goes into ZONE_MOVABLE should work
fine, all pages interchangeable (if it doesn't, the hardware is broken).
These are added via the add_memory_resource() hotplug path. If a
platform is known not to support this, it better not advertise MTE as a
feature (the CPUs usually have some tie-off signal when the rest of the
SoC cannot handle MTE). We could claim it's a hardware erratum if it
does.

But for ZONE_DEVICE ranges, these are not guaranteed to support all the
characteristics of the main RAM. I think that's what memremap_pages()
gives us. I'm not too familiar with this part of the kernel but IIUC
that falls under the HMM category, so not interchangeable with the
normal RAM (hotplugged or not).

> > Other than preventing malicious guest behaviour, it depends what the VM
> > needs cacheable access for: some GPU memory that's only used for sharing
> > data and we don't need all features or general purpose memory that a VM
> > can use to run applications from etc. The former may not need all the
> > features (e.g. can skip exclusives) but the latter does.
> 
> Like CXL memory pooling GPU memory is used interchangably with every
> kind of DDR memory in every context. It must be completely transparent
> and interchangable via the mm's migration machinery.

I don't see the mm code doing this but I haven't looked deep enough.
At least not in the way of doing an mmap(MAP_ANONYMOUS) and the kernel
allocating ZONE_DEVICE pages and passing them to the user.

> > > > I've seen something similar in the past with
> > > > LSE atomics (or was it exclusives?) not being propagated. These don't
> > > > make the memory safe for a guest to use as general purpose RAM.
> > > 
> > > At least from a mm perspective, I think it is important that cachable
> > > struct pages are all the same and all interchangable. If the arch
> > > cannot provide this it should not allow the pgmap/memremap to succeed
> > > at all. Otherwise drivers using these new APIs are never going to work
> > > fully right..
> > 
> > Yes, for struct page backed memory, the current assumption is that all
> > are the same, support all CPU features. It's the PFN-based memory where
> > we don't have such guarantees.
> 
> I see it got a bit confused, I am talking about memremap_pages() (ie
> include/linux/memremap.h), not memremap (include/linux/io.h) for
> getting non-struct page memory. It is confusing :|
> 
> memremap_pages() is one of the entry points of the struct page hotplug
> machinery. Things like CXL drivers assume they can hot plug in new
> memory through these APIs and get new cachable struct pages that are
> functionally identical to boot time cachable struct pages.

We have two mechanisms, one in memremap.c and another in
memory_hotplug.c. So far my assumption is that only the memory added by
the latter ends up in ZONE_MOVABLE and can be used by the kernel as any
of the ZONE_NORMAL RAM, transparently to the user. For ZONE_DEVICE
allocations, one would have to explicitly mmap() it via a device fd.

If a VMM wants to mmap() such GPU memory and give it to the guest as
general purpose RAM, it should make sure it has all the characteristics
as advertised by the CPU or disable certain features (if it can).
Currently we don't have a way to tell what such memory supports (neither
ACPI tables nor any hardware probing). The same assumption w.r.t. MTE is
that it doesn't.

> > We have an additional flag, VM_MTE_ALLOWED, only set for mappings backed
> > by struct page. We could probe that in KVM and either fall back to
> > non-cacheable or allow cacheable if MTE is disable at the vCPU level.
> 
> I'm not sure what this does, it is only set by shmem_map? That is
> much stricter than "mappings backed by struct page"

This flag is similar to the VM_MAYWRITE etc. On an mmap(), the vma gets
the VM_MTE_ALLOWED flag if the mapping is MAP_ANONYMOUS (see
arch_calc_vm_flag_bits()) or the (driver) mmap function knows that the
memory supports MTE and sets the flag explicitly. Currently that's only
done in shmem_mmap() as we know where this memory is coming from. When
the user wants an mmap(PROT_MTE), the arch code checks whether
VM_MTE_ALLOWED is set on the vma before allowing tag accesses.

Memory mapped from ZONE_DEVICE won't have such flag set, so
mmap(PROT_MTE) will fail. But for KVM guests, there's no such mmap()
call into the hypervisor. A guest can simply enable MTE at stage 1
without the hypervisor being able to tell.

> Still, I'm not sure how to proceed here - we veered into this MTE
> stuff I don't know we have experiance with yet.

We veered mostly because on arm64 such GPU memory is not guaranteed to
have all the characteristics of the generic RAM. I think only MTE is the
dangerous one and it needs extra care but I wouldn't be surprised if we
notice atomics failing.

It looks like memremap_pages() also takes a memory type and I suspect
it's only safe to map MEMORY_DEVICE_COHERENT into a guest (as generic
RAM). Is there any sanity check on the host kernel side to allow VMM
cacheable mappings only for such memory and not the other
MEMORY_DEVICE_* types?

Going back to KVM, we can relax to cacheable mapping at Stage 2 if the
vma is cacheable and either VM_MTE_ALLOWED is set or KVM_CAP_ARM_MTE is
disabled. From the earlier discussions, we can probably ignore VM_IO
since we won't have a cacheable mapping with this flag. Not sure about
VM_PFNMAP.

-- 
Catalin

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v1 1/2] KVM: arm64: determine memory type from VMA
  2023-10-11 17:45               ` Catalin Marinas
@ 2023-10-11 18:38                 ` Jason Gunthorpe
  2023-10-12 16:16                   ` Catalin Marinas
  0 siblings, 1 reply; 55+ messages in thread
From: Jason Gunthorpe @ 2023-10-11 18:38 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: ankita, maz, oliver.upton, will, aniketa, cjia, kwankhede,
	targupta, vsethi, acurrid, apopple, jhubbard, danw,
	linux-arm-kernel, kvmarm, linux-kernel

On Wed, Oct 11, 2023 at 06:45:52PM +0100, Catalin Marinas wrote:

> > From the mm perspective we can't have two kinds of cachable struct
> > pages running around that are functionally different.
> 
> From a Linux+MTE perspective, what goes into ZONE_MOVABLE should work
> fine, all pages interchangeable (if it doesn't, the hardware is
> broken).

Yes, and we imagine adding GPU and CXL memory to ZONE_MOVABLE (on a
NUMA node)

> These are added via the add_memory_resource() hotplug path. If a
> platform is known not to support this, it better not advertise MTE as a
> feature (the CPUs usually have some tie-off signal when the rest of the
> SoC cannot handle MTE). We could claim it's a hardware erratum if it
> does.

Seems logical
 
> But for ZONE_DEVICE ranges, these are not guaranteed to support all the
> characteristics of the main RAM. I think that's what memremap_pages()
> gives us. I'm not too familiar with this part of the kernel but IIUC
> that falls under the HMM category, so not interchangeable with the
> normal RAM (hotplugged or not).

DAX pages use ZONE_DEVICE and they are cachable, and not "HMM".

They are not fully interchangable, but they get into the page cache,
they can back .data segements, they could be subject atomics/etc. So
they should be fully functional like DDR.

> I don't see the mm code doing this but I haven't looked deep enough.
> At least not in the way of doing an mmap(MAP_ANONYMOUS) and the kernel
> allocating ZONE_DEVICE pages and passing them to the user.

Not ZONE_DEVICE. One popular coherent GPU approach is to use
ZONE_MOVABLE pages.

> > > > > I've seen something similar in the past with
> > > > > LSE atomics (or was it exclusives?) not being propagated. These don't
> > > > > make the memory safe for a guest to use as general purpose RAM.
> > > > 
> > > > At least from a mm perspective, I think it is important that cachable
> > > > struct pages are all the same and all interchangable. If the arch
> > > > cannot provide this it should not allow the pgmap/memremap to succeed
> > > > at all. Otherwise drivers using these new APIs are never going to work
> > > > fully right..
> > > 
> > > Yes, for struct page backed memory, the current assumption is that all
> > > are the same, support all CPU features. It's the PFN-based memory where
> > > we don't have such guarantees.
> > 
> > I see it got a bit confused, I am talking about memremap_pages() (ie
> > include/linux/memremap.h), not memremap (include/linux/io.h) for
> > getting non-struct page memory. It is confusing :|
> > 
> > memremap_pages() is one of the entry points of the struct page hotplug
> > machinery. Things like CXL drivers assume they can hot plug in new
> > memory through these APIs and get new cachable struct pages that are
> > functionally identical to boot time cachable struct pages.
> 
> We have two mechanisms, one in memremap.c and another in
> memory_hotplug.c.

Yes

>  So far my assumption is that only the memory added by
> the latter ends up in ZONE_MOVABLE and can be used by the kernel as any
> of the ZONE_NORMAL RAM, transparently to the user. 

Probably for now, yes. But CXL/GPU memory goes there too.

> For ZONE_DEVICE allocations, one would have to explicitly mmap() it
> via a device fd.

Not quite for DAX, it gets in through the page cache, but it is still
mmap on a FD and not anonymous memory.

> If a VMM wants to mmap() such GPU memory and give it to the guest as
> general purpose RAM, it should make sure it has all the characteristics
> as advertised by the CPU or disable certain features (if it can).

This is the VFIO flow we are talking about here, I think. PFNMAP
memory that goes into a VM that is cachable.

> Currently we don't have a way to tell what such memory supports (neither
> ACPI tables nor any hardware probing). The same assumption w.r.t. MTE is
> that it doesn't.

Indeed, but my GPU driver hot plugged it as ZONE_MOVABLE and my VFIO
driver turned in into PFNMAP.. So these things seem incompatible.

> > > We have an additional flag, VM_MTE_ALLOWED, only set for mappings backed
> > > by struct page. We could probe that in KVM and either fall back to
> > > non-cacheable or allow cacheable if MTE is disable at the vCPU level.
> > 
> > I'm not sure what this does, it is only set by shmem_map? That is
> > much stricter than "mappings backed by struct page"
> 
> This flag is similar to the VM_MAYWRITE etc. On an mmap(), the vma gets
> the VM_MTE_ALLOWED flag if the mapping is MAP_ANONYMOUS (see
> arch_calc_vm_flag_bits()) or the (driver) mmap function knows that the
> memory supports MTE and sets the flag explicitly. Currently that's only
> done in shmem_mmap() as we know where this memory is coming from. When
> the user wants an mmap(PROT_MTE), the arch code checks whether
> VM_MTE_ALLOWED is set on the vma before allowing tag accesses.
> 
> Memory mapped from ZONE_DEVICE won't have such flag set, so
> mmap(PROT_MTE) will fail. But for KVM guests, there's no such mmap()
> call into the hypervisor. A guest can simply enable MTE at stage 1
> without the hypervisor being able to tell.

Yes, so this is all safe for DAX usages, not safe for GPU CXL NUMA
memory hotplug. :|

> > Still, I'm not sure how to proceed here - we veered into this MTE
> > stuff I don't know we have experiance with yet.
> 
> We veered mostly because on arm64 such GPU memory is not guaranteed to
> have all the characteristics of the generic RAM. I think only MTE is the
> dangerous one and it needs extra care but I wouldn't be surprised if we
> notice atomics failing.

So, at least for the system this change is being tested on, the
"pre-CXL" GPU memory is 100% interchangable with DDR memory. It is
surfaced to the OS as ZONE_MOVABLE and it should work in VMs like this
too.

> It looks like memremap_pages() also takes a memory type and I suspect
> it's only safe to map MEMORY_DEVICE_COHERENT into a guest (as generic
> RAM). Is there any sanity check on the host kernel side to allow VMM
> cacheable mappings only for such memory and not the other
> MEMORY_DEVICE_* types?

I guess it is this current KVM code we are discussing, it probably
happens by the pfn_is_map_memory() check?
 
> Going back to KVM, we can relax to cacheable mapping at Stage 2 if the
> vma is cacheable and either VM_MTE_ALLOWED is set or KVM_CAP_ARM_MTE is
> disabled.

This seems logical to me, thanks

> From the earlier discussions, we can probably ignore VM_IO
> since we won't have a cacheable mapping with this flag. Not sure about
> VM_PFNMAP.

PFNMAP is the interesting one for VFIO, at least. Can we use the same
reasoning that it will be !VM_MTE_ALLOWED and we can close the MTE
discussion.

Currently no VFIO driver is doing cachable that has memory that is
different from DDR memory. So this is sort of theoretical discussion
about future cachable HW that does use VFIO that does have a
non-uniformity.

Maybe that HW should set VM_IO on its VFIO PFN map and obviously not
use ZONE_MOVABLE?

Where does that leave us for this patch? We check the VM_MTE_ALLOWED
and check for ZONE_MOVABLE struct pages as one of the conditions for
NORMAL?

Jason

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v1 1/2] KVM: arm64: determine memory type from VMA
  2023-10-11 18:38                 ` Jason Gunthorpe
@ 2023-10-12 16:16                   ` Catalin Marinas
  2024-03-10  3:49                     ` Ankit Agrawal
  0 siblings, 1 reply; 55+ messages in thread
From: Catalin Marinas @ 2023-10-12 16:16 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: ankita, maz, oliver.upton, will, aniketa, cjia, kwankhede,
	targupta, vsethi, acurrid, apopple, jhubbard, danw,
	linux-arm-kernel, kvmarm, linux-kernel

On Wed, Oct 11, 2023 at 03:38:39PM -0300, Jason Gunthorpe wrote:
> On Wed, Oct 11, 2023 at 06:45:52PM +0100, Catalin Marinas wrote:
> > But for ZONE_DEVICE ranges, these are not guaranteed to support all the
> > characteristics of the main RAM. I think that's what memremap_pages()
> > gives us. I'm not too familiar with this part of the kernel but IIUC
> > that falls under the HMM category, so not interchangeable with the
> > normal RAM (hotplugged or not).
> 
> DAX pages use ZONE_DEVICE and they are cachable, and not "HMM".
> 
> They are not fully interchangable, but they get into the page cache,
> they can back .data segements, they could be subject atomics/etc. So
> they should be fully functional like DDR.

Unfortunately the Arm architecture makes the distinction between
"cacheable" and "cacheable tagged". We don't currently have any way of
describing this in firmware tables, so we rely on the hardware or
firmware not advertising MTE if such memory ends up as general purpose.

That said, DAX mappings are safe since the vma would have VM_MTE_ALLOWED
set, so no mmap(PROT_MTE) possible.

> > I don't see the mm code doing this but I haven't looked deep enough.
> > At least not in the way of doing an mmap(MAP_ANONYMOUS) and the kernel
> > allocating ZONE_DEVICE pages and passing them to the user.
> 
> Not ZONE_DEVICE. One popular coherent GPU approach is to use
> ZONE_MOVABLE pages.

OK, so presumably the driver could tell on which architecture it is
running and plug in the memory appropriately (or refuse to). It's a bit
more arch knowledge in a (generic) driver that I'd like but we don't
have a way to describe or probe this yet. Maybe a firmware config would
just turn MTE off in this case (SCTLR_EL3.ATA=0 and some other chicken
bit or tie-off for the ID registers not to advertise MTE).

> > If a VMM wants to mmap() such GPU memory and give it to the guest as
> > general purpose RAM, it should make sure it has all the characteristics
> > as advertised by the CPU or disable certain features (if it can).
> 
> This is the VFIO flow we are talking about here, I think. PFNMAP
> memory that goes into a VM that is cachable.
> 
> > Currently we don't have a way to tell what such memory supports (neither
> > ACPI tables nor any hardware probing). The same assumption w.r.t. MTE is
> > that it doesn't.
> 
> Indeed, but my GPU driver hot plugged it as ZONE_MOVABLE and my VFIO
> driver turned in into PFNMAP.. So these things seem incompatible.

So can we end up with the same pfn mapped in two different vmas, one
backed by struct page and another with VM_PFNMAP (implying no struct
page)? I don't know how this is supposed to work. Even if the memory
supports MTE, we rely on the struct page to track the status of the tags
(the PG_mte_tagged flag). We may get away with this if user_mem_abort()
-> sanitise_mte_tags() -> pfn_to_page() finds a page structure but I'd
actually prevent this path altogether if VM_PFNMAP (well, we may
actually have a bug if we don't already do this).

> > From the earlier discussions, we can probably ignore VM_IO
> > since we won't have a cacheable mapping with this flag. Not sure about
> > VM_PFNMAP.
> 
> PFNMAP is the interesting one for VFIO, at least. Can we use the same
> reasoning that it will be !VM_MTE_ALLOWED and we can close the MTE
> discussion.
> 
> Currently no VFIO driver is doing cachable that has memory that is
> different from DDR memory. So this is sort of theoretical discussion
> about future cachable HW that does use VFIO that does have a
> non-uniformity.
> 
> Maybe that HW should set VM_IO on its VFIO PFN map and obviously not
> use ZONE_MOVABLE?

I think we should only keep VM_IO for memory mapped I/O with side
effects. Other ranges can be VM_PFNMAP if not backed by struct page.

> Where does that leave us for this patch? We check the VM_MTE_ALLOWED
> and check for ZONE_MOVABLE struct pages as one of the conditions for
> NORMAL?

I think we should keep it as simple as possible and, looking at it
again, maybe even ignore vm_page_prot. Two questions though:

1. Does VM_IO imply vm_page_prot never having MT_NORMAL or
   MT_NORMAL_TAGGED?

2. Do all I/O ranges (side-effects, non-RAM) mapped into a guest (and
   which end up in user_mem_abort()) imply VM_IO?

If yes to both, I think something like below would do:

	mte_allowed = kvm_vma_mte_allowed(vma);
	noncacheable = false;				// or 'device' as in user_mem_abort()
	...
	if (vma->flags & VM_IO)				// replaces !pfn_is_map_memory()
		noncacheable = true;
	else if (!mte_allowed && kvm_has_mte())
		noncaheable = true;
	...
	if (noncacheable)
		prot |= KVM_PGTABLE_PROT_DEVICE;	// or the new KVM_PGTABLE_PROT_NC

mte_allowed would cover DAX mappings (and, who knows, some future DAX
mapping may allow MTE and the driver explicitly set the flag). Anything
else hot-plugged into ZONE_MOVABLE should have VM_MTE_ALLOWED set or
MTE disabled altogether.

-- 
Catalin

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v1 1/2] KVM: arm64: determine memory type from VMA
  2023-10-12 16:16                   ` Catalin Marinas
@ 2024-03-10  3:49                     ` Ankit Agrawal
  2024-03-19 13:38                       ` Jason Gunthorpe
  0 siblings, 1 reply; 55+ messages in thread
From: Ankit Agrawal @ 2024-03-10  3:49 UTC (permalink / raw)
  To: Catalin Marinas, Jason Gunthorpe
  Cc: maz@kernel.org, oliver.upton@linux.dev, will@kernel.org,
	Aniket Agashe, Neo Jia, Kirti Wankhede, Tarun Gupta (SW-GPU),
	Vikram Sethi, Andy Currid, Alistair Popple, John Hubbard,
	Dan Williams, linux-arm-kernel@lists.infradead.org,
	kvmarm@lists.linux.dev, linux-kernel@vger.kernel.org

Bringing this to the fore.

>> Where does that leave us for this patch? We check the VM_MTE_ALLOWED
>> and check for ZONE_MOVABLE struct pages as one of the conditions for
>> NORMAL?
>
> I think we should keep it as simple as possible and, looking at it
> again, maybe even ignore vm_page_prot. Two questions though:
>
> 1. Does VM_IO imply vm_page_prot never having MT_NORMAL or
>    MT_NORMAL_TAGGED?
>
> 2. Do all I/O ranges (side-effects, non-RAM) mapped into a guest (and
>   which end up in user_mem_abort()) imply VM_IO?
>
> If yes to both, I think something like below would do:

How may we get the answer to these? It seems to be the logical behavior, but
how can we confirm?

While we discuss on that, I am considering to send out the next version of this
patch (incorporating the feedbacks on the thread) that is rebased to a commit
inclusive of the recently applied KVM patch series:

[KVM: arm64: Allow the VM to select DEVICE_* and NORMAL_NC for IO memory]
https://lore.kernel.org/all/20240224150546.368-1-ankita@nvidia.com/

>        mte_allowed = kvm_vma_mte_allowed(vma);
>        noncacheable = false;                           // or 'device' as in user_mem_abort()
>        ...
>        if (vma->flags & VM_IO)                         // replaces !pfn_is_map_memory()
>                noncacheable = true;
>        else if (!mte_allowed && kvm_has_mte())
>                noncaheable = true;
>        ...
>        if (noncacheable)
>                prot |= KVM_PGTABLE_PROT_DEVICE;        // or the new KVM_PGTABLE_PROT_NC
>
> mte_allowed would cover DAX mappings (and, who knows, some future DAX
> mapping may allow MTE and the driver explicitly set the flag). Anything
> else hot-plugged into ZONE_MOVABLE should have VM_MTE_ALLOWED set or
> MTE disabled altogether.

--
Catalin

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v1 1/2] KVM: arm64: determine memory type from VMA
  2024-03-10  3:49                     ` Ankit Agrawal
@ 2024-03-19 13:38                       ` Jason Gunthorpe
  0 siblings, 0 replies; 55+ messages in thread
From: Jason Gunthorpe @ 2024-03-19 13:38 UTC (permalink / raw)
  To: Ankit Agrawal
  Cc: Catalin Marinas, maz@kernel.org, oliver.upton@linux.dev,
	will@kernel.org, Aniket Agashe, Neo Jia, Kirti Wankhede,
	Tarun Gupta (SW-GPU), Vikram Sethi, Andy Currid, Alistair Popple,
	John Hubbard, Dan Williams, linux-arm-kernel@lists.infradead.org,
	kvmarm@lists.linux.dev, linux-kernel@vger.kernel.org

On Sun, Mar 10, 2024 at 03:49:36AM +0000, Ankit Agrawal wrote:
> Bringing this to the fore.
> 
> >> Where does that leave us for this patch? We check the VM_MTE_ALLOWED
> >> and check for ZONE_MOVABLE struct pages as one of the conditions for
> >> NORMAL?
> >
> > I think we should keep it as simple as possible and, looking at it
> > again, maybe even ignore vm_page_prot. Two questions though:
> >
> > 1. Does VM_IO imply vm_page_prot never having MT_NORMAL or
> >    MT_NORMAL_TAGGED?

Drivers do crazy things, so I would not be surprised if something in
the tree does this (wrong) thing. We can check the pgprot in the vma
as well as the VM_IO to make the decision.

> > 2. Do all I/O ranges (side-effects, non-RAM) mapped into a guest (and
> >   which end up in user_mem_abort()) imply VM_IO?

Yes, this is definately true for VFIO.

Jason

^ permalink raw reply	[flat|nested] 55+ messages in thread

* RE: [PATCH v1 1/2] KVM: arm64: determine memory type from VMA
  2023-09-07 18:14 ` [PATCH v1 1/2] KVM: arm64: determine memory type from VMA ankita
  2023-09-07 19:12   ` Jason Gunthorpe
  2023-10-05 16:15   ` Catalin Marinas
@ 2023-10-23 13:20   ` Shameerali Kolothum Thodi
  2 siblings, 0 replies; 55+ messages in thread
From: Shameerali Kolothum Thodi @ 2023-10-23 13:20 UTC (permalink / raw)
  To: ankita@nvidia.com, jgg@nvidia.com, maz@kernel.org,
	oliver.upton@linux.dev, catalin.marinas@arm.com, will@kernel.org
  Cc: aniketa@nvidia.com, cjia@nvidia.com, kwankhede@nvidia.com,
	targupta@nvidia.com, vsethi@nvidia.com, acurrid@nvidia.com,
	apopple@nvidia.com, jhubbard@nvidia.com, danw@nvidia.com,
	linux-arm-kernel@lists.infradead.org, kvmarm@lists.linux.dev,
	linux-kernel@vger.kernel.org, tiantao (H), linyufeng (A)

Hi,

> -----Original Message-----
> From: ankita@nvidia.com [mailto:ankita@nvidia.com]
> Sent: 07 September 2023 19:15
> To: ankita@nvidia.com; jgg@nvidia.com; maz@kernel.org;
> oliver.upton@linux.dev; catalin.marinas@arm.com; will@kernel.org
> Cc: aniketa@nvidia.com; cjia@nvidia.com; kwankhede@nvidia.com;
> targupta@nvidia.com; vsethi@nvidia.com; acurrid@nvidia.com;
> apopple@nvidia.com; jhubbard@nvidia.com; danw@nvidia.com;
> linux-arm-kernel@lists.infradead.org; kvmarm@lists.linux.dev;
> linux-kernel@vger.kernel.org
> Subject: [PATCH v1 1/2] KVM: arm64: determine memory type from VMA
> 
> From: Ankit Agrawal <ankita@nvidia.com>
> 
> Currently KVM determines if a VMA is pointing at IO memory by checking
> pfn_is_map_memory(). However, the MM already gives us a way to tell what
> kind of memory it is by inspecting the VMA.
> 
> Replace pfn_is_map_memory() with a check on the VMA pgprot to
> determine if
> the memory is IO and thus needs stage-2 device mapping.
> 
> The VMA's pgprot is tested to determine the memory type with the
> following mapping:
> 
>  pgprot_noncached    MT_DEVICE_nGnRnE   device
>  pgprot_writecombine MT_NORMAL_NC       device
>  pgprot_device       MT_DEVICE_nGnRE    device
>  pgprot_tagged       MT_NORMAL_TAGGED   RAM
> 
> This patch solves a problems where it is possible for the kernel to
> have VMAs pointing at cachable memory without causing
> pfn_is_map_memory() to be true, eg DAX memremap cases and CXL/pre-CXL
> devices. This memory is now properly marked as cachable in KVM.
> 
> Unfortunately when FWB is not enabled, the kernel expects to naively do
> cache management by flushing the memory using an address in the
> kernel's map. This does not work in several of the newly allowed
> cases such as dcache_clean_inval_poc(). Check whether the targeted pfn
> and its mapping KVA is valid in case the FWB is absent before continuing.
> 
> Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
> ---
>  arch/arm64/include/asm/kvm_pgtable.h |  8 ++++++
>  arch/arm64/kvm/hyp/pgtable.c         |  2 +-
>  arch/arm64/kvm/mmu.c                 | 40
> +++++++++++++++++++++++++---
>  3 files changed, 45 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/kvm_pgtable.h
> b/arch/arm64/include/asm/kvm_pgtable.h
> index d3e354bb8351..0579dbe958b9 100644
> --- a/arch/arm64/include/asm/kvm_pgtable.h
> +++ b/arch/arm64/include/asm/kvm_pgtable.h
> @@ -430,6 +430,14 @@ u64 kvm_pgtable_hyp_unmap(struct kvm_pgtable
> *pgt, u64 addr, u64 size);
>   */
>  u64 kvm_get_vtcr(u64 mmfr0, u64 mmfr1, u32 phys_shift);
> 
> +/**
> + * stage2_has_fwb() - Determine whether FWB is supported
> + * @pgt:    Page-table structure initialised by kvm_pgtable_stage2_init*()
> + *
> + * Return: True if FWB is supported.
> + */
> +bool stage2_has_fwb(struct kvm_pgtable *pgt);
> +
>  /**
>   * kvm_pgtable_stage2_pgd_size() - Helper to compute size of a stage-2
> PGD
>   * @vtcr:	Content of the VTCR register.
> diff --git a/arch/arm64/kvm/hyp/pgtable.c
> b/arch/arm64/kvm/hyp/pgtable.c
> index f155b8c9e98c..ccd291b6893d 100644
> --- a/arch/arm64/kvm/hyp/pgtable.c
> +++ b/arch/arm64/kvm/hyp/pgtable.c
> @@ -662,7 +662,7 @@ u64 kvm_get_vtcr(u64 mmfr0, u64 mmfr1, u32
> phys_shift)
>  	return vtcr;
>  }
> 
> -static bool stage2_has_fwb(struct kvm_pgtable *pgt)
> +bool stage2_has_fwb(struct kvm_pgtable *pgt)
>  {
>  	if (!cpus_have_const_cap(ARM64_HAS_STAGE2_FWB))
>  		return false;
> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index 482280fe22d7..79f1caaa08a0 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -1391,6 +1391,15 @@ static bool kvm_vma_mte_allowed(struct
> vm_area_struct *vma)
>  	return vma->vm_flags & VM_MTE_ALLOWED;
>  }
> 
> +/*
> + * Determine the memory region cacheability from VMA's pgprot. This
> + * is used to set the stage 2 PTEs.
> + */
> +static unsigned long mapping_type(pgprot_t page_prot)
> +{
> +	return FIELD_GET(PTE_ATTRINDX_MASK, pgprot_val(page_prot));
> +}
> +
>  static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>  			  struct kvm_memory_slot *memslot, unsigned long hva,
>  			  unsigned long fault_status)
> @@ -1490,6 +1499,18 @@ static int user_mem_abort(struct kvm_vcpu
> *vcpu, phys_addr_t fault_ipa,
>  	gfn = fault_ipa >> PAGE_SHIFT;
>  	mte_allowed = kvm_vma_mte_allowed(vma);
> 
> +	/*
> +	 * Figure out the memory type based on the user va mapping properties
> +	 * Only MT_DEVICE_nGnRE and MT_DEVICE_nGnRnE will be set using
> +	 * pgprot_device() and pgprot_noncached() respectively.
> +	 */
> +	if ((mapping_type(vma->vm_page_prot) == MT_DEVICE_nGnRE) ||
> +	    (mapping_type(vma->vm_page_prot) == MT_DEVICE_nGnRnE) ||
> +	    (mapping_type(vma->vm_page_prot) == MT_NORMAL_NC))
> +		prot |= KVM_PGTABLE_PROT_DEVICE;
> +	else if (cpus_have_const_cap(ARM64_HAS_CACHE_DIC))
> +		prot |= KVM_PGTABLE_PROT_X;
> +
>  	/* Don't use the VMA after the unlock -- it may have vanished */
>  	vma = NULL;
> 
> @@ -1576,10 +1597,21 @@ static int user_mem_abort(struct kvm_vcpu
> *vcpu, phys_addr_t fault_ipa,
>  	if (exec_fault)
>  		prot |= KVM_PGTABLE_PROT_X;
> 
> -	if (device)
> -		prot |= KVM_PGTABLE_PROT_DEVICE;
> -	else if (cpus_have_const_cap(ARM64_HAS_CACHE_DIC))
> -		prot |= KVM_PGTABLE_PROT_X;
> +	/*
> +	 *  When FWB is unsupported KVM needs to do cache flushes
> +	 *  (via dcache_clean_inval_poc()) of the underlying memory. This is
> +	 *  only possible if the memory is already mapped into the kernel map
> +	 *  at the usual spot.
> +	 *
> +	 *  Validate that there is a struct page for the PFN which maps
> +	 *  to the KVA that the flushing code expects.
> +	 */
> +	if (!stage2_has_fwb(pgt) &&
> +	    !(pfn_valid(pfn) &&
> +	      page_to_virt(pfn_to_page(pfn)) ==
> kvm_host_va(PFN_PHYS(pfn)))) {
> +		ret = -EINVAL;
> +		goto out_unlock;
> +	}

I don't quite follow the above check. Does no FWB matters for
Non-cacheable/device memory as well?

From a quick check, it breaks a n/w dev assignment on a platform that doesn't 
have FWB.

Qemu reports,

error: kvm run failed Invalid argument
 PC=ffff800080a95ed0 X00=ffff0000c7ff8090 X01=ffff0000c7ff80a0
X02=00000000fe180000 X03=ffff800085327000 X04=0000000000000001
X05=0000000000000040 X06=000000000000003f X07=0000000000000000
X08=ffff0000be190000 X09=0000000000000000 X10=0000000000000040

Please let me know.

Thanks,
Shameer

> 
>  	/*
>  	 * Under the premise of getting a FSC_PERM fault, we just need to relax
> --
> 2.17.1
> 


^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH v1 2/2] KVM: arm64: allow the VM to select DEVICE_* and NORMAL_NC for IO memory
  2023-09-07 18:14 [PATCH v1 0/2] KVM: arm64: support write combining and cachable IO memory in VMs ankita
  2023-09-07 18:14 ` [PATCH v1 1/2] KVM: arm64: determine memory type from VMA ankita
@ 2023-09-07 18:14 ` ankita
  2023-09-08 16:40   ` Catalin Marinas
                     ` (2 more replies)
  1 sibling, 3 replies; 55+ messages in thread
From: ankita @ 2023-09-07 18:14 UTC (permalink / raw)
  To: ankita, jgg, maz, oliver.upton, catalin.marinas, will
  Cc: aniketa, cjia, kwankhede, targupta, vsethi, acurrid, apopple,
	jhubbard, danw, linux-arm-kernel, kvmarm, linux-kernel

From: Ankit Agrawal <ankita@nvidia.com>

Linux allows device drivers to map IO memory on a per-page basis using
"write combining" or WC. This is often done using
pgprot_writecombing(). The driver knows which pages can support WC
access and the proper programming model to generate this IO. Generally
the use case is to boost performance by using write combining to
generate larger PCIe MemWr TLPs.

Allow VMs to select DEVICE_* or NORMAL_NC on a page by page basis for
all IO memory. This puts the VM in charge of the memory attributes,
and removes the KVM override to DEVICE_nGnRE.

Ultimately this makes pgprot_writecombing() work correctly in VMs and
allows drivers like mlx5 to fully operate their HW.

After some discussions with ARM and CPU architects we reached the
conclusion there was no need for KVM to prevent the VM from selecting
between DEVICE_* and NORMAL_NC for IO memory in VMs. There was a fear
that NORMAL_NC could result in uncontained failures, but upon deeper
analysis it turns out there are already possible cases for uncontained
failures with DEVICE types too. Ultimately the platform must be
implemented in a way that ensures that all DEVICE_* and NORMAL_NC
accesses have no uncontained failures.

Fortunately real platforms do tend to implement this.

This patch makes the VM's memory attributes behave as follows:

 S1           |   S2          |  Result
 NORMAL-WB    |  NORMAL-NC    |  NORMAL-NC
 NORMAL-WT    |  NORMAL-NC    |  NORMAL-NC
 NORMAL-NC    |  NORMAL-NC    |  NORMAL-NC
 DEVICE<attr> |  NORMAL-NC    |  DEVICE<attr>

See section D8.5.5 of DDI0487_I_a_a-profile_architecture_reference_manual.pdf
for details.

Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
---
 arch/arm64/include/asm/memory.h | 2 ++
 arch/arm64/kvm/hyp/pgtable.c    | 2 +-
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/include/asm/memory.h b/arch/arm64/include/asm/memory.h
index fde4186cc387..c247e5f29d5a 100644
--- a/arch/arm64/include/asm/memory.h
+++ b/arch/arm64/include/asm/memory.h
@@ -147,6 +147,7 @@
  * Memory types for Stage-2 translation
  */
 #define MT_S2_NORMAL		0xf
+#define MT_S2_NORMAL_NC	0x5
 #define MT_S2_DEVICE_nGnRE	0x1

 /*
@@ -154,6 +155,7 @@
  * Stage-2 enforces Normal-WB and Device-nGnRE
  */
 #define MT_S2_FWB_NORMAL	6
+#define MT_S2_FWB_NORMAL_NC	5
 #define MT_S2_FWB_DEVICE_nGnRE	1

 #ifdef CONFIG_ARM64_4K_PAGES
diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index ccd291b6893d..a80949002191 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -696,7 +696,7 @@ static int stage2_set_prot_attr(struct kvm_pgtable *pgt, enum kvm_pgtable_prot p
 				kvm_pte_t *ptep)
 {
 	bool device = prot & KVM_PGTABLE_PROT_DEVICE;
-	kvm_pte_t attr = device ? KVM_S2_MEMATTR(pgt, DEVICE_nGnRE) :
+	kvm_pte_t attr = device ? KVM_S2_MEMATTR(pgt, NORMAL_NC) :
 			    KVM_S2_MEMATTR(pgt, NORMAL);
 	u32 sh = KVM_PTE_LEAF_ATTR_LO_S2_SH_IS;

-- 
2.17.1

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* Re: [PATCH v1 2/2] KVM: arm64: allow the VM to select DEVICE_* and NORMAL_NC for IO memory
  2023-09-07 18:14 ` [PATCH v1 2/2] KVM: arm64: allow the VM to select DEVICE_* and NORMAL_NC for IO memory ankita
@ 2023-09-08 16:40   ` Catalin Marinas
  2023-09-11 14:57   ` Lorenzo Pieralisi
  2023-10-12 12:27   ` Will Deacon
  2 siblings, 0 replies; 55+ messages in thread
From: Catalin Marinas @ 2023-09-08 16:40 UTC (permalink / raw)
  To: ankita
  Cc: jgg, maz, oliver.upton, will, aniketa, cjia, kwankhede, targupta,
	vsethi, acurrid, apopple, jhubbard, danw, linux-arm-kernel,
	kvmarm, linux-kernel, Lorenzo Pieralisi

+ Lorenzo

On Thu, Sep 07, 2023 at 11:14:59AM -0700, ankita@nvidia.com wrote:
> From: Ankit Agrawal <ankita@nvidia.com>
> 
> Linux allows device drivers to map IO memory on a per-page basis using
> "write combining" or WC. This is often done using
> pgprot_writecombing(). The driver knows which pages can support WC
> access and the proper programming model to generate this IO. Generally
> the use case is to boost performance by using write combining to
> generate larger PCIe MemWr TLPs.
> 
> Allow VMs to select DEVICE_* or NORMAL_NC on a page by page basis for
> all IO memory. This puts the VM in charge of the memory attributes,
> and removes the KVM override to DEVICE_nGnRE.
> 
> Ultimately this makes pgprot_writecombing() work correctly in VMs and
> allows drivers like mlx5 to fully operate their HW.
> 
> After some discussions with ARM and CPU architects we reached the
> conclusion there was no need for KVM to prevent the VM from selecting
> between DEVICE_* and NORMAL_NC for IO memory in VMs. There was a fear
> that NORMAL_NC could result in uncontained failures, but upon deeper
> analysis it turns out there are already possible cases for uncontained
> failures with DEVICE types too. Ultimately the platform must be
> implemented in a way that ensures that all DEVICE_* and NORMAL_NC
> accesses have no uncontained failures.
> 
> Fortunately real platforms do tend to implement this.
> 
> This patch makes the VM's memory attributes behave as follows:
> 
>  S1           |   S2          |  Result
>  NORMAL-WB    |  NORMAL-NC    |  NORMAL-NC
>  NORMAL-WT    |  NORMAL-NC    |  NORMAL-NC
>  NORMAL-NC    |  NORMAL-NC    |  NORMAL-NC
>  DEVICE<attr> |  NORMAL-NC    |  DEVICE<attr>
> 
> See section D8.5.5 of DDI0487_I_a_a-profile_architecture_reference_manual.pdf
> for details.
> 
> Signed-off-by: Ankit Agrawal <ankita@nvidia.com>

From the discussions with the hardware people in Arm and Nvidia, we
indeed concluded that relaxing S2 to Normal-NC is not any worse than
Device (and actually Device memory is more prone to generating
uncontained errors, something to do with the Device memory ordering
semantics).

For this change:

Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>

-- 
Catalin

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v1 2/2] KVM: arm64: allow the VM to select DEVICE_* and NORMAL_NC for IO memory
  2023-09-07 18:14 ` [PATCH v1 2/2] KVM: arm64: allow the VM to select DEVICE_* and NORMAL_NC for IO memory ankita
  2023-09-08 16:40   ` Catalin Marinas
@ 2023-09-11 14:57   ` Lorenzo Pieralisi
  2023-09-11 17:20     ` Jason Gunthorpe
  2023-10-12 12:27   ` Will Deacon
  2 siblings, 1 reply; 55+ messages in thread
From: Lorenzo Pieralisi @ 2023-09-11 14:57 UTC (permalink / raw)
  To: ankita
  Cc: jgg, maz, oliver.upton, catalin.marinas, will, aniketa, cjia,
	kwankhede, targupta, vsethi, acurrid, apopple, jhubbard, danw,
	linux-arm-kernel, kvmarm, linux-kernel

On Thu, Sep 07, 2023 at 11:14:59AM -0700, ankita@nvidia.com wrote:
> From: Ankit Agrawal <ankita@nvidia.com>
> 
> Linux allows device drivers to map IO memory on a per-page basis using
> "write combining" or WC. This is often done using
> pgprot_writecombing(). The driver knows which pages can support WC

pgprot_writecombine() ?

> access and the proper programming model to generate this IO. Generally
> the use case is to boost performance by using write combining to
> generate larger PCIe MemWr TLPs.

First off, this changeset does not affect *only* Linux guests, obviously.

I understand that's the use case you are after but this change is
targeting all VMs, it must be clear.

Then WC and mapping to PCI TLPs, either you describe that in details
(NormalNC vs device-nGnRE and resulting SystemBus<->PCI transactions) or
you don't describe it at all, as it stands I don't know how to use
this information.

> Allow VMs to select DEVICE_* or NORMAL_NC on a page by page basis for
> all IO memory. This puts the VM in charge of the memory attributes,
> and removes the KVM override to DEVICE_nGnRE.
> 
> Ultimately this makes pgprot_writecombing() work correctly in VMs and

pgprot_writecombine() ?

> allows drivers like mlx5 to fully operate their HW.
> 
> After some discussions with ARM and CPU architects we reached the
> conclusion there was no need for KVM to prevent the VM from selecting
> between DEVICE_* and NORMAL_NC for IO memory in VMs. There was a fear
> that NORMAL_NC could result in uncontained failures, but upon deeper
> analysis it turns out there are already possible cases for uncontained
> failures with DEVICE types too. Ultimately the platform must be
> implemented in a way that ensures that all DEVICE_* and NORMAL_NC
> accesses have no uncontained failures.

I would reorder/rephrase this changelog as follows:

- Describe what the problem is (ie KVM default s2 mappings)
- Describe how you are solving it
- Add a link to the documentation that states why it is safe to do
  that and the resulting s1/s2 mappings combination

It must be clear why from a legacy standpoint this is a safe change
to apply.

> Fortunately real platforms do tend to implement this.

Remove this sentence, it adds no information for someone who
is chasing bugs or just wants to understand the change itself.

> This patch makes the VM's memory attributes behave as follows:

"The resulting stage 1 and stage 2 memory types and cacheability
attributes are as follows":

>  S1           |   S2          |  Result
>  NORMAL-WB    |  NORMAL-NC    |  NORMAL-NC
>  NORMAL-WT    |  NORMAL-NC    |  NORMAL-NC
>  NORMAL-NC    |  NORMAL-NC    |  NORMAL-NC
>  DEVICE<attr> |  NORMAL-NC    |  DEVICE<attr>
> 
> See section D8.5.5 of DDI0487_I_a_a-profile_architecture_reference_manual.pdf
> for details.

D8.5 I assume.

I commented on the log because it must be easy to follow, it is an
important change (and there is a lot of background information
required to understand it - please report it).

The patch itself must be reviewed by arm64/kvm maintainers; the changes
(given the background work that led to them) seem fine to me (please CC
me on next versions).

Thanks
Lorenzo

> Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
> ---
>  arch/arm64/include/asm/memory.h | 2 ++
>  arch/arm64/kvm/hyp/pgtable.c    | 2 +-
>  2 files changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/arm64/include/asm/memory.h b/arch/arm64/include/asm/memory.h
> index fde4186cc387..c247e5f29d5a 100644
> --- a/arch/arm64/include/asm/memory.h
> +++ b/arch/arm64/include/asm/memory.h
> @@ -147,6 +147,7 @@
>   * Memory types for Stage-2 translation
>   */
>  #define MT_S2_NORMAL		0xf
> +#define MT_S2_NORMAL_NC	0x5
>  #define MT_S2_DEVICE_nGnRE	0x1
>  
>  /*
> @@ -154,6 +155,7 @@
>   * Stage-2 enforces Normal-WB and Device-nGnRE
>   */
>  #define MT_S2_FWB_NORMAL	6
> +#define MT_S2_FWB_NORMAL_NC	5
>  #define MT_S2_FWB_DEVICE_nGnRE	1
>  
>  #ifdef CONFIG_ARM64_4K_PAGES
> diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> index ccd291b6893d..a80949002191 100644
> --- a/arch/arm64/kvm/hyp/pgtable.c
> +++ b/arch/arm64/kvm/hyp/pgtable.c
> @@ -696,7 +696,7 @@ static int stage2_set_prot_attr(struct kvm_pgtable *pgt, enum kvm_pgtable_prot p
>  				kvm_pte_t *ptep)
>  {
>  	bool device = prot & KVM_PGTABLE_PROT_DEVICE;
> -	kvm_pte_t attr = device ? KVM_S2_MEMATTR(pgt, DEVICE_nGnRE) :
> +	kvm_pte_t attr = device ? KVM_S2_MEMATTR(pgt, NORMAL_NC) :
>  			    KVM_S2_MEMATTR(pgt, NORMAL);
>  	u32 sh = KVM_PTE_LEAF_ATTR_LO_S2_SH_IS;
>  
> -- 
> 2.17.1
> 

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v1 2/2] KVM: arm64: allow the VM to select DEVICE_* and NORMAL_NC for IO memory
  2023-09-11 14:57   ` Lorenzo Pieralisi
@ 2023-09-11 17:20     ` Jason Gunthorpe
  2023-09-13 15:26       ` Lorenzo Pieralisi
  0 siblings, 1 reply; 55+ messages in thread
From: Jason Gunthorpe @ 2023-09-11 17:20 UTC (permalink / raw)
  To: Lorenzo Pieralisi
  Cc: ankita, maz, oliver.upton, catalin.marinas, will, aniketa, cjia,
	kwankhede, targupta, vsethi, acurrid, apopple, jhubbard, danw,
	linux-arm-kernel, kvmarm, linux-kernel

On Mon, Sep 11, 2023 at 04:57:51PM +0200, Lorenzo Pieralisi wrote:
> On Thu, Sep 07, 2023 at 11:14:59AM -0700, ankita@nvidia.com wrote:
> > From: Ankit Agrawal <ankita@nvidia.com>
> > 
> > Linux allows device drivers to map IO memory on a per-page basis using
> > "write combining" or WC. This is often done using
> > pgprot_writecombing(). The driver knows which pages can support WC
> 
> pgprot_writecombine() ?
> 
> > access and the proper programming model to generate this IO. Generally
> > the use case is to boost performance by using write combining to
> > generate larger PCIe MemWr TLPs.
> 
> First off, this changeset does not affect *only* Linux guests, obviously.

I think everyone understands that. It can be clarified.

> I understand that's the use case you are after but this change is
> targeting all VMs, it must be clear.
> 
> Then WC and mapping to PCI TLPs, either you describe that in details
> (NormalNC vs device-nGnRE and resulting SystemBus<->PCI transactions) or
> you don't describe it at all, as it stands I don't know how to use
> this information.

How about another pargraph:

 KVM prevents all VMs (including Linux) from accessing NORMAL_NC
 mappings, which is how Linux implements pgprot_writecombine(). This
 prevents using this performance optimization within VMs.

I don't think we need to go into details how it works beyond that it
requires NORMAL_NC.

> > Allow VMs to select DEVICE_* or NORMAL_NC on a page by page basis for
> > all IO memory. This puts the VM in charge of the memory attributes,
> > and removes the KVM override to DEVICE_nGnRE.
> > 
> > Ultimately this makes pgprot_writecombing() work correctly in VMs and
> 
> pgprot_writecombine() ?
> 
> > allows drivers like mlx5 to fully operate their HW.
> > 
> > After some discussions with ARM and CPU architects we reached the
> > conclusion there was no need for KVM to prevent the VM from selecting
> > between DEVICE_* and NORMAL_NC for IO memory in VMs. There was a fear
> > that NORMAL_NC could result in uncontained failures, but upon deeper
> > analysis it turns out there are already possible cases for uncontained
> > failures with DEVICE types too. Ultimately the platform must be
> > implemented in a way that ensures that all DEVICE_* and NORMAL_NC
> > accesses have no uncontained failures.
> 
> I would reorder/rephrase this changelog as follows:
> 
> - Describe what the problem is (ie KVM default s2 mappings)

The problem is that pgprot_writecombine() doesn't work in Linux
VMs. That is the first pagraph.

> - Describe how you are solving it

That is the middle paragraph "Allow VMs to select DEVICE_* or
NORMAL_NC on a page by page basis"

> - Add a link to the documentation that states why it is safe to do
>   that and the resulting s1/s2 mappings combination

AFAIK there is no documentation beyond the combining rules. Exactly
what should happen in various error conditions is implementation
defined. Catalin did you ever find anything?

> It must be clear why from a legacy standpoint this is a safe change
> to apply.

This is why:
 
> > Fortunately real platforms do tend to implement this.

It is why it is safe today, because real platforms don't throw
uncontained errors from typical PCI accesses that VFIO allows. I think
the conclusions was it turns out that is just because they don't do
errors at all, not because DEVICE_* prevents it.

> Remove this sentence, it adds no information for someone who
> is chasing bugs or just wants to understand the change itself.

So, if you hit a bug here you might evaluate if there is something
wrong with your platform, ie it is allowing uncontained errors in
unexpected places.

Jason

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v1 2/2] KVM: arm64: allow the VM to select DEVICE_* and NORMAL_NC for IO memory
  2023-09-11 17:20     ` Jason Gunthorpe
@ 2023-09-13 15:26       ` Lorenzo Pieralisi
  2023-09-13 18:54         ` Jason Gunthorpe
  0 siblings, 1 reply; 55+ messages in thread
From: Lorenzo Pieralisi @ 2023-09-13 15:26 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: ankita, maz, oliver.upton, catalin.marinas, will, aniketa, cjia,
	kwankhede, targupta, vsethi, acurrid, apopple, jhubbard, danw,
	linux-arm-kernel, kvmarm, linux-kernel

On Mon, Sep 11, 2023 at 02:20:01PM -0300, Jason Gunthorpe wrote:
> On Mon, Sep 11, 2023 at 04:57:51PM +0200, Lorenzo Pieralisi wrote:
> > On Thu, Sep 07, 2023 at 11:14:59AM -0700, ankita@nvidia.com wrote:
> > > From: Ankit Agrawal <ankita@nvidia.com>
> > > 
> > > Linux allows device drivers to map IO memory on a per-page basis using
> > > "write combining" or WC. This is often done using
> > > pgprot_writecombing(). The driver knows which pages can support WC
> > 
> > pgprot_writecombine() ?
> > 
> > > access and the proper programming model to generate this IO. Generally
> > > the use case is to boost performance by using write combining to
> > > generate larger PCIe MemWr TLPs.
> > 
> > First off, this changeset does not affect *only* Linux guests, obviously.
> 
> I think everyone understands that. It can be clarified.

OK, this is not a Linux guest fix *only*, everyone understands that
but I would rather make sure that's crystal clear.

> > I understand that's the use case you are after but this change is
> > targeting all VMs, it must be clear.
> > 
> > Then WC and mapping to PCI TLPs, either you describe that in details
> > (NormalNC vs device-nGnRE and resulting SystemBus<->PCI transactions) or
> > you don't describe it at all, as it stands I don't know how to use
> > this information.
> 
> How about another pargraph:
> 
>  KVM prevents all VMs (including Linux) from accessing NORMAL_NC
>  mappings, which is how Linux implements pgprot_writecombine(). This
>  prevents using this performance optimization within VMs.
> 
> I don't think we need to go into details how it works beyond that it
> requires NORMAL_NC.

I think it is worth summarizing the discussion that led to this change,
I can write something up.

> > > Allow VMs to select DEVICE_* or NORMAL_NC on a page by page basis for
> > > all IO memory. This puts the VM in charge of the memory attributes,
> > > and removes the KVM override to DEVICE_nGnRE.
> > > 
> > > Ultimately this makes pgprot_writecombing() work correctly in VMs and
> > 
> > pgprot_writecombine() ?
> > 
> > > allows drivers like mlx5 to fully operate their HW.
> > > 
> > > After some discussions with ARM and CPU architects we reached the
> > > conclusion there was no need for KVM to prevent the VM from selecting
> > > between DEVICE_* and NORMAL_NC for IO memory in VMs. There was a fear
> > > that NORMAL_NC could result in uncontained failures, but upon deeper
> > > analysis it turns out there are already possible cases for uncontained
> > > failures with DEVICE types too. Ultimately the platform must be
> > > implemented in a way that ensures that all DEVICE_* and NORMAL_NC
> > > accesses have no uncontained failures.
> > 
> > I would reorder/rephrase this changelog as follows:
> > 
> > - Describe what the problem is (ie KVM default s2 mappings)
> 
> The problem is that pgprot_writecombine() doesn't work in Linux
> VMs. That is the first pagraph.

s/pagraph/paragraph

Well to me that's how the problem was spotted but OK, I can rewrite
the log and post it here, NP.

> > - Describe how you are solving it
> 
> That is the middle paragraph "Allow VMs to select DEVICE_* or
> NORMAL_NC on a page by page basis"
> 
> > - Add a link to the documentation that states why it is safe to do
> >   that and the resulting s1/s2 mappings combination
> 
> AFAIK there is no documentation beyond the combining rules. Exactly
> what should happen in various error conditions is implementation
> defined. Catalin did you ever find anything?
> 
> > It must be clear why from a legacy standpoint this is a safe change
> > to apply.
> 
> This is why:
>  
> > > Fortunately real platforms do tend to implement this.

Is it possible please to describe what "this" is in details ?

What does "real platforms" mean ?

Let's describe what HW/SW should be implemented to make this
safe.

> It is why it is safe today, because real platforms don't throw

"real platforms", I have a problem with that, see above.

> uncontained errors from typical PCI accesses that VFIO allows. I think
> the conclusions was it turns out that is just because they don't do
> errors at all, not because DEVICE_* prevents it.
> 
> > Remove this sentence, it adds no information for someone who
> > is chasing bugs or just wants to understand the change itself.
> 
> So, if you hit a bug here you might evaluate if there is something
> wrong with your platform, ie it is allowing uncontained errors in
> unexpected places.

Developers looking at commit log in the future must be able to
understand why it was a sound change to make. As it stands IMO
the commit log does not explain it fully.

I can write up the commit log and post it if I manage to summarize
it any better - more important the review on the code (that was already
provided), I will try to write something up asap.

Thanks,
Lorenzo

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v1 2/2] KVM: arm64: allow the VM to select DEVICE_* and NORMAL_NC for IO memory
  2023-09-13 15:26       ` Lorenzo Pieralisi
@ 2023-09-13 18:54         ` Jason Gunthorpe
  2023-09-26  8:31           ` Lorenzo Pieralisi
  0 siblings, 1 reply; 55+ messages in thread
From: Jason Gunthorpe @ 2023-09-13 18:54 UTC (permalink / raw)
  To: Lorenzo Pieralisi
  Cc: ankita, maz, oliver.upton, catalin.marinas, will, aniketa, cjia,
	kwankhede, targupta, vsethi, acurrid, apopple, jhubbard, danw,
	linux-arm-kernel, kvmarm, linux-kernel

On Wed, Sep 13, 2023 at 05:26:01PM +0200, Lorenzo Pieralisi wrote:
> > > I understand that's the use case you are after but this change is
> > > targeting all VMs, it must be clear.
> > > 
> > > Then WC and mapping to PCI TLPs, either you describe that in details
> > > (NormalNC vs device-nGnRE and resulting SystemBus<->PCI transactions) or
> > > you don't describe it at all, as it stands I don't know how to use
> > > this information.
> > 
> > How about another pargraph:
> > 
> >  KVM prevents all VMs (including Linux) from accessing NORMAL_NC
> >  mappings, which is how Linux implements pgprot_writecombine(). This
> >  prevents using this performance optimization within VMs.
> > 
> > I don't think we need to go into details how it works beyond that it
> > requires NORMAL_NC.
> 
> I think it is worth summarizing the discussion that led to this change,
> I can write something up.

We tried here, in short the reason we all understood it was like this
today is because there was a belief that DEVICE_nGnRE allowed
fewer/none uncontained failures than NORMAL_NC.

AFIACT it turns out the specs are unclear and actual real world IP
from ARM does not behave this way. We learned that at best they are
equally unsafe, even perhaps NORMAL_NC is bit safer.

What makes it safe in real chips is not the unclear specification, or
the behavior of the ARM IP to generate uncontained failures, but
because the combination of all IP in the platform never triggers an
uncontained error anyhow.

For instance PCIe integration IP does not transform PCIe TLP failures
into uncontained errors. They generate all ones data on the bus
instead.

There is no spec that says this should happen, it is just what people
do - c&p from Intel.

On top of that we see that server focused designs built to run VMs
have put in the RAS work to ensure uncontained errors can't exist and
truely plug this hole.

Thus, at the end of the day the safety of KVM and VFIO is effectively
an unknowable platform specific property no matter what choice we make
in SW here. Thus let's make the choice that enables the broadest VM
functionality.

> > > > Fortunately real platforms do tend to implement this.
> 
> Is it possible please to describe what "this" is in details ?

"prevent uncontained errors for DEVICE_* and NORMAL_NC access"

> What does "real platforms" mean ?

Actual real world deployments of VFIO and KVM in production that
expect to be safe from uncontained errors.

> Let's describe what HW/SW should be implemented to make this
> safe.

The platform must not generate uncontained errors :)

> I can write up the commit log and post it if I manage to summarize
> it any better - more important the review on the code (that was already
> provided), I will try to write something up asap.

Thank you!

Jason

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v1 2/2] KVM: arm64: allow the VM to select DEVICE_* and NORMAL_NC for IO memory
  2023-09-13 18:54         ` Jason Gunthorpe
@ 2023-09-26  8:31           ` Lorenzo Pieralisi
  2023-09-26 12:25             ` Jason Gunthorpe
  2023-09-26 13:52             ` Catalin Marinas
  0 siblings, 2 replies; 55+ messages in thread
From: Lorenzo Pieralisi @ 2023-09-26  8:31 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: ankita, maz, oliver.upton, catalin.marinas, will, aniketa, cjia,
	kwankhede, targupta, vsethi, acurrid, apopple, jhubbard, danw,
	linux-arm-kernel, kvmarm, linux-kernel

On Wed, Sep 13, 2023 at 03:54:54PM -0300, Jason Gunthorpe wrote:
> On Wed, Sep 13, 2023 at 05:26:01PM +0200, Lorenzo Pieralisi wrote:

[...]

> > I can write up the commit log and post it if I manage to summarize
> > it any better - more important the review on the code (that was already
> > provided), I will try to write something up asap.
> 
> Thank you!
> 
> Jason

FWIW, I have come up with the commit log below - please review and
scrutinize/change it as deemed fit - it is not necessarily clearer
than this one and it definitely requires MarcZ/Catalin/Will attention
before it can be considered:

---
Currently, KVM for ARM64 maps at stage 2 memory that is
considered device (ie using pfn_is_map_memory() to discern
between device memory and memory itself) with DEVICE_nGnRE
memory attributes; this setting overrides (as per the ARM
architecture [1]) any device MMIO mapping present at stage
1, resulting in a set-up whereby a guest operating system
can't determine device MMIO mapping memory attributes on its
own but it is always overriden by the KVM stage 2 default.

This set-up does not allow guest operating systems to map
device memory on a page by page basis with combined attributes
other than DEVICE_nGnRE, which turns out to be an issue in that
guest operating systems (eg Linux) may request to map
devices MMIO regions with memory attributes that guarantee
better performance (eg gathering attribute - that for some
devices can generate larger PCIe memory writes TLPs)
and specific operations (eg unaligned transactions) such as
the NormalNC memory type.

The default device stage 2 mapping was chosen in KVM
for ARM64 since it was considered safer (ie it would
not allow guests to trigger uncontained failures
ultimately crashing the machine) but this turned out
to be imprecise.

Failures containability is a property of the platform
and is independent from the memory type used for MMIO
device memory mappings (ie DEVICE_nGnRE memory type is
even more problematic than NormalNC in terms of containability
since eg aborts triggered on loads cannot be made synchronous,
which make them harder to contain); this means that,
regardless of the combined stage1+stage2 mappings a
platform is safe if and only if device transactions cannot trigger
uncontained failures; reworded, the default KVM device
stage 2 memory attributes play no role in making device
assignment safer for a given platform and therefore can
be relaxed.

For all these reasons, relax the KVM stage 2 device
memory attributes from DEVICE_nGnRE to NormalNC.

This puts guests in control (thanks to stage1+stage2
combined memory attributes rules [1]) of device MMIO
regions memory mappings, according to the rules
described in [1] and summarized here ([(S1) = Stage1][(S2) = Stage2]):

 S1           |   S2          |  Result
 NORMAL-WB    |  NORMAL-NC    |  NORMAL-NC
 NORMAL-WT    |  NORMAL-NC    |  NORMAL-NC
 NORMAL-NC    |  NORMAL-NC    |  NORMAL-NC
 DEVICE<attr> |  NORMAL-NC    |  DEVICE<attr>

[1] section D8.5 - DDI0487_I_a_a-profile_architecture_reference_manual.pdf

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v1 2/2] KVM: arm64: allow the VM to select DEVICE_* and NORMAL_NC for IO memory
  2023-09-26  8:31           ` Lorenzo Pieralisi
@ 2023-09-26 12:25             ` Jason Gunthorpe
  2023-09-26 13:52             ` Catalin Marinas
  1 sibling, 0 replies; 55+ messages in thread
From: Jason Gunthorpe @ 2023-09-26 12:25 UTC (permalink / raw)
  To: Lorenzo Pieralisi
  Cc: ankita, maz, oliver.upton, catalin.marinas, will, aniketa, cjia,
	kwankhede, targupta, vsethi, acurrid, apopple, jhubbard, danw,
	linux-arm-kernel, kvmarm, linux-kernel

On Tue, Sep 26, 2023 at 10:31:38AM +0200, Lorenzo Pieralisi wrote:
> On Wed, Sep 13, 2023 at 03:54:54PM -0300, Jason Gunthorpe wrote:
> > On Wed, Sep 13, 2023 at 05:26:01PM +0200, Lorenzo Pieralisi wrote:
> 
> [...]
> 
> > > I can write up the commit log and post it if I manage to summarize
> > > it any better - more important the review on the code (that was already
> > > provided), I will try to write something up asap.
> > 
> > Thank you!
> > 
> > Jason
> 
> FWIW, I have come up with the commit log below - please review and
> scrutinize/change it as deemed fit - it is not necessarily clearer
> than this one and it definitely requires MarcZ/Catalin/Will attention
> before it can be considered:

I have no issue with this language.

Thanks,
Jason

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v1 2/2] KVM: arm64: allow the VM to select DEVICE_* and NORMAL_NC for IO memory
  2023-09-26  8:31           ` Lorenzo Pieralisi
  2023-09-26 12:25             ` Jason Gunthorpe
@ 2023-09-26 13:52             ` Catalin Marinas
  2023-09-26 16:12               ` Lorenzo Pieralisi
  2023-10-05  9:56               ` Lorenzo Pieralisi
  1 sibling, 2 replies; 55+ messages in thread
From: Catalin Marinas @ 2023-09-26 13:52 UTC (permalink / raw)
  To: Lorenzo Pieralisi
  Cc: Jason Gunthorpe, ankita, maz, oliver.upton, will, aniketa, cjia,
	kwankhede, targupta, vsethi, acurrid, apopple, jhubbard, danw,
	linux-arm-kernel, kvmarm, linux-kernel

On Tue, Sep 26, 2023 at 10:31:38AM +0200, Lorenzo Pieralisi wrote:
> Currently, KVM for ARM64 maps at stage 2 memory that is
> considered device (ie using pfn_is_map_memory() to discern
> between device memory and memory itself) with DEVICE_nGnRE
> memory attributes; this setting overrides (as per the ARM
> architecture [1]) any device MMIO mapping present at stage
> 1, resulting in a set-up whereby a guest operating system
> can't determine device MMIO mapping memory attributes on its
> own but it is always overriden by the KVM stage 2 default.
> 
> This set-up does not allow guest operating systems to map
> device memory on a page by page basis with combined attributes
> other than DEVICE_nGnRE,

Well, it also has the option of DEVICE_nGnRnE ;).

> which turns out to be an issue in that
> guest operating systems (eg Linux) may request to map
> devices MMIO regions with memory attributes that guarantee
> better performance (eg gathering attribute - that for some
> devices can generate larger PCIe memory writes TLPs)
> and specific operations (eg unaligned transactions) such as
> the NormalNC memory type.
> 
> The default device stage 2 mapping was chosen in KVM
> for ARM64 since it was considered safer (ie it would
> not allow guests to trigger uncontained failures
> ultimately crashing the machine) but this turned out
> to be imprecise.
> 
> Failures containability is a property of the platform
> and is independent from the memory type used for MMIO
> device memory mappings (ie DEVICE_nGnRE memory type is
> even more problematic than NormalNC in terms of containability
> since eg aborts triggered on loads cannot be made synchronous,
> which make them harder to contain); this means that,
> regardless of the combined stage1+stage2 mappings a
> platform is safe if and only if device transactions cannot trigger
> uncontained failures; reworded, the default KVM device
> stage 2 memory attributes play no role in making device
> assignment safer for a given platform and therefore can
> be relaxed.
> 
> For all these reasons, relax the KVM stage 2 device
> memory attributes from DEVICE_nGnRE to NormalNC.
> 
> This puts guests in control (thanks to stage1+stage2
> combined memory attributes rules [1]) of device MMIO
> regions memory mappings, according to the rules
> described in [1] and summarized here ([(S1) = Stage1][(S2) = Stage2]):
> 
> ï¿½S1ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ |ï¿½ï¿½ S2ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ |ï¿½ Result
> ï¿½NORMAL-WBï¿½ï¿½ï¿½ï¿½|ï¿½ NORMAL-NCï¿½ï¿½ï¿½ï¿½|ï¿½ NORMAL-NC
> ï¿½NORMAL-WTï¿½ï¿½ï¿½ï¿½|ï¿½ NORMAL-NCï¿½ï¿½ï¿½ï¿½|ï¿½ NORMAL-NC
> ï¿½NORMAL-NCï¿½ï¿½ï¿½ï¿½|ï¿½ NORMAL-NCï¿½ï¿½ï¿½ï¿½|ï¿½ NORMAL-NC
> ï¿½DEVICE<attr>ï¿½|ï¿½ NORMAL-NCï¿½ï¿½ï¿½ï¿½|ï¿½ DEVICE<attr>

Not sure what's wrong with my font setup as I can't see the above table
but I know it from the Arm ARM.

Anyway, the text looks fine to me. Thanks for putting it together
Lorenzo.

One thing not mentioned here is that pci-vfio still maps such memory as
Device-nGnRnE in user space and relaxing this potentially creates an
alias. But such alias is only relevant of both the VMM and the VM try to
access the same device which I doubt is a realistic scenario.

-- 
Catalin

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v1 2/2] KVM: arm64: allow the VM to select DEVICE_* and NORMAL_NC for IO memory
  2023-09-26 13:52             ` Catalin Marinas
@ 2023-09-26 16:12               ` Lorenzo Pieralisi
  2023-10-05  9:56               ` Lorenzo Pieralisi
  1 sibling, 0 replies; 55+ messages in thread
From: Lorenzo Pieralisi @ 2023-09-26 16:12 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Jason Gunthorpe, ankita, maz, oliver.upton, will, aniketa, cjia,
	kwankhede, targupta, vsethi, acurrid, apopple, jhubbard, danw,
	linux-arm-kernel, kvmarm, linux-kernel

On Tue, Sep 26, 2023 at 02:52:13PM +0100, Catalin Marinas wrote:
> On Tue, Sep 26, 2023 at 10:31:38AM +0200, Lorenzo Pieralisi wrote:
> > Currently, KVM for ARM64 maps at stage 2 memory that is
> > considered device (ie using pfn_is_map_memory() to discern
> > between device memory and memory itself) with DEVICE_nGnRE
> > memory attributes; this setting overrides (as per the ARM
> > architecture [1]) any device MMIO mapping present at stage
> > 1, resulting in a set-up whereby a guest operating system
> > can't determine device MMIO mapping memory attributes on its
> > own but it is always overriden by the KVM stage 2 default.
> > 
> > This set-up does not allow guest operating systems to map
> > device memory on a page by page basis with combined attributes
> > other than DEVICE_nGnRE,
> 
> Well, it also has the option of DEVICE_nGnRnE ;).

That's true - we have to fix it up.

"This set-up does not allow guest operating systems to choose
device memory attributes on a page by page basis independently
from KVM stage 2 mappings,..."

> > which turns out to be an issue in that
> > guest operating systems (eg Linux) may request to map
> > devices MMIO regions with memory attributes that guarantee
> > better performance (eg gathering attribute - that for some
> > devices can generate larger PCIe memory writes TLPs)
> > and specific operations (eg unaligned transactions) such as
> > the NormalNC memory type.
> > 
> > The default device stage 2 mapping was chosen in KVM
> > for ARM64 since it was considered safer (ie it would
> > not allow guests to trigger uncontained failures
> > ultimately crashing the machine) but this turned out
> > to be imprecise.
> > 
> > Failures containability is a property of the platform
> > and is independent from the memory type used for MMIO
> > device memory mappings (ie DEVICE_nGnRE memory type is
> > even more problematic than NormalNC in terms of containability
> > since eg aborts triggered on loads cannot be made synchronous,
> > which make them harder to contain); this means that,
> > regardless of the combined stage1+stage2 mappings a
> > platform is safe if and only if device transactions cannot trigger
> > uncontained failures; reworded, the default KVM device
> > stage 2 memory attributes play no role in making device
> > assignment safer for a given platform and therefore can
> > be relaxed.
> > 
> > For all these reasons, relax the KVM stage 2 device
> > memory attributes from DEVICE_nGnRE to NormalNC.
> > 
> > This puts guests in control (thanks to stage1+stage2
> > combined memory attributes rules [1]) of device MMIO
> > regions memory mappings, according to the rules
> > described in [1] and summarized here ([(S1) = Stage1][(S2) = Stage2]):
> > 
> > ï¿½S1ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ |ï¿½ï¿½ S2ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ |ï¿½ Result
> > ï¿½NORMAL-WBï¿½ï¿½ï¿½ï¿½|ï¿½ NORMAL-NCï¿½ï¿½ï¿½ï¿½|ï¿½ NORMAL-NC
> > ï¿½NORMAL-WTï¿½ï¿½ï¿½ï¿½|ï¿½ NORMAL-NCï¿½ï¿½ï¿½ï¿½|ï¿½ NORMAL-NC
> > ï¿½NORMAL-NCï¿½ï¿½ï¿½ï¿½|ï¿½ NORMAL-NCï¿½ï¿½ï¿½ï¿½|ï¿½ NORMAL-NC
> > ï¿½DEVICE<attr>ï¿½|ï¿½ NORMAL-NCï¿½ï¿½ï¿½ï¿½|ï¿½ DEVICE<attr>
> 
> Not sure what's wrong with my font setup as I can't see the above table
> but I know it from the Arm ARM.
> 
> Anyway, the text looks fine to me. Thanks for putting it together
> Lorenzo.
> 
> One thing not mentioned here is that pci-vfio still maps such memory as
> Device-nGnRnE in user space and relaxing this potentially creates an
> alias. But such alias is only relevant of both the VMM and the VM try to
> access the same device which I doubt is a realistic scenario.

I can update the log and repost, fixing the comment above as well (need
to check what's wrong with the table characters).

Thanks,
Lorenzo

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v1 2/2] KVM: arm64: allow the VM to select DEVICE_* and NORMAL_NC for IO memory
  2023-09-26 13:52             ` Catalin Marinas
  2023-09-26 16:12               ` Lorenzo Pieralisi
@ 2023-10-05  9:56               ` Lorenzo Pieralisi
  2023-10-05 11:56                 ` Jason Gunthorpe
  2023-10-12 12:35                 ` Will Deacon
  1 sibling, 2 replies; 55+ messages in thread
From: Lorenzo Pieralisi @ 2023-10-05  9:56 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Jason Gunthorpe, ankita, maz, oliver.upton, will, aniketa, cjia,
	kwankhede, targupta, vsethi, acurrid, apopple, jhubbard, danw,
	linux-arm-kernel, kvmarm, linux-kernel

On Tue, Sep 26, 2023 at 02:52:13PM +0100, Catalin Marinas wrote:

[...]

> Anyway, the text looks fine to me. Thanks for putting it together
> Lorenzo.

Thanks !

> One thing not mentioned here is that pci-vfio still maps such memory as
> Device-nGnRnE in user space and relaxing this potentially creates an
> alias. But such alias is only relevant of both the VMM and the VM try to
> access the same device which I doubt is a realistic scenario.

A revised log, FWIW:

---
Currently, KVM for ARM64 maps at stage 2 memory that is
considered device (ie it is not RAM) with DEVICE_nGnRE
memory attributes; this setting overrides (as per the ARM
architecture [1]) any device MMIO mapping present at stage
1, resulting in a set-up whereby a guest operating system
can't determine device MMIO mapping memory attributes on its
own but it is always overriden by the KVM stage 2 default.

This set-up does not allow guest operating systems to select
device memory attributes on a page by page basis independently
from KVM stage-2 mappings (refer to [1], "Combining stage 1 and stage
2 memory type attributes"), which turns out to be an issue in that
guest operating systems (eg Linux) may request to map
devices MMIO regions with memory attributes that guarantee
better performance (eg gathering attribute - that for some
devices can generate larger PCIe memory writes TLPs)
and specific operations (eg unaligned transactions) such as
the NormalNC memory type.

The default device stage 2 mapping was chosen in KVM
for ARM64 since it was considered safer (ie it would
not allow guests to trigger uncontained failures
ultimately crashing the machine) but this turned out
to be imprecise.

Failures containability is a property of the platform
and is independent from the memory type used for MMIO
device memory mappings (ie DEVICE_nGnRE memory type is
even more problematic than NormalNC in terms of containability
since eg aborts triggered on loads cannot be made synchronous,
which make them harder to contain); this means that,
regardless of the combined stage1+stage2 mappings a
platform is safe if and only if device transactions cannot trigger
uncontained failures; reworded, the default KVM device
stage 2 memory attributes play no role in making device
assignment safer for a given platform and therefore can
be relaxed.

For all these reasons, relax the KVM stage 2 device
memory attributes from DEVICE_nGnRE to NormalNC.

This puts guests in control (thanks to stage1+stage2
combined memory attributes rules [1]) of device MMIO
regions memory mappings, according to the rules
described in [1] and summarized here ([(S1) - stage1],
[(S2) - stage 2]):

S1	     |  S2	     | Result
NORMAL-WB    |  NORMAL-NC    | NORMAL-NC
NORMAL-WT    |  NORMAL-NC    | NORMAL-NC
NORMAL-NC    |  NORMAL-NC    | NORMAL-NC
DEVICE<attr> |  NORMAL-NC    | DEVICE<attr>

It is worth noting that currently, to map devices MMIO space to user
space in a device pass-through use case the VFIO framework applies memory
attributes derived from pgprot_noncached() settings applied to VMAs, which
result in device-nGnRnE memory attributes for the stage-1 VMM mappings.

This means that a userspace mapping for device MMIO space carried
out with the current VFIO framework and a guest OS mapping for the same
MMIO space may result in a mismatched alias as described in [2].

Defaulting KVM device stage-2 mappings to Normal-NC attributes does not change
anything in this respect, in that the mismatched aliases would only affect
(refer to [2] for a detailed explanation) ordering between the userspace and
GuestOS mappings resulting stream of transactions (ie it does not cause loss of
property for either stream of transactions on its own), which is harmless
given that the userspace and GuestOS access to the device is carried
out through independent transactions streams.

[1] section D8.5 - DDI0487_I_a_a-profile_architecture_reference_manual.pdf
[2] section B2.8 - DDI0487_I_a_a-profile_architecture_reference_manual.pdf

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v1 2/2] KVM: arm64: allow the VM to select DEVICE_* and NORMAL_NC for IO memory
  2023-10-05  9:56               ` Lorenzo Pieralisi
@ 2023-10-05 11:56                 ` Jason Gunthorpe
  2023-10-05 14:08                   ` Lorenzo Pieralisi
  2023-10-12 12:35                 ` Will Deacon
  1 sibling, 1 reply; 55+ messages in thread
From: Jason Gunthorpe @ 2023-10-05 11:56 UTC (permalink / raw)
  To: Lorenzo Pieralisi
  Cc: Catalin Marinas, ankita, maz, oliver.upton, will, aniketa, cjia,
	kwankhede, targupta, vsethi, acurrid, apopple, jhubbard, danw,
	linux-arm-kernel, kvmarm, linux-kernel

On Thu, Oct 05, 2023 at 11:56:55AM +0200, Lorenzo Pieralisi wrote:
> On Tue, Sep 26, 2023 at 02:52:13PM +0100, Catalin Marinas wrote:
> 
> [...]
> 
> > Anyway, the text looks fine to me. Thanks for putting it together
> > Lorenzo.
> 
> Thanks !
> 
> > One thing not mentioned here is that pci-vfio still maps such memory as
> > Device-nGnRnE in user space and relaxing this potentially creates an
> > alias. But such alias is only relevant of both the VMM and the VM try to
> > access the same device which I doubt is a realistic scenario.
> 
> A revised log, FWIW:

What is the plan here, do you want Ankit to resend the series with
this text?

There were no comments on patch 1/2?

Thanks,
Jason

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v1 2/2] KVM: arm64: allow the VM to select DEVICE_* and NORMAL_NC for IO memory
  2023-10-05 11:56                 ` Jason Gunthorpe
@ 2023-10-05 14:08                   ` Lorenzo Pieralisi
  0 siblings, 0 replies; 55+ messages in thread
From: Lorenzo Pieralisi @ 2023-10-05 14:08 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Catalin Marinas, ankita, maz, oliver.upton, will, aniketa, cjia,
	kwankhede, targupta, vsethi, acurrid, apopple, jhubbard, danw,
	linux-arm-kernel, kvmarm, linux-kernel

On Thu, Oct 05, 2023 at 08:56:30AM -0300, Jason Gunthorpe wrote:
> On Thu, Oct 05, 2023 at 11:56:55AM +0200, Lorenzo Pieralisi wrote:
> > On Tue, Sep 26, 2023 at 02:52:13PM +0100, Catalin Marinas wrote:
> > 
> > [...]
> > 
> > > Anyway, the text looks fine to me. Thanks for putting it together
> > > Lorenzo.
> > 
> > Thanks !
> > 
> > > One thing not mentioned here is that pci-vfio still maps such memory as
> > > Device-nGnRnE in user space and relaxing this potentially creates an
> > > alias. But such alias is only relevant of both the VMM and the VM try to
> > > access the same device which I doubt is a realistic scenario.
> > 
> > A revised log, FWIW:
> 
> What is the plan here, do you want Ankit to resend the series with
> this text?

I have put together the text to summarize the discussions that led to
this patch; if that helps somehow, yes please, make it the commit log.

> There were no comments on patch 1/2?

I don't have enough background knowledge to make a statement on this
series' code, I just wanted to make sure the architectural details are
in the commit log so that it can help the reviewers and, if it makes
it to the kernel, future developers.

Thanks,
Lorenzo

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v1 2/2] KVM: arm64: allow the VM to select DEVICE_* and NORMAL_NC for IO memory
  2023-10-05  9:56               ` Lorenzo Pieralisi
  2023-10-05 11:56                 ` Jason Gunthorpe
@ 2023-10-12 12:35                 ` Will Deacon
  2023-10-12 13:20                   ` Jason Gunthorpe
  2023-10-12 13:53                   ` Catalin Marinas
  1 sibling, 2 replies; 55+ messages in thread
From: Will Deacon @ 2023-10-12 12:35 UTC (permalink / raw)
  To: Lorenzo Pieralisi
  Cc: Catalin Marinas, Jason Gunthorpe, ankita, maz, oliver.upton,
	aniketa, cjia, kwankhede, targupta, vsethi, acurrid, apopple,
	jhubbard, danw, linux-arm-kernel, kvmarm, linux-kernel

On Thu, Oct 05, 2023 at 11:56:55AM +0200, Lorenzo Pieralisi wrote:
> On Tue, Sep 26, 2023 at 02:52:13PM +0100, Catalin Marinas wrote:
> 
> [...]
> 
> > Anyway, the text looks fine to me. Thanks for putting it together
> > Lorenzo.
> 
> Thanks !
> 
> > One thing not mentioned here is that pci-vfio still maps such memory as
> > Device-nGnRnE in user space and relaxing this potentially creates an
> > alias. But such alias is only relevant of both the VMM and the VM try to
> > access the same device which I doubt is a realistic scenario.
> 
> A revised log, FWIW:

Thanks for putting this together, Lorenzo. Just one thing below:

> ---
> Currently, KVM for ARM64 maps at stage 2 memory that is
> considered device (ie it is not RAM) with DEVICE_nGnRE
> memory attributes; this setting overrides (as per the ARM
> architecture [1]) any device MMIO mapping present at stage
> 1, resulting in a set-up whereby a guest operating system
> can't determine device MMIO mapping memory attributes on its
> own but it is always overriden by the KVM stage 2 default.
> 
> This set-up does not allow guest operating systems to select
> device memory attributes on a page by page basis independently
> from KVM stage-2 mappings (refer to [1], "Combining stage 1 and stage
> 2 memory type attributes"), which turns out to be an issue in that
> guest operating systems (eg Linux) may request to map
> devices MMIO regions with memory attributes that guarantee
> better performance (eg gathering attribute - that for some
> devices can generate larger PCIe memory writes TLPs)
> and specific operations (eg unaligned transactions) such as
> the NormalNC memory type.
> 
> The default device stage 2 mapping was chosen in KVM
> for ARM64 since it was considered safer (ie it would
> not allow guests to trigger uncontained failures
> ultimately crashing the machine) but this turned out
> to be imprecise.
> 
> Failures containability is a property of the platform
> and is independent from the memory type used for MMIO
> device memory mappings (ie DEVICE_nGnRE memory type is
> even more problematic than NormalNC in terms of containability
> since eg aborts triggered on loads cannot be made synchronous,
> which make them harder to contain); this means that,
> regardless of the combined stage1+stage2 mappings a
> platform is safe if and only if device transactions cannot trigger
> uncontained failures; reworded, the default KVM device
> stage 2 memory attributes play no role in making device
> assignment safer for a given platform and therefore can
> be relaxed.
> 
> For all these reasons, relax the KVM stage 2 device
> memory attributes from DEVICE_nGnRE to NormalNC.

The reasoning above suggests to me that this should probably just be
Normal cacheable, as that is what actually allows the guest to control
the attributes. So what is the rationale behind stopping at Normal-NC?

Will

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v1 2/2] KVM: arm64: allow the VM to select DEVICE_* and NORMAL_NC for IO memory
  2023-10-12 12:35                 ` Will Deacon
@ 2023-10-12 13:20                   ` Jason Gunthorpe
  2023-10-12 14:29                     ` Lorenzo Pieralisi
  2023-10-12 13:53                   ` Catalin Marinas
  1 sibling, 1 reply; 55+ messages in thread
From: Jason Gunthorpe @ 2023-10-12 13:20 UTC (permalink / raw)
  To: Will Deacon
  Cc: Lorenzo Pieralisi, Catalin Marinas, ankita, maz, oliver.upton,
	aniketa, cjia, kwankhede, targupta, vsethi, acurrid, apopple,
	jhubbard, danw, linux-arm-kernel, kvmarm, linux-kernel

On Thu, Oct 12, 2023 at 01:35:41PM +0100, Will Deacon wrote:

> > Failures containability is a property of the platform
> > and is independent from the memory type used for MMIO
> > device memory mappings (ie DEVICE_nGnRE memory type is
> > even more problematic than NormalNC in terms of containability
> > since eg aborts triggered on loads cannot be made synchronous,
> > which make them harder to contain); this means that,
> > regardless of the combined stage1+stage2 mappings a
> > platform is safe if and only if device transactions cannot trigger
> > uncontained failures; reworded, the default KVM device
> > stage 2 memory attributes play no role in making device
> > assignment safer for a given platform and therefore can
> > be relaxed.
> > 
> > For all these reasons, relax the KVM stage 2 device
> > memory attributes from DEVICE_nGnRE to NormalNC.
> 
> The reasoning above suggests to me that this should probably just be
> Normal cacheable, as that is what actually allows the guest to control
> the attributes. So what is the rationale behind stopping at Normal-NC?

I agree it would be very nice if the all the memory in the guest could
just be cachable and the guest could select everything.

However, I think Lorenzo over stated the argument. The off-list
discussion was focused on NormalNC for MMIO only. Nobody raised the
idea that cachable was safe from uncontained errors for MMIO.

I'm looking through the conversations and I wouldn't jump to
concluding that "cachable MMIO" is safe from uncontained failures.

Catalin has already raised a number of conerns in the other patch
about making actual "designed to be cachable memory" into KVM
cachable.

Regards,
Jason

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v1 2/2] KVM: arm64: allow the VM to select DEVICE_* and NORMAL_NC for IO memory
  2023-10-12 13:20                   ` Jason Gunthorpe
@ 2023-10-12 14:29                     ` Lorenzo Pieralisi
  0 siblings, 0 replies; 55+ messages in thread
From: Lorenzo Pieralisi @ 2023-10-12 14:29 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Will Deacon, Catalin Marinas, ankita, maz, oliver.upton, aniketa,
	cjia, kwankhede, targupta, vsethi, acurrid, apopple, jhubbard,
	danw, linux-arm-kernel, kvmarm, linux-kernel

On Thu, Oct 12, 2023 at 10:20:45AM -0300, Jason Gunthorpe wrote:
> On Thu, Oct 12, 2023 at 01:35:41PM +0100, Will Deacon wrote:
> 
> > > Failures containability is a property of the platform
> > > and is independent from the memory type used for MMIO
> > > device memory mappings (ie DEVICE_nGnRE memory type is
> > > even more problematic than NormalNC in terms of containability
> > > since eg aborts triggered on loads cannot be made synchronous,
> > > which make them harder to contain); this means that,
> > > regardless of the combined stage1+stage2 mappings a
> > > platform is safe if and only if device transactions cannot trigger
> > > uncontained failures; reworded, the default KVM device
> > > stage 2 memory attributes play no role in making device
> > > assignment safer for a given platform and therefore can
> > > be relaxed.
> > > 
> > > For all these reasons, relax the KVM stage 2 device
> > > memory attributes from DEVICE_nGnRE to NormalNC.
> > 
> > The reasoning above suggests to me that this should probably just be
> > Normal cacheable, as that is what actually allows the guest to control
> > the attributes. So what is the rationale behind stopping at Normal-NC?
> 
> I agree it would be very nice if the all the memory in the guest could
> just be cachable and the guest could select everything.
> 
> However, I think Lorenzo over stated the argument.
>
> The off-list discussion was focused on NormalNC for MMIO only. Nobody
> raised the idea that cachable was safe from uncontained errors for
> MMIO.

True, it should be made clearer ie "independent from the
non-cacheable/uncacheable memory type...", please update the
log accordingly, forgive me the confusion I raised.

Lorenzo
 
> I'm looking through the conversations and I wouldn't jump to
> concluding that "cachable MMIO" is safe from uncontained failures.
> 
> Catalin has already raised a number of conerns in the other patch
> about making actual "designed to be cachable memory" into KVM
> cachable.
> 
> Regards,
> Jason

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v1 2/2] KVM: arm64: allow the VM to select DEVICE_* and NORMAL_NC for IO memory
  2023-10-12 12:35                 ` Will Deacon
  2023-10-12 13:20                   ` Jason Gunthorpe
@ 2023-10-12 13:53                   ` Catalin Marinas
  2023-10-12 14:48                     ` Will Deacon
  1 sibling, 1 reply; 55+ messages in thread
From: Catalin Marinas @ 2023-10-12 13:53 UTC (permalink / raw)
  To: Will Deacon
  Cc: Lorenzo Pieralisi, Jason Gunthorpe, ankita, maz, oliver.upton,
	aniketa, cjia, kwankhede, targupta, vsethi, acurrid, apopple,
	jhubbard, danw, linux-arm-kernel, kvmarm, linux-kernel

On Thu, Oct 12, 2023 at 01:35:41PM +0100, Will Deacon wrote:
> On Thu, Oct 05, 2023 at 11:56:55AM +0200, Lorenzo Pieralisi wrote:
> > For all these reasons, relax the KVM stage 2 device
> > memory attributes from DEVICE_nGnRE to NormalNC.
> 
> The reasoning above suggests to me that this should probably just be
> Normal cacheable, as that is what actually allows the guest to control
> the attributes. So what is the rationale behind stopping at Normal-NC?

It's more like we don't have any clue on what may happen. MTE is
obviously a case where it can go wrong (we can blame the architecture
design here) but I recall years ago where a malicious guest could bring
the platform down by mapping the GIC CPU interface as cacheable.

Not sure how error containment works with cacheable memory. A cacheable
access to a device may stay in the cache a lot longer after the guest
has been scheduled out, only evicted at some random time. We may no
longer be able to associate it with the guest, especially if the guest
exited. Also not sure about claiming back the device after killing the
guest, do we need cache maintenance?

So, for now I'd only relax this if we know there's RAM(-like) on the
other side and won't trigger some potentially uncontainable errors as a
result.

-- 
Catalin

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v1 2/2] KVM: arm64: allow the VM to select DEVICE_* and NORMAL_NC for IO memory
  2023-10-12 13:53                   ` Catalin Marinas
@ 2023-10-12 14:48                     ` Will Deacon
  2023-10-12 15:44                       ` Jason Gunthorpe
  2023-10-12 17:26                       ` Catalin Marinas
  0 siblings, 2 replies; 55+ messages in thread
From: Will Deacon @ 2023-10-12 14:48 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Lorenzo Pieralisi, Jason Gunthorpe, ankita, maz, oliver.upton,
	aniketa, cjia, kwankhede, targupta, vsethi, acurrid, apopple,
	jhubbard, danw, linux-arm-kernel, kvmarm, linux-kernel

On Thu, Oct 12, 2023 at 02:53:21PM +0100, Catalin Marinas wrote:
> On Thu, Oct 12, 2023 at 01:35:41PM +0100, Will Deacon wrote:
> > On Thu, Oct 05, 2023 at 11:56:55AM +0200, Lorenzo Pieralisi wrote:
> > > For all these reasons, relax the KVM stage 2 device
> > > memory attributes from DEVICE_nGnRE to NormalNC.
> > 
> > The reasoning above suggests to me that this should probably just be
> > Normal cacheable, as that is what actually allows the guest to control
> > the attributes. So what is the rationale behind stopping at Normal-NC?
> 
> It's more like we don't have any clue on what may happen. MTE is
> obviously a case where it can go wrong (we can blame the architecture
> design here) but I recall years ago where a malicious guest could bring
> the platform down by mapping the GIC CPU interface as cacheable.

... and do we know that isn't the case for non-cacheable? If not, why not?

Also, are you saying we used to map the GIC CPU interface as cacheable
at stage-2? I remember exclusives causing a problem, but I don't remember
the guest having a cacheable mapping.

> Not sure how error containment works with cacheable memory. A cacheable
> access to a device may stay in the cache a lot longer after the guest
> has been scheduled out, only evicted at some random time.

But similarly, non-cacheable stores can be buffered. Why isn't that a
problem?

> We may no longer be able to associate it with the guest, especially if the
> guest exited. Also not sure about claiming back the device after killing
> the guest, do we need cache maintenance?

Claiming back the device also seems strange if the guest has been using
non-cacheable accesses since I think you could get write merging and
reordering with subsequent device accesses trying to reset the device.

> So, for now I'd only relax this if we know there's RAM(-like) on the
> other side and won't trigger some potentially uncontainable errors as a
> result.

I guess my wider point is that I'm not convinced that non-cacheable is
actually much better and I think we're going way off the deep end looking
at what particular implementations do and trying to justify to ourselves
that non-cacheable is safe, even though it's still a normal memory type
at the end of the day.

Obviously, it's up to Marc and Oliver if they want to do this, but I'm
wary without an official statement from Arm to say that Normal-NC is
correct. There's mention of such a statement in the cover letter:

  > We hope ARM will publish information helping platform designers
  > follow these guidelines.

but imo we shouldn't merge this without either:

  (a) _Architectural_ guidance (as opposed to some random whitepaper or
      half-baked certification scheme).

- or -

  (b) A concrete justification based on the current architecture as to
      why Normal-NC is the right thing to do for KVM.

The current wording talks about use-cases (I get this) and error containment
(it's a property of the system) but doesn't talk at all about why Normal-NC
is the right result.

Will

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v1 2/2] KVM: arm64: allow the VM to select DEVICE_* and NORMAL_NC for IO memory
  2023-10-12 14:48                     ` Will Deacon
@ 2023-10-12 15:44                       ` Jason Gunthorpe
  2023-10-12 16:39                         ` Will Deacon
  2023-10-12 17:26                       ` Catalin Marinas
  1 sibling, 1 reply; 55+ messages in thread
From: Jason Gunthorpe @ 2023-10-12 15:44 UTC (permalink / raw)
  To: Will Deacon
  Cc: Catalin Marinas, Lorenzo Pieralisi, ankita, maz, oliver.upton,
	aniketa, cjia, kwankhede, targupta, vsethi, acurrid, apopple,
	jhubbard, danw, linux-arm-kernel, kvmarm, linux-kernel

On Thu, Oct 12, 2023 at 03:48:08PM +0100, Will Deacon wrote:

> I guess my wider point is that I'm not convinced that non-cacheable is
> actually much better and I think we're going way off the deep end looking
> at what particular implementations do and trying to justify to ourselves
> that non-cacheable is safe, even though it's still a normal memory type
> at the end of the day.

When we went over this with ARM it became fairly clear there wasn't an
official statement that Device-* is safe from uncontained
failures. For instance, looking at the actual IP, our architects
pointed out that ARM IP already provides ways for Device-* to trigger
uncontained failures today.

We then mutually concluded that KVM safe implementations must already
be preventing uncontained failures for Device-* at the system level
and that same prevention will carry over to NormalNC as well.

IMHO, this seems to be a gap where ARM has not fully defined when
uncontained failures are allowed and left that as an implementation
choice.

In other words, KVM safety around uncontained failure is not a
property that can be reasoned about from the ARM architecture alone.

> The current wording talks about use-cases (I get this) and error containment
> (it's a property of the system) but doesn't talk at all about why Normal-NC
> is the right result.

Given that Device-* and NormalNC are equally implementation defined
with regards to uncontained failures, NormalNC allows more VM
functionality.

Further, we have a broad agreement that this use case is important,
and that NormalNC is the correct way to adress it.

I think you are right to ask for more formality from ARM team but also
we shouldn't hold up fixing real functional bugs in real shipping
server ARM products.

Jason

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v1 2/2] KVM: arm64: allow the VM to select DEVICE_* and NORMAL_NC for IO memory
  2023-10-12 15:44                       ` Jason Gunthorpe
@ 2023-10-12 16:39                         ` Will Deacon
  2023-10-12 18:36                           ` Jason Gunthorpe
  0 siblings, 1 reply; 55+ messages in thread
From: Will Deacon @ 2023-10-12 16:39 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Catalin Marinas, Lorenzo Pieralisi, ankita, maz, oliver.upton,
	aniketa, cjia, kwankhede, targupta, vsethi, acurrid, apopple,
	jhubbard, danw, linux-arm-kernel, kvmarm, linux-kernel

On Thu, Oct 12, 2023 at 12:44:39PM -0300, Jason Gunthorpe wrote:
> On Thu, Oct 12, 2023 at 03:48:08PM +0100, Will Deacon wrote:
> 
> > I guess my wider point is that I'm not convinced that non-cacheable is
> > actually much better and I think we're going way off the deep end looking
> > at what particular implementations do and trying to justify to ourselves
> > that non-cacheable is safe, even though it's still a normal memory type
> > at the end of the day.
> 
> When we went over this with ARM it became fairly clear there wasn't an
> official statement that Device-* is safe from uncontained
> failures. For instance, looking at the actual IP, our architects
> pointed out that ARM IP already provides ways for Device-* to trigger
> uncontained failures today.
> 
> We then mutually concluded that KVM safe implementations must already
> be preventing uncontained failures for Device-* at the system level
> and that same prevention will carry over to NormalNC as well.
> 
> IMHO, this seems to be a gap where ARM has not fully defined when
> uncontained failures are allowed and left that as an implementation
> choice.
> 
> In other words, KVM safety around uncontained failure is not a
> property that can be reasoned about from the ARM architecture alone.
> 
> > The current wording talks about use-cases (I get this) and error containment
> > (it's a property of the system) but doesn't talk at all about why Normal-NC
> > is the right result.
> 
> Given that Device-* and NormalNC are equally implementation defined
> with regards to uncontained failures, NormalNC allows more VM
> functionality.
> 
> Further, we have a broad agreement that this use case is important,
> and that NormalNC is the correct way to adress it.
> 
> I think you are right to ask for more formality from ARM team but also
> we shouldn't hold up fixing real functional bugs in real shipping
> server ARM products.

All I'm asking for is justification as to why Normal-NC is the right
memory type rather than any other normal memory type. If it's not possible
to explain that architecturally, then I'm not sure this change belongs in
architecture code.

Ultimately, we need to be able to maintain this stuff, so we can't just
blindly implement changes based on a combination of off-list discussions
and individual product needs. For example, if somebody else rocks up
tomorrow and asks for this to be Normal-writethrough, what grounds do
we have to say no if we've taken this change already?

So please let's get to a point where we can actually reason about this.

Will

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v1 2/2] KVM: arm64: allow the VM to select DEVICE_* and NORMAL_NC for IO memory
  2023-10-12 16:39                         ` Will Deacon
@ 2023-10-12 18:36                           ` Jason Gunthorpe
  2023-10-13  9:29                             ` Will Deacon
  0 siblings, 1 reply; 55+ messages in thread
From: Jason Gunthorpe @ 2023-10-12 18:36 UTC (permalink / raw)
  To: Will Deacon
  Cc: Catalin Marinas, Lorenzo Pieralisi, ankita, maz, oliver.upton,
	aniketa, cjia, kwankhede, targupta, vsethi, acurrid, apopple,
	jhubbard, danw, linux-arm-kernel, kvmarm, linux-kernel

On Thu, Oct 12, 2023 at 05:39:31PM +0100, Will Deacon wrote:

> All I'm asking for is justification as to why Normal-NC is the right
> memory type rather than any other normal memory type. If it's not possible
> to explain that architecturally, then I'm not sure this change belongs in
> architecture code.

Well, I think Catalin summarized it nicely, I second his ask at the end:

We are basically at your scenario below - can you justify why
DEVICE_GRE is correct today, architecturally? We could not. Earlier
someone said uncontained failure prevention, but a deep dive on that
found it is not so.

> Ultimately, we need to be able to maintain this stuff, so we can't just
> blindly implement changes based on a combination of off-list discussions
> and individual product needs. For example, if somebody else rocks up
> tomorrow and asks for this to be Normal-writethrough, what grounds do
> we have to say no if we've taken this change already?

Hmm. Something got lost here.

This patch is talking about the S2 MemAttr[3:0]. There are only 4
relevant values (when FEAT_S2FWB) (see D5.5.5 in ARM DDI 0487F.c)

 0b001 - Today: force VM to be Device nGnRE

 0b101 - Proposed: prevent the VM from selecting cachable, allow it to
         choose Device-* or NormalNC

 0b110 - Force write back. Would totally break MMIO, ruled out

 0b111 - Allow the VM to select anything, including cachable.
         This is nice, but summarizing Catalin's remarks:
           a) The kernel can't do cache maintenance (defeats FWB)
           b) Catalin's concerns about MTE and Atomics triggering
	      uncontained failures
           c) It is unclear about uncontained failures for cachable
              in general

So the only argument is 001 v 110 v 111

Catalin has explained why 111 is not good as a default. Most likely
with some future ACPI/BSA/etc work, and some cache syncing in the
kernel, someone could define a way to allow 111 as a choice. So, I
think we can rule out 111 as being the default choice without more the
kernel getting more detailed system level knowledge.

Further, patch 1 is about making 110 a driver-opt-in choice for VFIO
memory which reduces the need for 111.

For 001 v 110: 001 is less functional in the VM. 001 offers no
advantage.

!FEAT_S2FWB has similar choices and similar argument.

So, IMHO, if someone comes to ask for something it would be to ask for
111 and we do have a set of pretty clear reasons why it should not be
111. (indeed we wanted to ask for that instead of patch 1 but there
are too many things required to get there),

Jason

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v1 2/2] KVM: arm64: allow the VM to select DEVICE_* and NORMAL_NC for IO memory
  2023-10-12 18:36                           ` Jason Gunthorpe
@ 2023-10-13  9:29                             ` Will Deacon
  0 siblings, 0 replies; 55+ messages in thread
From: Will Deacon @ 2023-10-13  9:29 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Catalin Marinas, Lorenzo Pieralisi, ankita, maz, oliver.upton,
	aniketa, cjia, kwankhede, targupta, vsethi, acurrid, apopple,
	jhubbard, danw, linux-arm-kernel, kvmarm, linux-kernel

On Thu, Oct 12, 2023 at 03:36:24PM -0300, Jason Gunthorpe wrote:
> On Thu, Oct 12, 2023 at 05:39:31PM +0100, Will Deacon wrote:
> 
> > All I'm asking for is justification as to why Normal-NC is the right
> > memory type rather than any other normal memory type. If it's not possible
> > to explain that architecturally, then I'm not sure this change belongs in
> > architecture code.
> 
> Well, I think Catalin summarized it nicely, I second his ask at the end:
> 
> We are basically at your scenario below - can you justify why
> DEVICE_GRE is correct today, architecturally? We could not. Earlier
> someone said uncontained failure prevention, but a deep dive on that
> found it is not so.

This logic can be used to justify the use of any other memory type. I'm
asking specifically why Normal-NC is correct.

> > Ultimately, we need to be able to maintain this stuff, so we can't just
> > blindly implement changes based on a combination of off-list discussions
> > and individual product needs. For example, if somebody else rocks up
> > tomorrow and asks for this to be Normal-writethrough, what grounds do
> > we have to say no if we've taken this change already?
> 
> Hmm. Something got lost here.

Well, I didn't assume FWB since these patches change the behaviour
regardless...

> This patch is talking about the S2 MemAttr[3:0]. There are only 4
> relevant values (when FEAT_S2FWB) (see D5.5.5 in ARM DDI 0487F.c)
> 
>  0b001 - Today: force VM to be Device nGnRE
> 
>  0b101 - Proposed: prevent the VM from selecting cachable, allow it to
>          choose Device-* or NormalNC
> 
>  0b110 - Force write back. Would totally break MMIO, ruled out
> 
>  0b111 - Allow the VM to select anything, including cachable.
>          This is nice, but summarizing Catalin's remarks:
>            a) The kernel can't do cache maintenance (defeats FWB)
>            b) Catalin's concerns about MTE and Atomics triggering
> 	      uncontained failures
>            c) It is unclear about uncontained failures for cachable
>               in general
> 
> So the only argument is 001 v 110 v 111

Ok, so I can see why you end up with Normal-NC on a system that implements
FWB. Systems without FWB have other options, but you can reasonably argue
that having consistent behaviour between the two configurations is
desirable. Please can you add that to the commit message?

I still don't understand how to reclaim an MMIO region that the guest has
mapped Normal-NC (see the thread with Catalin).

Will

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v1 2/2] KVM: arm64: allow the VM to select DEVICE_* and NORMAL_NC for IO memory
  2023-10-12 14:48                     ` Will Deacon
  2023-10-12 15:44                       ` Jason Gunthorpe
@ 2023-10-12 17:26                       ` Catalin Marinas
  2023-10-13  9:29                         ` Will Deacon
  1 sibling, 1 reply; 55+ messages in thread
From: Catalin Marinas @ 2023-10-12 17:26 UTC (permalink / raw)
  To: Will Deacon
  Cc: Lorenzo Pieralisi, Jason Gunthorpe, ankita, maz, oliver.upton,
	aniketa, cjia, kwankhede, targupta, vsethi, acurrid, apopple,
	jhubbard, danw, linux-arm-kernel, kvmarm, linux-kernel

On Thu, Oct 12, 2023 at 03:48:08PM +0100, Will Deacon wrote:
> On Thu, Oct 12, 2023 at 02:53:21PM +0100, Catalin Marinas wrote:
> > On Thu, Oct 12, 2023 at 01:35:41PM +0100, Will Deacon wrote:
> > > On Thu, Oct 05, 2023 at 11:56:55AM +0200, Lorenzo Pieralisi wrote:
> > > > For all these reasons, relax the KVM stage 2 device
> > > > memory attributes from DEVICE_nGnRE to NormalNC.
> > > 
> > > The reasoning above suggests to me that this should probably just be
> > > Normal cacheable, as that is what actually allows the guest to control
> > > the attributes. So what is the rationale behind stopping at Normal-NC?
> > 
> > It's more like we don't have any clue on what may happen. MTE is
> > obviously a case where it can go wrong (we can blame the architecture
> > design here) but I recall years ago where a malicious guest could bring
> > the platform down by mapping the GIC CPU interface as cacheable.
> 
> ... and do we know that isn't the case for non-cacheable? If not, why not?

Trying to get this information from the hw folk and architects is really
hard. So we only relax it one step at a time ;). But given the MTE
problems, I'd not go for cacheable Stage 2 unless we have FEAT_MTE_PERM
implemented (both hw and sw). S2 cacheable allows the guest to map it as
Normal Tagged.

> Also, are you saying we used to map the GIC CPU interface as cacheable
> at stage-2? I remember exclusives causing a problem, but I don't remember
> the guest having a cacheable mapping.

The guest never had a cacheable mapping, IIRC it was more of a
theoretical problem, plugging a hole. Now, maybe I misremember, it's
pretty hard to search the git logs given how the code was moved around
(but I do remember the building we were in when discussing this, it was
on the ground floor ;)).

> > Not sure how error containment works with cacheable memory. A cacheable
> > access to a device may stay in the cache a lot longer after the guest
> > has been scheduled out, only evicted at some random time.
> 
> But similarly, non-cacheable stores can be buffered. Why isn't that a
> problem?

RAS might track this for cacheable mappings as well, I just haven't
figured out the details.

> > We may no longer be able to associate it with the guest, especially if the
> > guest exited. Also not sure about claiming back the device after killing
> > the guest, do we need cache maintenance?
> 
> Claiming back the device also seems strange if the guest has been using
> non-cacheable accesses since I think you could get write merging and
> reordering with subsequent device accesses trying to reset the device.

True. Not sure we have a good story here (maybe reinvent the DWB barrier ;)).

> > So, for now I'd only relax this if we know there's RAM(-like) on the
> > other side and won't trigger some potentially uncontainable errors as a
> > result.
> 
> I guess my wider point is that I'm not convinced that non-cacheable is
> actually much better and I think we're going way off the deep end looking
> at what particular implementations do and trying to justify to ourselves
> that non-cacheable is safe, even though it's still a normal memory type
> at the end of the day.

Is this about Device vs NC or Device/NC vs Normal Cacheable? The
justification for the former has been summarised in Lorenzo's write-up.
How the hardware behaves, it depends a lot on the RAS implementation.
The BSA has some statements but not sure it covers everything.

Things can go wrong but that's not because Device does anything better.
Given the RAS implementation, external aborts caused on Device memory
(e.g. wrong size access) is uncontainable. For Normal NC it can be
contained (I can dig out the reasoning behind this if you want, IIUC
something to do with not being able to cancel an already issued Device
access since such accesses don't allow speculation due to side-effects;
for Normal NC, it's just about the software not getting the data).

> Obviously, it's up to Marc and Oliver if they want to do this, but I'm
> wary without an official statement from Arm to say that Normal-NC is
> correct. There's mention of such a statement in the cover letter:
> 
>   > We hope ARM will publish information helping platform designers
>   > follow these guidelines.
> 
> but imo we shouldn't merge this without either:
> 
>   (a) _Architectural_ guidance (as opposed to some random whitepaper or
>       half-baked certification scheme).

Well, you know the story, the architects will probably make it a SoC or
integration issue, PCIe etc., not something that can live in the Arm
ARM. The best we could get is more recommendations in the RAS spec
around containment but not for things that might happen outside the CPU,
e.g. PCIe root complex.

> - or -
> 
>   (b) A concrete justification based on the current architecture as to
>       why Normal-NC is the right thing to do for KVM.

To put it differently, we don't have any strong arguments why Device is
the right thing to do. We chose Device based on some understanding
software people had about how the hardware behaves, which apparently
wasn't entirely correct (and summarised by Lorenzo).

-- 
Catalin

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v1 2/2] KVM: arm64: allow the VM to select DEVICE_* and NORMAL_NC for IO memory
  2023-10-12 17:26                       ` Catalin Marinas
@ 2023-10-13  9:29                         ` Will Deacon
  2023-10-13 13:08                           ` Catalin Marinas
  0 siblings, 1 reply; 55+ messages in thread
From: Will Deacon @ 2023-10-13  9:29 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Lorenzo Pieralisi, Jason Gunthorpe, ankita, maz, oliver.upton,
	aniketa, cjia, kwankhede, targupta, vsethi, acurrid, apopple,
	jhubbard, danw, linux-arm-kernel, kvmarm, linux-kernel

On Thu, Oct 12, 2023 at 06:26:01PM +0100, Catalin Marinas wrote:
> On Thu, Oct 12, 2023 at 03:48:08PM +0100, Will Deacon wrote:
> > Claiming back the device also seems strange if the guest has been using
> > non-cacheable accesses since I think you could get write merging and
> > reordering with subsequent device accesses trying to reset the device.
> 
> True. Not sure we have a good story here (maybe reinvent the DWB barrier ;)).

We do have a good story for this part: use Device-nGnRE!

> > > So, for now I'd only relax this if we know there's RAM(-like) on the
> > > other side and won't trigger some potentially uncontainable errors as a
> > > result.
> > 
> > I guess my wider point is that I'm not convinced that non-cacheable is
> > actually much better and I think we're going way off the deep end looking
> > at what particular implementations do and trying to justify to ourselves
> > that non-cacheable is safe, even though it's still a normal memory type
> > at the end of the day.
> 
> Is this about Device vs NC or Device/NC vs Normal Cacheable? The
> justification for the former has been summarised in Lorenzo's write-up.
> How the hardware behaves, it depends a lot on the RAS implementation.
> The BSA has some statements but not sure it covers everything.

I don't know how to be more clear, but I'm asking why Normal-NC is the right
memory type to use. Jason's explanation on the other thread about how it's
basically the only option with FWB (with some hand-waving about why
Normal-cacheable isn't safe) will have to do, but it needs to be in the
commit message.

The host maps MMIO with Device-nGnRE. Allowing a guest to map it as Normal
surely means the host is going to need additional logic to deal with that?
We briefly discussed claiming back a device above and I'm not convinced
that the code we have for doing that today will work correctly if the
guest has issued a bunch of Normal-NC stores prior to the device being
unmapped.

Could we change these patches so that the memory type of the stage-1 VMA
in the VMM is reflected in the stage-2? In other words, continue to use
Device mappings at stage-2 for I/O but relax to Normal-NC if that's
how the VMM has it mapped?

> Things can go wrong but that's not because Device does anything better.
> Given the RAS implementation, external aborts caused on Device memory
> (e.g. wrong size access) is uncontainable. For Normal NC it can be
> contained (I can dig out the reasoning behind this if you want, IIUC
> something to do with not being able to cancel an already issued Device
> access since such accesses don't allow speculation due to side-effects;
> for Normal NC, it's just about the software not getting the data).

I really think these details belong in the commit message.

> > Obviously, it's up to Marc and Oliver if they want to do this, but I'm
> > wary without an official statement from Arm to say that Normal-NC is
> > correct. There's mention of such a statement in the cover letter:
> > 
> >   > We hope ARM will publish information helping platform designers
> >   > follow these guidelines.
> > 
> > but imo we shouldn't merge this without either:
> > 
> >   (a) _Architectural_ guidance (as opposed to some random whitepaper or
> >       half-baked certification scheme).
> 
> Well, you know the story, the architects will probably make it a SoC or
> integration issue, PCIe etc., not something that can live in the Arm
> ARM. The best we could get is more recommendations in the RAS spec
> around containment but not for things that might happen outside the CPU,
> e.g. PCIe root complex.

The Arm ARM _does_ mention PCI config space when talking about early write
acknowledgement, so there's some precedence for providing guidance around
which memory types to use.

> > - or -
> > 
> >   (b) A concrete justification based on the current architecture as to
> >       why Normal-NC is the right thing to do for KVM.
> 
> To put it differently, we don't have any strong arguments why Device is
> the right thing to do. We chose Device based on some understanding
> software people had about how the hardware behaves, which apparently
> wasn't entirely correct (and summarised by Lorenzo).

I think we use Device because that's what the host uses in its stage-1
and mismatched aliases are bad.

Will

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v1 2/2] KVM: arm64: allow the VM to select DEVICE_* and NORMAL_NC for IO memory
  2023-10-13  9:29                         ` Will Deacon
@ 2023-10-13 13:08                           ` Catalin Marinas
  2023-10-13 13:45                             ` Jason Gunthorpe
                                               ` (2 more replies)
  0 siblings, 3 replies; 55+ messages in thread
From: Catalin Marinas @ 2023-10-13 13:08 UTC (permalink / raw)
  To: Will Deacon
  Cc: Lorenzo Pieralisi, Jason Gunthorpe, ankita, maz, oliver.upton,
	aniketa, cjia, kwankhede, targupta, vsethi, acurrid, apopple,
	jhubbard, danw, linux-arm-kernel, kvmarm, linux-kernel

On Fri, Oct 13, 2023 at 10:29:35AM +0100, Will Deacon wrote:
> On Thu, Oct 12, 2023 at 06:26:01PM +0100, Catalin Marinas wrote:
> > On Thu, Oct 12, 2023 at 03:48:08PM +0100, Will Deacon wrote:
> > > Claiming back the device also seems strange if the guest has been using
> > > non-cacheable accesses since I think you could get write merging and
> > > reordering with subsequent device accesses trying to reset the device.
> > 
> > True. Not sure we have a good story here (maybe reinvent the DWB barrier ;)).
> 
> We do have a good story for this part: use Device-nGnRE!

Don't we actually need Device-nGnRnE for this, coupled with a DSB for
endpoint completion?

Device-nGnRE may be sufficient as a read from that device would ensure
that the previous write is observable (potentially with a DMB if
accessing separate device regions) but I don't think we do this now
either. Even this, isn't it device-specific? I don't know enough about
PCIe, posted writes, reordering, maybe others can shed some light.

For Normal NC, if the access doesn't have side-effects (or rather the
endpoint is memory-like), I think we are fine. The Stage 2 unmapping +
TLBI + DSB (DVM + DVMSync) should ensure that a pending write by the CPU
was pushed sufficiently far as not to affect subsequent writes by other
CPUs.

For I/O accesses that change some state of the device, I'm not sure the
TLBI+DSB is sufficient. But I don't think Device nGnRE is either, only
nE + DSB as long as the PCIe device plays along nicely.

> Could we change these patches so that the memory type of the stage-1 VMA
> in the VMM is reflected in the stage-2? In other words, continue to use
> Device mappings at stage-2 for I/O but relax to Normal-NC if that's
> how the VMM has it mapped?

We've been through this and it's not feasible. The VMM does not have
detailed knowledge of the BARs of the PCIe device it is mapping (and the
prefetchable BAR attribute is useless). It may end up with a Normal
mapping of a BAR with read side-effects. It's only the guest driver that
knows all the details. The safest is for the VMM to keep it as Device (I
think vfio-pci goes for the strongest nGnRnE).

Yes, we end up with mismatched aliases but they only matter if the VMM
also accesses the I/O range via its own mapping. So far I haven't seen
case that suggests this.

> > Things can go wrong but that's not because Device does anything better.
> > Given the RAS implementation, external aborts caused on Device memory
> > (e.g. wrong size access) is uncontainable. For Normal NC it can be
> > contained (I can dig out the reasoning behind this if you want, IIUC
> > something to do with not being able to cancel an already issued Device
> > access since such accesses don't allow speculation due to side-effects;
> > for Normal NC, it's just about the software not getting the data).
> 
> I really think these details belong in the commit message.

I guess another task for Lorenzo ;).

> > > Obviously, it's up to Marc and Oliver if they want to do this, but I'm
> > > wary without an official statement from Arm to say that Normal-NC is
> > > correct. There's mention of such a statement in the cover letter:
> > > 
> > >   > We hope ARM will publish information helping platform designers
> > >   > follow these guidelines.
> > > 
> > > but imo we shouldn't merge this without either:
> > > 
> > >   (a) _Architectural_ guidance (as opposed to some random whitepaper or
> > >       half-baked certification scheme).
> > 
> > Well, you know the story, the architects will probably make it a SoC or
> > integration issue, PCIe etc., not something that can live in the Arm
> > ARM. The best we could get is more recommendations in the RAS spec
> > around containment but not for things that might happen outside the CPU,
> > e.g. PCIe root complex.
> 
> The Arm ARM _does_ mention PCI config space when talking about early write
> acknowledgement, so there's some precedence for providing guidance around
> which memory types to use.

Ah, yes, it looks like it does, though mostly around the config space.
We could ask them to add some notes but I don't think we have the
problem well defined yet.

Trying to restate what we aim: the guest driver knows what attributes it
needs and would set the appropriate attributes: Device or Normal. KVM's
role is not to fix bugs in the guest driver by constraining the
attributes but rather to avoid potential security issues with malicious
(or buggy) guests:

1) triggering uncontained errors

2) accessing memory that it shouldn't (like the MTE tag access)

3) causing delayed side-effects after the host reclaims the device

... anything else?

For (1), Normal NC vs. Device doesn't make any difference, slightly
better for the former. (2) so far is solved by not allowing Cacheable
(or disabling MTE, enabling FEAT_MTE_PERM in the future). I'm now trying
to understand (3), I think it needs more digging.

> > >   (b) A concrete justification based on the current architecture as to
> > >       why Normal-NC is the right thing to do for KVM.
> > 
> > To put it differently, we don't have any strong arguments why Device is
> > the right thing to do. We chose Device based on some understanding
> > software people had about how the hardware behaves, which apparently
> > wasn't entirely correct (and summarised by Lorenzo).
> 
> I think we use Device because that's what the host uses in its stage-1
> and mismatched aliases are bad.

They are "constrained" bad ;).

-- 
Catalin

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v1 2/2] KVM: arm64: allow the VM to select DEVICE_* and NORMAL_NC for IO memory
  2023-10-13 13:08                           ` Catalin Marinas
@ 2023-10-13 13:45                             ` Jason Gunthorpe
  2023-10-19 11:07                               ` Catalin Marinas
  2023-10-13 15:28                             ` Lorenzo Pieralisi
  2023-11-09 15:34                             ` Lorenzo Pieralisi
  2 siblings, 1 reply; 55+ messages in thread
From: Jason Gunthorpe @ 2023-10-13 13:45 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Will Deacon, Lorenzo Pieralisi, ankita, maz, oliver.upton,
	aniketa, cjia, kwankhede, targupta, vsethi, acurrid, apopple,
	jhubbard, danw, linux-arm-kernel, kvmarm, linux-kernel

On Fri, Oct 13, 2023 at 02:08:10PM +0100, Catalin Marinas wrote:
> On Fri, Oct 13, 2023 at 10:29:35AM +0100, Will Deacon wrote:
> > On Thu, Oct 12, 2023 at 06:26:01PM +0100, Catalin Marinas wrote:
> > > On Thu, Oct 12, 2023 at 03:48:08PM +0100, Will Deacon wrote:
> > > > Claiming back the device also seems strange if the guest has been using
> > > > non-cacheable accesses since I think you could get write merging and
> > > > reordering with subsequent device accesses trying to reset the device.
> > > 
> > > True. Not sure we have a good story here (maybe reinvent the DWB barrier ;)).
> > 
> > We do have a good story for this part: use Device-nGnRE!
> 
> Don't we actually need Device-nGnRnE for this, coupled with a DSB for
> endpoint completion?
> 
> Device-nGnRE may be sufficient as a read from that device would ensure
> that the previous write is observable (potentially with a DMB if
> accessing separate device regions) but I don't think we do this now
> either. Even this, isn't it device-specific? I don't know enough about
> PCIe, posted writes, reordering, maybe others can shed some light.
> 
> For Normal NC, if the access doesn't have side-effects (or rather the
> endpoint is memory-like), I think we are fine. The Stage 2 unmapping +
> TLBI + DSB (DVM + DVMSync) should ensure that a pending write by the CPU
> was pushed sufficiently far as not to affect subsequent writes by other
> CPUs.
> 
> For I/O accesses that change some state of the device, I'm not sure the
> TLBI+DSB is sufficient. But I don't think Device nGnRE is either, only
> nE + DSB as long as the PCIe device plays along nicely.

Can someone explain this concern a little more simply please?

Let's try something simpler. I have no KVM. My kernel driver 
creates a VMA with pgprot_writecombine (NormalNC). Userpsace does a
write to the NormalNC and immediately unmaps the VMA

What is the issue?

And then how does making KVM the thing that creates the NormalNC
change this?

Not knowing the whole details, here is my story about how it should work:

Unmapping the VMA's must already have some NormalNC friendly ordering
barrier across all CPUs or we have a bigger problem. This barrier
definately must close write combining.

VFIO issues a config space write to reset the PCI function.  Config
space writes MUST NOT write combine with anything. This is already
impossible for PCIe since they are different TLP types at the PCIe
level.

By the PCIe rules, config space write must order strictly after all
other CPU's accesses. Once the reset non-posted write returns back to
VFIO we know that:

 1) There is no reference in any CPU page table to the MMIO PFN
 2) No CPU has pending data in any write buffer
 3) The interconnect and PCIe fabric have no inflight operations
 4) The device is in a clean post-reset state

?

> knows all the details. The safest is for the VMM to keep it as Device (I
> think vfio-pci goes for the strongest nGnRnE).

We are probably going to allow VFIO to let userspace pick if it should
be pgprot_device or pgprot_writecombine.

The alias issue could be resolved by teaching KVM how to insert a
physical PFN based on some VFIO FD/dmabuf rather than a VMA so that
the PFNs are never mapped in the hypervisor side.

Jsaon

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v1 2/2] KVM: arm64: allow the VM to select DEVICE_* and NORMAL_NC for IO memory
  2023-10-13 13:45                             ` Jason Gunthorpe
@ 2023-10-19 11:07                               ` Catalin Marinas
  2023-10-19 11:51                                 ` Jason Gunthorpe
  2023-10-19 13:35                                 ` Lorenzo Pieralisi
  0 siblings, 2 replies; 55+ messages in thread
From: Catalin Marinas @ 2023-10-19 11:07 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Will Deacon, Lorenzo Pieralisi, ankita, maz, oliver.upton,
	aniketa, cjia, kwankhede, targupta, vsethi, acurrid, apopple,
	jhubbard, danw, linux-arm-kernel, kvmarm, linux-kernel

On Fri, Oct 13, 2023 at 10:45:41AM -0300, Jason Gunthorpe wrote:
> On Fri, Oct 13, 2023 at 02:08:10PM +0100, Catalin Marinas wrote:
> > On Fri, Oct 13, 2023 at 10:29:35AM +0100, Will Deacon wrote:
> > > On Thu, Oct 12, 2023 at 06:26:01PM +0100, Catalin Marinas wrote:
> > > > On Thu, Oct 12, 2023 at 03:48:08PM +0100, Will Deacon wrote:
> > > > > Claiming back the device also seems strange if the guest has been using
> > > > > non-cacheable accesses since I think you could get write merging and
> > > > > reordering with subsequent device accesses trying to reset the device.
> > > > 
> > > > True. Not sure we have a good story here (maybe reinvent the DWB barrier ;)).
> > > 
> > > We do have a good story for this part: use Device-nGnRE!
> > 
> > Don't we actually need Device-nGnRnE for this, coupled with a DSB for
> > endpoint completion?
> > 
> > Device-nGnRE may be sufficient as a read from that device would ensure
> > that the previous write is observable (potentially with a DMB if
> > accessing separate device regions) but I don't think we do this now
> > either. Even this, isn't it device-specific? I don't know enough about
> > PCIe, posted writes, reordering, maybe others can shed some light.
> > 
> > For Normal NC, if the access doesn't have side-effects (or rather the
> > endpoint is memory-like), I think we are fine. The Stage 2 unmapping +
> > TLBI + DSB (DVM + DVMSync) should ensure that a pending write by the CPU
> > was pushed sufficiently far as not to affect subsequent writes by other
> > CPUs.
> > 
> > For I/O accesses that change some state of the device, I'm not sure the
> > TLBI+DSB is sufficient. But I don't think Device nGnRE is either, only
> > nE + DSB as long as the PCIe device plays along nicely.
> 
> Can someone explain this concern a little more simply please?
> 
> Let's try something simpler. I have no KVM. My kernel driver 
> creates a VMA with pgprot_writecombine (NormalNC). Userpsace does a
> write to the NormalNC and immediately unmaps the VMA
> 
> What is the issue?

The issue is when the device is reclaimed to be given to another
user-space process, do we know the previous transaction reached the
device? With the TLBI + DSB in unmap, we can only tell that a subsequent
map + write (in a new process) is ordered after the write in the old
process. Maybe that's sufficient in most cases.

> And then how does making KVM the thing that creates the NormalNC
> change this?

They are similar.

> Not knowing the whole details, here is my story about how it should work:
> 
> Unmapping the VMA's must already have some NormalNC friendly ordering
> barrier across all CPUs or we have a bigger problem. This barrier
> definately must close write combining.

I think what we have is TLBI + DSB generating the DVM/DVMSync messages
should ensure that the Normal NC writes on other CPUs reach some
serialisation point (not necessarily the device, AFAICT we can't
guarantee end-point completion here).

> VFIO issues a config space write to reset the PCI function.  Config
> space writes MUST NOT write combine with anything. This is already
> impossible for PCIe since they are different TLP types at the PCIe
> level.

Yes, config space writes are fine, vfio-pci even maps them as
Device_nGnRnE. But AFAIK a guest won't have direct access to the config
space.

> By the PCIe rules, config space write must order strictly after all
> other CPU's accesses. Once the reset non-posted write returns back to
> VFIO we know that:
> 
>  1) There is no reference in any CPU page table to the MMIO PFN
>  2) No CPU has pending data in any write buffer
>  3) The interconnect and PCIe fabric have no inflight operations
>  4) The device is in a clean post-reset state

I think from the CPU perspective, we can guarantee that a Normal_NC
write on CPU0 for example reaches a serialisation point before a config
space (Device_nGnRnE) write on CPU1 by the host as long as CPU1 issued a
TLBI+DSB. Now, what I'm not sure is what this serialisation point is. If
it is the PCIe root complex, we are probably fine, we hope it deals with
any ordering between the Normal_NC write and the config space one.

Talking to Will earlier, I think we can deem the PCIe scenario
(somewhat) safe but not as a generic mechanism for other non-PCIe
devices (e.g. platform). With this concern, can we make this Stage 2
relaxation in KVM only for vfio-pci mappings? I don't have an example of
non-PCIe device assignment to figure out how this should work though.

> > knows all the details. The safest is for the VMM to keep it as Device (I
> > think vfio-pci goes for the strongest nGnRnE).
> 
> We are probably going to allow VFIO to let userspace pick if it should
> be pgprot_device or pgprot_writecombine.

I guess that's for the direct use by an application rather than VMM+VM.
IIUC people work around this currently by mapping PCIe BARs as
pgprot_writecombine() via sysfs. Getting vfio-pci to allow different
mappings is probably a good idea, though it doesn't currently help with
the KVM case as we can't force the VMM to know the specifics of the
device it is giving to a guest.

> The alias issue could be resolved by teaching KVM how to insert a
> physical PFN based on some VFIO FD/dmabuf rather than a VMA so that
> the PFNs are never mapped in the hypervisor side.

This should work as well and solves the aliasing problem, though it
requires changes to the VMM as well, not just KVM, which currently
relies on vfio-pci mmap().

-- 
Catalin

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v1 2/2] KVM: arm64: allow the VM to select DEVICE_* and NORMAL_NC for IO memory
  2023-10-19 11:07                               ` Catalin Marinas
@ 2023-10-19 11:51                                 ` Jason Gunthorpe
  2023-10-20 11:21                                   ` Catalin Marinas
  2023-10-19 13:35                                 ` Lorenzo Pieralisi
  1 sibling, 1 reply; 55+ messages in thread
From: Jason Gunthorpe @ 2023-10-19 11:51 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Will Deacon, Lorenzo Pieralisi, ankita, maz, oliver.upton,
	aniketa, cjia, kwankhede, targupta, vsethi, acurrid, apopple,
	jhubbard, danw, linux-arm-kernel, kvmarm, linux-kernel

On Thu, Oct 19, 2023 at 12:07:42PM +0100, Catalin Marinas wrote:

> The issue is when the device is reclaimed to be given to another
> user-space process, do we know the previous transaction reached the
> device? With the TLBI + DSB in unmap, we can only tell that a subsequent
> map + write (in a new process) is ordered after the write in the old
> process. Maybe that's sufficient in most cases.

I can't think of a case where ordering is insufficient. As long as the
device perceives a strict seperation in order of writes preceeding the
'reset' writel and writes following the reset writel() the absolute
time of completion can not matter.

At least for VFIO PCI the reset sequences uses non-posted config TLPs
followed by config read TLPs.

For something like mlx5 when it reclaims a WC VMA it does a RPC which
is dma_wmb(), a writel(), a device DMA read, a device DMA write and a
CPU read fo DMA'd data.

Both of these are pretty "firm" ordering barriers.

> > Unmapping the VMA's must already have some NormalNC friendly ordering
> > barrier across all CPUs or we have a bigger problem. This barrier
> > definately must close write combining.
> 
> I think what we have is TLBI + DSB generating the DVM/DVMSync messages
> should ensure that the Normal NC writes on other CPUs reach some
> serialisation point (not necessarily the device, AFAICT we can't
> guarantee end-point completion here).

This is what our internal architects thought as well.

> Talking to Will earlier, I think we can deem the PCIe scenario
> (somewhat) safe but not as a generic mechanism for other non-PCIe
> devices (e.g. platform). With this concern, can we make this Stage 2
> relaxation in KVM only for vfio-pci mappings? I don't have an example of
> non-PCIe device assignment to figure out how this should work though.

It is not a KVM problem. As I implied above it is VFIO's
responsibility to reliably reset the device, not KVMs. If for some
reason VFIO cannot do that on some SOC then VFIO devices should not
exist.

It is not KVM's job to double guess VFIO's own security properties.

Specifically about platform the generic VFIO platform driver is the
ACPI based one. If the FW provides an ACPI method for device reset
that is not properly serializing, that is a FW bug. We can quirk it in
VFIO and block using those devices if they actually exist.

I expect the non-generic VFIO platform drivers to take care of this
issue internally with, barriers, read from devices, whatver is
required to make their SOCs order properly. Just as I would expect a
normal Linux platform driver to directly manage whatever
implementation specific ordering quirks the SOC may have.

Generic PCI drivers are the ones that rely on the implicit ordering at
the PCIe TLP level from TLBI + DSB. I would say this is part of the
undocumented arch API around pgprot_wc

> > > knows all the details. The safest is for the VMM to keep it as Device (I
> > > think vfio-pci goes for the strongest nGnRnE).
> > 
> > We are probably going to allow VFIO to let userspace pick if it should
> > be pgprot_device or pgprot_writecombine.
> 
> I guess that's for the direct use by an application rather than
> VMM+VM.

Yes, there have been requests for this.

> IIUC people work around this currently by mapping PCIe BARs as
> pgprot_writecombine() via sysfs. Getting vfio-pci to allow different
> mappings is probably a good idea, though it doesn't currently help with
> the KVM case as we can't force the VMM to know the specifics of the
> device it is giving to a guest.

Yes to all points

Thanks,
Jason

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v1 2/2] KVM: arm64: allow the VM to select DEVICE_* and NORMAL_NC for IO memory
  2023-10-19 11:51                                 ` Jason Gunthorpe
@ 2023-10-20 11:21                                   ` Catalin Marinas
  2023-10-20 11:47                                     ` Jason Gunthorpe
  0 siblings, 1 reply; 55+ messages in thread
From: Catalin Marinas @ 2023-10-20 11:21 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Will Deacon, Lorenzo Pieralisi, ankita, maz, oliver.upton,
	aniketa, cjia, kwankhede, targupta, vsethi, acurrid, apopple,
	jhubbard, danw, linux-arm-kernel, kvmarm, linux-kernel

On Thu, Oct 19, 2023 at 08:51:42AM -0300, Jason Gunthorpe wrote:
> On Thu, Oct 19, 2023 at 12:07:42PM +0100, Catalin Marinas wrote:
> > Talking to Will earlier, I think we can deem the PCIe scenario
> > (somewhat) safe but not as a generic mechanism for other non-PCIe
> > devices (e.g. platform). With this concern, can we make this Stage 2
> > relaxation in KVM only for vfio-pci mappings? I don't have an example of
> > non-PCIe device assignment to figure out how this should work though.
> 
> It is not a KVM problem. As I implied above it is VFIO's
> responsibility to reliably reset the device, not KVMs. If for some
> reason VFIO cannot do that on some SOC then VFIO devices should not
> exist.
> 
> It is not KVM's job to double guess VFIO's own security properties.

I'd argue that since KVM is the one relaxing the memory attributes
beyond what the VFIO driver allows the VMM to use, it is KVM's job to
consider the security implications. This is fine for vfio-pci and
Normal_NC but I'm not sure we can generalise.

> Specifically about platform the generic VFIO platform driver is the
> ACPI based one. If the FW provides an ACPI method for device reset
> that is not properly serializing, that is a FW bug. We can quirk it in
> VFIO and block using those devices if they actually exist.
> 
> I expect the non-generic VFIO platform drivers to take care of this
> issue internally with, barriers, read from devices, whatver is
> required to make their SOCs order properly. Just as I would expect a
> normal Linux platform driver to directly manage whatever
> implementation specific ordering quirks the SOC may have.

This would be a new requirement if an existing VFIO platform driver
relied on all mappings being Device. But maybe that's just theoretical
at the moment, are there any concrete examples outside vfio-pci? If not,
we can document it as per Lorenzo's suggestion to summarise this
discussion under Documentation/.

-- 
Catalin

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v1 2/2] KVM: arm64: allow the VM to select DEVICE_* and NORMAL_NC for IO memory
  2023-10-20 11:21                                   ` Catalin Marinas
@ 2023-10-20 11:47                                     ` Jason Gunthorpe
  2023-10-20 14:03                                       ` Lorenzo Pieralisi
  0 siblings, 1 reply; 55+ messages in thread
From: Jason Gunthorpe @ 2023-10-20 11:47 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Will Deacon, Lorenzo Pieralisi, ankita, maz, oliver.upton,
	aniketa, cjia, kwankhede, targupta, vsethi, acurrid, apopple,
	jhubbard, danw, linux-arm-kernel, kvmarm, linux-kernel

On Fri, Oct 20, 2023 at 12:21:50PM +0100, Catalin Marinas wrote:
> On Thu, Oct 19, 2023 at 08:51:42AM -0300, Jason Gunthorpe wrote:
> > On Thu, Oct 19, 2023 at 12:07:42PM +0100, Catalin Marinas wrote:
> > > Talking to Will earlier, I think we can deem the PCIe scenario
> > > (somewhat) safe but not as a generic mechanism for other non-PCIe
> > > devices (e.g. platform). With this concern, can we make this Stage 2
> > > relaxation in KVM only for vfio-pci mappings? I don't have an example of
> > > non-PCIe device assignment to figure out how this should work though.
> > 
> > It is not a KVM problem. As I implied above it is VFIO's
> > responsibility to reliably reset the device, not KVMs. If for some
> > reason VFIO cannot do that on some SOC then VFIO devices should not
> > exist.
> > 
> > It is not KVM's job to double guess VFIO's own security properties.
> 
> I'd argue that since KVM is the one relaxing the memory attributes
> beyond what the VFIO driver allows the VMM to use, it is KVM's job to
> consider the security implications. This is fine for vfio-pci and
> Normal_NC but I'm not sure we can generalise.

I can see that, but I belive we should take this responsibility into
VFIO as a requirement. As I said in the other email we do want to
extend VFIO to support NormalNC VMAs for DPDK, so we need to take this
anyhow.

> > Specifically about platform the generic VFIO platform driver is the
> > ACPI based one. If the FW provides an ACPI method for device reset
> > that is not properly serializing, that is a FW bug. We can quirk it in
> > VFIO and block using those devices if they actually exist.
> > 
> > I expect the non-generic VFIO platform drivers to take care of this
> > issue internally with, barriers, read from devices, whatver is
> > required to make their SOCs order properly. Just as I would expect a
> > normal Linux platform driver to directly manage whatever
> > implementation specific ordering quirks the SOC may have.
> 
> This would be a new requirement if an existing VFIO platform driver
> relied on all mappings being Device. But maybe that's just theoretical
> at the moment, are there any concrete examples outside vfio-pci? If not,
> we can document it as per Lorenzo's suggestion to summarise this
> discussion under Documentation/.

My point is if this becomes a real world concern we have a solid
answer on how to resolve it - fix the VFIO driver to have a stronger
barrier before reset.

I'm confident it is not a problem for PCI and IIRC the remaining ARM
platform drivers were made primarily for DPDK, not KVM.

So I agree with documenting and perhaps a comment someplace in VFIO is
also warranted.

Jason

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v1 2/2] KVM: arm64: allow the VM to select DEVICE_* and NORMAL_NC for IO memory
  2023-10-20 11:47                                     ` Jason Gunthorpe
@ 2023-10-20 14:03                                       ` Lorenzo Pieralisi
  2023-10-20 14:28                                         ` Jason Gunthorpe
  0 siblings, 1 reply; 55+ messages in thread
From: Lorenzo Pieralisi @ 2023-10-20 14:03 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Catalin Marinas, Will Deacon, ankita, maz, oliver.upton, aniketa,
	cjia, kwankhede, targupta, vsethi, acurrid, apopple, jhubbard,
	danw, linux-arm-kernel, kvmarm, linux-kernel

On Fri, Oct 20, 2023 at 08:47:19AM -0300, Jason Gunthorpe wrote:
> On Fri, Oct 20, 2023 at 12:21:50PM +0100, Catalin Marinas wrote:
> > On Thu, Oct 19, 2023 at 08:51:42AM -0300, Jason Gunthorpe wrote:
> > > On Thu, Oct 19, 2023 at 12:07:42PM +0100, Catalin Marinas wrote:
> > > > Talking to Will earlier, I think we can deem the PCIe scenario
> > > > (somewhat) safe but not as a generic mechanism for other non-PCIe
> > > > devices (e.g. platform). With this concern, can we make this Stage 2
> > > > relaxation in KVM only for vfio-pci mappings? I don't have an example of
> > > > non-PCIe device assignment to figure out how this should work though.
> > > 
> > > It is not a KVM problem. As I implied above it is VFIO's
> > > responsibility to reliably reset the device, not KVMs. If for some
> > > reason VFIO cannot do that on some SOC then VFIO devices should not
> > > exist.
> > > 
> > > It is not KVM's job to double guess VFIO's own security properties.
> > 
> > I'd argue that since KVM is the one relaxing the memory attributes
> > beyond what the VFIO driver allows the VMM to use, it is KVM's job to
> > consider the security implications. This is fine for vfio-pci and
> > Normal_NC but I'm not sure we can generalise.
> 
> I can see that, but I belive we should take this responsibility into
> VFIO as a requirement. As I said in the other email we do want to
> extend VFIO to support NormalNC VMAs for DPDK, so we need to take this
> anyhow.
> 
> > > Specifically about platform the generic VFIO platform driver is the
> > > ACPI based one. If the FW provides an ACPI method for device reset
> > > that is not properly serializing, that is a FW bug. We can quirk it in
> > > VFIO and block using those devices if they actually exist.
> > > 
> > > I expect the non-generic VFIO platform drivers to take care of this
> > > issue internally with, barriers, read from devices, whatver is
> > > required to make their SOCs order properly. Just as I would expect a
> > > normal Linux platform driver to directly manage whatever
> > > implementation specific ordering quirks the SOC may have.
> > 
> > This would be a new requirement if an existing VFIO platform driver
> > relied on all mappings being Device. But maybe that's just theoretical
> > at the moment, are there any concrete examples outside vfio-pci? If not,
> > we can document it as per Lorenzo's suggestion to summarise this
> > discussion under Documentation/.
> 
> My point is if this becomes a real world concern we have a solid
> answer on how to resolve it - fix the VFIO driver to have a stronger
> barrier before reset.

Just to make sure I am parsing this correctly: this case above is
related to a non-PCI VFIO device passthrough where a guest would want to
map the device MMIO at stage-1 with normal-NC memory type (well, let's
say with a memory attribute != device-nGnRE - that combined with the new
stage-2 default might cause transactions ordering/grouping trouble with
eg device resets), correct ? IIRC, all requests related to honouring
"write-combine" style stage-1 mappings were for PCI(e) devices but
that's as far as what *I* was made aware of goes.

> I'm confident it is not a problem for PCI and IIRC the remaining ARM
> platform drivers were made primarily for DPDK, not KVM.
> 
> So I agree with documenting and perhaps a comment someplace in VFIO is
> also warranted.

We will do that, I will start adding the recent discussions to the
new documentation file. Side note: for those who attend LPC it would be
useful to review the resulting documentation together there, it should
happen around v6.7-rc1.

Lorenzo

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v1 2/2] KVM: arm64: allow the VM to select DEVICE_* and NORMAL_NC for IO memory
  2023-10-20 14:03                                       ` Lorenzo Pieralisi
@ 2023-10-20 14:28                                         ` Jason Gunthorpe
  0 siblings, 0 replies; 55+ messages in thread
From: Jason Gunthorpe @ 2023-10-20 14:28 UTC (permalink / raw)
  To: Lorenzo Pieralisi
  Cc: Catalin Marinas, Will Deacon, ankita, maz, oliver.upton, aniketa,
	cjia, kwankhede, targupta, vsethi, acurrid, apopple, jhubbard,
	danw, linux-arm-kernel, kvmarm, linux-kernel

On Fri, Oct 20, 2023 at 04:03:57PM +0200, Lorenzo Pieralisi wrote:
> > My point is if this becomes a real world concern we have a solid
> > answer on how to resolve it - fix the VFIO driver to have a stronger
> > barrier before reset.
> 
> Just to make sure I am parsing this correctly: this case above is
> related to a non-PCI VFIO device passthrough where a guest would want to
> map the device MMIO at stage-1 with normal-NC memory type (well, let's
> say with a memory attribute != device-nGnRE - that combined with the new
> stage-2 default might cause transactions ordering/grouping trouble with
> eg device resets), correct ? 

This is what I have understood was Will's concern, yes.

> IIRC, all requests related to honouring "write-combine" style
> stage-1 mappings were for PCI(e) devices but that's as far as what
> *I* was made aware of goes.

Yes, this is what I am aware of as well.

Though I do not object to the idea from the VFIO side that platform
devices would also have to support NormalNC too.

The theoretical missing peice is that someone would say they have a
SOC issue XYZ and thus their VFIO platform devices must fully block
NormalNC. I suggest if someone comes with this and really, really
wants VFIO, then we could use a VMA flag to indicate that KVM must not
upgrade it. Currently I have no knowledge of such a thing existing.

With PCI we've made the argument that if NormalNC is broken unsafe for
KVM in the SOC then probably so is Device-*. I think the same basic
argument holds for platform devices too. Thus I'm skeptical that
someone can come and say they have SOC issue XYZ and NormalNC is
broken but Device-* is perfectly safe.

> We will do that, I will start adding the recent discussions to the
> new documentation file. Side note: for those who attend LPC it would be
> useful to review the resulting documentation together there, it should
> happen around v6.7-rc1.

I will be there, let me know

Thanks,
Jason

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v1 2/2] KVM: arm64: allow the VM to select DEVICE_* and NORMAL_NC for IO memory
  2023-10-19 11:07                               ` Catalin Marinas
  2023-10-19 11:51                                 ` Jason Gunthorpe
@ 2023-10-19 13:35                                 ` Lorenzo Pieralisi
  1 sibling, 0 replies; 55+ messages in thread
From: Lorenzo Pieralisi @ 2023-10-19 13:35 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Jason Gunthorpe, Will Deacon, ankita, maz, oliver.upton, aniketa,
	cjia, kwankhede, targupta, vsethi, acurrid, apopple, jhubbard,
	danw, linux-arm-kernel, kvmarm, linux-kernel

On Thu, Oct 19, 2023 at 12:07:42PM +0100, Catalin Marinas wrote:

[...]

> > VFIO issues a config space write to reset the PCI function.  Config
> > space writes MUST NOT write combine with anything. This is already
> > impossible for PCIe since they are different TLP types at the PCIe
> > level.
> 
> Yes, config space writes are fine, vfio-pci even maps them as
> Device_nGnRnE. But AFAIK a guest won't have direct access to the config
> space.
> 
> > By the PCIe rules, config space write must order strictly after all
> > other CPU's accesses. Once the reset non-posted write returns back to
> > VFIO we know that:
> > 
> >  1) There is no reference in any CPU page table to the MMIO PFN
> >  2) No CPU has pending data in any write buffer
> >  3) The interconnect and PCIe fabric have no inflight operations
> >  4) The device is in a clean post-reset state
> 
> I think from the CPU perspective, we can guarantee that a Normal_NC
> write on CPU0 for example reaches a serialisation point before a config
> space (Device_nGnRnE) write on CPU1 by the host as long as CPU1 issued a
> TLBI+DSB. Now, what I'm not sure is what this serialisation point is. If
> it is the PCIe root complex, we are probably fine, we hope it deals with
> any ordering between the Normal_NC write and the config space one.

If it is the PCI host bridge (and for PCI it should be since it is the
logic between the ARM world - where ARM ordering rules and barriers
apply - and PCI protocol), either it enforces PCI ordering rules or it
is broken by design; if it is the latter, at that stage device
assignment would be the least of our problems.

For non-PCI device assignment, I am not sure at all we can rely on
anything other than what Jason mentioned, eg resets (and the components
that through eg MMIO are carrying them out) are not architected, the
device MMIO space and the MMIO space used to trigger the reset (again,
it is an example) may well be placed on different interconnect paths,
it is device specific.

Lorenzo

> Talking to Will earlier, I think we can deem the PCIe scenario
> (somewhat) safe but not as a generic mechanism for other non-PCIe
> devices (e.g. platform). With this concern, can we make this Stage 2
> relaxation in KVM only for vfio-pci mappings? I don't have an example of
> non-PCIe device assignment to figure out how this should work though.
> 
> > > knows all the details. The safest is for the VMM to keep it as Device (I
> > > think vfio-pci goes for the strongest nGnRnE).
> > 
> > We are probably going to allow VFIO to let userspace pick if it should
> > be pgprot_device or pgprot_writecombine.
> 
> I guess that's for the direct use by an application rather than VMM+VM.
> IIUC people work around this currently by mapping PCIe BARs as
> pgprot_writecombine() via sysfs. Getting vfio-pci to allow different
> mappings is probably a good idea, though it doesn't currently help with
> the KVM case as we can't force the VMM to know the specifics of the
> device it is giving to a guest.
> 
> > The alias issue could be resolved by teaching KVM how to insert a
> > physical PFN based on some VFIO FD/dmabuf rather than a VMA so that
> > the PFNs are never mapped in the hypervisor side.
> 
> This should work as well and solves the aliasing problem, though it
> requires changes to the VMM as well, not just KVM, which currently
> relies on vfio-pci mmap().
> 
> -- 
> Catalin

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v1 2/2] KVM: arm64: allow the VM to select DEVICE_* and NORMAL_NC for IO memory
  2023-10-13 13:08                           ` Catalin Marinas
  2023-10-13 13:45                             ` Jason Gunthorpe
@ 2023-10-13 15:28                             ` Lorenzo Pieralisi
  2023-10-19 11:12                               ` Catalin Marinas
  2023-11-09 15:34                             ` Lorenzo Pieralisi
  2 siblings, 1 reply; 55+ messages in thread
From: Lorenzo Pieralisi @ 2023-10-13 15:28 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Will Deacon, Jason Gunthorpe, ankita, maz, oliver.upton, aniketa,
	cjia, kwankhede, targupta, vsethi, acurrid, apopple, jhubbard,
	danw, linux-arm-kernel, kvmarm, linux-kernel

On Fri, Oct 13, 2023 at 02:08:10PM +0100, Catalin Marinas wrote:

[...]

> Yes, we end up with mismatched aliases but they only matter if the VMM
> also accesses the I/O range via its own mapping. So far I haven't seen
> case that suggests this.
> 
> > > Things can go wrong but that's not because Device does anything better.
> > > Given the RAS implementation, external aborts caused on Device memory
> > > (e.g. wrong size access) is uncontainable. For Normal NC it can be
> > > contained (I can dig out the reasoning behind this if you want, IIUC
> > > something to do with not being able to cancel an already issued Device
> > > access since such accesses don't allow speculation due to side-effects;
> > > for Normal NC, it's just about the software not getting the data).
> > 
> > I really think these details belong in the commit message.
> 
> I guess another task for Lorenzo ;).

I will do, I start wondering though whether this documentation belongs
in this commit log only or at Documentation/arch/arm64 level (or both),
I am pretty sure this thread can turn out quite useful as a reference
(it is for me) if we manage to summarize it that would benefit
everyone.

Lorenzo

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v1 2/2] KVM: arm64: allow the VM to select DEVICE_* and NORMAL_NC for IO memory
  2023-10-13 15:28                             ` Lorenzo Pieralisi
@ 2023-10-19 11:12                               ` Catalin Marinas
  0 siblings, 0 replies; 55+ messages in thread
From: Catalin Marinas @ 2023-10-19 11:12 UTC (permalink / raw)
  To: Lorenzo Pieralisi
  Cc: Will Deacon, Jason Gunthorpe, ankita, maz, oliver.upton, aniketa,
	cjia, kwankhede, targupta, vsethi, acurrid, apopple, jhubbard,
	danw, linux-arm-kernel, kvmarm, linux-kernel

On Fri, Oct 13, 2023 at 05:28:10PM +0200, Lorenzo Pieralisi wrote:
> On Fri, Oct 13, 2023 at 02:08:10PM +0100, Catalin Marinas wrote:
> 
> [...]
> 
> > Yes, we end up with mismatched aliases but they only matter if the VMM
> > also accesses the I/O range via its own mapping. So far I haven't seen
> > case that suggests this.
> > 
> > > > Things can go wrong but that's not because Device does anything better.
> > > > Given the RAS implementation, external aborts caused on Device memory
> > > > (e.g. wrong size access) is uncontainable. For Normal NC it can be
> > > > contained (I can dig out the reasoning behind this if you want, IIUC
> > > > something to do with not being able to cancel an already issued Device
> > > > access since such accesses don't allow speculation due to side-effects;
> > > > for Normal NC, it's just about the software not getting the data).
> > > 
> > > I really think these details belong in the commit message.
> > 
> > I guess another task for Lorenzo ;).
> 
> I will do, I start wondering though whether this documentation belongs
> in this commit log only or at Documentation/arch/arm64 level (or both),
> I am pretty sure this thread can turn out quite useful as a reference
> (it is for me) if we manage to summarize it that would benefit
> everyone.

I think it makes sense to add something in the Documentation for easier
future reference. We can also keep adding to it as we learn more. The
commit log can be shorter in this case.

-- 
Catalin

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v1 2/2] KVM: arm64: allow the VM to select DEVICE_* and NORMAL_NC for IO memory
  2023-10-13 13:08                           ` Catalin Marinas
  2023-10-13 13:45                             ` Jason Gunthorpe
  2023-10-13 15:28                             ` Lorenzo Pieralisi
@ 2023-11-09 15:34                             ` Lorenzo Pieralisi
  2023-11-10 14:26                               ` Jason Gunthorpe
  2023-11-13 17:41                               ` Catalin Marinas
  2 siblings, 2 replies; 55+ messages in thread
From: Lorenzo Pieralisi @ 2023-11-09 15:34 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Will Deacon, Jason Gunthorpe, ankita, maz, oliver.upton, aniketa,
	cjia, kwankhede, targupta, vsethi, acurrid, apopple, jhubbard,
	danw, linux-arm-kernel, kvmarm, linux-kernel

On Fri, Oct 13, 2023 at 02:08:10PM +0100, Catalin Marinas wrote:

[...]

> > > Things can go wrong but that's not because Device does anything better.
> > > Given the RAS implementation, external aborts caused on Device memory
> > > (e.g. wrong size access) is uncontainable. For Normal NC it can be
> > > contained (I can dig out the reasoning behind this if you want, IIUC
> > > something to do with not being able to cancel an already issued Device
> > > access since such accesses don't allow speculation due to side-effects;
> > > for Normal NC, it's just about the software not getting the data).
> > 
> > I really think these details belong in the commit message.
> 
> I guess another task for Lorenzo ;).

Updated commit log (it might be [is] too verbose) below, it should probably
be moved into a documentation file but to do that I should decouple
it from this changeset (ie a document explaining memory attributes
and error containment for ARM64 - indipendent from KVM S2 defaults).

I'd also add a Link: to the lore archive entry for reference (did not
add it in the log below).

Please let me know what's the best option here.

Thanks,
Lorenzo

-- >8 --
Currently, KVM for ARM64 maps at stage 2 memory that is
considered device (ie it is not RAM) with DEVICE_nGnRE
memory attributes; this setting overrides (as per the ARM
architecture [1]) any device MMIO mapping present at stage
1, resulting in a set-up whereby a guest operating system
can't determine device MMIO mapping memory attributes on its
own but it is always overriden by the KVM stage 2 default.

This set-up does not allow guest operating systems to select
device memory attributes on a page by page basis independently
from KVM stage-2 mappings (refer to [1], "Combining stage 1 and stage
2 memory type attributes"), which turns out to be an issue in that
guest operating systems (eg Linux) may request to map
devices MMIO regions with memory attributes that guarantee
better performance (eg gathering attribute - that for some
devices can generate larger PCIe memory writes TLPs)
and specific operations (eg unaligned transactions) such as
the NormalNC memory type.

The default device stage 2 mapping was chosen in KVM
for ARM64 since it was considered safer (ie it would
not allow guests to trigger uncontained failures
ultimately crashing the machine) but this turned out
to be imprecise.

Failures containability is a property of the platform
and is independent from the memory type used for MMIO
device memory mappings.

Actually, DEVICE_nGnRE memory type is even more problematic
than eg Normal-NC memory type in terms of faults containability
in that eg aborts triggered on DEVICE_nGnRE loads cannot be made,
architecturally, synchronous (ie that would imply that the processor
should issue at most 1 load transaction at a time - ie it can't pipeline
them - otherwise the synchronous abort semantics would break the
no-speculation attribute attached to DEVICE_XXX memory).

This means that regardless of the combined stage1+stage2 mappings a
platform is safe if and only if device transactions cannot trigger
uncontained failures and that in turn relies on platform
capabilities and the device type being assigned (ie PCIe AER/DPC
error containment and RAS architecture[3]); therefore the default
KVM device stage 2 memory attributes play no role in making device
assignment safer for a given platform (if the platform design
adheres to design guidelines outlined in [3]) and therefore can
be relaxed.

For all these reasons, relax the KVM stage 2 device
memory attributes from DEVICE_nGnRE to Normal-NC.

A different Normal memory type default at stage-2
(eg Normal Write-through) could have been chosen
instead of Normal-NC but Normal-NC was chosen
because:

- Its attributes are well-understood compared to
  other Normal memory types for MMIO mappings
- On systems implementing S2FWB (FEAT_S2FWB), that's the only sensible
  default for normal memory types. For systems implementing
  FEAT_S2FWB (see [1] D8.5.5 S2=stage-2 - S2 MemAttr[3:0]), the options
  to choose the memory types are as follows:

  if S2 MemAttr[2] == 0, the mapping defaults to DEVICE_XXX
  (XXX determined by S2 MemAttr[1:0]). This is what we have
  today (MemAttr[2:0] == 0b001) and therefore it is not described
  any further.

  if S2 MemAttr[2] == 1, there are the following options:

  S2 MemAttr[2:0] | Resulting mapping
  -----------------------------------------------------------------------------
  0b101           | Prevent the guest from selecting cachable memory type, it
		  | allows it to choose Device-* or Normal-NC
  0b110           | It forces write-back memory type; it breaks MMIO.
  0b111           | Allow the VM to select any memory type including cachable.
		  | It is unclear whether this is safe from a platform
		  | perspective, especially wrt uncontained failures and
		  | cacheability (eg device reclaim/reset and cache
		  | maintenance).
  ------------------------------------------------------------------------------

  - For !FEAT_S2FWB systems, it is logical to choose a default S2 mapping
    identical to FEAT_S2FWB (that basically would force Normal-NC, see
    option S2 MemAttr[2:0] == 0b101 above), to guarantee a coherent approach
    between the two

Relaxing S2 KVM device MMIO mappings to Normal-NC is not expected to
trigger any issue on guest device reclaim use cases either (ie device
MMIO unmap followed by a device reset) at least for PCIe devices, in that
in PCIe a device reset is architected and carried out through PCI config
space transactions that are naturally ordered wrt MMIO transactions
according to the PCI ordering rules.

Having Normal-NC S2 default puts guests in control (thanks to
stage1+stage2 combined memory attributes rules [1]) of device MMIO
regions memory mappings, according to the rules described in [1]
and summarized here ([(S1) - stage1], [(S2) - stage 2]):

S1	     |  S2	     | Result
NORMAL-WB    |  NORMAL-NC    | NORMAL-NC
NORMAL-WT    |  NORMAL-NC    | NORMAL-NC
NORMAL-NC    |  NORMAL-NC    | NORMAL-NC
DEVICE<attr> |  NORMAL-NC    | DEVICE<attr>

It is worth noting that currently, to map devices MMIO space to user
space in a device pass-through use case the VFIO framework applies memory
attributes derived from pgprot_noncached() settings applied to VMAs, which
result in device-nGnRnE memory attributes for the stage-1 VMM mappings.

This means that a userspace mapping for device MMIO space carried
out with the current VFIO framework and a guest OS mapping for the same
MMIO space may result in a mismatched alias as described in [2].

Defaulting KVM device stage-2 mappings to Normal-NC attributes does not change
anything in this respect, in that the mismatched aliases would only affect
(refer to [2] for a detailed explanation) ordering between the userspace and
GuestOS mappings resulting stream of transactions (ie it does not cause loss of
property for either stream of transactions on its own), which is harmless
given that the userspace and GuestOS access to the device is carried
out through independent transactions streams.

[1] section D8.5 - DDI0487_I_a_a-profile_architecture_reference_manual.pdf
[2] section B2.8 - DDI0487_I_a_a-profile_architecture_reference_manual.pdf
[3] sections 1.7.7.3/1.8.5.2/appendix C - DEN0029H_SBSA_7.1.pdf

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v1 2/2] KVM: arm64: allow the VM to select DEVICE_* and NORMAL_NC for IO memory
  2023-11-09 15:34                             ` Lorenzo Pieralisi
@ 2023-11-10 14:26                               ` Jason Gunthorpe
  2023-11-13  0:42                                 ` Lorenzo Pieralisi
  2023-11-13 17:41                               ` Catalin Marinas
  1 sibling, 1 reply; 55+ messages in thread
From: Jason Gunthorpe @ 2023-11-10 14:26 UTC (permalink / raw)
  To: Lorenzo Pieralisi
  Cc: Catalin Marinas, Will Deacon, ankita, maz, oliver.upton, aniketa,
	cjia, kwankhede, targupta, vsethi, acurrid, apopple, jhubbard,
	danw, linux-arm-kernel, kvmarm, linux-kernel

On Thu, Nov 09, 2023 at 04:34:10PM +0100, Lorenzo Pieralisi wrote:

> Relaxing S2 KVM device MMIO mappings to Normal-NC is not expected to
> trigger any issue on guest device reclaim use cases either (ie device
> MMIO unmap followed by a device reset) at least for PCIe devices, in that
> in PCIe a device reset is architected and carried out through PCI config
> space transactions that are naturally ordered wrt MMIO transactions
> according to the PCI ordering rules.

This is not how I see that thread concluding..

The device reclaim problem belongs solely to VFIO, not KVM. VFIO must
ensure global ordering of access before the VMA is unmaped and access
after, that includes ordering whatever mechanism the VFIO driver uses
for reset.

If there are quirky SOCs, or non-PCI devices that need something
stronger than the TLBI/etc sequence it should be fixed in VFIO (or
maybe even the arch code), not by blocking NORMAL_NC in the KVM. Such
a quirky SOC would broadly have security issues beyond KVM.

> It is worth noting that currently, to map devices MMIO space to user
> space in a device pass-through use case the VFIO framework applies memory
> attributes derived from pgprot_noncached() settings applied to VMAs, which

Sometimes. VFIO uses a mix of pgprot_noncached and pgprot_device. AFAIK
we should change to to always use pgprot_device..

Thanks,
Jason

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v1 2/2] KVM: arm64: allow the VM to select DEVICE_* and NORMAL_NC for IO memory
  2023-11-10 14:26                               ` Jason Gunthorpe
@ 2023-11-13  0:42                                 ` Lorenzo Pieralisi
  0 siblings, 0 replies; 55+ messages in thread
From: Lorenzo Pieralisi @ 2023-11-13  0:42 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Catalin Marinas, Will Deacon, ankita, maz, oliver.upton, aniketa,
	cjia, kwankhede, targupta, vsethi, acurrid, apopple, jhubbard,
	danw, linux-arm-kernel, kvmarm, linux-kernel

On Fri, Nov 10, 2023 at 10:26:49AM -0400, Jason Gunthorpe wrote:
> On Thu, Nov 09, 2023 at 04:34:10PM +0100, Lorenzo Pieralisi wrote:
> 
> > Relaxing S2 KVM device MMIO mappings to Normal-NC is not expected to
> > trigger any issue on guest device reclaim use cases either (ie device
> > MMIO unmap followed by a device reset) at least for PCIe devices, in that
> > in PCIe a device reset is architected and carried out through PCI config
> > space transactions that are naturally ordered wrt MMIO transactions
> > according to the PCI ordering rules.
> 
> This is not how I see that thread concluding..
> 
> The device reclaim problem belongs solely to VFIO, not KVM. VFIO must
> ensure global ordering of access before the VMA is unmaped and access
> after, that includes ordering whatever mechanism the VFIO driver uses
> for reset.
> 
> If there are quirky SOCs, or non-PCI devices that need something
> stronger than the TLBI/etc sequence it should be fixed in VFIO (or
> maybe even the arch code), not by blocking NORMAL_NC in the KVM. Such
> a quirky SOC would broadly have security issues beyond KVM.

https://lore.kernel.org/linux-arm-kernel/20231013092934.GA13524@willie-the-truck

I think that Will's point _was_ related to the change we are making for
KVM S2 mappings and related device transactions on device reclaim - ie
reset, that's what I tried to convey (I probably simplified too much)
that for PCI at least that should not trigger any regression/issue, in
that BAR MMIO and reset transactions are decoupled streams and must
follow the PCI ordering rules.

Yes, it is VFIO responsibility but changing the S2 KVM mappings
may have (for non-PCI devices) side effects compared to what
we have today, I am not saying this should be a blocker I just
summarized the thread above, the paragraph can be expanded.

Lorenzo

> 
> > It is worth noting that currently, to map devices MMIO space to user
> > space in a device pass-through use case the VFIO framework applies memory
> > attributes derived from pgprot_noncached() settings applied to VMAs, which
> 
> Sometimes. VFIO uses a mix of pgprot_noncached and pgprot_device. AFAIK
> we should change to to always use pgprot_device..
> 
> Thanks,
> Jason

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v1 2/2] KVM: arm64: allow the VM to select DEVICE_* and NORMAL_NC for IO memory
  2023-11-09 15:34                             ` Lorenzo Pieralisi
  2023-11-10 14:26                               ` Jason Gunthorpe
@ 2023-11-13 17:41                               ` Catalin Marinas
  1 sibling, 0 replies; 55+ messages in thread
From: Catalin Marinas @ 2023-11-13 17:41 UTC (permalink / raw)
  To: Lorenzo Pieralisi
  Cc: Will Deacon, Jason Gunthorpe, ankita, maz, oliver.upton, aniketa,
	cjia, kwankhede, targupta, vsethi, acurrid, apopple, jhubbard,
	danw, linux-arm-kernel, kvmarm, linux-kernel

Hi Lorenzo,

Thanks for putting this together. It looks fine to me, just some general
comments below.

On Thu, Nov 09, 2023 at 04:34:10PM +0100, Lorenzo Pieralisi wrote:
> Updated commit log (it might be [is] too verbose) below, it should probably
> be moved into a documentation file but to do that I should decouple
> it from this changeset (ie a document explaining memory attributes
> and error containment for ARM64 - indipendent from KVM S2 defaults).
> 
> I'd also add a Link: to the lore archive entry for reference (did not
> add it in the log below).
> 
> Please let me know what's the best option here.

It would be good to have this under Documentation/arch/arm64/, though as
a document at the end of the series describing what the new behaviour
is, the kernel/KVM expectations (after the patches have been applied).
We could even submit it separately once we agreed on how the series
looks like.

> -- >8 --
> Currently, KVM for ARM64 maps at stage 2 memory that is
> considered device (ie it is not RAM) with DEVICE_nGnRE
> memory attributes; this setting overrides (as per the ARM
> architecture [1]) any device MMIO mapping present at stage
> 1, resulting in a set-up whereby a guest operating system
> can't determine device MMIO mapping memory attributes on its
> own but it is always overriden by the KVM stage 2 default.

To be fully correct, stage 1 and stage 2 attributes combine in a way
that results in the more restrictive properties, so not a simple
override. But to keep things simple, especially if this part only goes
in the commit log, leave it as is.

> This set-up does not allow guest operating systems to select
> device memory attributes on a page by page basis independently

Not sure it's even worth specifying "page by page". It just doesn't
allow the guest to relax the device map memory attributes at all.

> from KVM stage-2 mappings (refer to [1], "Combining stage 1 and stage
> 2 memory type attributes"), which turns out to be an issue in that
> guest operating systems (eg Linux) may request to map
> devices MMIO regions with memory attributes that guarantee
> better performance (eg gathering attribute - that for some
> devices can generate larger PCIe memory writes TLPs)
> and specific operations (eg unaligned transactions) such as
> the NormalNC memory type.

Another case is correct guest behaviour where it would not expect
unaligned accesses from Normal NC mappings.

> The default device stage 2 mapping was chosen in KVM
> for ARM64 since it was considered safer (ie it would
> not allow guests to trigger uncontained failures
> ultimately crashing the machine) but this turned out
> to be imprecise.

s/imprecise/asynchronous (SError)/

Another reason was probably that it matches the default VFIO mapping
(though the latter is slightly stricter as in Device_nGnRnE).

> Failures containability is a property of the platform
> and is independent from the memory type used for MMIO
> device memory mappings.
> 
> Actually, DEVICE_nGnRE memory type is even more problematic
> than eg Normal-NC memory type in terms of faults containability
> in that eg aborts triggered on DEVICE_nGnRE loads cannot be made,
> architecturally, synchronous (ie that would imply that the processor
> should issue at most 1 load transaction at a time - ie it can't pipeline
> them - otherwise the synchronous abort semantics would break the
> no-speculation attribute attached to DEVICE_XXX memory).
> 
> This means that regardless of the combined stage1+stage2 mappings a
> platform is safe if and only if device transactions cannot trigger
> uncontained failures and that in turn relies on platform
> capabilities and the device type being assigned (ie PCIe AER/DPC
> error containment and RAS architecture[3]); therefore the default
> KVM device stage 2 memory attributes play no role in making device
> assignment safer for a given platform (if the platform design
> adheres to design guidelines outlined in [3]) and therefore can
> be relaxed.
> 
> For all these reasons, relax the KVM stage 2 device
> memory attributes from DEVICE_nGnRE to Normal-NC.

I think this covers the safety aspect w.r.t. uncontained errors.

> A different Normal memory type default at stage-2
> (eg Normal Write-through) could have been chosen
> instead of Normal-NC but Normal-NC was chosen
> because:
> 
> - Its attributes are well-understood compared to
>   other Normal memory types for MMIO mappings

Ideally we'd have gone for Device_GRE from a performance perspective but
it doesn't support unaligned accesses. Basically at this stage we don't
want cacheable accesses to MMIO until we gain a better understanding on
what the impact would be.

> - On systems implementing S2FWB (FEAT_S2FWB), that's the only sensible
>   default for normal memory types. For systems implementing
>   FEAT_S2FWB (see [1] D8.5.5 S2=stage-2 - S2 MemAttr[3:0]), the options
>   to choose the memory types are as follows:
> 
>   if S2 MemAttr[2] == 0, the mapping defaults to DEVICE_XXX
>   (XXX determined by S2 MemAttr[1:0]). This is what we have
>   today (MemAttr[2:0] == 0b001) and therefore it is not described
>   any further.
> 
>   if S2 MemAttr[2] == 1, there are the following options:
>  
>   S2 MemAttr[2:0] | Resulting mapping
>   -----------------------------------------------------------------------------
>   0b101           | Prevent the guest from selecting cachable memory type, it
> 		  | allows it to choose Device-* or Normal-NC
>   0b110           | It forces write-back memory type; it breaks MMIO.
>   0b111           | Allow the VM to select any memory type including cachable.
> 		  | It is unclear whether this is safe from a platform
> 		  | perspective, especially wrt uncontained failures and
> 		  | cacheability (eg device reclaim/reset and cache
> 		  | maintenance).
>   ------------------------------------------------------------------------------
> 
>   - For !FEAT_S2FWB systems, it is logical to choose a default S2 mapping
>     identical to FEAT_S2FWB (that basically would force Normal-NC, see
>     option S2 MemAttr[2:0] == 0b101 above), to guarantee a coherent approach
>     between the two

Is this text only to say that we can't use write-through memory because
with FEAT_S2FWB it is forced cacheable? I'd probably keep in simple and
just state that we went for Normal NC to avoid cache
allocation/snooping.

It just crossed my mind, I think we also assume here that on platforms
with transparent caches, Normal NC of MMIO won't be upgraded to
cacheable (that's irrespective of KVM). This may need to be stated in
some Arm BSA document (I'm not aware of any arch rules that prevent this
from happening). AFAIK, platforms with transparent caches only do this
for the memory ranges where the DRAM is expected.

> Relaxing S2 KVM device MMIO mappings to Normal-NC is not expected to
> trigger any issue on guest device reclaim use cases either (ie device
> MMIO unmap followed by a device reset) at least for PCIe devices, in that
> in PCIe a device reset is architected and carried out through PCI config
> space transactions that are naturally ordered wrt MMIO transactions
> according to the PCI ordering rules.

We could state this in a doc that for device pass-through, there's an
expectation that the device can be reclaimed along the lines of the PCIe
reset behaviour. If one admin decides to allow device pass-through for
some unsafe devices/platforms, it's their problem really. I don't think
a Device_nGnRE mapping would help in those cases anyway.

This could be expanded to Normal Cacheable at some point but KVM would
have to do cache maintenance after unmapping the device from the guest
(so let's leave this for a separate series).

> Having Normal-NC S2 default puts guests in control (thanks to
> stage1+stage2 combined memory attributes rules [1]) of device MMIO
> regions memory mappings, according to the rules described in [1]
> and summarized here ([(S1) - stage1], [(S2) - stage 2]):
> 
> S1	     |  S2	     | Result
> NORMAL-WB    |  NORMAL-NC    | NORMAL-NC
> NORMAL-WT    |  NORMAL-NC    | NORMAL-NC
> NORMAL-NC    |  NORMAL-NC    | NORMAL-NC
> DEVICE<attr> |  NORMAL-NC    | DEVICE<attr>
> 
> It is worth noting that currently, to map devices MMIO space to user
> space in a device pass-through use case the VFIO framework applies memory
> attributes derived from pgprot_noncached() settings applied to VMAs, which
> result in device-nGnRnE memory attributes for the stage-1 VMM mappings.
> 
> This means that a userspace mapping for device MMIO space carried
> out with the current VFIO framework and a guest OS mapping for the same
> MMIO space may result in a mismatched alias as described in [2].
> 
> Defaulting KVM device stage-2 mappings to Normal-NC attributes does not change
> anything in this respect, in that the mismatched aliases would only affect
> (refer to [2] for a detailed explanation) ordering between the userspace and
> GuestOS mappings resulting stream of transactions (ie it does not cause loss of
> property for either stream of transactions on its own), which is harmless
> given that the userspace and GuestOS access to the device is carried
> out through independent transactions streams.

I think this was a recent clarification from the architects (in private
emails). My understanding for some time has been that the mere presence
of a Normal alias would affect the semantics of the user (VMM) accesses
to the Device mapping. Apparently, this only matters if there is some
expected ordering between the user (VMM) access and the VM one with
different aliases. I can't tell where the mismatched aliases rules state
this but I haven't read them in a while.

-- 
Catalin

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v1 2/2] KVM: arm64: allow the VM to select DEVICE_* and NORMAL_NC for IO memory
  2023-09-07 18:14 ` [PATCH v1 2/2] KVM: arm64: allow the VM to select DEVICE_* and NORMAL_NC for IO memory ankita
  2023-09-08 16:40   ` Catalin Marinas
  2023-09-11 14:57   ` Lorenzo Pieralisi
@ 2023-10-12 12:27   ` Will Deacon
  2 siblings, 0 replies; 55+ messages in thread
From: Will Deacon @ 2023-10-12 12:27 UTC (permalink / raw)
  To: ankita
  Cc: jgg, maz, oliver.upton, catalin.marinas, aniketa, cjia, kwankhede,
	targupta, vsethi, acurrid, apopple, jhubbard, danw,
	linux-arm-kernel, kvmarm, linux-kernel

On Thu, Sep 07, 2023 at 11:14:59AM -0700, ankita@nvidia.com wrote:
> From: Ankit Agrawal <ankita@nvidia.com>
> 
> Linux allows device drivers to map IO memory on a per-page basis using
> "write combining" or WC. This is often done using
> pgprot_writecombing(). The driver knows which pages can support WC
> access and the proper programming model to generate this IO. Generally
> the use case is to boost performance by using write combining to
> generate larger PCIe MemWr TLPs.
> 
> Allow VMs to select DEVICE_* or NORMAL_NC on a page by page basis for
> all IO memory. This puts the VM in charge of the memory attributes,
> and removes the KVM override to DEVICE_nGnRE.
> 
> Ultimately this makes pgprot_writecombing() work correctly in VMs and
> allows drivers like mlx5 to fully operate their HW.
> 
> After some discussions with ARM and CPU architects we reached the
> conclusion there was no need for KVM to prevent the VM from selecting
> between DEVICE_* and NORMAL_NC for IO memory in VMs. There was a fear
> that NORMAL_NC could result in uncontained failures, but upon deeper
> analysis it turns out there are already possible cases for uncontained
> failures with DEVICE types too. Ultimately the platform must be
> implemented in a way that ensures that all DEVICE_* and NORMAL_NC
> accesses have no uncontained failures.
> 
> Fortunately real platforms do tend to implement this.
> 
> This patch makes the VM's memory attributes behave as follows:
> 
>  S1           |   S2          |  Result
>  NORMAL-WB    |  NORMAL-NC    |  NORMAL-NC
>  NORMAL-WT    |  NORMAL-NC    |  NORMAL-NC
>  NORMAL-NC    |  NORMAL-NC    |  NORMAL-NC
>  DEVICE<attr> |  NORMAL-NC    |  DEVICE<attr>
> 
> See section D8.5.5 of DDI0487_I_a_a-profile_architecture_reference_manual.pdf
> for details.
> 
> Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
> ---
>  arch/arm64/include/asm/memory.h | 2 ++
>  arch/arm64/kvm/hyp/pgtable.c    | 2 +-
>  2 files changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/arm64/include/asm/memory.h b/arch/arm64/include/asm/memory.h
> index fde4186cc387..c247e5f29d5a 100644
> --- a/arch/arm64/include/asm/memory.h
> +++ b/arch/arm64/include/asm/memory.h
> @@ -147,6 +147,7 @@
>   * Memory types for Stage-2 translation
>   */
>  #define MT_S2_NORMAL		0xf
> +#define MT_S2_NORMAL_NC	0x5
>  #define MT_S2_DEVICE_nGnRE	0x1
>  
>  /*
> @@ -154,6 +155,7 @@
>   * Stage-2 enforces Normal-WB and Device-nGnRE
>   */
>  #define MT_S2_FWB_NORMAL	6
> +#define MT_S2_FWB_NORMAL_NC	5
>  #define MT_S2_FWB_DEVICE_nGnRE	1
>  
>  #ifdef CONFIG_ARM64_4K_PAGES
> diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> index ccd291b6893d..a80949002191 100644
> --- a/arch/arm64/kvm/hyp/pgtable.c
> +++ b/arch/arm64/kvm/hyp/pgtable.c
> @@ -696,7 +696,7 @@ static int stage2_set_prot_attr(struct kvm_pgtable *pgt, enum kvm_pgtable_prot p
>  				kvm_pte_t *ptep)
>  {
>  	bool device = prot & KVM_PGTABLE_PROT_DEVICE;
> -	kvm_pte_t attr = device ? KVM_S2_MEMATTR(pgt, DEVICE_nGnRE) :
> +	kvm_pte_t attr = device ? KVM_S2_MEMATTR(pgt, NORMAL_NC) :
>  			    KVM_S2_MEMATTR(pgt, NORMAL);
>  	u32 sh = KVM_PTE_LEAF_ATTR_LO_S2_SH_IS;

I think this is putting the policy into the low-level page-table code,
which isn't where it belongs. KVM_PGTABLE_PROT_DEVICE means that the
mapping should be created with device attributes. If you want other
attributes, then please introduce another entry to 'kvm_pgtable_prot'
and pass that instead (e.g. KVM_PGTABLE_PROT_NC, which coincidentally
we already have in Android [1] ;).

Will

[1] https://android.googlesource.com/kernel/common/+/72cc19df8b71095f9740ff0ca6a75bf7ed27b0cd%5E%21/

^ permalink raw reply	[flat|nested] 55+ messages in thread

end of thread, other threads:[~2024-03-19 13:38 UTC | newest]

Thread overview: 55+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-09-07 18:14 [PATCH v1 0/2] KVM: arm64: support write combining and cachable IO memory in VMs ankita
2023-09-07 18:14 ` [PATCH v1 1/2] KVM: arm64: determine memory type from VMA ankita
2023-09-07 19:12   ` Jason Gunthorpe
2023-10-05 16:15   ` Catalin Marinas
2023-10-05 16:54     ` Jason Gunthorpe
2023-10-10 14:25       ` Catalin Marinas
2023-10-10 15:05         ` Jason Gunthorpe
2023-10-10 17:19           ` Catalin Marinas
2023-10-10 18:23             ` Jason Gunthorpe
2023-10-11 17:45               ` Catalin Marinas
2023-10-11 18:38                 ` Jason Gunthorpe
2023-10-12 16:16                   ` Catalin Marinas
2024-03-10  3:49                     ` Ankit Agrawal
2024-03-19 13:38                       ` Jason Gunthorpe
2023-10-23 13:20   ` Shameerali Kolothum Thodi
2023-09-07 18:14 ` [PATCH v1 2/2] KVM: arm64: allow the VM to select DEVICE_* and NORMAL_NC for IO memory ankita
2023-09-08 16:40   ` Catalin Marinas
2023-09-11 14:57   ` Lorenzo Pieralisi
2023-09-11 17:20     ` Jason Gunthorpe
2023-09-13 15:26       ` Lorenzo Pieralisi
2023-09-13 18:54         ` Jason Gunthorpe
2023-09-26  8:31           ` Lorenzo Pieralisi
2023-09-26 12:25             ` Jason Gunthorpe
2023-09-26 13:52             ` Catalin Marinas
2023-09-26 16:12               ` Lorenzo Pieralisi
2023-10-05  9:56               ` Lorenzo Pieralisi
2023-10-05 11:56                 ` Jason Gunthorpe
2023-10-05 14:08                   ` Lorenzo Pieralisi
2023-10-12 12:35                 ` Will Deacon
2023-10-12 13:20                   ` Jason Gunthorpe
2023-10-12 14:29                     ` Lorenzo Pieralisi
2023-10-12 13:53                   ` Catalin Marinas
2023-10-12 14:48                     ` Will Deacon
2023-10-12 15:44                       ` Jason Gunthorpe
2023-10-12 16:39                         ` Will Deacon
2023-10-12 18:36                           ` Jason Gunthorpe
2023-10-13  9:29                             ` Will Deacon
2023-10-12 17:26                       ` Catalin Marinas
2023-10-13  9:29                         ` Will Deacon
2023-10-13 13:08                           ` Catalin Marinas
2023-10-13 13:45                             ` Jason Gunthorpe
2023-10-19 11:07                               ` Catalin Marinas
2023-10-19 11:51                                 ` Jason Gunthorpe
2023-10-20 11:21                                   ` Catalin Marinas
2023-10-20 11:47                                     ` Jason Gunthorpe
2023-10-20 14:03                                       ` Lorenzo Pieralisi
2023-10-20 14:28                                         ` Jason Gunthorpe
2023-10-19 13:35                                 ` Lorenzo Pieralisi
2023-10-13 15:28                             ` Lorenzo Pieralisi
2023-10-19 11:12                               ` Catalin Marinas
2023-11-09 15:34                             ` Lorenzo Pieralisi
2023-11-10 14:26                               ` Jason Gunthorpe
2023-11-13  0:42                                 ` Lorenzo Pieralisi
2023-11-13 17:41                               ` Catalin Marinas
2023-10-12 12:27   ` Will Deacon

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).