* [PATCH v1 0/2] KVM: arm64: support write combining and cachable IO memory in VMs
@ 2023-09-07 18:14 ankita
2023-09-07 18:14 ` [PATCH v1 1/2] KVM: arm64: determine memory type from VMA ankita
2023-09-07 18:14 ` [PATCH v1 2/2] KVM: arm64: allow the VM to select DEVICE_* and NORMAL_NC for IO memory ankita
0 siblings, 2 replies; 55+ messages in thread
From: ankita @ 2023-09-07 18:14 UTC (permalink / raw)
To: ankita, jgg, maz, oliver.upton, catalin.marinas, will
Cc: aniketa, cjia, kwankhede, targupta, vsethi, acurrid, apopple,
jhubbard, danw, linux-arm-kernel, kvmarm, linux-kernel
From: Ankit Agrawal <ankita@nvidia.com>
Today KVM forces all of the memory to either NORMAL or DEVICE_nGnRE
largely based on pfn_is_map_memory() and ignores the per-VMA flags
that indicate the memory attributes.
This means there is no way for a VM to get NORMAL_NC IO memory (what
Linux calls WC memory, which is used for performance in certain NIC
and GPU devices). There is also no way for a VM to get cachable IO
memory (like from a CXL or pre-CXL device). In both cases the memory
will be forced to be DEVICE_nGnRE and the VM's memory attributes will
be ignored.
After some discussions with ARM and CPU architects we reached the
conclusion there was no need for KVM to prevent the VM from selecting
between DEVICE_* and NORMAL_NC for IO memory in VMs. There was a fear
that NORMAL_NC could result in uncontained failures, but upon deeper
analysis it turns out there are already possible cases for uncontained
failures with DEVICE types too. Ultimately the platform must be
implemented in a way that ensures that all DEVICE_* and NORMAL_NC
accesses have no uncontained failures.
Fortunately real platforms do tend to implement this.
This is similar to x86 where various BIOS choices can result in
uncontained failures (machine checks in x86 speak) on IO memory. On
either architecture there is no way for KVM to know if the underlying
platform is fully safe or not.
This series expands the assumption that any uncachable IO will not
have any uncontained failures for DEVICE_* and NORMAL_NC and that
cachable IO will not have uncontained failures for NORMAL. We hope ARM
will publish information helping platform designers follow these
guidelines.
Additionally have KVM drive the decision on cachable or not entirely
on the VMA. If the VMA is cachable then the KVM memory is made NORMAL
regardless of pfn_is_map_memory(). This closes a hole where cachable
VMA mappings in a process could be accessed via an uncached alias
through KVM.
Applied over next-20230906
Ankit Agrawal (2):
KVM: arm64: determine memory type from VMA
KVM: arm64: allow the VM to select DEVICE_* and NORMAL_NC for IO
memory
arch/arm64/include/asm/kvm_pgtable.h | 8 ++++++
arch/arm64/include/asm/memory.h | 2 ++
arch/arm64/kvm/hyp/pgtable.c | 4 +--
arch/arm64/kvm/mmu.c | 40 +++++++++++++++++++++++++---
4 files changed, 48 insertions(+), 6 deletions(-)
--
2.17.1
^ permalink raw reply [flat|nested] 55+ messages in thread* [PATCH v1 1/2] KVM: arm64: determine memory type from VMA 2023-09-07 18:14 [PATCH v1 0/2] KVM: arm64: support write combining and cachable IO memory in VMs ankita @ 2023-09-07 18:14 ` ankita 2023-09-07 19:12 ` Jason Gunthorpe ` (2 more replies) 2023-09-07 18:14 ` [PATCH v1 2/2] KVM: arm64: allow the VM to select DEVICE_* and NORMAL_NC for IO memory ankita 1 sibling, 3 replies; 55+ messages in thread From: ankita @ 2023-09-07 18:14 UTC (permalink / raw) To: ankita, jgg, maz, oliver.upton, catalin.marinas, will Cc: aniketa, cjia, kwankhede, targupta, vsethi, acurrid, apopple, jhubbard, danw, linux-arm-kernel, kvmarm, linux-kernel From: Ankit Agrawal <ankita@nvidia.com> Currently KVM determines if a VMA is pointing at IO memory by checking pfn_is_map_memory(). However, the MM already gives us a way to tell what kind of memory it is by inspecting the VMA. Replace pfn_is_map_memory() with a check on the VMA pgprot to determine if the memory is IO and thus needs stage-2 device mapping. The VMA's pgprot is tested to determine the memory type with the following mapping: pgprot_noncached MT_DEVICE_nGnRnE device pgprot_writecombine MT_NORMAL_NC device pgprot_device MT_DEVICE_nGnRE device pgprot_tagged MT_NORMAL_TAGGED RAM This patch solves a problems where it is possible for the kernel to have VMAs pointing at cachable memory without causing pfn_is_map_memory() to be true, eg DAX memremap cases and CXL/pre-CXL devices. This memory is now properly marked as cachable in KVM. Unfortunately when FWB is not enabled, the kernel expects to naively do cache management by flushing the memory using an address in the kernel's map. This does not work in several of the newly allowed cases such as dcache_clean_inval_poc(). Check whether the targeted pfn and its mapping KVA is valid in case the FWB is absent before continuing. Signed-off-by: Ankit Agrawal <ankita@nvidia.com> --- arch/arm64/include/asm/kvm_pgtable.h | 8 ++++++ arch/arm64/kvm/hyp/pgtable.c | 2 +- arch/arm64/kvm/mmu.c | 40 +++++++++++++++++++++++++--- 3 files changed, 45 insertions(+), 5 deletions(-) diff --git a/arch/arm64/include/asm/kvm_pgtable.h b/arch/arm64/include/asm/kvm_pgtable.h index d3e354bb8351..0579dbe958b9 100644 --- a/arch/arm64/include/asm/kvm_pgtable.h +++ b/arch/arm64/include/asm/kvm_pgtable.h @@ -430,6 +430,14 @@ u64 kvm_pgtable_hyp_unmap(struct kvm_pgtable *pgt, u64 addr, u64 size); */ u64 kvm_get_vtcr(u64 mmfr0, u64 mmfr1, u32 phys_shift); +/** + * stage2_has_fwb() - Determine whether FWB is supported + * @pgt: Page-table structure initialised by kvm_pgtable_stage2_init*() + * + * Return: True if FWB is supported. + */ +bool stage2_has_fwb(struct kvm_pgtable *pgt); + /** * kvm_pgtable_stage2_pgd_size() - Helper to compute size of a stage-2 PGD * @vtcr: Content of the VTCR register. diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c index f155b8c9e98c..ccd291b6893d 100644 --- a/arch/arm64/kvm/hyp/pgtable.c +++ b/arch/arm64/kvm/hyp/pgtable.c @@ -662,7 +662,7 @@ u64 kvm_get_vtcr(u64 mmfr0, u64 mmfr1, u32 phys_shift) return vtcr; } -static bool stage2_has_fwb(struct kvm_pgtable *pgt) +bool stage2_has_fwb(struct kvm_pgtable *pgt) { if (!cpus_have_const_cap(ARM64_HAS_STAGE2_FWB)) return false; diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c index 482280fe22d7..79f1caaa08a0 100644 --- a/arch/arm64/kvm/mmu.c +++ b/arch/arm64/kvm/mmu.c @@ -1391,6 +1391,15 @@ static bool kvm_vma_mte_allowed(struct vm_area_struct *vma) return vma->vm_flags & VM_MTE_ALLOWED; } +/* + * Determine the memory region cacheability from VMA's pgprot. This + * is used to set the stage 2 PTEs. + */ +static unsigned long mapping_type(pgprot_t page_prot) +{ + return FIELD_GET(PTE_ATTRINDX_MASK, pgprot_val(page_prot)); +} + static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, struct kvm_memory_slot *memslot, unsigned long hva, unsigned long fault_status) @@ -1490,6 +1499,18 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, gfn = fault_ipa >> PAGE_SHIFT; mte_allowed = kvm_vma_mte_allowed(vma); + /* + * Figure out the memory type based on the user va mapping properties + * Only MT_DEVICE_nGnRE and MT_DEVICE_nGnRnE will be set using + * pgprot_device() and pgprot_noncached() respectively. + */ + if ((mapping_type(vma->vm_page_prot) == MT_DEVICE_nGnRE) || + (mapping_type(vma->vm_page_prot) == MT_DEVICE_nGnRnE) || + (mapping_type(vma->vm_page_prot) == MT_NORMAL_NC)) + prot |= KVM_PGTABLE_PROT_DEVICE; + else if (cpus_have_const_cap(ARM64_HAS_CACHE_DIC)) + prot |= KVM_PGTABLE_PROT_X; + /* Don't use the VMA after the unlock -- it may have vanished */ vma = NULL; @@ -1576,10 +1597,21 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, if (exec_fault) prot |= KVM_PGTABLE_PROT_X; - if (device) - prot |= KVM_PGTABLE_PROT_DEVICE; - else if (cpus_have_const_cap(ARM64_HAS_CACHE_DIC)) - prot |= KVM_PGTABLE_PROT_X; + /* + * When FWB is unsupported KVM needs to do cache flushes + * (via dcache_clean_inval_poc()) of the underlying memory. This is + * only possible if the memory is already mapped into the kernel map + * at the usual spot. + * + * Validate that there is a struct page for the PFN which maps + * to the KVA that the flushing code expects. + */ + if (!stage2_has_fwb(pgt) && + !(pfn_valid(pfn) && + page_to_virt(pfn_to_page(pfn)) == kvm_host_va(PFN_PHYS(pfn)))) { + ret = -EINVAL; + goto out_unlock; + } /* * Under the premise of getting a FSC_PERM fault, we just need to relax -- 2.17.1 ^ permalink raw reply related [flat|nested] 55+ messages in thread
* Re: [PATCH v1 1/2] KVM: arm64: determine memory type from VMA 2023-09-07 18:14 ` [PATCH v1 1/2] KVM: arm64: determine memory type from VMA ankita @ 2023-09-07 19:12 ` Jason Gunthorpe 2023-10-05 16:15 ` Catalin Marinas 2023-10-23 13:20 ` Shameerali Kolothum Thodi 2 siblings, 0 replies; 55+ messages in thread From: Jason Gunthorpe @ 2023-09-07 19:12 UTC (permalink / raw) To: ankita Cc: maz, oliver.upton, catalin.marinas, will, aniketa, cjia, kwankhede, targupta, vsethi, acurrid, apopple, jhubbard, danw, linux-arm-kernel, kvmarm, linux-kernel On Thu, Sep 07, 2023 at 11:14:58AM -0700, ankita@nvidia.com wrote: > From: Ankit Agrawal <ankita@nvidia.com> > > Currently KVM determines if a VMA is pointing at IO memory by checking > pfn_is_map_memory(). However, the MM already gives us a way to tell what > kind of memory it is by inspecting the VMA. > > Replace pfn_is_map_memory() with a check on the VMA pgprot to determine if > the memory is IO and thus needs stage-2 device mapping. > > The VMA's pgprot is tested to determine the memory type with the > following mapping: > > pgprot_noncached MT_DEVICE_nGnRnE device > pgprot_writecombine MT_NORMAL_NC device > pgprot_device MT_DEVICE_nGnRE device > pgprot_tagged MT_NORMAL_TAGGED RAM > > This patch solves a problems where it is possible for the kernel to > have VMAs pointing at cachable memory without causing > pfn_is_map_memory() to be true, eg DAX memremap cases and CXL/pre-CXL > devices. This memory is now properly marked as cachable in KVM. > > Unfortunately when FWB is not enabled, the kernel expects to naively do > cache management by flushing the memory using an address in the > kernel's map. This does not work in several of the newly allowed > cases such as dcache_clean_inval_poc(). Check whether the targeted pfn > and its mapping KVA is valid in case the FWB is absent before > continuing. Looking at this more closely, it relies on kvm_pte_follow() which ultimately calls __va(), and the ARM 64 version of page_to_virt() is: #define page_to_virt(x) ({ \ __typeof__(x) __page = x; \ void *__addr = __va(page_to_phys(__page)); \ (void *)__tag_set((const void *)__addr, page_kasan_tag(__page));\ }) So we don't actually have an issue here, anything with a struct page will be properly handled by KVM. Thus this can just be: if (!stage2_has_fwb(pgt) && !pfn_valid(pfn))) Then the last paragraph of the commit message is: As cachable vs device memory is now determined by the VMA adjust the other checks to match their purpose. In most cases the code needs to know if there is a struct page, or if the memory is in the kernel map and pfn_valid() is an appropriate API for this. Note when FWB is not enabled, the kernel expects to trivially do cache management by flushing the memory by linearly converting a kvm_pte to phys_addr to a KVA, see kvm_flush_dcache_to_poc(). This is only possibile for struct page backed memory. Do not allow non-struct page memory to be cachable without FWB. > @@ -1490,6 +1499,18 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, > gfn = fault_ipa >> PAGE_SHIFT; > mte_allowed = kvm_vma_mte_allowed(vma); > > + /* > + * Figure out the memory type based on the user va mapping properties > + * Only MT_DEVICE_nGnRE and MT_DEVICE_nGnRnE will be set using > + * pgprot_device() and pgprot_noncached() respectively. > + */ > + if ((mapping_type(vma->vm_page_prot) == MT_DEVICE_nGnRE) || > + (mapping_type(vma->vm_page_prot) == MT_DEVICE_nGnRnE) || > + (mapping_type(vma->vm_page_prot) == MT_NORMAL_NC)) > + prot |= KVM_PGTABLE_PROT_DEVICE; > + else if (cpus_have_const_cap(ARM64_HAS_CACHE_DIC)) > + prot |= KVM_PGTABLE_PROT_X; > + > /* Don't use the VMA after the unlock -- it may have vanished */ > vma = NULL; > > @@ -1576,10 +1597,21 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, > if (exec_fault) > prot |= KVM_PGTABLE_PROT_X; You still didn't remove the kvm_is_device_pfn() check from this code, I don't think it can really stay and make any sense.. Probably this: if (exec_fault && (prot & KVM_PGTABLE_PROT_DEVICE)) return -ENOEXEC; And these two should also be pfn_valid() [precompute pfn_valid] if (vma_pagesize == PAGE_SIZE && !(force_pte || !pfn_valid(pte))) { if (fault_status != ESR_ELx_FSC_PERM && pfn_valid(pte) && kvm_has_mte(kvm)) { Makes sense? Check if kvm_is_device_pfn() can be removed entirely. Jason ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH v1 1/2] KVM: arm64: determine memory type from VMA 2023-09-07 18:14 ` [PATCH v1 1/2] KVM: arm64: determine memory type from VMA ankita 2023-09-07 19:12 ` Jason Gunthorpe @ 2023-10-05 16:15 ` Catalin Marinas 2023-10-05 16:54 ` Jason Gunthorpe 2023-10-23 13:20 ` Shameerali Kolothum Thodi 2 siblings, 1 reply; 55+ messages in thread From: Catalin Marinas @ 2023-10-05 16:15 UTC (permalink / raw) To: ankita Cc: jgg, maz, oliver.upton, will, aniketa, cjia, kwankhede, targupta, vsethi, acurrid, apopple, jhubbard, danw, linux-arm-kernel, kvmarm, linux-kernel On Thu, Sep 07, 2023 at 11:14:58AM -0700, ankita@nvidia.com wrote: > From: Ankit Agrawal <ankita@nvidia.com> > > Currently KVM determines if a VMA is pointing at IO memory by checking > pfn_is_map_memory(). However, the MM already gives us a way to tell what > kind of memory it is by inspecting the VMA. Well, it doesn't. It tells us what attributes the user mapped that memory with, not whether it's I/O memory or standard RAM. > Replace pfn_is_map_memory() with a check on the VMA pgprot to determine if > the memory is IO and thus needs stage-2 device mapping. > > The VMA's pgprot is tested to determine the memory type with the > following mapping: > > pgprot_noncached MT_DEVICE_nGnRnE device > pgprot_writecombine MT_NORMAL_NC device > pgprot_device MT_DEVICE_nGnRE device > pgprot_tagged MT_NORMAL_TAGGED RAM I would move the second patch to be the first, we could even merge that independently as it is about relaxing the stage 2 mapping to Normal NC. It would make it simpler I think to reason about the second patch which further relaxes the stage 2 mapping to Normal Cacheable under certain conditions. > This patch solves a problems where it is possible for the kernel to > have VMAs pointing at cachable memory without causing > pfn_is_map_memory() to be true, eg DAX memremap cases and CXL/pre-CXL > devices. This memory is now properly marked as cachable in KVM. > > Unfortunately when FWB is not enabled, the kernel expects to naively do > cache management by flushing the memory using an address in the > kernel's map. This does not work in several of the newly allowed > cases such as dcache_clean_inval_poc(). Check whether the targeted pfn > and its mapping KVA is valid in case the FWB is absent before continuing. I would only allow cacheable stage 2 mappings if FWB is enabled. Otherwise we end up with a mismatch between the VMM mapping and whatever the guest may do. > Signed-off-by: Ankit Agrawal <ankita@nvidia.com> > --- > arch/arm64/include/asm/kvm_pgtable.h | 8 ++++++ > arch/arm64/kvm/hyp/pgtable.c | 2 +- > arch/arm64/kvm/mmu.c | 40 +++++++++++++++++++++++++--- > 3 files changed, 45 insertions(+), 5 deletions(-) > > diff --git a/arch/arm64/include/asm/kvm_pgtable.h b/arch/arm64/include/asm/kvm_pgtable.h > index d3e354bb8351..0579dbe958b9 100644 > --- a/arch/arm64/include/asm/kvm_pgtable.h > +++ b/arch/arm64/include/asm/kvm_pgtable.h > @@ -430,6 +430,14 @@ u64 kvm_pgtable_hyp_unmap(struct kvm_pgtable *pgt, u64 addr, u64 size); > */ > u64 kvm_get_vtcr(u64 mmfr0, u64 mmfr1, u32 phys_shift); > > +/** > + * stage2_has_fwb() - Determine whether FWB is supported > + * @pgt: Page-table structure initialised by kvm_pgtable_stage2_init*() > + * > + * Return: True if FWB is supported. > + */ > +bool stage2_has_fwb(struct kvm_pgtable *pgt); > + > /** > * kvm_pgtable_stage2_pgd_size() - Helper to compute size of a stage-2 PGD > * @vtcr: Content of the VTCR register. > diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c > index f155b8c9e98c..ccd291b6893d 100644 > --- a/arch/arm64/kvm/hyp/pgtable.c > +++ b/arch/arm64/kvm/hyp/pgtable.c > @@ -662,7 +662,7 @@ u64 kvm_get_vtcr(u64 mmfr0, u64 mmfr1, u32 phys_shift) > return vtcr; > } > > -static bool stage2_has_fwb(struct kvm_pgtable *pgt) > +bool stage2_has_fwb(struct kvm_pgtable *pgt) > { > if (!cpus_have_const_cap(ARM64_HAS_STAGE2_FWB)) > return false; > diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c > index 482280fe22d7..79f1caaa08a0 100644 > --- a/arch/arm64/kvm/mmu.c > +++ b/arch/arm64/kvm/mmu.c > @@ -1391,6 +1391,15 @@ static bool kvm_vma_mte_allowed(struct vm_area_struct *vma) > return vma->vm_flags & VM_MTE_ALLOWED; > } > > +/* > + * Determine the memory region cacheability from VMA's pgprot. This > + * is used to set the stage 2 PTEs. > + */ > +static unsigned long mapping_type(pgprot_t page_prot) > +{ > + return FIELD_GET(PTE_ATTRINDX_MASK, pgprot_val(page_prot)); > +} > + > static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, > struct kvm_memory_slot *memslot, unsigned long hva, > unsigned long fault_status) > @@ -1490,6 +1499,18 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, > gfn = fault_ipa >> PAGE_SHIFT; > mte_allowed = kvm_vma_mte_allowed(vma); > > + /* > + * Figure out the memory type based on the user va mapping properties > + * Only MT_DEVICE_nGnRE and MT_DEVICE_nGnRnE will be set using > + * pgprot_device() and pgprot_noncached() respectively. > + */ > + if ((mapping_type(vma->vm_page_prot) == MT_DEVICE_nGnRE) || > + (mapping_type(vma->vm_page_prot) == MT_DEVICE_nGnRnE) || > + (mapping_type(vma->vm_page_prot) == MT_NORMAL_NC)) > + prot |= KVM_PGTABLE_PROT_DEVICE; > + else if (cpus_have_const_cap(ARM64_HAS_CACHE_DIC)) > + prot |= KVM_PGTABLE_PROT_X; Does this mean that we can end up with some I/O memory also mapped as executable? Is there a use-case (e.g. using CXL memory as standard guest RAM, executable)? > + > /* Don't use the VMA after the unlock -- it may have vanished */ > vma = NULL; > > @@ -1576,10 +1597,21 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, > if (exec_fault) > prot |= KVM_PGTABLE_PROT_X; > > - if (device) > - prot |= KVM_PGTABLE_PROT_DEVICE; > - else if (cpus_have_const_cap(ARM64_HAS_CACHE_DIC)) > - prot |= KVM_PGTABLE_PROT_X; > + /* > + * When FWB is unsupported KVM needs to do cache flushes > + * (via dcache_clean_inval_poc()) of the underlying memory. This is > + * only possible if the memory is already mapped into the kernel map > + * at the usual spot. > + * > + * Validate that there is a struct page for the PFN which maps > + * to the KVA that the flushing code expects. > + */ > + if (!stage2_has_fwb(pgt) && > + !(pfn_valid(pfn) && > + page_to_virt(pfn_to_page(pfn)) == kvm_host_va(PFN_PHYS(pfn)))) { > + ret = -EINVAL; > + goto out_unlock; > + } My preference would be to keep most of the current logic (including pfn_is_map_memory()) but force stage 2 cacheable for this page if the user vma_page_prot is MT_NORMAL or MT_NORMAL_TAGGED and we have FWB. It might be seen as an ABI change but I don't think it matters, it mostly brings cacheable I/O mem mappings in line with standard RAM (bar the exec permission unless there is a use-case for it). -- Catalin ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH v1 1/2] KVM: arm64: determine memory type from VMA 2023-10-05 16:15 ` Catalin Marinas @ 2023-10-05 16:54 ` Jason Gunthorpe 2023-10-10 14:25 ` Catalin Marinas 0 siblings, 1 reply; 55+ messages in thread From: Jason Gunthorpe @ 2023-10-05 16:54 UTC (permalink / raw) To: Catalin Marinas Cc: ankita, maz, oliver.upton, will, aniketa, cjia, kwankhede, targupta, vsethi, acurrid, apopple, jhubbard, danw, linux-arm-kernel, kvmarm, linux-kernel On Thu, Oct 05, 2023 at 05:15:37PM +0100, Catalin Marinas wrote: > On Thu, Sep 07, 2023 at 11:14:58AM -0700, ankita@nvidia.com wrote: > > From: Ankit Agrawal <ankita@nvidia.com> > > > > Currently KVM determines if a VMA is pointing at IO memory by checking > > pfn_is_map_memory(). However, the MM already gives us a way to tell what > > kind of memory it is by inspecting the VMA. > > Well, it doesn't. It tells us what attributes the user mapped that > memory with, not whether it's I/O memory or standard RAM. There is VM_IO which is intended to be used for address space with side effects. And there is VM_PFNMAP which is intended to be used for address space without struct page (IO or not) And finally we have the pgprot bit which define the cachability. Do you have a definition of IO memory that those three things don't cover? I would propose that, for KVM's purpose, IO memory is marked with VM_IO or a non-cachable pgprot And "standard RAM" is defined by a cachable pgprot. Linux never makes something that is VM_IO cachable. ? > I would move the second patch to be the first, we could even merge that > independently as it is about relaxing the stage 2 mapping to Normal NC. > It would make it simpler I think to reason about the second patch which > further relaxes the stage 2 mapping to Normal Cacheable under certain > conditions. Make sense > > Unfortunately when FWB is not enabled, the kernel expects to naively do > > cache management by flushing the memory using an address in the > > kernel's map. This does not work in several of the newly allowed > > cases such as dcache_clean_inval_poc(). Check whether the targeted pfn > > and its mapping KVA is valid in case the FWB is absent before continuing. > > I would only allow cacheable stage 2 mappings if FWB is enabled. > Otherwise we end up with a mismatch between the VMM mapping and whatever > the guest may do. Does it need to be stronger? If FWB is disabled and the cache flush works then what is the issue? > > + /* > > + * Figure out the memory type based on the user va mapping properties > > + * Only MT_DEVICE_nGnRE and MT_DEVICE_nGnRnE will be set using > > + * pgprot_device() and pgprot_noncached() respectively. > > + */ > > + if ((mapping_type(vma->vm_page_prot) == MT_DEVICE_nGnRE) || > > + (mapping_type(vma->vm_page_prot) == MT_DEVICE_nGnRnE) || > > + (mapping_type(vma->vm_page_prot) == MT_NORMAL_NC)) > > + prot |= KVM_PGTABLE_PROT_DEVICE; > > + else if (cpus_have_const_cap(ARM64_HAS_CACHE_DIC)) > > + prot |= KVM_PGTABLE_PROT_X; > > Does this mean that we can end up with some I/O memory also mapped as > executable? Yes. We don't have cachable memory with side effects in Linux? > Is there a use-case (e.g. using CXL memory as standard guest > RAM, executable)? Certainly. > > + > > /* Don't use the VMA after the unlock -- it may have vanished */ > > vma = NULL; > > > > @@ -1576,10 +1597,21 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, > > if (exec_fault) > > prot |= KVM_PGTABLE_PROT_X; > > > > - if (device) > > - prot |= KVM_PGTABLE_PROT_DEVICE; > > - else if (cpus_have_const_cap(ARM64_HAS_CACHE_DIC)) > > - prot |= KVM_PGTABLE_PROT_X; > > + /* > > + * When FWB is unsupported KVM needs to do cache flushes > > + * (via dcache_clean_inval_poc()) of the underlying memory. This is > > + * only possible if the memory is already mapped into the kernel map > > + * at the usual spot. > > + * > > + * Validate that there is a struct page for the PFN which maps > > + * to the KVA that the flushing code expects. > > + */ > > + if (!stage2_has_fwb(pgt) && > > + !(pfn_valid(pfn) && > > + page_to_virt(pfn_to_page(pfn)) == kvm_host_va(PFN_PHYS(pfn)))) { > > + ret = -EINVAL; > > + goto out_unlock; > > + } > > My preference would be to keep most of the current logic (including > pfn_is_map_memory()) Why? I think this pfn_is_map_memory() is actually not desired, it doesn't deal with modern memory hotplug or pgmap stuff? Isn't pfn_valid() more appropriate? > as an ABI change but I don't think it matters, it mostly brings > cacheable I/O mem mappings in line with standard RAM (bar the exec > permission unless there is a use-case for it). I would discourage the concept of "cacheable I/O mem mappings". Cachable memory located on a NUMA node close to the CPU should have exactly the same treatement as cachable memory located on a NUMA node distant from the CPU. I think when you say "cachable I/O memory" it really just means normal memory that lives on a NUMA node that is located on an IO device. At the KVM level we don't care about the NUMA locality, we only care if it is normal cachable system memory. I think there are two issues here. 1) KVM uses pfn_is_map_memory() which does not cover all our modern NUMA and memory hotplug cases for normal struct page backed cachable memory. 2) KVM doesn't work with normal cachable memory that does not have struct pages. For 1 the test should be 'does the pfn have a struct page, does the struct page refer to cachable memory?' For 2 the test should be 'does the VMA have pgprot = cachable, VM_PFNMAP and not VM_IO (both implied)' Jason ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH v1 1/2] KVM: arm64: determine memory type from VMA 2023-10-05 16:54 ` Jason Gunthorpe @ 2023-10-10 14:25 ` Catalin Marinas 2023-10-10 15:05 ` Jason Gunthorpe 0 siblings, 1 reply; 55+ messages in thread From: Catalin Marinas @ 2023-10-10 14:25 UTC (permalink / raw) To: Jason Gunthorpe Cc: ankita, maz, oliver.upton, will, aniketa, cjia, kwankhede, targupta, vsethi, acurrid, apopple, jhubbard, danw, linux-arm-kernel, kvmarm, linux-kernel On Thu, Oct 05, 2023 at 01:54:58PM -0300, Jason Gunthorpe wrote: > On Thu, Oct 05, 2023 at 05:15:37PM +0100, Catalin Marinas wrote: > > On Thu, Sep 07, 2023 at 11:14:58AM -0700, ankita@nvidia.com wrote: > > > From: Ankit Agrawal <ankita@nvidia.com> > > > Currently KVM determines if a VMA is pointing at IO memory by checking > > > pfn_is_map_memory(). However, the MM already gives us a way to tell what > > > kind of memory it is by inspecting the VMA. > > > > Well, it doesn't. It tells us what attributes the user mapped that > > memory with, not whether it's I/O memory or standard RAM. > > There is VM_IO which is intended to be used for address space with > side effects. > > And there is VM_PFNMAP which is intended to be used for address space > without struct page (IO or not) > > And finally we have the pgprot bit which define the cachability. > > Do you have a definition of IO memory that those three things don't > cover? > > I would propose that, for KVM's purpose, IO memory is marked with > VM_IO or a non-cachable pgprot > > And "standard RAM" is defined by a cachable pgprot. Linux never makes > something that is VM_IO cachable. I think we can safely set a stage 2 Normal NC for a vma with pgprot other than MT_NORMAL or MT_NORMAL_TAGGED. But the other way around is not that simple. Just because the VMM was allowed to map it as cacheable does not mean that it supports all the CPU features. One example is MTE where we can only guarantee that the RAM given to the OS at boot supports tagged accesses. I've seen something similar in the past with LSE atomics (or was it exclusives?) not being propagated. These don't make the memory safe for a guest to use as general purpose RAM. I don't have a nice solution, it looks more like the host kernel being able to trust what the VMM maps and gives to a guest (or we map everything as Device at Stage 2 as we currently do). An alternative would be for the host to know which physical address ranges support which attributes and ignore the vma but not sure we have such information in the ACPI tables (we can make something up for DT). > > > Unfortunately when FWB is not enabled, the kernel expects to naively do > > > cache management by flushing the memory using an address in the > > > kernel's map. This does not work in several of the newly allowed > > > cases such as dcache_clean_inval_poc(). Check whether the targeted pfn > > > and its mapping KVA is valid in case the FWB is absent before continuing. > > > > I would only allow cacheable stage 2 mappings if FWB is enabled. > > Otherwise we end up with a mismatch between the VMM mapping and whatever > > the guest may do. > > Does it need to be stronger? If FWB is disabled and the cache flush > works then what is the issue? I was thinking more about keeping things simpler and avoid any lack of coherence between the VMM and the VM, in case the latter maps it as Normal NC. But if the VMM doesn't touch it, the initial cache maintenance by the KVM would do. > I think there are two issues here. > > 1) KVM uses pfn_is_map_memory() which does not cover all our modern > NUMA and memory hotplug cases for normal struct page backed cachable > memory. > > 2) KVM doesn't work with normal cachable memory that does not have > struct pages. > > For 1 the test should be 'does the pfn have a struct page, does the > struct page refer to cachable memory?' > > For 2 the test should be 'does the VMA have pgprot = cachable, > VM_PFNMAP and not VM_IO (both implied)' See above on the characteristics of the memory. If some of them are not supported, it's probably fine (atomics not working) but others like MTE accesses could either cause external aborts or access random addresses from elsewhere. Currently we rely on pfn_is_map_memory() for this but we need a way to tell that other ranges outside the initial RAM supports all features. IOW, is any of this memory (mapped as cacheable in the VMM) special purpose with only a subset of the CPU features supported? -- Catalin ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH v1 1/2] KVM: arm64: determine memory type from VMA 2023-10-10 14:25 ` Catalin Marinas @ 2023-10-10 15:05 ` Jason Gunthorpe 2023-10-10 17:19 ` Catalin Marinas 0 siblings, 1 reply; 55+ messages in thread From: Jason Gunthorpe @ 2023-10-10 15:05 UTC (permalink / raw) To: Catalin Marinas Cc: ankita, maz, oliver.upton, will, aniketa, cjia, kwankhede, targupta, vsethi, acurrid, apopple, jhubbard, danw, linux-arm-kernel, kvmarm, linux-kernel On Tue, Oct 10, 2023 at 03:25:22PM +0100, Catalin Marinas wrote: > On Thu, Oct 05, 2023 at 01:54:58PM -0300, Jason Gunthorpe wrote: > > On Thu, Oct 05, 2023 at 05:15:37PM +0100, Catalin Marinas wrote: > > > On Thu, Sep 07, 2023 at 11:14:58AM -0700, ankita@nvidia.com wrote: > > > > From: Ankit Agrawal <ankita@nvidia.com> > > > > Currently KVM determines if a VMA is pointing at IO memory by checking > > > > pfn_is_map_memory(). However, the MM already gives us a way to tell what > > > > kind of memory it is by inspecting the VMA. > > > > > > Well, it doesn't. It tells us what attributes the user mapped that > > > memory with, not whether it's I/O memory or standard RAM. > > > > There is VM_IO which is intended to be used for address space with > > side effects. > > > > And there is VM_PFNMAP which is intended to be used for address space > > without struct page (IO or not) > > > > And finally we have the pgprot bit which define the cachability. > > > > Do you have a definition of IO memory that those three things don't > > cover? > > > > I would propose that, for KVM's purpose, IO memory is marked with > > VM_IO or a non-cachable pgprot > > > > And "standard RAM" is defined by a cachable pgprot. Linux never makes > > something that is VM_IO cachable. > > I think we can safely set a stage 2 Normal NC for a vma with pgprot > other than MT_NORMAL or MT_NORMAL_TAGGED. But the other way around is > not that simple. Just because the VMM was allowed to map it as cacheable > does not mean that it supports all the CPU features. One example is MTE > where we can only guarantee that the RAM given to the OS at boot > supports tagged accesses. Is there a use case to supply the VMM with cachable memory that is not full featured? At least the vfio cases I am aware of do not actually want to do this and would probably like the arch to prevent these corner cases upfront. > I've seen something similar in the past with > LSE atomics (or was it exclusives?) not being propagated. These don't > make the memory safe for a guest to use as general purpose RAM. At least from a mm perspective, I think it is important that cachable struct pages are all the same and all interchangable. If the arch cannot provide this it should not allow the pgmap/memremap to succeed at all. Otherwise drivers using these new APIs are never going to work fully right.. That leaves open the question of what to do with a cachable VM_PFNMAP, but if the arch can deal with the memremap issue then it seems like it can use the same logic when inspecting the VMA contents? > I was thinking more about keeping things simpler and avoid any lack of > coherence between the VMM and the VM, in case the latter maps it as > Normal NC. But if the VMM doesn't touch it, the initial cache > maintenance by the KVM would do. I see, in general we usually don't have use cases for the VMM touching the vfio memory as it is hard to make that secure. I would like it if that memory could be a FD instead of a VMA. Maybe someday.. > See above on the characteristics of the memory. If some of them are not > supported, it's probably fine (atomics not working) but others like MTE > accesses could either cause external aborts or access random addresses > from elsewhere. Currently we rely on pfn_is_map_memory() for this but we > need a way to tell that other ranges outside the initial RAM supports > all features. Indeed, I did not realize this about arm. See above on my note that at the mm level we should not have different types of cachable struct pages. So I think the right thing to do is fix that and still rely on the struct page test here in kvm. > IOW, is any of this memory (mapped as cacheable in the > VMM) special purpose with only a subset of the CPU features supported? At least the use cases I am interested in has the memory be functionally indistinguishable from boot time DDR. Cachable memory that does not function the same as DDR is something else, maybe that is your "cachable IO memory" concept. I don't have a use case for it. If we do decide to create it then it likely should be tagged as MEMORY_DEVICE_PCI_P2PDMA in the mm side and not get into any code paths that assume DDR. Jason ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH v1 1/2] KVM: arm64: determine memory type from VMA 2023-10-10 15:05 ` Jason Gunthorpe @ 2023-10-10 17:19 ` Catalin Marinas 2023-10-10 18:23 ` Jason Gunthorpe 0 siblings, 1 reply; 55+ messages in thread From: Catalin Marinas @ 2023-10-10 17:19 UTC (permalink / raw) To: Jason Gunthorpe Cc: ankita, maz, oliver.upton, will, aniketa, cjia, kwankhede, targupta, vsethi, acurrid, apopple, jhubbard, danw, linux-arm-kernel, kvmarm, linux-kernel On Tue, Oct 10, 2023 at 12:05:02PM -0300, Jason Gunthorpe wrote: > On Tue, Oct 10, 2023 at 03:25:22PM +0100, Catalin Marinas wrote: > > On Thu, Oct 05, 2023 at 01:54:58PM -0300, Jason Gunthorpe wrote: > > > On Thu, Oct 05, 2023 at 05:15:37PM +0100, Catalin Marinas wrote: > > > > On Thu, Sep 07, 2023 at 11:14:58AM -0700, ankita@nvidia.com wrote: > > > > > From: Ankit Agrawal <ankita@nvidia.com> > > > > > Currently KVM determines if a VMA is pointing at IO memory by checking > > > > > pfn_is_map_memory(). However, the MM already gives us a way to tell what > > > > > kind of memory it is by inspecting the VMA. > > > > > > > > Well, it doesn't. It tells us what attributes the user mapped that > > > > memory with, not whether it's I/O memory or standard RAM. > > > > > > There is VM_IO which is intended to be used for address space with > > > side effects. > > > > > > And there is VM_PFNMAP which is intended to be used for address space > > > without struct page (IO or not) > > > > > > And finally we have the pgprot bit which define the cachability. > > > > > > Do you have a definition of IO memory that those three things don't > > > cover? > > > > > > I would propose that, for KVM's purpose, IO memory is marked with > > > VM_IO or a non-cachable pgprot > > > > > > And "standard RAM" is defined by a cachable pgprot. Linux never makes > > > something that is VM_IO cachable. > > > > I think we can safely set a stage 2 Normal NC for a vma with pgprot > > other than MT_NORMAL or MT_NORMAL_TAGGED. But the other way around is > > not that simple. Just because the VMM was allowed to map it as cacheable > > does not mean that it supports all the CPU features. One example is MTE > > where we can only guarantee that the RAM given to the OS at boot > > supports tagged accesses. > > Is there a use case to supply the VMM with cachable memory that is not > full featured? At least the vfio cases I am aware of do not actually > want to do this and would probably like the arch to prevent these > corner cases upfront. The MTE case is the problematic one here. On a data access, the interconnect shifts (right) the physical address and adds an offset. The resulting address is used to access tags. Such shift+offset is configured by firmware at boot and normally only covers the default memory. If there's some memory on PCIe, it's very unlikely to be covered and we can't tell whether it simply drops such tag accesses or makes up some random address that may or may not hit an existing memory or device. We don't currently have a way to describe this in ACPI tables (there were talks about describing special purpose memory, I lost track of the progress) and the way MTE was first designed doesn't allow a hypervisor to prevent the guest from generating a tagged access (other than mapping the memory as non-cacheable at Stage 2). This has been fixed in newer architecture versions but we haven't added Linux support for it yet (and there's no hardware available either). AFAIK, there's no MTE support for CXL-attached memory at the moment in the current interconnects, so better not pretend it's general purpose memory that supports all the features. Other than preventing malicious guest behaviour, it depends what the VM needs cacheable access for: some GPU memory that's only used for sharing data and we don't need all features or general purpose memory that a VM can use to run applications from etc. The former may not need all the features (e.g. can skip exclusives) but the latter does. We can probably work around the MTE case by only allowing cacheable Stage 2 if MTE is disabled for the guest or FEAT_MTE_PERM is implemented/supported (TBD when we'll add this). For the other cases, it would be up to the VMM how it presents the mapping to the guest (device pass-through or general purpose memory). > > I've seen something similar in the past with > > LSE atomics (or was it exclusives?) not being propagated. These don't > > make the memory safe for a guest to use as general purpose RAM. > > At least from a mm perspective, I think it is important that cachable > struct pages are all the same and all interchangable. If the arch > cannot provide this it should not allow the pgmap/memremap to succeed > at all. Otherwise drivers using these new APIs are never going to work > fully right.. Yes, for struct page backed memory, the current assumption is that all are the same, support all CPU features. It's the PFN-based memory where we don't have such guarantees. > That leaves open the question of what to do with a cachable VM_PFNMAP, > but if the arch can deal with the memremap issue then it seems like it > can use the same logic when inspecting the VMA contents? In the MTE case, memremap() never returns a Normal Tagged mapping and it would not map it in user-space as PROT_MTE either. But if a page is not mmap(PROT_MTE) (or VM_MTE in vma->flags) in the VMM, it doesn't mean the guest should not be allowed to use MTE. Qemu for example doesn't map the guest memory with mmap(PROT_MTE) since it doesn't have a need for it but the guest can enable MTE independently (well, if enabled at the vCPU level). We have an additional flag, VM_MTE_ALLOWED, only set for mappings backed by struct page. We could probe that in KVM and either fall back to non-cacheable or allow cacheable if MTE is disable at the vCPU level. -- Catalin ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH v1 1/2] KVM: arm64: determine memory type from VMA 2023-10-10 17:19 ` Catalin Marinas @ 2023-10-10 18:23 ` Jason Gunthorpe 2023-10-11 17:45 ` Catalin Marinas 0 siblings, 1 reply; 55+ messages in thread From: Jason Gunthorpe @ 2023-10-10 18:23 UTC (permalink / raw) To: Catalin Marinas Cc: ankita, maz, oliver.upton, will, aniketa, cjia, kwankhede, targupta, vsethi, acurrid, apopple, jhubbard, danw, linux-arm-kernel, kvmarm, linux-kernel On Tue, Oct 10, 2023 at 06:19:14PM +0100, Catalin Marinas wrote: > This has been fixed in newer architecture versions but we haven't > added Linux support for it yet (and there's no hardware available > either). AFAIK, there's no MTE support for CXL-attached memory at > the moment in the current interconnects, so better not pretend it's > general purpose memory that supports all the features. I don't know much about MTE, but the use case imagined for CXL memory allows the MM to exchange any system page with a CXL page. So there cannot be a behavioral difference. Can usespace continue to use tagged pointers even if the mm has moved the pages to CXL that doesn't support it? The main purpose for this is to tier VM memory, so having CXL behaviorally different in a VM seems fatal to me. Linux drivers need a way to understand this, like we can't have a CXL memory pool driver or GPU driver calling memremap_pages() and getting a somewhat broken system because of MTE incompatibilities. So maybe ARM really should block memremap_pages() in case of MTE until everything is resolved? From the mm perspective we can't have two kinds of cachable struct pages running around that are functionally different. > Other than preventing malicious guest behaviour, it depends what the VM > needs cacheable access for: some GPU memory that's only used for sharing > data and we don't need all features or general purpose memory that a VM > can use to run applications from etc. The former may not need all the > features (e.g. can skip exclusives) but the latter does. Like CXL memory pooling GPU memory is used interchangably with every kind of DDR memory in every context. It must be completely transparent and interchangable via the mm's migration machinery. > > > I've seen something similar in the past with > > > LSE atomics (or was it exclusives?) not being propagated. These don't > > > make the memory safe for a guest to use as general purpose RAM. > > > > At least from a mm perspective, I think it is important that cachable > > struct pages are all the same and all interchangable. If the arch > > cannot provide this it should not allow the pgmap/memremap to succeed > > at all. Otherwise drivers using these new APIs are never going to work > > fully right.. > > Yes, for struct page backed memory, the current assumption is that all > are the same, support all CPU features. It's the PFN-based memory where > we don't have such guarantees. I see it got a bit confused, I am talking about memremap_pages() (ie include/linux/memremap.h), not memremap (include/linux/io.h) for getting non-struct page memory. It is confusing :| memremap_pages() is one of the entry points of the struct page hotplug machinery. Things like CXL drivers assume they can hot plug in new memory through these APIs and get new cachable struct pages that are functionally identical to boot time cachable struct pages. > We have an additional flag, VM_MTE_ALLOWED, only set for mappings backed > by struct page. We could probe that in KVM and either fall back to > non-cacheable or allow cacheable if MTE is disable at the vCPU level. I'm not sure what this does, it is only set by shmem_map? That is much stricter than "mappings backed by struct page" Still, I'm not sure how to proceed here - we veered into this MTE stuff I don't know we have experiance with yet. Thanks, Jason ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH v1 1/2] KVM: arm64: determine memory type from VMA 2023-10-10 18:23 ` Jason Gunthorpe @ 2023-10-11 17:45 ` Catalin Marinas 2023-10-11 18:38 ` Jason Gunthorpe 0 siblings, 1 reply; 55+ messages in thread From: Catalin Marinas @ 2023-10-11 17:45 UTC (permalink / raw) To: Jason Gunthorpe Cc: ankita, maz, oliver.upton, will, aniketa, cjia, kwankhede, targupta, vsethi, acurrid, apopple, jhubbard, danw, linux-arm-kernel, kvmarm, linux-kernel On Tue, Oct 10, 2023 at 03:23:28PM -0300, Jason Gunthorpe wrote: > On Tue, Oct 10, 2023 at 06:19:14PM +0100, Catalin Marinas wrote: > > This has been fixed in newer architecture versions but we haven't > > added Linux support for it yet (and there's no hardware available > > either). AFAIK, there's no MTE support for CXL-attached memory at > > the moment in the current interconnects, so better not pretend it's > > general purpose memory that supports all the features. > > I don't know much about MTE, but the use case imagined for CXL memory > allows the MM to exchange any system page with a CXL page. So there > cannot be a behavioral difference. > > Can usespace continue to use tagged pointers even if the mm has moved > the pages to CXL that doesn't support it? That would lead to external aborts in the worst case or just losing tags in the best. But none of them ideal. > The main purpose for this is to tier VM memory, so having CXL > behaviorally different in a VM seems fatal to me. Yes, that's why if you can't tell the memory supports MTE, either it should not be given to a guest with KVM_CAP_ARM_MTE or MTE gets disabled in the guest. > Linux drivers need a way to understand this, like we can't have a CXL > memory pool driver or GPU driver calling memremap_pages() and getting > a somewhat broken system because of MTE incompatibilities. So maybe > ARM really should block memremap_pages() in case of MTE until > everything is resolved? I need to understand this part of the kernel a bit more but see below (and MTE as it currently stands doesn't play well with server-like systems, memory hotplug etc.) > From the mm perspective we can't have two kinds of cachable struct > pages running around that are functionally different. From a Linux+MTE perspective, what goes into ZONE_MOVABLE should work fine, all pages interchangeable (if it doesn't, the hardware is broken). These are added via the add_memory_resource() hotplug path. If a platform is known not to support this, it better not advertise MTE as a feature (the CPUs usually have some tie-off signal when the rest of the SoC cannot handle MTE). We could claim it's a hardware erratum if it does. But for ZONE_DEVICE ranges, these are not guaranteed to support all the characteristics of the main RAM. I think that's what memremap_pages() gives us. I'm not too familiar with this part of the kernel but IIUC that falls under the HMM category, so not interchangeable with the normal RAM (hotplugged or not). > > Other than preventing malicious guest behaviour, it depends what the VM > > needs cacheable access for: some GPU memory that's only used for sharing > > data and we don't need all features or general purpose memory that a VM > > can use to run applications from etc. The former may not need all the > > features (e.g. can skip exclusives) but the latter does. > > Like CXL memory pooling GPU memory is used interchangably with every > kind of DDR memory in every context. It must be completely transparent > and interchangable via the mm's migration machinery. I don't see the mm code doing this but I haven't looked deep enough. At least not in the way of doing an mmap(MAP_ANONYMOUS) and the kernel allocating ZONE_DEVICE pages and passing them to the user. > > > > I've seen something similar in the past with > > > > LSE atomics (or was it exclusives?) not being propagated. These don't > > > > make the memory safe for a guest to use as general purpose RAM. > > > > > > At least from a mm perspective, I think it is important that cachable > > > struct pages are all the same and all interchangable. If the arch > > > cannot provide this it should not allow the pgmap/memremap to succeed > > > at all. Otherwise drivers using these new APIs are never going to work > > > fully right.. > > > > Yes, for struct page backed memory, the current assumption is that all > > are the same, support all CPU features. It's the PFN-based memory where > > we don't have such guarantees. > > I see it got a bit confused, I am talking about memremap_pages() (ie > include/linux/memremap.h), not memremap (include/linux/io.h) for > getting non-struct page memory. It is confusing :| > > memremap_pages() is one of the entry points of the struct page hotplug > machinery. Things like CXL drivers assume they can hot plug in new > memory through these APIs and get new cachable struct pages that are > functionally identical to boot time cachable struct pages. We have two mechanisms, one in memremap.c and another in memory_hotplug.c. So far my assumption is that only the memory added by the latter ends up in ZONE_MOVABLE and can be used by the kernel as any of the ZONE_NORMAL RAM, transparently to the user. For ZONE_DEVICE allocations, one would have to explicitly mmap() it via a device fd. If a VMM wants to mmap() such GPU memory and give it to the guest as general purpose RAM, it should make sure it has all the characteristics as advertised by the CPU or disable certain features (if it can). Currently we don't have a way to tell what such memory supports (neither ACPI tables nor any hardware probing). The same assumption w.r.t. MTE is that it doesn't. > > We have an additional flag, VM_MTE_ALLOWED, only set for mappings backed > > by struct page. We could probe that in KVM and either fall back to > > non-cacheable or allow cacheable if MTE is disable at the vCPU level. > > I'm not sure what this does, it is only set by shmem_map? That is > much stricter than "mappings backed by struct page" This flag is similar to the VM_MAYWRITE etc. On an mmap(), the vma gets the VM_MTE_ALLOWED flag if the mapping is MAP_ANONYMOUS (see arch_calc_vm_flag_bits()) or the (driver) mmap function knows that the memory supports MTE and sets the flag explicitly. Currently that's only done in shmem_mmap() as we know where this memory is coming from. When the user wants an mmap(PROT_MTE), the arch code checks whether VM_MTE_ALLOWED is set on the vma before allowing tag accesses. Memory mapped from ZONE_DEVICE won't have such flag set, so mmap(PROT_MTE) will fail. But for KVM guests, there's no such mmap() call into the hypervisor. A guest can simply enable MTE at stage 1 without the hypervisor being able to tell. > Still, I'm not sure how to proceed here - we veered into this MTE > stuff I don't know we have experiance with yet. We veered mostly because on arm64 such GPU memory is not guaranteed to have all the characteristics of the generic RAM. I think only MTE is the dangerous one and it needs extra care but I wouldn't be surprised if we notice atomics failing. It looks like memremap_pages() also takes a memory type and I suspect it's only safe to map MEMORY_DEVICE_COHERENT into a guest (as generic RAM). Is there any sanity check on the host kernel side to allow VMM cacheable mappings only for such memory and not the other MEMORY_DEVICE_* types? Going back to KVM, we can relax to cacheable mapping at Stage 2 if the vma is cacheable and either VM_MTE_ALLOWED is set or KVM_CAP_ARM_MTE is disabled. From the earlier discussions, we can probably ignore VM_IO since we won't have a cacheable mapping with this flag. Not sure about VM_PFNMAP. -- Catalin ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH v1 1/2] KVM: arm64: determine memory type from VMA 2023-10-11 17:45 ` Catalin Marinas @ 2023-10-11 18:38 ` Jason Gunthorpe 2023-10-12 16:16 ` Catalin Marinas 0 siblings, 1 reply; 55+ messages in thread From: Jason Gunthorpe @ 2023-10-11 18:38 UTC (permalink / raw) To: Catalin Marinas Cc: ankita, maz, oliver.upton, will, aniketa, cjia, kwankhede, targupta, vsethi, acurrid, apopple, jhubbard, danw, linux-arm-kernel, kvmarm, linux-kernel On Wed, Oct 11, 2023 at 06:45:52PM +0100, Catalin Marinas wrote: > > From the mm perspective we can't have two kinds of cachable struct > > pages running around that are functionally different. > > From a Linux+MTE perspective, what goes into ZONE_MOVABLE should work > fine, all pages interchangeable (if it doesn't, the hardware is > broken). Yes, and we imagine adding GPU and CXL memory to ZONE_MOVABLE (on a NUMA node) > These are added via the add_memory_resource() hotplug path. If a > platform is known not to support this, it better not advertise MTE as a > feature (the CPUs usually have some tie-off signal when the rest of the > SoC cannot handle MTE). We could claim it's a hardware erratum if it > does. Seems logical > But for ZONE_DEVICE ranges, these are not guaranteed to support all the > characteristics of the main RAM. I think that's what memremap_pages() > gives us. I'm not too familiar with this part of the kernel but IIUC > that falls under the HMM category, so not interchangeable with the > normal RAM (hotplugged or not). DAX pages use ZONE_DEVICE and they are cachable, and not "HMM". They are not fully interchangable, but they get into the page cache, they can back .data segements, they could be subject atomics/etc. So they should be fully functional like DDR. > I don't see the mm code doing this but I haven't looked deep enough. > At least not in the way of doing an mmap(MAP_ANONYMOUS) and the kernel > allocating ZONE_DEVICE pages and passing them to the user. Not ZONE_DEVICE. One popular coherent GPU approach is to use ZONE_MOVABLE pages. > > > > > I've seen something similar in the past with > > > > > LSE atomics (or was it exclusives?) not being propagated. These don't > > > > > make the memory safe for a guest to use as general purpose RAM. > > > > > > > > At least from a mm perspective, I think it is important that cachable > > > > struct pages are all the same and all interchangable. If the arch > > > > cannot provide this it should not allow the pgmap/memremap to succeed > > > > at all. Otherwise drivers using these new APIs are never going to work > > > > fully right.. > > > > > > Yes, for struct page backed memory, the current assumption is that all > > > are the same, support all CPU features. It's the PFN-based memory where > > > we don't have such guarantees. > > > > I see it got a bit confused, I am talking about memremap_pages() (ie > > include/linux/memremap.h), not memremap (include/linux/io.h) for > > getting non-struct page memory. It is confusing :| > > > > memremap_pages() is one of the entry points of the struct page hotplug > > machinery. Things like CXL drivers assume they can hot plug in new > > memory through these APIs and get new cachable struct pages that are > > functionally identical to boot time cachable struct pages. > > We have two mechanisms, one in memremap.c and another in > memory_hotplug.c. Yes > So far my assumption is that only the memory added by > the latter ends up in ZONE_MOVABLE and can be used by the kernel as any > of the ZONE_NORMAL RAM, transparently to the user. Probably for now, yes. But CXL/GPU memory goes there too. > For ZONE_DEVICE allocations, one would have to explicitly mmap() it > via a device fd. Not quite for DAX, it gets in through the page cache, but it is still mmap on a FD and not anonymous memory. > If a VMM wants to mmap() such GPU memory and give it to the guest as > general purpose RAM, it should make sure it has all the characteristics > as advertised by the CPU or disable certain features (if it can). This is the VFIO flow we are talking about here, I think. PFNMAP memory that goes into a VM that is cachable. > Currently we don't have a way to tell what such memory supports (neither > ACPI tables nor any hardware probing). The same assumption w.r.t. MTE is > that it doesn't. Indeed, but my GPU driver hot plugged it as ZONE_MOVABLE and my VFIO driver turned in into PFNMAP.. So these things seem incompatible. > > > We have an additional flag, VM_MTE_ALLOWED, only set for mappings backed > > > by struct page. We could probe that in KVM and either fall back to > > > non-cacheable or allow cacheable if MTE is disable at the vCPU level. > > > > I'm not sure what this does, it is only set by shmem_map? That is > > much stricter than "mappings backed by struct page" > > This flag is similar to the VM_MAYWRITE etc. On an mmap(), the vma gets > the VM_MTE_ALLOWED flag if the mapping is MAP_ANONYMOUS (see > arch_calc_vm_flag_bits()) or the (driver) mmap function knows that the > memory supports MTE and sets the flag explicitly. Currently that's only > done in shmem_mmap() as we know where this memory is coming from. When > the user wants an mmap(PROT_MTE), the arch code checks whether > VM_MTE_ALLOWED is set on the vma before allowing tag accesses. > > Memory mapped from ZONE_DEVICE won't have such flag set, so > mmap(PROT_MTE) will fail. But for KVM guests, there's no such mmap() > call into the hypervisor. A guest can simply enable MTE at stage 1 > without the hypervisor being able to tell. Yes, so this is all safe for DAX usages, not safe for GPU CXL NUMA memory hotplug. :| > > Still, I'm not sure how to proceed here - we veered into this MTE > > stuff I don't know we have experiance with yet. > > We veered mostly because on arm64 such GPU memory is not guaranteed to > have all the characteristics of the generic RAM. I think only MTE is the > dangerous one and it needs extra care but I wouldn't be surprised if we > notice atomics failing. So, at least for the system this change is being tested on, the "pre-CXL" GPU memory is 100% interchangable with DDR memory. It is surfaced to the OS as ZONE_MOVABLE and it should work in VMs like this too. > It looks like memremap_pages() also takes a memory type and I suspect > it's only safe to map MEMORY_DEVICE_COHERENT into a guest (as generic > RAM). Is there any sanity check on the host kernel side to allow VMM > cacheable mappings only for such memory and not the other > MEMORY_DEVICE_* types? I guess it is this current KVM code we are discussing, it probably happens by the pfn_is_map_memory() check? > Going back to KVM, we can relax to cacheable mapping at Stage 2 if the > vma is cacheable and either VM_MTE_ALLOWED is set or KVM_CAP_ARM_MTE is > disabled. This seems logical to me, thanks > From the earlier discussions, we can probably ignore VM_IO > since we won't have a cacheable mapping with this flag. Not sure about > VM_PFNMAP. PFNMAP is the interesting one for VFIO, at least. Can we use the same reasoning that it will be !VM_MTE_ALLOWED and we can close the MTE discussion. Currently no VFIO driver is doing cachable that has memory that is different from DDR memory. So this is sort of theoretical discussion about future cachable HW that does use VFIO that does have a non-uniformity. Maybe that HW should set VM_IO on its VFIO PFN map and obviously not use ZONE_MOVABLE? Where does that leave us for this patch? We check the VM_MTE_ALLOWED and check for ZONE_MOVABLE struct pages as one of the conditions for NORMAL? Jason ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH v1 1/2] KVM: arm64: determine memory type from VMA 2023-10-11 18:38 ` Jason Gunthorpe @ 2023-10-12 16:16 ` Catalin Marinas 2024-03-10 3:49 ` Ankit Agrawal 0 siblings, 1 reply; 55+ messages in thread From: Catalin Marinas @ 2023-10-12 16:16 UTC (permalink / raw) To: Jason Gunthorpe Cc: ankita, maz, oliver.upton, will, aniketa, cjia, kwankhede, targupta, vsethi, acurrid, apopple, jhubbard, danw, linux-arm-kernel, kvmarm, linux-kernel On Wed, Oct 11, 2023 at 03:38:39PM -0300, Jason Gunthorpe wrote: > On Wed, Oct 11, 2023 at 06:45:52PM +0100, Catalin Marinas wrote: > > But for ZONE_DEVICE ranges, these are not guaranteed to support all the > > characteristics of the main RAM. I think that's what memremap_pages() > > gives us. I'm not too familiar with this part of the kernel but IIUC > > that falls under the HMM category, so not interchangeable with the > > normal RAM (hotplugged or not). > > DAX pages use ZONE_DEVICE and they are cachable, and not "HMM". > > They are not fully interchangable, but they get into the page cache, > they can back .data segements, they could be subject atomics/etc. So > they should be fully functional like DDR. Unfortunately the Arm architecture makes the distinction between "cacheable" and "cacheable tagged". We don't currently have any way of describing this in firmware tables, so we rely on the hardware or firmware not advertising MTE if such memory ends up as general purpose. That said, DAX mappings are safe since the vma would have VM_MTE_ALLOWED set, so no mmap(PROT_MTE) possible. > > I don't see the mm code doing this but I haven't looked deep enough. > > At least not in the way of doing an mmap(MAP_ANONYMOUS) and the kernel > > allocating ZONE_DEVICE pages and passing them to the user. > > Not ZONE_DEVICE. One popular coherent GPU approach is to use > ZONE_MOVABLE pages. OK, so presumably the driver could tell on which architecture it is running and plug in the memory appropriately (or refuse to). It's a bit more arch knowledge in a (generic) driver that I'd like but we don't have a way to describe or probe this yet. Maybe a firmware config would just turn MTE off in this case (SCTLR_EL3.ATA=0 and some other chicken bit or tie-off for the ID registers not to advertise MTE). > > If a VMM wants to mmap() such GPU memory and give it to the guest as > > general purpose RAM, it should make sure it has all the characteristics > > as advertised by the CPU or disable certain features (if it can). > > This is the VFIO flow we are talking about here, I think. PFNMAP > memory that goes into a VM that is cachable. > > > Currently we don't have a way to tell what such memory supports (neither > > ACPI tables nor any hardware probing). The same assumption w.r.t. MTE is > > that it doesn't. > > Indeed, but my GPU driver hot plugged it as ZONE_MOVABLE and my VFIO > driver turned in into PFNMAP.. So these things seem incompatible. So can we end up with the same pfn mapped in two different vmas, one backed by struct page and another with VM_PFNMAP (implying no struct page)? I don't know how this is supposed to work. Even if the memory supports MTE, we rely on the struct page to track the status of the tags (the PG_mte_tagged flag). We may get away with this if user_mem_abort() -> sanitise_mte_tags() -> pfn_to_page() finds a page structure but I'd actually prevent this path altogether if VM_PFNMAP (well, we may actually have a bug if we don't already do this). > > From the earlier discussions, we can probably ignore VM_IO > > since we won't have a cacheable mapping with this flag. Not sure about > > VM_PFNMAP. > > PFNMAP is the interesting one for VFIO, at least. Can we use the same > reasoning that it will be !VM_MTE_ALLOWED and we can close the MTE > discussion. > > Currently no VFIO driver is doing cachable that has memory that is > different from DDR memory. So this is sort of theoretical discussion > about future cachable HW that does use VFIO that does have a > non-uniformity. > > Maybe that HW should set VM_IO on its VFIO PFN map and obviously not > use ZONE_MOVABLE? I think we should only keep VM_IO for memory mapped I/O with side effects. Other ranges can be VM_PFNMAP if not backed by struct page. > Where does that leave us for this patch? We check the VM_MTE_ALLOWED > and check for ZONE_MOVABLE struct pages as one of the conditions for > NORMAL? I think we should keep it as simple as possible and, looking at it again, maybe even ignore vm_page_prot. Two questions though: 1. Does VM_IO imply vm_page_prot never having MT_NORMAL or MT_NORMAL_TAGGED? 2. Do all I/O ranges (side-effects, non-RAM) mapped into a guest (and which end up in user_mem_abort()) imply VM_IO? If yes to both, I think something like below would do: mte_allowed = kvm_vma_mte_allowed(vma); noncacheable = false; // or 'device' as in user_mem_abort() ... if (vma->flags & VM_IO) // replaces !pfn_is_map_memory() noncacheable = true; else if (!mte_allowed && kvm_has_mte()) noncaheable = true; ... if (noncacheable) prot |= KVM_PGTABLE_PROT_DEVICE; // or the new KVM_PGTABLE_PROT_NC mte_allowed would cover DAX mappings (and, who knows, some future DAX mapping may allow MTE and the driver explicitly set the flag). Anything else hot-plugged into ZONE_MOVABLE should have VM_MTE_ALLOWED set or MTE disabled altogether. -- Catalin ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH v1 1/2] KVM: arm64: determine memory type from VMA 2023-10-12 16:16 ` Catalin Marinas @ 2024-03-10 3:49 ` Ankit Agrawal 2024-03-19 13:38 ` Jason Gunthorpe 0 siblings, 1 reply; 55+ messages in thread From: Ankit Agrawal @ 2024-03-10 3:49 UTC (permalink / raw) To: Catalin Marinas, Jason Gunthorpe Cc: maz@kernel.org, oliver.upton@linux.dev, will@kernel.org, Aniket Agashe, Neo Jia, Kirti Wankhede, Tarun Gupta (SW-GPU), Vikram Sethi, Andy Currid, Alistair Popple, John Hubbard, Dan Williams, linux-arm-kernel@lists.infradead.org, kvmarm@lists.linux.dev, linux-kernel@vger.kernel.org Bringing this to the fore. >> Where does that leave us for this patch? We check the VM_MTE_ALLOWED >> and check for ZONE_MOVABLE struct pages as one of the conditions for >> NORMAL? > > I think we should keep it as simple as possible and, looking at it > again, maybe even ignore vm_page_prot. Two questions though: > > 1. Does VM_IO imply vm_page_prot never having MT_NORMAL or > MT_NORMAL_TAGGED? > > 2. Do all I/O ranges (side-effects, non-RAM) mapped into a guest (and > which end up in user_mem_abort()) imply VM_IO? > > If yes to both, I think something like below would do: How may we get the answer to these? It seems to be the logical behavior, but how can we confirm? While we discuss on that, I am considering to send out the next version of this patch (incorporating the feedbacks on the thread) that is rebased to a commit inclusive of the recently applied KVM patch series: [KVM: arm64: Allow the VM to select DEVICE_* and NORMAL_NC for IO memory] https://lore.kernel.org/all/20240224150546.368-1-ankita@nvidia.com/ > mte_allowed = kvm_vma_mte_allowed(vma); > noncacheable = false; // or 'device' as in user_mem_abort() > ... > if (vma->flags & VM_IO) // replaces !pfn_is_map_memory() > noncacheable = true; > else if (!mte_allowed && kvm_has_mte()) > noncaheable = true; > ... > if (noncacheable) > prot |= KVM_PGTABLE_PROT_DEVICE; // or the new KVM_PGTABLE_PROT_NC > > mte_allowed would cover DAX mappings (and, who knows, some future DAX > mapping may allow MTE and the driver explicitly set the flag). Anything > else hot-plugged into ZONE_MOVABLE should have VM_MTE_ALLOWED set or > MTE disabled altogether. -- Catalin ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH v1 1/2] KVM: arm64: determine memory type from VMA 2024-03-10 3:49 ` Ankit Agrawal @ 2024-03-19 13:38 ` Jason Gunthorpe 0 siblings, 0 replies; 55+ messages in thread From: Jason Gunthorpe @ 2024-03-19 13:38 UTC (permalink / raw) To: Ankit Agrawal Cc: Catalin Marinas, maz@kernel.org, oliver.upton@linux.dev, will@kernel.org, Aniket Agashe, Neo Jia, Kirti Wankhede, Tarun Gupta (SW-GPU), Vikram Sethi, Andy Currid, Alistair Popple, John Hubbard, Dan Williams, linux-arm-kernel@lists.infradead.org, kvmarm@lists.linux.dev, linux-kernel@vger.kernel.org On Sun, Mar 10, 2024 at 03:49:36AM +0000, Ankit Agrawal wrote: > Bringing this to the fore. > > >> Where does that leave us for this patch? We check the VM_MTE_ALLOWED > >> and check for ZONE_MOVABLE struct pages as one of the conditions for > >> NORMAL? > > > > I think we should keep it as simple as possible and, looking at it > > again, maybe even ignore vm_page_prot. Two questions though: > > > > 1. Does VM_IO imply vm_page_prot never having MT_NORMAL or > > MT_NORMAL_TAGGED? Drivers do crazy things, so I would not be surprised if something in the tree does this (wrong) thing. We can check the pgprot in the vma as well as the VM_IO to make the decision. > > 2. Do all I/O ranges (side-effects, non-RAM) mapped into a guest (and > > which end up in user_mem_abort()) imply VM_IO? Yes, this is definately true for VFIO. Jason ^ permalink raw reply [flat|nested] 55+ messages in thread
* RE: [PATCH v1 1/2] KVM: arm64: determine memory type from VMA 2023-09-07 18:14 ` [PATCH v1 1/2] KVM: arm64: determine memory type from VMA ankita 2023-09-07 19:12 ` Jason Gunthorpe 2023-10-05 16:15 ` Catalin Marinas @ 2023-10-23 13:20 ` Shameerali Kolothum Thodi 2 siblings, 0 replies; 55+ messages in thread From: Shameerali Kolothum Thodi @ 2023-10-23 13:20 UTC (permalink / raw) To: ankita@nvidia.com, jgg@nvidia.com, maz@kernel.org, oliver.upton@linux.dev, catalin.marinas@arm.com, will@kernel.org Cc: aniketa@nvidia.com, cjia@nvidia.com, kwankhede@nvidia.com, targupta@nvidia.com, vsethi@nvidia.com, acurrid@nvidia.com, apopple@nvidia.com, jhubbard@nvidia.com, danw@nvidia.com, linux-arm-kernel@lists.infradead.org, kvmarm@lists.linux.dev, linux-kernel@vger.kernel.org, tiantao (H), linyufeng (A) Hi, > -----Original Message----- > From: ankita@nvidia.com [mailto:ankita@nvidia.com] > Sent: 07 September 2023 19:15 > To: ankita@nvidia.com; jgg@nvidia.com; maz@kernel.org; > oliver.upton@linux.dev; catalin.marinas@arm.com; will@kernel.org > Cc: aniketa@nvidia.com; cjia@nvidia.com; kwankhede@nvidia.com; > targupta@nvidia.com; vsethi@nvidia.com; acurrid@nvidia.com; > apopple@nvidia.com; jhubbard@nvidia.com; danw@nvidia.com; > linux-arm-kernel@lists.infradead.org; kvmarm@lists.linux.dev; > linux-kernel@vger.kernel.org > Subject: [PATCH v1 1/2] KVM: arm64: determine memory type from VMA > > From: Ankit Agrawal <ankita@nvidia.com> > > Currently KVM determines if a VMA is pointing at IO memory by checking > pfn_is_map_memory(). However, the MM already gives us a way to tell what > kind of memory it is by inspecting the VMA. > > Replace pfn_is_map_memory() with a check on the VMA pgprot to > determine if > the memory is IO and thus needs stage-2 device mapping. > > The VMA's pgprot is tested to determine the memory type with the > following mapping: > > pgprot_noncached MT_DEVICE_nGnRnE device > pgprot_writecombine MT_NORMAL_NC device > pgprot_device MT_DEVICE_nGnRE device > pgprot_tagged MT_NORMAL_TAGGED RAM > > This patch solves a problems where it is possible for the kernel to > have VMAs pointing at cachable memory without causing > pfn_is_map_memory() to be true, eg DAX memremap cases and CXL/pre-CXL > devices. This memory is now properly marked as cachable in KVM. > > Unfortunately when FWB is not enabled, the kernel expects to naively do > cache management by flushing the memory using an address in the > kernel's map. This does not work in several of the newly allowed > cases such as dcache_clean_inval_poc(). Check whether the targeted pfn > and its mapping KVA is valid in case the FWB is absent before continuing. > > Signed-off-by: Ankit Agrawal <ankita@nvidia.com> > --- > arch/arm64/include/asm/kvm_pgtable.h | 8 ++++++ > arch/arm64/kvm/hyp/pgtable.c | 2 +- > arch/arm64/kvm/mmu.c | 40 > +++++++++++++++++++++++++--- > 3 files changed, 45 insertions(+), 5 deletions(-) > > diff --git a/arch/arm64/include/asm/kvm_pgtable.h > b/arch/arm64/include/asm/kvm_pgtable.h > index d3e354bb8351..0579dbe958b9 100644 > --- a/arch/arm64/include/asm/kvm_pgtable.h > +++ b/arch/arm64/include/asm/kvm_pgtable.h > @@ -430,6 +430,14 @@ u64 kvm_pgtable_hyp_unmap(struct kvm_pgtable > *pgt, u64 addr, u64 size); > */ > u64 kvm_get_vtcr(u64 mmfr0, u64 mmfr1, u32 phys_shift); > > +/** > + * stage2_has_fwb() - Determine whether FWB is supported > + * @pgt: Page-table structure initialised by kvm_pgtable_stage2_init*() > + * > + * Return: True if FWB is supported. > + */ > +bool stage2_has_fwb(struct kvm_pgtable *pgt); > + > /** > * kvm_pgtable_stage2_pgd_size() - Helper to compute size of a stage-2 > PGD > * @vtcr: Content of the VTCR register. > diff --git a/arch/arm64/kvm/hyp/pgtable.c > b/arch/arm64/kvm/hyp/pgtable.c > index f155b8c9e98c..ccd291b6893d 100644 > --- a/arch/arm64/kvm/hyp/pgtable.c > +++ b/arch/arm64/kvm/hyp/pgtable.c > @@ -662,7 +662,7 @@ u64 kvm_get_vtcr(u64 mmfr0, u64 mmfr1, u32 > phys_shift) > return vtcr; > } > > -static bool stage2_has_fwb(struct kvm_pgtable *pgt) > +bool stage2_has_fwb(struct kvm_pgtable *pgt) > { > if (!cpus_have_const_cap(ARM64_HAS_STAGE2_FWB)) > return false; > diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c > index 482280fe22d7..79f1caaa08a0 100644 > --- a/arch/arm64/kvm/mmu.c > +++ b/arch/arm64/kvm/mmu.c > @@ -1391,6 +1391,15 @@ static bool kvm_vma_mte_allowed(struct > vm_area_struct *vma) > return vma->vm_flags & VM_MTE_ALLOWED; > } > > +/* > + * Determine the memory region cacheability from VMA's pgprot. This > + * is used to set the stage 2 PTEs. > + */ > +static unsigned long mapping_type(pgprot_t page_prot) > +{ > + return FIELD_GET(PTE_ATTRINDX_MASK, pgprot_val(page_prot)); > +} > + > static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, > struct kvm_memory_slot *memslot, unsigned long hva, > unsigned long fault_status) > @@ -1490,6 +1499,18 @@ static int user_mem_abort(struct kvm_vcpu > *vcpu, phys_addr_t fault_ipa, > gfn = fault_ipa >> PAGE_SHIFT; > mte_allowed = kvm_vma_mte_allowed(vma); > > + /* > + * Figure out the memory type based on the user va mapping properties > + * Only MT_DEVICE_nGnRE and MT_DEVICE_nGnRnE will be set using > + * pgprot_device() and pgprot_noncached() respectively. > + */ > + if ((mapping_type(vma->vm_page_prot) == MT_DEVICE_nGnRE) || > + (mapping_type(vma->vm_page_prot) == MT_DEVICE_nGnRnE) || > + (mapping_type(vma->vm_page_prot) == MT_NORMAL_NC)) > + prot |= KVM_PGTABLE_PROT_DEVICE; > + else if (cpus_have_const_cap(ARM64_HAS_CACHE_DIC)) > + prot |= KVM_PGTABLE_PROT_X; > + > /* Don't use the VMA after the unlock -- it may have vanished */ > vma = NULL; > > @@ -1576,10 +1597,21 @@ static int user_mem_abort(struct kvm_vcpu > *vcpu, phys_addr_t fault_ipa, > if (exec_fault) > prot |= KVM_PGTABLE_PROT_X; > > - if (device) > - prot |= KVM_PGTABLE_PROT_DEVICE; > - else if (cpus_have_const_cap(ARM64_HAS_CACHE_DIC)) > - prot |= KVM_PGTABLE_PROT_X; > + /* > + * When FWB is unsupported KVM needs to do cache flushes > + * (via dcache_clean_inval_poc()) of the underlying memory. This is > + * only possible if the memory is already mapped into the kernel map > + * at the usual spot. > + * > + * Validate that there is a struct page for the PFN which maps > + * to the KVA that the flushing code expects. > + */ > + if (!stage2_has_fwb(pgt) && > + !(pfn_valid(pfn) && > + page_to_virt(pfn_to_page(pfn)) == > kvm_host_va(PFN_PHYS(pfn)))) { > + ret = -EINVAL; > + goto out_unlock; > + } I don't quite follow the above check. Does no FWB matters for Non-cacheable/device memory as well? From a quick check, it breaks a n/w dev assignment on a platform that doesn't have FWB. Qemu reports, error: kvm run failed Invalid argument PC=ffff800080a95ed0 X00=ffff0000c7ff8090 X01=ffff0000c7ff80a0 X02=00000000fe180000 X03=ffff800085327000 X04=0000000000000001 X05=0000000000000040 X06=000000000000003f X07=0000000000000000 X08=ffff0000be190000 X09=0000000000000000 X10=0000000000000040 Please let me know. Thanks, Shameer > > /* > * Under the premise of getting a FSC_PERM fault, we just need to relax > -- > 2.17.1 > ^ permalink raw reply [flat|nested] 55+ messages in thread
* [PATCH v1 2/2] KVM: arm64: allow the VM to select DEVICE_* and NORMAL_NC for IO memory 2023-09-07 18:14 [PATCH v1 0/2] KVM: arm64: support write combining and cachable IO memory in VMs ankita 2023-09-07 18:14 ` [PATCH v1 1/2] KVM: arm64: determine memory type from VMA ankita @ 2023-09-07 18:14 ` ankita 2023-09-08 16:40 ` Catalin Marinas ` (2 more replies) 1 sibling, 3 replies; 55+ messages in thread From: ankita @ 2023-09-07 18:14 UTC (permalink / raw) To: ankita, jgg, maz, oliver.upton, catalin.marinas, will Cc: aniketa, cjia, kwankhede, targupta, vsethi, acurrid, apopple, jhubbard, danw, linux-arm-kernel, kvmarm, linux-kernel From: Ankit Agrawal <ankita@nvidia.com> Linux allows device drivers to map IO memory on a per-page basis using "write combining" or WC. This is often done using pgprot_writecombing(). The driver knows which pages can support WC access and the proper programming model to generate this IO. Generally the use case is to boost performance by using write combining to generate larger PCIe MemWr TLPs. Allow VMs to select DEVICE_* or NORMAL_NC on a page by page basis for all IO memory. This puts the VM in charge of the memory attributes, and removes the KVM override to DEVICE_nGnRE. Ultimately this makes pgprot_writecombing() work correctly in VMs and allows drivers like mlx5 to fully operate their HW. After some discussions with ARM and CPU architects we reached the conclusion there was no need for KVM to prevent the VM from selecting between DEVICE_* and NORMAL_NC for IO memory in VMs. There was a fear that NORMAL_NC could result in uncontained failures, but upon deeper analysis it turns out there are already possible cases for uncontained failures with DEVICE types too. Ultimately the platform must be implemented in a way that ensures that all DEVICE_* and NORMAL_NC accesses have no uncontained failures. Fortunately real platforms do tend to implement this. This patch makes the VM's memory attributes behave as follows: S1 | S2 | Result NORMAL-WB | NORMAL-NC | NORMAL-NC NORMAL-WT | NORMAL-NC | NORMAL-NC NORMAL-NC | NORMAL-NC | NORMAL-NC DEVICE<attr> | NORMAL-NC | DEVICE<attr> See section D8.5.5 of DDI0487_I_a_a-profile_architecture_reference_manual.pdf for details. Signed-off-by: Ankit Agrawal <ankita@nvidia.com> --- arch/arm64/include/asm/memory.h | 2 ++ arch/arm64/kvm/hyp/pgtable.c | 2 +- 2 files changed, 3 insertions(+), 1 deletion(-) diff --git a/arch/arm64/include/asm/memory.h b/arch/arm64/include/asm/memory.h index fde4186cc387..c247e5f29d5a 100644 --- a/arch/arm64/include/asm/memory.h +++ b/arch/arm64/include/asm/memory.h @@ -147,6 +147,7 @@ * Memory types for Stage-2 translation */ #define MT_S2_NORMAL 0xf +#define MT_S2_NORMAL_NC 0x5 #define MT_S2_DEVICE_nGnRE 0x1 /* @@ -154,6 +155,7 @@ * Stage-2 enforces Normal-WB and Device-nGnRE */ #define MT_S2_FWB_NORMAL 6 +#define MT_S2_FWB_NORMAL_NC 5 #define MT_S2_FWB_DEVICE_nGnRE 1 #ifdef CONFIG_ARM64_4K_PAGES diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c index ccd291b6893d..a80949002191 100644 --- a/arch/arm64/kvm/hyp/pgtable.c +++ b/arch/arm64/kvm/hyp/pgtable.c @@ -696,7 +696,7 @@ static int stage2_set_prot_attr(struct kvm_pgtable *pgt, enum kvm_pgtable_prot p kvm_pte_t *ptep) { bool device = prot & KVM_PGTABLE_PROT_DEVICE; - kvm_pte_t attr = device ? KVM_S2_MEMATTR(pgt, DEVICE_nGnRE) : + kvm_pte_t attr = device ? KVM_S2_MEMATTR(pgt, NORMAL_NC) : KVM_S2_MEMATTR(pgt, NORMAL); u32 sh = KVM_PTE_LEAF_ATTR_LO_S2_SH_IS; -- 2.17.1 ^ permalink raw reply related [flat|nested] 55+ messages in thread
* Re: [PATCH v1 2/2] KVM: arm64: allow the VM to select DEVICE_* and NORMAL_NC for IO memory 2023-09-07 18:14 ` [PATCH v1 2/2] KVM: arm64: allow the VM to select DEVICE_* and NORMAL_NC for IO memory ankita @ 2023-09-08 16:40 ` Catalin Marinas 2023-09-11 14:57 ` Lorenzo Pieralisi 2023-10-12 12:27 ` Will Deacon 2 siblings, 0 replies; 55+ messages in thread From: Catalin Marinas @ 2023-09-08 16:40 UTC (permalink / raw) To: ankita Cc: jgg, maz, oliver.upton, will, aniketa, cjia, kwankhede, targupta, vsethi, acurrid, apopple, jhubbard, danw, linux-arm-kernel, kvmarm, linux-kernel, Lorenzo Pieralisi + Lorenzo On Thu, Sep 07, 2023 at 11:14:59AM -0700, ankita@nvidia.com wrote: > From: Ankit Agrawal <ankita@nvidia.com> > > Linux allows device drivers to map IO memory on a per-page basis using > "write combining" or WC. This is often done using > pgprot_writecombing(). The driver knows which pages can support WC > access and the proper programming model to generate this IO. Generally > the use case is to boost performance by using write combining to > generate larger PCIe MemWr TLPs. > > Allow VMs to select DEVICE_* or NORMAL_NC on a page by page basis for > all IO memory. This puts the VM in charge of the memory attributes, > and removes the KVM override to DEVICE_nGnRE. > > Ultimately this makes pgprot_writecombing() work correctly in VMs and > allows drivers like mlx5 to fully operate their HW. > > After some discussions with ARM and CPU architects we reached the > conclusion there was no need for KVM to prevent the VM from selecting > between DEVICE_* and NORMAL_NC for IO memory in VMs. There was a fear > that NORMAL_NC could result in uncontained failures, but upon deeper > analysis it turns out there are already possible cases for uncontained > failures with DEVICE types too. Ultimately the platform must be > implemented in a way that ensures that all DEVICE_* and NORMAL_NC > accesses have no uncontained failures. > > Fortunately real platforms do tend to implement this. > > This patch makes the VM's memory attributes behave as follows: > > S1 | S2 | Result > NORMAL-WB | NORMAL-NC | NORMAL-NC > NORMAL-WT | NORMAL-NC | NORMAL-NC > NORMAL-NC | NORMAL-NC | NORMAL-NC > DEVICE<attr> | NORMAL-NC | DEVICE<attr> > > See section D8.5.5 of DDI0487_I_a_a-profile_architecture_reference_manual.pdf > for details. > > Signed-off-by: Ankit Agrawal <ankita@nvidia.com> From the discussions with the hardware people in Arm and Nvidia, we indeed concluded that relaxing S2 to Normal-NC is not any worse than Device (and actually Device memory is more prone to generating uncontained errors, something to do with the Device memory ordering semantics). For this change: Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> -- Catalin ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH v1 2/2] KVM: arm64: allow the VM to select DEVICE_* and NORMAL_NC for IO memory 2023-09-07 18:14 ` [PATCH v1 2/2] KVM: arm64: allow the VM to select DEVICE_* and NORMAL_NC for IO memory ankita 2023-09-08 16:40 ` Catalin Marinas @ 2023-09-11 14:57 ` Lorenzo Pieralisi 2023-09-11 17:20 ` Jason Gunthorpe 2023-10-12 12:27 ` Will Deacon 2 siblings, 1 reply; 55+ messages in thread From: Lorenzo Pieralisi @ 2023-09-11 14:57 UTC (permalink / raw) To: ankita Cc: jgg, maz, oliver.upton, catalin.marinas, will, aniketa, cjia, kwankhede, targupta, vsethi, acurrid, apopple, jhubbard, danw, linux-arm-kernel, kvmarm, linux-kernel On Thu, Sep 07, 2023 at 11:14:59AM -0700, ankita@nvidia.com wrote: > From: Ankit Agrawal <ankita@nvidia.com> > > Linux allows device drivers to map IO memory on a per-page basis using > "write combining" or WC. This is often done using > pgprot_writecombing(). The driver knows which pages can support WC pgprot_writecombine() ? > access and the proper programming model to generate this IO. Generally > the use case is to boost performance by using write combining to > generate larger PCIe MemWr TLPs. First off, this changeset does not affect *only* Linux guests, obviously. I understand that's the use case you are after but this change is targeting all VMs, it must be clear. Then WC and mapping to PCI TLPs, either you describe that in details (NormalNC vs device-nGnRE and resulting SystemBus<->PCI transactions) or you don't describe it at all, as it stands I don't know how to use this information. > Allow VMs to select DEVICE_* or NORMAL_NC on a page by page basis for > all IO memory. This puts the VM in charge of the memory attributes, > and removes the KVM override to DEVICE_nGnRE. > > Ultimately this makes pgprot_writecombing() work correctly in VMs and pgprot_writecombine() ? > allows drivers like mlx5 to fully operate their HW. > > After some discussions with ARM and CPU architects we reached the > conclusion there was no need for KVM to prevent the VM from selecting > between DEVICE_* and NORMAL_NC for IO memory in VMs. There was a fear > that NORMAL_NC could result in uncontained failures, but upon deeper > analysis it turns out there are already possible cases for uncontained > failures with DEVICE types too. Ultimately the platform must be > implemented in a way that ensures that all DEVICE_* and NORMAL_NC > accesses have no uncontained failures. I would reorder/rephrase this changelog as follows: - Describe what the problem is (ie KVM default s2 mappings) - Describe how you are solving it - Add a link to the documentation that states why it is safe to do that and the resulting s1/s2 mappings combination It must be clear why from a legacy standpoint this is a safe change to apply. > Fortunately real platforms do tend to implement this. Remove this sentence, it adds no information for someone who is chasing bugs or just wants to understand the change itself. > This patch makes the VM's memory attributes behave as follows: "The resulting stage 1 and stage 2 memory types and cacheability attributes are as follows": > S1 | S2 | Result > NORMAL-WB | NORMAL-NC | NORMAL-NC > NORMAL-WT | NORMAL-NC | NORMAL-NC > NORMAL-NC | NORMAL-NC | NORMAL-NC > DEVICE<attr> | NORMAL-NC | DEVICE<attr> > > See section D8.5.5 of DDI0487_I_a_a-profile_architecture_reference_manual.pdf > for details. D8.5 I assume. I commented on the log because it must be easy to follow, it is an important change (and there is a lot of background information required to understand it - please report it). The patch itself must be reviewed by arm64/kvm maintainers; the changes (given the background work that led to them) seem fine to me (please CC me on next versions). Thanks Lorenzo > Signed-off-by: Ankit Agrawal <ankita@nvidia.com> > --- > arch/arm64/include/asm/memory.h | 2 ++ > arch/arm64/kvm/hyp/pgtable.c | 2 +- > 2 files changed, 3 insertions(+), 1 deletion(-) > > diff --git a/arch/arm64/include/asm/memory.h b/arch/arm64/include/asm/memory.h > index fde4186cc387..c247e5f29d5a 100644 > --- a/arch/arm64/include/asm/memory.h > +++ b/arch/arm64/include/asm/memory.h > @@ -147,6 +147,7 @@ > * Memory types for Stage-2 translation > */ > #define MT_S2_NORMAL 0xf > +#define MT_S2_NORMAL_NC 0x5 > #define MT_S2_DEVICE_nGnRE 0x1 > > /* > @@ -154,6 +155,7 @@ > * Stage-2 enforces Normal-WB and Device-nGnRE > */ > #define MT_S2_FWB_NORMAL 6 > +#define MT_S2_FWB_NORMAL_NC 5 > #define MT_S2_FWB_DEVICE_nGnRE 1 > > #ifdef CONFIG_ARM64_4K_PAGES > diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c > index ccd291b6893d..a80949002191 100644 > --- a/arch/arm64/kvm/hyp/pgtable.c > +++ b/arch/arm64/kvm/hyp/pgtable.c > @@ -696,7 +696,7 @@ static int stage2_set_prot_attr(struct kvm_pgtable *pgt, enum kvm_pgtable_prot p > kvm_pte_t *ptep) > { > bool device = prot & KVM_PGTABLE_PROT_DEVICE; > - kvm_pte_t attr = device ? KVM_S2_MEMATTR(pgt, DEVICE_nGnRE) : > + kvm_pte_t attr = device ? KVM_S2_MEMATTR(pgt, NORMAL_NC) : > KVM_S2_MEMATTR(pgt, NORMAL); > u32 sh = KVM_PTE_LEAF_ATTR_LO_S2_SH_IS; > > -- > 2.17.1 > ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH v1 2/2] KVM: arm64: allow the VM to select DEVICE_* and NORMAL_NC for IO memory 2023-09-11 14:57 ` Lorenzo Pieralisi @ 2023-09-11 17:20 ` Jason Gunthorpe 2023-09-13 15:26 ` Lorenzo Pieralisi 0 siblings, 1 reply; 55+ messages in thread From: Jason Gunthorpe @ 2023-09-11 17:20 UTC (permalink / raw) To: Lorenzo Pieralisi Cc: ankita, maz, oliver.upton, catalin.marinas, will, aniketa, cjia, kwankhede, targupta, vsethi, acurrid, apopple, jhubbard, danw, linux-arm-kernel, kvmarm, linux-kernel On Mon, Sep 11, 2023 at 04:57:51PM +0200, Lorenzo Pieralisi wrote: > On Thu, Sep 07, 2023 at 11:14:59AM -0700, ankita@nvidia.com wrote: > > From: Ankit Agrawal <ankita@nvidia.com> > > > > Linux allows device drivers to map IO memory on a per-page basis using > > "write combining" or WC. This is often done using > > pgprot_writecombing(). The driver knows which pages can support WC > > pgprot_writecombine() ? > > > access and the proper programming model to generate this IO. Generally > > the use case is to boost performance by using write combining to > > generate larger PCIe MemWr TLPs. > > First off, this changeset does not affect *only* Linux guests, obviously. I think everyone understands that. It can be clarified. > I understand that's the use case you are after but this change is > targeting all VMs, it must be clear. > > Then WC and mapping to PCI TLPs, either you describe that in details > (NormalNC vs device-nGnRE and resulting SystemBus<->PCI transactions) or > you don't describe it at all, as it stands I don't know how to use > this information. How about another pargraph: KVM prevents all VMs (including Linux) from accessing NORMAL_NC mappings, which is how Linux implements pgprot_writecombine(). This prevents using this performance optimization within VMs. I don't think we need to go into details how it works beyond that it requires NORMAL_NC. > > Allow VMs to select DEVICE_* or NORMAL_NC on a page by page basis for > > all IO memory. This puts the VM in charge of the memory attributes, > > and removes the KVM override to DEVICE_nGnRE. > > > > Ultimately this makes pgprot_writecombing() work correctly in VMs and > > pgprot_writecombine() ? > > > allows drivers like mlx5 to fully operate their HW. > > > > After some discussions with ARM and CPU architects we reached the > > conclusion there was no need for KVM to prevent the VM from selecting > > between DEVICE_* and NORMAL_NC for IO memory in VMs. There was a fear > > that NORMAL_NC could result in uncontained failures, but upon deeper > > analysis it turns out there are already possible cases for uncontained > > failures with DEVICE types too. Ultimately the platform must be > > implemented in a way that ensures that all DEVICE_* and NORMAL_NC > > accesses have no uncontained failures. > > I would reorder/rephrase this changelog as follows: > > - Describe what the problem is (ie KVM default s2 mappings) The problem is that pgprot_writecombine() doesn't work in Linux VMs. That is the first pagraph. > - Describe how you are solving it That is the middle paragraph "Allow VMs to select DEVICE_* or NORMAL_NC on a page by page basis" > - Add a link to the documentation that states why it is safe to do > that and the resulting s1/s2 mappings combination AFAIK there is no documentation beyond the combining rules. Exactly what should happen in various error conditions is implementation defined. Catalin did you ever find anything? > It must be clear why from a legacy standpoint this is a safe change > to apply. This is why: > > Fortunately real platforms do tend to implement this. It is why it is safe today, because real platforms don't throw uncontained errors from typical PCI accesses that VFIO allows. I think the conclusions was it turns out that is just because they don't do errors at all, not because DEVICE_* prevents it. > Remove this sentence, it adds no information for someone who > is chasing bugs or just wants to understand the change itself. So, if you hit a bug here you might evaluate if there is something wrong with your platform, ie it is allowing uncontained errors in unexpected places. Jason ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH v1 2/2] KVM: arm64: allow the VM to select DEVICE_* and NORMAL_NC for IO memory 2023-09-11 17:20 ` Jason Gunthorpe @ 2023-09-13 15:26 ` Lorenzo Pieralisi 2023-09-13 18:54 ` Jason Gunthorpe 0 siblings, 1 reply; 55+ messages in thread From: Lorenzo Pieralisi @ 2023-09-13 15:26 UTC (permalink / raw) To: Jason Gunthorpe Cc: ankita, maz, oliver.upton, catalin.marinas, will, aniketa, cjia, kwankhede, targupta, vsethi, acurrid, apopple, jhubbard, danw, linux-arm-kernel, kvmarm, linux-kernel On Mon, Sep 11, 2023 at 02:20:01PM -0300, Jason Gunthorpe wrote: > On Mon, Sep 11, 2023 at 04:57:51PM +0200, Lorenzo Pieralisi wrote: > > On Thu, Sep 07, 2023 at 11:14:59AM -0700, ankita@nvidia.com wrote: > > > From: Ankit Agrawal <ankita@nvidia.com> > > > > > > Linux allows device drivers to map IO memory on a per-page basis using > > > "write combining" or WC. This is often done using > > > pgprot_writecombing(). The driver knows which pages can support WC > > > > pgprot_writecombine() ? > > > > > access and the proper programming model to generate this IO. Generally > > > the use case is to boost performance by using write combining to > > > generate larger PCIe MemWr TLPs. > > > > First off, this changeset does not affect *only* Linux guests, obviously. > > I think everyone understands that. It can be clarified. OK, this is not a Linux guest fix *only*, everyone understands that but I would rather make sure that's crystal clear. > > I understand that's the use case you are after but this change is > > targeting all VMs, it must be clear. > > > > Then WC and mapping to PCI TLPs, either you describe that in details > > (NormalNC vs device-nGnRE and resulting SystemBus<->PCI transactions) or > > you don't describe it at all, as it stands I don't know how to use > > this information. > > How about another pargraph: > > KVM prevents all VMs (including Linux) from accessing NORMAL_NC > mappings, which is how Linux implements pgprot_writecombine(). This > prevents using this performance optimization within VMs. > > I don't think we need to go into details how it works beyond that it > requires NORMAL_NC. I think it is worth summarizing the discussion that led to this change, I can write something up. > > > Allow VMs to select DEVICE_* or NORMAL_NC on a page by page basis for > > > all IO memory. This puts the VM in charge of the memory attributes, > > > and removes the KVM override to DEVICE_nGnRE. > > > > > > Ultimately this makes pgprot_writecombing() work correctly in VMs and > > > > pgprot_writecombine() ? > > > > > allows drivers like mlx5 to fully operate their HW. > > > > > > After some discussions with ARM and CPU architects we reached the > > > conclusion there was no need for KVM to prevent the VM from selecting > > > between DEVICE_* and NORMAL_NC for IO memory in VMs. There was a fear > > > that NORMAL_NC could result in uncontained failures, but upon deeper > > > analysis it turns out there are already possible cases for uncontained > > > failures with DEVICE types too. Ultimately the platform must be > > > implemented in a way that ensures that all DEVICE_* and NORMAL_NC > > > accesses have no uncontained failures. > > > > I would reorder/rephrase this changelog as follows: > > > > - Describe what the problem is (ie KVM default s2 mappings) > > The problem is that pgprot_writecombine() doesn't work in Linux > VMs. That is the first pagraph. s/pagraph/paragraph Well to me that's how the problem was spotted but OK, I can rewrite the log and post it here, NP. > > - Describe how you are solving it > > That is the middle paragraph "Allow VMs to select DEVICE_* or > NORMAL_NC on a page by page basis" > > > - Add a link to the documentation that states why it is safe to do > > that and the resulting s1/s2 mappings combination > > AFAIK there is no documentation beyond the combining rules. Exactly > what should happen in various error conditions is implementation > defined. Catalin did you ever find anything? > > > It must be clear why from a legacy standpoint this is a safe change > > to apply. > > This is why: > > > > Fortunately real platforms do tend to implement this. Is it possible please to describe what "this" is in details ? What does "real platforms" mean ? Let's describe what HW/SW should be implemented to make this safe. > It is why it is safe today, because real platforms don't throw "real platforms", I have a problem with that, see above. > uncontained errors from typical PCI accesses that VFIO allows. I think > the conclusions was it turns out that is just because they don't do > errors at all, not because DEVICE_* prevents it. > > > Remove this sentence, it adds no information for someone who > > is chasing bugs or just wants to understand the change itself. > > So, if you hit a bug here you might evaluate if there is something > wrong with your platform, ie it is allowing uncontained errors in > unexpected places. Developers looking at commit log in the future must be able to understand why it was a sound change to make. As it stands IMO the commit log does not explain it fully. I can write up the commit log and post it if I manage to summarize it any better - more important the review on the code (that was already provided), I will try to write something up asap. Thanks, Lorenzo ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH v1 2/2] KVM: arm64: allow the VM to select DEVICE_* and NORMAL_NC for IO memory 2023-09-13 15:26 ` Lorenzo Pieralisi @ 2023-09-13 18:54 ` Jason Gunthorpe 2023-09-26 8:31 ` Lorenzo Pieralisi 0 siblings, 1 reply; 55+ messages in thread From: Jason Gunthorpe @ 2023-09-13 18:54 UTC (permalink / raw) To: Lorenzo Pieralisi Cc: ankita, maz, oliver.upton, catalin.marinas, will, aniketa, cjia, kwankhede, targupta, vsethi, acurrid, apopple, jhubbard, danw, linux-arm-kernel, kvmarm, linux-kernel On Wed, Sep 13, 2023 at 05:26:01PM +0200, Lorenzo Pieralisi wrote: > > > I understand that's the use case you are after but this change is > > > targeting all VMs, it must be clear. > > > > > > Then WC and mapping to PCI TLPs, either you describe that in details > > > (NormalNC vs device-nGnRE and resulting SystemBus<->PCI transactions) or > > > you don't describe it at all, as it stands I don't know how to use > > > this information. > > > > How about another pargraph: > > > > KVM prevents all VMs (including Linux) from accessing NORMAL_NC > > mappings, which is how Linux implements pgprot_writecombine(). This > > prevents using this performance optimization within VMs. > > > > I don't think we need to go into details how it works beyond that it > > requires NORMAL_NC. > > I think it is worth summarizing the discussion that led to this change, > I can write something up. We tried here, in short the reason we all understood it was like this today is because there was a belief that DEVICE_nGnRE allowed fewer/none uncontained failures than NORMAL_NC. AFIACT it turns out the specs are unclear and actual real world IP from ARM does not behave this way. We learned that at best they are equally unsafe, even perhaps NORMAL_NC is bit safer. What makes it safe in real chips is not the unclear specification, or the behavior of the ARM IP to generate uncontained failures, but because the combination of all IP in the platform never triggers an uncontained error anyhow. For instance PCIe integration IP does not transform PCIe TLP failures into uncontained errors. They generate all ones data on the bus instead. There is no spec that says this should happen, it is just what people do - c&p from Intel. On top of that we see that server focused designs built to run VMs have put in the RAS work to ensure uncontained errors can't exist and truely plug this hole. Thus, at the end of the day the safety of KVM and VFIO is effectively an unknowable platform specific property no matter what choice we make in SW here. Thus let's make the choice that enables the broadest VM functionality. > > > > Fortunately real platforms do tend to implement this. > > Is it possible please to describe what "this" is in details ? "prevent uncontained errors for DEVICE_* and NORMAL_NC access" > What does "real platforms" mean ? Actual real world deployments of VFIO and KVM in production that expect to be safe from uncontained errors. > Let's describe what HW/SW should be implemented to make this > safe. The platform must not generate uncontained errors :) > I can write up the commit log and post it if I manage to summarize > it any better - more important the review on the code (that was already > provided), I will try to write something up asap. Thank you! Jason ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH v1 2/2] KVM: arm64: allow the VM to select DEVICE_* and NORMAL_NC for IO memory 2023-09-13 18:54 ` Jason Gunthorpe @ 2023-09-26 8:31 ` Lorenzo Pieralisi 2023-09-26 12:25 ` Jason Gunthorpe 2023-09-26 13:52 ` Catalin Marinas 0 siblings, 2 replies; 55+ messages in thread From: Lorenzo Pieralisi @ 2023-09-26 8:31 UTC (permalink / raw) To: Jason Gunthorpe Cc: ankita, maz, oliver.upton, catalin.marinas, will, aniketa, cjia, kwankhede, targupta, vsethi, acurrid, apopple, jhubbard, danw, linux-arm-kernel, kvmarm, linux-kernel On Wed, Sep 13, 2023 at 03:54:54PM -0300, Jason Gunthorpe wrote: > On Wed, Sep 13, 2023 at 05:26:01PM +0200, Lorenzo Pieralisi wrote: [...] > > I can write up the commit log and post it if I manage to summarize > > it any better - more important the review on the code (that was already > > provided), I will try to write something up asap. > > Thank you! > > Jason FWIW, I have come up with the commit log below - please review and scrutinize/change it as deemed fit - it is not necessarily clearer than this one and it definitely requires MarcZ/Catalin/Will attention before it can be considered: --- Currently, KVM for ARM64 maps at stage 2 memory that is considered device (ie using pfn_is_map_memory() to discern between device memory and memory itself) with DEVICE_nGnRE memory attributes; this setting overrides (as per the ARM architecture [1]) any device MMIO mapping present at stage 1, resulting in a set-up whereby a guest operating system can't determine device MMIO mapping memory attributes on its own but it is always overriden by the KVM stage 2 default. This set-up does not allow guest operating systems to map device memory on a page by page basis with combined attributes other than DEVICE_nGnRE, which turns out to be an issue in that guest operating systems (eg Linux) may request to map devices MMIO regions with memory attributes that guarantee better performance (eg gathering attribute - that for some devices can generate larger PCIe memory writes TLPs) and specific operations (eg unaligned transactions) such as the NormalNC memory type. The default device stage 2 mapping was chosen in KVM for ARM64 since it was considered safer (ie it would not allow guests to trigger uncontained failures ultimately crashing the machine) but this turned out to be imprecise. Failures containability is a property of the platform and is independent from the memory type used for MMIO device memory mappings (ie DEVICE_nGnRE memory type is even more problematic than NormalNC in terms of containability since eg aborts triggered on loads cannot be made synchronous, which make them harder to contain); this means that, regardless of the combined stage1+stage2 mappings a platform is safe if and only if device transactions cannot trigger uncontained failures; reworded, the default KVM device stage 2 memory attributes play no role in making device assignment safer for a given platform and therefore can be relaxed. For all these reasons, relax the KVM stage 2 device memory attributes from DEVICE_nGnRE to NormalNC. This puts guests in control (thanks to stage1+stage2 combined memory attributes rules [1]) of device MMIO regions memory mappings, according to the rules described in [1] and summarized here ([(S1) = Stage1][(S2) = Stage2]): S1 | S2 | Result NORMAL-WB | NORMAL-NC | NORMAL-NC NORMAL-WT | NORMAL-NC | NORMAL-NC NORMAL-NC | NORMAL-NC | NORMAL-NC DEVICE<attr> | NORMAL-NC | DEVICE<attr> [1] section D8.5 - DDI0487_I_a_a-profile_architecture_reference_manual.pdf ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH v1 2/2] KVM: arm64: allow the VM to select DEVICE_* and NORMAL_NC for IO memory 2023-09-26 8:31 ` Lorenzo Pieralisi @ 2023-09-26 12:25 ` Jason Gunthorpe 2023-09-26 13:52 ` Catalin Marinas 1 sibling, 0 replies; 55+ messages in thread From: Jason Gunthorpe @ 2023-09-26 12:25 UTC (permalink / raw) To: Lorenzo Pieralisi Cc: ankita, maz, oliver.upton, catalin.marinas, will, aniketa, cjia, kwankhede, targupta, vsethi, acurrid, apopple, jhubbard, danw, linux-arm-kernel, kvmarm, linux-kernel On Tue, Sep 26, 2023 at 10:31:38AM +0200, Lorenzo Pieralisi wrote: > On Wed, Sep 13, 2023 at 03:54:54PM -0300, Jason Gunthorpe wrote: > > On Wed, Sep 13, 2023 at 05:26:01PM +0200, Lorenzo Pieralisi wrote: > > [...] > > > > I can write up the commit log and post it if I manage to summarize > > > it any better - more important the review on the code (that was already > > > provided), I will try to write something up asap. > > > > Thank you! > > > > Jason > > FWIW, I have come up with the commit log below - please review and > scrutinize/change it as deemed fit - it is not necessarily clearer > than this one and it definitely requires MarcZ/Catalin/Will attention > before it can be considered: I have no issue with this language. Thanks, Jason ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH v1 2/2] KVM: arm64: allow the VM to select DEVICE_* and NORMAL_NC for IO memory 2023-09-26 8:31 ` Lorenzo Pieralisi 2023-09-26 12:25 ` Jason Gunthorpe @ 2023-09-26 13:52 ` Catalin Marinas 2023-09-26 16:12 ` Lorenzo Pieralisi 2023-10-05 9:56 ` Lorenzo Pieralisi 1 sibling, 2 replies; 55+ messages in thread From: Catalin Marinas @ 2023-09-26 13:52 UTC (permalink / raw) To: Lorenzo Pieralisi Cc: Jason Gunthorpe, ankita, maz, oliver.upton, will, aniketa, cjia, kwankhede, targupta, vsethi, acurrid, apopple, jhubbard, danw, linux-arm-kernel, kvmarm, linux-kernel On Tue, Sep 26, 2023 at 10:31:38AM +0200, Lorenzo Pieralisi wrote: > Currently, KVM for ARM64 maps at stage 2 memory that is > considered device (ie using pfn_is_map_memory() to discern > between device memory and memory itself) with DEVICE_nGnRE > memory attributes; this setting overrides (as per the ARM > architecture [1]) any device MMIO mapping present at stage > 1, resulting in a set-up whereby a guest operating system > can't determine device MMIO mapping memory attributes on its > own but it is always overriden by the KVM stage 2 default. > > This set-up does not allow guest operating systems to map > device memory on a page by page basis with combined attributes > other than DEVICE_nGnRE, Well, it also has the option of DEVICE_nGnRnE ;). > which turns out to be an issue in that > guest operating systems (eg Linux) may request to map > devices MMIO regions with memory attributes that guarantee > better performance (eg gathering attribute - that for some > devices can generate larger PCIe memory writes TLPs) > and specific operations (eg unaligned transactions) such as > the NormalNC memory type. > > The default device stage 2 mapping was chosen in KVM > for ARM64 since it was considered safer (ie it would > not allow guests to trigger uncontained failures > ultimately crashing the machine) but this turned out > to be imprecise. > > Failures containability is a property of the platform > and is independent from the memory type used for MMIO > device memory mappings (ie DEVICE_nGnRE memory type is > even more problematic than NormalNC in terms of containability > since eg aborts triggered on loads cannot be made synchronous, > which make them harder to contain); this means that, > regardless of the combined stage1+stage2 mappings a > platform is safe if and only if device transactions cannot trigger > uncontained failures; reworded, the default KVM device > stage 2 memory attributes play no role in making device > assignment safer for a given platform and therefore can > be relaxed. > > For all these reasons, relax the KVM stage 2 device > memory attributes from DEVICE_nGnRE to NormalNC. > > This puts guests in control (thanks to stage1+stage2 > combined memory attributes rules [1]) of device MMIO > regions memory mappings, according to the rules > described in [1] and summarized here ([(S1) = Stage1][(S2) = Stage2]): > > �S1���������� |�� S2��������� |� Result > �NORMAL-WB����|� NORMAL-NC����|� NORMAL-NC > �NORMAL-WT����|� NORMAL-NC����|� NORMAL-NC > �NORMAL-NC����|� NORMAL-NC����|� NORMAL-NC > �DEVICE<attr>�|� NORMAL-NC����|� DEVICE<attr> Not sure what's wrong with my font setup as I can't see the above table but I know it from the Arm ARM. Anyway, the text looks fine to me. Thanks for putting it together Lorenzo. One thing not mentioned here is that pci-vfio still maps such memory as Device-nGnRnE in user space and relaxing this potentially creates an alias. But such alias is only relevant of both the VMM and the VM try to access the same device which I doubt is a realistic scenario. -- Catalin ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH v1 2/2] KVM: arm64: allow the VM to select DEVICE_* and NORMAL_NC for IO memory 2023-09-26 13:52 ` Catalin Marinas @ 2023-09-26 16:12 ` Lorenzo Pieralisi 2023-10-05 9:56 ` Lorenzo Pieralisi 1 sibling, 0 replies; 55+ messages in thread From: Lorenzo Pieralisi @ 2023-09-26 16:12 UTC (permalink / raw) To: Catalin Marinas Cc: Jason Gunthorpe, ankita, maz, oliver.upton, will, aniketa, cjia, kwankhede, targupta, vsethi, acurrid, apopple, jhubbard, danw, linux-arm-kernel, kvmarm, linux-kernel On Tue, Sep 26, 2023 at 02:52:13PM +0100, Catalin Marinas wrote: > On Tue, Sep 26, 2023 at 10:31:38AM +0200, Lorenzo Pieralisi wrote: > > Currently, KVM for ARM64 maps at stage 2 memory that is > > considered device (ie using pfn_is_map_memory() to discern > > between device memory and memory itself) with DEVICE_nGnRE > > memory attributes; this setting overrides (as per the ARM > > architecture [1]) any device MMIO mapping present at stage > > 1, resulting in a set-up whereby a guest operating system > > can't determine device MMIO mapping memory attributes on its > > own but it is always overriden by the KVM stage 2 default. > > > > This set-up does not allow guest operating systems to map > > device memory on a page by page basis with combined attributes > > other than DEVICE_nGnRE, > > Well, it also has the option of DEVICE_nGnRnE ;). That's true - we have to fix it up. "This set-up does not allow guest operating systems to choose device memory attributes on a page by page basis independently from KVM stage 2 mappings,..." > > which turns out to be an issue in that > > guest operating systems (eg Linux) may request to map > > devices MMIO regions with memory attributes that guarantee > > better performance (eg gathering attribute - that for some > > devices can generate larger PCIe memory writes TLPs) > > and specific operations (eg unaligned transactions) such as > > the NormalNC memory type. > > > > The default device stage 2 mapping was chosen in KVM > > for ARM64 since it was considered safer (ie it would > > not allow guests to trigger uncontained failures > > ultimately crashing the machine) but this turned out > > to be imprecise. > > > > Failures containability is a property of the platform > > and is independent from the memory type used for MMIO > > device memory mappings (ie DEVICE_nGnRE memory type is > > even more problematic than NormalNC in terms of containability > > since eg aborts triggered on loads cannot be made synchronous, > > which make them harder to contain); this means that, > > regardless of the combined stage1+stage2 mappings a > > platform is safe if and only if device transactions cannot trigger > > uncontained failures; reworded, the default KVM device > > stage 2 memory attributes play no role in making device > > assignment safer for a given platform and therefore can > > be relaxed. > > > > For all these reasons, relax the KVM stage 2 device > > memory attributes from DEVICE_nGnRE to NormalNC. > > > > This puts guests in control (thanks to stage1+stage2 > > combined memory attributes rules [1]) of device MMIO > > regions memory mappings, according to the rules > > described in [1] and summarized here ([(S1) = Stage1][(S2) = Stage2]): > > > > �S1���������� |�� S2��������� |� Result > > �NORMAL-WB����|� NORMAL-NC����|� NORMAL-NC > > �NORMAL-WT����|� NORMAL-NC����|� NORMAL-NC > > �NORMAL-NC����|� NORMAL-NC����|� NORMAL-NC > > �DEVICE<attr>�|� NORMAL-NC����|� DEVICE<attr> > > Not sure what's wrong with my font setup as I can't see the above table > but I know it from the Arm ARM. > > Anyway, the text looks fine to me. Thanks for putting it together > Lorenzo. > > One thing not mentioned here is that pci-vfio still maps such memory as > Device-nGnRnE in user space and relaxing this potentially creates an > alias. But such alias is only relevant of both the VMM and the VM try to > access the same device which I doubt is a realistic scenario. I can update the log and repost, fixing the comment above as well (need to check what's wrong with the table characters). Thanks, Lorenzo ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH v1 2/2] KVM: arm64: allow the VM to select DEVICE_* and NORMAL_NC for IO memory 2023-09-26 13:52 ` Catalin Marinas 2023-09-26 16:12 ` Lorenzo Pieralisi @ 2023-10-05 9:56 ` Lorenzo Pieralisi 2023-10-05 11:56 ` Jason Gunthorpe 2023-10-12 12:35 ` Will Deacon 1 sibling, 2 replies; 55+ messages in thread From: Lorenzo Pieralisi @ 2023-10-05 9:56 UTC (permalink / raw) To: Catalin Marinas Cc: Jason Gunthorpe, ankita, maz, oliver.upton, will, aniketa, cjia, kwankhede, targupta, vsethi, acurrid, apopple, jhubbard, danw, linux-arm-kernel, kvmarm, linux-kernel On Tue, Sep 26, 2023 at 02:52:13PM +0100, Catalin Marinas wrote: [...] > Anyway, the text looks fine to me. Thanks for putting it together > Lorenzo. Thanks ! > One thing not mentioned here is that pci-vfio still maps such memory as > Device-nGnRnE in user space and relaxing this potentially creates an > alias. But such alias is only relevant of both the VMM and the VM try to > access the same device which I doubt is a realistic scenario. A revised log, FWIW: --- Currently, KVM for ARM64 maps at stage 2 memory that is considered device (ie it is not RAM) with DEVICE_nGnRE memory attributes; this setting overrides (as per the ARM architecture [1]) any device MMIO mapping present at stage 1, resulting in a set-up whereby a guest operating system can't determine device MMIO mapping memory attributes on its own but it is always overriden by the KVM stage 2 default. This set-up does not allow guest operating systems to select device memory attributes on a page by page basis independently from KVM stage-2 mappings (refer to [1], "Combining stage 1 and stage 2 memory type attributes"), which turns out to be an issue in that guest operating systems (eg Linux) may request to map devices MMIO regions with memory attributes that guarantee better performance (eg gathering attribute - that for some devices can generate larger PCIe memory writes TLPs) and specific operations (eg unaligned transactions) such as the NormalNC memory type. The default device stage 2 mapping was chosen in KVM for ARM64 since it was considered safer (ie it would not allow guests to trigger uncontained failures ultimately crashing the machine) but this turned out to be imprecise. Failures containability is a property of the platform and is independent from the memory type used for MMIO device memory mappings (ie DEVICE_nGnRE memory type is even more problematic than NormalNC in terms of containability since eg aborts triggered on loads cannot be made synchronous, which make them harder to contain); this means that, regardless of the combined stage1+stage2 mappings a platform is safe if and only if device transactions cannot trigger uncontained failures; reworded, the default KVM device stage 2 memory attributes play no role in making device assignment safer for a given platform and therefore can be relaxed. For all these reasons, relax the KVM stage 2 device memory attributes from DEVICE_nGnRE to NormalNC. This puts guests in control (thanks to stage1+stage2 combined memory attributes rules [1]) of device MMIO regions memory mappings, according to the rules described in [1] and summarized here ([(S1) - stage1], [(S2) - stage 2]): S1 | S2 | Result NORMAL-WB | NORMAL-NC | NORMAL-NC NORMAL-WT | NORMAL-NC | NORMAL-NC NORMAL-NC | NORMAL-NC | NORMAL-NC DEVICE<attr> | NORMAL-NC | DEVICE<attr> It is worth noting that currently, to map devices MMIO space to user space in a device pass-through use case the VFIO framework applies memory attributes derived from pgprot_noncached() settings applied to VMAs, which result in device-nGnRnE memory attributes for the stage-1 VMM mappings. This means that a userspace mapping for device MMIO space carried out with the current VFIO framework and a guest OS mapping for the same MMIO space may result in a mismatched alias as described in [2]. Defaulting KVM device stage-2 mappings to Normal-NC attributes does not change anything in this respect, in that the mismatched aliases would only affect (refer to [2] for a detailed explanation) ordering between the userspace and GuestOS mappings resulting stream of transactions (ie it does not cause loss of property for either stream of transactions on its own), which is harmless given that the userspace and GuestOS access to the device is carried out through independent transactions streams. [1] section D8.5 - DDI0487_I_a_a-profile_architecture_reference_manual.pdf [2] section B2.8 - DDI0487_I_a_a-profile_architecture_reference_manual.pdf ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH v1 2/2] KVM: arm64: allow the VM to select DEVICE_* and NORMAL_NC for IO memory 2023-10-05 9:56 ` Lorenzo Pieralisi @ 2023-10-05 11:56 ` Jason Gunthorpe 2023-10-05 14:08 ` Lorenzo Pieralisi 2023-10-12 12:35 ` Will Deacon 1 sibling, 1 reply; 55+ messages in thread From: Jason Gunthorpe @ 2023-10-05 11:56 UTC (permalink / raw) To: Lorenzo Pieralisi Cc: Catalin Marinas, ankita, maz, oliver.upton, will, aniketa, cjia, kwankhede, targupta, vsethi, acurrid, apopple, jhubbard, danw, linux-arm-kernel, kvmarm, linux-kernel On Thu, Oct 05, 2023 at 11:56:55AM +0200, Lorenzo Pieralisi wrote: > On Tue, Sep 26, 2023 at 02:52:13PM +0100, Catalin Marinas wrote: > > [...] > > > Anyway, the text looks fine to me. Thanks for putting it together > > Lorenzo. > > Thanks ! > > > One thing not mentioned here is that pci-vfio still maps such memory as > > Device-nGnRnE in user space and relaxing this potentially creates an > > alias. But such alias is only relevant of both the VMM and the VM try to > > access the same device which I doubt is a realistic scenario. > > A revised log, FWIW: What is the plan here, do you want Ankit to resend the series with this text? There were no comments on patch 1/2? Thanks, Jason ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH v1 2/2] KVM: arm64: allow the VM to select DEVICE_* and NORMAL_NC for IO memory 2023-10-05 11:56 ` Jason Gunthorpe @ 2023-10-05 14:08 ` Lorenzo Pieralisi 0 siblings, 0 replies; 55+ messages in thread From: Lorenzo Pieralisi @ 2023-10-05 14:08 UTC (permalink / raw) To: Jason Gunthorpe Cc: Catalin Marinas, ankita, maz, oliver.upton, will, aniketa, cjia, kwankhede, targupta, vsethi, acurrid, apopple, jhubbard, danw, linux-arm-kernel, kvmarm, linux-kernel On Thu, Oct 05, 2023 at 08:56:30AM -0300, Jason Gunthorpe wrote: > On Thu, Oct 05, 2023 at 11:56:55AM +0200, Lorenzo Pieralisi wrote: > > On Tue, Sep 26, 2023 at 02:52:13PM +0100, Catalin Marinas wrote: > > > > [...] > > > > > Anyway, the text looks fine to me. Thanks for putting it together > > > Lorenzo. > > > > Thanks ! > > > > > One thing not mentioned here is that pci-vfio still maps such memory as > > > Device-nGnRnE in user space and relaxing this potentially creates an > > > alias. But such alias is only relevant of both the VMM and the VM try to > > > access the same device which I doubt is a realistic scenario. > > > > A revised log, FWIW: > > What is the plan here, do you want Ankit to resend the series with > this text? I have put together the text to summarize the discussions that led to this patch; if that helps somehow, yes please, make it the commit log. > There were no comments on patch 1/2? I don't have enough background knowledge to make a statement on this series' code, I just wanted to make sure the architectural details are in the commit log so that it can help the reviewers and, if it makes it to the kernel, future developers. Thanks, Lorenzo ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH v1 2/2] KVM: arm64: allow the VM to select DEVICE_* and NORMAL_NC for IO memory 2023-10-05 9:56 ` Lorenzo Pieralisi 2023-10-05 11:56 ` Jason Gunthorpe @ 2023-10-12 12:35 ` Will Deacon 2023-10-12 13:20 ` Jason Gunthorpe 2023-10-12 13:53 ` Catalin Marinas 1 sibling, 2 replies; 55+ messages in thread From: Will Deacon @ 2023-10-12 12:35 UTC (permalink / raw) To: Lorenzo Pieralisi Cc: Catalin Marinas, Jason Gunthorpe, ankita, maz, oliver.upton, aniketa, cjia, kwankhede, targupta, vsethi, acurrid, apopple, jhubbard, danw, linux-arm-kernel, kvmarm, linux-kernel On Thu, Oct 05, 2023 at 11:56:55AM +0200, Lorenzo Pieralisi wrote: > On Tue, Sep 26, 2023 at 02:52:13PM +0100, Catalin Marinas wrote: > > [...] > > > Anyway, the text looks fine to me. Thanks for putting it together > > Lorenzo. > > Thanks ! > > > One thing not mentioned here is that pci-vfio still maps such memory as > > Device-nGnRnE in user space and relaxing this potentially creates an > > alias. But such alias is only relevant of both the VMM and the VM try to > > access the same device which I doubt is a realistic scenario. > > A revised log, FWIW: Thanks for putting this together, Lorenzo. Just one thing below: > --- > Currently, KVM for ARM64 maps at stage 2 memory that is > considered device (ie it is not RAM) with DEVICE_nGnRE > memory attributes; this setting overrides (as per the ARM > architecture [1]) any device MMIO mapping present at stage > 1, resulting in a set-up whereby a guest operating system > can't determine device MMIO mapping memory attributes on its > own but it is always overriden by the KVM stage 2 default. > > This set-up does not allow guest operating systems to select > device memory attributes on a page by page basis independently > from KVM stage-2 mappings (refer to [1], "Combining stage 1 and stage > 2 memory type attributes"), which turns out to be an issue in that > guest operating systems (eg Linux) may request to map > devices MMIO regions with memory attributes that guarantee > better performance (eg gathering attribute - that for some > devices can generate larger PCIe memory writes TLPs) > and specific operations (eg unaligned transactions) such as > the NormalNC memory type. > > The default device stage 2 mapping was chosen in KVM > for ARM64 since it was considered safer (ie it would > not allow guests to trigger uncontained failures > ultimately crashing the machine) but this turned out > to be imprecise. > > Failures containability is a property of the platform > and is independent from the memory type used for MMIO > device memory mappings (ie DEVICE_nGnRE memory type is > even more problematic than NormalNC in terms of containability > since eg aborts triggered on loads cannot be made synchronous, > which make them harder to contain); this means that, > regardless of the combined stage1+stage2 mappings a > platform is safe if and only if device transactions cannot trigger > uncontained failures; reworded, the default KVM device > stage 2 memory attributes play no role in making device > assignment safer for a given platform and therefore can > be relaxed. > > For all these reasons, relax the KVM stage 2 device > memory attributes from DEVICE_nGnRE to NormalNC. The reasoning above suggests to me that this should probably just be Normal cacheable, as that is what actually allows the guest to control the attributes. So what is the rationale behind stopping at Normal-NC? Will ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH v1 2/2] KVM: arm64: allow the VM to select DEVICE_* and NORMAL_NC for IO memory 2023-10-12 12:35 ` Will Deacon @ 2023-10-12 13:20 ` Jason Gunthorpe 2023-10-12 14:29 ` Lorenzo Pieralisi 2023-10-12 13:53 ` Catalin Marinas 1 sibling, 1 reply; 55+ messages in thread From: Jason Gunthorpe @ 2023-10-12 13:20 UTC (permalink / raw) To: Will Deacon Cc: Lorenzo Pieralisi, Catalin Marinas, ankita, maz, oliver.upton, aniketa, cjia, kwankhede, targupta, vsethi, acurrid, apopple, jhubbard, danw, linux-arm-kernel, kvmarm, linux-kernel On Thu, Oct 12, 2023 at 01:35:41PM +0100, Will Deacon wrote: > > Failures containability is a property of the platform > > and is independent from the memory type used for MMIO > > device memory mappings (ie DEVICE_nGnRE memory type is > > even more problematic than NormalNC in terms of containability > > since eg aborts triggered on loads cannot be made synchronous, > > which make them harder to contain); this means that, > > regardless of the combined stage1+stage2 mappings a > > platform is safe if and only if device transactions cannot trigger > > uncontained failures; reworded, the default KVM device > > stage 2 memory attributes play no role in making device > > assignment safer for a given platform and therefore can > > be relaxed. > > > > For all these reasons, relax the KVM stage 2 device > > memory attributes from DEVICE_nGnRE to NormalNC. > > The reasoning above suggests to me that this should probably just be > Normal cacheable, as that is what actually allows the guest to control > the attributes. So what is the rationale behind stopping at Normal-NC? I agree it would be very nice if the all the memory in the guest could just be cachable and the guest could select everything. However, I think Lorenzo over stated the argument. The off-list discussion was focused on NormalNC for MMIO only. Nobody raised the idea that cachable was safe from uncontained errors for MMIO. I'm looking through the conversations and I wouldn't jump to concluding that "cachable MMIO" is safe from uncontained failures. Catalin has already raised a number of conerns in the other patch about making actual "designed to be cachable memory" into KVM cachable. Regards, Jason ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH v1 2/2] KVM: arm64: allow the VM to select DEVICE_* and NORMAL_NC for IO memory 2023-10-12 13:20 ` Jason Gunthorpe @ 2023-10-12 14:29 ` Lorenzo Pieralisi 0 siblings, 0 replies; 55+ messages in thread From: Lorenzo Pieralisi @ 2023-10-12 14:29 UTC (permalink / raw) To: Jason Gunthorpe Cc: Will Deacon, Catalin Marinas, ankita, maz, oliver.upton, aniketa, cjia, kwankhede, targupta, vsethi, acurrid, apopple, jhubbard, danw, linux-arm-kernel, kvmarm, linux-kernel On Thu, Oct 12, 2023 at 10:20:45AM -0300, Jason Gunthorpe wrote: > On Thu, Oct 12, 2023 at 01:35:41PM +0100, Will Deacon wrote: > > > > Failures containability is a property of the platform > > > and is independent from the memory type used for MMIO > > > device memory mappings (ie DEVICE_nGnRE memory type is > > > even more problematic than NormalNC in terms of containability > > > since eg aborts triggered on loads cannot be made synchronous, > > > which make them harder to contain); this means that, > > > regardless of the combined stage1+stage2 mappings a > > > platform is safe if and only if device transactions cannot trigger > > > uncontained failures; reworded, the default KVM device > > > stage 2 memory attributes play no role in making device > > > assignment safer for a given platform and therefore can > > > be relaxed. > > > > > > For all these reasons, relax the KVM stage 2 device > > > memory attributes from DEVICE_nGnRE to NormalNC. > > > > The reasoning above suggests to me that this should probably just be > > Normal cacheable, as that is what actually allows the guest to control > > the attributes. So what is the rationale behind stopping at Normal-NC? > > I agree it would be very nice if the all the memory in the guest could > just be cachable and the guest could select everything. > > However, I think Lorenzo over stated the argument. > > The off-list discussion was focused on NormalNC for MMIO only. Nobody > raised the idea that cachable was safe from uncontained errors for > MMIO. True, it should be made clearer ie "independent from the non-cacheable/uncacheable memory type...", please update the log accordingly, forgive me the confusion I raised. Lorenzo > I'm looking through the conversations and I wouldn't jump to > concluding that "cachable MMIO" is safe from uncontained failures. > > Catalin has already raised a number of conerns in the other patch > about making actual "designed to be cachable memory" into KVM > cachable. > > Regards, > Jason ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH v1 2/2] KVM: arm64: allow the VM to select DEVICE_* and NORMAL_NC for IO memory 2023-10-12 12:35 ` Will Deacon 2023-10-12 13:20 ` Jason Gunthorpe @ 2023-10-12 13:53 ` Catalin Marinas 2023-10-12 14:48 ` Will Deacon 1 sibling, 1 reply; 55+ messages in thread From: Catalin Marinas @ 2023-10-12 13:53 UTC (permalink / raw) To: Will Deacon Cc: Lorenzo Pieralisi, Jason Gunthorpe, ankita, maz, oliver.upton, aniketa, cjia, kwankhede, targupta, vsethi, acurrid, apopple, jhubbard, danw, linux-arm-kernel, kvmarm, linux-kernel On Thu, Oct 12, 2023 at 01:35:41PM +0100, Will Deacon wrote: > On Thu, Oct 05, 2023 at 11:56:55AM +0200, Lorenzo Pieralisi wrote: > > For all these reasons, relax the KVM stage 2 device > > memory attributes from DEVICE_nGnRE to NormalNC. > > The reasoning above suggests to me that this should probably just be > Normal cacheable, as that is what actually allows the guest to control > the attributes. So what is the rationale behind stopping at Normal-NC? It's more like we don't have any clue on what may happen. MTE is obviously a case where it can go wrong (we can blame the architecture design here) but I recall years ago where a malicious guest could bring the platform down by mapping the GIC CPU interface as cacheable. Not sure how error containment works with cacheable memory. A cacheable access to a device may stay in the cache a lot longer after the guest has been scheduled out, only evicted at some random time. We may no longer be able to associate it with the guest, especially if the guest exited. Also not sure about claiming back the device after killing the guest, do we need cache maintenance? So, for now I'd only relax this if we know there's RAM(-like) on the other side and won't trigger some potentially uncontainable errors as a result. -- Catalin ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH v1 2/2] KVM: arm64: allow the VM to select DEVICE_* and NORMAL_NC for IO memory 2023-10-12 13:53 ` Catalin Marinas @ 2023-10-12 14:48 ` Will Deacon 2023-10-12 15:44 ` Jason Gunthorpe 2023-10-12 17:26 ` Catalin Marinas 0 siblings, 2 replies; 55+ messages in thread From: Will Deacon @ 2023-10-12 14:48 UTC (permalink / raw) To: Catalin Marinas Cc: Lorenzo Pieralisi, Jason Gunthorpe, ankita, maz, oliver.upton, aniketa, cjia, kwankhede, targupta, vsethi, acurrid, apopple, jhubbard, danw, linux-arm-kernel, kvmarm, linux-kernel On Thu, Oct 12, 2023 at 02:53:21PM +0100, Catalin Marinas wrote: > On Thu, Oct 12, 2023 at 01:35:41PM +0100, Will Deacon wrote: > > On Thu, Oct 05, 2023 at 11:56:55AM +0200, Lorenzo Pieralisi wrote: > > > For all these reasons, relax the KVM stage 2 device > > > memory attributes from DEVICE_nGnRE to NormalNC. > > > > The reasoning above suggests to me that this should probably just be > > Normal cacheable, as that is what actually allows the guest to control > > the attributes. So what is the rationale behind stopping at Normal-NC? > > It's more like we don't have any clue on what may happen. MTE is > obviously a case where it can go wrong (we can blame the architecture > design here) but I recall years ago where a malicious guest could bring > the platform down by mapping the GIC CPU interface as cacheable. ... and do we know that isn't the case for non-cacheable? If not, why not? Also, are you saying we used to map the GIC CPU interface as cacheable at stage-2? I remember exclusives causing a problem, but I don't remember the guest having a cacheable mapping. > Not sure how error containment works with cacheable memory. A cacheable > access to a device may stay in the cache a lot longer after the guest > has been scheduled out, only evicted at some random time. But similarly, non-cacheable stores can be buffered. Why isn't that a problem? > We may no longer be able to associate it with the guest, especially if the > guest exited. Also not sure about claiming back the device after killing > the guest, do we need cache maintenance? Claiming back the device also seems strange if the guest has been using non-cacheable accesses since I think you could get write merging and reordering with subsequent device accesses trying to reset the device. > So, for now I'd only relax this if we know there's RAM(-like) on the > other side and won't trigger some potentially uncontainable errors as a > result. I guess my wider point is that I'm not convinced that non-cacheable is actually much better and I think we're going way off the deep end looking at what particular implementations do and trying to justify to ourselves that non-cacheable is safe, even though it's still a normal memory type at the end of the day. Obviously, it's up to Marc and Oliver if they want to do this, but I'm wary without an official statement from Arm to say that Normal-NC is correct. There's mention of such a statement in the cover letter: > We hope ARM will publish information helping platform designers > follow these guidelines. but imo we shouldn't merge this without either: (a) _Architectural_ guidance (as opposed to some random whitepaper or half-baked certification scheme). - or - (b) A concrete justification based on the current architecture as to why Normal-NC is the right thing to do for KVM. The current wording talks about use-cases (I get this) and error containment (it's a property of the system) but doesn't talk at all about why Normal-NC is the right result. Will ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH v1 2/2] KVM: arm64: allow the VM to select DEVICE_* and NORMAL_NC for IO memory 2023-10-12 14:48 ` Will Deacon @ 2023-10-12 15:44 ` Jason Gunthorpe 2023-10-12 16:39 ` Will Deacon 2023-10-12 17:26 ` Catalin Marinas 1 sibling, 1 reply; 55+ messages in thread From: Jason Gunthorpe @ 2023-10-12 15:44 UTC (permalink / raw) To: Will Deacon Cc: Catalin Marinas, Lorenzo Pieralisi, ankita, maz, oliver.upton, aniketa, cjia, kwankhede, targupta, vsethi, acurrid, apopple, jhubbard, danw, linux-arm-kernel, kvmarm, linux-kernel On Thu, Oct 12, 2023 at 03:48:08PM +0100, Will Deacon wrote: > I guess my wider point is that I'm not convinced that non-cacheable is > actually much better and I think we're going way off the deep end looking > at what particular implementations do and trying to justify to ourselves > that non-cacheable is safe, even though it's still a normal memory type > at the end of the day. When we went over this with ARM it became fairly clear there wasn't an official statement that Device-* is safe from uncontained failures. For instance, looking at the actual IP, our architects pointed out that ARM IP already provides ways for Device-* to trigger uncontained failures today. We then mutually concluded that KVM safe implementations must already be preventing uncontained failures for Device-* at the system level and that same prevention will carry over to NormalNC as well. IMHO, this seems to be a gap where ARM has not fully defined when uncontained failures are allowed and left that as an implementation choice. In other words, KVM safety around uncontained failure is not a property that can be reasoned about from the ARM architecture alone. > The current wording talks about use-cases (I get this) and error containment > (it's a property of the system) but doesn't talk at all about why Normal-NC > is the right result. Given that Device-* and NormalNC are equally implementation defined with regards to uncontained failures, NormalNC allows more VM functionality. Further, we have a broad agreement that this use case is important, and that NormalNC is the correct way to adress it. I think you are right to ask for more formality from ARM team but also we shouldn't hold up fixing real functional bugs in real shipping server ARM products. Jason ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH v1 2/2] KVM: arm64: allow the VM to select DEVICE_* and NORMAL_NC for IO memory 2023-10-12 15:44 ` Jason Gunthorpe @ 2023-10-12 16:39 ` Will Deacon 2023-10-12 18:36 ` Jason Gunthorpe 0 siblings, 1 reply; 55+ messages in thread From: Will Deacon @ 2023-10-12 16:39 UTC (permalink / raw) To: Jason Gunthorpe Cc: Catalin Marinas, Lorenzo Pieralisi, ankita, maz, oliver.upton, aniketa, cjia, kwankhede, targupta, vsethi, acurrid, apopple, jhubbard, danw, linux-arm-kernel, kvmarm, linux-kernel On Thu, Oct 12, 2023 at 12:44:39PM -0300, Jason Gunthorpe wrote: > On Thu, Oct 12, 2023 at 03:48:08PM +0100, Will Deacon wrote: > > > I guess my wider point is that I'm not convinced that non-cacheable is > > actually much better and I think we're going way off the deep end looking > > at what particular implementations do and trying to justify to ourselves > > that non-cacheable is safe, even though it's still a normal memory type > > at the end of the day. > > When we went over this with ARM it became fairly clear there wasn't an > official statement that Device-* is safe from uncontained > failures. For instance, looking at the actual IP, our architects > pointed out that ARM IP already provides ways for Device-* to trigger > uncontained failures today. > > We then mutually concluded that KVM safe implementations must already > be preventing uncontained failures for Device-* at the system level > and that same prevention will carry over to NormalNC as well. > > IMHO, this seems to be a gap where ARM has not fully defined when > uncontained failures are allowed and left that as an implementation > choice. > > In other words, KVM safety around uncontained failure is not a > property that can be reasoned about from the ARM architecture alone. > > > The current wording talks about use-cases (I get this) and error containment > > (it's a property of the system) but doesn't talk at all about why Normal-NC > > is the right result. > > Given that Device-* and NormalNC are equally implementation defined > with regards to uncontained failures, NormalNC allows more VM > functionality. > > Further, we have a broad agreement that this use case is important, > and that NormalNC is the correct way to adress it. > > I think you are right to ask for more formality from ARM team but also > we shouldn't hold up fixing real functional bugs in real shipping > server ARM products. All I'm asking for is justification as to why Normal-NC is the right memory type rather than any other normal memory type. If it's not possible to explain that architecturally, then I'm not sure this change belongs in architecture code. Ultimately, we need to be able to maintain this stuff, so we can't just blindly implement changes based on a combination of off-list discussions and individual product needs. For example, if somebody else rocks up tomorrow and asks for this to be Normal-writethrough, what grounds do we have to say no if we've taken this change already? So please let's get to a point where we can actually reason about this. Will ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH v1 2/2] KVM: arm64: allow the VM to select DEVICE_* and NORMAL_NC for IO memory 2023-10-12 16:39 ` Will Deacon @ 2023-10-12 18:36 ` Jason Gunthorpe 2023-10-13 9:29 ` Will Deacon 0 siblings, 1 reply; 55+ messages in thread From: Jason Gunthorpe @ 2023-10-12 18:36 UTC (permalink / raw) To: Will Deacon Cc: Catalin Marinas, Lorenzo Pieralisi, ankita, maz, oliver.upton, aniketa, cjia, kwankhede, targupta, vsethi, acurrid, apopple, jhubbard, danw, linux-arm-kernel, kvmarm, linux-kernel On Thu, Oct 12, 2023 at 05:39:31PM +0100, Will Deacon wrote: > All I'm asking for is justification as to why Normal-NC is the right > memory type rather than any other normal memory type. If it's not possible > to explain that architecturally, then I'm not sure this change belongs in > architecture code. Well, I think Catalin summarized it nicely, I second his ask at the end: We are basically at your scenario below - can you justify why DEVICE_GRE is correct today, architecturally? We could not. Earlier someone said uncontained failure prevention, but a deep dive on that found it is not so. > Ultimately, we need to be able to maintain this stuff, so we can't just > blindly implement changes based on a combination of off-list discussions > and individual product needs. For example, if somebody else rocks up > tomorrow and asks for this to be Normal-writethrough, what grounds do > we have to say no if we've taken this change already? Hmm. Something got lost here. This patch is talking about the S2 MemAttr[3:0]. There are only 4 relevant values (when FEAT_S2FWB) (see D5.5.5 in ARM DDI 0487F.c) 0b001 - Today: force VM to be Device nGnRE 0b101 - Proposed: prevent the VM from selecting cachable, allow it to choose Device-* or NormalNC 0b110 - Force write back. Would totally break MMIO, ruled out 0b111 - Allow the VM to select anything, including cachable. This is nice, but summarizing Catalin's remarks: a) The kernel can't do cache maintenance (defeats FWB) b) Catalin's concerns about MTE and Atomics triggering uncontained failures c) It is unclear about uncontained failures for cachable in general So the only argument is 001 v 110 v 111 Catalin has explained why 111 is not good as a default. Most likely with some future ACPI/BSA/etc work, and some cache syncing in the kernel, someone could define a way to allow 111 as a choice. So, I think we can rule out 111 as being the default choice without more the kernel getting more detailed system level knowledge. Further, patch 1 is about making 110 a driver-opt-in choice for VFIO memory which reduces the need for 111. For 001 v 110: 001 is less functional in the VM. 001 offers no advantage. !FEAT_S2FWB has similar choices and similar argument. So, IMHO, if someone comes to ask for something it would be to ask for 111 and we do have a set of pretty clear reasons why it should not be 111. (indeed we wanted to ask for that instead of patch 1 but there are too many things required to get there), Jason ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH v1 2/2] KVM: arm64: allow the VM to select DEVICE_* and NORMAL_NC for IO memory 2023-10-12 18:36 ` Jason Gunthorpe @ 2023-10-13 9:29 ` Will Deacon 0 siblings, 0 replies; 55+ messages in thread From: Will Deacon @ 2023-10-13 9:29 UTC (permalink / raw) To: Jason Gunthorpe Cc: Catalin Marinas, Lorenzo Pieralisi, ankita, maz, oliver.upton, aniketa, cjia, kwankhede, targupta, vsethi, acurrid, apopple, jhubbard, danw, linux-arm-kernel, kvmarm, linux-kernel On Thu, Oct 12, 2023 at 03:36:24PM -0300, Jason Gunthorpe wrote: > On Thu, Oct 12, 2023 at 05:39:31PM +0100, Will Deacon wrote: > > > All I'm asking for is justification as to why Normal-NC is the right > > memory type rather than any other normal memory type. If it's not possible > > to explain that architecturally, then I'm not sure this change belongs in > > architecture code. > > Well, I think Catalin summarized it nicely, I second his ask at the end: > > We are basically at your scenario below - can you justify why > DEVICE_GRE is correct today, architecturally? We could not. Earlier > someone said uncontained failure prevention, but a deep dive on that > found it is not so. This logic can be used to justify the use of any other memory type. I'm asking specifically why Normal-NC is correct. > > Ultimately, we need to be able to maintain this stuff, so we can't just > > blindly implement changes based on a combination of off-list discussions > > and individual product needs. For example, if somebody else rocks up > > tomorrow and asks for this to be Normal-writethrough, what grounds do > > we have to say no if we've taken this change already? > > Hmm. Something got lost here. Well, I didn't assume FWB since these patches change the behaviour regardless... > This patch is talking about the S2 MemAttr[3:0]. There are only 4 > relevant values (when FEAT_S2FWB) (see D5.5.5 in ARM DDI 0487F.c) > > 0b001 - Today: force VM to be Device nGnRE > > 0b101 - Proposed: prevent the VM from selecting cachable, allow it to > choose Device-* or NormalNC > > 0b110 - Force write back. Would totally break MMIO, ruled out > > 0b111 - Allow the VM to select anything, including cachable. > This is nice, but summarizing Catalin's remarks: > a) The kernel can't do cache maintenance (defeats FWB) > b) Catalin's concerns about MTE and Atomics triggering > uncontained failures > c) It is unclear about uncontained failures for cachable > in general > > So the only argument is 001 v 110 v 111 Ok, so I can see why you end up with Normal-NC on a system that implements FWB. Systems without FWB have other options, but you can reasonably argue that having consistent behaviour between the two configurations is desirable. Please can you add that to the commit message? I still don't understand how to reclaim an MMIO region that the guest has mapped Normal-NC (see the thread with Catalin). Will ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH v1 2/2] KVM: arm64: allow the VM to select DEVICE_* and NORMAL_NC for IO memory 2023-10-12 14:48 ` Will Deacon 2023-10-12 15:44 ` Jason Gunthorpe @ 2023-10-12 17:26 ` Catalin Marinas 2023-10-13 9:29 ` Will Deacon 1 sibling, 1 reply; 55+ messages in thread From: Catalin Marinas @ 2023-10-12 17:26 UTC (permalink / raw) To: Will Deacon Cc: Lorenzo Pieralisi, Jason Gunthorpe, ankita, maz, oliver.upton, aniketa, cjia, kwankhede, targupta, vsethi, acurrid, apopple, jhubbard, danw, linux-arm-kernel, kvmarm, linux-kernel On Thu, Oct 12, 2023 at 03:48:08PM +0100, Will Deacon wrote: > On Thu, Oct 12, 2023 at 02:53:21PM +0100, Catalin Marinas wrote: > > On Thu, Oct 12, 2023 at 01:35:41PM +0100, Will Deacon wrote: > > > On Thu, Oct 05, 2023 at 11:56:55AM +0200, Lorenzo Pieralisi wrote: > > > > For all these reasons, relax the KVM stage 2 device > > > > memory attributes from DEVICE_nGnRE to NormalNC. > > > > > > The reasoning above suggests to me that this should probably just be > > > Normal cacheable, as that is what actually allows the guest to control > > > the attributes. So what is the rationale behind stopping at Normal-NC? > > > > It's more like we don't have any clue on what may happen. MTE is > > obviously a case where it can go wrong (we can blame the architecture > > design here) but I recall years ago where a malicious guest could bring > > the platform down by mapping the GIC CPU interface as cacheable. > > ... and do we know that isn't the case for non-cacheable? If not, why not? Trying to get this information from the hw folk and architects is really hard. So we only relax it one step at a time ;). But given the MTE problems, I'd not go for cacheable Stage 2 unless we have FEAT_MTE_PERM implemented (both hw and sw). S2 cacheable allows the guest to map it as Normal Tagged. > Also, are you saying we used to map the GIC CPU interface as cacheable > at stage-2? I remember exclusives causing a problem, but I don't remember > the guest having a cacheable mapping. The guest never had a cacheable mapping, IIRC it was more of a theoretical problem, plugging a hole. Now, maybe I misremember, it's pretty hard to search the git logs given how the code was moved around (but I do remember the building we were in when discussing this, it was on the ground floor ;)). > > Not sure how error containment works with cacheable memory. A cacheable > > access to a device may stay in the cache a lot longer after the guest > > has been scheduled out, only evicted at some random time. > > But similarly, non-cacheable stores can be buffered. Why isn't that a > problem? RAS might track this for cacheable mappings as well, I just haven't figured out the details. > > We may no longer be able to associate it with the guest, especially if the > > guest exited. Also not sure about claiming back the device after killing > > the guest, do we need cache maintenance? > > Claiming back the device also seems strange if the guest has been using > non-cacheable accesses since I think you could get write merging and > reordering with subsequent device accesses trying to reset the device. True. Not sure we have a good story here (maybe reinvent the DWB barrier ;)). > > So, for now I'd only relax this if we know there's RAM(-like) on the > > other side and won't trigger some potentially uncontainable errors as a > > result. > > I guess my wider point is that I'm not convinced that non-cacheable is > actually much better and I think we're going way off the deep end looking > at what particular implementations do and trying to justify to ourselves > that non-cacheable is safe, even though it's still a normal memory type > at the end of the day. Is this about Device vs NC or Device/NC vs Normal Cacheable? The justification for the former has been summarised in Lorenzo's write-up. How the hardware behaves, it depends a lot on the RAS implementation. The BSA has some statements but not sure it covers everything. Things can go wrong but that's not because Device does anything better. Given the RAS implementation, external aborts caused on Device memory (e.g. wrong size access) is uncontainable. For Normal NC it can be contained (I can dig out the reasoning behind this if you want, IIUC something to do with not being able to cancel an already issued Device access since such accesses don't allow speculation due to side-effects; for Normal NC, it's just about the software not getting the data). > Obviously, it's up to Marc and Oliver if they want to do this, but I'm > wary without an official statement from Arm to say that Normal-NC is > correct. There's mention of such a statement in the cover letter: > > > We hope ARM will publish information helping platform designers > > follow these guidelines. > > but imo we shouldn't merge this without either: > > (a) _Architectural_ guidance (as opposed to some random whitepaper or > half-baked certification scheme). Well, you know the story, the architects will probably make it a SoC or integration issue, PCIe etc., not something that can live in the Arm ARM. The best we could get is more recommendations in the RAS spec around containment but not for things that might happen outside the CPU, e.g. PCIe root complex. > - or - > > (b) A concrete justification based on the current architecture as to > why Normal-NC is the right thing to do for KVM. To put it differently, we don't have any strong arguments why Device is the right thing to do. We chose Device based on some understanding software people had about how the hardware behaves, which apparently wasn't entirely correct (and summarised by Lorenzo). -- Catalin ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH v1 2/2] KVM: arm64: allow the VM to select DEVICE_* and NORMAL_NC for IO memory 2023-10-12 17:26 ` Catalin Marinas @ 2023-10-13 9:29 ` Will Deacon 2023-10-13 13:08 ` Catalin Marinas 0 siblings, 1 reply; 55+ messages in thread From: Will Deacon @ 2023-10-13 9:29 UTC (permalink / raw) To: Catalin Marinas Cc: Lorenzo Pieralisi, Jason Gunthorpe, ankita, maz, oliver.upton, aniketa, cjia, kwankhede, targupta, vsethi, acurrid, apopple, jhubbard, danw, linux-arm-kernel, kvmarm, linux-kernel On Thu, Oct 12, 2023 at 06:26:01PM +0100, Catalin Marinas wrote: > On Thu, Oct 12, 2023 at 03:48:08PM +0100, Will Deacon wrote: > > Claiming back the device also seems strange if the guest has been using > > non-cacheable accesses since I think you could get write merging and > > reordering with subsequent device accesses trying to reset the device. > > True. Not sure we have a good story here (maybe reinvent the DWB barrier ;)). We do have a good story for this part: use Device-nGnRE! > > > So, for now I'd only relax this if we know there's RAM(-like) on the > > > other side and won't trigger some potentially uncontainable errors as a > > > result. > > > > I guess my wider point is that I'm not convinced that non-cacheable is > > actually much better and I think we're going way off the deep end looking > > at what particular implementations do and trying to justify to ourselves > > that non-cacheable is safe, even though it's still a normal memory type > > at the end of the day. > > Is this about Device vs NC or Device/NC vs Normal Cacheable? The > justification for the former has been summarised in Lorenzo's write-up. > How the hardware behaves, it depends a lot on the RAS implementation. > The BSA has some statements but not sure it covers everything. I don't know how to be more clear, but I'm asking why Normal-NC is the right memory type to use. Jason's explanation on the other thread about how it's basically the only option with FWB (with some hand-waving about why Normal-cacheable isn't safe) will have to do, but it needs to be in the commit message. The host maps MMIO with Device-nGnRE. Allowing a guest to map it as Normal surely means the host is going to need additional logic to deal with that? We briefly discussed claiming back a device above and I'm not convinced that the code we have for doing that today will work correctly if the guest has issued a bunch of Normal-NC stores prior to the device being unmapped. Could we change these patches so that the memory type of the stage-1 VMA in the VMM is reflected in the stage-2? In other words, continue to use Device mappings at stage-2 for I/O but relax to Normal-NC if that's how the VMM has it mapped? > Things can go wrong but that's not because Device does anything better. > Given the RAS implementation, external aborts caused on Device memory > (e.g. wrong size access) is uncontainable. For Normal NC it can be > contained (I can dig out the reasoning behind this if you want, IIUC > something to do with not being able to cancel an already issued Device > access since such accesses don't allow speculation due to side-effects; > for Normal NC, it's just about the software not getting the data). I really think these details belong in the commit message. > > Obviously, it's up to Marc and Oliver if they want to do this, but I'm > > wary without an official statement from Arm to say that Normal-NC is > > correct. There's mention of such a statement in the cover letter: > > > > > We hope ARM will publish information helping platform designers > > > follow these guidelines. > > > > but imo we shouldn't merge this without either: > > > > (a) _Architectural_ guidance (as opposed to some random whitepaper or > > half-baked certification scheme). > > Well, you know the story, the architects will probably make it a SoC or > integration issue, PCIe etc., not something that can live in the Arm > ARM. The best we could get is more recommendations in the RAS spec > around containment but not for things that might happen outside the CPU, > e.g. PCIe root complex. The Arm ARM _does_ mention PCI config space when talking about early write acknowledgement, so there's some precedence for providing guidance around which memory types to use. > > - or - > > > > (b) A concrete justification based on the current architecture as to > > why Normal-NC is the right thing to do for KVM. > > To put it differently, we don't have any strong arguments why Device is > the right thing to do. We chose Device based on some understanding > software people had about how the hardware behaves, which apparently > wasn't entirely correct (and summarised by Lorenzo). I think we use Device because that's what the host uses in its stage-1 and mismatched aliases are bad. Will ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH v1 2/2] KVM: arm64: allow the VM to select DEVICE_* and NORMAL_NC for IO memory 2023-10-13 9:29 ` Will Deacon @ 2023-10-13 13:08 ` Catalin Marinas 2023-10-13 13:45 ` Jason Gunthorpe ` (2 more replies) 0 siblings, 3 replies; 55+ messages in thread From: Catalin Marinas @ 2023-10-13 13:08 UTC (permalink / raw) To: Will Deacon Cc: Lorenzo Pieralisi, Jason Gunthorpe, ankita, maz, oliver.upton, aniketa, cjia, kwankhede, targupta, vsethi, acurrid, apopple, jhubbard, danw, linux-arm-kernel, kvmarm, linux-kernel On Fri, Oct 13, 2023 at 10:29:35AM +0100, Will Deacon wrote: > On Thu, Oct 12, 2023 at 06:26:01PM +0100, Catalin Marinas wrote: > > On Thu, Oct 12, 2023 at 03:48:08PM +0100, Will Deacon wrote: > > > Claiming back the device also seems strange if the guest has been using > > > non-cacheable accesses since I think you could get write merging and > > > reordering with subsequent device accesses trying to reset the device. > > > > True. Not sure we have a good story here (maybe reinvent the DWB barrier ;)). > > We do have a good story for this part: use Device-nGnRE! Don't we actually need Device-nGnRnE for this, coupled with a DSB for endpoint completion? Device-nGnRE may be sufficient as a read from that device would ensure that the previous write is observable (potentially with a DMB if accessing separate device regions) but I don't think we do this now either. Even this, isn't it device-specific? I don't know enough about PCIe, posted writes, reordering, maybe others can shed some light. For Normal NC, if the access doesn't have side-effects (or rather the endpoint is memory-like), I think we are fine. The Stage 2 unmapping + TLBI + DSB (DVM + DVMSync) should ensure that a pending write by the CPU was pushed sufficiently far as not to affect subsequent writes by other CPUs. For I/O accesses that change some state of the device, I'm not sure the TLBI+DSB is sufficient. But I don't think Device nGnRE is either, only nE + DSB as long as the PCIe device plays along nicely. > Could we change these patches so that the memory type of the stage-1 VMA > in the VMM is reflected in the stage-2? In other words, continue to use > Device mappings at stage-2 for I/O but relax to Normal-NC if that's > how the VMM has it mapped? We've been through this and it's not feasible. The VMM does not have detailed knowledge of the BARs of the PCIe device it is mapping (and the prefetchable BAR attribute is useless). It may end up with a Normal mapping of a BAR with read side-effects. It's only the guest driver that knows all the details. The safest is for the VMM to keep it as Device (I think vfio-pci goes for the strongest nGnRnE). Yes, we end up with mismatched aliases but they only matter if the VMM also accesses the I/O range via its own mapping. So far I haven't seen case that suggests this. > > Things can go wrong but that's not because Device does anything better. > > Given the RAS implementation, external aborts caused on Device memory > > (e.g. wrong size access) is uncontainable. For Normal NC it can be > > contained (I can dig out the reasoning behind this if you want, IIUC > > something to do with not being able to cancel an already issued Device > > access since such accesses don't allow speculation due to side-effects; > > for Normal NC, it's just about the software not getting the data). > > I really think these details belong in the commit message. I guess another task for Lorenzo ;). > > > Obviously, it's up to Marc and Oliver if they want to do this, but I'm > > > wary without an official statement from Arm to say that Normal-NC is > > > correct. There's mention of such a statement in the cover letter: > > > > > > > We hope ARM will publish information helping platform designers > > > > follow these guidelines. > > > > > > but imo we shouldn't merge this without either: > > > > > > (a) _Architectural_ guidance (as opposed to some random whitepaper or > > > half-baked certification scheme). > > > > Well, you know the story, the architects will probably make it a SoC or > > integration issue, PCIe etc., not something that can live in the Arm > > ARM. The best we could get is more recommendations in the RAS spec > > around containment but not for things that might happen outside the CPU, > > e.g. PCIe root complex. > > The Arm ARM _does_ mention PCI config space when talking about early write > acknowledgement, so there's some precedence for providing guidance around > which memory types to use. Ah, yes, it looks like it does, though mostly around the config space. We could ask them to add some notes but I don't think we have the problem well defined yet. Trying to restate what we aim: the guest driver knows what attributes it needs and would set the appropriate attributes: Device or Normal. KVM's role is not to fix bugs in the guest driver by constraining the attributes but rather to avoid potential security issues with malicious (or buggy) guests: 1) triggering uncontained errors 2) accessing memory that it shouldn't (like the MTE tag access) 3) causing delayed side-effects after the host reclaims the device ... anything else? For (1), Normal NC vs. Device doesn't make any difference, slightly better for the former. (2) so far is solved by not allowing Cacheable (or disabling MTE, enabling FEAT_MTE_PERM in the future). I'm now trying to understand (3), I think it needs more digging. > > > (b) A concrete justification based on the current architecture as to > > > why Normal-NC is the right thing to do for KVM. > > > > To put it differently, we don't have any strong arguments why Device is > > the right thing to do. We chose Device based on some understanding > > software people had about how the hardware behaves, which apparently > > wasn't entirely correct (and summarised by Lorenzo). > > I think we use Device because that's what the host uses in its stage-1 > and mismatched aliases are bad. They are "constrained" bad ;). -- Catalin ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH v1 2/2] KVM: arm64: allow the VM to select DEVICE_* and NORMAL_NC for IO memory 2023-10-13 13:08 ` Catalin Marinas @ 2023-10-13 13:45 ` Jason Gunthorpe 2023-10-19 11:07 ` Catalin Marinas 2023-10-13 15:28 ` Lorenzo Pieralisi 2023-11-09 15:34 ` Lorenzo Pieralisi 2 siblings, 1 reply; 55+ messages in thread From: Jason Gunthorpe @ 2023-10-13 13:45 UTC (permalink / raw) To: Catalin Marinas Cc: Will Deacon, Lorenzo Pieralisi, ankita, maz, oliver.upton, aniketa, cjia, kwankhede, targupta, vsethi, acurrid, apopple, jhubbard, danw, linux-arm-kernel, kvmarm, linux-kernel On Fri, Oct 13, 2023 at 02:08:10PM +0100, Catalin Marinas wrote: > On Fri, Oct 13, 2023 at 10:29:35AM +0100, Will Deacon wrote: > > On Thu, Oct 12, 2023 at 06:26:01PM +0100, Catalin Marinas wrote: > > > On Thu, Oct 12, 2023 at 03:48:08PM +0100, Will Deacon wrote: > > > > Claiming back the device also seems strange if the guest has been using > > > > non-cacheable accesses since I think you could get write merging and > > > > reordering with subsequent device accesses trying to reset the device. > > > > > > True. Not sure we have a good story here (maybe reinvent the DWB barrier ;)). > > > > We do have a good story for this part: use Device-nGnRE! > > Don't we actually need Device-nGnRnE for this, coupled with a DSB for > endpoint completion? > > Device-nGnRE may be sufficient as a read from that device would ensure > that the previous write is observable (potentially with a DMB if > accessing separate device regions) but I don't think we do this now > either. Even this, isn't it device-specific? I don't know enough about > PCIe, posted writes, reordering, maybe others can shed some light. > > For Normal NC, if the access doesn't have side-effects (or rather the > endpoint is memory-like), I think we are fine. The Stage 2 unmapping + > TLBI + DSB (DVM + DVMSync) should ensure that a pending write by the CPU > was pushed sufficiently far as not to affect subsequent writes by other > CPUs. > > For I/O accesses that change some state of the device, I'm not sure the > TLBI+DSB is sufficient. But I don't think Device nGnRE is either, only > nE + DSB as long as the PCIe device plays along nicely. Can someone explain this concern a little more simply please? Let's try something simpler. I have no KVM. My kernel driver creates a VMA with pgprot_writecombine (NormalNC). Userpsace does a write to the NormalNC and immediately unmaps the VMA What is the issue? And then how does making KVM the thing that creates the NormalNC change this? Not knowing the whole details, here is my story about how it should work: Unmapping the VMA's must already have some NormalNC friendly ordering barrier across all CPUs or we have a bigger problem. This barrier definately must close write combining. VFIO issues a config space write to reset the PCI function. Config space writes MUST NOT write combine with anything. This is already impossible for PCIe since they are different TLP types at the PCIe level. By the PCIe rules, config space write must order strictly after all other CPU's accesses. Once the reset non-posted write returns back to VFIO we know that: 1) There is no reference in any CPU page table to the MMIO PFN 2) No CPU has pending data in any write buffer 3) The interconnect and PCIe fabric have no inflight operations 4) The device is in a clean post-reset state ? > knows all the details. The safest is for the VMM to keep it as Device (I > think vfio-pci goes for the strongest nGnRnE). We are probably going to allow VFIO to let userspace pick if it should be pgprot_device or pgprot_writecombine. The alias issue could be resolved by teaching KVM how to insert a physical PFN based on some VFIO FD/dmabuf rather than a VMA so that the PFNs are never mapped in the hypervisor side. Jsaon ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH v1 2/2] KVM: arm64: allow the VM to select DEVICE_* and NORMAL_NC for IO memory 2023-10-13 13:45 ` Jason Gunthorpe @ 2023-10-19 11:07 ` Catalin Marinas 2023-10-19 11:51 ` Jason Gunthorpe 2023-10-19 13:35 ` Lorenzo Pieralisi 0 siblings, 2 replies; 55+ messages in thread From: Catalin Marinas @ 2023-10-19 11:07 UTC (permalink / raw) To: Jason Gunthorpe Cc: Will Deacon, Lorenzo Pieralisi, ankita, maz, oliver.upton, aniketa, cjia, kwankhede, targupta, vsethi, acurrid, apopple, jhubbard, danw, linux-arm-kernel, kvmarm, linux-kernel On Fri, Oct 13, 2023 at 10:45:41AM -0300, Jason Gunthorpe wrote: > On Fri, Oct 13, 2023 at 02:08:10PM +0100, Catalin Marinas wrote: > > On Fri, Oct 13, 2023 at 10:29:35AM +0100, Will Deacon wrote: > > > On Thu, Oct 12, 2023 at 06:26:01PM +0100, Catalin Marinas wrote: > > > > On Thu, Oct 12, 2023 at 03:48:08PM +0100, Will Deacon wrote: > > > > > Claiming back the device also seems strange if the guest has been using > > > > > non-cacheable accesses since I think you could get write merging and > > > > > reordering with subsequent device accesses trying to reset the device. > > > > > > > > True. Not sure we have a good story here (maybe reinvent the DWB barrier ;)). > > > > > > We do have a good story for this part: use Device-nGnRE! > > > > Don't we actually need Device-nGnRnE for this, coupled with a DSB for > > endpoint completion? > > > > Device-nGnRE may be sufficient as a read from that device would ensure > > that the previous write is observable (potentially with a DMB if > > accessing separate device regions) but I don't think we do this now > > either. Even this, isn't it device-specific? I don't know enough about > > PCIe, posted writes, reordering, maybe others can shed some light. > > > > For Normal NC, if the access doesn't have side-effects (or rather the > > endpoint is memory-like), I think we are fine. The Stage 2 unmapping + > > TLBI + DSB (DVM + DVMSync) should ensure that a pending write by the CPU > > was pushed sufficiently far as not to affect subsequent writes by other > > CPUs. > > > > For I/O accesses that change some state of the device, I'm not sure the > > TLBI+DSB is sufficient. But I don't think Device nGnRE is either, only > > nE + DSB as long as the PCIe device plays along nicely. > > Can someone explain this concern a little more simply please? > > Let's try something simpler. I have no KVM. My kernel driver > creates a VMA with pgprot_writecombine (NormalNC). Userpsace does a > write to the NormalNC and immediately unmaps the VMA > > What is the issue? The issue is when the device is reclaimed to be given to another user-space process, do we know the previous transaction reached the device? With the TLBI + DSB in unmap, we can only tell that a subsequent map + write (in a new process) is ordered after the write in the old process. Maybe that's sufficient in most cases. > And then how does making KVM the thing that creates the NormalNC > change this? They are similar. > Not knowing the whole details, here is my story about how it should work: > > Unmapping the VMA's must already have some NormalNC friendly ordering > barrier across all CPUs or we have a bigger problem. This barrier > definately must close write combining. I think what we have is TLBI + DSB generating the DVM/DVMSync messages should ensure that the Normal NC writes on other CPUs reach some serialisation point (not necessarily the device, AFAICT we can't guarantee end-point completion here). > VFIO issues a config space write to reset the PCI function. Config > space writes MUST NOT write combine with anything. This is already > impossible for PCIe since they are different TLP types at the PCIe > level. Yes, config space writes are fine, vfio-pci even maps them as Device_nGnRnE. But AFAIK a guest won't have direct access to the config space. > By the PCIe rules, config space write must order strictly after all > other CPU's accesses. Once the reset non-posted write returns back to > VFIO we know that: > > 1) There is no reference in any CPU page table to the MMIO PFN > 2) No CPU has pending data in any write buffer > 3) The interconnect and PCIe fabric have no inflight operations > 4) The device is in a clean post-reset state I think from the CPU perspective, we can guarantee that a Normal_NC write on CPU0 for example reaches a serialisation point before a config space (Device_nGnRnE) write on CPU1 by the host as long as CPU1 issued a TLBI+DSB. Now, what I'm not sure is what this serialisation point is. If it is the PCIe root complex, we are probably fine, we hope it deals with any ordering between the Normal_NC write and the config space one. Talking to Will earlier, I think we can deem the PCIe scenario (somewhat) safe but not as a generic mechanism for other non-PCIe devices (e.g. platform). With this concern, can we make this Stage 2 relaxation in KVM only for vfio-pci mappings? I don't have an example of non-PCIe device assignment to figure out how this should work though. > > knows all the details. The safest is for the VMM to keep it as Device (I > > think vfio-pci goes for the strongest nGnRnE). > > We are probably going to allow VFIO to let userspace pick if it should > be pgprot_device or pgprot_writecombine. I guess that's for the direct use by an application rather than VMM+VM. IIUC people work around this currently by mapping PCIe BARs as pgprot_writecombine() via sysfs. Getting vfio-pci to allow different mappings is probably a good idea, though it doesn't currently help with the KVM case as we can't force the VMM to know the specifics of the device it is giving to a guest. > The alias issue could be resolved by teaching KVM how to insert a > physical PFN based on some VFIO FD/dmabuf rather than a VMA so that > the PFNs are never mapped in the hypervisor side. This should work as well and solves the aliasing problem, though it requires changes to the VMM as well, not just KVM, which currently relies on vfio-pci mmap(). -- Catalin ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH v1 2/2] KVM: arm64: allow the VM to select DEVICE_* and NORMAL_NC for IO memory 2023-10-19 11:07 ` Catalin Marinas @ 2023-10-19 11:51 ` Jason Gunthorpe 2023-10-20 11:21 ` Catalin Marinas 2023-10-19 13:35 ` Lorenzo Pieralisi 1 sibling, 1 reply; 55+ messages in thread From: Jason Gunthorpe @ 2023-10-19 11:51 UTC (permalink / raw) To: Catalin Marinas Cc: Will Deacon, Lorenzo Pieralisi, ankita, maz, oliver.upton, aniketa, cjia, kwankhede, targupta, vsethi, acurrid, apopple, jhubbard, danw, linux-arm-kernel, kvmarm, linux-kernel On Thu, Oct 19, 2023 at 12:07:42PM +0100, Catalin Marinas wrote: > The issue is when the device is reclaimed to be given to another > user-space process, do we know the previous transaction reached the > device? With the TLBI + DSB in unmap, we can only tell that a subsequent > map + write (in a new process) is ordered after the write in the old > process. Maybe that's sufficient in most cases. I can't think of a case where ordering is insufficient. As long as the device perceives a strict seperation in order of writes preceeding the 'reset' writel and writes following the reset writel() the absolute time of completion can not matter. At least for VFIO PCI the reset sequences uses non-posted config TLPs followed by config read TLPs. For something like mlx5 when it reclaims a WC VMA it does a RPC which is dma_wmb(), a writel(), a device DMA read, a device DMA write and a CPU read fo DMA'd data. Both of these are pretty "firm" ordering barriers. > > Unmapping the VMA's must already have some NormalNC friendly ordering > > barrier across all CPUs or we have a bigger problem. This barrier > > definately must close write combining. > > I think what we have is TLBI + DSB generating the DVM/DVMSync messages > should ensure that the Normal NC writes on other CPUs reach some > serialisation point (not necessarily the device, AFAICT we can't > guarantee end-point completion here). This is what our internal architects thought as well. > Talking to Will earlier, I think we can deem the PCIe scenario > (somewhat) safe but not as a generic mechanism for other non-PCIe > devices (e.g. platform). With this concern, can we make this Stage 2 > relaxation in KVM only for vfio-pci mappings? I don't have an example of > non-PCIe device assignment to figure out how this should work though. It is not a KVM problem. As I implied above it is VFIO's responsibility to reliably reset the device, not KVMs. If for some reason VFIO cannot do that on some SOC then VFIO devices should not exist. It is not KVM's job to double guess VFIO's own security properties. Specifically about platform the generic VFIO platform driver is the ACPI based one. If the FW provides an ACPI method for device reset that is not properly serializing, that is a FW bug. We can quirk it in VFIO and block using those devices if they actually exist. I expect the non-generic VFIO platform drivers to take care of this issue internally with, barriers, read from devices, whatver is required to make their SOCs order properly. Just as I would expect a normal Linux platform driver to directly manage whatever implementation specific ordering quirks the SOC may have. Generic PCI drivers are the ones that rely on the implicit ordering at the PCIe TLP level from TLBI + DSB. I would say this is part of the undocumented arch API around pgprot_wc > > > knows all the details. The safest is for the VMM to keep it as Device (I > > > think vfio-pci goes for the strongest nGnRnE). > > > > We are probably going to allow VFIO to let userspace pick if it should > > be pgprot_device or pgprot_writecombine. > > I guess that's for the direct use by an application rather than > VMM+VM. Yes, there have been requests for this. > IIUC people work around this currently by mapping PCIe BARs as > pgprot_writecombine() via sysfs. Getting vfio-pci to allow different > mappings is probably a good idea, though it doesn't currently help with > the KVM case as we can't force the VMM to know the specifics of the > device it is giving to a guest. Yes to all points Thanks, Jason ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH v1 2/2] KVM: arm64: allow the VM to select DEVICE_* and NORMAL_NC for IO memory 2023-10-19 11:51 ` Jason Gunthorpe @ 2023-10-20 11:21 ` Catalin Marinas 2023-10-20 11:47 ` Jason Gunthorpe 0 siblings, 1 reply; 55+ messages in thread From: Catalin Marinas @ 2023-10-20 11:21 UTC (permalink / raw) To: Jason Gunthorpe Cc: Will Deacon, Lorenzo Pieralisi, ankita, maz, oliver.upton, aniketa, cjia, kwankhede, targupta, vsethi, acurrid, apopple, jhubbard, danw, linux-arm-kernel, kvmarm, linux-kernel On Thu, Oct 19, 2023 at 08:51:42AM -0300, Jason Gunthorpe wrote: > On Thu, Oct 19, 2023 at 12:07:42PM +0100, Catalin Marinas wrote: > > Talking to Will earlier, I think we can deem the PCIe scenario > > (somewhat) safe but not as a generic mechanism for other non-PCIe > > devices (e.g. platform). With this concern, can we make this Stage 2 > > relaxation in KVM only for vfio-pci mappings? I don't have an example of > > non-PCIe device assignment to figure out how this should work though. > > It is not a KVM problem. As I implied above it is VFIO's > responsibility to reliably reset the device, not KVMs. If for some > reason VFIO cannot do that on some SOC then VFIO devices should not > exist. > > It is not KVM's job to double guess VFIO's own security properties. I'd argue that since KVM is the one relaxing the memory attributes beyond what the VFIO driver allows the VMM to use, it is KVM's job to consider the security implications. This is fine for vfio-pci and Normal_NC but I'm not sure we can generalise. > Specifically about platform the generic VFIO platform driver is the > ACPI based one. If the FW provides an ACPI method for device reset > that is not properly serializing, that is a FW bug. We can quirk it in > VFIO and block using those devices if they actually exist. > > I expect the non-generic VFIO platform drivers to take care of this > issue internally with, barriers, read from devices, whatver is > required to make their SOCs order properly. Just as I would expect a > normal Linux platform driver to directly manage whatever > implementation specific ordering quirks the SOC may have. This would be a new requirement if an existing VFIO platform driver relied on all mappings being Device. But maybe that's just theoretical at the moment, are there any concrete examples outside vfio-pci? If not, we can document it as per Lorenzo's suggestion to summarise this discussion under Documentation/. -- Catalin ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH v1 2/2] KVM: arm64: allow the VM to select DEVICE_* and NORMAL_NC for IO memory 2023-10-20 11:21 ` Catalin Marinas @ 2023-10-20 11:47 ` Jason Gunthorpe 2023-10-20 14:03 ` Lorenzo Pieralisi 0 siblings, 1 reply; 55+ messages in thread From: Jason Gunthorpe @ 2023-10-20 11:47 UTC (permalink / raw) To: Catalin Marinas Cc: Will Deacon, Lorenzo Pieralisi, ankita, maz, oliver.upton, aniketa, cjia, kwankhede, targupta, vsethi, acurrid, apopple, jhubbard, danw, linux-arm-kernel, kvmarm, linux-kernel On Fri, Oct 20, 2023 at 12:21:50PM +0100, Catalin Marinas wrote: > On Thu, Oct 19, 2023 at 08:51:42AM -0300, Jason Gunthorpe wrote: > > On Thu, Oct 19, 2023 at 12:07:42PM +0100, Catalin Marinas wrote: > > > Talking to Will earlier, I think we can deem the PCIe scenario > > > (somewhat) safe but not as a generic mechanism for other non-PCIe > > > devices (e.g. platform). With this concern, can we make this Stage 2 > > > relaxation in KVM only for vfio-pci mappings? I don't have an example of > > > non-PCIe device assignment to figure out how this should work though. > > > > It is not a KVM problem. As I implied above it is VFIO's > > responsibility to reliably reset the device, not KVMs. If for some > > reason VFIO cannot do that on some SOC then VFIO devices should not > > exist. > > > > It is not KVM's job to double guess VFIO's own security properties. > > I'd argue that since KVM is the one relaxing the memory attributes > beyond what the VFIO driver allows the VMM to use, it is KVM's job to > consider the security implications. This is fine for vfio-pci and > Normal_NC but I'm not sure we can generalise. I can see that, but I belive we should take this responsibility into VFIO as a requirement. As I said in the other email we do want to extend VFIO to support NormalNC VMAs for DPDK, so we need to take this anyhow. > > Specifically about platform the generic VFIO platform driver is the > > ACPI based one. If the FW provides an ACPI method for device reset > > that is not properly serializing, that is a FW bug. We can quirk it in > > VFIO and block using those devices if they actually exist. > > > > I expect the non-generic VFIO platform drivers to take care of this > > issue internally with, barriers, read from devices, whatver is > > required to make their SOCs order properly. Just as I would expect a > > normal Linux platform driver to directly manage whatever > > implementation specific ordering quirks the SOC may have. > > This would be a new requirement if an existing VFIO platform driver > relied on all mappings being Device. But maybe that's just theoretical > at the moment, are there any concrete examples outside vfio-pci? If not, > we can document it as per Lorenzo's suggestion to summarise this > discussion under Documentation/. My point is if this becomes a real world concern we have a solid answer on how to resolve it - fix the VFIO driver to have a stronger barrier before reset. I'm confident it is not a problem for PCI and IIRC the remaining ARM platform drivers were made primarily for DPDK, not KVM. So I agree with documenting and perhaps a comment someplace in VFIO is also warranted. Jason ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH v1 2/2] KVM: arm64: allow the VM to select DEVICE_* and NORMAL_NC for IO memory 2023-10-20 11:47 ` Jason Gunthorpe @ 2023-10-20 14:03 ` Lorenzo Pieralisi 2023-10-20 14:28 ` Jason Gunthorpe 0 siblings, 1 reply; 55+ messages in thread From: Lorenzo Pieralisi @ 2023-10-20 14:03 UTC (permalink / raw) To: Jason Gunthorpe Cc: Catalin Marinas, Will Deacon, ankita, maz, oliver.upton, aniketa, cjia, kwankhede, targupta, vsethi, acurrid, apopple, jhubbard, danw, linux-arm-kernel, kvmarm, linux-kernel On Fri, Oct 20, 2023 at 08:47:19AM -0300, Jason Gunthorpe wrote: > On Fri, Oct 20, 2023 at 12:21:50PM +0100, Catalin Marinas wrote: > > On Thu, Oct 19, 2023 at 08:51:42AM -0300, Jason Gunthorpe wrote: > > > On Thu, Oct 19, 2023 at 12:07:42PM +0100, Catalin Marinas wrote: > > > > Talking to Will earlier, I think we can deem the PCIe scenario > > > > (somewhat) safe but not as a generic mechanism for other non-PCIe > > > > devices (e.g. platform). With this concern, can we make this Stage 2 > > > > relaxation in KVM only for vfio-pci mappings? I don't have an example of > > > > non-PCIe device assignment to figure out how this should work though. > > > > > > It is not a KVM problem. As I implied above it is VFIO's > > > responsibility to reliably reset the device, not KVMs. If for some > > > reason VFIO cannot do that on some SOC then VFIO devices should not > > > exist. > > > > > > It is not KVM's job to double guess VFIO's own security properties. > > > > I'd argue that since KVM is the one relaxing the memory attributes > > beyond what the VFIO driver allows the VMM to use, it is KVM's job to > > consider the security implications. This is fine for vfio-pci and > > Normal_NC but I'm not sure we can generalise. > > I can see that, but I belive we should take this responsibility into > VFIO as a requirement. As I said in the other email we do want to > extend VFIO to support NormalNC VMAs for DPDK, so we need to take this > anyhow. > > > > Specifically about platform the generic VFIO platform driver is the > > > ACPI based one. If the FW provides an ACPI method for device reset > > > that is not properly serializing, that is a FW bug. We can quirk it in > > > VFIO and block using those devices if they actually exist. > > > > > > I expect the non-generic VFIO platform drivers to take care of this > > > issue internally with, barriers, read from devices, whatver is > > > required to make their SOCs order properly. Just as I would expect a > > > normal Linux platform driver to directly manage whatever > > > implementation specific ordering quirks the SOC may have. > > > > This would be a new requirement if an existing VFIO platform driver > > relied on all mappings being Device. But maybe that's just theoretical > > at the moment, are there any concrete examples outside vfio-pci? If not, > > we can document it as per Lorenzo's suggestion to summarise this > > discussion under Documentation/. > > My point is if this becomes a real world concern we have a solid > answer on how to resolve it - fix the VFIO driver to have a stronger > barrier before reset. Just to make sure I am parsing this correctly: this case above is related to a non-PCI VFIO device passthrough where a guest would want to map the device MMIO at stage-1 with normal-NC memory type (well, let's say with a memory attribute != device-nGnRE - that combined with the new stage-2 default might cause transactions ordering/grouping trouble with eg device resets), correct ? IIRC, all requests related to honouring "write-combine" style stage-1 mappings were for PCI(e) devices but that's as far as what *I* was made aware of goes. > I'm confident it is not a problem for PCI and IIRC the remaining ARM > platform drivers were made primarily for DPDK, not KVM. > > So I agree with documenting and perhaps a comment someplace in VFIO is > also warranted. We will do that, I will start adding the recent discussions to the new documentation file. Side note: for those who attend LPC it would be useful to review the resulting documentation together there, it should happen around v6.7-rc1. Lorenzo ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH v1 2/2] KVM: arm64: allow the VM to select DEVICE_* and NORMAL_NC for IO memory 2023-10-20 14:03 ` Lorenzo Pieralisi @ 2023-10-20 14:28 ` Jason Gunthorpe 0 siblings, 0 replies; 55+ messages in thread From: Jason Gunthorpe @ 2023-10-20 14:28 UTC (permalink / raw) To: Lorenzo Pieralisi Cc: Catalin Marinas, Will Deacon, ankita, maz, oliver.upton, aniketa, cjia, kwankhede, targupta, vsethi, acurrid, apopple, jhubbard, danw, linux-arm-kernel, kvmarm, linux-kernel On Fri, Oct 20, 2023 at 04:03:57PM +0200, Lorenzo Pieralisi wrote: > > My point is if this becomes a real world concern we have a solid > > answer on how to resolve it - fix the VFIO driver to have a stronger > > barrier before reset. > > Just to make sure I am parsing this correctly: this case above is > related to a non-PCI VFIO device passthrough where a guest would want to > map the device MMIO at stage-1 with normal-NC memory type (well, let's > say with a memory attribute != device-nGnRE - that combined with the new > stage-2 default might cause transactions ordering/grouping trouble with > eg device resets), correct ? This is what I have understood was Will's concern, yes. > IIRC, all requests related to honouring "write-combine" style > stage-1 mappings were for PCI(e) devices but that's as far as what > *I* was made aware of goes. Yes, this is what I am aware of as well. Though I do not object to the idea from the VFIO side that platform devices would also have to support NormalNC too. The theoretical missing peice is that someone would say they have a SOC issue XYZ and thus their VFIO platform devices must fully block NormalNC. I suggest if someone comes with this and really, really wants VFIO, then we could use a VMA flag to indicate that KVM must not upgrade it. Currently I have no knowledge of such a thing existing. With PCI we've made the argument that if NormalNC is broken unsafe for KVM in the SOC then probably so is Device-*. I think the same basic argument holds for platform devices too. Thus I'm skeptical that someone can come and say they have SOC issue XYZ and NormalNC is broken but Device-* is perfectly safe. > We will do that, I will start adding the recent discussions to the > new documentation file. Side note: for those who attend LPC it would be > useful to review the resulting documentation together there, it should > happen around v6.7-rc1. I will be there, let me know Thanks, Jason ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH v1 2/2] KVM: arm64: allow the VM to select DEVICE_* and NORMAL_NC for IO memory 2023-10-19 11:07 ` Catalin Marinas 2023-10-19 11:51 ` Jason Gunthorpe @ 2023-10-19 13:35 ` Lorenzo Pieralisi 1 sibling, 0 replies; 55+ messages in thread From: Lorenzo Pieralisi @ 2023-10-19 13:35 UTC (permalink / raw) To: Catalin Marinas Cc: Jason Gunthorpe, Will Deacon, ankita, maz, oliver.upton, aniketa, cjia, kwankhede, targupta, vsethi, acurrid, apopple, jhubbard, danw, linux-arm-kernel, kvmarm, linux-kernel On Thu, Oct 19, 2023 at 12:07:42PM +0100, Catalin Marinas wrote: [...] > > VFIO issues a config space write to reset the PCI function. Config > > space writes MUST NOT write combine with anything. This is already > > impossible for PCIe since they are different TLP types at the PCIe > > level. > > Yes, config space writes are fine, vfio-pci even maps them as > Device_nGnRnE. But AFAIK a guest won't have direct access to the config > space. > > > By the PCIe rules, config space write must order strictly after all > > other CPU's accesses. Once the reset non-posted write returns back to > > VFIO we know that: > > > > 1) There is no reference in any CPU page table to the MMIO PFN > > 2) No CPU has pending data in any write buffer > > 3) The interconnect and PCIe fabric have no inflight operations > > 4) The device is in a clean post-reset state > > I think from the CPU perspective, we can guarantee that a Normal_NC > write on CPU0 for example reaches a serialisation point before a config > space (Device_nGnRnE) write on CPU1 by the host as long as CPU1 issued a > TLBI+DSB. Now, what I'm not sure is what this serialisation point is. If > it is the PCIe root complex, we are probably fine, we hope it deals with > any ordering between the Normal_NC write and the config space one. If it is the PCI host bridge (and for PCI it should be since it is the logic between the ARM world - where ARM ordering rules and barriers apply - and PCI protocol), either it enforces PCI ordering rules or it is broken by design; if it is the latter, at that stage device assignment would be the least of our problems. For non-PCI device assignment, I am not sure at all we can rely on anything other than what Jason mentioned, eg resets (and the components that through eg MMIO are carrying them out) are not architected, the device MMIO space and the MMIO space used to trigger the reset (again, it is an example) may well be placed on different interconnect paths, it is device specific. Lorenzo > Talking to Will earlier, I think we can deem the PCIe scenario > (somewhat) safe but not as a generic mechanism for other non-PCIe > devices (e.g. platform). With this concern, can we make this Stage 2 > relaxation in KVM only for vfio-pci mappings? I don't have an example of > non-PCIe device assignment to figure out how this should work though. > > > > knows all the details. The safest is for the VMM to keep it as Device (I > > > think vfio-pci goes for the strongest nGnRnE). > > > > We are probably going to allow VFIO to let userspace pick if it should > > be pgprot_device or pgprot_writecombine. > > I guess that's for the direct use by an application rather than VMM+VM. > IIUC people work around this currently by mapping PCIe BARs as > pgprot_writecombine() via sysfs. Getting vfio-pci to allow different > mappings is probably a good idea, though it doesn't currently help with > the KVM case as we can't force the VMM to know the specifics of the > device it is giving to a guest. > > > The alias issue could be resolved by teaching KVM how to insert a > > physical PFN based on some VFIO FD/dmabuf rather than a VMA so that > > the PFNs are never mapped in the hypervisor side. > > This should work as well and solves the aliasing problem, though it > requires changes to the VMM as well, not just KVM, which currently > relies on vfio-pci mmap(). > > -- > Catalin ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH v1 2/2] KVM: arm64: allow the VM to select DEVICE_* and NORMAL_NC for IO memory 2023-10-13 13:08 ` Catalin Marinas 2023-10-13 13:45 ` Jason Gunthorpe @ 2023-10-13 15:28 ` Lorenzo Pieralisi 2023-10-19 11:12 ` Catalin Marinas 2023-11-09 15:34 ` Lorenzo Pieralisi 2 siblings, 1 reply; 55+ messages in thread From: Lorenzo Pieralisi @ 2023-10-13 15:28 UTC (permalink / raw) To: Catalin Marinas Cc: Will Deacon, Jason Gunthorpe, ankita, maz, oliver.upton, aniketa, cjia, kwankhede, targupta, vsethi, acurrid, apopple, jhubbard, danw, linux-arm-kernel, kvmarm, linux-kernel On Fri, Oct 13, 2023 at 02:08:10PM +0100, Catalin Marinas wrote: [...] > Yes, we end up with mismatched aliases but they only matter if the VMM > also accesses the I/O range via its own mapping. So far I haven't seen > case that suggests this. > > > > Things can go wrong but that's not because Device does anything better. > > > Given the RAS implementation, external aborts caused on Device memory > > > (e.g. wrong size access) is uncontainable. For Normal NC it can be > > > contained (I can dig out the reasoning behind this if you want, IIUC > > > something to do with not being able to cancel an already issued Device > > > access since such accesses don't allow speculation due to side-effects; > > > for Normal NC, it's just about the software not getting the data). > > > > I really think these details belong in the commit message. > > I guess another task for Lorenzo ;). I will do, I start wondering though whether this documentation belongs in this commit log only or at Documentation/arch/arm64 level (or both), I am pretty sure this thread can turn out quite useful as a reference (it is for me) if we manage to summarize it that would benefit everyone. Lorenzo ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH v1 2/2] KVM: arm64: allow the VM to select DEVICE_* and NORMAL_NC for IO memory 2023-10-13 15:28 ` Lorenzo Pieralisi @ 2023-10-19 11:12 ` Catalin Marinas 0 siblings, 0 replies; 55+ messages in thread From: Catalin Marinas @ 2023-10-19 11:12 UTC (permalink / raw) To: Lorenzo Pieralisi Cc: Will Deacon, Jason Gunthorpe, ankita, maz, oliver.upton, aniketa, cjia, kwankhede, targupta, vsethi, acurrid, apopple, jhubbard, danw, linux-arm-kernel, kvmarm, linux-kernel On Fri, Oct 13, 2023 at 05:28:10PM +0200, Lorenzo Pieralisi wrote: > On Fri, Oct 13, 2023 at 02:08:10PM +0100, Catalin Marinas wrote: > > [...] > > > Yes, we end up with mismatched aliases but they only matter if the VMM > > also accesses the I/O range via its own mapping. So far I haven't seen > > case that suggests this. > > > > > > Things can go wrong but that's not because Device does anything better. > > > > Given the RAS implementation, external aborts caused on Device memory > > > > (e.g. wrong size access) is uncontainable. For Normal NC it can be > > > > contained (I can dig out the reasoning behind this if you want, IIUC > > > > something to do with not being able to cancel an already issued Device > > > > access since such accesses don't allow speculation due to side-effects; > > > > for Normal NC, it's just about the software not getting the data). > > > > > > I really think these details belong in the commit message. > > > > I guess another task for Lorenzo ;). > > I will do, I start wondering though whether this documentation belongs > in this commit log only or at Documentation/arch/arm64 level (or both), > I am pretty sure this thread can turn out quite useful as a reference > (it is for me) if we manage to summarize it that would benefit > everyone. I think it makes sense to add something in the Documentation for easier future reference. We can also keep adding to it as we learn more. The commit log can be shorter in this case. -- Catalin ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH v1 2/2] KVM: arm64: allow the VM to select DEVICE_* and NORMAL_NC for IO memory 2023-10-13 13:08 ` Catalin Marinas 2023-10-13 13:45 ` Jason Gunthorpe 2023-10-13 15:28 ` Lorenzo Pieralisi @ 2023-11-09 15:34 ` Lorenzo Pieralisi 2023-11-10 14:26 ` Jason Gunthorpe 2023-11-13 17:41 ` Catalin Marinas 2 siblings, 2 replies; 55+ messages in thread From: Lorenzo Pieralisi @ 2023-11-09 15:34 UTC (permalink / raw) To: Catalin Marinas Cc: Will Deacon, Jason Gunthorpe, ankita, maz, oliver.upton, aniketa, cjia, kwankhede, targupta, vsethi, acurrid, apopple, jhubbard, danw, linux-arm-kernel, kvmarm, linux-kernel On Fri, Oct 13, 2023 at 02:08:10PM +0100, Catalin Marinas wrote: [...] > > > Things can go wrong but that's not because Device does anything better. > > > Given the RAS implementation, external aborts caused on Device memory > > > (e.g. wrong size access) is uncontainable. For Normal NC it can be > > > contained (I can dig out the reasoning behind this if you want, IIUC > > > something to do with not being able to cancel an already issued Device > > > access since such accesses don't allow speculation due to side-effects; > > > for Normal NC, it's just about the software not getting the data). > > > > I really think these details belong in the commit message. > > I guess another task for Lorenzo ;). Updated commit log (it might be [is] too verbose) below, it should probably be moved into a documentation file but to do that I should decouple it from this changeset (ie a document explaining memory attributes and error containment for ARM64 - indipendent from KVM S2 defaults). I'd also add a Link: to the lore archive entry for reference (did not add it in the log below). Please let me know what's the best option here. Thanks, Lorenzo -- >8 -- Currently, KVM for ARM64 maps at stage 2 memory that is considered device (ie it is not RAM) with DEVICE_nGnRE memory attributes; this setting overrides (as per the ARM architecture [1]) any device MMIO mapping present at stage 1, resulting in a set-up whereby a guest operating system can't determine device MMIO mapping memory attributes on its own but it is always overriden by the KVM stage 2 default. This set-up does not allow guest operating systems to select device memory attributes on a page by page basis independently from KVM stage-2 mappings (refer to [1], "Combining stage 1 and stage 2 memory type attributes"), which turns out to be an issue in that guest operating systems (eg Linux) may request to map devices MMIO regions with memory attributes that guarantee better performance (eg gathering attribute - that for some devices can generate larger PCIe memory writes TLPs) and specific operations (eg unaligned transactions) such as the NormalNC memory type. The default device stage 2 mapping was chosen in KVM for ARM64 since it was considered safer (ie it would not allow guests to trigger uncontained failures ultimately crashing the machine) but this turned out to be imprecise. Failures containability is a property of the platform and is independent from the memory type used for MMIO device memory mappings. Actually, DEVICE_nGnRE memory type is even more problematic than eg Normal-NC memory type in terms of faults containability in that eg aborts triggered on DEVICE_nGnRE loads cannot be made, architecturally, synchronous (ie that would imply that the processor should issue at most 1 load transaction at a time - ie it can't pipeline them - otherwise the synchronous abort semantics would break the no-speculation attribute attached to DEVICE_XXX memory). This means that regardless of the combined stage1+stage2 mappings a platform is safe if and only if device transactions cannot trigger uncontained failures and that in turn relies on platform capabilities and the device type being assigned (ie PCIe AER/DPC error containment and RAS architecture[3]); therefore the default KVM device stage 2 memory attributes play no role in making device assignment safer for a given platform (if the platform design adheres to design guidelines outlined in [3]) and therefore can be relaxed. For all these reasons, relax the KVM stage 2 device memory attributes from DEVICE_nGnRE to Normal-NC. A different Normal memory type default at stage-2 (eg Normal Write-through) could have been chosen instead of Normal-NC but Normal-NC was chosen because: - Its attributes are well-understood compared to other Normal memory types for MMIO mappings - On systems implementing S2FWB (FEAT_S2FWB), that's the only sensible default for normal memory types. For systems implementing FEAT_S2FWB (see [1] D8.5.5 S2=stage-2 - S2 MemAttr[3:0]), the options to choose the memory types are as follows: if S2 MemAttr[2] == 0, the mapping defaults to DEVICE_XXX (XXX determined by S2 MemAttr[1:0]). This is what we have today (MemAttr[2:0] == 0b001) and therefore it is not described any further. if S2 MemAttr[2] == 1, there are the following options: S2 MemAttr[2:0] | Resulting mapping ----------------------------------------------------------------------------- 0b101 | Prevent the guest from selecting cachable memory type, it | allows it to choose Device-* or Normal-NC 0b110 | It forces write-back memory type; it breaks MMIO. 0b111 | Allow the VM to select any memory type including cachable. | It is unclear whether this is safe from a platform | perspective, especially wrt uncontained failures and | cacheability (eg device reclaim/reset and cache | maintenance). ------------------------------------------------------------------------------ - For !FEAT_S2FWB systems, it is logical to choose a default S2 mapping identical to FEAT_S2FWB (that basically would force Normal-NC, see option S2 MemAttr[2:0] == 0b101 above), to guarantee a coherent approach between the two Relaxing S2 KVM device MMIO mappings to Normal-NC is not expected to trigger any issue on guest device reclaim use cases either (ie device MMIO unmap followed by a device reset) at least for PCIe devices, in that in PCIe a device reset is architected and carried out through PCI config space transactions that are naturally ordered wrt MMIO transactions according to the PCI ordering rules. Having Normal-NC S2 default puts guests in control (thanks to stage1+stage2 combined memory attributes rules [1]) of device MMIO regions memory mappings, according to the rules described in [1] and summarized here ([(S1) - stage1], [(S2) - stage 2]): S1 | S2 | Result NORMAL-WB | NORMAL-NC | NORMAL-NC NORMAL-WT | NORMAL-NC | NORMAL-NC NORMAL-NC | NORMAL-NC | NORMAL-NC DEVICE<attr> | NORMAL-NC | DEVICE<attr> It is worth noting that currently, to map devices MMIO space to user space in a device pass-through use case the VFIO framework applies memory attributes derived from pgprot_noncached() settings applied to VMAs, which result in device-nGnRnE memory attributes for the stage-1 VMM mappings. This means that a userspace mapping for device MMIO space carried out with the current VFIO framework and a guest OS mapping for the same MMIO space may result in a mismatched alias as described in [2]. Defaulting KVM device stage-2 mappings to Normal-NC attributes does not change anything in this respect, in that the mismatched aliases would only affect (refer to [2] for a detailed explanation) ordering between the userspace and GuestOS mappings resulting stream of transactions (ie it does not cause loss of property for either stream of transactions on its own), which is harmless given that the userspace and GuestOS access to the device is carried out through independent transactions streams. [1] section D8.5 - DDI0487_I_a_a-profile_architecture_reference_manual.pdf [2] section B2.8 - DDI0487_I_a_a-profile_architecture_reference_manual.pdf [3] sections 1.7.7.3/1.8.5.2/appendix C - DEN0029H_SBSA_7.1.pdf ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH v1 2/2] KVM: arm64: allow the VM to select DEVICE_* and NORMAL_NC for IO memory 2023-11-09 15:34 ` Lorenzo Pieralisi @ 2023-11-10 14:26 ` Jason Gunthorpe 2023-11-13 0:42 ` Lorenzo Pieralisi 2023-11-13 17:41 ` Catalin Marinas 1 sibling, 1 reply; 55+ messages in thread From: Jason Gunthorpe @ 2023-11-10 14:26 UTC (permalink / raw) To: Lorenzo Pieralisi Cc: Catalin Marinas, Will Deacon, ankita, maz, oliver.upton, aniketa, cjia, kwankhede, targupta, vsethi, acurrid, apopple, jhubbard, danw, linux-arm-kernel, kvmarm, linux-kernel On Thu, Nov 09, 2023 at 04:34:10PM +0100, Lorenzo Pieralisi wrote: > Relaxing S2 KVM device MMIO mappings to Normal-NC is not expected to > trigger any issue on guest device reclaim use cases either (ie device > MMIO unmap followed by a device reset) at least for PCIe devices, in that > in PCIe a device reset is architected and carried out through PCI config > space transactions that are naturally ordered wrt MMIO transactions > according to the PCI ordering rules. This is not how I see that thread concluding.. The device reclaim problem belongs solely to VFIO, not KVM. VFIO must ensure global ordering of access before the VMA is unmaped and access after, that includes ordering whatever mechanism the VFIO driver uses for reset. If there are quirky SOCs, or non-PCI devices that need something stronger than the TLBI/etc sequence it should be fixed in VFIO (or maybe even the arch code), not by blocking NORMAL_NC in the KVM. Such a quirky SOC would broadly have security issues beyond KVM. > It is worth noting that currently, to map devices MMIO space to user > space in a device pass-through use case the VFIO framework applies memory > attributes derived from pgprot_noncached() settings applied to VMAs, which Sometimes. VFIO uses a mix of pgprot_noncached and pgprot_device. AFAIK we should change to to always use pgprot_device.. Thanks, Jason ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH v1 2/2] KVM: arm64: allow the VM to select DEVICE_* and NORMAL_NC for IO memory 2023-11-10 14:26 ` Jason Gunthorpe @ 2023-11-13 0:42 ` Lorenzo Pieralisi 0 siblings, 0 replies; 55+ messages in thread From: Lorenzo Pieralisi @ 2023-11-13 0:42 UTC (permalink / raw) To: Jason Gunthorpe Cc: Catalin Marinas, Will Deacon, ankita, maz, oliver.upton, aniketa, cjia, kwankhede, targupta, vsethi, acurrid, apopple, jhubbard, danw, linux-arm-kernel, kvmarm, linux-kernel On Fri, Nov 10, 2023 at 10:26:49AM -0400, Jason Gunthorpe wrote: > On Thu, Nov 09, 2023 at 04:34:10PM +0100, Lorenzo Pieralisi wrote: > > > Relaxing S2 KVM device MMIO mappings to Normal-NC is not expected to > > trigger any issue on guest device reclaim use cases either (ie device > > MMIO unmap followed by a device reset) at least for PCIe devices, in that > > in PCIe a device reset is architected and carried out through PCI config > > space transactions that are naturally ordered wrt MMIO transactions > > according to the PCI ordering rules. > > This is not how I see that thread concluding.. > > The device reclaim problem belongs solely to VFIO, not KVM. VFIO must > ensure global ordering of access before the VMA is unmaped and access > after, that includes ordering whatever mechanism the VFIO driver uses > for reset. > > If there are quirky SOCs, or non-PCI devices that need something > stronger than the TLBI/etc sequence it should be fixed in VFIO (or > maybe even the arch code), not by blocking NORMAL_NC in the KVM. Such > a quirky SOC would broadly have security issues beyond KVM. https://lore.kernel.org/linux-arm-kernel/20231013092934.GA13524@willie-the-truck I think that Will's point _was_ related to the change we are making for KVM S2 mappings and related device transactions on device reclaim - ie reset, that's what I tried to convey (I probably simplified too much) that for PCI at least that should not trigger any regression/issue, in that BAR MMIO and reset transactions are decoupled streams and must follow the PCI ordering rules. Yes, it is VFIO responsibility but changing the S2 KVM mappings may have (for non-PCI devices) side effects compared to what we have today, I am not saying this should be a blocker I just summarized the thread above, the paragraph can be expanded. Lorenzo > > > It is worth noting that currently, to map devices MMIO space to user > > space in a device pass-through use case the VFIO framework applies memory > > attributes derived from pgprot_noncached() settings applied to VMAs, which > > Sometimes. VFIO uses a mix of pgprot_noncached and pgprot_device. AFAIK > we should change to to always use pgprot_device.. > > Thanks, > Jason ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH v1 2/2] KVM: arm64: allow the VM to select DEVICE_* and NORMAL_NC for IO memory 2023-11-09 15:34 ` Lorenzo Pieralisi 2023-11-10 14:26 ` Jason Gunthorpe @ 2023-11-13 17:41 ` Catalin Marinas 1 sibling, 0 replies; 55+ messages in thread From: Catalin Marinas @ 2023-11-13 17:41 UTC (permalink / raw) To: Lorenzo Pieralisi Cc: Will Deacon, Jason Gunthorpe, ankita, maz, oliver.upton, aniketa, cjia, kwankhede, targupta, vsethi, acurrid, apopple, jhubbard, danw, linux-arm-kernel, kvmarm, linux-kernel Hi Lorenzo, Thanks for putting this together. It looks fine to me, just some general comments below. On Thu, Nov 09, 2023 at 04:34:10PM +0100, Lorenzo Pieralisi wrote: > Updated commit log (it might be [is] too verbose) below, it should probably > be moved into a documentation file but to do that I should decouple > it from this changeset (ie a document explaining memory attributes > and error containment for ARM64 - indipendent from KVM S2 defaults). > > I'd also add a Link: to the lore archive entry for reference (did not > add it in the log below). > > Please let me know what's the best option here. It would be good to have this under Documentation/arch/arm64/, though as a document at the end of the series describing what the new behaviour is, the kernel/KVM expectations (after the patches have been applied). We could even submit it separately once we agreed on how the series looks like. > -- >8 -- > Currently, KVM for ARM64 maps at stage 2 memory that is > considered device (ie it is not RAM) with DEVICE_nGnRE > memory attributes; this setting overrides (as per the ARM > architecture [1]) any device MMIO mapping present at stage > 1, resulting in a set-up whereby a guest operating system > can't determine device MMIO mapping memory attributes on its > own but it is always overriden by the KVM stage 2 default. To be fully correct, stage 1 and stage 2 attributes combine in a way that results in the more restrictive properties, so not a simple override. But to keep things simple, especially if this part only goes in the commit log, leave it as is. > This set-up does not allow guest operating systems to select > device memory attributes on a page by page basis independently Not sure it's even worth specifying "page by page". It just doesn't allow the guest to relax the device map memory attributes at all. > from KVM stage-2 mappings (refer to [1], "Combining stage 1 and stage > 2 memory type attributes"), which turns out to be an issue in that > guest operating systems (eg Linux) may request to map > devices MMIO regions with memory attributes that guarantee > better performance (eg gathering attribute - that for some > devices can generate larger PCIe memory writes TLPs) > and specific operations (eg unaligned transactions) such as > the NormalNC memory type. Another case is correct guest behaviour where it would not expect unaligned accesses from Normal NC mappings. > The default device stage 2 mapping was chosen in KVM > for ARM64 since it was considered safer (ie it would > not allow guests to trigger uncontained failures > ultimately crashing the machine) but this turned out > to be imprecise. s/imprecise/asynchronous (SError)/ Another reason was probably that it matches the default VFIO mapping (though the latter is slightly stricter as in Device_nGnRnE). > Failures containability is a property of the platform > and is independent from the memory type used for MMIO > device memory mappings. > > Actually, DEVICE_nGnRE memory type is even more problematic > than eg Normal-NC memory type in terms of faults containability > in that eg aborts triggered on DEVICE_nGnRE loads cannot be made, > architecturally, synchronous (ie that would imply that the processor > should issue at most 1 load transaction at a time - ie it can't pipeline > them - otherwise the synchronous abort semantics would break the > no-speculation attribute attached to DEVICE_XXX memory). > > This means that regardless of the combined stage1+stage2 mappings a > platform is safe if and only if device transactions cannot trigger > uncontained failures and that in turn relies on platform > capabilities and the device type being assigned (ie PCIe AER/DPC > error containment and RAS architecture[3]); therefore the default > KVM device stage 2 memory attributes play no role in making device > assignment safer for a given platform (if the platform design > adheres to design guidelines outlined in [3]) and therefore can > be relaxed. > > For all these reasons, relax the KVM stage 2 device > memory attributes from DEVICE_nGnRE to Normal-NC. I think this covers the safety aspect w.r.t. uncontained errors. > A different Normal memory type default at stage-2 > (eg Normal Write-through) could have been chosen > instead of Normal-NC but Normal-NC was chosen > because: > > - Its attributes are well-understood compared to > other Normal memory types for MMIO mappings Ideally we'd have gone for Device_GRE from a performance perspective but it doesn't support unaligned accesses. Basically at this stage we don't want cacheable accesses to MMIO until we gain a better understanding on what the impact would be. > - On systems implementing S2FWB (FEAT_S2FWB), that's the only sensible > default for normal memory types. For systems implementing > FEAT_S2FWB (see [1] D8.5.5 S2=stage-2 - S2 MemAttr[3:0]), the options > to choose the memory types are as follows: > > if S2 MemAttr[2] == 0, the mapping defaults to DEVICE_XXX > (XXX determined by S2 MemAttr[1:0]). This is what we have > today (MemAttr[2:0] == 0b001) and therefore it is not described > any further. > > if S2 MemAttr[2] == 1, there are the following options: > > S2 MemAttr[2:0] | Resulting mapping > ----------------------------------------------------------------------------- > 0b101 | Prevent the guest from selecting cachable memory type, it > | allows it to choose Device-* or Normal-NC > 0b110 | It forces write-back memory type; it breaks MMIO. > 0b111 | Allow the VM to select any memory type including cachable. > | It is unclear whether this is safe from a platform > | perspective, especially wrt uncontained failures and > | cacheability (eg device reclaim/reset and cache > | maintenance). > ------------------------------------------------------------------------------ > > - For !FEAT_S2FWB systems, it is logical to choose a default S2 mapping > identical to FEAT_S2FWB (that basically would force Normal-NC, see > option S2 MemAttr[2:0] == 0b101 above), to guarantee a coherent approach > between the two Is this text only to say that we can't use write-through memory because with FEAT_S2FWB it is forced cacheable? I'd probably keep in simple and just state that we went for Normal NC to avoid cache allocation/snooping. It just crossed my mind, I think we also assume here that on platforms with transparent caches, Normal NC of MMIO won't be upgraded to cacheable (that's irrespective of KVM). This may need to be stated in some Arm BSA document (I'm not aware of any arch rules that prevent this from happening). AFAIK, platforms with transparent caches only do this for the memory ranges where the DRAM is expected. > Relaxing S2 KVM device MMIO mappings to Normal-NC is not expected to > trigger any issue on guest device reclaim use cases either (ie device > MMIO unmap followed by a device reset) at least for PCIe devices, in that > in PCIe a device reset is architected and carried out through PCI config > space transactions that are naturally ordered wrt MMIO transactions > according to the PCI ordering rules. We could state this in a doc that for device pass-through, there's an expectation that the device can be reclaimed along the lines of the PCIe reset behaviour. If one admin decides to allow device pass-through for some unsafe devices/platforms, it's their problem really. I don't think a Device_nGnRE mapping would help in those cases anyway. This could be expanded to Normal Cacheable at some point but KVM would have to do cache maintenance after unmapping the device from the guest (so let's leave this for a separate series). > Having Normal-NC S2 default puts guests in control (thanks to > stage1+stage2 combined memory attributes rules [1]) of device MMIO > regions memory mappings, according to the rules described in [1] > and summarized here ([(S1) - stage1], [(S2) - stage 2]): > > S1 | S2 | Result > NORMAL-WB | NORMAL-NC | NORMAL-NC > NORMAL-WT | NORMAL-NC | NORMAL-NC > NORMAL-NC | NORMAL-NC | NORMAL-NC > DEVICE<attr> | NORMAL-NC | DEVICE<attr> > > It is worth noting that currently, to map devices MMIO space to user > space in a device pass-through use case the VFIO framework applies memory > attributes derived from pgprot_noncached() settings applied to VMAs, which > result in device-nGnRnE memory attributes for the stage-1 VMM mappings. > > This means that a userspace mapping for device MMIO space carried > out with the current VFIO framework and a guest OS mapping for the same > MMIO space may result in a mismatched alias as described in [2]. > > Defaulting KVM device stage-2 mappings to Normal-NC attributes does not change > anything in this respect, in that the mismatched aliases would only affect > (refer to [2] for a detailed explanation) ordering between the userspace and > GuestOS mappings resulting stream of transactions (ie it does not cause loss of > property for either stream of transactions on its own), which is harmless > given that the userspace and GuestOS access to the device is carried > out through independent transactions streams. I think this was a recent clarification from the architects (in private emails). My understanding for some time has been that the mere presence of a Normal alias would affect the semantics of the user (VMM) accesses to the Device mapping. Apparently, this only matters if there is some expected ordering between the user (VMM) access and the VM one with different aliases. I can't tell where the mismatched aliases rules state this but I haven't read them in a while. -- Catalin ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH v1 2/2] KVM: arm64: allow the VM to select DEVICE_* and NORMAL_NC for IO memory 2023-09-07 18:14 ` [PATCH v1 2/2] KVM: arm64: allow the VM to select DEVICE_* and NORMAL_NC for IO memory ankita 2023-09-08 16:40 ` Catalin Marinas 2023-09-11 14:57 ` Lorenzo Pieralisi @ 2023-10-12 12:27 ` Will Deacon 2 siblings, 0 replies; 55+ messages in thread From: Will Deacon @ 2023-10-12 12:27 UTC (permalink / raw) To: ankita Cc: jgg, maz, oliver.upton, catalin.marinas, aniketa, cjia, kwankhede, targupta, vsethi, acurrid, apopple, jhubbard, danw, linux-arm-kernel, kvmarm, linux-kernel On Thu, Sep 07, 2023 at 11:14:59AM -0700, ankita@nvidia.com wrote: > From: Ankit Agrawal <ankita@nvidia.com> > > Linux allows device drivers to map IO memory on a per-page basis using > "write combining" or WC. This is often done using > pgprot_writecombing(). The driver knows which pages can support WC > access and the proper programming model to generate this IO. Generally > the use case is to boost performance by using write combining to > generate larger PCIe MemWr TLPs. > > Allow VMs to select DEVICE_* or NORMAL_NC on a page by page basis for > all IO memory. This puts the VM in charge of the memory attributes, > and removes the KVM override to DEVICE_nGnRE. > > Ultimately this makes pgprot_writecombing() work correctly in VMs and > allows drivers like mlx5 to fully operate their HW. > > After some discussions with ARM and CPU architects we reached the > conclusion there was no need for KVM to prevent the VM from selecting > between DEVICE_* and NORMAL_NC for IO memory in VMs. There was a fear > that NORMAL_NC could result in uncontained failures, but upon deeper > analysis it turns out there are already possible cases for uncontained > failures with DEVICE types too. Ultimately the platform must be > implemented in a way that ensures that all DEVICE_* and NORMAL_NC > accesses have no uncontained failures. > > Fortunately real platforms do tend to implement this. > > This patch makes the VM's memory attributes behave as follows: > > S1 | S2 | Result > NORMAL-WB | NORMAL-NC | NORMAL-NC > NORMAL-WT | NORMAL-NC | NORMAL-NC > NORMAL-NC | NORMAL-NC | NORMAL-NC > DEVICE<attr> | NORMAL-NC | DEVICE<attr> > > See section D8.5.5 of DDI0487_I_a_a-profile_architecture_reference_manual.pdf > for details. > > Signed-off-by: Ankit Agrawal <ankita@nvidia.com> > --- > arch/arm64/include/asm/memory.h | 2 ++ > arch/arm64/kvm/hyp/pgtable.c | 2 +- > 2 files changed, 3 insertions(+), 1 deletion(-) > > diff --git a/arch/arm64/include/asm/memory.h b/arch/arm64/include/asm/memory.h > index fde4186cc387..c247e5f29d5a 100644 > --- a/arch/arm64/include/asm/memory.h > +++ b/arch/arm64/include/asm/memory.h > @@ -147,6 +147,7 @@ > * Memory types for Stage-2 translation > */ > #define MT_S2_NORMAL 0xf > +#define MT_S2_NORMAL_NC 0x5 > #define MT_S2_DEVICE_nGnRE 0x1 > > /* > @@ -154,6 +155,7 @@ > * Stage-2 enforces Normal-WB and Device-nGnRE > */ > #define MT_S2_FWB_NORMAL 6 > +#define MT_S2_FWB_NORMAL_NC 5 > #define MT_S2_FWB_DEVICE_nGnRE 1 > > #ifdef CONFIG_ARM64_4K_PAGES > diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c > index ccd291b6893d..a80949002191 100644 > --- a/arch/arm64/kvm/hyp/pgtable.c > +++ b/arch/arm64/kvm/hyp/pgtable.c > @@ -696,7 +696,7 @@ static int stage2_set_prot_attr(struct kvm_pgtable *pgt, enum kvm_pgtable_prot p > kvm_pte_t *ptep) > { > bool device = prot & KVM_PGTABLE_PROT_DEVICE; > - kvm_pte_t attr = device ? KVM_S2_MEMATTR(pgt, DEVICE_nGnRE) : > + kvm_pte_t attr = device ? KVM_S2_MEMATTR(pgt, NORMAL_NC) : > KVM_S2_MEMATTR(pgt, NORMAL); > u32 sh = KVM_PTE_LEAF_ATTR_LO_S2_SH_IS; I think this is putting the policy into the low-level page-table code, which isn't where it belongs. KVM_PGTABLE_PROT_DEVICE means that the mapping should be created with device attributes. If you want other attributes, then please introduce another entry to 'kvm_pgtable_prot' and pass that instead (e.g. KVM_PGTABLE_PROT_NC, which coincidentally we already have in Android [1] ;). Will [1] https://android.googlesource.com/kernel/common/+/72cc19df8b71095f9740ff0ca6a75bf7ed27b0cd%5E%21/ ^ permalink raw reply [flat|nested] 55+ messages in thread
end of thread, other threads:[~2024-03-19 13:38 UTC | newest] Thread overview: 55+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2023-09-07 18:14 [PATCH v1 0/2] KVM: arm64: support write combining and cachable IO memory in VMs ankita 2023-09-07 18:14 ` [PATCH v1 1/2] KVM: arm64: determine memory type from VMA ankita 2023-09-07 19:12 ` Jason Gunthorpe 2023-10-05 16:15 ` Catalin Marinas 2023-10-05 16:54 ` Jason Gunthorpe 2023-10-10 14:25 ` Catalin Marinas 2023-10-10 15:05 ` Jason Gunthorpe 2023-10-10 17:19 ` Catalin Marinas 2023-10-10 18:23 ` Jason Gunthorpe 2023-10-11 17:45 ` Catalin Marinas 2023-10-11 18:38 ` Jason Gunthorpe 2023-10-12 16:16 ` Catalin Marinas 2024-03-10 3:49 ` Ankit Agrawal 2024-03-19 13:38 ` Jason Gunthorpe 2023-10-23 13:20 ` Shameerali Kolothum Thodi 2023-09-07 18:14 ` [PATCH v1 2/2] KVM: arm64: allow the VM to select DEVICE_* and NORMAL_NC for IO memory ankita 2023-09-08 16:40 ` Catalin Marinas 2023-09-11 14:57 ` Lorenzo Pieralisi 2023-09-11 17:20 ` Jason Gunthorpe 2023-09-13 15:26 ` Lorenzo Pieralisi 2023-09-13 18:54 ` Jason Gunthorpe 2023-09-26 8:31 ` Lorenzo Pieralisi 2023-09-26 12:25 ` Jason Gunthorpe 2023-09-26 13:52 ` Catalin Marinas 2023-09-26 16:12 ` Lorenzo Pieralisi 2023-10-05 9:56 ` Lorenzo Pieralisi 2023-10-05 11:56 ` Jason Gunthorpe 2023-10-05 14:08 ` Lorenzo Pieralisi 2023-10-12 12:35 ` Will Deacon 2023-10-12 13:20 ` Jason Gunthorpe 2023-10-12 14:29 ` Lorenzo Pieralisi 2023-10-12 13:53 ` Catalin Marinas 2023-10-12 14:48 ` Will Deacon 2023-10-12 15:44 ` Jason Gunthorpe 2023-10-12 16:39 ` Will Deacon 2023-10-12 18:36 ` Jason Gunthorpe 2023-10-13 9:29 ` Will Deacon 2023-10-12 17:26 ` Catalin Marinas 2023-10-13 9:29 ` Will Deacon 2023-10-13 13:08 ` Catalin Marinas 2023-10-13 13:45 ` Jason Gunthorpe 2023-10-19 11:07 ` Catalin Marinas 2023-10-19 11:51 ` Jason Gunthorpe 2023-10-20 11:21 ` Catalin Marinas 2023-10-20 11:47 ` Jason Gunthorpe 2023-10-20 14:03 ` Lorenzo Pieralisi 2023-10-20 14:28 ` Jason Gunthorpe 2023-10-19 13:35 ` Lorenzo Pieralisi 2023-10-13 15:28 ` Lorenzo Pieralisi 2023-10-19 11:12 ` Catalin Marinas 2023-11-09 15:34 ` Lorenzo Pieralisi 2023-11-10 14:26 ` Jason Gunthorpe 2023-11-13 0:42 ` Lorenzo Pieralisi 2023-11-13 17:41 ` Catalin Marinas 2023-10-12 12:27 ` Will Deacon
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).