* Re: [PATCH kernel 1/2] powerpc/iommu: Allocate it_map by vmalloc
From: David Gibson @ 2021-02-17 0:16 UTC (permalink / raw)
To: Alexey Kardashevskiy; +Cc: linuxppc-dev, kvm-ppc
In-Reply-To: <20210216033307.69863-2-aik@ozlabs.ru>
[-- Attachment #1: Type: text/plain, Size: 3078 bytes --]
On Tue, Feb 16, 2021 at 02:33:06PM +1100, Alexey Kardashevskiy wrote:
> The IOMMU table uses the it_map bitmap to keep track of allocated DMA
> pages. This has always been a contiguous array allocated at either
> the boot time or when a passed through device is returned to the host OS.
> The it_map memory is allocated by alloc_pages() which allocates
> contiguous physical memory.
>
> Such allocation method occasionally creates a problem when there is
> no big chunk of memory available (no free memory or too fragmented).
> On powernv/ioda2 the default DMA window requires 16MB for it_map.
>
> This replaces alloc_pages_node() with vzalloc_node() which allocates
> contiguous block but in virtual memory. This should reduce changes of
> failure but should not cause other behavioral changes as it_map is only
> used by the kernel's DMA hooks/api when MMU is on.
>
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
> ---
> arch/powerpc/kernel/iommu.c | 15 +++------------
> 1 file changed, 3 insertions(+), 12 deletions(-)
>
> diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
> index c00214a4355c..8eb6eb0afa97 100644
> --- a/arch/powerpc/kernel/iommu.c
> +++ b/arch/powerpc/kernel/iommu.c
> @@ -719,7 +719,6 @@ struct iommu_table *iommu_init_table(struct iommu_table *tbl, int nid,
> {
> unsigned long sz;
> static int welcomed = 0;
> - struct page *page;
> unsigned int i;
> struct iommu_pool *p;
>
> @@ -728,11 +727,9 @@ struct iommu_table *iommu_init_table(struct iommu_table *tbl, int nid,
> /* number of bytes needed for the bitmap */
> sz = BITS_TO_LONGS(tbl->it_size) * sizeof(unsigned long);
>
> - page = alloc_pages_node(nid, GFP_KERNEL, get_order(sz));
> - if (!page)
> + tbl->it_map = vzalloc_node(sz, nid);
> + if (!tbl->it_map)
> panic("iommu_init_table: Can't allocate %ld bytes\n", sz);
> - tbl->it_map = page_address(page);
> - memset(tbl->it_map, 0, sz);
>
> iommu_table_reserve_pages(tbl, res_start, res_end);
>
> @@ -774,8 +771,6 @@ struct iommu_table *iommu_init_table(struct iommu_table *tbl, int nid,
>
> static void iommu_table_free(struct kref *kref)
> {
> - unsigned long bitmap_sz;
> - unsigned int order;
> struct iommu_table *tbl;
>
> tbl = container_of(kref, struct iommu_table, it_kref);
> @@ -796,12 +791,8 @@ static void iommu_table_free(struct kref *kref)
> if (!bitmap_empty(tbl->it_map, tbl->it_size))
> pr_warn("%s: Unexpected TCEs\n", __func__);
>
> - /* calculate bitmap size in bytes */
> - bitmap_sz = BITS_TO_LONGS(tbl->it_size) * sizeof(unsigned long);
> -
> /* free bitmap */
> - order = get_order(bitmap_sz);
> - free_pages((unsigned long) tbl->it_map, order);
> + vfree(tbl->it_map);
>
> /* free table */
> kfree(tbl);
--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply
* Re: [PATCH v4 2/3] KVM: PPC: Book3S HV: Add support for H_RPT_INVALIDATE
From: David Gibson @ 2021-02-17 0:38 UTC (permalink / raw)
To: Bharata B Rao; +Cc: farosas, aneesh.kumar, npiggin, kvm-ppc, linuxppc-dev
In-Reply-To: <20210215063542.3642366-3-bharata@linux.ibm.com>
[-- Attachment #1: Type: text/plain, Size: 14794 bytes --]
On Mon, Feb 15, 2021 at 12:05:41PM +0530, Bharata B Rao wrote:
> Implement H_RPT_INVALIDATE hcall and add KVM capability
> KVM_CAP_PPC_RPT_INVALIDATE to indicate the support for the same.
>
> This hcall does two types of TLB invalidations:
>
> 1. Process-scoped invalidations for guests with LPCR[GTSE]=0.
> This is currently not used in KVM as GTSE is not usually
> disabled in KVM.
> 2. Partition-scoped invalidations that an L1 hypervisor does on
> behalf of an L2 guest. This replaces the uses of the existing
> hcall H_TLB_INVALIDATE.
>
> In order to handle process scoped invalidations of L2, we
> intercept the nested exit handling code in L0 only to handle
> H_TLB_INVALIDATE hcall.
>
> Signed-off-by: Bharata B Rao <bharata@linux.ibm.com>
> ---
> Documentation/virt/kvm/api.rst | 17 +++++
> arch/powerpc/include/asm/kvm_book3s.h | 3 +
> arch/powerpc/include/asm/mmu_context.h | 11 +++
> arch/powerpc/kvm/book3s_hv.c | 91 ++++++++++++++++++++++++
> arch/powerpc/kvm/book3s_hv_nested.c | 96 ++++++++++++++++++++++++++
> arch/powerpc/kvm/powerpc.c | 3 +
> arch/powerpc/mm/book3s64/radix_tlb.c | 25 +++++++
> include/uapi/linux/kvm.h | 1 +
> 8 files changed, 247 insertions(+)
>
> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> index 99ceb978c8b0..416c36aa35d4 100644
> --- a/Documentation/virt/kvm/api.rst
> +++ b/Documentation/virt/kvm/api.rst
> @@ -6038,6 +6038,23 @@ KVM_EXIT_X86_RDMSR and KVM_EXIT_X86_WRMSR exit notifications which user space
> can then handle to implement model specific MSR handling and/or user notifications
> to inform a user that an MSR was not handled.
>
> +7.22 KVM_CAP_PPC_RPT_INVALIDATE
> +------------------------------
> +
> +:Capability: KVM_CAP_PPC_RPT_INVALIDATE
> +:Architectures: ppc
> +:Type: vm
> +
> +This capability indicates that the kernel is capable of handling
> +H_RPT_INVALIDATE hcall.
> +
> +In order to enable the use of H_RPT_INVALIDATE in the guest,
> +user space might have to advertise it for the guest. For example,
> +IBM pSeries (sPAPR) guest starts using it if "hcall-rpt-invalidate" is
> +present in the "ibm,hypertas-functions" device-tree property.
> +
> +This capability is always enabled.
I guess that means it's always enabled when it's available - I'm
pretty sure it won't be enabled on POWER8 or on PR KVM.
> +
> 8. Other capabilities.
> ======================
>
> diff --git a/arch/powerpc/include/asm/kvm_book3s.h b/arch/powerpc/include/asm/kvm_book3s.h
> index d32ec9ae73bd..0f1c5fa6e8ce 100644
> --- a/arch/powerpc/include/asm/kvm_book3s.h
> +++ b/arch/powerpc/include/asm/kvm_book3s.h
> @@ -298,6 +298,9 @@ void kvmhv_set_ptbl_entry(unsigned int lpid, u64 dw0, u64 dw1);
> void kvmhv_release_all_nested(struct kvm *kvm);
> long kvmhv_enter_nested_guest(struct kvm_vcpu *vcpu);
> long kvmhv_do_nested_tlbie(struct kvm_vcpu *vcpu);
> +long kvmhv_h_rpti_nested(struct kvm_vcpu *vcpu, unsigned long lpid,
> + unsigned long type, unsigned long pg_sizes,
> + unsigned long start, unsigned long end);
> int kvmhv_run_single_vcpu(struct kvm_vcpu *vcpu,
> u64 time_limit, unsigned long lpcr);
> void kvmhv_save_hv_regs(struct kvm_vcpu *vcpu, struct hv_guest_state *hr);
> diff --git a/arch/powerpc/include/asm/mmu_context.h b/arch/powerpc/include/asm/mmu_context.h
> index d5821834dba9..fbf3b5b45fe9 100644
> --- a/arch/powerpc/include/asm/mmu_context.h
> +++ b/arch/powerpc/include/asm/mmu_context.h
> @@ -124,8 +124,19 @@ static inline bool need_extra_context(struct mm_struct *mm, unsigned long ea)
>
> #if defined(CONFIG_KVM_BOOK3S_HV_POSSIBLE) && defined(CONFIG_PPC_RADIX_MMU)
> extern void radix_kvm_prefetch_workaround(struct mm_struct *mm);
> +void do_h_rpt_invalidate(unsigned long pid, unsigned long lpid,
> + unsigned long type, unsigned long page_size,
> + unsigned long psize, unsigned long start,
> + unsigned long end);
> #else
> static inline void radix_kvm_prefetch_workaround(struct mm_struct *mm) { }
> +static inline void do_h_rpt_invalidate(unsigned long pid,
> + unsigned long lpid,
> + unsigned long type,
> + unsigned long page_size,
> + unsigned long psize,
> + unsigned long start,
> + unsigned long end) { }
> #endif
>
> extern void switch_cop(struct mm_struct *next);
> diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
> index 6f612d240392..802cb77c39cc 100644
> --- a/arch/powerpc/kvm/book3s_hv.c
> +++ b/arch/powerpc/kvm/book3s_hv.c
> @@ -904,6 +904,64 @@ static int kvmppc_get_yield_count(struct kvm_vcpu *vcpu)
> return yield_count;
> }
>
> +static void do_h_rpt_invalidate_prs(unsigned long pid, unsigned long lpid,
> + unsigned long type, unsigned long pg_sizes,
> + unsigned long start, unsigned long end)
> +{
> + unsigned long psize;
> +
> + if (pg_sizes & H_RPTI_PAGE_64K) {
> + psize = rpti_pgsize_to_psize(pg_sizes & H_RPTI_PAGE_64K);
> + do_h_rpt_invalidate(pid, lpid, type, (1UL << 16), psize,
> + start, end);
> + }
> +
> + if (pg_sizes & H_RPTI_PAGE_2M) {
> + psize = rpti_pgsize_to_psize(pg_sizes & H_RPTI_PAGE_2M);
> + do_h_rpt_invalidate(pid, lpid, type, (1UL << 21), psize,
> + start, end);
> + }
> +
> + if (pg_sizes & H_RPTI_PAGE_1G) {
> + psize = rpti_pgsize_to_psize(pg_sizes & H_RPTI_PAGE_1G);
> + do_h_rpt_invalidate(pid, lpid, type, (1UL << 30), psize,
> + start, end);
> + }
Hrm. Here you're stepping through the hcall defined pagesizes, then
mapping each one to the Linux internal page size defs.
It might be more elegant to step through mmu_psize_defs table, and
conditionally performan an invalidate on that pagesize if the
corresponding bit in pg_sizes is set (as noted earlier you could
easily add the H_RPTI_PAGE bit to the table). That way it's a direct
table lookup rather than a bunch of ifs or switches.
> +}
> +
> +static long kvmppc_h_rpt_invalidate(struct kvm_vcpu *vcpu,
> + unsigned long pid, unsigned long target,
> + unsigned long type, unsigned long pg_sizes,
> + unsigned long start, unsigned long end)
> +{
> + if (!kvm_is_radix(vcpu->kvm))
> + return H_UNSUPPORTED;
> +
> + if (kvmhv_on_pseries())
> + return H_UNSUPPORTED;
This doesn't seem quite right. If you have multiply nested guests,
won't the L2 be issueing H_RPT_INVALIDATE hcalls to the L1 on behalf
of the L3? The L1 would have to implement them by calling the L0, but
the L1 can't just reject them, no?
Likewise for the !H_RPTI_TYPE_NESTED case, but on what happens to be a
nested guest in any case, couldn't this case legitimately arise and
need to be handled?
> +
> + if (end < start)
> + return H_P5;
> +
> + if (type & H_RPTI_TYPE_NESTED) {
> + if (!nesting_enabled(vcpu->kvm))
> + return H_FUNCTION;
> +
> + /* Support only cores as target */
> + if (target != H_RPTI_TARGET_CMMU)
> + return H_P2;
> +
> + return kvmhv_h_rpti_nested(vcpu, pid,
> + (type & ~H_RPTI_TYPE_NESTED),
> + pg_sizes, start, end);
> + }
> +
> + do_h_rpt_invalidate_prs(pid, vcpu->kvm->arch.lpid, type, pg_sizes,
> + start, end);
> + return H_SUCCESS;
> +}
> +
> int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu)
> {
> unsigned long req = kvmppc_get_gpr(vcpu, 3);
> @@ -1112,6 +1170,14 @@ int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu)
> */
> ret = kvmppc_h_svm_init_abort(vcpu->kvm);
> break;
> + case H_RPT_INVALIDATE:
> + ret = kvmppc_h_rpt_invalidate(vcpu, kvmppc_get_gpr(vcpu, 4),
> + kvmppc_get_gpr(vcpu, 5),
> + kvmppc_get_gpr(vcpu, 6),
> + kvmppc_get_gpr(vcpu, 7),
> + kvmppc_get_gpr(vcpu, 8),
> + kvmppc_get_gpr(vcpu, 9));
> + break;
>
> default:
> return RESUME_HOST;
> @@ -1158,6 +1224,7 @@ static int kvmppc_hcall_impl_hv(unsigned long cmd)
> case H_XIRR_X:
> #endif
> case H_PAGE_INIT:
> + case H_RPT_INVALIDATE:
> return 1;
> }
>
> @@ -1573,6 +1640,30 @@ static int kvmppc_handle_nested_exit(struct kvm_vcpu *vcpu)
> if (!xics_on_xive())
> kvmppc_xics_rm_complete(vcpu, 0);
> break;
> + case BOOK3S_INTERRUPT_SYSCALL:
> + {
> + unsigned long req = kvmppc_get_gpr(vcpu, 3);
> +
> + if (req != H_RPT_INVALIDATE) {
> + r = RESUME_HOST;
> + break;
> + }
> +
> + /*
> + * The H_RPT_INVALIDATE hcalls issued by nested
> + * guest for process scoped invalidations when
> + * GTSE=0 are handled here.
> + */
> + do_h_rpt_invalidate_prs(kvmppc_get_gpr(vcpu, 4),
> + vcpu->arch.nested->shadow_lpid,
> + kvmppc_get_gpr(vcpu, 5),
> + kvmppc_get_gpr(vcpu, 6),
> + kvmppc_get_gpr(vcpu, 7),
> + kvmppc_get_gpr(vcpu, 8));
> + kvmppc_set_gpr(vcpu, 3, H_SUCCESS);
> + r = RESUME_GUEST;
> + break;
> + }
> default:
> r = RESUME_HOST;
> break;
> diff --git a/arch/powerpc/kvm/book3s_hv_nested.c b/arch/powerpc/kvm/book3s_hv_nested.c
> index 33b58549a9aa..40ed4eb80adb 100644
> --- a/arch/powerpc/kvm/book3s_hv_nested.c
> +++ b/arch/powerpc/kvm/book3s_hv_nested.c
> @@ -1149,6 +1149,102 @@ long kvmhv_do_nested_tlbie(struct kvm_vcpu *vcpu)
> return H_SUCCESS;
> }
>
> +static long do_tlb_invalidate_nested_tlb(struct kvm_vcpu *vcpu,
> + unsigned long lpid,
> + unsigned long page_size,
> + unsigned long ap,
> + unsigned long start,
> + unsigned long end)
> +{
> + unsigned long addr = start;
> + int ret;
> +
> + do {
> + ret = kvmhv_emulate_tlbie_tlb_addr(vcpu, lpid, ap,
> + get_epn(addr));
> + if (ret)
> + return ret;
> + addr += page_size;
> + } while (addr < end);
> +
> + return ret;
> +}
> +
> +static long do_tlb_invalidate_nested_all(struct kvm_vcpu *vcpu,
> + unsigned long lpid)
> +{
> + struct kvm *kvm = vcpu->kvm;
> + struct kvm_nested_guest *gp;
> +
> + gp = kvmhv_get_nested(kvm, lpid, false);
> + if (gp) {
> + kvmhv_emulate_tlbie_lpid(vcpu, gp, RIC_FLUSH_ALL);
> + kvmhv_put_nested(gp);
> + }
> + return H_SUCCESS;
> +}
> +
> +long kvmhv_h_rpti_nested(struct kvm_vcpu *vcpu, unsigned long lpid,
> + unsigned long type, unsigned long pg_sizes,
> + unsigned long start, unsigned long end)
> +{
> + struct kvm_nested_guest *gp;
> + long ret;
> + unsigned long psize, ap;
> +
> + /*
> + * If L2 lpid isn't valid, we need to return H_PARAMETER.
> + *
> + * However, nested KVM issues a L2 lpid flush call when creating
> + * partition table entries for L2. This happens even before the
> + * corresponding shadow lpid is created in HV which happens in
> + * H_ENTER_NESTED call. Since we can't differentiate this case from
> + * the invalid case, we ignore such flush requests and return success.
> + */
> + gp = kvmhv_find_nested(vcpu->kvm, lpid);
> + if (!gp)
> + return H_SUCCESS;
> +
> + if ((type & H_RPTI_TYPE_NESTED_ALL) == H_RPTI_TYPE_NESTED_ALL)
> + return do_tlb_invalidate_nested_all(vcpu, lpid);
> +
> + if ((type & H_RPTI_TYPE_TLB) == H_RPTI_TYPE_TLB) {
> + if (pg_sizes & H_RPTI_PAGE_64K) {
> + psize = rpti_pgsize_to_psize(pg_sizes & H_RPTI_PAGE_64K);
> + ap = mmu_get_ap(psize);
> +
> + ret = do_tlb_invalidate_nested_tlb(vcpu, lpid,
> + (1UL << 16),
> + ap, start, end);
> + if (ret)
> + return H_P4;
> + }
> +
> + if (pg_sizes & H_RPTI_PAGE_2M) {
> + psize = rpti_pgsize_to_psize(pg_sizes & H_RPTI_PAGE_2M);
> + ap = mmu_get_ap(psize);
> +
> + ret = do_tlb_invalidate_nested_tlb(vcpu, lpid,
> + (1UL << 21),
> + ap, start, end);
> + if (ret)
> + return H_P4;
> + }
> +
> + if (pg_sizes & H_RPTI_PAGE_1G) {
> + psize = rpti_pgsize_to_psize(pg_sizes & H_RPTI_PAGE_1G);
> + ap = mmu_get_ap(psize);
> +
> + ret = do_tlb_invalidate_nested_tlb(vcpu, lpid,
> + (1UL << 30),
> + ap, start, end);
> + if (ret)
> + return H_P4;
> + }
Again it might be more elegant to step through the pagesizes from the
mmu_psize_defs side, rather than from the pg_sizes side.
> + }
> + return H_SUCCESS;
> +}
> +
> /* Used to convert a nested guest real address to a L1 guest real address */
> static int kvmhv_translate_addr_nested(struct kvm_vcpu *vcpu,
> struct kvm_nested_guest *gp,
> diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
> index cf52d26f49cd..5388cd4a206a 100644
> --- a/arch/powerpc/kvm/powerpc.c
> +++ b/arch/powerpc/kvm/powerpc.c
> @@ -678,6 +678,9 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
> r = hv_enabled && kvmppc_hv_ops->enable_svm &&
> !kvmppc_hv_ops->enable_svm(NULL);
> break;
> + case KVM_CAP_PPC_RPT_INVALIDATE:
> + r = 1;
> + break;
> #endif
> default:
> r = 0;
> diff --git a/arch/powerpc/mm/book3s64/radix_tlb.c b/arch/powerpc/mm/book3s64/radix_tlb.c
> index 097402435303..4f746d34b420 100644
> --- a/arch/powerpc/mm/book3s64/radix_tlb.c
> +++ b/arch/powerpc/mm/book3s64/radix_tlb.c
> @@ -1400,4 +1400,29 @@ extern void radix_kvm_prefetch_workaround(struct mm_struct *mm)
> }
> }
> EXPORT_SYMBOL_GPL(radix_kvm_prefetch_workaround);
> +
> +/*
> + * Process-scoped invalidations for a given LPID.
> + */
> +void do_h_rpt_invalidate(unsigned long pid, unsigned long lpid,
> + unsigned long type, unsigned long page_size,
> + unsigned long psize, unsigned long start,
> + unsigned long end)
> +{
> + if ((type & H_RPTI_TYPE_ALL) == H_RPTI_TYPE_ALL) {
> + _tlbie_pid_lpid(pid, lpid, RIC_FLUSH_ALL);
> + return;
> + }
> +
> + if (type & H_RPTI_TYPE_PWC)
> + _tlbie_pid_lpid(pid, lpid, RIC_FLUSH_PWC);
> +
> + if (!start && end == -1) /* PID */
> + _tlbie_pid_lpid(pid, lpid, RIC_FLUSH_TLB);
> + else /* EA */
> + _tlbie_va_range_lpid(start, end, pid, lpid, page_size,
> + psize, false);
> +}
> +EXPORT_SYMBOL_GPL(do_h_rpt_invalidate);
> +
> #endif /* CONFIG_KVM_BOOK3S_HV_POSSIBLE */
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index 374c67875cdb..6fd530fae452 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -1058,6 +1058,7 @@ struct kvm_ppc_resize_hpt {
> #define KVM_CAP_ENFORCE_PV_FEATURE_CPUID 190
> #define KVM_CAP_SYS_HYPERV_CPUID 191
> #define KVM_CAP_DIRTY_LOG_RING 192
> +#define KVM_CAP_PPC_RPT_INVALIDATE 193
>
> #ifdef KVM_CAP_IRQ_ROUTING
>
--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply
* Re: [PATCH kernel 2/2] powerpc/iommu: Do not immediately panic when failed IOMMU table allocation
From: David Gibson @ 2021-02-17 0:16 UTC (permalink / raw)
To: Alexey Kardashevskiy; +Cc: linuxppc-dev, kvm-ppc
In-Reply-To: <20210216033307.69863-3-aik@ozlabs.ru>
[-- Attachment #1: Type: text/plain, Size: 7551 bytes --]
On Tue, Feb 16, 2021 at 02:33:07PM +1100, Alexey Kardashevskiy wrote:
> Most platforms allocate IOMMU table structures (specifically it_map)
> at the boot time and when this fails - it is a valid reason for panic().
>
> However the powernv platform allocates it_map after a device is returned
> to the host OS after being passed through and this happens long after
> the host OS booted. It is quite possible to trigger the it_map allocation
> panic() and kill the host even though it is not necessary - the host OS
> can still use the DMA bypass mode (requires a tiny fraction of it_map's
> memory) and even if that fails, the host OS is runnnable as it was without
> the device for which allocating it_map causes the panic.
>
> Instead of immediately crashing in a powernv/ioda2 system, this prints
> an error and continues. All other platforms still call panic().
>
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
> ---
> arch/powerpc/kernel/iommu.c | 6 ++++--
> arch/powerpc/platforms/cell/iommu.c | 3 ++-
> arch/powerpc/platforms/pasemi/iommu.c | 4 +++-
> arch/powerpc/platforms/powernv/pci-ioda.c | 15 ++++++++-------
> arch/powerpc/platforms/pseries/iommu.c | 10 +++++++---
> arch/powerpc/sysdev/dart_iommu.c | 3 ++-
> 6 files changed, 26 insertions(+), 15 deletions(-)
>
> diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
> index 8eb6eb0afa97..c1a5c366a664 100644
> --- a/arch/powerpc/kernel/iommu.c
> +++ b/arch/powerpc/kernel/iommu.c
> @@ -728,8 +728,10 @@ struct iommu_table *iommu_init_table(struct iommu_table *tbl, int nid,
> sz = BITS_TO_LONGS(tbl->it_size) * sizeof(unsigned long);
>
> tbl->it_map = vzalloc_node(sz, nid);
> - if (!tbl->it_map)
> - panic("iommu_init_table: Can't allocate %ld bytes\n", sz);
> + if (!tbl->it_map) {
> + pr_err("%s: Can't allocate %ld bytes\n", __func__, sz);
> + return NULL;
> + }
>
> iommu_table_reserve_pages(tbl, res_start, res_end);
>
> diff --git a/arch/powerpc/platforms/cell/iommu.c b/arch/powerpc/platforms/cell/iommu.c
> index 2124831cf57c..fa08699aedeb 100644
> --- a/arch/powerpc/platforms/cell/iommu.c
> +++ b/arch/powerpc/platforms/cell/iommu.c
> @@ -486,7 +486,8 @@ cell_iommu_setup_window(struct cbe_iommu *iommu, struct device_node *np,
> window->table.it_size = size >> window->table.it_page_shift;
> window->table.it_ops = &cell_iommu_ops;
>
> - iommu_init_table(&window->table, iommu->nid, 0, 0);
> + if (!iommu_init_table(&window->table, iommu->nid, 0, 0))
> + panic("Failed to initialize iommu table");
>
> pr_debug("\tioid %d\n", window->ioid);
> pr_debug("\tblocksize %ld\n", window->table.it_blocksize);
> diff --git a/arch/powerpc/platforms/pasemi/iommu.c b/arch/powerpc/platforms/pasemi/iommu.c
> index b500a6e47e6b..5be7242fbd86 100644
> --- a/arch/powerpc/platforms/pasemi/iommu.c
> +++ b/arch/powerpc/platforms/pasemi/iommu.c
> @@ -146,7 +146,9 @@ static void iommu_table_iobmap_setup(void)
> */
> iommu_table_iobmap.it_blocksize = 4;
> iommu_table_iobmap.it_ops = &iommu_table_iobmap_ops;
> - iommu_init_table(&iommu_table_iobmap, 0, 0, 0);
> + if (!iommu_init_table(&iommu_table_iobmap, 0, 0, 0))
> + panic("Failed to initialize iommu table");
> +
> pr_debug(" <- %s\n", __func__);
> }
>
> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
> index f0f901683a2f..66c3c3337334 100644
> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
> @@ -1762,7 +1762,8 @@ static void pnv_pci_ioda1_setup_dma_pe(struct pnv_phb *phb,
> tbl->it_ops = &pnv_ioda1_iommu_ops;
> pe->table_group.tce32_start = tbl->it_offset << tbl->it_page_shift;
> pe->table_group.tce32_size = tbl->it_size << tbl->it_page_shift;
> - iommu_init_table(tbl, phb->hose->node, 0, 0);
> + if (!iommu_init_table(tbl, phb->hose->node, 0, 0))
> + panic("Failed to initialize iommu table");
>
> pe->dma_setup_done = true;
> return;
> @@ -1930,16 +1931,16 @@ static long pnv_pci_ioda2_setup_default_config(struct pnv_ioda_pe *pe)
> res_start = pe->phb->ioda.m32_pci_base >> tbl->it_page_shift;
> res_end = min(window_size, SZ_4G) >> tbl->it_page_shift;
> }
> - iommu_init_table(tbl, pe->phb->hose->node, res_start, res_end);
>
> - rc = pnv_pci_ioda2_set_window(&pe->table_group, 0, tbl);
> + if (iommu_init_table(tbl, pe->phb->hose->node, res_start, res_end))
> + rc = pnv_pci_ioda2_set_window(&pe->table_group, 0, tbl);
> + else
> + rc = -ENOMEM;
> if (rc) {
> - pe_err(pe, "Failed to configure 32-bit TCE table, err %ld\n",
> - rc);
> + pe_err(pe, "Failed to configure 32-bit TCE table, err %ld\n", rc);
> iommu_tce_table_put(tbl);
> - return rc;
> + tbl = NULL; /* This clears iommu_table_base below */
> }
> -
> if (!pnv_iommu_bypass_disabled)
> pnv_pci_ioda2_set_bypass(pe, true);
>
> diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
> index 9fc5217f0c8e..4d9ac1f181c2 100644
> --- a/arch/powerpc/platforms/pseries/iommu.c
> +++ b/arch/powerpc/platforms/pseries/iommu.c
> @@ -638,7 +638,8 @@ static void pci_dma_bus_setup_pSeries(struct pci_bus *bus)
>
> iommu_table_setparms(pci->phb, dn, tbl);
> tbl->it_ops = &iommu_table_pseries_ops;
> - iommu_init_table(tbl, pci->phb->node, 0, 0);
> + if (!iommu_init_table(tbl, pci->phb->node, 0, 0))
> + panic("Failed to initialize iommu table");
>
> /* Divide the rest (1.75GB) among the children */
> pci->phb->dma_window_size = 0x80000000ul;
> @@ -720,7 +721,8 @@ static void pci_dma_bus_setup_pSeriesLP(struct pci_bus *bus)
> iommu_table_setparms_lpar(ppci->phb, pdn, tbl,
> ppci->table_group, dma_window);
> tbl->it_ops = &iommu_table_lpar_multi_ops;
> - iommu_init_table(tbl, ppci->phb->node, 0, 0);
> + if (!iommu_init_table(tbl, ppci->phb->node, 0, 0))
> + panic("Failed to initialize iommu table");
> iommu_register_group(ppci->table_group,
> pci_domain_nr(bus), 0);
> pr_debug(" created table: %p\n", ppci->table_group);
> @@ -749,7 +751,9 @@ static void pci_dma_dev_setup_pSeries(struct pci_dev *dev)
> tbl = PCI_DN(dn)->table_group->tables[0];
> iommu_table_setparms(phb, dn, tbl);
> tbl->it_ops = &iommu_table_pseries_ops;
> - iommu_init_table(tbl, phb->node, 0, 0);
> + if (!iommu_init_table(tbl, phb->node, 0, 0))
> + panic("Failed to initialize iommu table");
> +
> set_iommu_table_base(&dev->dev, tbl);
> return;
> }
> diff --git a/arch/powerpc/sysdev/dart_iommu.c b/arch/powerpc/sysdev/dart_iommu.c
> index 6b4a34b36d98..1d33b7a5ea83 100644
> --- a/arch/powerpc/sysdev/dart_iommu.c
> +++ b/arch/powerpc/sysdev/dart_iommu.c
> @@ -344,7 +344,8 @@ static void iommu_table_dart_setup(void)
> iommu_table_dart.it_index = 0;
> iommu_table_dart.it_blocksize = 1;
> iommu_table_dart.it_ops = &iommu_dart_ops;
> - iommu_init_table(&iommu_table_dart, -1, 0, 0);
> + if (!iommu_init_table(&iommu_table_dart, -1, 0, 0))
> + panic("Failed to initialize iommu table");
>
> /* Reserve the last page of the DART to avoid possible prefetch
> * past the DART mapped area
--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply
* Re: [PATCH v4 1/3] powerpc/book3s64/radix/tlb: tlbie primitives for process-scoped invalidations from guests
From: David Gibson @ 2021-02-17 0:24 UTC (permalink / raw)
To: Bharata B Rao; +Cc: farosas, aneesh.kumar, npiggin, kvm-ppc, linuxppc-dev
In-Reply-To: <20210215063542.3642366-2-bharata@linux.ibm.com>
[-- Attachment #1: Type: text/plain, Size: 8651 bytes --]
On Mon, Feb 15, 2021 at 12:05:40PM +0530, Bharata B Rao wrote:
> H_RPT_INVALIDATE hcall needs to perform process scoped tlbie
> invalidations of L1 and nested guests from L0. This needs RS register
> for TLBIE instruction to contain both PID and LPID. Introduce
> primitives that execute tlbie instruction with both PID
> and LPID set in prepartion for H_RPT_INVALIDATE hcall.
>
> While we are here, move RIC_FLUSH definitions to header file
> and introduce helper rpti_pgsize_to_psize() that will be needed
> by the upcoming hcall.
>
> Signed-off-by: Bharata B Rao <bharata@linux.ibm.com>
> ---
> .../include/asm/book3s/64/tlbflush-radix.h | 18 +++
> arch/powerpc/mm/book3s64/radix_tlb.c | 122 +++++++++++++++++-
> 2 files changed, 136 insertions(+), 4 deletions(-)
>
> diff --git a/arch/powerpc/include/asm/book3s/64/tlbflush-radix.h b/arch/powerpc/include/asm/book3s/64/tlbflush-radix.h
> index 94439e0cefc9..aace7e9b2397 100644
> --- a/arch/powerpc/include/asm/book3s/64/tlbflush-radix.h
> +++ b/arch/powerpc/include/asm/book3s/64/tlbflush-radix.h
> @@ -4,6 +4,10 @@
>
> #include <asm/hvcall.h>
>
> +#define RIC_FLUSH_TLB 0
> +#define RIC_FLUSH_PWC 1
> +#define RIC_FLUSH_ALL 2
> +
> struct vm_area_struct;
> struct mm_struct;
> struct mmu_gather;
> @@ -21,6 +25,20 @@ static inline u64 psize_to_rpti_pgsize(unsigned long psize)
> return H_RPTI_PAGE_ALL;
> }
>
> +static inline int rpti_pgsize_to_psize(unsigned long page_size)
> +{
> + if (page_size == H_RPTI_PAGE_4K)
> + return MMU_PAGE_4K;
> + if (page_size == H_RPTI_PAGE_64K)
> + return MMU_PAGE_64K;
> + if (page_size == H_RPTI_PAGE_2M)
> + return MMU_PAGE_2M;
> + if (page_size == H_RPTI_PAGE_1G)
> + return MMU_PAGE_1G;
> + else
> + return MMU_PAGE_64K; /* Default */
> +}
Would it make sense to put the H_RPT_PAGE_ tags into the
mmu_psize_defs table and scan that here, rather than open coding the
conversion?
> +
> static inline int mmu_get_ap(int psize)
> {
> return mmu_psize_defs[psize].ap;
> diff --git a/arch/powerpc/mm/book3s64/radix_tlb.c b/arch/powerpc/mm/book3s64/radix_tlb.c
> index fb66d154b26c..097402435303 100644
> --- a/arch/powerpc/mm/book3s64/radix_tlb.c
> +++ b/arch/powerpc/mm/book3s64/radix_tlb.c
> @@ -18,10 +18,6 @@
> #include <asm/cputhreads.h>
> #include <asm/plpar_wrappers.h>
>
> -#define RIC_FLUSH_TLB 0
> -#define RIC_FLUSH_PWC 1
> -#define RIC_FLUSH_ALL 2
> -
> /*
> * tlbiel instruction for radix, set invalidation
> * i.e., r=1 and is=01 or is=10 or is=11
> @@ -128,6 +124,21 @@ static __always_inline void __tlbie_pid(unsigned long pid, unsigned long ric)
> trace_tlbie(0, 0, rb, rs, ric, prs, r);
> }
>
> +static __always_inline void __tlbie_pid_lpid(unsigned long pid,
> + unsigned long lpid,
> + unsigned long ric)
> +{
> + unsigned long rb, rs, prs, r;
> +
> + rb = PPC_BIT(53); /* IS = 1 */
> + rs = (pid << PPC_BITLSHIFT(31)) | (lpid & ~(PPC_BITMASK(0, 31)));
> + prs = 1; /* process scoped */
> + r = 1; /* radix format */
> +
> + asm volatile(PPC_TLBIE_5(%0, %4, %3, %2, %1)
> + : : "r"(rb), "i"(r), "i"(prs), "i"(ric), "r"(rs) : "memory");
> + trace_tlbie(0, 0, rb, rs, ric, prs, r);
> +}
> static __always_inline void __tlbie_lpid(unsigned long lpid, unsigned long ric)
> {
> unsigned long rb,rs,prs,r;
> @@ -188,6 +199,23 @@ static __always_inline void __tlbie_va(unsigned long va, unsigned long pid,
> trace_tlbie(0, 0, rb, rs, ric, prs, r);
> }
>
> +static __always_inline void __tlbie_va_lpid(unsigned long va, unsigned long pid,
> + unsigned long lpid,
> + unsigned long ap, unsigned long ric)
> +{
> + unsigned long rb, rs, prs, r;
> +
> + rb = va & ~(PPC_BITMASK(52, 63));
> + rb |= ap << PPC_BITLSHIFT(58);
> + rs = (pid << PPC_BITLSHIFT(31)) | (lpid & ~(PPC_BITMASK(0, 31)));
> + prs = 1; /* process scoped */
> + r = 1; /* radix format */
> +
> + asm volatile(PPC_TLBIE_5(%0, %4, %3, %2, %1)
> + : : "r"(rb), "i"(r), "i"(prs), "i"(ric), "r"(rs) : "memory");
> + trace_tlbie(0, 0, rb, rs, ric, prs, r);
> +}
> +
> static __always_inline void __tlbie_lpid_va(unsigned long va, unsigned long lpid,
> unsigned long ap, unsigned long ric)
> {
> @@ -233,6 +261,22 @@ static inline void fixup_tlbie_va_range(unsigned long va, unsigned long pid,
> }
> }
>
> +static inline void fixup_tlbie_va_range_lpid(unsigned long va,
> + unsigned long pid,
> + unsigned long lpid,
> + unsigned long ap)
> +{
> + if (cpu_has_feature(CPU_FTR_P9_TLBIE_ERAT_BUG)) {
> + asm volatile("ptesync" : : : "memory");
> + __tlbie_pid_lpid(0, lpid, RIC_FLUSH_TLB);
> + }
> +
> + if (cpu_has_feature(CPU_FTR_P9_TLBIE_STQ_BUG)) {
> + asm volatile("ptesync" : : : "memory");
> + __tlbie_va_lpid(va, pid, lpid, ap, RIC_FLUSH_TLB);
> + }
> +}
> +
> static inline void fixup_tlbie_pid(unsigned long pid)
> {
> /*
> @@ -252,6 +296,25 @@ static inline void fixup_tlbie_pid(unsigned long pid)
> }
> }
>
> +static inline void fixup_tlbie_pid_lpid(unsigned long pid, unsigned long lpid)
> +{
> + /*
> + * We can use any address for the invalidation, pick one which is
> + * probably unused as an optimisation.
> + */
> + unsigned long va = ((1UL << 52) - 1);
> +
> + if (cpu_has_feature(CPU_FTR_P9_TLBIE_ERAT_BUG)) {
> + asm volatile("ptesync" : : : "memory");
> + __tlbie_pid_lpid(0, lpid, RIC_FLUSH_TLB);
> + }
> +
> + if (cpu_has_feature(CPU_FTR_P9_TLBIE_STQ_BUG)) {
> + asm volatile("ptesync" : : : "memory");
> + __tlbie_va_lpid(va, pid, lpid, mmu_get_ap(MMU_PAGE_64K),
> + RIC_FLUSH_TLB);
> + }
> +}
>
> static inline void fixup_tlbie_lpid_va(unsigned long va, unsigned long lpid,
> unsigned long ap)
> @@ -342,6 +405,31 @@ static inline void _tlbie_pid(unsigned long pid, unsigned long ric)
> asm volatile("eieio; tlbsync; ptesync": : :"memory");
> }
>
> +static inline void _tlbie_pid_lpid(unsigned long pid, unsigned long lpid,
> + unsigned long ric)
> +{
> + asm volatile("ptesync" : : : "memory");
> +
> + /*
> + * Workaround the fact that the "ric" argument to __tlbie_pid
> + * must be a compile-time contraint to match the "i" constraint
> + * in the asm statement.
> + */
> + switch (ric) {
> + case RIC_FLUSH_TLB:
> + __tlbie_pid_lpid(pid, lpid, RIC_FLUSH_TLB);
> + fixup_tlbie_pid_lpid(pid, lpid);
> + break;
> + case RIC_FLUSH_PWC:
> + __tlbie_pid_lpid(pid, lpid, RIC_FLUSH_PWC);
> + break;
> + case RIC_FLUSH_ALL:
> + default:
> + __tlbie_pid_lpid(pid, lpid, RIC_FLUSH_ALL);
> + fixup_tlbie_pid_lpid(pid, lpid);
> + }
> + asm volatile("eieio; tlbsync; ptesync" : : : "memory");
> +}
> struct tlbiel_pid {
> unsigned long pid;
> unsigned long ric;
> @@ -467,6 +555,20 @@ static inline void __tlbie_va_range(unsigned long start, unsigned long end,
> fixup_tlbie_va_range(addr - page_size, pid, ap);
> }
>
> +static inline void __tlbie_va_range_lpid(unsigned long start, unsigned long end,
> + unsigned long pid, unsigned long lpid,
> + unsigned long page_size,
> + unsigned long psize)
> +{
> + unsigned long addr;
> + unsigned long ap = mmu_get_ap(psize);
> +
> + for (addr = start; addr < end; addr += page_size)
> + __tlbie_va_lpid(addr, pid, lpid, ap, RIC_FLUSH_TLB);
> +
> + fixup_tlbie_va_range_lpid(addr - page_size, pid, lpid, ap);
> +}
> +
> static __always_inline void _tlbie_va(unsigned long va, unsigned long pid,
> unsigned long psize, unsigned long ric)
> {
> @@ -547,6 +649,18 @@ static inline void _tlbie_va_range(unsigned long start, unsigned long end,
> asm volatile("eieio; tlbsync; ptesync": : :"memory");
> }
>
> +static inline void _tlbie_va_range_lpid(unsigned long start, unsigned long end,
> + unsigned long pid, unsigned long lpid,
> + unsigned long page_size,
> + unsigned long psize, bool also_pwc)
> +{
> + asm volatile("ptesync" : : : "memory");
> + if (also_pwc)
> + __tlbie_pid_lpid(pid, lpid, RIC_FLUSH_PWC);
> + __tlbie_va_range_lpid(start, end, pid, lpid, page_size, psize);
> + asm volatile("eieio; tlbsync; ptesync" : : : "memory");
> +}
> +
> static inline void _tlbiel_va_range_multicast(struct mm_struct *mm,
> unsigned long start, unsigned long end,
> unsigned long pid, unsigned long page_size,
--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply
* Re: [PATCH 2/2] powerpc/uaccess: Move might_fault() into user_access_begin()
From: Michael Ellerman @ 2021-02-17 1:58 UTC (permalink / raw)
To: Christophe Leroy, linuxppc-dev; +Cc: aik
In-Reply-To: <b1d3af99-2d38-0794-0694-95a735fccbe3@csgroup.eu>
Christophe Leroy <christophe.leroy@csgroup.eu> writes:
> Le 08/02/2021 à 14:57, Michael Ellerman a écrit :
>> We have a might_fault() check in __unsafe_put_user_goto(), but that is
>> dangerous as it potentially calls lots of code while user access is
>> enabled.
>>
>> It also triggers the check Alexey added to the irq restore path to
>> catch cases like that:
>>
>> WARNING: CPU: 30 PID: 1 at arch/powerpc/include/asm/book3s/64/kup.h:324 arch_local_irq_restore+0x160/0x190
>> NIP arch_local_irq_restore+0x160/0x190
>> LR lock_is_held_type+0x140/0x200
>> Call Trace:
>> 0xc00000007f392ff8 (unreliable)
>> ___might_sleep+0x180/0x320
>> __might_fault+0x50/0xe0
>> filldir64+0x2d0/0x5d0
>> call_filldir+0xc8/0x180
>> ext4_readdir+0x948/0xb40
>> iterate_dir+0x1ec/0x240
>> sys_getdents64+0x80/0x290
>> system_call_exception+0x160/0x280
>> system_call_common+0xf0/0x27c
>>
>> So remove the might fault check from unsafe_put_user().
>>
>> Any call to unsafe_put_user() must be inside a region that's had user
>> access enabled with user_access_begin(), so move the might_fault() in
>> there. That also allows us to drop the is_kernel_addr() test, because
>> there should be no code using user_access_begin() in order to access a
>> kernel address.
>
> x86 and mips only have might_fault() on get_user() and put_user(),
> neither on __get_user() nor on __put_user() nor on the unsafe
> alternative.
Yeah, that's their choice, or perhaps it's by accident.
arm64 on the other hand has might_fault() in all variants.
A __get_user() can fault just as much as a get_user(), so there's no
reason the check should be omitted from __get_user(), other than perhaps
some historical argument about __get_user() being the "fast" case.
> When have might_fault() in __get_user_nocheck() that is used by
> __get_user() and __get_user_allowed() ie by unsafe_get_user().
>
> Shoudln't those be dropped as well ?
That was handled by Alexey's patch, which I ended up merging with this
one:
https://git.kernel.org/torvalds/c/7d506ca97b665b95e698a53697dad99fae813c1a
ie. we still have might_fault() in __get_user_nocheck(), but it's
guarded by a check of do_allow, so we won't call it for
__get_user_allowed().
So I think the code (in my next branch) is correct, we don't have any
might_fault() calls in unsafe regions.
But I'd still be happier if we could simplify our uaccess.h more, it's a
bit of a rats nest. We could start by making __get/put_user() ==
get/put_user() the same way arm64 did.
cheers
^ permalink raw reply
* Re: [PATCH 2/2] powerpc/uaccess: Move might_fault() into user_access_begin()
From: Christophe Leroy @ 2021-02-17 18:29 UTC (permalink / raw)
To: Michael Ellerman; +Cc: aik, linuxppc-dev
In-Reply-To: <87czwzv4io.fsf@mpe.ellerman.id.au>
Michael Ellerman <mpe@ellerman.id.au> a écrit :
> Christophe Leroy <christophe.leroy@csgroup.eu> writes:
>> Le 08/02/2021 à 14:57, Michael Ellerman a écrit :
>>> We have a might_fault() check in __unsafe_put_user_goto(), but that is
>>> dangerous as it potentially calls lots of code while user access is
>>> enabled.
>>>
>>> It also triggers the check Alexey added to the irq restore path to
>>> catch cases like that:
>>>
>>> WARNING: CPU: 30 PID: 1 at
>>> arch/powerpc/include/asm/book3s/64/kup.h:324
>>> arch_local_irq_restore+0x160/0x190
>>> NIP arch_local_irq_restore+0x160/0x190
>>> LR lock_is_held_type+0x140/0x200
>>> Call Trace:
>>> 0xc00000007f392ff8 (unreliable)
>>> ___might_sleep+0x180/0x320
>>> __might_fault+0x50/0xe0
>>> filldir64+0x2d0/0x5d0
>>> call_filldir+0xc8/0x180
>>> ext4_readdir+0x948/0xb40
>>> iterate_dir+0x1ec/0x240
>>> sys_getdents64+0x80/0x290
>>> system_call_exception+0x160/0x280
>>> system_call_common+0xf0/0x27c
>>>
>>> So remove the might fault check from unsafe_put_user().
>>>
>>> Any call to unsafe_put_user() must be inside a region that's had user
>>> access enabled with user_access_begin(), so move the might_fault() in
>>> there. That also allows us to drop the is_kernel_addr() test, because
>>> there should be no code using user_access_begin() in order to access a
>>> kernel address.
>>
>> x86 and mips only have might_fault() on get_user() and put_user(),
>> neither on __get_user() nor on __put_user() nor on the unsafe
>> alternative.
>
> Yeah, that's their choice, or perhaps it's by accident.
>
> arm64 on the other hand has might_fault() in all variants.
>
> A __get_user() can fault just as much as a get_user(), so there's no
> reason the check should be omitted from __get_user(), other than perhaps
> some historical argument about __get_user() being the "fast" case.
>
>> When have might_fault() in __get_user_nocheck() that is used by
>> __get_user() and __get_user_allowed() ie by unsafe_get_user().
>>
>> Shoudln't those be dropped as well ?
>
> That was handled by Alexey's patch, which I ended up merging with this
> one:
>
> https://git.kernel.org/torvalds/c/7d506ca97b665b95e698a53697dad99fae813c1a
>
>
> ie. we still have might_fault() in __get_user_nocheck(), but it's
> guarded by a check of do_allow, so we won't call it for
> __get_user_allowed().
>
> So I think the code (in my next branch) is correct, we don't have any
> might_fault() calls in unsafe regions.
>
> But I'd still be happier if we could simplify our uaccess.h more, it's a
> bit of a rats nest. We could start by making __get/put_user() ==
> get/put_user() the same way arm64 did.
I agree there are several easy simplifications to do there. I'll look
at that in the coming weeks.
I'm not sure it is good to make __get/put_user equal get/put_user as
it would mean calling access_ok() everytime. But we could most likely
make something simpler with get_user() calling access_ok() then
__get_user().
I think we should also audit our use of the _inatomic variants.
might_fault() voids when pagefault is disabled so I think the inatomic
variants should be needed. As far as I can see, powerpc is the only
arch having that.
Need to also check why we still need that is_kernel_addr() check
because since the removal of set_fs(), __get/put_user() helpers
shouldn't be used anymore for kernel addresses
Christophe
^ permalink raw reply
* Re: [PATCH kernel 1/2] powerpc/iommu: Allocate it_map by vmalloc
From: Leonardo Bras @ 2021-02-17 19:11 UTC (permalink / raw)
To: Alexey Kardashevskiy, linuxppc-dev; +Cc: kvm-ppc, David Gibson
In-Reply-To: <20210216033307.69863-2-aik@ozlabs.ru>
On Tue, 2021-02-16 at 14:33 +1100, Alexey Kardashevskiy wrote:
> The IOMMU table uses the it_map bitmap to keep track of allocated DMA
> pages. This has always been a contiguous array allocated at either
> the boot time or when a passed through device is returned to the host OS.
> The it_map memory is allocated by alloc_pages() which allocates
> contiguous physical memory.
>
> Such allocation method occasionally creates a problem when there is
> no big chunk of memory available (no free memory or too fragmented).
> On powernv/ioda2 the default DMA window requires 16MB for it_map.
>
> This replaces alloc_pages_node() with vzalloc_node() which allocates
> contiguous block but in virtual memory. This should reduce changes of
> failure but should not cause other behavioral changes as it_map is only
> used by the kernel's DMA hooks/api when MMU is on.
>
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
It looks a very good change, and also makes code much simpler to read.
FWIW:
Reviewed-by: Leonardo Bras <leobras.c@gmail,com>
> ---
> arch/powerpc/kernel/iommu.c | 15 +++------------
> 1 file changed, 3 insertions(+), 12 deletions(-)
>
> diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
> index c00214a4355c..8eb6eb0afa97 100644
> --- a/arch/powerpc/kernel/iommu.c
> +++ b/arch/powerpc/kernel/iommu.c
> @@ -719,7 +719,6 @@ struct iommu_table *iommu_init_table(struct iommu_table *tbl, int nid,
> {
> unsigned long sz;
> static int welcomed = 0;
> - struct page *page;
> unsigned int i;
> struct iommu_pool *p;
>
>
>
>
> @@ -728,11 +727,9 @@ struct iommu_table *iommu_init_table(struct iommu_table *tbl, int nid,
> /* number of bytes needed for the bitmap */
> sz = BITS_TO_LONGS(tbl->it_size) * sizeof(unsigned long);
>
>
>
>
> - page = alloc_pages_node(nid, GFP_KERNEL, get_order(sz));
> - if (!page)
> + tbl->it_map = vzalloc_node(sz, nid);
> + if (!tbl->it_map)
> panic("iommu_init_table: Can't allocate %ld bytes\n", sz);
> - tbl->it_map = page_address(page);
> - memset(tbl->it_map, 0, sz);
>
>
>
>
> iommu_table_reserve_pages(tbl, res_start, res_end);
>
>
>
>
> @@ -774,8 +771,6 @@ struct iommu_table *iommu_init_table(struct iommu_table *tbl, int nid,
>
>
>
>
> static void iommu_table_free(struct kref *kref)
> {
> - unsigned long bitmap_sz;
> - unsigned int order;
> struct iommu_table *tbl;
>
>
>
>
> tbl = container_of(kref, struct iommu_table, it_kref);
> @@ -796,12 +791,8 @@ static void iommu_table_free(struct kref *kref)
> if (!bitmap_empty(tbl->it_map, tbl->it_size))
> pr_warn("%s: Unexpected TCEs\n", __func__);
>
>
>
>
> - /* calculate bitmap size in bytes */
> - bitmap_sz = BITS_TO_LONGS(tbl->it_size) * sizeof(unsigned long);
> -
> /* free bitmap */
> - order = get_order(bitmap_sz);
> - free_pages((unsigned long) tbl->it_map, order);
> + vfree(tbl->it_map);
>
>
>
>
> /* free table */
> kfree(tbl);
^ permalink raw reply
* Re: [PATCH v5 06/10] powerpc/signal64: Replace setup_sigcontext() w/ unsafe_setup_sigcontext()
From: Christopher M. Riedl @ 2021-02-17 19:31 UTC (permalink / raw)
To: Daniel Axtens, linuxppc-dev
In-Reply-To: <871rdletbo.fsf@dja-thinkpad.axtens.net>
On Thu Feb 11, 2021 at 11:41 PM CST, Daniel Axtens wrote:
> Hi Chris,
>
> > Previously setup_sigcontext() performed a costly KUAP switch on every
> > uaccess operation. These repeated uaccess switches cause a significant
> > drop in signal handling performance.
> >
> > Rewrite setup_sigcontext() to assume that a userspace write access window
> > is open. Replace all uaccess functions with their 'unsafe' versions
> > which avoid the repeated uaccess switches.
>
> Just to clarify the commit message a bit: you're also changing the
> callers of the old safe versions to first open the window, then call the
> unsafe variants, then close the window again.
Noted!
>
> > Signed-off-by: Christopher M. Riedl <cmr@codefail.de>
> > ---
> > arch/powerpc/kernel/signal_64.c | 70 ++++++++++++++++++++-------------
> > 1 file changed, 43 insertions(+), 27 deletions(-)
> >
> > diff --git a/arch/powerpc/kernel/signal_64.c b/arch/powerpc/kernel/signal_64.c
> > index 8e1d804ce552..4248e4489ff1 100644
> > --- a/arch/powerpc/kernel/signal_64.c
> > +++ b/arch/powerpc/kernel/signal_64.c
> > @@ -101,9 +101,13 @@ static void prepare_setup_sigcontext(struct task_struct *tsk, int ctx_has_vsx_re
> > * Set up the sigcontext for the signal frame.
> > */
> >
> > -static long setup_sigcontext(struct sigcontext __user *sc,
> > - struct task_struct *tsk, int signr, sigset_t *set,
> > - unsigned long handler, int ctx_has_vsx_region)
> > +#define unsafe_setup_sigcontext(sc, tsk, signr, set, handler, \
> > + ctx_has_vsx_region, e) \
> > + unsafe_op_wrap(__unsafe_setup_sigcontext(sc, tsk, signr, set, \
> > + handler, ctx_has_vsx_region), e)
> > +static long notrace __unsafe_setup_sigcontext(struct sigcontext __user *sc,
> > + struct task_struct *tsk, int signr, sigset_t *set,
> > + unsigned long handler, int ctx_has_vsx_region)
> > {
> > /* When CONFIG_ALTIVEC is set, we _always_ setup v_regs even if the
> > * process never used altivec yet (MSR_VEC is zero in pt_regs of
> > @@ -118,20 +122,19 @@ static long setup_sigcontext(struct sigcontext __user *sc,
> > #endif
> > struct pt_regs *regs = tsk->thread.regs;
> > unsigned long msr = regs->msr;
> > - long err = 0;
> > /* Force usr to alway see softe as 1 (interrupts enabled) */
> > unsigned long softe = 0x1;
> >
> > BUG_ON(tsk != current);
> >
> > #ifdef CONFIG_ALTIVEC
> > - err |= __put_user(v_regs, &sc->v_regs);
> > + unsafe_put_user(v_regs, &sc->v_regs, efault_out);
> >
> > /* save altivec registers */
> > if (tsk->thread.used_vr) {
> > /* Copy 33 vec registers (vr0..31 and vscr) to the stack */
> > - err |= __copy_to_user(v_regs, &tsk->thread.vr_state,
> > - 33 * sizeof(vector128));
> > + unsafe_copy_to_user(v_regs, &tsk->thread.vr_state,
> > + 33 * sizeof(vector128), efault_out);
> > /* set MSR_VEC in the MSR value in the frame to indicate that sc->v_reg)
> > * contains valid data.
> > */
> > @@ -140,12 +143,12 @@ static long setup_sigcontext(struct sigcontext __user *sc,
> > /* We always copy to/from vrsave, it's 0 if we don't have or don't
> > * use altivec.
> > */
> > - err |= __put_user(tsk->thread.vrsave, (u32 __user *)&v_regs[33]);
> > + unsafe_put_user(tsk->thread.vrsave, (u32 __user *)&v_regs[33], efault_out);
> > #else /* CONFIG_ALTIVEC */
> > - err |= __put_user(0, &sc->v_regs);
> > + unsafe_put_user(0, &sc->v_regs, efault_out);
> > #endif /* CONFIG_ALTIVEC */
> > /* copy fpr regs and fpscr */
> > - err |= copy_fpr_to_user(&sc->fp_regs, tsk);
> > + unsafe_copy_fpr_to_user(&sc->fp_regs, tsk, efault_out);
> >
> > /*
> > * Clear the MSR VSX bit to indicate there is no valid state attached
> > @@ -160,24 +163,27 @@ static long setup_sigcontext(struct sigcontext __user *sc,
> > */
> > if (tsk->thread.used_vsr && ctx_has_vsx_region) {
> > v_regs += ELF_NVRREG;
> > - err |= copy_vsx_to_user(v_regs, tsk);
> > + unsafe_copy_vsx_to_user(v_regs, tsk, efault_out);
> > /* set MSR_VSX in the MSR value in the frame to
> > * indicate that sc->vs_reg) contains valid data.
> > */
> > msr |= MSR_VSX;
> > }
> > #endif /* CONFIG_VSX */
> > - err |= __put_user(&sc->gp_regs, &sc->regs);
> > + unsafe_put_user(&sc->gp_regs, &sc->regs, efault_out);
> > WARN_ON(!FULL_REGS(regs));
> > - err |= __copy_to_user(&sc->gp_regs, regs, GP_REGS_SIZE);
> > - err |= __put_user(msr, &sc->gp_regs[PT_MSR]);
> > - err |= __put_user(softe, &sc->gp_regs[PT_SOFTE]);
> > - err |= __put_user(signr, &sc->signal);
> > - err |= __put_user(handler, &sc->handler);
> > + unsafe_copy_to_user(&sc->gp_regs, regs, GP_REGS_SIZE, efault_out);
> > + unsafe_put_user(msr, &sc->gp_regs[PT_MSR], efault_out);
> > + unsafe_put_user(softe, &sc->gp_regs[PT_SOFTE], efault_out);
> > + unsafe_put_user(signr, &sc->signal, efault_out);
> > + unsafe_put_user(handler, &sc->handler, efault_out);
> > if (set != NULL)
> > - err |= __put_user(set->sig[0], &sc->oldmask);
> > + unsafe_put_user(set->sig[0], &sc->oldmask, efault_out);
> >
> > - return err;
> > + return 0;
> > +
> > +efault_out:
> > + return -EFAULT;
> > }
> >
> > #ifdef CONFIG_PPC_TRANSACTIONAL_MEM
> > @@ -664,12 +670,15 @@ SYSCALL_DEFINE3(swapcontext, struct ucontext __user *, old_ctx,
> >
> > if (old_ctx != NULL) {
> > prepare_setup_sigcontext(current, ctx_has_vsx_region);
> > - if (!access_ok(old_ctx, ctx_size)
> > - || setup_sigcontext(&old_ctx->uc_mcontext, current, 0, NULL, 0,
> > - ctx_has_vsx_region)
> > - || __copy_to_user(&old_ctx->uc_sigmask,
> > - ¤t->blocked, sizeof(sigset_t)))
> > + if (!user_write_access_begin(old_ctx, ctx_size))
> > return -EFAULT;
>
> I notice we get rid of access_ok, but that's fine because
> user_write_access_begin includes access_ok since Linus decided that was
> a good idea.
>
> > +
> > + unsafe_setup_sigcontext(&old_ctx->uc_mcontext, current, 0, NULL,
> > + 0, ctx_has_vsx_region, efault_out);
> > + unsafe_copy_to_user(&old_ctx->uc_sigmask, ¤t->blocked,
> > + sizeof(sigset_t), efault_out);
> > +
> > + user_write_access_end();
> > }
> > if (new_ctx == NULL)
> > return 0;
> > @@ -698,6 +707,10 @@ SYSCALL_DEFINE3(swapcontext, struct ucontext __user *, old_ctx,
> > /* This returns like rt_sigreturn */
> > set_thread_flag(TIF_RESTOREALL);
> > return 0;
> > +
> > +efault_out:
> > + user_write_access_end();
> > + return -EFAULT;
> > }
> >
> >
> > @@ -850,9 +863,12 @@ int handle_rt_signal64(struct ksignal *ksig, sigset_t *set,
> > } else {
> > err |= __put_user(0, &frame->uc.uc_link);
> > prepare_setup_sigcontext(tsk, 1);
> > - err |= setup_sigcontext(&frame->uc.uc_mcontext, tsk, ksig->sig,
> > - NULL, (unsigned long)ksig->ka.sa.sa_handler,
> > - 1);
> > + if (!user_write_access_begin(frame, sizeof(struct rt_sigframe)))
> > + return -EFAULT;
>
> Here you're opening a window for all of `frame`...
>
> > + err |= __unsafe_setup_sigcontext(&frame->uc.uc_mcontext, tsk,
> ... but here you're only passing in frame->uc.uc_mcontext for writing.
>
> ISTR that the underlying AMR switch is fully on / fully off so I don't
> think it really matters, but in this case should you be calling
> user_write_access_begin() with &frame->uc.uc_mcontext and the size of
> that?
You are correct that it doesn't _really_ matter w/ the current
implementation of KUAP/AMR. It's still inconsistent and could cause
problems in the future so I will fix this to pass the proper size.
>
> > + ksig->sig, NULL,
> > + (unsigned long)ksig->ka.sa.sa_handler, 1);
> > + user_write_access_end();
> > }
> > err |= __copy_to_user(&frame->uc.uc_sigmask, set, sizeof(*set));
> > if (err)
>
>
> Apart from the size thing, everything looks good to me. I checked that
> all the things you've changed from safe to unsafe pass the same
> parameters, and they all looked good to me.
Thanks!
>
> With those caveats,
> Reviewed-by: Daniel Axtens <dja@axtens.net>
>
> Kind regards,
> Daniel
^ permalink raw reply
* Re: [PATCH v5 07/10] powerpc/signal64: Replace restore_sigcontext() w/ unsafe_restore_sigcontext()
From: Christopher M. Riedl @ 2021-02-17 19:32 UTC (permalink / raw)
To: Daniel Axtens, linuxppc-dev
In-Reply-To: <87y2ftc7el.fsf@dja-thinkpad.axtens.net>
On Fri Feb 12, 2021 at 3:17 PM CST, Daniel Axtens wrote:
> Hi Chris,
>
> > Previously restore_sigcontext() performed a costly KUAP switch on every
> > uaccess operation. These repeated uaccess switches cause a significant
> > drop in signal handling performance.
> >
> > Rewrite restore_sigcontext() to assume that a userspace read access
> > window is open. Replace all uaccess functions with their 'unsafe'
> > versions which avoid the repeated uaccess switches.
> >
>
> Much of the same comments apply here as to the last patch:
> - the commit message might be improved by adding that you are also
> changing the calling functions to open the uaccess window before
> calling into the new unsafe functions
>
> - I have checked that the safe to unsafe conversions look right.
>
> - I think you're opening too wide a window in user_read_access_begin,
> it seems to me that it could be reduced to just the
> uc_mcontext. (Again, not that it makes a difference with the current
> HW.)
Ok, I'll fix these in the next version as well.
>
> Kind regards,
> Daniel
>
> > Signed-off-by: Christopher M. Riedl <cmr@codefail.de>
> > ---
> > arch/powerpc/kernel/signal_64.c | 68 ++++++++++++++++++++-------------
> > 1 file changed, 41 insertions(+), 27 deletions(-)
> >
> > diff --git a/arch/powerpc/kernel/signal_64.c b/arch/powerpc/kernel/signal_64.c
> > index 4248e4489ff1..d668f8af18fe 100644
> > --- a/arch/powerpc/kernel/signal_64.c
> > +++ b/arch/powerpc/kernel/signal_64.c
> > @@ -326,14 +326,14 @@ static long setup_tm_sigcontexts(struct sigcontext __user *sc,
> > /*
> > * Restore the sigcontext from the signal frame.
> > */
> > -
> > -static long restore_sigcontext(struct task_struct *tsk, sigset_t *set, int sig,
> > - struct sigcontext __user *sc)
> > +#define unsafe_restore_sigcontext(tsk, set, sig, sc, e) \
> > + unsafe_op_wrap(__unsafe_restore_sigcontext(tsk, set, sig, sc), e)
> > +static long notrace __unsafe_restore_sigcontext(struct task_struct *tsk, sigset_t *set,
> > + int sig, struct sigcontext __user *sc)
> > {
> > #ifdef CONFIG_ALTIVEC
> > elf_vrreg_t __user *v_regs;
> > #endif
> > - unsigned long err = 0;
> > unsigned long save_r13 = 0;
> > unsigned long msr;
> > struct pt_regs *regs = tsk->thread.regs;
> > @@ -348,27 +348,28 @@ static long restore_sigcontext(struct task_struct *tsk, sigset_t *set, int sig,
> > save_r13 = regs->gpr[13];
> >
> > /* copy the GPRs */
> > - err |= __copy_from_user(regs->gpr, sc->gp_regs, sizeof(regs->gpr));
> > - err |= __get_user(regs->nip, &sc->gp_regs[PT_NIP]);
> > + unsafe_copy_from_user(regs->gpr, sc->gp_regs, sizeof(regs->gpr),
> > + efault_out);
> > + unsafe_get_user(regs->nip, &sc->gp_regs[PT_NIP], efault_out);
> > /* get MSR separately, transfer the LE bit if doing signal return */
> > - err |= __get_user(msr, &sc->gp_regs[PT_MSR]);
> > + unsafe_get_user(msr, &sc->gp_regs[PT_MSR], efault_out);
> > if (sig)
> > regs->msr = (regs->msr & ~MSR_LE) | (msr & MSR_LE);
> > - err |= __get_user(regs->orig_gpr3, &sc->gp_regs[PT_ORIG_R3]);
> > - err |= __get_user(regs->ctr, &sc->gp_regs[PT_CTR]);
> > - err |= __get_user(regs->link, &sc->gp_regs[PT_LNK]);
> > - err |= __get_user(regs->xer, &sc->gp_regs[PT_XER]);
> > - err |= __get_user(regs->ccr, &sc->gp_regs[PT_CCR]);
> > + unsafe_get_user(regs->orig_gpr3, &sc->gp_regs[PT_ORIG_R3], efault_out);
> > + unsafe_get_user(regs->ctr, &sc->gp_regs[PT_CTR], efault_out);
> > + unsafe_get_user(regs->link, &sc->gp_regs[PT_LNK], efault_out);
> > + unsafe_get_user(regs->xer, &sc->gp_regs[PT_XER], efault_out);
> > + unsafe_get_user(regs->ccr, &sc->gp_regs[PT_CCR], efault_out);
> > /* Don't allow userspace to set SOFTE */
> > set_trap_norestart(regs);
> > - err |= __get_user(regs->dar, &sc->gp_regs[PT_DAR]);
> > - err |= __get_user(regs->dsisr, &sc->gp_regs[PT_DSISR]);
> > - err |= __get_user(regs->result, &sc->gp_regs[PT_RESULT]);
> > + unsafe_get_user(regs->dar, &sc->gp_regs[PT_DAR], efault_out);
> > + unsafe_get_user(regs->dsisr, &sc->gp_regs[PT_DSISR], efault_out);
> > + unsafe_get_user(regs->result, &sc->gp_regs[PT_RESULT], efault_out);
> >
> > if (!sig)
> > regs->gpr[13] = save_r13;
> > if (set != NULL)
> > - err |= __get_user(set->sig[0], &sc->oldmask);
> > + unsafe_get_user(set->sig[0], &sc->oldmask, efault_out);
> >
> > /*
> > * Force reload of FP/VEC.
> > @@ -378,29 +379,28 @@ static long restore_sigcontext(struct task_struct *tsk, sigset_t *set, int sig,
> > regs->msr &= ~(MSR_FP | MSR_FE0 | MSR_FE1 | MSR_VEC | MSR_VSX);
> >
> > #ifdef CONFIG_ALTIVEC
> > - err |= __get_user(v_regs, &sc->v_regs);
> > - if (err)
> > - return err;
> > + unsafe_get_user(v_regs, &sc->v_regs, efault_out);
> > if (v_regs && !access_ok(v_regs, 34 * sizeof(vector128)))
> > return -EFAULT;
> > /* Copy 33 vec registers (vr0..31 and vscr) from the stack */
> > if (v_regs != NULL && (msr & MSR_VEC) != 0) {
> > - err |= __copy_from_user(&tsk->thread.vr_state, v_regs,
> > - 33 * sizeof(vector128));
> > + unsafe_copy_from_user(&tsk->thread.vr_state, v_regs,
> > + 33 * sizeof(vector128), efault_out);
> > tsk->thread.used_vr = true;
> > } else if (tsk->thread.used_vr) {
> > memset(&tsk->thread.vr_state, 0, 33 * sizeof(vector128));
> > }
> > /* Always get VRSAVE back */
> > if (v_regs != NULL)
> > - err |= __get_user(tsk->thread.vrsave, (u32 __user *)&v_regs[33]);
> > + unsafe_get_user(tsk->thread.vrsave, (u32 __user *)&v_regs[33],
> > + efault_out);
> > else
> > tsk->thread.vrsave = 0;
> > if (cpu_has_feature(CPU_FTR_ALTIVEC))
> > mtspr(SPRN_VRSAVE, tsk->thread.vrsave);
> > #endif /* CONFIG_ALTIVEC */
> > /* restore floating point */
> > - err |= copy_fpr_from_user(tsk, &sc->fp_regs);
> > + unsafe_copy_fpr_from_user(tsk, &sc->fp_regs, efault_out);
> > #ifdef CONFIG_VSX
> > /*
> > * Get additional VSX data. Update v_regs to point after the
> > @@ -409,14 +409,17 @@ static long restore_sigcontext(struct task_struct *tsk, sigset_t *set, int sig,
> > */
> > v_regs += ELF_NVRREG;
> > if ((msr & MSR_VSX) != 0) {
> > - err |= copy_vsx_from_user(tsk, v_regs);
> > + unsafe_copy_vsx_from_user(tsk, v_regs, efault_out);
> > tsk->thread.used_vsr = true;
> > } else {
> > for (i = 0; i < 32 ; i++)
> > tsk->thread.fp_state.fpr[i][TS_VSRLOWOFFSET] = 0;
> > }
> > #endif
> > - return err;
> > + return 0;
> > +
> > +efault_out:
> > + return -EFAULT;
> > }
> >
> > #ifdef CONFIG_PPC_TRANSACTIONAL_MEM
> > @@ -701,8 +704,14 @@ SYSCALL_DEFINE3(swapcontext, struct ucontext __user *, old_ctx,
> > if (__copy_from_user(&set, &new_ctx->uc_sigmask, sizeof(set)))
> > do_exit(SIGSEGV);
> > set_current_blocked(&set);
> > - if (restore_sigcontext(current, NULL, 0, &new_ctx->uc_mcontext))
> > +
> > + if (!user_read_access_begin(new_ctx, ctx_size))
> > + return -EFAULT;
> > + if (__unsafe_restore_sigcontext(current, NULL, 0, &new_ctx->uc_mcontext)) {
> > + user_read_access_end();
> > do_exit(SIGSEGV);
> > + }
> > + user_read_access_end();
> >
> > /* This returns like rt_sigreturn */
> > set_thread_flag(TIF_RESTOREALL);
> > @@ -807,8 +816,13 @@ SYSCALL_DEFINE0(rt_sigreturn)
> > * causing a TM bad thing.
> > */
> > current->thread.regs->msr &= ~MSR_TS_MASK;
> > - if (restore_sigcontext(current, NULL, 1, &uc->uc_mcontext))
> > + if (!user_read_access_begin(uc, sizeof(*uc)))
> > + return -EFAULT;
> > + if (__unsafe_restore_sigcontext(current, NULL, 1, &uc->uc_mcontext)) {
> > + user_read_access_end();
> > goto badframe;
> > + }
> > + user_read_access_end();
> > }
> >
> > if (restore_altstack(&uc->uc_stack))
> > --
> > 2.26.1
^ permalink raw reply
* Re: [PATCH kernel 2/2] powerpc/iommu: Do not immediately panic when failed IOMMU table allocation
From: Leonardo Bras @ 2021-02-17 19:32 UTC (permalink / raw)
To: Alexey Kardashevskiy, linuxppc-dev; +Cc: kvm-ppc, David Gibson
In-Reply-To: <20210216033307.69863-3-aik@ozlabs.ru>
On Tue, 2021-02-16 at 14:33 +1100, Alexey Kardashevskiy wrote:
> Most platforms allocate IOMMU table structures (specifically it_map)
> at the boot time and when this fails - it is a valid reason for panic().
>
> However the powernv platform allocates it_map after a device is returned
> to the host OS after being passed through and this happens long after
> the host OS booted. It is quite possible to trigger the it_map allocation
> panic() and kill the host even though it is not necessary - the host OS
> can still use the DMA bypass mode (requires a tiny fraction of it_map's
> memory) and even if that fails, the host OS is runnnable as it was without
> the device for which allocating it_map causes the panic.
>
> Instead of immediately crashing in a powernv/ioda2 system, this prints
> an error and continues. All other platforms still call panic().
>
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Hello Alexey,
This looks like a good change, that passes panic() decision to platform
code. Everything looks pretty straightforward, but I have a question
regarding this:
> @@ -1930,16 +1931,16 @@ static long pnv_pci_ioda2_setup_default_config(struct pnv_ioda_pe *pe)
> res_start = pe->phb->ioda.m32_pci_base >> tbl->it_page_shift;
> res_end = min(window_size, SZ_4G) >> tbl->it_page_shift;
> }
> - iommu_init_table(tbl, pe->phb->hose->node, res_start, res_end);
> - rc = pnv_pci_ioda2_set_window(&pe->table_group, 0, tbl);
>
> + if (iommu_init_table(tbl, pe->phb->hose->node, res_start, res_end))
> + rc = pnv_pci_ioda2_set_window(&pe->table_group, 0, tbl);
> + else
> + rc = -ENOMEM;
> if (rc) {
> - pe_err(pe, "Failed to configure 32-bit TCE table, err %ld\n",
> - rc);
> + pe_err(pe, "Failed to configure 32-bit TCE table, err %ld\n", rc);
> iommu_tce_table_put(tbl);
> - return rc;
> + tbl = NULL; /* This clears iommu_table_base below */
> }
> -
> if (!pnv_iommu_bypass_disabled)
> pnv_pci_ioda2_set_bypass(pe, true);
>
>
If I could understand correctly, previously if iommu_init_table() did
not panic(), and pnv_pci_ioda2_set_window() returned something other
than 0, it would return rc in the if (rc) clause, but now it does not
happen anymore, going through if (!pnv_iommu_bypass_disabled) onwards.
Is that desired?
As far as I could see, returning rc there seems a good procedure after
iommu_init_table returning -ENOMEM.
Best regards,
Leonardo Bras
^ permalink raw reply
* Re: [PATCH v5 05/10] powerpc/signal64: Remove TM ifdefery in middle of if/else block
From: Christopher M. Riedl @ 2021-02-17 19:27 UTC (permalink / raw)
To: Daniel Axtens, linuxppc-dev
In-Reply-To: <874kiheu93.fsf@dja-thinkpad.axtens.net>
On Thu Feb 11, 2021 at 11:21 PM CST, Daniel Axtens wrote:
> Hi Chris,
>
> > Rework the messy ifdef breaking up the if-else for TM similar to
> > commit f1cf4f93de2f ("powerpc/signal32: Remove ifdefery in middle of if/else").
>
> I'm not sure what 'the messy ifdef' and 'the if-else for TM' is (yet):
> perhaps you could start the commit message with a tiny bit of
> background.
Yup good point - I will reword this for the next spin.
>
> > Unlike that commit for ppc32, the ifdef can't be removed entirely since
> > uc_transact in sigframe depends on CONFIG_PPC_TRANSACTIONAL_MEM.
> >
> > Signed-off-by: Christopher M. Riedl <cmr@codefail.de>
> > ---
> > arch/powerpc/kernel/signal_64.c | 16 +++++++---------
> > 1 file changed, 7 insertions(+), 9 deletions(-)
> >
> > diff --git a/arch/powerpc/kernel/signal_64.c b/arch/powerpc/kernel/signal_64.c
> > index b211a8ea4f6e..8e1d804ce552 100644
> > --- a/arch/powerpc/kernel/signal_64.c
> > +++ b/arch/powerpc/kernel/signal_64.c
> > @@ -710,9 +710,7 @@ SYSCALL_DEFINE0(rt_sigreturn)
> > struct pt_regs *regs = current_pt_regs();
> > struct ucontext __user *uc = (struct ucontext __user *)regs->gpr[1];
> > sigset_t set;
> > -#ifdef CONFIG_PPC_TRANSACTIONAL_MEM
> > unsigned long msr;
> > -#endif
> >
> > /* Always make any pending restarted system calls return -EINTR */
> > current->restart_block.fn = do_no_restart_syscall;
> > @@ -765,7 +763,10 @@ SYSCALL_DEFINE0(rt_sigreturn)
> >
> > if (__get_user(msr, &uc->uc_mcontext.gp_regs[PT_MSR]))
> > goto badframe;
> > +#endif
> > +
> > if (MSR_TM_ACTIVE(msr)) {
> > +#ifdef CONFIG_PPC_TRANSACTIONAL_MEM
> > /* We recheckpoint on return. */
> > struct ucontext __user *uc_transact;
> >
> > @@ -778,9 +779,8 @@ SYSCALL_DEFINE0(rt_sigreturn)
> > if (restore_tm_sigcontexts(current, &uc->uc_mcontext,
> > &uc_transact->uc_mcontext))
> > goto badframe;
> > - } else
> > #endif
> > - {
> > + } else {
> > /*
> > * Fall through, for non-TM restore
> > *
>
> I think you can get rid of all the ifdefs in _this function_ by
> providing a couple of stubs:
>
> diff --git a/arch/powerpc/kernel/process.c
> b/arch/powerpc/kernel/process.c
> index a66f435dabbf..19059a4b798f 100644
> --- a/arch/powerpc/kernel/process.c
> +++ b/arch/powerpc/kernel/process.c
> @@ -1120,6 +1120,7 @@ void restore_tm_state(struct pt_regs *regs)
> #else
> #define tm_recheckpoint_new_task(new)
> #define __switch_to_tm(prev, new)
> +void tm_reclaim_current(uint8_t cause) {}
> #endif /* CONFIG_PPC_TRANSACTIONAL_MEM */
>
> static inline void save_sprs(struct thread_struct *t)
> diff --git a/arch/powerpc/kernel/signal_64.c
> b/arch/powerpc/kernel/signal_64.c
> index 8e1d804ce552..ed58ca019ad9 100644
> --- a/arch/powerpc/kernel/signal_64.c
> +++ b/arch/powerpc/kernel/signal_64.c
> @@ -594,6 +594,13 @@ static long restore_tm_sigcontexts(struct
> task_struct *tsk,
>
> return err;
> }
> +#else
> +static long restore_tm_sigcontexts(struct task_struct *tsk,
> + struct sigcontext __user *sc,
> + struct sigcontext __user *tm_sc)
> +{
> + return -EINVAL;
> +}
> #endif
>
> /*
> @@ -722,7 +729,6 @@ SYSCALL_DEFINE0(rt_sigreturn)
> goto badframe;
> set_current_blocked(&set);
>
> -#ifdef CONFIG_PPC_TRANSACTIONAL_MEM
> /*
> * If there is a transactional state then throw it away.
> * The purpose of a sigreturn is to destroy all traces of the
> @@ -763,10 +769,8 @@ SYSCALL_DEFINE0(rt_sigreturn)
>
> if (__get_user(msr, &uc->uc_mcontext.gp_regs[PT_MSR]))
> goto badframe;
> -#endif
>
> if (MSR_TM_ACTIVE(msr)) {
> -#ifdef CONFIG_PPC_TRANSACTIONAL_MEM
> /* We recheckpoint on return. */
> struct ucontext __user *uc_transact;
>
> @@ -779,7 +783,6 @@ SYSCALL_DEFINE0(rt_sigreturn)
> if (restore_tm_sigcontexts(current, &uc->uc_mcontext,
> &uc_transact->uc_mcontext))
> goto badframe;
> -#endif
> } else {
> /*
> * Fall through, for non-TM restore
>
> My only concern here was whether it was valid to access
> if (__get_user(msr, &uc->uc_mcontext.gp_regs[PT_MSR]))
> if CONFIG_PPC_TRANSACTIONAL_MEM was not defined, but I didn't think of
> any obvious reason why it wouldn't be...
Hmm, we don't really need it for the non-TM case and it is another extra
uaccess. I will take your suggestion to remove the need for the other
ifdefs but might keep this one. Keeping it also makes it very clear this
call is only here for TM and possible to remove in a potentially TM-less
future :)
>
> > @@ -818,10 +818,8 @@ int handle_rt_signal64(struct ksignal *ksig, sigset_t *set,
> > unsigned long newsp = 0;
> > long err = 0;
> > struct pt_regs *regs = tsk->thread.regs;
> > -#ifdef CONFIG_PPC_TRANSACTIONAL_MEM
> > /* Save the thread's msr before get_tm_stackpointer() changes it */
> > unsigned long msr = regs->msr;
> > -#endif
>
> I wondered if that comment still makes sense in the absence of the
> ifdef. It is preserved in commit f1cf4f93de2f ("powerpc/signal32: Remove
> ifdefery in middle of if/else"), and I can't figure out how you would
> improve it, so I guess it's probably good as it is.
>
> > frame = get_sigframe(ksig, tsk, sizeof(*frame), 0);
> > if (!access_ok(frame, sizeof(*frame)))
> > @@ -836,8 +834,9 @@ int handle_rt_signal64(struct ksignal *ksig, sigset_t *set,
> > /* Create the ucontext. */
> > err |= __put_user(0, &frame->uc.uc_flags);
> > err |= __save_altstack(&frame->uc.uc_stack, regs->gpr[1]);
> > -#ifdef CONFIG_PPC_TRANSACTIONAL_MEM
> > +
> > if (MSR_TM_ACTIVE(msr)) {
> > +#ifdef CONFIG_PPC_TRANSACTIONAL_MEM
> > /* The ucontext_t passed to userland points to the second
> > * ucontext_t (for transactional state) with its uc_link ptr.
> > */
> > @@ -847,9 +846,8 @@ int handle_rt_signal64(struct ksignal *ksig, sigset_t *set,
> > tsk, ksig->sig, NULL,
> > (unsigned long)ksig->ka.sa.sa_handler,
> > msr);
> > - } else
> > #endif
> > - {
> > + } else {
> > err |= __put_user(0, &frame->uc.uc_link);
> > prepare_setup_sigcontext(tsk, 1);
> > err |= setup_sigcontext(&frame->uc.uc_mcontext, tsk, ksig->sig,
>
> That seems reasonable to me.
Thanks for the feedback!
>
> Kind regards,
> Daniel
^ permalink raw reply
* Re: [PATCH v5 10/10] powerpc/signal64: Use __get_user() to copy sigset_t
From: Christopher M. Riedl @ 2021-02-17 19:53 UTC (permalink / raw)
To: Daniel Axtens, linuxppc-dev
In-Reply-To: <87v9axc78z.fsf@dja-thinkpad.axtens.net>
On Fri Feb 12, 2021 at 3:21 PM CST, Daniel Axtens wrote:
> "Christopher M. Riedl" <cmr@codefail.de> writes:
>
> > Usually sigset_t is exactly 8B which is a "trivial" size and does not
> > warrant using __copy_from_user(). Use __get_user() directly in
> > anticipation of future work to remove the trivial size optimizations
> > from __copy_from_user(). Calling __get_user() also results in a small
> > boost to signal handling throughput here.
>
> Modulo the comments from Christophe, this looks good to me. It seems to
> do what it says on the tin.
>
> Reviewed-by: Daniel Axtens <dja@axtens.net>
>
> Do you know if this patch is responsible for the slightly increase in
> radix performance you observed in your cover letter? The rest of the
> series shouldn't really make things faster than the no-KUAP case...
No, this patch just results in a really small improvement overall.
I looked over this again and I think the reason for the speedup is that
my implementation of unsafe_copy_from_user() in this series calls
__copy_tofrom_user() directly avoiding a barrier_nospec(). This speeds
up all the unsafe_get_from_user() calls in this version.
Skipping the barrier_nospec() like that is incorrect, but Christophe
recently sent a patch which moves barrier_nospec() into the uaccess
allowance helpers. It looks like mpe has already accepted that patch:
https://lists.ozlabs.org/pipermail/linuxppc-dev/2021-February/223959.html
That means that my implementation of unsafe_copy_from_user() is _now_
correct _and_ faster. We do not need to call barrier_nospec() since the
precondition for the 'unsafe' routine is that we called one of the
helpers to open up a uaccess window earlier.
>
> Kind regards,
> Daniel
>
> >
> > Signed-off-by: Christopher M. Riedl <cmr@codefail.de>
> > ---
> > arch/powerpc/kernel/signal_64.c | 14 ++++++++++++--
> > 1 file changed, 12 insertions(+), 2 deletions(-)
> >
> > diff --git a/arch/powerpc/kernel/signal_64.c b/arch/powerpc/kernel/signal_64.c
> > index 817b64e1e409..42fdc4a7ff72 100644
> > --- a/arch/powerpc/kernel/signal_64.c
> > +++ b/arch/powerpc/kernel/signal_64.c
> > @@ -97,6 +97,14 @@ static void prepare_setup_sigcontext(struct task_struct *tsk, int ctx_has_vsx_re
> > #endif /* CONFIG_VSX */
> > }
> >
> > +static inline int get_user_sigset(sigset_t *dst, const sigset_t *src)
> > +{
> > + if (sizeof(sigset_t) <= 8)
> > + return __get_user(dst->sig[0], &src->sig[0]);
> > + else
> > + return __copy_from_user(dst, src, sizeof(sigset_t));
> > +}
> > +
> > /*
> > * Set up the sigcontext for the signal frame.
> > */
> > @@ -701,8 +709,9 @@ SYSCALL_DEFINE3(swapcontext, struct ucontext __user *, old_ctx,
> > * We kill the task with a SIGSEGV in this situation.
> > */
> >
> > - if (__copy_from_user(&set, &new_ctx->uc_sigmask, sizeof(set)))
> > + if (get_user_sigset(&set, &new_ctx->uc_sigmask))
> > do_exit(SIGSEGV);
> > +
> > set_current_blocked(&set);
> >
> > if (!user_read_access_begin(new_ctx, ctx_size))
> > @@ -740,8 +749,9 @@ SYSCALL_DEFINE0(rt_sigreturn)
> > if (!access_ok(uc, sizeof(*uc)))
> > goto badframe;
> >
> > - if (__copy_from_user(&set, &uc->uc_sigmask, sizeof(set)))
> > + if (get_user_sigset(&set, &uc->uc_sigmask))
> > goto badframe;
> > +
> > set_current_blocked(&set);
> >
> > #ifdef CONFIG_PPC_TRANSACTIONAL_MEM
> > --
> > 2.26.1
^ permalink raw reply
* Re: [PATCH 1/4] add generic builtin command line
From: Andrew Morton @ 2021-02-17 21:16 UTC (permalink / raw)
To: Daniel Gimpelevich
Cc: Christophe Leroy, Maksym Kokhan, linux-kernel, Rob Herring,
Paul Mackerras, xe-linux-external, Daniel Walker, linuxppc-dev,
Daniel Walker
In-Reply-To: <1613417521.3853.5.camel@chimera>
On Mon, 15 Feb 2021 11:32:01 -0800 Daniel Gimpelevich <daniel@gimpelevich.san-francisco.ca.us> wrote:
> On Thu, 2019-03-21 at 15:15 -0700, Andrew Morton wrote:
> > On Thu, 21 Mar 2019 08:13:08 -0700 Daniel Walker <danielwa@cisco.com> wrote:
> > > On Wed, Mar 20, 2019 at 08:14:33PM -0700, Andrew Morton wrote:
> > > > The patches (or some version of them) are already in linux-next,
> > > > which messes me up. I'll disable them for now.
> > >
> > > Those are from my tree, but I remove them when you picked up the series. The
> > > next linux-next should not have them.
> >
> > Yup, thanks, all looks good now.
>
> This patchset is currently neither in mainline nor in -next. May I ask
> what happened to it? Thanks.
Seems that I didn't bring them back after the confict with the powerpc
tree resolved itself.
Please resend everything for -rc1 and let's await the reviewer
feedback,
^ permalink raw reply
* Re: [PATCH v18 00/10] Carry forward IMA measurement log on kexec on ARM64
From: Rob Herring @ 2021-02-18 1:25 UTC (permalink / raw)
To: Lakshmi Ramasubramanian
Cc: mark.rutland, tao.li, zohar, paulus, vincenzo.frascino,
frowand.list, sashal, masahiroy, jmorris, takahiro.akashi,
linux-arm-kernel, catalin.marinas, serge, devicetree,
pasha.tatashin, will, prsriva, hsinyi, allison, christophe.leroy,
mbrugger, balajib, dmitry.kasatkin, linux-kernel, james.morse,
gregkh, joe, linux-integrity, linuxppc-dev, bauerman
In-Reply-To: <20210213161049.6190-1-nramas@linux.microsoft.com>
On Sat, Feb 13, 2021 at 08:10:38AM -0800, Lakshmi Ramasubramanian wrote:
> On kexec file load Integrity Measurement Architecture (IMA) subsystem
> may verify the IMA signature of the kernel and initramfs, and measure
> it. The command line parameters passed to the kernel in the kexec call
> may also be measured by IMA. A remote attestation service can verify
> a TPM quote based on the TPM event log, the IMA measurement list, and
> the TPM PCR data. This can be achieved only if the IMA measurement log
> is carried over from the current kernel to the next kernel across
> the kexec call.
>
> powerpc already supports carrying forward the IMA measurement log on
> kexec. This patch set adds support for carrying forward the IMA
> measurement log on kexec on ARM64.
>
> This patch set moves the platform independent code defined for powerpc
> such that it can be reused for other platforms as well. A chosen node
> "linux,ima-kexec-buffer" is added to the DTB for ARM64 to hold
> the address and the size of the memory reserved to carry
> the IMA measurement log.
>
> This patch set has been tested for ARM64 platform using QEMU.
> I would like help from the community for testing this change on powerpc.
> Thanks.
>
> This patch set is based on
> commit f31e3386a4e9 ("ima: Free IMA measurement buffer after kexec syscall")
> in https://git.kernel.org/pub/scm/linux/kernel/git/zohar/linux-integrity.git
> "ima-kexec-fixes" branch.
>
> Changelog:
>
> v18
> - Added a parameter to of_kexec_alloc_and_setup_fdt() for the caller
> to specify additional space needed for the FDT buffer
> - Renamed arm64 and powerpc ELF buffer address field in
> "struct kimage_arch" to elf_load_addr to match x86_64 name.
> - Removed of_ima_add_kexec_buffer() and instead directly set
> ima_buffer_addr and ima_buffer_size in ima_add_kexec_buffer()
> - Moved FDT_EXTRA_SPACE definition from include/linux/of.h to
> drivers/of/kexec.c
>
> v17
> - Renamed of_kexec_setup_new_fdt() to of_kexec_alloc_and_setup_fdt(),
> and moved memory allocation for the new FDT to this function.
>
> v16
> - Defined functions to allocate and free buffer for FDT for powerpc
> and arm64.
> - Moved ima_buffer_addr and ima_buffer_size fields from
> "struct kimage_arch" in powerpc to "struct kimage"
> v15
> - Included Rob's patches in the patch set, and rebased
> the changes to "next-integrity" branch.
> - Allocate memory for DTB, on arm64, using kmalloc() instead of
> vmalloc() to keep it consistent with powerpc implementation.
> - Call of_kexec_setup_new_fdt() from setup_new_fdt_ppc64() and
> remove setup_new_fdt() in the same patch to keep it bisect safe.
>
> v14
> - Select CONFIG_HAVE_IMA_KEXEC for CONFIG_KEXEC_FILE, for powerpc
> and arm64, if CONFIG_IMA is enabled.
> - Use IS_ENABLED() macro instead of "#ifdef" in remove_ima_buffer(),
> ima_get_kexec_buffer(), and ima_free_kexec_buffer().
> - Call of_kexec_setup_new_fdt() from setup_new_fdt_ppc64() and
> remove setup_new_fdt() in "arch/powerpc/kexec/file_load.c".
>
> v13
> - Moved the arch independent functions to drivers/of/kexec.c
> and then refactored the code.
> - Moved arch_ima_add_kexec_buffer() to
> security/integrity/ima/ima_kexec.c
>
> v12
> - Use fdt_appendprop_addrrange() in setup_ima_buffer()
> to setup the IMA measurement list property in
> the device tree.
> - Moved architecture independent functions from
> "arch/powerpc/kexec/ima.c" to "drivers/of/kexec."
> - Deleted "arch/powerpc/kexec/ima.c" and
> "arch/powerpc/include/asm/ima.h".
>
> v11
> - Rebased the changes on the kexec code refactoring done by
> Rob Herring in his "dt/kexec" branch
> - Removed "extern" keyword in function declarations
> - Removed unnecessary header files included in C files
> - Updated patch descriptions per Thiago's comments
>
> v10
> - Moved delete_fdt_mem_rsv(), remove_ima_buffer(),
> get_ima_kexec_buffer, and get_root_addr_size_cells()
> to drivers/of/kexec.c
> - Moved arch_ima_add_kexec_buffer() to
> security/integrity/ima/ima_kexec.c
> - Conditionally define IMA buffer fields in struct kimage_arch
>
> v9
> - Moved delete_fdt_mem_rsv() to drivers/of/kexec_fdt.c
> - Defined a new function get_ima_kexec_buffer() in
> drivers/of/ima_kexec.c to replace do_get_kexec_buffer()
> - Changed remove_ima_kexec_buffer() to the original function name
> remove_ima_buffer()
> - Moved remove_ima_buffer() to drivers/of/ima_kexec.c
> - Moved ima_get_kexec_buffer() and ima_free_kexec_buffer()
> to security/integrity/ima/ima_kexec.c
>
> v8:
> - Moved remove_ima_kexec_buffer(), do_get_kexec_buffer(), and
> delete_fdt_mem_rsv() to drivers/of/fdt.c
> - Moved ima_dump_measurement_list() and ima_add_kexec_buffer()
> back to security/integrity/ima/ima_kexec.c
>
> v7:
> - Renamed remove_ima_buffer() to remove_ima_kexec_buffer() and moved
> this function definition to kernel.
> - Moved delete_fdt_mem_rsv() definition to kernel
> - Moved ima_dump_measurement_list() and ima_add_kexec_buffer() to
> a new file namely ima_kexec_fdt.c in IMA
>
> v6:
> - Remove any existing FDT_PROP_IMA_KEXEC_BUFFER property in the device
> tree and also its corresponding memory reservation in the currently
> running kernel.
> - Moved the function remove_ima_buffer() defined for powerpc to IMA
> and renamed the function to ima_remove_kexec_buffer(). Also, moved
> delete_fdt_mem_rsv() from powerpc to IMA.
>
> v5:
> - Merged get_addr_size_cells() and do_get_kexec_buffer() into a single
> function when moving the arch independent code from powerpc to IMA
> - Reverted the change to use FDT functions in powerpc code and added
> back the original code in get_addr_size_cells() and
> do_get_kexec_buffer() for powerpc.
> - Added fdt_add_mem_rsv() for ARM64 to reserve the memory for
> the IMA log buffer during kexec.
> - Fixed the warning reported by kernel test bot for ARM64
> arch_ima_add_kexec_buffer() - moved this function to a new file
> namely arch/arm64/kernel/ima_kexec.c
>
> v4:
> - Submitting the patch series on behalf of the original author
> Prakhar Srivastava <prsriva@linux.microsoft.com>
> - Moved FDT_PROP_IMA_KEXEC_BUFFER ("linux,ima-kexec-buffer") to
> libfdt.h so that it can be shared by multiple platforms.
>
> v3:
> Breakup patches further into separate patches.
> - Refactoring non architecture specific code out of powerpc
> - Update powerpc related code to use fdt functions
> - Update IMA buffer read related code to use of functions
> - Add support to store the memory information of the IMA
> measurement logs to be carried forward.
> - Update the property strings to align with documented nodes
> https://github.com/devicetree-org/dt-schema/pull/46
>
> v2:
> Break patches into separate patches.
> - Powerpc related Refactoring
> - Updating the docuemntation for chosen node
> - Updating arm64 to support IMA buffer pass
>
> v1:
> Refactoring carrying over IMA measuremnet logs over Kexec. This patch
> moves the non-architecture specific code out of powerpc and adds to
> security/ima.(Suggested by Thiago)
> Add Documentation regarding the ima-kexec-buffer node in the chosen
> node documentation
>
> v0:
> Add a layer of abstraction to use the memory reserved by device tree
> for ima buffer pass.
> Add support for ima buffer pass using reserved memory for arm64 kexec.
> Update the arch sepcific code path in kexec file load to store the
> ima buffer in the reserved memory. The same reserved memory is read
> on kexec or cold boot.
>
> Lakshmi Ramasubramanian (7):
> arm64: Rename kexec elf_headers_mem to elf_load_addr
> powerpc: Move ima buffer fields to struct kimage
> powerpc: Enable passing IMA log to next kernel on kexec
> powerpc: Move arch independent ima kexec functions to
> drivers/of/kexec.c
> kexec: Use fdt_appendprop_addrrange() to add ima buffer to FDT
> powerpc: Delete unused function delete_fdt_mem_rsv()
> arm64: Enable passing IMA log to next kernel on kexec
>
> Rob Herring (4):
> powerpc: Rename kexec elfcorehdr_addr to elf_load_addr
> of: Add a common kexec FDT setup function
> arm64: Use common of_kexec_alloc_and_setup_fdt()
> powerpc: Use common of_kexec_alloc_and_setup_fdt()
I've applied the series. The merge conflict is not too bad, but will
need a follow on patch to change the powerpc FDT size calculation
(dropping the current FDT's size out).
Thanks for persisting!
Rob
^ permalink raw reply
* linux-next: manual merge of the devicetree tree with the powerpc tree
From: Stephen Rothwell @ 2021-02-18 3:48 UTC (permalink / raw)
To: Rob Herring, Michael Ellerman, PowerPC
Cc: Lakshmi Ramasubramanian, Linux Next Mailing List,
Linux Kernel Mailing List, Hari Bathini, Rob Herring
[-- Attachment #1: Type: text/plain, Size: 858 bytes --]
Hi all,
Today's linux-next merge of the devicetree tree got a conflict in:
arch/powerpc/kexec/elf_64.c
between commit:
2377c92e37fe ("powerpc/kexec_file: fix FDT size estimation for kdump kernel")
from the powerpc tree and commit:
130b2d59cec0 ("powerpc: Use common of_kexec_alloc_and_setup_fdt()")
from the devicetree tree.
I can't easily see how to resolve these, so for now I have just used
the latter' changes to this file.
I fixed it up and can carry the fix as necessary. This is now fixed as
far as linux-next is concerned, but any non trivial conflicts should be
mentioned to your upstream maintainer when your tree is submitted for
merging. You may also want to consider cooperating with the maintainer
of the conflicting tree to minimise any particularly complex conflicts.
--
Cheers,
Stephen Rothwell
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]
^ permalink raw reply
* Re: [PATCH] arm64: defconfig: enable modern virtio pci device
From: Jason Wang @ 2021-02-18 6:04 UTC (permalink / raw)
To: Arnd Bergmann, Anders Roxell
Cc: Chris Zankel, Thomas Bogendoerfer, Michael S. Tsirkin,
Arnd Bergmann, linuxppc-dev, Catalin Marinas, linux-xtensa,
Paul Walmsley, virtualization, linux-kernel@vger.kernel.org,
Russell King - ARM Linux, Max Filippov, SoC Team, Albert Ou,
Palmer Dabbelt, linux-riscv, open list:BROADCOM NVRAM DRIVER,
Will Deacon, Linux ARM
In-Reply-To: <CAK8P3a2ysNApoG2FDsLdNoWA7nPXvzLMzkjXWdCig9jaSWwuKw@mail.gmail.com>
On 2021/2/11 下午7:52, Arnd Bergmann wrote:
> On Wed, Feb 10, 2021 at 8:05 PM Anders Roxell <anders.roxell@linaro.org> wrote:
>> Since patch ("virtio-pci: introduce modern device module") got added it
>> is not possible to boot a defconfig kernel in qemu with a virtio pci
>> device. Add CONFIG_VIRTIO_PCI_MODERN=y fragment makes the kernel able
>> to boot.
>>
>> Signed-off-by: Anders Roxell <anders.roxell@linaro.org>
>> ---
>> arch/arm/configs/multi_v7_defconfig | 1 +
>> arch/arm64/configs/defconfig | 1 +
> Acked-by: Arnd Bergmann <arnd@arndb.de>
>
> Michael, can you pick this up in the vhost tree that introduces the regression?
>
> Arnd
>
Hi:
Based on the discussion previously, the plan is to select
VIRTIO_PCI_MODERN, and document the module that select it must depend on
PCI.
I will post a patch soon.
Thanks
^ permalink raw reply
* Re: [PATCH 4/4] ASoC: fsl: drop unneeded snd_soc_dai_set_drvdata
From: Nicolin Chen @ 2021-02-18 6:30 UTC (permalink / raw)
To: Julia Lawall
Cc: alsa-devel, linuxppc-dev, Timur Tabi, Xiubo Li, Shengjiu Wang,
Takashi Iwai, kernel-janitors, Liam Girdwood, Jaroslav Kysela,
Mark Brown, Fabio Estevam, linux-kernel
In-Reply-To: <20210213101907.1318496-5-Julia.Lawall@inria.fr>
On Sat, Feb 13, 2021 at 11:19:07AM +0100, Julia Lawall wrote:
> snd_soc_dai_set_drvdata is not needed when the set data comes from
> snd_soc_dai_get_drvdata or dev_get_drvdata. The problem was fixed
> usingthe following semantic patch: (http://coccinelle.lip6.fr/)
>
> // <smpl>
> @@
> expression x,y,e;
> @@
> x = dev_get_drvdata(y->dev)
> ... when != x = e
> - snd_soc_dai_set_drvdata(y,x);
>
> @@
> expression x,y,e;
> @@
> x = snd_soc_dai_get_drvdata(y)
> ... when != x = e
> - snd_soc_dai_set_drvdata(y,x);
> // </smpl>
>
> Signed-off-by: Julia Lawall <Julia.Lawall@inria.fr>
Acked-by: Nicolin Chen <nicoleotsuka@gmail.com>
^ permalink raw reply
* Re: [PATCH v2 3/7] ASoC: dt-bindings: fsl_rpmsg: Add binding doc for rpmsg cpu dai driver
From: Shengjiu Wang @ 2021-02-18 7:21 UTC (permalink / raw)
To: Rob Herring
Cc: open list:OPEN FIRMWARE AND FLATTENED DEVICE TREE BINDINGS,
alsa-devel, Timur Tabi, Xiubo Li, Fabio Estevam, Shengjiu Wang,
Takashi Iwai, Liam Girdwood, linux-kernel, Nicolin Chen,
Mark Brown, linuxppc-dev
In-Reply-To: <20210210221252.GA2885308@robh.at.kernel.org>
On Thu, Feb 11, 2021 at 6:13 AM Rob Herring <robh@kernel.org> wrote:
>
> On Sun, Feb 07, 2021 at 06:23:51PM +0800, Shengjiu Wang wrote:
> > fsl_rpmsg cpu dai driver is driver for rpmsg audio, which is mainly used
> > for getting the user's configuration from device tree and configure the
> > clocks which is used by Cortex-M core. So in this document define the
> > needed property.
> >
> > Signed-off-by: Shengjiu Wang <shengjiu.wang@nxp.com>
> > ---
> > .../devicetree/bindings/sound/fsl,rpmsg.yaml | 80 +++++++++++++++++++
> > 1 file changed, 80 insertions(+)
> > create mode 100644 Documentation/devicetree/bindings/sound/fsl,rpmsg.yaml
> >
> > diff --git a/Documentation/devicetree/bindings/sound/fsl,rpmsg.yaml b/Documentation/devicetree/bindings/sound/fsl,rpmsg.yaml
> > new file mode 100644
> > index 000000000000..2d3ce10d42fc
> > --- /dev/null
> > +++ b/Documentation/devicetree/bindings/sound/fsl,rpmsg.yaml
> > @@ -0,0 +1,80 @@
> > +# SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
> > +%YAML 1.2
> > +---
> > +$id: http://devicetree.org/schemas/sound/fsl,rpmsg.yaml#
> > +$schema: http://devicetree.org/meta-schemas/core.yaml#
> > +
> > +title: NXP Audio RPMSG CPU DAI Controller
> > +
> > +maintainers:
> > + - Shengjiu Wang <shengjiu.wang@nxp.com>
> > +
> > +properties:
> > + compatible:
> > + enum:
> > + - fsl,imx7ulp-rpmsg
> > + - fsl,imx8mn-rpmsg
> > + - fsl,imx8mm-rpmsg
> > + - fsl,imx8mp-rpmsg
>
> rpmsg is a protocol. What's the h/w block?
On Linux side this driver is a virtual driver, it is running
on Arm Cortex-A core. The h/w block is controlled by
another core (cortex-M core). so this driver actually
doesn't touch any hardware, it just does configuration
for rpmsg channel.
>
> > +
> > + clocks:
> > + items:
> > + - description: Peripheral clock for register access
> > + - description: Master clock
> > + - description: DMA clock for DMA register access
> > + - description: Parent clock for multiple of 8kHz sample rates
> > + - description: Parent clock for multiple of 11kHz sample rates
> > + minItems: 5
> > +
> > + clock-names:
> > + items:
> > + - const: ipg
> > + - const: mclk
> > + - const: dma
> > + - const: pll8k
> > + - const: pll11k
> > + minItems: 5
> > +
> > + power-domains:
> > + maxItems: 1
> > +
> > + fsl,audioindex:
> > + $ref: /schemas/types.yaml#/definitions/uint32
> > + description: instance index for rpmsg image
> > +
> > + fsl,version:
> > + $ref: /schemas/types.yaml#/definitions/uint32
> > + description: rpmsg image version index
>
> What are these 2 used for?
fsl,audioindex: As we may support multiple instance, for example
two sound card with one rpmsg channel, this is the instance index.
fsl,version: There are maybe different image running on M core, this
is the image version, different image has different function.
>
> > +
> > + fsl,buffer-size:
> > + $ref: /schemas/types.yaml#/definitions/uint32
> > + description: pre allocate dma buffer size
> > +
> > + fsl,enable-lpa:
> > + $ref: /schemas/types.yaml#/definitions/flag
> > + description: enable low power audio path.
> > +
> > + fsl,codec-type:
> > + $ref: /schemas/types.yaml#/definitions/uint32
> > + description: Sometimes the codec is registered by
> > + driver not the device tree, this items
> > + can be used to distinguish codecs
>
> 0-2^32 are valid values?
I should add range for it.
>
> > +
> > +required:
> > + - compatible
> > + - fsl,audioindex
> > + - fsl,version
> > + - fsl,buffer-size
> > +
> > +additionalProperties: false
> > +
> > +examples:
> > + - |
> > + rpmsg_audio: rpmsg_audio {
> > + compatible = "fsl,imx8mn-rpmsg";
> > + fsl,audioindex = <0> ;
> > + fsl,version = <2>;
> > + fsl,buffer-size = <0x6000000>;
> > + fsl,enable-lpa;
> > + status = "okay";
>
> Don't show status in examples.
ok, will remove it.
>
> > + };
> > --
> > 2.27.0
> >
^ permalink raw reply
* Re: [PATCH v2 7/7] ASoC: dt-bindings: imx-rpmsg: Add binding doc for rpmsg machine driver
From: Shengjiu Wang @ 2021-02-18 7:23 UTC (permalink / raw)
To: Rob Herring
Cc: open list:OPEN FIRMWARE AND FLATTENED DEVICE TREE BINDINGS,
alsa-devel, Timur Tabi, Xiubo Li, Fabio Estevam, Shengjiu Wang,
Takashi Iwai, Liam Girdwood, linux-kernel, Nicolin Chen,
Mark Brown, linuxppc-dev
In-Reply-To: <20210210221704.GA2894134@robh.at.kernel.org>
On Thu, Feb 11, 2021 at 6:18 AM Rob Herring <robh@kernel.org> wrote:
>
> On Sun, Feb 07, 2021 at 06:23:55PM +0800, Shengjiu Wang wrote:
> > Imx-rpmsg is a new added machine driver for supporting audio on Cortex-M
> > core. The Cortex-M core will control the audio interface, DMA and audio
> > codec, setup the pipeline, the audio driver on Cortex-A core side is just
> > to communitcate with M core, it is a virtual sound card and don't touch
> > the hardware.
>
> I don't understand why there are 2 nodes for this other than you happen
> to want to split this into 2 Linux drivers. It's 1 h/w thing.
This one is for the sound card machine driver. Another one is
for the sound card cpu dai driver. so there are 2 nodes.
>
> >
> > Signed-off-by: Shengjiu Wang <shengjiu.wang@nxp.com>
> > ---
> > .../bindings/sound/imx-audio-rpmsg.yaml | 48 +++++++++++++++++++
> > 1 file changed, 48 insertions(+)
> > create mode 100644 Documentation/devicetree/bindings/sound/imx-audio-rpmsg.yaml
> >
> > diff --git a/Documentation/devicetree/bindings/sound/imx-audio-rpmsg.yaml b/Documentation/devicetree/bindings/sound/imx-audio-rpmsg.yaml
> > new file mode 100644
> > index 000000000000..b941aeb80678
> > --- /dev/null
> > +++ b/Documentation/devicetree/bindings/sound/imx-audio-rpmsg.yaml
> > @@ -0,0 +1,48 @@
> > +# SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
> > +%YAML 1.2
> > +---
> > +$id: http://devicetree.org/schemas/sound/imx-audio-rpmsg.yaml#
> > +$schema: http://devicetree.org/meta-schemas/core.yaml#
> > +
> > +title: NXP i.MX audio complex with rpmsg
> > +
> > +maintainers:
> > + - Shengjiu Wang <shengjiu.wang@nxp.com>
> > +
> > +properties:
> > + compatible:
> > + enum:
> > + - fsl,imx-audio-rpmsg
> > +
> > + model:
> > + $ref: /schemas/types.yaml#/definitions/string
> > + description: User specified audio sound card name
> > +
> > + audio-cpu:
> > + description: The phandle of an CPU DAI controller
> > +
> > + rpmsg-out:
> > + description: |
> > + This is a boolean property. If present, the transmitting function
> > + will be enabled,
> > +
> > + rpmsg-in:
> > + description: |
> > + This is a boolean property. If present, the receiving function
> > + will be enabled.
> > +
> > +required:
> > + - compatible
> > + - model
> > + - audio-cpu
> > +
> > +additionalProperties: false
> > +
> > +examples:
> > + - |
> > + sound-rpmsg {
> > + compatible = "fsl,imx-audio-rpmsg";
> > + model = "ak4497-audio";
> > + audio-cpu = <&rpmsg_audio>;
> > + rpmsg-out;
> > + };
> > --
> > 2.27.0
> >
^ permalink raw reply
* Re: [PATCH v2 2/7] ASoC: fsl_rpmsg: Add CPU DAI driver for audio base on rpmsg
From: Shengjiu Wang @ 2021-02-18 7:57 UTC (permalink / raw)
To: Mark Brown
Cc: open list:OPEN FIRMWARE AND FLATTENED DEVICE TREE BINDINGS,
alsa-devel, Timur Tabi, Xiubo Li, Fabio Estevam, Shengjiu Wang,
Takashi Iwai, Liam Girdwood, linux-kernel, Nicolin Chen,
Rob Herring, linuxppc-dev
In-Reply-To: <20210210153808.GB4748@sirena.org.uk>
On Wed, Feb 10, 2021 at 11:39 PM Mark Brown <broonie@kernel.org> wrote:
>
> On Wed, Feb 10, 2021 at 02:35:29PM +0800, Shengjiu Wang wrote:
> > On Wed, Feb 10, 2021 at 6:30 AM Mark Brown <broonie@kernel.org> wrote:
>
> > > Like I say I'd actually recommend moving this control to DAPM.
>
> > I may understand your point, you suggest to use the .set_bias_level
> > interface. But in my case I need to enable the clock in earlier stage
> > and keep the clock on when system go to suspend.
>
> The device can be kept alive over system suspend if that's needed, or
> possibly it sounds like runtime PM is a better fit? There's callbacks
> in the core to keep the device runtime PM enabled while it's open which
> is probably about the time range you're looking for.
Before enabling the clock, I need to reparent the clock according to
the sample rate, Maybe the hw_params is the right place to do
these things.
Can I add a flag:
"rpmsg->mclk_streams & BIT(substream->stream)"
for avoiding multiple calls of hw_params function before enabling
clock?
^ permalink raw reply
* Re: linux-next: manual merge of the devicetree tree with the powerpc tree
From: Michael Ellerman @ 2021-02-18 10:44 UTC (permalink / raw)
To: Stephen Rothwell, Rob Herring, PowerPC
Cc: Lakshmi Ramasubramanian, Linux Next Mailing List,
Linux Kernel Mailing List, Hari Bathini, Rob Herring
In-Reply-To: <20210218144815.5673ae6f@canb.auug.org.au>
Stephen Rothwell <sfr@canb.auug.org.au> writes:
> Hi all,
>
> Today's linux-next merge of the devicetree tree got a conflict in:
>
> arch/powerpc/kexec/elf_64.c
>
> between commit:
>
> 2377c92e37fe ("powerpc/kexec_file: fix FDT size estimation for kdump kernel")
>
> from the powerpc tree and commit:
>
> 130b2d59cec0 ("powerpc: Use common of_kexec_alloc_and_setup_fdt()")
>
> from the devicetree tree.
>
> I can't easily see how to resolve these, so for now I have just used
> the latter' changes to this file.
I think it just needs this?
diff --git a/arch/powerpc/kexec/elf_64.c b/arch/powerpc/kexec/elf_64.c
index 87e34611f93d..0492ca6003f3 100644
--- a/arch/powerpc/kexec/elf_64.c
+++ b/arch/powerpc/kexec/elf_64.c
@@ -104,7 +104,7 @@ static void *elf64_load(struct kimage *image, char *kernel_buf,
fdt = of_kexec_alloc_and_setup_fdt(image, initrd_load_addr,
initrd_len, cmdline,
- fdt_totalsize(initial_boot_params));
+ kexec_fdt_totalsize_ppc64(image));
if (!fdt) {
pr_err("Error setting up the new device tree.\n");
ret = -EINVAL;
cheers
^ permalink raw reply related
* Re: [PATCH v5 05/10] powerpc/signal64: Remove TM ifdefery in middle of if/else block
From: Michael Ellerman @ 2021-02-18 10:47 UTC (permalink / raw)
To: Christopher M. Riedl, Daniel Axtens, linuxppc-dev
In-Reply-To: <C9C1XKNVRQC4.LHKIIQAIC3L7@oc8246131445.ibm.com>
"Christopher M. Riedl" <cmr@codefail.de> writes:
> On Thu Feb 11, 2021 at 11:21 PM CST, Daniel Axtens wrote:
...
>>
>> My only concern here was whether it was valid to access
>> if (__get_user(msr, &uc->uc_mcontext.gp_regs[PT_MSR]))
>> if CONFIG_PPC_TRANSACTIONAL_MEM was not defined, but I didn't think of
>> any obvious reason why it wouldn't be...
>
> Hmm, we don't really need it for the non-TM case and it is another extra
> uaccess. I will take your suggestion to remove the need for the other
> ifdefs but might keep this one. Keeping it also makes it very clear this
> call is only here for TM and possible to remove in a potentially TM-less
> future :)
Yep I agree.
cheers
^ permalink raw reply
* Re: linux-next: manual merge of the devicetree tree with the powerpc tree
From: Stephen Rothwell @ 2021-02-18 11:34 UTC (permalink / raw)
To: Michael Ellerman
Cc: Rob Herring, Linux Kernel Mailing List, Lakshmi Ramasubramanian,
Linux Next Mailing List, Rob Herring, PowerPC, Hari Bathini
In-Reply-To: <874ki9vene.fsf@mpe.ellerman.id.au>
[-- Attachment #1: Type: text/plain, Size: 1229 bytes --]
Hi Michael,
On Thu, 18 Feb 2021 21:44:37 +1100 Michael Ellerman <mpe@ellerman.id.au> wrote:
>
> I think it just needs this?
>
> diff --git a/arch/powerpc/kexec/elf_64.c b/arch/powerpc/kexec/elf_64.c
> index 87e34611f93d..0492ca6003f3 100644
> --- a/arch/powerpc/kexec/elf_64.c
> +++ b/arch/powerpc/kexec/elf_64.c
> @@ -104,7 +104,7 @@ static void *elf64_load(struct kimage *image, char *kernel_buf,
>
> fdt = of_kexec_alloc_and_setup_fdt(image, initrd_load_addr,
> initrd_len, cmdline,
> - fdt_totalsize(initial_boot_params));
> + kexec_fdt_totalsize_ppc64(image));
> if (!fdt) {
> pr_err("Error setting up the new device tree.\n");
> ret = -EINVAL;
>
I thought about that, but the last argument to
of_kexec_alloc_and_setup_fdt() is extra_fdt_size and the allocation
done is for this:
fdt_size = fdt_totalsize(initial_boot_params) +
(cmdline ? strlen(cmdline) : 0) +
FDT_EXTRA_SPACE +
extra_fdt_size;
and kexec_fdt_totalsize_ppc64() also includes
fdt_totalsize(initial_boot_params) so I was not sure. Maybe
kexec_fdt_totalsize_ppc64() needs modification as well?
--
Cheers,
Stephen Rothwell
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]
^ permalink raw reply
* [PATCH] powerpc/4xx: Fix build errors from mfdcr()
From: Michael Ellerman @ 2021-02-18 12:30 UTC (permalink / raw)
To: linuxppc-dev; +Cc: feng.tang
lkp reported a build error in fsp2.o:
CC arch/powerpc/platforms/44x/fsp2.o
{standard input}:577: Error: unsupported relocation against base
Which comes from:
pr_err("GESR0: 0x%08x\n", mfdcr(base + PLB4OPB_GESR0));
Where our mfdcr() macro is stringifying "base + PLB4OPB_GESR0", and
passing that to the assembler, which obviously doesn't work.
The mfdcr() macro already checks that the argument is constant using
__builtin_constant_p(), and if not calls the out-of-line version of
mfdcr(). But in this case GCC is smart enough to notice that "base +
PLB4OPB_GESR0" will be constant, even though it's not something we can
immediately stringify into a register number.
Segher pointed out that passing the register number to the inline asm
as a constant would be better, and in fact it fixes the build error,
presumably because it gives GCC a chance to resolve the value.
While we're at it, change mtdcr() similarly.
Reported-by: kernel test robot <lkp@intel.com>
Suggested-by: Segher Boessenkool <segher@kernel.crashing.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
---
arch/powerpc/include/asm/dcr-native.h | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)
diff --git a/arch/powerpc/include/asm/dcr-native.h b/arch/powerpc/include/asm/dcr-native.h
index 7141ccea8c94..a92059964579 100644
--- a/arch/powerpc/include/asm/dcr-native.h
+++ b/arch/powerpc/include/asm/dcr-native.h
@@ -53,8 +53,8 @@ static inline void mtdcrx(unsigned int reg, unsigned int val)
#define mfdcr(rn) \
({unsigned int rval; \
if (__builtin_constant_p(rn) && rn < 1024) \
- asm volatile("mfdcr %0," __stringify(rn) \
- : "=r" (rval)); \
+ asm volatile("mfdcr %0, %1" : "=r" (rval) \
+ : "n" (rn)); \
else if (likely(cpu_has_feature(CPU_FTR_INDEXED_DCR))) \
rval = mfdcrx(rn); \
else \
@@ -64,8 +64,8 @@ static inline void mtdcrx(unsigned int reg, unsigned int val)
#define mtdcr(rn, v) \
do { \
if (__builtin_constant_p(rn) && rn < 1024) \
- asm volatile("mtdcr " __stringify(rn) ",%0" \
- : : "r" (v)); \
+ asm volatile("mtdcr %0, %1" \
+ : : "n" (rn), "r" (v)); \
else if (likely(cpu_has_feature(CPU_FTR_INDEXED_DCR))) \
mtdcrx(rn, v); \
else \
--
2.25.1
^ permalink raw reply related
* Re: [PATCH kernel] powerpc/iommu: Annotate nested lock for lockdep
From: Frederic Barrat @ 2021-02-18 12:59 UTC (permalink / raw)
To: Alexey Kardashevskiy, linuxppc-dev; +Cc: kvm-ppc
In-Reply-To: <20210216032000.21642-1-aik@ozlabs.ru>
On 16/02/2021 04:20, Alexey Kardashevskiy wrote:
> The IOMMU table is divided into pools for concurrent mappings and each
> pool has a separate spinlock. When taking the ownership of an IOMMU group
> to pass through a device to a VM, we lock these spinlocks which triggers
> a false negative warning in lockdep (below).
>
> This fixes it by annotating the large pool's spinlock as a nest lock.
>
> ===
> WARNING: possible recursive locking detected
> 5.11.0-le_syzkaller_a+fstn1 #100 Not tainted
> --------------------------------------------
> qemu-system-ppc/4129 is trying to acquire lock:
> c0000000119bddb0 (&(p->lock)/1){....}-{2:2}, at: iommu_take_ownership+0xac/0x1e0
>
> but task is already holding lock:
> c0000000119bdd30 (&(p->lock)/1){....}-{2:2}, at: iommu_take_ownership+0xac/0x1e0
>
> other info that might help us debug this:
> Possible unsafe locking scenario:
>
> CPU0
> ----
> lock(&(p->lock)/1);
> lock(&(p->lock)/1);
> ===
>
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> ---
> arch/powerpc/kernel/iommu.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
> index 557a09dd5b2f..2ee642a6731a 100644
> --- a/arch/powerpc/kernel/iommu.c
> +++ b/arch/powerpc/kernel/iommu.c
> @@ -1089,7 +1089,7 @@ int iommu_take_ownership(struct iommu_table *tbl)
>
> spin_lock_irqsave(&tbl->large_pool.lock, flags);
> for (i = 0; i < tbl->nr_pools; i++)
> - spin_lock(&tbl->pools[i].lock);
> + spin_lock_nest_lock(&tbl->pools[i].lock, &tbl->large_pool.lock);
We have the same pattern and therefore should have the same problem in
iommu_release_ownership().
But as I understand, we're hacking our way around lockdep here, since
conceptually, those locks are independent. I was wondering why it seems
to fix it by worrying only about the large pool lock. That loop can take
many locks (up to 4 with current config). However, if the dma window is
less than 1GB, we would only have one, so it would make sense for
lockdep to stop complaining. Is it what happened? In which case, this
patch doesn't really fix it. Or I'm missing something :-)
Fred
> iommu_table_release_pages(tbl);
>
>
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox