LinuxPPC-Dev Archive on lore.kernel.org
 help / color / mirror / Atom feed
* [Bug 213837] "Kernel panic - not syncing: corrupted stack end detected inside scheduler" at building via distcc on a G5
From: bugzilla-daemon @ 2021-09-08 12:54 UTC (permalink / raw)
  To: linuxppc-dev
In-Reply-To: <bug-213837-206035@https.bugzilla.kernel.org/>

https://bugzilla.kernel.org/show_bug.cgi?id=213837

--- Comment #6 from mpe@ellerman.id.au ---
bugzilla-daemon@bugzilla.kernel.org writes:
> https://bugzilla.kernel.org/show_bug.cgi?id=213837
>
> Erhard F. (erhard_f@mailbox.org) changed:
>
>            What    |Removed                     |Added
> ----------------------------------------------------------------------------
>            See Also|https://bugzilla.kernel.org |
>                    |/show_bug.cgi?id=213079     |
>
> --- Comment #4 from Erhard F. (erhard_f@mailbox.org) ---
> Checked out whether this has really something to do with bug #213079 or not
> by
> copying this root partition to a regular HDD and use that one instead. As the
> issue still happens it seems these are two seperate bugs.
>
> [...]
> Kernel panic - not syncing: corrupted stack end detected inside scheduler

Can you try this patch, it might help us work out what is corrupting the
stack.

cheers

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index c4462c454ab9..07bfa25c1b48 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5490,8 +5490,14 @@ static noinline void __schedule_bug(struct task_struct
*prev)
 static inline void schedule_debug(struct task_struct *prev, bool preempt)
 {
 #ifdef CONFIG_SCHED_STACK_END_CHECK
-       if (task_stack_end_corrupted(prev))
+       if (task_stack_end_corrupted(prev)) {
+               char *start = (char *)end_of_stack(prev);
+               pr_err("stack corrupted? stack end = 0x%px\n",
end_of_stack(prev));
+               print_hex_dump(KERN_ERR, "stack: ", DUMP_PREFIX_ADDRESS, 16, 4,
+                              start - SZ_1K, THREAD_SIZE + SZ_1K, true);
+
                panic("corrupted stack end detected inside scheduler\n");
+       }

        if (task_scs_end_corrupted(prev))
                panic("corrupted shadow stack detected inside scheduler\n");

-- 
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching someone on the CC list of the bug.

^ permalink raw reply related

* Re: [Bug 213837] "Kernel panic - not syncing: corrupted stack end detected inside scheduler" at building via distcc on a G5
From: Michael Ellerman @ 2021-09-08 12:54 UTC (permalink / raw)
  To: bugzilla-daemon, linuxppc-dev
In-Reply-To: <bug-213837-206035-BEt4KJvSST@https.bugzilla.kernel.org/>

bugzilla-daemon@bugzilla.kernel.org writes:
> https://bugzilla.kernel.org/show_bug.cgi?id=213837
>
> Erhard F. (erhard_f@mailbox.org) changed:
>
>            What    |Removed                     |Added
> ----------------------------------------------------------------------------
>            See Also|https://bugzilla.kernel.org |
>                    |/show_bug.cgi?id=213079     |
>
> --- Comment #4 from Erhard F. (erhard_f@mailbox.org) ---
> Checked out whether this has really something to do with bug #213079 or not by
> copying this root partition to a regular HDD and use that one instead. As the
> issue still happens it seems these are two seperate bugs.
>
> [...]
> Kernel panic - not syncing: corrupted stack end detected inside scheduler

Can you try this patch, it might help us work out what is corrupting the
stack.

cheers

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index c4462c454ab9..07bfa25c1b48 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5490,8 +5490,14 @@ static noinline void __schedule_bug(struct task_struct *prev)
 static inline void schedule_debug(struct task_struct *prev, bool preempt)
 {
 #ifdef CONFIG_SCHED_STACK_END_CHECK
-	if (task_stack_end_corrupted(prev))
+	if (task_stack_end_corrupted(prev)) {
+		char *start = (char *)end_of_stack(prev);
+		pr_err("stack corrupted? stack end = 0x%px\n", end_of_stack(prev));
+		print_hex_dump(KERN_ERR, "stack: ", DUMP_PREFIX_ADDRESS, 16, 4,
+			       start - SZ_1K, THREAD_SIZE + SZ_1K, true);
+
 		panic("corrupted stack end detected inside scheduler\n");
+	}
 
 	if (task_scs_end_corrupted(prev))
 		panic("corrupted shadow stack detected inside scheduler\n");

^ permalink raw reply related

* [RFC PATCH bpf-next] bpf: Make actual max tail call count as MAX_TAIL_CALL_CNT
From: Tiezhu Yang @ 2021-09-08  8:20 UTC (permalink / raw)
  To: Shubham Bansal, Russell King, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Zi Shen Lim, Catalin Marinas,
	Will Deacon, Paul Burton, Thomas Bogendoerfer, naveen.n.rao,
	Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras,
	Luke Nelson, Xi Wang, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	bjorn, davem
  Cc: netdev, linux-kernel, linux-mips, sparclinux, bpf, linuxppc-dev,
	linux-riscv, linux-arm-kernel

In the current code, the actual max tail call count is 33 which is greater
than MAX_TAIL_CALL_CNT, this is not consistent with the intended meaning
in the commit 04fd61ab36ec ("bpf: allow bpf programs to tail-call other
bpf programs"):

"The chain of tail calls can form unpredictable dynamic loops therefore
tail_call_cnt is used to limit the number of calls and currently is set
to 32."

Additionally, after commit 874be05f525e ("bpf, tests: Add tail call test
suite"), we can see there exists failed testcase.

On all archs when CONFIG_BPF_JIT_ALWAYS_ON is not set:
 # echo 0 > /proc/sys/net/core/bpf_jit_enable
 # modprobe test_bpf
 # dmesg | grep -w FAIL
 Tail call error path, max count reached jited:0 ret 34 != 33 FAIL

On some archs:
 # echo 1 > /proc/sys/net/core/bpf_jit_enable
 # modprobe test_bpf
 # dmesg | grep -w FAIL
 Tail call error path, max count reached jited:1 ret 34 != 33 FAIL

with this patch, make the actual max tail call count as MAX_TAIL_CALL_CNT,
at the same time, the above failed testcase can be fixed.

Signed-off-by: Tiezhu Yang <yangtiezhu@loongson.cn>
---

Hi all,

This is a RFC patch, if I am wrong or I missed something,
please let me know, thank you!

 arch/arm/net/bpf_jit_32.c         | 11 ++++++-----
 arch/arm64/net/bpf_jit_comp.c     |  7 ++++---
 arch/mips/net/ebpf_jit.c          |  4 ++--
 arch/powerpc/net/bpf_jit_comp32.c |  4 ++--
 arch/powerpc/net/bpf_jit_comp64.c | 12 ++++++------
 arch/riscv/net/bpf_jit_comp32.c   |  4 ++--
 arch/riscv/net/bpf_jit_comp64.c   |  4 ++--
 arch/sparc/net/bpf_jit_comp_64.c  |  8 ++++----
 kernel/bpf/core.c                 |  4 ++--
 9 files changed, 30 insertions(+), 28 deletions(-)

diff --git a/arch/arm/net/bpf_jit_32.c b/arch/arm/net/bpf_jit_32.c
index a951276..39d9ae9 100644
--- a/arch/arm/net/bpf_jit_32.c
+++ b/arch/arm/net/bpf_jit_32.c
@@ -1180,18 +1180,19 @@ static int emit_bpf_tail_call(struct jit_ctx *ctx)
 
 	/* tmp2[0] = array, tmp2[1] = index */
 
-	/* if (tail_call_cnt > MAX_TAIL_CALL_CNT)
-	 *	goto out;
+	/*
 	 * tail_call_cnt++;
+	 * if (tail_call_cnt > MAX_TAIL_CALL_CNT)
+	 *	goto out;
 	 */
+	tc = arm_bpf_get_reg64(tcc, tmp, ctx);
+	emit(ARM_ADDS_I(tc[1], tc[1], 1), ctx);
+	emit(ARM_ADC_I(tc[0], tc[0], 0), ctx);
 	lo = (u32)MAX_TAIL_CALL_CNT;
 	hi = (u32)((u64)MAX_TAIL_CALL_CNT >> 32);
-	tc = arm_bpf_get_reg64(tcc, tmp, ctx);
 	emit(ARM_CMP_I(tc[0], hi), ctx);
 	_emit(ARM_COND_EQ, ARM_CMP_I(tc[1], lo), ctx);
 	_emit(ARM_COND_HI, ARM_B(jmp_offset), ctx);
-	emit(ARM_ADDS_I(tc[1], tc[1], 1), ctx);
-	emit(ARM_ADC_I(tc[0], tc[0], 0), ctx);
 	arm_bpf_put_reg64(tcc, tmp, ctx);
 
 	/* prog = array->ptrs[index]
diff --git a/arch/arm64/net/bpf_jit_comp.c b/arch/arm64/net/bpf_jit_comp.c
index 41c23f4..5d6c843 100644
--- a/arch/arm64/net/bpf_jit_comp.c
+++ b/arch/arm64/net/bpf_jit_comp.c
@@ -286,14 +286,15 @@ static int emit_bpf_tail_call(struct jit_ctx *ctx)
 	emit(A64_CMP(0, r3, tmp), ctx);
 	emit(A64_B_(A64_COND_CS, jmp_offset), ctx);
 
-	/* if (tail_call_cnt > MAX_TAIL_CALL_CNT)
-	 *     goto out;
+	/*
 	 * tail_call_cnt++;
+	 * if (tail_call_cnt > MAX_TAIL_CALL_CNT)
+	 *     goto out;
 	 */
+	emit(A64_ADD_I(1, tcc, tcc, 1), ctx);
 	emit_a64_mov_i64(tmp, MAX_TAIL_CALL_CNT, ctx);
 	emit(A64_CMP(1, tcc, tmp), ctx);
 	emit(A64_B_(A64_COND_HI, jmp_offset), ctx);
-	emit(A64_ADD_I(1, tcc, tcc, 1), ctx);
 
 	/* prog = array->ptrs[index];
 	 * if (prog == NULL)
diff --git a/arch/mips/net/ebpf_jit.c b/arch/mips/net/ebpf_jit.c
index 3a73e93..029fc34 100644
--- a/arch/mips/net/ebpf_jit.c
+++ b/arch/mips/net/ebpf_jit.c
@@ -617,14 +617,14 @@ static int emit_bpf_tail_call(struct jit_ctx *ctx, int this_idx)
 	b_off = b_imm(this_idx + 1, ctx);
 	emit_instr(ctx, bne, MIPS_R_AT, MIPS_R_ZERO, b_off);
 	/*
-	 * if (TCC-- < 0)
+	 * if (--TCC < 0)
 	 *     goto out;
 	 */
 	/* Delay slot */
 	tcc_reg = (ctx->flags & EBPF_TCC_IN_V1) ? MIPS_R_V1 : MIPS_R_S4;
 	emit_instr(ctx, daddiu, MIPS_R_T5, tcc_reg, -1);
 	b_off = b_imm(this_idx + 1, ctx);
-	emit_instr(ctx, bltz, tcc_reg, b_off);
+	emit_instr(ctx, bltz, MIPS_R_T5, b_off);
 	/*
 	 * prog = array->ptrs[index];
 	 * if (prog == NULL)
diff --git a/arch/powerpc/net/bpf_jit_comp32.c b/arch/powerpc/net/bpf_jit_comp32.c
index beb12cb..b5585ad 100644
--- a/arch/powerpc/net/bpf_jit_comp32.c
+++ b/arch/powerpc/net/bpf_jit_comp32.c
@@ -221,12 +221,12 @@ static void bpf_jit_emit_tail_call(u32 *image, struct codegen_context *ctx, u32
 	PPC_BCC(COND_GE, out);
 
 	/*
+	 * tail_call_cnt++;
 	 * if (tail_call_cnt > MAX_TAIL_CALL_CNT)
 	 *   goto out;
 	 */
-	EMIT(PPC_RAW_CMPLWI(_R0, MAX_TAIL_CALL_CNT));
-	/* tail_call_cnt++; */
 	EMIT(PPC_RAW_ADDIC(_R0, _R0, 1));
+	EMIT(PPC_RAW_CMPLWI(_R0, MAX_TAIL_CALL_CNT));
 	PPC_BCC(COND_GT, out);
 
 	/* prog = array->ptrs[index]; */
diff --git a/arch/powerpc/net/bpf_jit_comp64.c b/arch/powerpc/net/bpf_jit_comp64.c
index b87a63d..bb15cc4 100644
--- a/arch/powerpc/net/bpf_jit_comp64.c
+++ b/arch/powerpc/net/bpf_jit_comp64.c
@@ -227,6 +227,12 @@ static void bpf_jit_emit_tail_call(u32 *image, struct codegen_context *ctx, u32
 	PPC_BCC(COND_GE, out);
 
 	/*
+	 * tail_call_cnt++;
+	 */
+	EMIT(PPC_RAW_ADDI(b2p[TMP_REG_1], b2p[TMP_REG_1], 1));
+	PPC_BPF_STL(b2p[TMP_REG_1], 1, bpf_jit_stack_tailcallcnt(ctx));
+
+	/*
 	 * if (tail_call_cnt > MAX_TAIL_CALL_CNT)
 	 *   goto out;
 	 */
@@ -234,12 +240,6 @@ static void bpf_jit_emit_tail_call(u32 *image, struct codegen_context *ctx, u32
 	EMIT(PPC_RAW_CMPLWI(b2p[TMP_REG_1], MAX_TAIL_CALL_CNT));
 	PPC_BCC(COND_GT, out);
 
-	/*
-	 * tail_call_cnt++;
-	 */
-	EMIT(PPC_RAW_ADDI(b2p[TMP_REG_1], b2p[TMP_REG_1], 1));
-	PPC_BPF_STL(b2p[TMP_REG_1], 1, bpf_jit_stack_tailcallcnt(ctx));
-
 	/* prog = array->ptrs[index]; */
 	EMIT(PPC_RAW_MULI(b2p[TMP_REG_1], b2p_index, 8));
 	EMIT(PPC_RAW_ADD(b2p[TMP_REG_1], b2p[TMP_REG_1], b2p_bpf_array));
diff --git a/arch/riscv/net/bpf_jit_comp32.c b/arch/riscv/net/bpf_jit_comp32.c
index e649742..1608d94 100644
--- a/arch/riscv/net/bpf_jit_comp32.c
+++ b/arch/riscv/net/bpf_jit_comp32.c
@@ -800,12 +800,12 @@ static int emit_bpf_tail_call(int insn, struct rv_jit_context *ctx)
 
 	/*
 	 * temp_tcc = tcc - 1;
-	 * if (tcc < 0)
+	 * if (temp_tcc < 0)
 	 *   goto out;
 	 */
 	emit(rv_addi(RV_REG_T1, RV_REG_TCC, -1), ctx);
 	off = ninsns_rvoff(tc_ninsn - (ctx->ninsns - start_insn));
-	emit_bcc(BPF_JSLT, RV_REG_TCC, RV_REG_ZERO, off, ctx);
+	emit_bcc(BPF_JSLT, RV_REG_T1, RV_REG_ZERO, off, ctx);
 
 	/*
 	 * prog = array->ptrs[index];
diff --git a/arch/riscv/net/bpf_jit_comp64.c b/arch/riscv/net/bpf_jit_comp64.c
index 3af4131..6e9ba83 100644
--- a/arch/riscv/net/bpf_jit_comp64.c
+++ b/arch/riscv/net/bpf_jit_comp64.c
@@ -311,12 +311,12 @@ static int emit_bpf_tail_call(int insn, struct rv_jit_context *ctx)
 	off = ninsns_rvoff(tc_ninsn - (ctx->ninsns - start_insn));
 	emit_branch(BPF_JGE, RV_REG_A2, RV_REG_T1, off, ctx);
 
-	/* if (TCC-- < 0)
+	/* if (--TCC < 0)
 	 *     goto out;
 	 */
 	emit_addi(RV_REG_T1, tcc, -1, ctx);
 	off = ninsns_rvoff(tc_ninsn - (ctx->ninsns - start_insn));
-	emit_branch(BPF_JSLT, tcc, RV_REG_ZERO, off, ctx);
+	emit_branch(BPF_JSLT, RV_REG_T1, RV_REG_ZERO, off, ctx);
 
 	/* prog = array->ptrs[index];
 	 * if (!prog)
diff --git a/arch/sparc/net/bpf_jit_comp_64.c b/arch/sparc/net/bpf_jit_comp_64.c
index 9a2f20c..50d914c 100644
--- a/arch/sparc/net/bpf_jit_comp_64.c
+++ b/arch/sparc/net/bpf_jit_comp_64.c
@@ -863,6 +863,10 @@ static void emit_tail_call(struct jit_ctx *ctx)
 	emit_branch(BGEU, ctx->idx, ctx->idx + OFFSET1, ctx);
 	emit_nop(ctx);
 
+	emit_alu_K(ADD, tmp, 1, ctx);
+	off = BPF_TAILCALL_CNT_SP_OFF;
+	emit(ST32 | IMMED | RS1(SP) | S13(off) | RD(tmp), ctx);
+
 	off = BPF_TAILCALL_CNT_SP_OFF;
 	emit(LD32 | IMMED | RS1(SP) | S13(off) | RD(tmp), ctx);
 	emit_cmpi(tmp, MAX_TAIL_CALL_CNT, ctx);
@@ -870,10 +874,6 @@ static void emit_tail_call(struct jit_ctx *ctx)
 	emit_branch(BGU, ctx->idx, ctx->idx + OFFSET2, ctx);
 	emit_nop(ctx);
 
-	emit_alu_K(ADD, tmp, 1, ctx);
-	off = BPF_TAILCALL_CNT_SP_OFF;
-	emit(ST32 | IMMED | RS1(SP) | S13(off) | RD(tmp), ctx);
-
 	emit_alu3_K(SLL, bpf_index, 3, tmp, ctx);
 	emit_alu(ADD, bpf_array, tmp, ctx);
 	off = offsetof(struct bpf_array, ptrs);
diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index 9f4636d..8edb1c3 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -1564,10 +1564,10 @@ static u64 ___bpf_prog_run(u64 *regs, const struct bpf_insn *insn)
 
 		if (unlikely(index >= array->map.max_entries))
 			goto out;
-		if (unlikely(tail_call_cnt > MAX_TAIL_CALL_CNT))
-			goto out;
 
 		tail_call_cnt++;
+		if (unlikely(tail_call_cnt > MAX_TAIL_CALL_CNT))
+			goto out;
 
 		prog = READ_ONCE(array->ptrs[index]);
 		if (!prog)
-- 
2.1.0


^ permalink raw reply related

* Re: [RFC PATCH v2] powerpc/papr_scm: Move duplicate definitions to common header files
From: Vaibhav Jain @ 2021-09-08 10:46 UTC (permalink / raw)
  To: Michael Ellerman, Shivaprasad G Bhat, linuxppc-dev, linux-kernel
  Cc: nvdimm, dan.j.williams, sbhat, aneesh.kumar
In-Reply-To: <87sfyfmzhh.fsf@mpe.ellerman.id.au>

Hi Mpe,

Thanks for looking into this patch.

Michael Ellerman <mpe@ellerman.id.au> writes:

> Shivaprasad G Bhat <sbhat@linux.ibm.com> writes:
>> papr_scm and ndtest share common PDSM payload structs like
>> nd_papr_pdsm_health. Presently these structs are duplicated across papr_pdsm.h
>> and ndtest.h header files. Since 'ndtest' is essentially arch independent and can
>> run on platforms other than PPC64, a way needs to be deviced to avoid redundancy
>> and duplication of PDSM structs in future.
>>
>> So the patch proposes moving the PDSM header from arch/powerpc/include/uapi/ to
>> the generic include/uapi/linux directory. Also, there are some #defines common
>> between papr_scm and ndtest which are not exported to the user space. So, move
>> them to a header file which can be shared across ndtest and papr_scm via newly
>> introduced include/linux/papr_scm.h.
>>
>> Signed-off-by: Shivaprasad G Bhat <sbhat@linux.ibm.com>
>> Signed-off-by: Vaibhav Jain <vaibhav@linux.ibm.com>
>> Suggested-by: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
>> ---
>> Changelog:
>>
>> Since v1:
>> Link: https://patchwork.kernel.org/project/linux-nvdimm/patch/162505488483.72147.12741153746322191381.stgit@56e104a48989/
>> * Removed dependency on this patch for the other patches
>>
>>  MAINTAINERS                               |    2 
>>  arch/powerpc/include/uapi/asm/papr_pdsm.h |  165 -----------------------------
>>  arch/powerpc/platforms/pseries/papr_scm.c |   43 --------
>>  include/linux/papr_scm.h                  |   48 ++++++++
>>  include/uapi/linux/papr_pdsm.h            |  165 +++++++++++++++++++++++++++++
>
> This doesn't make sense to me.
>
> Anything with papr (or PAPR) in the name is fundamentally powerpc
> specific, it doesn't belong in a generic header, or in a generic
> location.
>
> What's the actual problem you're trying to solve?
>
The ndtest module (tools/testing/nvdimm/test/ndtest.c) is implemented in
an arch independed way to enable testing of PAPR PDSMs on non ppc64
platforms like x86_64. It uses the same PDSM structs as used by papr_scm
to communicate with libndctl userspace. 

Since papr_scm is ppc64 arch specific we were so far duplicating the
PDSM structures between ndtest and papr_scm. The patch tries to solve
this duplication by moving the shared structs to arch independent common
include dirs.

Secondly, PDSMs describes how userspace can use NVDIMM_FAMILY_PAPR to
interact with NVDIMMs. So potentially a new NVDIMM beyond powerpc arch
can add its support and would need access to the same structs used by
papr_scm and ndtest. In that context it would make sense to move PDSM
headers to generic include dirs.

> If it's just including papr_scm bits into ndtest.c then that should be
> as simple as:
>
> #ifdef __powerpc__
> #include <asm/papr_scm.h>
> #endif
>
> Shouldn't it?
>
No, as ndtest implements support for NVDIMM_FAMILY_PAPR and would need
access to PDSM related structs which presently are only available for
powerpc.

> cheers
>

-- 
Cheers
~ Vaibhav

^ permalink raw reply

* [PATCH v1 1/2] powerpc/64s: system call rfscv workaround for TM bugs
From: Nicholas Piggin @ 2021-09-08 10:17 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Eirik Fuller, kvm-ppc, Nicholas Piggin

The rfscv instruction does not work correctly with the fake-suspend mode
in POWER9, which can end up with the hypervisor restoring an incorrect
checkpoint.

Work around this by setting the _TIF_RESTOREALL flag if a system call
returns to a transaction active state, causing rfid to be used instead
of rfscv to return, which will do the right thing. The contents of the
registers are irrelevant because they will be overwritten in this case
anyway.

Reported-by: Eirik Fuller <efuller@redhat.com>
Fixes: 7fa95f9adaee7 ("powerpc/64s: system call support for scv/rfscv instructions")
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
 arch/powerpc/kernel/interrupt.c | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/arch/powerpc/kernel/interrupt.c b/arch/powerpc/kernel/interrupt.c
index c77c80214ad3..917a2ac4def6 100644
--- a/arch/powerpc/kernel/interrupt.c
+++ b/arch/powerpc/kernel/interrupt.c
@@ -139,6 +139,19 @@ notrace long system_call_exception(long r3, long r4, long r5,
 	 */
 	irq_soft_mask_regs_set_state(regs, IRQS_ENABLED);
 
+	/*
+	 * If system call is called with TM active, set _TIF_RESTOREALL to
+	 * prevent RFSCV being used to return to userspace, because POWER9
+	 * TM implementation has problems with this instruction returning to
+	 * transactional state. Final register values are not relevant because
+	 * the transaction will be aborted upon return anyway. Or in the case
+	 * of unsupported_scv SIGILL fault, the return state does not much
+	 * matter because it's an edge case.
+	 */
+	if (IS_ENABLED(CONFIG_PPC_TRANSACTIONAL_MEM) &&
+			unlikely(MSR_TM_TRANSACTIONAL(regs->msr)))
+		current_thread_info()->flags |= _TIF_RESTOREALL;
+
 	/*
 	 * If the system call was made with a transaction active, doom it and
 	 * return without performing the system call. Unless it was an
-- 
2.23.0


^ permalink raw reply related

* [PATCH v1 2/2] KVM: PPC: Book3S HV: Tolerate treclaim. in fake-suspend mode changing registers
From: Nicholas Piggin @ 2021-09-08 10:17 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Eirik Fuller, kvm-ppc, Nicholas Piggin
In-Reply-To: <20210908101718.118522-1-npiggin@gmail.com>

POWER9 DD2.2 and 2.3 hardware implements a "fake-suspend" mode where
certain TM instructions executed in HV=0 mode cause softpatch interrupts
so the hypervisor can emulate them and prevent problematic processor
conditions. In this fake-suspend mode, the treclaim. instruction does
not modify registers.

Unfortunately the rfscv instruction executed by the guest do not
generate softpatch interrupts, which can cause the hypervisor to lose
track of the fake-suspend mode, and it can execute this treclaim. while
not in fake-suspend mode. This modifies GPRs and crashes the hypervisor.

It's not trivial to disable scv in the guest with HFSCR now, because
they assume a POWER9 has scv available. So this fix saves and restores
checkpointed registers across the treclaim.

Fixes: 7854f7545bff ("KVM: PPC: Book3S: Rework TM save/restore code and make it C-callable")
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
 arch/powerpc/kvm/book3s_hv_rmhandlers.S | 36 +++++++++++++++++++++++--
 1 file changed, 34 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
index 8dd437d7a2c6..dd18e1c44751 100644
--- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
+++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
@@ -2578,7 +2578,7 @@ END_FTR_SECTION_IFCLR(CPU_FTR_P9_TM_HV_ASSIST)
 	/* The following code handles the fake_suspend = 1 case */
 	mflr	r0
 	std	r0, PPC_LR_STKOFF(r1)
-	stdu	r1, -PPC_MIN_STKFRM(r1)
+	stdu	r1, -TM_FRAME_SIZE(r1)
 
 	/* Turn on TM. */
 	mfmsr	r8
@@ -2593,10 +2593,42 @@ BEGIN_FTR_SECTION
 END_FTR_SECTION_IFSET(CPU_FTR_P9_TM_XER_SO_BUG)
 	nop
 
+	/*
+	 * It's possible that treclaim. may modify registers, if we have lost
+	 * track of fake-suspend state in the guest due to it using rfscv.
+	 * Save and restore registers in case this occurs.
+	 */
+	mfspr	r3, SPRN_DSCR
+	mfspr	r4, SPRN_XER
+	mfspr	r5, SPRN_AMR
+	/* SPRN_TAR would need to be saved here if the kernel ever used it */
+	mfcr	r12
+	SAVE_NVGPRS(r1)
+	SAVE_GPR(2, r1)
+	SAVE_GPR(3, r1)
+	SAVE_GPR(4, r1)
+	SAVE_GPR(5, r1)
+	stw	r12, 8(r1)
+	std	r1, HSTATE_HOST_R1(r13)
+
 	/* We have to treclaim here because that's the only way to do S->N */
 	li	r3, TM_CAUSE_KVM_RESCHED
 	TRECLAIM(R3)
 
+	GET_PACA(r13)
+	ld	r1, HSTATE_HOST_R1(r13)
+	REST_GPR(2, r1)
+	REST_GPR(3, r1)
+	REST_GPR(4, r1)
+	REST_GPR(5, r1)
+	lwz	r12, 8(r1)
+	REST_NVGPRS(r1)
+	mtspr	SPRN_DSCR, r3
+	mtspr	SPRN_XER, r4
+	mtspr	SPRN_AMR, r5
+	mtcr	r12
+	HMT_MEDIUM
+
 	/*
 	 * We were in fake suspend, so we are not going to save the
 	 * register state as the guest checkpointed state (since
@@ -2624,7 +2656,7 @@ END_FTR_SECTION_IFSET(CPU_FTR_P9_TM_XER_SO_BUG)
 	std	r5, VCPU_TFHAR(r9)
 	std	r6, VCPU_TFIAR(r9)
 
-	addi	r1, r1, PPC_MIN_STKFRM
+	addi	r1, r1, TM_FRAME_SIZE
 	ld	r0, PPC_LR_STKOFF(r1)
 	mtlr	r0
 	blr
-- 
2.23.0


^ permalink raw reply related

* Re: [PATCH 1/3] perf: Add macros to specify onchip L2/L3 accesses
From: Peter Zijlstra @ 2021-09-08  9:26 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: mark.rutland, atrajeev, ak, daniel, rnsastry, alexander.shishkin,
	Kajol Jain, linux-kernel, acme, ast, linux-perf-users, yao.jin,
	mingo, paulus, maddy, jolsa, namhyung, songliubraving,
	linuxppc-dev, kan.liang
In-Reply-To: <87ilzbmt7i.fsf@mpe.ellerman.id.au>

On Wed, Sep 08, 2021 at 05:17:53PM +1000, Michael Ellerman wrote:
> Kajol Jain <kjain@linux.ibm.com> writes:

> > diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
> > index f92880a15645..030b3e990ac3 100644
> > --- a/include/uapi/linux/perf_event.h
> > +++ b/include/uapi/linux/perf_event.h
> > @@ -1265,7 +1265,9 @@ union perf_mem_data_src {
> >  #define PERF_MEM_LVLNUM_L2	0x02 /* L2 */
> >  #define PERF_MEM_LVLNUM_L3	0x03 /* L3 */
> >  #define PERF_MEM_LVLNUM_L4	0x04 /* L4 */
> > -/* 5-0xa available */
> > +#define PERF_MEM_LVLNUM_OC_L2	0x05 /* On Chip L2 */
> > +#define PERF_MEM_LVLNUM_OC_L3	0x06 /* On Chip L3 */
> 
> The obvious use for 5 is for "L5" and so on.
> 
> I'm not sure adding new levels is the best idea, because these don't fit
> neatly into the hierarchy, they are off to the side.
> 
> 
> I wonder if we should use the remote field.
> 
> ie. for another core's L2 we set:
> 
>   mem_lvl = PERF_MEM_LVL_L2
>   mem_remote = 1

This mixes APIs (see below), IIUC the correct usage would be something
like: lvl_num=L2 remote=1

> Which would mean "remote L2", but not remote enough to be
> lvl = PERF_MEM_LVL_REM_CCE1.
> 
> It would be printed by the existing tools/perf code as "Remote L2", vs
> "Remote cache (1 hop)", which seems OK.
> 
> 
> ie. we'd be able to express:
> 
>   Current core's L2: LVL_L2
>   Other core's L2:   LVL_L2 | REMOTE
>   Other chip's L2:   LVL_REM_CCE1 | REMOTE
> 
> And similarly for L3.
> 
> I think that makes sense? Unless people think remote should be reserved
> to mean on another chip, though we already have REM_CCE1 for that.

IIRC the PERF_MEM_LVL_* namespace is somewhat depricated in favour of
the newer composite PERF_MEM_{LVLNUM_,REMOTE_,SNOOPX_} fields. Of
course, ABIs being what they are, we get to support both :/ But I'm not
sure mixing them is a great idea.

Also, clearly this could use a comment...

The 'new' composite doesnt have a hops field because the hardware that
nessecitated that change doesn't report it, but we could easily add a
field there.

Suppose we add, mem_hops:3 (would 6 hops be too small?) and the
corresponding PERF_MEM_HOPS_{NA, 0..6}

Then I suppose you can encode things like:

	L2			- local L2
	L2 | REMOTE		- remote L2 at an unspecified distance (NA)
	L2 | REMOTE | HOPS_0	- remote L2 on the same node
	L2 | REMOTE | HOPS_1	- remote L2 on a node 1 removed

Would that work?

^ permalink raw reply

* Re: [PATCH 0/5] s390/pci: automatic error recovery
From: Niklas Schnelle @ 2021-09-08  8:09 UTC (permalink / raw)
  To: Oliver O'Halloran
  Cc: linux-s390, Pierre Morel, Matthew Rosato,
	Linux Kernel Mailing List, Bjorn Helgaas, Linas Vepstas,
	linuxppc-dev
In-Reply-To: <CAOSf1CH2T-R44qx1mGpJQ8WgD0upxG8sQNud_5L3SHYZJm9LRA@mail.gmail.com>

On Wed, 2021-09-08 at 11:37 +1000, Oliver O'Halloran wrote:
> On Tue, Sep 7, 2021 at 10:21 PM Niklas Schnelle <schnelle@linux.ibm.com> wrote:
> > On Tue, 2021-09-07 at 10:45 +0200, Niklas Schnelle wrote:
> > > On Tue, 2021-09-07 at 12:04 +1000, Oliver O'Halloran wrote:
> > > > On Mon, Sep 6, 2021 at 7:49 PM Niklas Schnelle <schnelle@linux.ibm.com> wrote:
> > > > > Patch 3 I already sent separately resulting in the discussion below but without
> > > > > a final conclusion.
> > > > > 
> > > > > https://lore.kernel.org/lkml/20210720150145.640727-1-schnelle@linux.ibm.com/
> > > > > 
> > > > > I believe even though there were some doubts about the use of
> > > > > pci_dev_is_added() by arch code the existing uses as well as the use in the
> > > > > final patch of this series warrant this export.
> > > > 
> > > > The use of pci_dev_is_added() in arch/powerpc was because in the past
> > > > pci_bus_add_device() could be called before pci_device_add(). That was
> > > > fixed a while ago so It should be safe to remove those calls now.
> > > 
> > > Hmm, ok that confirms Bjorns suspicion and explains how it came to be.
> > > I can certainly sent a patch for that. This would then leave only the
> > > existing use in s390 which I added because of a dead lock prevention
> > > and explained here:
> > > https://lore.kernel.org/lkml/87d15d5eead35c9eaa667958d057cf4a81a8bf13.camel@linux.ibm.com/
> > > 
> > > Plus the need to use it in the recovery code of this series. I think in
> > > the EEH code the need for a similar check is alleviated by the checks
> > > in the beginning of
> > > arch/powerpc/kernel/eeh_driver.c:eeh_handle_normal_event() especially
> > > eeh_slot_presence_check() which checks presence via the hotplug slot.
> > > I guess we could use our own state tracking in a similar way but felt
> > > like pci_dev_is_added() is the more logical choice.
> 
> The slot check is mainly there to prevent attempts to "recover"
> devices that have been surprise removed (i.e NVMe hot-unplug). The
> actual recovery process operates off the eeh_pe tree which is frozen
> in place when an error is detected. If a pci_dev is added or removed
> it's not really a problem since those are only ever looked at when
> notifying drivers which is done with the rescan_remove lock held. 

Thanks for the explanation.

> That
> said, I wouldn't really encourage anyone to follow the EEH model since
> it's pretty byzantine.
> 
> > Looking into this again, I think we actually can't easily track this
> > state ourselves outside struct pci_dev. The reason for this is that
> > when e.g. arch/s390/pci/pci_sysfs.c:recover_store() removes the struct
> > pci_dev and scans it again the new struct pci_dev re-uses the same
> > struct zpci_dev because from a platform point of view the PCI device
> > was never removed but only disabled and re-enabled. Thus we can only
> > distinguish the stale struct pci_dev by looking at things stored in
> > struct pci_dev itself.
> 
> IMO the real problem is removing and re-adding the pci_dev. I think
> it's something that's done largely because the PCI core doesn't really
> provide any better mechanism for getting a device back into a
> known-good state so it's abused to implement error recovery. This is
> something that's always annoyed me since it conflates recovery with
> hotplug. After a hot-(un)plug we might have a different device or no
> device. In the recovery case we expect to start and end with the same
> device. Why not apply the same logic to the pci_dev?

For us there are two cases. First The existing
/sys/bus/pci/devices/<dev>/recover attribute. This does the pci_dev
remove and re-add that you mention and thus we end up with a ne pci_dev
afterwards and I agree that is kind of a dumb way to recover which
(too?) closely resembles unplug/re-plug.

Secondly the automatic error recovery added in this series. Here we
only attempt recovery if we have a driver bound that supports the error
callbacks thus always keeping the same pci_dev. If there is no driver
we give up automatic recovery and are back at the situation without
this series.

> 
> Something I was tinkering with before I left IBM was re-working the
> way EEH handles recovering devices that don't have a driver with error
> handling callbacks to something like:
> 
> 1. unbind the driver
> 2. pci_save_state()
> 3. do the reset
> 4. pci_restore_state()
> 5. re-bind the driver
> 
> That would allow keeping the pci_dev around and let me delete a pile
> of confusing code which handles binding the eeh_dev to the new
> pci_dev.

This sounds like an interesting future approach for us too. Thankfully
our binding of the zpci_dev to the new pci_dev is pretty simple by now.
The main trouble with removing and re-adding a pci_dev is then that
upper layers like block devices are also re-created which really only
happens if we have a driver bound.

>  The obvious problem with that approach is the assumption the
> device is functional enough to allow saving the config space, but I
> don't think that's a deal breaker. We could stash a copy of the device
> state before we allow drivers to attach and use that to restore the
> device after the reset. The end result would be the same known-good
> state that we'd get after a re-scan.
> 
> > That said, I think for the recovery case we might be able to drop the
> > pci_dev_is_added() and rely on pdev->driver != NULL which we check
> > anyway and that should catch any PCI device that was already removed.
> 
> Would that work if there was an error on a device without a driver
> bound? 

For the automatic recovery flow introduced by this series we only
recover if such a driver is bound anyway so that is already a
requirement. Luckily all physical PCI devices we support on our
platform have drivers with that support.

> If you're just trying to stop races between recovery and device
> removal then pci_dev_is_added() is probably the right tool for the
> job. Trying to substitute it with a proxy seems like a bad idea.

Yes I believe at least for the existing recover attribute that does not
require a bound driver we still need pci_dev_is_added().

For the automatic recovery flow I think it would be okay to rely on the
fact that removed devices don't have a driver bound since the recovery
requires a bound driver anyway but yes an explicit pci_dev_is_added()
check as in this patch does feel more clean.




^ permalink raw reply

* [PATCH 1/3] soc: mediatek: pwrap: Make use of the helper function devm_platform_ioremap_resource_byname()
From: Cai Huoqing @ 2021-09-08  7:11 UTC (permalink / raw)
  To: caihuoqing
  Cc: Nishanth Menon, Neil Armstrong, linux-kernel, linux-tegra,
	Thierry Reding, Rafał Miłecki, Jerome Brunet,
	Florian Fainelli, Kevin Hilman, Jernej Skrabec, Jonathan Hunter,
	Chen-Yu Tsai, bcm-kernel-feedback-list, linux-sunxi, linux-pm,
	Martin Blumenstingl, Maxime Ripard, Krzysztof Halasa,
	Santosh Shilimkar, Matthias Brugger, linux-amlogic,
	linux-arm-kernel, linux-mips, Li Yang, linux-mediatek,
	linuxppc-dev
In-Reply-To: <20210908071123.348-1-caihuoqing@baidu.com>

Use the devm_platform_ioremap_resource_byname() helper instead of
calling platform_get_resource_byname() and devm_ioremap_resource()
separately

Signed-off-by: Cai Huoqing <caihuoqing@baidu.com>
---
 drivers/soc/mediatek/mtk-pmic-wrap.c | 8 ++------
 1 file changed, 2 insertions(+), 6 deletions(-)

diff --git a/drivers/soc/mediatek/mtk-pmic-wrap.c b/drivers/soc/mediatek/mtk-pmic-wrap.c
index 952bc554f443..29f1bd42f7a8 100644
--- a/drivers/soc/mediatek/mtk-pmic-wrap.c
+++ b/drivers/soc/mediatek/mtk-pmic-wrap.c
@@ -2116,7 +2116,6 @@ static int pwrap_probe(struct platform_device *pdev)
 	struct pmic_wrapper *wrp;
 	struct device_node *np = pdev->dev.of_node;
 	const struct of_device_id *of_slave_id = NULL;
-	struct resource *res;
 
 	if (np->child)
 		of_slave_id = of_match_node(of_slave_match_tbl, np->child);
@@ -2136,8 +2135,7 @@ static int pwrap_probe(struct platform_device *pdev)
 	wrp->slave = of_slave_id->data;
 	wrp->dev = &pdev->dev;
 
-	res = platform_get_resource_byname(pdev, IORESOURCE_MEM, "pwrap");
-	wrp->base = devm_ioremap_resource(wrp->dev, res);
+	wrp->base = devm_platform_ioremap_resource_byname(pdev, "pwrap");
 	if (IS_ERR(wrp->base))
 		return PTR_ERR(wrp->base);
 
@@ -2151,9 +2149,7 @@ static int pwrap_probe(struct platform_device *pdev)
 	}
 
 	if (HAS_CAP(wrp->master->caps, PWRAP_CAP_BRIDGE)) {
-		res = platform_get_resource_byname(pdev, IORESOURCE_MEM,
-				"pwrap-bridge");
-		wrp->bridge_base = devm_ioremap_resource(wrp->dev, res);
+		wrp->bridge_base = devm_platform_ioremap_resource_byname(pdev, "pwrap-bridge");
 		if (IS_ERR(wrp->bridge_base))
 			return PTR_ERR(wrp->bridge_base);
 
-- 
2.25.1


^ permalink raw reply related

* [PATCH] soc: sunxi_sram: Make use of the helper function devm_platform_ioremap_resource()
From: Cai Huoqing @ 2021-09-08  7:11 UTC (permalink / raw)
  To: caihuoqing
  Cc: Nishanth Menon, Neil Armstrong, linux-kernel, linux-tegra,
	Thierry Reding, Rafał Miłecki, Jerome Brunet,
	Florian Fainelli, Kevin Hilman, Jernej Skrabec, Jonathan Hunter,
	Chen-Yu Tsai, bcm-kernel-feedback-list, linux-sunxi, linux-pm,
	Martin Blumenstingl, Maxime Ripard, Krzysztof Halasa,
	Santosh Shilimkar, Matthias Brugger, linux-amlogic,
	linux-arm-kernel, linux-mips, Li Yang, linux-mediatek,
	linuxppc-dev
In-Reply-To: <20210908071123.348-1-caihuoqing@baidu.com>

Use the devm_platform_ioremap_resource() helper instead of
calling platform_get_resource() and devm_ioremap_resource()
separately

Signed-off-by: Cai Huoqing <caihuoqing@baidu.com>
---
 drivers/soc/sunxi/sunxi_sram.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/drivers/soc/sunxi/sunxi_sram.c b/drivers/soc/sunxi/sunxi_sram.c
index 42833e33a96c..a8f3876963a0 100644
--- a/drivers/soc/sunxi/sunxi_sram.c
+++ b/drivers/soc/sunxi/sunxi_sram.c
@@ -331,7 +331,6 @@ static struct regmap_config sunxi_sram_emac_clock_regmap = {
 
 static int sunxi_sram_probe(struct platform_device *pdev)
 {
-	struct resource *res;
 	struct dentry *d;
 	struct regmap *emac_clock;
 	const struct sunxi_sramc_variant *variant;
@@ -342,8 +341,7 @@ static int sunxi_sram_probe(struct platform_device *pdev)
 	if (!variant)
 		return -EINVAL;
 
-	res = platform_get_resource(pdev, IORESOURCE_MEM, 0);
-	base = devm_ioremap_resource(&pdev->dev, res);
+	base = devm_platform_ioremap_resource(pdev, 0);
 	if (IS_ERR(base))
 		return PTR_ERR(base);
 
-- 
2.25.1


^ permalink raw reply related

* [PATCH] soc: ixp4xx/qmgr: Make use of the helper function devm_platform_ioremap_resource()
From: Cai Huoqing @ 2021-09-08  7:11 UTC (permalink / raw)
  To: caihuoqing
  Cc: Nishanth Menon, Neil Armstrong, linux-kernel, linux-tegra,
	Thierry Reding, Rafał Miłecki, Jerome Brunet,
	Florian Fainelli, Kevin Hilman, Jernej Skrabec, Jonathan Hunter,
	Chen-Yu Tsai, bcm-kernel-feedback-list, linux-sunxi, linux-pm,
	Martin Blumenstingl, Maxime Ripard, Krzysztof Halasa,
	Santosh Shilimkar, Matthias Brugger, linux-amlogic,
	linux-arm-kernel, linux-mips, Li Yang, linux-mediatek,
	linuxppc-dev
In-Reply-To: <20210908071123.348-1-caihuoqing@baidu.com>

Use the devm_platform_ioremap_resource() helper instead of
calling platform_get_resource() and devm_ioremap_resource()
separately

Signed-off-by: Cai Huoqing <caihuoqing@baidu.com>
---
 drivers/soc/ixp4xx/ixp4xx-qmgr.c | 6 +-----
 1 file changed, 1 insertion(+), 5 deletions(-)

diff --git a/drivers/soc/ixp4xx/ixp4xx-qmgr.c b/drivers/soc/ixp4xx/ixp4xx-qmgr.c
index 9154c7029b05..72b5a10e3104 100644
--- a/drivers/soc/ixp4xx/ixp4xx-qmgr.c
+++ b/drivers/soc/ixp4xx/ixp4xx-qmgr.c
@@ -377,13 +377,9 @@ static int ixp4xx_qmgr_probe(struct platform_device *pdev)
 	int i, err;
 	irq_handler_t handler1, handler2;
 	struct device *dev = &pdev->dev;
-	struct resource *res;
 	int irq1, irq2;
 
-	res = platform_get_resource(pdev, IORESOURCE_MEM, 0);
-	if (!res)
-		return -ENODEV;
-	qmgr_regs = devm_ioremap_resource(dev, res);
+	qmgr_regs = devm_platform_ioremap_resource(pdev, 0);
 	if (IS_ERR(qmgr_regs))
 		return PTR_ERR(qmgr_regs);
 
-- 
2.25.1


^ permalink raw reply related

* [PATCH 1/2] soc: bcm: bcm-pmb: Make use of the helper function devm_platform_ioremap_resource()
From: Cai Huoqing @ 2021-09-08  7:11 UTC (permalink / raw)
  To: caihuoqing
  Cc: Nishanth Menon, Neil Armstrong, linux-kernel, linux-tegra,
	Thierry Reding, Rafał Miłecki, Jerome Brunet,
	Florian Fainelli, Kevin Hilman, Jernej Skrabec, Jonathan Hunter,
	Chen-Yu Tsai, bcm-kernel-feedback-list, linux-sunxi, linux-pm,
	Martin Blumenstingl, Maxime Ripard, Krzysztof Halasa,
	Santosh Shilimkar, Matthias Brugger, linux-amlogic,
	linux-arm-kernel, linux-mips, Li Yang, linux-mediatek,
	linuxppc-dev
In-Reply-To: <20210908071123.348-1-caihuoqing@baidu.com>

Use the devm_platform_ioremap_resource() helper instead of
calling platform_get_resource() and devm_ioremap_resource()
separately

Signed-off-by: Cai Huoqing <caihuoqing@baidu.com>
---
 drivers/soc/bcm/bcm63xx/bcm-pmb.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/drivers/soc/bcm/bcm63xx/bcm-pmb.c b/drivers/soc/bcm/bcm63xx/bcm-pmb.c
index 774465c119be..7bbe46ea5f94 100644
--- a/drivers/soc/bcm/bcm63xx/bcm-pmb.c
+++ b/drivers/soc/bcm/bcm63xx/bcm-pmb.c
@@ -276,7 +276,6 @@ static int bcm_pmb_probe(struct platform_device *pdev)
 	struct device *dev = &pdev->dev;
 	const struct bcm_pmb_pd_data *table;
 	const struct bcm_pmb_pd_data *e;
-	struct resource *res;
 	struct bcm_pmb *pmb;
 	int max_id;
 	int err;
@@ -287,8 +286,7 @@ static int bcm_pmb_probe(struct platform_device *pdev)
 
 	pmb->dev = dev;
 
-	res = platform_get_resource(pdev, IORESOURCE_MEM, 0);
-	pmb->base = devm_ioremap_resource(&pdev->dev, res);
+	pmb->base = devm_platform_ioremap_resource(pdev, 0);
 	if (IS_ERR(pmb->base))
 		return PTR_ERR(pmb->base);
 
-- 
2.25.1


^ permalink raw reply related

* [PATCH 1/2] soc: amlogic: canvas: Make use of the helper function devm_platform_ioremap_resource()
From: Cai Huoqing @ 2021-09-08  7:11 UTC (permalink / raw)
  To: caihuoqing
  Cc: Nishanth Menon, Neil Armstrong, linux-kernel, linux-tegra,
	Thierry Reding, Rafał Miłecki, Jerome Brunet,
	Florian Fainelli, Kevin Hilman, Jernej Skrabec, Jonathan Hunter,
	Chen-Yu Tsai, bcm-kernel-feedback-list, linux-sunxi, linux-pm,
	Martin Blumenstingl, Maxime Ripard, Krzysztof Halasa,
	Santosh Shilimkar, Matthias Brugger, linux-amlogic,
	linux-arm-kernel, linux-mips, Li Yang, linux-mediatek,
	linuxppc-dev

Use the devm_platform_ioremap_resource() helper instead of
calling platform_get_resource() and devm_ioremap_resource()
separately

Signed-off-by: Cai Huoqing <caihuoqing@baidu.com>
---
 drivers/soc/amlogic/meson-canvas.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/drivers/soc/amlogic/meson-canvas.c b/drivers/soc/amlogic/meson-canvas.c
index d0329ad170d1..383b0cfc584e 100644
--- a/drivers/soc/amlogic/meson-canvas.c
+++ b/drivers/soc/amlogic/meson-canvas.c
@@ -168,7 +168,6 @@ EXPORT_SYMBOL_GPL(meson_canvas_free);
 
 static int meson_canvas_probe(struct platform_device *pdev)
 {
-	struct resource *res;
 	struct meson_canvas *canvas;
 	struct device *dev = &pdev->dev;
 
@@ -176,8 +175,7 @@ static int meson_canvas_probe(struct platform_device *pdev)
 	if (!canvas)
 		return -ENOMEM;
 
-	res = platform_get_resource(pdev, IORESOURCE_MEM, 0);
-	canvas->reg_base = devm_ioremap_resource(dev, res);
+	canvas->reg_base = devm_platform_ioremap_resource(pdev, 0);
 	if (IS_ERR(canvas->reg_base))
 		return PTR_ERR(canvas->reg_base);
 
-- 
2.25.1


^ permalink raw reply related

* [PATCH 1/2] soc: fsl: guts: Make use of the helper function devm_platform_ioremap_resource()
From: Cai Huoqing @ 2021-09-08  7:11 UTC (permalink / raw)
  To: caihuoqing
  Cc: Nishanth Menon, Neil Armstrong, linux-kernel, linux-tegra,
	Thierry Reding, Rafał Miłecki, Jerome Brunet,
	Florian Fainelli, Kevin Hilman, Jernej Skrabec, Jonathan Hunter,
	Chen-Yu Tsai, bcm-kernel-feedback-list, linux-sunxi, linux-pm,
	Martin Blumenstingl, Maxime Ripard, Krzysztof Halasa,
	Santosh Shilimkar, Matthias Brugger, linux-amlogic,
	linux-arm-kernel, linux-mips, Li Yang, linux-mediatek,
	linuxppc-dev
In-Reply-To: <20210908071123.348-1-caihuoqing@baidu.com>

Use the devm_platform_ioremap_resource() helper instead of
calling platform_get_resource() and devm_ioremap_resource()
separately

Signed-off-by: Cai Huoqing <caihuoqing@baidu.com>
---
 drivers/soc/fsl/guts.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/drivers/soc/fsl/guts.c b/drivers/soc/fsl/guts.c
index d5e9a5f2c087..072473a16f4d 100644
--- a/drivers/soc/fsl/guts.c
+++ b/drivers/soc/fsl/guts.c
@@ -140,7 +140,6 @@ static int fsl_guts_probe(struct platform_device *pdev)
 {
 	struct device_node *np = pdev->dev.of_node;
 	struct device *dev = &pdev->dev;
-	struct resource *res;
 	const struct fsl_soc_die_attr *soc_die;
 	const char *machine;
 	u32 svr;
@@ -152,8 +151,7 @@ static int fsl_guts_probe(struct platform_device *pdev)
 
 	guts->little_endian = of_property_read_bool(np, "little-endian");
 
-	res = platform_get_resource(pdev, IORESOURCE_MEM, 0);
-	guts->regs = devm_ioremap_resource(dev, res);
+	guts->regs = devm_platform_ioremap_resource(pdev, 0);
 	if (IS_ERR(guts->regs))
 		return PTR_ERR(guts->regs);
 
-- 
2.25.1


^ permalink raw reply related

* Re: [PATCH 1/3] perf: Add macros to specify onchip L2/L3 accesses
From: Michael Ellerman @ 2021-09-08  7:17 UTC (permalink / raw)
  To: Kajol Jain, linuxppc-dev, linux-kernel, peterz, mingo, acme,
	jolsa, namhyung, linux-perf-users, ak
  Cc: mark.rutland, songliubraving, atrajeev, daniel, rnsastry,
	alexander.shishkin, kjain, ast, yao.jin, maddy, paulus, kan.liang
In-Reply-To: <20210904064932.307610-1-kjain@linux.ibm.com>

Kajol Jain <kjain@linux.ibm.com> writes:
> Add couple of new macros to represent onchip L2 and onchip L3 accesses.

It would be "on chip". But I think this needs much more explanation,
this is a generic header so these definitions need to make sense, and
have an understood meaning, across all architectures.

I think most people are going to read "on chip" as differentiating
between an L2/L3 that is "on chip" vs "off chip".

But the case you're trying to express is "another core's L2/L3 on the
same chip as the CPU", vs "the current CPU's L2/L3".


> diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
> index f92880a15645..030b3e990ac3 100644
> --- a/include/uapi/linux/perf_event.h
> +++ b/include/uapi/linux/perf_event.h
> @@ -1265,7 +1265,9 @@ union perf_mem_data_src {
>  #define PERF_MEM_LVLNUM_L2	0x02 /* L2 */
>  #define PERF_MEM_LVLNUM_L3	0x03 /* L3 */
>  #define PERF_MEM_LVLNUM_L4	0x04 /* L4 */
> -/* 5-0xa available */
> +#define PERF_MEM_LVLNUM_OC_L2	0x05 /* On Chip L2 */
> +#define PERF_MEM_LVLNUM_OC_L3	0x06 /* On Chip L3 */

The obvious use for 5 is for "L5" and so on.

I'm not sure adding new levels is the best idea, because these don't fit
neatly into the hierarchy, they are off to the side.


I wonder if we should use the remote field.

ie. for another core's L2 we set:

  mem_lvl = PERF_MEM_LVL_L2
  mem_remote = 1

Which would mean "remote L2", but not remote enough to be
lvl = PERF_MEM_LVL_REM_CCE1.

It would be printed by the existing tools/perf code as "Remote L2", vs
"Remote cache (1 hop)", which seems OK.


ie. we'd be able to express:

  Current core's L2: LVL_L2
  Other core's L2:   LVL_L2 | REMOTE
  Other chip's L2:   LVL_REM_CCE1 | REMOTE

And similarly for L3.

I think that makes sense? Unless people think remote should be reserved
to mean on another chip, though we already have REM_CCE1 for that.

cheers



^ permalink raw reply

* [PATCH 2/2] soc: fsl: rcpm: Make use of the helper function devm_platform_ioremap_resource()
From: Cai Huoqing @ 2021-09-08  7:16 UTC (permalink / raw)
  To: caihuoqing; +Cc: linuxppc-dev, linux-kernel, linux-arm-kernel, Li Yang
In-Reply-To: <20210908071631.660-1-caihuoqing@baidu.com>

Use the devm_platform_ioremap_resource() helper instead of
calling platform_get_resource() and devm_ioremap_resource()
separately

Signed-off-by: Cai Huoqing <caihuoqing@baidu.com>
---
 drivers/soc/fsl/rcpm.c | 7 +------
 1 file changed, 1 insertion(+), 6 deletions(-)

diff --git a/drivers/soc/fsl/rcpm.c b/drivers/soc/fsl/rcpm.c
index 90d3f4060b0c..3d0cae30c769 100644
--- a/drivers/soc/fsl/rcpm.c
+++ b/drivers/soc/fsl/rcpm.c
@@ -146,7 +146,6 @@ static const struct dev_pm_ops rcpm_pm_ops = {
 static int rcpm_probe(struct platform_device *pdev)
 {
 	struct device	*dev = &pdev->dev;
-	struct resource *r;
 	struct rcpm	*rcpm;
 	int ret;
 
@@ -154,11 +153,7 @@ static int rcpm_probe(struct platform_device *pdev)
 	if (!rcpm)
 		return -ENOMEM;
 
-	r = platform_get_resource(pdev, IORESOURCE_MEM, 0);
-	if (!r)
-		return -ENODEV;
-
-	rcpm->ippdexpcr_base = devm_ioremap_resource(&pdev->dev, r);
+	rcpm->ippdexpcr_base = devm_platform_ioremap_resource(pdev, 0);
 	if (IS_ERR(rcpm->ippdexpcr_base)) {
 		ret =  PTR_ERR(rcpm->ippdexpcr_base);
 		return ret;
-- 
2.25.1


^ permalink raw reply related

* [PATCH 1/2] soc: fsl: guts: Make use of the helper function devm_platform_ioremap_resource()
From: Cai Huoqing @ 2021-09-08  7:16 UTC (permalink / raw)
  To: caihuoqing; +Cc: linuxppc-dev, linux-kernel, linux-arm-kernel, Li Yang

Use the devm_platform_ioremap_resource() helper instead of
calling platform_get_resource() and devm_ioremap_resource()
separately

Signed-off-by: Cai Huoqing <caihuoqing@baidu.com>
---
 drivers/soc/fsl/guts.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/drivers/soc/fsl/guts.c b/drivers/soc/fsl/guts.c
index d5e9a5f2c087..072473a16f4d 100644
--- a/drivers/soc/fsl/guts.c
+++ b/drivers/soc/fsl/guts.c
@@ -140,7 +140,6 @@ static int fsl_guts_probe(struct platform_device *pdev)
 {
 	struct device_node *np = pdev->dev.of_node;
 	struct device *dev = &pdev->dev;
-	struct resource *res;
 	const struct fsl_soc_die_attr *soc_die;
 	const char *machine;
 	u32 svr;
@@ -152,8 +151,7 @@ static int fsl_guts_probe(struct platform_device *pdev)
 
 	guts->little_endian = of_property_read_bool(np, "little-endian");
 
-	res = platform_get_resource(pdev, IORESOURCE_MEM, 0);
-	guts->regs = devm_ioremap_resource(dev, res);
+	guts->regs = devm_platform_ioremap_resource(pdev, 0);
 	if (IS_ERR(guts->regs))
 		return PTR_ERR(guts->regs);
 
-- 
2.25.1


^ permalink raw reply related

* Re: [PATCH] powerpc/mce: Fix access error in mce handler
From: Ganesh @ 2021-09-08  6:49 UTC (permalink / raw)
  To: Michael Ellerman, linuxppc-dev; +Cc: mahesh, npiggin
In-Reply-To: <87mtonmxp6.fsf@mpe.ellerman.id.au>

[-- Attachment #1: Type: text/plain, Size: 5679 bytes --]

On 9/8/21 11:10 AM, Michael Ellerman wrote:

> Ganesh <ganeshgr@linux.ibm.com> writes:
>> On 9/6/21 6:03 PM, Michael Ellerman wrote:
>>
>>> Ganesh Goudar <ganeshgr@linux.ibm.com> writes
>>>> Oops: Kernel access of bad area, sig: 11 [#1]
>>>> LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
>>>> CPU: 5 PID: 1883 Comm: insmod Tainted: G        OE     5.14.0-mce+ #137
>>>> NIP:  c000000000735d60 LR: c000000000318640 CTR: 0000000000000000
>>>> REGS: c00000001ebff9a0 TRAP: 0300   Tainted: G       OE      (5.14.0-mce+)
>>>> MSR:  8000000000001003 <SF,ME,RI,LE>  CR: 28008228  XER: 00000001
>>>> CFAR: c00000000031863c DAR: c00000027fa8fe08 DSISR: 40000000 IRQMASK: 0
>>>> GPR00: c0000000003186d0 c00000001ebffc40 c000000001b0df00 c0000000016337e8
>>>> GPR04: c0000000016337e8 c00000027fa8fe08 0000000000000023 c0000000016337f0
>>>> GPR08: 0000000000000023 c0000000012ffe08 0000000000000000 c008000001460240
>>>> GPR12: 0000000000000000 c00000001ec9a900 c00000002ac4bd00 0000000000000000
>>>> GPR16: 00000000000005a0 c0080000006b0000 c0080000006b05a0 c000000000ff3068
>>>> GPR20: c00000002ac4bbc0 0000000000000001 c00000002ac4bbc0 c008000001490298
>>>> GPR24: c008000001490108 c000000001636198 c008000001470090 c008000001470058
>>>> GPR28: 0000000000000510 c008000001000000 c008000008000019 0000000000000019
>>>> NIP [c000000000735d60] llist_add_batch+0x0/0x40
>>>> LR [c000000000318640] __irq_work_queue_local+0x70/0xc0
>>>> Call Trace:
>>>> [c00000001ebffc40] [c00000001ebffc0c] 0xc00000001ebffc0c (unreliable)
>>>> [c00000001ebffc60] [c0000000003186d0] irq_work_queue+0x40/0x70
>>>> [c00000001ebffc80] [c00000000004425c] machine_check_queue_event+0xbc/0xd0
>>>> [c00000001ebffcf0] [c00000000000838c] machine_check_early_common+0x16c/0x1f4
>>>>
>>>> Fixes: 74c3354bc1d89 ("powerpc/pseries/mce: restore msr before returning from handler")
>>> Please explain in more detail why that commit caused this breakage.
>> After enabling translation in mce_handle_error() we used to leave it enabled to avoid
>> crashing here, but now with this commit we are restoring the MSR to disable translation.
> Are you sure we left the MMU enabled to avoid crashing there, or we just
> left it enabled by accident?

No, I think we left it enabled intentionally, I mentioned about leaving it enabled
in my comment and commit message of a95a0a1654 "powerpc/pseries: Fix MCE handling on pseries".

>
> But yeah, previously the MMU was enabled when we got here whereas now
> it's not, because of that change.
>
>> Missed to mention it in commit log, I will add it.
> Thanks.
>
>>>> diff --git a/arch/powerpc/kernel/mce.c b/arch/powerpc/kernel/mce.c
>>>> index 47a683cd00d2..9d1e39d42e3e 100644
>>>> --- a/arch/powerpc/kernel/mce.c
>>>> +++ b/arch/powerpc/kernel/mce.c
>>>> @@ -249,6 +249,7 @@ void machine_check_queue_event(void)
>>>>    {
>>>>    	int index;
>>>>    	struct machine_check_event evt;
>>>> +	unsigned long msr;
>>>>    
>>>>    	if (!get_mce_event(&evt, MCE_EVENT_RELEASE))
>>>>    		return;
>>>> @@ -262,8 +263,19 @@ void machine_check_queue_event(void)
>>>>    	memcpy(&local_paca->mce_info->mce_event_queue[index],
>>>>    	       &evt, sizeof(evt));
>>>>    
>>>> -	/* Queue irq work to process this event later. */
>>>> -	irq_work_queue(&mce_event_process_work);
>>>> +	/* Queue irq work to process this event later. Before
>>>> +	 * queuing the work enable translation for non radix LPAR,
>>>> +	 * as irq_work_queue may try to access memory outside RMO
>>>> +	 * region.
>>>> +	 */
>>>> +	if (!radix_enabled() && firmware_has_feature(FW_FEATURE_LPAR)) {
>>>> +		msr = mfmsr();
>>>> +		mtmsr(msr | MSR_IR | MSR_DR);
>>>> +		irq_work_queue(&mce_event_process_work);
>>>> +		mtmsr(msr);
>>>> +	} else {
>>>> +		irq_work_queue(&mce_event_process_work);
>>>> +	}
>>>>    }
>>> We already went to virtual mode and queued (different) irq work in
>>> arch/powerpc/platforms/pseries/ras.c:mce_handle_error()
>>>
>>> We also called save_mce_event() which also might have queued irq work,
>>> via machine_check_ue_event().
>>>
>>> So it really feels like something about the design is wrong if we have
>>> to go to virtual mode again and queue more irq work here.
>>>
>>> I guess we can probably merge this as a backportable fix, doing anything
>>> else would be a bigger change.
>> I agree.
>>
>>> Looking at ras.c there's the comment:
>>>
>>> 	 * Enable translation as we will be accessing per-cpu variables
>>> 	 * in save_mce_event() which may fall outside RMO region, also
>>>
>>> But AFAICS it's only irq_work_queue() that touches anything percpu?
>> Yeah, we left the comment unchanged after doing some modifications around it,
>> It needs to be updated, ill send a separate patch for it.
> Thanks.
>
> I see some other comments that look out of date, ie. the one above
> machine_check_process_queued_event() mentions syscall exit, which is no
> longer true.

ill take care of it.

>
> There's also comments in pseries/ras.c about fwnmi_release_errinfo()
> crashing in real mode, but we call it in real mode now so that must be
> fixed?

Yes, it is fixed now.

>
>>> So maybe we should just not be using irq_work_queue(). It's a pretty
>>> thin wrapper around set_dec(1), perhaps we just need to hand-roll some
>>> real-mode friendly way of doing that.
>> You mean, have separate queue and run the work from timer handler?
> Yeah something like that.
>
> We don't even need a queue, we already have local_paca->mce_info->mce_queue_count.
>
> So it could just be:
>
>    if (local_paca->mce_info->mce_queue_count)
>    	machine_check_process_queued_event();
>
> Though it would need a wrapper because local_paca only exists for 64-bit.

Thanks, ill look into it.

>
> cheers

[-- Attachment #2: Type: text/html, Size: 7804 bytes --]

^ permalink raw reply

* Re: [PATCH kernel v2] KVM: PPC: Merge powerpc's debugfs entry content into generic entry
From: Alexey Kardashevskiy @ 2021-09-08  6:20 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Paolo Bonzini, Fabiano Rosas, kvm-ppc, kvm
In-Reply-To: <872d75a4-08e2-f597-0bee-6be9fdce0ac1@ozlabs.ru>

[hopefulle fixed my thunderbird now]

Huh, not sure anymore after reading d56f5136b0102 "KVM: let 
kvm_destroy_vm_debugfs clean up vCPU debugfs directories" which remove
debugfs_dentry from vcpu. Paolo?


On 05/09/2021 12:27, Alexey Kardashevskiy wrote:
> Please ignore this one, v3 is coming.
> 
> After I posted this, I suddenly realized that the vcpu debugfs entry
> remain until the VM exists and this does not handle vcpu
> hotunplug+hotplug (the ppc book3e did handle this). Thanks,
> 
> 
> On 04/09/2021 23:35, Alexey Kardashevskiy wrote:
>> At the moment the generic KVM code creates an "%pid-%fd" entry per a KVM
>> instance; and the PPC HV KVM creates its own at "vm%pid". The Book3E KVM
>> creates its own entry for timings.
>>
>> The problems with the PPC entries are:
>> 1. they do not allow multiple VMs in the same process (which is extremely
>> rare case mostly used by syzkaller fuzzer);
>> 2. prone to race bugs like the generic KVM code had fixed in
>> commit 85cd39af14f4 ("KVM: Do not leak memory for duplicate debugfs
>> directories").
>>
>> This defines kvm_arch_create_kvm_debugfs() similar to one for vcpus.
>>
>> This defines 2 hooks in kvmppc_ops for allowing specific KVM
>> implementations to add necessary entries. This defines handlers
>> for HV KVM and defines the Book3E debugfs vcpu helper as a handler.
>>
>> This makes use of already existing kvm_arch_create_vcpu_debugfs
>> on PPC.
>>
>> This removes no more used debugfs_dir pointers from PPC kvm_arch structs.
>>
>> Suggested-by: Fabiano Rosas <farosas@linux.ibm.com>
>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>> ---
>> Changes:
>> v2:
>> * handled powerpc-booke
>> * s/kvm/vm/ in arch hooks
>> ---
>>    arch/powerpc/include/asm/kvm_host.h    |  7 +++---
>>    arch/powerpc/include/asm/kvm_ppc.h     |  2 ++
>>    arch/powerpc/kvm/timing.h              |  7 +++---
>>    include/linux/kvm_host.h               |  3 +++
>>    arch/powerpc/kvm/book3s_64_mmu_hv.c    |  2 +-
>>    arch/powerpc/kvm/book3s_64_mmu_radix.c |  2 +-
>>    arch/powerpc/kvm/book3s_hv.c           | 30 +++++++++-----------------
>>    arch/powerpc/kvm/e500.c                |  1 +
>>    arch/powerpc/kvm/e500mc.c              |  1 +
>>    arch/powerpc/kvm/powerpc.c             | 15 ++++++++++---
>>    arch/powerpc/kvm/timing.c              | 20 ++++-------------
>>    virt/kvm/kvm_main.c                    |  3 +++
>>    12 files changed, 44 insertions(+), 49 deletions(-)
>>
>> diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
>> index 2bcac6da0a4b..f29b66cc2163 100644
>> --- a/arch/powerpc/include/asm/kvm_host.h
>> +++ b/arch/powerpc/include/asm/kvm_host.h
>> @@ -296,7 +296,6 @@ struct kvm_arch {
>>    	bool dawr1_enabled;
>>    	pgd_t *pgtable;
>>    	u64 process_table;
>> -	struct dentry *debugfs_dir;
>>    	struct kvm_resize_hpt *resize_hpt; /* protected by kvm->lock */
>>    #endif /* CONFIG_KVM_BOOK3S_HV_POSSIBLE */
>>    #ifdef CONFIG_KVM_BOOK3S_PR_POSSIBLE
>> @@ -672,7 +671,6 @@ struct kvm_vcpu_arch {
>>    	u64 timing_min_duration[__NUMBER_OF_KVM_EXIT_TYPES];
>>    	u64 timing_max_duration[__NUMBER_OF_KVM_EXIT_TYPES];
>>    	u64 timing_last_exit;
>> -	struct dentry *debugfs_exit_timing;
>>    #endif
>>    
>>    #ifdef CONFIG_PPC_BOOK3S
>> @@ -828,8 +826,6 @@ struct kvm_vcpu_arch {
>>    	struct kvmhv_tb_accumulator rm_exit;	/* real-mode exit code */
>>    	struct kvmhv_tb_accumulator guest_time;	/* guest execution */
>>    	struct kvmhv_tb_accumulator cede_time;	/* time napping inside guest */
>> -
>> -	struct dentry *debugfs_dir;
>>    #endif /* CONFIG_KVM_BOOK3S_HV_EXIT_TIMING */
>>    };
>>    
>> @@ -868,4 +864,7 @@ static inline void kvm_arch_vcpu_blocking(struct kvm_vcpu *vcpu) {}
>>    static inline void kvm_arch_vcpu_unblocking(struct kvm_vcpu *vcpu) {}
>>    static inline void kvm_arch_vcpu_block_finish(struct kvm_vcpu *vcpu) {}
>>    
>> +#define __KVM_HAVE_ARCH_VCPU_DEBUGFS
>> +#define __KVM_HAVE_ARCH_KVM_DEBUGFS
>> +
>>    #endif /* __POWERPC_KVM_HOST_H__ */
>> diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
>> index 6355a6980ccf..fd841e844b90 100644
>> --- a/arch/powerpc/include/asm/kvm_ppc.h
>> +++ b/arch/powerpc/include/asm/kvm_ppc.h
>> @@ -316,6 +316,8 @@ struct kvmppc_ops {
>>    	int (*svm_off)(struct kvm *kvm);
>>    	int (*enable_dawr1)(struct kvm *kvm);
>>    	bool (*hash_v3_possible)(void);
>> +	void (*create_vm_debugfs)(struct kvm *kvm);
>> +	void (*create_vcpu_debugfs)(struct kvm_vcpu *vcpu, struct dentry *debugfs_dentry);
>>    };
>>    
>>    extern struct kvmppc_ops *kvmppc_hv_ops;
>> diff --git a/arch/powerpc/kvm/timing.h b/arch/powerpc/kvm/timing.h
>> index feef7885ba82..36f7c201c6f1 100644
>> --- a/arch/powerpc/kvm/timing.h
>> +++ b/arch/powerpc/kvm/timing.h
>> @@ -14,8 +14,8 @@
>>    #ifdef CONFIG_KVM_EXIT_TIMING
>>    void kvmppc_init_timing_stats(struct kvm_vcpu *vcpu);
>>    void kvmppc_update_timing_stats(struct kvm_vcpu *vcpu);
>> -void kvmppc_create_vcpu_debugfs(struct kvm_vcpu *vcpu, unsigned int id);
>> -void kvmppc_remove_vcpu_debugfs(struct kvm_vcpu *vcpu);
>> +void kvmppc_create_vcpu_debugfs(struct kvm_vcpu *vcpu,
>> +				struct dentry *debugfs_dentry);
>>    
>>    static inline void kvmppc_set_exit_type(struct kvm_vcpu *vcpu, int type)
>>    {
>> @@ -27,8 +27,7 @@ static inline void kvmppc_set_exit_type(struct kvm_vcpu *vcpu, int type)
>>    static inline void kvmppc_init_timing_stats(struct kvm_vcpu *vcpu) {}
>>    static inline void kvmppc_update_timing_stats(struct kvm_vcpu *vcpu) {}
>>    static inline void kvmppc_create_vcpu_debugfs(struct kvm_vcpu *vcpu,
>> -						unsigned int id) {}
>> -static inline void kvmppc_remove_vcpu_debugfs(struct kvm_vcpu *vcpu) {}
>> +					      struct dentry *debugfs_dentry) {}
>>    static inline void kvmppc_set_exit_type(struct kvm_vcpu *vcpu, int type) {}
>>    #endif /* CONFIG_KVM_EXIT_TIMING */
>>    
>> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
>> index ae7735b490b4..4f22b1201a0d 100644
>> --- a/include/linux/kvm_host.h
>> +++ b/include/linux/kvm_host.h
>> @@ -1021,6 +1021,9 @@ int kvm_arch_pm_notifier(struct kvm *kvm, unsigned long state);
>>    #ifdef __KVM_HAVE_ARCH_VCPU_DEBUGFS
>>    void kvm_arch_create_vcpu_debugfs(struct kvm_vcpu *vcpu, struct dentry *debugfs_dentry);
>>    #endif
>> +#ifdef __KVM_HAVE_ARCH_KVM_DEBUGFS
>> +void kvm_arch_create_vm_debugfs(struct kvm *kvm);
>> +#endif
>>    
>>    int kvm_arch_hardware_enable(void);
>>    void kvm_arch_hardware_disable(void);
>> diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c b/arch/powerpc/kvm/book3s_64_mmu_hv.c
>> index c63e263312a4..33dae253a0ac 100644
>> --- a/arch/powerpc/kvm/book3s_64_mmu_hv.c
>> +++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c
>> @@ -2112,7 +2112,7 @@ static const struct file_operations debugfs_htab_fops = {
>>    
>>    void kvmppc_mmu_debugfs_init(struct kvm *kvm)
>>    {
>> -	debugfs_create_file("htab", 0400, kvm->arch.debugfs_dir, kvm,
>> +	debugfs_create_file("htab", 0400, kvm->debugfs_dentry, kvm,
>>    			    &debugfs_htab_fops);
>>    }
>>    
>> diff --git a/arch/powerpc/kvm/book3s_64_mmu_radix.c b/arch/powerpc/kvm/book3s_64_mmu_radix.c
>> index c5508744e14c..f4e083c20872 100644
>> --- a/arch/powerpc/kvm/book3s_64_mmu_radix.c
>> +++ b/arch/powerpc/kvm/book3s_64_mmu_radix.c
>> @@ -1452,7 +1452,7 @@ static const struct file_operations debugfs_radix_fops = {
>>    
>>    void kvmhv_radix_debugfs_init(struct kvm *kvm)
>>    {
>> -	debugfs_create_file("radix", 0400, kvm->arch.debugfs_dir, kvm,
>> +	debugfs_create_file("radix", 0400, kvm->debugfs_dentry, kvm,
>>    			    &debugfs_radix_fops);
>>    }
>>    
>> diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
>> index c8f12b056968..046df9e0d462 100644
>> --- a/arch/powerpc/kvm/book3s_hv.c
>> +++ b/arch/powerpc/kvm/book3s_hv.c
>> @@ -2771,19 +2771,14 @@ static const struct file_operations debugfs_timings_ops = {
>>    };
>>    
>>    /* Create a debugfs directory for the vcpu */
>> -static void debugfs_vcpu_init(struct kvm_vcpu *vcpu, unsigned int id)
>> +static void kvmppc_arch_create_vcpu_debugfs_hv(struct kvm_vcpu *vcpu, struct dentry *debugfs_dentry)
>>    {
>> -	char buf[16];
>> -	struct kvm *kvm = vcpu->kvm;
>> -
>> -	snprintf(buf, sizeof(buf), "vcpu%u", id);
>> -	vcpu->arch.debugfs_dir = debugfs_create_dir(buf, kvm->arch.debugfs_dir);
>> -	debugfs_create_file("timings", 0444, vcpu->arch.debugfs_dir, vcpu,
>> +	debugfs_create_file("timings", 0444, debugfs_dentry, vcpu,
>>    			    &debugfs_timings_ops);
>>    }
>>    
>>    #else /* CONFIG_KVM_BOOK3S_HV_EXIT_TIMING */
>> -static void debugfs_vcpu_init(struct kvm_vcpu *vcpu, unsigned int id)
>> +static void kvmppc_arch_create_vcpu_debugfs_hv(struct kvm_vcpu *vcpu, struct dentry *debugfs_dentry)
>>    {
>>    }
>>    #endif /* CONFIG_KVM_BOOK3S_HV_EXIT_TIMING */
>> @@ -2907,8 +2902,6 @@ static int kvmppc_core_vcpu_create_hv(struct kvm_vcpu *vcpu)
>>    	vcpu->arch.cpu_type = KVM_CPU_3S_64;
>>    	kvmppc_sanity_check(vcpu);
>>    
>> -	debugfs_vcpu_init(vcpu, id);
>> -
>>    	return 0;
>>    }
>>    
>> @@ -5186,7 +5179,6 @@ void kvmppc_free_host_rm_ops(void)
>>    static int kvmppc_core_init_vm_hv(struct kvm *kvm)
>>    {
>>    	unsigned long lpcr, lpid;
>> -	char buf[32];
>>    	int ret;
>>    
>>    	mutex_init(&kvm->arch.uvmem_lock);
>> @@ -5319,16 +5311,14 @@ static int kvmppc_core_init_vm_hv(struct kvm *kvm)
>>    		kvm->arch.smt_mode = 1;
>>    	kvm->arch.emul_smt_mode = 1;
>>    
>> -	/*
>> -	 * Create a debugfs directory for the VM
>> -	 */
>> -	snprintf(buf, sizeof(buf), "vm%d", current->pid);
>> -	kvm->arch.debugfs_dir = debugfs_create_dir(buf, kvm_debugfs_dir);
>> +	return 0;
>> +}
>> +
>> +static void kvmppc_arch_create_vm_debugfs_hv(struct kvm *kvm)
>> +{
>>    	kvmppc_mmu_debugfs_init(kvm);
>>    	if (radix_enabled())
>>    		kvmhv_radix_debugfs_init(kvm);
>> -
>> -	return 0;
>>    }
>>    
>>    static void kvmppc_free_vcores(struct kvm *kvm)
>> @@ -5342,8 +5332,6 @@ static void kvmppc_free_vcores(struct kvm *kvm)
>>    
>>    static void kvmppc_core_destroy_vm_hv(struct kvm *kvm)
>>    {
>> -	debugfs_remove_recursive(kvm->arch.debugfs_dir);
>> -
>>    	if (!cpu_has_feature(CPU_FTR_ARCH_300))
>>    		kvm_hv_vm_deactivated();
>>    
>> @@ -5996,6 +5984,8 @@ static struct kvmppc_ops kvm_ops_hv = {
>>    	.svm_off = kvmhv_svm_off,
>>    	.enable_dawr1 = kvmhv_enable_dawr1,
>>    	.hash_v3_possible = kvmppc_hash_v3_possible,
>> +	.create_vcpu_debugfs = kvmppc_arch_create_vcpu_debugfs_hv,
>> +	.create_vm_debugfs = kvmppc_arch_create_vm_debugfs_hv,
>>    };
>>    
>>    static int kvm_init_subcore_bitmap(void)
>> diff --git a/arch/powerpc/kvm/e500.c b/arch/powerpc/kvm/e500.c
>> index 7e8b69015d20..d82e70c3e0a9 100644
>> --- a/arch/powerpc/kvm/e500.c
>> +++ b/arch/powerpc/kvm/e500.c
>> @@ -495,6 +495,7 @@ static struct kvmppc_ops kvm_ops_e500 = {
>>    	.emulate_op = kvmppc_core_emulate_op_e500,
>>    	.emulate_mtspr = kvmppc_core_emulate_mtspr_e500,
>>    	.emulate_mfspr = kvmppc_core_emulate_mfspr_e500,
>> +	.create_vcpu_debugfs = kvmppc_create_vcpu_debugfs,
>>    };
>>    
>>    static int __init kvmppc_e500_init(void)
>> diff --git a/arch/powerpc/kvm/e500mc.c b/arch/powerpc/kvm/e500mc.c
>> index 1c189b5aadcc..45eacd949f4b 100644
>> --- a/arch/powerpc/kvm/e500mc.c
>> +++ b/arch/powerpc/kvm/e500mc.c
>> @@ -381,6 +381,7 @@ static struct kvmppc_ops kvm_ops_e500mc = {
>>    	.emulate_op = kvmppc_core_emulate_op_e500,
>>    	.emulate_mtspr = kvmppc_core_emulate_mtspr_e500,
>>    	.emulate_mfspr = kvmppc_core_emulate_mfspr_e500,
>> +	.create_vcpu_debugfs = kvmppc_create_vcpu_debugfs,
>>    };
>>    
>>    static int __init kvmppc_e500mc_init(void)
>> diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
>> index c248d6d8b9e3..c895521ac6e9 100644
>> --- a/arch/powerpc/kvm/powerpc.c
>> +++ b/arch/powerpc/kvm/powerpc.c
>> @@ -763,7 +763,6 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
>>    		goto out_vcpu_uninit;
>>    
>>    	vcpu->arch.waitp = &vcpu->wait;
>> -	kvmppc_create_vcpu_debugfs(vcpu, vcpu->vcpu_id);
>>    	return 0;
>>    
>>    out_vcpu_uninit:
>> @@ -780,8 +779,6 @@ void kvm_arch_vcpu_destroy(struct kvm_vcpu *vcpu)
>>    	/* Make sure we're not using the vcpu anymore */
>>    	hrtimer_cancel(&vcpu->arch.dec_timer);
>>    
>> -	kvmppc_remove_vcpu_debugfs(vcpu);
>> -
>>    	switch (vcpu->arch.irq_type) {
>>    	case KVMPPC_IRQ_MPIC:
>>    		kvmppc_mpic_disconnect_vcpu(vcpu->arch.mpic, vcpu);
>> @@ -2505,3 +2502,15 @@ int kvm_arch_init(void *opaque)
>>    }
>>    
>>    EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_ppc_instr);
>> +
>> +void kvm_arch_create_vcpu_debugfs(struct kvm_vcpu *vcpu, struct dentry *debugfs_dentry)
>> +{
>> +	if (vcpu->kvm->arch.kvm_ops->create_vcpu_debugfs)
>> +		vcpu->kvm->arch.kvm_ops->create_vcpu_debugfs(vcpu, debugfs_dentry);
>> +}
>> +
>> +void kvm_arch_create_vm_debugfs(struct kvm *kvm)
>> +{
>> +	if (kvm->arch.kvm_ops->create_vm_debugfs)
>> +		kvm->arch.kvm_ops->create_vm_debugfs(kvm);
>> +}
>> diff --git a/arch/powerpc/kvm/timing.c b/arch/powerpc/kvm/timing.c
>> index ba56a5cbba97..e1c17afc714d 100644
>> --- a/arch/powerpc/kvm/timing.c
>> +++ b/arch/powerpc/kvm/timing.c
>> @@ -204,21 +204,9 @@ static const struct file_operations kvmppc_exit_timing_fops = {
>>    	.release = single_release,
>>    };
>>    
>> -void kvmppc_create_vcpu_debugfs(struct kvm_vcpu *vcpu, unsigned int id)
>> +void kvmppc_create_vcpu_debugfs(struct kvm_vcpu *vcpu,
>> +				struct dentry *debugfs_dentry)
>>    {
>> -	static char dbg_fname[50];
>> -	struct dentry *debugfs_file;
>> -
>> -	snprintf(dbg_fname, sizeof(dbg_fname), "vm%u_vcpu%u_timing",
>> -		 current->pid, id);
>> -	debugfs_file = debugfs_create_file(dbg_fname, 0666, kvm_debugfs_dir,
>> -						vcpu, &kvmppc_exit_timing_fops);
>> -
>> -	vcpu->arch.debugfs_exit_timing = debugfs_file;
>> -}
>> -
>> -void kvmppc_remove_vcpu_debugfs(struct kvm_vcpu *vcpu)
>> -{
>> -	debugfs_remove(vcpu->arch.debugfs_exit_timing);
>> -	vcpu->arch.debugfs_exit_timing = NULL;
>> +	debugfs_create_file("timing", 0666, debugfs_dentry,
>> +			    vcpu, &kvmppc_exit_timing_fops);
>>    }
>> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
>> index b50dbe269f4b..85b2550e18e7 100644
>> --- a/virt/kvm/kvm_main.c
>> +++ b/virt/kvm/kvm_main.c
>> @@ -954,6 +954,9 @@ static int kvm_create_vm_debugfs(struct kvm *kvm, int fd)
>>    				    kvm->debugfs_dentry, stat_data,
>>    				    &stat_fops_per_vm);
>>    	}
>> +#ifdef __KVM_HAVE_ARCH_KVM_DEBUGFS
>> +	kvm_arch_create_vm_debugfs(kvm);
>> +#endif
>>    	return 0;
>>    }
>>    
>>
> 

-- 
Alexey

^ permalink raw reply

* Re: [PATCH] powerpc/mce: Fix access error in mce handler
From: Michael Ellerman @ 2021-09-08  5:40 UTC (permalink / raw)
  To: Ganesh, linuxppc-dev; +Cc: mahesh, npiggin
In-Reply-To: <f14cb57a-5ae0-f867-1e18-004f34a3b320@linux.ibm.com>

Ganesh <ganeshgr@linux.ibm.com> writes:
> On 9/6/21 6:03 PM, Michael Ellerman wrote:
>
>> Ganesh Goudar <ganeshgr@linux.ibm.com> writes:
>>> We queue an irq work for deferred processing of mce event
>>> in realmode mce handler, where translation is disabled.
>>> Queuing of the work may result in accessing memory outside
>>> RMO region, such access needs the translation to be enabled
>>> for an LPAR running with hash mmu else the kernel crashes.
>>>
>>> So enable the translation before queuing the work.
>>>
>>> Without this change following trace is seen on injecting machine
>>> check error in an LPAR running with hash mmu.
>> What type of error are you injecting?
>
> SLB multihit in kernel mode.
>
>>
>>> Oops: Kernel access of bad area, sig: 11 [#1]
>>> LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
>>> CPU: 5 PID: 1883 Comm: insmod Tainted: G        OE     5.14.0-mce+ #137
>>> NIP:  c000000000735d60 LR: c000000000318640 CTR: 0000000000000000
>>> REGS: c00000001ebff9a0 TRAP: 0300   Tainted: G       OE      (5.14.0-mce+)
>>> MSR:  8000000000001003 <SF,ME,RI,LE>  CR: 28008228  XER: 00000001
>>> CFAR: c00000000031863c DAR: c00000027fa8fe08 DSISR: 40000000 IRQMASK: 0
>>> GPR00: c0000000003186d0 c00000001ebffc40 c000000001b0df00 c0000000016337e8
>>> GPR04: c0000000016337e8 c00000027fa8fe08 0000000000000023 c0000000016337f0
>>> GPR08: 0000000000000023 c0000000012ffe08 0000000000000000 c008000001460240
>>> GPR12: 0000000000000000 c00000001ec9a900 c00000002ac4bd00 0000000000000000
>>> GPR16: 00000000000005a0 c0080000006b0000 c0080000006b05a0 c000000000ff3068
>>> GPR20: c00000002ac4bbc0 0000000000000001 c00000002ac4bbc0 c008000001490298
>>> GPR24: c008000001490108 c000000001636198 c008000001470090 c008000001470058
>>> GPR28: 0000000000000510 c008000001000000 c008000008000019 0000000000000019
>>> NIP [c000000000735d60] llist_add_batch+0x0/0x40
>>> LR [c000000000318640] __irq_work_queue_local+0x70/0xc0
>>> Call Trace:
>>> [c00000001ebffc40] [c00000001ebffc0c] 0xc00000001ebffc0c (unreliable)
>>> [c00000001ebffc60] [c0000000003186d0] irq_work_queue+0x40/0x70
>>> [c00000001ebffc80] [c00000000004425c] machine_check_queue_event+0xbc/0xd0
>>> [c00000001ebffcf0] [c00000000000838c] machine_check_early_common+0x16c/0x1f4
>>>
>>> Fixes: 74c3354bc1d89 ("powerpc/pseries/mce: restore msr before returning from handler")
>> Please explain in more detail why that commit caused this breakage.
>
> After enabling translation in mce_handle_error() we used to leave it enabled to avoid
> crashing here, but now with this commit we are restoring the MSR to disable translation.

Are you sure we left the MMU enabled to avoid crashing there, or we just
left it enabled by accident?

But yeah, previously the MMU was enabled when we got here whereas now
it's not, because of that change.

> Missed to mention it in commit log, I will add it.

Thanks.

>>> diff --git a/arch/powerpc/kernel/mce.c b/arch/powerpc/kernel/mce.c
>>> index 47a683cd00d2..9d1e39d42e3e 100644
>>> --- a/arch/powerpc/kernel/mce.c
>>> +++ b/arch/powerpc/kernel/mce.c
>>> @@ -249,6 +249,7 @@ void machine_check_queue_event(void)
>>>   {
>>>   	int index;
>>>   	struct machine_check_event evt;
>>> +	unsigned long msr;
>>>   
>>>   	if (!get_mce_event(&evt, MCE_EVENT_RELEASE))
>>>   		return;
>>> @@ -262,8 +263,19 @@ void machine_check_queue_event(void)
>>>   	memcpy(&local_paca->mce_info->mce_event_queue[index],
>>>   	       &evt, sizeof(evt));
>>>   
>>> -	/* Queue irq work to process this event later. */
>>> -	irq_work_queue(&mce_event_process_work);
>>> +	/* Queue irq work to process this event later. Before
>>> +	 * queuing the work enable translation for non radix LPAR,
>>> +	 * as irq_work_queue may try to access memory outside RMO
>>> +	 * region.
>>> +	 */
>>> +	if (!radix_enabled() && firmware_has_feature(FW_FEATURE_LPAR)) {
>>> +		msr = mfmsr();
>>> +		mtmsr(msr | MSR_IR | MSR_DR);
>>> +		irq_work_queue(&mce_event_process_work);
>>> +		mtmsr(msr);
>>> +	} else {
>>> +		irq_work_queue(&mce_event_process_work);
>>> +	}
>>>   }
>> We already went to virtual mode and queued (different) irq work in
>> arch/powerpc/platforms/pseries/ras.c:mce_handle_error()
>>
>> We also called save_mce_event() which also might have queued irq work,
>> via machine_check_ue_event().
>>
>> So it really feels like something about the design is wrong if we have
>> to go to virtual mode again and queue more irq work here.
>>
>> I guess we can probably merge this as a backportable fix, doing anything
>> else would be a bigger change.
>
> I agree.
>
>>
>> Looking at ras.c there's the comment:
>>
>> 	 * Enable translation as we will be accessing per-cpu variables
>> 	 * in save_mce_event() which may fall outside RMO region, also
>>
>> But AFAICS it's only irq_work_queue() that touches anything percpu?
>
> Yeah, we left the comment unchanged after doing some modifications around it,
> It needs to be updated, ill send a separate patch for it.

Thanks.

I see some other comments that look out of date, ie. the one above
machine_check_process_queued_event() mentions syscall exit, which is no
longer true.

There's also comments in pseries/ras.c about fwnmi_release_errinfo()
crashing in real mode, but we call it in real mode now so that must be
fixed?

>> So maybe we should just not be using irq_work_queue(). It's a pretty
>> thin wrapper around set_dec(1), perhaps we just need to hand-roll some
>> real-mode friendly way of doing that.
>
> You mean, have separate queue and run the work from timer handler?

Yeah something like that.

We don't even need a queue, we already have local_paca->mce_info->mce_queue_count.

So it could just be:

  if (local_paca->mce_info->mce_queue_count)
  	machine_check_process_queued_event();

Though it would need a wrapper because local_paca only exists for 64-bit.

cheers

^ permalink raw reply

* Re: [PATCH 1/2] powerpc/perf: Expose instruction and data address registers as part of extended regs
From: Michael Ellerman @ 2021-09-08  5:17 UTC (permalink / raw)
  To: Athira Rajeev, acme, jolsa; +Cc: kjain, maddy, linuxppc-dev, rnsastry
In-Reply-To: <1624200360-1429-2-git-send-email-atrajeev@linux.vnet.ibm.com>

Athira Rajeev <atrajeev@linux.vnet.ibm.com> writes:
> Patch adds support to include Sampled Instruction Address Register
> (SIAR) and Sampled Data Address Register (SDAR) SPRs as part of extended
> registers. Update the definition of PERF_REG_PMU_MASK_300/31 and
> PERF_REG_EXTENDED_MAX to include these SPR's.
>
> Signed-off-by: Athira Rajeev <atrajeev@linux.vnet.ibm.com>
> ---
>  arch/powerpc/include/uapi/asm/perf_regs.h | 12 +++++++-----
>  arch/powerpc/perf/perf_regs.c             |  4 ++++
>  2 files changed, 11 insertions(+), 5 deletions(-)
>
...
> diff --git a/arch/powerpc/perf/perf_regs.c b/arch/powerpc/perf/perf_regs.c
> index b931eed..51d31b6 100644
> --- a/arch/powerpc/perf/perf_regs.c
> +++ b/arch/powerpc/perf/perf_regs.c
> @@ -90,7 +90,11 @@ static u64 get_ext_regs_value(int idx)
>  		return mfspr(SPRN_SIER2);
>  	case PERF_REG_POWERPC_SIER3:
>  		return mfspr(SPRN_SIER3);
> +	case PERF_REG_POWERPC_SDAR:
> +		return mfspr(SPRN_SDAR);
>  #endif
> +	case PERF_REG_POWERPC_SIAR:
> +		return mfspr(SPRN_SIAR);
>  	default: return 0;
>  	}

This file is built for all powerpc configs that have PERF_EVENTS. Which
includes CPUs that don't have SDAR or SIAR.

Don't we need checks in perf_reg_value() like we do for SIER?

I guess we already got this wrong when we added the Power10 registers,
SIER2/3 etc.

cheers

^ permalink raw reply

* Re: [RFC PATCH v2] powerpc/papr_scm: Move duplicate definitions to common header files
From: Michael Ellerman @ 2021-09-08  5:02 UTC (permalink / raw)
  To: Shivaprasad G Bhat, linuxppc-dev, linux-kernel
  Cc: nvdimm, dan.j.williams, vaibhav, sbhat, aneesh.kumar
In-Reply-To: <163092037510.812.12838160593592476913.stgit@82313cf9f602>

Shivaprasad G Bhat <sbhat@linux.ibm.com> writes:
> papr_scm and ndtest share common PDSM payload structs like
> nd_papr_pdsm_health. Presently these structs are duplicated across papr_pdsm.h
> and ndtest.h header files. Since 'ndtest' is essentially arch independent and can
> run on platforms other than PPC64, a way needs to be deviced to avoid redundancy
> and duplication of PDSM structs in future.
>
> So the patch proposes moving the PDSM header from arch/powerpc/include/uapi/ to
> the generic include/uapi/linux directory. Also, there are some #defines common
> between papr_scm and ndtest which are not exported to the user space. So, move
> them to a header file which can be shared across ndtest and papr_scm via newly
> introduced include/linux/papr_scm.h.
>
> Signed-off-by: Shivaprasad G Bhat <sbhat@linux.ibm.com>
> Signed-off-by: Vaibhav Jain <vaibhav@linux.ibm.com>
> Suggested-by: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
> ---
> Changelog:
>
> Since v1:
> Link: https://patchwork.kernel.org/project/linux-nvdimm/patch/162505488483.72147.12741153746322191381.stgit@56e104a48989/
> * Removed dependency on this patch for the other patches
>
>  MAINTAINERS                               |    2 
>  arch/powerpc/include/uapi/asm/papr_pdsm.h |  165 -----------------------------
>  arch/powerpc/platforms/pseries/papr_scm.c |   43 --------
>  include/linux/papr_scm.h                  |   48 ++++++++
>  include/uapi/linux/papr_pdsm.h            |  165 +++++++++++++++++++++++++++++

This doesn't make sense to me.

Anything with papr (or PAPR) in the name is fundamentally powerpc
specific, it doesn't belong in a generic header, or in a generic
location.

What's the actual problem you're trying to solve?

If it's just including papr_scm bits into ndtest.c then that should be
as simple as:

#ifdef __powerpc__
#include <asm/papr_scm.h>
#endif

Shouldn't it?

cheers

^ permalink raw reply

* Re: [PATCH v3] ftrace: Cleanup ftrace_dyn_arch_init()
From: Weizhao Ouyang @ 2021-09-08  3:52 UTC (permalink / raw)
  To: LEROY Christophe, Steven Rostedt, Ingo Molnar
  Cc: Rich Felker, linux-ia64@vger.kernel.org, linux-sh@vger.kernel.org,
	Catalin Marinas, linux-kernel@vger.kernel.org,
	James E.J. Bottomley, Guo Ren, H. Peter Anvin,
	sparclinux@vger.kernel.org, linux-riscv@lists.infradead.org,
	Vincent Chen, Will Deacon, linux-s390@vger.kernel.org,
	Yoshinori Sato, Helge Deller, x86@kernel.org, Russell King,
	linux-csky@vger.kernel.org, Christian Borntraeger, Albert Ou,
	Vasily Gorbik, Heiko Carstens, Borislav Petkov, Greentime Hu,
	Paul Walmsley, Thomas Gleixner,
	linux-arm-kernel@lists.infradead.org, Michal Simek,
	Thomas Bogendoerfer, Nick Hu, linux-parisc@vger.kernel.org,
	linux-mips@vger.kernel.org, Palmer Dabbelt, Paul Mackerras,
	linuxppc-dev@lists.ozlabs.org, David S. Miller
In-Reply-To: <MRZP264MB298824D80E6C0ADCB5EA1D9AEDD39@MRZP264MB2988.FRAP264.PROD.OUTLOOK.COM>

Thanks for reply.

On 2021/9/7 23:55, LEROY Christophe wrote:
>
>> -----Message d'origine-----
>> De : Linuxppc-dev <linuxppc-dev-
>> bounces+christophe.leroy=csgroup.eu@lists.ozlabs.org> De la part de Weizhao
>> Ouyang
>>
>> Most of ARCHs use empty ftrace_dyn_arch_init(), introduce a weak common
>> ftrace_dyn_arch_init() to cleanup them.
>>
>> Signed-off-by: Weizhao Ouyang <o451686892@gmail.com>
>> Acked-by: Heiko Carstens <hca@linux.ibm.com> (s390)
>> Acked-by: Helge Deller <deller@gmx.de> (parisc)
>>
>> ---
>> Changes in v3:
>> -- fix unrecognized opcode on PowerPC
>>
>> Changes in v2:
>> -- correct CONFIG_DYNAMIC_FTRACE on PowerPC
>> -- add Acked-by tag
>>
>> ---
>>  arch/arm/kernel/ftrace.c          | 5 -----
>>  arch/arm64/kernel/ftrace.c        | 5 -----
>>  arch/csky/kernel/ftrace.c         | 5 -----
>>  arch/ia64/kernel/ftrace.c         | 6 ------
>>  arch/microblaze/kernel/ftrace.c   | 5 -----
>>  arch/mips/include/asm/ftrace.h    | 2 ++
>>  arch/nds32/kernel/ftrace.c        | 5 -----
>>  arch/parisc/kernel/ftrace.c       | 5 -----
>>  arch/powerpc/include/asm/ftrace.h | 4 ++++
>>  arch/riscv/kernel/ftrace.c        | 5 -----
>>  arch/s390/kernel/ftrace.c         | 5 -----
>>  arch/sh/kernel/ftrace.c           | 5 -----
>>  arch/sparc/kernel/ftrace.c        | 5 -----
>>  arch/x86/kernel/ftrace.c          | 5 -----
>>  include/linux/ftrace.h            | 1 -
>>  kernel/trace/ftrace.c             | 5 +++++
>>  16 files changed, 11 insertions(+), 62 deletions(-)
>>
>> diff --git a/arch/mips/include/asm/ftrace.h b/arch/mips/include/asm/ftrace.h
>> index b463f2aa5a61..ed013e767390 100644
>> --- a/arch/mips/include/asm/ftrace.h
>> +++ b/arch/mips/include/asm/ftrace.h
>> @@ -76,6 +76,8 @@ do {                                                \
>>
>>
>>  #ifdef CONFIG_DYNAMIC_FTRACE
>> +int __init ftrace_dyn_arch_init(void);
>> +
> Why ?
>
>
>>  static inline unsigned long ftrace_call_adjust(unsigned long addr)
>>  {
>>       return addr;
>> diff --git a/arch/powerpc/include/asm/ftrace.h
>> b/arch/powerpc/include/asm/ftrace.h
>> index debe8c4f7062..b05c43f13a4d 100644
>> --- a/arch/powerpc/include/asm/ftrace.h
>> +++ b/arch/powerpc/include/asm/ftrace.h
>> @@ -126,6 +126,10 @@ static inline void this_cpu_enable_ftrace(void) { }
>>  static inline void this_cpu_set_ftrace_enabled(u8 ftrace_enabled) { }
>>  static inline u8 this_cpu_get_ftrace_enabled(void) { return 1; }
>>  #endif /* CONFIG_PPC64 */
>> +
>> +#ifdef CONFIG_DYNAMIC_FTRACE
>> +int __init ftrace_dyn_arch_init(void);
>> +#endif /* CONFIG_DYNAMIC_FTRACE */
> Why ?
>
>>  #endif /* !__ASSEMBLY__ */
>>
>>  #endif /* _ASM_POWERPC_FTRACE */
>> diff --git a/include/linux/ftrace.h b/include/linux/ftrace.h
>> index 832e65f06754..f1eca123d89d 100644
>> --- a/include/linux/ftrace.h
>> +++ b/include/linux/ftrace.h
>> @@ -573,7 +573,6 @@ ftrace_set_early_filter(struct ftrace_ops *ops, char
>> *buf, int enable);
>>
>>  /* defined in arch */
>>  extern int ftrace_ip_converted(unsigned long ip);
>> -extern int ftrace_dyn_arch_init(void);
> Why removing that ?
>
> Have you tried to build kernel/trace/ftrace.o with C=2 ? It will likely tell you that the function is not declared and that it should be static

Yes I missed this check. Under the situation, the function should be static.

> We could eventually consider that in the past, this generic declaration was unrelevant because the definitions where in the arch specific sections.
> Now that you are implementing a generic weak version of this function, it would make sense to have a generic declaration as well.
>
> I really don't see the point in duplicating the declaration of the function in the arch specific headers.

I use declaration in arch specific headers in tend to clarify the arch has implement ftrace_dyn_arch_init().
Anyway, it maybe pointless, a generic declaration is enough. Will update it later.

>>  extern void ftrace_replace_code(int enable);
>>  extern int ftrace_update_ftrace_func(ftrace_func_t func);
>>  extern void ftrace_caller(void);
> Christophe
>
> CS Group - Document Interne

^ permalink raw reply

* Re: [PATCH 0/5] s390/pci: automatic error recovery
From: Oliver O'Halloran @ 2021-09-08  1:37 UTC (permalink / raw)
  To: Niklas Schnelle
  Cc: linux-s390, Pierre Morel, Matthew Rosato,
	Linux Kernel Mailing List, Bjorn Helgaas, Linas Vepstas,
	linuxppc-dev
In-Reply-To: <0c9326c943c0e6aa572cc132ee2deb952bf41c7f.camel@linux.ibm.com>

On Tue, Sep 7, 2021 at 10:21 PM Niklas Schnelle <schnelle@linux.ibm.com> wrote:
>
> On Tue, 2021-09-07 at 10:45 +0200, Niklas Schnelle wrote:
> > On Tue, 2021-09-07 at 12:04 +1000, Oliver O'Halloran wrote:
> > > On Mon, Sep 6, 2021 at 7:49 PM Niklas Schnelle <schnelle@linux.ibm.com> wrote:
> > > > Patch 3 I already sent separately resulting in the discussion below but without
> > > > a final conclusion.
> > > >
> > > > https://lore.kernel.org/lkml/20210720150145.640727-1-schnelle@linux.ibm.com/
> > > >
> > > > I believe even though there were some doubts about the use of
> > > > pci_dev_is_added() by arch code the existing uses as well as the use in the
> > > > final patch of this series warrant this export.
> > >
> > > The use of pci_dev_is_added() in arch/powerpc was because in the past
> > > pci_bus_add_device() could be called before pci_device_add(). That was
> > > fixed a while ago so It should be safe to remove those calls now.
> >
> > Hmm, ok that confirms Bjorns suspicion and explains how it came to be.
> > I can certainly sent a patch for that. This would then leave only the
> > existing use in s390 which I added because of a dead lock prevention
> > and explained here:
> > https://lore.kernel.org/lkml/87d15d5eead35c9eaa667958d057cf4a81a8bf13.camel@linux.ibm.com/
> >
> > Plus the need to use it in the recovery code of this series. I think in
> > the EEH code the need for a similar check is alleviated by the checks
> > in the beginning of
> > arch/powerpc/kernel/eeh_driver.c:eeh_handle_normal_event() especially
> > eeh_slot_presence_check() which checks presence via the hotplug slot.
> > I guess we could use our own state tracking in a similar way but felt
> > like pci_dev_is_added() is the more logical choice.

The slot check is mainly there to prevent attempts to "recover"
devices that have been surprise removed (i.e NVMe hot-unplug). The
actual recovery process operates off the eeh_pe tree which is frozen
in place when an error is detected. If a pci_dev is added or removed
it's not really a problem since those are only ever looked at when
notifying drivers which is done with the rescan_remove lock held. That
said, I wouldn't really encourage anyone to follow the EEH model since
it's pretty byzantine.

> Looking into this again, I think we actually can't easily track this
> state ourselves outside struct pci_dev. The reason for this is that
> when e.g. arch/s390/pci/pci_sysfs.c:recover_store() removes the struct
> pci_dev and scans it again the new struct pci_dev re-uses the same
> struct zpci_dev because from a platform point of view the PCI device
> was never removed but only disabled and re-enabled. Thus we can only
> distinguish the stale struct pci_dev by looking at things stored in
> struct pci_dev itself.

IMO the real problem is removing and re-adding the pci_dev. I think
it's something that's done largely because the PCI core doesn't really
provide any better mechanism for getting a device back into a
known-good state so it's abused to implement error recovery. This is
something that's always annoyed me since it conflates recovery with
hotplug. After a hot-(un)plug we might have a different device or no
device. In the recovery case we expect to start and end with the same
device. Why not apply the same logic to the pci_dev?

Something I was tinkering with before I left IBM was re-working the
way EEH handles recovering devices that don't have a driver with error
handling callbacks to something like:

1. unbind the driver
2. pci_save_state()
3. do the reset
4. pci_restore_state()
5. re-bind the driver

That would allow keeping the pci_dev around and let me delete a pile
of confusing code which handles binding the eeh_dev to the new
pci_dev. The obvious problem with that approach is the assumption the
device is functional enough to allow saving the config space, but I
don't think that's a deal breaker. We could stash a copy of the device
state before we allow drivers to attach and use that to restore the
device after the reset. The end result would be the same known-good
state that we'd get after a re-scan.

> That said, I think for the recovery case we might be able to drop the
> pci_dev_is_added() and rely on pdev->driver != NULL which we check
> anyway and that should catch any PCI device that was already removed.

Would that work if there was an error on a device without a driver
bound? If you're just trying to stop races between recovery and device
removal then pci_dev_is_added() is probably the right tool for the
job. Trying to substitute it with a proxy seems like a bad idea.

^ permalink raw reply

* Re: [RESEND PATCH v4 4/4] powerpc/papr_scm: Document papr_scm sysfs event format entries
From: Dan Williams @ 2021-09-08  1:03 UTC (permalink / raw)
  To: Kajol Jain
  Cc: Linux NVDIMM, Santosh Sivaraj, maddy, Weiny, Ira, rnsastry,
	Peter Zijlstra, Linux Kernel Mailing List, atrajeev,
	Aneesh Kumar K.V, Vishal L Verma, Vaibhav Jain, Thomas Gleixner,
	linuxppc-dev
In-Reply-To: <20210903050914.273525-5-kjain@linux.ibm.com>

On Thu, Sep 2, 2021 at 10:11 PM Kajol Jain <kjain@linux.ibm.com> wrote:
>
> Details is added for the event, cpumask and format attributes
> in the ABI documentation.
>
> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> Reviewed-by: Madhavan Srinivasan <maddy@linux.ibm.com>
> Tested-by: Nageswara R Sastry <rnsastry@linux.ibm.com>
> Signed-off-by: Kajol Jain <kjain@linux.ibm.com>
> ---
>  Documentation/ABI/testing/sysfs-bus-papr-pmem | 31 +++++++++++++++++++
>  1 file changed, 31 insertions(+)
>
> diff --git a/Documentation/ABI/testing/sysfs-bus-papr-pmem b/Documentation/ABI/testing/sysfs-bus-papr-pmem
> index 95254cec92bf..4d86252448f8 100644
> --- a/Documentation/ABI/testing/sysfs-bus-papr-pmem
> +++ b/Documentation/ABI/testing/sysfs-bus-papr-pmem
> @@ -61,3 +61,34 @@ Description:
>                 * "CchRHCnt" : Cache Read Hit Count
>                 * "CchWHCnt" : Cache Write Hit Count
>                 * "FastWCnt" : Fast Write Count
> +
> +What:          /sys/devices/nmemX/format
> +Date:          June 2021
> +Contact:       linuxppc-dev <linuxppc-dev@lists.ozlabs.org>, nvdimm@lists.linux.dev,
> +Description:   (RO) Attribute group to describe the magic bits
> +                that go into perf_event_attr.config for a particular pmu.
> +                (See ABI/testing/sysfs-bus-event_source-devices-format).
> +
> +                Each attribute under this group defines a bit range of the
> +                perf_event_attr.config. Supported attribute is listed
> +                below::
> +
> +                   event  = "config:0-4"  - event ID
> +
> +               For example::
> +                   noopstat = "event=0x1"
> +
> +What:          /sys/devices/nmemX/events

That's not a valid sysfs path. Did you mean /sys/bus/nd/devices/nmemX?

> +Date:          June 2021
> +Contact:       linuxppc-dev <linuxppc-dev@lists.ozlabs.org>, nvdimm@lists.linux.dev,
> +Description:    (RO) Attribute group to describe performance monitoring
> +                events specific to papr-scm. Each attribute in this group describes
> +                a single performance monitoring event supported by this nvdimm pmu.
> +                The name of the file is the name of the event.
> +                (See ABI/testing/sysfs-bus-event_source-devices-events).

Given these events are in the generic namespace the ABI documentation
should be generic as well. So I think move these entries to
Documentation/ABI/testing/sysfs-bus-nvdimm directly.

You can still mention papr-scm, but I would expect something like:

What:           /sys/bus/nd/devices/nmemX/events
Date:           September 2021
KernelVersion:  5.16
Contact:        Kajol Jain <kjain@linux.ibm.com>
Description:
                (RO) Attribute group to describe performance monitoring events
                for the nvdimm memory device. Each attribute in this group
                describes a single performance monitoring event supported by
                this nvdimm pmu.  The name of the file is the name of the event.
                (See ABI/testing/sysfs-bus-event_source-devices-events). A
                listing of the events supported by a given nvdimm provider type
                can be found in Documentation/driver-api/nvdimm/$provider, for
                example: Documentation/driver-api/nvdimm/papr-scm.



> +
> +What:          /sys/devices/nmemX/cpumask
> +Date:          June 2021
> +Contact:       linuxppc-dev <linuxppc-dev@lists.ozlabs.org>, nvdimm@lists.linux.dev,
> +Description:   (RO) This sysfs file exposes the cpumask which is designated to make
> +                HCALLs to retrieve nvdimm pmu event counter data.

Seems this one would be provider generic, so no need to refer to PPC
specific concepts like HCALLs.

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox