LinuxPPC-Dev Archive on lore.kernel.org

LinuxPPC-Dev Archive on lore.kernel.org
 help / color / mirror / Atom feed

* Re: [PATCH] powerpc64/idle: Fix SP offsets when saving GPRs
From: Nicholas Piggin @ 2021-02-04  9:05 UTC (permalink / raw)
  To: Christopher M. Riedl, linuxppc-dev, Michael Ellerman
In-Reply-To: <C90JVYFOGWU0.1C2DRATSDH0FM@geist>

Excerpts from Christopher M. Riedl's message of February 4, 2021 4:59 pm:
> On Sat Jan 30, 2021 at 7:44 AM CST, Nicholas Piggin wrote:
>> Excerpts from Michael Ellerman's message of January 30, 2021 9:32 pm:
>> > "Christopher M. Riedl" <cmr@codefail.de> writes:
>> >> The idle entry/exit code saves/restores GPRs in the stack "red zone"
>> >> (Protected Zone according to PowerPC64 ELF ABI v2). However, the offset
>> >> used for the first GPR is incorrect and overwrites the back chain - the
>> >> Protected Zone actually starts below the current SP. In practice this is
>> >> probably not an issue, but it's still incorrect so fix it.
>> > 
>> > Nice catch.
>> > 
>> > Corrupting the back chain means you can't backtrace from there, which
>> > could be confusing for debugging one day.
>>
>> Yeah, we seem to have got away without noticing because the CPU will
>> wake up and return out of here before it tries to unwind the stack,
>> but if you tried to walk it by hand if the CPU got stuck in idle or
>> something, then we'd get confused.
>>
>> > It does make me wonder why we don't just create a stack frame and use
>> > the normal macros? It would use a bit more stack space, but we shouldn't
>> > be short of stack space when going idle.
>> > 
>> > Nick, was there a particular reason for using the red zone?
>>
>> I don't recall a particular reason, I think a normal stack frame is
>> probably a good idea.
> 
> I'll send a version using STACKFRAMESIZE - I assume that's the "normal"
> stack frame :)
> 

I think STACKFRAMESIZE is STACK_FRAME_OVERHEAD + NVGPRs. LR and CR can 
be saved in the caller's frame so that should be okay.

> I admit I am a bit confused when I saw the similar but much smaller
> STACK_FRAME_OVERHEAD which is also used in _some_ cases to save/restore
> a few registers.

Yeah if you don't need to save all nvgprs you can use caller's frame 
plus a few bytes in the minimum frame as volatile storage.

STACK_FRAME_OVERHEAD should be 32 on LE, but I think the problem is a 
lot of asm uses it and hasn't necessarily been audited to make sure it's 
not assuming it's bigger. You could actually use STACK_FRAME_MIN_SIZE
for new code, maybe we add a STACK_FRAME_MIN_NVGPR_SIZE to match and
use that.

Thanks,
Nick

^ permalink raw reply

* Re: [PATCH v3 28/32] powerpc/64s: interrupt implement exit logic in C
From: Nicholas Piggin @ 2021-02-04  8:53 UTC (permalink / raw)
  To: Christophe Leroy, linuxppc-dev; +Cc: Michal Suchanek
In-Reply-To: <65686b53-feb4-2788-88e1-76c3714d3e97@csgroup.eu>

Excerpts from Christophe Leroy's message of February 4, 2021 6:03 pm:
> 
> 
> Le 04/02/2021 à 04:27, Nicholas Piggin a écrit :
>> Excerpts from Christophe Leroy's message of February 4, 2021 2:25 am:
>>>
>>>
>>> Le 25/02/2020 à 18:35, Nicholas Piggin a écrit :
>>>> Implement the bulk of interrupt return logic in C. The asm return code
>>>> must handle a few cases: restoring full GPRs, and emulating stack store.
>>>>
>>>
>>>
>>>> +notrace unsigned long interrupt_exit_kernel_prepare(struct pt_regs *regs, unsigned long msr)
>>>> +{
>>>> +	unsigned long *ti_flagsp = &current_thread_info()->flags;
>>>> +	unsigned long flags;
>>>> +
>>>> +	if (IS_ENABLED(CONFIG_PPC_BOOK3S) && unlikely(!(regs->msr & MSR_RI)))
>>>> +		unrecoverable_exception(regs);
>>>> +	BUG_ON(regs->msr & MSR_PR);
>>>> +	BUG_ON(!FULL_REGS(regs));
>>>> +
>>>> +	local_irq_save(flags);
>>>> +
>>>> +	if (regs->softe == IRQS_ENABLED) {
>>>> +		/* Returning to a kernel context with local irqs enabled. */
>>>> +		WARN_ON_ONCE(!(regs->msr & MSR_EE));
>>>> +again:
>>>> +		if (IS_ENABLED(CONFIG_PREEMPT)) {
>>>> +			/* Return to preemptible kernel context */
>>>> +			if (unlikely(*ti_flagsp & _TIF_NEED_RESCHED)) {
>>>> +				if (preempt_count() == 0)
>>>> +					preempt_schedule_irq();
>>>> +			}
>>>> +		}
>>>> +
>>>> +		trace_hardirqs_on();
>>>> +		__hard_EE_RI_disable();
>>>> +		if (unlikely(lazy_irq_pending())) {
>>>> +			__hard_RI_enable();
>>>> +			irq_soft_mask_set(IRQS_ALL_DISABLED);
>>>> +			trace_hardirqs_off();
>>>> +			local_paca->irq_happened |= PACA_IRQ_HARD_DIS;
>>>> +			/*
>>>> +			 * Can't local_irq_enable in case we are in interrupt
>>>> +			 * context. Must replay directly.
>>>> +			 */
>>>> +			replay_soft_interrupts();
>>>> +			irq_soft_mask_set(flags);
>>>> +			/* Took an interrupt, may have more exit work to do. */
>>>> +			goto again;
>>>> +		}
>>>> +		local_paca->irq_happened = 0;
>>>> +		irq_soft_mask_set(IRQS_ENABLED);
>>>> +	} else {
>>>> +		/* Returning to a kernel context with local irqs disabled. */
>>>> +		trace_hardirqs_on();
>>>> +		__hard_EE_RI_disable();
>>>> +		if (regs->msr & MSR_EE)
>>>> +			local_paca->irq_happened &= ~PACA_IRQ_HARD_DIS;
>>>> +	}
>>>> +
>>>> +
>>>> +#ifdef CONFIG_PPC_TRANSACTIONAL_MEM
>>>> +	local_paca->tm_scratch = regs->msr;
>>>> +#endif
>>>> +
>>>> +	/*
>>>> +	 * We don't need to restore AMR on the way back to userspace for KUAP.
>>>> +	 * The value of AMR only matters while we're in the kernel.
>>>> +	 */
>>>> +	kuap_restore_amr(regs);
>>>
>>> Is that correct to restore KUAP state here ? Shouldn't we have it at lower level in assembly ?
>>>
>>> Isn't there a risk that someone manages to call interrupt_exit_kernel_prepare() or the end of it in
>>> a way or another, and get the previous KUAP state restored by this way ?
>> 
>> I'm not sure if there much more risk if it's here rather than the
>> instruction being in another place in the code.
>> 
>> There's a lot of user access around the kernel too if you want to find a
>> gadget to unlock KUAP then I suppose there is a pretty large attack
>> surface.
> 
> My understanding is that user access scope is strictly limited, for instance we enforce the 
> begin/end of user access to be in the same function, and we refrain from calling any other function 
> inside the user access window. x86 even have 'objtool' to enforce it at build time. So in theory 
> there is no way to get out of the function while user access is open.
> 
> Here with the interrupt exit function it is free beer. You have a place where you re-open user 
> access and return with a simple blr. So that's open bar. If someone manages to just call the 
> interrupt exit function, then user access remains open

Hmm okay maybe that's a good point.

>>> Also, it looks a bit strange to have kuap_save_amr_and_lock() done at lowest level in assembly, and
>>> kuap_restore_amr() done in upper level. That looks unbalanced.
>> 
>> I'd like to bring the entry assembly into C.
>> 
> 
> I really think it's not a good idea. We'll get better control and readability by keeping it at the 
> lowest possible level in assembly.
> 
> x86 only save and restore SMAC state in assembly.

Okay we could go the other way and move the unlock into asm then.

Thanks,
Nick


^ permalink raw reply

* Re: [PATCH RFC v1 5/6] xen-swiotlb: convert variables to arrays
From: Christoph Hellwig @ 2021-02-04  8:40 UTC (permalink / raw)
  To: Dongli Zhang
  Cc: ulf.hansson, airlied, joonas.lahtinen, dri-devel, linux-kernel,
	bhelgaas, paulus, hpa, hch, m.szyprowski, sstabellini,
	adrian.hunter, x86, joe.jin, mingo, peterz, mingo, bskeggs,
	linux-pci, xen-devel, matthew.auld, thomas.lendacky, konrad.wilk,
	intel-gfx, jani.nikula, bp, rodrigo.vivi, nouveau,
	boris.ostrovsky, chris, jgross, tsbogend, robin.murphy, linux-mmc,
	linux-mips, iommu, tglx, bauerman, daniel, akpm, linuxppc-dev,
	rppt
In-Reply-To: <20210203233709.19819-6-dongli.zhang@oracle.com>

So one thing that has been on my mind for a while:  I'd really like
to kill the separate dma ops in Xen swiotlb.  If we compare xen-swiotlb
to swiotlb the main difference seems to be:

 - additional reasons to bounce I/O vs the plain DMA capable
 - the possibility to do a hypercall on arm/arm64
 - an extra translation layer before doing the phys_to_dma and vice
   versa
 - an special memory allocator

I wonder if inbetween a few jump labels or other no overhead enablement
options and possibly better use of the dma_range_map we could kill
off most of swiotlb-xen instead of maintaining all this code duplication?

^ permalink raw reply

* Re: [PATCH v4 1/2] powerpc: sstep: Fix load-store and update emulation
From: Naveen N. Rao @ 2021-02-04  8:31 UTC (permalink / raw)
  To: Sandipan Das; +Cc: ravi.bangoria, ananth, jniethe5, paulus, linuxppc-dev, dja
In-Reply-To: <20210204080744.135785-1-sandipan@linux.ibm.com>

On 2021/02/04 01:37PM, Sandipan Das wrote:
> The Power ISA says that the fixed-point load and update
> instructions must neither use R0 for the base address (RA)
> nor have the destination (RT) and the base address (RA) as
> the same register. Similarly, for fixed-point stores and
> floating-point loads and stores, the instruction is invalid
> when R0 is used as the base address (RA).
> 
> This is applicable to the following instructions.
>   * Load Byte and Zero with Update (lbzu)
>   * Load Byte and Zero with Update Indexed (lbzux)
>   * Load Halfword and Zero with Update (lhzu)
>   * Load Halfword and Zero with Update Indexed (lhzux)
>   * Load Halfword Algebraic with Update (lhau)
>   * Load Halfword Algebraic with Update Indexed (lhaux)
>   * Load Word and Zero with Update (lwzu)
>   * Load Word and Zero with Update Indexed (lwzux)
>   * Load Word Algebraic with Update Indexed (lwaux)
>   * Load Doubleword with Update (ldu)
>   * Load Doubleword with Update Indexed (ldux)
>   * Load Floating Single with Update (lfsu)
>   * Load Floating Single with Update Indexed (lfsux)
>   * Load Floating Double with Update (lfdu)
>   * Load Floating Double with Update Indexed (lfdux)
>   * Store Byte with Update (stbu)
>   * Store Byte with Update Indexed (stbux)
>   * Store Halfword with Update (sthu)
>   * Store Halfword with Update Indexed (sthux)
>   * Store Word with Update (stwu)
>   * Store Word with Update Indexed (stwux)
>   * Store Doubleword with Update (stdu)
>   * Store Doubleword with Update Indexed (stdux)
>   * Store Floating Single with Update (stfsu)
>   * Store Floating Single with Update Indexed (stfsux)
>   * Store Floating Double with Update (stfdu)
>   * Store Floating Double with Update Indexed (stfdux)
> 
> E.g. the following behaviour is observed for an invalid
> load and update instruction having RA = RT.
> 
> While a userspace program having an instruction word like
> 0xe9ce0001, i.e. ldu r14, 0(r14), runs without getting
> receiving a SIGILL on a Power system (observed on P8 and
> P9), the outcome of executing that instruction word varies
> and its behaviour can be considered to be undefined.
> 
> Attaching an uprobe at that instruction's address results
> in emulation which currently performs the load as well as
> writes the effective address back to the base register.
> This might not match the outcome from hardware.
> 
> To remove any inconsistencies, this adds additional checks
> for the aforementioned instructions to make sure that the
> emulation infrastructure treats them as unknown. The kernel
> can then fallback to executing such instructions on hardware.
> 
> Fixes: 0016a4cf5582 ("powerpc: Emulate most Book I instructions in emulate_step()")
> Signed-off-by: Sandipan Das <sandipan@linux.ibm.com>
> ---
> Previous versions can be found at:
> v3: https://lore.kernel.org/linuxppc-dev/20210204071432.116439-1-sandipan@linux.ibm.com/
> v2: https://lore.kernel.org/linuxppc-dev/20210203063841.431063-1-sandipan@linux.ibm.com/
> v1: https://lore.kernel.org/linuxppc-dev/20201119054139.244083-1-sandipan@linux.ibm.com/
> 
> Changes in v4:
> - Fixed grammar and switch-case alignment.
> 
> Changes in v3:
> - Consolidated the checks as suggested by Naveen.
> - Consolidated load/store changes into a single patch.
> - Included floating-point load/store and update instructions.
> 
> Changes in v2:
> - Jump to unknown_opcode instead of returning -1 for invalid
>   instruction forms.
> 
> ---
>  arch/powerpc/lib/sstep.c | 14 ++++++++++++++
>  1 file changed, 14 insertions(+)

For the series:
Reviewed-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>

- Naveen


^ permalink raw reply

* Re: [PATCH v2 1/3] powerpc: sstep: Fix load and update emulation
From: Naveen N. Rao @ 2021-02-04  8:27 UTC (permalink / raw)
  To: Segher Boessenkool
  Cc: ravi.bangoria, ananth, jniethe5, paulus, Sandipan Das,
	linuxppc-dev, dja
In-Reply-To: <20210203211732.GD30983@gate.crashing.org>

On 2021/02/03 03:17PM, Segher Boessenkool wrote:
> On Wed, Feb 03, 2021 at 03:19:09PM +0530, Naveen N. Rao wrote:
> > On 2021/02/03 12:08PM, Sandipan Das wrote:
> > > The Power ISA says that the fixed-point load and update
> > > instructions must neither use R0 for the base address (RA)
> > > nor have the destination (RT) and the base address (RA) as
> > > the same register. In these cases, the instruction is
> > > invalid.
> 
> > > However, the following behaviour is observed using some
> > > invalid opcodes where RA = RT.
> > > 
> > > An userspace program using an invalid instruction word like
> > > 0xe9ce0001, i.e. "ldu r14, 0(r14)", runs and exits without
> > > getting terminated abruptly. The instruction performs the
> > > load operation but does not write the effective address to
> > > the base address register. 
> > 
> > While the processor (p8 in my test) doesn't seem to be throwing an 
> > exception, I don't think it is necessarily loading the value. Qemu 
> > throws an exception though. It's probably best to term the behavior as 
> > being undefined.
> 
> Power8 does:
> 
>   Load with Update Instructions (RA = 0)
>     EA is placed into R0.
>   Load with Update Instructions (RA = RT)
>     EA is placed into RT. The storage operand addressed by EA is
>     accessed, but the data returned by the load is discarded.

I'm actually not seeing that. This is what I am testing with:
	li      8,0xaaa
	mr      6,1
	std     8,64(6)
	#ldu    6,64(6)
	.long	0xe8c60041

And, r6 always ends up with 0xaea. It changes with the value I put into 
r6 though.

Granted, this is all up in the air, but it does look like there is more 
going on and the value isn't the EA or the value at the address.

> 
> Power9 does:
> 
>   Load with Update Instructions (RA = 0)
>     EA is placed into R0.
>   Load with Update Instructions (RA = RT)
>     The storage operand addressed by EA is accessed. The displacement
>     field is added to the data returned by the load and placed into RT.
> 
> Both UMs also say
> 
>   Invalid Forms
>     In general, the POWER9 core handles invalid forms of instructions in
>     the manner that is most convenient for the particular case (within
>     the scope of meeting the boundedly-undefined definition described in
>     the Power ISA). This document specifies the behavior for these
>     cases.  However, it is not recommended that software or other system
>     facilities make use of the POWER9 behavior in these cases because
>     such behavior might be different in another processor that
>     implements the Power ISA.
> 
> (or POWER8 instead of POWER9 of course).  Always complaining about most
> invalid forms seems wise, certainly if not all recent CPUs behave the
> same :-)

Agreed.

- Naveen


^ permalink raw reply

* Re: [PATCH v2 0/5] shoot lazy tlbs
From: Nicholas Piggin @ 2021-02-04  8:21 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-arch, linux-mm, linuxppc-dev, Andy Lutomirski
In-Reply-To: <20201214065312.270062-1-npiggin@gmail.com>

I'll ask Andrew to put this in -mm if no objections.

The series now doesn't touch other archs in non-trivial ways, and core code
is functionally not changed much / at all if the option is not selected so
it's actually pretty simple aside from the powerpc change.

Thanks,
Nick

Excerpts from Nicholas Piggin's message of December 14, 2020 4:53 pm:
> This is another rebase, on top of mainline now (don't need the
> asm-generic tree), and without any x86 or membarrier changes.
> This makes the series far smaller and more manageable and
> without the controversial bits.
> 
> Thanks,
> Nick
> 
> Nicholas Piggin (5):
>   lazy tlb: introduce lazy mm refcount helper functions
>   lazy tlb: allow lazy tlb mm switching to be configurable
>   lazy tlb: shoot lazies, a non-refcounting lazy tlb option
>   powerpc: use lazy mm refcount helper functions
>   powerpc/64s: enable MMU_LAZY_TLB_SHOOTDOWN
> 
>  arch/Kconfig                         | 30 ++++++++++
>  arch/arm/mach-rpc/ecard.c            |  2 +-
>  arch/powerpc/Kconfig                 |  1 +
>  arch/powerpc/kernel/smp.c            |  2 +-
>  arch/powerpc/mm/book3s64/radix_tlb.c |  4 +-
>  fs/exec.c                            |  4 +-
>  include/linux/sched/mm.h             | 20 +++++++
>  kernel/cpu.c                         |  2 +-
>  kernel/exit.c                        |  2 +-
>  kernel/fork.c                        | 52 ++++++++++++++++
>  kernel/kthread.c                     | 11 ++--
>  kernel/sched/core.c                  | 88 ++++++++++++++++++++--------
>  kernel/sched/sched.h                 |  4 +-
>  13 files changed, 184 insertions(+), 38 deletions(-)
> 
> -- 
> 2.23.0
> 
> 

^ permalink raw reply

* [PATCH v4 2/2] powerpc: sstep: Fix darn emulation
From: Sandipan Das @ 2021-02-04  8:07 UTC (permalink / raw)
  To: mpe
  Cc: ravi.bangoria, ananth, jniethe5, paulus, naveen.n.rao,
	linuxppc-dev, dja
In-Reply-To: <20210204080744.135785-1-sandipan@linux.ibm.com>

Commit 8813ff49607e ("powerpc/sstep: Check instruction
validity against ISA version before emulation") introduced
a proper way to skip unknown instructions. This makes sure
that the same is used for the darn instruction when the
range selection bits have a reserved value.

Fixes: a23987ef267a ("powerpc: sstep: Add support for darn instruction")
Signed-off-by: Sandipan Das <sandipan@linux.ibm.com>
---
 arch/powerpc/lib/sstep.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/lib/sstep.c b/arch/powerpc/lib/sstep.c
index 11f14b209d7f..683f7c20f74b 100644
--- a/arch/powerpc/lib/sstep.c
+++ b/arch/powerpc/lib/sstep.c
@@ -1916,7 +1916,7 @@ int analyse_instr(struct instruction_op *op, const struct pt_regs *regs,
 				goto compute_done;
 			}
 
-			return -1;
+			goto unknown_opcode;
 #ifdef __powerpc64__
 		case 777:	/* modsd */
 			if (!cpu_has_feature(CPU_FTR_ARCH_300))
-- 
2.25.1


^ permalink raw reply related

* [PATCH v4 1/2] powerpc: sstep: Fix load-store and update emulation
From: Sandipan Das @ 2021-02-04  8:07 UTC (permalink / raw)
  To: mpe
  Cc: ravi.bangoria, ananth, jniethe5, paulus, naveen.n.rao,
	linuxppc-dev, dja

The Power ISA says that the fixed-point load and update
instructions must neither use R0 for the base address (RA)
nor have the destination (RT) and the base address (RA) as
the same register. Similarly, for fixed-point stores and
floating-point loads and stores, the instruction is invalid
when R0 is used as the base address (RA).

This is applicable to the following instructions.
  * Load Byte and Zero with Update (lbzu)
  * Load Byte and Zero with Update Indexed (lbzux)
  * Load Halfword and Zero with Update (lhzu)
  * Load Halfword and Zero with Update Indexed (lhzux)
  * Load Halfword Algebraic with Update (lhau)
  * Load Halfword Algebraic with Update Indexed (lhaux)
  * Load Word and Zero with Update (lwzu)
  * Load Word and Zero with Update Indexed (lwzux)
  * Load Word Algebraic with Update Indexed (lwaux)
  * Load Doubleword with Update (ldu)
  * Load Doubleword with Update Indexed (ldux)
  * Load Floating Single with Update (lfsu)
  * Load Floating Single with Update Indexed (lfsux)
  * Load Floating Double with Update (lfdu)
  * Load Floating Double with Update Indexed (lfdux)
  * Store Byte with Update (stbu)
  * Store Byte with Update Indexed (stbux)
  * Store Halfword with Update (sthu)
  * Store Halfword with Update Indexed (sthux)
  * Store Word with Update (stwu)
  * Store Word with Update Indexed (stwux)
  * Store Doubleword with Update (stdu)
  * Store Doubleword with Update Indexed (stdux)
  * Store Floating Single with Update (stfsu)
  * Store Floating Single with Update Indexed (stfsux)
  * Store Floating Double with Update (stfdu)
  * Store Floating Double with Update Indexed (stfdux)

E.g. the following behaviour is observed for an invalid
load and update instruction having RA = RT.

While a userspace program having an instruction word like
0xe9ce0001, i.e. ldu r14, 0(r14), runs without getting
receiving a SIGILL on a Power system (observed on P8 and
P9), the outcome of executing that instruction word varies
and its behaviour can be considered to be undefined.

Attaching an uprobe at that instruction's address results
in emulation which currently performs the load as well as
writes the effective address back to the base register.
This might not match the outcome from hardware.

To remove any inconsistencies, this adds additional checks
for the aforementioned instructions to make sure that the
emulation infrastructure treats them as unknown. The kernel
can then fallback to executing such instructions on hardware.

Fixes: 0016a4cf5582 ("powerpc: Emulate most Book I instructions in emulate_step()")
Signed-off-by: Sandipan Das <sandipan@linux.ibm.com>
---
Previous versions can be found at:
v3: https://lore.kernel.org/linuxppc-dev/20210204071432.116439-1-sandipan@linux.ibm.com/
v2: https://lore.kernel.org/linuxppc-dev/20210203063841.431063-1-sandipan@linux.ibm.com/
v1: https://lore.kernel.org/linuxppc-dev/20201119054139.244083-1-sandipan@linux.ibm.com/

Changes in v4:
- Fixed grammar and switch-case alignment.

Changes in v3:
- Consolidated the checks as suggested by Naveen.
- Consolidated load/store changes into a single patch.
- Included floating-point load/store and update instructions.

Changes in v2:
- Jump to unknown_opcode instead of returning -1 for invalid
  instruction forms.

---
 arch/powerpc/lib/sstep.c | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

diff --git a/arch/powerpc/lib/sstep.c b/arch/powerpc/lib/sstep.c
index e96cff845ef7..11f14b209d7f 100644
--- a/arch/powerpc/lib/sstep.c
+++ b/arch/powerpc/lib/sstep.c
@@ -3017,6 +3017,20 @@ int analyse_instr(struct instruction_op *op, const struct pt_regs *regs,

 	}

+	if (OP_IS_LOAD_STORE(op->type) && (op->type & UPDATE)) {
+		switch (GETTYPE(op->type)) {
+		case LOAD:
+			if (ra == rd)
+				goto unknown_opcode;
+			fallthrough;
+		case STORE:
+		case LOAD_FP:
+		case STORE_FP:
+			if (ra == 0)
+				goto unknown_opcode;
+		}
+	}
+
 #ifdef CONFIG_VSX
 	if ((GETTYPE(op->type) == LOAD_VSX ||
 	     GETTYPE(op->type) == STORE_VSX) &&
-- 
2.25.1

^ permalink raw reply related

* Re: [PATCH v3 28/32] powerpc/64s: interrupt implement exit logic in C
From: Christophe Leroy @ 2021-02-04  8:03 UTC (permalink / raw)
  To: Nicholas Piggin, linuxppc-dev; +Cc: Michal Suchanek
In-Reply-To: <1612409077.fadt3kvld9.astroid@bobo.none>



Le 04/02/2021 à 04:27, Nicholas Piggin a écrit :
> Excerpts from Christophe Leroy's message of February 4, 2021 2:25 am:
>>
>>
>> Le 25/02/2020 à 18:35, Nicholas Piggin a écrit :
>>> Implement the bulk of interrupt return logic in C. The asm return code
>>> must handle a few cases: restoring full GPRs, and emulating stack store.
>>>
>>
>>
>>> +notrace unsigned long interrupt_exit_kernel_prepare(struct pt_regs *regs, unsigned long msr)
>>> +{
>>> +	unsigned long *ti_flagsp = &current_thread_info()->flags;
>>> +	unsigned long flags;
>>> +
>>> +	if (IS_ENABLED(CONFIG_PPC_BOOK3S) && unlikely(!(regs->msr & MSR_RI)))
>>> +		unrecoverable_exception(regs);
>>> +	BUG_ON(regs->msr & MSR_PR);
>>> +	BUG_ON(!FULL_REGS(regs));
>>> +
>>> +	local_irq_save(flags);
>>> +
>>> +	if (regs->softe == IRQS_ENABLED) {
>>> +		/* Returning to a kernel context with local irqs enabled. */
>>> +		WARN_ON_ONCE(!(regs->msr & MSR_EE));
>>> +again:
>>> +		if (IS_ENABLED(CONFIG_PREEMPT)) {
>>> +			/* Return to preemptible kernel context */
>>> +			if (unlikely(*ti_flagsp & _TIF_NEED_RESCHED)) {
>>> +				if (preempt_count() == 0)
>>> +					preempt_schedule_irq();
>>> +			}
>>> +		}
>>> +
>>> +		trace_hardirqs_on();
>>> +		__hard_EE_RI_disable();
>>> +		if (unlikely(lazy_irq_pending())) {
>>> +			__hard_RI_enable();
>>> +			irq_soft_mask_set(IRQS_ALL_DISABLED);
>>> +			trace_hardirqs_off();
>>> +			local_paca->irq_happened |= PACA_IRQ_HARD_DIS;
>>> +			/*
>>> +			 * Can't local_irq_enable in case we are in interrupt
>>> +			 * context. Must replay directly.
>>> +			 */
>>> +			replay_soft_interrupts();
>>> +			irq_soft_mask_set(flags);
>>> +			/* Took an interrupt, may have more exit work to do. */
>>> +			goto again;
>>> +		}
>>> +		local_paca->irq_happened = 0;
>>> +		irq_soft_mask_set(IRQS_ENABLED);
>>> +	} else {
>>> +		/* Returning to a kernel context with local irqs disabled. */
>>> +		trace_hardirqs_on();
>>> +		__hard_EE_RI_disable();
>>> +		if (regs->msr & MSR_EE)
>>> +			local_paca->irq_happened &= ~PACA_IRQ_HARD_DIS;
>>> +	}
>>> +
>>> +
>>> +#ifdef CONFIG_PPC_TRANSACTIONAL_MEM
>>> +	local_paca->tm_scratch = regs->msr;
>>> +#endif
>>> +
>>> +	/*
>>> +	 * We don't need to restore AMR on the way back to userspace for KUAP.
>>> +	 * The value of AMR only matters while we're in the kernel.
>>> +	 */
>>> +	kuap_restore_amr(regs);
>>
>> Is that correct to restore KUAP state here ? Shouldn't we have it at lower level in assembly ?
>>
>> Isn't there a risk that someone manages to call interrupt_exit_kernel_prepare() or the end of it in
>> a way or another, and get the previous KUAP state restored by this way ?
> 
> I'm not sure if there much more risk if it's here rather than the
> instruction being in another place in the code.
> 
> There's a lot of user access around the kernel too if you want to find a
> gadget to unlock KUAP then I suppose there is a pretty large attack
> surface.

My understanding is that user access scope is strictly limited, for instance we enforce the 
begin/end of user access to be in the same function, and we refrain from calling any other function 
inside the user access window. x86 even have 'objtool' to enforce it at build time. So in theory 
there is no way to get out of the function while user access is open.

Here with the interrupt exit function it is free beer. You have a place where you re-open user 
access and return with a simple blr. So that's open bar. If someone manages to just call the 
interrupt exit function, then user access remains open

> 
>> Also, it looks a bit strange to have kuap_save_amr_and_lock() done at lowest level in assembly, and
>> kuap_restore_amr() done in upper level. That looks unbalanced.
> 
> I'd like to bring the entry assembly into C.
> 

I really think it's not a good idea. We'll get better control and readability by keeping it at the 
lowest possible level in assembly.

x86 only save and restore SMAC state in assembly.

Christophe

^ permalink raw reply

* Re: [PATCH v3 1/2] powerpc: sstep: Fix load-store and update emulation
From: Sandipan Das @ 2021-02-04  7:59 UTC (permalink / raw)
  To: Naveen N. Rao; +Cc: ravi.bangoria, ananth, jniethe5, paulus, linuxppc-dev, dja
In-Reply-To: <20210204073954.GH210@DESKTOP-TDPLP67.localdomain>

On 04/02/21 1:09 pm, Naveen N. Rao wrote:
> [...]
> 
> I'm afraid there is one more thing. scripts/checkpatch.pl reports:
> 
> WARNING: 'an userspace' may be misspelled - perhaps 'a userspace'?
> #52:
> While an userspace program having an instruction word like
>       ^^^^^^^^^^^^
> 
> ERROR: switch and case should be at the same indent
> #96: FILE: arch/powerpc/lib/sstep.c:3021:
> +               switch (GETTYPE(op->type)) {
> +                       case LOAD:
> [...]
> +                       case STORE:
> +                       case LOAD_FP:
> +                       case STORE_FP:
> 
> 

Yikes! Thanks for pointing that out. Sending v4.

- Sandipan

^ permalink raw reply

* Re: [PATCH v3 1/2] powerpc: sstep: Fix load-store and update emulation
From: Naveen N. Rao @ 2021-02-04  7:39 UTC (permalink / raw)
  To: Sandipan Das; +Cc: ravi.bangoria, ananth, jniethe5, paulus, linuxppc-dev, dja
In-Reply-To: <20210204071432.116439-1-sandipan@linux.ibm.com>

On 2021/02/04 12:44PM, Sandipan Das wrote:
> The Power ISA says that the fixed-point load and update
> instructions must neither use R0 for the base address (RA)
> nor have the destination (RT) and the base address (RA) as
> the same register. Similarly, for fixed-point stores and
> floating-point loads and stores, the instruction is invalid
> when R0 is used as the base address (RA).
> 
> This is applicable to the following instructions.
>   * Load Byte and Zero with Update (lbzu)
>   * Load Byte and Zero with Update Indexed (lbzux)
>   * Load Halfword and Zero with Update (lhzu)
>   * Load Halfword and Zero with Update Indexed (lhzux)
>   * Load Halfword Algebraic with Update (lhau)
>   * Load Halfword Algebraic with Update Indexed (lhaux)
>   * Load Word and Zero with Update (lwzu)
>   * Load Word and Zero with Update Indexed (lwzux)
>   * Load Word Algebraic with Update Indexed (lwaux)
>   * Load Doubleword with Update (ldu)
>   * Load Doubleword with Update Indexed (ldux)
>   * Load Floating Single with Update (lfsu)
>   * Load Floating Single with Update Indexed (lfsux)
>   * Load Floating Double with Update (lfdu)
>   * Load Floating Double with Update Indexed (lfdux)
>   * Store Byte with Update (stbu)
>   * Store Byte with Update Indexed (stbux)
>   * Store Halfword with Update (sthu)
>   * Store Halfword with Update Indexed (sthux)
>   * Store Word with Update (stwu)
>   * Store Word with Update Indexed (stwux)
>   * Store Doubleword with Update (stdu)
>   * Store Doubleword with Update Indexed (stdux)
>   * Store Floating Single with Update (stfsu)
>   * Store Floating Single with Update Indexed (stfsux)
>   * Store Floating Double with Update (stfdu)
>   * Store Floating Double with Update Indexed (stfdux)
> 
> E.g. the following behaviour is observed for an invalid
> load and update instruction having RA = RT.
> 
> While an userspace program having an instruction word like
> 0xe9ce0001, i.e. ldu r14, 0(r14), runs without getting
> receiving a SIGILL on a Power system (observed on P8 and
> P9), the outcome of executing that instruction word varies
> and its behaviour can be considered to be undefined.
> 
> Attaching an uprobe at that instruction's address results
> in emulation which currently performs the load as well as
> writes the effective address back to the base register.
> This might not match the outcome from hardware.
> 
> To remove any inconsistencies, this adds additional checks
> for the aforementioned instructions to make sure that the
> emulation infrastructure treats them as unknown. The kernel
> can then fallback to executing such instructions on hardware.
> 
> Fixes: 0016a4cf5582 ("powerpc: Emulate most Book I instructions in emulate_step()")
> Signed-off-by: Sandipan Das <sandipan@linux.ibm.com>
> ---
> Previous versions can be found at:
> v2: https://lore.kernel.org/linuxppc-dev/20210203063841.431063-1-sandipan@linux.ibm.com/
> v1: https://lore.kernel.org/linuxppc-dev/20201119054139.244083-1-sandipan@linux.ibm.com/
> 
> Changes in v3:
> - Dropped CONFIG_PPC_FPU check as suggested by Michael.
> - Consolidated the checks as suggested by Naveen.
> 
> Changes in v2:
> - Jump to unknown_opcode instead of returning -1 for invalid
>   instruction forms.
> 
> ---
>  arch/powerpc/lib/sstep.c | 14 ++++++++++++++
>  1 file changed, 14 insertions(+)
> 
> diff --git a/arch/powerpc/lib/sstep.c b/arch/powerpc/lib/sstep.c
> index e96cff845ef7..9138967eb82e 100644
> --- a/arch/powerpc/lib/sstep.c
> +++ b/arch/powerpc/lib/sstep.c
> @@ -3017,6 +3017,20 @@ int analyse_instr(struct instruction_op *op, const struct pt_regs *regs,
> 
>  	}
> 
> +	if (OP_IS_LOAD_STORE(op->type) && (op->type & UPDATE)) {
> +		switch (GETTYPE(op->type)) {
> +			case LOAD:
> +				if (ra == rd)
> +					goto unknown_opcode;
> +				fallthrough;
> +			case STORE:
> +			case LOAD_FP:
> +			case STORE_FP:
> +				if (ra == 0)
> +					goto unknown_opcode;
> +		}
> +	}
> +
>  #ifdef CONFIG_VSX
>  	if ((GETTYPE(op->type) == LOAD_VSX ||
>  	     GETTYPE(op->type) == STORE_VSX) &&

I'm afraid there is one more thing. scripts/checkpatch.pl reports:

WARNING: 'an userspace' may be misspelled - perhaps 'a userspace'?
#52:
While an userspace program having an instruction word like
      ^^^^^^^^^^^^

ERROR: switch and case should be at the same indent
#96: FILE: arch/powerpc/lib/sstep.c:3021:
+               switch (GETTYPE(op->type)) {
+                       case LOAD:
[...]
+                       case STORE:
+                       case LOAD_FP:
+                       case STORE_FP:


- Naveen

^ permalink raw reply

* Re: [PATCH RFC v1 2/6] swiotlb: convert variables to arrays
From: Christoph Hellwig @ 2021-02-04  7:29 UTC (permalink / raw)
  To: Dongli Zhang
  Cc: ulf.hansson, airlied, joonas.lahtinen, dri-devel, linux-kernel,
	bhelgaas, paulus, hpa, hch, m.szyprowski, sstabellini,
	adrian.hunter, x86, joe.jin, mingo, peterz, mingo, bskeggs,
	linux-pci, xen-devel, matthew.auld, thomas.lendacky, konrad.wilk,
	intel-gfx, jani.nikula, bp, rodrigo.vivi, nouveau, Claire Chang,
	boris.ostrovsky, chris, jgross, tsbogend, robin.murphy, linux-mmc,
	linux-mips, iommu, tglx, bauerman, daniel, akpm, linuxppc-dev,
	rppt
In-Reply-To: <20210203233709.19819-3-dongli.zhang@oracle.com>

On Wed, Feb 03, 2021 at 03:37:05PM -0800, Dongli Zhang wrote:
> This patch converts several swiotlb related variables to arrays, in
> order to maintain stat/status for different swiotlb buffers. Here are
> variables involved:
> 
> - io_tlb_start and io_tlb_end
> - io_tlb_nslabs and io_tlb_used
> - io_tlb_list
> - io_tlb_index
> - max_segment
> - io_tlb_orig_addr
> - no_iotlb_memory
> 
> There is no functional change and this is to prepare to enable 64-bit
> swiotlb.

Claire Chang (on Cc) already posted a patch like this a month ago,
which looks much better because it actually uses a struct instead
of all the random variables. 

^ permalink raw reply

* Re: [PATCH v3 1/2] powerpc: sstep: Fix load-store and update emulation
From: Sandipan Das @ 2021-02-04  7:23 UTC (permalink / raw)
  To: mpe
  Cc: ravi.bangoria, ananth, jniethe5, paulus, naveen.n.rao,
	linuxppc-dev, dja
In-Reply-To: <20210204071432.116439-1-sandipan@linux.ibm.com>


On 04/02/21 12:44 pm, Sandipan Das wrote:
> The Power ISA says that the fixed-point load and update
> instructions must neither use R0 for the base address (RA)
> nor have the destination (RT) and the base address (RA) as
> the same register. Similarly, for fixed-point stores and
> floating-point loads and stores, the instruction is invalid
> when R0 is used as the base address (RA).
> 
> This is applicable to the following instructions.
>   * Load Byte and Zero with Update (lbzu)
>   * Load Byte and Zero with Update Indexed (lbzux)
>   * Load Halfword and Zero with Update (lhzu)
>   * Load Halfword and Zero with Update Indexed (lhzux)
>   * Load Halfword Algebraic with Update (lhau)
>   * Load Halfword Algebraic with Update Indexed (lhaux)
>   * Load Word and Zero with Update (lwzu)
>   * Load Word and Zero with Update Indexed (lwzux)
>   * Load Word Algebraic with Update Indexed (lwaux)
>   * Load Doubleword with Update (ldu)
>   * Load Doubleword with Update Indexed (ldux)
>   * Load Floating Single with Update (lfsu)
>   * Load Floating Single with Update Indexed (lfsux)
>   * Load Floating Double with Update (lfdu)
>   * Load Floating Double with Update Indexed (lfdux)
>   * Store Byte with Update (stbu)
>   * Store Byte with Update Indexed (stbux)
>   * Store Halfword with Update (sthu)
>   * Store Halfword with Update Indexed (sthux)
>   * Store Word with Update (stwu)
>   * Store Word with Update Indexed (stwux)
>   * Store Doubleword with Update (stdu)
>   * Store Doubleword with Update Indexed (stdux)
>   * Store Floating Single with Update (stfsu)
>   * Store Floating Single with Update Indexed (stfsux)
>   * Store Floating Double with Update (stfdu)
>   * Store Floating Double with Update Indexed (stfdux)
> 
> E.g. the following behaviour is observed for an invalid
> load and update instruction having RA = RT.
> 
> While an userspace program having an instruction word like
> 0xe9ce0001, i.e. ldu r14, 0(r14), runs without getting
> receiving a SIGILL on a Power system (observed on P8 and
> P9), the outcome of executing that instruction word varies
> and its behaviour can be considered to be undefined.
> 
> Attaching an uprobe at that instruction's address results
> in emulation which currently performs the load as well as
> writes the effective address back to the base register.
> This might not match the outcome from hardware.
> 
> To remove any inconsistencies, this adds additional checks
> for the aforementioned instructions to make sure that the
> emulation infrastructure treats them as unknown. The kernel
> can then fallback to executing such instructions on hardware.
> 
> Fixes: 0016a4cf5582 ("powerpc: Emulate most Book I instructions in emulate_step()")
> Signed-off-by: Sandipan Das <sandipan@linux.ibm.com>
> ---
> Previous versions can be found at:
> v2: https://lore.kernel.org/linuxppc-dev/20210203063841.431063-1-sandipan@linux.ibm.com/
> v1: https://lore.kernel.org/linuxppc-dev/20201119054139.244083-1-sandipan@linux.ibm.com/
> 
> Changes in v3:
> - Dropped CONFIG_PPC_FPU check as suggested by Michael.
> - Consolidated the checks as suggested by Naveen.

Sorry. Missed these in the changelog:
- Consolidated load/store changes into a single patch.
- Included floating-point load/store and update instructions.


- Sandipan

^ permalink raw reply

* [PATCH v3 2/2] powerpc: sstep: Fix darn emulation
From: Sandipan Das @ 2021-02-04  7:14 UTC (permalink / raw)
  To: mpe
  Cc: ravi.bangoria, ananth, jniethe5, paulus, naveen.n.rao,
	linuxppc-dev, dja
In-Reply-To: <20210204071432.116439-1-sandipan@linux.ibm.com>

Commit 8813ff49607e ("powerpc/sstep: Check instruction
validity against ISA version before emulation") introduced
a proper way to skip unknown instructions. This makes sure
that the same is used for the darn instruction when the
range selection bits have a reserved value.

Fixes: a23987ef267a ("powerpc: sstep: Add support for darn instruction")
Signed-off-by: Sandipan Das <sandipan@linux.ibm.com>
---
 arch/powerpc/lib/sstep.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/lib/sstep.c b/arch/powerpc/lib/sstep.c
index 9138967eb82e..e7e8bc974c0f 100644
--- a/arch/powerpc/lib/sstep.c
+++ b/arch/powerpc/lib/sstep.c
@@ -1916,7 +1916,7 @@ int analyse_instr(struct instruction_op *op, const struct pt_regs *regs,
 				goto compute_done;
 			}
 
-			return -1;
+			goto unknown_opcode;
 #ifdef __powerpc64__
 		case 777:	/* modsd */
 			if (!cpu_has_feature(CPU_FTR_ARCH_300))
-- 
2.25.1


^ permalink raw reply related

* [PATCH v3 1/2] powerpc: sstep: Fix load-store and update emulation
From: Sandipan Das @ 2021-02-04  7:14 UTC (permalink / raw)
  To: mpe
  Cc: ravi.bangoria, ananth, jniethe5, paulus, naveen.n.rao,
	linuxppc-dev, dja

The Power ISA says that the fixed-point load and update
instructions must neither use R0 for the base address (RA)
nor have the destination (RT) and the base address (RA) as
the same register. Similarly, for fixed-point stores and
floating-point loads and stores, the instruction is invalid
when R0 is used as the base address (RA).

This is applicable to the following instructions.
  * Load Byte and Zero with Update (lbzu)
  * Load Byte and Zero with Update Indexed (lbzux)
  * Load Halfword and Zero with Update (lhzu)
  * Load Halfword and Zero with Update Indexed (lhzux)
  * Load Halfword Algebraic with Update (lhau)
  * Load Halfword Algebraic with Update Indexed (lhaux)
  * Load Word and Zero with Update (lwzu)
  * Load Word and Zero with Update Indexed (lwzux)
  * Load Word Algebraic with Update Indexed (lwaux)
  * Load Doubleword with Update (ldu)
  * Load Doubleword with Update Indexed (ldux)
  * Load Floating Single with Update (lfsu)
  * Load Floating Single with Update Indexed (lfsux)
  * Load Floating Double with Update (lfdu)
  * Load Floating Double with Update Indexed (lfdux)
  * Store Byte with Update (stbu)
  * Store Byte with Update Indexed (stbux)
  * Store Halfword with Update (sthu)
  * Store Halfword with Update Indexed (sthux)
  * Store Word with Update (stwu)
  * Store Word with Update Indexed (stwux)
  * Store Doubleword with Update (stdu)
  * Store Doubleword with Update Indexed (stdux)
  * Store Floating Single with Update (stfsu)
  * Store Floating Single with Update Indexed (stfsux)
  * Store Floating Double with Update (stfdu)
  * Store Floating Double with Update Indexed (stfdux)

E.g. the following behaviour is observed for an invalid
load and update instruction having RA = RT.

While an userspace program having an instruction word like
0xe9ce0001, i.e. ldu r14, 0(r14), runs without getting
receiving a SIGILL on a Power system (observed on P8 and
P9), the outcome of executing that instruction word varies
and its behaviour can be considered to be undefined.

Attaching an uprobe at that instruction's address results
in emulation which currently performs the load as well as
writes the effective address back to the base register.
This might not match the outcome from hardware.

To remove any inconsistencies, this adds additional checks
for the aforementioned instructions to make sure that the
emulation infrastructure treats them as unknown. The kernel
can then fallback to executing such instructions on hardware.

Fixes: 0016a4cf5582 ("powerpc: Emulate most Book I instructions in emulate_step()")
Signed-off-by: Sandipan Das <sandipan@linux.ibm.com>
---
Previous versions can be found at:
v2: https://lore.kernel.org/linuxppc-dev/20210203063841.431063-1-sandipan@linux.ibm.com/
v1: https://lore.kernel.org/linuxppc-dev/20201119054139.244083-1-sandipan@linux.ibm.com/

Changes in v3:
- Dropped CONFIG_PPC_FPU check as suggested by Michael.
- Consolidated the checks as suggested by Naveen.

Changes in v2:
- Jump to unknown_opcode instead of returning -1 for invalid
  instruction forms.

---
 arch/powerpc/lib/sstep.c | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

diff --git a/arch/powerpc/lib/sstep.c b/arch/powerpc/lib/sstep.c
index e96cff845ef7..9138967eb82e 100644
--- a/arch/powerpc/lib/sstep.c
+++ b/arch/powerpc/lib/sstep.c
@@ -3017,6 +3017,20 @@ int analyse_instr(struct instruction_op *op, const struct pt_regs *regs,

 	}

+	if (OP_IS_LOAD_STORE(op->type) && (op->type & UPDATE)) {
+		switch (GETTYPE(op->type)) {
+			case LOAD:
+				if (ra == rd)
+					goto unknown_opcode;
+				fallthrough;
+			case STORE:
+			case LOAD_FP:
+			case STORE_FP:
+				if (ra == 0)
+					goto unknown_opcode;
+		}
+	}
+
 #ifdef CONFIG_VSX
 	if ((GETTYPE(op->type) == LOAD_VSX ||
 	     GETTYPE(op->type) == STORE_VSX) &&
-- 
2.25.1

^ permalink raw reply related

* Re: [PATCH] powerpc64/idle: Fix SP offsets when saving GPRs
From: Christopher M. Riedl @ 2021-02-04  6:59 UTC (permalink / raw)
  To: Nicholas Piggin, linuxppc-dev, Michael Ellerman
In-Reply-To: <1612014056.e1qcnzac7c.astroid@bobo.none>

On Sat Jan 30, 2021 at 7:44 AM CST, Nicholas Piggin wrote:
> Excerpts from Michael Ellerman's message of January 30, 2021 9:32 pm:
> > "Christopher M. Riedl" <cmr@codefail.de> writes:
> >> The idle entry/exit code saves/restores GPRs in the stack "red zone"
> >> (Protected Zone according to PowerPC64 ELF ABI v2). However, the offset
> >> used for the first GPR is incorrect and overwrites the back chain - the
> >> Protected Zone actually starts below the current SP. In practice this is
> >> probably not an issue, but it's still incorrect so fix it.
> > 
> > Nice catch.
> > 
> > Corrupting the back chain means you can't backtrace from there, which
> > could be confusing for debugging one day.
>
> Yeah, we seem to have got away without noticing because the CPU will
> wake up and return out of here before it tries to unwind the stack,
> but if you tried to walk it by hand if the CPU got stuck in idle or
> something, then we'd get confused.
>
> > It does make me wonder why we don't just create a stack frame and use
> > the normal macros? It would use a bit more stack space, but we shouldn't
> > be short of stack space when going idle.
> > 
> > Nick, was there a particular reason for using the red zone?
>
> I don't recall a particular reason, I think a normal stack frame is
> probably a good idea.

I'll send a version using STACKFRAMESIZE - I assume that's the "normal"
stack frame :)

I admit I am a bit confused when I saw the similar but much smaller
STACK_FRAME_OVERHEAD which is also used in _some_ cases to save/restore
a few registers.

>
> Thanks,
> Nick


^ permalink raw reply

* Re: [PATCH v2] powerpc/64s: Fix pte update for kernel memory on radix
From: Naveen N. Rao @ 2021-02-04  6:41 UTC (permalink / raw)
  To: Jordan Niethe; +Cc: linuxppc-dev, npiggin, cmr
In-Reply-To: <20210203235907.961243-1-jniethe5@gmail.com>

Hi Jordan,

On 2021/02/04 10:59AM, Jordan Niethe wrote:
> When adding a pte a ptesync is needed to order the update of the pte
> with subsequent accesses otherwise a spurious fault may be raised.
> 
> radix__set_pte_at() does not do this for performance gains. For
> non-kernel memory this is not an issue as any faults of this kind are
> corrected by the page fault handler.  For kernel memory these faults are
> not handled.  The current solution is that there is a ptesync in
> flush_cache_vmap() which should be called when mapping from the vmalloc
> region.
> 
> However, map_kernel_page() does not call flush_cache_vmap(). This is
> troublesome in particular for code patching with Strict RWX on radix. In
> do_patch_instruction() the page frame that contains the instruction to
> be patched is mapped and then immediately patched. With no ordering or
> synchronization between setting up the pte and writing to the page it is
> possible for faults.
> 
> As the code patching is done using __put_user_asm_goto() the resulting
> fault is obscured - but using a normal store instead it can be seen:
> 
> [  418.498768][  T757] BUG: Unable to handle kernel data access on write at 0xc008000008f24a3c
> [  418.498790][  T757] Faulting instruction address: 0xc00000000008bd74
> [  418.498805][  T757] Oops: Kernel access of bad area, sig: 11 [#1]
> [  418.498828][  T757] LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA PowerNV
> [  418.498843][  T757] Modules linked in: nop_module(PO+) [last unloaded: nop_module]
> [  418.498872][  T757] CPU: 4 PID: 757 Comm: sh Tainted: P           O      5.10.0-rc5-01361-ge3c1b78c8440-dirty #43
> [  418.498936][  T757] NIP:  c00000000008bd74 LR: c00000000008bd50 CTR: c000000000025810
> [  418.498979][  T757] REGS: c000000016f634a0 TRAP: 0300   Tainted: P           O       (5.10.0-rc5-01361-ge3c1b78c8440-dirty)
> [  418.499033][  T757] MSR:  9000000000009033 <SF,HV,EE,ME,IR,DR,RI,LE>  CR: 44002884  XER: 00000000
> [  418.499084][  T757] CFAR: c00000000007c68c DAR: c008000008f24a3c DSISR: 42000000 IRQMASK: 1
> 
> This results in the kind of issue reported here:
> https://lore.kernel.org/linuxppc-dev/15AC5B0E-A221-4B8C-9039-FA96B8EF7C88@lca.pw/
> 

Thanks for fixing this!

- Naveen


^ permalink raw reply

* Re: [PATCH v4 02/10] powerpc/signal: Add unsafe_copy_{vsx, fpr}_from_user()
From: Michael Ellerman @ 2021-02-04  6:09 UTC (permalink / raw)
  To: Christopher M. Riedl, Gabriel Paubert
  Cc: David Laight, linuxppc-dev@lists.ozlabs.org
In-Reply-To: <C8YHSKQ99VC4.M9Y0WOFVUTBQ@geist>

"Christopher M. Riedl" <cmr@codefail.de> writes:
> On Mon Feb 1, 2021 at 10:54 AM CST, Gabriel Paubert wrote:
>> On Mon, Feb 01, 2021 at 09:55:44AM -0600, Christopher M. Riedl wrote:
>> > On Thu Jan 28, 2021 at 4:38 AM CST, David Laight wrote:
>> > > From: Christopher M. Riedl
>> > > > Sent: 28 January 2021 04:04
>> > > > 
>> > > > Reuse the "safe" implementation from signal.c except for calling
>> > > > unsafe_copy_from_user() to copy into a local buffer.
>> > > > 
>> > > > Signed-off-by: Christopher M. Riedl <cmr@codefail.de>
>> > > > ---
>> > > >  arch/powerpc/kernel/signal.h | 33 +++++++++++++++++++++++++++++++++
>> > > >  1 file changed, 33 insertions(+)
>> > > > 
>> > > > diff --git a/arch/powerpc/kernel/signal.h b/arch/powerpc/kernel/signal.h
>> > > > index 2559a681536e..c18402d625f1 100644
>> > > > --- a/arch/powerpc/kernel/signal.h
>> > > > +++ b/arch/powerpc/kernel/signal.h
>> > > > @@ -53,6 +53,33 @@ unsigned long copy_ckfpr_from_user(struct task_struct *task, void __user *from);
>> > > >  				&buf[i], label);\
>> > > >  } while (0)
>> > > > 
>> > > > +#define unsafe_copy_fpr_from_user(task, from, label)	do {		\
>> > > > +	struct task_struct *__t = task;					\
>> > > > +	u64 __user *__f = (u64 __user *)from;				\
>> > > > +	u64 buf[ELF_NFPREG];						\
>> > >
>> > > How big is that buffer?
>> > > Isn't is likely to be reasonably large compared to a reasonable
>> > > kernel stack frame.
>> > > Especially since this isn't even a leaf function.
>> > >
>> > 
>> > I think Christophe answered this - I don't really have an opinion either
>> > way. What would be a 'reasonable' kernel stack frame for reference?
>>
>> See include/linux/poll.h, where the limit is of the order of 800 bytes
>> and the number of entries in an on stack array is chosen at compile time
>> (different between 32 and 64 bit for example).
>>
>> The values are used in do_sys_poll, which, with almost 1000 bytes of
>> stack footprint, appears close to the top of "make checkstack". In
>> addition do_sys_poll has to call the ->poll function of every file
>> descriptor in its table, so it is not a tail function.
>>
>> This 264 bytes array looks reasonable, but please use 'make checkstack'
>> to verify that the function's total stack usage stays within reason.
>
> Neat, looks like total usage is a bit larger but still reasonable and
> less than half of 800B:
>
> 0xc000000000017e900 __unsafe_restore_sigcontext.constprop.0 [vmlinux]:352

We warn for frames larger than 2KB on 64-bit, see FRAME_WARN in
lib/Kconfig.debug.

So 264 bytes is entirely reasonable IMHO.

cheers


^ permalink raw reply

* Re: [PATCH v2] powerpc/64s: Fix pte update for kernel memory on radix
From: Jordan Niethe @ 2021-02-04  4:45 UTC (permalink / raw)
  To: Nicholas Piggin; +Cc: linuxppc-dev, cmr
In-Reply-To: <1612409379.adcakadx0b.astroid@bobo.none>

On Thu, Feb 4, 2021 at 2:31 PM Nicholas Piggin <npiggin@gmail.com> wrote:
>
> Excerpts from Jordan Niethe's message of February 4, 2021 9:59 am:
> > When adding a pte a ptesync is needed to order the update of the pte
> > with subsequent accesses otherwise a spurious fault may be raised.
> >
> > radix__set_pte_at() does not do this for performance gains. For
> > non-kernel memory this is not an issue as any faults of this kind are
> > corrected by the page fault handler.  For kernel memory these faults are
> > not handled.  The current solution is that there is a ptesync in
> > flush_cache_vmap() which should be called when mapping from the vmalloc
> > region.
> >
> > However, map_kernel_page() does not call flush_cache_vmap(). This is
> > troublesome in particular for code patching with Strict RWX on radix. In
> > do_patch_instruction() the page frame that contains the instruction to
> > be patched is mapped and then immediately patched. With no ordering or
> > synchronization between setting up the pte and writing to the page it is
> > possible for faults.
> >
> > As the code patching is done using __put_user_asm_goto() the resulting
> > fault is obscured - but using a normal store instead it can be seen:
> >
> > [  418.498768][  T757] BUG: Unable to handle kernel data access on write at 0xc008000008f24a3c
> > [  418.498790][  T757] Faulting instruction address: 0xc00000000008bd74
> > [  418.498805][  T757] Oops: Kernel access of bad area, sig: 11 [#1]
> > [  418.498828][  T757] LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA PowerNV
> > [  418.498843][  T757] Modules linked in: nop_module(PO+) [last unloaded: nop_module]
> > [  418.498872][  T757] CPU: 4 PID: 757 Comm: sh Tainted: P           O      5.10.0-rc5-01361-ge3c1b78c8440-dirty #43
> > [  418.498936][  T757] NIP:  c00000000008bd74 LR: c00000000008bd50 CTR: c000000000025810
> > [  418.498979][  T757] REGS: c000000016f634a0 TRAP: 0300   Tainted: P           O       (5.10.0-rc5-01361-ge3c1b78c8440-dirty)
> > [  418.499033][  T757] MSR:  9000000000009033 <SF,HV,EE,ME,IR,DR,RI,LE>  CR: 44002884  XER: 00000000
> > [  418.499084][  T757] CFAR: c00000000007c68c DAR: c008000008f24a3c DSISR: 42000000 IRQMASK: 1
> >
> > This results in the kind of issue reported here:
> > https://lore.kernel.org/linuxppc-dev/15AC5B0E-A221-4B8C-9039-FA96B8EF7C88@lca.pw/
> >
> > Chris Riedl suggested a reliable way to reproduce the issue:
> > $ mount -t debugfs none /sys/kernel/debug
> > $ (while true; do echo function > /sys/kernel/debug/tracing/current_tracer ; echo nop > /sys/kernel/debug/tracing/current_tracer ; done)&
> >
> > Turning ftrace on and off does a large amount of code patching which in
> > usually less then 5min will crash giving a trace like:
> >
> > [  146.668988][  T809] ftrace-powerpc: (____ptrval____): replaced (4b473b11) != old (60000000)
> > [  146.668995][  T809] ------------[ ftrace bug ]------------
> > [  146.669031][  T809] ftrace failed to modify
> > [  146.669039][  T809] [<c000000000bf8e5c>] napi_busy_loop+0xc/0x390
> > [  146.669045][  T809]  actual:   11:3b:47:4b
> > [  146.669070][  T809] Setting ftrace call site to call ftrace function
> > [  146.669075][  T809] ftrace record flags: 80000001
> > [  146.669081][  T809]  (1)
> > [  146.669081][  T809]  expected tramp: c00000000006c96c
> > [  146.669096][  T809] ------------[ cut here ]------------
> > [  146.669104][  T809] WARNING: CPU: 4 PID: 809 at kernel/trace/ftrace.c:2065 ftrace_bug+0x28c/0x2e8
> > [  146.669109][  T809] Modules linked in: nop_module(PO-) [last unloaded: nop_module]
> > [  146.669130][  T809] CPU: 4 PID: 809 Comm: sh Tainted: P           O      5.10.0-rc5-01360-gf878ccaf250a #1
> > [  146.669136][  T809] NIP:  c00000000024f334 LR: c00000000024f330 CTR: c0000000001a5af0
> > [  146.669142][  T809] REGS: c000000004c8b760 TRAP: 0700   Tainted: P           O       (5.10.0-rc5-01360-gf878ccaf250a)
> > [  146.669147][  T809] MSR:  900000000282b033 <SF,HV,VEC,VSX,EE,FP,ME,IR,DR,RI,LE>  CR: 28008848  XER: 20040000
> > [  146.669208][  T809] CFAR: c0000000001a9c98 IRQMASK: 0
> > [  146.669208][  T809] GPR00: c00000000024f330 c000000004c8b9f0 c000000002770600 0000000000000022
> > [  146.669208][  T809] GPR04: 00000000ffff7fff c000000004c8b6d0 0000000000000027 c0000007fe9bcdd8
> > [  146.669208][  T809] GPR08: 0000000000000023 ffffffffffffffd8 0000000000000027 c000000002613118
> > [  146.669208][  T809] GPR12: 0000000000008000 c0000007fffdca00 0000000000000000 0000000000000000
> > [  146.669208][  T809] GPR16: 0000000023ec37c5 0000000000000000 0000000000000000 0000000000000008
> > [  146.669208][  T809] GPR20: c000000004c8bc90 c0000000027a2d20 c000000004c8bcd0 c000000002612fe8
> > [  146.669208][  T809] GPR24: 0000000000000038 0000000000000030 0000000000000028 0000000000000020
> > [  146.669208][  T809] GPR28: c000000000ff1b68 c000000000bf8e5c c00000000312f700 c000000000fbb9b0
> > [  146.669384][  T809] NIP [c00000000024f334] ftrace_bug+0x28c/0x2e8
> > [  146.669391][  T809] LR [c00000000024f330] ftrace_bug+0x288/0x2e8
> > [  146.669396][  T809] Call Trace:
> > [  146.669403][  T809] [c000000004c8b9f0] [c00000000024f330] ftrace_bug+0x288/0x2e8 (unreliable)
> > [  146.669418][  T809] [c000000004c8ba80] [c000000000248778] ftrace_modify_all_code+0x168/0x210
> > [  146.669429][  T809] [c000000004c8bab0] [c00000000006c528] arch_ftrace_update_code+0x18/0x30
> > [  146.669440][  T809] [c000000004c8bad0] [c000000000248954] ftrace_run_update_code+0x44/0xc0
> > [  146.669451][  T809] [c000000004c8bb00] [c00000000024dc88] ftrace_startup+0xf8/0x1c0
> > [  146.669461][  T809] [c000000004c8bb40] [c00000000024dd9c] register_ftrace_function+0x4c/0xc0
> > [  146.669472][  T809] [c000000004c8bb70] [c00000000026e750] function_trace_init+0x80/0xb0
> > [  146.669484][  T809] [c000000004c8bba0] [c000000000266b84] tracing_set_tracer+0x2a4/0x4f0
> > [  146.669495][  T809] [c000000004c8bc70] [c000000000266ea4] tracing_set_trace_write+0xd4/0x130
> > [  146.669506][  T809] [c000000004c8bd20] [c000000000422790] vfs_write+0xf0/0x330
> > [  146.669518][  T809] [c000000004c8bd70] [c000000000422bb4] ksys_write+0x84/0x140
> > [  146.669529][  T809] [c000000004c8bdc0] [c00000000003499c] system_call_exception+0x14c/0x230
> > [  146.669540][  T809] [c000000004c8be20] [c00000000000d860] system_call_common+0xf0/0x27c
> > [  146.669549][  T809] Instruction dump:
> > [  146.669558][  T809] 48000014 3c62fe88 38631718 4bf5a941 60000000 7fc3f378 4bff877d 7c641b78
> > [  146.669598][  T809] 3c62fe88 38631730 4bf5a925 60000000 <0fe00000> 38210090 3d22fd90 39000001
> > [  146.669638][  T809] ---[ end trace 5ea7076ea28c0fbd ]---
> >
> > To fix this when updating kernel memory ptes using ptesync.
> >
> > Fixes: 37bc3e5fd764 ("powerpc/lib/code-patching: Use alternate map for patch_instruction()")
> > Fixes: f1cb8f9beba8 ("powerpc/64s/radix: avoid ptesync after set_pte and ptep_set_access_flags")
> > Signed-off-by: Jordan Niethe <jniethe5@gmail.com>
>
> Good catch. I would say it just fixes the latter patch doesn't it? The
> previous one just happens to break because it's using the broken API?
Yes that is true. I will send another revision that removes that from
the 'Fixes' and also add a self test.
>
> Anyhow,
>
> Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
>
> > ---
> > v2: Only ptesync is needed
> > ---
> >  arch/powerpc/include/asm/book3s/64/radix.h | 6 ++++--
> >  arch/powerpc/mm/book3s64/radix_pgtable.c   | 4 ++--
> >  2 files changed, 6 insertions(+), 4 deletions(-)
> >
> > diff --git a/arch/powerpc/include/asm/book3s/64/radix.h b/arch/powerpc/include/asm/book3s/64/radix.h
> > index c7813dc628fc..59cab558e2f0 100644
> > --- a/arch/powerpc/include/asm/book3s/64/radix.h
> > +++ b/arch/powerpc/include/asm/book3s/64/radix.h
> > @@ -222,8 +222,10 @@ static inline void radix__set_pte_at(struct mm_struct *mm, unsigned long addr,
> >        * from ptesync, it should probably go into update_mmu_cache, rather
> >        * than set_pte_at (which is used to set ptes unrelated to faults).
> >        *
> > -      * Spurious faults to vmalloc region are not tolerated, so there is
> > -      * a ptesync in flush_cache_vmap.
> > +      * Spurious faults from the kernel memory are not tolerated, so there
> > +      * is a ptesync in flush_cache_vmap, and __map_kernel_page() follows
> > +      * the pte update sequence from ISA Book III 6.10 Translation Table
> > +      * Update Synchronization Requirements.
> >        */
> >  }
> >
> > diff --git a/arch/powerpc/mm/book3s64/radix_pgtable.c b/arch/powerpc/mm/book3s64/radix_pgtable.c
> > index 3adcf730f478..1d5eec847b88 100644
> > --- a/arch/powerpc/mm/book3s64/radix_pgtable.c
> > +++ b/arch/powerpc/mm/book3s64/radix_pgtable.c
> > @@ -108,7 +108,7 @@ static int early_map_kernel_page(unsigned long ea, unsigned long pa,
> >
> >  set_the_pte:
> >       set_pte_at(&init_mm, ea, ptep, pfn_pte(pfn, flags));
> > -     smp_wmb();
> > +     asm volatile("ptesync": : :"memory");
> >       return 0;
> >  }
> >
> > @@ -168,7 +168,7 @@ static int __map_kernel_page(unsigned long ea, unsigned long pa,
> >
> >  set_the_pte:
> >       set_pte_at(&init_mm, ea, ptep, pfn_pte(pfn, flags));
> > -     smp_wmb();
> > +     asm volatile("ptesync": : :"memory");
> >       return 0;
> >  }
> >
> > --
> > 2.25.1
> >
> >

^ permalink raw reply

* Re: [PATCH v2] powerpc/64s: Fix pte update for kernel memory on radix
From: Nicholas Piggin @ 2021-02-04  3:31 UTC (permalink / raw)
  To: Jordan Niethe, linuxppc-dev; +Cc: cmr
In-Reply-To: <20210203235907.961243-1-jniethe5@gmail.com>

Excerpts from Jordan Niethe's message of February 4, 2021 9:59 am:
> When adding a pte a ptesync is needed to order the update of the pte
> with subsequent accesses otherwise a spurious fault may be raised.
> 
> radix__set_pte_at() does not do this for performance gains. For
> non-kernel memory this is not an issue as any faults of this kind are
> corrected by the page fault handler.  For kernel memory these faults are
> not handled.  The current solution is that there is a ptesync in
> flush_cache_vmap() which should be called when mapping from the vmalloc
> region.
> 
> However, map_kernel_page() does not call flush_cache_vmap(). This is
> troublesome in particular for code patching with Strict RWX on radix. In
> do_patch_instruction() the page frame that contains the instruction to
> be patched is mapped and then immediately patched. With no ordering or
> synchronization between setting up the pte and writing to the page it is
> possible for faults.
> 
> As the code patching is done using __put_user_asm_goto() the resulting
> fault is obscured - but using a normal store instead it can be seen:
> 
> [  418.498768][  T757] BUG: Unable to handle kernel data access on write at 0xc008000008f24a3c
> [  418.498790][  T757] Faulting instruction address: 0xc00000000008bd74
> [  418.498805][  T757] Oops: Kernel access of bad area, sig: 11 [#1]
> [  418.498828][  T757] LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA PowerNV
> [  418.498843][  T757] Modules linked in: nop_module(PO+) [last unloaded: nop_module]
> [  418.498872][  T757] CPU: 4 PID: 757 Comm: sh Tainted: P           O      5.10.0-rc5-01361-ge3c1b78c8440-dirty #43
> [  418.498936][  T757] NIP:  c00000000008bd74 LR: c00000000008bd50 CTR: c000000000025810
> [  418.498979][  T757] REGS: c000000016f634a0 TRAP: 0300   Tainted: P           O       (5.10.0-rc5-01361-ge3c1b78c8440-dirty)
> [  418.499033][  T757] MSR:  9000000000009033 <SF,HV,EE,ME,IR,DR,RI,LE>  CR: 44002884  XER: 00000000
> [  418.499084][  T757] CFAR: c00000000007c68c DAR: c008000008f24a3c DSISR: 42000000 IRQMASK: 1
> 
> This results in the kind of issue reported here:
> https://lore.kernel.org/linuxppc-dev/15AC5B0E-A221-4B8C-9039-FA96B8EF7C88@lca.pw/
> 
> Chris Riedl suggested a reliable way to reproduce the issue:
> $ mount -t debugfs none /sys/kernel/debug
> $ (while true; do echo function > /sys/kernel/debug/tracing/current_tracer ; echo nop > /sys/kernel/debug/tracing/current_tracer ; done)&
> 
> Turning ftrace on and off does a large amount of code patching which in
> usually less then 5min will crash giving a trace like:
> 
> [  146.668988][  T809] ftrace-powerpc: (____ptrval____): replaced (4b473b11) != old (60000000)
> [  146.668995][  T809] ------------[ ftrace bug ]------------
> [  146.669031][  T809] ftrace failed to modify
> [  146.669039][  T809] [<c000000000bf8e5c>] napi_busy_loop+0xc/0x390
> [  146.669045][  T809]  actual:   11:3b:47:4b
> [  146.669070][  T809] Setting ftrace call site to call ftrace function
> [  146.669075][  T809] ftrace record flags: 80000001
> [  146.669081][  T809]  (1)
> [  146.669081][  T809]  expected tramp: c00000000006c96c
> [  146.669096][  T809] ------------[ cut here ]------------
> [  146.669104][  T809] WARNING: CPU: 4 PID: 809 at kernel/trace/ftrace.c:2065 ftrace_bug+0x28c/0x2e8
> [  146.669109][  T809] Modules linked in: nop_module(PO-) [last unloaded: nop_module]
> [  146.669130][  T809] CPU: 4 PID: 809 Comm: sh Tainted: P           O      5.10.0-rc5-01360-gf878ccaf250a #1
> [  146.669136][  T809] NIP:  c00000000024f334 LR: c00000000024f330 CTR: c0000000001a5af0
> [  146.669142][  T809] REGS: c000000004c8b760 TRAP: 0700   Tainted: P           O       (5.10.0-rc5-01360-gf878ccaf250a)
> [  146.669147][  T809] MSR:  900000000282b033 <SF,HV,VEC,VSX,EE,FP,ME,IR,DR,RI,LE>  CR: 28008848  XER: 20040000
> [  146.669208][  T809] CFAR: c0000000001a9c98 IRQMASK: 0
> [  146.669208][  T809] GPR00: c00000000024f330 c000000004c8b9f0 c000000002770600 0000000000000022
> [  146.669208][  T809] GPR04: 00000000ffff7fff c000000004c8b6d0 0000000000000027 c0000007fe9bcdd8
> [  146.669208][  T809] GPR08: 0000000000000023 ffffffffffffffd8 0000000000000027 c000000002613118
> [  146.669208][  T809] GPR12: 0000000000008000 c0000007fffdca00 0000000000000000 0000000000000000
> [  146.669208][  T809] GPR16: 0000000023ec37c5 0000000000000000 0000000000000000 0000000000000008
> [  146.669208][  T809] GPR20: c000000004c8bc90 c0000000027a2d20 c000000004c8bcd0 c000000002612fe8
> [  146.669208][  T809] GPR24: 0000000000000038 0000000000000030 0000000000000028 0000000000000020
> [  146.669208][  T809] GPR28: c000000000ff1b68 c000000000bf8e5c c00000000312f700 c000000000fbb9b0
> [  146.669384][  T809] NIP [c00000000024f334] ftrace_bug+0x28c/0x2e8
> [  146.669391][  T809] LR [c00000000024f330] ftrace_bug+0x288/0x2e8
> [  146.669396][  T809] Call Trace:
> [  146.669403][  T809] [c000000004c8b9f0] [c00000000024f330] ftrace_bug+0x288/0x2e8 (unreliable)
> [  146.669418][  T809] [c000000004c8ba80] [c000000000248778] ftrace_modify_all_code+0x168/0x210
> [  146.669429][  T809] [c000000004c8bab0] [c00000000006c528] arch_ftrace_update_code+0x18/0x30
> [  146.669440][  T809] [c000000004c8bad0] [c000000000248954] ftrace_run_update_code+0x44/0xc0
> [  146.669451][  T809] [c000000004c8bb00] [c00000000024dc88] ftrace_startup+0xf8/0x1c0
> [  146.669461][  T809] [c000000004c8bb40] [c00000000024dd9c] register_ftrace_function+0x4c/0xc0
> [  146.669472][  T809] [c000000004c8bb70] [c00000000026e750] function_trace_init+0x80/0xb0
> [  146.669484][  T809] [c000000004c8bba0] [c000000000266b84] tracing_set_tracer+0x2a4/0x4f0
> [  146.669495][  T809] [c000000004c8bc70] [c000000000266ea4] tracing_set_trace_write+0xd4/0x130
> [  146.669506][  T809] [c000000004c8bd20] [c000000000422790] vfs_write+0xf0/0x330
> [  146.669518][  T809] [c000000004c8bd70] [c000000000422bb4] ksys_write+0x84/0x140
> [  146.669529][  T809] [c000000004c8bdc0] [c00000000003499c] system_call_exception+0x14c/0x230
> [  146.669540][  T809] [c000000004c8be20] [c00000000000d860] system_call_common+0xf0/0x27c
> [  146.669549][  T809] Instruction dump:
> [  146.669558][  T809] 48000014 3c62fe88 38631718 4bf5a941 60000000 7fc3f378 4bff877d 7c641b78
> [  146.669598][  T809] 3c62fe88 38631730 4bf5a925 60000000 <0fe00000> 38210090 3d22fd90 39000001
> [  146.669638][  T809] ---[ end trace 5ea7076ea28c0fbd ]---
> 
> To fix this when updating kernel memory ptes using ptesync.
> 
> Fixes: 37bc3e5fd764 ("powerpc/lib/code-patching: Use alternate map for patch_instruction()")
> Fixes: f1cb8f9beba8 ("powerpc/64s/radix: avoid ptesync after set_pte and ptep_set_access_flags")
> Signed-off-by: Jordan Niethe <jniethe5@gmail.com>

Good catch. I would say it just fixes the latter patch doesn't it? The 
previous one just happens to break because it's using the broken API?

Anyhow,

Reviewed-by: Nicholas Piggin <npiggin@gmail.com>

> ---
> v2: Only ptesync is needed
> ---
>  arch/powerpc/include/asm/book3s/64/radix.h | 6 ++++--
>  arch/powerpc/mm/book3s64/radix_pgtable.c   | 4 ++--
>  2 files changed, 6 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/book3s/64/radix.h b/arch/powerpc/include/asm/book3s/64/radix.h
> index c7813dc628fc..59cab558e2f0 100644
> --- a/arch/powerpc/include/asm/book3s/64/radix.h
> +++ b/arch/powerpc/include/asm/book3s/64/radix.h
> @@ -222,8 +222,10 @@ static inline void radix__set_pte_at(struct mm_struct *mm, unsigned long addr,
>  	 * from ptesync, it should probably go into update_mmu_cache, rather
>  	 * than set_pte_at (which is used to set ptes unrelated to faults).
>  	 *
> -	 * Spurious faults to vmalloc region are not tolerated, so there is
> -	 * a ptesync in flush_cache_vmap.
> +	 * Spurious faults from the kernel memory are not tolerated, so there
> +	 * is a ptesync in flush_cache_vmap, and __map_kernel_page() follows
> +	 * the pte update sequence from ISA Book III 6.10 Translation Table
> +	 * Update Synchronization Requirements.
>  	 */
>  }
>  
> diff --git a/arch/powerpc/mm/book3s64/radix_pgtable.c b/arch/powerpc/mm/book3s64/radix_pgtable.c
> index 3adcf730f478..1d5eec847b88 100644
> --- a/arch/powerpc/mm/book3s64/radix_pgtable.c
> +++ b/arch/powerpc/mm/book3s64/radix_pgtable.c
> @@ -108,7 +108,7 @@ static int early_map_kernel_page(unsigned long ea, unsigned long pa,
>  
>  set_the_pte:
>  	set_pte_at(&init_mm, ea, ptep, pfn_pte(pfn, flags));
> -	smp_wmb();
> +	asm volatile("ptesync": : :"memory");
>  	return 0;
>  }
>  
> @@ -168,7 +168,7 @@ static int __map_kernel_page(unsigned long ea, unsigned long pa,
>  
>  set_the_pte:
>  	set_pte_at(&init_mm, ea, ptep, pfn_pte(pfn, flags));
> -	smp_wmb();
> +	asm volatile("ptesync": : :"memory");
>  	return 0;
>  }
>  
> -- 
> 2.25.1
> 
> 

^ permalink raw reply

* Re: [PATCH v3 28/32] powerpc/64s: interrupt implement exit logic in C
From: Nicholas Piggin @ 2021-02-04  3:27 UTC (permalink / raw)
  To: Christophe Leroy, linuxppc-dev; +Cc: Michal Suchanek
In-Reply-To: <37c2a8e1-2c4b-2e55-6753-0a804ce00cac@csgroup.eu>

Excerpts from Christophe Leroy's message of February 4, 2021 2:25 am:
> 
> 
> Le 25/02/2020 à 18:35, Nicholas Piggin a écrit :
>> Implement the bulk of interrupt return logic in C. The asm return code
>> must handle a few cases: restoring full GPRs, and emulating stack store.
>> 
> 
> 
>> +notrace unsigned long interrupt_exit_kernel_prepare(struct pt_regs *regs, unsigned long msr)
>> +{
>> +	unsigned long *ti_flagsp = &current_thread_info()->flags;
>> +	unsigned long flags;
>> +
>> +	if (IS_ENABLED(CONFIG_PPC_BOOK3S) && unlikely(!(regs->msr & MSR_RI)))
>> +		unrecoverable_exception(regs);
>> +	BUG_ON(regs->msr & MSR_PR);
>> +	BUG_ON(!FULL_REGS(regs));
>> +
>> +	local_irq_save(flags);
>> +
>> +	if (regs->softe == IRQS_ENABLED) {
>> +		/* Returning to a kernel context with local irqs enabled. */
>> +		WARN_ON_ONCE(!(regs->msr & MSR_EE));
>> +again:
>> +		if (IS_ENABLED(CONFIG_PREEMPT)) {
>> +			/* Return to preemptible kernel context */
>> +			if (unlikely(*ti_flagsp & _TIF_NEED_RESCHED)) {
>> +				if (preempt_count() == 0)
>> +					preempt_schedule_irq();
>> +			}
>> +		}
>> +
>> +		trace_hardirqs_on();
>> +		__hard_EE_RI_disable();
>> +		if (unlikely(lazy_irq_pending())) {
>> +			__hard_RI_enable();
>> +			irq_soft_mask_set(IRQS_ALL_DISABLED);
>> +			trace_hardirqs_off();
>> +			local_paca->irq_happened |= PACA_IRQ_HARD_DIS;
>> +			/*
>> +			 * Can't local_irq_enable in case we are in interrupt
>> +			 * context. Must replay directly.
>> +			 */
>> +			replay_soft_interrupts();
>> +			irq_soft_mask_set(flags);
>> +			/* Took an interrupt, may have more exit work to do. */
>> +			goto again;
>> +		}
>> +		local_paca->irq_happened = 0;
>> +		irq_soft_mask_set(IRQS_ENABLED);
>> +	} else {
>> +		/* Returning to a kernel context with local irqs disabled. */
>> +		trace_hardirqs_on();
>> +		__hard_EE_RI_disable();
>> +		if (regs->msr & MSR_EE)
>> +			local_paca->irq_happened &= ~PACA_IRQ_HARD_DIS;
>> +	}
>> +
>> +
>> +#ifdef CONFIG_PPC_TRANSACTIONAL_MEM
>> +	local_paca->tm_scratch = regs->msr;
>> +#endif
>> +
>> +	/*
>> +	 * We don't need to restore AMR on the way back to userspace for KUAP.
>> +	 * The value of AMR only matters while we're in the kernel.
>> +	 */
>> +	kuap_restore_amr(regs);
> 
> Is that correct to restore KUAP state here ? Shouldn't we have it at lower level in assembly ?
> 
> Isn't there a risk that someone manages to call interrupt_exit_kernel_prepare() or the end of it in 
> a way or another, and get the previous KUAP state restored by this way ?

I'm not sure if there much more risk if it's here rather than the 
instruction being in another place in the code.

There's a lot of user access around the kernel too if you want to find a 
gadget to unlock KUAP then I suppose there is a pretty large attack
surface.

> Also, it looks a bit strange to have kuap_save_amr_and_lock() done at lowest level in assembly, and 
> kuap_restore_amr() done in upper level. That looks unbalanced.

I'd like to bring the entry assembly into C.

Thanks,
Nick

^ permalink raw reply

* Re: [PATCH] mm/memory.c: Remove pte_sw_mkyoung()
From: Nicholas Piggin @ 2021-02-04  3:23 UTC (permalink / raw)
  To: Andrew Morton, Christophe Leroy
  Cc: linux-arch, Thomas Bogendoerfer, Jia He, linux-kernel, Bibo Mao,
	linux-mips, linux-mm, linuxppc-dev
In-Reply-To: <20210203164630.ada46d0c84e0e9f0a474b283@linux-foundation.org>

Excerpts from Andrew Morton's message of February 4, 2021 10:46 am:
> On Wed,  3 Feb 2021 10:19:44 +0000 (UTC) Christophe Leroy <christophe.leroy@csgroup.eu> wrote:
> 
>> Commit 83d116c53058 ("mm: fix double page fault on arm64 if PTE_AF
>> is cleared") introduced arch_faults_on_old_pte() helper to identify
>> platforms that don't set page access bit in HW and require a page
>> fault to set it.
>> 
>> Commit 44bf431b47b4 ("mm/memory.c: Add memory read privilege on page
>> fault handling") added pte_sw_mkyoung() which is yet another way to
>> manage platforms that don't set page access bit in HW and require a
>> page fault to set it.
>> 
>> Remove that pte_sw_mkyoung() helper and use the already existing
>> arch_faults_on_old_pte() helper together with pte_mkyoung() instead.
> 
> This conflicts with mm/memory.c changes in linux-next.  In
> do_set_pte().  Please check my efforts:

I wanted to just get rid of it completely --

https://marc.info/?l=linux-mm&m=160860750115163&w=2

Waiting for MIPs to get that patch mentioned merged or nacked but
as yet seems to be no response from maintainers.

https://lore.kernel.org/linux-arch/20201019081257.32127-1-huangpei@loongson.cn/

Thanks,
Nick

> 
> --- a/arch/mips/include/asm/pgtable.h~mm-memoryc-remove-pte_sw_mkyoung
> +++ a/arch/mips/include/asm/pgtable.h
> @@ -406,8 +406,6 @@ static inline pte_t pte_mkyoung(pte_t pt
>  	return pte;
>  }
>  
> -#define pte_sw_mkyoung	pte_mkyoung
> -
>  #ifdef CONFIG_MIPS_HUGE_TLB_SUPPORT
>  static inline int pte_huge(pte_t pte)	{ return pte_val(pte) & _PAGE_HUGE; }
>  
> --- a/include/linux/pgtable.h~mm-memoryc-remove-pte_sw_mkyoung
> +++ a/include/linux/pgtable.h
> @@ -424,22 +424,6 @@ static inline void ptep_set_wrprotect(st
>  }
>  #endif
>  
> -/*
> - * On some architectures hardware does not set page access bit when accessing
> - * memory page, it is responsibilty of software setting this bit. It brings
> - * out extra page fault penalty to track page access bit. For optimization page
> - * access bit can be set during all page fault flow on these arches.
> - * To be differentiate with macro pte_mkyoung, this macro is used on platforms
> - * where software maintains page access bit.
> - */
> -#ifndef pte_sw_mkyoung
> -static inline pte_t pte_sw_mkyoung(pte_t pte)
> -{
> -	return pte;
> -}
> -#define pte_sw_mkyoung	pte_sw_mkyoung
> -#endif
> -
>  #ifndef pte_savedwrite
>  #define pte_savedwrite pte_write
>  #endif
> --- a/mm/memory.c~mm-memoryc-remove-pte_sw_mkyoung
> +++ a/mm/memory.c
> @@ -2902,7 +2902,8 @@ static vm_fault_t wp_page_copy(struct vm
>  		}
>  		flush_cache_page(vma, vmf->address, pte_pfn(vmf->orig_pte));
>  		entry = mk_pte(new_page, vma->vm_page_prot);
> -		entry = pte_sw_mkyoung(entry);
> +		if (arch_faults_on_old_pte())
> +			entry = pte_mkyoung(entry);
>  		entry = maybe_mkwrite(pte_mkdirty(entry), vma);
>  
>  		/*
> @@ -3560,7 +3561,8 @@ static vm_fault_t do_anonymous_page(stru
>  	__SetPageUptodate(page);
>  
>  	entry = mk_pte(page, vma->vm_page_prot);
> -	entry = pte_sw_mkyoung(entry);
> +	if (arch_faults_on_old_pte())
> +		entry = pte_mkyoung(entry);
>  	if (vma->vm_flags & VM_WRITE)
>  		entry = pte_mkwrite(pte_mkdirty(entry));
>  
> @@ -3745,8 +3747,8 @@ void do_set_pte(struct vm_fault *vmf, st
>  
>  	if (prefault && arch_wants_old_prefaulted_pte())
>  		entry = pte_mkold(entry);
> -	else
> -		entry = pte_sw_mkyoung(entry);
> +	else if (arch_faults_on_old_pte())
> +		entry = pte_mkyoung(entry);
>  
>  	if (write)
>  		entry = maybe_mkwrite(pte_mkdirty(entry), vma);
> _
> 
> 
> 

^ permalink raw reply

* Re: [PATCH] powerpc/perf: Record counter overflow always if SAMPLE_IP is unset
From: Michael Ellerman @ 2021-02-04  2:55 UTC (permalink / raw)
  To: Athira Rajeev; +Cc: maddy, linuxppc-dev
In-Reply-To: <1612342484-1404-1-git-send-email-atrajeev@linux.vnet.ibm.com>

Athira Rajeev <atrajeev@linux.vnet.ibm.com> writes:
> While sampling for marked events, currently we record the sample only
> if the SIAR valid bit of Sampled Instruction Event Register (SIER) is
> set. SIAR_VALID bit is used for fetching the instruction address from
> Sampled Instruction Address Register(SIAR). But there are some usecases,
> where the user is interested only in the PMU stats at each counter
> overflow and the exact IP of the overflow event is not required.
> Dropping SIAR invalid samples will fail to record some of the counter
> overflows in such cases.
>
> Example of such usecase is dumping the PMU stats (event counts)
> after some regular amount of instructions/events from the userspace
> (ex: via ptrace). Here counter overflow is indicated to userspace via
> signal handler, and captured by monitoring and enabling I/O
> signaling on the event file descriptor. In these cases, we expect to
> get sample/overflow indication after each specified sample_period.
>
> Perf event attribute will not have PERF_SAMPLE_IP set in the
> sample_type if exact IP of the overflow event is not requested. So
> while profiling if SAMPLE_IP is not set, just record the counter overflow
> irrespective of SIAR_VALID check.
>
> Signed-off-by: Athira Rajeev <atrajeev@linux.vnet.ibm.com>
> ---
>  arch/powerpc/perf/core-book3s.c | 10 ++++++++--
>  1 file changed, 8 insertions(+), 2 deletions(-)
>
> diff --git a/arch/powerpc/perf/core-book3s.c b/arch/powerpc/perf/core-book3s.c
> index 28206b1fe172..bb4828a05e4d 100644
> --- a/arch/powerpc/perf/core-book3s.c
> +++ b/arch/powerpc/perf/core-book3s.c
> @@ -2166,10 +2166,16 @@ static void record_and_restart(struct perf_event *event, unsigned long val,
>  	 * address even when freeze on supervisor state (kernel) is set in
>  	 * MMCR2. Check attr.exclude_kernel and address to drop the sample in
>  	 * these cases.
> +	 *
> +	 * If address is not requested in the sample
> +	 * via PERF_SAMPLE_IP, just record that sample
> +	 * irrespective of SIAR valid check.
>  	 */
> -	if (event->attr.exclude_kernel && record)
> -		if (is_kernel_addr(mfspr(SPRN_SIAR)))
> +	if (event->attr.exclude_kernel && record) {
> +		if (is_kernel_addr(mfspr(SPRN_SIAR)) && (event->attr.sample_type & PERF_SAMPLE_IP))
>  			record = 0;
> +	} else if (!record && !(event->attr.sample_type & PERF_SAMPLE_IP))
> +		record = 1;

This seems wrong, you're assuming that record was set previously to
= siar_valid(), but it may be that record is still 0 from the
initialisation and we weren't going to record.

Don't we need something more like this?

diff --git a/arch/powerpc/perf/core-book3s.c b/arch/powerpc/perf/core-book3s.c
index 9fd06010e8b6..e4e8a017d339 100644
--- a/arch/powerpc/perf/core-book3s.c
+++ b/arch/powerpc/perf/core-book3s.c
@@ -2136,7 +2136,12 @@ static void record_and_restart(struct perf_event *event, unsigned long val,
 			left += period;
 			if (left <= 0)
 				left = period;
-			record = siar_valid(regs);
+
+			if (event->attr.sample_type & PERF_SAMPLE_IP)
+				record = siar_valid(regs);
+			else
+				record = 1;
+
 			event->hw.last_period = event->hw.sample_period;
 		}
 		if (left < 0x80000000LL)
@@ -2154,9 +2159,10 @@ static void record_and_restart(struct perf_event *event, unsigned long val,
 	 * MMCR2. Check attr.exclude_kernel and address to drop the sample in
 	 * these cases.
 	 */
-	if (event->attr.exclude_kernel && record)
-		if (is_kernel_addr(mfspr(SPRN_SIAR)))
-			record = 0;
+	if (event->attr.exclude_kernel &&
+	    (event->attr.sample_type & PERF_SAMPLE_IP) &&
+	    is_kernel_addr(mfspr(SPRN_SIAR)))
+		record = 0;
 
 	/*
 	 * Finally record data if requested.



cheers

^ permalink raw reply related

* Re: [PATCH v2 1/3] powerpc: sstep: Fix load and update emulation
From: Michael Ellerman @ 2021-02-04  0:53 UTC (permalink / raw)
  To: Sandipan Das, Naveen N. Rao
  Cc: ravi.bangoria, ananth, jniethe5, paulus, linuxppc-dev, dja
In-Reply-To: <8e675de3-51df-0711-c90c-f450a1585f65@linux.ibm.com>

Sandipan Das <sandipan@linux.ibm.com> writes:
> On 03/02/21 3:19 pm, Naveen N. Rao wrote:
>> [...]
>> 
>> Wouldn't it be easier to just do the below at the end? Or, am I missing something?
>> 
>> diff --git a/arch/powerpc/lib/sstep.c b/arch/powerpc/lib/sstep.c
>> index ede093e9623472..a2d726d2a5e9d1 100644
>> --- a/arch/powerpc/lib/sstep.c
>> +++ b/arch/powerpc/lib/sstep.c
>> @@ -2980,6 +2980,10 @@ int analyse_instr(struct instruction_op *op, const struct pt_regs *regs,
>>         }
>>  #endif /* CONFIG_VSX */
>> 
>> +       if (GETTYPE(op->type) == LOAD && (op->type & UPDATE) &&
>> +                       (ra == 0 || ra == rd))
>> +               goto unknown_opcode;
>> +
>>         return 0;
>> 
>>   logical_done:
>> 
>
> This looks good?
>
> diff --git a/arch/powerpc/lib/sstep.c b/arch/powerpc/lib/sstep.c
> index e96cff845ef7..a9c149bfd2f5 100644
> --- a/arch/powerpc/lib/sstep.c
> +++ b/arch/powerpc/lib/sstep.c
> @@ -3017,6 +3017,21 @@ int analyse_instr(struct instruction_op *op, const struct pt_regs *regs,
>  
>         }
>  
> +       if (op->type & UPDATE) {
> +               if (ra == rd && GETTYPE(op->type) == LOAD)
> +                       goto unknown_opcode;
> +               else if (ra == 0)
> +                       switch(GETTYPE(op->type)) {
> +                       case LOAD:
> +                       case STORE:
> +#ifdef CONFIG_PPC_FPU
> +                       case LOAD_FP:
> +                       case STORE_FP:
> +#endif

Why make it conditional?

cheers

^ permalink raw reply

* Re: [PATCH] mm/memory.c: Remove pte_sw_mkyoung()
From: Andrew Morton @ 2021-02-04  0:46 UTC (permalink / raw)
  To: Christophe Leroy
  Cc: linux-arch, Thomas Bogendoerfer, Jia He, linux-kernel, Bibo Mao,
	linux-mips, linux-mm, linuxppc-dev
In-Reply-To: <f302ef92c48d1f08a0459aaee1c568ca11213814.1612345700.git.christophe.leroy@csgroup.eu>

On Wed,  3 Feb 2021 10:19:44 +0000 (UTC) Christophe Leroy <christophe.leroy@csgroup.eu> wrote:

> Commit 83d116c53058 ("mm: fix double page fault on arm64 if PTE_AF
> is cleared") introduced arch_faults_on_old_pte() helper to identify
> platforms that don't set page access bit in HW and require a page
> fault to set it.
> 
> Commit 44bf431b47b4 ("mm/memory.c: Add memory read privilege on page
> fault handling") added pte_sw_mkyoung() which is yet another way to
> manage platforms that don't set page access bit in HW and require a
> page fault to set it.
> 
> Remove that pte_sw_mkyoung() helper and use the already existing
> arch_faults_on_old_pte() helper together with pte_mkyoung() instead.

This conflicts with mm/memory.c changes in linux-next.  In
do_set_pte().  Please check my efforts:

--- a/arch/mips/include/asm/pgtable.h~mm-memoryc-remove-pte_sw_mkyoung
+++ a/arch/mips/include/asm/pgtable.h
@@ -406,8 +406,6 @@ static inline pte_t pte_mkyoung(pte_t pt
 	return pte;
 }
 
-#define pte_sw_mkyoung	pte_mkyoung
-
 #ifdef CONFIG_MIPS_HUGE_TLB_SUPPORT
 static inline int pte_huge(pte_t pte)	{ return pte_val(pte) & _PAGE_HUGE; }
 
--- a/include/linux/pgtable.h~mm-memoryc-remove-pte_sw_mkyoung
+++ a/include/linux/pgtable.h
@@ -424,22 +424,6 @@ static inline void ptep_set_wrprotect(st
 }
 #endif
 
-/*
- * On some architectures hardware does not set page access bit when accessing
- * memory page, it is responsibilty of software setting this bit. It brings
- * out extra page fault penalty to track page access bit. For optimization page
- * access bit can be set during all page fault flow on these arches.
- * To be differentiate with macro pte_mkyoung, this macro is used on platforms
- * where software maintains page access bit.
- */
-#ifndef pte_sw_mkyoung
-static inline pte_t pte_sw_mkyoung(pte_t pte)
-{
-	return pte;
-}
-#define pte_sw_mkyoung	pte_sw_mkyoung
-#endif
-
 #ifndef pte_savedwrite
 #define pte_savedwrite pte_write
 #endif
--- a/mm/memory.c~mm-memoryc-remove-pte_sw_mkyoung
+++ a/mm/memory.c
@@ -2902,7 +2902,8 @@ static vm_fault_t wp_page_copy(struct vm
 		}
 		flush_cache_page(vma, vmf->address, pte_pfn(vmf->orig_pte));
 		entry = mk_pte(new_page, vma->vm_page_prot);
-		entry = pte_sw_mkyoung(entry);
+		if (arch_faults_on_old_pte())
+			entry = pte_mkyoung(entry);
 		entry = maybe_mkwrite(pte_mkdirty(entry), vma);
 
 		/*
@@ -3560,7 +3561,8 @@ static vm_fault_t do_anonymous_page(stru
 	__SetPageUptodate(page);
 
 	entry = mk_pte(page, vma->vm_page_prot);
-	entry = pte_sw_mkyoung(entry);
+	if (arch_faults_on_old_pte())
+		entry = pte_mkyoung(entry);
 	if (vma->vm_flags & VM_WRITE)
 		entry = pte_mkwrite(pte_mkdirty(entry));
 
@@ -3745,8 +3747,8 @@ void do_set_pte(struct vm_fault *vmf, st
 
 	if (prefault && arch_wants_old_prefaulted_pte())
 		entry = pte_mkold(entry);
-	else
-		entry = pte_sw_mkyoung(entry);
+	else if (arch_faults_on_old_pte())
+		entry = pte_mkyoung(entry);
 
 	if (write)
 		entry = maybe_mkwrite(pte_mkdirty(entry), vma);
_


^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox