LinuxPPC-Dev Archive on lore.kernel.org

LinuxPPC-Dev Archive on lore.kernel.org
 help / color / mirror / Atom feed

* Re: [PATCH v2 5/5] KVM: selftests: Remove __NR_userfaultfd syscall fallback
From: Ben Gardon @ 2021-08-23 23:46 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: kvm, Peter Zijlstra, Catalin Marinas, LKML, Will Deacon, Guo Ren,
	linux-kselftest, Shuah Khan, Paul Mackerras, linux-s390,
	Shakeel Butt, Vasily Gorbik, Russell King, linux-csky,
	Christian Borntraeger, Ingo Molnar, linux-mips, Boqun Feng,
	Paul E. McKenney, Heiko Carstens, Steven Rostedt,
	Mathieu Desnoyers, Andy Lutomirski, Thomas Gleixner, Peter Foley,
	linux-arm-kernel, Thomas Bogendoerfer, Oleg Nesterov,
	Paolo Bonzini, linuxppc-dev
In-Reply-To: <20210820225002.310652-6-seanjc@google.com>

On Fri, Aug 20, 2021 at 3:50 PM Sean Christopherson <seanjc@google.com> wrote:
>
> Revert the __NR_userfaultfd syscall fallback added for KVM selftests now
> that x86's unistd_{32,63}.h overrides are under uapi/ and thus not in
> KVM sefltests' search path, i.e. now that KVM gets x86 syscall numbers
> from the installed kernel headers.
>
> No functional change intended.
>
> Cc: Ben Gardon <bgardon@google.com>
> Signed-off-by: Sean Christopherson <seanjc@google.com>

Reviewed-by: Ben Gardon <bgardon@google.com>

> ---
>  tools/arch/x86/include/uapi/asm/unistd_64.h | 3 ---
>  1 file changed, 3 deletions(-)
>
> diff --git a/tools/arch/x86/include/uapi/asm/unistd_64.h b/tools/arch/x86/include/uapi/asm/unistd_64.h
> index 4205ed4158bf..cb52a3a8b8fc 100644
> --- a/tools/arch/x86/include/uapi/asm/unistd_64.h
> +++ b/tools/arch/x86/include/uapi/asm/unistd_64.h
> @@ -1,7 +1,4 @@
>  /* SPDX-License-Identifier: GPL-2.0 */
> -#ifndef __NR_userfaultfd
> -#define __NR_userfaultfd 282
> -#endif
>  #ifndef __NR_perf_event_open
>  # define __NR_perf_event_open 298
>  #endif
> --
> 2.33.0.rc2.250.ged5fa647cd-goog
>

^ permalink raw reply

* Re: linux-next: build warning after merge of the powerpc tree
From: Stephen Rothwell @ 2021-08-23 23:21 UTC (permalink / raw)
  To: Jonathan Corbet
  Cc: Daniel Henrique Barboza, Linux Kernel Mailing List,
	Linux Next Mailing List, Aneesh Kumar K.V, PowerPC
In-Reply-To: <87a6l8p7kd.fsf@meer.lwn.net>

[-- Attachment #1: Type: text/plain, Size: 366 bytes --]

Hi Jona,

On Mon, 23 Aug 2021 08:19:30 -0600 Jonathan Corbet <corbet@lwn.net> wrote:
>
> Stephen Rothwell <sfr@canb.auug.org.au> writes:
> 
> > Hi all,
> >
> > [cc'ing Jon in case he can fix the sphix hang - or knows anything about it]  
> 
> That's new to me.  Which version of sphinx?

3.4.3-2, its a Debian version.

-- 
Cheers,
Stephen Rothwell

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply

* Re: [PATCH v2 1/3] powerpc/pseries: Parse control memory access error
From: Ganesh @ 2021-08-23 18:53 UTC (permalink / raw)
  To: linuxppc-dev, mpe; +Cc: mikey, mahesh, npiggin
In-Reply-To: <20210805092025.272871-1-ganeshgr@linux.ibm.com>

[-- Attachment #1: Type: text/plain, Size: 2353 bytes --]

Hi mpe, Any comments on this patchset?

On 8/5/21 2:50 PM, Ganesh Goudar wrote:

> Add support to parse and log control memory access
> error for pseries.
>
> Signed-off-by: Ganesh Goudar <ganeshgr@linux.ibm.com>
> ---
> v2: No changes in this patch.
> ---
>   arch/powerpc/platforms/pseries/ras.c | 21 +++++++++++++++++++++
>   1 file changed, 21 insertions(+)
>
> diff --git a/arch/powerpc/platforms/pseries/ras.c b/arch/powerpc/platforms/pseries/ras.c
> index 167f2e1b8d39..608c35cad0c3 100644
> --- a/arch/powerpc/platforms/pseries/ras.c
> +++ b/arch/powerpc/platforms/pseries/ras.c
> @@ -80,6 +80,7 @@ struct pseries_mc_errorlog {
>   #define MC_ERROR_TYPE_TLB		0x04
>   #define MC_ERROR_TYPE_D_CACHE		0x05
>   #define MC_ERROR_TYPE_I_CACHE		0x07
> +#define MC_ERROR_TYPE_CTRL_MEM_ACCESS	0x08
>   
>   /* RTAS pseries MCE error sub types */
>   #define MC_ERROR_UE_INDETERMINATE		0
> @@ -103,6 +104,9 @@ struct pseries_mc_errorlog {
>   #define MC_ERROR_TLB_MULTIHIT		2
>   #define MC_ERROR_TLB_INDETERMINATE	3
>   
> +#define MC_ERROR_CTRL_MEM_ACCESS_PTABLE_WALK	0
> +#define MC_ERROR_CTRL_MEM_ACCESS_OP_ACCESS	1
> +
>   static inline u8 rtas_mc_error_sub_type(const struct pseries_mc_errorlog *mlog)
>   {
>   	switch (mlog->error_type) {
> @@ -112,6 +116,8 @@ static inline u8 rtas_mc_error_sub_type(const struct pseries_mc_errorlog *mlog)
>   	case	MC_ERROR_TYPE_ERAT:
>   	case	MC_ERROR_TYPE_TLB:
>   		return (mlog->sub_err_type & 0x03);
> +	case	MC_ERROR_TYPE_CTRL_MEM_ACCESS:
> +		return (mlog->sub_err_type & 0x70) >> 4;
>   	default:
>   		return 0;
>   	}
> @@ -699,6 +705,21 @@ static int mce_handle_err_virtmode(struct pt_regs *regs,
>   	case MC_ERROR_TYPE_I_CACHE:
>   		mce_err.error_type = MCE_ERROR_TYPE_ICACHE;
>   		break;
> +	case MC_ERROR_TYPE_CTRL_MEM_ACCESS:
> +		mce_err.error_type = MCE_ERROR_TYPE_RA;
> +		if (mce_log->sub_err_type & 0x80)
> +			eaddr = be64_to_cpu(mce_log->effective_address);
> +		switch (err_sub_type) {
> +		case MC_ERROR_CTRL_MEM_ACCESS_PTABLE_WALK:
> +			mce_err.u.ra_error_type =
> +				MCE_RA_ERROR_PAGE_TABLE_WALK_LOAD_STORE_FOREIGN;
> +			break;
> +		case MC_ERROR_CTRL_MEM_ACCESS_OP_ACCESS:
> +			mce_err.u.ra_error_type =
> +				MCE_RA_ERROR_LOAD_STORE_FOREIGN;
> +			break;
> +		}
> +		break;
>   	case MC_ERROR_TYPE_UNKNOWN:
>   	default:
>   		mce_err.error_type = MCE_ERROR_TYPE_UNKNOWN;

[-- Attachment #2: Type: text/html, Size: 2676 bytes --]

^ permalink raw reply

* Re: [PATCH] powerpc/booke: Avoid link stack corruption in several places
From: Segher Boessenkool @ 2021-08-23 20:12 UTC (permalink / raw)
  To: Christophe Leroy; +Cc: Paul Mackerras, linuxppc-dev, linux-kernel
In-Reply-To: <67a5be3f-a443-03eb-aa8e-a1fa6c0b3d3f@csgroup.eu>

On Mon, Aug 23, 2021 at 07:05:38PM +0200, Christophe Leroy wrote:
> Le 23/08/2021 à 17:58, Segher Boessenkool a écrit :
> >On Mon, Aug 23, 2021 at 07:53:01AM +0000, Christophe Leroy wrote:
> >>  /* Be careful, this will clobber the lr register. */
> >>  #define LOAD_REG_ADDR_PIC(reg, name)		\
> >>-	bl	0f;				\
> >>+	bcl	20,31,0f			\
> >>  0:	mflr	reg;				\
> >>  	addis	reg,reg,(name - 0b)@ha;		\
> >>  	addi	reg,reg,(name - 0b)@l;
> >
> >The code ended each line with a semicolon before, for absolutely no
> >reason that I can see, but still.  Fixing that would be nice, but only
> >doing it on one line isn't good.
> 
> Sure, forgetting the semicolon broke the build. That's because the 
> backslash removes the newline.

Ah right, one of the surprises you get from using the C preprocessor on
non-C code :-)

> The cleanest way I found to fix that quite of stuff is by using GAS macro, 
> as I did for LOAD_REG_IMMEDIATE() some time ago.

Yeah, good plan.  You can use loops and saner parameters etc. as well if
you do :-)

> >Btw.  Both the 7450 and the modern cores implementing this really need
> >this to be $+4, so it is a lot clearer to write that instead of 1f or
> >a named label.
> 
> I like that, removing unneeded labels will make it smoother and clearer. 
> I'll do it.

Cool, thanks!


Segher

^ permalink raw reply

* Re: [PATCH] powerpc/32: Don't use lmw/stmw for saving/restoring non volatile regs
From: Segher Boessenkool @ 2021-08-23 18:46 UTC (permalink / raw)
  To: Christophe Leroy; +Cc: Paul Mackerras, linuxppc-dev, linux-kernel
In-Reply-To: <316c543b8906712c108985c8463eec09c8db577b.1629732542.git.christophe.leroy@csgroup.eu>

On Mon, Aug 23, 2021 at 03:29:12PM +0000, Christophe Leroy wrote:
> Instructions lmw/stmw are interesting for functions that are rarely
> used and not in the cache, because only one instruction is to be
> copied into the instruction cache instead of 19. However those
> instruction are less performant than 19x raw lwz/stw as they require
> synchronisation plus one additional cycle.

lmw takes N+2 cycles for loading N words on 603/604/750/7400, and N+3 on
7450.  stmw takes N+1 cycles for storing N words on 603, N+2 on 604/750/
7400, and N+3 on 7450 (load latency is 3 instead of 2 on 7450).

There is no synchronisation needed, although there is some serialisation,
which of course doesn't mean much since there can be only 6 or 8 or so
insns executing at once anyway.

So, these insns are almost never slower, they can easily win cycles back
because of the smaller code, too.

What 32-bit core do you see where load/store multiple are more than a
fraction of a cycle (per memory access) slower?

> SAVE_NVGPRS / REST_NVGPRS are used in only a few places which are
> mostly in interrupts entries/exits and in task switch so they are
> likely already in the cache.

Nothing is likely in the cache on the older cores (except in
microbenchmarks), the caches are not big enough for that!

> Using standard lwz improves null_syscall selftest by:
> - 10 cycles on mpc832x.
> - 2 cycles on mpc8xx.

And in real benchmarks?

On mpccore both lmw and stmw are only N+1 btw.  But the serialization
might cost another cycle here?

Segher

^ permalink raw reply

* Re: [PATCH] powerpc/booke: Avoid link stack corruption in several places
From: Christophe Leroy @ 2021-08-23 17:05 UTC (permalink / raw)
  To: Segher Boessenkool; +Cc: Paul Mackerras, linuxppc-dev, linux-kernel
In-Reply-To: <20210823155837.GX1583@gate.crashing.org>



Le 23/08/2021 à 17:58, Segher Boessenkool a écrit :
> On Mon, Aug 23, 2021 at 07:53:01AM +0000, Christophe Leroy wrote:
>>   /* Be careful, this will clobber the lr register. */
>>   #define LOAD_REG_ADDR_PIC(reg, name)		\
>> -	bl	0f;				\
>> +	bcl	20,31,0f			\
>>   0:	mflr	reg;				\
>>   	addis	reg,reg,(name - 0b)@ha;		\
>>   	addi	reg,reg,(name - 0b)@l;
> 
> The code ended each line with a semicolon before, for absolutely no
> reason that I can see, but still.  Fixing that would be nice, but only
> doing it on one line isn't good.

Sure, forgetting the semicolon broke the build. That's because the backslash removes the newline.

The cleanest way I found to fix that quite of stuff is by using GAS macro, as I did for 
LOAD_REG_IMMEDIATE() some time ago.

> 
> Btw.  Both the 7450 and the modern cores implementing this really need
> this to be $+4, so it is a lot clearer to write that instead of 1f or
> a named label.

I like that, removing unneeded labels will make it smoother and clearer. I'll do it.

Christophe

^ permalink raw reply

* Re: [PATCH] powerpc/booke: Avoid link stack corruption in several places
From: Segher Boessenkool @ 2021-08-23 15:58 UTC (permalink / raw)
  To: Christophe Leroy; +Cc: Paul Mackerras, linuxppc-dev, linux-kernel
In-Reply-To: <d7435e616336fd5f07bb19ec61e97d71e5c53568.1629705153.git.christophe.leroy@csgroup.eu>

On Mon, Aug 23, 2021 at 07:53:01AM +0000, Christophe Leroy wrote:
>  /* Be careful, this will clobber the lr register. */
>  #define LOAD_REG_ADDR_PIC(reg, name)		\
> -	bl	0f;				\
> +	bcl	20,31,0f			\
>  0:	mflr	reg;				\
>  	addis	reg,reg,(name - 0b)@ha;		\
>  	addi	reg,reg,(name - 0b)@l;

The code ended each line with a semicolon before, for absolutely no
reason that I can see, but still.  Fixing that would be nice, but only
doing it on one line isn't good.

Btw.  Both the 7450 and the modern cores implementing this really need
this to be $+4, so it is a lot clearer to write that instead of 1f or
a named label.

Segher

^ permalink raw reply

* [PATCH v2 5/5] powerpc/signal: Use unsafe_copy_siginfo_to_user()
From: Christophe Leroy @ 2021-08-23 15:35 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman
  Cc: linuxppc-dev, linux-kernel
In-Reply-To: <fd7938d94008711d441551c06b25a033669a0618.1629732940.git.christophe.leroy@csgroup.eu>

Use unsafe_copy_siginfo_to_user() in order to do the copy
within the user access block.

On an mpc 8321 (book3s/32) the improvment is about 5% on a process
sending a signal to itself.

Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
---
 arch/powerpc/kernel/signal_32.c | 13 ++++++-------
 arch/powerpc/kernel/signal_64.c |  5 +----
 2 files changed, 7 insertions(+), 11 deletions(-)

diff --git a/arch/powerpc/kernel/signal_32.c b/arch/powerpc/kernel/signal_32.c
index ff101e2b3bab..f9e16d108bc8 100644
--- a/arch/powerpc/kernel/signal_32.c
+++ b/arch/powerpc/kernel/signal_32.c
@@ -710,12 +710,6 @@ static long restore_tm_user_regs(struct pt_regs *regs, struct mcontext __user *s
 }
 #endif
 
-#ifdef CONFIG_PPC64
-
-#define copy_siginfo_to_user	copy_siginfo_to_user32
-
-#endif /* CONFIG_PPC64 */
-
 /*
  * Set up a signal frame for a "real-time" signal handler
  * (one which gets siginfo).
@@ -779,14 +773,19 @@ int handle_rt_signal32(struct ksignal *ksig, sigset_t *oldset,
 		asm("dcbst %y0; sync; icbi %y0; sync" :: "Z" (mctx->mc_pad[0]));
 	}
 	unsafe_put_sigset_t(&frame->uc.uc_sigmask, oldset, failed);
+#ifndef CONFIG_COMPAT
+	unsafe_copy_siginfo_to_user(&frame->info, &ksig->info, failed);
+#endif
 
 	/* create a stack frame for the caller of the handler */
 	unsafe_put_user(regs->gpr[1], newsp, failed);
 
 	user_access_end();
 
-	if (copy_siginfo_to_user(&frame->info, &ksig->info))
+#ifdef CONFIG_COMPAT
+	if (copy_siginfo_to_user32(&frame->info, &ksig->info))
 		goto badframe;
+#endif
 
 	regs->link = tramp;
 
diff --git a/arch/powerpc/kernel/signal_64.c b/arch/powerpc/kernel/signal_64.c
index 2cca6c8febe1..82b73fbd937d 100644
--- a/arch/powerpc/kernel/signal_64.c
+++ b/arch/powerpc/kernel/signal_64.c
@@ -901,15 +901,12 @@ int handle_rt_signal64(struct ksignal *ksig, sigset_t *set,
 	}
 
 	unsafe_copy_to_user(&frame->uc.uc_sigmask, set, sizeof(*set), badframe_block);
+	unsafe_copy_siginfo_to_user(&frame->info, &ksig->info, badframe_block);
 	/* Allocate a dummy caller frame for the signal handler. */
 	unsafe_put_user(regs->gpr[1], newsp, badframe_block);
 
 	user_write_access_end();
 
-	/* Save the siginfo outside of the unsafe block. */
-	if (copy_siginfo_to_user(&frame->info, &ksig->info))
-		goto badframe;
-
 	/* Make sure signal handler doesn't get spurious FP exceptions */
 	tsk->thread.fp_state.fpscr = 0;
 
-- 
2.25.0


^ permalink raw reply related

* [PATCH v2 4/5] powerpc/uaccess: Add unsafe_clear_user()
From: Christophe Leroy @ 2021-08-23 15:35 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman
  Cc: linuxppc-dev, linux-kernel
In-Reply-To: <fd7938d94008711d441551c06b25a033669a0618.1629732940.git.christophe.leroy@csgroup.eu>

Implement unsafe_clear_user() for powerpc.
It's a copy/paste of unsafe_copy_to_user() with value 0 as source.

It may be improved in a later patch by using 'dcbz' instruction
to zeroize full cache lines at once.

Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
---
 arch/powerpc/include/asm/uaccess.h | 20 ++++++++++++++++++++
 1 file changed, 20 insertions(+)

diff --git a/arch/powerpc/include/asm/uaccess.h b/arch/powerpc/include/asm/uaccess.h
index 22c79ab40006..962b675485ff 100644
--- a/arch/powerpc/include/asm/uaccess.h
+++ b/arch/powerpc/include/asm/uaccess.h
@@ -467,6 +467,26 @@ do {									\
 		unsafe_put_user(*(u8*)(_src + _i), (u8 __user *)(_dst + _i), e); \
 } while (0)
 
+#define unsafe_clear_user(d, l, e)					\
+do {									\
+	u8 __user *_dst = (u8 __user *)(d);				\
+	size_t _len = (l);						\
+	int _i;								\
+									\
+	for (_i = 0; _i < (_len & ~(sizeof(u64) - 1)); _i += sizeof(u64)) \
+		unsafe_put_user(0, (u64 __user *)(_dst + _i), e);	\
+	if (_len & 4) {							\
+		unsafe_put_user(0, (u32 __user *)(_dst + _i), e);	\
+		_i += 4;						\
+	}								\
+	if (_len & 2) {							\
+		unsafe_put_user(0, (u16 __user *)(_dst + _i), e);	\
+		_i += 2;						\
+	}								\
+	if (_len & 1)							\
+		unsafe_put_user(0, (u8 __user *)(_dst + _i), e);	\
+} while (0)
+
 #define HAVE_GET_KERNEL_NOFAULT
 
 #define __get_kernel_nofault(dst, src, type, err_label)			\
-- 
2.25.0


^ permalink raw reply related

* [PATCH v2 3/5] signal: Add unsafe_copy_siginfo_to_user()
From: Christophe Leroy @ 2021-08-23 15:35 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman
  Cc: linuxppc-dev, linux-kernel
In-Reply-To: <fd7938d94008711d441551c06b25a033669a0618.1629732940.git.christophe.leroy@csgroup.eu>

In the same spirit as commit fb05121fd6a2 ("signal: Add
unsafe_get_compat_sigset()"), implement an 'unsafe' version of
copy_siginfo_to_user() in order to use it within user access blocks.

For that, also add an 'unsafe' version of clear_user().

Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
---
 include/linux/signal.h  | 15 +++++++++++++++
 include/linux/uaccess.h |  1 +
 kernel/signal.c         |  5 -----
 3 files changed, 16 insertions(+), 5 deletions(-)

diff --git a/include/linux/signal.h b/include/linux/signal.h
index 3454c7ff0778..659bd43daf10 100644
--- a/include/linux/signal.h
+++ b/include/linux/signal.h
@@ -35,6 +35,21 @@ static inline void copy_siginfo_to_external(siginfo_t *to,
 int copy_siginfo_to_user(siginfo_t __user *to, const kernel_siginfo_t *from);
 int copy_siginfo_from_user(kernel_siginfo_t *to, const siginfo_t __user *from);
 
+static __always_inline char __user *si_expansion(const siginfo_t __user *info)
+{
+	return ((char __user *)info) + sizeof(struct kernel_siginfo);
+}
+
+#define unsafe_copy_siginfo_to_user(to, from, label) do {		\
+	siginfo_t __user *__ucs_to = to;				\
+	const kernel_siginfo_t *__ucs_from = from;			\
+	char __user *__ucs_expansion = si_expansion(__ucs_to);		\
+									\
+	unsafe_copy_to_user(__ucs_to, __ucs_from,			\
+			    sizeof(struct kernel_siginfo), label);	\
+	unsafe_clear_user(__ucs_expansion, SI_EXPANSION_SIZE, label);	\
+} while (0)
+
 enum siginfo_layout {
 	SIL_KILL,
 	SIL_TIMER,
diff --git a/include/linux/uaccess.h b/include/linux/uaccess.h
index c05e903cef02..37073caac474 100644
--- a/include/linux/uaccess.h
+++ b/include/linux/uaccess.h
@@ -398,6 +398,7 @@ long strnlen_user_nofault(const void __user *unsafe_addr, long count);
 #define unsafe_put_user(x,p,e) unsafe_op_wrap(__put_user(x,p),e)
 #define unsafe_copy_to_user(d,s,l,e) unsafe_op_wrap(__copy_to_user(d,s,l),e)
 #define unsafe_copy_from_user(d,s,l,e) unsafe_op_wrap(__copy_from_user(d,s,l),e)
+#define unsafe_clear_user(d, l, e) unsafe_op_wrap(__clear_user(d, l), e)
 static inline unsigned long user_access_save(void) { return 0UL; }
 static inline void user_access_restore(unsigned long flags) { }
 #endif
diff --git a/kernel/signal.c b/kernel/signal.c
index a3229add4455..83b5971e4304 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -3261,11 +3261,6 @@ enum siginfo_layout siginfo_layout(unsigned sig, int si_code)
 	return layout;
 }
 
-static inline char __user *si_expansion(const siginfo_t __user *info)
-{
-	return ((char __user *)info) + sizeof(struct kernel_siginfo);
-}
-
 int copy_siginfo_to_user(siginfo_t __user *to, const kernel_siginfo_t *from)
 {
 	char __user *expansion = si_expansion(to);
-- 
2.25.0


^ permalink raw reply related

* [PATCH v2 2/5] powerpc/signal: Include the new stack frame inside the user access block
From: Christophe Leroy @ 2021-08-23 15:35 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman
  Cc: linuxppc-dev, linux-kernel
In-Reply-To: <fd7938d94008711d441551c06b25a033669a0618.1629732940.git.christophe.leroy@csgroup.eu>

Include the new stack frame inside the user access block and set it up
using unsafe_put_user().

On an mpc 8321 (book3s/32) the improvment is about 4% on a process
sending a signal to itself.

Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
---
 arch/powerpc/kernel/signal_32.c | 29 +++++++++++++----------------
 arch/powerpc/kernel/signal_64.c | 14 +++++++-------
 2 files changed, 20 insertions(+), 23 deletions(-)

diff --git a/arch/powerpc/kernel/signal_32.c b/arch/powerpc/kernel/signal_32.c
index 0608581967f0..ff101e2b3bab 100644
--- a/arch/powerpc/kernel/signal_32.c
+++ b/arch/powerpc/kernel/signal_32.c
@@ -726,7 +726,7 @@ int handle_rt_signal32(struct ksignal *ksig, sigset_t *oldset,
 	struct rt_sigframe __user *frame;
 	struct mcontext __user *mctx;
 	struct mcontext __user *tm_mctx = NULL;
-	unsigned long newsp = 0;
+	unsigned long __user *newsp;
 	unsigned long tramp;
 	struct pt_regs *regs = tsk->thread.regs;
 	/* Save the thread's msr before get_tm_stackpointer() changes it */
@@ -734,6 +734,7 @@ int handle_rt_signal32(struct ksignal *ksig, sigset_t *oldset,
 
 	/* Set up Signal Frame */
 	frame = get_sigframe(ksig, tsk, sizeof(*frame), 1);
+	newsp = (unsigned long __user *)((unsigned long)frame - (__SIGNAL_FRAMESIZE + 16));
 	mctx = &frame->uc.uc_mcontext;
 #ifdef CONFIG_PPC_TRANSACTIONAL_MEM
 	tm_mctx = &frame->uc_transact.uc_mcontext;
@@ -743,7 +744,7 @@ int handle_rt_signal32(struct ksignal *ksig, sigset_t *oldset,
 	else
 		prepare_save_user_regs(1);
 
-	if (!user_access_begin(frame, sizeof(*frame)))
+	if (!user_access_begin(newsp, __SIGNAL_FRAMESIZE + 16 + sizeof(*frame)))
 		goto badframe;
 
 	/* Put the siginfo & fill in most of the ucontext */
@@ -779,6 +780,9 @@ int handle_rt_signal32(struct ksignal *ksig, sigset_t *oldset,
 	}
 	unsafe_put_sigset_t(&frame->uc.uc_sigmask, oldset, failed);
 
+	/* create a stack frame for the caller of the handler */
+	unsafe_put_user(regs->gpr[1], newsp, failed);
+
 	user_access_end();
 
 	if (copy_siginfo_to_user(&frame->info, &ksig->info))
@@ -790,13 +794,8 @@ int handle_rt_signal32(struct ksignal *ksig, sigset_t *oldset,
 	tsk->thread.fp_state.fpscr = 0;	/* turn off all fp exceptions */
 #endif
 
-	/* create a stack frame for the caller of the handler */
-	newsp = ((unsigned long)frame) - (__SIGNAL_FRAMESIZE + 16);
-	if (put_user(regs->gpr[1], (u32 __user *)newsp))
-		goto badframe;
-
 	/* Fill registers for signal handler */
-	regs->gpr[1] = newsp;
+	regs->gpr[1] = (unsigned long)newsp;
 	regs->gpr[3] = ksig->sig;
 	regs->gpr[4] = (unsigned long)&frame->info;
 	regs->gpr[5] = (unsigned long)&frame->uc;
@@ -826,7 +825,7 @@ int handle_signal32(struct ksignal *ksig, sigset_t *oldset,
 	struct sigframe __user *frame;
 	struct mcontext __user *mctx;
 	struct mcontext __user *tm_mctx = NULL;
-	unsigned long newsp = 0;
+	unsigned long __user *newsp;
 	unsigned long tramp;
 	struct pt_regs *regs = tsk->thread.regs;
 	/* Save the thread's msr before get_tm_stackpointer() changes it */
@@ -834,6 +833,7 @@ int handle_signal32(struct ksignal *ksig, sigset_t *oldset,
 
 	/* Set up Signal Frame */
 	frame = get_sigframe(ksig, tsk, sizeof(*frame), 1);
+	newsp = (unsigned long __user *)((unsigned long)frame - __SIGNAL_FRAMESIZE);
 	mctx = &frame->mctx;
 #ifdef CONFIG_PPC_TRANSACTIONAL_MEM
 	tm_mctx = &frame->mctx_transact;
@@ -843,7 +843,7 @@ int handle_signal32(struct ksignal *ksig, sigset_t *oldset,
 	else
 		prepare_save_user_regs(1);
 
-	if (!user_access_begin(frame, sizeof(*frame)))
+	if (!user_access_begin(newsp, __SIGNAL_FRAMESIZE + sizeof(*frame)))
 		goto badframe;
 	sc = (struct sigcontext __user *) &frame->sctx;
 
@@ -873,6 +873,8 @@ int handle_signal32(struct ksignal *ksig, sigset_t *oldset,
 		unsafe_put_user(PPC_RAW_SC(), &mctx->mc_pad[1], failed);
 		asm("dcbst %y0; sync; icbi %y0; sync" :: "Z" (mctx->mc_pad[0]));
 	}
+	/* create a stack frame for the caller of the handler */
+	unsafe_put_user(regs->gpr[1], newsp, failed);
 	user_access_end();
 
 	regs->link = tramp;
@@ -881,12 +883,7 @@ int handle_signal32(struct ksignal *ksig, sigset_t *oldset,
 	tsk->thread.fp_state.fpscr = 0;	/* turn off all fp exceptions */
 #endif
 
-	/* create a stack frame for the caller of the handler */
-	newsp = ((unsigned long)frame) - __SIGNAL_FRAMESIZE;
-	if (put_user(regs->gpr[1], (u32 __user *)newsp))
-		goto badframe;
-
-	regs->gpr[1] = newsp;
+	regs->gpr[1] = (unsigned long)newsp;
 	regs->gpr[3] = ksig->sig;
 	regs->gpr[4] = (unsigned long) sc;
 	regs_set_return_ip(regs, (unsigned long) ksig->ka.sa.sa_handler);
diff --git a/arch/powerpc/kernel/signal_64.c b/arch/powerpc/kernel/signal_64.c
index 790c450c2de8..2cca6c8febe1 100644
--- a/arch/powerpc/kernel/signal_64.c
+++ b/arch/powerpc/kernel/signal_64.c
@@ -847,13 +847,14 @@ int handle_rt_signal64(struct ksignal *ksig, sigset_t *set,
 		struct task_struct *tsk)
 {
 	struct rt_sigframe __user *frame;
-	unsigned long newsp = 0;
+	unsigned long __user *newsp;
 	long err = 0;
 	struct pt_regs *regs = tsk->thread.regs;
 	/* Save the thread's msr before get_tm_stackpointer() changes it */
 	unsigned long msr = regs->msr;
 
 	frame = get_sigframe(ksig, tsk, sizeof(*frame), 0);
+	newsp = (unsigned long __user *)((unsigned long)frame - __SIGNAL_FRAMESIZE);
 
 	/*
 	 * This only applies when calling unsafe_setup_sigcontext() and must be
@@ -862,7 +863,7 @@ int handle_rt_signal64(struct ksignal *ksig, sigset_t *set,
 	if (!MSR_TM_ACTIVE(msr))
 		prepare_setup_sigcontext(tsk);
 
-	if (!user_write_access_begin(frame, sizeof(*frame)))
+	if (!user_write_access_begin(newsp, __SIGNAL_FRAMESIZE + sizeof(*frame)))
 		goto badframe;
 
 	unsafe_put_user(&frame->info, &frame->pinfo, badframe_block);
@@ -900,6 +901,9 @@ int handle_rt_signal64(struct ksignal *ksig, sigset_t *set,
 	}
 
 	unsafe_copy_to_user(&frame->uc.uc_sigmask, set, sizeof(*set), badframe_block);
+	/* Allocate a dummy caller frame for the signal handler. */
+	unsafe_put_user(regs->gpr[1], newsp, badframe_block);
+
 	user_write_access_end();
 
 	/* Save the siginfo outside of the unsafe block. */
@@ -919,10 +923,6 @@ int handle_rt_signal64(struct ksignal *ksig, sigset_t *set,
 		regs_set_return_ip(regs, (unsigned long) &frame->tramp[0]);
 	}
 
-	/* Allocate a dummy caller frame for the signal handler. */
-	newsp = ((unsigned long)frame) - __SIGNAL_FRAMESIZE;
-	err |= put_user(regs->gpr[1], (unsigned long __user *)newsp);
-
 	/* Set up "regs" so we "return" to the signal handler. */
 	if (is_elf2_task()) {
 		regs->ctr = (unsigned long) ksig->ka.sa.sa_handler;
@@ -952,7 +952,7 @@ int handle_rt_signal64(struct ksignal *ksig, sigset_t *set,
 
 	/* enter the signal handler in native-endian mode */
 	regs_set_return_msr(regs, (regs->msr & ~MSR_LE) | (MSR_KERNEL & MSR_LE));
-	regs->gpr[1] = newsp;
+	regs->gpr[1] = (unsigned long)newsp;
 	regs->gpr[3] = ksig->sig;
 	regs->result = 0;
 	if (ksig->ka.sa.sa_flags & SA_SIGINFO) {
-- 
2.25.0


^ permalink raw reply related

* [PATCH v2 1/5] powerpc/signal64: Access function descriptor with user access block
From: Christophe Leroy @ 2021-08-23 15:35 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman
  Cc: linuxppc-dev, linux-kernel

Access the function descriptor of the handler within a
user access block.

Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
---
 arch/powerpc/kernel/signal_64.c | 14 ++++++++++++--
 1 file changed, 12 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/kernel/signal_64.c b/arch/powerpc/kernel/signal_64.c
index 1831bba0582e..790c450c2de8 100644
--- a/arch/powerpc/kernel/signal_64.c
+++ b/arch/powerpc/kernel/signal_64.c
@@ -936,8 +936,18 @@ int handle_rt_signal64(struct ksignal *ksig, sigset_t *set,
 		func_descr_t __user *funct_desc_ptr =
 			(func_descr_t __user *) ksig->ka.sa.sa_handler;
 
-		err |= get_user(regs->ctr, &funct_desc_ptr->entry);
-		err |= get_user(regs->gpr[2], &funct_desc_ptr->toc);
+		if (user_read_access_begin(funct_desc_ptr, sizeof(func_descr_t))) {
+			unsafe_get_user(regs->ctr, &funct_desc_ptr->entry, bad_funct_desc_block);
+			unsafe_get_user(regs->gpr[2], &funct_desc_ptr->toc, bad_funct_desc_block);
+		} else {
+			goto bad_funct_desc;
+bad_funct_desc_block:
+			user_read_access_end();
+bad_funct_desc:
+			signal_fault(current, regs, __func__, funct_desc_ptr);
+			return 1;
+		}
+		user_read_access_end();
 	}
 
 	/* enter the signal handler in native-endian mode */
-- 
2.25.0


^ permalink raw reply related

* [PATCH] powerpc/32: Don't use lmw/stmw for saving/restoring non volatile regs
From: Christophe Leroy @ 2021-08-23 15:29 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman
  Cc: linuxppc-dev, linux-kernel

Instructions lmw/stmw are interesting for functions that are rarely
used and not in the cache, because only one instruction is to be
copied into the instruction cache instead of 19. However those
instruction are less performant than 19x raw lwz/stw as they require
synchronisation plus one additional cycle.

SAVE_NVGPRS / REST_NVGPRS are used in only a few places which are
mostly in interrupts entries/exits and in task switch so they are
likely already in the cache.

Using standard lwz improves null_syscall selftest by:
- 10 cycles on mpc832x.
- 2 cycles on mpc8xx.

Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
---
 arch/powerpc/include/asm/ppc_asm.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/include/asm/ppc_asm.h b/arch/powerpc/include/asm/ppc_asm.h
index ffe712307e11..349fc0ec0dbb 100644
--- a/arch/powerpc/include/asm/ppc_asm.h
+++ b/arch/powerpc/include/asm/ppc_asm.h
@@ -28,8 +28,8 @@
 #else
 #define SAVE_GPR(n, base)	stw	n,GPR0+4*(n)(base)
 #define REST_GPR(n, base)	lwz	n,GPR0+4*(n)(base)
-#define SAVE_NVGPRS(base)	stmw	13, GPR0+4*13(base)
-#define REST_NVGPRS(base)	lmw	13, GPR0+4*13(base)
+#define SAVE_NVGPRS(base)	SAVE_GPR(13, base); SAVE_8GPRS(14, base); SAVE_10GPRS(22, base)
+#define REST_NVGPRS(base)	REST_GPR(13, base); REST_8GPRS(14, base); REST_10GPRS(22, base)
 #endif
 
 #define SAVE_2GPRS(n, base)	SAVE_GPR(n, base); SAVE_GPR(n+1, base)
-- 
2.25.0


^ permalink raw reply related

* [PATCH v2] powerpc/booke: Avoid link stack corruption in several places
From: Christophe Leroy @ 2021-08-23 15:26 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman
  Cc: linuxppc-dev, linux-kernel

Use bcl 20,31,+4 instead of bl in order to preserve link stack.

See commit c974809a26a1 ("powerpc/vdso: Avoid link stack corruption
in __get_datapage()") for details.

Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
---
v2: Added missing ; in LOAD_REG_ADDR_PIC()
---
 arch/powerpc/include/asm/ppc_asm.h            | 2 +-
 arch/powerpc/kernel/exceptions-64e.S          | 6 +++---
 arch/powerpc/kernel/fsl_booke_entry_mapping.S | 8 ++++----
 arch/powerpc/kernel/head_44x.S                | 6 +++---
 arch/powerpc/kernel/head_fsl_booke.S          | 6 +++---
 arch/powerpc/mm/nohash/tlb_low.S              | 4 ++--
 6 files changed, 16 insertions(+), 16 deletions(-)

diff --git a/arch/powerpc/include/asm/ppc_asm.h b/arch/powerpc/include/asm/ppc_asm.h
index a771588bb39e..057e5d7c41fb 100644
--- a/arch/powerpc/include/asm/ppc_asm.h
+++ b/arch/powerpc/include/asm/ppc_asm.h
@@ -265,7 +265,7 @@ GLUE(.,name):
 
 /* Be careful, this will clobber the lr register. */
 #define LOAD_REG_ADDR_PIC(reg, name)		\
-	bl	0f;				\
+	bcl	20,31,0f;			\
 0:	mflr	reg;				\
 	addis	reg,reg,(name - 0b)@ha;		\
 	addi	reg,reg,(name - 0b)@l;
diff --git a/arch/powerpc/kernel/exceptions-64e.S b/arch/powerpc/kernel/exceptions-64e.S
index 1401787b0b93..0a1835a0ec12 100644
--- a/arch/powerpc/kernel/exceptions-64e.S
+++ b/arch/powerpc/kernel/exceptions-64e.S
@@ -1127,7 +1127,7 @@ found_iprot:
  * r3 = MAS0_TLBSEL (for the iprot array)
  * r4 = SPRN_TLBnCFG
  */
-	bl	invstr				/* Find our address */
+	bcl	20,31,invstr			/* Find our address */
 invstr:	mflr	r6				/* Make it accessible */
 	mfmsr	r7
 	rlwinm	r5,r7,27,31,31			/* extract MSR[IS] */
@@ -1196,7 +1196,7 @@ skpinv:	addi	r6,r6,1				/* Increment */
 	mfmsr	r6
 	xori	r6,r6,MSR_IS
 	mtspr	SPRN_SRR1,r6
-	bl	1f		/* Find our address */
+	bcl	20,31,1f	/* Find our address */
 1:	mflr	r6
 	addi	r6,r6,(2f - 1b)
 	mtspr	SPRN_SRR0,r6
@@ -1256,7 +1256,7 @@ skpinv:	addi	r6,r6,1				/* Increment */
  * r4 = MAS0 w/TLBSEL & ESEL for the temp mapping
  */
 	/* Now we branch the new virtual address mapped by this entry */
-	bl	1f		/* Find our address */
+	bcl	20,31,1f	/* Find our address */
 1:	mflr	r6
 	addi	r6,r6,(2f - 1b)
 	tovirt(r6,r6)
diff --git a/arch/powerpc/kernel/fsl_booke_entry_mapping.S b/arch/powerpc/kernel/fsl_booke_entry_mapping.S
index 8bccce6544b5..a9e2235f6c40 100644
--- a/arch/powerpc/kernel/fsl_booke_entry_mapping.S
+++ b/arch/powerpc/kernel/fsl_booke_entry_mapping.S
@@ -1,7 +1,7 @@
 /* SPDX-License-Identifier: GPL-2.0 */
 
 /* 1. Find the index of the entry we're executing in */
-	bl	invstr				/* Find our address */
+	bcl	20,31,invstr				/* Find our address */
 invstr:	mflr	r6				/* Make it accessible */
 	mfmsr	r7
 	rlwinm	r4,r7,27,31,31			/* extract MSR[IS] */
@@ -85,7 +85,7 @@ skpinv:	addi	r6,r6,1				/* Increment */
 	addi	r6,r6,10
 	slw	r6,r8,r6	/* convert to mask */
 
-	bl	1f		/* Find our address */
+	bcl	20,31,1f	/* Find our address */
 1:	mflr	r7
 
 	mfspr	r8,SPRN_MAS3
@@ -117,7 +117,7 @@ skpinv:	addi	r6,r6,1				/* Increment */
 
 	xori	r6,r4,1
 	slwi	r6,r6,5		/* setup new context with other address space */
-	bl	1f		/* Find our address */
+	bcl	20,31,1f	/* Find our address */
 1:	mflr	r9
 	rlwimi	r7,r9,0,20,31
 	addi	r7,r7,(2f - 1b)
@@ -207,7 +207,7 @@ next_tlb_setup:
 
 	lis	r7,MSR_KERNEL@h
 	ori	r7,r7,MSR_KERNEL@l
-	bl	1f			/* Find our address */
+	bcl	20,31,1f		/* Find our address */
 1:	mflr	r9
 	rlwimi	r6,r9,0,20,31
 	addi	r6,r6,(2f - 1b)
diff --git a/arch/powerpc/kernel/head_44x.S b/arch/powerpc/kernel/head_44x.S
index ddc978a2d381..b14efa87d1cf 100644
--- a/arch/powerpc/kernel/head_44x.S
+++ b/arch/powerpc/kernel/head_44x.S
@@ -70,7 +70,7 @@ _ENTRY(_start);
  * address.
  * r21 will be loaded with the physical runtime address of _stext
  */
-	bl	0f				/* Get our runtime address */
+	bcl	20,31,0f			/* Get our runtime address */
 0:	mflr	r21				/* Make it accessible */
 	addis	r21,r21,(_stext - 0b)@ha
 	addi	r21,r21,(_stext - 0b)@l 	/* Get our current runtime base */
@@ -853,7 +853,7 @@ _GLOBAL(init_cpu_state)
 wmmucr:	mtspr	SPRN_MMUCR,r3			/* Put MMUCR */
 	sync
 
-	bl	invstr				/* Find our address */
+	bcl	20,31,invstr			/* Find our address */
 invstr:	mflr	r5				/* Make it accessible */
 	tlbsx	r23,0,r5			/* Find entry we are in */
 	li	r4,0				/* Start at TLB entry 0 */
@@ -1045,7 +1045,7 @@ head_start_47x:
 	sync
 
 	/* Find the entry we are running from */
-	bl	1f
+	bcl	20,31,1f
 1:	mflr	r23
 	tlbsx	r23,0,r23
 	tlbre	r24,r23,0
diff --git a/arch/powerpc/kernel/head_fsl_booke.S b/arch/powerpc/kernel/head_fsl_booke.S
index 0f9642f36b49..dd197da2ffcc 100644
--- a/arch/powerpc/kernel/head_fsl_booke.S
+++ b/arch/powerpc/kernel/head_fsl_booke.S
@@ -79,7 +79,7 @@ _ENTRY(_start);
 	mr	r23,r3
 	mr	r25,r4
 
-	bl	0f
+	bcl	20,31,0f
 0:	mflr	r8
 	addis	r3,r8,(is_second_reloc - 0b)@ha
 	lwz	r19,(is_second_reloc - 0b)@l(r3)
@@ -1132,7 +1132,7 @@ _GLOBAL(switch_to_as1)
 	bne	1b
 
 	/* Get the tlb entry used by the current running code */
-	bl	0f
+	bcl	20,31,0f
 0:	mflr	r4
 	tlbsx	0,r4
 
@@ -1166,7 +1166,7 @@ _GLOBAL(switch_to_as1)
 _GLOBAL(restore_to_as0)
 	mflr	r0
 
-	bl	0f
+	bcl	20,31,0f
 0:	mflr	r9
 	addi	r9,r9,1f - 0b
 
diff --git a/arch/powerpc/mm/nohash/tlb_low.S b/arch/powerpc/mm/nohash/tlb_low.S
index 4613bf8e9aae..8b225a3df7e3 100644
--- a/arch/powerpc/mm/nohash/tlb_low.S
+++ b/arch/powerpc/mm/nohash/tlb_low.S
@@ -199,7 +199,7 @@ END_FTR_SECTION_IFSET(CPU_FTR_476_DD2)
  * Touch enough instruction cache lines to ensure cache hits
  */
 1:	mflr	r9
-	bl	2f
+	bcl	20,31,2f
 2:	mflr	r6
 	li	r7,32
 	PPC_ICBT(0,R6,R7)		/* touch next cache line */
@@ -414,7 +414,7 @@ _GLOBAL(loadcam_multi)
 	 * Set up temporary TLB entry that is the same as what we're
 	 * running from, but in AS=1.
 	 */
-	bl	1f
+	bcl	20,31,1f
 1:	mflr	r6
 	tlbsx	0,r8
 	mfspr	r6,SPRN_MAS1
-- 
2.25.0


^ permalink raw reply related

* [PATCH v3 2/2] powerpc/32s: Save content of sr0 to avoid 'mfsr'
From: Christophe Leroy @ 2021-08-23 15:20 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman
  Cc: linuxppc-dev, linux-kernel
In-Reply-To: <343d06ab9ea7b8b6358308753ec1e8fa7e3d4f11.1629731144.git.christophe.leroy@csgroup.eu>

Calling 'mfsr' to get the content of segment registers is heavy,
in addition it requires clearing of the 'reserved' bits.

In order to avoid this operation, save it in mm context and in
thread struct.

The saved sr0 is the one used by kernel, this means that on
locking entry it can be used as is.

For unlocking, the only thing to do is to clear SR_NX.

This improves null_syscall selftest by 12 cycles, ie 4%.

Capability to deactive KUEP at boot time is re-enabled by this patch.

Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
---
v3: Simplified patching implied by simplified preceding patch
---
 arch/powerpc/include/asm/book3s/32/kup.h      |  2 ++
 arch/powerpc/include/asm/book3s/32/mmu-hash.h |  1 +
 arch/powerpc/include/asm/processor.h          |  1 +
 arch/powerpc/kernel/entry_32.S                | 11 +++++----
 arch/powerpc/mm/book3s32/kuap.c               |  5 +++-
 arch/powerpc/mm/book3s32/kuep.c               | 24 ++++++++++++-------
 arch/powerpc/mm/book3s32/mmu_context.c        | 15 ++++++------
 arch/powerpc/mm/mmu_context.c                 |  3 +++
 8 files changed, 40 insertions(+), 22 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/32/kup.h b/arch/powerpc/include/asm/book3s/32/kup.h
index f159efd04ebc..f03fe357471f 100644
--- a/arch/powerpc/include/asm/book3s/32/kup.h
+++ b/arch/powerpc/include/asm/book3s/32/kup.h
@@ -14,6 +14,8 @@
 extern struct static_key_false disable_kuap_key;
 extern struct static_key_false disable_kuep_key;
 
+extern s32 patch__kuep_lock, patch__kuep_unlock;
+
 static __always_inline bool kuap_is_disabled(void)
 {
 	return !IS_ENABLED(CONFIG_PPC_KUAP) || static_branch_unlikely(&disable_kuap_key);
diff --git a/arch/powerpc/include/asm/book3s/32/mmu-hash.h b/arch/powerpc/include/asm/book3s/32/mmu-hash.h
index e2f7ccc13edb..ecc148c1e795 100644
--- a/arch/powerpc/include/asm/book3s/32/mmu-hash.h
+++ b/arch/powerpc/include/asm/book3s/32/mmu-hash.h
@@ -175,6 +175,7 @@ struct hash_pte {
 
 typedef struct {
 	unsigned long id;
+	unsigned long sr0;
 	void __user *vdso;
 } mm_context_t;
 
diff --git a/arch/powerpc/include/asm/processor.h b/arch/powerpc/include/asm/processor.h
index f348e564f7dd..4b13f94a4f42 100644
--- a/arch/powerpc/include/asm/processor.h
+++ b/arch/powerpc/include/asm/processor.h
@@ -157,6 +157,7 @@ struct thread_struct {
 #ifdef CONFIG_PPC_BOOK3S_32
 	unsigned long	r0, r3, r4, r5, r6, r8, r9, r11;
 	unsigned long	lr, ctr;
+	unsigned long	sr0;
 #endif
 #endif /* CONFIG_PPC32 */
 	/* Debug Registers */
diff --git a/arch/powerpc/kernel/entry_32.S b/arch/powerpc/kernel/entry_32.S
index 14269313d5dd..784be0a0dd9d 100644
--- a/arch/powerpc/kernel/entry_32.S
+++ b/arch/powerpc/kernel/entry_32.S
@@ -33,6 +33,7 @@
 #include <asm/kup.h>
 #include <asm/bug.h>
 #include <asm/interrupt.h>
+#include <asm/code-patching-asm.h>
 
 #include "head_32.h"
 
@@ -76,17 +77,17 @@ _ASM_NOKPROBE_SYMBOL(prepare_transfer_to_handler)
 #if defined(CONFIG_PPC_KUEP) && defined(CONFIG_PPC_BOOK3S_32)
 	.globl	__kuep_lock
 __kuep_lock:
-	mfsr    r9,0
-	rlwinm  r9,r9,0,8,3
-	oris    r9,r9,SR_NX@h
+0:	blr	/* lwz	r9, current->thread.sr0(r2) */
 	update_user_segments_by_4 r9, r10, r11, r12
 	blr
+	patch_site	0b, patch__kuep_lock
 
 __kuep_unlock:
-	mfsr    r9,0
-	rlwinm  r9,r9,0,8,2
+0:	blr	/* lwz	r9, current->thread.sr0(r2) */
+	rlwinm  r9,r9,0,~SR_NX
 	update_user_segments_by_4 r9, r10, r11, r12
 	blr
+	patch_site	0b, patch__kuep_unlock
 
 .macro	kuep_lock
 	bl	__kuep_lock
diff --git a/arch/powerpc/mm/book3s32/kuap.c b/arch/powerpc/mm/book3s32/kuap.c
index 0f920f09af57..28676cabb005 100644
--- a/arch/powerpc/mm/book3s32/kuap.c
+++ b/arch/powerpc/mm/book3s32/kuap.c
@@ -20,8 +20,11 @@ EXPORT_SYMBOL(kuap_unlock_all_ool);
 
 void setup_kuap(bool disabled)
 {
-	if (!disabled)
+	if (!disabled) {
 		kuap_lock_all_ool();
+		init_mm.context.sr0 |= SR_KS;
+		current->thread.sr0 |= SR_KS;
+	}
 
 	if (smp_processor_id() != boot_cpuid)
 		return;
diff --git a/arch/powerpc/mm/book3s32/kuep.c b/arch/powerpc/mm/book3s32/kuep.c
index 45c9967f9aef..0be25492b42d 100644
--- a/arch/powerpc/mm/book3s32/kuep.c
+++ b/arch/powerpc/mm/book3s32/kuep.c
@@ -1,5 +1,6 @@
 // SPDX-License-Identifier: GPL-2.0-or-later
 
+#include <asm/code-patching.h>
 #include <asm/kup.h>
 #include <asm/smp.h>
 
@@ -7,19 +8,26 @@ struct static_key_false disable_kuep_key;
 
 void setup_kuep(bool disabled)
 {
-	if (disabled) {
-		pr_info("KUEP cannot be disabled for the time being\n");
-		disabled = false;
-	}
+	u32 insn;
 
-	if (!disabled)
-		update_user_segments(mfsr(0) | SR_NX);
+	if (!disabled) {
+		init_mm.context.sr0 |= SR_NX;
+		current->thread.sr0 |= SR_NX;
+		update_user_segments(init_mm.context.sr0);
+	}
 
 	if (smp_processor_id() != boot_cpuid)
 		return;
 
 	if (disabled)
 		static_branch_enable(&disable_kuep_key);
-	else
-		pr_info("Activating Kernel Userspace Execution Prevention\n");
+
+	if (disabled)
+		return;
+
+	insn = PPC_RAW_LWZ(_R9, _R2, offsetof(struct task_struct, thread.sr0));
+	patch_instruction_site(&patch__kuep_lock, ppc_inst(insn));
+	patch_instruction_site(&patch__kuep_unlock, ppc_inst(insn));
+
+	pr_info("Activating Kernel Userspace Execution Prevention\n");
 }
diff --git a/arch/powerpc/mm/book3s32/mmu_context.c b/arch/powerpc/mm/book3s32/mmu_context.c
index e2708e387dc3..269a3eb25a73 100644
--- a/arch/powerpc/mm/book3s32/mmu_context.c
+++ b/arch/powerpc/mm/book3s32/mmu_context.c
@@ -69,6 +69,12 @@ EXPORT_SYMBOL_GPL(__init_new_context);
 int init_new_context(struct task_struct *t, struct mm_struct *mm)
 {
 	mm->context.id = __init_new_context();
+	mm->context.sr0 = CTX_TO_VSID(mm->context.id, 0);
+
+	if (!kuep_is_disabled())
+		mm->context.sr0 |= SR_NX;
+	if (!kuap_is_disabled())
+		mm->context.sr0 |= SR_KS;
 
 	return 0;
 }
@@ -108,20 +114,13 @@ void __init mmu_context_init(void)
 void switch_mmu_context(struct mm_struct *prev, struct mm_struct *next, struct task_struct *tsk)
 {
 	long id = next->context.id;
-	unsigned long val;
 
 	if (id < 0)
 		panic("mm_struct %p has no context ID", next);
 
 	isync();
 
-	val = CTX_TO_VSID(id, 0);
-	if (!kuep_is_disabled())
-		val |= SR_NX;
-	if (!kuap_is_disabled())
-		val |= SR_KS;
-
-	update_user_segments(val);
+	update_user_segments(next->context.sr0);
 
 	if (IS_ENABLED(CONFIG_BDI_SWITCH))
 		abatron_pteptrs[1] = next->pgd;
diff --git a/arch/powerpc/mm/mmu_context.c b/arch/powerpc/mm/mmu_context.c
index 74246536b832..e618d5442a28 100644
--- a/arch/powerpc/mm/mmu_context.c
+++ b/arch/powerpc/mm/mmu_context.c
@@ -18,6 +18,9 @@ static inline void switch_mm_pgdir(struct task_struct *tsk,
 {
 	/* 32-bit keeps track of the current PGDIR in the thread struct */
 	tsk->thread.pgdir = mm->pgd;
+#ifdef CONFIG_PPC_BOOK3S_32
+	tsk->thread.sr0 = mm->context.sr0;
+#endif
 }
 #elif defined(CONFIG_PPC_BOOK3E_64)
 static inline void switch_mm_pgdir(struct task_struct *tsk,
-- 
2.25.0


^ permalink raw reply related

* [PATCH v3 1/2] powerpc/32s: Do kuep_lock() and kuep_unlock() in assembly
From: Christophe Leroy @ 2021-08-23 15:20 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman
  Cc: linuxppc-dev, linux-kernel

When interrupt and syscall entries where converted to C, KUEP locking
and unlocking was also converted. It improved performance by unrolling
the loop, and allowed easily implementing boot time deactivation of
KUEP.

However, null_syscall selftest shows that KUEP is still heavy
(361 cycles with KUEP, 212 cycles without).

A way to improve more is to group 'mtsr's together, instead of
repeating 'addi' + 'mtsr' several times.

In order to do that, more registers need to be available. In C, GCC
will always be able to provide the requested number of registers, but
at the cost of saving some data on the stack, which is counter
performant here.

So let's do it in assembly, when we have full control of which
register can be used. It also has the advantage of locking earlier
and unlocking later and it helps GCC generating less tricky code.
The only drawback is to make boot time deactivation less straight
forward and require 'hand' instruction patching.

Group 'mtsr's by 4.

With this change, null_syscall selftest reports 336 cycles. Without
the change it was 361 cycles, that's a 7% reduction.

For the time being, capability to deactive at boot time is disabled.
It will be re-enabled in following patch.

Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
---
v3:
- Add isync after updating segments
- Only group by 4. Grouping by 6 only saves one more cycle.
- Implement subfunctions kuep_lock and kuep_unlock.

v2: Fixed build failure for non book3s/32
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
---
 arch/powerpc/include/asm/book3s/32/kup.h      | 34 --------
 arch/powerpc/include/asm/book3s/32/mmu-hash.h | 77 ++++++++++++++++++-
 arch/powerpc/include/asm/interrupt.h          |  6 +-
 arch/powerpc/include/asm/kup.h                |  5 --
 arch/powerpc/kernel/entry_32.S                | 31 ++++++++
 arch/powerpc/kernel/head_32.h                 |  6 ++
 arch/powerpc/kernel/interrupt.c               |  3 -
 arch/powerpc/mm/book3s32/kuep.c               |  7 +-
 8 files changed, 121 insertions(+), 48 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/32/kup.h b/arch/powerpc/include/asm/book3s/32/kup.h
index d4b145b279f6..f159efd04ebc 100644
--- a/arch/powerpc/include/asm/book3s/32/kup.h
+++ b/arch/powerpc/include/asm/book3s/32/kup.h
@@ -24,40 +24,6 @@ static __always_inline bool kuep_is_disabled(void)
 	return !IS_ENABLED(CONFIG_PPC_KUEP) || static_branch_unlikely(&disable_kuep_key);
 }
 
-static inline void kuep_lock(void)
-{
-	if (kuep_is_disabled())
-		return;
-
-	update_user_segments(mfsr(0) | SR_NX);
-	/*
-	 * This isync() shouldn't be necessary as the kernel is not excepted to
-	 * run any instruction in userspace soon after the update of segments,
-	 * but hash based cores (at least G3) seem to exhibit a random
-	 * behaviour when the 'isync' is not there. 603 cores don't have this
-	 * behaviour so don't do the 'isync' as it saves several CPU cycles.
-	 */
-	if (mmu_has_feature(MMU_FTR_HPTE_TABLE))
-		isync();	/* Context sync required after mtsr() */
-}
-
-static inline void kuep_unlock(void)
-{
-	if (kuep_is_disabled())
-		return;
-
-	update_user_segments(mfsr(0) & ~SR_NX);
-	/*
-	 * This isync() shouldn't be necessary as a 'rfi' will soon be executed
-	 * to return to userspace, but hash based cores (at least G3) seem to
-	 * exhibit a random behaviour when the 'isync' is not there. 603 cores
-	 * don't have this behaviour so don't do the 'isync' as it saves several
-	 * CPU cycles.
-	 */
-	if (mmu_has_feature(MMU_FTR_HPTE_TABLE))
-		isync();	/* Context sync required after mtsr() */
-}
-
 #ifdef CONFIG_PPC_KUAP
 
 #include <linux/sched.h>
diff --git a/arch/powerpc/include/asm/book3s/32/mmu-hash.h b/arch/powerpc/include/asm/book3s/32/mmu-hash.h
index f5be185cbdf8..e2f7ccc13edb 100644
--- a/arch/powerpc/include/asm/book3s/32/mmu-hash.h
+++ b/arch/powerpc/include/asm/book3s/32/mmu-hash.h
@@ -64,7 +64,82 @@ struct ppc_bat {
 #define SR_KP	0x20000000	/* User key */
 #define SR_KS	0x40000000	/* Supervisor key */
 
-#ifndef __ASSEMBLY__
+#ifdef __ASSEMBLY__
+
+#include <asm/asm-offsets.h>
+
+.macro uus_addi sr reg1 reg2 imm
+	.if NUM_USER_SEGMENTS > \sr
+	addi	\reg1,\reg2,\imm
+	.endif
+.endm
+
+.macro uus_mtsr sr reg1
+	.if NUM_USER_SEGMENTS > \sr
+	mtsr	\sr, \reg1
+	.endif
+.endm
+
+/*
+ * This isync() shouldn't be necessary as the kernel is not excepted to run
+ * any instruction in userspace soon after the update of segments and 'rfi'
+ * instruction is used to return to userspace, but hash based cores
+ * (at least G3) seem to exhibit a random behaviour when the 'isync' is not
+ * there. 603 cores don't have this behaviour so don't do the 'isync' as it
+ * saves several CPU cycles.
+ */
+.macro uus_isync
+#ifdef CONFIG_PPC_BOOK3S_604
+BEGIN_MMU_FTR_SECTION
+	isync
+END_MMU_FTR_SECTION_IFSET(MMU_FTR_HPTE_TABLE)
+#endif
+.endm
+
+.macro update_user_segments_by_4 tmp1 tmp2 tmp3 tmp4
+	uus_addi	1, \tmp2, \tmp1, 0x111
+	uus_addi	2, \tmp3, \tmp1, 0x222
+	uus_addi	3, \tmp4, \tmp1, 0x333
+
+	uus_mtsr	0, \tmp1
+	uus_mtsr	1, \tmp2
+	uus_mtsr	2, \tmp3
+	uus_mtsr	3, \tmp4
+
+	uus_addi	4, \tmp1, \tmp1, 0x444
+	uus_addi	5, \tmp2, \tmp2, 0x444
+	uus_addi	6, \tmp3, \tmp3, 0x444
+	uus_addi	7, \tmp4, \tmp4, 0x444
+
+	uus_mtsr	4, \tmp1
+	uus_mtsr	5, \tmp2
+	uus_mtsr	6, \tmp3
+	uus_mtsr	7, \tmp4
+
+	uus_addi	8, \tmp1, \tmp1, 0x444
+	uus_addi	9, \tmp2, \tmp2, 0x444
+	uus_addi	10, \tmp3, \tmp3, 0x444
+	uus_addi	11, \tmp4, \tmp4, 0x444
+
+	uus_mtsr	8, \tmp1
+	uus_mtsr	9, \tmp2
+	uus_mtsr	10, \tmp3
+	uus_mtsr	11, \tmp4
+
+	uus_addi	12, \tmp1, \tmp1, 0x444
+	uus_addi	13, \tmp2, \tmp2, 0x444
+	uus_addi	14, \tmp3, \tmp3, 0x444
+	uus_addi	15, \tmp4, \tmp4, 0x444
+
+	uus_mtsr	12, \tmp1
+	uus_mtsr	13, \tmp2
+	uus_mtsr	14, \tmp3
+	uus_mtsr	15, \tmp4
+
+	uus_isync
+.endm
+
+#else
 
 /*
  * This macro defines the mapping from contexts to VSIDs (virtual
diff --git a/arch/powerpc/include/asm/interrupt.h b/arch/powerpc/include/asm/interrupt.h
index 6b800d3e2681..03afc4e7928e 100644
--- a/arch/powerpc/include/asm/interrupt.h
+++ b/arch/powerpc/include/asm/interrupt.h
@@ -139,12 +139,10 @@ static inline void interrupt_enter_prepare(struct pt_regs *regs, struct interrup
 	if (!arch_irq_disabled_regs(regs))
 		trace_hardirqs_off();
 
-	if (user_mode(regs)) {
-		kuep_lock();
+	if (user_mode(regs))
 		account_cpu_user_entry();
-	} else {
+	else
 		kuap_save_and_lock(regs);
-	}
 #endif
 
 #ifdef CONFIG_PPC64
diff --git a/arch/powerpc/include/asm/kup.h b/arch/powerpc/include/asm/kup.h
index 1df763002726..34ff86e3686e 100644
--- a/arch/powerpc/include/asm/kup.h
+++ b/arch/powerpc/include/asm/kup.h
@@ -38,11 +38,6 @@ void setup_kuep(bool disabled);
 static inline void setup_kuep(bool disabled) { }
 #endif /* CONFIG_PPC_KUEP */
 
-#ifndef CONFIG_PPC_BOOK3S_32
-static inline void kuep_lock(void) { }
-static inline void kuep_unlock(void) { }
-#endif
-
 #ifdef CONFIG_PPC_KUAP
 void setup_kuap(bool disabled);
 #else
diff --git a/arch/powerpc/kernel/entry_32.S b/arch/powerpc/kernel/entry_32.S
index 0273a1349006..14269313d5dd 100644
--- a/arch/powerpc/kernel/entry_32.S
+++ b/arch/powerpc/kernel/entry_32.S
@@ -73,6 +73,34 @@ prepare_transfer_to_handler:
 _ASM_NOKPROBE_SYMBOL(prepare_transfer_to_handler)
 #endif /* CONFIG_PPC_BOOK3S_32 || CONFIG_E500 */
 
+#if defined(CONFIG_PPC_KUEP) && defined(CONFIG_PPC_BOOK3S_32)
+	.globl	__kuep_lock
+__kuep_lock:
+	mfsr    r9,0
+	rlwinm  r9,r9,0,8,3
+	oris    r9,r9,SR_NX@h
+	update_user_segments_by_4 r9, r10, r11, r12
+	blr
+
+__kuep_unlock:
+	mfsr    r9,0
+	rlwinm  r9,r9,0,8,2
+	update_user_segments_by_4 r9, r10, r11, r12
+	blr
+
+.macro	kuep_lock
+	bl	__kuep_lock
+.endm
+.macro	kuep_unlock
+	bl	__kuep_unlock
+.endm
+#else
+.macro	kuep_lock
+.endm
+.macro	kuep_unlock
+.endm
+#endif
+
 	.globl	transfer_to_syscall
 transfer_to_syscall:
 	stw	r11, GPR1(r1)
@@ -94,6 +122,7 @@ transfer_to_syscall:
 	SAVE_2GPRS(7, r1)
 	addi	r2,r10,-THREAD
 	SAVE_NVGPRS(r1)
+	kuep_lock
 
 	/* Calling convention has r9 = orig r0, r10 = regs */
 	addi	r10,r1,STACK_FRAME_OVERHEAD
@@ -110,6 +139,7 @@ ret_from_syscall:
 	cmplwi	cr0,r5,0
 	bne-	2f
 #endif /* CONFIG_PPC_47x */
+	kuep_unlock
 	lwz	r4,_LINK(r1)
 	lwz	r5,_CCR(r1)
 	mtlr	r4
@@ -273,6 +303,7 @@ interrupt_return:
 	beq	.Lkernel_interrupt_return
 	bl	interrupt_exit_user_prepare
 	cmpwi	r3,0
+	kuep_unlock
 	bne-	.Lrestore_nvgprs
 
 .Lfast_user_interrupt_return:
diff --git a/arch/powerpc/kernel/head_32.h b/arch/powerpc/kernel/head_32.h
index 6b1ec9e3541b..133197039775 100644
--- a/arch/powerpc/kernel/head_32.h
+++ b/arch/powerpc/kernel/head_32.h
@@ -136,6 +136,12 @@ _ASM_NOKPROBE_SYMBOL(\name\()_virt)
 	andi.	r12,r9,MSR_PR
 	bne	777f
 	bl	prepare_transfer_to_handler
+#ifdef CONFIG_PPC_KUEP
+	b	778f
+777:
+	bl	__kuep_lock
+778:
+#endif
 777:
 #endif
 .endm
diff --git a/arch/powerpc/kernel/interrupt.c b/arch/powerpc/kernel/interrupt.c
index 21bbd615ca41..cd6139003776 100644
--- a/arch/powerpc/kernel/interrupt.c
+++ b/arch/powerpc/kernel/interrupt.c
@@ -81,8 +81,6 @@ notrace long system_call_exception(long r3, long r4, long r5,
 {
 	syscall_fn f;
 
-	kuep_lock();
-
 	regs->orig_gpr3 = r3;
 
 	if (IS_ENABLED(CONFIG_PPC_IRQ_SOFT_MASK_DEBUG))
@@ -365,7 +363,6 @@ interrupt_exit_user_prepare_main(unsigned long ret, struct pt_regs *regs)
 
 	/* Restore user access locks last */
 	kuap_user_restore(regs);
-	kuep_unlock();
 
 	return ret;
 }
diff --git a/arch/powerpc/mm/book3s32/kuep.c b/arch/powerpc/mm/book3s32/kuep.c
index c20733d6e02c..45c9967f9aef 100644
--- a/arch/powerpc/mm/book3s32/kuep.c
+++ b/arch/powerpc/mm/book3s32/kuep.c
@@ -7,8 +7,13 @@ struct static_key_false disable_kuep_key;
 
 void setup_kuep(bool disabled)
 {
+	if (disabled) {
+		pr_info("KUEP cannot be disabled for the time being\n");
+		disabled = false;
+	}
+
 	if (!disabled)
-		kuep_lock();
+		update_user_segments(mfsr(0) | SR_NX);
 
 	if (smp_processor_id() != boot_cpuid)
 		return;
-- 
2.25.0


^ permalink raw reply related

* Re: [PATCH v2 4/5] KVM: selftests: Add a test for KVM_RUN+rseq to detect task migration bugs
From: Mathieu Desnoyers @ 2021-08-23 15:20 UTC (permalink / raw)
  To: Sean Christopherson, Darren Hart
  Cc: KVM list, Peter Zijlstra, linux-kernel, Will Deacon, Guo Ren,
	linux-kselftest, Ben Gardon, shuah, Paul Mackerras, linux-s390,
	gor, Russell King, ARM Linux, linux-csky, Christian Borntraeger,
	Ingo Molnar, Catalin Marinas, linux-mips, Boqun Feng, paulmck,
	Heiko Carstens, rostedt, Shakeel Butt, Andy Lutomirski,
	Thomas Gleixner, Peter Foley, linux-arm-kernel,
	Thomas Bogendoerfer, Oleg Nesterov, Paolo Bonzini, linuxppc-dev
In-Reply-To: <766990430.21713.1629731934069.JavaMail.zimbra@efficios.com>

[ re-send to Darren Hart ]

----- On Aug 23, 2021, at 11:18 AM, Mathieu Desnoyers mathieu.desnoyers@efficios.com wrote:

> ----- On Aug 20, 2021, at 6:50 PM, Sean Christopherson seanjc@google.com wrote:
> 
>> Add a test to verify an rseq's CPU ID is updated correctly if the task is
>> migrated while the kernel is handling KVM_RUN.  This is a regression test
>> for a bug introduced by commit 72c3c0fe54a3 ("x86/kvm: Use generic xfer
>> to guest work function"), where TIF_NOTIFY_RESUME would be cleared by KVM
>> without updating rseq, leading to a stale CPU ID and other badness.
>> 
> 
> [...]
> 
> +#define RSEQ_SIG 0xdeadbeef
> 
> Is there any reason for defining a custom signature rather than including
> tools/testing/selftests/rseq/rseq.h ? This should take care of including
> the proper architecture header which will define the appropriate signature.
> 
> Arguably you don't define rseq critical sections in this test per se, but
> I'm wondering why the custom signature here.
> 
> [...]
> 
>> +
>> +static void *migration_worker(void *ign)
>> +{
>> +	cpu_set_t allowed_mask;
>> +	int r, i, nr_cpus, cpu;
>> +
>> +	CPU_ZERO(&allowed_mask);
>> +
>> +	nr_cpus = CPU_COUNT(&possible_mask);
>> +
>> +	for (i = 0; i < 20000; i++) {
>> +		cpu = i % nr_cpus;
>> +		if (!CPU_ISSET(cpu, &possible_mask))
>> +			continue;
>> +
>> +		CPU_SET(cpu, &allowed_mask);
>> +
>> +		/*
>> +		 * Bump the sequence count twice to allow the reader to detect
>> +		 * that a migration may have occurred in between rseq and sched
>> +		 * CPU ID reads.  An odd sequence count indicates a migration
>> +		 * is in-progress, while a completely different count indicates
>> +		 * a migration occurred since the count was last read.
>> +		 */
>> +		atomic_inc(&seq_cnt);
> 
> So technically this atomic_inc contains the required barriers because the
> selftests
> implementation uses "__sync_add_and_fetch(&addr->val, 1)". But it's rather odd
> that
> the semantic differs from the kernel implementation in terms of memory barriers:
> the
> kernel implementation of atomic_inc guarantees no memory barriers, but this one
> happens to provide full barriers pretty much by accident (selftests
> futex/include/atomic.h documents no such guarantee).
> 
> If this full barrier guarantee is indeed provided by the selftests atomic.h
> header,
> I would really like a comment stating that in the atomic.h header so the carpet
> is
> not pulled from under our feet by a future optimization.
> 
> 
>> +		r = sched_setaffinity(0, sizeof(allowed_mask), &allowed_mask);
>> +		TEST_ASSERT(!r, "sched_setaffinity failed, errno = %d (%s)",
>> +			    errno, strerror(errno));
>> +		atomic_inc(&seq_cnt);
>> +
>> +		CPU_CLR(cpu, &allowed_mask);
>> +
>> +		/*
>> +		 * Let the read-side get back into KVM_RUN to improve the odds
>> +		 * of task migration coinciding with KVM's run loop.
> 
> This comment should be about increasing the odds of letting the seqlock
> read-side
> complete. Otherwise, the delay between the two back-to-back atomic_inc is so
> small
> that the seqlock read-side may never have time to complete the reading the rseq
> cpu id and the sched_getcpu() call, and can retry forever.
> 
> I'm wondering if 1 microsecond is sufficient on other architectures as well. One
> alternative way to make this depend less on the architecture's implementation of
> sched_getcpu (whether it's a vDSO, or goes through a syscall) would be to read
> the rseq cpu id and call sched_getcpu a few times (e.g. 3 times) in the
> migration
> thread rather than use usleep, and throw away the value read. This would ensure
> the delay is appropriate on all architectures.
> 
> Thanks!
> 
> Mathieu
> 
>> +		 */
>> +		usleep(1);
>> +	}
>> +	done = true;
>> +	return NULL;
>> +}
>> +
>> +int main(int argc, char *argv[])
>> +{
>> +	struct kvm_vm *vm;
>> +	u32 cpu, rseq_cpu;
>> +	int r, snapshot;
>> +
>> +	/* Tell stdout not to buffer its content */
>> +	setbuf(stdout, NULL);
>> +
>> +	r = sched_getaffinity(0, sizeof(possible_mask), &possible_mask);
>> +	TEST_ASSERT(!r, "sched_getaffinity failed, errno = %d (%s)", errno,
>> +		    strerror(errno));
>> +
>> +	if (CPU_COUNT(&possible_mask) < 2) {
>> +		print_skip("Only one CPU, task migration not possible\n");
>> +		exit(KSFT_SKIP);
>> +	}
>> +
>> +	sys_rseq(0);
>> +
>> +	/*
>> +	 * Create and run a dummy VM that immediately exits to userspace via
>> +	 * GUEST_SYNC, while concurrently migrating the process by setting its
>> +	 * CPU affinity.
>> +	 */
>> +	vm = vm_create_default(VCPU_ID, 0, guest_code);
>> +
>> +	pthread_create(&migration_thread, NULL, migration_worker, 0);
>> +
>> +	while (!done) {
>> +		vcpu_run(vm, VCPU_ID);
>> +		TEST_ASSERT(get_ucall(vm, VCPU_ID, NULL) == UCALL_SYNC,
>> +			    "Guest failed?");
>> +
>> +		/*
>> +		 * Verify rseq's CPU matches sched's CPU.  Ensure migration
>> +		 * doesn't occur between sched_getcpu() and reading the rseq
>> +		 * cpu_id by rereading both if the sequence count changes, or
>> +		 * if the count is odd (migration in-progress).
>> +		 */
>> +		do {
>> +			/*
>> +			 * Drop bit 0 to force a mismatch if the count is odd,
>> +			 * i.e. if a migration is in-progress.
>> +			 */
>> +			snapshot = atomic_read(&seq_cnt) & ~1;
>> +			smp_rmb();
>> +			cpu = sched_getcpu();
>> +			rseq_cpu = READ_ONCE(__rseq.cpu_id);
>> +			smp_rmb();
>> +		} while (snapshot != atomic_read(&seq_cnt));
>> +
>> +		TEST_ASSERT(rseq_cpu == cpu,
>> +			    "rseq CPU = %d, sched CPU = %d\n", rseq_cpu, cpu);
>> +	}
>> +
>> +	pthread_join(migration_thread, NULL);
>> +
>> +	kvm_vm_free(vm);
>> +
>> +	sys_rseq(RSEQ_FLAG_UNREGISTER);
>> +
>> +	return 0;
>> +}
>> --
>> 2.33.0.rc2.250.ged5fa647cd-goog
> 
> --
> Mathieu Desnoyers
> EfficiOS Inc.
> http://www.efficios.com

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply

* Re: [PATCH v2 4/5] KVM: selftests: Add a test for KVM_RUN+rseq to detect task migration bugs
From: Mathieu Desnoyers @ 2021-08-23 15:18 UTC (permalink / raw)
  To: Sean Christopherson, Darren Hart
  Cc: KVM list, Peter Zijlstra, linux-kernel, Will Deacon, Guo Ren,
	linux-kselftest, Ben Gardon, shuah, Paul Mackerras, linux-s390,
	gor, Russell King, ARM Linux, linux-csky, Christian Borntraeger,
	Ingo Molnar, Catalin Marinas, linux-mips, Boqun Feng, paulmck,
	Heiko Carstens, rostedt, Shakeel Butt, Andy Lutomirski,
	Thomas Gleixner, Peter Foley, linux-arm-kernel,
	Thomas Bogendoerfer, Oleg Nesterov, Paolo Bonzini, linuxppc-dev
In-Reply-To: <20210820225002.310652-5-seanjc@google.com>

----- On Aug 20, 2021, at 6:50 PM, Sean Christopherson seanjc@google.com wrote:

> Add a test to verify an rseq's CPU ID is updated correctly if the task is
> migrated while the kernel is handling KVM_RUN.  This is a regression test
> for a bug introduced by commit 72c3c0fe54a3 ("x86/kvm: Use generic xfer
> to guest work function"), where TIF_NOTIFY_RESUME would be cleared by KVM
> without updating rseq, leading to a stale CPU ID and other badness.
> 

[...]

+#define RSEQ_SIG 0xdeadbeef

Is there any reason for defining a custom signature rather than including
tools/testing/selftests/rseq/rseq.h ? This should take care of including
the proper architecture header which will define the appropriate signature.

Arguably you don't define rseq critical sections in this test per se, but
I'm wondering why the custom signature here.

[...]

> +
> +static void *migration_worker(void *ign)
> +{
> +	cpu_set_t allowed_mask;
> +	int r, i, nr_cpus, cpu;
> +
> +	CPU_ZERO(&allowed_mask);
> +
> +	nr_cpus = CPU_COUNT(&possible_mask);
> +
> +	for (i = 0; i < 20000; i++) {
> +		cpu = i % nr_cpus;
> +		if (!CPU_ISSET(cpu, &possible_mask))
> +			continue;
> +
> +		CPU_SET(cpu, &allowed_mask);
> +
> +		/*
> +		 * Bump the sequence count twice to allow the reader to detect
> +		 * that a migration may have occurred in between rseq and sched
> +		 * CPU ID reads.  An odd sequence count indicates a migration
> +		 * is in-progress, while a completely different count indicates
> +		 * a migration occurred since the count was last read.
> +		 */
> +		atomic_inc(&seq_cnt);

So technically this atomic_inc contains the required barriers because the selftests
implementation uses "__sync_add_and_fetch(&addr->val, 1)". But it's rather odd that
the semantic differs from the kernel implementation in terms of memory barriers: the
kernel implementation of atomic_inc guarantees no memory barriers, but this one
happens to provide full barriers pretty much by accident (selftests
futex/include/atomic.h documents no such guarantee).

If this full barrier guarantee is indeed provided by the selftests atomic.h header,
I would really like a comment stating that in the atomic.h header so the carpet is
not pulled from under our feet by a future optimization.


> +		r = sched_setaffinity(0, sizeof(allowed_mask), &allowed_mask);
> +		TEST_ASSERT(!r, "sched_setaffinity failed, errno = %d (%s)",
> +			    errno, strerror(errno));
> +		atomic_inc(&seq_cnt);
> +
> +		CPU_CLR(cpu, &allowed_mask);
> +
> +		/*
> +		 * Let the read-side get back into KVM_RUN to improve the odds
> +		 * of task migration coinciding with KVM's run loop.

This comment should be about increasing the odds of letting the seqlock read-side
complete. Otherwise, the delay between the two back-to-back atomic_inc is so small
that the seqlock read-side may never have time to complete the reading the rseq
cpu id and the sched_getcpu() call, and can retry forever.

I'm wondering if 1 microsecond is sufficient on other architectures as well. One
alternative way to make this depend less on the architecture's implementation of
sched_getcpu (whether it's a vDSO, or goes through a syscall) would be to read
the rseq cpu id and call sched_getcpu a few times (e.g. 3 times) in the migration
thread rather than use usleep, and throw away the value read. This would ensure
the delay is appropriate on all architectures.

Thanks!

Mathieu

> +		 */
> +		usleep(1);
> +	}
> +	done = true;
> +	return NULL;
> +}
> +
> +int main(int argc, char *argv[])
> +{
> +	struct kvm_vm *vm;
> +	u32 cpu, rseq_cpu;
> +	int r, snapshot;
> +
> +	/* Tell stdout not to buffer its content */
> +	setbuf(stdout, NULL);
> +
> +	r = sched_getaffinity(0, sizeof(possible_mask), &possible_mask);
> +	TEST_ASSERT(!r, "sched_getaffinity failed, errno = %d (%s)", errno,
> +		    strerror(errno));
> +
> +	if (CPU_COUNT(&possible_mask) < 2) {
> +		print_skip("Only one CPU, task migration not possible\n");
> +		exit(KSFT_SKIP);
> +	}
> +
> +	sys_rseq(0);
> +
> +	/*
> +	 * Create and run a dummy VM that immediately exits to userspace via
> +	 * GUEST_SYNC, while concurrently migrating the process by setting its
> +	 * CPU affinity.
> +	 */
> +	vm = vm_create_default(VCPU_ID, 0, guest_code);
> +
> +	pthread_create(&migration_thread, NULL, migration_worker, 0);
> +
> +	while (!done) {
> +		vcpu_run(vm, VCPU_ID);
> +		TEST_ASSERT(get_ucall(vm, VCPU_ID, NULL) == UCALL_SYNC,
> +			    "Guest failed?");
> +
> +		/*
> +		 * Verify rseq's CPU matches sched's CPU.  Ensure migration
> +		 * doesn't occur between sched_getcpu() and reading the rseq
> +		 * cpu_id by rereading both if the sequence count changes, or
> +		 * if the count is odd (migration in-progress).
> +		 */
> +		do {
> +			/*
> +			 * Drop bit 0 to force a mismatch if the count is odd,
> +			 * i.e. if a migration is in-progress.
> +			 */
> +			snapshot = atomic_read(&seq_cnt) & ~1;
> +			smp_rmb();
> +			cpu = sched_getcpu();
> +			rseq_cpu = READ_ONCE(__rseq.cpu_id);
> +			smp_rmb();
> +		} while (snapshot != atomic_read(&seq_cnt));
> +
> +		TEST_ASSERT(rseq_cpu == cpu,
> +			    "rseq CPU = %d, sched CPU = %d\n", rseq_cpu, cpu);
> +	}
> +
> +	pthread_join(migration_thread, NULL);
> +
> +	kvm_vm_free(vm);
> +
> +	sys_rseq(RSEQ_FLAG_UNREGISTER);
> +
> +	return 0;
> +}
> --
> 2.33.0.rc2.250.ged5fa647cd-goog

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply

* Re: [PATCH v2 1/5] KVM: rseq: Update rseq when processing NOTIFY_RESUME on xfer to KVM guest
From: Mathieu Desnoyers @ 2021-08-23 15:00 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: KVM list, Peter Zijlstra, linux-kernel, Will Deacon, Guo Ren,
	linux-kselftest, Ben Gardon, shuah, Paul Mackerras, linux-s390,
	gor, Russell King, ARM Linux, linux-csky, Christian Borntraeger,
	Ingo Molnar, Catalin Marinas, linux-mips, Boqun Feng, paulmck,
	Heiko Carstens, rostedt, Shakeel Butt, Andy Lutomirski,
	Thomas Gleixner, Peter Foley, linux-arm-kernel,
	Thomas Bogendoerfer, Oleg Nesterov, Paolo Bonzini, linuxppc-dev
In-Reply-To: <20210820225002.310652-2-seanjc@google.com>

----- On Aug 20, 2021, at 6:49 PM, Sean Christopherson seanjc@google.com wrote:

> Invoke rseq's NOTIFY_RESUME handler when processing the flag prior to
> transferring to a KVM guest, which is roughly equivalent to an exit to
> userspace and processes many of the same pending actions.  While the task
> cannot be in an rseq critical section as the KVM path is reachable only
> by via ioctl(KVM_RUN), the side effects that apply to rseq outside of a
> critical section still apply, e.g. the current CPU needs to be updated if
> the task is migrated.
> 
> Clearing TIF_NOTIFY_RESUME without informing rseq can lead to segfaults
> and other badness in userspace VMMs that use rseq in combination with KVM,
> e.g. due to the CPU ID being stale after task migration.

Acked-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>

> 
> Fixes: 72c3c0fe54a3 ("x86/kvm: Use generic xfer to guest work function")
> Reported-by: Peter Foley <pefoley@google.com>
> Bisected-by: Doug Evans <dje@google.com>
> Cc: Shakeel Butt <shakeelb@google.com>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: stable@vger.kernel.org
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
> kernel/entry/kvm.c |  4 +++-
> kernel/rseq.c      | 14 +++++++++++---
> 2 files changed, 14 insertions(+), 4 deletions(-)
> 
> diff --git a/kernel/entry/kvm.c b/kernel/entry/kvm.c
> index 49972ee99aff..049fd06b4c3d 100644
> --- a/kernel/entry/kvm.c
> +++ b/kernel/entry/kvm.c
> @@ -19,8 +19,10 @@ static int xfer_to_guest_mode_work(struct kvm_vcpu *vcpu,
> unsigned long ti_work)
> 		if (ti_work & _TIF_NEED_RESCHED)
> 			schedule();
> 
> -		if (ti_work & _TIF_NOTIFY_RESUME)
> +		if (ti_work & _TIF_NOTIFY_RESUME) {
> 			tracehook_notify_resume(NULL);
> +			rseq_handle_notify_resume(NULL, NULL);
> +		}
> 
> 		ret = arch_xfer_to_guest_mode_handle_work(vcpu, ti_work);
> 		if (ret)
> diff --git a/kernel/rseq.c b/kernel/rseq.c
> index 35f7bd0fced0..6d45ac3dae7f 100644
> --- a/kernel/rseq.c
> +++ b/kernel/rseq.c
> @@ -282,9 +282,17 @@ void __rseq_handle_notify_resume(struct ksignal *ksig,
> struct pt_regs *regs)
> 
> 	if (unlikely(t->flags & PF_EXITING))
> 		return;
> -	ret = rseq_ip_fixup(regs);
> -	if (unlikely(ret < 0))
> -		goto error;
> +
> +	/*
> +	 * regs is NULL if and only if the caller is in a syscall path.  Skip
> +	 * fixup and leave rseq_cs as is so that rseq_sycall() will detect and
> +	 * kill a misbehaving userspace on debug kernels.
> +	 */
> +	if (regs) {
> +		ret = rseq_ip_fixup(regs);
> +		if (unlikely(ret < 0))
> +			goto error;
> +	}
> 	if (unlikely(rseq_update_cpu_id(t)))
> 		goto error;
> 	return;
> --
> 2.33.0.rc2.250.ged5fa647cd-goog

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply

* Re: linux-next: build warning after merge of the powerpc tree
From: Jonathan Corbet @ 2021-08-23 14:19 UTC (permalink / raw)
  To: Stephen Rothwell, Michael Ellerman, PowerPC
  Cc: Aneesh Kumar K.V, Linux Next Mailing List,
	Daniel Henrique Barboza, Linux Kernel Mailing List
In-Reply-To: <20210823204803.7cb76778@canb.auug.org.au>

Stephen Rothwell <sfr@canb.auug.org.au> writes:

> Hi all,
>
> [cc'ing Jon in case he can fix the sphix hang - or knows anything about it]

That's new to me.  Which version of sphinx?

jon

^ permalink raw reply

* Re: [PATCH v2 1/2] KVM: Refactor kvm_arch_vcpu_fault() to return a struct page pointer
From: Christian Borntraeger @ 2021-08-23 14:12 UTC (permalink / raw)
  To: David Hildenbrand, Hou Wenlong, kvm
  Cc: x86, Wanpeng Li, linux-mips, H. Peter Anvin, Claudio Imbrenda,
	Will Deacon, kvmarm, linux-s390, Janosch Frank, Marc Zyngier,
	Joerg Roedel, Huacai Chen, Aleksandar Markovic, Ingo Molnar,
	Catalin Marinas, Vasily Gorbik, Suzuki K Poulose, Heiko Carstens,
	kvm-ppc, Borislav Petkov, Thomas Gleixner, Alexandru Elisei,
	linux-arm-kernel, Jim Mattson, Thomas Bogendoerfer,
	Sean Christopherson, Cornelia Huck, linux-kernel, James Morse,
	Paolo Bonzini, Vitaly Kuznetsov, linuxppc-dev
In-Reply-To: <98adbd3c-ec6f-5689-1686-2a8a7909951a@redhat.com>



On 12.08.21 11:04, David Hildenbrand wrote:
> On 12.08.21 06:02, Hou Wenlong wrote:
>> From: Sean Christopherson <seanjc@google.com>
>>
>> Refactor kvm_arch_vcpu_fault() to return 'struct page *' instead of
>> 'vm_fault_t' to simplify architecture specific implementations that do
>> more than return SIGBUS.  Currently this only applies to s390, but a
>> future patch will move x86's pio_data handling into x86 where it belongs.
>>
>> No functional changed intended.
>>
>> Cc: Hou Wenlong <houwenlong93@linux.alibaba.com>
>> Signed-off-by: Sean Christopherson <seanjc@google.com>
>> Signed-off-by: Hou Wenlong <houwenlong93@linux.alibaba.com>
>> ---
>>   arch/arm64/kvm/arm.c       |  4 ++--
>>   arch/mips/kvm/mips.c       |  4 ++--
>>   arch/powerpc/kvm/powerpc.c |  4 ++--
>>   arch/s390/kvm/kvm-s390.c   | 12 ++++--------
>>   arch/x86/kvm/x86.c         |  4 ++--
>>   include/linux/kvm_host.h   |  2 +-
>>   virt/kvm/kvm_main.c        |  5 ++++-
>>   7 files changed, 17 insertions(+), 18 deletions(-)
>>
>> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
>> index e9a2b8f27792..83f4ffe3e4f2 100644
>> --- a/arch/arm64/kvm/arm.c
>> +++ b/arch/arm64/kvm/arm.c
>> @@ -161,9 +161,9 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
>>       return ret;
>>   }
>> -vm_fault_t kvm_arch_vcpu_fault(struct kvm_vcpu *vcpu, struct vm_fault *vmf)
>> +struct page *kvm_arch_vcpu_fault(struct kvm_vcpu *vcpu, struct vm_fault *vmf)
>>   {
>> -    return VM_FAULT_SIGBUS;
>> +    return NULL;
>>   }
>> diff --git a/arch/mips/kvm/mips.c b/arch/mips/kvm/mips.c
>> index af9dd029a4e1..ae79874e6fd2 100644
>> --- a/arch/mips/kvm/mips.c
>> +++ b/arch/mips/kvm/mips.c
>> @@ -1053,9 +1053,9 @@ int kvm_arch_vcpu_ioctl_set_fpu(struct kvm_vcpu *vcpu, struct kvm_fpu *fpu)
>>       return -ENOIOCTLCMD;
>>   }
>> -vm_fault_t kvm_arch_vcpu_fault(struct kvm_vcpu *vcpu, struct vm_fault *vmf)
>> +struct page *kvm_arch_vcpu_fault(struct kvm_vcpu *vcpu, struct vm_fault *vmf)
>>   {
>> -    return VM_FAULT_SIGBUS;
>> +    return NULL;
>>   }
>>   int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
>> diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
>> index be33b5321a76..b9c21f9ab784 100644
>> --- a/arch/powerpc/kvm/powerpc.c
>> +++ b/arch/powerpc/kvm/powerpc.c
>> @@ -2090,9 +2090,9 @@ long kvm_arch_vcpu_ioctl(struct file *filp,
>>       return r;
>>   }
>> -vm_fault_t kvm_arch_vcpu_fault(struct kvm_vcpu *vcpu, struct vm_fault *vmf)
>> +struct page *kvm_arch_vcpu_fault(struct kvm_vcpu *vcpu, struct vm_fault *vmf)
>>   {
>> -    return VM_FAULT_SIGBUS;
>> +    return NULL;
>>   }
>>   static int kvm_vm_ioctl_get_pvinfo(struct kvm_ppc_pvinfo *pvinfo)
>> diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c
>> index 02574d7b3612..e1b69833e228 100644
>> --- a/arch/s390/kvm/kvm-s390.c
>> +++ b/arch/s390/kvm/kvm-s390.c
>> @@ -4979,17 +4979,13 @@ long kvm_arch_vcpu_ioctl(struct file *filp,
>>       return r;
>>   }
>> -vm_fault_t kvm_arch_vcpu_fault(struct kvm_vcpu *vcpu, struct vm_fault *vmf)
>> +struct page *kvm_arch_vcpu_fault(struct kvm_vcpu *vcpu, struct vm_fault *vmf)
>>   {
>>   #ifdef CONFIG_KVM_S390_UCONTROL
>> -    if ((vmf->pgoff == KVM_S390_SIE_PAGE_OFFSET)
>> -         && (kvm_is_ucontrol(vcpu->kvm))) {
>> -        vmf->page = virt_to_page(vcpu->arch.sie_block);
>> -        get_page(vmf->page);
>> -        return 0;
>> -    }
>> +    if (vmf->pgoff == KVM_S390_SIE_PAGE_OFFSET && kvm_is_ucontrol(vcpu->kvm))
>> +        return virt_to_page(vcpu->arch.sie_block);
>>   #endif
>> -    return VM_FAULT_SIGBUS;
>> +    return NULL;
>>   }
>>   /* Section: memory related */
>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>> index 3cedc7cc132a..1e3bbe5cd33a 100644
>> --- a/arch/x86/kvm/x86.c
>> +++ b/arch/x86/kvm/x86.c
>> @@ -5347,9 +5347,9 @@ long kvm_arch_vcpu_ioctl(struct file *filp,
>>       return r;
>>   }
>> -vm_fault_t kvm_arch_vcpu_fault(struct kvm_vcpu *vcpu, struct vm_fault *vmf)
>> +struct page *kvm_arch_vcpu_fault(struct kvm_vcpu *vcpu, struct vm_fault *vmf)
>>   {
>> -    return VM_FAULT_SIGBUS;
>> +    return NULL;
>>   }
>>   static int kvm_vm_ioctl_set_tss_addr(struct kvm *kvm, unsigned long addr)
>> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
>> index 492d183dd7d0..a949de534722 100644
>> --- a/include/linux/kvm_host.h
>> +++ b/include/linux/kvm_host.h
>> @@ -995,7 +995,7 @@ long kvm_arch_dev_ioctl(struct file *filp,
>>               unsigned int ioctl, unsigned long arg);
>>   long kvm_arch_vcpu_ioctl(struct file *filp,
>>                unsigned int ioctl, unsigned long arg);
>> -vm_fault_t kvm_arch_vcpu_fault(struct kvm_vcpu *vcpu, struct vm_fault *vmf);
>> +struct page *kvm_arch_vcpu_fault(struct kvm_vcpu *vcpu, struct vm_fault *vmf);
>>   int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext);
>> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
>> index 30d322519253..f7d21418971b 100644
>> --- a/virt/kvm/kvm_main.c
>> +++ b/virt/kvm/kvm_main.c
>> @@ -3448,7 +3448,10 @@ static vm_fault_t kvm_vcpu_fault(struct vm_fault *vmf)
>>               &vcpu->dirty_ring,
>>               vmf->pgoff - KVM_DIRTY_LOG_PAGE_OFFSET);
>>       else
>> -        return kvm_arch_vcpu_fault(vcpu, vmf);
>> +        page = kvm_arch_vcpu_fault(vcpu, vmf);
>> +    if (!page)
>> +        return VM_FAULT_SIGBUS;
>> +
>>       get_page(page);
>>       vmf->page = page;
>>       return 0;
>>
> 
> Reviewed-by: David Hildenbrand <david@redhat.com>
> 
> But at the same time I wonder if we should just get rid of CONFIG_KVM_S390_UCONTROL and consequently kvm_arch_vcpu_fault().
> 
> 
> In practice CONFIG_KVM_S390_UCONTROL, is never enabled in any reasonable kernel build and consequently it's never tested; further, exposing the sie_block to user space allows user space to generate random SIE validity intercepts.
> 
> CONFIG_KVM_S390_UCONTROL feels like something that should just be maintained out of tree by someone who really needs to hack deep into hw virtualization for testing purposes etc.

I recently talked to the ucontrol users and they will look into selftests.

^ permalink raw reply

* Re: [PATCH v3] PCI: Move pci_dev_is/assign_added() to pci.h
From: Niklas Schnelle @ 2021-08-23 10:53 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: linux-arch, linux-s390, linux-kernel, Paul Mackerras, linux-pci,
	Bjorn Helgaas, linuxppc-dev
In-Reply-To: <20210820223734.GA3366782@bjorn-Precision-5520>

On Fri, 2021-08-20 at 17:37 -0500, Bjorn Helgaas wrote:
> On Tue, Jul 20, 2021 at 05:01:45PM +0200, Niklas Schnelle wrote:
> > The helper function pci_dev_is_added() from drivers/pci/pci.h is used in
> > PCI arch code of both s390 and powerpc leading to awkward relative
> > includes. Move it to the global include/linux/pci.h and get rid of these
> > includes just for that one function.
> 
> I agree the includes are awkward.
> 
> But the arch code *using* pci_dev_is_added() seems awkward, too.

See below for my interpretation why s390 has some driver like
functionality in its arch code which isn't necessarily awkward.

Independent from that I have found pci_dev_is_added() as the only way
deal with the case that one might be looking at a struct pci_dev
reference that has been removed via pci_stop_and_remove_bus_device() or
has never been fully scanned. This is quite useful when handling error
events which on s390 are part of the adapter event mechanism shared
with channel I/O devices.

> 
> AFAICS, in powerpc, pci_dev_is_added() is only used by
> pnv_pci_ioda_fixup_iov() and pseries_pci_fixup_iov_resources().  Those
> are only called from pcibios_add_device(), which is only called from
> pci_device_add().
> 
> Is it even possible for pci_dev_is_added() to be true in that path?
> 
> s390 uses pci_dev_is_added() in recover_store()

I'm actually looking into this as I'm working on an s390 implementation
of the PCI recovery flow described in Documentation/PCI/pci-error-
recovery.rst that would also call pci_dev_is_added() because when we
get a platform notification of a PCI reset done by firmware it may be
that the sturct pci_dev is going away i.e. we still have a ref count
but it is not added to the PCI bus anymore. And pci_dev_is_added() is
the only way I've found to check for this state.

> , but I don't know what
> that is (looks like a sysfs file, but it's not documented) or why s390
> is the only arch that does this.

Good point about this not being documented, I'll look into adding docs.

This is a sysfs attribute that basically removes the pci_dev and re-
adds it. This has the complication that since the attribute sits at
/sys/bus/pci/devices/<dev>/recover it deletes its own parent directory
which requires extra caution and means concurrent accesses block on
pci_lock_rescan_remove() instead of a kernfs lock.
Long story short when concurrently triggering the attribute one thread
proceeds into the pci_lock_rescan_remove() section and does the
removal, while others would block on pci_lock_rescan_remove(). Now when
the threads unblock the removal is done. In this case there is a new
struct pci_dev found in the rescan but the previously blocked threads
still have references to the old struct pci_dev which was removed and
as far as I could tell can only be distinguihsed by checking
pci_dev_is_added().

> 
> Maybe we should make powerpc and s390 less special?

On s390, as I see it, the reason for this is that all of the PCI
functionality is directly defined in the Architecture as special CPU
instructions which are kind of hypercalls but also an ISA extension.

These instructions range from the basic PCI memory accesses (no real
MMIO) to enumeration of the devices and on to reporting of hot-plug and
and resets/recovery events. Importantly we do not have any kind of
direct access to a real or virtual PCI controller and the architecture
has no concept of a comparable entity.

So in my opinion while there is some of the functionality of a PCI
controller in arch/s390/pci the cut off between controller
functionality and arch support isn't clear at all and exposing PCI
support as CPU instructions doesn't map well to the controller concept.

That said, in principle I'm open to moving some of that into
drivers/pci/controller/ if you think that would improve things and we
can find a good argument what should go where. One possible cut off
would be to have arch/s390/pci/ provide wrappers to the PCI
instructions but move all their uses to  e.g.
drivers/pci/controller/s390/. This would of course be a major
refactoring and none of that code would be useful on any other
architecture but it would move a lot the accesses to PCI common code
functionality out of the arch code.

> 
> > Signed-off-by: Niklas Schnelle <schnelle@linux.ibm.com>
> > ---
> > Since v1 (and bad v2):
> > - Fixed accidental removal of PCI_DPC_RECOVERED, PCI_DPC_RECOVERING
> >   defines and also move these to include/linux/pci.h
> > 
> >  arch/powerpc/platforms/powernv/pci-sriov.c |  3 ---
> >  arch/powerpc/platforms/pseries/setup.c     |  1 -
> >  arch/s390/pci/pci_sysfs.c                  |  2 --
> >  drivers/pci/hotplug/acpiphp_glue.c         |  1 -
> >  drivers/pci/pci.h                          | 15 ---------------
> >  include/linux/pci.h                        | 15 +++++++++++++++
> >  6 files changed, 15 insertions(+), 22 deletions(-)
> > 
> > diff --git a/arch/powerpc/platforms/powernv/pci-sriov.c b/arch/powerpc/platforms/powernv/pci-sriov.c
> > index 28aac933a439..2e0ca5451e85 100644
> > --- a/arch/powerpc/platforms/powernv/pci-sriov.c
> > +++ b/arch/powerpc/platforms/powernv/pci-sriov.c
> > @@ -9,9 +9,6 @@
> >  
> >  #include "pci.h"
> >  
> > -/* for pci_dev_is_added() */
> > -#include "../../../../drivers/pci/pci.h"
> > 
.. snip ..

^ permalink raw reply

* Re: linux-next: build warning after merge of the powerpc tree
From: Stephen Rothwell @ 2021-08-23 10:48 UTC (permalink / raw)
  To: Michael Ellerman, PowerPC
  Cc: Aneesh Kumar K.V, Linux Next Mailing List,
	Daniel Henrique Barboza, Linux Kernel Mailing List,
	Jonathan Corbet
In-Reply-To: <20210823195540.4d7363ed@canb.auug.org.au>

[-- Attachment #1: Type: text/plain, Size: 2640 bytes --]

Hi all,

[cc'ing Jon in case he can fix the sphix hang - or knows anything about it]

On Mon, 23 Aug 2021 19:55:40 +1000 Stephen Rothwell <sfr@canb.auug.org.au> wrote:
>
> After merging the powerpc tree, today's linux-next build (htmldocs)
> produced this warning:
> 

I missed a line:

Sphinx parallel build error:

> docutils.utils.SystemMessage: Documentation/powerpc/associativity.rst:1: (SEVERE/4) Title overline & underline mismatch.
> 
> ============================
> NUMA resource associativity
> =============================
> 
> Introduced by commit
> 
>   1c6b5a7e7405 ("powerpc/pseries: Add support for FORM2 associativity")
> 
> There are other obvious problems with this document (but sphinx seems
> to have hung before it reported them).
> 
> Like
> 
> Form 0
> -----
> 
> and
> 
> Form 1
> -----
> 
> and
> 
> Form 2
> -------

I also get the following warning:

Documentation/powerpc/associativity.rst: WARNING: document isn't included in any toctree

And applying the following patch is enough to allow sphinx to finish
(rather than livelocking):

diff --git a/Documentation/powerpc/associativity.rst b/Documentation/powerpc/associativity.rst
index 07e7dd3d6c87..b77c6ccbd6cb 100644
--- a/Documentation/powerpc/associativity.rst
+++ b/Documentation/powerpc/associativity.rst
@@ -1,6 +1,6 @@
-============================
+===========================
 NUMA resource associativity
-=============================
+===========================
 
 Associativity represents the groupings of the various platform resources into
 domains of substantially similar mean performance relative to resources outside
@@ -20,11 +20,11 @@ A value of 1 indicates the usage of Form 1 associativity. For Form 2 associativi
 bit 2 of byte 5 in the "ibm,architecture-vec-5" property is used.
 
 Form 0
------
+------
 Form 0 associativity supports only two NUMA distances (LOCAL and REMOTE).
 
 Form 1
------
+------
 With Form 1 a combination of ibm,associativity-reference-points, and ibm,associativity
 device tree properties are used to determine the NUMA distance between resource groups/domains.
 
@@ -45,7 +45,7 @@ level of the resource group, the kernel doubles the NUMA distance between the
 comparing domains.
 
 Form 2
--------
+------
 Form 2 associativity format adds separate device tree properties representing NUMA node distance
 thereby making the node distance computation flexible. Form 2 also allows flexible primary
 domain numbering. With numa distance computation now detached from the index value in

-- 
Cheers,
Stephen Rothwell

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply related

* [PATCH] powerpc/32s: Fix random crashes by adding isync() after locking/unlocking KUEP
From: Christophe Leroy @ 2021-08-23 10:07 UTC (permalink / raw)
  To: stable; +Cc: fthain, linuxppc-dev, linux-kernel, userm57

Backport for kernel 5.13

(cherry picked from commit ef486bf448a057a6e2d50e40ae879f7add6585da)

Commit b5efec00b671 ("powerpc/32s: Move KUEP locking/unlocking in C")
removed the 'isync' instruction after adding/removing NX bit in user
segments. The reasoning behind this change was that when setting the
NX bit we don't mind it taking effect with delay as the kernel never
executes text from userspace, and when clearing the NX bit this is
to return to userspace and then the 'rfi' should synchronise the
context.

However, it looks like on book3s/32 having a hash page table, at least
on the G3 processor, we get an unexpected fault from userspace, then
this is followed by something wrong in the verification of MSR_PR
at end of another interrupt.

This is fixed by adding back the removed isync() following update
of NX bit in user segment registers. Only do it for cores with an
hash table, as 603 cores don't exhibit that problem and the two isync
increase ./null_syscall selftest by 6 cycles on an MPC 832x.

First problem: unexpected WARN_ON() for mysterious PROTFAULT

  WARNING: CPU: 0 PID: 1660 at arch/powerpc/mm/fault.c:354 do_page_fault+0x6c/0x5b0
  Modules linked in:
  CPU: 0 PID: 1660 Comm: Xorg Not tainted 5.13.0-pmac-00028-gb3c15b60339a #40
  NIP:  c001b5c8 LR: c001b6f8 CTR: 00000000
  REGS: e2d09e40 TRAP: 0700   Not tainted  (5.13.0-pmac-00028-gb3c15b60339a)
  MSR:  00021032 <ME,IR,DR,RI>  CR: 42d04f30  XER: 20000000
  GPR00: c000424c e2d09f00 c301b680 e2d09f40 0000001e 42000000 00cba028 00000000
  GPR08: 08000000 48000010 c301b680 e2d09f30 22d09f30 00c1fff0 00cba000 a7b7ba4c
  GPR16: 00000031 00000000 00000000 00000000 00000000 00000000 a7b7b0d0 00c5c010
  GPR24: a7b7b64c a7b7d2f0 00000004 00000000 c1efa6c0 00cba02c 00000300 e2d09f40
  NIP [c001b5c8] do_page_fault+0x6c/0x5b0
  LR [c001b6f8] do_page_fault+0x19c/0x5b0
  Call Trace:
  [e2d09f00] [e2d09f04] 0xe2d09f04 (unreliable)
  [e2d09f30] [c000424c] DataAccess_virt+0xd4/0xe4
  --- interrupt: 300 at 0xa7a261dc
  NIP:  a7a261dc LR: a7a253bc CTR: 00000000
  REGS: e2d09f40 TRAP: 0300   Not tainted  (5.13.0-pmac-00028-gb3c15b60339a)
  MSR:  0000d032 <EE,PR,ME,IR,DR,RI>  CR: 228428e2  XER: 20000000
  DAR: 00cba02c DSISR: 42000000
  GPR00: a7a27448 afa6b0e0 a74c35c0 a7b7b614 0000001e a7b7b614 00cba028 00000000
  GPR08: 00020fd9 00000031 00cb9ff8 a7a273b0 220028e2 00c1fff0 00cba000 a7b7ba4c
  GPR16: 00000031 00000000 00000000 00000000 00000000 00000000 a7b7b0d0 00c5c010
  GPR24: a7b7b64c a7b7d2f0 00000004 00000002 0000001e a7b7b614 a7b7aff4 00000030
  NIP [a7a261dc] 0xa7a261dc
  LR [a7a253bc] 0xa7a253bc
  --- interrupt: 300
  Instruction dump:
  7c4a1378 810300a0 75278410 83820298 83a300a4 553b018c 551e0036 4082038c
  2e1b0000 40920228 75280800 41820220 <0fe00000> 3b600000 41920214 81420594

Second problem: MSR PR is seen unset allthough the interrupt frame shows it set

  kernel BUG at arch/powerpc/kernel/interrupt.c:458!
  Oops: Exception in kernel mode, sig: 5 [#1]
  BE PAGE_SIZE=4K MMU=Hash SMP NR_CPUS=2 PowerMac
  Modules linked in:
  CPU: 0 PID: 1660 Comm: Xorg Tainted: G        W         5.13.0-pmac-00028-gb3c15b60339a #40
  NIP:  c0011434 LR: c001629c CTR: 00000000
  REGS: e2d09e70 TRAP: 0700   Tainted: G        W          (5.13.0-pmac-00028-gb3c15b60339a)
  MSR:  00029032 <EE,ME,IR,DR,RI>  CR: 42d09f30  XER: 00000000
  GPR00: 00000000 e2d09f30 c301b680 e2d09f40 83440000 c44d0e68 e2d09e8c 00000000
  GPR08: 00000002 00dc228a 00004000 e2d09f30 22d09f30 00c1fff0 afa6ceb4 00c26144
  GPR16: 00c25fb8 00c26140 afa6ceb8 90000000 00c944d8 0000001c 00000000 00200000
  GPR24: 00000000 000001fb afa6d1b4 00000001 00000000 a539a2a0 a530fd80 00000089
  NIP [c0011434] interrupt_exit_kernel_prepare+0x10/0x70
  LR [c001629c] interrupt_return+0x9c/0x144
  Call Trace:
  [e2d09f30] [c000424c] DataAccess_virt+0xd4/0xe4 (unreliable)
  --- interrupt: 300 at 0xa09be008
  NIP:  a09be008 LR: a09bdfe8 CTR: a09bdfc0
  REGS: e2d09f40 TRAP: 0300   Tainted: G        W          (5.13.0-pmac-00028-gb3c15b60339a)
  MSR:  0000d032 <EE,PR,ME,IR,DR,RI>  CR: 420028e2  XER: 20000000
  DAR: a539a308 DSISR: 0a000000
  GPR00: a7b90d50 afa6b2d0 a74c35c0 a0a8b690 a0a8b698 a5365d70 a4fa82a8 00000004
  GPR08: 00000000 a09bdfc0 00000000 a5360000 a09bde7c 00c1fff0 afa6ceb4 00c26144
  GPR16: 00c25fb8 00c26140 afa6ceb8 90000000 00c944d8 0000001c 00000000 00200000
  GPR24: 00000000 000001fb afa6d1b4 00000001 00000000 a539a2a0 a530fd80 00000089
  NIP [a09be008] 0xa09be008
  LR [a09bdfe8] 0xa09bdfe8
  --- interrupt: 300
  Instruction dump:
  80010024 83e1001c 7c0803a6 4bffff80 3bc00800 4bffffd0 486b42fd 4bffffcc
  81430084 71480002 41820038 554a0462 <0f0a0000> 80620060 74630001 40820034

Fixes: b5efec00b671 ("powerpc/32s: Move KUEP locking/unlocking in C")
Cc: stable@vger.kernel.org # v5.13+
Reported-by: Stan Johnson <userm57@yahoo.com>
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/4856f5574906e2aec0522be17bf3848a22b2cd0b.1629269345.git.christophe.leroy@csgroup.eu
---
 arch/powerpc/mm/book3s32/kuep.c | 19 +++++++++++++++++++
 1 file changed, 19 insertions(+)

diff --git a/arch/powerpc/mm/book3s32/kuep.c b/arch/powerpc/mm/book3s32/kuep.c
index 8ed1b8634839..7015bd489063 100644
--- a/arch/powerpc/mm/book3s32/kuep.c
+++ b/arch/powerpc/mm/book3s32/kuep.c
@@ -4,6 +4,7 @@
 #include <asm/reg.h>
 #include <asm/task_size_32.h>
 #include <asm/mmu.h>
+#include <asm/synch.h>
 
 #define KUEP_UPDATE_TWO_USER_SEGMENTS(n) do {		\
 	if (TASK_SIZE > ((n) << 28))			\
@@ -32,9 +33,27 @@ static __always_inline void kuep_update(u32 val)
 void kuep_lock(void)
 {
 	kuep_update(mfsr(0) | SR_NX);
+	/*
+	 * This isync() shouldn't be necessary as the kernel is not excepted to
+	 * run any instruction in userspace soon after the update of segments,
+	 * but hash based cores (at least G3) seem to exhibit a random
+	 * behaviour when the 'isync' is not there. 603 cores don't have this
+	 * behaviour so don't do the 'isync' as it saves several CPU cycles.
+	 */
+	if (mmu_has_feature(MMU_FTR_HPTE_TABLE))
+		isync();	/* Context sync required after mtsr() */
 }
 
 void kuep_unlock(void)
 {
 	kuep_update(mfsr(0) & ~SR_NX);
+	/*
+	 * This isync() shouldn't be necessary as a 'rfi' will soon be executed
+	 * to return to userspace, but hash based cores (at least G3) seem to
+	 * exhibit a random behaviour when the 'isync' is not there. 603 cores
+	 * don't have this behaviour so don't do the 'isync' as it saves several
+	 * CPU cycles.
+	 */
+	if (mmu_has_feature(MMU_FTR_HPTE_TABLE))
+		isync();	/* Context sync required after mtsr() */
 }
-- 
2.25.0


^ permalink raw reply related

* Re: [PATCH 1/3] powerpc/smp: Fix a crash while booting kvm guest with nr_cpus=2
From: Srikar Dronamraju @ 2021-08-23 10:04 UTC (permalink / raw)
  To: Gautham R Shenoy
  Cc: Nathan Lynch, Vincent Guittot, Peter Zijlstra, Valentin Schneider,
	Aneesh Kumar K . V, linuxppc-dev, Ingo Molnar
In-Reply-To: <20210823061122.GC8104@in.ibm.com>

* Gautham R Shenoy <ego@linux.vnet.ibm.com> [2021-08-23 11:41:22]:

> On Sat, Aug 21, 2021 at 02:54:17PM +0530, Srikar Dronamraju wrote:
> > Aneesh reported a crash with a fairly recent upstream kernel when
> > booting kernel whose commandline was appended with nr_cpus=2
> > 
> > 1:mon> e
> > cpu 0x1: Vector: 300 (Data Access) at [c000000008a67bd0]
> >     pc: c00000000002557c: cpu_to_chip_id+0x3c/0x100
> >     lr: c000000000058380: start_secondary+0x460/0xb00
> >     sp: c000000008a67e70
> >    msr: 8000000000001033
> >    dar: 10
> >  dsisr: 80000
> >   current = 0xc00000000891bb00
> >   paca    = 0xc0000018ff981f80   irqmask: 0x03   irq_happened: 0x01
> >     pid   = 0, comm = swapper/1
> > Linux version 5.13.0-rc3-15704-ga050a6d2b7e8 (kvaneesh@ltc-boston8) (gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0, GNU ld (GNU Binutils for Ubuntu) 2.34) #433 SMP Tue May 25 02:38:49 CDT 2021
> > 1:mon> t
> > [link register   ] c000000000058380 start_secondary+0x460/0xb00
> > [c000000008a67e70] c000000008a67eb0 (unreliable)
> > [c000000008a67eb0] c0000000000589d4 start_secondary+0xab4/0xb00
> > [c000000008a67f90] c00000000000c654 start_secondary_prolog+0x10/0x14
> > 
> > Current code assumes that num_possible_cpus() is always greater than
> > threads_per_core. However this may not be true when using nr_cpus=2 or
> > similar options. Handle the case where num_possible_cpus is smaller than
> > threads_per_core.
> >
> > Cc: linuxppc-dev@lists.ozlabs.org
> > Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
> > Cc: Nathan Lynch <nathanl@linux.ibm.com>
> > Cc: Michael Ellerman <mpe@ellerman.id.au>
> > Cc: Ingo Molnar <mingo@kernel.org>
> > Cc: Peter Zijlstra <peterz@infradead.org>
> > Cc: Valentin Schneider <valentin.schneider@arm.com>
> > Cc: Gautham R Shenoy <ego@linux.vnet.ibm.com>
> > Cc: Vincent Guittot <vincent.guittot@linaro.org>
> > Fixes: c1e53367dab1 ("powerpc/smp: Cache CPU to chip lookup")
> > Reported-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
> > Debugged-by: Michael Ellerman <mpe@ellerman.id.au>
> > Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
> > ---
> >  arch/powerpc/kernel/smp.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
> > index 6c6e4d934d86..3d6874fe1937 100644
> > --- a/arch/powerpc/kernel/smp.c
> > +++ b/arch/powerpc/kernel/smp.c
> > @@ -1074,7 +1074,7 @@ void __init smp_prepare_cpus(unsigned int max_cpus)
> >  	}
> > 
> >  	if (cpu_to_chip_id(boot_cpuid) != -1) {
> > -		int idx = num_possible_cpus() / threads_per_core;
> > +		int idx = max((int)num_possible_cpus() / threads_per_core, 1);
> 
> I think this code was assuming that num_possible_cpus() is a multiple
> of threads_per_core.
> 
> So, on a system with threads_per_core=8, if we pass nr_cpus=10, we
> will still get idx=1. Thus, we will allocate only one entry in
> chip_id_lookup_table[] even though there are two cores and
> chip_id_lookup_table[] is expected to have one entry per core.
> 
> Is this a valid scenario ? If yes, should we use
> 
>    idx = DIV_ROUND_UP(num_possible_cpus, threads_per_core);
> 

Yes, this can be done.
will resend this patch with this change.

> 
> > 
> >  		/*
> >  		 * All threads of a core will all belong to the same core,
> > -- 
> > 2.18.2
> > 
> 
> --
> Thanks and Regards
> gautham.

-- 
Thanks and Regards
Srikar Dronamraju

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox