linuxppc-dev.lists.ozlabs.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/8] FP/VEC/VSX switching optimisations
@ 2015-11-18  3:26 Cyril Bur
  2015-11-18  3:26 ` [PATCH 1/8] selftests/powerpc: Test the preservation of FPU and VMX regs across syscall Cyril Bur
                   ` (8 more replies)
  0 siblings, 9 replies; 23+ messages in thread
From: Cyril Bur @ 2015-11-18  3:26 UTC (permalink / raw)
  To: mikey, anton, linuxppc-dev

Hi,

These patches are an extension of the work done by Anton
https://patchwork.ozlabs.org/patch/537621/, they'll need to be applied on
top of them.

The goal of these patches is to rework how the 'math' registers (FP, VEC
and VSX) are context switched. Currently the kernel adopts a lazy approach,
always switching userspace tasks with all three facilities disabled and
loads in each set of registers upon receiving each unavailable exception.
The kernel does try to avoid disabling the features in the syscall quick
path but it during testing it appears that even what should be a simple
syscall still causes the kernel to use some facilities (vectorised memcpy
for example) for its self and therefore disable it for the user task.

The lazy approach makes for a small amount of time spent restoring
userspace state and if tasks don't use any of these facilities it is the
correct thing to do. In recent years, new workloads and new features such
as auto vectorisation in GCC have meant that the use of these facilities by
userspace has increased, so much so that some workloads can have a task
take an FP unavailable exception and a VEC unavailable exception almost
every time slice.

This series removes the general laziness in favour of a more selective
approach. If a task uses any of the 'math' facilities the kernel will load
the registers and enable the facilities for future time slices as the
assumption is that the use is likely to continue for some time. This
removes the cost of having to take an exception.

These patches also adds logic to detect if a task had been using a facility
and optimises in the case where the registers are still hot, this provides
another speedup as not only is the cost of the exception saved but the cost
of copying up to 64 x 128 bit registers is also removed.

With these patches applied on top of Antons patches I observe a significant
improvement with Antons context switch microbenchmark using yield():

http://ozlabs.org/~anton/junkcode/context_switch2.c

Using an LE kernel compiled with pseries_le_defconfig

Running:
./context_switch2 --test=yield 8 8
and adding one of --fp, --altivec or --vector
Gives a 5% improvement on a POWER8 CPU.

./context_switch2 --test=yield --fp --altivec --vector 8 8
Gives a 15% improvement on a POWER8 CPU.

I'll take this opportunity to note that 15% can be somewhat misleading. It
may be reasonable to assume that each of the optimisations has had a
compounding effect, this isn't incorrect and the reason behind the apparent
compounding reveals a lot about where the current bottleneck is.

The tests always touch FP first, then VEC then VSX which is the guaranteed
worst case for the way the kernel currently operates. This behaviour will
trigger three subsequent unavailable exceptions. Since the kernel currently
enables all three facilities after taking a VSX unavailable the tests can
be modified to touch VSX->VEC->FP in this order the difference in
performance when touching all three only 5%. There is a compounding effect
in so far as the cost of taking multiple unavailable exception is removed.
This testing also demonstrates that the cost of the exception is by far the
most expensive part of the current lazy approach.

Cyril Bur (8):
  selftests/powerpc: Test the preservation of FPU and VMX regs across
    syscall
  selftests/powerpc: Test preservation of FPU and VMX regs across
    preemption
  selftests/powerpc: Test FPU and VMX regs in signal ucontext
  powerpc: Explicitly disable math features when copying thread
  powerpc: Restore FPU/VEC/VSX if previously used
  powerpc: Add the ability to save FPU without giving it up
  powerpc: Add the ability to save Altivec without giving it up
  powerpc: Add the ability to save VSX without giving it up

 arch/powerpc/include/asm/processor.h               |   2 +
 arch/powerpc/include/asm/switch_to.h               |   5 +-
 arch/powerpc/kernel/asm-offsets.c                  |   2 +
 arch/powerpc/kernel/entry_64.S                     |  55 +++++-
 arch/powerpc/kernel/fpu.S                          |  25 +--
 arch/powerpc/kernel/ppc_ksyms.c                    |   4 -
 arch/powerpc/kernel/process.c                      | 144 ++++++++++++--
 arch/powerpc/kernel/vector.S                       |  45 +----
 tools/testing/selftests/powerpc/Makefile           |   3 +-
 tools/testing/selftests/powerpc/math/Makefile      |  19 ++
 tools/testing/selftests/powerpc/math/basic_asm.h   |  26 +++
 tools/testing/selftests/powerpc/math/fpu_asm.S     | 185 +++++++++++++++++
 tools/testing/selftests/powerpc/math/fpu_preempt.c |  92 +++++++++
 tools/testing/selftests/powerpc/math/fpu_signal.c  | 119 +++++++++++
 tools/testing/selftests/powerpc/math/fpu_syscall.c |  79 ++++++++
 tools/testing/selftests/powerpc/math/vmx_asm.S     | 219 +++++++++++++++++++++
 tools/testing/selftests/powerpc/math/vmx_preempt.c |  92 +++++++++
 tools/testing/selftests/powerpc/math/vmx_signal.c  | 124 ++++++++++++
 tools/testing/selftests/powerpc/math/vmx_syscall.c |  81 ++++++++
 19 files changed, 1240 insertions(+), 81 deletions(-)
 create mode 100644 tools/testing/selftests/powerpc/math/Makefile
 create mode 100644 tools/testing/selftests/powerpc/math/basic_asm.h
 create mode 100644 tools/testing/selftests/powerpc/math/fpu_asm.S
 create mode 100644 tools/testing/selftests/powerpc/math/fpu_preempt.c
 create mode 100644 tools/testing/selftests/powerpc/math/fpu_signal.c
 create mode 100644 tools/testing/selftests/powerpc/math/fpu_syscall.c
 create mode 100644 tools/testing/selftests/powerpc/math/vmx_asm.S
 create mode 100644 tools/testing/selftests/powerpc/math/vmx_preempt.c
 create mode 100644 tools/testing/selftests/powerpc/math/vmx_signal.c
 create mode 100644 tools/testing/selftests/powerpc/math/vmx_syscall.c

-- 
2.6.2

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH 1/8] selftests/powerpc: Test the preservation of FPU and VMX regs across syscall
  2015-11-18  3:26 [PATCH 0/8] FP/VEC/VSX switching optimisations Cyril Bur
@ 2015-11-18  3:26 ` Cyril Bur
  2015-11-23  0:23   ` Michael Neuling
  2015-11-18  3:26 ` [PATCH 2/8] selftests/powerpc: Test preservation of FPU and VMX regs across preemption Cyril Bur
                   ` (7 subsequent siblings)
  8 siblings, 1 reply; 23+ messages in thread
From: Cyril Bur @ 2015-11-18  3:26 UTC (permalink / raw)
  To: mikey, anton, linuxppc-dev

Test that the non volatile floating point and Altivec registers get
correctly preserved across the fork() syscall.

fork() works nicely for this purpose, the registers should be the same for
both parent and child

Signed-off-by: Cyril Bur <cyrilbur@gmail.com>
---
 tools/testing/selftests/powerpc/Makefile           |   3 +-
 tools/testing/selftests/powerpc/math/Makefile      |  14 ++
 tools/testing/selftests/powerpc/math/basic_asm.h   |  26 +++
 tools/testing/selftests/powerpc/math/fpu_asm.S     | 151 +++++++++++++++++
 tools/testing/selftests/powerpc/math/fpu_syscall.c |  79 +++++++++
 tools/testing/selftests/powerpc/math/vmx_asm.S     | 183 +++++++++++++++++++++
 tools/testing/selftests/powerpc/math/vmx_syscall.c |  81 +++++++++
 7 files changed, 536 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/powerpc/math/Makefile
 create mode 100644 tools/testing/selftests/powerpc/math/basic_asm.h
 create mode 100644 tools/testing/selftests/powerpc/math/fpu_asm.S
 create mode 100644 tools/testing/selftests/powerpc/math/fpu_syscall.c
 create mode 100644 tools/testing/selftests/powerpc/math/vmx_asm.S
 create mode 100644 tools/testing/selftests/powerpc/math/vmx_syscall.c

diff --git a/tools/testing/selftests/powerpc/Makefile b/tools/testing/selftests/powerpc/Makefile
index 0c2706b..19e8191 100644
--- a/tools/testing/selftests/powerpc/Makefile
+++ b/tools/testing/selftests/powerpc/Makefile
@@ -22,7 +22,8 @@ SUB_DIRS = benchmarks 		\
 	   switch_endian	\
 	   syscalls		\
 	   tm			\
-	   vphn
+	   vphn         \
+	   math
 
 endif
 
diff --git a/tools/testing/selftests/powerpc/math/Makefile b/tools/testing/selftests/powerpc/math/Makefile
new file mode 100644
index 0000000..896d9e2
--- /dev/null
+++ b/tools/testing/selftests/powerpc/math/Makefile
@@ -0,0 +1,14 @@
+TEST_PROGS := fpu_syscall vmx_syscall
+
+all: $(TEST_PROGS)
+
+$(TEST_PROGS): ../harness.c
+$(TEST_PROGS): CFLAGS += -O2 -g
+
+fpu_syscall: fpu_asm.S
+vmx_syscall: vmx_asm.S
+
+include ../../lib.mk
+
+clean:
+	rm -f $(TEST_PROGS) *.o
diff --git a/tools/testing/selftests/powerpc/math/basic_asm.h b/tools/testing/selftests/powerpc/math/basic_asm.h
new file mode 100644
index 0000000..27aca79
--- /dev/null
+++ b/tools/testing/selftests/powerpc/math/basic_asm.h
@@ -0,0 +1,26 @@
+#include <ppc-asm.h>
+#include <asm/unistd.h>
+
+#define LOAD_REG_IMMEDIATE(reg,expr) \
+	lis	reg,(expr)@highest;	\
+	ori	reg,reg,(expr)@higher;	\
+	rldicr	reg,reg,32,31;	\
+	oris	reg,reg,(expr)@high;	\
+	ori	reg,reg,(expr)@l;
+
+#define PUSH_BASIC_STACK(size) \
+	std	2,24(sp); \
+	mflr	r0; \
+	std	r0,16(sp); \
+	mfcr	r0; \
+	stw	r0,8(sp); \
+	stdu	sp,-size(sp);
+
+#define POP_BASIC_STACK(size) \
+	addi	sp,sp,size; \
+	ld	2,24(sp); \
+	ld	r0,16(sp); \
+	mtlr	r0; \
+	lwz	r0,8(sp); \
+	mtcr	r0; \
+
diff --git a/tools/testing/selftests/powerpc/math/fpu_asm.S b/tools/testing/selftests/powerpc/math/fpu_asm.S
new file mode 100644
index 0000000..d5412c1
--- /dev/null
+++ b/tools/testing/selftests/powerpc/math/fpu_asm.S
@@ -0,0 +1,151 @@
+#include "basic_asm.h"
+
+#define PUSH_FPU(pos) \
+	stfd	f14,pos(sp); \
+	stfd	f15,pos+8(sp); \
+	stfd	f16,pos+16(sp); \
+	stfd	f17,pos+24(sp); \
+	stfd	f18,pos+32(sp); \
+	stfd	f19,pos+40(sp); \
+	stfd	f20,pos+48(sp); \
+	stfd	f21,pos+56(sp); \
+	stfd	f22,pos+64(sp); \
+	stfd	f23,pos+72(sp); \
+	stfd	f24,pos+80(sp); \
+	stfd	f25,pos+88(sp); \
+	stfd	f26,pos+96(sp); \
+	stfd	f27,pos+104(sp); \
+	stfd	f28,pos+112(sp); \
+	stfd	f29,pos+120(sp); \
+	stfd	f30,pos+128(sp); \
+	stfd	f31,pos+136(sp);
+
+#define POP_FPU(pos) \
+	lfd	f14,pos(sp); \
+	lfd	f15,pos+8(sp); \
+	lfd	f16,pos+16(sp); \
+	lfd	f17,pos+24(sp); \
+	lfd	f18,pos+32(sp); \
+	lfd	f19,pos+40(sp); \
+	lfd	f20,pos+48(sp); \
+	lfd	f21,pos+56(sp); \
+	lfd	f22,pos+64(sp); \
+	lfd	f23,pos+72(sp); \
+	lfd	f24,pos+80(sp); \
+	lfd	f25,pos+88(sp); \
+	lfd	f26,pos+96(sp); \
+	lfd	f27,pos+104(sp); \
+	lfd	f28,pos+112(sp); \
+	lfd	f29,pos+120(sp); \
+	lfd	f30,pos+128(sp); \
+	lfd	f31,pos+136(sp);
+
+#Careful calling this, it will 'clobber' fpu (by design)
+#Don't call this from C
+FUNC_START(load_fpu)
+	lfd	f14,0(r3)
+	lfd	f15,8(r3)
+	lfd	f16,16(r3)
+	lfd	f17,24(r3)
+	lfd	f18,32(r3)
+	lfd	f19,40(r3)
+	lfd	f20,48(r3)
+	lfd	f21,56(r3)
+	lfd	f22,64(r3)
+	lfd	f23,72(r3)
+	lfd	f24,80(r3)
+	lfd	f25,88(r3)
+	lfd	f26,96(r3)
+	lfd	f27,104(r3)
+	lfd	f28,112(r3)
+	lfd	f29,120(r3)
+	lfd	f30,128(r3)
+	lfd	f31,136(r3)
+	blr
+FUNC_END(load_fpu)
+
+FUNC_START(check_fpu)
+	mr r4,r3
+	li	r3,1 #assume a bad result
+	lfd	f0,0(r4)
+	fcmpu	cr1,f0,f14
+	bne	cr1,1f
+	lfd	f0,8(r4)
+	fcmpu	cr1,f0,f15
+	bne	cr1,1f
+	lfd	f0,16(r4)
+	fcmpu	cr1,f0,f16
+	bne	cr1,1f
+	lfd	f0,24(r4)
+	fcmpu	cr1,f0,f17
+	bne	cr1,1f
+	lfd	f0,32(r4)
+	fcmpu	cr1,f0,f18
+	bne	cr1,1f
+	lfd	f0,40(r4)
+	fcmpu	cr1,f0,f19
+	bne	cr1,1f
+	lfd	f0,48(r4)
+	fcmpu	cr1,f0,f20
+	bne	cr1,1f
+	lfd	f0,56(r4)
+	fcmpu	cr1,f0,f21
+	bne	cr1,1f
+	lfd	f0,64(r4)
+	fcmpu	cr1,f0,f22
+	bne	cr1,1f
+	lfd	f0,72(r4)
+	fcmpu	cr1,f0,f23
+	bne	cr1,1f
+	lfd	f0,80(r4)
+	fcmpu	cr1,f0,f24
+	bne	cr1,1f
+	lfd	f0,88(r4)
+	fcmpu	cr1,f0,f25
+	bne	cr1,1f
+	lfd	f0,96(r4)
+	fcmpu	cr1,f0,f26
+	bne	cr1,1f
+	lfd	f0,104(r4)
+	fcmpu	cr1,f0,f27
+	bne	cr1,1f
+	lfd	f0,112(r4)
+	fcmpu	cr1,f0,f28
+	bne	cr1,1f
+	lfd	f0,120(r4)
+	fcmpu	cr1,f0,f29
+	bne	cr1,1f
+	lfd	f0,128(r4)
+	fcmpu	cr1,f0,f30
+	bne	cr1,1f
+	lfd	f0,136(r4)
+	fcmpu	cr1,f0,f31
+	bne	cr1,1f
+	li	r3,0 #Sucess!!!
+1:	blr
+
+FUNC_START(test_fpu)
+	#r3 holds pointer to where to put the result of fork
+	#f14-f31 are non volatiles
+	PUSH_BASIC_STACK(256)
+	std	r3,40(sp) #Address of darray
+	std r4,48(sp) #Address of pid
+	PUSH_FPU(56)
+
+	bl load_fpu
+	nop
+	li	r0,__NR_fork
+	sc
+
+	#pass the result of the fork to the caller
+	ld	r9,48(sp)
+	std	r3,0(r9)
+
+	ld r3,40(sp)
+	bl check_fpu
+	nop
+
+	POP_FPU(56)
+	POP_BASIC_STACK(256)
+	blr
+FUNC_END(test_fpu)
diff --git a/tools/testing/selftests/powerpc/math/fpu_syscall.c b/tools/testing/selftests/powerpc/math/fpu_syscall.c
new file mode 100644
index 0000000..a967fd6
--- /dev/null
+++ b/tools/testing/selftests/powerpc/math/fpu_syscall.c
@@ -0,0 +1,79 @@
+#include <stdio.h>
+#include <unistd.h>
+#include <sys/syscall.h>
+#include <sys/time.h>
+#include <sys/types.h>
+#include <sys/wait.h>
+#include <stdlib.h>
+
+#include "utils.h"
+
+extern int test_fpu(double *darray, pid_t *pid);
+
+double darray[] = {0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0,
+		     1.1, 1.2, 1.3, 1.4, 1.5, 1.6, 1.7, 1.8, 1.9, 2.0,
+		     2.1};
+
+int syscall_fpu(void)
+{
+	pid_t fork_pid;
+	int i;
+	int ret;
+	int child_ret;
+	for (i = 0; i < 1000; i++) {
+		/* test_fpu will fork() */
+		ret = test_fpu(darray, &fork_pid);
+		if (fork_pid == -1)
+			return -1;
+		if (fork_pid == 0)
+			exit(ret);
+		waitpid(fork_pid, &child_ret, 0);
+		if (ret || child_ret)
+			return 1;
+	}
+
+	return 0;
+}
+
+int test_syscall_fpu(void)
+{
+	/*
+	 * Setup an environment with much context switching
+	 */
+	pid_t pid2;
+	pid_t pid = fork();
+	int ret;
+	int child_ret;
+	FAIL_IF(pid == -1);
+
+	pid2 = fork();
+	/* Can't FAIL_IF(pid2 == -1); because already forked once */
+	if (pid2 == -1) {
+		/*
+		 * Couldn't fork, ensure test is a fail
+		 */
+		child_ret = ret = 1;
+	} else {
+		ret = syscall_fpu();
+		if (pid2)
+			waitpid(pid2, &child_ret, 0);
+		else
+			exit(ret);
+	}
+
+	ret |= child_ret;
+
+	if (pid)
+		waitpid(pid, &child_ret, 0);
+	else
+		exit(ret);
+
+	FAIL_IF(ret || child_ret);
+	return 0;
+}
+
+int main(int argc, char *argv[])
+{
+	return test_harness(test_syscall_fpu, "syscall_fpu");
+
+}
diff --git a/tools/testing/selftests/powerpc/math/vmx_asm.S b/tools/testing/selftests/powerpc/math/vmx_asm.S
new file mode 100644
index 0000000..e642e67
--- /dev/null
+++ b/tools/testing/selftests/powerpc/math/vmx_asm.S
@@ -0,0 +1,183 @@
+#include "basic_asm.h"
+
+#define PUSH_VMX(pos,reg) \
+	li	reg,pos; \
+	stvx	v20,reg,sp; \
+	addi	reg,reg,16; \
+	stvx	v21,reg,sp; \
+	addi	reg,reg,16; \
+	stvx	v22,reg,sp; \
+	addi	reg,reg,16; \
+	stvx	v23,reg,sp; \
+	addi	reg,reg,16; \
+	stvx	v24,reg,sp; \
+	addi	reg,reg,16; \
+	stvx	v25,reg,sp; \
+	addi	reg,reg,16; \
+	stvx	v26,reg,sp; \
+	addi	reg,reg,16; \
+	stvx	v27,reg,sp; \
+	addi	reg,reg,16; \
+	stvx	v28,reg,sp; \
+	addi	reg,reg,16; \
+	stvx	v29,reg,sp; \
+	addi	reg,reg,16; \
+	stvx	v30,reg,sp; \
+	addi	reg,reg,16; \
+	stvx	v31,reg,sp;
+
+#define POP_VMX(pos,reg) \
+	li	reg,pos; \
+	lvx	v20,reg,sp; \
+	addi	reg,reg,16; \
+	lvx	v21,reg,sp; \
+	addi	reg,reg,16; \
+	lvx	v22,reg,sp; \
+	addi	reg,reg,16; \
+	lvx	v23,reg,sp; \
+	addi	reg,reg,16; \
+	lvx	v24,reg,sp; \
+	addi	reg,reg,16; \
+	lvx	v25,reg,sp; \
+	addi	reg,reg,16; \
+	lvx	v26,reg,sp; \
+	addi	reg,reg,16; \
+	lvx	v27,reg,sp; \
+	addi	reg,reg,16; \
+	lvx	v28,reg,sp; \
+	addi	reg,reg,16; \
+	lvx	v29,reg,sp; \
+	addi	reg,reg,16; \
+	lvx	v30,reg,sp; \
+	addi	reg,reg,16; \
+	lvx	v31,reg,sp;
+
+#Carefull this will 'clobber' vmx (by design)
+#Don't call this from C
+FUNC_START(load_vmx)
+	li	r5,0
+	lvx	v20,r5,r3
+	addi	r5,r5,16
+	lvx	v21,r5,r3
+	addi	r5,r5,16
+	lvx	v22,r5,r3
+	addi	r5,r5,16
+	lvx	v23,r5,r3
+	addi	r5,r5,16
+	lvx	v24,r5,r3
+	addi	r5,r5,16
+	lvx	v25,r5,r3
+	addi	r5,r5,16
+	lvx	v26,r5,r3
+	addi	r5,r5,16
+	lvx	v27,r5,r3
+	addi	r5,r5,16
+	lvx	v28,r5,r3
+	addi	r5,r5,16
+	lvx	v29,r5,r3
+	addi	r5,r5,16
+	lvx	v30,r5,r3
+	addi	r5,r5,16
+	lvx	v31,r5,r3
+	blr
+FUNC_END(load_vmx)
+
+#Should be safe from C, only touches r4, r5 and v0,v1,v2
+FUNC_START(check_vmx)
+	PUSH_BASIC_STACK(16)
+	mr r4,r3
+	li	r3,1 #assume a bad result
+	li	r5,0
+	lvx	v0,r5,r4
+	vcmpequd.	v1,v0,v20
+	vmr	v2,v1
+
+	addi	r5,r5,16
+	lvx	v0,r5,r4
+	vcmpequd.	v1,v0,v21
+	vand	v2,v2,v1
+
+	addi	r5,r5,16
+	lvx	v0,r5,r4
+	vcmpequd.	v1,v0,v22
+	vand	v2,v2,v1
+
+	addi	r5,r5,16
+	lvx	v0,r5,r4
+	vcmpequd.	v1,v0,v23
+	vand	v2,v2,v1
+
+	addi	r5,r5,16
+	lvx	v0,r5,r4
+	vcmpequd.	v1,v0,v24
+	vand	v2,v2,v1
+
+	addi	r5,r5,16
+	lvx	v0,r5,r4
+	vcmpequd.	v1,v0,v25
+	vand	v2,v2,v1
+
+	addi	r5,r5,16
+	lvx	v0,r5,r4
+	vcmpequd.	v1,v0,v26
+	vand	v2,v2,v1
+
+	addi	r5,r5,16
+	lvx	v0,r5,r4
+	vcmpequd.	v1,v0,v27
+	vand	v2,v2,v1
+
+	addi	r5,r5,16
+	lvx	v0,r5,r4
+	vcmpequd.	v1,v0,v28
+	vand	v2,v2,v1
+
+	addi	r5,r5,16
+	lvx	v0,r5,r4
+	vcmpequd.	v1,v0,v29
+	vand	v2,v2,v1
+
+	addi	r5,r5,16
+	lvx	v0,r5,r4
+	vcmpequd.	v1,v0,v30
+	vand	v2,v2,v1
+
+	addi	r5,r5,16
+	lvx	v0,r5,r4
+	vcmpequd.	v1,v0,v31
+	vand	v2,v2,v1
+
+	li r5,0
+	stvx	v2,r5,sp
+	ldx	r0,r5,sp
+	cmpdi	r0,0xffffffff
+	bne	1f
+	li	r3,0
+1:	POP_BASIC_STACK(16)
+	blr
+FUNC_END(check_vmx)
+
+#Safe from C
+FUNC_START(test_vmx)
+	#r3 holds pointer to where to put the result of fork
+	#v20-v31 are non-volatile
+	PUSH_BASIC_STACK(512)
+	std	r3,40(sp) #Address of varray
+	std r4,48(sp) #address of pid
+	PUSH_VMX(56, r4)
+
+	bl load_vmx
+
+	li	r0,__NR_fork
+	sc
+	#Pass the result of fork back to the caller
+	ld	r9,48(sp)
+	std	r3,0(r9)
+
+	ld r3,40(sp)
+	bl check_vmx
+
+	POP_VMX(56,r4)
+	POP_BASIC_STACK(512)
+	blr
+FUNC_END(test_vmx)
diff --git a/tools/testing/selftests/powerpc/math/vmx_syscall.c b/tools/testing/selftests/powerpc/math/vmx_syscall.c
new file mode 100644
index 0000000..7adff05
--- /dev/null
+++ b/tools/testing/selftests/powerpc/math/vmx_syscall.c
@@ -0,0 +1,81 @@
+#include <stdio.h>
+#include <unistd.h>
+#include <sys/syscall.h>
+#include <sys/time.h>
+#include <stdlib.h>
+#include <sys/types.h>
+#include <sys/wait.h>
+
+#include "utils.h"
+
+typedef int v4si __attribute__ ((vector_size (16)));
+v4si varray[] = {{1, 2, 3, 4}, {5, 6, 7, 8}, {9, 10,11,12},
+	{13,14,15,16},{17,18,19,20},{21,22,23,24},
+	{25,26,27,28},{29,30,31,32},{33,34,35,36},
+	{37,38,39,40},{41,42,43,44},{45,46,47,48}};
+
+extern int test_vmx(v4si *varray, pid_t *pid);
+
+int vmx_syscall(void)
+{
+	pid_t fork_pid;
+	int i;
+	int ret;
+	int child_ret;
+	for (i = 0; i < 1000; i++) {
+		/* test_vmx will fork() */
+		ret = test_vmx(varray, &fork_pid);
+		if (fork_pid == -1)
+			return -1;
+		if (fork_pid == 0)
+			exit(ret);
+		waitpid(fork_pid, &child_ret, 0);
+		if (ret || child_ret)
+			return 1;
+	}
+
+	return 0;
+}
+
+int test_vmx_syscall(void)
+{
+	/*
+	 * Setup an environment with much context switching
+	 */
+	pid_t pid2;
+	pid_t pid = fork();
+	int ret;
+	int child_ret;
+	FAIL_IF(pid == -1);
+
+	pid2 = fork();
+	ret = vmx_syscall();
+	/* Can't FAIL_IF(pid2 == -1); because we've already forked */
+	if (pid2 == -1) {
+		/*
+		 * Couldn't fork, ensure child_ret is set and is a fail
+		 */
+		ret = child_ret = 1;
+	} else {
+		if (pid2)
+			waitpid(pid2, &child_ret, 0);
+		else
+			exit(ret);
+	}
+
+	ret |= child_ret;
+
+	if (pid)
+		waitpid(pid, &child_ret, 0);
+	else
+		exit(ret);
+
+	FAIL_IF(ret || child_ret);
+	return 0;
+}
+
+int main(int argc, char *argv[])
+{
+	return test_harness(test_vmx_syscall, "vmx_syscall");
+
+}
-- 
2.6.2

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 2/8] selftests/powerpc: Test preservation of FPU and VMX regs across preemption
  2015-11-18  3:26 [PATCH 0/8] FP/VEC/VSX switching optimisations Cyril Bur
  2015-11-18  3:26 ` [PATCH 1/8] selftests/powerpc: Test the preservation of FPU and VMX regs across syscall Cyril Bur
@ 2015-11-18  3:26 ` Cyril Bur
  2015-11-23  0:34   ` Michael Neuling
  2015-11-18  3:26 ` [PATCH 3/8] selftests/powerpc: Test FPU and VMX regs in signal ucontext Cyril Bur
                   ` (6 subsequent siblings)
  8 siblings, 1 reply; 23+ messages in thread
From: Cyril Bur @ 2015-11-18  3:26 UTC (permalink / raw)
  To: mikey, anton, linuxppc-dev

Loop in assembly checking the registers with many threads.

Signed-off-by: Cyril Bur <cyrilbur@gmail.com>
---
 tools/testing/selftests/powerpc/math/Makefile      |  7 +-
 tools/testing/selftests/powerpc/math/fpu_asm.S     | 34 ++++++++
 tools/testing/selftests/powerpc/math/fpu_preempt.c | 92 ++++++++++++++++++++++
 tools/testing/selftests/powerpc/math/vmx_asm.S     | 44 ++++++++++-
 tools/testing/selftests/powerpc/math/vmx_preempt.c | 92 ++++++++++++++++++++++
 5 files changed, 263 insertions(+), 6 deletions(-)
 create mode 100644 tools/testing/selftests/powerpc/math/fpu_preempt.c
 create mode 100644 tools/testing/selftests/powerpc/math/vmx_preempt.c

diff --git a/tools/testing/selftests/powerpc/math/Makefile b/tools/testing/selftests/powerpc/math/Makefile
index 896d9e2..9fa690f 100644
--- a/tools/testing/selftests/powerpc/math/Makefile
+++ b/tools/testing/selftests/powerpc/math/Makefile
@@ -1,12 +1,15 @@
-TEST_PROGS := fpu_syscall vmx_syscall
+TEST_PROGS := fpu_syscall fpu_preempt vmx_syscall vmx_preempt
 
 all: $(TEST_PROGS)
 
 $(TEST_PROGS): ../harness.c
-$(TEST_PROGS): CFLAGS += -O2 -g
+$(TEST_PROGS): CFLAGS += -O2 -g -pthread
 
 fpu_syscall: fpu_asm.S
+fpu_preempt: fpu_asm.S
+
 vmx_syscall: vmx_asm.S
+vmx_preempt: vmx_asm.S
 
 include ../../lib.mk
 
diff --git a/tools/testing/selftests/powerpc/math/fpu_asm.S b/tools/testing/selftests/powerpc/math/fpu_asm.S
index d5412c1..5ff0adc 100644
--- a/tools/testing/selftests/powerpc/math/fpu_asm.S
+++ b/tools/testing/selftests/powerpc/math/fpu_asm.S
@@ -149,3 +149,37 @@ FUNC_START(test_fpu)
 	POP_BASIC_STACK(256)
 	blr
 FUNC_END(test_fpu)
+
+#int preempt_fpu(double *darray, volatile int *not_ready, int *sentinal)
+#On starting will (atomically) decrement not_ready as a signal that the FPU
+#has been loaded with darray. Will proceed to check the validity of the FPU
+#registers while sentinal is not zero.
+FUNC_START(preempt_fpu)
+	PUSH_BASIC_STACK(256)
+	std r3,32(sp) #double *darray
+	std r4,40(sp) #volatile int *not_ready
+	std r5,48(sp) #int *sentinal
+	PUSH_FPU(56)
+
+	bl load_fpu
+
+	#Atomic DEC
+	ld r3,40(sp)
+1:	lwarx r4,0,r3
+	addi r4,r4,-1
+	stwcx. r4,0,r3
+	bne- 1b
+
+2:	ld r3, 32(sp)
+	bl check_fpu
+	cmpdi r3,0
+	bne 3f
+	ld r4, 48(sp)
+	ld r5, 0(r4)
+	cmpwi r5,0
+	bne 2b
+
+3:	POP_FPU(56)
+	POP_BASIC_STACK(256)
+	blr
+FUNC_END(preempt_fpu)
diff --git a/tools/testing/selftests/powerpc/math/fpu_preempt.c b/tools/testing/selftests/powerpc/math/fpu_preempt.c
new file mode 100644
index 0000000..e24cf9b
--- /dev/null
+++ b/tools/testing/selftests/powerpc/math/fpu_preempt.c
@@ -0,0 +1,92 @@
+#include <stdio.h>
+#include <unistd.h>
+#include <sys/syscall.h>
+#include <sys/time.h>
+#include <sys/types.h>
+#include <sys/wait.h>
+#include <stdlib.h>
+#include <pthread.h>
+
+#include "utils.h"
+
+/* Time to wait for workers to get preempted (seconds) */
+#define PREEMPT_TIME 20
+/*
+ * Factor by which to multiply number of online CPUs for total number of
+ * worker threads
+ */
+#define THREAD_FACTOR 8
+
+
+__thread double darray[] = {0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0,
+		     1.1, 1.2, 1.3, 1.4, 1.5, 1.6, 1.7, 1.8, 1.9, 2.0,
+		     2.1};
+
+volatile int not_ready;
+int running;
+
+extern void preempt_fpu(double *darray, volatile int *not_ready, int *sentinal);
+
+void *preempt_fpu_c(void *p)
+{
+	int i;
+	srand(pthread_self());
+	for (i = 0; i < 21; i++)
+		darray[i] = rand();
+
+	/* Test failed if it ever returns */
+	preempt_fpu(darray, &not_ready, &running);
+
+	return p;
+}
+
+int test_preempt_fpu(void)
+{
+	int i, rc, threads;
+	pthread_t *tids;
+
+	threads = sysconf(_SC_NPROCESSORS_ONLN) * THREAD_FACTOR;
+	tids = malloc((threads) * sizeof(pthread_t));
+	FAIL_IF(!tids);
+
+	running = true;
+	not_ready = threads;
+	for (i = 0; i < threads; i++) {
+		rc = pthread_create(&tids[i], NULL, preempt_fpu_c, NULL);
+		FAIL_IF(rc);
+	}
+
+	setbuf(stdout, NULL);
+	/* Not really nessesary but nice to wait for every thread to start */
+	printf("\tWaiting for all workers to start...");
+	while(not_ready);
+	printf("done\n");
+
+	printf("\tWaiting for %d seconds to let some workers get preempted...", PREEMPT_TIME);
+	sleep(PREEMPT_TIME);
+	printf("done\n");
+
+	printf("\tKilling workers...");
+	running = 0;
+	for (i = 0; i < threads; i++) {
+		void *rc_p;
+		pthread_join(tids[i], &rc_p);
+
+		/*
+		 * Harness will say the fail was here, look at why preempt_fpu
+		 * returned
+		 */
+		if ((long) rc_p)
+			printf("oops\n");
+		FAIL_IF((long) rc_p);
+	}
+	printf("done\n");
+
+	free(tids);
+	return 0;
+}
+
+int main(int argc, char *argv[])
+{
+	return test_harness(test_preempt_fpu, "fpu_preempt");
+}
diff --git a/tools/testing/selftests/powerpc/math/vmx_asm.S b/tools/testing/selftests/powerpc/math/vmx_asm.S
index e642e67..23db4b3 100644
--- a/tools/testing/selftests/powerpc/math/vmx_asm.S
+++ b/tools/testing/selftests/powerpc/math/vmx_asm.S
@@ -1,5 +1,6 @@
 #include "basic_asm.h"
 
+#POS MUST BE 16 ALIGNED!
 #define PUSH_VMX(pos,reg) \
 	li	reg,pos; \
 	stvx	v20,reg,sp; \
@@ -26,6 +27,7 @@
 	addi	reg,reg,16; \
 	stvx	v31,reg,sp;
 
+#POS MUST BE 16 ALIGNED!
 #define POP_VMX(pos,reg) \
 	li	reg,pos; \
 	lvx	v20,reg,sp; \
@@ -84,7 +86,7 @@ FUNC_END(load_vmx)
 
 #Should be safe from C, only touches r4, r5 and v0,v1,v2
 FUNC_START(check_vmx)
-	PUSH_BASIC_STACK(16)
+	PUSH_BASIC_STACK(32)
 	mr r4,r3
 	li	r3,1 #assume a bad result
 	li	r5,0
@@ -153,7 +155,7 @@ FUNC_START(check_vmx)
 	cmpdi	r0,0xffffffff
 	bne	1f
 	li	r3,0
-1:	POP_BASIC_STACK(16)
+1:	POP_BASIC_STACK(32)
 	blr
 FUNC_END(check_vmx)
 
@@ -164,7 +166,7 @@ FUNC_START(test_vmx)
 	PUSH_BASIC_STACK(512)
 	std	r3,40(sp) #Address of varray
 	std r4,48(sp) #address of pid
-	PUSH_VMX(56, r4)
+	PUSH_VMX(64, r4)
 
 	bl load_vmx
 
@@ -177,7 +179,41 @@ FUNC_START(test_vmx)
 	ld r3,40(sp)
 	bl check_vmx
 
-	POP_VMX(56,r4)
+	POP_VMX(64,r4)
 	POP_BASIC_STACK(512)
 	blr
 FUNC_END(test_vmx)
+
+#int preempt_vmx(v4si *varray, volatile int *not_ready, int *sentinal)
+#On starting will (atomically) decrement not_ready as a signal that the FPU
+#has been loaded with varray. Will proceed to check the validity of the FPU
+#registers while sentinal is not zero.
+FUNC_START(preempt_vmx)
+	PUSH_BASIC_STACK(512)
+	std r3,32(sp) #v4si *varray
+	std r4,40(sp) #volatile int *not_ready
+	std r5,48(sp) #int *sentinal
+	PUSH_VMX(64,r4)
+
+	bl load_vmx
+
+	#Atomic DEC
+	ld r3,40(sp)
+1:	lwarx r4,0,r3
+	addi r4,r4,-1
+	stwcx. r4,0,r3
+	bne- 1b
+
+2:	ld r3,32(sp)
+	bl check_vmx
+	cmpdi r3,0
+	bne 3f
+	ld r4,48(sp)
+	ld r5,0(r4)
+	cmpwi r5,0
+	bne 2b
+
+3:	POP_VMX(64,r4)
+	POP_BASIC_STACK(512)
+	blr
+FUNC_END(preempt_vmx)
diff --git a/tools/testing/selftests/powerpc/math/vmx_preempt.c b/tools/testing/selftests/powerpc/math/vmx_preempt.c
new file mode 100644
index 0000000..342db15
--- /dev/null
+++ b/tools/testing/selftests/powerpc/math/vmx_preempt.c
@@ -0,0 +1,92 @@
+#include <stdio.h>
+#include <unistd.h>
+#include <sys/syscall.h>
+#include <sys/time.h>
+#include <sys/types.h>
+#include <sys/wait.h>
+#include <stdlib.h>
+#include <pthread.h>
+
+#include "utils.h"
+
+/* Time to wait for workers to get preempted (seconds) */
+#define PREEMPT_TIME 20
+/*
+ * Factor by which to multiply number of online CPUs for total number of
+ * worker threads
+ */
+#define THREAD_FACTOR 8
+
+typedef int v4si __attribute__ ((vector_size (16)));
+__thread v4si varray[] = {{1, 2, 3, 4}, {5, 6, 7, 8}, {9, 10,11,12},
+	{13,14,15,16},{17,18,19,20},{21,22,23,24},
+	{25,26,27,28},{29,30,31,32},{33,34,35,36},
+	{37,38,39,40},{41,42,43,44},{45,46,47,48}};
+
+volatile int not_ready;
+int running;
+
+extern void preempt_vmx(v4si *varray, volatile int *not_ready, int *sentinal);
+
+void *preempt_vmx_c(void *p)
+{
+	int i, j;
+	srand(pthread_self());
+	for (i = 0; i < 12; i++)
+		for (j = 0; j < 4; j++)
+			varray[i][j] = rand();
+
+	/* Test fails if it ever returns */
+	preempt_vmx(varray, &not_ready, &running);
+	return p;
+}
+
+int test_preempt_vmx(void)
+{
+	int i, rc, threads;
+	pthread_t *tids;
+
+	threads = sysconf(_SC_NPROCESSORS_ONLN) * THREAD_FACTOR;
+	tids = malloc(threads * sizeof(pthread_t));
+	FAIL_IF(!tids);
+
+	running = true;
+	not_ready = threads;
+	for (i = 0; i < threads; i++) {
+		rc = pthread_create(&tids[i], NULL, preempt_vmx_c, NULL);
+		FAIL_IF(rc);
+	}
+
+	setbuf(stdout, NULL);
+	/* Not really nessesary but nice to wait for every thread to start */
+	printf("\tWaiting for all workers to start...");
+	while(not_ready);
+	printf("done\n");
+
+	printf("\tWaiting for %d seconds to let some workers get preempted...", PREEMPT_TIME);
+	sleep(PREEMPT_TIME);
+	printf("done\n");
+
+	printf("\tKilling workers...");
+	running = 0;
+	for (i = 0; i < threads; i++) {
+		void *rc_p;
+		pthread_join(tids[i], &rc_p);
+
+		/*
+		 * Harness will say the fail was here, look at why preempt_vmx
+		 * returned
+		 */
+		if ((long) rc_p)
+			printf("oops\n");
+		FAIL_IF((long) rc_p);
+	}
+	printf("done\n");
+
+	return 0;
+}
+
+int main(int argc, char *argv[])
+{
+	return test_harness(test_preempt_vmx, "vmx_preempt");
+}
-- 
2.6.2

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 3/8] selftests/powerpc: Test FPU and VMX regs in signal ucontext
  2015-11-18  3:26 [PATCH 0/8] FP/VEC/VSX switching optimisations Cyril Bur
  2015-11-18  3:26 ` [PATCH 1/8] selftests/powerpc: Test the preservation of FPU and VMX regs across syscall Cyril Bur
  2015-11-18  3:26 ` [PATCH 2/8] selftests/powerpc: Test preservation of FPU and VMX regs across preemption Cyril Bur
@ 2015-11-18  3:26 ` Cyril Bur
  2015-11-19 11:36   ` [3/8] " Michael Ellerman
  2015-11-23  1:04   ` [PATCH 3/8] " Michael Neuling
  2015-11-18  3:26 ` [PATCH 4/8] powerpc: Explicitly disable math features when copying thread Cyril Bur
                   ` (5 subsequent siblings)
  8 siblings, 2 replies; 23+ messages in thread
From: Cyril Bur @ 2015-11-18  3:26 UTC (permalink / raw)
  To: mikey, anton, linuxppc-dev

Load up the non volatile FPU and VMX regs and ensure that they are the
expected value in a signal handler

Signed-off-by: Cyril Bur <cyrilbur@gmail.com>
---
 tools/testing/selftests/powerpc/math/Makefile     |   4 +-
 tools/testing/selftests/powerpc/math/fpu_signal.c | 119 +++++++++++++++++++++
 tools/testing/selftests/powerpc/math/vmx_signal.c | 124 ++++++++++++++++++++++
 3 files changed, 246 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/powerpc/math/fpu_signal.c
 create mode 100644 tools/testing/selftests/powerpc/math/vmx_signal.c

diff --git a/tools/testing/selftests/powerpc/math/Makefile b/tools/testing/selftests/powerpc/math/Makefile
index 9fa690f..5ce000bf 100644
--- a/tools/testing/selftests/powerpc/math/Makefile
+++ b/tools/testing/selftests/powerpc/math/Makefile
@@ -1,4 +1,4 @@
-TEST_PROGS := fpu_syscall fpu_preempt vmx_syscall vmx_preempt
+TEST_PROGS := fpu_syscall fpu_preempt fpu_signal vmx_syscall vmx_preempt vmx_signal
 
 all: $(TEST_PROGS)
 
@@ -7,9 +7,11 @@ $(TEST_PROGS): CFLAGS += -O2 -g -pthread
 
 fpu_syscall: fpu_asm.S
 fpu_preempt: fpu_asm.S
+fpu_signal:  fpu_asm.S
 
 vmx_syscall: vmx_asm.S
 vmx_preempt: vmx_asm.S
+vmx_signal: vmx_asm.S
 
 include ../../lib.mk
 
diff --git a/tools/testing/selftests/powerpc/math/fpu_signal.c b/tools/testing/selftests/powerpc/math/fpu_signal.c
new file mode 100644
index 0000000..ca61d1f
--- /dev/null
+++ b/tools/testing/selftests/powerpc/math/fpu_signal.c
@@ -0,0 +1,119 @@
+#include <stdio.h>
+#include <unistd.h>
+#include <sys/syscall.h>
+#include <sys/time.h>
+#include <sys/types.h>
+#include <sys/wait.h>
+#include <stdlib.h>
+#include <pthread.h>
+
+#include "utils.h"
+
+/* Number of times each thread should recieve the signal */
+#define ITERATIONS 10
+/*
+ * Factor by which to multiply number of online CPUs for total number of
+ * worker threads
+ */
+#define THREAD_FACTOR 8
+
+__thread double darray[] = {0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0,
+		     1.1, 1.2, 1.3, 1.4, 1.5, 1.6, 1.7, 1.8, 1.9, 2.0,
+		     2.1};
+
+bool bad_context;
+int running;
+volatile int not_ready;
+extern long preempt_fpu(double *darray, volatile int *not_ready, int *sentinal);
+
+void signal_fpu_sig(int sig, siginfo_t *info, void *context)
+{
+	int i;
+	ucontext_t *uc = context;
+	mcontext_t *mc = &uc->uc_mcontext;
+
+	/* Only the non volatiles were loaded up */
+	for (i = 14; i < 32; i++) {
+		if (mc->fp_regs[i] != darray[i - 14]) {
+			bad_context = true;
+			break;
+		}
+	}
+}
+
+void *signal_fpu_c(void *p)
+{
+	int i;
+	long rc;
+	struct sigaction act;
+	act.sa_sigaction = signal_fpu_sig;
+	act.sa_flags = SA_SIGINFO;
+	rc = sigaction(SIGUSR1, &act, NULL);
+	if (rc)
+		return p;
+
+	srand(pthread_self());
+	for (i = 0; i < 21; i++)
+		darray[i] = rand();
+
+	rc = preempt_fpu(darray, &not_ready, &running);
+
+	return (void *) rc;
+}
+
+int test_signal_fpu(void)
+{
+	int i, j, rc, threads;
+	void *rc_p;
+	pthread_t *tids;
+
+	threads = sysconf(_SC_NPROCESSORS_ONLN) * THREAD_FACTOR;
+	tids = malloc(threads * sizeof(pthread_t));
+	FAIL_IF(!tids);
+
+	running = true;
+	not_ready = threads;
+	for (i = 0; i < threads; i++) {
+		rc = pthread_create(&tids[i], NULL, signal_fpu_c, NULL);
+		FAIL_IF(rc);
+	}
+
+	setbuf(stdout, NULL);
+	printf("\tWaiting for all workers to start...");
+	while (not_ready);
+	printf("done\n");
+
+	printf("\tSending signals to all threads %d times...", ITERATIONS);
+	for (i = 0; i < ITERATIONS; i++) {
+		for (j = 0; j < threads; j++) {
+			pthread_kill(tids[j], SIGUSR1);
+		}
+		sleep(1);
+	}
+	printf("done\n");
+
+	printf("\tKilling workers...");
+	running = 0;
+	for (i = 0; i < threads; i++) {
+		pthread_join(tids[i], &rc_p);
+
+		/*
+		 * Harness will say the fail was here, look at why signal_fpu
+		 * returned
+		 */
+		if ((long) rc_p || bad_context)
+			printf("oops\n");
+		if (bad_context)
+			fprintf(stderr, "\t!! bad_context is true\n");
+		FAIL_IF((long) rc_p || bad_context);
+	}
+	printf("done\n");
+
+	free(tids);
+	return 0;
+}
+
+int main(int argc, char *argv[])
+{
+	return test_harness(test_signal_fpu, "fpu_signal");
+}
diff --git a/tools/testing/selftests/powerpc/math/vmx_signal.c b/tools/testing/selftests/powerpc/math/vmx_signal.c
new file mode 100644
index 0000000..007ac9e
--- /dev/null
+++ b/tools/testing/selftests/powerpc/math/vmx_signal.c
@@ -0,0 +1,124 @@
+#include <stdio.h>
+#include <unistd.h>
+#include <sys/syscall.h>
+#include <sys/time.h>
+#include <sys/types.h>
+#include <sys/wait.h>
+#include <stdlib.h>
+#include <string.h>
+#include <pthread.h>
+
+#include "utils.h"
+
+/* Number of times each thread should recieve the signal */
+#define ITERATIONS 10
+/*
+ * Factor by which to multiply number of online CPUs for total number of
+ * worker threads
+ */
+#define THREAD_FACTOR 8
+
+typedef int v4si __attribute__ ((vector_size (16)));
+__thread v4si varray[] = {{1, 2, 3, 4}, {5, 6, 7, 8}, {9, 10,11,12},
+	{13,14,15,16},{17,18,19,20},{21,22,23,24},
+	{25,26,27,28},{29,30,31,32},{33,34,35,36},
+	{37,38,39,40},{41,42,43,44},{45,46,47,48}};
+
+bool bad_context;
+int running;
+volatile int not_ready;
+
+extern int preempt_vmx(v4si *varray, volatile int *not_ready, int *sentinal);
+
+void signal_vmx_sig(int sig, siginfo_t *info, void *context)
+{
+	int i;
+	ucontext_t *uc = context;
+	mcontext_t *mc = &uc->uc_mcontext;
+
+	/* Only the non volatiles were loaded up */
+	for (i = 20; i < 32; i++) {
+		if (memcmp(mc->v_regs->vrregs[i], &varray[i - 20], 16)) {
+			bad_context = true;
+			break;
+		}
+	}
+}
+
+void *signal_vmx_c(void *p)
+{
+	int i, j;
+	long rc;
+	struct sigaction act;
+	act.sa_sigaction = signal_vmx_sig;
+	act.sa_flags = SA_SIGINFO;
+	rc = sigaction(SIGUSR1, &act, NULL);
+	if (rc)
+		return p;
+
+	srand(pthread_self());
+	for (i = 0; i < 12; i++)
+		for (j = 0; j < 4; j++)
+			varray[i][j] = rand();
+
+	rc = preempt_vmx(varray, &not_ready, &running);
+
+	return (void *) rc;
+}
+
+int test_signal_vmx(void)
+{
+	int i, j, rc, threads;
+	void *rc_p;
+	pthread_t *tids;
+
+	threads = sysconf(_SC_NPROCESSORS_ONLN) * THREAD_FACTOR;
+	tids = malloc(threads * sizeof(pthread_t));
+	FAIL_IF(!tids);
+
+	running = true;
+	not_ready = threads;
+	for (i = 0; i < threads; i++) {
+		rc = pthread_create(&tids[i], NULL, signal_vmx_c, NULL);
+		FAIL_IF(rc);
+	}
+
+	setbuf(stdout, NULL);
+	printf("\tWaiting for all workers to start...");
+	while (not_ready);
+	printf("done\n");
+
+	printf("\tSending signals to all threads %d times...", ITERATIONS);
+	for (i = 0; i < ITERATIONS; i++) {
+		for (j = 0; j < threads; j++) {
+			pthread_kill(tids[j], SIGUSR1);
+		}
+		sleep(1);
+	}
+	printf("done\n");
+
+	printf("\tKilling workers...");
+	running = 0;
+	for (i = 0; i < threads; i++) {
+		pthread_join(tids[i], &rc_p);
+
+		/*
+		 * Harness will say the fail was here, look at why signal_vmx
+		 * returned
+		 */
+		if ((long) rc_p || bad_context)
+			printf("oops\n");
+		if (bad_context)
+			fprintf(stderr, "\t!! bad_context is true\n");
+		FAIL_IF((long) rc_p || bad_context);
+	}
+	printf("done\n");
+
+	free(tids);
+	return 0;
+}
+
+int main(int argc, char *argv[])
+{
+	return test_harness(test_signal_vmx, "vmx_signal");
+}
-- 
2.6.2

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 4/8] powerpc: Explicitly disable math features when copying thread
  2015-11-18  3:26 [PATCH 0/8] FP/VEC/VSX switching optimisations Cyril Bur
                   ` (2 preceding siblings ...)
  2015-11-18  3:26 ` [PATCH 3/8] selftests/powerpc: Test FPU and VMX regs in signal ucontext Cyril Bur
@ 2015-11-18  3:26 ` Cyril Bur
  2015-11-23  1:08   ` Michael Neuling
  2015-11-18  3:26 ` [PATCH 5/8] powerpc: Restore FPU/VEC/VSX if previously used Cyril Bur
                   ` (4 subsequent siblings)
  8 siblings, 1 reply; 23+ messages in thread
From: Cyril Bur @ 2015-11-18  3:26 UTC (permalink / raw)
  To: mikey, anton, linuxppc-dev

With threads leaving the math bits enabled in their saved MSR to indicate
that the hardware is hot and a restore is not needed, children need to turn
it off as when they do get scheduled, there's no way their registers could
have been hot.

Signed-off-by: Cyril Bur <cyrilbur@gmail.com>
---
 arch/powerpc/kernel/process.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c
index 398f7bf..441d9e5 100644
--- a/arch/powerpc/kernel/process.c
+++ b/arch/powerpc/kernel/process.c
@@ -1243,6 +1243,7 @@ int copy_thread(unsigned long clone_flags, unsigned long usp,
 
 		f = ret_from_fork;
 	}
+	childregs->msr &= ~(MSR_FP|MSR_VEC|MSR_VSX);
 	sp -= STACK_FRAME_OVERHEAD;
 
 	/*
-- 
2.6.2

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 5/8] powerpc: Restore FPU/VEC/VSX if previously used
  2015-11-18  3:26 [PATCH 0/8] FP/VEC/VSX switching optimisations Cyril Bur
                   ` (3 preceding siblings ...)
  2015-11-18  3:26 ` [PATCH 4/8] powerpc: Explicitly disable math features when copying thread Cyril Bur
@ 2015-11-18  3:26 ` Cyril Bur
  2015-11-20 11:01   ` Michael Ellerman
  2015-11-23  1:29   ` Michael Neuling
  2015-11-18  3:26 ` [PATCH 6/8] powerpc: Add the ability to save FPU without giving it up Cyril Bur
                   ` (3 subsequent siblings)
  8 siblings, 2 replies; 23+ messages in thread
From: Cyril Bur @ 2015-11-18  3:26 UTC (permalink / raw)
  To: mikey, anton, linuxppc-dev

Currently the FPU, VEC and VSX facilities are lazily loaded. This is not a
problem unless a process is using these facilities.

Modern versions of GCC are very good at automatically vectorising code, new
and modernised workloads make use of floating point and vector facilities,
even the kernel makes use of vectorised memcpy.

All this combined greatly increases the cost of a syscall since the kernel
uses the facilities sometimes even in syscall fast-path making it
increasingly common for a thread to take an *_unavailable exception soon
after a syscall, not to mention potentially taking the trifecta.

The obvious overcompensation to this problem is to simply always load all
the facilities on every exit to userspace. Loading up all FPU, VEC and VSX
registers every time can be expensive and if a workload does avoid using
them, it should not be forced to incur this penalty.

An 8bit counter is used to detect if the registers have been used in the
past and the registers are always loaded until the value wraps to back to
zero.

Signed-off-by: Cyril Bur <cyrilbur@gmail.com>
---
 arch/powerpc/include/asm/processor.h |  2 ++
 arch/powerpc/kernel/asm-offsets.c    |  2 ++
 arch/powerpc/kernel/entry_64.S       | 55 ++++++++++++++++++++++++++++--
 arch/powerpc/kernel/fpu.S            |  4 +++
 arch/powerpc/kernel/process.c        | 66 ++++++++++++++++++++++++++++++------
 arch/powerpc/kernel/vector.S         |  4 +++
 6 files changed, 119 insertions(+), 14 deletions(-)

diff --git a/arch/powerpc/include/asm/processor.h b/arch/powerpc/include/asm/processor.h
index ac23308..dcab21f 100644
--- a/arch/powerpc/include/asm/processor.h
+++ b/arch/powerpc/include/asm/processor.h
@@ -236,11 +236,13 @@ struct thread_struct {
 #endif
 	struct arch_hw_breakpoint hw_brk; /* info on the hardware breakpoint */
 	unsigned long	trap_nr;	/* last trap # on this thread */
+	u8 load_fp;
 #ifdef CONFIG_ALTIVEC
 	struct thread_vr_state vr_state;
 	struct thread_vr_state *vr_save_area;
 	unsigned long	vrsave;
 	int		used_vr;	/* set if process has used altivec */
+	u8 load_vec;
 #endif /* CONFIG_ALTIVEC */
 #ifdef CONFIG_VSX
 	/* VSR status */
diff --git a/arch/powerpc/kernel/asm-offsets.c b/arch/powerpc/kernel/asm-offsets.c
index 221d584..0f593d7 100644
--- a/arch/powerpc/kernel/asm-offsets.c
+++ b/arch/powerpc/kernel/asm-offsets.c
@@ -95,12 +95,14 @@ int main(void)
 	DEFINE(THREAD_FPSTATE, offsetof(struct thread_struct, fp_state));
 	DEFINE(THREAD_FPSAVEAREA, offsetof(struct thread_struct, fp_save_area));
 	DEFINE(FPSTATE_FPSCR, offsetof(struct thread_fp_state, fpscr));
+	DEFINE(THREAD_LOAD_FP, offsetof(struct thread_struct, load_fp));
 #ifdef CONFIG_ALTIVEC
 	DEFINE(THREAD_VRSTATE, offsetof(struct thread_struct, vr_state));
 	DEFINE(THREAD_VRSAVEAREA, offsetof(struct thread_struct, vr_save_area));
 	DEFINE(THREAD_VRSAVE, offsetof(struct thread_struct, vrsave));
 	DEFINE(THREAD_USED_VR, offsetof(struct thread_struct, used_vr));
 	DEFINE(VRSTATE_VSCR, offsetof(struct thread_vr_state, vscr));
+	DEFINE(THREAD_LOAD_VEC, offsetof(struct thread_struct, load_vec));
 #endif /* CONFIG_ALTIVEC */
 #ifdef CONFIG_VSX
 	DEFINE(THREAD_USED_VSR, offsetof(struct thread_struct, used_vsr));
diff --git a/arch/powerpc/kernel/entry_64.S b/arch/powerpc/kernel/entry_64.S
index c8b4225..46e9869 100644
--- a/arch/powerpc/kernel/entry_64.S
+++ b/arch/powerpc/kernel/entry_64.S
@@ -210,7 +210,54 @@ system_call:			/* label this so stack traces look sane */
 	li	r11,-MAX_ERRNO
 	andi.	r0,r9,(_TIF_SYSCALL_DOTRACE|_TIF_SINGLESTEP|_TIF_USER_WORK_MASK|_TIF_PERSYSCALL_MASK)
 	bne-	syscall_exit_work
-	cmpld	r3,r11
+
+	/*
+	 * This is an assembly version of checks performed in restore_math()
+	 * to avoid calling C unless absolutely necessary.
+	 * Note: In order to simplify the assembly, if the FP or VEC registers
+	 * are hot (and therefore restore_math() isn't called) the
+	 * LOAD_{FP,VEC} thread counter doesn't get incremented.
+	 * This is likely the best thing to do anyway because hot regs indicate
+	 * that the workload is doing a lot of syscalls that can be handled
+	 * quickly and without the need to touch FP or VEC regs (by the kernel).
+	 * a) If this workload is long running then this is exactly what the
+	 * kernel should be doing.
+	 * b) If this workload isn't long running then we'll soon fall back to
+	 * calling into C and the counter will be incremented regularly again
+	 * anyway.
+	 */
+	ld	r9,PACACURRENT(r13)
+	andi.	r0,r8,MSR_FP
+	addi	r9,r9,THREAD
+	lbz	r5,THREAD_LOAD_FP(r9)
+	/*
+	 * Goto 2 if !r0 && r5
+	 * The cmpb works because r5 can only have bits set in the lowest byte
+	 * and r0 may or may not have bit 13 set (different byte) but will have
+	 * a zero low byte therefore the low bytes must differ if r5 == true
+	 * and the bit 13 byte must be the same if !r0
+	 */
+	cmpb	r7,r0,r5
+	cmpldi	r7,0xff0
+#ifdef CONFIG_ALTIVEC
+	beq	2f
+
+	lbz	r9,THREAD_LOAD_VEC(r9)
+	andis.	r0,r8,MSR_VEC@h
+	/* Skip (goto 3) if r0 || !r9 */
+	bne	3f
+	cmpldi	r9,0
+	beq 3f
+#else
+	bne 3f
+#endif
+2:	addi    r3,r1,STACK_FRAME_OVERHEAD
+	bl	restore_math
+	ld	r8,_MSR(r1)
+	ld	r3,RESULT(r1)
+	li	r11,-MAX_ERRNO
+
+3:	cmpld	r3,r11
 	ld	r5,_CCR(r1)
 	bge-	syscall_error
 .Lsyscall_error_cont:
@@ -592,8 +639,8 @@ _GLOBAL(ret_from_except_lite)
 
 	/* Check current_thread_info()->flags */
 	andi.	r0,r4,_TIF_USER_WORK_MASK
-#ifdef CONFIG_PPC_BOOK3E
 	bne	1f
+#ifdef CONFIG_PPC_BOOK3E
 	/*
 	 * Check to see if the dbcr0 register is set up to debug.
 	 * Use the internal debug mode bit to do this.
@@ -608,7 +655,9 @@ _GLOBAL(ret_from_except_lite)
 	mtspr	SPRN_DBSR,r10
 	b	restore
 #else
-	beq	restore
+	addi	r3,r1,STACK_FRAME_OVERHEAD
+	bl	restore_math
+	b	restore
 #endif
 1:	andi.	r0,r4,_TIF_NEED_RESCHED
 	beq	2f
diff --git a/arch/powerpc/kernel/fpu.S b/arch/powerpc/kernel/fpu.S
index 2117eac..b063524 100644
--- a/arch/powerpc/kernel/fpu.S
+++ b/arch/powerpc/kernel/fpu.S
@@ -130,6 +130,10 @@ END_FTR_SECTION_IFSET(CPU_FTR_VSX)
 	or	r12,r12,r4
 	std	r12,_MSR(r1)
 #endif
+	/* Don't care if r4 overflows, this is desired behaviour */
+	lbz	r4,THREAD_LOAD_FP(r5)
+	addi	r4,r4,1
+	stb	r4,THREAD_LOAD_FP(r5)
 	addi	r10,r5,THREAD_FPSTATE
 	lfd	fr0,FPSTATE_FPSCR(r10)
 	MTFSF_L(fr0)
diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c
index 441d9e5..c602b67 100644
--- a/arch/powerpc/kernel/process.c
+++ b/arch/powerpc/kernel/process.c
@@ -366,6 +366,53 @@ void giveup_all(struct task_struct *tsk)
 	msr_check_and_clear(msr_all_available);
 }
 
+void restore_math(struct pt_regs *regs)
+{
+	unsigned long msr;
+
+	if (!current->thread.load_fp
+#ifdef CONFIG_ALTIVEC
+		&& !current->thread.load_vec)
+#else
+		)
+#endif
+		return;
+
+	msr = regs->msr;
+	msr_check_and_set(msr_all_available);
+
+	/*
+	 * Only reload if the bit is not set in the user MSR, the bit BEING set
+	 * indicates that the registers are hot
+	 */
+#ifdef CONFIG_PPC_FPU
+	if (current->thread.load_fp && !(msr & MSR_FP)) {
+		load_fp_state(&current->thread.fp_state);
+		msr |= MSR_FP | current->thread.fpexc_mode;
+		current->thread.load_fp++;
+	}
+#endif
+#ifdef CONFIG_ALTIVEC
+	if (current->thread.load_vec && !(msr & MSR_VEC) &&
+			cpu_has_feature(CPU_FTR_ALTIVEC)) {
+		load_vr_state(&current->thread.vr_state);
+		current->thread.used_vr = 1;
+		msr |= MSR_VEC;
+		current->thread.load_vec++;
+	}
+#endif
+#ifdef CONFIG_VSX
+	if (!(msr & MSR_VSX) && (msr & (MSR_FP | MSR_VEC)) == (MSR_FP | MSR_VEC)) {
+		current->thread.used_vsr = 1;
+		msr |= MSR_VSX;
+	}
+#endif
+
+	msr_check_and_clear(msr_all_available);
+
+	regs->msr = msr;
+}
+
 void flush_all_to_thread(struct task_struct *tsk)
 {
 	if (tsk->thread.regs) {
@@ -806,17 +853,9 @@ void restore_tm_state(struct pt_regs *regs)
 
 	msr_diff = current->thread.ckpt_regs.msr & ~regs->msr;
 	msr_diff &= MSR_FP | MSR_VEC | MSR_VSX;
-	if (msr_diff & MSR_FP) {
-		msr_check_and_set(MSR_FP);
-		load_fp_state(&current->thread.fp_state);
-		msr_check_and_clear(MSR_FP);
-		regs->msr |= current->thread.fpexc_mode;
-	}
-	if (msr_diff & MSR_VEC) {
-		msr_check_and_set(MSR_VEC);
-		load_vr_state(&current->thread.vr_state);
-		msr_check_and_clear(MSR_VEC);
-	}
+
+	restore_math(regs);
+
 	regs->msr |= msr_diff;
 }
 
@@ -977,6 +1016,11 @@ struct task_struct *__switch_to(struct task_struct *prev,
 		batch = this_cpu_ptr(&ppc64_tlb_batch);
 		batch->active = 1;
 	}
+
+	/* Don't do this on a kernel thread */
+	if (current_thread_info()->task->thread.regs)
+		restore_math(current_thread_info()->task->thread.regs);
+
 #endif /* CONFIG_PPC_BOOK3S_64 */
 
 	return last;
diff --git a/arch/powerpc/kernel/vector.S b/arch/powerpc/kernel/vector.S
index 162d0f7..038cff8 100644
--- a/arch/powerpc/kernel/vector.S
+++ b/arch/powerpc/kernel/vector.S
@@ -91,6 +91,10 @@ _GLOBAL(load_up_altivec)
 	oris	r12,r12,MSR_VEC@h
 	std	r12,_MSR(r1)
 #endif
+	/* Don't care if r4 overflows, this is desired behaviour */
+	lbz	r4,THREAD_LOAD_VEC(r5)
+	addi	r4,r4,1
+	stb	r4,THREAD_LOAD_VEC(r5)
 	addi	r6,r5,THREAD_VRSTATE
 	li	r4,1
 	li	r10,VRSTATE_VSCR
-- 
2.6.2

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 6/8] powerpc: Add the ability to save FPU without giving it up
  2015-11-18  3:26 [PATCH 0/8] FP/VEC/VSX switching optimisations Cyril Bur
                   ` (4 preceding siblings ...)
  2015-11-18  3:26 ` [PATCH 5/8] powerpc: Restore FPU/VEC/VSX if previously used Cyril Bur
@ 2015-11-18  3:26 ` Cyril Bur
  2015-11-18  3:26 ` [PATCH 7/8] powerpc: Add the ability to save Altivec " Cyril Bur
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 23+ messages in thread
From: Cyril Bur @ 2015-11-18  3:26 UTC (permalink / raw)
  To: mikey, anton, linuxppc-dev

This patch adds the ability to be able to save the FPU registers to the
thread struct without giving up (disabling the facility) next time the
process returns to userspace.

This patch optimises the thread copy path (as a result of a fork() or
clone()) so that the parent thread can return to userspace with hot
registers avoiding a possibly pointless reload of FPU register state.

Signed-off-by: Cyril Bur <cyrilbur@gmail.com>
---
 arch/powerpc/include/asm/switch_to.h |  2 +-
 arch/powerpc/kernel/fpu.S            | 21 ++++------------
 arch/powerpc/kernel/process.c        | 46 +++++++++++++++++++++++++++++++++++-
 3 files changed, 50 insertions(+), 19 deletions(-)

diff --git a/arch/powerpc/include/asm/switch_to.h b/arch/powerpc/include/asm/switch_to.h
index 171ce13..8cf7fd6 100644
--- a/arch/powerpc/include/asm/switch_to.h
+++ b/arch/powerpc/include/asm/switch_to.h
@@ -28,7 +28,7 @@ extern void giveup_all(struct task_struct *);
 extern void enable_kernel_fp(void);
 extern void flush_fp_to_thread(struct task_struct *);
 extern void giveup_fpu(struct task_struct *);
-extern void __giveup_fpu(struct task_struct *);
+extern void save_fpu(struct task_struct *);
 static inline void disable_kernel_fp(void)
 {
 	msr_check_and_clear(MSR_FP);
diff --git a/arch/powerpc/kernel/fpu.S b/arch/powerpc/kernel/fpu.S
index b063524..15da2b5 100644
--- a/arch/powerpc/kernel/fpu.S
+++ b/arch/powerpc/kernel/fpu.S
@@ -143,33 +143,20 @@ END_FTR_SECTION_IFSET(CPU_FTR_VSX)
 	blr
 
 /*
- * __giveup_fpu(tsk)
- * Disable FP for the task given as the argument,
- * and save the floating-point registers in its thread_struct.
+ * save_fpu(tsk)
+ * Save the floating-point registers in its thread_struct.
  * Enables the FPU for use in the kernel on return.
  */
-_GLOBAL(__giveup_fpu)
+_GLOBAL(save_fpu)
 	addi	r3,r3,THREAD	        /* want THREAD of task */
 	PPC_LL	r6,THREAD_FPSAVEAREA(r3)
 	PPC_LL	r5,PT_REGS(r3)
 	PPC_LCMPI	0,r6,0
 	bne	2f
 	addi	r6,r3,THREAD_FPSTATE
-2:	PPC_LCMPI	0,r5,0
-	SAVE_32FPVSRS(0, R4, R6)
+2:	SAVE_32FPVSRS(0, R4, R6)
 	mffs	fr0
 	stfd	fr0,FPSTATE_FPSCR(r6)
-	beq	1f
-	PPC_LL	r4,_MSR-STACK_FRAME_OVERHEAD(r5)
-	li	r3,MSR_FP|MSR_FE0|MSR_FE1
-#ifdef CONFIG_VSX
-BEGIN_FTR_SECTION
-	oris	r3,r3,MSR_VSX@h
-END_FTR_SECTION_IFSET(CPU_FTR_VSX)
-#endif
-	andc	r4,r4,r3		/* disable FP for previous task */
-	PPC_STL	r4,_MSR-STACK_FRAME_OVERHEAD(r5)
-1:
 	blr
 
 /*
diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c
index c602b67..51e246a 100644
--- a/arch/powerpc/kernel/process.c
+++ b/arch/powerpc/kernel/process.c
@@ -133,6 +133,16 @@ void __msr_check_and_clear(unsigned long bits)
 EXPORT_SYMBOL(__msr_check_and_clear);
 
 #ifdef CONFIG_PPC_FPU
+void __giveup_fpu(struct task_struct *tsk)
+{
+	save_fpu(tsk);
+	tsk->thread.regs->msr &= ~MSR_FP;
+#ifdef CONFIG_VSX
+	if (cpu_has_feature(CPU_FTR_VSX))
+		tsk->thread.regs->msr &= ~MSR_VSX;
+#endif
+}
+
 void giveup_fpu(struct task_struct *tsk)
 {
 	check_if_tm_restore_required(tsk);
@@ -413,12 +423,46 @@ void restore_math(struct pt_regs *regs)
 	regs->msr = msr;
 }
 
+void save_all(struct task_struct *tsk)
+{
+	unsigned long usermsr;
+
+	if (!tsk->thread.regs)
+		return;
+
+	usermsr = tsk->thread.regs->msr;
+
+	if ((usermsr & msr_all_available) == 0)
+		return;
+
+	msr_check_and_set(msr_all_available);
+
+#ifdef CONFIG_PPC_FPU
+	if (usermsr & MSR_FP)
+		save_fpu(tsk);
+#endif
+#ifdef CONFIG_ALTIVEC
+	if (usermsr & MSR_VEC)
+		__giveup_altivec(tsk);
+#endif
+#ifdef CONFIG_VSX
+	if (usermsr & MSR_VSX)
+		__giveup_vsx(tsk);
+#endif
+#ifdef CONFIG_SPE
+	if (usermsr & MSR_SPE)
+		__giveup_spe(tsk);
+#endif
+
+	msr_check_and_clear(msr_all_available);
+}
+
 void flush_all_to_thread(struct task_struct *tsk)
 {
 	if (tsk->thread.regs) {
 		preempt_disable();
 		BUG_ON(tsk != current);
-		giveup_all(tsk);
+		save_all(tsk);
 
 #ifdef CONFIG_SPE
 		if (tsk->thread.regs->msr & MSR_SPE)
-- 
2.6.2

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 7/8] powerpc: Add the ability to save Altivec without giving it up
  2015-11-18  3:26 [PATCH 0/8] FP/VEC/VSX switching optimisations Cyril Bur
                   ` (5 preceding siblings ...)
  2015-11-18  3:26 ` [PATCH 6/8] powerpc: Add the ability to save FPU without giving it up Cyril Bur
@ 2015-11-18  3:26 ` Cyril Bur
  2015-11-18  3:26 ` [PATCH 8/8] powerpc: Add the ability to save VSX " Cyril Bur
  2015-11-18 14:51 ` [PATCH 0/8] FP/VEC/VSX switching optimisations David Laight
  8 siblings, 0 replies; 23+ messages in thread
From: Cyril Bur @ 2015-11-18  3:26 UTC (permalink / raw)
  To: mikey, anton, linuxppc-dev

This patch adds the ability to be able to save the VEC registers to the
thread struct without giving up (disabling the facility) next time the
process returns to userspace.

This patch builds on a previous optimisation for the FPU registers in the
thread copy path to avoid a possibly pointless reload of VEC state.

Signed-off-by: Cyril Bur <cyrilbur@gmail.com>
---
 arch/powerpc/include/asm/switch_to.h |  2 +-
 arch/powerpc/kernel/process.c        | 12 +++++++++++-
 arch/powerpc/kernel/vector.S         | 24 ++++--------------------
 3 files changed, 16 insertions(+), 22 deletions(-)

diff --git a/arch/powerpc/include/asm/switch_to.h b/arch/powerpc/include/asm/switch_to.h
index 8cf7fd6..372f297 100644
--- a/arch/powerpc/include/asm/switch_to.h
+++ b/arch/powerpc/include/asm/switch_to.h
@@ -39,7 +39,7 @@ static inline void disable_kernel_fp(void)
 extern void enable_kernel_altivec(void);
 extern void flush_altivec_to_thread(struct task_struct *);
 extern void giveup_altivec(struct task_struct *);
-extern void __giveup_altivec(struct task_struct *);
+extern void save_altivec(struct task_struct *);
 static inline void disable_kernel_altivec(void)
 {
 	msr_check_and_clear(MSR_VEC);
diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c
index 51e246a..19e803a 100644
--- a/arch/powerpc/kernel/process.c
+++ b/arch/powerpc/kernel/process.c
@@ -198,6 +198,16 @@ void enable_kernel_fp(void)
 EXPORT_SYMBOL(enable_kernel_fp);
 
 #ifdef CONFIG_ALTIVEC
+void __giveup_altivec(struct task_struct *tsk)
+{
+	save_altivec(tsk);
+	tsk->thread.regs->msr &= ~MSR_VEC;
+#ifdef CONFIG_VSX
+	if (cpu_has_feature(CPU_FTR_VSX))
+		tsk->thread.regs->msr &= ~MSR_VSX;
+#endif
+}
+
 void giveup_altivec(struct task_struct *tsk)
 {
 	check_if_tm_restore_required(tsk);
@@ -443,7 +453,7 @@ void save_all(struct task_struct *tsk)
 #endif
 #ifdef CONFIG_ALTIVEC
 	if (usermsr & MSR_VEC)
-		__giveup_altivec(tsk);
+		save_altivec(tsk);
 #endif
 #ifdef CONFIG_VSX
 	if (usermsr & MSR_VSX)
diff --git a/arch/powerpc/kernel/vector.S b/arch/powerpc/kernel/vector.S
index 038cff8..51b0c17 100644
--- a/arch/powerpc/kernel/vector.S
+++ b/arch/powerpc/kernel/vector.S
@@ -106,36 +106,20 @@ _GLOBAL(load_up_altivec)
 	blr
 
 /*
- * __giveup_altivec(tsk)
- * Disable VMX for the task given as the argument,
- * and save the vector registers in its thread_struct.
+ * save_altivec(tsk)
+ * Save the vector registers to its thread_struct
  */
-_GLOBAL(__giveup_altivec)
+_GLOBAL(save_altivec)
 	addi	r3,r3,THREAD		/* want THREAD of task */
 	PPC_LL	r7,THREAD_VRSAVEAREA(r3)
 	PPC_LL	r5,PT_REGS(r3)
 	PPC_LCMPI	0,r7,0
 	bne	2f
 	addi	r7,r3,THREAD_VRSTATE
-2:	PPC_LCMPI	0,r5,0
-	SAVE_32VRS(0,r4,r7)
+2:	SAVE_32VRS(0,r4,r7)
 	mfvscr	v0
 	li	r4,VRSTATE_VSCR
 	stvx	v0,r4,r7
-	beq	1f
-	PPC_LL	r4,_MSR-STACK_FRAME_OVERHEAD(r5)
-#ifdef CONFIG_VSX
-BEGIN_FTR_SECTION
-	lis	r3,(MSR_VEC|MSR_VSX)@h
-FTR_SECTION_ELSE
-	lis	r3,MSR_VEC@h
-ALT_FTR_SECTION_END_IFSET(CPU_FTR_VSX)
-#else
-	lis	r3,MSR_VEC@h
-#endif
-	andc	r4,r4,r3		/* disable FP for previous task */
-	PPC_STL	r4,_MSR-STACK_FRAME_OVERHEAD(r5)
-1:
 	blr
 
 #ifdef CONFIG_VSX
-- 
2.6.2

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 8/8] powerpc: Add the ability to save VSX without giving it up
  2015-11-18  3:26 [PATCH 0/8] FP/VEC/VSX switching optimisations Cyril Bur
                   ` (6 preceding siblings ...)
  2015-11-18  3:26 ` [PATCH 7/8] powerpc: Add the ability to save Altivec " Cyril Bur
@ 2015-11-18  3:26 ` Cyril Bur
  2015-11-18 14:51 ` [PATCH 0/8] FP/VEC/VSX switching optimisations David Laight
  8 siblings, 0 replies; 23+ messages in thread
From: Cyril Bur @ 2015-11-18  3:26 UTC (permalink / raw)
  To: mikey, anton, linuxppc-dev

This patch adds the ability to be able to save the VSX registers to the
thread struct without giving up (disabling the facility) next time the
process returns to userspace.

This patch builds on a previous optimisation for the FPU and VEC registers
in the thread copy path to avoid a possibly pointless reload of VSX state.

Signed-off-by: Cyril Bur <cyrilbur@gmail.com>
---
 arch/powerpc/include/asm/switch_to.h |  1 -
 arch/powerpc/kernel/ppc_ksyms.c      |  4 ----
 arch/powerpc/kernel/process.c        | 23 ++++++++++++++++++-----
 arch/powerpc/kernel/vector.S         | 17 -----------------
 4 files changed, 18 insertions(+), 27 deletions(-)

diff --git a/arch/powerpc/include/asm/switch_to.h b/arch/powerpc/include/asm/switch_to.h
index 372f297..15843d3 100644
--- a/arch/powerpc/include/asm/switch_to.h
+++ b/arch/powerpc/include/asm/switch_to.h
@@ -50,7 +50,6 @@ static inline void disable_kernel_altivec(void)
 extern void enable_kernel_vsx(void);
 extern void flush_vsx_to_thread(struct task_struct *);
 extern void giveup_vsx(struct task_struct *);
-extern void __giveup_vsx(struct task_struct *);
 static inline void disable_kernel_vsx(void)
 {
 	msr_check_and_clear(MSR_FP|MSR_VEC|MSR_VSX);
diff --git a/arch/powerpc/kernel/ppc_ksyms.c b/arch/powerpc/kernel/ppc_ksyms.c
index 41e1607..ef7024da 100644
--- a/arch/powerpc/kernel/ppc_ksyms.c
+++ b/arch/powerpc/kernel/ppc_ksyms.c
@@ -28,10 +28,6 @@ EXPORT_SYMBOL(load_vr_state);
 EXPORT_SYMBOL(store_vr_state);
 #endif
 
-#ifdef CONFIG_VSX
-EXPORT_SYMBOL_GPL(__giveup_vsx);
-#endif
-
 #ifdef CONFIG_EPAPR_PARAVIRT
 EXPORT_SYMBOL(epapr_hypercall_start);
 #endif
diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c
index 19e803a..e0bb3d7 100644
--- a/arch/powerpc/kernel/process.c
+++ b/arch/powerpc/kernel/process.c
@@ -248,20 +248,33 @@ EXPORT_SYMBOL_GPL(flush_altivec_to_thread);
 #endif /* CONFIG_ALTIVEC */
 
 #ifdef CONFIG_VSX
-void giveup_vsx(struct task_struct *tsk)
+void __giveup_vsx(struct task_struct *tsk)
 {
-	check_if_tm_restore_required(tsk);
-
-	msr_check_and_set(MSR_FP|MSR_VEC|MSR_VSX);
 	if (tsk->thread.regs->msr & MSR_FP)
 		__giveup_fpu(tsk);
 	if (tsk->thread.regs->msr & MSR_VEC)
 		__giveup_altivec(tsk);
+	tsk->thread.regs->msr &= ~MSR_VSX;
+}
+
+void giveup_vsx(struct task_struct *tsk)
+{
+	check_if_tm_restore_required(tsk);
+
+	msr_check_and_set(MSR_FP|MSR_VEC|MSR_VSX);
 	__giveup_vsx(tsk);
 	msr_check_and_clear(MSR_FP|MSR_VEC|MSR_VSX);
 }
 EXPORT_SYMBOL(giveup_vsx);
 
+void save_vsx(struct task_struct *tsk)
+{
+	if (tsk->thread.regs->msr & MSR_FP)
+		save_fpu(tsk);
+	if (tsk->thread.regs->msr & MSR_VEC)
+		save_altivec(tsk);
+}
+
 void enable_kernel_vsx(void)
 {
 	WARN_ON(preemptible());
@@ -457,7 +470,7 @@ void save_all(struct task_struct *tsk)
 #endif
 #ifdef CONFIG_VSX
 	if (usermsr & MSR_VSX)
-		__giveup_vsx(tsk);
+		save_vsx(tsk);
 #endif
 #ifdef CONFIG_SPE
 	if (usermsr & MSR_SPE)
diff --git a/arch/powerpc/kernel/vector.S b/arch/powerpc/kernel/vector.S
index 51b0c17..1c2e7a3 100644
--- a/arch/powerpc/kernel/vector.S
+++ b/arch/powerpc/kernel/vector.S
@@ -151,23 +151,6 @@ _GLOBAL(load_up_vsx)
 	std	r12,_MSR(r1)
 	b	fast_exception_return
 
-/*
- * __giveup_vsx(tsk)
- * Disable VSX for the task given as the argument.
- * Does NOT save vsx registers.
- */
-_GLOBAL(__giveup_vsx)
-	addi	r3,r3,THREAD		/* want THREAD of task */
-	ld	r5,PT_REGS(r3)
-	cmpdi	0,r5,0
-	beq	1f
-	ld	r4,_MSR-STACK_FRAME_OVERHEAD(r5)
-	lis	r3,MSR_VSX@h
-	andc	r4,r4,r3		/* disable VSX for previous task */
-	std	r4,_MSR-STACK_FRAME_OVERHEAD(r5)
-1:
-	blr
-
 #endif /* CONFIG_VSX */
 
 
-- 
2.6.2

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* RE: [PATCH 0/8] FP/VEC/VSX switching optimisations
  2015-11-18  3:26 [PATCH 0/8] FP/VEC/VSX switching optimisations Cyril Bur
                   ` (7 preceding siblings ...)
  2015-11-18  3:26 ` [PATCH 8/8] powerpc: Add the ability to save VSX " Cyril Bur
@ 2015-11-18 14:51 ` David Laight
  2015-11-18 23:01   ` Cyril Bur
  8 siblings, 1 reply; 23+ messages in thread
From: David Laight @ 2015-11-18 14:51 UTC (permalink / raw)
  To: 'Cyril Bur', mikey@neuling.org, anton@samba.org,
	linuxppc-dev@ozlabs.org

RnJvbTogQ3lyaWwgQnVyDQo+IFNlbnQ6IDE4IE5vdmVtYmVyIDIwMTUgMDM6MjcNCi4uLg0KPiBU
aGUgZ29hbCBvZiB0aGVzZSBwYXRjaGVzIGlzIHRvIHJld29yayBob3cgdGhlICdtYXRoJyByZWdp
c3RlcnMgKEZQLCBWRUMNCj4gYW5kIFZTWCkgYXJlIGNvbnRleHQgc3dpdGNoZWQuIEN1cnJlbnRs
eSB0aGUga2VybmVsIGFkb3B0cyBhIGxhenkgYXBwcm9hY2gsDQo+IGFsd2F5cyBzd2l0Y2hpbmcg
dXNlcnNwYWNlIHRhc2tzIHdpdGggYWxsIHRocmVlIGZhY2lsaXRpZXMgZGlzYWJsZWQgYW5kDQo+
IGxvYWRzIGluIGVhY2ggc2V0IG9mIHJlZ2lzdGVycyB1cG9uIHJlY2VpdmluZyBlYWNoIHVuYXZh
aWxhYmxlIGV4Y2VwdGlvbi4NCj4gVGhlIGtlcm5lbCBkb2VzIHRyeSB0byBhdm9pZCBkaXNhYmxp
bmcgdGhlIGZlYXR1cmVzIGluIHRoZSBzeXNjYWxsIHF1aWNrDQo+IHBhdGggYnV0IGl0IGR1cmlu
ZyB0ZXN0aW5nIGl0IGFwcGVhcnMgdGhhdCBldmVuIHdoYXQgc2hvdWxkIGJlIGEgc2ltcGxlDQo+
IHN5c2NhbGwgc3RpbGwgY2F1c2VzIHRoZSBrZXJuZWwgdG8gdXNlIHNvbWUgZmFjaWxpdGllcyAo
dmVjdG9yaXNlZCBtZW1jcHkNCj4gZm9yIGV4YW1wbGUpIGZvciBpdHMgc2VsZiBhbmQgdGhlcmVm
b3JlIGRpc2FibGUgaXQgZm9yIHRoZSB1c2VyIHRhc2suDQoNClBlcmhhcHMgdGhlIGtlcm5lbCBz
aG91bGQgYmUgYXZvaWRpbmcgdXNpbmcgdGhlc2UgcmVnaXN0ZXJzPw0KSSB3b25kZXIgaWYgdGhl
IGdhaW4gZnJvbSB1c2luZyB2ZWN0b3Jpc2VkIG1lbWNweSBpcyB0eXBpY2FsbHkNCmVub3VnaCB0
byB3YXJyYW50IHRoZSBjb3N0IG9mIHRoZSBzYXZlIGFuZCByZXN0b3JlPw0KDQpUaGVyZSBtYXkg
ZXZlbiBiZSBzY29wZSBmb3Iga2VybmVsIGNvZGUgZG9pbmcgYSBzYXZlL3Jlc3RvcmUNCm9mIGEg
c21hbGwgbnVtYmVyIG9mIHJlZ2lzdGVycyBvbnRvIGFuIGluLXN0YWNrIHNhdmUgYXJlYS4NCkl0
IHdvdWxkIG5lZWQgdG8gYmUgbGlua2VkIHRvIHRoZSBkYXRhIG9mIHRoZSB0aHJlYWQNCnRoYXQg
b3ducyB0aGUgZnB1IHJlZ2lzdGVycyBzbyB0aGF0IGEgc2F2ZSByZXF1ZXN0IGNvdWxkDQpiZSBo
b25vdXJlZC4NClByZS1lbXB0aW9uIHdvdWxkIHByb2JhYmx5IG5lZWQgdG8gYmUgZGlzYWJsZWQs
IGJ1dCBuZXN0ZWQNCnVzZSwgYW5kIHVzZSBmcm9tIElTUiBzaG91bGQgYmUgb2suDQoNCglEYXZp
ZA0KDQo=

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 0/8] FP/VEC/VSX switching optimisations
  2015-11-18 14:51 ` [PATCH 0/8] FP/VEC/VSX switching optimisations David Laight
@ 2015-11-18 23:01   ` Cyril Bur
  0 siblings, 0 replies; 23+ messages in thread
From: Cyril Bur @ 2015-11-18 23:01 UTC (permalink / raw)
  To: David Laight; +Cc: mikey@neuling.org, anton@samba.org, linuxppc-dev@ozlabs.org

On Wed, 18 Nov 2015 14:51:25 +0000
David Laight <David.Laight@ACULAB.COM> wrote:

> From: Cyril Bur
> > Sent: 18 November 2015 03:27  
> ...
> > The goal of these patches is to rework how the 'math' registers (FP, VEC
> > and VSX) are context switched. Currently the kernel adopts a lazy approach,
> > always switching userspace tasks with all three facilities disabled and
> > loads in each set of registers upon receiving each unavailable exception.
> > The kernel does try to avoid disabling the features in the syscall quick
> > path but it during testing it appears that even what should be a simple
> > syscall still causes the kernel to use some facilities (vectorised memcpy
> > for example) for its self and therefore disable it for the user task.  
> 

Hi David,

> Perhaps the kernel should be avoiding using these registers?
> I wonder if the gain from using vectorised memcpy is typically
> enough to warrant the cost of the save and restore?
> 

Yeah, on smaller copies that might be the way to go.

> There may even be scope for kernel code doing a save/restore
> of a small number of registers onto an in-stack save area.

This has been thrown up in the air, there's also the volatile/non-volatiles to
consider and the caveat that glibc doesn't quite respect the ABI here.

As it turns out (and no one is more surprised than me), despite the other
attempts at optimising, this series really has boiled down to removing the need
for processes to take the facility unavailable interrupts.

I do plan to carry on with optimising in this area and will have a look to see
what I can do.

Cyril

> It would need to be linked to the data of the thread
> that owns the fpu registers so that a save request could
> be honoured.
> Pre-emption would probably need to be disabled, but nested
> use, and use from ISR should be ok.
> 
> 	David
> 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [3/8] selftests/powerpc: Test FPU and VMX regs in signal ucontext
  2015-11-18  3:26 ` [PATCH 3/8] selftests/powerpc: Test FPU and VMX regs in signal ucontext Cyril Bur
@ 2015-11-19 11:36   ` Michael Ellerman
  2015-11-23  1:04   ` [PATCH 3/8] " Michael Neuling
  1 sibling, 0 replies; 23+ messages in thread
From: Michael Ellerman @ 2015-11-19 11:36 UTC (permalink / raw)
  To: Cyril Bur, mikey, anton, linuxppc-dev

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain, Size: 1975 bytes --]

On Wed, 2015-18-11 at 03:26:50 UTC, Cyril Bur wrote:
> Load up the non volatile FPU and VMX regs and ensure that they are the
> expected value in a signal handler

This is giving me:

  $ /usr/bin/powerpc-linux-gnu-gcc -Wall -O2 -flto -Wall -Werror -DGIT_VERSION='"v4.0-rc1-55000-g906d582"' -I/home/buildbot/buildbot/slave/selftests-ppc64-gcc-ubuntu-be/build/tools/testing/selftests/powerpc  -O2 -g -pthread    fpu_signal.c ../harness.c fpu_asm.S   -o fpu_signal
  fpu_signal.c: In function ‘signal_fpu_sig’:
  fpu_signal.c:33:19: error: initialization from incompatible pointer type [-Werror]
    mcontext_t *mc = &uc->uc_mcontext;
                     ^
  fpu_signal.c:37:9: error: ‘mcontext_t’ has no member named ‘fp_regs’
     if (mc->fp_regs[i] != darray[i - 14]) {
           ^


As well as:

  $ /usr/bin/powerpc-linux-gnu-gcc -Wall -O2 -flto -Wall -Werror -DGIT_VERSION='"v4.0-rc1-55000-g906d582"' -I/home/buildbot/buildbot/slave/selftests-ppc64-gcc-ubuntu-be/build/tools/testing/selftests/powerpc  -O2 -g -pthread    vmx_syscall.c ../harness.c vmx_asm.S   -o vmx_syscall
  vmx_asm.S: Assembler messages:
  vmx_asm.S:61: Error: unsupported relocation against v20
  vmx_asm.S:63: Error: unsupported relocation against v21

Times infinity.

And same for vmx_preempt.

And:

  $ /usr/bin/powerpc-linux-gnu-gcc -Wall -O2 -flto -Wall -Werror -DGIT_VERSION='"v4.0-rc1-55000-g906d582"' -I/home/buildbot/buildbot/slave/selftests-ppc64-gcc-ubuntu-be/build/tools/testing/selftests/powerpc  -O2 -g -pthread    vmx_signal.c ../harness.c vmx_asm.S   -o vmx_signal
  make[1]: *** [vmx_preempt] Error 1
  vmx_signal.c: In function ‘signal_vmx_sig’:
  vmx_signal.c:37:19: error: initialization from incompatible pointer type [-Werror]
    mcontext_t *mc = &uc->uc_mcontext;
                     ^
  vmx_signal.c:41:16: error: ‘mcontext_t’ has no member named ‘v_regs’
     if (memcmp(mc->v_regs->vrregs[i], &varray[i - 20], 16)) {
                ^


cheers

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 5/8] powerpc: Restore FPU/VEC/VSX if previously used
  2015-11-18  3:26 ` [PATCH 5/8] powerpc: Restore FPU/VEC/VSX if previously used Cyril Bur
@ 2015-11-20 11:01   ` Michael Ellerman
  2015-11-22 22:18     ` Cyril Bur
  2015-11-23  1:29   ` Michael Neuling
  1 sibling, 1 reply; 23+ messages in thread
From: Michael Ellerman @ 2015-11-20 11:01 UTC (permalink / raw)
  To: Cyril Bur, mikey, anton, linuxppc-dev

On Wed, 2015-11-18 at 14:26 +1100, Cyril Bur wrote:
> diff --git a/arch/powerpc/kernel/entry_64.S b/arch/powerpc/kernel/entry_64.S
> index c8b4225..46e9869 100644
> --- a/arch/powerpc/kernel/entry_64.S
> +++ b/arch/powerpc/kernel/entry_64.S
> @@ -210,7 +210,54 @@ system_call:			/* label this so stack traces look sane */
>  	li	r11,-MAX_ERRNO
>  	andi.	r0,r9,(_TIF_SYSCALL_DOTRACE|_TIF_SINGLESTEP|_TIF_USER_WORK_MASK|_TIF_PERSYSCALL_MASK)
>  	bne-	syscall_exit_work
> -	cmpld	r3,r11
> +
> +	/*
> +	 * This is an assembly version of checks performed in restore_math()
> +	 * to avoid calling C unless absolutely necessary.
> +	 * Note: In order to simplify the assembly, if the FP or VEC registers
> +	 * are hot (and therefore restore_math() isn't called) the
> +	 * LOAD_{FP,VEC} thread counter doesn't get incremented.
> +	 * This is likely the best thing to do anyway because hot regs indicate
> +	 * that the workload is doing a lot of syscalls that can be handled
> +	 * quickly and without the need to touch FP or VEC regs (by the kernel).
> +	 * a) If this workload is long running then this is exactly what the
> +	 * kernel should be doing.
> +	 * b) If this workload isn't long running then we'll soon fall back to
> +	 * calling into C and the counter will be incremented regularly again
> +	 * anyway.
> +	 */
> +	ld	r9,PACACURRENT(r13)
> +	andi.	r0,r8,MSR_FP
> +	addi	r9,r9,THREAD
> +	lbz	r5,THREAD_LOAD_FP(r9)
> +	/*
> +	 * Goto 2 if !r0 && r5
> +	 * The cmpb works because r5 can only have bits set in the lowest byte
> +	 * and r0 may or may not have bit 13 set (different byte) but will have
> +	 * a zero low byte therefore the low bytes must differ if r5 == true
> +	 * and the bit 13 byte must be the same if !r0
> +	 */
> +	cmpb	r7,r0,r5

cmpb is new since Power6, which means it doesn't exist on Cell -> Program Check :)

I'm testing a patch using crandc, but I don't like it.

I'm not a big fan of the logic here, it's unpleasantly complicated. Did you
benchmark going to C to do the checks? Or I wonder if we could just check
THREAD_LOAD_FP || THREAD_LOAD_VEC and if either is set we go to restore_math().

Or on the other hand we check !MSR_FP && !MSR_VEC and if so we go to
restore_math()?

> +	cmpldi	r7,0xff0
> +#ifdef CONFIG_ALTIVEC
> +	beq	2f
> +
> +	lbz	r9,THREAD_LOAD_VEC(r9)
> +	andis.	r0,r8,MSR_VEC@h
> +	/* Skip (goto 3) if r0 || !r9 */
> +	bne	3f
> +	cmpldi	r9,0
> +	beq 3f
> +#else
> +	bne 3f
> +#endif
> +2:	addi    r3,r1,STACK_FRAME_OVERHEAD
> +	bl	restore_math
> +	ld	r8,_MSR(r1)
> +	ld	r3,RESULT(r1)
> +	li	r11,-MAX_ERRNO
> +
> +3:	cmpld	r3,r11
>  	ld	r5,_CCR(r1)
>  	bge-	syscall_error
>  .Lsyscall_error_cont:


cheers

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 5/8] powerpc: Restore FPU/VEC/VSX if previously used
  2015-11-20 11:01   ` Michael Ellerman
@ 2015-11-22 22:18     ` Cyril Bur
  2015-11-22 23:07       ` Michael Ellerman
  0 siblings, 1 reply; 23+ messages in thread
From: Cyril Bur @ 2015-11-22 22:18 UTC (permalink / raw)
  To: Michael Ellerman; +Cc: mikey, anton, linuxppc-dev

On Fri, 20 Nov 2015 22:01:04 +1100
Michael Ellerman <mpe@ellerman.id.au> wrote:

> On Wed, 2015-11-18 at 14:26 +1100, Cyril Bur wrote:
> > diff --git a/arch/powerpc/kernel/entry_64.S b/arch/powerpc/kernel/entry_64.S
> > index c8b4225..46e9869 100644
> > --- a/arch/powerpc/kernel/entry_64.S
> > +++ b/arch/powerpc/kernel/entry_64.S
> > @@ -210,7 +210,54 @@ system_call:			/* label this so stack traces look sane */
> >  	li	r11,-MAX_ERRNO
> >  	andi.	r0,r9,(_TIF_SYSCALL_DOTRACE|_TIF_SINGLESTEP|_TIF_USER_WORK_MASK|_TIF_PERSYSCALL_MASK)
> >  	bne-	syscall_exit_work
> > -	cmpld	r3,r11
> > +
> > +	/*
> > +	 * This is an assembly version of checks performed in restore_math()
> > +	 * to avoid calling C unless absolutely necessary.
> > +	 * Note: In order to simplify the assembly, if the FP or VEC registers
> > +	 * are hot (and therefore restore_math() isn't called) the
> > +	 * LOAD_{FP,VEC} thread counter doesn't get incremented.
> > +	 * This is likely the best thing to do anyway because hot regs indicate
> > +	 * that the workload is doing a lot of syscalls that can be handled
> > +	 * quickly and without the need to touch FP or VEC regs (by the kernel).
> > +	 * a) If this workload is long running then this is exactly what the
> > +	 * kernel should be doing.
> > +	 * b) If this workload isn't long running then we'll soon fall back to
> > +	 * calling into C and the counter will be incremented regularly again
> > +	 * anyway.
> > +	 */
> > +	ld	r9,PACACURRENT(r13)
> > +	andi.	r0,r8,MSR_FP
> > +	addi	r9,r9,THREAD
> > +	lbz	r5,THREAD_LOAD_FP(r9)
> > +	/*
> > +	 * Goto 2 if !r0 && r5
> > +	 * The cmpb works because r5 can only have bits set in the lowest byte
> > +	 * and r0 may or may not have bit 13 set (different byte) but will have
> > +	 * a zero low byte therefore the low bytes must differ if r5 == true
> > +	 * and the bit 13 byte must be the same if !r0
> > +	 */
> > +	cmpb	r7,r0,r5  
> 
> cmpb is new since Power6, which means it doesn't exist on Cell -> Program Check :)
> 
Oops, sorry.

> I'm testing a patch using crandc, but I don't like it.
> 
> I'm not a big fan of the logic here, it's unpleasantly complicated. Did you
> benchmark going to C to do the checks? Or I wonder if we could just check
> THREAD_LOAD_FP || THREAD_LOAD_VEC and if either is set we go to restore_math().
> 

I didn't benchmark going to C mostly because you wanted to avoid calling C
unless necessary in that path. Based off the results I got benchmarking the
this series I expect calling C will also be in the noise of removing the
exception.

> Or on the other hand we check !MSR_FP && !MSR_VEC and if so we go to
> restore_math()?
> 

That seems like the best check to leave in the assembly if you want to avoid
complicated assembly in there.

> > +	cmpldi	r7,0xff0
> > +#ifdef CONFIG_ALTIVEC
> > +	beq	2f
> > +
> > +	lbz	r9,THREAD_LOAD_VEC(r9)
> > +	andis.	r0,r8,MSR_VEC@h
> > +	/* Skip (goto 3) if r0 || !r9 */
> > +	bne	3f
> > +	cmpldi	r9,0
> > +	beq 3f
> > +#else
> > +	bne 3f
> > +#endif
> > +2:	addi    r3,r1,STACK_FRAME_OVERHEAD
> > +	bl	restore_math
> > +	ld	r8,_MSR(r1)
> > +	ld	r3,RESULT(r1)
> > +	li	r11,-MAX_ERRNO
> > +
> > +3:	cmpld	r3,r11
> >  	ld	r5,_CCR(r1)
> >  	bge-	syscall_error
> >  .Lsyscall_error_cont:  
> 
> 
> cheers
> 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 5/8] powerpc: Restore FPU/VEC/VSX if previously used
  2015-11-22 22:18     ` Cyril Bur
@ 2015-11-22 23:07       ` Michael Ellerman
  0 siblings, 0 replies; 23+ messages in thread
From: Michael Ellerman @ 2015-11-22 23:07 UTC (permalink / raw)
  To: Cyril Bur; +Cc: mikey, anton, linuxppc-dev

On Mon, 2015-11-23 at 09:18 +1100, Cyril Bur wrote:
> On Fri, 20 Nov 2015 22:01:04 +1100
> Michael Ellerman <mpe@ellerman.id.au> wrote:
> > On Wed, 2015-11-18 at 14:26 +1100, Cyril Bur wrote:
> > > diff --git a/arch/powerpc/kernel/entry_64.S b/arch/powerpc/kernel/entry_64.S
> > > index c8b4225..46e9869 100644
> > > --- a/arch/powerpc/kernel/entry_64.S
> > > +++ b/arch/powerpc/kernel/entry_64.S
> > > @@ -210,7 +210,54 @@ system_call:			/* label this so stack traces look sane */
> > >  	li	r11,-MAX_ERRNO
> > >  	andi.	r0,r9,(_TIF_SYSCALL_DOTRACE|_TIF_SINGLESTEP|_TIF_USER_WORK_MASK|_TIF_PERSYSCALL_MASK)
> > >  	bne-	syscall_exit_work
> > > -	cmpld	r3,r11
> > > +
> > > +	/*
> > > +	 * This is an assembly version of checks performed in restore_math()
> > > +	 * to avoid calling C unless absolutely necessary.
> > > +	 * Note: In order to simplify the assembly, if the FP or VEC registers
> > > +	 * are hot (and therefore restore_math() isn't called) the
> > > +	 * LOAD_{FP,VEC} thread counter doesn't get incremented.
> > > +	 * This is likely the best thing to do anyway because hot regs indicate
> > > +	 * that the workload is doing a lot of syscalls that can be handled
> > > +	 * quickly and without the need to touch FP or VEC regs (by the kernel).
> > > +	 * a) If this workload is long running then this is exactly what the
> > > +	 * kernel should be doing.
> > > +	 * b) If this workload isn't long running then we'll soon fall back to
> > > +	 * calling into C and the counter will be incremented regularly again
> > > +	 * anyway.
> > > +	 */
> > > +	ld	r9,PACACURRENT(r13)
> > > +	andi.	r0,r8,MSR_FP
> > > +	addi	r9,r9,THREAD
> > > +	lbz	r5,THREAD_LOAD_FP(r9)
> > > +	/*
> > > +	 * Goto 2 if !r0 && r5
> > > +	 * The cmpb works because r5 can only have bits set in the lowest byte
> > > +	 * and r0 may or may not have bit 13 set (different byte) but will have
> > > +	 * a zero low byte therefore the low bytes must differ if r5 == true
> > > +	 * and the bit 13 byte must be the same if !r0
> > > +	 */
> > > +	cmpb	r7,r0,r5  
> > 
> > cmpb is new since Power6, which means it doesn't exist on Cell -> Program Check :)
> > 
> Oops, sorry.

That's fine, there's almost no way for you to know that from reading the
documentation.

> > I'm testing a patch using crandc, but I don't like it.
> > 
> > I'm not a big fan of the logic here, it's unpleasantly complicated. Did you
> > benchmark going to C to do the checks? Or I wonder if we could just check
> > THREAD_LOAD_FP || THREAD_LOAD_VEC and if either is set we go to restore_math().
> > 
> 
> I didn't benchmark going to C mostly because you wanted to avoid calling C
> unless necessary in that path. Based off the results I got benchmarking the
> this series I expect calling C will also be in the noise of removing the
> exception.

Yeah I figured it was probably me that said "avoid C at all costs". But I've
changed my mind ;)

> > Or on the other hand we check !MSR_FP && !MSR_VEC and if so we go to
> > restore_math()?
> 
> That seems like the best check to leave in the assembly if you want to avoid
> complicated assembly in there.

Cool. If you can benchmark that that'd be great, mmkay.

cheers

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/8] selftests/powerpc: Test the preservation of FPU and VMX regs across syscall
  2015-11-18  3:26 ` [PATCH 1/8] selftests/powerpc: Test the preservation of FPU and VMX regs across syscall Cyril Bur
@ 2015-11-23  0:23   ` Michael Neuling
  2015-11-23  0:58     ` Cyril Bur
  0 siblings, 1 reply; 23+ messages in thread
From: Michael Neuling @ 2015-11-23  0:23 UTC (permalink / raw)
  To: Cyril Bur, anton, linuxppc-dev

On Wed, 2015-11-18 at 14:26 +1100, Cyril Bur wrote:
> Test that the non volatile floating point and Altivec registers get
> correctly preserved across the fork() syscall.

Can we add a test for VSX too?  I realise it's the same registers, but
the enable bits in the MSR are different so it's easy to get them wrong
in the kernel.

Additional comments below.

> fork() works nicely for this purpose, the registers should be the same fo=
r
> both parent and child
>=20
> Signed-off-by: Cyril Bur <cyrilbur@gmail.com>
> ---
>  tools/testing/selftests/powerpc/Makefile           |   3 +-
>  tools/testing/selftests/powerpc/math/Makefile      |  14 ++
>  tools/testing/selftests/powerpc/math/basic_asm.h   |  26 +++
>  tools/testing/selftests/powerpc/math/fpu_asm.S     | 151 +++++++++++++++=
++
>  tools/testing/selftests/powerpc/math/fpu_syscall.c |  79 +++++++++
>  tools/testing/selftests/powerpc/math/vmx_asm.S     | 183 +++++++++++++++=
++++++
>  tools/testing/selftests/powerpc/math/vmx_syscall.c |  81 +++++++++
>  7 files changed, 536 insertions(+), 1 deletion(-)
>  create mode 100644 tools/testing/selftests/powerpc/math/Makefile
>  create mode 100644 tools/testing/selftests/powerpc/math/basic_asm.h
>  create mode 100644 tools/testing/selftests/powerpc/math/fpu_asm.S
>  create mode 100644 tools/testing/selftests/powerpc/math/fpu_syscall.c
>  create mode 100644 tools/testing/selftests/powerpc/math/vmx_asm.S
>  create mode 100644 tools/testing/selftests/powerpc/math/vmx_syscall.c
>=20
> diff --git a/tools/testing/selftests/powerpc/Makefile b/tools/testing/sel=
ftests/powerpc/Makefile
> index 0c2706b..19e8191 100644
> --- a/tools/testing/selftests/powerpc/Makefile
> +++ b/tools/testing/selftests/powerpc/Makefile
> @@ -22,7 +22,8 @@ SUB_DIRS =3D benchmarks > 	> 	> \
>  > 	>    switch_endian> 	> \
>  > 	>    syscalls> 	> 	> \
>  > 	>    tm> 	> 	> 	> \
> -> 	>    vphn
> +> 	>    vphn         \
> +> 	>    math
> =20
>  endif
> =20
> diff --git a/tools/testing/selftests/powerpc/math/Makefile b/tools/testin=
g/selftests/powerpc/math/Makefile
> new file mode 100644
> index 0000000..896d9e2
> --- /dev/null
> +++ b/tools/testing/selftests/powerpc/math/Makefile
> @@ -0,0 +1,14 @@
> +TEST_PROGS :=3D fpu_syscall vmx_syscall


Add a new .gitignore in this dirfor these new build objects.

> +
> +all: $(TEST_PROGS)
> +
> +$(TEST_PROGS): ../harness.c
> +$(TEST_PROGS): CFLAGS +=3D -O2 -g
> +
> +fpu_syscall: fpu_asm.S
> +vmx_syscall: vmx_asm.S
> +
> +include ../../lib.mk
> +
> +clean:
> +> 	> rm -f $(TEST_PROGS) *.o
> diff --git a/tools/testing/selftests/powerpc/math/basic_asm.h b/tools/tes=
ting/selftests/powerpc/math/basic_asm.h
> new file mode 100644
> index 0000000..27aca79
> --- /dev/null
> +++ b/tools/testing/selftests/powerpc/math/basic_asm.h

Can you put this up a directory since it's generically useful for
powerpc?

> @@ -0,0 +1,26 @@
> +#include=20
> +#include=20
> +
> +#define LOAD_REG_IMMEDIATE(reg,expr) \
> +> 	> lis> 	> reg,(expr)@highest;> 	> \
> +> 	> ori> 	> reg,reg,(expr)@higher;> 	> \
> +> 	> rldicr> 	> reg,reg,32,31;> 	> \
> +> 	> oris> 	> reg,reg,(expr)@high;> 	> \
> +> 	> ori> 	> reg,reg,(expr)@l;
> +
> +#define PUSH_BASIC_STACK(size) \
> +> 	> std> 	> 2,24(sp); \
> +> 	> mflr> 	> r0; \
> +> 	> std> 	> r0,16(sp); \
> +> 	> mfcr> 	> r0; \
> +> 	> stw> 	> r0,8(sp); \
> +> 	> stdu> 	> sp,-size(sp);
> +
> +#define POP_BASIC_STACK(size) \
> +> 	> addi> 	> sp,sp,size; \
> +> 	> ld> 	> 2,24(sp); \
> +> 	> ld> 	> r0,16(sp); \
> +> 	> mtlr> 	> r0; \
> +> 	> lwz> 	> r0,8(sp); \
> +> 	> mtcr> 	> r0; \
> +
> diff --git a/tools/testing/selftests/powerpc/math/fpu_asm.S b/tools/testi=
ng/selftests/powerpc/math/fpu_asm.S
> new file mode 100644
> index 0000000..d5412c1
> --- /dev/null
> +++ b/tools/testing/selftests/powerpc/math/fpu_asm.S
> @@ -0,0 +1,151 @@
> +#include "basic_asm.h"
> +
> +#define PUSH_FPU(pos) \
> +> 	> stfd> 	> f14,pos(sp); \
> +> 	> stfd> 	> f15,pos+8(sp); \
> +> 	> stfd> 	> f16,pos+16(sp); \
> +> 	> stfd> 	> f17,pos+24(sp); \
> +> 	> stfd> 	> f18,pos+32(sp); \
> +> 	> stfd> 	> f19,pos+40(sp); \
> +> 	> stfd> 	> f20,pos+48(sp); \
> +> 	> stfd> 	> f21,pos+56(sp); \
> +> 	> stfd> 	> f22,pos+64(sp); \
> +> 	> stfd> 	> f23,pos+72(sp); \
> +> 	> stfd> 	> f24,pos+80(sp); \
> +> 	> stfd> 	> f25,pos+88(sp); \
> +> 	> stfd> 	> f26,pos+96(sp); \
> +> 	> stfd> 	> f27,pos+104(sp); \
> +> 	> stfd> 	> f28,pos+112(sp); \
> +> 	> stfd> 	> f29,pos+120(sp); \
> +> 	> stfd> 	> f30,pos+128(sp); \
> +> 	> stfd> 	> f31,pos+136(sp);
> +
> +#define POP_FPU(pos) \
> +> 	> lfd> 	> f14,pos(sp); \
> +> 	> lfd> 	> f15,pos+8(sp); \
> +> 	> lfd> 	> f16,pos+16(sp); \
> +> 	> lfd> 	> f17,pos+24(sp); \
> +> 	> lfd> 	> f18,pos+32(sp); \
> +> 	> lfd> 	> f19,pos+40(sp); \
> +> 	> lfd> 	> f20,pos+48(sp); \
> +> 	> lfd> 	> f21,pos+56(sp); \
> +> 	> lfd> 	> f22,pos+64(sp); \
> +> 	> lfd> 	> f23,pos+72(sp); \
> +> 	> lfd> 	> f24,pos+80(sp); \
> +> 	> lfd> 	> f25,pos+88(sp); \
> +> 	> lfd> 	> f26,pos+96(sp); \
> +> 	> lfd> 	> f27,pos+104(sp); \
> +> 	> lfd> 	> f28,pos+112(sp); \
> +> 	> lfd> 	> f29,pos+120(sp); \
> +> 	> lfd> 	> f30,pos+128(sp); \
> +> 	> lfd> 	> f31,pos+136(sp);
> +
> +#Careful calling this, it will 'clobber' fpu (by design)
> +#Don't call this from C
> +FUNC_START(load_fpu)
> +> 	> lfd> 	> f14,0(r3)
> +> 	> lfd> 	> f15,8(r3)
> +> 	> lfd> 	> f16,16(r3)
> +> 	> lfd> 	> f17,24(r3)
> +> 	> lfd> 	> f18,32(r3)
> +> 	> lfd> 	> f19,40(r3)
> +> 	> lfd> 	> f20,48(r3)
> +> 	> lfd> 	> f21,56(r3)
> +> 	> lfd> 	> f22,64(r3)
> +> 	> lfd> 	> f23,72(r3)
> +> 	> lfd> 	> f24,80(r3)
> +> 	> lfd> 	> f25,88(r3)
> +> 	> lfd> 	> f26,96(r3)
> +> 	> lfd> 	> f27,104(r3)
> +> 	> lfd> 	> f28,112(r3)
> +> 	> lfd> 	> f29,120(r3)
> +> 	> lfd> 	> f30,128(r3)
> +> 	> lfd> 	> f31,136(r3)
> +> 	> blr
> +FUNC_END(load_fpu)
> +
> +FUNC_START(check_fpu)
> +> 	> mr r4,r3
> +> 	> li> 	> r3,1 #assume a bad result
> +> 	> lfd> 	> f0,0(r4)
> +> 	> fcmpu> 	> cr1,f0,f14
> +> 	> bne> 	> cr1,1f
> +> 	> lfd> 	> f0,8(r4)
> +> 	> fcmpu> 	> cr1,f0,f15
> +> 	> bne> 	> cr1,1f
> +> 	> lfd> 	> f0,16(r4)
> +> 	> fcmpu> 	> cr1,f0,f16
> +> 	> bne> 	> cr1,1f
> +> 	> lfd> 	> f0,24(r4)
> +> 	> fcmpu> 	> cr1,f0,f17
> +> 	> bne> 	> cr1,1f
> +> 	> lfd> 	> f0,32(r4)
> +> 	> fcmpu> 	> cr1,f0,f18
> +> 	> bne> 	> cr1,1f
> +> 	> lfd> 	> f0,40(r4)
> +> 	> fcmpu> 	> cr1,f0,f19
> +> 	> bne> 	> cr1,1f
> +> 	> lfd> 	> f0,48(r4)
> +> 	> fcmpu> 	> cr1,f0,f20
> +> 	> bne> 	> cr1,1f
> +> 	> lfd> 	> f0,56(r4)
> +> 	> fcmpu> 	> cr1,f0,f21
> +> 	> bne> 	> cr1,1f
> +> 	> lfd> 	> f0,64(r4)
> +> 	> fcmpu> 	> cr1,f0,f22
> +> 	> bne> 	> cr1,1f
> +> 	> lfd> 	> f0,72(r4)
> +> 	> fcmpu> 	> cr1,f0,f23
> +> 	> bne> 	> cr1,1f
> +> 	> lfd> 	> f0,80(r4)
> +> 	> fcmpu> 	> cr1,f0,f24
> +> 	> bne> 	> cr1,1f
> +> 	> lfd> 	> f0,88(r4)
> +> 	> fcmpu> 	> cr1,f0,f25
> +> 	> bne> 	> cr1,1f
> +> 	> lfd> 	> f0,96(r4)
> +> 	> fcmpu> 	> cr1,f0,f26
> +> 	> bne> 	> cr1,1f
> +> 	> lfd> 	> f0,104(r4)
> +> 	> fcmpu> 	> cr1,f0,f27
> +> 	> bne> 	> cr1,1f
> +> 	> lfd> 	> f0,112(r4)
> +> 	> fcmpu> 	> cr1,f0,f28
> +> 	> bne> 	> cr1,1f
> +> 	> lfd> 	> f0,120(r4)
> +> 	> fcmpu> 	> cr1,f0,f29
> +> 	> bne> 	> cr1,1f
> +> 	> lfd> 	> f0,128(r4)
> +> 	> fcmpu> 	> cr1,f0,f30
> +> 	> bne> 	> cr1,1f
> +> 	> lfd> 	> f0,136(r4)
> +> 	> fcmpu> 	> cr1,f0,f31
> +> 	> bne> 	> cr1,1f
> +> 	> li> 	> r3,0 #Sucess!!!
> +1:> 	> blr
> +
> +FUNC_START(test_fpu)
> +> 	> #r3 holds pointer to where to put the result of fork

#r4 seems to hold a ptr to the pid

> +	#f14-f31 are non volatiles
> +> 	> PUSH_BASIC_STACK(256)
> +> 	> std> 	> r3,40(sp) #Address of darray
> +> 	> std r4,48(sp) #Address of pid
> +> 	> PUSH_FPU(56)
> +
> +> 	> bl load_fpu
> +> 	> nop
> +> 	> li> 	> r0,__NR_fork
> +> 	> sc
> +
> +> 	> #pass the result of the fork to the caller
> +> 	> ld> 	> r9,48(sp)
> +> 	> std> 	> r3,0(r9)
> +
> +> 	> ld r3,40(sp)
> +> 	> bl check_fpu
> +> 	> nop
> +
> +> 	> POP_FPU(56)
> +> 	> POP_BASIC_STACK(256)
> +> 	> blr
> +FUNC_END(test_fpu)
> diff --git a/tools/testing/selftests/powerpc/math/fpu_syscall.c b/tools/t=
esting/selftests/powerpc/math/fpu_syscall.c
> new file mode 100644
> index 0000000..a967fd6
> --- /dev/null
> +++ b/tools/testing/selftests/powerpc/math/fpu_syscall.c
> @@ -0,0 +1,79 @@
> +#include=20
> +#include=20
> +#include=20
> +#include=20
> +#include=20
> +#include=20
> +#include=20
> +
> +#include "utils.h"
> +
> +extern int test_fpu(double *darray, pid_t *pid);
> +
> +double darray[] =3D {0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0,
> +> 	> 	>      1.1, 1.2, 1.3, 1.4, 1.5, 1.6, 1.7, 1.8, 1.9, 2.0,
> +> 	> 	>      2.1};
> +
> +int syscall_fpu(void)
> +{
> +> 	> pid_t fork_pid;
> +> 	> int i;
> +> 	> int ret;
> +> 	> int child_ret;
> +> 	> for (i =3D 0; i < 1000; i++) {
> +> 	> 	> /* test_fpu will fork() */
> +> 	> 	> ret =3D test_fpu(darray, &fork_pid);
> +> 	> 	> if (fork_pid =3D=3D -1)
> +> 	> 	> 	> return -1;
> +> 	> 	> if (fork_pid =3D=3D 0)
> +> 	> 	> 	> exit(ret);
> +> 	> 	> waitpid(fork_pid, &child_ret, 0);
> +> 	> 	> if (ret || child_ret)
> +> 	> 	> 	> return 1;
> +> 	> }
> +
> +> 	> return 0;
> +}
> +
> +int test_syscall_fpu(void)
> +{
> +> 	> /*
> +> 	>  * Setup an environment with much context switching
> +> 	>  */
> +> 	> pid_t pid2;
> +> 	> pid_t pid =3D fork();
> +> 	> int ret;
> +> 	> int child_ret;
> +> 	> FAIL_IF(pid =3D=3D -1);
> +
> +> 	> pid2 =3D fork();
> +> 	> /* Can't FAIL_IF(pid2 =3D=3D -1); because already forked once */
> +> 	> if (pid2 =3D=3D -1) {
> +> 	> 	> /*
> +> 	> 	>  * Couldn't fork, ensure test is a fail
> +> 	> 	>  */
> +> 	> 	> child_ret =3D ret =3D 1;
> +> 	> } else {
> +> 	> 	> ret =3D syscall_fpu();
> +> 	> 	> if (pid2)
> +> 	> 	> 	> waitpid(pid2, &child_ret, 0);
> +> 	> 	> else
> +> 	> 	> 	> exit(ret);
> +> 	> }
> +
> +> 	> ret |=3D child_ret;
> +
> +> 	> if (pid)
> +> 	> 	> waitpid(pid, &child_ret, 0);
> +> 	> else
> +> 	> 	> exit(ret);
> +
> +> 	> FAIL_IF(ret || child_ret);
> +> 	> return 0;
> +}
> +
> +int main(int argc, char *argv[])
> +{
> +> 	> return test_harness(test_syscall_fpu, "syscall_fpu");
> +
> +}
> diff --git a/tools/testing/selftests/powerpc/math/vmx_asm.S b/tools/testi=
ng/selftests/powerpc/math/vmx_asm.S
> new file mode 100644
> index 0000000..e642e67
> --- /dev/null
> +++ b/tools/testing/selftests/powerpc/math/vmx_asm.S
> @@ -0,0 +1,183 @@
> +#include "basic_asm.h"
> +
> +#define PUSH_VMX(pos,reg) \
> +> 	> li> 	> reg,pos; \
> +> 	> stvx> 	> v20,reg,sp; \
> +> 	> addi> 	> reg,reg,16; \
> +> 	> stvx> 	> v21,reg,sp; \
> +> 	> addi> 	> reg,reg,16; \
> +> 	> stvx> 	> v22,reg,sp; \
> +> 	> addi> 	> reg,reg,16; \
> +> 	> stvx> 	> v23,reg,sp; \
> +> 	> addi> 	> reg,reg,16; \
> +> 	> stvx> 	> v24,reg,sp; \
> +> 	> addi> 	> reg,reg,16; \
> +> 	> stvx> 	> v25,reg,sp; \
> +> 	> addi> 	> reg,reg,16; \
> +> 	> stvx> 	> v26,reg,sp; \
> +> 	> addi> 	> reg,reg,16; \
> +> 	> stvx> 	> v27,reg,sp; \
> +> 	> addi> 	> reg,reg,16; \
> +> 	> stvx> 	> v28,reg,sp; \
> +> 	> addi> 	> reg,reg,16; \
> +> 	> stvx> 	> v29,reg,sp; \
> +> 	> addi> 	> reg,reg,16; \
> +> 	> stvx> 	> v30,reg,sp; \
> +> 	> addi> 	> reg,reg,16; \
> +> 	> stvx> 	> v31,reg,sp;
> +
> +#define POP_VMX(pos,reg) \
> +> 	> li> 	> reg,pos; \
> +> 	> lvx> 	> v20,reg,sp; \
> +> 	> addi> 	> reg,reg,16; \
> +> 	> lvx> 	> v21,reg,sp; \
> +> 	> addi> 	> reg,reg,16; \
> +> 	> lvx> 	> v22,reg,sp; \
> +> 	> addi> 	> reg,reg,16; \
> +> 	> lvx> 	> v23,reg,sp; \
> +> 	> addi> 	> reg,reg,16; \
> +> 	> lvx> 	> v24,reg,sp; \
> +> 	> addi> 	> reg,reg,16; \
> +> 	> lvx> 	> v25,reg,sp; \
> +> 	> addi> 	> reg,reg,16; \
> +> 	> lvx> 	> v26,reg,sp; \
> +> 	> addi> 	> reg,reg,16; \
> +> 	> lvx> 	> v27,reg,sp; \
> +> 	> addi> 	> reg,reg,16; \
> +> 	> lvx> 	> v28,reg,sp; \
> +> 	> addi> 	> reg,reg,16; \
> +> 	> lvx> 	> v29,reg,sp; \
> +> 	> addi> 	> reg,reg,16; \
> +> 	> lvx> 	> v30,reg,sp; \
> +> 	> addi> 	> reg,reg,16; \
> +> 	> lvx> 	> v31,reg,sp;
> +
> +#Carefull this will 'clobber' vmx (by design)
> +#Don't call this from C
> +FUNC_START(load_vmx)
> +> 	> li> 	> r5,0
> +> 	> lvx> 	> v20,r5,r3
> +> 	> addi> 	> r5,r5,16
> +> 	> lvx> 	> v21,r5,r3
> +> 	> addi> 	> r5,r5,16
> +> 	> lvx> 	> v22,r5,r3
> +> 	> addi> 	> r5,r5,16
> +> 	> lvx> 	> v23,r5,r3
> +> 	> addi> 	> r5,r5,16
> +> 	> lvx> 	> v24,r5,r3
> +> 	> addi> 	> r5,r5,16
> +> 	> lvx> 	> v25,r5,r3
> +> 	> addi> 	> r5,r5,16
> +> 	> lvx> 	> v26,r5,r3
> +> 	> addi> 	> r5,r5,16
> +> 	> lvx> 	> v27,r5,r3
> +> 	> addi> 	> r5,r5,16
> +> 	> lvx> 	> v28,r5,r3
> +> 	> addi> 	> r5,r5,16
> +> 	> lvx> 	> v29,r5,r3
> +> 	> addi> 	> r5,r5,16
> +> 	> lvx> 	> v30,r5,r3
> +> 	> addi> 	> r5,r5,16
> +> 	> lvx> 	> v31,r5,r3
> +> 	> blr
> +FUNC_END(load_vmx)
> +
> +#Should be safe from C, only touches r4, r5 and v0,v1,v2
> +FUNC_START(check_vmx)
> +> 	> PUSH_BASIC_STACK(16)
> +> 	> mr r4,r3
> +> 	> li> 	> r3,1 #assume a bad result
> +> 	> li> 	> r5,0
> +> 	> lvx> 	> v0,r5,r4
> +> 	> vcmpequd.> 	> v1,v0,v20
> +> 	> vmr> 	> v2,v1
> +
> +> 	> addi> 	> r5,r5,16
> +> 	> lvx> 	> v0,r5,r4
> +> 	> vcmpequd.> 	> v1,v0,v21
> +> 	> vand> 	> v2,v2,v1
> +
> +> 	> addi> 	> r5,r5,16
> +> 	> lvx> 	> v0,r5,r4
> +> 	> vcmpequd.> 	> v1,v0,v22
> +> 	> vand> 	> v2,v2,v1
> +
> +> 	> addi> 	> r5,r5,16
> +> 	> lvx> 	> v0,r5,r4
> +> 	> vcmpequd.> 	> v1,v0,v23
> +> 	> vand> 	> v2,v2,v1
> +
> +> 	> addi> 	> r5,r5,16
> +> 	> lvx> 	> v0,r5,r4
> +> 	> vcmpequd.> 	> v1,v0,v24
> +> 	> vand> 	> v2,v2,v1
> +
> +> 	> addi> 	> r5,r5,16
> +> 	> lvx> 	> v0,r5,r4
> +> 	> vcmpequd.> 	> v1,v0,v25
> +> 	> vand> 	> v2,v2,v1
> +
> +> 	> addi> 	> r5,r5,16
> +> 	> lvx> 	> v0,r5,r4
> +> 	> vcmpequd.> 	> v1,v0,v26
> +> 	> vand> 	> v2,v2,v1
> +
> +> 	> addi> 	> r5,r5,16
> +> 	> lvx> 	> v0,r5,r4
> +> 	> vcmpequd.> 	> v1,v0,v27
> +> 	> vand> 	> v2,v2,v1
> +
> +> 	> addi> 	> r5,r5,16
> +> 	> lvx> 	> v0,r5,r4
> +> 	> vcmpequd.> 	> v1,v0,v28
> +> 	> vand> 	> v2,v2,v1
> +
> +> 	> addi> 	> r5,r5,16
> +> 	> lvx> 	> v0,r5,r4
> +> 	> vcmpequd.> 	> v1,v0,v29
> +> 	> vand> 	> v2,v2,v1
> +
> +> 	> addi> 	> r5,r5,16
> +> 	> lvx> 	> v0,r5,r4
> +> 	> vcmpequd.> 	> v1,v0,v30
> +> 	> vand> 	> v2,v2,v1
> +
> +> 	> addi> 	> r5,r5,16
> +> 	> lvx> 	> v0,r5,r4
> +> 	> vcmpequd.> 	> v1,v0,v31
> +> 	> vand> 	> v2,v2,v1
> +
> +> 	> li r5,0
> +> 	> stvx> 	> v2,r5,sp
> +> 	> ldx> 	> r0,r5,sp
> +> 	> cmpdi> 	> r0,0xffffffff
> +> 	> bne> 	> 1f
> +> 	> li> 	> r3,0
> +1:> 	> POP_BASIC_STACK(16)
> +> 	> blr
> +FUNC_END(check_vmx)
> +
> +#Safe from C
> +FUNC_START(test_vmx)
> +> 	> #r3 holds pointer to where to put the result of fork
> +> 	> #v20-v31 are non-volatile
> +> 	> PUSH_BASIC_STACK(512)
> +> 	> std> 	> r3,40(sp) #Address of varray
> +> 	> std r4,48(sp) #address of pid
> +> 	> PUSH_VMX(56, r4)
> +
> +> 	> bl load_vmx
> +
> +> 	> li> 	> r0,__NR_fork
> +> 	> sc
> +> 	> #Pass the result of fork back to the caller
> +> 	> ld> 	> r9,48(sp)
> +> 	> std> 	> r3,0(r9)
> +
> +> 	> ld r3,40(sp)
> +> 	> bl check_vmx
> +
> +> 	> POP_VMX(56,r4)
> +> 	> POP_BASIC_STACK(512)
> +> 	> blr
> +FUNC_END(test_vmx)
> diff --git a/tools/testing/selftests/powerpc/math/vmx_syscall.c b/tools/t=
esting/selftests/powerpc/math/vmx_syscall.c
> new file mode 100644
> index 0000000..7adff05
> --- /dev/null
> +++ b/tools/testing/selftests/powerpc/math/vmx_syscall.c
> @@ -0,0 +1,81 @@
> +#include=20
> +#include=20
> +#include=20
> +#include=20
> +#include=20
> +#include=20
> +#include=20
> +
> +#include "utils.h"
> +
> +typedef int v4si __attribute__ ((vector_size (16)));
> +v4si varray[] =3D {{1, 2, 3, 4}, {5, 6, 7, 8}, {9, 10,11,12},
> +> 	> {13,14,15,16},{17,18,19,20},{21,22,23,24},
> +> 	> {25,26,27,28},{29,30,31,32},{33,34,35,36},
> +> 	> {37,38,39,40},{41,42,43,44},{45,46,47,48}};
> +
> +extern int test_vmx(v4si *varray, pid_t *pid);
> +
> +int vmx_syscall(void)
> +{
> +> 	> pid_t fork_pid;
> +> 	> int i;
> +> 	> int ret;
> +> 	> int child_ret;
> +> 	> for (i =3D 0; i < 1000; i++) {
> +> 	> 	> /* test_vmx will fork() */
> +> 	> 	> ret =3D test_vmx(varray, &fork_pid);
> +> 	> 	> if (fork_pid =3D=3D -1)
> +> 	> 	> 	> return -1;
> +> 	> 	> if (fork_pid =3D=3D 0)
> +> 	> 	> 	> exit(ret);
> +> 	> 	> waitpid(fork_pid, &child_ret, 0);
> +> 	> 	> if (ret || child_ret)
> +> 	> 	> 	> return 1;
> +> 	> }
> +
> +> 	> return 0;
> +}
> +
> +int test_vmx_syscall(void)
> +{
> +> 	> /*
> +> 	>  * Setup an environment with much context switching
> +> 	>  */
> +> 	> pid_t pid2;
> +> 	> pid_t pid =3D fork();
> +> 	> int ret;
> +> 	> int child_ret;
> +> 	> FAIL_IF(pid =3D=3D -1);
> +
> +> 	> pid2 =3D fork();
> +> 	> ret =3D vmx_syscall();
> +> 	> /* Can't FAIL_IF(pid2 =3D=3D -1); because we've already forked */
> +> 	> if (pid2 =3D=3D -1) {
> +> 	> 	> /*
> +> 	> 	>  * Couldn't fork, ensure child_ret is set and is a fail
> +> 	> 	>  */
> +> 	> 	> ret =3D child_ret =3D 1;
> +> 	> } else {
> +> 	> 	> if (pid2)
> +> 	> 	> 	> waitpid(pid2, &child_ret, 0);
> +> 	> 	> else
> +> 	> 	> 	> exit(ret);
> +> 	> }
> +
> +> 	> ret |=3D child_ret;
> +
> +> 	> if (pid)
> +> 	> 	> waitpid(pid, &child_ret, 0);
> +> 	> else
> +> 	> 	> exit(ret);
> +
> +> 	> FAIL_IF(ret || child_ret);
> +> 	> return 0;
> +}
> +
> +int main(int argc, char *argv[])
> +{
> +> 	> return test_harness(test_vmx_syscall, "vmx_syscall");
> +
> +}

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 2/8] selftests/powerpc: Test preservation of FPU and VMX regs across preemption
  2015-11-18  3:26 ` [PATCH 2/8] selftests/powerpc: Test preservation of FPU and VMX regs across preemption Cyril Bur
@ 2015-11-23  0:34   ` Michael Neuling
  0 siblings, 0 replies; 23+ messages in thread
From: Michael Neuling @ 2015-11-23  0:34 UTC (permalink / raw)
  To: Cyril Bur, anton, linuxppc-dev

On Wed, 2015-11-18 at 14:26 +1100, Cyril Bur wrote:

> Loop in assembly checking the registers with many threads.
>=20
> Signed-off-by: Cyril Bur <cyrilbur@gmail.com>
> ---
>  tools/testing/selftests/powerpc/math/Makefile      |  7 +-
>  tools/testing/selftests/powerpc/math/fpu_asm.S     | 34 ++++++++
>  tools/testing/selftests/powerpc/math/fpu_preempt.c | 92 ++++++++++++++++=
++++++
>  tools/testing/selftests/powerpc/math/vmx_asm.S     | 44 ++++++++++-
>  tools/testing/selftests/powerpc/math/vmx_preempt.c | 92 ++++++++++++++++=
++++++
>  5 files changed, 263 insertions(+), 6 deletions(-)
>  create mode 100644 tools/testing/selftests/powerpc/math/fpu_preempt.c
>  create mode 100644 tools/testing/selftests/powerpc/math/vmx_preempt.c
>=20
> diff --git a/tools/testing/selftests/powerpc/math/Makefile
b/tools/testing/selftests/powerpc/math/Makefile
> index 896d9e2..9fa690f 100644
> --- a/tools/testing/selftests/powerpc/math/Makefile
> +++ b/tools/testing/selftests/powerpc/math/Makefile
> @@ -1,12 +1,15 @@
> -TEST_PROGS :=3D fpu_syscall vmx_syscall
> +TEST_PROGS :=3D fpu_syscall fpu_preempt vmx_syscall vmx_preempt

.gitignore for this new build object


>  all: $(TEST_PROGS)
> =20
>  $(TEST_PROGS): ../harness.c
> -$(TEST_PROGS): CFLAGS +=3D -O2 -g
> +$(TEST_PROGS): CFLAGS +=3D -O2 -g -pthread
> =20
>  fpu_syscall: fpu_asm.S
> +fpu_preempt: fpu_asm.S
> +
>  vmx_syscall: vmx_asm.S
> +vmx_preempt: vmx_asm.S
> =20
>  include ../../lib.mk
> =20
> diff --git a/tools/testing/selftests/powerpc/math/fpu_asm.S b/tools/testi=
ng/selftests/powerpc/math/fpu_asm.S
> index d5412c1..5ff0adc 100644
> --- a/tools/testing/selftests/powerpc/math/fpu_asm.S
> +++ b/tools/testing/selftests/powerpc/math/fpu_asm.S
> @@ -149,3 +149,37 @@ FUNC_START(test_fpu)
>  > 	> POP_BASIC_STACK(256)
>  > 	> blr
>  FUNC_END(test_fpu)
> +
> +#int preempt_fpu(double *darray, volatile int *not_ready, int *sentinal)
> +#On starting will (atomically) decrement not_ready as a signal that the =
FPU
> +#has been loaded with darray. Will proceed to check the validity of the =
FPU
> +#registers while sentinal is not zero.
> +FUNC_START(preempt_fpu)
> +> 	> PUSH_BASIC_STACK(256)
> +> 	> std r3,32(sp) #double *darray
> +> 	> std r4,40(sp) #volatile int *not_ready
> +> 	> std r5,48(sp) #int *sentinal
> +> 	> PUSH_FPU(56)
> +
> +> 	> bl load_fpu
> +

Memory barrier here.

> +> 	> #Atomic DEC
> +> 	> ld r3,40(sp)
> +1:> 	> lwarx r4,0,r3
> +> 	> addi r4,r4,-1
> +> 	> stwcx. r4,0,r3
> +> 	> bne- 1b
> +
> +2:> 	> ld r3, 32(sp)
> +> 	> bl check_fpu
> +> 	> cmpdi r3,0
> +> 	> bne 3f
> +> 	> ld r4, 48(sp)
> +> 	> ld r5, 0(r4)
> +> 	> cmpwi r5,0
> +> 	> bne 2b
> +
> +3:> 	> POP_FPU(56)
> +> 	> POP_BASIC_STACK(256)
> +> 	> blr
> +FUNC_END(preempt_fpu)
> diff --git a/tools/testing/selftests/powerpc/math/fpu_preempt.c b/tools/t=
esting/selftests/powerpc/math/fpu_preempt.c
> new file mode 100644
> index 0000000..e24cf9b
> --- /dev/null
> +++ b/tools/testing/selftests/powerpc/math/fpu_preempt.c

Needs a copyright and a description of the test here.

Same in patch 1.

> @@ -0,0 +1,92 @@
> +#include <stdio.h>
> +#include <unistd.h>
> +#include <sys/syscall.h>
> +#include <sys/time.h>
> +#include <sys/types.h>
> +#include <sys/wait.h>
> +#include <stdlib.h>
> +#include <pthread.h>
> +
> +#include "utils.h"
> +
> +/* Time to wait for workers to get preempted (seconds) */
> +#define PREEMPT_TIME 20
> +/*
> + * Factor by which to multiply number of online CPUs for total number of
> + * worker threads
> + */
> +#define THREAD_FACTOR 8
> +
> +
> +__thread double darray[] =3D {0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.=
9, 1.0,
> +> 	> 	>      1.1, 1.2, 1.3, 1.4, 1.5, 1.6, 1.7, 1.8, 1.9, 2.0,
> +> 	> 	>      2.1};
> +
> +volatile int not_ready;
> +int running;
> +
> +extern void preempt_fpu(double *darray, volatile int *not_ready, int *se=
ntinal);
> +
> +void *preempt_fpu_c(void *p)
> +{
> +> 	> int i;
> +> 	> srand(pthread_self());
> +> 	> for (i =3D 0; i < 21; i++)
> +> 	> 	> darray[i] =3D rand();
> +
> +> 	> /* Test failed if it ever returns */
> +> 	> preempt_fpu(darray, &not_ready, &running);
> +
> +> 	> return p;
> +}
> +
> +int test_preempt_fpu(void)
> +{
> +> 	> int i, rc, threads;
> +> 	> pthread_t *tids;
> +
> +> 	> threads =3D sysconf(_SC_NPROCESSORS_ONLN) * THREAD_FACTOR;
> +> 	> tids =3D malloc((threads) * sizeof(pthread_t));
> +> 	> FAIL_IF(!tids);
> +
> +> 	> running =3D true;
> +> 	> not_ready =3D threads;
> +> 	> for (i =3D 0; i < threads; i++) {
> +> 	> 	> rc =3D pthread_create(&tids[i], NULL, preempt_fpu_c, NULL);
> +> 	> 	> FAIL_IF(rc);
> +> 	> }
> +
> +> 	> setbuf(stdout, NULL);
> +> 	> /* Not really nessesary but nice to wait for every thread to start =
*/
> +> 	> printf("\tWaiting for all workers to start...");
> +> 	> while(not_ready);

You need a memory barrier here and a matching one in the asm which is
derementing it.  Same in the VSX test.

> +> 	> printf("done\n");
> +
> +> 	> printf("\tWaiting for %d seconds to let some workers get preempted.=
..", PREEMPT_TIME);
> +> 	> sleep(PREEMPT_TIME);
> +> 	> printf("done\n");
> +
> +> 	> printf("\tKilling workers...");
> +> 	> running =3D 0;
> +> 	> for (i =3D 0; i < threads; i++) {
> +> 	> 	> void *rc_p;
> +> 	> 	> pthread_join(tids[i], &rc_p);
> +
> +> 	> 	> /*
> +> 	> 	>  * Harness will say the fail was here, look at why preempt_fpu
> +> 	> 	>  * returned
> +> 	> 	>  */
> +> 	> 	> if ((long) rc_p)
> +> 	> 	> 	> printf("oops\n");
> +> 	> 	> FAIL_IF((long) rc_p);
> +> 	> }
> +> 	> printf("done\n");
> +
> +> 	> free(tids);
> +> 	> return 0;
> +}
> +
> +int main(int argc, char *argv[])
> +{
> +> 	> return test_harness(test_preempt_fpu, "fpu_preempt");
> +}
> diff --git a/tools/testing/selftests/powerpc/math/vmx_asm.S b/tools/testi=
ng/selftests/powerpc/math/vmx_asm.S
> index e642e67..23db4b3 100644
> --- a/tools/testing/selftests/powerpc/math/vmx_asm.S
> +++ b/tools/testing/selftests/powerpc/math/vmx_asm.S
> @@ -1,5 +1,6 @@
>  #include "basic_asm.h"
> =20
> +#POS MUST BE 16 ALIGNED!
>  #define PUSH_VMX(pos,reg) \
>  > 	> li> 	> reg,pos; \
>  > 	> stvx> 	> v20,reg,sp; \
> @@ -26,6 +27,7 @@
>  > 	> addi> 	> reg,reg,16; \
>  > 	> stvx> 	> v31,reg,sp;
> =20
> +#POS MUST BE 16 ALIGNED!
>  #define POP_VMX(pos,reg) \
>  > 	> li> 	> reg,pos; \
>  > 	> lvx> 	> v20,reg,sp; \
> @@ -84,7 +86,7 @@ FUNC_END(load_vmx)
> =20
>  #Should be safe from C, only touches r4, r5 and v0,v1,v2
>  FUNC_START(check_vmx)
> -> 	> PUSH_BASIC_STACK(16)
> +> 	> PUSH_BASIC_STACK(32)
>  > 	> mr r4,r3
>  > 	> li> 	> r3,1 #assume a bad result
>  > 	> li> 	> r5,0
> @@ -153,7 +155,7 @@ FUNC_START(check_vmx)
>  > 	> cmpdi> 	> r0,0xffffffff
>  > 	> bne> 	> 1f
>  > 	> li> 	> r3,0
> -1:> 	> POP_BASIC_STACK(16)
> +1:> 	> POP_BASIC_STACK(32)
>  > 	> blr
>  FUNC_END(check_vmx)
> =20
> @@ -164,7 +166,7 @@ FUNC_START(test_vmx)
>  > 	> PUSH_BASIC_STACK(512)
>  > 	> std> 	> r3,40(sp) #Address of varray
>  > 	> std r4,48(sp) #address of pid
> -> 	> PUSH_VMX(56, r4)
> +> 	> PUSH_VMX(64, r4)
> =20
>  > 	> bl load_vmx
> =20
> @@ -177,7 +179,41 @@ FUNC_START(test_vmx)
>  > 	> ld r3,40(sp)
>  > 	> bl check_vmx
> =20
> -> 	> POP_VMX(56,r4)
> +> 	> POP_VMX(64,r4)
>  > 	> POP_BASIC_STACK(512)
>  > 	> blr
>  FUNC_END(test_vmx)
> +
> +#int preempt_vmx(v4si *varray, volatile int *not_ready, int *sentinal)
> +#On starting will (atomically) decrement not_ready as a signal that the =
FPU
> +#has been loaded with varray. Will proceed to check the validity of the =
FPU
> +#registers while sentinal is not zero.
> +FUNC_START(preempt_vmx)
> +> 	> PUSH_BASIC_STACK(512)
> +> 	> std r3,32(sp) #v4si *varray
> +> 	> std r4,40(sp) #volatile int *not_ready
> +> 	> std r5,48(sp) #int *sentinal
> +> 	> PUSH_VMX(64,r4)
> +
> +> 	> bl load_vmx
> +

Memory barrier here

> +> 	> #Atomic DEC
> +> 	> ld r3,40(sp)
> +1:> 	> lwarx r4,0,r3
> +> 	> addi r4,r4,-1
> +> 	> stwcx. r4,0,r3
> +> 	> bne- 1b
> +
> +2:> 	> ld r3,32(sp)
> +> 	> bl check_vmx
> +> 	> cmpdi r3,0
> +> 	> bne 3f
> +> 	> ld r4,48(sp)
> +> 	> ld r5,0(r4)
> +> 	> cmpwi r5,0
> +> 	> bne 2b
> +
> +3:> 	> POP_VMX(64,r4)
> +> 	> POP_BASIC_STACK(512)
> +> 	> blr
> +FUNC_END(preempt_vmx)
> diff --git a/tools/testing/selftests/powerpc/math/vmx_preempt.c b/tools/t=
esting/selftests/powerpc/math/vmx_preempt.c
> new file mode 100644
> index 0000000..342db15
> --- /dev/null
> +++ b/tools/testing/selftests/powerpc/math/vmx_preempt.c

Needs a copyright and a description of the test here. =20

> @@ -0,0 +1,92 @@
> +#include <stdio.h>
> +#include <unistd.h>
> +#include <sys/syscall.h>
> +#include <sys/time.h>
> +#include <sys/types.h>
> +#include <sys/wait.h>
> +#include <stdlib.h>
> +#include <pthread.h>
> +
> +#include "utils.h"
> +
> +/* Time to wait for workers to get preempted (seconds) */
> +#define PREEMPT_TIME 20
> +/*
> + * Factor by which to multiply number of online CPUs for total number of
> + * worker threads
> + */
> +#define THREAD_FACTOR 8
> +
> +typedef int v4si __attribute__ ((vector_size (16)));
> +__thread v4si varray[] =3D {{1, 2, 3, 4}, {5, 6, 7, 8}, {9, 10,11,12},
> +> 	> {13,14,15,16},{17,18,19,20},{21,22,23,24},
> +> 	> {25,26,27,28},{29,30,31,32},{33,34,35,36},
> +> 	> {37,38,39,40},{41,42,43,44},{45,46,47,48}};
> +
> +volatile int not_ready;

I really hate this name.  How about just "ready" and you negate the
code.

> +int running;

Now about not_running ;-P

> +
> +extern void preempt_vmx(v4si *varray, volatile int *not_ready, int *sent=
inal);
> +
> +void *preempt_vmx_c(void *p)
> +{
> +> 	> int i, j;
> +> 	> srand(pthread_self());
> +> 	> for (i =3D 0; i < 12; i++)
> +> 	> 	> for (j =3D 0; j < 4; j++)
> +> 	> 	> 	> varray[i][j] =3D rand();
> +
> +> 	> /* Test fails if it ever returns */
> +> 	> preempt_vmx(varray, &not_ready, &running);
> +> 	> return p;
> +}
> +
> +int test_preempt_vmx(void)
> +{
> +> 	> int i, rc, threads;
> +> 	> pthread_t *tids;
> +
> +> 	> threads =3D sysconf(_SC_NPROCESSORS_ONLN) * THREAD_FACTOR;
> +> 	> tids =3D malloc(threads * sizeof(pthread_t));
> +> 	> FAIL_IF(!tids);
> +
> +> 	> running =3D true;
> +> 	> not_ready =3D threads;
> +> 	> for (i =3D 0; i < threads; i++) {
> +> 	> 	> rc =3D pthread_create(&tids[i], NULL, preempt_vmx_c, NULL);
> +> 	> 	> FAIL_IF(rc);
> +> 	> }
> +
> +> 	> setbuf(stdout, NULL);
> +> 	> /* Not really nessesary but nice to wait for every thread to start =
*/
> +> 	> printf("\tWaiting for all workers to start...");
> +> 	> while(not_ready);

Again, a memory barrier here.

> +> 	> printf("done\n");
> +
> +> 	> printf("\tWaiting for %d seconds to let some workers get preempted.=
..", PREEMPT_TIME);
> +> 	> sleep(PREEMPT_TIME);
> +> 	> printf("done\n");
> +
> +> 	> printf("\tKilling workers...");
> +> 	> running =3D 0;
> +> 	> for (i =3D 0; i < threads; i++) {
> +> 	> 	> void *rc_p;
> +> 	> 	> pthread_join(tids[i], &rc_p);
> +
> +> 	> 	> /*
> +> 	> 	>  * Harness will say the fail was here, look at why preempt_vmx
> +> 	> 	>  * returned
> +> 	> 	>  */
> +> 	> 	> if ((long) rc_p)
> +> 	> 	> 	> printf("oops\n");
> +> 	> 	> FAIL_IF((long) rc_p);
> +> 	> }
> +> 	> printf("done\n");
> +
> +> 	> return 0;
> +}
> +
> +int main(int argc, char *argv[])
> +{
> +> 	> return test_harness(test_preempt_vmx, "vmx_preempt");
> +}

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/8] selftests/powerpc: Test the preservation of FPU and VMX regs across syscall
  2015-11-23  0:23   ` Michael Neuling
@ 2015-11-23  0:58     ` Cyril Bur
  2015-11-23  1:06       ` Michael Neuling
  0 siblings, 1 reply; 23+ messages in thread
From: Cyril Bur @ 2015-11-23  0:58 UTC (permalink / raw)
  To: Michael Neuling; +Cc: anton, linuxppc-dev

On Mon, 23 Nov 2015 11:23:13 +1100
Michael Neuling <mikey@neuling.org> wrote:

> On Wed, 2015-11-18 at 14:26 +1100, Cyril Bur wrote:
> > Test that the non volatile floating point and Altivec registers get
> > correctly preserved across the fork() syscall.  
> 
> Can we add a test for VSX too?  I realise it's the same registers, but
> the enable bits in the MSR are different so it's easy to get them wrong
> in the kernel.

Yeah, I'm sure I could get that wrong haha.

Hmmmm this got me thinking. Today we always enable FP and Altivec when we
enable VSX but isn't there a world where we could actually run with FP and
Altivec disabled and VSX on? In which case, is the whole thing volatile or
does the kernel still need to save the subset of the matrix which corresponds 
to non-volatile FPs and non-volatile Altivec?
> 
> Additional comments below.
> 
> > fork() works nicely for this purpose, the registers should be the same for
> > both parent and child
> > 
> > Signed-off-by: Cyril Bur <cyrilbur@gmail.com>
> > ---
> >  tools/testing/selftests/powerpc/Makefile           |   3 +-
> >  tools/testing/selftests/powerpc/math/Makefile      |  14 ++
> >  tools/testing/selftests/powerpc/math/basic_asm.h   |  26 +++
> >  tools/testing/selftests/powerpc/math/fpu_asm.S     | 151 +++++++++++++++++
> >  tools/testing/selftests/powerpc/math/fpu_syscall.c |  79 +++++++++
> >  tools/testing/selftests/powerpc/math/vmx_asm.S     | 183 +++++++++++++++++++++
> >  tools/testing/selftests/powerpc/math/vmx_syscall.c |  81 +++++++++
> >  7 files changed, 536 insertions(+), 1 deletion(-)
> >  create mode 100644 tools/testing/selftests/powerpc/math/Makefile
> >  create mode 100644 tools/testing/selftests/powerpc/math/basic_asm.h
> >  create mode 100644 tools/testing/selftests/powerpc/math/fpu_asm.S
> >  create mode 100644 tools/testing/selftests/powerpc/math/fpu_syscall.c
> >  create mode 100644 tools/testing/selftests/powerpc/math/vmx_asm.S
> >  create mode 100644 tools/testing/selftests/powerpc/math/vmx_syscall.c
> > 
> > diff --git a/tools/testing/selftests/powerpc/Makefile b/tools/testing/selftests/powerpc/Makefile
> > index 0c2706b..19e8191 100644
> > --- a/tools/testing/selftests/powerpc/Makefile
> > +++ b/tools/testing/selftests/powerpc/Makefile
> > @@ -22,7 +22,8 @@ SUB_DIRS = benchmarks > 	> 	> \  
> >  > 	>    switch_endian> 	> \  
> >  > 	>    syscalls> 	> 	> \  
> >  > 	>    tm> 	> 	> 	> \  
> > -> 	>    vphn
> > +> 	>    vphn         \
> > +> 	>    math  
> >  
> >  endif
> >  
> > diff --git a/tools/testing/selftests/powerpc/math/Makefile b/tools/testing/selftests/powerpc/math/Makefile
> > new file mode 100644
> > index 0000000..896d9e2
> > --- /dev/null
> > +++ b/tools/testing/selftests/powerpc/math/Makefile
> > @@ -0,0 +1,14 @@
> > +TEST_PROGS := fpu_syscall vmx_syscall  
> 
> 
> Add a new .gitignore in this dirfor these new build objects.
> 

Yep

> > +
> > +all: $(TEST_PROGS)
> > +
> > +$(TEST_PROGS): ../harness.c
> > +$(TEST_PROGS): CFLAGS += -O2 -g
> > +
> > +fpu_syscall: fpu_asm.S
> > +vmx_syscall: vmx_asm.S
> > +
> > +include ../../lib.mk
> > +
> > +clean:  
> > +> 	> rm -f $(TEST_PROGS) *.o  
> > diff --git a/tools/testing/selftests/powerpc/math/basic_asm.h b/tools/testing/selftests/powerpc/math/basic_asm.h
> > new file mode 100644
> > index 0000000..27aca79
> > --- /dev/null
> > +++ b/tools/testing/selftests/powerpc/math/basic_asm.h  
> 
> Can you put this up a directory since it's generically useful for
> powerpc?
> 

Sure why not.

> > @@ -0,0 +1,26 @@
> > +#include 
> > +#include 
> > +
> > +#define LOAD_REG_IMMEDIATE(reg,expr) \  
> > +> 	> lis> 	> reg,(expr)@highest;> 	> \
> > +> 	> ori> 	> reg,reg,(expr)@higher;> 	> \
> > +> 	> rldicr> 	> reg,reg,32,31;> 	> \
> > +> 	> oris> 	> reg,reg,(expr)@high;> 	> \  
> > +> 	> ori> 	> reg,reg,(expr)@l;  
> > +
> > +#define PUSH_BASIC_STACK(size) \  
> > +> 	> std> 	> 2,24(sp); \
> > +> 	> mflr> 	> r0; \
> > +> 	> std> 	> r0,16(sp); \
> > +> 	> mfcr> 	> r0; \
> > +> 	> stw> 	> r0,8(sp); \
> > +> 	> stdu> 	> sp,-size(sp);  
> > +
> > +#define POP_BASIC_STACK(size) \  
> > +> 	> addi> 	> sp,sp,size; \
> > +> 	> ld> 	> 2,24(sp); \
> > +> 	> ld> 	> r0,16(sp); \
> > +> 	> mtlr> 	> r0; \
> > +> 	> lwz> 	> r0,8(sp); \
> > +> 	> mtcr> 	> r0; \  
> > +
> > diff --git a/tools/testing/selftests/powerpc/math/fpu_asm.S b/tools/testing/selftests/powerpc/math/fpu_asm.S
> > new file mode 100644
> > index 0000000..d5412c1
> > --- /dev/null
> > +++ b/tools/testing/selftests/powerpc/math/fpu_asm.S
> > @@ -0,0 +1,151 @@
> > +#include "basic_asm.h"
> > +
> > +#define PUSH_FPU(pos) \  
> > +> 	> stfd> 	> f14,pos(sp); \
> > +> 	> stfd> 	> f15,pos+8(sp); \
> > +> 	> stfd> 	> f16,pos+16(sp); \
> > +> 	> stfd> 	> f17,pos+24(sp); \
> > +> 	> stfd> 	> f18,pos+32(sp); \
> > +> 	> stfd> 	> f19,pos+40(sp); \
> > +> 	> stfd> 	> f20,pos+48(sp); \
> > +> 	> stfd> 	> f21,pos+56(sp); \
> > +> 	> stfd> 	> f22,pos+64(sp); \
> > +> 	> stfd> 	> f23,pos+72(sp); \
> > +> 	> stfd> 	> f24,pos+80(sp); \
> > +> 	> stfd> 	> f25,pos+88(sp); \
> > +> 	> stfd> 	> f26,pos+96(sp); \
> > +> 	> stfd> 	> f27,pos+104(sp); \
> > +> 	> stfd> 	> f28,pos+112(sp); \
> > +> 	> stfd> 	> f29,pos+120(sp); \
> > +> 	> stfd> 	> f30,pos+128(sp); \
> > +> 	> stfd> 	> f31,pos+136(sp);  
> > +
> > +#define POP_FPU(pos) \  
> > +> 	> lfd> 	> f14,pos(sp); \
> > +> 	> lfd> 	> f15,pos+8(sp); \
> > +> 	> lfd> 	> f16,pos+16(sp); \
> > +> 	> lfd> 	> f17,pos+24(sp); \
> > +> 	> lfd> 	> f18,pos+32(sp); \
> > +> 	> lfd> 	> f19,pos+40(sp); \
> > +> 	> lfd> 	> f20,pos+48(sp); \
> > +> 	> lfd> 	> f21,pos+56(sp); \
> > +> 	> lfd> 	> f22,pos+64(sp); \
> > +> 	> lfd> 	> f23,pos+72(sp); \
> > +> 	> lfd> 	> f24,pos+80(sp); \
> > +> 	> lfd> 	> f25,pos+88(sp); \
> > +> 	> lfd> 	> f26,pos+96(sp); \
> > +> 	> lfd> 	> f27,pos+104(sp); \
> > +> 	> lfd> 	> f28,pos+112(sp); \
> > +> 	> lfd> 	> f29,pos+120(sp); \
> > +> 	> lfd> 	> f30,pos+128(sp); \
> > +> 	> lfd> 	> f31,pos+136(sp);  
> > +
> > +#Careful calling this, it will 'clobber' fpu (by design)
> > +#Don't call this from C
> > +FUNC_START(load_fpu)  
> > +> 	> lfd> 	> f14,0(r3)
> > +> 	> lfd> 	> f15,8(r3)
> > +> 	> lfd> 	> f16,16(r3)
> > +> 	> lfd> 	> f17,24(r3)
> > +> 	> lfd> 	> f18,32(r3)
> > +> 	> lfd> 	> f19,40(r3)
> > +> 	> lfd> 	> f20,48(r3)
> > +> 	> lfd> 	> f21,56(r3)
> > +> 	> lfd> 	> f22,64(r3)
> > +> 	> lfd> 	> f23,72(r3)
> > +> 	> lfd> 	> f24,80(r3)
> > +> 	> lfd> 	> f25,88(r3)
> > +> 	> lfd> 	> f26,96(r3)
> > +> 	> lfd> 	> f27,104(r3)
> > +> 	> lfd> 	> f28,112(r3)
> > +> 	> lfd> 	> f29,120(r3)
> > +> 	> lfd> 	> f30,128(r3)
> > +> 	> lfd> 	> f31,136(r3)  
> > +> 	> blr  
> > +FUNC_END(load_fpu)
> > +
> > +FUNC_START(check_fpu)  
> > +> 	> mr r4,r3  
> > +> 	> li> 	> r3,1 #assume a bad result
> > +> 	> lfd> 	> f0,0(r4)
> > +> 	> fcmpu> 	> cr1,f0,f14
> > +> 	> bne> 	> cr1,1f
> > +> 	> lfd> 	> f0,8(r4)
> > +> 	> fcmpu> 	> cr1,f0,f15
> > +> 	> bne> 	> cr1,1f
> > +> 	> lfd> 	> f0,16(r4)
> > +> 	> fcmpu> 	> cr1,f0,f16
> > +> 	> bne> 	> cr1,1f
> > +> 	> lfd> 	> f0,24(r4)
> > +> 	> fcmpu> 	> cr1,f0,f17
> > +> 	> bne> 	> cr1,1f
> > +> 	> lfd> 	> f0,32(r4)
> > +> 	> fcmpu> 	> cr1,f0,f18
> > +> 	> bne> 	> cr1,1f
> > +> 	> lfd> 	> f0,40(r4)
> > +> 	> fcmpu> 	> cr1,f0,f19
> > +> 	> bne> 	> cr1,1f
> > +> 	> lfd> 	> f0,48(r4)
> > +> 	> fcmpu> 	> cr1,f0,f20
> > +> 	> bne> 	> cr1,1f
> > +> 	> lfd> 	> f0,56(r4)
> > +> 	> fcmpu> 	> cr1,f0,f21
> > +> 	> bne> 	> cr1,1f
> > +> 	> lfd> 	> f0,64(r4)
> > +> 	> fcmpu> 	> cr1,f0,f22
> > +> 	> bne> 	> cr1,1f
> > +> 	> lfd> 	> f0,72(r4)
> > +> 	> fcmpu> 	> cr1,f0,f23
> > +> 	> bne> 	> cr1,1f
> > +> 	> lfd> 	> f0,80(r4)
> > +> 	> fcmpu> 	> cr1,f0,f24
> > +> 	> bne> 	> cr1,1f
> > +> 	> lfd> 	> f0,88(r4)
> > +> 	> fcmpu> 	> cr1,f0,f25
> > +> 	> bne> 	> cr1,1f
> > +> 	> lfd> 	> f0,96(r4)
> > +> 	> fcmpu> 	> cr1,f0,f26
> > +> 	> bne> 	> cr1,1f
> > +> 	> lfd> 	> f0,104(r4)
> > +> 	> fcmpu> 	> cr1,f0,f27
> > +> 	> bne> 	> cr1,1f
> > +> 	> lfd> 	> f0,112(r4)
> > +> 	> fcmpu> 	> cr1,f0,f28
> > +> 	> bne> 	> cr1,1f
> > +> 	> lfd> 	> f0,120(r4)
> > +> 	> fcmpu> 	> cr1,f0,f29
> > +> 	> bne> 	> cr1,1f
> > +> 	> lfd> 	> f0,128(r4)
> > +> 	> fcmpu> 	> cr1,f0,f30
> > +> 	> bne> 	> cr1,1f
> > +> 	> lfd> 	> f0,136(r4)
> > +> 	> fcmpu> 	> cr1,f0,f31
> > +> 	> bne> 	> cr1,1f
> > +> 	> li> 	> r3,0 #Sucess!!!  
> > +1:> 	> blr  
> > +
> > +FUNC_START(test_fpu)  
> > +> 	> #r3 holds pointer to where to put the result of fork  
> 
> #r4 seems to hold a ptr to the pid
> 

Thanks

> > +	#f14-f31 are non volatiles  
> > +> 	> PUSH_BASIC_STACK(256)  
> > +> 	> std> 	> r3,40(sp) #Address of darray  
> > +> 	> std r4,48(sp) #Address of pid
> > +> 	> PUSH_FPU(56)  
> > +  
> > +> 	> bl load_fpu
> > +> 	> nop  
> > +> 	> li> 	> r0,__NR_fork  
> > +> 	> sc  
> > +  
> > +> 	> #pass the result of the fork to the caller  
> > +> 	> ld> 	> r9,48(sp)
> > +> 	> std> 	> r3,0(r9)  
> > +  
> > +> 	> ld r3,40(sp)
> > +> 	> bl check_fpu
> > +> 	> nop  
> > +  
> > +> 	> POP_FPU(56)
> > +> 	> POP_BASIC_STACK(256)
> > +> 	> blr  
> > +FUNC_END(test_fpu)
> > diff --git a/tools/testing/selftests/powerpc/math/fpu_syscall.c b/tools/testing/selftests/powerpc/math/fpu_syscall.c
> > new file mode 100644
> > index 0000000..a967fd6
> > --- /dev/null
> > +++ b/tools/testing/selftests/powerpc/math/fpu_syscall.c
> > @@ -0,0 +1,79 @@
> > +#include 
> > +#include 
> > +#include 
> > +#include 
> > +#include 
> > +#include 
> > +#include 
> > +
> > +#include "utils.h"
> > +
> > +extern int test_fpu(double *darray, pid_t *pid);
> > +
> > +double darray[] = {0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0,  
> > +> 	> 	>      1.1, 1.2, 1.3, 1.4, 1.5, 1.6, 1.7, 1.8, 1.9, 2.0,
> > +> 	> 	>      2.1};  
> > +
> > +int syscall_fpu(void)
> > +{  
> > +> 	> pid_t fork_pid;
> > +> 	> int i;
> > +> 	> int ret;
> > +> 	> int child_ret;
> > +> 	> for (i = 0; i < 1000; i++) {  
> > +> 	> 	> /* test_fpu will fork() */
> > +> 	> 	> ret = test_fpu(darray, &fork_pid);
> > +> 	> 	> if (fork_pid == -1)  
> > +> 	> 	> 	> return -1;  
> > +> 	> 	> if (fork_pid == 0)  
> > +> 	> 	> 	> exit(ret);  
> > +> 	> 	> waitpid(fork_pid, &child_ret, 0);
> > +> 	> 	> if (ret || child_ret)  
> > +> 	> 	> 	> return 1;  
> > +> 	> }  
> > +  
> > +> 	> return 0;  
> > +}
> > +
> > +int test_syscall_fpu(void)
> > +{  
> > +> 	> /*
> > +> 	>  * Setup an environment with much context switching
> > +> 	>  */
> > +> 	> pid_t pid2;
> > +> 	> pid_t pid = fork();
> > +> 	> int ret;
> > +> 	> int child_ret;
> > +> 	> FAIL_IF(pid == -1);  
> > +  
> > +> 	> pid2 = fork();
> > +> 	> /* Can't FAIL_IF(pid2 == -1); because already forked once */
> > +> 	> if (pid2 == -1) {  
> > +> 	> 	> /*
> > +> 	> 	>  * Couldn't fork, ensure test is a fail
> > +> 	> 	>  */
> > +> 	> 	> child_ret = ret = 1;  
> > +> 	> } else {  
> > +> 	> 	> ret = syscall_fpu();
> > +> 	> 	> if (pid2)  
> > +> 	> 	> 	> waitpid(pid2, &child_ret, 0);  
> > +> 	> 	> else  
> > +> 	> 	> 	> exit(ret);  
> > +> 	> }  
> > +  
> > +> 	> ret |= child_ret;  
> > +  
> > +> 	> if (pid)  
> > +> 	> 	> waitpid(pid, &child_ret, 0);  
> > +> 	> else  
> > +> 	> 	> exit(ret);  
> > +  
> > +> 	> FAIL_IF(ret || child_ret);
> > +> 	> return 0;  
> > +}
> > +
> > +int main(int argc, char *argv[])
> > +{  
> > +> 	> return test_harness(test_syscall_fpu, "syscall_fpu");  
> > +
> > +}
> > diff --git a/tools/testing/selftests/powerpc/math/vmx_asm.S b/tools/testing/selftests/powerpc/math/vmx_asm.S
> > new file mode 100644
> > index 0000000..e642e67
> > --- /dev/null
> > +++ b/tools/testing/selftests/powerpc/math/vmx_asm.S
> > @@ -0,0 +1,183 @@
> > +#include "basic_asm.h"
> > +
> > +#define PUSH_VMX(pos,reg) \  
> > +> 	> li> 	> reg,pos; \
> > +> 	> stvx> 	> v20,reg,sp; \
> > +> 	> addi> 	> reg,reg,16; \
> > +> 	> stvx> 	> v21,reg,sp; \
> > +> 	> addi> 	> reg,reg,16; \
> > +> 	> stvx> 	> v22,reg,sp; \
> > +> 	> addi> 	> reg,reg,16; \
> > +> 	> stvx> 	> v23,reg,sp; \
> > +> 	> addi> 	> reg,reg,16; \
> > +> 	> stvx> 	> v24,reg,sp; \
> > +> 	> addi> 	> reg,reg,16; \
> > +> 	> stvx> 	> v25,reg,sp; \
> > +> 	> addi> 	> reg,reg,16; \
> > +> 	> stvx> 	> v26,reg,sp; \
> > +> 	> addi> 	> reg,reg,16; \
> > +> 	> stvx> 	> v27,reg,sp; \
> > +> 	> addi> 	> reg,reg,16; \
> > +> 	> stvx> 	> v28,reg,sp; \
> > +> 	> addi> 	> reg,reg,16; \
> > +> 	> stvx> 	> v29,reg,sp; \
> > +> 	> addi> 	> reg,reg,16; \
> > +> 	> stvx> 	> v30,reg,sp; \
> > +> 	> addi> 	> reg,reg,16; \
> > +> 	> stvx> 	> v31,reg,sp;  
> > +
> > +#define POP_VMX(pos,reg) \  
> > +> 	> li> 	> reg,pos; \
> > +> 	> lvx> 	> v20,reg,sp; \
> > +> 	> addi> 	> reg,reg,16; \
> > +> 	> lvx> 	> v21,reg,sp; \
> > +> 	> addi> 	> reg,reg,16; \
> > +> 	> lvx> 	> v22,reg,sp; \
> > +> 	> addi> 	> reg,reg,16; \
> > +> 	> lvx> 	> v23,reg,sp; \
> > +> 	> addi> 	> reg,reg,16; \
> > +> 	> lvx> 	> v24,reg,sp; \
> > +> 	> addi> 	> reg,reg,16; \
> > +> 	> lvx> 	> v25,reg,sp; \
> > +> 	> addi> 	> reg,reg,16; \
> > +> 	> lvx> 	> v26,reg,sp; \
> > +> 	> addi> 	> reg,reg,16; \
> > +> 	> lvx> 	> v27,reg,sp; \
> > +> 	> addi> 	> reg,reg,16; \
> > +> 	> lvx> 	> v28,reg,sp; \
> > +> 	> addi> 	> reg,reg,16; \
> > +> 	> lvx> 	> v29,reg,sp; \
> > +> 	> addi> 	> reg,reg,16; \
> > +> 	> lvx> 	> v30,reg,sp; \
> > +> 	> addi> 	> reg,reg,16; \
> > +> 	> lvx> 	> v31,reg,sp;  
> > +
> > +#Carefull this will 'clobber' vmx (by design)
> > +#Don't call this from C
> > +FUNC_START(load_vmx)  
> > +> 	> li> 	> r5,0
> > +> 	> lvx> 	> v20,r5,r3
> > +> 	> addi> 	> r5,r5,16
> > +> 	> lvx> 	> v21,r5,r3
> > +> 	> addi> 	> r5,r5,16
> > +> 	> lvx> 	> v22,r5,r3
> > +> 	> addi> 	> r5,r5,16
> > +> 	> lvx> 	> v23,r5,r3
> > +> 	> addi> 	> r5,r5,16
> > +> 	> lvx> 	> v24,r5,r3
> > +> 	> addi> 	> r5,r5,16
> > +> 	> lvx> 	> v25,r5,r3
> > +> 	> addi> 	> r5,r5,16
> > +> 	> lvx> 	> v26,r5,r3
> > +> 	> addi> 	> r5,r5,16
> > +> 	> lvx> 	> v27,r5,r3
> > +> 	> addi> 	> r5,r5,16
> > +> 	> lvx> 	> v28,r5,r3
> > +> 	> addi> 	> r5,r5,16
> > +> 	> lvx> 	> v29,r5,r3
> > +> 	> addi> 	> r5,r5,16
> > +> 	> lvx> 	> v30,r5,r3
> > +> 	> addi> 	> r5,r5,16
> > +> 	> lvx> 	> v31,r5,r3  
> > +> 	> blr  
> > +FUNC_END(load_vmx)
> > +
> > +#Should be safe from C, only touches r4, r5 and v0,v1,v2
> > +FUNC_START(check_vmx)  
> > +> 	> PUSH_BASIC_STACK(16)
> > +> 	> mr r4,r3  
> > +> 	> li> 	> r3,1 #assume a bad result
> > +> 	> li> 	> r5,0
> > +> 	> lvx> 	> v0,r5,r4
> > +> 	> vcmpequd.> 	> v1,v0,v20
> > +> 	> vmr> 	> v2,v1  
> > +  
> > +> 	> addi> 	> r5,r5,16
> > +> 	> lvx> 	> v0,r5,r4
> > +> 	> vcmpequd.> 	> v1,v0,v21
> > +> 	> vand> 	> v2,v2,v1  
> > +  
> > +> 	> addi> 	> r5,r5,16
> > +> 	> lvx> 	> v0,r5,r4
> > +> 	> vcmpequd.> 	> v1,v0,v22
> > +> 	> vand> 	> v2,v2,v1  
> > +  
> > +> 	> addi> 	> r5,r5,16
> > +> 	> lvx> 	> v0,r5,r4
> > +> 	> vcmpequd.> 	> v1,v0,v23
> > +> 	> vand> 	> v2,v2,v1  
> > +  
> > +> 	> addi> 	> r5,r5,16
> > +> 	> lvx> 	> v0,r5,r4
> > +> 	> vcmpequd.> 	> v1,v0,v24
> > +> 	> vand> 	> v2,v2,v1  
> > +  
> > +> 	> addi> 	> r5,r5,16
> > +> 	> lvx> 	> v0,r5,r4
> > +> 	> vcmpequd.> 	> v1,v0,v25
> > +> 	> vand> 	> v2,v2,v1  
> > +  
> > +> 	> addi> 	> r5,r5,16
> > +> 	> lvx> 	> v0,r5,r4
> > +> 	> vcmpequd.> 	> v1,v0,v26
> > +> 	> vand> 	> v2,v2,v1  
> > +  
> > +> 	> addi> 	> r5,r5,16
> > +> 	> lvx> 	> v0,r5,r4
> > +> 	> vcmpequd.> 	> v1,v0,v27
> > +> 	> vand> 	> v2,v2,v1  
> > +  
> > +> 	> addi> 	> r5,r5,16
> > +> 	> lvx> 	> v0,r5,r4
> > +> 	> vcmpequd.> 	> v1,v0,v28
> > +> 	> vand> 	> v2,v2,v1  
> > +  
> > +> 	> addi> 	> r5,r5,16
> > +> 	> lvx> 	> v0,r5,r4
> > +> 	> vcmpequd.> 	> v1,v0,v29
> > +> 	> vand> 	> v2,v2,v1  
> > +  
> > +> 	> addi> 	> r5,r5,16
> > +> 	> lvx> 	> v0,r5,r4
> > +> 	> vcmpequd.> 	> v1,v0,v30
> > +> 	> vand> 	> v2,v2,v1  
> > +  
> > +> 	> addi> 	> r5,r5,16
> > +> 	> lvx> 	> v0,r5,r4
> > +> 	> vcmpequd.> 	> v1,v0,v31
> > +> 	> vand> 	> v2,v2,v1  
> > +  
> > +> 	> li r5,0  
> > +> 	> stvx> 	> v2,r5,sp
> > +> 	> ldx> 	> r0,r5,sp
> > +> 	> cmpdi> 	> r0,0xffffffff
> > +> 	> bne> 	> 1f
> > +> 	> li> 	> r3,0  
> > +1:> 	> POP_BASIC_STACK(16)
> > +> 	> blr  
> > +FUNC_END(check_vmx)
> > +
> > +#Safe from C
> > +FUNC_START(test_vmx)  
> > +> 	> #r3 holds pointer to where to put the result of fork
> > +> 	> #v20-v31 are non-volatile
> > +> 	> PUSH_BASIC_STACK(512)  
> > +> 	> std> 	> r3,40(sp) #Address of varray  
> > +> 	> std r4,48(sp) #address of pid
> > +> 	> PUSH_VMX(56, r4)  
> > +  
> > +> 	> bl load_vmx  
> > +  
> > +> 	> li> 	> r0,__NR_fork  
> > +> 	> sc
> > +> 	> #Pass the result of fork back to the caller  
> > +> 	> ld> 	> r9,48(sp)
> > +> 	> std> 	> r3,0(r9)  
> > +  
> > +> 	> ld r3,40(sp)
> > +> 	> bl check_vmx  
> > +  
> > +> 	> POP_VMX(56,r4)
> > +> 	> POP_BASIC_STACK(512)
> > +> 	> blr  
> > +FUNC_END(test_vmx)
> > diff --git a/tools/testing/selftests/powerpc/math/vmx_syscall.c b/tools/testing/selftests/powerpc/math/vmx_syscall.c
> > new file mode 100644
> > index 0000000..7adff05
> > --- /dev/null
> > +++ b/tools/testing/selftests/powerpc/math/vmx_syscall.c
> > @@ -0,0 +1,81 @@
> > +#include 
> > +#include 
> > +#include 
> > +#include 
> > +#include 
> > +#include 
> > +#include 
> > +
> > +#include "utils.h"
> > +
> > +typedef int v4si __attribute__ ((vector_size (16)));
> > +v4si varray[] = {{1, 2, 3, 4}, {5, 6, 7, 8}, {9, 10,11,12},  
> > +> 	> {13,14,15,16},{17,18,19,20},{21,22,23,24},
> > +> 	> {25,26,27,28},{29,30,31,32},{33,34,35,36},
> > +> 	> {37,38,39,40},{41,42,43,44},{45,46,47,48}};  
> > +
> > +extern int test_vmx(v4si *varray, pid_t *pid);
> > +
> > +int vmx_syscall(void)
> > +{  
> > +> 	> pid_t fork_pid;
> > +> 	> int i;
> > +> 	> int ret;
> > +> 	> int child_ret;
> > +> 	> for (i = 0; i < 1000; i++) {  
> > +> 	> 	> /* test_vmx will fork() */
> > +> 	> 	> ret = test_vmx(varray, &fork_pid);
> > +> 	> 	> if (fork_pid == -1)  
> > +> 	> 	> 	> return -1;  
> > +> 	> 	> if (fork_pid == 0)  
> > +> 	> 	> 	> exit(ret);  
> > +> 	> 	> waitpid(fork_pid, &child_ret, 0);
> > +> 	> 	> if (ret || child_ret)  
> > +> 	> 	> 	> return 1;  
> > +> 	> }  
> > +  
> > +> 	> return 0;  
> > +}
> > +
> > +int test_vmx_syscall(void)
> > +{  
> > +> 	> /*
> > +> 	>  * Setup an environment with much context switching
> > +> 	>  */
> > +> 	> pid_t pid2;
> > +> 	> pid_t pid = fork();
> > +> 	> int ret;
> > +> 	> int child_ret;
> > +> 	> FAIL_IF(pid == -1);  
> > +  
> > +> 	> pid2 = fork();
> > +> 	> ret = vmx_syscall();
> > +> 	> /* Can't FAIL_IF(pid2 == -1); because we've already forked */
> > +> 	> if (pid2 == -1) {  
> > +> 	> 	> /*
> > +> 	> 	>  * Couldn't fork, ensure child_ret is set and is a fail
> > +> 	> 	>  */
> > +> 	> 	> ret = child_ret = 1;  
> > +> 	> } else {  
> > +> 	> 	> if (pid2)  
> > +> 	> 	> 	> waitpid(pid2, &child_ret, 0);  
> > +> 	> 	> else  
> > +> 	> 	> 	> exit(ret);  
> > +> 	> }  
> > +  
> > +> 	> ret |= child_ret;  
> > +  
> > +> 	> if (pid)  
> > +> 	> 	> waitpid(pid, &child_ret, 0);  
> > +> 	> else  
> > +> 	> 	> exit(ret);  
> > +  
> > +> 	> FAIL_IF(ret || child_ret);
> > +> 	> return 0;  
> > +}
> > +
> > +int main(int argc, char *argv[])
> > +{  
> > +> 	> return test_harness(test_vmx_syscall, "vmx_syscall");  
> > +
> > +}  

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 3/8] selftests/powerpc: Test FPU and VMX regs in signal ucontext
  2015-11-18  3:26 ` [PATCH 3/8] selftests/powerpc: Test FPU and VMX regs in signal ucontext Cyril Bur
  2015-11-19 11:36   ` [3/8] " Michael Ellerman
@ 2015-11-23  1:04   ` Michael Neuling
  1 sibling, 0 replies; 23+ messages in thread
From: Michael Neuling @ 2015-11-23  1:04 UTC (permalink / raw)
  To: Cyril Bur, anton, linuxppc-dev

On Wed, 2015-11-18 at 14:26 +1100, Cyril Bur wrote:

> Load up the non volatile FPU and VMX regs and ensure that they are
> the
> expected value in a signal handler
>=20
> Signed-off-by: Cyril Bur <cyrilbur@gmail.com>
> ---
>  tools/testing/selftests/powerpc/math/Makefile     |   4 +-
>  tools/testing/selftests/powerpc/math/fpu_signal.c | 119
> +++++++++++++++++++++
>  tools/testing/selftests/powerpc/math/vmx_signal.c | 124
> ++++++++++++++++++++++
>  3 files changed, 246 insertions(+), 1 deletion(-)
>  create mode 100644 tools/testing/selftests/powerpc/math/fpu_signal.c
>  create mode 100644 tools/testing/selftests/powerpc/math/vmx_signal.c
>=20
> diff --git a/tools/testing/selftests/powerpc/math/Makefile
> b/tools/testing/selftests/powerpc/math/Makefile
> index 9fa690f..5ce000bf 100644
> --- a/tools/testing/selftests/powerpc/math/Makefile
> +++ b/tools/testing/selftests/powerpc/math/Makefile
> @@ -1,4 +1,4 @@
> -TEST_PROGS :=3D fpu_syscall fpu_preempt vmx_syscall vmx_preempt
> +TEST_PROGS :=3D fpu_syscall fpu_preempt fpu_signal vmx_syscall
> vmx_preempt vmx_signal


.gitignore


>  all: $(TEST_PROGS)
> =20
> @@ -7,9 +7,11 @@ $(TEST_PROGS): CFLAGS +=3D -O2 -g -pthread
> =20
>  fpu_syscall: fpu_asm.S
>  fpu_preempt: fpu_asm.S
> +fpu_signal:  fpu_asm.S
> =20
>  vmx_syscall: vmx_asm.S
>  vmx_preempt: vmx_asm.S
> +vmx_signal: vmx_asm.S
> =20
>  include ../../lib.mk
> =20
> diff --git a/tools/testing/selftests/powerpc/math/fpu_signal.c
> b/tools/testing/selftests/powerpc/math/fpu_signal.c
> new file mode 100644
> index 0000000..ca61d1f
> --- /dev/null
> +++ b/tools/testing/selftests/powerpc/math/fpu_signal.c
> @@ -0,0 +1,119 @@

copyright + detailed description of test here.


> +#include <stdio.h>
> +#include <unistd.h>
> +#include <sys/syscall.h>
> +#include <sys/time.h>
> +#include <sys/types.h>
> +#include <sys/wait.h>
> +#include <stdlib.h>
> +#include <pthread.h>
> +
> +#include "utils.h"
> +
> +/* Number of times each thread should recieve the signal */

receive=20


> +#define ITERATIONS 10
> +/*
> + * Factor by which to multiply number of online CPUs for total
> number of
> + * worker threads
> + */
> +#define THREAD_FACTOR 8
> +
> +__thread double darray[] =3D {0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8,
> 0.9, 1.0,
> +		     1.1, 1.2, 1.3, 1.4, 1.5, 1.6, 1.7, 1.8, 1.9,
> 2.0,
> +		     2.1};
> +
> +bool bad_context;
> +int running;
> +volatile int not_ready;

Don't use volatile...  Ready Documentation/volatile-considered
-harmful.txt


> +extern long preempt_fpu(double *darray, volatile int *not_ready, int
> *sentinal);
> +
> +void signal_fpu_sig(int sig, siginfo_t *info, void *context)
> +{
> +	int i;
> +	ucontext_t *uc =3D context;
> +	mcontext_t *mc =3D &uc->uc_mcontext;
> +
> +	/* Only the non volatiles were loaded up */
> +	for (i =3D 14; i < 32; i++) {
> +		if (mc->fp_regs[i] !=3D darray[i - 14]) {
> +			bad_context =3D true;
> +			break;
> +		}
> +	}
> +}
> +
> +void *signal_fpu_c(void *p)
> +{
> +	int i;
> +	long rc;
> +	struct sigaction act;
> +	act.sa_sigaction =3D signal_fpu_sig;
> +	act.sa_flags =3D SA_SIGINFO;
> +	rc =3D sigaction(SIGUSR1, &act, NULL);
> +	if (rc)
> +		return p;
> +
> +	srand(pthread_self());
> +	for (i =3D 0; i < 21; i++)
> +		darray[i] =3D rand();
> +
> +	rc =3D preempt_fpu(darray, &not_ready, &running);
> +
> +	return (void *) rc;
> +}
> +
> +int test_signal_fpu(void)

This is almost idential to test_signal_vmx.  Can you merge them?

> +{
> +	int i, j, rc, threads;
> +	void *rc_p;
> +	pthread_t *tids;
> +
> +	threads =3D sysconf(_SC_NPROCESSORS_ONLN) * THREAD_FACTOR;
> +	tids =3D malloc(threads * sizeof(pthread_t));
> +	FAIL_IF(!tids);
> +
> +	running =3D true;
> +	not_ready =3D threads;
> +	for (i =3D 0; i < threads; i++) {
> +		rc =3D pthread_create(&tids[i], NULL, signal_fpu_c,
> NULL);
> +		FAIL_IF(rc);
> +	}
> +
> +	setbuf(stdout, NULL);
> +	printf("\tWaiting for all workers to start...");
> +	while (not_ready);

A barrier needed here.

> +	printf("done\n");
> +
> +	printf("\tSending signals to all threads %d times...",
> ITERATIONS);
> +	for (i =3D 0; i < ITERATIONS; i++) {
> +		for (j =3D 0; j < threads; j++) {
> +			pthread_kill(tids[j], SIGUSR1);
> +		}
> +		sleep(1);
> +	}
> +	printf("done\n");
> +
> +	printf("\tKilling workers...");
> +	running =3D 0;
> +	for (i =3D 0; i < threads; i++) {
> +		pthread_join(tids[i], &rc_p);
> +
> +		/*
> +		 * Harness will say the fail was here, look at why
> signal_fpu
> +		 * returned
> +		 */
> +		if ((long) rc_p || bad_context)

bad_context is set in a signal handler but is that run in another
thread (I think)?  You might have some memory barrier issues with
bad_context also.

> +			printf("oops\n");
> +		if (bad_context)
> +			fprintf(stderr, "\t!! bad_context is
> true\n");
> +		FAIL_IF((long) rc_p || bad_context);
> +	}
> +	printf("done\n");
> +
> +	free(tids);
> +	return 0;
> +}
> +
> +int main(int argc, char *argv[])
> +{
> +	return test_harness(test_signal_fpu, "fpu_signal");
> +}
> diff --git a/tools/testing/selftests/powerpc/math/vmx_signal.c
> b/tools/testing/selftests/powerpc/math/vmx_signal.c
> new file mode 100644
> index 0000000..007ac9e
> --- /dev/null
> +++ b/tools/testing/selftests/powerpc/math/vmx_signal.c
> @@ -0,0 +1,124 @@
> +#include <stdio.h>
> +#include <unistd.h>
> +#include <sys/syscall.h>
> +#include <sys/time.h>
> +#include <sys/types.h>
> +#include <sys/wait.h>
> +#include <stdlib.h>
> +#include <string.h>
> +#include <pthread.h>
> +
> +#include "utils.h"
> +
> +/* Number of times each thread should recieve the signal */
> +#define ITERATIONS 10
> +/*
> + * Factor by which to multiply number of online CPUs for total
> number of
> + * worker threads
> + */
> +#define THREAD_FACTOR 8
> +
> +typedef int v4si __attribute__ ((vector_size (16)));
> +__thread v4si varray[] =3D {{1, 2, 3, 4}, {5, 6, 7, 8}, {9, 10,11,12},
> +	{13,14,15,16},{17,18,19,20},{21,22,23,24},
> +	{25,26,27,28},{29,30,31,32},{33,34,35,36},
> +	{37,38,39,40},{41,42,43,44},{45,46,47,48}};
> +
> +bool bad_context;
> +int running;
> +volatile int not_ready;
> +
> +extern int preempt_vmx(v4si *varray, volatile int *not_ready, int
> *sentinal);
> +
> +void signal_vmx_sig(int sig, siginfo_t *info, void *context)
> +{
> +	int i;
> +	ucontext_t *uc =3D context;
> +	mcontext_t *mc =3D &uc->uc_mcontext;
> +
> +	/* Only the non volatiles were loaded up */
> +	for (i =3D 20; i < 32; i++) {
> +		if (memcmp(mc->v_regs->vrregs[i], &varray[i - 20],
> 16)) {
> +			bad_context =3D true;
> +			break;
> +		}
> +	}
> +}
> +
> +void *signal_vmx_c(void *p)
> +{
> +	int i, j;
> +	long rc;
> +	struct sigaction act;
> +	act.sa_sigaction =3D signal_vmx_sig;
> +	act.sa_flags =3D SA_SIGINFO;
> +	rc =3D sigaction(SIGUSR1, &act, NULL);
> +	if (rc)
> +		return p;
> +
> +	srand(pthread_self());
> +	for (i =3D 0; i < 12; i++)
> +		for (j =3D 0; j < 4; j++)
> +			varray[i][j] =3D rand();
> +
> +	rc =3D preempt_vmx(varray, &not_ready, &running);
> +
> +	return (void *) rc;
> +}
> +
> +int test_signal_vmx(void)
> +{
> +	int i, j, rc, threads;
> +	void *rc_p;
> +	pthread_t *tids;
> +
> +	threads =3D sysconf(_SC_NPROCESSORS_ONLN) * THREAD_FACTOR;
> +	tids =3D malloc(threads * sizeof(pthread_t));
> +	FAIL_IF(!tids);
> +
> +	running =3D true;
> +	not_ready =3D threads;
> +	for (i =3D 0; i < threads; i++) {
> +		rc =3D pthread_create(&tids[i], NULL, signal_vmx_c,
> NULL);
> +		FAIL_IF(rc);
> +	}
> +
> +	setbuf(stdout, NULL);
> +	printf("\tWaiting for all workers to start...");
> +	while (not_ready);
> +	printf("done\n");
> +
> +	printf("\tSending signals to all threads %d times...",
> ITERATIONS);
> +	for (i =3D 0; i < ITERATIONS; i++) {
> +		for (j =3D 0; j < threads; j++) {
> +			pthread_kill(tids[j], SIGUSR1);
> +		}
> +		sleep(1);
> +	}
> +	printf("done\n");
> +
> +	printf("\tKilling workers...");
> +	running =3D 0;
> +	for (i =3D 0; i < threads; i++) {
> +		pthread_join(tids[i], &rc_p);
> +
> +		/*
> +		 * Harness will say the fail was here, look at why
> signal_vmx
> +		 * returned
> +		 */
> +		if ((long) rc_p || bad_context)
> +			printf("oops\n");
> +		if (bad_context)
> +			fprintf(stderr, "\t!! bad_context is
> true\n");
> +		FAIL_IF((long) rc_p || bad_context);
> +	}
> +	printf("done\n");
> +
> +	free(tids);
> +	return 0;
> +}
> +
> +int main(int argc, char *argv[])
> +{
> +	return test_harness(test_signal_vmx, "vmx_signal");
> +}

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/8] selftests/powerpc: Test the preservation of FPU and VMX regs across syscall
  2015-11-23  0:58     ` Cyril Bur
@ 2015-11-23  1:06       ` Michael Neuling
  0 siblings, 0 replies; 23+ messages in thread
From: Michael Neuling @ 2015-11-23  1:06 UTC (permalink / raw)
  To: Cyril Bur; +Cc: anton, linuxppc-dev

On Mon, 2015-11-23 at 11:58 +1100, Cyril Bur wrote:
> On Mon, 23 Nov 2015 11:23:13 +1100
> Michael Neuling <mikey@neuling.org> wrote:
>=20
> > On Wed, 2015-11-18 at 14:26 +1100, Cyril Bur wrote:
> > > Test that the non volatile floating point and Altivec registers
> > > get
> > > correctly preserved across the fork() syscall. =20
> >=20
> > Can we add a test for VSX too?  I realise it's the same registers,
> > but
> > the enable bits in the MSR are different so it's easy to get them
> > wrong
> > in the kernel.
>=20
> Yeah, I'm sure I could get that wrong haha.
>=20
> Hmmmm this got me thinking. Today we always enable FP and Altivec
> when we
> enable VSX but isn't there a world where we could actually run with
> FP and
> Altivec disabled and VSX on? In which case, is the whole thing
> volatile or
> does the kernel still need to save the subset of the matrix which
> corresponds=20
> to non-volatile FPs and non-volatile Altivec?

The hardware can run with FP and VMX off and VSX on but we should never
do that in Linux.

Mikey

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 4/8] powerpc: Explicitly disable math features when copying thread
  2015-11-18  3:26 ` [PATCH 4/8] powerpc: Explicitly disable math features when copying thread Cyril Bur
@ 2015-11-23  1:08   ` Michael Neuling
  2015-11-23  3:20     ` Cyril Bur
  0 siblings, 1 reply; 23+ messages in thread
From: Michael Neuling @ 2015-11-23  1:08 UTC (permalink / raw)
  To: Cyril Bur, anton, linuxppc-dev

On Wed, 2015-11-18 at 14:26 +1100, Cyril Bur wrote:
> With threads leaving the math bits enabled in their saved MSR to
> indicate
> that the hardware is hot and a restore is not needed, children need
> to turn
> it off as when they do get scheduled, there's no way their registers
> could
> have been hot.
>=20
> Signed-off-by: Cyril Bur <cyrilbur@gmail.com>
> ---
>  arch/powerpc/kernel/process.c | 1 +
>  1 file changed, 1 insertion(+)
>=20
> diff --git a/arch/powerpc/kernel/process.c
> b/arch/powerpc/kernel/process.c
> index 398f7bf..441d9e5 100644
> --- a/arch/powerpc/kernel/process.c
> +++ b/arch/powerpc/kernel/process.c
> @@ -1243,6 +1243,7 @@ int copy_thread(unsigned long clone_flags,
> unsigned long usp,
> =20
>  		f =3D ret_from_fork;
>  	}
> +	childregs->msr &=3D ~(MSR_FP|MSR_VEC|MSR_VSX);

Is this a current bug?=20

Mikey

>  	sp -=3D STACK_FRAME_OVERHEAD;
> =20
>  	/*

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 5/8] powerpc: Restore FPU/VEC/VSX if previously used
  2015-11-18  3:26 ` [PATCH 5/8] powerpc: Restore FPU/VEC/VSX if previously used Cyril Bur
  2015-11-20 11:01   ` Michael Ellerman
@ 2015-11-23  1:29   ` Michael Neuling
  1 sibling, 0 replies; 23+ messages in thread
From: Michael Neuling @ 2015-11-23  1:29 UTC (permalink / raw)
  To: Cyril Bur, anton, linuxppc-dev

On Wed, 2015-11-18 at 14:26 +1100, Cyril Bur wrote:
> Currently the FPU, VEC and VSX facilities are lazily loaded. This is
> not a
> problem unless a process is using these facilities.

I would prefer to say facilities are "enabled" and registers are
"loaded".  You're mixing the two here.

> Modern versions of GCC are very good at automatically vectorising
> code, new
> and modernised workloads make use of floating point and vector
> facilities,
> even the kernel makes use of vectorised memcpy.
>=20
> All this combined greatly increases the cost of a syscall since the
> kernel
> uses the facilities sometimes even in syscall fast-path making it
> increasingly common for a thread to take an *_unavailable exception
> soon
> after a syscall, not to mention potentially taking the trifecta.
>=20
> The obvious overcompensation to this problem is to simply always load
> all
> the facilities on every exit to userspace. Loading up all FPU, VEC
> and VSX
> registers every time can be expensive and if a workload does avoid
> using
> them, it should not be forced to incur this penalty.
>=20
> An 8bit counter is used to detect if the registers have been used in
> the
> past and the registers are always loaded until the value wraps to
> back to
> zero.
>=20
> Signed-off-by: Cyril Bur <cyrilbur@gmail.com>
> ---
>  arch/powerpc/include/asm/processor.h |  2 ++
>  arch/powerpc/kernel/asm-offsets.c    |  2 ++
>  arch/powerpc/kernel/entry_64.S       | 55
> ++++++++++++++++++++++++++++--
>  arch/powerpc/kernel/fpu.S            |  4 +++
>  arch/powerpc/kernel/process.c        | 66
> ++++++++++++++++++++++++++++++------
>  arch/powerpc/kernel/vector.S         |  4 +++
>  6 files changed, 119 insertions(+), 14 deletions(-)
>=20
> diff --git a/arch/powerpc/include/asm/processor.h
> b/arch/powerpc/include/asm/processor.h
> index ac23308..dcab21f 100644
> --- a/arch/powerpc/include/asm/processor.h
> +++ b/arch/powerpc/include/asm/processor.h
> @@ -236,11 +236,13 @@ struct thread_struct {
>  #endif
>  	struct arch_hw_breakpoint hw_brk; /* info on the hardware
> breakpoint */
>  	unsigned long	trap_nr;	/* last trap # on this
> thread */
> +	u8 load_fp;
>  #ifdef CONFIG_ALTIVEC
>  	struct thread_vr_state vr_state;
>  	struct thread_vr_state *vr_save_area;
>  	unsigned long	vrsave;
>  	int		used_vr;	/* set if process has
> used altivec */
> +	u8 load_vec;
>  #endif /* CONFIG_ALTIVEC */
>  #ifdef CONFIG_VSX
>  	/* VSR status */
> diff --git a/arch/powerpc/kernel/asm-offsets.c
> b/arch/powerpc/kernel/asm-offsets.c
> index 221d584..0f593d7 100644
> --- a/arch/powerpc/kernel/asm-offsets.c
> +++ b/arch/powerpc/kernel/asm-offsets.c
> @@ -95,12 +95,14 @@ int main(void)
>  	DEFINE(THREAD_FPSTATE, offsetof(struct thread_struct,
> fp_state));
>  	DEFINE(THREAD_FPSAVEAREA, offsetof(struct thread_struct,
> fp_save_area));
>  	DEFINE(FPSTATE_FPSCR, offsetof(struct thread_fp_state,
> fpscr));
> +	DEFINE(THREAD_LOAD_FP, offsetof(struct thread_struct,
> load_fp));
>  #ifdef CONFIG_ALTIVEC
>  	DEFINE(THREAD_VRSTATE, offsetof(struct thread_struct,
> vr_state));
>  	DEFINE(THREAD_VRSAVEAREA, offsetof(struct thread_struct,
> vr_save_area));
>  	DEFINE(THREAD_VRSAVE, offsetof(struct thread_struct,
> vrsave));
>  	DEFINE(THREAD_USED_VR, offsetof(struct thread_struct,
> used_vr));
>  	DEFINE(VRSTATE_VSCR, offsetof(struct thread_vr_state,
> vscr));
> +	DEFINE(THREAD_LOAD_VEC, offsetof(struct thread_struct,
> load_vec));
>  #endif /* CONFIG_ALTIVEC */
>  #ifdef CONFIG_VSX
>  	DEFINE(THREAD_USED_VSR, offsetof(struct thread_struct,
> used_vsr));
> diff --git a/arch/powerpc/kernel/entry_64.S
> b/arch/powerpc/kernel/entry_64.S
> index c8b4225..46e9869 100644
> --- a/arch/powerpc/kernel/entry_64.S
> +++ b/arch/powerpc/kernel/entry_64.S
> @@ -210,7 +210,54 @@ system_call:			/* label
> this so stack traces look sane */
>  	li	r11,-MAX_ERRNO
>  	andi.	r0,r9,(_TIF_SYSCALL_DOTRACE|_TIF_SINGLESTEP|_TI
> F_USER_WORK_MASK|_TIF_PERSYSCALL_MASK)
>  	bne-	syscall_exit_work
> -	cmpld	r3,r11
> +
> +	/*
> +	 * This is an assembly version of checks performed in
> restore_math()
> +	 * to avoid calling C unless absolutely necessary.
> +	 * Note: In order to simplify the assembly, if the FP or VEC
> registers
> +	 * are hot (and therefore restore_math() isn't called) the
> +	 * LOAD_{FP,VEC} thread counter doesn't get incremented.
> +	 * This is likely the best thing to do anyway because hot
> regs indicate
> +	 * that the workload is doing a lot of syscalls that can be
> handled
> +	 * quickly and without the need to touch FP or VEC regs (by
> the kernel).
> +	 * a) If this workload is long running then this is exactly
> what the
> +	 * kernel should be doing.
> +	 * b) If this workload isn't long running then we'll soon
> fall back to
> +	 * calling into C and the counter will be incremented
> regularly again
> +	 * anyway.
> +	 */
> +	ld	r9,PACACURRENT(r13)
> +	andi.	r0,r8,MSR_FP
> +	addi	r9,r9,THREAD
> +	lbz	r5,THREAD_LOAD_FP(r9)

Would this work as a minor simplification?

	ld	r9,PACACURRENT(r13)
	andi.	r0,r8,MSR_FP
	lbz	r5,THREAD+THREAD_LOAD_FP(r9)

> +> 	> /*
> +> 	>  * Goto 2 if !r0 && r5
> +> 	>  * The cmpb works because r5 can only have bits set in the lowest b=
yte
> +> 	>  * and r0 may or may not have bit 13 set (different byte) but will =
have
> +> 	>  * a zero low byte therefore the low bytes must differ if r5 =3D=3D=
 true
> +> 	>  * and the bit 13 byte must be the same if !r0
> +> 	>  */
> +	cmpb	r7,r0,r5
> +	cmpldi	r7,0xff0
> +#ifdef CONFIG_ALTIVEC
> +	beq	2f
> +
> +	lbz	r9,THREAD_LOAD_VEC(r9)
> +	andis.	r0,r8,MSR_VEC@h
> +	/* Skip (goto 3) if r0 || !r9 */
> +	bne	3f
> +	cmpldi	r9,0
> +	beq 3f
> +#else
> +	bne 3f
> +#endif
> +2:	addi    r3,r1,STACK_FRAME_OVERHEAD
> +	bl	restore_math
> +	ld	r8,_MSR(r1)
> +	ld	r3,RESULT(r1)
> +	li	r11,-MAX_ERRNO
> +
> +3:	cmpld	r3,r11
>  	ld	r5,_CCR(r1)
>  	bge-	syscall_error
>  .Lsyscall_error_cont:
> @@ -592,8 +639,8 @@ _GLOBAL(ret_from_except_lite)
> =20
>  	/* Check current_thread_info()->flags */
>  	andi.	r0,r4,_TIF_USER_WORK_MASK
> -#ifdef CONFIG_PPC_BOOK3E
>  	bne	1f
> +#ifdef CONFIG_PPC_BOOK3E

why this change?

>  	/*
>  	 * Check to see if the dbcr0 register is set up to debug.
>  	 * Use the internal debug mode bit to do this.
> @@ -608,7 +655,9 @@ _GLOBAL(ret_from_except_lite)
>  	mtspr	SPRN_DBSR,r10
>  	b	restore
>  #else
> -	beq	restore
> +	addi	r3,r1,STACK_FRAME_OVERHEAD
> +	bl	restore_math
> +	b	restore
>  #endif
>  1:	andi.	r0,r4,_TIF_NEED_RESCHED
>  	beq	2f
> diff --git a/arch/powerpc/kernel/fpu.S b/arch/powerpc/kernel/fpu.S
> index 2117eac..b063524 100644
> --- a/arch/powerpc/kernel/fpu.S
> +++ b/arch/powerpc/kernel/fpu.S
> @@ -130,6 +130,10 @@ END_FTR_SECTION_IFSET(CPU_FTR_VSX)
>  	or	r12,r12,r4
>  	std	r12,_MSR(r1)
>  #endif
> +	/* Don't care if r4 overflows, this is desired behaviour */
> +	lbz	r4,THREAD_LOAD_FP(r5)
> +	addi	r4,r4,1
> +	stb	r4,THREAD_LOAD_FP(r5)
>  	addi	r10,r5,THREAD_FPSTATE
>  	lfd	fr0,FPSTATE_FPSCR(r10)
>  	MTFSF_L(fr0)
> diff --git a/arch/powerpc/kernel/process.c
> b/arch/powerpc/kernel/process.c
> index 441d9e5..c602b67 100644
> --- a/arch/powerpc/kernel/process.c
> +++ b/arch/powerpc/kernel/process.c
> @@ -366,6 +366,53 @@ void giveup_all(struct task_struct *tsk)
>  	msr_check_and_clear(msr_all_available);
>  }
> =20
> +void restore_math(struct pt_regs *regs)
> +{
> +	unsigned long msr;
> +
> +	if (!current->thread.load_fp
> +#ifdef CONFIG_ALTIVEC
> +		&& !current->thread.load_vec)
> +#else
> +		)
> +#endif

That is possibly the most revolting use of #ifdef I have ever seen.=20
 And that's saying something with the crap state that process.c is
already in :-P

> +		return;
> +
> +	msr =3D regs->msr;
> +	msr_check_and_set(msr_all_available);
> +
> +	/*
> +	 * Only reload if the bit is not set in the user MSR, the
> bit BEING set
> +	 * indicates that the registers are hot
> +	 */
> +#ifdef CONFIG_PPC_FPU
> +	if (current->thread.load_fp && !(msr & MSR_FP)) {
> +		load_fp_state(&current->thread.fp_state);
> +		msr |=3D MSR_FP | current->thread.fpexc_mode;
> +		current->thread.load_fp++;
> +	}
> +#endif
> +#ifdef CONFIG_ALTIVEC
> +	if (current->thread.load_vec && !(msr & MSR_VEC) &&
> +			cpu_has_feature(CPU_FTR_ALTIVEC)) {
> +		load_vr_state(&current->thread.vr_state);
> +		current->thread.used_vr =3D 1;
> +		msr |=3D MSR_VEC;
> +		current->thread.load_vec++;
> +	}
> +#endif
> +#ifdef CONFIG_VSX
> +	if (!(msr & MSR_VSX) && (msr & (MSR_FP | MSR_VEC)) =3D=3D
> (MSR_FP | MSR_VEC)) {
> +		current->thread.used_vsr =3D 1;
> +		msr |=3D MSR_VSX;
> +	}
> +#endif
> +
> +	msr_check_and_clear(msr_all_available);
> +
> +	regs->msr =3D msr;
> +}
> +
>  void flush_all_to_thread(struct task_struct *tsk)
>  {
>  	if (tsk->thread.regs) {
> @@ -806,17 +853,9 @@ void restore_tm_state(struct pt_regs *regs)
> =20
>  	msr_diff =3D current->thread.ckpt_regs.msr & ~regs->msr;
>  	msr_diff &=3D MSR_FP | MSR_VEC | MSR_VSX;
> -	if (msr_diff & MSR_FP) {
> -		msr_check_and_set(MSR_FP);
> -		load_fp_state(&current->thread.fp_state);
> -		msr_check_and_clear(MSR_FP);
> -		regs->msr |=3D current->thread.fpexc_mode;
> -	}
> -	if (msr_diff & MSR_VEC) {
> -		msr_check_and_set(MSR_VEC);
> -		load_vr_state(&current->thread.vr_state);
> -		msr_check_and_clear(MSR_VEC);
> -	}
> +
> +	restore_math(regs);
> +
>  	regs->msr |=3D msr_diff;
>  }
> =20
> @@ -977,6 +1016,11 @@ struct task_struct *__switch_to(struct
> task_struct *prev,
>  		batch =3D this_cpu_ptr(&ppc64_tlb_batch);
>  		batch->active =3D 1;
>  	}
> +
> +	/* Don't do this on a kernel thread */
> +	if (current_thread_info()->task->thread.regs)
> +		restore_math(current_thread_info()->task
> ->thread.regs);
> +
>  #endif /* CONFIG_PPC_BOOK3S_64 */
> =20
>  	return last;
> diff --git a/arch/powerpc/kernel/vector.S
> b/arch/powerpc/kernel/vector.S
> index 162d0f7..038cff8 100644
> --- a/arch/powerpc/kernel/vector.S
> +++ b/arch/powerpc/kernel/vector.S
> @@ -91,6 +91,10 @@ _GLOBAL(load_up_altivec)
>  	oris	r12,r12,MSR_VEC@h
>  	std	r12,_MSR(r1)
>  #endif
> +	/* Don't care if r4 overflows, this is desired behaviour */
> +	lbz	r4,THREAD_LOAD_VEC(r5)
> +	addi	r4,r4,1
> +	stb	r4,THREAD_LOAD_VEC(r5)
>  	addi	r6,r5,THREAD_VRSTATE
>  	li	r4,1
>  	li	r10,VRSTATE_VSCR

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 4/8] powerpc: Explicitly disable math features when copying thread
  2015-11-23  1:08   ` Michael Neuling
@ 2015-11-23  3:20     ` Cyril Bur
  0 siblings, 0 replies; 23+ messages in thread
From: Cyril Bur @ 2015-11-23  3:20 UTC (permalink / raw)
  To: Michael Neuling; +Cc: anton, linuxppc-dev

On Mon, 23 Nov 2015 12:08:38 +1100
Michael Neuling <mikey@neuling.org> wrote:

> On Wed, 2015-11-18 at 14:26 +1100, Cyril Bur wrote:
> > With threads leaving the math bits enabled in their saved MSR to
> > indicate
> > that the hardware is hot and a restore is not needed, children need
> > to turn
> > it off as when they do get scheduled, there's no way their registers
> > could
> > have been hot.
> > 
> > Signed-off-by: Cyril Bur <cyrilbur@gmail.com>
> > ---
> >  arch/powerpc/kernel/process.c | 1 +
> >  1 file changed, 1 insertion(+)
> > 
> > diff --git a/arch/powerpc/kernel/process.c
> > b/arch/powerpc/kernel/process.c
> > index 398f7bf..441d9e5 100644
> > --- a/arch/powerpc/kernel/process.c
> > +++ b/arch/powerpc/kernel/process.c
> > @@ -1243,6 +1243,7 @@ int copy_thread(unsigned long clone_flags,
> > unsigned long usp,
> >  
> >  		f = ret_from_fork;
> >  	}
> > +	childregs->msr &= ~(MSR_FP|MSR_VEC|MSR_VSX);  
> 
> Is this a current bug? 
> 

It is impossible currently because saving the registers (of the parent, before
the creating the child) also forces a giveup of the facilities.

The next patch in the series decouples the saving and the giving up which
makes this situation possible.

> Mikey
> 
> >  	sp -= STACK_FRAME_OVERHEAD;
> >  
> >  	/*  

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2015-11-23  3:20 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-11-18  3:26 [PATCH 0/8] FP/VEC/VSX switching optimisations Cyril Bur
2015-11-18  3:26 ` [PATCH 1/8] selftests/powerpc: Test the preservation of FPU and VMX regs across syscall Cyril Bur
2015-11-23  0:23   ` Michael Neuling
2015-11-23  0:58     ` Cyril Bur
2015-11-23  1:06       ` Michael Neuling
2015-11-18  3:26 ` [PATCH 2/8] selftests/powerpc: Test preservation of FPU and VMX regs across preemption Cyril Bur
2015-11-23  0:34   ` Michael Neuling
2015-11-18  3:26 ` [PATCH 3/8] selftests/powerpc: Test FPU and VMX regs in signal ucontext Cyril Bur
2015-11-19 11:36   ` [3/8] " Michael Ellerman
2015-11-23  1:04   ` [PATCH 3/8] " Michael Neuling
2015-11-18  3:26 ` [PATCH 4/8] powerpc: Explicitly disable math features when copying thread Cyril Bur
2015-11-23  1:08   ` Michael Neuling
2015-11-23  3:20     ` Cyril Bur
2015-11-18  3:26 ` [PATCH 5/8] powerpc: Restore FPU/VEC/VSX if previously used Cyril Bur
2015-11-20 11:01   ` Michael Ellerman
2015-11-22 22:18     ` Cyril Bur
2015-11-22 23:07       ` Michael Ellerman
2015-11-23  1:29   ` Michael Neuling
2015-11-18  3:26 ` [PATCH 6/8] powerpc: Add the ability to save FPU without giving it up Cyril Bur
2015-11-18  3:26 ` [PATCH 7/8] powerpc: Add the ability to save Altivec " Cyril Bur
2015-11-18  3:26 ` [PATCH 8/8] powerpc: Add the ability to save VSX " Cyril Bur
2015-11-18 14:51 ` [PATCH 0/8] FP/VEC/VSX switching optimisations David Laight
2015-11-18 23:01   ` Cyril Bur

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).