* [PATCH 0/8] FP/VEC/VSX switching optimisations @ 2015-11-18 3:26 Cyril Bur 2015-11-18 3:26 ` [PATCH 1/8] selftests/powerpc: Test the preservation of FPU and VMX regs across syscall Cyril Bur ` (8 more replies) 0 siblings, 9 replies; 23+ messages in thread From: Cyril Bur @ 2015-11-18 3:26 UTC (permalink / raw) To: mikey, anton, linuxppc-dev Hi, These patches are an extension of the work done by Anton https://patchwork.ozlabs.org/patch/537621/, they'll need to be applied on top of them. The goal of these patches is to rework how the 'math' registers (FP, VEC and VSX) are context switched. Currently the kernel adopts a lazy approach, always switching userspace tasks with all three facilities disabled and loads in each set of registers upon receiving each unavailable exception. The kernel does try to avoid disabling the features in the syscall quick path but it during testing it appears that even what should be a simple syscall still causes the kernel to use some facilities (vectorised memcpy for example) for its self and therefore disable it for the user task. The lazy approach makes for a small amount of time spent restoring userspace state and if tasks don't use any of these facilities it is the correct thing to do. In recent years, new workloads and new features such as auto vectorisation in GCC have meant that the use of these facilities by userspace has increased, so much so that some workloads can have a task take an FP unavailable exception and a VEC unavailable exception almost every time slice. This series removes the general laziness in favour of a more selective approach. If a task uses any of the 'math' facilities the kernel will load the registers and enable the facilities for future time slices as the assumption is that the use is likely to continue for some time. This removes the cost of having to take an exception. These patches also adds logic to detect if a task had been using a facility and optimises in the case where the registers are still hot, this provides another speedup as not only is the cost of the exception saved but the cost of copying up to 64 x 128 bit registers is also removed. With these patches applied on top of Antons patches I observe a significant improvement with Antons context switch microbenchmark using yield(): http://ozlabs.org/~anton/junkcode/context_switch2.c Using an LE kernel compiled with pseries_le_defconfig Running: ./context_switch2 --test=yield 8 8 and adding one of --fp, --altivec or --vector Gives a 5% improvement on a POWER8 CPU. ./context_switch2 --test=yield --fp --altivec --vector 8 8 Gives a 15% improvement on a POWER8 CPU. I'll take this opportunity to note that 15% can be somewhat misleading. It may be reasonable to assume that each of the optimisations has had a compounding effect, this isn't incorrect and the reason behind the apparent compounding reveals a lot about where the current bottleneck is. The tests always touch FP first, then VEC then VSX which is the guaranteed worst case for the way the kernel currently operates. This behaviour will trigger three subsequent unavailable exceptions. Since the kernel currently enables all three facilities after taking a VSX unavailable the tests can be modified to touch VSX->VEC->FP in this order the difference in performance when touching all three only 5%. There is a compounding effect in so far as the cost of taking multiple unavailable exception is removed. This testing also demonstrates that the cost of the exception is by far the most expensive part of the current lazy approach. Cyril Bur (8): selftests/powerpc: Test the preservation of FPU and VMX regs across syscall selftests/powerpc: Test preservation of FPU and VMX regs across preemption selftests/powerpc: Test FPU and VMX regs in signal ucontext powerpc: Explicitly disable math features when copying thread powerpc: Restore FPU/VEC/VSX if previously used powerpc: Add the ability to save FPU without giving it up powerpc: Add the ability to save Altivec without giving it up powerpc: Add the ability to save VSX without giving it up arch/powerpc/include/asm/processor.h | 2 + arch/powerpc/include/asm/switch_to.h | 5 +- arch/powerpc/kernel/asm-offsets.c | 2 + arch/powerpc/kernel/entry_64.S | 55 +++++- arch/powerpc/kernel/fpu.S | 25 +-- arch/powerpc/kernel/ppc_ksyms.c | 4 - arch/powerpc/kernel/process.c | 144 ++++++++++++-- arch/powerpc/kernel/vector.S | 45 +---- tools/testing/selftests/powerpc/Makefile | 3 +- tools/testing/selftests/powerpc/math/Makefile | 19 ++ tools/testing/selftests/powerpc/math/basic_asm.h | 26 +++ tools/testing/selftests/powerpc/math/fpu_asm.S | 185 +++++++++++++++++ tools/testing/selftests/powerpc/math/fpu_preempt.c | 92 +++++++++ tools/testing/selftests/powerpc/math/fpu_signal.c | 119 +++++++++++ tools/testing/selftests/powerpc/math/fpu_syscall.c | 79 ++++++++ tools/testing/selftests/powerpc/math/vmx_asm.S | 219 +++++++++++++++++++++ tools/testing/selftests/powerpc/math/vmx_preempt.c | 92 +++++++++ tools/testing/selftests/powerpc/math/vmx_signal.c | 124 ++++++++++++ tools/testing/selftests/powerpc/math/vmx_syscall.c | 81 ++++++++ 19 files changed, 1240 insertions(+), 81 deletions(-) create mode 100644 tools/testing/selftests/powerpc/math/Makefile create mode 100644 tools/testing/selftests/powerpc/math/basic_asm.h create mode 100644 tools/testing/selftests/powerpc/math/fpu_asm.S create mode 100644 tools/testing/selftests/powerpc/math/fpu_preempt.c create mode 100644 tools/testing/selftests/powerpc/math/fpu_signal.c create mode 100644 tools/testing/selftests/powerpc/math/fpu_syscall.c create mode 100644 tools/testing/selftests/powerpc/math/vmx_asm.S create mode 100644 tools/testing/selftests/powerpc/math/vmx_preempt.c create mode 100644 tools/testing/selftests/powerpc/math/vmx_signal.c create mode 100644 tools/testing/selftests/powerpc/math/vmx_syscall.c -- 2.6.2 ^ permalink raw reply [flat|nested] 23+ messages in thread
* [PATCH 1/8] selftests/powerpc: Test the preservation of FPU and VMX regs across syscall 2015-11-18 3:26 [PATCH 0/8] FP/VEC/VSX switching optimisations Cyril Bur @ 2015-11-18 3:26 ` Cyril Bur 2015-11-23 0:23 ` Michael Neuling 2015-11-18 3:26 ` [PATCH 2/8] selftests/powerpc: Test preservation of FPU and VMX regs across preemption Cyril Bur ` (7 subsequent siblings) 8 siblings, 1 reply; 23+ messages in thread From: Cyril Bur @ 2015-11-18 3:26 UTC (permalink / raw) To: mikey, anton, linuxppc-dev Test that the non volatile floating point and Altivec registers get correctly preserved across the fork() syscall. fork() works nicely for this purpose, the registers should be the same for both parent and child Signed-off-by: Cyril Bur <cyrilbur@gmail.com> --- tools/testing/selftests/powerpc/Makefile | 3 +- tools/testing/selftests/powerpc/math/Makefile | 14 ++ tools/testing/selftests/powerpc/math/basic_asm.h | 26 +++ tools/testing/selftests/powerpc/math/fpu_asm.S | 151 +++++++++++++++++ tools/testing/selftests/powerpc/math/fpu_syscall.c | 79 +++++++++ tools/testing/selftests/powerpc/math/vmx_asm.S | 183 +++++++++++++++++++++ tools/testing/selftests/powerpc/math/vmx_syscall.c | 81 +++++++++ 7 files changed, 536 insertions(+), 1 deletion(-) create mode 100644 tools/testing/selftests/powerpc/math/Makefile create mode 100644 tools/testing/selftests/powerpc/math/basic_asm.h create mode 100644 tools/testing/selftests/powerpc/math/fpu_asm.S create mode 100644 tools/testing/selftests/powerpc/math/fpu_syscall.c create mode 100644 tools/testing/selftests/powerpc/math/vmx_asm.S create mode 100644 tools/testing/selftests/powerpc/math/vmx_syscall.c diff --git a/tools/testing/selftests/powerpc/Makefile b/tools/testing/selftests/powerpc/Makefile index 0c2706b..19e8191 100644 --- a/tools/testing/selftests/powerpc/Makefile +++ b/tools/testing/selftests/powerpc/Makefile @@ -22,7 +22,8 @@ SUB_DIRS = benchmarks \ switch_endian \ syscalls \ tm \ - vphn + vphn \ + math endif diff --git a/tools/testing/selftests/powerpc/math/Makefile b/tools/testing/selftests/powerpc/math/Makefile new file mode 100644 index 0000000..896d9e2 --- /dev/null +++ b/tools/testing/selftests/powerpc/math/Makefile @@ -0,0 +1,14 @@ +TEST_PROGS := fpu_syscall vmx_syscall + +all: $(TEST_PROGS) + +$(TEST_PROGS): ../harness.c +$(TEST_PROGS): CFLAGS += -O2 -g + +fpu_syscall: fpu_asm.S +vmx_syscall: vmx_asm.S + +include ../../lib.mk + +clean: + rm -f $(TEST_PROGS) *.o diff --git a/tools/testing/selftests/powerpc/math/basic_asm.h b/tools/testing/selftests/powerpc/math/basic_asm.h new file mode 100644 index 0000000..27aca79 --- /dev/null +++ b/tools/testing/selftests/powerpc/math/basic_asm.h @@ -0,0 +1,26 @@ +#include <ppc-asm.h> +#include <asm/unistd.h> + +#define LOAD_REG_IMMEDIATE(reg,expr) \ + lis reg,(expr)@highest; \ + ori reg,reg,(expr)@higher; \ + rldicr reg,reg,32,31; \ + oris reg,reg,(expr)@high; \ + ori reg,reg,(expr)@l; + +#define PUSH_BASIC_STACK(size) \ + std 2,24(sp); \ + mflr r0; \ + std r0,16(sp); \ + mfcr r0; \ + stw r0,8(sp); \ + stdu sp,-size(sp); + +#define POP_BASIC_STACK(size) \ + addi sp,sp,size; \ + ld 2,24(sp); \ + ld r0,16(sp); \ + mtlr r0; \ + lwz r0,8(sp); \ + mtcr r0; \ + diff --git a/tools/testing/selftests/powerpc/math/fpu_asm.S b/tools/testing/selftests/powerpc/math/fpu_asm.S new file mode 100644 index 0000000..d5412c1 --- /dev/null +++ b/tools/testing/selftests/powerpc/math/fpu_asm.S @@ -0,0 +1,151 @@ +#include "basic_asm.h" + +#define PUSH_FPU(pos) \ + stfd f14,pos(sp); \ + stfd f15,pos+8(sp); \ + stfd f16,pos+16(sp); \ + stfd f17,pos+24(sp); \ + stfd f18,pos+32(sp); \ + stfd f19,pos+40(sp); \ + stfd f20,pos+48(sp); \ + stfd f21,pos+56(sp); \ + stfd f22,pos+64(sp); \ + stfd f23,pos+72(sp); \ + stfd f24,pos+80(sp); \ + stfd f25,pos+88(sp); \ + stfd f26,pos+96(sp); \ + stfd f27,pos+104(sp); \ + stfd f28,pos+112(sp); \ + stfd f29,pos+120(sp); \ + stfd f30,pos+128(sp); \ + stfd f31,pos+136(sp); + +#define POP_FPU(pos) \ + lfd f14,pos(sp); \ + lfd f15,pos+8(sp); \ + lfd f16,pos+16(sp); \ + lfd f17,pos+24(sp); \ + lfd f18,pos+32(sp); \ + lfd f19,pos+40(sp); \ + lfd f20,pos+48(sp); \ + lfd f21,pos+56(sp); \ + lfd f22,pos+64(sp); \ + lfd f23,pos+72(sp); \ + lfd f24,pos+80(sp); \ + lfd f25,pos+88(sp); \ + lfd f26,pos+96(sp); \ + lfd f27,pos+104(sp); \ + lfd f28,pos+112(sp); \ + lfd f29,pos+120(sp); \ + lfd f30,pos+128(sp); \ + lfd f31,pos+136(sp); + +#Careful calling this, it will 'clobber' fpu (by design) +#Don't call this from C +FUNC_START(load_fpu) + lfd f14,0(r3) + lfd f15,8(r3) + lfd f16,16(r3) + lfd f17,24(r3) + lfd f18,32(r3) + lfd f19,40(r3) + lfd f20,48(r3) + lfd f21,56(r3) + lfd f22,64(r3) + lfd f23,72(r3) + lfd f24,80(r3) + lfd f25,88(r3) + lfd f26,96(r3) + lfd f27,104(r3) + lfd f28,112(r3) + lfd f29,120(r3) + lfd f30,128(r3) + lfd f31,136(r3) + blr +FUNC_END(load_fpu) + +FUNC_START(check_fpu) + mr r4,r3 + li r3,1 #assume a bad result + lfd f0,0(r4) + fcmpu cr1,f0,f14 + bne cr1,1f + lfd f0,8(r4) + fcmpu cr1,f0,f15 + bne cr1,1f + lfd f0,16(r4) + fcmpu cr1,f0,f16 + bne cr1,1f + lfd f0,24(r4) + fcmpu cr1,f0,f17 + bne cr1,1f + lfd f0,32(r4) + fcmpu cr1,f0,f18 + bne cr1,1f + lfd f0,40(r4) + fcmpu cr1,f0,f19 + bne cr1,1f + lfd f0,48(r4) + fcmpu cr1,f0,f20 + bne cr1,1f + lfd f0,56(r4) + fcmpu cr1,f0,f21 + bne cr1,1f + lfd f0,64(r4) + fcmpu cr1,f0,f22 + bne cr1,1f + lfd f0,72(r4) + fcmpu cr1,f0,f23 + bne cr1,1f + lfd f0,80(r4) + fcmpu cr1,f0,f24 + bne cr1,1f + lfd f0,88(r4) + fcmpu cr1,f0,f25 + bne cr1,1f + lfd f0,96(r4) + fcmpu cr1,f0,f26 + bne cr1,1f + lfd f0,104(r4) + fcmpu cr1,f0,f27 + bne cr1,1f + lfd f0,112(r4) + fcmpu cr1,f0,f28 + bne cr1,1f + lfd f0,120(r4) + fcmpu cr1,f0,f29 + bne cr1,1f + lfd f0,128(r4) + fcmpu cr1,f0,f30 + bne cr1,1f + lfd f0,136(r4) + fcmpu cr1,f0,f31 + bne cr1,1f + li r3,0 #Sucess!!! +1: blr + +FUNC_START(test_fpu) + #r3 holds pointer to where to put the result of fork + #f14-f31 are non volatiles + PUSH_BASIC_STACK(256) + std r3,40(sp) #Address of darray + std r4,48(sp) #Address of pid + PUSH_FPU(56) + + bl load_fpu + nop + li r0,__NR_fork + sc + + #pass the result of the fork to the caller + ld r9,48(sp) + std r3,0(r9) + + ld r3,40(sp) + bl check_fpu + nop + + POP_FPU(56) + POP_BASIC_STACK(256) + blr +FUNC_END(test_fpu) diff --git a/tools/testing/selftests/powerpc/math/fpu_syscall.c b/tools/testing/selftests/powerpc/math/fpu_syscall.c new file mode 100644 index 0000000..a967fd6 --- /dev/null +++ b/tools/testing/selftests/powerpc/math/fpu_syscall.c @@ -0,0 +1,79 @@ +#include <stdio.h> +#include <unistd.h> +#include <sys/syscall.h> +#include <sys/time.h> +#include <sys/types.h> +#include <sys/wait.h> +#include <stdlib.h> + +#include "utils.h" + +extern int test_fpu(double *darray, pid_t *pid); + +double darray[] = {0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0, + 1.1, 1.2, 1.3, 1.4, 1.5, 1.6, 1.7, 1.8, 1.9, 2.0, + 2.1}; + +int syscall_fpu(void) +{ + pid_t fork_pid; + int i; + int ret; + int child_ret; + for (i = 0; i < 1000; i++) { + /* test_fpu will fork() */ + ret = test_fpu(darray, &fork_pid); + if (fork_pid == -1) + return -1; + if (fork_pid == 0) + exit(ret); + waitpid(fork_pid, &child_ret, 0); + if (ret || child_ret) + return 1; + } + + return 0; +} + +int test_syscall_fpu(void) +{ + /* + * Setup an environment with much context switching + */ + pid_t pid2; + pid_t pid = fork(); + int ret; + int child_ret; + FAIL_IF(pid == -1); + + pid2 = fork(); + /* Can't FAIL_IF(pid2 == -1); because already forked once */ + if (pid2 == -1) { + /* + * Couldn't fork, ensure test is a fail + */ + child_ret = ret = 1; + } else { + ret = syscall_fpu(); + if (pid2) + waitpid(pid2, &child_ret, 0); + else + exit(ret); + } + + ret |= child_ret; + + if (pid) + waitpid(pid, &child_ret, 0); + else + exit(ret); + + FAIL_IF(ret || child_ret); + return 0; +} + +int main(int argc, char *argv[]) +{ + return test_harness(test_syscall_fpu, "syscall_fpu"); + +} diff --git a/tools/testing/selftests/powerpc/math/vmx_asm.S b/tools/testing/selftests/powerpc/math/vmx_asm.S new file mode 100644 index 0000000..e642e67 --- /dev/null +++ b/tools/testing/selftests/powerpc/math/vmx_asm.S @@ -0,0 +1,183 @@ +#include "basic_asm.h" + +#define PUSH_VMX(pos,reg) \ + li reg,pos; \ + stvx v20,reg,sp; \ + addi reg,reg,16; \ + stvx v21,reg,sp; \ + addi reg,reg,16; \ + stvx v22,reg,sp; \ + addi reg,reg,16; \ + stvx v23,reg,sp; \ + addi reg,reg,16; \ + stvx v24,reg,sp; \ + addi reg,reg,16; \ + stvx v25,reg,sp; \ + addi reg,reg,16; \ + stvx v26,reg,sp; \ + addi reg,reg,16; \ + stvx v27,reg,sp; \ + addi reg,reg,16; \ + stvx v28,reg,sp; \ + addi reg,reg,16; \ + stvx v29,reg,sp; \ + addi reg,reg,16; \ + stvx v30,reg,sp; \ + addi reg,reg,16; \ + stvx v31,reg,sp; + +#define POP_VMX(pos,reg) \ + li reg,pos; \ + lvx v20,reg,sp; \ + addi reg,reg,16; \ + lvx v21,reg,sp; \ + addi reg,reg,16; \ + lvx v22,reg,sp; \ + addi reg,reg,16; \ + lvx v23,reg,sp; \ + addi reg,reg,16; \ + lvx v24,reg,sp; \ + addi reg,reg,16; \ + lvx v25,reg,sp; \ + addi reg,reg,16; \ + lvx v26,reg,sp; \ + addi reg,reg,16; \ + lvx v27,reg,sp; \ + addi reg,reg,16; \ + lvx v28,reg,sp; \ + addi reg,reg,16; \ + lvx v29,reg,sp; \ + addi reg,reg,16; \ + lvx v30,reg,sp; \ + addi reg,reg,16; \ + lvx v31,reg,sp; + +#Carefull this will 'clobber' vmx (by design) +#Don't call this from C +FUNC_START(load_vmx) + li r5,0 + lvx v20,r5,r3 + addi r5,r5,16 + lvx v21,r5,r3 + addi r5,r5,16 + lvx v22,r5,r3 + addi r5,r5,16 + lvx v23,r5,r3 + addi r5,r5,16 + lvx v24,r5,r3 + addi r5,r5,16 + lvx v25,r5,r3 + addi r5,r5,16 + lvx v26,r5,r3 + addi r5,r5,16 + lvx v27,r5,r3 + addi r5,r5,16 + lvx v28,r5,r3 + addi r5,r5,16 + lvx v29,r5,r3 + addi r5,r5,16 + lvx v30,r5,r3 + addi r5,r5,16 + lvx v31,r5,r3 + blr +FUNC_END(load_vmx) + +#Should be safe from C, only touches r4, r5 and v0,v1,v2 +FUNC_START(check_vmx) + PUSH_BASIC_STACK(16) + mr r4,r3 + li r3,1 #assume a bad result + li r5,0 + lvx v0,r5,r4 + vcmpequd. v1,v0,v20 + vmr v2,v1 + + addi r5,r5,16 + lvx v0,r5,r4 + vcmpequd. v1,v0,v21 + vand v2,v2,v1 + + addi r5,r5,16 + lvx v0,r5,r4 + vcmpequd. v1,v0,v22 + vand v2,v2,v1 + + addi r5,r5,16 + lvx v0,r5,r4 + vcmpequd. v1,v0,v23 + vand v2,v2,v1 + + addi r5,r5,16 + lvx v0,r5,r4 + vcmpequd. v1,v0,v24 + vand v2,v2,v1 + + addi r5,r5,16 + lvx v0,r5,r4 + vcmpequd. v1,v0,v25 + vand v2,v2,v1 + + addi r5,r5,16 + lvx v0,r5,r4 + vcmpequd. v1,v0,v26 + vand v2,v2,v1 + + addi r5,r5,16 + lvx v0,r5,r4 + vcmpequd. v1,v0,v27 + vand v2,v2,v1 + + addi r5,r5,16 + lvx v0,r5,r4 + vcmpequd. v1,v0,v28 + vand v2,v2,v1 + + addi r5,r5,16 + lvx v0,r5,r4 + vcmpequd. v1,v0,v29 + vand v2,v2,v1 + + addi r5,r5,16 + lvx v0,r5,r4 + vcmpequd. v1,v0,v30 + vand v2,v2,v1 + + addi r5,r5,16 + lvx v0,r5,r4 + vcmpequd. v1,v0,v31 + vand v2,v2,v1 + + li r5,0 + stvx v2,r5,sp + ldx r0,r5,sp + cmpdi r0,0xffffffff + bne 1f + li r3,0 +1: POP_BASIC_STACK(16) + blr +FUNC_END(check_vmx) + +#Safe from C +FUNC_START(test_vmx) + #r3 holds pointer to where to put the result of fork + #v20-v31 are non-volatile + PUSH_BASIC_STACK(512) + std r3,40(sp) #Address of varray + std r4,48(sp) #address of pid + PUSH_VMX(56, r4) + + bl load_vmx + + li r0,__NR_fork + sc + #Pass the result of fork back to the caller + ld r9,48(sp) + std r3,0(r9) + + ld r3,40(sp) + bl check_vmx + + POP_VMX(56,r4) + POP_BASIC_STACK(512) + blr +FUNC_END(test_vmx) diff --git a/tools/testing/selftests/powerpc/math/vmx_syscall.c b/tools/testing/selftests/powerpc/math/vmx_syscall.c new file mode 100644 index 0000000..7adff05 --- /dev/null +++ b/tools/testing/selftests/powerpc/math/vmx_syscall.c @@ -0,0 +1,81 @@ +#include <stdio.h> +#include <unistd.h> +#include <sys/syscall.h> +#include <sys/time.h> +#include <stdlib.h> +#include <sys/types.h> +#include <sys/wait.h> + +#include "utils.h" + +typedef int v4si __attribute__ ((vector_size (16))); +v4si varray[] = {{1, 2, 3, 4}, {5, 6, 7, 8}, {9, 10,11,12}, + {13,14,15,16},{17,18,19,20},{21,22,23,24}, + {25,26,27,28},{29,30,31,32},{33,34,35,36}, + {37,38,39,40},{41,42,43,44},{45,46,47,48}}; + +extern int test_vmx(v4si *varray, pid_t *pid); + +int vmx_syscall(void) +{ + pid_t fork_pid; + int i; + int ret; + int child_ret; + for (i = 0; i < 1000; i++) { + /* test_vmx will fork() */ + ret = test_vmx(varray, &fork_pid); + if (fork_pid == -1) + return -1; + if (fork_pid == 0) + exit(ret); + waitpid(fork_pid, &child_ret, 0); + if (ret || child_ret) + return 1; + } + + return 0; +} + +int test_vmx_syscall(void) +{ + /* + * Setup an environment with much context switching + */ + pid_t pid2; + pid_t pid = fork(); + int ret; + int child_ret; + FAIL_IF(pid == -1); + + pid2 = fork(); + ret = vmx_syscall(); + /* Can't FAIL_IF(pid2 == -1); because we've already forked */ + if (pid2 == -1) { + /* + * Couldn't fork, ensure child_ret is set and is a fail + */ + ret = child_ret = 1; + } else { + if (pid2) + waitpid(pid2, &child_ret, 0); + else + exit(ret); + } + + ret |= child_ret; + + if (pid) + waitpid(pid, &child_ret, 0); + else + exit(ret); + + FAIL_IF(ret || child_ret); + return 0; +} + +int main(int argc, char *argv[]) +{ + return test_harness(test_vmx_syscall, "vmx_syscall"); + +} -- 2.6.2 ^ permalink raw reply related [flat|nested] 23+ messages in thread
* Re: [PATCH 1/8] selftests/powerpc: Test the preservation of FPU and VMX regs across syscall 2015-11-18 3:26 ` [PATCH 1/8] selftests/powerpc: Test the preservation of FPU and VMX regs across syscall Cyril Bur @ 2015-11-23 0:23 ` Michael Neuling 2015-11-23 0:58 ` Cyril Bur 0 siblings, 1 reply; 23+ messages in thread From: Michael Neuling @ 2015-11-23 0:23 UTC (permalink / raw) To: Cyril Bur, anton, linuxppc-dev On Wed, 2015-11-18 at 14:26 +1100, Cyril Bur wrote: > Test that the non volatile floating point and Altivec registers get > correctly preserved across the fork() syscall. Can we add a test for VSX too? I realise it's the same registers, but the enable bits in the MSR are different so it's easy to get them wrong in the kernel. Additional comments below. > fork() works nicely for this purpose, the registers should be the same fo= r > both parent and child >=20 > Signed-off-by: Cyril Bur <cyrilbur@gmail.com> > --- > tools/testing/selftests/powerpc/Makefile | 3 +- > tools/testing/selftests/powerpc/math/Makefile | 14 ++ > tools/testing/selftests/powerpc/math/basic_asm.h | 26 +++ > tools/testing/selftests/powerpc/math/fpu_asm.S | 151 +++++++++++++++= ++ > tools/testing/selftests/powerpc/math/fpu_syscall.c | 79 +++++++++ > tools/testing/selftests/powerpc/math/vmx_asm.S | 183 +++++++++++++++= ++++++ > tools/testing/selftests/powerpc/math/vmx_syscall.c | 81 +++++++++ > 7 files changed, 536 insertions(+), 1 deletion(-) > create mode 100644 tools/testing/selftests/powerpc/math/Makefile > create mode 100644 tools/testing/selftests/powerpc/math/basic_asm.h > create mode 100644 tools/testing/selftests/powerpc/math/fpu_asm.S > create mode 100644 tools/testing/selftests/powerpc/math/fpu_syscall.c > create mode 100644 tools/testing/selftests/powerpc/math/vmx_asm.S > create mode 100644 tools/testing/selftests/powerpc/math/vmx_syscall.c >=20 > diff --git a/tools/testing/selftests/powerpc/Makefile b/tools/testing/sel= ftests/powerpc/Makefile > index 0c2706b..19e8191 100644 > --- a/tools/testing/selftests/powerpc/Makefile > +++ b/tools/testing/selftests/powerpc/Makefile > @@ -22,7 +22,8 @@ SUB_DIRS =3D benchmarks > > > \ > > > switch_endian> > \ > > > syscalls> > > \ > > > tm> > > > \ > -> > vphn > +> > vphn \ > +> > math > =20 > endif > =20 > diff --git a/tools/testing/selftests/powerpc/math/Makefile b/tools/testin= g/selftests/powerpc/math/Makefile > new file mode 100644 > index 0000000..896d9e2 > --- /dev/null > +++ b/tools/testing/selftests/powerpc/math/Makefile > @@ -0,0 +1,14 @@ > +TEST_PROGS :=3D fpu_syscall vmx_syscall Add a new .gitignore in this dirfor these new build objects. > + > +all: $(TEST_PROGS) > + > +$(TEST_PROGS): ../harness.c > +$(TEST_PROGS): CFLAGS +=3D -O2 -g > + > +fpu_syscall: fpu_asm.S > +vmx_syscall: vmx_asm.S > + > +include ../../lib.mk > + > +clean: > +> > rm -f $(TEST_PROGS) *.o > diff --git a/tools/testing/selftests/powerpc/math/basic_asm.h b/tools/tes= ting/selftests/powerpc/math/basic_asm.h > new file mode 100644 > index 0000000..27aca79 > --- /dev/null > +++ b/tools/testing/selftests/powerpc/math/basic_asm.h Can you put this up a directory since it's generically useful for powerpc? > @@ -0,0 +1,26 @@ > +#include=20 > +#include=20 > + > +#define LOAD_REG_IMMEDIATE(reg,expr) \ > +> > lis> > reg,(expr)@highest;> > \ > +> > ori> > reg,reg,(expr)@higher;> > \ > +> > rldicr> > reg,reg,32,31;> > \ > +> > oris> > reg,reg,(expr)@high;> > \ > +> > ori> > reg,reg,(expr)@l; > + > +#define PUSH_BASIC_STACK(size) \ > +> > std> > 2,24(sp); \ > +> > mflr> > r0; \ > +> > std> > r0,16(sp); \ > +> > mfcr> > r0; \ > +> > stw> > r0,8(sp); \ > +> > stdu> > sp,-size(sp); > + > +#define POP_BASIC_STACK(size) \ > +> > addi> > sp,sp,size; \ > +> > ld> > 2,24(sp); \ > +> > ld> > r0,16(sp); \ > +> > mtlr> > r0; \ > +> > lwz> > r0,8(sp); \ > +> > mtcr> > r0; \ > + > diff --git a/tools/testing/selftests/powerpc/math/fpu_asm.S b/tools/testi= ng/selftests/powerpc/math/fpu_asm.S > new file mode 100644 > index 0000000..d5412c1 > --- /dev/null > +++ b/tools/testing/selftests/powerpc/math/fpu_asm.S > @@ -0,0 +1,151 @@ > +#include "basic_asm.h" > + > +#define PUSH_FPU(pos) \ > +> > stfd> > f14,pos(sp); \ > +> > stfd> > f15,pos+8(sp); \ > +> > stfd> > f16,pos+16(sp); \ > +> > stfd> > f17,pos+24(sp); \ > +> > stfd> > f18,pos+32(sp); \ > +> > stfd> > f19,pos+40(sp); \ > +> > stfd> > f20,pos+48(sp); \ > +> > stfd> > f21,pos+56(sp); \ > +> > stfd> > f22,pos+64(sp); \ > +> > stfd> > f23,pos+72(sp); \ > +> > stfd> > f24,pos+80(sp); \ > +> > stfd> > f25,pos+88(sp); \ > +> > stfd> > f26,pos+96(sp); \ > +> > stfd> > f27,pos+104(sp); \ > +> > stfd> > f28,pos+112(sp); \ > +> > stfd> > f29,pos+120(sp); \ > +> > stfd> > f30,pos+128(sp); \ > +> > stfd> > f31,pos+136(sp); > + > +#define POP_FPU(pos) \ > +> > lfd> > f14,pos(sp); \ > +> > lfd> > f15,pos+8(sp); \ > +> > lfd> > f16,pos+16(sp); \ > +> > lfd> > f17,pos+24(sp); \ > +> > lfd> > f18,pos+32(sp); \ > +> > lfd> > f19,pos+40(sp); \ > +> > lfd> > f20,pos+48(sp); \ > +> > lfd> > f21,pos+56(sp); \ > +> > lfd> > f22,pos+64(sp); \ > +> > lfd> > f23,pos+72(sp); \ > +> > lfd> > f24,pos+80(sp); \ > +> > lfd> > f25,pos+88(sp); \ > +> > lfd> > f26,pos+96(sp); \ > +> > lfd> > f27,pos+104(sp); \ > +> > lfd> > f28,pos+112(sp); \ > +> > lfd> > f29,pos+120(sp); \ > +> > lfd> > f30,pos+128(sp); \ > +> > lfd> > f31,pos+136(sp); > + > +#Careful calling this, it will 'clobber' fpu (by design) > +#Don't call this from C > +FUNC_START(load_fpu) > +> > lfd> > f14,0(r3) > +> > lfd> > f15,8(r3) > +> > lfd> > f16,16(r3) > +> > lfd> > f17,24(r3) > +> > lfd> > f18,32(r3) > +> > lfd> > f19,40(r3) > +> > lfd> > f20,48(r3) > +> > lfd> > f21,56(r3) > +> > lfd> > f22,64(r3) > +> > lfd> > f23,72(r3) > +> > lfd> > f24,80(r3) > +> > lfd> > f25,88(r3) > +> > lfd> > f26,96(r3) > +> > lfd> > f27,104(r3) > +> > lfd> > f28,112(r3) > +> > lfd> > f29,120(r3) > +> > lfd> > f30,128(r3) > +> > lfd> > f31,136(r3) > +> > blr > +FUNC_END(load_fpu) > + > +FUNC_START(check_fpu) > +> > mr r4,r3 > +> > li> > r3,1 #assume a bad result > +> > lfd> > f0,0(r4) > +> > fcmpu> > cr1,f0,f14 > +> > bne> > cr1,1f > +> > lfd> > f0,8(r4) > +> > fcmpu> > cr1,f0,f15 > +> > bne> > cr1,1f > +> > lfd> > f0,16(r4) > +> > fcmpu> > cr1,f0,f16 > +> > bne> > cr1,1f > +> > lfd> > f0,24(r4) > +> > fcmpu> > cr1,f0,f17 > +> > bne> > cr1,1f > +> > lfd> > f0,32(r4) > +> > fcmpu> > cr1,f0,f18 > +> > bne> > cr1,1f > +> > lfd> > f0,40(r4) > +> > fcmpu> > cr1,f0,f19 > +> > bne> > cr1,1f > +> > lfd> > f0,48(r4) > +> > fcmpu> > cr1,f0,f20 > +> > bne> > cr1,1f > +> > lfd> > f0,56(r4) > +> > fcmpu> > cr1,f0,f21 > +> > bne> > cr1,1f > +> > lfd> > f0,64(r4) > +> > fcmpu> > cr1,f0,f22 > +> > bne> > cr1,1f > +> > lfd> > f0,72(r4) > +> > fcmpu> > cr1,f0,f23 > +> > bne> > cr1,1f > +> > lfd> > f0,80(r4) > +> > fcmpu> > cr1,f0,f24 > +> > bne> > cr1,1f > +> > lfd> > f0,88(r4) > +> > fcmpu> > cr1,f0,f25 > +> > bne> > cr1,1f > +> > lfd> > f0,96(r4) > +> > fcmpu> > cr1,f0,f26 > +> > bne> > cr1,1f > +> > lfd> > f0,104(r4) > +> > fcmpu> > cr1,f0,f27 > +> > bne> > cr1,1f > +> > lfd> > f0,112(r4) > +> > fcmpu> > cr1,f0,f28 > +> > bne> > cr1,1f > +> > lfd> > f0,120(r4) > +> > fcmpu> > cr1,f0,f29 > +> > bne> > cr1,1f > +> > lfd> > f0,128(r4) > +> > fcmpu> > cr1,f0,f30 > +> > bne> > cr1,1f > +> > lfd> > f0,136(r4) > +> > fcmpu> > cr1,f0,f31 > +> > bne> > cr1,1f > +> > li> > r3,0 #Sucess!!! > +1:> > blr > + > +FUNC_START(test_fpu) > +> > #r3 holds pointer to where to put the result of fork #r4 seems to hold a ptr to the pid > + #f14-f31 are non volatiles > +> > PUSH_BASIC_STACK(256) > +> > std> > r3,40(sp) #Address of darray > +> > std r4,48(sp) #Address of pid > +> > PUSH_FPU(56) > + > +> > bl load_fpu > +> > nop > +> > li> > r0,__NR_fork > +> > sc > + > +> > #pass the result of the fork to the caller > +> > ld> > r9,48(sp) > +> > std> > r3,0(r9) > + > +> > ld r3,40(sp) > +> > bl check_fpu > +> > nop > + > +> > POP_FPU(56) > +> > POP_BASIC_STACK(256) > +> > blr > +FUNC_END(test_fpu) > diff --git a/tools/testing/selftests/powerpc/math/fpu_syscall.c b/tools/t= esting/selftests/powerpc/math/fpu_syscall.c > new file mode 100644 > index 0000000..a967fd6 > --- /dev/null > +++ b/tools/testing/selftests/powerpc/math/fpu_syscall.c > @@ -0,0 +1,79 @@ > +#include=20 > +#include=20 > +#include=20 > +#include=20 > +#include=20 > +#include=20 > +#include=20 > + > +#include "utils.h" > + > +extern int test_fpu(double *darray, pid_t *pid); > + > +double darray[] =3D {0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0, > +> > > 1.1, 1.2, 1.3, 1.4, 1.5, 1.6, 1.7, 1.8, 1.9, 2.0, > +> > > 2.1}; > + > +int syscall_fpu(void) > +{ > +> > pid_t fork_pid; > +> > int i; > +> > int ret; > +> > int child_ret; > +> > for (i =3D 0; i < 1000; i++) { > +> > > /* test_fpu will fork() */ > +> > > ret =3D test_fpu(darray, &fork_pid); > +> > > if (fork_pid =3D=3D -1) > +> > > > return -1; > +> > > if (fork_pid =3D=3D 0) > +> > > > exit(ret); > +> > > waitpid(fork_pid, &child_ret, 0); > +> > > if (ret || child_ret) > +> > > > return 1; > +> > } > + > +> > return 0; > +} > + > +int test_syscall_fpu(void) > +{ > +> > /* > +> > * Setup an environment with much context switching > +> > */ > +> > pid_t pid2; > +> > pid_t pid =3D fork(); > +> > int ret; > +> > int child_ret; > +> > FAIL_IF(pid =3D=3D -1); > + > +> > pid2 =3D fork(); > +> > /* Can't FAIL_IF(pid2 =3D=3D -1); because already forked once */ > +> > if (pid2 =3D=3D -1) { > +> > > /* > +> > > * Couldn't fork, ensure test is a fail > +> > > */ > +> > > child_ret =3D ret =3D 1; > +> > } else { > +> > > ret =3D syscall_fpu(); > +> > > if (pid2) > +> > > > waitpid(pid2, &child_ret, 0); > +> > > else > +> > > > exit(ret); > +> > } > + > +> > ret |=3D child_ret; > + > +> > if (pid) > +> > > waitpid(pid, &child_ret, 0); > +> > else > +> > > exit(ret); > + > +> > FAIL_IF(ret || child_ret); > +> > return 0; > +} > + > +int main(int argc, char *argv[]) > +{ > +> > return test_harness(test_syscall_fpu, "syscall_fpu"); > + > +} > diff --git a/tools/testing/selftests/powerpc/math/vmx_asm.S b/tools/testi= ng/selftests/powerpc/math/vmx_asm.S > new file mode 100644 > index 0000000..e642e67 > --- /dev/null > +++ b/tools/testing/selftests/powerpc/math/vmx_asm.S > @@ -0,0 +1,183 @@ > +#include "basic_asm.h" > + > +#define PUSH_VMX(pos,reg) \ > +> > li> > reg,pos; \ > +> > stvx> > v20,reg,sp; \ > +> > addi> > reg,reg,16; \ > +> > stvx> > v21,reg,sp; \ > +> > addi> > reg,reg,16; \ > +> > stvx> > v22,reg,sp; \ > +> > addi> > reg,reg,16; \ > +> > stvx> > v23,reg,sp; \ > +> > addi> > reg,reg,16; \ > +> > stvx> > v24,reg,sp; \ > +> > addi> > reg,reg,16; \ > +> > stvx> > v25,reg,sp; \ > +> > addi> > reg,reg,16; \ > +> > stvx> > v26,reg,sp; \ > +> > addi> > reg,reg,16; \ > +> > stvx> > v27,reg,sp; \ > +> > addi> > reg,reg,16; \ > +> > stvx> > v28,reg,sp; \ > +> > addi> > reg,reg,16; \ > +> > stvx> > v29,reg,sp; \ > +> > addi> > reg,reg,16; \ > +> > stvx> > v30,reg,sp; \ > +> > addi> > reg,reg,16; \ > +> > stvx> > v31,reg,sp; > + > +#define POP_VMX(pos,reg) \ > +> > li> > reg,pos; \ > +> > lvx> > v20,reg,sp; \ > +> > addi> > reg,reg,16; \ > +> > lvx> > v21,reg,sp; \ > +> > addi> > reg,reg,16; \ > +> > lvx> > v22,reg,sp; \ > +> > addi> > reg,reg,16; \ > +> > lvx> > v23,reg,sp; \ > +> > addi> > reg,reg,16; \ > +> > lvx> > v24,reg,sp; \ > +> > addi> > reg,reg,16; \ > +> > lvx> > v25,reg,sp; \ > +> > addi> > reg,reg,16; \ > +> > lvx> > v26,reg,sp; \ > +> > addi> > reg,reg,16; \ > +> > lvx> > v27,reg,sp; \ > +> > addi> > reg,reg,16; \ > +> > lvx> > v28,reg,sp; \ > +> > addi> > reg,reg,16; \ > +> > lvx> > v29,reg,sp; \ > +> > addi> > reg,reg,16; \ > +> > lvx> > v30,reg,sp; \ > +> > addi> > reg,reg,16; \ > +> > lvx> > v31,reg,sp; > + > +#Carefull this will 'clobber' vmx (by design) > +#Don't call this from C > +FUNC_START(load_vmx) > +> > li> > r5,0 > +> > lvx> > v20,r5,r3 > +> > addi> > r5,r5,16 > +> > lvx> > v21,r5,r3 > +> > addi> > r5,r5,16 > +> > lvx> > v22,r5,r3 > +> > addi> > r5,r5,16 > +> > lvx> > v23,r5,r3 > +> > addi> > r5,r5,16 > +> > lvx> > v24,r5,r3 > +> > addi> > r5,r5,16 > +> > lvx> > v25,r5,r3 > +> > addi> > r5,r5,16 > +> > lvx> > v26,r5,r3 > +> > addi> > r5,r5,16 > +> > lvx> > v27,r5,r3 > +> > addi> > r5,r5,16 > +> > lvx> > v28,r5,r3 > +> > addi> > r5,r5,16 > +> > lvx> > v29,r5,r3 > +> > addi> > r5,r5,16 > +> > lvx> > v30,r5,r3 > +> > addi> > r5,r5,16 > +> > lvx> > v31,r5,r3 > +> > blr > +FUNC_END(load_vmx) > + > +#Should be safe from C, only touches r4, r5 and v0,v1,v2 > +FUNC_START(check_vmx) > +> > PUSH_BASIC_STACK(16) > +> > mr r4,r3 > +> > li> > r3,1 #assume a bad result > +> > li> > r5,0 > +> > lvx> > v0,r5,r4 > +> > vcmpequd.> > v1,v0,v20 > +> > vmr> > v2,v1 > + > +> > addi> > r5,r5,16 > +> > lvx> > v0,r5,r4 > +> > vcmpequd.> > v1,v0,v21 > +> > vand> > v2,v2,v1 > + > +> > addi> > r5,r5,16 > +> > lvx> > v0,r5,r4 > +> > vcmpequd.> > v1,v0,v22 > +> > vand> > v2,v2,v1 > + > +> > addi> > r5,r5,16 > +> > lvx> > v0,r5,r4 > +> > vcmpequd.> > v1,v0,v23 > +> > vand> > v2,v2,v1 > + > +> > addi> > r5,r5,16 > +> > lvx> > v0,r5,r4 > +> > vcmpequd.> > v1,v0,v24 > +> > vand> > v2,v2,v1 > + > +> > addi> > r5,r5,16 > +> > lvx> > v0,r5,r4 > +> > vcmpequd.> > v1,v0,v25 > +> > vand> > v2,v2,v1 > + > +> > addi> > r5,r5,16 > +> > lvx> > v0,r5,r4 > +> > vcmpequd.> > v1,v0,v26 > +> > vand> > v2,v2,v1 > + > +> > addi> > r5,r5,16 > +> > lvx> > v0,r5,r4 > +> > vcmpequd.> > v1,v0,v27 > +> > vand> > v2,v2,v1 > + > +> > addi> > r5,r5,16 > +> > lvx> > v0,r5,r4 > +> > vcmpequd.> > v1,v0,v28 > +> > vand> > v2,v2,v1 > + > +> > addi> > r5,r5,16 > +> > lvx> > v0,r5,r4 > +> > vcmpequd.> > v1,v0,v29 > +> > vand> > v2,v2,v1 > + > +> > addi> > r5,r5,16 > +> > lvx> > v0,r5,r4 > +> > vcmpequd.> > v1,v0,v30 > +> > vand> > v2,v2,v1 > + > +> > addi> > r5,r5,16 > +> > lvx> > v0,r5,r4 > +> > vcmpequd.> > v1,v0,v31 > +> > vand> > v2,v2,v1 > + > +> > li r5,0 > +> > stvx> > v2,r5,sp > +> > ldx> > r0,r5,sp > +> > cmpdi> > r0,0xffffffff > +> > bne> > 1f > +> > li> > r3,0 > +1:> > POP_BASIC_STACK(16) > +> > blr > +FUNC_END(check_vmx) > + > +#Safe from C > +FUNC_START(test_vmx) > +> > #r3 holds pointer to where to put the result of fork > +> > #v20-v31 are non-volatile > +> > PUSH_BASIC_STACK(512) > +> > std> > r3,40(sp) #Address of varray > +> > std r4,48(sp) #address of pid > +> > PUSH_VMX(56, r4) > + > +> > bl load_vmx > + > +> > li> > r0,__NR_fork > +> > sc > +> > #Pass the result of fork back to the caller > +> > ld> > r9,48(sp) > +> > std> > r3,0(r9) > + > +> > ld r3,40(sp) > +> > bl check_vmx > + > +> > POP_VMX(56,r4) > +> > POP_BASIC_STACK(512) > +> > blr > +FUNC_END(test_vmx) > diff --git a/tools/testing/selftests/powerpc/math/vmx_syscall.c b/tools/t= esting/selftests/powerpc/math/vmx_syscall.c > new file mode 100644 > index 0000000..7adff05 > --- /dev/null > +++ b/tools/testing/selftests/powerpc/math/vmx_syscall.c > @@ -0,0 +1,81 @@ > +#include=20 > +#include=20 > +#include=20 > +#include=20 > +#include=20 > +#include=20 > +#include=20 > + > +#include "utils.h" > + > +typedef int v4si __attribute__ ((vector_size (16))); > +v4si varray[] =3D {{1, 2, 3, 4}, {5, 6, 7, 8}, {9, 10,11,12}, > +> > {13,14,15,16},{17,18,19,20},{21,22,23,24}, > +> > {25,26,27,28},{29,30,31,32},{33,34,35,36}, > +> > {37,38,39,40},{41,42,43,44},{45,46,47,48}}; > + > +extern int test_vmx(v4si *varray, pid_t *pid); > + > +int vmx_syscall(void) > +{ > +> > pid_t fork_pid; > +> > int i; > +> > int ret; > +> > int child_ret; > +> > for (i =3D 0; i < 1000; i++) { > +> > > /* test_vmx will fork() */ > +> > > ret =3D test_vmx(varray, &fork_pid); > +> > > if (fork_pid =3D=3D -1) > +> > > > return -1; > +> > > if (fork_pid =3D=3D 0) > +> > > > exit(ret); > +> > > waitpid(fork_pid, &child_ret, 0); > +> > > if (ret || child_ret) > +> > > > return 1; > +> > } > + > +> > return 0; > +} > + > +int test_vmx_syscall(void) > +{ > +> > /* > +> > * Setup an environment with much context switching > +> > */ > +> > pid_t pid2; > +> > pid_t pid =3D fork(); > +> > int ret; > +> > int child_ret; > +> > FAIL_IF(pid =3D=3D -1); > + > +> > pid2 =3D fork(); > +> > ret =3D vmx_syscall(); > +> > /* Can't FAIL_IF(pid2 =3D=3D -1); because we've already forked */ > +> > if (pid2 =3D=3D -1) { > +> > > /* > +> > > * Couldn't fork, ensure child_ret is set and is a fail > +> > > */ > +> > > ret =3D child_ret =3D 1; > +> > } else { > +> > > if (pid2) > +> > > > waitpid(pid2, &child_ret, 0); > +> > > else > +> > > > exit(ret); > +> > } > + > +> > ret |=3D child_ret; > + > +> > if (pid) > +> > > waitpid(pid, &child_ret, 0); > +> > else > +> > > exit(ret); > + > +> > FAIL_IF(ret || child_ret); > +> > return 0; > +} > + > +int main(int argc, char *argv[]) > +{ > +> > return test_harness(test_vmx_syscall, "vmx_syscall"); > + > +} ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 1/8] selftests/powerpc: Test the preservation of FPU and VMX regs across syscall 2015-11-23 0:23 ` Michael Neuling @ 2015-11-23 0:58 ` Cyril Bur 2015-11-23 1:06 ` Michael Neuling 0 siblings, 1 reply; 23+ messages in thread From: Cyril Bur @ 2015-11-23 0:58 UTC (permalink / raw) To: Michael Neuling; +Cc: anton, linuxppc-dev On Mon, 23 Nov 2015 11:23:13 +1100 Michael Neuling <mikey@neuling.org> wrote: > On Wed, 2015-11-18 at 14:26 +1100, Cyril Bur wrote: > > Test that the non volatile floating point and Altivec registers get > > correctly preserved across the fork() syscall. > > Can we add a test for VSX too? I realise it's the same registers, but > the enable bits in the MSR are different so it's easy to get them wrong > in the kernel. Yeah, I'm sure I could get that wrong haha. Hmmmm this got me thinking. Today we always enable FP and Altivec when we enable VSX but isn't there a world where we could actually run with FP and Altivec disabled and VSX on? In which case, is the whole thing volatile or does the kernel still need to save the subset of the matrix which corresponds to non-volatile FPs and non-volatile Altivec? > > Additional comments below. > > > fork() works nicely for this purpose, the registers should be the same for > > both parent and child > > > > Signed-off-by: Cyril Bur <cyrilbur@gmail.com> > > --- > > tools/testing/selftests/powerpc/Makefile | 3 +- > > tools/testing/selftests/powerpc/math/Makefile | 14 ++ > > tools/testing/selftests/powerpc/math/basic_asm.h | 26 +++ > > tools/testing/selftests/powerpc/math/fpu_asm.S | 151 +++++++++++++++++ > > tools/testing/selftests/powerpc/math/fpu_syscall.c | 79 +++++++++ > > tools/testing/selftests/powerpc/math/vmx_asm.S | 183 +++++++++++++++++++++ > > tools/testing/selftests/powerpc/math/vmx_syscall.c | 81 +++++++++ > > 7 files changed, 536 insertions(+), 1 deletion(-) > > create mode 100644 tools/testing/selftests/powerpc/math/Makefile > > create mode 100644 tools/testing/selftests/powerpc/math/basic_asm.h > > create mode 100644 tools/testing/selftests/powerpc/math/fpu_asm.S > > create mode 100644 tools/testing/selftests/powerpc/math/fpu_syscall.c > > create mode 100644 tools/testing/selftests/powerpc/math/vmx_asm.S > > create mode 100644 tools/testing/selftests/powerpc/math/vmx_syscall.c > > > > diff --git a/tools/testing/selftests/powerpc/Makefile b/tools/testing/selftests/powerpc/Makefile > > index 0c2706b..19e8191 100644 > > --- a/tools/testing/selftests/powerpc/Makefile > > +++ b/tools/testing/selftests/powerpc/Makefile > > @@ -22,7 +22,8 @@ SUB_DIRS = benchmarks > > > \ > > > > switch_endian> > \ > > > > syscalls> > > \ > > > > tm> > > > \ > > -> > vphn > > +> > vphn \ > > +> > math > > > > endif > > > > diff --git a/tools/testing/selftests/powerpc/math/Makefile b/tools/testing/selftests/powerpc/math/Makefile > > new file mode 100644 > > index 0000000..896d9e2 > > --- /dev/null > > +++ b/tools/testing/selftests/powerpc/math/Makefile > > @@ -0,0 +1,14 @@ > > +TEST_PROGS := fpu_syscall vmx_syscall > > > Add a new .gitignore in this dirfor these new build objects. > Yep > > + > > +all: $(TEST_PROGS) > > + > > +$(TEST_PROGS): ../harness.c > > +$(TEST_PROGS): CFLAGS += -O2 -g > > + > > +fpu_syscall: fpu_asm.S > > +vmx_syscall: vmx_asm.S > > + > > +include ../../lib.mk > > + > > +clean: > > +> > rm -f $(TEST_PROGS) *.o > > diff --git a/tools/testing/selftests/powerpc/math/basic_asm.h b/tools/testing/selftests/powerpc/math/basic_asm.h > > new file mode 100644 > > index 0000000..27aca79 > > --- /dev/null > > +++ b/tools/testing/selftests/powerpc/math/basic_asm.h > > Can you put this up a directory since it's generically useful for > powerpc? > Sure why not. > > @@ -0,0 +1,26 @@ > > +#include > > +#include > > + > > +#define LOAD_REG_IMMEDIATE(reg,expr) \ > > +> > lis> > reg,(expr)@highest;> > \ > > +> > ori> > reg,reg,(expr)@higher;> > \ > > +> > rldicr> > reg,reg,32,31;> > \ > > +> > oris> > reg,reg,(expr)@high;> > \ > > +> > ori> > reg,reg,(expr)@l; > > + > > +#define PUSH_BASIC_STACK(size) \ > > +> > std> > 2,24(sp); \ > > +> > mflr> > r0; \ > > +> > std> > r0,16(sp); \ > > +> > mfcr> > r0; \ > > +> > stw> > r0,8(sp); \ > > +> > stdu> > sp,-size(sp); > > + > > +#define POP_BASIC_STACK(size) \ > > +> > addi> > sp,sp,size; \ > > +> > ld> > 2,24(sp); \ > > +> > ld> > r0,16(sp); \ > > +> > mtlr> > r0; \ > > +> > lwz> > r0,8(sp); \ > > +> > mtcr> > r0; \ > > + > > diff --git a/tools/testing/selftests/powerpc/math/fpu_asm.S b/tools/testing/selftests/powerpc/math/fpu_asm.S > > new file mode 100644 > > index 0000000..d5412c1 > > --- /dev/null > > +++ b/tools/testing/selftests/powerpc/math/fpu_asm.S > > @@ -0,0 +1,151 @@ > > +#include "basic_asm.h" > > + > > +#define PUSH_FPU(pos) \ > > +> > stfd> > f14,pos(sp); \ > > +> > stfd> > f15,pos+8(sp); \ > > +> > stfd> > f16,pos+16(sp); \ > > +> > stfd> > f17,pos+24(sp); \ > > +> > stfd> > f18,pos+32(sp); \ > > +> > stfd> > f19,pos+40(sp); \ > > +> > stfd> > f20,pos+48(sp); \ > > +> > stfd> > f21,pos+56(sp); \ > > +> > stfd> > f22,pos+64(sp); \ > > +> > stfd> > f23,pos+72(sp); \ > > +> > stfd> > f24,pos+80(sp); \ > > +> > stfd> > f25,pos+88(sp); \ > > +> > stfd> > f26,pos+96(sp); \ > > +> > stfd> > f27,pos+104(sp); \ > > +> > stfd> > f28,pos+112(sp); \ > > +> > stfd> > f29,pos+120(sp); \ > > +> > stfd> > f30,pos+128(sp); \ > > +> > stfd> > f31,pos+136(sp); > > + > > +#define POP_FPU(pos) \ > > +> > lfd> > f14,pos(sp); \ > > +> > lfd> > f15,pos+8(sp); \ > > +> > lfd> > f16,pos+16(sp); \ > > +> > lfd> > f17,pos+24(sp); \ > > +> > lfd> > f18,pos+32(sp); \ > > +> > lfd> > f19,pos+40(sp); \ > > +> > lfd> > f20,pos+48(sp); \ > > +> > lfd> > f21,pos+56(sp); \ > > +> > lfd> > f22,pos+64(sp); \ > > +> > lfd> > f23,pos+72(sp); \ > > +> > lfd> > f24,pos+80(sp); \ > > +> > lfd> > f25,pos+88(sp); \ > > +> > lfd> > f26,pos+96(sp); \ > > +> > lfd> > f27,pos+104(sp); \ > > +> > lfd> > f28,pos+112(sp); \ > > +> > lfd> > f29,pos+120(sp); \ > > +> > lfd> > f30,pos+128(sp); \ > > +> > lfd> > f31,pos+136(sp); > > + > > +#Careful calling this, it will 'clobber' fpu (by design) > > +#Don't call this from C > > +FUNC_START(load_fpu) > > +> > lfd> > f14,0(r3) > > +> > lfd> > f15,8(r3) > > +> > lfd> > f16,16(r3) > > +> > lfd> > f17,24(r3) > > +> > lfd> > f18,32(r3) > > +> > lfd> > f19,40(r3) > > +> > lfd> > f20,48(r3) > > +> > lfd> > f21,56(r3) > > +> > lfd> > f22,64(r3) > > +> > lfd> > f23,72(r3) > > +> > lfd> > f24,80(r3) > > +> > lfd> > f25,88(r3) > > +> > lfd> > f26,96(r3) > > +> > lfd> > f27,104(r3) > > +> > lfd> > f28,112(r3) > > +> > lfd> > f29,120(r3) > > +> > lfd> > f30,128(r3) > > +> > lfd> > f31,136(r3) > > +> > blr > > +FUNC_END(load_fpu) > > + > > +FUNC_START(check_fpu) > > +> > mr r4,r3 > > +> > li> > r3,1 #assume a bad result > > +> > lfd> > f0,0(r4) > > +> > fcmpu> > cr1,f0,f14 > > +> > bne> > cr1,1f > > +> > lfd> > f0,8(r4) > > +> > fcmpu> > cr1,f0,f15 > > +> > bne> > cr1,1f > > +> > lfd> > f0,16(r4) > > +> > fcmpu> > cr1,f0,f16 > > +> > bne> > cr1,1f > > +> > lfd> > f0,24(r4) > > +> > fcmpu> > cr1,f0,f17 > > +> > bne> > cr1,1f > > +> > lfd> > f0,32(r4) > > +> > fcmpu> > cr1,f0,f18 > > +> > bne> > cr1,1f > > +> > lfd> > f0,40(r4) > > +> > fcmpu> > cr1,f0,f19 > > +> > bne> > cr1,1f > > +> > lfd> > f0,48(r4) > > +> > fcmpu> > cr1,f0,f20 > > +> > bne> > cr1,1f > > +> > lfd> > f0,56(r4) > > +> > fcmpu> > cr1,f0,f21 > > +> > bne> > cr1,1f > > +> > lfd> > f0,64(r4) > > +> > fcmpu> > cr1,f0,f22 > > +> > bne> > cr1,1f > > +> > lfd> > f0,72(r4) > > +> > fcmpu> > cr1,f0,f23 > > +> > bne> > cr1,1f > > +> > lfd> > f0,80(r4) > > +> > fcmpu> > cr1,f0,f24 > > +> > bne> > cr1,1f > > +> > lfd> > f0,88(r4) > > +> > fcmpu> > cr1,f0,f25 > > +> > bne> > cr1,1f > > +> > lfd> > f0,96(r4) > > +> > fcmpu> > cr1,f0,f26 > > +> > bne> > cr1,1f > > +> > lfd> > f0,104(r4) > > +> > fcmpu> > cr1,f0,f27 > > +> > bne> > cr1,1f > > +> > lfd> > f0,112(r4) > > +> > fcmpu> > cr1,f0,f28 > > +> > bne> > cr1,1f > > +> > lfd> > f0,120(r4) > > +> > fcmpu> > cr1,f0,f29 > > +> > bne> > cr1,1f > > +> > lfd> > f0,128(r4) > > +> > fcmpu> > cr1,f0,f30 > > +> > bne> > cr1,1f > > +> > lfd> > f0,136(r4) > > +> > fcmpu> > cr1,f0,f31 > > +> > bne> > cr1,1f > > +> > li> > r3,0 #Sucess!!! > > +1:> > blr > > + > > +FUNC_START(test_fpu) > > +> > #r3 holds pointer to where to put the result of fork > > #r4 seems to hold a ptr to the pid > Thanks > > + #f14-f31 are non volatiles > > +> > PUSH_BASIC_STACK(256) > > +> > std> > r3,40(sp) #Address of darray > > +> > std r4,48(sp) #Address of pid > > +> > PUSH_FPU(56) > > + > > +> > bl load_fpu > > +> > nop > > +> > li> > r0,__NR_fork > > +> > sc > > + > > +> > #pass the result of the fork to the caller > > +> > ld> > r9,48(sp) > > +> > std> > r3,0(r9) > > + > > +> > ld r3,40(sp) > > +> > bl check_fpu > > +> > nop > > + > > +> > POP_FPU(56) > > +> > POP_BASIC_STACK(256) > > +> > blr > > +FUNC_END(test_fpu) > > diff --git a/tools/testing/selftests/powerpc/math/fpu_syscall.c b/tools/testing/selftests/powerpc/math/fpu_syscall.c > > new file mode 100644 > > index 0000000..a967fd6 > > --- /dev/null > > +++ b/tools/testing/selftests/powerpc/math/fpu_syscall.c > > @@ -0,0 +1,79 @@ > > +#include > > +#include > > +#include > > +#include > > +#include > > +#include > > +#include > > + > > +#include "utils.h" > > + > > +extern int test_fpu(double *darray, pid_t *pid); > > + > > +double darray[] = {0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0, > > +> > > 1.1, 1.2, 1.3, 1.4, 1.5, 1.6, 1.7, 1.8, 1.9, 2.0, > > +> > > 2.1}; > > + > > +int syscall_fpu(void) > > +{ > > +> > pid_t fork_pid; > > +> > int i; > > +> > int ret; > > +> > int child_ret; > > +> > for (i = 0; i < 1000; i++) { > > +> > > /* test_fpu will fork() */ > > +> > > ret = test_fpu(darray, &fork_pid); > > +> > > if (fork_pid == -1) > > +> > > > return -1; > > +> > > if (fork_pid == 0) > > +> > > > exit(ret); > > +> > > waitpid(fork_pid, &child_ret, 0); > > +> > > if (ret || child_ret) > > +> > > > return 1; > > +> > } > > + > > +> > return 0; > > +} > > + > > +int test_syscall_fpu(void) > > +{ > > +> > /* > > +> > * Setup an environment with much context switching > > +> > */ > > +> > pid_t pid2; > > +> > pid_t pid = fork(); > > +> > int ret; > > +> > int child_ret; > > +> > FAIL_IF(pid == -1); > > + > > +> > pid2 = fork(); > > +> > /* Can't FAIL_IF(pid2 == -1); because already forked once */ > > +> > if (pid2 == -1) { > > +> > > /* > > +> > > * Couldn't fork, ensure test is a fail > > +> > > */ > > +> > > child_ret = ret = 1; > > +> > } else { > > +> > > ret = syscall_fpu(); > > +> > > if (pid2) > > +> > > > waitpid(pid2, &child_ret, 0); > > +> > > else > > +> > > > exit(ret); > > +> > } > > + > > +> > ret |= child_ret; > > + > > +> > if (pid) > > +> > > waitpid(pid, &child_ret, 0); > > +> > else > > +> > > exit(ret); > > + > > +> > FAIL_IF(ret || child_ret); > > +> > return 0; > > +} > > + > > +int main(int argc, char *argv[]) > > +{ > > +> > return test_harness(test_syscall_fpu, "syscall_fpu"); > > + > > +} > > diff --git a/tools/testing/selftests/powerpc/math/vmx_asm.S b/tools/testing/selftests/powerpc/math/vmx_asm.S > > new file mode 100644 > > index 0000000..e642e67 > > --- /dev/null > > +++ b/tools/testing/selftests/powerpc/math/vmx_asm.S > > @@ -0,0 +1,183 @@ > > +#include "basic_asm.h" > > + > > +#define PUSH_VMX(pos,reg) \ > > +> > li> > reg,pos; \ > > +> > stvx> > v20,reg,sp; \ > > +> > addi> > reg,reg,16; \ > > +> > stvx> > v21,reg,sp; \ > > +> > addi> > reg,reg,16; \ > > +> > stvx> > v22,reg,sp; \ > > +> > addi> > reg,reg,16; \ > > +> > stvx> > v23,reg,sp; \ > > +> > addi> > reg,reg,16; \ > > +> > stvx> > v24,reg,sp; \ > > +> > addi> > reg,reg,16; \ > > +> > stvx> > v25,reg,sp; \ > > +> > addi> > reg,reg,16; \ > > +> > stvx> > v26,reg,sp; \ > > +> > addi> > reg,reg,16; \ > > +> > stvx> > v27,reg,sp; \ > > +> > addi> > reg,reg,16; \ > > +> > stvx> > v28,reg,sp; \ > > +> > addi> > reg,reg,16; \ > > +> > stvx> > v29,reg,sp; \ > > +> > addi> > reg,reg,16; \ > > +> > stvx> > v30,reg,sp; \ > > +> > addi> > reg,reg,16; \ > > +> > stvx> > v31,reg,sp; > > + > > +#define POP_VMX(pos,reg) \ > > +> > li> > reg,pos; \ > > +> > lvx> > v20,reg,sp; \ > > +> > addi> > reg,reg,16; \ > > +> > lvx> > v21,reg,sp; \ > > +> > addi> > reg,reg,16; \ > > +> > lvx> > v22,reg,sp; \ > > +> > addi> > reg,reg,16; \ > > +> > lvx> > v23,reg,sp; \ > > +> > addi> > reg,reg,16; \ > > +> > lvx> > v24,reg,sp; \ > > +> > addi> > reg,reg,16; \ > > +> > lvx> > v25,reg,sp; \ > > +> > addi> > reg,reg,16; \ > > +> > lvx> > v26,reg,sp; \ > > +> > addi> > reg,reg,16; \ > > +> > lvx> > v27,reg,sp; \ > > +> > addi> > reg,reg,16; \ > > +> > lvx> > v28,reg,sp; \ > > +> > addi> > reg,reg,16; \ > > +> > lvx> > v29,reg,sp; \ > > +> > addi> > reg,reg,16; \ > > +> > lvx> > v30,reg,sp; \ > > +> > addi> > reg,reg,16; \ > > +> > lvx> > v31,reg,sp; > > + > > +#Carefull this will 'clobber' vmx (by design) > > +#Don't call this from C > > +FUNC_START(load_vmx) > > +> > li> > r5,0 > > +> > lvx> > v20,r5,r3 > > +> > addi> > r5,r5,16 > > +> > lvx> > v21,r5,r3 > > +> > addi> > r5,r5,16 > > +> > lvx> > v22,r5,r3 > > +> > addi> > r5,r5,16 > > +> > lvx> > v23,r5,r3 > > +> > addi> > r5,r5,16 > > +> > lvx> > v24,r5,r3 > > +> > addi> > r5,r5,16 > > +> > lvx> > v25,r5,r3 > > +> > addi> > r5,r5,16 > > +> > lvx> > v26,r5,r3 > > +> > addi> > r5,r5,16 > > +> > lvx> > v27,r5,r3 > > +> > addi> > r5,r5,16 > > +> > lvx> > v28,r5,r3 > > +> > addi> > r5,r5,16 > > +> > lvx> > v29,r5,r3 > > +> > addi> > r5,r5,16 > > +> > lvx> > v30,r5,r3 > > +> > addi> > r5,r5,16 > > +> > lvx> > v31,r5,r3 > > +> > blr > > +FUNC_END(load_vmx) > > + > > +#Should be safe from C, only touches r4, r5 and v0,v1,v2 > > +FUNC_START(check_vmx) > > +> > PUSH_BASIC_STACK(16) > > +> > mr r4,r3 > > +> > li> > r3,1 #assume a bad result > > +> > li> > r5,0 > > +> > lvx> > v0,r5,r4 > > +> > vcmpequd.> > v1,v0,v20 > > +> > vmr> > v2,v1 > > + > > +> > addi> > r5,r5,16 > > +> > lvx> > v0,r5,r4 > > +> > vcmpequd.> > v1,v0,v21 > > +> > vand> > v2,v2,v1 > > + > > +> > addi> > r5,r5,16 > > +> > lvx> > v0,r5,r4 > > +> > vcmpequd.> > v1,v0,v22 > > +> > vand> > v2,v2,v1 > > + > > +> > addi> > r5,r5,16 > > +> > lvx> > v0,r5,r4 > > +> > vcmpequd.> > v1,v0,v23 > > +> > vand> > v2,v2,v1 > > + > > +> > addi> > r5,r5,16 > > +> > lvx> > v0,r5,r4 > > +> > vcmpequd.> > v1,v0,v24 > > +> > vand> > v2,v2,v1 > > + > > +> > addi> > r5,r5,16 > > +> > lvx> > v0,r5,r4 > > +> > vcmpequd.> > v1,v0,v25 > > +> > vand> > v2,v2,v1 > > + > > +> > addi> > r5,r5,16 > > +> > lvx> > v0,r5,r4 > > +> > vcmpequd.> > v1,v0,v26 > > +> > vand> > v2,v2,v1 > > + > > +> > addi> > r5,r5,16 > > +> > lvx> > v0,r5,r4 > > +> > vcmpequd.> > v1,v0,v27 > > +> > vand> > v2,v2,v1 > > + > > +> > addi> > r5,r5,16 > > +> > lvx> > v0,r5,r4 > > +> > vcmpequd.> > v1,v0,v28 > > +> > vand> > v2,v2,v1 > > + > > +> > addi> > r5,r5,16 > > +> > lvx> > v0,r5,r4 > > +> > vcmpequd.> > v1,v0,v29 > > +> > vand> > v2,v2,v1 > > + > > +> > addi> > r5,r5,16 > > +> > lvx> > v0,r5,r4 > > +> > vcmpequd.> > v1,v0,v30 > > +> > vand> > v2,v2,v1 > > + > > +> > addi> > r5,r5,16 > > +> > lvx> > v0,r5,r4 > > +> > vcmpequd.> > v1,v0,v31 > > +> > vand> > v2,v2,v1 > > + > > +> > li r5,0 > > +> > stvx> > v2,r5,sp > > +> > ldx> > r0,r5,sp > > +> > cmpdi> > r0,0xffffffff > > +> > bne> > 1f > > +> > li> > r3,0 > > +1:> > POP_BASIC_STACK(16) > > +> > blr > > +FUNC_END(check_vmx) > > + > > +#Safe from C > > +FUNC_START(test_vmx) > > +> > #r3 holds pointer to where to put the result of fork > > +> > #v20-v31 are non-volatile > > +> > PUSH_BASIC_STACK(512) > > +> > std> > r3,40(sp) #Address of varray > > +> > std r4,48(sp) #address of pid > > +> > PUSH_VMX(56, r4) > > + > > +> > bl load_vmx > > + > > +> > li> > r0,__NR_fork > > +> > sc > > +> > #Pass the result of fork back to the caller > > +> > ld> > r9,48(sp) > > +> > std> > r3,0(r9) > > + > > +> > ld r3,40(sp) > > +> > bl check_vmx > > + > > +> > POP_VMX(56,r4) > > +> > POP_BASIC_STACK(512) > > +> > blr > > +FUNC_END(test_vmx) > > diff --git a/tools/testing/selftests/powerpc/math/vmx_syscall.c b/tools/testing/selftests/powerpc/math/vmx_syscall.c > > new file mode 100644 > > index 0000000..7adff05 > > --- /dev/null > > +++ b/tools/testing/selftests/powerpc/math/vmx_syscall.c > > @@ -0,0 +1,81 @@ > > +#include > > +#include > > +#include > > +#include > > +#include > > +#include > > +#include > > + > > +#include "utils.h" > > + > > +typedef int v4si __attribute__ ((vector_size (16))); > > +v4si varray[] = {{1, 2, 3, 4}, {5, 6, 7, 8}, {9, 10,11,12}, > > +> > {13,14,15,16},{17,18,19,20},{21,22,23,24}, > > +> > {25,26,27,28},{29,30,31,32},{33,34,35,36}, > > +> > {37,38,39,40},{41,42,43,44},{45,46,47,48}}; > > + > > +extern int test_vmx(v4si *varray, pid_t *pid); > > + > > +int vmx_syscall(void) > > +{ > > +> > pid_t fork_pid; > > +> > int i; > > +> > int ret; > > +> > int child_ret; > > +> > for (i = 0; i < 1000; i++) { > > +> > > /* test_vmx will fork() */ > > +> > > ret = test_vmx(varray, &fork_pid); > > +> > > if (fork_pid == -1) > > +> > > > return -1; > > +> > > if (fork_pid == 0) > > +> > > > exit(ret); > > +> > > waitpid(fork_pid, &child_ret, 0); > > +> > > if (ret || child_ret) > > +> > > > return 1; > > +> > } > > + > > +> > return 0; > > +} > > + > > +int test_vmx_syscall(void) > > +{ > > +> > /* > > +> > * Setup an environment with much context switching > > +> > */ > > +> > pid_t pid2; > > +> > pid_t pid = fork(); > > +> > int ret; > > +> > int child_ret; > > +> > FAIL_IF(pid == -1); > > + > > +> > pid2 = fork(); > > +> > ret = vmx_syscall(); > > +> > /* Can't FAIL_IF(pid2 == -1); because we've already forked */ > > +> > if (pid2 == -1) { > > +> > > /* > > +> > > * Couldn't fork, ensure child_ret is set and is a fail > > +> > > */ > > +> > > ret = child_ret = 1; > > +> > } else { > > +> > > if (pid2) > > +> > > > waitpid(pid2, &child_ret, 0); > > +> > > else > > +> > > > exit(ret); > > +> > } > > + > > +> > ret |= child_ret; > > + > > +> > if (pid) > > +> > > waitpid(pid, &child_ret, 0); > > +> > else > > +> > > exit(ret); > > + > > +> > FAIL_IF(ret || child_ret); > > +> > return 0; > > +} > > + > > +int main(int argc, char *argv[]) > > +{ > > +> > return test_harness(test_vmx_syscall, "vmx_syscall"); > > + > > +} ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 1/8] selftests/powerpc: Test the preservation of FPU and VMX regs across syscall 2015-11-23 0:58 ` Cyril Bur @ 2015-11-23 1:06 ` Michael Neuling 0 siblings, 0 replies; 23+ messages in thread From: Michael Neuling @ 2015-11-23 1:06 UTC (permalink / raw) To: Cyril Bur; +Cc: anton, linuxppc-dev On Mon, 2015-11-23 at 11:58 +1100, Cyril Bur wrote: > On Mon, 23 Nov 2015 11:23:13 +1100 > Michael Neuling <mikey@neuling.org> wrote: >=20 > > On Wed, 2015-11-18 at 14:26 +1100, Cyril Bur wrote: > > > Test that the non volatile floating point and Altivec registers > > > get > > > correctly preserved across the fork() syscall. =20 > >=20 > > Can we add a test for VSX too? I realise it's the same registers, > > but > > the enable bits in the MSR are different so it's easy to get them > > wrong > > in the kernel. >=20 > Yeah, I'm sure I could get that wrong haha. >=20 > Hmmmm this got me thinking. Today we always enable FP and Altivec > when we > enable VSX but isn't there a world where we could actually run with > FP and > Altivec disabled and VSX on? In which case, is the whole thing > volatile or > does the kernel still need to save the subset of the matrix which > corresponds=20 > to non-volatile FPs and non-volatile Altivec? The hardware can run with FP and VMX off and VSX on but we should never do that in Linux. Mikey ^ permalink raw reply [flat|nested] 23+ messages in thread
* [PATCH 2/8] selftests/powerpc: Test preservation of FPU and VMX regs across preemption 2015-11-18 3:26 [PATCH 0/8] FP/VEC/VSX switching optimisations Cyril Bur 2015-11-18 3:26 ` [PATCH 1/8] selftests/powerpc: Test the preservation of FPU and VMX regs across syscall Cyril Bur @ 2015-11-18 3:26 ` Cyril Bur 2015-11-23 0:34 ` Michael Neuling 2015-11-18 3:26 ` [PATCH 3/8] selftests/powerpc: Test FPU and VMX regs in signal ucontext Cyril Bur ` (6 subsequent siblings) 8 siblings, 1 reply; 23+ messages in thread From: Cyril Bur @ 2015-11-18 3:26 UTC (permalink / raw) To: mikey, anton, linuxppc-dev Loop in assembly checking the registers with many threads. Signed-off-by: Cyril Bur <cyrilbur@gmail.com> --- tools/testing/selftests/powerpc/math/Makefile | 7 +- tools/testing/selftests/powerpc/math/fpu_asm.S | 34 ++++++++ tools/testing/selftests/powerpc/math/fpu_preempt.c | 92 ++++++++++++++++++++++ tools/testing/selftests/powerpc/math/vmx_asm.S | 44 ++++++++++- tools/testing/selftests/powerpc/math/vmx_preempt.c | 92 ++++++++++++++++++++++ 5 files changed, 263 insertions(+), 6 deletions(-) create mode 100644 tools/testing/selftests/powerpc/math/fpu_preempt.c create mode 100644 tools/testing/selftests/powerpc/math/vmx_preempt.c diff --git a/tools/testing/selftests/powerpc/math/Makefile b/tools/testing/selftests/powerpc/math/Makefile index 896d9e2..9fa690f 100644 --- a/tools/testing/selftests/powerpc/math/Makefile +++ b/tools/testing/selftests/powerpc/math/Makefile @@ -1,12 +1,15 @@ -TEST_PROGS := fpu_syscall vmx_syscall +TEST_PROGS := fpu_syscall fpu_preempt vmx_syscall vmx_preempt all: $(TEST_PROGS) $(TEST_PROGS): ../harness.c -$(TEST_PROGS): CFLAGS += -O2 -g +$(TEST_PROGS): CFLAGS += -O2 -g -pthread fpu_syscall: fpu_asm.S +fpu_preempt: fpu_asm.S + vmx_syscall: vmx_asm.S +vmx_preempt: vmx_asm.S include ../../lib.mk diff --git a/tools/testing/selftests/powerpc/math/fpu_asm.S b/tools/testing/selftests/powerpc/math/fpu_asm.S index d5412c1..5ff0adc 100644 --- a/tools/testing/selftests/powerpc/math/fpu_asm.S +++ b/tools/testing/selftests/powerpc/math/fpu_asm.S @@ -149,3 +149,37 @@ FUNC_START(test_fpu) POP_BASIC_STACK(256) blr FUNC_END(test_fpu) + +#int preempt_fpu(double *darray, volatile int *not_ready, int *sentinal) +#On starting will (atomically) decrement not_ready as a signal that the FPU +#has been loaded with darray. Will proceed to check the validity of the FPU +#registers while sentinal is not zero. +FUNC_START(preempt_fpu) + PUSH_BASIC_STACK(256) + std r3,32(sp) #double *darray + std r4,40(sp) #volatile int *not_ready + std r5,48(sp) #int *sentinal + PUSH_FPU(56) + + bl load_fpu + + #Atomic DEC + ld r3,40(sp) +1: lwarx r4,0,r3 + addi r4,r4,-1 + stwcx. r4,0,r3 + bne- 1b + +2: ld r3, 32(sp) + bl check_fpu + cmpdi r3,0 + bne 3f + ld r4, 48(sp) + ld r5, 0(r4) + cmpwi r5,0 + bne 2b + +3: POP_FPU(56) + POP_BASIC_STACK(256) + blr +FUNC_END(preempt_fpu) diff --git a/tools/testing/selftests/powerpc/math/fpu_preempt.c b/tools/testing/selftests/powerpc/math/fpu_preempt.c new file mode 100644 index 0000000..e24cf9b --- /dev/null +++ b/tools/testing/selftests/powerpc/math/fpu_preempt.c @@ -0,0 +1,92 @@ +#include <stdio.h> +#include <unistd.h> +#include <sys/syscall.h> +#include <sys/time.h> +#include <sys/types.h> +#include <sys/wait.h> +#include <stdlib.h> +#include <pthread.h> + +#include "utils.h" + +/* Time to wait for workers to get preempted (seconds) */ +#define PREEMPT_TIME 20 +/* + * Factor by which to multiply number of online CPUs for total number of + * worker threads + */ +#define THREAD_FACTOR 8 + + +__thread double darray[] = {0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0, + 1.1, 1.2, 1.3, 1.4, 1.5, 1.6, 1.7, 1.8, 1.9, 2.0, + 2.1}; + +volatile int not_ready; +int running; + +extern void preempt_fpu(double *darray, volatile int *not_ready, int *sentinal); + +void *preempt_fpu_c(void *p) +{ + int i; + srand(pthread_self()); + for (i = 0; i < 21; i++) + darray[i] = rand(); + + /* Test failed if it ever returns */ + preempt_fpu(darray, ¬_ready, &running); + + return p; +} + +int test_preempt_fpu(void) +{ + int i, rc, threads; + pthread_t *tids; + + threads = sysconf(_SC_NPROCESSORS_ONLN) * THREAD_FACTOR; + tids = malloc((threads) * sizeof(pthread_t)); + FAIL_IF(!tids); + + running = true; + not_ready = threads; + for (i = 0; i < threads; i++) { + rc = pthread_create(&tids[i], NULL, preempt_fpu_c, NULL); + FAIL_IF(rc); + } + + setbuf(stdout, NULL); + /* Not really nessesary but nice to wait for every thread to start */ + printf("\tWaiting for all workers to start..."); + while(not_ready); + printf("done\n"); + + printf("\tWaiting for %d seconds to let some workers get preempted...", PREEMPT_TIME); + sleep(PREEMPT_TIME); + printf("done\n"); + + printf("\tKilling workers..."); + running = 0; + for (i = 0; i < threads; i++) { + void *rc_p; + pthread_join(tids[i], &rc_p); + + /* + * Harness will say the fail was here, look at why preempt_fpu + * returned + */ + if ((long) rc_p) + printf("oops\n"); + FAIL_IF((long) rc_p); + } + printf("done\n"); + + free(tids); + return 0; +} + +int main(int argc, char *argv[]) +{ + return test_harness(test_preempt_fpu, "fpu_preempt"); +} diff --git a/tools/testing/selftests/powerpc/math/vmx_asm.S b/tools/testing/selftests/powerpc/math/vmx_asm.S index e642e67..23db4b3 100644 --- a/tools/testing/selftests/powerpc/math/vmx_asm.S +++ b/tools/testing/selftests/powerpc/math/vmx_asm.S @@ -1,5 +1,6 @@ #include "basic_asm.h" +#POS MUST BE 16 ALIGNED! #define PUSH_VMX(pos,reg) \ li reg,pos; \ stvx v20,reg,sp; \ @@ -26,6 +27,7 @@ addi reg,reg,16; \ stvx v31,reg,sp; +#POS MUST BE 16 ALIGNED! #define POP_VMX(pos,reg) \ li reg,pos; \ lvx v20,reg,sp; \ @@ -84,7 +86,7 @@ FUNC_END(load_vmx) #Should be safe from C, only touches r4, r5 and v0,v1,v2 FUNC_START(check_vmx) - PUSH_BASIC_STACK(16) + PUSH_BASIC_STACK(32) mr r4,r3 li r3,1 #assume a bad result li r5,0 @@ -153,7 +155,7 @@ FUNC_START(check_vmx) cmpdi r0,0xffffffff bne 1f li r3,0 -1: POP_BASIC_STACK(16) +1: POP_BASIC_STACK(32) blr FUNC_END(check_vmx) @@ -164,7 +166,7 @@ FUNC_START(test_vmx) PUSH_BASIC_STACK(512) std r3,40(sp) #Address of varray std r4,48(sp) #address of pid - PUSH_VMX(56, r4) + PUSH_VMX(64, r4) bl load_vmx @@ -177,7 +179,41 @@ FUNC_START(test_vmx) ld r3,40(sp) bl check_vmx - POP_VMX(56,r4) + POP_VMX(64,r4) POP_BASIC_STACK(512) blr FUNC_END(test_vmx) + +#int preempt_vmx(v4si *varray, volatile int *not_ready, int *sentinal) +#On starting will (atomically) decrement not_ready as a signal that the FPU +#has been loaded with varray. Will proceed to check the validity of the FPU +#registers while sentinal is not zero. +FUNC_START(preempt_vmx) + PUSH_BASIC_STACK(512) + std r3,32(sp) #v4si *varray + std r4,40(sp) #volatile int *not_ready + std r5,48(sp) #int *sentinal + PUSH_VMX(64,r4) + + bl load_vmx + + #Atomic DEC + ld r3,40(sp) +1: lwarx r4,0,r3 + addi r4,r4,-1 + stwcx. r4,0,r3 + bne- 1b + +2: ld r3,32(sp) + bl check_vmx + cmpdi r3,0 + bne 3f + ld r4,48(sp) + ld r5,0(r4) + cmpwi r5,0 + bne 2b + +3: POP_VMX(64,r4) + POP_BASIC_STACK(512) + blr +FUNC_END(preempt_vmx) diff --git a/tools/testing/selftests/powerpc/math/vmx_preempt.c b/tools/testing/selftests/powerpc/math/vmx_preempt.c new file mode 100644 index 0000000..342db15 --- /dev/null +++ b/tools/testing/selftests/powerpc/math/vmx_preempt.c @@ -0,0 +1,92 @@ +#include <stdio.h> +#include <unistd.h> +#include <sys/syscall.h> +#include <sys/time.h> +#include <sys/types.h> +#include <sys/wait.h> +#include <stdlib.h> +#include <pthread.h> + +#include "utils.h" + +/* Time to wait for workers to get preempted (seconds) */ +#define PREEMPT_TIME 20 +/* + * Factor by which to multiply number of online CPUs for total number of + * worker threads + */ +#define THREAD_FACTOR 8 + +typedef int v4si __attribute__ ((vector_size (16))); +__thread v4si varray[] = {{1, 2, 3, 4}, {5, 6, 7, 8}, {9, 10,11,12}, + {13,14,15,16},{17,18,19,20},{21,22,23,24}, + {25,26,27,28},{29,30,31,32},{33,34,35,36}, + {37,38,39,40},{41,42,43,44},{45,46,47,48}}; + +volatile int not_ready; +int running; + +extern void preempt_vmx(v4si *varray, volatile int *not_ready, int *sentinal); + +void *preempt_vmx_c(void *p) +{ + int i, j; + srand(pthread_self()); + for (i = 0; i < 12; i++) + for (j = 0; j < 4; j++) + varray[i][j] = rand(); + + /* Test fails if it ever returns */ + preempt_vmx(varray, ¬_ready, &running); + return p; +} + +int test_preempt_vmx(void) +{ + int i, rc, threads; + pthread_t *tids; + + threads = sysconf(_SC_NPROCESSORS_ONLN) * THREAD_FACTOR; + tids = malloc(threads * sizeof(pthread_t)); + FAIL_IF(!tids); + + running = true; + not_ready = threads; + for (i = 0; i < threads; i++) { + rc = pthread_create(&tids[i], NULL, preempt_vmx_c, NULL); + FAIL_IF(rc); + } + + setbuf(stdout, NULL); + /* Not really nessesary but nice to wait for every thread to start */ + printf("\tWaiting for all workers to start..."); + while(not_ready); + printf("done\n"); + + printf("\tWaiting for %d seconds to let some workers get preempted...", PREEMPT_TIME); + sleep(PREEMPT_TIME); + printf("done\n"); + + printf("\tKilling workers..."); + running = 0; + for (i = 0; i < threads; i++) { + void *rc_p; + pthread_join(tids[i], &rc_p); + + /* + * Harness will say the fail was here, look at why preempt_vmx + * returned + */ + if ((long) rc_p) + printf("oops\n"); + FAIL_IF((long) rc_p); + } + printf("done\n"); + + return 0; +} + +int main(int argc, char *argv[]) +{ + return test_harness(test_preempt_vmx, "vmx_preempt"); +} -- 2.6.2 ^ permalink raw reply related [flat|nested] 23+ messages in thread
* Re: [PATCH 2/8] selftests/powerpc: Test preservation of FPU and VMX regs across preemption 2015-11-18 3:26 ` [PATCH 2/8] selftests/powerpc: Test preservation of FPU and VMX regs across preemption Cyril Bur @ 2015-11-23 0:34 ` Michael Neuling 0 siblings, 0 replies; 23+ messages in thread From: Michael Neuling @ 2015-11-23 0:34 UTC (permalink / raw) To: Cyril Bur, anton, linuxppc-dev On Wed, 2015-11-18 at 14:26 +1100, Cyril Bur wrote: > Loop in assembly checking the registers with many threads. >=20 > Signed-off-by: Cyril Bur <cyrilbur@gmail.com> > --- > tools/testing/selftests/powerpc/math/Makefile | 7 +- > tools/testing/selftests/powerpc/math/fpu_asm.S | 34 ++++++++ > tools/testing/selftests/powerpc/math/fpu_preempt.c | 92 ++++++++++++++++= ++++++ > tools/testing/selftests/powerpc/math/vmx_asm.S | 44 ++++++++++- > tools/testing/selftests/powerpc/math/vmx_preempt.c | 92 ++++++++++++++++= ++++++ > 5 files changed, 263 insertions(+), 6 deletions(-) > create mode 100644 tools/testing/selftests/powerpc/math/fpu_preempt.c > create mode 100644 tools/testing/selftests/powerpc/math/vmx_preempt.c >=20 > diff --git a/tools/testing/selftests/powerpc/math/Makefile b/tools/testing/selftests/powerpc/math/Makefile > index 896d9e2..9fa690f 100644 > --- a/tools/testing/selftests/powerpc/math/Makefile > +++ b/tools/testing/selftests/powerpc/math/Makefile > @@ -1,12 +1,15 @@ > -TEST_PROGS :=3D fpu_syscall vmx_syscall > +TEST_PROGS :=3D fpu_syscall fpu_preempt vmx_syscall vmx_preempt .gitignore for this new build object > all: $(TEST_PROGS) > =20 > $(TEST_PROGS): ../harness.c > -$(TEST_PROGS): CFLAGS +=3D -O2 -g > +$(TEST_PROGS): CFLAGS +=3D -O2 -g -pthread > =20 > fpu_syscall: fpu_asm.S > +fpu_preempt: fpu_asm.S > + > vmx_syscall: vmx_asm.S > +vmx_preempt: vmx_asm.S > =20 > include ../../lib.mk > =20 > diff --git a/tools/testing/selftests/powerpc/math/fpu_asm.S b/tools/testi= ng/selftests/powerpc/math/fpu_asm.S > index d5412c1..5ff0adc 100644 > --- a/tools/testing/selftests/powerpc/math/fpu_asm.S > +++ b/tools/testing/selftests/powerpc/math/fpu_asm.S > @@ -149,3 +149,37 @@ FUNC_START(test_fpu) > > > POP_BASIC_STACK(256) > > > blr > FUNC_END(test_fpu) > + > +#int preempt_fpu(double *darray, volatile int *not_ready, int *sentinal) > +#On starting will (atomically) decrement not_ready as a signal that the = FPU > +#has been loaded with darray. Will proceed to check the validity of the = FPU > +#registers while sentinal is not zero. > +FUNC_START(preempt_fpu) > +> > PUSH_BASIC_STACK(256) > +> > std r3,32(sp) #double *darray > +> > std r4,40(sp) #volatile int *not_ready > +> > std r5,48(sp) #int *sentinal > +> > PUSH_FPU(56) > + > +> > bl load_fpu > + Memory barrier here. > +> > #Atomic DEC > +> > ld r3,40(sp) > +1:> > lwarx r4,0,r3 > +> > addi r4,r4,-1 > +> > stwcx. r4,0,r3 > +> > bne- 1b > + > +2:> > ld r3, 32(sp) > +> > bl check_fpu > +> > cmpdi r3,0 > +> > bne 3f > +> > ld r4, 48(sp) > +> > ld r5, 0(r4) > +> > cmpwi r5,0 > +> > bne 2b > + > +3:> > POP_FPU(56) > +> > POP_BASIC_STACK(256) > +> > blr > +FUNC_END(preempt_fpu) > diff --git a/tools/testing/selftests/powerpc/math/fpu_preempt.c b/tools/t= esting/selftests/powerpc/math/fpu_preempt.c > new file mode 100644 > index 0000000..e24cf9b > --- /dev/null > +++ b/tools/testing/selftests/powerpc/math/fpu_preempt.c Needs a copyright and a description of the test here. Same in patch 1. > @@ -0,0 +1,92 @@ > +#include <stdio.h> > +#include <unistd.h> > +#include <sys/syscall.h> > +#include <sys/time.h> > +#include <sys/types.h> > +#include <sys/wait.h> > +#include <stdlib.h> > +#include <pthread.h> > + > +#include "utils.h" > + > +/* Time to wait for workers to get preempted (seconds) */ > +#define PREEMPT_TIME 20 > +/* > + * Factor by which to multiply number of online CPUs for total number of > + * worker threads > + */ > +#define THREAD_FACTOR 8 > + > + > +__thread double darray[] =3D {0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.= 9, 1.0, > +> > > 1.1, 1.2, 1.3, 1.4, 1.5, 1.6, 1.7, 1.8, 1.9, 2.0, > +> > > 2.1}; > + > +volatile int not_ready; > +int running; > + > +extern void preempt_fpu(double *darray, volatile int *not_ready, int *se= ntinal); > + > +void *preempt_fpu_c(void *p) > +{ > +> > int i; > +> > srand(pthread_self()); > +> > for (i =3D 0; i < 21; i++) > +> > > darray[i] =3D rand(); > + > +> > /* Test failed if it ever returns */ > +> > preempt_fpu(darray, ¬_ready, &running); > + > +> > return p; > +} > + > +int test_preempt_fpu(void) > +{ > +> > int i, rc, threads; > +> > pthread_t *tids; > + > +> > threads =3D sysconf(_SC_NPROCESSORS_ONLN) * THREAD_FACTOR; > +> > tids =3D malloc((threads) * sizeof(pthread_t)); > +> > FAIL_IF(!tids); > + > +> > running =3D true; > +> > not_ready =3D threads; > +> > for (i =3D 0; i < threads; i++) { > +> > > rc =3D pthread_create(&tids[i], NULL, preempt_fpu_c, NULL); > +> > > FAIL_IF(rc); > +> > } > + > +> > setbuf(stdout, NULL); > +> > /* Not really nessesary but nice to wait for every thread to start = */ > +> > printf("\tWaiting for all workers to start..."); > +> > while(not_ready); You need a memory barrier here and a matching one in the asm which is derementing it. Same in the VSX test. > +> > printf("done\n"); > + > +> > printf("\tWaiting for %d seconds to let some workers get preempted.= ..", PREEMPT_TIME); > +> > sleep(PREEMPT_TIME); > +> > printf("done\n"); > + > +> > printf("\tKilling workers..."); > +> > running =3D 0; > +> > for (i =3D 0; i < threads; i++) { > +> > > void *rc_p; > +> > > pthread_join(tids[i], &rc_p); > + > +> > > /* > +> > > * Harness will say the fail was here, look at why preempt_fpu > +> > > * returned > +> > > */ > +> > > if ((long) rc_p) > +> > > > printf("oops\n"); > +> > > FAIL_IF((long) rc_p); > +> > } > +> > printf("done\n"); > + > +> > free(tids); > +> > return 0; > +} > + > +int main(int argc, char *argv[]) > +{ > +> > return test_harness(test_preempt_fpu, "fpu_preempt"); > +} > diff --git a/tools/testing/selftests/powerpc/math/vmx_asm.S b/tools/testi= ng/selftests/powerpc/math/vmx_asm.S > index e642e67..23db4b3 100644 > --- a/tools/testing/selftests/powerpc/math/vmx_asm.S > +++ b/tools/testing/selftests/powerpc/math/vmx_asm.S > @@ -1,5 +1,6 @@ > #include "basic_asm.h" > =20 > +#POS MUST BE 16 ALIGNED! > #define PUSH_VMX(pos,reg) \ > > > li> > reg,pos; \ > > > stvx> > v20,reg,sp; \ > @@ -26,6 +27,7 @@ > > > addi> > reg,reg,16; \ > > > stvx> > v31,reg,sp; > =20 > +#POS MUST BE 16 ALIGNED! > #define POP_VMX(pos,reg) \ > > > li> > reg,pos; \ > > > lvx> > v20,reg,sp; \ > @@ -84,7 +86,7 @@ FUNC_END(load_vmx) > =20 > #Should be safe from C, only touches r4, r5 and v0,v1,v2 > FUNC_START(check_vmx) > -> > PUSH_BASIC_STACK(16) > +> > PUSH_BASIC_STACK(32) > > > mr r4,r3 > > > li> > r3,1 #assume a bad result > > > li> > r5,0 > @@ -153,7 +155,7 @@ FUNC_START(check_vmx) > > > cmpdi> > r0,0xffffffff > > > bne> > 1f > > > li> > r3,0 > -1:> > POP_BASIC_STACK(16) > +1:> > POP_BASIC_STACK(32) > > > blr > FUNC_END(check_vmx) > =20 > @@ -164,7 +166,7 @@ FUNC_START(test_vmx) > > > PUSH_BASIC_STACK(512) > > > std> > r3,40(sp) #Address of varray > > > std r4,48(sp) #address of pid > -> > PUSH_VMX(56, r4) > +> > PUSH_VMX(64, r4) > =20 > > > bl load_vmx > =20 > @@ -177,7 +179,41 @@ FUNC_START(test_vmx) > > > ld r3,40(sp) > > > bl check_vmx > =20 > -> > POP_VMX(56,r4) > +> > POP_VMX(64,r4) > > > POP_BASIC_STACK(512) > > > blr > FUNC_END(test_vmx) > + > +#int preempt_vmx(v4si *varray, volatile int *not_ready, int *sentinal) > +#On starting will (atomically) decrement not_ready as a signal that the = FPU > +#has been loaded with varray. Will proceed to check the validity of the = FPU > +#registers while sentinal is not zero. > +FUNC_START(preempt_vmx) > +> > PUSH_BASIC_STACK(512) > +> > std r3,32(sp) #v4si *varray > +> > std r4,40(sp) #volatile int *not_ready > +> > std r5,48(sp) #int *sentinal > +> > PUSH_VMX(64,r4) > + > +> > bl load_vmx > + Memory barrier here > +> > #Atomic DEC > +> > ld r3,40(sp) > +1:> > lwarx r4,0,r3 > +> > addi r4,r4,-1 > +> > stwcx. r4,0,r3 > +> > bne- 1b > + > +2:> > ld r3,32(sp) > +> > bl check_vmx > +> > cmpdi r3,0 > +> > bne 3f > +> > ld r4,48(sp) > +> > ld r5,0(r4) > +> > cmpwi r5,0 > +> > bne 2b > + > +3:> > POP_VMX(64,r4) > +> > POP_BASIC_STACK(512) > +> > blr > +FUNC_END(preempt_vmx) > diff --git a/tools/testing/selftests/powerpc/math/vmx_preempt.c b/tools/t= esting/selftests/powerpc/math/vmx_preempt.c > new file mode 100644 > index 0000000..342db15 > --- /dev/null > +++ b/tools/testing/selftests/powerpc/math/vmx_preempt.c Needs a copyright and a description of the test here. =20 > @@ -0,0 +1,92 @@ > +#include <stdio.h> > +#include <unistd.h> > +#include <sys/syscall.h> > +#include <sys/time.h> > +#include <sys/types.h> > +#include <sys/wait.h> > +#include <stdlib.h> > +#include <pthread.h> > + > +#include "utils.h" > + > +/* Time to wait for workers to get preempted (seconds) */ > +#define PREEMPT_TIME 20 > +/* > + * Factor by which to multiply number of online CPUs for total number of > + * worker threads > + */ > +#define THREAD_FACTOR 8 > + > +typedef int v4si __attribute__ ((vector_size (16))); > +__thread v4si varray[] =3D {{1, 2, 3, 4}, {5, 6, 7, 8}, {9, 10,11,12}, > +> > {13,14,15,16},{17,18,19,20},{21,22,23,24}, > +> > {25,26,27,28},{29,30,31,32},{33,34,35,36}, > +> > {37,38,39,40},{41,42,43,44},{45,46,47,48}}; > + > +volatile int not_ready; I really hate this name. How about just "ready" and you negate the code. > +int running; Now about not_running ;-P > + > +extern void preempt_vmx(v4si *varray, volatile int *not_ready, int *sent= inal); > + > +void *preempt_vmx_c(void *p) > +{ > +> > int i, j; > +> > srand(pthread_self()); > +> > for (i =3D 0; i < 12; i++) > +> > > for (j =3D 0; j < 4; j++) > +> > > > varray[i][j] =3D rand(); > + > +> > /* Test fails if it ever returns */ > +> > preempt_vmx(varray, ¬_ready, &running); > +> > return p; > +} > + > +int test_preempt_vmx(void) > +{ > +> > int i, rc, threads; > +> > pthread_t *tids; > + > +> > threads =3D sysconf(_SC_NPROCESSORS_ONLN) * THREAD_FACTOR; > +> > tids =3D malloc(threads * sizeof(pthread_t)); > +> > FAIL_IF(!tids); > + > +> > running =3D true; > +> > not_ready =3D threads; > +> > for (i =3D 0; i < threads; i++) { > +> > > rc =3D pthread_create(&tids[i], NULL, preempt_vmx_c, NULL); > +> > > FAIL_IF(rc); > +> > } > + > +> > setbuf(stdout, NULL); > +> > /* Not really nessesary but nice to wait for every thread to start = */ > +> > printf("\tWaiting for all workers to start..."); > +> > while(not_ready); Again, a memory barrier here. > +> > printf("done\n"); > + > +> > printf("\tWaiting for %d seconds to let some workers get preempted.= ..", PREEMPT_TIME); > +> > sleep(PREEMPT_TIME); > +> > printf("done\n"); > + > +> > printf("\tKilling workers..."); > +> > running =3D 0; > +> > for (i =3D 0; i < threads; i++) { > +> > > void *rc_p; > +> > > pthread_join(tids[i], &rc_p); > + > +> > > /* > +> > > * Harness will say the fail was here, look at why preempt_vmx > +> > > * returned > +> > > */ > +> > > if ((long) rc_p) > +> > > > printf("oops\n"); > +> > > FAIL_IF((long) rc_p); > +> > } > +> > printf("done\n"); > + > +> > return 0; > +} > + > +int main(int argc, char *argv[]) > +{ > +> > return test_harness(test_preempt_vmx, "vmx_preempt"); > +} ^ permalink raw reply [flat|nested] 23+ messages in thread
* [PATCH 3/8] selftests/powerpc: Test FPU and VMX regs in signal ucontext 2015-11-18 3:26 [PATCH 0/8] FP/VEC/VSX switching optimisations Cyril Bur 2015-11-18 3:26 ` [PATCH 1/8] selftests/powerpc: Test the preservation of FPU and VMX regs across syscall Cyril Bur 2015-11-18 3:26 ` [PATCH 2/8] selftests/powerpc: Test preservation of FPU and VMX regs across preemption Cyril Bur @ 2015-11-18 3:26 ` Cyril Bur 2015-11-19 11:36 ` [3/8] " Michael Ellerman 2015-11-23 1:04 ` [PATCH 3/8] " Michael Neuling 2015-11-18 3:26 ` [PATCH 4/8] powerpc: Explicitly disable math features when copying thread Cyril Bur ` (5 subsequent siblings) 8 siblings, 2 replies; 23+ messages in thread From: Cyril Bur @ 2015-11-18 3:26 UTC (permalink / raw) To: mikey, anton, linuxppc-dev Load up the non volatile FPU and VMX regs and ensure that they are the expected value in a signal handler Signed-off-by: Cyril Bur <cyrilbur@gmail.com> --- tools/testing/selftests/powerpc/math/Makefile | 4 +- tools/testing/selftests/powerpc/math/fpu_signal.c | 119 +++++++++++++++++++++ tools/testing/selftests/powerpc/math/vmx_signal.c | 124 ++++++++++++++++++++++ 3 files changed, 246 insertions(+), 1 deletion(-) create mode 100644 tools/testing/selftests/powerpc/math/fpu_signal.c create mode 100644 tools/testing/selftests/powerpc/math/vmx_signal.c diff --git a/tools/testing/selftests/powerpc/math/Makefile b/tools/testing/selftests/powerpc/math/Makefile index 9fa690f..5ce000bf 100644 --- a/tools/testing/selftests/powerpc/math/Makefile +++ b/tools/testing/selftests/powerpc/math/Makefile @@ -1,4 +1,4 @@ -TEST_PROGS := fpu_syscall fpu_preempt vmx_syscall vmx_preempt +TEST_PROGS := fpu_syscall fpu_preempt fpu_signal vmx_syscall vmx_preempt vmx_signal all: $(TEST_PROGS) @@ -7,9 +7,11 @@ $(TEST_PROGS): CFLAGS += -O2 -g -pthread fpu_syscall: fpu_asm.S fpu_preempt: fpu_asm.S +fpu_signal: fpu_asm.S vmx_syscall: vmx_asm.S vmx_preempt: vmx_asm.S +vmx_signal: vmx_asm.S include ../../lib.mk diff --git a/tools/testing/selftests/powerpc/math/fpu_signal.c b/tools/testing/selftests/powerpc/math/fpu_signal.c new file mode 100644 index 0000000..ca61d1f --- /dev/null +++ b/tools/testing/selftests/powerpc/math/fpu_signal.c @@ -0,0 +1,119 @@ +#include <stdio.h> +#include <unistd.h> +#include <sys/syscall.h> +#include <sys/time.h> +#include <sys/types.h> +#include <sys/wait.h> +#include <stdlib.h> +#include <pthread.h> + +#include "utils.h" + +/* Number of times each thread should recieve the signal */ +#define ITERATIONS 10 +/* + * Factor by which to multiply number of online CPUs for total number of + * worker threads + */ +#define THREAD_FACTOR 8 + +__thread double darray[] = {0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0, + 1.1, 1.2, 1.3, 1.4, 1.5, 1.6, 1.7, 1.8, 1.9, 2.0, + 2.1}; + +bool bad_context; +int running; +volatile int not_ready; +extern long preempt_fpu(double *darray, volatile int *not_ready, int *sentinal); + +void signal_fpu_sig(int sig, siginfo_t *info, void *context) +{ + int i; + ucontext_t *uc = context; + mcontext_t *mc = &uc->uc_mcontext; + + /* Only the non volatiles were loaded up */ + for (i = 14; i < 32; i++) { + if (mc->fp_regs[i] != darray[i - 14]) { + bad_context = true; + break; + } + } +} + +void *signal_fpu_c(void *p) +{ + int i; + long rc; + struct sigaction act; + act.sa_sigaction = signal_fpu_sig; + act.sa_flags = SA_SIGINFO; + rc = sigaction(SIGUSR1, &act, NULL); + if (rc) + return p; + + srand(pthread_self()); + for (i = 0; i < 21; i++) + darray[i] = rand(); + + rc = preempt_fpu(darray, ¬_ready, &running); + + return (void *) rc; +} + +int test_signal_fpu(void) +{ + int i, j, rc, threads; + void *rc_p; + pthread_t *tids; + + threads = sysconf(_SC_NPROCESSORS_ONLN) * THREAD_FACTOR; + tids = malloc(threads * sizeof(pthread_t)); + FAIL_IF(!tids); + + running = true; + not_ready = threads; + for (i = 0; i < threads; i++) { + rc = pthread_create(&tids[i], NULL, signal_fpu_c, NULL); + FAIL_IF(rc); + } + + setbuf(stdout, NULL); + printf("\tWaiting for all workers to start..."); + while (not_ready); + printf("done\n"); + + printf("\tSending signals to all threads %d times...", ITERATIONS); + for (i = 0; i < ITERATIONS; i++) { + for (j = 0; j < threads; j++) { + pthread_kill(tids[j], SIGUSR1); + } + sleep(1); + } + printf("done\n"); + + printf("\tKilling workers..."); + running = 0; + for (i = 0; i < threads; i++) { + pthread_join(tids[i], &rc_p); + + /* + * Harness will say the fail was here, look at why signal_fpu + * returned + */ + if ((long) rc_p || bad_context) + printf("oops\n"); + if (bad_context) + fprintf(stderr, "\t!! bad_context is true\n"); + FAIL_IF((long) rc_p || bad_context); + } + printf("done\n"); + + free(tids); + return 0; +} + +int main(int argc, char *argv[]) +{ + return test_harness(test_signal_fpu, "fpu_signal"); +} diff --git a/tools/testing/selftests/powerpc/math/vmx_signal.c b/tools/testing/selftests/powerpc/math/vmx_signal.c new file mode 100644 index 0000000..007ac9e --- /dev/null +++ b/tools/testing/selftests/powerpc/math/vmx_signal.c @@ -0,0 +1,124 @@ +#include <stdio.h> +#include <unistd.h> +#include <sys/syscall.h> +#include <sys/time.h> +#include <sys/types.h> +#include <sys/wait.h> +#include <stdlib.h> +#include <string.h> +#include <pthread.h> + +#include "utils.h" + +/* Number of times each thread should recieve the signal */ +#define ITERATIONS 10 +/* + * Factor by which to multiply number of online CPUs for total number of + * worker threads + */ +#define THREAD_FACTOR 8 + +typedef int v4si __attribute__ ((vector_size (16))); +__thread v4si varray[] = {{1, 2, 3, 4}, {5, 6, 7, 8}, {9, 10,11,12}, + {13,14,15,16},{17,18,19,20},{21,22,23,24}, + {25,26,27,28},{29,30,31,32},{33,34,35,36}, + {37,38,39,40},{41,42,43,44},{45,46,47,48}}; + +bool bad_context; +int running; +volatile int not_ready; + +extern int preempt_vmx(v4si *varray, volatile int *not_ready, int *sentinal); + +void signal_vmx_sig(int sig, siginfo_t *info, void *context) +{ + int i; + ucontext_t *uc = context; + mcontext_t *mc = &uc->uc_mcontext; + + /* Only the non volatiles were loaded up */ + for (i = 20; i < 32; i++) { + if (memcmp(mc->v_regs->vrregs[i], &varray[i - 20], 16)) { + bad_context = true; + break; + } + } +} + +void *signal_vmx_c(void *p) +{ + int i, j; + long rc; + struct sigaction act; + act.sa_sigaction = signal_vmx_sig; + act.sa_flags = SA_SIGINFO; + rc = sigaction(SIGUSR1, &act, NULL); + if (rc) + return p; + + srand(pthread_self()); + for (i = 0; i < 12; i++) + for (j = 0; j < 4; j++) + varray[i][j] = rand(); + + rc = preempt_vmx(varray, ¬_ready, &running); + + return (void *) rc; +} + +int test_signal_vmx(void) +{ + int i, j, rc, threads; + void *rc_p; + pthread_t *tids; + + threads = sysconf(_SC_NPROCESSORS_ONLN) * THREAD_FACTOR; + tids = malloc(threads * sizeof(pthread_t)); + FAIL_IF(!tids); + + running = true; + not_ready = threads; + for (i = 0; i < threads; i++) { + rc = pthread_create(&tids[i], NULL, signal_vmx_c, NULL); + FAIL_IF(rc); + } + + setbuf(stdout, NULL); + printf("\tWaiting for all workers to start..."); + while (not_ready); + printf("done\n"); + + printf("\tSending signals to all threads %d times...", ITERATIONS); + for (i = 0; i < ITERATIONS; i++) { + for (j = 0; j < threads; j++) { + pthread_kill(tids[j], SIGUSR1); + } + sleep(1); + } + printf("done\n"); + + printf("\tKilling workers..."); + running = 0; + for (i = 0; i < threads; i++) { + pthread_join(tids[i], &rc_p); + + /* + * Harness will say the fail was here, look at why signal_vmx + * returned + */ + if ((long) rc_p || bad_context) + printf("oops\n"); + if (bad_context) + fprintf(stderr, "\t!! bad_context is true\n"); + FAIL_IF((long) rc_p || bad_context); + } + printf("done\n"); + + free(tids); + return 0; +} + +int main(int argc, char *argv[]) +{ + return test_harness(test_signal_vmx, "vmx_signal"); +} -- 2.6.2 ^ permalink raw reply related [flat|nested] 23+ messages in thread
* Re: [3/8] selftests/powerpc: Test FPU and VMX regs in signal ucontext 2015-11-18 3:26 ` [PATCH 3/8] selftests/powerpc: Test FPU and VMX regs in signal ucontext Cyril Bur @ 2015-11-19 11:36 ` Michael Ellerman 2015-11-23 1:04 ` [PATCH 3/8] " Michael Neuling 1 sibling, 0 replies; 23+ messages in thread From: Michael Ellerman @ 2015-11-19 11:36 UTC (permalink / raw) To: Cyril Bur, mikey, anton, linuxppc-dev [-- Warning: decoded text below may be mangled, UTF-8 assumed --] [-- Attachment #1: Type: text/plain, Size: 1975 bytes --] On Wed, 2015-18-11 at 03:26:50 UTC, Cyril Bur wrote: > Load up the non volatile FPU and VMX regs and ensure that they are the > expected value in a signal handler This is giving me: $ /usr/bin/powerpc-linux-gnu-gcc -Wall -O2 -flto -Wall -Werror -DGIT_VERSION='"v4.0-rc1-55000-g906d582"' -I/home/buildbot/buildbot/slave/selftests-ppc64-gcc-ubuntu-be/build/tools/testing/selftests/powerpc -O2 -g -pthread fpu_signal.c ../harness.c fpu_asm.S -o fpu_signal fpu_signal.c: In function ‘signal_fpu_sig’: fpu_signal.c:33:19: error: initialization from incompatible pointer type [-Werror] mcontext_t *mc = &uc->uc_mcontext; ^ fpu_signal.c:37:9: error: ‘mcontext_t’ has no member named ‘fp_regs’ if (mc->fp_regs[i] != darray[i - 14]) { ^ As well as: $ /usr/bin/powerpc-linux-gnu-gcc -Wall -O2 -flto -Wall -Werror -DGIT_VERSION='"v4.0-rc1-55000-g906d582"' -I/home/buildbot/buildbot/slave/selftests-ppc64-gcc-ubuntu-be/build/tools/testing/selftests/powerpc -O2 -g -pthread vmx_syscall.c ../harness.c vmx_asm.S -o vmx_syscall vmx_asm.S: Assembler messages: vmx_asm.S:61: Error: unsupported relocation against v20 vmx_asm.S:63: Error: unsupported relocation against v21 Times infinity. And same for vmx_preempt. And: $ /usr/bin/powerpc-linux-gnu-gcc -Wall -O2 -flto -Wall -Werror -DGIT_VERSION='"v4.0-rc1-55000-g906d582"' -I/home/buildbot/buildbot/slave/selftests-ppc64-gcc-ubuntu-be/build/tools/testing/selftests/powerpc -O2 -g -pthread vmx_signal.c ../harness.c vmx_asm.S -o vmx_signal make[1]: *** [vmx_preempt] Error 1 vmx_signal.c: In function ‘signal_vmx_sig’: vmx_signal.c:37:19: error: initialization from incompatible pointer type [-Werror] mcontext_t *mc = &uc->uc_mcontext; ^ vmx_signal.c:41:16: error: ‘mcontext_t’ has no member named ‘v_regs’ if (memcmp(mc->v_regs->vrregs[i], &varray[i - 20], 16)) { ^ cheers ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 3/8] selftests/powerpc: Test FPU and VMX regs in signal ucontext 2015-11-18 3:26 ` [PATCH 3/8] selftests/powerpc: Test FPU and VMX regs in signal ucontext Cyril Bur 2015-11-19 11:36 ` [3/8] " Michael Ellerman @ 2015-11-23 1:04 ` Michael Neuling 1 sibling, 0 replies; 23+ messages in thread From: Michael Neuling @ 2015-11-23 1:04 UTC (permalink / raw) To: Cyril Bur, anton, linuxppc-dev On Wed, 2015-11-18 at 14:26 +1100, Cyril Bur wrote: > Load up the non volatile FPU and VMX regs and ensure that they are > the > expected value in a signal handler >=20 > Signed-off-by: Cyril Bur <cyrilbur@gmail.com> > --- > tools/testing/selftests/powerpc/math/Makefile | 4 +- > tools/testing/selftests/powerpc/math/fpu_signal.c | 119 > +++++++++++++++++++++ > tools/testing/selftests/powerpc/math/vmx_signal.c | 124 > ++++++++++++++++++++++ > 3 files changed, 246 insertions(+), 1 deletion(-) > create mode 100644 tools/testing/selftests/powerpc/math/fpu_signal.c > create mode 100644 tools/testing/selftests/powerpc/math/vmx_signal.c >=20 > diff --git a/tools/testing/selftests/powerpc/math/Makefile > b/tools/testing/selftests/powerpc/math/Makefile > index 9fa690f..5ce000bf 100644 > --- a/tools/testing/selftests/powerpc/math/Makefile > +++ b/tools/testing/selftests/powerpc/math/Makefile > @@ -1,4 +1,4 @@ > -TEST_PROGS :=3D fpu_syscall fpu_preempt vmx_syscall vmx_preempt > +TEST_PROGS :=3D fpu_syscall fpu_preempt fpu_signal vmx_syscall > vmx_preempt vmx_signal .gitignore > all: $(TEST_PROGS) > =20 > @@ -7,9 +7,11 @@ $(TEST_PROGS): CFLAGS +=3D -O2 -g -pthread > =20 > fpu_syscall: fpu_asm.S > fpu_preempt: fpu_asm.S > +fpu_signal: fpu_asm.S > =20 > vmx_syscall: vmx_asm.S > vmx_preempt: vmx_asm.S > +vmx_signal: vmx_asm.S > =20 > include ../../lib.mk > =20 > diff --git a/tools/testing/selftests/powerpc/math/fpu_signal.c > b/tools/testing/selftests/powerpc/math/fpu_signal.c > new file mode 100644 > index 0000000..ca61d1f > --- /dev/null > +++ b/tools/testing/selftests/powerpc/math/fpu_signal.c > @@ -0,0 +1,119 @@ copyright + detailed description of test here. > +#include <stdio.h> > +#include <unistd.h> > +#include <sys/syscall.h> > +#include <sys/time.h> > +#include <sys/types.h> > +#include <sys/wait.h> > +#include <stdlib.h> > +#include <pthread.h> > + > +#include "utils.h" > + > +/* Number of times each thread should recieve the signal */ receive=20 > +#define ITERATIONS 10 > +/* > + * Factor by which to multiply number of online CPUs for total > number of > + * worker threads > + */ > +#define THREAD_FACTOR 8 > + > +__thread double darray[] =3D {0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, > 0.9, 1.0, > + 1.1, 1.2, 1.3, 1.4, 1.5, 1.6, 1.7, 1.8, 1.9, > 2.0, > + 2.1}; > + > +bool bad_context; > +int running; > +volatile int not_ready; Don't use volatile... Ready Documentation/volatile-considered -harmful.txt > +extern long preempt_fpu(double *darray, volatile int *not_ready, int > *sentinal); > + > +void signal_fpu_sig(int sig, siginfo_t *info, void *context) > +{ > + int i; > + ucontext_t *uc =3D context; > + mcontext_t *mc =3D &uc->uc_mcontext; > + > + /* Only the non volatiles were loaded up */ > + for (i =3D 14; i < 32; i++) { > + if (mc->fp_regs[i] !=3D darray[i - 14]) { > + bad_context =3D true; > + break; > + } > + } > +} > + > +void *signal_fpu_c(void *p) > +{ > + int i; > + long rc; > + struct sigaction act; > + act.sa_sigaction =3D signal_fpu_sig; > + act.sa_flags =3D SA_SIGINFO; > + rc =3D sigaction(SIGUSR1, &act, NULL); > + if (rc) > + return p; > + > + srand(pthread_self()); > + for (i =3D 0; i < 21; i++) > + darray[i] =3D rand(); > + > + rc =3D preempt_fpu(darray, ¬_ready, &running); > + > + return (void *) rc; > +} > + > +int test_signal_fpu(void) This is almost idential to test_signal_vmx. Can you merge them? > +{ > + int i, j, rc, threads; > + void *rc_p; > + pthread_t *tids; > + > + threads =3D sysconf(_SC_NPROCESSORS_ONLN) * THREAD_FACTOR; > + tids =3D malloc(threads * sizeof(pthread_t)); > + FAIL_IF(!tids); > + > + running =3D true; > + not_ready =3D threads; > + for (i =3D 0; i < threads; i++) { > + rc =3D pthread_create(&tids[i], NULL, signal_fpu_c, > NULL); > + FAIL_IF(rc); > + } > + > + setbuf(stdout, NULL); > + printf("\tWaiting for all workers to start..."); > + while (not_ready); A barrier needed here. > + printf("done\n"); > + > + printf("\tSending signals to all threads %d times...", > ITERATIONS); > + for (i =3D 0; i < ITERATIONS; i++) { > + for (j =3D 0; j < threads; j++) { > + pthread_kill(tids[j], SIGUSR1); > + } > + sleep(1); > + } > + printf("done\n"); > + > + printf("\tKilling workers..."); > + running =3D 0; > + for (i =3D 0; i < threads; i++) { > + pthread_join(tids[i], &rc_p); > + > + /* > + * Harness will say the fail was here, look at why > signal_fpu > + * returned > + */ > + if ((long) rc_p || bad_context) bad_context is set in a signal handler but is that run in another thread (I think)? You might have some memory barrier issues with bad_context also. > + printf("oops\n"); > + if (bad_context) > + fprintf(stderr, "\t!! bad_context is > true\n"); > + FAIL_IF((long) rc_p || bad_context); > + } > + printf("done\n"); > + > + free(tids); > + return 0; > +} > + > +int main(int argc, char *argv[]) > +{ > + return test_harness(test_signal_fpu, "fpu_signal"); > +} > diff --git a/tools/testing/selftests/powerpc/math/vmx_signal.c > b/tools/testing/selftests/powerpc/math/vmx_signal.c > new file mode 100644 > index 0000000..007ac9e > --- /dev/null > +++ b/tools/testing/selftests/powerpc/math/vmx_signal.c > @@ -0,0 +1,124 @@ > +#include <stdio.h> > +#include <unistd.h> > +#include <sys/syscall.h> > +#include <sys/time.h> > +#include <sys/types.h> > +#include <sys/wait.h> > +#include <stdlib.h> > +#include <string.h> > +#include <pthread.h> > + > +#include "utils.h" > + > +/* Number of times each thread should recieve the signal */ > +#define ITERATIONS 10 > +/* > + * Factor by which to multiply number of online CPUs for total > number of > + * worker threads > + */ > +#define THREAD_FACTOR 8 > + > +typedef int v4si __attribute__ ((vector_size (16))); > +__thread v4si varray[] =3D {{1, 2, 3, 4}, {5, 6, 7, 8}, {9, 10,11,12}, > + {13,14,15,16},{17,18,19,20},{21,22,23,24}, > + {25,26,27,28},{29,30,31,32},{33,34,35,36}, > + {37,38,39,40},{41,42,43,44},{45,46,47,48}}; > + > +bool bad_context; > +int running; > +volatile int not_ready; > + > +extern int preempt_vmx(v4si *varray, volatile int *not_ready, int > *sentinal); > + > +void signal_vmx_sig(int sig, siginfo_t *info, void *context) > +{ > + int i; > + ucontext_t *uc =3D context; > + mcontext_t *mc =3D &uc->uc_mcontext; > + > + /* Only the non volatiles were loaded up */ > + for (i =3D 20; i < 32; i++) { > + if (memcmp(mc->v_regs->vrregs[i], &varray[i - 20], > 16)) { > + bad_context =3D true; > + break; > + } > + } > +} > + > +void *signal_vmx_c(void *p) > +{ > + int i, j; > + long rc; > + struct sigaction act; > + act.sa_sigaction =3D signal_vmx_sig; > + act.sa_flags =3D SA_SIGINFO; > + rc =3D sigaction(SIGUSR1, &act, NULL); > + if (rc) > + return p; > + > + srand(pthread_self()); > + for (i =3D 0; i < 12; i++) > + for (j =3D 0; j < 4; j++) > + varray[i][j] =3D rand(); > + > + rc =3D preempt_vmx(varray, ¬_ready, &running); > + > + return (void *) rc; > +} > + > +int test_signal_vmx(void) > +{ > + int i, j, rc, threads; > + void *rc_p; > + pthread_t *tids; > + > + threads =3D sysconf(_SC_NPROCESSORS_ONLN) * THREAD_FACTOR; > + tids =3D malloc(threads * sizeof(pthread_t)); > + FAIL_IF(!tids); > + > + running =3D true; > + not_ready =3D threads; > + for (i =3D 0; i < threads; i++) { > + rc =3D pthread_create(&tids[i], NULL, signal_vmx_c, > NULL); > + FAIL_IF(rc); > + } > + > + setbuf(stdout, NULL); > + printf("\tWaiting for all workers to start..."); > + while (not_ready); > + printf("done\n"); > + > + printf("\tSending signals to all threads %d times...", > ITERATIONS); > + for (i =3D 0; i < ITERATIONS; i++) { > + for (j =3D 0; j < threads; j++) { > + pthread_kill(tids[j], SIGUSR1); > + } > + sleep(1); > + } > + printf("done\n"); > + > + printf("\tKilling workers..."); > + running =3D 0; > + for (i =3D 0; i < threads; i++) { > + pthread_join(tids[i], &rc_p); > + > + /* > + * Harness will say the fail was here, look at why > signal_vmx > + * returned > + */ > + if ((long) rc_p || bad_context) > + printf("oops\n"); > + if (bad_context) > + fprintf(stderr, "\t!! bad_context is > true\n"); > + FAIL_IF((long) rc_p || bad_context); > + } > + printf("done\n"); > + > + free(tids); > + return 0; > +} > + > +int main(int argc, char *argv[]) > +{ > + return test_harness(test_signal_vmx, "vmx_signal"); > +} ^ permalink raw reply [flat|nested] 23+ messages in thread
* [PATCH 4/8] powerpc: Explicitly disable math features when copying thread 2015-11-18 3:26 [PATCH 0/8] FP/VEC/VSX switching optimisations Cyril Bur ` (2 preceding siblings ...) 2015-11-18 3:26 ` [PATCH 3/8] selftests/powerpc: Test FPU and VMX regs in signal ucontext Cyril Bur @ 2015-11-18 3:26 ` Cyril Bur 2015-11-23 1:08 ` Michael Neuling 2015-11-18 3:26 ` [PATCH 5/8] powerpc: Restore FPU/VEC/VSX if previously used Cyril Bur ` (4 subsequent siblings) 8 siblings, 1 reply; 23+ messages in thread From: Cyril Bur @ 2015-11-18 3:26 UTC (permalink / raw) To: mikey, anton, linuxppc-dev With threads leaving the math bits enabled in their saved MSR to indicate that the hardware is hot and a restore is not needed, children need to turn it off as when they do get scheduled, there's no way their registers could have been hot. Signed-off-by: Cyril Bur <cyrilbur@gmail.com> --- arch/powerpc/kernel/process.c | 1 + 1 file changed, 1 insertion(+) diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c index 398f7bf..441d9e5 100644 --- a/arch/powerpc/kernel/process.c +++ b/arch/powerpc/kernel/process.c @@ -1243,6 +1243,7 @@ int copy_thread(unsigned long clone_flags, unsigned long usp, f = ret_from_fork; } + childregs->msr &= ~(MSR_FP|MSR_VEC|MSR_VSX); sp -= STACK_FRAME_OVERHEAD; /* -- 2.6.2 ^ permalink raw reply related [flat|nested] 23+ messages in thread
* Re: [PATCH 4/8] powerpc: Explicitly disable math features when copying thread 2015-11-18 3:26 ` [PATCH 4/8] powerpc: Explicitly disable math features when copying thread Cyril Bur @ 2015-11-23 1:08 ` Michael Neuling 2015-11-23 3:20 ` Cyril Bur 0 siblings, 1 reply; 23+ messages in thread From: Michael Neuling @ 2015-11-23 1:08 UTC (permalink / raw) To: Cyril Bur, anton, linuxppc-dev On Wed, 2015-11-18 at 14:26 +1100, Cyril Bur wrote: > With threads leaving the math bits enabled in their saved MSR to > indicate > that the hardware is hot and a restore is not needed, children need > to turn > it off as when they do get scheduled, there's no way their registers > could > have been hot. >=20 > Signed-off-by: Cyril Bur <cyrilbur@gmail.com> > --- > arch/powerpc/kernel/process.c | 1 + > 1 file changed, 1 insertion(+) >=20 > diff --git a/arch/powerpc/kernel/process.c > b/arch/powerpc/kernel/process.c > index 398f7bf..441d9e5 100644 > --- a/arch/powerpc/kernel/process.c > +++ b/arch/powerpc/kernel/process.c > @@ -1243,6 +1243,7 @@ int copy_thread(unsigned long clone_flags, > unsigned long usp, > =20 > f =3D ret_from_fork; > } > + childregs->msr &=3D ~(MSR_FP|MSR_VEC|MSR_VSX); Is this a current bug?=20 Mikey > sp -=3D STACK_FRAME_OVERHEAD; > =20 > /* ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 4/8] powerpc: Explicitly disable math features when copying thread 2015-11-23 1:08 ` Michael Neuling @ 2015-11-23 3:20 ` Cyril Bur 0 siblings, 0 replies; 23+ messages in thread From: Cyril Bur @ 2015-11-23 3:20 UTC (permalink / raw) To: Michael Neuling; +Cc: anton, linuxppc-dev On Mon, 23 Nov 2015 12:08:38 +1100 Michael Neuling <mikey@neuling.org> wrote: > On Wed, 2015-11-18 at 14:26 +1100, Cyril Bur wrote: > > With threads leaving the math bits enabled in their saved MSR to > > indicate > > that the hardware is hot and a restore is not needed, children need > > to turn > > it off as when they do get scheduled, there's no way their registers > > could > > have been hot. > > > > Signed-off-by: Cyril Bur <cyrilbur@gmail.com> > > --- > > arch/powerpc/kernel/process.c | 1 + > > 1 file changed, 1 insertion(+) > > > > diff --git a/arch/powerpc/kernel/process.c > > b/arch/powerpc/kernel/process.c > > index 398f7bf..441d9e5 100644 > > --- a/arch/powerpc/kernel/process.c > > +++ b/arch/powerpc/kernel/process.c > > @@ -1243,6 +1243,7 @@ int copy_thread(unsigned long clone_flags, > > unsigned long usp, > > > > f = ret_from_fork; > > } > > + childregs->msr &= ~(MSR_FP|MSR_VEC|MSR_VSX); > > Is this a current bug? > It is impossible currently because saving the registers (of the parent, before the creating the child) also forces a giveup of the facilities. The next patch in the series decouples the saving and the giving up which makes this situation possible. > Mikey > > > sp -= STACK_FRAME_OVERHEAD; > > > > /* ^ permalink raw reply [flat|nested] 23+ messages in thread
* [PATCH 5/8] powerpc: Restore FPU/VEC/VSX if previously used 2015-11-18 3:26 [PATCH 0/8] FP/VEC/VSX switching optimisations Cyril Bur ` (3 preceding siblings ...) 2015-11-18 3:26 ` [PATCH 4/8] powerpc: Explicitly disable math features when copying thread Cyril Bur @ 2015-11-18 3:26 ` Cyril Bur 2015-11-20 11:01 ` Michael Ellerman 2015-11-23 1:29 ` Michael Neuling 2015-11-18 3:26 ` [PATCH 6/8] powerpc: Add the ability to save FPU without giving it up Cyril Bur ` (3 subsequent siblings) 8 siblings, 2 replies; 23+ messages in thread From: Cyril Bur @ 2015-11-18 3:26 UTC (permalink / raw) To: mikey, anton, linuxppc-dev Currently the FPU, VEC and VSX facilities are lazily loaded. This is not a problem unless a process is using these facilities. Modern versions of GCC are very good at automatically vectorising code, new and modernised workloads make use of floating point and vector facilities, even the kernel makes use of vectorised memcpy. All this combined greatly increases the cost of a syscall since the kernel uses the facilities sometimes even in syscall fast-path making it increasingly common for a thread to take an *_unavailable exception soon after a syscall, not to mention potentially taking the trifecta. The obvious overcompensation to this problem is to simply always load all the facilities on every exit to userspace. Loading up all FPU, VEC and VSX registers every time can be expensive and if a workload does avoid using them, it should not be forced to incur this penalty. An 8bit counter is used to detect if the registers have been used in the past and the registers are always loaded until the value wraps to back to zero. Signed-off-by: Cyril Bur <cyrilbur@gmail.com> --- arch/powerpc/include/asm/processor.h | 2 ++ arch/powerpc/kernel/asm-offsets.c | 2 ++ arch/powerpc/kernel/entry_64.S | 55 ++++++++++++++++++++++++++++-- arch/powerpc/kernel/fpu.S | 4 +++ arch/powerpc/kernel/process.c | 66 ++++++++++++++++++++++++++++++------ arch/powerpc/kernel/vector.S | 4 +++ 6 files changed, 119 insertions(+), 14 deletions(-) diff --git a/arch/powerpc/include/asm/processor.h b/arch/powerpc/include/asm/processor.h index ac23308..dcab21f 100644 --- a/arch/powerpc/include/asm/processor.h +++ b/arch/powerpc/include/asm/processor.h @@ -236,11 +236,13 @@ struct thread_struct { #endif struct arch_hw_breakpoint hw_brk; /* info on the hardware breakpoint */ unsigned long trap_nr; /* last trap # on this thread */ + u8 load_fp; #ifdef CONFIG_ALTIVEC struct thread_vr_state vr_state; struct thread_vr_state *vr_save_area; unsigned long vrsave; int used_vr; /* set if process has used altivec */ + u8 load_vec; #endif /* CONFIG_ALTIVEC */ #ifdef CONFIG_VSX /* VSR status */ diff --git a/arch/powerpc/kernel/asm-offsets.c b/arch/powerpc/kernel/asm-offsets.c index 221d584..0f593d7 100644 --- a/arch/powerpc/kernel/asm-offsets.c +++ b/arch/powerpc/kernel/asm-offsets.c @@ -95,12 +95,14 @@ int main(void) DEFINE(THREAD_FPSTATE, offsetof(struct thread_struct, fp_state)); DEFINE(THREAD_FPSAVEAREA, offsetof(struct thread_struct, fp_save_area)); DEFINE(FPSTATE_FPSCR, offsetof(struct thread_fp_state, fpscr)); + DEFINE(THREAD_LOAD_FP, offsetof(struct thread_struct, load_fp)); #ifdef CONFIG_ALTIVEC DEFINE(THREAD_VRSTATE, offsetof(struct thread_struct, vr_state)); DEFINE(THREAD_VRSAVEAREA, offsetof(struct thread_struct, vr_save_area)); DEFINE(THREAD_VRSAVE, offsetof(struct thread_struct, vrsave)); DEFINE(THREAD_USED_VR, offsetof(struct thread_struct, used_vr)); DEFINE(VRSTATE_VSCR, offsetof(struct thread_vr_state, vscr)); + DEFINE(THREAD_LOAD_VEC, offsetof(struct thread_struct, load_vec)); #endif /* CONFIG_ALTIVEC */ #ifdef CONFIG_VSX DEFINE(THREAD_USED_VSR, offsetof(struct thread_struct, used_vsr)); diff --git a/arch/powerpc/kernel/entry_64.S b/arch/powerpc/kernel/entry_64.S index c8b4225..46e9869 100644 --- a/arch/powerpc/kernel/entry_64.S +++ b/arch/powerpc/kernel/entry_64.S @@ -210,7 +210,54 @@ system_call: /* label this so stack traces look sane */ li r11,-MAX_ERRNO andi. r0,r9,(_TIF_SYSCALL_DOTRACE|_TIF_SINGLESTEP|_TIF_USER_WORK_MASK|_TIF_PERSYSCALL_MASK) bne- syscall_exit_work - cmpld r3,r11 + + /* + * This is an assembly version of checks performed in restore_math() + * to avoid calling C unless absolutely necessary. + * Note: In order to simplify the assembly, if the FP or VEC registers + * are hot (and therefore restore_math() isn't called) the + * LOAD_{FP,VEC} thread counter doesn't get incremented. + * This is likely the best thing to do anyway because hot regs indicate + * that the workload is doing a lot of syscalls that can be handled + * quickly and without the need to touch FP or VEC regs (by the kernel). + * a) If this workload is long running then this is exactly what the + * kernel should be doing. + * b) If this workload isn't long running then we'll soon fall back to + * calling into C and the counter will be incremented regularly again + * anyway. + */ + ld r9,PACACURRENT(r13) + andi. r0,r8,MSR_FP + addi r9,r9,THREAD + lbz r5,THREAD_LOAD_FP(r9) + /* + * Goto 2 if !r0 && r5 + * The cmpb works because r5 can only have bits set in the lowest byte + * and r0 may or may not have bit 13 set (different byte) but will have + * a zero low byte therefore the low bytes must differ if r5 == true + * and the bit 13 byte must be the same if !r0 + */ + cmpb r7,r0,r5 + cmpldi r7,0xff0 +#ifdef CONFIG_ALTIVEC + beq 2f + + lbz r9,THREAD_LOAD_VEC(r9) + andis. r0,r8,MSR_VEC@h + /* Skip (goto 3) if r0 || !r9 */ + bne 3f + cmpldi r9,0 + beq 3f +#else + bne 3f +#endif +2: addi r3,r1,STACK_FRAME_OVERHEAD + bl restore_math + ld r8,_MSR(r1) + ld r3,RESULT(r1) + li r11,-MAX_ERRNO + +3: cmpld r3,r11 ld r5,_CCR(r1) bge- syscall_error .Lsyscall_error_cont: @@ -592,8 +639,8 @@ _GLOBAL(ret_from_except_lite) /* Check current_thread_info()->flags */ andi. r0,r4,_TIF_USER_WORK_MASK -#ifdef CONFIG_PPC_BOOK3E bne 1f +#ifdef CONFIG_PPC_BOOK3E /* * Check to see if the dbcr0 register is set up to debug. * Use the internal debug mode bit to do this. @@ -608,7 +655,9 @@ _GLOBAL(ret_from_except_lite) mtspr SPRN_DBSR,r10 b restore #else - beq restore + addi r3,r1,STACK_FRAME_OVERHEAD + bl restore_math + b restore #endif 1: andi. r0,r4,_TIF_NEED_RESCHED beq 2f diff --git a/arch/powerpc/kernel/fpu.S b/arch/powerpc/kernel/fpu.S index 2117eac..b063524 100644 --- a/arch/powerpc/kernel/fpu.S +++ b/arch/powerpc/kernel/fpu.S @@ -130,6 +130,10 @@ END_FTR_SECTION_IFSET(CPU_FTR_VSX) or r12,r12,r4 std r12,_MSR(r1) #endif + /* Don't care if r4 overflows, this is desired behaviour */ + lbz r4,THREAD_LOAD_FP(r5) + addi r4,r4,1 + stb r4,THREAD_LOAD_FP(r5) addi r10,r5,THREAD_FPSTATE lfd fr0,FPSTATE_FPSCR(r10) MTFSF_L(fr0) diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c index 441d9e5..c602b67 100644 --- a/arch/powerpc/kernel/process.c +++ b/arch/powerpc/kernel/process.c @@ -366,6 +366,53 @@ void giveup_all(struct task_struct *tsk) msr_check_and_clear(msr_all_available); } +void restore_math(struct pt_regs *regs) +{ + unsigned long msr; + + if (!current->thread.load_fp +#ifdef CONFIG_ALTIVEC + && !current->thread.load_vec) +#else + ) +#endif + return; + + msr = regs->msr; + msr_check_and_set(msr_all_available); + + /* + * Only reload if the bit is not set in the user MSR, the bit BEING set + * indicates that the registers are hot + */ +#ifdef CONFIG_PPC_FPU + if (current->thread.load_fp && !(msr & MSR_FP)) { + load_fp_state(¤t->thread.fp_state); + msr |= MSR_FP | current->thread.fpexc_mode; + current->thread.load_fp++; + } +#endif +#ifdef CONFIG_ALTIVEC + if (current->thread.load_vec && !(msr & MSR_VEC) && + cpu_has_feature(CPU_FTR_ALTIVEC)) { + load_vr_state(¤t->thread.vr_state); + current->thread.used_vr = 1; + msr |= MSR_VEC; + current->thread.load_vec++; + } +#endif +#ifdef CONFIG_VSX + if (!(msr & MSR_VSX) && (msr & (MSR_FP | MSR_VEC)) == (MSR_FP | MSR_VEC)) { + current->thread.used_vsr = 1; + msr |= MSR_VSX; + } +#endif + + msr_check_and_clear(msr_all_available); + + regs->msr = msr; +} + void flush_all_to_thread(struct task_struct *tsk) { if (tsk->thread.regs) { @@ -806,17 +853,9 @@ void restore_tm_state(struct pt_regs *regs) msr_diff = current->thread.ckpt_regs.msr & ~regs->msr; msr_diff &= MSR_FP | MSR_VEC | MSR_VSX; - if (msr_diff & MSR_FP) { - msr_check_and_set(MSR_FP); - load_fp_state(¤t->thread.fp_state); - msr_check_and_clear(MSR_FP); - regs->msr |= current->thread.fpexc_mode; - } - if (msr_diff & MSR_VEC) { - msr_check_and_set(MSR_VEC); - load_vr_state(¤t->thread.vr_state); - msr_check_and_clear(MSR_VEC); - } + + restore_math(regs); + regs->msr |= msr_diff; } @@ -977,6 +1016,11 @@ struct task_struct *__switch_to(struct task_struct *prev, batch = this_cpu_ptr(&ppc64_tlb_batch); batch->active = 1; } + + /* Don't do this on a kernel thread */ + if (current_thread_info()->task->thread.regs) + restore_math(current_thread_info()->task->thread.regs); + #endif /* CONFIG_PPC_BOOK3S_64 */ return last; diff --git a/arch/powerpc/kernel/vector.S b/arch/powerpc/kernel/vector.S index 162d0f7..038cff8 100644 --- a/arch/powerpc/kernel/vector.S +++ b/arch/powerpc/kernel/vector.S @@ -91,6 +91,10 @@ _GLOBAL(load_up_altivec) oris r12,r12,MSR_VEC@h std r12,_MSR(r1) #endif + /* Don't care if r4 overflows, this is desired behaviour */ + lbz r4,THREAD_LOAD_VEC(r5) + addi r4,r4,1 + stb r4,THREAD_LOAD_VEC(r5) addi r6,r5,THREAD_VRSTATE li r4,1 li r10,VRSTATE_VSCR -- 2.6.2 ^ permalink raw reply related [flat|nested] 23+ messages in thread
* Re: [PATCH 5/8] powerpc: Restore FPU/VEC/VSX if previously used 2015-11-18 3:26 ` [PATCH 5/8] powerpc: Restore FPU/VEC/VSX if previously used Cyril Bur @ 2015-11-20 11:01 ` Michael Ellerman 2015-11-22 22:18 ` Cyril Bur 2015-11-23 1:29 ` Michael Neuling 1 sibling, 1 reply; 23+ messages in thread From: Michael Ellerman @ 2015-11-20 11:01 UTC (permalink / raw) To: Cyril Bur, mikey, anton, linuxppc-dev On Wed, 2015-11-18 at 14:26 +1100, Cyril Bur wrote: > diff --git a/arch/powerpc/kernel/entry_64.S b/arch/powerpc/kernel/entry_64.S > index c8b4225..46e9869 100644 > --- a/arch/powerpc/kernel/entry_64.S > +++ b/arch/powerpc/kernel/entry_64.S > @@ -210,7 +210,54 @@ system_call: /* label this so stack traces look sane */ > li r11,-MAX_ERRNO > andi. r0,r9,(_TIF_SYSCALL_DOTRACE|_TIF_SINGLESTEP|_TIF_USER_WORK_MASK|_TIF_PERSYSCALL_MASK) > bne- syscall_exit_work > - cmpld r3,r11 > + > + /* > + * This is an assembly version of checks performed in restore_math() > + * to avoid calling C unless absolutely necessary. > + * Note: In order to simplify the assembly, if the FP or VEC registers > + * are hot (and therefore restore_math() isn't called) the > + * LOAD_{FP,VEC} thread counter doesn't get incremented. > + * This is likely the best thing to do anyway because hot regs indicate > + * that the workload is doing a lot of syscalls that can be handled > + * quickly and without the need to touch FP or VEC regs (by the kernel). > + * a) If this workload is long running then this is exactly what the > + * kernel should be doing. > + * b) If this workload isn't long running then we'll soon fall back to > + * calling into C and the counter will be incremented regularly again > + * anyway. > + */ > + ld r9,PACACURRENT(r13) > + andi. r0,r8,MSR_FP > + addi r9,r9,THREAD > + lbz r5,THREAD_LOAD_FP(r9) > + /* > + * Goto 2 if !r0 && r5 > + * The cmpb works because r5 can only have bits set in the lowest byte > + * and r0 may or may not have bit 13 set (different byte) but will have > + * a zero low byte therefore the low bytes must differ if r5 == true > + * and the bit 13 byte must be the same if !r0 > + */ > + cmpb r7,r0,r5 cmpb is new since Power6, which means it doesn't exist on Cell -> Program Check :) I'm testing a patch using crandc, but I don't like it. I'm not a big fan of the logic here, it's unpleasantly complicated. Did you benchmark going to C to do the checks? Or I wonder if we could just check THREAD_LOAD_FP || THREAD_LOAD_VEC and if either is set we go to restore_math(). Or on the other hand we check !MSR_FP && !MSR_VEC and if so we go to restore_math()? > + cmpldi r7,0xff0 > +#ifdef CONFIG_ALTIVEC > + beq 2f > + > + lbz r9,THREAD_LOAD_VEC(r9) > + andis. r0,r8,MSR_VEC@h > + /* Skip (goto 3) if r0 || !r9 */ > + bne 3f > + cmpldi r9,0 > + beq 3f > +#else > + bne 3f > +#endif > +2: addi r3,r1,STACK_FRAME_OVERHEAD > + bl restore_math > + ld r8,_MSR(r1) > + ld r3,RESULT(r1) > + li r11,-MAX_ERRNO > + > +3: cmpld r3,r11 > ld r5,_CCR(r1) > bge- syscall_error > .Lsyscall_error_cont: cheers ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 5/8] powerpc: Restore FPU/VEC/VSX if previously used 2015-11-20 11:01 ` Michael Ellerman @ 2015-11-22 22:18 ` Cyril Bur 2015-11-22 23:07 ` Michael Ellerman 0 siblings, 1 reply; 23+ messages in thread From: Cyril Bur @ 2015-11-22 22:18 UTC (permalink / raw) To: Michael Ellerman; +Cc: mikey, anton, linuxppc-dev On Fri, 20 Nov 2015 22:01:04 +1100 Michael Ellerman <mpe@ellerman.id.au> wrote: > On Wed, 2015-11-18 at 14:26 +1100, Cyril Bur wrote: > > diff --git a/arch/powerpc/kernel/entry_64.S b/arch/powerpc/kernel/entry_64.S > > index c8b4225..46e9869 100644 > > --- a/arch/powerpc/kernel/entry_64.S > > +++ b/arch/powerpc/kernel/entry_64.S > > @@ -210,7 +210,54 @@ system_call: /* label this so stack traces look sane */ > > li r11,-MAX_ERRNO > > andi. r0,r9,(_TIF_SYSCALL_DOTRACE|_TIF_SINGLESTEP|_TIF_USER_WORK_MASK|_TIF_PERSYSCALL_MASK) > > bne- syscall_exit_work > > - cmpld r3,r11 > > + > > + /* > > + * This is an assembly version of checks performed in restore_math() > > + * to avoid calling C unless absolutely necessary. > > + * Note: In order to simplify the assembly, if the FP or VEC registers > > + * are hot (and therefore restore_math() isn't called) the > > + * LOAD_{FP,VEC} thread counter doesn't get incremented. > > + * This is likely the best thing to do anyway because hot regs indicate > > + * that the workload is doing a lot of syscalls that can be handled > > + * quickly and without the need to touch FP or VEC regs (by the kernel). > > + * a) If this workload is long running then this is exactly what the > > + * kernel should be doing. > > + * b) If this workload isn't long running then we'll soon fall back to > > + * calling into C and the counter will be incremented regularly again > > + * anyway. > > + */ > > + ld r9,PACACURRENT(r13) > > + andi. r0,r8,MSR_FP > > + addi r9,r9,THREAD > > + lbz r5,THREAD_LOAD_FP(r9) > > + /* > > + * Goto 2 if !r0 && r5 > > + * The cmpb works because r5 can only have bits set in the lowest byte > > + * and r0 may or may not have bit 13 set (different byte) but will have > > + * a zero low byte therefore the low bytes must differ if r5 == true > > + * and the bit 13 byte must be the same if !r0 > > + */ > > + cmpb r7,r0,r5 > > cmpb is new since Power6, which means it doesn't exist on Cell -> Program Check :) > Oops, sorry. > I'm testing a patch using crandc, but I don't like it. > > I'm not a big fan of the logic here, it's unpleasantly complicated. Did you > benchmark going to C to do the checks? Or I wonder if we could just check > THREAD_LOAD_FP || THREAD_LOAD_VEC and if either is set we go to restore_math(). > I didn't benchmark going to C mostly because you wanted to avoid calling C unless necessary in that path. Based off the results I got benchmarking the this series I expect calling C will also be in the noise of removing the exception. > Or on the other hand we check !MSR_FP && !MSR_VEC and if so we go to > restore_math()? > That seems like the best check to leave in the assembly if you want to avoid complicated assembly in there. > > + cmpldi r7,0xff0 > > +#ifdef CONFIG_ALTIVEC > > + beq 2f > > + > > + lbz r9,THREAD_LOAD_VEC(r9) > > + andis. r0,r8,MSR_VEC@h > > + /* Skip (goto 3) if r0 || !r9 */ > > + bne 3f > > + cmpldi r9,0 > > + beq 3f > > +#else > > + bne 3f > > +#endif > > +2: addi r3,r1,STACK_FRAME_OVERHEAD > > + bl restore_math > > + ld r8,_MSR(r1) > > + ld r3,RESULT(r1) > > + li r11,-MAX_ERRNO > > + > > +3: cmpld r3,r11 > > ld r5,_CCR(r1) > > bge- syscall_error > > .Lsyscall_error_cont: > > > cheers > ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 5/8] powerpc: Restore FPU/VEC/VSX if previously used 2015-11-22 22:18 ` Cyril Bur @ 2015-11-22 23:07 ` Michael Ellerman 0 siblings, 0 replies; 23+ messages in thread From: Michael Ellerman @ 2015-11-22 23:07 UTC (permalink / raw) To: Cyril Bur; +Cc: mikey, anton, linuxppc-dev On Mon, 2015-11-23 at 09:18 +1100, Cyril Bur wrote: > On Fri, 20 Nov 2015 22:01:04 +1100 > Michael Ellerman <mpe@ellerman.id.au> wrote: > > On Wed, 2015-11-18 at 14:26 +1100, Cyril Bur wrote: > > > diff --git a/arch/powerpc/kernel/entry_64.S b/arch/powerpc/kernel/entry_64.S > > > index c8b4225..46e9869 100644 > > > --- a/arch/powerpc/kernel/entry_64.S > > > +++ b/arch/powerpc/kernel/entry_64.S > > > @@ -210,7 +210,54 @@ system_call: /* label this so stack traces look sane */ > > > li r11,-MAX_ERRNO > > > andi. r0,r9,(_TIF_SYSCALL_DOTRACE|_TIF_SINGLESTEP|_TIF_USER_WORK_MASK|_TIF_PERSYSCALL_MASK) > > > bne- syscall_exit_work > > > - cmpld r3,r11 > > > + > > > + /* > > > + * This is an assembly version of checks performed in restore_math() > > > + * to avoid calling C unless absolutely necessary. > > > + * Note: In order to simplify the assembly, if the FP or VEC registers > > > + * are hot (and therefore restore_math() isn't called) the > > > + * LOAD_{FP,VEC} thread counter doesn't get incremented. > > > + * This is likely the best thing to do anyway because hot regs indicate > > > + * that the workload is doing a lot of syscalls that can be handled > > > + * quickly and without the need to touch FP or VEC regs (by the kernel). > > > + * a) If this workload is long running then this is exactly what the > > > + * kernel should be doing. > > > + * b) If this workload isn't long running then we'll soon fall back to > > > + * calling into C and the counter will be incremented regularly again > > > + * anyway. > > > + */ > > > + ld r9,PACACURRENT(r13) > > > + andi. r0,r8,MSR_FP > > > + addi r9,r9,THREAD > > > + lbz r5,THREAD_LOAD_FP(r9) > > > + /* > > > + * Goto 2 if !r0 && r5 > > > + * The cmpb works because r5 can only have bits set in the lowest byte > > > + * and r0 may or may not have bit 13 set (different byte) but will have > > > + * a zero low byte therefore the low bytes must differ if r5 == true > > > + * and the bit 13 byte must be the same if !r0 > > > + */ > > > + cmpb r7,r0,r5 > > > > cmpb is new since Power6, which means it doesn't exist on Cell -> Program Check :) > > > Oops, sorry. That's fine, there's almost no way for you to know that from reading the documentation. > > I'm testing a patch using crandc, but I don't like it. > > > > I'm not a big fan of the logic here, it's unpleasantly complicated. Did you > > benchmark going to C to do the checks? Or I wonder if we could just check > > THREAD_LOAD_FP || THREAD_LOAD_VEC and if either is set we go to restore_math(). > > > > I didn't benchmark going to C mostly because you wanted to avoid calling C > unless necessary in that path. Based off the results I got benchmarking the > this series I expect calling C will also be in the noise of removing the > exception. Yeah I figured it was probably me that said "avoid C at all costs". But I've changed my mind ;) > > Or on the other hand we check !MSR_FP && !MSR_VEC and if so we go to > > restore_math()? > > That seems like the best check to leave in the assembly if you want to avoid > complicated assembly in there. Cool. If you can benchmark that that'd be great, mmkay. cheers ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 5/8] powerpc: Restore FPU/VEC/VSX if previously used 2015-11-18 3:26 ` [PATCH 5/8] powerpc: Restore FPU/VEC/VSX if previously used Cyril Bur 2015-11-20 11:01 ` Michael Ellerman @ 2015-11-23 1:29 ` Michael Neuling 1 sibling, 0 replies; 23+ messages in thread From: Michael Neuling @ 2015-11-23 1:29 UTC (permalink / raw) To: Cyril Bur, anton, linuxppc-dev On Wed, 2015-11-18 at 14:26 +1100, Cyril Bur wrote: > Currently the FPU, VEC and VSX facilities are lazily loaded. This is > not a > problem unless a process is using these facilities. I would prefer to say facilities are "enabled" and registers are "loaded". You're mixing the two here. > Modern versions of GCC are very good at automatically vectorising > code, new > and modernised workloads make use of floating point and vector > facilities, > even the kernel makes use of vectorised memcpy. >=20 > All this combined greatly increases the cost of a syscall since the > kernel > uses the facilities sometimes even in syscall fast-path making it > increasingly common for a thread to take an *_unavailable exception > soon > after a syscall, not to mention potentially taking the trifecta. >=20 > The obvious overcompensation to this problem is to simply always load > all > the facilities on every exit to userspace. Loading up all FPU, VEC > and VSX > registers every time can be expensive and if a workload does avoid > using > them, it should not be forced to incur this penalty. >=20 > An 8bit counter is used to detect if the registers have been used in > the > past and the registers are always loaded until the value wraps to > back to > zero. >=20 > Signed-off-by: Cyril Bur <cyrilbur@gmail.com> > --- > arch/powerpc/include/asm/processor.h | 2 ++ > arch/powerpc/kernel/asm-offsets.c | 2 ++ > arch/powerpc/kernel/entry_64.S | 55 > ++++++++++++++++++++++++++++-- > arch/powerpc/kernel/fpu.S | 4 +++ > arch/powerpc/kernel/process.c | 66 > ++++++++++++++++++++++++++++++------ > arch/powerpc/kernel/vector.S | 4 +++ > 6 files changed, 119 insertions(+), 14 deletions(-) >=20 > diff --git a/arch/powerpc/include/asm/processor.h > b/arch/powerpc/include/asm/processor.h > index ac23308..dcab21f 100644 > --- a/arch/powerpc/include/asm/processor.h > +++ b/arch/powerpc/include/asm/processor.h > @@ -236,11 +236,13 @@ struct thread_struct { > #endif > struct arch_hw_breakpoint hw_brk; /* info on the hardware > breakpoint */ > unsigned long trap_nr; /* last trap # on this > thread */ > + u8 load_fp; > #ifdef CONFIG_ALTIVEC > struct thread_vr_state vr_state; > struct thread_vr_state *vr_save_area; > unsigned long vrsave; > int used_vr; /* set if process has > used altivec */ > + u8 load_vec; > #endif /* CONFIG_ALTIVEC */ > #ifdef CONFIG_VSX > /* VSR status */ > diff --git a/arch/powerpc/kernel/asm-offsets.c > b/arch/powerpc/kernel/asm-offsets.c > index 221d584..0f593d7 100644 > --- a/arch/powerpc/kernel/asm-offsets.c > +++ b/arch/powerpc/kernel/asm-offsets.c > @@ -95,12 +95,14 @@ int main(void) > DEFINE(THREAD_FPSTATE, offsetof(struct thread_struct, > fp_state)); > DEFINE(THREAD_FPSAVEAREA, offsetof(struct thread_struct, > fp_save_area)); > DEFINE(FPSTATE_FPSCR, offsetof(struct thread_fp_state, > fpscr)); > + DEFINE(THREAD_LOAD_FP, offsetof(struct thread_struct, > load_fp)); > #ifdef CONFIG_ALTIVEC > DEFINE(THREAD_VRSTATE, offsetof(struct thread_struct, > vr_state)); > DEFINE(THREAD_VRSAVEAREA, offsetof(struct thread_struct, > vr_save_area)); > DEFINE(THREAD_VRSAVE, offsetof(struct thread_struct, > vrsave)); > DEFINE(THREAD_USED_VR, offsetof(struct thread_struct, > used_vr)); > DEFINE(VRSTATE_VSCR, offsetof(struct thread_vr_state, > vscr)); > + DEFINE(THREAD_LOAD_VEC, offsetof(struct thread_struct, > load_vec)); > #endif /* CONFIG_ALTIVEC */ > #ifdef CONFIG_VSX > DEFINE(THREAD_USED_VSR, offsetof(struct thread_struct, > used_vsr)); > diff --git a/arch/powerpc/kernel/entry_64.S > b/arch/powerpc/kernel/entry_64.S > index c8b4225..46e9869 100644 > --- a/arch/powerpc/kernel/entry_64.S > +++ b/arch/powerpc/kernel/entry_64.S > @@ -210,7 +210,54 @@ system_call: /* label > this so stack traces look sane */ > li r11,-MAX_ERRNO > andi. r0,r9,(_TIF_SYSCALL_DOTRACE|_TIF_SINGLESTEP|_TI > F_USER_WORK_MASK|_TIF_PERSYSCALL_MASK) > bne- syscall_exit_work > - cmpld r3,r11 > + > + /* > + * This is an assembly version of checks performed in > restore_math() > + * to avoid calling C unless absolutely necessary. > + * Note: In order to simplify the assembly, if the FP or VEC > registers > + * are hot (and therefore restore_math() isn't called) the > + * LOAD_{FP,VEC} thread counter doesn't get incremented. > + * This is likely the best thing to do anyway because hot > regs indicate > + * that the workload is doing a lot of syscalls that can be > handled > + * quickly and without the need to touch FP or VEC regs (by > the kernel). > + * a) If this workload is long running then this is exactly > what the > + * kernel should be doing. > + * b) If this workload isn't long running then we'll soon > fall back to > + * calling into C and the counter will be incremented > regularly again > + * anyway. > + */ > + ld r9,PACACURRENT(r13) > + andi. r0,r8,MSR_FP > + addi r9,r9,THREAD > + lbz r5,THREAD_LOAD_FP(r9) Would this work as a minor simplification? ld r9,PACACURRENT(r13) andi. r0,r8,MSR_FP lbz r5,THREAD+THREAD_LOAD_FP(r9) > +> > /* > +> > * Goto 2 if !r0 && r5 > +> > * The cmpb works because r5 can only have bits set in the lowest b= yte > +> > * and r0 may or may not have bit 13 set (different byte) but will = have > +> > * a zero low byte therefore the low bytes must differ if r5 =3D=3D= true > +> > * and the bit 13 byte must be the same if !r0 > +> > */ > + cmpb r7,r0,r5 > + cmpldi r7,0xff0 > +#ifdef CONFIG_ALTIVEC > + beq 2f > + > + lbz r9,THREAD_LOAD_VEC(r9) > + andis. r0,r8,MSR_VEC@h > + /* Skip (goto 3) if r0 || !r9 */ > + bne 3f > + cmpldi r9,0 > + beq 3f > +#else > + bne 3f > +#endif > +2: addi r3,r1,STACK_FRAME_OVERHEAD > + bl restore_math > + ld r8,_MSR(r1) > + ld r3,RESULT(r1) > + li r11,-MAX_ERRNO > + > +3: cmpld r3,r11 > ld r5,_CCR(r1) > bge- syscall_error > .Lsyscall_error_cont: > @@ -592,8 +639,8 @@ _GLOBAL(ret_from_except_lite) > =20 > /* Check current_thread_info()->flags */ > andi. r0,r4,_TIF_USER_WORK_MASK > -#ifdef CONFIG_PPC_BOOK3E > bne 1f > +#ifdef CONFIG_PPC_BOOK3E why this change? > /* > * Check to see if the dbcr0 register is set up to debug. > * Use the internal debug mode bit to do this. > @@ -608,7 +655,9 @@ _GLOBAL(ret_from_except_lite) > mtspr SPRN_DBSR,r10 > b restore > #else > - beq restore > + addi r3,r1,STACK_FRAME_OVERHEAD > + bl restore_math > + b restore > #endif > 1: andi. r0,r4,_TIF_NEED_RESCHED > beq 2f > diff --git a/arch/powerpc/kernel/fpu.S b/arch/powerpc/kernel/fpu.S > index 2117eac..b063524 100644 > --- a/arch/powerpc/kernel/fpu.S > +++ b/arch/powerpc/kernel/fpu.S > @@ -130,6 +130,10 @@ END_FTR_SECTION_IFSET(CPU_FTR_VSX) > or r12,r12,r4 > std r12,_MSR(r1) > #endif > + /* Don't care if r4 overflows, this is desired behaviour */ > + lbz r4,THREAD_LOAD_FP(r5) > + addi r4,r4,1 > + stb r4,THREAD_LOAD_FP(r5) > addi r10,r5,THREAD_FPSTATE > lfd fr0,FPSTATE_FPSCR(r10) > MTFSF_L(fr0) > diff --git a/arch/powerpc/kernel/process.c > b/arch/powerpc/kernel/process.c > index 441d9e5..c602b67 100644 > --- a/arch/powerpc/kernel/process.c > +++ b/arch/powerpc/kernel/process.c > @@ -366,6 +366,53 @@ void giveup_all(struct task_struct *tsk) > msr_check_and_clear(msr_all_available); > } > =20 > +void restore_math(struct pt_regs *regs) > +{ > + unsigned long msr; > + > + if (!current->thread.load_fp > +#ifdef CONFIG_ALTIVEC > + && !current->thread.load_vec) > +#else > + ) > +#endif That is possibly the most revolting use of #ifdef I have ever seen.=20 And that's saying something with the crap state that process.c is already in :-P > + return; > + > + msr =3D regs->msr; > + msr_check_and_set(msr_all_available); > + > + /* > + * Only reload if the bit is not set in the user MSR, the > bit BEING set > + * indicates that the registers are hot > + */ > +#ifdef CONFIG_PPC_FPU > + if (current->thread.load_fp && !(msr & MSR_FP)) { > + load_fp_state(¤t->thread.fp_state); > + msr |=3D MSR_FP | current->thread.fpexc_mode; > + current->thread.load_fp++; > + } > +#endif > +#ifdef CONFIG_ALTIVEC > + if (current->thread.load_vec && !(msr & MSR_VEC) && > + cpu_has_feature(CPU_FTR_ALTIVEC)) { > + load_vr_state(¤t->thread.vr_state); > + current->thread.used_vr =3D 1; > + msr |=3D MSR_VEC; > + current->thread.load_vec++; > + } > +#endif > +#ifdef CONFIG_VSX > + if (!(msr & MSR_VSX) && (msr & (MSR_FP | MSR_VEC)) =3D=3D > (MSR_FP | MSR_VEC)) { > + current->thread.used_vsr =3D 1; > + msr |=3D MSR_VSX; > + } > +#endif > + > + msr_check_and_clear(msr_all_available); > + > + regs->msr =3D msr; > +} > + > void flush_all_to_thread(struct task_struct *tsk) > { > if (tsk->thread.regs) { > @@ -806,17 +853,9 @@ void restore_tm_state(struct pt_regs *regs) > =20 > msr_diff =3D current->thread.ckpt_regs.msr & ~regs->msr; > msr_diff &=3D MSR_FP | MSR_VEC | MSR_VSX; > - if (msr_diff & MSR_FP) { > - msr_check_and_set(MSR_FP); > - load_fp_state(¤t->thread.fp_state); > - msr_check_and_clear(MSR_FP); > - regs->msr |=3D current->thread.fpexc_mode; > - } > - if (msr_diff & MSR_VEC) { > - msr_check_and_set(MSR_VEC); > - load_vr_state(¤t->thread.vr_state); > - msr_check_and_clear(MSR_VEC); > - } > + > + restore_math(regs); > + > regs->msr |=3D msr_diff; > } > =20 > @@ -977,6 +1016,11 @@ struct task_struct *__switch_to(struct > task_struct *prev, > batch =3D this_cpu_ptr(&ppc64_tlb_batch); > batch->active =3D 1; > } > + > + /* Don't do this on a kernel thread */ > + if (current_thread_info()->task->thread.regs) > + restore_math(current_thread_info()->task > ->thread.regs); > + > #endif /* CONFIG_PPC_BOOK3S_64 */ > =20 > return last; > diff --git a/arch/powerpc/kernel/vector.S > b/arch/powerpc/kernel/vector.S > index 162d0f7..038cff8 100644 > --- a/arch/powerpc/kernel/vector.S > +++ b/arch/powerpc/kernel/vector.S > @@ -91,6 +91,10 @@ _GLOBAL(load_up_altivec) > oris r12,r12,MSR_VEC@h > std r12,_MSR(r1) > #endif > + /* Don't care if r4 overflows, this is desired behaviour */ > + lbz r4,THREAD_LOAD_VEC(r5) > + addi r4,r4,1 > + stb r4,THREAD_LOAD_VEC(r5) > addi r6,r5,THREAD_VRSTATE > li r4,1 > li r10,VRSTATE_VSCR ^ permalink raw reply [flat|nested] 23+ messages in thread
* [PATCH 6/8] powerpc: Add the ability to save FPU without giving it up 2015-11-18 3:26 [PATCH 0/8] FP/VEC/VSX switching optimisations Cyril Bur ` (4 preceding siblings ...) 2015-11-18 3:26 ` [PATCH 5/8] powerpc: Restore FPU/VEC/VSX if previously used Cyril Bur @ 2015-11-18 3:26 ` Cyril Bur 2015-11-18 3:26 ` [PATCH 7/8] powerpc: Add the ability to save Altivec " Cyril Bur ` (2 subsequent siblings) 8 siblings, 0 replies; 23+ messages in thread From: Cyril Bur @ 2015-11-18 3:26 UTC (permalink / raw) To: mikey, anton, linuxppc-dev This patch adds the ability to be able to save the FPU registers to the thread struct without giving up (disabling the facility) next time the process returns to userspace. This patch optimises the thread copy path (as a result of a fork() or clone()) so that the parent thread can return to userspace with hot registers avoiding a possibly pointless reload of FPU register state. Signed-off-by: Cyril Bur <cyrilbur@gmail.com> --- arch/powerpc/include/asm/switch_to.h | 2 +- arch/powerpc/kernel/fpu.S | 21 ++++------------ arch/powerpc/kernel/process.c | 46 +++++++++++++++++++++++++++++++++++- 3 files changed, 50 insertions(+), 19 deletions(-) diff --git a/arch/powerpc/include/asm/switch_to.h b/arch/powerpc/include/asm/switch_to.h index 171ce13..8cf7fd6 100644 --- a/arch/powerpc/include/asm/switch_to.h +++ b/arch/powerpc/include/asm/switch_to.h @@ -28,7 +28,7 @@ extern void giveup_all(struct task_struct *); extern void enable_kernel_fp(void); extern void flush_fp_to_thread(struct task_struct *); extern void giveup_fpu(struct task_struct *); -extern void __giveup_fpu(struct task_struct *); +extern void save_fpu(struct task_struct *); static inline void disable_kernel_fp(void) { msr_check_and_clear(MSR_FP); diff --git a/arch/powerpc/kernel/fpu.S b/arch/powerpc/kernel/fpu.S index b063524..15da2b5 100644 --- a/arch/powerpc/kernel/fpu.S +++ b/arch/powerpc/kernel/fpu.S @@ -143,33 +143,20 @@ END_FTR_SECTION_IFSET(CPU_FTR_VSX) blr /* - * __giveup_fpu(tsk) - * Disable FP for the task given as the argument, - * and save the floating-point registers in its thread_struct. + * save_fpu(tsk) + * Save the floating-point registers in its thread_struct. * Enables the FPU for use in the kernel on return. */ -_GLOBAL(__giveup_fpu) +_GLOBAL(save_fpu) addi r3,r3,THREAD /* want THREAD of task */ PPC_LL r6,THREAD_FPSAVEAREA(r3) PPC_LL r5,PT_REGS(r3) PPC_LCMPI 0,r6,0 bne 2f addi r6,r3,THREAD_FPSTATE -2: PPC_LCMPI 0,r5,0 - SAVE_32FPVSRS(0, R4, R6) +2: SAVE_32FPVSRS(0, R4, R6) mffs fr0 stfd fr0,FPSTATE_FPSCR(r6) - beq 1f - PPC_LL r4,_MSR-STACK_FRAME_OVERHEAD(r5) - li r3,MSR_FP|MSR_FE0|MSR_FE1 -#ifdef CONFIG_VSX -BEGIN_FTR_SECTION - oris r3,r3,MSR_VSX@h -END_FTR_SECTION_IFSET(CPU_FTR_VSX) -#endif - andc r4,r4,r3 /* disable FP for previous task */ - PPC_STL r4,_MSR-STACK_FRAME_OVERHEAD(r5) -1: blr /* diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c index c602b67..51e246a 100644 --- a/arch/powerpc/kernel/process.c +++ b/arch/powerpc/kernel/process.c @@ -133,6 +133,16 @@ void __msr_check_and_clear(unsigned long bits) EXPORT_SYMBOL(__msr_check_and_clear); #ifdef CONFIG_PPC_FPU +void __giveup_fpu(struct task_struct *tsk) +{ + save_fpu(tsk); + tsk->thread.regs->msr &= ~MSR_FP; +#ifdef CONFIG_VSX + if (cpu_has_feature(CPU_FTR_VSX)) + tsk->thread.regs->msr &= ~MSR_VSX; +#endif +} + void giveup_fpu(struct task_struct *tsk) { check_if_tm_restore_required(tsk); @@ -413,12 +423,46 @@ void restore_math(struct pt_regs *regs) regs->msr = msr; } +void save_all(struct task_struct *tsk) +{ + unsigned long usermsr; + + if (!tsk->thread.regs) + return; + + usermsr = tsk->thread.regs->msr; + + if ((usermsr & msr_all_available) == 0) + return; + + msr_check_and_set(msr_all_available); + +#ifdef CONFIG_PPC_FPU + if (usermsr & MSR_FP) + save_fpu(tsk); +#endif +#ifdef CONFIG_ALTIVEC + if (usermsr & MSR_VEC) + __giveup_altivec(tsk); +#endif +#ifdef CONFIG_VSX + if (usermsr & MSR_VSX) + __giveup_vsx(tsk); +#endif +#ifdef CONFIG_SPE + if (usermsr & MSR_SPE) + __giveup_spe(tsk); +#endif + + msr_check_and_clear(msr_all_available); +} + void flush_all_to_thread(struct task_struct *tsk) { if (tsk->thread.regs) { preempt_disable(); BUG_ON(tsk != current); - giveup_all(tsk); + save_all(tsk); #ifdef CONFIG_SPE if (tsk->thread.regs->msr & MSR_SPE) -- 2.6.2 ^ permalink raw reply related [flat|nested] 23+ messages in thread
* [PATCH 7/8] powerpc: Add the ability to save Altivec without giving it up 2015-11-18 3:26 [PATCH 0/8] FP/VEC/VSX switching optimisations Cyril Bur ` (5 preceding siblings ...) 2015-11-18 3:26 ` [PATCH 6/8] powerpc: Add the ability to save FPU without giving it up Cyril Bur @ 2015-11-18 3:26 ` Cyril Bur 2015-11-18 3:26 ` [PATCH 8/8] powerpc: Add the ability to save VSX " Cyril Bur 2015-11-18 14:51 ` [PATCH 0/8] FP/VEC/VSX switching optimisations David Laight 8 siblings, 0 replies; 23+ messages in thread From: Cyril Bur @ 2015-11-18 3:26 UTC (permalink / raw) To: mikey, anton, linuxppc-dev This patch adds the ability to be able to save the VEC registers to the thread struct without giving up (disabling the facility) next time the process returns to userspace. This patch builds on a previous optimisation for the FPU registers in the thread copy path to avoid a possibly pointless reload of VEC state. Signed-off-by: Cyril Bur <cyrilbur@gmail.com> --- arch/powerpc/include/asm/switch_to.h | 2 +- arch/powerpc/kernel/process.c | 12 +++++++++++- arch/powerpc/kernel/vector.S | 24 ++++-------------------- 3 files changed, 16 insertions(+), 22 deletions(-) diff --git a/arch/powerpc/include/asm/switch_to.h b/arch/powerpc/include/asm/switch_to.h index 8cf7fd6..372f297 100644 --- a/arch/powerpc/include/asm/switch_to.h +++ b/arch/powerpc/include/asm/switch_to.h @@ -39,7 +39,7 @@ static inline void disable_kernel_fp(void) extern void enable_kernel_altivec(void); extern void flush_altivec_to_thread(struct task_struct *); extern void giveup_altivec(struct task_struct *); -extern void __giveup_altivec(struct task_struct *); +extern void save_altivec(struct task_struct *); static inline void disable_kernel_altivec(void) { msr_check_and_clear(MSR_VEC); diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c index 51e246a..19e803a 100644 --- a/arch/powerpc/kernel/process.c +++ b/arch/powerpc/kernel/process.c @@ -198,6 +198,16 @@ void enable_kernel_fp(void) EXPORT_SYMBOL(enable_kernel_fp); #ifdef CONFIG_ALTIVEC +void __giveup_altivec(struct task_struct *tsk) +{ + save_altivec(tsk); + tsk->thread.regs->msr &= ~MSR_VEC; +#ifdef CONFIG_VSX + if (cpu_has_feature(CPU_FTR_VSX)) + tsk->thread.regs->msr &= ~MSR_VSX; +#endif +} + void giveup_altivec(struct task_struct *tsk) { check_if_tm_restore_required(tsk); @@ -443,7 +453,7 @@ void save_all(struct task_struct *tsk) #endif #ifdef CONFIG_ALTIVEC if (usermsr & MSR_VEC) - __giveup_altivec(tsk); + save_altivec(tsk); #endif #ifdef CONFIG_VSX if (usermsr & MSR_VSX) diff --git a/arch/powerpc/kernel/vector.S b/arch/powerpc/kernel/vector.S index 038cff8..51b0c17 100644 --- a/arch/powerpc/kernel/vector.S +++ b/arch/powerpc/kernel/vector.S @@ -106,36 +106,20 @@ _GLOBAL(load_up_altivec) blr /* - * __giveup_altivec(tsk) - * Disable VMX for the task given as the argument, - * and save the vector registers in its thread_struct. + * save_altivec(tsk) + * Save the vector registers to its thread_struct */ -_GLOBAL(__giveup_altivec) +_GLOBAL(save_altivec) addi r3,r3,THREAD /* want THREAD of task */ PPC_LL r7,THREAD_VRSAVEAREA(r3) PPC_LL r5,PT_REGS(r3) PPC_LCMPI 0,r7,0 bne 2f addi r7,r3,THREAD_VRSTATE -2: PPC_LCMPI 0,r5,0 - SAVE_32VRS(0,r4,r7) +2: SAVE_32VRS(0,r4,r7) mfvscr v0 li r4,VRSTATE_VSCR stvx v0,r4,r7 - beq 1f - PPC_LL r4,_MSR-STACK_FRAME_OVERHEAD(r5) -#ifdef CONFIG_VSX -BEGIN_FTR_SECTION - lis r3,(MSR_VEC|MSR_VSX)@h -FTR_SECTION_ELSE - lis r3,MSR_VEC@h -ALT_FTR_SECTION_END_IFSET(CPU_FTR_VSX) -#else - lis r3,MSR_VEC@h -#endif - andc r4,r4,r3 /* disable FP for previous task */ - PPC_STL r4,_MSR-STACK_FRAME_OVERHEAD(r5) -1: blr #ifdef CONFIG_VSX -- 2.6.2 ^ permalink raw reply related [flat|nested] 23+ messages in thread
* [PATCH 8/8] powerpc: Add the ability to save VSX without giving it up 2015-11-18 3:26 [PATCH 0/8] FP/VEC/VSX switching optimisations Cyril Bur ` (6 preceding siblings ...) 2015-11-18 3:26 ` [PATCH 7/8] powerpc: Add the ability to save Altivec " Cyril Bur @ 2015-11-18 3:26 ` Cyril Bur 2015-11-18 14:51 ` [PATCH 0/8] FP/VEC/VSX switching optimisations David Laight 8 siblings, 0 replies; 23+ messages in thread From: Cyril Bur @ 2015-11-18 3:26 UTC (permalink / raw) To: mikey, anton, linuxppc-dev This patch adds the ability to be able to save the VSX registers to the thread struct without giving up (disabling the facility) next time the process returns to userspace. This patch builds on a previous optimisation for the FPU and VEC registers in the thread copy path to avoid a possibly pointless reload of VSX state. Signed-off-by: Cyril Bur <cyrilbur@gmail.com> --- arch/powerpc/include/asm/switch_to.h | 1 - arch/powerpc/kernel/ppc_ksyms.c | 4 ---- arch/powerpc/kernel/process.c | 23 ++++++++++++++++++----- arch/powerpc/kernel/vector.S | 17 ----------------- 4 files changed, 18 insertions(+), 27 deletions(-) diff --git a/arch/powerpc/include/asm/switch_to.h b/arch/powerpc/include/asm/switch_to.h index 372f297..15843d3 100644 --- a/arch/powerpc/include/asm/switch_to.h +++ b/arch/powerpc/include/asm/switch_to.h @@ -50,7 +50,6 @@ static inline void disable_kernel_altivec(void) extern void enable_kernel_vsx(void); extern void flush_vsx_to_thread(struct task_struct *); extern void giveup_vsx(struct task_struct *); -extern void __giveup_vsx(struct task_struct *); static inline void disable_kernel_vsx(void) { msr_check_and_clear(MSR_FP|MSR_VEC|MSR_VSX); diff --git a/arch/powerpc/kernel/ppc_ksyms.c b/arch/powerpc/kernel/ppc_ksyms.c index 41e1607..ef7024da 100644 --- a/arch/powerpc/kernel/ppc_ksyms.c +++ b/arch/powerpc/kernel/ppc_ksyms.c @@ -28,10 +28,6 @@ EXPORT_SYMBOL(load_vr_state); EXPORT_SYMBOL(store_vr_state); #endif -#ifdef CONFIG_VSX -EXPORT_SYMBOL_GPL(__giveup_vsx); -#endif - #ifdef CONFIG_EPAPR_PARAVIRT EXPORT_SYMBOL(epapr_hypercall_start); #endif diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c index 19e803a..e0bb3d7 100644 --- a/arch/powerpc/kernel/process.c +++ b/arch/powerpc/kernel/process.c @@ -248,20 +248,33 @@ EXPORT_SYMBOL_GPL(flush_altivec_to_thread); #endif /* CONFIG_ALTIVEC */ #ifdef CONFIG_VSX -void giveup_vsx(struct task_struct *tsk) +void __giveup_vsx(struct task_struct *tsk) { - check_if_tm_restore_required(tsk); - - msr_check_and_set(MSR_FP|MSR_VEC|MSR_VSX); if (tsk->thread.regs->msr & MSR_FP) __giveup_fpu(tsk); if (tsk->thread.regs->msr & MSR_VEC) __giveup_altivec(tsk); + tsk->thread.regs->msr &= ~MSR_VSX; +} + +void giveup_vsx(struct task_struct *tsk) +{ + check_if_tm_restore_required(tsk); + + msr_check_and_set(MSR_FP|MSR_VEC|MSR_VSX); __giveup_vsx(tsk); msr_check_and_clear(MSR_FP|MSR_VEC|MSR_VSX); } EXPORT_SYMBOL(giveup_vsx); +void save_vsx(struct task_struct *tsk) +{ + if (tsk->thread.regs->msr & MSR_FP) + save_fpu(tsk); + if (tsk->thread.regs->msr & MSR_VEC) + save_altivec(tsk); +} + void enable_kernel_vsx(void) { WARN_ON(preemptible()); @@ -457,7 +470,7 @@ void save_all(struct task_struct *tsk) #endif #ifdef CONFIG_VSX if (usermsr & MSR_VSX) - __giveup_vsx(tsk); + save_vsx(tsk); #endif #ifdef CONFIG_SPE if (usermsr & MSR_SPE) diff --git a/arch/powerpc/kernel/vector.S b/arch/powerpc/kernel/vector.S index 51b0c17..1c2e7a3 100644 --- a/arch/powerpc/kernel/vector.S +++ b/arch/powerpc/kernel/vector.S @@ -151,23 +151,6 @@ _GLOBAL(load_up_vsx) std r12,_MSR(r1) b fast_exception_return -/* - * __giveup_vsx(tsk) - * Disable VSX for the task given as the argument. - * Does NOT save vsx registers. - */ -_GLOBAL(__giveup_vsx) - addi r3,r3,THREAD /* want THREAD of task */ - ld r5,PT_REGS(r3) - cmpdi 0,r5,0 - beq 1f - ld r4,_MSR-STACK_FRAME_OVERHEAD(r5) - lis r3,MSR_VSX@h - andc r4,r4,r3 /* disable VSX for previous task */ - std r4,_MSR-STACK_FRAME_OVERHEAD(r5) -1: - blr - #endif /* CONFIG_VSX */ -- 2.6.2 ^ permalink raw reply related [flat|nested] 23+ messages in thread
* RE: [PATCH 0/8] FP/VEC/VSX switching optimisations 2015-11-18 3:26 [PATCH 0/8] FP/VEC/VSX switching optimisations Cyril Bur ` (7 preceding siblings ...) 2015-11-18 3:26 ` [PATCH 8/8] powerpc: Add the ability to save VSX " Cyril Bur @ 2015-11-18 14:51 ` David Laight 2015-11-18 23:01 ` Cyril Bur 8 siblings, 1 reply; 23+ messages in thread From: David Laight @ 2015-11-18 14:51 UTC (permalink / raw) To: 'Cyril Bur', mikey@neuling.org, anton@samba.org, linuxppc-dev@ozlabs.org RnJvbTogQ3lyaWwgQnVyDQo+IFNlbnQ6IDE4IE5vdmVtYmVyIDIwMTUgMDM6MjcNCi4uLg0KPiBU aGUgZ29hbCBvZiB0aGVzZSBwYXRjaGVzIGlzIHRvIHJld29yayBob3cgdGhlICdtYXRoJyByZWdp c3RlcnMgKEZQLCBWRUMNCj4gYW5kIFZTWCkgYXJlIGNvbnRleHQgc3dpdGNoZWQuIEN1cnJlbnRs eSB0aGUga2VybmVsIGFkb3B0cyBhIGxhenkgYXBwcm9hY2gsDQo+IGFsd2F5cyBzd2l0Y2hpbmcg dXNlcnNwYWNlIHRhc2tzIHdpdGggYWxsIHRocmVlIGZhY2lsaXRpZXMgZGlzYWJsZWQgYW5kDQo+ IGxvYWRzIGluIGVhY2ggc2V0IG9mIHJlZ2lzdGVycyB1cG9uIHJlY2VpdmluZyBlYWNoIHVuYXZh aWxhYmxlIGV4Y2VwdGlvbi4NCj4gVGhlIGtlcm5lbCBkb2VzIHRyeSB0byBhdm9pZCBkaXNhYmxp bmcgdGhlIGZlYXR1cmVzIGluIHRoZSBzeXNjYWxsIHF1aWNrDQo+IHBhdGggYnV0IGl0IGR1cmlu ZyB0ZXN0aW5nIGl0IGFwcGVhcnMgdGhhdCBldmVuIHdoYXQgc2hvdWxkIGJlIGEgc2ltcGxlDQo+ IHN5c2NhbGwgc3RpbGwgY2F1c2VzIHRoZSBrZXJuZWwgdG8gdXNlIHNvbWUgZmFjaWxpdGllcyAo dmVjdG9yaXNlZCBtZW1jcHkNCj4gZm9yIGV4YW1wbGUpIGZvciBpdHMgc2VsZiBhbmQgdGhlcmVm b3JlIGRpc2FibGUgaXQgZm9yIHRoZSB1c2VyIHRhc2suDQoNClBlcmhhcHMgdGhlIGtlcm5lbCBz aG91bGQgYmUgYXZvaWRpbmcgdXNpbmcgdGhlc2UgcmVnaXN0ZXJzPw0KSSB3b25kZXIgaWYgdGhl IGdhaW4gZnJvbSB1c2luZyB2ZWN0b3Jpc2VkIG1lbWNweSBpcyB0eXBpY2FsbHkNCmVub3VnaCB0 byB3YXJyYW50IHRoZSBjb3N0IG9mIHRoZSBzYXZlIGFuZCByZXN0b3JlPw0KDQpUaGVyZSBtYXkg ZXZlbiBiZSBzY29wZSBmb3Iga2VybmVsIGNvZGUgZG9pbmcgYSBzYXZlL3Jlc3RvcmUNCm9mIGEg c21hbGwgbnVtYmVyIG9mIHJlZ2lzdGVycyBvbnRvIGFuIGluLXN0YWNrIHNhdmUgYXJlYS4NCkl0 IHdvdWxkIG5lZWQgdG8gYmUgbGlua2VkIHRvIHRoZSBkYXRhIG9mIHRoZSB0aHJlYWQNCnRoYXQg b3ducyB0aGUgZnB1IHJlZ2lzdGVycyBzbyB0aGF0IGEgc2F2ZSByZXF1ZXN0IGNvdWxkDQpiZSBo b25vdXJlZC4NClByZS1lbXB0aW9uIHdvdWxkIHByb2JhYmx5IG5lZWQgdG8gYmUgZGlzYWJsZWQs IGJ1dCBuZXN0ZWQNCnVzZSwgYW5kIHVzZSBmcm9tIElTUiBzaG91bGQgYmUgb2suDQoNCglEYXZp ZA0KDQo= ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 0/8] FP/VEC/VSX switching optimisations 2015-11-18 14:51 ` [PATCH 0/8] FP/VEC/VSX switching optimisations David Laight @ 2015-11-18 23:01 ` Cyril Bur 0 siblings, 0 replies; 23+ messages in thread From: Cyril Bur @ 2015-11-18 23:01 UTC (permalink / raw) To: David Laight; +Cc: mikey@neuling.org, anton@samba.org, linuxppc-dev@ozlabs.org On Wed, 18 Nov 2015 14:51:25 +0000 David Laight <David.Laight@ACULAB.COM> wrote: > From: Cyril Bur > > Sent: 18 November 2015 03:27 > ... > > The goal of these patches is to rework how the 'math' registers (FP, VEC > > and VSX) are context switched. Currently the kernel adopts a lazy approach, > > always switching userspace tasks with all three facilities disabled and > > loads in each set of registers upon receiving each unavailable exception. > > The kernel does try to avoid disabling the features in the syscall quick > > path but it during testing it appears that even what should be a simple > > syscall still causes the kernel to use some facilities (vectorised memcpy > > for example) for its self and therefore disable it for the user task. > Hi David, > Perhaps the kernel should be avoiding using these registers? > I wonder if the gain from using vectorised memcpy is typically > enough to warrant the cost of the save and restore? > Yeah, on smaller copies that might be the way to go. > There may even be scope for kernel code doing a save/restore > of a small number of registers onto an in-stack save area. This has been thrown up in the air, there's also the volatile/non-volatiles to consider and the caveat that glibc doesn't quite respect the ABI here. As it turns out (and no one is more surprised than me), despite the other attempts at optimising, this series really has boiled down to removing the need for processes to take the facility unavailable interrupts. I do plan to carry on with optimising in this area and will have a look to see what I can do. Cyril > It would need to be linked to the data of the thread > that owns the fpu registers so that a save request could > be honoured. > Pre-emption would probably need to be disabled, but nested > use, and use from ISR should be ok. > > David > ^ permalink raw reply [flat|nested] 23+ messages in thread
end of thread, other threads:[~2015-11-23 3:20 UTC | newest] Thread overview: 23+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2015-11-18 3:26 [PATCH 0/8] FP/VEC/VSX switching optimisations Cyril Bur 2015-11-18 3:26 ` [PATCH 1/8] selftests/powerpc: Test the preservation of FPU and VMX regs across syscall Cyril Bur 2015-11-23 0:23 ` Michael Neuling 2015-11-23 0:58 ` Cyril Bur 2015-11-23 1:06 ` Michael Neuling 2015-11-18 3:26 ` [PATCH 2/8] selftests/powerpc: Test preservation of FPU and VMX regs across preemption Cyril Bur 2015-11-23 0:34 ` Michael Neuling 2015-11-18 3:26 ` [PATCH 3/8] selftests/powerpc: Test FPU and VMX regs in signal ucontext Cyril Bur 2015-11-19 11:36 ` [3/8] " Michael Ellerman 2015-11-23 1:04 ` [PATCH 3/8] " Michael Neuling 2015-11-18 3:26 ` [PATCH 4/8] powerpc: Explicitly disable math features when copying thread Cyril Bur 2015-11-23 1:08 ` Michael Neuling 2015-11-23 3:20 ` Cyril Bur 2015-11-18 3:26 ` [PATCH 5/8] powerpc: Restore FPU/VEC/VSX if previously used Cyril Bur 2015-11-20 11:01 ` Michael Ellerman 2015-11-22 22:18 ` Cyril Bur 2015-11-22 23:07 ` Michael Ellerman 2015-11-23 1:29 ` Michael Neuling 2015-11-18 3:26 ` [PATCH 6/8] powerpc: Add the ability to save FPU without giving it up Cyril Bur 2015-11-18 3:26 ` [PATCH 7/8] powerpc: Add the ability to save Altivec " Cyril Bur 2015-11-18 3:26 ` [PATCH 8/8] powerpc: Add the ability to save VSX " Cyril Bur 2015-11-18 14:51 ` [PATCH 0/8] FP/VEC/VSX switching optimisations David Laight 2015-11-18 23:01 ` Cyril Bur
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).