[PATCH 0/6] FP improvements

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/6] FP improvements
@ 2013-11-07 12:48 ` Paul Burton
  0 siblings, 0 replies; 42+ messages in thread
From: Paul Burton @ 2013-11-07 12:48 UTC (permalink / raw)
  To: linux-mips; +Cc: Paul Burton

This series includes a few improvements to floating point support. The
first 2 patches add support for missing instructions to the FPU
emulator. The 3rd is a small cleanup. The 4th introduces support for
O32 binaries using 64-bit floating point. The 5th modifies the FPU
emulator to stop executing code from the user stack. The 6th & final
patch is not strictly FP-related but is a consequence of the 5th patch,
and allows us to mark the stack & allocated heap memory as
non-executable by default.

Leonid Yegoshin (1):
  mips: mfhc1 & mthc1 support for the FPU emulator

Paul Burton (4):
  mips: remove unused {en,dis}able_fpu macros
  mips: support for 64-bit FP with O32 binaries
  mips: use per-mm page to execute FP branch delay slots
  mips: non-exec stack & heap when non-exec PT_GNU_STACK is present

Steven J. Hill (1):
  mips: microMIPS: mfhc1 & mthc1 support for the FPU emulator

 arch/mips/Kconfig                    |  17 ++
 arch/mips/include/asm/asmmacro-32.h  |  42 -----
 arch/mips/include/asm/asmmacro-64.h  |  96 ----------
 arch/mips/include/asm/asmmacro.h     | 107 +++++++++++
 arch/mips/include/asm/elf.h          |  22 ++-
 arch/mips/include/asm/fpu.h          | 104 ++++++++---
 arch/mips/include/asm/fpu_emulator.h |   2 +
 arch/mips/include/asm/mmu.h          |  12 ++
 arch/mips/include/asm/mmu_context.h  |   7 +
 arch/mips/include/asm/page.h         |   6 +-
 arch/mips/include/asm/processor.h    |   7 +-
 arch/mips/include/asm/thread_info.h  |   6 +-
 arch/mips/include/uapi/asm/inst.h    |   7 +-
 arch/mips/kernel/Makefile            |   7 +-
 arch/mips/kernel/cpu-probe.c         |   2 +-
 arch/mips/kernel/elf.c               |  28 +++
 arch/mips/kernel/entry.S             |  13 +-
 arch/mips/kernel/process.c           |   5 +-
 arch/mips/kernel/ptrace.c            |   8 +-
 arch/mips/kernel/ptrace32.c          |   4 +-
 arch/mips/kernel/r4k_fpu.S           |  74 +++++++-
 arch/mips/kernel/r4k_switch.S        |  45 ++++-
 arch/mips/kernel/signal.c            |  10 +-
 arch/mips/kernel/signal32.c          |  10 +-
 arch/mips/kernel/traps.c             |  20 +-
 arch/mips/kernel/vdso.c              |   2 +-
 arch/mips/math-emu/cp1emu.c          |  37 +++-
 arch/mips/math-emu/dsemul.c          | 346 ++++++++++++++++++++++++++---------
 arch/mips/math-emu/kernel_linkage.c  |   6 +-
 29 files changed, 743 insertions(+), 309 deletions(-)
 create mode 100644 arch/mips/kernel/elf.c

-- 
1.8.4.1

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH 0/6] FP improvements
@ 2013-11-07 12:48 ` Paul Burton
  0 siblings, 0 replies; 42+ messages in thread
From: Paul Burton @ 2013-11-07 12:48 UTC (permalink / raw)
  To: linux-mips; +Cc: Paul Burton

This series includes a few improvements to floating point support. The
first 2 patches add support for missing instructions to the FPU
emulator. The 3rd is a small cleanup. The 4th introduces support for
O32 binaries using 64-bit floating point. The 5th modifies the FPU
emulator to stop executing code from the user stack. The 6th & final
patch is not strictly FP-related but is a consequence of the 5th patch,
and allows us to mark the stack & allocated heap memory as
non-executable by default.

Leonid Yegoshin (1):
  mips: mfhc1 & mthc1 support for the FPU emulator

Paul Burton (4):
  mips: remove unused {en,dis}able_fpu macros
  mips: support for 64-bit FP with O32 binaries
  mips: use per-mm page to execute FP branch delay slots
  mips: non-exec stack & heap when non-exec PT_GNU_STACK is present

Steven J. Hill (1):
  mips: microMIPS: mfhc1 & mthc1 support for the FPU emulator

 arch/mips/Kconfig                    |  17 ++
 arch/mips/include/asm/asmmacro-32.h  |  42 -----
 arch/mips/include/asm/asmmacro-64.h  |  96 ----------
 arch/mips/include/asm/asmmacro.h     | 107 +++++++++++
 arch/mips/include/asm/elf.h          |  22 ++-
 arch/mips/include/asm/fpu.h          | 104 ++++++++---
 arch/mips/include/asm/fpu_emulator.h |   2 +
 arch/mips/include/asm/mmu.h          |  12 ++
 arch/mips/include/asm/mmu_context.h  |   7 +
 arch/mips/include/asm/page.h         |   6 +-
 arch/mips/include/asm/processor.h    |   7 +-
 arch/mips/include/asm/thread_info.h  |   6 +-
 arch/mips/include/uapi/asm/inst.h    |   7 +-
 arch/mips/kernel/Makefile            |   7 +-
 arch/mips/kernel/cpu-probe.c         |   2 +-
 arch/mips/kernel/elf.c               |  28 +++
 arch/mips/kernel/entry.S             |  13 +-
 arch/mips/kernel/process.c           |   5 +-
 arch/mips/kernel/ptrace.c            |   8 +-
 arch/mips/kernel/ptrace32.c          |   4 +-
 arch/mips/kernel/r4k_fpu.S           |  74 +++++++-
 arch/mips/kernel/r4k_switch.S        |  45 ++++-
 arch/mips/kernel/signal.c            |  10 +-
 arch/mips/kernel/signal32.c          |  10 +-
 arch/mips/kernel/traps.c             |  20 +-
 arch/mips/kernel/vdso.c              |   2 +-
 arch/mips/math-emu/cp1emu.c          |  37 +++-
 arch/mips/math-emu/dsemul.c          | 346 ++++++++++++++++++++++++++---------
 arch/mips/math-emu/kernel_linkage.c  |   6 +-
 29 files changed, 743 insertions(+), 309 deletions(-)
 create mode 100644 arch/mips/kernel/elf.c

-- 
1.8.4.1

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH 1/6] mips: mfhc1 & mthc1 support for the FPU emulator
@ 2013-11-07 12:48   ` Paul Burton
  0 siblings, 0 replies; 42+ messages in thread
From: Paul Burton @ 2013-11-07 12:48 UTC (permalink / raw)
  To: linux-mips; +Cc: Leonid Yegoshin, Steven J. Hill, Paul Burton

From: Leonid Yegoshin <Leonid.Yegoshin@imgtec.com>

This patch adds support for the mfhc1 & mthc1 instructions to the FPU
emulator. These instructions were introduced in release 2 of the mips32
& mips64 architectures, and allow access to the most significant 32 bits
of a 64-bit FP register.

Signed-off-by: Leonid Yegoshin <Leonid.Yegoshin@imgtec.com>
Signed-off-by: Steven J. Hill <Steven.Hill@imgtec.com>
Signed-off-by: Paul Burton <paul.burton@imgtec.com>
---
 arch/mips/include/uapi/asm/inst.h |  5 +++--
 arch/mips/math-emu/cp1emu.c       | 19 +++++++++++++++++++
 2 files changed, 22 insertions(+), 2 deletions(-)

diff --git a/arch/mips/include/uapi/asm/inst.h b/arch/mips/include/uapi/asm/inst.h
index e5a676e..0ee9656 100644
--- a/arch/mips/include/uapi/asm/inst.h
+++ b/arch/mips/include/uapi/asm/inst.h
@@ -98,8 +98,9 @@ enum rt_op {
  */
 enum cop_op {
 	mfc_op	      = 0x00, dmfc_op	    = 0x01,
-	cfc_op	      = 0x02, mtc_op	    = 0x04,
-	dmtc_op	      = 0x05, ctc_op	    = 0x06,
+	cfc_op	      = 0x02, mfhc_op	    = 0x03,
+	mtc_op        = 0x04, dmtc_op	    = 0x05,
+	ctc_op	      = 0x06, mthc_op	    = 0x07,
 	bc_op	      = 0x08, cop_op	    = 0x10,
 	copm_op	      = 0x18
 };
diff --git a/arch/mips/math-emu/cp1emu.c b/arch/mips/math-emu/cp1emu.c
index efe0088..20a51d0 100644
--- a/arch/mips/math-emu/cp1emu.c
+++ b/arch/mips/math-emu/cp1emu.c
@@ -878,6 +878,10 @@ static inline int cop1_64bit(struct pt_regs *xcp)
 			ctx->fpr[x & ~1] >> 32 << 32 | (u32)(si) : \
 			ctx->fpr[x & ~1] << 32 >> 32 | (u64)(si) << 32)
 
+#define SIFROMHREG(si, x)	((si) = (int)(ctx->fpr[x] >> 32))
+#define SITOHREG(si, x)		(ctx->fpr[x] = \
+				ctx->fpr[x] << 32 >> 32 | (u64)(si) << 32)
+
 #define DIFROMREG(di, x) ((di) = ctx->fpr[x & ~(cop1_64bit(xcp) == 0)])
 #define DITOREG(di, x)	(ctx->fpr[x & ~(cop1_64bit(xcp) == 0)] = (di))
 
@@ -1055,6 +1059,21 @@ static int cop1Emulate(struct pt_regs *xcp, struct mips_fpu_struct *ctx,
 			break;
 #endif
 
+#ifdef CONFIG_CPU_MIPSR2
+		case mfhc_op:
+			/* copregister rd -> gpr[rt] */
+			if (MIPSInst_RT(ir) != 0) {
+				SIFROMHREG(xcp->regs[MIPSInst_RT(ir)],
+					MIPSInst_RD(ir));
+			}
+			break;
+
+		case mthc_op:
+			/* copregister rd <- gpr[rt] */
+			SITOHREG(xcp->regs[MIPSInst_RT(ir)], MIPSInst_RD(ir));
+			break;
+#endif
+
 		case mfc_op:
 			/* copregister rd -> gpr[rt] */
 			if (MIPSInst_RT(ir) != 0) {
-- 
1.8.4.1

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 1/6] mips: mfhc1 & mthc1 support for the FPU emulator
@ 2013-11-07 12:48   ` Paul Burton
  0 siblings, 0 replies; 42+ messages in thread
From: Paul Burton @ 2013-11-07 12:48 UTC (permalink / raw)
  To: linux-mips; +Cc: Leonid Yegoshin, Steven J. Hill, Paul Burton

From: Leonid Yegoshin <Leonid.Yegoshin@imgtec.com>

This patch adds support for the mfhc1 & mthc1 instructions to the FPU
emulator. These instructions were introduced in release 2 of the mips32
& mips64 architectures, and allow access to the most significant 32 bits
of a 64-bit FP register.

Signed-off-by: Leonid Yegoshin <Leonid.Yegoshin@imgtec.com>
Signed-off-by: Steven J. Hill <Steven.Hill@imgtec.com>
Signed-off-by: Paul Burton <paul.burton@imgtec.com>
---
 arch/mips/include/uapi/asm/inst.h |  5 +++--
 arch/mips/math-emu/cp1emu.c       | 19 +++++++++++++++++++
 2 files changed, 22 insertions(+), 2 deletions(-)

diff --git a/arch/mips/include/uapi/asm/inst.h b/arch/mips/include/uapi/asm/inst.h
index e5a676e..0ee9656 100644
--- a/arch/mips/include/uapi/asm/inst.h
+++ b/arch/mips/include/uapi/asm/inst.h
@@ -98,8 +98,9 @@ enum rt_op {
  */
 enum cop_op {
 	mfc_op	      = 0x00, dmfc_op	    = 0x01,
-	cfc_op	      = 0x02, mtc_op	    = 0x04,
-	dmtc_op	      = 0x05, ctc_op	    = 0x06,
+	cfc_op	      = 0x02, mfhc_op	    = 0x03,
+	mtc_op        = 0x04, dmtc_op	    = 0x05,
+	ctc_op	      = 0x06, mthc_op	    = 0x07,
 	bc_op	      = 0x08, cop_op	    = 0x10,
 	copm_op	      = 0x18
 };
diff --git a/arch/mips/math-emu/cp1emu.c b/arch/mips/math-emu/cp1emu.c
index efe0088..20a51d0 100644
--- a/arch/mips/math-emu/cp1emu.c
+++ b/arch/mips/math-emu/cp1emu.c
@@ -878,6 +878,10 @@ static inline int cop1_64bit(struct pt_regs *xcp)
 			ctx->fpr[x & ~1] >> 32 << 32 | (u32)(si) : \
 			ctx->fpr[x & ~1] << 32 >> 32 | (u64)(si) << 32)
 
+#define SIFROMHREG(si, x)	((si) = (int)(ctx->fpr[x] >> 32))
+#define SITOHREG(si, x)		(ctx->fpr[x] = \
+				ctx->fpr[x] << 32 >> 32 | (u64)(si) << 32)
+
 #define DIFROMREG(di, x) ((di) = ctx->fpr[x & ~(cop1_64bit(xcp) == 0)])
 #define DITOREG(di, x)	(ctx->fpr[x & ~(cop1_64bit(xcp) == 0)] = (di))
 
@@ -1055,6 +1059,21 @@ static int cop1Emulate(struct pt_regs *xcp, struct mips_fpu_struct *ctx,
 			break;
 #endif
 
+#ifdef CONFIG_CPU_MIPSR2
+		case mfhc_op:
+			/* copregister rd -> gpr[rt] */
+			if (MIPSInst_RT(ir) != 0) {
+				SIFROMHREG(xcp->regs[MIPSInst_RT(ir)],
+					MIPSInst_RD(ir));
+			}
+			break;
+
+		case mthc_op:
+			/* copregister rd <- gpr[rt] */
+			SITOHREG(xcp->regs[MIPSInst_RT(ir)], MIPSInst_RD(ir));
+			break;
+#endif
+
 		case mfc_op:
 			/* copregister rd -> gpr[rt] */
 			if (MIPSInst_RT(ir) != 0) {
-- 
1.8.4.1

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 2/6] mips: microMIPS: mfhc1 & mthc1 support for the FPU emulator
@ 2013-11-07 12:48   ` Paul Burton
  0 siblings, 0 replies; 42+ messages in thread
From: Paul Burton @ 2013-11-07 12:48 UTC (permalink / raw)
  To: linux-mips; +Cc: Steven J. Hill, Paul Burton

From: "Steven J. Hill" <Steven.Hill@imgtec.com>

This patch adds support for microMIPS encodings of the mfhc1 & mthc1
instructions introduced in release 2 of the mips32 & mips64
architectures, converting them to their mips32 equivalents for the FPU
emulator.

Signed-off-by: Steven J. Hill <Steven.Hill@imgtec.com>
Signed-off-by: Paul Burton <paul.burton@imgtec.com>
---
 arch/mips/include/uapi/asm/inst.h | 2 ++
 arch/mips/math-emu/cp1emu.c       | 8 +++++++-
 2 files changed, 9 insertions(+), 1 deletion(-)

diff --git a/arch/mips/include/uapi/asm/inst.h b/arch/mips/include/uapi/asm/inst.h
index 0ee9656..b39ba25 100644
--- a/arch/mips/include/uapi/asm/inst.h
+++ b/arch/mips/include/uapi/asm/inst.h
@@ -398,8 +398,10 @@ enum mm_32f_73_minor_op {
 	mm_movt1_op = 0xa5,
 	mm_ftruncw_op = 0xac,
 	mm_fneg1_op = 0xad,
+	mm_mfhc1_op = 0xc0,
 	mm_froundl_op = 0xcc,
 	mm_fcvtd1_op = 0xcd,
+	mm_mthc1_op = 0xe0,
 	mm_froundw_op = 0xec,
 	mm_fcvts1_op = 0xed,
 };
diff --git a/arch/mips/math-emu/cp1emu.c b/arch/mips/math-emu/cp1emu.c
index 20a51d0..4b37961 100644
--- a/arch/mips/math-emu/cp1emu.c
+++ b/arch/mips/math-emu/cp1emu.c
@@ -417,14 +417,20 @@ static int microMIPS32_to_MIPS32(union mips_instruction *insn_ptr)
 			case mm_mtc1_op:
 			case mm_cfc1_op:
 			case mm_ctc1_op:
+			case mm_mfhc1_op:
+			case mm_mthc1_op:
 				if (insn.mm_fp1_format.op == mm_mfc1_op)
 					op = mfc_op;
 				else if (insn.mm_fp1_format.op == mm_mtc1_op)
 					op = mtc_op;
 				else if (insn.mm_fp1_format.op == mm_cfc1_op)
 					op = cfc_op;
-				else
+				else if (insn.mm_fp1_format.op == mm_ctc1_op)
 					op = ctc_op;
+				else if (insn.mm_fp1_format.op == mm_mfhc1_op)
+					op = mfhc_op;
+				else
+					op = mthc_op;
 				mips32_insn.fp1_format.opcode = cop1_op;
 				mips32_insn.fp1_format.op = op;
 				mips32_insn.fp1_format.rt =
-- 
1.8.4.1

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 2/6] mips: microMIPS: mfhc1 & mthc1 support for the FPU emulator
@ 2013-11-07 12:48   ` Paul Burton
  0 siblings, 0 replies; 42+ messages in thread
From: Paul Burton @ 2013-11-07 12:48 UTC (permalink / raw)
  To: linux-mips; +Cc: Steven J. Hill, Paul Burton

From: "Steven J. Hill" <Steven.Hill@imgtec.com>

This patch adds support for microMIPS encodings of the mfhc1 & mthc1
instructions introduced in release 2 of the mips32 & mips64
architectures, converting them to their mips32 equivalents for the FPU
emulator.

Signed-off-by: Steven J. Hill <Steven.Hill@imgtec.com>
Signed-off-by: Paul Burton <paul.burton@imgtec.com>
---
 arch/mips/include/uapi/asm/inst.h | 2 ++
 arch/mips/math-emu/cp1emu.c       | 8 +++++++-
 2 files changed, 9 insertions(+), 1 deletion(-)

diff --git a/arch/mips/include/uapi/asm/inst.h b/arch/mips/include/uapi/asm/inst.h
index 0ee9656..b39ba25 100644
--- a/arch/mips/include/uapi/asm/inst.h
+++ b/arch/mips/include/uapi/asm/inst.h
@@ -398,8 +398,10 @@ enum mm_32f_73_minor_op {
 	mm_movt1_op = 0xa5,
 	mm_ftruncw_op = 0xac,
 	mm_fneg1_op = 0xad,
+	mm_mfhc1_op = 0xc0,
 	mm_froundl_op = 0xcc,
 	mm_fcvtd1_op = 0xcd,
+	mm_mthc1_op = 0xe0,
 	mm_froundw_op = 0xec,
 	mm_fcvts1_op = 0xed,
 };
diff --git a/arch/mips/math-emu/cp1emu.c b/arch/mips/math-emu/cp1emu.c
index 20a51d0..4b37961 100644
--- a/arch/mips/math-emu/cp1emu.c
+++ b/arch/mips/math-emu/cp1emu.c
@@ -417,14 +417,20 @@ static int microMIPS32_to_MIPS32(union mips_instruction *insn_ptr)
 			case mm_mtc1_op:
 			case mm_cfc1_op:
 			case mm_ctc1_op:
+			case mm_mfhc1_op:
+			case mm_mthc1_op:
 				if (insn.mm_fp1_format.op == mm_mfc1_op)
 					op = mfc_op;
 				else if (insn.mm_fp1_format.op == mm_mtc1_op)
 					op = mtc_op;
 				else if (insn.mm_fp1_format.op == mm_cfc1_op)
 					op = cfc_op;
-				else
+				else if (insn.mm_fp1_format.op == mm_ctc1_op)
 					op = ctc_op;
+				else if (insn.mm_fp1_format.op == mm_mfhc1_op)
+					op = mfhc_op;
+				else
+					op = mthc_op;
 				mips32_insn.fp1_format.opcode = cop1_op;
 				mips32_insn.fp1_format.op = op;
 				mips32_insn.fp1_format.rt =
-- 
1.8.4.1

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 3/6] mips: remove unused {en,dis}able_fpu macros
@ 2013-11-07 12:48   ` Paul Burton
  0 siblings, 0 replies; 42+ messages in thread
From: Paul Burton @ 2013-11-07 12:48 UTC (permalink / raw)
  To: linux-mips; +Cc: Paul Burton

These macros are not used anywhere in the kernel. Remove them.

Signed-off-by: Paul Burton <paul.burton@imgtec.com>
---
 arch/mips/include/asm/fpu.h | 13 -------------
 1 file changed, 13 deletions(-)

diff --git a/arch/mips/include/asm/fpu.h b/arch/mips/include/asm/fpu.h
index d088e5d..3bf023f 100644
--- a/arch/mips/include/asm/fpu.h
+++ b/arch/mips/include/asm/fpu.h
@@ -45,19 +45,6 @@ do {									\
 	disable_fpu_hazard();						\
 } while (0)
 
-#define enable_fpu()							\
-do {									\
-	if (cpu_has_fpu)						\
-		__enable_fpu();						\
-} while (0)
-
-#define disable_fpu()							\
-do {									\
-	if (cpu_has_fpu)						\
-		__disable_fpu();					\
-} while (0)
-
-
 #define clear_fpu_owner()	clear_thread_flag(TIF_USEDFPU)
 
 static inline int __is_fpu_owner(void)
-- 
1.8.4.1

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 3/6] mips: remove unused {en,dis}able_fpu macros
@ 2013-11-07 12:48   ` Paul Burton
  0 siblings, 0 replies; 42+ messages in thread
From: Paul Burton @ 2013-11-07 12:48 UTC (permalink / raw)
  To: linux-mips; +Cc: Paul Burton

These macros are not used anywhere in the kernel. Remove them.

Signed-off-by: Paul Burton <paul.burton@imgtec.com>
---
 arch/mips/include/asm/fpu.h | 13 -------------
 1 file changed, 13 deletions(-)

diff --git a/arch/mips/include/asm/fpu.h b/arch/mips/include/asm/fpu.h
index d088e5d..3bf023f 100644
--- a/arch/mips/include/asm/fpu.h
+++ b/arch/mips/include/asm/fpu.h
@@ -45,19 +45,6 @@ do {									\
 	disable_fpu_hazard();						\
 } while (0)
 
-#define enable_fpu()							\
-do {									\
-	if (cpu_has_fpu)						\
-		__enable_fpu();						\
-} while (0)
-
-#define disable_fpu()							\
-do {									\
-	if (cpu_has_fpu)						\
-		__disable_fpu();					\
-} while (0)
-
-
 #define clear_fpu_owner()	clear_thread_flag(TIF_USEDFPU)
 
 static inline int __is_fpu_owner(void)
-- 
1.8.4.1

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 4/6] mips: support for 64-bit FP with O32 binaries
@ 2013-11-07 12:48   ` Paul Burton
  0 siblings, 0 replies; 42+ messages in thread
From: Paul Burton @ 2013-11-07 12:48 UTC (permalink / raw)
  To: linux-mips; +Cc: Paul Burton

CPUs implementing mips32r2 may include a 64-bit FPU, just as mips64 CPUs
do. In order to preserve backwards compatibility a 64-bit FPU will act
like a 32-bit FPU (by accessing doubles from the least significant 32
bits of an even-odd pair of FP registers) when the Status.FR bit is
zero, again just like a mips64 CPU. The standard O32 ABI is defined
expecting a 32-bit FPU, however recent toolchains support use of a
64-bit FPU from an O32 mips32 executable. When an ELF executable is
built to use a 64-bit FPU a new flag (EF_MIPS_FP64) is set in the ELF
header.

With this patch the kernel will check the EF_MIPS_FP64 flag when
executing an O32 binary, and set Status.FR accordingly. The addition
of O32 64-bit FP support lessens the opportunity for optimisation in
the FPU emulator, so a CONFIG_MIPS_O32_FP64_SUPPORT Kconfig option is
introduced to allow this support to be disabled for those that don't
require it.

Inspired by an earlier patch by Leonid Yegoshin, but implemented more
cleanly & correctly.

Signed-off-by: Paul Burton <paul.burton@imgtec.com>
---
 arch/mips/Kconfig                   |  17 ++++++
 arch/mips/include/asm/asmmacro-32.h |  42 --------------
 arch/mips/include/asm/asmmacro-64.h |  96 --------------------------------
 arch/mips/include/asm/asmmacro.h    | 107 ++++++++++++++++++++++++++++++++++++
 arch/mips/include/asm/elf.h         |  17 +++++-
 arch/mips/include/asm/fpu.h         |  91 +++++++++++++++++++++++++-----
 arch/mips/include/asm/thread_info.h |   4 +-
 arch/mips/kernel/cpu-probe.c        |   2 +-
 arch/mips/kernel/process.c          |   3 -
 arch/mips/kernel/ptrace.c           |   8 +--
 arch/mips/kernel/ptrace32.c         |   4 +-
 arch/mips/kernel/r4k_fpu.S          |  74 +++++++++++++++++++++++--
 arch/mips/kernel/r4k_switch.S       |  45 ++++++++++++++-
 arch/mips/kernel/signal.c           |  10 ++--
 arch/mips/kernel/signal32.c         |  10 ++--
 arch/mips/kernel/traps.c            |  20 +++++--
 arch/mips/math-emu/cp1emu.c         |  10 ++--
 arch/mips/math-emu/kernel_linkage.c |   6 +-
 18 files changed, 373 insertions(+), 193 deletions(-)

diff --git a/arch/mips/Kconfig b/arch/mips/Kconfig
index 17cc7ff..aa2e03a 100644
--- a/arch/mips/Kconfig
+++ b/arch/mips/Kconfig
@@ -2335,6 +2335,23 @@ config CC_STACKPROTECTOR
 
 	  This feature requires gcc version 4.2 or above.
 
+config MIPS_O32_FP64_SUPPORT
+	bool "Support for O32 binaries using 64-bit FP"
+	depends on (32BIT && CPU_MIPSR2) || MIPS32_O32
+	default y
+	help
+	  When this is enabled, the kernel will support use of 64-bit floating
+	  point registers with binaries using the O32 ABI along with the
+	  EF_MIPS_FP64 ELF header flag (typically built with -mfp64). On
+	  mips32 systems this support is at the cost of increasing the size
+	  and complexity of the compiled FPU emulator. Thus if you are running
+	  a mips32 system and know that none of your userland binaries will
+	  require 64-bit floating point, you may wish to reduce the size of
+	  your kernel & potentially improve FP emulation performance by saying
+	  N here.
+
+	  If unsure, say Y.
+
 config USE_OF
 	bool
 	select OF
diff --git a/arch/mips/include/asm/asmmacro-32.h b/arch/mips/include/asm/asmmacro-32.h
index 2413afe..70e1f17 100644
--- a/arch/mips/include/asm/asmmacro-32.h
+++ b/arch/mips/include/asm/asmmacro-32.h
@@ -12,27 +12,6 @@
 #include <asm/fpregdef.h>
 #include <asm/mipsregs.h>
 
-	.macro	fpu_save_double thread status tmp1=t0
-	cfc1	\tmp1,  fcr31
-	sdc1	$f0,  THREAD_FPR0(\thread)
-	sdc1	$f2,  THREAD_FPR2(\thread)
-	sdc1	$f4,  THREAD_FPR4(\thread)
-	sdc1	$f6,  THREAD_FPR6(\thread)
-	sdc1	$f8,  THREAD_FPR8(\thread)
-	sdc1	$f10, THREAD_FPR10(\thread)
-	sdc1	$f12, THREAD_FPR12(\thread)
-	sdc1	$f14, THREAD_FPR14(\thread)
-	sdc1	$f16, THREAD_FPR16(\thread)
-	sdc1	$f18, THREAD_FPR18(\thread)
-	sdc1	$f20, THREAD_FPR20(\thread)
-	sdc1	$f22, THREAD_FPR22(\thread)
-	sdc1	$f24, THREAD_FPR24(\thread)
-	sdc1	$f26, THREAD_FPR26(\thread)
-	sdc1	$f28, THREAD_FPR28(\thread)
-	sdc1	$f30, THREAD_FPR30(\thread)
-	sw	\tmp1, THREAD_FCR31(\thread)
-	.endm
-
 	.macro	fpu_save_single thread tmp=t0
 	cfc1	\tmp,  fcr31
 	swc1	$f0,  THREAD_FPR0(\thread)
@@ -70,27 +49,6 @@
 	sw	\tmp, THREAD_FCR31(\thread)
 	.endm
 
-	.macro	fpu_restore_double thread status tmp=t0
-	lw	\tmp, THREAD_FCR31(\thread)
-	ldc1	$f0,  THREAD_FPR0(\thread)
-	ldc1	$f2,  THREAD_FPR2(\thread)
-	ldc1	$f4,  THREAD_FPR4(\thread)
-	ldc1	$f6,  THREAD_FPR6(\thread)
-	ldc1	$f8,  THREAD_FPR8(\thread)
-	ldc1	$f10, THREAD_FPR10(\thread)
-	ldc1	$f12, THREAD_FPR12(\thread)
-	ldc1	$f14, THREAD_FPR14(\thread)
-	ldc1	$f16, THREAD_FPR16(\thread)
-	ldc1	$f18, THREAD_FPR18(\thread)
-	ldc1	$f20, THREAD_FPR20(\thread)
-	ldc1	$f22, THREAD_FPR22(\thread)
-	ldc1	$f24, THREAD_FPR24(\thread)
-	ldc1	$f26, THREAD_FPR26(\thread)
-	ldc1	$f28, THREAD_FPR28(\thread)
-	ldc1	$f30, THREAD_FPR30(\thread)
-	ctc1	\tmp, fcr31
-	.endm
-
 	.macro	fpu_restore_single thread tmp=t0
 	lw	\tmp, THREAD_FCR31(\thread)
 	lwc1	$f0,  THREAD_FPR0(\thread)
diff --git a/arch/mips/include/asm/asmmacro-64.h b/arch/mips/include/asm/asmmacro-64.h
index 08a527d..38ea609 100644
--- a/arch/mips/include/asm/asmmacro-64.h
+++ b/arch/mips/include/asm/asmmacro-64.h
@@ -13,102 +13,6 @@
 #include <asm/fpregdef.h>
 #include <asm/mipsregs.h>
 
-	.macro	fpu_save_16even thread tmp=t0
-	cfc1	\tmp, fcr31
-	sdc1	$f0,  THREAD_FPR0(\thread)
-	sdc1	$f2,  THREAD_FPR2(\thread)
-	sdc1	$f4,  THREAD_FPR4(\thread)
-	sdc1	$f6,  THREAD_FPR6(\thread)
-	sdc1	$f8,  THREAD_FPR8(\thread)
-	sdc1	$f10, THREAD_FPR10(\thread)
-	sdc1	$f12, THREAD_FPR12(\thread)
-	sdc1	$f14, THREAD_FPR14(\thread)
-	sdc1	$f16, THREAD_FPR16(\thread)
-	sdc1	$f18, THREAD_FPR18(\thread)
-	sdc1	$f20, THREAD_FPR20(\thread)
-	sdc1	$f22, THREAD_FPR22(\thread)
-	sdc1	$f24, THREAD_FPR24(\thread)
-	sdc1	$f26, THREAD_FPR26(\thread)
-	sdc1	$f28, THREAD_FPR28(\thread)
-	sdc1	$f30, THREAD_FPR30(\thread)
-	sw	\tmp, THREAD_FCR31(\thread)
-	.endm
-
-	.macro	fpu_save_16odd thread
-	sdc1	$f1,  THREAD_FPR1(\thread)
-	sdc1	$f3,  THREAD_FPR3(\thread)
-	sdc1	$f5,  THREAD_FPR5(\thread)
-	sdc1	$f7,  THREAD_FPR7(\thread)
-	sdc1	$f9,  THREAD_FPR9(\thread)
-	sdc1	$f11, THREAD_FPR11(\thread)
-	sdc1	$f13, THREAD_FPR13(\thread)
-	sdc1	$f15, THREAD_FPR15(\thread)
-	sdc1	$f17, THREAD_FPR17(\thread)
-	sdc1	$f19, THREAD_FPR19(\thread)
-	sdc1	$f21, THREAD_FPR21(\thread)
-	sdc1	$f23, THREAD_FPR23(\thread)
-	sdc1	$f25, THREAD_FPR25(\thread)
-	sdc1	$f27, THREAD_FPR27(\thread)
-	sdc1	$f29, THREAD_FPR29(\thread)
-	sdc1	$f31, THREAD_FPR31(\thread)
-	.endm
-
-	.macro	fpu_save_double thread status tmp
-	sll	\tmp, \status, 5
-	bgez	\tmp, 2f
-	fpu_save_16odd \thread
-2:
-	fpu_save_16even \thread \tmp
-	.endm
-
-	.macro	fpu_restore_16even thread tmp=t0
-	lw	\tmp, THREAD_FCR31(\thread)
-	ldc1	$f0,  THREAD_FPR0(\thread)
-	ldc1	$f2,  THREAD_FPR2(\thread)
-	ldc1	$f4,  THREAD_FPR4(\thread)
-	ldc1	$f6,  THREAD_FPR6(\thread)
-	ldc1	$f8,  THREAD_FPR8(\thread)
-	ldc1	$f10, THREAD_FPR10(\thread)
-	ldc1	$f12, THREAD_FPR12(\thread)
-	ldc1	$f14, THREAD_FPR14(\thread)
-	ldc1	$f16, THREAD_FPR16(\thread)
-	ldc1	$f18, THREAD_FPR18(\thread)
-	ldc1	$f20, THREAD_FPR20(\thread)
-	ldc1	$f22, THREAD_FPR22(\thread)
-	ldc1	$f24, THREAD_FPR24(\thread)
-	ldc1	$f26, THREAD_FPR26(\thread)
-	ldc1	$f28, THREAD_FPR28(\thread)
-	ldc1	$f30, THREAD_FPR30(\thread)
-	ctc1	\tmp, fcr31
-	.endm
-
-	.macro	fpu_restore_16odd thread
-	ldc1	$f1,  THREAD_FPR1(\thread)
-	ldc1	$f3,  THREAD_FPR3(\thread)
-	ldc1	$f5,  THREAD_FPR5(\thread)
-	ldc1	$f7,  THREAD_FPR7(\thread)
-	ldc1	$f9,  THREAD_FPR9(\thread)
-	ldc1	$f11, THREAD_FPR11(\thread)
-	ldc1	$f13, THREAD_FPR13(\thread)
-	ldc1	$f15, THREAD_FPR15(\thread)
-	ldc1	$f17, THREAD_FPR17(\thread)
-	ldc1	$f19, THREAD_FPR19(\thread)
-	ldc1	$f21, THREAD_FPR21(\thread)
-	ldc1	$f23, THREAD_FPR23(\thread)
-	ldc1	$f25, THREAD_FPR25(\thread)
-	ldc1	$f27, THREAD_FPR27(\thread)
-	ldc1	$f29, THREAD_FPR29(\thread)
-	ldc1	$f31, THREAD_FPR31(\thread)
-	.endm
-
-	.macro	fpu_restore_double thread status tmp
-	sll	\tmp, \status, 5
-	bgez	\tmp, 1f				# 16 register mode?
-
-	fpu_restore_16odd \thread
-1:	fpu_restore_16even \thread \tmp
-	.endm
-
 	.macro	cpu_save_nonscratch thread
 	LONG_S	s0, THREAD_REG16(\thread)
 	LONG_S	s1, THREAD_REG17(\thread)
diff --git a/arch/mips/include/asm/asmmacro.h b/arch/mips/include/asm/asmmacro.h
index 6c8342a..3220c93 100644
--- a/arch/mips/include/asm/asmmacro.h
+++ b/arch/mips/include/asm/asmmacro.h
@@ -62,6 +62,113 @@
 	.endm
 #endif /* CONFIG_MIPS_MT_SMTC */
 
+	.macro	fpu_save_16even thread tmp=t0
+	cfc1	\tmp, fcr31
+	sdc1	$f0,  THREAD_FPR0(\thread)
+	sdc1	$f2,  THREAD_FPR2(\thread)
+	sdc1	$f4,  THREAD_FPR4(\thread)
+	sdc1	$f6,  THREAD_FPR6(\thread)
+	sdc1	$f8,  THREAD_FPR8(\thread)
+	sdc1	$f10, THREAD_FPR10(\thread)
+	sdc1	$f12, THREAD_FPR12(\thread)
+	sdc1	$f14, THREAD_FPR14(\thread)
+	sdc1	$f16, THREAD_FPR16(\thread)
+	sdc1	$f18, THREAD_FPR18(\thread)
+	sdc1	$f20, THREAD_FPR20(\thread)
+	sdc1	$f22, THREAD_FPR22(\thread)
+	sdc1	$f24, THREAD_FPR24(\thread)
+	sdc1	$f26, THREAD_FPR26(\thread)
+	sdc1	$f28, THREAD_FPR28(\thread)
+	sdc1	$f30, THREAD_FPR30(\thread)
+	sw	\tmp, THREAD_FCR31(\thread)
+	.endm
+
+	.macro	fpu_save_16odd thread
+	.set	push
+	.set	mips64r2
+	sdc1	$f1,  THREAD_FPR1(\thread)
+	sdc1	$f3,  THREAD_FPR3(\thread)
+	sdc1	$f5,  THREAD_FPR5(\thread)
+	sdc1	$f7,  THREAD_FPR7(\thread)
+	sdc1	$f9,  THREAD_FPR9(\thread)
+	sdc1	$f11, THREAD_FPR11(\thread)
+	sdc1	$f13, THREAD_FPR13(\thread)
+	sdc1	$f15, THREAD_FPR15(\thread)
+	sdc1	$f17, THREAD_FPR17(\thread)
+	sdc1	$f19, THREAD_FPR19(\thread)
+	sdc1	$f21, THREAD_FPR21(\thread)
+	sdc1	$f23, THREAD_FPR23(\thread)
+	sdc1	$f25, THREAD_FPR25(\thread)
+	sdc1	$f27, THREAD_FPR27(\thread)
+	sdc1	$f29, THREAD_FPR29(\thread)
+	sdc1	$f31, THREAD_FPR31(\thread)
+	.set	pop
+	.endm
+
+	.macro	fpu_save_double thread status tmp
+#if defined(CONFIG_MIPS64) || defined(CONFIG_CPU_MIPS32_R2)
+	sll	\tmp, \status, 5
+	bgez	\tmp, 10f
+	fpu_save_16odd \thread
+10:
+#endif
+	fpu_save_16even \thread \tmp
+	.endm
+
+	.macro	fpu_restore_16even thread tmp=t0
+	lw	\tmp, THREAD_FCR31(\thread)
+	ldc1	$f0,  THREAD_FPR0(\thread)
+	ldc1	$f2,  THREAD_FPR2(\thread)
+	ldc1	$f4,  THREAD_FPR4(\thread)
+	ldc1	$f6,  THREAD_FPR6(\thread)
+	ldc1	$f8,  THREAD_FPR8(\thread)
+	ldc1	$f10, THREAD_FPR10(\thread)
+	ldc1	$f12, THREAD_FPR12(\thread)
+	ldc1	$f14, THREAD_FPR14(\thread)
+	ldc1	$f16, THREAD_FPR16(\thread)
+	ldc1	$f18, THREAD_FPR18(\thread)
+	ldc1	$f20, THREAD_FPR20(\thread)
+	ldc1	$f22, THREAD_FPR22(\thread)
+	ldc1	$f24, THREAD_FPR24(\thread)
+	ldc1	$f26, THREAD_FPR26(\thread)
+	ldc1	$f28, THREAD_FPR28(\thread)
+	ldc1	$f30, THREAD_FPR30(\thread)
+	ctc1	\tmp, fcr31
+	.endm
+
+	.macro	fpu_restore_16odd thread
+	.set	push
+	.set	mips64r2
+	ldc1	$f1,  THREAD_FPR1(\thread)
+	ldc1	$f3,  THREAD_FPR3(\thread)
+	ldc1	$f5,  THREAD_FPR5(\thread)
+	ldc1	$f7,  THREAD_FPR7(\thread)
+	ldc1	$f9,  THREAD_FPR9(\thread)
+	ldc1	$f11, THREAD_FPR11(\thread)
+	ldc1	$f13, THREAD_FPR13(\thread)
+	ldc1	$f15, THREAD_FPR15(\thread)
+	ldc1	$f17, THREAD_FPR17(\thread)
+	ldc1	$f19, THREAD_FPR19(\thread)
+	ldc1	$f21, THREAD_FPR21(\thread)
+	ldc1	$f23, THREAD_FPR23(\thread)
+	ldc1	$f25, THREAD_FPR25(\thread)
+	ldc1	$f27, THREAD_FPR27(\thread)
+	ldc1	$f29, THREAD_FPR29(\thread)
+	ldc1	$f31, THREAD_FPR31(\thread)
+	.set	pop
+	.endm
+
+	.macro	fpu_restore_double thread status tmp
+#if defined(CONFIG_MIPS64) || defined(CONFIG_CPU_MIPS32_R2)
+	sll	\tmp, \status, 5
+	bgez	\tmp, 10f				# 16 register mode?
+
+	fpu_restore_16odd \thread
+10:
+#endif
+	fpu_restore_16even \thread \tmp
+	.endm
+
 /*
  * Temporary until all gas have MT ASE support
  */
diff --git a/arch/mips/include/asm/elf.h b/arch/mips/include/asm/elf.h
index a66359e..17163cf 100644
--- a/arch/mips/include/asm/elf.h
+++ b/arch/mips/include/asm/elf.h
@@ -36,6 +36,7 @@
 #define EF_MIPS_ABI2		0x00000020
 #define EF_MIPS_OPTIONS_FIRST	0x00000080
 #define EF_MIPS_32BITMODE	0x00000100
+#define EF_MIPS_FP64		0x00000200
 #define EF_MIPS_ABI		0x0000f000
 #define EF_MIPS_ARCH		0xf0000000
 
@@ -249,6 +250,11 @@ extern struct mips_abi mips_abi_n32;
 
 #define SET_PERSONALITY(ex)						\
 do {									\
+	if ((ex).e_flags & EF_MIPS_FP64)				\
+		clear_thread_flag(TIF_32BIT_FPREGS);			\
+	else								\
+		set_thread_flag(TIF_32BIT_FPREGS);			\
+									\
 	if (personality(current->personality) != PER_LINUX)		\
 		set_personality(PER_LINUX);				\
 									\
@@ -271,14 +277,18 @@ do {									\
 #endif
 
 #ifdef CONFIG_MIPS32_O32
-#define __SET_PERSONALITY32_O32()					\
+#define __SET_PERSONALITY32_O32(ex)					\
 	do {								\
 		set_thread_flag(TIF_32BIT_REGS);			\
 		set_thread_flag(TIF_32BIT_ADDR);			\
+									\
+		if (!((ex).e_flags & EF_MIPS_FP64))			\
+			set_thread_flag(TIF_32BIT_FPREGS);		\
+									\
 		current->thread.abi = &mips_abi_32;			\
 	} while (0)
 #else
-#define __SET_PERSONALITY32_O32()					\
+#define __SET_PERSONALITY32_O32(ex)					\
 	do { } while (0)
 #endif
 
@@ -289,7 +299,7 @@ do {									\
 	     ((ex).e_flags & EF_MIPS_ABI) == 0)				\
 		__SET_PERSONALITY32_N32();				\
 	else								\
-		__SET_PERSONALITY32_O32();				\
+		__SET_PERSONALITY32_O32(ex);                            \
 } while (0)
 #else
 #define __SET_PERSONALITY32(ex) do { } while (0)
@@ -300,6 +310,7 @@ do {									\
 	unsigned int p;							\
 									\
 	clear_thread_flag(TIF_32BIT_REGS);				\
+	clear_thread_flag(TIF_32BIT_FPREGS);				\
 	clear_thread_flag(TIF_32BIT_ADDR);				\
 									\
 	if ((ex).e_ident[EI_CLASS] == ELFCLASS32)			\
diff --git a/arch/mips/include/asm/fpu.h b/arch/mips/include/asm/fpu.h
index 3bf023f..cfe092f 100644
--- a/arch/mips/include/asm/fpu.h
+++ b/arch/mips/include/asm/fpu.h
@@ -33,11 +33,48 @@ extern void _init_fpu(void);
 extern void _save_fp(struct task_struct *);
 extern void _restore_fp(struct task_struct *);
 
-#define __enable_fpu()							\
-do {									\
-	set_c0_status(ST0_CU1);						\
-	enable_fpu_hazard();						\
-} while (0)
+/*
+ * This enum specifies a mode in which we want the FPU to operate, for cores
+ * which implement the Status.FR bit. Note that FPU_32BIT & FPU_64BIT
+ * purposefully have the values 0 & 1 respectively, so that an integer value
+ * of Status.FR can be trivially casted to the corresponding enum fpu_mode.
+ */
+enum fpu_mode {
+	FPU_32BIT = 0,		/* FR = 0 */
+	FPU_64BIT,		/* FR = 1 */
+	FPU_AS_IS,
+};
+
+static inline int __enable_fpu(enum fpu_mode mode)
+{
+	int fr;
+
+	switch (mode) {
+	case FPU_AS_IS:
+		/* just enable the FPU in its current mode */
+		set_c0_status(ST0_CU1);
+		enable_fpu_hazard();
+		return 0;
+
+	case FPU_64BIT:
+#if !(defined(CONFIG_CPU_MIPS32_R2) || defined(CONFIG_MIPS64))
+		/* we only have a 32-bit FPU */
+		return SIGFPE;
+#endif
+		/* fall through */
+	case FPU_32BIT:
+		/* set CU1 & change FR appropriately */
+		fr = (int)mode;
+		change_c0_status(ST0_CU1 | ST0_FR, ST0_CU1 | (fr ? ST0_FR : 0));
+		enable_fpu_hazard();
+
+		/* check FR has the desired value */
+		return (!!(read_c0_status() & ST0_FR) == !!fr) ? 0 : SIGFPE;
+
+	default:
+		BUG();
+	}
+}
 
 #define __disable_fpu()							\
 do {									\
@@ -57,27 +94,46 @@ static inline int is_fpu_owner(void)
 	return cpu_has_fpu && __is_fpu_owner();
 }
 
-static inline void __own_fpu(void)
+static inline int __own_fpu(void)
 {
-	__enable_fpu();
+	enum fpu_mode mode;
+	int ret;
+
+	mode = !test_thread_flag(TIF_32BIT_FPREGS);
+	ret = __enable_fpu(mode);
+	if (ret)
+		return ret;
+
 	KSTK_STATUS(current) |= ST0_CU1;
+	if (mode == FPU_64BIT)
+		KSTK_STATUS(current) |= ST0_FR;
+	else /* mode == FPU_32BIT */
+		KSTK_STATUS(current) &= ~ST0_FR;
+
 	set_thread_flag(TIF_USEDFPU);
+	return 0;
 }
 
-static inline void own_fpu_inatomic(int restore)
+static inline int own_fpu_inatomic(int restore)
 {
+	int ret = 0;
+
 	if (cpu_has_fpu && !__is_fpu_owner()) {
-		__own_fpu();
-		if (restore)
+		ret = __own_fpu();
+		if (restore && !ret)
 			_restore_fp(current);
 	}
+	return ret;
 }
 
-static inline void own_fpu(int restore)
+static inline int own_fpu(int restore)
 {
+	int ret;
+
 	preempt_disable();
-	own_fpu_inatomic(restore);
+	ret = own_fpu_inatomic(restore);
 	preempt_enable();
+	return ret;
 }
 
 static inline void lose_fpu(int save)
@@ -93,16 +149,21 @@ static inline void lose_fpu(int save)
 	preempt_enable();
 }
 
-static inline void init_fpu(void)
+static inline int init_fpu(void)
 {
+	int ret = 0;
+
 	preempt_disable();
 	if (cpu_has_fpu) {
-		__own_fpu();
-		_init_fpu();
+		ret = __own_fpu();
+		if (!ret)
+			_init_fpu();
 	} else {
 		fpu_emulator_init_fpu();
 	}
+
 	preempt_enable();
+	return ret;
 }
 
 static inline void save_fp(struct task_struct *tsk)
diff --git a/arch/mips/include/asm/thread_info.h b/arch/mips/include/asm/thread_info.h
index f9b24bf..b6da8b7 100644
--- a/arch/mips/include/asm/thread_info.h
+++ b/arch/mips/include/asm/thread_info.h
@@ -112,11 +112,12 @@ static inline struct thread_info *current_thread_info(void)
 #define TIF_NOHZ		19	/* in adaptive nohz mode */
 #define TIF_FIXADE		20	/* Fix address errors in software */
 #define TIF_LOGADE		21	/* Log address errors to syslog */
-#define TIF_32BIT_REGS		22	/* also implies 16/32 fprs */
+#define TIF_32BIT_REGS		22	/* 32-bit general purpose registers */
 #define TIF_32BIT_ADDR		23	/* 32-bit address space (o32/n32) */
 #define TIF_FPUBOUND		24	/* thread bound to FPU-full CPU set */
 #define TIF_LOAD_WATCH		25	/* If set, load watch registers */
 #define TIF_SYSCALL_TRACEPOINT	26	/* syscall tracepoint instrumentation */
+#define TIF_32BIT_FPREGS	27	/* 32-bit floating point registers */
 #define TIF_SYSCALL_TRACE	31	/* syscall trace active */
 
 #define _TIF_SYSCALL_TRACE	(1<<TIF_SYSCALL_TRACE)
@@ -133,6 +134,7 @@ static inline struct thread_info *current_thread_info(void)
 #define _TIF_32BIT_ADDR		(1<<TIF_32BIT_ADDR)
 #define _TIF_FPUBOUND		(1<<TIF_FPUBOUND)
 #define _TIF_LOAD_WATCH		(1<<TIF_LOAD_WATCH)
+#define _TIF_32BIT_FPREGS	(1<<TIF_32BIT_FPREGS)
 #define _TIF_SYSCALL_TRACEPOINT	(1<<TIF_SYSCALL_TRACEPOINT)
 
 #define _TIF_WORK_SYSCALL_ENTRY	(_TIF_NOHZ | _TIF_SYSCALL_TRACE |	\
diff --git a/arch/mips/kernel/cpu-probe.c b/arch/mips/kernel/cpu-probe.c
index 8168e29..116102c 100644
--- a/arch/mips/kernel/cpu-probe.c
+++ b/arch/mips/kernel/cpu-probe.c
@@ -112,7 +112,7 @@ static inline unsigned long cpu_get_fpu_id(void)
 	unsigned long tmp, fpu_id;
 
 	tmp = read_c0_status();
-	__enable_fpu();
+	__enable_fpu(FPU_AS_IS);
 	fpu_id = read_32bit_cp1_register(CP1_REVISION);
 	write_c0_status(tmp);
 	return fpu_id;
diff --git a/arch/mips/kernel/process.c b/arch/mips/kernel/process.c
index ddc7610..747a6cf 100644
--- a/arch/mips/kernel/process.c
+++ b/arch/mips/kernel/process.c
@@ -60,9 +60,6 @@ void start_thread(struct pt_regs * regs, unsigned long pc, unsigned long sp)
 
 	/* New thread loses kernel privileges. */
 	status = regs->cp0_status & ~(ST0_CU0|ST0_CU1|ST0_FR|KU_MASK);
-#ifdef CONFIG_64BIT
-	status |= test_thread_flag(TIF_32BIT_REGS) ? 0 : ST0_FR;
-#endif
 	status |= KU_USER;
 	regs->cp0_status = status;
 	clear_used_math();
diff --git a/arch/mips/kernel/ptrace.c b/arch/mips/kernel/ptrace.c
index b52e1d2..30b1a43 100644
--- a/arch/mips/kernel/ptrace.c
+++ b/arch/mips/kernel/ptrace.c
@@ -137,13 +137,13 @@ int ptrace_getfpregs(struct task_struct *child, __u32 __user *data)
 		if (cpu_has_mipsmt) {
 			unsigned int vpflags = dvpe();
 			flags = read_c0_status();
-			__enable_fpu();
+			__enable_fpu(FPU_AS_IS);
 			__asm__ __volatile__("cfc1\t%0,$0" : "=r" (tmp));
 			write_c0_status(flags);
 			evpe(vpflags);
 		} else {
 			flags = read_c0_status();
-			__enable_fpu();
+			__enable_fpu(FPU_AS_IS);
 			__asm__ __volatile__("cfc1\t%0,$0" : "=r" (tmp));
 			write_c0_status(flags);
 		}
@@ -483,13 +483,13 @@ long arch_ptrace(struct task_struct *child, long request,
 			if (cpu_has_mipsmt) {
 				unsigned int vpflags = dvpe();
 				flags = read_c0_status();
-				__enable_fpu();
+				__enable_fpu(FPU_AS_IS);
 				__asm__ __volatile__("cfc1\t%0,$0": "=r" (tmp));
 				write_c0_status(flags);
 				evpe(vpflags);
 			} else {
 				flags = read_c0_status();
-				__enable_fpu();
+				__enable_fpu(FPU_AS_IS);
 				__asm__ __volatile__("cfc1\t%0,$0": "=r" (tmp));
 				write_c0_status(flags);
 			}
diff --git a/arch/mips/kernel/ptrace32.c b/arch/mips/kernel/ptrace32.c
index 9486055..020342a 100644
--- a/arch/mips/kernel/ptrace32.c
+++ b/arch/mips/kernel/ptrace32.c
@@ -147,13 +147,13 @@ long compat_arch_ptrace(struct task_struct *child, compat_long_t request,
 			if (cpu_has_mipsmt) {
 				unsigned int vpflags = dvpe();
 				flags = read_c0_status();
-				__enable_fpu();
+				__enable_fpu(FPU_AS_IS);
 				__asm__ __volatile__("cfc1\t%0,$0": "=r" (tmp));
 				write_c0_status(flags);
 				evpe(vpflags);
 			} else {
 				flags = read_c0_status();
-				__enable_fpu();
+				__enable_fpu(FPU_AS_IS);
 				__asm__ __volatile__("cfc1\t%0,$0": "=r" (tmp));
 				write_c0_status(flags);
 			}
diff --git a/arch/mips/kernel/r4k_fpu.S b/arch/mips/kernel/r4k_fpu.S
index 55ffe14..253b2fb 100644
--- a/arch/mips/kernel/r4k_fpu.S
+++ b/arch/mips/kernel/r4k_fpu.S
@@ -35,7 +35,15 @@
 LEAF(_save_fp_context)
 	cfc1	t1, fcr31
 
-#ifdef CONFIG_64BIT
+#if defined(CONFIG_64BIT) || defined(CONFIG_MIPS32_R2)
+	.set	push
+#ifdef CONFIG_MIPS32_R2
+	.set	mips64r2
+	mfc0	t0, CP0_STATUS
+	sll	t0, t0, 5
+	bgez	t0, 1f			# skip storing odd if FR=0
+	 nop
+#endif
 	/* Store the 16 odd double precision registers */
 	EX	sdc1 $f1, SC_FPREGS+8(a0)
 	EX	sdc1 $f3, SC_FPREGS+24(a0)
@@ -53,6 +61,7 @@ LEAF(_save_fp_context)
 	EX	sdc1 $f27, SC_FPREGS+216(a0)
 	EX	sdc1 $f29, SC_FPREGS+232(a0)
 	EX	sdc1 $f31, SC_FPREGS+248(a0)
+1:	.set	pop
 #endif
 
 	/* Store the 16 even double precision registers */
@@ -82,7 +91,31 @@ LEAF(_save_fp_context)
 LEAF(_save_fp_context32)
 	cfc1	t1, fcr31
 
-	EX	sdc1 $f0, SC32_FPREGS+0(a0)
+	mfc0	t0, CP0_STATUS
+	sll	t0, t0, 5
+	bgez	t0, 1f			# skip storing odd if FR=0
+	 nop
+
+	/* Store the 16 odd double precision registers */
+	EX      sdc1 $f1, SC32_FPREGS+8(a0)
+	EX      sdc1 $f3, SC32_FPREGS+24(a0)
+	EX      sdc1 $f5, SC32_FPREGS+40(a0)
+	EX      sdc1 $f7, SC32_FPREGS+56(a0)
+	EX      sdc1 $f9, SC32_FPREGS+72(a0)
+	EX      sdc1 $f11, SC32_FPREGS+88(a0)
+	EX      sdc1 $f13, SC32_FPREGS+104(a0)
+	EX      sdc1 $f15, SC32_FPREGS+120(a0)
+	EX      sdc1 $f17, SC32_FPREGS+136(a0)
+	EX      sdc1 $f19, SC32_FPREGS+152(a0)
+	EX      sdc1 $f21, SC32_FPREGS+168(a0)
+	EX      sdc1 $f23, SC32_FPREGS+184(a0)
+	EX      sdc1 $f25, SC32_FPREGS+200(a0)
+	EX      sdc1 $f27, SC32_FPREGS+216(a0)
+	EX      sdc1 $f29, SC32_FPREGS+232(a0)
+	EX      sdc1 $f31, SC32_FPREGS+248(a0)
+
+	/* Store the 16 even double precision registers */
+1:	EX	sdc1 $f0, SC32_FPREGS+0(a0)
 	EX	sdc1 $f2, SC32_FPREGS+16(a0)
 	EX	sdc1 $f4, SC32_FPREGS+32(a0)
 	EX	sdc1 $f6, SC32_FPREGS+48(a0)
@@ -114,7 +147,16 @@ LEAF(_save_fp_context32)
  */
 LEAF(_restore_fp_context)
 	EX	lw t0, SC_FPC_CSR(a0)
-#ifdef CONFIG_64BIT
+
+#if defined(CONFIG_64BIT) || defined(CONFIG_MIPS32_R2)
+	.set	push
+#ifdef CONFIG_MIPS32_R2
+	.set	mips64r2
+	mfc0	t0, CP0_STATUS
+	sll	t0, t0, 5
+	bgez	t0, 1f			# skip loading odd if FR=0
+	 nop
+#endif
 	EX	ldc1 $f1, SC_FPREGS+8(a0)
 	EX	ldc1 $f3, SC_FPREGS+24(a0)
 	EX	ldc1 $f5, SC_FPREGS+40(a0)
@@ -131,6 +173,7 @@ LEAF(_restore_fp_context)
 	EX	ldc1 $f27, SC_FPREGS+216(a0)
 	EX	ldc1 $f29, SC_FPREGS+232(a0)
 	EX	ldc1 $f31, SC_FPREGS+248(a0)
+1:	.set pop
 #endif
 	EX	ldc1 $f0, SC_FPREGS+0(a0)
 	EX	ldc1 $f2, SC_FPREGS+16(a0)
@@ -157,7 +200,30 @@ LEAF(_restore_fp_context)
 LEAF(_restore_fp_context32)
 	/* Restore an o32 sigcontext.  */
 	EX	lw t0, SC32_FPC_CSR(a0)
-	EX	ldc1 $f0, SC32_FPREGS+0(a0)
+
+	mfc0	t0, CP0_STATUS
+	sll	t0, t0, 5
+	bgez	t0, 1f			# skip loading odd if FR=0
+	 nop
+
+	EX      ldc1 $f1, SC32_FPREGS+8(a0)
+	EX      ldc1 $f3, SC32_FPREGS+24(a0)
+	EX      ldc1 $f5, SC32_FPREGS+40(a0)
+	EX      ldc1 $f7, SC32_FPREGS+56(a0)
+	EX      ldc1 $f9, SC32_FPREGS+72(a0)
+	EX      ldc1 $f11, SC32_FPREGS+88(a0)
+	EX      ldc1 $f13, SC32_FPREGS+104(a0)
+	EX      ldc1 $f15, SC32_FPREGS+120(a0)
+	EX      ldc1 $f17, SC32_FPREGS+136(a0)
+	EX      ldc1 $f19, SC32_FPREGS+152(a0)
+	EX      ldc1 $f21, SC32_FPREGS+168(a0)
+	EX      ldc1 $f23, SC32_FPREGS+184(a0)
+	EX      ldc1 $f25, SC32_FPREGS+200(a0)
+	EX      ldc1 $f27, SC32_FPREGS+216(a0)
+	EX      ldc1 $f29, SC32_FPREGS+232(a0)
+	EX      ldc1 $f31, SC32_FPREGS+248(a0)
+
+1:	EX	ldc1 $f0, SC32_FPREGS+0(a0)
 	EX	ldc1 $f2, SC32_FPREGS+16(a0)
 	EX	ldc1 $f4, SC32_FPREGS+32(a0)
 	EX	ldc1 $f6, SC32_FPREGS+48(a0)
diff --git a/arch/mips/kernel/r4k_switch.S b/arch/mips/kernel/r4k_switch.S
index 078de5e..cc78dd9 100644
--- a/arch/mips/kernel/r4k_switch.S
+++ b/arch/mips/kernel/r4k_switch.S
@@ -123,7 +123,7 @@
  * Save a thread's fp context.
  */
 LEAF(_save_fp)
-#ifdef CONFIG_64BIT
+#if defined(CONFIG_64BIT) || defined(CONFIG_CPU_MIPS32_R2)
 	mfc0	t0, CP0_STATUS
 #endif
 	fpu_save_double a0 t0 t1		# clobbers t1
@@ -134,7 +134,7 @@ LEAF(_save_fp)
  * Restore a thread's fp context.
  */
 LEAF(_restore_fp)
-#ifdef CONFIG_64BIT
+#if defined(CONFIG_64BIT) || defined(CONFIG_CPU_MIPS32_R2)
 	mfc0	t0, CP0_STATUS
 #endif
 	fpu_restore_double a0 t0 t1		# clobbers t1
@@ -228,6 +228,47 @@ LEAF(_init_fpu)
 	mtc1	t1, $f29
 	mtc1	t1, $f30
 	mtc1	t1, $f31
+
+#ifdef CONFIG_CPU_MIPS32_R2
+	.set    push
+	.set    mips64r2
+	sll     t0, t0, 5			# is Status.FR set?
+	bgez    t0, 1f				# no: skip setting upper 32b
+
+	mthc1   t1, $f0
+	mthc1   t1, $f1
+	mthc1   t1, $f2
+	mthc1   t1, $f3
+	mthc1   t1, $f4
+	mthc1   t1, $f5
+	mthc1   t1, $f6
+	mthc1   t1, $f7
+	mthc1   t1, $f8
+	mthc1   t1, $f9
+	mthc1   t1, $f10
+	mthc1   t1, $f11
+	mthc1   t1, $f12
+	mthc1   t1, $f13
+	mthc1   t1, $f14
+	mthc1   t1, $f15
+	mthc1   t1, $f16
+	mthc1   t1, $f17
+	mthc1   t1, $f18
+	mthc1   t1, $f19
+	mthc1   t1, $f20
+	mthc1   t1, $f21
+	mthc1   t1, $f22
+	mthc1   t1, $f23
+	mthc1   t1, $f24
+	mthc1   t1, $f25
+	mthc1   t1, $f26
+	mthc1   t1, $f27
+	mthc1   t1, $f28
+	mthc1   t1, $f29
+	mthc1   t1, $f30
+	mthc1   t1, $f31
+1:	.set    pop
+#endif /* CONFIG_CPU_MIPS32_R2 */
 #else
 	.set	mips3
 	dmtc1	t1, $f0
diff --git a/arch/mips/kernel/signal.c b/arch/mips/kernel/signal.c
index 2f285ab..5199563 100644
--- a/arch/mips/kernel/signal.c
+++ b/arch/mips/kernel/signal.c
@@ -71,8 +71,9 @@ static int protected_save_fp_context(struct sigcontext __user *sc)
 	int err;
 	while (1) {
 		lock_fpu_owner();
-		own_fpu_inatomic(1);
-		err = save_fp_context(sc); /* this might fail */
+		err = own_fpu_inatomic(1);
+		if (!err)
+			err = save_fp_context(sc); /* this might fail */
 		unlock_fpu_owner();
 		if (likely(!err))
 			break;
@@ -91,8 +92,9 @@ static int protected_restore_fp_context(struct sigcontext __user *sc)
 	int err, tmp __maybe_unused;
 	while (1) {
 		lock_fpu_owner();
-		own_fpu_inatomic(0);
-		err = restore_fp_context(sc); /* this might fail */
+		err = own_fpu_inatomic(0);
+		if (!err)
+			err = restore_fp_context(sc); /* this might fail */
 		unlock_fpu_owner();
 		if (likely(!err))
 			break;
diff --git a/arch/mips/kernel/signal32.c b/arch/mips/kernel/signal32.c
index 57de8b7..7c1024b 100644
--- a/arch/mips/kernel/signal32.c
+++ b/arch/mips/kernel/signal32.c
@@ -85,8 +85,9 @@ static int protected_save_fp_context32(struct sigcontext32 __user *sc)
 	int err;
 	while (1) {
 		lock_fpu_owner();
-		own_fpu_inatomic(1);
-		err = save_fp_context32(sc); /* this might fail */
+		err = own_fpu_inatomic(1);
+		if (!err)
+			err = save_fp_context32(sc); /* this might fail */
 		unlock_fpu_owner();
 		if (likely(!err))
 			break;
@@ -105,8 +106,9 @@ static int protected_restore_fp_context32(struct sigcontext32 __user *sc)
 	int err, tmp __maybe_unused;
 	while (1) {
 		lock_fpu_owner();
-		own_fpu_inatomic(0);
-		err = restore_fp_context32(sc); /* this might fail */
+		err = own_fpu_inatomic(0);
+		if (!err)
+			err = restore_fp_context32(sc); /* this might fail */
 		unlock_fpu_owner();
 		if (likely(!err))
 			break;
diff --git a/arch/mips/kernel/traps.c b/arch/mips/kernel/traps.c
index cc20415..eb28423 100644
--- a/arch/mips/kernel/traps.c
+++ b/arch/mips/kernel/traps.c
@@ -1080,7 +1080,7 @@ asmlinkage void do_cpu(struct pt_regs *regs)
 	unsigned long old_epc, old31;
 	unsigned int opcode;
 	unsigned int cpid;
-	int status;
+	int status, err;
 	unsigned long __maybe_unused flags;
 
 	prev_state = exception_enter();
@@ -1153,19 +1153,29 @@ asmlinkage void do_cpu(struct pt_regs *regs)
 
 	case 1:
 		if (used_math())	/* Using the FPU again.	 */
-			own_fpu(1);
+			err = own_fpu(1);
 		else {			/* First time FPU user.	 */
-			init_fpu();
+			err = init_fpu();
 			set_used_math();
 		}
 
-		if (!raw_cpu_has_fpu) {
+#ifndef CONFIG_MIPS_O32_FP64_SUPPORT
+		/*
+		 * This assumes that either all FPUs in the system support
+		 * Status.FR (ie. both 32-bit & 64-bit) or none of them do.
+		 */
+		if (err) {
+			force_sig(SIGFPE, current);
+			goto out;
+		}
+#endif
+		if (!raw_cpu_has_fpu || err) {
 			int sig;
 			void __user *fault_addr = NULL;
 			sig = fpu_emulator_cop1Handler(regs,
 						       &current->thread.fpu,
 						       0, &fault_addr);
-			if (!process_fpemu_return(sig, fault_addr))
+			if (!process_fpemu_return(sig, fault_addr) && !err)
 				mt_ase_fp_affinity();
 		}
 
diff --git a/arch/mips/math-emu/cp1emu.c b/arch/mips/math-emu/cp1emu.c
index 4b37961..22f7b11 100644
--- a/arch/mips/math-emu/cp1emu.c
+++ b/arch/mips/math-emu/cp1emu.c
@@ -859,20 +859,20 @@ static int isBranchInstr(struct pt_regs *regs, struct mm_decoded_insn dec_insn,
  * In the Linux kernel, we support selection of FPR format on the
  * basis of the Status.FR bit.	If an FPU is not present, the FR bit
  * is hardwired to zero, which would imply a 32-bit FPU even for
- * 64-bit CPUs so we rather look at TIF_32BIT_REGS.
+ * 64-bit CPUs so we rather look at TIF_32BIT_FPREGS.
  * FPU emu is slow and bulky and optimizing this function offers fairly
  * sizeable benefits so we try to be clever and make this function return
  * a constant whenever possible, that is on 64-bit kernels without O32
- * compatibility enabled and on 32-bit kernels.
+ * compatibility enabled and on 32-bit without 64-bit FPU support.
  */
 static inline int cop1_64bit(struct pt_regs *xcp)
 {
 #if defined(CONFIG_64BIT) && !defined(CONFIG_MIPS32_O32)
 	return 1;
-#elif defined(CONFIG_64BIT) && defined(CONFIG_MIPS32_O32)
-	return !test_thread_flag(TIF_32BIT_REGS);
-#else
+#elif defined(CONFIG_32BIT) && !defined(CONFIG_MIPS_O32_FP64_SUPPORT)
 	return 0;
+#else
+	return !test_thread_flag(TIF_32BIT_FPREGS);
 #endif
 }
 
diff --git a/arch/mips/math-emu/kernel_linkage.c b/arch/mips/math-emu/kernel_linkage.c
index 1c58657..3aeae07 100644
--- a/arch/mips/math-emu/kernel_linkage.c
+++ b/arch/mips/math-emu/kernel_linkage.c
@@ -89,8 +89,9 @@ int fpu_emulator_save_context32(struct sigcontext32 __user *sc)
 {
 	int i;
 	int err = 0;
+	int inc = test_thread_flag(TIF_32BIT_FPREGS) ? 2 : 1;
 
-	for (i = 0; i < 32; i+=2) {
+	for (i = 0; i < 32; i += inc) {
 		err |=
 		    __put_user(current->thread.fpu.fpr[i], &sc->sc_fpregs[i]);
 	}
@@ -103,8 +104,9 @@ int fpu_emulator_restore_context32(struct sigcontext32 __user *sc)
 {
 	int i;
 	int err = 0;
+	int inc = test_thread_flag(TIF_32BIT_FPREGS) ? 2 : 1;
 
-	for (i = 0; i < 32; i+=2) {
+	for (i = 0; i < 32; i += inc) {
 		err |=
 		    __get_user(current->thread.fpu.fpr[i], &sc->sc_fpregs[i]);
 	}
-- 
1.8.4.1

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 4/6] mips: support for 64-bit FP with O32 binaries
@ 2013-11-07 12:48   ` Paul Burton
  0 siblings, 0 replies; 42+ messages in thread
From: Paul Burton @ 2013-11-07 12:48 UTC (permalink / raw)
  To: linux-mips; +Cc: Paul Burton

CPUs implementing mips32r2 may include a 64-bit FPU, just as mips64 CPUs
do. In order to preserve backwards compatibility a 64-bit FPU will act
like a 32-bit FPU (by accessing doubles from the least significant 32
bits of an even-odd pair of FP registers) when the Status.FR bit is
zero, again just like a mips64 CPU. The standard O32 ABI is defined
expecting a 32-bit FPU, however recent toolchains support use of a
64-bit FPU from an O32 mips32 executable. When an ELF executable is
built to use a 64-bit FPU a new flag (EF_MIPS_FP64) is set in the ELF
header.

With this patch the kernel will check the EF_MIPS_FP64 flag when
executing an O32 binary, and set Status.FR accordingly. The addition
of O32 64-bit FP support lessens the opportunity for optimisation in
the FPU emulator, so a CONFIG_MIPS_O32_FP64_SUPPORT Kconfig option is
introduced to allow this support to be disabled for those that don't
require it.

Inspired by an earlier patch by Leonid Yegoshin, but implemented more
cleanly & correctly.

Signed-off-by: Paul Burton <paul.burton@imgtec.com>
---
 arch/mips/Kconfig                   |  17 ++++++
 arch/mips/include/asm/asmmacro-32.h |  42 --------------
 arch/mips/include/asm/asmmacro-64.h |  96 --------------------------------
 arch/mips/include/asm/asmmacro.h    | 107 ++++++++++++++++++++++++++++++++++++
 arch/mips/include/asm/elf.h         |  17 +++++-
 arch/mips/include/asm/fpu.h         |  91 +++++++++++++++++++++++++-----
 arch/mips/include/asm/thread_info.h |   4 +-
 arch/mips/kernel/cpu-probe.c        |   2 +-
 arch/mips/kernel/process.c          |   3 -
 arch/mips/kernel/ptrace.c           |   8 +--
 arch/mips/kernel/ptrace32.c         |   4 +-
 arch/mips/kernel/r4k_fpu.S          |  74 +++++++++++++++++++++++--
 arch/mips/kernel/r4k_switch.S       |  45 ++++++++++++++-
 arch/mips/kernel/signal.c           |  10 ++--
 arch/mips/kernel/signal32.c         |  10 ++--
 arch/mips/kernel/traps.c            |  20 +++++--
 arch/mips/math-emu/cp1emu.c         |  10 ++--
 arch/mips/math-emu/kernel_linkage.c |   6 +-
 18 files changed, 373 insertions(+), 193 deletions(-)

diff --git a/arch/mips/Kconfig b/arch/mips/Kconfig
index 17cc7ff..aa2e03a 100644
--- a/arch/mips/Kconfig
+++ b/arch/mips/Kconfig
@@ -2335,6 +2335,23 @@ config CC_STACKPROTECTOR
 
 	  This feature requires gcc version 4.2 or above.
 
+config MIPS_O32_FP64_SUPPORT
+	bool "Support for O32 binaries using 64-bit FP"
+	depends on (32BIT && CPU_MIPSR2) || MIPS32_O32
+	default y
+	help
+	  When this is enabled, the kernel will support use of 64-bit floating
+	  point registers with binaries using the O32 ABI along with the
+	  EF_MIPS_FP64 ELF header flag (typically built with -mfp64). On
+	  mips32 systems this support is at the cost of increasing the size
+	  and complexity of the compiled FPU emulator. Thus if you are running
+	  a mips32 system and know that none of your userland binaries will
+	  require 64-bit floating point, you may wish to reduce the size of
+	  your kernel & potentially improve FP emulation performance by saying
+	  N here.
+
+	  If unsure, say Y.
+
 config USE_OF
 	bool
 	select OF
diff --git a/arch/mips/include/asm/asmmacro-32.h b/arch/mips/include/asm/asmmacro-32.h
index 2413afe..70e1f17 100644
--- a/arch/mips/include/asm/asmmacro-32.h
+++ b/arch/mips/include/asm/asmmacro-32.h
@@ -12,27 +12,6 @@
 #include <asm/fpregdef.h>
 #include <asm/mipsregs.h>
 
-	.macro	fpu_save_double thread status tmp1=t0
-	cfc1	\tmp1,  fcr31
-	sdc1	$f0,  THREAD_FPR0(\thread)
-	sdc1	$f2,  THREAD_FPR2(\thread)
-	sdc1	$f4,  THREAD_FPR4(\thread)
-	sdc1	$f6,  THREAD_FPR6(\thread)
-	sdc1	$f8,  THREAD_FPR8(\thread)
-	sdc1	$f10, THREAD_FPR10(\thread)
-	sdc1	$f12, THREAD_FPR12(\thread)
-	sdc1	$f14, THREAD_FPR14(\thread)
-	sdc1	$f16, THREAD_FPR16(\thread)
-	sdc1	$f18, THREAD_FPR18(\thread)
-	sdc1	$f20, THREAD_FPR20(\thread)
-	sdc1	$f22, THREAD_FPR22(\thread)
-	sdc1	$f24, THREAD_FPR24(\thread)
-	sdc1	$f26, THREAD_FPR26(\thread)
-	sdc1	$f28, THREAD_FPR28(\thread)
-	sdc1	$f30, THREAD_FPR30(\thread)
-	sw	\tmp1, THREAD_FCR31(\thread)
-	.endm
-
 	.macro	fpu_save_single thread tmp=t0
 	cfc1	\tmp,  fcr31
 	swc1	$f0,  THREAD_FPR0(\thread)
@@ -70,27 +49,6 @@
 	sw	\tmp, THREAD_FCR31(\thread)
 	.endm
 
-	.macro	fpu_restore_double thread status tmp=t0
-	lw	\tmp, THREAD_FCR31(\thread)
-	ldc1	$f0,  THREAD_FPR0(\thread)
-	ldc1	$f2,  THREAD_FPR2(\thread)
-	ldc1	$f4,  THREAD_FPR4(\thread)
-	ldc1	$f6,  THREAD_FPR6(\thread)
-	ldc1	$f8,  THREAD_FPR8(\thread)
-	ldc1	$f10, THREAD_FPR10(\thread)
-	ldc1	$f12, THREAD_FPR12(\thread)
-	ldc1	$f14, THREAD_FPR14(\thread)
-	ldc1	$f16, THREAD_FPR16(\thread)
-	ldc1	$f18, THREAD_FPR18(\thread)
-	ldc1	$f20, THREAD_FPR20(\thread)
-	ldc1	$f22, THREAD_FPR22(\thread)
-	ldc1	$f24, THREAD_FPR24(\thread)
-	ldc1	$f26, THREAD_FPR26(\thread)
-	ldc1	$f28, THREAD_FPR28(\thread)
-	ldc1	$f30, THREAD_FPR30(\thread)
-	ctc1	\tmp, fcr31
-	.endm
-
 	.macro	fpu_restore_single thread tmp=t0
 	lw	\tmp, THREAD_FCR31(\thread)
 	lwc1	$f0,  THREAD_FPR0(\thread)
diff --git a/arch/mips/include/asm/asmmacro-64.h b/arch/mips/include/asm/asmmacro-64.h
index 08a527d..38ea609 100644
--- a/arch/mips/include/asm/asmmacro-64.h
+++ b/arch/mips/include/asm/asmmacro-64.h
@@ -13,102 +13,6 @@
 #include <asm/fpregdef.h>
 #include <asm/mipsregs.h>
 
-	.macro	fpu_save_16even thread tmp=t0
-	cfc1	\tmp, fcr31
-	sdc1	$f0,  THREAD_FPR0(\thread)
-	sdc1	$f2,  THREAD_FPR2(\thread)
-	sdc1	$f4,  THREAD_FPR4(\thread)
-	sdc1	$f6,  THREAD_FPR6(\thread)
-	sdc1	$f8,  THREAD_FPR8(\thread)
-	sdc1	$f10, THREAD_FPR10(\thread)
-	sdc1	$f12, THREAD_FPR12(\thread)
-	sdc1	$f14, THREAD_FPR14(\thread)
-	sdc1	$f16, THREAD_FPR16(\thread)
-	sdc1	$f18, THREAD_FPR18(\thread)
-	sdc1	$f20, THREAD_FPR20(\thread)
-	sdc1	$f22, THREAD_FPR22(\thread)
-	sdc1	$f24, THREAD_FPR24(\thread)
-	sdc1	$f26, THREAD_FPR26(\thread)
-	sdc1	$f28, THREAD_FPR28(\thread)
-	sdc1	$f30, THREAD_FPR30(\thread)
-	sw	\tmp, THREAD_FCR31(\thread)
-	.endm
-
-	.macro	fpu_save_16odd thread
-	sdc1	$f1,  THREAD_FPR1(\thread)
-	sdc1	$f3,  THREAD_FPR3(\thread)
-	sdc1	$f5,  THREAD_FPR5(\thread)
-	sdc1	$f7,  THREAD_FPR7(\thread)
-	sdc1	$f9,  THREAD_FPR9(\thread)
-	sdc1	$f11, THREAD_FPR11(\thread)
-	sdc1	$f13, THREAD_FPR13(\thread)
-	sdc1	$f15, THREAD_FPR15(\thread)
-	sdc1	$f17, THREAD_FPR17(\thread)
-	sdc1	$f19, THREAD_FPR19(\thread)
-	sdc1	$f21, THREAD_FPR21(\thread)
-	sdc1	$f23, THREAD_FPR23(\thread)
-	sdc1	$f25, THREAD_FPR25(\thread)
-	sdc1	$f27, THREAD_FPR27(\thread)
-	sdc1	$f29, THREAD_FPR29(\thread)
-	sdc1	$f31, THREAD_FPR31(\thread)
-	.endm
-
-	.macro	fpu_save_double thread status tmp
-	sll	\tmp, \status, 5
-	bgez	\tmp, 2f
-	fpu_save_16odd \thread
-2:
-	fpu_save_16even \thread \tmp
-	.endm
-
-	.macro	fpu_restore_16even thread tmp=t0
-	lw	\tmp, THREAD_FCR31(\thread)
-	ldc1	$f0,  THREAD_FPR0(\thread)
-	ldc1	$f2,  THREAD_FPR2(\thread)
-	ldc1	$f4,  THREAD_FPR4(\thread)
-	ldc1	$f6,  THREAD_FPR6(\thread)
-	ldc1	$f8,  THREAD_FPR8(\thread)
-	ldc1	$f10, THREAD_FPR10(\thread)
-	ldc1	$f12, THREAD_FPR12(\thread)
-	ldc1	$f14, THREAD_FPR14(\thread)
-	ldc1	$f16, THREAD_FPR16(\thread)
-	ldc1	$f18, THREAD_FPR18(\thread)
-	ldc1	$f20, THREAD_FPR20(\thread)
-	ldc1	$f22, THREAD_FPR22(\thread)
-	ldc1	$f24, THREAD_FPR24(\thread)
-	ldc1	$f26, THREAD_FPR26(\thread)
-	ldc1	$f28, THREAD_FPR28(\thread)
-	ldc1	$f30, THREAD_FPR30(\thread)
-	ctc1	\tmp, fcr31
-	.endm
-
-	.macro	fpu_restore_16odd thread
-	ldc1	$f1,  THREAD_FPR1(\thread)
-	ldc1	$f3,  THREAD_FPR3(\thread)
-	ldc1	$f5,  THREAD_FPR5(\thread)
-	ldc1	$f7,  THREAD_FPR7(\thread)
-	ldc1	$f9,  THREAD_FPR9(\thread)
-	ldc1	$f11, THREAD_FPR11(\thread)
-	ldc1	$f13, THREAD_FPR13(\thread)
-	ldc1	$f15, THREAD_FPR15(\thread)
-	ldc1	$f17, THREAD_FPR17(\thread)
-	ldc1	$f19, THREAD_FPR19(\thread)
-	ldc1	$f21, THREAD_FPR21(\thread)
-	ldc1	$f23, THREAD_FPR23(\thread)
-	ldc1	$f25, THREAD_FPR25(\thread)
-	ldc1	$f27, THREAD_FPR27(\thread)
-	ldc1	$f29, THREAD_FPR29(\thread)
-	ldc1	$f31, THREAD_FPR31(\thread)
-	.endm
-
-	.macro	fpu_restore_double thread status tmp
-	sll	\tmp, \status, 5
-	bgez	\tmp, 1f				# 16 register mode?
-
-	fpu_restore_16odd \thread
-1:	fpu_restore_16even \thread \tmp
-	.endm
-
 	.macro	cpu_save_nonscratch thread
 	LONG_S	s0, THREAD_REG16(\thread)
 	LONG_S	s1, THREAD_REG17(\thread)
diff --git a/arch/mips/include/asm/asmmacro.h b/arch/mips/include/asm/asmmacro.h
index 6c8342a..3220c93 100644
--- a/arch/mips/include/asm/asmmacro.h
+++ b/arch/mips/include/asm/asmmacro.h
@@ -62,6 +62,113 @@
 	.endm
 #endif /* CONFIG_MIPS_MT_SMTC */
 
+	.macro	fpu_save_16even thread tmp=t0
+	cfc1	\tmp, fcr31
+	sdc1	$f0,  THREAD_FPR0(\thread)
+	sdc1	$f2,  THREAD_FPR2(\thread)
+	sdc1	$f4,  THREAD_FPR4(\thread)
+	sdc1	$f6,  THREAD_FPR6(\thread)
+	sdc1	$f8,  THREAD_FPR8(\thread)
+	sdc1	$f10, THREAD_FPR10(\thread)
+	sdc1	$f12, THREAD_FPR12(\thread)
+	sdc1	$f14, THREAD_FPR14(\thread)
+	sdc1	$f16, THREAD_FPR16(\thread)
+	sdc1	$f18, THREAD_FPR18(\thread)
+	sdc1	$f20, THREAD_FPR20(\thread)
+	sdc1	$f22, THREAD_FPR22(\thread)
+	sdc1	$f24, THREAD_FPR24(\thread)
+	sdc1	$f26, THREAD_FPR26(\thread)
+	sdc1	$f28, THREAD_FPR28(\thread)
+	sdc1	$f30, THREAD_FPR30(\thread)
+	sw	\tmp, THREAD_FCR31(\thread)
+	.endm
+
+	.macro	fpu_save_16odd thread
+	.set	push
+	.set	mips64r2
+	sdc1	$f1,  THREAD_FPR1(\thread)
+	sdc1	$f3,  THREAD_FPR3(\thread)
+	sdc1	$f5,  THREAD_FPR5(\thread)
+	sdc1	$f7,  THREAD_FPR7(\thread)
+	sdc1	$f9,  THREAD_FPR9(\thread)
+	sdc1	$f11, THREAD_FPR11(\thread)
+	sdc1	$f13, THREAD_FPR13(\thread)
+	sdc1	$f15, THREAD_FPR15(\thread)
+	sdc1	$f17, THREAD_FPR17(\thread)
+	sdc1	$f19, THREAD_FPR19(\thread)
+	sdc1	$f21, THREAD_FPR21(\thread)
+	sdc1	$f23, THREAD_FPR23(\thread)
+	sdc1	$f25, THREAD_FPR25(\thread)
+	sdc1	$f27, THREAD_FPR27(\thread)
+	sdc1	$f29, THREAD_FPR29(\thread)
+	sdc1	$f31, THREAD_FPR31(\thread)
+	.set	pop
+	.endm
+
+	.macro	fpu_save_double thread status tmp
+#if defined(CONFIG_MIPS64) || defined(CONFIG_CPU_MIPS32_R2)
+	sll	\tmp, \status, 5
+	bgez	\tmp, 10f
+	fpu_save_16odd \thread
+10:
+#endif
+	fpu_save_16even \thread \tmp
+	.endm
+
+	.macro	fpu_restore_16even thread tmp=t0
+	lw	\tmp, THREAD_FCR31(\thread)
+	ldc1	$f0,  THREAD_FPR0(\thread)
+	ldc1	$f2,  THREAD_FPR2(\thread)
+	ldc1	$f4,  THREAD_FPR4(\thread)
+	ldc1	$f6,  THREAD_FPR6(\thread)
+	ldc1	$f8,  THREAD_FPR8(\thread)
+	ldc1	$f10, THREAD_FPR10(\thread)
+	ldc1	$f12, THREAD_FPR12(\thread)
+	ldc1	$f14, THREAD_FPR14(\thread)
+	ldc1	$f16, THREAD_FPR16(\thread)
+	ldc1	$f18, THREAD_FPR18(\thread)
+	ldc1	$f20, THREAD_FPR20(\thread)
+	ldc1	$f22, THREAD_FPR22(\thread)
+	ldc1	$f24, THREAD_FPR24(\thread)
+	ldc1	$f26, THREAD_FPR26(\thread)
+	ldc1	$f28, THREAD_FPR28(\thread)
+	ldc1	$f30, THREAD_FPR30(\thread)
+	ctc1	\tmp, fcr31
+	.endm
+
+	.macro	fpu_restore_16odd thread
+	.set	push
+	.set	mips64r2
+	ldc1	$f1,  THREAD_FPR1(\thread)
+	ldc1	$f3,  THREAD_FPR3(\thread)
+	ldc1	$f5,  THREAD_FPR5(\thread)
+	ldc1	$f7,  THREAD_FPR7(\thread)
+	ldc1	$f9,  THREAD_FPR9(\thread)
+	ldc1	$f11, THREAD_FPR11(\thread)
+	ldc1	$f13, THREAD_FPR13(\thread)
+	ldc1	$f15, THREAD_FPR15(\thread)
+	ldc1	$f17, THREAD_FPR17(\thread)
+	ldc1	$f19, THREAD_FPR19(\thread)
+	ldc1	$f21, THREAD_FPR21(\thread)
+	ldc1	$f23, THREAD_FPR23(\thread)
+	ldc1	$f25, THREAD_FPR25(\thread)
+	ldc1	$f27, THREAD_FPR27(\thread)
+	ldc1	$f29, THREAD_FPR29(\thread)
+	ldc1	$f31, THREAD_FPR31(\thread)
+	.set	pop
+	.endm
+
+	.macro	fpu_restore_double thread status tmp
+#if defined(CONFIG_MIPS64) || defined(CONFIG_CPU_MIPS32_R2)
+	sll	\tmp, \status, 5
+	bgez	\tmp, 10f				# 16 register mode?
+
+	fpu_restore_16odd \thread
+10:
+#endif
+	fpu_restore_16even \thread \tmp
+	.endm
+
 /*
  * Temporary until all gas have MT ASE support
  */
diff --git a/arch/mips/include/asm/elf.h b/arch/mips/include/asm/elf.h
index a66359e..17163cf 100644
--- a/arch/mips/include/asm/elf.h
+++ b/arch/mips/include/asm/elf.h
@@ -36,6 +36,7 @@
 #define EF_MIPS_ABI2		0x00000020
 #define EF_MIPS_OPTIONS_FIRST	0x00000080
 #define EF_MIPS_32BITMODE	0x00000100
+#define EF_MIPS_FP64		0x00000200
 #define EF_MIPS_ABI		0x0000f000
 #define EF_MIPS_ARCH		0xf0000000
 
@@ -249,6 +250,11 @@ extern struct mips_abi mips_abi_n32;
 
 #define SET_PERSONALITY(ex)						\
 do {									\
+	if ((ex).e_flags & EF_MIPS_FP64)				\
+		clear_thread_flag(TIF_32BIT_FPREGS);			\
+	else								\
+		set_thread_flag(TIF_32BIT_FPREGS);			\
+									\
 	if (personality(current->personality) != PER_LINUX)		\
 		set_personality(PER_LINUX);				\
 									\
@@ -271,14 +277,18 @@ do {									\
 #endif
 
 #ifdef CONFIG_MIPS32_O32
-#define __SET_PERSONALITY32_O32()					\
+#define __SET_PERSONALITY32_O32(ex)					\
 	do {								\
 		set_thread_flag(TIF_32BIT_REGS);			\
 		set_thread_flag(TIF_32BIT_ADDR);			\
+									\
+		if (!((ex).e_flags & EF_MIPS_FP64))			\
+			set_thread_flag(TIF_32BIT_FPREGS);		\
+									\
 		current->thread.abi = &mips_abi_32;			\
 	} while (0)
 #else
-#define __SET_PERSONALITY32_O32()					\
+#define __SET_PERSONALITY32_O32(ex)					\
 	do { } while (0)
 #endif
 
@@ -289,7 +299,7 @@ do {									\
 	     ((ex).e_flags & EF_MIPS_ABI) == 0)				\
 		__SET_PERSONALITY32_N32();				\
 	else								\
-		__SET_PERSONALITY32_O32();				\
+		__SET_PERSONALITY32_O32(ex);                            \
 } while (0)
 #else
 #define __SET_PERSONALITY32(ex) do { } while (0)
@@ -300,6 +310,7 @@ do {									\
 	unsigned int p;							\
 									\
 	clear_thread_flag(TIF_32BIT_REGS);				\
+	clear_thread_flag(TIF_32BIT_FPREGS);				\
 	clear_thread_flag(TIF_32BIT_ADDR);				\
 									\
 	if ((ex).e_ident[EI_CLASS] == ELFCLASS32)			\
diff --git a/arch/mips/include/asm/fpu.h b/arch/mips/include/asm/fpu.h
index 3bf023f..cfe092f 100644
--- a/arch/mips/include/asm/fpu.h
+++ b/arch/mips/include/asm/fpu.h
@@ -33,11 +33,48 @@ extern void _init_fpu(void);
 extern void _save_fp(struct task_struct *);
 extern void _restore_fp(struct task_struct *);
 
-#define __enable_fpu()							\
-do {									\
-	set_c0_status(ST0_CU1);						\
-	enable_fpu_hazard();						\
-} while (0)
+/*
+ * This enum specifies a mode in which we want the FPU to operate, for cores
+ * which implement the Status.FR bit. Note that FPU_32BIT & FPU_64BIT
+ * purposefully have the values 0 & 1 respectively, so that an integer value
+ * of Status.FR can be trivially casted to the corresponding enum fpu_mode.
+ */
+enum fpu_mode {
+	FPU_32BIT = 0,		/* FR = 0 */
+	FPU_64BIT,		/* FR = 1 */
+	FPU_AS_IS,
+};
+
+static inline int __enable_fpu(enum fpu_mode mode)
+{
+	int fr;
+
+	switch (mode) {
+	case FPU_AS_IS:
+		/* just enable the FPU in its current mode */
+		set_c0_status(ST0_CU1);
+		enable_fpu_hazard();
+		return 0;
+
+	case FPU_64BIT:
+#if !(defined(CONFIG_CPU_MIPS32_R2) || defined(CONFIG_MIPS64))
+		/* we only have a 32-bit FPU */
+		return SIGFPE;
+#endif
+		/* fall through */
+	case FPU_32BIT:
+		/* set CU1 & change FR appropriately */
+		fr = (int)mode;
+		change_c0_status(ST0_CU1 | ST0_FR, ST0_CU1 | (fr ? ST0_FR : 0));
+		enable_fpu_hazard();
+
+		/* check FR has the desired value */
+		return (!!(read_c0_status() & ST0_FR) == !!fr) ? 0 : SIGFPE;
+
+	default:
+		BUG();
+	}
+}
 
 #define __disable_fpu()							\
 do {									\
@@ -57,27 +94,46 @@ static inline int is_fpu_owner(void)
 	return cpu_has_fpu && __is_fpu_owner();
 }
 
-static inline void __own_fpu(void)
+static inline int __own_fpu(void)
 {
-	__enable_fpu();
+	enum fpu_mode mode;
+	int ret;
+
+	mode = !test_thread_flag(TIF_32BIT_FPREGS);
+	ret = __enable_fpu(mode);
+	if (ret)
+		return ret;
+
 	KSTK_STATUS(current) |= ST0_CU1;
+	if (mode == FPU_64BIT)
+		KSTK_STATUS(current) |= ST0_FR;
+	else /* mode == FPU_32BIT */
+		KSTK_STATUS(current) &= ~ST0_FR;
+
 	set_thread_flag(TIF_USEDFPU);
+	return 0;
 }
 
-static inline void own_fpu_inatomic(int restore)
+static inline int own_fpu_inatomic(int restore)
 {
+	int ret = 0;
+
 	if (cpu_has_fpu && !__is_fpu_owner()) {
-		__own_fpu();
-		if (restore)
+		ret = __own_fpu();
+		if (restore && !ret)
 			_restore_fp(current);
 	}
+	return ret;
 }
 
-static inline void own_fpu(int restore)
+static inline int own_fpu(int restore)
 {
+	int ret;
+
 	preempt_disable();
-	own_fpu_inatomic(restore);
+	ret = own_fpu_inatomic(restore);
 	preempt_enable();
+	return ret;
 }
 
 static inline void lose_fpu(int save)
@@ -93,16 +149,21 @@ static inline void lose_fpu(int save)
 	preempt_enable();
 }
 
-static inline void init_fpu(void)
+static inline int init_fpu(void)
 {
+	int ret = 0;
+
 	preempt_disable();
 	if (cpu_has_fpu) {
-		__own_fpu();
-		_init_fpu();
+		ret = __own_fpu();
+		if (!ret)
+			_init_fpu();
 	} else {
 		fpu_emulator_init_fpu();
 	}
+
 	preempt_enable();
+	return ret;
 }
 
 static inline void save_fp(struct task_struct *tsk)
diff --git a/arch/mips/include/asm/thread_info.h b/arch/mips/include/asm/thread_info.h
index f9b24bf..b6da8b7 100644
--- a/arch/mips/include/asm/thread_info.h
+++ b/arch/mips/include/asm/thread_info.h
@@ -112,11 +112,12 @@ static inline struct thread_info *current_thread_info(void)
 #define TIF_NOHZ		19	/* in adaptive nohz mode */
 #define TIF_FIXADE		20	/* Fix address errors in software */
 #define TIF_LOGADE		21	/* Log address errors to syslog */
-#define TIF_32BIT_REGS		22	/* also implies 16/32 fprs */
+#define TIF_32BIT_REGS		22	/* 32-bit general purpose registers */
 #define TIF_32BIT_ADDR		23	/* 32-bit address space (o32/n32) */
 #define TIF_FPUBOUND		24	/* thread bound to FPU-full CPU set */
 #define TIF_LOAD_WATCH		25	/* If set, load watch registers */
 #define TIF_SYSCALL_TRACEPOINT	26	/* syscall tracepoint instrumentation */
+#define TIF_32BIT_FPREGS	27	/* 32-bit floating point registers */
 #define TIF_SYSCALL_TRACE	31	/* syscall trace active */
 
 #define _TIF_SYSCALL_TRACE	(1<<TIF_SYSCALL_TRACE)
@@ -133,6 +134,7 @@ static inline struct thread_info *current_thread_info(void)
 #define _TIF_32BIT_ADDR		(1<<TIF_32BIT_ADDR)
 #define _TIF_FPUBOUND		(1<<TIF_FPUBOUND)
 #define _TIF_LOAD_WATCH		(1<<TIF_LOAD_WATCH)
+#define _TIF_32BIT_FPREGS	(1<<TIF_32BIT_FPREGS)
 #define _TIF_SYSCALL_TRACEPOINT	(1<<TIF_SYSCALL_TRACEPOINT)
 
 #define _TIF_WORK_SYSCALL_ENTRY	(_TIF_NOHZ | _TIF_SYSCALL_TRACE |	\
diff --git a/arch/mips/kernel/cpu-probe.c b/arch/mips/kernel/cpu-probe.c
index 8168e29..116102c 100644
--- a/arch/mips/kernel/cpu-probe.c
+++ b/arch/mips/kernel/cpu-probe.c
@@ -112,7 +112,7 @@ static inline unsigned long cpu_get_fpu_id(void)
 	unsigned long tmp, fpu_id;
 
 	tmp = read_c0_status();
-	__enable_fpu();
+	__enable_fpu(FPU_AS_IS);
 	fpu_id = read_32bit_cp1_register(CP1_REVISION);
 	write_c0_status(tmp);
 	return fpu_id;
diff --git a/arch/mips/kernel/process.c b/arch/mips/kernel/process.c
index ddc7610..747a6cf 100644
--- a/arch/mips/kernel/process.c
+++ b/arch/mips/kernel/process.c
@@ -60,9 +60,6 @@ void start_thread(struct pt_regs * regs, unsigned long pc, unsigned long sp)
 
 	/* New thread loses kernel privileges. */
 	status = regs->cp0_status & ~(ST0_CU0|ST0_CU1|ST0_FR|KU_MASK);
-#ifdef CONFIG_64BIT
-	status |= test_thread_flag(TIF_32BIT_REGS) ? 0 : ST0_FR;
-#endif
 	status |= KU_USER;
 	regs->cp0_status = status;
 	clear_used_math();
diff --git a/arch/mips/kernel/ptrace.c b/arch/mips/kernel/ptrace.c
index b52e1d2..30b1a43 100644
--- a/arch/mips/kernel/ptrace.c
+++ b/arch/mips/kernel/ptrace.c
@@ -137,13 +137,13 @@ int ptrace_getfpregs(struct task_struct *child, __u32 __user *data)
 		if (cpu_has_mipsmt) {
 			unsigned int vpflags = dvpe();
 			flags = read_c0_status();
-			__enable_fpu();
+			__enable_fpu(FPU_AS_IS);
 			__asm__ __volatile__("cfc1\t%0,$0" : "=r" (tmp));
 			write_c0_status(flags);
 			evpe(vpflags);
 		} else {
 			flags = read_c0_status();
-			__enable_fpu();
+			__enable_fpu(FPU_AS_IS);
 			__asm__ __volatile__("cfc1\t%0,$0" : "=r" (tmp));
 			write_c0_status(flags);
 		}
@@ -483,13 +483,13 @@ long arch_ptrace(struct task_struct *child, long request,
 			if (cpu_has_mipsmt) {
 				unsigned int vpflags = dvpe();
 				flags = read_c0_status();
-				__enable_fpu();
+				__enable_fpu(FPU_AS_IS);
 				__asm__ __volatile__("cfc1\t%0,$0": "=r" (tmp));
 				write_c0_status(flags);
 				evpe(vpflags);
 			} else {
 				flags = read_c0_status();
-				__enable_fpu();
+				__enable_fpu(FPU_AS_IS);
 				__asm__ __volatile__("cfc1\t%0,$0": "=r" (tmp));
 				write_c0_status(flags);
 			}
diff --git a/arch/mips/kernel/ptrace32.c b/arch/mips/kernel/ptrace32.c
index 9486055..020342a 100644
--- a/arch/mips/kernel/ptrace32.c
+++ b/arch/mips/kernel/ptrace32.c
@@ -147,13 +147,13 @@ long compat_arch_ptrace(struct task_struct *child, compat_long_t request,
 			if (cpu_has_mipsmt) {
 				unsigned int vpflags = dvpe();
 				flags = read_c0_status();
-				__enable_fpu();
+				__enable_fpu(FPU_AS_IS);
 				__asm__ __volatile__("cfc1\t%0,$0": "=r" (tmp));
 				write_c0_status(flags);
 				evpe(vpflags);
 			} else {
 				flags = read_c0_status();
-				__enable_fpu();
+				__enable_fpu(FPU_AS_IS);
 				__asm__ __volatile__("cfc1\t%0,$0": "=r" (tmp));
 				write_c0_status(flags);
 			}
diff --git a/arch/mips/kernel/r4k_fpu.S b/arch/mips/kernel/r4k_fpu.S
index 55ffe14..253b2fb 100644
--- a/arch/mips/kernel/r4k_fpu.S
+++ b/arch/mips/kernel/r4k_fpu.S
@@ -35,7 +35,15 @@
 LEAF(_save_fp_context)
 	cfc1	t1, fcr31
 
-#ifdef CONFIG_64BIT
+#if defined(CONFIG_64BIT) || defined(CONFIG_MIPS32_R2)
+	.set	push
+#ifdef CONFIG_MIPS32_R2
+	.set	mips64r2
+	mfc0	t0, CP0_STATUS
+	sll	t0, t0, 5
+	bgez	t0, 1f			# skip storing odd if FR=0
+	 nop
+#endif
 	/* Store the 16 odd double precision registers */
 	EX	sdc1 $f1, SC_FPREGS+8(a0)
 	EX	sdc1 $f3, SC_FPREGS+24(a0)
@@ -53,6 +61,7 @@ LEAF(_save_fp_context)
 	EX	sdc1 $f27, SC_FPREGS+216(a0)
 	EX	sdc1 $f29, SC_FPREGS+232(a0)
 	EX	sdc1 $f31, SC_FPREGS+248(a0)
+1:	.set	pop
 #endif
 
 	/* Store the 16 even double precision registers */
@@ -82,7 +91,31 @@ LEAF(_save_fp_context)
 LEAF(_save_fp_context32)
 	cfc1	t1, fcr31
 
-	EX	sdc1 $f0, SC32_FPREGS+0(a0)
+	mfc0	t0, CP0_STATUS
+	sll	t0, t0, 5
+	bgez	t0, 1f			# skip storing odd if FR=0
+	 nop
+
+	/* Store the 16 odd double precision registers */
+	EX      sdc1 $f1, SC32_FPREGS+8(a0)
+	EX      sdc1 $f3, SC32_FPREGS+24(a0)
+	EX      sdc1 $f5, SC32_FPREGS+40(a0)
+	EX      sdc1 $f7, SC32_FPREGS+56(a0)
+	EX      sdc1 $f9, SC32_FPREGS+72(a0)
+	EX      sdc1 $f11, SC32_FPREGS+88(a0)
+	EX      sdc1 $f13, SC32_FPREGS+104(a0)
+	EX      sdc1 $f15, SC32_FPREGS+120(a0)
+	EX      sdc1 $f17, SC32_FPREGS+136(a0)
+	EX      sdc1 $f19, SC32_FPREGS+152(a0)
+	EX      sdc1 $f21, SC32_FPREGS+168(a0)
+	EX      sdc1 $f23, SC32_FPREGS+184(a0)
+	EX      sdc1 $f25, SC32_FPREGS+200(a0)
+	EX      sdc1 $f27, SC32_FPREGS+216(a0)
+	EX      sdc1 $f29, SC32_FPREGS+232(a0)
+	EX      sdc1 $f31, SC32_FPREGS+248(a0)
+
+	/* Store the 16 even double precision registers */
+1:	EX	sdc1 $f0, SC32_FPREGS+0(a0)
 	EX	sdc1 $f2, SC32_FPREGS+16(a0)
 	EX	sdc1 $f4, SC32_FPREGS+32(a0)
 	EX	sdc1 $f6, SC32_FPREGS+48(a0)
@@ -114,7 +147,16 @@ LEAF(_save_fp_context32)
  */
 LEAF(_restore_fp_context)
 	EX	lw t0, SC_FPC_CSR(a0)
-#ifdef CONFIG_64BIT
+
+#if defined(CONFIG_64BIT) || defined(CONFIG_MIPS32_R2)
+	.set	push
+#ifdef CONFIG_MIPS32_R2
+	.set	mips64r2
+	mfc0	t0, CP0_STATUS
+	sll	t0, t0, 5
+	bgez	t0, 1f			# skip loading odd if FR=0
+	 nop
+#endif
 	EX	ldc1 $f1, SC_FPREGS+8(a0)
 	EX	ldc1 $f3, SC_FPREGS+24(a0)
 	EX	ldc1 $f5, SC_FPREGS+40(a0)
@@ -131,6 +173,7 @@ LEAF(_restore_fp_context)
 	EX	ldc1 $f27, SC_FPREGS+216(a0)
 	EX	ldc1 $f29, SC_FPREGS+232(a0)
 	EX	ldc1 $f31, SC_FPREGS+248(a0)
+1:	.set pop
 #endif
 	EX	ldc1 $f0, SC_FPREGS+0(a0)
 	EX	ldc1 $f2, SC_FPREGS+16(a0)
@@ -157,7 +200,30 @@ LEAF(_restore_fp_context)
 LEAF(_restore_fp_context32)
 	/* Restore an o32 sigcontext.  */
 	EX	lw t0, SC32_FPC_CSR(a0)
-	EX	ldc1 $f0, SC32_FPREGS+0(a0)
+
+	mfc0	t0, CP0_STATUS
+	sll	t0, t0, 5
+	bgez	t0, 1f			# skip loading odd if FR=0
+	 nop
+
+	EX      ldc1 $f1, SC32_FPREGS+8(a0)
+	EX      ldc1 $f3, SC32_FPREGS+24(a0)
+	EX      ldc1 $f5, SC32_FPREGS+40(a0)
+	EX      ldc1 $f7, SC32_FPREGS+56(a0)
+	EX      ldc1 $f9, SC32_FPREGS+72(a0)
+	EX      ldc1 $f11, SC32_FPREGS+88(a0)
+	EX      ldc1 $f13, SC32_FPREGS+104(a0)
+	EX      ldc1 $f15, SC32_FPREGS+120(a0)
+	EX      ldc1 $f17, SC32_FPREGS+136(a0)
+	EX      ldc1 $f19, SC32_FPREGS+152(a0)
+	EX      ldc1 $f21, SC32_FPREGS+168(a0)
+	EX      ldc1 $f23, SC32_FPREGS+184(a0)
+	EX      ldc1 $f25, SC32_FPREGS+200(a0)
+	EX      ldc1 $f27, SC32_FPREGS+216(a0)
+	EX      ldc1 $f29, SC32_FPREGS+232(a0)
+	EX      ldc1 $f31, SC32_FPREGS+248(a0)
+
+1:	EX	ldc1 $f0, SC32_FPREGS+0(a0)
 	EX	ldc1 $f2, SC32_FPREGS+16(a0)
 	EX	ldc1 $f4, SC32_FPREGS+32(a0)
 	EX	ldc1 $f6, SC32_FPREGS+48(a0)
diff --git a/arch/mips/kernel/r4k_switch.S b/arch/mips/kernel/r4k_switch.S
index 078de5e..cc78dd9 100644
--- a/arch/mips/kernel/r4k_switch.S
+++ b/arch/mips/kernel/r4k_switch.S
@@ -123,7 +123,7 @@
  * Save a thread's fp context.
  */
 LEAF(_save_fp)
-#ifdef CONFIG_64BIT
+#if defined(CONFIG_64BIT) || defined(CONFIG_CPU_MIPS32_R2)
 	mfc0	t0, CP0_STATUS
 #endif
 	fpu_save_double a0 t0 t1		# clobbers t1
@@ -134,7 +134,7 @@ LEAF(_save_fp)
  * Restore a thread's fp context.
  */
 LEAF(_restore_fp)
-#ifdef CONFIG_64BIT
+#if defined(CONFIG_64BIT) || defined(CONFIG_CPU_MIPS32_R2)
 	mfc0	t0, CP0_STATUS
 #endif
 	fpu_restore_double a0 t0 t1		# clobbers t1
@@ -228,6 +228,47 @@ LEAF(_init_fpu)
 	mtc1	t1, $f29
 	mtc1	t1, $f30
 	mtc1	t1, $f31
+
+#ifdef CONFIG_CPU_MIPS32_R2
+	.set    push
+	.set    mips64r2
+	sll     t0, t0, 5			# is Status.FR set?
+	bgez    t0, 1f				# no: skip setting upper 32b
+
+	mthc1   t1, $f0
+	mthc1   t1, $f1
+	mthc1   t1, $f2
+	mthc1   t1, $f3
+	mthc1   t1, $f4
+	mthc1   t1, $f5
+	mthc1   t1, $f6
+	mthc1   t1, $f7
+	mthc1   t1, $f8
+	mthc1   t1, $f9
+	mthc1   t1, $f10
+	mthc1   t1, $f11
+	mthc1   t1, $f12
+	mthc1   t1, $f13
+	mthc1   t1, $f14
+	mthc1   t1, $f15
+	mthc1   t1, $f16
+	mthc1   t1, $f17
+	mthc1   t1, $f18
+	mthc1   t1, $f19
+	mthc1   t1, $f20
+	mthc1   t1, $f21
+	mthc1   t1, $f22
+	mthc1   t1, $f23
+	mthc1   t1, $f24
+	mthc1   t1, $f25
+	mthc1   t1, $f26
+	mthc1   t1, $f27
+	mthc1   t1, $f28
+	mthc1   t1, $f29
+	mthc1   t1, $f30
+	mthc1   t1, $f31
+1:	.set    pop
+#endif /* CONFIG_CPU_MIPS32_R2 */
 #else
 	.set	mips3
 	dmtc1	t1, $f0
diff --git a/arch/mips/kernel/signal.c b/arch/mips/kernel/signal.c
index 2f285ab..5199563 100644
--- a/arch/mips/kernel/signal.c
+++ b/arch/mips/kernel/signal.c
@@ -71,8 +71,9 @@ static int protected_save_fp_context(struct sigcontext __user *sc)
 	int err;
 	while (1) {
 		lock_fpu_owner();
-		own_fpu_inatomic(1);
-		err = save_fp_context(sc); /* this might fail */
+		err = own_fpu_inatomic(1);
+		if (!err)
+			err = save_fp_context(sc); /* this might fail */
 		unlock_fpu_owner();
 		if (likely(!err))
 			break;
@@ -91,8 +92,9 @@ static int protected_restore_fp_context(struct sigcontext __user *sc)
 	int err, tmp __maybe_unused;
 	while (1) {
 		lock_fpu_owner();
-		own_fpu_inatomic(0);
-		err = restore_fp_context(sc); /* this might fail */
+		err = own_fpu_inatomic(0);
+		if (!err)
+			err = restore_fp_context(sc); /* this might fail */
 		unlock_fpu_owner();
 		if (likely(!err))
 			break;
diff --git a/arch/mips/kernel/signal32.c b/arch/mips/kernel/signal32.c
index 57de8b7..7c1024b 100644
--- a/arch/mips/kernel/signal32.c
+++ b/arch/mips/kernel/signal32.c
@@ -85,8 +85,9 @@ static int protected_save_fp_context32(struct sigcontext32 __user *sc)
 	int err;
 	while (1) {
 		lock_fpu_owner();
-		own_fpu_inatomic(1);
-		err = save_fp_context32(sc); /* this might fail */
+		err = own_fpu_inatomic(1);
+		if (!err)
+			err = save_fp_context32(sc); /* this might fail */
 		unlock_fpu_owner();
 		if (likely(!err))
 			break;
@@ -105,8 +106,9 @@ static int protected_restore_fp_context32(struct sigcontext32 __user *sc)
 	int err, tmp __maybe_unused;
 	while (1) {
 		lock_fpu_owner();
-		own_fpu_inatomic(0);
-		err = restore_fp_context32(sc); /* this might fail */
+		err = own_fpu_inatomic(0);
+		if (!err)
+			err = restore_fp_context32(sc); /* this might fail */
 		unlock_fpu_owner();
 		if (likely(!err))
 			break;
diff --git a/arch/mips/kernel/traps.c b/arch/mips/kernel/traps.c
index cc20415..eb28423 100644
--- a/arch/mips/kernel/traps.c
+++ b/arch/mips/kernel/traps.c
@@ -1080,7 +1080,7 @@ asmlinkage void do_cpu(struct pt_regs *regs)
 	unsigned long old_epc, old31;
 	unsigned int opcode;
 	unsigned int cpid;
-	int status;
+	int status, err;
 	unsigned long __maybe_unused flags;
 
 	prev_state = exception_enter();
@@ -1153,19 +1153,29 @@ asmlinkage void do_cpu(struct pt_regs *regs)
 
 	case 1:
 		if (used_math())	/* Using the FPU again.	 */
-			own_fpu(1);
+			err = own_fpu(1);
 		else {			/* First time FPU user.	 */
-			init_fpu();
+			err = init_fpu();
 			set_used_math();
 		}
 
-		if (!raw_cpu_has_fpu) {
+#ifndef CONFIG_MIPS_O32_FP64_SUPPORT
+		/*
+		 * This assumes that either all FPUs in the system support
+		 * Status.FR (ie. both 32-bit & 64-bit) or none of them do.
+		 */
+		if (err) {
+			force_sig(SIGFPE, current);
+			goto out;
+		}
+#endif
+		if (!raw_cpu_has_fpu || err) {
 			int sig;
 			void __user *fault_addr = NULL;
 			sig = fpu_emulator_cop1Handler(regs,
 						       &current->thread.fpu,
 						       0, &fault_addr);
-			if (!process_fpemu_return(sig, fault_addr))
+			if (!process_fpemu_return(sig, fault_addr) && !err)
 				mt_ase_fp_affinity();
 		}
 
diff --git a/arch/mips/math-emu/cp1emu.c b/arch/mips/math-emu/cp1emu.c
index 4b37961..22f7b11 100644
--- a/arch/mips/math-emu/cp1emu.c
+++ b/arch/mips/math-emu/cp1emu.c
@@ -859,20 +859,20 @@ static int isBranchInstr(struct pt_regs *regs, struct mm_decoded_insn dec_insn,
  * In the Linux kernel, we support selection of FPR format on the
  * basis of the Status.FR bit.	If an FPU is not present, the FR bit
  * is hardwired to zero, which would imply a 32-bit FPU even for
- * 64-bit CPUs so we rather look at TIF_32BIT_REGS.
+ * 64-bit CPUs so we rather look at TIF_32BIT_FPREGS.
  * FPU emu is slow and bulky and optimizing this function offers fairly
  * sizeable benefits so we try to be clever and make this function return
  * a constant whenever possible, that is on 64-bit kernels without O32
- * compatibility enabled and on 32-bit kernels.
+ * compatibility enabled and on 32-bit without 64-bit FPU support.
  */
 static inline int cop1_64bit(struct pt_regs *xcp)
 {
 #if defined(CONFIG_64BIT) && !defined(CONFIG_MIPS32_O32)
 	return 1;
-#elif defined(CONFIG_64BIT) && defined(CONFIG_MIPS32_O32)
-	return !test_thread_flag(TIF_32BIT_REGS);
-#else
+#elif defined(CONFIG_32BIT) && !defined(CONFIG_MIPS_O32_FP64_SUPPORT)
 	return 0;
+#else
+	return !test_thread_flag(TIF_32BIT_FPREGS);
 #endif
 }
 
diff --git a/arch/mips/math-emu/kernel_linkage.c b/arch/mips/math-emu/kernel_linkage.c
index 1c58657..3aeae07 100644
--- a/arch/mips/math-emu/kernel_linkage.c
+++ b/arch/mips/math-emu/kernel_linkage.c
@@ -89,8 +89,9 @@ int fpu_emulator_save_context32(struct sigcontext32 __user *sc)
 {
 	int i;
 	int err = 0;
+	int inc = test_thread_flag(TIF_32BIT_FPREGS) ? 2 : 1;
 
-	for (i = 0; i < 32; i+=2) {
+	for (i = 0; i < 32; i += inc) {
 		err |=
 		    __put_user(current->thread.fpu.fpr[i], &sc->sc_fpregs[i]);
 	}
@@ -103,8 +104,9 @@ int fpu_emulator_restore_context32(struct sigcontext32 __user *sc)
 {
 	int i;
 	int err = 0;
+	int inc = test_thread_flag(TIF_32BIT_FPREGS) ? 2 : 1;
 
-	for (i = 0; i < 32; i+=2) {
+	for (i = 0; i < 32; i += inc) {
 		err |=
 		    __get_user(current->thread.fpu.fpr[i], &sc->sc_fpregs[i]);
 	}
-- 
1.8.4.1

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 5/6] mips: use per-mm page to execute FP branch delay slots
@ 2013-11-07 12:48   ` Paul Burton
  0 siblings, 0 replies; 42+ messages in thread
From: Paul Burton @ 2013-11-07 12:48 UTC (permalink / raw)
  To: linux-mips; +Cc: Paul Burton

If a floating point branch instruction (bc1[ft]l?) is emulated,
typically because we're running on a core with no FPU, then we need to
execute the instruction in its branch delay slot too. This is done by
writing that instruction to memory followed by a trap, as part of an
"emuframe", and executing it. This avoids the requirement of an emulator
for the entire MIPS instruction set. Prior to this patch such emuframes
are written to the user stack and executed from there.

This patch moves FP branch delay emuframes off of the user stack and
into a per-mm page. Allocating a page per-mm leaves userland with access
to only what it had access to previously, and prevents processes
interfering with each other as they might if a single system-wide page
were used. The book-keeping required to track the allocation of
emuframes is not cheap, but given that invoking the FP emulator is
already very expensive I don't expect this to be an issue.

The biggest issue with executing the instruction from an FP branch delay
is that we must ensure that we free the frame from which we ran it. That
means that we must trap back to the kernel after executing that
instruction, which means that we must take special care not to let the
PC be changed as a result of that instruction. Fortunately since we're
executing an instruction we found in a branch delay the result is
unpredictable if that instruction is a branch or jump, so we can simply
treat those as NOPs and avoid them causing a problem. However there is
still the possibility that a signal may be handled whilst executing the
branch delay instruction. This would usually be fine as we would simply
execute our trap back to the kernel after sigreturn, however it is
possible for userland to simply not return from the signal handler - for
example if it executes something like a longjmp. In that case we would
never trap back to the kernel and never free the frame. For that reason
a TIF_FP_BD_EMU flag is introduced and set whilst we are executing an FP
branch delay instruction. Whilst this flag is set, signals will be
ignored. This isn't exactly pretty, but it's simpler than most of the
alternatives. One other simple option I considered would be to just
kill a process if we find a branch in an FP branch delay slot, but I
chose the current approach because its result is closer to what would
previously happen.

The primary benefit of this patch is that we are now free to mark the
user stack non-executable where that is possible.

Additionally the FP emuframes themselves are simplified somewhat. The
cookie field is removed since we can be pretty certain that we're
looking at an emuframe by virtue of it being located in the page
allocated for them. The PC to continue from is moved into struct
thread_struct since the control flow of a thread can no longer be
modified for the duration of the 'emulation', meaning there will now
only ever be a single emuframe required for a thread at any given time.

Signed-off-by: Paul Burton <paul.burton@imgtec.com>
---
 arch/mips/include/asm/fpu_emulator.h |   2 +
 arch/mips/include/asm/mmu.h          |  12 ++
 arch/mips/include/asm/mmu_context.h  |   7 +
 arch/mips/include/asm/processor.h    |   7 +-
 arch/mips/include/asm/thread_info.h  |   2 +
 arch/mips/kernel/entry.S             |  13 +-
 arch/mips/kernel/process.c           |   2 +
 arch/mips/kernel/vdso.c              |   2 +-
 arch/mips/math-emu/dsemul.c          | 346 ++++++++++++++++++++++++++---------
 9 files changed, 298 insertions(+), 95 deletions(-)

diff --git a/arch/mips/include/asm/fpu_emulator.h b/arch/mips/include/asm/fpu_emulator.h
index 2abb587..7aef609 100644
--- a/arch/mips/include/asm/fpu_emulator.h
+++ b/arch/mips/include/asm/fpu_emulator.h
@@ -51,6 +51,8 @@ do {									\
 #define MIPS_FPU_EMU_INC_STATS(M) do { } while (0)
 #endif /* CONFIG_DEBUG_FS */
 
+extern void dsemul_thread_cleanup(void);
+extern void dsemul_mm_cleanup(struct mm_struct *mm);
 extern int mips_dsemul(struct pt_regs *regs, mips_instruction ir,
 	unsigned long cpc);
 extern int do_dsemulret(struct pt_regs *xcp);
diff --git a/arch/mips/include/asm/mmu.h b/arch/mips/include/asm/mmu.h
index c436138..08214da 100644
--- a/arch/mips/include/asm/mmu.h
+++ b/arch/mips/include/asm/mmu.h
@@ -1,9 +1,21 @@
 #ifndef __ASM_MMU_H
 #define __ASM_MMU_H
 
+#include <linux/mutex.h>
+#include <linux/wait.h>
+
 typedef struct {
 	unsigned long asid[NR_CPUS];
 	void *vdso;
+
+	/* address of page used to hold FP branch delay emulation frames */
+	unsigned long fp_bd_emupage;
+	/* bitmap tracking allocation of fp_bd_emupage */
+	unsigned long *fp_bd_emupage_allocmap;
+	/* mutex to be held whilst modifying fp_bd_emupage(_allocmap) */
+	struct mutex fp_bd_emupage_mutex;
+	/* wait queue for threads requiring an emuframe */
+	wait_queue_head_t fp_bd_emupage_queue;
 } mm_context_t;
 
 #endif /* __ASM_MMU_H */
diff --git a/arch/mips/include/asm/mmu_context.h b/arch/mips/include/asm/mmu_context.h
index e277bba..c55e864 100644
--- a/arch/mips/include/asm/mmu_context.h
+++ b/arch/mips/include/asm/mmu_context.h
@@ -16,6 +16,7 @@
 #include <linux/smp.h>
 #include <linux/slab.h>
 #include <asm/cacheflush.h>
+#include <asm/fpu_emulator.h>
 #include <asm/hazards.h>
 #include <asm/tlbflush.h>
 #ifdef CONFIG_MIPS_MT_SMTC
@@ -133,6 +134,11 @@ init_new_context(struct task_struct *tsk, struct mm_struct *mm)
 	for_each_possible_cpu(i)
 		cpu_context(i, mm) = 0;
 
+	mm->context.fp_bd_emupage = 0;
+	mm->context.fp_bd_emupage_allocmap = NULL;
+	mutex_init(&mm->context.fp_bd_emupage_mutex);
+	init_waitqueue_head(&mm->context.fp_bd_emupage_queue);
+
 	return 0;
 }
 
@@ -199,6 +205,7 @@ static inline void switch_mm(struct mm_struct *prev, struct mm_struct *next,
  */
 static inline void destroy_context(struct mm_struct *mm)
 {
+	dsemul_mm_cleanup(mm);
 }
 
 #define deactivate_mm(tsk, mm)	do { } while (0)
diff --git a/arch/mips/include/asm/processor.h b/arch/mips/include/asm/processor.h
index 3605b84..683a3d6 100644
--- a/arch/mips/include/asm/processor.h
+++ b/arch/mips/include/asm/processor.h
@@ -38,9 +38,10 @@ extern unsigned int vced_count, vcei_count;
 
 /*
  * A special page (the vdso) is mapped into all processes at the very
- * top of the virtual memory space.
+ * top of the virtual memory space. The page below it is used for FP
+ * emulator branch delay slot executions.
  */
-#define SPECIAL_PAGES_SIZE PAGE_SIZE
+#define SPECIAL_PAGES_SIZE (PAGE_SIZE * 2)
 
 #ifdef CONFIG_32BIT
 #ifdef CONFIG_KVM_GUEST
@@ -226,6 +227,8 @@ struct thread_struct {
 
 	/* Saved fpu/fpu emulator stuff. */
 	struct mips_fpu_struct fpu;
+	/* PC to continue from following an FP branch delay 'emulation' */
+	unsigned long fp_bd_emu_cpc;
 #ifdef CONFIG_MIPS_MT_FPAFF
 	/* Emulated instruction count */
 	unsigned long emulated_fp;
diff --git a/arch/mips/include/asm/thread_info.h b/arch/mips/include/asm/thread_info.h
index b6da8b7..eee6e18 100644
--- a/arch/mips/include/asm/thread_info.h
+++ b/arch/mips/include/asm/thread_info.h
@@ -118,6 +118,7 @@ static inline struct thread_info *current_thread_info(void)
 #define TIF_LOAD_WATCH		25	/* If set, load watch registers */
 #define TIF_SYSCALL_TRACEPOINT	26	/* syscall tracepoint instrumentation */
 #define TIF_32BIT_FPREGS	27	/* 32-bit floating point registers */
+#define TIF_FP_BD_EMU		28	/* executing an FP branch delay */
 #define TIF_SYSCALL_TRACE	31	/* syscall trace active */
 
 #define _TIF_SYSCALL_TRACE	(1<<TIF_SYSCALL_TRACE)
@@ -135,6 +136,7 @@ static inline struct thread_info *current_thread_info(void)
 #define _TIF_FPUBOUND		(1<<TIF_FPUBOUND)
 #define _TIF_LOAD_WATCH		(1<<TIF_LOAD_WATCH)
 #define _TIF_32BIT_FPREGS	(1<<TIF_32BIT_FPREGS)
+#define _TIF_FP_BD_EMU		(1<<TIF_FP_BD_EMU)
 #define _TIF_SYSCALL_TRACEPOINT	(1<<TIF_SYSCALL_TRACEPOINT)
 
 #define _TIF_WORK_SYSCALL_ENTRY	(_TIF_NOHZ | _TIF_SYSCALL_TRACE |	\
diff --git a/arch/mips/kernel/entry.S b/arch/mips/kernel/entry.S
index e578685..24707d7 100644
--- a/arch/mips/kernel/entry.S
+++ b/arch/mips/kernel/entry.S
@@ -168,10 +168,15 @@ work_resched:
 	andi	t0, a2, _TIF_NEED_RESCHED
 	bnez	t0, work_resched
 
-work_notifysig:				# deal with pending signals and
-					# notify-resume requests
-	move	a0, sp
-	li	a1, 0
+work_notifysig:
+	and	t0, a2, _TIF_FP_BD_EMU	# are we currently 'emulating' the
+					# delay slot of an FP branch?
+	beqz	t0, 1f			# no, continue below
+	and	a2, a2, ~_TIF_SIGPENDING	# yes, skip handling signals
+	beqz	a2, restore_all		# which leaves us nothing to do
+
+1:	move	a0, sp			# deal with pending signals and
+	li	a1, 0			# notify-resume requests
 	jal	do_notify_resume	# a2 already loaded
 	j	resume_userspace_check
 
diff --git a/arch/mips/kernel/process.c b/arch/mips/kernel/process.c
index 747a6cf..0219502 100644
--- a/arch/mips/kernel/process.c
+++ b/arch/mips/kernel/process.c
@@ -32,6 +32,7 @@
 #include <asm/cpu.h>
 #include <asm/dsp.h>
 #include <asm/fpu.h>
+#include <asm/fpu_emulator.h>
 #include <asm/pgtable.h>
 #include <asm/mipsregs.h>
 #include <asm/processor.h>
@@ -72,6 +73,7 @@ void start_thread(struct pt_regs * regs, unsigned long pc, unsigned long sp)
 
 void exit_thread(void)
 {
+	dsemul_thread_cleanup();
 }
 
 void flush_thread(void)
diff --git a/arch/mips/kernel/vdso.c b/arch/mips/kernel/vdso.c
index 0f1af58..213d871 100644
--- a/arch/mips/kernel/vdso.c
+++ b/arch/mips/kernel/vdso.c
@@ -78,7 +78,7 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
 
 	down_write(&mm->mmap_sem);
 
-	addr = vdso_addr(mm->start_stack);
+	addr = vdso_addr(mm->start_stack) + PAGE_SIZE;
 
 	addr = get_unmapped_area(NULL, addr, PAGE_SIZE, 0, 0);
 	if (IS_ERR_VALUE(addr)) {
diff --git a/arch/mips/math-emu/dsemul.c b/arch/mips/math-emu/dsemul.c
index 7ea622a..05b74b3 100644
--- a/arch/mips/math-emu/dsemul.c
+++ b/arch/mips/math-emu/dsemul.c
@@ -1,6 +1,8 @@
 #include <linux/compiler.h>
+#include <linux/err.h>
 #include <linux/mm.h>
 #include <linux/signal.h>
+#include <linux/slab.h>
 #include <linux/smp.h>
 
 #include <asm/asm.h>
@@ -45,52 +47,245 @@
 struct emuframe {
 	mips_instruction	emul;
 	mips_instruction	badinst;
-	mips_instruction	cookie;
-	unsigned long		epc;
 };
 
-int mips_dsemul(struct pt_regs *regs, mips_instruction ir, unsigned long cpc)
+static const int emupage_frame_count = PAGE_SIZE / sizeof(struct emuframe);
+
+static struct emuframe __user *alloc_emuframe(void)
 {
-	extern asmlinkage void handle_dsemulret(void);
-	struct emuframe __user *fr;
-	int err;
+	mm_context_t *mm_ctx = &current->mm->context;
+	struct emuframe __user *fr = NULL;
+	unsigned long addr;
+	int idx;
+
+retry:
+	mutex_lock(&mm_ctx->fp_bd_emupage_mutex);
 
-	if ((get_isa16_mode(regs->cp0_epc) && ((ir >> 16) == MM_NOP16)) ||
-		(ir == 0)) {
-		/* NOP is easy */
-		regs->cp0_epc = cpc;
-		regs->cp0_cause &= ~CAUSEF_BD;
-		return 0;
+	/* Ensure we have a page allocated for emuframes */
+	if (!mm_ctx->fp_bd_emupage) {
+		addr = mmap_region(NULL, STACK_TOP, PAGE_SIZE,
+				   VM_READ|VM_WRITE|VM_EXEC|
+				   VM_MAYREAD|VM_MAYWRITE|VM_MAYEXEC,
+				   0);
+		if (IS_ERR_VALUE(addr))
+			goto out_unlock;
+
+		mm_ctx->fp_bd_emupage = addr;
+		pr_debug("allocate emupage at 0x%08lx to %d\n", addr,
+			 current->pid);
 	}
-#ifdef DSEMUL_TRACE
-	printk("dsemul %lx %lx\n", regs->cp0_epc, cpc);
 
-#endif
+	/* Ensure we have an allocation bitmap */
+	if (!mm_ctx->fp_bd_emupage_allocmap) {
+		mm_ctx->fp_bd_emupage_allocmap =
+			kcalloc(BITS_TO_LONGS(emupage_frame_count),
+					      sizeof(unsigned long),
+				GFP_KERNEL);
+
+		if (!mm_ctx->fp_bd_emupage_allocmap)
+			goto out_unlock;
+	}
+
+	/* Attempt to allocate a single bit/frame */
+	idx = bitmap_find_free_region(mm_ctx->fp_bd_emupage_allocmap,
+				      emupage_frame_count, 0);
+	if (idx < 0) {
+		/*
+		 * Failed to allocate a frame. We'll wait until one becomes
+		 * available. The mutex is unlocked so that other threads
+		 * actually get the opportunity to free their frames, which
+		 * means technically the result of bitmap_full may be incorrect.
+		 * However the worst case is that we repeat all this and end up
+		 * back here again.
+		 */
+		mutex_unlock(&mm_ctx->fp_bd_emupage_mutex);
+		if (!wait_event_killable(mm_ctx->fp_bd_emupage_queue,
+			!bitmap_full(mm_ctx->fp_bd_emupage_allocmap,
+				     emupage_frame_count)))
+			goto retry;
+
+		/* Received a fatal signal - just give in */
+		return NULL;
+	}
+
+	/* Success! */
+	fr = (struct emuframe __user *)mm_ctx->fp_bd_emupage + idx;
+	pr_debug("allocate emuframe %d to %d\n", idx, current->pid);
+out_unlock:
+	mutex_unlock(&mm_ctx->fp_bd_emupage_mutex);
+	return fr;
+}
+
+static void free_emuframe(struct emuframe __user *frame)
+{
+	mm_context_t *mm_ctx = &current->mm->context;
+	int idx;
+
+	mutex_lock(&mm_ctx->fp_bd_emupage_mutex);
 
+	idx = frame - (struct emuframe __user *)mm_ctx->fp_bd_emupage;
+	pr_debug("free emuframe %d from %d\n", idx, current->pid);
+	bitmap_clear(mm_ctx->fp_bd_emupage_allocmap, idx, 1);
+
+	/* If some thread is waiting for a frame, now's its chance */
+	wake_up(&mm_ctx->fp_bd_emupage_queue);
+
+	mutex_unlock(&mm_ctx->fp_bd_emupage_mutex);
+}
+
+void dsemul_thread_cleanup(void)
+{
 	/*
-	 * The strategy is to push the instruction onto the user stack
-	 * and put a trap after it which we can catch and jump to
-	 * the required address any alternative apart from full
-	 * instruction emulation!!.
+	 * We should always have passed through do_dsemulret prior to the
+	 * thread exiting, so TIF_FP_BD_EMU should never be set here.
+	 */
+	BUG_ON(test_thread_flag(TIF_FP_BD_EMU));
+}
+
+void dsemul_mm_cleanup(struct mm_struct *mm)
+{
+	mm_context_t *mm_ctx = &mm->context;
+
+	kfree(mm_ctx->fp_bd_emupage_allocmap);
+}
+
+int mips_dsemul(struct pt_regs *regs, mips_instruction ir, unsigned long cpc)
+{
+	union mips_instruction inst = { .word = ir };
+	struct emuframe __user *fr;
+	int err;
+
+	/*
+	 * In order for us to clean up the emuframe properly, we'll need to
+	 * execute a break instruction after ir. If ir is a branch then we may
+	 * never reach that break instruction and thus never free the emuframe.
+	 *
+	 * Fortunately we know that ir is in a branch delay slot and thus if
+	 * it is a branch then its operation is unpredictable. So we can just
+	 * treat branches as NOPs and skip the 'emulation' entirely.
+	 *
+	 * If the worst happens and we miss a branch/jump instruction here, or
+	 * some processor implements a custom one, then it would be possible
+	 * for us to allocate an emuframe and never free it. Fortunately this
+	 * would:
 	 *
-	 * Algorithmics used a system call instruction, and
-	 * borrowed that vector.  MIPS/Linux version is a bit
-	 * more heavyweight in the interests of portability and
-	 * multiprocessor support.  For Linux we generate a
-	 * an unaligned access and force an address error exception.
+	 *  1) Be a bug in the userland code, because it has a branch/jump in
+	 *     a branch delay slot. So if we run out of emuframes and the
+	 *     userland code hangs it's not exactly the kernels fault.
 	 *
-	 * For embedded systems (stand-alone) we prefer to use a
-	 * non-existing CP1 instruction. This prevents us from emulating
-	 * branches, but gives us a cleaner interface to the exception
-	 * handler (single entry point).
+	 *  2) Only affect that userland process, since emuframes are allocated
+	 *     per-mm and kernel threads don't use them at all.
 	 */
+	if (!get_isa16_mode(regs->cp0_epc)) {
+		if (!ir) {
+			/* typical NOP encoding: sll r0, r0, r0 */
+is_nop:
+			regs->cp0_epc = cpc;
+			regs->cp0_cause &= ~CAUSEF_BD;
+			return 0;
+		}
 
-	/* Ensure that the two instructions are in the same cache line */
-	fr = (struct emuframe __user *)
-		((regs->regs[29] - sizeof(struct emuframe)) & ~0x7);
+		switch (inst.j_format.opcode) {
+		case bcond_op:
+			switch (inst.i_format.rt) {
+			case bltz_op:
+			case bgez_op:
+			case bltzl_op:
+			case bgezl_op:
+			case bltzal_op:
+			case bgezal_op:
+			case bltzall_op:
+			case bgezall_op:
+				goto is_branch;
+			}
+			break;
 
-	/* Verify that the stack pointer is not competely insane */
-	if (unlikely(!access_ok(VERIFY_WRITE, fr, sizeof(struct emuframe))))
+		case cop1_op:
+			switch (inst.i_format.rs) {
+			case bc_op:
+				goto is_branch;
+			}
+			break;
+
+		case j_op:
+		case jal_op:
+		case beq_op:
+		case bne_op:
+		case blez_op:
+		case bgtz_op:
+		case beql_op:
+		case bnel_op:
+		case blezl_op:
+		case bgtzl_op:
+		case jalx_op:
+is_branch:
+			pr_warn("PID %d has a branch in an FP branch delay slot at 0x%08lx\n",
+				current->pid, regs->cp0_epc);
+			goto is_nop;
+		}
+	} else {
+		if ((ir >> 16) == MM_NOP16)
+			goto is_nop;
+
+		switch (inst.mm_i_format.opcode) {
+		case mm_beqz16_op:
+		case mm_beq32_op:
+		case mm_bnez16_op:
+		case mm_bne32_op:
+		case mm_b16_op:
+		case mm_j32_op:
+		case mm_jalx32_op:
+		case mm_jal32_op:
+			goto is_branch;
+
+		case mm_pool32i_op:
+			switch (inst.mm_i_format.rt) {
+			case mm_bltz_op:
+			case mm_bltzal_op:
+			case mm_bgez_op:
+			case mm_bgezal_op:
+			case mm_blez_op:
+			case mm_bnezc_op:
+			case mm_bgtz_op:
+			case mm_beqzc_op:
+			case mm_bltzals_op:
+			case mm_bgezals_op:
+			case mm_bc2f_op:
+			case mm_bc2t_op:
+			case mm_bc1f_op:
+			case mm_bc1t_op:
+				goto is_branch;
+			}
+			break;
+
+		case mm_pool16c_op:
+			switch (inst.mm16_r5_format.rt) {
+			case mm_jr16_op:
+			case mm_jrc_op:
+			case mm_jalr16_op:
+			case mm_jalrs16_op:
+			case mm_jraddiusp_op:
+				goto is_branch;
+			}
+			break;
+		}
+	}
+
+	pr_debug("dsemul 0x%08lx cont at 0x%08lx\n", regs->cp0_epc, cpc);
+
+	/*
+	 * The strategy is to write the instruction to a per-mm page followed
+	 * by a trap which we can catch to return to the required address. Any
+	 * alternative to full instruction emulation!!
+	 *
+	 * Algorithmics used a system call instruction, and borrowed that
+	 * vector.  MIPS/Linux version is a bit more heavyweight in the
+	 * interests of portability and multiprocessor support.  For Linux we
+	 * generate a BREAK instruction with a break code reserved for this
+	 * purpose.
+	 */
+	fr = alloc_emuframe();
+	if (!fr)
 		return SIGBUS;
 
 	if (get_isa16_mode(regs->cp0_epc)) {
@@ -103,17 +298,18 @@ int mips_dsemul(struct pt_regs *regs, mips_instruction ir, unsigned long cpc)
 		err |= __put_user((mips_instruction)BREAK_MATH, &fr->badinst);
 	}
 
-	err |= __put_user((mips_instruction)BD_COOKIE, &fr->cookie);
-	err |= __put_user(cpc, &fr->epc);
-
 	if (unlikely(err)) {
 		MIPS_FPU_EMU_INC_STATS(errors);
+		free_emuframe(fr);
 		return SIGBUS;
 	}
 
 	regs->cp0_epc = ((unsigned long) &fr->emul) |
 		get_isa16_mode(regs->cp0_epc);
 
+	current->thread.fp_bd_emu_cpc = cpc;
+	set_thread_flag(TIF_FP_BD_EMU);
+
 	flush_cache_sigtramp((unsigned long)&fr->badinst);
 
 	return SIGILL;		/* force out of emulation loop */
@@ -121,64 +317,38 @@ int mips_dsemul(struct pt_regs *regs, mips_instruction ir, unsigned long cpc)
 
 int do_dsemulret(struct pt_regs *xcp)
 {
-	struct emuframe __user *fr;
-	unsigned long epc;
-	u32 insn, cookie;
-	int err = 0;
-	u16 instr[2];
+	mm_context_t *mm_ctx = &current->mm->context;
+	struct emuframe __user *fr = NULL;
+	unsigned long fr_addr;
+	int success = 0;
 
-	fr = (struct emuframe __user *)
-		(msk_isa16_mode(xcp->cp0_epc) - sizeof(mips_instruction));
+	/* If we don't have TIF_FP_BD_EMU set... */
+	if (!test_and_clear_thread_flag(TIF_FP_BD_EMU))
+		goto out;
 
 	/*
-	 * If we can't even access the area, something is very wrong, but we'll
-	 * leave that to the default handling
+	 * ...or EPC is outside of the expected page or misaligned then
+	 * something is wrong. Leave it to the default trap/break code to
+	 * handle.
 	 */
-	if (!access_ok(VERIFY_READ, fr, sizeof(struct emuframe)))
-		return 0;
+	fr_addr = msk_isa16_mode(xcp->cp0_epc) - sizeof(mips_instruction);
+	if ((fr_addr < mm_ctx->fp_bd_emupage) ||
+	    (fr_addr > (mm_ctx->fp_bd_emupage + PAGE_SIZE - sizeof(*fr))) ||
+	    (fr_addr & (sizeof(*fr) - 1)))
+		goto out;
 
-	/*
-	 * Do some sanity checking on the stackframe:
-	 *
-	 *  - Is the instruction pointed to by the EPC an BREAK_MATH?
-	 *  - Is the following memory word the BD_COOKIE?
-	 */
-	if (get_isa16_mode(xcp->cp0_epc)) {
-		err = __get_user(instr[0], (u16 __user *)(&fr->badinst));
-		err |= __get_user(instr[1], (u16 __user *)((long)(&fr->badinst) + 2));
-		insn = (instr[0] << 16) | instr[1];
-	} else {
-		err = __get_user(insn, &fr->badinst);
-	}
-	err |= __get_user(cookie, &fr->cookie);
-
-	if (unlikely(err || (insn != BREAK_MATH) || (cookie != BD_COOKIE))) {
-		MIPS_FPU_EMU_INC_STATS(errors);
-		return 0;
-	}
-
-	/*
-	 * At this point, we are satisfied that it's a BD emulation trap.  Yes,
-	 * a user might have deliberately put two malformed and useless
-	 * instructions in a row in his program, in which case he's in for a
-	 * nasty surprise - the next instruction will be treated as a
-	 * continuation address!  Alas, this seems to be the only way that we
-	 * can handle signals, recursion, and longjmps() in the context of
-	 * emulating the branch delay instruction.
-	 */
-
-#ifdef DSEMUL_TRACE
-	printk("dsemulret\n");
-#endif
-	if (__get_user(epc, &fr->epc)) {		/* Saved EPC */
-		/* This is not a good situation to be in */
-		force_sig(SIGBUS, current);
-
-		return 0;
-	}
+	/* At this point, we are satisfied that it's a BD emulation trap. */
+	fr = (struct emuframe __user *)fr_addr;
 
 	/* Set EPC to return to post-branch instruction */
-	xcp->cp0_epc = epc;
+	xcp->cp0_epc = current->thread.fp_bd_emu_cpc;
+	success = 1;
 
-	return 1;
+	pr_debug("dsemulret to 0x%08lx\n", xcp->cp0_epc);
+out:
+	if (fr)
+		free_emuframe(fr);
+	if (!success)
+		MIPS_FPU_EMU_INC_STATS(errors);
+	return success;
 }
-- 
1.8.4.1

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 5/6] mips: use per-mm page to execute FP branch delay slots
@ 2013-11-07 12:48   ` Paul Burton
  0 siblings, 0 replies; 42+ messages in thread
From: Paul Burton @ 2013-11-07 12:48 UTC (permalink / raw)
  To: linux-mips; +Cc: Paul Burton

If a floating point branch instruction (bc1[ft]l?) is emulated,
typically because we're running on a core with no FPU, then we need to
execute the instruction in its branch delay slot too. This is done by
writing that instruction to memory followed by a trap, as part of an
"emuframe", and executing it. This avoids the requirement of an emulator
for the entire MIPS instruction set. Prior to this patch such emuframes
are written to the user stack and executed from there.

This patch moves FP branch delay emuframes off of the user stack and
into a per-mm page. Allocating a page per-mm leaves userland with access
to only what it had access to previously, and prevents processes
interfering with each other as they might if a single system-wide page
were used. The book-keeping required to track the allocation of
emuframes is not cheap, but given that invoking the FP emulator is
already very expensive I don't expect this to be an issue.

The biggest issue with executing the instruction from an FP branch delay
is that we must ensure that we free the frame from which we ran it. That
means that we must trap back to the kernel after executing that
instruction, which means that we must take special care not to let the
PC be changed as a result of that instruction. Fortunately since we're
executing an instruction we found in a branch delay the result is
unpredictable if that instruction is a branch or jump, so we can simply
treat those as NOPs and avoid them causing a problem. However there is
still the possibility that a signal may be handled whilst executing the
branch delay instruction. This would usually be fine as we would simply
execute our trap back to the kernel after sigreturn, however it is
possible for userland to simply not return from the signal handler - for
example if it executes something like a longjmp. In that case we would
never trap back to the kernel and never free the frame. For that reason
a TIF_FP_BD_EMU flag is introduced and set whilst we are executing an FP
branch delay instruction. Whilst this flag is set, signals will be
ignored. This isn't exactly pretty, but it's simpler than most of the
alternatives. One other simple option I considered would be to just
kill a process if we find a branch in an FP branch delay slot, but I
chose the current approach because its result is closer to what would
previously happen.

The primary benefit of this patch is that we are now free to mark the
user stack non-executable where that is possible.

Additionally the FP emuframes themselves are simplified somewhat. The
cookie field is removed since we can be pretty certain that we're
looking at an emuframe by virtue of it being located in the page
allocated for them. The PC to continue from is moved into struct
thread_struct since the control flow of a thread can no longer be
modified for the duration of the 'emulation', meaning there will now
only ever be a single emuframe required for a thread at any given time.

Signed-off-by: Paul Burton <paul.burton@imgtec.com>
---
 arch/mips/include/asm/fpu_emulator.h |   2 +
 arch/mips/include/asm/mmu.h          |  12 ++
 arch/mips/include/asm/mmu_context.h  |   7 +
 arch/mips/include/asm/processor.h    |   7 +-
 arch/mips/include/asm/thread_info.h  |   2 +
 arch/mips/kernel/entry.S             |  13 +-
 arch/mips/kernel/process.c           |   2 +
 arch/mips/kernel/vdso.c              |   2 +-
 arch/mips/math-emu/dsemul.c          | 346 ++++++++++++++++++++++++++---------
 9 files changed, 298 insertions(+), 95 deletions(-)

diff --git a/arch/mips/include/asm/fpu_emulator.h b/arch/mips/include/asm/fpu_emulator.h
index 2abb587..7aef609 100644
--- a/arch/mips/include/asm/fpu_emulator.h
+++ b/arch/mips/include/asm/fpu_emulator.h
@@ -51,6 +51,8 @@ do {									\
 #define MIPS_FPU_EMU_INC_STATS(M) do { } while (0)
 #endif /* CONFIG_DEBUG_FS */
 
+extern void dsemul_thread_cleanup(void);
+extern void dsemul_mm_cleanup(struct mm_struct *mm);
 extern int mips_dsemul(struct pt_regs *regs, mips_instruction ir,
 	unsigned long cpc);
 extern int do_dsemulret(struct pt_regs *xcp);
diff --git a/arch/mips/include/asm/mmu.h b/arch/mips/include/asm/mmu.h
index c436138..08214da 100644
--- a/arch/mips/include/asm/mmu.h
+++ b/arch/mips/include/asm/mmu.h
@@ -1,9 +1,21 @@
 #ifndef __ASM_MMU_H
 #define __ASM_MMU_H
 
+#include <linux/mutex.h>
+#include <linux/wait.h>
+
 typedef struct {
 	unsigned long asid[NR_CPUS];
 	void *vdso;
+
+	/* address of page used to hold FP branch delay emulation frames */
+	unsigned long fp_bd_emupage;
+	/* bitmap tracking allocation of fp_bd_emupage */
+	unsigned long *fp_bd_emupage_allocmap;
+	/* mutex to be held whilst modifying fp_bd_emupage(_allocmap) */
+	struct mutex fp_bd_emupage_mutex;
+	/* wait queue for threads requiring an emuframe */
+	wait_queue_head_t fp_bd_emupage_queue;
 } mm_context_t;
 
 #endif /* __ASM_MMU_H */
diff --git a/arch/mips/include/asm/mmu_context.h b/arch/mips/include/asm/mmu_context.h
index e277bba..c55e864 100644
--- a/arch/mips/include/asm/mmu_context.h
+++ b/arch/mips/include/asm/mmu_context.h
@@ -16,6 +16,7 @@
 #include <linux/smp.h>
 #include <linux/slab.h>
 #include <asm/cacheflush.h>
+#include <asm/fpu_emulator.h>
 #include <asm/hazards.h>
 #include <asm/tlbflush.h>
 #ifdef CONFIG_MIPS_MT_SMTC
@@ -133,6 +134,11 @@ init_new_context(struct task_struct *tsk, struct mm_struct *mm)
 	for_each_possible_cpu(i)
 		cpu_context(i, mm) = 0;
 
+	mm->context.fp_bd_emupage = 0;
+	mm->context.fp_bd_emupage_allocmap = NULL;
+	mutex_init(&mm->context.fp_bd_emupage_mutex);
+	init_waitqueue_head(&mm->context.fp_bd_emupage_queue);
+
 	return 0;
 }
 
@@ -199,6 +205,7 @@ static inline void switch_mm(struct mm_struct *prev, struct mm_struct *next,
  */
 static inline void destroy_context(struct mm_struct *mm)
 {
+	dsemul_mm_cleanup(mm);
 }
 
 #define deactivate_mm(tsk, mm)	do { } while (0)
diff --git a/arch/mips/include/asm/processor.h b/arch/mips/include/asm/processor.h
index 3605b84..683a3d6 100644
--- a/arch/mips/include/asm/processor.h
+++ b/arch/mips/include/asm/processor.h
@@ -38,9 +38,10 @@ extern unsigned int vced_count, vcei_count;
 
 /*
  * A special page (the vdso) is mapped into all processes at the very
- * top of the virtual memory space.
+ * top of the virtual memory space. The page below it is used for FP
+ * emulator branch delay slot executions.
  */
-#define SPECIAL_PAGES_SIZE PAGE_SIZE
+#define SPECIAL_PAGES_SIZE (PAGE_SIZE * 2)
 
 #ifdef CONFIG_32BIT
 #ifdef CONFIG_KVM_GUEST
@@ -226,6 +227,8 @@ struct thread_struct {
 
 	/* Saved fpu/fpu emulator stuff. */
 	struct mips_fpu_struct fpu;
+	/* PC to continue from following an FP branch delay 'emulation' */
+	unsigned long fp_bd_emu_cpc;
 #ifdef CONFIG_MIPS_MT_FPAFF
 	/* Emulated instruction count */
 	unsigned long emulated_fp;
diff --git a/arch/mips/include/asm/thread_info.h b/arch/mips/include/asm/thread_info.h
index b6da8b7..eee6e18 100644
--- a/arch/mips/include/asm/thread_info.h
+++ b/arch/mips/include/asm/thread_info.h
@@ -118,6 +118,7 @@ static inline struct thread_info *current_thread_info(void)
 #define TIF_LOAD_WATCH		25	/* If set, load watch registers */
 #define TIF_SYSCALL_TRACEPOINT	26	/* syscall tracepoint instrumentation */
 #define TIF_32BIT_FPREGS	27	/* 32-bit floating point registers */
+#define TIF_FP_BD_EMU		28	/* executing an FP branch delay */
 #define TIF_SYSCALL_TRACE	31	/* syscall trace active */
 
 #define _TIF_SYSCALL_TRACE	(1<<TIF_SYSCALL_TRACE)
@@ -135,6 +136,7 @@ static inline struct thread_info *current_thread_info(void)
 #define _TIF_FPUBOUND		(1<<TIF_FPUBOUND)
 #define _TIF_LOAD_WATCH		(1<<TIF_LOAD_WATCH)
 #define _TIF_32BIT_FPREGS	(1<<TIF_32BIT_FPREGS)
+#define _TIF_FP_BD_EMU		(1<<TIF_FP_BD_EMU)
 #define _TIF_SYSCALL_TRACEPOINT	(1<<TIF_SYSCALL_TRACEPOINT)
 
 #define _TIF_WORK_SYSCALL_ENTRY	(_TIF_NOHZ | _TIF_SYSCALL_TRACE |	\
diff --git a/arch/mips/kernel/entry.S b/arch/mips/kernel/entry.S
index e578685..24707d7 100644
--- a/arch/mips/kernel/entry.S
+++ b/arch/mips/kernel/entry.S
@@ -168,10 +168,15 @@ work_resched:
 	andi	t0, a2, _TIF_NEED_RESCHED
 	bnez	t0, work_resched
 
-work_notifysig:				# deal with pending signals and
-					# notify-resume requests
-	move	a0, sp
-	li	a1, 0
+work_notifysig:
+	and	t0, a2, _TIF_FP_BD_EMU	# are we currently 'emulating' the
+					# delay slot of an FP branch?
+	beqz	t0, 1f			# no, continue below
+	and	a2, a2, ~_TIF_SIGPENDING	# yes, skip handling signals
+	beqz	a2, restore_all		# which leaves us nothing to do
+
+1:	move	a0, sp			# deal with pending signals and
+	li	a1, 0			# notify-resume requests
 	jal	do_notify_resume	# a2 already loaded
 	j	resume_userspace_check
 
diff --git a/arch/mips/kernel/process.c b/arch/mips/kernel/process.c
index 747a6cf..0219502 100644
--- a/arch/mips/kernel/process.c
+++ b/arch/mips/kernel/process.c
@@ -32,6 +32,7 @@
 #include <asm/cpu.h>
 #include <asm/dsp.h>
 #include <asm/fpu.h>
+#include <asm/fpu_emulator.h>
 #include <asm/pgtable.h>
 #include <asm/mipsregs.h>
 #include <asm/processor.h>
@@ -72,6 +73,7 @@ void start_thread(struct pt_regs * regs, unsigned long pc, unsigned long sp)
 
 void exit_thread(void)
 {
+	dsemul_thread_cleanup();
 }
 
 void flush_thread(void)
diff --git a/arch/mips/kernel/vdso.c b/arch/mips/kernel/vdso.c
index 0f1af58..213d871 100644
--- a/arch/mips/kernel/vdso.c
+++ b/arch/mips/kernel/vdso.c
@@ -78,7 +78,7 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
 
 	down_write(&mm->mmap_sem);
 
-	addr = vdso_addr(mm->start_stack);
+	addr = vdso_addr(mm->start_stack) + PAGE_SIZE;
 
 	addr = get_unmapped_area(NULL, addr, PAGE_SIZE, 0, 0);
 	if (IS_ERR_VALUE(addr)) {
diff --git a/arch/mips/math-emu/dsemul.c b/arch/mips/math-emu/dsemul.c
index 7ea622a..05b74b3 100644
--- a/arch/mips/math-emu/dsemul.c
+++ b/arch/mips/math-emu/dsemul.c
@@ -1,6 +1,8 @@
 #include <linux/compiler.h>
+#include <linux/err.h>
 #include <linux/mm.h>
 #include <linux/signal.h>
+#include <linux/slab.h>
 #include <linux/smp.h>
 
 #include <asm/asm.h>
@@ -45,52 +47,245 @@
 struct emuframe {
 	mips_instruction	emul;
 	mips_instruction	badinst;
-	mips_instruction	cookie;
-	unsigned long		epc;
 };
 
-int mips_dsemul(struct pt_regs *regs, mips_instruction ir, unsigned long cpc)
+static const int emupage_frame_count = PAGE_SIZE / sizeof(struct emuframe);
+
+static struct emuframe __user *alloc_emuframe(void)
 {
-	extern asmlinkage void handle_dsemulret(void);
-	struct emuframe __user *fr;
-	int err;
+	mm_context_t *mm_ctx = &current->mm->context;
+	struct emuframe __user *fr = NULL;
+	unsigned long addr;
+	int idx;
+
+retry:
+	mutex_lock(&mm_ctx->fp_bd_emupage_mutex);
 
-	if ((get_isa16_mode(regs->cp0_epc) && ((ir >> 16) == MM_NOP16)) ||
-		(ir == 0)) {
-		/* NOP is easy */
-		regs->cp0_epc = cpc;
-		regs->cp0_cause &= ~CAUSEF_BD;
-		return 0;
+	/* Ensure we have a page allocated for emuframes */
+	if (!mm_ctx->fp_bd_emupage) {
+		addr = mmap_region(NULL, STACK_TOP, PAGE_SIZE,
+				   VM_READ|VM_WRITE|VM_EXEC|
+				   VM_MAYREAD|VM_MAYWRITE|VM_MAYEXEC,
+				   0);
+		if (IS_ERR_VALUE(addr))
+			goto out_unlock;
+
+		mm_ctx->fp_bd_emupage = addr;
+		pr_debug("allocate emupage at 0x%08lx to %d\n", addr,
+			 current->pid);
 	}
-#ifdef DSEMUL_TRACE
-	printk("dsemul %lx %lx\n", regs->cp0_epc, cpc);
 
-#endif
+	/* Ensure we have an allocation bitmap */
+	if (!mm_ctx->fp_bd_emupage_allocmap) {
+		mm_ctx->fp_bd_emupage_allocmap =
+			kcalloc(BITS_TO_LONGS(emupage_frame_count),
+					      sizeof(unsigned long),
+				GFP_KERNEL);
+
+		if (!mm_ctx->fp_bd_emupage_allocmap)
+			goto out_unlock;
+	}
+
+	/* Attempt to allocate a single bit/frame */
+	idx = bitmap_find_free_region(mm_ctx->fp_bd_emupage_allocmap,
+				      emupage_frame_count, 0);
+	if (idx < 0) {
+		/*
+		 * Failed to allocate a frame. We'll wait until one becomes
+		 * available. The mutex is unlocked so that other threads
+		 * actually get the opportunity to free their frames, which
+		 * means technically the result of bitmap_full may be incorrect.
+		 * However the worst case is that we repeat all this and end up
+		 * back here again.
+		 */
+		mutex_unlock(&mm_ctx->fp_bd_emupage_mutex);
+		if (!wait_event_killable(mm_ctx->fp_bd_emupage_queue,
+			!bitmap_full(mm_ctx->fp_bd_emupage_allocmap,
+				     emupage_frame_count)))
+			goto retry;
+
+		/* Received a fatal signal - just give in */
+		return NULL;
+	}
+
+	/* Success! */
+	fr = (struct emuframe __user *)mm_ctx->fp_bd_emupage + idx;
+	pr_debug("allocate emuframe %d to %d\n", idx, current->pid);
+out_unlock:
+	mutex_unlock(&mm_ctx->fp_bd_emupage_mutex);
+	return fr;
+}
+
+static void free_emuframe(struct emuframe __user *frame)
+{
+	mm_context_t *mm_ctx = &current->mm->context;
+	int idx;
+
+	mutex_lock(&mm_ctx->fp_bd_emupage_mutex);
 
+	idx = frame - (struct emuframe __user *)mm_ctx->fp_bd_emupage;
+	pr_debug("free emuframe %d from %d\n", idx, current->pid);
+	bitmap_clear(mm_ctx->fp_bd_emupage_allocmap, idx, 1);
+
+	/* If some thread is waiting for a frame, now's its chance */
+	wake_up(&mm_ctx->fp_bd_emupage_queue);
+
+	mutex_unlock(&mm_ctx->fp_bd_emupage_mutex);
+}
+
+void dsemul_thread_cleanup(void)
+{
 	/*
-	 * The strategy is to push the instruction onto the user stack
-	 * and put a trap after it which we can catch and jump to
-	 * the required address any alternative apart from full
-	 * instruction emulation!!.
+	 * We should always have passed through do_dsemulret prior to the
+	 * thread exiting, so TIF_FP_BD_EMU should never be set here.
+	 */
+	BUG_ON(test_thread_flag(TIF_FP_BD_EMU));
+}
+
+void dsemul_mm_cleanup(struct mm_struct *mm)
+{
+	mm_context_t *mm_ctx = &mm->context;
+
+	kfree(mm_ctx->fp_bd_emupage_allocmap);
+}
+
+int mips_dsemul(struct pt_regs *regs, mips_instruction ir, unsigned long cpc)
+{
+	union mips_instruction inst = { .word = ir };
+	struct emuframe __user *fr;
+	int err;
+
+	/*
+	 * In order for us to clean up the emuframe properly, we'll need to
+	 * execute a break instruction after ir. If ir is a branch then we may
+	 * never reach that break instruction and thus never free the emuframe.
+	 *
+	 * Fortunately we know that ir is in a branch delay slot and thus if
+	 * it is a branch then its operation is unpredictable. So we can just
+	 * treat branches as NOPs and skip the 'emulation' entirely.
+	 *
+	 * If the worst happens and we miss a branch/jump instruction here, or
+	 * some processor implements a custom one, then it would be possible
+	 * for us to allocate an emuframe and never free it. Fortunately this
+	 * would:
 	 *
-	 * Algorithmics used a system call instruction, and
-	 * borrowed that vector.  MIPS/Linux version is a bit
-	 * more heavyweight in the interests of portability and
-	 * multiprocessor support.  For Linux we generate a
-	 * an unaligned access and force an address error exception.
+	 *  1) Be a bug in the userland code, because it has a branch/jump in
+	 *     a branch delay slot. So if we run out of emuframes and the
+	 *     userland code hangs it's not exactly the kernels fault.
 	 *
-	 * For embedded systems (stand-alone) we prefer to use a
-	 * non-existing CP1 instruction. This prevents us from emulating
-	 * branches, but gives us a cleaner interface to the exception
-	 * handler (single entry point).
+	 *  2) Only affect that userland process, since emuframes are allocated
+	 *     per-mm and kernel threads don't use them at all.
 	 */
+	if (!get_isa16_mode(regs->cp0_epc)) {
+		if (!ir) {
+			/* typical NOP encoding: sll r0, r0, r0 */
+is_nop:
+			regs->cp0_epc = cpc;
+			regs->cp0_cause &= ~CAUSEF_BD;
+			return 0;
+		}
 
-	/* Ensure that the two instructions are in the same cache line */
-	fr = (struct emuframe __user *)
-		((regs->regs[29] - sizeof(struct emuframe)) & ~0x7);
+		switch (inst.j_format.opcode) {
+		case bcond_op:
+			switch (inst.i_format.rt) {
+			case bltz_op:
+			case bgez_op:
+			case bltzl_op:
+			case bgezl_op:
+			case bltzal_op:
+			case bgezal_op:
+			case bltzall_op:
+			case bgezall_op:
+				goto is_branch;
+			}
+			break;
 
-	/* Verify that the stack pointer is not competely insane */
-	if (unlikely(!access_ok(VERIFY_WRITE, fr, sizeof(struct emuframe))))
+		case cop1_op:
+			switch (inst.i_format.rs) {
+			case bc_op:
+				goto is_branch;
+			}
+			break;
+
+		case j_op:
+		case jal_op:
+		case beq_op:
+		case bne_op:
+		case blez_op:
+		case bgtz_op:
+		case beql_op:
+		case bnel_op:
+		case blezl_op:
+		case bgtzl_op:
+		case jalx_op:
+is_branch:
+			pr_warn("PID %d has a branch in an FP branch delay slot at 0x%08lx\n",
+				current->pid, regs->cp0_epc);
+			goto is_nop;
+		}
+	} else {
+		if ((ir >> 16) == MM_NOP16)
+			goto is_nop;
+
+		switch (inst.mm_i_format.opcode) {
+		case mm_beqz16_op:
+		case mm_beq32_op:
+		case mm_bnez16_op:
+		case mm_bne32_op:
+		case mm_b16_op:
+		case mm_j32_op:
+		case mm_jalx32_op:
+		case mm_jal32_op:
+			goto is_branch;
+
+		case mm_pool32i_op:
+			switch (inst.mm_i_format.rt) {
+			case mm_bltz_op:
+			case mm_bltzal_op:
+			case mm_bgez_op:
+			case mm_bgezal_op:
+			case mm_blez_op:
+			case mm_bnezc_op:
+			case mm_bgtz_op:
+			case mm_beqzc_op:
+			case mm_bltzals_op:
+			case mm_bgezals_op:
+			case mm_bc2f_op:
+			case mm_bc2t_op:
+			case mm_bc1f_op:
+			case mm_bc1t_op:
+				goto is_branch;
+			}
+			break;
+
+		case mm_pool16c_op:
+			switch (inst.mm16_r5_format.rt) {
+			case mm_jr16_op:
+			case mm_jrc_op:
+			case mm_jalr16_op:
+			case mm_jalrs16_op:
+			case mm_jraddiusp_op:
+				goto is_branch;
+			}
+			break;
+		}
+	}
+
+	pr_debug("dsemul 0x%08lx cont at 0x%08lx\n", regs->cp0_epc, cpc);
+
+	/*
+	 * The strategy is to write the instruction to a per-mm page followed
+	 * by a trap which we can catch to return to the required address. Any
+	 * alternative to full instruction emulation!!
+	 *
+	 * Algorithmics used a system call instruction, and borrowed that
+	 * vector.  MIPS/Linux version is a bit more heavyweight in the
+	 * interests of portability and multiprocessor support.  For Linux we
+	 * generate a BREAK instruction with a break code reserved for this
+	 * purpose.
+	 */
+	fr = alloc_emuframe();
+	if (!fr)
 		return SIGBUS;
 
 	if (get_isa16_mode(regs->cp0_epc)) {
@@ -103,17 +298,18 @@ int mips_dsemul(struct pt_regs *regs, mips_instruction ir, unsigned long cpc)
 		err |= __put_user((mips_instruction)BREAK_MATH, &fr->badinst);
 	}
 
-	err |= __put_user((mips_instruction)BD_COOKIE, &fr->cookie);
-	err |= __put_user(cpc, &fr->epc);
-
 	if (unlikely(err)) {
 		MIPS_FPU_EMU_INC_STATS(errors);
+		free_emuframe(fr);
 		return SIGBUS;
 	}
 
 	regs->cp0_epc = ((unsigned long) &fr->emul) |
 		get_isa16_mode(regs->cp0_epc);
 
+	current->thread.fp_bd_emu_cpc = cpc;
+	set_thread_flag(TIF_FP_BD_EMU);
+
 	flush_cache_sigtramp((unsigned long)&fr->badinst);
 
 	return SIGILL;		/* force out of emulation loop */
@@ -121,64 +317,38 @@ int mips_dsemul(struct pt_regs *regs, mips_instruction ir, unsigned long cpc)
 
 int do_dsemulret(struct pt_regs *xcp)
 {
-	struct emuframe __user *fr;
-	unsigned long epc;
-	u32 insn, cookie;
-	int err = 0;
-	u16 instr[2];
+	mm_context_t *mm_ctx = &current->mm->context;
+	struct emuframe __user *fr = NULL;
+	unsigned long fr_addr;
+	int success = 0;
 
-	fr = (struct emuframe __user *)
-		(msk_isa16_mode(xcp->cp0_epc) - sizeof(mips_instruction));
+	/* If we don't have TIF_FP_BD_EMU set... */
+	if (!test_and_clear_thread_flag(TIF_FP_BD_EMU))
+		goto out;
 
 	/*
-	 * If we can't even access the area, something is very wrong, but we'll
-	 * leave that to the default handling
+	 * ...or EPC is outside of the expected page or misaligned then
+	 * something is wrong. Leave it to the default trap/break code to
+	 * handle.
 	 */
-	if (!access_ok(VERIFY_READ, fr, sizeof(struct emuframe)))
-		return 0;
+	fr_addr = msk_isa16_mode(xcp->cp0_epc) - sizeof(mips_instruction);
+	if ((fr_addr < mm_ctx->fp_bd_emupage) ||
+	    (fr_addr > (mm_ctx->fp_bd_emupage + PAGE_SIZE - sizeof(*fr))) ||
+	    (fr_addr & (sizeof(*fr) - 1)))
+		goto out;
 
-	/*
-	 * Do some sanity checking on the stackframe:
-	 *
-	 *  - Is the instruction pointed to by the EPC an BREAK_MATH?
-	 *  - Is the following memory word the BD_COOKIE?
-	 */
-	if (get_isa16_mode(xcp->cp0_epc)) {
-		err = __get_user(instr[0], (u16 __user *)(&fr->badinst));
-		err |= __get_user(instr[1], (u16 __user *)((long)(&fr->badinst) + 2));
-		insn = (instr[0] << 16) | instr[1];
-	} else {
-		err = __get_user(insn, &fr->badinst);
-	}
-	err |= __get_user(cookie, &fr->cookie);
-
-	if (unlikely(err || (insn != BREAK_MATH) || (cookie != BD_COOKIE))) {
-		MIPS_FPU_EMU_INC_STATS(errors);
-		return 0;
-	}
-
-	/*
-	 * At this point, we are satisfied that it's a BD emulation trap.  Yes,
-	 * a user might have deliberately put two malformed and useless
-	 * instructions in a row in his program, in which case he's in for a
-	 * nasty surprise - the next instruction will be treated as a
-	 * continuation address!  Alas, this seems to be the only way that we
-	 * can handle signals, recursion, and longjmps() in the context of
-	 * emulating the branch delay instruction.
-	 */
-
-#ifdef DSEMUL_TRACE
-	printk("dsemulret\n");
-#endif
-	if (__get_user(epc, &fr->epc)) {		/* Saved EPC */
-		/* This is not a good situation to be in */
-		force_sig(SIGBUS, current);
-
-		return 0;
-	}
+	/* At this point, we are satisfied that it's a BD emulation trap. */
+	fr = (struct emuframe __user *)fr_addr;
 
 	/* Set EPC to return to post-branch instruction */
-	xcp->cp0_epc = epc;
+	xcp->cp0_epc = current->thread.fp_bd_emu_cpc;
+	success = 1;
 
-	return 1;
+	pr_debug("dsemulret to 0x%08lx\n", xcp->cp0_epc);
+out:
+	if (fr)
+		free_emuframe(fr);
+	if (!success)
+		MIPS_FPU_EMU_INC_STATS(errors);
+	return success;
 }
-- 
1.8.4.1

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 6/6] mips: non-exec stack & heap when non-exec PT_GNU_STACK is present
@ 2013-11-07 12:48   ` Paul Burton
  0 siblings, 0 replies; 42+ messages in thread
From: Paul Burton @ 2013-11-07 12:48 UTC (permalink / raw)
  To: linux-mips; +Cc: Paul Burton

The stack and heap have both been executable by default on MIPS until
now. This patch changes the default to be non-executable, but only for
ELF binaries with a non-executable PT_GNU_STACK header present. This
does apply to both the heap & the stack, despite the name PT_GNU_STACK,
and this matches the behaviour of other architectures like ARM & x86.

Current MIPS toolchains do not produce the PT_GNU_STACK header, which
means that we can rely upon this patch not changing the behaviour of
existing binaries. The new default will only take effect for newly
compiled binaries once toolchains are updated to support PT_GNU_STACK,
and since those binaries are newly compiled they can be compiled
expecting the change in default behaviour. Again this matches the way in
which the ARM & x86 architectures handled their implementations of
non-executable memory.

Signed-off-by: Paul Burton <paul.burton@imgtec.com>
---
 arch/mips/include/asm/elf.h  |  5 +++++
 arch/mips/include/asm/page.h |  6 ++++--
 arch/mips/kernel/Makefile    |  7 ++++---
 arch/mips/kernel/elf.c       | 28 ++++++++++++++++++++++++++++
 4 files changed, 41 insertions(+), 5 deletions(-)
 create mode 100644 arch/mips/kernel/elf.c

diff --git a/arch/mips/include/asm/elf.h b/arch/mips/include/asm/elf.h
index 17163cf..d6c91dd 100644
--- a/arch/mips/include/asm/elf.h
+++ b/arch/mips/include/asm/elf.h
@@ -393,4 +393,9 @@ struct mm_struct;
 extern unsigned long arch_randomize_brk(struct mm_struct *mm);
 #define arch_randomize_brk arch_randomize_brk
 
+#define elf_read_implies_exec(ex, stk) mips_elf_read_implies_exec(&(ex), stk)
+struct elf32_hdr;
+extern int mips_elf_read_implies_exec(const struct elf32_hdr *elf_ex,
+				      int exstack);
+
 #endif /* _ASM_ELF_H */
diff --git a/arch/mips/include/asm/page.h b/arch/mips/include/asm/page.h
index f6be474..87f862d 100644
--- a/arch/mips/include/asm/page.h
+++ b/arch/mips/include/asm/page.h
@@ -202,8 +202,10 @@ extern int __virt_addr_valid(const volatile void *kaddr);
 #define virt_addr_valid(kaddr)						\
 	__virt_addr_valid((const volatile void *) (kaddr))
 
-#define VM_DATA_DEFAULT_FLAGS	(VM_READ | VM_WRITE | VM_EXEC | \
-				 VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC)
+#define VM_DATA_DEFAULT_FLAGS \
+	(VM_READ | VM_WRITE | \
+	 ((current->personality & READ_IMPLIES_EXEC) ? VM_EXEC : 0) | \
+	 VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC)
 
 #define UNCAC_ADDR(addr)	((addr) - PAGE_OFFSET + UNCAC_BASE)
 #define CAC_ADDR(addr)		((addr) - UNCAC_BASE + PAGE_OFFSET)
diff --git a/arch/mips/kernel/Makefile b/arch/mips/kernel/Makefile
index 1c1b717..97d3bf7 100644
--- a/arch/mips/kernel/Makefile
+++ b/arch/mips/kernel/Makefile
@@ -4,9 +4,10 @@
 
 extra-y		:= head.o vmlinux.lds
 
-obj-y		+= cpu-probe.o branch.o entry.o genex.o idle.o irq.o process.o \
-		   prom.o ptrace.o reset.o setup.o signal.o syscall.o \
-		   time.o topology.o traps.o unaligned.o watch.o vdso.o
+obj-y		+= cpu-probe.o branch.o elf.o entry.o genex.o idle.o irq.o \
+		   process.o prom.o ptrace.o reset.o setup.o signal.o \
+		   syscall.o time.o topology.o traps.o unaligned.o watch.o \
+		   vdso.o
 
 ifdef CONFIG_FUNCTION_TRACER
 CFLAGS_REMOVE_ftrace.o = -pg
diff --git a/arch/mips/kernel/elf.c b/arch/mips/kernel/elf.c
new file mode 100644
index 0000000..92212ba
--- /dev/null
+++ b/arch/mips/kernel/elf.c
@@ -0,0 +1,28 @@
+/*
+ * This file is subject to the terms and conditions of the GNU General Public
+ * License.  See the file "COPYING" in the main directory of this archive
+ * for more details.
+ *
+ * Copyright (C) 2013  Imagination Technologies Ltd.
+ */
+
+#include <linux/binfmts.h>
+#include <linux/elf.h>
+#include <linux/export.h>
+#include <asm/cpu-features.h>
+
+int mips_elf_read_implies_exec(const struct elf32_hdr *elf_ex, int exstack)
+{
+	if (exstack != EXSTACK_DISABLE_X) {
+		/* the binary doesn't request a non-executable stack */
+		return 1;
+	}
+
+	if (!cpu_has_rixi) {
+		/* the CPU doesn't support non-executable memory */
+		return 1;
+	}
+
+	return 0;
+}
+EXPORT_SYMBOL(mips_elf_read_implies_exec);
-- 
1.8.4.1

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 6/6] mips: non-exec stack & heap when non-exec PT_GNU_STACK is present
@ 2013-11-07 12:48   ` Paul Burton
  0 siblings, 0 replies; 42+ messages in thread
From: Paul Burton @ 2013-11-07 12:48 UTC (permalink / raw)
  To: linux-mips; +Cc: Paul Burton

The stack and heap have both been executable by default on MIPS until
now. This patch changes the default to be non-executable, but only for
ELF binaries with a non-executable PT_GNU_STACK header present. This
does apply to both the heap & the stack, despite the name PT_GNU_STACK,
and this matches the behaviour of other architectures like ARM & x86.

Current MIPS toolchains do not produce the PT_GNU_STACK header, which
means that we can rely upon this patch not changing the behaviour of
existing binaries. The new default will only take effect for newly
compiled binaries once toolchains are updated to support PT_GNU_STACK,
and since those binaries are newly compiled they can be compiled
expecting the change in default behaviour. Again this matches the way in
which the ARM & x86 architectures handled their implementations of
non-executable memory.

Signed-off-by: Paul Burton <paul.burton@imgtec.com>
---
 arch/mips/include/asm/elf.h  |  5 +++++
 arch/mips/include/asm/page.h |  6 ++++--
 arch/mips/kernel/Makefile    |  7 ++++---
 arch/mips/kernel/elf.c       | 28 ++++++++++++++++++++++++++++
 4 files changed, 41 insertions(+), 5 deletions(-)
 create mode 100644 arch/mips/kernel/elf.c

diff --git a/arch/mips/include/asm/elf.h b/arch/mips/include/asm/elf.h
index 17163cf..d6c91dd 100644
--- a/arch/mips/include/asm/elf.h
+++ b/arch/mips/include/asm/elf.h
@@ -393,4 +393,9 @@ struct mm_struct;
 extern unsigned long arch_randomize_brk(struct mm_struct *mm);
 #define arch_randomize_brk arch_randomize_brk
 
+#define elf_read_implies_exec(ex, stk) mips_elf_read_implies_exec(&(ex), stk)
+struct elf32_hdr;
+extern int mips_elf_read_implies_exec(const struct elf32_hdr *elf_ex,
+				      int exstack);
+
 #endif /* _ASM_ELF_H */
diff --git a/arch/mips/include/asm/page.h b/arch/mips/include/asm/page.h
index f6be474..87f862d 100644
--- a/arch/mips/include/asm/page.h
+++ b/arch/mips/include/asm/page.h
@@ -202,8 +202,10 @@ extern int __virt_addr_valid(const volatile void *kaddr);
 #define virt_addr_valid(kaddr)						\
 	__virt_addr_valid((const volatile void *) (kaddr))
 
-#define VM_DATA_DEFAULT_FLAGS	(VM_READ | VM_WRITE | VM_EXEC | \
-				 VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC)
+#define VM_DATA_DEFAULT_FLAGS \
+	(VM_READ | VM_WRITE | \
+	 ((current->personality & READ_IMPLIES_EXEC) ? VM_EXEC : 0) | \
+	 VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC)
 
 #define UNCAC_ADDR(addr)	((addr) - PAGE_OFFSET + UNCAC_BASE)
 #define CAC_ADDR(addr)		((addr) - UNCAC_BASE + PAGE_OFFSET)
diff --git a/arch/mips/kernel/Makefile b/arch/mips/kernel/Makefile
index 1c1b717..97d3bf7 100644
--- a/arch/mips/kernel/Makefile
+++ b/arch/mips/kernel/Makefile
@@ -4,9 +4,10 @@
 
 extra-y		:= head.o vmlinux.lds
 
-obj-y		+= cpu-probe.o branch.o entry.o genex.o idle.o irq.o process.o \
-		   prom.o ptrace.o reset.o setup.o signal.o syscall.o \
-		   time.o topology.o traps.o unaligned.o watch.o vdso.o
+obj-y		+= cpu-probe.o branch.o elf.o entry.o genex.o idle.o irq.o \
+		   process.o prom.o ptrace.o reset.o setup.o signal.o \
+		   syscall.o time.o topology.o traps.o unaligned.o watch.o \
+		   vdso.o
 
 ifdef CONFIG_FUNCTION_TRACER
 CFLAGS_REMOVE_ftrace.o = -pg
diff --git a/arch/mips/kernel/elf.c b/arch/mips/kernel/elf.c
new file mode 100644
index 0000000..92212ba
--- /dev/null
+++ b/arch/mips/kernel/elf.c
@@ -0,0 +1,28 @@
+/*
+ * This file is subject to the terms and conditions of the GNU General Public
+ * License.  See the file "COPYING" in the main directory of this archive
+ * for more details.
+ *
+ * Copyright (C) 2013  Imagination Technologies Ltd.
+ */
+
+#include <linux/binfmts.h>
+#include <linux/elf.h>
+#include <linux/export.h>
+#include <asm/cpu-features.h>
+
+int mips_elf_read_implies_exec(const struct elf32_hdr *elf_ex, int exstack)
+{
+	if (exstack != EXSTACK_DISABLE_X) {
+		/* the binary doesn't request a non-executable stack */
+		return 1;
+	}
+
+	if (!cpu_has_rixi) {
+		/* the CPU doesn't support non-executable memory */
+		return 1;
+	}
+
+	return 0;
+}
+EXPORT_SYMBOL(mips_elf_read_implies_exec);
-- 
1.8.4.1

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [PATCH 0/6] FP improvements
  2013-11-07 12:48 ` Paul Burton
                   ` (6 preceding siblings ...)
  (?)
@ 2013-11-07 13:57 ` Ralf Baechle
  -1 siblings, 0 replies; 42+ messages in thread
From: Ralf Baechle @ 2013-11-07 13:57 UTC (permalink / raw)
  To: Paul Burton; +Cc: linux-mips

On Thu, Nov 07, 2013 at 12:48:27PM +0000, Paul Burton wrote:

> This series includes a few improvements to floating point support. The
> first 2 patches add support for missing instructions to the FPU
> emulator. The 3rd is a small cleanup. The 4th introduces support for
> O32 binaries using 64-bit floating point. The 5th modifies the FPU
> emulator to stop executing code from the user stack. The 6th & final
> patch is not strictly FP-related but is a consequence of the 5th patch,
> and allows us to mark the stack & allocated heap memory as
> non-executable by default.

Very interesting, in particular #5.  More once I've me and others had a
chance to review the series.

  Ralf

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 5/6] mips: use per-mm page to execute FP branch delay slots
  2013-11-07 12:48   ` Paul Burton
  (?)
@ 2013-11-07 18:00   ` David Daney
  2013-11-08 12:07       ` Paul Burton
  -1 siblings, 1 reply; 42+ messages in thread
From: David Daney @ 2013-11-07 18:00 UTC (permalink / raw)
  To: Paul Burton, linux-mips

Nice work...

On 11/07/2013 04:48 AM, Paul Burton wrote:
[...]
> -	 * Algorithmics used a system call instruction, and
> -	 * borrowed that vector.  MIPS/Linux version is a bit
> -	 * more heavyweight in the interests of portability and
> -	 * multiprocessor support.  For Linux we generate a
> -	 * an unaligned access and force an address error exception.
> +	 *  1) Be a bug in the userland code, because it has a branch/jump in
> +	 *     a branch delay slot. So if we run out of emuframes and the
> +	 *     userland code hangs it's not exactly the kernels fault.

s/kernels/kernel's/


>   	 *
> -	 * For embedded systems (stand-alone) we prefer to use a
> -	 * non-existing CP1 instruction. This prevents us from emulating
> -	 * branches, but gives us a cleaner interface to the exception
> -	 * handler (single entry point).
> +	 *  2) Only affect that userland process, since emuframes are allocated
> +	 *     per-mm and kernel threads don't use them at all.
>   	 */
> +	if (!get_isa16_mode(regs->cp0_epc)) {
> +		if (!ir) {
> +			/* typical NOP encoding: sll r0, r0, r0 */
> +is_nop:
> +			regs->cp0_epc = cpc;
> +			regs->cp0_cause &= ~CAUSEF_BD;
> +			return 0;
> +		}
>
> -	/* Ensure that the two instructions are in the same cache line */
> -	fr = (struct emuframe __user *)
> -		((regs->regs[29] - sizeof(struct emuframe)) & ~0x7);
> +		switch (inst.j_format.opcode) {
> +		case bcond_op:
> +			switch (inst.i_format.rt) {
> +			case bltz_op:
> +			case bgez_op:
> +			case bltzl_op:
> +			case bgezl_op:
> +			case bltzal_op:
> +			case bgezal_op:
> +			case bltzall_op:
> +			case bgezall_op:
> +				goto is_branch;
> +			}
> +			break;

Is there any way to use the support in arch/mips/kernel/branch.c instead 
of duplicating the code here?

It may require some refactoring to make it work, but I think it would be 
worth it.

>
> -	/* Verify that the stack pointer is not competely insane */
> -	if (unlikely(!access_ok(VERIFY_WRITE, fr, sizeof(struct emuframe))))
> +		case cop1_op:
> +			switch (inst.i_format.rs) {
> +			case bc_op:
> +				goto is_branch;
> +			}
> +			break;
> +
> +		case j_op:
> +		case jal_op:
> +		case beq_op:
> +		case bne_op:
> +		case blez_op:
> +		case bgtz_op:
> +		case beql_op:
> +		case bnel_op:
> +		case blezl_op:
> +		case bgtzl_op:
> +		case jalx_op:
> +is_branch:
> +			pr_warn("PID %d has a branch in an FP branch delay slot at 0x%08lx\n",
> +				current->pid, regs->cp0_epc);
> +			goto is_nop;
> +		}
> +	} else {
> +		if ((ir >> 16) == MM_NOP16)
> +			goto is_nop;
> +
> +		switch (inst.mm_i_format.opcode) {
> +		case mm_beqz16_op:
> +		case mm_beq32_op:
> +		case mm_bnez16_op:
> +		case mm_bne32_op:
> +		case mm_b16_op:
> +		case mm_j32_op:
> +		case mm_jalx32_op:
> +		case mm_jal32_op:
> +			goto is_branch;
> +
> +		case mm_pool32i_op:
> +			switch (inst.mm_i_format.rt) {
> +			case mm_bltz_op:
> +			case mm_bltzal_op:
> +			case mm_bgez_op:
> +			case mm_bgezal_op:
> +			case mm_blez_op:
> +			case mm_bnezc_op:
> +			case mm_bgtz_op:
> +			case mm_beqzc_op:
> +			case mm_bltzals_op:
> +			case mm_bgezals_op:
> +			case mm_bc2f_op:
> +			case mm_bc2t_op:
> +			case mm_bc1f_op:
> +			case mm_bc1t_op:
> +				goto is_branch;
> +			}
> +			break;
> +
> +		case mm_pool16c_op:
> +			switch (inst.mm16_r5_format.rt) {
> +			case mm_jr16_op:
> +			case mm_jrc_op:
> +			case mm_jalr16_op:
> +			case mm_jalrs16_op:
> +			case mm_jraddiusp_op:
> +				goto is_branch;
> +			}
> +			break;
> +		}
> +	}
> +
>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 5/6] mips: use per-mm page to execute FP branch delay slots
@ 2013-11-08 12:07       ` Paul Burton
  0 siblings, 0 replies; 42+ messages in thread
From: Paul Burton @ 2013-11-08 12:07 UTC (permalink / raw)
  To: David Daney; +Cc: linux-mips

On 07/11/13 18:00, David Daney wrote:
> Nice work...
>
> On 11/07/2013 04:48 AM, Paul Burton wrote:
> [...]
>> -     * Algorithmics used a system call instruction, and
>> -     * borrowed that vector.  MIPS/Linux version is a bit
>> -     * more heavyweight in the interests of portability and
>> -     * multiprocessor support.  For Linux we generate a
>> -     * an unaligned access and force an address error exception.
>> +     *  1) Be a bug in the userland code, because it has a
>> branch/jump in
>> +     *     a branch delay slot. So if we run out of emuframes and the
>> +     *     userland code hangs it's not exactly the kernels fault.
>
> s/kernels/kernel's/
>

Yup, thanks.

>
>>        *
>> -     * For embedded systems (stand-alone) we prefer to use a
>> -     * non-existing CP1 instruction. This prevents us from emulating
>> -     * branches, but gives us a cleaner interface to the exception
>> -     * handler (single entry point).
>> +     *  2) Only affect that userland process, since emuframes are
>> allocated
>> +     *     per-mm and kernel threads don't use them at all.
>>        */
>> +    if (!get_isa16_mode(regs->cp0_epc)) {
>> +        if (!ir) {
>> +            /* typical NOP encoding: sll r0, r0, r0 */
>> +is_nop:
>> +            regs->cp0_epc = cpc;
>> +            regs->cp0_cause &= ~CAUSEF_BD;
>> +            return 0;
>> +        }
>>
>> -    /* Ensure that the two instructions are in the same cache line */
>> -    fr = (struct emuframe __user *)
>> -        ((regs->regs[29] - sizeof(struct emuframe)) & ~0x7);
>> +        switch (inst.j_format.opcode) {
>> +        case bcond_op:
>> +            switch (inst.i_format.rt) {
>> +            case bltz_op:
>> +            case bgez_op:
>> +            case bltzl_op:
>> +            case bgezl_op:
>> +            case bltzal_op:
>> +            case bgezal_op:
>> +            case bltzall_op:
>> +            case bgezall_op:
>> +                goto is_branch;
>> +            }
>> +            break;
>
> Is there any way to use the support in arch/mips/kernel/branch.c instead
> of duplicating the code here?
>
> It may require some refactoring to make it work, but I think it would be
> worth it.
>

Ah (how had I not spotted that code?) :)

It may fit better with the (mm_)isBranchInstr functions in 
arch/mips/math-emu/cp1emu.c since they already return a value specifying 
whether or not the instruction is a branch. The microMIPS variant is 
already used elsewhere too. I'll take a look at it.

Thanks,
     Paul

>>
>> -    /* Verify that the stack pointer is not competely insane */
>> -    if (unlikely(!access_ok(VERIFY_WRITE, fr, sizeof(struct emuframe))))
>> +        case cop1_op:
>> +            switch (inst.i_format.rs) {
>> +            case bc_op:
>> +                goto is_branch;
>> +            }
>> +            break;
>> +
>> +        case j_op:
>> +        case jal_op:
>> +        case beq_op:
>> +        case bne_op:
>> +        case blez_op:
>> +        case bgtz_op:
>> +        case beql_op:
>> +        case bnel_op:
>> +        case blezl_op:
>> +        case bgtzl_op:
>> +        case jalx_op:
>> +is_branch:
>> +            pr_warn("PID %d has a branch in an FP branch delay slot
>> at 0x%08lx\n",
>> +                current->pid, regs->cp0_epc);
>> +            goto is_nop;
>> +        }
>> +    } else {
>> +        if ((ir >> 16) == MM_NOP16)
>> +            goto is_nop;
>> +
>> +        switch (inst.mm_i_format.opcode) {
>> +        case mm_beqz16_op:
>> +        case mm_beq32_op:
>> +        case mm_bnez16_op:
>> +        case mm_bne32_op:
>> +        case mm_b16_op:
>> +        case mm_j32_op:
>> +        case mm_jalx32_op:
>> +        case mm_jal32_op:
>> +            goto is_branch;
>> +
>> +        case mm_pool32i_op:
>> +            switch (inst.mm_i_format.rt) {
>> +            case mm_bltz_op:
>> +            case mm_bltzal_op:
>> +            case mm_bgez_op:
>> +            case mm_bgezal_op:
>> +            case mm_blez_op:
>> +            case mm_bnezc_op:
>> +            case mm_bgtz_op:
>> +            case mm_beqzc_op:
>> +            case mm_bltzals_op:
>> +            case mm_bgezals_op:
>> +            case mm_bc2f_op:
>> +            case mm_bc2t_op:
>> +            case mm_bc1f_op:
>> +            case mm_bc1t_op:
>> +                goto is_branch;
>> +            }
>> +            break;
>> +
>> +        case mm_pool16c_op:
>> +            switch (inst.mm16_r5_format.rt) {
>> +            case mm_jr16_op:
>> +            case mm_jrc_op:
>> +            case mm_jalr16_op:
>> +            case mm_jalrs16_op:
>> +            case mm_jraddiusp_op:
>> +                goto is_branch;
>> +            }
>> +            break;
>> +        }
>> +    }
>> +
>>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 5/6] mips: use per-mm page to execute FP branch delay slots
@ 2013-11-08 12:07       ` Paul Burton
  0 siblings, 0 replies; 42+ messages in thread
From: Paul Burton @ 2013-11-08 12:07 UTC (permalink / raw)
  To: David Daney; +Cc: linux-mips

On 07/11/13 18:00, David Daney wrote:
> Nice work...
>
> On 11/07/2013 04:48 AM, Paul Burton wrote:
> [...]
>> -     * Algorithmics used a system call instruction, and
>> -     * borrowed that vector.  MIPS/Linux version is a bit
>> -     * more heavyweight in the interests of portability and
>> -     * multiprocessor support.  For Linux we generate a
>> -     * an unaligned access and force an address error exception.
>> +     *  1) Be a bug in the userland code, because it has a
>> branch/jump in
>> +     *     a branch delay slot. So if we run out of emuframes and the
>> +     *     userland code hangs it's not exactly the kernels fault.
>
> s/kernels/kernel's/
>

Yup, thanks.

>
>>        *
>> -     * For embedded systems (stand-alone) we prefer to use a
>> -     * non-existing CP1 instruction. This prevents us from emulating
>> -     * branches, but gives us a cleaner interface to the exception
>> -     * handler (single entry point).
>> +     *  2) Only affect that userland process, since emuframes are
>> allocated
>> +     *     per-mm and kernel threads don't use them at all.
>>        */
>> +    if (!get_isa16_mode(regs->cp0_epc)) {
>> +        if (!ir) {
>> +            /* typical NOP encoding: sll r0, r0, r0 */
>> +is_nop:
>> +            regs->cp0_epc = cpc;
>> +            regs->cp0_cause &= ~CAUSEF_BD;
>> +            return 0;
>> +        }
>>
>> -    /* Ensure that the two instructions are in the same cache line */
>> -    fr = (struct emuframe __user *)
>> -        ((regs->regs[29] - sizeof(struct emuframe)) & ~0x7);
>> +        switch (inst.j_format.opcode) {
>> +        case bcond_op:
>> +            switch (inst.i_format.rt) {
>> +            case bltz_op:
>> +            case bgez_op:
>> +            case bltzl_op:
>> +            case bgezl_op:
>> +            case bltzal_op:
>> +            case bgezal_op:
>> +            case bltzall_op:
>> +            case bgezall_op:
>> +                goto is_branch;
>> +            }
>> +            break;
>
> Is there any way to use the support in arch/mips/kernel/branch.c instead
> of duplicating the code here?
>
> It may require some refactoring to make it work, but I think it would be
> worth it.
>

Ah (how had I not spotted that code?) :)

It may fit better with the (mm_)isBranchInstr functions in 
arch/mips/math-emu/cp1emu.c since they already return a value specifying 
whether or not the instruction is a branch. The microMIPS variant is 
already used elsewhere too. I'll take a look at it.

Thanks,
     Paul

>>
>> -    /* Verify that the stack pointer is not competely insane */
>> -    if (unlikely(!access_ok(VERIFY_WRITE, fr, sizeof(struct emuframe))))
>> +        case cop1_op:
>> +            switch (inst.i_format.rs) {
>> +            case bc_op:
>> +                goto is_branch;
>> +            }
>> +            break;
>> +
>> +        case j_op:
>> +        case jal_op:
>> +        case beq_op:
>> +        case bne_op:
>> +        case blez_op:
>> +        case bgtz_op:
>> +        case beql_op:
>> +        case bnel_op:
>> +        case blezl_op:
>> +        case bgtzl_op:
>> +        case jalx_op:
>> +is_branch:
>> +            pr_warn("PID %d has a branch in an FP branch delay slot
>> at 0x%08lx\n",
>> +                current->pid, regs->cp0_epc);
>> +            goto is_nop;
>> +        }
>> +    } else {
>> +        if ((ir >> 16) == MM_NOP16)
>> +            goto is_nop;
>> +
>> +        switch (inst.mm_i_format.opcode) {
>> +        case mm_beqz16_op:
>> +        case mm_beq32_op:
>> +        case mm_bnez16_op:
>> +        case mm_bne32_op:
>> +        case mm_b16_op:
>> +        case mm_j32_op:
>> +        case mm_jalx32_op:
>> +        case mm_jal32_op:
>> +            goto is_branch;
>> +
>> +        case mm_pool32i_op:
>> +            switch (inst.mm_i_format.rt) {
>> +            case mm_bltz_op:
>> +            case mm_bltzal_op:
>> +            case mm_bgez_op:
>> +            case mm_bgezal_op:
>> +            case mm_blez_op:
>> +            case mm_bnezc_op:
>> +            case mm_bgtz_op:
>> +            case mm_beqzc_op:
>> +            case mm_bltzals_op:
>> +            case mm_bgezals_op:
>> +            case mm_bc2f_op:
>> +            case mm_bc2t_op:
>> +            case mm_bc1f_op:
>> +            case mm_bc1t_op:
>> +                goto is_branch;
>> +            }
>> +            break;
>> +
>> +        case mm_pool16c_op:
>> +            switch (inst.mm16_r5_format.rt) {
>> +            case mm_jr16_op:
>> +            case mm_jrc_op:
>> +            case mm_jalr16_op:
>> +            case mm_jalrs16_op:
>> +            case mm_jraddiusp_op:
>> +                goto is_branch;
>> +            }
>> +            break;
>> +        }
>> +    }
>> +
>>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH v2 5/6] mips: use per-mm page to execute FP branch delay slots
@ 2013-11-08 14:50         ` Paul Burton
  0 siblings, 0 replies; 42+ messages in thread
From: Paul Burton @ 2013-11-08 14:50 UTC (permalink / raw)
  To: linux-mips; +Cc: ddaney.cavm, Paul Burton

If a floating point branch instruction (bc1[ft]l?) is emulated,
typically because we're running on a core with no FPU, then we need to
execute the instruction in its branch delay slot too. This is done by
writing that instruction to memory followed by a trap, as part of an
"emuframe", and executing it. This avoids the requirement of an emulator
for the entire MIPS instruction set. Prior to this patch such emuframes
are written to the user stack and executed from there.

This patch moves FP branch delay emuframes off of the user stack and
into a per-mm page. Allocating a page per-mm leaves userland with access
to only what it had access to previously, and prevents processes
interfering with each other as they might if a single system-wide page
were used. The book-keeping required to track the allocation of
emuframes is not cheap, but given that invoking the FP emulator is
already very expensive I don't expect this to be an issue.

The biggest issue with executing the instruction from an FP branch delay
is that we must ensure that we free the frame from which we ran it. That
means that we must trap back to the kernel after executing that
instruction, which means that we must take special care not to let the
PC be changed as a result of that instruction. Fortunately since we're
executing an instruction we found in a branch delay the result is
unpredictable if that instruction is a branch or jump, so we can simply
treat those as NOPs and avoid them causing a problem. However there is
still the possibility that a signal may be handled whilst executing the
branch delay instruction. This would usually be fine as we would simply
execute our trap back to the kernel after sigreturn, however it is
possible for userland to simply not return from the signal handler - for
example if it executes something like a longjmp. In that case we would
never trap back to the kernel and never free the frame. For that reason
a TIF_FP_BD_EMU flag is introduced and set whilst we are executing an FP
branch delay instruction. Whilst this flag is set, signals will be
ignored. This isn't exactly pretty, but it's simpler than most of the
alternatives. One other simple option I considered would be to just
kill a process if we find a branch in an FP branch delay slot, but I
chose the current approach because its result is closer to what would
previously happen.

The primary benefit of this patch is that we are now free to mark the
user stack non-executable where that is possible.

Additionally the FP emuframes themselves are simplified somewhat. The
cookie field is removed since we can be pretty certain that we're
looking at an emuframe by virtue of it being located in the page
allocated for them. The PC to continue from is moved into struct
thread_struct since the control flow of a thread can no longer be
modified for the duration of the 'emulation', meaning there will now
only ever be a single emuframe required for a thread at any given time.

Signed-off-by: Paul Burton <paul.burton@imgtec.com>
---
Changes in v2:
  - s/kernels/kernel's/
  - Use (mm_)isBranchInstr in mips_dsemul rather than duplicating
    similar logic.
---
 arch/mips/include/asm/fpu_emulator.h |   4 +
 arch/mips/include/asm/mmu.h          |  12 ++
 arch/mips/include/asm/mmu_context.h  |   7 +
 arch/mips/include/asm/processor.h    |   7 +-
 arch/mips/include/asm/thread_info.h  |   2 +
 arch/mips/kernel/entry.S             |  13 +-
 arch/mips/kernel/process.c           |   2 +
 arch/mips/kernel/vdso.c              |   2 +-
 arch/mips/math-emu/cp1emu.c          |   4 +-
 arch/mips/math-emu/dsemul.c          | 266 ++++++++++++++++++++++++-----------
 10 files changed, 226 insertions(+), 93 deletions(-)

diff --git a/arch/mips/include/asm/fpu_emulator.h b/arch/mips/include/asm/fpu_emulator.h
index 2abb587..16f7b0b 100644
--- a/arch/mips/include/asm/fpu_emulator.h
+++ b/arch/mips/include/asm/fpu_emulator.h
@@ -51,6 +51,8 @@ do {									\
 #define MIPS_FPU_EMU_INC_STATS(M) do { } while (0)
 #endif /* CONFIG_DEBUG_FS */
 
+extern void dsemul_thread_cleanup(void);
+extern void dsemul_mm_cleanup(struct mm_struct *mm);
 extern int mips_dsemul(struct pt_regs *regs, mips_instruction ir,
 	unsigned long cpc);
 extern int do_dsemulret(struct pt_regs *xcp);
@@ -58,6 +60,8 @@ extern int fpu_emulator_cop1Handler(struct pt_regs *xcp,
 				    struct mips_fpu_struct *ctx, int has_fpu,
 				    void *__user *fault_addr);
 int process_fpemu_return(int sig, void __user *fault_addr);
+int isBranchInstr(struct pt_regs *regs, struct mm_decoded_insn dec_insn,
+		  unsigned long *contpc);
 int mm_isBranchInstr(struct pt_regs *regs, struct mm_decoded_insn dec_insn,
 		     unsigned long *contpc);
 
diff --git a/arch/mips/include/asm/mmu.h b/arch/mips/include/asm/mmu.h
index c436138..08214da 100644
--- a/arch/mips/include/asm/mmu.h
+++ b/arch/mips/include/asm/mmu.h
@@ -1,9 +1,21 @@
 #ifndef __ASM_MMU_H
 #define __ASM_MMU_H
 
+#include <linux/mutex.h>
+#include <linux/wait.h>
+
 typedef struct {
 	unsigned long asid[NR_CPUS];
 	void *vdso;
+
+	/* address of page used to hold FP branch delay emulation frames */
+	unsigned long fp_bd_emupage;
+	/* bitmap tracking allocation of fp_bd_emupage */
+	unsigned long *fp_bd_emupage_allocmap;
+	/* mutex to be held whilst modifying fp_bd_emupage(_allocmap) */
+	struct mutex fp_bd_emupage_mutex;
+	/* wait queue for threads requiring an emuframe */
+	wait_queue_head_t fp_bd_emupage_queue;
 } mm_context_t;
 
 #endif /* __ASM_MMU_H */
diff --git a/arch/mips/include/asm/mmu_context.h b/arch/mips/include/asm/mmu_context.h
index e277bba..c55e864 100644
--- a/arch/mips/include/asm/mmu_context.h
+++ b/arch/mips/include/asm/mmu_context.h
@@ -16,6 +16,7 @@
 #include <linux/smp.h>
 #include <linux/slab.h>
 #include <asm/cacheflush.h>
+#include <asm/fpu_emulator.h>
 #include <asm/hazards.h>
 #include <asm/tlbflush.h>
 #ifdef CONFIG_MIPS_MT_SMTC
@@ -133,6 +134,11 @@ init_new_context(struct task_struct *tsk, struct mm_struct *mm)
 	for_each_possible_cpu(i)
 		cpu_context(i, mm) = 0;
 
+	mm->context.fp_bd_emupage = 0;
+	mm->context.fp_bd_emupage_allocmap = NULL;
+	mutex_init(&mm->context.fp_bd_emupage_mutex);
+	init_waitqueue_head(&mm->context.fp_bd_emupage_queue);
+
 	return 0;
 }
 
@@ -199,6 +205,7 @@ static inline void switch_mm(struct mm_struct *prev, struct mm_struct *next,
  */
 static inline void destroy_context(struct mm_struct *mm)
 {
+	dsemul_mm_cleanup(mm);
 }
 
 #define deactivate_mm(tsk, mm)	do { } while (0)
diff --git a/arch/mips/include/asm/processor.h b/arch/mips/include/asm/processor.h
index 3605b84..683a3d6 100644
--- a/arch/mips/include/asm/processor.h
+++ b/arch/mips/include/asm/processor.h
@@ -38,9 +38,10 @@ extern unsigned int vced_count, vcei_count;
 
 /*
  * A special page (the vdso) is mapped into all processes at the very
- * top of the virtual memory space.
+ * top of the virtual memory space. The page below it is used for FP
+ * emulator branch delay slot executions.
  */
-#define SPECIAL_PAGES_SIZE PAGE_SIZE
+#define SPECIAL_PAGES_SIZE (PAGE_SIZE * 2)
 
 #ifdef CONFIG_32BIT
 #ifdef CONFIG_KVM_GUEST
@@ -226,6 +227,8 @@ struct thread_struct {
 
 	/* Saved fpu/fpu emulator stuff. */
 	struct mips_fpu_struct fpu;
+	/* PC to continue from following an FP branch delay 'emulation' */
+	unsigned long fp_bd_emu_cpc;
 #ifdef CONFIG_MIPS_MT_FPAFF
 	/* Emulated instruction count */
 	unsigned long emulated_fp;
diff --git a/arch/mips/include/asm/thread_info.h b/arch/mips/include/asm/thread_info.h
index b6da8b7..eee6e18 100644
--- a/arch/mips/include/asm/thread_info.h
+++ b/arch/mips/include/asm/thread_info.h
@@ -118,6 +118,7 @@ static inline struct thread_info *current_thread_info(void)
 #define TIF_LOAD_WATCH		25	/* If set, load watch registers */
 #define TIF_SYSCALL_TRACEPOINT	26	/* syscall tracepoint instrumentation */
 #define TIF_32BIT_FPREGS	27	/* 32-bit floating point registers */
+#define TIF_FP_BD_EMU		28	/* executing an FP branch delay */
 #define TIF_SYSCALL_TRACE	31	/* syscall trace active */
 
 #define _TIF_SYSCALL_TRACE	(1<<TIF_SYSCALL_TRACE)
@@ -135,6 +136,7 @@ static inline struct thread_info *current_thread_info(void)
 #define _TIF_FPUBOUND		(1<<TIF_FPUBOUND)
 #define _TIF_LOAD_WATCH		(1<<TIF_LOAD_WATCH)
 #define _TIF_32BIT_FPREGS	(1<<TIF_32BIT_FPREGS)
+#define _TIF_FP_BD_EMU		(1<<TIF_FP_BD_EMU)
 #define _TIF_SYSCALL_TRACEPOINT	(1<<TIF_SYSCALL_TRACEPOINT)
 
 #define _TIF_WORK_SYSCALL_ENTRY	(_TIF_NOHZ | _TIF_SYSCALL_TRACE |	\
diff --git a/arch/mips/kernel/entry.S b/arch/mips/kernel/entry.S
index e578685..24707d7 100644
--- a/arch/mips/kernel/entry.S
+++ b/arch/mips/kernel/entry.S
@@ -168,10 +168,15 @@ work_resched:
 	andi	t0, a2, _TIF_NEED_RESCHED
 	bnez	t0, work_resched
 
-work_notifysig:				# deal with pending signals and
-					# notify-resume requests
-	move	a0, sp
-	li	a1, 0
+work_notifysig:
+	and	t0, a2, _TIF_FP_BD_EMU	# are we currently 'emulating' the
+					# delay slot of an FP branch?
+	beqz	t0, 1f			# no, continue below
+	and	a2, a2, ~_TIF_SIGPENDING	# yes, skip handling signals
+	beqz	a2, restore_all		# which leaves us nothing to do
+
+1:	move	a0, sp			# deal with pending signals and
+	li	a1, 0			# notify-resume requests
 	jal	do_notify_resume	# a2 already loaded
 	j	resume_userspace_check
 
diff --git a/arch/mips/kernel/process.c b/arch/mips/kernel/process.c
index 747a6cf..0219502 100644
--- a/arch/mips/kernel/process.c
+++ b/arch/mips/kernel/process.c
@@ -32,6 +32,7 @@
 #include <asm/cpu.h>
 #include <asm/dsp.h>
 #include <asm/fpu.h>
+#include <asm/fpu_emulator.h>
 #include <asm/pgtable.h>
 #include <asm/mipsregs.h>
 #include <asm/processor.h>
@@ -72,6 +73,7 @@ void start_thread(struct pt_regs * regs, unsigned long pc, unsigned long sp)
 
 void exit_thread(void)
 {
+	dsemul_thread_cleanup();
 }
 
 void flush_thread(void)
diff --git a/arch/mips/kernel/vdso.c b/arch/mips/kernel/vdso.c
index 0f1af58..213d871 100644
--- a/arch/mips/kernel/vdso.c
+++ b/arch/mips/kernel/vdso.c
@@ -78,7 +78,7 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
 
 	down_write(&mm->mmap_sem);
 
-	addr = vdso_addr(mm->start_stack);
+	addr = vdso_addr(mm->start_stack) + PAGE_SIZE;
 
 	addr = get_unmapped_area(NULL, addr, PAGE_SIZE, 0, 0);
 	if (IS_ERR_VALUE(addr)) {
diff --git a/arch/mips/math-emu/cp1emu.c b/arch/mips/math-emu/cp1emu.c
index 22f7b11..a0566c8 100644
--- a/arch/mips/math-emu/cp1emu.c
+++ b/arch/mips/math-emu/cp1emu.c
@@ -665,8 +665,8 @@ int mm_isBranchInstr(struct pt_regs *regs, struct mm_decoded_insn dec_insn,
  * a single subroutine should be used across both
  * modules.
  */
-static int isBranchInstr(struct pt_regs *regs, struct mm_decoded_insn dec_insn,
-			 unsigned long *contpc)
+int isBranchInstr(struct pt_regs *regs, struct mm_decoded_insn dec_insn,
+		  unsigned long *contpc)
 {
 	union mips_instruction insn = (union mips_instruction)dec_insn.insn;
 	unsigned int fcr31;
diff --git a/arch/mips/math-emu/dsemul.c b/arch/mips/math-emu/dsemul.c
index 7ea622a..3e64b17 100644
--- a/arch/mips/math-emu/dsemul.c
+++ b/arch/mips/math-emu/dsemul.c
@@ -1,6 +1,8 @@
 #include <linux/compiler.h>
+#include <linux/err.h>
 #include <linux/mm.h>
 #include <linux/signal.h>
+#include <linux/slab.h>
 #include <linux/smp.h>
 
 #include <asm/asm.h>
@@ -45,52 +47,173 @@
 struct emuframe {
 	mips_instruction	emul;
 	mips_instruction	badinst;
-	mips_instruction	cookie;
-	unsigned long		epc;
 };
 
+static const int emupage_frame_count = PAGE_SIZE / sizeof(struct emuframe);
+
+static struct emuframe __user *alloc_emuframe(void)
+{
+	mm_context_t *mm_ctx = &current->mm->context;
+	struct emuframe __user *fr = NULL;
+	unsigned long addr;
+	int idx;
+
+retry:
+	mutex_lock(&mm_ctx->fp_bd_emupage_mutex);
+
+	/* Ensure we have a page allocated for emuframes */
+	if (!mm_ctx->fp_bd_emupage) {
+		addr = mmap_region(NULL, STACK_TOP, PAGE_SIZE,
+				   VM_READ|VM_WRITE|VM_EXEC|
+				   VM_MAYREAD|VM_MAYWRITE|VM_MAYEXEC,
+				   0);
+		if (IS_ERR_VALUE(addr))
+			goto out_unlock;
+
+		mm_ctx->fp_bd_emupage = addr;
+		pr_debug("allocate emupage at 0x%08lx to %d\n", addr,
+			 current->pid);
+	}
+
+	/* Ensure we have an allocation bitmap */
+	if (!mm_ctx->fp_bd_emupage_allocmap) {
+		mm_ctx->fp_bd_emupage_allocmap =
+			kcalloc(BITS_TO_LONGS(emupage_frame_count),
+					      sizeof(unsigned long),
+				GFP_KERNEL);
+
+		if (!mm_ctx->fp_bd_emupage_allocmap)
+			goto out_unlock;
+	}
+
+	/* Attempt to allocate a single bit/frame */
+	idx = bitmap_find_free_region(mm_ctx->fp_bd_emupage_allocmap,
+				      emupage_frame_count, 0);
+	if (idx < 0) {
+		/*
+		 * Failed to allocate a frame. We'll wait until one becomes
+		 * available. The mutex is unlocked so that other threads
+		 * actually get the opportunity to free their frames, which
+		 * means technically the result of bitmap_full may be incorrect.
+		 * However the worst case is that we repeat all this and end up
+		 * back here again.
+		 */
+		mutex_unlock(&mm_ctx->fp_bd_emupage_mutex);
+		if (!wait_event_killable(mm_ctx->fp_bd_emupage_queue,
+			!bitmap_full(mm_ctx->fp_bd_emupage_allocmap,
+				     emupage_frame_count)))
+			goto retry;
+
+		/* Received a fatal signal - just give in */
+		return NULL;
+	}
+
+	/* Success! */
+	fr = (struct emuframe __user *)mm_ctx->fp_bd_emupage + idx;
+	pr_debug("allocate emuframe %d to %d\n", idx, current->pid);
+out_unlock:
+	mutex_unlock(&mm_ctx->fp_bd_emupage_mutex);
+	return fr;
+}
+
+static void free_emuframe(struct emuframe __user *frame)
+{
+	mm_context_t *mm_ctx = &current->mm->context;
+	int idx;
+
+	mutex_lock(&mm_ctx->fp_bd_emupage_mutex);
+
+	idx = frame - (struct emuframe __user *)mm_ctx->fp_bd_emupage;
+	pr_debug("free emuframe %d from %d\n", idx, current->pid);
+	bitmap_clear(mm_ctx->fp_bd_emupage_allocmap, idx, 1);
+
+	/* If some thread is waiting for a frame, now's its chance */
+	wake_up(&mm_ctx->fp_bd_emupage_queue);
+
+	mutex_unlock(&mm_ctx->fp_bd_emupage_mutex);
+}
+
+void dsemul_thread_cleanup(void)
+{
+	/*
+	 * We should always have passed through do_dsemulret prior to the
+	 * thread exiting, so TIF_FP_BD_EMU should never be set here.
+	 */
+	BUG_ON(test_thread_flag(TIF_FP_BD_EMU));
+}
+
+void dsemul_mm_cleanup(struct mm_struct *mm)
+{
+	mm_context_t *mm_ctx = &mm->context;
+
+	kfree(mm_ctx->fp_bd_emupage_allocmap);
+}
+
 int mips_dsemul(struct pt_regs *regs, mips_instruction ir, unsigned long cpc)
 {
-	extern asmlinkage void handle_dsemulret(void);
+	struct mm_decoded_insn mm_inst = { .insn = ir };
 	struct emuframe __user *fr;
-	int err;
+	struct pt_regs dummy_regs;
+	unsigned long dummy_cpc;
+	int err, is_mm;
 
-	if ((get_isa16_mode(regs->cp0_epc) && ((ir >> 16) == MM_NOP16)) ||
-		(ir == 0)) {
-		/* NOP is easy */
+	/*
+	 * Trivially handle typical NOP encodings:
+	 *
+	 *   MIPS32:		sll	r0, r0, r0
+	 *   microMIPS:		move16	r0, r0
+	 */
+	is_mm = get_isa16_mode(regs->cp0_epc);
+	if ((!is_mm && !ir) || (is_mm && ((ir >> 16) == MM_NOP16))) {
+is_nop:
 		regs->cp0_epc = cpc;
 		regs->cp0_cause &= ~CAUSEF_BD;
 		return 0;
 	}
-#ifdef DSEMUL_TRACE
-	printk("dsemul %lx %lx\n", regs->cp0_epc, cpc);
-
-#endif
 
 	/*
-	 * The strategy is to push the instruction onto the user stack
-	 * and put a trap after it which we can catch and jump to
-	 * the required address any alternative apart from full
-	 * instruction emulation!!.
+	 * In order for us to clean up the emuframe properly, we'll need to
+	 * execute a break instruction after ir. If ir is a branch then we may
+	 * never reach that break instruction and thus never free the emuframe.
 	 *
-	 * Algorithmics used a system call instruction, and
-	 * borrowed that vector.  MIPS/Linux version is a bit
-	 * more heavyweight in the interests of portability and
-	 * multiprocessor support.  For Linux we generate a
-	 * an unaligned access and force an address error exception.
+	 * Fortunately we know that ir is in a branch delay slot and thus if
+	 * it is a branch then its operation is unpredictable. So we can just
+	 * treat branches as NOPs and skip the 'emulation' entirely.
 	 *
-	 * For embedded systems (stand-alone) we prefer to use a
-	 * non-existing CP1 instruction. This prevents us from emulating
-	 * branches, but gives us a cleaner interface to the exception
-	 * handler (single entry point).
+	 * If the worst happens and we miss a branch/jump instruction here, or
+	 * some processor implements a custom one, then it would be possible
+	 * for us to allocate an emuframe and never free it. Fortunately this
+	 * would:
+	 *
+	 *  1) Be a bug in the userland code, because it has a branch/jump in
+	 *     a branch delay slot. So if we run out of emuframes and the
+	 *     userland code hangs it's not exactly the kernel's fault.
+	 *
+	 *  2) Only affect that userland process, since emuframes are allocated
+	 *     per-mm and kernel threads don't use them at all.
 	 */
+	if ((!is_mm && isBranchInstr(&dummy_regs, mm_inst, &dummy_cpc)) ||
+	    (is_mm && mm_isBranchInstr(&dummy_regs, mm_inst, &dummy_cpc))) {
+		pr_warn("PID %d has a branch in an FP branch delay slot at 0x%08lx\n",
+			current->pid, regs->cp0_epc);
+		goto is_nop;
+	}
 
-	/* Ensure that the two instructions are in the same cache line */
-	fr = (struct emuframe __user *)
-		((regs->regs[29] - sizeof(struct emuframe)) & ~0x7);
+	pr_debug("dsemul 0x%08lx cont at 0x%08lx\n", regs->cp0_epc, cpc);
 
-	/* Verify that the stack pointer is not competely insane */
-	if (unlikely(!access_ok(VERIFY_WRITE, fr, sizeof(struct emuframe))))
+	/*
+	 * The strategy is to write the instruction to a per-mm page followed
+	 * by a trap which we can catch to return to the required address. Any
+	 * alternative to full instruction emulation!!
+	 *
+	 * Algorithmics used a system call instruction, and borrowed that
+	 * vector.  MIPS/Linux version is a bit more heavyweight in the
+	 * interests of portability and multiprocessor support.  For Linux we
+	 * generate a BREAK instruction with a break code reserved for this
+	 * purpose.
+	 */
+	fr = alloc_emuframe();
+	if (!fr)
 		return SIGBUS;
 
 	if (get_isa16_mode(regs->cp0_epc)) {
@@ -103,17 +226,18 @@ int mips_dsemul(struct pt_regs *regs, mips_instruction ir, unsigned long cpc)
 		err |= __put_user((mips_instruction)BREAK_MATH, &fr->badinst);
 	}
 
-	err |= __put_user((mips_instruction)BD_COOKIE, &fr->cookie);
-	err |= __put_user(cpc, &fr->epc);
-
 	if (unlikely(err)) {
 		MIPS_FPU_EMU_INC_STATS(errors);
+		free_emuframe(fr);
 		return SIGBUS;
 	}
 
 	regs->cp0_epc = ((unsigned long) &fr->emul) |
 		get_isa16_mode(regs->cp0_epc);
 
+	current->thread.fp_bd_emu_cpc = cpc;
+	set_thread_flag(TIF_FP_BD_EMU);
+
 	flush_cache_sigtramp((unsigned long)&fr->badinst);
 
 	return SIGILL;		/* force out of emulation loop */
@@ -121,64 +245,38 @@ int mips_dsemul(struct pt_regs *regs, mips_instruction ir, unsigned long cpc)
 
 int do_dsemulret(struct pt_regs *xcp)
 {
-	struct emuframe __user *fr;
-	unsigned long epc;
-	u32 insn, cookie;
-	int err = 0;
-	u16 instr[2];
-
-	fr = (struct emuframe __user *)
-		(msk_isa16_mode(xcp->cp0_epc) - sizeof(mips_instruction));
-
-	/*
-	 * If we can't even access the area, something is very wrong, but we'll
-	 * leave that to the default handling
-	 */
-	if (!access_ok(VERIFY_READ, fr, sizeof(struct emuframe)))
-		return 0;
-
-	/*
-	 * Do some sanity checking on the stackframe:
-	 *
-	 *  - Is the instruction pointed to by the EPC an BREAK_MATH?
-	 *  - Is the following memory word the BD_COOKIE?
-	 */
-	if (get_isa16_mode(xcp->cp0_epc)) {
-		err = __get_user(instr[0], (u16 __user *)(&fr->badinst));
-		err |= __get_user(instr[1], (u16 __user *)((long)(&fr->badinst) + 2));
-		insn = (instr[0] << 16) | instr[1];
-	} else {
-		err = __get_user(insn, &fr->badinst);
-	}
-	err |= __get_user(cookie, &fr->cookie);
+	mm_context_t *mm_ctx = &current->mm->context;
+	struct emuframe __user *fr = NULL;
+	unsigned long fr_addr;
+	int success = 0;
 
-	if (unlikely(err || (insn != BREAK_MATH) || (cookie != BD_COOKIE))) {
-		MIPS_FPU_EMU_INC_STATS(errors);
-		return 0;
-	}
+	/* If we don't have TIF_FP_BD_EMU set... */
+	if (!test_and_clear_thread_flag(TIF_FP_BD_EMU))
+		goto out;
 
 	/*
-	 * At this point, we are satisfied that it's a BD emulation trap.  Yes,
-	 * a user might have deliberately put two malformed and useless
-	 * instructions in a row in his program, in which case he's in for a
-	 * nasty surprise - the next instruction will be treated as a
-	 * continuation address!  Alas, this seems to be the only way that we
-	 * can handle signals, recursion, and longjmps() in the context of
-	 * emulating the branch delay instruction.
+	 * ...or EPC is outside of the expected page or misaligned then
+	 * something is wrong. Leave it to the default trap/break code to
+	 * handle.
 	 */
+	fr_addr = msk_isa16_mode(xcp->cp0_epc) - sizeof(mips_instruction);
+	if ((fr_addr < mm_ctx->fp_bd_emupage) ||
+	    (fr_addr > (mm_ctx->fp_bd_emupage + PAGE_SIZE - sizeof(*fr))) ||
+	    (fr_addr & (sizeof(*fr) - 1)))
+		goto out;
 
-#ifdef DSEMUL_TRACE
-	printk("dsemulret\n");
-#endif
-	if (__get_user(epc, &fr->epc)) {		/* Saved EPC */
-		/* This is not a good situation to be in */
-		force_sig(SIGBUS, current);
-
-		return 0;
-	}
+	/* At this point, we are satisfied that it's a BD emulation trap. */
+	fr = (struct emuframe __user *)fr_addr;
 
 	/* Set EPC to return to post-branch instruction */
-	xcp->cp0_epc = epc;
+	xcp->cp0_epc = current->thread.fp_bd_emu_cpc;
+	success = 1;
 
-	return 1;
+	pr_debug("dsemulret to 0x%08lx\n", xcp->cp0_epc);
+out:
+	if (fr)
+		free_emuframe(fr);
+	if (!success)
+		MIPS_FPU_EMU_INC_STATS(errors);
+	return success;
 }
-- 
1.8.4.1

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH v2 5/6] mips: use per-mm page to execute FP branch delay slots
@ 2013-11-08 14:50         ` Paul Burton
  0 siblings, 0 replies; 42+ messages in thread
From: Paul Burton @ 2013-11-08 14:50 UTC (permalink / raw)
  To: linux-mips; +Cc: ddaney.cavm, Paul Burton

If a floating point branch instruction (bc1[ft]l?) is emulated,
typically because we're running on a core with no FPU, then we need to
execute the instruction in its branch delay slot too. This is done by
writing that instruction to memory followed by a trap, as part of an
"emuframe", and executing it. This avoids the requirement of an emulator
for the entire MIPS instruction set. Prior to this patch such emuframes
are written to the user stack and executed from there.

This patch moves FP branch delay emuframes off of the user stack and
into a per-mm page. Allocating a page per-mm leaves userland with access
to only what it had access to previously, and prevents processes
interfering with each other as they might if a single system-wide page
were used. The book-keeping required to track the allocation of
emuframes is not cheap, but given that invoking the FP emulator is
already very expensive I don't expect this to be an issue.

The biggest issue with executing the instruction from an FP branch delay
is that we must ensure that we free the frame from which we ran it. That
means that we must trap back to the kernel after executing that
instruction, which means that we must take special care not to let the
PC be changed as a result of that instruction. Fortunately since we're
executing an instruction we found in a branch delay the result is
unpredictable if that instruction is a branch or jump, so we can simply
treat those as NOPs and avoid them causing a problem. However there is
still the possibility that a signal may be handled whilst executing the
branch delay instruction. This would usually be fine as we would simply
execute our trap back to the kernel after sigreturn, however it is
possible for userland to simply not return from the signal handler - for
example if it executes something like a longjmp. In that case we would
never trap back to the kernel and never free the frame. For that reason
a TIF_FP_BD_EMU flag is introduced and set whilst we are executing an FP
branch delay instruction. Whilst this flag is set, signals will be
ignored. This isn't exactly pretty, but it's simpler than most of the
alternatives. One other simple option I considered would be to just
kill a process if we find a branch in an FP branch delay slot, but I
chose the current approach because its result is closer to what would
previously happen.

The primary benefit of this patch is that we are now free to mark the
user stack non-executable where that is possible.

Additionally the FP emuframes themselves are simplified somewhat. The
cookie field is removed since we can be pretty certain that we're
looking at an emuframe by virtue of it being located in the page
allocated for them. The PC to continue from is moved into struct
thread_struct since the control flow of a thread can no longer be
modified for the duration of the 'emulation', meaning there will now
only ever be a single emuframe required for a thread at any given time.

Signed-off-by: Paul Burton <paul.burton@imgtec.com>
---
Changes in v2:
  - s/kernels/kernel's/
  - Use (mm_)isBranchInstr in mips_dsemul rather than duplicating
    similar logic.
---
 arch/mips/include/asm/fpu_emulator.h |   4 +
 arch/mips/include/asm/mmu.h          |  12 ++
 arch/mips/include/asm/mmu_context.h  |   7 +
 arch/mips/include/asm/processor.h    |   7 +-
 arch/mips/include/asm/thread_info.h  |   2 +
 arch/mips/kernel/entry.S             |  13 +-
 arch/mips/kernel/process.c           |   2 +
 arch/mips/kernel/vdso.c              |   2 +-
 arch/mips/math-emu/cp1emu.c          |   4 +-
 arch/mips/math-emu/dsemul.c          | 266 ++++++++++++++++++++++++-----------
 10 files changed, 226 insertions(+), 93 deletions(-)

diff --git a/arch/mips/include/asm/fpu_emulator.h b/arch/mips/include/asm/fpu_emulator.h
index 2abb587..16f7b0b 100644
--- a/arch/mips/include/asm/fpu_emulator.h
+++ b/arch/mips/include/asm/fpu_emulator.h
@@ -51,6 +51,8 @@ do {									\
 #define MIPS_FPU_EMU_INC_STATS(M) do { } while (0)
 #endif /* CONFIG_DEBUG_FS */
 
+extern void dsemul_thread_cleanup(void);
+extern void dsemul_mm_cleanup(struct mm_struct *mm);
 extern int mips_dsemul(struct pt_regs *regs, mips_instruction ir,
 	unsigned long cpc);
 extern int do_dsemulret(struct pt_regs *xcp);
@@ -58,6 +60,8 @@ extern int fpu_emulator_cop1Handler(struct pt_regs *xcp,
 				    struct mips_fpu_struct *ctx, int has_fpu,
 				    void *__user *fault_addr);
 int process_fpemu_return(int sig, void __user *fault_addr);
+int isBranchInstr(struct pt_regs *regs, struct mm_decoded_insn dec_insn,
+		  unsigned long *contpc);
 int mm_isBranchInstr(struct pt_regs *regs, struct mm_decoded_insn dec_insn,
 		     unsigned long *contpc);
 
diff --git a/arch/mips/include/asm/mmu.h b/arch/mips/include/asm/mmu.h
index c436138..08214da 100644
--- a/arch/mips/include/asm/mmu.h
+++ b/arch/mips/include/asm/mmu.h
@@ -1,9 +1,21 @@
 #ifndef __ASM_MMU_H
 #define __ASM_MMU_H
 
+#include <linux/mutex.h>
+#include <linux/wait.h>
+
 typedef struct {
 	unsigned long asid[NR_CPUS];
 	void *vdso;
+
+	/* address of page used to hold FP branch delay emulation frames */
+	unsigned long fp_bd_emupage;
+	/* bitmap tracking allocation of fp_bd_emupage */
+	unsigned long *fp_bd_emupage_allocmap;
+	/* mutex to be held whilst modifying fp_bd_emupage(_allocmap) */
+	struct mutex fp_bd_emupage_mutex;
+	/* wait queue for threads requiring an emuframe */
+	wait_queue_head_t fp_bd_emupage_queue;
 } mm_context_t;
 
 #endif /* __ASM_MMU_H */
diff --git a/arch/mips/include/asm/mmu_context.h b/arch/mips/include/asm/mmu_context.h
index e277bba..c55e864 100644
--- a/arch/mips/include/asm/mmu_context.h
+++ b/arch/mips/include/asm/mmu_context.h
@@ -16,6 +16,7 @@
 #include <linux/smp.h>
 #include <linux/slab.h>
 #include <asm/cacheflush.h>
+#include <asm/fpu_emulator.h>
 #include <asm/hazards.h>
 #include <asm/tlbflush.h>
 #ifdef CONFIG_MIPS_MT_SMTC
@@ -133,6 +134,11 @@ init_new_context(struct task_struct *tsk, struct mm_struct *mm)
 	for_each_possible_cpu(i)
 		cpu_context(i, mm) = 0;
 
+	mm->context.fp_bd_emupage = 0;
+	mm->context.fp_bd_emupage_allocmap = NULL;
+	mutex_init(&mm->context.fp_bd_emupage_mutex);
+	init_waitqueue_head(&mm->context.fp_bd_emupage_queue);
+
 	return 0;
 }
 
@@ -199,6 +205,7 @@ static inline void switch_mm(struct mm_struct *prev, struct mm_struct *next,
  */
 static inline void destroy_context(struct mm_struct *mm)
 {
+	dsemul_mm_cleanup(mm);
 }
 
 #define deactivate_mm(tsk, mm)	do { } while (0)
diff --git a/arch/mips/include/asm/processor.h b/arch/mips/include/asm/processor.h
index 3605b84..683a3d6 100644
--- a/arch/mips/include/asm/processor.h
+++ b/arch/mips/include/asm/processor.h
@@ -38,9 +38,10 @@ extern unsigned int vced_count, vcei_count;
 
 /*
  * A special page (the vdso) is mapped into all processes at the very
- * top of the virtual memory space.
+ * top of the virtual memory space. The page below it is used for FP
+ * emulator branch delay slot executions.
  */
-#define SPECIAL_PAGES_SIZE PAGE_SIZE
+#define SPECIAL_PAGES_SIZE (PAGE_SIZE * 2)
 
 #ifdef CONFIG_32BIT
 #ifdef CONFIG_KVM_GUEST
@@ -226,6 +227,8 @@ struct thread_struct {
 
 	/* Saved fpu/fpu emulator stuff. */
 	struct mips_fpu_struct fpu;
+	/* PC to continue from following an FP branch delay 'emulation' */
+	unsigned long fp_bd_emu_cpc;
 #ifdef CONFIG_MIPS_MT_FPAFF
 	/* Emulated instruction count */
 	unsigned long emulated_fp;
diff --git a/arch/mips/include/asm/thread_info.h b/arch/mips/include/asm/thread_info.h
index b6da8b7..eee6e18 100644
--- a/arch/mips/include/asm/thread_info.h
+++ b/arch/mips/include/asm/thread_info.h
@@ -118,6 +118,7 @@ static inline struct thread_info *current_thread_info(void)
 #define TIF_LOAD_WATCH		25	/* If set, load watch registers */
 #define TIF_SYSCALL_TRACEPOINT	26	/* syscall tracepoint instrumentation */
 #define TIF_32BIT_FPREGS	27	/* 32-bit floating point registers */
+#define TIF_FP_BD_EMU		28	/* executing an FP branch delay */
 #define TIF_SYSCALL_TRACE	31	/* syscall trace active */
 
 #define _TIF_SYSCALL_TRACE	(1<<TIF_SYSCALL_TRACE)
@@ -135,6 +136,7 @@ static inline struct thread_info *current_thread_info(void)
 #define _TIF_FPUBOUND		(1<<TIF_FPUBOUND)
 #define _TIF_LOAD_WATCH		(1<<TIF_LOAD_WATCH)
 #define _TIF_32BIT_FPREGS	(1<<TIF_32BIT_FPREGS)
+#define _TIF_FP_BD_EMU		(1<<TIF_FP_BD_EMU)
 #define _TIF_SYSCALL_TRACEPOINT	(1<<TIF_SYSCALL_TRACEPOINT)
 
 #define _TIF_WORK_SYSCALL_ENTRY	(_TIF_NOHZ | _TIF_SYSCALL_TRACE |	\
diff --git a/arch/mips/kernel/entry.S b/arch/mips/kernel/entry.S
index e578685..24707d7 100644
--- a/arch/mips/kernel/entry.S
+++ b/arch/mips/kernel/entry.S
@@ -168,10 +168,15 @@ work_resched:
 	andi	t0, a2, _TIF_NEED_RESCHED
 	bnez	t0, work_resched
 
-work_notifysig:				# deal with pending signals and
-					# notify-resume requests
-	move	a0, sp
-	li	a1, 0
+work_notifysig:
+	and	t0, a2, _TIF_FP_BD_EMU	# are we currently 'emulating' the
+					# delay slot of an FP branch?
+	beqz	t0, 1f			# no, continue below
+	and	a2, a2, ~_TIF_SIGPENDING	# yes, skip handling signals
+	beqz	a2, restore_all		# which leaves us nothing to do
+
+1:	move	a0, sp			# deal with pending signals and
+	li	a1, 0			# notify-resume requests
 	jal	do_notify_resume	# a2 already loaded
 	j	resume_userspace_check
 
diff --git a/arch/mips/kernel/process.c b/arch/mips/kernel/process.c
index 747a6cf..0219502 100644
--- a/arch/mips/kernel/process.c
+++ b/arch/mips/kernel/process.c
@@ -32,6 +32,7 @@
 #include <asm/cpu.h>
 #include <asm/dsp.h>
 #include <asm/fpu.h>
+#include <asm/fpu_emulator.h>
 #include <asm/pgtable.h>
 #include <asm/mipsregs.h>
 #include <asm/processor.h>
@@ -72,6 +73,7 @@ void start_thread(struct pt_regs * regs, unsigned long pc, unsigned long sp)
 
 void exit_thread(void)
 {
+	dsemul_thread_cleanup();
 }
 
 void flush_thread(void)
diff --git a/arch/mips/kernel/vdso.c b/arch/mips/kernel/vdso.c
index 0f1af58..213d871 100644
--- a/arch/mips/kernel/vdso.c
+++ b/arch/mips/kernel/vdso.c
@@ -78,7 +78,7 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
 
 	down_write(&mm->mmap_sem);
 
-	addr = vdso_addr(mm->start_stack);
+	addr = vdso_addr(mm->start_stack) + PAGE_SIZE;
 
 	addr = get_unmapped_area(NULL, addr, PAGE_SIZE, 0, 0);
 	if (IS_ERR_VALUE(addr)) {
diff --git a/arch/mips/math-emu/cp1emu.c b/arch/mips/math-emu/cp1emu.c
index 22f7b11..a0566c8 100644
--- a/arch/mips/math-emu/cp1emu.c
+++ b/arch/mips/math-emu/cp1emu.c
@@ -665,8 +665,8 @@ int mm_isBranchInstr(struct pt_regs *regs, struct mm_decoded_insn dec_insn,
  * a single subroutine should be used across both
  * modules.
  */
-static int isBranchInstr(struct pt_regs *regs, struct mm_decoded_insn dec_insn,
-			 unsigned long *contpc)
+int isBranchInstr(struct pt_regs *regs, struct mm_decoded_insn dec_insn,
+		  unsigned long *contpc)
 {
 	union mips_instruction insn = (union mips_instruction)dec_insn.insn;
 	unsigned int fcr31;
diff --git a/arch/mips/math-emu/dsemul.c b/arch/mips/math-emu/dsemul.c
index 7ea622a..3e64b17 100644
--- a/arch/mips/math-emu/dsemul.c
+++ b/arch/mips/math-emu/dsemul.c
@@ -1,6 +1,8 @@
 #include <linux/compiler.h>
+#include <linux/err.h>
 #include <linux/mm.h>
 #include <linux/signal.h>
+#include <linux/slab.h>
 #include <linux/smp.h>
 
 #include <asm/asm.h>
@@ -45,52 +47,173 @@
 struct emuframe {
 	mips_instruction	emul;
 	mips_instruction	badinst;
-	mips_instruction	cookie;
-	unsigned long		epc;
 };
 
+static const int emupage_frame_count = PAGE_SIZE / sizeof(struct emuframe);
+
+static struct emuframe __user *alloc_emuframe(void)
+{
+	mm_context_t *mm_ctx = &current->mm->context;
+	struct emuframe __user *fr = NULL;
+	unsigned long addr;
+	int idx;
+
+retry:
+	mutex_lock(&mm_ctx->fp_bd_emupage_mutex);
+
+	/* Ensure we have a page allocated for emuframes */
+	if (!mm_ctx->fp_bd_emupage) {
+		addr = mmap_region(NULL, STACK_TOP, PAGE_SIZE,
+				   VM_READ|VM_WRITE|VM_EXEC|
+				   VM_MAYREAD|VM_MAYWRITE|VM_MAYEXEC,
+				   0);
+		if (IS_ERR_VALUE(addr))
+			goto out_unlock;
+
+		mm_ctx->fp_bd_emupage = addr;
+		pr_debug("allocate emupage at 0x%08lx to %d\n", addr,
+			 current->pid);
+	}
+
+	/* Ensure we have an allocation bitmap */
+	if (!mm_ctx->fp_bd_emupage_allocmap) {
+		mm_ctx->fp_bd_emupage_allocmap =
+			kcalloc(BITS_TO_LONGS(emupage_frame_count),
+					      sizeof(unsigned long),
+				GFP_KERNEL);
+
+		if (!mm_ctx->fp_bd_emupage_allocmap)
+			goto out_unlock;
+	}
+
+	/* Attempt to allocate a single bit/frame */
+	idx = bitmap_find_free_region(mm_ctx->fp_bd_emupage_allocmap,
+				      emupage_frame_count, 0);
+	if (idx < 0) {
+		/*
+		 * Failed to allocate a frame. We'll wait until one becomes
+		 * available. The mutex is unlocked so that other threads
+		 * actually get the opportunity to free their frames, which
+		 * means technically the result of bitmap_full may be incorrect.
+		 * However the worst case is that we repeat all this and end up
+		 * back here again.
+		 */
+		mutex_unlock(&mm_ctx->fp_bd_emupage_mutex);
+		if (!wait_event_killable(mm_ctx->fp_bd_emupage_queue,
+			!bitmap_full(mm_ctx->fp_bd_emupage_allocmap,
+				     emupage_frame_count)))
+			goto retry;
+
+		/* Received a fatal signal - just give in */
+		return NULL;
+	}
+
+	/* Success! */
+	fr = (struct emuframe __user *)mm_ctx->fp_bd_emupage + idx;
+	pr_debug("allocate emuframe %d to %d\n", idx, current->pid);
+out_unlock:
+	mutex_unlock(&mm_ctx->fp_bd_emupage_mutex);
+	return fr;
+}
+
+static void free_emuframe(struct emuframe __user *frame)
+{
+	mm_context_t *mm_ctx = &current->mm->context;
+	int idx;
+
+	mutex_lock(&mm_ctx->fp_bd_emupage_mutex);
+
+	idx = frame - (struct emuframe __user *)mm_ctx->fp_bd_emupage;
+	pr_debug("free emuframe %d from %d\n", idx, current->pid);
+	bitmap_clear(mm_ctx->fp_bd_emupage_allocmap, idx, 1);
+
+	/* If some thread is waiting for a frame, now's its chance */
+	wake_up(&mm_ctx->fp_bd_emupage_queue);
+
+	mutex_unlock(&mm_ctx->fp_bd_emupage_mutex);
+}
+
+void dsemul_thread_cleanup(void)
+{
+	/*
+	 * We should always have passed through do_dsemulret prior to the
+	 * thread exiting, so TIF_FP_BD_EMU should never be set here.
+	 */
+	BUG_ON(test_thread_flag(TIF_FP_BD_EMU));
+}
+
+void dsemul_mm_cleanup(struct mm_struct *mm)
+{
+	mm_context_t *mm_ctx = &mm->context;
+
+	kfree(mm_ctx->fp_bd_emupage_allocmap);
+}
+
 int mips_dsemul(struct pt_regs *regs, mips_instruction ir, unsigned long cpc)
 {
-	extern asmlinkage void handle_dsemulret(void);
+	struct mm_decoded_insn mm_inst = { .insn = ir };
 	struct emuframe __user *fr;
-	int err;
+	struct pt_regs dummy_regs;
+	unsigned long dummy_cpc;
+	int err, is_mm;
 
-	if ((get_isa16_mode(regs->cp0_epc) && ((ir >> 16) == MM_NOP16)) ||
-		(ir == 0)) {
-		/* NOP is easy */
+	/*
+	 * Trivially handle typical NOP encodings:
+	 *
+	 *   MIPS32:		sll	r0, r0, r0
+	 *   microMIPS:		move16	r0, r0
+	 */
+	is_mm = get_isa16_mode(regs->cp0_epc);
+	if ((!is_mm && !ir) || (is_mm && ((ir >> 16) == MM_NOP16))) {
+is_nop:
 		regs->cp0_epc = cpc;
 		regs->cp0_cause &= ~CAUSEF_BD;
 		return 0;
 	}
-#ifdef DSEMUL_TRACE
-	printk("dsemul %lx %lx\n", regs->cp0_epc, cpc);
-
-#endif
 
 	/*
-	 * The strategy is to push the instruction onto the user stack
-	 * and put a trap after it which we can catch and jump to
-	 * the required address any alternative apart from full
-	 * instruction emulation!!.
+	 * In order for us to clean up the emuframe properly, we'll need to
+	 * execute a break instruction after ir. If ir is a branch then we may
+	 * never reach that break instruction and thus never free the emuframe.
 	 *
-	 * Algorithmics used a system call instruction, and
-	 * borrowed that vector.  MIPS/Linux version is a bit
-	 * more heavyweight in the interests of portability and
-	 * multiprocessor support.  For Linux we generate a
-	 * an unaligned access and force an address error exception.
+	 * Fortunately we know that ir is in a branch delay slot and thus if
+	 * it is a branch then its operation is unpredictable. So we can just
+	 * treat branches as NOPs and skip the 'emulation' entirely.
 	 *
-	 * For embedded systems (stand-alone) we prefer to use a
-	 * non-existing CP1 instruction. This prevents us from emulating
-	 * branches, but gives us a cleaner interface to the exception
-	 * handler (single entry point).
+	 * If the worst happens and we miss a branch/jump instruction here, or
+	 * some processor implements a custom one, then it would be possible
+	 * for us to allocate an emuframe and never free it. Fortunately this
+	 * would:
+	 *
+	 *  1) Be a bug in the userland code, because it has a branch/jump in
+	 *     a branch delay slot. So if we run out of emuframes and the
+	 *     userland code hangs it's not exactly the kernel's fault.
+	 *
+	 *  2) Only affect that userland process, since emuframes are allocated
+	 *     per-mm and kernel threads don't use them at all.
 	 */
+	if ((!is_mm && isBranchInstr(&dummy_regs, mm_inst, &dummy_cpc)) ||
+	    (is_mm && mm_isBranchInstr(&dummy_regs, mm_inst, &dummy_cpc))) {
+		pr_warn("PID %d has a branch in an FP branch delay slot at 0x%08lx\n",
+			current->pid, regs->cp0_epc);
+		goto is_nop;
+	}
 
-	/* Ensure that the two instructions are in the same cache line */
-	fr = (struct emuframe __user *)
-		((regs->regs[29] - sizeof(struct emuframe)) & ~0x7);
+	pr_debug("dsemul 0x%08lx cont at 0x%08lx\n", regs->cp0_epc, cpc);
 
-	/* Verify that the stack pointer is not competely insane */
-	if (unlikely(!access_ok(VERIFY_WRITE, fr, sizeof(struct emuframe))))
+	/*
+	 * The strategy is to write the instruction to a per-mm page followed
+	 * by a trap which we can catch to return to the required address. Any
+	 * alternative to full instruction emulation!!
+	 *
+	 * Algorithmics used a system call instruction, and borrowed that
+	 * vector.  MIPS/Linux version is a bit more heavyweight in the
+	 * interests of portability and multiprocessor support.  For Linux we
+	 * generate a BREAK instruction with a break code reserved for this
+	 * purpose.
+	 */
+	fr = alloc_emuframe();
+	if (!fr)
 		return SIGBUS;
 
 	if (get_isa16_mode(regs->cp0_epc)) {
@@ -103,17 +226,18 @@ int mips_dsemul(struct pt_regs *regs, mips_instruction ir, unsigned long cpc)
 		err |= __put_user((mips_instruction)BREAK_MATH, &fr->badinst);
 	}
 
-	err |= __put_user((mips_instruction)BD_COOKIE, &fr->cookie);
-	err |= __put_user(cpc, &fr->epc);
-
 	if (unlikely(err)) {
 		MIPS_FPU_EMU_INC_STATS(errors);
+		free_emuframe(fr);
 		return SIGBUS;
 	}
 
 	regs->cp0_epc = ((unsigned long) &fr->emul) |
 		get_isa16_mode(regs->cp0_epc);
 
+	current->thread.fp_bd_emu_cpc = cpc;
+	set_thread_flag(TIF_FP_BD_EMU);
+
 	flush_cache_sigtramp((unsigned long)&fr->badinst);
 
 	return SIGILL;		/* force out of emulation loop */
@@ -121,64 +245,38 @@ int mips_dsemul(struct pt_regs *regs, mips_instruction ir, unsigned long cpc)
 
 int do_dsemulret(struct pt_regs *xcp)
 {
-	struct emuframe __user *fr;
-	unsigned long epc;
-	u32 insn, cookie;
-	int err = 0;
-	u16 instr[2];
-
-	fr = (struct emuframe __user *)
-		(msk_isa16_mode(xcp->cp0_epc) - sizeof(mips_instruction));
-
-	/*
-	 * If we can't even access the area, something is very wrong, but we'll
-	 * leave that to the default handling
-	 */
-	if (!access_ok(VERIFY_READ, fr, sizeof(struct emuframe)))
-		return 0;
-
-	/*
-	 * Do some sanity checking on the stackframe:
-	 *
-	 *  - Is the instruction pointed to by the EPC an BREAK_MATH?
-	 *  - Is the following memory word the BD_COOKIE?
-	 */
-	if (get_isa16_mode(xcp->cp0_epc)) {
-		err = __get_user(instr[0], (u16 __user *)(&fr->badinst));
-		err |= __get_user(instr[1], (u16 __user *)((long)(&fr->badinst) + 2));
-		insn = (instr[0] << 16) | instr[1];
-	} else {
-		err = __get_user(insn, &fr->badinst);
-	}
-	err |= __get_user(cookie, &fr->cookie);
+	mm_context_t *mm_ctx = &current->mm->context;
+	struct emuframe __user *fr = NULL;
+	unsigned long fr_addr;
+	int success = 0;
 
-	if (unlikely(err || (insn != BREAK_MATH) || (cookie != BD_COOKIE))) {
-		MIPS_FPU_EMU_INC_STATS(errors);
-		return 0;
-	}
+	/* If we don't have TIF_FP_BD_EMU set... */
+	if (!test_and_clear_thread_flag(TIF_FP_BD_EMU))
+		goto out;
 
 	/*
-	 * At this point, we are satisfied that it's a BD emulation trap.  Yes,
-	 * a user might have deliberately put two malformed and useless
-	 * instructions in a row in his program, in which case he's in for a
-	 * nasty surprise - the next instruction will be treated as a
-	 * continuation address!  Alas, this seems to be the only way that we
-	 * can handle signals, recursion, and longjmps() in the context of
-	 * emulating the branch delay instruction.
+	 * ...or EPC is outside of the expected page or misaligned then
+	 * something is wrong. Leave it to the default trap/break code to
+	 * handle.
 	 */
+	fr_addr = msk_isa16_mode(xcp->cp0_epc) - sizeof(mips_instruction);
+	if ((fr_addr < mm_ctx->fp_bd_emupage) ||
+	    (fr_addr > (mm_ctx->fp_bd_emupage + PAGE_SIZE - sizeof(*fr))) ||
+	    (fr_addr & (sizeof(*fr) - 1)))
+		goto out;
 
-#ifdef DSEMUL_TRACE
-	printk("dsemulret\n");
-#endif
-	if (__get_user(epc, &fr->epc)) {		/* Saved EPC */
-		/* This is not a good situation to be in */
-		force_sig(SIGBUS, current);
-
-		return 0;
-	}
+	/* At this point, we are satisfied that it's a BD emulation trap. */
+	fr = (struct emuframe __user *)fr_addr;
 
 	/* Set EPC to return to post-branch instruction */
-	xcp->cp0_epc = epc;
+	xcp->cp0_epc = current->thread.fp_bd_emu_cpc;
+	success = 1;
 
-	return 1;
+	pr_debug("dsemulret to 0x%08lx\n", xcp->cp0_epc);
+out:
+	if (fr)
+		free_emuframe(fr);
+	if (!success)
+		MIPS_FPU_EMU_INC_STATS(errors);
+	return success;
 }
-- 
1.8.4.1

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH v2 4/6] mips: support for 64-bit FP with O32 binaries
@ 2013-11-15 12:35     ` Paul Burton
  0 siblings, 0 replies; 42+ messages in thread
From: Paul Burton @ 2013-11-15 12:35 UTC (permalink / raw)
  To: linux-mips; +Cc: Paul Burton

CPUs implementing mips32r2 may include a 64-bit FPU, just as mips64 CPUs
do. In order to preserve backwards compatibility a 64-bit FPU will act
like a 32-bit FPU (by accessing doubles from the least significant 32
bits of an even-odd pair of FP registers) when the Status.FR bit is
zero, again just like a mips64 CPU. The standard O32 ABI is defined
expecting a 32-bit FPU, however recent toolchains support use of a
64-bit FPU from an O32 mips32 executable. When an ELF executable is
built to use a 64-bit FPU a new flag (EF_MIPS_FP64) is set in the ELF
header.

With this patch the kernel will check the EF_MIPS_FP64 flag when
executing an O32 binary, and set Status.FR accordingly. The addition
of O32 64-bit FP support lessens the opportunity for optimisation in
the FPU emulator, so a CONFIG_MIPS_O32_FP64_SUPPORT Kconfig option is
introduced to allow this support to be disabled for those that don't
require it.

Inspired by an earlier patch by Leonid Yegoshin, but implemented more
cleanly & correctly.

Signed-off-by: Paul Burton <paul.burton@imgtec.com>
---
Changes in v2:
  - Handle TIF_32BIT_FPREGS in PTRACE_P{EE,OK}EUSR.
---
 arch/mips/Kconfig                   |  17 ++++++
 arch/mips/include/asm/asmmacro-32.h |  42 --------------
 arch/mips/include/asm/asmmacro-64.h |  96 --------------------------------
 arch/mips/include/asm/asmmacro.h    | 107 ++++++++++++++++++++++++++++++++++++
 arch/mips/include/asm/elf.h         |  17 +++++-
 arch/mips/include/asm/fpu.h         |  91 +++++++++++++++++++++++++-----
 arch/mips/include/asm/thread_info.h |   4 +-
 arch/mips/kernel/cpu-probe.c        |   2 +-
 arch/mips/kernel/process.c          |   3 -
 arch/mips/kernel/ptrace.c           |  60 +++++++++++---------
 arch/mips/kernel/ptrace32.c         |  53 ++++++++++--------
 arch/mips/kernel/r4k_fpu.S          |  74 +++++++++++++++++++++++--
 arch/mips/kernel/r4k_switch.S       |  45 ++++++++++++++-
 arch/mips/kernel/signal.c           |  10 ++--
 arch/mips/kernel/signal32.c         |  10 ++--
 arch/mips/kernel/traps.c            |  20 +++++--
 arch/mips/math-emu/cp1emu.c         |  10 ++--
 arch/mips/math-emu/kernel_linkage.c |   6 +-
 18 files changed, 431 insertions(+), 236 deletions(-)

diff --git a/arch/mips/Kconfig b/arch/mips/Kconfig
index 17cc7ff..aa2e03a 100644
--- a/arch/mips/Kconfig
+++ b/arch/mips/Kconfig
@@ -2335,6 +2335,23 @@ config CC_STACKPROTECTOR
 
 	  This feature requires gcc version 4.2 or above.
 
+config MIPS_O32_FP64_SUPPORT
+	bool "Support for O32 binaries using 64-bit FP"
+	depends on (32BIT && CPU_MIPSR2) || MIPS32_O32
+	default y
+	help
+	  When this is enabled, the kernel will support use of 64-bit floating
+	  point registers with binaries using the O32 ABI along with the
+	  EF_MIPS_FP64 ELF header flag (typically built with -mfp64). On
+	  mips32 systems this support is at the cost of increasing the size
+	  and complexity of the compiled FPU emulator. Thus if you are running
+	  a mips32 system and know that none of your userland binaries will
+	  require 64-bit floating point, you may wish to reduce the size of
+	  your kernel & potentially improve FP emulation performance by saying
+	  N here.
+
+	  If unsure, say Y.
+
 config USE_OF
 	bool
 	select OF
diff --git a/arch/mips/include/asm/asmmacro-32.h b/arch/mips/include/asm/asmmacro-32.h
index 2413afe..70e1f17 100644
--- a/arch/mips/include/asm/asmmacro-32.h
+++ b/arch/mips/include/asm/asmmacro-32.h
@@ -12,27 +12,6 @@
 #include <asm/fpregdef.h>
 #include <asm/mipsregs.h>
 
-	.macro	fpu_save_double thread status tmp1=t0
-	cfc1	\tmp1,  fcr31
-	sdc1	$f0,  THREAD_FPR0(\thread)
-	sdc1	$f2,  THREAD_FPR2(\thread)
-	sdc1	$f4,  THREAD_FPR4(\thread)
-	sdc1	$f6,  THREAD_FPR6(\thread)
-	sdc1	$f8,  THREAD_FPR8(\thread)
-	sdc1	$f10, THREAD_FPR10(\thread)
-	sdc1	$f12, THREAD_FPR12(\thread)
-	sdc1	$f14, THREAD_FPR14(\thread)
-	sdc1	$f16, THREAD_FPR16(\thread)
-	sdc1	$f18, THREAD_FPR18(\thread)
-	sdc1	$f20, THREAD_FPR20(\thread)
-	sdc1	$f22, THREAD_FPR22(\thread)
-	sdc1	$f24, THREAD_FPR24(\thread)
-	sdc1	$f26, THREAD_FPR26(\thread)
-	sdc1	$f28, THREAD_FPR28(\thread)
-	sdc1	$f30, THREAD_FPR30(\thread)
-	sw	\tmp1, THREAD_FCR31(\thread)
-	.endm
-
 	.macro	fpu_save_single thread tmp=t0
 	cfc1	\tmp,  fcr31
 	swc1	$f0,  THREAD_FPR0(\thread)
@@ -70,27 +49,6 @@
 	sw	\tmp, THREAD_FCR31(\thread)
 	.endm
 
-	.macro	fpu_restore_double thread status tmp=t0
-	lw	\tmp, THREAD_FCR31(\thread)
-	ldc1	$f0,  THREAD_FPR0(\thread)
-	ldc1	$f2,  THREAD_FPR2(\thread)
-	ldc1	$f4,  THREAD_FPR4(\thread)
-	ldc1	$f6,  THREAD_FPR6(\thread)
-	ldc1	$f8,  THREAD_FPR8(\thread)
-	ldc1	$f10, THREAD_FPR10(\thread)
-	ldc1	$f12, THREAD_FPR12(\thread)
-	ldc1	$f14, THREAD_FPR14(\thread)
-	ldc1	$f16, THREAD_FPR16(\thread)
-	ldc1	$f18, THREAD_FPR18(\thread)
-	ldc1	$f20, THREAD_FPR20(\thread)
-	ldc1	$f22, THREAD_FPR22(\thread)
-	ldc1	$f24, THREAD_FPR24(\thread)
-	ldc1	$f26, THREAD_FPR26(\thread)
-	ldc1	$f28, THREAD_FPR28(\thread)
-	ldc1	$f30, THREAD_FPR30(\thread)
-	ctc1	\tmp, fcr31
-	.endm
-
 	.macro	fpu_restore_single thread tmp=t0
 	lw	\tmp, THREAD_FCR31(\thread)
 	lwc1	$f0,  THREAD_FPR0(\thread)
diff --git a/arch/mips/include/asm/asmmacro-64.h b/arch/mips/include/asm/asmmacro-64.h
index 08a527d..38ea609 100644
--- a/arch/mips/include/asm/asmmacro-64.h
+++ b/arch/mips/include/asm/asmmacro-64.h
@@ -13,102 +13,6 @@
 #include <asm/fpregdef.h>
 #include <asm/mipsregs.h>
 
-	.macro	fpu_save_16even thread tmp=t0
-	cfc1	\tmp, fcr31
-	sdc1	$f0,  THREAD_FPR0(\thread)
-	sdc1	$f2,  THREAD_FPR2(\thread)
-	sdc1	$f4,  THREAD_FPR4(\thread)
-	sdc1	$f6,  THREAD_FPR6(\thread)
-	sdc1	$f8,  THREAD_FPR8(\thread)
-	sdc1	$f10, THREAD_FPR10(\thread)
-	sdc1	$f12, THREAD_FPR12(\thread)
-	sdc1	$f14, THREAD_FPR14(\thread)
-	sdc1	$f16, THREAD_FPR16(\thread)
-	sdc1	$f18, THREAD_FPR18(\thread)
-	sdc1	$f20, THREAD_FPR20(\thread)
-	sdc1	$f22, THREAD_FPR22(\thread)
-	sdc1	$f24, THREAD_FPR24(\thread)
-	sdc1	$f26, THREAD_FPR26(\thread)
-	sdc1	$f28, THREAD_FPR28(\thread)
-	sdc1	$f30, THREAD_FPR30(\thread)
-	sw	\tmp, THREAD_FCR31(\thread)
-	.endm
-
-	.macro	fpu_save_16odd thread
-	sdc1	$f1,  THREAD_FPR1(\thread)
-	sdc1	$f3,  THREAD_FPR3(\thread)
-	sdc1	$f5,  THREAD_FPR5(\thread)
-	sdc1	$f7,  THREAD_FPR7(\thread)
-	sdc1	$f9,  THREAD_FPR9(\thread)
-	sdc1	$f11, THREAD_FPR11(\thread)
-	sdc1	$f13, THREAD_FPR13(\thread)
-	sdc1	$f15, THREAD_FPR15(\thread)
-	sdc1	$f17, THREAD_FPR17(\thread)
-	sdc1	$f19, THREAD_FPR19(\thread)
-	sdc1	$f21, THREAD_FPR21(\thread)
-	sdc1	$f23, THREAD_FPR23(\thread)
-	sdc1	$f25, THREAD_FPR25(\thread)
-	sdc1	$f27, THREAD_FPR27(\thread)
-	sdc1	$f29, THREAD_FPR29(\thread)
-	sdc1	$f31, THREAD_FPR31(\thread)
-	.endm
-
-	.macro	fpu_save_double thread status tmp
-	sll	\tmp, \status, 5
-	bgez	\tmp, 2f
-	fpu_save_16odd \thread
-2:
-	fpu_save_16even \thread \tmp
-	.endm
-
-	.macro	fpu_restore_16even thread tmp=t0
-	lw	\tmp, THREAD_FCR31(\thread)
-	ldc1	$f0,  THREAD_FPR0(\thread)
-	ldc1	$f2,  THREAD_FPR2(\thread)
-	ldc1	$f4,  THREAD_FPR4(\thread)
-	ldc1	$f6,  THREAD_FPR6(\thread)
-	ldc1	$f8,  THREAD_FPR8(\thread)
-	ldc1	$f10, THREAD_FPR10(\thread)
-	ldc1	$f12, THREAD_FPR12(\thread)
-	ldc1	$f14, THREAD_FPR14(\thread)
-	ldc1	$f16, THREAD_FPR16(\thread)
-	ldc1	$f18, THREAD_FPR18(\thread)
-	ldc1	$f20, THREAD_FPR20(\thread)
-	ldc1	$f22, THREAD_FPR22(\thread)
-	ldc1	$f24, THREAD_FPR24(\thread)
-	ldc1	$f26, THREAD_FPR26(\thread)
-	ldc1	$f28, THREAD_FPR28(\thread)
-	ldc1	$f30, THREAD_FPR30(\thread)
-	ctc1	\tmp, fcr31
-	.endm
-
-	.macro	fpu_restore_16odd thread
-	ldc1	$f1,  THREAD_FPR1(\thread)
-	ldc1	$f3,  THREAD_FPR3(\thread)
-	ldc1	$f5,  THREAD_FPR5(\thread)
-	ldc1	$f7,  THREAD_FPR7(\thread)
-	ldc1	$f9,  THREAD_FPR9(\thread)
-	ldc1	$f11, THREAD_FPR11(\thread)
-	ldc1	$f13, THREAD_FPR13(\thread)
-	ldc1	$f15, THREAD_FPR15(\thread)
-	ldc1	$f17, THREAD_FPR17(\thread)
-	ldc1	$f19, THREAD_FPR19(\thread)
-	ldc1	$f21, THREAD_FPR21(\thread)
-	ldc1	$f23, THREAD_FPR23(\thread)
-	ldc1	$f25, THREAD_FPR25(\thread)
-	ldc1	$f27, THREAD_FPR27(\thread)
-	ldc1	$f29, THREAD_FPR29(\thread)
-	ldc1	$f31, THREAD_FPR31(\thread)
-	.endm
-
-	.macro	fpu_restore_double thread status tmp
-	sll	\tmp, \status, 5
-	bgez	\tmp, 1f				# 16 register mode?
-
-	fpu_restore_16odd \thread
-1:	fpu_restore_16even \thread \tmp
-	.endm
-
 	.macro	cpu_save_nonscratch thread
 	LONG_S	s0, THREAD_REG16(\thread)
 	LONG_S	s1, THREAD_REG17(\thread)
diff --git a/arch/mips/include/asm/asmmacro.h b/arch/mips/include/asm/asmmacro.h
index 6c8342a..3220c93 100644
--- a/arch/mips/include/asm/asmmacro.h
+++ b/arch/mips/include/asm/asmmacro.h
@@ -62,6 +62,113 @@
 	.endm
 #endif /* CONFIG_MIPS_MT_SMTC */
 
+	.macro	fpu_save_16even thread tmp=t0
+	cfc1	\tmp, fcr31
+	sdc1	$f0,  THREAD_FPR0(\thread)
+	sdc1	$f2,  THREAD_FPR2(\thread)
+	sdc1	$f4,  THREAD_FPR4(\thread)
+	sdc1	$f6,  THREAD_FPR6(\thread)
+	sdc1	$f8,  THREAD_FPR8(\thread)
+	sdc1	$f10, THREAD_FPR10(\thread)
+	sdc1	$f12, THREAD_FPR12(\thread)
+	sdc1	$f14, THREAD_FPR14(\thread)
+	sdc1	$f16, THREAD_FPR16(\thread)
+	sdc1	$f18, THREAD_FPR18(\thread)
+	sdc1	$f20, THREAD_FPR20(\thread)
+	sdc1	$f22, THREAD_FPR22(\thread)
+	sdc1	$f24, THREAD_FPR24(\thread)
+	sdc1	$f26, THREAD_FPR26(\thread)
+	sdc1	$f28, THREAD_FPR28(\thread)
+	sdc1	$f30, THREAD_FPR30(\thread)
+	sw	\tmp, THREAD_FCR31(\thread)
+	.endm
+
+	.macro	fpu_save_16odd thread
+	.set	push
+	.set	mips64r2
+	sdc1	$f1,  THREAD_FPR1(\thread)
+	sdc1	$f3,  THREAD_FPR3(\thread)
+	sdc1	$f5,  THREAD_FPR5(\thread)
+	sdc1	$f7,  THREAD_FPR7(\thread)
+	sdc1	$f9,  THREAD_FPR9(\thread)
+	sdc1	$f11, THREAD_FPR11(\thread)
+	sdc1	$f13, THREAD_FPR13(\thread)
+	sdc1	$f15, THREAD_FPR15(\thread)
+	sdc1	$f17, THREAD_FPR17(\thread)
+	sdc1	$f19, THREAD_FPR19(\thread)
+	sdc1	$f21, THREAD_FPR21(\thread)
+	sdc1	$f23, THREAD_FPR23(\thread)
+	sdc1	$f25, THREAD_FPR25(\thread)
+	sdc1	$f27, THREAD_FPR27(\thread)
+	sdc1	$f29, THREAD_FPR29(\thread)
+	sdc1	$f31, THREAD_FPR31(\thread)
+	.set	pop
+	.endm
+
+	.macro	fpu_save_double thread status tmp
+#if defined(CONFIG_MIPS64) || defined(CONFIG_CPU_MIPS32_R2)
+	sll	\tmp, \status, 5
+	bgez	\tmp, 10f
+	fpu_save_16odd \thread
+10:
+#endif
+	fpu_save_16even \thread \tmp
+	.endm
+
+	.macro	fpu_restore_16even thread tmp=t0
+	lw	\tmp, THREAD_FCR31(\thread)
+	ldc1	$f0,  THREAD_FPR0(\thread)
+	ldc1	$f2,  THREAD_FPR2(\thread)
+	ldc1	$f4,  THREAD_FPR4(\thread)
+	ldc1	$f6,  THREAD_FPR6(\thread)
+	ldc1	$f8,  THREAD_FPR8(\thread)
+	ldc1	$f10, THREAD_FPR10(\thread)
+	ldc1	$f12, THREAD_FPR12(\thread)
+	ldc1	$f14, THREAD_FPR14(\thread)
+	ldc1	$f16, THREAD_FPR16(\thread)
+	ldc1	$f18, THREAD_FPR18(\thread)
+	ldc1	$f20, THREAD_FPR20(\thread)
+	ldc1	$f22, THREAD_FPR22(\thread)
+	ldc1	$f24, THREAD_FPR24(\thread)
+	ldc1	$f26, THREAD_FPR26(\thread)
+	ldc1	$f28, THREAD_FPR28(\thread)
+	ldc1	$f30, THREAD_FPR30(\thread)
+	ctc1	\tmp, fcr31
+	.endm
+
+	.macro	fpu_restore_16odd thread
+	.set	push
+	.set	mips64r2
+	ldc1	$f1,  THREAD_FPR1(\thread)
+	ldc1	$f3,  THREAD_FPR3(\thread)
+	ldc1	$f5,  THREAD_FPR5(\thread)
+	ldc1	$f7,  THREAD_FPR7(\thread)
+	ldc1	$f9,  THREAD_FPR9(\thread)
+	ldc1	$f11, THREAD_FPR11(\thread)
+	ldc1	$f13, THREAD_FPR13(\thread)
+	ldc1	$f15, THREAD_FPR15(\thread)
+	ldc1	$f17, THREAD_FPR17(\thread)
+	ldc1	$f19, THREAD_FPR19(\thread)
+	ldc1	$f21, THREAD_FPR21(\thread)
+	ldc1	$f23, THREAD_FPR23(\thread)
+	ldc1	$f25, THREAD_FPR25(\thread)
+	ldc1	$f27, THREAD_FPR27(\thread)
+	ldc1	$f29, THREAD_FPR29(\thread)
+	ldc1	$f31, THREAD_FPR31(\thread)
+	.set	pop
+	.endm
+
+	.macro	fpu_restore_double thread status tmp
+#if defined(CONFIG_MIPS64) || defined(CONFIG_CPU_MIPS32_R2)
+	sll	\tmp, \status, 5
+	bgez	\tmp, 10f				# 16 register mode?
+
+	fpu_restore_16odd \thread
+10:
+#endif
+	fpu_restore_16even \thread \tmp
+	.endm
+
 /*
  * Temporary until all gas have MT ASE support
  */
diff --git a/arch/mips/include/asm/elf.h b/arch/mips/include/asm/elf.h
index a66359e..17163cf 100644
--- a/arch/mips/include/asm/elf.h
+++ b/arch/mips/include/asm/elf.h
@@ -36,6 +36,7 @@
 #define EF_MIPS_ABI2		0x00000020
 #define EF_MIPS_OPTIONS_FIRST	0x00000080
 #define EF_MIPS_32BITMODE	0x00000100
+#define EF_MIPS_FP64		0x00000200
 #define EF_MIPS_ABI		0x0000f000
 #define EF_MIPS_ARCH		0xf0000000
 
@@ -249,6 +250,11 @@ extern struct mips_abi mips_abi_n32;
 
 #define SET_PERSONALITY(ex)						\
 do {									\
+	if ((ex).e_flags & EF_MIPS_FP64)				\
+		clear_thread_flag(TIF_32BIT_FPREGS);			\
+	else								\
+		set_thread_flag(TIF_32BIT_FPREGS);			\
+									\
 	if (personality(current->personality) != PER_LINUX)		\
 		set_personality(PER_LINUX);				\
 									\
@@ -271,14 +277,18 @@ do {									\
 #endif
 
 #ifdef CONFIG_MIPS32_O32
-#define __SET_PERSONALITY32_O32()					\
+#define __SET_PERSONALITY32_O32(ex)					\
 	do {								\
 		set_thread_flag(TIF_32BIT_REGS);			\
 		set_thread_flag(TIF_32BIT_ADDR);			\
+									\
+		if (!((ex).e_flags & EF_MIPS_FP64))			\
+			set_thread_flag(TIF_32BIT_FPREGS);		\
+									\
 		current->thread.abi = &mips_abi_32;			\
 	} while (0)
 #else
-#define __SET_PERSONALITY32_O32()					\
+#define __SET_PERSONALITY32_O32(ex)					\
 	do { } while (0)
 #endif
 
@@ -289,7 +299,7 @@ do {									\
 	     ((ex).e_flags & EF_MIPS_ABI) == 0)				\
 		__SET_PERSONALITY32_N32();				\
 	else								\
-		__SET_PERSONALITY32_O32();				\
+		__SET_PERSONALITY32_O32(ex);                            \
 } while (0)
 #else
 #define __SET_PERSONALITY32(ex) do { } while (0)
@@ -300,6 +310,7 @@ do {									\
 	unsigned int p;							\
 									\
 	clear_thread_flag(TIF_32BIT_REGS);				\
+	clear_thread_flag(TIF_32BIT_FPREGS);				\
 	clear_thread_flag(TIF_32BIT_ADDR);				\
 									\
 	if ((ex).e_ident[EI_CLASS] == ELFCLASS32)			\
diff --git a/arch/mips/include/asm/fpu.h b/arch/mips/include/asm/fpu.h
index 3bf023f..cfe092f 100644
--- a/arch/mips/include/asm/fpu.h
+++ b/arch/mips/include/asm/fpu.h
@@ -33,11 +33,48 @@ extern void _init_fpu(void);
 extern void _save_fp(struct task_struct *);
 extern void _restore_fp(struct task_struct *);
 
-#define __enable_fpu()							\
-do {									\
-	set_c0_status(ST0_CU1);						\
-	enable_fpu_hazard();						\
-} while (0)
+/*
+ * This enum specifies a mode in which we want the FPU to operate, for cores
+ * which implement the Status.FR bit. Note that FPU_32BIT & FPU_64BIT
+ * purposefully have the values 0 & 1 respectively, so that an integer value
+ * of Status.FR can be trivially casted to the corresponding enum fpu_mode.
+ */
+enum fpu_mode {
+	FPU_32BIT = 0,		/* FR = 0 */
+	FPU_64BIT,		/* FR = 1 */
+	FPU_AS_IS,
+};
+
+static inline int __enable_fpu(enum fpu_mode mode)
+{
+	int fr;
+
+	switch (mode) {
+	case FPU_AS_IS:
+		/* just enable the FPU in its current mode */
+		set_c0_status(ST0_CU1);
+		enable_fpu_hazard();
+		return 0;
+
+	case FPU_64BIT:
+#if !(defined(CONFIG_CPU_MIPS32_R2) || defined(CONFIG_MIPS64))
+		/* we only have a 32-bit FPU */
+		return SIGFPE;
+#endif
+		/* fall through */
+	case FPU_32BIT:
+		/* set CU1 & change FR appropriately */
+		fr = (int)mode;
+		change_c0_status(ST0_CU1 | ST0_FR, ST0_CU1 | (fr ? ST0_FR : 0));
+		enable_fpu_hazard();
+
+		/* check FR has the desired value */
+		return (!!(read_c0_status() & ST0_FR) == !!fr) ? 0 : SIGFPE;
+
+	default:
+		BUG();
+	}
+}
 
 #define __disable_fpu()							\
 do {									\
@@ -57,27 +94,46 @@ static inline int is_fpu_owner(void)
 	return cpu_has_fpu && __is_fpu_owner();
 }
 
-static inline void __own_fpu(void)
+static inline int __own_fpu(void)
 {
-	__enable_fpu();
+	enum fpu_mode mode;
+	int ret;
+
+	mode = !test_thread_flag(TIF_32BIT_FPREGS);
+	ret = __enable_fpu(mode);
+	if (ret)
+		return ret;
+
 	KSTK_STATUS(current) |= ST0_CU1;
+	if (mode == FPU_64BIT)
+		KSTK_STATUS(current) |= ST0_FR;
+	else /* mode == FPU_32BIT */
+		KSTK_STATUS(current) &= ~ST0_FR;
+
 	set_thread_flag(TIF_USEDFPU);
+	return 0;
 }
 
-static inline void own_fpu_inatomic(int restore)
+static inline int own_fpu_inatomic(int restore)
 {
+	int ret = 0;
+
 	if (cpu_has_fpu && !__is_fpu_owner()) {
-		__own_fpu();
-		if (restore)
+		ret = __own_fpu();
+		if (restore && !ret)
 			_restore_fp(current);
 	}
+	return ret;
 }
 
-static inline void own_fpu(int restore)
+static inline int own_fpu(int restore)
 {
+	int ret;
+
 	preempt_disable();
-	own_fpu_inatomic(restore);
+	ret = own_fpu_inatomic(restore);
 	preempt_enable();
+	return ret;
 }
 
 static inline void lose_fpu(int save)
@@ -93,16 +149,21 @@ static inline void lose_fpu(int save)
 	preempt_enable();
 }
 
-static inline void init_fpu(void)
+static inline int init_fpu(void)
 {
+	int ret = 0;
+
 	preempt_disable();
 	if (cpu_has_fpu) {
-		__own_fpu();
-		_init_fpu();
+		ret = __own_fpu();
+		if (!ret)
+			_init_fpu();
 	} else {
 		fpu_emulator_init_fpu();
 	}
+
 	preempt_enable();
+	return ret;
 }
 
 static inline void save_fp(struct task_struct *tsk)
diff --git a/arch/mips/include/asm/thread_info.h b/arch/mips/include/asm/thread_info.h
index f9b24bf..b6da8b7 100644
--- a/arch/mips/include/asm/thread_info.h
+++ b/arch/mips/include/asm/thread_info.h
@@ -112,11 +112,12 @@ static inline struct thread_info *current_thread_info(void)
 #define TIF_NOHZ		19	/* in adaptive nohz mode */
 #define TIF_FIXADE		20	/* Fix address errors in software */
 #define TIF_LOGADE		21	/* Log address errors to syslog */
-#define TIF_32BIT_REGS		22	/* also implies 16/32 fprs */
+#define TIF_32BIT_REGS		22	/* 32-bit general purpose registers */
 #define TIF_32BIT_ADDR		23	/* 32-bit address space (o32/n32) */
 #define TIF_FPUBOUND		24	/* thread bound to FPU-full CPU set */
 #define TIF_LOAD_WATCH		25	/* If set, load watch registers */
 #define TIF_SYSCALL_TRACEPOINT	26	/* syscall tracepoint instrumentation */
+#define TIF_32BIT_FPREGS	27	/* 32-bit floating point registers */
 #define TIF_SYSCALL_TRACE	31	/* syscall trace active */
 
 #define _TIF_SYSCALL_TRACE	(1<<TIF_SYSCALL_TRACE)
@@ -133,6 +134,7 @@ static inline struct thread_info *current_thread_info(void)
 #define _TIF_32BIT_ADDR		(1<<TIF_32BIT_ADDR)
 #define _TIF_FPUBOUND		(1<<TIF_FPUBOUND)
 #define _TIF_LOAD_WATCH		(1<<TIF_LOAD_WATCH)
+#define _TIF_32BIT_FPREGS	(1<<TIF_32BIT_FPREGS)
 #define _TIF_SYSCALL_TRACEPOINT	(1<<TIF_SYSCALL_TRACEPOINT)
 
 #define _TIF_WORK_SYSCALL_ENTRY	(_TIF_NOHZ | _TIF_SYSCALL_TRACE |	\
diff --git a/arch/mips/kernel/cpu-probe.c b/arch/mips/kernel/cpu-probe.c
index 8168e29..116102c 100644
--- a/arch/mips/kernel/cpu-probe.c
+++ b/arch/mips/kernel/cpu-probe.c
@@ -112,7 +112,7 @@ static inline unsigned long cpu_get_fpu_id(void)
 	unsigned long tmp, fpu_id;
 
 	tmp = read_c0_status();
-	__enable_fpu();
+	__enable_fpu(FPU_AS_IS);
 	fpu_id = read_32bit_cp1_register(CP1_REVISION);
 	write_c0_status(tmp);
 	return fpu_id;
diff --git a/arch/mips/kernel/process.c b/arch/mips/kernel/process.c
index ddc7610..747a6cf 100644
--- a/arch/mips/kernel/process.c
+++ b/arch/mips/kernel/process.c
@@ -60,9 +60,6 @@ void start_thread(struct pt_regs * regs, unsigned long pc, unsigned long sp)
 
 	/* New thread loses kernel privileges. */
 	status = regs->cp0_status & ~(ST0_CU0|ST0_CU1|ST0_FR|KU_MASK);
-#ifdef CONFIG_64BIT
-	status |= test_thread_flag(TIF_32BIT_REGS) ? 0 : ST0_FR;
-#endif
 	status |= KU_USER;
 	regs->cp0_status = status;
 	clear_used_math();
diff --git a/arch/mips/kernel/ptrace.c b/arch/mips/kernel/ptrace.c
index b52e1d2..7da9b76 100644
--- a/arch/mips/kernel/ptrace.c
+++ b/arch/mips/kernel/ptrace.c
@@ -137,13 +137,13 @@ int ptrace_getfpregs(struct task_struct *child, __u32 __user *data)
 		if (cpu_has_mipsmt) {
 			unsigned int vpflags = dvpe();
 			flags = read_c0_status();
-			__enable_fpu();
+			__enable_fpu(FPU_AS_IS);
 			__asm__ __volatile__("cfc1\t%0,$0" : "=r" (tmp));
 			write_c0_status(flags);
 			evpe(vpflags);
 		} else {
 			flags = read_c0_status();
-			__enable_fpu();
+			__enable_fpu(FPU_AS_IS);
 			__asm__ __volatile__("cfc1\t%0,$0" : "=r" (tmp));
 			write_c0_status(flags);
 		}
@@ -408,6 +408,7 @@ long arch_ptrace(struct task_struct *child, long request,
 	/* Read the word at location addr in the USER area. */
 	case PTRACE_PEEKUSR: {
 		struct pt_regs *regs;
+		fpureg_t *fregs;
 		unsigned long tmp = 0;
 
 		regs = task_pt_regs(child);
@@ -418,26 +419,28 @@ long arch_ptrace(struct task_struct *child, long request,
 			tmp = regs->regs[addr];
 			break;
 		case FPR_BASE ... FPR_BASE + 31:
-			if (tsk_used_math(child)) {
-				fpureg_t *fregs = get_fpu_regs(child);
+			if (!tsk_used_math(child)) {
+				/* FP not yet used */
+				tmp = -1;
+				break;
+			}
+			fregs = get_fpu_regs(child);
 
 #ifdef CONFIG_32BIT
+			if (test_thread_flag(TIF_32BIT_FPREGS)) {
 				/*
 				 * The odd registers are actually the high
 				 * order bits of the values stored in the even
 				 * registers - unless we're using r2k_switch.S.
 				 */
 				if (addr & 1)
-					tmp = (unsigned long) (fregs[((addr & ~1) - 32)] >> 32);
+					tmp = fregs[(addr & ~1) - 32] >> 32;
 				else
-					tmp = (unsigned long) (fregs[(addr - 32)] & 0xffffffff);
-#endif
-#ifdef CONFIG_64BIT
-				tmp = fregs[addr - FPR_BASE];
-#endif
-			} else {
-				tmp = -1;	/* FP not yet used  */
+					tmp = fregs[addr - 32];
+				break;
 			}
+#endif
+			tmp = fregs[addr - FPR_BASE];
 			break;
 		case PC:
 			tmp = regs->cp0_epc;
@@ -483,13 +486,13 @@ long arch_ptrace(struct task_struct *child, long request,
 			if (cpu_has_mipsmt) {
 				unsigned int vpflags = dvpe();
 				flags = read_c0_status();
-				__enable_fpu();
+				__enable_fpu(FPU_AS_IS);
 				__asm__ __volatile__("cfc1\t%0,$0": "=r" (tmp));
 				write_c0_status(flags);
 				evpe(vpflags);
 			} else {
 				flags = read_c0_status();
-				__enable_fpu();
+				__enable_fpu(FPU_AS_IS);
 				__asm__ __volatile__("cfc1\t%0,$0": "=r" (tmp));
 				write_c0_status(flags);
 			}
@@ -554,22 +557,25 @@ long arch_ptrace(struct task_struct *child, long request,
 				child->thread.fpu.fcr31 = 0;
 			}
 #ifdef CONFIG_32BIT
-			/*
-			 * The odd registers are actually the high order bits
-			 * of the values stored in the even registers - unless
-			 * we're using r2k_switch.S.
-			 */
-			if (addr & 1) {
-				fregs[(addr & ~1) - FPR_BASE] &= 0xffffffff;
-				fregs[(addr & ~1) - FPR_BASE] |= ((unsigned long long) data) << 32;
-			} else {
-				fregs[addr - FPR_BASE] &= ~0xffffffffLL;
-				fregs[addr - FPR_BASE] |= data;
+			if (test_thread_flag(TIF_32BIT_FPREGS)) {
+				/*
+				 * The odd registers are actually the high
+				 * order bits of the values stored in the even
+				 * registers - unless we're using r2k_switch.S.
+				 */
+				if (addr & 1) {
+					fregs[(addr & ~1) - FPR_BASE] &=
+						0xffffffff;
+					fregs[(addr & ~1) - FPR_BASE] |=
+						((u64)data) << 32;
+				} else {
+					fregs[addr - FPR_BASE] &= ~0xffffffffLL;
+					fregs[addr - FPR_BASE] |= data;
+				}
+				break;
 			}
 #endif
-#ifdef CONFIG_64BIT
 			fregs[addr - FPR_BASE] = data;
-#endif
 			break;
 		}
 		case PC:
diff --git a/arch/mips/kernel/ptrace32.c b/arch/mips/kernel/ptrace32.c
index 9486055..b8aa2dd 100644
--- a/arch/mips/kernel/ptrace32.c
+++ b/arch/mips/kernel/ptrace32.c
@@ -80,6 +80,7 @@ long compat_arch_ptrace(struct task_struct *child, compat_long_t request,
 	/* Read the word at location addr in the USER area. */
 	case PTRACE_PEEKUSR: {
 		struct pt_regs *regs;
+		fpureg_t *fregs;
 		unsigned int tmp;
 
 		regs = task_pt_regs(child);
@@ -90,21 +91,25 @@ long compat_arch_ptrace(struct task_struct *child, compat_long_t request,
 			tmp = regs->regs[addr];
 			break;
 		case FPR_BASE ... FPR_BASE + 31:
-			if (tsk_used_math(child)) {
-				fpureg_t *fregs = get_fpu_regs(child);
-
+			if (!tsk_used_math(child)) {
+				/* FP not yet used */
+				tmp = -1;
+				break;
+			}
+			fregs = get_fpu_regs(child);
+			if (test_thread_flag(TIF_32BIT_FPREGS)) {
 				/*
 				 * The odd registers are actually the high
 				 * order bits of the values stored in the even
 				 * registers - unless we're using r2k_switch.S.
 				 */
 				if (addr & 1)
-					tmp = (unsigned long) (fregs[((addr & ~1) - 32)] >> 32);
+					tmp = fregs[(addr & ~1) - 32] >> 32;
 				else
-					tmp = (unsigned long) (fregs[(addr - 32)] & 0xffffffff);
-			} else {
-				tmp = -1;	/* FP not yet used  */
+					tmp = fregs[addr - 32];
+				break;
 			}
+			tmp = fregs[addr - FPR_BASE];
 			break;
 		case PC:
 			tmp = regs->cp0_epc;
@@ -147,13 +152,13 @@ long compat_arch_ptrace(struct task_struct *child, compat_long_t request,
 			if (cpu_has_mipsmt) {
 				unsigned int vpflags = dvpe();
 				flags = read_c0_status();
-				__enable_fpu();
+				__enable_fpu(FPU_AS_IS);
 				__asm__ __volatile__("cfc1\t%0,$0": "=r" (tmp));
 				write_c0_status(flags);
 				evpe(vpflags);
 			} else {
 				flags = read_c0_status();
-				__enable_fpu();
+				__enable_fpu(FPU_AS_IS);
 				__asm__ __volatile__("cfc1\t%0,$0": "=r" (tmp));
 				write_c0_status(flags);
 			}
@@ -236,20 +241,24 @@ long compat_arch_ptrace(struct task_struct *child, compat_long_t request,
 				       sizeof(child->thread.fpu));
 				child->thread.fpu.fcr31 = 0;
 			}
-			/*
-			 * The odd registers are actually the high order bits
-			 * of the values stored in the even registers - unless
-			 * we're using r2k_switch.S.
-			 */
-			if (addr & 1) {
-				fregs[(addr & ~1) - FPR_BASE] &= 0xffffffff;
-				fregs[(addr & ~1) - FPR_BASE] |= ((unsigned long long) data) << 32;
-			} else {
-				fregs[addr - FPR_BASE] &= ~0xffffffffLL;
-				/* Must cast, lest sign extension fill upper
-				   bits!  */
-				fregs[addr - FPR_BASE] |= (unsigned int)data;
+			if (test_thread_flag(TIF_32BIT_FPREGS)) {
+				/*
+				 * The odd registers are actually the high
+				 * order bits of the values stored in the even
+				 * registers - unless we're using r2k_switch.S.
+				 */
+				if (addr & 1) {
+					fregs[(addr & ~1) - FPR_BASE] &=
+						0xffffffff;
+					fregs[(addr & ~1) - FPR_BASE] |=
+						((u64)data) << 32;
+				} else {
+					fregs[addr - FPR_BASE] &= ~0xffffffffLL;
+					fregs[addr - FPR_BASE] |= data;
+				}
+				break;
 			}
+			fregs[addr - FPR_BASE] = data;
 			break;
 		}
 		case PC:
diff --git a/arch/mips/kernel/r4k_fpu.S b/arch/mips/kernel/r4k_fpu.S
index 55ffe14..253b2fb 100644
--- a/arch/mips/kernel/r4k_fpu.S
+++ b/arch/mips/kernel/r4k_fpu.S
@@ -35,7 +35,15 @@
 LEAF(_save_fp_context)
 	cfc1	t1, fcr31
 
-#ifdef CONFIG_64BIT
+#if defined(CONFIG_64BIT) || defined(CONFIG_MIPS32_R2)
+	.set	push
+#ifdef CONFIG_MIPS32_R2
+	.set	mips64r2
+	mfc0	t0, CP0_STATUS
+	sll	t0, t0, 5
+	bgez	t0, 1f			# skip storing odd if FR=0
+	 nop
+#endif
 	/* Store the 16 odd double precision registers */
 	EX	sdc1 $f1, SC_FPREGS+8(a0)
 	EX	sdc1 $f3, SC_FPREGS+24(a0)
@@ -53,6 +61,7 @@ LEAF(_save_fp_context)
 	EX	sdc1 $f27, SC_FPREGS+216(a0)
 	EX	sdc1 $f29, SC_FPREGS+232(a0)
 	EX	sdc1 $f31, SC_FPREGS+248(a0)
+1:	.set	pop
 #endif
 
 	/* Store the 16 even double precision registers */
@@ -82,7 +91,31 @@ LEAF(_save_fp_context)
 LEAF(_save_fp_context32)
 	cfc1	t1, fcr31
 
-	EX	sdc1 $f0, SC32_FPREGS+0(a0)
+	mfc0	t0, CP0_STATUS
+	sll	t0, t0, 5
+	bgez	t0, 1f			# skip storing odd if FR=0
+	 nop
+
+	/* Store the 16 odd double precision registers */
+	EX      sdc1 $f1, SC32_FPREGS+8(a0)
+	EX      sdc1 $f3, SC32_FPREGS+24(a0)
+	EX      sdc1 $f5, SC32_FPREGS+40(a0)
+	EX      sdc1 $f7, SC32_FPREGS+56(a0)
+	EX      sdc1 $f9, SC32_FPREGS+72(a0)
+	EX      sdc1 $f11, SC32_FPREGS+88(a0)
+	EX      sdc1 $f13, SC32_FPREGS+104(a0)
+	EX      sdc1 $f15, SC32_FPREGS+120(a0)
+	EX      sdc1 $f17, SC32_FPREGS+136(a0)
+	EX      sdc1 $f19, SC32_FPREGS+152(a0)
+	EX      sdc1 $f21, SC32_FPREGS+168(a0)
+	EX      sdc1 $f23, SC32_FPREGS+184(a0)
+	EX      sdc1 $f25, SC32_FPREGS+200(a0)
+	EX      sdc1 $f27, SC32_FPREGS+216(a0)
+	EX      sdc1 $f29, SC32_FPREGS+232(a0)
+	EX      sdc1 $f31, SC32_FPREGS+248(a0)
+
+	/* Store the 16 even double precision registers */
+1:	EX	sdc1 $f0, SC32_FPREGS+0(a0)
 	EX	sdc1 $f2, SC32_FPREGS+16(a0)
 	EX	sdc1 $f4, SC32_FPREGS+32(a0)
 	EX	sdc1 $f6, SC32_FPREGS+48(a0)
@@ -114,7 +147,16 @@ LEAF(_save_fp_context32)
  */
 LEAF(_restore_fp_context)
 	EX	lw t0, SC_FPC_CSR(a0)
-#ifdef CONFIG_64BIT
+
+#if defined(CONFIG_64BIT) || defined(CONFIG_MIPS32_R2)
+	.set	push
+#ifdef CONFIG_MIPS32_R2
+	.set	mips64r2
+	mfc0	t0, CP0_STATUS
+	sll	t0, t0, 5
+	bgez	t0, 1f			# skip loading odd if FR=0
+	 nop
+#endif
 	EX	ldc1 $f1, SC_FPREGS+8(a0)
 	EX	ldc1 $f3, SC_FPREGS+24(a0)
 	EX	ldc1 $f5, SC_FPREGS+40(a0)
@@ -131,6 +173,7 @@ LEAF(_restore_fp_context)
 	EX	ldc1 $f27, SC_FPREGS+216(a0)
 	EX	ldc1 $f29, SC_FPREGS+232(a0)
 	EX	ldc1 $f31, SC_FPREGS+248(a0)
+1:	.set pop
 #endif
 	EX	ldc1 $f0, SC_FPREGS+0(a0)
 	EX	ldc1 $f2, SC_FPREGS+16(a0)
@@ -157,7 +200,30 @@ LEAF(_restore_fp_context)
 LEAF(_restore_fp_context32)
 	/* Restore an o32 sigcontext.  */
 	EX	lw t0, SC32_FPC_CSR(a0)
-	EX	ldc1 $f0, SC32_FPREGS+0(a0)
+
+	mfc0	t0, CP0_STATUS
+	sll	t0, t0, 5
+	bgez	t0, 1f			# skip loading odd if FR=0
+	 nop
+
+	EX      ldc1 $f1, SC32_FPREGS+8(a0)
+	EX      ldc1 $f3, SC32_FPREGS+24(a0)
+	EX      ldc1 $f5, SC32_FPREGS+40(a0)
+	EX      ldc1 $f7, SC32_FPREGS+56(a0)
+	EX      ldc1 $f9, SC32_FPREGS+72(a0)
+	EX      ldc1 $f11, SC32_FPREGS+88(a0)
+	EX      ldc1 $f13, SC32_FPREGS+104(a0)
+	EX      ldc1 $f15, SC32_FPREGS+120(a0)
+	EX      ldc1 $f17, SC32_FPREGS+136(a0)
+	EX      ldc1 $f19, SC32_FPREGS+152(a0)
+	EX      ldc1 $f21, SC32_FPREGS+168(a0)
+	EX      ldc1 $f23, SC32_FPREGS+184(a0)
+	EX      ldc1 $f25, SC32_FPREGS+200(a0)
+	EX      ldc1 $f27, SC32_FPREGS+216(a0)
+	EX      ldc1 $f29, SC32_FPREGS+232(a0)
+	EX      ldc1 $f31, SC32_FPREGS+248(a0)
+
+1:	EX	ldc1 $f0, SC32_FPREGS+0(a0)
 	EX	ldc1 $f2, SC32_FPREGS+16(a0)
 	EX	ldc1 $f4, SC32_FPREGS+32(a0)
 	EX	ldc1 $f6, SC32_FPREGS+48(a0)
diff --git a/arch/mips/kernel/r4k_switch.S b/arch/mips/kernel/r4k_switch.S
index 078de5e..cc78dd9 100644
--- a/arch/mips/kernel/r4k_switch.S
+++ b/arch/mips/kernel/r4k_switch.S
@@ -123,7 +123,7 @@
  * Save a thread's fp context.
  */
 LEAF(_save_fp)
-#ifdef CONFIG_64BIT
+#if defined(CONFIG_64BIT) || defined(CONFIG_CPU_MIPS32_R2)
 	mfc0	t0, CP0_STATUS
 #endif
 	fpu_save_double a0 t0 t1		# clobbers t1
@@ -134,7 +134,7 @@ LEAF(_save_fp)
  * Restore a thread's fp context.
  */
 LEAF(_restore_fp)
-#ifdef CONFIG_64BIT
+#if defined(CONFIG_64BIT) || defined(CONFIG_CPU_MIPS32_R2)
 	mfc0	t0, CP0_STATUS
 #endif
 	fpu_restore_double a0 t0 t1		# clobbers t1
@@ -228,6 +228,47 @@ LEAF(_init_fpu)
 	mtc1	t1, $f29
 	mtc1	t1, $f30
 	mtc1	t1, $f31
+
+#ifdef CONFIG_CPU_MIPS32_R2
+	.set    push
+	.set    mips64r2
+	sll     t0, t0, 5			# is Status.FR set?
+	bgez    t0, 1f				# no: skip setting upper 32b
+
+	mthc1   t1, $f0
+	mthc1   t1, $f1
+	mthc1   t1, $f2
+	mthc1   t1, $f3
+	mthc1   t1, $f4
+	mthc1   t1, $f5
+	mthc1   t1, $f6
+	mthc1   t1, $f7
+	mthc1   t1, $f8
+	mthc1   t1, $f9
+	mthc1   t1, $f10
+	mthc1   t1, $f11
+	mthc1   t1, $f12
+	mthc1   t1, $f13
+	mthc1   t1, $f14
+	mthc1   t1, $f15
+	mthc1   t1, $f16
+	mthc1   t1, $f17
+	mthc1   t1, $f18
+	mthc1   t1, $f19
+	mthc1   t1, $f20
+	mthc1   t1, $f21
+	mthc1   t1, $f22
+	mthc1   t1, $f23
+	mthc1   t1, $f24
+	mthc1   t1, $f25
+	mthc1   t1, $f26
+	mthc1   t1, $f27
+	mthc1   t1, $f28
+	mthc1   t1, $f29
+	mthc1   t1, $f30
+	mthc1   t1, $f31
+1:	.set    pop
+#endif /* CONFIG_CPU_MIPS32_R2 */
 #else
 	.set	mips3
 	dmtc1	t1, $f0
diff --git a/arch/mips/kernel/signal.c b/arch/mips/kernel/signal.c
index 2f285ab..5199563 100644
--- a/arch/mips/kernel/signal.c
+++ b/arch/mips/kernel/signal.c
@@ -71,8 +71,9 @@ static int protected_save_fp_context(struct sigcontext __user *sc)
 	int err;
 	while (1) {
 		lock_fpu_owner();
-		own_fpu_inatomic(1);
-		err = save_fp_context(sc); /* this might fail */
+		err = own_fpu_inatomic(1);
+		if (!err)
+			err = save_fp_context(sc); /* this might fail */
 		unlock_fpu_owner();
 		if (likely(!err))
 			break;
@@ -91,8 +92,9 @@ static int protected_restore_fp_context(struct sigcontext __user *sc)
 	int err, tmp __maybe_unused;
 	while (1) {
 		lock_fpu_owner();
-		own_fpu_inatomic(0);
-		err = restore_fp_context(sc); /* this might fail */
+		err = own_fpu_inatomic(0);
+		if (!err)
+			err = restore_fp_context(sc); /* this might fail */
 		unlock_fpu_owner();
 		if (likely(!err))
 			break;
diff --git a/arch/mips/kernel/signal32.c b/arch/mips/kernel/signal32.c
index 57de8b7..7c1024b 100644
--- a/arch/mips/kernel/signal32.c
+++ b/arch/mips/kernel/signal32.c
@@ -85,8 +85,9 @@ static int protected_save_fp_context32(struct sigcontext32 __user *sc)
 	int err;
 	while (1) {
 		lock_fpu_owner();
-		own_fpu_inatomic(1);
-		err = save_fp_context32(sc); /* this might fail */
+		err = own_fpu_inatomic(1);
+		if (!err)
+			err = save_fp_context32(sc); /* this might fail */
 		unlock_fpu_owner();
 		if (likely(!err))
 			break;
@@ -105,8 +106,9 @@ static int protected_restore_fp_context32(struct sigcontext32 __user *sc)
 	int err, tmp __maybe_unused;
 	while (1) {
 		lock_fpu_owner();
-		own_fpu_inatomic(0);
-		err = restore_fp_context32(sc); /* this might fail */
+		err = own_fpu_inatomic(0);
+		if (!err)
+			err = restore_fp_context32(sc); /* this might fail */
 		unlock_fpu_owner();
 		if (likely(!err))
 			break;
diff --git a/arch/mips/kernel/traps.c b/arch/mips/kernel/traps.c
index cc20415..eb28423 100644
--- a/arch/mips/kernel/traps.c
+++ b/arch/mips/kernel/traps.c
@@ -1080,7 +1080,7 @@ asmlinkage void do_cpu(struct pt_regs *regs)
 	unsigned long old_epc, old31;
 	unsigned int opcode;
 	unsigned int cpid;
-	int status;
+	int status, err;
 	unsigned long __maybe_unused flags;
 
 	prev_state = exception_enter();
@@ -1153,19 +1153,29 @@ asmlinkage void do_cpu(struct pt_regs *regs)
 
 	case 1:
 		if (used_math())	/* Using the FPU again.	 */
-			own_fpu(1);
+			err = own_fpu(1);
 		else {			/* First time FPU user.	 */
-			init_fpu();
+			err = init_fpu();
 			set_used_math();
 		}
 
-		if (!raw_cpu_has_fpu) {
+#ifndef CONFIG_MIPS_O32_FP64_SUPPORT
+		/*
+		 * This assumes that either all FPUs in the system support
+		 * Status.FR (ie. both 32-bit & 64-bit) or none of them do.
+		 */
+		if (err) {
+			force_sig(SIGFPE, current);
+			goto out;
+		}
+#endif
+		if (!raw_cpu_has_fpu || err) {
 			int sig;
 			void __user *fault_addr = NULL;
 			sig = fpu_emulator_cop1Handler(regs,
 						       &current->thread.fpu,
 						       0, &fault_addr);
-			if (!process_fpemu_return(sig, fault_addr))
+			if (!process_fpemu_return(sig, fault_addr) && !err)
 				mt_ase_fp_affinity();
 		}
 
diff --git a/arch/mips/math-emu/cp1emu.c b/arch/mips/math-emu/cp1emu.c
index 4b37961..22f7b11 100644
--- a/arch/mips/math-emu/cp1emu.c
+++ b/arch/mips/math-emu/cp1emu.c
@@ -859,20 +859,20 @@ static int isBranchInstr(struct pt_regs *regs, struct mm_decoded_insn dec_insn,
  * In the Linux kernel, we support selection of FPR format on the
  * basis of the Status.FR bit.	If an FPU is not present, the FR bit
  * is hardwired to zero, which would imply a 32-bit FPU even for
- * 64-bit CPUs so we rather look at TIF_32BIT_REGS.
+ * 64-bit CPUs so we rather look at TIF_32BIT_FPREGS.
  * FPU emu is slow and bulky and optimizing this function offers fairly
  * sizeable benefits so we try to be clever and make this function return
  * a constant whenever possible, that is on 64-bit kernels without O32
- * compatibility enabled and on 32-bit kernels.
+ * compatibility enabled and on 32-bit without 64-bit FPU support.
  */
 static inline int cop1_64bit(struct pt_regs *xcp)
 {
 #if defined(CONFIG_64BIT) && !defined(CONFIG_MIPS32_O32)
 	return 1;
-#elif defined(CONFIG_64BIT) && defined(CONFIG_MIPS32_O32)
-	return !test_thread_flag(TIF_32BIT_REGS);
-#else
+#elif defined(CONFIG_32BIT) && !defined(CONFIG_MIPS_O32_FP64_SUPPORT)
 	return 0;
+#else
+	return !test_thread_flag(TIF_32BIT_FPREGS);
 #endif
 }
 
diff --git a/arch/mips/math-emu/kernel_linkage.c b/arch/mips/math-emu/kernel_linkage.c
index 1c58657..3aeae07 100644
--- a/arch/mips/math-emu/kernel_linkage.c
+++ b/arch/mips/math-emu/kernel_linkage.c
@@ -89,8 +89,9 @@ int fpu_emulator_save_context32(struct sigcontext32 __user *sc)
 {
 	int i;
 	int err = 0;
+	int inc = test_thread_flag(TIF_32BIT_FPREGS) ? 2 : 1;
 
-	for (i = 0; i < 32; i+=2) {
+	for (i = 0; i < 32; i += inc) {
 		err |=
 		    __put_user(current->thread.fpu.fpr[i], &sc->sc_fpregs[i]);
 	}
@@ -103,8 +104,9 @@ int fpu_emulator_restore_context32(struct sigcontext32 __user *sc)
 {
 	int i;
 	int err = 0;
+	int inc = test_thread_flag(TIF_32BIT_FPREGS) ? 2 : 1;
 
-	for (i = 0; i < 32; i+=2) {
+	for (i = 0; i < 32; i += inc) {
 		err |=
 		    __get_user(current->thread.fpu.fpr[i], &sc->sc_fpregs[i]);
 	}
-- 
1.8.4.1

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH v2 4/6] mips: support for 64-bit FP with O32 binaries
@ 2013-11-15 12:35     ` Paul Burton
  0 siblings, 0 replies; 42+ messages in thread
From: Paul Burton @ 2013-11-15 12:35 UTC (permalink / raw)
  To: linux-mips; +Cc: Paul Burton

CPUs implementing mips32r2 may include a 64-bit FPU, just as mips64 CPUs
do. In order to preserve backwards compatibility a 64-bit FPU will act
like a 32-bit FPU (by accessing doubles from the least significant 32
bits of an even-odd pair of FP registers) when the Status.FR bit is
zero, again just like a mips64 CPU. The standard O32 ABI is defined
expecting a 32-bit FPU, however recent toolchains support use of a
64-bit FPU from an O32 mips32 executable. When an ELF executable is
built to use a 64-bit FPU a new flag (EF_MIPS_FP64) is set in the ELF
header.

With this patch the kernel will check the EF_MIPS_FP64 flag when
executing an O32 binary, and set Status.FR accordingly. The addition
of O32 64-bit FP support lessens the opportunity for optimisation in
the FPU emulator, so a CONFIG_MIPS_O32_FP64_SUPPORT Kconfig option is
introduced to allow this support to be disabled for those that don't
require it.

Inspired by an earlier patch by Leonid Yegoshin, but implemented more
cleanly & correctly.

Signed-off-by: Paul Burton <paul.burton@imgtec.com>
---
Changes in v2:
  - Handle TIF_32BIT_FPREGS in PTRACE_P{EE,OK}EUSR.
---
 arch/mips/Kconfig                   |  17 ++++++
 arch/mips/include/asm/asmmacro-32.h |  42 --------------
 arch/mips/include/asm/asmmacro-64.h |  96 --------------------------------
 arch/mips/include/asm/asmmacro.h    | 107 ++++++++++++++++++++++++++++++++++++
 arch/mips/include/asm/elf.h         |  17 +++++-
 arch/mips/include/asm/fpu.h         |  91 +++++++++++++++++++++++++-----
 arch/mips/include/asm/thread_info.h |   4 +-
 arch/mips/kernel/cpu-probe.c        |   2 +-
 arch/mips/kernel/process.c          |   3 -
 arch/mips/kernel/ptrace.c           |  60 +++++++++++---------
 arch/mips/kernel/ptrace32.c         |  53 ++++++++++--------
 arch/mips/kernel/r4k_fpu.S          |  74 +++++++++++++++++++++++--
 arch/mips/kernel/r4k_switch.S       |  45 ++++++++++++++-
 arch/mips/kernel/signal.c           |  10 ++--
 arch/mips/kernel/signal32.c         |  10 ++--
 arch/mips/kernel/traps.c            |  20 +++++--
 arch/mips/math-emu/cp1emu.c         |  10 ++--
 arch/mips/math-emu/kernel_linkage.c |   6 +-
 18 files changed, 431 insertions(+), 236 deletions(-)

diff --git a/arch/mips/Kconfig b/arch/mips/Kconfig
index 17cc7ff..aa2e03a 100644
--- a/arch/mips/Kconfig
+++ b/arch/mips/Kconfig
@@ -2335,6 +2335,23 @@ config CC_STACKPROTECTOR
 
 	  This feature requires gcc version 4.2 or above.
 
+config MIPS_O32_FP64_SUPPORT
+	bool "Support for O32 binaries using 64-bit FP"
+	depends on (32BIT && CPU_MIPSR2) || MIPS32_O32
+	default y
+	help
+	  When this is enabled, the kernel will support use of 64-bit floating
+	  point registers with binaries using the O32 ABI along with the
+	  EF_MIPS_FP64 ELF header flag (typically built with -mfp64). On
+	  mips32 systems this support is at the cost of increasing the size
+	  and complexity of the compiled FPU emulator. Thus if you are running
+	  a mips32 system and know that none of your userland binaries will
+	  require 64-bit floating point, you may wish to reduce the size of
+	  your kernel & potentially improve FP emulation performance by saying
+	  N here.
+
+	  If unsure, say Y.
+
 config USE_OF
 	bool
 	select OF
diff --git a/arch/mips/include/asm/asmmacro-32.h b/arch/mips/include/asm/asmmacro-32.h
index 2413afe..70e1f17 100644
--- a/arch/mips/include/asm/asmmacro-32.h
+++ b/arch/mips/include/asm/asmmacro-32.h
@@ -12,27 +12,6 @@
 #include <asm/fpregdef.h>
 #include <asm/mipsregs.h>
 
-	.macro	fpu_save_double thread status tmp1=t0
-	cfc1	\tmp1,  fcr31
-	sdc1	$f0,  THREAD_FPR0(\thread)
-	sdc1	$f2,  THREAD_FPR2(\thread)
-	sdc1	$f4,  THREAD_FPR4(\thread)
-	sdc1	$f6,  THREAD_FPR6(\thread)
-	sdc1	$f8,  THREAD_FPR8(\thread)
-	sdc1	$f10, THREAD_FPR10(\thread)
-	sdc1	$f12, THREAD_FPR12(\thread)
-	sdc1	$f14, THREAD_FPR14(\thread)
-	sdc1	$f16, THREAD_FPR16(\thread)
-	sdc1	$f18, THREAD_FPR18(\thread)
-	sdc1	$f20, THREAD_FPR20(\thread)
-	sdc1	$f22, THREAD_FPR22(\thread)
-	sdc1	$f24, THREAD_FPR24(\thread)
-	sdc1	$f26, THREAD_FPR26(\thread)
-	sdc1	$f28, THREAD_FPR28(\thread)
-	sdc1	$f30, THREAD_FPR30(\thread)
-	sw	\tmp1, THREAD_FCR31(\thread)
-	.endm
-
 	.macro	fpu_save_single thread tmp=t0
 	cfc1	\tmp,  fcr31
 	swc1	$f0,  THREAD_FPR0(\thread)
@@ -70,27 +49,6 @@
 	sw	\tmp, THREAD_FCR31(\thread)
 	.endm
 
-	.macro	fpu_restore_double thread status tmp=t0
-	lw	\tmp, THREAD_FCR31(\thread)
-	ldc1	$f0,  THREAD_FPR0(\thread)
-	ldc1	$f2,  THREAD_FPR2(\thread)
-	ldc1	$f4,  THREAD_FPR4(\thread)
-	ldc1	$f6,  THREAD_FPR6(\thread)
-	ldc1	$f8,  THREAD_FPR8(\thread)
-	ldc1	$f10, THREAD_FPR10(\thread)
-	ldc1	$f12, THREAD_FPR12(\thread)
-	ldc1	$f14, THREAD_FPR14(\thread)
-	ldc1	$f16, THREAD_FPR16(\thread)
-	ldc1	$f18, THREAD_FPR18(\thread)
-	ldc1	$f20, THREAD_FPR20(\thread)
-	ldc1	$f22, THREAD_FPR22(\thread)
-	ldc1	$f24, THREAD_FPR24(\thread)
-	ldc1	$f26, THREAD_FPR26(\thread)
-	ldc1	$f28, THREAD_FPR28(\thread)
-	ldc1	$f30, THREAD_FPR30(\thread)
-	ctc1	\tmp, fcr31
-	.endm
-
 	.macro	fpu_restore_single thread tmp=t0
 	lw	\tmp, THREAD_FCR31(\thread)
 	lwc1	$f0,  THREAD_FPR0(\thread)
diff --git a/arch/mips/include/asm/asmmacro-64.h b/arch/mips/include/asm/asmmacro-64.h
index 08a527d..38ea609 100644
--- a/arch/mips/include/asm/asmmacro-64.h
+++ b/arch/mips/include/asm/asmmacro-64.h
@@ -13,102 +13,6 @@
 #include <asm/fpregdef.h>
 #include <asm/mipsregs.h>
 
-	.macro	fpu_save_16even thread tmp=t0
-	cfc1	\tmp, fcr31
-	sdc1	$f0,  THREAD_FPR0(\thread)
-	sdc1	$f2,  THREAD_FPR2(\thread)
-	sdc1	$f4,  THREAD_FPR4(\thread)
-	sdc1	$f6,  THREAD_FPR6(\thread)
-	sdc1	$f8,  THREAD_FPR8(\thread)
-	sdc1	$f10, THREAD_FPR10(\thread)
-	sdc1	$f12, THREAD_FPR12(\thread)
-	sdc1	$f14, THREAD_FPR14(\thread)
-	sdc1	$f16, THREAD_FPR16(\thread)
-	sdc1	$f18, THREAD_FPR18(\thread)
-	sdc1	$f20, THREAD_FPR20(\thread)
-	sdc1	$f22, THREAD_FPR22(\thread)
-	sdc1	$f24, THREAD_FPR24(\thread)
-	sdc1	$f26, THREAD_FPR26(\thread)
-	sdc1	$f28, THREAD_FPR28(\thread)
-	sdc1	$f30, THREAD_FPR30(\thread)
-	sw	\tmp, THREAD_FCR31(\thread)
-	.endm
-
-	.macro	fpu_save_16odd thread
-	sdc1	$f1,  THREAD_FPR1(\thread)
-	sdc1	$f3,  THREAD_FPR3(\thread)
-	sdc1	$f5,  THREAD_FPR5(\thread)
-	sdc1	$f7,  THREAD_FPR7(\thread)
-	sdc1	$f9,  THREAD_FPR9(\thread)
-	sdc1	$f11, THREAD_FPR11(\thread)
-	sdc1	$f13, THREAD_FPR13(\thread)
-	sdc1	$f15, THREAD_FPR15(\thread)
-	sdc1	$f17, THREAD_FPR17(\thread)
-	sdc1	$f19, THREAD_FPR19(\thread)
-	sdc1	$f21, THREAD_FPR21(\thread)
-	sdc1	$f23, THREAD_FPR23(\thread)
-	sdc1	$f25, THREAD_FPR25(\thread)
-	sdc1	$f27, THREAD_FPR27(\thread)
-	sdc1	$f29, THREAD_FPR29(\thread)
-	sdc1	$f31, THREAD_FPR31(\thread)
-	.endm
-
-	.macro	fpu_save_double thread status tmp
-	sll	\tmp, \status, 5
-	bgez	\tmp, 2f
-	fpu_save_16odd \thread
-2:
-	fpu_save_16even \thread \tmp
-	.endm
-
-	.macro	fpu_restore_16even thread tmp=t0
-	lw	\tmp, THREAD_FCR31(\thread)
-	ldc1	$f0,  THREAD_FPR0(\thread)
-	ldc1	$f2,  THREAD_FPR2(\thread)
-	ldc1	$f4,  THREAD_FPR4(\thread)
-	ldc1	$f6,  THREAD_FPR6(\thread)
-	ldc1	$f8,  THREAD_FPR8(\thread)
-	ldc1	$f10, THREAD_FPR10(\thread)
-	ldc1	$f12, THREAD_FPR12(\thread)
-	ldc1	$f14, THREAD_FPR14(\thread)
-	ldc1	$f16, THREAD_FPR16(\thread)
-	ldc1	$f18, THREAD_FPR18(\thread)
-	ldc1	$f20, THREAD_FPR20(\thread)
-	ldc1	$f22, THREAD_FPR22(\thread)
-	ldc1	$f24, THREAD_FPR24(\thread)
-	ldc1	$f26, THREAD_FPR26(\thread)
-	ldc1	$f28, THREAD_FPR28(\thread)
-	ldc1	$f30, THREAD_FPR30(\thread)
-	ctc1	\tmp, fcr31
-	.endm
-
-	.macro	fpu_restore_16odd thread
-	ldc1	$f1,  THREAD_FPR1(\thread)
-	ldc1	$f3,  THREAD_FPR3(\thread)
-	ldc1	$f5,  THREAD_FPR5(\thread)
-	ldc1	$f7,  THREAD_FPR7(\thread)
-	ldc1	$f9,  THREAD_FPR9(\thread)
-	ldc1	$f11, THREAD_FPR11(\thread)
-	ldc1	$f13, THREAD_FPR13(\thread)
-	ldc1	$f15, THREAD_FPR15(\thread)
-	ldc1	$f17, THREAD_FPR17(\thread)
-	ldc1	$f19, THREAD_FPR19(\thread)
-	ldc1	$f21, THREAD_FPR21(\thread)
-	ldc1	$f23, THREAD_FPR23(\thread)
-	ldc1	$f25, THREAD_FPR25(\thread)
-	ldc1	$f27, THREAD_FPR27(\thread)
-	ldc1	$f29, THREAD_FPR29(\thread)
-	ldc1	$f31, THREAD_FPR31(\thread)
-	.endm
-
-	.macro	fpu_restore_double thread status tmp
-	sll	\tmp, \status, 5
-	bgez	\tmp, 1f				# 16 register mode?
-
-	fpu_restore_16odd \thread
-1:	fpu_restore_16even \thread \tmp
-	.endm
-
 	.macro	cpu_save_nonscratch thread
 	LONG_S	s0, THREAD_REG16(\thread)
 	LONG_S	s1, THREAD_REG17(\thread)
diff --git a/arch/mips/include/asm/asmmacro.h b/arch/mips/include/asm/asmmacro.h
index 6c8342a..3220c93 100644
--- a/arch/mips/include/asm/asmmacro.h
+++ b/arch/mips/include/asm/asmmacro.h
@@ -62,6 +62,113 @@
 	.endm
 #endif /* CONFIG_MIPS_MT_SMTC */
 
+	.macro	fpu_save_16even thread tmp=t0
+	cfc1	\tmp, fcr31
+	sdc1	$f0,  THREAD_FPR0(\thread)
+	sdc1	$f2,  THREAD_FPR2(\thread)
+	sdc1	$f4,  THREAD_FPR4(\thread)
+	sdc1	$f6,  THREAD_FPR6(\thread)
+	sdc1	$f8,  THREAD_FPR8(\thread)
+	sdc1	$f10, THREAD_FPR10(\thread)
+	sdc1	$f12, THREAD_FPR12(\thread)
+	sdc1	$f14, THREAD_FPR14(\thread)
+	sdc1	$f16, THREAD_FPR16(\thread)
+	sdc1	$f18, THREAD_FPR18(\thread)
+	sdc1	$f20, THREAD_FPR20(\thread)
+	sdc1	$f22, THREAD_FPR22(\thread)
+	sdc1	$f24, THREAD_FPR24(\thread)
+	sdc1	$f26, THREAD_FPR26(\thread)
+	sdc1	$f28, THREAD_FPR28(\thread)
+	sdc1	$f30, THREAD_FPR30(\thread)
+	sw	\tmp, THREAD_FCR31(\thread)
+	.endm
+
+	.macro	fpu_save_16odd thread
+	.set	push
+	.set	mips64r2
+	sdc1	$f1,  THREAD_FPR1(\thread)
+	sdc1	$f3,  THREAD_FPR3(\thread)
+	sdc1	$f5,  THREAD_FPR5(\thread)
+	sdc1	$f7,  THREAD_FPR7(\thread)
+	sdc1	$f9,  THREAD_FPR9(\thread)
+	sdc1	$f11, THREAD_FPR11(\thread)
+	sdc1	$f13, THREAD_FPR13(\thread)
+	sdc1	$f15, THREAD_FPR15(\thread)
+	sdc1	$f17, THREAD_FPR17(\thread)
+	sdc1	$f19, THREAD_FPR19(\thread)
+	sdc1	$f21, THREAD_FPR21(\thread)
+	sdc1	$f23, THREAD_FPR23(\thread)
+	sdc1	$f25, THREAD_FPR25(\thread)
+	sdc1	$f27, THREAD_FPR27(\thread)
+	sdc1	$f29, THREAD_FPR29(\thread)
+	sdc1	$f31, THREAD_FPR31(\thread)
+	.set	pop
+	.endm
+
+	.macro	fpu_save_double thread status tmp
+#if defined(CONFIG_MIPS64) || defined(CONFIG_CPU_MIPS32_R2)
+	sll	\tmp, \status, 5
+	bgez	\tmp, 10f
+	fpu_save_16odd \thread
+10:
+#endif
+	fpu_save_16even \thread \tmp
+	.endm
+
+	.macro	fpu_restore_16even thread tmp=t0
+	lw	\tmp, THREAD_FCR31(\thread)
+	ldc1	$f0,  THREAD_FPR0(\thread)
+	ldc1	$f2,  THREAD_FPR2(\thread)
+	ldc1	$f4,  THREAD_FPR4(\thread)
+	ldc1	$f6,  THREAD_FPR6(\thread)
+	ldc1	$f8,  THREAD_FPR8(\thread)
+	ldc1	$f10, THREAD_FPR10(\thread)
+	ldc1	$f12, THREAD_FPR12(\thread)
+	ldc1	$f14, THREAD_FPR14(\thread)
+	ldc1	$f16, THREAD_FPR16(\thread)
+	ldc1	$f18, THREAD_FPR18(\thread)
+	ldc1	$f20, THREAD_FPR20(\thread)
+	ldc1	$f22, THREAD_FPR22(\thread)
+	ldc1	$f24, THREAD_FPR24(\thread)
+	ldc1	$f26, THREAD_FPR26(\thread)
+	ldc1	$f28, THREAD_FPR28(\thread)
+	ldc1	$f30, THREAD_FPR30(\thread)
+	ctc1	\tmp, fcr31
+	.endm
+
+	.macro	fpu_restore_16odd thread
+	.set	push
+	.set	mips64r2
+	ldc1	$f1,  THREAD_FPR1(\thread)
+	ldc1	$f3,  THREAD_FPR3(\thread)
+	ldc1	$f5,  THREAD_FPR5(\thread)
+	ldc1	$f7,  THREAD_FPR7(\thread)
+	ldc1	$f9,  THREAD_FPR9(\thread)
+	ldc1	$f11, THREAD_FPR11(\thread)
+	ldc1	$f13, THREAD_FPR13(\thread)
+	ldc1	$f15, THREAD_FPR15(\thread)
+	ldc1	$f17, THREAD_FPR17(\thread)
+	ldc1	$f19, THREAD_FPR19(\thread)
+	ldc1	$f21, THREAD_FPR21(\thread)
+	ldc1	$f23, THREAD_FPR23(\thread)
+	ldc1	$f25, THREAD_FPR25(\thread)
+	ldc1	$f27, THREAD_FPR27(\thread)
+	ldc1	$f29, THREAD_FPR29(\thread)
+	ldc1	$f31, THREAD_FPR31(\thread)
+	.set	pop
+	.endm
+
+	.macro	fpu_restore_double thread status tmp
+#if defined(CONFIG_MIPS64) || defined(CONFIG_CPU_MIPS32_R2)
+	sll	\tmp, \status, 5
+	bgez	\tmp, 10f				# 16 register mode?
+
+	fpu_restore_16odd \thread
+10:
+#endif
+	fpu_restore_16even \thread \tmp
+	.endm
+
 /*
  * Temporary until all gas have MT ASE support
  */
diff --git a/arch/mips/include/asm/elf.h b/arch/mips/include/asm/elf.h
index a66359e..17163cf 100644
--- a/arch/mips/include/asm/elf.h
+++ b/arch/mips/include/asm/elf.h
@@ -36,6 +36,7 @@
 #define EF_MIPS_ABI2		0x00000020
 #define EF_MIPS_OPTIONS_FIRST	0x00000080
 #define EF_MIPS_32BITMODE	0x00000100
+#define EF_MIPS_FP64		0x00000200
 #define EF_MIPS_ABI		0x0000f000
 #define EF_MIPS_ARCH		0xf0000000
 
@@ -249,6 +250,11 @@ extern struct mips_abi mips_abi_n32;
 
 #define SET_PERSONALITY(ex)						\
 do {									\
+	if ((ex).e_flags & EF_MIPS_FP64)				\
+		clear_thread_flag(TIF_32BIT_FPREGS);			\
+	else								\
+		set_thread_flag(TIF_32BIT_FPREGS);			\
+									\
 	if (personality(current->personality) != PER_LINUX)		\
 		set_personality(PER_LINUX);				\
 									\
@@ -271,14 +277,18 @@ do {									\
 #endif
 
 #ifdef CONFIG_MIPS32_O32
-#define __SET_PERSONALITY32_O32()					\
+#define __SET_PERSONALITY32_O32(ex)					\
 	do {								\
 		set_thread_flag(TIF_32BIT_REGS);			\
 		set_thread_flag(TIF_32BIT_ADDR);			\
+									\
+		if (!((ex).e_flags & EF_MIPS_FP64))			\
+			set_thread_flag(TIF_32BIT_FPREGS);		\
+									\
 		current->thread.abi = &mips_abi_32;			\
 	} while (0)
 #else
-#define __SET_PERSONALITY32_O32()					\
+#define __SET_PERSONALITY32_O32(ex)					\
 	do { } while (0)
 #endif
 
@@ -289,7 +299,7 @@ do {									\
 	     ((ex).e_flags & EF_MIPS_ABI) == 0)				\
 		__SET_PERSONALITY32_N32();				\
 	else								\
-		__SET_PERSONALITY32_O32();				\
+		__SET_PERSONALITY32_O32(ex);                            \
 } while (0)
 #else
 #define __SET_PERSONALITY32(ex) do { } while (0)
@@ -300,6 +310,7 @@ do {									\
 	unsigned int p;							\
 									\
 	clear_thread_flag(TIF_32BIT_REGS);				\
+	clear_thread_flag(TIF_32BIT_FPREGS);				\
 	clear_thread_flag(TIF_32BIT_ADDR);				\
 									\
 	if ((ex).e_ident[EI_CLASS] == ELFCLASS32)			\
diff --git a/arch/mips/include/asm/fpu.h b/arch/mips/include/asm/fpu.h
index 3bf023f..cfe092f 100644
--- a/arch/mips/include/asm/fpu.h
+++ b/arch/mips/include/asm/fpu.h
@@ -33,11 +33,48 @@ extern void _init_fpu(void);
 extern void _save_fp(struct task_struct *);
 extern void _restore_fp(struct task_struct *);
 
-#define __enable_fpu()							\
-do {									\
-	set_c0_status(ST0_CU1);						\
-	enable_fpu_hazard();						\
-} while (0)
+/*
+ * This enum specifies a mode in which we want the FPU to operate, for cores
+ * which implement the Status.FR bit. Note that FPU_32BIT & FPU_64BIT
+ * purposefully have the values 0 & 1 respectively, so that an integer value
+ * of Status.FR can be trivially casted to the corresponding enum fpu_mode.
+ */
+enum fpu_mode {
+	FPU_32BIT = 0,		/* FR = 0 */
+	FPU_64BIT,		/* FR = 1 */
+	FPU_AS_IS,
+};
+
+static inline int __enable_fpu(enum fpu_mode mode)
+{
+	int fr;
+
+	switch (mode) {
+	case FPU_AS_IS:
+		/* just enable the FPU in its current mode */
+		set_c0_status(ST0_CU1);
+		enable_fpu_hazard();
+		return 0;
+
+	case FPU_64BIT:
+#if !(defined(CONFIG_CPU_MIPS32_R2) || defined(CONFIG_MIPS64))
+		/* we only have a 32-bit FPU */
+		return SIGFPE;
+#endif
+		/* fall through */
+	case FPU_32BIT:
+		/* set CU1 & change FR appropriately */
+		fr = (int)mode;
+		change_c0_status(ST0_CU1 | ST0_FR, ST0_CU1 | (fr ? ST0_FR : 0));
+		enable_fpu_hazard();
+
+		/* check FR has the desired value */
+		return (!!(read_c0_status() & ST0_FR) == !!fr) ? 0 : SIGFPE;
+
+	default:
+		BUG();
+	}
+}
 
 #define __disable_fpu()							\
 do {									\
@@ -57,27 +94,46 @@ static inline int is_fpu_owner(void)
 	return cpu_has_fpu && __is_fpu_owner();
 }
 
-static inline void __own_fpu(void)
+static inline int __own_fpu(void)
 {
-	__enable_fpu();
+	enum fpu_mode mode;
+	int ret;
+
+	mode = !test_thread_flag(TIF_32BIT_FPREGS);
+	ret = __enable_fpu(mode);
+	if (ret)
+		return ret;
+
 	KSTK_STATUS(current) |= ST0_CU1;
+	if (mode == FPU_64BIT)
+		KSTK_STATUS(current) |= ST0_FR;
+	else /* mode == FPU_32BIT */
+		KSTK_STATUS(current) &= ~ST0_FR;
+
 	set_thread_flag(TIF_USEDFPU);
+	return 0;
 }
 
-static inline void own_fpu_inatomic(int restore)
+static inline int own_fpu_inatomic(int restore)
 {
+	int ret = 0;
+
 	if (cpu_has_fpu && !__is_fpu_owner()) {
-		__own_fpu();
-		if (restore)
+		ret = __own_fpu();
+		if (restore && !ret)
 			_restore_fp(current);
 	}
+	return ret;
 }
 
-static inline void own_fpu(int restore)
+static inline int own_fpu(int restore)
 {
+	int ret;
+
 	preempt_disable();
-	own_fpu_inatomic(restore);
+	ret = own_fpu_inatomic(restore);
 	preempt_enable();
+	return ret;
 }
 
 static inline void lose_fpu(int save)
@@ -93,16 +149,21 @@ static inline void lose_fpu(int save)
 	preempt_enable();
 }
 
-static inline void init_fpu(void)
+static inline int init_fpu(void)
 {
+	int ret = 0;
+
 	preempt_disable();
 	if (cpu_has_fpu) {
-		__own_fpu();
-		_init_fpu();
+		ret = __own_fpu();
+		if (!ret)
+			_init_fpu();
 	} else {
 		fpu_emulator_init_fpu();
 	}
+
 	preempt_enable();
+	return ret;
 }
 
 static inline void save_fp(struct task_struct *tsk)
diff --git a/arch/mips/include/asm/thread_info.h b/arch/mips/include/asm/thread_info.h
index f9b24bf..b6da8b7 100644
--- a/arch/mips/include/asm/thread_info.h
+++ b/arch/mips/include/asm/thread_info.h
@@ -112,11 +112,12 @@ static inline struct thread_info *current_thread_info(void)
 #define TIF_NOHZ		19	/* in adaptive nohz mode */
 #define TIF_FIXADE		20	/* Fix address errors in software */
 #define TIF_LOGADE		21	/* Log address errors to syslog */
-#define TIF_32BIT_REGS		22	/* also implies 16/32 fprs */
+#define TIF_32BIT_REGS		22	/* 32-bit general purpose registers */
 #define TIF_32BIT_ADDR		23	/* 32-bit address space (o32/n32) */
 #define TIF_FPUBOUND		24	/* thread bound to FPU-full CPU set */
 #define TIF_LOAD_WATCH		25	/* If set, load watch registers */
 #define TIF_SYSCALL_TRACEPOINT	26	/* syscall tracepoint instrumentation */
+#define TIF_32BIT_FPREGS	27	/* 32-bit floating point registers */
 #define TIF_SYSCALL_TRACE	31	/* syscall trace active */
 
 #define _TIF_SYSCALL_TRACE	(1<<TIF_SYSCALL_TRACE)
@@ -133,6 +134,7 @@ static inline struct thread_info *current_thread_info(void)
 #define _TIF_32BIT_ADDR		(1<<TIF_32BIT_ADDR)
 #define _TIF_FPUBOUND		(1<<TIF_FPUBOUND)
 #define _TIF_LOAD_WATCH		(1<<TIF_LOAD_WATCH)
+#define _TIF_32BIT_FPREGS	(1<<TIF_32BIT_FPREGS)
 #define _TIF_SYSCALL_TRACEPOINT	(1<<TIF_SYSCALL_TRACEPOINT)
 
 #define _TIF_WORK_SYSCALL_ENTRY	(_TIF_NOHZ | _TIF_SYSCALL_TRACE |	\
diff --git a/arch/mips/kernel/cpu-probe.c b/arch/mips/kernel/cpu-probe.c
index 8168e29..116102c 100644
--- a/arch/mips/kernel/cpu-probe.c
+++ b/arch/mips/kernel/cpu-probe.c
@@ -112,7 +112,7 @@ static inline unsigned long cpu_get_fpu_id(void)
 	unsigned long tmp, fpu_id;
 
 	tmp = read_c0_status();
-	__enable_fpu();
+	__enable_fpu(FPU_AS_IS);
 	fpu_id = read_32bit_cp1_register(CP1_REVISION);
 	write_c0_status(tmp);
 	return fpu_id;
diff --git a/arch/mips/kernel/process.c b/arch/mips/kernel/process.c
index ddc7610..747a6cf 100644
--- a/arch/mips/kernel/process.c
+++ b/arch/mips/kernel/process.c
@@ -60,9 +60,6 @@ void start_thread(struct pt_regs * regs, unsigned long pc, unsigned long sp)
 
 	/* New thread loses kernel privileges. */
 	status = regs->cp0_status & ~(ST0_CU0|ST0_CU1|ST0_FR|KU_MASK);
-#ifdef CONFIG_64BIT
-	status |= test_thread_flag(TIF_32BIT_REGS) ? 0 : ST0_FR;
-#endif
 	status |= KU_USER;
 	regs->cp0_status = status;
 	clear_used_math();
diff --git a/arch/mips/kernel/ptrace.c b/arch/mips/kernel/ptrace.c
index b52e1d2..7da9b76 100644
--- a/arch/mips/kernel/ptrace.c
+++ b/arch/mips/kernel/ptrace.c
@@ -137,13 +137,13 @@ int ptrace_getfpregs(struct task_struct *child, __u32 __user *data)
 		if (cpu_has_mipsmt) {
 			unsigned int vpflags = dvpe();
 			flags = read_c0_status();
-			__enable_fpu();
+			__enable_fpu(FPU_AS_IS);
 			__asm__ __volatile__("cfc1\t%0,$0" : "=r" (tmp));
 			write_c0_status(flags);
 			evpe(vpflags);
 		} else {
 			flags = read_c0_status();
-			__enable_fpu();
+			__enable_fpu(FPU_AS_IS);
 			__asm__ __volatile__("cfc1\t%0,$0" : "=r" (tmp));
 			write_c0_status(flags);
 		}
@@ -408,6 +408,7 @@ long arch_ptrace(struct task_struct *child, long request,
 	/* Read the word at location addr in the USER area. */
 	case PTRACE_PEEKUSR: {
 		struct pt_regs *regs;
+		fpureg_t *fregs;
 		unsigned long tmp = 0;
 
 		regs = task_pt_regs(child);
@@ -418,26 +419,28 @@ long arch_ptrace(struct task_struct *child, long request,
 			tmp = regs->regs[addr];
 			break;
 		case FPR_BASE ... FPR_BASE + 31:
-			if (tsk_used_math(child)) {
-				fpureg_t *fregs = get_fpu_regs(child);
+			if (!tsk_used_math(child)) {
+				/* FP not yet used */
+				tmp = -1;
+				break;
+			}
+			fregs = get_fpu_regs(child);
 
 #ifdef CONFIG_32BIT
+			if (test_thread_flag(TIF_32BIT_FPREGS)) {
 				/*
 				 * The odd registers are actually the high
 				 * order bits of the values stored in the even
 				 * registers - unless we're using r2k_switch.S.
 				 */
 				if (addr & 1)
-					tmp = (unsigned long) (fregs[((addr & ~1) - 32)] >> 32);
+					tmp = fregs[(addr & ~1) - 32] >> 32;
 				else
-					tmp = (unsigned long) (fregs[(addr - 32)] & 0xffffffff);
-#endif
-#ifdef CONFIG_64BIT
-				tmp = fregs[addr - FPR_BASE];
-#endif
-			} else {
-				tmp = -1;	/* FP not yet used  */
+					tmp = fregs[addr - 32];
+				break;
 			}
+#endif
+			tmp = fregs[addr - FPR_BASE];
 			break;
 		case PC:
 			tmp = regs->cp0_epc;
@@ -483,13 +486,13 @@ long arch_ptrace(struct task_struct *child, long request,
 			if (cpu_has_mipsmt) {
 				unsigned int vpflags = dvpe();
 				flags = read_c0_status();
-				__enable_fpu();
+				__enable_fpu(FPU_AS_IS);
 				__asm__ __volatile__("cfc1\t%0,$0": "=r" (tmp));
 				write_c0_status(flags);
 				evpe(vpflags);
 			} else {
 				flags = read_c0_status();
-				__enable_fpu();
+				__enable_fpu(FPU_AS_IS);
 				__asm__ __volatile__("cfc1\t%0,$0": "=r" (tmp));
 				write_c0_status(flags);
 			}
@@ -554,22 +557,25 @@ long arch_ptrace(struct task_struct *child, long request,
 				child->thread.fpu.fcr31 = 0;
 			}
 #ifdef CONFIG_32BIT
-			/*
-			 * The odd registers are actually the high order bits
-			 * of the values stored in the even registers - unless
-			 * we're using r2k_switch.S.
-			 */
-			if (addr & 1) {
-				fregs[(addr & ~1) - FPR_BASE] &= 0xffffffff;
-				fregs[(addr & ~1) - FPR_BASE] |= ((unsigned long long) data) << 32;
-			} else {
-				fregs[addr - FPR_BASE] &= ~0xffffffffLL;
-				fregs[addr - FPR_BASE] |= data;
+			if (test_thread_flag(TIF_32BIT_FPREGS)) {
+				/*
+				 * The odd registers are actually the high
+				 * order bits of the values stored in the even
+				 * registers - unless we're using r2k_switch.S.
+				 */
+				if (addr & 1) {
+					fregs[(addr & ~1) - FPR_BASE] &=
+						0xffffffff;
+					fregs[(addr & ~1) - FPR_BASE] |=
+						((u64)data) << 32;
+				} else {
+					fregs[addr - FPR_BASE] &= ~0xffffffffLL;
+					fregs[addr - FPR_BASE] |= data;
+				}
+				break;
 			}
 #endif
-#ifdef CONFIG_64BIT
 			fregs[addr - FPR_BASE] = data;
-#endif
 			break;
 		}
 		case PC:
diff --git a/arch/mips/kernel/ptrace32.c b/arch/mips/kernel/ptrace32.c
index 9486055..b8aa2dd 100644
--- a/arch/mips/kernel/ptrace32.c
+++ b/arch/mips/kernel/ptrace32.c
@@ -80,6 +80,7 @@ long compat_arch_ptrace(struct task_struct *child, compat_long_t request,
 	/* Read the word at location addr in the USER area. */
 	case PTRACE_PEEKUSR: {
 		struct pt_regs *regs;
+		fpureg_t *fregs;
 		unsigned int tmp;
 
 		regs = task_pt_regs(child);
@@ -90,21 +91,25 @@ long compat_arch_ptrace(struct task_struct *child, compat_long_t request,
 			tmp = regs->regs[addr];
 			break;
 		case FPR_BASE ... FPR_BASE + 31:
-			if (tsk_used_math(child)) {
-				fpureg_t *fregs = get_fpu_regs(child);
-
+			if (!tsk_used_math(child)) {
+				/* FP not yet used */
+				tmp = -1;
+				break;
+			}
+			fregs = get_fpu_regs(child);
+			if (test_thread_flag(TIF_32BIT_FPREGS)) {
 				/*
 				 * The odd registers are actually the high
 				 * order bits of the values stored in the even
 				 * registers - unless we're using r2k_switch.S.
 				 */
 				if (addr & 1)
-					tmp = (unsigned long) (fregs[((addr & ~1) - 32)] >> 32);
+					tmp = fregs[(addr & ~1) - 32] >> 32;
 				else
-					tmp = (unsigned long) (fregs[(addr - 32)] & 0xffffffff);
-			} else {
-				tmp = -1;	/* FP not yet used  */
+					tmp = fregs[addr - 32];
+				break;
 			}
+			tmp = fregs[addr - FPR_BASE];
 			break;
 		case PC:
 			tmp = regs->cp0_epc;
@@ -147,13 +152,13 @@ long compat_arch_ptrace(struct task_struct *child, compat_long_t request,
 			if (cpu_has_mipsmt) {
 				unsigned int vpflags = dvpe();
 				flags = read_c0_status();
-				__enable_fpu();
+				__enable_fpu(FPU_AS_IS);
 				__asm__ __volatile__("cfc1\t%0,$0": "=r" (tmp));
 				write_c0_status(flags);
 				evpe(vpflags);
 			} else {
 				flags = read_c0_status();
-				__enable_fpu();
+				__enable_fpu(FPU_AS_IS);
 				__asm__ __volatile__("cfc1\t%0,$0": "=r" (tmp));
 				write_c0_status(flags);
 			}
@@ -236,20 +241,24 @@ long compat_arch_ptrace(struct task_struct *child, compat_long_t request,
 				       sizeof(child->thread.fpu));
 				child->thread.fpu.fcr31 = 0;
 			}
-			/*
-			 * The odd registers are actually the high order bits
-			 * of the values stored in the even registers - unless
-			 * we're using r2k_switch.S.
-			 */
-			if (addr & 1) {
-				fregs[(addr & ~1) - FPR_BASE] &= 0xffffffff;
-				fregs[(addr & ~1) - FPR_BASE] |= ((unsigned long long) data) << 32;
-			} else {
-				fregs[addr - FPR_BASE] &= ~0xffffffffLL;
-				/* Must cast, lest sign extension fill upper
-				   bits!  */
-				fregs[addr - FPR_BASE] |= (unsigned int)data;
+			if (test_thread_flag(TIF_32BIT_FPREGS)) {
+				/*
+				 * The odd registers are actually the high
+				 * order bits of the values stored in the even
+				 * registers - unless we're using r2k_switch.S.
+				 */
+				if (addr & 1) {
+					fregs[(addr & ~1) - FPR_BASE] &=
+						0xffffffff;
+					fregs[(addr & ~1) - FPR_BASE] |=
+						((u64)data) << 32;
+				} else {
+					fregs[addr - FPR_BASE] &= ~0xffffffffLL;
+					fregs[addr - FPR_BASE] |= data;
+				}
+				break;
 			}
+			fregs[addr - FPR_BASE] = data;
 			break;
 		}
 		case PC:
diff --git a/arch/mips/kernel/r4k_fpu.S b/arch/mips/kernel/r4k_fpu.S
index 55ffe14..253b2fb 100644
--- a/arch/mips/kernel/r4k_fpu.S
+++ b/arch/mips/kernel/r4k_fpu.S
@@ -35,7 +35,15 @@
 LEAF(_save_fp_context)
 	cfc1	t1, fcr31
 
-#ifdef CONFIG_64BIT
+#if defined(CONFIG_64BIT) || defined(CONFIG_MIPS32_R2)
+	.set	push
+#ifdef CONFIG_MIPS32_R2
+	.set	mips64r2
+	mfc0	t0, CP0_STATUS
+	sll	t0, t0, 5
+	bgez	t0, 1f			# skip storing odd if FR=0
+	 nop
+#endif
 	/* Store the 16 odd double precision registers */
 	EX	sdc1 $f1, SC_FPREGS+8(a0)
 	EX	sdc1 $f3, SC_FPREGS+24(a0)
@@ -53,6 +61,7 @@ LEAF(_save_fp_context)
 	EX	sdc1 $f27, SC_FPREGS+216(a0)
 	EX	sdc1 $f29, SC_FPREGS+232(a0)
 	EX	sdc1 $f31, SC_FPREGS+248(a0)
+1:	.set	pop
 #endif
 
 	/* Store the 16 even double precision registers */
@@ -82,7 +91,31 @@ LEAF(_save_fp_context)
 LEAF(_save_fp_context32)
 	cfc1	t1, fcr31
 
-	EX	sdc1 $f0, SC32_FPREGS+0(a0)
+	mfc0	t0, CP0_STATUS
+	sll	t0, t0, 5
+	bgez	t0, 1f			# skip storing odd if FR=0
+	 nop
+
+	/* Store the 16 odd double precision registers */
+	EX      sdc1 $f1, SC32_FPREGS+8(a0)
+	EX      sdc1 $f3, SC32_FPREGS+24(a0)
+	EX      sdc1 $f5, SC32_FPREGS+40(a0)
+	EX      sdc1 $f7, SC32_FPREGS+56(a0)
+	EX      sdc1 $f9, SC32_FPREGS+72(a0)
+	EX      sdc1 $f11, SC32_FPREGS+88(a0)
+	EX      sdc1 $f13, SC32_FPREGS+104(a0)
+	EX      sdc1 $f15, SC32_FPREGS+120(a0)
+	EX      sdc1 $f17, SC32_FPREGS+136(a0)
+	EX      sdc1 $f19, SC32_FPREGS+152(a0)
+	EX      sdc1 $f21, SC32_FPREGS+168(a0)
+	EX      sdc1 $f23, SC32_FPREGS+184(a0)
+	EX      sdc1 $f25, SC32_FPREGS+200(a0)
+	EX      sdc1 $f27, SC32_FPREGS+216(a0)
+	EX      sdc1 $f29, SC32_FPREGS+232(a0)
+	EX      sdc1 $f31, SC32_FPREGS+248(a0)
+
+	/* Store the 16 even double precision registers */
+1:	EX	sdc1 $f0, SC32_FPREGS+0(a0)
 	EX	sdc1 $f2, SC32_FPREGS+16(a0)
 	EX	sdc1 $f4, SC32_FPREGS+32(a0)
 	EX	sdc1 $f6, SC32_FPREGS+48(a0)
@@ -114,7 +147,16 @@ LEAF(_save_fp_context32)
  */
 LEAF(_restore_fp_context)
 	EX	lw t0, SC_FPC_CSR(a0)
-#ifdef CONFIG_64BIT
+
+#if defined(CONFIG_64BIT) || defined(CONFIG_MIPS32_R2)
+	.set	push
+#ifdef CONFIG_MIPS32_R2
+	.set	mips64r2
+	mfc0	t0, CP0_STATUS
+	sll	t0, t0, 5
+	bgez	t0, 1f			# skip loading odd if FR=0
+	 nop
+#endif
 	EX	ldc1 $f1, SC_FPREGS+8(a0)
 	EX	ldc1 $f3, SC_FPREGS+24(a0)
 	EX	ldc1 $f5, SC_FPREGS+40(a0)
@@ -131,6 +173,7 @@ LEAF(_restore_fp_context)
 	EX	ldc1 $f27, SC_FPREGS+216(a0)
 	EX	ldc1 $f29, SC_FPREGS+232(a0)
 	EX	ldc1 $f31, SC_FPREGS+248(a0)
+1:	.set pop
 #endif
 	EX	ldc1 $f0, SC_FPREGS+0(a0)
 	EX	ldc1 $f2, SC_FPREGS+16(a0)
@@ -157,7 +200,30 @@ LEAF(_restore_fp_context)
 LEAF(_restore_fp_context32)
 	/* Restore an o32 sigcontext.  */
 	EX	lw t0, SC32_FPC_CSR(a0)
-	EX	ldc1 $f0, SC32_FPREGS+0(a0)
+
+	mfc0	t0, CP0_STATUS
+	sll	t0, t0, 5
+	bgez	t0, 1f			# skip loading odd if FR=0
+	 nop
+
+	EX      ldc1 $f1, SC32_FPREGS+8(a0)
+	EX      ldc1 $f3, SC32_FPREGS+24(a0)
+	EX      ldc1 $f5, SC32_FPREGS+40(a0)
+	EX      ldc1 $f7, SC32_FPREGS+56(a0)
+	EX      ldc1 $f9, SC32_FPREGS+72(a0)
+	EX      ldc1 $f11, SC32_FPREGS+88(a0)
+	EX      ldc1 $f13, SC32_FPREGS+104(a0)
+	EX      ldc1 $f15, SC32_FPREGS+120(a0)
+	EX      ldc1 $f17, SC32_FPREGS+136(a0)
+	EX      ldc1 $f19, SC32_FPREGS+152(a0)
+	EX      ldc1 $f21, SC32_FPREGS+168(a0)
+	EX      ldc1 $f23, SC32_FPREGS+184(a0)
+	EX      ldc1 $f25, SC32_FPREGS+200(a0)
+	EX      ldc1 $f27, SC32_FPREGS+216(a0)
+	EX      ldc1 $f29, SC32_FPREGS+232(a0)
+	EX      ldc1 $f31, SC32_FPREGS+248(a0)
+
+1:	EX	ldc1 $f0, SC32_FPREGS+0(a0)
 	EX	ldc1 $f2, SC32_FPREGS+16(a0)
 	EX	ldc1 $f4, SC32_FPREGS+32(a0)
 	EX	ldc1 $f6, SC32_FPREGS+48(a0)
diff --git a/arch/mips/kernel/r4k_switch.S b/arch/mips/kernel/r4k_switch.S
index 078de5e..cc78dd9 100644
--- a/arch/mips/kernel/r4k_switch.S
+++ b/arch/mips/kernel/r4k_switch.S
@@ -123,7 +123,7 @@
  * Save a thread's fp context.
  */
 LEAF(_save_fp)
-#ifdef CONFIG_64BIT
+#if defined(CONFIG_64BIT) || defined(CONFIG_CPU_MIPS32_R2)
 	mfc0	t0, CP0_STATUS
 #endif
 	fpu_save_double a0 t0 t1		# clobbers t1
@@ -134,7 +134,7 @@ LEAF(_save_fp)
  * Restore a thread's fp context.
  */
 LEAF(_restore_fp)
-#ifdef CONFIG_64BIT
+#if defined(CONFIG_64BIT) || defined(CONFIG_CPU_MIPS32_R2)
 	mfc0	t0, CP0_STATUS
 #endif
 	fpu_restore_double a0 t0 t1		# clobbers t1
@@ -228,6 +228,47 @@ LEAF(_init_fpu)
 	mtc1	t1, $f29
 	mtc1	t1, $f30
 	mtc1	t1, $f31
+
+#ifdef CONFIG_CPU_MIPS32_R2
+	.set    push
+	.set    mips64r2
+	sll     t0, t0, 5			# is Status.FR set?
+	bgez    t0, 1f				# no: skip setting upper 32b
+
+	mthc1   t1, $f0
+	mthc1   t1, $f1
+	mthc1   t1, $f2
+	mthc1   t1, $f3
+	mthc1   t1, $f4
+	mthc1   t1, $f5
+	mthc1   t1, $f6
+	mthc1   t1, $f7
+	mthc1   t1, $f8
+	mthc1   t1, $f9
+	mthc1   t1, $f10
+	mthc1   t1, $f11
+	mthc1   t1, $f12
+	mthc1   t1, $f13
+	mthc1   t1, $f14
+	mthc1   t1, $f15
+	mthc1   t1, $f16
+	mthc1   t1, $f17
+	mthc1   t1, $f18
+	mthc1   t1, $f19
+	mthc1   t1, $f20
+	mthc1   t1, $f21
+	mthc1   t1, $f22
+	mthc1   t1, $f23
+	mthc1   t1, $f24
+	mthc1   t1, $f25
+	mthc1   t1, $f26
+	mthc1   t1, $f27
+	mthc1   t1, $f28
+	mthc1   t1, $f29
+	mthc1   t1, $f30
+	mthc1   t1, $f31
+1:	.set    pop
+#endif /* CONFIG_CPU_MIPS32_R2 */
 #else
 	.set	mips3
 	dmtc1	t1, $f0
diff --git a/arch/mips/kernel/signal.c b/arch/mips/kernel/signal.c
index 2f285ab..5199563 100644
--- a/arch/mips/kernel/signal.c
+++ b/arch/mips/kernel/signal.c
@@ -71,8 +71,9 @@ static int protected_save_fp_context(struct sigcontext __user *sc)
 	int err;
 	while (1) {
 		lock_fpu_owner();
-		own_fpu_inatomic(1);
-		err = save_fp_context(sc); /* this might fail */
+		err = own_fpu_inatomic(1);
+		if (!err)
+			err = save_fp_context(sc); /* this might fail */
 		unlock_fpu_owner();
 		if (likely(!err))
 			break;
@@ -91,8 +92,9 @@ static int protected_restore_fp_context(struct sigcontext __user *sc)
 	int err, tmp __maybe_unused;
 	while (1) {
 		lock_fpu_owner();
-		own_fpu_inatomic(0);
-		err = restore_fp_context(sc); /* this might fail */
+		err = own_fpu_inatomic(0);
+		if (!err)
+			err = restore_fp_context(sc); /* this might fail */
 		unlock_fpu_owner();
 		if (likely(!err))
 			break;
diff --git a/arch/mips/kernel/signal32.c b/arch/mips/kernel/signal32.c
index 57de8b7..7c1024b 100644
--- a/arch/mips/kernel/signal32.c
+++ b/arch/mips/kernel/signal32.c
@@ -85,8 +85,9 @@ static int protected_save_fp_context32(struct sigcontext32 __user *sc)
 	int err;
 	while (1) {
 		lock_fpu_owner();
-		own_fpu_inatomic(1);
-		err = save_fp_context32(sc); /* this might fail */
+		err = own_fpu_inatomic(1);
+		if (!err)
+			err = save_fp_context32(sc); /* this might fail */
 		unlock_fpu_owner();
 		if (likely(!err))
 			break;
@@ -105,8 +106,9 @@ static int protected_restore_fp_context32(struct sigcontext32 __user *sc)
 	int err, tmp __maybe_unused;
 	while (1) {
 		lock_fpu_owner();
-		own_fpu_inatomic(0);
-		err = restore_fp_context32(sc); /* this might fail */
+		err = own_fpu_inatomic(0);
+		if (!err)
+			err = restore_fp_context32(sc); /* this might fail */
 		unlock_fpu_owner();
 		if (likely(!err))
 			break;
diff --git a/arch/mips/kernel/traps.c b/arch/mips/kernel/traps.c
index cc20415..eb28423 100644
--- a/arch/mips/kernel/traps.c
+++ b/arch/mips/kernel/traps.c
@@ -1080,7 +1080,7 @@ asmlinkage void do_cpu(struct pt_regs *regs)
 	unsigned long old_epc, old31;
 	unsigned int opcode;
 	unsigned int cpid;
-	int status;
+	int status, err;
 	unsigned long __maybe_unused flags;
 
 	prev_state = exception_enter();
@@ -1153,19 +1153,29 @@ asmlinkage void do_cpu(struct pt_regs *regs)
 
 	case 1:
 		if (used_math())	/* Using the FPU again.	 */
-			own_fpu(1);
+			err = own_fpu(1);
 		else {			/* First time FPU user.	 */
-			init_fpu();
+			err = init_fpu();
 			set_used_math();
 		}
 
-		if (!raw_cpu_has_fpu) {
+#ifndef CONFIG_MIPS_O32_FP64_SUPPORT
+		/*
+		 * This assumes that either all FPUs in the system support
+		 * Status.FR (ie. both 32-bit & 64-bit) or none of them do.
+		 */
+		if (err) {
+			force_sig(SIGFPE, current);
+			goto out;
+		}
+#endif
+		if (!raw_cpu_has_fpu || err) {
 			int sig;
 			void __user *fault_addr = NULL;
 			sig = fpu_emulator_cop1Handler(regs,
 						       &current->thread.fpu,
 						       0, &fault_addr);
-			if (!process_fpemu_return(sig, fault_addr))
+			if (!process_fpemu_return(sig, fault_addr) && !err)
 				mt_ase_fp_affinity();
 		}
 
diff --git a/arch/mips/math-emu/cp1emu.c b/arch/mips/math-emu/cp1emu.c
index 4b37961..22f7b11 100644
--- a/arch/mips/math-emu/cp1emu.c
+++ b/arch/mips/math-emu/cp1emu.c
@@ -859,20 +859,20 @@ static int isBranchInstr(struct pt_regs *regs, struct mm_decoded_insn dec_insn,
  * In the Linux kernel, we support selection of FPR format on the
  * basis of the Status.FR bit.	If an FPU is not present, the FR bit
  * is hardwired to zero, which would imply a 32-bit FPU even for
- * 64-bit CPUs so we rather look at TIF_32BIT_REGS.
+ * 64-bit CPUs so we rather look at TIF_32BIT_FPREGS.
  * FPU emu is slow and bulky and optimizing this function offers fairly
  * sizeable benefits so we try to be clever and make this function return
  * a constant whenever possible, that is on 64-bit kernels without O32
- * compatibility enabled and on 32-bit kernels.
+ * compatibility enabled and on 32-bit without 64-bit FPU support.
  */
 static inline int cop1_64bit(struct pt_regs *xcp)
 {
 #if defined(CONFIG_64BIT) && !defined(CONFIG_MIPS32_O32)
 	return 1;
-#elif defined(CONFIG_64BIT) && defined(CONFIG_MIPS32_O32)
-	return !test_thread_flag(TIF_32BIT_REGS);
-#else
+#elif defined(CONFIG_32BIT) && !defined(CONFIG_MIPS_O32_FP64_SUPPORT)
 	return 0;
+#else
+	return !test_thread_flag(TIF_32BIT_FPREGS);
 #endif
 }
 
diff --git a/arch/mips/math-emu/kernel_linkage.c b/arch/mips/math-emu/kernel_linkage.c
index 1c58657..3aeae07 100644
--- a/arch/mips/math-emu/kernel_linkage.c
+++ b/arch/mips/math-emu/kernel_linkage.c
@@ -89,8 +89,9 @@ int fpu_emulator_save_context32(struct sigcontext32 __user *sc)
 {
 	int i;
 	int err = 0;
+	int inc = test_thread_flag(TIF_32BIT_FPREGS) ? 2 : 1;
 
-	for (i = 0; i < 32; i+=2) {
+	for (i = 0; i < 32; i += inc) {
 		err |=
 		    __put_user(current->thread.fpu.fpr[i], &sc->sc_fpregs[i]);
 	}
@@ -103,8 +104,9 @@ int fpu_emulator_restore_context32(struct sigcontext32 __user *sc)
 {
 	int i;
 	int err = 0;
+	int inc = test_thread_flag(TIF_32BIT_FPREGS) ? 2 : 1;
 
-	for (i = 0; i < 32; i+=2) {
+	for (i = 0; i < 32; i += inc) {
 		err |=
 		    __get_user(current->thread.fpu.fpr[i], &sc->sc_fpregs[i]);
 	}
-- 
1.8.4.1

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [PATCH v2 5/6] mips: use per-mm page to execute FP branch delay slots
@ 2013-11-21 16:48           ` Paul Burton
  0 siblings, 0 replies; 42+ messages in thread
From: Paul Burton @ 2013-11-21 16:48 UTC (permalink / raw)
  To: linux-mips; +Cc: ddaney.cavm, Paul Burton, ralf@linux-mips.org

Hmm, I believe there may still be an issue with this patch. If the
instruction in the branch delay slot being "emulated" traps to the
kernel, and the kernel does a force_sig then that signal won't get
processed because signals are being temporarily ignored. So I think
we'd go back off to userland & execute the same instruction from the
branch delay slot again, trap again, force_sig again, go back to
userland etc etc. I need to think about this...

In the meantime, Ralf: if you get to merging this series please drop
this patch & 6/6 (the stack exec change) for the time being. The "Some
(mostly FP) cleanups" series I submitted will still apply after only
the first 4 patches of this series.

Thanks,
     Paul

On 08/11/13 14:50, Paul Burton wrote:
> If a floating point branch instruction (bc1[ft]l?) is emulated,
> typically because we're running on a core with no FPU, then we need to
> execute the instruction in its branch delay slot too. This is done by
> writing that instruction to memory followed by a trap, as part of an
> "emuframe", and executing it. This avoids the requirement of an emulator
> for the entire MIPS instruction set. Prior to this patch such emuframes
> are written to the user stack and executed from there.
>
> This patch moves FP branch delay emuframes off of the user stack and
> into a per-mm page. Allocating a page per-mm leaves userland with access
> to only what it had access to previously, and prevents processes
> interfering with each other as they might if a single system-wide page
> were used. The book-keeping required to track the allocation of
> emuframes is not cheap, but given that invoking the FP emulator is
> already very expensive I don't expect this to be an issue.
>
> The biggest issue with executing the instruction from an FP branch delay
> is that we must ensure that we free the frame from which we ran it. That
> means that we must trap back to the kernel after executing that
> instruction, which means that we must take special care not to let the
> PC be changed as a result of that instruction. Fortunately since we're
> executing an instruction we found in a branch delay the result is
> unpredictable if that instruction is a branch or jump, so we can simply
> treat those as NOPs and avoid them causing a problem. However there is
> still the possibility that a signal may be handled whilst executing the
> branch delay instruction. This would usually be fine as we would simply
> execute our trap back to the kernel after sigreturn, however it is
> possible for userland to simply not return from the signal handler - for
> example if it executes something like a longjmp. In that case we would
> never trap back to the kernel and never free the frame. For that reason
> a TIF_FP_BD_EMU flag is introduced and set whilst we are executing an FP
> branch delay instruction. Whilst this flag is set, signals will be
> ignored. This isn't exactly pretty, but it's simpler than most of the
> alternatives. One other simple option I considered would be to just
> kill a process if we find a branch in an FP branch delay slot, but I
> chose the current approach because its result is closer to what would
> previously happen.
>
> The primary benefit of this patch is that we are now free to mark the
> user stack non-executable where that is possible.
>
> Additionally the FP emuframes themselves are simplified somewhat. The
> cookie field is removed since we can be pretty certain that we're
> looking at an emuframe by virtue of it being located in the page
> allocated for them. The PC to continue from is moved into struct
> thread_struct since the control flow of a thread can no longer be
> modified for the duration of the 'emulation', meaning there will now
> only ever be a single emuframe required for a thread at any given time.
>
> Signed-off-by: Paul Burton <paul.burton@imgtec.com>
> ---
> Changes in v2:
>    - s/kernels/kernel's/
>    - Use (mm_)isBranchInstr in mips_dsemul rather than duplicating
>      similar logic.
> ---
>   arch/mips/include/asm/fpu_emulator.h |   4 +
>   arch/mips/include/asm/mmu.h          |  12 ++
>   arch/mips/include/asm/mmu_context.h  |   7 +
>   arch/mips/include/asm/processor.h    |   7 +-
>   arch/mips/include/asm/thread_info.h  |   2 +
>   arch/mips/kernel/entry.S             |  13 +-
>   arch/mips/kernel/process.c           |   2 +
>   arch/mips/kernel/vdso.c              |   2 +-
>   arch/mips/math-emu/cp1emu.c          |   4 +-
>   arch/mips/math-emu/dsemul.c          | 266 ++++++++++++++++++++++++-----------
>   10 files changed, 226 insertions(+), 93 deletions(-)
>
> diff --git a/arch/mips/include/asm/fpu_emulator.h b/arch/mips/include/asm/fpu_emulator.h
> index 2abb587..16f7b0b 100644
> --- a/arch/mips/include/asm/fpu_emulator.h
> +++ b/arch/mips/include/asm/fpu_emulator.h
> @@ -51,6 +51,8 @@ do {									\
>   #define MIPS_FPU_EMU_INC_STATS(M) do { } while (0)
>   #endif /* CONFIG_DEBUG_FS */
>
> +extern void dsemul_thread_cleanup(void);
> +extern void dsemul_mm_cleanup(struct mm_struct *mm);
>   extern int mips_dsemul(struct pt_regs *regs, mips_instruction ir,
>   	unsigned long cpc);
>   extern int do_dsemulret(struct pt_regs *xcp);
> @@ -58,6 +60,8 @@ extern int fpu_emulator_cop1Handler(struct pt_regs *xcp,
>   				    struct mips_fpu_struct *ctx, int has_fpu,
>   				    void *__user *fault_addr);
>   int process_fpemu_return(int sig, void __user *fault_addr);
> +int isBranchInstr(struct pt_regs *regs, struct mm_decoded_insn dec_insn,
> +		  unsigned long *contpc);
>   int mm_isBranchInstr(struct pt_regs *regs, struct mm_decoded_insn dec_insn,
>   		     unsigned long *contpc);
>
> diff --git a/arch/mips/include/asm/mmu.h b/arch/mips/include/asm/mmu.h
> index c436138..08214da 100644
> --- a/arch/mips/include/asm/mmu.h
> +++ b/arch/mips/include/asm/mmu.h
> @@ -1,9 +1,21 @@
>   #ifndef __ASM_MMU_H
>   #define __ASM_MMU_H
>
> +#include <linux/mutex.h>
> +#include <linux/wait.h>
> +
>   typedef struct {
>   	unsigned long asid[NR_CPUS];
>   	void *vdso;
> +
> +	/* address of page used to hold FP branch delay emulation frames */
> +	unsigned long fp_bd_emupage;
> +	/* bitmap tracking allocation of fp_bd_emupage */
> +	unsigned long *fp_bd_emupage_allocmap;
> +	/* mutex to be held whilst modifying fp_bd_emupage(_allocmap) */
> +	struct mutex fp_bd_emupage_mutex;
> +	/* wait queue for threads requiring an emuframe */
> +	wait_queue_head_t fp_bd_emupage_queue;
>   } mm_context_t;
>
>   #endif /* __ASM_MMU_H */
> diff --git a/arch/mips/include/asm/mmu_context.h b/arch/mips/include/asm/mmu_context.h
> index e277bba..c55e864 100644
> --- a/arch/mips/include/asm/mmu_context.h
> +++ b/arch/mips/include/asm/mmu_context.h
> @@ -16,6 +16,7 @@
>   #include <linux/smp.h>
>   #include <linux/slab.h>
>   #include <asm/cacheflush.h>
> +#include <asm/fpu_emulator.h>
>   #include <asm/hazards.h>
>   #include <asm/tlbflush.h>
>   #ifdef CONFIG_MIPS_MT_SMTC
> @@ -133,6 +134,11 @@ init_new_context(struct task_struct *tsk, struct mm_struct *mm)
>   	for_each_possible_cpu(i)
>   		cpu_context(i, mm) = 0;
>
> +	mm->context.fp_bd_emupage = 0;
> +	mm->context.fp_bd_emupage_allocmap = NULL;
> +	mutex_init(&mm->context.fp_bd_emupage_mutex);
> +	init_waitqueue_head(&mm->context.fp_bd_emupage_queue);
> +
>   	return 0;
>   }
>
> @@ -199,6 +205,7 @@ static inline void switch_mm(struct mm_struct *prev, struct mm_struct *next,
>    */
>   static inline void destroy_context(struct mm_struct *mm)
>   {
> +	dsemul_mm_cleanup(mm);
>   }
>
>   #define deactivate_mm(tsk, mm)	do { } while (0)
> diff --git a/arch/mips/include/asm/processor.h b/arch/mips/include/asm/processor.h
> index 3605b84..683a3d6 100644
> --- a/arch/mips/include/asm/processor.h
> +++ b/arch/mips/include/asm/processor.h
> @@ -38,9 +38,10 @@ extern unsigned int vced_count, vcei_count;
>
>   /*
>    * A special page (the vdso) is mapped into all processes at the very
> - * top of the virtual memory space.
> + * top of the virtual memory space. The page below it is used for FP
> + * emulator branch delay slot executions.
>    */
> -#define SPECIAL_PAGES_SIZE PAGE_SIZE
> +#define SPECIAL_PAGES_SIZE (PAGE_SIZE * 2)
>
>   #ifdef CONFIG_32BIT
>   #ifdef CONFIG_KVM_GUEST
> @@ -226,6 +227,8 @@ struct thread_struct {
>
>   	/* Saved fpu/fpu emulator stuff. */
>   	struct mips_fpu_struct fpu;
> +	/* PC to continue from following an FP branch delay 'emulation' */
> +	unsigned long fp_bd_emu_cpc;
>   #ifdef CONFIG_MIPS_MT_FPAFF
>   	/* Emulated instruction count */
>   	unsigned long emulated_fp;
> diff --git a/arch/mips/include/asm/thread_info.h b/arch/mips/include/asm/thread_info.h
> index b6da8b7..eee6e18 100644
> --- a/arch/mips/include/asm/thread_info.h
> +++ b/arch/mips/include/asm/thread_info.h
> @@ -118,6 +118,7 @@ static inline struct thread_info *current_thread_info(void)
>   #define TIF_LOAD_WATCH		25	/* If set, load watch registers */
>   #define TIF_SYSCALL_TRACEPOINT	26	/* syscall tracepoint instrumentation */
>   #define TIF_32BIT_FPREGS	27	/* 32-bit floating point registers */
> +#define TIF_FP_BD_EMU		28	/* executing an FP branch delay */
>   #define TIF_SYSCALL_TRACE	31	/* syscall trace active */
>
>   #define _TIF_SYSCALL_TRACE	(1<<TIF_SYSCALL_TRACE)
> @@ -135,6 +136,7 @@ static inline struct thread_info *current_thread_info(void)
>   #define _TIF_FPUBOUND		(1<<TIF_FPUBOUND)
>   #define _TIF_LOAD_WATCH		(1<<TIF_LOAD_WATCH)
>   #define _TIF_32BIT_FPREGS	(1<<TIF_32BIT_FPREGS)
> +#define _TIF_FP_BD_EMU		(1<<TIF_FP_BD_EMU)
>   #define _TIF_SYSCALL_TRACEPOINT	(1<<TIF_SYSCALL_TRACEPOINT)
>
>   #define _TIF_WORK_SYSCALL_ENTRY	(_TIF_NOHZ | _TIF_SYSCALL_TRACE |	\
> diff --git a/arch/mips/kernel/entry.S b/arch/mips/kernel/entry.S
> index e578685..24707d7 100644
> --- a/arch/mips/kernel/entry.S
> +++ b/arch/mips/kernel/entry.S
> @@ -168,10 +168,15 @@ work_resched:
>   	andi	t0, a2, _TIF_NEED_RESCHED
>   	bnez	t0, work_resched
>
> -work_notifysig:				# deal with pending signals and
> -					# notify-resume requests
> -	move	a0, sp
> -	li	a1, 0
> +work_notifysig:
> +	and	t0, a2, _TIF_FP_BD_EMU	# are we currently 'emulating' the
> +					# delay slot of an FP branch?
> +	beqz	t0, 1f			# no, continue below
> +	and	a2, a2, ~_TIF_SIGPENDING	# yes, skip handling signals
> +	beqz	a2, restore_all		# which leaves us nothing to do
> +
> +1:	move	a0, sp			# deal with pending signals and
> +	li	a1, 0			# notify-resume requests
>   	jal	do_notify_resume	# a2 already loaded
>   	j	resume_userspace_check
>
> diff --git a/arch/mips/kernel/process.c b/arch/mips/kernel/process.c
> index 747a6cf..0219502 100644
> --- a/arch/mips/kernel/process.c
> +++ b/arch/mips/kernel/process.c
> @@ -32,6 +32,7 @@
>   #include <asm/cpu.h>
>   #include <asm/dsp.h>
>   #include <asm/fpu.h>
> +#include <asm/fpu_emulator.h>
>   #include <asm/pgtable.h>
>   #include <asm/mipsregs.h>
>   #include <asm/processor.h>
> @@ -72,6 +73,7 @@ void start_thread(struct pt_regs * regs, unsigned long pc, unsigned long sp)
>
>   void exit_thread(void)
>   {
> +	dsemul_thread_cleanup();
>   }
>
>   void flush_thread(void)
> diff --git a/arch/mips/kernel/vdso.c b/arch/mips/kernel/vdso.c
> index 0f1af58..213d871 100644
> --- a/arch/mips/kernel/vdso.c
> +++ b/arch/mips/kernel/vdso.c
> @@ -78,7 +78,7 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
>
>   	down_write(&mm->mmap_sem);
>
> -	addr = vdso_addr(mm->start_stack);
> +	addr = vdso_addr(mm->start_stack) + PAGE_SIZE;
>
>   	addr = get_unmapped_area(NULL, addr, PAGE_SIZE, 0, 0);
>   	if (IS_ERR_VALUE(addr)) {
> diff --git a/arch/mips/math-emu/cp1emu.c b/arch/mips/math-emu/cp1emu.c
> index 22f7b11..a0566c8 100644
> --- a/arch/mips/math-emu/cp1emu.c
> +++ b/arch/mips/math-emu/cp1emu.c
> @@ -665,8 +665,8 @@ int mm_isBranchInstr(struct pt_regs *regs, struct mm_decoded_insn dec_insn,
>    * a single subroutine should be used across both
>    * modules.
>    */
> -static int isBranchInstr(struct pt_regs *regs, struct mm_decoded_insn dec_insn,
> -			 unsigned long *contpc)
> +int isBranchInstr(struct pt_regs *regs, struct mm_decoded_insn dec_insn,
> +		  unsigned long *contpc)
>   {
>   	union mips_instruction insn = (union mips_instruction)dec_insn.insn;
>   	unsigned int fcr31;
> diff --git a/arch/mips/math-emu/dsemul.c b/arch/mips/math-emu/dsemul.c
> index 7ea622a..3e64b17 100644
> --- a/arch/mips/math-emu/dsemul.c
> +++ b/arch/mips/math-emu/dsemul.c
> @@ -1,6 +1,8 @@
>   #include <linux/compiler.h>
> +#include <linux/err.h>
>   #include <linux/mm.h>
>   #include <linux/signal.h>
> +#include <linux/slab.h>
>   #include <linux/smp.h>
>
>   #include <asm/asm.h>
> @@ -45,52 +47,173 @@
>   struct emuframe {
>   	mips_instruction	emul;
>   	mips_instruction	badinst;
> -	mips_instruction	cookie;
> -	unsigned long		epc;
>   };
>
> +static const int emupage_frame_count = PAGE_SIZE / sizeof(struct emuframe);
> +
> +static struct emuframe __user *alloc_emuframe(void)
> +{
> +	mm_context_t *mm_ctx = &current->mm->context;
> +	struct emuframe __user *fr = NULL;
> +	unsigned long addr;
> +	int idx;
> +
> +retry:
> +	mutex_lock(&mm_ctx->fp_bd_emupage_mutex);
> +
> +	/* Ensure we have a page allocated for emuframes */
> +	if (!mm_ctx->fp_bd_emupage) {
> +		addr = mmap_region(NULL, STACK_TOP, PAGE_SIZE,
> +				   VM_READ|VM_WRITE|VM_EXEC|
> +				   VM_MAYREAD|VM_MAYWRITE|VM_MAYEXEC,
> +				   0);
> +		if (IS_ERR_VALUE(addr))
> +			goto out_unlock;
> +
> +		mm_ctx->fp_bd_emupage = addr;
> +		pr_debug("allocate emupage at 0x%08lx to %d\n", addr,
> +			 current->pid);
> +	}
> +
> +	/* Ensure we have an allocation bitmap */
> +	if (!mm_ctx->fp_bd_emupage_allocmap) {
> +		mm_ctx->fp_bd_emupage_allocmap =
> +			kcalloc(BITS_TO_LONGS(emupage_frame_count),
> +					      sizeof(unsigned long),
> +				GFP_KERNEL);
> +
> +		if (!mm_ctx->fp_bd_emupage_allocmap)
> +			goto out_unlock;
> +	}
> +
> +	/* Attempt to allocate a single bit/frame */
> +	idx = bitmap_find_free_region(mm_ctx->fp_bd_emupage_allocmap,
> +				      emupage_frame_count, 0);
> +	if (idx < 0) {
> +		/*
> +		 * Failed to allocate a frame. We'll wait until one becomes
> +		 * available. The mutex is unlocked so that other threads
> +		 * actually get the opportunity to free their frames, which
> +		 * means technically the result of bitmap_full may be incorrect.
> +		 * However the worst case is that we repeat all this and end up
> +		 * back here again.
> +		 */
> +		mutex_unlock(&mm_ctx->fp_bd_emupage_mutex);
> +		if (!wait_event_killable(mm_ctx->fp_bd_emupage_queue,
> +			!bitmap_full(mm_ctx->fp_bd_emupage_allocmap,
> +				     emupage_frame_count)))
> +			goto retry;
> +
> +		/* Received a fatal signal - just give in */
> +		return NULL;
> +	}
> +
> +	/* Success! */
> +	fr = (struct emuframe __user *)mm_ctx->fp_bd_emupage + idx;
> +	pr_debug("allocate emuframe %d to %d\n", idx, current->pid);
> +out_unlock:
> +	mutex_unlock(&mm_ctx->fp_bd_emupage_mutex);
> +	return fr;
> +}
> +
> +static void free_emuframe(struct emuframe __user *frame)
> +{
> +	mm_context_t *mm_ctx = &current->mm->context;
> +	int idx;
> +
> +	mutex_lock(&mm_ctx->fp_bd_emupage_mutex);
> +
> +	idx = frame - (struct emuframe __user *)mm_ctx->fp_bd_emupage;
> +	pr_debug("free emuframe %d from %d\n", idx, current->pid);
> +	bitmap_clear(mm_ctx->fp_bd_emupage_allocmap, idx, 1);
> +
> +	/* If some thread is waiting for a frame, now's its chance */
> +	wake_up(&mm_ctx->fp_bd_emupage_queue);
> +
> +	mutex_unlock(&mm_ctx->fp_bd_emupage_mutex);
> +}
> +
> +void dsemul_thread_cleanup(void)
> +{
> +	/*
> +	 * We should always have passed through do_dsemulret prior to the
> +	 * thread exiting, so TIF_FP_BD_EMU should never be set here.
> +	 */
> +	BUG_ON(test_thread_flag(TIF_FP_BD_EMU));
> +}
> +
> +void dsemul_mm_cleanup(struct mm_struct *mm)
> +{
> +	mm_context_t *mm_ctx = &mm->context;
> +
> +	kfree(mm_ctx->fp_bd_emupage_allocmap);
> +}
> +
>   int mips_dsemul(struct pt_regs *regs, mips_instruction ir, unsigned long cpc)
>   {
> -	extern asmlinkage void handle_dsemulret(void);
> +	struct mm_decoded_insn mm_inst = { .insn = ir };
>   	struct emuframe __user *fr;
> -	int err;
> +	struct pt_regs dummy_regs;
> +	unsigned long dummy_cpc;
> +	int err, is_mm;
>
> -	if ((get_isa16_mode(regs->cp0_epc) && ((ir >> 16) == MM_NOP16)) ||
> -		(ir == 0)) {
> -		/* NOP is easy */
> +	/*
> +	 * Trivially handle typical NOP encodings:
> +	 *
> +	 *   MIPS32:		sll	r0, r0, r0
> +	 *   microMIPS:		move16	r0, r0
> +	 */
> +	is_mm = get_isa16_mode(regs->cp0_epc);
> +	if ((!is_mm && !ir) || (is_mm && ((ir >> 16) == MM_NOP16))) {
> +is_nop:
>   		regs->cp0_epc = cpc;
>   		regs->cp0_cause &= ~CAUSEF_BD;
>   		return 0;
>   	}
> -#ifdef DSEMUL_TRACE
> -	printk("dsemul %lx %lx\n", regs->cp0_epc, cpc);
> -
> -#endif
>
>   	/*
> -	 * The strategy is to push the instruction onto the user stack
> -	 * and put a trap after it which we can catch and jump to
> -	 * the required address any alternative apart from full
> -	 * instruction emulation!!.
> +	 * In order for us to clean up the emuframe properly, we'll need to
> +	 * execute a break instruction after ir. If ir is a branch then we may
> +	 * never reach that break instruction and thus never free the emuframe.
>   	 *
> -	 * Algorithmics used a system call instruction, and
> -	 * borrowed that vector.  MIPS/Linux version is a bit
> -	 * more heavyweight in the interests of portability and
> -	 * multiprocessor support.  For Linux we generate a
> -	 * an unaligned access and force an address error exception.
> +	 * Fortunately we know that ir is in a branch delay slot and thus if
> +	 * it is a branch then its operation is unpredictable. So we can just
> +	 * treat branches as NOPs and skip the 'emulation' entirely.
>   	 *
> -	 * For embedded systems (stand-alone) we prefer to use a
> -	 * non-existing CP1 instruction. This prevents us from emulating
> -	 * branches, but gives us a cleaner interface to the exception
> -	 * handler (single entry point).
> +	 * If the worst happens and we miss a branch/jump instruction here, or
> +	 * some processor implements a custom one, then it would be possible
> +	 * for us to allocate an emuframe and never free it. Fortunately this
> +	 * would:
> +	 *
> +	 *  1) Be a bug in the userland code, because it has a branch/jump in
> +	 *     a branch delay slot. So if we run out of emuframes and the
> +	 *     userland code hangs it's not exactly the kernel's fault.
> +	 *
> +	 *  2) Only affect that userland process, since emuframes are allocated
> +	 *     per-mm and kernel threads don't use them at all.
>   	 */
> +	if ((!is_mm && isBranchInstr(&dummy_regs, mm_inst, &dummy_cpc)) ||
> +	    (is_mm && mm_isBranchInstr(&dummy_regs, mm_inst, &dummy_cpc))) {
> +		pr_warn("PID %d has a branch in an FP branch delay slot at 0x%08lx\n",
> +			current->pid, regs->cp0_epc);
> +		goto is_nop;
> +	}
>
> -	/* Ensure that the two instructions are in the same cache line */
> -	fr = (struct emuframe __user *)
> -		((regs->regs[29] - sizeof(struct emuframe)) & ~0x7);
> +	pr_debug("dsemul 0x%08lx cont at 0x%08lx\n", regs->cp0_epc, cpc);
>
> -	/* Verify that the stack pointer is not competely insane */
> -	if (unlikely(!access_ok(VERIFY_WRITE, fr, sizeof(struct emuframe))))
> +	/*
> +	 * The strategy is to write the instruction to a per-mm page followed
> +	 * by a trap which we can catch to return to the required address. Any
> +	 * alternative to full instruction emulation!!
> +	 *
> +	 * Algorithmics used a system call instruction, and borrowed that
> +	 * vector.  MIPS/Linux version is a bit more heavyweight in the
> +	 * interests of portability and multiprocessor support.  For Linux we
> +	 * generate a BREAK instruction with a break code reserved for this
> +	 * purpose.
> +	 */
> +	fr = alloc_emuframe();
> +	if (!fr)
>   		return SIGBUS;
>
>   	if (get_isa16_mode(regs->cp0_epc)) {
> @@ -103,17 +226,18 @@ int mips_dsemul(struct pt_regs *regs, mips_instruction ir, unsigned long cpc)
>   		err |= __put_user((mips_instruction)BREAK_MATH, &fr->badinst);
>   	}
>
> -	err |= __put_user((mips_instruction)BD_COOKIE, &fr->cookie);
> -	err |= __put_user(cpc, &fr->epc);
> -
>   	if (unlikely(err)) {
>   		MIPS_FPU_EMU_INC_STATS(errors);
> +		free_emuframe(fr);
>   		return SIGBUS;
>   	}
>
>   	regs->cp0_epc = ((unsigned long) &fr->emul) |
>   		get_isa16_mode(regs->cp0_epc);
>
> +	current->thread.fp_bd_emu_cpc = cpc;
> +	set_thread_flag(TIF_FP_BD_EMU);
> +
>   	flush_cache_sigtramp((unsigned long)&fr->badinst);
>
>   	return SIGILL;		/* force out of emulation loop */
> @@ -121,64 +245,38 @@ int mips_dsemul(struct pt_regs *regs, mips_instruction ir, unsigned long cpc)
>
>   int do_dsemulret(struct pt_regs *xcp)
>   {
> -	struct emuframe __user *fr;
> -	unsigned long epc;
> -	u32 insn, cookie;
> -	int err = 0;
> -	u16 instr[2];
> -
> -	fr = (struct emuframe __user *)
> -		(msk_isa16_mode(xcp->cp0_epc) - sizeof(mips_instruction));
> -
> -	/*
> -	 * If we can't even access the area, something is very wrong, but we'll
> -	 * leave that to the default handling
> -	 */
> -	if (!access_ok(VERIFY_READ, fr, sizeof(struct emuframe)))
> -		return 0;
> -
> -	/*
> -	 * Do some sanity checking on the stackframe:
> -	 *
> -	 *  - Is the instruction pointed to by the EPC an BREAK_MATH?
> -	 *  - Is the following memory word the BD_COOKIE?
> -	 */
> -	if (get_isa16_mode(xcp->cp0_epc)) {
> -		err = __get_user(instr[0], (u16 __user *)(&fr->badinst));
> -		err |= __get_user(instr[1], (u16 __user *)((long)(&fr->badinst) + 2));
> -		insn = (instr[0] << 16) | instr[1];
> -	} else {
> -		err = __get_user(insn, &fr->badinst);
> -	}
> -	err |= __get_user(cookie, &fr->cookie);
> +	mm_context_t *mm_ctx = &current->mm->context;
> +	struct emuframe __user *fr = NULL;
> +	unsigned long fr_addr;
> +	int success = 0;
>
> -	if (unlikely(err || (insn != BREAK_MATH) || (cookie != BD_COOKIE))) {
> -		MIPS_FPU_EMU_INC_STATS(errors);
> -		return 0;
> -	}
> +	/* If we don't have TIF_FP_BD_EMU set... */
> +	if (!test_and_clear_thread_flag(TIF_FP_BD_EMU))
> +		goto out;
>
>   	/*
> -	 * At this point, we are satisfied that it's a BD emulation trap.  Yes,
> -	 * a user might have deliberately put two malformed and useless
> -	 * instructions in a row in his program, in which case he's in for a
> -	 * nasty surprise - the next instruction will be treated as a
> -	 * continuation address!  Alas, this seems to be the only way that we
> -	 * can handle signals, recursion, and longjmps() in the context of
> -	 * emulating the branch delay instruction.
> +	 * ...or EPC is outside of the expected page or misaligned then
> +	 * something is wrong. Leave it to the default trap/break code to
> +	 * handle.
>   	 */
> +	fr_addr = msk_isa16_mode(xcp->cp0_epc) - sizeof(mips_instruction);
> +	if ((fr_addr < mm_ctx->fp_bd_emupage) ||
> +	    (fr_addr > (mm_ctx->fp_bd_emupage + PAGE_SIZE - sizeof(*fr))) ||
> +	    (fr_addr & (sizeof(*fr) - 1)))
> +		goto out;
>
> -#ifdef DSEMUL_TRACE
> -	printk("dsemulret\n");
> -#endif
> -	if (__get_user(epc, &fr->epc)) {		/* Saved EPC */
> -		/* This is not a good situation to be in */
> -		force_sig(SIGBUS, current);
> -
> -		return 0;
> -	}
> +	/* At this point, we are satisfied that it's a BD emulation trap. */
> +	fr = (struct emuframe __user *)fr_addr;
>
>   	/* Set EPC to return to post-branch instruction */
> -	xcp->cp0_epc = epc;
> +	xcp->cp0_epc = current->thread.fp_bd_emu_cpc;
> +	success = 1;
>
> -	return 1;
> +	pr_debug("dsemulret to 0x%08lx\n", xcp->cp0_epc);
> +out:
> +	if (fr)
> +		free_emuframe(fr);
> +	if (!success)
> +		MIPS_FPU_EMU_INC_STATS(errors);
> +	return success;
>   }
>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v2 5/6] mips: use per-mm page to execute FP branch delay slots
@ 2013-11-21 16:48           ` Paul Burton
  0 siblings, 0 replies; 42+ messages in thread
From: Paul Burton @ 2013-11-21 16:48 UTC (permalink / raw)
  To: linux-mips; +Cc: ddaney.cavm, Paul Burton, ralf@linux-mips.org

Hmm, I believe there may still be an issue with this patch. If the
instruction in the branch delay slot being "emulated" traps to the
kernel, and the kernel does a force_sig then that signal won't get
processed because signals are being temporarily ignored. So I think
we'd go back off to userland & execute the same instruction from the
branch delay slot again, trap again, force_sig again, go back to
userland etc etc. I need to think about this...

In the meantime, Ralf: if you get to merging this series please drop
this patch & 6/6 (the stack exec change) for the time being. The "Some
(mostly FP) cleanups" series I submitted will still apply after only
the first 4 patches of this series.

Thanks,
     Paul

On 08/11/13 14:50, Paul Burton wrote:
> If a floating point branch instruction (bc1[ft]l?) is emulated,
> typically because we're running on a core with no FPU, then we need to
> execute the instruction in its branch delay slot too. This is done by
> writing that instruction to memory followed by a trap, as part of an
> "emuframe", and executing it. This avoids the requirement of an emulator
> for the entire MIPS instruction set. Prior to this patch such emuframes
> are written to the user stack and executed from there.
>
> This patch moves FP branch delay emuframes off of the user stack and
> into a per-mm page. Allocating a page per-mm leaves userland with access
> to only what it had access to previously, and prevents processes
> interfering with each other as they might if a single system-wide page
> were used. The book-keeping required to track the allocation of
> emuframes is not cheap, but given that invoking the FP emulator is
> already very expensive I don't expect this to be an issue.
>
> The biggest issue with executing the instruction from an FP branch delay
> is that we must ensure that we free the frame from which we ran it. That
> means that we must trap back to the kernel after executing that
> instruction, which means that we must take special care not to let the
> PC be changed as a result of that instruction. Fortunately since we're
> executing an instruction we found in a branch delay the result is
> unpredictable if that instruction is a branch or jump, so we can simply
> treat those as NOPs and avoid them causing a problem. However there is
> still the possibility that a signal may be handled whilst executing the
> branch delay instruction. This would usually be fine as we would simply
> execute our trap back to the kernel after sigreturn, however it is
> possible for userland to simply not return from the signal handler - for
> example if it executes something like a longjmp. In that case we would
> never trap back to the kernel and never free the frame. For that reason
> a TIF_FP_BD_EMU flag is introduced and set whilst we are executing an FP
> branch delay instruction. Whilst this flag is set, signals will be
> ignored. This isn't exactly pretty, but it's simpler than most of the
> alternatives. One other simple option I considered would be to just
> kill a process if we find a branch in an FP branch delay slot, but I
> chose the current approach because its result is closer to what would
> previously happen.
>
> The primary benefit of this patch is that we are now free to mark the
> user stack non-executable where that is possible.
>
> Additionally the FP emuframes themselves are simplified somewhat. The
> cookie field is removed since we can be pretty certain that we're
> looking at an emuframe by virtue of it being located in the page
> allocated for them. The PC to continue from is moved into struct
> thread_struct since the control flow of a thread can no longer be
> modified for the duration of the 'emulation', meaning there will now
> only ever be a single emuframe required for a thread at any given time.
>
> Signed-off-by: Paul Burton <paul.burton@imgtec.com>
> ---
> Changes in v2:
>    - s/kernels/kernel's/
>    - Use (mm_)isBranchInstr in mips_dsemul rather than duplicating
>      similar logic.
> ---
>   arch/mips/include/asm/fpu_emulator.h |   4 +
>   arch/mips/include/asm/mmu.h          |  12 ++
>   arch/mips/include/asm/mmu_context.h  |   7 +
>   arch/mips/include/asm/processor.h    |   7 +-
>   arch/mips/include/asm/thread_info.h  |   2 +
>   arch/mips/kernel/entry.S             |  13 +-
>   arch/mips/kernel/process.c           |   2 +
>   arch/mips/kernel/vdso.c              |   2 +-
>   arch/mips/math-emu/cp1emu.c          |   4 +-
>   arch/mips/math-emu/dsemul.c          | 266 ++++++++++++++++++++++++-----------
>   10 files changed, 226 insertions(+), 93 deletions(-)
>
> diff --git a/arch/mips/include/asm/fpu_emulator.h b/arch/mips/include/asm/fpu_emulator.h
> index 2abb587..16f7b0b 100644
> --- a/arch/mips/include/asm/fpu_emulator.h
> +++ b/arch/mips/include/asm/fpu_emulator.h
> @@ -51,6 +51,8 @@ do {									\
>   #define MIPS_FPU_EMU_INC_STATS(M) do { } while (0)
>   #endif /* CONFIG_DEBUG_FS */
>
> +extern void dsemul_thread_cleanup(void);
> +extern void dsemul_mm_cleanup(struct mm_struct *mm);
>   extern int mips_dsemul(struct pt_regs *regs, mips_instruction ir,
>   	unsigned long cpc);
>   extern int do_dsemulret(struct pt_regs *xcp);
> @@ -58,6 +60,8 @@ extern int fpu_emulator_cop1Handler(struct pt_regs *xcp,
>   				    struct mips_fpu_struct *ctx, int has_fpu,
>   				    void *__user *fault_addr);
>   int process_fpemu_return(int sig, void __user *fault_addr);
> +int isBranchInstr(struct pt_regs *regs, struct mm_decoded_insn dec_insn,
> +		  unsigned long *contpc);
>   int mm_isBranchInstr(struct pt_regs *regs, struct mm_decoded_insn dec_insn,
>   		     unsigned long *contpc);
>
> diff --git a/arch/mips/include/asm/mmu.h b/arch/mips/include/asm/mmu.h
> index c436138..08214da 100644
> --- a/arch/mips/include/asm/mmu.h
> +++ b/arch/mips/include/asm/mmu.h
> @@ -1,9 +1,21 @@
>   #ifndef __ASM_MMU_H
>   #define __ASM_MMU_H
>
> +#include <linux/mutex.h>
> +#include <linux/wait.h>
> +
>   typedef struct {
>   	unsigned long asid[NR_CPUS];
>   	void *vdso;
> +
> +	/* address of page used to hold FP branch delay emulation frames */
> +	unsigned long fp_bd_emupage;
> +	/* bitmap tracking allocation of fp_bd_emupage */
> +	unsigned long *fp_bd_emupage_allocmap;
> +	/* mutex to be held whilst modifying fp_bd_emupage(_allocmap) */
> +	struct mutex fp_bd_emupage_mutex;
> +	/* wait queue for threads requiring an emuframe */
> +	wait_queue_head_t fp_bd_emupage_queue;
>   } mm_context_t;
>
>   #endif /* __ASM_MMU_H */
> diff --git a/arch/mips/include/asm/mmu_context.h b/arch/mips/include/asm/mmu_context.h
> index e277bba..c55e864 100644
> --- a/arch/mips/include/asm/mmu_context.h
> +++ b/arch/mips/include/asm/mmu_context.h
> @@ -16,6 +16,7 @@
>   #include <linux/smp.h>
>   #include <linux/slab.h>
>   #include <asm/cacheflush.h>
> +#include <asm/fpu_emulator.h>
>   #include <asm/hazards.h>
>   #include <asm/tlbflush.h>
>   #ifdef CONFIG_MIPS_MT_SMTC
> @@ -133,6 +134,11 @@ init_new_context(struct task_struct *tsk, struct mm_struct *mm)
>   	for_each_possible_cpu(i)
>   		cpu_context(i, mm) = 0;
>
> +	mm->context.fp_bd_emupage = 0;
> +	mm->context.fp_bd_emupage_allocmap = NULL;
> +	mutex_init(&mm->context.fp_bd_emupage_mutex);
> +	init_waitqueue_head(&mm->context.fp_bd_emupage_queue);
> +
>   	return 0;
>   }
>
> @@ -199,6 +205,7 @@ static inline void switch_mm(struct mm_struct *prev, struct mm_struct *next,
>    */
>   static inline void destroy_context(struct mm_struct *mm)
>   {
> +	dsemul_mm_cleanup(mm);
>   }
>
>   #define deactivate_mm(tsk, mm)	do { } while (0)
> diff --git a/arch/mips/include/asm/processor.h b/arch/mips/include/asm/processor.h
> index 3605b84..683a3d6 100644
> --- a/arch/mips/include/asm/processor.h
> +++ b/arch/mips/include/asm/processor.h
> @@ -38,9 +38,10 @@ extern unsigned int vced_count, vcei_count;
>
>   /*
>    * A special page (the vdso) is mapped into all processes at the very
> - * top of the virtual memory space.
> + * top of the virtual memory space. The page below it is used for FP
> + * emulator branch delay slot executions.
>    */
> -#define SPECIAL_PAGES_SIZE PAGE_SIZE
> +#define SPECIAL_PAGES_SIZE (PAGE_SIZE * 2)
>
>   #ifdef CONFIG_32BIT
>   #ifdef CONFIG_KVM_GUEST
> @@ -226,6 +227,8 @@ struct thread_struct {
>
>   	/* Saved fpu/fpu emulator stuff. */
>   	struct mips_fpu_struct fpu;
> +	/* PC to continue from following an FP branch delay 'emulation' */
> +	unsigned long fp_bd_emu_cpc;
>   #ifdef CONFIG_MIPS_MT_FPAFF
>   	/* Emulated instruction count */
>   	unsigned long emulated_fp;
> diff --git a/arch/mips/include/asm/thread_info.h b/arch/mips/include/asm/thread_info.h
> index b6da8b7..eee6e18 100644
> --- a/arch/mips/include/asm/thread_info.h
> +++ b/arch/mips/include/asm/thread_info.h
> @@ -118,6 +118,7 @@ static inline struct thread_info *current_thread_info(void)
>   #define TIF_LOAD_WATCH		25	/* If set, load watch registers */
>   #define TIF_SYSCALL_TRACEPOINT	26	/* syscall tracepoint instrumentation */
>   #define TIF_32BIT_FPREGS	27	/* 32-bit floating point registers */
> +#define TIF_FP_BD_EMU		28	/* executing an FP branch delay */
>   #define TIF_SYSCALL_TRACE	31	/* syscall trace active */
>
>   #define _TIF_SYSCALL_TRACE	(1<<TIF_SYSCALL_TRACE)
> @@ -135,6 +136,7 @@ static inline struct thread_info *current_thread_info(void)
>   #define _TIF_FPUBOUND		(1<<TIF_FPUBOUND)
>   #define _TIF_LOAD_WATCH		(1<<TIF_LOAD_WATCH)
>   #define _TIF_32BIT_FPREGS	(1<<TIF_32BIT_FPREGS)
> +#define _TIF_FP_BD_EMU		(1<<TIF_FP_BD_EMU)
>   #define _TIF_SYSCALL_TRACEPOINT	(1<<TIF_SYSCALL_TRACEPOINT)
>
>   #define _TIF_WORK_SYSCALL_ENTRY	(_TIF_NOHZ | _TIF_SYSCALL_TRACE |	\
> diff --git a/arch/mips/kernel/entry.S b/arch/mips/kernel/entry.S
> index e578685..24707d7 100644
> --- a/arch/mips/kernel/entry.S
> +++ b/arch/mips/kernel/entry.S
> @@ -168,10 +168,15 @@ work_resched:
>   	andi	t0, a2, _TIF_NEED_RESCHED
>   	bnez	t0, work_resched
>
> -work_notifysig:				# deal with pending signals and
> -					# notify-resume requests
> -	move	a0, sp
> -	li	a1, 0
> +work_notifysig:
> +	and	t0, a2, _TIF_FP_BD_EMU	# are we currently 'emulating' the
> +					# delay slot of an FP branch?
> +	beqz	t0, 1f			# no, continue below
> +	and	a2, a2, ~_TIF_SIGPENDING	# yes, skip handling signals
> +	beqz	a2, restore_all		# which leaves us nothing to do
> +
> +1:	move	a0, sp			# deal with pending signals and
> +	li	a1, 0			# notify-resume requests
>   	jal	do_notify_resume	# a2 already loaded
>   	j	resume_userspace_check
>
> diff --git a/arch/mips/kernel/process.c b/arch/mips/kernel/process.c
> index 747a6cf..0219502 100644
> --- a/arch/mips/kernel/process.c
> +++ b/arch/mips/kernel/process.c
> @@ -32,6 +32,7 @@
>   #include <asm/cpu.h>
>   #include <asm/dsp.h>
>   #include <asm/fpu.h>
> +#include <asm/fpu_emulator.h>
>   #include <asm/pgtable.h>
>   #include <asm/mipsregs.h>
>   #include <asm/processor.h>
> @@ -72,6 +73,7 @@ void start_thread(struct pt_regs * regs, unsigned long pc, unsigned long sp)
>
>   void exit_thread(void)
>   {
> +	dsemul_thread_cleanup();
>   }
>
>   void flush_thread(void)
> diff --git a/arch/mips/kernel/vdso.c b/arch/mips/kernel/vdso.c
> index 0f1af58..213d871 100644
> --- a/arch/mips/kernel/vdso.c
> +++ b/arch/mips/kernel/vdso.c
> @@ -78,7 +78,7 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
>
>   	down_write(&mm->mmap_sem);
>
> -	addr = vdso_addr(mm->start_stack);
> +	addr = vdso_addr(mm->start_stack) + PAGE_SIZE;
>
>   	addr = get_unmapped_area(NULL, addr, PAGE_SIZE, 0, 0);
>   	if (IS_ERR_VALUE(addr)) {
> diff --git a/arch/mips/math-emu/cp1emu.c b/arch/mips/math-emu/cp1emu.c
> index 22f7b11..a0566c8 100644
> --- a/arch/mips/math-emu/cp1emu.c
> +++ b/arch/mips/math-emu/cp1emu.c
> @@ -665,8 +665,8 @@ int mm_isBranchInstr(struct pt_regs *regs, struct mm_decoded_insn dec_insn,
>    * a single subroutine should be used across both
>    * modules.
>    */
> -static int isBranchInstr(struct pt_regs *regs, struct mm_decoded_insn dec_insn,
> -			 unsigned long *contpc)
> +int isBranchInstr(struct pt_regs *regs, struct mm_decoded_insn dec_insn,
> +		  unsigned long *contpc)
>   {
>   	union mips_instruction insn = (union mips_instruction)dec_insn.insn;
>   	unsigned int fcr31;
> diff --git a/arch/mips/math-emu/dsemul.c b/arch/mips/math-emu/dsemul.c
> index 7ea622a..3e64b17 100644
> --- a/arch/mips/math-emu/dsemul.c
> +++ b/arch/mips/math-emu/dsemul.c
> @@ -1,6 +1,8 @@
>   #include <linux/compiler.h>
> +#include <linux/err.h>
>   #include <linux/mm.h>
>   #include <linux/signal.h>
> +#include <linux/slab.h>
>   #include <linux/smp.h>
>
>   #include <asm/asm.h>
> @@ -45,52 +47,173 @@
>   struct emuframe {
>   	mips_instruction	emul;
>   	mips_instruction	badinst;
> -	mips_instruction	cookie;
> -	unsigned long		epc;
>   };
>
> +static const int emupage_frame_count = PAGE_SIZE / sizeof(struct emuframe);
> +
> +static struct emuframe __user *alloc_emuframe(void)
> +{
> +	mm_context_t *mm_ctx = &current->mm->context;
> +	struct emuframe __user *fr = NULL;
> +	unsigned long addr;
> +	int idx;
> +
> +retry:
> +	mutex_lock(&mm_ctx->fp_bd_emupage_mutex);
> +
> +	/* Ensure we have a page allocated for emuframes */
> +	if (!mm_ctx->fp_bd_emupage) {
> +		addr = mmap_region(NULL, STACK_TOP, PAGE_SIZE,
> +				   VM_READ|VM_WRITE|VM_EXEC|
> +				   VM_MAYREAD|VM_MAYWRITE|VM_MAYEXEC,
> +				   0);
> +		if (IS_ERR_VALUE(addr))
> +			goto out_unlock;
> +
> +		mm_ctx->fp_bd_emupage = addr;
> +		pr_debug("allocate emupage at 0x%08lx to %d\n", addr,
> +			 current->pid);
> +	}
> +
> +	/* Ensure we have an allocation bitmap */
> +	if (!mm_ctx->fp_bd_emupage_allocmap) {
> +		mm_ctx->fp_bd_emupage_allocmap =
> +			kcalloc(BITS_TO_LONGS(emupage_frame_count),
> +					      sizeof(unsigned long),
> +				GFP_KERNEL);
> +
> +		if (!mm_ctx->fp_bd_emupage_allocmap)
> +			goto out_unlock;
> +	}
> +
> +	/* Attempt to allocate a single bit/frame */
> +	idx = bitmap_find_free_region(mm_ctx->fp_bd_emupage_allocmap,
> +				      emupage_frame_count, 0);
> +	if (idx < 0) {
> +		/*
> +		 * Failed to allocate a frame. We'll wait until one becomes
> +		 * available. The mutex is unlocked so that other threads
> +		 * actually get the opportunity to free their frames, which
> +		 * means technically the result of bitmap_full may be incorrect.
> +		 * However the worst case is that we repeat all this and end up
> +		 * back here again.
> +		 */
> +		mutex_unlock(&mm_ctx->fp_bd_emupage_mutex);
> +		if (!wait_event_killable(mm_ctx->fp_bd_emupage_queue,
> +			!bitmap_full(mm_ctx->fp_bd_emupage_allocmap,
> +				     emupage_frame_count)))
> +			goto retry;
> +
> +		/* Received a fatal signal - just give in */
> +		return NULL;
> +	}
> +
> +	/* Success! */
> +	fr = (struct emuframe __user *)mm_ctx->fp_bd_emupage + idx;
> +	pr_debug("allocate emuframe %d to %d\n", idx, current->pid);
> +out_unlock:
> +	mutex_unlock(&mm_ctx->fp_bd_emupage_mutex);
> +	return fr;
> +}
> +
> +static void free_emuframe(struct emuframe __user *frame)
> +{
> +	mm_context_t *mm_ctx = &current->mm->context;
> +	int idx;
> +
> +	mutex_lock(&mm_ctx->fp_bd_emupage_mutex);
> +
> +	idx = frame - (struct emuframe __user *)mm_ctx->fp_bd_emupage;
> +	pr_debug("free emuframe %d from %d\n", idx, current->pid);
> +	bitmap_clear(mm_ctx->fp_bd_emupage_allocmap, idx, 1);
> +
> +	/* If some thread is waiting for a frame, now's its chance */
> +	wake_up(&mm_ctx->fp_bd_emupage_queue);
> +
> +	mutex_unlock(&mm_ctx->fp_bd_emupage_mutex);
> +}
> +
> +void dsemul_thread_cleanup(void)
> +{
> +	/*
> +	 * We should always have passed through do_dsemulret prior to the
> +	 * thread exiting, so TIF_FP_BD_EMU should never be set here.
> +	 */
> +	BUG_ON(test_thread_flag(TIF_FP_BD_EMU));
> +}
> +
> +void dsemul_mm_cleanup(struct mm_struct *mm)
> +{
> +	mm_context_t *mm_ctx = &mm->context;
> +
> +	kfree(mm_ctx->fp_bd_emupage_allocmap);
> +}
> +
>   int mips_dsemul(struct pt_regs *regs, mips_instruction ir, unsigned long cpc)
>   {
> -	extern asmlinkage void handle_dsemulret(void);
> +	struct mm_decoded_insn mm_inst = { .insn = ir };
>   	struct emuframe __user *fr;
> -	int err;
> +	struct pt_regs dummy_regs;
> +	unsigned long dummy_cpc;
> +	int err, is_mm;
>
> -	if ((get_isa16_mode(regs->cp0_epc) && ((ir >> 16) == MM_NOP16)) ||
> -		(ir == 0)) {
> -		/* NOP is easy */
> +	/*
> +	 * Trivially handle typical NOP encodings:
> +	 *
> +	 *   MIPS32:		sll	r0, r0, r0
> +	 *   microMIPS:		move16	r0, r0
> +	 */
> +	is_mm = get_isa16_mode(regs->cp0_epc);
> +	if ((!is_mm && !ir) || (is_mm && ((ir >> 16) == MM_NOP16))) {
> +is_nop:
>   		regs->cp0_epc = cpc;
>   		regs->cp0_cause &= ~CAUSEF_BD;
>   		return 0;
>   	}
> -#ifdef DSEMUL_TRACE
> -	printk("dsemul %lx %lx\n", regs->cp0_epc, cpc);
> -
> -#endif
>
>   	/*
> -	 * The strategy is to push the instruction onto the user stack
> -	 * and put a trap after it which we can catch and jump to
> -	 * the required address any alternative apart from full
> -	 * instruction emulation!!.
> +	 * In order for us to clean up the emuframe properly, we'll need to
> +	 * execute a break instruction after ir. If ir is a branch then we may
> +	 * never reach that break instruction and thus never free the emuframe.
>   	 *
> -	 * Algorithmics used a system call instruction, and
> -	 * borrowed that vector.  MIPS/Linux version is a bit
> -	 * more heavyweight in the interests of portability and
> -	 * multiprocessor support.  For Linux we generate a
> -	 * an unaligned access and force an address error exception.
> +	 * Fortunately we know that ir is in a branch delay slot and thus if
> +	 * it is a branch then its operation is unpredictable. So we can just
> +	 * treat branches as NOPs and skip the 'emulation' entirely.
>   	 *
> -	 * For embedded systems (stand-alone) we prefer to use a
> -	 * non-existing CP1 instruction. This prevents us from emulating
> -	 * branches, but gives us a cleaner interface to the exception
> -	 * handler (single entry point).
> +	 * If the worst happens and we miss a branch/jump instruction here, or
> +	 * some processor implements a custom one, then it would be possible
> +	 * for us to allocate an emuframe and never free it. Fortunately this
> +	 * would:
> +	 *
> +	 *  1) Be a bug in the userland code, because it has a branch/jump in
> +	 *     a branch delay slot. So if we run out of emuframes and the
> +	 *     userland code hangs it's not exactly the kernel's fault.
> +	 *
> +	 *  2) Only affect that userland process, since emuframes are allocated
> +	 *     per-mm and kernel threads don't use them at all.
>   	 */
> +	if ((!is_mm && isBranchInstr(&dummy_regs, mm_inst, &dummy_cpc)) ||
> +	    (is_mm && mm_isBranchInstr(&dummy_regs, mm_inst, &dummy_cpc))) {
> +		pr_warn("PID %d has a branch in an FP branch delay slot at 0x%08lx\n",
> +			current->pid, regs->cp0_epc);
> +		goto is_nop;
> +	}
>
> -	/* Ensure that the two instructions are in the same cache line */
> -	fr = (struct emuframe __user *)
> -		((regs->regs[29] - sizeof(struct emuframe)) & ~0x7);
> +	pr_debug("dsemul 0x%08lx cont at 0x%08lx\n", regs->cp0_epc, cpc);
>
> -	/* Verify that the stack pointer is not competely insane */
> -	if (unlikely(!access_ok(VERIFY_WRITE, fr, sizeof(struct emuframe))))
> +	/*
> +	 * The strategy is to write the instruction to a per-mm page followed
> +	 * by a trap which we can catch to return to the required address. Any
> +	 * alternative to full instruction emulation!!
> +	 *
> +	 * Algorithmics used a system call instruction, and borrowed that
> +	 * vector.  MIPS/Linux version is a bit more heavyweight in the
> +	 * interests of portability and multiprocessor support.  For Linux we
> +	 * generate a BREAK instruction with a break code reserved for this
> +	 * purpose.
> +	 */
> +	fr = alloc_emuframe();
> +	if (!fr)
>   		return SIGBUS;
>
>   	if (get_isa16_mode(regs->cp0_epc)) {
> @@ -103,17 +226,18 @@ int mips_dsemul(struct pt_regs *regs, mips_instruction ir, unsigned long cpc)
>   		err |= __put_user((mips_instruction)BREAK_MATH, &fr->badinst);
>   	}
>
> -	err |= __put_user((mips_instruction)BD_COOKIE, &fr->cookie);
> -	err |= __put_user(cpc, &fr->epc);
> -
>   	if (unlikely(err)) {
>   		MIPS_FPU_EMU_INC_STATS(errors);
> +		free_emuframe(fr);
>   		return SIGBUS;
>   	}
>
>   	regs->cp0_epc = ((unsigned long) &fr->emul) |
>   		get_isa16_mode(regs->cp0_epc);
>
> +	current->thread.fp_bd_emu_cpc = cpc;
> +	set_thread_flag(TIF_FP_BD_EMU);
> +
>   	flush_cache_sigtramp((unsigned long)&fr->badinst);
>
>   	return SIGILL;		/* force out of emulation loop */
> @@ -121,64 +245,38 @@ int mips_dsemul(struct pt_regs *regs, mips_instruction ir, unsigned long cpc)
>
>   int do_dsemulret(struct pt_regs *xcp)
>   {
> -	struct emuframe __user *fr;
> -	unsigned long epc;
> -	u32 insn, cookie;
> -	int err = 0;
> -	u16 instr[2];
> -
> -	fr = (struct emuframe __user *)
> -		(msk_isa16_mode(xcp->cp0_epc) - sizeof(mips_instruction));
> -
> -	/*
> -	 * If we can't even access the area, something is very wrong, but we'll
> -	 * leave that to the default handling
> -	 */
> -	if (!access_ok(VERIFY_READ, fr, sizeof(struct emuframe)))
> -		return 0;
> -
> -	/*
> -	 * Do some sanity checking on the stackframe:
> -	 *
> -	 *  - Is the instruction pointed to by the EPC an BREAK_MATH?
> -	 *  - Is the following memory word the BD_COOKIE?
> -	 */
> -	if (get_isa16_mode(xcp->cp0_epc)) {
> -		err = __get_user(instr[0], (u16 __user *)(&fr->badinst));
> -		err |= __get_user(instr[1], (u16 __user *)((long)(&fr->badinst) + 2));
> -		insn = (instr[0] << 16) | instr[1];
> -	} else {
> -		err = __get_user(insn, &fr->badinst);
> -	}
> -	err |= __get_user(cookie, &fr->cookie);
> +	mm_context_t *mm_ctx = &current->mm->context;
> +	struct emuframe __user *fr = NULL;
> +	unsigned long fr_addr;
> +	int success = 0;
>
> -	if (unlikely(err || (insn != BREAK_MATH) || (cookie != BD_COOKIE))) {
> -		MIPS_FPU_EMU_INC_STATS(errors);
> -		return 0;
> -	}
> +	/* If we don't have TIF_FP_BD_EMU set... */
> +	if (!test_and_clear_thread_flag(TIF_FP_BD_EMU))
> +		goto out;
>
>   	/*
> -	 * At this point, we are satisfied that it's a BD emulation trap.  Yes,
> -	 * a user might have deliberately put two malformed and useless
> -	 * instructions in a row in his program, in which case he's in for a
> -	 * nasty surprise - the next instruction will be treated as a
> -	 * continuation address!  Alas, this seems to be the only way that we
> -	 * can handle signals, recursion, and longjmps() in the context of
> -	 * emulating the branch delay instruction.
> +	 * ...or EPC is outside of the expected page or misaligned then
> +	 * something is wrong. Leave it to the default trap/break code to
> +	 * handle.
>   	 */
> +	fr_addr = msk_isa16_mode(xcp->cp0_epc) - sizeof(mips_instruction);
> +	if ((fr_addr < mm_ctx->fp_bd_emupage) ||
> +	    (fr_addr > (mm_ctx->fp_bd_emupage + PAGE_SIZE - sizeof(*fr))) ||
> +	    (fr_addr & (sizeof(*fr) - 1)))
> +		goto out;
>
> -#ifdef DSEMUL_TRACE
> -	printk("dsemulret\n");
> -#endif
> -	if (__get_user(epc, &fr->epc)) {		/* Saved EPC */
> -		/* This is not a good situation to be in */
> -		force_sig(SIGBUS, current);
> -
> -		return 0;
> -	}
> +	/* At this point, we are satisfied that it's a BD emulation trap. */
> +	fr = (struct emuframe __user *)fr_addr;
>
>   	/* Set EPC to return to post-branch instruction */
> -	xcp->cp0_epc = epc;
> +	xcp->cp0_epc = current->thread.fp_bd_emu_cpc;
> +	success = 1;
>
> -	return 1;
> +	pr_debug("dsemulret to 0x%08lx\n", xcp->cp0_epc);
> +out:
> +	if (fr)
> +		free_emuframe(fr);
> +	if (!success)
> +		MIPS_FPU_EMU_INC_STATS(errors);
> +	return success;
>   }
>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH v3 4/6] mips: support for 64-bit FP with O32 binaries
@ 2013-11-22 13:12       ` Paul Burton
  0 siblings, 0 replies; 42+ messages in thread
From: Paul Burton @ 2013-11-22 13:12 UTC (permalink / raw)
  To: linux-mips; +Cc: Paul Burton

CPUs implementing mips32r2 may include a 64-bit FPU, just as mips64 CPUs
do. In order to preserve backwards compatibility a 64-bit FPU will act
like a 32-bit FPU (by accessing doubles from the least significant 32
bits of an even-odd pair of FP registers) when the Status.FR bit is
zero, again just like a mips64 CPU. The standard O32 ABI is defined
expecting a 32-bit FPU, however recent toolchains support use of a
64-bit FPU from an O32 mips32 executable. When an ELF executable is
built to use a 64-bit FPU a new flag (EF_MIPS_FP64) is set in the ELF
header.

With this patch the kernel will check the EF_MIPS_FP64 flag when
executing an O32 binary, and set Status.FR accordingly. The addition
of O32 64-bit FP support lessens the opportunity for optimisation in
the FPU emulator, so a CONFIG_MIPS_O32_FP64_SUPPORT Kconfig option is
introduced to allow this support to be disabled for those that don't
require it.

Inspired by an earlier patch by Leonid Yegoshin, but implemented more
cleanly & correctly.

Signed-off-by: Paul Burton <paul.burton@imgtec.com>
---
Changes in v3:
  - Drop dependency on CONFIG_CPU_MIPSR2.
  - Refuse to execute O32 binaries requiring 64 bit FP when the kernel
    doesn't include support for it (via elf_check_arch), rather than
    killing the process later once it executes an FP instruction.

Changes in v2:
  - Handle TIF_32BIT_FPREGS in PTRACE_P{EEK,OKE}USR.
---
 arch/mips/Kconfig                   |  17 ++++++
 arch/mips/include/asm/asmmacro-32.h |  42 --------------
 arch/mips/include/asm/asmmacro-64.h |  96 --------------------------------
 arch/mips/include/asm/asmmacro.h    | 107 ++++++++++++++++++++++++++++++++++++
 arch/mips/include/asm/elf.h         |  31 ++++++++++-
 arch/mips/include/asm/fpu.h         |  91 +++++++++++++++++++++++++-----
 arch/mips/include/asm/thread_info.h |   4 +-
 arch/mips/kernel/binfmt_elfo32.c    |  14 +++++
 arch/mips/kernel/cpu-probe.c        |   2 +-
 arch/mips/kernel/process.c          |   3 -
 arch/mips/kernel/ptrace.c           |  60 +++++++++++---------
 arch/mips/kernel/ptrace32.c         |  53 ++++++++++--------
 arch/mips/kernel/r4k_fpu.S          |  74 +++++++++++++++++++++++--
 arch/mips/kernel/r4k_switch.S       |  45 ++++++++++++++-
 arch/mips/kernel/signal.c           |  10 ++--
 arch/mips/kernel/signal32.c         |  10 ++--
 arch/mips/kernel/traps.c            |  10 ++--
 arch/mips/math-emu/cp1emu.c         |  10 ++--
 arch/mips/math-emu/kernel_linkage.c |   6 +-
 19 files changed, 449 insertions(+), 236 deletions(-)

diff --git a/arch/mips/Kconfig b/arch/mips/Kconfig
index 17cc7ff..3c3cb32 100644
--- a/arch/mips/Kconfig
+++ b/arch/mips/Kconfig
@@ -2335,6 +2335,23 @@ config CC_STACKPROTECTOR
 
 	  This feature requires gcc version 4.2 or above.
 
+config MIPS_O32_FP64_SUPPORT
+	bool "Support for O32 binaries using 64-bit FP"
+	depends on 32BIT || MIPS32_O32
+	default y
+	help
+	  When this is enabled, the kernel will support use of 64-bit floating
+	  point registers with binaries using the O32 ABI along with the
+	  EF_MIPS_FP64 ELF header flag (typically built with -mfp64). On
+	  mips32 systems this support is at the cost of increasing the size
+	  and complexity of the compiled FPU emulator. Thus if you are running
+	  a mips32 system and know that none of your userland binaries will
+	  require 64-bit floating point, you may wish to reduce the size of
+	  your kernel & potentially improve FP emulation performance by saying
+	  N here.
+
+	  If unsure, say Y.
+
 config USE_OF
 	bool
 	select OF
diff --git a/arch/mips/include/asm/asmmacro-32.h b/arch/mips/include/asm/asmmacro-32.h
index 2413afe..70e1f17 100644
--- a/arch/mips/include/asm/asmmacro-32.h
+++ b/arch/mips/include/asm/asmmacro-32.h
@@ -12,27 +12,6 @@
 #include <asm/fpregdef.h>
 #include <asm/mipsregs.h>
 
-	.macro	fpu_save_double thread status tmp1=t0
-	cfc1	\tmp1,  fcr31
-	sdc1	$f0,  THREAD_FPR0(\thread)
-	sdc1	$f2,  THREAD_FPR2(\thread)
-	sdc1	$f4,  THREAD_FPR4(\thread)
-	sdc1	$f6,  THREAD_FPR6(\thread)
-	sdc1	$f8,  THREAD_FPR8(\thread)
-	sdc1	$f10, THREAD_FPR10(\thread)
-	sdc1	$f12, THREAD_FPR12(\thread)
-	sdc1	$f14, THREAD_FPR14(\thread)
-	sdc1	$f16, THREAD_FPR16(\thread)
-	sdc1	$f18, THREAD_FPR18(\thread)
-	sdc1	$f20, THREAD_FPR20(\thread)
-	sdc1	$f22, THREAD_FPR22(\thread)
-	sdc1	$f24, THREAD_FPR24(\thread)
-	sdc1	$f26, THREAD_FPR26(\thread)
-	sdc1	$f28, THREAD_FPR28(\thread)
-	sdc1	$f30, THREAD_FPR30(\thread)
-	sw	\tmp1, THREAD_FCR31(\thread)
-	.endm
-
 	.macro	fpu_save_single thread tmp=t0
 	cfc1	\tmp,  fcr31
 	swc1	$f0,  THREAD_FPR0(\thread)
@@ -70,27 +49,6 @@
 	sw	\tmp, THREAD_FCR31(\thread)
 	.endm
 
-	.macro	fpu_restore_double thread status tmp=t0
-	lw	\tmp, THREAD_FCR31(\thread)
-	ldc1	$f0,  THREAD_FPR0(\thread)
-	ldc1	$f2,  THREAD_FPR2(\thread)
-	ldc1	$f4,  THREAD_FPR4(\thread)
-	ldc1	$f6,  THREAD_FPR6(\thread)
-	ldc1	$f8,  THREAD_FPR8(\thread)
-	ldc1	$f10, THREAD_FPR10(\thread)
-	ldc1	$f12, THREAD_FPR12(\thread)
-	ldc1	$f14, THREAD_FPR14(\thread)
-	ldc1	$f16, THREAD_FPR16(\thread)
-	ldc1	$f18, THREAD_FPR18(\thread)
-	ldc1	$f20, THREAD_FPR20(\thread)
-	ldc1	$f22, THREAD_FPR22(\thread)
-	ldc1	$f24, THREAD_FPR24(\thread)
-	ldc1	$f26, THREAD_FPR26(\thread)
-	ldc1	$f28, THREAD_FPR28(\thread)
-	ldc1	$f30, THREAD_FPR30(\thread)
-	ctc1	\tmp, fcr31
-	.endm
-
 	.macro	fpu_restore_single thread tmp=t0
 	lw	\tmp, THREAD_FCR31(\thread)
 	lwc1	$f0,  THREAD_FPR0(\thread)
diff --git a/arch/mips/include/asm/asmmacro-64.h b/arch/mips/include/asm/asmmacro-64.h
index 08a527d..38ea609 100644
--- a/arch/mips/include/asm/asmmacro-64.h
+++ b/arch/mips/include/asm/asmmacro-64.h
@@ -13,102 +13,6 @@
 #include <asm/fpregdef.h>
 #include <asm/mipsregs.h>
 
-	.macro	fpu_save_16even thread tmp=t0
-	cfc1	\tmp, fcr31
-	sdc1	$f0,  THREAD_FPR0(\thread)
-	sdc1	$f2,  THREAD_FPR2(\thread)
-	sdc1	$f4,  THREAD_FPR4(\thread)
-	sdc1	$f6,  THREAD_FPR6(\thread)
-	sdc1	$f8,  THREAD_FPR8(\thread)
-	sdc1	$f10, THREAD_FPR10(\thread)
-	sdc1	$f12, THREAD_FPR12(\thread)
-	sdc1	$f14, THREAD_FPR14(\thread)
-	sdc1	$f16, THREAD_FPR16(\thread)
-	sdc1	$f18, THREAD_FPR18(\thread)
-	sdc1	$f20, THREAD_FPR20(\thread)
-	sdc1	$f22, THREAD_FPR22(\thread)
-	sdc1	$f24, THREAD_FPR24(\thread)
-	sdc1	$f26, THREAD_FPR26(\thread)
-	sdc1	$f28, THREAD_FPR28(\thread)
-	sdc1	$f30, THREAD_FPR30(\thread)
-	sw	\tmp, THREAD_FCR31(\thread)
-	.endm
-
-	.macro	fpu_save_16odd thread
-	sdc1	$f1,  THREAD_FPR1(\thread)
-	sdc1	$f3,  THREAD_FPR3(\thread)
-	sdc1	$f5,  THREAD_FPR5(\thread)
-	sdc1	$f7,  THREAD_FPR7(\thread)
-	sdc1	$f9,  THREAD_FPR9(\thread)
-	sdc1	$f11, THREAD_FPR11(\thread)
-	sdc1	$f13, THREAD_FPR13(\thread)
-	sdc1	$f15, THREAD_FPR15(\thread)
-	sdc1	$f17, THREAD_FPR17(\thread)
-	sdc1	$f19, THREAD_FPR19(\thread)
-	sdc1	$f21, THREAD_FPR21(\thread)
-	sdc1	$f23, THREAD_FPR23(\thread)
-	sdc1	$f25, THREAD_FPR25(\thread)
-	sdc1	$f27, THREAD_FPR27(\thread)
-	sdc1	$f29, THREAD_FPR29(\thread)
-	sdc1	$f31, THREAD_FPR31(\thread)
-	.endm
-
-	.macro	fpu_save_double thread status tmp
-	sll	\tmp, \status, 5
-	bgez	\tmp, 2f
-	fpu_save_16odd \thread
-2:
-	fpu_save_16even \thread \tmp
-	.endm
-
-	.macro	fpu_restore_16even thread tmp=t0
-	lw	\tmp, THREAD_FCR31(\thread)
-	ldc1	$f0,  THREAD_FPR0(\thread)
-	ldc1	$f2,  THREAD_FPR2(\thread)
-	ldc1	$f4,  THREAD_FPR4(\thread)
-	ldc1	$f6,  THREAD_FPR6(\thread)
-	ldc1	$f8,  THREAD_FPR8(\thread)
-	ldc1	$f10, THREAD_FPR10(\thread)
-	ldc1	$f12, THREAD_FPR12(\thread)
-	ldc1	$f14, THREAD_FPR14(\thread)
-	ldc1	$f16, THREAD_FPR16(\thread)
-	ldc1	$f18, THREAD_FPR18(\thread)
-	ldc1	$f20, THREAD_FPR20(\thread)
-	ldc1	$f22, THREAD_FPR22(\thread)
-	ldc1	$f24, THREAD_FPR24(\thread)
-	ldc1	$f26, THREAD_FPR26(\thread)
-	ldc1	$f28, THREAD_FPR28(\thread)
-	ldc1	$f30, THREAD_FPR30(\thread)
-	ctc1	\tmp, fcr31
-	.endm
-
-	.macro	fpu_restore_16odd thread
-	ldc1	$f1,  THREAD_FPR1(\thread)
-	ldc1	$f3,  THREAD_FPR3(\thread)
-	ldc1	$f5,  THREAD_FPR5(\thread)
-	ldc1	$f7,  THREAD_FPR7(\thread)
-	ldc1	$f9,  THREAD_FPR9(\thread)
-	ldc1	$f11, THREAD_FPR11(\thread)
-	ldc1	$f13, THREAD_FPR13(\thread)
-	ldc1	$f15, THREAD_FPR15(\thread)
-	ldc1	$f17, THREAD_FPR17(\thread)
-	ldc1	$f19, THREAD_FPR19(\thread)
-	ldc1	$f21, THREAD_FPR21(\thread)
-	ldc1	$f23, THREAD_FPR23(\thread)
-	ldc1	$f25, THREAD_FPR25(\thread)
-	ldc1	$f27, THREAD_FPR27(\thread)
-	ldc1	$f29, THREAD_FPR29(\thread)
-	ldc1	$f31, THREAD_FPR31(\thread)
-	.endm
-
-	.macro	fpu_restore_double thread status tmp
-	sll	\tmp, \status, 5
-	bgez	\tmp, 1f				# 16 register mode?
-
-	fpu_restore_16odd \thread
-1:	fpu_restore_16even \thread \tmp
-	.endm
-
 	.macro	cpu_save_nonscratch thread
 	LONG_S	s0, THREAD_REG16(\thread)
 	LONG_S	s1, THREAD_REG17(\thread)
diff --git a/arch/mips/include/asm/asmmacro.h b/arch/mips/include/asm/asmmacro.h
index 6c8342a..3220c93 100644
--- a/arch/mips/include/asm/asmmacro.h
+++ b/arch/mips/include/asm/asmmacro.h
@@ -62,6 +62,113 @@
 	.endm
 #endif /* CONFIG_MIPS_MT_SMTC */
 
+	.macro	fpu_save_16even thread tmp=t0
+	cfc1	\tmp, fcr31
+	sdc1	$f0,  THREAD_FPR0(\thread)
+	sdc1	$f2,  THREAD_FPR2(\thread)
+	sdc1	$f4,  THREAD_FPR4(\thread)
+	sdc1	$f6,  THREAD_FPR6(\thread)
+	sdc1	$f8,  THREAD_FPR8(\thread)
+	sdc1	$f10, THREAD_FPR10(\thread)
+	sdc1	$f12, THREAD_FPR12(\thread)
+	sdc1	$f14, THREAD_FPR14(\thread)
+	sdc1	$f16, THREAD_FPR16(\thread)
+	sdc1	$f18, THREAD_FPR18(\thread)
+	sdc1	$f20, THREAD_FPR20(\thread)
+	sdc1	$f22, THREAD_FPR22(\thread)
+	sdc1	$f24, THREAD_FPR24(\thread)
+	sdc1	$f26, THREAD_FPR26(\thread)
+	sdc1	$f28, THREAD_FPR28(\thread)
+	sdc1	$f30, THREAD_FPR30(\thread)
+	sw	\tmp, THREAD_FCR31(\thread)
+	.endm
+
+	.macro	fpu_save_16odd thread
+	.set	push
+	.set	mips64r2
+	sdc1	$f1,  THREAD_FPR1(\thread)
+	sdc1	$f3,  THREAD_FPR3(\thread)
+	sdc1	$f5,  THREAD_FPR5(\thread)
+	sdc1	$f7,  THREAD_FPR7(\thread)
+	sdc1	$f9,  THREAD_FPR9(\thread)
+	sdc1	$f11, THREAD_FPR11(\thread)
+	sdc1	$f13, THREAD_FPR13(\thread)
+	sdc1	$f15, THREAD_FPR15(\thread)
+	sdc1	$f17, THREAD_FPR17(\thread)
+	sdc1	$f19, THREAD_FPR19(\thread)
+	sdc1	$f21, THREAD_FPR21(\thread)
+	sdc1	$f23, THREAD_FPR23(\thread)
+	sdc1	$f25, THREAD_FPR25(\thread)
+	sdc1	$f27, THREAD_FPR27(\thread)
+	sdc1	$f29, THREAD_FPR29(\thread)
+	sdc1	$f31, THREAD_FPR31(\thread)
+	.set	pop
+	.endm
+
+	.macro	fpu_save_double thread status tmp
+#if defined(CONFIG_MIPS64) || defined(CONFIG_CPU_MIPS32_R2)
+	sll	\tmp, \status, 5
+	bgez	\tmp, 10f
+	fpu_save_16odd \thread
+10:
+#endif
+	fpu_save_16even \thread \tmp
+	.endm
+
+	.macro	fpu_restore_16even thread tmp=t0
+	lw	\tmp, THREAD_FCR31(\thread)
+	ldc1	$f0,  THREAD_FPR0(\thread)
+	ldc1	$f2,  THREAD_FPR2(\thread)
+	ldc1	$f4,  THREAD_FPR4(\thread)
+	ldc1	$f6,  THREAD_FPR6(\thread)
+	ldc1	$f8,  THREAD_FPR8(\thread)
+	ldc1	$f10, THREAD_FPR10(\thread)
+	ldc1	$f12, THREAD_FPR12(\thread)
+	ldc1	$f14, THREAD_FPR14(\thread)
+	ldc1	$f16, THREAD_FPR16(\thread)
+	ldc1	$f18, THREAD_FPR18(\thread)
+	ldc1	$f20, THREAD_FPR20(\thread)
+	ldc1	$f22, THREAD_FPR22(\thread)
+	ldc1	$f24, THREAD_FPR24(\thread)
+	ldc1	$f26, THREAD_FPR26(\thread)
+	ldc1	$f28, THREAD_FPR28(\thread)
+	ldc1	$f30, THREAD_FPR30(\thread)
+	ctc1	\tmp, fcr31
+	.endm
+
+	.macro	fpu_restore_16odd thread
+	.set	push
+	.set	mips64r2
+	ldc1	$f1,  THREAD_FPR1(\thread)
+	ldc1	$f3,  THREAD_FPR3(\thread)
+	ldc1	$f5,  THREAD_FPR5(\thread)
+	ldc1	$f7,  THREAD_FPR7(\thread)
+	ldc1	$f9,  THREAD_FPR9(\thread)
+	ldc1	$f11, THREAD_FPR11(\thread)
+	ldc1	$f13, THREAD_FPR13(\thread)
+	ldc1	$f15, THREAD_FPR15(\thread)
+	ldc1	$f17, THREAD_FPR17(\thread)
+	ldc1	$f19, THREAD_FPR19(\thread)
+	ldc1	$f21, THREAD_FPR21(\thread)
+	ldc1	$f23, THREAD_FPR23(\thread)
+	ldc1	$f25, THREAD_FPR25(\thread)
+	ldc1	$f27, THREAD_FPR27(\thread)
+	ldc1	$f29, THREAD_FPR29(\thread)
+	ldc1	$f31, THREAD_FPR31(\thread)
+	.set	pop
+	.endm
+
+	.macro	fpu_restore_double thread status tmp
+#if defined(CONFIG_MIPS64) || defined(CONFIG_CPU_MIPS32_R2)
+	sll	\tmp, \status, 5
+	bgez	\tmp, 10f				# 16 register mode?
+
+	fpu_restore_16odd \thread
+10:
+#endif
+	fpu_restore_16even \thread \tmp
+	.endm
+
 /*
  * Temporary until all gas have MT ASE support
  */
diff --git a/arch/mips/include/asm/elf.h b/arch/mips/include/asm/elf.h
index a66359e..d414405 100644
--- a/arch/mips/include/asm/elf.h
+++ b/arch/mips/include/asm/elf.h
@@ -36,6 +36,7 @@
 #define EF_MIPS_ABI2		0x00000020
 #define EF_MIPS_OPTIONS_FIRST	0x00000080
 #define EF_MIPS_32BITMODE	0x00000100
+#define EF_MIPS_FP64		0x00000200
 #define EF_MIPS_ABI		0x0000f000
 #define EF_MIPS_ARCH		0xf0000000
 
@@ -176,6 +177,18 @@ typedef elf_fpreg_t elf_fpregset_t[ELF_NFPREG];
 #ifdef CONFIG_32BIT
 
 /*
+ * In order to be sure that we don't attempt to execute an O32 binary which
+ * requires 64 bit FP (FR=1) on a system which does not support it we refuse
+ * to execute any binary which has bits specified by the following macro set
+ * in its ELF header flags.
+ */
+#ifdef CONFIG_MIPS_O32_FP64_SUPPORT
+# define __MIPS_O32_FP64_MUST_BE_ZERO	0
+#else
+# define __MIPS_O32_FP64_MUST_BE_ZERO	EF_MIPS_FP64
+#endif
+
+/*
  * This is used to ensure we don't load something for the wrong architecture.
  */
 #define elf_check_arch(hdr)						\
@@ -192,6 +205,8 @@ typedef elf_fpreg_t elf_fpregset_t[ELF_NFPREG];
 	if (((__h->e_flags & EF_MIPS_ABI) != 0) &&			\
 	    ((__h->e_flags & EF_MIPS_ABI) != EF_MIPS_ABI_O32))		\
 		__res = 0;						\
+	if (__h->e_flags & __MIPS_O32_FP64_MUST_BE_ZERO)		\
+		__res = 0;						\
 									\
 	__res;								\
 })
@@ -249,6 +264,11 @@ extern struct mips_abi mips_abi_n32;
 
 #define SET_PERSONALITY(ex)						\
 do {									\
+	if ((ex).e_flags & EF_MIPS_FP64)				\
+		clear_thread_flag(TIF_32BIT_FPREGS);			\
+	else								\
+		set_thread_flag(TIF_32BIT_FPREGS);			\
+									\
 	if (personality(current->personality) != PER_LINUX)		\
 		set_personality(PER_LINUX);				\
 									\
@@ -271,14 +291,18 @@ do {									\
 #endif
 
 #ifdef CONFIG_MIPS32_O32
-#define __SET_PERSONALITY32_O32()					\
+#define __SET_PERSONALITY32_O32(ex)					\
 	do {								\
 		set_thread_flag(TIF_32BIT_REGS);			\
 		set_thread_flag(TIF_32BIT_ADDR);			\
+									\
+		if (!((ex).e_flags & EF_MIPS_FP64))			\
+			set_thread_flag(TIF_32BIT_FPREGS);		\
+									\
 		current->thread.abi = &mips_abi_32;			\
 	} while (0)
 #else
-#define __SET_PERSONALITY32_O32()					\
+#define __SET_PERSONALITY32_O32(ex)					\
 	do { } while (0)
 #endif
 
@@ -289,7 +313,7 @@ do {									\
 	     ((ex).e_flags & EF_MIPS_ABI) == 0)				\
 		__SET_PERSONALITY32_N32();				\
 	else								\
-		__SET_PERSONALITY32_O32();				\
+		__SET_PERSONALITY32_O32(ex);                            \
 } while (0)
 #else
 #define __SET_PERSONALITY32(ex) do { } while (0)
@@ -300,6 +324,7 @@ do {									\
 	unsigned int p;							\
 									\
 	clear_thread_flag(TIF_32BIT_REGS);				\
+	clear_thread_flag(TIF_32BIT_FPREGS);				\
 	clear_thread_flag(TIF_32BIT_ADDR);				\
 									\
 	if ((ex).e_ident[EI_CLASS] == ELFCLASS32)			\
diff --git a/arch/mips/include/asm/fpu.h b/arch/mips/include/asm/fpu.h
index 3bf023f..cfe092f 100644
--- a/arch/mips/include/asm/fpu.h
+++ b/arch/mips/include/asm/fpu.h
@@ -33,11 +33,48 @@ extern void _init_fpu(void);
 extern void _save_fp(struct task_struct *);
 extern void _restore_fp(struct task_struct *);
 
-#define __enable_fpu()							\
-do {									\
-	set_c0_status(ST0_CU1);						\
-	enable_fpu_hazard();						\
-} while (0)
+/*
+ * This enum specifies a mode in which we want the FPU to operate, for cores
+ * which implement the Status.FR bit. Note that FPU_32BIT & FPU_64BIT
+ * purposefully have the values 0 & 1 respectively, so that an integer value
+ * of Status.FR can be trivially casted to the corresponding enum fpu_mode.
+ */
+enum fpu_mode {
+	FPU_32BIT = 0,		/* FR = 0 */
+	FPU_64BIT,		/* FR = 1 */
+	FPU_AS_IS,
+};
+
+static inline int __enable_fpu(enum fpu_mode mode)
+{
+	int fr;
+
+	switch (mode) {
+	case FPU_AS_IS:
+		/* just enable the FPU in its current mode */
+		set_c0_status(ST0_CU1);
+		enable_fpu_hazard();
+		return 0;
+
+	case FPU_64BIT:
+#if !(defined(CONFIG_CPU_MIPS32_R2) || defined(CONFIG_MIPS64))
+		/* we only have a 32-bit FPU */
+		return SIGFPE;
+#endif
+		/* fall through */
+	case FPU_32BIT:
+		/* set CU1 & change FR appropriately */
+		fr = (int)mode;
+		change_c0_status(ST0_CU1 | ST0_FR, ST0_CU1 | (fr ? ST0_FR : 0));
+		enable_fpu_hazard();
+
+		/* check FR has the desired value */
+		return (!!(read_c0_status() & ST0_FR) == !!fr) ? 0 : SIGFPE;
+
+	default:
+		BUG();
+	}
+}
 
 #define __disable_fpu()							\
 do {									\
@@ -57,27 +94,46 @@ static inline int is_fpu_owner(void)
 	return cpu_has_fpu && __is_fpu_owner();
 }
 
-static inline void __own_fpu(void)
+static inline int __own_fpu(void)
 {
-	__enable_fpu();
+	enum fpu_mode mode;
+	int ret;
+
+	mode = !test_thread_flag(TIF_32BIT_FPREGS);
+	ret = __enable_fpu(mode);
+	if (ret)
+		return ret;
+
 	KSTK_STATUS(current) |= ST0_CU1;
+	if (mode == FPU_64BIT)
+		KSTK_STATUS(current) |= ST0_FR;
+	else /* mode == FPU_32BIT */
+		KSTK_STATUS(current) &= ~ST0_FR;
+
 	set_thread_flag(TIF_USEDFPU);
+	return 0;
 }
 
-static inline void own_fpu_inatomic(int restore)
+static inline int own_fpu_inatomic(int restore)
 {
+	int ret = 0;
+
 	if (cpu_has_fpu && !__is_fpu_owner()) {
-		__own_fpu();
-		if (restore)
+		ret = __own_fpu();
+		if (restore && !ret)
 			_restore_fp(current);
 	}
+	return ret;
 }
 
-static inline void own_fpu(int restore)
+static inline int own_fpu(int restore)
 {
+	int ret;
+
 	preempt_disable();
-	own_fpu_inatomic(restore);
+	ret = own_fpu_inatomic(restore);
 	preempt_enable();
+	return ret;
 }
 
 static inline void lose_fpu(int save)
@@ -93,16 +149,21 @@ static inline void lose_fpu(int save)
 	preempt_enable();
 }
 
-static inline void init_fpu(void)
+static inline int init_fpu(void)
 {
+	int ret = 0;
+
 	preempt_disable();
 	if (cpu_has_fpu) {
-		__own_fpu();
-		_init_fpu();
+		ret = __own_fpu();
+		if (!ret)
+			_init_fpu();
 	} else {
 		fpu_emulator_init_fpu();
 	}
+
 	preempt_enable();
+	return ret;
 }
 
 static inline void save_fp(struct task_struct *tsk)
diff --git a/arch/mips/include/asm/thread_info.h b/arch/mips/include/asm/thread_info.h
index f9b24bf..b6da8b7 100644
--- a/arch/mips/include/asm/thread_info.h
+++ b/arch/mips/include/asm/thread_info.h
@@ -112,11 +112,12 @@ static inline struct thread_info *current_thread_info(void)
 #define TIF_NOHZ		19	/* in adaptive nohz mode */
 #define TIF_FIXADE		20	/* Fix address errors in software */
 #define TIF_LOGADE		21	/* Log address errors to syslog */
-#define TIF_32BIT_REGS		22	/* also implies 16/32 fprs */
+#define TIF_32BIT_REGS		22	/* 32-bit general purpose registers */
 #define TIF_32BIT_ADDR		23	/* 32-bit address space (o32/n32) */
 #define TIF_FPUBOUND		24	/* thread bound to FPU-full CPU set */
 #define TIF_LOAD_WATCH		25	/* If set, load watch registers */
 #define TIF_SYSCALL_TRACEPOINT	26	/* syscall tracepoint instrumentation */
+#define TIF_32BIT_FPREGS	27	/* 32-bit floating point registers */
 #define TIF_SYSCALL_TRACE	31	/* syscall trace active */
 
 #define _TIF_SYSCALL_TRACE	(1<<TIF_SYSCALL_TRACE)
@@ -133,6 +134,7 @@ static inline struct thread_info *current_thread_info(void)
 #define _TIF_32BIT_ADDR		(1<<TIF_32BIT_ADDR)
 #define _TIF_FPUBOUND		(1<<TIF_FPUBOUND)
 #define _TIF_LOAD_WATCH		(1<<TIF_LOAD_WATCH)
+#define _TIF_32BIT_FPREGS	(1<<TIF_32BIT_FPREGS)
 #define _TIF_SYSCALL_TRACEPOINT	(1<<TIF_SYSCALL_TRACEPOINT)
 
 #define _TIF_WORK_SYSCALL_ENTRY	(_TIF_NOHZ | _TIF_SYSCALL_TRACE |	\
diff --git a/arch/mips/kernel/binfmt_elfo32.c b/arch/mips/kernel/binfmt_elfo32.c
index 202e581..7faf5f2 100644
--- a/arch/mips/kernel/binfmt_elfo32.c
+++ b/arch/mips/kernel/binfmt_elfo32.c
@@ -28,6 +28,18 @@ typedef double elf_fpreg_t;
 typedef elf_fpreg_t elf_fpregset_t[ELF_NFPREG];
 
 /*
+ * In order to be sure that we don't attempt to execute an O32 binary which
+ * requires 64 bit FP (FR=1) on a system which does not support it we refuse
+ * to execute any binary which has bits specified by the following macro set
+ * in its ELF header flags.
+ */
+#ifdef CONFIG_MIPS_O32_FP64_SUPPORT
+# define __MIPS_O32_FP64_MUST_BE_ZERO	0
+#else
+# define __MIPS_O32_FP64_MUST_BE_ZERO	EF_MIPS_FP64
+#endif
+
+/*
  * This is used to ensure we don't load something for the wrong architecture.
  */
 #define elf_check_arch(hdr)						\
@@ -44,6 +56,8 @@ typedef elf_fpreg_t elf_fpregset_t[ELF_NFPREG];
 	if (((__h->e_flags & EF_MIPS_ABI) != 0) &&			\
 	    ((__h->e_flags & EF_MIPS_ABI) != EF_MIPS_ABI_O32))		\
 		__res = 0;						\
+	if (__h->e_flags & __MIPS_O32_FP64_MUST_BE_ZERO)		\
+		__res = 0;						\
 									\
 	__res;								\
 })
diff --git a/arch/mips/kernel/cpu-probe.c b/arch/mips/kernel/cpu-probe.c
index c814287..e2b2d20 100644
--- a/arch/mips/kernel/cpu-probe.c
+++ b/arch/mips/kernel/cpu-probe.c
@@ -112,7 +112,7 @@ static inline unsigned long cpu_get_fpu_id(void)
 	unsigned long tmp, fpu_id;
 
 	tmp = read_c0_status();
-	__enable_fpu();
+	__enable_fpu(FPU_AS_IS);
 	fpu_id = read_32bit_cp1_register(CP1_REVISION);
 	write_c0_status(tmp);
 	return fpu_id;
diff --git a/arch/mips/kernel/process.c b/arch/mips/kernel/process.c
index ddc7610..747a6cf 100644
--- a/arch/mips/kernel/process.c
+++ b/arch/mips/kernel/process.c
@@ -60,9 +60,6 @@ void start_thread(struct pt_regs * regs, unsigned long pc, unsigned long sp)
 
 	/* New thread loses kernel privileges. */
 	status = regs->cp0_status & ~(ST0_CU0|ST0_CU1|ST0_FR|KU_MASK);
-#ifdef CONFIG_64BIT
-	status |= test_thread_flag(TIF_32BIT_REGS) ? 0 : ST0_FR;
-#endif
 	status |= KU_USER;
 	regs->cp0_status = status;
 	clear_used_math();
diff --git a/arch/mips/kernel/ptrace.c b/arch/mips/kernel/ptrace.c
index b52e1d2..7da9b76 100644
--- a/arch/mips/kernel/ptrace.c
+++ b/arch/mips/kernel/ptrace.c
@@ -137,13 +137,13 @@ int ptrace_getfpregs(struct task_struct *child, __u32 __user *data)
 		if (cpu_has_mipsmt) {
 			unsigned int vpflags = dvpe();
 			flags = read_c0_status();
-			__enable_fpu();
+			__enable_fpu(FPU_AS_IS);
 			__asm__ __volatile__("cfc1\t%0,$0" : "=r" (tmp));
 			write_c0_status(flags);
 			evpe(vpflags);
 		} else {
 			flags = read_c0_status();
-			__enable_fpu();
+			__enable_fpu(FPU_AS_IS);
 			__asm__ __volatile__("cfc1\t%0,$0" : "=r" (tmp));
 			write_c0_status(flags);
 		}
@@ -408,6 +408,7 @@ long arch_ptrace(struct task_struct *child, long request,
 	/* Read the word at location addr in the USER area. */
 	case PTRACE_PEEKUSR: {
 		struct pt_regs *regs;
+		fpureg_t *fregs;
 		unsigned long tmp = 0;
 
 		regs = task_pt_regs(child);
@@ -418,26 +419,28 @@ long arch_ptrace(struct task_struct *child, long request,
 			tmp = regs->regs[addr];
 			break;
 		case FPR_BASE ... FPR_BASE + 31:
-			if (tsk_used_math(child)) {
-				fpureg_t *fregs = get_fpu_regs(child);
+			if (!tsk_used_math(child)) {
+				/* FP not yet used */
+				tmp = -1;
+				break;
+			}
+			fregs = get_fpu_regs(child);
 
 #ifdef CONFIG_32BIT
+			if (test_thread_flag(TIF_32BIT_FPREGS)) {
 				/*
 				 * The odd registers are actually the high
 				 * order bits of the values stored in the even
 				 * registers - unless we're using r2k_switch.S.
 				 */
 				if (addr & 1)
-					tmp = (unsigned long) (fregs[((addr & ~1) - 32)] >> 32);
+					tmp = fregs[(addr & ~1) - 32] >> 32;
 				else
-					tmp = (unsigned long) (fregs[(addr - 32)] & 0xffffffff);
-#endif
-#ifdef CONFIG_64BIT
-				tmp = fregs[addr - FPR_BASE];
-#endif
-			} else {
-				tmp = -1;	/* FP not yet used  */
+					tmp = fregs[addr - 32];
+				break;
 			}
+#endif
+			tmp = fregs[addr - FPR_BASE];
 			break;
 		case PC:
 			tmp = regs->cp0_epc;
@@ -483,13 +486,13 @@ long arch_ptrace(struct task_struct *child, long request,
 			if (cpu_has_mipsmt) {
 				unsigned int vpflags = dvpe();
 				flags = read_c0_status();
-				__enable_fpu();
+				__enable_fpu(FPU_AS_IS);
 				__asm__ __volatile__("cfc1\t%0,$0": "=r" (tmp));
 				write_c0_status(flags);
 				evpe(vpflags);
 			} else {
 				flags = read_c0_status();
-				__enable_fpu();
+				__enable_fpu(FPU_AS_IS);
 				__asm__ __volatile__("cfc1\t%0,$0": "=r" (tmp));
 				write_c0_status(flags);
 			}
@@ -554,22 +557,25 @@ long arch_ptrace(struct task_struct *child, long request,
 				child->thread.fpu.fcr31 = 0;
 			}
 #ifdef CONFIG_32BIT
-			/*
-			 * The odd registers are actually the high order bits
-			 * of the values stored in the even registers - unless
-			 * we're using r2k_switch.S.
-			 */
-			if (addr & 1) {
-				fregs[(addr & ~1) - FPR_BASE] &= 0xffffffff;
-				fregs[(addr & ~1) - FPR_BASE] |= ((unsigned long long) data) << 32;
-			} else {
-				fregs[addr - FPR_BASE] &= ~0xffffffffLL;
-				fregs[addr - FPR_BASE] |= data;
+			if (test_thread_flag(TIF_32BIT_FPREGS)) {
+				/*
+				 * The odd registers are actually the high
+				 * order bits of the values stored in the even
+				 * registers - unless we're using r2k_switch.S.
+				 */
+				if (addr & 1) {
+					fregs[(addr & ~1) - FPR_BASE] &=
+						0xffffffff;
+					fregs[(addr & ~1) - FPR_BASE] |=
+						((u64)data) << 32;
+				} else {
+					fregs[addr - FPR_BASE] &= ~0xffffffffLL;
+					fregs[addr - FPR_BASE] |= data;
+				}
+				break;
 			}
 #endif
-#ifdef CONFIG_64BIT
 			fregs[addr - FPR_BASE] = data;
-#endif
 			break;
 		}
 		case PC:
diff --git a/arch/mips/kernel/ptrace32.c b/arch/mips/kernel/ptrace32.c
index 9486055..b8aa2dd 100644
--- a/arch/mips/kernel/ptrace32.c
+++ b/arch/mips/kernel/ptrace32.c
@@ -80,6 +80,7 @@ long compat_arch_ptrace(struct task_struct *child, compat_long_t request,
 	/* Read the word at location addr in the USER area. */
 	case PTRACE_PEEKUSR: {
 		struct pt_regs *regs;
+		fpureg_t *fregs;
 		unsigned int tmp;
 
 		regs = task_pt_regs(child);
@@ -90,21 +91,25 @@ long compat_arch_ptrace(struct task_struct *child, compat_long_t request,
 			tmp = regs->regs[addr];
 			break;
 		case FPR_BASE ... FPR_BASE + 31:
-			if (tsk_used_math(child)) {
-				fpureg_t *fregs = get_fpu_regs(child);
-
+			if (!tsk_used_math(child)) {
+				/* FP not yet used */
+				tmp = -1;
+				break;
+			}
+			fregs = get_fpu_regs(child);
+			if (test_thread_flag(TIF_32BIT_FPREGS)) {
 				/*
 				 * The odd registers are actually the high
 				 * order bits of the values stored in the even
 				 * registers - unless we're using r2k_switch.S.
 				 */
 				if (addr & 1)
-					tmp = (unsigned long) (fregs[((addr & ~1) - 32)] >> 32);
+					tmp = fregs[(addr & ~1) - 32] >> 32;
 				else
-					tmp = (unsigned long) (fregs[(addr - 32)] & 0xffffffff);
-			} else {
-				tmp = -1;	/* FP not yet used  */
+					tmp = fregs[addr - 32];
+				break;
 			}
+			tmp = fregs[addr - FPR_BASE];
 			break;
 		case PC:
 			tmp = regs->cp0_epc;
@@ -147,13 +152,13 @@ long compat_arch_ptrace(struct task_struct *child, compat_long_t request,
 			if (cpu_has_mipsmt) {
 				unsigned int vpflags = dvpe();
 				flags = read_c0_status();
-				__enable_fpu();
+				__enable_fpu(FPU_AS_IS);
 				__asm__ __volatile__("cfc1\t%0,$0": "=r" (tmp));
 				write_c0_status(flags);
 				evpe(vpflags);
 			} else {
 				flags = read_c0_status();
-				__enable_fpu();
+				__enable_fpu(FPU_AS_IS);
 				__asm__ __volatile__("cfc1\t%0,$0": "=r" (tmp));
 				write_c0_status(flags);
 			}
@@ -236,20 +241,24 @@ long compat_arch_ptrace(struct task_struct *child, compat_long_t request,
 				       sizeof(child->thread.fpu));
 				child->thread.fpu.fcr31 = 0;
 			}
-			/*
-			 * The odd registers are actually the high order bits
-			 * of the values stored in the even registers - unless
-			 * we're using r2k_switch.S.
-			 */
-			if (addr & 1) {
-				fregs[(addr & ~1) - FPR_BASE] &= 0xffffffff;
-				fregs[(addr & ~1) - FPR_BASE] |= ((unsigned long long) data) << 32;
-			} else {
-				fregs[addr - FPR_BASE] &= ~0xffffffffLL;
-				/* Must cast, lest sign extension fill upper
-				   bits!  */
-				fregs[addr - FPR_BASE] |= (unsigned int)data;
+			if (test_thread_flag(TIF_32BIT_FPREGS)) {
+				/*
+				 * The odd registers are actually the high
+				 * order bits of the values stored in the even
+				 * registers - unless we're using r2k_switch.S.
+				 */
+				if (addr & 1) {
+					fregs[(addr & ~1) - FPR_BASE] &=
+						0xffffffff;
+					fregs[(addr & ~1) - FPR_BASE] |=
+						((u64)data) << 32;
+				} else {
+					fregs[addr - FPR_BASE] &= ~0xffffffffLL;
+					fregs[addr - FPR_BASE] |= data;
+				}
+				break;
 			}
+			fregs[addr - FPR_BASE] = data;
 			break;
 		}
 		case PC:
diff --git a/arch/mips/kernel/r4k_fpu.S b/arch/mips/kernel/r4k_fpu.S
index 55ffe14..253b2fb 100644
--- a/arch/mips/kernel/r4k_fpu.S
+++ b/arch/mips/kernel/r4k_fpu.S
@@ -35,7 +35,15 @@
 LEAF(_save_fp_context)
 	cfc1	t1, fcr31
 
-#ifdef CONFIG_64BIT
+#if defined(CONFIG_64BIT) || defined(CONFIG_MIPS32_R2)
+	.set	push
+#ifdef CONFIG_MIPS32_R2
+	.set	mips64r2
+	mfc0	t0, CP0_STATUS
+	sll	t0, t0, 5
+	bgez	t0, 1f			# skip storing odd if FR=0
+	 nop
+#endif
 	/* Store the 16 odd double precision registers */
 	EX	sdc1 $f1, SC_FPREGS+8(a0)
 	EX	sdc1 $f3, SC_FPREGS+24(a0)
@@ -53,6 +61,7 @@ LEAF(_save_fp_context)
 	EX	sdc1 $f27, SC_FPREGS+216(a0)
 	EX	sdc1 $f29, SC_FPREGS+232(a0)
 	EX	sdc1 $f31, SC_FPREGS+248(a0)
+1:	.set	pop
 #endif
 
 	/* Store the 16 even double precision registers */
@@ -82,7 +91,31 @@ LEAF(_save_fp_context)
 LEAF(_save_fp_context32)
 	cfc1	t1, fcr31
 
-	EX	sdc1 $f0, SC32_FPREGS+0(a0)
+	mfc0	t0, CP0_STATUS
+	sll	t0, t0, 5
+	bgez	t0, 1f			# skip storing odd if FR=0
+	 nop
+
+	/* Store the 16 odd double precision registers */
+	EX      sdc1 $f1, SC32_FPREGS+8(a0)
+	EX      sdc1 $f3, SC32_FPREGS+24(a0)
+	EX      sdc1 $f5, SC32_FPREGS+40(a0)
+	EX      sdc1 $f7, SC32_FPREGS+56(a0)
+	EX      sdc1 $f9, SC32_FPREGS+72(a0)
+	EX      sdc1 $f11, SC32_FPREGS+88(a0)
+	EX      sdc1 $f13, SC32_FPREGS+104(a0)
+	EX      sdc1 $f15, SC32_FPREGS+120(a0)
+	EX      sdc1 $f17, SC32_FPREGS+136(a0)
+	EX      sdc1 $f19, SC32_FPREGS+152(a0)
+	EX      sdc1 $f21, SC32_FPREGS+168(a0)
+	EX      sdc1 $f23, SC32_FPREGS+184(a0)
+	EX      sdc1 $f25, SC32_FPREGS+200(a0)
+	EX      sdc1 $f27, SC32_FPREGS+216(a0)
+	EX      sdc1 $f29, SC32_FPREGS+232(a0)
+	EX      sdc1 $f31, SC32_FPREGS+248(a0)
+
+	/* Store the 16 even double precision registers */
+1:	EX	sdc1 $f0, SC32_FPREGS+0(a0)
 	EX	sdc1 $f2, SC32_FPREGS+16(a0)
 	EX	sdc1 $f4, SC32_FPREGS+32(a0)
 	EX	sdc1 $f6, SC32_FPREGS+48(a0)
@@ -114,7 +147,16 @@ LEAF(_save_fp_context32)
  */
 LEAF(_restore_fp_context)
 	EX	lw t0, SC_FPC_CSR(a0)
-#ifdef CONFIG_64BIT
+
+#if defined(CONFIG_64BIT) || defined(CONFIG_MIPS32_R2)
+	.set	push
+#ifdef CONFIG_MIPS32_R2
+	.set	mips64r2
+	mfc0	t0, CP0_STATUS
+	sll	t0, t0, 5
+	bgez	t0, 1f			# skip loading odd if FR=0
+	 nop
+#endif
 	EX	ldc1 $f1, SC_FPREGS+8(a0)
 	EX	ldc1 $f3, SC_FPREGS+24(a0)
 	EX	ldc1 $f5, SC_FPREGS+40(a0)
@@ -131,6 +173,7 @@ LEAF(_restore_fp_context)
 	EX	ldc1 $f27, SC_FPREGS+216(a0)
 	EX	ldc1 $f29, SC_FPREGS+232(a0)
 	EX	ldc1 $f31, SC_FPREGS+248(a0)
+1:	.set pop
 #endif
 	EX	ldc1 $f0, SC_FPREGS+0(a0)
 	EX	ldc1 $f2, SC_FPREGS+16(a0)
@@ -157,7 +200,30 @@ LEAF(_restore_fp_context)
 LEAF(_restore_fp_context32)
 	/* Restore an o32 sigcontext.  */
 	EX	lw t0, SC32_FPC_CSR(a0)
-	EX	ldc1 $f0, SC32_FPREGS+0(a0)
+
+	mfc0	t0, CP0_STATUS
+	sll	t0, t0, 5
+	bgez	t0, 1f			# skip loading odd if FR=0
+	 nop
+
+	EX      ldc1 $f1, SC32_FPREGS+8(a0)
+	EX      ldc1 $f3, SC32_FPREGS+24(a0)
+	EX      ldc1 $f5, SC32_FPREGS+40(a0)
+	EX      ldc1 $f7, SC32_FPREGS+56(a0)
+	EX      ldc1 $f9, SC32_FPREGS+72(a0)
+	EX      ldc1 $f11, SC32_FPREGS+88(a0)
+	EX      ldc1 $f13, SC32_FPREGS+104(a0)
+	EX      ldc1 $f15, SC32_FPREGS+120(a0)
+	EX      ldc1 $f17, SC32_FPREGS+136(a0)
+	EX      ldc1 $f19, SC32_FPREGS+152(a0)
+	EX      ldc1 $f21, SC32_FPREGS+168(a0)
+	EX      ldc1 $f23, SC32_FPREGS+184(a0)
+	EX      ldc1 $f25, SC32_FPREGS+200(a0)
+	EX      ldc1 $f27, SC32_FPREGS+216(a0)
+	EX      ldc1 $f29, SC32_FPREGS+232(a0)
+	EX      ldc1 $f31, SC32_FPREGS+248(a0)
+
+1:	EX	ldc1 $f0, SC32_FPREGS+0(a0)
 	EX	ldc1 $f2, SC32_FPREGS+16(a0)
 	EX	ldc1 $f4, SC32_FPREGS+32(a0)
 	EX	ldc1 $f6, SC32_FPREGS+48(a0)
diff --git a/arch/mips/kernel/r4k_switch.S b/arch/mips/kernel/r4k_switch.S
index 078de5e..cc78dd9 100644
--- a/arch/mips/kernel/r4k_switch.S
+++ b/arch/mips/kernel/r4k_switch.S
@@ -123,7 +123,7 @@
  * Save a thread's fp context.
  */
 LEAF(_save_fp)
-#ifdef CONFIG_64BIT
+#if defined(CONFIG_64BIT) || defined(CONFIG_CPU_MIPS32_R2)
 	mfc0	t0, CP0_STATUS
 #endif
 	fpu_save_double a0 t0 t1		# clobbers t1
@@ -134,7 +134,7 @@ LEAF(_save_fp)
  * Restore a thread's fp context.
  */
 LEAF(_restore_fp)
-#ifdef CONFIG_64BIT
+#if defined(CONFIG_64BIT) || defined(CONFIG_CPU_MIPS32_R2)
 	mfc0	t0, CP0_STATUS
 #endif
 	fpu_restore_double a0 t0 t1		# clobbers t1
@@ -228,6 +228,47 @@ LEAF(_init_fpu)
 	mtc1	t1, $f29
 	mtc1	t1, $f30
 	mtc1	t1, $f31
+
+#ifdef CONFIG_CPU_MIPS32_R2
+	.set    push
+	.set    mips64r2
+	sll     t0, t0, 5			# is Status.FR set?
+	bgez    t0, 1f				# no: skip setting upper 32b
+
+	mthc1   t1, $f0
+	mthc1   t1, $f1
+	mthc1   t1, $f2
+	mthc1   t1, $f3
+	mthc1   t1, $f4
+	mthc1   t1, $f5
+	mthc1   t1, $f6
+	mthc1   t1, $f7
+	mthc1   t1, $f8
+	mthc1   t1, $f9
+	mthc1   t1, $f10
+	mthc1   t1, $f11
+	mthc1   t1, $f12
+	mthc1   t1, $f13
+	mthc1   t1, $f14
+	mthc1   t1, $f15
+	mthc1   t1, $f16
+	mthc1   t1, $f17
+	mthc1   t1, $f18
+	mthc1   t1, $f19
+	mthc1   t1, $f20
+	mthc1   t1, $f21
+	mthc1   t1, $f22
+	mthc1   t1, $f23
+	mthc1   t1, $f24
+	mthc1   t1, $f25
+	mthc1   t1, $f26
+	mthc1   t1, $f27
+	mthc1   t1, $f28
+	mthc1   t1, $f29
+	mthc1   t1, $f30
+	mthc1   t1, $f31
+1:	.set    pop
+#endif /* CONFIG_CPU_MIPS32_R2 */
 #else
 	.set	mips3
 	dmtc1	t1, $f0
diff --git a/arch/mips/kernel/signal.c b/arch/mips/kernel/signal.c
index 2f285ab..5199563 100644
--- a/arch/mips/kernel/signal.c
+++ b/arch/mips/kernel/signal.c
@@ -71,8 +71,9 @@ static int protected_save_fp_context(struct sigcontext __user *sc)
 	int err;
 	while (1) {
 		lock_fpu_owner();
-		own_fpu_inatomic(1);
-		err = save_fp_context(sc); /* this might fail */
+		err = own_fpu_inatomic(1);
+		if (!err)
+			err = save_fp_context(sc); /* this might fail */
 		unlock_fpu_owner();
 		if (likely(!err))
 			break;
@@ -91,8 +92,9 @@ static int protected_restore_fp_context(struct sigcontext __user *sc)
 	int err, tmp __maybe_unused;
 	while (1) {
 		lock_fpu_owner();
-		own_fpu_inatomic(0);
-		err = restore_fp_context(sc); /* this might fail */
+		err = own_fpu_inatomic(0);
+		if (!err)
+			err = restore_fp_context(sc); /* this might fail */
 		unlock_fpu_owner();
 		if (likely(!err))
 			break;
diff --git a/arch/mips/kernel/signal32.c b/arch/mips/kernel/signal32.c
index 57de8b7..7c1024b 100644
--- a/arch/mips/kernel/signal32.c
+++ b/arch/mips/kernel/signal32.c
@@ -85,8 +85,9 @@ static int protected_save_fp_context32(struct sigcontext32 __user *sc)
 	int err;
 	while (1) {
 		lock_fpu_owner();
-		own_fpu_inatomic(1);
-		err = save_fp_context32(sc); /* this might fail */
+		err = own_fpu_inatomic(1);
+		if (!err)
+			err = save_fp_context32(sc); /* this might fail */
 		unlock_fpu_owner();
 		if (likely(!err))
 			break;
@@ -105,8 +106,9 @@ static int protected_restore_fp_context32(struct sigcontext32 __user *sc)
 	int err, tmp __maybe_unused;
 	while (1) {
 		lock_fpu_owner();
-		own_fpu_inatomic(0);
-		err = restore_fp_context32(sc); /* this might fail */
+		err = own_fpu_inatomic(0);
+		if (!err)
+			err = restore_fp_context32(sc); /* this might fail */
 		unlock_fpu_owner();
 		if (likely(!err))
 			break;
diff --git a/arch/mips/kernel/traps.c b/arch/mips/kernel/traps.c
index f9c8746..f40f688 100644
--- a/arch/mips/kernel/traps.c
+++ b/arch/mips/kernel/traps.c
@@ -1080,7 +1080,7 @@ asmlinkage void do_cpu(struct pt_regs *regs)
 	unsigned long old_epc, old31;
 	unsigned int opcode;
 	unsigned int cpid;
-	int status;
+	int status, err;
 	unsigned long __maybe_unused flags;
 
 	prev_state = exception_enter();
@@ -1153,19 +1153,19 @@ asmlinkage void do_cpu(struct pt_regs *regs)
 
 	case 1:
 		if (used_math())	/* Using the FPU again.	 */
-			own_fpu(1);
+			err = own_fpu(1);
 		else {			/* First time FPU user.	 */
-			init_fpu();
+			err = init_fpu();
 			set_used_math();
 		}
 
-		if (!raw_cpu_has_fpu) {
+		if (!raw_cpu_has_fpu || err) {
 			int sig;
 			void __user *fault_addr = NULL;
 			sig = fpu_emulator_cop1Handler(regs,
 						       &current->thread.fpu,
 						       0, &fault_addr);
-			if (!process_fpemu_return(sig, fault_addr))
+			if (!process_fpemu_return(sig, fault_addr) && !err)
 				mt_ase_fp_affinity();
 		}
 
diff --git a/arch/mips/math-emu/cp1emu.c b/arch/mips/math-emu/cp1emu.c
index 4b37961..22f7b11 100644
--- a/arch/mips/math-emu/cp1emu.c
+++ b/arch/mips/math-emu/cp1emu.c
@@ -859,20 +859,20 @@ static int isBranchInstr(struct pt_regs *regs, struct mm_decoded_insn dec_insn,
  * In the Linux kernel, we support selection of FPR format on the
  * basis of the Status.FR bit.	If an FPU is not present, the FR bit
  * is hardwired to zero, which would imply a 32-bit FPU even for
- * 64-bit CPUs so we rather look at TIF_32BIT_REGS.
+ * 64-bit CPUs so we rather look at TIF_32BIT_FPREGS.
  * FPU emu is slow and bulky and optimizing this function offers fairly
  * sizeable benefits so we try to be clever and make this function return
  * a constant whenever possible, that is on 64-bit kernels without O32
- * compatibility enabled and on 32-bit kernels.
+ * compatibility enabled and on 32-bit without 64-bit FPU support.
  */
 static inline int cop1_64bit(struct pt_regs *xcp)
 {
 #if defined(CONFIG_64BIT) && !defined(CONFIG_MIPS32_O32)
 	return 1;
-#elif defined(CONFIG_64BIT) && defined(CONFIG_MIPS32_O32)
-	return !test_thread_flag(TIF_32BIT_REGS);
-#else
+#elif defined(CONFIG_32BIT) && !defined(CONFIG_MIPS_O32_FP64_SUPPORT)
 	return 0;
+#else
+	return !test_thread_flag(TIF_32BIT_FPREGS);
 #endif
 }
 
diff --git a/arch/mips/math-emu/kernel_linkage.c b/arch/mips/math-emu/kernel_linkage.c
index 1c58657..3aeae07 100644
--- a/arch/mips/math-emu/kernel_linkage.c
+++ b/arch/mips/math-emu/kernel_linkage.c
@@ -89,8 +89,9 @@ int fpu_emulator_save_context32(struct sigcontext32 __user *sc)
 {
 	int i;
 	int err = 0;
+	int inc = test_thread_flag(TIF_32BIT_FPREGS) ? 2 : 1;
 
-	for (i = 0; i < 32; i+=2) {
+	for (i = 0; i < 32; i += inc) {
 		err |=
 		    __put_user(current->thread.fpu.fpr[i], &sc->sc_fpregs[i]);
 	}
@@ -103,8 +104,9 @@ int fpu_emulator_restore_context32(struct sigcontext32 __user *sc)
 {
 	int i;
 	int err = 0;
+	int inc = test_thread_flag(TIF_32BIT_FPREGS) ? 2 : 1;
 
-	for (i = 0; i < 32; i+=2) {
+	for (i = 0; i < 32; i += inc) {
 		err |=
 		    __get_user(current->thread.fpu.fpr[i], &sc->sc_fpregs[i]);
 	}
-- 
1.8.4.2

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH v3 4/6] mips: support for 64-bit FP with O32 binaries
@ 2013-11-22 13:12       ` Paul Burton
  0 siblings, 0 replies; 42+ messages in thread
From: Paul Burton @ 2013-11-22 13:12 UTC (permalink / raw)
  To: linux-mips; +Cc: Paul Burton

CPUs implementing mips32r2 may include a 64-bit FPU, just as mips64 CPUs
do. In order to preserve backwards compatibility a 64-bit FPU will act
like a 32-bit FPU (by accessing doubles from the least significant 32
bits of an even-odd pair of FP registers) when the Status.FR bit is
zero, again just like a mips64 CPU. The standard O32 ABI is defined
expecting a 32-bit FPU, however recent toolchains support use of a
64-bit FPU from an O32 mips32 executable. When an ELF executable is
built to use a 64-bit FPU a new flag (EF_MIPS_FP64) is set in the ELF
header.

With this patch the kernel will check the EF_MIPS_FP64 flag when
executing an O32 binary, and set Status.FR accordingly. The addition
of O32 64-bit FP support lessens the opportunity for optimisation in
the FPU emulator, so a CONFIG_MIPS_O32_FP64_SUPPORT Kconfig option is
introduced to allow this support to be disabled for those that don't
require it.

Inspired by an earlier patch by Leonid Yegoshin, but implemented more
cleanly & correctly.

Signed-off-by: Paul Burton <paul.burton@imgtec.com>
---
Changes in v3:
  - Drop dependency on CONFIG_CPU_MIPSR2.
  - Refuse to execute O32 binaries requiring 64 bit FP when the kernel
    doesn't include support for it (via elf_check_arch), rather than
    killing the process later once it executes an FP instruction.

Changes in v2:
  - Handle TIF_32BIT_FPREGS in PTRACE_P{EEK,OKE}USR.
---
 arch/mips/Kconfig                   |  17 ++++++
 arch/mips/include/asm/asmmacro-32.h |  42 --------------
 arch/mips/include/asm/asmmacro-64.h |  96 --------------------------------
 arch/mips/include/asm/asmmacro.h    | 107 ++++++++++++++++++++++++++++++++++++
 arch/mips/include/asm/elf.h         |  31 ++++++++++-
 arch/mips/include/asm/fpu.h         |  91 +++++++++++++++++++++++++-----
 arch/mips/include/asm/thread_info.h |   4 +-
 arch/mips/kernel/binfmt_elfo32.c    |  14 +++++
 arch/mips/kernel/cpu-probe.c        |   2 +-
 arch/mips/kernel/process.c          |   3 -
 arch/mips/kernel/ptrace.c           |  60 +++++++++++---------
 arch/mips/kernel/ptrace32.c         |  53 ++++++++++--------
 arch/mips/kernel/r4k_fpu.S          |  74 +++++++++++++++++++++++--
 arch/mips/kernel/r4k_switch.S       |  45 ++++++++++++++-
 arch/mips/kernel/signal.c           |  10 ++--
 arch/mips/kernel/signal32.c         |  10 ++--
 arch/mips/kernel/traps.c            |  10 ++--
 arch/mips/math-emu/cp1emu.c         |  10 ++--
 arch/mips/math-emu/kernel_linkage.c |   6 +-
 19 files changed, 449 insertions(+), 236 deletions(-)

diff --git a/arch/mips/Kconfig b/arch/mips/Kconfig
index 17cc7ff..3c3cb32 100644
--- a/arch/mips/Kconfig
+++ b/arch/mips/Kconfig
@@ -2335,6 +2335,23 @@ config CC_STACKPROTECTOR
 
 	  This feature requires gcc version 4.2 or above.
 
+config MIPS_O32_FP64_SUPPORT
+	bool "Support for O32 binaries using 64-bit FP"
+	depends on 32BIT || MIPS32_O32
+	default y
+	help
+	  When this is enabled, the kernel will support use of 64-bit floating
+	  point registers with binaries using the O32 ABI along with the
+	  EF_MIPS_FP64 ELF header flag (typically built with -mfp64). On
+	  mips32 systems this support is at the cost of increasing the size
+	  and complexity of the compiled FPU emulator. Thus if you are running
+	  a mips32 system and know that none of your userland binaries will
+	  require 64-bit floating point, you may wish to reduce the size of
+	  your kernel & potentially improve FP emulation performance by saying
+	  N here.
+
+	  If unsure, say Y.
+
 config USE_OF
 	bool
 	select OF
diff --git a/arch/mips/include/asm/asmmacro-32.h b/arch/mips/include/asm/asmmacro-32.h
index 2413afe..70e1f17 100644
--- a/arch/mips/include/asm/asmmacro-32.h
+++ b/arch/mips/include/asm/asmmacro-32.h
@@ -12,27 +12,6 @@
 #include <asm/fpregdef.h>
 #include <asm/mipsregs.h>
 
-	.macro	fpu_save_double thread status tmp1=t0
-	cfc1	\tmp1,  fcr31
-	sdc1	$f0,  THREAD_FPR0(\thread)
-	sdc1	$f2,  THREAD_FPR2(\thread)
-	sdc1	$f4,  THREAD_FPR4(\thread)
-	sdc1	$f6,  THREAD_FPR6(\thread)
-	sdc1	$f8,  THREAD_FPR8(\thread)
-	sdc1	$f10, THREAD_FPR10(\thread)
-	sdc1	$f12, THREAD_FPR12(\thread)
-	sdc1	$f14, THREAD_FPR14(\thread)
-	sdc1	$f16, THREAD_FPR16(\thread)
-	sdc1	$f18, THREAD_FPR18(\thread)
-	sdc1	$f20, THREAD_FPR20(\thread)
-	sdc1	$f22, THREAD_FPR22(\thread)
-	sdc1	$f24, THREAD_FPR24(\thread)
-	sdc1	$f26, THREAD_FPR26(\thread)
-	sdc1	$f28, THREAD_FPR28(\thread)
-	sdc1	$f30, THREAD_FPR30(\thread)
-	sw	\tmp1, THREAD_FCR31(\thread)
-	.endm
-
 	.macro	fpu_save_single thread tmp=t0
 	cfc1	\tmp,  fcr31
 	swc1	$f0,  THREAD_FPR0(\thread)
@@ -70,27 +49,6 @@
 	sw	\tmp, THREAD_FCR31(\thread)
 	.endm
 
-	.macro	fpu_restore_double thread status tmp=t0
-	lw	\tmp, THREAD_FCR31(\thread)
-	ldc1	$f0,  THREAD_FPR0(\thread)
-	ldc1	$f2,  THREAD_FPR2(\thread)
-	ldc1	$f4,  THREAD_FPR4(\thread)
-	ldc1	$f6,  THREAD_FPR6(\thread)
-	ldc1	$f8,  THREAD_FPR8(\thread)
-	ldc1	$f10, THREAD_FPR10(\thread)
-	ldc1	$f12, THREAD_FPR12(\thread)
-	ldc1	$f14, THREAD_FPR14(\thread)
-	ldc1	$f16, THREAD_FPR16(\thread)
-	ldc1	$f18, THREAD_FPR18(\thread)
-	ldc1	$f20, THREAD_FPR20(\thread)
-	ldc1	$f22, THREAD_FPR22(\thread)
-	ldc1	$f24, THREAD_FPR24(\thread)
-	ldc1	$f26, THREAD_FPR26(\thread)
-	ldc1	$f28, THREAD_FPR28(\thread)
-	ldc1	$f30, THREAD_FPR30(\thread)
-	ctc1	\tmp, fcr31
-	.endm
-
 	.macro	fpu_restore_single thread tmp=t0
 	lw	\tmp, THREAD_FCR31(\thread)
 	lwc1	$f0,  THREAD_FPR0(\thread)
diff --git a/arch/mips/include/asm/asmmacro-64.h b/arch/mips/include/asm/asmmacro-64.h
index 08a527d..38ea609 100644
--- a/arch/mips/include/asm/asmmacro-64.h
+++ b/arch/mips/include/asm/asmmacro-64.h
@@ -13,102 +13,6 @@
 #include <asm/fpregdef.h>
 #include <asm/mipsregs.h>
 
-	.macro	fpu_save_16even thread tmp=t0
-	cfc1	\tmp, fcr31
-	sdc1	$f0,  THREAD_FPR0(\thread)
-	sdc1	$f2,  THREAD_FPR2(\thread)
-	sdc1	$f4,  THREAD_FPR4(\thread)
-	sdc1	$f6,  THREAD_FPR6(\thread)
-	sdc1	$f8,  THREAD_FPR8(\thread)
-	sdc1	$f10, THREAD_FPR10(\thread)
-	sdc1	$f12, THREAD_FPR12(\thread)
-	sdc1	$f14, THREAD_FPR14(\thread)
-	sdc1	$f16, THREAD_FPR16(\thread)
-	sdc1	$f18, THREAD_FPR18(\thread)
-	sdc1	$f20, THREAD_FPR20(\thread)
-	sdc1	$f22, THREAD_FPR22(\thread)
-	sdc1	$f24, THREAD_FPR24(\thread)
-	sdc1	$f26, THREAD_FPR26(\thread)
-	sdc1	$f28, THREAD_FPR28(\thread)
-	sdc1	$f30, THREAD_FPR30(\thread)
-	sw	\tmp, THREAD_FCR31(\thread)
-	.endm
-
-	.macro	fpu_save_16odd thread
-	sdc1	$f1,  THREAD_FPR1(\thread)
-	sdc1	$f3,  THREAD_FPR3(\thread)
-	sdc1	$f5,  THREAD_FPR5(\thread)
-	sdc1	$f7,  THREAD_FPR7(\thread)
-	sdc1	$f9,  THREAD_FPR9(\thread)
-	sdc1	$f11, THREAD_FPR11(\thread)
-	sdc1	$f13, THREAD_FPR13(\thread)
-	sdc1	$f15, THREAD_FPR15(\thread)
-	sdc1	$f17, THREAD_FPR17(\thread)
-	sdc1	$f19, THREAD_FPR19(\thread)
-	sdc1	$f21, THREAD_FPR21(\thread)
-	sdc1	$f23, THREAD_FPR23(\thread)
-	sdc1	$f25, THREAD_FPR25(\thread)
-	sdc1	$f27, THREAD_FPR27(\thread)
-	sdc1	$f29, THREAD_FPR29(\thread)
-	sdc1	$f31, THREAD_FPR31(\thread)
-	.endm
-
-	.macro	fpu_save_double thread status tmp
-	sll	\tmp, \status, 5
-	bgez	\tmp, 2f
-	fpu_save_16odd \thread
-2:
-	fpu_save_16even \thread \tmp
-	.endm
-
-	.macro	fpu_restore_16even thread tmp=t0
-	lw	\tmp, THREAD_FCR31(\thread)
-	ldc1	$f0,  THREAD_FPR0(\thread)
-	ldc1	$f2,  THREAD_FPR2(\thread)
-	ldc1	$f4,  THREAD_FPR4(\thread)
-	ldc1	$f6,  THREAD_FPR6(\thread)
-	ldc1	$f8,  THREAD_FPR8(\thread)
-	ldc1	$f10, THREAD_FPR10(\thread)
-	ldc1	$f12, THREAD_FPR12(\thread)
-	ldc1	$f14, THREAD_FPR14(\thread)
-	ldc1	$f16, THREAD_FPR16(\thread)
-	ldc1	$f18, THREAD_FPR18(\thread)
-	ldc1	$f20, THREAD_FPR20(\thread)
-	ldc1	$f22, THREAD_FPR22(\thread)
-	ldc1	$f24, THREAD_FPR24(\thread)
-	ldc1	$f26, THREAD_FPR26(\thread)
-	ldc1	$f28, THREAD_FPR28(\thread)
-	ldc1	$f30, THREAD_FPR30(\thread)
-	ctc1	\tmp, fcr31
-	.endm
-
-	.macro	fpu_restore_16odd thread
-	ldc1	$f1,  THREAD_FPR1(\thread)
-	ldc1	$f3,  THREAD_FPR3(\thread)
-	ldc1	$f5,  THREAD_FPR5(\thread)
-	ldc1	$f7,  THREAD_FPR7(\thread)
-	ldc1	$f9,  THREAD_FPR9(\thread)
-	ldc1	$f11, THREAD_FPR11(\thread)
-	ldc1	$f13, THREAD_FPR13(\thread)
-	ldc1	$f15, THREAD_FPR15(\thread)
-	ldc1	$f17, THREAD_FPR17(\thread)
-	ldc1	$f19, THREAD_FPR19(\thread)
-	ldc1	$f21, THREAD_FPR21(\thread)
-	ldc1	$f23, THREAD_FPR23(\thread)
-	ldc1	$f25, THREAD_FPR25(\thread)
-	ldc1	$f27, THREAD_FPR27(\thread)
-	ldc1	$f29, THREAD_FPR29(\thread)
-	ldc1	$f31, THREAD_FPR31(\thread)
-	.endm
-
-	.macro	fpu_restore_double thread status tmp
-	sll	\tmp, \status, 5
-	bgez	\tmp, 1f				# 16 register mode?
-
-	fpu_restore_16odd \thread
-1:	fpu_restore_16even \thread \tmp
-	.endm
-
 	.macro	cpu_save_nonscratch thread
 	LONG_S	s0, THREAD_REG16(\thread)
 	LONG_S	s1, THREAD_REG17(\thread)
diff --git a/arch/mips/include/asm/asmmacro.h b/arch/mips/include/asm/asmmacro.h
index 6c8342a..3220c93 100644
--- a/arch/mips/include/asm/asmmacro.h
+++ b/arch/mips/include/asm/asmmacro.h
@@ -62,6 +62,113 @@
 	.endm
 #endif /* CONFIG_MIPS_MT_SMTC */
 
+	.macro	fpu_save_16even thread tmp=t0
+	cfc1	\tmp, fcr31
+	sdc1	$f0,  THREAD_FPR0(\thread)
+	sdc1	$f2,  THREAD_FPR2(\thread)
+	sdc1	$f4,  THREAD_FPR4(\thread)
+	sdc1	$f6,  THREAD_FPR6(\thread)
+	sdc1	$f8,  THREAD_FPR8(\thread)
+	sdc1	$f10, THREAD_FPR10(\thread)
+	sdc1	$f12, THREAD_FPR12(\thread)
+	sdc1	$f14, THREAD_FPR14(\thread)
+	sdc1	$f16, THREAD_FPR16(\thread)
+	sdc1	$f18, THREAD_FPR18(\thread)
+	sdc1	$f20, THREAD_FPR20(\thread)
+	sdc1	$f22, THREAD_FPR22(\thread)
+	sdc1	$f24, THREAD_FPR24(\thread)
+	sdc1	$f26, THREAD_FPR26(\thread)
+	sdc1	$f28, THREAD_FPR28(\thread)
+	sdc1	$f30, THREAD_FPR30(\thread)
+	sw	\tmp, THREAD_FCR31(\thread)
+	.endm
+
+	.macro	fpu_save_16odd thread
+	.set	push
+	.set	mips64r2
+	sdc1	$f1,  THREAD_FPR1(\thread)
+	sdc1	$f3,  THREAD_FPR3(\thread)
+	sdc1	$f5,  THREAD_FPR5(\thread)
+	sdc1	$f7,  THREAD_FPR7(\thread)
+	sdc1	$f9,  THREAD_FPR9(\thread)
+	sdc1	$f11, THREAD_FPR11(\thread)
+	sdc1	$f13, THREAD_FPR13(\thread)
+	sdc1	$f15, THREAD_FPR15(\thread)
+	sdc1	$f17, THREAD_FPR17(\thread)
+	sdc1	$f19, THREAD_FPR19(\thread)
+	sdc1	$f21, THREAD_FPR21(\thread)
+	sdc1	$f23, THREAD_FPR23(\thread)
+	sdc1	$f25, THREAD_FPR25(\thread)
+	sdc1	$f27, THREAD_FPR27(\thread)
+	sdc1	$f29, THREAD_FPR29(\thread)
+	sdc1	$f31, THREAD_FPR31(\thread)
+	.set	pop
+	.endm
+
+	.macro	fpu_save_double thread status tmp
+#if defined(CONFIG_MIPS64) || defined(CONFIG_CPU_MIPS32_R2)
+	sll	\tmp, \status, 5
+	bgez	\tmp, 10f
+	fpu_save_16odd \thread
+10:
+#endif
+	fpu_save_16even \thread \tmp
+	.endm
+
+	.macro	fpu_restore_16even thread tmp=t0
+	lw	\tmp, THREAD_FCR31(\thread)
+	ldc1	$f0,  THREAD_FPR0(\thread)
+	ldc1	$f2,  THREAD_FPR2(\thread)
+	ldc1	$f4,  THREAD_FPR4(\thread)
+	ldc1	$f6,  THREAD_FPR6(\thread)
+	ldc1	$f8,  THREAD_FPR8(\thread)
+	ldc1	$f10, THREAD_FPR10(\thread)
+	ldc1	$f12, THREAD_FPR12(\thread)
+	ldc1	$f14, THREAD_FPR14(\thread)
+	ldc1	$f16, THREAD_FPR16(\thread)
+	ldc1	$f18, THREAD_FPR18(\thread)
+	ldc1	$f20, THREAD_FPR20(\thread)
+	ldc1	$f22, THREAD_FPR22(\thread)
+	ldc1	$f24, THREAD_FPR24(\thread)
+	ldc1	$f26, THREAD_FPR26(\thread)
+	ldc1	$f28, THREAD_FPR28(\thread)
+	ldc1	$f30, THREAD_FPR30(\thread)
+	ctc1	\tmp, fcr31
+	.endm
+
+	.macro	fpu_restore_16odd thread
+	.set	push
+	.set	mips64r2
+	ldc1	$f1,  THREAD_FPR1(\thread)
+	ldc1	$f3,  THREAD_FPR3(\thread)
+	ldc1	$f5,  THREAD_FPR5(\thread)
+	ldc1	$f7,  THREAD_FPR7(\thread)
+	ldc1	$f9,  THREAD_FPR9(\thread)
+	ldc1	$f11, THREAD_FPR11(\thread)
+	ldc1	$f13, THREAD_FPR13(\thread)
+	ldc1	$f15, THREAD_FPR15(\thread)
+	ldc1	$f17, THREAD_FPR17(\thread)
+	ldc1	$f19, THREAD_FPR19(\thread)
+	ldc1	$f21, THREAD_FPR21(\thread)
+	ldc1	$f23, THREAD_FPR23(\thread)
+	ldc1	$f25, THREAD_FPR25(\thread)
+	ldc1	$f27, THREAD_FPR27(\thread)
+	ldc1	$f29, THREAD_FPR29(\thread)
+	ldc1	$f31, THREAD_FPR31(\thread)
+	.set	pop
+	.endm
+
+	.macro	fpu_restore_double thread status tmp
+#if defined(CONFIG_MIPS64) || defined(CONFIG_CPU_MIPS32_R2)
+	sll	\tmp, \status, 5
+	bgez	\tmp, 10f				# 16 register mode?
+
+	fpu_restore_16odd \thread
+10:
+#endif
+	fpu_restore_16even \thread \tmp
+	.endm
+
 /*
  * Temporary until all gas have MT ASE support
  */
diff --git a/arch/mips/include/asm/elf.h b/arch/mips/include/asm/elf.h
index a66359e..d414405 100644
--- a/arch/mips/include/asm/elf.h
+++ b/arch/mips/include/asm/elf.h
@@ -36,6 +36,7 @@
 #define EF_MIPS_ABI2		0x00000020
 #define EF_MIPS_OPTIONS_FIRST	0x00000080
 #define EF_MIPS_32BITMODE	0x00000100
+#define EF_MIPS_FP64		0x00000200
 #define EF_MIPS_ABI		0x0000f000
 #define EF_MIPS_ARCH		0xf0000000
 
@@ -176,6 +177,18 @@ typedef elf_fpreg_t elf_fpregset_t[ELF_NFPREG];
 #ifdef CONFIG_32BIT
 
 /*
+ * In order to be sure that we don't attempt to execute an O32 binary which
+ * requires 64 bit FP (FR=1) on a system which does not support it we refuse
+ * to execute any binary which has bits specified by the following macro set
+ * in its ELF header flags.
+ */
+#ifdef CONFIG_MIPS_O32_FP64_SUPPORT
+# define __MIPS_O32_FP64_MUST_BE_ZERO	0
+#else
+# define __MIPS_O32_FP64_MUST_BE_ZERO	EF_MIPS_FP64
+#endif
+
+/*
  * This is used to ensure we don't load something for the wrong architecture.
  */
 #define elf_check_arch(hdr)						\
@@ -192,6 +205,8 @@ typedef elf_fpreg_t elf_fpregset_t[ELF_NFPREG];
 	if (((__h->e_flags & EF_MIPS_ABI) != 0) &&			\
 	    ((__h->e_flags & EF_MIPS_ABI) != EF_MIPS_ABI_O32))		\
 		__res = 0;						\
+	if (__h->e_flags & __MIPS_O32_FP64_MUST_BE_ZERO)		\
+		__res = 0;						\
 									\
 	__res;								\
 })
@@ -249,6 +264,11 @@ extern struct mips_abi mips_abi_n32;
 
 #define SET_PERSONALITY(ex)						\
 do {									\
+	if ((ex).e_flags & EF_MIPS_FP64)				\
+		clear_thread_flag(TIF_32BIT_FPREGS);			\
+	else								\
+		set_thread_flag(TIF_32BIT_FPREGS);			\
+									\
 	if (personality(current->personality) != PER_LINUX)		\
 		set_personality(PER_LINUX);				\
 									\
@@ -271,14 +291,18 @@ do {									\
 #endif
 
 #ifdef CONFIG_MIPS32_O32
-#define __SET_PERSONALITY32_O32()					\
+#define __SET_PERSONALITY32_O32(ex)					\
 	do {								\
 		set_thread_flag(TIF_32BIT_REGS);			\
 		set_thread_flag(TIF_32BIT_ADDR);			\
+									\
+		if (!((ex).e_flags & EF_MIPS_FP64))			\
+			set_thread_flag(TIF_32BIT_FPREGS);		\
+									\
 		current->thread.abi = &mips_abi_32;			\
 	} while (0)
 #else
-#define __SET_PERSONALITY32_O32()					\
+#define __SET_PERSONALITY32_O32(ex)					\
 	do { } while (0)
 #endif
 
@@ -289,7 +313,7 @@ do {									\
 	     ((ex).e_flags & EF_MIPS_ABI) == 0)				\
 		__SET_PERSONALITY32_N32();				\
 	else								\
-		__SET_PERSONALITY32_O32();				\
+		__SET_PERSONALITY32_O32(ex);                            \
 } while (0)
 #else
 #define __SET_PERSONALITY32(ex) do { } while (0)
@@ -300,6 +324,7 @@ do {									\
 	unsigned int p;							\
 									\
 	clear_thread_flag(TIF_32BIT_REGS);				\
+	clear_thread_flag(TIF_32BIT_FPREGS);				\
 	clear_thread_flag(TIF_32BIT_ADDR);				\
 									\
 	if ((ex).e_ident[EI_CLASS] == ELFCLASS32)			\
diff --git a/arch/mips/include/asm/fpu.h b/arch/mips/include/asm/fpu.h
index 3bf023f..cfe092f 100644
--- a/arch/mips/include/asm/fpu.h
+++ b/arch/mips/include/asm/fpu.h
@@ -33,11 +33,48 @@ extern void _init_fpu(void);
 extern void _save_fp(struct task_struct *);
 extern void _restore_fp(struct task_struct *);
 
-#define __enable_fpu()							\
-do {									\
-	set_c0_status(ST0_CU1);						\
-	enable_fpu_hazard();						\
-} while (0)
+/*
+ * This enum specifies a mode in which we want the FPU to operate, for cores
+ * which implement the Status.FR bit. Note that FPU_32BIT & FPU_64BIT
+ * purposefully have the values 0 & 1 respectively, so that an integer value
+ * of Status.FR can be trivially casted to the corresponding enum fpu_mode.
+ */
+enum fpu_mode {
+	FPU_32BIT = 0,		/* FR = 0 */
+	FPU_64BIT,		/* FR = 1 */
+	FPU_AS_IS,
+};
+
+static inline int __enable_fpu(enum fpu_mode mode)
+{
+	int fr;
+
+	switch (mode) {
+	case FPU_AS_IS:
+		/* just enable the FPU in its current mode */
+		set_c0_status(ST0_CU1);
+		enable_fpu_hazard();
+		return 0;
+
+	case FPU_64BIT:
+#if !(defined(CONFIG_CPU_MIPS32_R2) || defined(CONFIG_MIPS64))
+		/* we only have a 32-bit FPU */
+		return SIGFPE;
+#endif
+		/* fall through */
+	case FPU_32BIT:
+		/* set CU1 & change FR appropriately */
+		fr = (int)mode;
+		change_c0_status(ST0_CU1 | ST0_FR, ST0_CU1 | (fr ? ST0_FR : 0));
+		enable_fpu_hazard();
+
+		/* check FR has the desired value */
+		return (!!(read_c0_status() & ST0_FR) == !!fr) ? 0 : SIGFPE;
+
+	default:
+		BUG();
+	}
+}
 
 #define __disable_fpu()							\
 do {									\
@@ -57,27 +94,46 @@ static inline int is_fpu_owner(void)
 	return cpu_has_fpu && __is_fpu_owner();
 }
 
-static inline void __own_fpu(void)
+static inline int __own_fpu(void)
 {
-	__enable_fpu();
+	enum fpu_mode mode;
+	int ret;
+
+	mode = !test_thread_flag(TIF_32BIT_FPREGS);
+	ret = __enable_fpu(mode);
+	if (ret)
+		return ret;
+
 	KSTK_STATUS(current) |= ST0_CU1;
+	if (mode == FPU_64BIT)
+		KSTK_STATUS(current) |= ST0_FR;
+	else /* mode == FPU_32BIT */
+		KSTK_STATUS(current) &= ~ST0_FR;
+
 	set_thread_flag(TIF_USEDFPU);
+	return 0;
 }
 
-static inline void own_fpu_inatomic(int restore)
+static inline int own_fpu_inatomic(int restore)
 {
+	int ret = 0;
+
 	if (cpu_has_fpu && !__is_fpu_owner()) {
-		__own_fpu();
-		if (restore)
+		ret = __own_fpu();
+		if (restore && !ret)
 			_restore_fp(current);
 	}
+	return ret;
 }
 
-static inline void own_fpu(int restore)
+static inline int own_fpu(int restore)
 {
+	int ret;
+
 	preempt_disable();
-	own_fpu_inatomic(restore);
+	ret = own_fpu_inatomic(restore);
 	preempt_enable();
+	return ret;
 }
 
 static inline void lose_fpu(int save)
@@ -93,16 +149,21 @@ static inline void lose_fpu(int save)
 	preempt_enable();
 }
 
-static inline void init_fpu(void)
+static inline int init_fpu(void)
 {
+	int ret = 0;
+
 	preempt_disable();
 	if (cpu_has_fpu) {
-		__own_fpu();
-		_init_fpu();
+		ret = __own_fpu();
+		if (!ret)
+			_init_fpu();
 	} else {
 		fpu_emulator_init_fpu();
 	}
+
 	preempt_enable();
+	return ret;
 }
 
 static inline void save_fp(struct task_struct *tsk)
diff --git a/arch/mips/include/asm/thread_info.h b/arch/mips/include/asm/thread_info.h
index f9b24bf..b6da8b7 100644
--- a/arch/mips/include/asm/thread_info.h
+++ b/arch/mips/include/asm/thread_info.h
@@ -112,11 +112,12 @@ static inline struct thread_info *current_thread_info(void)
 #define TIF_NOHZ		19	/* in adaptive nohz mode */
 #define TIF_FIXADE		20	/* Fix address errors in software */
 #define TIF_LOGADE		21	/* Log address errors to syslog */
-#define TIF_32BIT_REGS		22	/* also implies 16/32 fprs */
+#define TIF_32BIT_REGS		22	/* 32-bit general purpose registers */
 #define TIF_32BIT_ADDR		23	/* 32-bit address space (o32/n32) */
 #define TIF_FPUBOUND		24	/* thread bound to FPU-full CPU set */
 #define TIF_LOAD_WATCH		25	/* If set, load watch registers */
 #define TIF_SYSCALL_TRACEPOINT	26	/* syscall tracepoint instrumentation */
+#define TIF_32BIT_FPREGS	27	/* 32-bit floating point registers */
 #define TIF_SYSCALL_TRACE	31	/* syscall trace active */
 
 #define _TIF_SYSCALL_TRACE	(1<<TIF_SYSCALL_TRACE)
@@ -133,6 +134,7 @@ static inline struct thread_info *current_thread_info(void)
 #define _TIF_32BIT_ADDR		(1<<TIF_32BIT_ADDR)
 #define _TIF_FPUBOUND		(1<<TIF_FPUBOUND)
 #define _TIF_LOAD_WATCH		(1<<TIF_LOAD_WATCH)
+#define _TIF_32BIT_FPREGS	(1<<TIF_32BIT_FPREGS)
 #define _TIF_SYSCALL_TRACEPOINT	(1<<TIF_SYSCALL_TRACEPOINT)
 
 #define _TIF_WORK_SYSCALL_ENTRY	(_TIF_NOHZ | _TIF_SYSCALL_TRACE |	\
diff --git a/arch/mips/kernel/binfmt_elfo32.c b/arch/mips/kernel/binfmt_elfo32.c
index 202e581..7faf5f2 100644
--- a/arch/mips/kernel/binfmt_elfo32.c
+++ b/arch/mips/kernel/binfmt_elfo32.c
@@ -28,6 +28,18 @@ typedef double elf_fpreg_t;
 typedef elf_fpreg_t elf_fpregset_t[ELF_NFPREG];
 
 /*
+ * In order to be sure that we don't attempt to execute an O32 binary which
+ * requires 64 bit FP (FR=1) on a system which does not support it we refuse
+ * to execute any binary which has bits specified by the following macro set
+ * in its ELF header flags.
+ */
+#ifdef CONFIG_MIPS_O32_FP64_SUPPORT
+# define __MIPS_O32_FP64_MUST_BE_ZERO	0
+#else
+# define __MIPS_O32_FP64_MUST_BE_ZERO	EF_MIPS_FP64
+#endif
+
+/*
  * This is used to ensure we don't load something for the wrong architecture.
  */
 #define elf_check_arch(hdr)						\
@@ -44,6 +56,8 @@ typedef elf_fpreg_t elf_fpregset_t[ELF_NFPREG];
 	if (((__h->e_flags & EF_MIPS_ABI) != 0) &&			\
 	    ((__h->e_flags & EF_MIPS_ABI) != EF_MIPS_ABI_O32))		\
 		__res = 0;						\
+	if (__h->e_flags & __MIPS_O32_FP64_MUST_BE_ZERO)		\
+		__res = 0;						\
 									\
 	__res;								\
 })
diff --git a/arch/mips/kernel/cpu-probe.c b/arch/mips/kernel/cpu-probe.c
index c814287..e2b2d20 100644
--- a/arch/mips/kernel/cpu-probe.c
+++ b/arch/mips/kernel/cpu-probe.c
@@ -112,7 +112,7 @@ static inline unsigned long cpu_get_fpu_id(void)
 	unsigned long tmp, fpu_id;
 
 	tmp = read_c0_status();
-	__enable_fpu();
+	__enable_fpu(FPU_AS_IS);
 	fpu_id = read_32bit_cp1_register(CP1_REVISION);
 	write_c0_status(tmp);
 	return fpu_id;
diff --git a/arch/mips/kernel/process.c b/arch/mips/kernel/process.c
index ddc7610..747a6cf 100644
--- a/arch/mips/kernel/process.c
+++ b/arch/mips/kernel/process.c
@@ -60,9 +60,6 @@ void start_thread(struct pt_regs * regs, unsigned long pc, unsigned long sp)
 
 	/* New thread loses kernel privileges. */
 	status = regs->cp0_status & ~(ST0_CU0|ST0_CU1|ST0_FR|KU_MASK);
-#ifdef CONFIG_64BIT
-	status |= test_thread_flag(TIF_32BIT_REGS) ? 0 : ST0_FR;
-#endif
 	status |= KU_USER;
 	regs->cp0_status = status;
 	clear_used_math();
diff --git a/arch/mips/kernel/ptrace.c b/arch/mips/kernel/ptrace.c
index b52e1d2..7da9b76 100644
--- a/arch/mips/kernel/ptrace.c
+++ b/arch/mips/kernel/ptrace.c
@@ -137,13 +137,13 @@ int ptrace_getfpregs(struct task_struct *child, __u32 __user *data)
 		if (cpu_has_mipsmt) {
 			unsigned int vpflags = dvpe();
 			flags = read_c0_status();
-			__enable_fpu();
+			__enable_fpu(FPU_AS_IS);
 			__asm__ __volatile__("cfc1\t%0,$0" : "=r" (tmp));
 			write_c0_status(flags);
 			evpe(vpflags);
 		} else {
 			flags = read_c0_status();
-			__enable_fpu();
+			__enable_fpu(FPU_AS_IS);
 			__asm__ __volatile__("cfc1\t%0,$0" : "=r" (tmp));
 			write_c0_status(flags);
 		}
@@ -408,6 +408,7 @@ long arch_ptrace(struct task_struct *child, long request,
 	/* Read the word at location addr in the USER area. */
 	case PTRACE_PEEKUSR: {
 		struct pt_regs *regs;
+		fpureg_t *fregs;
 		unsigned long tmp = 0;
 
 		regs = task_pt_regs(child);
@@ -418,26 +419,28 @@ long arch_ptrace(struct task_struct *child, long request,
 			tmp = regs->regs[addr];
 			break;
 		case FPR_BASE ... FPR_BASE + 31:
-			if (tsk_used_math(child)) {
-				fpureg_t *fregs = get_fpu_regs(child);
+			if (!tsk_used_math(child)) {
+				/* FP not yet used */
+				tmp = -1;
+				break;
+			}
+			fregs = get_fpu_regs(child);
 
 #ifdef CONFIG_32BIT
+			if (test_thread_flag(TIF_32BIT_FPREGS)) {
 				/*
 				 * The odd registers are actually the high
 				 * order bits of the values stored in the even
 				 * registers - unless we're using r2k_switch.S.
 				 */
 				if (addr & 1)
-					tmp = (unsigned long) (fregs[((addr & ~1) - 32)] >> 32);
+					tmp = fregs[(addr & ~1) - 32] >> 32;
 				else
-					tmp = (unsigned long) (fregs[(addr - 32)] & 0xffffffff);
-#endif
-#ifdef CONFIG_64BIT
-				tmp = fregs[addr - FPR_BASE];
-#endif
-			} else {
-				tmp = -1;	/* FP not yet used  */
+					tmp = fregs[addr - 32];
+				break;
 			}
+#endif
+			tmp = fregs[addr - FPR_BASE];
 			break;
 		case PC:
 			tmp = regs->cp0_epc;
@@ -483,13 +486,13 @@ long arch_ptrace(struct task_struct *child, long request,
 			if (cpu_has_mipsmt) {
 				unsigned int vpflags = dvpe();
 				flags = read_c0_status();
-				__enable_fpu();
+				__enable_fpu(FPU_AS_IS);
 				__asm__ __volatile__("cfc1\t%0,$0": "=r" (tmp));
 				write_c0_status(flags);
 				evpe(vpflags);
 			} else {
 				flags = read_c0_status();
-				__enable_fpu();
+				__enable_fpu(FPU_AS_IS);
 				__asm__ __volatile__("cfc1\t%0,$0": "=r" (tmp));
 				write_c0_status(flags);
 			}
@@ -554,22 +557,25 @@ long arch_ptrace(struct task_struct *child, long request,
 				child->thread.fpu.fcr31 = 0;
 			}
 #ifdef CONFIG_32BIT
-			/*
-			 * The odd registers are actually the high order bits
-			 * of the values stored in the even registers - unless
-			 * we're using r2k_switch.S.
-			 */
-			if (addr & 1) {
-				fregs[(addr & ~1) - FPR_BASE] &= 0xffffffff;
-				fregs[(addr & ~1) - FPR_BASE] |= ((unsigned long long) data) << 32;
-			} else {
-				fregs[addr - FPR_BASE] &= ~0xffffffffLL;
-				fregs[addr - FPR_BASE] |= data;
+			if (test_thread_flag(TIF_32BIT_FPREGS)) {
+				/*
+				 * The odd registers are actually the high
+				 * order bits of the values stored in the even
+				 * registers - unless we're using r2k_switch.S.
+				 */
+				if (addr & 1) {
+					fregs[(addr & ~1) - FPR_BASE] &=
+						0xffffffff;
+					fregs[(addr & ~1) - FPR_BASE] |=
+						((u64)data) << 32;
+				} else {
+					fregs[addr - FPR_BASE] &= ~0xffffffffLL;
+					fregs[addr - FPR_BASE] |= data;
+				}
+				break;
 			}
 #endif
-#ifdef CONFIG_64BIT
 			fregs[addr - FPR_BASE] = data;
-#endif
 			break;
 		}
 		case PC:
diff --git a/arch/mips/kernel/ptrace32.c b/arch/mips/kernel/ptrace32.c
index 9486055..b8aa2dd 100644
--- a/arch/mips/kernel/ptrace32.c
+++ b/arch/mips/kernel/ptrace32.c
@@ -80,6 +80,7 @@ long compat_arch_ptrace(struct task_struct *child, compat_long_t request,
 	/* Read the word at location addr in the USER area. */
 	case PTRACE_PEEKUSR: {
 		struct pt_regs *regs;
+		fpureg_t *fregs;
 		unsigned int tmp;
 
 		regs = task_pt_regs(child);
@@ -90,21 +91,25 @@ long compat_arch_ptrace(struct task_struct *child, compat_long_t request,
 			tmp = regs->regs[addr];
 			break;
 		case FPR_BASE ... FPR_BASE + 31:
-			if (tsk_used_math(child)) {
-				fpureg_t *fregs = get_fpu_regs(child);
-
+			if (!tsk_used_math(child)) {
+				/* FP not yet used */
+				tmp = -1;
+				break;
+			}
+			fregs = get_fpu_regs(child);
+			if (test_thread_flag(TIF_32BIT_FPREGS)) {
 				/*
 				 * The odd registers are actually the high
 				 * order bits of the values stored in the even
 				 * registers - unless we're using r2k_switch.S.
 				 */
 				if (addr & 1)
-					tmp = (unsigned long) (fregs[((addr & ~1) - 32)] >> 32);
+					tmp = fregs[(addr & ~1) - 32] >> 32;
 				else
-					tmp = (unsigned long) (fregs[(addr - 32)] & 0xffffffff);
-			} else {
-				tmp = -1;	/* FP not yet used  */
+					tmp = fregs[addr - 32];
+				break;
 			}
+			tmp = fregs[addr - FPR_BASE];
 			break;
 		case PC:
 			tmp = regs->cp0_epc;
@@ -147,13 +152,13 @@ long compat_arch_ptrace(struct task_struct *child, compat_long_t request,
 			if (cpu_has_mipsmt) {
 				unsigned int vpflags = dvpe();
 				flags = read_c0_status();
-				__enable_fpu();
+				__enable_fpu(FPU_AS_IS);
 				__asm__ __volatile__("cfc1\t%0,$0": "=r" (tmp));
 				write_c0_status(flags);
 				evpe(vpflags);
 			} else {
 				flags = read_c0_status();
-				__enable_fpu();
+				__enable_fpu(FPU_AS_IS);
 				__asm__ __volatile__("cfc1\t%0,$0": "=r" (tmp));
 				write_c0_status(flags);
 			}
@@ -236,20 +241,24 @@ long compat_arch_ptrace(struct task_struct *child, compat_long_t request,
 				       sizeof(child->thread.fpu));
 				child->thread.fpu.fcr31 = 0;
 			}
-			/*
-			 * The odd registers are actually the high order bits
-			 * of the values stored in the even registers - unless
-			 * we're using r2k_switch.S.
-			 */
-			if (addr & 1) {
-				fregs[(addr & ~1) - FPR_BASE] &= 0xffffffff;
-				fregs[(addr & ~1) - FPR_BASE] |= ((unsigned long long) data) << 32;
-			} else {
-				fregs[addr - FPR_BASE] &= ~0xffffffffLL;
-				/* Must cast, lest sign extension fill upper
-				   bits!  */
-				fregs[addr - FPR_BASE] |= (unsigned int)data;
+			if (test_thread_flag(TIF_32BIT_FPREGS)) {
+				/*
+				 * The odd registers are actually the high
+				 * order bits of the values stored in the even
+				 * registers - unless we're using r2k_switch.S.
+				 */
+				if (addr & 1) {
+					fregs[(addr & ~1) - FPR_BASE] &=
+						0xffffffff;
+					fregs[(addr & ~1) - FPR_BASE] |=
+						((u64)data) << 32;
+				} else {
+					fregs[addr - FPR_BASE] &= ~0xffffffffLL;
+					fregs[addr - FPR_BASE] |= data;
+				}
+				break;
 			}
+			fregs[addr - FPR_BASE] = data;
 			break;
 		}
 		case PC:
diff --git a/arch/mips/kernel/r4k_fpu.S b/arch/mips/kernel/r4k_fpu.S
index 55ffe14..253b2fb 100644
--- a/arch/mips/kernel/r4k_fpu.S
+++ b/arch/mips/kernel/r4k_fpu.S
@@ -35,7 +35,15 @@
 LEAF(_save_fp_context)
 	cfc1	t1, fcr31
 
-#ifdef CONFIG_64BIT
+#if defined(CONFIG_64BIT) || defined(CONFIG_MIPS32_R2)
+	.set	push
+#ifdef CONFIG_MIPS32_R2
+	.set	mips64r2
+	mfc0	t0, CP0_STATUS
+	sll	t0, t0, 5
+	bgez	t0, 1f			# skip storing odd if FR=0
+	 nop
+#endif
 	/* Store the 16 odd double precision registers */
 	EX	sdc1 $f1, SC_FPREGS+8(a0)
 	EX	sdc1 $f3, SC_FPREGS+24(a0)
@@ -53,6 +61,7 @@ LEAF(_save_fp_context)
 	EX	sdc1 $f27, SC_FPREGS+216(a0)
 	EX	sdc1 $f29, SC_FPREGS+232(a0)
 	EX	sdc1 $f31, SC_FPREGS+248(a0)
+1:	.set	pop
 #endif
 
 	/* Store the 16 even double precision registers */
@@ -82,7 +91,31 @@ LEAF(_save_fp_context)
 LEAF(_save_fp_context32)
 	cfc1	t1, fcr31
 
-	EX	sdc1 $f0, SC32_FPREGS+0(a0)
+	mfc0	t0, CP0_STATUS
+	sll	t0, t0, 5
+	bgez	t0, 1f			# skip storing odd if FR=0
+	 nop
+
+	/* Store the 16 odd double precision registers */
+	EX      sdc1 $f1, SC32_FPREGS+8(a0)
+	EX      sdc1 $f3, SC32_FPREGS+24(a0)
+	EX      sdc1 $f5, SC32_FPREGS+40(a0)
+	EX      sdc1 $f7, SC32_FPREGS+56(a0)
+	EX      sdc1 $f9, SC32_FPREGS+72(a0)
+	EX      sdc1 $f11, SC32_FPREGS+88(a0)
+	EX      sdc1 $f13, SC32_FPREGS+104(a0)
+	EX      sdc1 $f15, SC32_FPREGS+120(a0)
+	EX      sdc1 $f17, SC32_FPREGS+136(a0)
+	EX      sdc1 $f19, SC32_FPREGS+152(a0)
+	EX      sdc1 $f21, SC32_FPREGS+168(a0)
+	EX      sdc1 $f23, SC32_FPREGS+184(a0)
+	EX      sdc1 $f25, SC32_FPREGS+200(a0)
+	EX      sdc1 $f27, SC32_FPREGS+216(a0)
+	EX      sdc1 $f29, SC32_FPREGS+232(a0)
+	EX      sdc1 $f31, SC32_FPREGS+248(a0)
+
+	/* Store the 16 even double precision registers */
+1:	EX	sdc1 $f0, SC32_FPREGS+0(a0)
 	EX	sdc1 $f2, SC32_FPREGS+16(a0)
 	EX	sdc1 $f4, SC32_FPREGS+32(a0)
 	EX	sdc1 $f6, SC32_FPREGS+48(a0)
@@ -114,7 +147,16 @@ LEAF(_save_fp_context32)
  */
 LEAF(_restore_fp_context)
 	EX	lw t0, SC_FPC_CSR(a0)
-#ifdef CONFIG_64BIT
+
+#if defined(CONFIG_64BIT) || defined(CONFIG_MIPS32_R2)
+	.set	push
+#ifdef CONFIG_MIPS32_R2
+	.set	mips64r2
+	mfc0	t0, CP0_STATUS
+	sll	t0, t0, 5
+	bgez	t0, 1f			# skip loading odd if FR=0
+	 nop
+#endif
 	EX	ldc1 $f1, SC_FPREGS+8(a0)
 	EX	ldc1 $f3, SC_FPREGS+24(a0)
 	EX	ldc1 $f5, SC_FPREGS+40(a0)
@@ -131,6 +173,7 @@ LEAF(_restore_fp_context)
 	EX	ldc1 $f27, SC_FPREGS+216(a0)
 	EX	ldc1 $f29, SC_FPREGS+232(a0)
 	EX	ldc1 $f31, SC_FPREGS+248(a0)
+1:	.set pop
 #endif
 	EX	ldc1 $f0, SC_FPREGS+0(a0)
 	EX	ldc1 $f2, SC_FPREGS+16(a0)
@@ -157,7 +200,30 @@ LEAF(_restore_fp_context)
 LEAF(_restore_fp_context32)
 	/* Restore an o32 sigcontext.  */
 	EX	lw t0, SC32_FPC_CSR(a0)
-	EX	ldc1 $f0, SC32_FPREGS+0(a0)
+
+	mfc0	t0, CP0_STATUS
+	sll	t0, t0, 5
+	bgez	t0, 1f			# skip loading odd if FR=0
+	 nop
+
+	EX      ldc1 $f1, SC32_FPREGS+8(a0)
+	EX      ldc1 $f3, SC32_FPREGS+24(a0)
+	EX      ldc1 $f5, SC32_FPREGS+40(a0)
+	EX      ldc1 $f7, SC32_FPREGS+56(a0)
+	EX      ldc1 $f9, SC32_FPREGS+72(a0)
+	EX      ldc1 $f11, SC32_FPREGS+88(a0)
+	EX      ldc1 $f13, SC32_FPREGS+104(a0)
+	EX      ldc1 $f15, SC32_FPREGS+120(a0)
+	EX      ldc1 $f17, SC32_FPREGS+136(a0)
+	EX      ldc1 $f19, SC32_FPREGS+152(a0)
+	EX      ldc1 $f21, SC32_FPREGS+168(a0)
+	EX      ldc1 $f23, SC32_FPREGS+184(a0)
+	EX      ldc1 $f25, SC32_FPREGS+200(a0)
+	EX      ldc1 $f27, SC32_FPREGS+216(a0)
+	EX      ldc1 $f29, SC32_FPREGS+232(a0)
+	EX      ldc1 $f31, SC32_FPREGS+248(a0)
+
+1:	EX	ldc1 $f0, SC32_FPREGS+0(a0)
 	EX	ldc1 $f2, SC32_FPREGS+16(a0)
 	EX	ldc1 $f4, SC32_FPREGS+32(a0)
 	EX	ldc1 $f6, SC32_FPREGS+48(a0)
diff --git a/arch/mips/kernel/r4k_switch.S b/arch/mips/kernel/r4k_switch.S
index 078de5e..cc78dd9 100644
--- a/arch/mips/kernel/r4k_switch.S
+++ b/arch/mips/kernel/r4k_switch.S
@@ -123,7 +123,7 @@
  * Save a thread's fp context.
  */
 LEAF(_save_fp)
-#ifdef CONFIG_64BIT
+#if defined(CONFIG_64BIT) || defined(CONFIG_CPU_MIPS32_R2)
 	mfc0	t0, CP0_STATUS
 #endif
 	fpu_save_double a0 t0 t1		# clobbers t1
@@ -134,7 +134,7 @@ LEAF(_save_fp)
  * Restore a thread's fp context.
  */
 LEAF(_restore_fp)
-#ifdef CONFIG_64BIT
+#if defined(CONFIG_64BIT) || defined(CONFIG_CPU_MIPS32_R2)
 	mfc0	t0, CP0_STATUS
 #endif
 	fpu_restore_double a0 t0 t1		# clobbers t1
@@ -228,6 +228,47 @@ LEAF(_init_fpu)
 	mtc1	t1, $f29
 	mtc1	t1, $f30
 	mtc1	t1, $f31
+
+#ifdef CONFIG_CPU_MIPS32_R2
+	.set    push
+	.set    mips64r2
+	sll     t0, t0, 5			# is Status.FR set?
+	bgez    t0, 1f				# no: skip setting upper 32b
+
+	mthc1   t1, $f0
+	mthc1   t1, $f1
+	mthc1   t1, $f2
+	mthc1   t1, $f3
+	mthc1   t1, $f4
+	mthc1   t1, $f5
+	mthc1   t1, $f6
+	mthc1   t1, $f7
+	mthc1   t1, $f8
+	mthc1   t1, $f9
+	mthc1   t1, $f10
+	mthc1   t1, $f11
+	mthc1   t1, $f12
+	mthc1   t1, $f13
+	mthc1   t1, $f14
+	mthc1   t1, $f15
+	mthc1   t1, $f16
+	mthc1   t1, $f17
+	mthc1   t1, $f18
+	mthc1   t1, $f19
+	mthc1   t1, $f20
+	mthc1   t1, $f21
+	mthc1   t1, $f22
+	mthc1   t1, $f23
+	mthc1   t1, $f24
+	mthc1   t1, $f25
+	mthc1   t1, $f26
+	mthc1   t1, $f27
+	mthc1   t1, $f28
+	mthc1   t1, $f29
+	mthc1   t1, $f30
+	mthc1   t1, $f31
+1:	.set    pop
+#endif /* CONFIG_CPU_MIPS32_R2 */
 #else
 	.set	mips3
 	dmtc1	t1, $f0
diff --git a/arch/mips/kernel/signal.c b/arch/mips/kernel/signal.c
index 2f285ab..5199563 100644
--- a/arch/mips/kernel/signal.c
+++ b/arch/mips/kernel/signal.c
@@ -71,8 +71,9 @@ static int protected_save_fp_context(struct sigcontext __user *sc)
 	int err;
 	while (1) {
 		lock_fpu_owner();
-		own_fpu_inatomic(1);
-		err = save_fp_context(sc); /* this might fail */
+		err = own_fpu_inatomic(1);
+		if (!err)
+			err = save_fp_context(sc); /* this might fail */
 		unlock_fpu_owner();
 		if (likely(!err))
 			break;
@@ -91,8 +92,9 @@ static int protected_restore_fp_context(struct sigcontext __user *sc)
 	int err, tmp __maybe_unused;
 	while (1) {
 		lock_fpu_owner();
-		own_fpu_inatomic(0);
-		err = restore_fp_context(sc); /* this might fail */
+		err = own_fpu_inatomic(0);
+		if (!err)
+			err = restore_fp_context(sc); /* this might fail */
 		unlock_fpu_owner();
 		if (likely(!err))
 			break;
diff --git a/arch/mips/kernel/signal32.c b/arch/mips/kernel/signal32.c
index 57de8b7..7c1024b 100644
--- a/arch/mips/kernel/signal32.c
+++ b/arch/mips/kernel/signal32.c
@@ -85,8 +85,9 @@ static int protected_save_fp_context32(struct sigcontext32 __user *sc)
 	int err;
 	while (1) {
 		lock_fpu_owner();
-		own_fpu_inatomic(1);
-		err = save_fp_context32(sc); /* this might fail */
+		err = own_fpu_inatomic(1);
+		if (!err)
+			err = save_fp_context32(sc); /* this might fail */
 		unlock_fpu_owner();
 		if (likely(!err))
 			break;
@@ -105,8 +106,9 @@ static int protected_restore_fp_context32(struct sigcontext32 __user *sc)
 	int err, tmp __maybe_unused;
 	while (1) {
 		lock_fpu_owner();
-		own_fpu_inatomic(0);
-		err = restore_fp_context32(sc); /* this might fail */
+		err = own_fpu_inatomic(0);
+		if (!err)
+			err = restore_fp_context32(sc); /* this might fail */
 		unlock_fpu_owner();
 		if (likely(!err))
 			break;
diff --git a/arch/mips/kernel/traps.c b/arch/mips/kernel/traps.c
index f9c8746..f40f688 100644
--- a/arch/mips/kernel/traps.c
+++ b/arch/mips/kernel/traps.c
@@ -1080,7 +1080,7 @@ asmlinkage void do_cpu(struct pt_regs *regs)
 	unsigned long old_epc, old31;
 	unsigned int opcode;
 	unsigned int cpid;
-	int status;
+	int status, err;
 	unsigned long __maybe_unused flags;
 
 	prev_state = exception_enter();
@@ -1153,19 +1153,19 @@ asmlinkage void do_cpu(struct pt_regs *regs)
 
 	case 1:
 		if (used_math())	/* Using the FPU again.	 */
-			own_fpu(1);
+			err = own_fpu(1);
 		else {			/* First time FPU user.	 */
-			init_fpu();
+			err = init_fpu();
 			set_used_math();
 		}
 
-		if (!raw_cpu_has_fpu) {
+		if (!raw_cpu_has_fpu || err) {
 			int sig;
 			void __user *fault_addr = NULL;
 			sig = fpu_emulator_cop1Handler(regs,
 						       &current->thread.fpu,
 						       0, &fault_addr);
-			if (!process_fpemu_return(sig, fault_addr))
+			if (!process_fpemu_return(sig, fault_addr) && !err)
 				mt_ase_fp_affinity();
 		}
 
diff --git a/arch/mips/math-emu/cp1emu.c b/arch/mips/math-emu/cp1emu.c
index 4b37961..22f7b11 100644
--- a/arch/mips/math-emu/cp1emu.c
+++ b/arch/mips/math-emu/cp1emu.c
@@ -859,20 +859,20 @@ static int isBranchInstr(struct pt_regs *regs, struct mm_decoded_insn dec_insn,
  * In the Linux kernel, we support selection of FPR format on the
  * basis of the Status.FR bit.	If an FPU is not present, the FR bit
  * is hardwired to zero, which would imply a 32-bit FPU even for
- * 64-bit CPUs so we rather look at TIF_32BIT_REGS.
+ * 64-bit CPUs so we rather look at TIF_32BIT_FPREGS.
  * FPU emu is slow and bulky and optimizing this function offers fairly
  * sizeable benefits so we try to be clever and make this function return
  * a constant whenever possible, that is on 64-bit kernels without O32
- * compatibility enabled and on 32-bit kernels.
+ * compatibility enabled and on 32-bit without 64-bit FPU support.
  */
 static inline int cop1_64bit(struct pt_regs *xcp)
 {
 #if defined(CONFIG_64BIT) && !defined(CONFIG_MIPS32_O32)
 	return 1;
-#elif defined(CONFIG_64BIT) && defined(CONFIG_MIPS32_O32)
-	return !test_thread_flag(TIF_32BIT_REGS);
-#else
+#elif defined(CONFIG_32BIT) && !defined(CONFIG_MIPS_O32_FP64_SUPPORT)
 	return 0;
+#else
+	return !test_thread_flag(TIF_32BIT_FPREGS);
 #endif
 }
 
diff --git a/arch/mips/math-emu/kernel_linkage.c b/arch/mips/math-emu/kernel_linkage.c
index 1c58657..3aeae07 100644
--- a/arch/mips/math-emu/kernel_linkage.c
+++ b/arch/mips/math-emu/kernel_linkage.c
@@ -89,8 +89,9 @@ int fpu_emulator_save_context32(struct sigcontext32 __user *sc)
 {
 	int i;
 	int err = 0;
+	int inc = test_thread_flag(TIF_32BIT_FPREGS) ? 2 : 1;
 
-	for (i = 0; i < 32; i+=2) {
+	for (i = 0; i < 32; i += inc) {
 		err |=
 		    __put_user(current->thread.fpu.fpr[i], &sc->sc_fpregs[i]);
 	}
@@ -103,8 +104,9 @@ int fpu_emulator_restore_context32(struct sigcontext32 __user *sc)
 {
 	int i;
 	int err = 0;
+	int inc = test_thread_flag(TIF_32BIT_FPREGS) ? 2 : 1;
 
-	for (i = 0; i < 32; i+=2) {
+	for (i = 0; i < 32; i += inc) {
 		err |=
 		    __get_user(current->thread.fpu.fpr[i], &sc->sc_fpregs[i]);
 	}
-- 
1.8.4.2

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [PATCH v2 5/6] mips: use per-mm page to execute FP branch delay slots
@ 2014-07-03 17:56 Ed Swierk
  2014-07-03 20:12   ` Paul Burton
  0 siblings, 1 reply; 42+ messages in thread
From: Ed Swierk @ 2014-07-03 17:56 UTC (permalink / raw)
  To: linux-mips, Paul Burton; +Cc: ddaney.cavm, ralf

Hi Paul,

I'm came across your patch while puzzling over mysterious segvs from
programs that do floating-point compares. I'm running a Debian 32-bit
mips userland, which is built with hard floating point operations, on
a Cavium Octeon2, which lacks an FPU. Now that Linux makes user stacks
non-executable by default, the current FP emulation approach is simply
broken.

Have you had the opportunity to revisit your patch in light of the
force_sig issue? I'm wondering if instead of trying to free the page
for the FP branch delay emuframe immediately, it would be simpler to
leave it around until the thread is destroyed.

--Ed

Paul Burton <paul.burton@imgtec.com> wrote:
> Hmm, I believe there may still be an issue with this patch. If the
> instruction in the branch delay slot being "emulated" traps to the
> kernel, and the kernel does a force_sig then that signal won't get
> processed because signals are being temporarily ignored. So I think
> we'd go back off to userland & execute the same instruction from the
> branch delay slot again, trap again, force_sig again, go back to
> userland etc etc. I need to think about this...
>
> In the meantime, Ralf: if you get to merging this series please drop
> this patch & 6/6 (the stack exec change) for the time being. The "Some
> (mostly FP) cleanups" series I submitted will still apply after only
> the first 4 patches of this series.
>
> Thanks,
>      Paul
>
> On 08/11/13 14:50, Paul Burton wrote:
>> If a floating point branch instruction (bc1[ft]l?) is emulated,
>> typically because we're running on a core with no FPU, then we need to
>> execute the instruction in its branch delay slot too. This is done by
>> writing that instruction to memory followed by a trap, as part of an
>> "emuframe", and executing it. This avoids the requirement of an emulator
>> for the entire MIPS instruction set. Prior to this patch such emuframes
>> are written to the user stack and executed from there.
>>
>> This patch moves FP branch delay emuframes off of the user stack and
>> into a per-mm page. Allocating a page per-mm leaves userland with access
>> to only what it had access to previously, and prevents processes
>> interfering with each other as they might if a single system-wide page
>> were used. The book-keeping required to track the allocation of
>> emuframes is not cheap, but given that invoking the FP emulator is
>> already very expensive I don't expect this to be an issue.
>>
>> The biggest issue with executing the instruction from an FP branch delay
>> is that we must ensure that we free the frame from which we ran it. That
>> means that we must trap back to the kernel after executing that
>> instruction, which means that we must take special care not to let the
>> PC be changed as a result of that instruction. Fortunately since we're
>> executing an instruction we found in a branch delay the result is
>> unpredictable if that instruction is a branch or jump, so we can simply
>> treat those as NOPs and avoid them causing a problem. However there is
>> still the possibility that a signal may be handled whilst executing the
>> branch delay instruction. This would usually be fine as we would simply
>> execute our trap back to the kernel after sigreturn, however it is
>> possible for userland to simply not return from the signal handler - for
>> example if it executes something like a longjmp. In that case we would
>> never trap back to the kernel and never free the frame. For that reason
>> a TIF_FP_BD_EMU flag is introduced and set whilst we are executing an FP
>> branch delay instruction. Whilst this flag is set, signals will be
>> ignored. This isn't exactly pretty, but it's simpler than most of the
>> alternatives. One other simple option I considered would be to just
>> kill a process if we find a branch in an FP branch delay slot, but I
>> chose the current approach because its result is closer to what would
>> previously happen.
>>
>> The primary benefit of this patch is that we are now free to mark the
>> user stack non-executable where that is possible.
>>
>> Additionally the FP emuframes themselves are simplified somewhat. The
>> cookie field is removed since we can be pretty certain that we're
>> looking at an emuframe by virtue of it being located in the page
>> allocated for them. The PC to continue from is moved into struct
>> thread_struct since the control flow of a thread can no longer be
>> modified for the duration of the 'emulation', meaning there will now
>> only ever be a single emuframe required for a thread at any given time.
>>
>> Signed-off-by: Paul Burton <paul.burton@imgtec.com>
>> ---
>> Changes in v2:
>>    - s/kernels/kernel's/
>>    - Use (mm_)isBranchInstr in mips_dsemul rather than duplicating
>>      similar logic.
>> ---
>>   arch/mips/include/asm/fpu_emulator.h |   4 +
>>   arch/mips/include/asm/mmu.h          |  12 ++
>>   arch/mips/include/asm/mmu_context.h  |   7 +
>>   arch/mips/include/asm/processor.h    |   7 +-
>>   arch/mips/include/asm/thread_info.h  |   2 +
>>   arch/mips/kernel/entry.S             |  13 +-
>>   arch/mips/kernel/process.c           |   2 +
>>   arch/mips/kernel/vdso.c              |   2 +-
>>   arch/mips/math-emu/cp1emu.c          |   4 +-
>>   arch/mips/math-emu/dsemul.c          | 266 ++++++++++++++++++++++++-----------
>>   10 files changed, 226 insertions(+), 93 deletions(-)
>>
>> diff --git a/arch/mips/include/asm/fpu_emulator.h b/arch/mips/include/asm/fpu_emulator.h
>> index 2abb587..16f7b0b 100644
>> --- a/arch/mips/include/asm/fpu_emulator.h
>> +++ b/arch/mips/include/asm/fpu_emulator.h
>> @@ -51,6 +51,8 @@ do { \
>>   #define MIPS_FPU_EMU_INC_STATS(M) do { } while (0)
>>   #endif /* CONFIG_DEBUG_FS */
>>
>> +extern void dsemul_thread_cleanup(void);
>> +extern void dsemul_mm_cleanup(struct mm_struct *mm);
>>   extern int mips_dsemul(struct pt_regs *regs, mips_instruction ir,
>>   unsigned long cpc);
>>   extern int do_dsemulret(struct pt_regs *xcp);
>> @@ -58,6 +60,8 @@ extern int fpu_emulator_cop1Handler(struct pt_regs *xcp,
>>      struct mips_fpu_struct *ctx, int has_fpu,
>>      void *__user *fault_addr);
>>   int process_fpemu_return(int sig, void __user *fault_addr);
>> +int isBranchInstr(struct pt_regs *regs, struct mm_decoded_insn dec_insn,
>> +  unsigned long *contpc);
>>   int mm_isBranchInstr(struct pt_regs *regs, struct mm_decoded_insn dec_insn,
>>       unsigned long *contpc);
>>
>> diff --git a/arch/mips/include/asm/mmu.h b/arch/mips/include/asm/mmu.h
>> index c436138..08214da 100644
>> --- a/arch/mips/include/asm/mmu.h
>> +++ b/arch/mips/include/asm/mmu.h
>> @@ -1,9 +1,21 @@
>>   #ifndef __ASM_MMU_H
>>   #define __ASM_MMU_H
>>
>> +#include <linux/mutex.h>
>> +#include <linux/wait.h>
>> +
>>   typedef struct {
>>   unsigned long asid[NR_CPUS];
>>   void *vdso;
>> +
>> + /* address of page used to hold FP branch delay emulation frames */
>> + unsigned long fp_bd_emupage;
>> + /* bitmap tracking allocation of fp_bd_emupage */
>> + unsigned long *fp_bd_emupage_allocmap;
>> + /* mutex to be held whilst modifying fp_bd_emupage(_allocmap) */
>> + struct mutex fp_bd_emupage_mutex;
>> + /* wait queue for threads requiring an emuframe */
>> + wait_queue_head_t fp_bd_emupage_queue;
>>   } mm_context_t;
>>
>>   #endif /* __ASM_MMU_H */
>> diff --git a/arch/mips/include/asm/mmu_context.h b/arch/mips/include/asm/mmu_context.h
>> index e277bba..c55e864 100644
>> --- a/arch/mips/include/asm/mmu_context.h
>> +++ b/arch/mips/include/asm/mmu_context.h
>> @@ -16,6 +16,7 @@
>>   #include <linux/smp.h>
>>   #include <linux/slab.h>
>>   #include <asm/cacheflush.h>
>> +#include <asm/fpu_emulator.h>
>>   #include <asm/hazards.h>
>>   #include <asm/tlbflush.h>
>>   #ifdef CONFIG_MIPS_MT_SMTC
>> @@ -133,6 +134,11 @@ init_new_context(struct task_struct *tsk, struct mm_struct *mm)
>>   for_each_possible_cpu(i)
>>   cpu_context(i, mm) = 0;
>>
>> + mm->context.fp_bd_emupage = 0;
>> + mm->context.fp_bd_emupage_allocmap = NULL;
>> + mutex_init(&mm->context.fp_bd_emupage_mutex);
>> + init_waitqueue_head(&mm->context.fp_bd_emupage_queue);
>> +
>>   return 0;
>>   }
>>
>> @@ -199,6 +205,7 @@ static inline void switch_mm(struct mm_struct *prev, struct mm_struct *next,
>>    */
>>   static inline void destroy_context(struct mm_struct *mm)
>>   {
>> + dsemul_mm_cleanup(mm);
>>   }
>>
>>   #define deactivate_mm(tsk, mm) do { } while (0)
>> diff --git a/arch/mips/include/asm/processor.h b/arch/mips/include/asm/processor.h
>> index 3605b84..683a3d6 100644
>> --- a/arch/mips/include/asm/processor.h
>> +++ b/arch/mips/include/asm/processor.h
>> @@ -38,9 +38,10 @@ extern unsigned int vced_count, vcei_count;
>>
>>   /*
>>    * A special page (the vdso) is mapped into all processes at the very
>> - * top of the virtual memory space.
>> + * top of the virtual memory space. The page below it is used for FP
>> + * emulator branch delay slot executions.
>>    */
>> -#define SPECIAL_PAGES_SIZE PAGE_SIZE
>> +#define SPECIAL_PAGES_SIZE (PAGE_SIZE * 2)
>>
>>   #ifdef CONFIG_32BIT
>>   #ifdef CONFIG_KVM_GUEST
>> @@ -226,6 +227,8 @@ struct thread_struct {
>>
>>   /* Saved fpu/fpu emulator stuff. */
>>   struct mips_fpu_struct fpu;
>> + /* PC to continue from following an FP branch delay 'emulation' */
>> + unsigned long fp_bd_emu_cpc;
>>   #ifdef CONFIG_MIPS_MT_FPAFF
>>   /* Emulated instruction count */
>>   unsigned long emulated_fp;
>> diff --git a/arch/mips/include/asm/thread_info.h b/arch/mips/include/asm/thread_info.h
>> index b6da8b7..eee6e18 100644
>> --- a/arch/mips/include/asm/thread_info.h
>> +++ b/arch/mips/include/asm/thread_info.h
>> @@ -118,6 +118,7 @@ static inline struct thread_info *current_thread_info(void)
>>   #define TIF_LOAD_WATCH 25 /* If set, load watch registers */
>>   #define TIF_SYSCALL_TRACEPOINT 26 /* syscall tracepoint instrumentation */
>>   #define TIF_32BIT_FPREGS 27 /* 32-bit floating point registers */
>> +#define TIF_FP_BD_EMU 28 /* executing an FP branch delay */
>>   #define TIF_SYSCALL_TRACE 31 /* syscall trace active */
>>
>>   #define _TIF_SYSCALL_TRACE (1<<TIF_SYSCALL_TRACE)
>> @@ -135,6 +136,7 @@ static inline struct thread_info *current_thread_info(void)
>>   #define _TIF_FPUBOUND (1<<TIF_FPUBOUND)
>>   #define _TIF_LOAD_WATCH (1<<TIF_LOAD_WATCH)
>>   #define _TIF_32BIT_FPREGS (1<<TIF_32BIT_FPREGS)
>> +#define _TIF_FP_BD_EMU (1<<TIF_FP_BD_EMU)
>>   #define _TIF_SYSCALL_TRACEPOINT (1<<TIF_SYSCALL_TRACEPOINT)
>>
>>   #define _TIF_WORK_SYSCALL_ENTRY (_TIF_NOHZ | _TIF_SYSCALL_TRACE | \
>> diff --git a/arch/mips/kernel/entry.S b/arch/mips/kernel/entry.S
>> index e578685..24707d7 100644
>> --- a/arch/mips/kernel/entry.S
>> +++ b/arch/mips/kernel/entry.S
>> @@ -168,10 +168,15 @@ work_resched:
>>   andi t0, a2, _TIF_NEED_RESCHED
>>   bnez t0, work_resched
>>
>> -work_notifysig: # deal with pending signals and
>> - # notify-resume requests
>> - move a0, sp
>> - li a1, 0
>> +work_notifysig:
>> + and t0, a2, _TIF_FP_BD_EMU # are we currently 'emulating' the
>> + # delay slot of an FP branch?
>> + beqz t0, 1f # no, continue below
>> + and a2, a2, ~_TIF_SIGPENDING # yes, skip handling signals
>> + beqz a2, restore_all # which leaves us nothing to do
>> +
>> +1: move a0, sp # deal with pending signals and
>> + li a1, 0 # notify-resume requests
>>   jal do_notify_resume # a2 already loaded
>>   j resume_userspace_check
>>
>> diff --git a/arch/mips/kernel/process.c b/arch/mips/kernel/process.c
>> index 747a6cf..0219502 100644
>> --- a/arch/mips/kernel/process.c
>> +++ b/arch/mips/kernel/process.c
>> @@ -32,6 +32,7 @@
>>   #include <asm/cpu.h>
>>   #include <asm/dsp.h>
>>   #include <asm/fpu.h>
>> +#include <asm/fpu_emulator.h>
>>   #include <asm/pgtable.h>
>>   #include <asm/mipsregs.h>
>>   #include <asm/processor.h>
>> @@ -72,6 +73,7 @@ void start_thread(struct pt_regs * regs, unsigned long pc, unsigned long sp)
>>
>>   void exit_thread(void)
>>   {
>> + dsemul_thread_cleanup();
>>   }
>>
>>   void flush_thread(void)
>> diff --git a/arch/mips/kernel/vdso.c b/arch/mips/kernel/vdso.c
>> index 0f1af58..213d871 100644
>> --- a/arch/mips/kernel/vdso.c
>> +++ b/arch/mips/kernel/vdso.c
>> @@ -78,7 +78,7 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
>>
>>   down_write(&mm->mmap_sem);
>>
>> - addr = vdso_addr(mm->start_stack);
>> + addr = vdso_addr(mm->start_stack) + PAGE_SIZE;
>>
>>   addr = get_unmapped_area(NULL, addr, PAGE_SIZE, 0, 0);
>>   if (IS_ERR_VALUE(addr)) {
>> diff --git a/arch/mips/math-emu/cp1emu.c b/arch/mips/math-emu/cp1emu.c
>> index 22f7b11..a0566c8 100644
>> --- a/arch/mips/math-emu/cp1emu.c
>> +++ b/arch/mips/math-emu/cp1emu.c
>> @@ -665,8 +665,8 @@ int mm_isBranchInstr(struct pt_regs *regs, struct mm_decoded_insn dec_insn,
>>    * a single subroutine should be used across both
>>    * modules.
>>    */
>> -static int isBranchInstr(struct pt_regs *regs, struct mm_decoded_insn dec_insn,
>> - unsigned long *contpc)
>> +int isBranchInstr(struct pt_regs *regs, struct mm_decoded_insn dec_insn,
>> +  unsigned long *contpc)
>>   {
>>   union mips_instruction insn = (union mips_instruction)dec_insn.insn;
>>   unsigned int fcr31;
>> diff --git a/arch/mips/math-emu/dsemul.c b/arch/mips/math-emu/dsemul.c
>> index 7ea622a..3e64b17 100644
>> --- a/arch/mips/math-emu/dsemul.c
>> +++ b/arch/mips/math-emu/dsemul.c
>> @@ -1,6 +1,8 @@
>>   #include <linux/compiler.h>
>> +#include <linux/err.h>
>>   #include <linux/mm.h>
>>   #include <linux/signal.h>
>> +#include <linux/slab.h>
>>   #include <linux/smp.h>
>>
>>   #include <asm/asm.h>
>> @@ -45,52 +47,173 @@
>>   struct emuframe {
>>   mips_instruction emul;
>>   mips_instruction badinst;
>> - mips_instruction cookie;
>> - unsigned long epc;
>>   };
>>
>> +static const int emupage_frame_count = PAGE_SIZE / sizeof(struct emuframe);
>> +
>> +static struct emuframe __user *alloc_emuframe(void)
>> +{
>> + mm_context_t *mm_ctx = &current->mm->context;
>> + struct emuframe __user *fr = NULL;
>> + unsigned long addr;
>> + int idx;
>> +
>> +retry:
>> + mutex_lock(&mm_ctx->fp_bd_emupage_mutex);
>> +
>> + /* Ensure we have a page allocated for emuframes */
>> + if (!mm_ctx->fp_bd_emupage) {
>> + addr = mmap_region(NULL, STACK_TOP, PAGE_SIZE,
>> +   VM_READ|VM_WRITE|VM_EXEC|
>> +   VM_MAYREAD|VM_MAYWRITE|VM_MAYEXEC,
>> +   0);
>> + if (IS_ERR_VALUE(addr))
>> + goto out_unlock;
>> +
>> + mm_ctx->fp_bd_emupage = addr;
>> + pr_debug("allocate emupage at 0x%08lx to %d\n", addr,
>> + current->pid);
>> + }
>> +
>> + /* Ensure we have an allocation bitmap */
>> + if (!mm_ctx->fp_bd_emupage_allocmap) {
>> + mm_ctx->fp_bd_emupage_allocmap =
>> + kcalloc(BITS_TO_LONGS(emupage_frame_count),
>> +      sizeof(unsigned long),
>> + GFP_KERNEL);
>> +
>> + if (!mm_ctx->fp_bd_emupage_allocmap)
>> + goto out_unlock;
>> + }
>> +
>> + /* Attempt to allocate a single bit/frame */
>> + idx = bitmap_find_free_region(mm_ctx->fp_bd_emupage_allocmap,
>> +      emupage_frame_count, 0);
>> + if (idx < 0) {
>> + /*
>> + * Failed to allocate a frame. We'll wait until one becomes
>> + * available. The mutex is unlocked so that other threads
>> + * actually get the opportunity to free their frames, which
>> + * means technically the result of bitmap_full may be incorrect.
>> + * However the worst case is that we repeat all this and end up
>> + * back here again.
>> + */
>> + mutex_unlock(&mm_ctx->fp_bd_emupage_mutex);
>> + if (!wait_event_killable(mm_ctx->fp_bd_emupage_queue,
>> + !bitmap_full(mm_ctx->fp_bd_emupage_allocmap,
>> +     emupage_frame_count)))
>> + goto retry;
>> +
>> + /* Received a fatal signal - just give in */
>> + return NULL;
>> + }
>> +
>> + /* Success! */
>> + fr = (struct emuframe __user *)mm_ctx->fp_bd_emupage + idx;
>> + pr_debug("allocate emuframe %d to %d\n", idx, current->pid);
>> +out_unlock:
>> + mutex_unlock(&mm_ctx->fp_bd_emupage_mutex);
>> + return fr;
>> +}
>> +
>> +static void free_emuframe(struct emuframe __user *frame)
>> +{
>> + mm_context_t *mm_ctx = &current->mm->context;
>> + int idx;
>> +
>> + mutex_lock(&mm_ctx->fp_bd_emupage_mutex);
>> +
>> + idx = frame - (struct emuframe __user *)mm_ctx->fp_bd_emupage;
>> + pr_debug("free emuframe %d from %d\n", idx, current->pid);
>> + bitmap_clear(mm_ctx->fp_bd_emupage_allocmap, idx, 1);
>> +
>> + /* If some thread is waiting for a frame, now's its chance */
>> + wake_up(&mm_ctx->fp_bd_emupage_queue);
>> +
>> + mutex_unlock(&mm_ctx->fp_bd_emupage_mutex);
>> +}
>> +
>> +void dsemul_thread_cleanup(void)
>> +{
>> + /*
>> + * We should always have passed through do_dsemulret prior to the
>> + * thread exiting, so TIF_FP_BD_EMU should never be set here.
>> + */
>> + BUG_ON(test_thread_flag(TIF_FP_BD_EMU));
>> +}
>> +
>> +void dsemul_mm_cleanup(struct mm_struct *mm)
>> +{
>> + mm_context_t *mm_ctx = &mm->context;
>> +
>> + kfree(mm_ctx->fp_bd_emupage_allocmap);
>> +}
>> +
>>   int mips_dsemul(struct pt_regs *regs, mips_instruction ir, unsigned long cpc)
>>   {
>> - extern asmlinkage void handle_dsemulret(void);
>> + struct mm_decoded_insn mm_inst = { .insn = ir };
>>   struct emuframe __user *fr;
>> - int err;
>> + struct pt_regs dummy_regs;
>> + unsigned long dummy_cpc;
>> + int err, is_mm;
>>
>> - if ((get_isa16_mode(regs->cp0_epc) && ((ir >> 16) == MM_NOP16)) ||
>> - (ir == 0)) {
>> - /* NOP is easy */
>> + /*
>> + * Trivially handle typical NOP encodings:
>> + *
>> + *   MIPS32: sll r0, r0, r0
>> + *   microMIPS: move16 r0, r0
>> + */
>> + is_mm = get_isa16_mode(regs->cp0_epc);
>> + if ((!is_mm && !ir) || (is_mm && ((ir >> 16) == MM_NOP16))) {
>> +is_nop:
>>   regs->cp0_epc = cpc;
>>   regs->cp0_cause &= ~CAUSEF_BD;
>>   return 0;
>>   }
>> -#ifdef DSEMUL_TRACE
>> - printk("dsemul %lx %lx\n", regs->cp0_epc, cpc);
>> -
>> -#endif
>>
>>   /*
>> - * The strategy is to push the instruction onto the user stack
>> - * and put a trap after it which we can catch and jump to
>> - * the required address any alternative apart from full
>> - * instruction emulation!!.
>> + * In order for us to clean up the emuframe properly, we'll need to
>> + * execute a break instruction after ir. If ir is a branch then we may
>> + * never reach that break instruction and thus never free the emuframe.
>>   *
>> - * Algorithmics used a system call instruction, and
>> - * borrowed that vector.  MIPS/Linux version is a bit
>> - * more heavyweight in the interests of portability and
>> - * multiprocessor support.  For Linux we generate a
>> - * an unaligned access and force an address error exception.
>> + * Fortunately we know that ir is in a branch delay slot and thus if
>> + * it is a branch then its operation is unpredictable. So we can just
>> + * treat branches as NOPs and skip the 'emulation' entirely.
>>   *
>> - * For embedded systems (stand-alone) we prefer to use a
>> - * non-existing CP1 instruction. This prevents us from emulating
>> - * branches, but gives us a cleaner interface to the exception
>> - * handler (single entry point).
>> + * If the worst happens and we miss a branch/jump instruction here, or
>> + * some processor implements a custom one, then it would be possible
>> + * for us to allocate an emuframe and never free it. Fortunately this
>> + * would:
>> + *
>> + *  1) Be a bug in the userland code, because it has a branch/jump in
>> + *     a branch delay slot. So if we run out of emuframes and the
>> + *     userland code hangs it's not exactly the kernel's fault.
>> + *
>> + *  2) Only affect that userland process, since emuframes are allocated
>> + *     per-mm and kernel threads don't use them at all.
>>   */
>> + if ((!is_mm && isBranchInstr(&dummy_regs, mm_inst, &dummy_cpc)) ||
>> +    (is_mm && mm_isBranchInstr(&dummy_regs, mm_inst, &dummy_cpc))) {
>> + pr_warn("PID %d has a branch in an FP branch delay slot at 0x%08lx\n",
>> + current->pid, regs->cp0_epc);
>> + goto is_nop;
>> + }
>>
>> - /* Ensure that the two instructions are in the same cache line */
>> - fr = (struct emuframe __user *)
>> - ((regs->regs[29] - sizeof(struct emuframe)) & ~0x7);
>> + pr_debug("dsemul 0x%08lx cont at 0x%08lx\n", regs->cp0_epc, cpc);
>>
>> - /* Verify that the stack pointer is not competely insane */
>> - if (unlikely(!access_ok(VERIFY_WRITE, fr, sizeof(struct emuframe))))
>> + /*
>> + * The strategy is to write the instruction to a per-mm page followed
>> + * by a trap which we can catch to return to the required address. Any
>> + * alternative to full instruction emulation!!
>> + *
>> + * Algorithmics used a system call instruction, and borrowed that
>> + * vector.  MIPS/Linux version is a bit more heavyweight in the
>> + * interests of portability and multiprocessor support.  For Linux we
>> + * generate a BREAK instruction with a break code reserved for this
>> + * purpose.
>> + */
>> + fr = alloc_emuframe();
>> + if (!fr)
>>   return SIGBUS;
>>
>>   if (get_isa16_mode(regs->cp0_epc)) {
>> @@ -103,17 +226,18 @@ int mips_dsemul(struct pt_regs *regs, mips_instruction ir, unsigned long cpc)
>>   err |= __put_user((mips_instruction)BREAK_MATH, &fr->badinst);
>>   }
>>
>> - err |= __put_user((mips_instruction)BD_COOKIE, &fr->cookie);
>> - err |= __put_user(cpc, &fr->epc);
>> -
>>   if (unlikely(err)) {
>>   MIPS_FPU_EMU_INC_STATS(errors);
>> + free_emuframe(fr);
>>   return SIGBUS;
>>   }
>>
>>   regs->cp0_epc = ((unsigned long) &fr->emul) |
>>   get_isa16_mode(regs->cp0_epc);
>>
>> + current->thread.fp_bd_emu_cpc = cpc;
>> + set_thread_flag(TIF_FP_BD_EMU);
>> +
>>   flush_cache_sigtramp((unsigned long)&fr->badinst);
>>
>>   return SIGILL; /* force out of emulation loop */
>> @@ -121,64 +245,38 @@ int mips_dsemul(struct pt_regs *regs, mips_instruction ir, unsigned long cpc)
>>
>>   int do_dsemulret(struct pt_regs *xcp)
>>   {
>> - struct emuframe __user *fr;
>> - unsigned long epc;
>> - u32 insn, cookie;
>> - int err = 0;
>> - u16 instr[2];
>> -
>> - fr = (struct emuframe __user *)
>> - (msk_isa16_mode(xcp->cp0_epc) - sizeof(mips_instruction));
>> -
>> - /*
>> - * If we can't even access the area, something is very wrong, but we'll
>> - * leave that to the default handling
>> - */
>> - if (!access_ok(VERIFY_READ, fr, sizeof(struct emuframe)))
>> - return 0;
>> -
>> - /*
>> - * Do some sanity checking on the stackframe:
>> - *
>> - *  - Is the instruction pointed to by the EPC an BREAK_MATH?
>> - *  - Is the following memory word the BD_COOKIE?
>> - */
>> - if (get_isa16_mode(xcp->cp0_epc)) {
>> - err = __get_user(instr[0], (u16 __user *)(&fr->badinst));
>> - err |= __get_user(instr[1], (u16 __user *)((long)(&fr->badinst) + 2));
>> - insn = (instr[0] << 16) | instr[1];
>> - } else {
>> - err = __get_user(insn, &fr->badinst);
>> - }
>> - err |= __get_user(cookie, &fr->cookie);
>> + mm_context_t *mm_ctx = &current->mm->context;
>> + struct emuframe __user *fr = NULL;
>> + unsigned long fr_addr;
>> + int success = 0;
>>
>> - if (unlikely(err || (insn != BREAK_MATH) || (cookie != BD_COOKIE))) {
>> - MIPS_FPU_EMU_INC_STATS(errors);
>> - return 0;
>> - }
>> + /* If we don't have TIF_FP_BD_EMU set... */
>> + if (!test_and_clear_thread_flag(TIF_FP_BD_EMU))
>> + goto out;
>>
>>   /*
>> - * At this point, we are satisfied that it's a BD emulation trap.  Yes,
>> - * a user might have deliberately put two malformed and useless
>> - * instructions in a row in his program, in which case he's in for a
>> - * nasty surprise - the next instruction will be treated as a
>> - * continuation address!  Alas, this seems to be the only way that we
>> - * can handle signals, recursion, and longjmps() in the context of
>> - * emulating the branch delay instruction.
>> + * ...or EPC is outside of the expected page or misaligned then
>> + * something is wrong. Leave it to the default trap/break code to
>> + * handle.
>>   */
>> + fr_addr = msk_isa16_mode(xcp->cp0_epc) - sizeof(mips_instruction);
>> + if ((fr_addr < mm_ctx->fp_bd_emupage) ||
>> +    (fr_addr > (mm_ctx->fp_bd_emupage + PAGE_SIZE - sizeof(*fr))) ||
>> +    (fr_addr & (sizeof(*fr) - 1)))
>> + goto out;
>>
>> -#ifdef DSEMUL_TRACE
>> - printk("dsemulret\n");
>> -#endif
>> - if (__get_user(epc, &fr->epc)) { /* Saved EPC */
>> - /* This is not a good situation to be in */
>> - force_sig(SIGBUS, current);
>> -
>> - return 0;
>> - }
>> + /* At this point, we are satisfied that it's a BD emulation trap. */
>> + fr = (struct emuframe __user *)fr_addr;
>>
>>   /* Set EPC to return to post-branch instruction */
>> - xcp->cp0_epc = epc;
>> + xcp->cp0_epc = current->thread.fp_bd_emu_cpc;
>> + success = 1;
>>
>> - return 1;
>> + pr_debug("dsemulret to 0x%08lx\n", xcp->cp0_epc);
>> +out:
>> + if (fr)
>> + free_emuframe(fr);
>> + if (!success)
>> + MIPS_FPU_EMU_INC_STATS(errors);
>> + return success;
>>   }
>>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v2 5/6] mips: use per-mm page to execute FP branch delay slots
@ 2014-07-03 20:12   ` Paul Burton
  0 siblings, 0 replies; 42+ messages in thread
From: Paul Burton @ 2014-07-03 20:12 UTC (permalink / raw)
  To: Ed Swierk; +Cc: linux-mips, ddaney.cavm, ralf

On Thu, Jul 03, 2014 at 10:56:10AM -0700, Ed Swierk wrote:
> Hi Paul,
> 
> I'm came across your patch while puzzling over mysterious segvs from
> programs that do floating-point compares. I'm running a Debian 32-bit
> mips userland, which is built with hard floating point operations, on
> a Cavium Octeon2, which lacks an FPU. Now that Linux makes user stacks
> non-executable by default, the current FP emulation approach is simply
> broken.

Really? I wasn't aware of any change to the default attributes of the
stack. Do you know what changed? From a quick look at fs/binfmt_elf.c &
arch/mips/include/asm/elf.h I can't see anything relevant having
changed - the stack should be executable unless a non-executable
PT_GNU_STACK header is present in the ELF. I don't suppose the issue
is simply that such a PT_GNU_STACK header is present in your binaries?

Of course if that is the case the question is: what did you build them
with? If some toolchain has started emitting such headers then this
becomes more urgent.

> Have you had the opportunity to revisit your patch in light of the
> force_sig issue?

I'm afraid not... it's been sitting somewhere near the bottom of my
rather large TODO list.

> I'm wondering if instead of trying to free the page
> for the FP branch delay emuframe immediately, it would be simpler to
> leave it around until the thread is destroyed.

It's not really an issue of freeing a page - my patch mapped one page
per-mm (per-process) and that page was left intact for the life of that
mm (process). The issue, if I remember correctly, is that:

  - The emuframe needs to be freed after the instruction from the branch
    delay slot has been executed, or we'd simply run out of space in the
    page after a while.

  - In order to guarantee that the emuframe is freed we need to ignore
    signals to the task until it has been. Otherwise if a signal is
    delivered whilst the emuframe is live, userland could do something
    like a longjmp out of the signal handler & control would never
    return back to the emuframe - which would then never be freed.

  - If the instruction from the branch delay slot traps to the kernel
    and userland would normally receive a signal as a result, it won't
    because we're ignoring signals at this point. So we'd just end up
    returning to the same instruction again, trapping again, etc etc.

One solution would be for any trap to the kernel whilst the
TIF_FP_BD_EMU flag is set would lead to freeing the emuframe, clearing
the TIF_FP_BD_EMU flag & pointing epc back at the original branch. It
sounds quite invasive though since each entry to the kernel that can
be reached by a trap caused by a userland instruction would need to
take this into account.

Another solution would be to bite the bullet & just implement a full on
MIPS emulator, but that would need to take into account any ASEs or cop2
functionality implemented on a system. So it would be rather large & no
doubt it would be easy to get things subtly wrong.

I'll try to find some time to ponder this some more & see if I can think
up something better. I'm open to suggestions if anyone has any :)

Thanks,
    Paul

> --Ed
> 
> Paul Burton <paul.burton@imgtec.com> wrote:
> > Hmm, I believe there may still be an issue with this patch. If the
> > instruction in the branch delay slot being "emulated" traps to the
> > kernel, and the kernel does a force_sig then that signal won't get
> > processed because signals are being temporarily ignored. So I think
> > we'd go back off to userland & execute the same instruction from the
> > branch delay slot again, trap again, force_sig again, go back to
> > userland etc etc. I need to think about this...
> >
> > In the meantime, Ralf: if you get to merging this series please drop
> > this patch & 6/6 (the stack exec change) for the time being. The "Some
> > (mostly FP) cleanups" series I submitted will still apply after only
> > the first 4 patches of this series.
> >
> > Thanks,
> >      Paul
> >
> > On 08/11/13 14:50, Paul Burton wrote:
> >> If a floating point branch instruction (bc1[ft]l?) is emulated,
> >> typically because we're running on a core with no FPU, then we need to
> >> execute the instruction in its branch delay slot too. This is done by
> >> writing that instruction to memory followed by a trap, as part of an
> >> "emuframe", and executing it. This avoids the requirement of an emulator
> >> for the entire MIPS instruction set. Prior to this patch such emuframes
> >> are written to the user stack and executed from there.
> >>
> >> This patch moves FP branch delay emuframes off of the user stack and
> >> into a per-mm page. Allocating a page per-mm leaves userland with access
> >> to only what it had access to previously, and prevents processes
> >> interfering with each other as they might if a single system-wide page
> >> were used. The book-keeping required to track the allocation of
> >> emuframes is not cheap, but given that invoking the FP emulator is
> >> already very expensive I don't expect this to be an issue.
> >>
> >> The biggest issue with executing the instruction from an FP branch delay
> >> is that we must ensure that we free the frame from which we ran it. That
> >> means that we must trap back to the kernel after executing that
> >> instruction, which means that we must take special care not to let the
> >> PC be changed as a result of that instruction. Fortunately since we're
> >> executing an instruction we found in a branch delay the result is
> >> unpredictable if that instruction is a branch or jump, so we can simply
> >> treat those as NOPs and avoid them causing a problem. However there is
> >> still the possibility that a signal may be handled whilst executing the
> >> branch delay instruction. This would usually be fine as we would simply
> >> execute our trap back to the kernel after sigreturn, however it is
> >> possible for userland to simply not return from the signal handler - for
> >> example if it executes something like a longjmp. In that case we would
> >> never trap back to the kernel and never free the frame. For that reason
> >> a TIF_FP_BD_EMU flag is introduced and set whilst we are executing an FP
> >> branch delay instruction. Whilst this flag is set, signals will be
> >> ignored. This isn't exactly pretty, but it's simpler than most of the
> >> alternatives. One other simple option I considered would be to just
> >> kill a process if we find a branch in an FP branch delay slot, but I
> >> chose the current approach because its result is closer to what would
> >> previously happen.
> >>
> >> The primary benefit of this patch is that we are now free to mark the
> >> user stack non-executable where that is possible.
> >>
> >> Additionally the FP emuframes themselves are simplified somewhat. The
> >> cookie field is removed since we can be pretty certain that we're
> >> looking at an emuframe by virtue of it being located in the page
> >> allocated for them. The PC to continue from is moved into struct
> >> thread_struct since the control flow of a thread can no longer be
> >> modified for the duration of the 'emulation', meaning there will now
> >> only ever be a single emuframe required for a thread at any given time.
> >>
> >> Signed-off-by: Paul Burton <paul.burton@imgtec.com>
> >> ---
> >> Changes in v2:
> >>    - s/kernels/kernel's/
> >>    - Use (mm_)isBranchInstr in mips_dsemul rather than duplicating
> >>      similar logic.
> >> ---
> >>   arch/mips/include/asm/fpu_emulator.h |   4 +
> >>   arch/mips/include/asm/mmu.h          |  12 ++
> >>   arch/mips/include/asm/mmu_context.h  |   7 +
> >>   arch/mips/include/asm/processor.h    |   7 +-
> >>   arch/mips/include/asm/thread_info.h  |   2 +
> >>   arch/mips/kernel/entry.S             |  13 +-
> >>   arch/mips/kernel/process.c           |   2 +
> >>   arch/mips/kernel/vdso.c              |   2 +-
> >>   arch/mips/math-emu/cp1emu.c          |   4 +-
> >>   arch/mips/math-emu/dsemul.c          | 266 ++++++++++++++++++++++++-----------
> >>   10 files changed, 226 insertions(+), 93 deletions(-)
> >>
> >> diff --git a/arch/mips/include/asm/fpu_emulator.h b/arch/mips/include/asm/fpu_emulator.h
> >> index 2abb587..16f7b0b 100644
> >> --- a/arch/mips/include/asm/fpu_emulator.h
> >> +++ b/arch/mips/include/asm/fpu_emulator.h
> >> @@ -51,6 +51,8 @@ do { \
> >>   #define MIPS_FPU_EMU_INC_STATS(M) do { } while (0)
> >>   #endif /* CONFIG_DEBUG_FS */
> >>
> >> +extern void dsemul_thread_cleanup(void);
> >> +extern void dsemul_mm_cleanup(struct mm_struct *mm);
> >>   extern int mips_dsemul(struct pt_regs *regs, mips_instruction ir,
> >>   unsigned long cpc);
> >>   extern int do_dsemulret(struct pt_regs *xcp);
> >> @@ -58,6 +60,8 @@ extern int fpu_emulator_cop1Handler(struct pt_regs *xcp,
> >>      struct mips_fpu_struct *ctx, int has_fpu,
> >>      void *__user *fault_addr);
> >>   int process_fpemu_return(int sig, void __user *fault_addr);
> >> +int isBranchInstr(struct pt_regs *regs, struct mm_decoded_insn dec_insn,
> >> +  unsigned long *contpc);
> >>   int mm_isBranchInstr(struct pt_regs *regs, struct mm_decoded_insn dec_insn,
> >>       unsigned long *contpc);
> >>
> >> diff --git a/arch/mips/include/asm/mmu.h b/arch/mips/include/asm/mmu.h
> >> index c436138..08214da 100644
> >> --- a/arch/mips/include/asm/mmu.h
> >> +++ b/arch/mips/include/asm/mmu.h
> >> @@ -1,9 +1,21 @@
> >>   #ifndef __ASM_MMU_H
> >>   #define __ASM_MMU_H
> >>
> >> +#include <linux/mutex.h>
> >> +#include <linux/wait.h>
> >> +
> >>   typedef struct {
> >>   unsigned long asid[NR_CPUS];
> >>   void *vdso;
> >> +
> >> + /* address of page used to hold FP branch delay emulation frames */
> >> + unsigned long fp_bd_emupage;
> >> + /* bitmap tracking allocation of fp_bd_emupage */
> >> + unsigned long *fp_bd_emupage_allocmap;
> >> + /* mutex to be held whilst modifying fp_bd_emupage(_allocmap) */
> >> + struct mutex fp_bd_emupage_mutex;
> >> + /* wait queue for threads requiring an emuframe */
> >> + wait_queue_head_t fp_bd_emupage_queue;
> >>   } mm_context_t;
> >>
> >>   #endif /* __ASM_MMU_H */
> >> diff --git a/arch/mips/include/asm/mmu_context.h b/arch/mips/include/asm/mmu_context.h
> >> index e277bba..c55e864 100644
> >> --- a/arch/mips/include/asm/mmu_context.h
> >> +++ b/arch/mips/include/asm/mmu_context.h
> >> @@ -16,6 +16,7 @@
> >>   #include <linux/smp.h>
> >>   #include <linux/slab.h>
> >>   #include <asm/cacheflush.h>
> >> +#include <asm/fpu_emulator.h>
> >>   #include <asm/hazards.h>
> >>   #include <asm/tlbflush.h>
> >>   #ifdef CONFIG_MIPS_MT_SMTC
> >> @@ -133,6 +134,11 @@ init_new_context(struct task_struct *tsk, struct mm_struct *mm)
> >>   for_each_possible_cpu(i)
> >>   cpu_context(i, mm) = 0;
> >>
> >> + mm->context.fp_bd_emupage = 0;
> >> + mm->context.fp_bd_emupage_allocmap = NULL;
> >> + mutex_init(&mm->context.fp_bd_emupage_mutex);
> >> + init_waitqueue_head(&mm->context.fp_bd_emupage_queue);
> >> +
> >>   return 0;
> >>   }
> >>
> >> @@ -199,6 +205,7 @@ static inline void switch_mm(struct mm_struct *prev, struct mm_struct *next,
> >>    */
> >>   static inline void destroy_context(struct mm_struct *mm)
> >>   {
> >> + dsemul_mm_cleanup(mm);
> >>   }
> >>
> >>   #define deactivate_mm(tsk, mm) do { } while (0)
> >> diff --git a/arch/mips/include/asm/processor.h b/arch/mips/include/asm/processor.h
> >> index 3605b84..683a3d6 100644
> >> --- a/arch/mips/include/asm/processor.h
> >> +++ b/arch/mips/include/asm/processor.h
> >> @@ -38,9 +38,10 @@ extern unsigned int vced_count, vcei_count;
> >>
> >>   /*
> >>    * A special page (the vdso) is mapped into all processes at the very
> >> - * top of the virtual memory space.
> >> + * top of the virtual memory space. The page below it is used for FP
> >> + * emulator branch delay slot executions.
> >>    */
> >> -#define SPECIAL_PAGES_SIZE PAGE_SIZE
> >> +#define SPECIAL_PAGES_SIZE (PAGE_SIZE * 2)
> >>
> >>   #ifdef CONFIG_32BIT
> >>   #ifdef CONFIG_KVM_GUEST
> >> @@ -226,6 +227,8 @@ struct thread_struct {
> >>
> >>   /* Saved fpu/fpu emulator stuff. */
> >>   struct mips_fpu_struct fpu;
> >> + /* PC to continue from following an FP branch delay 'emulation' */
> >> + unsigned long fp_bd_emu_cpc;
> >>   #ifdef CONFIG_MIPS_MT_FPAFF
> >>   /* Emulated instruction count */
> >>   unsigned long emulated_fp;
> >> diff --git a/arch/mips/include/asm/thread_info.h b/arch/mips/include/asm/thread_info.h
> >> index b6da8b7..eee6e18 100644
> >> --- a/arch/mips/include/asm/thread_info.h
> >> +++ b/arch/mips/include/asm/thread_info.h
> >> @@ -118,6 +118,7 @@ static inline struct thread_info *current_thread_info(void)
> >>   #define TIF_LOAD_WATCH 25 /* If set, load watch registers */
> >>   #define TIF_SYSCALL_TRACEPOINT 26 /* syscall tracepoint instrumentation */
> >>   #define TIF_32BIT_FPREGS 27 /* 32-bit floating point registers */
> >> +#define TIF_FP_BD_EMU 28 /* executing an FP branch delay */
> >>   #define TIF_SYSCALL_TRACE 31 /* syscall trace active */
> >>
> >>   #define _TIF_SYSCALL_TRACE (1<<TIF_SYSCALL_TRACE)
> >> @@ -135,6 +136,7 @@ static inline struct thread_info *current_thread_info(void)
> >>   #define _TIF_FPUBOUND (1<<TIF_FPUBOUND)
> >>   #define _TIF_LOAD_WATCH (1<<TIF_LOAD_WATCH)
> >>   #define _TIF_32BIT_FPREGS (1<<TIF_32BIT_FPREGS)
> >> +#define _TIF_FP_BD_EMU (1<<TIF_FP_BD_EMU)
> >>   #define _TIF_SYSCALL_TRACEPOINT (1<<TIF_SYSCALL_TRACEPOINT)
> >>
> >>   #define _TIF_WORK_SYSCALL_ENTRY (_TIF_NOHZ | _TIF_SYSCALL_TRACE | \
> >> diff --git a/arch/mips/kernel/entry.S b/arch/mips/kernel/entry.S
> >> index e578685..24707d7 100644
> >> --- a/arch/mips/kernel/entry.S
> >> +++ b/arch/mips/kernel/entry.S
> >> @@ -168,10 +168,15 @@ work_resched:
> >>   andi t0, a2, _TIF_NEED_RESCHED
> >>   bnez t0, work_resched
> >>
> >> -work_notifysig: # deal with pending signals and
> >> - # notify-resume requests
> >> - move a0, sp
> >> - li a1, 0
> >> +work_notifysig:
> >> + and t0, a2, _TIF_FP_BD_EMU # are we currently 'emulating' the
> >> + # delay slot of an FP branch?
> >> + beqz t0, 1f # no, continue below
> >> + and a2, a2, ~_TIF_SIGPENDING # yes, skip handling signals
> >> + beqz a2, restore_all # which leaves us nothing to do
> >> +
> >> +1: move a0, sp # deal with pending signals and
> >> + li a1, 0 # notify-resume requests
> >>   jal do_notify_resume # a2 already loaded
> >>   j resume_userspace_check
> >>
> >> diff --git a/arch/mips/kernel/process.c b/arch/mips/kernel/process.c
> >> index 747a6cf..0219502 100644
> >> --- a/arch/mips/kernel/process.c
> >> +++ b/arch/mips/kernel/process.c
> >> @@ -32,6 +32,7 @@
> >>   #include <asm/cpu.h>
> >>   #include <asm/dsp.h>
> >>   #include <asm/fpu.h>
> >> +#include <asm/fpu_emulator.h>
> >>   #include <asm/pgtable.h>
> >>   #include <asm/mipsregs.h>
> >>   #include <asm/processor.h>
> >> @@ -72,6 +73,7 @@ void start_thread(struct pt_regs * regs, unsigned long pc, unsigned long sp)
> >>
> >>   void exit_thread(void)
> >>   {
> >> + dsemul_thread_cleanup();
> >>   }
> >>
> >>   void flush_thread(void)
> >> diff --git a/arch/mips/kernel/vdso.c b/arch/mips/kernel/vdso.c
> >> index 0f1af58..213d871 100644
> >> --- a/arch/mips/kernel/vdso.c
> >> +++ b/arch/mips/kernel/vdso.c
> >> @@ -78,7 +78,7 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
> >>
> >>   down_write(&mm->mmap_sem);
> >>
> >> - addr = vdso_addr(mm->start_stack);
> >> + addr = vdso_addr(mm->start_stack) + PAGE_SIZE;
> >>
> >>   addr = get_unmapped_area(NULL, addr, PAGE_SIZE, 0, 0);
> >>   if (IS_ERR_VALUE(addr)) {
> >> diff --git a/arch/mips/math-emu/cp1emu.c b/arch/mips/math-emu/cp1emu.c
> >> index 22f7b11..a0566c8 100644
> >> --- a/arch/mips/math-emu/cp1emu.c
> >> +++ b/arch/mips/math-emu/cp1emu.c
> >> @@ -665,8 +665,8 @@ int mm_isBranchInstr(struct pt_regs *regs, struct mm_decoded_insn dec_insn,
> >>    * a single subroutine should be used across both
> >>    * modules.
> >>    */
> >> -static int isBranchInstr(struct pt_regs *regs, struct mm_decoded_insn dec_insn,
> >> - unsigned long *contpc)
> >> +int isBranchInstr(struct pt_regs *regs, struct mm_decoded_insn dec_insn,
> >> +  unsigned long *contpc)
> >>   {
> >>   union mips_instruction insn = (union mips_instruction)dec_insn.insn;
> >>   unsigned int fcr31;
> >> diff --git a/arch/mips/math-emu/dsemul.c b/arch/mips/math-emu/dsemul.c
> >> index 7ea622a..3e64b17 100644
> >> --- a/arch/mips/math-emu/dsemul.c
> >> +++ b/arch/mips/math-emu/dsemul.c
> >> @@ -1,6 +1,8 @@
> >>   #include <linux/compiler.h>
> >> +#include <linux/err.h>
> >>   #include <linux/mm.h>
> >>   #include <linux/signal.h>
> >> +#include <linux/slab.h>
> >>   #include <linux/smp.h>
> >>
> >>   #include <asm/asm.h>
> >> @@ -45,52 +47,173 @@
> >>   struct emuframe {
> >>   mips_instruction emul;
> >>   mips_instruction badinst;
> >> - mips_instruction cookie;
> >> - unsigned long epc;
> >>   };
> >>
> >> +static const int emupage_frame_count = PAGE_SIZE / sizeof(struct emuframe);
> >> +
> >> +static struct emuframe __user *alloc_emuframe(void)
> >> +{
> >> + mm_context_t *mm_ctx = &current->mm->context;
> >> + struct emuframe __user *fr = NULL;
> >> + unsigned long addr;
> >> + int idx;
> >> +
> >> +retry:
> >> + mutex_lock(&mm_ctx->fp_bd_emupage_mutex);
> >> +
> >> + /* Ensure we have a page allocated for emuframes */
> >> + if (!mm_ctx->fp_bd_emupage) {
> >> + addr = mmap_region(NULL, STACK_TOP, PAGE_SIZE,
> >> +   VM_READ|VM_WRITE|VM_EXEC|
> >> +   VM_MAYREAD|VM_MAYWRITE|VM_MAYEXEC,
> >> +   0);
> >> + if (IS_ERR_VALUE(addr))
> >> + goto out_unlock;
> >> +
> >> + mm_ctx->fp_bd_emupage = addr;
> >> + pr_debug("allocate emupage at 0x%08lx to %d\n", addr,
> >> + current->pid);
> >> + }
> >> +
> >> + /* Ensure we have an allocation bitmap */
> >> + if (!mm_ctx->fp_bd_emupage_allocmap) {
> >> + mm_ctx->fp_bd_emupage_allocmap =
> >> + kcalloc(BITS_TO_LONGS(emupage_frame_count),
> >> +      sizeof(unsigned long),
> >> + GFP_KERNEL);
> >> +
> >> + if (!mm_ctx->fp_bd_emupage_allocmap)
> >> + goto out_unlock;
> >> + }
> >> +
> >> + /* Attempt to allocate a single bit/frame */
> >> + idx = bitmap_find_free_region(mm_ctx->fp_bd_emupage_allocmap,
> >> +      emupage_frame_count, 0);
> >> + if (idx < 0) {
> >> + /*
> >> + * Failed to allocate a frame. We'll wait until one becomes
> >> + * available. The mutex is unlocked so that other threads
> >> + * actually get the opportunity to free their frames, which
> >> + * means technically the result of bitmap_full may be incorrect.
> >> + * However the worst case is that we repeat all this and end up
> >> + * back here again.
> >> + */
> >> + mutex_unlock(&mm_ctx->fp_bd_emupage_mutex);
> >> + if (!wait_event_killable(mm_ctx->fp_bd_emupage_queue,
> >> + !bitmap_full(mm_ctx->fp_bd_emupage_allocmap,
> >> +     emupage_frame_count)))
> >> + goto retry;
> >> +
> >> + /* Received a fatal signal - just give in */
> >> + return NULL;
> >> + }
> >> +
> >> + /* Success! */
> >> + fr = (struct emuframe __user *)mm_ctx->fp_bd_emupage + idx;
> >> + pr_debug("allocate emuframe %d to %d\n", idx, current->pid);
> >> +out_unlock:
> >> + mutex_unlock(&mm_ctx->fp_bd_emupage_mutex);
> >> + return fr;
> >> +}
> >> +
> >> +static void free_emuframe(struct emuframe __user *frame)
> >> +{
> >> + mm_context_t *mm_ctx = &current->mm->context;
> >> + int idx;
> >> +
> >> + mutex_lock(&mm_ctx->fp_bd_emupage_mutex);
> >> +
> >> + idx = frame - (struct emuframe __user *)mm_ctx->fp_bd_emupage;
> >> + pr_debug("free emuframe %d from %d\n", idx, current->pid);
> >> + bitmap_clear(mm_ctx->fp_bd_emupage_allocmap, idx, 1);
> >> +
> >> + /* If some thread is waiting for a frame, now's its chance */
> >> + wake_up(&mm_ctx->fp_bd_emupage_queue);
> >> +
> >> + mutex_unlock(&mm_ctx->fp_bd_emupage_mutex);
> >> +}
> >> +
> >> +void dsemul_thread_cleanup(void)
> >> +{
> >> + /*
> >> + * We should always have passed through do_dsemulret prior to the
> >> + * thread exiting, so TIF_FP_BD_EMU should never be set here.
> >> + */
> >> + BUG_ON(test_thread_flag(TIF_FP_BD_EMU));
> >> +}
> >> +
> >> +void dsemul_mm_cleanup(struct mm_struct *mm)
> >> +{
> >> + mm_context_t *mm_ctx = &mm->context;
> >> +
> >> + kfree(mm_ctx->fp_bd_emupage_allocmap);
> >> +}
> >> +
> >>   int mips_dsemul(struct pt_regs *regs, mips_instruction ir, unsigned long cpc)
> >>   {
> >> - extern asmlinkage void handle_dsemulret(void);
> >> + struct mm_decoded_insn mm_inst = { .insn = ir };
> >>   struct emuframe __user *fr;
> >> - int err;
> >> + struct pt_regs dummy_regs;
> >> + unsigned long dummy_cpc;
> >> + int err, is_mm;
> >>
> >> - if ((get_isa16_mode(regs->cp0_epc) && ((ir >> 16) == MM_NOP16)) ||
> >> - (ir == 0)) {
> >> - /* NOP is easy */
> >> + /*
> >> + * Trivially handle typical NOP encodings:
> >> + *
> >> + *   MIPS32: sll r0, r0, r0
> >> + *   microMIPS: move16 r0, r0
> >> + */
> >> + is_mm = get_isa16_mode(regs->cp0_epc);
> >> + if ((!is_mm && !ir) || (is_mm && ((ir >> 16) == MM_NOP16))) {
> >> +is_nop:
> >>   regs->cp0_epc = cpc;
> >>   regs->cp0_cause &= ~CAUSEF_BD;
> >>   return 0;
> >>   }
> >> -#ifdef DSEMUL_TRACE
> >> - printk("dsemul %lx %lx\n", regs->cp0_epc, cpc);
> >> -
> >> -#endif
> >>
> >>   /*
> >> - * The strategy is to push the instruction onto the user stack
> >> - * and put a trap after it which we can catch and jump to
> >> - * the required address any alternative apart from full
> >> - * instruction emulation!!.
> >> + * In order for us to clean up the emuframe properly, we'll need to
> >> + * execute a break instruction after ir. If ir is a branch then we may
> >> + * never reach that break instruction and thus never free the emuframe.
> >>   *
> >> - * Algorithmics used a system call instruction, and
> >> - * borrowed that vector.  MIPS/Linux version is a bit
> >> - * more heavyweight in the interests of portability and
> >> - * multiprocessor support.  For Linux we generate a
> >> - * an unaligned access and force an address error exception.
> >> + * Fortunately we know that ir is in a branch delay slot and thus if
> >> + * it is a branch then its operation is unpredictable. So we can just
> >> + * treat branches as NOPs and skip the 'emulation' entirely.
> >>   *
> >> - * For embedded systems (stand-alone) we prefer to use a
> >> - * non-existing CP1 instruction. This prevents us from emulating
> >> - * branches, but gives us a cleaner interface to the exception
> >> - * handler (single entry point).
> >> + * If the worst happens and we miss a branch/jump instruction here, or
> >> + * some processor implements a custom one, then it would be possible
> >> + * for us to allocate an emuframe and never free it. Fortunately this
> >> + * would:
> >> + *
> >> + *  1) Be a bug in the userland code, because it has a branch/jump in
> >> + *     a branch delay slot. So if we run out of emuframes and the
> >> + *     userland code hangs it's not exactly the kernel's fault.
> >> + *
> >> + *  2) Only affect that userland process, since emuframes are allocated
> >> + *     per-mm and kernel threads don't use them at all.
> >>   */
> >> + if ((!is_mm && isBranchInstr(&dummy_regs, mm_inst, &dummy_cpc)) ||
> >> +    (is_mm && mm_isBranchInstr(&dummy_regs, mm_inst, &dummy_cpc))) {
> >> + pr_warn("PID %d has a branch in an FP branch delay slot at 0x%08lx\n",
> >> + current->pid, regs->cp0_epc);
> >> + goto is_nop;
> >> + }
> >>
> >> - /* Ensure that the two instructions are in the same cache line */
> >> - fr = (struct emuframe __user *)
> >> - ((regs->regs[29] - sizeof(struct emuframe)) & ~0x7);
> >> + pr_debug("dsemul 0x%08lx cont at 0x%08lx\n", regs->cp0_epc, cpc);
> >>
> >> - /* Verify that the stack pointer is not competely insane */
> >> - if (unlikely(!access_ok(VERIFY_WRITE, fr, sizeof(struct emuframe))))
> >> + /*
> >> + * The strategy is to write the instruction to a per-mm page followed
> >> + * by a trap which we can catch to return to the required address. Any
> >> + * alternative to full instruction emulation!!
> >> + *
> >> + * Algorithmics used a system call instruction, and borrowed that
> >> + * vector.  MIPS/Linux version is a bit more heavyweight in the
> >> + * interests of portability and multiprocessor support.  For Linux we
> >> + * generate a BREAK instruction with a break code reserved for this
> >> + * purpose.
> >> + */
> >> + fr = alloc_emuframe();
> >> + if (!fr)
> >>   return SIGBUS;
> >>
> >>   if (get_isa16_mode(regs->cp0_epc)) {
> >> @@ -103,17 +226,18 @@ int mips_dsemul(struct pt_regs *regs, mips_instruction ir, unsigned long cpc)
> >>   err |= __put_user((mips_instruction)BREAK_MATH, &fr->badinst);
> >>   }
> >>
> >> - err |= __put_user((mips_instruction)BD_COOKIE, &fr->cookie);
> >> - err |= __put_user(cpc, &fr->epc);
> >> -
> >>   if (unlikely(err)) {
> >>   MIPS_FPU_EMU_INC_STATS(errors);
> >> + free_emuframe(fr);
> >>   return SIGBUS;
> >>   }
> >>
> >>   regs->cp0_epc = ((unsigned long) &fr->emul) |
> >>   get_isa16_mode(regs->cp0_epc);
> >>
> >> + current->thread.fp_bd_emu_cpc = cpc;
> >> + set_thread_flag(TIF_FP_BD_EMU);
> >> +
> >>   flush_cache_sigtramp((unsigned long)&fr->badinst);
> >>
> >>   return SIGILL; /* force out of emulation loop */
> >> @@ -121,64 +245,38 @@ int mips_dsemul(struct pt_regs *regs, mips_instruction ir, unsigned long cpc)
> >>
> >>   int do_dsemulret(struct pt_regs *xcp)
> >>   {
> >> - struct emuframe __user *fr;
> >> - unsigned long epc;
> >> - u32 insn, cookie;
> >> - int err = 0;
> >> - u16 instr[2];
> >> -
> >> - fr = (struct emuframe __user *)
> >> - (msk_isa16_mode(xcp->cp0_epc) - sizeof(mips_instruction));
> >> -
> >> - /*
> >> - * If we can't even access the area, something is very wrong, but we'll
> >> - * leave that to the default handling
> >> - */
> >> - if (!access_ok(VERIFY_READ, fr, sizeof(struct emuframe)))
> >> - return 0;
> >> -
> >> - /*
> >> - * Do some sanity checking on the stackframe:
> >> - *
> >> - *  - Is the instruction pointed to by the EPC an BREAK_MATH?
> >> - *  - Is the following memory word the BD_COOKIE?
> >> - */
> >> - if (get_isa16_mode(xcp->cp0_epc)) {
> >> - err = __get_user(instr[0], (u16 __user *)(&fr->badinst));
> >> - err |= __get_user(instr[1], (u16 __user *)((long)(&fr->badinst) + 2));
> >> - insn = (instr[0] << 16) | instr[1];
> >> - } else {
> >> - err = __get_user(insn, &fr->badinst);
> >> - }
> >> - err |= __get_user(cookie, &fr->cookie);
> >> + mm_context_t *mm_ctx = &current->mm->context;
> >> + struct emuframe __user *fr = NULL;
> >> + unsigned long fr_addr;
> >> + int success = 0;
> >>
> >> - if (unlikely(err || (insn != BREAK_MATH) || (cookie != BD_COOKIE))) {
> >> - MIPS_FPU_EMU_INC_STATS(errors);
> >> - return 0;
> >> - }
> >> + /* If we don't have TIF_FP_BD_EMU set... */
> >> + if (!test_and_clear_thread_flag(TIF_FP_BD_EMU))
> >> + goto out;
> >>
> >>   /*
> >> - * At this point, we are satisfied that it's a BD emulation trap.  Yes,
> >> - * a user might have deliberately put two malformed and useless
> >> - * instructions in a row in his program, in which case he's in for a
> >> - * nasty surprise - the next instruction will be treated as a
> >> - * continuation address!  Alas, this seems to be the only way that we
> >> - * can handle signals, recursion, and longjmps() in the context of
> >> - * emulating the branch delay instruction.
> >> + * ...or EPC is outside of the expected page or misaligned then
> >> + * something is wrong. Leave it to the default trap/break code to
> >> + * handle.
> >>   */
> >> + fr_addr = msk_isa16_mode(xcp->cp0_epc) - sizeof(mips_instruction);
> >> + if ((fr_addr < mm_ctx->fp_bd_emupage) ||
> >> +    (fr_addr > (mm_ctx->fp_bd_emupage + PAGE_SIZE - sizeof(*fr))) ||
> >> +    (fr_addr & (sizeof(*fr) - 1)))
> >> + goto out;
> >>
> >> -#ifdef DSEMUL_TRACE
> >> - printk("dsemulret\n");
> >> -#endif
> >> - if (__get_user(epc, &fr->epc)) { /* Saved EPC */
> >> - /* This is not a good situation to be in */
> >> - force_sig(SIGBUS, current);
> >> -
> >> - return 0;
> >> - }
> >> + /* At this point, we are satisfied that it's a BD emulation trap. */
> >> + fr = (struct emuframe __user *)fr_addr;
> >>
> >>   /* Set EPC to return to post-branch instruction */
> >> - xcp->cp0_epc = epc;
> >> + xcp->cp0_epc = current->thread.fp_bd_emu_cpc;
> >> + success = 1;
> >>
> >> - return 1;
> >> + pr_debug("dsemulret to 0x%08lx\n", xcp->cp0_epc);
> >> +out:
> >> + if (fr)
> >> + free_emuframe(fr);
> >> + if (!success)
> >> + MIPS_FPU_EMU_INC_STATS(errors);
> >> + return success;
> >>   }
> >>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v2 5/6] mips: use per-mm page to execute FP branch delay slots
@ 2014-07-03 20:12   ` Paul Burton
  0 siblings, 0 replies; 42+ messages in thread
From: Paul Burton @ 2014-07-03 20:12 UTC (permalink / raw)
  To: Ed Swierk; +Cc: linux-mips, ddaney.cavm, ralf

On Thu, Jul 03, 2014 at 10:56:10AM -0700, Ed Swierk wrote:
> Hi Paul,
> 
> I'm came across your patch while puzzling over mysterious segvs from
> programs that do floating-point compares. I'm running a Debian 32-bit
> mips userland, which is built with hard floating point operations, on
> a Cavium Octeon2, which lacks an FPU. Now that Linux makes user stacks
> non-executable by default, the current FP emulation approach is simply
> broken.

Really? I wasn't aware of any change to the default attributes of the
stack. Do you know what changed? From a quick look at fs/binfmt_elf.c &
arch/mips/include/asm/elf.h I can't see anything relevant having
changed - the stack should be executable unless a non-executable
PT_GNU_STACK header is present in the ELF. I don't suppose the issue
is simply that such a PT_GNU_STACK header is present in your binaries?

Of course if that is the case the question is: what did you build them
with? If some toolchain has started emitting such headers then this
becomes more urgent.

> Have you had the opportunity to revisit your patch in light of the
> force_sig issue?

I'm afraid not... it's been sitting somewhere near the bottom of my
rather large TODO list.

> I'm wondering if instead of trying to free the page
> for the FP branch delay emuframe immediately, it would be simpler to
> leave it around until the thread is destroyed.

It's not really an issue of freeing a page - my patch mapped one page
per-mm (per-process) and that page was left intact for the life of that
mm (process). The issue, if I remember correctly, is that:

  - The emuframe needs to be freed after the instruction from the branch
    delay slot has been executed, or we'd simply run out of space in the
    page after a while.

  - In order to guarantee that the emuframe is freed we need to ignore
    signals to the task until it has been. Otherwise if a signal is
    delivered whilst the emuframe is live, userland could do something
    like a longjmp out of the signal handler & control would never
    return back to the emuframe - which would then never be freed.

  - If the instruction from the branch delay slot traps to the kernel
    and userland would normally receive a signal as a result, it won't
    because we're ignoring signals at this point. So we'd just end up
    returning to the same instruction again, trapping again, etc etc.

One solution would be for any trap to the kernel whilst the
TIF_FP_BD_EMU flag is set would lead to freeing the emuframe, clearing
the TIF_FP_BD_EMU flag & pointing epc back at the original branch. It
sounds quite invasive though since each entry to the kernel that can
be reached by a trap caused by a userland instruction would need to
take this into account.

Another solution would be to bite the bullet & just implement a full on
MIPS emulator, but that would need to take into account any ASEs or cop2
functionality implemented on a system. So it would be rather large & no
doubt it would be easy to get things subtly wrong.

I'll try to find some time to ponder this some more & see if I can think
up something better. I'm open to suggestions if anyone has any :)

Thanks,
    Paul

> --Ed
> 
> Paul Burton <paul.burton@imgtec.com> wrote:
> > Hmm, I believe there may still be an issue with this patch. If the
> > instruction in the branch delay slot being "emulated" traps to the
> > kernel, and the kernel does a force_sig then that signal won't get
> > processed because signals are being temporarily ignored. So I think
> > we'd go back off to userland & execute the same instruction from the
> > branch delay slot again, trap again, force_sig again, go back to
> > userland etc etc. I need to think about this...
> >
> > In the meantime, Ralf: if you get to merging this series please drop
> > this patch & 6/6 (the stack exec change) for the time being. The "Some
> > (mostly FP) cleanups" series I submitted will still apply after only
> > the first 4 patches of this series.
> >
> > Thanks,
> >      Paul
> >
> > On 08/11/13 14:50, Paul Burton wrote:
> >> If a floating point branch instruction (bc1[ft]l?) is emulated,
> >> typically because we're running on a core with no FPU, then we need to
> >> execute the instruction in its branch delay slot too. This is done by
> >> writing that instruction to memory followed by a trap, as part of an
> >> "emuframe", and executing it. This avoids the requirement of an emulator
> >> for the entire MIPS instruction set. Prior to this patch such emuframes
> >> are written to the user stack and executed from there.
> >>
> >> This patch moves FP branch delay emuframes off of the user stack and
> >> into a per-mm page. Allocating a page per-mm leaves userland with access
> >> to only what it had access to previously, and prevents processes
> >> interfering with each other as they might if a single system-wide page
> >> were used. The book-keeping required to track the allocation of
> >> emuframes is not cheap, but given that invoking the FP emulator is
> >> already very expensive I don't expect this to be an issue.
> >>
> >> The biggest issue with executing the instruction from an FP branch delay
> >> is that we must ensure that we free the frame from which we ran it. That
> >> means that we must trap back to the kernel after executing that
> >> instruction, which means that we must take special care not to let the
> >> PC be changed as a result of that instruction. Fortunately since we're
> >> executing an instruction we found in a branch delay the result is
> >> unpredictable if that instruction is a branch or jump, so we can simply
> >> treat those as NOPs and avoid them causing a problem. However there is
> >> still the possibility that a signal may be handled whilst executing the
> >> branch delay instruction. This would usually be fine as we would simply
> >> execute our trap back to the kernel after sigreturn, however it is
> >> possible for userland to simply not return from the signal handler - for
> >> example if it executes something like a longjmp. In that case we would
> >> never trap back to the kernel and never free the frame. For that reason
> >> a TIF_FP_BD_EMU flag is introduced and set whilst we are executing an FP
> >> branch delay instruction. Whilst this flag is set, signals will be
> >> ignored. This isn't exactly pretty, but it's simpler than most of the
> >> alternatives. One other simple option I considered would be to just
> >> kill a process if we find a branch in an FP branch delay slot, but I
> >> chose the current approach because its result is closer to what would
> >> previously happen.
> >>
> >> The primary benefit of this patch is that we are now free to mark the
> >> user stack non-executable where that is possible.
> >>
> >> Additionally the FP emuframes themselves are simplified somewhat. The
> >> cookie field is removed since we can be pretty certain that we're
> >> looking at an emuframe by virtue of it being located in the page
> >> allocated for them. The PC to continue from is moved into struct
> >> thread_struct since the control flow of a thread can no longer be
> >> modified for the duration of the 'emulation', meaning there will now
> >> only ever be a single emuframe required for a thread at any given time.
> >>
> >> Signed-off-by: Paul Burton <paul.burton@imgtec.com>
> >> ---
> >> Changes in v2:
> >>    - s/kernels/kernel's/
> >>    - Use (mm_)isBranchInstr in mips_dsemul rather than duplicating
> >>      similar logic.
> >> ---
> >>   arch/mips/include/asm/fpu_emulator.h |   4 +
> >>   arch/mips/include/asm/mmu.h          |  12 ++
> >>   arch/mips/include/asm/mmu_context.h  |   7 +
> >>   arch/mips/include/asm/processor.h    |   7 +-
> >>   arch/mips/include/asm/thread_info.h  |   2 +
> >>   arch/mips/kernel/entry.S             |  13 +-
> >>   arch/mips/kernel/process.c           |   2 +
> >>   arch/mips/kernel/vdso.c              |   2 +-
> >>   arch/mips/math-emu/cp1emu.c          |   4 +-
> >>   arch/mips/math-emu/dsemul.c          | 266 ++++++++++++++++++++++++-----------
> >>   10 files changed, 226 insertions(+), 93 deletions(-)
> >>
> >> diff --git a/arch/mips/include/asm/fpu_emulator.h b/arch/mips/include/asm/fpu_emulator.h
> >> index 2abb587..16f7b0b 100644
> >> --- a/arch/mips/include/asm/fpu_emulator.h
> >> +++ b/arch/mips/include/asm/fpu_emulator.h
> >> @@ -51,6 +51,8 @@ do { \
> >>   #define MIPS_FPU_EMU_INC_STATS(M) do { } while (0)
> >>   #endif /* CONFIG_DEBUG_FS */
> >>
> >> +extern void dsemul_thread_cleanup(void);
> >> +extern void dsemul_mm_cleanup(struct mm_struct *mm);
> >>   extern int mips_dsemul(struct pt_regs *regs, mips_instruction ir,
> >>   unsigned long cpc);
> >>   extern int do_dsemulret(struct pt_regs *xcp);
> >> @@ -58,6 +60,8 @@ extern int fpu_emulator_cop1Handler(struct pt_regs *xcp,
> >>      struct mips_fpu_struct *ctx, int has_fpu,
> >>      void *__user *fault_addr);
> >>   int process_fpemu_return(int sig, void __user *fault_addr);
> >> +int isBranchInstr(struct pt_regs *regs, struct mm_decoded_insn dec_insn,
> >> +  unsigned long *contpc);
> >>   int mm_isBranchInstr(struct pt_regs *regs, struct mm_decoded_insn dec_insn,
> >>       unsigned long *contpc);
> >>
> >> diff --git a/arch/mips/include/asm/mmu.h b/arch/mips/include/asm/mmu.h
> >> index c436138..08214da 100644
> >> --- a/arch/mips/include/asm/mmu.h
> >> +++ b/arch/mips/include/asm/mmu.h
> >> @@ -1,9 +1,21 @@
> >>   #ifndef __ASM_MMU_H
> >>   #define __ASM_MMU_H
> >>
> >> +#include <linux/mutex.h>
> >> +#include <linux/wait.h>
> >> +
> >>   typedef struct {
> >>   unsigned long asid[NR_CPUS];
> >>   void *vdso;
> >> +
> >> + /* address of page used to hold FP branch delay emulation frames */
> >> + unsigned long fp_bd_emupage;
> >> + /* bitmap tracking allocation of fp_bd_emupage */
> >> + unsigned long *fp_bd_emupage_allocmap;
> >> + /* mutex to be held whilst modifying fp_bd_emupage(_allocmap) */
> >> + struct mutex fp_bd_emupage_mutex;
> >> + /* wait queue for threads requiring an emuframe */
> >> + wait_queue_head_t fp_bd_emupage_queue;
> >>   } mm_context_t;
> >>
> >>   #endif /* __ASM_MMU_H */
> >> diff --git a/arch/mips/include/asm/mmu_context.h b/arch/mips/include/asm/mmu_context.h
> >> index e277bba..c55e864 100644
> >> --- a/arch/mips/include/asm/mmu_context.h
> >> +++ b/arch/mips/include/asm/mmu_context.h
> >> @@ -16,6 +16,7 @@
> >>   #include <linux/smp.h>
> >>   #include <linux/slab.h>
> >>   #include <asm/cacheflush.h>
> >> +#include <asm/fpu_emulator.h>
> >>   #include <asm/hazards.h>
> >>   #include <asm/tlbflush.h>
> >>   #ifdef CONFIG_MIPS_MT_SMTC
> >> @@ -133,6 +134,11 @@ init_new_context(struct task_struct *tsk, struct mm_struct *mm)
> >>   for_each_possible_cpu(i)
> >>   cpu_context(i, mm) = 0;
> >>
> >> + mm->context.fp_bd_emupage = 0;
> >> + mm->context.fp_bd_emupage_allocmap = NULL;
> >> + mutex_init(&mm->context.fp_bd_emupage_mutex);
> >> + init_waitqueue_head(&mm->context.fp_bd_emupage_queue);
> >> +
> >>   return 0;
> >>   }
> >>
> >> @@ -199,6 +205,7 @@ static inline void switch_mm(struct mm_struct *prev, struct mm_struct *next,
> >>    */
> >>   static inline void destroy_context(struct mm_struct *mm)
> >>   {
> >> + dsemul_mm_cleanup(mm);
> >>   }
> >>
> >>   #define deactivate_mm(tsk, mm) do { } while (0)
> >> diff --git a/arch/mips/include/asm/processor.h b/arch/mips/include/asm/processor.h
> >> index 3605b84..683a3d6 100644
> >> --- a/arch/mips/include/asm/processor.h
> >> +++ b/arch/mips/include/asm/processor.h
> >> @@ -38,9 +38,10 @@ extern unsigned int vced_count, vcei_count;
> >>
> >>   /*
> >>    * A special page (the vdso) is mapped into all processes at the very
> >> - * top of the virtual memory space.
> >> + * top of the virtual memory space. The page below it is used for FP
> >> + * emulator branch delay slot executions.
> >>    */
> >> -#define SPECIAL_PAGES_SIZE PAGE_SIZE
> >> +#define SPECIAL_PAGES_SIZE (PAGE_SIZE * 2)
> >>
> >>   #ifdef CONFIG_32BIT
> >>   #ifdef CONFIG_KVM_GUEST
> >> @@ -226,6 +227,8 @@ struct thread_struct {
> >>
> >>   /* Saved fpu/fpu emulator stuff. */
> >>   struct mips_fpu_struct fpu;
> >> + /* PC to continue from following an FP branch delay 'emulation' */
> >> + unsigned long fp_bd_emu_cpc;
> >>   #ifdef CONFIG_MIPS_MT_FPAFF
> >>   /* Emulated instruction count */
> >>   unsigned long emulated_fp;
> >> diff --git a/arch/mips/include/asm/thread_info.h b/arch/mips/include/asm/thread_info.h
> >> index b6da8b7..eee6e18 100644
> >> --- a/arch/mips/include/asm/thread_info.h
> >> +++ b/arch/mips/include/asm/thread_info.h
> >> @@ -118,6 +118,7 @@ static inline struct thread_info *current_thread_info(void)
> >>   #define TIF_LOAD_WATCH 25 /* If set, load watch registers */
> >>   #define TIF_SYSCALL_TRACEPOINT 26 /* syscall tracepoint instrumentation */
> >>   #define TIF_32BIT_FPREGS 27 /* 32-bit floating point registers */
> >> +#define TIF_FP_BD_EMU 28 /* executing an FP branch delay */
> >>   #define TIF_SYSCALL_TRACE 31 /* syscall trace active */
> >>
> >>   #define _TIF_SYSCALL_TRACE (1<<TIF_SYSCALL_TRACE)
> >> @@ -135,6 +136,7 @@ static inline struct thread_info *current_thread_info(void)
> >>   #define _TIF_FPUBOUND (1<<TIF_FPUBOUND)
> >>   #define _TIF_LOAD_WATCH (1<<TIF_LOAD_WATCH)
> >>   #define _TIF_32BIT_FPREGS (1<<TIF_32BIT_FPREGS)
> >> +#define _TIF_FP_BD_EMU (1<<TIF_FP_BD_EMU)
> >>   #define _TIF_SYSCALL_TRACEPOINT (1<<TIF_SYSCALL_TRACEPOINT)
> >>
> >>   #define _TIF_WORK_SYSCALL_ENTRY (_TIF_NOHZ | _TIF_SYSCALL_TRACE | \
> >> diff --git a/arch/mips/kernel/entry.S b/arch/mips/kernel/entry.S
> >> index e578685..24707d7 100644
> >> --- a/arch/mips/kernel/entry.S
> >> +++ b/arch/mips/kernel/entry.S
> >> @@ -168,10 +168,15 @@ work_resched:
> >>   andi t0, a2, _TIF_NEED_RESCHED
> >>   bnez t0, work_resched
> >>
> >> -work_notifysig: # deal with pending signals and
> >> - # notify-resume requests
> >> - move a0, sp
> >> - li a1, 0
> >> +work_notifysig:
> >> + and t0, a2, _TIF_FP_BD_EMU # are we currently 'emulating' the
> >> + # delay slot of an FP branch?
> >> + beqz t0, 1f # no, continue below
> >> + and a2, a2, ~_TIF_SIGPENDING # yes, skip handling signals
> >> + beqz a2, restore_all # which leaves us nothing to do
> >> +
> >> +1: move a0, sp # deal with pending signals and
> >> + li a1, 0 # notify-resume requests
> >>   jal do_notify_resume # a2 already loaded
> >>   j resume_userspace_check
> >>
> >> diff --git a/arch/mips/kernel/process.c b/arch/mips/kernel/process.c
> >> index 747a6cf..0219502 100644
> >> --- a/arch/mips/kernel/process.c
> >> +++ b/arch/mips/kernel/process.c
> >> @@ -32,6 +32,7 @@
> >>   #include <asm/cpu.h>
> >>   #include <asm/dsp.h>
> >>   #include <asm/fpu.h>
> >> +#include <asm/fpu_emulator.h>
> >>   #include <asm/pgtable.h>
> >>   #include <asm/mipsregs.h>
> >>   #include <asm/processor.h>
> >> @@ -72,6 +73,7 @@ void start_thread(struct pt_regs * regs, unsigned long pc, unsigned long sp)
> >>
> >>   void exit_thread(void)
> >>   {
> >> + dsemul_thread_cleanup();
> >>   }
> >>
> >>   void flush_thread(void)
> >> diff --git a/arch/mips/kernel/vdso.c b/arch/mips/kernel/vdso.c
> >> index 0f1af58..213d871 100644
> >> --- a/arch/mips/kernel/vdso.c
> >> +++ b/arch/mips/kernel/vdso.c
> >> @@ -78,7 +78,7 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
> >>
> >>   down_write(&mm->mmap_sem);
> >>
> >> - addr = vdso_addr(mm->start_stack);
> >> + addr = vdso_addr(mm->start_stack) + PAGE_SIZE;
> >>
> >>   addr = get_unmapped_area(NULL, addr, PAGE_SIZE, 0, 0);
> >>   if (IS_ERR_VALUE(addr)) {
> >> diff --git a/arch/mips/math-emu/cp1emu.c b/arch/mips/math-emu/cp1emu.c
> >> index 22f7b11..a0566c8 100644
> >> --- a/arch/mips/math-emu/cp1emu.c
> >> +++ b/arch/mips/math-emu/cp1emu.c
> >> @@ -665,8 +665,8 @@ int mm_isBranchInstr(struct pt_regs *regs, struct mm_decoded_insn dec_insn,
> >>    * a single subroutine should be used across both
> >>    * modules.
> >>    */
> >> -static int isBranchInstr(struct pt_regs *regs, struct mm_decoded_insn dec_insn,
> >> - unsigned long *contpc)
> >> +int isBranchInstr(struct pt_regs *regs, struct mm_decoded_insn dec_insn,
> >> +  unsigned long *contpc)
> >>   {
> >>   union mips_instruction insn = (union mips_instruction)dec_insn.insn;
> >>   unsigned int fcr31;
> >> diff --git a/arch/mips/math-emu/dsemul.c b/arch/mips/math-emu/dsemul.c
> >> index 7ea622a..3e64b17 100644
> >> --- a/arch/mips/math-emu/dsemul.c
> >> +++ b/arch/mips/math-emu/dsemul.c
> >> @@ -1,6 +1,8 @@
> >>   #include <linux/compiler.h>
> >> +#include <linux/err.h>
> >>   #include <linux/mm.h>
> >>   #include <linux/signal.h>
> >> +#include <linux/slab.h>
> >>   #include <linux/smp.h>
> >>
> >>   #include <asm/asm.h>
> >> @@ -45,52 +47,173 @@
> >>   struct emuframe {
> >>   mips_instruction emul;
> >>   mips_instruction badinst;
> >> - mips_instruction cookie;
> >> - unsigned long epc;
> >>   };
> >>
> >> +static const int emupage_frame_count = PAGE_SIZE / sizeof(struct emuframe);
> >> +
> >> +static struct emuframe __user *alloc_emuframe(void)
> >> +{
> >> + mm_context_t *mm_ctx = &current->mm->context;
> >> + struct emuframe __user *fr = NULL;
> >> + unsigned long addr;
> >> + int idx;
> >> +
> >> +retry:
> >> + mutex_lock(&mm_ctx->fp_bd_emupage_mutex);
> >> +
> >> + /* Ensure we have a page allocated for emuframes */
> >> + if (!mm_ctx->fp_bd_emupage) {
> >> + addr = mmap_region(NULL, STACK_TOP, PAGE_SIZE,
> >> +   VM_READ|VM_WRITE|VM_EXEC|
> >> +   VM_MAYREAD|VM_MAYWRITE|VM_MAYEXEC,
> >> +   0);
> >> + if (IS_ERR_VALUE(addr))
> >> + goto out_unlock;
> >> +
> >> + mm_ctx->fp_bd_emupage = addr;
> >> + pr_debug("allocate emupage at 0x%08lx to %d\n", addr,
> >> + current->pid);
> >> + }
> >> +
> >> + /* Ensure we have an allocation bitmap */
> >> + if (!mm_ctx->fp_bd_emupage_allocmap) {
> >> + mm_ctx->fp_bd_emupage_allocmap =
> >> + kcalloc(BITS_TO_LONGS(emupage_frame_count),
> >> +      sizeof(unsigned long),
> >> + GFP_KERNEL);
> >> +
> >> + if (!mm_ctx->fp_bd_emupage_allocmap)
> >> + goto out_unlock;
> >> + }
> >> +
> >> + /* Attempt to allocate a single bit/frame */
> >> + idx = bitmap_find_free_region(mm_ctx->fp_bd_emupage_allocmap,
> >> +      emupage_frame_count, 0);
> >> + if (idx < 0) {
> >> + /*
> >> + * Failed to allocate a frame. We'll wait until one becomes
> >> + * available. The mutex is unlocked so that other threads
> >> + * actually get the opportunity to free their frames, which
> >> + * means technically the result of bitmap_full may be incorrect.
> >> + * However the worst case is that we repeat all this and end up
> >> + * back here again.
> >> + */
> >> + mutex_unlock(&mm_ctx->fp_bd_emupage_mutex);
> >> + if (!wait_event_killable(mm_ctx->fp_bd_emupage_queue,
> >> + !bitmap_full(mm_ctx->fp_bd_emupage_allocmap,
> >> +     emupage_frame_count)))
> >> + goto retry;
> >> +
> >> + /* Received a fatal signal - just give in */
> >> + return NULL;
> >> + }
> >> +
> >> + /* Success! */
> >> + fr = (struct emuframe __user *)mm_ctx->fp_bd_emupage + idx;
> >> + pr_debug("allocate emuframe %d to %d\n", idx, current->pid);
> >> +out_unlock:
> >> + mutex_unlock(&mm_ctx->fp_bd_emupage_mutex);
> >> + return fr;
> >> +}
> >> +
> >> +static void free_emuframe(struct emuframe __user *frame)
> >> +{
> >> + mm_context_t *mm_ctx = &current->mm->context;
> >> + int idx;
> >> +
> >> + mutex_lock(&mm_ctx->fp_bd_emupage_mutex);
> >> +
> >> + idx = frame - (struct emuframe __user *)mm_ctx->fp_bd_emupage;
> >> + pr_debug("free emuframe %d from %d\n", idx, current->pid);
> >> + bitmap_clear(mm_ctx->fp_bd_emupage_allocmap, idx, 1);
> >> +
> >> + /* If some thread is waiting for a frame, now's its chance */
> >> + wake_up(&mm_ctx->fp_bd_emupage_queue);
> >> +
> >> + mutex_unlock(&mm_ctx->fp_bd_emupage_mutex);
> >> +}
> >> +
> >> +void dsemul_thread_cleanup(void)
> >> +{
> >> + /*
> >> + * We should always have passed through do_dsemulret prior to the
> >> + * thread exiting, so TIF_FP_BD_EMU should never be set here.
> >> + */
> >> + BUG_ON(test_thread_flag(TIF_FP_BD_EMU));
> >> +}
> >> +
> >> +void dsemul_mm_cleanup(struct mm_struct *mm)
> >> +{
> >> + mm_context_t *mm_ctx = &mm->context;
> >> +
> >> + kfree(mm_ctx->fp_bd_emupage_allocmap);
> >> +}
> >> +
> >>   int mips_dsemul(struct pt_regs *regs, mips_instruction ir, unsigned long cpc)
> >>   {
> >> - extern asmlinkage void handle_dsemulret(void);
> >> + struct mm_decoded_insn mm_inst = { .insn = ir };
> >>   struct emuframe __user *fr;
> >> - int err;
> >> + struct pt_regs dummy_regs;
> >> + unsigned long dummy_cpc;
> >> + int err, is_mm;
> >>
> >> - if ((get_isa16_mode(regs->cp0_epc) && ((ir >> 16) == MM_NOP16)) ||
> >> - (ir == 0)) {
> >> - /* NOP is easy */
> >> + /*
> >> + * Trivially handle typical NOP encodings:
> >> + *
> >> + *   MIPS32: sll r0, r0, r0
> >> + *   microMIPS: move16 r0, r0
> >> + */
> >> + is_mm = get_isa16_mode(regs->cp0_epc);
> >> + if ((!is_mm && !ir) || (is_mm && ((ir >> 16) == MM_NOP16))) {
> >> +is_nop:
> >>   regs->cp0_epc = cpc;
> >>   regs->cp0_cause &= ~CAUSEF_BD;
> >>   return 0;
> >>   }
> >> -#ifdef DSEMUL_TRACE
> >> - printk("dsemul %lx %lx\n", regs->cp0_epc, cpc);
> >> -
> >> -#endif
> >>
> >>   /*
> >> - * The strategy is to push the instruction onto the user stack
> >> - * and put a trap after it which we can catch and jump to
> >> - * the required address any alternative apart from full
> >> - * instruction emulation!!.
> >> + * In order for us to clean up the emuframe properly, we'll need to
> >> + * execute a break instruction after ir. If ir is a branch then we may
> >> + * never reach that break instruction and thus never free the emuframe.
> >>   *
> >> - * Algorithmics used a system call instruction, and
> >> - * borrowed that vector.  MIPS/Linux version is a bit
> >> - * more heavyweight in the interests of portability and
> >> - * multiprocessor support.  For Linux we generate a
> >> - * an unaligned access and force an address error exception.
> >> + * Fortunately we know that ir is in a branch delay slot and thus if
> >> + * it is a branch then its operation is unpredictable. So we can just
> >> + * treat branches as NOPs and skip the 'emulation' entirely.
> >>   *
> >> - * For embedded systems (stand-alone) we prefer to use a
> >> - * non-existing CP1 instruction. This prevents us from emulating
> >> - * branches, but gives us a cleaner interface to the exception
> >> - * handler (single entry point).
> >> + * If the worst happens and we miss a branch/jump instruction here, or
> >> + * some processor implements a custom one, then it would be possible
> >> + * for us to allocate an emuframe and never free it. Fortunately this
> >> + * would:
> >> + *
> >> + *  1) Be a bug in the userland code, because it has a branch/jump in
> >> + *     a branch delay slot. So if we run out of emuframes and the
> >> + *     userland code hangs it's not exactly the kernel's fault.
> >> + *
> >> + *  2) Only affect that userland process, since emuframes are allocated
> >> + *     per-mm and kernel threads don't use them at all.
> >>   */
> >> + if ((!is_mm && isBranchInstr(&dummy_regs, mm_inst, &dummy_cpc)) ||
> >> +    (is_mm && mm_isBranchInstr(&dummy_regs, mm_inst, &dummy_cpc))) {
> >> + pr_warn("PID %d has a branch in an FP branch delay slot at 0x%08lx\n",
> >> + current->pid, regs->cp0_epc);
> >> + goto is_nop;
> >> + }
> >>
> >> - /* Ensure that the two instructions are in the same cache line */
> >> - fr = (struct emuframe __user *)
> >> - ((regs->regs[29] - sizeof(struct emuframe)) & ~0x7);
> >> + pr_debug("dsemul 0x%08lx cont at 0x%08lx\n", regs->cp0_epc, cpc);
> >>
> >> - /* Verify that the stack pointer is not competely insane */
> >> - if (unlikely(!access_ok(VERIFY_WRITE, fr, sizeof(struct emuframe))))
> >> + /*
> >> + * The strategy is to write the instruction to a per-mm page followed
> >> + * by a trap which we can catch to return to the required address. Any
> >> + * alternative to full instruction emulation!!
> >> + *
> >> + * Algorithmics used a system call instruction, and borrowed that
> >> + * vector.  MIPS/Linux version is a bit more heavyweight in the
> >> + * interests of portability and multiprocessor support.  For Linux we
> >> + * generate a BREAK instruction with a break code reserved for this
> >> + * purpose.
> >> + */
> >> + fr = alloc_emuframe();
> >> + if (!fr)
> >>   return SIGBUS;
> >>
> >>   if (get_isa16_mode(regs->cp0_epc)) {
> >> @@ -103,17 +226,18 @@ int mips_dsemul(struct pt_regs *regs, mips_instruction ir, unsigned long cpc)
> >>   err |= __put_user((mips_instruction)BREAK_MATH, &fr->badinst);
> >>   }
> >>
> >> - err |= __put_user((mips_instruction)BD_COOKIE, &fr->cookie);
> >> - err |= __put_user(cpc, &fr->epc);
> >> -
> >>   if (unlikely(err)) {
> >>   MIPS_FPU_EMU_INC_STATS(errors);
> >> + free_emuframe(fr);
> >>   return SIGBUS;
> >>   }
> >>
> >>   regs->cp0_epc = ((unsigned long) &fr->emul) |
> >>   get_isa16_mode(regs->cp0_epc);
> >>
> >> + current->thread.fp_bd_emu_cpc = cpc;
> >> + set_thread_flag(TIF_FP_BD_EMU);
> >> +
> >>   flush_cache_sigtramp((unsigned long)&fr->badinst);
> >>
> >>   return SIGILL; /* force out of emulation loop */
> >> @@ -121,64 +245,38 @@ int mips_dsemul(struct pt_regs *regs, mips_instruction ir, unsigned long cpc)
> >>
> >>   int do_dsemulret(struct pt_regs *xcp)
> >>   {
> >> - struct emuframe __user *fr;
> >> - unsigned long epc;
> >> - u32 insn, cookie;
> >> - int err = 0;
> >> - u16 instr[2];
> >> -
> >> - fr = (struct emuframe __user *)
> >> - (msk_isa16_mode(xcp->cp0_epc) - sizeof(mips_instruction));
> >> -
> >> - /*
> >> - * If we can't even access the area, something is very wrong, but we'll
> >> - * leave that to the default handling
> >> - */
> >> - if (!access_ok(VERIFY_READ, fr, sizeof(struct emuframe)))
> >> - return 0;
> >> -
> >> - /*
> >> - * Do some sanity checking on the stackframe:
> >> - *
> >> - *  - Is the instruction pointed to by the EPC an BREAK_MATH?
> >> - *  - Is the following memory word the BD_COOKIE?
> >> - */
> >> - if (get_isa16_mode(xcp->cp0_epc)) {
> >> - err = __get_user(instr[0], (u16 __user *)(&fr->badinst));
> >> - err |= __get_user(instr[1], (u16 __user *)((long)(&fr->badinst) + 2));
> >> - insn = (instr[0] << 16) | instr[1];
> >> - } else {
> >> - err = __get_user(insn, &fr->badinst);
> >> - }
> >> - err |= __get_user(cookie, &fr->cookie);
> >> + mm_context_t *mm_ctx = &current->mm->context;
> >> + struct emuframe __user *fr = NULL;
> >> + unsigned long fr_addr;
> >> + int success = 0;
> >>
> >> - if (unlikely(err || (insn != BREAK_MATH) || (cookie != BD_COOKIE))) {
> >> - MIPS_FPU_EMU_INC_STATS(errors);
> >> - return 0;
> >> - }
> >> + /* If we don't have TIF_FP_BD_EMU set... */
> >> + if (!test_and_clear_thread_flag(TIF_FP_BD_EMU))
> >> + goto out;
> >>
> >>   /*
> >> - * At this point, we are satisfied that it's a BD emulation trap.  Yes,
> >> - * a user might have deliberately put two malformed and useless
> >> - * instructions in a row in his program, in which case he's in for a
> >> - * nasty surprise - the next instruction will be treated as a
> >> - * continuation address!  Alas, this seems to be the only way that we
> >> - * can handle signals, recursion, and longjmps() in the context of
> >> - * emulating the branch delay instruction.
> >> + * ...or EPC is outside of the expected page or misaligned then
> >> + * something is wrong. Leave it to the default trap/break code to
> >> + * handle.
> >>   */
> >> + fr_addr = msk_isa16_mode(xcp->cp0_epc) - sizeof(mips_instruction);
> >> + if ((fr_addr < mm_ctx->fp_bd_emupage) ||
> >> +    (fr_addr > (mm_ctx->fp_bd_emupage + PAGE_SIZE - sizeof(*fr))) ||
> >> +    (fr_addr & (sizeof(*fr) - 1)))
> >> + goto out;
> >>
> >> -#ifdef DSEMUL_TRACE
> >> - printk("dsemulret\n");
> >> -#endif
> >> - if (__get_user(epc, &fr->epc)) { /* Saved EPC */
> >> - /* This is not a good situation to be in */
> >> - force_sig(SIGBUS, current);
> >> -
> >> - return 0;
> >> - }
> >> + /* At this point, we are satisfied that it's a BD emulation trap. */
> >> + fr = (struct emuframe __user *)fr_addr;
> >>
> >>   /* Set EPC to return to post-branch instruction */
> >> - xcp->cp0_epc = epc;
> >> + xcp->cp0_epc = current->thread.fp_bd_emu_cpc;
> >> + success = 1;
> >>
> >> - return 1;
> >> + pr_debug("dsemulret to 0x%08lx\n", xcp->cp0_epc);
> >> +out:
> >> + if (fr)
> >> + free_emuframe(fr);
> >> + if (!success)
> >> + MIPS_FPU_EMU_INC_STATS(errors);
> >> + return success;
> >>   }
> >>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v2 5/6] mips: use per-mm page to execute FP branch delay slots
@ 2014-07-03 22:31 Ed Swierk
  2014-07-04  8:06   ` Paul Burton
  0 siblings, 1 reply; 42+ messages in thread
From: Ed Swierk @ 2014-07-03 22:31 UTC (permalink / raw)
  To: Paul Burton; +Cc: linux-mips, ddaney.cavm, ralf

On Thu, Jul 3, 2014 at 1:12 PM, Paul Burton <paul.burton@imgtec.com> wrote:
> On Thu, Jul 03, 2014 at 10:56:10AM -0700, Ed Swierk wrote:
>> Now that Linux makes user stacks
>> non-executable by default, the current FP emulation approach is simply
>> broken.
>
> Really? I wasn't aware of any change to the default attributes of the
> stack. Do you know what changed? From a quick look at fs/binfmt_elf.c &
> arch/mips/include/asm/elf.h I can't see anything relevant having
> changed - the stack should be executable unless a non-executable
> PT_GNU_STACK header is present in the ELF. I don't suppose the issue
> is simply that such a PT_GNU_STACK header is present in your binaries?

Actually that was a completely unsupported assertion on my part. I
have no reason to believe there was a change in behavior in the kernel
or the toolchain (gcc 4.9.0, x86_64 host, mips target; binutils
2.24.51.20140425).

What I do notice is that mips-linux-gnu-gcc generates no
.note.GNU-stack section, while x86_64-linux-gnu-gcc does. In turn, ld
produces no GNU_STACK program header on the mips executable, while for
x86_64 it produces GNU_STACK with RW (no E) flags.

The toolchain behavior is the same for gccgo as for gcc. But I get a
segv on the Octeon2 target only when running a gccgo-generated
executable. A C program compiled with gcc works fine performing the
same FP operations.

And when I add the following hack to mips/include/asm/elf.h in the
kernel, the segv goes away:

   #define elf_read_implies_exec(ex, have_pt_gnu_stack) 1

So I assume gccgo or libgo is doing some extra magic that makes the
stack non-executable on mips at least.

>> I'm wondering if instead of trying to free the page
>> for the FP branch delay emuframe immediately, it would be simpler to
>> leave it around until the thread is destroyed.
>
> It's not really an issue of freeing a page - my patch mapped one page
> per-mm (per-process) and that page was left intact for the life of that
> mm (process).

Ah, I see. What if we allocate a page per thread rather than per
process? Then the bookkeeping becomes a lot simpler, as there can be
only a single emuframe in the page at one time. And we can defer
freeing the page until the thread exits.

Assuming we could tolerate the overhead of an entire page for a puny
little emuframe, do you think the approach would work?

--Ed

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v2 5/6] mips: use per-mm page to execute FP branch delay slots
@ 2014-07-04  8:06   ` Paul Burton
  0 siblings, 0 replies; 42+ messages in thread
From: Paul Burton @ 2014-07-04  8:06 UTC (permalink / raw)
  To: Ed Swierk; +Cc: linux-mips, ddaney.cavm, ralf

On Thu, Jul 03, 2014 at 03:31:48PM -0700, Ed Swierk wrote:
> On Thu, Jul 3, 2014 at 1:12 PM, Paul Burton <paul.burton@imgtec.com> wrote:
> > On Thu, Jul 03, 2014 at 10:56:10AM -0700, Ed Swierk wrote:
> >> Now that Linux makes user stacks
> >> non-executable by default, the current FP emulation approach is simply
> >> broken.
> >
> > Really? I wasn't aware of any change to the default attributes of the
> > stack. Do you know what changed? From a quick look at fs/binfmt_elf.c &
> > arch/mips/include/asm/elf.h I can't see anything relevant having
> > changed - the stack should be executable unless a non-executable
> > PT_GNU_STACK header is present in the ELF. I don't suppose the issue
> > is simply that such a PT_GNU_STACK header is present in your binaries?
> 
> Actually that was a completely unsupported assertion on my part. I
> have no reason to believe there was a change in behavior in the kernel
> or the toolchain (gcc 4.9.0, x86_64 host, mips target; binutils
> 2.24.51.20140425).
> 
> What I do notice is that mips-linux-gnu-gcc generates no
> .note.GNU-stack section, while x86_64-linux-gnu-gcc does. In turn, ld
> produces no GNU_STACK program header on the mips executable, while for
> x86_64 it produces GNU_STACK with RW (no E) flags.
> 
> The toolchain behavior is the same for gccgo as for gcc. But I get a
> segv on the Octeon2 target only when running a gccgo-generated
> executable. A C program compiled with gcc works fine performing the
> same FP operations.
> 
> And when I add the following hack to mips/include/asm/elf.h in the
> kernel, the segv goes away:
> 
>    #define elf_read_implies_exec(ex, have_pt_gnu_stack) 1
> 
> So I assume gccgo or libgo is doing some extra magic that makes the
> stack non-executable on mips at least.

Ah, interesting :)

I haven't tried running any go executables before but that and rust are
2 languages I've been curious about for a while.

> >> I'm wondering if instead of trying to free the page
> >> for the FP branch delay emuframe immediately, it would be simpler to
> >> leave it around until the thread is destroyed.
> >
> > It's not really an issue of freeing a page - my patch mapped one page
> > per-mm (per-process) and that page was left intact for the life of that
> > mm (process).
> 
> Ah, I see. What if we allocate a page per thread rather than per
> process? Then the bookkeeping becomes a lot simpler, as there can be
> only a single emuframe in the page at one time. And we can defer
> freeing the page until the thread exits.
> 
> Assuming we could tolerate the overhead of an entire page for a puny
> little emuframe, do you think the approach would work?

Yes, I think it would. The reason I went with the per-mm approach though
was to try to avoid so much overhead. I suppose we could possibly
allocate the page on demand so that threads which don't use FP don't pay
for it, and maybe use the shrinker interface to free the page if we run
low on memory and aren't currently executing from it. Though it would
mean that the FP branch delay "emulation" could fail if memory is tight,
but I suppose that's no worse than now where it could blow the (user)
stack.

I'll try to get a v3 out at some point soon.

Paul

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v2 5/6] mips: use per-mm page to execute FP branch delay slots
@ 2014-07-04  8:06   ` Paul Burton
  0 siblings, 0 replies; 42+ messages in thread
From: Paul Burton @ 2014-07-04  8:06 UTC (permalink / raw)
  To: Ed Swierk; +Cc: linux-mips, ddaney.cavm, ralf

On Thu, Jul 03, 2014 at 03:31:48PM -0700, Ed Swierk wrote:
> On Thu, Jul 3, 2014 at 1:12 PM, Paul Burton <paul.burton@imgtec.com> wrote:
> > On Thu, Jul 03, 2014 at 10:56:10AM -0700, Ed Swierk wrote:
> >> Now that Linux makes user stacks
> >> non-executable by default, the current FP emulation approach is simply
> >> broken.
> >
> > Really? I wasn't aware of any change to the default attributes of the
> > stack. Do you know what changed? From a quick look at fs/binfmt_elf.c &
> > arch/mips/include/asm/elf.h I can't see anything relevant having
> > changed - the stack should be executable unless a non-executable
> > PT_GNU_STACK header is present in the ELF. I don't suppose the issue
> > is simply that such a PT_GNU_STACK header is present in your binaries?
> 
> Actually that was a completely unsupported assertion on my part. I
> have no reason to believe there was a change in behavior in the kernel
> or the toolchain (gcc 4.9.0, x86_64 host, mips target; binutils
> 2.24.51.20140425).
> 
> What I do notice is that mips-linux-gnu-gcc generates no
> .note.GNU-stack section, while x86_64-linux-gnu-gcc does. In turn, ld
> produces no GNU_STACK program header on the mips executable, while for
> x86_64 it produces GNU_STACK with RW (no E) flags.
> 
> The toolchain behavior is the same for gccgo as for gcc. But I get a
> segv on the Octeon2 target only when running a gccgo-generated
> executable. A C program compiled with gcc works fine performing the
> same FP operations.
> 
> And when I add the following hack to mips/include/asm/elf.h in the
> kernel, the segv goes away:
> 
>    #define elf_read_implies_exec(ex, have_pt_gnu_stack) 1
> 
> So I assume gccgo or libgo is doing some extra magic that makes the
> stack non-executable on mips at least.

Ah, interesting :)

I haven't tried running any go executables before but that and rust are
2 languages I've been curious about for a while.

> >> I'm wondering if instead of trying to free the page
> >> for the FP branch delay emuframe immediately, it would be simpler to
> >> leave it around until the thread is destroyed.
> >
> > It's not really an issue of freeing a page - my patch mapped one page
> > per-mm (per-process) and that page was left intact for the life of that
> > mm (process).
> 
> Ah, I see. What if we allocate a page per thread rather than per
> process? Then the bookkeeping becomes a lot simpler, as there can be
> only a single emuframe in the page at one time. And we can defer
> freeing the page until the thread exits.
> 
> Assuming we could tolerate the overhead of an entire page for a puny
> little emuframe, do you think the approach would work?

Yes, I think it would. The reason I went with the per-mm approach though
was to try to avoid so much overhead. I suppose we could possibly
allocate the page on demand so that threads which don't use FP don't pay
for it, and maybe use the shrinker interface to free the page if we run
low on memory and aren't currently executing from it. Though it would
mean that the FP branch delay "emulation" could fail if memory is tight,
but I suppose that's no worse than now where it could blow the (user)
stack.

I'll try to get a v3 out at some point soon.

Paul

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v2 5/6] mips: use per-mm page to execute FP branch delay slots
  2014-07-04  8:06   ` Paul Burton
  (?)
@ 2014-07-04  8:52   ` Ralf Baechle
  2014-07-04  9:06       ` Paul Burton
  -1 siblings, 1 reply; 42+ messages in thread
From: Ralf Baechle @ 2014-07-04  8:52 UTC (permalink / raw)
  To: Paul Burton; +Cc: Ed Swierk, linux-mips, ddaney.cavm

On Fri, Jul 04, 2014 at 09:06:41AM +0100, Paul Burton wrote:

> Yes, I think it would. The reason I went with the per-mm approach though
> was to try to avoid so much overhead. I suppose we could possibly
> allocate the page on demand so that threads which don't use FP don't pay
> for it, and maybe use the shrinker interface to free the page if we run
> low on memory and aren't currently executing from it. Though it would
> mean that the FP branch delay "emulation" could fail if memory is tight,
> but I suppose that's no worse than now where it could blow the (user)
> stack.
> 
> I'll try to get a v3 out at some point soon.

The actual piece of code that needs to be installed is tiny.  So the page
could be shared between many threads.  In fact a single page would
suffice for most processes and only threads would require more slots
than provided by a single page so more pags could be allocated or the
process could sleep until a slot becomes available.

Assuming the smallest supported page size of 4k and slots of 128 bytes
(that is the largest S-cache line size in common use) that's 32 slots.

I'm also wondering how insane emulation would be.  We already have the
capability to emulate a fair fraction of the instruction set.

  Ralf

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v2 5/6] mips: use per-mm page to execute FP branch delay slots
@ 2014-07-04  9:06       ` Paul Burton
  0 siblings, 0 replies; 42+ messages in thread
From: Paul Burton @ 2014-07-04  9:06 UTC (permalink / raw)
  To: Ralf Baechle; +Cc: Ed Swierk, linux-mips, ddaney.cavm

On Fri, Jul 04, 2014 at 10:52:46AM +0200, Ralf Baechle wrote:
> On Fri, Jul 04, 2014 at 09:06:41AM +0100, Paul Burton wrote:
> 
> > Yes, I think it would. The reason I went with the per-mm approach though
> > was to try to avoid so much overhead. I suppose we could possibly
> > allocate the page on demand so that threads which don't use FP don't pay
> > for it, and maybe use the shrinker interface to free the page if we run
> > low on memory and aren't currently executing from it. Though it would
> > mean that the FP branch delay "emulation" could fail if memory is tight,
> > but I suppose that's no worse than now where it could blow the (user)
> > stack.
> > 
> > I'll try to get a v3 out at some point soon.
> 
> The actual piece of code that needs to be installed is tiny.  So the page
> could be shared between many threads.  In fact a single page would
> suffice for most processes and only threads would require more slots
> than provided by a single page so more pags could be allocated or the
> process could sleep until a slot becomes available.

You just roughly described the v2 patch that we're replying to :)

The problem is how to reliably free the frame after it has been used.
I can see ways to do it, but none that are particularly "nice".

> Assuming the smallest supported page size of 4k and slots of 128 bytes
> (that is the largest S-cache line size in common use) that's 32 slots.

Why S-cache line sized slots? I suppose it could simplify updating the
page slightly at the cost of space.

> I'm also wondering how insane emulation would be.  We already have the
> capability to emulate a fair fraction of the instruction set.

Yeah, and I'm reasonably sure we're going to need some more once MIPSr6
is supported. I guess (perhaps only for the short term?) it could be
done in stages - if systems include ASEs or cop2 that the emulation
didn't implement then it could fall back to the current emuframe code.

I'm in 2 minds about this - it sounds crazy but perhaps it's the most
sane option available :)

Paul

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v2 5/6] mips: use per-mm page to execute FP branch delay slots
@ 2014-07-04  9:06       ` Paul Burton
  0 siblings, 0 replies; 42+ messages in thread
From: Paul Burton @ 2014-07-04  9:06 UTC (permalink / raw)
  To: Ralf Baechle; +Cc: Ed Swierk, linux-mips, ddaney.cavm

On Fri, Jul 04, 2014 at 10:52:46AM +0200, Ralf Baechle wrote:
> On Fri, Jul 04, 2014 at 09:06:41AM +0100, Paul Burton wrote:
> 
> > Yes, I think it would. The reason I went with the per-mm approach though
> > was to try to avoid so much overhead. I suppose we could possibly
> > allocate the page on demand so that threads which don't use FP don't pay
> > for it, and maybe use the shrinker interface to free the page if we run
> > low on memory and aren't currently executing from it. Though it would
> > mean that the FP branch delay "emulation" could fail if memory is tight,
> > but I suppose that's no worse than now where it could blow the (user)
> > stack.
> > 
> > I'll try to get a v3 out at some point soon.
> 
> The actual piece of code that needs to be installed is tiny.  So the page
> could be shared between many threads.  In fact a single page would
> suffice for most processes and only threads would require more slots
> than provided by a single page so more pags could be allocated or the
> process could sleep until a slot becomes available.

You just roughly described the v2 patch that we're replying to :)

The problem is how to reliably free the frame after it has been used.
I can see ways to do it, but none that are particularly "nice".

> Assuming the smallest supported page size of 4k and slots of 128 bytes
> (that is the largest S-cache line size in common use) that's 32 slots.

Why S-cache line sized slots? I suppose it could simplify updating the
page slightly at the cost of space.

> I'm also wondering how insane emulation would be.  We already have the
> capability to emulate a fair fraction of the instruction set.

Yeah, and I'm reasonably sure we're going to need some more once MIPSr6
is supported. I guess (perhaps only for the short term?) it could be
done in stages - if systems include ASEs or cop2 that the emulation
didn't implement then it could fall back to the current emuframe code.

I'm in 2 minds about this - it sounds crazy but perhaps it's the most
sane option available :)

Paul

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v2 5/6] mips: use per-mm page to execute FP branch delay slots
  2014-07-04  9:06       ` Paul Burton
  (?)
@ 2014-07-04  9:38       ` Ralf Baechle
  2014-07-04 11:30           ` Paul Burton
  -1 siblings, 1 reply; 42+ messages in thread
From: Ralf Baechle @ 2014-07-04  9:38 UTC (permalink / raw)
  To: Paul Burton; +Cc: Ed Swierk, linux-mips, ddaney.cavm

On Fri, Jul 04, 2014 at 10:06:01AM +0100, Paul Burton wrote:

> > The actual piece of code that needs to be installed is tiny.  So the page
> > could be shared between many threads.  In fact a single page would
> > suffice for most processes and only threads would require more slots
> > than provided by a single page so more pags could be allocated or the
> > process could sleep until a slot becomes available.
> 
> You just roughly described the v2 patch that we're replying to :)

Can't be that wrong then :-)

I seem to only have replies to that patch in my mail folder not the
patch itself.

> The problem is how to reliably free the frame after it has been used.
> I can see ways to do it, but none that are particularly "nice".
> 
> > Assuming the smallest supported page size of 4k and slots of 128 bytes
> > (that is the largest S-cache line size in common use) that's 32 slots.
> 
> Why S-cache line sized slots? I suppose it could simplify updating the
> page slightly at the cost of space.

That's to handle the worst case - R4000/R4400 SC and MC variants it is
possible to split the S-cache as SI-cache and SD-cache.  That means
modified instructions need to be written back all the way to memory
otherwise potencially stale instructions might be fetched from the
SI-cache.

That's more theoretical - I'm not aware of any system that's using split
S-caches.  Still using S-cache line sized slots might reduce the cache
line ping pong on multi-core systems a bit.

> > I'm also wondering how insane emulation would be.  We already have the
> > capability to emulate a fair fraction of the instruction set.
> 
> Yeah, and I'm reasonably sure we're going to need some more once MIPSr6
> is supported. I guess (perhaps only for the short term?) it could be
> done in stages - if systems include ASEs or cop2 that the emulation
> didn't implement then it could fall back to the current emuframe code.

And it's dependence on executable stackframe ...

> I'm in 2 minds about this - it sounds crazy but perhaps it's the most
> sane option available :)

Sanity is overrated anyway ;-)

  Ralf

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v2 5/6] mips: use per-mm page to execute FP branch delay slots
@ 2014-07-04 11:30           ` Paul Burton
  0 siblings, 0 replies; 42+ messages in thread
From: Paul Burton @ 2014-07-04 11:30 UTC (permalink / raw)
  To: Ralf Baechle; +Cc: Ed Swierk, linux-mips, ddaney.cavm

On Fri, Jul 04, 2014 at 11:38:09AM +0200, Ralf Baechle wrote:
> On Fri, Jul 04, 2014 at 10:06:01AM +0100, Paul Burton wrote:
> 
> > > The actual piece of code that needs to be installed is tiny.  So the page
> > > could be shared between many threads.  In fact a single page would
> > > suffice for most processes and only threads would require more slots
> > > than provided by a single page so more pags could be allocated or the
> > > process could sleep until a slot becomes available.
> > 
> > You just roughly described the v2 patch that we're replying to :)
> 
> Can't be that wrong then :-)
> 
> I seem to only have replies to that patch in my mail folder not the
> patch itself.

You can find it in patchwork if you're interested:

  http://patchwork.linux-mips.org/patch/6125/

> > > I'm also wondering how insane emulation would be.  We already have the
> > > capability to emulate a fair fraction of the instruction set.
> > 
> > Yeah, and I'm reasonably sure we're going to need some more once MIPSr6
> > is supported. I guess (perhaps only for the short term?) it could be
> > done in stages - if systems include ASEs or cop2 that the emulation
> > didn't implement then it could fall back to the current emuframe code.
> 
> And it's dependence on executable stackframe ...

...yup.

> > I'm in 2 minds about this - it sounds crazy but perhaps it's the most
> > sane option available :)
> 
> Sanity is overrated anyway ;-)

I had originally left this patch at the point I started considering
implementing emulation for the whole ISA in the kernel, figuring I was
going insane & should probably do something else for a while. Perhaps I
shouldn't worry so much ;)

Thanks,
    Paul

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v2 5/6] mips: use per-mm page to execute FP branch delay slots
@ 2014-07-04 11:30           ` Paul Burton
  0 siblings, 0 replies; 42+ messages in thread
From: Paul Burton @ 2014-07-04 11:30 UTC (permalink / raw)
  To: Ralf Baechle; +Cc: Ed Swierk, linux-mips, ddaney.cavm

On Fri, Jul 04, 2014 at 11:38:09AM +0200, Ralf Baechle wrote:
> On Fri, Jul 04, 2014 at 10:06:01AM +0100, Paul Burton wrote:
> 
> > > The actual piece of code that needs to be installed is tiny.  So the page
> > > could be shared between many threads.  In fact a single page would
> > > suffice for most processes and only threads would require more slots
> > > than provided by a single page so more pags could be allocated or the
> > > process could sleep until a slot becomes available.
> > 
> > You just roughly described the v2 patch that we're replying to :)
> 
> Can't be that wrong then :-)
> 
> I seem to only have replies to that patch in my mail folder not the
> patch itself.

You can find it in patchwork if you're interested:

  http://patchwork.linux-mips.org/patch/6125/

> > > I'm also wondering how insane emulation would be.  We already have the
> > > capability to emulate a fair fraction of the instruction set.
> > 
> > Yeah, and I'm reasonably sure we're going to need some more once MIPSr6
> > is supported. I guess (perhaps only for the short term?) it could be
> > done in stages - if systems include ASEs or cop2 that the emulation
> > didn't implement then it could fall back to the current emuframe code.
> 
> And it's dependence on executable stackframe ...

...yup.

> > I'm in 2 minds about this - it sounds crazy but perhaps it's the most
> > sane option available :)
> 
> Sanity is overrated anyway ;-)

I had originally left this patch at the point I started considering
implementing emulation for the whole ISA in the kernel, figuring I was
going insane & should probably do something else for a while. Perhaps I
shouldn't worry so much ;)

Thanks,
    Paul

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v2 5/6] mips: use per-mm page to execute FP branch delay slots
  2014-07-04 11:30           ` Paul Burton
  (?)
@ 2014-07-04 15:42           ` Ralf Baechle
  -1 siblings, 0 replies; 42+ messages in thread
From: Ralf Baechle @ 2014-07-04 15:42 UTC (permalink / raw)
  To: Paul Burton; +Cc: Ed Swierk, linux-mips, ddaney.cavm

On Fri, Jul 04, 2014 at 12:30:07PM +0100, Paul Burton wrote:

> I had originally left this patch at the point I started considering
> implementing emulation for the whole ISA in the kernel, figuring I was
> going insane & should probably do something else for a while. Perhaps I
> shouldn't worry so much ;)

Full emulation is a bit problematic in particular for some of the older
CPUs which have random odd ISA extensions.  Think like a MIPS I plus a
sprinkling of MIPS II and a bucket of oddness or similar.  Getting that
right would be tedious.

  Ralf

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v2 5/6] mips: use per-mm page to execute FP branch delay slots
  2014-07-04 11:30           ` Paul Burton
  (?)
  (?)
@ 2014-09-13 23:06           ` Maciej W. Rozycki
  2014-09-18  8:57               ` Paul Burton
  -1 siblings, 1 reply; 42+ messages in thread
From: Maciej W. Rozycki @ 2014-09-13 23:06 UTC (permalink / raw)
  To: Paul Burton; +Cc: Ralf Baechle, Ed Swierk, linux-mips, ddaney.cavm

On Fri, 4 Jul 2014, Paul Burton wrote:

> > > I'm in 2 minds about this - it sounds crazy but perhaps it's the most
> > > sane option available :)
> > 
> > Sanity is overrated anyway ;-)
> 
> I had originally left this patch at the point I started considering
> implementing emulation for the whole ISA in the kernel, figuring I was
> going insane & should probably do something else for a while. Perhaps I
> shouldn't worry so much ;)

 One question: does this emulation handle PC-relative instructions placed 
in a branch delay slot correctly?  This only applies to microMIPS ADDIUPC 
at the moment I believe, but still that has to work correctly whether on 
FP hardware or emulated.

  Maciej

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v2 5/6] mips: use per-mm page to execute FP branch delay slots
@ 2014-09-18  8:57               ` Paul Burton
  0 siblings, 0 replies; 42+ messages in thread
From: Paul Burton @ 2014-09-18  8:57 UTC (permalink / raw)
  To: Maciej W. Rozycki; +Cc: Ralf Baechle, Ed Swierk, linux-mips, ddaney.cavm

On Sun, Sep 14, 2014 at 12:06:03AM +0100, Maciej W. Rozycki wrote:
> On Fri, 4 Jul 2014, Paul Burton wrote:
> 
> > > > I'm in 2 minds about this - it sounds crazy but perhaps it's the most
> > > > sane option available :)
> > > 
> > > Sanity is overrated anyway ;-)
> > 
> > I had originally left this patch at the point I started considering
> > implementing emulation for the whole ISA in the kernel, figuring I was
> > going insane & should probably do something else for a while. Perhaps I
> > shouldn't worry so much ;)
> 
>  One question: does this emulation handle PC-relative instructions placed 
> in a branch delay slot correctly?  This only applies to microMIPS ADDIUPC 
> at the moment I believe, but still that has to work correctly whether on 
> FP hardware or emulated.
> 
>   Maciej

Hi Maciej,

That's a good question, and no I don't believe the current dsemul code
will handle that correctly. Perhaps that's another argument in favor of
the full on emulator...

Paul

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v2 5/6] mips: use per-mm page to execute FP branch delay slots
@ 2014-09-18  8:57               ` Paul Burton
  0 siblings, 0 replies; 42+ messages in thread
From: Paul Burton @ 2014-09-18  8:57 UTC (permalink / raw)
  To: Maciej W. Rozycki; +Cc: Ralf Baechle, Ed Swierk, linux-mips, ddaney.cavm

On Sun, Sep 14, 2014 at 12:06:03AM +0100, Maciej W. Rozycki wrote:
> On Fri, 4 Jul 2014, Paul Burton wrote:
> 
> > > > I'm in 2 minds about this - it sounds crazy but perhaps it's the most
> > > > sane option available :)
> > > 
> > > Sanity is overrated anyway ;-)
> > 
> > I had originally left this patch at the point I started considering
> > implementing emulation for the whole ISA in the kernel, figuring I was
> > going insane & should probably do something else for a while. Perhaps I
> > shouldn't worry so much ;)
> 
>  One question: does this emulation handle PC-relative instructions placed 
> in a branch delay slot correctly?  This only applies to microMIPS ADDIUPC 
> at the moment I believe, but still that has to work correctly whether on 
> FP hardware or emulated.
> 
>   Maciej

Hi Maciej,

That's a good question, and no I don't believe the current dsemul code
will handle that correctly. Perhaps that's another argument in favor of
the full on emulator...

Paul

^ permalink raw reply	[flat|nested] 42+ messages in thread

end of thread, other threads:[~2014-09-18  8:58 UTC | newest]

Thread overview: 42+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-11-07 12:48 [PATCH 0/6] FP improvements Paul Burton
2013-11-07 12:48 ` Paul Burton
2013-11-07 12:48 ` [PATCH 1/6] mips: mfhc1 & mthc1 support for the FPU emulator Paul Burton
2013-11-07 12:48   ` Paul Burton
2013-11-07 12:48 ` [PATCH 2/6] mips: microMIPS: " Paul Burton
2013-11-07 12:48   ` Paul Burton
2013-11-07 12:48 ` [PATCH 3/6] mips: remove unused {en,dis}able_fpu macros Paul Burton
2013-11-07 12:48   ` Paul Burton
2013-11-07 12:48 ` [PATCH 4/6] mips: support for 64-bit FP with O32 binaries Paul Burton
2013-11-07 12:48   ` Paul Burton
2013-11-15 12:35   ` [PATCH v2 " Paul Burton
2013-11-15 12:35     ` Paul Burton
2013-11-22 13:12     ` [PATCH v3 " Paul Burton
2013-11-22 13:12       ` Paul Burton
2013-11-07 12:48 ` [PATCH 5/6] mips: use per-mm page to execute FP branch delay slots Paul Burton
2013-11-07 12:48   ` Paul Burton
2013-11-07 18:00   ` David Daney
2013-11-08 12:07     ` Paul Burton
2013-11-08 12:07       ` Paul Burton
2013-11-08 14:50       ` [PATCH v2 " Paul Burton
2013-11-08 14:50         ` Paul Burton
2013-11-21 16:48         ` Paul Burton
2013-11-21 16:48           ` Paul Burton
2013-11-07 12:48 ` [PATCH 6/6] mips: non-exec stack & heap when non-exec PT_GNU_STACK is present Paul Burton
2013-11-07 12:48   ` Paul Burton
2013-11-07 13:57 ` [PATCH 0/6] FP improvements Ralf Baechle
  -- strict thread matches above, loose matches on Subject: below --
2014-07-03 17:56 [PATCH v2 5/6] mips: use per-mm page to execute FP branch delay slots Ed Swierk
2014-07-03 20:12 ` Paul Burton
2014-07-03 20:12   ` Paul Burton
2014-07-03 22:31 Ed Swierk
2014-07-04  8:06 ` Paul Burton
2014-07-04  8:06   ` Paul Burton
2014-07-04  8:52   ` Ralf Baechle
2014-07-04  9:06     ` Paul Burton
2014-07-04  9:06       ` Paul Burton
2014-07-04  9:38       ` Ralf Baechle
2014-07-04 11:30         ` Paul Burton
2014-07-04 11:30           ` Paul Burton
2014-07-04 15:42           ` Ralf Baechle
2014-09-13 23:06           ` Maciej W. Rozycki
2014-09-18  8:57             ` Paul Burton
2014-09-18  8:57               ` Paul Burton

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.