All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] x86/mce: Reduce number of machine checks taken during recovery
@ 2021-12-15 22:20 Tony Luck
  2021-12-18  0:53 ` Peter Zijlstra
  0 siblings, 1 reply; 6+ messages in thread
From: Tony Luck @ 2021-12-15 22:20 UTC (permalink / raw)
  To: Borislav Petkov; +Cc: x86, linux-kernel, patches, Youquan Song, Tony Luck

From: Youquan Song <youquan.song@intel.com>

A machine check recovery test case injects poison into memory and then
tries a write(2) system call with a buffer that begins 64 bytes before
the poison and includes the first poison byte:

$ sudo ./einj_mem_uc -f -m 64:65:0 copyin
0: copyin   vaddr = 0x7f13639cc400 paddr = 834fa0d400
./einj_mem_uc: short (64 bytes) write to temp file
page not present
Big surprise ... still running. Thought that would be fatal
Unusual number of MCEs seen: 4
Test passed

While it executes correctly (just the 64 non-poison bytes are written)
and recovers, four machine checks seems like a lot.

When any of the copy functions in arch/x86/lib/copy_user_64.S consume
poison and take a fault the fixup code copies the remaining byte count
from %ecx to %edx and unconditionally jumps to .Lcopy_user_handle_tail
to continue the copy in case any more bytes can be copied.

If the fault was #PF this may copy more bytes (because the page fault
handler might have fixed the fault). But when the fault is a machine
check the original copy code will have copied all the way to the poisoned
cache line. So .Lcopy_user_handle_tail will just take another machine
check for no good reason.

There are five distinct blocks of fixup code that branch to
.Lcopy_user_handle_tail, so add a check there to check the trap type
(in %eax) and simply return the count of remaining bytes.

As well as reducing the number of machine checks, this also allows
Skylake generation Xeons to recover some cases that currently fail. The
is because "rep movsb" is only recoverable when source and destination
are well aligned and the byte count is large. That useless call to
.Lcopy_user_handle_tail may violate one or more of these conditions and
generate a fatal machine check.

With this change applied the count of machine checks to recover is
reduced from four to three:

$ sudo ./einj_mem_uc -f -m 64:65:0 copyin
0: copyin   vaddr = 0x7f1e44845400 paddr = 55930b400
./einj_mem_uc: short (64 bytes) write to temp file
page not present
Big surprise ... still running. Thought that would be fatal
Unusual number of MCEs seen: 3
Test passed

[Tony: Added more details to commit message]

Signed-off-by: Youquan Song <youquan.song@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
 arch/x86/lib/copy_user_64.S | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/arch/x86/lib/copy_user_64.S b/arch/x86/lib/copy_user_64.S
index 2797e630b9b1..8c53be99faa0 100644
--- a/arch/x86/lib/copy_user_64.S
+++ b/arch/x86/lib/copy_user_64.S
@@ -233,12 +233,19 @@ EXPORT_SYMBOL(copy_user_enhanced_fast_string)
  * eax uncopied bytes or 0 if successful.
  */
 SYM_CODE_START_LOCAL(.Lcopy_user_handle_tail)
+	cmp $X86_TRAP_MC,%eax
+	je 3f
 	movl %edx,%ecx
 1:	rep movsb
 2:	mov %ecx,%eax
 	ASM_CLAC
 	ret
 
+3:
+	movl %edx,%eax
+	ASM_CLAC
+	ret
+
 	_ASM_EXTABLE_CPY(1b, 2b)
 SYM_CODE_END(.Lcopy_user_handle_tail)
 
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2021-12-31 17:34 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2021-12-15 22:20 [PATCH] x86/mce: Reduce number of machine checks taken during recovery Tony Luck
2021-12-18  0:53 ` Peter Zijlstra
2021-12-23 20:07   ` [PATCH v2] " Luck, Tony
2021-12-24 11:50     ` Peter Zijlstra
2021-12-27 10:52       ` Borislav Petkov
2021-12-31 17:34     ` [tip: ras/core] " tip-bot2 for Youquan Song

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.