From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pf0-f175.google.com (mail-pf0-f175.google.com [209.85.192.175]) by kanga.kvack.org (Postfix) with ESMTP id 51C4F6B0255 for ; Fri, 11 Dec 2015 14:32:07 -0500 (EST) Received: by pfbu66 with SMTP id u66so26810457pfb.3 for ; Fri, 11 Dec 2015 11:32:07 -0800 (PST) Received: from mga01.intel.com (mga01.intel.com. [192.55.52.88]) by mx.google.com with ESMTP id q2si2921504pfi.136.2015.12.11.11.32.06 for ; Fri, 11 Dec 2015 11:32:06 -0800 (PST) Message-Id: From: Tony Luck Date: Fri, 11 Dec 2015 11:13:23 -0800 Subject: [PATCHV2 0/3] Machine check recovery when kernel accesses poison Sender: owner-linux-mm@kvack.org List-ID: To: Ingo Molnar Cc: Borislav Petkov , Andrew Morton , Andy Lutomirski , Dan Williams , linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-nvdimm@ml01.01.org, x86@kernel.org This series is initially targeted at the folks doing filesystems on top of NVDIMMs. They really want to be able to return -EIO when there is a h/w error (just like spinning rust, and SSD does). I plan to use the same infrastructure in parts 1&2 to write a machine check aware "copy_from_user()" that will SIGBUS the calling application when a syscall touches poison in user space (just like we do when the application touches the poison itself). Changes V1->V2: 0-day: Reported build errors and warnings on 32-bit systems. Fixed 0-day: Reported bloat to tinyconfig. Fixed Boris: Suggestions to use extra macros to reduce code duplication in _ASM_*EXTABLE. Done Boris: Re-write "tolerant==3" check to reduce indentation level. See below. Andy: Check IP is valid before searching kernel exception tables. Done. Andy: Explain use of BIT(63) on return value from mcsafe_memcpy(). Done (added decode macros). Andy: Untangle mess of code in tail of do_machine_check() to make it clear what is going on (e.g. that we only enter the ist_begin_non_atomic() if we were called from user code, not from kernel!). Done Tony Luck (3): x86, ras: Add new infrastructure for machine check fixup tables 2/6] x86, ras: Extend machine check recovery code to annotated ring0 areas 3/6] x86, ras: Add mcsafe_memcpy() function to recover from machine checks arch/x86/Kconfig | 4 ++ arch/x86/include/asm/asm.h | 10 +++- arch/x86/include/asm/uaccess.h | 8 +++ arch/x86/include/asm/uaccess_64.h | 5 ++ arch/x86/kernel/cpu/mcheck/mce-severity.c | 22 +++++++- arch/x86/kernel/cpu/mcheck/mce.c | 69 +++++++++++------------ arch/x86/kernel/x8664_ksyms_64.c | 2 + arch/x86/lib/copy_user_64.S | 91 +++++++++++++++++++++++++++++++ arch/x86/mm/extable.c | 19 +++++++ include/asm-generic/vmlinux.lds.h | 6 ++ include/linux/module.h | 1 + kernel/extable.c | 20 +++++++ 12 files changed, 219 insertions(+), 38 deletions(-) -- 2.1.4 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pa0-f46.google.com (mail-pa0-f46.google.com [209.85.220.46]) by kanga.kvack.org (Postfix) with ESMTP id C6FBD6B0258 for ; Fri, 11 Dec 2015 14:32:32 -0500 (EST) Received: by pabur14 with SMTP id ur14so70065819pab.0 for ; Fri, 11 Dec 2015 11:32:32 -0800 (PST) Received: from mga03.intel.com (mga03.intel.com. [134.134.136.65]) by mx.google.com with ESMTP id 82si2919150pft.132.2015.12.11.11.32.31 for ; Fri, 11 Dec 2015 11:32:31 -0800 (PST) Message-Id: <456153d09e85f2f139020a051caed3ca8f8fca73.1449861203.git.tony.luck@intel.com> In-Reply-To: References: From: Tony Luck Date: Thu, 10 Dec 2015 13:58:04 -0800 Subject: [PATCHV2 1/3] x86, ras: Add new infrastructure for machine check fixup tables Sender: owner-linux-mm@kvack.org List-ID: To: Ingo Molnar Cc: Borislav Petkov , Andrew Morton , Andy Lutomirski , Dan Williams , linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-nvdimm@ml01.01.org, x86@kernel.org Copy the existing page fault fixup mechanisms to create a new table to be used when fixing machine checks. Note: 1) At this time we only provide a macro to annotate assembly code 2) We assume all fixups will in code builtin to the kernel. 3) Only for x86_64 4) New code under CONFIG_MCE_KERNEL_RECOVERY Signed-off-by: Tony Luck --- arch/x86/Kconfig | 4 ++++ arch/x86/include/asm/asm.h | 10 ++++++++-- arch/x86/include/asm/uaccess.h | 8 ++++++++ arch/x86/mm/extable.c | 19 +++++++++++++++++++ include/asm-generic/vmlinux.lds.h | 6 ++++++ include/linux/module.h | 1 + kernel/extable.c | 20 ++++++++++++++++++++ 7 files changed, 66 insertions(+), 2 deletions(-) diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 96d058a87100..db5c6e1d6e37 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -1001,6 +1001,10 @@ config X86_MCE_INJECT If you don't know what a machine check is and you don't do kernel QA it is safe to say n. +config MCE_KERNEL_RECOVERY + depends on X86_MCE && X86_64 + def_bool y + config X86_THERMAL_VECTOR def_bool y depends on X86_MCE_INTEL diff --git a/arch/x86/include/asm/asm.h b/arch/x86/include/asm/asm.h index 189679aba703..a5d483ac11fa 100644 --- a/arch/x86/include/asm/asm.h +++ b/arch/x86/include/asm/asm.h @@ -44,13 +44,19 @@ /* Exception table entry */ #ifdef __ASSEMBLY__ -# define _ASM_EXTABLE(from,to) \ - .pushsection "__ex_table","a" ; \ +# define __ASM_EXTABLE(from, to, table) \ + .pushsection table, "a" ; \ .balign 8 ; \ .long (from) - . ; \ .long (to) - . ; \ .popsection +# define _ASM_EXTABLE(from, to) \ + __ASM_EXTABLE(from, to, "__ex_table") + +# define _ASM_MCEXTABLE(from, to) \ + __ASM_EXTABLE(from, to, "__mcex_table") + # define _ASM_EXTABLE_EX(from,to) \ .pushsection "__ex_table","a" ; \ .balign 8 ; \ diff --git a/arch/x86/include/asm/uaccess.h b/arch/x86/include/asm/uaccess.h index a8df874f3e88..7b02ca1991b4 100644 --- a/arch/x86/include/asm/uaccess.h +++ b/arch/x86/include/asm/uaccess.h @@ -111,6 +111,14 @@ struct exception_table_entry { #define ARCH_HAS_SEARCH_EXTABLE extern int fixup_exception(struct pt_regs *regs); +#ifdef CONFIG_MCE_KERNEL_RECOVERY +extern int fixup_mcexception(struct pt_regs *regs, u64 addr); +#else +static inline int fixup_mcexception(struct pt_regs *regs, u64 addr) +{ + return 0; +} +#endif extern int early_fixup_exception(unsigned long *ip); /* diff --git a/arch/x86/mm/extable.c b/arch/x86/mm/extable.c index 903ec1e9c326..a461c4212758 100644 --- a/arch/x86/mm/extable.c +++ b/arch/x86/mm/extable.c @@ -49,6 +49,25 @@ int fixup_exception(struct pt_regs *regs) return 0; } +#ifdef CONFIG_MCE_KERNEL_RECOVERY +int fixup_mcexception(struct pt_regs *regs, u64 addr) +{ + const struct exception_table_entry *fixup; + unsigned long new_ip; + + fixup = search_mcexception_tables(regs->ip); + if (fixup) { + new_ip = ex_fixup_addr(fixup); + + regs->ip = new_ip; + regs->ax = BIT(63) | addr; + return 1; + } + + return 0; +} +#endif + /* Restricted version used during very early boot */ int __init early_fixup_exception(unsigned long *ip) { diff --git a/include/asm-generic/vmlinux.lds.h b/include/asm-generic/vmlinux.lds.h index 1781e54ea6d3..21bb20d1172a 100644 --- a/include/asm-generic/vmlinux.lds.h +++ b/include/asm-generic/vmlinux.lds.h @@ -473,6 +473,12 @@ VMLINUX_SYMBOL(__start___ex_table) = .; \ *(__ex_table) \ VMLINUX_SYMBOL(__stop___ex_table) = .; \ + } \ + . = ALIGN(align); \ + __mcex_table : AT(ADDR(__mcex_table) - LOAD_OFFSET) { \ + VMLINUX_SYMBOL(__start___mcex_table) = .; \ + *(__mcex_table) \ + VMLINUX_SYMBOL(__stop___mcex_table) = .; \ } /* diff --git a/include/linux/module.h b/include/linux/module.h index 3a19c79918e0..ffecbfcc462c 100644 --- a/include/linux/module.h +++ b/include/linux/module.h @@ -270,6 +270,7 @@ extern const typeof(name) __mod_##type##__##name##_device_table \ /* Given an address, look for it in the exception tables */ const struct exception_table_entry *search_exception_tables(unsigned long add); +const struct exception_table_entry *search_mcexception_tables(unsigned long a); struct notifier_block; diff --git a/kernel/extable.c b/kernel/extable.c index e820ccee9846..7b224fbcb708 100644 --- a/kernel/extable.c +++ b/kernel/extable.c @@ -34,6 +34,10 @@ DEFINE_MUTEX(text_mutex); extern struct exception_table_entry __start___ex_table[]; extern struct exception_table_entry __stop___ex_table[]; +#ifdef CONFIG_MCE_KERNEL_RECOVERY +extern struct exception_table_entry __start___mcex_table[]; +extern struct exception_table_entry __stop___mcex_table[]; +#endif /* Cleared by build time tools if the table is already sorted. */ u32 __initdata __visible main_extable_sort_needed = 1; @@ -45,6 +49,10 @@ void __init sort_main_extable(void) pr_notice("Sorting __ex_table...\n"); sort_extable(__start___ex_table, __stop___ex_table); } +#ifdef CONFIG_MCE_KERNEL_RECOVERY + if (__stop___mcex_table > __start___mcex_table) + sort_extable(__start___mcex_table, __stop___mcex_table); +#endif } /* Given an address, look for it in the exception tables. */ @@ -58,6 +66,18 @@ const struct exception_table_entry *search_exception_tables(unsigned long addr) return e; } +#ifdef CONFIG_MCE_KERNEL_RECOVERY +/* Given an address, look for it in the machine check exception tables. */ +const struct exception_table_entry *search_mcexception_tables( + unsigned long addr) +{ + const struct exception_table_entry *e; + + e = search_extable(__start___mcex_table, __stop___mcex_table-1, addr); + return e; +} +#endif + static inline int init_kernel_text(unsigned long addr) { if (addr >= (unsigned long)_sinittext && -- 2.1.4 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pf0-f178.google.com (mail-pf0-f178.google.com [209.85.192.178]) by kanga.kvack.org (Postfix) with ESMTP id E07AD6B0259 for ; Fri, 11 Dec 2015 14:32:33 -0500 (EST) Received: by pfd5 with SMTP id 5so8967861pfd.2 for ; Fri, 11 Dec 2015 11:32:33 -0800 (PST) Received: from mga09.intel.com (mga09.intel.com. [134.134.136.24]) by mx.google.com with ESMTP id r79si2948968pfi.230.2015.12.11.11.32.33 for ; Fri, 11 Dec 2015 11:32:33 -0800 (PST) Message-Id: In-Reply-To: References: From: Tony Luck Date: Thu, 10 Dec 2015 16:14:44 -0800 Subject: [PATCHV2 2/3] x86, ras: Extend machine check recovery code to annotated ring0 areas Sender: owner-linux-mm@kvack.org List-ID: To: Ingo Molnar Cc: Borislav Petkov , Andrew Morton , Andy Lutomirski , Dan Williams , linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-nvdimm@ml01.01.org, x86@kernel.org Extend the severity checking code to add a new context IN_KERN_RECOV which is used to indicate that the machine check was triggered by code in the kernel with a fixup entry. Add code to check for this situation and respond by altering the return IP to the fixup address and changing the regs->ax so that the recovery code knows the physical address of the error. Note that we also set bit 63 because 0x0 is a legal physical address. Major re-work to the tail code in do_machine_check() to make all this readable/maintainable. One functional change is that tolerant=3 no longer stops recovery actions. Revert to only skipping sending SIGBUS to the current process. Signed-off-by: Tony Luck --- arch/x86/kernel/cpu/mcheck/mce-severity.c | 22 +++++++++- arch/x86/kernel/cpu/mcheck/mce.c | 69 ++++++++++++++++--------------- 2 files changed, 55 insertions(+), 36 deletions(-) diff --git a/arch/x86/kernel/cpu/mcheck/mce-severity.c b/arch/x86/kernel/cpu/mcheck/mce-severity.c index 9c682c222071..ac7fbb0689fb 100644 --- a/arch/x86/kernel/cpu/mcheck/mce-severity.c +++ b/arch/x86/kernel/cpu/mcheck/mce-severity.c @@ -12,6 +12,7 @@ #include #include #include +#include #include #include @@ -29,7 +30,7 @@ * panic situations) */ -enum context { IN_KERNEL = 1, IN_USER = 2 }; +enum context { IN_KERNEL = 1, IN_USER = 2, IN_KERNEL_RECOV = 3 }; enum ser { SER_REQUIRED = 1, NO_SER = 2 }; enum exception { EXCP_CONTEXT = 1, NO_EXCP = 2 }; @@ -48,6 +49,7 @@ static struct severity { #define MCESEV(s, m, c...) { .sev = MCE_ ## s ## _SEVERITY, .msg = m, ## c } #define KERNEL .context = IN_KERNEL #define USER .context = IN_USER +#define KERNEL_RECOV .context = IN_KERNEL_RECOV #define SER .ser = SER_REQUIRED #define NOSER .ser = NO_SER #define EXCP .excp = EXCP_CONTEXT @@ -87,6 +89,10 @@ static struct severity { EXCP, KERNEL, MCGMASK(MCG_STATUS_RIPV, 0) ), MCESEV( + PANIC, "In kernel and no restart IP", + EXCP, KERNEL_RECOV, MCGMASK(MCG_STATUS_RIPV, 0) + ), + MCESEV( DEFERRED, "Deferred error", NOSER, MASK(MCI_STATUS_UC|MCI_STATUS_DEFERRED|MCI_STATUS_POISON, MCI_STATUS_DEFERRED) ), @@ -123,6 +129,11 @@ static struct severity { MCGMASK(MCG_STATUS_RIPV|MCG_STATUS_EIPV, MCG_STATUS_RIPV) ), MCESEV( + AR, "Action required: data load error recoverable area of kernel", + SER, MASK(MCI_STATUS_OVER|MCI_UC_SAR|MCI_ADDR|MCACOD, MCI_UC_SAR|MCI_ADDR|MCACOD_DATA), + KERNEL_RECOV + ), + MCESEV( AR, "Action required: data load error in a user process", SER, MASK(MCI_STATUS_OVER|MCI_UC_SAR|MCI_ADDR|MCACOD, MCI_UC_SAR|MCI_ADDR|MCACOD_DATA), USER @@ -170,6 +181,9 @@ static struct severity { ) /* always matches. keep at end */ }; +#define mc_recoverable(mcg) (((mcg) & (MCG_STATUS_RIPV|MCG_STATUS_EIPV)) == \ + (MCG_STATUS_RIPV|MCG_STATUS_EIPV)) + /* * If mcgstatus indicated that ip/cs on the stack were * no good, then "m->cs" will be zero and we will have @@ -183,7 +197,11 @@ static struct severity { */ static int error_context(struct mce *m) { - return ((m->cs & 3) == 3) ? IN_USER : IN_KERNEL; + if ((m->cs & 3) == 3) + return IN_USER; + if (mc_recoverable(m->mcgstatus) && search_mcexception_tables(m->ip)) + return IN_KERNEL_RECOV; + return IN_KERNEL; } /* diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c index 9d014b82a124..f2f568ad6409 100644 --- a/arch/x86/kernel/cpu/mcheck/mce.c +++ b/arch/x86/kernel/cpu/mcheck/mce.c @@ -31,6 +31,7 @@ #include #include #include +#include #include #include #include @@ -958,6 +959,20 @@ static void mce_clear_state(unsigned long *toclear) } } +static int do_memory_failure(struct mce *m) +{ + int flags = MF_ACTION_REQUIRED; + int ret; + + pr_err("Uncorrected hardware memory error in user-access at %llx", m->addr); + if (!(m->mcgstatus & MCG_STATUS_RIPV)) + flags |= MF_MUST_KILL; + ret = memory_failure(m->addr >> PAGE_SHIFT, MCE_VECTOR, flags); + if (ret) + pr_err("Memory error not recovered"); + return ret; +} + /* * The actual machine check handler. This only handles real * exceptions when something got corrupted coming in through int 18. @@ -995,8 +1010,6 @@ void do_machine_check(struct pt_regs *regs, long error_code) DECLARE_BITMAP(toclear, MAX_NR_BANKS); DECLARE_BITMAP(valid_banks, MAX_NR_BANKS); char *msg = "Unknown"; - u64 recover_paddr = ~0ull; - int flags = MF_ACTION_REQUIRED; int lmce = 0; ist_enter(regs); @@ -1123,22 +1136,13 @@ void do_machine_check(struct pt_regs *regs, long error_code) } /* - * At insane "tolerant" levels we take no action. Otherwise - * we only die if we have no other choice. For less serious - * issues we try to recover, or limit damage to the current - * process. + * If tolerant is at an insane level we drop requests to kill + * processes and continue even when there is no way out */ - if (cfg->tolerant < 3) { - if (no_way_out) - mce_panic("Fatal machine check on current CPU", &m, msg); - if (worst == MCE_AR_SEVERITY) { - recover_paddr = m.addr; - if (!(m.mcgstatus & MCG_STATUS_RIPV)) - flags |= MF_MUST_KILL; - } else if (kill_it) { - force_sig(SIGBUS, current); - } - } + if (cfg->tolerant == 3) + kill_it = 0; + else if (no_way_out) + mce_panic("Fatal machine check on current CPU", &m, msg); if (worst > 0) mce_report_event(regs); @@ -1146,25 +1150,22 @@ void do_machine_check(struct pt_regs *regs, long error_code) out: sync_core(); - if (recover_paddr == ~0ull) - goto done; + /* Fault was in user mode and we need to take some action */ + if ((m.cs & 3) == 3 && (worst == MCE_AR_SEVERITY || kill_it)) { + ist_begin_non_atomic(regs); + local_irq_enable(); - pr_err("Uncorrected hardware memory error in user-access at %llx", - recover_paddr); - /* - * We must call memory_failure() here even if the current process is - * doomed. We still need to mark the page as poisoned and alert any - * other users of the page. - */ - ist_begin_non_atomic(regs); - local_irq_enable(); - if (memory_failure(recover_paddr >> PAGE_SHIFT, MCE_VECTOR, flags) < 0) { - pr_err("Memory error not recovered"); - force_sig(SIGBUS, current); + if (kill_it || do_memory_failure(&m)) + force_sig(SIGBUS, current); + local_irq_disable(); + ist_end_non_atomic(); } - local_irq_disable(); - ist_end_non_atomic(); -done: + + /* Fault was in recoverable area of the kernel */ + if ((m.cs & 3) != 3 && worst == MCE_AR_SEVERITY) + if (!fixup_mcexception(regs, m.addr)) + mce_panic("Failed kernel mode recovery", &m, NULL); + ist_exit(regs); } EXPORT_SYMBOL_GPL(do_machine_check); -- 2.1.4 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pa0-f50.google.com (mail-pa0-f50.google.com [209.85.220.50]) by kanga.kvack.org (Postfix) with ESMTP id 41B576B025A for ; Fri, 11 Dec 2015 14:32:40 -0500 (EST) Received: by pacdm15 with SMTP id dm15so69913336pac.3 for ; Fri, 11 Dec 2015 11:32:40 -0800 (PST) Received: from mga09.intel.com (mga09.intel.com. [134.134.136.24]) by mx.google.com with ESMTP id t67si2918676pfa.123.2015.12.11.11.32.39 for ; Fri, 11 Dec 2015 11:32:39 -0800 (PST) Message-Id: <23b2515da9d06b198044ad83ca0a15ba38c24e6e.1449861203.git.tony.luck@intel.com> In-Reply-To: References: From: Tony Luck Date: Thu, 10 Dec 2015 16:21:50 -0800 Subject: [PATCHV2 3/3] x86, ras: Add mcsafe_memcpy() function to recover from machine checks Sender: owner-linux-mm@kvack.org List-ID: To: Ingo Molnar Cc: Borislav Petkov , Andrew Morton , Andy Lutomirski , Dan Williams , linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-nvdimm@ml01.01.org, x86@kernel.org Using __copy_user_nocache() as inspiration create a memory copy routine for use by kernel code with annotations to allow for recovery from machine checks. Notes: 1) Unlike the original we make no attempt to copy all the bytes up to the faulting address. The original achieves that by re-executing the failing part as a byte-by-byte copy, which will take another page fault. We don't want to have a second machine check! 2) Likewise the return value for the original indicates exactly how many bytes were not copied. Instead we provide the physical address of the fault (thanks to help from do_machine_check() 3) Provide helpful macros to decode the return value. Signed-off-by: Tony Luck --- arch/x86/include/asm/uaccess_64.h | 5 +++ arch/x86/kernel/x8664_ksyms_64.c | 2 + arch/x86/lib/copy_user_64.S | 91 +++++++++++++++++++++++++++++++++++++++ 3 files changed, 98 insertions(+) diff --git a/arch/x86/include/asm/uaccess_64.h b/arch/x86/include/asm/uaccess_64.h index f2f9b39b274a..779cb0e77ecc 100644 --- a/arch/x86/include/asm/uaccess_64.h +++ b/arch/x86/include/asm/uaccess_64.h @@ -216,6 +216,11 @@ __copy_to_user_inatomic(void __user *dst, const void *src, unsigned size) extern long __copy_user_nocache(void *dst, const void __user *src, unsigned size, int zerorest); +extern u64 mcsafe_memcpy(void *dst, const void __user *src, + unsigned size); +#define COPY_HAD_MCHECK(ret) ((ret) & BIT(63)) +#define COPY_MCHECK_PADDR(ret) ((ret) & ~BIT(63)) + static inline int __copy_from_user_nocache(void *dst, const void __user *src, unsigned size) { diff --git a/arch/x86/kernel/x8664_ksyms_64.c b/arch/x86/kernel/x8664_ksyms_64.c index a0695be19864..ec988c92c055 100644 --- a/arch/x86/kernel/x8664_ksyms_64.c +++ b/arch/x86/kernel/x8664_ksyms_64.c @@ -37,6 +37,8 @@ EXPORT_SYMBOL(__copy_user_nocache); EXPORT_SYMBOL(_copy_from_user); EXPORT_SYMBOL(_copy_to_user); +EXPORT_SYMBOL(mcsafe_memcpy); + EXPORT_SYMBOL(copy_page); EXPORT_SYMBOL(clear_page); diff --git a/arch/x86/lib/copy_user_64.S b/arch/x86/lib/copy_user_64.S index 982ce34f4a9b..ffce93cbc9a5 100644 --- a/arch/x86/lib/copy_user_64.S +++ b/arch/x86/lib/copy_user_64.S @@ -319,3 +319,94 @@ ENTRY(__copy_user_nocache) _ASM_EXTABLE(21b,50b) _ASM_EXTABLE(22b,50b) ENDPROC(__copy_user_nocache) + +/* + * mcsafe_memcpy - Uncached memory copy with machine check exception handling + * Note that we only catch machine checks when reading the source addresses. + * Writes to target are posted and don't generate machine checks. + * This will force destination/source out of cache for more performance. + */ +ENTRY(mcsafe_memcpy) + cmpl $8,%edx + jb 20f /* less then 8 bytes, go to byte copy loop */ + + /* check for bad alignment of destination */ + movl %edi,%ecx + andl $7,%ecx + jz 102f /* already aligned */ + subl $8,%ecx + negl %ecx + subl %ecx,%edx +0: movb (%rsi),%al + movb %al,(%rdi) + incq %rsi + incq %rdi + decl %ecx + jnz 100b +102: + movl %edx,%ecx + andl $63,%edx + shrl $6,%ecx + jz 17f +1: movq (%rsi),%r8 +2: movq 1*8(%rsi),%r9 +3: movq 2*8(%rsi),%r10 +4: movq 3*8(%rsi),%r11 + movnti %r8,(%rdi) + movnti %r9,1*8(%rdi) + movnti %r10,2*8(%rdi) + movnti %r11,3*8(%rdi) +9: movq 4*8(%rsi),%r8 +10: movq 5*8(%rsi),%r9 +11: movq 6*8(%rsi),%r10 +12: movq 7*8(%rsi),%r11 + movnti %r8,4*8(%rdi) + movnti %r9,5*8(%rdi) + movnti %r10,6*8(%rdi) + movnti %r11,7*8(%rdi) + leaq 64(%rsi),%rsi + leaq 64(%rdi),%rdi + decl %ecx + jnz 1b +17: movl %edx,%ecx + andl $7,%edx + shrl $3,%ecx + jz 20f +18: movq (%rsi),%r8 + movnti %r8,(%rdi) + leaq 8(%rsi),%rsi + leaq 8(%rdi),%rdi + decl %ecx + jnz 18b +20: andl %edx,%edx + jz 23f + movl %edx,%ecx +21: movb (%rsi),%al + movb %al,(%rdi) + incq %rsi + incq %rdi + decl %ecx + jnz 21b +23: xorl %eax,%eax + sfence + ret + + .section .fixup,"ax" +30: + sfence + /* do_machine_check() sets %eax return value */ + ret + .previous + + _ASM_MCEXTABLE(0b,30b) + _ASM_MCEXTABLE(1b,30b) + _ASM_MCEXTABLE(2b,30b) + _ASM_MCEXTABLE(3b,30b) + _ASM_MCEXTABLE(4b,30b) + _ASM_MCEXTABLE(9b,30b) + _ASM_MCEXTABLE(10b,30b) + _ASM_MCEXTABLE(11b,30b) + _ASM_MCEXTABLE(12b,30b) + _ASM_MCEXTABLE(18b,30b) + _ASM_MCEXTABLE(21b,30b) +ENDPROC(mcsafe_memcpy) -- 2.1.4 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ob0-f178.google.com (mail-ob0-f178.google.com [209.85.214.178]) by kanga.kvack.org (Postfix) with ESMTP id C33AE6B0253 for ; Fri, 11 Dec 2015 15:07:03 -0500 (EST) Received: by obbsd4 with SMTP id sd4so41682145obb.0 for ; Fri, 11 Dec 2015 12:07:03 -0800 (PST) Received: from mail-ob0-x233.google.com (mail-ob0-x233.google.com. [2607:f8b0:4003:c01::233]) by mx.google.com with ESMTPS id a78si18698676oib.141.2015.12.11.12.07.02 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Fri, 11 Dec 2015 12:07:02 -0800 (PST) Received: by obciw8 with SMTP id iw8so91339804obc.1 for ; Fri, 11 Dec 2015 12:07:02 -0800 (PST) MIME-Version: 1.0 In-Reply-To: <456153d09e85f2f139020a051caed3ca8f8fca73.1449861203.git.tony.luck@intel.com> References: <456153d09e85f2f139020a051caed3ca8f8fca73.1449861203.git.tony.luck@intel.com> From: Andy Lutomirski Date: Fri, 11 Dec 2015 12:06:42 -0800 Message-ID: Subject: Re: [PATCHV2 1/3] x86, ras: Add new infrastructure for machine check fixup tables Content-Type: text/plain; charset=UTF-8 Sender: owner-linux-mm@kvack.org List-ID: To: Tony Luck Cc: Ingo Molnar , Borislav Petkov , Andrew Morton , Andy Lutomirski , Dan Williams , "linux-kernel@vger.kernel.org" , "linux-mm@kvack.org" , linux-nvdimm , X86 ML On Thu, Dec 10, 2015 at 1:58 PM, Tony Luck wrote: > Copy the existing page fault fixup mechanisms to create a new table > to be used when fixing machine checks. Note: > 1) At this time we only provide a macro to annotate assembly code > 2) We assume all fixups will in code builtin to the kernel. > 3) Only for x86_64 > 4) New code under CONFIG_MCE_KERNEL_RECOVERY > > Signed-off-by: Tony Luck > --- > +#ifdef CONFIG_MCE_KERNEL_RECOVERY > +int fixup_mcexception(struct pt_regs *regs, u64 addr) > +{ > + const struct exception_table_entry *fixup; > + unsigned long new_ip; > + > + fixup = search_mcexception_tables(regs->ip); > + if (fixup) { > + new_ip = ex_fixup_addr(fixup); > + > + regs->ip = new_ip; > + regs->ax = BIT(63) | addr; Can this be an actual #define? --Andy -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-oi0-f54.google.com (mail-oi0-f54.google.com [209.85.218.54]) by kanga.kvack.org (Postfix) with ESMTP id D23E16B0253 for ; Fri, 11 Dec 2015 15:08:25 -0500 (EST) Received: by oifz134 with SMTP id z134so308059oif.0 for ; Fri, 11 Dec 2015 12:08:25 -0800 (PST) Received: from mail-ob0-x229.google.com (mail-ob0-x229.google.com. [2607:f8b0:4003:c01::229]) by mx.google.com with ESMTPS id d66si4406312oia.2.2015.12.11.12.08.24 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Fri, 11 Dec 2015 12:08:25 -0800 (PST) Received: by obciw8 with SMTP id iw8so91365461obc.1 for ; Fri, 11 Dec 2015 12:08:24 -0800 (PST) MIME-Version: 1.0 In-Reply-To: References: From: Andy Lutomirski Date: Fri, 11 Dec 2015 12:08:05 -0800 Message-ID: Subject: Re: [PATCHV2 2/3] x86, ras: Extend machine check recovery code to annotated ring0 areas Content-Type: text/plain; charset=UTF-8 Sender: owner-linux-mm@kvack.org List-ID: To: Tony Luck Cc: Ingo Molnar , Borislav Petkov , Andrew Morton , Andy Lutomirski , Dan Williams , "linux-kernel@vger.kernel.org" , "linux-mm@kvack.org" , linux-nvdimm , X86 ML On Thu, Dec 10, 2015 at 4:14 PM, Tony Luck wrote: > Extend the severity checking code to add a new context IN_KERN_RECOV > which is used to indicate that the machine check was triggered by code > in the kernel with a fixup entry. > > Add code to check for this situation and respond by altering the return > IP to the fixup address and changing the regs->ax so that the recovery > code knows the physical address of the error. Note that we also set bit > 63 because 0x0 is a legal physical address. > > Major re-work to the tail code in do_machine_check() to make all this > readable/maintainable. One functional change is that tolerant=3 no longer > stops recovery actions. Revert to only skipping sending SIGBUS to the > current process. This is IMO much, much nicer than the old code. Thanks! --Andy -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ob0-f176.google.com (mail-ob0-f176.google.com [209.85.214.176]) by kanga.kvack.org (Postfix) with ESMTP id 82E526B0255 for ; Fri, 11 Dec 2015 15:09:30 -0500 (EST) Received: by obbsd4 with SMTP id sd4so41728816obb.0 for ; Fri, 11 Dec 2015 12:09:30 -0800 (PST) Received: from mail-ob0-x234.google.com (mail-ob0-x234.google.com. [2607:f8b0:4003:c01::234]) by mx.google.com with ESMTPS id u85si18680836oie.43.2015.12.11.12.09.29 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Fri, 11 Dec 2015 12:09:30 -0800 (PST) Received: by obc18 with SMTP id 18so90553180obc.2 for ; Fri, 11 Dec 2015 12:09:29 -0800 (PST) MIME-Version: 1.0 In-Reply-To: <23b2515da9d06b198044ad83ca0a15ba38c24e6e.1449861203.git.tony.luck@intel.com> References: <23b2515da9d06b198044ad83ca0a15ba38c24e6e.1449861203.git.tony.luck@intel.com> From: Andy Lutomirski Date: Fri, 11 Dec 2015 12:09:10 -0800 Message-ID: Subject: Re: [PATCHV2 3/3] x86, ras: Add mcsafe_memcpy() function to recover from machine checks Content-Type: text/plain; charset=UTF-8 Sender: owner-linux-mm@kvack.org List-ID: To: Tony Luck Cc: Ingo Molnar , Borislav Petkov , Andrew Morton , Andy Lutomirski , Dan Williams , "linux-kernel@vger.kernel.org" , "linux-mm@kvack.org" , linux-nvdimm , X86 ML On Thu, Dec 10, 2015 at 4:21 PM, Tony Luck wrote: > Using __copy_user_nocache() as inspiration create a memory copy > routine for use by kernel code with annotations to allow for > recovery from machine checks. > > Notes: > 1) Unlike the original we make no attempt to copy all the bytes > up to the faulting address. The original achieves that by > re-executing the failing part as a byte-by-byte copy, > which will take another page fault. We don't want to have > a second machine check! > 2) Likewise the return value for the original indicates exactly > how many bytes were not copied. Instead we provide the physical > address of the fault (thanks to help from do_machine_check() > 3) Provide helpful macros to decode the return value. > > Signed-off-by: Tony Luck > --- > arch/x86/include/asm/uaccess_64.h | 5 +++ > arch/x86/kernel/x8664_ksyms_64.c | 2 + > arch/x86/lib/copy_user_64.S | 91 +++++++++++++++++++++++++++++++++++++++ > 3 files changed, 98 insertions(+) > > diff --git a/arch/x86/include/asm/uaccess_64.h b/arch/x86/include/asm/uaccess_64.h > index f2f9b39b274a..779cb0e77ecc 100644 > --- a/arch/x86/include/asm/uaccess_64.h > +++ b/arch/x86/include/asm/uaccess_64.h > @@ -216,6 +216,11 @@ __copy_to_user_inatomic(void __user *dst, const void *src, unsigned size) > extern long __copy_user_nocache(void *dst, const void __user *src, > unsigned size, int zerorest); > > +extern u64 mcsafe_memcpy(void *dst, const void __user *src, > + unsigned size); > +#define COPY_HAD_MCHECK(ret) ((ret) & BIT(63)) > +#define COPY_MCHECK_PADDR(ret) ((ret) & ~BIT(63)) > + > static inline int > __copy_from_user_nocache(void *dst, const void __user *src, unsigned size) > { > diff --git a/arch/x86/kernel/x8664_ksyms_64.c b/arch/x86/kernel/x8664_ksyms_64.c > index a0695be19864..ec988c92c055 100644 > --- a/arch/x86/kernel/x8664_ksyms_64.c > +++ b/arch/x86/kernel/x8664_ksyms_64.c > @@ -37,6 +37,8 @@ EXPORT_SYMBOL(__copy_user_nocache); > EXPORT_SYMBOL(_copy_from_user); > EXPORT_SYMBOL(_copy_to_user); > > +EXPORT_SYMBOL(mcsafe_memcpy); > + > EXPORT_SYMBOL(copy_page); > EXPORT_SYMBOL(clear_page); > > diff --git a/arch/x86/lib/copy_user_64.S b/arch/x86/lib/copy_user_64.S > index 982ce34f4a9b..ffce93cbc9a5 100644 > --- a/arch/x86/lib/copy_user_64.S > +++ b/arch/x86/lib/copy_user_64.S > @@ -319,3 +319,94 @@ ENTRY(__copy_user_nocache) > _ASM_EXTABLE(21b,50b) > _ASM_EXTABLE(22b,50b) > ENDPROC(__copy_user_nocache) > + > +/* > + * mcsafe_memcpy - Uncached memory copy with machine check exception handling > + * Note that we only catch machine checks when reading the source addresses. > + * Writes to target are posted and don't generate machine checks. > + * This will force destination/source out of cache for more performance. > + */ > +ENTRY(mcsafe_memcpy) > + cmpl $8,%edx > + jb 20f /* less then 8 bytes, go to byte copy loop */ > + > + /* check for bad alignment of destination */ > + movl %edi,%ecx > + andl $7,%ecx > + jz 102f /* already aligned */ > + subl $8,%ecx > + negl %ecx > + subl %ecx,%edx > +0: movb (%rsi),%al > + movb %al,(%rdi) > + incq %rsi > + incq %rdi > + decl %ecx > + jnz 100b > +102: > + movl %edx,%ecx > + andl $63,%edx > + shrl $6,%ecx > + jz 17f > +1: movq (%rsi),%r8 > +2: movq 1*8(%rsi),%r9 > +3: movq 2*8(%rsi),%r10 > +4: movq 3*8(%rsi),%r11 > + movnti %r8,(%rdi) > + movnti %r9,1*8(%rdi) > + movnti %r10,2*8(%rdi) > + movnti %r11,3*8(%rdi) > +9: movq 4*8(%rsi),%r8 > +10: movq 5*8(%rsi),%r9 > +11: movq 6*8(%rsi),%r10 > +12: movq 7*8(%rsi),%r11 > + movnti %r8,4*8(%rdi) > + movnti %r9,5*8(%rdi) > + movnti %r10,6*8(%rdi) > + movnti %r11,7*8(%rdi) > + leaq 64(%rsi),%rsi > + leaq 64(%rdi),%rdi > + decl %ecx > + jnz 1b > +17: movl %edx,%ecx > + andl $7,%edx > + shrl $3,%ecx > + jz 20f > +18: movq (%rsi),%r8 > + movnti %r8,(%rdi) > + leaq 8(%rsi),%rsi > + leaq 8(%rdi),%rdi > + decl %ecx > + jnz 18b > +20: andl %edx,%edx > + jz 23f > + movl %edx,%ecx > +21: movb (%rsi),%al > + movb %al,(%rdi) > + incq %rsi > + incq %rdi > + decl %ecx > + jnz 21b > +23: xorl %eax,%eax > + sfence > + ret > + > + .section .fixup,"ax" > +30: > + sfence > + /* do_machine_check() sets %eax return value */ > + ret > + .previous > + > + _ASM_MCEXTABLE(0b,30b) > + _ASM_MCEXTABLE(1b,30b) > + _ASM_MCEXTABLE(2b,30b) > + _ASM_MCEXTABLE(3b,30b) > + _ASM_MCEXTABLE(4b,30b) > + _ASM_MCEXTABLE(9b,30b) > + _ASM_MCEXTABLE(10b,30b) > + _ASM_MCEXTABLE(11b,30b) > + _ASM_MCEXTABLE(12b,30b) > + _ASM_MCEXTABLE(18b,30b) > + _ASM_MCEXTABLE(21b,30b) > +ENDPROC(mcsafe_memcpy) I still don't get the BIT(63) thing. Can you explain it? --Andy -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pa0-f47.google.com (mail-pa0-f47.google.com [209.85.220.47]) by kanga.kvack.org (Postfix) with ESMTP id 3D3A46B0253 for ; Fri, 11 Dec 2015 16:01:51 -0500 (EST) Received: by pacwq6 with SMTP id wq6so70705442pac.1 for ; Fri, 11 Dec 2015 13:01:51 -0800 (PST) Received: from mga14.intel.com (mga14.intel.com. [192.55.52.115]) by mx.google.com with ESMTP id rf10si3351326pab.94.2015.12.11.13.01.50 for ; Fri, 11 Dec 2015 13:01:50 -0800 (PST) From: "Luck, Tony" Subject: RE: [PATCHV2 1/3] x86, ras: Add new infrastructure for machine check fixup tables Date: Fri, 11 Dec 2015 21:01:49 +0000 Message-ID: <3908561D78D1C84285E8C5FCA982C28F39F82D35@ORSMSX114.amr.corp.intel.com> References: <456153d09e85f2f139020a051caed3ca8f8fca73.1449861203.git.tony.luck@intel.com> In-Reply-To: Content-Language: en-US Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: base64 MIME-Version: 1.0 Sender: owner-linux-mm@kvack.org List-ID: To: Andy Lutomirski Cc: Ingo Molnar , Borislav Petkov , Andrew Morton , Andy Lutomirski , "Williams, Dan J" , "linux-kernel@vger.kernel.org" , "linux-mm@kvack.org" , linux-nvdimm , X86 ML Pj4gKyAgICAgICAgICAgICAgIHJlZ3MtPmlwID0gbmV3X2lwOw0KPj4gKyAgICAgICAgICAgICAg IHJlZ3MtPmF4ID0gQklUKDYzKSB8IGFkZHI7DQo+DQo+IENhbiB0aGlzIGJlIGFuIGFjdHVhbCAj ZGVmaW5lPw0KDQpEb2ghICBZZXMsIG9mIGNvdXJzZS4gVGhhdCB3b3VsZCBiZSBtdWNoIGJldHRl ci4NCg0KTm93IEkgbmVlZCB0byB0aGluayBvZiBhIGdvb2QgbmFtZSBmb3IgaXQuDQoNCi1Ub255 DQo= -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pa0-f51.google.com (mail-pa0-f51.google.com [209.85.220.51]) by kanga.kvack.org (Postfix) with ESMTP id 673156B0257 for ; Fri, 11 Dec 2015 16:19:19 -0500 (EST) Received: by pabur14 with SMTP id ur14so71242176pab.0 for ; Fri, 11 Dec 2015 13:19:19 -0800 (PST) Received: from mga14.intel.com (mga14.intel.com. [192.55.52.115]) by mx.google.com with ESMTP id rq5si3430051pab.160.2015.12.11.13.19.18 for ; Fri, 11 Dec 2015 13:19:18 -0800 (PST) From: "Luck, Tony" Subject: RE: [PATCHV2 3/3] x86, ras: Add mcsafe_memcpy() function to recover from machine checks Date: Fri, 11 Dec 2015 21:19:17 +0000 Message-ID: <3908561D78D1C84285E8C5FCA982C28F39F82D87@ORSMSX114.amr.corp.intel.com> References: <23b2515da9d06b198044ad83ca0a15ba38c24e6e.1449861203.git.tony.luck@intel.com> In-Reply-To: Content-Language: en-US Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: base64 MIME-Version: 1.0 Sender: owner-linux-mm@kvack.org List-ID: To: Andy Lutomirski Cc: Ingo Molnar , Borislav Petkov , Andrew Morton , Andy Lutomirski , "Williams, Dan J" , "linux-kernel@vger.kernel.org" , "linux-mm@kvack.org" , linux-nvdimm , X86 ML PiBJIHN0aWxsIGRvbid0IGdldCB0aGUgQklUKDYzKSB0aGluZy4gIENhbiB5b3UgZXhwbGFpbiBp dD8NCg0KSXQgd2lsbCBiZSBtb3JlIG9idmlvdXMgd2hlbiBJIGdldCBhcm91bmQgdG8gd3JpdGlu ZyBjb3B5X2Zyb21fdXNlcigpLg0KDQpUaGVuIHdlIHdpbGwgaGF2ZSBhIGZ1bmN0aW9uIHRoYXQg Y2FuIHRha2UgcGFnZSBmYXVsdHMgaWYgdGhlcmUgYXJlIHBhZ2VzDQp0aGF0IGFyZSBub3QgcHJl c2VudC4gIElmIHRoZSBwYWdlIGZhdWx0cyBjYW4ndCBiZSBmaXhlZCB3ZSBoYXZlIGEgLUVGQVVM VA0KY29uZGl0aW9uLiBXZSBjYW4gYWxzbyB0YWtlIG1hY2hpbmUgY2hlY2tzIGlmIHdlIHJlYWRz IGZyb20gYSBsb2NhdGlvbiB3aXRoIGFuDQp1bmNvcnJlY3RlZCBlcnJvci4NCg0KV2UgbmVlZCB0 byBkaXN0aW5ndWlzaCB0aGVzZSB0d28gY2FzZXMgYmVjYXVzZSB0aGUgYWN0aW9uIHdlIHRha2Ug aXMNCmRpZmZlcmVudC4gRm9yIHRoZSB1bnJlc29sdmVkIHBhZ2UgZmF1bHQgd2UgYWxyZWFkeSBo YXZlIHRoZSBBQkkgdGhhdCB0aGUNCmNvcHlfdG8vZnJvbV91c2VyKCkgZnVuY3Rpb25zIHJldHVy biB6ZXJvIGZvciBzdWNjZXNzLCBhbmQgYSBub24temVybw0KcmV0dXJuIGlzIHRoZSBudW1iZXIg b2Ygbm90LWNvcGllZCBieXRlcy4NCg0KU28gZm9yIG15IG5ldyBjYXNlIEknbSBzZXR0aW5nIGJp dDYzIC4uLiB0aGlzIGlzIG5ldmVyIGdvaW5nIHRvIGJlIHNldCBmb3INCmEgZmFpbGVkIHBhZ2Ug ZmF1bHQuDQoNCmNvcHlfZnJvbV91c2VyKCkgY29uY2VwdHVhbGx5IHdpbGwgbG9vayBsaWtlIHRo aXM6DQoNCmludCBjb3B5X2Zyb21fdXNlcih2b2lkICp0bywgdm9pZCAqZnJvbSwgdW5zaWduZWQg bG9uZyBuKQ0Kew0KCXU2NCByZXQgPSBtY3NhZmVfbWVtY3B5KHRvLCBmcm9tLCBuKTsNCg0KCWlm IChDT1BZX0hBRF9NQ0hFQ0socikpIHsNCgkJaWYgKG1lbW9yeV9mYWlsdXJlKENPUFlfTUNIRUNL X1BBRERSKHJldCkgPj4gUEFHRV9TSVpFLCAuLi4pKQ0KCQkJZm9yY2Vfc2lnKFNJR0JVUywgY3Vy cmVudCk7DQoJCXJldHVybiBzb21ldGhpbmc7DQoJfSBlbHNlDQoJCXJldHVybiByZXQ7DQp9DQoN Ci1Ub255DQo= -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pa0-f52.google.com (mail-pa0-f52.google.com [209.85.220.52]) by kanga.kvack.org (Postfix) with ESMTP id E79146B0253 for ; Fri, 11 Dec 2015 16:32:52 -0500 (EST) Received: by padhk6 with SMTP id hk6so31084651pad.2 for ; Fri, 11 Dec 2015 13:32:52 -0800 (PST) Received: from aserp1040.oracle.com (aserp1040.oracle.com. [141.146.126.69]) by mx.google.com with ESMTPS id s131si3486343pfs.12.2015.12.11.13.32.51 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Fri, 11 Dec 2015 13:32:52 -0800 (PST) Date: Fri, 11 Dec 2015 16:32:27 -0500 From: Konrad Rzeszutek Wilk Subject: Re: [PATCHV2 3/3] x86, ras: Add mcsafe_memcpy() function to recover from machine checks Message-ID: <20151211213227.GA22996@char.us.oracle.com> References: <23b2515da9d06b198044ad83ca0a15ba38c24e6e.1449861203.git.tony.luck@intel.com> <3908561D78D1C84285E8C5FCA982C28F39F82D87@ORSMSX114.amr.corp.intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <3908561D78D1C84285E8C5FCA982C28F39F82D87@ORSMSX114.amr.corp.intel.com> Sender: owner-linux-mm@kvack.org List-ID: To: "Luck, Tony" Cc: Andy Lutomirski , linux-nvdimm , X86 ML , "linux-kernel@vger.kernel.org" , Ingo Molnar , "linux-mm@kvack.org" , Borislav Petkov , Andy Lutomirski , Andrew Morton On Fri, Dec 11, 2015 at 09:19:17PM +0000, Luck, Tony wrote: > > I still don't get the BIT(63) thing. Can you explain it? > > It will be more obvious when I get around to writing copy_from_user(). > > Then we will have a function that can take page faults if there are pages > that are not present. If the page faults can't be fixed we have a -EFAULT > condition. We can also take machine checks if we reads from a location with an > uncorrected error. > > We need to distinguish these two cases because the action we take is > different. For the unresolved page fault we already have the ABI that the > copy_to/from_user() functions return zero for success, and a non-zero > return is the number of not-copied bytes. > > So for my new case I'm setting bit63 ... this is never going to be set for > a failed page fault. Isn't 63 NX? > > copy_from_user() conceptually will look like this: > > int copy_from_user(void *to, void *from, unsigned long n) > { > u64 ret = mcsafe_memcpy(to, from, n); > > if (COPY_HAD_MCHECK(r)) { > if (memory_failure(COPY_MCHECK_PADDR(ret) >> PAGE_SIZE, ...)) > force_sig(SIGBUS, current); > return something; > } else > return ret; > } > > -Tony > _______________________________________________ > Linux-nvdimm mailing list > Linux-nvdimm@lists.01.org > https://lists.01.org/mailman/listinfo/linux-nvdimm -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-oi0-f47.google.com (mail-oi0-f47.google.com [209.85.218.47]) by kanga.kvack.org (Postfix) with ESMTP id 8D6126B0253 for ; Fri, 11 Dec 2015 16:51:12 -0500 (EST) Received: by oifz134 with SMTP id z134so1684432oif.0 for ; Fri, 11 Dec 2015 13:51:12 -0800 (PST) Received: from mail-ob0-x233.google.com (mail-ob0-x233.google.com. [2607:f8b0:4003:c01::233]) by mx.google.com with ESMTPS id s125si19023922oif.132.2015.12.11.13.51.11 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Fri, 11 Dec 2015 13:51:11 -0800 (PST) Received: by obc18 with SMTP id 18so92344256obc.2 for ; Fri, 11 Dec 2015 13:51:11 -0800 (PST) MIME-Version: 1.0 In-Reply-To: <3908561D78D1C84285E8C5FCA982C28F39F82D87@ORSMSX114.amr.corp.intel.com> References: <23b2515da9d06b198044ad83ca0a15ba38c24e6e.1449861203.git.tony.luck@intel.com> <3908561D78D1C84285E8C5FCA982C28F39F82D87@ORSMSX114.amr.corp.intel.com> From: Andy Lutomirski Date: Fri, 11 Dec 2015 13:50:52 -0800 Message-ID: Subject: Re: [PATCHV2 3/3] x86, ras: Add mcsafe_memcpy() function to recover from machine checks Content-Type: text/plain; charset=UTF-8 Sender: owner-linux-mm@kvack.org List-ID: To: "Luck, Tony" Cc: Ingo Molnar , Borislav Petkov , Andrew Morton , Andy Lutomirski , "Williams, Dan J" , "linux-kernel@vger.kernel.org" , "linux-mm@kvack.org" , linux-nvdimm , X86 ML On Fri, Dec 11, 2015 at 1:19 PM, Luck, Tony wrote: >> I still don't get the BIT(63) thing. Can you explain it? > > It will be more obvious when I get around to writing copy_from_user(). > > Then we will have a function that can take page faults if there are pages > that are not present. If the page faults can't be fixed we have a -EFAULT > condition. We can also take machine checks if we reads from a location with an > uncorrected error. > > We need to distinguish these two cases because the action we take is > different. For the unresolved page fault we already have the ABI that the > copy_to/from_user() functions return zero for success, and a non-zero > return is the number of not-copied bytes. I'm missing something, though. The normal fixup_exception path doesn't touch rax at all. The memory_failure path does. But couldn't you distinguish them by just pointing the exception handlers at different landing pads? Also, would it be more straightforward if the mcexception landing pad looked up the va -> pa mapping by itself? Or is that somehow not reliable? --Andy -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pf0-f177.google.com (mail-pf0-f177.google.com [209.85.192.177]) by kanga.kvack.org (Postfix) with ESMTP id C0FCE6B0253 for ; Fri, 11 Dec 2015 17:17:33 -0500 (EST) Received: by pfd5 with SMTP id 5so10790556pfd.2 for ; Fri, 11 Dec 2015 14:17:33 -0800 (PST) Received: from mga03.intel.com (mga03.intel.com. [134.134.136.65]) by mx.google.com with ESMTP id t67si3702438pfa.123.2015.12.11.14.17.32 for ; Fri, 11 Dec 2015 14:17:32 -0800 (PST) From: "Luck, Tony" Subject: RE: [PATCHV2 3/3] x86, ras: Add mcsafe_memcpy() function to recover from machine checks Date: Fri, 11 Dec 2015 22:17:10 +0000 Message-ID: <3908561D78D1C84285E8C5FCA982C28F39F82EEF@ORSMSX114.amr.corp.intel.com> References: <23b2515da9d06b198044ad83ca0a15ba38c24e6e.1449861203.git.tony.luck@intel.com> <3908561D78D1C84285E8C5FCA982C28F39F82D87@ORSMSX114.amr.corp.intel.com> In-Reply-To: Content-Language: en-US Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: base64 MIME-Version: 1.0 Sender: owner-linux-mm@kvack.org List-ID: To: Andy Lutomirski Cc: Ingo Molnar , Borislav Petkov , Andrew Morton , Andy Lutomirski , "Williams, Dan J" , "linux-kernel@vger.kernel.org" , "linux-mm@kvack.org" , linux-nvdimm , X86 ML PiBJJ20gbWlzc2luZyBzb21ldGhpbmcsIHRob3VnaC4gIFRoZSBub3JtYWwgZml4dXBfZXhjZXB0 aW9uIHBhdGgNCj4gZG9lc24ndCB0b3VjaCByYXggYXQgYWxsLiAgVGhlIG1lbW9yeV9mYWlsdXJl IHBhdGggZG9lcy4gIEJ1dCBjb3VsZG4ndA0KPiB5b3UgZGlzdGluZ3Vpc2ggdGhlbSBieSBqdXN0 IHBvaW50aW5nIHRoZSBleGNlcHRpb24gaGFuZGxlcnMgYXQNCj4gZGlmZmVyZW50IGxhbmRpbmcg cGFkcz8NCg0KUGVyaGFwcyBJJ20ganVzdCB0cnlpbmcgdG8gdGFrZSBhIHNob3J0IGN1dCB0byBh dm9pZCB3cml0aW5nDQpzb21lIGNsZXZlciBmaXh1cCBjb2RlIGZvciB0aGUgdGFyZ2V0IGlwIHRo YXQgZ29lcyBpbnRvIHRoZQ0KZXhjZXB0aW9uIHRhYmxlLg0KDQpGb3IgX19jb3B5X3VzZXJfbm9j YWNoZSgpIHdlIGhhdmUgZm91ciBwb3NzaWJsZSB0YXJnZXRzDQpmb3IgZml4dXAgZGVwZW5kaW5n IG9uIHdoZXJlIHdlIHdlcmUgaW4gdGhlIGZ1bmN0aW9uLg0KDQogICAgICAgIC5zZWN0aW9uIC5m aXh1cCwiYXgiDQozMDogICAgIHNobGwgJDYsJWVjeA0KICAgICAgICBhZGRsICVlY3gsJWVkeA0K ICAgICAgICBqbXAgNjBmDQo0MDogICAgIGxlYSAoJXJkeCwlcmN4LDgpLCVyZHgNCiAgICAgICAg am1wIDYwZg0KNTA6ICAgICBtb3ZsICVlY3gsJWVkeA0KNjA6ICAgICBzZmVuY2UNCiAgICAgICAg am1wIGNvcHlfdXNlcl9oYW5kbGVfdGFpbA0KICAgICAgICAucHJldmlvdXMNCg0KTm90ZSB0aGF0 IHRoaXMgY29kZSBhbHNvIHRha2VzIGEgc2hvcnRjdXQNCmJ5IGp1bXBpbmcgdG8gY29weV91c2Vy X2hhbmRsZV90YWlsKCkgdG8NCmZpbmlzaCB1cCB0aGUgY29weSBhIGJ5dGUgYXQgYSB0aW1lIC4u LiBhbmQNCnJ1bm5pbmcgYmFjayBpbnRvIHRoZSBzYW1lIHBhZ2UgZmF1bHQgYSAybmQNCnRpbWUg dG8gbWFrZSBzdXJlIHRoZSBieXRlIGNvdW50IGlzIGV4YWN0bHkNCnJpZ2h0Lg0KDQpJIHJlYWxs eSwgcmVhbGx5LCBkb24ndCB3YW50IHRvIHJ1biBiYWNrIGludG8NCnRoZSBwb2lzb24gYWdhaW4u ICBJdCB3b3VsZCBwcm9iYWJseSB3b3JrLCBidXQNCmJlY2F1c2UgY3VycmVudCBnZW5lcmF0aW9u IEludGVsIGNwdXMgYnJvYWRjYXN0IG1hY2hpbmUNCmNoZWNrcyB0byBldmVyeSBsb2dpY2FsIGNw dSwgaXQgaXMgYSBsb3Qgb2Ygb3ZlcmhlYWQsDQphbmQgcG90ZW50aWFsbHkgcmlza3kuDQoNCj4g QWxzbywgd291bGQgaXQgYmUgbW9yZSBzdHJhaWdodGZvcndhcmQgaWYgdGhlIG1jZXhjZXB0aW9u IGxhbmRpbmcgcGFkDQo+IGxvb2tlZCB1cCB0aGUgdmEgLT4gcGEgbWFwcGluZyBieSBpdHNlbGY/ ICBPciBpcyB0aGF0IHNvbWVob3cgbm90DQo+IHJlbGlhYmxlPw0KDQpJZiB3ZSBkaWQgZ2V0IGFs bCB0aGUgYWJvdmUgcmlnaHQsIHRoZW4gd2UgY291bGQgaGF2ZQ0KdGFyZ2V0IHVzZSB2aXJ0X3Rv X3BoeXMoKSB0byBjb252ZXJ0IHRvIHBoeXNpY2FsIC4uLg0KSSBkb24ndCBzZWUgdGhhdCB0aGlz IHBhcnQgd291bGQgYmUgYSBwcm9ibGVtLg0KDQotVG9ueQ0KDQoNCg0KDQo= -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-qg0-f43.google.com (mail-qg0-f43.google.com [209.85.192.43]) by kanga.kvack.org (Postfix) with ESMTP id BF1F56B0253 for ; Fri, 11 Dec 2015 17:20:17 -0500 (EST) Received: by qgz52 with SMTP id 52so20842703qgz.1 for ; Fri, 11 Dec 2015 14:20:17 -0800 (PST) Received: from mail-qk0-x232.google.com (mail-qk0-x232.google.com. [2607:f8b0:400d:c09::232]) by mx.google.com with ESMTPS id p104si22301267qgd.126.2015.12.11.14.20.17 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Fri, 11 Dec 2015 14:20:17 -0800 (PST) Received: by qkck189 with SMTP id k189so24925813qkc.0 for ; Fri, 11 Dec 2015 14:20:16 -0800 (PST) MIME-Version: 1.0 In-Reply-To: <3908561D78D1C84285E8C5FCA982C28F39F82EEF@ORSMSX114.amr.corp.intel.com> References: <23b2515da9d06b198044ad83ca0a15ba38c24e6e.1449861203.git.tony.luck@intel.com> <3908561D78D1C84285E8C5FCA982C28F39F82D87@ORSMSX114.amr.corp.intel.com> <3908561D78D1C84285E8C5FCA982C28F39F82EEF@ORSMSX114.amr.corp.intel.com> Date: Fri, 11 Dec 2015 14:20:16 -0800 Message-ID: Subject: Re: [PATCHV2 3/3] x86, ras: Add mcsafe_memcpy() function to recover from machine checks From: Dan Williams Content-Type: text/plain; charset=UTF-8 Sender: owner-linux-mm@kvack.org List-ID: To: "Luck, Tony" Cc: Andy Lutomirski , Ingo Molnar , Borislav Petkov , Andrew Morton , Andy Lutomirski , "linux-kernel@vger.kernel.org" , "linux-mm@kvack.org" , linux-nvdimm , X86 ML On Fri, Dec 11, 2015 at 2:17 PM, Luck, Tony wrote: >> Also, would it be more straightforward if the mcexception landing pad >> looked up the va -> pa mapping by itself? Or is that somehow not >> reliable? > > If we did get all the above right, then we could have > target use virt_to_phys() to convert to physical ... > I don't see that this part would be a problem. virt_to_phys() implies a linear address. In the case of the use in the pmem driver we'll be using an ioremap()'d address off somewherein vmalloc space. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ob0-f171.google.com (mail-ob0-f171.google.com [209.85.214.171]) by kanga.kvack.org (Postfix) with ESMTP id BFD876B0253 for ; Fri, 11 Dec 2015 17:27:13 -0500 (EST) Received: by obbsd4 with SMTP id sd4so44104989obb.0 for ; Fri, 11 Dec 2015 14:27:13 -0800 (PST) Received: from mail-ob0-x229.google.com (mail-ob0-x229.google.com. [2607:f8b0:4003:c01::229]) by mx.google.com with ESMTPS id fj3si8933749obc.64.2015.12.11.14.27.13 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Fri, 11 Dec 2015 14:27:13 -0800 (PST) Received: by obc18 with SMTP id 18so92925926obc.2 for ; Fri, 11 Dec 2015 14:27:13 -0800 (PST) MIME-Version: 1.0 In-Reply-To: References: <23b2515da9d06b198044ad83ca0a15ba38c24e6e.1449861203.git.tony.luck@intel.com> <3908561D78D1C84285E8C5FCA982C28F39F82D87@ORSMSX114.amr.corp.intel.com> <3908561D78D1C84285E8C5FCA982C28F39F82EEF@ORSMSX114.amr.corp.intel.com> From: Andy Lutomirski Date: Fri, 11 Dec 2015 14:26:53 -0800 Message-ID: Subject: Re: [PATCHV2 3/3] x86, ras: Add mcsafe_memcpy() function to recover from machine checks Content-Type: text/plain; charset=UTF-8 Sender: owner-linux-mm@kvack.org List-ID: To: Dan Williams Cc: "Luck, Tony" , Ingo Molnar , Borislav Petkov , Andrew Morton , Andy Lutomirski , "linux-kernel@vger.kernel.org" , "linux-mm@kvack.org" , linux-nvdimm , X86 ML On Fri, Dec 11, 2015 at 2:20 PM, Dan Williams wrote: > On Fri, Dec 11, 2015 at 2:17 PM, Luck, Tony wrote: >>> Also, would it be more straightforward if the mcexception landing pad >>> looked up the va -> pa mapping by itself? Or is that somehow not >>> reliable? >> >> If we did get all the above right, then we could have >> target use virt_to_phys() to convert to physical ... >> I don't see that this part would be a problem. > > virt_to_phys() implies a linear address. In the case of the use in > the pmem driver we'll be using an ioremap()'d address off somewherein > vmalloc space. There's always slow_virt_to_phys. Note that I don't fundamentally object to passing the pa to the fixup handler. I just think we should try to disentangle that from figuring out what exactly the failure was. Also, are there really PCOMMIT-capable CPUs that still forcibly broadcast MCE? If, so, that's unfortunate. --Andy -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pf0-f180.google.com (mail-pf0-f180.google.com [209.85.192.180]) by kanga.kvack.org (Postfix) with ESMTP id 4DD406B0253 for ; Fri, 11 Dec 2015 17:35:19 -0500 (EST) Received: by pfee188 with SMTP id e188so2316165pfe.1 for ; Fri, 11 Dec 2015 14:35:19 -0800 (PST) Received: from mga11.intel.com (mga11.intel.com. [192.55.52.93]) by mx.google.com with ESMTP id 68si3786931pfi.137.2015.12.11.14.35.18 for ; Fri, 11 Dec 2015 14:35:18 -0800 (PST) From: "Luck, Tony" Subject: RE: [PATCHV2 3/3] x86, ras: Add mcsafe_memcpy() function to recover from machine checks Date: Fri, 11 Dec 2015 22:35:17 +0000 Message-ID: <3908561D78D1C84285E8C5FCA982C28F39F82F97@ORSMSX114.amr.corp.intel.com> References: <23b2515da9d06b198044ad83ca0a15ba38c24e6e.1449861203.git.tony.luck@intel.com> <3908561D78D1C84285E8C5FCA982C28F39F82D87@ORSMSX114.amr.corp.intel.com> <3908561D78D1C84285E8C5FCA982C28F39F82EEF@ORSMSX114.amr.corp.intel.com> In-Reply-To: Content-Language: en-US Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: base64 MIME-Version: 1.0 Sender: owner-linux-mm@kvack.org List-ID: To: Andy Lutomirski , "Williams, Dan J" Cc: Ingo Molnar , Borislav Petkov , Andrew Morton , Andy Lutomirski , "linux-kernel@vger.kernel.org" , "linux-mm@kvack.org" , linux-nvdimm , X86 ML PiBBbHNvLCBhcmUgdGhlcmUgcmVhbGx5IFBDT01NSVQtY2FwYWJsZSBDUFVzIHRoYXQgc3RpbGwg Zm9yY2libHkNCj4gYnJvYWRjYXN0IE1DRT8gIElmLCBzbywgdGhhdCdzIHVuZm9ydHVuYXRlLg0K DQpQQ09NTUlUIGFuZCBMTUNFIGFycml2ZSB0b2dldGhlciAuLi4gdGhvdWdoIEJJT1MgaXMgaW4g dGhlIGRlY2lzaW9uDQpwYXRoIHRvIGVuYWJsZSBMTUNFLCBzbyBpdCBpcyBwb3NzaWJsZSB0aGF0 IHNvbWUgc3lzdGVtcyBjb3VsZCBzdGlsbA0KYnJvYWRjYXN0IGlmIHRoZSBCSU9TIHdyaXRlciBk ZWNpZGVzIHRvIG5vdCBhbGxvdyBsb2NhbC4NCg0KQnV0IGEgbWFjaGluZSBjaGVjayBzYWZlIGNv cHlfZnJvbV91c2VyKCkgd291bGQgYmUgdXNlZnVsDQpjdXJyZW50IGdlbmVyYXRpb24gY3B1cyB0 aGF0IGJyb2FkY2FzdCBhbGwgdGhlIHRpbWUuDQoNCi1Ub255DQo= -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ob0-f173.google.com (mail-ob0-f173.google.com [209.85.214.173]) by kanga.kvack.org (Postfix) with ESMTP id CB8A16B0253 for ; Fri, 11 Dec 2015 17:38:33 -0500 (EST) Received: by obc18 with SMTP id 18so93097490obc.2 for ; Fri, 11 Dec 2015 14:38:33 -0800 (PST) Received: from mail-oi0-x236.google.com (mail-oi0-x236.google.com. [2607:f8b0:4003:c06::236]) by mx.google.com with ESMTPS id u132si2774013oif.139.2015.12.11.14.38.33 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Fri, 11 Dec 2015 14:38:33 -0800 (PST) Received: by oiww189 with SMTP id w189so70700695oiw.3 for ; Fri, 11 Dec 2015 14:38:33 -0800 (PST) MIME-Version: 1.0 In-Reply-To: <3908561D78D1C84285E8C5FCA982C28F39F82F97@ORSMSX114.amr.corp.intel.com> References: <23b2515da9d06b198044ad83ca0a15ba38c24e6e.1449861203.git.tony.luck@intel.com> <3908561D78D1C84285E8C5FCA982C28F39F82D87@ORSMSX114.amr.corp.intel.com> <3908561D78D1C84285E8C5FCA982C28F39F82EEF@ORSMSX114.amr.corp.intel.com> <3908561D78D1C84285E8C5FCA982C28F39F82F97@ORSMSX114.amr.corp.intel.com> From: Andy Lutomirski Date: Fri, 11 Dec 2015 14:38:13 -0800 Message-ID: Subject: Re: [PATCHV2 3/3] x86, ras: Add mcsafe_memcpy() function to recover from machine checks Content-Type: text/plain; charset=UTF-8 Sender: owner-linux-mm@kvack.org List-ID: To: "Luck, Tony" Cc: "Williams, Dan J" , Ingo Molnar , Borislav Petkov , Andrew Morton , Andy Lutomirski , "linux-kernel@vger.kernel.org" , "linux-mm@kvack.org" , linux-nvdimm , X86 ML On Fri, Dec 11, 2015 at 2:35 PM, Luck, Tony wrote: >> Also, are there really PCOMMIT-capable CPUs that still forcibly >> broadcast MCE? If, so, that's unfortunate. > > PCOMMIT and LMCE arrive together ... though BIOS is in the decision > path to enable LMCE, so it is possible that some systems could still > broadcast if the BIOS writer decides to not allow local. I really wish Intel would stop doing that. > > But a machine check safe copy_from_user() would be useful > current generation cpus that broadcast all the time. Fair enough. --Andy -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pf0-f172.google.com (mail-pf0-f172.google.com [209.85.192.172]) by kanga.kvack.org (Postfix) with ESMTP id 9AE2C6B0254 for ; Fri, 11 Dec 2015 17:45:35 -0500 (EST) Received: by pfee188 with SMTP id e188so2426505pfe.1 for ; Fri, 11 Dec 2015 14:45:35 -0800 (PST) Received: from mga09.intel.com (mga09.intel.com. [134.134.136.24]) by mx.google.com with ESMTP id ie7si3839185pad.155.2015.12.11.14.45.34 for ; Fri, 11 Dec 2015 14:45:35 -0800 (PST) From: "Luck, Tony" Subject: RE: [PATCHV2 3/3] x86, ras: Add mcsafe_memcpy() function to recover from machine checks Date: Fri, 11 Dec 2015 22:45:33 +0000 Message-ID: <3908561D78D1C84285E8C5FCA982C28F39F82FED@ORSMSX114.amr.corp.intel.com> References: <23b2515da9d06b198044ad83ca0a15ba38c24e6e.1449861203.git.tony.luck@intel.com> <3908561D78D1C84285E8C5FCA982C28F39F82D87@ORSMSX114.amr.corp.intel.com> <3908561D78D1C84285E8C5FCA982C28F39F82EEF@ORSMSX114.amr.corp.intel.com> <3908561D78D1C84285E8C5FCA982C28F39F82F97@ORSMSX114.amr.corp.intel.com> In-Reply-To: Content-Language: en-US Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: base64 MIME-Version: 1.0 Sender: owner-linux-mm@kvack.org List-ID: To: Andy Lutomirski Cc: "Williams, Dan J" , Ingo Molnar , Borislav Petkov , Andrew Morton , Andy Lutomirski , "linux-kernel@vger.kernel.org" , "linux-mm@kvack.org" , linux-nvdimm , X86 ML Pj4gQnV0IGEgbWFjaGluZSBjaGVjayBzYWZlIGNvcHlfZnJvbV91c2VyKCkgd291bGQgYmUgdXNl ZnVsDQo+PiBjdXJyZW50IGdlbmVyYXRpb24gY3B1cyB0aGF0IGJyb2FkY2FzdCBhbGwgdGhlIHRp bWUuDQo+DQo+IEZhaXIgZW5vdWdoLg0KDQpUaGFua3MgZm9yIHNwZW5kaW5nIHRoZSB0aW1lIHRv IGxvb2sgYXQgdGhpcy4gIENvYXhpbmcgbWUgdG8gcmUtd3JpdGUgdGhlDQp0YWlsIG9mIGRvX21h Y2hpbmVfY2hlY2soKSBoYXMgbWFkZSB0aGF0IGNvZGUgbXVjaCBiZXR0ZXIuIFRvbyBtYW55DQp5 ZWFycyBvZiBvbmUgcGF0Y2ggb24gdG9wIG9mIGFub3RoZXIgd2l0aG91dCBsb29raW5nIGF0IHRo ZSB3aG9sZSBjb250ZXh0Lg0KDQpDb2dpdGF0ZSBvbiB0aGlzIHNlcmllcyBvdmVyIHRoZSB3ZWVr ZW5kIGFuZCBzZWUgaWYgeW91IGNhbiBnaXZlIG1lDQphbiBBY2tlZC1ieSBvciBSZXZpZXdlZC1i eSAoSSdsbCBiZSBhZGRpbmcgYSAjZGVmaW5lIGZvciBCSVQoNjMpKS4NCg0KLVRvbnkNCg== -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ob0-f181.google.com (mail-ob0-f181.google.com [209.85.214.181]) by kanga.kvack.org (Postfix) with ESMTP id 38F026B0253 for ; Fri, 11 Dec 2015 17:56:06 -0500 (EST) Received: by obbsd4 with SMTP id sd4so44524424obb.0 for ; Fri, 11 Dec 2015 14:56:06 -0800 (PST) Received: from mail-ob0-x231.google.com (mail-ob0-x231.google.com. [2607:f8b0:4003:c01::231]) by mx.google.com with ESMTPS id d3si711821obo.16.2015.12.11.14.56.05 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Fri, 11 Dec 2015 14:56:05 -0800 (PST) Received: by obber4 with SMTP id er4so10968783obb.3 for ; Fri, 11 Dec 2015 14:56:05 -0800 (PST) MIME-Version: 1.0 In-Reply-To: <3908561D78D1C84285E8C5FCA982C28F39F82FED@ORSMSX114.amr.corp.intel.com> References: <23b2515da9d06b198044ad83ca0a15ba38c24e6e.1449861203.git.tony.luck@intel.com> <3908561D78D1C84285E8C5FCA982C28F39F82D87@ORSMSX114.amr.corp.intel.com> <3908561D78D1C84285E8C5FCA982C28F39F82EEF@ORSMSX114.amr.corp.intel.com> <3908561D78D1C84285E8C5FCA982C28F39F82F97@ORSMSX114.amr.corp.intel.com> <3908561D78D1C84285E8C5FCA982C28F39F82FED@ORSMSX114.amr.corp.intel.com> From: Andy Lutomirski Date: Fri, 11 Dec 2015 14:55:45 -0800 Message-ID: Subject: Re: [PATCHV2 3/3] x86, ras: Add mcsafe_memcpy() function to recover from machine checks Content-Type: text/plain; charset=UTF-8 Sender: owner-linux-mm@kvack.org List-ID: To: "Luck, Tony" Cc: "Williams, Dan J" , Ingo Molnar , Borislav Petkov , Andrew Morton , Andy Lutomirski , "linux-kernel@vger.kernel.org" , "linux-mm@kvack.org" , linux-nvdimm , X86 ML On Fri, Dec 11, 2015 at 2:45 PM, Luck, Tony wrote: >>> But a machine check safe copy_from_user() would be useful >>> current generation cpus that broadcast all the time. >> >> Fair enough. > > Thanks for spending the time to look at this. Coaxing me to re-write the > tail of do_machine_check() has made that code much better. Too many > years of one patch on top of another without looking at the whole context. > > Cogitate on this series over the weekend and see if you can give me > an Acked-by or Reviewed-by (I'll be adding a #define for BIT(63)). I can't review the MCE decoding part, because I don't understand it nearly well enough. The interaction with the core fault handling looks fine, modulo any need to bikeshed on the macro naming (which I'll refrain from doing). I still think it would be better if you get rid of BIT(63) and use a pair of landing pads, though. They could be as simple as: .Lpage_fault_goes_here: xorq %rax, %rax jmp .Lbad .Lmce_goes_here: /* set high bit of rax or whatever */ /* fall through */ .Lbad: /* deal with it */ That way the magic is isolated to the function that needs the magic. Also, at least renaming the macro to EXTABLE_MC_PA_IN_AX might be nice. It'll keep future users honest. Maybe some day there'll be a PA_IN_AX flag, and, heck, maybe some day there'll be ways to get info for non-MCE faults delivered through fixup_exception. --Andy -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wm0-f53.google.com (mail-wm0-f53.google.com [74.125.82.53]) by kanga.kvack.org (Postfix) with ESMTP id CE5366B0253 for ; Sat, 12 Dec 2015 05:11:46 -0500 (EST) Received: by wmnn186 with SMTP id n186so62431708wmn.0 for ; Sat, 12 Dec 2015 02:11:46 -0800 (PST) Received: from mail.skyhub.de (mail.skyhub.de. [2a01:4f8:120:8448::d00d]) by mx.google.com with ESMTP id y72si9145667wmd.20.2015.12.12.02.11.45 for ; Sat, 12 Dec 2015 02:11:45 -0800 (PST) Date: Sat, 12 Dec 2015 11:11:42 +0100 From: Borislav Petkov Subject: Re: [PATCHV2 1/3] x86, ras: Add new infrastructure for machine check fixup tables Message-ID: <20151212101142.GA3867@pd.tnic> References: <456153d09e85f2f139020a051caed3ca8f8fca73.1449861203.git.tony.luck@intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: <456153d09e85f2f139020a051caed3ca8f8fca73.1449861203.git.tony.luck@intel.com> Sender: owner-linux-mm@kvack.org List-ID: To: Tony Luck Cc: Ingo Molnar , Andrew Morton , Andy Lutomirski , Dan Williams , linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-nvdimm@ml01.01.org, x86@kernel.org On Thu, Dec 10, 2015 at 01:58:04PM -0800, Tony Luck wrote: > Copy the existing page fault fixup mechanisms to create a new table > to be used when fixing machine checks. Note: > 1) At this time we only provide a macro to annotate assembly code > 2) We assume all fixups will in code builtin to the kernel. > 3) Only for x86_64 > 4) New code under CONFIG_MCE_KERNEL_RECOVERY > > Signed-off-by: Tony Luck > --- > arch/x86/Kconfig | 4 ++++ > arch/x86/include/asm/asm.h | 10 ++++++++-- > arch/x86/include/asm/uaccess.h | 8 ++++++++ > arch/x86/mm/extable.c | 19 +++++++++++++++++++ > include/asm-generic/vmlinux.lds.h | 6 ++++++ > include/linux/module.h | 1 + > kernel/extable.c | 20 ++++++++++++++++++++ > 7 files changed, 66 insertions(+), 2 deletions(-) > > diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig > index 96d058a87100..db5c6e1d6e37 100644 > --- a/arch/x86/Kconfig > +++ b/arch/x86/Kconfig > @@ -1001,6 +1001,10 @@ config X86_MCE_INJECT > If you don't know what a machine check is and you don't do kernel > QA it is safe to say n. > > +config MCE_KERNEL_RECOVERY > + depends on X86_MCE && X86_64 > + def_bool y Shouldn't that depend on NVDIMM or whatnot? Looks too generic now. > + > config X86_THERMAL_VECTOR > def_bool y > depends on X86_MCE_INTEL > diff --git a/arch/x86/include/asm/asm.h b/arch/x86/include/asm/asm.h > index 189679aba703..a5d483ac11fa 100644 > --- a/arch/x86/include/asm/asm.h > +++ b/arch/x86/include/asm/asm.h > @@ -44,13 +44,19 @@ > > /* Exception table entry */ > #ifdef __ASSEMBLY__ > -# define _ASM_EXTABLE(from,to) \ > - .pushsection "__ex_table","a" ; \ > +# define __ASM_EXTABLE(from, to, table) \ > + .pushsection table, "a" ; \ > .balign 8 ; \ > .long (from) - . ; \ > .long (to) - . ; \ > .popsection > > +# define _ASM_EXTABLE(from, to) \ > + __ASM_EXTABLE(from, to, "__ex_table") > + > +# define _ASM_MCEXTABLE(from, to) \ > + __ASM_EXTABLE(from, to, "__mcex_table") > + > # define _ASM_EXTABLE_EX(from,to) \ > .pushsection "__ex_table","a" ; \ > .balign 8 ; \ > diff --git a/arch/x86/include/asm/uaccess.h b/arch/x86/include/asm/uaccess.h > index a8df874f3e88..7b02ca1991b4 100644 > --- a/arch/x86/include/asm/uaccess.h > +++ b/arch/x86/include/asm/uaccess.h > @@ -111,6 +111,14 @@ struct exception_table_entry { > #define ARCH_HAS_SEARCH_EXTABLE > > extern int fixup_exception(struct pt_regs *regs); > +#ifdef CONFIG_MCE_KERNEL_RECOVERY > +extern int fixup_mcexception(struct pt_regs *regs, u64 addr); > +#else > +static inline int fixup_mcexception(struct pt_regs *regs, u64 addr) > +{ > + return 0; > +} > +#endif > extern int early_fixup_exception(unsigned long *ip); No need for "extern" > > /* > diff --git a/arch/x86/mm/extable.c b/arch/x86/mm/extable.c > index 903ec1e9c326..a461c4212758 100644 > --- a/arch/x86/mm/extable.c > +++ b/arch/x86/mm/extable.c > @@ -49,6 +49,25 @@ int fixup_exception(struct pt_regs *regs) > return 0; > } > > +#ifdef CONFIG_MCE_KERNEL_RECOVERY > +int fixup_mcexception(struct pt_regs *regs, u64 addr) > +{ If you move the #ifdef here, you can save yourself the ifdeffery in the header above. > + const struct exception_table_entry *fixup; > + unsigned long new_ip; > + > + fixup = search_mcexception_tables(regs->ip); > + if (fixup) { > + new_ip = ex_fixup_addr(fixup); > + > + regs->ip = new_ip; > + regs->ax = BIT(63) | addr; > + return 1; > + } > + > + return 0; > +} > +#endif > + > /* Restricted version used during very early boot */ > int __init early_fixup_exception(unsigned long *ip) > { > diff --git a/include/asm-generic/vmlinux.lds.h b/include/asm-generic/vmlinux.lds.h > index 1781e54ea6d3..21bb20d1172a 100644 > --- a/include/asm-generic/vmlinux.lds.h > +++ b/include/asm-generic/vmlinux.lds.h > @@ -473,6 +473,12 @@ > VMLINUX_SYMBOL(__start___ex_table) = .; \ > *(__ex_table) \ > VMLINUX_SYMBOL(__stop___ex_table) = .; \ > + } \ > + . = ALIGN(align); \ > + __mcex_table : AT(ADDR(__mcex_table) - LOAD_OFFSET) { \ > + VMLINUX_SYMBOL(__start___mcex_table) = .; \ > + *(__mcex_table) \ > + VMLINUX_SYMBOL(__stop___mcex_table) = .; \ Of all the places, this one is missing #ifdef CONFIG_MCE_KERNEL_RECOVERY. > } > > /* > diff --git a/include/linux/module.h b/include/linux/module.h > index 3a19c79918e0..ffecbfcc462c 100644 > --- a/include/linux/module.h > +++ b/include/linux/module.h > @@ -270,6 +270,7 @@ extern const typeof(name) __mod_##type##__##name##_device_table \ > > /* Given an address, look for it in the exception tables */ > const struct exception_table_entry *search_exception_tables(unsigned long add); > +const struct exception_table_entry *search_mcexception_tables(unsigned long a); > > struct notifier_block; > > diff --git a/kernel/extable.c b/kernel/extable.c > index e820ccee9846..7b224fbcb708 100644 > --- a/kernel/extable.c > +++ b/kernel/extable.c > @@ -34,6 +34,10 @@ DEFINE_MUTEX(text_mutex); > > extern struct exception_table_entry __start___ex_table[]; > extern struct exception_table_entry __stop___ex_table[]; > +#ifdef CONFIG_MCE_KERNEL_RECOVERY > +extern struct exception_table_entry __start___mcex_table[]; > +extern struct exception_table_entry __stop___mcex_table[]; > +#endif > > /* Cleared by build time tools if the table is already sorted. */ > u32 __initdata __visible main_extable_sort_needed = 1; > @@ -45,6 +49,10 @@ void __init sort_main_extable(void) > pr_notice("Sorting __ex_table...\n"); > sort_extable(__start___ex_table, __stop___ex_table); > } > +#ifdef CONFIG_MCE_KERNEL_RECOVERY > + if (__stop___mcex_table > __start___mcex_table) > + sort_extable(__start___mcex_table, __stop___mcex_table); > +#endif > } > > /* Given an address, look for it in the exception tables. */ > @@ -58,6 +66,18 @@ const struct exception_table_entry *search_exception_tables(unsigned long addr) > return e; > } > > +#ifdef CONFIG_MCE_KERNEL_RECOVERY > +/* Given an address, look for it in the machine check exception tables. */ > +const struct exception_table_entry *search_mcexception_tables( > + unsigned long addr) > +{ > + const struct exception_table_entry *e; > + > + e = search_extable(__start___mcex_table, __stop___mcex_table-1, addr); > + return e; > +} > +#endif You can make this one a bit more readable by doing: /* Given an address, look for it in the machine check exception tables. */ const struct exception_table_entry * search_mcexception_tables(unsigned long addr) { #ifdef CONFIG_MCE_KERNEL_RECOVERY return search_extable(__start___mcex_table, __stop___mcex_table - 1, addr); #endif } -- Regards/Gruss, Boris. ECO tip #101: Trim your mails when you reply. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wm0-f43.google.com (mail-wm0-f43.google.com [74.125.82.43]) by kanga.kvack.org (Postfix) with ESMTP id 1340B6B0038 for ; Mon, 14 Dec 2015 03:36:30 -0500 (EST) Received: by wmnn186 with SMTP id n186so110040892wmn.0 for ; Mon, 14 Dec 2015 00:36:29 -0800 (PST) Received: from mail-wm0-x229.google.com (mail-wm0-x229.google.com. [2a00:1450:400c:c09::229]) by mx.google.com with ESMTPS id x9si44449502wje.220.2015.12.14.00.36.28 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 14 Dec 2015 00:36:28 -0800 (PST) Received: by mail-wm0-x229.google.com with SMTP id n186so34247723wmn.0 for ; Mon, 14 Dec 2015 00:36:28 -0800 (PST) Date: Mon, 14 Dec 2015 09:36:25 +0100 From: Ingo Molnar Subject: Re: [PATCHV2 3/3] x86, ras: Add mcsafe_memcpy() function to recover from machine checks Message-ID: <20151214083625.GA28073@gmail.com> References: <3908561D78D1C84285E8C5FCA982C28F39F82D87@ORSMSX114.amr.corp.intel.com> <3908561D78D1C84285E8C5FCA982C28F39F82EEF@ORSMSX114.amr.corp.intel.com> <3908561D78D1C84285E8C5FCA982C28F39F82F97@ORSMSX114.amr.corp.intel.com> <3908561D78D1C84285E8C5FCA982C28F39F82FED@ORSMSX114.amr.corp.intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org List-ID: To: Andy Lutomirski Cc: "Luck, Tony" , "Williams, Dan J" , Borislav Petkov , Andrew Morton , Andy Lutomirski , "linux-kernel@vger.kernel.org" , "linux-mm@kvack.org" , linux-nvdimm , X86 ML * Andy Lutomirski wrote: > I still think it would be better if you get rid of BIT(63) and use a > pair of landing pads, though. They could be as simple as: > > .Lpage_fault_goes_here: > xorq %rax, %rax > jmp .Lbad > > .Lmce_goes_here: > /* set high bit of rax or whatever */ > /* fall through */ > > .Lbad: > /* deal with it */ > > That way the magic is isolated to the function that needs the magic. Seconded - this is the usual pattern we use in all assembly functions. Thanks, Ingo -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-io0-f173.google.com (mail-io0-f173.google.com [209.85.223.173]) by kanga.kvack.org (Postfix) with ESMTP id 964C66B0038 for ; Mon, 14 Dec 2015 12:58:46 -0500 (EST) Received: by iow186 with SMTP id 186so34640696iow.0 for ; Mon, 14 Dec 2015 09:58:46 -0800 (PST) Received: from mail-ig0-x22e.google.com (mail-ig0-x22e.google.com. [2607:f8b0:4001:c05::22e]) by mx.google.com with ESMTPS id z134si19141987iod.50.2015.12.14.09.58.45 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 14 Dec 2015 09:58:46 -0800 (PST) Received: by igbxm8 with SMTP id xm8so89862756igb.1 for ; Mon, 14 Dec 2015 09:58:45 -0800 (PST) MIME-Version: 1.0 In-Reply-To: <20151212101142.GA3867@pd.tnic> References: <456153d09e85f2f139020a051caed3ca8f8fca73.1449861203.git.tony.luck@intel.com> <20151212101142.GA3867@pd.tnic> Date: Mon, 14 Dec 2015 10:58:45 -0700 Message-ID: Subject: Re: [PATCHV2 1/3] x86, ras: Add new infrastructure for machine check fixup tables From: Ross Zwisler Content-Type: text/plain; charset=UTF-8 Sender: owner-linux-mm@kvack.org List-ID: To: Borislav Petkov Cc: Tony Luck , linux-nvdimm , X86 ML , linux-kernel@vger.kernel.org, Ingo Molnar , linux-mm@kvack.org, Andy Lutomirski , Andrew Morton , Ross Zwisler On Sat, Dec 12, 2015 at 3:11 AM, Borislav Petkov wrote: > On Thu, Dec 10, 2015 at 01:58:04PM -0800, Tony Luck wrote: <> >> +#ifdef CONFIG_MCE_KERNEL_RECOVERY >> +/* Given an address, look for it in the machine check exception tables. */ >> +const struct exception_table_entry *search_mcexception_tables( >> + unsigned long addr) >> +{ >> + const struct exception_table_entry *e; >> + >> + e = search_extable(__start___mcex_table, __stop___mcex_table-1, addr); >> + return e; >> +} >> +#endif > > You can make this one a bit more readable by doing: > > /* Given an address, look for it in the machine check exception tables. */ > const struct exception_table_entry * > search_mcexception_tables(unsigned long addr) > { > #ifdef CONFIG_MCE_KERNEL_RECOVERY > return search_extable(__start___mcex_table, > __stop___mcex_table - 1, addr); > #endif > } With this code if CONFIG_MCE_KERNEL_RECOVERY isn't defined you'll get a compiler error that the function doesn't have a return statement, right? I think we need an #else to return NULL, or to have the #ifdef encompass the whole function definition as it was in Tony's version. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pa0-f46.google.com (mail-pa0-f46.google.com [209.85.220.46]) by kanga.kvack.org (Postfix) with ESMTP id CFBA16B026D for ; Mon, 14 Dec 2015 14:46:49 -0500 (EST) Received: by pacwq6 with SMTP id wq6so108578787pac.1 for ; Mon, 14 Dec 2015 11:46:49 -0800 (PST) Received: from mga09.intel.com (mga09.intel.com. [134.134.136.24]) by mx.google.com with ESMTP id qy7si8457386pab.169.2015.12.14.11.46.49 for ; Mon, 14 Dec 2015 11:46:49 -0800 (PST) Date: Mon, 14 Dec 2015 11:46:48 -0800 From: "Luck, Tony" Subject: Re: [PATCHV2 3/3] x86, ras: Add mcsafe_memcpy() function to recover from machine checks Message-ID: <20151214194648.GA15222@agluck-desk.sc.intel.com> References: <3908561D78D1C84285E8C5FCA982C28F39F82D87@ORSMSX114.amr.corp.intel.com> <3908561D78D1C84285E8C5FCA982C28F39F82EEF@ORSMSX114.amr.corp.intel.com> <3908561D78D1C84285E8C5FCA982C28F39F82F97@ORSMSX114.amr.corp.intel.com> <3908561D78D1C84285E8C5FCA982C28F39F82FED@ORSMSX114.amr.corp.intel.com> <20151214083625.GA28073@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20151214083625.GA28073@gmail.com> Sender: owner-linux-mm@kvack.org List-ID: To: Ingo Molnar Cc: Andy Lutomirski , "Williams, Dan J" , Borislav Petkov , Andrew Morton , Andy Lutomirski , "linux-kernel@vger.kernel.org" , "linux-mm@kvack.org" , linux-nvdimm , X86 ML On Mon, Dec 14, 2015 at 09:36:25AM +0100, Ingo Molnar wrote: > > /* deal with it */ > > > > That way the magic is isolated to the function that needs the magic. > > Seconded - this is the usual pattern we use in all assembly functions. Ok - you want me to write some x86 assembly code (you may regret that). Initial question ... here's the fixup for __copy_user_nocache() .section .fixup,"ax" 30: shll $6,%ecx addl %ecx,%edx jmp 60f 40: lea (%rdx,%rcx,8),%rdx jmp 60f 50: movl %ecx,%edx 60: sfence jmp copy_user_handle_tail .previous Are %ecx and %rcx synonyms for the same register? Is there some super subtle reason we use the 'r' names in the "40" fixup, but the 'e' names everywhere else in this code (and the 'e' names in the body of the original function)? -Tony -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-oi0-f46.google.com (mail-oi0-f46.google.com [209.85.218.46]) by kanga.kvack.org (Postfix) with ESMTP id A3A0F6B0254 for ; Mon, 14 Dec 2015 15:12:14 -0500 (EST) Received: by oian133 with SMTP id n133so21308516oia.3 for ; Mon, 14 Dec 2015 12:12:14 -0800 (PST) Received: from mail-ob0-x22d.google.com (mail-ob0-x22d.google.com. [2607:f8b0:4003:c01::22d]) by mx.google.com with ESMTPS id h5si4124640obe.20.2015.12.14.12.12.13 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 14 Dec 2015 12:12:14 -0800 (PST) Received: by obciw8 with SMTP id iw8so140914627obc.1 for ; Mon, 14 Dec 2015 12:12:13 -0800 (PST) MIME-Version: 1.0 In-Reply-To: <20151214194648.GA15222@agluck-desk.sc.intel.com> References: <3908561D78D1C84285E8C5FCA982C28F39F82D87@ORSMSX114.amr.corp.intel.com> <3908561D78D1C84285E8C5FCA982C28F39F82EEF@ORSMSX114.amr.corp.intel.com> <3908561D78D1C84285E8C5FCA982C28F39F82F97@ORSMSX114.amr.corp.intel.com> <3908561D78D1C84285E8C5FCA982C28F39F82FED@ORSMSX114.amr.corp.intel.com> <20151214083625.GA28073@gmail.com> <20151214194648.GA15222@agluck-desk.sc.intel.com> From: Andy Lutomirski Date: Mon, 14 Dec 2015 12:11:53 -0800 Message-ID: Subject: Re: [PATCHV2 3/3] x86, ras: Add mcsafe_memcpy() function to recover from machine checks Content-Type: text/plain; charset=UTF-8 Sender: owner-linux-mm@kvack.org List-ID: To: "Luck, Tony" Cc: Ingo Molnar , "Williams, Dan J" , Borislav Petkov , Andrew Morton , Andy Lutomirski , "linux-kernel@vger.kernel.org" , "linux-mm@kvack.org" , linux-nvdimm , X86 ML On Mon, Dec 14, 2015 at 11:46 AM, Luck, Tony wrote: > On Mon, Dec 14, 2015 at 09:36:25AM +0100, Ingo Molnar wrote: >> > /* deal with it */ >> > >> > That way the magic is isolated to the function that needs the magic. >> >> Seconded - this is the usual pattern we use in all assembly functions. > > Ok - you want me to write some x86 assembly code (you may regret that). > All you have to do is erase all of the ia64 asm knowledge from your brain and repurpose 1% of that space for x86 asm. You'll be a world-class expert! > Initial question ... here's the fixup for __copy_user_nocache() > > .section .fixup,"ax" > 30: shll $6,%ecx > addl %ecx,%edx > jmp 60f > 40: lea (%rdx,%rcx,8),%rdx > jmp 60f > 50: movl %ecx,%edx > 60: sfence > jmp copy_user_handle_tail > .previous > > Are %ecx and %rcx synonyms for the same register? Is there some > super subtle reason we use the 'r' names in the "40" fixup, but > the 'e' names everywhere else in this code (and the 'e' names in > the body of the original function)? rcx is a 64-bit register. ecx is the low 32 bits of it. If you read from ecx, you get the low 32 bits, but if you write to ecx, you zero the high bits as a side-effect. --Andy -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wm0-f51.google.com (mail-wm0-f51.google.com [74.125.82.51]) by kanga.kvack.org (Postfix) with ESMTP id CDC1D6B0038 for ; Mon, 14 Dec 2015 17:28:05 -0500 (EST) Received: by mail-wm0-f51.google.com with SMTP id n186so68577911wmn.0 for ; Mon, 14 Dec 2015 14:28:05 -0800 (PST) Received: from mail.skyhub.de (mail.skyhub.de. [2a01:4f8:120:8448::d00d]) by mx.google.com with ESMTP id xw2si48879112wjc.40.2015.12.14.14.28.04 for ; Mon, 14 Dec 2015 14:28:04 -0800 (PST) Date: Mon, 14 Dec 2015 23:27:59 +0100 From: Borislav Petkov Subject: Re: [PATCHV2 1/3] x86, ras: Add new infrastructure for machine check fixup tables Message-ID: <20151214222759.GF10520@pd.tnic> References: <456153d09e85f2f139020a051caed3ca8f8fca73.1449861203.git.tony.luck@intel.com> <20151212101142.GA3867@pd.tnic> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org List-ID: To: Ross Zwisler Cc: Tony Luck , linux-nvdimm , X86 ML , linux-kernel@vger.kernel.org, Ingo Molnar , linux-mm@kvack.org, Andy Lutomirski , Andrew Morton , Ross Zwisler On Mon, Dec 14, 2015 at 10:58:45AM -0700, Ross Zwisler wrote: > With this code if CONFIG_MCE_KERNEL_RECOVERY isn't defined you'll get > a compiler error that the function doesn't have a return statement, > right? I think we need an #else to return NULL, or to have the #ifdef > encompass the whole function definition as it was in Tony's version. Right, correct. Thanks. -- Regards/Gruss, Boris. ECO tip #101: Trim your mails when you reply. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pf0-f179.google.com (mail-pf0-f179.google.com [209.85.192.179]) by kanga.kvack.org (Postfix) with ESMTP id 0B6DB6B0038 for ; Mon, 14 Dec 2015 20:01:01 -0500 (EST) Received: by pff63 with SMTP id 63so20494775pff.2 for ; Mon, 14 Dec 2015 17:01:00 -0800 (PST) Received: from mga01.intel.com (mga01.intel.com. [192.55.52.88]) by mx.google.com with ESMTP id n1si12408577pap.152.2015.12.14.17.01.00 for ; Mon, 14 Dec 2015 17:01:00 -0800 (PST) Date: Mon, 14 Dec 2015 17:00:59 -0800 From: "Luck, Tony" Subject: Re: [PATCHV2 1/3] x86, ras: Add new infrastructure for machine check fixup tables Message-ID: <20151215010059.GA17353@agluck-desk.sc.intel.com> References: <456153d09e85f2f139020a051caed3ca8f8fca73.1449861203.git.tony.luck@intel.com> <20151212101142.GA3867@pd.tnic> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20151212101142.GA3867@pd.tnic> Sender: owner-linux-mm@kvack.org List-ID: To: Borislav Petkov Cc: Ingo Molnar , Andrew Morton , Andy Lutomirski , Dan Williams , linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-nvdimm@ml01.01.org, x86@kernel.org On Sat, Dec 12, 2015 at 11:11:42AM +0100, Borislav Petkov wrote: > > +config MCE_KERNEL_RECOVERY > > + depends on X86_MCE && X86_64 > > + def_bool y > > Shouldn't that depend on NVDIMM or whatnot? Looks too generic now. Not sure what the "whatnot" would be though. Making it depend on X86_MCE should keep it out of the tiny configurations. By the time you have MCE support, this seems like a pretty small incremental change. > > +#ifdef CONFIG_MCE_KERNEL_RECOVERY > > +int fixup_mcexception(struct pt_regs *regs, u64 addr) > > +{ > > If you move the #ifdef here, you can save yourself the ifdeffery in the > header above. I realized I didn't need the inline stub function in the header. > > diff --git a/include/asm-generic/vmlinux.lds.h b/include/asm-generic/vmlinux.lds.h > > index 1781e54ea6d3..21bb20d1172a 100644 > > --- a/include/asm-generic/vmlinux.lds.h > > +++ b/include/asm-generic/vmlinux.lds.h > > @@ -473,6 +473,12 @@ > > VMLINUX_SYMBOL(__start___ex_table) = .; \ > > *(__ex_table) \ > > VMLINUX_SYMBOL(__stop___ex_table) = .; \ > > + } \ > > + . = ALIGN(align); \ > > + __mcex_table : AT(ADDR(__mcex_table) - LOAD_OFFSET) { \ > > + VMLINUX_SYMBOL(__start___mcex_table) = .; \ > > + *(__mcex_table) \ > > + VMLINUX_SYMBOL(__stop___mcex_table) = .; \ > > Of all the places, this one is missing #ifdef CONFIG_MCE_KERNEL_RECOVERY. Is there some cpp magic to use an #ifdef inside a multi-line macro like this? Impact of not having the #ifdef is two extra symbols (the start/stop ones) in the symbol table of the final binary. If that's unacceptable I can fall back to an earlier unpublished version that had separate EXCEPTION_TABLE and MCEXCEPTION_TABLE macros with both invoked in the x86 vmlinux.lds.S file. > You can make this one a bit more readable by doing: > > /* Given an address, look for it in the machine check exception tables. */ > const struct exception_table_entry * > search_mcexception_tables(unsigned long addr) > { > #ifdef CONFIG_MCE_KERNEL_RECOVERY > return search_extable(__start___mcex_table, > __stop___mcex_table - 1, addr); > #endif > } I got rid of the local variable and the return ... but left the #ifdef/#endif around the whole function. -Tony -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wm0-f43.google.com (mail-wm0-f43.google.com [74.125.82.43]) by kanga.kvack.org (Postfix) with ESMTP id 693CF6B0038 for ; Tue, 15 Dec 2015 04:47:01 -0500 (EST) Received: by mail-wm0-f43.google.com with SMTP id p66so82243813wmp.0 for ; Tue, 15 Dec 2015 01:47:01 -0800 (PST) Received: from mail.skyhub.de (mail.skyhub.de. [2a01:4f8:120:8448::d00d]) by mx.google.com with ESMTP id jo5si770885wjb.37.2015.12.15.01.46.59 for ; Tue, 15 Dec 2015 01:47:00 -0800 (PST) Date: Tue, 15 Dec 2015 10:46:53 +0100 From: Borislav Petkov Subject: Re: [PATCHV2 1/3] x86, ras: Add new infrastructure for machine check fixup tables Message-ID: <20151215094653.GA25973@pd.tnic> References: <456153d09e85f2f139020a051caed3ca8f8fca73.1449861203.git.tony.luck@intel.com> <20151212101142.GA3867@pd.tnic> <20151215010059.GA17353@agluck-desk.sc.intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: <20151215010059.GA17353@agluck-desk.sc.intel.com> Sender: owner-linux-mm@kvack.org List-ID: To: "Luck, Tony" Cc: Ingo Molnar , Andrew Morton , Andy Lutomirski , Dan Williams , linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-nvdimm@ml01.01.org, x86@kernel.org On Mon, Dec 14, 2015 at 05:00:59PM -0800, Luck, Tony wrote: > Not sure what the "whatnot" would be though. Making it depend on > X86_MCE should keep it out of the tiny configurations. By the time > you have MCE support, this seems like a pretty small incremental > change. Ok, so it is called CONFIG_LIBNVDIMM. Do you see a use case for this stuff except on machines with NVDIMM hw? CONFIG_LIBNVDIMM can select it but on !NVDIMM systems you don't really need it enabled. > Is there some cpp magic to use an #ifdef inside a multi-line macro like this? > Impact of not having the #ifdef is two extra symbols (the start/stop ones) > in the symbol table of the final binary. If that's unacceptable I can fall > back to an earlier unpublished version that had separate EXCEPTION_TABLE and > MCEXCEPTION_TABLE macros with both invoked in the x86 vmlinux.lds.S file. I think what is more important is that this should be in the x86-specific linker script, not in the generic one. And yes, we should strive to be clean and not pullute the kernel image with symbols which are unused, i.e. when CONFIG_MCE_KERNEL_RECOVERY is not enabled. This below seems to build ok here, ontop of yours. It could be a MCEXCEPTION_TABLE macro, as you say: Index: b/include/asm-generic/vmlinux.lds.h =================================================================== --- a/include/asm-generic/vmlinux.lds.h 2015-12-15 10:17:25.568046033 +0100 +++ b/include/asm-generic/vmlinux.lds.h 2015-12-15 10:07:06.064034490 +0100 @@ -484,12 +484,6 @@ *(__ex_table) \ VMLINUX_SYMBOL(__stop___ex_table) = .; \ } \ - . = ALIGN(align); \ - __mcex_table : AT(ADDR(__mcex_table) - LOAD_OFFSET) { \ - VMLINUX_SYMBOL(__start___mcex_table) = .; \ - *(__mcex_table) \ - VMLINUX_SYMBOL(__stop___mcex_table) = .; \ - } /* * Init task Index: b/arch/x86/kernel/vmlinux.lds.S =================================================================== --- a/arch/x86/kernel/vmlinux.lds.S 2015-12-14 11:38:58.188150070 +0100 +++ b/arch/x86/kernel/vmlinux.lds.S 2015-12-15 10:09:04.624036699 +0100 @@ -110,7 +110,17 @@ SECTIONS NOTES :text :note - EXCEPTION_TABLE(16) :text = 0x9090 + EXCEPTION_TABLE(16) + +#ifdef CONFIG_MCE_KERNEL_RECOVERY + . = ALIGN(16); + __mcex_table : AT(ADDR(__mcex_table) - LOAD_OFFSET) { + VMLINUX_SYMBOL(__start___mcex_table) = .; + *(__mcex_table) + VMLINUX_SYMBOL(__stop___mcex_table) = .; + } +#endif + :text = 0x9090 #if defined(CONFIG_DEBUG_RODATA) /* .text should occupy whole number of pages */ -- Regards/Gruss, Boris. ECO tip #101: Trim your mails when you reply. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wm0-f44.google.com (mail-wm0-f44.google.com [74.125.82.44]) by kanga.kvack.org (Postfix) with ESMTP id 11EE56B0254 for ; Tue, 15 Dec 2015 05:44:10 -0500 (EST) Received: by mail-wm0-f44.google.com with SMTP id n186so158543280wmn.1 for ; Tue, 15 Dec 2015 02:44:10 -0800 (PST) Received: from mail.skyhub.de (mail.skyhub.de. [2a01:4f8:120:8448::d00d]) by mx.google.com with ESMTP id w2si1009577wjf.153.2015.12.15.02.44.08 for ; Tue, 15 Dec 2015 02:44:08 -0800 (PST) Date: Tue, 15 Dec 2015 11:44:02 +0100 From: Borislav Petkov Subject: Re: [PATCHV2 1/3] x86, ras: Add new infrastructure for machine check fixup tables Message-ID: <20151215104402.GC25973@pd.tnic> References: <456153d09e85f2f139020a051caed3ca8f8fca73.1449861203.git.tony.luck@intel.com> <20151212101142.GA3867@pd.tnic> <20151215010059.GA17353@agluck-desk.sc.intel.com> <20151215094653.GA25973@pd.tnic> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: <20151215094653.GA25973@pd.tnic> Sender: owner-linux-mm@kvack.org List-ID: To: "Luck, Tony" Cc: Ingo Molnar , Andrew Morton , Andy Lutomirski , Dan Williams , linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-nvdimm@ml01.01.org, x86@kernel.org On Tue, Dec 15, 2015 at 10:46:53AM +0100, Borislav Petkov wrote: > I think what is more important is that this should be in the > x86-specific linker script, not in the generic one. And related to that, I think all those additions to kernel/extable.c should be somewhere in arch/x86/ and not in generic code. -- Regards/Gruss, Boris. ECO tip #101: Trim your mails when you reply. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wm0-f44.google.com (mail-wm0-f44.google.com [74.125.82.44]) by kanga.kvack.org (Postfix) with ESMTP id EAE386B0254 for ; Tue, 15 Dec 2015 06:43:23 -0500 (EST) Received: by mail-wm0-f44.google.com with SMTP id p66so21172967wmp.1 for ; Tue, 15 Dec 2015 03:43:23 -0800 (PST) Received: from mail.skyhub.de (mail.skyhub.de. [2a01:4f8:120:8448::d00d]) by mx.google.com with ESMTP id em7si1313607wjd.150.2015.12.15.03.43.22 for ; Tue, 15 Dec 2015 03:43:22 -0800 (PST) Date: Tue, 15 Dec 2015 12:43:14 +0100 From: Borislav Petkov Subject: Re: [PATCHV2 2/3] x86, ras: Extend machine check recovery code to annotated ring0 areas Message-ID: <20151215114314.GD25973@pd.tnic> References: MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org List-ID: To: Tony Luck Cc: Ingo Molnar , Andrew Morton , Andy Lutomirski , Dan Williams , linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-nvdimm@ml01.01.org, x86@kernel.org On Thu, Dec 10, 2015 at 04:14:44PM -0800, Tony Luck wrote: > Extend the severity checking code to add a new context IN_KERN_RECOV > which is used to indicate that the machine check was triggered by code > in the kernel with a fixup entry. > > Add code to check for this situation and respond by altering the return > IP to the fixup address and changing the regs->ax so that the recovery > code knows the physical address of the error. Note that we also set bit > 63 because 0x0 is a legal physical address. > > Major re-work to the tail code in do_machine_check() to make all this > readable/maintainable. One functional change is that tolerant=3 no longer > stops recovery actions. Revert to only skipping sending SIGBUS to the > current process. > > Signed-off-by: Tony Luck > --- > arch/x86/kernel/cpu/mcheck/mce-severity.c | 22 +++++++++- > arch/x86/kernel/cpu/mcheck/mce.c | 69 ++++++++++++++++--------------- > 2 files changed, 55 insertions(+), 36 deletions(-) > > diff --git a/arch/x86/kernel/cpu/mcheck/mce-severity.c b/arch/x86/kernel/cpu/mcheck/mce-severity.c > index 9c682c222071..ac7fbb0689fb 100644 > --- a/arch/x86/kernel/cpu/mcheck/mce-severity.c > +++ b/arch/x86/kernel/cpu/mcheck/mce-severity.c > @@ -12,6 +12,7 @@ > #include > #include > #include > +#include > #include > #include > > @@ -29,7 +30,7 @@ > * panic situations) > */ > > -enum context { IN_KERNEL = 1, IN_USER = 2 }; > +enum context { IN_KERNEL = 1, IN_USER = 2, IN_KERNEL_RECOV = 3 }; > enum ser { SER_REQUIRED = 1, NO_SER = 2 }; > enum exception { EXCP_CONTEXT = 1, NO_EXCP = 2 }; > > @@ -48,6 +49,7 @@ static struct severity { > #define MCESEV(s, m, c...) { .sev = MCE_ ## s ## _SEVERITY, .msg = m, ## c } > #define KERNEL .context = IN_KERNEL > #define USER .context = IN_USER > +#define KERNEL_RECOV .context = IN_KERNEL_RECOV > #define SER .ser = SER_REQUIRED > #define NOSER .ser = NO_SER > #define EXCP .excp = EXCP_CONTEXT > @@ -87,6 +89,10 @@ static struct severity { > EXCP, KERNEL, MCGMASK(MCG_STATUS_RIPV, 0) > ), > MCESEV( > + PANIC, "In kernel and no restart IP", > + EXCP, KERNEL_RECOV, MCGMASK(MCG_STATUS_RIPV, 0) > + ), > + MCESEV( > DEFERRED, "Deferred error", > NOSER, MASK(MCI_STATUS_UC|MCI_STATUS_DEFERRED|MCI_STATUS_POISON, MCI_STATUS_DEFERRED) > ), > @@ -123,6 +129,11 @@ static struct severity { > MCGMASK(MCG_STATUS_RIPV|MCG_STATUS_EIPV, MCG_STATUS_RIPV) > ), > MCESEV( > + AR, "Action required: data load error recoverable area of kernel", ... in ... > + SER, MASK(MCI_STATUS_OVER|MCI_UC_SAR|MCI_ADDR|MCACOD, MCI_UC_SAR|MCI_ADDR|MCACOD_DATA), > + KERNEL_RECOV > + ), > + MCESEV( > AR, "Action required: data load error in a user process", > SER, MASK(MCI_STATUS_OVER|MCI_UC_SAR|MCI_ADDR|MCACOD, MCI_UC_SAR|MCI_ADDR|MCACOD_DATA), > USER > @@ -170,6 +181,9 @@ static struct severity { > ) /* always matches. keep at end */ > }; > > +#define mc_recoverable(mcg) (((mcg) & (MCG_STATUS_RIPV|MCG_STATUS_EIPV)) == \ > + (MCG_STATUS_RIPV|MCG_STATUS_EIPV)) > + > /* > * If mcgstatus indicated that ip/cs on the stack were > * no good, then "m->cs" will be zero and we will have > @@ -183,7 +197,11 @@ static struct severity { > */ > static int error_context(struct mce *m) > { > - return ((m->cs & 3) == 3) ? IN_USER : IN_KERNEL; > + if ((m->cs & 3) == 3) > + return IN_USER; > + if (mc_recoverable(m->mcgstatus) && search_mcexception_tables(m->ip)) > + return IN_KERNEL_RECOV; > + return IN_KERNEL; > } > > /* > diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c > index 9d014b82a124..f2f568ad6409 100644 > --- a/arch/x86/kernel/cpu/mcheck/mce.c > +++ b/arch/x86/kernel/cpu/mcheck/mce.c > @@ -31,6 +31,7 @@ > #include > #include > #include > +#include > #include > #include > #include > @@ -958,6 +959,20 @@ static void mce_clear_state(unsigned long *toclear) > } > } > > +static int do_memory_failure(struct mce *m) > +{ > + int flags = MF_ACTION_REQUIRED; > + int ret; > + > + pr_err("Uncorrected hardware memory error in user-access at %llx", m->addr); > + if (!(m->mcgstatus & MCG_STATUS_RIPV)) > + flags |= MF_MUST_KILL; > + ret = memory_failure(m->addr >> PAGE_SHIFT, MCE_VECTOR, flags); > + if (ret) > + pr_err("Memory error not recovered"); > + return ret; > +} > + > /* > * The actual machine check handler. This only handles real > * exceptions when something got corrupted coming in through int 18. > @@ -995,8 +1010,6 @@ void do_machine_check(struct pt_regs *regs, long error_code) > DECLARE_BITMAP(toclear, MAX_NR_BANKS); > DECLARE_BITMAP(valid_banks, MAX_NR_BANKS); > char *msg = "Unknown"; > - u64 recover_paddr = ~0ull; > - int flags = MF_ACTION_REQUIRED; > int lmce = 0; > > ist_enter(regs); > @@ -1123,22 +1136,13 @@ void do_machine_check(struct pt_regs *regs, long error_code) > } > > /* > - * At insane "tolerant" levels we take no action. Otherwise > - * we only die if we have no other choice. For less serious > - * issues we try to recover, or limit damage to the current > - * process. > + * If tolerant is at an insane level we drop requests to kill > + * processes and continue even when there is no way out ^ | . Fullstop here. > */ > - if (cfg->tolerant < 3) { > - if (no_way_out) > - mce_panic("Fatal machine check on current CPU", &m, msg); > - if (worst == MCE_AR_SEVERITY) { > - recover_paddr = m.addr; > - if (!(m.mcgstatus & MCG_STATUS_RIPV)) > - flags |= MF_MUST_KILL; > - } else if (kill_it) { > - force_sig(SIGBUS, current); > - } > - } > + if (cfg->tolerant == 3) Btw, I don't see where we limit the input values for that tolerant setting, i.e., user could easily enter something > 3. I think we should add a check in a separate patch to not allow anything except [0-3]. > + kill_it = 0; > + else if (no_way_out) > + mce_panic("Fatal machine check on current CPU", &m, msg); > > if (worst > 0) > mce_report_event(regs); > @@ -1146,25 +1150,22 @@ void do_machine_check(struct pt_regs *regs, long error_code) > out: > sync_core(); > > - if (recover_paddr == ~0ull) > - goto done; > + /* Fault was in user mode and we need to take some action */ > + if ((m.cs & 3) == 3 && (worst == MCE_AR_SEVERITY || kill_it)) { > + ist_begin_non_atomic(regs); > + local_irq_enable(); > > - pr_err("Uncorrected hardware memory error in user-access at %llx", > - recover_paddr); > - /* > - * We must call memory_failure() here even if the current process is > - * doomed. We still need to mark the page as poisoned and alert any > - * other users of the page. > - */ > - ist_begin_non_atomic(regs); > - local_irq_enable(); > - if (memory_failure(recover_paddr >> PAGE_SHIFT, MCE_VECTOR, flags) < 0) { > - pr_err("Memory error not recovered"); > - force_sig(SIGBUS, current); > + if (kill_it || do_memory_failure(&m)) > + force_sig(SIGBUS, current); > + local_irq_disable(); > + ist_end_non_atomic(); > } > - local_irq_disable(); > - ist_end_non_atomic(); > -done: > + > + /* Fault was in recoverable area of the kernel */ > + if ((m.cs & 3) != 3 && worst == MCE_AR_SEVERITY) > + if (!fixup_mcexception(regs, m.addr)) > + mce_panic("Failed kernel mode recovery", &m, NULL); ^^^^^^^^^^^^^^^^^^^^^^^^^^^ Does that always imply a failed kernel mode recovery? I don't see (m.cs == 0 and MCE_AR_SEVERITY) MCEs always meaning that a recovery should be attempted there. I think this should simply say mce_panic("Fatal machine check on current CPU", &m, msg); Also, how about taking out that worst and kill_it check. It is a bit more readable this way IMO: --- out: sync_core(); if (worst < MCE_AR_SEVERITY && !kill_it) goto out_ist; /* Fault was in user mode and we need to take some action */ if ((m.cs & 3) == 3) { ist_begin_non_atomic(regs); local_irq_enable(); if (kill_it || do_memory_failure(&m)) force_sig(SIGBUS, current); local_irq_disable(); ist_end_non_atomic(); } else { if (!fixup_mcexception(regs, m.addr)) mce_panic("Fatal machine check on current CPU", &m, NULL); } out_ist: ist_exit(regs); } EXPORT_SYMBOL_GPL(do_machine_check); --- Hmm... -- Regards/Gruss, Boris. ECO tip #101: Trim your mails when you reply. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wm0-f53.google.com (mail-wm0-f53.google.com [74.125.82.53]) by kanga.kvack.org (Postfix) with ESMTP id D005E6B0254 for ; Tue, 15 Dec 2015 08:11:53 -0500 (EST) Received: by mail-wm0-f53.google.com with SMTP id n186so25016611wmn.0 for ; Tue, 15 Dec 2015 05:11:53 -0800 (PST) Received: from mail.skyhub.de (mail.skyhub.de. [2a01:4f8:120:8448::d00d]) by mx.google.com with ESMTP id wx2si1833235wjc.78.2015.12.15.05.11.41 for ; Tue, 15 Dec 2015 05:11:42 -0800 (PST) Date: Tue, 15 Dec 2015 14:11:35 +0100 From: Borislav Petkov Subject: Re: [PATCHV2 3/3] x86, ras: Add mcsafe_memcpy() function to recover from machine checks Message-ID: <20151215131135.GE25973@pd.tnic> References: <23b2515da9d06b198044ad83ca0a15ba38c24e6e.1449861203.git.tony.luck@intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: <23b2515da9d06b198044ad83ca0a15ba38c24e6e.1449861203.git.tony.luck@intel.com> Sender: owner-linux-mm@kvack.org List-ID: To: Tony Luck Cc: Ingo Molnar , Andrew Morton , Andy Lutomirski , Dan Williams , linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-nvdimm@ml01.01.org, x86@kernel.org On Thu, Dec 10, 2015 at 04:21:50PM -0800, Tony Luck wrote: > Using __copy_user_nocache() as inspiration create a memory copy > routine for use by kernel code with annotations to allow for > recovery from machine checks. > > Notes: > 1) Unlike the original we make no attempt to copy all the bytes > up to the faulting address. The original achieves that by > re-executing the failing part as a byte-by-byte copy, > which will take another page fault. We don't want to have > a second machine check! > 2) Likewise the return value for the original indicates exactly > how many bytes were not copied. Instead we provide the physical > address of the fault (thanks to help from do_machine_check() > 3) Provide helpful macros to decode the return value. > > Signed-off-by: Tony Luck > --- > arch/x86/include/asm/uaccess_64.h | 5 +++ > arch/x86/kernel/x8664_ksyms_64.c | 2 + > arch/x86/lib/copy_user_64.S | 91 +++++++++++++++++++++++++++++++++++++++ > 3 files changed, 98 insertions(+) ... > + * mcsafe_memcpy - Uncached memory copy with machine check exception handling > + * Note that we only catch machine checks when reading the source addresses. > + * Writes to target are posted and don't generate machine checks. > + * This will force destination/source out of cache for more performance. ... and the non-temporal version is the optimal one even though we're defaulting to copy_user_enhanced_fast_string for memcpy on modern Intel CPUs...? Btw, it should be also inside an ifdef if we're going to ifdef CONFIG_MCE_KERNEL_RECOVERY everywhere else. -- Regards/Gruss, Boris. ECO tip #101: Trim your mails when you reply. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-qk0-f170.google.com (mail-qk0-f170.google.com [209.85.220.170]) by kanga.kvack.org (Postfix) with ESMTP id A27076B0253 for ; Tue, 15 Dec 2015 12:45:06 -0500 (EST) Received: by mail-qk0-f170.google.com with SMTP id p187so24927979qkd.1 for ; Tue, 15 Dec 2015 09:45:06 -0800 (PST) Received: from mail-qk0-x22e.google.com (mail-qk0-x22e.google.com. [2607:f8b0:400d:c09::22e]) by mx.google.com with ESMTPS id x4si2079663qkx.21.2015.12.15.09.45.04 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Tue, 15 Dec 2015 09:45:04 -0800 (PST) Received: by mail-qk0-x22e.google.com with SMTP id p187so24926605qkd.1 for ; Tue, 15 Dec 2015 09:45:04 -0800 (PST) MIME-Version: 1.0 In-Reply-To: <20151215131135.GE25973@pd.tnic> References: <23b2515da9d06b198044ad83ca0a15ba38c24e6e.1449861203.git.tony.luck@intel.com> <20151215131135.GE25973@pd.tnic> Date: Tue, 15 Dec 2015 09:45:04 -0800 Message-ID: Subject: Re: [PATCHV2 3/3] x86, ras: Add mcsafe_memcpy() function to recover from machine checks From: Dan Williams Content-Type: text/plain; charset=UTF-8 Sender: owner-linux-mm@kvack.org List-ID: To: Borislav Petkov Cc: Tony Luck , Ingo Molnar , Andrew Morton , Andy Lutomirski , "linux-kernel@vger.kernel.org" , Linux MM , linux-nvdimm , X86 ML On Tue, Dec 15, 2015 at 5:11 AM, Borislav Petkov wrote: > On Thu, Dec 10, 2015 at 04:21:50PM -0800, Tony Luck wrote: >> Using __copy_user_nocache() as inspiration create a memory copy >> routine for use by kernel code with annotations to allow for >> recovery from machine checks. >> >> Notes: >> 1) Unlike the original we make no attempt to copy all the bytes >> up to the faulting address. The original achieves that by >> re-executing the failing part as a byte-by-byte copy, >> which will take another page fault. We don't want to have >> a second machine check! >> 2) Likewise the return value for the original indicates exactly >> how many bytes were not copied. Instead we provide the physical >> address of the fault (thanks to help from do_machine_check() >> 3) Provide helpful macros to decode the return value. >> >> Signed-off-by: Tony Luck >> --- >> arch/x86/include/asm/uaccess_64.h | 5 +++ >> arch/x86/kernel/x8664_ksyms_64.c | 2 + >> arch/x86/lib/copy_user_64.S | 91 +++++++++++++++++++++++++++++++++++++++ >> 3 files changed, 98 insertions(+) > > ... > >> + * mcsafe_memcpy - Uncached memory copy with machine check exception handling >> + * Note that we only catch machine checks when reading the source addresses. >> + * Writes to target are posted and don't generate machine checks. >> + * This will force destination/source out of cache for more performance. > > ... and the non-temporal version is the optimal one even though we're > defaulting to copy_user_enhanced_fast_string for memcpy on modern Intel > CPUs...? At least the pmem driver use case does not want caching of the source-buffer since that is the raw "disk" media. I.e. in pmem_do_bvec() we'd use this to implement memcpy_from_pmem(). However, caching the destination-buffer may prove beneficial since that data is likely to be consumed immediately by the thread that submitted the i/o. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pf0-f172.google.com (mail-pf0-f172.google.com [209.85.192.172]) by kanga.kvack.org (Postfix) with ESMTP id 6A9176B0254 for ; Tue, 15 Dec 2015 12:53:34 -0500 (EST) Received: by mail-pf0-f172.google.com with SMTP id n128so8429769pfn.0 for ; Tue, 15 Dec 2015 09:53:34 -0800 (PST) Received: from mga14.intel.com (mga14.intel.com. [192.55.52.115]) by mx.google.com with ESMTP id 69si3089209pfc.197.2015.12.15.09.53.33 for ; Tue, 15 Dec 2015 09:53:33 -0800 (PST) From: "Luck, Tony" Subject: RE: [PATCHV2 3/3] x86, ras: Add mcsafe_memcpy() function to recover from machine checks Date: Tue, 15 Dec 2015 17:53:31 +0000 Message-ID: <3908561D78D1C84285E8C5FCA982C28F39F8566E@ORSMSX114.amr.corp.intel.com> References: <23b2515da9d06b198044ad83ca0a15ba38c24e6e.1449861203.git.tony.luck@intel.com> <20151215131135.GE25973@pd.tnic> In-Reply-To: Content-Language: en-US Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: base64 MIME-Version: 1.0 Sender: owner-linux-mm@kvack.org List-ID: To: "Williams, Dan J" , Borislav Petkov Cc: Ingo Molnar , Andrew Morton , Andy Lutomirski , "linux-kernel@vger.kernel.org" , Linux MM , linux-nvdimm , X86 ML Pj4gLi4uIGFuZCB0aGUgbm9uLXRlbXBvcmFsIHZlcnNpb24gaXMgdGhlIG9wdGltYWwgb25lIGV2 ZW4gdGhvdWdoIHdlJ3JlDQo+PiBkZWZhdWx0aW5nIHRvIGNvcHlfdXNlcl9lbmhhbmNlZF9mYXN0 X3N0cmluZyBmb3IgbWVtY3B5IG9uIG1vZGVybiBJbnRlbA0KPj4gQ1BVcy4uLj8NCg0KTXkgY3Vy cmVudCBnZW5lcmF0aW9uIGNwdSBoYXMgYSBiaXQgb2YgYW4gaXNzdWUgd2l0aCByZWNvdmVyaW5n IGZyb20gYQ0KbWFjaGluZSBjaGVjayBpbiBhICJyZXAgbW92IiAuLi4gc28gSSdtIHdvcmtpbmcg d2l0aCBhIHZlcnNpb24gb2YgbWVtY3B5DQp0aGF0IHVucm9sbHMgaW50byBpbmRpdmlkdWFsIG1v diBpbnN0cnVjdGlvbnMgZm9yIG5vdy4NCg0KPiBBdCBsZWFzdCB0aGUgcG1lbSBkcml2ZXIgdXNl IGNhc2UgZG9lcyBub3Qgd2FudCBjYWNoaW5nIG9mIHRoZQ0KPiBzb3VyY2UtYnVmZmVyIHNpbmNl IHRoYXQgaXMgdGhlIHJhdyAiZGlzayIgbWVkaWEuICBJLmUuIGluDQo+IHBtZW1fZG9fYnZlYygp IHdlJ2QgdXNlIHRoaXMgdG8gaW1wbGVtZW50IG1lbWNweV9mcm9tX3BtZW0oKS4NCj4gSG93ZXZl ciwgY2FjaGluZyB0aGUgZGVzdGluYXRpb24tYnVmZmVyIG1heSBwcm92ZSBiZW5lZmljaWFsIHNp bmNlDQo+IHRoYXQgZGF0YSBpcyBsaWtlbHkgdG8gYmUgY29uc3VtZWQgaW1tZWRpYXRlbHkgYnkg dGhlIHRocmVhZCB0aGF0DQo+IHN1Ym1pdHRlZCB0aGUgaS9vLg0KDQpJIGNhbiBkcm9wIHRoZSAi bnRpIiBmcm9tIHRoZSBkZXN0aW5hdGlvbiBtb3Zlcy4gIERvZXMgIm50aSIgd29yaw0Kb24gdGhl IGxvYWQgZnJvbSBzb3VyY2UgYWRkcmVzcyBzaWRlIHRvIGF2b2lkIGNhY2hlIGFsbG9jYXRpb24/ DQoNCk9uIGFub3RoZXIgdG9waWMgcmFpc2VkIGJ5IEJvcmlzIC4uLiBpcyB0aGVyZSBzb21lIENP TkZJR19QTUVNKg0KdGhhdCBJIHNob3VsZCB1c2UgYXMgYSBkZXBlbmRlbmN5IHRvIGVuYWJsZSBh bGwgdGhpcz8NCg0KLVRvbnkNCg== -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wm0-f51.google.com (mail-wm0-f51.google.com [74.125.82.51]) by kanga.kvack.org (Postfix) with ESMTP id 5AE7A6B025D for ; Tue, 15 Dec 2015 13:21:09 -0500 (EST) Received: by mail-wm0-f51.google.com with SMTP id l126so6246962wml.0 for ; Tue, 15 Dec 2015 10:21:09 -0800 (PST) Received: from mail.skyhub.de (mail.skyhub.de. [2a01:4f8:120:8448::d00d]) by mx.google.com with ESMTP id fq10si3557401wjc.228.2015.12.15.10.21.07 for ; Tue, 15 Dec 2015 10:21:08 -0800 (PST) Date: Tue, 15 Dec 2015 19:21:00 +0100 From: Borislav Petkov Subject: Re: [PATCHV2 3/3] x86, ras: Add mcsafe_memcpy() function to recover from machine checks Message-ID: <20151215182059.GH25973@pd.tnic> References: <23b2515da9d06b198044ad83ca0a15ba38c24e6e.1449861203.git.tony.luck@intel.com> <20151215131135.GE25973@pd.tnic> <3908561D78D1C84285E8C5FCA982C28F39F8566E@ORSMSX114.amr.corp.intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: <3908561D78D1C84285E8C5FCA982C28F39F8566E@ORSMSX114.amr.corp.intel.com> Sender: owner-linux-mm@kvack.org List-ID: To: "Luck, Tony" Cc: "Williams, Dan J" , Ingo Molnar , Andrew Morton , Andy Lutomirski , "linux-kernel@vger.kernel.org" , Linux MM , linux-nvdimm , X86 ML On Tue, Dec 15, 2015 at 05:53:31PM +0000, Luck, Tony wrote: > My current generation cpu has a bit of an issue with recovering from a > machine check in a "rep mov" ... so I'm working with a version of memcpy > that unrolls into individual mov instructions for now. Ah. > I can drop the "nti" from the destination moves. Does "nti" work > on the load from source address side to avoid cache allocation? I don't think so: +1: movq (%rsi),%r8 +2: movq 1*8(%rsi),%r9 +3: movq 2*8(%rsi),%r10 +4: movq 3*8(%rsi),%r11 ... You need to load the data into registers first because MOVNTI needs them there as it does reg -> mem movement. That first load from memory into registers with a normal MOV will pull the data into the cache. Perhaps the first thing to try would be to see what slowdown normal MOVs bring and if not really noticeable, use those instead. > On another topic raised by Boris ... is there some CONFIG_PMEM* > that I should use as a dependency to enable all this? I found CONFIG_LIBNVDIMM only today: drivers/nvdimm/Kconfig:1:menuconfig LIBNVDIMM drivers/nvdimm/Kconfig:2: tristate "NVDIMM (Non-Volatile Memory Device) Support" -- Regards/Gruss, Boris. ECO tip #101: Trim your mails when you reply. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-qk0-f170.google.com (mail-qk0-f170.google.com [209.85.220.170]) by kanga.kvack.org (Postfix) with ESMTP id 229BB6B025F for ; Tue, 15 Dec 2015 13:27:32 -0500 (EST) Received: by mail-qk0-f170.google.com with SMTP id t125so26816771qkh.3 for ; Tue, 15 Dec 2015 10:27:32 -0800 (PST) Received: from mail-qk0-x234.google.com (mail-qk0-x234.google.com. [2607:f8b0:400d:c09::234]) by mx.google.com with ESMTPS id t184si2244689qht.9.2015.12.15.10.27.31 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Tue, 15 Dec 2015 10:27:31 -0800 (PST) Received: by mail-qk0-x234.google.com with SMTP id k189so27001850qkc.0 for ; Tue, 15 Dec 2015 10:27:31 -0800 (PST) MIME-Version: 1.0 In-Reply-To: <3908561D78D1C84285E8C5FCA982C28F39F8566E@ORSMSX114.amr.corp.intel.com> References: <23b2515da9d06b198044ad83ca0a15ba38c24e6e.1449861203.git.tony.luck@intel.com> <20151215131135.GE25973@pd.tnic> <3908561D78D1C84285E8C5FCA982C28F39F8566E@ORSMSX114.amr.corp.intel.com> Date: Tue, 15 Dec 2015 10:27:31 -0800 Message-ID: Subject: Re: [PATCHV2 3/3] x86, ras: Add mcsafe_memcpy() function to recover from machine checks From: Dan Williams Content-Type: text/plain; charset=UTF-8 Sender: owner-linux-mm@kvack.org List-ID: To: "Luck, Tony" Cc: Borislav Petkov , Ingo Molnar , Andrew Morton , Andy Lutomirski , "linux-kernel@vger.kernel.org" , Linux MM , linux-nvdimm , X86 ML On Tue, Dec 15, 2015 at 9:53 AM, Luck, Tony wrote: >>> ... and the non-temporal version is the optimal one even though we're >>> defaulting to copy_user_enhanced_fast_string for memcpy on modern Intel >>> CPUs...? > > My current generation cpu has a bit of an issue with recovering from a > machine check in a "rep mov" ... so I'm working with a version of memcpy > that unrolls into individual mov instructions for now. > >> At least the pmem driver use case does not want caching of the >> source-buffer since that is the raw "disk" media. I.e. in >> pmem_do_bvec() we'd use this to implement memcpy_from_pmem(). >> However, caching the destination-buffer may prove beneficial since >> that data is likely to be consumed immediately by the thread that >> submitted the i/o. > > I can drop the "nti" from the destination moves. Does "nti" work > on the load from source address side to avoid cache allocation? My mistake, I don't think we have an uncached load capability, only store. > On another topic raised by Boris ... is there some CONFIG_PMEM* > that I should use as a dependency to enable all this? I'd rather make this a "select ARCH_MCSAFE_MEMCPY". Since it's not a hard dependency and the details will be hidden behind memcpy_from_pmem(). Specifically, the details will be handled by a new arch_memcpy_from_pmem() in arch/x86/include/asm/pmem.h to supplement the existing arch_memcpy_to_pmem(). -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-qg0-f50.google.com (mail-qg0-f50.google.com [209.85.192.50]) by kanga.kvack.org (Postfix) with ESMTP id 4AFBF6B0253 for ; Tue, 15 Dec 2015 13:35:50 -0500 (EST) Received: by mail-qg0-f50.google.com with SMTP id v16so14787701qge.0 for ; Tue, 15 Dec 2015 10:35:50 -0800 (PST) Received: from mail-qk0-x234.google.com (mail-qk0-x234.google.com. [2607:f8b0:400d:c09::234]) by mx.google.com with ESMTPS id c13si2242298qkb.79.2015.12.15.10.35.49 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Tue, 15 Dec 2015 10:35:49 -0800 (PST) Received: by mail-qk0-x234.google.com with SMTP id u65so8236890qkh.2 for ; Tue, 15 Dec 2015 10:35:49 -0800 (PST) MIME-Version: 1.0 In-Reply-To: References: <23b2515da9d06b198044ad83ca0a15ba38c24e6e.1449861203.git.tony.luck@intel.com> <20151215131135.GE25973@pd.tnic> <3908561D78D1C84285E8C5FCA982C28F39F8566E@ORSMSX114.amr.corp.intel.com> Date: Tue, 15 Dec 2015 10:35:49 -0800 Message-ID: Subject: Re: [PATCHV2 3/3] x86, ras: Add mcsafe_memcpy() function to recover from machine checks From: Dan Williams Content-Type: text/plain; charset=UTF-8 Sender: owner-linux-mm@kvack.org List-ID: To: "Luck, Tony" Cc: Borislav Petkov , Ingo Molnar , Andrew Morton , Andy Lutomirski , "linux-kernel@vger.kernel.org" , Linux MM , linux-nvdimm , X86 ML On Tue, Dec 15, 2015 at 10:27 AM, Dan Williams wrote: > On Tue, Dec 15, 2015 at 9:53 AM, Luck, Tony wrote: >>>> ... and the non-temporal version is the optimal one even though we're >>>> defaulting to copy_user_enhanced_fast_string for memcpy on modern Intel >>>> CPUs...? >> >> My current generation cpu has a bit of an issue with recovering from a >> machine check in a "rep mov" ... so I'm working with a version of memcpy >> that unrolls into individual mov instructions for now. >> >>> At least the pmem driver use case does not want caching of the >>> source-buffer since that is the raw "disk" media. I.e. in >>> pmem_do_bvec() we'd use this to implement memcpy_from_pmem(). >>> However, caching the destination-buffer may prove beneficial since >>> that data is likely to be consumed immediately by the thread that >>> submitted the i/o. >> >> I can drop the "nti" from the destination moves. Does "nti" work >> on the load from source address side to avoid cache allocation? > > My mistake, I don't think we have an uncached load capability, only store. Correction we have MOVNTDQA, but that requires saving the fpu state and marking the memory as WC, i.e. probably not worth it. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wm0-f47.google.com (mail-wm0-f47.google.com [74.125.82.47]) by kanga.kvack.org (Postfix) with ESMTP id 8B6846B0253 for ; Tue, 15 Dec 2015 13:39:32 -0500 (EST) Received: by mail-wm0-f47.google.com with SMTP id n186so178196604wmn.1 for ; Tue, 15 Dec 2015 10:39:32 -0800 (PST) Received: from mail.skyhub.de (mail.skyhub.de. [2a01:4f8:120:8448::d00d]) by mx.google.com with ESMTP id k185si4923163wmf.19.2015.12.15.10.39.31 for ; Tue, 15 Dec 2015 10:39:31 -0800 (PST) Date: Tue, 15 Dec 2015 19:39:24 +0100 From: Borislav Petkov Subject: Re: [PATCHV2 3/3] x86, ras: Add mcsafe_memcpy() function to recover from machine checks Message-ID: <20151215183924.GJ25973@pd.tnic> References: <23b2515da9d06b198044ad83ca0a15ba38c24e6e.1449861203.git.tony.luck@intel.com> <20151215131135.GE25973@pd.tnic> <3908561D78D1C84285E8C5FCA982C28F39F8566E@ORSMSX114.amr.corp.intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org List-ID: To: Dan Williams Cc: "Luck, Tony" , Ingo Molnar , Andrew Morton , Andy Lutomirski , "linux-kernel@vger.kernel.org" , Linux MM , linux-nvdimm , X86 ML On Tue, Dec 15, 2015 at 10:35:49AM -0800, Dan Williams wrote: > Correction we have MOVNTDQA, but that requires saving the fpu state > and marking the memory as WC, i.e. probably not worth it. Not really. Last time I tried an SSE3 memcpy in the kernel like glibc does, it wasn't worth it. The enhanced REP; MOVSB is hands down faster. -- Regards/Gruss, Boris. ECO tip #101: Trim your mails when you reply. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-oi0-f51.google.com (mail-oi0-f51.google.com [209.85.218.51]) by kanga.kvack.org (Postfix) with ESMTP id 2E6696B0253 for ; Tue, 15 Dec 2015 14:20:54 -0500 (EST) Received: by mail-oi0-f51.google.com with SMTP id y66so11119844oig.0 for ; Tue, 15 Dec 2015 11:20:54 -0800 (PST) Received: from g9t5009.houston.hp.com (g9t5009.houston.hp.com. [15.240.92.67]) by mx.google.com with ESMTPS id m8si1303345obq.22.2015.12.15.11.20.53 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Tue, 15 Dec 2015 11:20:53 -0800 (PST) From: "Elliott, Robert (Persistent Memory)" Subject: RE: [PATCHV2 3/3] x86, ras: Add mcsafe_memcpy() function to recover from machine checks Date: Tue, 15 Dec 2015 19:19:58 +0000 Message-ID: <94D0CD8314A33A4D9D801C0FE68B40295BE9F290@G4W3202.americas.hpqcorp.net> References: <23b2515da9d06b198044ad83ca0a15ba38c24e6e.1449861203.git.tony.luck@intel.com> <20151215131135.GE25973@pd.tnic> <3908561D78D1C84285E8C5FCA982C28F39F8566E@ORSMSX114.amr.corp.intel.com> <20151215183924.GJ25973@pd.tnic> In-Reply-To: <20151215183924.GJ25973@pd.tnic> Content-Language: en-US Content-Type: text/plain; charset="Windows-1252" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 Sender: owner-linux-mm@kvack.org List-ID: To: Borislav Petkov , Dan Williams Cc: "Luck, Tony" , linux-nvdimm , X86 ML , "linux-kernel@vger.kernel.org" , Linux MM , Andy Lutomirski , Andrew Morton , Ingo Molnar > -----Original Message----- > From: Linux-nvdimm [mailto:linux-nvdimm-bounces@lists.01.org] On Behalf > Of Borislav Petkov > Sent: Tuesday, December 15, 2015 12:39 PM > To: Dan Williams > Cc: Luck, Tony ; linux-nvdimm nvdimm@ml01.01.org>; X86 ML ; linux- > kernel@vger.kernel.org; Linux MM ; Andy Lutomirski > ; Andrew Morton ; Ingo Molnar > > Subject: Re: [PATCHV2 3/3] x86, ras: Add mcsafe_memcpy() function to > recover from machine checks >=20 > On Tue, Dec 15, 2015 at 10:35:49AM -0800, Dan Williams wrote: > > Correction we have MOVNTDQA, but that requires saving the fpu state > > and marking the memory as WC, i.e. probably not worth it. >=20 > Not really. Last time I tried an SSE3 memcpy in the kernel like glibc > does, it wasn't worth it. The enhanced REP; MOVSB is hands down faster. Reading from NVDIMM, rep movsb is efficient, but it=20 fills the CPU caches with the NVDIMM addresses. For large data moves (not uncommon for storage) this will crowd out more important cacheable data. For normal block device reads made through the pmem block device driver, this CPU cache consumption is wasteful, since it is unlikely the application will ask pmem to read the same addresses anytime soon. Due to the historic long latency of storage devices, applications don't re-read from storage again; they save the results. So, the streaming-load instructions are beneficial: * movntdqa (16-byte xmm registers)=20 * vmovntdqa (32-byte ymm registers) * vmovntdqa (64-byte zmm registers) Dan Williams wrote: > Correction we have MOVNTDQA, but that requires > saving the fpu state and marking the memory as WC > i.e. probably not worth it. Although the WC memory type is described in the SDM in the most detail: "An implementation may also make use of the non-temporal hint associated with this instruction if the memory source is WB (write back) memory type. ... may optimize cache reads generated by=20 (V)MOVNTDQA on WB memory type to reduce cache=20 evictions." For applications doing loads from mmap() DAX memory,=20 the CPU cache usage could be worthwhile, because applications expect mmap() regions to consist of traditional writeback-cached memory and might do lots of loads/stores. Writing to the NVDIMM requires either: * non-temporal stores; or * normal stores + cache flushes + fences movnti is OK for small transfers, but these are better for bulk moves: * movntdq (16-byte xmm registers) * vmovntdq (32-byte ymm registers) * vmovntdq (64-byte zmm registers) --- Robert Elliott, HPE Persistent Memory -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wm0-f42.google.com (mail-wm0-f42.google.com [74.125.82.42]) by kanga.kvack.org (Postfix) with ESMTP id 988AF6B0254 for ; Tue, 15 Dec 2015 14:28:46 -0500 (EST) Received: by mail-wm0-f42.google.com with SMTP id l126so8655228wml.0 for ; Tue, 15 Dec 2015 11:28:46 -0800 (PST) Received: from mail.skyhub.de (mail.skyhub.de. [2a01:4f8:120:8448::d00d]) by mx.google.com with ESMTP id js6si3980438wjb.211.2015.12.15.11.28.45 for ; Tue, 15 Dec 2015 11:28:45 -0800 (PST) Date: Tue, 15 Dec 2015 20:28:37 +0100 From: Borislav Petkov Subject: Re: [PATCHV2 3/3] x86, ras: Add mcsafe_memcpy() function to recover from machine checks Message-ID: <20151215192837.GL25973@pd.tnic> References: <23b2515da9d06b198044ad83ca0a15ba38c24e6e.1449861203.git.tony.luck@intel.com> <20151215131135.GE25973@pd.tnic> <3908561D78D1C84285E8C5FCA982C28F39F8566E@ORSMSX114.amr.corp.intel.com> <20151215183924.GJ25973@pd.tnic> <94D0CD8314A33A4D9D801C0FE68B40295BE9F290@G4W3202.americas.hpqcorp.net> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: <94D0CD8314A33A4D9D801C0FE68B40295BE9F290@G4W3202.americas.hpqcorp.net> Sender: owner-linux-mm@kvack.org List-ID: To: "Elliott, Robert (Persistent Memory)" Cc: Dan Williams , "Luck, Tony" , linux-nvdimm , X86 ML , "linux-kernel@vger.kernel.org" , Linux MM , Andy Lutomirski , Andrew Morton , Ingo Molnar On Tue, Dec 15, 2015 at 07:19:58PM +0000, Elliott, Robert (Persistent Memory) wrote: ... > Due to the historic long latency of storage devices, > applications don't re-read from storage again; they > save the results. > So, the streaming-load instructions are beneficial: That's the theory... Do you also have some actual performance numbers where non-temporal operations are better than the REP; MOVSB and *actually* show improvements? And no microbenchmarks please. Thanks. -- Regards/Gruss, Boris. ECO tip #101: Trim your mails when you reply. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-oi0-f48.google.com (mail-oi0-f48.google.com [209.85.218.48]) by kanga.kvack.org (Postfix) with ESMTP id 701926B0253 for ; Tue, 15 Dec 2015 15:26:34 -0500 (EST) Received: by mail-oi0-f48.google.com with SMTP id i186so12321424oia.2 for ; Tue, 15 Dec 2015 12:26:34 -0800 (PST) Received: from g9t5008.houston.hp.com (g9t5008.houston.hp.com. [15.240.92.66]) by mx.google.com with ESMTPS id i9si3231311oia.94.2015.12.15.12.26.33 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Tue, 15 Dec 2015 12:26:33 -0800 (PST) From: "Elliott, Robert (Persistent Memory)" Subject: RE: [PATCHV2 3/3] x86, ras: Add mcsafe_memcpy() function to recover from machine checks Date: Tue, 15 Dec 2015 20:25:37 +0000 Message-ID: <94D0CD8314A33A4D9D801C0FE68B40295BE9F3D5@G4W3202.americas.hpqcorp.net> References: <23b2515da9d06b198044ad83ca0a15ba38c24e6e.1449861203.git.tony.luck@intel.com> <20151215131135.GE25973@pd.tnic> <3908561D78D1C84285E8C5FCA982C28F39F8566E@ORSMSX114.amr.corp.intel.com> <20151215183924.GJ25973@pd.tnic> <94D0CD8314A33A4D9D801C0FE68B40295BE9F290@G4W3202.americas.hpqcorp.net> <20151215192837.GL25973@pd.tnic> In-Reply-To: <20151215192837.GL25973@pd.tnic> Content-Language: en-US Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: base64 MIME-Version: 1.0 Sender: owner-linux-mm@kvack.org List-ID: To: Borislav Petkov Cc: Dan Williams , "Luck, Tony" , linux-nvdimm , X86 ML , "linux-kernel@vger.kernel.org" , Linux MM , Andy Lutomirski , Andrew Morton , Ingo Molnar DQoNCi0tLQ0KUm9iZXJ0IEVsbGlvdHQsIEhQRSBQZXJzaXN0ZW50IE1lbW9yeQ0KDQoNCj4gLS0t LS1PcmlnaW5hbCBNZXNzYWdlLS0tLS0NCj4gRnJvbTogQm9yaXNsYXYgUGV0a292IFttYWlsdG86 YnBAYWxpZW44LmRlXQ0KPiBTZW50OiBUdWVzZGF5LCBEZWNlbWJlciAxNSwgMjAxNSAxOjI5IFBN DQo+IFRvOiBFbGxpb3R0LCBSb2JlcnQgKFBlcnNpc3RlbnQgTWVtb3J5KSA8ZWxsaW90dEBocGUu Y29tPg0KPiBDYzogRGFuIFdpbGxpYW1zIDxkYW4uai53aWxsaWFtc0BpbnRlbC5jb20+OyBMdWNr LCBUb255DQo+IDx0b255Lmx1Y2tAaW50ZWwuY29tPjsgbGludXgtbnZkaW1tIDxsaW51eC1udmRp bW1AbWwwMS4wMS5vcmc+OyBYODYgTUwNCj4gPHg4NkBrZXJuZWwub3JnPjsgbGludXgta2VybmVs QHZnZXIua2VybmVsLm9yZzsgTGludXggTU0gPGxpbnV4LQ0KPiBtbUBrdmFjay5vcmc+OyBBbmR5 IEx1dG9taXJza2kgPGx1dG9Aa2VybmVsLm9yZz47IEFuZHJldyBNb3J0b24NCj4gPGFrcG1AbGlu dXgtZm91bmRhdGlvbi5vcmc+OyBJbmdvIE1vbG5hciA8bWluZ29Aa2VybmVsLm9yZz4NCj4gU3Vi amVjdDogUmU6IFtQQVRDSFYyIDMvM10geDg2LCByYXM6IEFkZCBtY3NhZmVfbWVtY3B5KCkgZnVu Y3Rpb24gdG8NCj4gcmVjb3ZlciBmcm9tIG1hY2hpbmUgY2hlY2tzDQo+IA0KPiBPbiBUdWUsIERl YyAxNSwgMjAxNSBhdCAwNzoxOTo1OFBNICswMDAwLCBFbGxpb3R0LCBSb2JlcnQgKFBlcnNpc3Rl bnQNCj4gTWVtb3J5KSB3cm90ZToNCj4gDQo+IC4uLg0KPiANCj4gPiBEdWUgdG8gdGhlIGhpc3Rv cmljIGxvbmcgbGF0ZW5jeSBvZiBzdG9yYWdlIGRldmljZXMsDQo+ID4gYXBwbGljYXRpb25zIGRv bid0IHJlLXJlYWQgZnJvbSBzdG9yYWdlIGFnYWluOyB0aGV5DQo+ID4gc2F2ZSB0aGUgcmVzdWx0 cy4NCj4gPiBTbywgdGhlIHN0cmVhbWluZy1sb2FkIGluc3RydWN0aW9ucyBhcmUgYmVuZWZpY2lh bDoNCj4gDQo+IFRoYXQncyB0aGUgdGhlb3J5Li4uDQo+IA0KPiBEbyB5b3UgYWxzbyBoYXZlIHNv bWUgYWN0dWFsIHBlcmZvcm1hbmNlIG51bWJlcnMgd2hlcmUgbm9uLXRlbXBvcmFsDQo+IG9wZXJh dGlvbnMgYXJlIGJldHRlciB0aGFuIHRoZSBSRVA7IE1PVlNCIGFuZCAqYWN0dWFsbHkqIHNob3cN Cj4gaW1wcm92ZW1lbnRzPyBBbmQgbm8gbWljcm9iZW5jaG1hcmtzIHBsZWFzZS4NCj4gDQo+IFRo YW5rcy4NCj4gDQoNClRoaXMgaXNuJ3QgZXhhY3RseSB3aGF0IHlvdSdyZSBsb29raW5nIGZvciwg YnV0IGhlcmUgaXMgDQphbiBleGFtcGxlIG9mIGZpbyBkb2luZyByZWFkcyBmcm9tIHBtZW0gZGV2 aWNlcyAocmVhZGluZw0KZnJvbSBOVkRJTU1zLCB3cml0aW5nIHRvIERJTU1zKSB3aXRoIHZhcmlv dXMgdHJhbnNmZXINCnNpemVzLg0KDQpBdCAyNTYgS2lCLCBhbGwgdGhlIG1haW4gbWVtb3J5IGJ1 ZmZlcnMgZml0IGluIHRoZSBDUFUNCmNhY2hlcywgc28gbm8gd3JpdGUgdHJhZmZpYyBhcHBlYXJz IG9uIEREUiAoanVzdCB0aGUgcmVhZHMNCmZyb20gdGhlIE5WRElNTXMpLiAgQXQgMSBNaUIsIHRo ZSBkYXRhIHNwaWxscyBvdXQgb2YgdGhlDQpjYWNoZXMsIGFuZCB3cml0ZXMgdG8gdGhlIERJTU1z IGVuZCB1cCBvbiBERFIuDQoNCkFsdGhvdWdoIEREUiBpcyBidXNpZXIsIGZpbyBnZXRzIGEgbG90 IGxlc3Mgd29yayBkb25lOg0KKiAyNTYgS2lCOiA5MCBHaUIvcyBieSBmaW8NCiogICAxIE1pQjog NDkgR2lCL3MgYnkgZmlvDQoNCldlIGNvdWxkIHRyeSBtb2RpZnlpbmcgcG1lbSB0byB1c2UgaXRz IG93biBub24tdGVtcG9yYWwNCm1lbWNweSBmdW5jdGlvbnMgKEkndmUgcG9zdGVkIGV4cGVyaW1l bnRhbCBwYXRjaGVzDQpiZWZvcmUgdGhhdCBkaWQgdGhpcykgdG8gc2VlIGlmIHRoYXQgdHJhbnNp dGlvbiBwb2ludA0Kc2hpZnRzLiAgV2UgY2FuIGFsc28gd2F0Y2ggdGhlIENQVSBjYWNoZSBzdGF0 aXN0aWNzDQp3aGlsZSBydW5uaW5nLg0KDQpIZXJlIGFyZSBzdGF0aXN0aWNzIGZyb20gSW50ZWwn cyBwY20tbWVtb3J5LnggDQoocGFyZG9uIHRoZSB3aWRlIGZvcm1hdHRpbmcpOg0KDQoyNTYgS2lC DQo9PT09PT09DQpwbWVtMDogKGdyb3VwaWQ9MCwgam9icz00MCk6IGVycj0gMDogcGlkPTIwODY3 OiBUdWUgTm92IDI0IDE4OjIwOjA4IDIwMTUNCiAgcmVhZCA6IGlvPTUyMTkuMUdCLCBidz04OTA3 OU1CL3MsIGlvcHM9MzU2MzE0LCBydW50PSA2MDAwNm1zZWMNCiAgY3B1ICAgICAgICAgIDogdXNy PTEuNzQlLCBzeXM9OTYuMTYlLCBjdHg9NDk1NzYsIG1hamY9MCwgbWluZj0yMTk5Nw0KDQpSdW4g c3RhdHVzIGdyb3VwIDAgKGFsbCBqb2JzKToNCiAgIFJFQUQ6IGlvPTUyMTkuMUdCLCBhZ2dyYj04 OTA3OU1CL3MsIG1pbmI9ODkwNzlNQi9zLCBtYXhiPTg5MDc5TUIvcywgbWludD02MDAwNm1zZWMs IG1heHQ9NjAwMDZtc2VjDQoNCnwtLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0t LS18fC0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLXwNCnwtLSAgICAgICAg ICAgICBTb2NrZXQgIDAgICAgICAgICAgICAgLS18fC0tICAgICAgICAgICAgIFNvY2tldCAgMSAg ICAgICAgICAgICAtLXwNCnwtLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS18 fC0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLXwNCnwtLSAgICAgTWVtb3J5 IENoYW5uZWwgTW9uaXRvcmluZyAgICAgLS18fC0tICAgICBNZW1vcnkgQ2hhbm5lbCBNb25pdG9y aW5nICAgICAtLXwNCnwtLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS18fC0t LS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLXwNCnwtLSBNZW0gQ2ggIDA6IFJl YWRzIChNQi9zKTogMTE3NzguMTEgLS18fC0tIE1lbSBDaCAgMDogUmVhZHMgKE1CL3MpOiAxMTc0 My45OSAtLXwNCnwtLSAgICAgICAgICAgIFdyaXRlcyhNQi9zKTogICAgNTEuODMgLS18fC0tICAg ICAgICAgICAgV3JpdGVzKE1CL3MpOiAgICA0My4yNSAtLXwNCnwtLSBNZW0gQ2ggIDE6IFJlYWRz IChNQi9zKTogMTE3NzkuOTAgLS18fC0tIE1lbSBDaCAgMTogUmVhZHMgKE1CL3MpOiAxMTczNi4w NiAtLXwNCnwtLSAgICAgICAgICAgIFdyaXRlcyhNQi9zKTogICAgNDguNzMgLS18fC0tICAgICAg ICAgICAgV3JpdGVzKE1CL3MpOiAgICAzNy44NiAtLXwNCnwtLSBNZW0gQ2ggIDQ6IFJlYWRzIChN Qi9zKTogMTE3ODQuNzkgLS18fC0tIE1lbSBDaCAgNDogUmVhZHMgKE1CL3MpOiAxMTc0Ni45NCAt LXwNCnwtLSAgICAgICAgICAgIFdyaXRlcyhNQi9zKTogICAgNTIuOTAgLS18fC0tICAgICAgICAg ICAgV3JpdGVzKE1CL3MpOiAgICA0My43MyAtLXwNCnwtLSBNZW0gQ2ggIDU6IFJlYWRzIChNQi9z KTogMTE3NzguNDggLS18fC0tIE1lbSBDaCAgNTogUmVhZHMgKE1CL3MpOiAxMTc0MS41NSAtLXwN CnwtLSAgICAgICAgICAgIFdyaXRlcyhNQi9zKTogICAgNDcuNjIgLS18fC0tICAgICAgICAgICAg V3JpdGVzKE1CL3MpOiAgICAzNy44MCAtLXwNCnwtLSBOT0RFIDAgTWVtIFJlYWQgKE1CL3MpIDog NDcxMjEuMjcgLS18fC0tIE5PREUgMSBNZW0gUmVhZCAoTUIvcykgOiA0Njk2OC41MyAtLXwNCnwt LSBOT0RFIDAgTWVtIFdyaXRlKE1CL3MpIDogICAyMDEuMDggLS18fC0tIE5PREUgMSBNZW0gV3Jp dGUoTUIvcykgOiAgIDE2Mi42NSAtLXwNCnwtLSBOT0RFIDAgUC4gV3JpdGUgKFQvcyk6ICAgICAx OTA5MjcgLS18fC0tIE5PREUgMSBQLiBXcml0ZSAoVC9zKTogICAgIDE4Mjk2MSAtLXwNCnwtLSBO T0RFIDAgTWVtb3J5IChNQi9zKTogICAgNDczMjIuMzYgLS18fC0tIE5PREUgMSBNZW1vcnkgKE1C L3MpOiAgICA0NzEzMS4xNyAtLXwNCnwtLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0t LS0tLS18fC0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLXwNCnwtLS0tLS0t LS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS18fC0tLS0tLS0tLS0tLS0tLS0tLS0tLS0t LS0tLS0tLS0tLS0tLS0tLXwNCnwtLSAgICAgICAgICAgICAgICAgICBTeXN0ZW0gUmVhZCBUaHJv dWdocHV0KE1CL3MpOiAgOTQwODkuODAgICAgICAgICAgICAgICAgICAtLXwNCnwtLSAgICAgICAg ICAgICAgICAgIFN5c3RlbSBXcml0ZSBUaHJvdWdocHV0KE1CL3MpOiAgICAzNjMuNzMgICAgICAg ICAgICAgICAgICAtLXwNCnwtLSAgICAgICAgICAgICAgICAgU3lzdGVtIE1lbW9yeSBUaHJvdWdo cHV0KE1CL3MpOiAgOTQ0NTMuNTIgICAgICAgICAgICAgICAgICAtLXwNCnwtLS0tLS0tLS0tLS0t LS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS18fC0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0t LS0tLS0tLS0tLXwNCg0KMSBNaUINCj09PT09DQp8LS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0t LS0tLS0tLS0tLS0tfHwtLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS18DQp8 LS0gICAgICAgICAgICAgU29ja2V0ICAwICAgICAgICAgICAgIC0tfHwtLSAgICAgICAgICAgICBT b2NrZXQgIDEgICAgICAgICAgICAgLS18DQp8LS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0t LS0tLS0tLS0tfHwtLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS18DQp8LS0g ICAgIE1lbW9yeSBDaGFubmVsIE1vbml0b3JpbmcgICAgIC0tfHwtLSAgICAgTWVtb3J5IENoYW5u ZWwgTW9uaXRvcmluZyAgICAgLS18DQp8LS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0t LS0tLS0tfHwtLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS18DQp8LS0gTWVt IENoICAwOiBSZWFkcyAoTUIvcyk6ICA3MjI3LjgzIC0tfHwtLSBNZW0gQ2ggIDA6IFJlYWRzIChN Qi9zKTogIDcwNDcuNDUgLS18DQp8LS0gICAgICAgICAgICBXcml0ZXMoTUIvcyk6ICA1ODk0LjQ3 IC0tfHwtLSAgICAgICAgICAgIFdyaXRlcyhNQi9zKTogIDYwMTAuNjYgLS18DQp8LS0gTWVtIENo ICAxOiBSZWFkcyAoTUIvcyk6ICA3MjI5LjMyIC0tfHwtLSBNZW0gQ2ggIDE6IFJlYWRzIChNQi9z KTogIDcwNDEuNzkgLS18DQp8LS0gICAgICAgICAgICBXcml0ZXMoTUIvcyk6ICA1ODkxLjM4IC0t fHwtLSAgICAgICAgICAgIFdyaXRlcyhNQi9zKTogIDYwMDMuMTkgLS18DQp8LS0gTWVtIENoICA0 OiBSZWFkcyAoTUIvcyk6ICA3MjMwLjcwIC0tfHwtLSBNZW0gQ2ggIDQ6IFJlYWRzIChNQi9zKTog IDcwNTIuNDQgLS18DQp8LS0gICAgICAgICAgICBXcml0ZXMoTUIvcyk6ICA1ODg4LjYzIC0tfHwt LSAgICAgICAgICAgIFdyaXRlcyhNQi9zKTogIDYwMTIuNDkgLS18DQp8LS0gTWVtIENoICA1OiBS ZWFkcyAoTUIvcyk6ICA3MjI5LjE2IC0tfHwtLSBNZW0gQ2ggIDU6IFJlYWRzIChNQi9zKTogIDcw NDcuMTkgLS18DQp8LS0gICAgICAgICAgICBXcml0ZXMoTUIvcyk6ICA1ODgyLjQ1IC0tfHwtLSAg ICAgICAgICAgIFdyaXRlcyhNQi9zKTogIDYwMDguMTEgLS18DQp8LS0gTk9ERSAwIE1lbSBSZWFk IChNQi9zKSA6IDI4OTE3LjAxIC0tfHwtLSBOT0RFIDEgTWVtIFJlYWQgKE1CL3MpIDogMjgxODgu ODcgLS18DQp8LS0gTk9ERSAwIE1lbSBXcml0ZShNQi9zKSA6IDIzNTU2LjkzIC0tfHwtLSBOT0RF IDEgTWVtIFdyaXRlKE1CL3MpIDogMjQwMzQuNDYgLS18DQp8LS0gTk9ERSAwIFAuIFdyaXRlIChU L3MpOiAgICAgMjM4NzEzIC0tfHwtLSBOT0RFIDEgUC4gV3JpdGUgKFQvcyk6ICAgICAyMjgwNDAg LS18DQp8LS0gTk9ERSAwIE1lbW9yeSAoTUIvcyk6ICAgIDUyNDczLjk0IC0tfHwtLSBOT0RFIDEg TWVtb3J5IChNQi9zKTogICAgNTIyMjMuMzMgLS18DQp8LS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0t LS0tLS0tLS0tLS0tLS0tfHwtLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS18 DQp8LS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tfHwtLS0tLS0tLS0tLS0t LS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS18DQp8LS0gICAgICAgICAgICAgICAgICAgU3lzdGVt IFJlYWQgVGhyb3VnaHB1dChNQi9zKTogIDU3MTA1Ljg3ICAgICAgICAgICAgICAgICAgLS18DQp8 LS0gICAgICAgICAgICAgICAgICBTeXN0ZW0gV3JpdGUgVGhyb3VnaHB1dChNQi9zKTogIDQ3NTkx LjM5ICAgICAgICAgICAgICAgICAgLS18DQp8LS0gICAgICAgICAgICAgICAgIFN5c3RlbSBNZW1v cnkgVGhyb3VnaHB1dChNQi9zKTogMTA0Njk3LjI3ICAgICAgICAgICAgICAgICAgLS18DQp8LS0t LS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tfHwtLS0tLS0tLS0tLS0tLS0tLS0t LS0tLS0tLS0tLS0tLS0tLS0tLS18DQoNCg0K -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pf0-f177.google.com (mail-pf0-f177.google.com [209.85.192.177]) by kanga.kvack.org (Postfix) with ESMTP id E506B6B025B for ; Tue, 15 Dec 2015 18:46:05 -0500 (EST) Received: by mail-pf0-f177.google.com with SMTP id e66so1858442pfe.0 for ; Tue, 15 Dec 2015 15:46:05 -0800 (PST) Received: from mga03.intel.com (mga03.intel.com. [134.134.136.65]) by mx.google.com with ESMTP id 86si806614pfs.88.2015.12.15.15.46.05 for ; Tue, 15 Dec 2015 15:46:05 -0800 (PST) From: "Luck, Tony" Subject: RE: [PATCHV2 2/3] x86, ras: Extend machine check recovery code to annotated ring0 areas Date: Tue, 15 Dec 2015 23:46:03 +0000 Message-ID: <3908561D78D1C84285E8C5FCA982C28F39F85DBE@ORSMSX114.amr.corp.intel.com> References: <20151215114314.GD25973@pd.tnic> In-Reply-To: <20151215114314.GD25973@pd.tnic> Content-Language: en-US Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: base64 MIME-Version: 1.0 Sender: owner-linux-mm@kvack.org List-ID: To: Borislav Petkov Cc: Ingo Molnar , Andrew Morton , Andy Lutomirski , "Williams, Dan J" , "linux-kernel@vger.kernel.org" , "linux-mm@kvack.org" , "linux-nvdimm@ml01.01.org" , "x86@kernel.org" Pj4gKwkvKiBGYXVsdCB3YXMgaW4gcmVjb3ZlcmFibGUgYXJlYSBvZiB0aGUga2VybmVsICovDQo+ PiArCWlmICgobS5jcyAmIDMpICE9IDMgJiYgd29yc3QgPT0gTUNFX0FSX1NFVkVSSVRZKQ0KPj4g KwkJaWYgKCFmaXh1cF9tY2V4Y2VwdGlvbihyZWdzLCBtLmFkZHIpKQ0KPj4gKwkJCW1jZV9wYW5p YygiRmFpbGVkIGtlcm5lbCBtb2RlIHJlY292ZXJ5IiwgJm0sIE5VTEwpOw0KPgkJCQkgICBeXl5e Xl5eXl5eXl5eXl5eXl5eXl5eXl5eXl4NCj4NCj4gRG9lcyB0aGF0IGFsd2F5cyBpbXBseSBhIGZh aWxlZCBrZXJuZWwgbW9kZSByZWNvdmVyeT8gSSBkb24ndCBzZWUNCj4NCj4JKG0uY3MgPT0gMCBh bmQgTUNFX0FSX1NFVkVSSVRZKQ0KPg0KPiBNQ0VzIGFsd2F5cyBtZWFuaW5nIHRoYXQgYSByZWNv dmVyeSBzaG91bGQgYmUgYXR0ZW1wdGVkIHRoZXJlLiBJIHRoaW5rDQo+IHRoaXMgc2hvdWxkIHNp bXBseSBzYXkNCj4NCj4JbWNlX3BhbmljKCJGYXRhbCBtYWNoaW5lIGNoZWNrIG9uIGN1cnJlbnQg Q1BVIiwgJm0sIG1zZyk7DQoNCkkgZG9uJ3QgdGhpbmsgdGhpcyBjYW4gZXZlciBoYXBwZW4uIElm IHdlIHdlcmUgaW4ga2VybmVsIG1vZGUgYW5kIGRlY2lkZWQNCnRoYXQgdGhlIHNldmVyaXR5IHdh cyBBUl9TRVZFUklUWSAuLi4gdGhlbiBzZWFyY2hfbWNleGNlcHRpb25fdGFibGUoKQ0KZm91bmQg YW4gZW50cnkgZm9yIHRoZSBJUCB3aGVyZSB0aGUgbWFjaGluZSBjaGVjayBoYXBwZW5lZC4NCg0K VGhlIG9ubHkgd2F5IGZvciBmaXh1cF9leGNlcHRpb24gdG8gZmFpbCBpcyBpZiBzZWFyY2hfbWNl eGNlcHRpb25fdGFibGUoKQ0Kbm93IHN1ZGRlbmx5IGRvZXNuJ3QgZmluZCB0aGUgZW50cnkgaXQg Zm91bmQgZWFybGllci4NCg0KQnV0IGlmIHRoaXMgImNhbid0IGhhcHBlbiIgdGhpbmcgYWN0dWFs bHkgZG9lcyBoYXBwZW4gLi4uIEknZCBsaWtlIHRoZSBwYW5pYw0KbWVzc2FnZSB0byBiZSBkaWZm ZXJlbnQgZnJvbSBvdGhlciBtY2VfcGFuaWMoKSBzbyB5b3UnbGwga25vdyB0byBibGFtZQ0KbWUu DQoNCkFwcGxpZWQgYWxsIHRoZSBvdGhlciBzdWdnZXN0aW9ucy4NCg0KLVRvbnkNCg0K -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-lf0-f44.google.com (mail-lf0-f44.google.com [209.85.215.44]) by kanga.kvack.org (Postfix) with ESMTP id 10A1D6B0007 for ; Mon, 21 Dec 2015 12:33:28 -0500 (EST) Received: by mail-lf0-f44.google.com with SMTP id l133so114891756lfd.2 for ; Mon, 21 Dec 2015 09:33:28 -0800 (PST) Received: from mail.skyhub.de (mail.skyhub.de. [2a01:4f8:120:8448::d00d]) by mx.google.com with ESMTP id vq10si18887934lbb.180.2015.12.21.09.33.26 for ; Mon, 21 Dec 2015 09:33:26 -0800 (PST) Date: Mon, 21 Dec 2015 18:33:10 +0100 From: Borislav Petkov Subject: Re: [PATCHV2 3/3] x86, ras: Add mcsafe_memcpy() function to recover from machine checks Message-ID: <20151221173310.GD21582@pd.tnic> References: <23b2515da9d06b198044ad83ca0a15ba38c24e6e.1449861203.git.tony.luck@intel.com> <20151215131135.GE25973@pd.tnic> <3908561D78D1C84285E8C5FCA982C28F39F8566E@ORSMSX114.amr.corp.intel.com> <20151215183924.GJ25973@pd.tnic> <94D0CD8314A33A4D9D801C0FE68B40295BE9F290@G4W3202.americas.hpqcorp.net> <20151215192837.GL25973@pd.tnic> <94D0CD8314A33A4D9D801C0FE68B40295BE9F3D5@G4W3202.americas.hpqcorp.net> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: <94D0CD8314A33A4D9D801C0FE68B40295BE9F3D5@G4W3202.americas.hpqcorp.net> Sender: owner-linux-mm@kvack.org List-ID: To: "Elliott, Robert (Persistent Memory)" Cc: Dan Williams , "Luck, Tony" , linux-nvdimm , X86 ML , "linux-kernel@vger.kernel.org" , Linux MM , Andy Lutomirski , Andrew Morton , Ingo Molnar On Tue, Dec 15, 2015 at 08:25:37PM +0000, Elliott, Robert (Persistent Memory) wrote: > This isn't exactly what you're looking for, but here is > an example of fio doing reads from pmem devices (reading > from NVDIMMs, writing to DIMMs) with various transfer > sizes. ... and "fio" is? > At 256 KiB, all the main memory buffers fit in the CPU > caches, so no write traffic appears on DDR (just the reads > from the NVDIMMs). At 1 MiB, the data spills out of the > caches, and writes to the DIMMs end up on DDR. > > Although DDR is busier, fio gets a lot less work done: > * 256 KiB: 90 GiB/s by fio > * 1 MiB: 49 GiB/s by fio Yeah, I don't think that answers the question I had: whether REP; MOVSB is faster/better than using non-temporal stores. But you say that already above. Also, if you do non-temporal stores then you're expected to have *more* memory controller and DIMM traffic as you're pushing everything out through the WCC. What would need to be measured instead is, IMO, two things: * compare NTI vs REP; MOVSB data movement to see the differences in performance aspects * run a benchmark (no idea which one) which would measure the positive impact of the NTI versions which do not pollute the cache and thus do not hurt other workloads' working set being pushed out of the cache. Also, we don't really know (at least I don't) what REP; MOVSB improvements hide behind those enhanced fast string optimizations. It could be that microcode is doing some aggregation into cachelines and doing much bigger writes which could compensate for the cache pollution. Questions over questions... > We could try modifying pmem to use its own non-temporal > memcpy functions (I've posted experimental patches > before that did this) to see if that transition point > shifts. We can also watch the CPU cache statistics > while running. > > Here are statistics from Intel's pcm-memory.x > (pardon the wide formatting): > > 256 KiB > ======= > pmem0: (groupid=0, jobs=40): err= 0: pid=20867: Tue Nov 24 18:20:08 2015 > read : io=5219.1GB, bw=89079MB/s, iops=356314, runt= 60006msec > cpu : usr=1.74%, sys=96.16%, ctx=49576, majf=0, minf=21997 > > Run status group 0 (all jobs): > READ: io=5219.1GB, aggrb=89079MB/s, minb=89079MB/s, maxb=89079MB/s, mint=60006msec, maxt=60006msec > > |---------------------------------------||---------------------------------------| > |-- Socket 0 --||-- Socket 1 --| > |---------------------------------------||---------------------------------------| > |-- Memory Channel Monitoring --||-- Memory Channel Monitoring --| > |---------------------------------------||---------------------------------------| > |-- Mem Ch 0: Reads (MB/s): 11778.11 --||-- Mem Ch 0: Reads (MB/s): 11743.99 --| > |-- Writes(MB/s): 51.83 --||-- Writes(MB/s): 43.25 --| > |-- Mem Ch 1: Reads (MB/s): 11779.90 --||-- Mem Ch 1: Reads (MB/s): 11736.06 --| > |-- Writes(MB/s): 48.73 --||-- Writes(MB/s): 37.86 --| > |-- Mem Ch 4: Reads (MB/s): 11784.79 --||-- Mem Ch 4: Reads (MB/s): 11746.94 --| > |-- Writes(MB/s): 52.90 --||-- Writes(MB/s): 43.73 --| > |-- Mem Ch 5: Reads (MB/s): 11778.48 --||-- Mem Ch 5: Reads (MB/s): 11741.55 --| > |-- Writes(MB/s): 47.62 --||-- Writes(MB/s): 37.80 --| > |-- NODE 0 Mem Read (MB/s) : 47121.27 --||-- NODE 1 Mem Read (MB/s) : 46968.53 --| > |-- NODE 0 Mem Write(MB/s) : 201.08 --||-- NODE 1 Mem Write(MB/s) : 162.65 --| > |-- NODE 0 P. Write (T/s): 190927 --||-- NODE 1 P. Write (T/s): 182961 --| What does T/s mean? > |-- NODE 0 Memory (MB/s): 47322.36 --||-- NODE 1 Memory (MB/s): 47131.17 --| > |---------------------------------------||---------------------------------------| > |---------------------------------------||---------------------------------------| > |-- System Read Throughput(MB/s): 94089.80 --| > |-- System Write Throughput(MB/s): 363.73 --| > |-- System Memory Throughput(MB/s): 94453.52 --| > |---------------------------------------||---------------------------------------| > > 1 MiB > ===== > |---------------------------------------||---------------------------------------| > |-- Socket 0 --||-- Socket 1 --| > |---------------------------------------||---------------------------------------| > |-- Memory Channel Monitoring --||-- Memory Channel Monitoring --| > |---------------------------------------||---------------------------------------| > |-- Mem Ch 0: Reads (MB/s): 7227.83 --||-- Mem Ch 0: Reads (MB/s): 7047.45 --| > |-- Writes(MB/s): 5894.47 --||-- Writes(MB/s): 6010.66 --| > |-- Mem Ch 1: Reads (MB/s): 7229.32 --||-- Mem Ch 1: Reads (MB/s): 7041.79 --| > |-- Writes(MB/s): 5891.38 --||-- Writes(MB/s): 6003.19 --| > |-- Mem Ch 4: Reads (MB/s): 7230.70 --||-- Mem Ch 4: Reads (MB/s): 7052.44 --| > |-- Writes(MB/s): 5888.63 --||-- Writes(MB/s): 6012.49 --| > |-- Mem Ch 5: Reads (MB/s): 7229.16 --||-- Mem Ch 5: Reads (MB/s): 7047.19 --| > |-- Writes(MB/s): 5882.45 --||-- Writes(MB/s): 6008.11 --| > |-- NODE 0 Mem Read (MB/s) : 28917.01 --||-- NODE 1 Mem Read (MB/s) : 28188.87 --| > |-- NODE 0 Mem Write(MB/s) : 23556.93 --||-- NODE 1 Mem Write(MB/s) : 24034.46 --| > |-- NODE 0 P. Write (T/s): 238713 --||-- NODE 1 P. Write (T/s): 228040 --| > |-- NODE 0 Memory (MB/s): 52473.94 --||-- NODE 1 Memory (MB/s): 52223.33 --| > |---------------------------------------||---------------------------------------| > |---------------------------------------||---------------------------------------| > |-- System Read Throughput(MB/s): 57105.87 --| > |-- System Write Throughput(MB/s): 47591.39 --| > |-- System Memory Throughput(MB/s): 104697.27 --| > |---------------------------------------||---------------------------------------| Looks to me like, because writes have increased, the read bandwidth has dropped too, which makes sense. -- Regards/Gruss, Boris. ECO tip #101: Trim your mails when you reply. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751990AbbLKTcS (ORCPT ); Fri, 11 Dec 2015 14:32:18 -0500 Received: from mga11.intel.com ([192.55.52.93]:63698 "EHLO mga11.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751064AbbLKTcR (ORCPT ); Fri, 11 Dec 2015 14:32:17 -0500 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.20,414,1444719600"; d="scan'208";a="871837575" Message-Id: <456153d09e85f2f139020a051caed3ca8f8fca73.1449861203.git.tony.luck@intel.com> In-Reply-To: References: From: Tony Luck Date: Thu, 10 Dec 2015 13:58:04 -0800 Subject: [PATCHV2 1/3] x86, ras: Add new infrastructure for machine check fixup tables To: Ingo Molnar Cc: Borislav Petkov , Andrew Morton , Andy Lutomirski , Dan Williams , linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-nvdimm@ml01.01.org, x86@kernel.org Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Copy the existing page fault fixup mechanisms to create a new table to be used when fixing machine checks. Note: 1) At this time we only provide a macro to annotate assembly code 2) We assume all fixups will in code builtin to the kernel. 3) Only for x86_64 4) New code under CONFIG_MCE_KERNEL_RECOVERY Signed-off-by: Tony Luck --- arch/x86/Kconfig | 4 ++++ arch/x86/include/asm/asm.h | 10 ++++++++-- arch/x86/include/asm/uaccess.h | 8 ++++++++ arch/x86/mm/extable.c | 19 +++++++++++++++++++ include/asm-generic/vmlinux.lds.h | 6 ++++++ include/linux/module.h | 1 + kernel/extable.c | 20 ++++++++++++++++++++ 7 files changed, 66 insertions(+), 2 deletions(-) diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 96d058a87100..db5c6e1d6e37 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -1001,6 +1001,10 @@ config X86_MCE_INJECT If you don't know what a machine check is and you don't do kernel QA it is safe to say n. +config MCE_KERNEL_RECOVERY + depends on X86_MCE && X86_64 + def_bool y + config X86_THERMAL_VECTOR def_bool y depends on X86_MCE_INTEL diff --git a/arch/x86/include/asm/asm.h b/arch/x86/include/asm/asm.h index 189679aba703..a5d483ac11fa 100644 --- a/arch/x86/include/asm/asm.h +++ b/arch/x86/include/asm/asm.h @@ -44,13 +44,19 @@ /* Exception table entry */ #ifdef __ASSEMBLY__ -# define _ASM_EXTABLE(from,to) \ - .pushsection "__ex_table","a" ; \ +# define __ASM_EXTABLE(from, to, table) \ + .pushsection table, "a" ; \ .balign 8 ; \ .long (from) - . ; \ .long (to) - . ; \ .popsection +# define _ASM_EXTABLE(from, to) \ + __ASM_EXTABLE(from, to, "__ex_table") + +# define _ASM_MCEXTABLE(from, to) \ + __ASM_EXTABLE(from, to, "__mcex_table") + # define _ASM_EXTABLE_EX(from,to) \ .pushsection "__ex_table","a" ; \ .balign 8 ; \ diff --git a/arch/x86/include/asm/uaccess.h b/arch/x86/include/asm/uaccess.h index a8df874f3e88..7b02ca1991b4 100644 --- a/arch/x86/include/asm/uaccess.h +++ b/arch/x86/include/asm/uaccess.h @@ -111,6 +111,14 @@ struct exception_table_entry { #define ARCH_HAS_SEARCH_EXTABLE extern int fixup_exception(struct pt_regs *regs); +#ifdef CONFIG_MCE_KERNEL_RECOVERY +extern int fixup_mcexception(struct pt_regs *regs, u64 addr); +#else +static inline int fixup_mcexception(struct pt_regs *regs, u64 addr) +{ + return 0; +} +#endif extern int early_fixup_exception(unsigned long *ip); /* diff --git a/arch/x86/mm/extable.c b/arch/x86/mm/extable.c index 903ec1e9c326..a461c4212758 100644 --- a/arch/x86/mm/extable.c +++ b/arch/x86/mm/extable.c @@ -49,6 +49,25 @@ int fixup_exception(struct pt_regs *regs) return 0; } +#ifdef CONFIG_MCE_KERNEL_RECOVERY +int fixup_mcexception(struct pt_regs *regs, u64 addr) +{ + const struct exception_table_entry *fixup; + unsigned long new_ip; + + fixup = search_mcexception_tables(regs->ip); + if (fixup) { + new_ip = ex_fixup_addr(fixup); + + regs->ip = new_ip; + regs->ax = BIT(63) | addr; + return 1; + } + + return 0; +} +#endif + /* Restricted version used during very early boot */ int __init early_fixup_exception(unsigned long *ip) { diff --git a/include/asm-generic/vmlinux.lds.h b/include/asm-generic/vmlinux.lds.h index 1781e54ea6d3..21bb20d1172a 100644 --- a/include/asm-generic/vmlinux.lds.h +++ b/include/asm-generic/vmlinux.lds.h @@ -473,6 +473,12 @@ VMLINUX_SYMBOL(__start___ex_table) = .; \ *(__ex_table) \ VMLINUX_SYMBOL(__stop___ex_table) = .; \ + } \ + . = ALIGN(align); \ + __mcex_table : AT(ADDR(__mcex_table) - LOAD_OFFSET) { \ + VMLINUX_SYMBOL(__start___mcex_table) = .; \ + *(__mcex_table) \ + VMLINUX_SYMBOL(__stop___mcex_table) = .; \ } /* diff --git a/include/linux/module.h b/include/linux/module.h index 3a19c79918e0..ffecbfcc462c 100644 --- a/include/linux/module.h +++ b/include/linux/module.h @@ -270,6 +270,7 @@ extern const typeof(name) __mod_##type##__##name##_device_table \ /* Given an address, look for it in the exception tables */ const struct exception_table_entry *search_exception_tables(unsigned long add); +const struct exception_table_entry *search_mcexception_tables(unsigned long a); struct notifier_block; diff --git a/kernel/extable.c b/kernel/extable.c index e820ccee9846..7b224fbcb708 100644 --- a/kernel/extable.c +++ b/kernel/extable.c @@ -34,6 +34,10 @@ DEFINE_MUTEX(text_mutex); extern struct exception_table_entry __start___ex_table[]; extern struct exception_table_entry __stop___ex_table[]; +#ifdef CONFIG_MCE_KERNEL_RECOVERY +extern struct exception_table_entry __start___mcex_table[]; +extern struct exception_table_entry __stop___mcex_table[]; +#endif /* Cleared by build time tools if the table is already sorted. */ u32 __initdata __visible main_extable_sort_needed = 1; @@ -45,6 +49,10 @@ void __init sort_main_extable(void) pr_notice("Sorting __ex_table...\n"); sort_extable(__start___ex_table, __stop___ex_table); } +#ifdef CONFIG_MCE_KERNEL_RECOVERY + if (__stop___mcex_table > __start___mcex_table) + sort_extable(__start___mcex_table, __stop___mcex_table); +#endif } /* Given an address, look for it in the exception tables. */ @@ -58,6 +66,18 @@ const struct exception_table_entry *search_exception_tables(unsigned long addr) return e; } +#ifdef CONFIG_MCE_KERNEL_RECOVERY +/* Given an address, look for it in the machine check exception tables. */ +const struct exception_table_entry *search_mcexception_tables( + unsigned long addr) +{ + const struct exception_table_entry *e; + + e = search_extable(__start___mcex_table, __stop___mcex_table-1, addr); + return e; +} +#endif + static inline int init_kernel_text(unsigned long addr) { if (addr >= (unsigned long)_sinittext && -- 2.1.4 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752935AbbLKTcv (ORCPT ); Fri, 11 Dec 2015 14:32:51 -0500 Received: from mga14.intel.com ([192.55.52.115]:53243 "EHLO mga14.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752547AbbLKTct (ORCPT ); Fri, 11 Dec 2015 14:32:49 -0500 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.20,414,1444719600"; d="scan'208";a="859000632" Message-Id: In-Reply-To: References: From: Tony Luck Date: Thu, 10 Dec 2015 16:14:44 -0800 Subject: [PATCHV2 2/3] x86, ras: Extend machine check recovery code to annotated ring0 areas To: Ingo Molnar Cc: Borislav Petkov , Andrew Morton , Andy Lutomirski , Dan Williams , linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-nvdimm@ml01.01.org, x86@kernel.org Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Extend the severity checking code to add a new context IN_KERN_RECOV which is used to indicate that the machine check was triggered by code in the kernel with a fixup entry. Add code to check for this situation and respond by altering the return IP to the fixup address and changing the regs->ax so that the recovery code knows the physical address of the error. Note that we also set bit 63 because 0x0 is a legal physical address. Major re-work to the tail code in do_machine_check() to make all this readable/maintainable. One functional change is that tolerant=3 no longer stops recovery actions. Revert to only skipping sending SIGBUS to the current process. Signed-off-by: Tony Luck --- arch/x86/kernel/cpu/mcheck/mce-severity.c | 22 +++++++++- arch/x86/kernel/cpu/mcheck/mce.c | 69 ++++++++++++++++--------------- 2 files changed, 55 insertions(+), 36 deletions(-) diff --git a/arch/x86/kernel/cpu/mcheck/mce-severity.c b/arch/x86/kernel/cpu/mcheck/mce-severity.c index 9c682c222071..ac7fbb0689fb 100644 --- a/arch/x86/kernel/cpu/mcheck/mce-severity.c +++ b/arch/x86/kernel/cpu/mcheck/mce-severity.c @@ -12,6 +12,7 @@ #include #include #include +#include #include #include @@ -29,7 +30,7 @@ * panic situations) */ -enum context { IN_KERNEL = 1, IN_USER = 2 }; +enum context { IN_KERNEL = 1, IN_USER = 2, IN_KERNEL_RECOV = 3 }; enum ser { SER_REQUIRED = 1, NO_SER = 2 }; enum exception { EXCP_CONTEXT = 1, NO_EXCP = 2 }; @@ -48,6 +49,7 @@ static struct severity { #define MCESEV(s, m, c...) { .sev = MCE_ ## s ## _SEVERITY, .msg = m, ## c } #define KERNEL .context = IN_KERNEL #define USER .context = IN_USER +#define KERNEL_RECOV .context = IN_KERNEL_RECOV #define SER .ser = SER_REQUIRED #define NOSER .ser = NO_SER #define EXCP .excp = EXCP_CONTEXT @@ -87,6 +89,10 @@ static struct severity { EXCP, KERNEL, MCGMASK(MCG_STATUS_RIPV, 0) ), MCESEV( + PANIC, "In kernel and no restart IP", + EXCP, KERNEL_RECOV, MCGMASK(MCG_STATUS_RIPV, 0) + ), + MCESEV( DEFERRED, "Deferred error", NOSER, MASK(MCI_STATUS_UC|MCI_STATUS_DEFERRED|MCI_STATUS_POISON, MCI_STATUS_DEFERRED) ), @@ -123,6 +129,11 @@ static struct severity { MCGMASK(MCG_STATUS_RIPV|MCG_STATUS_EIPV, MCG_STATUS_RIPV) ), MCESEV( + AR, "Action required: data load error recoverable area of kernel", + SER, MASK(MCI_STATUS_OVER|MCI_UC_SAR|MCI_ADDR|MCACOD, MCI_UC_SAR|MCI_ADDR|MCACOD_DATA), + KERNEL_RECOV + ), + MCESEV( AR, "Action required: data load error in a user process", SER, MASK(MCI_STATUS_OVER|MCI_UC_SAR|MCI_ADDR|MCACOD, MCI_UC_SAR|MCI_ADDR|MCACOD_DATA), USER @@ -170,6 +181,9 @@ static struct severity { ) /* always matches. keep at end */ }; +#define mc_recoverable(mcg) (((mcg) & (MCG_STATUS_RIPV|MCG_STATUS_EIPV)) == \ + (MCG_STATUS_RIPV|MCG_STATUS_EIPV)) + /* * If mcgstatus indicated that ip/cs on the stack were * no good, then "m->cs" will be zero and we will have @@ -183,7 +197,11 @@ static struct severity { */ static int error_context(struct mce *m) { - return ((m->cs & 3) == 3) ? IN_USER : IN_KERNEL; + if ((m->cs & 3) == 3) + return IN_USER; + if (mc_recoverable(m->mcgstatus) && search_mcexception_tables(m->ip)) + return IN_KERNEL_RECOV; + return IN_KERNEL; } /* diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c index 9d014b82a124..f2f568ad6409 100644 --- a/arch/x86/kernel/cpu/mcheck/mce.c +++ b/arch/x86/kernel/cpu/mcheck/mce.c @@ -31,6 +31,7 @@ #include #include #include +#include #include #include #include @@ -958,6 +959,20 @@ static void mce_clear_state(unsigned long *toclear) } } +static int do_memory_failure(struct mce *m) +{ + int flags = MF_ACTION_REQUIRED; + int ret; + + pr_err("Uncorrected hardware memory error in user-access at %llx", m->addr); + if (!(m->mcgstatus & MCG_STATUS_RIPV)) + flags |= MF_MUST_KILL; + ret = memory_failure(m->addr >> PAGE_SHIFT, MCE_VECTOR, flags); + if (ret) + pr_err("Memory error not recovered"); + return ret; +} + /* * The actual machine check handler. This only handles real * exceptions when something got corrupted coming in through int 18. @@ -995,8 +1010,6 @@ void do_machine_check(struct pt_regs *regs, long error_code) DECLARE_BITMAP(toclear, MAX_NR_BANKS); DECLARE_BITMAP(valid_banks, MAX_NR_BANKS); char *msg = "Unknown"; - u64 recover_paddr = ~0ull; - int flags = MF_ACTION_REQUIRED; int lmce = 0; ist_enter(regs); @@ -1123,22 +1136,13 @@ void do_machine_check(struct pt_regs *regs, long error_code) } /* - * At insane "tolerant" levels we take no action. Otherwise - * we only die if we have no other choice. For less serious - * issues we try to recover, or limit damage to the current - * process. + * If tolerant is at an insane level we drop requests to kill + * processes and continue even when there is no way out */ - if (cfg->tolerant < 3) { - if (no_way_out) - mce_panic("Fatal machine check on current CPU", &m, msg); - if (worst == MCE_AR_SEVERITY) { - recover_paddr = m.addr; - if (!(m.mcgstatus & MCG_STATUS_RIPV)) - flags |= MF_MUST_KILL; - } else if (kill_it) { - force_sig(SIGBUS, current); - } - } + if (cfg->tolerant == 3) + kill_it = 0; + else if (no_way_out) + mce_panic("Fatal machine check on current CPU", &m, msg); if (worst > 0) mce_report_event(regs); @@ -1146,25 +1150,22 @@ void do_machine_check(struct pt_regs *regs, long error_code) out: sync_core(); - if (recover_paddr == ~0ull) - goto done; + /* Fault was in user mode and we need to take some action */ + if ((m.cs & 3) == 3 && (worst == MCE_AR_SEVERITY || kill_it)) { + ist_begin_non_atomic(regs); + local_irq_enable(); - pr_err("Uncorrected hardware memory error in user-access at %llx", - recover_paddr); - /* - * We must call memory_failure() here even if the current process is - * doomed. We still need to mark the page as poisoned and alert any - * other users of the page. - */ - ist_begin_non_atomic(regs); - local_irq_enable(); - if (memory_failure(recover_paddr >> PAGE_SHIFT, MCE_VECTOR, flags) < 0) { - pr_err("Memory error not recovered"); - force_sig(SIGBUS, current); + if (kill_it || do_memory_failure(&m)) + force_sig(SIGBUS, current); + local_irq_disable(); + ist_end_non_atomic(); } - local_irq_disable(); - ist_end_non_atomic(); -done: + + /* Fault was in recoverable area of the kernel */ + if ((m.cs & 3) != 3 && worst == MCE_AR_SEVERITY) + if (!fixup_mcexception(regs, m.addr)) + mce_panic("Failed kernel mode recovery", &m, NULL); + ist_exit(regs); } EXPORT_SYMBOL_GPL(do_machine_check); -- 2.1.4 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752759AbbLKTcl (ORCPT ); Fri, 11 Dec 2015 14:32:41 -0500 Received: from mga09.intel.com ([134.134.136.24]:14308 "EHLO mga09.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752547AbbLKTcj (ORCPT ); Fri, 11 Dec 2015 14:32:39 -0500 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.20,414,1444719600"; d="scan'208";a="705674514" Message-Id: <23b2515da9d06b198044ad83ca0a15ba38c24e6e.1449861203.git.tony.luck@intel.com> In-Reply-To: References: From: Tony Luck Date: Thu, 10 Dec 2015 16:21:50 -0800 Subject: [PATCHV2 3/3] x86, ras: Add mcsafe_memcpy() function to recover from machine checks To: Ingo Molnar Cc: Borislav Petkov , Andrew Morton , Andy Lutomirski , Dan Williams , linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-nvdimm@ml01.01.org, x86@kernel.org Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Using __copy_user_nocache() as inspiration create a memory copy routine for use by kernel code with annotations to allow for recovery from machine checks. Notes: 1) Unlike the original we make no attempt to copy all the bytes up to the faulting address. The original achieves that by re-executing the failing part as a byte-by-byte copy, which will take another page fault. We don't want to have a second machine check! 2) Likewise the return value for the original indicates exactly how many bytes were not copied. Instead we provide the physical address of the fault (thanks to help from do_machine_check() 3) Provide helpful macros to decode the return value. Signed-off-by: Tony Luck --- arch/x86/include/asm/uaccess_64.h | 5 +++ arch/x86/kernel/x8664_ksyms_64.c | 2 + arch/x86/lib/copy_user_64.S | 91 +++++++++++++++++++++++++++++++++++++++ 3 files changed, 98 insertions(+) diff --git a/arch/x86/include/asm/uaccess_64.h b/arch/x86/include/asm/uaccess_64.h index f2f9b39b274a..779cb0e77ecc 100644 --- a/arch/x86/include/asm/uaccess_64.h +++ b/arch/x86/include/asm/uaccess_64.h @@ -216,6 +216,11 @@ __copy_to_user_inatomic(void __user *dst, const void *src, unsigned size) extern long __copy_user_nocache(void *dst, const void __user *src, unsigned size, int zerorest); +extern u64 mcsafe_memcpy(void *dst, const void __user *src, + unsigned size); +#define COPY_HAD_MCHECK(ret) ((ret) & BIT(63)) +#define COPY_MCHECK_PADDR(ret) ((ret) & ~BIT(63)) + static inline int __copy_from_user_nocache(void *dst, const void __user *src, unsigned size) { diff --git a/arch/x86/kernel/x8664_ksyms_64.c b/arch/x86/kernel/x8664_ksyms_64.c index a0695be19864..ec988c92c055 100644 --- a/arch/x86/kernel/x8664_ksyms_64.c +++ b/arch/x86/kernel/x8664_ksyms_64.c @@ -37,6 +37,8 @@ EXPORT_SYMBOL(__copy_user_nocache); EXPORT_SYMBOL(_copy_from_user); EXPORT_SYMBOL(_copy_to_user); +EXPORT_SYMBOL(mcsafe_memcpy); + EXPORT_SYMBOL(copy_page); EXPORT_SYMBOL(clear_page); diff --git a/arch/x86/lib/copy_user_64.S b/arch/x86/lib/copy_user_64.S index 982ce34f4a9b..ffce93cbc9a5 100644 --- a/arch/x86/lib/copy_user_64.S +++ b/arch/x86/lib/copy_user_64.S @@ -319,3 +319,94 @@ ENTRY(__copy_user_nocache) _ASM_EXTABLE(21b,50b) _ASM_EXTABLE(22b,50b) ENDPROC(__copy_user_nocache) + +/* + * mcsafe_memcpy - Uncached memory copy with machine check exception handling + * Note that we only catch machine checks when reading the source addresses. + * Writes to target are posted and don't generate machine checks. + * This will force destination/source out of cache for more performance. + */ +ENTRY(mcsafe_memcpy) + cmpl $8,%edx + jb 20f /* less then 8 bytes, go to byte copy loop */ + + /* check for bad alignment of destination */ + movl %edi,%ecx + andl $7,%ecx + jz 102f /* already aligned */ + subl $8,%ecx + negl %ecx + subl %ecx,%edx +0: movb (%rsi),%al + movb %al,(%rdi) + incq %rsi + incq %rdi + decl %ecx + jnz 100b +102: + movl %edx,%ecx + andl $63,%edx + shrl $6,%ecx + jz 17f +1: movq (%rsi),%r8 +2: movq 1*8(%rsi),%r9 +3: movq 2*8(%rsi),%r10 +4: movq 3*8(%rsi),%r11 + movnti %r8,(%rdi) + movnti %r9,1*8(%rdi) + movnti %r10,2*8(%rdi) + movnti %r11,3*8(%rdi) +9: movq 4*8(%rsi),%r8 +10: movq 5*8(%rsi),%r9 +11: movq 6*8(%rsi),%r10 +12: movq 7*8(%rsi),%r11 + movnti %r8,4*8(%rdi) + movnti %r9,5*8(%rdi) + movnti %r10,6*8(%rdi) + movnti %r11,7*8(%rdi) + leaq 64(%rsi),%rsi + leaq 64(%rdi),%rdi + decl %ecx + jnz 1b +17: movl %edx,%ecx + andl $7,%edx + shrl $3,%ecx + jz 20f +18: movq (%rsi),%r8 + movnti %r8,(%rdi) + leaq 8(%rsi),%rsi + leaq 8(%rdi),%rdi + decl %ecx + jnz 18b +20: andl %edx,%edx + jz 23f + movl %edx,%ecx +21: movb (%rsi),%al + movb %al,(%rdi) + incq %rsi + incq %rdi + decl %ecx + jnz 21b +23: xorl %eax,%eax + sfence + ret + + .section .fixup,"ax" +30: + sfence + /* do_machine_check() sets %eax return value */ + ret + .previous + + _ASM_MCEXTABLE(0b,30b) + _ASM_MCEXTABLE(1b,30b) + _ASM_MCEXTABLE(2b,30b) + _ASM_MCEXTABLE(3b,30b) + _ASM_MCEXTABLE(4b,30b) + _ASM_MCEXTABLE(9b,30b) + _ASM_MCEXTABLE(10b,30b) + _ASM_MCEXTABLE(11b,30b) + _ASM_MCEXTABLE(12b,30b) + _ASM_MCEXTABLE(18b,30b) + _ASM_MCEXTABLE(21b,30b) +ENDPROC(mcsafe_memcpy) -- 2.1.4 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752443AbbLKTca (ORCPT ); Fri, 11 Dec 2015 14:32:30 -0500 Received: from mga03.intel.com ([134.134.136.65]:36954 "EHLO mga03.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752048AbbLKTc2 (ORCPT ); Fri, 11 Dec 2015 14:32:28 -0500 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.20,414,1444719600"; d="scan'208";a="705674323" Message-Id: From: Tony Luck Date: Fri, 11 Dec 2015 11:13:23 -0800 Subject: [PATCHV2 0/3] Machine check recovery when kernel accesses poison To: Ingo Molnar Cc: Borislav Petkov , Andrew Morton , Andy Lutomirski , Dan Williams , linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-nvdimm@ml01.01.org, x86@kernel.org Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org This series is initially targeted at the folks doing filesystems on top of NVDIMMs. They really want to be able to return -EIO when there is a h/w error (just like spinning rust, and SSD does). I plan to use the same infrastructure in parts 1&2 to write a machine check aware "copy_from_user()" that will SIGBUS the calling application when a syscall touches poison in user space (just like we do when the application touches the poison itself). Changes V1->V2: 0-day: Reported build errors and warnings on 32-bit systems. Fixed 0-day: Reported bloat to tinyconfig. Fixed Boris: Suggestions to use extra macros to reduce code duplication in _ASM_*EXTABLE. Done Boris: Re-write "tolerant==3" check to reduce indentation level. See below. Andy: Check IP is valid before searching kernel exception tables. Done. Andy: Explain use of BIT(63) on return value from mcsafe_memcpy(). Done (added decode macros). Andy: Untangle mess of code in tail of do_machine_check() to make it clear what is going on (e.g. that we only enter the ist_begin_non_atomic() if we were called from user code, not from kernel!). Done Tony Luck (3): x86, ras: Add new infrastructure for machine check fixup tables 2/6] x86, ras: Extend machine check recovery code to annotated ring0 areas 3/6] x86, ras: Add mcsafe_memcpy() function to recover from machine checks arch/x86/Kconfig | 4 ++ arch/x86/include/asm/asm.h | 10 +++- arch/x86/include/asm/uaccess.h | 8 +++ arch/x86/include/asm/uaccess_64.h | 5 ++ arch/x86/kernel/cpu/mcheck/mce-severity.c | 22 +++++++- arch/x86/kernel/cpu/mcheck/mce.c | 69 +++++++++++------------ arch/x86/kernel/x8664_ksyms_64.c | 2 + arch/x86/lib/copy_user_64.S | 91 +++++++++++++++++++++++++++++++ arch/x86/mm/extable.c | 19 +++++++ include/asm-generic/vmlinux.lds.h | 6 ++ include/linux/module.h | 1 + kernel/extable.c | 20 +++++++ 12 files changed, 219 insertions(+), 38 deletions(-) -- 2.1.4 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752515AbbLKUHF (ORCPT ); Fri, 11 Dec 2015 15:07:05 -0500 Received: from mail-ob0-f180.google.com ([209.85.214.180]:36805 "EHLO mail-ob0-f180.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751358AbbLKUHD (ORCPT ); Fri, 11 Dec 2015 15:07:03 -0500 MIME-Version: 1.0 In-Reply-To: <456153d09e85f2f139020a051caed3ca8f8fca73.1449861203.git.tony.luck@intel.com> References: <456153d09e85f2f139020a051caed3ca8f8fca73.1449861203.git.tony.luck@intel.com> From: Andy Lutomirski Date: Fri, 11 Dec 2015 12:06:42 -0800 Message-ID: Subject: Re: [PATCHV2 1/3] x86, ras: Add new infrastructure for machine check fixup tables To: Tony Luck Cc: Ingo Molnar , Borislav Petkov , Andrew Morton , Andy Lutomirski , Dan Williams , "linux-kernel@vger.kernel.org" , "linux-mm@kvack.org" , linux-nvdimm , X86 ML Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Dec 10, 2015 at 1:58 PM, Tony Luck wrote: > Copy the existing page fault fixup mechanisms to create a new table > to be used when fixing machine checks. Note: > 1) At this time we only provide a macro to annotate assembly code > 2) We assume all fixups will in code builtin to the kernel. > 3) Only for x86_64 > 4) New code under CONFIG_MCE_KERNEL_RECOVERY > > Signed-off-by: Tony Luck > --- > +#ifdef CONFIG_MCE_KERNEL_RECOVERY > +int fixup_mcexception(struct pt_regs *regs, u64 addr) > +{ > + const struct exception_table_entry *fixup; > + unsigned long new_ip; > + > + fixup = search_mcexception_tables(regs->ip); > + if (fixup) { > + new_ip = ex_fixup_addr(fixup); > + > + regs->ip = new_ip; > + regs->ax = BIT(63) | addr; Can this be an actual #define? --Andy From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753071AbbLKUI0 (ORCPT ); Fri, 11 Dec 2015 15:08:26 -0500 Received: from mail-ob0-f180.google.com ([209.85.214.180]:34494 "EHLO mail-ob0-f180.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752451AbbLKUIZ (ORCPT ); Fri, 11 Dec 2015 15:08:25 -0500 MIME-Version: 1.0 In-Reply-To: References: From: Andy Lutomirski Date: Fri, 11 Dec 2015 12:08:05 -0800 Message-ID: Subject: Re: [PATCHV2 2/3] x86, ras: Extend machine check recovery code to annotated ring0 areas To: Tony Luck Cc: Ingo Molnar , Borislav Petkov , Andrew Morton , Andy Lutomirski , Dan Williams , "linux-kernel@vger.kernel.org" , "linux-mm@kvack.org" , linux-nvdimm , X86 ML Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Dec 10, 2015 at 4:14 PM, Tony Luck wrote: > Extend the severity checking code to add a new context IN_KERN_RECOV > which is used to indicate that the machine check was triggered by code > in the kernel with a fixup entry. > > Add code to check for this situation and respond by altering the return > IP to the fixup address and changing the regs->ax so that the recovery > code knows the physical address of the error. Note that we also set bit > 63 because 0x0 is a legal physical address. > > Major re-work to the tail code in do_machine_check() to make all this > readable/maintainable. One functional change is that tolerant=3 no longer > stops recovery actions. Revert to only skipping sending SIGBUS to the > current process. This is IMO much, much nicer than the old code. Thanks! --Andy From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753530AbbLKUJb (ORCPT ); Fri, 11 Dec 2015 15:09:31 -0500 Received: from mail-ob0-f177.google.com ([209.85.214.177]:33013 "EHLO mail-ob0-f177.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751358AbbLKUJa (ORCPT ); Fri, 11 Dec 2015 15:09:30 -0500 MIME-Version: 1.0 In-Reply-To: <23b2515da9d06b198044ad83ca0a15ba38c24e6e.1449861203.git.tony.luck@intel.com> References: <23b2515da9d06b198044ad83ca0a15ba38c24e6e.1449861203.git.tony.luck@intel.com> From: Andy Lutomirski Date: Fri, 11 Dec 2015 12:09:10 -0800 Message-ID: Subject: Re: [PATCHV2 3/3] x86, ras: Add mcsafe_memcpy() function to recover from machine checks To: Tony Luck Cc: Ingo Molnar , Borislav Petkov , Andrew Morton , Andy Lutomirski , Dan Williams , "linux-kernel@vger.kernel.org" , "linux-mm@kvack.org" , linux-nvdimm , X86 ML Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Dec 10, 2015 at 4:21 PM, Tony Luck wrote: > Using __copy_user_nocache() as inspiration create a memory copy > routine for use by kernel code with annotations to allow for > recovery from machine checks. > > Notes: > 1) Unlike the original we make no attempt to copy all the bytes > up to the faulting address. The original achieves that by > re-executing the failing part as a byte-by-byte copy, > which will take another page fault. We don't want to have > a second machine check! > 2) Likewise the return value for the original indicates exactly > how many bytes were not copied. Instead we provide the physical > address of the fault (thanks to help from do_machine_check() > 3) Provide helpful macros to decode the return value. > > Signed-off-by: Tony Luck > --- > arch/x86/include/asm/uaccess_64.h | 5 +++ > arch/x86/kernel/x8664_ksyms_64.c | 2 + > arch/x86/lib/copy_user_64.S | 91 +++++++++++++++++++++++++++++++++++++++ > 3 files changed, 98 insertions(+) > > diff --git a/arch/x86/include/asm/uaccess_64.h b/arch/x86/include/asm/uaccess_64.h > index f2f9b39b274a..779cb0e77ecc 100644 > --- a/arch/x86/include/asm/uaccess_64.h > +++ b/arch/x86/include/asm/uaccess_64.h > @@ -216,6 +216,11 @@ __copy_to_user_inatomic(void __user *dst, const void *src, unsigned size) > extern long __copy_user_nocache(void *dst, const void __user *src, > unsigned size, int zerorest); > > +extern u64 mcsafe_memcpy(void *dst, const void __user *src, > + unsigned size); > +#define COPY_HAD_MCHECK(ret) ((ret) & BIT(63)) > +#define COPY_MCHECK_PADDR(ret) ((ret) & ~BIT(63)) > + > static inline int > __copy_from_user_nocache(void *dst, const void __user *src, unsigned size) > { > diff --git a/arch/x86/kernel/x8664_ksyms_64.c b/arch/x86/kernel/x8664_ksyms_64.c > index a0695be19864..ec988c92c055 100644 > --- a/arch/x86/kernel/x8664_ksyms_64.c > +++ b/arch/x86/kernel/x8664_ksyms_64.c > @@ -37,6 +37,8 @@ EXPORT_SYMBOL(__copy_user_nocache); > EXPORT_SYMBOL(_copy_from_user); > EXPORT_SYMBOL(_copy_to_user); > > +EXPORT_SYMBOL(mcsafe_memcpy); > + > EXPORT_SYMBOL(copy_page); > EXPORT_SYMBOL(clear_page); > > diff --git a/arch/x86/lib/copy_user_64.S b/arch/x86/lib/copy_user_64.S > index 982ce34f4a9b..ffce93cbc9a5 100644 > --- a/arch/x86/lib/copy_user_64.S > +++ b/arch/x86/lib/copy_user_64.S > @@ -319,3 +319,94 @@ ENTRY(__copy_user_nocache) > _ASM_EXTABLE(21b,50b) > _ASM_EXTABLE(22b,50b) > ENDPROC(__copy_user_nocache) > + > +/* > + * mcsafe_memcpy - Uncached memory copy with machine check exception handling > + * Note that we only catch machine checks when reading the source addresses. > + * Writes to target are posted and don't generate machine checks. > + * This will force destination/source out of cache for more performance. > + */ > +ENTRY(mcsafe_memcpy) > + cmpl $8,%edx > + jb 20f /* less then 8 bytes, go to byte copy loop */ > + > + /* check for bad alignment of destination */ > + movl %edi,%ecx > + andl $7,%ecx > + jz 102f /* already aligned */ > + subl $8,%ecx > + negl %ecx > + subl %ecx,%edx > +0: movb (%rsi),%al > + movb %al,(%rdi) > + incq %rsi > + incq %rdi > + decl %ecx > + jnz 100b > +102: > + movl %edx,%ecx > + andl $63,%edx > + shrl $6,%ecx > + jz 17f > +1: movq (%rsi),%r8 > +2: movq 1*8(%rsi),%r9 > +3: movq 2*8(%rsi),%r10 > +4: movq 3*8(%rsi),%r11 > + movnti %r8,(%rdi) > + movnti %r9,1*8(%rdi) > + movnti %r10,2*8(%rdi) > + movnti %r11,3*8(%rdi) > +9: movq 4*8(%rsi),%r8 > +10: movq 5*8(%rsi),%r9 > +11: movq 6*8(%rsi),%r10 > +12: movq 7*8(%rsi),%r11 > + movnti %r8,4*8(%rdi) > + movnti %r9,5*8(%rdi) > + movnti %r10,6*8(%rdi) > + movnti %r11,7*8(%rdi) > + leaq 64(%rsi),%rsi > + leaq 64(%rdi),%rdi > + decl %ecx > + jnz 1b > +17: movl %edx,%ecx > + andl $7,%edx > + shrl $3,%ecx > + jz 20f > +18: movq (%rsi),%r8 > + movnti %r8,(%rdi) > + leaq 8(%rsi),%rsi > + leaq 8(%rdi),%rdi > + decl %ecx > + jnz 18b > +20: andl %edx,%edx > + jz 23f > + movl %edx,%ecx > +21: movb (%rsi),%al > + movb %al,(%rdi) > + incq %rsi > + incq %rdi > + decl %ecx > + jnz 21b > +23: xorl %eax,%eax > + sfence > + ret > + > + .section .fixup,"ax" > +30: > + sfence > + /* do_machine_check() sets %eax return value */ > + ret > + .previous > + > + _ASM_MCEXTABLE(0b,30b) > + _ASM_MCEXTABLE(1b,30b) > + _ASM_MCEXTABLE(2b,30b) > + _ASM_MCEXTABLE(3b,30b) > + _ASM_MCEXTABLE(4b,30b) > + _ASM_MCEXTABLE(9b,30b) > + _ASM_MCEXTABLE(10b,30b) > + _ASM_MCEXTABLE(11b,30b) > + _ASM_MCEXTABLE(12b,30b) > + _ASM_MCEXTABLE(18b,30b) > + _ASM_MCEXTABLE(21b,30b) > +ENDPROC(mcsafe_memcpy) I still don't get the BIT(63) thing. Can you explain it? --Andy From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754118AbbLKVBw (ORCPT ); Fri, 11 Dec 2015 16:01:52 -0500 Received: from mga11.intel.com ([192.55.52.93]:50136 "EHLO mga11.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752912AbbLKVBu (ORCPT ); Fri, 11 Dec 2015 16:01:50 -0500 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.20,415,1444719600"; d="scan'208";a="871887718" From: "Luck, Tony" To: Andy Lutomirski CC: Ingo Molnar , Borislav Petkov , "Andrew Morton" , Andy Lutomirski , "Williams, Dan J" , "linux-kernel@vger.kernel.org" , "linux-mm@kvack.org" , linux-nvdimm , X86 ML Subject: RE: [PATCHV2 1/3] x86, ras: Add new infrastructure for machine check fixup tables Thread-Topic: [PATCHV2 1/3] x86, ras: Add new infrastructure for machine check fixup tables Thread-Index: AQHRNE+BlJPZAmI3pUu+7trbxTcoXZ7GRSHg Date: Fri, 11 Dec 2015 21:01:49 +0000 Message-ID: <3908561D78D1C84285E8C5FCA982C28F39F82D35@ORSMSX114.amr.corp.intel.com> References: <456153d09e85f2f139020a051caed3ca8f8fca73.1449861203.git.tony.luck@intel.com> In-Reply-To: Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-titus-metadata-40: eyJDYXRlZ29yeUxhYmVscyI6IiIsIk1ldGFkYXRhIjp7Im5zIjoiaHR0cDpcL1wvd3d3LnRpdHVzLmNvbVwvbnNcL0ludGVsIiwiaWQiOiIxN2Y2YWFkNS1iMGNmLTQ2ZjEtYTMxMi1mNTdhODUzNjI4NTIiLCJwcm9wcyI6W3sibiI6IkludGVsRGF0YUNsYXNzaWZpY2F0aW9uIiwidmFscyI6W3sidmFsdWUiOiJDVFBfSUMifV19XX0sIlN1YmplY3RMYWJlbHMiOltdLCJUTUNWZXJzaW9uIjoiMTUuNC4xMC4xOSIsIlRydXN0ZWRMYWJlbEhhc2giOiJETGI1cDd3VTZ4b0d6Zk9QUXdwR0lvWTdXN1wvRnBDSUhrZEdnd1BrUUdSYz0ifQ== x-inteldataclassification: CTP_IC x-originating-ip: [10.22.254.140] Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from base64 to 8bit by mail.home.local id tBBL20MI013565 >> + regs->ip = new_ip; >> + regs->ax = BIT(63) | addr; > > Can this be an actual #define? Doh! Yes, of course. That would be much better. Now I need to think of a good name for it. -Tony {.n++%ݶw{.n+{G{ayʇڙ,jfhz_(階ݢj"mG?&~iOzv^m ?I From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754137AbbLKVTU (ORCPT ); Fri, 11 Dec 2015 16:19:20 -0500 Received: from mga03.intel.com ([134.134.136.65]:11150 "EHLO mga03.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752753AbbLKVTS (ORCPT ); Fri, 11 Dec 2015 16:19:18 -0500 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.20,415,1444719600"; d="scan'208";a="839483246" From: "Luck, Tony" To: Andy Lutomirski CC: Ingo Molnar , Borislav Petkov , "Andrew Morton" , Andy Lutomirski , "Williams, Dan J" , "linux-kernel@vger.kernel.org" , "linux-mm@kvack.org" , linux-nvdimm , X86 ML Subject: RE: [PATCHV2 3/3] x86, ras: Add mcsafe_memcpy() function to recover from machine checks Thread-Topic: [PATCHV2 3/3] x86, ras: Add mcsafe_memcpy() function to recover from machine checks Thread-Index: AQHRNE/aOhVlt+R/OUCiyPbQy9P3LZ7GRdgw Date: Fri, 11 Dec 2015 21:19:17 +0000 Message-ID: <3908561D78D1C84285E8C5FCA982C28F39F82D87@ORSMSX114.amr.corp.intel.com> References: <23b2515da9d06b198044ad83ca0a15ba38c24e6e.1449861203.git.tony.luck@intel.com> In-Reply-To: Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-titus-metadata-40: eyJDYXRlZ29yeUxhYmVscyI6IiIsIk1ldGFkYXRhIjp7Im5zIjoiaHR0cDpcL1wvd3d3LnRpdHVzLmNvbVwvbnNcL0ludGVsIiwiaWQiOiI2ODAwMmFkNi03ZjhlLTQ3ODQtOGY3Mi01MDE3NGJjYTcyMjYiLCJwcm9wcyI6W3sibiI6IkludGVsRGF0YUNsYXNzaWZpY2F0aW9uIiwidmFscyI6W3sidmFsdWUiOiJDVFBfSUMifV19XX0sIlN1YmplY3RMYWJlbHMiOltdLCJUTUNWZXJzaW9uIjoiMTUuNC4xMC4xOSIsIlRydXN0ZWRMYWJlbEhhc2giOiJyYUNjT3VcL0FKQmgwVGZVYURXVkFReXJyK0lqNVpKMWVJdW9wbUdHTVFlOD0ifQ== x-inteldataclassification: CTP_IC x-originating-ip: [10.22.254.140] Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from base64 to 8bit by mail.home.local id tBBLJNvX013641 > I still don't get the BIT(63) thing. Can you explain it? It will be more obvious when I get around to writing copy_from_user(). Then we will have a function that can take page faults if there are pages that are not present. If the page faults can't be fixed we have a -EFAULT condition. We can also take machine checks if we reads from a location with an uncorrected error. We need to distinguish these two cases because the action we take is different. For the unresolved page fault we already have the ABI that the copy_to/from_user() functions return zero for success, and a non-zero return is the number of not-copied bytes. So for my new case I'm setting bit63 ... this is never going to be set for a failed page fault. copy_from_user() conceptually will look like this: int copy_from_user(void *to, void *from, unsigned long n) { u64 ret = mcsafe_memcpy(to, from, n); if (COPY_HAD_MCHECK(r)) { if (memory_failure(COPY_MCHECK_PADDR(ret) >> PAGE_SIZE, ...)) force_sig(SIGBUS, current); return something; } else return ret; } -Tony {.n++%ݶw{.n+{G{ayʇڙ,jfhz_(階ݢj"mG?&~iOzv^m ?I From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753805AbbLKVvN (ORCPT ); Fri, 11 Dec 2015 16:51:13 -0500 Received: from mail-ob0-f169.google.com ([209.85.214.169]:35140 "EHLO mail-ob0-f169.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751024AbbLKVvM (ORCPT ); Fri, 11 Dec 2015 16:51:12 -0500 MIME-Version: 1.0 In-Reply-To: <3908561D78D1C84285E8C5FCA982C28F39F82D87@ORSMSX114.amr.corp.intel.com> References: <23b2515da9d06b198044ad83ca0a15ba38c24e6e.1449861203.git.tony.luck@intel.com> <3908561D78D1C84285E8C5FCA982C28F39F82D87@ORSMSX114.amr.corp.intel.com> From: Andy Lutomirski Date: Fri, 11 Dec 2015 13:50:52 -0800 Message-ID: Subject: Re: [PATCHV2 3/3] x86, ras: Add mcsafe_memcpy() function to recover from machine checks To: "Luck, Tony" Cc: Ingo Molnar , Borislav Petkov , Andrew Morton , Andy Lutomirski , "Williams, Dan J" , "linux-kernel@vger.kernel.org" , "linux-mm@kvack.org" , linux-nvdimm , X86 ML Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Dec 11, 2015 at 1:19 PM, Luck, Tony wrote: >> I still don't get the BIT(63) thing. Can you explain it? > > It will be more obvious when I get around to writing copy_from_user(). > > Then we will have a function that can take page faults if there are pages > that are not present. If the page faults can't be fixed we have a -EFAULT > condition. We can also take machine checks if we reads from a location with an > uncorrected error. > > We need to distinguish these two cases because the action we take is > different. For the unresolved page fault we already have the ABI that the > copy_to/from_user() functions return zero for success, and a non-zero > return is the number of not-copied bytes. I'm missing something, though. The normal fixup_exception path doesn't touch rax at all. The memory_failure path does. But couldn't you distinguish them by just pointing the exception handlers at different landing pads? Also, would it be more straightforward if the mcexception landing pad looked up the va -> pa mapping by itself? Or is that somehow not reliable? --Andy From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754499AbbLKWRP (ORCPT ); Fri, 11 Dec 2015 17:17:15 -0500 Received: from mga01.intel.com ([192.55.52.88]:52727 "EHLO mga01.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754000AbbLKWRN (ORCPT ); Fri, 11 Dec 2015 17:17:13 -0500 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.20,415,1444719600"; d="scan'208";a="859083441" From: "Luck, Tony" To: Andy Lutomirski CC: Ingo Molnar , Borislav Petkov , "Andrew Morton" , Andy Lutomirski , "Williams, Dan J" , "linux-kernel@vger.kernel.org" , "linux-mm@kvack.org" , linux-nvdimm , X86 ML Subject: RE: [PATCHV2 3/3] x86, ras: Add mcsafe_memcpy() function to recover from machine checks Thread-Topic: [PATCHV2 3/3] x86, ras: Add mcsafe_memcpy() function to recover from machine checks Thread-Index: AQHRNE/aOhVlt+R/OUCiyPbQy9P3LZ7GRdgwgACTpgD//34voA== Date: Fri, 11 Dec 2015 22:17:10 +0000 Message-ID: <3908561D78D1C84285E8C5FCA982C28F39F82EEF@ORSMSX114.amr.corp.intel.com> References: <23b2515da9d06b198044ad83ca0a15ba38c24e6e.1449861203.git.tony.luck@intel.com> <3908561D78D1C84285E8C5FCA982C28F39F82D87@ORSMSX114.amr.corp.intel.com> In-Reply-To: Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-titus-metadata-40: eyJDYXRlZ29yeUxhYmVscyI6IiIsIk1ldGFkYXRhIjp7Im5zIjoiaHR0cDpcL1wvd3d3LnRpdHVzLmNvbVwvbnNcL0ludGVsIiwiaWQiOiJhYzM0OTAyOS1lNWI0LTRiZTAtODlhMS01NWEzYmNjMzE4NDYiLCJwcm9wcyI6W3sibiI6IkludGVsRGF0YUNsYXNzaWZpY2F0aW9uIiwidmFscyI6W3sidmFsdWUiOiJDVFBfSUMifV19XX0sIlN1YmplY3RMYWJlbHMiOltdLCJUTUNWZXJzaW9uIjoiMTUuNC4xMC4xOSIsIlRydXN0ZWRMYWJlbEhhc2giOiJEU3RSVEU0NFRLUEM0R1J6cCtyWlExd2R2clBneW5Lc3NIdm5XUG1jRHpNPSJ9 x-inteldataclassification: CTP_IC x-originating-ip: [10.22.254.140] Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from base64 to 8bit by mail.home.local id tBBMHKfN013980 > I'm missing something, though. The normal fixup_exception path > doesn't touch rax at all. The memory_failure path does. But couldn't > you distinguish them by just pointing the exception handlers at > different landing pads? Perhaps I'm just trying to take a short cut to avoid writing some clever fixup code for the target ip that goes into the exception table. For __copy_user_nocache() we have four possible targets for fixup depending on where we were in the function. .section .fixup,"ax" 30: shll $6,%ecx addl %ecx,%edx jmp 60f 40: lea (%rdx,%rcx,8),%rdx jmp 60f 50: movl %ecx,%edx 60: sfence jmp copy_user_handle_tail .previous Note that this code also takes a shortcut by jumping to copy_user_handle_tail() to finish up the copy a byte at a time ... and running back into the same page fault a 2nd time to make sure the byte count is exactly right. I really, really, don't want to run back into the poison again. It would probably work, but because current generation Intel cpus broadcast machine checks to every logical cpu, it is a lot of overhead, and potentially risky. > Also, would it be more straightforward if the mcexception landing pad > looked up the va -> pa mapping by itself? Or is that somehow not > reliable? If we did get all the above right, then we could have target use virt_to_phys() to convert to physical ... I don't see that this part would be a problem. -Tony {.n++%ݶw{.n+{G{ayʇڙ,jfhz_(階ݢj"mG?&~iOzv^m ?I From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754353AbbLKWUT (ORCPT ); Fri, 11 Dec 2015 17:20:19 -0500 Received: from mail-qk0-f175.google.com ([209.85.220.175]:34739 "EHLO mail-qk0-f175.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750724AbbLKWUR (ORCPT ); Fri, 11 Dec 2015 17:20:17 -0500 MIME-Version: 1.0 In-Reply-To: <3908561D78D1C84285E8C5FCA982C28F39F82EEF@ORSMSX114.amr.corp.intel.com> References: <23b2515da9d06b198044ad83ca0a15ba38c24e6e.1449861203.git.tony.luck@intel.com> <3908561D78D1C84285E8C5FCA982C28F39F82D87@ORSMSX114.amr.corp.intel.com> <3908561D78D1C84285E8C5FCA982C28F39F82EEF@ORSMSX114.amr.corp.intel.com> Date: Fri, 11 Dec 2015 14:20:16 -0800 Message-ID: Subject: Re: [PATCHV2 3/3] x86, ras: Add mcsafe_memcpy() function to recover from machine checks From: Dan Williams To: "Luck, Tony" Cc: Andy Lutomirski , Ingo Molnar , Borislav Petkov , Andrew Morton , Andy Lutomirski , "linux-kernel@vger.kernel.org" , "linux-mm@kvack.org" , linux-nvdimm , X86 ML Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Dec 11, 2015 at 2:17 PM, Luck, Tony wrote: >> Also, would it be more straightforward if the mcexception landing pad >> looked up the va -> pa mapping by itself? Or is that somehow not >> reliable? > > If we did get all the above right, then we could have > target use virt_to_phys() to convert to physical ... > I don't see that this part would be a problem. virt_to_phys() implies a linear address. In the case of the use in the pmem driver we'll be using an ioremap()'d address off somewherein vmalloc space. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754764AbbLKW1P (ORCPT ); Fri, 11 Dec 2015 17:27:15 -0500 Received: from mail-ob0-f173.google.com ([209.85.214.173]:34427 "EHLO mail-ob0-f173.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751803AbbLKW1N (ORCPT ); Fri, 11 Dec 2015 17:27:13 -0500 MIME-Version: 1.0 In-Reply-To: References: <23b2515da9d06b198044ad83ca0a15ba38c24e6e.1449861203.git.tony.luck@intel.com> <3908561D78D1C84285E8C5FCA982C28F39F82D87@ORSMSX114.amr.corp.intel.com> <3908561D78D1C84285E8C5FCA982C28F39F82EEF@ORSMSX114.amr.corp.intel.com> From: Andy Lutomirski Date: Fri, 11 Dec 2015 14:26:53 -0800 Message-ID: Subject: Re: [PATCHV2 3/3] x86, ras: Add mcsafe_memcpy() function to recover from machine checks To: Dan Williams Cc: "Luck, Tony" , Ingo Molnar , Borislav Petkov , Andrew Morton , Andy Lutomirski , "linux-kernel@vger.kernel.org" , "linux-mm@kvack.org" , linux-nvdimm , X86 ML Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Dec 11, 2015 at 2:20 PM, Dan Williams wrote: > On Fri, Dec 11, 2015 at 2:17 PM, Luck, Tony wrote: >>> Also, would it be more straightforward if the mcexception landing pad >>> looked up the va -> pa mapping by itself? Or is that somehow not >>> reliable? >> >> If we did get all the above right, then we could have >> target use virt_to_phys() to convert to physical ... >> I don't see that this part would be a problem. > > virt_to_phys() implies a linear address. In the case of the use in > the pmem driver we'll be using an ioremap()'d address off somewherein > vmalloc space. There's always slow_virt_to_phys. Note that I don't fundamentally object to passing the pa to the fixup handler. I just think we should try to disentangle that from figuring out what exactly the failure was. Also, are there really PCOMMIT-capable CPUs that still forcibly broadcast MCE? If, so, that's unfortunate. --Andy From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754602AbbLKWfU (ORCPT ); Fri, 11 Dec 2015 17:35:20 -0500 Received: from mga01.intel.com ([192.55.52.88]:51201 "EHLO mga01.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751949AbbLKWfS (ORCPT ); Fri, 11 Dec 2015 17:35:18 -0500 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.20,415,1444719600"; d="scan'208";a="839520741" From: "Luck, Tony" To: Andy Lutomirski , "Williams, Dan J" CC: Ingo Molnar , Borislav Petkov , "Andrew Morton" , Andy Lutomirski , "linux-kernel@vger.kernel.org" , "linux-mm@kvack.org" , linux-nvdimm , X86 ML Subject: RE: [PATCHV2 3/3] x86, ras: Add mcsafe_memcpy() function to recover from machine checks Thread-Topic: [PATCHV2 3/3] x86, ras: Add mcsafe_memcpy() function to recover from machine checks Thread-Index: AQHRNE/aOhVlt+R/OUCiyPbQy9P3LZ7GRdgwgACTpgD//34voIAAiggAgAAB2YD//3tu8A== Date: Fri, 11 Dec 2015 22:35:17 +0000 Message-ID: <3908561D78D1C84285E8C5FCA982C28F39F82F97@ORSMSX114.amr.corp.intel.com> References: <23b2515da9d06b198044ad83ca0a15ba38c24e6e.1449861203.git.tony.luck@intel.com> <3908561D78D1C84285E8C5FCA982C28F39F82D87@ORSMSX114.amr.corp.intel.com> <3908561D78D1C84285E8C5FCA982C28F39F82EEF@ORSMSX114.amr.corp.intel.com> In-Reply-To: Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-titus-metadata-40: eyJDYXRlZ29yeUxhYmVscyI6IiIsIk1ldGFkYXRhIjp7Im5zIjoiaHR0cDpcL1wvd3d3LnRpdHVzLmNvbVwvbnNcL0ludGVsIiwiaWQiOiI5ZDVmZjMyYi1mNjk4LTRjYmEtYWZkMy01ZWNjMGY1MTM5MzIiLCJwcm9wcyI6W3sibiI6IkludGVsRGF0YUNsYXNzaWZpY2F0aW9uIiwidmFscyI6W3sidmFsdWUiOiJDVFBfSUMifV19XX0sIlN1YmplY3RMYWJlbHMiOltdLCJUTUNWZXJzaW9uIjoiMTUuNC4xMC4xOSIsIlRydXN0ZWRMYWJlbEhhc2giOiIwRDU5V250VTZxTTZWdCtyTDUxQUtZTGtYTUk3TWltaVl3SVFtVUliU2JrPSJ9 x-inteldataclassification: CTP_IC x-originating-ip: [10.22.254.140] Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from base64 to 8bit by mail.home.local id tBBMZO6m014104 > Also, are there really PCOMMIT-capable CPUs that still forcibly > broadcast MCE? If, so, that's unfortunate. PCOMMIT and LMCE arrive together ... though BIOS is in the decision path to enable LMCE, so it is possible that some systems could still broadcast if the BIOS writer decides to not allow local. But a machine check safe copy_from_user() would be useful current generation cpus that broadcast all the time. -Tony {.n++%ݶw{.n+{G{ayʇڙ,jfhz_(階ݢj"mG?&~iOzv^m ?I From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754335AbbLKWie (ORCPT ); Fri, 11 Dec 2015 17:38:34 -0500 Received: from mail-oi0-f53.google.com ([209.85.218.53]:35036 "EHLO mail-oi0-f53.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750724AbbLKWid (ORCPT ); Fri, 11 Dec 2015 17:38:33 -0500 MIME-Version: 1.0 In-Reply-To: <3908561D78D1C84285E8C5FCA982C28F39F82F97@ORSMSX114.amr.corp.intel.com> References: <23b2515da9d06b198044ad83ca0a15ba38c24e6e.1449861203.git.tony.luck@intel.com> <3908561D78D1C84285E8C5FCA982C28F39F82D87@ORSMSX114.amr.corp.intel.com> <3908561D78D1C84285E8C5FCA982C28F39F82EEF@ORSMSX114.amr.corp.intel.com> <3908561D78D1C84285E8C5FCA982C28F39F82F97@ORSMSX114.amr.corp.intel.com> From: Andy Lutomirski Date: Fri, 11 Dec 2015 14:38:13 -0800 Message-ID: Subject: Re: [PATCHV2 3/3] x86, ras: Add mcsafe_memcpy() function to recover from machine checks To: "Luck, Tony" Cc: "Williams, Dan J" , Ingo Molnar , Borislav Petkov , Andrew Morton , Andy Lutomirski , "linux-kernel@vger.kernel.org" , "linux-mm@kvack.org" , linux-nvdimm , X86 ML Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Dec 11, 2015 at 2:35 PM, Luck, Tony wrote: >> Also, are there really PCOMMIT-capable CPUs that still forcibly >> broadcast MCE? If, so, that's unfortunate. > > PCOMMIT and LMCE arrive together ... though BIOS is in the decision > path to enable LMCE, so it is possible that some systems could still > broadcast if the BIOS writer decides to not allow local. I really wish Intel would stop doing that. > > But a machine check safe copy_from_user() would be useful > current generation cpus that broadcast all the time. Fair enough. --Andy From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754913AbbLKWpg (ORCPT ); Fri, 11 Dec 2015 17:45:36 -0500 Received: from mga03.intel.com ([134.134.136.65]:47776 "EHLO mga03.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751588AbbLKWpf (ORCPT ); Fri, 11 Dec 2015 17:45:35 -0500 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.20,415,1444719600"; d="scan'208";a="839525111" From: "Luck, Tony" To: Andy Lutomirski CC: "Williams, Dan J" , Ingo Molnar , Borislav Petkov , Andrew Morton , Andy Lutomirski , "linux-kernel@vger.kernel.org" , "linux-mm@kvack.org" , linux-nvdimm , X86 ML Subject: RE: [PATCHV2 3/3] x86, ras: Add mcsafe_memcpy() function to recover from machine checks Thread-Topic: [PATCHV2 3/3] x86, ras: Add mcsafe_memcpy() function to recover from machine checks Thread-Index: AQHRNE/aOhVlt+R/OUCiyPbQy9P3LZ7GRdgwgACTpgD//34voIAAiggAgAAB2YD//3tu8IAAh72A//96OnA= Date: Fri, 11 Dec 2015 22:45:33 +0000 Message-ID: <3908561D78D1C84285E8C5FCA982C28F39F82FED@ORSMSX114.amr.corp.intel.com> References: <23b2515da9d06b198044ad83ca0a15ba38c24e6e.1449861203.git.tony.luck@intel.com> <3908561D78D1C84285E8C5FCA982C28F39F82D87@ORSMSX114.amr.corp.intel.com> <3908561D78D1C84285E8C5FCA982C28F39F82EEF@ORSMSX114.amr.corp.intel.com> <3908561D78D1C84285E8C5FCA982C28F39F82F97@ORSMSX114.amr.corp.intel.com> In-Reply-To: Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-titus-metadata-40: eyJDYXRlZ29yeUxhYmVscyI6IiIsIk1ldGFkYXRhIjp7Im5zIjoiaHR0cDpcL1wvd3d3LnRpdHVzLmNvbVwvbnNcL0ludGVsIiwiaWQiOiI2MWFhNGZiMS1lNzgwLTRiYWItYmYxOS0yOGVlNzA0ZGZhZjkiLCJwcm9wcyI6W3sibiI6IkludGVsRGF0YUNsYXNzaWZpY2F0aW9uIiwidmFscyI6W3sidmFsdWUiOiJDVFBfSUMifV19XX0sIlN1YmplY3RMYWJlbHMiOltdLCJUTUNWZXJzaW9uIjoiMTUuNC4xMC4xOSIsIlRydXN0ZWRMYWJlbEhhc2giOiI2dlliTUVwNlBIVkZ2azhXRjNrcjJWY3E5R0hFNnA1SmpGcWJ6U2dIM0k0PSJ9 x-inteldataclassification: CTP_IC x-originating-ip: [10.22.254.140] Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from base64 to 8bit by mail.home.local id tBBMjfsr014173 >> But a machine check safe copy_from_user() would be useful >> current generation cpus that broadcast all the time. > > Fair enough. Thanks for spending the time to look at this. Coaxing me to re-write the tail of do_machine_check() has made that code much better. Too many years of one patch on top of another without looking at the whole context. Cogitate on this series over the weekend and see if you can give me an Acked-by or Reviewed-by (I'll be adding a #define for BIT(63)). -Tony {.n++%ݶw{.n+{G{ayʇڙ,jfhz_(階ݢj"mG?&~iOzv^m ?I From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754134AbbLKW4I (ORCPT ); Fri, 11 Dec 2015 17:56:08 -0500 Received: from mail-ob0-f171.google.com ([209.85.214.171]:34470 "EHLO mail-ob0-f171.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751626AbbLKW4F (ORCPT ); Fri, 11 Dec 2015 17:56:05 -0500 MIME-Version: 1.0 In-Reply-To: <3908561D78D1C84285E8C5FCA982C28F39F82FED@ORSMSX114.amr.corp.intel.com> References: <23b2515da9d06b198044ad83ca0a15ba38c24e6e.1449861203.git.tony.luck@intel.com> <3908561D78D1C84285E8C5FCA982C28F39F82D87@ORSMSX114.amr.corp.intel.com> <3908561D78D1C84285E8C5FCA982C28F39F82EEF@ORSMSX114.amr.corp.intel.com> <3908561D78D1C84285E8C5FCA982C28F39F82F97@ORSMSX114.amr.corp.intel.com> <3908561D78D1C84285E8C5FCA982C28F39F82FED@ORSMSX114.amr.corp.intel.com> From: Andy Lutomirski Date: Fri, 11 Dec 2015 14:55:45 -0800 Message-ID: Subject: Re: [PATCHV2 3/3] x86, ras: Add mcsafe_memcpy() function to recover from machine checks To: "Luck, Tony" Cc: "Williams, Dan J" , Ingo Molnar , Borislav Petkov , Andrew Morton , Andy Lutomirski , "linux-kernel@vger.kernel.org" , "linux-mm@kvack.org" , linux-nvdimm , X86 ML Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Dec 11, 2015 at 2:45 PM, Luck, Tony wrote: >>> But a machine check safe copy_from_user() would be useful >>> current generation cpus that broadcast all the time. >> >> Fair enough. > > Thanks for spending the time to look at this. Coaxing me to re-write the > tail of do_machine_check() has made that code much better. Too many > years of one patch on top of another without looking at the whole context. > > Cogitate on this series over the weekend and see if you can give me > an Acked-by or Reviewed-by (I'll be adding a #define for BIT(63)). I can't review the MCE decoding part, because I don't understand it nearly well enough. The interaction with the core fault handling looks fine, modulo any need to bikeshed on the macro naming (which I'll refrain from doing). I still think it would be better if you get rid of BIT(63) and use a pair of landing pads, though. They could be as simple as: .Lpage_fault_goes_here: xorq %rax, %rax jmp .Lbad .Lmce_goes_here: /* set high bit of rax or whatever */ /* fall through */ .Lbad: /* deal with it */ That way the magic is isolated to the function that needs the magic. Also, at least renaming the macro to EXTABLE_MC_PA_IN_AX might be nice. It'll keep future users honest. Maybe some day there'll be a PA_IN_AX flag, and, heck, maybe some day there'll be ways to get info for non-MCE faults delivered through fixup_exception. --Andy From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752410AbbLLKLt (ORCPT ); Sat, 12 Dec 2015 05:11:49 -0500 Received: from mail.skyhub.de ([78.46.96.112]:57594 "EHLO mail.skyhub.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751724AbbLLKLq (ORCPT ); Sat, 12 Dec 2015 05:11:46 -0500 Date: Sat, 12 Dec 2015 11:11:42 +0100 From: Borislav Petkov To: Tony Luck Cc: Ingo Molnar , Andrew Morton , Andy Lutomirski , Dan Williams , linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-nvdimm@ml01.01.org, x86@kernel.org Subject: Re: [PATCHV2 1/3] x86, ras: Add new infrastructure for machine check fixup tables Message-ID: <20151212101142.GA3867@pd.tnic> References: <456153d09e85f2f139020a051caed3ca8f8fca73.1449861203.git.tony.luck@intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: <456153d09e85f2f139020a051caed3ca8f8fca73.1449861203.git.tony.luck@intel.com> User-Agent: Mutt/1.5.23 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Dec 10, 2015 at 01:58:04PM -0800, Tony Luck wrote: > Copy the existing page fault fixup mechanisms to create a new table > to be used when fixing machine checks. Note: > 1) At this time we only provide a macro to annotate assembly code > 2) We assume all fixups will in code builtin to the kernel. > 3) Only for x86_64 > 4) New code under CONFIG_MCE_KERNEL_RECOVERY > > Signed-off-by: Tony Luck > --- > arch/x86/Kconfig | 4 ++++ > arch/x86/include/asm/asm.h | 10 ++++++++-- > arch/x86/include/asm/uaccess.h | 8 ++++++++ > arch/x86/mm/extable.c | 19 +++++++++++++++++++ > include/asm-generic/vmlinux.lds.h | 6 ++++++ > include/linux/module.h | 1 + > kernel/extable.c | 20 ++++++++++++++++++++ > 7 files changed, 66 insertions(+), 2 deletions(-) > > diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig > index 96d058a87100..db5c6e1d6e37 100644 > --- a/arch/x86/Kconfig > +++ b/arch/x86/Kconfig > @@ -1001,6 +1001,10 @@ config X86_MCE_INJECT > If you don't know what a machine check is and you don't do kernel > QA it is safe to say n. > > +config MCE_KERNEL_RECOVERY > + depends on X86_MCE && X86_64 > + def_bool y Shouldn't that depend on NVDIMM or whatnot? Looks too generic now. > + > config X86_THERMAL_VECTOR > def_bool y > depends on X86_MCE_INTEL > diff --git a/arch/x86/include/asm/asm.h b/arch/x86/include/asm/asm.h > index 189679aba703..a5d483ac11fa 100644 > --- a/arch/x86/include/asm/asm.h > +++ b/arch/x86/include/asm/asm.h > @@ -44,13 +44,19 @@ > > /* Exception table entry */ > #ifdef __ASSEMBLY__ > -# define _ASM_EXTABLE(from,to) \ > - .pushsection "__ex_table","a" ; \ > +# define __ASM_EXTABLE(from, to, table) \ > + .pushsection table, "a" ; \ > .balign 8 ; \ > .long (from) - . ; \ > .long (to) - . ; \ > .popsection > > +# define _ASM_EXTABLE(from, to) \ > + __ASM_EXTABLE(from, to, "__ex_table") > + > +# define _ASM_MCEXTABLE(from, to) \ > + __ASM_EXTABLE(from, to, "__mcex_table") > + > # define _ASM_EXTABLE_EX(from,to) \ > .pushsection "__ex_table","a" ; \ > .balign 8 ; \ > diff --git a/arch/x86/include/asm/uaccess.h b/arch/x86/include/asm/uaccess.h > index a8df874f3e88..7b02ca1991b4 100644 > --- a/arch/x86/include/asm/uaccess.h > +++ b/arch/x86/include/asm/uaccess.h > @@ -111,6 +111,14 @@ struct exception_table_entry { > #define ARCH_HAS_SEARCH_EXTABLE > > extern int fixup_exception(struct pt_regs *regs); > +#ifdef CONFIG_MCE_KERNEL_RECOVERY > +extern int fixup_mcexception(struct pt_regs *regs, u64 addr); > +#else > +static inline int fixup_mcexception(struct pt_regs *regs, u64 addr) > +{ > + return 0; > +} > +#endif > extern int early_fixup_exception(unsigned long *ip); No need for "extern" > > /* > diff --git a/arch/x86/mm/extable.c b/arch/x86/mm/extable.c > index 903ec1e9c326..a461c4212758 100644 > --- a/arch/x86/mm/extable.c > +++ b/arch/x86/mm/extable.c > @@ -49,6 +49,25 @@ int fixup_exception(struct pt_regs *regs) > return 0; > } > > +#ifdef CONFIG_MCE_KERNEL_RECOVERY > +int fixup_mcexception(struct pt_regs *regs, u64 addr) > +{ If you move the #ifdef here, you can save yourself the ifdeffery in the header above. > + const struct exception_table_entry *fixup; > + unsigned long new_ip; > + > + fixup = search_mcexception_tables(regs->ip); > + if (fixup) { > + new_ip = ex_fixup_addr(fixup); > + > + regs->ip = new_ip; > + regs->ax = BIT(63) | addr; > + return 1; > + } > + > + return 0; > +} > +#endif > + > /* Restricted version used during very early boot */ > int __init early_fixup_exception(unsigned long *ip) > { > diff --git a/include/asm-generic/vmlinux.lds.h b/include/asm-generic/vmlinux.lds.h > index 1781e54ea6d3..21bb20d1172a 100644 > --- a/include/asm-generic/vmlinux.lds.h > +++ b/include/asm-generic/vmlinux.lds.h > @@ -473,6 +473,12 @@ > VMLINUX_SYMBOL(__start___ex_table) = .; \ > *(__ex_table) \ > VMLINUX_SYMBOL(__stop___ex_table) = .; \ > + } \ > + . = ALIGN(align); \ > + __mcex_table : AT(ADDR(__mcex_table) - LOAD_OFFSET) { \ > + VMLINUX_SYMBOL(__start___mcex_table) = .; \ > + *(__mcex_table) \ > + VMLINUX_SYMBOL(__stop___mcex_table) = .; \ Of all the places, this one is missing #ifdef CONFIG_MCE_KERNEL_RECOVERY. > } > > /* > diff --git a/include/linux/module.h b/include/linux/module.h > index 3a19c79918e0..ffecbfcc462c 100644 > --- a/include/linux/module.h > +++ b/include/linux/module.h > @@ -270,6 +270,7 @@ extern const typeof(name) __mod_##type##__##name##_device_table \ > > /* Given an address, look for it in the exception tables */ > const struct exception_table_entry *search_exception_tables(unsigned long add); > +const struct exception_table_entry *search_mcexception_tables(unsigned long a); > > struct notifier_block; > > diff --git a/kernel/extable.c b/kernel/extable.c > index e820ccee9846..7b224fbcb708 100644 > --- a/kernel/extable.c > +++ b/kernel/extable.c > @@ -34,6 +34,10 @@ DEFINE_MUTEX(text_mutex); > > extern struct exception_table_entry __start___ex_table[]; > extern struct exception_table_entry __stop___ex_table[]; > +#ifdef CONFIG_MCE_KERNEL_RECOVERY > +extern struct exception_table_entry __start___mcex_table[]; > +extern struct exception_table_entry __stop___mcex_table[]; > +#endif > > /* Cleared by build time tools if the table is already sorted. */ > u32 __initdata __visible main_extable_sort_needed = 1; > @@ -45,6 +49,10 @@ void __init sort_main_extable(void) > pr_notice("Sorting __ex_table...\n"); > sort_extable(__start___ex_table, __stop___ex_table); > } > +#ifdef CONFIG_MCE_KERNEL_RECOVERY > + if (__stop___mcex_table > __start___mcex_table) > + sort_extable(__start___mcex_table, __stop___mcex_table); > +#endif > } > > /* Given an address, look for it in the exception tables. */ > @@ -58,6 +66,18 @@ const struct exception_table_entry *search_exception_tables(unsigned long addr) > return e; > } > > +#ifdef CONFIG_MCE_KERNEL_RECOVERY > +/* Given an address, look for it in the machine check exception tables. */ > +const struct exception_table_entry *search_mcexception_tables( > + unsigned long addr) > +{ > + const struct exception_table_entry *e; > + > + e = search_extable(__start___mcex_table, __stop___mcex_table-1, addr); > + return e; > +} > +#endif You can make this one a bit more readable by doing: /* Given an address, look for it in the machine check exception tables. */ const struct exception_table_entry * search_mcexception_tables(unsigned long addr) { #ifdef CONFIG_MCE_KERNEL_RECOVERY return search_extable(__start___mcex_table, __stop___mcex_table - 1, addr); #endif } -- Regards/Gruss, Boris. ECO tip #101: Trim your mails when you reply. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752720AbbLNIgb (ORCPT ); Mon, 14 Dec 2015 03:36:31 -0500 Received: from mail-wm0-f42.google.com ([74.125.82.42]:35569 "EHLO mail-wm0-f42.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752093AbbLNIg3 (ORCPT ); Mon, 14 Dec 2015 03:36:29 -0500 Date: Mon, 14 Dec 2015 09:36:25 +0100 From: Ingo Molnar To: Andy Lutomirski Cc: "Luck, Tony" , "Williams, Dan J" , Borislav Petkov , Andrew Morton , Andy Lutomirski , "linux-kernel@vger.kernel.org" , "linux-mm@kvack.org" , linux-nvdimm , X86 ML Subject: Re: [PATCHV2 3/3] x86, ras: Add mcsafe_memcpy() function to recover from machine checks Message-ID: <20151214083625.GA28073@gmail.com> References: <3908561D78D1C84285E8C5FCA982C28F39F82D87@ORSMSX114.amr.corp.intel.com> <3908561D78D1C84285E8C5FCA982C28F39F82EEF@ORSMSX114.amr.corp.intel.com> <3908561D78D1C84285E8C5FCA982C28F39F82F97@ORSMSX114.amr.corp.intel.com> <3908561D78D1C84285E8C5FCA982C28F39F82FED@ORSMSX114.amr.corp.intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.23 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org * Andy Lutomirski wrote: > I still think it would be better if you get rid of BIT(63) and use a > pair of landing pads, though. They could be as simple as: > > .Lpage_fault_goes_here: > xorq %rax, %rax > jmp .Lbad > > .Lmce_goes_here: > /* set high bit of rax or whatever */ > /* fall through */ > > .Lbad: > /* deal with it */ > > That way the magic is isolated to the function that needs the magic. Seconded - this is the usual pattern we use in all assembly functions. Thanks, Ingo From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753339AbbLNR6r (ORCPT ); Mon, 14 Dec 2015 12:58:47 -0500 Received: from mail-ig0-f178.google.com ([209.85.213.178]:36990 "EHLO mail-ig0-f178.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752830AbbLNR6q (ORCPT ); Mon, 14 Dec 2015 12:58:46 -0500 MIME-Version: 1.0 In-Reply-To: <20151212101142.GA3867@pd.tnic> References: <456153d09e85f2f139020a051caed3ca8f8fca73.1449861203.git.tony.luck@intel.com> <20151212101142.GA3867@pd.tnic> Date: Mon, 14 Dec 2015 10:58:45 -0700 Message-ID: Subject: Re: [PATCHV2 1/3] x86, ras: Add new infrastructure for machine check fixup tables From: Ross Zwisler To: Borislav Petkov Cc: Tony Luck , linux-nvdimm , X86 ML , linux-kernel@vger.kernel.org, Ingo Molnar , linux-mm@kvack.org, Andy Lutomirski , Andrew Morton , Ross Zwisler Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Sat, Dec 12, 2015 at 3:11 AM, Borislav Petkov wrote: > On Thu, Dec 10, 2015 at 01:58:04PM -0800, Tony Luck wrote: <> >> +#ifdef CONFIG_MCE_KERNEL_RECOVERY >> +/* Given an address, look for it in the machine check exception tables. */ >> +const struct exception_table_entry *search_mcexception_tables( >> + unsigned long addr) >> +{ >> + const struct exception_table_entry *e; >> + >> + e = search_extable(__start___mcex_table, __stop___mcex_table-1, addr); >> + return e; >> +} >> +#endif > > You can make this one a bit more readable by doing: > > /* Given an address, look for it in the machine check exception tables. */ > const struct exception_table_entry * > search_mcexception_tables(unsigned long addr) > { > #ifdef CONFIG_MCE_KERNEL_RECOVERY > return search_extable(__start___mcex_table, > __stop___mcex_table - 1, addr); > #endif > } With this code if CONFIG_MCE_KERNEL_RECOVERY isn't defined you'll get a compiler error that the function doesn't have a return statement, right? I think we need an #else to return NULL, or to have the #ifdef encompass the whole function definition as it was in Tony's version. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753285AbbLNTqu (ORCPT ); Mon, 14 Dec 2015 14:46:50 -0500 Received: from mga02.intel.com ([134.134.136.20]:18635 "EHLO mga02.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750801AbbLNTqt (ORCPT ); Mon, 14 Dec 2015 14:46:49 -0500 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.20,428,1444719600"; d="scan'208";a="841020380" Date: Mon, 14 Dec 2015 11:46:48 -0800 From: "Luck, Tony" To: Ingo Molnar Cc: Andy Lutomirski , "Williams, Dan J" , Borislav Petkov , Andrew Morton , Andy Lutomirski , "linux-kernel@vger.kernel.org" , "linux-mm@kvack.org" , linux-nvdimm , X86 ML Subject: Re: [PATCHV2 3/3] x86, ras: Add mcsafe_memcpy() function to recover from machine checks Message-ID: <20151214194648.GA15222@agluck-desk.sc.intel.com> References: <3908561D78D1C84285E8C5FCA982C28F39F82D87@ORSMSX114.amr.corp.intel.com> <3908561D78D1C84285E8C5FCA982C28F39F82EEF@ORSMSX114.amr.corp.intel.com> <3908561D78D1C84285E8C5FCA982C28F39F82F97@ORSMSX114.amr.corp.intel.com> <3908561D78D1C84285E8C5FCA982C28F39F82FED@ORSMSX114.amr.corp.intel.com> <20151214083625.GA28073@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20151214083625.GA28073@gmail.com> User-Agent: Mutt/1.5.23 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Dec 14, 2015 at 09:36:25AM +0100, Ingo Molnar wrote: > > /* deal with it */ > > > > That way the magic is isolated to the function that needs the magic. > > Seconded - this is the usual pattern we use in all assembly functions. Ok - you want me to write some x86 assembly code (you may regret that). Initial question ... here's the fixup for __copy_user_nocache() .section .fixup,"ax" 30: shll $6,%ecx addl %ecx,%edx jmp 60f 40: lea (%rdx,%rcx,8),%rdx jmp 60f 50: movl %ecx,%edx 60: sfence jmp copy_user_handle_tail .previous Are %ecx and %rcx synonyms for the same register? Is there some super subtle reason we use the 'r' names in the "40" fixup, but the 'e' names everywhere else in this code (and the 'e' names in the body of the original function)? -Tony From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932151AbbLNUMW (ORCPT ); Mon, 14 Dec 2015 15:12:22 -0500 Received: from mail-ob0-f175.google.com ([209.85.214.175]:36602 "EHLO mail-ob0-f175.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753306AbbLNUMO (ORCPT ); Mon, 14 Dec 2015 15:12:14 -0500 MIME-Version: 1.0 In-Reply-To: <20151214194648.GA15222@agluck-desk.sc.intel.com> References: <3908561D78D1C84285E8C5FCA982C28F39F82D87@ORSMSX114.amr.corp.intel.com> <3908561D78D1C84285E8C5FCA982C28F39F82EEF@ORSMSX114.amr.corp.intel.com> <3908561D78D1C84285E8C5FCA982C28F39F82F97@ORSMSX114.amr.corp.intel.com> <3908561D78D1C84285E8C5FCA982C28F39F82FED@ORSMSX114.amr.corp.intel.com> <20151214083625.GA28073@gmail.com> <20151214194648.GA15222@agluck-desk.sc.intel.com> From: Andy Lutomirski Date: Mon, 14 Dec 2015 12:11:53 -0800 Message-ID: Subject: Re: [PATCHV2 3/3] x86, ras: Add mcsafe_memcpy() function to recover from machine checks To: "Luck, Tony" Cc: Ingo Molnar , "Williams, Dan J" , Borislav Petkov , Andrew Morton , Andy Lutomirski , "linux-kernel@vger.kernel.org" , "linux-mm@kvack.org" , linux-nvdimm , X86 ML Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Dec 14, 2015 at 11:46 AM, Luck, Tony wrote: > On Mon, Dec 14, 2015 at 09:36:25AM +0100, Ingo Molnar wrote: >> > /* deal with it */ >> > >> > That way the magic is isolated to the function that needs the magic. >> >> Seconded - this is the usual pattern we use in all assembly functions. > > Ok - you want me to write some x86 assembly code (you may regret that). > All you have to do is erase all of the ia64 asm knowledge from your brain and repurpose 1% of that space for x86 asm. You'll be a world-class expert! > Initial question ... here's the fixup for __copy_user_nocache() > > .section .fixup,"ax" > 30: shll $6,%ecx > addl %ecx,%edx > jmp 60f > 40: lea (%rdx,%rcx,8),%rdx > jmp 60f > 50: movl %ecx,%edx > 60: sfence > jmp copy_user_handle_tail > .previous > > Are %ecx and %rcx synonyms for the same register? Is there some > super subtle reason we use the 'r' names in the "40" fixup, but > the 'e' names everywhere else in this code (and the 'e' names in > the body of the original function)? rcx is a 64-bit register. ecx is the low 32 bits of it. If you read from ecx, you get the low 32 bits, but if you write to ecx, you zero the high bits as a side-effect. --Andy From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932743AbbLNW2H (ORCPT ); Mon, 14 Dec 2015 17:28:07 -0500 Received: from mail.skyhub.de ([78.46.96.112]:58735 "EHLO mail.skyhub.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753567AbbLNW2F (ORCPT ); Mon, 14 Dec 2015 17:28:05 -0500 Date: Mon, 14 Dec 2015 23:27:59 +0100 From: Borislav Petkov To: Ross Zwisler Cc: Tony Luck , linux-nvdimm , X86 ML , linux-kernel@vger.kernel.org, Ingo Molnar , linux-mm@kvack.org, Andy Lutomirski , Andrew Morton , Ross Zwisler Subject: Re: [PATCHV2 1/3] x86, ras: Add new infrastructure for machine check fixup tables Message-ID: <20151214222759.GF10520@pd.tnic> References: <456153d09e85f2f139020a051caed3ca8f8fca73.1449861203.git.tony.luck@intel.com> <20151212101142.GA3867@pd.tnic> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.23 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Dec 14, 2015 at 10:58:45AM -0700, Ross Zwisler wrote: > With this code if CONFIG_MCE_KERNEL_RECOVERY isn't defined you'll get > a compiler error that the function doesn't have a return statement, > right? I think we need an #else to return NULL, or to have the #ifdef > encompass the whole function definition as it was in Tony's version. Right, correct. Thanks. -- Regards/Gruss, Boris. ECO tip #101: Trim your mails when you reply. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S933114AbbLOBB2 (ORCPT ); Mon, 14 Dec 2015 20:01:28 -0500 Received: from mga09.intel.com ([134.134.136.24]:35411 "EHLO mga09.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932795AbbLOBB1 (ORCPT ); Mon, 14 Dec 2015 20:01:27 -0500 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.20,429,1444719600"; d="scan'208";a="871508126" Date: Mon, 14 Dec 2015 17:00:59 -0800 From: "Luck, Tony" To: Borislav Petkov Cc: Ingo Molnar , Andrew Morton , Andy Lutomirski , Dan Williams , linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-nvdimm@ml01.01.org, x86@kernel.org Subject: Re: [PATCHV2 1/3] x86, ras: Add new infrastructure for machine check fixup tables Message-ID: <20151215010059.GA17353@agluck-desk.sc.intel.com> References: <456153d09e85f2f139020a051caed3ca8f8fca73.1449861203.git.tony.luck@intel.com> <20151212101142.GA3867@pd.tnic> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20151212101142.GA3867@pd.tnic> User-Agent: Mutt/1.5.23 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Sat, Dec 12, 2015 at 11:11:42AM +0100, Borislav Petkov wrote: > > +config MCE_KERNEL_RECOVERY > > + depends on X86_MCE && X86_64 > > + def_bool y > > Shouldn't that depend on NVDIMM or whatnot? Looks too generic now. Not sure what the "whatnot" would be though. Making it depend on X86_MCE should keep it out of the tiny configurations. By the time you have MCE support, this seems like a pretty small incremental change. > > +#ifdef CONFIG_MCE_KERNEL_RECOVERY > > +int fixup_mcexception(struct pt_regs *regs, u64 addr) > > +{ > > If you move the #ifdef here, you can save yourself the ifdeffery in the > header above. I realized I didn't need the inline stub function in the header. > > diff --git a/include/asm-generic/vmlinux.lds.h b/include/asm-generic/vmlinux.lds.h > > index 1781e54ea6d3..21bb20d1172a 100644 > > --- a/include/asm-generic/vmlinux.lds.h > > +++ b/include/asm-generic/vmlinux.lds.h > > @@ -473,6 +473,12 @@ > > VMLINUX_SYMBOL(__start___ex_table) = .; \ > > *(__ex_table) \ > > VMLINUX_SYMBOL(__stop___ex_table) = .; \ > > + } \ > > + . = ALIGN(align); \ > > + __mcex_table : AT(ADDR(__mcex_table) - LOAD_OFFSET) { \ > > + VMLINUX_SYMBOL(__start___mcex_table) = .; \ > > + *(__mcex_table) \ > > + VMLINUX_SYMBOL(__stop___mcex_table) = .; \ > > Of all the places, this one is missing #ifdef CONFIG_MCE_KERNEL_RECOVERY. Is there some cpp magic to use an #ifdef inside a multi-line macro like this? Impact of not having the #ifdef is two extra symbols (the start/stop ones) in the symbol table of the final binary. If that's unacceptable I can fall back to an earlier unpublished version that had separate EXCEPTION_TABLE and MCEXCEPTION_TABLE macros with both invoked in the x86 vmlinux.lds.S file. > You can make this one a bit more readable by doing: > > /* Given an address, look for it in the machine check exception tables. */ > const struct exception_table_entry * > search_mcexception_tables(unsigned long addr) > { > #ifdef CONFIG_MCE_KERNEL_RECOVERY > return search_extable(__start___mcex_table, > __stop___mcex_table - 1, addr); > #endif > } I got rid of the local variable and the return ... but left the #ifdef/#endif around the whole function. -Tony From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S933299AbbLOJrE (ORCPT ); Tue, 15 Dec 2015 04:47:04 -0500 Received: from mail.skyhub.de ([78.46.96.112]:59656 "EHLO mail.skyhub.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S933146AbbLOJrA (ORCPT ); Tue, 15 Dec 2015 04:47:00 -0500 Date: Tue, 15 Dec 2015 10:46:53 +0100 From: Borislav Petkov To: "Luck, Tony" Cc: Ingo Molnar , Andrew Morton , Andy Lutomirski , Dan Williams , linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-nvdimm@ml01.01.org, x86@kernel.org Subject: Re: [PATCHV2 1/3] x86, ras: Add new infrastructure for machine check fixup tables Message-ID: <20151215094653.GA25973@pd.tnic> References: <456153d09e85f2f139020a051caed3ca8f8fca73.1449861203.git.tony.luck@intel.com> <20151212101142.GA3867@pd.tnic> <20151215010059.GA17353@agluck-desk.sc.intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: <20151215010059.GA17353@agluck-desk.sc.intel.com> User-Agent: Mutt/1.5.23 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Dec 14, 2015 at 05:00:59PM -0800, Luck, Tony wrote: > Not sure what the "whatnot" would be though. Making it depend on > X86_MCE should keep it out of the tiny configurations. By the time > you have MCE support, this seems like a pretty small incremental > change. Ok, so it is called CONFIG_LIBNVDIMM. Do you see a use case for this stuff except on machines with NVDIMM hw? CONFIG_LIBNVDIMM can select it but on !NVDIMM systems you don't really need it enabled. > Is there some cpp magic to use an #ifdef inside a multi-line macro like this? > Impact of not having the #ifdef is two extra symbols (the start/stop ones) > in the symbol table of the final binary. If that's unacceptable I can fall > back to an earlier unpublished version that had separate EXCEPTION_TABLE and > MCEXCEPTION_TABLE macros with both invoked in the x86 vmlinux.lds.S file. I think what is more important is that this should be in the x86-specific linker script, not in the generic one. And yes, we should strive to be clean and not pullute the kernel image with symbols which are unused, i.e. when CONFIG_MCE_KERNEL_RECOVERY is not enabled. This below seems to build ok here, ontop of yours. It could be a MCEXCEPTION_TABLE macro, as you say: Index: b/include/asm-generic/vmlinux.lds.h =================================================================== --- a/include/asm-generic/vmlinux.lds.h 2015-12-15 10:17:25.568046033 +0100 +++ b/include/asm-generic/vmlinux.lds.h 2015-12-15 10:07:06.064034490 +0100 @@ -484,12 +484,6 @@ *(__ex_table) \ VMLINUX_SYMBOL(__stop___ex_table) = .; \ } \ - . = ALIGN(align); \ - __mcex_table : AT(ADDR(__mcex_table) - LOAD_OFFSET) { \ - VMLINUX_SYMBOL(__start___mcex_table) = .; \ - *(__mcex_table) \ - VMLINUX_SYMBOL(__stop___mcex_table) = .; \ - } /* * Init task Index: b/arch/x86/kernel/vmlinux.lds.S =================================================================== --- a/arch/x86/kernel/vmlinux.lds.S 2015-12-14 11:38:58.188150070 +0100 +++ b/arch/x86/kernel/vmlinux.lds.S 2015-12-15 10:09:04.624036699 +0100 @@ -110,7 +110,17 @@ SECTIONS NOTES :text :note - EXCEPTION_TABLE(16) :text = 0x9090 + EXCEPTION_TABLE(16) + +#ifdef CONFIG_MCE_KERNEL_RECOVERY + . = ALIGN(16); + __mcex_table : AT(ADDR(__mcex_table) - LOAD_OFFSET) { + VMLINUX_SYMBOL(__start___mcex_table) = .; + *(__mcex_table) + VMLINUX_SYMBOL(__stop___mcex_table) = .; + } +#endif + :text = 0x9090 #if defined(CONFIG_DEBUG_RODATA) /* .text should occupy whole number of pages */ -- Regards/Gruss, Boris. ECO tip #101: Trim your mails when you reply. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753816AbbLOKoL (ORCPT ); Tue, 15 Dec 2015 05:44:11 -0500 Received: from mail.skyhub.de ([78.46.96.112]:49582 "EHLO mail.skyhub.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753432AbbLOKoJ (ORCPT ); Tue, 15 Dec 2015 05:44:09 -0500 Date: Tue, 15 Dec 2015 11:44:02 +0100 From: Borislav Petkov To: "Luck, Tony" Cc: Ingo Molnar , Andrew Morton , Andy Lutomirski , Dan Williams , linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-nvdimm@ml01.01.org, x86@kernel.org Subject: Re: [PATCHV2 1/3] x86, ras: Add new infrastructure for machine check fixup tables Message-ID: <20151215104402.GC25973@pd.tnic> References: <456153d09e85f2f139020a051caed3ca8f8fca73.1449861203.git.tony.luck@intel.com> <20151212101142.GA3867@pd.tnic> <20151215010059.GA17353@agluck-desk.sc.intel.com> <20151215094653.GA25973@pd.tnic> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: <20151215094653.GA25973@pd.tnic> User-Agent: Mutt/1.5.23 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Dec 15, 2015 at 10:46:53AM +0100, Borislav Petkov wrote: > I think what is more important is that this should be in the > x86-specific linker script, not in the generic one. And related to that, I think all those additions to kernel/extable.c should be somewhere in arch/x86/ and not in generic code. -- Regards/Gruss, Boris. ECO tip #101: Trim your mails when you reply. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S965152AbbLOLnZ (ORCPT ); Tue, 15 Dec 2015 06:43:25 -0500 Received: from mail.skyhub.de ([78.46.96.112]:56951 "EHLO mail.skyhub.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S964922AbbLOLnX (ORCPT ); Tue, 15 Dec 2015 06:43:23 -0500 Date: Tue, 15 Dec 2015 12:43:14 +0100 From: Borislav Petkov To: Tony Luck Cc: Ingo Molnar , Andrew Morton , Andy Lutomirski , Dan Williams , linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-nvdimm@ml01.01.org, x86@kernel.org Subject: Re: [PATCHV2 2/3] x86, ras: Extend machine check recovery code to annotated ring0 areas Message-ID: <20151215114314.GD25973@pd.tnic> References: MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.23 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Dec 10, 2015 at 04:14:44PM -0800, Tony Luck wrote: > Extend the severity checking code to add a new context IN_KERN_RECOV > which is used to indicate that the machine check was triggered by code > in the kernel with a fixup entry. > > Add code to check for this situation and respond by altering the return > IP to the fixup address and changing the regs->ax so that the recovery > code knows the physical address of the error. Note that we also set bit > 63 because 0x0 is a legal physical address. > > Major re-work to the tail code in do_machine_check() to make all this > readable/maintainable. One functional change is that tolerant=3 no longer > stops recovery actions. Revert to only skipping sending SIGBUS to the > current process. > > Signed-off-by: Tony Luck > --- > arch/x86/kernel/cpu/mcheck/mce-severity.c | 22 +++++++++- > arch/x86/kernel/cpu/mcheck/mce.c | 69 ++++++++++++++++--------------- > 2 files changed, 55 insertions(+), 36 deletions(-) > > diff --git a/arch/x86/kernel/cpu/mcheck/mce-severity.c b/arch/x86/kernel/cpu/mcheck/mce-severity.c > index 9c682c222071..ac7fbb0689fb 100644 > --- a/arch/x86/kernel/cpu/mcheck/mce-severity.c > +++ b/arch/x86/kernel/cpu/mcheck/mce-severity.c > @@ -12,6 +12,7 @@ > #include > #include > #include > +#include > #include > #include > > @@ -29,7 +30,7 @@ > * panic situations) > */ > > -enum context { IN_KERNEL = 1, IN_USER = 2 }; > +enum context { IN_KERNEL = 1, IN_USER = 2, IN_KERNEL_RECOV = 3 }; > enum ser { SER_REQUIRED = 1, NO_SER = 2 }; > enum exception { EXCP_CONTEXT = 1, NO_EXCP = 2 }; > > @@ -48,6 +49,7 @@ static struct severity { > #define MCESEV(s, m, c...) { .sev = MCE_ ## s ## _SEVERITY, .msg = m, ## c } > #define KERNEL .context = IN_KERNEL > #define USER .context = IN_USER > +#define KERNEL_RECOV .context = IN_KERNEL_RECOV > #define SER .ser = SER_REQUIRED > #define NOSER .ser = NO_SER > #define EXCP .excp = EXCP_CONTEXT > @@ -87,6 +89,10 @@ static struct severity { > EXCP, KERNEL, MCGMASK(MCG_STATUS_RIPV, 0) > ), > MCESEV( > + PANIC, "In kernel and no restart IP", > + EXCP, KERNEL_RECOV, MCGMASK(MCG_STATUS_RIPV, 0) > + ), > + MCESEV( > DEFERRED, "Deferred error", > NOSER, MASK(MCI_STATUS_UC|MCI_STATUS_DEFERRED|MCI_STATUS_POISON, MCI_STATUS_DEFERRED) > ), > @@ -123,6 +129,11 @@ static struct severity { > MCGMASK(MCG_STATUS_RIPV|MCG_STATUS_EIPV, MCG_STATUS_RIPV) > ), > MCESEV( > + AR, "Action required: data load error recoverable area of kernel", ... in ... > + SER, MASK(MCI_STATUS_OVER|MCI_UC_SAR|MCI_ADDR|MCACOD, MCI_UC_SAR|MCI_ADDR|MCACOD_DATA), > + KERNEL_RECOV > + ), > + MCESEV( > AR, "Action required: data load error in a user process", > SER, MASK(MCI_STATUS_OVER|MCI_UC_SAR|MCI_ADDR|MCACOD, MCI_UC_SAR|MCI_ADDR|MCACOD_DATA), > USER > @@ -170,6 +181,9 @@ static struct severity { > ) /* always matches. keep at end */ > }; > > +#define mc_recoverable(mcg) (((mcg) & (MCG_STATUS_RIPV|MCG_STATUS_EIPV)) == \ > + (MCG_STATUS_RIPV|MCG_STATUS_EIPV)) > + > /* > * If mcgstatus indicated that ip/cs on the stack were > * no good, then "m->cs" will be zero and we will have > @@ -183,7 +197,11 @@ static struct severity { > */ > static int error_context(struct mce *m) > { > - return ((m->cs & 3) == 3) ? IN_USER : IN_KERNEL; > + if ((m->cs & 3) == 3) > + return IN_USER; > + if (mc_recoverable(m->mcgstatus) && search_mcexception_tables(m->ip)) > + return IN_KERNEL_RECOV; > + return IN_KERNEL; > } > > /* > diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c > index 9d014b82a124..f2f568ad6409 100644 > --- a/arch/x86/kernel/cpu/mcheck/mce.c > +++ b/arch/x86/kernel/cpu/mcheck/mce.c > @@ -31,6 +31,7 @@ > #include > #include > #include > +#include > #include > #include > #include > @@ -958,6 +959,20 @@ static void mce_clear_state(unsigned long *toclear) > } > } > > +static int do_memory_failure(struct mce *m) > +{ > + int flags = MF_ACTION_REQUIRED; > + int ret; > + > + pr_err("Uncorrected hardware memory error in user-access at %llx", m->addr); > + if (!(m->mcgstatus & MCG_STATUS_RIPV)) > + flags |= MF_MUST_KILL; > + ret = memory_failure(m->addr >> PAGE_SHIFT, MCE_VECTOR, flags); > + if (ret) > + pr_err("Memory error not recovered"); > + return ret; > +} > + > /* > * The actual machine check handler. This only handles real > * exceptions when something got corrupted coming in through int 18. > @@ -995,8 +1010,6 @@ void do_machine_check(struct pt_regs *regs, long error_code) > DECLARE_BITMAP(toclear, MAX_NR_BANKS); > DECLARE_BITMAP(valid_banks, MAX_NR_BANKS); > char *msg = "Unknown"; > - u64 recover_paddr = ~0ull; > - int flags = MF_ACTION_REQUIRED; > int lmce = 0; > > ist_enter(regs); > @@ -1123,22 +1136,13 @@ void do_machine_check(struct pt_regs *regs, long error_code) > } > > /* > - * At insane "tolerant" levels we take no action. Otherwise > - * we only die if we have no other choice. For less serious > - * issues we try to recover, or limit damage to the current > - * process. > + * If tolerant is at an insane level we drop requests to kill > + * processes and continue even when there is no way out ^ | . Fullstop here. > */ > - if (cfg->tolerant < 3) { > - if (no_way_out) > - mce_panic("Fatal machine check on current CPU", &m, msg); > - if (worst == MCE_AR_SEVERITY) { > - recover_paddr = m.addr; > - if (!(m.mcgstatus & MCG_STATUS_RIPV)) > - flags |= MF_MUST_KILL; > - } else if (kill_it) { > - force_sig(SIGBUS, current); > - } > - } > + if (cfg->tolerant == 3) Btw, I don't see where we limit the input values for that tolerant setting, i.e., user could easily enter something > 3. I think we should add a check in a separate patch to not allow anything except [0-3]. > + kill_it = 0; > + else if (no_way_out) > + mce_panic("Fatal machine check on current CPU", &m, msg); > > if (worst > 0) > mce_report_event(regs); > @@ -1146,25 +1150,22 @@ void do_machine_check(struct pt_regs *regs, long error_code) > out: > sync_core(); > > - if (recover_paddr == ~0ull) > - goto done; > + /* Fault was in user mode and we need to take some action */ > + if ((m.cs & 3) == 3 && (worst == MCE_AR_SEVERITY || kill_it)) { > + ist_begin_non_atomic(regs); > + local_irq_enable(); > > - pr_err("Uncorrected hardware memory error in user-access at %llx", > - recover_paddr); > - /* > - * We must call memory_failure() here even if the current process is > - * doomed. We still need to mark the page as poisoned and alert any > - * other users of the page. > - */ > - ist_begin_non_atomic(regs); > - local_irq_enable(); > - if (memory_failure(recover_paddr >> PAGE_SHIFT, MCE_VECTOR, flags) < 0) { > - pr_err("Memory error not recovered"); > - force_sig(SIGBUS, current); > + if (kill_it || do_memory_failure(&m)) > + force_sig(SIGBUS, current); > + local_irq_disable(); > + ist_end_non_atomic(); > } > - local_irq_disable(); > - ist_end_non_atomic(); > -done: > + > + /* Fault was in recoverable area of the kernel */ > + if ((m.cs & 3) != 3 && worst == MCE_AR_SEVERITY) > + if (!fixup_mcexception(regs, m.addr)) > + mce_panic("Failed kernel mode recovery", &m, NULL); ^^^^^^^^^^^^^^^^^^^^^^^^^^^ Does that always imply a failed kernel mode recovery? I don't see (m.cs == 0 and MCE_AR_SEVERITY) MCEs always meaning that a recovery should be attempted there. I think this should simply say mce_panic("Fatal machine check on current CPU", &m, msg); Also, how about taking out that worst and kill_it check. It is a bit more readable this way IMO: --- out: sync_core(); if (worst < MCE_AR_SEVERITY && !kill_it) goto out_ist; /* Fault was in user mode and we need to take some action */ if ((m.cs & 3) == 3) { ist_begin_non_atomic(regs); local_irq_enable(); if (kill_it || do_memory_failure(&m)) force_sig(SIGBUS, current); local_irq_disable(); ist_end_non_atomic(); } else { if (!fixup_mcexception(regs, m.addr)) mce_panic("Fatal machine check on current CPU", &m, NULL); } out_ist: ist_exit(regs); } EXPORT_SYMBOL_GPL(do_machine_check); --- Hmm... -- Regards/Gruss, Boris. ECO tip #101: Trim your mails when you reply. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S933448AbbLONLo (ORCPT ); Tue, 15 Dec 2015 08:11:44 -0500 Received: from mail.skyhub.de ([78.46.96.112]:51244 "EHLO mail.skyhub.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932176AbbLONLm (ORCPT ); Tue, 15 Dec 2015 08:11:42 -0500 Date: Tue, 15 Dec 2015 14:11:35 +0100 From: Borislav Petkov To: Tony Luck Cc: Ingo Molnar , Andrew Morton , Andy Lutomirski , Dan Williams , linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-nvdimm@ml01.01.org, x86@kernel.org Subject: Re: [PATCHV2 3/3] x86, ras: Add mcsafe_memcpy() function to recover from machine checks Message-ID: <20151215131135.GE25973@pd.tnic> References: <23b2515da9d06b198044ad83ca0a15ba38c24e6e.1449861203.git.tony.luck@intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: <23b2515da9d06b198044ad83ca0a15ba38c24e6e.1449861203.git.tony.luck@intel.com> User-Agent: Mutt/1.5.23 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Dec 10, 2015 at 04:21:50PM -0800, Tony Luck wrote: > Using __copy_user_nocache() as inspiration create a memory copy > routine for use by kernel code with annotations to allow for > recovery from machine checks. > > Notes: > 1) Unlike the original we make no attempt to copy all the bytes > up to the faulting address. The original achieves that by > re-executing the failing part as a byte-by-byte copy, > which will take another page fault. We don't want to have > a second machine check! > 2) Likewise the return value for the original indicates exactly > how many bytes were not copied. Instead we provide the physical > address of the fault (thanks to help from do_machine_check() > 3) Provide helpful macros to decode the return value. > > Signed-off-by: Tony Luck > --- > arch/x86/include/asm/uaccess_64.h | 5 +++ > arch/x86/kernel/x8664_ksyms_64.c | 2 + > arch/x86/lib/copy_user_64.S | 91 +++++++++++++++++++++++++++++++++++++++ > 3 files changed, 98 insertions(+) ... > + * mcsafe_memcpy - Uncached memory copy with machine check exception handling > + * Note that we only catch machine checks when reading the source addresses. > + * Writes to target are posted and don't generate machine checks. > + * This will force destination/source out of cache for more performance. ... and the non-temporal version is the optimal one even though we're defaulting to copy_user_enhanced_fast_string for memcpy on modern Intel CPUs...? Btw, it should be also inside an ifdef if we're going to ifdef CONFIG_MCE_KERNEL_RECOVERY everywhere else. -- Regards/Gruss, Boris. ECO tip #101: Trim your mails when you reply. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S933635AbbLORpH (ORCPT ); Tue, 15 Dec 2015 12:45:07 -0500 Received: from mail-qk0-f176.google.com ([209.85.220.176]:34749 "EHLO mail-qk0-f176.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932211AbbLORpF (ORCPT ); Tue, 15 Dec 2015 12:45:05 -0500 MIME-Version: 1.0 In-Reply-To: <20151215131135.GE25973@pd.tnic> References: <23b2515da9d06b198044ad83ca0a15ba38c24e6e.1449861203.git.tony.luck@intel.com> <20151215131135.GE25973@pd.tnic> Date: Tue, 15 Dec 2015 09:45:04 -0800 Message-ID: Subject: Re: [PATCHV2 3/3] x86, ras: Add mcsafe_memcpy() function to recover from machine checks From: Dan Williams To: Borislav Petkov Cc: Tony Luck , Ingo Molnar , Andrew Morton , Andy Lutomirski , "linux-kernel@vger.kernel.org" , Linux MM , linux-nvdimm , X86 ML Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Dec 15, 2015 at 5:11 AM, Borislav Petkov wrote: > On Thu, Dec 10, 2015 at 04:21:50PM -0800, Tony Luck wrote: >> Using __copy_user_nocache() as inspiration create a memory copy >> routine for use by kernel code with annotations to allow for >> recovery from machine checks. >> >> Notes: >> 1) Unlike the original we make no attempt to copy all the bytes >> up to the faulting address. The original achieves that by >> re-executing the failing part as a byte-by-byte copy, >> which will take another page fault. We don't want to have >> a second machine check! >> 2) Likewise the return value for the original indicates exactly >> how many bytes were not copied. Instead we provide the physical >> address of the fault (thanks to help from do_machine_check() >> 3) Provide helpful macros to decode the return value. >> >> Signed-off-by: Tony Luck >> --- >> arch/x86/include/asm/uaccess_64.h | 5 +++ >> arch/x86/kernel/x8664_ksyms_64.c | 2 + >> arch/x86/lib/copy_user_64.S | 91 +++++++++++++++++++++++++++++++++++++++ >> 3 files changed, 98 insertions(+) > > ... > >> + * mcsafe_memcpy - Uncached memory copy with machine check exception handling >> + * Note that we only catch machine checks when reading the source addresses. >> + * Writes to target are posted and don't generate machine checks. >> + * This will force destination/source out of cache for more performance. > > ... and the non-temporal version is the optimal one even though we're > defaulting to copy_user_enhanced_fast_string for memcpy on modern Intel > CPUs...? At least the pmem driver use case does not want caching of the source-buffer since that is the raw "disk" media. I.e. in pmem_do_bvec() we'd use this to implement memcpy_from_pmem(). However, caching the destination-buffer may prove beneficial since that data is likely to be consumed immediately by the thread that submitted the i/o. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S965405AbbLORxu (ORCPT ); Tue, 15 Dec 2015 12:53:50 -0500 Received: from mga02.intel.com ([134.134.136.20]:3013 "EHLO mga02.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S965235AbbLORxr (ORCPT ); Tue, 15 Dec 2015 12:53:47 -0500 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.20,433,1444719600"; d="scan'208";a="861317693" From: "Luck, Tony" To: "Williams, Dan J" , Borislav Petkov CC: Ingo Molnar , Andrew Morton , Andy Lutomirski , "linux-kernel@vger.kernel.org" , Linux MM , linux-nvdimm , X86 ML Subject: RE: [PATCHV2 3/3] x86, ras: Add mcsafe_memcpy() function to recover from machine checks Thread-Topic: [PATCHV2 3/3] x86, ras: Add mcsafe_memcpy() function to recover from machine checks Thread-Index: AQHRNzofOhVlt+R/OUCiyPbQy9P3LZ7M2FAA//96T6A= Date: Tue, 15 Dec 2015 17:53:31 +0000 Message-ID: <3908561D78D1C84285E8C5FCA982C28F39F8566E@ORSMSX114.amr.corp.intel.com> References: <23b2515da9d06b198044ad83ca0a15ba38c24e6e.1449861203.git.tony.luck@intel.com> <20151215131135.GE25973@pd.tnic> In-Reply-To: Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-titus-metadata-40: eyJDYXRlZ29yeUxhYmVscyI6IiIsIk1ldGFkYXRhIjp7Im5zIjoiaHR0cDpcL1wvd3d3LnRpdHVzLmNvbVwvbnNcL0ludGVsIiwiaWQiOiJhYjI4NDJlMS0yZTNhLTQ2NzYtYTFjYS1mYjY4NWJlNGI3NTMiLCJwcm9wcyI6W3sibiI6IkludGVsRGF0YUNsYXNzaWZpY2F0aW9uIiwidmFscyI6W3sidmFsdWUiOiJDVFBfSUMifV19XX0sIlN1YmplY3RMYWJlbHMiOltdLCJUTUNWZXJzaW9uIjoiMTUuNC4xMC4xOSIsIlRydXN0ZWRMYWJlbEhhc2giOiJHQVFwWlZXSWpEU1l5Vk05NHZOOURvRTQxK3N6QmUwY0h1ZFN0NkhxN29rPSJ9 x-inteldataclassification: CTP_IC x-originating-ip: [10.22.254.140] Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from base64 to 8bit by mail.home.local id tBFHrtG4003623 >> ... and the non-temporal version is the optimal one even though we're >> defaulting to copy_user_enhanced_fast_string for memcpy on modern Intel >> CPUs...? My current generation cpu has a bit of an issue with recovering from a machine check in a "rep mov" ... so I'm working with a version of memcpy that unrolls into individual mov instructions for now. > At least the pmem driver use case does not want caching of the > source-buffer since that is the raw "disk" media. I.e. in > pmem_do_bvec() we'd use this to implement memcpy_from_pmem(). > However, caching the destination-buffer may prove beneficial since > that data is likely to be consumed immediately by the thread that > submitted the i/o. I can drop the "nti" from the destination moves. Does "nti" work on the load from source address side to avoid cache allocation? On another topic raised by Boris ... is there some CONFIG_PMEM* that I should use as a dependency to enable all this? -Tony {.n++%ݶw{.n+{G{ayʇڙ,jfhz_(階ݢj"mG?&~iOzv^m ?I From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754313AbbLOSVK (ORCPT ); Tue, 15 Dec 2015 13:21:10 -0500 Received: from mail.skyhub.de ([78.46.96.112]:57146 "EHLO mail.skyhub.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754289AbbLOSVH (ORCPT ); Tue, 15 Dec 2015 13:21:07 -0500 Date: Tue, 15 Dec 2015 19:21:00 +0100 From: Borislav Petkov To: "Luck, Tony" Cc: "Williams, Dan J" , Ingo Molnar , Andrew Morton , Andy Lutomirski , "linux-kernel@vger.kernel.org" , Linux MM , linux-nvdimm , X86 ML Subject: Re: [PATCHV2 3/3] x86, ras: Add mcsafe_memcpy() function to recover from machine checks Message-ID: <20151215182059.GH25973@pd.tnic> References: <23b2515da9d06b198044ad83ca0a15ba38c24e6e.1449861203.git.tony.luck@intel.com> <20151215131135.GE25973@pd.tnic> <3908561D78D1C84285E8C5FCA982C28F39F8566E@ORSMSX114.amr.corp.intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: <3908561D78D1C84285E8C5FCA982C28F39F8566E@ORSMSX114.amr.corp.intel.com> User-Agent: Mutt/1.5.23 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Dec 15, 2015 at 05:53:31PM +0000, Luck, Tony wrote: > My current generation cpu has a bit of an issue with recovering from a > machine check in a "rep mov" ... so I'm working with a version of memcpy > that unrolls into individual mov instructions for now. Ah. > I can drop the "nti" from the destination moves. Does "nti" work > on the load from source address side to avoid cache allocation? I don't think so: +1: movq (%rsi),%r8 +2: movq 1*8(%rsi),%r9 +3: movq 2*8(%rsi),%r10 +4: movq 3*8(%rsi),%r11 ... You need to load the data into registers first because MOVNTI needs them there as it does reg -> mem movement. That first load from memory into registers with a normal MOV will pull the data into the cache. Perhaps the first thing to try would be to see what slowdown normal MOVs bring and if not really noticeable, use those instead. > On another topic raised by Boris ... is there some CONFIG_PMEM* > that I should use as a dependency to enable all this? I found CONFIG_LIBNVDIMM only today: drivers/nvdimm/Kconfig:1:menuconfig LIBNVDIMM drivers/nvdimm/Kconfig:2: tristate "NVDIMM (Non-Volatile Memory Device) Support" -- Regards/Gruss, Boris. ECO tip #101: Trim your mails when you reply. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S965338AbbLOS1m (ORCPT ); Tue, 15 Dec 2015 13:27:42 -0500 Received: from mail-qk0-f180.google.com ([209.85.220.180]:34538 "EHLO mail-qk0-f180.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932489AbbLOS1b (ORCPT ); Tue, 15 Dec 2015 13:27:31 -0500 MIME-Version: 1.0 In-Reply-To: <3908561D78D1C84285E8C5FCA982C28F39F8566E@ORSMSX114.amr.corp.intel.com> References: <23b2515da9d06b198044ad83ca0a15ba38c24e6e.1449861203.git.tony.luck@intel.com> <20151215131135.GE25973@pd.tnic> <3908561D78D1C84285E8C5FCA982C28F39F8566E@ORSMSX114.amr.corp.intel.com> Date: Tue, 15 Dec 2015 10:27:31 -0800 Message-ID: Subject: Re: [PATCHV2 3/3] x86, ras: Add mcsafe_memcpy() function to recover from machine checks From: Dan Williams To: "Luck, Tony" Cc: Borislav Petkov , Ingo Molnar , Andrew Morton , Andy Lutomirski , "linux-kernel@vger.kernel.org" , Linux MM , linux-nvdimm , X86 ML Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Dec 15, 2015 at 9:53 AM, Luck, Tony wrote: >>> ... and the non-temporal version is the optimal one even though we're >>> defaulting to copy_user_enhanced_fast_string for memcpy on modern Intel >>> CPUs...? > > My current generation cpu has a bit of an issue with recovering from a > machine check in a "rep mov" ... so I'm working with a version of memcpy > that unrolls into individual mov instructions for now. > >> At least the pmem driver use case does not want caching of the >> source-buffer since that is the raw "disk" media. I.e. in >> pmem_do_bvec() we'd use this to implement memcpy_from_pmem(). >> However, caching the destination-buffer may prove beneficial since >> that data is likely to be consumed immediately by the thread that >> submitted the i/o. > > I can drop the "nti" from the destination moves. Does "nti" work > on the load from source address side to avoid cache allocation? My mistake, I don't think we have an uncached load capability, only store. > On another topic raised by Boris ... is there some CONFIG_PMEM* > that I should use as a dependency to enable all this? I'd rather make this a "select ARCH_MCSAFE_MEMCPY". Since it's not a hard dependency and the details will be hidden behind memcpy_from_pmem(). Specifically, the details will be handled by a new arch_memcpy_from_pmem() in arch/x86/include/asm/pmem.h to supplement the existing arch_memcpy_to_pmem(). From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S933651AbbLOSfv (ORCPT ); Tue, 15 Dec 2015 13:35:51 -0500 Received: from mail-qk0-f173.google.com ([209.85.220.173]:34169 "EHLO mail-qk0-f173.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932620AbbLOSfu (ORCPT ); Tue, 15 Dec 2015 13:35:50 -0500 MIME-Version: 1.0 In-Reply-To: References: <23b2515da9d06b198044ad83ca0a15ba38c24e6e.1449861203.git.tony.luck@intel.com> <20151215131135.GE25973@pd.tnic> <3908561D78D1C84285E8C5FCA982C28F39F8566E@ORSMSX114.amr.corp.intel.com> Date: Tue, 15 Dec 2015 10:35:49 -0800 Message-ID: Subject: Re: [PATCHV2 3/3] x86, ras: Add mcsafe_memcpy() function to recover from machine checks From: Dan Williams To: "Luck, Tony" Cc: Borislav Petkov , Ingo Molnar , Andrew Morton , Andy Lutomirski , "linux-kernel@vger.kernel.org" , Linux MM , linux-nvdimm , X86 ML Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Dec 15, 2015 at 10:27 AM, Dan Williams wrote: > On Tue, Dec 15, 2015 at 9:53 AM, Luck, Tony wrote: >>>> ... and the non-temporal version is the optimal one even though we're >>>> defaulting to copy_user_enhanced_fast_string for memcpy on modern Intel >>>> CPUs...? >> >> My current generation cpu has a bit of an issue with recovering from a >> machine check in a "rep mov" ... so I'm working with a version of memcpy >> that unrolls into individual mov instructions for now. >> >>> At least the pmem driver use case does not want caching of the >>> source-buffer since that is the raw "disk" media. I.e. in >>> pmem_do_bvec() we'd use this to implement memcpy_from_pmem(). >>> However, caching the destination-buffer may prove beneficial since >>> that data is likely to be consumed immediately by the thread that >>> submitted the i/o. >> >> I can drop the "nti" from the destination moves. Does "nti" work >> on the load from source address side to avoid cache allocation? > > My mistake, I don't think we have an uncached load capability, only store. Correction we have MOVNTDQA, but that requires saving the fpu state and marking the memory as WC, i.e. probably not worth it. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S965352AbbLOSjd (ORCPT ); Tue, 15 Dec 2015 13:39:33 -0500 Received: from mail.skyhub.de ([78.46.96.112]:36635 "EHLO mail.skyhub.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932620AbbLOSjc (ORCPT ); Tue, 15 Dec 2015 13:39:32 -0500 Date: Tue, 15 Dec 2015 19:39:24 +0100 From: Borislav Petkov To: Dan Williams Cc: "Luck, Tony" , Ingo Molnar , Andrew Morton , Andy Lutomirski , "linux-kernel@vger.kernel.org" , Linux MM , linux-nvdimm , X86 ML Subject: Re: [PATCHV2 3/3] x86, ras: Add mcsafe_memcpy() function to recover from machine checks Message-ID: <20151215183924.GJ25973@pd.tnic> References: <23b2515da9d06b198044ad83ca0a15ba38c24e6e.1449861203.git.tony.luck@intel.com> <20151215131135.GE25973@pd.tnic> <3908561D78D1C84285E8C5FCA982C28F39F8566E@ORSMSX114.amr.corp.intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.23 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Dec 15, 2015 at 10:35:49AM -0800, Dan Williams wrote: > Correction we have MOVNTDQA, but that requires saving the fpu state > and marking the memory as WC, i.e. probably not worth it. Not really. Last time I tried an SSE3 memcpy in the kernel like glibc does, it wasn't worth it. The enhanced REP; MOVSB is hands down faster. -- Regards/Gruss, Boris. ECO tip #101: Trim your mails when you reply. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754240AbbLOTUy (ORCPT ); Tue, 15 Dec 2015 14:20:54 -0500 Received: from g9t5009.houston.hp.com ([15.240.92.67]:60610 "EHLO g9t5009.houston.hp.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751259AbbLOTUx convert rfc822-to-8bit (ORCPT ); Tue, 15 Dec 2015 14:20:53 -0500 From: "Elliott, Robert (Persistent Memory)" To: Borislav Petkov , Dan Williams CC: "Luck, Tony" , linux-nvdimm , X86 ML , "linux-kernel@vger.kernel.org" , Linux MM , Andy Lutomirski , Andrew Morton , Ingo Molnar Subject: RE: [PATCHV2 3/3] x86, ras: Add mcsafe_memcpy() function to recover from machine checks Thread-Topic: [PATCHV2 3/3] x86, ras: Add mcsafe_memcpy() function to recover from machine checks Thread-Index: AQHRNEq4ghk1LvWIwU2qH4H+2NJLyJ7MC6mAgABMagCAAAJcgIAACYCAgAACUYCAAAEBAIAAATVQ Date: Tue, 15 Dec 2015 19:19:58 +0000 Message-ID: <94D0CD8314A33A4D9D801C0FE68B40295BE9F290@G4W3202.americas.hpqcorp.net> References: <23b2515da9d06b198044ad83ca0a15ba38c24e6e.1449861203.git.tony.luck@intel.com> <20151215131135.GE25973@pd.tnic> <3908561D78D1C84285E8C5FCA982C28F39F8566E@ORSMSX114.amr.corp.intel.com> <20151215183924.GJ25973@pd.tnic> In-Reply-To: <20151215183924.GJ25973@pd.tnic> Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [16.210.48.36] Content-Type: text/plain; charset="Windows-1252" Content-Transfer-Encoding: 8BIT MIME-Version: 1.0 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org > -----Original Message----- > From: Linux-nvdimm [mailto:linux-nvdimm-bounces@lists.01.org] On Behalf > Of Borislav Petkov > Sent: Tuesday, December 15, 2015 12:39 PM > To: Dan Williams > Cc: Luck, Tony ; linux-nvdimm nvdimm@ml01.01.org>; X86 ML ; linux- > kernel@vger.kernel.org; Linux MM ; Andy Lutomirski > ; Andrew Morton ; Ingo Molnar > > Subject: Re: [PATCHV2 3/3] x86, ras: Add mcsafe_memcpy() function to > recover from machine checks > > On Tue, Dec 15, 2015 at 10:35:49AM -0800, Dan Williams wrote: > > Correction we have MOVNTDQA, but that requires saving the fpu state > > and marking the memory as WC, i.e. probably not worth it. > > Not really. Last time I tried an SSE3 memcpy in the kernel like glibc > does, it wasn't worth it. The enhanced REP; MOVSB is hands down faster. Reading from NVDIMM, rep movsb is efficient, but it fills the CPU caches with the NVDIMM addresses. For large data moves (not uncommon for storage) this will crowd out more important cacheable data. For normal block device reads made through the pmem block device driver, this CPU cache consumption is wasteful, since it is unlikely the application will ask pmem to read the same addresses anytime soon. Due to the historic long latency of storage devices, applications don't re-read from storage again; they save the results. So, the streaming-load instructions are beneficial: * movntdqa (16-byte xmm registers) * vmovntdqa (32-byte ymm registers) * vmovntdqa (64-byte zmm registers) Dan Williams wrote: > Correction we have MOVNTDQA, but that requires > saving the fpu state and marking the memory as WC > i.e. probably not worth it. Although the WC memory type is described in the SDM in the most detail: "An implementation may also make use of the non-temporal hint associated with this instruction if the memory source is WB (write back) memory type. ... may optimize cache reads generated by (V)MOVNTDQA on WB memory type to reduce cache evictions." For applications doing loads from mmap() DAX memory, the CPU cache usage could be worthwhile, because applications expect mmap() regions to consist of traditional writeback-cached memory and might do lots of loads/stores. Writing to the NVDIMM requires either: * non-temporal stores; or * normal stores + cache flushes + fences movnti is OK for small transfers, but these are better for bulk moves: * movntdq (16-byte xmm registers) * vmovntdq (32-byte ymm registers) * vmovntdq (64-byte zmm registers) --- Robert Elliott, HPE Persistent Memory From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754326AbbLOT2r (ORCPT ); Tue, 15 Dec 2015 14:28:47 -0500 Received: from mail.skyhub.de ([78.46.96.112]:53389 "EHLO mail.skyhub.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754142AbbLOT2q (ORCPT ); Tue, 15 Dec 2015 14:28:46 -0500 Date: Tue, 15 Dec 2015 20:28:37 +0100 From: Borislav Petkov To: "Elliott, Robert (Persistent Memory)" Cc: Dan Williams , "Luck, Tony" , linux-nvdimm , X86 ML , "linux-kernel@vger.kernel.org" , Linux MM , Andy Lutomirski , Andrew Morton , Ingo Molnar Subject: Re: [PATCHV2 3/3] x86, ras: Add mcsafe_memcpy() function to recover from machine checks Message-ID: <20151215192837.GL25973@pd.tnic> References: <23b2515da9d06b198044ad83ca0a15ba38c24e6e.1449861203.git.tony.luck@intel.com> <20151215131135.GE25973@pd.tnic> <3908561D78D1C84285E8C5FCA982C28F39F8566E@ORSMSX114.amr.corp.intel.com> <20151215183924.GJ25973@pd.tnic> <94D0CD8314A33A4D9D801C0FE68B40295BE9F290@G4W3202.americas.hpqcorp.net> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: <94D0CD8314A33A4D9D801C0FE68B40295BE9F290@G4W3202.americas.hpqcorp.net> User-Agent: Mutt/1.5.23 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Dec 15, 2015 at 07:19:58PM +0000, Elliott, Robert (Persistent Memory) wrote: ... > Due to the historic long latency of storage devices, > applications don't re-read from storage again; they > save the results. > So, the streaming-load instructions are beneficial: That's the theory... Do you also have some actual performance numbers where non-temporal operations are better than the REP; MOVSB and *actually* show improvements? And no microbenchmarks please. Thanks. -- Regards/Gruss, Boris. ECO tip #101: Trim your mails when you reply. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S933643AbbLOU0f (ORCPT ); Tue, 15 Dec 2015 15:26:35 -0500 Received: from g9t5008.houston.hp.com ([15.240.92.66]:51619 "EHLO g9t5008.houston.hp.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932456AbbLOU0d (ORCPT ); Tue, 15 Dec 2015 15:26:33 -0500 From: "Elliott, Robert (Persistent Memory)" To: Borislav Petkov CC: Dan Williams , "Luck, Tony" , linux-nvdimm , X86 ML , "linux-kernel@vger.kernel.org" , Linux MM , "Andy Lutomirski" , Andrew Morton , Ingo Molnar Subject: RE: [PATCHV2 3/3] x86, ras: Add mcsafe_memcpy() function to recover from machine checks Thread-Topic: [PATCHV2 3/3] x86, ras: Add mcsafe_memcpy() function to recover from machine checks Thread-Index: AQHRNEq4ghk1LvWIwU2qH4H+2NJLyJ7MC6mAgABMagCAAAJcgIAACYCAgAACUYCAAAEBAIAAATVQgAAMi4CAAAtmAA== Date: Tue, 15 Dec 2015 20:25:37 +0000 Message-ID: <94D0CD8314A33A4D9D801C0FE68B40295BE9F3D5@G4W3202.americas.hpqcorp.net> References: <23b2515da9d06b198044ad83ca0a15ba38c24e6e.1449861203.git.tony.luck@intel.com> <20151215131135.GE25973@pd.tnic> <3908561D78D1C84285E8C5FCA982C28F39F8566E@ORSMSX114.amr.corp.intel.com> <20151215183924.GJ25973@pd.tnic> <94D0CD8314A33A4D9D801C0FE68B40295BE9F290@G4W3202.americas.hpqcorp.net> <20151215192837.GL25973@pd.tnic> In-Reply-To: <20151215192837.GL25973@pd.tnic> Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [16.210.48.36] Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from base64 to 8bit by mail.home.local id tBFKQduP004374 --- Robert Elliott, HPE Persistent Memory > -----Original Message----- > From: Borislav Petkov [mailto:bp@alien8.de] > Sent: Tuesday, December 15, 2015 1:29 PM > To: Elliott, Robert (Persistent Memory) > Cc: Dan Williams ; Luck, Tony > ; linux-nvdimm ; X86 ML > ; linux-kernel@vger.kernel.org; Linux MM mm@kvack.org>; Andy Lutomirski ; Andrew Morton > ; Ingo Molnar > Subject: Re: [PATCHV2 3/3] x86, ras: Add mcsafe_memcpy() function to > recover from machine checks > > On Tue, Dec 15, 2015 at 07:19:58PM +0000, Elliott, Robert (Persistent > Memory) wrote: > > ... > > > Due to the historic long latency of storage devices, > > applications don't re-read from storage again; they > > save the results. > > So, the streaming-load instructions are beneficial: > > That's the theory... > > Do you also have some actual performance numbers where non-temporal > operations are better than the REP; MOVSB and *actually* show > improvements? And no microbenchmarks please. > > Thanks. > This isn't exactly what you're looking for, but here is an example of fio doing reads from pmem devices (reading from NVDIMMs, writing to DIMMs) with various transfer sizes. At 256 KiB, all the main memory buffers fit in the CPU caches, so no write traffic appears on DDR (just the reads from the NVDIMMs). At 1 MiB, the data spills out of the caches, and writes to the DIMMs end up on DDR. Although DDR is busier, fio gets a lot less work done: * 256 KiB: 90 GiB/s by fio * 1 MiB: 49 GiB/s by fio We could try modifying pmem to use its own non-temporal memcpy functions (I've posted experimental patches before that did this) to see if that transition point shifts. We can also watch the CPU cache statistics while running. Here are statistics from Intel's pcm-memory.x (pardon the wide formatting): 256 KiB ======= pmem0: (groupid=0, jobs=40): err= 0: pid=20867: Tue Nov 24 18:20:08 2015 read : io=5219.1GB, bw=89079MB/s, iops=356314, runt= 60006msec cpu : usr=1.74%, sys=96.16%, ctx=49576, majf=0, minf=21997 Run status group 0 (all jobs): READ: io=5219.1GB, aggrb=89079MB/s, minb=89079MB/s, maxb=89079MB/s, mint=60006msec, maxt=60006msec |---------------------------------------||---------------------------------------| |-- Socket 0 --||-- Socket 1 --| |---------------------------------------||---------------------------------------| |-- Memory Channel Monitoring --||-- Memory Channel Monitoring --| |---------------------------------------||---------------------------------------| |-- Mem Ch 0: Reads (MB/s): 11778.11 --||-- Mem Ch 0: Reads (MB/s): 11743.99 --| |-- Writes(MB/s): 51.83 --||-- Writes(MB/s): 43.25 --| |-- Mem Ch 1: Reads (MB/s): 11779.90 --||-- Mem Ch 1: Reads (MB/s): 11736.06 --| |-- Writes(MB/s): 48.73 --||-- Writes(MB/s): 37.86 --| |-- Mem Ch 4: Reads (MB/s): 11784.79 --||-- Mem Ch 4: Reads (MB/s): 11746.94 --| |-- Writes(MB/s): 52.90 --||-- Writes(MB/s): 43.73 --| |-- Mem Ch 5: Reads (MB/s): 11778.48 --||-- Mem Ch 5: Reads (MB/s): 11741.55 --| |-- Writes(MB/s): 47.62 --||-- Writes(MB/s): 37.80 --| |-- NODE 0 Mem Read (MB/s) : 47121.27 --||-- NODE 1 Mem Read (MB/s) : 46968.53 --| |-- NODE 0 Mem Write(MB/s) : 201.08 --||-- NODE 1 Mem Write(MB/s) : 162.65 --| |-- NODE 0 P. Write (T/s): 190927 --||-- NODE 1 P. Write (T/s): 182961 --| |-- NODE 0 Memory (MB/s): 47322.36 --||-- NODE 1 Memory (MB/s): 47131.17 --| |---------------------------------------||---------------------------------------| |---------------------------------------||---------------------------------------| |-- System Read Throughput(MB/s): 94089.80 --| |-- System Write Throughput(MB/s): 363.73 --| |-- System Memory Throughput(MB/s): 94453.52 --| |---------------------------------------||---------------------------------------| 1 MiB ===== |---------------------------------------||---------------------------------------| |-- Socket 0 --||-- Socket 1 --| |---------------------------------------||---------------------------------------| |-- Memory Channel Monitoring --||-- Memory Channel Monitoring --| |---------------------------------------||---------------------------------------| |-- Mem Ch 0: Reads (MB/s): 7227.83 --||-- Mem Ch 0: Reads (MB/s): 7047.45 --| |-- Writes(MB/s): 5894.47 --||-- Writes(MB/s): 6010.66 --| |-- Mem Ch 1: Reads (MB/s): 7229.32 --||-- Mem Ch 1: Reads (MB/s): 7041.79 --| |-- Writes(MB/s): 5891.38 --||-- Writes(MB/s): 6003.19 --| |-- Mem Ch 4: Reads (MB/s): 7230.70 --||-- Mem Ch 4: Reads (MB/s): 7052.44 --| |-- Writes(MB/s): 5888.63 --||-- Writes(MB/s): 6012.49 --| |-- Mem Ch 5: Reads (MB/s): 7229.16 --||-- Mem Ch 5: Reads (MB/s): 7047.19 --| |-- Writes(MB/s): 5882.45 --||-- Writes(MB/s): 6008.11 --| |-- NODE 0 Mem Read (MB/s) : 28917.01 --||-- NODE 1 Mem Read (MB/s) : 28188.87 --| |-- NODE 0 Mem Write(MB/s) : 23556.93 --||-- NODE 1 Mem Write(MB/s) : 24034.46 --| |-- NODE 0 P. Write (T/s): 238713 --||-- NODE 1 P. Write (T/s): 228040 --| |-- NODE 0 Memory (MB/s): 52473.94 --||-- NODE 1 Memory (MB/s): 52223.33 --| |---------------------------------------||---------------------------------------| |---------------------------------------||---------------------------------------| |-- System Read Throughput(MB/s): 57105.87 --| |-- System Write Throughput(MB/s): 47591.39 --| |-- System Memory Throughput(MB/s): 104697.27 --| |---------------------------------------||---------------------------------------| {.n++%ݶw{.n+{G{ayʇڙ,jfhz_(階ݢj"mG?&~iOzv^m ?I From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S933868AbbLOXqH (ORCPT ); Tue, 15 Dec 2015 18:46:07 -0500 Received: from mga14.intel.com ([192.55.52.115]:46613 "EHLO mga14.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751842AbbLOXqG (ORCPT ); Tue, 15 Dec 2015 18:46:06 -0500 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.20,434,1444719600"; d="scan'208";a="872291217" From: "Luck, Tony" To: Borislav Petkov CC: Ingo Molnar , Andrew Morton , Andy Lutomirski , "Williams, Dan J" , "linux-kernel@vger.kernel.org" , "linux-mm@kvack.org" , "linux-nvdimm@ml01.01.org" , "x86@kernel.org" Subject: RE: [PATCHV2 2/3] x86, ras: Extend machine check recovery code to annotated ring0 areas Thread-Topic: [PATCHV2 2/3] x86, ras: Extend machine check recovery code to annotated ring0 areas Thread-Index: AQHRNy3VWd7zJHLLfUmdzBfwI2jiqJ7MtWLQ Date: Tue, 15 Dec 2015 23:46:03 +0000 Message-ID: <3908561D78D1C84285E8C5FCA982C28F39F85DBE@ORSMSX114.amr.corp.intel.com> References: <20151215114314.GD25973@pd.tnic> In-Reply-To: <20151215114314.GD25973@pd.tnic> Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-titus-metadata-40: eyJDYXRlZ29yeUxhYmVscyI6IiIsIk1ldGFkYXRhIjp7Im5zIjoiaHR0cDpcL1wvd3d3LnRpdHVzLmNvbVwvbnNcL0ludGVsIiwiaWQiOiI2MTg1ZGQwNy1hNWY2LTQ1M2UtYTg4Yi02ZGQ5ZWI3YWE2ZjciLCJwcm9wcyI6W3sibiI6IkludGVsRGF0YUNsYXNzaWZpY2F0aW9uIiwidmFscyI6W3sidmFsdWUiOiJDVFBfSUMifV19XX0sIlN1YmplY3RMYWJlbHMiOltdLCJUTUNWZXJzaW9uIjoiMTUuNC4xMC4xOSIsIlRydXN0ZWRMYWJlbEhhc2giOiJZWUl2Q3k3VG1XXC9XS0NLR0FxZkZuNVRxMU55QUl3RzJlSDF0akJqVXA2VT0ifQ== x-inteldataclassification: CTP_IC x-originating-ip: [10.22.254.140] Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from base64 to 8bit by mail.home.local id tBFNkCOm005428 >> + /* Fault was in recoverable area of the kernel */ >> + if ((m.cs & 3) != 3 && worst == MCE_AR_SEVERITY) >> + if (!fixup_mcexception(regs, m.addr)) >> + mce_panic("Failed kernel mode recovery", &m, NULL); > ^^^^^^^^^^^^^^^^^^^^^^^^^^^ > > Does that always imply a failed kernel mode recovery? I don't see > > (m.cs == 0 and MCE_AR_SEVERITY) > > MCEs always meaning that a recovery should be attempted there. I think > this should simply say > > mce_panic("Fatal machine check on current CPU", &m, msg); I don't think this can ever happen. If we were in kernel mode and decided that the severity was AR_SEVERITY ... then search_mcexception_table() found an entry for the IP where the machine check happened. The only way for fixup_exception to fail is if search_mcexception_table() now suddenly doesn't find the entry it found earlier. But if this "can't happen" thing actually does happen ... I'd like the panic message to be different from other mce_panic() so you'll know to blame me. Applied all the other suggestions. -Tony {.n++%ݶw{.n+{G{ayʇڙ,jfhz_(階ݢj"mG?&~iOzv^m ?I From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751615AbbLURda (ORCPT ); Mon, 21 Dec 2015 12:33:30 -0500 Received: from mail.skyhub.de ([78.46.96.112]:39329 "EHLO mail.skyhub.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751434AbbLURd1 (ORCPT ); Mon, 21 Dec 2015 12:33:27 -0500 Date: Mon, 21 Dec 2015 18:33:10 +0100 From: Borislav Petkov To: "Elliott, Robert (Persistent Memory)" Cc: Dan Williams , "Luck, Tony" , linux-nvdimm , X86 ML , "linux-kernel@vger.kernel.org" , Linux MM , Andy Lutomirski , Andrew Morton , Ingo Molnar Subject: Re: [PATCHV2 3/3] x86, ras: Add mcsafe_memcpy() function to recover from machine checks Message-ID: <20151221173310.GD21582@pd.tnic> References: <23b2515da9d06b198044ad83ca0a15ba38c24e6e.1449861203.git.tony.luck@intel.com> <20151215131135.GE25973@pd.tnic> <3908561D78D1C84285E8C5FCA982C28F39F8566E@ORSMSX114.amr.corp.intel.com> <20151215183924.GJ25973@pd.tnic> <94D0CD8314A33A4D9D801C0FE68B40295BE9F290@G4W3202.americas.hpqcorp.net> <20151215192837.GL25973@pd.tnic> <94D0CD8314A33A4D9D801C0FE68B40295BE9F3D5@G4W3202.americas.hpqcorp.net> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: <94D0CD8314A33A4D9D801C0FE68B40295BE9F3D5@G4W3202.americas.hpqcorp.net> User-Agent: Mutt/1.5.23 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Dec 15, 2015 at 08:25:37PM +0000, Elliott, Robert (Persistent Memory) wrote: > This isn't exactly what you're looking for, but here is > an example of fio doing reads from pmem devices (reading > from NVDIMMs, writing to DIMMs) with various transfer > sizes. ... and "fio" is? > At 256 KiB, all the main memory buffers fit in the CPU > caches, so no write traffic appears on DDR (just the reads > from the NVDIMMs). At 1 MiB, the data spills out of the > caches, and writes to the DIMMs end up on DDR. > > Although DDR is busier, fio gets a lot less work done: > * 256 KiB: 90 GiB/s by fio > * 1 MiB: 49 GiB/s by fio Yeah, I don't think that answers the question I had: whether REP; MOVSB is faster/better than using non-temporal stores. But you say that already above. Also, if you do non-temporal stores then you're expected to have *more* memory controller and DIMM traffic as you're pushing everything out through the WCC. What would need to be measured instead is, IMO, two things: * compare NTI vs REP; MOVSB data movement to see the differences in performance aspects * run a benchmark (no idea which one) which would measure the positive impact of the NTI versions which do not pollute the cache and thus do not hurt other workloads' working set being pushed out of the cache. Also, we don't really know (at least I don't) what REP; MOVSB improvements hide behind those enhanced fast string optimizations. It could be that microcode is doing some aggregation into cachelines and doing much bigger writes which could compensate for the cache pollution. Questions over questions... > We could try modifying pmem to use its own non-temporal > memcpy functions (I've posted experimental patches > before that did this) to see if that transition point > shifts. We can also watch the CPU cache statistics > while running. > > Here are statistics from Intel's pcm-memory.x > (pardon the wide formatting): > > 256 KiB > ======= > pmem0: (groupid=0, jobs=40): err= 0: pid=20867: Tue Nov 24 18:20:08 2015 > read : io=5219.1GB, bw=89079MB/s, iops=356314, runt= 60006msec > cpu : usr=1.74%, sys=96.16%, ctx=49576, majf=0, minf=21997 > > Run status group 0 (all jobs): > READ: io=5219.1GB, aggrb=89079MB/s, minb=89079MB/s, maxb=89079MB/s, mint=60006msec, maxt=60006msec > > |---------------------------------------||---------------------------------------| > |-- Socket 0 --||-- Socket 1 --| > |---------------------------------------||---------------------------------------| > |-- Memory Channel Monitoring --||-- Memory Channel Monitoring --| > |---------------------------------------||---------------------------------------| > |-- Mem Ch 0: Reads (MB/s): 11778.11 --||-- Mem Ch 0: Reads (MB/s): 11743.99 --| > |-- Writes(MB/s): 51.83 --||-- Writes(MB/s): 43.25 --| > |-- Mem Ch 1: Reads (MB/s): 11779.90 --||-- Mem Ch 1: Reads (MB/s): 11736.06 --| > |-- Writes(MB/s): 48.73 --||-- Writes(MB/s): 37.86 --| > |-- Mem Ch 4: Reads (MB/s): 11784.79 --||-- Mem Ch 4: Reads (MB/s): 11746.94 --| > |-- Writes(MB/s): 52.90 --||-- Writes(MB/s): 43.73 --| > |-- Mem Ch 5: Reads (MB/s): 11778.48 --||-- Mem Ch 5: Reads (MB/s): 11741.55 --| > |-- Writes(MB/s): 47.62 --||-- Writes(MB/s): 37.80 --| > |-- NODE 0 Mem Read (MB/s) : 47121.27 --||-- NODE 1 Mem Read (MB/s) : 46968.53 --| > |-- NODE 0 Mem Write(MB/s) : 201.08 --||-- NODE 1 Mem Write(MB/s) : 162.65 --| > |-- NODE 0 P. Write (T/s): 190927 --||-- NODE 1 P. Write (T/s): 182961 --| What does T/s mean? > |-- NODE 0 Memory (MB/s): 47322.36 --||-- NODE 1 Memory (MB/s): 47131.17 --| > |---------------------------------------||---------------------------------------| > |---------------------------------------||---------------------------------------| > |-- System Read Throughput(MB/s): 94089.80 --| > |-- System Write Throughput(MB/s): 363.73 --| > |-- System Memory Throughput(MB/s): 94453.52 --| > |---------------------------------------||---------------------------------------| > > 1 MiB > ===== > |---------------------------------------||---------------------------------------| > |-- Socket 0 --||-- Socket 1 --| > |---------------------------------------||---------------------------------------| > |-- Memory Channel Monitoring --||-- Memory Channel Monitoring --| > |---------------------------------------||---------------------------------------| > |-- Mem Ch 0: Reads (MB/s): 7227.83 --||-- Mem Ch 0: Reads (MB/s): 7047.45 --| > |-- Writes(MB/s): 5894.47 --||-- Writes(MB/s): 6010.66 --| > |-- Mem Ch 1: Reads (MB/s): 7229.32 --||-- Mem Ch 1: Reads (MB/s): 7041.79 --| > |-- Writes(MB/s): 5891.38 --||-- Writes(MB/s): 6003.19 --| > |-- Mem Ch 4: Reads (MB/s): 7230.70 --||-- Mem Ch 4: Reads (MB/s): 7052.44 --| > |-- Writes(MB/s): 5888.63 --||-- Writes(MB/s): 6012.49 --| > |-- Mem Ch 5: Reads (MB/s): 7229.16 --||-- Mem Ch 5: Reads (MB/s): 7047.19 --| > |-- Writes(MB/s): 5882.45 --||-- Writes(MB/s): 6008.11 --| > |-- NODE 0 Mem Read (MB/s) : 28917.01 --||-- NODE 1 Mem Read (MB/s) : 28188.87 --| > |-- NODE 0 Mem Write(MB/s) : 23556.93 --||-- NODE 1 Mem Write(MB/s) : 24034.46 --| > |-- NODE 0 P. Write (T/s): 238713 --||-- NODE 1 P. Write (T/s): 228040 --| > |-- NODE 0 Memory (MB/s): 52473.94 --||-- NODE 1 Memory (MB/s): 52223.33 --| > |---------------------------------------||---------------------------------------| > |---------------------------------------||---------------------------------------| > |-- System Read Throughput(MB/s): 57105.87 --| > |-- System Write Throughput(MB/s): 47591.39 --| > |-- System Memory Throughput(MB/s): 104697.27 --| > |---------------------------------------||---------------------------------------| Looks to me like, because writes have increased, the read bandwidth has dropped too, which makes sense. -- Regards/Gruss, Boris. ECO tip #101: Trim your mails when you reply.