* [PATCH v4 0/5] Revisit MCE handling for UE Errors
@ 2017-09-29 4:26 Balbir Singh
2017-09-29 4:26 ` [PATCH v4 1/5] powerpc/mce.c: Remove unused function get_mce_fault_addr() Balbir Singh
` (4 more replies)
0 siblings, 5 replies; 16+ messages in thread
From: Balbir Singh @ 2017-09-29 4:26 UTC (permalink / raw)
To: mpe; +Cc: linuxppc-dev, Balbir Singh
This patch series is designed to hook up memory_failure on
UE errors, this is specially helpful for user_mode UE errors.
The first two patches cleanup bits, remove dead code.
I could not find any users of get_mce_fault_addr().
The second one improves printing of physical address
The third patch walks kernel/user mode page tables in
real mode to extract the effective address of the instruction
that caused the UE error and the effective address it was
trying to access (for load/store). The fourth patch hooks
up the pfn for instruction UE errors (ierror).
The fifth patch hooks up memory_failure to the MCE patch.
TODO:
Log the address in NVRAM, so that we can recover from
bad pages at boot and keep the blacklist persistent.
Changelog v4:
- Add a #define for MAX_MCE_DEPTH instead of hard
coding the value
- Refactor addr_to_pfn to deduce the right parameters
Changelog v3:
- Check for recursive MCE invocations (suggested
by Nick)
Changelog v2:
- Remove pr_warn from real mode to a more delayed
context, where its OK to warn.
- Add a trivial patch to align prints of physical
and effective address
Changelog v1:
- Address review comments from Nick and Mahesh
(initialization of pfn and more comments on failure
when addr_to_pfn() or anaylse_instr() fail)
- Hookup ierrors to the framework as well
(comments from Mahesh)
Balbir Singh (5):
powerpc/mce.c: Remove unused function get_mce_fault_addr()
powerpc/mce: align the print of physical address better
powerpc/mce: Hookup derror (load/store) UE errors
powerpc/mce: Hookup ierror (instruction) UE errors
powerpc/mce: hookup memory_failure for UE errors
arch/powerpc/include/asm/exception-64s.h | 5 ++
arch/powerpc/include/asm/mce.h | 4 +-
arch/powerpc/kernel/exceptions-64s.S | 2 +-
arch/powerpc/kernel/mce.c | 117 +++++++++++++++++++------------
arch/powerpc/kernel/mce_power.c | 109 +++++++++++++++++++++++++---
5 files changed, 181 insertions(+), 56 deletions(-)
--
2.9.5
^ permalink raw reply [flat|nested] 16+ messages in thread
* [PATCH v4 1/5] powerpc/mce.c: Remove unused function get_mce_fault_addr()
2017-09-29 4:26 [PATCH v4 0/5] Revisit MCE handling for UE Errors Balbir Singh
@ 2017-09-29 4:26 ` Balbir Singh
2017-10-16 5:13 ` Michael Ellerman
2017-10-19 4:48 ` [v4, " Michael Ellerman
2017-09-29 4:26 ` [PATCH v4 2/5] powerpc/mce: align the print of physical address better Balbir Singh
` (3 subsequent siblings)
4 siblings, 2 replies; 16+ messages in thread
From: Balbir Singh @ 2017-09-29 4:26 UTC (permalink / raw)
To: mpe; +Cc: linuxppc-dev, Balbir Singh
There are no users of get_mce_fault_addr()
Fixes: b63a0ff ("powerpc/powernv: Machine check exception handling.")
Signed-off-by: Balbir Singh <bsingharora@gmail.com>
Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
---
arch/powerpc/include/asm/mce.h | 2 --
arch/powerpc/kernel/mce.c | 39 ---------------------------------------
2 files changed, 41 deletions(-)
diff --git a/arch/powerpc/include/asm/mce.h b/arch/powerpc/include/asm/mce.h
index 190d69a..75292c7 100644
--- a/arch/powerpc/include/asm/mce.h
+++ b/arch/powerpc/include/asm/mce.h
@@ -210,6 +210,4 @@ extern void release_mce_event(void);
extern void machine_check_queue_event(void);
extern void machine_check_print_event_info(struct machine_check_event *evt,
bool user_mode);
-extern uint64_t get_mce_fault_addr(struct machine_check_event *evt);
-
#endif /* __ASM_PPC64_MCE_H__ */
diff --git a/arch/powerpc/kernel/mce.c b/arch/powerpc/kernel/mce.c
index 9b2ea7e..e254399 100644
--- a/arch/powerpc/kernel/mce.c
+++ b/arch/powerpc/kernel/mce.c
@@ -411,45 +411,6 @@ void machine_check_print_event_info(struct machine_check_event *evt,
}
EXPORT_SYMBOL_GPL(machine_check_print_event_info);
-uint64_t get_mce_fault_addr(struct machine_check_event *evt)
-{
- switch (evt->error_type) {
- case MCE_ERROR_TYPE_UE:
- if (evt->u.ue_error.effective_address_provided)
- return evt->u.ue_error.effective_address;
- break;
- case MCE_ERROR_TYPE_SLB:
- if (evt->u.slb_error.effective_address_provided)
- return evt->u.slb_error.effective_address;
- break;
- case MCE_ERROR_TYPE_ERAT:
- if (evt->u.erat_error.effective_address_provided)
- return evt->u.erat_error.effective_address;
- break;
- case MCE_ERROR_TYPE_TLB:
- if (evt->u.tlb_error.effective_address_provided)
- return evt->u.tlb_error.effective_address;
- break;
- case MCE_ERROR_TYPE_USER:
- if (evt->u.user_error.effective_address_provided)
- return evt->u.user_error.effective_address;
- break;
- case MCE_ERROR_TYPE_RA:
- if (evt->u.ra_error.effective_address_provided)
- return evt->u.ra_error.effective_address;
- break;
- case MCE_ERROR_TYPE_LINK:
- if (evt->u.link_error.effective_address_provided)
- return evt->u.link_error.effective_address;
- break;
- default:
- case MCE_ERROR_TYPE_UNKNOWN:
- break;
- }
- return 0;
-}
-EXPORT_SYMBOL(get_mce_fault_addr);
-
/*
* This function is called in real mode. Strictly no printk's please.
*
--
2.9.5
^ permalink raw reply related [flat|nested] 16+ messages in thread
* [PATCH v4 2/5] powerpc/mce: align the print of physical address better
2017-09-29 4:26 [PATCH v4 0/5] Revisit MCE handling for UE Errors Balbir Singh
2017-09-29 4:26 ` [PATCH v4 1/5] powerpc/mce.c: Remove unused function get_mce_fault_addr() Balbir Singh
@ 2017-09-29 4:26 ` Balbir Singh
2017-10-16 5:18 ` Michael Ellerman
2017-09-29 4:26 ` [PATCH v4 3/5] powerpc/mce: Hookup derror (load/store) UE errors Balbir Singh
` (2 subsequent siblings)
4 siblings, 1 reply; 16+ messages in thread
From: Balbir Singh @ 2017-09-29 4:26 UTC (permalink / raw)
To: mpe; +Cc: linuxppc-dev, Balbir Singh
Use the same alignment as Effective address and rename
phyiscal address to Page Frame Number
Signed-off-by: Balbir Singh <bsingharora@gmail.com>
Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
---
arch/powerpc/kernel/mce.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/arch/powerpc/kernel/mce.c b/arch/powerpc/kernel/mce.c
index e254399..fef1408 100644
--- a/arch/powerpc/kernel/mce.c
+++ b/arch/powerpc/kernel/mce.c
@@ -340,7 +340,7 @@ void machine_check_print_event_info(struct machine_check_event *evt,
printk("%s Effective address: %016llx\n",
level, evt->u.ue_error.effective_address);
if (evt->u.ue_error.physical_address_provided)
- printk("%s Physical address: %016llx\n",
+ printk("%s Physical address: %016llx\n",
level, evt->u.ue_error.physical_address);
break;
case MCE_ERROR_TYPE_SLB:
--
2.9.5
^ permalink raw reply related [flat|nested] 16+ messages in thread
* [PATCH v4 3/5] powerpc/mce: Hookup derror (load/store) UE errors
2017-09-29 4:26 [PATCH v4 0/5] Revisit MCE handling for UE Errors Balbir Singh
2017-09-29 4:26 ` [PATCH v4 1/5] powerpc/mce.c: Remove unused function get_mce_fault_addr() Balbir Singh
2017-09-29 4:26 ` [PATCH v4 2/5] powerpc/mce: align the print of physical address better Balbir Singh
@ 2017-09-29 4:26 ` Balbir Singh
2017-10-16 5:36 ` Michael Ellerman
2017-09-29 4:26 ` [PATCH v4 4/5] powerpc/mce: Hookup ierror (instruction) " Balbir Singh
2017-09-29 4:26 ` [PATCH v4 5/5] powerpc/mce: hookup memory_failure for " Balbir Singh
4 siblings, 1 reply; 16+ messages in thread
From: Balbir Singh @ 2017-09-29 4:26 UTC (permalink / raw)
To: mpe; +Cc: linuxppc-dev, Balbir Singh
Extract physical_address for UE errors by walking the page
tables for the mm and address at the NIP, to extract the
instruction. Then use the instruction to find the effective
address via analyse_instr().
We might have page table walking races, but we expect them to
be rare, the physical address extraction is best effort. The idea
is to then hook up this infrastructure to memory failure eventually.
Signed-off-by: Balbir Singh <bsingharora@gmail.com>
Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
---
arch/powerpc/include/asm/exception-64s.h | 5 ++
arch/powerpc/include/asm/mce.h | 2 +-
arch/powerpc/kernel/exceptions-64s.S | 2 +-
arch/powerpc/kernel/mce.c | 6 ++-
arch/powerpc/kernel/mce_power.c | 87 ++++++++++++++++++++++++++++++--
5 files changed, 94 insertions(+), 8 deletions(-)
diff --git a/arch/powerpc/include/asm/exception-64s.h b/arch/powerpc/include/asm/exception-64s.h
index 9a31897..b272052 100644
--- a/arch/powerpc/include/asm/exception-64s.h
+++ b/arch/powerpc/include/asm/exception-64s.h
@@ -55,6 +55,11 @@
#endif
/*
+ * maximum recursive depth of MCE exceptions
+ */
+#define MAX_MCE_DEPTH 4
+
+/*
* EX_LR is only used in EXSLB and where it does not overlap with EX_DAR
* EX_CCR similarly with DSISR, but being 4 byte registers there is a hole
* in the save area so it's not necessary to overlap them. Could be used
diff --git a/arch/powerpc/include/asm/mce.h b/arch/powerpc/include/asm/mce.h
index 75292c7..3a1226e 100644
--- a/arch/powerpc/include/asm/mce.h
+++ b/arch/powerpc/include/asm/mce.h
@@ -204,7 +204,7 @@ struct mce_error_info {
extern void save_mce_event(struct pt_regs *regs, long handled,
struct mce_error_info *mce_err, uint64_t nip,
- uint64_t addr);
+ uint64_t addr, uint64_t phys_addr);
extern int get_mce_event(struct machine_check_event *mce, bool release);
extern void release_mce_event(void);
extern void machine_check_queue_event(void);
diff --git a/arch/powerpc/kernel/exceptions-64s.S b/arch/powerpc/kernel/exceptions-64s.S
index 48da0f5..3a6c8c8 100644
--- a/arch/powerpc/kernel/exceptions-64s.S
+++ b/arch/powerpc/kernel/exceptions-64s.S
@@ -232,7 +232,7 @@ BEGIN_FTR_SECTION
addi r10,r10,1 /* increment paca->in_mce */
sth r10,PACA_IN_MCE(r13)
/* Limit nested MCE to level 4 to avoid stack overflow */
- cmpwi r10,4
+ cmpwi r10,MAX_MCE_DEPTH
bgt 2f /* Check if we hit limit of 4 */
std r11,GPR1(r1) /* Save r1 on the stack. */
std r11,0(r1) /* make stack chain pointer */
diff --git a/arch/powerpc/kernel/mce.c b/arch/powerpc/kernel/mce.c
index fef1408..ee148f4 100644
--- a/arch/powerpc/kernel/mce.c
+++ b/arch/powerpc/kernel/mce.c
@@ -82,7 +82,7 @@ static void mce_set_error_info(struct machine_check_event *mce,
*/
void save_mce_event(struct pt_regs *regs, long handled,
struct mce_error_info *mce_err,
- uint64_t nip, uint64_t addr)
+ uint64_t nip, uint64_t addr, uint64_t phys_addr)
{
int index = __this_cpu_inc_return(mce_nest_count) - 1;
struct machine_check_event *mce = this_cpu_ptr(&mce_event[index]);
@@ -140,6 +140,10 @@ void save_mce_event(struct pt_regs *regs, long handled,
} else if (mce->error_type == MCE_ERROR_TYPE_UE) {
mce->u.ue_error.effective_address_provided = true;
mce->u.ue_error.effective_address = addr;
+ if (phys_addr != ULONG_MAX) {
+ mce->u.ue_error.physical_address_provided = true;
+ mce->u.ue_error.physical_address = phys_addr;
+ }
}
return;
}
diff --git a/arch/powerpc/kernel/mce_power.c b/arch/powerpc/kernel/mce_power.c
index b76ca19..0e584d5 100644
--- a/arch/powerpc/kernel/mce_power.c
+++ b/arch/powerpc/kernel/mce_power.c
@@ -27,6 +27,36 @@
#include <asm/mmu.h>
#include <asm/mce.h>
#include <asm/machdep.h>
+#include <asm/pgtable.h>
+#include <asm/pte-walk.h>
+#include <asm/sstep.h>
+#include <asm/exception-64s.h>
+
+/*
+ * Convert an address related to an mm to a PFN. NOTE: we are in real
+ * mode, we could potentially race with page table updates.
+ */
+static unsigned long addr_to_pfn(struct pt_regs *regs, unsigned long addr)
+{
+ pte_t *ptep;
+ unsigned long flags;
+ struct mm_struct *mm;
+
+ if (user_mode(regs))
+ mm = current->mm;
+ else
+ mm = &init_mm;
+
+ local_irq_save(flags);
+ if (mm == current->mm)
+ ptep = find_current_mm_pte(mm->pgd, addr, NULL, NULL);
+ else
+ ptep = find_init_mm_pte(addr, NULL);
+ local_irq_restore(flags);
+ if (!ptep || pte_special(*ptep))
+ return ULONG_MAX;
+ return pte_pfn(*ptep);
+}
static void flush_tlb_206(unsigned int num_sets, unsigned int action)
{
@@ -421,6 +451,41 @@ static const struct mce_derror_table mce_p9_derror_table[] = {
MCE_INITIATOR_CPU, MCE_SEV_ERROR_SYNC, },
{ 0, false, 0, 0, 0, 0 } };
+static int mce_find_instr_ea_and_pfn(struct pt_regs *regs, uint64_t *addr,
+ uint64_t *phys_addr)
+{
+ /*
+ * Carefully look at the NIP to determine
+ * the instruction to analyse. Reading the NIP
+ * in real-mode is tricky and can lead to recursive
+ * faults
+ */
+ int instr;
+ unsigned long pfn, instr_addr;
+ struct instruction_op op;
+ struct pt_regs tmp = *regs;
+
+ pfn = addr_to_pfn(regs, regs->nip);
+ if (pfn != ULONG_MAX) {
+ instr_addr = (pfn << PAGE_SHIFT) + (regs->nip & ~PAGE_MASK);
+ instr = *(unsigned int *)(instr_addr);
+ if (!analyse_instr(&op, &tmp, instr)) {
+ pfn = addr_to_pfn(regs, op.ea);
+ *addr = op.ea;
+ *phys_addr = (pfn << PAGE_SHIFT);
+ return 0;
+ }
+ /*
+ * analyse_instr() might fail if the instruction
+ * is not a load/store, although this is unexpected
+ * for load/store errors or if we got the NIP
+ * wrong
+ */
+ }
+ *addr = 0;
+ return -1;
+}
+
static int mce_handle_ierror(struct pt_regs *regs,
const struct mce_ierror_table table[],
struct mce_error_info *mce_err, uint64_t *addr)
@@ -489,7 +554,8 @@ static int mce_handle_ierror(struct pt_regs *regs,
static int mce_handle_derror(struct pt_regs *regs,
const struct mce_derror_table table[],
- struct mce_error_info *mce_err, uint64_t *addr)
+ struct mce_error_info *mce_err, uint64_t *addr,
+ uint64_t *phys_addr)
{
uint64_t dsisr = regs->dsisr;
int handled = 0;
@@ -555,7 +621,17 @@ static int mce_handle_derror(struct pt_regs *regs,
mce_err->initiator = table[i].initiator;
if (table[i].dar_valid)
*addr = regs->dar;
-
+ else if (mce_err->severity == MCE_SEV_ERROR_SYNC &&
+ table[i].error_type == MCE_ERROR_TYPE_UE) {
+ /*
+ * We do a maximum of 4 nested MCE calls, see
+ * kernel/exception-64s.h
+ */
+ if (get_paca()->in_mce < MAX_MCE_DEPTH)
+ if (!mce_find_instr_ea_and_pfn(regs, addr,
+ phys_addr))
+ handled = 1;
+ }
found = 1;
}
@@ -592,19 +668,20 @@ static long mce_handle_error(struct pt_regs *regs,
const struct mce_ierror_table itable[])
{
struct mce_error_info mce_err = { 0 };
- uint64_t addr;
+ uint64_t addr, phys_addr;
uint64_t srr1 = regs->msr;
long handled;
if (SRR1_MC_LOADSTORE(srr1))
- handled = mce_handle_derror(regs, dtable, &mce_err, &addr);
+ handled = mce_handle_derror(regs, dtable, &mce_err, &addr,
+ &phys_addr);
else
handled = mce_handle_ierror(regs, itable, &mce_err, &addr);
if (!handled && mce_err.error_type == MCE_ERROR_TYPE_UE)
handled = mce_handle_ue_error(regs);
- save_mce_event(regs, handled, &mce_err, regs->nip, addr);
+ save_mce_event(regs, handled, &mce_err, regs->nip, addr, phys_addr);
return handled;
}
--
2.9.5
^ permalink raw reply related [flat|nested] 16+ messages in thread
* [PATCH v4 4/5] powerpc/mce: Hookup ierror (instruction) UE errors
2017-09-29 4:26 [PATCH v4 0/5] Revisit MCE handling for UE Errors Balbir Singh
` (2 preceding siblings ...)
2017-09-29 4:26 ` [PATCH v4 3/5] powerpc/mce: Hookup derror (load/store) UE errors Balbir Singh
@ 2017-09-29 4:26 ` Balbir Singh
2017-09-29 4:26 ` [PATCH v4 5/5] powerpc/mce: hookup memory_failure for " Balbir Singh
4 siblings, 0 replies; 16+ messages in thread
From: Balbir Singh @ 2017-09-29 4:26 UTC (permalink / raw)
To: mpe; +Cc: linuxppc-dev, Balbir Singh
Hookup instruction errors (UE) for memory offling via memory_failure()
in a manner similar to load/store errors (derror). Since we have access
to the NIP, the conversion is a one step process in this case.
Signed-off-by: Balbir Singh <bsingharora@gmail.com>
Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
---
arch/powerpc/kernel/mce_power.c | 22 +++++++++++++++++++---
1 file changed, 19 insertions(+), 3 deletions(-)
diff --git a/arch/powerpc/kernel/mce_power.c b/arch/powerpc/kernel/mce_power.c
index 0e584d5..2a888f6 100644
--- a/arch/powerpc/kernel/mce_power.c
+++ b/arch/powerpc/kernel/mce_power.c
@@ -488,7 +488,8 @@ static int mce_find_instr_ea_and_pfn(struct pt_regs *regs, uint64_t *addr,
static int mce_handle_ierror(struct pt_regs *regs,
const struct mce_ierror_table table[],
- struct mce_error_info *mce_err, uint64_t *addr)
+ struct mce_error_info *mce_err, uint64_t *addr,
+ uint64_t *phys_addr)
{
uint64_t srr1 = regs->msr;
int handled = 0;
@@ -540,8 +541,22 @@ static int mce_handle_ierror(struct pt_regs *regs,
}
mce_err->severity = table[i].severity;
mce_err->initiator = table[i].initiator;
- if (table[i].nip_valid)
+ if (table[i].nip_valid) {
*addr = regs->nip;
+ if (mce_err->severity == MCE_SEV_ERROR_SYNC &&
+ table[i].error_type == MCE_ERROR_TYPE_UE) {
+ unsigned long pfn;
+
+ if (get_paca()->in_mce < MAX_MCE_DEPTH) {
+ pfn = addr_to_pfn(regs, regs->nip);
+ if (pfn != ULONG_MAX) {
+ *phys_addr =
+ (pfn << PAGE_SHIFT);
+ handled = 1;
+ }
+ }
+ }
+ }
return handled;
}
@@ -676,7 +691,8 @@ static long mce_handle_error(struct pt_regs *regs,
handled = mce_handle_derror(regs, dtable, &mce_err, &addr,
&phys_addr);
else
- handled = mce_handle_ierror(regs, itable, &mce_err, &addr);
+ handled = mce_handle_ierror(regs, itable, &mce_err, &addr,
+ &phys_addr);
if (!handled && mce_err.error_type == MCE_ERROR_TYPE_UE)
handled = mce_handle_ue_error(regs);
--
2.9.5
^ permalink raw reply related [flat|nested] 16+ messages in thread
* [PATCH v4 5/5] powerpc/mce: hookup memory_failure for UE errors
2017-09-29 4:26 [PATCH v4 0/5] Revisit MCE handling for UE Errors Balbir Singh
` (3 preceding siblings ...)
2017-09-29 4:26 ` [PATCH v4 4/5] powerpc/mce: Hookup ierror (instruction) " Balbir Singh
@ 2017-09-29 4:26 ` Balbir Singh
2017-10-16 5:38 ` Michael Ellerman
4 siblings, 1 reply; 16+ messages in thread
From: Balbir Singh @ 2017-09-29 4:26 UTC (permalink / raw)
To: mpe; +Cc: linuxppc-dev, Balbir Singh
If we are in user space and hit a UE error, we now have the
basic infrastructure to walk the page tables and find out
the effective address that was accessed, since the DAR
is not valid.
We use a work_queue content to hookup the bad pfn, any
other context causes problems, since memory_failure itself
can call into schedule() via lru_drain_ bits.
We could probably poison the struct page to avoid a race
between detection and taking corrective action.
Signed-off-by: Balbir Singh <bsingharora@gmail.com>
Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
---
arch/powerpc/kernel/mce.c | 70 +++++++++++++++++++++++++++++++++++++++++++++--
1 file changed, 67 insertions(+), 3 deletions(-)
diff --git a/arch/powerpc/kernel/mce.c b/arch/powerpc/kernel/mce.c
index ee148f4..299553e 100644
--- a/arch/powerpc/kernel/mce.c
+++ b/arch/powerpc/kernel/mce.c
@@ -39,11 +39,21 @@ static DEFINE_PER_CPU(struct machine_check_event[MAX_MC_EVT], mce_event);
static DEFINE_PER_CPU(int, mce_queue_count);
static DEFINE_PER_CPU(struct machine_check_event[MAX_MC_EVT], mce_event_queue);
+/* Queue for delayed MCE UE events. */
+static DEFINE_PER_CPU(int, mce_ue_count);
+static DEFINE_PER_CPU(struct machine_check_event[MAX_MC_EVT],
+ mce_ue_event_queue);
+
static void machine_check_process_queued_event(struct irq_work *work);
+void machine_check_ue_event(struct machine_check_event *evt);
+static void machine_process_ue_event(struct work_struct *work);
+
static struct irq_work mce_event_process_work = {
.func = machine_check_process_queued_event,
};
+DECLARE_WORK(mce_ue_event_work, machine_process_ue_event);
+
static void mce_set_error_info(struct machine_check_event *mce,
struct mce_error_info *mce_err)
{
@@ -143,6 +153,7 @@ void save_mce_event(struct pt_regs *regs, long handled,
if (phys_addr != ULONG_MAX) {
mce->u.ue_error.physical_address_provided = true;
mce->u.ue_error.physical_address = phys_addr;
+ machine_check_ue_event(mce);
}
}
return;
@@ -197,6 +208,26 @@ void release_mce_event(void)
get_mce_event(NULL, true);
}
+
+/*
+ * Queue up the MCE event which then can be handled later.
+ */
+void machine_check_ue_event(struct machine_check_event *evt)
+{
+ int index;
+
+ index = __this_cpu_inc_return(mce_ue_count) - 1;
+ /* If queue is full, just return for now. */
+ if (index >= MAX_MC_EVT) {
+ __this_cpu_dec(mce_ue_count);
+ return;
+ }
+ memcpy(this_cpu_ptr(&mce_ue_event_queue[index]), evt, sizeof(*evt));
+
+ /* Queue work to process this event later. */
+ schedule_work(&mce_ue_event_work);
+}
+
/*
* Queue up the MCE event which then can be handled later.
*/
@@ -219,7 +250,39 @@ void machine_check_queue_event(void)
/* Queue irq work to process this event later. */
irq_work_queue(&mce_event_process_work);
}
-
+/*
+ * process pending MCE event from the mce event queue. This function will be
+ * called during syscall exit.
+ */
+static void machine_process_ue_event(struct work_struct *work)
+{
+ int index;
+ struct machine_check_event *evt;
+
+ while (__this_cpu_read(mce_ue_count) > 0) {
+ index = __this_cpu_read(mce_ue_count) - 1;
+ evt = this_cpu_ptr(&mce_ue_event_queue[index]);
+#ifdef CONFIG_MEMORY_FAILURE
+ /*
+ * This should probably queued elsewhere, but
+ * oh! well
+ */
+ if (evt->error_type == MCE_ERROR_TYPE_UE) {
+ if (evt->u.ue_error.physical_address_provided) {
+ unsigned long pfn;
+
+ pfn = evt->u.ue_error.physical_address >>
+ PAGE_SHIFT;
+ memory_failure(pfn, SIGBUS, 0);
+ } else
+ pr_warn("Failed to identify bad address from "
+ "where the uncorrectable error (UE) "
+ "was generated\n");
+ }
+#endif
+ __this_cpu_dec(mce_ue_count);
+ }
+}
/*
* process pending MCE event from the mce event queue. This function will be
* called during syscall exit.
@@ -227,6 +290,7 @@ void machine_check_queue_event(void)
static void machine_check_process_queued_event(struct irq_work *work)
{
int index;
+ struct machine_check_event *evt;
add_taint(TAINT_MACHINE_CHECK, LOCKDEP_NOW_UNRELIABLE);
@@ -236,8 +300,8 @@ static void machine_check_process_queued_event(struct irq_work *work)
*/
while (__this_cpu_read(mce_queue_count) > 0) {
index = __this_cpu_read(mce_queue_count) - 1;
- machine_check_print_event_info(
- this_cpu_ptr(&mce_event_queue[index]), false);
+ evt = this_cpu_ptr(&mce_event_queue[index]);
+ machine_check_print_event_info(evt, false);
__this_cpu_dec(mce_queue_count);
}
}
--
2.9.5
^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [PATCH v4 1/5] powerpc/mce.c: Remove unused function get_mce_fault_addr()
2017-09-29 4:26 ` [PATCH v4 1/5] powerpc/mce.c: Remove unused function get_mce_fault_addr() Balbir Singh
@ 2017-10-16 5:13 ` Michael Ellerman
2017-10-16 5:23 ` Balbir Singh
2017-10-19 4:48 ` [v4, " Michael Ellerman
1 sibling, 1 reply; 16+ messages in thread
From: Michael Ellerman @ 2017-10-16 5:13 UTC (permalink / raw)
To: Balbir Singh; +Cc: linuxppc-dev, Balbir Singh
Balbir Singh <bsingharora@gmail.com> writes:
> There are no users of get_mce_fault_addr()
>
> Fixes: b63a0ff ("powerpc/powernv: Machine check exception handling.")
That fixes line is wrong, get_mce_fault_addr() was used in that commit.
The last usage was removed in:
1363875bdb63 ("powerpc/64s: fix handling of non-synchronous machine checks")
So that's what the fixes tag should point at if anything, I'll update it
here.
cheers
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v4 2/5] powerpc/mce: align the print of physical address better
2017-09-29 4:26 ` [PATCH v4 2/5] powerpc/mce: align the print of physical address better Balbir Singh
@ 2017-10-16 5:18 ` Michael Ellerman
2017-10-16 5:36 ` Balbir Singh
0 siblings, 1 reply; 16+ messages in thread
From: Michael Ellerman @ 2017-10-16 5:18 UTC (permalink / raw)
To: Balbir Singh; +Cc: linuxppc-dev, Balbir Singh
Balbir Singh <bsingharora@gmail.com> writes:
> Use the same alignment as Effective address
> and rename phyiscal address to Page Frame Number
You didn't do the 2nd part AFAICS?
Will fix it up.
cheers
> diff --git a/arch/powerpc/kernel/mce.c b/arch/powerpc/kernel/mce.c
> index e254399..fef1408 100644
> --- a/arch/powerpc/kernel/mce.c
> +++ b/arch/powerpc/kernel/mce.c
> @@ -340,7 +340,7 @@ void machine_check_print_event_info(struct machine_check_event *evt,
> printk("%s Effective address: %016llx\n",
> level, evt->u.ue_error.effective_address);
> if (evt->u.ue_error.physical_address_provided)
> - printk("%s Physical address: %016llx\n",
> + printk("%s Physical address: %016llx\n",
> level, evt->u.ue_error.physical_address);
> break;
> case MCE_ERROR_TYPE_SLB:
> --
> 2.9.5
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v4 1/5] powerpc/mce.c: Remove unused function get_mce_fault_addr()
2017-10-16 5:13 ` Michael Ellerman
@ 2017-10-16 5:23 ` Balbir Singh
0 siblings, 0 replies; 16+ messages in thread
From: Balbir Singh @ 2017-10-16 5:23 UTC (permalink / raw)
To: Michael Ellerman; +Cc: open list:LINUX FOR POWERPC (32-BIT AND 64-BIT)
On Mon, Oct 16, 2017 at 4:13 PM, Michael Ellerman <mpe@ellerman.id.au> wrote:
> Balbir Singh <bsingharora@gmail.com> writes:
>
>> There are no users of get_mce_fault_addr()
>>
>> Fixes: b63a0ff ("powerpc/powernv: Machine check exception handling.")
>
> That fixes line is wrong, get_mce_fault_addr() was used in that commit.
>
> The last usage was removed in:
>
> 1363875bdb63 ("powerpc/64s: fix handling of non-synchronous machine checks")
>
> So that's what the fixes tag should point at if anything, I'll update it
> here.
Sorry, my git search was really awe-full! Thanks for fixing it.
Balbir Singh
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v4 3/5] powerpc/mce: Hookup derror (load/store) UE errors
2017-09-29 4:26 ` [PATCH v4 3/5] powerpc/mce: Hookup derror (load/store) UE errors Balbir Singh
@ 2017-10-16 5:36 ` Michael Ellerman
2017-10-16 5:38 ` Balbir Singh
0 siblings, 1 reply; 16+ messages in thread
From: Michael Ellerman @ 2017-10-16 5:36 UTC (permalink / raw)
To: Balbir Singh; +Cc: linuxppc-dev, Balbir Singh
Balbir Singh <bsingharora@gmail.com> writes:
> diff --git a/arch/powerpc/kernel/mce_power.c b/arch/powerpc/kernel/mce_power.c
> index b76ca19..0e584d5 100644
> --- a/arch/powerpc/kernel/mce_power.c
> +++ b/arch/powerpc/kernel/mce_power.c
> @@ -27,6 +27,36 @@
> #include <asm/mmu.h>
> #include <asm/mce.h>
> #include <asm/machdep.h>
> +#include <asm/pgtable.h>
> +#include <asm/pte-walk.h>
> +#include <asm/sstep.h>
> +#include <asm/exception-64s.h>
> +
> +/*
> + * Convert an address related to an mm to a PFN. NOTE: we are in real
> + * mode, we could potentially race with page table updates.
> + */
That will blow some serious chunks if we're on pseries.
It seems it can't be triggered on pseries at the moment, but only
because of the feature section around the call to
machine_check_handle_early().
We should maybe move the powernv only code in here inside an #ifdef
POWERNV, just for documentation purposes. But we can do that later.
> +static unsigned long addr_to_pfn(struct pt_regs *regs, unsigned long addr)
> +{
> + pte_t *ptep;
> + unsigned long flags;
> + struct mm_struct *mm;
> +
> + if (user_mode(regs))
> + mm = current->mm;
> + else
> + mm = &init_mm;
> +
> + local_irq_save(flags);
> + if (mm == current->mm)
> + ptep = find_current_mm_pte(mm->pgd, addr, NULL, NULL);
> + else
> + ptep = find_init_mm_pte(addr, NULL);
> + local_irq_restore(flags);
> + if (!ptep || pte_special(*ptep))
> + return ULONG_MAX;
> + return pte_pfn(*ptep);
> +}
cheers
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v4 2/5] powerpc/mce: align the print of physical address better
2017-10-16 5:18 ` Michael Ellerman
@ 2017-10-16 5:36 ` Balbir Singh
0 siblings, 0 replies; 16+ messages in thread
From: Balbir Singh @ 2017-10-16 5:36 UTC (permalink / raw)
To: Michael Ellerman; +Cc: open list:LINUX FOR POWERPC (32-BIT AND 64-BIT)
On Mon, Oct 16, 2017 at 4:18 PM, Michael Ellerman <mpe@ellerman.id.au> wrote:
> Balbir Singh <bsingharora@gmail.com> writes:
>
>> Use the same alignment as Effective address
>> and rename phyiscal address to Page Frame Number
>
> You didn't do the 2nd part AFAICS?
>
> Will fix it up.
Sorry the changelog is buggy :( Nick asked me to keep the physical address.
I should have removed the changes when I reverted the changes
Balbir
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v4 3/5] powerpc/mce: Hookup derror (load/store) UE errors
2017-10-16 5:36 ` Michael Ellerman
@ 2017-10-16 5:38 ` Balbir Singh
0 siblings, 0 replies; 16+ messages in thread
From: Balbir Singh @ 2017-10-16 5:38 UTC (permalink / raw)
To: Michael Ellerman; +Cc: open list:LINUX FOR POWERPC (32-BIT AND 64-BIT)
On Mon, Oct 16, 2017 at 4:36 PM, Michael Ellerman <mpe@ellerman.id.au> wrote:
> Balbir Singh <bsingharora@gmail.com> writes:
>
>> diff --git a/arch/powerpc/kernel/mce_power.c b/arch/powerpc/kernel/mce_power.c
>> index b76ca19..0e584d5 100644
>> --- a/arch/powerpc/kernel/mce_power.c
>> +++ b/arch/powerpc/kernel/mce_power.c
>> @@ -27,6 +27,36 @@
>> #include <asm/mmu.h>
>> #include <asm/mce.h>
>> #include <asm/machdep.h>
>> +#include <asm/pgtable.h>
>> +#include <asm/pte-walk.h>
>> +#include <asm/sstep.h>
>> +#include <asm/exception-64s.h>
>> +
>> +/*
>> + * Convert an address related to an mm to a PFN. NOTE: we are in real
>> + * mode, we could potentially race with page table updates.
>> + */
>
> That will blow some serious chunks if we're on pseries.
>
> It seems it can't be triggered on pseries at the moment, but only
> because of the feature section around the call to
> machine_check_handle_early().
>
> We should maybe move the powernv only code in here inside an #ifdef
> POWERNV, just for documentation purposes. But we can do that later.
>
Will do that as a follow up thanks! I think the entire mce_power.c and mce.c
can move to pseries, will validate with Mahesh and Nick as well.
Balbir
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v4 5/5] powerpc/mce: hookup memory_failure for UE errors
2017-09-29 4:26 ` [PATCH v4 5/5] powerpc/mce: hookup memory_failure for " Balbir Singh
@ 2017-10-16 5:38 ` Michael Ellerman
2017-10-16 5:40 ` Balbir Singh
0 siblings, 1 reply; 16+ messages in thread
From: Michael Ellerman @ 2017-10-16 5:38 UTC (permalink / raw)
To: Balbir Singh; +Cc: linuxppc-dev, Balbir Singh
Balbir Singh <bsingharora@gmail.com> writes:
> If we are in user space and hit a UE error, we now have the
> basic infrastructure to walk the page tables and find out
> the effective address that was accessed, since the DAR
> is not valid.
>
> We use a work_queue content to hookup the bad pfn, any
> other context causes problems, since memory_failure itself
> can call into schedule() via lru_drain_ bits.
>
> We could probably poison the struct page to avoid a race
> between detection and taking corrective action.
>
> Signed-off-by: Balbir Singh <bsingharora@gmail.com>
> Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
> ---
> arch/powerpc/kernel/mce.c | 70 +++++++++++++++++++++++++++++++++++++++++++++--
> 1 file changed, 67 insertions(+), 3 deletions(-)
I'm not sure why this is in mce.c but the rest was in mce_power.c ?
cheers
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v4 5/5] powerpc/mce: hookup memory_failure for UE errors
2017-10-16 5:38 ` Michael Ellerman
@ 2017-10-16 5:40 ` Balbir Singh
2017-10-16 12:07 ` Michael Ellerman
0 siblings, 1 reply; 16+ messages in thread
From: Balbir Singh @ 2017-10-16 5:40 UTC (permalink / raw)
To: Michael Ellerman; +Cc: open list:LINUX FOR POWERPC (32-BIT AND 64-BIT)
On Mon, Oct 16, 2017 at 4:38 PM, Michael Ellerman <mpe@ellerman.id.au> wrote:
> Balbir Singh <bsingharora@gmail.com> writes:
>
>> If we are in user space and hit a UE error, we now have the
>> basic infrastructure to walk the page tables and find out
>> the effective address that was accessed, since the DAR
>> is not valid.
>>
>> We use a work_queue content to hookup the bad pfn, any
>> other context causes problems, since memory_failure itself
>> can call into schedule() via lru_drain_ bits.
>>
>> We could probably poison the struct page to avoid a race
>> between detection and taking corrective action.
>>
>> Signed-off-by: Balbir Singh <bsingharora@gmail.com>
>> Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
>> ---
>> arch/powerpc/kernel/mce.c | 70 +++++++++++++++++++++++++++++++++++++++++++++--
>> 1 file changed, 67 insertions(+), 3 deletions(-)
>
> I'm not sure why this is in mce.c but the rest was in mce_power.c ?
The way the code is organized is that save_mce_event is implemented here
and called from mce_power.c. save_mce_event() does the processing
of the event.
Balbir Singh.
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v4 5/5] powerpc/mce: hookup memory_failure for UE errors
2017-10-16 5:40 ` Balbir Singh
@ 2017-10-16 12:07 ` Michael Ellerman
0 siblings, 0 replies; 16+ messages in thread
From: Michael Ellerman @ 2017-10-16 12:07 UTC (permalink / raw)
To: Balbir Singh; +Cc: open list:LINUX FOR POWERPC (32-BIT AND 64-BIT)
Balbir Singh <bsingharora@gmail.com> writes:
> On Mon, Oct 16, 2017 at 4:38 PM, Michael Ellerman <mpe@ellerman.id.au> wrote:
>> Balbir Singh <bsingharora@gmail.com> writes:
>>
>>> If we are in user space and hit a UE error, we now have the
>>> basic infrastructure to walk the page tables and find out
>>> the effective address that was accessed, since the DAR
>>> is not valid.
>>>
>>> We use a work_queue content to hookup the bad pfn, any
>>> other context causes problems, since memory_failure itself
>>> can call into schedule() via lru_drain_ bits.
>>>
>>> We could probably poison the struct page to avoid a race
>>> between detection and taking corrective action.
>>>
>>> Signed-off-by: Balbir Singh <bsingharora@gmail.com>
>>> Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
>>> ---
>>> arch/powerpc/kernel/mce.c | 70 +++++++++++++++++++++++++++++++++++++++++++++--
>>> 1 file changed, 67 insertions(+), 3 deletions(-)
>>
>> I'm not sure why this is in mce.c but the rest was in mce_power.c ?
>
> The way the code is organized is that save_mce_event is implemented here
> and called from mce_power.c. save_mce_event() does the processing
> of the event.
save_mce_event() is only called from mce_power.c :/
I'd be happy if it was mce_real_mode.c and mce.c, that would be a good
distinction to make.
Though even then, most of the code is Book3s64 specific I think so maybe
it shouldn't be in just mce.c.
Anyway I'll merge this and we can improve things later.
cheers
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [v4, 1/5] powerpc/mce.c: Remove unused function get_mce_fault_addr()
2017-09-29 4:26 ` [PATCH v4 1/5] powerpc/mce.c: Remove unused function get_mce_fault_addr() Balbir Singh
2017-10-16 5:13 ` Michael Ellerman
@ 2017-10-19 4:48 ` Michael Ellerman
1 sibling, 0 replies; 16+ messages in thread
From: Michael Ellerman @ 2017-10-19 4:48 UTC (permalink / raw)
To: Balbir Singh; +Cc: linuxppc-dev
On Fri, 2017-09-29 at 04:26:51 UTC, Balbir Singh wrote:
> There are no users of get_mce_fault_addr()
>
> Fixes: b63a0ff ("powerpc/powernv: Machine check exception handling.")
>
> Signed-off-by: Balbir Singh <bsingharora@gmail.com>
> Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
Series applied to powerpc next, thanks.
https://git.kernel.org/powerpc/c/73e341eb6bea01fde706d10d7edba9
cheers
^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2017-10-19 4:48 UTC | newest]
Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2017-09-29 4:26 [PATCH v4 0/5] Revisit MCE handling for UE Errors Balbir Singh
2017-09-29 4:26 ` [PATCH v4 1/5] powerpc/mce.c: Remove unused function get_mce_fault_addr() Balbir Singh
2017-10-16 5:13 ` Michael Ellerman
2017-10-16 5:23 ` Balbir Singh
2017-10-19 4:48 ` [v4, " Michael Ellerman
2017-09-29 4:26 ` [PATCH v4 2/5] powerpc/mce: align the print of physical address better Balbir Singh
2017-10-16 5:18 ` Michael Ellerman
2017-10-16 5:36 ` Balbir Singh
2017-09-29 4:26 ` [PATCH v4 3/5] powerpc/mce: Hookup derror (load/store) UE errors Balbir Singh
2017-10-16 5:36 ` Michael Ellerman
2017-10-16 5:38 ` Balbir Singh
2017-09-29 4:26 ` [PATCH v4 4/5] powerpc/mce: Hookup ierror (instruction) " Balbir Singh
2017-09-29 4:26 ` [PATCH v4 5/5] powerpc/mce: hookup memory_failure for " Balbir Singh
2017-10-16 5:38 ` Michael Ellerman
2017-10-16 5:40 ` Balbir Singh
2017-10-16 12:07 ` Michael Ellerman
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).