public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/3] Fix MCE handling for AMD multi-node processors
@ 2014-12-22 20:10 Aravind Gopalakrishnan
  2014-12-22 20:10 ` [PATCH 1/3] x86,amd: Refactor amd cpu topology functions for " Aravind Gopalakrishnan
                   ` (3 more replies)
  0 siblings, 4 replies; 11+ messages in thread
From: Aravind Gopalakrishnan @ 2014-12-22 20:10 UTC (permalink / raw)
  To: tglx, mingo, hpa, tony.luck, bp, dougthompson, mchehab, x86,
	linux-kernel, linux-edac
  Cc: dave.hansen, mgorman, bp, riel, jacob.w.shin,
	Aravind Gopalakrishnan

When a MCE happens that is to be logged onto bank 4 of AMD multi-node
processors, they are reported only to corresponding node base core of
the cpu on which the error occurred.

Refer D18F3x44[NbMcaToMstCpuEn] on BKDGs of Fam10h and later for
clarifications on the reporting of MC4 errors only to NBC MSRs.

We don't have the exception handler wired up to handle this currently.
As a consequence, do_machine_check only runs on the core on which error
occurred and (since according to the BKDGs, reads to MC4_STATUS MSR of
non-NBC will simply RAZ) the exception is ignored for the core.

This is a problem as now we have dropped MCEs.
I tested this on Fam10h and Fam15h using mce_amd_inj and by triggering
a real HW MCE using Boris' new interface; And can confirm the behavior.

This patch set fixes the issue by looking at the NBC MSRs when bank 4
errors happen on AMD multi node processors.

Patch 1: Refactor AMD cpu topology functions so that we can get some
	 relevant info that we need to use in EDAC, MC handler routines
Patch 2: The fix to our problem
Patch 3: Modify mce_amd_inj interfaces to write to only NBC for bank 4
	 errors. Only then will they be picked up for error handling.

Aravind Gopalakrishnan (3):
  x86,amd: Refactor amd cpu topology functions for multi-node processors
  x86, mce: Handle AMD MCE on bank4 on NBC for multi-node processors
  edac, mce_amd_inj: Inject errors only on NBC for bank 4 errors

 arch/x86/include/asm/processor.h |   1 +
 arch/x86/kernel/cpu/amd.c        |  78 ++++++++++++++----
 arch/x86/kernel/cpu/mcheck/mce.c | 167 +++++++++++++++++++++++++++++++++++----
 drivers/edac/mce_amd_inj.c       |  21 ++++-
 4 files changed, 235 insertions(+), 32 deletions(-)

-- 
2.0.2


^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2015-01-08  0:35 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-12-22 20:10 [PATCH 0/3] Fix MCE handling for AMD multi-node processors Aravind Gopalakrishnan
2014-12-22 20:10 ` [PATCH 1/3] x86,amd: Refactor amd cpu topology functions for " Aravind Gopalakrishnan
2014-12-22 20:10 ` [PATCH 2/3] x86, mce: Handle AMD MCE on bank4 on NBC " Aravind Gopalakrishnan
2014-12-22 20:10 ` [PATCH 3/3] edac, mce_amd_inj: Inject errors only on NBC for bank 4 errors Aravind Gopalakrishnan
2014-12-22 20:15 ` [PATCH 0/3] Fix MCE handling for AMD multi-node processors Borislav Petkov
2014-12-22 20:56   ` Aravind Gopalakrishnan
2014-12-22 23:19     ` Borislav Petkov
2014-12-23 19:41       ` Aravind Gopalakrishnan
2015-01-06 23:54         ` Aravind Gopalakrishnan
2015-01-07 17:06           ` Borislav Petkov
2015-01-08  0:18             ` Aravind Gopalakrishnan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox