public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH] mce-apei: do not rely on ACPI_ERST_GET_RECORD_ID for record id
@ 2016-03-03 19:03 Josef Bacik
  2016-03-03 21:49 ` Luck, Tony
  0 siblings, 1 reply; 3+ messages in thread
From: Josef Bacik @ 2016-03-03 19:03 UTC (permalink / raw)
  To: tony.luck, bp, tglx, mingo, hpa, x86, linux-edac, linux-kernel

We have a bunch of boxes where ACPI_ERST_GET_RECORD_ID doesn't advance after
reading an error record, it only advances on reboot.  So we get a bunch of MCE's
from a bad CPU, swap out the CPU, and then /var/log/mcelog has a new entry on
every reboot, which makes our monitoring send the box back to be fixed when it
has already been fixed.

So instead of using ACPI_ERST_GET_RECORD_ID to walk through the records on the
box, simply pass 0 into erst_read() which will read the first entry every time,
and then store the record id from the cpe header so that we clear out the right
record.  With this patch these broken boxes now spit out all of records in the
ERST in one go instead of one per reboot.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
---
 arch/x86/kernel/cpu/mcheck/mce-apei.c | 43 +++++++++++++++++++++++++++--------
 1 file changed, 33 insertions(+), 10 deletions(-)

diff --git a/arch/x86/kernel/cpu/mcheck/mce-apei.c b/arch/x86/kernel/cpu/mcheck/mce-apei.c
index 34c89a3..6de662f 100644
--- a/arch/x86/kernel/cpu/mcheck/mce-apei.c
+++ b/arch/x86/kernel/cpu/mcheck/mce-apei.c
@@ -113,27 +113,50 @@ ssize_t apei_read_mce(struct mce *m, u64 *record_id)
 {
 	struct cper_mce_record rcd;
 	int rc, pos;
+	bool lookup_record = false;
 
 	rc = erst_get_record_id_begin(&pos);
 	if (rc)
 		return rc;
 retry:
-	rc = erst_get_record_id_next(&pos, record_id);
-	if (rc)
-		goto out;
+	/*
+	 * Some hardware is broken and doesn't actually advance the record id
+	 * returned by ACPI_ERST_GET_RECORD_ID when we read a record like the
+	 * spec says it is supposed to.  So instead use record_id == 0 to just
+	 * grab the first record in the erst, and fall back only if we trip over
+	 * a record that isn't a MCE record.
+	 */
+	if (lookup_record) {
+		rc = erst_get_record_id_next(&pos, record_id);
+		if (rc)
+			goto out;
+	} else {
+		*record_id = 0;
+	}
 	/* no more record */
 	if (*record_id == APEI_ERST_INVALID_RECORD_ID)
 		goto out;
 	rc = erst_read(*record_id, &rcd.hdr, sizeof(rcd));
-	/* someone else has cleared the record, try next one */
-	if (rc == -ENOENT)
-		goto retry;
-	else if (rc < 0)
+	/*
+	 * someone else has cleared the record, try next one if we are looking
+	 * up records.  If we aren't looking up the record id's then just bail
+	 * since this means we have an empty table.
+	 */
+	if (rc == -ENOENT) {
+		if (lookup_record)
+			goto retry;
+		rc = 0;
+		goto out;
+	} else if (rc < 0) {
 		goto out;
-	/* try to skip other type records in storage */
-	else if (rc != sizeof(rcd) ||
-		 uuid_le_cmp(rcd.hdr.creator_id, CPER_CREATOR_MCE))
+	} else if (rc != sizeof(rcd) ||
+		 uuid_le_cmp(rcd.hdr.creator_id, CPER_CREATOR_MCE)) {
+		/* try to skip other type records in storage */
+		lookup_record = true;
 		goto retry;
+	}
+	/* Use the record header as the source of truth for the record id. */
+	*record_id = rcd.hdr.record_id;
 	memcpy(m, &rcd.mce, sizeof(*m));
 	rc = sizeof(*m);
 out:
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2016-03-03 21:54 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-03-03 19:03 [PATCH] mce-apei: do not rely on ACPI_ERST_GET_RECORD_ID for record id Josef Bacik
2016-03-03 21:49 ` Luck, Tony
2016-03-03 21:53   ` Josef Bacik

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox