From: Shuai Xue <xueshuai@linux.alibaba.com>
To: mchehab@kernel.org, bp@alien8.de, tony.luck@intel.com,
james.morse@arm.com, rric@kernel.org, ardb@kernel.org,
linux-edac@vger.kernel.org, linux-kernel@vger.kernel.org,
linux-efi@vger.kernel.org
Cc: xueshuai@linux.alibaba.com, zhangliguang@linux.alibaba.com,
zhuo.song@linux.alibaba.com
Subject: [PATCH v2 1/3] ghes_edac: unify memory error report format with cper
Date: Fri, 10 Dec 2021 21:40:17 +0800 [thread overview]
Message-ID: <20211210134019.28536-2-xueshuai@linux.alibaba.com> (raw)
In-Reply-To: <20211210134019.28536-1-xueshuai@linux.alibaba.com>
The changes are to:
- add device info into ghes_edac
- change bit_pos to bit_position, col to column, requestorID to
requestor_id, etc in ghes_edac
- move requestor_id, responder_id, target_id and chip_id into memory error
location in ghes_edac
- add "DIMM location: not present." for DIMM location in ghes_edac
- remove the 'space' delimiter after the colon in ghes_edac and cper
The original EDAC and cper error log are as follows (all Validation Bits
are enabled):
[31940.060454] EDAC MC0: 1 CE Single-symbol ChipKill ECC on unknown memory (node:0 card:0 module:0 rank:0 bank:257 bank_group:1 bank_address:1 row:75492 col:8 bit_pos:0 DIMM DMI handle: 0x0000 chipID: 0 page:0x93724c offset:0x20 grain:1 syndrome:0x0 - APEI location: node:0 card:0 module:0 rank:0 bank:257 bank_group:1 bank_address:1 row:75492 col:8 bit_pos:0 DIMM DMI handle: 0x0000 chipID: 0 status(0x0000000000000000): reserved requestorID: 0x0000000000000000 responderID: 0x0000000000000000 targetID: 0x0000000000000000)
[31940.060459] {3}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 2
[31940.060460] {3}[Hardware Error]: It has been corrected by h/w and requires no further action
[31940.060462] {3}[Hardware Error]: event severity: corrected
[31940.060463] {3}[Hardware Error]: Error 0, type: corrected
[31940.060464] {3}[Hardware Error]: section_type: memory error
[31940.060465] {3}[Hardware Error]: error_status: 0x0000000000000000
[31940.060466] {3}[Hardware Error]: physical_address: 0x000000093724c020
[31940.060466] {3}[Hardware Error]: physical_address_mask: 0x0000000000000000
[31940.060469] {3}[Hardware Error]: node: 0 card: 0 module: 0 rank: 0 bank: 257 bank_group: 1 bank_address: 1 device: 0 row: 75492 column: 8 bit_position: 0 requestor_id: 0x0000000000000000 responder_id: 0x0000000000000000
[31940.060470] {3}[Hardware Error]: error_type: 4, single-symbol chipkill ECC
[31940.060471] {3}[Hardware Error]: DIMM location: not present. DMI handle: 0x0000
Now, the EDAC and cper error log are properly reporting the error as
follows (all Validation Bits are enabled):
[ 117.973657] EDAC MC0: 1 CE Single-symbol ChipKill ECC on 0x0000 (node:0 card:0 module:0 rank:0 bank:1026 bank_group:4 bank_address:2 device:0 row:6749 column:8 bit_position:0 requestor_id:0x0000000000000000 responder_id:0x0000000000000000 target_id:0x0000000000000000 chip_id:0 DIMM location:not present. DIMM DMI handle:0x0000 page:0x8d2ef4 offset:0x20 grain:1 syndrome:0x0 - APEI location: node:0 card:0 module:0 rank:0 bank:1026 bank_group:4 bank_address:2 device:0 row:6749 column:8 bit_position:0 requestor_id:0x0000000000000000 responder_id:0x0000000000000000 target_id:0x0000000000000000 chip_id:0 DIMM location:not present. DIMM DMI handle:0x0000 status(0x0000000000000000):reserved)
[ 117.973663] {2}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 2
[ 117.973664] {2}[Hardware Error]: It has been corrected by h/w and requires no further action
[ 117.973665] {2}[Hardware Error]: event severity: corrected
[ 117.973666] {2}[Hardware Error]: Error 0, type: corrected
[ 117.973667] {2}[Hardware Error]: section_type: memory error
[ 117.973668] {2}[Hardware Error]: error_status: 0x0000000000000000
[ 117.973669] {2}[Hardware Error]: physical_address: 0x00000008d2ef4020
[ 117.973670] {2}[Hardware Error]: physical_address_mask: 0x0000000000000000
[ 117.973672] {2}[Hardware Error]: node:0 card:0 module:0 rank:0 bank:1026 bank_group:4 bank_address:2 device:0 row:6749 column:8 bit_position:0 requestor_id:0x0000000000000000 responder_id:0x0000000000000000 target_id:0x0000000000000000 chip_id:0
[ 117.973673] {2}[Hardware Error]: error_type: 4, single-symbol chipkill ECC
[ 117.973674] {2}[Hardware Error]: DIMM location: not present. DMI handle:0x0000
Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com>
---
drivers/edac/ghes_edac.c | 35 +++++++++++++++++++----------------
drivers/firmware/efi/cper.c | 34 +++++++++++++++++-----------------
2 files changed, 36 insertions(+), 33 deletions(-)
diff --git a/drivers/edac/ghes_edac.c b/drivers/edac/ghes_edac.c
index 6d1ddecbf0da..526a28cbb19b 100644
--- a/drivers/edac/ghes_edac.c
+++ b/drivers/edac/ghes_edac.c
@@ -378,6 +378,8 @@ void ghes_edac_report_mem_error(int sev, struct cper_sec_mem_err *mem_err)
if (mem_err->validation_bits & CPER_MEM_VALID_BANK_ADDRESS)
p += sprintf(p, "bank_address:%d ",
mem_err->bank & CPER_MEM_BANK_ADDRESS_MASK);
+ if (mem_err->validation_bits & CPER_MEM_VALID_DEVICE)
+ p += sprintf(p, "device:%d ", mem_err->device);
if (mem_err->validation_bits & (CPER_MEM_VALID_ROW | CPER_MEM_VALID_ROW_EXT)) {
u32 row = mem_err->row;
@@ -385,9 +387,21 @@ void ghes_edac_report_mem_error(int sev, struct cper_sec_mem_err *mem_err)
p += sprintf(p, "row:%d ", row);
}
if (mem_err->validation_bits & CPER_MEM_VALID_COLUMN)
- p += sprintf(p, "col:%d ", mem_err->column);
+ p += sprintf(p, "column:%d ", mem_err->column);
if (mem_err->validation_bits & CPER_MEM_VALID_BIT_POSITION)
- p += sprintf(p, "bit_pos:%d ", mem_err->bit_pos);
+ p += sprintf(p, "bit_position:%d ", mem_err->bit_pos);
+ if (mem_err->validation_bits & CPER_MEM_VALID_REQUESTOR_ID)
+ p += sprintf(p, "requestor_id:0x%016llx ",
+ (long long)mem_err->requestor_id);
+ if (mem_err->validation_bits & CPER_MEM_VALID_RESPONDER_ID)
+ p += sprintf(p, "responder_id:0x%016llx ",
+ (long long)mem_err->responder_id);
+ if (mem_err->validation_bits & CPER_MEM_VALID_TARGET_ID)
+ p += sprintf(p, "target_id:0x%016llx ",
+ (long long)mem_err->responder_id);
+ if (mem_err->validation_bits & CPER_MEM_VALID_CHIP_ID)
+ p += sprintf(p, "chip_id:%d ",
+ mem_err->extended >> CPER_MEM_CHIP_ID_SHIFT);
if (mem_err->validation_bits & CPER_MEM_VALID_MODULE_HANDLE) {
const char *bank = NULL, *device = NULL;
struct dimm_info *dimm;
@@ -396,7 +410,7 @@ void ghes_edac_report_mem_error(int sev, struct cper_sec_mem_err *mem_err)
if (bank != NULL && device != NULL)
p += sprintf(p, "DIMM location:%s %s ", bank, device);
else
- p += sprintf(p, "DIMM DMI handle: 0x%.4x ",
+ p += sprintf(p, "DIMM location:not present. DIMM DMI handle:0x%.4x ",
mem_err->mem_dev_handle);
dimm = find_dimm_by_handle(mci, mem_err->mem_dev_handle);
@@ -405,9 +419,6 @@ void ghes_edac_report_mem_error(int sev, struct cper_sec_mem_err *mem_err)
strcpy(e->label, dimm->label);
}
}
- if (mem_err->validation_bits & CPER_MEM_VALID_CHIP_ID)
- p += sprintf(p, "chipID: %d ",
- mem_err->extended >> CPER_MEM_CHIP_ID_SHIFT);
if (p > e->location)
*(p - 1) = '\0';
@@ -421,7 +432,7 @@ void ghes_edac_report_mem_error(int sev, struct cper_sec_mem_err *mem_err)
if (mem_err->validation_bits & CPER_MEM_VALID_ERROR_STATUS) {
u64 status = mem_err->error_status;
- p += sprintf(p, "status(0x%016llx): ", (long long)status);
+ p += sprintf(p, "status(0x%016llx):", (long long)status);
switch ((status >> 8) & 0xff) {
case 1:
p += sprintf(p, "Error detected internal to the component ");
@@ -479,15 +490,7 @@ void ghes_edac_report_mem_error(int sev, struct cper_sec_mem_err *mem_err)
break;
}
}
- if (mem_err->validation_bits & CPER_MEM_VALID_REQUESTOR_ID)
- p += sprintf(p, "requestorID: 0x%016llx ",
- (long long)mem_err->requestor_id);
- if (mem_err->validation_bits & CPER_MEM_VALID_RESPONDER_ID)
- p += sprintf(p, "responderID: 0x%016llx ",
- (long long)mem_err->responder_id);
- if (mem_err->validation_bits & CPER_MEM_VALID_TARGET_ID)
- p += sprintf(p, "targetID: 0x%016llx ",
- (long long)mem_err->responder_id);
+
if (p > pvt->other_detail)
*(p - 1) = '\0';
diff --git a/drivers/firmware/efi/cper.c b/drivers/firmware/efi/cper.c
index 6ec8edec6329..77b39b058924 100644
--- a/drivers/firmware/efi/cper.c
+++ b/drivers/firmware/efi/cper.c
@@ -221,45 +221,45 @@ static int cper_mem_err_location(struct cper_mem_err_compact *mem, char *msg)
n = 0;
len = CPER_REC_LEN;
if (mem->validation_bits & CPER_MEM_VALID_NODE)
- n += scnprintf(msg + n, len - n, "node: %d ", mem->node);
+ n += scnprintf(msg + n, len - n, "node:%d ", mem->node);
if (mem->validation_bits & CPER_MEM_VALID_CARD)
- n += scnprintf(msg + n, len - n, "card: %d ", mem->card);
+ n += scnprintf(msg + n, len - n, "card:%d ", mem->card);
if (mem->validation_bits & CPER_MEM_VALID_MODULE)
- n += scnprintf(msg + n, len - n, "module: %d ", mem->module);
+ n += scnprintf(msg + n, len - n, "module:%d ", mem->module);
if (mem->validation_bits & CPER_MEM_VALID_RANK_NUMBER)
- n += scnprintf(msg + n, len - n, "rank: %d ", mem->rank);
+ n += scnprintf(msg + n, len - n, "rank:%d ", mem->rank);
if (mem->validation_bits & CPER_MEM_VALID_BANK)
- n += scnprintf(msg + n, len - n, "bank: %d ", mem->bank);
+ n += scnprintf(msg + n, len - n, "bank:%d ", mem->bank);
if (mem->validation_bits & CPER_MEM_VALID_BANK_GROUP)
- n += scnprintf(msg + n, len - n, "bank_group: %d ",
+ n += scnprintf(msg + n, len - n, "bank_group:%d ",
mem->bank >> CPER_MEM_BANK_GROUP_SHIFT);
if (mem->validation_bits & CPER_MEM_VALID_BANK_ADDRESS)
- n += scnprintf(msg + n, len - n, "bank_address: %d ",
+ n += scnprintf(msg + n, len - n, "bank_address:%d ",
mem->bank & CPER_MEM_BANK_ADDRESS_MASK);
if (mem->validation_bits & CPER_MEM_VALID_DEVICE)
- n += scnprintf(msg + n, len - n, "device: %d ", mem->device);
+ n += scnprintf(msg + n, len - n, "device:%d ", mem->device);
if (mem->validation_bits & (CPER_MEM_VALID_ROW | CPER_MEM_VALID_ROW_EXT)) {
u32 row = mem->row;
row |= cper_get_mem_extension(mem->validation_bits, mem->extended);
- n += scnprintf(msg + n, len - n, "row: %d ", row);
+ n += scnprintf(msg + n, len - n, "row:%d ", row);
}
if (mem->validation_bits & CPER_MEM_VALID_COLUMN)
- n += scnprintf(msg + n, len - n, "column: %d ", mem->column);
+ n += scnprintf(msg + n, len - n, "column:%d ", mem->column);
if (mem->validation_bits & CPER_MEM_VALID_BIT_POSITION)
- n += scnprintf(msg + n, len - n, "bit_position: %d ",
+ n += scnprintf(msg + n, len - n, "bit_position:%d ",
mem->bit_pos);
if (mem->validation_bits & CPER_MEM_VALID_REQUESTOR_ID)
- n += scnprintf(msg + n, len - n, "requestor_id: 0x%016llx ",
+ n += scnprintf(msg + n, len - n, "requestor_id:0x%016llx ",
mem->requestor_id);
if (mem->validation_bits & CPER_MEM_VALID_RESPONDER_ID)
- n += scnprintf(msg + n, len - n, "responder_id: 0x%016llx ",
+ n += scnprintf(msg + n, len - n, "responder_id:0x%016llx ",
mem->responder_id);
if (mem->validation_bits & CPER_MEM_VALID_TARGET_ID)
- n += scnprintf(msg + n, len - n, "target_id: 0x%016llx ",
+ n += scnprintf(msg + n, len - n, "target_id:0x%016llx ",
mem->target_id);
if (mem->validation_bits & CPER_MEM_VALID_CHIP_ID)
- n += scnprintf(msg + n, len - n, "chip_id: %d ",
+ n += scnprintf(msg + n, len - n, "chip_id:%d ",
mem->extended >> CPER_MEM_CHIP_ID_SHIFT);
return n;
@@ -276,10 +276,10 @@ static int cper_dimm_err_location(struct cper_mem_err_compact *mem, char *msg)
len = CPER_REC_LEN;
dmi_memdev_name(mem->mem_dev_handle, &bank, &device);
if (bank && device)
- n = snprintf(msg, len, "DIMM location: %s %s ", bank, device);
+ n = snprintf(msg, len, "DIMM location:%s %s ", bank, device);
else
n = snprintf(msg, len,
- "DIMM location: not present. DMI handle: 0x%.4x ",
+ "DIMM location: not present. DMI handle:0x%.4x ",
mem->mem_dev_handle);
return n;
--
2.20.1.12.g72788fdb
next prev parent reply other threads:[~2021-12-10 13:40 UTC|newest]
Thread overview: 50+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-12-10 13:40 [PATCH v2 0/3] ghes_edac: refactor memory error reporting to avoid code duplication Shuai Xue
2021-12-10 13:40 ` Shuai Xue [this message]
2021-12-28 17:12 ` [PATCH v2 1/3] ghes_edac: unify memory error report format with cper Borislav Petkov
2021-12-29 3:22 ` Shuai Xue
2021-12-30 17:57 ` Borislav Petkov
2021-12-31 3:04 ` Shuai Xue
2021-12-10 13:40 ` [PATCH v2 2/3] ghes_edac: refactor memory error location processing Shuai Xue
2021-12-10 13:40 ` [PATCH v2 3/3] ghes_edac: refactor error status fields decoding Shuai Xue
2021-12-16 12:54 ` [PATCH v2 0/3] ghes_edac: refactor memory error reporting to avoid code duplication Shuai Xue
2022-01-16 7:36 ` [PATCH v3 0/2] EDAC/ghes: refactor memory error reporting to avoid Shuai Xue
2022-01-16 7:36 ` [PATCH v3 1/2] efi/cper: add cper_mem_err_status_str to decode error description Shuai Xue
2022-01-16 7:36 ` [PATCH v3 2/2] EDAC/ghes: use cper functions to avoid code duplication Shuai Xue
2022-01-24 2:47 ` [RESEND PATCH v3 0/2] EDAC/ghes: refactor memory error reporting to avoid Shuai Xue
2022-01-24 2:47 ` [RESEND PATCH v3 1/2] efi/cper: add cper_mem_err_status_str to decode error description Shuai Xue
2022-01-24 21:16 ` Borislav Petkov
2022-01-25 2:45 ` Shuai Xue
2022-01-25 10:21 ` Borislav Petkov
2022-01-25 11:49 ` Shuai Xue
2022-01-25 12:37 ` Borislav Petkov
2022-01-25 15:02 ` Shuai Xue
2022-01-24 2:47 ` [RESEND PATCH v3 2/2] EDAC/ghes: use cper functions to avoid code duplication Shuai Xue
2022-01-25 2:49 ` [PATCH v4 0/2] EDAC/ghes: refactor memory error reporting " Shuai Xue
2022-01-25 2:49 ` [PATCH v4 1/2] efi/cper: add cper_mem_err_status_str to decode error description Shuai Xue
2022-01-25 19:08 ` Borislav Petkov
2022-01-26 5:25 ` Shuai Xue
2022-01-25 2:49 ` [PATCH v4 2/2] EDAC/ghes: use cper functions to avoid code duplication Shuai Xue
2022-01-26 8:17 ` [PATCH v5 0/2] EDAC/ghes: refactor memory error reporting " Shuai Xue
2022-01-26 8:20 ` Borislav Petkov
2022-01-26 9:26 ` Shuai Xue
2022-01-26 10:16 ` Boris Petkov
2022-01-27 2:28 ` Shuai Xue
2022-01-26 8:17 ` [PATCH v5 1/2] efi/cper: add cper_mem_err_status_str to decode error description Shuai Xue
2022-01-26 8:17 ` [PATCH v5 2/2] EDAC/ghes: use cper functions to avoid code duplication Shuai Xue
2022-01-29 13:09 ` Borislav Petkov
2022-02-12 11:28 ` Shuai Xue
2022-03-03 12:26 ` [PATCH v6 0/2] EDAC/ghes: refactor memory error reporting " Shuai Xue
2022-03-03 12:26 ` [PATCH v6 1/2] efi/cper: add cper_mem_err_status_str to decode error description Shuai Xue
2022-03-03 12:26 ` [PATCH v6 2/2] EDAC/ghes: use cper functions to avoid code duplication Shuai Xue
2022-03-04 10:16 ` Borislav Petkov
2022-03-07 6:16 ` Shuai Xue
2022-03-07 7:08 ` Borislav Petkov
2022-03-08 14:40 ` [PATCH v7 0/3] EDAC/ghes: refactor memory error reporting " Shuai Xue
2022-03-22 6:09 ` Shuai Xue
2022-04-01 6:20 ` Shuai Xue
2022-04-08 15:53 ` Borislav Petkov
2022-03-08 14:40 ` [PATCH v7 1/3] efi/cper: add cper_mem_err_status_str to decode error description Shuai Xue
2022-04-01 20:44 ` Borislav Petkov
2022-04-06 21:37 ` Ard Biesheuvel
2022-03-08 14:40 ` [PATCH v7 2/3] EDAC/ghes: Unify CPER memory error location reporting Shuai Xue
2022-03-08 14:40 ` [PATCH v7 3/3] efi/cper: reformat CPER memory error location to more readable Shuai Xue
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20211210134019.28536-2-xueshuai@linux.alibaba.com \
--to=xueshuai@linux.alibaba.com \
--cc=ardb@kernel.org \
--cc=bp@alien8.de \
--cc=james.morse@arm.com \
--cc=linux-edac@vger.kernel.org \
--cc=linux-efi@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=mchehab@kernel.org \
--cc=rric@kernel.org \
--cc=tony.luck@intel.com \
--cc=zhangliguang@linux.alibaba.com \
--cc=zhuo.song@linux.alibaba.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox