public inbox for linux-acpi@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH -v4 0/3] ACPI, APEI, Report GHES error information via printk
@ 2010-12-07  2:22 Huang Ying
  2010-12-07  2:22 ` [PATCH -v4 1/3] Add CPER PCIe error section structure and constants definition Huang Ying
                   ` (3 more replies)
  0 siblings, 4 replies; 5+ messages in thread
From: Huang Ying @ 2010-12-07  2:22 UTC (permalink / raw)
  To: Len Brown
  Cc: linux-kernel, Andi Kleen, Tony Luck, ying.huang, linux-acpi,
	Peter Zijlstra, Andrew Morton, Linus Torvalds, Ingo Molnar

The recent consensus is that hardware error should be reported via printk.
This patch adds printk support to APEI GHES.

v4:

- Remove pr_pfx, use printk directly

v3:

- Add document for output format per Andrew's comments
- Move pr_pfx into printk.h, hope it is useful for others too
- Fixes some issues according to comments

v2:

- Some minor adjustment of PCIe error section definition and printk format

[PATCH -v4 1/3] Add CPER PCIe error section structure and constants definition
[PATCH -v4 2/3] ACPI, APEI, Add APEI generic error status printing support
[PATCH -v4 3/3] ACPI, APEI, Report GHES error information via printk

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [PATCH -v4 1/3] Add CPER PCIe error section structure and constants definition
  2010-12-07  2:22 [PATCH -v4 0/3] ACPI, APEI, Report GHES error information via printk Huang Ying
@ 2010-12-07  2:22 ` Huang Ying
  2010-12-07  2:22 ` [PATCH -v4 2/3] ACPI, APEI, Add APEI generic error status printing support Huang Ying
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 5+ messages in thread
From: Huang Ying @ 2010-12-07  2:22 UTC (permalink / raw)
  To: Len Brown
  Cc: linux-kernel, Andi Kleen, Tony Luck, ying.huang, linux-acpi,
	Peter Zijlstra, Andrew Morton, Linus Torvalds, Ingo Molnar

On some machine, PCIe error is reported via APEI (ACPI Platform Error
Interface).  The error data is passed from firmware to Linux via CPER
PCIe error section structure.

This patch adds CPER PCIe error section structure and constants
definition.

Signed-off-by: Huang Ying <ying.huang@intel.com>
---
 include/linux/cper.h |   86 ++++++++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 82 insertions(+), 4 deletions(-)

--- a/include/linux/cper.h
+++ b/include/linux/cper.h
@@ -39,10 +39,12 @@
  * Severity difinition for error_severity in struct cper_record_header
  * and section_severity in struct cper_section_descriptor
  */
-#define CPER_SEV_RECOVERABLE			0x0
-#define CPER_SEV_FATAL				0x1
-#define CPER_SEV_CORRECTED			0x2
-#define CPER_SEV_INFORMATIONAL			0x3
+enum {
+	CPER_SEV_RECOVERABLE,
+	CPER_SEV_FATAL,
+	CPER_SEV_CORRECTED,
+	CPER_SEV_INFORMATIONAL,
+};
 
 /*
  * Validation bits difinition for validation_bits in struct
@@ -201,6 +203,47 @@
 	UUID_LE(0x036F84E1, 0x7F37, 0x428c, 0xA7, 0x9E, 0x57, 0x5F,	\
 		0xDF, 0xAA, 0x84, 0xEC)
 
+#define CPER_PROC_VALID_TYPE			0x0001
+#define CPER_PROC_VALID_ISA			0x0002
+#define CPER_PROC_VALID_ERROR_TYPE		0x0004
+#define CPER_PROC_VALID_OPERATION		0x0008
+#define CPER_PROC_VALID_FLAGS			0x0010
+#define CPER_PROC_VALID_LEVEL			0x0020
+#define CPER_PROC_VALID_VERSION			0x0040
+#define CPER_PROC_VALID_BRAND_INFO		0x0080
+#define CPER_PROC_VALID_ID			0x0100
+#define CPER_PROC_VALID_TARGET_ADDRESS		0x0200
+#define CPER_PROC_VALID_REQUESTOR_ID		0x0400
+#define CPER_PROC_VALID_RESPONDER_ID		0x0800
+#define CPER_PROC_VALID_IP			0x1000
+
+#define CPER_MEM_VALID_ERROR_STATUS		0x0001
+#define CPER_MEM_VALID_PHYSICAL_ADDRESS		0x0002
+#define CPER_MEM_VALID_PHYSICAL_ADDRESS_MASK	0x0004
+#define CPER_MEM_VALID_NODE			0x0008
+#define CPER_MEM_VALID_CARD			0x0010
+#define CPER_MEM_VALID_MODULE			0x0020
+#define CPER_MEM_VALID_BANK			0x0040
+#define CPER_MEM_VALID_DEVICE			0x0080
+#define CPER_MEM_VALID_ROW			0x0100
+#define CPER_MEM_VALID_COLUMN			0x0200
+#define CPER_MEM_VALID_BIT_POSITION		0x0400
+#define CPER_MEM_VALID_REQUESTOR_ID		0x0800
+#define CPER_MEM_VALID_RESPONDER_ID		0x1000
+#define CPER_MEM_VALID_TARGET_ID		0x2000
+#define CPER_MEM_VALID_ERROR_TYPE		0x4000
+
+#define CPER_PCIE_VALID_PORT_TYPE		0x0001
+#define CPER_PCIE_VALID_VERSION			0x0002
+#define CPER_PCIE_VALID_COMMAND_STATUS		0x0004
+#define CPER_PCIE_VALID_DEVICE_ID		0x0008
+#define CPER_PCIE_VALID_SERIAL_NUMBER		0x0010
+#define CPER_PCIE_VALID_BRIDGE_CONTROL_STATUS	0x0020
+#define CPER_PCIE_VALID_CAPABILITY		0x0040
+#define CPER_PCIE_VALID_AER_INFO		0x0080
+
+#define CPER_PCIE_SLOT_SHIFT			3
+
 /*
  * All tables and structs must be byte-packed to match CPER
  * specification, since the tables are provided by the system BIOS
@@ -306,6 +349,41 @@ struct cper_sec_mem_err {
 	__u8	error_type;
 };
 
+struct cper_sec_pcie {
+	__u64		validation_bits;
+	__u32		port_type;
+	struct {
+		__u8	minor;
+		__u8	major;
+		__u8	reserved[2];
+	}		version;
+	__u16		command;
+	__u16		status;
+	__u32		reserved;
+	struct {
+		__u16	vendor_id;
+		__u16	device_id;
+		__u8	class_code[3];
+		__u8	function;
+		__u8	device;
+		__u16	segment;
+		__u8	bus;
+		__u8	secondary_bus;
+		__u16	slot;
+		__u8	reserved;
+	}		device_id;
+	struct {
+		__u32	lower;
+		__u32	upper;
+	}		serial_number;
+	struct {
+		__u16	secondary_status;
+		__u16	control;
+	}		bridge;
+	__u8	capability[60];
+	__u8	aer_info[96];
+};
+
 /* Reset to default packing */
 #pragma pack()
 

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [PATCH -v4 2/3] ACPI, APEI, Add APEI generic error status printing support
  2010-12-07  2:22 [PATCH -v4 0/3] ACPI, APEI, Report GHES error information via printk Huang Ying
  2010-12-07  2:22 ` [PATCH -v4 1/3] Add CPER PCIe error section structure and constants definition Huang Ying
@ 2010-12-07  2:22 ` Huang Ying
  2010-12-07  2:22 ` [PATCH -v4 3/3] ACPI, APEI, Report GHES error information via printk Huang Ying
  2010-12-14  4:43 ` [PATCH -v4 0/3] " Len Brown
  3 siblings, 0 replies; 5+ messages in thread
From: Huang Ying @ 2010-12-07  2:22 UTC (permalink / raw)
  To: Len Brown
  Cc: linux-kernel, Andi Kleen, Tony Luck, ying.huang, linux-acpi,
	Peter Zijlstra, Andrew Morton, Linus Torvalds, Ingo Molnar

In APEI, Hardware error information reported by firmware to Linux
kernel is in the data structure of APEI generic error status (struct
acpi_hes_generic_status).  While now printk is used by Linux kernel to
report hardware error information to user space.

So, this patch adds printing support for the data structure, so that
the corresponding hardware error information can be reported to user
space via printk.

PCIe AER information printing is not implemented yet.  Will refactor the
original PCIe AER information printing code to avoid code duplicating.

The output format is as follow:

<error record> :=
APEI generic hardware error status
severity: <integer>, <severity string>
section: <integer>, severity: <integer>, <severity string>
flags: <integer>
<section flags strings>
fru_id: <uuid string>
fru_text: <string>
section_type: <section type string>
<section data>

<severity string>* := recoverable | fatal | corrected | info

<section flags strings># :=
[primary][, containment warning][, reset][, threshold exceeded]\
[, resource not accessible][, latent error]

<section type string> := generic processor error | memory error | \
PCIe error | unknown, <uuid string>

<section data> :=
<generic processor section data> | <memory section data> | \
<pcie section data> | <null>

<generic processor section data> :=
[processor_type: <integer>, <proc type string>]
[processor_isa: <integer>, <proc isa string>]
[error_type: <integer>
<proc error type strings>]
[operation: <integer>, <proc operation string>]
[flags: <integer>
<proc flags strings>]
[level: <integer>]
[version_info: <integer>]
[processor_id: <integer>]
[target_address: <integer>]
[requestor_id: <integer>]
[responder_id: <integer>]
[IP: <integer>]

<proc type string>* := IA32/X64 | IA64

<proc isa string>* := IA32 | IA64 | X64

<processor error type strings># :=
[cache error][, TLB error][, bus error][, micro-architectural error]

<proc operation string>* := unknown or generic | data read | data write | \
instruction execution

<proc flags strings># :=
[restartable][, precise IP][, overflow][, corrected]

<memory section data> :=
[error_status: <integer>]
[physical_address: <integer>]
[physical_address_mask: <integer>]
[node: <integer>]
[card: <integer>]
[module: <integer>]
[bank: <integer>]
[device: <integer>]
[row: <integer>]
[column: <integer>]
[bit_position: <integer>]
[requestor_id: <integer>]
[responder_id: <integer>]
[target_id: <integer>]
[error_type: <integer>, <mem error type string>]

<mem error type string>* :=
unknown | no error | single-bit ECC | multi-bit ECC | \
single-symbol chipkill ECC | multi-symbol chipkill ECC | master abort | \
target abort | parity error | watchdog timeout | invalid address | \
mirror Broken | memory sparing | scrub corrected error | \
scrub uncorrected error

<pcie section data> :=
[port_type: <integer>, <pcie port type string>]
[version: <integer>.<integer>]
[command: <integer>, status: <integer>]
[device_id: <integer>:<integer>:<integer>.<integer>
slot: <integer>
secondary_bus: <integer>
vendor_id: <integer>, device_id: <integer>
class_code: <integer>]
[serial number: <integer>, <integer>]
[bridge: secondary_status: <integer>, control: <integer>]

<pcie port type string>* := PCIe end point | legacy PCI end point | \
unknown | unknown | root port | upstream switch port | \
downstream switch port | PCIe to PCI/PCI-X bridge | \
PCI/PCI-X to PCIe bridge | root complex integrated endpoint device | \
root complex event collector

Where, [] designate corresponding content is optional

All <field string> description with * has the following format:

field: <integer>, <field string>

Where value of <integer> should be the position of "string" in <field
string> description. Otherwise, <field string> will be "unknown".

All <field strings> description with # has the following format:

field: <integer>
<field strings>

Where each string in <fields strings> corresponding to one set bit of
<integer>. The bit position is the position of "string" in <field
strings> description.

For more detailed explanation of every field, please refer to UEFI
specification version 2.3 or later, section Appendix N: Common
Platform Error Record.

Signed-off-by: Huang Ying <ying.huang@intel.com>
---
 Documentation/acpi/apei/output_format.txt |  122 +++++++++++
 drivers/acpi/apei/apei-internal.h         |    2 
 drivers/acpi/apei/cper.c                  |  311 ++++++++++++++++++++++++++++++
 3 files changed, 435 insertions(+)
 create mode 100644 Documentation/acpi/apei/output_format.txt

--- /dev/null
+++ b/Documentation/acpi/apei/output_format.txt
@@ -0,0 +1,122 @@
+                     APEI output format
+                     ~~~~~~~~~~~~~~~~~~
+
+APEI uses printk as hardware error reporting interface, the output
+format is as follow.
+
+<error record> :=
+APEI generic hardware error status
+severity: <integer>, <severity string>
+section: <integer>, severity: <integer>, <severity string>
+flags: <integer>
+<section flags strings>
+fru_id: <uuid string>
+fru_text: <string>
+section_type: <section type string>
+<section data>
+
+<severity string>* := recoverable | fatal | corrected | info
+
+<section flags strings># :=
+[primary][, containment warning][, reset][, threshold exceeded]\
+[, resource not accessible][, latent error]
+
+<section type string> := generic processor error | memory error | \
+PCIe error | unknown, <uuid string>
+
+<section data> :=
+<generic processor section data> | <memory section data> | \
+<pcie section data> | <null>
+
+<generic processor section data> :=
+[processor_type: <integer>, <proc type string>]
+[processor_isa: <integer>, <proc isa string>]
+[error_type: <integer>
+<proc error type strings>]
+[operation: <integer>, <proc operation string>]
+[flags: <integer>
+<proc flags strings>]
+[level: <integer>]
+[version_info: <integer>]
+[processor_id: <integer>]
+[target_address: <integer>]
+[requestor_id: <integer>]
+[responder_id: <integer>]
+[IP: <integer>]
+
+<proc type string>* := IA32/X64 | IA64
+
+<proc isa string>* := IA32 | IA64 | X64
+
+<processor error type strings># :=
+[cache error][, TLB error][, bus error][, micro-architectural error]
+
+<proc operation string>* := unknown or generic | data read | data write | \
+instruction execution
+
+<proc flags strings># :=
+[restartable][, precise IP][, overflow][, corrected]
+
+<memory section data> :=
+[error_status: <integer>]
+[physical_address: <integer>]
+[physical_address_mask: <integer>]
+[node: <integer>]
+[card: <integer>]
+[module: <integer>]
+[bank: <integer>]
+[device: <integer>]
+[row: <integer>]
+[column: <integer>]
+[bit_position: <integer>]
+[requestor_id: <integer>]
+[responder_id: <integer>]
+[target_id: <integer>]
+[error_type: <integer>, <mem error type string>]
+
+<mem error type string>* :=
+unknown | no error | single-bit ECC | multi-bit ECC | \
+single-symbol chipkill ECC | multi-symbol chipkill ECC | master abort | \
+target abort | parity error | watchdog timeout | invalid address | \
+mirror Broken | memory sparing | scrub corrected error | \
+scrub uncorrected error
+
+<pcie section data> :=
+[port_type: <integer>, <pcie port type string>]
+[version: <integer>.<integer>]
+[command: <integer>, status: <integer>]
+[device_id: <integer>:<integer>:<integer>.<integer>
+slot: <integer>
+secondary_bus: <integer>
+vendor_id: <integer>, device_id: <integer>
+class_code: <integer>]
+[serial number: <integer>, <integer>]
+[bridge: secondary_status: <integer>, control: <integer>]
+
+<pcie port type string>* := PCIe end point | legacy PCI end point | \
+unknown | unknown | root port | upstream switch port | \
+downstream switch port | PCIe to PCI/PCI-X bridge | \
+PCI/PCI-X to PCIe bridge | root complex integrated endpoint device | \
+root complex event collector
+
+Where, [] designate corresponding content is optional
+
+All <field string> description with * has the following format:
+
+field: <integer>, <field string>
+
+Where value of <integer> should be the position of "string" in <field
+string> description. Otherwise, <field string> will be "unknown".
+
+All <field strings> description with # has the following format:
+
+field: <integer>
+<field strings>
+
+Where each string in <fields strings> corresponding to one set bit of
+<integer>. The bit position is the position of "string" in <field
+strings> description.
+
+For more detailed explanation of every field, please refer to UEFI
+specification version 2.3 or later, section Appendix N: Common
+Platform Error Record.
--- a/drivers/acpi/apei/apei-internal.h
+++ b/drivers/acpi/apei/apei-internal.h
@@ -109,6 +109,8 @@ static inline u32 apei_estatus_len(struc
 		return sizeof(*estatus) + estatus->data_length;
 }
 
+void apei_estatus_print(const char *pfx,
+			const struct acpi_hest_generic_status *estatus);
 int apei_estatus_check_header(const struct acpi_hest_generic_status *estatus);
 int apei_estatus_check(const struct acpi_hest_generic_status *estatus);
 #endif
--- a/drivers/acpi/apei/cper.c
+++ b/drivers/acpi/apei/cper.c
@@ -46,6 +46,317 @@ u64 cper_next_record_id(void)
 }
 EXPORT_SYMBOL_GPL(cper_next_record_id);
 
+static const char *cper_severity_strs[] = {
+	"recoverable",
+	"fatal",
+	"corrected",
+	"info",
+};
+
+static const char *cper_severity_str(unsigned int severity)
+{
+	return severity < ARRAY_SIZE(cper_severity_strs) ?
+		cper_severity_strs[severity] : "unknown";
+}
+
+/*
+ * cper_print_bits - print strings for set bits
+ * @pfx: prefix for each line, including log level and prefix string
+ * @bits: bit mask
+ * @strs: string array, indexed by bit position
+ * @strs_size: size of the string array: @strs
+ *
+ * For each set bit in @bits, print the corresponding string in @strs.
+ * If the output length is longer than 80, multiple line will be
+ * printed, with @pfx is printed at the beginning of each line.
+ */
+static void cper_print_bits(const char *pfx, unsigned int bits,
+			    const char *strs[], unsigned int strs_size)
+{
+	int i, len = 0;
+	const char *str;
+	char buf[84];
+
+	for (i = 0; i < strs_size; i++) {
+		if (!(bits & (1U << i)))
+			continue;
+		str = strs[i];
+		if (len && len + strlen(str) + 2 > 80) {
+			printk("%s\n", buf);
+			len = 0;
+		}
+		if (!len)
+			len = snprintf(buf, sizeof(buf), "%s%s", pfx, str);
+		else
+			len += snprintf(buf+len, sizeof(buf)-len, ", %s", str);
+	}
+	if (len)
+		printk("%s\n", buf);
+}
+
+static const char *cper_proc_type_strs[] = {
+	"IA32/X64",
+	"IA64",
+};
+
+static const char *cper_proc_isa_strs[] = {
+	"IA32",
+	"IA64",
+	"X64",
+};
+
+static const char *cper_proc_error_type_strs[] = {
+	"cache error",
+	"TLB error",
+	"bus error",
+	"micro-architectural error",
+};
+
+static const char *cper_proc_op_strs[] = {
+	"unknown or generic",
+	"data read",
+	"data write",
+	"instruction execution",
+};
+
+static const char *cper_proc_flag_strs[] = {
+	"restartable",
+	"precise IP",
+	"overflow",
+	"corrected",
+};
+
+static void cper_print_proc_generic(const char *pfx,
+				    const struct cper_sec_proc_generic *proc)
+{
+	if (proc->validation_bits & CPER_PROC_VALID_TYPE)
+		printk("%s""processor_type: %d, %s\n", pfx, proc->proc_type,
+		       proc->proc_type < ARRAY_SIZE(cper_proc_type_strs) ?
+		       cper_proc_type_strs[proc->proc_type] : "unknown");
+	if (proc->validation_bits & CPER_PROC_VALID_ISA)
+		printk("%s""processor_isa: %d, %s\n", pfx, proc->proc_isa,
+		       proc->proc_isa < ARRAY_SIZE(cper_proc_isa_strs) ?
+		       cper_proc_isa_strs[proc->proc_isa] : "unknown");
+	if (proc->validation_bits & CPER_PROC_VALID_ERROR_TYPE) {
+		printk("%s""error_type: 0x%02x\n", pfx, proc->proc_error_type);
+		cper_print_bits(pfx, proc->proc_error_type,
+				cper_proc_error_type_strs,
+				ARRAY_SIZE(cper_proc_error_type_strs));
+	}
+	if (proc->validation_bits & CPER_PROC_VALID_OPERATION)
+		printk("%s""operation: %d, %s\n", pfx, proc->operation,
+		       proc->operation < ARRAY_SIZE(cper_proc_op_strs) ?
+		       cper_proc_op_strs[proc->operation] : "unknown");
+	if (proc->validation_bits & CPER_PROC_VALID_FLAGS) {
+		printk("%s""flags: 0x%02x\n", pfx, proc->flags);
+		cper_print_bits(pfx, proc->flags, cper_proc_flag_strs,
+				ARRAY_SIZE(cper_proc_flag_strs));
+	}
+	if (proc->validation_bits & CPER_PROC_VALID_LEVEL)
+		printk("%s""level: %d\n", pfx, proc->level);
+	if (proc->validation_bits & CPER_PROC_VALID_VERSION)
+		printk("%s""version_info: 0x%016llx\n", pfx, proc->cpu_version);
+	if (proc->validation_bits & CPER_PROC_VALID_ID)
+		printk("%s""processor_id: 0x%016llx\n", pfx, proc->proc_id);
+	if (proc->validation_bits & CPER_PROC_VALID_TARGET_ADDRESS)
+		printk("%s""target_address: 0x%016llx\n",
+		       pfx, proc->target_addr);
+	if (proc->validation_bits & CPER_PROC_VALID_REQUESTOR_ID)
+		printk("%s""requestor_id: 0x%016llx\n",
+		       pfx, proc->requestor_id);
+	if (proc->validation_bits & CPER_PROC_VALID_RESPONDER_ID)
+		printk("%s""responder_id: 0x%016llx\n",
+		       pfx, proc->responder_id);
+	if (proc->validation_bits & CPER_PROC_VALID_IP)
+		printk("%s""IP: 0x%016llx\n", pfx, proc->ip);
+}
+
+static const char *cper_mem_err_type_strs[] = {
+	"unknown",
+	"no error",
+	"single-bit ECC",
+	"multi-bit ECC",
+	"single-symbol chipkill ECC",
+	"multi-symbol chipkill ECC",
+	"master abort",
+	"target abort",
+	"parity error",
+	"watchdog timeout",
+	"invalid address",
+	"mirror Broken",
+	"memory sparing",
+	"scrub corrected error",
+	"scrub uncorrected error",
+};
+
+static void cper_print_mem(const char *pfx, const struct cper_sec_mem_err *mem)
+{
+	if (mem->validation_bits & CPER_MEM_VALID_ERROR_STATUS)
+		printk("%s""error_status: 0x%016llx\n", pfx, mem->error_status);
+	if (mem->validation_bits & CPER_MEM_VALID_PHYSICAL_ADDRESS)
+		printk("%s""physical_address: 0x%016llx\n",
+		       pfx, mem->physical_addr);
+	if (mem->validation_bits & CPER_MEM_VALID_PHYSICAL_ADDRESS_MASK)
+		printk("%s""physical_address_mask: 0x%016llx\n",
+		       pfx, mem->physical_addr_mask);
+	if (mem->validation_bits & CPER_MEM_VALID_NODE)
+		printk("%s""node: %d\n", pfx, mem->node);
+	if (mem->validation_bits & CPER_MEM_VALID_CARD)
+		printk("%s""card: %d\n", pfx, mem->card);
+	if (mem->validation_bits & CPER_MEM_VALID_MODULE)
+		printk("%s""module: %d\n", pfx, mem->module);
+	if (mem->validation_bits & CPER_MEM_VALID_BANK)
+		printk("%s""bank: %d\n", pfx, mem->bank);
+	if (mem->validation_bits & CPER_MEM_VALID_DEVICE)
+		printk("%s""device: %d\n", pfx, mem->device);
+	if (mem->validation_bits & CPER_MEM_VALID_ROW)
+		printk("%s""row: %d\n", pfx, mem->row);
+	if (mem->validation_bits & CPER_MEM_VALID_COLUMN)
+		printk("%s""column: %d\n", pfx, mem->column);
+	if (mem->validation_bits & CPER_MEM_VALID_BIT_POSITION)
+		printk("%s""bit_position: %d\n", pfx, mem->bit_pos);
+	if (mem->validation_bits & CPER_MEM_VALID_REQUESTOR_ID)
+		printk("%s""requestor_id: 0x%016llx\n", pfx, mem->requestor_id);
+	if (mem->validation_bits & CPER_MEM_VALID_RESPONDER_ID)
+		printk("%s""responder_id: 0x%016llx\n", pfx, mem->responder_id);
+	if (mem->validation_bits & CPER_MEM_VALID_TARGET_ID)
+		printk("%s""target_id: 0x%016llx\n", pfx, mem->target_id);
+	if (mem->validation_bits & CPER_MEM_VALID_ERROR_TYPE) {
+		u8 etype = mem->error_type;
+		printk("%s""error_type: %d, %s\n", pfx, etype,
+		       etype < ARRAY_SIZE(cper_mem_err_type_strs) ?
+		       cper_mem_err_type_strs[etype] : "unknown");
+	}
+}
+
+static const char *cper_pcie_port_type_strs[] = {
+	"PCIe end point",
+	"legacy PCI end point",
+	"unknown",
+	"unknown",
+	"root port",
+	"upstream switch port",
+	"downstream switch port",
+	"PCIe to PCI/PCI-X bridge",
+	"PCI/PCI-X to PCIe bridge",
+	"root complex integrated endpoint device",
+	"root complex event collector",
+};
+
+static void cper_print_pcie(const char *pfx, const struct cper_sec_pcie *pcie)
+{
+	if (pcie->validation_bits & CPER_PCIE_VALID_PORT_TYPE)
+		printk("%s""port_type: %d, %s\n", pfx, pcie->port_type,
+		       pcie->port_type < ARRAY_SIZE(cper_pcie_port_type_strs) ?
+		       cper_pcie_port_type_strs[pcie->port_type] : "unknown");
+	if (pcie->validation_bits & CPER_PCIE_VALID_VERSION)
+		printk("%s""version: %d.%d\n", pfx,
+		       pcie->version.major, pcie->version.minor);
+	if (pcie->validation_bits & CPER_PCIE_VALID_COMMAND_STATUS)
+		printk("%s""command: 0x%04x, status: 0x%04x\n", pfx,
+		       pcie->command, pcie->status);
+	if (pcie->validation_bits & CPER_PCIE_VALID_DEVICE_ID) {
+		const __u8 *p;
+		printk("%s""device_id: %04x:%02x:%02x.%x\n", pfx,
+		       pcie->device_id.segment, pcie->device_id.bus,
+		       pcie->device_id.device, pcie->device_id.function);
+		printk("%s""slot: %d\n", pfx,
+		       pcie->device_id.slot >> CPER_PCIE_SLOT_SHIFT);
+		printk("%s""secondary_bus: 0x%02x\n", pfx,
+		       pcie->device_id.secondary_bus);
+		printk("%s""vendor_id: 0x%04x, device_id: 0x%04x\n", pfx,
+		       pcie->device_id.vendor_id, pcie->device_id.device_id);
+		p = pcie->device_id.class_code;
+		printk("%s""class_code: %02x%02x%02x\n", pfx, p[0], p[1], p[2]);
+	}
+	if (pcie->validation_bits & CPER_PCIE_VALID_SERIAL_NUMBER)
+		printk("%s""serial number: 0x%04x, 0x%04x\n", pfx,
+		       pcie->serial_number.lower, pcie->serial_number.upper);
+	if (pcie->validation_bits & CPER_PCIE_VALID_BRIDGE_CONTROL_STATUS)
+		printk(
+	"%s""bridge: secondary_status: 0x%04x, control: 0x%04x\n",
+	pfx, pcie->bridge.secondary_status, pcie->bridge.control);
+}
+
+static const char *apei_estatus_section_flag_strs[] = {
+	"primary",
+	"containment warning",
+	"reset",
+	"threshold exceeded",
+	"resource not accessible",
+	"latent error",
+};
+
+static void apei_estatus_print_section(
+	const char *pfx, const struct acpi_hest_generic_data *gdata, int sec_no)
+{
+	uuid_le *sec_type = (uuid_le *)gdata->section_type;
+	__u16 severity;
+
+	severity = gdata->error_severity;
+	printk("%s""section: %d, severity: %d, %s\n", pfx, sec_no, severity,
+	       cper_severity_str(severity));
+	printk("%s""flags: 0x%02x\n", pfx, gdata->flags);
+	cper_print_bits(pfx, gdata->flags, apei_estatus_section_flag_strs,
+			ARRAY_SIZE(apei_estatus_section_flag_strs));
+	if (gdata->validation_bits & CPER_SEC_VALID_FRU_ID)
+		printk("%s""fru_id: %pUl\n", pfx, (uuid_le *)gdata->fru_id);
+	if (gdata->validation_bits & CPER_SEC_VALID_FRU_TEXT)
+		printk("%s""fru_text: %.20s\n", pfx, gdata->fru_text);
+
+	if (!uuid_le_cmp(*sec_type, CPER_SEC_PROC_GENERIC)) {
+		struct cper_sec_proc_generic *proc_err = (void *)(gdata + 1);
+		printk("%s""section_type: general processor error\n", pfx);
+		if (gdata->error_data_length >= sizeof(*proc_err))
+			cper_print_proc_generic(pfx, proc_err);
+		else
+			goto err_section_too_small;
+	} else if (!uuid_le_cmp(*sec_type, CPER_SEC_PLATFORM_MEM)) {
+		struct cper_sec_mem_err *mem_err = (void *)(gdata + 1);
+		printk("%s""section_type: memory error\n", pfx);
+		if (gdata->error_data_length >= sizeof(*mem_err))
+			cper_print_mem(pfx, mem_err);
+		else
+			goto err_section_too_small;
+	} else if (!uuid_le_cmp(*sec_type, CPER_SEC_PCIE)) {
+		struct cper_sec_pcie *pcie = (void *)(gdata + 1);
+		printk("%s""section_type: PCIe error\n", pfx);
+		if (gdata->error_data_length >= sizeof(*pcie))
+			cper_print_pcie(pfx, pcie);
+		else
+			goto err_section_too_small;
+	} else
+		printk("%s""section type: unknown, %pUl\n", pfx, sec_type);
+
+	return;
+
+err_section_too_small:
+	pr_err(FW_WARN "error section length is too small\n");
+}
+
+void apei_estatus_print(const char *pfx,
+			const struct acpi_hest_generic_status *estatus)
+{
+	struct acpi_hest_generic_data *gdata;
+	unsigned int data_len, gedata_len;
+	int sec_no = 0;
+	__u16 severity;
+
+	printk("%s""APEI generic hardware error status\n", pfx);
+	severity = estatus->error_severity;
+	printk("%s""severity: %d, %s\n", pfx, severity,
+	       cper_severity_str(severity));
+	data_len = estatus->data_length;
+	gdata = (struct acpi_hest_generic_data *)(estatus + 1);
+	while (data_len > sizeof(*gdata)) {
+		gedata_len = gdata->error_data_length;
+		apei_estatus_print_section(pfx, gdata, sec_no);
+		data_len -= gedata_len + sizeof(*gdata);
+		sec_no++;
+	}
+}
+EXPORT_SYMBOL_GPL(apei_estatus_print);
+
 int apei_estatus_check_header(const struct acpi_hest_generic_status *estatus)
 {
 	if (estatus->data_length &&

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [PATCH -v4 3/3] ACPI, APEI, Report GHES error information via printk
  2010-12-07  2:22 [PATCH -v4 0/3] ACPI, APEI, Report GHES error information via printk Huang Ying
  2010-12-07  2:22 ` [PATCH -v4 1/3] Add CPER PCIe error section structure and constants definition Huang Ying
  2010-12-07  2:22 ` [PATCH -v4 2/3] ACPI, APEI, Add APEI generic error status printing support Huang Ying
@ 2010-12-07  2:22 ` Huang Ying
  2010-12-14  4:43 ` [PATCH -v4 0/3] " Len Brown
  3 siblings, 0 replies; 5+ messages in thread
From: Huang Ying @ 2010-12-07  2:22 UTC (permalink / raw)
  To: Len Brown
  Cc: linux-kernel, Andi Kleen, Tony Luck, ying.huang, linux-acpi,
	Peter Zijlstra, Andrew Morton, Linus Torvalds, Ingo Molnar

printk is one of the methods to report hardware errors to user space.
This patch implements hardware error reporting for GHES via printk.

Signed-off-by: Huang Ying <ying.huang@intel.com>
---
 drivers/acpi/apei/ghes.c |   25 +++++++++++++++++++++----
 1 file changed, 21 insertions(+), 4 deletions(-)

--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -43,6 +43,7 @@
 #include <linux/kdebug.h>
 #include <linux/platform_device.h>
 #include <linux/mutex.h>
+#include <linux/ratelimit.h>
 #include <acpi/apei.h>
 #include <acpi/atomicio.h>
 #include <acpi/hed.h>
@@ -255,11 +256,26 @@ static void ghes_do_proc(struct ghes *gh
 		}
 #endif
 	}
+}
+
+static void ghes_print_estatus(const char *pfx, struct ghes *ghes)
+{
+	/* Not more than 2 messages every 5 seconds */
+	static DEFINE_RATELIMIT_STATE(ratelimit, 5*HZ, 2);
 
-	if (!processed && printk_ratelimit())
-		pr_warning(GHES_PFX
-		"Unknown error record from generic hardware error source: %d\n",
-			   ghes->generic->header.source_id);
+	if (pfx == NULL) {
+		if (ghes_severity(ghes->estatus->error_severity) <=
+		    GHES_SEV_CORRECTED)
+			pfx = KERN_WARNING HW_ERR;
+		else
+			pfx = KERN_ERR HW_ERR;
+	}
+	if (__ratelimit(&ratelimit)) {
+		printk(
+	"%s""Hardware error from APEI Generic Hardware Error Source: %d\n",
+	pfx, ghes->generic->header.source_id);
+		apei_estatus_print(pfx, ghes->estatus);
+	}
 }
 
 static int ghes_proc(struct ghes *ghes)
@@ -269,6 +285,7 @@ static int ghes_proc(struct ghes *ghes)
 	rc = ghes_read_estatus(ghes, 0);
 	if (rc)
 		goto out;
+	ghes_print_estatus(NULL, ghes);
 	ghes_do_proc(ghes);
 
 out:

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH -v4 0/3] ACPI, APEI, Report GHES error information via printk
  2010-12-07  2:22 [PATCH -v4 0/3] ACPI, APEI, Report GHES error information via printk Huang Ying
                   ` (2 preceding siblings ...)
  2010-12-07  2:22 ` [PATCH -v4 3/3] ACPI, APEI, Report GHES error information via printk Huang Ying
@ 2010-12-14  4:43 ` Len Brown
  3 siblings, 0 replies; 5+ messages in thread
From: Len Brown @ 2010-12-14  4:43 UTC (permalink / raw)
  To: Huang Ying
  Cc: linux-kernel, Andi Kleen, Tony Luck, linux-acpi, Peter Zijlstra,
	Andrew Morton, Linus Torvalds, Ingo Molnar

On Tue, 7 Dec 2010, Huang Ying wrote:

> The recent consensus is that hardware error should be reported via printk.
> This patch adds printk support to APEI GHES.
> 
> v4:
> 
> - Remove pr_pfx, use printk directly

v4 applied to acpi-test

thanks,
Len Brown, Intel Open Source Technology Center

> 
> v3:
> 
> - Add document for output format per Andrew's comments
> - Move pr_pfx into printk.h, hope it is useful for others too
> - Fixes some issues according to comments
> 
> v2:
> 
> - Some minor adjustment of PCIe error section definition and printk format
> 
> [PATCH -v4 1/3] Add CPER PCIe error section structure and constants definition
> [PATCH -v4 2/3] ACPI, APEI, Add APEI generic error status printing support
> [PATCH -v4 3/3] ACPI, APEI, Report GHES error information via printk
> 

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2010-12-14  4:43 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-12-07  2:22 [PATCH -v4 0/3] ACPI, APEI, Report GHES error information via printk Huang Ying
2010-12-07  2:22 ` [PATCH -v4 1/3] Add CPER PCIe error section structure and constants definition Huang Ying
2010-12-07  2:22 ` [PATCH -v4 2/3] ACPI, APEI, Add APEI generic error status printing support Huang Ying
2010-12-07  2:22 ` [PATCH -v4 3/3] ACPI, APEI, Report GHES error information via printk Huang Ying
2010-12-14  4:43 ` [PATCH -v4 0/3] " Len Brown

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox