[PATCH v2 1/3] acpi/apei: Add devm_ghes_register_vendor_record

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH v2 1/3] acpi/apei: Add devm_ghes_register_vendor_record_notifier()
@ 2026-03-19 11:13 Kai-Heng Feng
  2026-03-19 11:13 ` [PATCH v2 2/3] PCI: hisi: Use devm_ghes_register_vendor_record_notifier() Kai-Heng Feng
                   ` (3 more replies)
  0 siblings, 4 replies; 17+ messages in thread
From: Kai-Heng Feng @ 2026-03-19 11:13 UTC (permalink / raw)
  To: rafael
  Cc: Kai-Heng Feng, Jonathan Cameron, Tony Luck, Borislav Petkov,
	Hanjun Guo, Mauro Carvalho Chehab, Shuai Xue, Len Brown,
	Robert Moore, Ard Biesheuvel, Breno Leitao, Fabio M. De Francesco,
	Jason Tian, linux-acpi, linux-kernel, acpica-devel

Add a device-managed wrapper around ghes_register_vendor_record_notifier()
so drivers can avoid manual cleanup on device removal or probe failure.

Cc: Jonathan Cameron <jonathan.cameron@huawei.com>
Signed-off-by: Kai-Heng Feng <kaihengf@nvidia.com>
---
v2:
 - New patch.

 drivers/acpi/apei/ghes.c | 25 +++++++++++++++++++++++++
 include/acpi/ghes.h      |  3 +++
 2 files changed, 28 insertions(+)

diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index 8acd2742bb27..d31a70a05538 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -689,6 +689,31 @@ void ghes_unregister_vendor_record_notifier(struct notifier_block *nb)
 }
 EXPORT_SYMBOL_GPL(ghes_unregister_vendor_record_notifier);
 
+static void ghes_vendor_record_notifier_destroy(void *nb)
+{
+	ghes_unregister_vendor_record_notifier(nb);
+}
+
+/**
+ * devm_ghes_register_vendor_record_notifier - device-managed vendor record notifier registration
+ * @dev: device that owns the notifier lifetime
+ * @nb: pointer to the notifier_block structure of the vendor record handler
+ *
+ * Return: 0 on success, negative errno on failure.
+ */
+int devm_ghes_register_vendor_record_notifier(struct device *dev,
+					      struct notifier_block *nb)
+{
+	int ret;
+
+	ret = ghes_register_vendor_record_notifier(nb);
+	if (ret)
+		return ret;
+
+	return devm_add_action_or_reset(dev, ghes_vendor_record_notifier_destroy, nb);
+}
+EXPORT_SYMBOL_GPL(devm_ghes_register_vendor_record_notifier);
+
 static void ghes_vendor_record_work_func(struct work_struct *work)
 {
 	struct ghes_vendor_record_entry *entry;
diff --git a/include/acpi/ghes.h b/include/acpi/ghes.h
index 7bea522c0657..ca3ace828c1c 100644
--- a/include/acpi/ghes.h
+++ b/include/acpi/ghes.h
@@ -71,6 +71,9 @@ int ghes_register_vendor_record_notifier(struct notifier_block *nb);
  */
 void ghes_unregister_vendor_record_notifier(struct notifier_block *nb);
 
+int devm_ghes_register_vendor_record_notifier(struct device *dev,
+					      struct notifier_block *nb);
+
 struct list_head *ghes_get_devices(void);
 
 void ghes_estatus_pool_region_free(unsigned long addr, u32 size);
-- 
2.50.1 (Apple Git-155)


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v2 2/3] PCI: hisi: Use devm_ghes_register_vendor_record_notifier()
  2026-03-19 11:13 [PATCH v2 1/3] acpi/apei: Add devm_ghes_register_vendor_record_notifier() Kai-Heng Feng
@ 2026-03-19 11:13 ` Kai-Heng Feng
  2026-03-20  9:57   ` Jonathan Cameron
  2026-03-19 11:13 ` [PATCH v2 3/3] acpi/apei: Add NVIDIA GHES vendor CPER record handler Kai-Heng Feng
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 17+ messages in thread
From: Kai-Heng Feng @ 2026-03-19 11:13 UTC (permalink / raw)
  To: rafael
  Cc: Kai-Heng Feng, Jonathan Cameron, Lorenzo Pieralisi,
	Krzysztof Wilczyński, Manivannan Sadhasivam, Rob Herring,
	Bjorn Helgaas, linux-pci, linux-kernel

Switch to the device-managed variant so the notifier is automatically
unregistered on device removal, allowing the open-coded remove callback
to be dropped entirely.

Cc: Jonathan Cameron <jonathan.cameron@huawei.com>
Signed-off-by: Kai-Heng Feng <kaihengf@nvidia.com>
---
v2:
 - New patch.

 drivers/pci/controller/pcie-hisi-error.c | 12 +-----------
 1 file changed, 1 insertion(+), 11 deletions(-)

diff --git a/drivers/pci/controller/pcie-hisi-error.c b/drivers/pci/controller/pcie-hisi-error.c
index aaf1ed2b6e59..36be86d827a8 100644
--- a/drivers/pci/controller/pcie-hisi-error.c
+++ b/drivers/pci/controller/pcie-hisi-error.c
@@ -287,25 +287,16 @@ static int hisi_pcie_error_handler_probe(struct platform_device *pdev)
 
 	priv->nb.notifier_call = hisi_pcie_notify_error;
 	priv->dev = &pdev->dev;
-	ret = ghes_register_vendor_record_notifier(&priv->nb);
+	ret = devm_ghes_register_vendor_record_notifier(&pdev->dev, &priv->nb);
 	if (ret) {
 		dev_err(&pdev->dev,
 			"Failed to register hisi pcie controller error handler with apei\n");
 		return ret;
 	}
 
-	platform_set_drvdata(pdev, priv);
-
 	return 0;
 }
 
-static void hisi_pcie_error_handler_remove(struct platform_device *pdev)
-{
-	struct hisi_pcie_error_private *priv = platform_get_drvdata(pdev);
-
-	ghes_unregister_vendor_record_notifier(&priv->nb);
-}
-
 static const struct acpi_device_id hisi_pcie_acpi_match[] = {
 	{ "HISI0361", 0 },
 	{ }
@@ -317,7 +308,6 @@ static struct platform_driver hisi_pcie_error_handler_driver = {
 		.acpi_match_table = hisi_pcie_acpi_match,
 	},
 	.probe		= hisi_pcie_error_handler_probe,
-	.remove		= hisi_pcie_error_handler_remove,
 };
 module_platform_driver(hisi_pcie_error_handler_driver);
 
-- 
2.50.1 (Apple Git-155)


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v2 3/3] acpi/apei: Add NVIDIA GHES vendor CPER record handler
  2026-03-19 11:13 [PATCH v2 1/3] acpi/apei: Add devm_ghes_register_vendor_record_notifier() Kai-Heng Feng
  2026-03-19 11:13 ` [PATCH v2 2/3] PCI: hisi: Use devm_ghes_register_vendor_record_notifier() Kai-Heng Feng
@ 2026-03-19 11:13 ` Kai-Heng Feng
  2026-03-20 10:13   ` Jonathan Cameron
  2026-03-20 14:52   ` Bjorn Helgaas
  2026-03-20  9:55 ` [PATCH v2 1/3] acpi/apei: Add devm_ghes_register_vendor_record_notifier() Jonathan Cameron
  2026-03-23 12:28 ` Hanjun Guo
  3 siblings, 2 replies; 17+ messages in thread
From: Kai-Heng Feng @ 2026-03-19 11:13 UTC (permalink / raw)
  To: rafael
  Cc: Kai-Heng Feng, Jonathan Cameron, Shiju Jose, Tony Luck,
	Borislav Petkov, Hanjun Guo, Mauro Carvalho Chehab, Shuai Xue,
	Len Brown, Kees Cook, Gustavo A. R. Silva, Will Deacon,
	Huang Yiwei, Dave Jiang, Nathan Chancellor, Fabio M. De Francesco,
	linux-kernel, linux-acpi, linux-hardening

Add support for decoding NVIDIA-specific CPER sections delivered via
the APEI GHES vendor record notifier chain. NVIDIA hardware generates
vendor-specific CPER sections containing error signatures and diagnostic
register dumps. This implementation registers a notifier_block with the
GHES vendor record notifier and decodes these sections, printing error
details via dev_info().

The driver binds to ACPI device NVDA2012, present on NVIDIA server
platforms. The NVIDIA CPER section contains a fixed header with error
metadata (signature, error type, severity, socket) followed by
variable-length register address-value pairs for hardware diagnostics.

This work is based on libcper [0].

Example output:
nvidia-ghes NVDA2012:00: NVIDIA CPER section, error_data_length: 544
nvidia-ghes NVDA2012:00: signature: CMET-INFO
nvidia-ghes NVDA2012:00: error_type: 0
nvidia-ghes NVDA2012:00: error_instance: 0
nvidia-ghes NVDA2012:00: severity: 3
nvidia-ghes NVDA2012:00: socket: 0
nvidia-ghes NVDA2012:00: number_regs: 32
nvidia-ghes NVDA2012:00: instance_base: 0x0000000000000000
nvidia-ghes NVDA2012:00: register[0]: address=0x8000000100000000 value=0x0000000100000000

[0] https://github.com/openbmc/libcper/commit/683e055061ce
Cc: Jonathan Cameron <jonathan.cameron@huawei.com>
Cc: Shiju Jose <shiju.jose@huawei.com>
Signed-off-by: Kai-Heng Feng <kaihengf@nvidia.com>
---
v2:
  - Use right headers.
  - Use embedded struct and __counted_by.
  - Drop __packed.
  - Remove unecessary casts. 
  - Use * in sizeof().
  - Use devm_kmalloc() and struct assignment.
  - Use dev_err_probe and new devm helper.

 MAINTAINERS                     |   6 ++
 drivers/acpi/apei/Kconfig       |  14 +++
 drivers/acpi/apei/Makefile      |   1 +
 drivers/acpi/apei/nvidia-ghes.c | 146 ++++++++++++++++++++++++++++++++
 4 files changed, 167 insertions(+)
 create mode 100644 drivers/acpi/apei/nvidia-ghes.c

diff --git a/MAINTAINERS b/MAINTAINERS
index d7241695df96..a9be03fbf1a9 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -18908,6 +18908,12 @@ S:	Maintained
 F:	drivers/video/fbdev/nvidia/
 F:	drivers/video/fbdev/riva/
 
+NVIDIA GHES VENDOR CPER RECORD HANDLER
+M:	Kai-Heng Feng <kaihengf@nvidia.com>
+L:	linux-acpi@vger.kernel.org
+S:	Maintained
+F:	drivers/acpi/apei/nvidia-ghes.c
+
 NVIDIA VRS RTC DRIVER
 M:	Shubhi Garg <shgarg@nvidia.com>
 L:	linux-tegra@vger.kernel.org
diff --git a/drivers/acpi/apei/Kconfig b/drivers/acpi/apei/Kconfig
index 070c07d68dfb..7dc49f14f223 100644
--- a/drivers/acpi/apei/Kconfig
+++ b/drivers/acpi/apei/Kconfig
@@ -74,6 +74,20 @@ config ACPI_APEI_EINJ_CXL
 
 	  If unsure say 'n'
 
+config ACPI_APEI_NVIDIA_GHES
+	tristate "NVIDIA GHES vendor record handler"
+	depends on ACPI_APEI_GHES
+	help
+	  Support for decoding NVIDIA-specific CPER sections delivered via
+	  the APEI GHES vendor record notifier chain. Registers a handler
+	  for the NVIDIA section GUID and logs error signatures, severity,
+	  socket, and diagnostic register address-value pairs.
+
+	  Enable on NVIDIA server platforms (e.g. DGX, HGX) that expose
+	  ACPI device NVDA2012 in their firmware tables.
+
+	  If unsure, say N.
+
 config ACPI_APEI_ERST_DEBUG
 	tristate "APEI Error Record Serialization Table (ERST) Debug Support"
 	depends on ACPI_APEI
diff --git a/drivers/acpi/apei/Makefile b/drivers/acpi/apei/Makefile
index 1a0b85923cd4..4a883f67d698 100644
--- a/drivers/acpi/apei/Makefile
+++ b/drivers/acpi/apei/Makefile
@@ -10,5 +10,6 @@ obj-$(CONFIG_ACPI_APEI_EINJ)	+= einj.o
 einj-y				:= einj-core.o
 einj-$(CONFIG_ACPI_APEI_EINJ_CXL) += einj-cxl.o
 obj-$(CONFIG_ACPI_APEI_ERST_DEBUG) += erst-dbg.o
+obj-$(CONFIG_ACPI_APEI_NVIDIA_GHES) += nvidia-ghes.o
 
 apei-y := apei-base.o hest.o erst.o bert.o
diff --git a/drivers/acpi/apei/nvidia-ghes.c b/drivers/acpi/apei/nvidia-ghes.c
new file mode 100644
index 000000000000..aa2e3a387b49
--- /dev/null
+++ b/drivers/acpi/apei/nvidia-ghes.c
@@ -0,0 +1,146 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * NVIDIA GHES vendor record handler
+ *
+ * Copyright (c) 2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ */
+
+#include <linux/acpi.h>
+#include <linux/module.h>
+#include <linux/platform_device.h>
+#include <linux/types.h>
+#include <linux/uuid.h>
+#include <acpi/ghes.h>
+
+static const guid_t nvidia_sec_guid =
+	GUID_INIT(0x6d5244f2, 0x2712, 0x11ec,
+		  0xbe, 0xa7, 0xcb, 0x3f, 0xdb, 0x95, 0xc7, 0x86);
+
+struct cper_sec_nvidia {
+	char	signature[16];
+	__le16	error_type;
+	__le16	error_instance;
+	u8	severity;
+	u8	socket;
+	u8	number_regs;
+	u8	reserved;
+	__le64	instance_base;
+	struct {
+		__le64	addr;
+		__le64	val;
+	} regs[] __counted_by(number_regs);
+};
+
+struct nvidia_ghes_private {
+	struct notifier_block	nb;
+	struct device		*dev;
+};
+
+static void nvidia_ghes_print_error(struct device *dev,
+				    const struct cper_sec_nvidia *nvidia_err,
+				    size_t error_data_length, bool fatal)
+{
+	const char *level = fatal ? KERN_ERR : KERN_INFO;
+	size_t min_size;
+	int i;
+
+	dev_printk(level, dev, "signature: %.16s\n", nvidia_err->signature);
+	dev_printk(level, dev, "error_type: %u\n", le16_to_cpu(nvidia_err->error_type));
+	dev_printk(level, dev, "error_instance: %u\n", le16_to_cpu(nvidia_err->error_instance));
+	dev_printk(level, dev, "severity: %u\n", nvidia_err->severity);
+	dev_printk(level, dev, "socket: %u\n", nvidia_err->socket);
+	dev_printk(level, dev, "number_regs: %u\n", nvidia_err->number_regs);
+	dev_printk(level, dev, "instance_base: 0x%016llx\n",
+		   le64_to_cpu(nvidia_err->instance_base));
+
+	if (nvidia_err->number_regs == 0)
+		return;
+
+	/*
+	 * Validate that all registers fit within error_data_length.
+	 * Each register pair is two little-endian u64s.
+	 */
+	min_size = struct_size(nvidia_err, regs, nvidia_err->number_regs);
+	if (error_data_length < min_size) {
+		dev_err(dev, "Invalid number_regs %u (section size %zu, need %zu)\n",
+			nvidia_err->number_regs, error_data_length, min_size);
+		return;
+	}
+
+	for (i = 0; i < nvidia_err->number_regs; i++)
+		dev_printk(level, dev, "register[%d]: address=0x%016llx value=0x%016llx\n",
+			   i, le64_to_cpu(nvidia_err->regs[i].addr),
+			   le64_to_cpu(nvidia_err->regs[i].val));
+}
+
+static int nvidia_ghes_notify(struct notifier_block *nb,
+			      unsigned long event, void *data)
+{
+	struct acpi_hest_generic_data *gdata = data;
+	struct nvidia_ghes_private *priv;
+	const struct cper_sec_nvidia *nvidia_err;
+	guid_t sec_guid;
+
+	import_guid(&sec_guid, gdata->section_type);
+	if (!guid_equal(&sec_guid, &nvidia_sec_guid))
+		return NOTIFY_DONE;
+
+	priv = container_of(nb, struct nvidia_ghes_private, nb);
+
+	if (acpi_hest_get_error_length(gdata) < sizeof(*nvidia_err)) {
+		dev_err(priv->dev, "Section too small (%d < %zu)\n",
+			acpi_hest_get_error_length(gdata), sizeof(*nvidia_err));
+		return NOTIFY_OK;
+	}
+
+	nvidia_err = acpi_hest_get_payload(gdata);
+
+	if (event >= GHES_SEV_RECOVERABLE)
+		dev_err(priv->dev, "NVIDIA CPER section, error_data_length: %u\n",
+			acpi_hest_get_error_length(gdata));
+	else
+		dev_info(priv->dev, "NVIDIA CPER section, error_data_length: %u\n",
+			 acpi_hest_get_error_length(gdata));
+
+	nvidia_ghes_print_error(priv->dev, nvidia_err, acpi_hest_get_error_length(gdata),
+				event >= GHES_SEV_RECOVERABLE);
+
+	return NOTIFY_OK;
+}
+
+static int nvidia_ghes_probe(struct platform_device *pdev)
+{
+	struct nvidia_ghes_private *priv;
+
+	priv = devm_kmalloc(&pdev->dev, sizeof(*priv), GFP_KERNEL);
+	if (!priv)
+		return -ENOMEM;
+
+	*priv = (struct nvidia_ghes_private) {
+		.nb.notifier_call = nvidia_ghes_notify,
+		.dev = &pdev->dev,
+	};
+
+	return dev_err_probe(&pdev->dev,
+			     devm_ghes_register_vendor_record_notifier(&pdev->dev, &priv->nb),
+			     "Failed to register NVIDIA GHES vendor record notifier\n");
+}
+
+static const struct acpi_device_id nvidia_ghes_acpi_match[] = {
+	{ "NVDA2012" },
+	{ }
+};
+MODULE_DEVICE_TABLE(acpi, nvidia_ghes_acpi_match);
+
+static struct platform_driver nvidia_ghes_driver = {
+	.driver = {
+		.name		= "nvidia-ghes",
+		.acpi_match_table = nvidia_ghes_acpi_match,
+	},
+	.probe	= nvidia_ghes_probe,
+};
+module_platform_driver(nvidia_ghes_driver);
+
+MODULE_AUTHOR("Kai-Heng Feng <kaihengf@nvidia.com>");
+MODULE_DESCRIPTION("NVIDIA GHES vendor CPER record handler");
+MODULE_LICENSE("GPL");
-- 
2.50.1 (Apple Git-155)


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 1/3] acpi/apei: Add devm_ghes_register_vendor_record_notifier()
  2026-03-19 11:13 [PATCH v2 1/3] acpi/apei: Add devm_ghes_register_vendor_record_notifier() Kai-Heng Feng
  2026-03-19 11:13 ` [PATCH v2 2/3] PCI: hisi: Use devm_ghes_register_vendor_record_notifier() Kai-Heng Feng
  2026-03-19 11:13 ` [PATCH v2 3/3] acpi/apei: Add NVIDIA GHES vendor CPER record handler Kai-Heng Feng
@ 2026-03-20  9:55 ` Jonathan Cameron
  2026-03-24 10:14   ` Kai-Heng Feng
  2026-03-23 12:28 ` Hanjun Guo
  3 siblings, 1 reply; 17+ messages in thread
From: Jonathan Cameron @ 2026-03-20  9:55 UTC (permalink / raw)
  To: Kai-Heng Feng
  Cc: rafael, Tony Luck, Borislav Petkov, Hanjun Guo,
	Mauro Carvalho Chehab, Shuai Xue, Len Brown, Robert Moore,
	Ard Biesheuvel, Breno Leitao, Fabio M. De Francesco, Jason Tian,
	linux-acpi, linux-kernel, acpica-devel

On Thu, 19 Mar 2026 19:13:07 +0800
Kai-Heng Feng <kaihengf@nvidia.com> wrote:

> Add a device-managed wrapper around ghes_register_vendor_record_notifier()
> so drivers can avoid manual cleanup on device removal or probe failure.
> 
> Cc: Jonathan Cameron <jonathan.cameron@huawei.com>
> Signed-off-by: Kai-Heng Feng <kaihengf@nvidia.com>
Hi,

My only comment is about following local style. I think that
means moving the docs to the header.  Unfortunately whether things
are in the header or the c file is a subsystem specific thing.

My preference is in the c file, but local style overrides that!
Better to have all the docs in the same place.

Jonathan

> ---
> v2:
>  - New patch.
> 
>  drivers/acpi/apei/ghes.c | 25 +++++++++++++++++++++++++
>  include/acpi/ghes.h      |  3 +++
>  2 files changed, 28 insertions(+)
> 
> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
> index 8acd2742bb27..d31a70a05538 100644
> --- a/drivers/acpi/apei/ghes.c
> +++ b/
> @@ -689,6 +689,31 @@ void ghes_unregister_vendor_record_notifier(struct notifier_block *nb)
>  }
>  EXPORT_SYMBOL_GPL(ghes_unregister_vendor_record_notifier);
>  
> +static void ghes_vendor_record_notifier_destroy(void *nb)
> +{
> +	ghes_unregister_vendor_record_notifier(nb);
> +}
> +
> +/**
> + * devm_ghes_register_vendor_record_notifier - device-managed vendor record notifier registration

There is also quite a bit of kernel doc in header.  So I guess
local convention is put it there not in the C code?

Hence I would move the docs there.


> + * @dev: device that owns the notifier lifetime
> + * @nb: pointer to the notifier_block structure of the vendor record handler
> + *
> + * Return: 0 on success, negative errno on failure.
> + */
> +int devm_ghes_register_vendor_record_notifier(struct device *ev,
> +					      struct notifier_block *nb)
> +{
> +	int ret;
> +
> +	ret = ghes_register_vendor_record_notifier(nb);
> +	if (ret)
> +		return ret;
> +
> +	return devm_add_action_or_reset(dev, ghes_vendor_record_notifier_destroy, nb);
> +}
> +EXPORT_SYMBOL_GPL(devm_ghes_register_vendor_record_notifier);
> +
>  static void ghes_vendor_record_work_func(struct work_struct *work)
>  {
>  	struct ghes_vendor_record_entry *entry;
> diff --git a/include/acpi/ghes.h b/include/acpi/ghes.h
> index 7bea522c0657..ca3ace828c1c 100644
> --- a/include/acpi/ghes.h
> +++ b/include/acpi/ghes.h
> @@ -71,6 +71,9 @@ int ghes_register_vendor_record_notifier(struct notifier_block *nb);
>   */
>  void ghes_unregister_vendor_record_notifier(struct notifier_block *nb);
>  
> +int devm_ghes_register_vendor_record_notifier(struct device *dev,
> +					      struct notifier_block *nb);
> +
>  struct list_head *ghes_get_devices(void);
>  
>  void ghes_estatus_pool_region_free(unsigned long addr, u32 size);


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 2/3] PCI: hisi: Use devm_ghes_register_vendor_record_notifier()
  2026-03-19 11:13 ` [PATCH v2 2/3] PCI: hisi: Use devm_ghes_register_vendor_record_notifier() Kai-Heng Feng
@ 2026-03-20  9:57   ` Jonathan Cameron
  0 siblings, 0 replies; 17+ messages in thread
From: Jonathan Cameron @ 2026-03-20  9:57 UTC (permalink / raw)
  To: Kai-Heng Feng
  Cc: rafael, Lorenzo Pieralisi, Krzysztof Wilczyński,
	Manivannan Sadhasivam, Rob Herring, Bjorn Helgaas, linux-pci,
	linux-kernel, shiju.jose

On Thu, 19 Mar 2026 19:13:08 +0800
Kai-Heng Feng <kaihengf@nvidia.com> wrote:

> Switch to the device-managed variant so the notifier is automatically
> unregistered on device removal, allowing the open-coded remove callback
> to be dropped entirely.
> 
> Cc: Jonathan Cameron <jonathan.cameron@huawei.com>
> Signed-off-by: Kai-Heng Feng <kaihengf@nvidia.com>
+CC Shiju.

Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
> ---
> v2:
>  - New patch.
> 
>  drivers/pci/controller/pcie-hisi-error.c | 12 +-----------
>  1 file changed, 1 insertion(+), 11 deletions(-)
> 
> diff --git a/drivers/pci/controller/pcie-hisi-error.c b/drivers/pci/controller/pcie-hisi-error.c
> index aaf1ed2b6e59..36be86d827a8 100644
> --- a/drivers/pci/controller/pcie-hisi-error.c
> +++ b/drivers/pci/controller/pcie-hisi-error.c
> @@ -287,25 +287,16 @@ static int hisi_pcie_error_handler_probe(struct platform_device *pdev)
>  
>  	priv->nb.notifier_call = hisi_pcie_notify_error;
>  	priv->dev = &pdev->dev;
> -	ret = ghes_register_vendor_record_notifier(&priv->nb);
> +	ret = devm_ghes_register_vendor_record_notifier(&pdev->dev, &priv->nb);
>  	if (ret) {
>  		dev_err(&pdev->dev,
>  			"Failed to register hisi pcie controller error handler with apei\n");
>  		return ret;
>  	}
>  
> -	platform_set_drvdata(pdev, priv);
> -
>  	return 0;
>  }
>  
> -static void hisi_pcie_error_handler_remove(struct platform_device *pdev)
> -{
> -	struct hisi_pcie_error_private *priv = platform_get_drvdata(pdev);
> -
> -	ghes_unregister_vendor_record_notifier(&priv->nb);
> -}
> -
>  static const struct acpi_device_id hisi_pcie_acpi_match[] = {
>  	{ "HISI0361", 0 },
>  	{ }
> @@ -317,7 +308,6 @@ static struct platform_driver hisi_pcie_error_handler_driver = {
>  		.acpi_match_table = hisi_pcie_acpi_match,
>  	},
>  	.probe		= hisi_pcie_error_handler_probe,
> -	.remove		= hisi_pcie_error_handler_remove,
>  };
>  module_platform_driver(hisi_pcie_error_handler_driver);
>  


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 3/3] acpi/apei: Add NVIDIA GHES vendor CPER record handler
  2026-03-19 11:13 ` [PATCH v2 3/3] acpi/apei: Add NVIDIA GHES vendor CPER record handler Kai-Heng Feng
@ 2026-03-20 10:13   ` Jonathan Cameron
  2026-03-24  9:10     ` Kai-Heng Feng
  2026-03-20 14:52   ` Bjorn Helgaas
  1 sibling, 1 reply; 17+ messages in thread
From: Jonathan Cameron @ 2026-03-20 10:13 UTC (permalink / raw)
  To: Kai-Heng Feng
  Cc: rafael, Shiju Jose, Tony Luck, Borislav Petkov, Hanjun Guo,
	Mauro Carvalho Chehab, Shuai Xue, Len Brown, Kees Cook,
	Gustavo A. R. Silva, Will Deacon, Huang Yiwei, Dave Jiang,
	Nathan Chancellor, Fabio M. De Francesco, linux-kernel,
	linux-acpi, linux-hardening

On Thu, 19 Mar 2026 19:13:09 +0800
Kai-Heng Feng <kaihengf@nvidia.com> wrote:

> Add support for decoding NVIDIA-specific CPER sections delivered via
> the APEI GHES vendor record notifier chain. NVIDIA hardware generates
> vendor-specific CPER sections containing error signatures and diagnostic
> register dumps. This implementation registers a notifier_block with the
> GHES vendor record notifier and decodes these sections, printing error
> details via dev_info().
> 
> The driver binds to ACPI device NVDA2012, present on NVIDIA server
> platforms. The NVIDIA CPER section contains a fixed header with error
> metadata (signature, error type, severity, socket) followed by
> variable-length register address-value pairs for hardware diagnostics.
> 
> This work is based on libcper [0].
> 
> Example output:
> nvidia-ghes NVDA2012:00: NVIDIA CPER section, error_data_length: 544
> nvidia-ghes NVDA2012:00: signature: CMET-INFO
> nvidia-ghes NVDA2012:00: error_type: 0
> nvidia-ghes NVDA2012:00: error_instance: 0
> nvidia-ghes NVDA2012:00: severity: 3
> nvidia-ghes NVDA2012:00: socket: 0
> nvidia-ghes NVDA2012:00: number_regs: 32
> nvidia-ghes NVDA2012:00: instance_base: 0x0000000000000000
> nvidia-ghes NVDA2012:00: register[0]: address=0x8000000100000000 value=0x0000000100000000
> 
> [0] https://github.com/openbmc/libcper/commit/683e055061ce
> Cc: Jonathan Cameron <jonathan.cameron@huawei.com>
> Cc: Shiju Jose <shiju.jose@huawei.com>
> Signed-off-by: Kai-Heng Feng <kaihengf@nvidia.com>
Only significant thing is around use of dev_err_probe().

I'm surprised that didn't give you error messages in the log even on success.

With that fixed (other stuff is all up to you).
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>


>  apei-y := apei-base.o hest.o erst.o bert.o
> diff --git a/drivers/acpi/apei/nvidia-ghes.c b/drivers/acpi/apei/nvidia-ghes.c
> new file mode 100644
> index 000000000000..aa2e3a387b49
> --- /dev/null
> +++ b/drivers/acpi/apei/nvidia-ghes.c

> +static void nvidia_ghes_print_error(struct device *dev,
> +				    const struct cper_sec_nvidia *nvidia_err,
> +				    size_t error_data_length, bool fatal)
> +{
> +	const char *level = fatal ? KERN_ERR : KERN_INFO;
> +	size_t min_size;
> +	int i;
...


> +	 * Validate that all registers fit within error_data_length.
> +	 * Each register pair is two little-endian u64s.
> +	 */
> +	min_size = struct_size(nvidia_err, regs, nvidia_err->number_regs);
> +	if (error_data_length < min_size) {
> +		dev_err(dev, "Invalid number_regs %u (section size %zu, need %zu)\n",
> +			nvidia_err->number_regs, error_data_length, min_size);
> +		return;
> +	}
> +
> +	for (i = 0; i < nvidia_err->number_regs; i++)

Trivial but I'd take advantage of it now being acceptable (in general) to do
	for (int i = 0; i < ....)

> +		dev_printk(level, dev, "register[%d]: address=0x%016llx value=0x%016llx\n",
> +			   i, le64_to_cpu(nvidia_err->regs[i].addr),
> +			   le64_to_cpu(nvidia_err->regs[i].val));
> +}

> +static int nvidia_ghes_probe(struct platform_device *pdev)
> +{
> +	struct nvidia_ghes_private *priv;
> +
> +	priv = devm_kmalloc(&pdev->dev, sizeof(*priv), GFP_KERNEL);
> +	if (!priv)
> +		return -ENOMEM;
> +
> +	*priv = (struct nvidia_ghes_private) {
> +		.nb.notifier_call = nvidia_ghes_notify,
> +		.dev = &pdev->dev,
> +	};
> +
> +	return dev_err_probe(&pdev->dev,
> +			     devm_ghes_register_vendor_record_notifier(&pdev->dev, &priv->nb),
That's too not great for readability and dev_err_probe() should only be called on errors
I'm fairly sure it doesn't have special handling for 0 so will call dev_err() or dev_warn()
and print some stuff before saying 'no error'.

	int ret;
	...

	ret = devm_ghes_register_vendor_record_notifier(&pdev->dev, &priv->nb);
	if (ret)
		return dev_err_probe(&pdev->dev,
				      "Failed to register NVIDIA GHES vendor record notifier\n");

	return 0;



> +			     "Failed to register NVIDIA GHES vendor record notifier\n");
> +}
> +
> +static const struct acpi_device_id nvidia_ghes_acpi_match[] = {
> +	{ "NVDA2012" },

London Olympics :)

> +	{ }
> +};
> +MODULE_DEVICE_TABLE(acpi, nvidia_ghes_acpi_match);
> +
> +static struct platform_driver nvidia_ghes_driver = {
> +	.driver = {
> +		.name		= "nvidia-ghes",
> +		.acpi_match_table = nvidia_ghes_acpi_match,
> +	},
> +	.probe	= nvidia_ghes_probe,

I'd just not attempt to align the = 
static struct platform_driver nvidia_ghes_driver = {
	.driver = {
		.name = "nvidia-ghes",
		.acpi_match_table = nvidia_ghes_acpi_match,
	},
	.probe = nvidia_ghes_probe,

There aren't enough of them to make it much of a readability improvement
and doing this often results in unnecessary churn as a driver evolves.
Also it's already broken!

> +};
> +module_platform_driver(nvidia_ghes_driver);
> +
> +MODULE_AUTHOR("Kai-Heng Feng <kaihengf@nvidia.com>");
> +MODULE_DESCRIPTION("NVIDIA GHES vendor CPER record handler");
> +MODULE_LICENSE("GPL");


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 3/3] acpi/apei: Add NVIDIA GHES vendor CPER record handler
  2026-03-19 11:13 ` [PATCH v2 3/3] acpi/apei: Add NVIDIA GHES vendor CPER record handler Kai-Heng Feng
  2026-03-20 10:13   ` Jonathan Cameron
@ 2026-03-20 14:52   ` Bjorn Helgaas
  2026-03-20 15:13     ` Bjorn Helgaas
  2026-03-24  9:33     ` Kai-Heng Feng
  1 sibling, 2 replies; 17+ messages in thread
From: Bjorn Helgaas @ 2026-03-20 14:52 UTC (permalink / raw)
  To: Kai-Heng Feng
  Cc: rafael, Jonathan Cameron, Shiju Jose, Tony Luck, Borislav Petkov,
	Hanjun Guo, Mauro Carvalho Chehab, Shuai Xue, Len Brown,
	Kees Cook, Gustavo A. R. Silva, Will Deacon, Huang Yiwei,
	Dave Jiang, Nathan Chancellor, Fabio M. De Francesco,
	linux-kernel, linux-acpi, linux-hardening

On Thu, Mar 19, 2026 at 07:13:09PM +0800, Kai-Heng Feng wrote:
> Add support for decoding NVIDIA-specific CPER sections delivered via
> the APEI GHES vendor record notifier chain. NVIDIA hardware generates
> vendor-specific CPER sections containing error signatures and diagnostic
> register dumps. This implementation registers a notifier_block with the
> GHES vendor record notifier and decodes these sections, printing error
> details via dev_info().
> 
> The driver binds to ACPI device NVDA2012, present on NVIDIA server
> platforms. The NVIDIA CPER section contains a fixed header with error
> metadata (signature, error type, severity, socket) followed by
> variable-length register address-value pairs for hardware diagnostics.
> 
> This work is based on libcper [0].
> 
> Example output:
> nvidia-ghes NVDA2012:00: NVIDIA CPER section, error_data_length: 544
> nvidia-ghes NVDA2012:00: signature: CMET-INFO
> nvidia-ghes NVDA2012:00: error_type: 0
> nvidia-ghes NVDA2012:00: error_instance: 0
> nvidia-ghes NVDA2012:00: severity: 3
> nvidia-ghes NVDA2012:00: socket: 0
> nvidia-ghes NVDA2012:00: number_regs: 32
> nvidia-ghes NVDA2012:00: instance_base: 0x0000000000000000
> nvidia-ghes NVDA2012:00: register[0]: address=0x8000000100000000 value=0x0000000100000000

Is there a convenient way to connect NVDA2012:00 with the actual device?  I
assume this is typically a PCIe device?  How would we relate this with PCIe
errors?

Consider a cover letter.  Some of these comments apply to the series.

Wrap commit logs to fit in 75 columns.  When indented by "git log", all of
these overflow 80 columns by just a few characters.

Possibly reorder so the acpi/apei patches are together.  I don't think the
NVIDIA record handler depends on the PCI patch.

Typical subject line style in drivers/acpi/apei appears to be:

  ACPI: APEI: GHES: Add ...

> +config ACPI_APEI_NVIDIA_GHES
> +	tristate "NVIDIA GHES vendor record handler"
> +	depends on ACPI_APEI_GHES

Maybe s/ACPI_APEI_NVIDIA_GHES/ACPI_APEI_GHES_NVIDIA/ since there will
likely be more, and they'll sort nicely if the vendor is at the end.

> +	help
> +	  Support for decoding NVIDIA-specific CPER sections delivered via
> +	  the APEI GHES vendor record notifier chain. Registers a handler
> +	  for the NVIDIA section GUID and logs error signatures, severity,
> +	  socket, and diagnostic register address-value pairs.
> +
> +	  Enable on NVIDIA server platforms (e.g. DGX, HGX) that expose
> +	  ACPI device NVDA2012 in their firmware tables.

Wrap to fit in 80 columns like the rest of this file.

> +++ b/drivers/acpi/apei/nvidia-ghes.c

Maybe rename to "ghes-nvidia.c" so future decoders for other vendors are
grouped?

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 3/3] acpi/apei: Add NVIDIA GHES vendor CPER record handler
  2026-03-20 14:52   ` Bjorn Helgaas
@ 2026-03-20 15:13     ` Bjorn Helgaas
  2026-03-24  9:33     ` Kai-Heng Feng
  1 sibling, 0 replies; 17+ messages in thread
From: Bjorn Helgaas @ 2026-03-20 15:13 UTC (permalink / raw)
  To: Kai-Heng Feng
  Cc: rafael, Jonathan Cameron, Shiju Jose, Tony Luck, Borislav Petkov,
	Hanjun Guo, Mauro Carvalho Chehab, Shuai Xue, Len Brown,
	Kees Cook, Gustavo A. R. Silva, Will Deacon, Huang Yiwei,
	Dave Jiang, Nathan Chancellor, Fabio M. De Francesco,
	linux-kernel, linux-acpi, linux-hardening

On Fri, Mar 20, 2026 at 09:52:54AM -0500, Bjorn Helgaas wrote:
> On Thu, Mar 19, 2026 at 07:13:09PM +0800, Kai-Heng Feng wrote:
> > Add support for decoding NVIDIA-specific CPER sections delivered via
> ...

> Wrap commit logs to fit in 75 columns.  When indented by "git log", all of
> these overflow 80 columns by just a few characters.

I'm so sorry, I had accidentally resized my window narrower.  These are all
fine; please ignore this and the similar Kconfig comment.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 1/3] acpi/apei: Add devm_ghes_register_vendor_record_notifier()
  2026-03-19 11:13 [PATCH v2 1/3] acpi/apei: Add devm_ghes_register_vendor_record_notifier() Kai-Heng Feng
                   ` (2 preceding siblings ...)
  2026-03-20  9:55 ` [PATCH v2 1/3] acpi/apei: Add devm_ghes_register_vendor_record_notifier() Jonathan Cameron
@ 2026-03-23 12:28 ` Hanjun Guo
  3 siblings, 0 replies; 17+ messages in thread
From: Hanjun Guo @ 2026-03-23 12:28 UTC (permalink / raw)
  To: Kai-Heng Feng, rafael
  Cc: Jonathan Cameron, Tony Luck, Borislav Petkov,
	Mauro Carvalho Chehab, Shuai Xue, Len Brown, Robert Moore,
	Ard Biesheuvel, Breno Leitao, Fabio M. De Francesco, Jason Tian,
	linux-acpi, linux-kernel, acpica-devel

Hi Kai-Heng Feng,

Sorry, where is [PATCH 2/3]? did I miss something?

I reviewed the patch set, seems we don't need another patch?

Thanks
Hanjun

On 2026/3/19 19:13, Kai-Heng Feng wrote:
> Add a device-managed wrapper around ghes_register_vendor_record_notifier()
> so drivers can avoid manual cleanup on device removal or probe failure.
> 
> Cc: Jonathan Cameron <jonathan.cameron@huawei.com>
> Signed-off-by: Kai-Heng Feng <kaihengf@nvidia.com>
> ---
> v2:
>   - New patch.
> 
>   drivers/acpi/apei/ghes.c | 25 +++++++++++++++++++++++++
>   include/acpi/ghes.h      |  3 +++
>   2 files changed, 28 insertions(+)
> 
> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
> index 8acd2742bb27..d31a70a05538 100644
> --- a/drivers/acpi/apei/ghes.c
> +++ b/drivers/acpi/apei/ghes.c
> @@ -689,6 +689,31 @@ void ghes_unregister_vendor_record_notifier(struct notifier_block *nb)
>   }
>   EXPORT_SYMBOL_GPL(ghes_unregister_vendor_record_notifier);
>   
> +static void ghes_vendor_record_notifier_destroy(void *nb)
> +{
> +	ghes_unregister_vendor_record_notifier(nb);
> +}
> +
> +/**
> + * devm_ghes_register_vendor_record_notifier - device-managed vendor record notifier registration
> + * @dev: device that owns the notifier lifetime
> + * @nb: pointer to the notifier_block structure of the vendor record handler
> + *
> + * Return: 0 on success, negative errno on failure.
> + */
> +int devm_ghes_register_vendor_record_notifier(struct device *dev,
> +					      struct notifier_block *nb)
> +{
> +	int ret;
> +
> +	ret = ghes_register_vendor_record_notifier(nb);
> +	if (ret)
> +		return ret;
> +
> +	return devm_add_action_or_reset(dev, ghes_vendor_record_notifier_destroy, nb);
> +}
> +EXPORT_SYMBOL_GPL(devm_ghes_register_vendor_record_notifier);
> +
>   static void ghes_vendor_record_work_func(struct work_struct *work)
>   {
>   	struct ghes_vendor_record_entry *entry;
> diff --git a/include/acpi/ghes.h b/include/acpi/ghes.h
> index 7bea522c0657..ca3ace828c1c 100644
> --- a/include/acpi/ghes.h
> +++ b/include/acpi/ghes.h
> @@ -71,6 +71,9 @@ int ghes_register_vendor_record_notifier(struct notifier_block *nb);
>    */
>   void ghes_unregister_vendor_record_notifier(struct notifier_block *nb);
>   
> +int devm_ghes_register_vendor_record_notifier(struct device *dev,
> +					      struct notifier_block *nb);
> +
>   struct list_head *ghes_get_devices(void);
>   
>   void ghes_estatus_pool_region_free(unsigned long addr, u32 size);
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 3/3] acpi/apei: Add NVIDIA GHES vendor CPER record handler
  2026-03-20 10:13   ` Jonathan Cameron
@ 2026-03-24  9:10     ` Kai-Heng Feng
  0 siblings, 0 replies; 17+ messages in thread
From: Kai-Heng Feng @ 2026-03-24  9:10 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: rafael, Shiju Jose, Tony Luck, Borislav Petkov, Hanjun Guo,
	Mauro Carvalho Chehab, Shuai Xue, Len Brown, Kees Cook,
	Gustavo A. R. Silva, Will Deacon, Huang Yiwei, Dave Jiang,
	Nathan Chancellor, Fabio M. De Francesco, linux-kernel,
	linux-acpi, linux-hardening

On 2026-03-20 10:13, Jonathan Cameron wrote:
> External email: Use caution opening links or attachments
>
>
> On Thu, 19 Mar 2026 19:13:09 +0800
> Kai-Heng Feng <kaihengf@nvidia.com> wrote:
>
> > Add support for decoding NVIDIA-specific CPER sections delivered via
> > the APEI GHES vendor record notifier chain. NVIDIA hardware generates
> > vendor-specific CPER sections containing error signatures and diagnostic
> > register dumps. This implementation registers a notifier_block with the
> > GHES vendor record notifier and decodes these sections, printing error
> > details via dev_info().
> >
> > The driver binds to ACPI device NVDA2012, present on NVIDIA server
> > platforms. The NVIDIA CPER section contains a fixed header with error
> > metadata (signature, error type, severity, socket) followed by
> > variable-length register address-value pairs for hardware diagnostics.
> >
> > This work is based on libcper [0].
> >
> > Example output:
> > nvidia-ghes NVDA2012:00: NVIDIA CPER section, error_data_length: 544
> > nvidia-ghes NVDA2012:00: signature: CMET-INFO
> > nvidia-ghes NVDA2012:00: error_type: 0
> > nvidia-ghes NVDA2012:00: error_instance: 0
> > nvidia-ghes NVDA2012:00: severity: 3
> > nvidia-ghes NVDA2012:00: socket: 0
> > nvidia-ghes NVDA2012:00: number_regs: 32
> > nvidia-ghes NVDA2012:00: instance_base: 0x0000000000000000
> > nvidia-ghes NVDA2012:00: register[0]: address=0x8000000100000000 value=0x0000000100000000
> >
> > [0] https://github.com/openbmc/libcper/commit/683e055061ce
> > Cc: Jonathan Cameron <jonathan.cameron@huawei.com>
> > Cc: Shiju Jose <shiju.jose@huawei.com>
> > Signed-off-by: Kai-Heng Feng <kaihengf@nvidia.com>
> Only significant thing is around use of dev_err_probe().
>
> I'm surprised that didn't give you error messages in the log even on success.
>
> With that fixed (other stuff is all up to you).
> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
>
>
> >  apei-y := apei-base.o hest.o erst.o bert.o
> > diff --git a/drivers/acpi/apei/nvidia-ghes.c b/drivers/acpi/apei/nvidia-ghes.c
> > new file mode 100644
> > index 000000000000..aa2e3a387b49
> > --- /dev/null
> > +++ b/drivers/acpi/apei/nvidia-ghes.c
>
> > +static void nvidia_ghes_print_error(struct device *dev,
> > +                                 const struct cper_sec_nvidia *nvidia_err,
> > +                                 size_t error_data_length, bool fatal)
> > +{
> > +     const char *level = fatal ? KERN_ERR : KERN_INFO;
> > +     size_t min_size;
> > +     int i;
> ...
>
>
> > +      * Validate that all registers fit within error_data_length.
> > +      * Each register pair is two little-endian u64s.
> > +      */
> > +     min_size = struct_size(nvidia_err, regs, nvidia_err->number_regs);
> > +     if (error_data_length < min_size) {
> > +             dev_err(dev, "Invalid number_regs %u (section size %zu, need %zu)\n",
> > +                     nvidia_err->number_regs, error_data_length, min_size);
> > +             return;
> > +     }
> > +
> > +     for (i = 0; i < nvidia_err->number_regs; i++)
>
> Trivial but I'd take advantage of it now being acceptable (in general) to do
>         for (int i = 0; i < ....)

Didn't know it's acceptable now. Will change.

>
> > +             dev_printk(level, dev, "register[%d]: address=0x%016llx value=0x%016llx\n",
> > +                        i, le64_to_cpu(nvidia_err->regs[i].addr),
> > +                        le64_to_cpu(nvidia_err->regs[i].val));
> > +}
>
> > +static int nvidia_ghes_probe(struct platform_device *pdev)
> > +{
> > +     struct nvidia_ghes_private *priv;
> > +
> > +     priv = devm_kmalloc(&pdev->dev, sizeof(*priv), GFP_KERNEL);
> > +     if (!priv)
> > +             return -ENOMEM;
> > +
> > +     *priv = (struct nvidia_ghes_private) {
> > +             .nb.notifier_call = nvidia_ghes_notify,
> > +             .dev = &pdev->dev,
> > +     };
> > +
> > +     return dev_err_probe(&pdev->dev,
> > +                          devm_ghes_register_vendor_record_notifier(&pdev->dev, &priv->nb),
> That's too not great for readability and dev_err_probe() should only be called on errors
> I'm fairly sure it doesn't have special handling for 0 so will call dev_err() or dev_warn()
> and print some stuff before saying 'no error'.
>
>         int ret;
>         ...
>
>         ret = devm_ghes_register_vendor_record_notifier(&pdev->dev, &priv->nb);
>         if (ret)
>                 return dev_err_probe(&pdev->dev,
>                                       "Failed to register NVIDIA GHES vendor record notifier\n");
>
>         return 0;

OK, will change.

>
>
>
> > +                          "Failed to register NVIDIA GHES vendor record notifier\n");
> > +}
> > +
> > +static const struct acpi_device_id nvidia_ghes_acpi_match[] = {
> > +     { "NVDA2012" },
>
> London Olympics :)

Michael Phelps did great :)

>
> > +     { }
> > +};
> > +MODULE_DEVICE_TABLE(acpi, nvidia_ghes_acpi_match);
> > +
> > +static struct platform_driver nvidia_ghes_driver = {
> > +     .driver = {
> > +             .name           = "nvidia-ghes",
> > +             .acpi_match_table = nvidia_ghes_acpi_match,
> > +     },
> > +     .probe  = nvidia_ghes_probe,
>
> I'd just not attempt to align the =
> static struct platform_driver nvidia_ghes_driver = {
>         .driver = {
>                 .name = "nvidia-ghes",
>                 .acpi_match_table = nvidia_ghes_acpi_match,
>         },
>         .probe = nvidia_ghes_probe,
>
> There aren't enough of them to make it much of a readability improvement
> and doing this often results in unnecessary churn as a driver evolves.
> Also it's already broken!

OK, will change too.

>
> > +};
> > +module_platform_driver(nvidia_ghes_driver);
> > +
> > +MODULE_AUTHOR("Kai-Heng Feng <kaihengf@nvidia.com>");
> > +MODULE_DESCRIPTION("NVIDIA GHES vendor CPER record handler");
> > +MODULE_LICENSE("GPL");
>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 3/3] acpi/apei: Add NVIDIA GHES vendor CPER record handler
  2026-03-20 14:52   ` Bjorn Helgaas
  2026-03-20 15:13     ` Bjorn Helgaas
@ 2026-03-24  9:33     ` Kai-Heng Feng
  2026-03-24 16:15       ` Bjorn Helgaas
  1 sibling, 1 reply; 17+ messages in thread
From: Kai-Heng Feng @ 2026-03-24  9:33 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: rafael, Jonathan Cameron, Shiju Jose, Tony Luck, Borislav Petkov,
	Hanjun Guo, Mauro Carvalho Chehab, Shuai Xue, Len Brown,
	Kees Cook, Gustavo A. R. Silva, Will Deacon, Huang Yiwei,
	Dave Jiang, Nathan Chancellor, Fabio M. De Francesco,
	linux-kernel, linux-acpi, linux-hardening

On 2026-03-20 09:52, Bjorn Helgaas wrote:
> External email: Use caution opening links or attachments
>
>
> On Thu, Mar 19, 2026 at 07:13:09PM +0800, Kai-Heng Feng wrote:
> > Add support for decoding NVIDIA-specific CPER sections delivered via
> > the APEI GHES vendor record notifier chain. NVIDIA hardware generates
> > vendor-specific CPER sections containing error signatures and diagnostic
> > register dumps. This implementation registers a notifier_block with the
> > GHES vendor record notifier and decodes these sections, printing error
> > details via dev_info().
> >
> > The driver binds to ACPI device NVDA2012, present on NVIDIA server
> > platforms. The NVIDIA CPER section contains a fixed header with error
> > metadata (signature, error type, severity, socket) followed by
> > variable-length register address-value pairs for hardware diagnostics.
> >
> > This work is based on libcper [0].
> >
> > Example output:
> > nvidia-ghes NVDA2012:00: NVIDIA CPER section, error_data_length: 544
> > nvidia-ghes NVDA2012:00: signature: CMET-INFO
> > nvidia-ghes NVDA2012:00: error_type: 0
> > nvidia-ghes NVDA2012:00: error_instance: 0
> > nvidia-ghes NVDA2012:00: severity: 3
> > nvidia-ghes NVDA2012:00: socket: 0
> > nvidia-ghes NVDA2012:00: number_regs: 32
> > nvidia-ghes NVDA2012:00: instance_base: 0x0000000000000000
> > nvidia-ghes NVDA2012:00: register[0]: address=0x8000000100000000 value=0x0000000100000000
>
> Is there a convenient way to connect NVDA2012:00 with the actual device?  I
> assume this is typically a PCIe device?  How would we relate this with PCIe
> errors?

The CPER report is from ARM RAS firmware and not neccessarily be related
to a PCIe device.

>
> Consider a cover letter.  Some of these comments apply to the series.

Will do in next version.

>
> Wrap commit logs to fit in 75 columns.  When indented by "git log", all of
> these overflow 80 columns by just a few characters.
>
> Possibly reorder so the acpi/apei patches are together.  I don't think the
> NVIDIA record handler depends on the PCI patch.
>
> Typical subject line style in drivers/acpi/apei appears to be:
>
>   ACPI: APEI: GHES: Add ...
>
> > +config ACPI_APEI_NVIDIA_GHES
> > +     tristate "NVIDIA GHES vendor record handler"
> > +     depends on ACPI_APEI_GHES
>
> Maybe s/ACPI_APEI_NVIDIA_GHES/ACPI_APEI_GHES_NVIDIA/ since there will
> likely be more, and they'll sort nicely if the vendor is at the end.

OK, will do.

>
> > +     help
> > +       Support for decoding NVIDIA-specific CPER sections delivered via
> > +       the APEI GHES vendor record notifier chain. Registers a handler
> > +       for the NVIDIA section GUID and logs error signatures, severity,
> > +       socket, and diagnostic register address-value pairs.
> > +
> > +       Enable on NVIDIA server platforms (e.g. DGX, HGX) that expose
> > +       ACPI device NVDA2012 in their firmware tables.
>
> Wrap to fit in 80 columns like the rest of this file.
>
> > +++ b/drivers/acpi/apei/nvidia-ghes.c
>
> Maybe rename to "ghes-nvidia.c" so future decoders for other vendors are
> grouped?

Will do.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 1/3] acpi/apei: Add devm_ghes_register_vendor_record_notifier()
  2026-03-20  9:55 ` [PATCH v2 1/3] acpi/apei: Add devm_ghes_register_vendor_record_notifier() Jonathan Cameron
@ 2026-03-24 10:14   ` Kai-Heng Feng
  0 siblings, 0 replies; 17+ messages in thread
From: Kai-Heng Feng @ 2026-03-24 10:14 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: rafael, Tony Luck, Borislav Petkov, Hanjun Guo,
	Mauro Carvalho Chehab, Shuai Xue, Len Brown, Robert Moore,
	Ard Biesheuvel, Breno Leitao, Fabio M. De Francesco, Jason Tian,
	linux-acpi, linux-kernel, acpica-devel

On 2026-03-20 09:55, Jonathan Cameron wrote:
> External email: Use caution opening links or attachments
>
>
> On Thu, 19 Mar 2026 19:13:07 +0800
> Kai-Heng Feng <kaihengf@nvidia.com> wrote:
>
> > Add a device-managed wrapper around ghes_register_vendor_record_notifier()
> > so drivers can avoid manual cleanup on device removal or probe failure.
> >
> > Cc: Jonathan Cameron <jonathan.cameron@huawei.com>
> > Signed-off-by: Kai-Heng Feng <kaihengf@nvidia.com>
> Hi,
>
> My only comment is about following local style. I think that
> means moving the docs to the header.  Unfortunately whether things
> are in the header or the c file is a subsystem specific thing.
>
> My preference is in the c file, but local style overrides that!
> Better to have all the docs in the same place.

You are right, I didn't notice it.

>
> Jonathan
>
> > ---
> > v2:
> >  - New patch.
> >
> >  drivers/acpi/apei/ghes.c | 25 +++++++++++++++++++++++++
> >  include/acpi/ghes.h      |  3 +++
> >  2 files changed, 28 insertions(+)
> >
> > diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
> > index 8acd2742bb27..d31a70a05538 100644
> > --- a/drivers/acpi/apei/ghes.c
> > +++ b/
> > @@ -689,6 +689,31 @@ void ghes_unregister_vendor_record_notifier(struct notifier_block *nb)
> >  }
> >  EXPORT_SYMBOL_GPL(ghes_unregister_vendor_record_notifier);
> >
> > +static void ghes_vendor_record_notifier_destroy(void *nb)
> > +{
> > +     ghes_unregister_vendor_record_notifier(nb);
> > +}
> > +
> > +/**
> > + * devm_ghes_register_vendor_record_notifier - device-managed vendor record notifier registration
>
> There is also quite a bit of kernel doc in header.  So I guess
> local convention is put it there not in the C code?
>
> Hence I would move the docs there.

Sure, will do in next version.

>
>
> > + * @dev: device that owns the notifier lifetime
> > + * @nb: pointer to the notifier_block structure of the vendor record handler
> > + *
> > + * Return: 0 on success, negative errno on failure.
> > + */
> > +int devm_ghes_register_vendor_record_notifier(struct device *ev,
> > +                                           struct notifier_block *nb)
> > +{
> > +     int ret;
> > +
> > +     ret = ghes_register_vendor_record_notifier(nb);
> > +     if (ret)
> > +             return ret;
> > +
> > +     return devm_add_action_or_reset(dev, ghes_vendor_record_notifier_destroy, nb);
> > +}
> > +EXPORT_SYMBOL_GPL(devm_ghes_register_vendor_record_notifier);
> > +
> >  static void ghes_vendor_record_work_func(struct work_struct *work)
> >  {
> >       struct ghes_vendor_record_entry *entry;
> > diff --git a/include/acpi/ghes.h b/include/acpi/ghes.h
> > index 7bea522c0657..ca3ace828c1c 100644
> > --- a/include/acpi/ghes.h
> > +++ b/include/acpi/ghes.h
> > @@ -71,6 +71,9 @@ int ghes_register_vendor_record_notifier(struct notifier_block *nb);
> >   */
> >  void ghes_unregister_vendor_record_notifier(struct notifier_block *nb);
> >
> > +int devm_ghes_register_vendor_record_notifier(struct device *dev,
> > +                                           struct notifier_block *nb);
> > +
> >  struct list_head *ghes_get_devices(void);
> >
> >  void ghes_estatus_pool_region_free(unsigned long addr, u32 size);
>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 3/3] acpi/apei: Add NVIDIA GHES vendor CPER record handler
  2026-03-24  9:33     ` Kai-Heng Feng
@ 2026-03-24 16:15       ` Bjorn Helgaas
  2026-03-25 11:34         ` Kai-Heng Feng
  0 siblings, 1 reply; 17+ messages in thread
From: Bjorn Helgaas @ 2026-03-24 16:15 UTC (permalink / raw)
  To: Kai-Heng Feng
  Cc: rafael, Jonathan Cameron, Shiju Jose, Tony Luck, Borislav Petkov,
	Hanjun Guo, Mauro Carvalho Chehab, Shuai Xue, Len Brown,
	Kees Cook, Gustavo A. R. Silva, Will Deacon, Huang Yiwei,
	Dave Jiang, Nathan Chancellor, Fabio M. De Francesco,
	linux-kernel, linux-acpi, linux-hardening

On Tue, Mar 24, 2026 at 05:33:06PM +0800, Kai-Heng Feng wrote:
> On 2026-03-20 09:52, Bjorn Helgaas wrote:
> > On Thu, Mar 19, 2026 at 07:13:09PM +0800, Kai-Heng Feng wrote:
> > > Add support for decoding NVIDIA-specific CPER sections delivered via
> > > the APEI GHES vendor record notifier chain. NVIDIA hardware generates
> > > vendor-specific CPER sections containing error signatures and diagnostic
> > > register dumps. This implementation registers a notifier_block with the
> > > GHES vendor record notifier and decodes these sections, printing error
> > > details via dev_info().
> > >
> > > The driver binds to ACPI device NVDA2012, present on NVIDIA server
> > > platforms. The NVIDIA CPER section contains a fixed header with error
> > > metadata (signature, error type, severity, socket) followed by
> > > variable-length register address-value pairs for hardware diagnostics.
> > >
> > > This work is based on libcper [0].
> > >
> > > Example output:
> > > nvidia-ghes NVDA2012:00: NVIDIA CPER section, error_data_length: 544
> > > nvidia-ghes NVDA2012:00: signature: CMET-INFO
> > > nvidia-ghes NVDA2012:00: error_type: 0
> > > nvidia-ghes NVDA2012:00: error_instance: 0
> > > nvidia-ghes NVDA2012:00: severity: 3
> > > nvidia-ghes NVDA2012:00: socket: 0
> > > nvidia-ghes NVDA2012:00: number_regs: 32
> > > nvidia-ghes NVDA2012:00: instance_base: 0x0000000000000000
> > > nvidia-ghes NVDA2012:00: register[0]: address=0x8000000100000000 value=0x0000000100000000
> >
> > Is there a convenient way to connect NVDA2012:00 with the actual
> > device?  I assume this is typically a PCIe device?  How would we
> > relate this with PCIe errors?
> 
> The CPER report is from ARM RAS firmware and not neccessarily be
> related to a PCIe device.

Right, I know CPER is more general than just PCI/PCIe.

But in this case, I think NVDA2012 probably *is* a PCIe device.  How
would we figure out which one?  If we have to manually do an acpidump,
figure out which NVDA2012 is :00, and look for an _ADR or something,
that doesn't really seem convenient for multi-NVDA2012 situations.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 3/3] acpi/apei: Add NVIDIA GHES vendor CPER record handler
  2026-03-24 16:15       ` Bjorn Helgaas
@ 2026-03-25 11:34         ` Kai-Heng Feng
  2026-03-25 15:36           ` Bjorn Helgaas
  0 siblings, 1 reply; 17+ messages in thread
From: Kai-Heng Feng @ 2026-03-25 11:34 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: rafael, Jonathan Cameron, Shiju Jose, Tony Luck, Borislav Petkov,
	Hanjun Guo, Mauro Carvalho Chehab, Shuai Xue, Len Brown,
	Kees Cook, Gustavo A. R. Silva, Will Deacon, Huang Yiwei,
	Dave Jiang, Nathan Chancellor, Fabio M. De Francesco,
	linux-kernel, linux-acpi, linux-hardening

On Wed Mar 25, 2026 at 12:15 AM CST, Bjorn Helgaas wrote:
> External email: Use caution opening links or attachments
>
>
> On Tue, Mar 24, 2026 at 05:33:06PM +0800, Kai-Heng Feng wrote:
>> On 2026-03-20 09:52, Bjorn Helgaas wrote:
>> > On Thu, Mar 19, 2026 at 07:13:09PM +0800, Kai-Heng Feng wrote:
>> > > Add support for decoding NVIDIA-specific CPER sections delivered via
>> > > the APEI GHES vendor record notifier chain. NVIDIA hardware generates
>> > > vendor-specific CPER sections containing error signatures and diagnostic
>> > > register dumps. This implementation registers a notifier_block with the
>> > > GHES vendor record notifier and decodes these sections, printing error
>> > > details via dev_info().
>> > >
>> > > The driver binds to ACPI device NVDA2012, present on NVIDIA server
>> > > platforms. The NVIDIA CPER section contains a fixed header with error
>> > > metadata (signature, error type, severity, socket) followed by
>> > > variable-length register address-value pairs for hardware diagnostics.
>> > >
>> > > This work is based on libcper [0].
>> > >
>> > > Example output:
>> > > nvidia-ghes NVDA2012:00: NVIDIA CPER section, error_data_length: 544
>> > > nvidia-ghes NVDA2012:00: signature: CMET-INFO
>> > > nvidia-ghes NVDA2012:00: error_type: 0
>> > > nvidia-ghes NVDA2012:00: error_instance: 0
>> > > nvidia-ghes NVDA2012:00: severity: 3
>> > > nvidia-ghes NVDA2012:00: socket: 0
>> > > nvidia-ghes NVDA2012:00: number_regs: 32
>> > > nvidia-ghes NVDA2012:00: instance_base: 0x0000000000000000
>> > > nvidia-ghes NVDA2012:00: register[0]: address=0x8000000100000000 value=0x0000000100000000
>> >
>> > Is there a convenient way to connect NVDA2012:00 with the actual
>> > device?  I assume this is typically a PCIe device?  How would we
>> > relate this with PCIe errors?
>>
>> The CPER report is from ARM RAS firmware and not neccessarily be
>> related to a PCIe device.
>
> Right, I know CPER is more general than just PCI/PCIe.
>
> But in this case, I think NVDA2012 probably *is* a PCIe device.  How
> would we figure out which one?  If we have to manually do an acpidump,
> figure out which NVDA2012 is :00, and look for an _ADR or something,
> that doesn't really seem convenient for multi-NVDA2012 situations.

It's actually just an ACPI device:
Device (CPER)
{
  Name (_HID, "NVDA2012")  // _HID: Hardware ID
  Name (_UID, 0x00)  // _UID: Unique ID
  Method (_DSM, 4, Serialized) // _DSM: Device-Specific Method
}

And that's it.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 3/3] acpi/apei: Add NVIDIA GHES vendor CPER record handler
  2026-03-25 11:34         ` Kai-Heng Feng
@ 2026-03-25 15:36           ` Bjorn Helgaas
  2026-03-25 17:08             ` Jonathan Cameron
  0 siblings, 1 reply; 17+ messages in thread
From: Bjorn Helgaas @ 2026-03-25 15:36 UTC (permalink / raw)
  To: Kai-Heng Feng
  Cc: rafael, Jonathan Cameron, Shiju Jose, Tony Luck, Borislav Petkov,
	Hanjun Guo, Mauro Carvalho Chehab, Shuai Xue, Len Brown,
	Kees Cook, Gustavo A. R. Silva, Will Deacon, Huang Yiwei,
	Dave Jiang, Nathan Chancellor, Fabio M. De Francesco,
	linux-kernel, linux-acpi, linux-hardening

On Wed, Mar 25, 2026 at 07:34:50PM +0800, Kai-Heng Feng wrote:
> On Wed Mar 25, 2026 at 12:15 AM CST, Bjorn Helgaas wrote:
> > On Tue, Mar 24, 2026 at 05:33:06PM +0800, Kai-Heng Feng wrote:
> >> On 2026-03-20 09:52, Bjorn Helgaas wrote:
> >> > On Thu, Mar 19, 2026 at 07:13:09PM +0800, Kai-Heng Feng wrote:
> >> > > Add support for decoding NVIDIA-specific CPER sections delivered via
> >> > > the APEI GHES vendor record notifier chain. NVIDIA hardware generates
> >> > > vendor-specific CPER sections containing error signatures and diagnostic
> >> > > register dumps. This implementation registers a notifier_block with the
> >> > > GHES vendor record notifier and decodes these sections, printing error
> >> > > details via dev_info().
> >> > >
> >> > > The driver binds to ACPI device NVDA2012, present on NVIDIA server
> >> > > platforms. The NVIDIA CPER section contains a fixed header with error
> >> > > metadata (signature, error type, severity, socket) followed by
> >> > > variable-length register address-value pairs for hardware diagnostics.
> >> > >
> >> > > This work is based on libcper [0].
> >> > >
> >> > > Example output:
> >> > > nvidia-ghes NVDA2012:00: NVIDIA CPER section, error_data_length: 544
> >> > > nvidia-ghes NVDA2012:00: signature: CMET-INFO
> >> > > nvidia-ghes NVDA2012:00: error_type: 0
> >> > > nvidia-ghes NVDA2012:00: error_instance: 0
> >> > > nvidia-ghes NVDA2012:00: severity: 3
> >> > > nvidia-ghes NVDA2012:00: socket: 0
> >> > > nvidia-ghes NVDA2012:00: number_regs: 32
> >> > > nvidia-ghes NVDA2012:00: instance_base: 0x0000000000000000
> >> > > nvidia-ghes NVDA2012:00: register[0]: address=0x8000000100000000 value=0x0000000100000000
> >> >
> >> > Is there a convenient way to connect NVDA2012:00 with the actual
> >> > device?  I assume this is typically a PCIe device?  How would we
> >> > relate this with PCIe errors?
> >>
> >> The CPER report is from ARM RAS firmware and not neccessarily be
> >> related to a PCIe device.
> >
> > Right, I know CPER is more general than just PCI/PCIe.
> >
> > But in this case, I think NVDA2012 probably *is* a PCIe device.  How
> > would we figure out which one?  If we have to manually do an acpidump,
> > figure out which NVDA2012 is :00, and look for an _ADR or something,
> > that doesn't really seem convenient for multi-NVDA2012 situations.
> 
> It's actually just an ACPI device:
> Device (CPER)
> {
>   Name (_HID, "NVDA2012")  // _HID: Hardware ID
>   Name (_UID, 0x00)  // _UID: Unique ID
>   Method (_DSM, 4, Serialized) // _DSM: Device-Specific Method
> }
> 
> And that's it.

Weird.  There's nothing for a driver to operate the device with except
_DSM?  The device doesn't need any MMIO resources?  I would expect some
resources described by a _CRS method or some native enumeration protocol
like PCI BARs.

The _UID 0x00 matches the "00" in "NVDA2012:00", but I think that's a
coincidence; I think the "00" in the device name came from the ida_alloc()
in acpi_device_set_name(), not from _UID.

So I still don't know how you would identify the correct part in a system
with multiple NVDA2012 devices.  I do see the "socket" and "instance_base"
in the output.  Maybe that would help, but those seem to be
device-specific, and it seems like we should have a generic mechanism.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 3/3] acpi/apei: Add NVIDIA GHES vendor CPER record handler
  2026-03-25 15:36           ` Bjorn Helgaas
@ 2026-03-25 17:08             ` Jonathan Cameron
  2026-03-25 17:16               ` Rafael J. Wysocki
  0 siblings, 1 reply; 17+ messages in thread
From: Jonathan Cameron @ 2026-03-25 17:08 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Kai-Heng Feng, rafael, Shiju Jose, Tony Luck, Borislav Petkov,
	Hanjun Guo, Mauro Carvalho Chehab, Shuai Xue, Len Brown,
	Kees Cook, Gustavo A. R. Silva, Will Deacon, Huang Yiwei,
	Dave Jiang, Nathan Chancellor, Fabio M. De Francesco,
	linux-kernel, linux-acpi, linux-hardening

On Wed, 25 Mar 2026 10:36:28 -0500
Bjorn Helgaas <helgaas@kernel.org> wrote:

> On Wed, Mar 25, 2026 at 07:34:50PM +0800, Kai-Heng Feng wrote:
> > On Wed Mar 25, 2026 at 12:15 AM CST, Bjorn Helgaas wrote:  
> > > On Tue, Mar 24, 2026 at 05:33:06PM +0800, Kai-Heng Feng wrote:  
> > >> On 2026-03-20 09:52, Bjorn Helgaas wrote:  
> > >> > On Thu, Mar 19, 2026 at 07:13:09PM +0800, Kai-Heng Feng wrote:  
> > >> > > Add support for decoding NVIDIA-specific CPER sections delivered via
> > >> > > the APEI GHES vendor record notifier chain. NVIDIA hardware generates
> > >> > > vendor-specific CPER sections containing error signatures and diagnostic
> > >> > > register dumps. This implementation registers a notifier_block with the
> > >> > > GHES vendor record notifier and decodes these sections, printing error
> > >> > > details via dev_info().
> > >> > >
> > >> > > The driver binds to ACPI device NVDA2012, present on NVIDIA server
> > >> > > platforms. The NVIDIA CPER section contains a fixed header with error
> > >> > > metadata (signature, error type, severity, socket) followed by
> > >> > > variable-length register address-value pairs for hardware diagnostics.
> > >> > >
> > >> > > This work is based on libcper [0].
> > >> > >
> > >> > > Example output:
> > >> > > nvidia-ghes NVDA2012:00: NVIDIA CPER section, error_data_length: 544
> > >> > > nvidia-ghes NVDA2012:00: signature: CMET-INFO
> > >> > > nvidia-ghes NVDA2012:00: error_type: 0
> > >> > > nvidia-ghes NVDA2012:00: error_instance: 0
> > >> > > nvidia-ghes NVDA2012:00: severity: 3
> > >> > > nvidia-ghes NVDA2012:00: socket: 0
> > >> > > nvidia-ghes NVDA2012:00: number_regs: 32
> > >> > > nvidia-ghes NVDA2012:00: instance_base: 0x0000000000000000
> > >> > > nvidia-ghes NVDA2012:00: register[0]: address=0x8000000100000000 value=0x0000000100000000  
> > >> >
> > >> > Is there a convenient way to connect NVDA2012:00 with the actual
> > >> > device?  I assume this is typically a PCIe device?  How would we
> > >> > relate this with PCIe errors?  
> > >>
> > >> The CPER report is from ARM RAS firmware and not neccessarily be
> > >> related to a PCIe device.  
> > >
> > > Right, I know CPER is more general than just PCI/PCIe.
> > >
> > > But in this case, I think NVDA2012 probably *is* a PCIe device.  How
> > > would we figure out which one?  If we have to manually do an acpidump,
> > > figure out which NVDA2012 is :00, and look for an _ADR or something,
> > > that doesn't really seem convenient for multi-NVDA2012 situations.  
> > 
> > It's actually just an ACPI device:
> > Device (CPER)
> > {
> >   Name (_HID, "NVDA2012")  // _HID: Hardware ID
> >   Name (_UID, 0x00)  // _UID: Unique ID
> >   Method (_DSM, 4, Serialized) // _DSM: Device-Specific Method
> > }
> > 
> > And that's it.  
> 
> Weird.  There's nothing for a driver to operate the device with except
> _DSM?  The device doesn't need any MMIO resources?  I would expect some
> resources described by a _CRS method or some native enumeration protocol
> like PCI BARs.
> 
> The _UID 0x00 matches the "00" in "NVDA2012:00", but I think that's a
> coincidence; I think the "00" in the device name came from the ida_alloc()
> in acpi_device_set_name(), not from _UID.
> 
> So I still don't know how you would identify the correct part in a system
> with multiple NVDA2012 devices.  I do see the "socket" and "instance_base"
> in the output.  Maybe that would help, but those seem to be
> device-specific, and it seems like we should have a generic mechanism.

It's not unique in ACPI terms.  There are a few cases even in the ACPI spec
of IDs that exist just to say some feature is there.

ACPI0017 is an example. Simply says, there be CXL here, go look for the
tables.

Here this device is used to indicate that a platform should be ready to handle
a particular type of error record.  If it happened to expose any other
interfaces, then I agree it would need resources or a _DSM etc.

Basically it's a workaround for the lack of discoverability in APEI /
ACPI error reporting. Could use an _OSC bit for the same job but then
we'd run out of those fast.  Device IDs are near free.

Jonathan


> 


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 3/3] acpi/apei: Add NVIDIA GHES vendor CPER record handler
  2026-03-25 17:08             ` Jonathan Cameron
@ 2026-03-25 17:16               ` Rafael J. Wysocki
  0 siblings, 0 replies; 17+ messages in thread
From: Rafael J. Wysocki @ 2026-03-25 17:16 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: Bjorn Helgaas, Kai-Heng Feng, rafael, Shiju Jose, Tony Luck,
	Borislav Petkov, Hanjun Guo, Mauro Carvalho Chehab, Shuai Xue,
	Len Brown, Kees Cook, Gustavo A. R. Silva, Will Deacon,
	Huang Yiwei, Dave Jiang, Nathan Chancellor, Fabio M. De Francesco,
	linux-kernel, linux-acpi, linux-hardening

On Wed, Mar 25, 2026 at 6:08 PM Jonathan Cameron
<jonathan.cameron@huawei.com> wrote:
>
> On Wed, 25 Mar 2026 10:36:28 -0500
> Bjorn Helgaas <helgaas@kernel.org> wrote:
>
> > On Wed, Mar 25, 2026 at 07:34:50PM +0800, Kai-Heng Feng wrote:
> > > On Wed Mar 25, 2026 at 12:15 AM CST, Bjorn Helgaas wrote:
> > > > On Tue, Mar 24, 2026 at 05:33:06PM +0800, Kai-Heng Feng wrote:
> > > >> On 2026-03-20 09:52, Bjorn Helgaas wrote:
> > > >> > On Thu, Mar 19, 2026 at 07:13:09PM +0800, Kai-Heng Feng wrote:
> > > >> > > Add support for decoding NVIDIA-specific CPER sections delivered via
> > > >> > > the APEI GHES vendor record notifier chain. NVIDIA hardware generates
> > > >> > > vendor-specific CPER sections containing error signatures and diagnostic
> > > >> > > register dumps. This implementation registers a notifier_block with the
> > > >> > > GHES vendor record notifier and decodes these sections, printing error
> > > >> > > details via dev_info().
> > > >> > >
> > > >> > > The driver binds to ACPI device NVDA2012, present on NVIDIA server
> > > >> > > platforms. The NVIDIA CPER section contains a fixed header with error
> > > >> > > metadata (signature, error type, severity, socket) followed by
> > > >> > > variable-length register address-value pairs for hardware diagnostics.
> > > >> > >
> > > >> > > This work is based on libcper [0].
> > > >> > >
> > > >> > > Example output:
> > > >> > > nvidia-ghes NVDA2012:00: NVIDIA CPER section, error_data_length: 544
> > > >> > > nvidia-ghes NVDA2012:00: signature: CMET-INFO
> > > >> > > nvidia-ghes NVDA2012:00: error_type: 0
> > > >> > > nvidia-ghes NVDA2012:00: error_instance: 0
> > > >> > > nvidia-ghes NVDA2012:00: severity: 3
> > > >> > > nvidia-ghes NVDA2012:00: socket: 0
> > > >> > > nvidia-ghes NVDA2012:00: number_regs: 32
> > > >> > > nvidia-ghes NVDA2012:00: instance_base: 0x0000000000000000
> > > >> > > nvidia-ghes NVDA2012:00: register[0]: address=0x8000000100000000 value=0x0000000100000000
> > > >> >
> > > >> > Is there a convenient way to connect NVDA2012:00 with the actual
> > > >> > device?  I assume this is typically a PCIe device?  How would we
> > > >> > relate this with PCIe errors?
> > > >>
> > > >> The CPER report is from ARM RAS firmware and not neccessarily be
> > > >> related to a PCIe device.
> > > >
> > > > Right, I know CPER is more general than just PCI/PCIe.
> > > >
> > > > But in this case, I think NVDA2012 probably *is* a PCIe device.  How
> > > > would we figure out which one?  If we have to manually do an acpidump,
> > > > figure out which NVDA2012 is :00, and look for an _ADR or something,
> > > > that doesn't really seem convenient for multi-NVDA2012 situations.
> > >
> > > It's actually just an ACPI device:
> > > Device (CPER)
> > > {
> > >   Name (_HID, "NVDA2012")  // _HID: Hardware ID
> > >   Name (_UID, 0x00)  // _UID: Unique ID
> > >   Method (_DSM, 4, Serialized) // _DSM: Device-Specific Method
> > > }
> > >
> > > And that's it.
> >
> > Weird.  There's nothing for a driver to operate the device with except
> > _DSM?  The device doesn't need any MMIO resources?  I would expect some
> > resources described by a _CRS method or some native enumeration protocol
> > like PCI BARs.
> >
> > The _UID 0x00 matches the "00" in "NVDA2012:00", but I think that's a
> > coincidence; I think the "00" in the device name came from the ida_alloc()
> > in acpi_device_set_name(), not from _UID.
> >
> > So I still don't know how you would identify the correct part in a system
> > with multiple NVDA2012 devices.  I do see the "socket" and "instance_base"
> > in the output.  Maybe that would help, but those seem to be
> > device-specific, and it seems like we should have a generic mechanism.
>
> It's not unique in ACPI terms.  There are a few cases even in the ACPI spec
> of IDs that exist just to say some feature is there.
>
> ACPI0017 is an example. Simply says, there be CXL here, go look for the
> tables.
>
> Here this device is used to indicate that a platform should be ready to handle
> a particular type of error record.  If it happened to expose any other
> interfaces, then I agree it would need resources or a _DSM etc.
>
> Basically it's a workaround for the lack of discoverability in APEI /
> ACPI error reporting. Could use an _OSC bit for the same job but then
> we'd run out of those fast.  Device IDs are near free.

Well, in principle, an auxiliary device could be registered when a
given ACPI table was present.  It would then trigger a driver load and
the driver would probe against the table in question.

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2026-03-25 17:16 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-19 11:13 [PATCH v2 1/3] acpi/apei: Add devm_ghes_register_vendor_record_notifier() Kai-Heng Feng
2026-03-19 11:13 ` [PATCH v2 2/3] PCI: hisi: Use devm_ghes_register_vendor_record_notifier() Kai-Heng Feng
2026-03-20  9:57   ` Jonathan Cameron
2026-03-19 11:13 ` [PATCH v2 3/3] acpi/apei: Add NVIDIA GHES vendor CPER record handler Kai-Heng Feng
2026-03-20 10:13   ` Jonathan Cameron
2026-03-24  9:10     ` Kai-Heng Feng
2026-03-20 14:52   ` Bjorn Helgaas
2026-03-20 15:13     ` Bjorn Helgaas
2026-03-24  9:33     ` Kai-Heng Feng
2026-03-24 16:15       ` Bjorn Helgaas
2026-03-25 11:34         ` Kai-Heng Feng
2026-03-25 15:36           ` Bjorn Helgaas
2026-03-25 17:08             ` Jonathan Cameron
2026-03-25 17:16               ` Rafael J. Wysocki
2026-03-20  9:55 ` [PATCH v2 1/3] acpi/apei: Add devm_ghes_register_vendor_record_notifier() Jonathan Cameron
2026-03-24 10:14   ` Kai-Heng Feng
2026-03-23 12:28 ` Hanjun Guo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox