public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Jonathan Cameron <jonathan.cameron@huawei.com>
To: Bjorn Helgaas <helgaas@kernel.org>
Cc: Kai-Heng Feng <kaihengf@nvidia.com>, <rafael@kernel.org>,
	Shiju Jose <shiju.jose@huawei.com>,
	Tony Luck <tony.luck@intel.com>, Borislav Petkov <bp@alien8.de>,
	Hanjun Guo <guohanjun@huawei.com>,
	Mauro Carvalho Chehab <mchehab@kernel.org>,
	Shuai Xue <xueshuai@linux.alibaba.com>,
	Len Brown <lenb@kernel.org>, Kees Cook <kees@kernel.org>,
	"Gustavo A. R. Silva" <gustavoars@kernel.org>,
	Will Deacon <will@kernel.org>,
	Huang Yiwei <quic_hyiwei@quicinc.com>,
	Dave Jiang <dave.jiang@intel.com>,
	"Nathan Chancellor" <nathan@kernel.org>,
	"Fabio M. De Francesco" <fabio.m.de.francesco@linux.intel.com>,
	<linux-kernel@vger.kernel.org>, <linux-acpi@vger.kernel.org>,
	<linux-hardening@vger.kernel.org>
Subject: Re: [PATCH v2 3/3] acpi/apei: Add NVIDIA GHES vendor CPER record handler
Date: Wed, 25 Mar 2026 17:08:05 +0000	[thread overview]
Message-ID: <20260325170805.00005ba1@huawei.com> (raw)
In-Reply-To: <20260325153628.GA1189053@bhelgaas>

On Wed, 25 Mar 2026 10:36:28 -0500
Bjorn Helgaas <helgaas@kernel.org> wrote:

> On Wed, Mar 25, 2026 at 07:34:50PM +0800, Kai-Heng Feng wrote:
> > On Wed Mar 25, 2026 at 12:15 AM CST, Bjorn Helgaas wrote:  
> > > On Tue, Mar 24, 2026 at 05:33:06PM +0800, Kai-Heng Feng wrote:  
> > >> On 2026-03-20 09:52, Bjorn Helgaas wrote:  
> > >> > On Thu, Mar 19, 2026 at 07:13:09PM +0800, Kai-Heng Feng wrote:  
> > >> > > Add support for decoding NVIDIA-specific CPER sections delivered via
> > >> > > the APEI GHES vendor record notifier chain. NVIDIA hardware generates
> > >> > > vendor-specific CPER sections containing error signatures and diagnostic
> > >> > > register dumps. This implementation registers a notifier_block with the
> > >> > > GHES vendor record notifier and decodes these sections, printing error
> > >> > > details via dev_info().
> > >> > >
> > >> > > The driver binds to ACPI device NVDA2012, present on NVIDIA server
> > >> > > platforms. The NVIDIA CPER section contains a fixed header with error
> > >> > > metadata (signature, error type, severity, socket) followed by
> > >> > > variable-length register address-value pairs for hardware diagnostics.
> > >> > >
> > >> > > This work is based on libcper [0].
> > >> > >
> > >> > > Example output:
> > >> > > nvidia-ghes NVDA2012:00: NVIDIA CPER section, error_data_length: 544
> > >> > > nvidia-ghes NVDA2012:00: signature: CMET-INFO
> > >> > > nvidia-ghes NVDA2012:00: error_type: 0
> > >> > > nvidia-ghes NVDA2012:00: error_instance: 0
> > >> > > nvidia-ghes NVDA2012:00: severity: 3
> > >> > > nvidia-ghes NVDA2012:00: socket: 0
> > >> > > nvidia-ghes NVDA2012:00: number_regs: 32
> > >> > > nvidia-ghes NVDA2012:00: instance_base: 0x0000000000000000
> > >> > > nvidia-ghes NVDA2012:00: register[0]: address=0x8000000100000000 value=0x0000000100000000  
> > >> >
> > >> > Is there a convenient way to connect NVDA2012:00 with the actual
> > >> > device?  I assume this is typically a PCIe device?  How would we
> > >> > relate this with PCIe errors?  
> > >>
> > >> The CPER report is from ARM RAS firmware and not neccessarily be
> > >> related to a PCIe device.  
> > >
> > > Right, I know CPER is more general than just PCI/PCIe.
> > >
> > > But in this case, I think NVDA2012 probably *is* a PCIe device.  How
> > > would we figure out which one?  If we have to manually do an acpidump,
> > > figure out which NVDA2012 is :00, and look for an _ADR or something,
> > > that doesn't really seem convenient for multi-NVDA2012 situations.  
> > 
> > It's actually just an ACPI device:
> > Device (CPER)
> > {
> >   Name (_HID, "NVDA2012")  // _HID: Hardware ID
> >   Name (_UID, 0x00)  // _UID: Unique ID
> >   Method (_DSM, 4, Serialized) // _DSM: Device-Specific Method
> > }
> > 
> > And that's it.  
> 
> Weird.  There's nothing for a driver to operate the device with except
> _DSM?  The device doesn't need any MMIO resources?  I would expect some
> resources described by a _CRS method or some native enumeration protocol
> like PCI BARs.
> 
> The _UID 0x00 matches the "00" in "NVDA2012:00", but I think that's a
> coincidence; I think the "00" in the device name came from the ida_alloc()
> in acpi_device_set_name(), not from _UID.
> 
> So I still don't know how you would identify the correct part in a system
> with multiple NVDA2012 devices.  I do see the "socket" and "instance_base"
> in the output.  Maybe that would help, but those seem to be
> device-specific, and it seems like we should have a generic mechanism.

It's not unique in ACPI terms.  There are a few cases even in the ACPI spec
of IDs that exist just to say some feature is there.

ACPI0017 is an example. Simply says, there be CXL here, go look for the
tables.

Here this device is used to indicate that a platform should be ready to handle
a particular type of error record.  If it happened to expose any other
interfaces, then I agree it would need resources or a _DSM etc.

Basically it's a workaround for the lack of discoverability in APEI /
ACPI error reporting. Could use an _OSC bit for the same job but then
we'd run out of those fast.  Device IDs are near free.

Jonathan


> 


  reply	other threads:[~2026-03-25 17:08 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-03-19 11:13 [PATCH v2 1/3] acpi/apei: Add devm_ghes_register_vendor_record_notifier() Kai-Heng Feng
2026-03-19 11:13 ` [PATCH v2 2/3] PCI: hisi: Use devm_ghes_register_vendor_record_notifier() Kai-Heng Feng
2026-03-20  9:57   ` Jonathan Cameron
2026-03-19 11:13 ` [PATCH v2 3/3] acpi/apei: Add NVIDIA GHES vendor CPER record handler Kai-Heng Feng
2026-03-20 10:13   ` Jonathan Cameron
2026-03-24  9:10     ` Kai-Heng Feng
2026-03-20 14:52   ` Bjorn Helgaas
2026-03-20 15:13     ` Bjorn Helgaas
2026-03-24  9:33     ` Kai-Heng Feng
2026-03-24 16:15       ` Bjorn Helgaas
2026-03-25 11:34         ` Kai-Heng Feng
2026-03-25 15:36           ` Bjorn Helgaas
2026-03-25 17:08             ` Jonathan Cameron [this message]
2026-03-25 17:16               ` Rafael J. Wysocki
2026-03-20  9:55 ` [PATCH v2 1/3] acpi/apei: Add devm_ghes_register_vendor_record_notifier() Jonathan Cameron
2026-03-24 10:14   ` Kai-Heng Feng
2026-03-23 12:28 ` Hanjun Guo

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260325170805.00005ba1@huawei.com \
    --to=jonathan.cameron@huawei.com \
    --cc=bp@alien8.de \
    --cc=dave.jiang@intel.com \
    --cc=fabio.m.de.francesco@linux.intel.com \
    --cc=guohanjun@huawei.com \
    --cc=gustavoars@kernel.org \
    --cc=helgaas@kernel.org \
    --cc=kaihengf@nvidia.com \
    --cc=kees@kernel.org \
    --cc=lenb@kernel.org \
    --cc=linux-acpi@vger.kernel.org \
    --cc=linux-hardening@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mchehab@kernel.org \
    --cc=nathan@kernel.org \
    --cc=quic_hyiwei@quicinc.com \
    --cc=rafael@kernel.org \
    --cc=shiju.jose@huawei.com \
    --cc=tony.luck@intel.com \
    --cc=will@kernel.org \
    --cc=xueshuai@linux.alibaba.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox