public inbox for linux-efi@vger.kernel.org
 help / color / mirror / Atom feed
From: Dan Williams <dan.j.williams@intel.com>
To: Dan Williams <dan.j.williams@intel.com>,
	Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Ira Weiny <ira.weiny@intel.com>,
	Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com>,
	Shiju Jose <shiju.jose@huawei.com>,
	Yazen Ghannam <yazen.ghannam@amd.com>,
	Davidlohr Bueso <dave@stgolabs.net>,
	Dave Jiang <dave.jiang@intel.com>,
	Alison Schofield <alison.schofield@intel.com>,
	Vishal Verma <vishal.l.verma@intel.com>,
	"Ard Biesheuvel" <ardb@kernel.org>, <linux-efi@vger.kernel.org>,
	<linux-kernel@vger.kernel.org>, <linux-cxl@vger.kernel.org>,
	"Rafael J. Wysocki" <rafael@kernel.org>,
	Bjorn Helgaas <bhelgaas@google.com>
Subject: Re: [PATCH v5 0/9] efi/cxl-cper: Report CPER CXL component events through trace events
Date: Tue, 9 Jan 2024 15:30:07 -0800	[thread overview]
Message-ID: <659dd6ff5ee1_24a8294d0@dwillia2-xfh.jf.intel.com.notmuch> (raw)
In-Reply-To: <659db15452090_24a8294f4@dwillia2-xfh.jf.intel.com.notmuch>

Dan Williams wrote:
> Jonathan Cameron wrote:
> > On Mon, 8 Jan 2024 18:59:16 -0800
> > Dan Williams <dan.j.williams@intel.com> wrote:
> > 
> > > Ira Weiny wrote:
> > > > Dan Williams wrote:  
> > > > > Smita Koralahalli wrote:  
> > > > > > On 1/8/2024 8:58 AM, Jonathan Cameron wrote:  
> > > > > > > On Wed, 20 Dec 2023 16:17:27 -0800
> > > > > > > Ira Weiny <ira.weiny@intel.com> wrote:
> > > > > > >   
> > > > > > >> Series status/background
> > > > > > >> ========================
> > > > > > >>
> > > > > > >> Smita has been a great help with this series.  Thank you again!
> > > > > > >>
> > > > > > >> Smita's testing found that the GHES code ended up printing the events
> > > > > > >> twice.  This version avoids the duplicate print by calling the callback
> > > > > > >> from the GHES code instead of the EFI code as suggested by Dan.  
> > > > > > > 
> > > > > > > I'm not sure this is working as intended.
> > > > > > > 
> > > > > > > There is nothing gating the call in ghes_proc() of ghes_print_estatus()
> > > > > > > and now the EFI code handling that pretty printed things is missing we get
> > > > > > > the horrible kernel logging for an unknown block instead.
> > > > > > > 
> > > > > > > So I think we need some minimal code in cper.c to match the guids then not
> > > > > > > log them (on basis we are arguing there is no need for new cper records).
> > > > > > > Otherwise we are in for some messy kernel logs
> > > > > > > 
> > > > > > > Something like:
> > > > > > > 
> > > > > > > {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 1
> > > > > > > {1}[Hardware Error]: event severity: recoverable
> > > > > > > {1}[Hardware Error]:  Error 0, type: recoverable
> > > > > > > {1}[Hardware Error]:   section type: unknown, fbcd0a77-c260-417f-85a9-088b1621eba6
> > > > > > > {1}[Hardware Error]:   section length: 0x90
> > > > > > > {1}[Hardware Error]:   00000000: 00000090 00000007 00000000 0d938086  ................
> > > > > > > {1}[Hardware Error]:   00000010: 00100000 00000000 00040000 00000000  ................
> > > > > > > {1}[Hardware Error]:   00000020: 00000000 00000000 00000000 00000000  ................
> > > > > > > {1}[Hardware Error]:   00000030: 00000000 00000000 00000000 00000000  ................
> > > > > > > {1}[Hardware Error]:   00000040: 00000000 00000000 00000000 00000000  ................
> > > > > > > {1}[Hardware Error]:   00000050: 00000000 00000000 00000000 00000000  ................
> > > > > > > {1}[Hardware Error]:   00000060: 00000000 00000000 00000000 00000000  ................
> > > > > > > {1}[Hardware Error]:   00000070: 00000000 00000000 00000000 00000000  ................
> > > > > > > {1}[Hardware Error]:   00000080: 00000000 00000000 00000000 00000000  ................
> > > > > > > cxl_general_media: memdev=mem1 host=0000:10:00.0 serial=4 log=Informational : time=0 uuid=fbcd0a77-c260-417f-85a9-088b1621eba6 len=0 flags='' handle=0 related_handle=0 maint_op_class=0 : dpa=0 dpa_flags='' descriptor='' type='ECC Error' transaction_type='Unknown' channel=0 rank=0 device=0 comp_id=00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 validity_flags=''
> > > > > > > 
> > > > > > > (I'm filling the record with 0s currently)  
> > > > > > 
> > > > > > Yeah, when I tested this, I thought its okay for the hexdump to be there 
> > > > > > in dmesg from EFI as the handling is done in trace events from GHES.
> > > > > > 
> > > > > > If, we need to handle from EFI, then it would be a good reason to move 
> > > > > > the GUIDs out from GHES and place it in a common location for EFI/cper 
> > > > > > to share similar to protocol errors.  
> > > > > 
> > > > > Ah, yes, my expectation was more aligned with Jonathan's observation to
> > > > > do the processing in GHES code *and* skip the processing in the CPER
> > > > > code, something like:
> > > > >   
> > > > 
> > > > Agreed this was intended I did not realize the above.
> > > >   
> > > > > 
> > > > > diff --git a/drivers/firmware/efi/cper.c b/drivers/firmware/efi/cper.c
> > > > > index 35c37f667781..0a4eed470750 100644
> > > > > --- a/drivers/firmware/efi/cper.c
> > > > > +++ b/drivers/firmware/efi/cper.c
> > > > > @@ -24,6 +24,7 @@
> > > > >  #include <linux/bcd.h>
> > > > >  #include <acpi/ghes.h>
> > > > >  #include <ras/ras_event.h>
> > > > > +#include <linux/cxl-event.h>
> > > > >  #include "cper_cxl.h"
> > > > >  
> > > > >  /*
> > > > > @@ -607,6 +608,15 @@ cper_estatus_print_section(const char *pfx, struct acpi_hest_generic_data *gdata
> > > > >  			cper_print_prot_err(newpfx, prot_err);
> > > > >  		else
> > > > >  			goto err_section_too_small;
> > > > > +	} else if (guid_equal(sec_type, &CPER_SEC_CXL_GEN_MEDIA_GUID)) {
> > > > > +		printk("%ssection_type: CXL General Media Error\n", newpfx);  
> > > > 
> > > > Do we want the printk's here?  I did not realize that a generic event
> > > > would be printed.  So intention was nothing would be done on this path.  
> > > 
> > > I think we do otherwise the kernel will say
> > > 
> > >     {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 1
> > >     {1}[Hardware Error]: event severity: recoverable
> > >     {1}[Hardware Error]:  Error 0, type: recoverable
> > >     ...
> > > 
> > > ...leaving the user hanging vs:
> > >  
> > >     {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 1
> > >     {1}[Hardware Error]: event severity: recoverable
> > >     {1}[Hardware Error]:  Error 0, type: recoverable
> > >     {1}[Hardware Error]:   section type: General Media Error
> > > 
> > > ...as an indicator to go follow up with rasdaemon or whatever else is
> > > doing the detailed monitoring of CXL events.
> > 
> > Agreed. Maybe push it out to a static const table though.
> > As the argument was that we shouldn't be spitting out big logs in this
> > modern world, let's make it easy for people to add more entries.
> > 
> > struct skip_me {
> > 	guid_t guid;
> > 	const char *name;
> > };
> > static const struct skip_me skip_me = {
> > 	{ &CPER_SEC_CXL_GEN_MEDIA, "CXL General Media Error" },
> > etc.
> > };
> > 
> > for (i = 0; i < ARRAY_SIZE(skip_me); i++) {
> > 	if (guid_equal(sec_type, skip_me[i].guid)) {
> > 		printk("%asection_type: %s\n", newpfx, skip_me[i].name);
> > 		break;	
> > }
> > 
> > or something like that in the final else.
> 
> I like it.
> 
> Any concerns with that being an -rc fixup, and move ahead with the base
> enabling for v6.8? I don't see that follow-on as a reason to push the
> whole thing to v6.9.

I will put it in -next for soak time and make an inclusion decision in a
few days after I hear back.


  reply	other threads:[~2024-01-09 23:30 UTC|newest]

Thread overview: 51+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-12-21  0:17 [PATCH v5 0/9] efi/cxl-cper: Report CPER CXL component events through trace events Ira Weiny
2023-12-21  0:17 ` [PATCH v5 1/9] cxl/trace: Pass uuid explicitly to event traces Ira Weiny
2024-01-08 12:56   ` Jonathan Cameron
2023-12-21  0:17 ` [PATCH v5 2/9] cxl/events: Promote CXL event structures to a core header Ira Weiny
2024-01-08 13:05   ` Jonathan Cameron
2023-12-21  0:17 ` [PATCH v5 3/9] cxl/events: Create common event UUID defines Ira Weiny
2024-01-08 13:07   ` Jonathan Cameron
2023-12-21  0:17 ` [PATCH v5 4/9] cxl/events: Remove passing a UUID to known event traces Ira Weiny
2024-01-08 13:23   ` Jonathan Cameron
2024-01-09 23:38     ` Dan Williams
2024-01-10 14:22       ` Jonathan Cameron
2023-12-21  0:17 ` [PATCH v5 5/9] cxl/events: Separate UUID from event structures Ira Weiny
2024-01-08 13:27   ` Jonathan Cameron
2023-12-21  0:17 ` [PATCH v5 6/9] cxl/events: Create a CXL event union Ira Weiny
2024-01-08 13:31   ` Jonathan Cameron
2023-12-21  0:17 ` [PATCH v5 7/9] acpi/ghes: Process CXL Component Events Ira Weiny
2024-01-08 13:41   ` Jonathan Cameron
2023-12-21  0:17 ` [PATCH v5 8/9] PCI: Define scoped based management functions Ira Weiny
2024-01-03 22:38   ` Dan Williams
2024-01-03 23:01     ` Bjorn Helgaas
2024-01-04  0:21       ` Dan Williams
2024-01-04 17:17         ` Ira Weiny
2024-01-04 18:32         ` Bjorn Helgaas
2024-01-04 18:59           ` Dan Williams
2024-01-04 21:46             ` Ira Weiny
2024-01-04 22:37               ` Bjorn Helgaas
2024-01-04 23:00                 ` Ira Weiny
2024-01-04  6:05   ` Lukas Wunner
2024-01-04  6:43     ` Dan Williams
2024-01-04  7:02       ` Lukas Wunner
2024-01-04  7:37         ` Ard Biesheuvel
2024-01-04 17:41           ` Dan Williams
2024-01-08 13:44   ` Jonathan Cameron
2023-12-21  0:17 ` [PATCH v5 9/9] cxl/pci: Register for and process CPER events Ira Weiny
2024-01-02 15:14   ` Smita Koralahalli
2024-01-02 20:29     ` Ira Weiny
2024-01-03 22:08   ` Dan Williams
2024-01-04 18:31   ` Ira Weiny
2024-01-08 13:50   ` Jonathan Cameron
2024-01-09 23:59     ` Dan Williams
2024-01-04 22:55 ` [PATCH v5 0/9] efi/cxl-cper: Report CPER CXL component events through trace events Bjorn Helgaas
2024-01-08 16:58 ` Jonathan Cameron
2024-01-08 20:04   ` Smita Koralahalli
2024-01-09  2:08     ` Dan Williams
2024-01-09  2:32       ` Ira Weiny
2024-01-09  2:59         ` Dan Williams
2024-01-09 16:04           ` Jonathan Cameron
2024-01-09 20:49             ` Dan Williams
2024-01-09 23:30               ` Dan Williams [this message]
2024-01-09 23:31                 ` Ard Biesheuvel
2024-01-10 14:24                   ` Jonathan Cameron

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=659dd6ff5ee1_24a8294d0@dwillia2-xfh.jf.intel.com.notmuch \
    --to=dan.j.williams@intel.com \
    --cc=Jonathan.Cameron@huawei.com \
    --cc=Smita.KoralahalliChannabasappa@amd.com \
    --cc=alison.schofield@intel.com \
    --cc=ardb@kernel.org \
    --cc=bhelgaas@google.com \
    --cc=dave.jiang@intel.com \
    --cc=dave@stgolabs.net \
    --cc=ira.weiny@intel.com \
    --cc=linux-cxl@vger.kernel.org \
    --cc=linux-efi@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=rafael@kernel.org \
    --cc=shiju.jose@huawei.com \
    --cc=vishal.l.verma@intel.com \
    --cc=yazen.ghannam@amd.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox