All of lore.kernel.org
 help / color / mirror / Atom feed
From: Jonathan Cameron <Jonathan.Cameron@Huawei.com>
To: Ard Biesheuvel <ardb@kernel.org>
Cc: Dan Williams <dan.j.williams@intel.com>,
	Ira Weiny <ira.weiny@intel.com>,
	Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com>,
	Shiju Jose <shiju.jose@huawei.com>,
	Yazen Ghannam <yazen.ghannam@amd.com>,
	"Davidlohr Bueso" <dave@stgolabs.net>,
	Dave Jiang <dave.jiang@intel.com>,
	"Alison Schofield" <alison.schofield@intel.com>,
	Vishal Verma <vishal.l.verma@intel.com>,
	<linux-efi@vger.kernel.org>, <linux-kernel@vger.kernel.org>,
	<linux-cxl@vger.kernel.org>,
	"Rafael J. Wysocki" <rafael@kernel.org>,
	Bjorn Helgaas <bhelgaas@google.com>
Subject: Re: [PATCH v5 0/9] efi/cxl-cper: Report CPER CXL component events through trace events
Date: Wed, 10 Jan 2024 14:24:36 +0000	[thread overview]
Message-ID: <20240110142436.0000787a@Huawei.com> (raw)
In-Reply-To: <CAMj1kXHvJzLksPtj1_O2L+4zH4emEs5tnvFCq=Wysfr842b=Sg@mail.gmail.com>

On Wed, 10 Jan 2024 00:31:17 +0100
Ard Biesheuvel <ardb@kernel.org> wrote:

> On Wed, 10 Jan 2024 at 00:30, Dan Williams <dan.j.williams@intel.com> wrote:
> >
> > Dan Williams wrote:  
> > > Jonathan Cameron wrote:  
> > > > On Mon, 8 Jan 2024 18:59:16 -0800
> > > > Dan Williams <dan.j.williams@intel.com> wrote:
> > > >  
> > > > > Ira Weiny wrote:  
> > > > > > Dan Williams wrote:  
> > > > > > > Smita Koralahalli wrote:  
> > > > > > > > On 1/8/2024 8:58 AM, Jonathan Cameron wrote:  
> > > > > > > > > On Wed, 20 Dec 2023 16:17:27 -0800
> > > > > > > > > Ira Weiny <ira.weiny@intel.com> wrote:
> > > > > > > > >  
> > > > > > > > >> Series status/background
> > > > > > > > >> ========================
> > > > > > > > >>
> > > > > > > > >> Smita has been a great help with this series.  Thank you again!
> > > > > > > > >>
> > > > > > > > >> Smita's testing found that the GHES code ended up printing the events
> > > > > > > > >> twice.  This version avoids the duplicate print by calling the callback
> > > > > > > > >> from the GHES code instead of the EFI code as suggested by Dan.  
> > > > > > > > >
> > > > > > > > > I'm not sure this is working as intended.
> > > > > > > > >
> > > > > > > > > There is nothing gating the call in ghes_proc() of ghes_print_estatus()
> > > > > > > > > and now the EFI code handling that pretty printed things is missing we get
> > > > > > > > > the horrible kernel logging for an unknown block instead.
> > > > > > > > >
> > > > > > > > > So I think we need some minimal code in cper.c to match the guids then not
> > > > > > > > > log them (on basis we are arguing there is no need for new cper records).
> > > > > > > > > Otherwise we are in for some messy kernel logs
> > > > > > > > >
> > > > > > > > > Something like:
> > > > > > > > >
> > > > > > > > > {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 1
> > > > > > > > > {1}[Hardware Error]: event severity: recoverable
> > > > > > > > > {1}[Hardware Error]:  Error 0, type: recoverable
> > > > > > > > > {1}[Hardware Error]:   section type: unknown, fbcd0a77-c260-417f-85a9-088b1621eba6
> > > > > > > > > {1}[Hardware Error]:   section length: 0x90
> > > > > > > > > {1}[Hardware Error]:   00000000: 00000090 00000007 00000000 0d938086  ................
> > > > > > > > > {1}[Hardware Error]:   00000010: 00100000 00000000 00040000 00000000  ................
> > > > > > > > > {1}[Hardware Error]:   00000020: 00000000 00000000 00000000 00000000  ................
> > > > > > > > > {1}[Hardware Error]:   00000030: 00000000 00000000 00000000 00000000  ................
> > > > > > > > > {1}[Hardware Error]:   00000040: 00000000 00000000 00000000 00000000  ................
> > > > > > > > > {1}[Hardware Error]:   00000050: 00000000 00000000 00000000 00000000  ................
> > > > > > > > > {1}[Hardware Error]:   00000060: 00000000 00000000 00000000 00000000  ................
> > > > > > > > > {1}[Hardware Error]:   00000070: 00000000 00000000 00000000 00000000  ................
> > > > > > > > > {1}[Hardware Error]:   00000080: 00000000 00000000 00000000 00000000  ................
> > > > > > > > > cxl_general_media: memdev=mem1 host=0000:10:00.0 serial=4 log=Informational : time=0 uuid=fbcd0a77-c260-417f-85a9-088b1621eba6 len=0 flags='' handle=0 related_handle=0 maint_op_class=0 : dpa=0 dpa_flags='' descriptor='' type='ECC Error' transaction_type='Unknown' channel=0 rank=0 device=0 comp_id=00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 validity_flags=''
> > > > > > > > >
> > > > > > > > > (I'm filling the record with 0s currently)  
> > > > > > > >
> > > > > > > > Yeah, when I tested this, I thought its okay for the hexdump to be there
> > > > > > > > in dmesg from EFI as the handling is done in trace events from GHES.
> > > > > > > >
> > > > > > > > If, we need to handle from EFI, then it would be a good reason to move
> > > > > > > > the GUIDs out from GHES and place it in a common location for EFI/cper
> > > > > > > > to share similar to protocol errors.  
> > > > > > >
> > > > > > > Ah, yes, my expectation was more aligned with Jonathan's observation to
> > > > > > > do the processing in GHES code *and* skip the processing in the CPER
> > > > > > > code, something like:
> > > > > > >  
> > > > > >
> > > > > > Agreed this was intended I did not realize the above.
> > > > > >  
> > > > > > >
> > > > > > > diff --git a/drivers/firmware/efi/cper.c b/drivers/firmware/efi/cper.c
> > > > > > > index 35c37f667781..0a4eed470750 100644
> > > > > > > --- a/drivers/firmware/efi/cper.c
> > > > > > > +++ b/drivers/firmware/efi/cper.c
> > > > > > > @@ -24,6 +24,7 @@
> > > > > > >  #include <linux/bcd.h>
> > > > > > >  #include <acpi/ghes.h>
> > > > > > >  #include <ras/ras_event.h>
> > > > > > > +#include <linux/cxl-event.h>
> > > > > > >  #include "cper_cxl.h"
> > > > > > >
> > > > > > >  /*
> > > > > > > @@ -607,6 +608,15 @@ cper_estatus_print_section(const char *pfx, struct acpi_hest_generic_data *gdata
> > > > > > >                       cper_print_prot_err(newpfx, prot_err);
> > > > > > >               else
> > > > > > >                       goto err_section_too_small;
> > > > > > > +     } else if (guid_equal(sec_type, &CPER_SEC_CXL_GEN_MEDIA_GUID)) {
> > > > > > > +             printk("%ssection_type: CXL General Media Error\n", newpfx);  
> > > > > >
> > > > > > Do we want the printk's here?  I did not realize that a generic event
> > > > > > would be printed.  So intention was nothing would be done on this path.  
> > > > >
> > > > > I think we do otherwise the kernel will say
> > > > >
> > > > >     {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 1
> > > > >     {1}[Hardware Error]: event severity: recoverable
> > > > >     {1}[Hardware Error]:  Error 0, type: recoverable
> > > > >     ...
> > > > >
> > > > > ...leaving the user hanging vs:
> > > > >
> > > > >     {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 1
> > > > >     {1}[Hardware Error]: event severity: recoverable
> > > > >     {1}[Hardware Error]:  Error 0, type: recoverable
> > > > >     {1}[Hardware Error]:   section type: General Media Error
> > > > >
> > > > > ...as an indicator to go follow up with rasdaemon or whatever else is
> > > > > doing the detailed monitoring of CXL events.  
> > > >
> > > > Agreed. Maybe push it out to a static const table though.
> > > > As the argument was that we shouldn't be spitting out big logs in this
> > > > modern world, let's make it easy for people to add more entries.
> > > >
> > > > struct skip_me {
> > > >     guid_t guid;
> > > >     const char *name;
> > > > };
> > > > static const struct skip_me skip_me = {
> > > >     { &CPER_SEC_CXL_GEN_MEDIA, "CXL General Media Error" },
> > > > etc.
> > > > };
> > > >
> > > > for (i = 0; i < ARRAY_SIZE(skip_me); i++) {
> > > >     if (guid_equal(sec_type, skip_me[i].guid)) {
> > > >             printk("%asection_type: %s\n", newpfx, skip_me[i].name);
> > > >             break;
> > > > }
> > > >
> > > > or something like that in the final else.  
> > >
> > > I like it.
> > >
> > > Any concerns with that being an -rc fixup, and move ahead with the base
> > > enabling for v6.8? I don't see that follow-on as a reason to push the
> > > whole thing to v6.9.  
> >
> > I will put it in -next for soak time and make an inclusion decision in a
> > few days after I hear back.
> >  
> 
> For the series and however you want to handle the merge:
> 
> Acked-by: Ard Biesheuvel <ardb@kernel.org>

Any path in works for me as well.

J

      reply	other threads:[~2024-01-10 14:24 UTC|newest]

Thread overview: 51+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-12-21  0:17 [PATCH v5 0/9] efi/cxl-cper: Report CPER CXL component events through trace events Ira Weiny
2023-12-21  0:17 ` [PATCH v5 1/9] cxl/trace: Pass uuid explicitly to event traces Ira Weiny
2024-01-08 12:56   ` Jonathan Cameron
2023-12-21  0:17 ` [PATCH v5 2/9] cxl/events: Promote CXL event structures to a core header Ira Weiny
2024-01-08 13:05   ` Jonathan Cameron
2023-12-21  0:17 ` [PATCH v5 3/9] cxl/events: Create common event UUID defines Ira Weiny
2024-01-08 13:07   ` Jonathan Cameron
2023-12-21  0:17 ` [PATCH v5 4/9] cxl/events: Remove passing a UUID to known event traces Ira Weiny
2024-01-08 13:23   ` Jonathan Cameron
2024-01-09 23:38     ` Dan Williams
2024-01-10 14:22       ` Jonathan Cameron
2023-12-21  0:17 ` [PATCH v5 5/9] cxl/events: Separate UUID from event structures Ira Weiny
2024-01-08 13:27   ` Jonathan Cameron
2023-12-21  0:17 ` [PATCH v5 6/9] cxl/events: Create a CXL event union Ira Weiny
2024-01-08 13:31   ` Jonathan Cameron
2023-12-21  0:17 ` [PATCH v5 7/9] acpi/ghes: Process CXL Component Events Ira Weiny
2024-01-08 13:41   ` Jonathan Cameron
2023-12-21  0:17 ` [PATCH v5 8/9] PCI: Define scoped based management functions Ira Weiny
2024-01-03 22:38   ` Dan Williams
2024-01-03 23:01     ` Bjorn Helgaas
2024-01-04  0:21       ` Dan Williams
2024-01-04 17:17         ` Ira Weiny
2024-01-04 18:32         ` Bjorn Helgaas
2024-01-04 18:59           ` Dan Williams
2024-01-04 21:46             ` Ira Weiny
2024-01-04 22:37               ` Bjorn Helgaas
2024-01-04 23:00                 ` Ira Weiny
2024-01-04  6:05   ` Lukas Wunner
2024-01-04  6:43     ` Dan Williams
2024-01-04  7:02       ` Lukas Wunner
2024-01-04  7:37         ` Ard Biesheuvel
2024-01-04 17:41           ` Dan Williams
2024-01-08 13:44   ` Jonathan Cameron
2023-12-21  0:17 ` [PATCH v5 9/9] cxl/pci: Register for and process CPER events Ira Weiny
2024-01-02 15:14   ` Smita Koralahalli
2024-01-02 20:29     ` Ira Weiny
2024-01-03 22:08   ` Dan Williams
2024-01-04 18:31   ` Ira Weiny
2024-01-08 13:50   ` Jonathan Cameron
2024-01-09 23:59     ` Dan Williams
2024-01-04 22:55 ` [PATCH v5 0/9] efi/cxl-cper: Report CPER CXL component events through trace events Bjorn Helgaas
2024-01-08 16:58 ` Jonathan Cameron
2024-01-08 20:04   ` Smita Koralahalli
2024-01-09  2:08     ` Dan Williams
2024-01-09  2:32       ` Ira Weiny
2024-01-09  2:59         ` Dan Williams
2024-01-09 16:04           ` Jonathan Cameron
2024-01-09 20:49             ` Dan Williams
2024-01-09 23:30               ` Dan Williams
2024-01-09 23:31                 ` Ard Biesheuvel
2024-01-10 14:24                   ` Jonathan Cameron [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20240110142436.0000787a@Huawei.com \
    --to=jonathan.cameron@huawei.com \
    --cc=Smita.KoralahalliChannabasappa@amd.com \
    --cc=alison.schofield@intel.com \
    --cc=ardb@kernel.org \
    --cc=bhelgaas@google.com \
    --cc=dan.j.williams@intel.com \
    --cc=dave.jiang@intel.com \
    --cc=dave@stgolabs.net \
    --cc=ira.weiny@intel.com \
    --cc=linux-cxl@vger.kernel.org \
    --cc=linux-efi@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=rafael@kernel.org \
    --cc=shiju.jose@huawei.com \
    --cc=vishal.l.verma@intel.com \
    --cc=yazen.ghannam@amd.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.