linuxppc-dev.lists.ozlabs.org archive mirror
 help / color / mirror / Atom feed
From: Bjorn Helgaas <helgaas@kernel.org>
To: "Fabio M. De Francesco" <fabio.m.de.francesco@linux.intel.com>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>,
	linux-pci@vger.kernel.org, linux-kernel@vger.kernel.org,
	Mahesh J Salgaonkar <mahesh@linux.ibm.com>,
	linux-acpi@vger.kernel.org, Oliver O'Halloran <oohall@gmail.com>,
	Bjorn Helgaas <bhelgaas@google.com>,
	Dan Williams <dan.j.williams@intel.com>,
	linuxppc-dev@lists.ozlabs.org, Len Brown <lenb@kernel.org>
Subject: Re: [PATCH 2/2] ACPI: extlog: Trace CPER PCI Express Error Section
Date: Tue, 6 Aug 2024 16:31:18 -0500	[thread overview]
Message-ID: <20240806213118.GA78822@bhelgaas> (raw)
In-Reply-To: <20240527144356.246220-3-fabio.m.de.francesco@linux.intel.com>

On Mon, May 27, 2024 at 04:43:41PM +0200, Fabio M. De Francesco wrote:
> Currently, extlog_print() (ELOG) only reports CPER PCIe section (UEFI
> v2.10, Appendix N.2.7) to the kernel log via print_extlog_rcd(). Instead,
> the similar ghes_do_proc() (GHES) prints to kernel log and calls
> pci_print_aer() to report via the ftrace infrastructure.
> 
> Add support to report the CPER PCIe Error section also via the ftrace
> infrastructure by calling pci_print_aer() to make ELOG act consistently
> with GHES.
> 
> Cc: Dan Williams <dan.j.williams@intel.com>
> Signed-off-by: Fabio M. De Francesco <fabio.m.de.francesco@linux.intel.com>
> ---
>  drivers/acpi/acpi_extlog.c | 30 ++++++++++++++++++++++++++++++
>  drivers/pci/pcie/aer.c     |  2 +-
>  include/linux/aer.h        | 13 +++++++++++--
>  3 files changed, 42 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/acpi/acpi_extlog.c b/drivers/acpi/acpi_extlog.c
> index e025ae390737..007ce96f8672 100644
> --- a/drivers/acpi/acpi_extlog.c
> +++ b/drivers/acpi/acpi_extlog.c
> @@ -131,6 +131,32 @@ static int print_extlog_rcd(const char *pfx,
>  	return 1;
>  }
>  
> +static void extlog_print_pcie(struct cper_sec_pcie *pcie_err,
> +			      int severity)
> +{
> +	struct aer_capability_regs *aer;
> +	struct pci_dev *pdev;
> +	unsigned int devfn;
> +	unsigned int bus;
> +	int aer_severity;
> +	int domain;
> +
> +	if (pcie_err->validation_bits & CPER_PCIE_VALID_DEVICE_ID &&
> +	    pcie_err->validation_bits & CPER_PCIE_VALID_AER_INFO) {
> +		aer_severity = cper_severity_to_aer(severity);
> +		aer = (struct aer_capability_regs *)pcie_err->aer_info;
> +		domain = pcie_err->device_id.segment;
> +		bus = pcie_err->device_id.bus;
> +		devfn = PCI_DEVFN(pcie_err->device_id.device,
> +				  pcie_err->device_id.function);
> +		pdev = pci_get_domain_bus_and_slot(domain, bus, devfn);
> +		if (!pdev)
> +			return;
> +		pci_print_aer(pdev, aer_severity, aer);
> +		pci_dev_put(pdev);
> +	}

I'm 100% in favor of making error reporting work and look the same
across GHES and ELOG.  But I do have to gripe a bit...

It's already unfortunate that GHES and the native AER handling are
separate paths that lead to the same place (__aer_print_error()).

I'm sorry that we need to add a third path that again does
fundamentally the same thing.  The fact that they're separate means
all the design, reviewing, testing, and maintenance effort is diluted,
and error handling always gets too little love in the first place.
I think this is a recipe for confusion.

  ghes_do_proc                                        # GHES
    apei_estatus_for_each_section
      ...
      if (guid_equal(sec_type, &CPER_SEC_PCIE))
        ghes_handle_aer
          cper_severity_to_aer
          aer_recover_queue
            kfifo_in_spinlocked(&aer_recover_ring)    # add to queue
          aer_recover_work_func                       # another thread
            kfifo_get(&aer_recover_ring)              # remove from queue
            pci_print_aer
              __aer_print_error         <---

  aer_irq                                             # native AER
    kfifo_put(&aer_fifo)                              # add to queue
  aer_isr                                             # another thread
    kfifo_get(&aer_fifo)                              # remove from queue
    ...
    aer_isr_one_error
      aer_process_err_devices
        aer_print_error
          __aer_print_error             <---

  extlog_print                                        # extlog (x86 only)
    apei_estatus_for_each_section
      ...
      if (guid_equal(sec_type, &CPER_SEC_PCIE))
        extlog_print_pcie
          cper_severity_to_aer
          pci_get_domain_bus_and_slot
          pci_print_aer
            __aer_print_error           <---

And we also have CXL paths that lead to __aer_print_error(), although
it seems like they at least start in the native AER (and maybe GHES?)
path and branch out somewhere.  My head is spinning.

Do I *object* to this patch?  No, not really; it's a trivial change in
drivers/pci/, and Rafael can add my

  Acked-by: Bjorn Helgaas <bhelgaas@google.com>

as needed.  But I am afraid we're making ourselves a maintenance
headache.

> +}
> +
>  static int extlog_print(struct notifier_block *nb, unsigned long val,
>  			void *data)
>  {
> @@ -179,6 +205,10 @@ static int extlog_print(struct notifier_block *nb, unsigned long val,
>  			if (gdata->error_data_length >= sizeof(*mem))
>  				trace_extlog_mem_event(mem, err_seq, fru_id, fru_text,
>  						       (u8)gdata->error_severity);
> +		} else if (guid_equal(sec_type, &CPER_SEC_PCIE)) {
> +			struct cper_sec_pcie *pcie_err = acpi_hest_get_payload(gdata);
> +
> +			extlog_print_pcie(pcie_err, gdata->error_severity);
>  		} else {
>  			void *err = acpi_hest_get_payload(gdata);
>  
> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index ac6293c24976..794aa15527ba 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
> @@ -801,7 +801,7 @@ void pci_print_aer(struct pci_dev *dev, int aer_severity,
>  	trace_aer_event(dev_name(&dev->dev), (status & ~mask),
>  			aer_severity, tlp_header_valid, &aer->header_log);
>  }
> -EXPORT_SYMBOL_NS_GPL(pci_print_aer, CXL);
> +EXPORT_SYMBOL_GPL(pci_print_aer);
>  
>  /**
>   * add_error_device - list device to be handled
> diff --git a/include/linux/aer.h b/include/linux/aer.h
> index 4b97f38f3fcf..fbc82206045c 100644
> --- a/include/linux/aer.h
> +++ b/include/linux/aer.h
> @@ -42,17 +42,26 @@ int pcie_read_tlp_log(struct pci_dev *dev, int where, struct pcie_tlp_log *log);
>  #if defined(CONFIG_PCIEAER)
>  int pci_aer_clear_nonfatal_status(struct pci_dev *dev);
>  int pcie_aer_is_native(struct pci_dev *dev);
> +void pci_print_aer(struct pci_dev *dev, int aer_severity,
> +		   struct aer_capability_regs *aer);
>  #else
>  static inline int pci_aer_clear_nonfatal_status(struct pci_dev *dev)
>  {
>  	return -EINVAL;
>  }
> +
>  static inline int pcie_aer_is_native(struct pci_dev *dev) { return 0; }
> +static inline void pci_print_aer(struct pci_dev *dev, int aer_severity,
> +				 struct aer_capability_regs *aer)
> +{ }
>  #endif
>  
> -void pci_print_aer(struct pci_dev *dev, int aer_severity,
> -		    struct aer_capability_regs *aer);
> +#if defined(CONFIG_ACPI_APEI_PCIEAER)
>  int cper_severity_to_aer(int cper_severity);
> +#else
> +static inline int cper_severity_to_aer(int cper_severity) { return 0; }
> +#endif
> +
>  void aer_recover_queue(int domain, unsigned int bus, unsigned int devfn,
>  		       int severity, struct aer_capability_regs *aer_regs);
>  #endif //_AER_H_
> -- 
> 2.45.1
> 

  parent reply	other threads:[~2024-08-06 21:32 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-05-27 14:43 [PATCH 0/2] Make ELOG log and trace consistently with GHES Fabio M. De Francesco
2024-05-27 14:43 ` [PATCH 1/2] ACPI: extlog: Trace CPER Non-standard Section Body Fabio M. De Francesco
2024-08-06 19:21   ` Dan Williams
2024-08-06 21:07   ` Bjorn Helgaas
2024-05-27 14:43 ` [PATCH 2/2] ACPI: extlog: Trace CPER PCI Express Error Section Fabio M. De Francesco
2024-08-06 19:56   ` Dan Williams
2024-10-23 13:35     ` Fabio M. De Francesco
2024-12-11  1:51       ` Dan Williams
2024-08-06 21:31   ` Bjorn Helgaas [this message]
2024-08-07 20:28     ` Dan Williams
2024-08-07 20:31       ` Dan Williams
2024-07-23 13:43 ` [PATCH 0/2] Make ELOG log and trace consistently with GHES Fabio M. De Francesco

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20240806213118.GA78822@bhelgaas \
    --to=helgaas@kernel.org \
    --cc=bhelgaas@google.com \
    --cc=dan.j.williams@intel.com \
    --cc=fabio.m.de.francesco@linux.intel.com \
    --cc=lenb@kernel.org \
    --cc=linux-acpi@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-pci@vger.kernel.org \
    --cc=linuxppc-dev@lists.ozlabs.org \
    --cc=mahesh@linux.ibm.com \
    --cc=oohall@gmail.com \
    --cc=rafael@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).