Linux CXL
 help / color / mirror / Atom feed
From: Jonathan Cameron <Jonathan.Cameron@huawei.com>
To: <linux-cxl@vger.kernel.org>, <qemu-devel@nongnu.org>
Cc: Igor Mammedov <imammedo@redhat.com>,
	Ani Sinha <anisinha@redhat.com>,
	Shannon Zhao <shannon.zhaosl@gmail.com>,
	Dongjiu Geng <gengdongjiu1@gmail.com>, <linuxarm@huawei.com>,
	"Michael S . Tsirkin" <mst@redhat.com>,
	Ira Weiny <ira.weiny@intel.com>,
	Peter Maydell <peter.maydell@linaro.org>,
	Fan Ni <fan.ni@samsung.com>,
	Marcel Apfelbaum <marcel.apfelbaum@gmail.com>
Subject: [RFC PATCH 00/11 qemu] arm/acpi/pci/cxl: ACPI based FW First error injection.
Date: Mon, 5 Feb 2024 14:19:29 +0000	[thread overview]
Message-ID: <20240205141940.31111-1-Jonathan.Cameron@huawei.com> (raw)

I've had a version of this code for many years (and occasionally mention it
as test platform for kernel patches) and it keeps coming in handy, so time
to share the CXL version.

What is this?
- ACPI + UEFI specs define a means of notifying the OS of errors that
  firmware has handled (gathered up data etc, reset the relevant error tracking
  units etc) in a set of standard formats (UEFI spec appendix N).
- ARM virt already supports standard HEST ACPI table description of Synchronous
  External Abort (SEA) for memory errors. This series builds on this to
  add a GHESv2 / Generic Error Device / GPIO interrupt path for asynchronous
  error reporting.
- CXL and PCI AER both already have injection commands (via HMP / QMP)
  These are repurposed to perform FW first injection if the guest OS has not
  negotiated OS first handling (so before the CXL / PCIE _OSC is called or
  when it doesn't negotiate control of AER / CXL Memory Errors).
- The OS normally negotiates for control of error registers via _OSC.
  Previously QEMU unconditionally granted control of these registers.
  This series includes a machine parameter to allow the 'FW' to not let the
  OS take control and tracks whether the OS has asked for control or not.
  Note this code relies on the standard handshake - it's not remotely
  correct if the OS does follow that flow - this can be hardened with some
  more AML magic.

Alternatives:
- In theory we could emulate a management controller running appropriate firmware
  and have that actually handle the errors. It's much easier to instead intercept
  them before the error reporting messages are sent and result logged in the root
  ports error registers. As far as the guest is concerned it doesn't matter if
  these registers are handled via the firmware or never got written in the first
  place (the guest isn't allowed to touch these registers anyway!)
  This is sort of same argument for why we build ACPI tables in general in QEMU
  rather than making that an EDK2 problem.

Why?
- The kernel CXL code supports both firmware first and native RAS.
  As only some vendors have adopted a FW first model and hardware
  availability is limited this code has proven challenging to test.

Why an RFC?
- Small matter that the ARM CXL support isn't upstream.
- I'm assuming adding this support to QEMU will be controversial.
- There are some loose ends, TODOs and Fixme's in the code.
- Only one type of CXL event currently handled - should provide them all
  CXL Protocol and AER error reporting is more complete.
- I should probably figure out how to do this for x86 as apparently people
  also want to use that architecture ;)

Thanks to Shiju Jose for help testing this.

Based on: Random stack of patches on my gitlab.com/jic23/qemu cxl-2024-02-05-draft
branch. Specifically:
https://gitlab.com/jic23/qemu/-/commit/0fa064b9c8eeef468d8a19e87f39f230b4fa4da9

All comments welcome - particularly anyone who can advise on what the HEST
table should look like an x86 machine - too many options!

Jonathan Cameron (11):
  hw/pci: Add pcie_find_dvsec() utility.
  hw/acpi: Allow GPEX _OSC to keep fw first control of AER and CXL
    errors.
  arm/virt: Add fw-first-ras property.
  acpi/ghes: Support GPIO error source.
  arm/virt: Wire up GPIO error source for ACPI / GHES
  acpi: pci/cxl: Stash the OSC control parameters.
  pci/aer: Support firmware first error injection via GHESv2
  hw/pci/aer: Default to error handling on.
  cxl/ras: Set registers to sensible state for FW first ras
  cxl/type3: FW first protocol error injection.
  cxl/type3: Add firmware first error reporting for general media
    events.

 include/hw/acpi/cxl.h         |   2 +-
 include/hw/acpi/ghes.h        |  14 +
 include/hw/arm/virt.h         |   1 +
 include/hw/boards.h           |   1 +
 include/hw/cxl/cxl.h          |   2 +
 include/hw/pci-host/gpex.h    |   1 +
 include/hw/pci/pcie.h         |   1 +
 hw/acpi/cxl-stub.c            |   2 +-
 hw/acpi/cxl.c                 |  50 ++-
 hw/acpi/ghes-stub.c           |  25 ++
 hw/acpi/ghes.c                | 634 +++++++++++++++++++++++++++++++++-
 hw/arm/virt-acpi-build.c      |  71 +++-
 hw/arm/virt.c                 |  32 +-
 hw/cxl/cxl-component-utils.c  |   4 +-
 hw/i386/acpi-build.c          |   2 +-
 hw/mem/cxl_type3.c            |  42 ++-
 hw/pci-bridge/cxl_root_port.c |   1 -
 hw/pci-host/gpex-acpi.c       |  17 +-
 hw/pci-host/gpex.c            |   1 +
 hw/pci/pcie.c                 |  30 ++
 hw/pci/pcie_aer.c             |  35 +-
 21 files changed, 914 insertions(+), 54 deletions(-)

-- 
2.39.2


             reply	other threads:[~2024-02-05 14:19 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-02-05 14:19 Jonathan Cameron [this message]
2024-02-05 14:19 ` [RFC PATCH 01/11] hw/pci: Add pcie_find_dvsec() utility Jonathan Cameron
2024-02-05 14:19 ` [RFC PATCH 02/11] hw/acpi: Allow GPEX _OSC to keep fw first control of AER and CXL errors Jonathan Cameron
2024-02-05 14:19 ` [RFC PATCH 03/11] arm/virt: Add fw-first-ras property Jonathan Cameron
2024-02-05 14:19 ` [RFC PATCH 04/11] acpi/ghes: Support GPIO error source Jonathan Cameron
2024-02-05 14:19 ` [RFC PATCH 05/11] arm/virt: Wire up GPIO error source for ACPI / GHES Jonathan Cameron
2024-02-05 14:19 ` [RFC PATCH 06/11] acpi: pci/cxl: Stash the OSC control parameters Jonathan Cameron
2024-02-05 14:19 ` [RFC PATCH 07/11] pci/aer: Support firmware first error injection via GHESv2 Jonathan Cameron
2024-02-05 14:19 ` [RFC PATCH 08/11] hw/pci/aer: Default to error handling on Jonathan Cameron
2024-02-05 14:19 ` [RFC PATCH 09/11] cxl/ras: Set registers to sensible state for FW first ras Jonathan Cameron
2024-02-05 14:19 ` [RFC PATCH 10/11] cxl/type3: FW first protocol error injection Jonathan Cameron
2024-02-05 14:19 ` [RFC PATCH 11/11] cxl/type3: Add firmware first error reporting for general media events Jonathan Cameron

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20240205141940.31111-1-Jonathan.Cameron@huawei.com \
    --to=jonathan.cameron@huawei.com \
    --cc=anisinha@redhat.com \
    --cc=fan.ni@samsung.com \
    --cc=gengdongjiu1@gmail.com \
    --cc=imammedo@redhat.com \
    --cc=ira.weiny@intel.com \
    --cc=linux-cxl@vger.kernel.org \
    --cc=linuxarm@huawei.com \
    --cc=marcel.apfelbaum@gmail.com \
    --cc=mst@redhat.com \
    --cc=peter.maydell@linaro.org \
    --cc=qemu-devel@nongnu.org \
    --cc=shannon.zhaosl@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox