From: Bjorn Helgaas <helgaas@kernel.org>
To: Huang Ying <ying.huang@intel.com>,
Shiju Jose <shiju.jose@huawei.com>,
James Morse <james.morse@arm.com>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>,
Len Brown <lenb@kernel.org>, James Morse <james.morse@arm.com>,
Tony Luck <tony.luck@intel.com>, Borislav Petkov <bp@alien8.de>,
linux-acpi@vger.kernel.org, linux-pci@vger.kernel.org
Subject: GHES/AER synchronization missing?
Date: Fri, 1 Sep 2023 17:57:55 -0500 [thread overview]
Message-ID: <20230901225755.GA90053@bhelgaas> (raw)
TL;DR: I think ghes_handle_aer() lacks synchronization with
aer_recover_work_func(), so aer_recover_work_func() may use estatus
data after it's been overwritten.
Sorry this is so long; it took me a long time to get this far, and I
might be in the weeds. Here's the execution path I'm looking at:
ghes_proc(struct ghes *ghes)
estatus = ghes->estatus # linux kernel buffer
ghes_read_estatus(estatus, &buf_paddr) # copy fw mem to estatus
ghes_do_proc(estatus)
apei_estatus_for_each_section(estatus, gdata)
if (gdata is CPER_SEC_PCIE)
ghes_handle_aer(gdata) # pointer into estatus
struct cper_sec_pcie *pcie_err = acpi_hest_get_payload(gdata)
aer_recover_queue(..., pcie_err->aer_info)
entry.regs = aer_regs # pointer to struct aer_capability_regs
kfifo_in(&aer_recover_ring, &entry) # copy pointer into FIFO
...
aer_recover_work_func
kfifo_get(&aer_recover_ring, &entry)
cper_print_aer(entry.regs) # use aer_capability_regs values
I'm confused because I don't see what ensures that the
aer_capability_regs values, which I think are somewhere in the
ghes->estatus buffer, are preserved until aer_recover_work_func() is
finished with them.
Here's my understanding of the general flow:
- hest_parse_ghes() adds a GHES platform device for each HEST Error
Source descriptor of type 9 (Generic Hardware Error Source) or
type 10 (Generic Hardware Error Source version 2).
- Each HEST GHES entry has an Error Status Address that tells us
about some range of firmware reserved memory that will contain
error status data for the device.
- ghes_probe() claims each GHES platform device. It maps the Error
Status Address once (so I guess the address of the firmware memory
must be fixed for the life of the system?) and allocates a
ghes->estatus buffer in kernel memory.
- When the platform notifies OSPM of a GHES event, ghes_proc()
copies error status data from the Error Status Address firmware
memory to the ghes->estatus buffer.
- The error status data may have multiple sections. ghes_do_proc()
iterates through each section in the ghes->estatus buffer. PCIe
sections contain a struct aer_capability_regs that has values of
all the AER Capability registers, and ghes_handle_aer() passes a
pointer to the struct aer_capability_regs to aer_recover_queue().
- This struct aer_capability_regs pointer is a pointer into the
ghes->estatus buffer. aer_recover_queue() copies that pointer
into the aer_recover_ring fifo and schedules
aer_recover_work_func() for later execution.
- aer_recover_work_func() reads the struct aer_capability_regs data
at some future time.
- ghes_proc() does not know when aer_recover_work_func() is finished
with the struct aer_capability_regs data.
Am I missing a mechanism that prevents a second ghes_proc() invocation
from overwriting ghes->estatus before the first aer_recover_work_func()
is finished?
The ghes_defer_non_standard_event() case added by Shiju and James in
9aa9cf3ee945 ("ACPI / APEI: Add a notifier chain for unknown (vendor)
CPER records") also schedules future work, but it copies the data
needed for that work. It seems like ghes_handle_aer() maybe should do
something similar?
Bjorn
reply other threads:[~2023-09-01 22:58 UTC|newest]
Thread overview: [no followups] expand[flat|nested] mbox.gz Atom feed
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20230901225755.GA90053@bhelgaas \
--to=helgaas@kernel.org \
--cc=bp@alien8.de \
--cc=james.morse@arm.com \
--cc=lenb@kernel.org \
--cc=linux-acpi@vger.kernel.org \
--cc=linux-pci@vger.kernel.org \
--cc=rjw@rjwysocki.net \
--cc=shiju.jose@huawei.com \
--cc=tony.luck@intel.com \
--cc=ying.huang@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).