From: Shiyang Ruan <ruansy.fnst@fujitsu.com>
To: Dave Jiang <dave.jiang@intel.com>,
qemu-devel@nongnu.org, linux-cxl@vger.kernel.org
Cc: jonathan.cameron@huawei.com, dan.j.williams@intel.com,
dave@stgolabs.net, ira.weiny@intel.com,
alison.schofield@intel.com, vishal.l.verma@intel.com
Subject: Re: [RFC PATCH] cxl: avoid duplicating report from MCE & device
Date: Fri, 21 Jun 2024 18:18:08 +0800 [thread overview]
Message-ID: <cc2236cf-ace2-498d-a0b3-2fbc9f7fc317@fujitsu.com> (raw)
In-Reply-To: <e6b3a936-9c45-4d34-bf7c-2b68e2ad79ae@intel.com>
在 2024/6/20 23:51, Dave Jiang 写道:
>
>
> On 6/19/24 2:24 AM, Shiyang Ruan wrote:
>>
>>
>> 在 2024/6/19 7:35, Dave Jiang 写道:
>>>
>>>
>>> On 6/18/24 9:53 AM, Shiyang Ruan wrote:
>>>> Background:
>>>> Since CXL device is a memory device, while CPU consumes a poison page of
>>>> CXL device, it always triggers a MCE by interrupt (INT18), no matter
>>>> which-First path is configured. This is the first report. Then
>>>> currently, in FW-First path, the poison event is transferred according
>>>> to the following process: CXL device -> firmware -> OS:ACPI->APEI->GHES
>>>> -> CPER -> trace report. This is the second one. These two reports
>>>> are indicating the same poisoning page, which is the so-called "duplicate
>>>> report"[1]. And the memory_failure() handling I'm trying to add in
>>>> OS-First path could also be another duplicate report.
>>>>
>>>> Hope the flow below could make it easier to understand:
>>>> CPU accesses bad memory on CXL device, then
>>>> -> MCE (INT18), *always* report (1)
>>>> -> * FW-First (implemented now)
>>>> -> CXL device -> FW
>>>> -> OS:ACPI->APEI->GHES->CPER -> trace report (2.a)
>>>> * OS-First (not implemented yet, I'm working on it)
>>>> -> CXL device -> MSI
>>>> -> OS:CXL driver -> memory_failure() (2.b)
>>>> so, the (1) and (2.a/b) are duplicated.
>>>>
>>>> (I didn't get response in my reply for [1] while I have to make patch to
>>>> solve this problem, so please correct me if my understanding is wrong.)
>>>>
>>>> This patch adds a new notifier_block and MCE_PRIO_CXL, for CXL memdev
>>>> to check whether the current poison page has been reported (if yes,
>>>> stop the notifier chain, won't call the following memory_failure()
>>>> to report), into `x86_mce_decoder_chain`. In this way, if the poison
>>>> page already handled(recorded and reported) in (1) or (2), the other one
>>>> won't duplicate the report. The record could be clear when
>>>> cxl_clear_poison() is called.
>>>>
>>>> [1] https://lore.kernel.org/linux-cxl/664d948fb86f0_e8be294f8@dwillia2-mobl3.amr.corp.intel.com.notmuch/
>>>>
>>
>> ...
>>
>>>> +
>>>> +static bool cxl_contains_hpa(const struct cxl_memdev *cxlmd, u64 hpa)
>>>> +{
>>>> + struct cxl_contains_hpa_context ctx = {
>>>> + .contains = false,
>>>> + .hpa = hpa,
>>>> + };
>>>> + struct cxl_port *port;
>>>> +
>>>> + port = cxlmd->endpoint;
>>>> + if (port && is_cxl_endpoint(port) && cxl_num_decoders_committed(port))
>>>
>>> Maybe no need to check is_cxl_endpoint() if the port is retrieved from cxlmd->endpoint.
>>
>> OK, I'll remove this.
>>
>>>
>>> Also, in order to use cxl_num_decoders_committed(), cxl_region_rwsem must be held. See the lockdep_assert_held() in the function. Maybe add a
>>> guard(cxl_regoin_rwsem);
>>> before the if statement above.
>>
>> Got it. I didn't realize it before. Will add it.
>>
>>
>> BTW, may I have your opinion on this proposal? I'm not sure if the Background and problem described above are correct or not. If not, it could lead me in the wrong direction.
>
> Patch looks ok to me, but I'm no RAS expert in this area. Lets wait for comments from Jonathan and Dan.
Thanks!
--
Ruan.
>>
>> Thank you very much!
>>
>>
>> --
>> Ruan.
>>
>>>
>>> DJ
>>>
next prev parent reply other threads:[~2024-06-21 10:18 UTC|newest]
Thread overview: 20+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-06-18 16:53 [RFC PATCH] cxl: avoid duplicating report from MCE & device Shiyang Ruan
2024-06-18 23:35 ` Dave Jiang
2024-06-19 9:24 ` Shiyang Ruan
2024-06-20 15:51 ` Dave Jiang
2024-06-21 10:18 ` Shiyang Ruan [this message]
2024-06-20 17:02 ` Jonathan Cameron
2024-06-21 10:16 ` Shiyang Ruan
2024-06-21 17:21 ` Jonathan Cameron
2024-06-21 17:59 ` Dan Williams
2024-06-21 18:45 ` Jonathan Cameron
2024-06-21 20:44 ` Luck, Tony
2024-06-26 6:03 ` Shiyang Ruan
2024-06-26 15:56 ` Luck, Tony
2024-06-21 17:51 ` Dan Williams
2024-06-25 13:56 ` Shiyang Ruan
2024-07-02 2:12 ` Shiyang Ruan
2024-07-19 16:04 ` Dave Jiang
2024-07-22 7:01 ` Shiyang Ruan
2024-07-25 2:51 ` Yasunori Gotou (Fujitsu)
2024-07-19 6:24 ` Shiyang Ruan
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=cc2236cf-ace2-498d-a0b3-2fbc9f7fc317@fujitsu.com \
--to=ruansy.fnst@fujitsu.com \
--cc=alison.schofield@intel.com \
--cc=dan.j.williams@intel.com \
--cc=dave.jiang@intel.com \
--cc=dave@stgolabs.net \
--cc=ira.weiny@intel.com \
--cc=jonathan.cameron@huawei.com \
--cc=linux-cxl@vger.kernel.org \
--cc=qemu-devel@nongnu.org \
--cc=vishal.l.verma@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox