Linux CXL
 help / color / mirror / Atom feed
From: Shiyang Ruan <ruansy.fnst@fujitsu.com>
To: Dave Jiang <dave.jiang@intel.com>,
	qemu-devel@nongnu.org, linux-cxl@vger.kernel.org
Cc: jonathan.cameron@huawei.com, dan.j.williams@intel.com,
	dave@stgolabs.net, ira.weiny@intel.com,
	alison.schofield@intel.com, vishal.l.verma@intel.com
Subject: Re: [RFC PATCH] cxl: avoid duplicating report from MCE & device
Date: Fri, 21 Jun 2024 18:18:08 +0800	[thread overview]
Message-ID: <cc2236cf-ace2-498d-a0b3-2fbc9f7fc317@fujitsu.com> (raw)
In-Reply-To: <e6b3a936-9c45-4d34-bf7c-2b68e2ad79ae@intel.com>



在 2024/6/20 23:51, Dave Jiang 写道:
> 
> 
> On 6/19/24 2:24 AM, Shiyang Ruan wrote:
>>
>>
>> 在 2024/6/19 7:35, Dave Jiang 写道:
>>>
>>>
>>> On 6/18/24 9:53 AM, Shiyang Ruan wrote:
>>>> Background:
>>>> Since CXL device is a memory device, while CPU consumes a poison page of
>>>> CXL device, it always triggers a MCE by interrupt (INT18), no matter
>>>> which-First path is configured.  This is the first report.  Then
>>>> currently, in FW-First path, the poison event is transferred according
>>>> to the following process: CXL device -> firmware -> OS:ACPI->APEI->GHES
>>>>    -> CPER -> trace report.  This is the second one.  These two reports
>>>> are indicating the same poisoning page, which is the so-called "duplicate
>>>> report"[1].  And the memory_failure() handling I'm trying to add in
>>>> OS-First path could also be another duplicate report.
>>>>
>>>> Hope the flow below could make it easier to understand:
>>>> CPU accesses bad memory on CXL device, then
>>>>    -> MCE (INT18), *always* report (1)
>>>>    -> * FW-First (implemented now)
>>>>         -> CXL device -> FW
>>>>            -> OS:ACPI->APEI->GHES->CPER -> trace report (2.a)
>>>>       * OS-First (not implemented yet, I'm working on it)
>>>>         -> CXL device -> MSI
>>>>            -> OS:CXL driver -> memory_failure() (2.b)
>>>> so, the (1) and (2.a/b) are duplicated.
>>>>
>>>> (I didn't get response in my reply for [1] while I have to make patch to
>>>> solve this problem, so please correct me if my understanding is wrong.)
>>>>
>>>> This patch adds a new notifier_block and MCE_PRIO_CXL, for CXL memdev
>>>> to check whether the current poison page has been reported (if yes,
>>>> stop the notifier chain, won't call the following memory_failure()
>>>> to report), into `x86_mce_decoder_chain`.  In this way, if the poison
>>>> page already handled(recorded and reported) in (1) or (2), the other one
>>>> won't duplicate the report.  The record could be clear when
>>>> cxl_clear_poison() is called.
>>>>
>>>> [1] https://lore.kernel.org/linux-cxl/664d948fb86f0_e8be294f8@dwillia2-mobl3.amr.corp.intel.com.notmuch/
>>>>
>>
>> ...
>>
>>>> +
>>>> +static bool cxl_contains_hpa(const struct cxl_memdev *cxlmd, u64 hpa)
>>>> +{
>>>> +    struct cxl_contains_hpa_context ctx = {
>>>> +        .contains = false,
>>>> +        .hpa = hpa,
>>>> +    };
>>>> +    struct cxl_port *port;
>>>> +
>>>> +    port = cxlmd->endpoint;
>>>> +    if (port && is_cxl_endpoint(port) && cxl_num_decoders_committed(port))
>>>
>>> Maybe no need to check is_cxl_endpoint() if the port is retrieved from cxlmd->endpoint.
>>
>> OK, I'll remove this.
>>
>>>
>>> Also, in order to use cxl_num_decoders_committed(), cxl_region_rwsem must be held. See the lockdep_assert_held() in the function. Maybe add a
>>> guard(cxl_regoin_rwsem);
>>> before the if statement above.
>>
>> Got it.  I didn't realize it before.  Will add it.
>>
>>
>> BTW, may I have your opinion on this proposal?  I'm not sure if the Background and problem described above are correct or not.  If not, it could lead me in the wrong direction.
> 
> Patch looks ok to me, but I'm no RAS expert in this area. Lets wait for comments from Jonathan and Dan.

Thanks!

--
Ruan.

>>
>> Thank you very much!
>>
>>
>> -- 
>> Ruan.
>>
>>>
>>> DJ
>>>

  reply	other threads:[~2024-06-21 10:18 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-06-18 16:53 [RFC PATCH] cxl: avoid duplicating report from MCE & device Shiyang Ruan
2024-06-18 23:35 ` Dave Jiang
2024-06-19  9:24   ` Shiyang Ruan
2024-06-20 15:51     ` Dave Jiang
2024-06-21 10:18       ` Shiyang Ruan [this message]
2024-06-20 17:02 ` Jonathan Cameron
2024-06-21 10:16   ` Shiyang Ruan
2024-06-21 17:21     ` Jonathan Cameron
2024-06-21 17:59   ` Dan Williams
2024-06-21 18:45     ` Jonathan Cameron
2024-06-21 20:44       ` Luck, Tony
2024-06-26  6:03         ` Shiyang Ruan
2024-06-26 15:56           ` Luck, Tony
2024-06-21 17:51 ` Dan Williams
2024-06-25 13:56   ` Shiyang Ruan
2024-07-02  2:12     ` Shiyang Ruan
2024-07-19 16:04       ` Dave Jiang
2024-07-22  7:01         ` Shiyang Ruan
2024-07-25  2:51           ` Yasunori Gotou (Fujitsu)
2024-07-19  6:24 ` Shiyang Ruan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=cc2236cf-ace2-498d-a0b3-2fbc9f7fc317@fujitsu.com \
    --to=ruansy.fnst@fujitsu.com \
    --cc=alison.schofield@intel.com \
    --cc=dan.j.williams@intel.com \
    --cc=dave.jiang@intel.com \
    --cc=dave@stgolabs.net \
    --cc=ira.weiny@intel.com \
    --cc=jonathan.cameron@huawei.com \
    --cc=linux-cxl@vger.kernel.org \
    --cc=qemu-devel@nongnu.org \
    --cc=vishal.l.verma@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox