Linux CXL
 help / color / mirror / Atom feed
From: Dave Jiang <dave.jiang@intel.com>
To: Shiyang Ruan <ruansy.fnst@fujitsu.com>,
	qemu-devel@nongnu.org, linux-cxl@vger.kernel.org
Cc: jonathan.cameron@huawei.com, dan.j.williams@intel.com,
	dave@stgolabs.net, ira.weiny@intel.com,
	alison.schofield@intel.com, vishal.l.verma@intel.com
Subject: Re: [RFC PATCH] cxl: avoid duplicating report from MCE & device
Date: Thu, 20 Jun 2024 08:51:37 -0700	[thread overview]
Message-ID: <e6b3a936-9c45-4d34-bf7c-2b68e2ad79ae@intel.com> (raw)
In-Reply-To: <b9bbf3bc-9125-4cbb-b127-613b841dc9a8@fujitsu.com>



On 6/19/24 2:24 AM, Shiyang Ruan wrote:
> 
> 
> 在 2024/6/19 7:35, Dave Jiang 写道:
>>
>>
>> On 6/18/24 9:53 AM, Shiyang Ruan wrote:
>>> Background:
>>> Since CXL device is a memory device, while CPU consumes a poison page of
>>> CXL device, it always triggers a MCE by interrupt (INT18), no matter
>>> which-First path is configured.  This is the first report.  Then
>>> currently, in FW-First path, the poison event is transferred according
>>> to the following process: CXL device -> firmware -> OS:ACPI->APEI->GHES
>>>   -> CPER -> trace report.  This is the second one.  These two reports
>>> are indicating the same poisoning page, which is the so-called "duplicate
>>> report"[1].  And the memory_failure() handling I'm trying to add in
>>> OS-First path could also be another duplicate report.
>>>
>>> Hope the flow below could make it easier to understand:
>>> CPU accesses bad memory on CXL device, then
>>>   -> MCE (INT18), *always* report (1)
>>>   -> * FW-First (implemented now)
>>>        -> CXL device -> FW
>>>           -> OS:ACPI->APEI->GHES->CPER -> trace report (2.a)
>>>      * OS-First (not implemented yet, I'm working on it)
>>>        -> CXL device -> MSI
>>>           -> OS:CXL driver -> memory_failure() (2.b)
>>> so, the (1) and (2.a/b) are duplicated.
>>>
>>> (I didn't get response in my reply for [1] while I have to make patch to
>>> solve this problem, so please correct me if my understanding is wrong.)
>>>
>>> This patch adds a new notifier_block and MCE_PRIO_CXL, for CXL memdev
>>> to check whether the current poison page has been reported (if yes,
>>> stop the notifier chain, won't call the following memory_failure()
>>> to report), into `x86_mce_decoder_chain`.  In this way, if the poison
>>> page already handled(recorded and reported) in (1) or (2), the other one
>>> won't duplicate the report.  The record could be clear when
>>> cxl_clear_poison() is called.
>>>
>>> [1] https://lore.kernel.org/linux-cxl/664d948fb86f0_e8be294f8@dwillia2-mobl3.amr.corp.intel.com.notmuch/
>>>
> 
> ...
> 
>>> +
>>> +static bool cxl_contains_hpa(const struct cxl_memdev *cxlmd, u64 hpa)
>>> +{
>>> +    struct cxl_contains_hpa_context ctx = {
>>> +        .contains = false,
>>> +        .hpa = hpa,
>>> +    };
>>> +    struct cxl_port *port;
>>> +
>>> +    port = cxlmd->endpoint;
>>> +    if (port && is_cxl_endpoint(port) && cxl_num_decoders_committed(port))
>>
>> Maybe no need to check is_cxl_endpoint() if the port is retrieved from cxlmd->endpoint.
> 
> OK, I'll remove this.
> 
>>
>> Also, in order to use cxl_num_decoders_committed(), cxl_region_rwsem must be held. See the lockdep_assert_held() in the function. Maybe add a
>> guard(cxl_regoin_rwsem);
>> before the if statement above.
> 
> Got it.  I didn't realize it before.  Will add it.
> 
> 
> BTW, may I have your opinion on this proposal?  I'm not sure if the Background and problem described above are correct or not.  If not, it could lead me in the wrong direction.

Patch looks ok to me, but I'm no RAS expert in this area. Lets wait for comments from Jonathan and Dan. 
> 
> Thank you very much!
> 
> 
> -- 
> Ruan.
> 
>>
>> DJ
>>

  reply	other threads:[~2024-06-20 15:51 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-06-18 16:53 [RFC PATCH] cxl: avoid duplicating report from MCE & device Shiyang Ruan
2024-06-18 23:35 ` Dave Jiang
2024-06-19  9:24   ` Shiyang Ruan
2024-06-20 15:51     ` Dave Jiang [this message]
2024-06-21 10:18       ` Shiyang Ruan
2024-06-20 17:02 ` Jonathan Cameron
2024-06-21 10:16   ` Shiyang Ruan
2024-06-21 17:21     ` Jonathan Cameron
2024-06-21 17:59   ` Dan Williams
2024-06-21 18:45     ` Jonathan Cameron
2024-06-21 20:44       ` Luck, Tony
2024-06-26  6:03         ` Shiyang Ruan
2024-06-26 15:56           ` Luck, Tony
2024-06-21 17:51 ` Dan Williams
2024-06-25 13:56   ` Shiyang Ruan
2024-07-02  2:12     ` Shiyang Ruan
2024-07-19 16:04       ` Dave Jiang
2024-07-22  7:01         ` Shiyang Ruan
2024-07-25  2:51           ` Yasunori Gotou (Fujitsu)
2024-07-19  6:24 ` Shiyang Ruan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=e6b3a936-9c45-4d34-bf7c-2b68e2ad79ae@intel.com \
    --to=dave.jiang@intel.com \
    --cc=alison.schofield@intel.com \
    --cc=dan.j.williams@intel.com \
    --cc=dave@stgolabs.net \
    --cc=ira.weiny@intel.com \
    --cc=jonathan.cameron@huawei.com \
    --cc=linux-cxl@vger.kernel.org \
    --cc=qemu-devel@nongnu.org \
    --cc=ruansy.fnst@fujitsu.com \
    --cc=vishal.l.verma@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox