From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.18]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id F2650155351 for ; Tue, 18 Jun 2024 23:35:48 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=192.198.163.18 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1718753751; cv=none; b=TFC9Rj0qF1pnH8o7fTE2i3wRzdU8gHiq8SSxgs5OM4R8t7AG2pKnL8ZXs4MDkKHbbzgdz8VWnOTcTAHZcXP3Kspxqpri3Khcy07UiNq6TdPsAklPjqsVBhCjNBe9Am4AdZ8DgTibb14ImjJFROHqeSYYJcuD3/7TkawTdt9PjTU= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1718753751; c=relaxed/simple; bh=M1FEDauRjB/4gINRNPjojWAMB+egJq4MMB0aPzU2x14=; h=Message-ID:Date:MIME-Version:Subject:To:Cc:References:From: In-Reply-To:Content-Type; b=pQMqCgQqPymbX4K77IYU7Z4V4C2mLjZqRvUUGcetMC/LitGIB+HPN9Op2xcXvNAfcZPWxyCeG9eIXfCUF6L4SL8QvdZwCj/gC11+K4HLs5CgPpyteEodY1ppL7fAEqU+obmhp5RzPk4ZmYTS/bdrTBg6Sz0ZpI7YHu34AAumGNg= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com; spf=pass smtp.mailfrom=intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=JxRUkDvX; arc=none smtp.client-ip=192.198.163.18 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="JxRUkDvX" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1718753749; x=1750289749; h=message-id:date:mime-version:subject:to:cc:references: from:in-reply-to:content-transfer-encoding; bh=M1FEDauRjB/4gINRNPjojWAMB+egJq4MMB0aPzU2x14=; b=JxRUkDvXImKcTp0hxXCqHixtw5JRll6iNCZPpR/NgbSQDh2LIRqU7JXK pFIpH1EjspdxOwVyd+ahweuQUJLuXKk3RqQ//6/hv901NctQ1ropW8zpB XfDfOmDxQxBzx4WcEKWe/dY9IFmSBQczeW3wHnRIgSUAPCAZgX+u8jf67 IsTJeJJPdjvnuTqXpD5wGW+TAcQyqIl+UwwZq8quJQy6nynB21aMs7Fwq 7o0lOjsf5Vweu0Wjn1gzhEGjP160K4slmP4V3h8yB+06T/YDg3kg8k89N 2ZiAhAE9mY2qbskCdr0gOD5GAmuQsYeS34G6x86wXorumgqikIajNZBIM A==; X-CSE-ConnectionGUID: j5kmBitZSFSzflAr0R7ypg== X-CSE-MsgGUID: sKtPObMASV+jpJyvdb4Hsg== X-IronPort-AV: E=McAfee;i="6700,10204,11107"; a="15382247" X-IronPort-AV: E=Sophos;i="6.08,247,1712646000"; d="scan'208";a="15382247" Received: from orviesa001.jf.intel.com ([10.64.159.141]) by fmvoesa112.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 18 Jun 2024 16:35:48 -0700 X-CSE-ConnectionGUID: AMbu1aL7Ru6/aO4Vv+HzMw== X-CSE-MsgGUID: PIcFpnTxRFaosnHDijLJqA== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.08,247,1712646000"; d="scan'208";a="79183071" Received: from djiang5-mobl3.amr.corp.intel.com (HELO [10.125.111.236]) ([10.125.111.236]) by smtpauth.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 18 Jun 2024 16:35:48 -0700 Message-ID: Date: Tue, 18 Jun 2024 16:35:46 -0700 Precedence: bulk X-Mailing-List: linux-cxl@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [RFC PATCH] cxl: avoid duplicating report from MCE & device To: Shiyang Ruan , qemu-devel@nongnu.org, linux-cxl@vger.kernel.org Cc: jonathan.cameron@huawei.com, dan.j.williams@intel.com, dave@stgolabs.net, ira.weiny@intel.com, alison.schofield@intel.com, vishal.l.verma@intel.com References: <20240618165310.877974-1-ruansy.fnst@fujitsu.com> Content-Language: en-US From: Dave Jiang In-Reply-To: <20240618165310.877974-1-ruansy.fnst@fujitsu.com> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit On 6/18/24 9:53 AM, Shiyang Ruan wrote: > Background: > Since CXL device is a memory device, while CPU consumes a poison page of > CXL device, it always triggers a MCE by interrupt (INT18), no matter > which-First path is configured. This is the first report. Then > currently, in FW-First path, the poison event is transferred according > to the following process: CXL device -> firmware -> OS:ACPI->APEI->GHES > -> CPER -> trace report. This is the second one. These two reports > are indicating the same poisoning page, which is the so-called "duplicate > report"[1]. And the memory_failure() handling I'm trying to add in > OS-First path could also be another duplicate report. > > Hope the flow below could make it easier to understand: > CPU accesses bad memory on CXL device, then > -> MCE (INT18), *always* report (1) > -> * FW-First (implemented now) > -> CXL device -> FW > -> OS:ACPI->APEI->GHES->CPER -> trace report (2.a) > * OS-First (not implemented yet, I'm working on it) > -> CXL device -> MSI > -> OS:CXL driver -> memory_failure() (2.b) > so, the (1) and (2.a/b) are duplicated. > > (I didn't get response in my reply for [1] while I have to make patch to > solve this problem, so please correct me if my understanding is wrong.) > > This patch adds a new notifier_block and MCE_PRIO_CXL, for CXL memdev > to check whether the current poison page has been reported (if yes, > stop the notifier chain, won't call the following memory_failure() > to report), into `x86_mce_decoder_chain`. In this way, if the poison > page already handled(recorded and reported) in (1) or (2), the other one > won't duplicate the report. The record could be clear when > cxl_clear_poison() is called. > > [1] https://lore.kernel.org/linux-cxl/664d948fb86f0_e8be294f8@dwillia2-mobl3.amr.corp.intel.com.notmuch/ > > Signed-off-by: Shiyang Ruan > --- > arch/x86/include/asm/mce.h | 1 + > drivers/cxl/core/mbox.c | 130 +++++++++++++++++++++++++++++++++++++ > drivers/cxl/core/memdev.c | 6 +- > drivers/cxl/cxlmem.h | 3 + > 4 files changed, 139 insertions(+), 1 deletion(-) > > diff --git a/arch/x86/include/asm/mce.h b/arch/x86/include/asm/mce.h > index dfd2e9699bd7..d8109c48e7d9 100644 > --- a/arch/x86/include/asm/mce.h > +++ b/arch/x86/include/asm/mce.h > @@ -182,6 +182,7 @@ enum mce_notifier_prios { > MCE_PRIO_NFIT, > MCE_PRIO_EXTLOG, > MCE_PRIO_UC, > + MCE_PRIO_CXL, > MCE_PRIO_EARLY, > MCE_PRIO_CEC, > MCE_PRIO_HIGHEST = MCE_PRIO_CEC > diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c > index 2626f3fff201..0eb3c5401e81 100644 > --- a/drivers/cxl/core/mbox.c > +++ b/drivers/cxl/core/mbox.c > @@ -4,6 +4,8 @@ > #include > #include > #include > +#include > +#include > #include > #include > #include > @@ -880,6 +882,9 @@ void cxl_event_trace_record(const struct cxl_memdev *cxlmd, > if (cxlr) > hpa = cxl_trace_hpa(cxlr, cxlmd, dpa); > > + if (hpa != ULLONG_MAX && cxl_mce_recorded(hpa)) > + return; > + > if (event_type == CXL_CPER_EVENT_GEN_MEDIA) > trace_cxl_general_media(cxlmd, type, cxlr, hpa, > &evt->gen_media); > @@ -1408,6 +1413,127 @@ int cxl_poison_state_init(struct cxl_memdev_state *mds) > } > EXPORT_SYMBOL_NS_GPL(cxl_poison_state_init, CXL); > > +struct cxl_mce_record { > + struct list_head node; > + u64 hpa; > +}; > +LIST_HEAD(cxl_mce_records); > +DEFINE_MUTEX(cxl_mce_mutex); > + > +bool cxl_mce_recorded(u64 hpa) > +{ > + struct cxl_mce_record *cur, *next, *rec; > + int rc; > + > + rc = mutex_lock_interruptible(&cxl_mce_mutex); > + if (rc) > + return false; > + > + list_for_each_entry_safe(cur, next, &cxl_mce_records, node) { > + if (cur->hpa == hpa) { > + mutex_unlock(&cxl_mce_mutex); > + return true; > + } > + } > + > + rec = kmalloc(sizeof(struct cxl_mce_record), GFP_KERNEL); > + rec->hpa = hpa; > + list_add(&cxl_mce_records, &rec->node); > + > + mutex_unlock(&cxl_mce_mutex); > + > + return false; > +} > + > +void cxl_mce_clear(u64 hpa) > +{ > + struct cxl_mce_record *cur, *next; > + int rc; > + > + rc = mutex_lock_interruptible(&cxl_mce_mutex); > + if (rc) > + return; > + > + list_for_each_entry_safe(cur, next, &cxl_mce_records, node) { > + if (cur->hpa == hpa) { > + list_del(&cur->node); > + break; > + } > + } > + > + mutex_unlock(&cxl_mce_mutex); > +} > + > +struct cxl_contains_hpa_context { > + bool contains; > + u64 hpa; > +}; > + > +static int __cxl_contains_hpa(struct device *dev, void *arg) > +{ > + struct cxl_contains_hpa_context *ctx = arg; > + struct cxl_endpoint_decoder *cxled; > + struct range *range; > + u64 hpa = ctx->hpa; > + > + if (!is_endpoint_decoder(dev)) > + return 0; > + > + cxled = to_cxl_endpoint_decoder(dev); > + range = &cxled->cxld.hpa_range; > + > + if (range->start <= hpa && hpa <= range->end) { > + ctx->contains = true; > + return 1; > + } > + > + return 0; > +} > + > +static bool cxl_contains_hpa(const struct cxl_memdev *cxlmd, u64 hpa) > +{ > + struct cxl_contains_hpa_context ctx = { > + .contains = false, > + .hpa = hpa, > + }; > + struct cxl_port *port; > + > + port = cxlmd->endpoint; > + if (port && is_cxl_endpoint(port) && cxl_num_decoders_committed(port)) Maybe no need to check is_cxl_endpoint() if the port is retrieved from cxlmd->endpoint. Also, in order to use cxl_num_decoders_committed(), cxl_region_rwsem must be held. See the lockdep_assert_held() in the function. Maybe add a guard(cxl_regoin_rwsem); before the if statement above. DJ > + device_for_each_child(&port->dev, &ctx, __cxl_contains_hpa); > + > + return ctx.contains; > +} > + > +static int cxl_handle_mce(struct notifier_block *nb, unsigned long val, > + void *data) > +{ > + struct mce *mce = (struct mce *)data; > + struct cxl_memdev_state *mds = container_of(nb, struct cxl_memdev_state, > + mce_notifier); > + u64 hpa; > + > + if (!mce || !mce_usable_address(mce)) > + return NOTIFY_DONE; > + > + hpa = mce->addr & MCI_ADDR_PHYSADDR; > + > + /* Check if the PFN is located on this CXL device */ > + if (!pfn_valid(hpa >> PAGE_SHIFT) && > + !cxl_contains_hpa(mds->cxlds.cxlmd, hpa)) > + return NOTIFY_DONE; > + > + /* > + * Search PFN in the cxl_mce_records, if already exists, don't continue > + * to do memory_failure() to avoid a poison address being reported > + * more than once. > + */ > + if (cxl_mce_recorded(hpa)) > + return NOTIFY_STOP; > + else > + return NOTIFY_OK; > +} > + > struct cxl_memdev_state *cxl_memdev_state_create(struct device *dev) > { > struct cxl_memdev_state *mds; > @@ -1427,6 +1553,10 @@ struct cxl_memdev_state *cxl_memdev_state_create(struct device *dev) > mds->ram_perf.qos_class = CXL_QOS_CLASS_INVALID; > mds->pmem_perf.qos_class = CXL_QOS_CLASS_INVALID; > > + mds->mce_notifier.notifier_call = cxl_handle_mce; > + mds->mce_notifier.priority = MCE_PRIO_CXL; > + mce_register_decode_chain(&mds->mce_notifier); > + > return mds; > } > EXPORT_SYMBOL_NS_GPL(cxl_memdev_state_create, CXL); > diff --git a/drivers/cxl/core/memdev.c b/drivers/cxl/core/memdev.c > index 0277726afd04..aa3ac89d17be 100644 > --- a/drivers/cxl/core/memdev.c > +++ b/drivers/cxl/core/memdev.c > @@ -376,10 +376,14 @@ int cxl_clear_poison(struct cxl_memdev *cxlmd, u64 dpa) > goto out; > > cxlr = cxl_dpa_to_region(cxlmd, dpa); > - if (cxlr) > + if (cxlr) { > + u64 hpa = cxl_trace_hpa(cxlr, cxlmd, dpa); > + > + cxl_mce_clear(hpa); > dev_warn_once(mds->cxlds.dev, > "poison clear dpa:%#llx region: %s\n", dpa, > dev_name(&cxlr->dev)); > + } > > record = (struct cxl_poison_record) { > .address = cpu_to_le64(dpa), > diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h > index 19aba81cdf13..fbf8d9f46984 100644 > --- a/drivers/cxl/cxlmem.h > +++ b/drivers/cxl/cxlmem.h > @@ -501,6 +501,7 @@ struct cxl_memdev_state { > struct cxl_fw_state fw; > > struct rcuwait mbox_wait; > + struct notifier_block mce_notifier; > int (*mbox_send)(struct cxl_memdev_state *mds, > struct cxl_mbox_cmd *cmd); > }; > @@ -836,6 +837,8 @@ int cxl_mem_get_poison(struct cxl_memdev *cxlmd, u64 offset, u64 len, > int cxl_trigger_poison_list(struct cxl_memdev *cxlmd); > int cxl_inject_poison(struct cxl_memdev *cxlmd, u64 dpa); > int cxl_clear_poison(struct cxl_memdev *cxlmd, u64 dpa); > +bool cxl_mce_recorded(u64 pfn); > +void cxl_mce_clear(u64 pfn); > > #ifdef CONFIG_CXL_SUSPEND > void cxl_mem_active_inc(void);