From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.7]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id D8A3F1386DA; Tue, 3 Feb 2026 18:49:05 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=192.198.163.7 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1770144548; cv=none; b=m4ByvKVrUOy5oP5aaeRjTmGvkuHuBGnjEy1BucV/oHjFj0IRkyFh6Xa1MJ1Hzl50HpCOC2wyMt3/OP94Ja7WluQGnP7tjCHYnl8PFIEHFs1Kbm3JLaUGxzvb57gxm3yQjwDNg0CUKi88xqH9yYJB2lMEkKN0VYMl80lmNf+lwyA= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1770144548; c=relaxed/simple; bh=bsSQQ5o+D0Ip2FXNcOiqrSRbwx3xoUSvLjCnbmV+iYg=; h=Message-ID:Date:MIME-Version:Subject:To:Cc:References:From: In-Reply-To:Content-Type; b=S7lLmKuaHn24K1Oe98htT90yGZ7CInHnToxo1FbDLFIhEMgaVJzTD1mtg/p0A9xxXi0pgT9Umq1Dqqb177xBZOgSuFh7Hp83mGhJ+0E5iGVafnHmIS604UgyeT7ZkTh6U/uRXJWIJ6HVNT6JRP6CVJMmKskEet8Qbls2p+VHC4Q= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com; spf=pass smtp.mailfrom=intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=Oy5Nalmg; arc=none smtp.client-ip=192.198.163.7 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="Oy5Nalmg" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1770144546; x=1801680546; h=message-id:date:mime-version:subject:to:cc:references: from:in-reply-to:content-transfer-encoding; bh=bsSQQ5o+D0Ip2FXNcOiqrSRbwx3xoUSvLjCnbmV+iYg=; b=Oy5NalmgiWiBTffLfJYE1xf3vYJGsggPhK7xwsFp/XcqOvy7e67we1P5 8KBDQjH4CrkjF+P8XR/mm6GD6jDFEPNr8joR5sepyRxoE/MauEHsr1ks9 l8YroTeAXepDKKsGF4vMvmyAU/0j+nT0ahJN9AtK5HzYh3T4rvTRMjhEi UwYU0XkVe6SySO1hyo/cJt9X/tksvEvtfsvlwmWL3DdLl0ZhyY8L4bLPg ZVzhqIvQ6348A6ix+3TpchDBHPyQooMoyHsJvJPk/TYz+Nq/E2N58mHJr FoU61CEHm9yQwNpEzHVTK1alm9bchNQvry5P0Gh8fWJchPnsZhFiF7R5d w==; X-CSE-ConnectionGUID: Nc8gpienSaOnFW3fJAhJ2Q== X-CSE-MsgGUID: b/9f1Ih/QxqOk8PGwpicnQ== X-IronPort-AV: E=McAfee;i="6800,10657,11691"; a="96779848" X-IronPort-AV: E=Sophos;i="6.21,271,1763452800"; d="scan'208";a="96779848" Received: from orviesa001.jf.intel.com ([10.64.159.141]) by fmvoesa101.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 03 Feb 2026 10:49:04 -0800 X-CSE-ConnectionGUID: Nv5LxsgxTBKw7wrO1tvOEA== X-CSE-MsgGUID: YLH+xaSBSb6vQB2KAfOQLg== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.21,271,1763452800"; d="scan'208";a="247529504" Received: from aduenasd-mobl5.amr.corp.intel.com (HELO [10.125.110.221]) ([10.125.110.221]) by smtpauth.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 03 Feb 2026 10:49:04 -0800 Message-ID: Date: Tue, 3 Feb 2026 11:49:02 -0700 Precedence: bulk X-Mailing-List: linux-pci@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH v15 7/9] cxl: Update Endpoint AER uncorrectable handler To: "Bowman, Terry" , dave@stgolabs.net, jonathan.cameron@huawei.com, alison.schofield@intel.com, dan.j.williams@intel.com, bhelgaas@google.com, shiju.jose@huawei.com, ming.li@zohomail.com, Smita.KoralahalliChannabasappa@amd.com, rrichter@amd.com, dan.carpenter@linaro.org, PradeepVineshReddy.Kodamati@amd.com, lukas@wunner.de, Benjamin.Cheatham@amd.com, sathyanarayanan.kuppuswamy@linux.intel.com, linux-cxl@vger.kernel.org, vishal.l.verma@intel.com, alucerop@amd.com, ira.weiny@intel.com Cc: linux-kernel@vger.kernel.org, linux-pci@vger.kernel.org References: <20260203025244.3093805-1-terry.bowman@amd.com> <20260203025244.3093805-8-terry.bowman@amd.com> <66927d59-4a27-462c-b281-078967f4fca9@amd.com> Content-Language: en-US From: Dave Jiang In-Reply-To: <66927d59-4a27-462c-b281-078967f4fca9@amd.com> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit On 2/3/26 11:35 AM, Bowman, Terry wrote: > On 2/3/2026 11:31 AM, Dave Jiang wrote: >> >> >> On 2/2/26 7:52 PM, Terry Bowman wrote: >>> CXL drivers now implement protocol RAS support. PCI protocol errors, >>> however, continue to be reported via the AER capability and must still be >>> handled by a PCI error recovery callback. >>> >>> Replace the existing cxl_error_detected() callback in cxl/pci.c with a >>> new cxl_pci_error_detected() implementation that handles only uncorrectable >>> PCI protocol errors reported through AER. >> >> Do we need to explain why only uncorrectable is handled? >> > > Would it be Ok if I removed "only" with s/only// ? > > After mentioning an important detail I shoud elaborate. But, how about if > remove it and not refer to the CE at all here? CE shouldnt be mentioned unless > good reason in a primarily UCE patch. Is CE handling added later? Maybe just say that. DJ > > - Terry > >>> >>> Introduce helper named cxl_handler_aer() amd implement to handle and >>> log the CXL device's AER error. >>> >>> This cleanly separates CXL protocol error handling from PCI AER handling >>> and ensures that each subsystem processes only the errors it is >>> responsible. >>> >>> Signed-off-by: Terry Bowman >>> >>> --- >>> >>> Changes in v14->v15: >>> - Title update (Terry) >>> - Change cxl_pci_error-detected() to handle & log AER (Terry) >>> - Update commit message (Terry) >>> - Moved cxl_handle_ras()/cxl_handle_cor_ras() to earlier patch (Terry) >>> >>> Changes in v13->v14: >>> - Update commit headline (Bjorn) >>> - Rename pci_error_detected()/pci_cor_error_detected() -> >>> cxl_pci_error_detected/cxl_pci_cor_error_detected() (Jonathan) >>> - Remove now-invalid comment in cxl_error_detected() (Jonathan) >>> - Split into separate patches for UCE and CE (Terry) >>> >>> Changes in v12->v13: >>> - Update commit messaqge (Terry) >>> - Updated all the implementation and commit message. (Terry) >>> - Refactored cxl_cor_error_detected()/cxl_error_detected() to remove >>> pdev (Dave Jiang) >>> >>> Changes in v11->v12: >>> - None >>> >>> Changes in v10->v11: >>> - cxl_error_detected() - Change handlers' scoped_guard() to guard() (Jonathan) >>> - cxl_error_detected() - Remove extra line (Shiju) >>> - Changes moved to core/ras.c (Terry) >>> - cxl_error_detected(), remove 'ue' and return with function call. (Jonathan) >>> - Remove extra space in documentation for PCI_ERS_RESULT_PANIC definition >>> - Move #include "pci.h from cxl.h to core.h (Terry) >>> - Remove unnecessary includes of cxl.h and core.h in mem.c (Terry) >>> --- >>> drivers/cxl/core/ras.c | 68 +++++++++++++++--------------------------- >>> drivers/cxl/cxlpci.h | 9 +++--- >>> drivers/cxl/pci.c | 6 ++-- >>> 3 files changed, 31 insertions(+), 52 deletions(-) >>> >>> diff --git a/drivers/cxl/core/ras.c b/drivers/cxl/core/ras.c >>> index 970ff3df442c..061e6aaec176 100644 >>> --- a/drivers/cxl/core/ras.c >>> +++ b/drivers/cxl/core/ras.c >>> @@ -441,55 +441,35 @@ void cxl_cor_error_detected(struct pci_dev *pdev) >>> } >>> EXPORT_SYMBOL_NS_GPL(cxl_cor_error_detected, "CXL"); >>> >>> -pci_ers_result_t cxl_error_detected(struct pci_dev *pdev, >>> - pci_channel_state_t state) >>> +static bool cxl_handle_aer(struct pci_dev *pdev) >> >> For a function that returns a bool, the function name doesn't sound quite right. Maybe cxl_uncor_aer_present()? >> >> DJ >> > > I was trying to follow the pattern of detected() function calls the > handle() function as done for cxl_handle_ras() and cxl_handle_cor_ras(). > > I will change to cxl_uncor_aer_present(). > > -Terry > >>> { >>> - struct cxl_dev_state *cxlds = pci_get_drvdata(pdev); >>> - struct cxl_memdev *cxlmd = cxlds->cxlmd; >>> - struct device *dev = &cxlmd->dev; >>> - bool ue; >>> - >>> - scoped_guard(device, dev) { >>> - if (!dev->driver) { >>> - dev_warn(&pdev->dev, >>> - "%s: memdev disabled, abort error handling\n", >>> - dev_name(dev)); >>> - return PCI_ERS_RESULT_DISCONNECT; >>> - } >>> + struct aer_capability_regs aer; >>> + u32 aer_cap = pdev->aer_cap; >>> >>> - if (cxlds->rcd) >>> - cxl_handle_rdport_errors(cxlds); >>> - /* >>> - * A frozen channel indicates an impending reset which is fatal to >>> - * CXL.mem operation, and will likely crash the system. On the off >>> - * chance the situation is recoverable dump the status of the RAS >>> - * capability registers and bounce the active state of the memdev. >>> - */ >>> - ue = cxl_handle_ras(&cxlds->cxlmd->dev, cxlds->serial, >>> - cxlmd->endpoint->regs.ras); >>> + if (!aer_cap) { >>> + pr_warn_ratelimited("%s: AER capability isn't present\n", >>> + pci_name(pdev)); >>> + return false; >>> } >>> >>> - switch (state) { >>> - case pci_channel_io_normal: >>> - if (ue) { >>> - device_release_driver(dev); >>> - return PCI_ERS_RESULT_NEED_RESET; >>> - } >>> - return PCI_ERS_RESULT_CAN_RECOVER; >>> - case pci_channel_io_frozen: >>> - dev_warn(&pdev->dev, >>> - "%s: frozen state error detected, disable CXL.mem\n", >>> - dev_name(dev)); >>> - device_release_driver(dev); >>> - return PCI_ERS_RESULT_NEED_RESET; >>> - case pci_channel_io_perm_failure: >>> - dev_warn(&pdev->dev, >>> - "failure state error detected, request disconnect\n"); >>> - return PCI_ERS_RESULT_DISCONNECT; >>> - } >>> - return PCI_ERS_RESULT_NEED_RESET; >>> + pci_read_config_dword(pdev, aer_cap + PCI_ERR_UNCOR_STATUS, &aer.uncor_status); >>> + pci_read_config_dword(pdev, aer_cap + PCI_ERR_UNCOR_MASK, &aer.uncor_mask); >>> + >>> + /* The AER driver logged the error */ >>> + pci_aer_clear_nonfatal_status(pdev); >>> + pci_aer_clear_fatal_status(pdev); >>> + >>> + return (aer.uncor_status & aer.uncor_mask); >>> +} >>> + >>> +pci_ers_result_t cxl_pci_error_detected(struct pci_dev *pdev, >>> + pci_channel_state_t error) >>> +{ >>> + u32 rc = cxl_handle_aer(pdev); >>> + >>> + return rc ? PCI_ERS_RESULT_DISCONNECT : PCI_ERS_RESULT_CAN_RECOVER; >>> } >>> -EXPORT_SYMBOL_NS_GPL(cxl_error_detected, "CXL"); >>> +EXPORT_SYMBOL_NS_GPL(cxl_pci_error_detected, "CXL"); >>> >>> static void cxl_handle_proto_error(struct cxl_proto_err_work_data *err_info) >>> { >>> diff --git a/drivers/cxl/cxlpci.h b/drivers/cxl/cxlpci.h >>> index 970add0256e9..5534422b496c 100644 >>> --- a/drivers/cxl/cxlpci.h >>> +++ b/drivers/cxl/cxlpci.h >>> @@ -79,15 +79,14 @@ void read_cdat_data(struct cxl_port *port); >>> >>> #ifdef CONFIG_CXL_RAS >>> void cxl_cor_error_detected(struct pci_dev *pdev); >>> -pci_ers_result_t cxl_error_detected(struct pci_dev *pdev, >>> - pci_channel_state_t state); >>> void devm_cxl_dport_rch_ras_setup(struct cxl_dport *dport); >>> +pci_ers_result_t cxl_pci_error_detected(struct pci_dev *pdev, >>> + pci_channel_state_t error); >>> void devm_cxl_port_ras_setup(struct cxl_port *port); >>> #else >>> static inline void cxl_cor_error_detected(struct pci_dev *pdev) { } >>> - >>> -static inline pci_ers_result_t cxl_error_detected(struct pci_dev *pdev, >>> - pci_channel_state_t state) >>> +static inline pci_ers_result_t cxl_pci_error_detected(struct pci_dev *pdev, >>> + pci_channel_state_t state) >>> { >>> return PCI_ERS_RESULT_NONE; >>> } >>> diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c >>> index acb0eb2a13c3..ff741adc7c7f 100644 >>> --- a/drivers/cxl/pci.c >>> +++ b/drivers/cxl/pci.c >>> @@ -1051,8 +1051,8 @@ static void cxl_reset_done(struct pci_dev *pdev) >>> } >>> } >>> >>> -static const struct pci_error_handlers cxl_error_handlers = { >>> - .error_detected = cxl_error_detected, >>> +static const struct pci_error_handlers pci_error_handlers = { >>> + .error_detected = cxl_pci_error_detected, >>> .slot_reset = cxl_slot_reset, >>> .resume = cxl_error_resume, >>> .cor_error_detected = cxl_cor_error_detected, >>> @@ -1063,7 +1063,7 @@ static struct pci_driver cxl_pci_driver = { >>> .name = KBUILD_MODNAME, >>> .id_table = cxl_mem_pci_tbl, >>> .probe = cxl_pci_probe, >>> - .err_handler = &cxl_error_handlers, >>> + .err_handler = &pci_error_handlers, >>> .dev_groups = cxl_rcd_groups, >>> .driver = { >>> .probe_type = PROBE_PREFER_ASYNCHRONOUS, >> > >