From: <dan.j.williams@intel.com>
To: Vikram Sethi <vsethi@nvidia.com>,
"dan.j.williams@intel.com" <dan.j.williams@intel.com>,
Alex Williamson <alex@shazbot.org>,
"Srirangan Madhavan" <smadhavan@nvidia.com>
Cc: "dave@stgolabs.net" <dave@stgolabs.net>,
"jonathan.cameron@huawei.com" <jonathan.cameron@huawei.com>,
"dave.jiang@intel.com" <dave.jiang@intel.com>,
"alison.schofield@intel.com" <alison.schofield@intel.com>,
"vishal.l.verma@intel.com" <vishal.l.verma@intel.com>,
"ira.weiny@intel.com" <ira.weiny@intel.com>,
"bhelgaas@google.com" <bhelgaas@google.com>,
"ming.li@zohomail.com" <ming.li@zohomail.com>,
"rrichter@amd.com" <rrichter@amd.com>,
"Smita.KoralahalliChannabasappa@amd.com"
<Smita.KoralahalliChannabasappa@amd.com>,
"huaisheng.ye@intel.com" <huaisheng.ye@intel.com>,
"linux-cxl@vger.kernel.org" <linux-cxl@vger.kernel.org>,
"linux-pci@vger.kernel.org" <linux-pci@vger.kernel.org>,
Vishal Aslot <vaslot@nvidia.com>,
"Shanker Donthineni" <sdonthineni@nvidia.com>,
Vidya Sagar <vidyas@nvidia.com>,
"Jason Gunthorpe" <jgg@nvidia.com>, Matt Ochs <mochs@nvidia.com>,
Jason Sequeira <jsequeira@nvidia.com>
Subject: Re: [PATCH v4 0/10] CXL Reset support for Type 2 devices
Date: Tue, 27 Jan 2026 19:42:58 -0800 [thread overview]
Message-ID: <697985c2c622_1d33100a2@dwillia2-mobl4.notmuch> (raw)
In-Reply-To: <LV8PR12MB9182DEE18D9F83194A6E0034BD90A@LV8PR12MB9182.namprd12.prod.outlook.com>
Vikram Sethi wrote:
> Hi Dan,
>
> >Once the protocol error handling series has landed there is a potential
> >to extend that with CXL Reset recovery. However, that needs a clear
> >error model defined as to which resets have a chance of recovering
> >*system* operation when CXL.cache/mem fails. In Terry's series, panic /
> >reboot is the recovery, not reset.
>
> It's not just about CXL protocol error handling though. Type2 Device
> passthrough will be a common usecase for CXL reset (entire device is
> assigned to VM) across different VM assignment.
I understand. The point about CXL Protocol Error handling series is that
it at least enlightens the PCIe core about the presence of active CXL
links.
It is also the case that it much closer to being upstream than this set
which still has fundamental questions.
> The cache and mem of the device must be cleared via CXL reset for full
> device passthrough in addition to the PCIe/CXL.IO reset. Another
> usecase is when the device is reconfigured either in baremetal or in
> VM for passthrough usecase. Such usecases often require a reset of the
> device, and it's cache, mem and not a narrow "memregion" reset, so I'm
> not in favor of exposing it via CXL.mem sysfs entries. IMO, device
> sysfs attribute reset method is appropriate for type2 devices.
That case is already handled today with secondary bus reset to
completely reset an entire device. That path is problematic for CXL
because PCI reset has no idea about how to manage caches or handle
memory unplug.
Administrator is responsible for making sure that event is not a
surprise memory removal or cache protocol corruption event. That is a
level of explosiveness that PCI reset has never needed to consider.
CXL Reset wants to be more surgical than secondary bus reset. The only
way to be more surgical is to cooperate with the CXL core that knows
whether the explosives have been rendered inert.
> Finally, there is the error usecase, which in the common case is as
> simple as an uncorrected ECC error in the HDM. While not strictly
> necessary, it is common practice to reset the device in such cases, to
> recover the bad page/row via PPR on device reset. You want reset of
> the memory controller as part of the CXL reset here, FLR is not
> enough.
??
CXL memory repair is already upstream without CXL Reset, see
CONFIG_CXL_EDAC_MEM_REPAIR.
> Regarding scope, and "recoverability", we have significant experience
> with recovering "pre-CXL" coherent GB200 devices for device errors:
> memory ECC or other, by killing the application using device memory,
> unloading driver, resetting the device, and reloading driver. The CXL
> protocol error series can use CXL reset series, but I don't see that
> this series needs to wait for protocol error handling to be merged.
I do want to get to the point where device memory error recovery is a
first class citizen. CXL Protocol Error handling just happens to be at
the top of the review queue and safe to assume it can be a foundation
for further RAS features.
next prev parent reply other threads:[~2026-01-28 3:43 UTC|newest]
Thread overview: 48+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-01-20 22:26 [PATCH v4 0/10] CXL Reset support for Type 2 devices smadhavan
2026-01-20 22:26 ` [PATCH v4 01/10] cxl: move DVSEC defines to cxl pci header smadhavan
2026-01-21 10:31 ` Jonathan Cameron
2026-01-20 22:26 ` [PATCH v4 02/10] PCI: switch CXL port DVSEC defines smadhavan
2026-01-21 10:34 ` Jonathan Cameron
2026-01-20 22:26 ` [PATCH v4 03/10] cxl: add type 2 helper and reset DVSEC bits smadhavan
2026-01-20 23:27 ` Dave Jiang
2026-01-21 10:45 ` Jonathan Cameron
2026-01-20 22:26 ` [PATCH v4 04/10] PCI: add CXL reset method smadhavan
2026-01-21 0:08 ` Dave Jiang
2026-01-21 10:57 ` Jonathan Cameron
2026-01-23 13:54 ` kernel test robot
2026-01-20 22:26 ` [PATCH v4 05/10] cxl: add reset prepare and region teardown smadhavan
2026-01-21 11:09 ` Jonathan Cameron
2026-01-21 21:25 ` Dave Jiang
2026-01-20 22:26 ` [PATCH v4 06/10] PCI: wire CXL reset prepare/cleanup smadhavan
2026-01-21 22:13 ` Dave Jiang
2026-01-22 2:17 ` Srirangan Madhavan
2026-01-22 15:11 ` Dave Jiang
2026-01-24 7:54 ` kernel test robot
2026-01-20 22:26 ` [PATCH v4 07/10] cxl: add host cache flush and multi-function reset smadhavan
2026-01-21 11:20 ` Jonathan Cameron
2026-01-21 20:27 ` Davidlohr Bueso
2026-01-22 9:53 ` Jonathan Cameron
2026-01-21 22:19 ` Vikram Sethi
2026-01-22 9:40 ` Souvik Chakravarty
[not found] ` <PH7PR12MB9175CDFC163843BB497073CEBD96A@PH7PR12MB9175.namprd12.prod.outlook.com>
2026-01-22 10:31 ` Jonathan Cameron
2026-01-22 19:24 ` Vikram Sethi
2026-01-23 13:13 ` Jonathan Cameron
2026-01-21 23:59 ` Dave Jiang
2026-01-20 22:26 ` [PATCH v4 08/10] cxl: add DVSEC config save/restore smadhavan
2026-01-21 11:31 ` Jonathan Cameron
2026-01-20 22:26 ` [PATCH v4 09/10] PCI: save/restore CXL config around reset smadhavan
2026-01-21 22:32 ` Dave Jiang
2026-01-22 10:01 ` Lukas Wunner
2026-01-22 10:47 ` Jonathan Cameron
2026-01-26 22:34 ` Alex Williamson
2026-03-12 18:24 ` Jonathan Cameron
2026-01-20 22:26 ` [PATCH v4 10/10] cxl: add HDM decoder and IDE save/restore smadhavan
2026-01-21 11:42 ` Jonathan Cameron
2026-01-22 15:09 ` Dave Jiang
2026-01-21 1:19 ` [PATCH v4 0/10] CXL Reset support for Type 2 devices Alison Schofield
2026-01-22 0:00 ` Bjorn Helgaas
2026-01-27 16:33 ` Alex Williamson
2026-01-27 17:02 ` dan.j.williams
2026-01-27 18:07 ` Vikram Sethi
2026-01-28 3:42 ` dan.j.williams [this message]
2026-01-28 12:36 ` Jonathan Cameron
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=697985c2c622_1d33100a2@dwillia2-mobl4.notmuch \
--to=dan.j.williams@intel.com \
--cc=Smita.KoralahalliChannabasappa@amd.com \
--cc=alex@shazbot.org \
--cc=alison.schofield@intel.com \
--cc=bhelgaas@google.com \
--cc=dave.jiang@intel.com \
--cc=dave@stgolabs.net \
--cc=huaisheng.ye@intel.com \
--cc=ira.weiny@intel.com \
--cc=jgg@nvidia.com \
--cc=jonathan.cameron@huawei.com \
--cc=jsequeira@nvidia.com \
--cc=linux-cxl@vger.kernel.org \
--cc=linux-pci@vger.kernel.org \
--cc=ming.li@zohomail.com \
--cc=mochs@nvidia.com \
--cc=rrichter@amd.com \
--cc=sdonthineni@nvidia.com \
--cc=smadhavan@nvidia.com \
--cc=vaslot@nvidia.com \
--cc=vidyas@nvidia.com \
--cc=vishal.l.verma@intel.com \
--cc=vsethi@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox