public inbox for linux-pci@vger.kernel.org
 help / color / mirror / Atom feed
From: Jonathan Cameron <jonathan.cameron@huawei.com>
To: <dan.j.williams@intel.com>
Cc: Vikram Sethi <vsethi@nvidia.com>,
	Alex Williamson <alex@shazbot.org>,
	Srirangan Madhavan <smadhavan@nvidia.com>,
	"dave@stgolabs.net" <dave@stgolabs.net>,
	"dave.jiang@intel.com" <dave.jiang@intel.com>,
	"alison.schofield@intel.com" <alison.schofield@intel.com>,
	"vishal.l.verma@intel.com" <vishal.l.verma@intel.com>,
	"ira.weiny@intel.com" <ira.weiny@intel.com>,
	"bhelgaas@google.com" <bhelgaas@google.com>,
	"ming.li@zohomail.com" <ming.li@zohomail.com>,
	"rrichter@amd.com" <rrichter@amd.com>,
	"Smita.KoralahalliChannabasappa@amd.com"
	<Smita.KoralahalliChannabasappa@amd.com>,
	"huaisheng.ye@intel.com" <huaisheng.ye@intel.com>,
	"linux-cxl@vger.kernel.org" <linux-cxl@vger.kernel.org>,
	"linux-pci@vger.kernel.org" <linux-pci@vger.kernel.org>,
	Vishal Aslot <vaslot@nvidia.com>,
	"Shanker Donthineni" <sdonthineni@nvidia.com>,
	Vidya Sagar <vidyas@nvidia.com>,
	"Jason Gunthorpe" <jgg@nvidia.com>, Matt Ochs <mochs@nvidia.com>,
	Jason Sequeira <jsequeira@nvidia.com>
Subject: Re: [PATCH v4 0/10] CXL Reset support for Type 2 devices
Date: Wed, 28 Jan 2026 12:36:47 +0000	[thread overview]
Message-ID: <20260128123647.0000602f@huawei.com> (raw)
In-Reply-To: <697985c2c622_1d33100a2@dwillia2-mobl4.notmuch>

On Tue, 27 Jan 2026 19:42:58 -0800
dan.j.williams@intel.com wrote:

> Vikram Sethi wrote:
> > Hi Dan,
> >   
> > >Once the protocol error handling series has landed there is a potential
> > >to extend that with CXL Reset recovery. However, that needs a clear
> > >error model defined as to which resets have a chance of recovering
> > >*system* operation when CXL.cache/mem fails. In Terry's series, panic /
> > >reboot is the recovery, not reset.  
> > 
> > It's not just about CXL protocol error handling though. Type2 Device
> > passthrough will be a common usecase for CXL reset (entire device is
> > assigned to VM) across different VM assignment.  
> 
> I understand. The point about CXL Protocol Error handling series is that
> it at least enlightens the PCIe core about the presence of active CXL
> links.
> 
> It is also the case that it much closer to being upstream than this set
> which still has fundamental questions.
> 
> > The cache and mem of the device must be cleared via CXL reset for full
> > device passthrough in addition to the PCIe/CXL.IO reset. Another
> > usecase is when the device is reconfigured either in baremetal or in
> > VM for passthrough usecase. Such usecases often require a reset of the
> > device, and it's cache, mem and not a narrow "memregion" reset, so I'm
> > not in favor of exposing it via CXL.mem sysfs entries. IMO, device
> > sysfs attribute reset method is appropriate for type2 devices.  
> 
> That case is already handled today with secondary bus reset to
> completely reset an entire device. That path is problematic for CXL
> because PCI reset has no idea about how to manage caches or handle
> memory unplug.
> 
> Administrator is responsible for making sure that event is not a
> surprise memory removal or cache protocol corruption event. That is a
> level of explosiveness that PCI reset has never needed to consider.
> 
> CXL Reset wants to be more surgical than secondary bus reset. The only
> way to be more surgical is to cooperate with the CXL core that knows
> whether the explosives have been rendered inert.
> 
> > Finally, there is the error usecase, which in the common case is as
> > simple as an uncorrected ECC error in the HDM. While not strictly
> > necessary, it is common practice to reset the device in such cases, to
> > recover the bad page/row via PPR on device reset. You want reset of
> > the memory controller as part of the CXL reset here, FLR is not
> > enough.  
> 
> ??
> 
> CXL memory repair is already upstream without CXL Reset, see
> CONFIG_CXL_EDAC_MEM_REPAIR.

To actually do it we rely on tear down of all access to the device
(if it it's disruptive).
There is new spec stuff covering asking a device to just do it next reset
that would fit more closely with this flow. No support yet I think.

Mind you we are talking type 2, so who knows what people will implement.
Nice if they used that part of the spec but I doubt we can rely on it!

Jonathan


> 
> > Regarding scope, and "recoverability", we have significant experience
> > with recovering "pre-CXL" coherent GB200 devices for device errors:
> > memory ECC or other, by killing the application using device memory,
> > unloading driver, resetting the device, and reloading driver. The CXL
> > protocol error series can use CXL reset series, but I don't see that
> > this series needs to wait for protocol error handling to be merged.    
> 
> I do want to get to the point where device memory error recovery is a
> first class citizen. CXL Protocol Error handling just happens to be at
> the top of the review queue and safe to assume it can be a foundation
> for further RAS features.
> 
> 


      reply	other threads:[~2026-01-28 12:36 UTC|newest]

Thread overview: 48+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-01-20 22:26 [PATCH v4 0/10] CXL Reset support for Type 2 devices smadhavan
2026-01-20 22:26 ` [PATCH v4 01/10] cxl: move DVSEC defines to cxl pci header smadhavan
2026-01-21 10:31   ` Jonathan Cameron
2026-01-20 22:26 ` [PATCH v4 02/10] PCI: switch CXL port DVSEC defines smadhavan
2026-01-21 10:34   ` Jonathan Cameron
2026-01-20 22:26 ` [PATCH v4 03/10] cxl: add type 2 helper and reset DVSEC bits smadhavan
2026-01-20 23:27   ` Dave Jiang
2026-01-21 10:45     ` Jonathan Cameron
2026-01-20 22:26 ` [PATCH v4 04/10] PCI: add CXL reset method smadhavan
2026-01-21  0:08   ` Dave Jiang
2026-01-21 10:57   ` Jonathan Cameron
2026-01-23 13:54   ` kernel test robot
2026-01-20 22:26 ` [PATCH v4 05/10] cxl: add reset prepare and region teardown smadhavan
2026-01-21 11:09   ` Jonathan Cameron
2026-01-21 21:25   ` Dave Jiang
2026-01-20 22:26 ` [PATCH v4 06/10] PCI: wire CXL reset prepare/cleanup smadhavan
2026-01-21 22:13   ` Dave Jiang
2026-01-22  2:17     ` Srirangan Madhavan
2026-01-22 15:11       ` Dave Jiang
2026-01-24  7:54   ` kernel test robot
2026-01-20 22:26 ` [PATCH v4 07/10] cxl: add host cache flush and multi-function reset smadhavan
2026-01-21 11:20   ` Jonathan Cameron
2026-01-21 20:27     ` Davidlohr Bueso
2026-01-22  9:53       ` Jonathan Cameron
2026-01-21 22:19     ` Vikram Sethi
2026-01-22  9:40       ` Souvik Chakravarty
     [not found]     ` <PH7PR12MB9175CDFC163843BB497073CEBD96A@PH7PR12MB9175.namprd12.prod.outlook.com>
2026-01-22 10:31       ` Jonathan Cameron
2026-01-22 19:24         ` Vikram Sethi
2026-01-23 13:13           ` Jonathan Cameron
2026-01-21 23:59   ` Dave Jiang
2026-01-20 22:26 ` [PATCH v4 08/10] cxl: add DVSEC config save/restore smadhavan
2026-01-21 11:31   ` Jonathan Cameron
2026-01-20 22:26 ` [PATCH v4 09/10] PCI: save/restore CXL config around reset smadhavan
2026-01-21 22:32   ` Dave Jiang
2026-01-22 10:01   ` Lukas Wunner
2026-01-22 10:47     ` Jonathan Cameron
2026-01-26 22:34       ` Alex Williamson
2026-03-12 18:24         ` Jonathan Cameron
2026-01-20 22:26 ` [PATCH v4 10/10] cxl: add HDM decoder and IDE save/restore smadhavan
2026-01-21 11:42   ` Jonathan Cameron
2026-01-22 15:09   ` Dave Jiang
2026-01-21  1:19 ` [PATCH v4 0/10] CXL Reset support for Type 2 devices Alison Schofield
2026-01-22  0:00 ` Bjorn Helgaas
2026-01-27 16:33 ` Alex Williamson
2026-01-27 17:02   ` dan.j.williams
2026-01-27 18:07     ` Vikram Sethi
2026-01-28  3:42       ` dan.j.williams
2026-01-28 12:36         ` Jonathan Cameron [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260128123647.0000602f@huawei.com \
    --to=jonathan.cameron@huawei.com \
    --cc=Smita.KoralahalliChannabasappa@amd.com \
    --cc=alex@shazbot.org \
    --cc=alison.schofield@intel.com \
    --cc=bhelgaas@google.com \
    --cc=dan.j.williams@intel.com \
    --cc=dave.jiang@intel.com \
    --cc=dave@stgolabs.net \
    --cc=huaisheng.ye@intel.com \
    --cc=ira.weiny@intel.com \
    --cc=jgg@nvidia.com \
    --cc=jsequeira@nvidia.com \
    --cc=linux-cxl@vger.kernel.org \
    --cc=linux-pci@vger.kernel.org \
    --cc=ming.li@zohomail.com \
    --cc=mochs@nvidia.com \
    --cc=rrichter@amd.com \
    --cc=sdonthineni@nvidia.com \
    --cc=smadhavan@nvidia.com \
    --cc=vaslot@nvidia.com \
    --cc=vidyas@nvidia.com \
    --cc=vishal.l.verma@intel.com \
    --cc=vsethi@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox