From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Return-Path: MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII; format=flowed Date: Mon, 16 Apr 2018 11:21:15 +0530 From: poza@codeaurora.org To: Bjorn Helgaas Cc: Sinan Kaya , Keith Busch , Bjorn Helgaas , Philippe Ombredanne , Thomas Gleixner , Greg Kroah-Hartman , Kate Stewart , linux-pci@vger.kernel.org, linux-kernel@vger.kernel.org, Dongdong Liu , Wei Zhang , Timur Tabi , Alex Williamson , linux-pci-owner@vger.kernel.org Subject: Re: [PATCH v13 6/6] PCI/DPC: Do not do recovery for hotplug enabled system In-Reply-To: References: <20180410210349.GG54986@bhelgaas-glaptop.roam.corp.google.com> <13efe2e8-74c8-acb4-ec58-f79b14a1f182@codeaurora.org> <20180412140648.GD145698@bhelgaas-glaptop.roam.corp.google.com> <20180412143954.GB4810@localhost.localdomain> <20180412150231.GD4810@localhost.localdomain> <20180412170911.GA6424@localhost.localdomain> <20180416031726.GB158153@bhelgaas-glaptop.roam.corp.google.com> Message-ID: List-ID: On 2018-04-16 11:03, poza@codeaurora.org wrote: > On 2018-04-16 08:47, Bjorn Helgaas wrote: >> On Sat, Apr 14, 2018 at 11:53:17AM -0400, Sinan Kaya wrote: >> >>> You indicated that you want to unify the AER and DPC behavior. Let's >>> settle on what we want to do one more time. We have been going forth >>> and back on the direction. >> >> My thinking is that as much as possible, similar events should be >> handled similarly, whether the mechanism is AER, DPC, EEH, etc. >> Ideally, drivers shouldn't have to be aware of which mechanism is in >> use. >> >> Error recovery includes conventional PCI as well, but right now I >> think we're only concerned with PCIe. The following error types are >> from PCIe r4.0, sec 6.2.2: >> >> ERR_COR >> Corrected by hardware with no software intervention. Software >> involved for logging only. >> >> Handled by AER via pci_error_handlers; DPC is never involved. >> >> Link is unaffected. >> >> ERR_NONFATAL >> A transaction is unreliable but the link is fully functional. >> >> If DPC is not supported, handled by AER via pci_error_handlers and >> the link is unaffected. >> >> If DPC supported, handled by DPC (because we set >> PCI_EXP_DPC_CTL_EN_NONFATAL) via remove/re-enumerate. >> >> ERR_FATAL >> The link is unreliable. >> >> If DPC is not supported, handled by AER via pci_error_handlers and >> the link is reset. >> >> If DPC supported, handled by DPC via remove/re-enumerate. >> >> It doesn't seem right to me that we handle both ERR_NONFATAL and >> ERR_FATAL events differently if we happen to have DPC support in a >> switch. >> >> Maybe we should consider triggering DPC only on ERR_FATAL? That would >> keep DPC out of the ERR_NONFATAL cases. >> >> For ERR_FATAL, maybe we should bite the bullet and use >> remove/re-enumerate for AER as well as for DPC. That would be painful >> for higher-level software, but if we're willing to accept that pain >> for new systems that support DPC, maybe life would be better overall >> if it worked the same way on systems without DPC? >> >> Bjorn > > This had crossed my mind when I first looked at the code. > DPC is getting triggered for both ERR_NONFATAL and ERR_FATAL case. > I thought the primary purpose of DPC to recover fatal errors, by > triggering HW recovery. > but what if some platform wants to handle both FATAL and NON_FATAL with > DPC ? > > As you said AER FATAL cases and DPC FATAL cases should be handled > similarly. > e.g. remove/re-enumerate the devices. > > while NON_FATAL case; only AER would come into picture. > if some platform would like to handle DPC NON_FATAL then it should > follow AER NON_FATAL path (where it does not do remove/re-enumerate) > > And the case where hotplug is enabled, remove/re-enumerate more sense > in case of ERR_FATAL. > And the case where hotplug is disabled, only re-enumeration is > required. (no need to remove the devices) > but then do we need to handle this case specifically, what is the harm > in removing the devices in all the cases followed by re-enumerate ? To Clarify the last line, what I meant here was, in case of ERR_FATAL we can always remove/re-enumerate the devices irrespective of hotplug is enabled or not. and in case of ERR_NONFATAL, DPC will follow AER path (where it just tries to recover) although I am not very sure that how to handle ERR_NONFATAL case if hotplug is enabled. Because as Keith suggested device might have been changed run-time. > > Regards, > Oza.