From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mga18.intel.com ([134.134.136.126]:47546 "EHLO mga18.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1728021AbeIJWQb (ORCPT ); Mon, 10 Sep 2018 18:16:31 -0400 Date: Mon, 10 Sep 2018 11:22:46 -0600 From: Keith Busch To: Lukas Wunner Cc: Linux PCI , Bjorn Helgaas , Benjamin Herrenschmidt , Sinan Kaya , Thomas Tai , "poza@codeaurora.org" , Christoph Hellwig Subject: Re: [PATCHv2 16/20] PCI/pciehp: Implement error handling callbacks Message-ID: <20180910172246.GA7753@localhost.localdomain> References: <20180905203546.21921-1-keith.busch@intel.com> <20180905203546.21921-17-keith.busch@intel.com> <20180910132033.ei5nk4iibt7pesd5@wunner.de> <20180910145641.GA7466@localhost.localdomain> <20180910160926.7grtfdyw3iv2xg4x@wunner.de> <20180910164528.GC7466@localhost.localdomain> <20180910170848.7q2qii2mm655eghw@wunner.de> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii In-Reply-To: <20180910170848.7q2qii2mm655eghw@wunner.de> Sender: linux-pci-owner@vger.kernel.org List-ID: On Mon, Sep 10, 2018 at 07:08:48PM +0200, Lukas Wunner wrote: > On Mon, Sep 10, 2018 at 10:45:28AM -0600, Keith Busch wrote: > > On Mon, Sep 10, 2018 at 06:09:26PM +0200, Lukas Wunner wrote: > > > On Mon, Sep 10, 2018 at 08:56:42AM -0600, Keith Busch wrote: > > > > The sysfs entries still function. Their actions are only temporarily > > > > stalled during error handling. Once the slot reset is called, the > > > > ctrl->pending_events is queried to take requested actions. > > > > > > Okay I see. Still, releasing the IRQ and requesting it again seems fairly > > > heavy-wheight. Why not just acquire ctrl->reset_lock? > > > > That was looking like a nice way to handle it, but it introduces > > circular locking between ctrl->reset_lock and pci_bus_sem: > > > > CPU A CPU B > > --------------------------------- ------------------------ > > pci_walk_bus pciehp_ist > > down_read(&pci_bus_sem) down_read(&ctrl->reset_lock); > > pcie_portdrv_error_detected pciehp_handle_presence_or_link_change > > pciehp_error_detected pciehp_unconfigure_device > > down_write(&ctrl->reset_lock) pci_stop_and_remove_bus_devicea > > down_write(&pci_bus_sem); > > Why is pciehp bringing down the slot? Is that in reaction to a > Link Down caused by the error? This could just be something that happens if a hotplug and error event occur about the same time. > Can this be solved with Sinan's > approach to check in pciehp whether PCI_EXP_DEVSTA_FED is set > and if so, waiting for it to be handled? That only helps if the downstream port detected a fatal error, but error handling happens for any device reported error. > FWIW you can use synchronize_irq() in pciehp_error_detected() > if you need to wait for the IRQ thread to stop before taking > the reset_lock. That would just be a different dead lock: pciehp_error_detected is holding a read lock on pci_bus_sem, and irq thread may request a write lock, so both threads are dead locked on each other.