From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.3 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_PASS,USER_AGENT_MUTT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 98331C43143 for ; Mon, 1 Oct 2018 15:12:36 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 69310208D9 for ; Mon, 1 Oct 2018 15:12:36 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 69310208D9 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=intel.com Authentication-Results: mail.kernel.org; spf=none smtp.mailfrom=linux-pci-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729469AbeJAVuv (ORCPT ); Mon, 1 Oct 2018 17:50:51 -0400 Received: from mga09.intel.com ([134.134.136.24]:58392 "EHLO mga09.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1728979AbeJAVuu (ORCPT ); Mon, 1 Oct 2018 17:50:50 -0400 X-Amp-Result: UNSCANNABLE X-Amp-File-Uploaded: False Received: from fmsmga004.fm.intel.com ([10.253.24.48]) by orsmga102.jf.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 01 Oct 2018 08:12:34 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.54,328,1534834800"; d="scan'208";a="93611239" Received: from unknown (HELO localhost.localdomain) ([10.232.112.44]) by fmsmga004.fm.intel.com with ESMTP; 01 Oct 2018 08:12:33 -0700 Date: Mon, 1 Oct 2018 09:14:51 -0600 From: Keith Busch To: Bjorn Helgaas Cc: Linux PCI , Bjorn Helgaas , Benjamin Herrenschmidt , Sinan Kaya , Thomas Tai , poza@codeaurora.org, Lukas Wunner , Christoph Hellwig , Mika Westerberg Subject: Re: [PATCHv4 08/12] PCI: ERR: Always use the first downstream port Message-ID: <20181001151450.GB22508@localhost.localdomain> References: <20180920162717.31066-1-keith.busch@intel.com> <20180920162717.31066-9-keith.busch@intel.com> <20180926220116.GJ28024@bhelgaas-glaptop.roam.corp.google.com> <20180926221924.GA17934@localhost.localdomain> <20180927225625.GB18434@bhelgaas-glaptop.roam.corp.google.com> <20180928154220.GA21996@localhost.localdomain> <20180928205034.GA119911@bhelgaas-glaptop.roam.corp.google.com> <20180928213523.GA22508@localhost.localdomain> <20180928232801.GB119911@bhelgaas-glaptop.roam.corp.google.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20180928232801.GB119911@bhelgaas-glaptop.roam.corp.google.com> User-Agent: Mutt/1.9.1 (2017-09-22) Sender: linux-pci-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-pci@vger.kernel.org On Fri, Sep 28, 2018 at 06:28:02PM -0500, Bjorn Helgaas wrote: > On Fri, Sep 28, 2018 at 03:35:23PM -0600, Keith Busch wrote: > > The assumption I'm making (which I think is a safe assumption with > > general consensus) is that errors detected on an end point or an upstream > > port happened because of something wrong with the link going upstream: > > end devices have no other option, > > Is this really true? It looks like "Internal Errors" (sec 6.2.9) may > be unrelated to a packet or event (though they are supposed to be > associated with a PCIe interface). > > It says the only method of recovering is reset or hardware > replacement. It doesn't specify, but it seems like a FLR or similar > reset might be appropriate and we may not have to reset the link. That is an interesting case we might want to handle better. I've a couple concerns to consider for implementing: We don't know an ERR_FATAL occured for an internal reason until we read the config register across the link, and the AER driver historically avoided accessing potentially unhealthy links. I don't *think* it's harmful to attempt reading the register, but we'd just need to check for an "all 1's" completion before trusting the result. The other issue with trying to use FLR is a device may not implement it, so pci reset has fallback methods depending on the device's capabilities. We can end up calling pci_parent_bus_reset(), which does the same secondary bus reset that already happens as part of error recovery. We'd just need to make sure affected devices and drivers have a chance to be notified (which is the this patch's intention). > Getting back to the changelog, "error handling can only run on > bridges" clearly doesn't refer to the driver callbacks (since those > only apply to endpoints). Maybe "error handling" refers to the > reset_link(), which can only be done on a bridge? Yep, referring to how link reset_link is only sent from bridges. > That would make sense to me, although the current code may be > resetting more devices than necessary if Internal Errors can be > handled without a link reset. That sounds good, I'll test some scenarios out here.