From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-1.0 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_PASS autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id B0B90C43387 for ; Wed, 9 Jan 2019 05:11:02 +0000 (UTC) Received: from lists.ozlabs.org (lists.ozlabs.org [203.11.71.2]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPS id E6ECF206BA for ; Wed, 9 Jan 2019 05:11:01 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org E6ECF206BA Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=kernel.crashing.org Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=linuxppc-dev-bounces+linuxppc-dev=archiver.kernel.org@lists.ozlabs.org Received: from lists.ozlabs.org (lists.ozlabs.org [IPv6:2401:3900:2:1::3]) by lists.ozlabs.org (Postfix) with ESMTP id 43ZHJz6L1nzDqdw for ; Wed, 9 Jan 2019 16:10:59 +1100 (AEDT) Authentication-Results: lists.ozlabs.org; spf=permerror (mailfrom) smtp.mailfrom=kernel.crashing.org (client-ip=63.228.1.57; helo=gate.crashing.org; envelope-from=benh@kernel.crashing.org; receiver=) Authentication-Results: lists.ozlabs.org; dmarc=none (p=none dis=none) header.from=kernel.crashing.org Received: from gate.crashing.org (gate.crashing.org [63.228.1.57]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by lists.ozlabs.org (Postfix) with ESMTPS id 43ZHHH4RXfzDq5Y for ; Wed, 9 Jan 2019 16:09:31 +1100 (AEDT) Received: from localhost (localhost.localdomain [127.0.0.1]) by gate.crashing.org (8.14.1/8.14.1) with ESMTP id x09592Cr029143; Tue, 8 Jan 2019 23:09:04 -0600 Message-ID: <012d24d58a542ed44c8af9f517f1bd61ab912037.camel@kernel.crashing.org> Subject: Re: [PATCH] PCI: Add no-D3 quirk for Mellanox ConnectX-[45] From: Benjamin Herrenschmidt To: Jason Gunthorpe Date: Wed, 09 Jan 2019 16:09:02 +1100 In-Reply-To: <20190108040129.GE5336@ziepe.ca> References: <20181206041951.22413-1-david@gibson.dropbear.id.au> <20181206064509.GM15544@mtr-leonro.mtl.com> <20190104034401.GA2801@umbus.fritz.box> <20190105175116.GB14238@ziepe.ca> <20190108040129.GE5336@ziepe.ca> Content-Type: text/plain; charset="UTF-8" User-Agent: Evolution 3.30.3 (3.30.3-1.fc29) Mime-Version: 1.0 Content-Transfer-Encoding: 7bit X-BeenThere: linuxppc-dev@lists.ozlabs.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Linux on PowerPC Developers Mail List List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: Alexey Kardashevskiy , Leon Romanovsky , linux-rdma@vger.kernel.org, netdev@vger.kernel.org, linux-kernel@vger.kernel.org, sbest@redhat.com, saeedm@mellanox.com, alex.williamson@redhat.com, paulus@samba.org, linux-pci@vger.kernel.org, bhelgaas@google.com, ogerlitz@mellanox.com, David Gibson , linuxppc-dev@lists.ozlabs.org, davem@davemloft.net, tariqt@mellanox.com Errors-To: linuxppc-dev-bounces+linuxppc-dev=archiver.kernel.org@lists.ozlabs.org Sender: "Linuxppc-dev" On Mon, 2019-01-07 at 21:01 -0700, Jason Gunthorpe wrote: > > > In a very cryptic way that requires manual parsing using non-public > > docs sadly but yes. From the look of it, it's a completion timeout. > > > > Looks to me like we don't get a response to a config space access > > during the change of D state. I don't know if it's the write of the D3 > > state itself or the read back though (it's probably detected on the > > read back or a subsequent read, but that doesn't tell me which specific > > one failed). > > If it is just one card doing it (again, check you have latest > firmware) I wonder if it is a sketchy PCI-E electrical link that is > causing a long re-training cycle? Can you tell if the PCI-E link is > permanently gone or does it eventually return? No, it's 100% reproducable on systems with that specific card model, not card instance, and maybe different systems/cards as well, I'll let David & Alexey comment further on that. > Does the card work in Gen 3 when it starts? Is there any indication of > PCI-E link errors? Nope. > Everytime or sometimes? > > POWER 8 firmware is good? If the link does eventually come back, is > the POWER8's D3 resumption timeout long enough? > > If this doesn't lead to an obvious conclusion you'll probably need to > connect to IBM's Mellanox support team to get more information from > the card side. We are IBM :-) So far, it seems to be that the card is doing something not quite right, but we don't know what. We might need to engage Mellanox themselves. Cheers, Ben.