From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-1.0 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_PASS,URIBL_BLOCKED autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id B6B62C43441 for ; Tue, 13 Nov 2018 14:40:42 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 74FB4223AE for ; Tue, 13 Nov 2018 14:40:42 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 74FB4223AE Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=arm.com Authentication-Results: mail.kernel.org; spf=none smtp.mailfrom=linux-pci-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1731338AbeKNAjH (ORCPT ); Tue, 13 Nov 2018 19:39:07 -0500 Received: from usa-sjc-mx-foss1.foss.arm.com ([217.140.101.70]:57274 "EHLO foss.arm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726846AbeKNAjH (ORCPT ); Tue, 13 Nov 2018 19:39:07 -0500 Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.72.51.249]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id E858CA78; Tue, 13 Nov 2018 06:40:40 -0800 (PST) Received: from big-swifty.misterjones.org (usa-sjc-mx-foss1.foss.arm.com [217.140.101.70]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 8929D3F5BD; Tue, 13 Nov 2018 06:40:40 -0800 (PST) Date: Tue, 13 Nov 2018 14:40:39 +0000 Message-ID: <86h8gldorc.wl-marc.zyngier@arm.com> From: Marc Zyngier To: Gustavo Pimentel Cc: Trent Piepho , "lorenzo.pieralisi@arm.com" , "jpinto@synopsys.com" , "jingoohan1@gmail.com" , "faiz_abbas@ti.com" , "stable@vger.kernel.org" , "linux-pci@vger.kernel.org" , "bhelgaas@google.com" , "vigneshr@ti.com" Subject: Re: [PATCH] PCI: dwc: Fix interrupt race in when handling MSI In-Reply-To: References: <20181027000028.21343-1-tpiepho@impinj.com> <20181106145347.GB19060@e107981-ln.cambridge.arm.com> <1541533217.30311.263.camel@impinj.com> <597d9ebd-f95f-0a4d-e1a3-fe79d4333879@arm.com> <1541621853.30311.294.camel@impinj.com> <268eae88-274e-edfe-5668-5759efae62e6@arm.com> <1541706591.30311.308.camel@impinj.com> User-Agent: Wanderlust/2.15.9 (Almost Unreal) SEMI-EPG/1.14.7 (Harue) FLIM/1.14.9 (=?UTF-8?B?R29qxY0=?=) APEL/10.8 EasyPG/1.0.0 Emacs/25.1 (aarch64-unknown-linux-gnu) MULE/6.0 (HANACHIRUSATO) Organization: ARM Ltd MIME-Version: 1.0 (generated by SEMI-EPG 1.14.7 - "Harue") Content-Type: text/plain; charset=US-ASCII Sender: linux-pci-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-pci@vger.kernel.org On Tue, 13 Nov 2018 00:41:24 +0000, Gustavo Pimentel wrote: > > On 09/11/2018 11:34, Marc Zyngier wrote: > > On 08/11/18 19:49, Trent Piepho wrote: > >> On Thu, 2018-11-08 at 09:49 +0000, Marc Zyngier wrote: > >>> On 07/11/18 20:17, Trent Piepho wrote: > >>>> On Wed, 2018-11-07 at 18:41 +0000, Marc Zyngier wrote: > >>>>> On 06/11/18 19:40, Trent Piepho wrote: > >>>>>> > >>>>>> What about stable kernels that don't have the hierarchical API? > >>>>> > >>>>> My goal is to fix mainline first. Once we have something that works on > >>>>> mainline, we can look at propagating the fix to other versions. But > >>>>> mainline always comes first. > >>>> > >>>> This is a regression that went into 4.14. Wouldn't the appropriate > >>>> action for the stable series be to undo the regression? > >>> > >>> This is not how stable works. Stable kernels *only* contain patches that > >>> are backported from mainline, and do not take standalone patch. > >>> > >>> Furthermore, your fix is to actually undo someone else's fix. Who is > >>> right? In the absence of any documentation, the answer is "nobody". > >> > >> Little more history to this bug. The code was originally the way it is > >> now, but this same bug was fixed in 2013 in https://urldefense.proofpoint.com/v2/url?u=https-3A__patchwork.kernel.or&d=DwICaQ&c=DPL6_X_6JkXFx7AXWqB0tg&r=bkWxpLoW-f-E3EdiDCCa0_h0PicsViasSlvIpzZvPxs&m=Tk2dtPxJ5WvHjQbWWx35od-JZmur8mRuDrSSVhf8nYs&s=3nFZMUc2qvOXJht1WuNCDorudHeFiWyeDaO1UbhTXmw&e= > >> g/patch/3333681/ > >> > >> Then that lasted four years until it was changed Aug 2017 in https://urldefense.proofpoint.com/v2/url?u=https-3A__pa&d=DwICaQ&c=DPL6_X_6JkXFx7AXWqB0tg&r=bkWxpLoW-f-E3EdiDCCa0_h0PicsViasSlvIpzZvPxs&m=Tk2dtPxJ5WvHjQbWWx35od-JZmur8mRuDrSSVhf8nYs&s=yfWJAoFIGjLbTnuE_KLzhW7J_pyMD3X1dXm4eHoy7_o&e= > >> tchwork.kernel.org/patch/9893303/ > >> > >> That lasted just six months until someone tried to revert it, https://urldefense.proofpoint.com/v2/url?u=https-3A__p&d=DwICaQ&c=DPL6_X_6JkXFx7AXWqB0tg&r=bkWxpLoW-f-E3EdiDCCa0_h0PicsViasSlvIpzZvPxs&m=Tk2dtPxJ5WvHjQbWWx35od-JZmur8mRuDrSSVhf8nYs&s=q1LGEb9w_CjcwmWVFOM5VWJ17oS9XHd8SsWTsBF8zfU&e= > >> atchwork.kernel.org/patch/9893303/ > >> > >> Seems pretty clear the way it is now is much worse than the way it was > >> before, even if the previous design may have had another flaw. Though > >> I've yet to see anyone point out something makes the previous design > >> broken. Sub-optimal yes, but not broken. > > > > This is not what was reported by the previous submitter. I guess they > > changed this for a reason, no? I'm prepared to admit this is a end-point > > driver bug, but we need to understand what is happening and stop > > changing this driver randomly. > > > >>> Anything can be backported to stable once we understand the issue. At > >>> the moment, we're just playing games moving stuff around and hope > >>> nothing else will break. That's not a sustainable way of maintaining > >>> this driver. At the moment, the only patch I'm inclined to propose until > >>> we get an actual interrupt handling flow from Synopsys is to mark this > >>> driver as "BROKEN". > >> > >> It feels like you're using this bug to hold designware hostage in a > >> broken kernel, and me along with them. I don't have the documentation, > >> no one does, there's no way for me to give you want you want. But I've > >> got hardware that doesn't work in the mainline kernel. > > > > I take it as a personal offence that I'd be holding anything or anyone > > hostage. I think I have a long enough track record working with the > > Linux kernel not to take any of this nonsense. What's my interest in > > keeping anything in this sorry state? Think about it for a minute. > > > > When I'm asking for documentation, I'm not asking you. I perfectly > > understood that you don't have any. We need Synopsys to step up and give > > us a simple description of what the MSI workflow is. Because no matter > > how you randomly mutate this code, it will still be broken until we get > > a clear answer from *Synopsys*. > > > > Gustavo, here's one simple ask. Please reply to this email with a step > > by step, register by register description of how an MSI must be handled > > on this HW. We do need it, and we need it now. > > Hi Marc, I thought that I did respond to your question on Nov 8. However I will > repeat it and add some extra information to it now. > > About the registers: > > PCIE_MSI_INTR0_MASK > When an MSI is received for a masked interrupt, the corresponding status bit > gets set in the interrupt status register but the controller will not signal it. > As soon as the masked interrupt is unmasked and assuming the status bit is still > set the controller will signal it. So this is *really* a mask. > PCIE_MSI_INTR0_ENABLE > Enables/disables every interrupt. When an MSI is received from a disabled > interrupt, no status bit gets set in MSI controller interrupt status register. And that's not a mask at all. This allows the interrupt to be simply lost. It is just a bit worrying that this is exactly what the DWC driver is using, and faces all kind of interrupt loss when enabling/disabling interrupts (which do have a mask/unmask semantic -- yes, the language is confusing). > > About this new callbacks: > > The dw_pci_bottom_mask() and dw_pci_bottom_unmask() callbacks were added on > Linux kernel v4.17 on PCI: dwc: Move MSI IRQs allocation to IRQ domains > hierarchical API patch [1] and based on what I've seen so far, before this patch > there were no interrupt masking been performed. > > Based on the information provided, its very likely that I should use the > PCIE_MSI_INTR0_MASK register instead of PCIE_MSI_INTR0_ENABLE on the > dw_pci_bottom_mask() and dw_pci_bottom_unmask() callbacks. As soon as I return > from Linux plumbers conference, I will test the system using the > PCIE_MSI_INTR0_MASK register. I'm at Plumbers as well, so feel free to grab me and we'll sort this out. We've wasted enough time on this broken driver. Thanks, M. -- Jazz is not dead, it just smell funny.