From mboxrd@z Thu Jan 1 00:00:00 1970 From: Will Deacon Subject: Re: [PATCH] iommu: arm-smmu: Set SCTLR.HUPCF bit Date: Mon, 26 Nov 2018 19:31:48 +0000 Message-ID: <20181126193147.GB534@arm.com> References: <20180927224609.19515-1-robdclark@gmail.com> <20181029191000.GD16739@arm.com> <20181113063225.GA3109@brain-police> Mime-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: base64 Return-path: Content-Disposition: inline In-Reply-To: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: freedreno-bounces-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org Sender: "Freedreno" To: Rob Clark Cc: Robin Murphy , "list-Y9sIeH5OGRo@public.gmane.org:IOMMU DRIVERS , Joerg Roedel , " , Linux Kernel Mailing List "list-Y9sIeH5OGRo@public.gmane.org:IOMMU DRIVERS , Joerg Roedel , " , linux-arm-msm , freedreno , "moderated list:ARM/FREESCALE IMX / MXC ARM ARCHITECTURE" List-Id: linux-arm-msm@vger.kernel.org SGkgUm9iLAoKT24gVHVlLCBOb3YgMTMsIDIwMTggYXQgMDg6MTI6MzVBTSAtMDUwMCwgUm9iIENs YXJrIHdyb3RlOgo+IE9uIFR1ZSwgTm92IDEzLCAyMDE4IGF0IDE6MzIgQU0gV2lsbCBEZWFjb24g PHdpbGwuZGVhY29uQGFybS5jb20+IHdyb3RlOgo+ID4gT24gRnJpLCBOb3YgMDksIDIwMTggYXQg MDE6MDE6NTVQTSAtMDUwMCwgUm9iIENsYXJrIHdyb3RlOgo+ID4gPiBPbiBNb24sIE9jdCAyOSwg MjAxOCBhdCAzOjA5IFBNIFdpbGwgRGVhY29uIDx3aWxsLmRlYWNvbkBhcm0uY29tPiB3cm90ZToK PiA+ID4gPiBPbiBUaHUsIFNlcCAyNywgMjAxOCBhdCAwNjo0NjowN1BNIC0wNDAwLCBSb2IgQ2xh cmsgd3JvdGU6Cj4gPiA+ID4gPiBXZSBzZWVtIHRvIG5lZWQgdG8gc2V0IGVpdGhlciB0aGlzIG9y IENGQ0ZHIChzdGFsbCksIG90aGVyd2lzZSBncHUKPiA+ID4gPiA+IGZhdWx0cyB0cmlnZ2VyIHBy b2JsZW1zIHdpdGggb3RoZXIgaW4tZmxpZ2h0IHRyYW5zYWN0aW9ucyBmcm9tIHRoZQo+ID4gPiA+ ID4gR1BVIGNhdXNpbmcgQ1AgZXJyb3JzLCBldGMuCj4gPiA+ID4gPgo+ID4gPiA+ID4gSW4gdGhl IEFSTSBTTU1VIHNwZWMsIHRoZSAnSGl0IHVuZGVyIHByZXZpb3VzIGNvbnRleHQgZmF1bHQnIGJp dCBpcwo+ID4gPiA+ID4gZGVzY3JpYmVkIGFzOgo+ID4gPiA+ID4KPiA+ID4gPiA+ICAnMCcgLSBT dGFsbCBvciB0ZXJtaW5hdGUgc3Vic2VxdWVudCB0cmFuc2FjdGlvbnMgaW4gdGhlIHByZXNlbmNl Cj4gPiA+ID4gPiAgICAgICAgb2YgYW4gb3V0c3RhbmRpbmcgY29udGV4dCBmYXVsdAo+ID4gPiA+ ID4gICcxJyAtIFByb2Nlc3MgYWxsIHN1YnNlcXVlbnQgdHJhbnNhY3Rpb25zIGluZGVwZW5kZW50 bHkgb2YgYW55Cj4gPiA+ID4gPiAgICAgICAgb3V0c3RhbmRpbmcgY29udGV4dCBmYXVsdC4KPiA+ ID4gPiA+Cj4gPiA+ID4gPiBTaW5jZSB3ZSBkb24ndCBlbmFibGUgQ0ZDRkcgKHN0YWxsKSB0aGUg YmVoYXZpb3Igb2YgdGVybWluYXRpbmcKPiA+ID4gPiA+IG90aGVyIHRyYW5zYWN0aW9ucyBtYWtl cyBzZW5zZS4gIEFuZCBpcyBwcm9iYWJseSBub3Qgd2hhdCB3ZSB3YW50Cj4gPiA+ID4gPiAoYW5k IGRlZmluYXRlbHkgbm90IHdoYXQgd2Ugd2FudCBmb3IgR1BVKS4KPiA+ID4gPiA+Cj4gPiA+ID4g PiBTaWduZWQtb2ZmLWJ5OiBSb2IgQ2xhcmsgPHJvYmRjbGFya0BnbWFpbC5jb20+Cj4gPiA+ID4g PiAtLS0KPiA+ID4gPiA+IFNvIEkgaGl0IHRoaXMgaXNzdWUgYSBsb25nIHRpbWUgYmFjayBvbiA4 MjAgKG1zbTg5OTYpIGFuZCBhdCB0aGUKPiA+ID4gPiA+IHRpbWUgSSBzb2x2ZWQgaXQgd2l0aCBh IHBhdGNoIHRoYXQgZW5hYmxlZCBDRkNGRy4gIEFuZCBpdCByZXN1cmZhY2VkCj4gPiA+ID4gPiBt b3JlIHJlY2VudGx5IG9uIHNkbTg0NS4gIEJ1dCBhdCB0aGUgdGltZSBDRkNGRyB3YXMgcmVqZWN0 ZWQsIGlpcmMKPiA+ID4gPiA+IGJlY2F1c2Ugb2YgY29uY2VybiB0aGF0IGl0IHdvdWxkIGNhdXNl IHByb2JsZW1zIG9uIG90aGVyIG5vbi1xY29tCj4gPiA+ID4gPiBhcm0gc21tdSBpbXBsZW1lbnRh dGlvbnMuICBBbmQgSSB0aGluayBJIGZvcmdvdCB0byBzZW5kIHRoaXMgdmVyc2lvbgo+ID4gPiA+ ID4gb2YgdGhlIHNvbHV0aW9uLgo+ID4gPiA+ID4KPiA+ID4gPiA+IElmIGVuYWJsaW5nIEhVUENG IGlzIGFudGljaXBhdGVkIHRvIGNhdXNlIHByb2JsZW1zIG9uIG90aGVyIEFSTQo+ID4gPiA+ID4g U01NVSBpbXBsZW1lbnRhdGlvbnMsIEkgdGhpbmsgSSBjYW4gY29tZSB1cCB3aXRoIGEgdmFyaWFu dCBvZiB0aGlzCj4gPiA+ID4gPiBwYXRjaCB3aGljaCBjb25kaXRpb25hbGx5IGVuYWJsZXMgaXQg Zm9yIHNuYXBkcmFnb24uCj4gPiA+ID4gPgo+ID4gPiA+ID4gRWl0aGVyIHdheSwgSSdkIHJlYWxs eSBsaWtlIHRvIGdldCBzb21lIHZhcmlhbnQgb2YgdGhpcyBmaXggbWVyZ2VkCj4gPiA+ID4gPiAo YW5kIHByb2JhYmx5IGl0IHdvdWxkIGJlIGEgZ29vZCBpZGVhIGZvciBzdGFibGUga2VybmVsIGJy YW5jaGVzCj4gPiA+ID4gPiB0b28pLCBzaW5jZSBjdXJyZW50IGJlaGF2aW91ciB3aXRoIHRoZSBH UFUgbWVhbnMgZmF1bHRzIHR1cm4gaW50bwo+ID4gPiA+ID4gYSBmYW50YXN0aWMgY2FzY2FkZSBv ZiBmYWlsLgo+ID4gPiA+Cj4gPiA+ID4gQ2FuIHlvdSBkZXNjcmliZSBob3cgdGhpcyBmYW50YXN0 aWMgY2FzY2FkZSBvZiBmYWlsIGltcHJvdmVzIHdpdGggdGhpcwo+ID4gPiA+IHBhdGNoLCBwbGVh c2U/IElmIHlvdSdyZSBnZXR0aW5nIGNvbnRleHQgZmF1bHRzIHRoZW4gc29tZXRoaW5nIGhhcyBh bHJlYWR5Cj4gPiA+ID4gZ29uZSBob3JyaWJseSB3cm9uZywgc28gSSdtIHRyeWluZyB0byB3b3Jr IG91dCBob3cgdGhpcyBpbXByb3ZlcyB0aGluZ3MuCj4gPiA+ID4KPiA+ID4KPiA+ID4gVGhlcmUg YXJlIHBsZW50eSBvZiBjYXNlcyB3aGVyZSBnZXR0aW5nIGlvbW11IGZhdWx0cyB3aXRoIGEgR1BV IGlzCj4gPiA+ICJub3JtYWwiLCBvciBhdCBsZWFzdCBub3Qgc29tZXRoaW5nIHRoZSBrZXJuZWwg b3IgZXZlbiBHTCBkcml2ZXIgY2FuCj4gPiA+IGNvbnRyb2wuCj4gPgo+ID4gU3VjaCBhcz8gQWxs IHRoZSBtYWlubGluZSBkcml2ZXIgZG9lcyBpcyBwcmludCBhIGRpYWdub3N0aWMgYW5kIGNsZWFy IHRoZQo+ID4gZmF1bHQsIHdoaWNoIGRvZXNuJ3Qgc2VlbSBnZW5lcmFsbHkgdXNlZnVsLgo+IAo+ IGl0IGlzIHVzZWZ1bCB0byBkZWJ1ZyB0aGUgZmF1bHQgOy0pCj4gCj4gQWx0aG91Z2ggZXZlbnR1 YWxseSB3ZSdsbCB3YW50IHRvIGJlIGFibGUgdG8gZG8gbW9yZSB0aGFuIHRoYXQsIGxpa2UKPiBo YXZlIHRoZSBmYXVsdCB0cmlnZ2VyIGJyaW5naW5nIGluIHBhZ2VzIG9mIGEgbW1hcCdkIGZpbGUg YW5kIHRoYXQKPiBzb3J0IG9mIHRoaW5nLgoKUmlnaHQsIGFuZCBmZWVscyB2ZXJ5IHN0cmFuZ2Ug dG8gbWUgaWYgd2UgaGF2ZSB0aGlzIGJpdCBzZXQgYmVjYXVzZSBpdAptZWFucyB0aGF0IHlvdXIg ZmF1bHQgaXMgbm8gbG9uZ2VyIHN5bmNocm9ub3VzIGFuZCB0aGVyZWZvcmUgZGl2ZXJnZXMKZnJv bSB0aGUgZmF1bHQgbW9kZWwgdGhhdCB5b3UgZ2V0IGZyb20gdGhlIENQVSwgd2hlcmUgeW91IGNl cnRhaW5seQp3b3VsZG4ndCBleHBlY3Qgc3RvcmVzIGFwcGVhcmluZyBpbiB0aGUgcHJvZ3JhbSBh ZnRlciBhIGZhdWx0aW5nIGxvYWQgdG8KYmUgdmlzaWJsZSBpbiBtZW1vcnkuIEhvd2V2ZXIsIHRo aW5raW5nIGhhcmRlciBhYm91dCBpdCwgSSBzdXBwb3NlIHdlJ3JlCmFscmVhZHkgaW4gYSBzaXR1 YXRpb24gd2hlcmUgdGhlIHRyYW5zbGF0aW9ucyBhcmUgaGFuZGxlZCBvdXQgb2Ygb3JkZXIKaW4g dGhlIGFic2VuY2Ugb2YgYmFycmllcnMsIHNvIG1heWJlIGl0J3Mgbm90IHRoZSBlbmQgb2YgdGhl IHdvcmxkLgoKQ291bGQgeW91IGR1bXAgdGhlIEZTUiB2YWx1ZSB0aGF0IHlvdSBzZWUgaW4gdGhl IGZhdWx0IGhhbmRsZXIsIHBsZWFzZT8KRnJvbSBteSByZWFkaW5nIG9mIHRoZSBhcmNoaXRlY3R1 cmUgc3BlYywgYXMgbG9uZyBhcyBjbGVhciBhbGwgb2YgdGhlCmZhdWx0IGJpdHMgaW4gdGhlIGly cSBoYW5kbGVyLCB0aGVuIHlvdXIgbWFjaGluZSBzaG91bGRuJ3QgZGllIGxpa2UgaXQKZG9lcyB3 aXRoIEhVUENGRz1DRkNGRz0wLi4KCj4gPiA+IFdpdGggdGhpcyBwYXRjaCwgeW91IHN0aWxsIGdl dCB0aGUgaW9tbXUgZmF1bHQsIGJ1dCBpdCBkb2Vzbid0IGNhdXNlCj4gPiA+IHRoZSBncHUgdG8g Y3Jhc2guICBCdXQgd2l0aG91dCBpdCwgb3RoZXIgbWVtb3J5IGFjY2Vzc2VzIGluIGZsaWdodAo+ ID4gPiB3aGlsZSB0aGUgZmF1bHQgb2NjdXJzLCBsaWtlIHRoZSBHUFUgY29tbWFuZC1wcm9jZXNz b3IgcmVhZGluZyBmdXJ0aGVyCj4gPiA+IGFoZWFkIGluIHRoZSBjbWRzdHJlYW0gdG8gc2V0dXAg bmV4dCBkcmF3LCB3b3VsZCByZXR1cm4gemVybydzLAo+ID4gPiBjYXVzaW5nIHRoZSBHUFUgdG8g Y3Jhc2ggb3IgZ2V0IGludG8gYSBiYWQgc3RhdGUuCj4gPgo+ID4gSSBnZXQgdGhhdCBwYXJ0LCBi dXQgSSBkb24ndCB1bmRlcnN0YW5kIHdoeSB3ZSdyZSBzZWVpbmcgZmF1bHRzIGluIHRoZSBmaXJz dAo+ID4gcGxhY2UgYW5kIEkgd29ycnkgdGhhdCB0aGlzIHBhdGNoIGlzIGp1c3QgdGhlIHRpcCBv ZiB0aGUgaWNlYmVyZy4gSXQncyBhbHNvCj4gPiBub3QgY2xlYXIgdGhhdCBwcm9jZXNzaW5nIHN1 YnNlcXVlbnQgdHJhbnNhY3Rpb25zIGlzIGFsd2F5cyB0aGUgcmlnaHQgdGhpbmcKPiA+IHRvIGRv IGluIGEgd29ybGQgd2hlcmUgd2UgYWN0dWFsbHkgd2FudCB0byByZXBvcnQgKGFuZCBoYW5kbGUp IHN5bmNocm9ub3VzCj4gPiBmYXVsdHMgZnJvbSBkZXZpY2VzLgo+IAo+IFN1cmUsIGl0IGlzIGEg YnVnLi4gYnV0IGl0IGNhbiBiZSBhbiBhcHBsaWNhdGlvbiBidWcgdGhhdCBpcyBub3QKPiBzb21l dGhpbmcgdGhlIHVzZXJzcGFjZSBHTCBkcml2ZXIgb3Iga2VybmVsIGNvdWxkIGRvIGFueXRoaW5n IGFib3V0Lgo+IFdlIHNob3VsZG4ndCBsZXQgdGhpcyBraWxsIHRoZSBHUFUuICBJZiB0aGUgYXBw bGljYXRpb24gZGlkbid0IGhhdmUKPiB0aGlzIG11Y2ggY29udHJvbCwgd2Ugd291bGRuJ3QgbmVl ZCBhbiBJT01NVSBpbiB0aGUgZmlyc3QgcGxhY2VbMV0uCj4gV2l0aCBvcGVuY2wgY29tcHV0ZSwg dGhlIHVzZXJzcGFjZSBjb250cm9sbGVkIHNoYWRlciBoYXMgZnVsbCBibG93bgo+IHBvaW50ZXJz IHRvIEdQVSBtZW1vcnkuCj4gCj4gQW5kIGV2ZW4gaW4gY2FzZXMgd2hlcmUgaXQgaXMgYSB1c2Vy c3BhY2UgR0wgZHJpdmVyIGJ1ZywgaGF2aW5nIHNvbWUKPiByb2J1c3RuZXNzIHRvIG5vdCBjb21w bGV0ZWx5IGtpbGwgdGhlIEdQVSBtYWtlcyBkZWJ1Z2dpbmcgbXVjaCBlYXNpZXIuCj4gU29tZXRo aW5nIEkgZG8gYSBsb3Qgd2hlbiBicmluZ2luZyB1cCBzdXBwb3J0IGZvciBhIG5ldyBnZW5lcmF0 aW9uIG9mCj4gR1BVLgo+IAo+IEknbSBoYXZpbmcgYSBoYXJkIHRpbWUgdW5kZXJzdGFuZGluZyB5 b3VyIG9iamVjdGlvbiB0byB0aGlzLgo+IFJldHVybmluZyB6ZXJvJ3MgZm9yIG5vbi1mYXVsdGlu ZyB0cmFuc2FjdGlvbnMgaXMgYSAqcmVhbGx5IGJhZCBpZGVhKi4KCldhaXQgLS0gd2hvIHNhaWQg YW55dGhpbmcgYWJvdXQgcmV0dXJuaW5nIHplcm9lcz8gV2hlcmUgZG9lcyB0aGF0IGJlaGF2aW91 cgphcHBlYXIgaW4gdGhlIGFyY2hpdGVjdHVyZT8KCldpbGwKX19fX19fX19fX19fX19fX19fX19f X19fX19fX19fX19fX19fX19fX19fX19fX18KRnJlZWRyZW5vIG1haWxpbmcgbGlzdApGcmVlZHJl bm9AbGlzdHMuZnJlZWRlc2t0b3Aub3JnCmh0dHBzOi8vbGlzdHMuZnJlZWRlc2t0b3Aub3JnL21h aWxtYW4vbGlzdGluZm8vZnJlZWRyZW5vCg== From mboxrd@z Thu Jan 1 00:00:00 1970 From: will.deacon@arm.com (Will Deacon) Date: Mon, 26 Nov 2018 19:31:48 +0000 Subject: [PATCH] iommu: arm-smmu: Set SCTLR.HUPCF bit In-Reply-To: References: <20180927224609.19515-1-robdclark@gmail.com> <20181029191000.GD16739@arm.com> <20181113063225.GA3109@brain-police> Message-ID: <20181126193147.GB534@arm.com> To: linux-arm-kernel@lists.infradead.org List-Id: linux-arm-kernel.lists.infradead.org Hi Rob, On Tue, Nov 13, 2018 at 08:12:35AM -0500, Rob Clark wrote: > On Tue, Nov 13, 2018 at 1:32 AM Will Deacon wrote: > > On Fri, Nov 09, 2018 at 01:01:55PM -0500, Rob Clark wrote: > > > On Mon, Oct 29, 2018 at 3:09 PM Will Deacon wrote: > > > > On Thu, Sep 27, 2018 at 06:46:07PM -0400, Rob Clark wrote: > > > > > We seem to need to set either this or CFCFG (stall), otherwise gpu > > > > > faults trigger problems with other in-flight transactions from the > > > > > GPU causing CP errors, etc. > > > > > > > > > > In the ARM SMMU spec, the 'Hit under previous context fault' bit is > > > > > described as: > > > > > > > > > > '0' - Stall or terminate subsequent transactions in the presence > > > > > of an outstanding context fault > > > > > '1' - Process all subsequent transactions independently of any > > > > > outstanding context fault. > > > > > > > > > > Since we don't enable CFCFG (stall) the behavior of terminating > > > > > other transactions makes sense. And is probably not what we want > > > > > (and definately not what we want for GPU). > > > > > > > > > > Signed-off-by: Rob Clark > > > > > --- > > > > > So I hit this issue a long time back on 820 (msm8996) and at the > > > > > time I solved it with a patch that enabled CFCFG. And it resurfaced > > > > > more recently on sdm845. But at the time CFCFG was rejected, iirc > > > > > because of concern that it would cause problems on other non-qcom > > > > > arm smmu implementations. And I think I forgot to send this version > > > > > of the solution. > > > > > > > > > > If enabling HUPCF is anticipated to cause problems on other ARM > > > > > SMMU implementations, I think I can come up with a variant of this > > > > > patch which conditionally enables it for snapdragon. > > > > > > > > > > Either way, I'd really like to get some variant of this fix merged > > > > > (and probably it would be a good idea for stable kernel branches > > > > > too), since current behaviour with the GPU means faults turn into > > > > > a fantastic cascade of fail. > > > > > > > > Can you describe how this fantastic cascade of fail improves with this > > > > patch, please? If you're getting context faults then something has already > > > > gone horribly wrong, so I'm trying to work out how this improves things. > > > > > > > > > > There are plenty of cases where getting iommu faults with a GPU is > > > "normal", or at least not something the kernel or even GL driver can > > > control. > > > > Such as? All the mainline driver does is print a diagnostic and clear the > > fault, which doesn't seem generally useful. > > it is useful to debug the fault ;-) > > Although eventually we'll want to be able to do more than that, like > have the fault trigger bringing in pages of a mmap'd file and that > sort of thing. Right, and feels very strange to me if we have this bit set because it means that your fault is no longer synchronous and therefore diverges from the fault model that you get from the CPU, where you certainly wouldn't expect stores appearing in the program after a faulting load to be visible in memory. However, thinking harder about it, I suppose we're already in a situation where the translations are handled out of order in the absence of barriers, so maybe it's not the end of the world. Could you dump the FSR value that you see in the fault handler, please? >>From my reading of the architecture spec, as long as clear all of the fault bits in the irq handler, then your machine shouldn't die like it does with HUPCFG=CFCFG=0.. > > > With this patch, you still get the iommu fault, but it doesn't cause > > > the gpu to crash. But without it, other memory accesses in flight > > > while the fault occurs, like the GPU command-processor reading further > > > ahead in the cmdstream to setup next draw, would return zero's, > > > causing the GPU to crash or get into a bad state. > > > > I get that part, but I don't understand why we're seeing faults in the first > > place and I worry that this patch is just the tip of the iceberg. It's also > > not clear that processing subsequent transactions is always the right thing > > to do in a world where we actually want to report (and handle) synchronous > > faults from devices. > > Sure, it is a bug.. but it can be an application bug that is not > something the userspace GL driver or kernel could do anything about. > We shouldn't let this kill the GPU. If the application didn't have > this much control, we wouldn't need an IOMMU in the first place[1]. > With opencl compute, the userspace controlled shader has full blown > pointers to GPU memory. > > And even in cases where it is a userspace GL driver bug, having some > robustness to not completely kill the GPU makes debugging much easier. > Something I do a lot when bringing up support for a new generation of > GPU. > > I'm having a hard time understanding your objection to this. > Returning zero's for non-faulting transactions is a *really bad idea*. Wait -- who said anything about returning zeroes? Where does that behaviour appear in the architecture? Will From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-5.5 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SIGNED_OFF_BY,SPF_PASS,URIBL_BLOCKED,USER_AGENT_MUTT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 7743FC43441 for ; Mon, 26 Nov 2018 19:31:34 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id D6267205C9 for ; Mon, 26 Nov 2018 19:31:33 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org D6267205C9 Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=arm.com Authentication-Results: mail.kernel.org; spf=none smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726985AbeK0G0j (ORCPT ); Tue, 27 Nov 2018 01:26:39 -0500 Received: from usa-sjc-mx-foss1.foss.arm.com ([217.140.101.70]:45946 "EHLO foss.arm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726630AbeK0G0i (ORCPT ); Tue, 27 Nov 2018 01:26:38 -0500 Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.72.51.249]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 381AF1B55; Mon, 26 Nov 2018 11:31:31 -0800 (PST) Received: from edgewater-inn.cambridge.arm.com (usa-sjc-imap-foss1.foss.arm.com [10.72.51.249]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPA id 085B33F5AF; Mon, 26 Nov 2018 11:31:31 -0800 (PST) Received: by edgewater-inn.cambridge.arm.com (Postfix, from userid 1000) id 81AB61AE0839; Mon, 26 Nov 2018 19:31:48 +0000 (GMT) Date: Mon, 26 Nov 2018 19:31:48 +0000 From: Will Deacon To: Rob Clark Cc: "list@263.net:IOMMU DRIVERS , Joerg Roedel ," , Robin Murphy , "moderated list:ARM/FREESCALE IMX / MXC ARM ARCHITECTURE" , freedreno , linux-arm-msm , "list@263.net:IOMMU DRIVERS , Joerg Roedel ," , Linux Kernel Mailing List Subject: Re: [PATCH] iommu: arm-smmu: Set SCTLR.HUPCF bit Message-ID: <20181126193147.GB534@arm.com> References: <20180927224609.19515-1-robdclark@gmail.com> <20181029191000.GD16739@arm.com> <20181113063225.GA3109@brain-police> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.23 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi Rob, On Tue, Nov 13, 2018 at 08:12:35AM -0500, Rob Clark wrote: > On Tue, Nov 13, 2018 at 1:32 AM Will Deacon wrote: > > On Fri, Nov 09, 2018 at 01:01:55PM -0500, Rob Clark wrote: > > > On Mon, Oct 29, 2018 at 3:09 PM Will Deacon wrote: > > > > On Thu, Sep 27, 2018 at 06:46:07PM -0400, Rob Clark wrote: > > > > > We seem to need to set either this or CFCFG (stall), otherwise gpu > > > > > faults trigger problems with other in-flight transactions from the > > > > > GPU causing CP errors, etc. > > > > > > > > > > In the ARM SMMU spec, the 'Hit under previous context fault' bit is > > > > > described as: > > > > > > > > > > '0' - Stall or terminate subsequent transactions in the presence > > > > > of an outstanding context fault > > > > > '1' - Process all subsequent transactions independently of any > > > > > outstanding context fault. > > > > > > > > > > Since we don't enable CFCFG (stall) the behavior of terminating > > > > > other transactions makes sense. And is probably not what we want > > > > > (and definately not what we want for GPU). > > > > > > > > > > Signed-off-by: Rob Clark > > > > > --- > > > > > So I hit this issue a long time back on 820 (msm8996) and at the > > > > > time I solved it with a patch that enabled CFCFG. And it resurfaced > > > > > more recently on sdm845. But at the time CFCFG was rejected, iirc > > > > > because of concern that it would cause problems on other non-qcom > > > > > arm smmu implementations. And I think I forgot to send this version > > > > > of the solution. > > > > > > > > > > If enabling HUPCF is anticipated to cause problems on other ARM > > > > > SMMU implementations, I think I can come up with a variant of this > > > > > patch which conditionally enables it for snapdragon. > > > > > > > > > > Either way, I'd really like to get some variant of this fix merged > > > > > (and probably it would be a good idea for stable kernel branches > > > > > too), since current behaviour with the GPU means faults turn into > > > > > a fantastic cascade of fail. > > > > > > > > Can you describe how this fantastic cascade of fail improves with this > > > > patch, please? If you're getting context faults then something has already > > > > gone horribly wrong, so I'm trying to work out how this improves things. > > > > > > > > > > There are plenty of cases where getting iommu faults with a GPU is > > > "normal", or at least not something the kernel or even GL driver can > > > control. > > > > Such as? All the mainline driver does is print a diagnostic and clear the > > fault, which doesn't seem generally useful. > > it is useful to debug the fault ;-) > > Although eventually we'll want to be able to do more than that, like > have the fault trigger bringing in pages of a mmap'd file and that > sort of thing. Right, and feels very strange to me if we have this bit set because it means that your fault is no longer synchronous and therefore diverges from the fault model that you get from the CPU, where you certainly wouldn't expect stores appearing in the program after a faulting load to be visible in memory. However, thinking harder about it, I suppose we're already in a situation where the translations are handled out of order in the absence of barriers, so maybe it's not the end of the world. Could you dump the FSR value that you see in the fault handler, please? >From my reading of the architecture spec, as long as clear all of the fault bits in the irq handler, then your machine shouldn't die like it does with HUPCFG=CFCFG=0.. > > > With this patch, you still get the iommu fault, but it doesn't cause > > > the gpu to crash. But without it, other memory accesses in flight > > > while the fault occurs, like the GPU command-processor reading further > > > ahead in the cmdstream to setup next draw, would return zero's, > > > causing the GPU to crash or get into a bad state. > > > > I get that part, but I don't understand why we're seeing faults in the first > > place and I worry that this patch is just the tip of the iceberg. It's also > > not clear that processing subsequent transactions is always the right thing > > to do in a world where we actually want to report (and handle) synchronous > > faults from devices. > > Sure, it is a bug.. but it can be an application bug that is not > something the userspace GL driver or kernel could do anything about. > We shouldn't let this kill the GPU. If the application didn't have > this much control, we wouldn't need an IOMMU in the first place[1]. > With opencl compute, the userspace controlled shader has full blown > pointers to GPU memory. > > And even in cases where it is a userspace GL driver bug, having some > robustness to not completely kill the GPU makes debugging much easier. > Something I do a lot when bringing up support for a new generation of > GPU. > > I'm having a hard time understanding your objection to this. > Returning zero's for non-faulting transactions is a *really bad idea*. Wait -- who said anything about returning zeroes? Where does that behaviour appear in the architecture? Will