From mboxrd@z Thu Jan  1 00:00:00 1970
From: mark.rutland@arm.com (Mark Rutland)
Date: Thu, 19 Nov 2015 11:31:34 +0000
Subject: [PATCH] [PATCH] arm64: Boot failure on m400 with new cont PTEs
In-Reply-To: <564CD206.9040402@arm.com>
References: <1447858999-26665-1-git-send-email-jeremy.linton@arm.com>
 <20151118152044.GD10644@leverpostej> <564CA29A.9050905@arm.com>
 <20151118162932.GA13355@leverpostej> <564CB1DA.4090304@arm.com>
 <20151118180434.GB13355@leverpostej> <564CD206.9040402@arm.com>
Message-ID: <20151119112923.GA24570@leverpostej>
To: linux-arm-kernel@lists.infradead.org
List-Id: linux-arm-kernel.lists.infradead.org

On Wed, Nov 18, 2015 at 01:31:18PM -0600, Jeremy Linton wrote:
> On 11/18/2015 12:04 PM, Mark Rutland wrote:
> 
> >You're racing against other parts of the CPU (the page table walker(s),
> >I-caches, etc). The flushing only minimises the window for a race, and
> >does not prevent the race from being possible.
> >
> >Given that the envelope is constantly pushing forward w.r.t. how
> >aggressive CPUs may be in this area, we need to fix the issue by
> >reasoning against what the architecture guarantees us.
> 	Its also not suppose to fault on speculative access, and to me that
> means page table walks/etc that are the result of speculative
> access.

I was under the impression that TLB conflict abort could be delivered
for asynchronous events (e.g. speculative I-cache fetches rather than
for speculative execution of already fetched instructions).

Having looked at the ARM ARM, I appear to have been mistaken. As you
say, it appears that TLB conflict aborts are always delivered
synchronously.

> Which AFAIK, closes the window significantly. I would only
> really worry about interrupt activity, and updates to the memory
> containing the PTE's themselves. Either way the simple change
> (rather than rewriting the whole code path) is probably to flag the
> fault handler to simply resume from these kinds of faults during
> create_mapping_late().

Unfortunately that may not be sufficient. The conflicting address range
might cover the current stack or the text of the exception handler, and
in those cases trying to handle the exception would result in taking
another TLB conflict abort recursively.

> 	But that isn't what is happening here AFAIK, the faults are long
> after the PTE's have been updated, and are the result of failure to
> flush the TLB..

It's not quite the same case, certainly.

It's also possible that the faults you are seeing see are also possible
earlier, but simply less likely, and that we get away without seeing the
other potential issues because of things that may change (i.e. the way
the compiler lays out the text).

I think that if we need to do something more drastic to account for the
other issues above (e.g. by ensuring that we can never allocate
conflicting TLB entries in the first place), and that said strategy
would also fix this problem, that would be preferable, given that we're
going to have to do that eventually anyway.

Mark.