From mboxrd@z Thu Jan 1 00:00:00 1970 From: mark.rutland@arm.com (Mark Rutland) Date: Wed, 18 Nov 2015 18:04:35 +0000 Subject: [PATCH] [PATCH] arm64: Boot failure on m400 with new cont PTEs In-Reply-To: <564CB1DA.4090304@arm.com> References: <1447858999-26665-1-git-send-email-jeremy.linton@arm.com> <20151118152044.GD10644@leverpostej> <564CA29A.9050905@arm.com> <20151118162932.GA13355@leverpostej> <564CB1DA.4090304@arm.com> Message-ID: <20151118180434.GB13355@leverpostej> To: linux-arm-kernel@lists.infradead.org List-Id: linux-arm-kernel.lists.infradead.org > >We also need to figure out what's happening with the code as it is. > > Well, I'm suspect what is happening is that there are conflicting > TLB's hanging around, one for a cont range that is overlapping a stale > non cont one. This sort of implies that this has been happening all > along, AKA RO regions were being "lazy" activated if you will. Its > only on a core that aborts when it detects that (which i assume > requires differing size entries for this core) does it cause problems. I suspect that we may have believed that the TLB maintenance at the end of paging_init was sufficient, as we evidently id not consider TLB conflicts. > The break-before-make issue, seems like it won't cause a big problem > here as long as there is some way to assure valid TLBs before the > update, and then assure they are cleared following it. If at any instant in time you have a valid TLB entry for an address, and the page tables hold a value that would give rise to a conflicting TLB entry, you can encounter a TLB conflict abort. It's a race against the hardware. > Hence the overly aggressive change works because it > flushes following every cont block update. Which would bother me > more if the code were run more than once per boot (or in the future > per module load/unload if someone gets around to updating the no > execute reliably). You're racing against other parts of the CPU (the page table walker(s), I-caches, etc). The flushing only minimises the window for a race, and does not prevent the race from being possible. Given that the envelope is constantly pushing forward w.r.t. how aggressive CPUs may be in this area, we need to fix the issue by reasoning against what the architecture guarantees us. We're almost certainly going to have to revisit this code in future, and sprinkling TLB maintenance over it will only make it harder to reason about in future. Mark.