I'm new to this list, likely ignorant to past discussions, please bear with me 
:-)

I'm bringing up a new core, one that's heavily pipelined/speculative/O-O all 
that good stuff, and reached the point where linux is coming up, most of my 
issues are mine but there's one I've come across in the tip-of-tree riscv 
linux that I think is more general .... it's this code in relocate() in head.S 
- to paraphrase the code is:

relocate:
	li a1, PAGE_OFFSET
        	la a2, _start
        	sub a1, a1, a2	    //a1 is relocation offset
       	....
	la a2, 1f
	add a2, a2, a1
	csrw CSR_TVEC, a2  // vector is 1f relocated
	....
	la a0, trampoline_pg_dir
	srl a0, a0, PAGE_SHIFT
	or a0, a0, a1
	sfence.vma
	csrw CSR_SATP, a0
	.align 2
1:	....

In my world this fails miserably, mostly because the sfence.vma does a pipe 
flush (as it should) and by the time the csrw CSR_SATP, a0 is executed has 
already fetched (using the old [turned off] MMU mapping) and speculatively 
executed much of the code up until the following return instruction. What I 
think that the code is expecting to happen is that the instruction following 
the write to CSR_SATP will fault and refetch the instruction stream using the 
new mapping, and this likely works on some microarchitectures, it also 
probably works by happenstance on some systems where there happens to be an 
invalid instruction hiding under the ".align 2".

Reading the RISC-V priviliged spec it's very explicit about "csrw CSR_SATP, 
a0":

"Note that writing satp does not imply any ordering constraints between page-
table updates and subsequent address translations. If the new address space’s 
page tables have been modified, or if an ASID is reused, it may be necessary to 
execute an SFENCE.VMA instruction (see Section 4.2.1) after writing satp."

4.2.1 includes the note:

"A consequence of this specification is that an implementation may use any 
translation for an address that was valid at any time since the most recent 
SFENCE.VMA that subsumes that address. In particular, if a leaf PTE is modified 
but a subsuming SFENCE.VMA is not executed, either the old translation or the 
new translation will be used, but the choice is unpredictable. The behavior is 
otherwise well-defined."

What does this mean? it means that if you SFENCE.VMA and then subsequently 
write to satp it is undefined whether the new page table regime is in place for 
an arbitrary number of instructions thereafter (this number could be quite 
large if you are turning on the MMU for the first time because some larger 
systems may have hundreds of decoded instructions in flight at a time - in some 
versions of my current system it can be ~100, though in this particular case 
it's more likely in the order of 10-12 or so instructions that manage to pass 
the instruction TLB between when the sfence is executed and when the satp is 
written).

In general I think that for RISCV mmu code to work we always need to sfence 
after every write to satp or page tables (as the spec says it needs to be for 
an 'enclosing range') .... AND there needs to be a mapping in place in the MMU 
configuration both before and after the execution of the write to satp that 
includes a valid mapping of the virtual address of the code fragment between 
where the write to satp occurs and the sfence instruction. 

This last requirement, is normally not an issue in the linux kernel since all 
the code is mapped with one big mapping that doesn't change .... except of 
course when you first turn on the MMU, when you're switching from no MMU to a 
running MMU - which is the situation where I started this discussion.

-------------------------------------------------------------------------------------------------------------------

So a proposal: rather than use the 'trampoline' code, that only works for some 
systems, we should use an initial kernel mapping that maps both the kernel 
virtual addresses and also maps the initial memory 1:1. If we do that then the 
actual initial switch becomes simple (see the attached code fragment), the 
other required change is in setup_vm() - instead of making a 'trampoline' 
mapping and an initial kernel mapping we just make an initial kernel mapping 
that also contains a 1:1 mapping for the initially  loaded kernel.

Anyway this has gone on too long, hopefully the right people will read it and 
understand - as I mentioned above I'm a noob here (but a kernel hack since V6, 
and have been laying gates for almost as long)

	Paul Campbell
	Moonbase Otago