I'm new to this list, likely ignorant to past discussions, please bear with me :-) I'm bringing up a new core, one that's heavily pipelined/speculative/O-O all that good stuff, and reached the point where linux is coming up, most of my issues are mine but there's one I've come across in the tip-of-tree riscv linux that I think is more general .... it's this code in relocate() in head.S - to paraphrase the code is: relocate: li a1, PAGE_OFFSET la a2, _start sub a1, a1, a2 //a1 is relocation offset .... la a2, 1f add a2, a2, a1 csrw CSR_TVEC, a2 // vector is 1f relocated .... la a0, trampoline_pg_dir srl a0, a0, PAGE_SHIFT or a0, a0, a1 sfence.vma csrw CSR_SATP, a0 .align 2 1: .... In my world this fails miserably, mostly because the sfence.vma does a pipe flush (as it should) and by the time the csrw CSR_SATP, a0 is executed has already fetched (using the old [turned off] MMU mapping) and speculatively executed much of the code up until the following return instruction. What I think that the code is expecting to happen is that the instruction following the write to CSR_SATP will fault and refetch the instruction stream using the new mapping, and this likely works on some microarchitectures, it also probably works by happenstance on some systems where there happens to be an invalid instruction hiding under the ".align 2". Reading the RISC-V priviliged spec it's very explicit about "csrw CSR_SATP, a0": "Note that writing satp does not imply any ordering constraints between page- table updates and subsequent address translations. If the new address space’s page tables have been modified, or if an ASID is reused, it may be necessary to execute an SFENCE.VMA instruction (see Section 4.2.1) after writing satp." 4.2.1 includes the note: "A consequence of this specification is that an implementation may use any translation for an address that was valid at any time since the most recent SFENCE.VMA that subsumes that address. In particular, if a leaf PTE is modified but a subsuming SFENCE.VMA is not executed, either the old translation or the new translation will be used, but the choice is unpredictable. The behavior is otherwise well-defined." What does this mean? it means that if you SFENCE.VMA and then subsequently write to satp it is undefined whether the new page table regime is in place for an arbitrary number of instructions thereafter (this number could be quite large if you are turning on the MMU for the first time because some larger systems may have hundreds of decoded instructions in flight at a time - in some versions of my current system it can be ~100, though in this particular case it's more likely in the order of 10-12 or so instructions that manage to pass the instruction TLB between when the sfence is executed and when the satp is written). In general I think that for RISCV mmu code to work we always need to sfence after every write to satp or page tables (as the spec says it needs to be for an 'enclosing range') .... AND there needs to be a mapping in place in the MMU configuration both before and after the execution of the write to satp that includes a valid mapping of the virtual address of the code fragment between where the write to satp occurs and the sfence instruction. This last requirement, is normally not an issue in the linux kernel since all the code is mapped with one big mapping that doesn't change .... except of course when you first turn on the MMU, when you're switching from no MMU to a running MMU - which is the situation where I started this discussion. ------------------------------------------------------------------------------------------------------------------- So a proposal: rather than use the 'trampoline' code, that only works for some systems, we should use an initial kernel mapping that maps both the kernel virtual addresses and also maps the initial memory 1:1. If we do that then the actual initial switch becomes simple (see the attached code fragment), the other required change is in setup_vm() - instead of making a 'trampoline' mapping and an initial kernel mapping we just make an initial kernel mapping that also contains a 1:1 mapping for the initially loaded kernel. Anyway this has gone on too long, hopefully the right people will read it and understand - as I mentioned above I'm a noob here (but a kernel hack since V6, and have been laying gates for almost as long) Paul Campbell Moonbase Otago