From mboxrd@z Thu Jan 1 00:00:00 1970 From: Alexander Graf Date: Wed, 24 Feb 2016 11:19:41 +0100 Subject: [U-Boot] [PATCH 0/9] arm64: Unify MMU code In-Reply-To: References: <1456106232-233210-1-git-send-email-agraf@suse.de> <14FCFFD6-8691-43C6-AD6A-DAB535E61586@suse.de> <0AD586D5-1E50-4B33-B77D-24A42761515A@suse.de> <56CB6493.1000804@suse.de> <56CB6B05.5050400@suse.de> Message-ID: <56CD83BD.7060909@suse.de> List-Id: MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: u-boot@lists.denx.de On 22.02.16 21:15, york sun wrote: > On 02/22/2016 12:09 PM, Alexander Graf wrote: >> >> >> On 22.02.16 20:52, york sun wrote: >>> On 02/22/2016 11:42 AM, Alexander Graf wrote: >>>> >>>> >>>> On 22.02.16 19:39, york sun wrote: >>>>> On 02/22/2016 10:31 AM, Alexander Graf wrote: >>>>>> >>>>>> On Feb 22, 2016, at 7:12 PM, york sun wrote: >>>>>> >>>>>>> On 02/22/2016 10:02 AM, Alexander Graf wrote: >>>>>>>> >>>>>>>> >>>>>>>>> Am 22.02.2016 um 18:37 schrieb york sun : >>>>>>>>> >>>>>>>>>> On 02/21/2016 05:57 PM, Alexander Graf wrote: >>>>>>>>>> Howdy, >>>>>>>>>> >>>>>>>>>> Currently on arm64 there is a big pile of mess when it comes to MMU >>>>>>>>>> support and page tables. Each board does its own little thing and the >>>>>>>>>> generic code is pretty dumb and nobody actually uses it. >>>>>>>>>> >>>>>>>>>> This patch set tries to clean that up. After this series is applied, >>>>>>>>>> all boards except for the FSL Layerscape ones are converted to the >>>>>>>>>> new generic page table logic and have icache+dcache enabled. >>>>>>>>>> >>>>>>>>>> The new code always uses 4k page size. It dynamically allocates 1G or >>>>>>>>>> 2M pages for ranges that fit. When a dcache attribute request comes in >>>>>>>>>> that requires a smaller granularity than our previous allocation could >>>>>>>>>> fulfill, pages get automatically split. >>>>>>>>>> >>>>>>>>>> I have tested and verified the code works on HiKey (bare metal), >>>>>>>>>> vexpress64 (Foundation Model) and zynqmp (QEMU). The TX1 target is >>>>>>>>>> untested, but given the simplicity of the maps I doubt it'll break. >>>>>>>>>> ThunderX in theory should also work, but I haven't tested it. I would >>>>>>>>>> be very happy if people with access to those system could give the patch >>>>>>>>>> set a try. >>>>>>>>>> >>>>>>>>>> With this we're a big step closer to a good base line for EFI payload >>>>>>>>>> support, since we can now just require that all boards always have dcache >>>>>>>>>> enabled. >>>>>>>>>> >>>>>>>>>> I would also be incredibly happy if some Freescale people could look >>>>>>>>>> at their MMU code and try to unify it into the now cleaned up generic >>>>>>>>>> code. I don't think we're far off here. >>>>>>>>> >>>>>>>>> Alex, >>>>>>>>> >>>>>>>>> Unified MMU will be great for all of us. The reason we started with our own MMU >>>>>>>>> table was size and performance. I don't know much about other ARMv8 SoCs. For >>>>>>>>> our use, we enable cache very early to speed up running, especially for >>>>>>>>> pre-silicon development on emulators. We don't have DDR to use for the early >>>>>>>>> stage and we have very limited on-chip SRAM. I believe we can use the unified >>>>>>>>> structure for our 2nd stage MMU when DDR is up. >>>>>>>> >>>>>>>> Yup, and I think it should be fairly doable to move the early generation into the same table format - maybe even fully reuse the generic code. >>>>>>> >>>>>>> What's the size for the MMU tables? I think it may be simpler to use static >>>>>>> tables for our early stage. >>>>>> >>>>>> The size is determined dynamically from the memory map using some code that (as Steven found) is not 100% sound, but works well enough so far :). >>>>> >>>>> That's the part I can't live with. Since we have very limited on-chip RAM, we >>>>> have to know limit the size. But again, I do see the benefit to use unified >>>>> structure for the 2nd stage. >>>> >>>> I'm not quite sure I see how your current code works any differently. >>>> While the code to determine the page table pool size is dynamic, the >>>> outcome is static depending on your memory map. So the same memory map >>>> always means the same page table pool size. >>>> >>>> We could also just hard code the size for the early phase for you I guess. >>> >>> We can definitely try. >>> >>>> >>>>> >>>>>> >>>>>>> >>>>>>>> >>>>>>>> The thing that I tripped over while attempting conversion was that you don't always map phys==virt, unless other boards, and I didn't fully understand why. >>>>>>>> >>>>>>> True. We have some complication on the address mapping. For compatibility, each >>>>>>> device is mapped (partially) under 32-bit space. If the device is too large to >>>>>> >>>>>> Compatibility with what? Do we really need this in an AArch64 world? >>>>> >>>>> It's not up to me. The SoC was designed this way. By the way, this SoC can work >>>>> in AArch32 mode. >>>> >>>> I think I'm slowly grasping what the problem is. >>>> >>>> The fact that the SoC can run in AArch32 mode doesn't actually make a >>>> difference here though, since we're talking about U-Boot internal memory >>>> maps. The only reason to keep things mapped reachable from 32bits is if >>>> you want to run 32bit code with the U-Boot maps. I don't think you'd >>>> want to do that, no? :) >>> >>> I don't really want to run 32-bit code. My point is the SoC was designed that >>> way. We have DDR under 32-bit space, and in high region. We have the same for >>> flash controller where NOR is connected. Explained later below. >>>> >>>>> >>>>>> >>>>>> For 32bit code I can definitely understand why you'd want to have phys != virt. But in a pure 64bit world (which this target really is, no?) I see little benefit on it. >>>>>> >>>>>>> fit, the rest is mapped to high regions. I remember one particular case on top >>>>>>> of my head. It is the NOR flash we use for environmental variables. U-boot uses >>>>>>> that address for saving, but also uses that for loading during booting. For our >>>>>>> case, the NOR flash doesn't fit well in the low region, so it is remapped to >>>>>>> high region after booting. To make the environmental variables accessible during >>>>>>> boot, we mapped the high region phys with different virt, so u-boot doesn't have >>>>>>> to know the low region address. >>>>>> >>>>>> I might be missing the obvious, but why can't the environmental variables live in high regions? >>>>>> >>>>> >>>>> It is in high region. But as I tried to explain, the default physical mapping of >>>>> NOR flash (not MMU) is in low region out of reset. >>>> >>>> I see. So the problem is during the transitioning phase from uncached to >>>> MMU enabled, where we'd end up at a different address. >>> >>> Not exactly. We enable cache very early for performance boost on emulator. It >>> may sound trivial but it makes big difference when debugging software on >>> emulators. Since we still use emulators for new product, I am not ready to drop >>> the early MMU approach. >> >> I'm surprised it is that slow for you. Running the Foundation model >> (which doesn't do early mmu FWIW) seemed to be fast enough. > > Foundation model is a simulator, not an emulator. Our emulator runs on hardware. > It is much much slower than simulator, but more accurate on lower level. Ah, I remember the confusion in terminology from the PPC times :). > >> >>> But you get the idea, the difference is before and after relocation. After >>> u-boot relocates itself into DDR, we remap flash controller physical address to >>> high region. >>> >>>> >>>> Could we just configure NOR to be in high memory in early asm init code, >>>> then always use the high physical NOR address range and jump to it from >>>> asm very early on? Then we could ignore the 32bit map and everything >>>> could just stay 1:1 mapped. >>>> >>> >>> Out of reset, if booting from NOR flash, the flash controller is pre-configured >>> to use low region address. We can only reprogram the controller when u-boot is >>> not running on it. >> >> I see, so you keep the low map alive until you make the switch-over to >> DDR. Makes a lot of sense. >> >> I guess I can give the conversion another stab now whenever I get a free >> night :). If I understand you correctly we'd only need to do non-1:1 >> maps for the early code, right? > > So far, yes. But we don't want to block ourselves from using non-1:1 mapping > down the road, do we? We're not blocking us at all if we stick to the verbose struct definition. We can just add a va field later on and default to 1:1 if it's not set. I've also reworked the page table pool size calculation now, so it can properly determine the required size without much ram overhead at the expense of a few cycles. If it's too slow, you can always override it in your machine file with a constant value. Alex