From mboxrd@z Thu Jan 1 00:00:00 1970 From: robin.murphy@arm.com (Robin Murphy) Date: Fri, 18 May 2018 12:59:53 +0100 Subject: AArch64 memory In-Reply-To: References: Message-ID: <6f34d5bb-3581-93c3-583b-347e75acf3bf@arm.com> To: linux-arm-kernel@lists.infradead.org List-Id: linux-arm-kernel.lists.infradead.org Hi Tim, On 17/05/18 16:58, Tim Harvey wrote: > Greetings, > > I'm trying to understand some details of the AArch64 memory > configuration in the kernel. > > I've looked at Documentation/arm64/memory.txt which describes the > virtual memory layout used in terms of translation levels. This > relates to CONFIG_ARM64_{4K,16K,64K}_PAGES and CONFIG_ARM64_VA_BITS*. > > My first question has to do with virtual memory layout: What are the > advantages and disadvantages for a system with a fixed 2GB of DRAM > when using using 4KB pages + 3 levels (CONFIG_ARM64_4K_PAGES=y > CONFIG_ARM64_VA_BITS=39) resulting in 512GB user / 512GB kernel vs > using 64KB pages + 3 levels (CONFIG_ARM64_64K_PAGES=y > CONFIG_ARM64_VA_BITS=48)? The physical memory is far less than what > any of the combinations would offer but I'm not clear if the number of > levels affects any sort of performance or how fragmentation could play > into performance. There have been a number of discussions on the lists about the general topic in the contexts of several architectures, and I'm sure the last one I saw regarding arm64 actually had some measurements in it, although it's proving remarkably tricky to actually dig up again this morning :/ I think the rough executive summary remains that for certain memory-intensive workloads on AArch64, 64K pages *can* give a notable performance benefit in terms of reduced TLB pressure (and potentially also some for TLB miss overhead with 42-bit VA and 2-level tables). The (major) tradeoff is that for most other workloads, including much of the kernel itself, the increased allocation granularity leads to a significant increase in wasted RAM. My gut feeling is that if you have relatively limited RAM and don't know otherwise, then 39-bit VA is probably the way to go - notably, there are also still drivers/filesystems/etc. which don't play too well with PAGE_SIZE != 4096 - but I'm by no means an expert in this area. If you're targeting a particular application area (e.g. networking) and can benchmark some representative workloads to look at performance vs. RAM usage for different configs, that would probably help inform your decision the most. > My second question has to do with CMA and coherent_pool. I have > understood CMA as being a chunk of physical memory carved out by the > kernel for allocations from dma_alloc_coherent by drivers that need > chunks of contiguous memory for DMA buffers. I believe that before CMA > was introduced we had to do this by defining memory holes. I'm not > understanding the difference between CMA and the coherent pool. I've > noticed that if CONFIG_DMA_CMA=y then the coherent pool allocates from > CMA. Is there some disadvantage of CONFIG_DMA_CMA=y other than if > defined you need to make sure your CMA is larger than coherent_pool? > What drivers/calls use coherent_pool vs cma? coherent_pool is a special thing which exists for the sake of non-hardware-coherent devices - normally for those we satisfy DMA-coherent allocations by setting up a non-cacheable remap of the allocated buffer - see dma_common_contiguous_remap(). However, drivers may call dma_alloc_coherent(..., GFP_ATOMIC) from interrupt handlers, at which point we can't call get_vm_area() to remap on demand, since that might sleep, so we reserve a pool pre-mapped as non-cacheable to satisfy such atomic allocations from. I'm not sure why its user-visible name is "coherent pool" rather than the more descriptive "atomic pool" which it's named internally, but there's probably some history there. If you're lucky enough not to have any non-coherent DMA masters then you can safely ignore the whole thing; otherwise it's still generally rare that it should need adjusting. CMA is, as you surmise, a much more general thing for providing large physically-contiguous areas, which the arch code correspondingly uses to get DMA-contiguous buffers. Unless all your DMA masters are behind IOMMUs (such that we can make any motley collection of pages look DMA-contiguous), you probably don't want to turn it off. None of these details should be relevant as far as drivers are concerned, since from their viewpoint it's all abstracted behind dma_alloc_coherent(). Robin.