From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:36948) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1an0H7-0002oe-Rt for qemu-devel@nongnu.org; Mon, 04 Apr 2016 04:54:22 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1an0H3-0000S5-Cl for qemu-devel@nongnu.org; Mon, 04 Apr 2016 04:54:21 -0400 Received: from mail-lf0-x236.google.com ([2a00:1450:4010:c07::236]:33417) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1an0H2-0000Rp-UR for qemu-devel@nongnu.org; Mon, 04 Apr 2016 04:54:17 -0400 Received: by mail-lf0-x236.google.com with SMTP id p188so131988926lfd.0 for ; Mon, 04 Apr 2016 01:54:16 -0700 (PDT) From: Alex =?utf-8?Q?Benn=C3=A9e?= Date: Mon, 04 Apr 2016 09:54:14 +0100 Message-ID: <87y48tu53d.fsf@linaro.org> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit Subject: [Qemu-devel] Should we introduce a TranslationRegion with its own codegen buffer? List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: QEMU Developers , MTTCG Devel Cc: Peter Maydell , Sergey Fedorov , Richard Henderson , ".Cota"@braap.org, Paolo Bonzini Hi, While reviewing the recent TB patching cleanup patches I wondered if there is a cleaner way of handling TB invalidation. Currently we have a single code generation buffer which slowly fills up with TBs as we execute code. These TBs are chained together if they exist in the same physical page (so we always exit to the run-loop if crossing a page boundary). We hold a bunch if extra information in the TBs to facilitate looking things up. We have: struct TranslationBlock *phys_hash_next; to facilitate looking up TBs which have matching hashes in the physical address lookup. We also have: uintptr_t jmp_list_next[2]; uintptr_t jmp_list_first; Which are used for unwinding any jump patching should we invalidate a page and hence don't want code jumping to potentially invalid translations. We also have a number of associated jump caches held against each CPU which is used to optimise re-entry into generated code as we go round the main run-loop. These also have to be cleanly invalidated as TBs are marked invalid. Finally as the TBs are generated on demand the actual code may not be locally jump-able which makes atomic patching of the jumps trickier to do. TB invalidation is almost always due to page mapping changes although SMC code and debugging are also causes for throwing away translations. I'm wondering if it is time to add a layer of indirection to simplify the process? If we introduce a TranslationRegion which could initially cover a pages worth of code. It would have its own code generation buffer protected by an RCU lock to make it easier to swap out on code buffers is a clean manner: Normal Execution (cpu_exec): - lookup TranslationRegion - take RCU read lock - lookup-or-generate TB - jump into code - exit TB - release RCU read lock Invalidation of Page: - lookup TranslationRegion - take RCU write lock - create fresh empty region - signal cpu_exit to all vCPUs - release RCU write lock - take RCU read lock - lookup-or-generate TB - jump into code - exit TB - release RCU read lock* * when the last vCPU releases the read lock on the old code it can be cleanly thrown away. No fiddly jump patching required. There are some potential optimisation's that could be made to this system as well. Jump patching would become easier on backends with limited jump ranges as local code is kept together in a shared code buffer. For one there is no reason the area covered by a TranslationRegion has to be a page. For example the kernel segment once mapped will never change. Then all internal TBs could still be chained together. I'm sure there is scope for localising the jump cache to regions as there are likely to be only a few entry points to any given page with the rest all being internal branches for loops and conditionals. The only thing I can currently think that may be a problem is potentially causing heap fragmentation by having a large number of code buffers. This could probably be ameliorated by using custom allocation routines for the code buffers. I'm going to have a bit of a play to see what this sort of solution would look like in the code but I thought I'd sketch the idea out to see if there are any obvious glaring holes or others things to consider. Thoughts, objections? Discuss ;-) -- Alex Bennée