From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:59793) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1VlevW-0002PL-7J for qemu-devel@nongnu.org; Wed, 27 Nov 2013 08:13:18 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1VlevO-0002ul-6I for qemu-devel@nongnu.org; Wed, 27 Nov 2013 08:13:10 -0500 Received: from roura.ac.upc.es ([147.83.33.10]:53792) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1VlevN-0002uP-RA for qemu-devel@nongnu.org; Wed, 27 Nov 2013 08:13:02 -0500 Received: from gw.ac.upc.edu (gw.ac.upc.es [147.83.30.3]) by roura.ac.upc.es (8.13.8/8.13.8) with ESMTP id rARDCwdN009453 for ; Wed, 27 Nov 2013 14:12:58 +0100 Received: from localhost (unknown [84.88.51.85]) by gw.ac.upc.edu (Postfix) with ESMTP id 8C5756B020C for ; Wed, 27 Nov 2013 14:11:11 +0100 (CET) From: =?utf-8?Q?Llu=C3=ADs_Vilanova?= References: Date: Wed, 27 Nov 2013 14:12:58 +0100 In-Reply-To: (Xin Tong's message of "Wed, 27 Nov 2013 16:41:27 +0900") Message-ID: <8761rehtt1.fsf@fimbulvetr.bsc.es> MIME-Version: 1.0 Content-Type: text/plain Subject: Re: [Qemu-devel] outlined TLB lookup on x86 List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: qemu-devel@nongnu.org Xin Tong writes: > I am trying to implement a out-of-line TLB lookup for QEMU softmmu-x86-64 on > x86-64 machine, potentially for better instruction cache performance, I have a > few questions. > 1. I see that tcg_out_qemu_ld_slow_path/tcg_out_qemu_st_slow_path are generated > when tcg_out_tb_finalize is called. And when a TLB lookup misses, it jumps to > the generated slow path and slow path refills the TLB, then load/store and jumps > to the next emulated instruction. I am wondering is it easy to outline the code > for the slow path. I am thinking when a TLB misses, the outlined TLB lookup code > should generate a call out to the qemu_ld/st_helpers[opc & ~MO_SIGN] and rewalk > the TLB after its refilled ? This code is off the critical path, so its not as > important as the code when TLB hits. > 2. why not use a TLB or bigger size? currently the TLB has 1<<8 entries. the TLB > lookup is 10 x86 instructions , but every miss needs ~450 instructions, i > measured this using Intel PIN. so even the miss rate is low (say 3%) the overall > time spent in the cpu_x86_handle_mmu_fault is still signifcant. I am thinking > the tlb may need to be organized in a set associative fashion to reduce conflict > miss, e.g. 2 way set associative to reduce the miss rate. or have a victim tlb > that is 4 way associative and use x86 simd instructions to do the lookup once > the direct-mapped tlb misses. Has anybody done any work on this front ? > 3. what are some of the drawbacks of using a superlarge TLB, i.e. a TLB with 4K > entries ? Using vector intrinsics for the TLB lookup will probably make the code less portable. I don't know how compatible are the GCC and LLVM vectorizing intrinsics between each other (since there has been some efforts on making QEMU also compile with LLVM). A larger TLB will make some operations slower (e.g., look for CPU_TLB_SIZE in cputlb.c), but the higher hit ratio could pay off, although I don't know how the current size was chosen. Lluis -- "And it's much the same thing with knowledge, for whenever you learn something new, the whole world becomes that much richer." -- The Princess of Pure Reason, as told by Norton Juster in The Phantom Tollbooth