From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:60237) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1W6INP-00017W-GF for qemu-devel@nongnu.org; Thu, 23 Jan 2014 06:23:21 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1W6INJ-0005OF-Hu for qemu-devel@nongnu.org; Thu, 23 Jan 2014 06:23:15 -0500 Received: from static.88-198-71-155.clients.your-server.de ([88.198.71.155]:34337 helo=socrates.bennee.com) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1W6INJ-0005O6-Bb for qemu-devel@nongnu.org; Thu, 23 Jan 2014 06:23:09 -0500 References: From: Alex =?utf-8?Q?Benn=C3=A9e?= In-reply-to: Date: Thu, 23 Jan 2014 11:23:04 +0000 Message-ID: <87k3drc57r.fsf@linaro.org> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit Subject: Re: [Qemu-devel] [PATCH] cpu: implementing victim TLB for QEMU system emulated TLB List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Xin Tong Cc: Stefan Hajnoczi , QEMU Developers , aliguori@amazon.com, afaerber@suse.de trent.tong@gmail.com writes: > This patch adds a victim TLB to the QEMU system mode TLB. > > QEMU system mode page table walks are expensive. Taken by running QEMU > qemu-system-x86_64 system mode on Intel PIN , a TLB miss and walking a > 4-level page tables in guest Linux OS takes ~450 X86 instructions on > average. > > Attached are some performance results taken on SPECINT2006 train > dataset and a Intel(R) Xeon(R) CPU E5620 @ 2.40GHz Linux machine. In > summary, victim TLB improves the performance of qemu-system-x86_64 by > 11% on average on SPECINT2006 and with highest improvement of in 254% > in > 464.h264ref. And victim TLB does not result in any performance > degradation in any of the measured benchmarks. Furthermore, the > implemented victim TLB is architecture independent and is expected to > benefit other architectures in QEMU as well. > > Although there are measurement fluctuations, the performance > improvement are very significant and by no means in the range of > noises. I'm curious as the implication seems to be that entries are evicted from initial TLB lookup before they are "done". What would the impact be of simply growing the size of the main TLB cache? What's the current state of instrumentation around the system TLB handling? Can we trace the hit rates of the various caches with perf/oprofile/whatever (Stefan?)? -- Alex Bennée