From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:43408) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1gF0DJ-0005sj-5z for qemu-devel@nongnu.org; Tue, 23 Oct 2018 13:11:31 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1gF0D7-0007Xe-E5 for qemu-devel@nongnu.org; Tue, 23 Oct 2018 13:11:22 -0400 Received: from out3-smtp.messagingengine.com ([66.111.4.27]:48811) by eggs.gnu.org with esmtps (TLS1.0:DHE_RSA_AES_256_CBC_SHA1:32) (Exim 4.71) (envelope-from ) id 1gF0D5-0007Wf-TS for qemu-devel@nongnu.org; Tue, 23 Oct 2018 13:11:16 -0400 Date: Tue, 23 Oct 2018 13:11:14 -0400 From: "Emilio G. Cota" Message-ID: <20181023171114.GA10827@flamenco> References: <20181023070253.6407-1-richard.henderson@linaro.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20181023070253.6407-1-richard.henderson@linaro.org> Subject: Re: [Qemu-devel] [PATCH 00/10] cputlb: track dirty tlbs and general cleanup List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Richard Henderson Cc: qemu-devel@nongnu.org On Tue, Oct 23, 2018 at 08:02:42 +0100, Richard Henderson wrote: > The motivation here is reducing the total overhead. > > Before a few patches went into target-arm.next, I measured total > tlb flush overhead for aarch64 at 25%. This appears to reduce the > total overhead to about 5% (I do need to re-run the control tests, > not just watch perf top as I'm doing now). I'd like to see those absolute perf numbers; I ran a few Ubuntu aarch64 boots and the noise is just too high to draw any conclusions (I'm using your tlb-dirty branch on github). When booting the much smaller debian image, these patches are performance-neutral though. So, Reviewed-by: Emilio G. Cota for the series. (On a pedantic note: consider s/miniscule/minuscule/ in patches 6-7) > The final patch is somewhat of an RFC. I'd like to know what > benchmark was used when putting in pending_tlb_flushes, and I > have not done any archaeology to find out. I suspect that it > does make any measurable difference beyond tlb_c.dirty, and I > think the code is a bit cleaner without it. I suspect that pending_tlb_flushes was premature optimization. Avoiding an async job sounds like a good idea, since it is very expensive for the remote vCPU. However, in most cases we'll be taking a lock (or a full barrier in the original code) but we won't avoid the async job (because a race when flushing other vCPUs is unlikely), therefore wasting cycles in the lock (formerly barrier). Thanks, Emilio