From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-5.5 required=3.0 tests=DKIM_INVALID,DKIM_SIGNED, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,MENTIONS_GIT_HOSTING, SPF_HELO_NONE,SPF_PASS autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 685C2C28CBC for ; Wed, 6 May 2020 11:28:31 +0000 (UTC) Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPS id 328702073A for ; Wed, 6 May 2020 11:28:31 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=linaro.org header.i=@linaro.org header.b="ueKYye/d" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 328702073A Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=linaro.org Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Received: from localhost ([::1]:44656 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1jWIE2-0005tn-9w for qemu-devel@archiver.kernel.org; Wed, 06 May 2020 07:28:30 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]:47266) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1jWICM-00050R-U7 for qemu-devel@nongnu.org; Wed, 06 May 2020 07:26:49 -0400 Received: from mail-wr1-x441.google.com ([2a00:1450:4864:20::441]:32773) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.90_1) (envelope-from ) id 1jWICL-0006PU-A5 for qemu-devel@nongnu.org; Wed, 06 May 2020 07:26:46 -0400 Received: by mail-wr1-x441.google.com with SMTP id h9so1776884wrt.0 for ; Wed, 06 May 2020 04:26:44 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linaro.org; s=google; h=references:user-agent:from:to:cc:subject:in-reply-to:date :message-id:mime-version:content-transfer-encoding; bh=9ajHyZDjj19qkaeavLFsEz0T72+60eTRBcrBi5UQEGA=; b=ueKYye/dWfysjaldofZJ3/lO/Ggt//0+JLr7A5R1Fnfx7JLYhI5Vjt9tsJSbVjjgpM QIM9XQAtMAEYydEf5faGR7THZ3BryRQeE7BMP6EhPbYkWyioCMypNPUVHPDQStPTn4PC rKhKQt/ucU/iFApPRape0H3mkxJNh3h2mYlHKhGBzoeotNdW2UtZzpGDaoIecgPz/rj+ M/f5WmiLYx+8U2VyTDzIOmtFkVe2byqdxNZ4okGizYvcLaFTy1QNf3V88LNxpEvdH3s4 1lzDfNUE5+p1qkU8tXaFXFR4LdG1b+FGeTFc/SFnhmOltH7Z0WllSJQSj2I5M92gwNDo jYiA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:references:user-agent:from:to:cc:subject :in-reply-to:date:message-id:mime-version:content-transfer-encoding; bh=9ajHyZDjj19qkaeavLFsEz0T72+60eTRBcrBi5UQEGA=; b=l+dV813rqg+IByDPMwgRI+JaBra6ub2+TrD3ACv0Q19rou+prCProT7dqUGUzq+h6u cXQr59FCzjHwa0OgEq7H8P3R+/qHsigIJj2mPDtMiME2RVrlE1CFGbAkYElqSOZdVOZ2 rI32AT3GeS4kQQBYE0LwG1g9AJk/793t8U9dlbGCZFiaO7IgRpmQtnN7sTeycp5B4/6F GPRqEgmW3Qxqj38lprSOdqDVIdnXstcrA9r76PKCW05Zu49ObwpeP9V3KmAmvbgb+O0z Z7xbdNP0NveYH3hPHaVEzuYxAXvpK1UBrxC6Qd5QTLU/prl4li478llmS98QCzG5VtLC GA5A== X-Gm-Message-State: AGi0PuZuja8JPWwD2fmUmK/O1315Asp0mOCCghtMBwOpAjqQox+HsSPs cttLAXP3wTKkumkgWs6DREvn7A== X-Google-Smtp-Source: APiQypJLhuu5lgOu4bObuKexLgM+vWeNS8uyNtwjq1PMfIL25JEun7HrSQDuzjGo9RLZnZXYT1lRxQ== X-Received: by 2002:adf:e5ce:: with SMTP id a14mr8770742wrn.82.1588764403506; Wed, 06 May 2020 04:26:43 -0700 (PDT) Received: from zen.linaroharston ([51.148.130.216]) by smtp.gmail.com with ESMTPSA id c190sm2686177wme.10.2020.05.06.04.26.41 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 06 May 2020 04:26:41 -0700 (PDT) Received: from zen (localhost [127.0.0.1]) by zen.linaroharston (Postfix) with ESMTP id DFB031FF7E; Wed, 6 May 2020 12:26:40 +0100 (BST) References: User-agent: mu4e 1.4.4; emacs 28.0.50 From: Alex =?utf-8?Q?Benn=C3=A9e?= To: Aleksandar Markovic Subject: Re: [INFO] Some preliminary performance data In-reply-to: Date: Wed, 06 May 2020 12:26:40 +0100 Message-ID: <87imh95mof.fsf@linaro.org> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Received-SPF: pass client-ip=2a00:1450:4864:20::441; envelope-from=alex.bennee@linaro.org; helo=mail-wr1-x441.google.com X-detected-operating-system: by eggs.gnu.org: No matching host in p0f cache. That's all we know. X-Spam_score_int: -20 X-Spam_score: -2.1 X-Spam_bar: -- X-Spam_report: (-2.1 / 5.0 requ) BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, RCVD_IN_DNSWL_NONE=-0.0001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001 autolearn=_AUTOLEARN X-Spam_action: no action X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: =?utf-8?B?THVrw6HFoQ==?= Doktor , Peter Maydell , Stefan Hajnoczi , Richard Henderson , QEMU Developers , ahmedkhaledkaraman@gmail.com, "Emilio G . Cota" , kraxel@redhat.com Errors-To: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Sender: "Qemu-devel" Aleksandar Markovic writes: Some preliminary thoughts.... >> Hi, all. >> >> I just want to share with you some bits and pieces of data that I got >> while doing some preliminary experimentation for the GSoC project "TCG >> Continuous Benchmarking", that Ahmed Karaman, a student of the fourth fi= nal >> year of Electical Engineering Faculty in Cairo, will execute. >> >> *User Mode* >> >> * As expected, for any program dealing with any substantional >> floating-point calculation, softfloat library will be the the heaviest C= PU >> cycles consumer. >> * We plan to examine the performance behaviour of non-FP programs >> (integer arithmetic), or even non-numeric programs (sorting strings, for >> example). Emilio was the last person to do extensive bench-marking on TCG and he used a mild fork of the venerable nbench: https://github.com/cota/dbt-bench as the hot code is fairly small it offers a good way of testing quality of the output. Larger programs will differ as they can involve more code generation. >> >> *System Mode* >> >> * I did profiling of booting several machines using a tool called >> callgrind (a part of valgrind). The tool offers pletora of information, >> however it looks it is little confused by usage of coroutines, and that >> makes some of its reports look very illogical, or plain ugly. Doesn't running through valgrind inherently serialise execution anyway? If you are looking for latency caused by locks we have support for the QEMU sync profiler built into the code. See "help sync-profile" on the HMP. >> Still, it >> seems valid data can be extracted from it. Without going into details, h= ere >> is what it says for one machine (bear in mind that results may vary to a >> great extent between machines): You can also use perf to use sampling to find hot points in the code. One of last years GSoC student wrote some patches that included the ability to dump a jit info file for perf to consume. We never got it merged in the end but it might be worth having a go at pulling the relevant bits out from: Subject: [PATCH v9 00/13] TCG code quality tracking and perf integration Date: Mon, 7 Oct 2019 16:28:26 +0100 Message-Id: <20191007152839.30804-1-alex.bennee@linaro.org> >> ** The booting involved six threads, one for display handling, one >> for emulations, and four more. The last four did almost nothing during >> boot, just almost entire time siting idle, waiting for something. As far= as >> "Total Instruction Fetch Count" (this is the main measure used in >> callgrind), they were distributed in proportion 1:3 between display thre= ad >> and emulation thread (the rest of threads were negligible) (but, >> interestingly enough, for another machine that proportion was 1:20). >> ** The display thread is dominated by vga_update_display() function >> (21.5% "self" time, and 51.6% "self + callees" time, called almost 40000 >> times). Other functions worth mentioning are >> cpu_physical_memory_snapshot_get_dirty() and >> memory_region_snapshot_get_dirty(), which are very small functions, but = are >> both invoked over 26 000 000 times, and contribute with over 20% of disp= lay >> thread instruction fetch count together. The memory region tracking code will end up forcing the slow path for a lot of memory accesses to video memory via softmmu. You may want to measure if there is a difference using one of the virtio based graphics displays. >> ** Focusing now on emulation thread, "Total Instruction Fetch Count= s" >> were roughly distributed this way: >> - 15.7% is execution of GIT-ed code from translation block >> buffer >> - 39.9% is execution of helpers >> - 44.4% is code translation stage, including some coroutine >> activities >> Top two among helpers: >> - helper_le_stl_memory() I assume that is the MMU slow-path being called from the generated code. >> - helper_lookup_tb_ptr() (this one is invoked whopping 36 000 >> 000 times) This is an optimisation to avoid exiting the run-loop to find the next block. From memory I think the two main cases you'll see are: - computed jumps (i.e. target not known at JIT time) - jumps outside of the current page >> Single largest instruction consumer of code translation: >> - liveness_pass_1(), that constitutes 21.5% of the entire >> "emulation thread" consumption, or, in other way, almost half of code >> translation stage (that sits at 44.4%) This is very much driven by how much code generation vs running you see. In most of my personal benchmarks I never really notice code generation because I give my machines large amounts of RAM so code tends to stay resident so not need to be re-translated. When the optimiser shows up it's usually accompanied by high TB flush and invalidate counts in "info jit" because we are doing more translation that we usually do. I'll also mention my foray into tracking down the performance regression of DOSBox Doom: https://diasp.eu/posts/8659062 it presented a very nice demonstration of the increasing complexity (and run time) of the optimiser which was completely wasted due to self-modifying code causing us to regenerate code all the time. >> >> Please take all this with a little grain of salt, since these results are >> just of preliminary nature. >> >> I would like to use this opportunity to welcome Ahmed Karaman, a talented >> young man from Egypt, into QEMU development community, that'll work on "= TCG >> Continuous Benchmarking" project this summer. Please do help them in his >> first steps as our colleague. Best luck to Ahmed! Welcome to the QEMU community Ahmed. Feel free to CC me on TCG performance related patches. I like to see things go faster ;-) --=20 Alex Benn=C3=A9e