From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:46744) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1dwXjd-0005Ry-KI for qemu-devel@nongnu.org; Mon, 25 Sep 2017 14:04:03 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1dwXja-0002RC-Eq for qemu-devel@nongnu.org; Mon, 25 Sep 2017 14:04:01 -0400 Received: from roura.ac.upc.es ([147.83.33.10]:41609) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1dwXjZ-0002PW-VF for qemu-devel@nongnu.org; Mon, 25 Sep 2017 14:03:58 -0400 From: =?utf-8?Q?Llu=C3=ADs_Vilanova?= References: <150529642278.10902.18234057937634437857.stgit@frigg.lan> <150529666493.10902.14830445134051381968.stgit@frigg.lan> <87poasgjyh.fsf@frigg.lan> <87d16o53xr.fsf@frigg.lan> Date: Mon, 25 Sep 2017 21:03:39 +0300 In-Reply-To: (Peter Maydell's message of "Mon, 18 Sep 2017 18:42:55 +0100") Message-ID: <87o9pywt8k.fsf@frigg.lan> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Subject: Re: [Qemu-devel] [PATCH v6 01/22] instrument: Add documentation List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Peter Maydell Cc: "Emilio G. Cota" , QEMU Developers , Stefan Hajnoczi , Markus Armbruster First, sorry for the late response; I was away for a few days. Peter Maydell writes: > On 18 September 2017 at 18:09, Llu=C3=ADs Vilanova = wrote: >> Peter Maydell writes: >>> It's also exposing internal QEMU implementation detail. >>> What if in future we decide to switch from our current >>> setup to always interpreting guest instructions as a >>> first pass with JITting done only in the background for >>> hot code? >>=20 >> TCI still has a separation of translation-time (translate.c) and executi= on-time >> (interpreting the TCG opcodes), and I don't think that's gonna go away a= nytime >> soon. > I didn't mean TCI, which is nothing like what you'd use for > this if you did it (TCI is slower than just JITting.) My point is that even on the cold path you need to decode a guest instructi= on (equivalent to translating) and emulate it on the spot (equivalent to executing). >> Even if it did, I think there still will be a translation/execution sepa= ration >> easy enough to hook into (even if it's a "fake" one for the cold-path >> interpreted instructions). > But what would it mean? You don't have basic blocks any more. Every instruction emulated on the spot can be seen as a newly translated bl= ock (of one instruction only), which is executed immediately after. >>> Sticking to instrumentation events that correspond exactly to guest >>> execution events means they won't break or expose internals. >>=20 >> It also means we won't be able to "conditionally" instrument instruction= s (e.g., >> based on their opcode, address range, etc.). > You can still do that, it's just less efficient (your > condition-check happens in the callout to the instrumentation > plugin). We can add "filter" options later if we need them > (which I would rather do than have translate-time callbacks). Before answering, a short summary of when knowing about translate/execute m= akes a difference: * Record some information only once when an instruction is translated, inst= ead of recording it on every executed instruction (e.g., a study of opcode distribution, which you can get from a file of per-TB opcodes - generated= at translation time - and a list of executed TBs - generated at execution ti= me -). The translate/execute separation makes this run faster *and* produces= much smaller files with the recorded info. Other typical examples that benefit from this are writing a simulator that feeds off a stream of instruction information (a common reason why people= want to trace memory accesses and information of executed instructions). * Conditionally instrumenting instructions. Adding filtering to the instrumentation API would only solve the second poi= nt, but not the first one. Now, do we need/want to support the first point? >> Of course we can add the translation/execution differentiation later if = we find >> it necessary for performance, but I would rather avoid leaving "historic= al" >> instrumentation points behind on the API. >>=20 >> What are the use-cases you're aiming for? > * I want to be able to point the small stream of people who come > into qemu-devel asking "how do I trace all my guest's memory > accesses" at a clean API for it. > * I want to be able to have less ugly and confusing tracing > than our current -d output (and perhaps emit tracing in formats > that other analysis tools want as input) > * I want to keep this initial tracing API simple enough that > we can agree on it and get a first working useful version. Fair enough. I know it's not exactly the same we're discussing, but the plot in [1] comp= ares a few different ways to trace memory accesses on SPEC benchmarks: * First bar is using a Intel's tool called PIN [2]. * Second is calling into an instrumentation function on every executed memo= ry access in QEMU. * Third is embedding the hot path of writing the memory access info to an a= rray into the TCG opcode stream (more or less equivalent to supporting filteri= ng; when the array is full, a user's callback is called - cold path -) * Fourth bar can be ignored. This was working on a much older version of instrumentation for QEMU, but I= can implement something that does the first use-case point above and some filte= ring example (second use-case point) to see what's the performance difference. [1] https://filetea.me/n3wy9WwyCCZR72E9OWXHArHDw [2] https://software.intel.com/en-us/articles/pin-a-dynamic-binary-instrume= ntation-tool Thanks! Lluis