From: "Alex Bennée" <alex.bennee@linaro.org>
To: Jon Wilson <jonwilson030981@googlemail.com>
Cc: qemu-devel@nongnu.org,
Richard Henderson <richard.henderson@linaro.org>,
Pierrick Bouvier <pierrick.bouvier@linaro.org>
Subject: Re: TCG Address Sanitizer Optimization.
Date: Mon, 02 Jun 2025 17:26:49 +0100 [thread overview]
Message-ID: <87bjr63y9i.fsf@draig.linaro.org> (raw)
In-Reply-To: <CAJHT5-+tAuCoDV2G=-bZfP-j0gvY8Um-8TO8un2uNSGZpA1pcg@mail.gmail.com> (Jon Wilson's message of "Mon, 2 Jun 2025 16:54:16 +0100")
Jon Wilson <jonwilson030981@googlemail.com> writes:
> It would be good if we could have QEMU provide clean APIs to allow the sort of additional instrumentation that fuzzing
> requires. I guess the qemu-libafl-bridge project show the sort of modification which has been required so far...
> https://github.com/AFLplusplus/qemu-libafl-bridge/tree/main/libafl
>
> I would like to conditionally call a helper, or even just insert a breakpoint instruction, but like I say I don't seem to be able to
> make use of any sort of branches.
For TCG plugins we have:
/**
* qemu_plugin_register_vcpu_tb_exec_cond_cb() - register conditional callback
* @tb: the opaque qemu_plugin_tb handle for the translation
* @cb: callback function
* @cond: condition to enable callback
* @entry: first operand for condition
* @imm: second operand for condition
* @flags: does the plugin read or write the CPU's registers?
* @userdata: any plugin data to pass to the @cb?
*
* The @cb function is called when a translated unit executes if
* entry @cond imm is true.
* If condition is QEMU_PLUGIN_COND_ALWAYS, condition is never interpreted and
* this function is equivalent to qemu_plugin_register_vcpu_tb_exec_cb.
* If condition QEMU_PLUGIN_COND_NEVER, condition is never interpreted and
* callback is never installed.
*/
QEMU_PLUGIN_API
void qemu_plugin_register_vcpu_tb_exec_cond_cb(struct qemu_plugin_tb *tb,
qemu_plugin_vcpu_udata_cb_t cb,
enum qemu_plugin_cb_flags flags,
enum qemu_plugin_cond cond,
qemu_plugin_u64 entry,
uint64_t imm,
void *userdata);
Along with qemu_plugin_register_vcpu_insn_exec_cond_cb() for individual
instructions. They are designed work with per-cpu indexed scoreboards
using inline operations (e.g.
qemu_plugin_register_vcpu_tb_exec_inline_per_vcpu) which can either add
or store values.
Currently we inline memory operations are a bit limited:
/**
* qemu_plugin_register_vcpu_mem_inline_per_vcpu() - inline op for mem access
* @insn: handle for instruction to instrument
* @rw: apply to reads, writes or both
* @op: the op, of type qemu_plugin_op
* @entry: entry to run op
* @imm: immediate data for @op
*
* This registers a inline op every memory access generated by the
* instruction.
*/
QEMU_PLUGIN_API
void qemu_plugin_register_vcpu_mem_inline_per_vcpu(
struct qemu_plugin_insn *insn,
enum qemu_plugin_mem_rw rw,
enum qemu_plugin_op op,
qemu_plugin_u64 entry,
uint64_t imm);
Although you can get access to the memory information through helpers:
/**
* qemu_plugin_register_vcpu_mem_cb() - register memory access callback
* @insn: handle for instruction to instrument
* @cb: callback of type qemu_plugin_vcpu_mem_cb_t
* @flags: (currently unused) callback flags
* @rw: monitor reads, writes or both
* @userdata: opaque pointer for userdata
*
* This registers a full callback for every memory access generated by
* an instruction. If the instruction doesn't access memory no
* callback will be made.
*
* The callback reports the vCPU the access took place on, the virtual
* address of the access and a handle for further queries. The user
* can attach some userdata to the callback for additional purposes.
*
* Other execution threads will continue to execute during the
* callback so the plugin is responsible for ensuring it doesn't get
* confused by making appropriate use of locking if required.
*/
QEMU_PLUGIN_API
void qemu_plugin_register_vcpu_mem_cb(struct qemu_plugin_insn *insn,
qemu_plugin_vcpu_mem_cb_t cb,
enum qemu_plugin_cb_flags flags,
enum qemu_plugin_mem_rw rw,
void *userdata);
where you can then process qemu_plugin_meminfo_t info in the callback to
get the value or address of a memory operation. I guess you want some
sort of operation to do an inline transform of the address to use a
lookup to compare and branch to. The trick is coming up with an
interface which is general and flexible enough and not just "what asan
needs for a specific use case".
> Even if I add a benign instrumentation that simply conditionally branches at a label and
> nothing else (e.g. no actual functionality), I still have the same problem.
> e.g.
>
> ////////////////////////////////////////////////////////////////////////////////
>
> TCGLabel *done = gen_new_label();
> TCGv addr_val = temp_tcgv_tl(addr);
> TCGv zero = tcg_constant_tl(0);
> tcg_gen_brcond_tl(TCG_COND_EQ, addr_val, zero, done);
> gen_set_label(done);
>
> ////////////////////////////////////////////////////////////////////////////////
>
> Hence the current implementation is a little clumsy!
>
> Thanks for your advice.
>
> Jon
>
> On Mon, Jun 2, 2025 at 4:09 PM Alex Bennée <alex.bennee@linaro.org> wrote:
>
> Jon Wilson <jonwilson030981@googlemail.com> writes:
>
> (Adding Richard, the TCG maintainer to CC)
>
> > I am attempting to optimize some TCG code which I have previously written for
> > qemu-libafl-bridge (https://github.com/AFLplusplus/qemu-libafl-bridge), the
> > component used by the LibAFL project when fuzzing binaries using QEMU to
> > provide runtime instrumentation. The code is used to write additional TCG into
> > basic blocks whenever a load or store operation is performed in order to provide
> > address sanitizer functionality.
>
> I would like to figure out if we can push any of this instrumentation
> into TCG plugins so you can avoid patching QEMU itself. I guess you need
> something that allows you to hook a memory address into some sort of
> gadget?
>
> > Address sanitizer is quite simple and works by mapping each 8-byte region of
> > address space to single byte within a region called the shadow map. The address
> > (on a 64-bit platform) of the shadow map for a given address is:
> >
> > Shadow = (Mem >> 3) + 0x7fff8000;
> >
> > The value in the shadow map encodes the accessibility of an address:
> >
> > 0 - The whole 8 byte region is accessible.
> > 1 .. 7 - The first n bytes are accessible.
> > negative - The whole 8 byte region is inaccessible.
> >
> > The following pseudo code shows the algorithm:
> >
> > ////////////////////////////////////////////////////////////////////////////////
> >
> > https://github.com/google/sanitizers/wiki/addresssanitizeralgorithm
> >
> > byte *shadow_address = MemToShadow(address);
> > byte shadow_value = *shadow_address;
> > if (shadow_value) {
> > if (SlowPathCheck(shadow_value, address, kAccessSize)) {
> > ReportError(address, kAccessSize, kIsWrite);
> > }
> > }
> >
> > // Check the cases where we access first k bytes of the qword
> > // and these k bytes are unpoisoned.
> > bool SlowPathCheck(shadow_value, address, kAccessSize) {
> > last_accessed_byte = (address & 7) + kAccessSize - 1;
> > return (last_accessed_byte >= shadow_value);
> > }
> >
> > ////////////////////////////////////////////////////////////////////////////////
> >
> > My current implementation makes use of conditional move instructions to trigger
> > a segfault by way of null dereference in the event that the shadow map indicates
> > that a memory access is invalid.
> >
> > ////////////////////////////////////////////////////////////////////////////////
> >
> > #if TARGET_LONG_BITS == 32
> > #define SHADOW_BASE (0x20000000)
> > #elif TARGET_LONG_BITS == 64
> > #define SHADOW_BASE (0x7fff8000)
> > #else
> > #error Unhandled TARGET_LONG_BITS value
> > #endif
> >
> > void libafl_tcg_gen_asan(TCGTemp * addr, size_t size)
> > {
> > if (size == 0)
> > return;
> >
> > TCGv addr_val = temp_tcgv_tl(addr);
> > TCGv k = tcg_temp_new();
> > TCGv shadow_addr = tcg_temp_new();
> > TCGv_ptr shadow_ptr = tcg_temp_new_ptr();
> > TCGv shadow_val = tcg_temp_new();
> > TCGv test_addr = tcg_temp_new();
> > TCGv_ptr test_ptr = tcg_temp_new_ptr();
> >
> > tcg_gen_andi_tl(k, addr_val, 7);
> > tcg_gen_addi_tl(k, k, size - 1);
> >
> > tcg_gen_shri_tl(shadow_addr, addr_val, 3);
> > tcg_gen_addi_tl(shadow_addr, shadow_addr, SHADOW_BASE);
> > tcg_gen_tl_ptr(shadow_ptr, shadow_addr);
> > tcg_gen_ld8s_tl(shadow_val, shadow_ptr, 0);
> >
> > /*
> > * Making conditional branches here appears to cause QEMU issues with dead
> > * temporaries so we will instead avoid branches.
>
> This sounds like a TCG bug that may have been fixed.
>
> > We will cause the guest
> > * to perform a NULL dereference in the event of an ASAN fault. Note that
> > * we will do this by using a store rather than a load, since the TCG may
> > * otherwise determine that the result of the load is unused and simply
> > * discard the operation. In the event that the shadow memory doesn't
> > * detect a fault, we will simply write the value read from the shadow
> > * memory back to it's original location. If, however, the shadow memory
> > * detects an invalid access, we will instead attempt to write the value
> > * at 0x0.
> > */
>
> Why not conditionally call a helper here? Forcing the guest to actually
> fault seems a bit hammer like.
>
> > tcg_gen_movcond_tl(TCG_COND_EQ, test_addr,
> > shadow_val, tcg_constant_tl(0),
> > shadow_addr, tcg_constant_tl(0));
> >
> > if (size < 8)
> > {
> > tcg_gen_movcond_tl(TCG_COND_GE, test_addr,
> > k, shadow_val,
> > test_addr, shadow_addr);
> > }
> >
> > tcg_gen_tl_ptr(test_ptr, test_addr);
> > tcg_gen_st8_tl(shadow_val, test_ptr, 0);
> > }
> >
> > ////////////////////////////////////////////////////////////////////////////////
> >
> > However, I would like test an implementation more like the following to see how
> > the performance compares. Whilst this introduces branches, the fast path is much
> > more likely to be executed than the slow path and hence bypassing the additional
> > checks and unnecessary memory writes I am hopeful it will improve performance.
> >
> > ////////////////////////////////////////////////////////////////////////////////
> >
> > void libafl_tcg_gen_asan(TCGTemp* addr, size_t size)
> > {
> > if (size == 0) {
> > return;
> > }
> >
> > if (size > 8) {
> > size = 8;
> > }
> >
> > TCGLabel *done = gen_new_label();
> >
> > TCGv addr_val = temp_tcgv_tl(addr);
> > TCGv shadow_addr = tcg_temp_new();
> > TCGv_ptr shadow_ptr = tcg_temp_new_ptr();
> > TCGv shadow_val = tcg_temp_new();
> > TCGv k = tcg_temp_new();
> > TCGv zero = tcg_constant_tl(0);
> > TCGv_ptr null_ptr = tcg_temp_new_ptr();
> >
> > tcg_gen_shri_tl(shadow_addr, addr_val, 3);
> > tcg_gen_addi_tl(shadow_addr, shadow_addr, SHADOW_BASE);
> > tcg_gen_tl_ptr(shadow_ptr, shadow_addr);
> > tcg_gen_ld8s_tl(shadow_val, shadow_ptr, 0);
> >
> > tcg_gen_brcond_tl(TCG_COND_EQ, shadow_val, zero, done);
> >
> > tcg_gen_andi_tl(k, addr_val, 7);
> > tcg_gen_addi_tl(k, k, size - 1);
> >
> > tcg_gen_brcond_tl(TCG_COND_LT, shadow_val, k, done);
> >
> > tcg_gen_tl_ptr(null_ptr, zero);
> > tcg_gen_st8_tl(zero, null_ptr, 0);
> >
> > gen_set_label(done);
> > }
> >
> > ////////////////////////////////////////////////////////////////////////////////
> >
> > However, when I change to using this implementation, I get the following error.
> > I have tested it with a trivial hello world implementation for x86_64 running in
> > qemu-user. It doesn't occur the first time the block is executed, therefore I
> > think the issue is caused by the surrounding TCG in the block it is injected
> > into?
> >
> > ////////////////////////////////////////////////////////////////////////////////
> > runner-x86_64: ../tcg/tcg.c:4852: tcg_reg_alloc_mov: Assertion `ts->val_type == TEMP_VAL_REG' failed.
> > Aborted (core dumped)
> > ////////////////////////////////////////////////////////////////////////////////
> >
> > I would be very grateful for any advice of how to resolve this issue, or any
> > alternative approaches I could use to optimize my original implementation. The
> > code is obviously a very hot path and so even a tiny performance improvement
> > could result in a large performance gain overall.
>
> --
> Alex Bennée
> Virtualisation Tech Lead @ Linaro
--
Alex Bennée
Virtualisation Tech Lead @ Linaro
next prev parent reply other threads:[~2025-06-02 16:27 UTC|newest]
Thread overview: 6+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-06-02 12:02 TCG Address Sanitizer Optimization Jon Wilson
2025-06-02 15:09 ` Alex Bennée
2025-06-02 15:54 ` Jon Wilson
2025-06-02 15:58 ` Richard Henderson
2025-06-02 16:26 ` Alex Bennée [this message]
2025-06-03 8:20 ` Jon Wilson
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=87bjr63y9i.fsf@draig.linaro.org \
--to=alex.bennee@linaro.org \
--cc=jonwilson030981@googlemail.com \
--cc=pierrick.bouvier@linaro.org \
--cc=qemu-devel@nongnu.org \
--cc=richard.henderson@linaro.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).