TCG Address Sanitizer Optimization.

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* TCG Address Sanitizer Optimization.
@ 2025-06-02 12:02 Jon Wilson
  2025-06-02 15:09 ` Alex Bennée
  0 siblings, 1 reply; 6+ messages in thread
From: Jon Wilson @ 2025-06-02 12:02 UTC (permalink / raw)
  To: qemu-devel

[-- Attachment #1: Type: text/plain, Size: 6263 bytes --]

I am attempting to optimize some TCG code which I have previously written
for
qemu-libafl-bridge (https://github.com/AFLplusplus/qemu-libafl-bridge), the
component used by the LibAFL project when fuzzing binaries using QEMU to
provide runtime instrumentation. The code is used to write additional TCG
into
basic blocks whenever a load or store operation is performed in order to
provide
address sanitizer functionality.

Address sanitizer is quite simple and works by mapping each 8-byte region of
address space to single byte within a region called the shadow map. The
address
(on a 64-bit platform) of the shadow map for a given address is:

    Shadow = (Mem >> 3) + 0x7fff8000;

The value in the shadow map encodes the accessibility of an address:

    0  - The whole 8 byte region is accessible.
    1 .. 7 - The first n bytes are accessible.
    negative - The whole 8 byte region is inaccessible.

The following pseudo code shows the algorithm:

////////////////////////////////////////////////////////////////////////////////

https://github.com/google/sanitizers/wiki/addresssanitizeralgorithm

byte *shadow_address = MemToShadow(address);
byte shadow_value = *shadow_address;
if (shadow_value) {
  if (SlowPathCheck(shadow_value, address, kAccessSize)) {
    ReportError(address, kAccessSize, kIsWrite);
  }
}

// Check the cases where we access first k bytes of the qword
// and these k bytes are unpoisoned.
bool SlowPathCheck(shadow_value, address, kAccessSize) {
  last_accessed_byte = (address & 7) + kAccessSize - 1;
  return (last_accessed_byte >= shadow_value);
}

////////////////////////////////////////////////////////////////////////////////

My current implementation makes use of conditional move instructions to
trigger
a segfault by way of null dereference in the event that the shadow map
indicates
that a memory access is invalid.

////////////////////////////////////////////////////////////////////////////////

#if TARGET_LONG_BITS == 32
#define SHADOW_BASE (0x20000000)
#elif TARGET_LONG_BITS == 64
#define SHADOW_BASE (0x7fff8000)
#else
#error Unhandled TARGET_LONG_BITS value
#endif

void libafl_tcg_gen_asan(TCGTemp * addr, size_t size)
{
    if (size == 0)
        return;

    TCGv addr_val = temp_tcgv_tl(addr);
    TCGv k = tcg_temp_new();
    TCGv shadow_addr = tcg_temp_new();
    TCGv_ptr shadow_ptr = tcg_temp_new_ptr();
    TCGv shadow_val = tcg_temp_new();
    TCGv test_addr = tcg_temp_new();
    TCGv_ptr test_ptr = tcg_temp_new_ptr();

    tcg_gen_andi_tl(k, addr_val, 7);
    tcg_gen_addi_tl(k, k, size - 1);

    tcg_gen_shri_tl(shadow_addr, addr_val, 3);
    tcg_gen_addi_tl(shadow_addr, shadow_addr, SHADOW_BASE);
    tcg_gen_tl_ptr(shadow_ptr, shadow_addr);
    tcg_gen_ld8s_tl(shadow_val, shadow_ptr, 0);

    /*
     * Making conditional branches here appears to cause QEMU issues with
dead
     * temporaries so we will instead avoid branches. We will cause the
guest
     * to perform a NULL dereference in the event of an ASAN fault. Note
that
     * we will do this by using a store rather than a load, since the TCG
may
     * otherwise determine that the result of the load is unused and simply
     * discard the operation. In the event that the shadow memory doesn't
     * detect a fault, we will simply write the value read from the shadow
     * memory back to it's original location. If, however, the shadow memory
     * detects an invalid access, we will instead attempt to write the value
     * at 0x0.
     */
    tcg_gen_movcond_tl(TCG_COND_EQ, test_addr,
        shadow_val, tcg_constant_tl(0),
        shadow_addr, tcg_constant_tl(0));

    if (size < 8)
    {
        tcg_gen_movcond_tl(TCG_COND_GE, test_addr,
            k, shadow_val,
            test_addr, shadow_addr);
    }

    tcg_gen_tl_ptr(test_ptr, test_addr);
    tcg_gen_st8_tl(shadow_val, test_ptr, 0);
}

////////////////////////////////////////////////////////////////////////////////

However, I would like test an implementation more like the following to see
how
the performance compares. Whilst this introduces branches, the fast path is
much
more likely to be executed than the slow path and hence bypassing the
additional
checks and unnecessary memory writes I am hopeful it will improve
performance.

////////////////////////////////////////////////////////////////////////////////

void libafl_tcg_gen_asan(TCGTemp* addr, size_t size)
{
    if (size == 0) {
        return;
    }

    if (size > 8) {
        size = 8;
    }

    TCGLabel *done = gen_new_label();

    TCGv addr_val = temp_tcgv_tl(addr);
    TCGv shadow_addr = tcg_temp_new();
    TCGv_ptr shadow_ptr = tcg_temp_new_ptr();
    TCGv shadow_val = tcg_temp_new();
    TCGv k = tcg_temp_new();
    TCGv zero = tcg_constant_tl(0);
    TCGv_ptr null_ptr = tcg_temp_new_ptr();

    tcg_gen_shri_tl(shadow_addr, addr_val, 3);
    tcg_gen_addi_tl(shadow_addr, shadow_addr, SHADOW_BASE);
    tcg_gen_tl_ptr(shadow_ptr, shadow_addr);
    tcg_gen_ld8s_tl(shadow_val, shadow_ptr, 0);

    tcg_gen_brcond_tl(TCG_COND_EQ, shadow_val, zero, done);

    tcg_gen_andi_tl(k, addr_val, 7);
    tcg_gen_addi_tl(k, k, size - 1);

    tcg_gen_brcond_tl(TCG_COND_LT, shadow_val, k, done);

    tcg_gen_tl_ptr(null_ptr, zero);
    tcg_gen_st8_tl(zero, null_ptr, 0);

    gen_set_label(done);
}

////////////////////////////////////////////////////////////////////////////////

However, when I change to using this implementation, I get the following
error.
I have tested it with a trivial hello world implementation for x86_64
running in
qemu-user. It doesn't occur the first time the block is executed, therefore
I
think the issue is caused by the surrounding TCG in the block it is injected
into?

////////////////////////////////////////////////////////////////////////////////
runner-x86_64: ../tcg/tcg.c:4852: tcg_reg_alloc_mov: Assertion
`ts->val_type == TEMP_VAL_REG' failed.
Aborted (core dumped)
////////////////////////////////////////////////////////////////////////////////

I would be very grateful for any advice of how to resolve this issue, or any
alternative approaches I could use to optimize my original implementation.
The
code is obviously a very hot path and so even a tiny performance improvement
could result in a large performance gain overall.

[-- Attachment #2: Type: text/html, Size: 7195 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: TCG Address Sanitizer Optimization.
  2025-06-02 12:02 TCG Address Sanitizer Optimization Jon Wilson
@ 2025-06-02 15:09 ` Alex Bennée
  2025-06-02 15:54   ` Jon Wilson
  0 siblings, 1 reply; 6+ messages in thread
From: Alex Bennée @ 2025-06-02 15:09 UTC (permalink / raw)
  To: Jon Wilson; +Cc: qemu-devel, Richard Henderson, Pierrick Bouvier

Jon Wilson <jonwilson030981@googlemail.com> writes:

(Adding Richard, the TCG maintainer to CC)

> I am attempting to optimize some TCG code which I have previously written for
> qemu-libafl-bridge (https://github.com/AFLplusplus/qemu-libafl-bridge), the 
> component used by the LibAFL project when fuzzing binaries using QEMU to 
> provide runtime instrumentation. The code is used to write additional TCG into 
> basic blocks whenever a load or store operation is performed in order to provide
> address sanitizer functionality.

I would like to figure out if we can push any of this instrumentation
into TCG plugins so you can avoid patching QEMU itself. I guess you need
something that allows you to hook a memory address into some sort of
gadget?

> Address sanitizer is quite simple and works by mapping each 8-byte region of
> address space to single byte within a region called the shadow map. The address
> (on a 64-bit platform) of the shadow map for a given address is:
>
>     Shadow = (Mem >> 3) + 0x7fff8000;
>
> The value in the shadow map encodes the accessibility of an address:
>
>     0  - The whole 8 byte region is accessible.
>     1 .. 7 - The first n bytes are accessible.
>     negative - The whole 8 byte region is inaccessible.
>
> The following pseudo code shows the algorithm:
>
> ////////////////////////////////////////////////////////////////////////////////
>
> https://github.com/google/sanitizers/wiki/addresssanitizeralgorithm
>
> byte *shadow_address = MemToShadow(address);
> byte shadow_value = *shadow_address;
> if (shadow_value) {
>   if (SlowPathCheck(shadow_value, address, kAccessSize)) {
>     ReportError(address, kAccessSize, kIsWrite);
>   }
> }
>
> // Check the cases where we access first k bytes of the qword
> // and these k bytes are unpoisoned.
> bool SlowPathCheck(shadow_value, address, kAccessSize) {
>   last_accessed_byte = (address & 7) + kAccessSize - 1;
>   return (last_accessed_byte >= shadow_value);
> }
>
> ////////////////////////////////////////////////////////////////////////////////
>
> My current implementation makes use of conditional move instructions to trigger
> a segfault by way of null dereference in the event that the shadow map indicates
> that a memory access is invalid.
>
> ////////////////////////////////////////////////////////////////////////////////
>
> #if TARGET_LONG_BITS == 32
> #define SHADOW_BASE (0x20000000)
> #elif TARGET_LONG_BITS == 64
> #define SHADOW_BASE (0x7fff8000)
> #else
> #error Unhandled TARGET_LONG_BITS value
> #endif
>
> void libafl_tcg_gen_asan(TCGTemp * addr, size_t size)
> {
>     if (size == 0)
>         return;
>
>     TCGv addr_val = temp_tcgv_tl(addr);
>     TCGv k = tcg_temp_new();
>     TCGv shadow_addr = tcg_temp_new();
>     TCGv_ptr shadow_ptr = tcg_temp_new_ptr();
>     TCGv shadow_val = tcg_temp_new();
>     TCGv test_addr = tcg_temp_new();
>     TCGv_ptr test_ptr = tcg_temp_new_ptr();
>
>     tcg_gen_andi_tl(k, addr_val, 7);
>     tcg_gen_addi_tl(k, k, size - 1);
>
>     tcg_gen_shri_tl(shadow_addr, addr_val, 3);
>     tcg_gen_addi_tl(shadow_addr, shadow_addr, SHADOW_BASE);
>     tcg_gen_tl_ptr(shadow_ptr, shadow_addr);
>     tcg_gen_ld8s_tl(shadow_val, shadow_ptr, 0);
>
>     /*
>      * Making conditional branches here appears to cause QEMU issues with dead
>      * temporaries so we will instead avoid branches.

This sounds like a TCG bug that may have been fixed.

>      We will cause the guest
>      * to perform a NULL dereference in the event of an ASAN fault. Note that
>      * we will do this by using a store rather than a load, since the TCG may
>      * otherwise determine that the result of the load is unused and simply
>      * discard the operation. In the event that the shadow memory doesn't
>      * detect a fault, we will simply write the value read from the shadow
>      * memory back to it's original location. If, however, the shadow memory
>      * detects an invalid access, we will instead attempt to write the value
>      * at 0x0.
>      */

Why not conditionally call a helper here? Forcing the guest to actually
fault seems a bit hammer like.

>     tcg_gen_movcond_tl(TCG_COND_EQ, test_addr,
>         shadow_val, tcg_constant_tl(0),
>         shadow_addr, tcg_constant_tl(0));
>
>     if (size < 8)
>     {
>         tcg_gen_movcond_tl(TCG_COND_GE, test_addr,
>             k, shadow_val,
>             test_addr, shadow_addr);
>     }
>
>     tcg_gen_tl_ptr(test_ptr, test_addr);
>     tcg_gen_st8_tl(shadow_val, test_ptr, 0);
> }
>
> ////////////////////////////////////////////////////////////////////////////////
>
> However, I would like test an implementation more like the following to see how
> the performance compares. Whilst this introduces branches, the fast path is much
> more likely to be executed than the slow path and hence bypassing the additional
> checks and unnecessary memory writes I am hopeful it will improve performance.
>
> ////////////////////////////////////////////////////////////////////////////////
>
> void libafl_tcg_gen_asan(TCGTemp* addr, size_t size)
> {
>     if (size == 0) {
>         return;
>     }
>
>     if (size > 8) {
>         size = 8;
>     }
>
>     TCGLabel *done = gen_new_label();
>
>     TCGv addr_val = temp_tcgv_tl(addr);
>     TCGv shadow_addr = tcg_temp_new();
>     TCGv_ptr shadow_ptr = tcg_temp_new_ptr();
>     TCGv shadow_val = tcg_temp_new();
>     TCGv k = tcg_temp_new();
>     TCGv zero = tcg_constant_tl(0);
>     TCGv_ptr null_ptr = tcg_temp_new_ptr();
>
>     tcg_gen_shri_tl(shadow_addr, addr_val, 3);
>     tcg_gen_addi_tl(shadow_addr, shadow_addr, SHADOW_BASE);
>     tcg_gen_tl_ptr(shadow_ptr, shadow_addr);
>     tcg_gen_ld8s_tl(shadow_val, shadow_ptr, 0);
>
>     tcg_gen_brcond_tl(TCG_COND_EQ, shadow_val, zero, done);
>
>     tcg_gen_andi_tl(k, addr_val, 7);
>     tcg_gen_addi_tl(k, k, size - 1);
>
>     tcg_gen_brcond_tl(TCG_COND_LT, shadow_val, k, done);
>
>     tcg_gen_tl_ptr(null_ptr, zero);
>     tcg_gen_st8_tl(zero, null_ptr, 0);
>
>     gen_set_label(done);
> }
>
> ////////////////////////////////////////////////////////////////////////////////
>
> However, when I change to using this implementation, I get the following error.
> I have tested it with a trivial hello world implementation for x86_64 running in
> qemu-user. It doesn't occur the first time the block is executed, therefore I
> think the issue is caused by the surrounding TCG in the block it is injected
> into?
>
> ////////////////////////////////////////////////////////////////////////////////
> runner-x86_64: ../tcg/tcg.c:4852: tcg_reg_alloc_mov: Assertion `ts->val_type == TEMP_VAL_REG' failed.
> Aborted (core dumped)
> ////////////////////////////////////////////////////////////////////////////////
>
> I would be very grateful for any advice of how to resolve this issue, or any
> alternative approaches I could use to optimize my original implementation. The
> code is obviously a very hot path and so even a tiny performance improvement
> could result in a large performance gain overall.

-- 
Alex Bennée
Virtualisation Tech Lead @ Linaro


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: TCG Address Sanitizer Optimization.
  2025-06-02 15:09 ` Alex Bennée
@ 2025-06-02 15:54   ` Jon Wilson
  2025-06-02 15:58     ` Richard Henderson
  2025-06-02 16:26     ` Alex Bennée
  0 siblings, 2 replies; 6+ messages in thread
From: Jon Wilson @ 2025-06-02 15:54 UTC (permalink / raw)
  To: Alex Bennée; +Cc: qemu-devel, Richard Henderson, Pierrick Bouvier

[-- Attachment #1: Type: text/plain, Size: 8945 bytes --]

It would be good if we could have QEMU provide clean APIs to allow the sort
of additional instrumentation that fuzzing requires. I guess the
qemu-libafl-bridge project show the sort of modification which has been
required so far...
https://github.com/AFLplusplus/qemu-libafl-bridge/tree/main/libafl

I would like to conditionally call a helper, or even just insert a
breakpoint instruction, but like I say I don't seem to be able to make use
of any sort of branches. Even if I add a benign instrumentation that simply
conditionally branches at a label and nothing else (e.g. no actual
functionality), I still have the same problem.
e.g.

////////////////////////////////////////////////////////////////////////////////


TCGLabel *done = gen_new_label();
TCGv addr_val = temp_tcgv_tl(addr);
TCGv zero = tcg_constant_tl(0);
tcg_gen_brcond_tl(TCG_COND_EQ, addr_val, zero, done);
gen_set_label(done);

////////////////////////////////////////////////////////////////////////////////


Hence the current implementation is a little clumsy!

Thanks for your advice.

Jon



On Mon, Jun 2, 2025 at 4:09 PM Alex Bennée <alex.bennee@linaro.org> wrote:

> Jon Wilson <jonwilson030981@googlemail.com> writes:
>
> (Adding Richard, the TCG maintainer to CC)
>
> > I am attempting to optimize some TCG code which I have previously
> written for
> > qemu-libafl-bridge (https://github.com/AFLplusplus/qemu-libafl-bridge),
> the
> > component used by the LibAFL project when fuzzing binaries using QEMU to
> > provide runtime instrumentation. The code is used to write additional
> TCG into
> > basic blocks whenever a load or store operation is performed in order to
> provide
> > address sanitizer functionality.
>
> I would like to figure out if we can push any of this instrumentation
> into TCG plugins so you can avoid patching QEMU itself. I guess you need
> something that allows you to hook a memory address into some sort of
> gadget?
>
> > Address sanitizer is quite simple and works by mapping each 8-byte
> region of
> > address space to single byte within a region called the shadow map. The
> address
> > (on a 64-bit platform) of the shadow map for a given address is:
> >
> >     Shadow = (Mem >> 3) + 0x7fff8000;
> >
> > The value in the shadow map encodes the accessibility of an address:
> >
> >     0  - The whole 8 byte region is accessible.
> >     1 .. 7 - The first n bytes are accessible.
> >     negative - The whole 8 byte region is inaccessible.
> >
> > The following pseudo code shows the algorithm:
> >
> >
> ////////////////////////////////////////////////////////////////////////////////
> >
> > https://github.com/google/sanitizers/wiki/addresssanitizeralgorithm
> >
> > byte *shadow_address = MemToShadow(address);
> > byte shadow_value = *shadow_address;
> > if (shadow_value) {
> >   if (SlowPathCheck(shadow_value, address, kAccessSize)) {
> >     ReportError(address, kAccessSize, kIsWrite);
> >   }
> > }
> >
> > // Check the cases where we access first k bytes of the qword
> > // and these k bytes are unpoisoned.
> > bool SlowPathCheck(shadow_value, address, kAccessSize) {
> >   last_accessed_byte = (address & 7) + kAccessSize - 1;
> >   return (last_accessed_byte >= shadow_value);
> > }
> >
> >
> ////////////////////////////////////////////////////////////////////////////////
> >
> > My current implementation makes use of conditional move instructions to
> trigger
> > a segfault by way of null dereference in the event that the shadow map
> indicates
> > that a memory access is invalid.
> >
> >
> ////////////////////////////////////////////////////////////////////////////////
> >
> > #if TARGET_LONG_BITS == 32
> > #define SHADOW_BASE (0x20000000)
> > #elif TARGET_LONG_BITS == 64
> > #define SHADOW_BASE (0x7fff8000)
> > #else
> > #error Unhandled TARGET_LONG_BITS value
> > #endif
> >
> > void libafl_tcg_gen_asan(TCGTemp * addr, size_t size)
> > {
> >     if (size == 0)
> >         return;
> >
> >     TCGv addr_val = temp_tcgv_tl(addr);
> >     TCGv k = tcg_temp_new();
> >     TCGv shadow_addr = tcg_temp_new();
> >     TCGv_ptr shadow_ptr = tcg_temp_new_ptr();
> >     TCGv shadow_val = tcg_temp_new();
> >     TCGv test_addr = tcg_temp_new();
> >     TCGv_ptr test_ptr = tcg_temp_new_ptr();
> >
> >     tcg_gen_andi_tl(k, addr_val, 7);
> >     tcg_gen_addi_tl(k, k, size - 1);
> >
> >     tcg_gen_shri_tl(shadow_addr, addr_val, 3);
> >     tcg_gen_addi_tl(shadow_addr, shadow_addr, SHADOW_BASE);
> >     tcg_gen_tl_ptr(shadow_ptr, shadow_addr);
> >     tcg_gen_ld8s_tl(shadow_val, shadow_ptr, 0);
> >
> >     /*
> >      * Making conditional branches here appears to cause QEMU issues
> with dead
> >      * temporaries so we will instead avoid branches.
>
> This sounds like a TCG bug that may have been fixed.
>
> >      We will cause the guest
> >      * to perform a NULL dereference in the event of an ASAN fault. Note
> that
> >      * we will do this by using a store rather than a load, since the
> TCG may
> >      * otherwise determine that the result of the load is unused and
> simply
> >      * discard the operation. In the event that the shadow memory doesn't
> >      * detect a fault, we will simply write the value read from the
> shadow
> >      * memory back to it's original location. If, however, the shadow
> memory
> >      * detects an invalid access, we will instead attempt to write the
> value
> >      * at 0x0.
> >      */
>
> Why not conditionally call a helper here? Forcing the guest to actually
> fault seems a bit hammer like.
>
> >     tcg_gen_movcond_tl(TCG_COND_EQ, test_addr,
> >         shadow_val, tcg_constant_tl(0),
> >         shadow_addr, tcg_constant_tl(0));
> >
> >     if (size < 8)
> >     {
> >         tcg_gen_movcond_tl(TCG_COND_GE, test_addr,
> >             k, shadow_val,
> >             test_addr, shadow_addr);
> >     }
> >
> >     tcg_gen_tl_ptr(test_ptr, test_addr);
> >     tcg_gen_st8_tl(shadow_val, test_ptr, 0);
> > }
> >
> >
> ////////////////////////////////////////////////////////////////////////////////
> >
> > However, I would like test an implementation more like the following to
> see how
> > the performance compares. Whilst this introduces branches, the fast path
> is much
> > more likely to be executed than the slow path and hence bypassing the
> additional
> > checks and unnecessary memory writes I am hopeful it will improve
> performance.
> >
> >
> ////////////////////////////////////////////////////////////////////////////////
> >
> > void libafl_tcg_gen_asan(TCGTemp* addr, size_t size)
> > {
> >     if (size == 0) {
> >         return;
> >     }
> >
> >     if (size > 8) {
> >         size = 8;
> >     }
> >
> >     TCGLabel *done = gen_new_label();
> >
> >     TCGv addr_val = temp_tcgv_tl(addr);
> >     TCGv shadow_addr = tcg_temp_new();
> >     TCGv_ptr shadow_ptr = tcg_temp_new_ptr();
> >     TCGv shadow_val = tcg_temp_new();
> >     TCGv k = tcg_temp_new();
> >     TCGv zero = tcg_constant_tl(0);
> >     TCGv_ptr null_ptr = tcg_temp_new_ptr();
> >
> >     tcg_gen_shri_tl(shadow_addr, addr_val, 3);
> >     tcg_gen_addi_tl(shadow_addr, shadow_addr, SHADOW_BASE);
> >     tcg_gen_tl_ptr(shadow_ptr, shadow_addr);
> >     tcg_gen_ld8s_tl(shadow_val, shadow_ptr, 0);
> >
> >     tcg_gen_brcond_tl(TCG_COND_EQ, shadow_val, zero, done);
> >
> >     tcg_gen_andi_tl(k, addr_val, 7);
> >     tcg_gen_addi_tl(k, k, size - 1);
> >
> >     tcg_gen_brcond_tl(TCG_COND_LT, shadow_val, k, done);
> >
> >     tcg_gen_tl_ptr(null_ptr, zero);
> >     tcg_gen_st8_tl(zero, null_ptr, 0);
> >
> >     gen_set_label(done);
> > }
> >
> >
> ////////////////////////////////////////////////////////////////////////////////
> >
> > However, when I change to using this implementation, I get the following
> error.
> > I have tested it with a trivial hello world implementation for x86_64
> running in
> > qemu-user. It doesn't occur the first time the block is executed,
> therefore I
> > think the issue is caused by the surrounding TCG in the block it is
> injected
> > into?
> >
> >
> ////////////////////////////////////////////////////////////////////////////////
> > runner-x86_64: ../tcg/tcg.c:4852: tcg_reg_alloc_mov: Assertion
> `ts->val_type == TEMP_VAL_REG' failed.
> > Aborted (core dumped)
> >
> ////////////////////////////////////////////////////////////////////////////////
> >
> > I would be very grateful for any advice of how to resolve this issue, or
> any
> > alternative approaches I could use to optimize my original
> implementation. The
> > code is obviously a very hot path and so even a tiny performance
> improvement
> > could result in a large performance gain overall.
>
> --
> Alex Bennée
> Virtualisation Tech Lead @ Linaro
>

[-- Attachment #2: Type: text/html, Size: 10893 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: TCG Address Sanitizer Optimization.
  2025-06-02 15:54   ` Jon Wilson
@ 2025-06-02 15:58     ` Richard Henderson
  2025-06-02 16:26     ` Alex Bennée
  1 sibling, 0 replies; 6+ messages in thread
From: Richard Henderson @ 2025-06-02 15:58 UTC (permalink / raw)
  To: Jon Wilson, Alex Bennée; +Cc: qemu-devel, Pierrick Bouvier

On 6/2/25 16:54, Jon Wilson wrote:
> It would be good if we could have QEMU provide clean APIs to allow the sort of additional 
> instrumentation that fuzzing requires. I guess the qemu-libafl-bridge project show the 
> sort of modification which has been required so far...
> https://github.com/AFLplusplus/qemu-libafl-bridge/tree/main/libafl <https://github.com/ 
> AFLplusplus/qemu-libafl-bridge/tree/main/libafl>
> 
> I would like to conditionally call a helper, or even just insert a breakpoint instruction, 
> but like I say I don't seem to be able to make use of any sort of branches. Even if I add 
> a benign instrumentation that simply conditionally branches at a label and nothing else 
> (e.g. no actual functionality), I still have the same problem.
> e.g.
> 
> ////////////////////////////////////////////////////////////////////////////////
> 
> TCGLabel *done = gen_new_label();
> TCGv addr_val = temp_tcgv_tl(addr);
> TCGv zero = tcg_constant_tl(0);
> tcg_gen_brcond_tl(TCG_COND_EQ, addr_val, zero, done);
> gen_set_label(done);
> 
> ////////////////////////////////////////////////////////////////////////////////
> 
> Hence the current implementation is a little clumsy!

This has not been a problem since 438e685b1, in qemu v8.0.

r~


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: TCG Address Sanitizer Optimization.
  2025-06-02 15:54   ` Jon Wilson
  2025-06-02 15:58     ` Richard Henderson
@ 2025-06-02 16:26     ` Alex Bennée
  2025-06-03  8:20       ` Jon Wilson
  1 sibling, 1 reply; 6+ messages in thread
From: Alex Bennée @ 2025-06-02 16:26 UTC (permalink / raw)
  To: Jon Wilson; +Cc: qemu-devel, Richard Henderson, Pierrick Bouvier

Jon Wilson <jonwilson030981@googlemail.com> writes:

> It would be good if we could have QEMU provide clean APIs to allow the sort of additional instrumentation that fuzzing
> requires. I guess the qemu-libafl-bridge project show the sort of modification which has been required so far...
> https://github.com/AFLplusplus/qemu-libafl-bridge/tree/main/libafl
>
> I would like to conditionally call a helper, or even just insert a breakpoint instruction, but like I say I don't seem to be able to
> make use of any sort of branches.

For TCG plugins we have:

  /**
   * qemu_plugin_register_vcpu_tb_exec_cond_cb() - register conditional callback
   * @tb: the opaque qemu_plugin_tb handle for the translation
   * @cb: callback function
   * @cond: condition to enable callback
   * @entry: first operand for condition
   * @imm: second operand for condition
   * @flags: does the plugin read or write the CPU's registers?
   * @userdata: any plugin data to pass to the @cb?
   *
   * The @cb function is called when a translated unit executes if
   * entry @cond imm is true.
   * If condition is QEMU_PLUGIN_COND_ALWAYS, condition is never interpreted and
   * this function is equivalent to qemu_plugin_register_vcpu_tb_exec_cb.
   * If condition QEMU_PLUGIN_COND_NEVER, condition is never interpreted and
   * callback is never installed.
   */
  QEMU_PLUGIN_API
  void qemu_plugin_register_vcpu_tb_exec_cond_cb(struct qemu_plugin_tb *tb,
                                                 qemu_plugin_vcpu_udata_cb_t cb,
                                                 enum qemu_plugin_cb_flags flags,
                                                 enum qemu_plugin_cond cond,
                                                 qemu_plugin_u64 entry,
                                                 uint64_t imm,
                                                 void *userdata);

Along with qemu_plugin_register_vcpu_insn_exec_cond_cb() for individual
instructions. They are designed work with per-cpu indexed scoreboards
using inline operations (e.g.
qemu_plugin_register_vcpu_tb_exec_inline_per_vcpu) which can either add
or store values.

Currently we inline memory operations are a bit limited:

  /**
   * qemu_plugin_register_vcpu_mem_inline_per_vcpu() - inline op for mem access
   * @insn: handle for instruction to instrument
   * @rw: apply to reads, writes or both
   * @op: the op, of type qemu_plugin_op
   * @entry: entry to run op
   * @imm: immediate data for @op
   *
   * This registers a inline op every memory access generated by the
   * instruction.
   */
  QEMU_PLUGIN_API
  void qemu_plugin_register_vcpu_mem_inline_per_vcpu(
      struct qemu_plugin_insn *insn,
      enum qemu_plugin_mem_rw rw,
      enum qemu_plugin_op op,
      qemu_plugin_u64 entry,
      uint64_t imm);

Although you can get access to the memory information through helpers:

  /**
   * qemu_plugin_register_vcpu_mem_cb() - register memory access callback
   * @insn: handle for instruction to instrument
   * @cb: callback of type qemu_plugin_vcpu_mem_cb_t
   * @flags: (currently unused) callback flags
   * @rw: monitor reads, writes or both
   * @userdata: opaque pointer for userdata
   *
   * This registers a full callback for every memory access generated by
   * an instruction. If the instruction doesn't access memory no
   * callback will be made.
   *
   * The callback reports the vCPU the access took place on, the virtual
   * address of the access and a handle for further queries. The user
   * can attach some userdata to the callback for additional purposes.
   *
   * Other execution threads will continue to execute during the
   * callback so the plugin is responsible for ensuring it doesn't get
   * confused by making appropriate use of locking if required.
   */
  QEMU_PLUGIN_API
  void qemu_plugin_register_vcpu_mem_cb(struct qemu_plugin_insn *insn,
                                        qemu_plugin_vcpu_mem_cb_t cb,
                                        enum qemu_plugin_cb_flags flags,
                                        enum qemu_plugin_mem_rw rw,
                                        void *userdata);

where you can then process qemu_plugin_meminfo_t info in the callback to
get the value or address of a memory operation. I guess you want some
sort of operation to do an inline transform of the address to use a
lookup to compare and branch to. The trick is coming up with an
interface which is general and flexible enough and not just "what asan
needs for a specific use case".

> Even if I add a benign instrumentation that simply conditionally branches at a label and
> nothing else (e.g. no actual functionality), I still have the same problem.
> e.g.
>
> //////////////////////////////////////////////////////////////////////////////// 
>
> TCGLabel *done = gen_new_label();
> TCGv addr_val = temp_tcgv_tl(addr);
> TCGv zero = tcg_constant_tl(0);
> tcg_gen_brcond_tl(TCG_COND_EQ, addr_val, zero, done);
> gen_set_label(done);
>
> //////////////////////////////////////////////////////////////////////////////// 
>
> Hence the current implementation is a little clumsy!
>
> Thanks for your advice.
>
> Jon
>
> On Mon, Jun 2, 2025 at 4:09 PM Alex Bennée <alex.bennee@linaro.org> wrote:
>
>  Jon Wilson <jonwilson030981@googlemail.com> writes:
>
>  (Adding Richard, the TCG maintainer to CC)
>
>  > I am attempting to optimize some TCG code which I have previously written for
>  > qemu-libafl-bridge (https://github.com/AFLplusplus/qemu-libafl-bridge), the 
>  > component used by the LibAFL project when fuzzing binaries using QEMU to 
>  > provide runtime instrumentation. The code is used to write additional TCG into 
>  > basic blocks whenever a load or store operation is performed in order to provide
>  > address sanitizer functionality.
>
>  I would like to figure out if we can push any of this instrumentation
>  into TCG plugins so you can avoid patching QEMU itself. I guess you need
>  something that allows you to hook a memory address into some sort of
>  gadget?
>
>  > Address sanitizer is quite simple and works by mapping each 8-byte region of
>  > address space to single byte within a region called the shadow map. The address
>  > (on a 64-bit platform) of the shadow map for a given address is:
>  >
>  >     Shadow = (Mem >> 3) + 0x7fff8000;
>  >
>  > The value in the shadow map encodes the accessibility of an address:
>  >
>  >     0  - The whole 8 byte region is accessible.
>  >     1 .. 7 - The first n bytes are accessible.
>  >     negative - The whole 8 byte region is inaccessible.
>  >
>  > The following pseudo code shows the algorithm:
>  >
>  > ////////////////////////////////////////////////////////////////////////////////
>  >
>  > https://github.com/google/sanitizers/wiki/addresssanitizeralgorithm
>  >
>  > byte *shadow_address = MemToShadow(address);
>  > byte shadow_value = *shadow_address;
>  > if (shadow_value) {
>  >   if (SlowPathCheck(shadow_value, address, kAccessSize)) {
>  >     ReportError(address, kAccessSize, kIsWrite);
>  >   }
>  > }
>  >
>  > // Check the cases where we access first k bytes of the qword
>  > // and these k bytes are unpoisoned.
>  > bool SlowPathCheck(shadow_value, address, kAccessSize) {
>  >   last_accessed_byte = (address & 7) + kAccessSize - 1;
>  >   return (last_accessed_byte >= shadow_value);
>  > }
>  >
>  > ////////////////////////////////////////////////////////////////////////////////
>  >
>  > My current implementation makes use of conditional move instructions to trigger
>  > a segfault by way of null dereference in the event that the shadow map indicates
>  > that a memory access is invalid.
>  >
>  > ////////////////////////////////////////////////////////////////////////////////
>  >
>  > #if TARGET_LONG_BITS == 32
>  > #define SHADOW_BASE (0x20000000)
>  > #elif TARGET_LONG_BITS == 64
>  > #define SHADOW_BASE (0x7fff8000)
>  > #else
>  > #error Unhandled TARGET_LONG_BITS value
>  > #endif
>  >
>  > void libafl_tcg_gen_asan(TCGTemp * addr, size_t size)
>  > {
>  >     if (size == 0)
>  >         return;
>  >
>  >     TCGv addr_val = temp_tcgv_tl(addr);
>  >     TCGv k = tcg_temp_new();
>  >     TCGv shadow_addr = tcg_temp_new();
>  >     TCGv_ptr shadow_ptr = tcg_temp_new_ptr();
>  >     TCGv shadow_val = tcg_temp_new();
>  >     TCGv test_addr = tcg_temp_new();
>  >     TCGv_ptr test_ptr = tcg_temp_new_ptr();
>  >
>  >     tcg_gen_andi_tl(k, addr_val, 7);
>  >     tcg_gen_addi_tl(k, k, size - 1);
>  >
>  >     tcg_gen_shri_tl(shadow_addr, addr_val, 3);
>  >     tcg_gen_addi_tl(shadow_addr, shadow_addr, SHADOW_BASE);
>  >     tcg_gen_tl_ptr(shadow_ptr, shadow_addr);
>  >     tcg_gen_ld8s_tl(shadow_val, shadow_ptr, 0);
>  >
>  >     /*
>  >      * Making conditional branches here appears to cause QEMU issues with dead
>  >      * temporaries so we will instead avoid branches.
>
>  This sounds like a TCG bug that may have been fixed.
>
>  >      We will cause the guest
>  >      * to perform a NULL dereference in the event of an ASAN fault. Note that
>  >      * we will do this by using a store rather than a load, since the TCG may
>  >      * otherwise determine that the result of the load is unused and simply
>  >      * discard the operation. In the event that the shadow memory doesn't
>  >      * detect a fault, we will simply write the value read from the shadow
>  >      * memory back to it's original location. If, however, the shadow memory
>  >      * detects an invalid access, we will instead attempt to write the value
>  >      * at 0x0.
>  >      */
>
>  Why not conditionally call a helper here? Forcing the guest to actually
>  fault seems a bit hammer like.
>
>  >     tcg_gen_movcond_tl(TCG_COND_EQ, test_addr,
>  >         shadow_val, tcg_constant_tl(0),
>  >         shadow_addr, tcg_constant_tl(0));
>  >
>  >     if (size < 8)
>  >     {
>  >         tcg_gen_movcond_tl(TCG_COND_GE, test_addr,
>  >             k, shadow_val,
>  >             test_addr, shadow_addr);
>  >     }
>  >
>  >     tcg_gen_tl_ptr(test_ptr, test_addr);
>  >     tcg_gen_st8_tl(shadow_val, test_ptr, 0);
>  > }
>  >
>  > ////////////////////////////////////////////////////////////////////////////////
>  >
>  > However, I would like test an implementation more like the following to see how
>  > the performance compares. Whilst this introduces branches, the fast path is much
>  > more likely to be executed than the slow path and hence bypassing the additional
>  > checks and unnecessary memory writes I am hopeful it will improve performance.
>  >
>  > ////////////////////////////////////////////////////////////////////////////////
>  >
>  > void libafl_tcg_gen_asan(TCGTemp* addr, size_t size)
>  > {
>  >     if (size == 0) {
>  >         return;
>  >     }
>  >
>  >     if (size > 8) {
>  >         size = 8;
>  >     }
>  >
>  >     TCGLabel *done = gen_new_label();
>  >
>  >     TCGv addr_val = temp_tcgv_tl(addr);
>  >     TCGv shadow_addr = tcg_temp_new();
>  >     TCGv_ptr shadow_ptr = tcg_temp_new_ptr();
>  >     TCGv shadow_val = tcg_temp_new();
>  >     TCGv k = tcg_temp_new();
>  >     TCGv zero = tcg_constant_tl(0);
>  >     TCGv_ptr null_ptr = tcg_temp_new_ptr();
>  >
>  >     tcg_gen_shri_tl(shadow_addr, addr_val, 3);
>  >     tcg_gen_addi_tl(shadow_addr, shadow_addr, SHADOW_BASE);
>  >     tcg_gen_tl_ptr(shadow_ptr, shadow_addr);
>  >     tcg_gen_ld8s_tl(shadow_val, shadow_ptr, 0);
>  >
>  >     tcg_gen_brcond_tl(TCG_COND_EQ, shadow_val, zero, done);
>  >
>  >     tcg_gen_andi_tl(k, addr_val, 7);
>  >     tcg_gen_addi_tl(k, k, size - 1);
>  >
>  >     tcg_gen_brcond_tl(TCG_COND_LT, shadow_val, k, done);
>  >
>  >     tcg_gen_tl_ptr(null_ptr, zero);
>  >     tcg_gen_st8_tl(zero, null_ptr, 0);
>  >
>  >     gen_set_label(done);
>  > }
>  >
>  > ////////////////////////////////////////////////////////////////////////////////
>  >
>  > However, when I change to using this implementation, I get the following error.
>  > I have tested it with a trivial hello world implementation for x86_64 running in
>  > qemu-user. It doesn't occur the first time the block is executed, therefore I
>  > think the issue is caused by the surrounding TCG in the block it is injected
>  > into?
>  >
>  > ////////////////////////////////////////////////////////////////////////////////
>  > runner-x86_64: ../tcg/tcg.c:4852: tcg_reg_alloc_mov: Assertion `ts->val_type == TEMP_VAL_REG' failed.
>  > Aborted (core dumped)
>  > ////////////////////////////////////////////////////////////////////////////////
>  >
>  > I would be very grateful for any advice of how to resolve this issue, or any
>  > alternative approaches I could use to optimize my original implementation. The
>  > code is obviously a very hot path and so even a tiny performance improvement
>  > could result in a large performance gain overall.
>
>  -- 
>  Alex Bennée
>  Virtualisation Tech Lead @ Linaro

-- 
Alex Bennée
Virtualisation Tech Lead @ Linaro


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: TCG Address Sanitizer Optimization.
  2025-06-02 16:26     ` Alex Bennée
@ 2025-06-03  8:20       ` Jon Wilson
  0 siblings, 0 replies; 6+ messages in thread
From: Jon Wilson @ 2025-06-03  8:20 UTC (permalink / raw)
  To: Alex Bennée; +Cc: qemu-devel, Richard Henderson, Pierrick Bouvier

[-- Attachment #1: Type: text/plain, Size: 15148 bytes --]

I'm not sure what is needed to be fair, I'm kind of focussed on just this
one
aspect of the emulation. The previous implementation of ASAN was to
implement
the shadow map in the host address space and translate all the guest memory
access using g2h using helpers to check the shadow map. This was done
through
necessity since older versions of QEMU didn't work well with allocating the
large sparse memory regions used for the shadow maps.

However, by lowering the shadow map into the guest address space, we can
omit
this additional translation stage and we can avoid the overhead (albeit
small)
of calling a helper function (and any associated register pressure caused
by
needing to put arguments in the correct registers to support the ABI). But
given the
code is on such a hot path, even tiny improvements get significantly
magnified.

I'm hopeful though, that if it is possible to optimize the fast path (e.g.
the
shadow map value is zero) that it may be possible to reduce the performance
overhead of ASAN even further.

Best Regards.

Jon

On Mon, Jun 2, 2025 at 5:26 PM Alex Bennée <alex.bennee@linaro.org> wrote:

> Jon Wilson <jonwilson030981@googlemail.com> writes:
>
> > It would be good if we could have QEMU provide clean APIs to allow the
> sort of additional instrumentation that fuzzing
> > requires. I guess the qemu-libafl-bridge project show the sort of
> modification which has been required so far...
> > https://github.com/AFLplusplus/qemu-libafl-bridge/tree/main/libafl
> >
> > I would like to conditionally call a helper, or even just insert a
> breakpoint instruction, but like I say I don't seem to be able to
> > make use of any sort of branches.
>
> For TCG plugins we have:
>
>   /**
>    * qemu_plugin_register_vcpu_tb_exec_cond_cb() - register conditional
> callback
>    * @tb: the opaque qemu_plugin_tb handle for the translation
>    * @cb: callback function
>    * @cond: condition to enable callback
>    * @entry: first operand for condition
>    * @imm: second operand for condition
>    * @flags: does the plugin read or write the CPU's registers?
>    * @userdata: any plugin data to pass to the @cb?
>    *
>    * The @cb function is called when a translated unit executes if
>    * entry @cond imm is true.
>    * If condition is QEMU_PLUGIN_COND_ALWAYS, condition is never
> interpreted and
>    * this function is equivalent to qemu_plugin_register_vcpu_tb_exec_cb.
>    * If condition QEMU_PLUGIN_COND_NEVER, condition is never interpreted
> and
>    * callback is never installed.
>    */
>   QEMU_PLUGIN_API
>   void qemu_plugin_register_vcpu_tb_exec_cond_cb(struct qemu_plugin_tb *tb,
>
>  qemu_plugin_vcpu_udata_cb_t cb,
>                                                  enum qemu_plugin_cb_flags
> flags,
>                                                  enum qemu_plugin_cond
> cond,
>                                                  qemu_plugin_u64 entry,
>                                                  uint64_t imm,
>                                                  void *userdata);
>
> Along with qemu_plugin_register_vcpu_insn_exec_cond_cb() for individual
> instructions. They are designed work with per-cpu indexed scoreboards
> using inline operations (e.g.
> qemu_plugin_register_vcpu_tb_exec_inline_per_vcpu) which can either add
> or store values.
>
> Currently we inline memory operations are a bit limited:
>
>   /**
>    * qemu_plugin_register_vcpu_mem_inline_per_vcpu() - inline op for mem
> access
>    * @insn: handle for instruction to instrument
>    * @rw: apply to reads, writes or both
>    * @op: the op, of type qemu_plugin_op
>    * @entry: entry to run op
>    * @imm: immediate data for @op
>    *
>    * This registers a inline op every memory access generated by the
>    * instruction.
>    */
>   QEMU_PLUGIN_API
>   void qemu_plugin_register_vcpu_mem_inline_per_vcpu(
>       struct qemu_plugin_insn *insn,
>       enum qemu_plugin_mem_rw rw,
>       enum qemu_plugin_op op,
>       qemu_plugin_u64 entry,
>       uint64_t imm);
>
> Although you can get access to the memory information through helpers:
>
>   /**
>    * qemu_plugin_register_vcpu_mem_cb() - register memory access callback
>    * @insn: handle for instruction to instrument
>    * @cb: callback of type qemu_plugin_vcpu_mem_cb_t
>    * @flags: (currently unused) callback flags
>    * @rw: monitor reads, writes or both
>    * @userdata: opaque pointer for userdata
>    *
>    * This registers a full callback for every memory access generated by
>    * an instruction. If the instruction doesn't access memory no
>    * callback will be made.
>    *
>    * The callback reports the vCPU the access took place on, the virtual
>    * address of the access and a handle for further queries. The user
>    * can attach some userdata to the callback for additional purposes.
>    *
>    * Other execution threads will continue to execute during the
>    * callback so the plugin is responsible for ensuring it doesn't get
>    * confused by making appropriate use of locking if required.
>    */
>   QEMU_PLUGIN_API
>   void qemu_plugin_register_vcpu_mem_cb(struct qemu_plugin_insn *insn,
>                                         qemu_plugin_vcpu_mem_cb_t cb,
>                                         enum qemu_plugin_cb_flags flags,
>                                         enum qemu_plugin_mem_rw rw,
>                                         void *userdata);
>
> where you can then process qemu_plugin_meminfo_t info in the callback to
> get the value or address of a memory operation. I guess you want some
> sort of operation to do an inline transform of the address to use a
> lookup to compare and branch to. The trick is coming up with an
> interface which is general and flexible enough and not just "what asan
> needs for a specific use case".
>
> > Even if I add a benign instrumentation that simply conditionally
> branches at a label and
> > nothing else (e.g. no actual functionality), I still have the same
> problem.
> > e.g.
> >
> >
> ////////////////////////////////////////////////////////////////////////////////
>
> >
> > TCGLabel *done = gen_new_label();
> > TCGv addr_val = temp_tcgv_tl(addr);
> > TCGv zero = tcg_constant_tl(0);
> > tcg_gen_brcond_tl(TCG_COND_EQ, addr_val, zero, done);
> > gen_set_label(done);
> >
> >
> ////////////////////////////////////////////////////////////////////////////////
>
> >
> > Hence the current implementation is a little clumsy!
> >
> > Thanks for your advice.
> >
> > Jon
> >
> > On Mon, Jun 2, 2025 at 4:09 PM Alex Bennée <alex.bennee@linaro.org>
> wrote:
> >
> >  Jon Wilson <jonwilson030981@googlemail.com> writes:
> >
> >  (Adding Richard, the TCG maintainer to CC)
> >
> >  > I am attempting to optimize some TCG code which I have previously
> written for
> >  > qemu-libafl-bridge (https://github.com/AFLplusplus/qemu-libafl-bridge),
> the
> >  > component used by the LibAFL project when fuzzing binaries using QEMU
> to
> >  > provide runtime instrumentation. The code is used to write additional
> TCG into
> >  > basic blocks whenever a load or store operation is performed in order
> to provide
> >  > address sanitizer functionality.
> >
> >  I would like to figure out if we can push any of this instrumentation
> >  into TCG plugins so you can avoid patching QEMU itself. I guess you need
> >  something that allows you to hook a memory address into some sort of
> >  gadget?
> >
> >  > Address sanitizer is quite simple and works by mapping each 8-byte
> region of
> >  > address space to single byte within a region called the shadow map.
> The address
> >  > (on a 64-bit platform) of the shadow map for a given address is:
> >  >
> >  >     Shadow = (Mem >> 3) + 0x7fff8000;
> >  >
> >  > The value in the shadow map encodes the accessibility of an address:
> >  >
> >  >     0  - The whole 8 byte region is accessible.
> >  >     1 .. 7 - The first n bytes are accessible.
> >  >     negative - The whole 8 byte region is inaccessible.
> >  >
> >  > The following pseudo code shows the algorithm:
> >  >
> >  >
> ////////////////////////////////////////////////////////////////////////////////
> >  >
> >  > https://github.com/google/sanitizers/wiki/addresssanitizeralgorithm
> >  >
> >  > byte *shadow_address = MemToShadow(address);
> >  > byte shadow_value = *shadow_address;
> >  > if (shadow_value) {
> >  >   if (SlowPathCheck(shadow_value, address, kAccessSize)) {
> >  >     ReportError(address, kAccessSize, kIsWrite);
> >  >   }
> >  > }
> >  >
> >  > // Check the cases where we access first k bytes of the qword
> >  > // and these k bytes are unpoisoned.
> >  > bool SlowPathCheck(shadow_value, address, kAccessSize) {
> >  >   last_accessed_byte = (address & 7) + kAccessSize - 1;
> >  >   return (last_accessed_byte >= shadow_value);
> >  > }
> >  >
> >  >
> ////////////////////////////////////////////////////////////////////////////////
> >  >
> >  > My current implementation makes use of conditional move instructions
> to trigger
> >  > a segfault by way of null dereference in the event that the shadow
> map indicates
> >  > that a memory access is invalid.
> >  >
> >  >
> ////////////////////////////////////////////////////////////////////////////////
> >  >
> >  > #if TARGET_LONG_BITS == 32
> >  > #define SHADOW_BASE (0x20000000)
> >  > #elif TARGET_LONG_BITS == 64
> >  > #define SHADOW_BASE (0x7fff8000)
> >  > #else
> >  > #error Unhandled TARGET_LONG_BITS value
> >  > #endif
> >  >
> >  > void libafl_tcg_gen_asan(TCGTemp * addr, size_t size)
> >  > {
> >  >     if (size == 0)
> >  >         return;
> >  >
> >  >     TCGv addr_val = temp_tcgv_tl(addr);
> >  >     TCGv k = tcg_temp_new();
> >  >     TCGv shadow_addr = tcg_temp_new();
> >  >     TCGv_ptr shadow_ptr = tcg_temp_new_ptr();
> >  >     TCGv shadow_val = tcg_temp_new();
> >  >     TCGv test_addr = tcg_temp_new();
> >  >     TCGv_ptr test_ptr = tcg_temp_new_ptr();
> >  >
> >  >     tcg_gen_andi_tl(k, addr_val, 7);
> >  >     tcg_gen_addi_tl(k, k, size - 1);
> >  >
> >  >     tcg_gen_shri_tl(shadow_addr, addr_val, 3);
> >  >     tcg_gen_addi_tl(shadow_addr, shadow_addr, SHADOW_BASE);
> >  >     tcg_gen_tl_ptr(shadow_ptr, shadow_addr);
> >  >     tcg_gen_ld8s_tl(shadow_val, shadow_ptr, 0);
> >  >
> >  >     /*
> >  >      * Making conditional branches here appears to cause QEMU issues
> with dead
> >  >      * temporaries so we will instead avoid branches.
> >
> >  This sounds like a TCG bug that may have been fixed.
> >
> >  >      We will cause the guest
> >  >      * to perform a NULL dereference in the event of an ASAN fault.
> Note that
> >  >      * we will do this by using a store rather than a load, since the
> TCG may
> >  >      * otherwise determine that the result of the load is unused and
> simply
> >  >      * discard the operation. In the event that the shadow memory
> doesn't
> >  >      * detect a fault, we will simply write the value read from the
> shadow
> >  >      * memory back to it's original location. If, however, the shadow
> memory
> >  >      * detects an invalid access, we will instead attempt to write
> the value
> >  >      * at 0x0.
> >  >      */
> >
> >  Why not conditionally call a helper here? Forcing the guest to actually
> >  fault seems a bit hammer like.
> >
> >  >     tcg_gen_movcond_tl(TCG_COND_EQ, test_addr,
> >  >         shadow_val, tcg_constant_tl(0),
> >  >         shadow_addr, tcg_constant_tl(0));
> >  >
> >  >     if (size < 8)
> >  >     {
> >  >         tcg_gen_movcond_tl(TCG_COND_GE, test_addr,
> >  >             k, shadow_val,
> >  >             test_addr, shadow_addr);
> >  >     }
> >  >
> >  >     tcg_gen_tl_ptr(test_ptr, test_addr);
> >  >     tcg_gen_st8_tl(shadow_val, test_ptr, 0);
> >  > }
> >  >
> >  >
> ////////////////////////////////////////////////////////////////////////////////
> >  >
> >  > However, I would like test an implementation more like the following
> to see how
> >  > the performance compares. Whilst this introduces branches, the fast
> path is much
> >  > more likely to be executed than the slow path and hence bypassing the
> additional
> >  > checks and unnecessary memory writes I am hopeful it will improve
> performance.
> >  >
> >  >
> ////////////////////////////////////////////////////////////////////////////////
> >  >
> >  > void libafl_tcg_gen_asan(TCGTemp* addr, size_t size)
> >  > {
> >  >     if (size == 0) {
> >  >         return;
> >  >     }
> >  >
> >  >     if (size > 8) {
> >  >         size = 8;
> >  >     }
> >  >
> >  >     TCGLabel *done = gen_new_label();
> >  >
> >  >     TCGv addr_val = temp_tcgv_tl(addr);
> >  >     TCGv shadow_addr = tcg_temp_new();
> >  >     TCGv_ptr shadow_ptr = tcg_temp_new_ptr();
> >  >     TCGv shadow_val = tcg_temp_new();
> >  >     TCGv k = tcg_temp_new();
> >  >     TCGv zero = tcg_constant_tl(0);
> >  >     TCGv_ptr null_ptr = tcg_temp_new_ptr();
> >  >
> >  >     tcg_gen_shri_tl(shadow_addr, addr_val, 3);
> >  >     tcg_gen_addi_tl(shadow_addr, shadow_addr, SHADOW_BASE);
> >  >     tcg_gen_tl_ptr(shadow_ptr, shadow_addr);
> >  >     tcg_gen_ld8s_tl(shadow_val, shadow_ptr, 0);
> >  >
> >  >     tcg_gen_brcond_tl(TCG_COND_EQ, shadow_val, zero, done);
> >  >
> >  >     tcg_gen_andi_tl(k, addr_val, 7);
> >  >     tcg_gen_addi_tl(k, k, size - 1);
> >  >
> >  >     tcg_gen_brcond_tl(TCG_COND_LT, shadow_val, k, done);
> >  >
> >  >     tcg_gen_tl_ptr(null_ptr, zero);
> >  >     tcg_gen_st8_tl(zero, null_ptr, 0);
> >  >
> >  >     gen_set_label(done);
> >  > }
> >  >
> >  >
> ////////////////////////////////////////////////////////////////////////////////
> >  >
> >  > However, when I change to using this implementation, I get the
> following error.
> >  > I have tested it with a trivial hello world implementation for x86_64
> running in
> >  > qemu-user. It doesn't occur the first time the block is executed,
> therefore I
> >  > think the issue is caused by the surrounding TCG in the block it is
> injected
> >  > into?
> >  >
> >  >
> ////////////////////////////////////////////////////////////////////////////////
> >  > runner-x86_64: ../tcg/tcg.c:4852: tcg_reg_alloc_mov: Assertion
> `ts->val_type == TEMP_VAL_REG' failed.
> >  > Aborted (core dumped)
> >  >
> ////////////////////////////////////////////////////////////////////////////////
> >  >
> >  > I would be very grateful for any advice of how to resolve this issue,
> or any
> >  > alternative approaches I could use to optimize my original
> implementation. The
> >  > code is obviously a very hot path and so even a tiny performance
> improvement
> >  > could result in a large performance gain overall.
> >
> >  --
> >  Alex Bennée
> >  Virtualisation Tech Lead @ Linaro
>
> --
> Alex Bennée
> Virtualisation Tech Lead @ Linaro
>

[-- Attachment #2: Type: text/html, Size: 18646 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2025-06-03  8:20 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-02 12:02 TCG Address Sanitizer Optimization Jon Wilson
2025-06-02 15:09 ` Alex Bennée
2025-06-02 15:54   ` Jon Wilson
2025-06-02 15:58     ` Richard Henderson
2025-06-02 16:26     ` Alex Bennée
2025-06-03  8:20       ` Jon Wilson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).