qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] trace/simple: seperate hot paths of tracing fucntions
@ 2025-05-19 18:51 Tanish Desai
  2025-05-20 19:01 ` Stefan Hajnoczi
  0 siblings, 1 reply; 9+ messages in thread
From: Tanish Desai @ 2025-05-19 18:51 UTC (permalink / raw)
  To: qemu-devel; +Cc: pbonzini, Tanish Desai

Remove hot paths from .c file and added it in .h file to keep it inline.

Signed-off-by: Tanish Desai <tanishdesai37@gmail.com>
---
 scripts/tracetool/backend/simple.py | 23 ++++++++++-------------
 1 file changed, 10 insertions(+), 13 deletions(-)

diff --git a/scripts/tracetool/backend/simple.py b/scripts/tracetool/backend/simple.py
index a74d61fcd6..2688d4b64b 100644
--- a/scripts/tracetool/backend/simple.py
+++ b/scripts/tracetool/backend/simple.py
@@ -36,8 +36,17 @@ def generate_h_begin(events, group):
 
 
 def generate_h(event, group):
-    out('    _simple_%(api)s(%(args)s);',
+    event_id = 'TRACE_' + event.name.upper()
+    if "vcpu" in event.properties:
+        # already checked on the generic format code
+        cond = "true"
+    else:
+        cond = "trace_event_get_state(%s)" % event_id
+    out('    if (%(cond)s) {',
+        '        _simple_%(api)s(%(args)s);',
+        '    }',
         api=event.api(),
+        cond=cond,
         args=", ".join(event.args.names()))
 
 
@@ -72,22 +81,10 @@ def generate_c(event, group):
     if len(event.args) == 0:
         sizestr = '0'
 
-    event_id = 'TRACE_' + event.name.upper()
-    if "vcpu" in event.properties:
-        # already checked on the generic format code
-        cond = "true"
-    else:
-        cond = "trace_event_get_state(%s)" % event_id
-
     out('',
-        '    if (!%(cond)s) {',
-        '        return;',
-        '    }',
-        '',
         '    if (trace_record_start(&rec, %(event_obj)s.id, %(size_str)s)) {',
         '        return; /* Trace Buffer Full, Event Dropped ! */',
         '    }',
-        cond=cond,
         event_obj=event.api(event.QEMU_EVENT),
         size_str=sizestr)
 
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH] trace/simple: seperate hot paths of tracing fucntions
  2025-05-19 18:51 [PATCH] trace/simple: seperate hot paths of tracing fucntions Tanish Desai
@ 2025-05-20 19:01 ` Stefan Hajnoczi
  2025-05-20 20:51   ` Paolo Bonzini
  0 siblings, 1 reply; 9+ messages in thread
From: Stefan Hajnoczi @ 2025-05-20 19:01 UTC (permalink / raw)
  To: Tanish Desai; +Cc: qemu-devel, pbonzini, Mads Ynddal

On Mon, May 19, 2025 at 2:52 PM Tanish Desai <tanishdesai37@gmail.com> wrote:
>
> Remove hot paths from .c file and added it in .h file to keep it inline.

Please include performance results in the commit description so it's
clear what impact this change has.

>
> Signed-off-by: Tanish Desai <tanishdesai37@gmail.com>
> ---
>  scripts/tracetool/backend/simple.py | 23 ++++++++++-------------

Please use scripts/get_maintainer.pl to find the emails to CC:

  $ scripts/get_maintainer.pl -f scripts/tracetool/backend/simple.py
  Stefan Hajnoczi <stefanha@redhat.com> (maintainer:Tracing)
  Mads Ynddal <mads@ynddal.dk> (reviewer:Tracing)
  qemu-devel@nongnu.org (open list:All patches CC here)

qemu-devel is a high-traffic list and maintainers may not see your
patches unless you CC them.

>  1 file changed, 10 insertions(+), 13 deletions(-)
>
> diff --git a/scripts/tracetool/backend/simple.py b/scripts/tracetool/backend/simple.py
> index a74d61fcd6..2688d4b64b 100644
> --- a/scripts/tracetool/backend/simple.py
> +++ b/scripts/tracetool/backend/simple.py
> @@ -36,8 +36,17 @@ def generate_h_begin(events, group):
>
>
>  def generate_h(event, group):
> -    out('    _simple_%(api)s(%(args)s);',
> +    event_id = 'TRACE_' + event.name.upper()
> +    if "vcpu" in event.properties:
> +        # already checked on the generic format code
> +        cond = "true"
> +    else:
> +        cond = "trace_event_get_state(%s)" % event_id
> +    out('    if (%(cond)s) {',
> +        '        _simple_%(api)s(%(args)s);',
> +        '    }',
>          api=event.api(),
> +        cond=cond,
>          args=", ".join(event.args.names()))
>
>
> @@ -72,22 +81,10 @@ def generate_c(event, group):
>      if len(event.args) == 0:
>          sizestr = '0'
>
> -    event_id = 'TRACE_' + event.name.upper()
> -    if "vcpu" in event.properties:
> -        # already checked on the generic format code
> -        cond = "true"
> -    else:
> -        cond = "trace_event_get_state(%s)" % event_id
> -
>      out('',
> -        '    if (!%(cond)s) {',
> -        '        return;',
> -        '    }',
> -        '',
>          '    if (trace_record_start(&rec, %(event_obj)s.id, %(size_str)s)) {',
>          '        return; /* Trace Buffer Full, Event Dropped ! */',
>          '    }',
> -        cond=cond,
>          event_obj=event.api(event.QEMU_EVENT),
>          size_str=sizestr)
>
> --
> 2.34.1
>
>


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] trace/simple: seperate hot paths of tracing fucntions
  2025-05-20 19:01 ` Stefan Hajnoczi
@ 2025-05-20 20:51   ` Paolo Bonzini
  2025-05-21 18:16     ` Stefan Hajnoczi
  0 siblings, 1 reply; 9+ messages in thread
From: Paolo Bonzini @ 2025-05-20 20:51 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: Tanish Desai, qemu-devel, Mads Ynddal

[-- Attachment #1: Type: text/plain, Size: 3973 bytes --]

Il mar 20 mag 2025, 21:01 Stefan Hajnoczi <stefanha@gmail.com> ha scritto:

> On Mon, May 19, 2025 at 2:52 PM Tanish Desai <tanishdesai37@gmail.com>
> wrote:
> >
> > Remove hot paths from .c file and added it in .h file to keep it inline.
>
> Please include performance results in the commit description so it's
> clear what impact this change has.
>

Hi Stefan,

I am replying because I take the blame for this :) and as an example of how
he could interact with the maintainer.

For now we mostly looked at differences between the code that tracetool
generates for the various backends, and the observation that some of them
put code in the .h and some in the .c file. I explained to Tanish the
concept of separating hot vs cold code in theory, showed him some examples
in QEMU where performance measurements were done in the past, and suggested
applying this to various backends (starting with the one with the weirdest
choice!). However we didn't do any measurement yet.

Some possibilities that come to mind:

1) maybe the coroutine benchmarks are enough to show a difference, with
some luck

2) a new microbenchmark (or a set of microbenchmarks that try various level
of register pressure around the tracepoint), which would be nice to have
anyway

3) perhaps we could try to check the code size for some object files in
block/ (for example libblock.a.p/*.c.o), as a proxy for how much
instruction cache space is saved when all tracepoints are disabled

We can start from 3, but also try 1+3 and 2+3 if it fails if you think that
would be more persuasive.

Thanks,

Paolo

>
> > Signed-off-by: Tanish Desai <tanishdesai37@gmail.com>
> > ---
> >  scripts/tracetool/backend/simple.py | 23 ++++++++++-------------
>
> Please use scripts/get_maintainer.pl to find the emails to CC:
>
>   $ scripts/get_maintainer.pl -f scripts/tracetool/backend/simple.py
>   Stefan Hajnoczi <stefanha@redhat.com> (maintainer:Tracing)
>   Mads Ynddal <mads@ynddal.dk> (reviewer:Tracing)
>   qemu-devel@nongnu.org (open list:All patches CC here)
>
> qemu-devel is a high-traffic list and maintainers may not see your
> patches unless you CC them.
>
> >  1 file changed, 10 insertions(+), 13 deletions(-)
> >
> > diff --git a/scripts/tracetool/backend/simple.py
> b/scripts/tracetool/backend/simple.py
> > index a74d61fcd6..2688d4b64b 100644
> > --- a/scripts/tracetool/backend/simple.py
> > +++ b/scripts/tracetool/backend/simple.py
> > @@ -36,8 +36,17 @@ def generate_h_begin(events, group):
> >
> >
> >  def generate_h(event, group):
> > -    out('    _simple_%(api)s(%(args)s);',
> > +    event_id = 'TRACE_' + event.name.upper()
> > +    if "vcpu" in event.properties:
> > +        # already checked on the generic format code
> > +        cond = "true"
> > +    else:
> > +        cond = "trace_event_get_state(%s)" % event_id
> > +    out('    if (%(cond)s) {',
> > +        '        _simple_%(api)s(%(args)s);',
> > +        '    }',
> >          api=event.api(),
> > +        cond=cond,
> >          args=", ".join(event.args.names()))
> >
> >
> > @@ -72,22 +81,10 @@ def generate_c(event, group):
> >      if len(event.args) == 0:
> >          sizestr = '0'
> >
> > -    event_id = 'TRACE_' + event.name.upper()
> > -    if "vcpu" in event.properties:
> > -        # already checked on the generic format code
> > -        cond = "true"
> > -    else:
> > -        cond = "trace_event_get_state(%s)" % event_id
> > -
> >      out('',
> > -        '    if (!%(cond)s) {',
> > -        '        return;',
> > -        '    }',
> > -        '',
> >          '    if (trace_record_start(&rec, %(event_obj)s.id,
> %(size_str)s)) {',
> >          '        return; /* Trace Buffer Full, Event Dropped ! */',
> >          '    }',
> > -        cond=cond,
> >          event_obj=event.api(event.QEMU_EVENT),
> >          size_str=sizestr)
> >
> > --
> > 2.34.1
> >
> >
>
>

[-- Attachment #2: Type: text/html, Size: 6427 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] trace/simple: seperate hot paths of tracing fucntions
  2025-05-20 20:51   ` Paolo Bonzini
@ 2025-05-21 18:16     ` Stefan Hajnoczi
  2025-05-27  2:51       ` Tanish Desai
  0 siblings, 1 reply; 9+ messages in thread
From: Stefan Hajnoczi @ 2025-05-21 18:16 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: Tanish Desai, qemu-devel, Mads Ynddal

On Tue, May 20, 2025 at 4:52 PM Paolo Bonzini <pbonzini@redhat.com> wrote:
> Il mar 20 mag 2025, 21:01 Stefan Hajnoczi <stefanha@gmail.com> ha scritto:
>>
>> On Mon, May 19, 2025 at 2:52 PM Tanish Desai <tanishdesai37@gmail.com> wrote:
>> >
>> > Remove hot paths from .c file and added it in .h file to keep it inline.
>>
>> Please include performance results in the commit description so it's
>> clear what impact this change has.
>
>
> Hi Stefan,
>
> I am replying because I take the blame for this :) and as an example of how he could interact with the maintainer.
>
> For now we mostly looked at differences between the code that tracetool generates for the various backends, and the observation that some of them put code in the .h and some in the .c file. I explained to Tanish the concept of separating hot vs cold code in theory, showed him some examples in QEMU where performance measurements were done in the past, and suggested applying this to various backends (starting with the one with the weirdest choice!). However we didn't do any measurement yet.
>
> Some possibilities that come to mind:
>
> 1) maybe the coroutine benchmarks are enough to show a difference, with some luck
>
> 2) a new microbenchmark (or a set of microbenchmarks that try various level of register pressure around the tracepoint), which would be nice to have anyway

2 is going above and beyond :). I'm not trying to make life hard.

Another option if 1 or 2 are difficult to quantify: I'd be happy with
just a "before" and "after" disassembly snippet from objdump showing
that the instructions at a call site changed as intended.

What I'm looking for is a commit description that explains the purpose
of the commit followed by, since this is a performance improvement,
some form of evidence that the change achieved its goal. At that point
it's easy for me to merge because it has a justification and the
nature of the commit will be clear to anyone looking at the git log in
the future.

> 3) perhaps we could try to check the code size for some object files in block/ (for example libblock.a.p/*.c.o), as a proxy for how much instruction cache space is saved when all tracepoints are disabled
>
> We can start from 3, but also try 1+3 and 2+3 if it fails if you think that would be more persuasive.

Can you explain 3 more? Total code size on disk should stay about the
same because most trace events only have one call site. Moving code
between the caller and callee doesn't make a big difference in either
direction. At runtime there is a benefit from inlining the condition
check since the call site no longer needs to execute the callee's
function prologue/epilogue when the trace event is runtime-disabled,
but that CPU cycle and instruction cache effect won't be visible from
the on-disk code size.

Thanks,
Stefan


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] trace/simple: seperate hot paths of tracing fucntions
  2025-05-21 18:16     ` Stefan Hajnoczi
@ 2025-05-27  2:51       ` Tanish Desai
  2025-05-28 18:06         ` Stefan Hajnoczi
  0 siblings, 1 reply; 9+ messages in thread
From: Tanish Desai @ 2025-05-27  2:51 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: Paolo Bonzini, qemu-devel, Mads Ynddal

[-- Attachment #1: Type: text/plain, Size: 5383 bytes --]

> Can you explain 3 more?
Pablo and I have looked into object files for one of default trace point "
trace_vhost_commit"(first tracepoint of hw/virtio) before and after making
this change and we found that by moves the hot-path check for
_simple_trace_vhost_commit into the caller (in the .h file) and leaves the
full stack-frame prologue and tracepoint body in the cold path. By inlining
only the if (trace_events_enabled()) test at the call site, we eliminate
the 11-instruction prologue overhead (stack allocation, canary load/store,
register spills, and argument masking) on every vhost commit(trace call)
when tracing is disabled.
The old disassembled code looks like this:-
0x10 stp x29, x30, [sp, #-64]! *Prologue:* allocates 64-byte frame and
saves old FP (x29) & LR (x30)
0x14 adrp x3, trace_events_enabled_count *Prologue:* computes page-base of
the trace-enable counter
0x18 adrp x2, __stack_chk_guard *Important (maybe prolog don't
know?)(stack-protector):* starts up the stack-canary load
0x1c mov x29, sp *Prologue:* sets new frame pointer
0x20 ldr x3, [x3] *Prologue:* loads the actual trace-enabled count
0x24 stp x19, x20, [sp, #16] *Prologue:* spills callee-saved regs used by
this function (x19, x20)
0x28 and w20, w0, #0xff *Tracepoint setup:* extracts the low-8 bits of arg0
as the “event boolean”
0x2c ldr x2, [x2] *Prologue (cont’d):* completes loading of the
stack-canary value
0x30 and w19, w1, #0xff *Tracepoint setup:* extracts low-8 bits of arg1
0x34 ldr w0, [x3] *Important:* loads the current trace-enabled flag from
memory
0x38 ldr x1, [x2] *Prologue (cont’d):* reads the canary
0x3c str x1, [sp, #56] *Prologue (cont’d):* writes the canary into the new
frame
0x40 mov x1, #0 *Prologue (cont’d):* zeroes out x1 for the upcoming branch
test
0x44 cbnz w0, 0x88 *Important:* if tracing is disabled (w0==0) skip the
heavy path entirely
Saving 11/14 instructions!Also We have not considered epilog
instructional saves
would be definitely more than 11.
Old flow: Every call does ~11 insns of prologue + tracepoint check, even if
tracing is disabled.
New flow: We inline a tiny if (trace_enabled) at the caller; only when it’s
true do you call into a full function with its prologue.(Extra instructions)

On Wed, May 21, 2025 at 11:46 PM Stefan Hajnoczi <stefanha@gmail.com> wrote:

> On Tue, May 20, 2025 at 4:52 PM Paolo Bonzini <pbonzini@redhat.com> wrote:
> > Il mar 20 mag 2025, 21:01 Stefan Hajnoczi <stefanha@gmail.com> ha
> scritto:
> >>
> >> On Mon, May 19, 2025 at 2:52 PM Tanish Desai <tanishdesai37@gmail.com>
> wrote:
> >> >
> >> > Remove hot paths from .c file and added it in .h file to keep it
> inline.
> >>
> >> Please include performance results in the commit description so it's
> >> clear what impact this change has.
> >
> >
> > Hi Stefan,
> >
> > I am replying because I take the blame for this :) and as an example of
> how he could interact with the maintainer.
> >
> > For now we mostly looked at differences between the code that tracetool
> generates for the various backends, and the observation that some of them
> put code in the .h and some in the .c file. I explained to Tanish the
> concept of separating hot vs cold code in theory, showed him some examples
> in QEMU where performance measurements were done in the past, and suggested
> applying this to various backends (starting with the one with the weirdest
> choice!). However we didn't do any measurement yet.
> >
> > Some possibilities that come to mind:
> >
> > 1) maybe the coroutine benchmarks are enough to show a difference, with
> some luck
> >
> > 2) a new microbenchmark (or a set of microbenchmarks that try various
> level of register pressure around the tracepoint), which would be nice to
> have anyway
>
> 2 is going above and beyond :). I'm not trying to make life hard.
>
> Another option if 1 or 2 are difficult to quantify: I'd be happy with
> just a "before" and "after" disassembly snippet from objdump showing
> that the instructions at a call site changed as intended.
>
> What I'm looking for is a commit description that explains the purpose
> of the commit followed by, since this is a performance improvement,
> some form of evidence that the change achieved its goal. At that point
> it's easy for me to merge because it has a justification and the
> nature of the commit will be clear to anyone looking at the git log in
> the future.
>
> > 3) perhaps we could try to check the code size for some object files in
> block/ (for example libblock.a.p/*.c.o), as a proxy for how much
> instruction cache space is saved when all tracepoints are disabled
> >
> > We can start from 3, but also try 1+3 and 2+3 if it fails if you think
> that would be more persuasive.
>
> Can you explain 3 more? Total code size on disk should stay about the
> same because most trace events only have one call site. Moving code
> between the caller and callee doesn't make a big difference in either
> direction. At runtime there is a benefit from inlining the condition
> check since the call site no longer needs to execute the callee's
> function prologue/epilogue when the trace event is runtime-disabled,
> but that CPU cycle and instruction cache effect won't be visible from
> the on-disk code size.
>
> Thanks,
> Stefan
>

[-- Attachment #2: Type: text/html, Size: 7781 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] trace/simple: seperate hot paths of tracing fucntions
  2025-05-27  2:51       ` Tanish Desai
@ 2025-05-28 18:06         ` Stefan Hajnoczi
  2025-05-28 19:30           ` Tanish Desai
  0 siblings, 1 reply; 9+ messages in thread
From: Stefan Hajnoczi @ 2025-05-28 18:06 UTC (permalink / raw)
  To: Tanish Desai; +Cc: Paolo Bonzini, qemu-devel, Mads Ynddal

On Mon, May 26, 2025 at 10:51 PM Tanish Desai <tanishdesai37@gmail.com> wrote:
>
> > Can you explain 3 more?
> Pablo and I have looked into object files for one of default trace point "trace_vhost_commit"(first tracepoint of hw/virtio) before and after making this change and we found that by moves the hot-path check for _simple_trace_vhost_commit into the caller (in the .h file) and leaves the full stack-frame prologue and tracepoint body in the cold path. By inlining only the if (trace_events_enabled()) test at the call site, we eliminate the 11-instruction prologue overhead (stack allocation, canary load/store, register spills, and argument masking) on every vhost commit(trace call) when tracing is disabled.
> The old disassembled code looks like this:-
> 0x10stp x29, x30, [sp, #-64]!Prologue: allocates 64-byte frame and saves old FP (x29) & LR (x30)
> 0x14adrp x3, trace_events_enabled_countPrologue: computes page-base of the trace-enable counter
> 0x18adrp x2, __stack_chk_guardImportant (maybe prolog don't know?)(stack-protector): starts up the stack-canary load
> 0x1cmov x29, spPrologue: sets new frame pointer
> 0x20ldr x3, [x3]Prologue: loads the actual trace-enabled count
> 0x24stp x19, x20, [sp, #16]Prologue: spills callee-saved regs used by this function (x19, x20)
> 0x28and w20, w0, #0xffTracepoint setup: extracts the low-8 bits of arg0 as the “event boolean”
> 0x2cldr x2, [x2]Prologue (cont’d): completes loading of the stack-canary value
> 0x30and w19, w1, #0xffTracepoint setup: extracts low-8 bits of arg1
> 0x34ldr w0, [x3]Important: loads the current trace-enabled flag from memory
> 0x38ldr x1, [x2]Prologue (cont’d): reads the canary
> 0x3cstr x1, [sp, #56]Prologue (cont’d): writes the canary into the new frame
> 0x40mov x1, #0Prologue (cont’d): zeroes out x1 for the upcoming branch test
> 0x44cbnz w0, 0x88Important: if tracing is disabled (w0==0) skip the heavy path entirely
> Saving 11/14 instructions!Also We have not considered epilog instructional saves would be definitely more than 11.
> Old flow: Every call does ~11 insns of prologue + tracepoint check, even if tracing is disabled.
> New flow: We inline a tiny if (trace_enabled) at the caller; only when it’s true do you call into a full function with its prologue.(Extra instructions)

Good, inlining worked as intended.

Please send a v2 with an updated commit description explaining the
purpose and with a disassembly snippet of the trace_vhost_commit()
call site showing what happens.

Thanks!

Stefan

>
> On Wed, May 21, 2025 at 11:46 PM Stefan Hajnoczi <stefanha@gmail.com> wrote:
>>
>> On Tue, May 20, 2025 at 4:52 PM Paolo Bonzini <pbonzini@redhat.com> wrote:
>> > Il mar 20 mag 2025, 21:01 Stefan Hajnoczi <stefanha@gmail.com> ha scritto:
>> >>
>> >> On Mon, May 19, 2025 at 2:52 PM Tanish Desai <tanishdesai37@gmail.com> wrote:
>> >> >
>> >> > Remove hot paths from .c file and added it in .h file to keep it inline.
>> >>
>> >> Please include performance results in the commit description so it's
>> >> clear what impact this change has.
>> >
>> >
>> > Hi Stefan,
>> >
>> > I am replying because I take the blame for this :) and as an example of how he could interact with the maintainer.
>> >
>> > For now we mostly looked at differences between the code that tracetool generates for the various backends, and the observation that some of them put code in the .h and some in the .c file. I explained to Tanish the concept of separating hot vs cold code in theory, showed him some examples in QEMU where performance measurements were done in the past, and suggested applying this to various backends (starting with the one with the weirdest choice!). However we didn't do any measurement yet.
>> >
>> > Some possibilities that come to mind:
>> >
>> > 1) maybe the coroutine benchmarks are enough to show a difference, with some luck
>> >
>> > 2) a new microbenchmark (or a set of microbenchmarks that try various level of register pressure around the tracepoint), which would be nice to have anyway
>>
>> 2 is going above and beyond :). I'm not trying to make life hard.
>>
>> Another option if 1 or 2 are difficult to quantify: I'd be happy with
>> just a "before" and "after" disassembly snippet from objdump showing
>> that the instructions at a call site changed as intended.
>>
>> What I'm looking for is a commit description that explains the purpose
>> of the commit followed by, since this is a performance improvement,
>> some form of evidence that the change achieved its goal. At that point
>> it's easy for me to merge because it has a justification and the
>> nature of the commit will be clear to anyone looking at the git log in
>> the future.
>>
>> > 3) perhaps we could try to check the code size for some object files in block/ (for example libblock.a.p/*.c.o), as a proxy for how much instruction cache space is saved when all tracepoints are disabled
>> >
>> > We can start from 3, but also try 1+3 and 2+3 if it fails if you think that would be more persuasive.
>>
>> Can you explain 3 more? Total code size on disk should stay about the
>> same because most trace events only have one call site. Moving code
>> between the caller and callee doesn't make a big difference in either
>> direction. At runtime there is a benefit from inlining the condition
>> check since the call site no longer needs to execute the callee's
>> function prologue/epilogue when the trace event is runtime-disabled,
>> but that CPU cycle and instruction cache effect won't be visible from
>> the on-disk code size.
>>
>> Thanks,
>> Stefan


^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH] trace/simple: seperate hot paths of tracing fucntions
@ 2025-05-28 19:25 Tanish Desai
  2025-06-02 22:25 ` Stefan Hajnoczi
  0 siblings, 1 reply; 9+ messages in thread
From: Tanish Desai @ 2025-05-28 19:25 UTC (permalink / raw)
  To: qemu-devel; +Cc: pbonzini, stefanha, mads, Tanish Desai

This change improves performance by moving the hot path of the trace_vhost_commit()(or any other trace function) logic to the header file.
Previously, even when the trace event was disabled, the function call chain:-
	trace_vhost_commit()(Or any other trace function) →  _nocheck__trace_vhost_commit() →  _simple_trace_vhost_commit()
	incurred a significant function prologue overhead before checking the trace state.

Disassembly of _simple_trace_vhost_commit() (from the .c file) showed that 11 out of the first 14 instructions were prologue-related, including:
0x10	stp x29, x30, [sp, #-64]!	Prologue: allocates 64-byte frame and saves old FP (x29) & LR (x30)
0x14	adrp x3, trace_events_enabled_count	Prologue: computes page-base of the trace-enable counter
0x18	adrp x2, __stack_chk_guard	Important (maybe prolog don't know?)(stack-protector): starts up the stack-canary load
0x1c	mov x29, sp	Prologue: sets new frame pointer
0x20	ldr x3, [x3]	Prologue: loads the actual trace-enabled count
0x24	stp x19, x20, [sp, #16]	Prologue: spills callee-saved regs used by this function (x19, x20)
0x28	and w20, w0, #0xff	Tracepoint setup: extracts the low-8 bits of arg0 as the “event boolean”
0x2c	ldr x2, [x2]	Prologue (cont’d): completes loading of the stack-canary value
0x30	and w19, w1, #0xff	Tracepoint setup: extracts low-8 bits of arg1
0x34	ldr w0, [x3]	Important: loads the current trace-enabled flag from memory
0x38	ldr x1, [x2]	Prologue (cont’d): reads the canary
0x3c	str x1, [sp, #56]	Prologue (cont’d): writes the canary into the new frame
0x40	mov x1, #0	Prologue (cont’d): zeroes out x1 for the upcoming branch test
0x44	cbnz w0, 0x88	Important: if tracing is disabled (w0==0) skip the heavy path entirely

The trace-enabled check happens after the prologue. This is wasteful when tracing is disabled, which is often the case in production.
To optimize this:
_nocheck__trace_vhost_commit() is now fully inlined in the .h file with
the hot path.It checks trace_event_get_state() before calling into _simple_trace_vhost_commit(), which remains in .c.
This avoids calling into the .c function altogether when the tracepoint is disabled, thereby skipping unnecessary prologue instructions.

This results in better performance by removing redundant instructions in the tracing fast path.

Signed-off-by: Tanish Desai <tanishdesai37@gmail.com>
---
 scripts/tracetool/backend/simple.py | 23 ++++++++++-------------
 1 file changed, 10 insertions(+), 13 deletions(-)

diff --git a/scripts/tracetool/backend/simple.py b/scripts/tracetool/backend/simple.py
index a74d61fcd6..2688d4b64b 100644
--- a/scripts/tracetool/backend/simple.py
+++ b/scripts/tracetool/backend/simple.py
@@ -36,8 +36,17 @@ def generate_h_begin(events, group):
 
 
 def generate_h(event, group):
-    out('    _simple_%(api)s(%(args)s);',
+    event_id = 'TRACE_' + event.name.upper()
+    if "vcpu" in event.properties:
+        # already checked on the generic format code
+        cond = "true"
+    else:
+        cond = "trace_event_get_state(%s)" % event_id
+    out('    if (%(cond)s) {',
+        '        _simple_%(api)s(%(args)s);',
+        '    }',
         api=event.api(),
+        cond=cond,
         args=", ".join(event.args.names()))
 
 
@@ -72,22 +81,10 @@ def generate_c(event, group):
     if len(event.args) == 0:
         sizestr = '0'
 
-    event_id = 'TRACE_' + event.name.upper()
-    if "vcpu" in event.properties:
-        # already checked on the generic format code
-        cond = "true"
-    else:
-        cond = "trace_event_get_state(%s)" % event_id
-
     out('',
-        '    if (!%(cond)s) {',
-        '        return;',
-        '    }',
-        '',
         '    if (trace_record_start(&rec, %(event_obj)s.id, %(size_str)s)) {',
         '        return; /* Trace Buffer Full, Event Dropped ! */',
         '    }',
-        cond=cond,
         event_obj=event.api(event.QEMU_EVENT),
         size_str=sizestr)
 
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH] trace/simple: seperate hot paths of tracing fucntions
  2025-05-28 18:06         ` Stefan Hajnoczi
@ 2025-05-28 19:30           ` Tanish Desai
  0 siblings, 0 replies; 9+ messages in thread
From: Tanish Desai @ 2025-05-28 19:30 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: Paolo Bonzini, qemu-devel, Mads Ynddal

[-- Attachment #1: Type: text/plain, Size: 6202 bytes --]

I have shared v2 with updated commit msg to everyone in form of a new PATCH.
Please check it and let me know if anything is still missing or confusing.


On Wed, 28 May 2025 at 11:36 PM, Stefan Hajnoczi <stefanha@gmail.com> wrote:

> On Mon, May 26, 2025 at 10:51 PM Tanish Desai <tanishdesai37@gmail.com>
> wrote:
> >
> > > Can you explain 3 more?
> > Pablo and I have looked into object files for one of default trace point
> "trace_vhost_commit"(first tracepoint of hw/virtio) before and after making
> this change and we found that by moves the hot-path check for
> _simple_trace_vhost_commit into the caller (in the .h file) and leaves the
> full stack-frame prologue and tracepoint body in the cold path. By inlining
> only the if (trace_events_enabled()) test at the call site, we eliminate
> the 11-instruction prologue overhead (stack allocation, canary load/store,
> register spills, and argument masking) on every vhost commit(trace call)
> when tracing is disabled.
> > The old disassembled code looks like this:-
> > 0x10stp x29, x30, [sp, #-64]!Prologue: allocates 64-byte frame and saves
> old FP (x29) & LR (x30)
> > 0x14adrp x3, trace_events_enabled_countPrologue: computes page-base of
> the trace-enable counter
> > 0x18adrp x2, __stack_chk_guardImportant (maybe prolog don't
> know?)(stack-protector): starts up the stack-canary load
> > 0x1cmov x29, spPrologue: sets new frame pointer
> > 0x20ldr x3, [x3]Prologue: loads the actual trace-enabled count
> > 0x24stp x19, x20, [sp, #16]Prologue: spills callee-saved regs used by
> this function (x19, x20)
> > 0x28and w20, w0, #0xffTracepoint setup: extracts the low-8 bits of arg0
> as the “event boolean”
> > 0x2cldr x2, [x2]Prologue (cont’d): completes loading of the stack-canary
> value
> > 0x30and w19, w1, #0xffTracepoint setup: extracts low-8 bits of arg1
> > 0x34ldr w0, [x3]Important: loads the current trace-enabled flag from
> memory
> > 0x38ldr x1, [x2]Prologue (cont’d): reads the canary
> > 0x3cstr x1, [sp, #56]Prologue (cont’d): writes the canary into the new
> frame
> > 0x40mov x1, #0Prologue (cont’d): zeroes out x1 for the upcoming branch
> test
> > 0x44cbnz w0, 0x88Important: if tracing is disabled (w0==0) skip the
> heavy path entirely
> > Saving 11/14 instructions!Also We have not considered epilog
> instructional saves would be definitely more than 11.
> > Old flow: Every call does ~11 insns of prologue + tracepoint check, even
> if tracing is disabled.
> > New flow: We inline a tiny if (trace_enabled) at the caller; only when
> it’s true do you call into a full function with its prologue.(Extra
> instructions)
>
> Good, inlining worked as intended.
>
> Please send a v2 with an updated commit description explaining the
> purpose and with a disassembly snippet of the trace_vhost_commit()
> call site showing what happens.
>
> Thanks!
>
> Stefan
>
> >
> > On Wed, May 21, 2025 at 11:46 PM Stefan Hajnoczi <stefanha@gmail.com>
> wrote:
> >>
> >> On Tue, May 20, 2025 at 4:52 PM Paolo Bonzini <pbonzini@redhat.com>
> wrote:
> >> > Il mar 20 mag 2025, 21:01 Stefan Hajnoczi <stefanha@gmail.com> ha
> scritto:
> >> >>
> >> >> On Mon, May 19, 2025 at 2:52 PM Tanish Desai <
> tanishdesai37@gmail.com> wrote:
> >> >> >
> >> >> > Remove hot paths from .c file and added it in .h file to keep it
> inline.
> >> >>
> >> >> Please include performance results in the commit description so it's
> >> >> clear what impact this change has.
> >> >
> >> >
> >> > Hi Stefan,
> >> >
> >> > I am replying because I take the blame for this :) and as an example
> of how he could interact with the maintainer.
> >> >
> >> > For now we mostly looked at differences between the code that
> tracetool generates for the various backends, and the observation that some
> of them put code in the .h and some in the .c file. I explained to Tanish
> the concept of separating hot vs cold code in theory, showed him some
> examples in QEMU where performance measurements were done in the past, and
> suggested applying this to various backends (starting with the one with the
> weirdest choice!). However we didn't do any measurement yet.
> >> >
> >> > Some possibilities that come to mind:
> >> >
> >> > 1) maybe the coroutine benchmarks are enough to show a difference,
> with some luck
> >> >
> >> > 2) a new microbenchmark (or a set of microbenchmarks that try various
> level of register pressure around the tracepoint), which would be nice to
> have anyway
> >>
> >> 2 is going above and beyond :). I'm not trying to make life hard.
> >>
> >> Another option if 1 or 2 are difficult to quantify: I'd be happy with
> >> just a "before" and "after" disassembly snippet from objdump showing
> >> that the instructions at a call site changed as intended.
> >>
> >> What I'm looking for is a commit description that explains the purpose
> >> of the commit followed by, since this is a performance improvement,
> >> some form of evidence that the change achieved its goal. At that point
> >> it's easy for me to merge because it has a justification and the
> >> nature of the commit will be clear to anyone looking at the git log in
> >> the future.
> >>
> >> > 3) perhaps we could try to check the code size for some object files
> in block/ (for example libblock.a.p/*.c.o), as a proxy for how much
> instruction cache space is saved when all tracepoints are disabled
> >> >
> >> > We can start from 3, but also try 1+3 and 2+3 if it fails if you
> think that would be more persuasive.
> >>
> >> Can you explain 3 more? Total code size on disk should stay about the
> >> same because most trace events only have one call site. Moving code
> >> between the caller and callee doesn't make a big difference in either
> >> direction. At runtime there is a benefit from inlining the condition
> >> check since the call site no longer needs to execute the callee's
> >> function prologue/epilogue when the trace event is runtime-disabled,
> >> but that CPU cycle and instruction cache effect won't be visible from
> >> the on-disk code size.
> >>
> >> Thanks,
> >> Stefan
>

[-- Attachment #2: Type: text/html, Size: 7632 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] trace/simple: seperate hot paths of tracing fucntions
  2025-05-28 19:25 Tanish Desai
@ 2025-06-02 22:25 ` Stefan Hajnoczi
  0 siblings, 0 replies; 9+ messages in thread
From: Stefan Hajnoczi @ 2025-06-02 22:25 UTC (permalink / raw)
  To: Tanish Desai; +Cc: qemu-devel, pbonzini, mads

Applied to my tracing tree, thanks!

https://gitlab.com/stefanha/qemu/-/commit/ffcfb0faaa95fc6ca007f7dd989e390dacf936ca

Stefan


^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2025-06-02 22:27 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-05-19 18:51 [PATCH] trace/simple: seperate hot paths of tracing fucntions Tanish Desai
2025-05-20 19:01 ` Stefan Hajnoczi
2025-05-20 20:51   ` Paolo Bonzini
2025-05-21 18:16     ` Stefan Hajnoczi
2025-05-27  2:51       ` Tanish Desai
2025-05-28 18:06         ` Stefan Hajnoczi
2025-05-28 19:30           ` Tanish Desai
  -- strict thread matches above, loose matches on Subject: below --
2025-05-28 19:25 Tanish Desai
2025-06-02 22:25 ` Stefan Hajnoczi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).