Re: [PATCH] trace/simple: seperate hot paths of tracing fucntions

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

From: Stefan Hajnoczi <stefanha@gmail.com>
To: Tanish Desai <tanishdesai37@gmail.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>,
	qemu-devel <qemu-devel@nongnu.org>,  Mads Ynddal <mads@ynddal.dk>
Subject: Re: [PATCH] trace/simple: seperate hot paths of tracing fucntions
Date: Wed, 28 May 2025 14:06:44 -0400	[thread overview]
Message-ID: <CAJSP0QWM63GA1LHcTiSF=sivzghtompzJqQORRchveTHWmUW6A@mail.gmail.com> (raw)
In-Reply-To: <CAH_Y1je5u3GoaUJ3LfHxcp0MMpEjVPbGkYOw3mo8As+tEPExdg@mail.gmail.com>

On Mon, May 26, 2025 at 10:51 PM Tanish Desai <tanishdesai37@gmail.com> wrote:
>
> > Can you explain 3 more?
> Pablo and I have looked into object files for one of default trace point "trace_vhost_commit"(first tracepoint of hw/virtio) before and after making this change and we found that by moves the hot-path check for _simple_trace_vhost_commit into the caller (in the .h file) and leaves the full stack-frame prologue and tracepoint body in the cold path. By inlining only the if (trace_events_enabled()) test at the call site, we eliminate the 11-instruction prologue overhead (stack allocation, canary load/store, register spills, and argument masking) on every vhost commit(trace call) when tracing is disabled.
> The old disassembled code looks like this:-
> 0x10stp x29, x30, [sp, #-64]!Prologue: allocates 64-byte frame and saves old FP (x29) & LR (x30)
> 0x14adrp x3, trace_events_enabled_countPrologue: computes page-base of the trace-enable counter
> 0x18adrp x2, __stack_chk_guardImportant (maybe prolog don't know?)(stack-protector): starts up the stack-canary load
> 0x1cmov x29, spPrologue: sets new frame pointer
> 0x20ldr x3, [x3]Prologue: loads the actual trace-enabled count
> 0x24stp x19, x20, [sp, #16]Prologue: spills callee-saved regs used by this function (x19, x20)
> 0x28and w20, w0, #0xffTracepoint setup: extracts the low-8 bits of arg0 as the “event boolean”
> 0x2cldr x2, [x2]Prologue (cont’d): completes loading of the stack-canary value
> 0x30and w19, w1, #0xffTracepoint setup: extracts low-8 bits of arg1
> 0x34ldr w0, [x3]Important: loads the current trace-enabled flag from memory
> 0x38ldr x1, [x2]Prologue (cont’d): reads the canary
> 0x3cstr x1, [sp, #56]Prologue (cont’d): writes the canary into the new frame
> 0x40mov x1, #0Prologue (cont’d): zeroes out x1 for the upcoming branch test
> 0x44cbnz w0, 0x88Important: if tracing is disabled (w0==0) skip the heavy path entirely
> Saving 11/14 instructions!Also We have not considered epilog instructional saves would be definitely more than 11.
> Old flow: Every call does ~11 insns of prologue + tracepoint check, even if tracing is disabled.
> New flow: We inline a tiny if (trace_enabled) at the caller; only when it’s true do you call into a full function with its prologue.(Extra instructions)

Good, inlining worked as intended.

Please send a v2 with an updated commit description explaining the
purpose and with a disassembly snippet of the trace_vhost_commit()
call site showing what happens.

Thanks!

Stefan

>
> On Wed, May 21, 2025 at 11:46 PM Stefan Hajnoczi <stefanha@gmail.com> wrote:
>>
>> On Tue, May 20, 2025 at 4:52 PM Paolo Bonzini <pbonzini@redhat.com> wrote:
>> > Il mar 20 mag 2025, 21:01 Stefan Hajnoczi <stefanha@gmail.com> ha scritto:
>> >>
>> >> On Mon, May 19, 2025 at 2:52 PM Tanish Desai <tanishdesai37@gmail.com> wrote:
>> >> >
>> >> > Remove hot paths from .c file and added it in .h file to keep it inline.
>> >>
>> >> Please include performance results in the commit description so it's
>> >> clear what impact this change has.
>> >
>> >
>> > Hi Stefan,
>> >
>> > I am replying because I take the blame for this :) and as an example of how he could interact with the maintainer.
>> >
>> > For now we mostly looked at differences between the code that tracetool generates for the various backends, and the observation that some of them put code in the .h and some in the .c file. I explained to Tanish the concept of separating hot vs cold code in theory, showed him some examples in QEMU where performance measurements were done in the past, and suggested applying this to various backends (starting with the one with the weirdest choice!). However we didn't do any measurement yet.
>> >
>> > Some possibilities that come to mind:
>> >
>> > 1) maybe the coroutine benchmarks are enough to show a difference, with some luck
>> >
>> > 2) a new microbenchmark (or a set of microbenchmarks that try various level of register pressure around the tracepoint), which would be nice to have anyway
>>
>> 2 is going above and beyond :). I'm not trying to make life hard.
>>
>> Another option if 1 or 2 are difficult to quantify: I'd be happy with
>> just a "before" and "after" disassembly snippet from objdump showing
>> that the instructions at a call site changed as intended.
>>
>> What I'm looking for is a commit description that explains the purpose
>> of the commit followed by, since this is a performance improvement,
>> some form of evidence that the change achieved its goal. At that point
>> it's easy for me to merge because it has a justification and the
>> nature of the commit will be clear to anyone looking at the git log in
>> the future.
>>
>> > 3) perhaps we could try to check the code size for some object files in block/ (for example libblock.a.p/*.c.o), as a proxy for how much instruction cache space is saved when all tracepoints are disabled
>> >
>> > We can start from 3, but also try 1+3 and 2+3 if it fails if you think that would be more persuasive.
>>
>> Can you explain 3 more? Total code size on disk should stay about the
>> same because most trace events only have one call site. Moving code
>> between the caller and callee doesn't make a big difference in either
>> direction. At runtime there is a benefit from inlining the condition
>> check since the call site no longer needs to execute the callee's
>> function prologue/epilogue when the trace event is runtime-disabled,
>> but that CPU cycle and instruction cache effect won't be visible from
>> the on-disk code size.
>>
>> Thanks,
>> Stefan

next prev parent reply	other threads:[~2025-05-28 18:07 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-05-19 18:51 [PATCH] trace/simple: seperate hot paths of tracing fucntions Tanish Desai
2025-05-20 19:01 ` Stefan Hajnoczi
2025-05-20 20:51   ` Paolo Bonzini
2025-05-21 18:16     ` Stefan Hajnoczi
2025-05-27  2:51       ` Tanish Desai
2025-05-28 18:06         ` Stefan Hajnoczi [this message]
2025-05-28 19:30           ` Tanish Desai
  -- strict thread matches above, loose matches on Subject: below --
2025-05-28 19:25 Tanish Desai
2025-06-02 22:25 ` Stefan Hajnoczi

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAJSP0QWM63GA1LHcTiSF=sivzghtompzJqQORRchveTHWmUW6A@mail.gmail.com' \
    --to=stefanha@gmail.com \
    --cc=mads@ynddal.dk \
    --cc=pbonzini@redhat.com \
    --cc=qemu-devel@nongnu.org \
    --cc=tanishdesai37@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).