linux-perf-users.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFE] Improve handling of framepointer-based backtraces for function prologue samples
@ 2024-09-09  3:00 William Cohen
  0 siblings, 0 replies; only message in thread
From: William Cohen @ 2024-09-09  3:00 UTC (permalink / raw)
  To: perf group; +Cc: wcohen, Serhei Makarov

TL;DR summary: Estimate over 5% of cycle-based samples are taken in
function prologues. Would like better indication of when that occurs
in perf output and if possible improve perf to record enough info to
avoid missing functions in the FP-based backtraces.



The use of frame pointer-based (FP) stack unwinding has been touted as
a way of getting better backtrace samples.  However, this is
predicated on the assumption that the FP for the function has been
initialized.  If the FP for the current function has not been setup,
the unwinder will omit the function that called the current function.
For code layout optimization that most immediate call is probably the
call we care most about in the backtrace.  The question is how often
does the FP based unwinding omit the caller function that called the
function in the current sample hasn't been analyzed.  One concern is
that sample in the function prologue might be more common than
expected in cycle-based sampling as calls to function might trigger
expensive TLB or cache misses to pull in the code for the called
function.

To get some actual measurements on how often samples might occur in
code I examined disassembled x86_64, aarch64, and risc-v code to find
out what the mininmal prologues are:

x86_64:
   3532c:	f3 0f 1e fa          	endbr64
   35330:	55                   	push   %rbp
   35331:	48 89 e5             	mov	%rsp,%rbp
   35334:	53                   	push   %rbx    	// FP valid here

aarch64:
   3265c:   	d503233f    	paciasp
   32660:   	a9bc7bfd    	stp 	x29, x30, [sp, #-64]!
   32664:   	910003fd    	mov 	x29, sp
   32668:   	a90153f3    	stp 	x19, x20, [sp, #16] // FP valid

risc-v:
   38eb2:   	1141                	addi	sp,sp,-16
   38eb4:   	e022                	sd  	s0,0(sp)
   38eb6:   	e406                	sd  	ra,8(sp)
   38eb8:   	0800                	addi	s0,sp,16
   38eba:   	0001                	nop // FP valid here

Note that these are the shortest prologues possible. There are cases
where the prologue has to compute needed stack space or operations
from the body of the function have been moved into the prologue code.
Thus, for this experiment the FP is not be valid for x86_64 and risc-v
until at least an offset of 8 bytes from the start of the function.
For aarch64 this the FP is not be valid until at least 12 bytes from
the start of the function.

To collect data in one window running system-wide sampling for 10
minutes:

  sudo perf record -a sleep 600

In another running the systemtap testsuite to provide some load with systemtap:

  cd /usr/share/systemtap/testsuite
  sudo make installcheck


Once data collected can get compute the number of samples definitely
in the prologue of x86_64 and risc-v with something like following
(for aarch64 would change to 0x[0123456789ab]:

  sudo perf report --sort=sample,symoff | grep -E '0x[01234567]$' |grep -v "[k]" |grep -v "@plt" | awk '{print $2}' | paste -sd+ | bc

Obtain total sample count with:

  sudo perf report --sort=sample,dso  | awk '{print $2}' | grep '[[:digit:]]' |  paste -sd+ | bc

Obtain count of kernel samples:

  sudo perf report --sort=sample,symoff |grep "\\[k\\]" | awk '{print $2}' | paste -sd+ | bc

Obtain count of user samples by subtracing the kernel samples from the total samples.



			x86-64		aarch64		risc-v
total sample		1955552		4564540		4936702
# k-space		 345895		1034297		1323405
# u-space		1609657		3530243		3613297
# prologue		  84506		 210245		 262926
prologue/u-space	  5.2%		6.0%		 7.3%


Note these estimates are not exact, but it is possible that they may
be low due to:

  -A number of glibc optimized functions written in assembly without FP
  -Additional code to compute stack space based on arguments
  -Function optimization:
    -Moving loads into prologue to allow more time for load to complete
    -Skipping directly to return if argument is NULL (observed in riscv libelf gelf_getsymshndx and elf_getscn)


It would be an improvement in perf to include some information when
reporting backtrace information which backtraces have skipped a
function in the call backtraces.  For processors that store the return
address in a register like the aarch64 and risc-v record that register
in the sample and use it for samples in the prologue to get a more
complete backtrace.  Unfortunately for x86_64 the return value could
be anywhere on the stack relative to the current value of the stack
pointer.  Maybe could analyze the CFI information of x86_64 programs
to provide a reasonable small amount of stack to record that covers a
significant portion of function prologues.

-Will Cohen


^ permalink raw reply	[flat|nested] only message in thread

only message in thread, other threads:[~2024-09-09  3:00 UTC | newest]

Thread overview: (only message) (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-09-09  3:00 [RFE] Improve handling of framepointer-based backtraces for function prologue samples William Cohen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).