Re: [PATCH 1/1] tools/bpftrace: Add a simple instrumented kernel profiler

Kernel KVM virtualization development
 help / color / mirror / Atom feed

From: sashiko-bot@kernel.org
To: "Aaron Lewis" <aaronlewis@google.com>
Cc: bpf@vger.kernel.org, kvm@vger.kernel.org
Subject: Re: [PATCH 1/1] tools/bpftrace: Add a simple instrumented kernel profiler
Date: Tue, 30 Jun 2026 14:23:24 +0000	[thread overview]
Message-ID: <20260630142325.0FDC11F000E9@smtp.kernel.org> (raw)
In-Reply-To: <20260630141152.3757722-2-aaronlewis@google.com>

Thank you for your contribution! Sashiko AI review found 5 potential issue(s) to consider:
- [Medium] Incomplete timers (DURATION_UNSET) are blindly aggregated, corrupting profile metrics.
- [Medium] The `kvm_mmu_page_fault` probe incorrectly names its timer `kvm_vcpu_block`.
- [Medium] Two-pass parsing architecture breaks when reading from standard input, causing an AttributeError.
- [Medium] Fragile parsing logic crashes the script if any BPF trace map events are dropped.
- [Low] File validation checks evaluate against closed file descriptors, rendering them dead code.
--

commit 991cf3dc11faba4afe46fa2281e4850e5407d45b
Author: Aaron Lewis <aaronlewis@google.com>

tools/bpftrace: Add a simple instrumented kernel profiler

This commit introduces a simple instrumented profiler for the kernel using
BpfTrace. It allows arbitrary tracing of kernel functions and tracepoints,
and includes a Python script to aggregate and display the results in a
hierarchical view with timing data.

> diff --git a/tools/bpftrace/instrumented_profiler/aggregate_bpf_results.py b/tools/bpftrace/instrumented_profiler/aggregate_bpf_results.py
> new file mode 100644
> index 0000000000000..f4ffe8cff4418
> --- /dev/null
> +++ b/tools/bpftrace/instrumented_profiler/aggregate_bpf_results.py

[ ... ]

> +def update_child(r: BpfResult, timer: BpfTimer, tid: int) -> int:
> +  i = r.child_index.get(timer.name)
> +  if i is None:
> +    i = len(r.children)
> +    r.children.append(BpfResult(
> +        timer.name,
> +        timer.start_time_us,
> +        timer.start_time_us + timer.duration_us,
> +        timer.duration_us,
> +        timer.duration_us,
> +        timer.duration_us,
> +        1,
> +        1,
> +        timer.depth,
> +        tid))
> +    r.child_index[timer.name] = i
> +  else:
> +    r.children[i].wall_time_start_us = min(r.children[i].wall_time_start_us, timer.start_time_us)
> +    r.children[i].wall_time_end_us = max(r.children[i].wall_time_end_us, timer.start_time_us + timer.duration_us)
> +    r.children[i].thread_us += timer.duration_us
> +    r.children[i].max_us = max(r.children[i].max_us, timer.duration_us)

[Severity: Medium]
When a traced function does not complete before tracing ends (for example, if
interrupted via Ctrl-C), the duration remains at its initialization value of
DURATION_UNSET (-1 cast to uint64).

Are we blindly aggregating incomplete timers here?

Adding this massive sentinel value directly to thread_us and max_us will
severely corrupt the aggregated timing metrics.

[ ... ]

> +def merge_recursive(r: BpfResult, tid: int, timers: List[BpfTimer], index: int, child_index: int) -> int:
> +  while index < len(timers):
> +    while index < len(timers) and timers[index] == None:
> +      index += 1
> +
> +    if index >= len(timers):
> +      return index
> +
> +    # The timer is a child of 'r'. (Child case)
> +    if timers[index].depth == (r.depth + 1):
> +      # Add child
> +      child_index = update_child(r, timers[index], tid)
> +
> +      # Now that the timer has been handled increment to the next one.
> +      index += 1
> +
> +    # The timer is a child of 'r'. (Grandchild case)
> +    elif timers[index].depth == (r.depth + 2):
> +      if child_index < 0:
> +        raise ValueError("Invalid 'child_index'")
> +      # Recurse down the child becuase that is the parent this timer.
> +      index = merge_recursive(r.children[child_index], tid, timers, index, child_index)
> +
> +    # The timer is a descendant of 'r', but not the child... Invalid case!
> +    elif timers[index].depth > (r.depth + 2):
> +      raise ValueError("Invalid depth.  descendant can only be children. name = '%s', tid = %d, index = %d" % (timers[index].name, tid, index))

[Severity: Medium]
Will this script crash with a ValueError or IndexError if bpftrace drops any
map events due to ring buffer overflow or map eviction under heavy load?

The parsing logic appears to assume a flawless 1:1 correlation across all
dumped maps. If events are dropped, it might create a gap in the call depth
triggering this strict ValueError, or result in an IndexError during array
assignment in generate_from_log() when index expectations are broken.

[ ... ]

> +def generate_from_log(files: Sequence[str]):
> +  """Read the trace and return a list of ProfilerResult objects."""
> +
> +  timer_counts = {}
> +  all_timers = {}
> +
> +  count_regex = re.compile(
> +      r"^@timer_count\[(?P<tid>[0-9]+)\]:\W(?P<count>[-0-9]+)"
> +  )
> +  name_regex = re.compile(
> +      r"^@timer_name\[(?P<tid>[0-9]+),\W*(?P<index>[0-9]+)\]:\W(?P<name>[\w\.0-9]+)"
> +  )
> +  start_time_regex = re.compile(
> +      r"^@timer_start\[(?P<tid>[0-9]+),\W*(?P<index>[0-9]+)\]:\W(?P<start_time_us>[-0-9]+)"
> +  )
> +  duration_regex = re.compile(
> +      r"^@timer_duration\[(?P<tid>[0-9]+),\W*(?P<index>[0-9]+)\]:\W(?P<duration_us>[-0-9]+)"
> +  )
> +  depth_regex = re.compile(
> +      r"^@timer_depth\[(?P<tid>[0-9]+),\W*(?P<index>[0-9]+)\]:\W(?P<depth>[-0-9]+)"
> +  )
> +
> +  attaching_regex = re.compile(
> +      r"^Attaching.*"
> +  )
> +
> +  # Get all counts.  That way we know how many timers each thread used.
> +  for line in fileinput.input(files=files):

[Severity: Medium]
Could this two-pass parsing architecture break when reading trace data directly
from standard input?

If a user pipes bpftrace output into this script, this first loop will consume
the entire stream until EOF.

The second fileinput.input() loop below will then read nothing, leaving the
timer arrays filled with None values. This leads to an AttributeError in
merge_timers() when attempting to read timers[0].depth.

> +    line = line.rstrip()
> +
> +    m = re.fullmatch(duration_regex, line)
> +    if m:
> +      tid = int(m.group("tid"))
> +      index = int(m.group("index"))
> +      duration_str = m.group("duration_us")
> +      duration = int(m.group("duration_us"))
> +
> +      if not tid in timer_counts:
> +        timer_counts[tid] = index
> +      else:
> +        timer_counts[tid] = max(timer_counts[tid], index)
> +
> +  # Allocate enough timers to fit the trace.
> +  for tid, count in timer_counts.items():
> +    all_timers[tid] = [None] * (count + 1)
> +
> +  # Populate the trace.
> +  for line in fileinput.input(files=files):

[ ... ]

> +  root = BpfResult("", 0, 0, 0, 0, 0, 1, 1, -1, -1)
> +  results = []
> +  for tid, timers in all_timers.items():
> +    merge_timers(root, tid, timers)
> +  print_results(root, results)
> +
> +  if fileinput.fileno() == 0:
> +    raise ValueError("Expected a file to read.")

[Severity: Low]
Does this file validation check evaluate against closed file descriptors?

The fileinput loops have already exhausted and closed the files by this point.
Calling fileinput.fileno() returns the active UNIX file descriptor, which is
-1 when closed, not the count of parsed files.

Both this check and the one below will always evaluate to false, rendering
this block dead code.

> +
> +  if fileinput.fileno() > 1:
> +    raise ValueError("Too many files read.  Expected 1")
> +
> +def main() -> None:

[ ... ]

> diff --git a/tools/bpftrace/instrumented_profiler/samples/kvm_trace.bt b/tools/bpftrace/instrumented_profiler/samples/kvm_trace.bt
> new file mode 100644
> index 0000000000000..bd341e408570e
> --- /dev/null
> +++ b/tools/bpftrace/instrumented_profiler/samples/kvm_trace.bt

[ ... ]

> +/*
> + * Time 'kvm_mmu_page_fault()'.
> + *
> + * Check: sudo grep kvm_mmu_page_fault /proc/kallsyms
> + */
> +kfunc:kvm_mmu_page_fault
> +{
> +	START_TIMER("kvm_vcpu_block",
> +		    @timer_name, @trace_stack, @timer_start, @timer_duration,
> +		    @timer_depth, @timer_count, @trace_stack_top);
> +}

[Severity: Medium]
Is there a copy-paste error here where the timer for kvm_mmu_page_fault is
incorrectly named kvm_vcpu_block?

This will falsely attribute page fault durations to kvm_vcpu_block, distorting
the captured metrics for the profile.

> +kretfunc:kvm_mmu_page_fault
> +{
> +	if (@trace_stack_top[tid] > 0) {
> +		STOP_TIMER(@trace_stack_top, @trace_stack, @timer_duration, @timer_start);
> +	}
> +}

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260630141152.3757722-2-aaronlewis@google.com?part=1

next prev parent reply	other threads:[~2026-06-30 14:23 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-06-30 14:11 [PATCH 0/1] Introduce an instrumented profiler using BpfTrace Aaron Lewis
2026-06-30 14:11 ` [PATCH 1/1] tools/bpftrace: Add a simple instrumented kernel profiler Aaron Lewis
2026-06-30 14:23   ` sashiko-bot [this message]
2026-06-30 20:56 ` [PATCH 0/1] Introduce an instrumented profiler using BpfTrace Alexei Starovoitov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260630142325.0FDC11F000E9@smtp.kernel.org \
    --to=sashiko-bot@kernel.org \
    --cc=aaronlewis@google.com \
    --cc=bpf@vger.kernel.org \
    --cc=kvm@vger.kernel.org \
    --cc=sashiko-reviews@lists.linux.dev \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox