From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id AAC1932824A; Tue, 18 Nov 2025 03:08:27 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1763435307; cv=none; b=KrlXTVUtGxJPP+vjkiCe+YHznkIcPEuBSy708hi68IpEr+FnzywxrvqCO2CIbA412Lv58BR9f7VhGZyZuBxpBu1G0McIq5QcqGVtVujDMRSTbdcJDK8Zv3VqkwO6RceZdWpVykQ0nKLiEdGvExDchITjeBNTvPXqvJhWtaPBYEo= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1763435307; c=relaxed/simple; bh=Xind89b6j0RPaaXNpfgRKjEU07FHRaCHtqEas4Uojr0=; h=Date:From:To:Cc:Subject:Message-Id:In-Reply-To:References: Mime-Version:Content-Type; b=r5cbuposlgVwm1b9UFT4ToS0erezCq//CIhxJ5V5+2iBFYxPjC8TICQew+QfXmAdlUToKGIXRpAM+xh7YeiYF2P/uyGt774Hk+YpDAcbWWkH4MZgHK7SER6D+NtilISsnDhBwXBloIVmCV2d6nC0jLsA+KngNiIwnN9e9Tws9Oo= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=lpX+Jfwo; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="lpX+Jfwo" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 1CDFBC4CEF5; Tue, 18 Nov 2025 03:08:23 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1763435306; bh=Xind89b6j0RPaaXNpfgRKjEU07FHRaCHtqEas4Uojr0=; h=Date:From:To:Cc:Subject:In-Reply-To:References:From; b=lpX+Jfwod17GoB//NGRSeccek0jUrx+izKskDzcFYU9lvMs2NOK3WTxS4oOiCdPHz Oq7GPYivf0HSfu44yO/YpJzsw0gSSaD4bj2JpqYaNpzzdfN5PZlXNtiJWSKwgtvxoc CBFlSugXT65HBTVnDlcKgU/gqoYqm+YJ4JOvbaodoo+WZcWF8Kwp/AbQmve+cOGkny CLxtmKS7f2HObVSbvHH8FBkbhSwJrRVAmUyMvpzSTJfBMPBd+PgW7unYjnnQSyNjcL wRWYGkUrdil7jQogKvaeups0dAnno+AgTFeWOHHv0veAlNqomoF6oVE1TVgx+aWjaI IVD9cUP4vTPkQ== Date: Tue, 18 Nov 2025 12:08:21 +0900 From: Masami Hiramatsu (Google) To: Steven Rostedt Cc: linux-kernel@vger.kernel.org, linux-trace-kernel@vger.kernel.org, Masami Hiramatsu , Mark Rutland , Mathieu Desnoyers , Andrew Morton , Peter Zijlstra , Thomas Gleixner , Ian Rogers , Namhyung Kim , Arnaldo Carvalho de Melo , Jiri Olsa , Douglas Raillard Subject: Re: [POC][RFC][PATCH 0/3] tracing: Add perf events to trace buffer Message-Id: <20251118120821.0c47ef684b53d5d9a2d6dc83@kernel.org> In-Reply-To: <20251118002950.680329246@kernel.org> References: <20251118002950.680329246@kernel.org> X-Mailer: Sylpheed 3.8.0beta1 (GTK+ 2.24.33; x86_64-pc-linux-gnu) Precedence: bulk X-Mailing-List: linux-trace-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Hi Steve, Thanks for the great idea! On Mon, 17 Nov 2025 19:29:50 -0500 Steven Rostedt wrote: > > This series adds a perf event to the ftrace ring buffer. > It is currently a proof of concept as I'm not happy with the interface > and I also think the recorded perf event format may be changed too. > > This proof-of-concept interface (which I have no plans on using), currently > just adds 6 new trace options. > > event_cache_misses > event_cpu_cycles > func-cache-misses > func-cpu-cycles > funcgraph-cache-misses > funcgraph-cpu-cycles > > The first two trigger a perf event after every event, the second two trigger > a perf event after every function and the last two trigger a perf event > right after the start of a function and again at the end of the function. > > As this will eventual work with many more perf events than just cache-misses > and cpu-cycles , using options is not appropriate. Especially since the > options are limited to a 64 bit bitmask, and that can easily go much higher. > I'm thinking about having a file instead that will act as a way to enable > perf events for events, function and function graph tracing. > > set_event_perf, set_ftrace_perf, set_fgraph_perf What about adding a global `trigger` action file so that user can add these "perf" actions to write into it. It is something like stacktrace for events. (Maybe we can move stacktrace/user-stacktrace into it too) For pre-defined/software counters: # echo "perf:cpu_cycles" >> /sys/kernel/tracing/trigger For some hardware event sources (see /sys/bus/event_source/devices/): # echo "perf:cstate_core.c3-residency" >> /sys/kernel/tracing/trigger echo "perf:my_counter=pmu/config=M,config1=N" >> /sys/kernel/tracing/trigger If we need to set those counters for tracers and events separately, we can add `events/trigger` and `tracer-trigger` files. echo "perf:cpu_cycles" >> /sys/kernel/tracing/events/trigger To disable counters, we can use '!' as same as event triggers. echo !perf:cpu_cycles > trigger To add more than 2 counters, connect it with ':'. (or, we will allow to append new perf counters) This allows user to set perf counter options for each events. Maybe we also should move 'stacktrace'/'userstacktrace' option flags to it too eventually. > > And an available_perf_events that show what can be written into these files, > (similar to how set_ftrace_filter works). But for now, it was just easier to > implement them as options. > > As for the perf event that is triggered. It currently is a dynamic array of > 64 bit values. Each value is broken up into 8 bits for what type of perf > event it is, and 56 bits for the counter. It only writes a per CPU raw > counter and does not do any math. That would be needed to be done by any > post processing. > > Since the values are for user space to do the subtraction to figure out the > difference between events, for example, the function_graph tracer may have: > > is_vmalloc_addr() { > /* cpu_cycles: 5582263593 cache_misses: 2869004572 */ > /* cpu_cycles: 5582267527 cache_misses: 2869006049 */ > } Just a style question: Would this mean the first line is for function entry and the second one is function return? > > User space would subtract 2869006049 - 2869004572 = 1477 > > Then 56 bits should be plenty. > > 2^55 / 1,000,000,000 / 60 / 60 / 24 = 416 > 416 / 4 = 104 > > If you have a 4GHz machine, the cpu-cycles will overflow the 55 bits in 104 > days. This tooling is not for seeing how many cycles run over 104 days. > User space tooling would just need to be aware that the vale is 56 bits and > when calculating the difference between start and end do something like: > > if (start > end) > end |= 1ULL << 56; > > delta = end - start; > > The next question is how to label the perf events to be in the 8 bit > portion. It could simply be a value that is registered, and listed in the > available_perf_events file. > > cpu_cycles:1 > cach_misses:2 > [..] Looks good to me. I think pre-definied events of `perf list` will be there and have fixed numbers. Thank you, > > And this would need to be recorded by any tooling reading the events > so that it knows how to map the events with their attached ids. > > But again, this is just a proof-of-concept. How this will eventually be > implemented is yet to be determined. > > But to test these patches (which are based on top of my linux-next branch, > which should now be in linux-next): > > # cd /sys/kernel/tracing > # echo 1 > options/event_cpu_cycles > # echo 1 > options/event_cache_misses > # echo 1 > events/syscalls/enable > # cat trace > [..] > bash-995 [007] ..... 98.255252: sys_write -> 0x2 > bash-995 [007] ..... 98.255257: cpu_cycles: 1557241774 cache_misses: 449901166 > bash-995 [007] ..... 98.255284: sys_dup2(oldfd: 0xa, newfd: 1) > bash-995 [007] ..... 98.255285: cpu_cycles: 1557260057 cache_misses: 449902679 > bash-995 [007] ..... 98.255305: sys_dup2 -> 0x1 > bash-995 [007] ..... 98.255305: cpu_cycles: 1557280203 cache_misses: 449906196 > bash-995 [007] ..... 98.255343: sys_fcntl(fd: 0xa, cmd: 1, arg: 0) > bash-995 [007] ..... 98.255344: cpu_cycles: 1557322304 cache_misses: 449915522 > bash-995 [007] ..... 98.255352: sys_fcntl -> 0x1 > bash-995 [007] ..... 98.255353: cpu_cycles: 1557327809 cache_misses: 449916844 > bash-995 [007] ..... 98.255361: sys_close(fd: 0xa) > bash-995 [007] ..... 98.255362: cpu_cycles: 1557335383 cache_misses: 449918232 > bash-995 [007] ..... 98.255369: sys_close -> 0x0 > > > > Comments welcomed. > > > Steven Rostedt (3): > tracing: Add perf events > ftrace: Add perf counters to function tracing > fgraph: Add perf counters to function graph tracer > > ---- > include/linux/trace_recursion.h | 5 +- > kernel/trace/trace.c | 153 ++++++++++++++++++++++++++++++++- > kernel/trace/trace.h | 38 ++++++++ > kernel/trace/trace_entries.h | 13 +++ > kernel/trace/trace_event_perf.c | 162 +++++++++++++++++++++++++++++++++++ > kernel/trace/trace_functions.c | 124 +++++++++++++++++++++++++-- > kernel/trace/trace_functions_graph.c | 117 +++++++++++++++++++++++-- > kernel/trace/trace_output.c | 70 +++++++++++++++ > 8 files changed, 670 insertions(+), 12 deletions(-) -- Masami Hiramatsu (Google)