* perf filters don't work correctly inside containers
@ 2025-04-29 10:57 Edd Barrett
2025-04-30 7:27 ` Edd Barrett
0 siblings, 1 reply; 4+ messages in thread
From: Edd Barrett @ 2025-04-29 10:57 UTC (permalink / raw)
To: linux-perf-users
Hi,
I'm writing a userspace application that interacts directly with
perf_event_open() in order to trace control flow with Intel PT.
Recently I optimised this application by using an ftrace filter to trace only
the segment of code of interest, but to my surprise I couldn't get my work
merged because the tests failed inside the dockerised CI pipeline we use. Upon
closer inspection, the traces look empty (apart from some boring non-control
flow PT packets).
I was then able to reproduce the problem using the standard userspace perf
tools. I *think* (but I'm not sure) that this might be highlighting a bug in
perf when used in containers.
I'm using Debian stable:
```
# cat /etc/debian_version
12.9
# uname -a
Linux bencher16 6.1.0-31-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.128-1 (2025-02-07) x86_64 GNU/Linux
```
Here's a minimal example that shows the issue. Suppose we want to trace the
executable segment of /bin/ls with Intel PT. First, let's do this *outside* of
docker to see what we expect to get:
```
host$ readelf --segments /bin/ls
...
Program Headers:
Type Offset VirtAddr PhysAddr
FileSiz MemSiz Flags Align
...
LOAD 0x0000000000004000 0x0000000000004000 0x0000000000004000
0x0000000000015759 0x0000000000015759 R E 0x1000
```
There's our executable segment. To collect a trace of it, we can do:
```
host$ perf record -e intel_pt//u -v --filter 'filter 0x0000000000004000/0x0000000000015759@/bin/ls' /bin/ls -- /dev/null
DEBUGINFOD_URLS=
Address filter: filter 0x4000/0x15759@/usr/bin/ls
nr_cblocks: 0
affinity: SYS
mmap flush: 1
comp level: 0
mmap size 528384B
Control descriptor is not initialized
mmap size 528384B
perf_event__synthesize_bpf_events: can't get next program: Operation not permitted
/dev/null
[ perf record: Woken up 1 times to write data ]
failed to write feature HYBRID_TOPOLOGY
[ perf record: Captured and wrote 0.035 MB perf.data ]
```
Then we can look at the PT packets and verify we see some TNT and TIP packets,
as one would expect:
```
host$ perf report -D
...
. 00002d98: 06 TNT T (1)
. 00002d99: 59 e6 00 00 00 00 MTC 0xe6
. 00002d9f: c1 b0 d0 66 5f 80 7f 00 00 TIP.PGD 0x7f805f66d0b0
. 00002da8: 00 00 00 00 00 00 00 PAD
. 00002daf: d1 34 4b 7c 4b 82 55 00 00 TIP.PGE 0x55824b7c4b34
. 00002db8: 04 00 00 00 00 00 00 TNT N (1)
. 00002dbf: c1 b0 d0 66 5f 80 7f 00 00 TIP.PGD 0x7f805f66d0b0
...
^C
host$ perf report -D | grep -e 'TNT' | wc -l
130
host$ perf report -D | grep -e 'TIP' | wc -l
190
```
Good. Now let's repeat this experiment inside a docker container (one that has
the PERFMON capability enabled).
As is happens, the executable segment for the container's `ls` is at the same
offset as in the host:
```
container$ readelf --segments /bin/ls
...
Program Headers:
Type Offset VirtAddr PhysAddr
FileSiz MemSiz Flags Align
...
LOAD 0x0000000000004000 0x0000000000004000 0x0000000000004000
0x0000000000015759 0x0000000000015759 R E 0x1000
```
In light of that the `perf record` command is the same:
```
container$ perf record -e intel_pt//u -v --filter 'filter 0x0000000000004000/0x0000000000015759@/bin/ls' /bin/ls -- /dev/null
DEBUGINFOD_URLS=
Address filter: filter 0x4000/0x15759@/usr/bin/ls
nr_cblocks: 0
affinity: SYS
mmap flush: 1
comp level: 0
maps__set_modules_path_dir: cannot open /lib/modules/6.1.0-31-amd64 dir
Problems setting modules path maps, continuing anyway...
mmap size 528384B
Control descriptor is not initialized
mmap size 528384B
perf_event__synthesize_bpf_events: can't get next program: Operation not permitted
/dev/null
[ perf record: Woken up 1 times to write data ]
failed to write feature HYBRID_TOPOLOGY
[ perf record: Captured and wrote 0.056 MB perf.data ]
```
But this time there are no TIP or TNT packets to be seen:
```
container$ perf report -D | grep -e 'TNT' | wc -l
0
container$ perf report -D | grep -e 'TIP' | wc -l
0
```
I find this behaviour unexpected. It suggests that the executable segment
wasn't traced. I'd expect a container to be able to trace with a filter the
same as if it weren't a container at all, but maybe my expectations are
wrong(?).
One theory is that perf filtered on the host's /bin/ls, which of course wasn't
executed, however despite trying, I've been unable to confirm this.
I tried, for example, specifying the host's path to `ls`
(/var/lib/docker/overlay2/...) in the filter string, but perf will reject this,
presumably because it checks the path exists in the guest filesystem. Copying
`ls` to that path in the guest doesn't give us a trace with TIP and TNT
packets, so I'm not sure what's going on.
Does anyone know if filters should work in containers?
Thanks
--
Best Regards
Edd Barrett
https://www.theunixzoo.co.uk
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: perf filters don't work correctly inside containers
2025-04-29 10:57 perf filters don't work correctly inside containers Edd Barrett
@ 2025-04-30 7:27 ` Edd Barrett
2025-04-30 22:59 ` Ian Rogers
0 siblings, 1 reply; 4+ messages in thread
From: Edd Barrett @ 2025-04-30 7:27 UTC (permalink / raw)
To: linux-perf-users
On Tue, Apr 29, 2025 at 11:57:38AM +0100, Edd Barrett wrote:
> I was then able to reproduce the problem using the standard userspace perf
> tools. I *think* (but I'm not sure) that this might be highlighting a bug in
> perf when used in containers.
Forgot to mention: perf seems to work fine inside the docker container
*without* an ftrace filter, so it's not that the container cannot trace with PT
at all. The problem seems filter-related.
--
Best Regards
Edd Barrett
https://www.theunixzoo.co.uk
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: perf filters don't work correctly inside containers
2025-04-30 7:27 ` Edd Barrett
@ 2025-04-30 22:59 ` Ian Rogers
2025-05-05 8:28 ` Adrian Hunter
0 siblings, 1 reply; 4+ messages in thread
From: Ian Rogers @ 2025-04-30 22:59 UTC (permalink / raw)
To: Edd Barrett, Adrian Hunter; +Cc: linux-perf-users
On Wed, Apr 30, 2025 at 12:27 AM Edd Barrett <edd@theunixzoo.co.uk> wrote:
>
> On Tue, Apr 29, 2025 at 11:57:38AM +0100, Edd Barrett wrote:
> > I was then able to reproduce the problem using the standard userspace perf
> > tools. I *think* (but I'm not sure) that this might be highlighting a bug in
> > perf when used in containers.
>
> Forgot to mention: perf seems to work fine inside the docker container
> *without* an ftrace filter, so it's not that the container cannot trace with PT
> at all. The problem seems filter-related.
Hi Edd,
Adrian Hunter is the intel-pt expert. I've not tried this but I
believe it should work. On the kernel side the filename is grabbed and
swapped for an inode:
https://github.com/torvalds/linux/blob/master/kernel/events/core.c#L11474
the path should use the container's namespace. Perhaps you can use
"perf ftrace" to see what's going on inside the kernel. You can also
compare the -vv verbose output to see if the events were opened in
different ways, look for things like:
```
------------------------------------------------------------
perf_event_attr:
type 12 (intel_pt)
size 136
config 0x300e601
{ sample_period, sample_freq } 1
sample_type IP|TID|TIME|CPU|IDENTIFIER
read_format ID|LOST
disabled 1
inherit 1
exclude_kernel 1
exclude_hv 1
enable_on_exec 1
sample_id_all 1
aux_watermark 32768
------------------------------------------------------------
sys_perf_event_open: pid 288789 cpu 0 group_fd -1 flags 0x8 = 7
```
which is saying the perf_event_open succeeded giving an fd of 7.
Thanks,
Ian
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: perf filters don't work correctly inside containers
2025-04-30 22:59 ` Ian Rogers
@ 2025-05-05 8:28 ` Adrian Hunter
0 siblings, 0 replies; 4+ messages in thread
From: Adrian Hunter @ 2025-05-05 8:28 UTC (permalink / raw)
To: Ian Rogers, Edd Barrett, Shishkin, Alexander; +Cc: linux-perf-users
+Alex
On 01/05/2025 01:59, Ian Rogers wrote:
> On Wed, Apr 30, 2025 at 12:27 AM Edd Barrett <edd@theunixzoo.co.uk> wrote:
>>
>> On Tue, Apr 29, 2025 at 11:57:38AM +0100, Edd Barrett wrote:
>>> I was then able to reproduce the problem using the standard userspace perf
>>> tools. I *think* (but I'm not sure) that this might be highlighting a bug in
>>> perf when used in containers.
>>
>> Forgot to mention: perf seems to work fine inside the docker container
>> *without* an ftrace filter, so it's not that the container cannot trace with PT
>> at all. The problem seems filter-related.
>
> Hi Edd,
>
> Adrian Hunter is the intel-pt expert. I've not tried this but I
Thanks Ian!
> believe it should work. On the kernel side the filename is grabbed and
> swapped for an inode:
> https://github.com/torvalds/linux/blob/master/kernel/events/core.c#L11474
Looks OK to me too, but Alex might know more.
> the path should use the container's namespace. Perhaps you can use
> "perf ftrace" to see what's going on inside the kernel. You can also
> compare the -vv verbose output to see if the events were opened in
> different ways, look for things like:
> ```
> ------------------------------------------------------------
> perf_event_attr:
> type 12 (intel_pt)
> size 136
> config 0x300e601
> { sample_period, sample_freq } 1
> sample_type IP|TID|TIME|CPU|IDENTIFIER
> read_format ID|LOST
> disabled 1
> inherit 1
> exclude_kernel 1
> exclude_hv 1
> enable_on_exec 1
> sample_id_all 1
> aux_watermark 32768
> ------------------------------------------------------------
> sys_perf_event_open: pid 288789 cpu 0 group_fd -1 flags 0x8 = 7
> ```
> which is saying the perf_event_open succeeded giving an fd of 7.
>
> Thanks,
> Ian
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2025-05-05 8:28 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-04-29 10:57 perf filters don't work correctly inside containers Edd Barrett
2025-04-30 7:27 ` Edd Barrett
2025-04-30 22:59 ` Ian Rogers
2025-05-05 8:28 ` Adrian Hunter
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).