linux-perf-users.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* perf filters don't work correctly inside containers
@ 2025-04-29 10:57 Edd Barrett
  2025-04-30  7:27 ` Edd Barrett
  0 siblings, 1 reply; 4+ messages in thread
From: Edd Barrett @ 2025-04-29 10:57 UTC (permalink / raw)
  To: linux-perf-users

Hi,

I'm writing a userspace application that interacts directly with
perf_event_open() in order to trace control flow with Intel PT.

Recently I optimised this application by using an ftrace filter to trace only
the segment of code of interest, but to my surprise I couldn't get my work
merged because the tests failed inside the dockerised CI pipeline we use. Upon
closer inspection, the traces look empty (apart from some boring non-control
flow PT packets).

I was then able to reproduce the problem using the standard userspace perf
tools. I *think* (but I'm not sure) that this might be highlighting a bug in
perf when used in containers.

I'm using Debian stable:
```
# cat /etc/debian_version
12.9
# uname -a
Linux bencher16 6.1.0-31-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.128-1 (2025-02-07) x86_64 GNU/Linux
```

Here's a minimal example that shows the issue. Suppose we want to trace the
executable segment of /bin/ls with Intel PT. First, let's do this *outside* of
docker to see what we expect to get:

```
host$ readelf --segments /bin/ls
...
Program Headers:
  Type           Offset             VirtAddr           PhysAddr
                 FileSiz            MemSiz              Flags  Align
...
  LOAD           0x0000000000004000 0x0000000000004000 0x0000000000004000
                 0x0000000000015759 0x0000000000015759  R E    0x1000
```

There's our executable segment. To collect a trace of it, we can do:

```
host$ perf record -e intel_pt//u -v --filter 'filter 0x0000000000004000/0x0000000000015759@/bin/ls' /bin/ls -- /dev/null
DEBUGINFOD_URLS=
Address filter: filter 0x4000/0x15759@/usr/bin/ls
nr_cblocks: 0
affinity: SYS
mmap flush: 1
comp level: 0
mmap size 528384B
Control descriptor is not initialized
mmap size 528384B
perf_event__synthesize_bpf_events: can't get next program: Operation not permitted
/dev/null
[ perf record: Woken up 1 times to write data ]
failed to write feature HYBRID_TOPOLOGY
[ perf record: Captured and wrote 0.035 MB perf.data ]
```

Then we can look at the PT packets and verify we see some TNT and TIP packets,
as one would expect:

```
host$ perf report -D
...
.  00002d98:  06                                              TNT T (1)
.  00002d99:  59 e6 00 00 00 00                               MTC 0xe6
.  00002d9f:  c1 b0 d0 66 5f 80 7f 00 00                      TIP.PGD 0x7f805f66d0b0
.  00002da8:  00 00 00 00 00 00 00                            PAD
.  00002daf:  d1 34 4b 7c 4b 82 55 00 00                      TIP.PGE 0x55824b7c4b34
.  00002db8:  04 00 00 00 00 00 00                            TNT N (1)
.  00002dbf:  c1 b0 d0 66 5f 80 7f 00 00                      TIP.PGD 0x7f805f66d0b0
...
^C
host$ perf report -D | grep -e 'TNT' | wc -l
130
host$ perf report -D | grep -e 'TIP' | wc -l
190
```

Good. Now let's repeat this experiment inside a docker container (one that has
the PERFMON capability enabled).

As is happens, the executable segment for the container's `ls` is at the same
offset as in the host:
```
container$ readelf --segments /bin/ls
...
Program Headers:
  Type           Offset             VirtAddr           PhysAddr
                 FileSiz            MemSiz              Flags  Align
...
  LOAD           0x0000000000004000 0x0000000000004000 0x0000000000004000
                 0x0000000000015759 0x0000000000015759  R E    0x1000
```

In light of that the `perf record` command is the same:
```
container$ perf record -e intel_pt//u -v --filter 'filter 0x0000000000004000/0x0000000000015759@/bin/ls' /bin/ls -- /dev/null
DEBUGINFOD_URLS=
Address filter: filter 0x4000/0x15759@/usr/bin/ls
nr_cblocks: 0
affinity: SYS
mmap flush: 1
comp level: 0
maps__set_modules_path_dir: cannot open /lib/modules/6.1.0-31-amd64 dir
Problems setting modules path maps, continuing anyway...
mmap size 528384B
Control descriptor is not initialized
mmap size 528384B
perf_event__synthesize_bpf_events: can't get next program: Operation not permitted
/dev/null
[ perf record: Woken up 1 times to write data ]
failed to write feature HYBRID_TOPOLOGY
[ perf record: Captured and wrote 0.056 MB perf.data ]
```

But this time there are no TIP or TNT packets to be seen:
```
container$ perf report -D | grep -e 'TNT' | wc -l
0
container$ perf report -D | grep -e 'TIP' | wc -l
0
```

I find this behaviour unexpected. It suggests that the executable segment
wasn't traced. I'd expect a container to be able to trace with a filter the
same as if it weren't a container at all, but maybe my expectations are
wrong(?).

One theory is that perf filtered on the host's /bin/ls, which of course wasn't
executed, however despite trying, I've been unable to confirm this.

I tried, for example, specifying the host's path to `ls`
(/var/lib/docker/overlay2/...) in the filter string, but perf will reject this,
presumably because it checks the path exists in the guest filesystem. Copying
`ls` to that path in the guest doesn't give us a trace with TIP and TNT
packets, so I'm not sure what's going on.

Does anyone know if filters should work in containers?

Thanks

-- 
Best Regards
Edd Barrett

https://www.theunixzoo.co.uk

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: perf filters don't work correctly inside containers
  2025-04-29 10:57 perf filters don't work correctly inside containers Edd Barrett
@ 2025-04-30  7:27 ` Edd Barrett
  2025-04-30 22:59   ` Ian Rogers
  0 siblings, 1 reply; 4+ messages in thread
From: Edd Barrett @ 2025-04-30  7:27 UTC (permalink / raw)
  To: linux-perf-users

On Tue, Apr 29, 2025 at 11:57:38AM +0100, Edd Barrett wrote:
> I was then able to reproduce the problem using the standard userspace perf
> tools. I *think* (but I'm not sure) that this might be highlighting a bug in
> perf when used in containers.

Forgot to mention: perf seems to work fine inside the docker container
*without* an ftrace filter, so it's not that the container cannot trace with PT
at all. The problem seems filter-related.

-- 
Best Regards
Edd Barrett

https://www.theunixzoo.co.uk

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: perf filters don't work correctly inside containers
  2025-04-30  7:27 ` Edd Barrett
@ 2025-04-30 22:59   ` Ian Rogers
  2025-05-05  8:28     ` Adrian Hunter
  0 siblings, 1 reply; 4+ messages in thread
From: Ian Rogers @ 2025-04-30 22:59 UTC (permalink / raw)
  To: Edd Barrett, Adrian Hunter; +Cc: linux-perf-users

On Wed, Apr 30, 2025 at 12:27 AM Edd Barrett <edd@theunixzoo.co.uk> wrote:
>
> On Tue, Apr 29, 2025 at 11:57:38AM +0100, Edd Barrett wrote:
> > I was then able to reproduce the problem using the standard userspace perf
> > tools. I *think* (but I'm not sure) that this might be highlighting a bug in
> > perf when used in containers.
>
> Forgot to mention: perf seems to work fine inside the docker container
> *without* an ftrace filter, so it's not that the container cannot trace with PT
> at all. The problem seems filter-related.

Hi Edd,

Adrian Hunter is the intel-pt expert. I've not tried this but I
believe it should work. On the kernel side the filename is grabbed and
swapped for an inode:
https://github.com/torvalds/linux/blob/master/kernel/events/core.c#L11474
the path should use the container's namespace. Perhaps you can use
"perf ftrace" to see what's going on inside the kernel. You can also
compare the -vv verbose output to see if the events were opened in
different ways, look for things like:
 ```
------------------------------------------------------------
perf_event_attr:
  type                             12 (intel_pt)
  size                             136
  config                           0x300e601
  { sample_period, sample_freq }   1
  sample_type                      IP|TID|TIME|CPU|IDENTIFIER
  read_format                      ID|LOST
  disabled                         1
  inherit                          1
  exclude_kernel                   1
  exclude_hv                       1
  enable_on_exec                   1
  sample_id_all                    1
  aux_watermark                    32768
------------------------------------------------------------
sys_perf_event_open: pid 288789  cpu 0  group_fd -1  flags 0x8 = 7
```
which is saying the perf_event_open succeeded giving an fd of 7.

Thanks,
Ian

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: perf filters don't work correctly inside containers
  2025-04-30 22:59   ` Ian Rogers
@ 2025-05-05  8:28     ` Adrian Hunter
  0 siblings, 0 replies; 4+ messages in thread
From: Adrian Hunter @ 2025-05-05  8:28 UTC (permalink / raw)
  To: Ian Rogers, Edd Barrett, Shishkin, Alexander; +Cc: linux-perf-users

+Alex

On 01/05/2025 01:59, Ian Rogers wrote:
> On Wed, Apr 30, 2025 at 12:27 AM Edd Barrett <edd@theunixzoo.co.uk> wrote:
>>
>> On Tue, Apr 29, 2025 at 11:57:38AM +0100, Edd Barrett wrote:
>>> I was then able to reproduce the problem using the standard userspace perf
>>> tools. I *think* (but I'm not sure) that this might be highlighting a bug in
>>> perf when used in containers.
>>
>> Forgot to mention: perf seems to work fine inside the docker container
>> *without* an ftrace filter, so it's not that the container cannot trace with PT
>> at all. The problem seems filter-related.
> 
> Hi Edd,
> 
> Adrian Hunter is the intel-pt expert. I've not tried this but I

Thanks Ian!

> believe it should work. On the kernel side the filename is grabbed and
> swapped for an inode:
> https://github.com/torvalds/linux/blob/master/kernel/events/core.c#L11474

Looks OK to me too, but Alex might know more.

> the path should use the container's namespace. Perhaps you can use
> "perf ftrace" to see what's going on inside the kernel. You can also
> compare the -vv verbose output to see if the events were opened in
> different ways, look for things like:
>  ```
> ------------------------------------------------------------
> perf_event_attr:
>   type                             12 (intel_pt)
>   size                             136
>   config                           0x300e601
>   { sample_period, sample_freq }   1
>   sample_type                      IP|TID|TIME|CPU|IDENTIFIER
>   read_format                      ID|LOST
>   disabled                         1
>   inherit                          1
>   exclude_kernel                   1
>   exclude_hv                       1
>   enable_on_exec                   1
>   sample_id_all                    1
>   aux_watermark                    32768
> ------------------------------------------------------------
> sys_perf_event_open: pid 288789  cpu 0  group_fd -1  flags 0x8 = 7
> ```
> which is saying the perf_event_open succeeded giving an fd of 7.
> 
> Thanks,
> Ian


^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2025-05-05  8:28 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-04-29 10:57 perf filters don't work correctly inside containers Edd Barrett
2025-04-30  7:27 ` Edd Barrett
2025-04-30 22:59   ` Ian Rogers
2025-05-05  8:28     ` Adrian Hunter

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).