Re: Extending bpf_get_ns_current_pid_tgid()

BPF List
 help / color / mirror / Atom feed

From: Yonghong Song <yhs@fb.com>
To: Daniel Xu <dxu@dxuuu.xyz>, <bpf@vger.kernel.org>
Cc: <cneirabustos@gmail.com>, <ebiederm@xmission.com>, <blez@fb.com>
Subject: Re: Extending bpf_get_ns_current_pid_tgid()
Date: Thu, 12 Nov 2020 20:13:21 -0800	[thread overview]
Message-ID: <13b5b2dd-bec0-cef2-7304-7e5a09bafb6c@fb.com> (raw)
In-Reply-To: <C71Q73J0Y8S5.3PXMV3YTPDCL7@maharaja>



On 11/12/20 4:57 PM, Daniel Xu wrote:
> On Thu Nov 12, 2020 at 4:27 PM PST, Yonghong Song wrote:
>>
>>
>> On 11/12/20 2:20 PM, Daniel Xu wrote:
>>> Hi,
>>>
>>> I'm looking at the current implementation of
>>> bpf_get_ns_current_pid_tgid() and the helper seems to be a bit overly
>>> restricting to me. Specifically the following line:
>>>
>>>       if (!ns_match(&pidns->ns, (dev_t)dev, ino))
>>>               goto clear;
>>>
>>> Why bail if the inode # does not match? IIUC from the old discussions,
>>> it was b/c in the future pidns files might belong to different devices.
>>> It's not clear to me (possibly b/c I'm missing something) why the inode
>>> has to match as well.
>>
>> Yes, pidns file might belong to different devices in theory so we need
>> to match dev as well.
>>
>> The inode number needs to match so we can ensure user indeed wants to
>> get the *current pidns* tgid/pid.
> 
> Right, this double-checking at the API level is what feels strange to
> me -- why make the user prove they know what they're doing?

If we do not have this checking, it is possible that in interrupt
context, pidns #10 user may get a tgid/pid actually from pisns #11,
and tgid/pid could be valid for pidns #10. This result will be
actually wrong.

> 
> Furthermore, the "proof" restricts flexibility. It's as if
> bpf_get_current_task() required a (dev,ino) pair. How would you get the
> namespaced pid for a process you don't know about yet? eg when you're
> profiling the system.

Did not fully understand questions here. Do you mean
   bpf_get_current_task(dev, ino)
that will be weird. task is not associated with dev/ino.

> 
>>
>> (dev, ino) input expressed user intention. Without this, in no-process
>> context, it will be hard to interpret the results.
> 
> But bpf_get_current_pid_tgid() doesn't return errors so this shouldn't
> either, right?

Different helpers can have different signatures.

> 
>>
>>>
>>> Would it be possible to instead have the helper return the pid/tgid of
>>> the current task as viewed _from_ the `dev`/`ino` pidns? If the current
>>> task is hidden from the `dev`/`ino` pidns, then return -ENOENT. The use
>>> case is for bpftrace symbolize stacks when run inside a container. For
>>> example:
>>>
>>>       (in-container)# bpftrace -e 'profile:hz:99 { print(ustack) }'
>>
>> I think you try to propose something like below:
>> - user provides dev/ino
>> - the helper will try to go through all pidns'es (not just active
>> one), if any match pidns match, returns tgid/pid in that pidns,
>> otherwise, returns -ENOENT.
> 
> Right, exactly.

If you want to do this, you will need a new helper like
   bpf_get_ns_pid_tgid

It actually will be weird to use this helper as it looks like
you try to get pid/tgid of another ns. So we do need to nail
down the use case here.

> 
>>
>> The current helper is
>> bpf_get_ns_current_pid_tgid
>> you want
>> bpf_get_ns_pid_tgid
>>
>> I think it is possible, you need to check
>> pid->numbers[pid_level].ns
>> for all pid levels. You need to get a reference count for the namespace
>> to ensure valid result.
>>
>> This may work for root inode, but for container inode, it may have
>> issues. For example,
>> container 1: create, inode 2
>> container 1 removed
>> container 2: create, inode 2
>> If you use inode 2, depending on timing you may accidentally targetting
>> wrong container.
> 
> Yeah, so maybe an fd to /proc/<pid>/ns/pid or something.
> 
>>
>> I think you can workaround the issue without this helper. See below.
>>
>>>
>>> This currently does not work b/c bpftrace will generate a prog that gets
>>> the root pidns pid, pack it with the stackid, and pass it up to
>>> userspace. But b/c bpftrace is running inside the container, the root
>>> pidns pid is invalid and symbolization fails.
>>
>> bpftrace can generate a program takes dev/inode as input parameters in
>> map. The bpftrace will supply dev/inode value, by query the current
>> system/container, and then run the program.
> 
> I don't think it's very feasible to have bpftrace integrate with every
> container runtime out there. This also becomes really difficult to
> manage if you want to trace N processes. You'd need N maps or N progs.

Why, just one map to store dev/inode is shared among all progs, right?

> 
>>
>>>
>>> What would be nice is if bpftrace could generate a prog that gets the
>>> current pid as viewed from bpftrace's pidns. Then symbolization would
>>> work.
>>
>> Despite the above workaround, what you really need is although it is
>> running on container, you want to get stack trace interpreted with
>> root pid/tgid for symbolization purpose? But you can already achieve
>> this with bpf_get_pid_tgid()?
> 
> No, this isn't possible when bpftrace runs inside the container. ie
> bpftrace is in a pidns along with the tracees. Bpftrace gets the root
> pidns pid from the kernel but cannot resolve it to the pidns pid. That
> means bpftrace cannot find the executable file to symbolize against.

Not sure whether I understand correct or not. You want root pid to
find exec, right? but bpf_get_pid_tgid() will give your root pid?
Maybe I miss something here...

> 
> [...]
> 
> Thanks,
> Daniel
>

next prev parent reply	other threads:[~2020-11-13  4:13 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-11-12 22:20 Extending bpf_get_ns_current_pid_tgid() Daniel Xu
2020-11-13  0:27 ` Yonghong Song
2020-11-13  0:57   ` Daniel Xu
2020-11-13  4:13     ` Yonghong Song [this message]
2020-11-13 12:04       ` Blaise Sanouillet
2020-11-13 16:57         ` Yonghong Song
     [not found]         ` <CACiB22i6d2skkJJa7uwVRrYy7dtYOxmLgFwzjtieW4BFn2tzLw@mail.gmail.com>
2020-11-13 16:59           ` Yonghong Song
     [not found]             ` <CACiB22iU3zk4Row=wAween=rSvHJ7j7M5T2KbyFk38arzEwQpQ@mail.gmail.com>
2021-06-16  9:54               ` Blaise Sanouillet
2021-06-16 17:02               ` Yonghong Song
2021-06-16 21:44                 ` Carlos Neira
2021-06-17  2:49                   ` Yonghong Song
2021-06-17 10:59                     ` Blaise Sanouillet

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=13b5b2dd-bec0-cef2-7304-7e5a09bafb6c@fb.com \
    --to=yhs@fb.com \
    --cc=blez@fb.com \
    --cc=bpf@vger.kernel.org \
    --cc=cneirabustos@gmail.com \
    --cc=dxu@dxuuu.xyz \
    --cc=ebiederm@xmission.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox