* [GIT PULL] execve updates for v6.13-rc1
@ 2024-11-19 17:00 Kees Cook
2024-11-20 22:32 ` Linus Torvalds
0 siblings, 1 reply; 27+ messages in thread
From: Kees Cook @ 2024-11-19 17:00 UTC (permalink / raw)
To: Linus Torvalds
Cc: linux-kernel, Alexander Viro, Christophe JAILLET, Dan Carpenter,
Eric W. Biederman, Kees Cook, Nir Lichtman,
syzbot+03e1af5c332f7e0eb84b, Tycho Andersen, Vegard Nossum,
Zbigniew Jędrzejewski-Szmek
Hi Linus,
Please pull these execve updates for v6.13-rc1.
Thanks!
-Kees
The following changes since commit 8cf0b93919e13d1e8d4466eb4080a4c4d9d66d7b:
Linux 6.12-rc2 (2024-10-06 15:32:27 -0700)
are available in the Git repository at:
https://git.kernel.org/pub/scm/linux/kernel/git/kees/linux.git tags/execve-v6.13-rc1
for you to fetch changes up to 8c9870077aac4b141decfb42431415bb32d1fedf:
exec: remove legacy custom binfmt modules autoloading (2024-11-16 20:54:03 -0800)
----------------------------------------------------------------
execve updates for v6.13-rc1
- binfmt: Fix comment typos (Christophe JAILLET)
- exec: Use argv[0] for "comm" with AT_EMPTY_PATH (Tycho Andersen,
Dan Carpenter, Nir Lichtman)
- exec: remove legacy custom binfmt modules autoloading (Nir Lichtman)
- coredump: Do not lock when copying "comm"
- MAINTAINERS: add auxvec.h and set myself as maintainer
----------------------------------------------------------------
Christophe JAILLET (1):
fs: binfmt: Fix a typo
Dan Carpenter (1):
exec: Fix a NULL vs IS_ERR() test in bprm_add_fixup_comm()
Kees Cook (4):
coredump: Do not lock during 'comm' reporting
MAINTAINERS: exec: Add auxvec.h UAPI
MAINTAINERS: exec: Mark Kees as maintainer
exec: NULL out bprm->argv0 when it is an ERR_PTR
Nir Lichtman (1):
exec: remove legacy custom binfmt modules autoloading
Tycho Andersen (2):
exec: fix up /proc/pid/comm in the execveat(AT_EMPTY_PATH) case
selftests/exec: add a test for execveat()'s comm
nir@lichtman.org (1):
exec: move warning of null argv to be next to the relevant code
MAINTAINERS | 3 +-
fs/binfmt_misc.c | 2 +-
fs/exec.c | 62 ++++++++++++++++++--------
include/linux/binfmts.h | 1 +
include/linux/coredump.h | 4 +-
tools/testing/selftests/exec/execveat.c | 77 +++++++++++++++++++++++++++++++--
6 files changed, 123 insertions(+), 26 deletions(-)
--
Kees Cook
^ permalink raw reply [flat|nested] 27+ messages in thread* Re: [GIT PULL] execve updates for v6.13-rc1 2024-11-19 17:00 [GIT PULL] execve updates for v6.13-rc1 Kees Cook @ 2024-11-20 22:32 ` Linus Torvalds 2024-11-20 22:33 ` Linus Torvalds 2024-11-21 0:54 ` Eric W. Biederman 0 siblings, 2 replies; 27+ messages in thread From: Linus Torvalds @ 2024-11-20 22:32 UTC (permalink / raw) To: Kees Cook Cc: linux-kernel, Alexander Viro, Christophe JAILLET, Dan Carpenter, Eric W. Biederman, Nir Lichtman, syzbot+03e1af5c332f7e0eb84b, Tycho Andersen, Vegard Nossum, Zbigniew Jędrzejewski-Szmek On Tue, 19 Nov 2024 at 09:00, Kees Cook <kees@kernel.org> wrote: > > - exec: Use argv[0] for "comm" with AT_EMPTY_PATH (Tycho Andersen, > Dan Carpenter, Nir Lichtman) Ugh. I *really* despise this one. People: we *have* a filename. It's right there in the dentry. Which is right there as bprm->file->f_dentry.dentry. And that's actually going to match the actual execcutable, unlike, for example, argv[0], which can be filled in with random data. *AND* we don't need any silly and expensive get_user_arg_ptr() and strndup_user() copy for it, which does that user access twice. And no, we shouldn't fall back to the horrid thing that bprm->fdpath does either. That's the thing that you apparently thought was too ugly to use, and literally the *only* use of it was for this case. The reason that code existed at all was to generate a filename, and because we didn't use to have access to the 'bprm->file' back in the days. But that was changed by commit 978ffcbf00d8 ("execve: open the executable file before doing anything else"). So I really really think that what this code *should* have done is - get rid of fdpath that you made pointless by not using it for 'comm[]' - teach the code to use the dentry name instead because this horrible hack is too broken to live. Linus ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [GIT PULL] execve updates for v6.13-rc1 2024-11-20 22:32 ` Linus Torvalds @ 2024-11-20 22:33 ` Linus Torvalds 2024-11-20 22:50 ` Linus Torvalds 2024-11-21 0:54 ` Eric W. Biederman 1 sibling, 1 reply; 27+ messages in thread From: Linus Torvalds @ 2024-11-20 22:33 UTC (permalink / raw) To: Kees Cook Cc: linux-kernel, Alexander Viro, Christophe JAILLET, Dan Carpenter, Eric W. Biederman, Nir Lichtman, syzbot+03e1af5c332f7e0eb84b, Tycho Andersen, Vegard Nossum, Zbigniew Jędrzejewski-Szmek On Wed, 20 Nov 2024 at 14:32, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > People: we *have* a filename. It's right there in the dentry. Which is > right there as bprm->file->f_dentry.dentry. ... that should obviously be '...->f_path.dentry'. Linus ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [GIT PULL] execve updates for v6.13-rc1 2024-11-20 22:33 ` Linus Torvalds @ 2024-11-20 22:50 ` Linus Torvalds 2024-11-21 2:36 ` Al Viro 0 siblings, 1 reply; 27+ messages in thread From: Linus Torvalds @ 2024-11-20 22:50 UTC (permalink / raw) To: Kees Cook Cc: linux-kernel, Alexander Viro, Christophe JAILLET, Dan Carpenter, Eric W. Biederman, Nir Lichtman, syzbot+03e1af5c332f7e0eb84b, Tycho Andersen, Vegard Nossum, Zbigniew Jędrzejewski-Szmek On Wed, 20 Nov 2024 at 14:33, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > On Wed, 20 Nov 2024 at 14:32, Linus Torvalds > <torvalds@linux-foundation.org> wrote: > > > > People: we *have* a filename. It's right there in the dentry. Which is > > right there as bprm->file->f_dentry.dentry. > > ... that should obviously be '...->f_path.dentry'. One thing to look out for is that dentry->name can be switched around by renames etc So you probably want to do something like const char *name = smp_load_acquire(&dentry->d_name.name); under the RCU read lock before then copying it with strscpy(). It should always be NULL-terminated. If you want to be extra careful, you might surround it with a read_seqbegin_or_lock(&rename_lock, &seq); .. if (need_seqretry(&rename_lock, seq)) { seq = 1; goto restart; } done_seqretry(&rename_lock, seq); but I seriously doubt we even care that much. If people want to mess with the executable filename, they can just use links or whatever, so this is a "politeness" thing rather than anything else. (And yes, our d_path() code tries to be extra smart and also uses 'name->len' to avoid having to search for the end of the string etc, but that then causes huge complexities with ->len and ->name not matching, so it has to compensate for that with being extra careful. So don't do that) Linus ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [GIT PULL] execve updates for v6.13-rc1 2024-11-20 22:50 ` Linus Torvalds @ 2024-11-21 2:36 ` Al Viro 2024-11-21 2:41 ` Al Viro 2024-11-21 2:51 ` Linus Torvalds 0 siblings, 2 replies; 27+ messages in thread From: Al Viro @ 2024-11-21 2:36 UTC (permalink / raw) To: Linus Torvalds Cc: Kees Cook, linux-kernel, Christophe JAILLET, Dan Carpenter, Eric W. Biederman, Nir Lichtman, syzbot+03e1af5c332f7e0eb84b, Tycho Andersen, Vegard Nossum, Zbigniew Jędrzejewski-Szmek On Wed, Nov 20, 2024 at 02:50:39PM -0800, Linus Torvalds wrote: > So you probably want to do something like > > const char *name = smp_load_acquire(&dentry->d_name.name); > > under the RCU read lock before then copying it with strscpy(). It > should always be NULL-terminated. > > If you want to be extra careful, you might surround it with a > > read_seqbegin_or_lock(&rename_lock, &seq); What for? char name[something]; sprintf(name, "%*pD", sizeof(name) - 1, file); and be done with that... ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [GIT PULL] execve updates for v6.13-rc1 2024-11-21 2:36 ` Al Viro @ 2024-11-21 2:41 ` Al Viro 2024-11-21 2:51 ` Linus Torvalds 1 sibling, 0 replies; 27+ messages in thread From: Al Viro @ 2024-11-21 2:41 UTC (permalink / raw) To: Linus Torvalds Cc: Kees Cook, linux-kernel, Christophe JAILLET, Dan Carpenter, Eric W. Biederman, Nir Lichtman, syzbot+03e1af5c332f7e0eb84b, Tycho Andersen, Vegard Nossum, Zbigniew Jędrzejewski-Szmek On Thu, Nov 21, 2024 at 02:36:19AM +0000, Al Viro wrote: > On Wed, Nov 20, 2024 at 02:50:39PM -0800, Linus Torvalds wrote: > > So you probably want to do something like > > > > const char *name = smp_load_acquire(&dentry->d_name.name); > > > > under the RCU read lock before then copying it with strscpy(). It > > should always be NULL-terminated. > > > > If you want to be extra careful, you might surround it with a > > > > read_seqbegin_or_lock(&rename_lock, &seq); > > What for? > > char name[something]; > > sprintf(name, "%*pD", sizeof(name) - 1, file); > > and be done with that... ... or struct name_snapshot n; take_dentry_name_snapshot(&n, file->f_path.dentry); do_something(n.name.name); release_dentry_name_snapshot(&n); ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [GIT PULL] execve updates for v6.13-rc1 2024-11-21 2:36 ` Al Viro 2024-11-21 2:41 ` Al Viro @ 2024-11-21 2:51 ` Linus Torvalds 1 sibling, 0 replies; 27+ messages in thread From: Linus Torvalds @ 2024-11-21 2:51 UTC (permalink / raw) To: Al Viro Cc: Kees Cook, linux-kernel, Christophe JAILLET, Dan Carpenter, Eric W. Biederman, Nir Lichtman, syzbot+03e1af5c332f7e0eb84b, Tycho Andersen, Vegard Nossum, Zbigniew Jędrzejewski-Szmek On Wed, 20 Nov 2024 at 18:36, Al Viro <viro@zeniv.linux.org.uk> wrote: > > If you want to be extra careful, you might surround it with a > > > > read_seqbegin_or_lock(&rename_lock, &seq); > > What for? Yeah, probably no good reason. > char name[something]; > > sprintf(name, "%*pD", sizeof(name) - 1, file); > > and be done with that... I think you might as well just do the much cheaper rcu_read_lock(); strscpy_pad(tsk->comm, sizeof(tsk->comm), smp_read_acquire(&dentry->d_name.name); rcu_read_unlock(); and be done with it. Using snprintf() on tsk->comm[] violates all the rules we have for it, and while our rules aren't great, *that* one we shouldn't do. Linus ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [GIT PULL] execve updates for v6.13-rc1 2024-11-20 22:32 ` Linus Torvalds 2024-11-20 22:33 ` Linus Torvalds @ 2024-11-21 0:54 ` Eric W. Biederman 2024-11-21 2:23 ` Linus Torvalds 1 sibling, 1 reply; 27+ messages in thread From: Eric W. Biederman @ 2024-11-21 0:54 UTC (permalink / raw) To: Linus Torvalds Cc: Kees Cook, linux-kernel, Alexander Viro, Christophe JAILLET, Dan Carpenter, Nir Lichtman, syzbot+03e1af5c332f7e0eb84b, Tycho Andersen, Vegard Nossum, Zbigniew Jędrzejewski-Szmek Linus Torvalds <torvalds@linux-foundation.org> writes: > On Tue, 19 Nov 2024 at 09:00, Kees Cook <kees@kernel.org> wrote: >> >> - exec: Use argv[0] for "comm" with AT_EMPTY_PATH (Tycho Andersen, >> Dan Carpenter, Nir Lichtman) > > Ugh. I *really* despise this one. > > People: we *have* a filename. It's right there in the dentry. Which is > right there as bprm->file->f_dentry.dentry. __set_task_comm cannot be called with bprm->file->f_dentry unconditionally. That will break userspace. On at least debian using /etc/alternatives there are a lot of symlinks to executables. For example if I run "vi", two symlinks are followed: /usr/bin/vi -> /etc/alternatives/vi /etc/alternatives/vi -> usr/bin/vim.nox Seeing "vim.nox" in ps instead "vi" would be weird and break tooks like psgrep. https://lore.kernel.org/all/Zv1OayMEmLP2kjhj@kawka3.in.waw.pl/ The reason bprm->file->f_dentry.dentry was abandoned were concerns about breaking userspace. Limited to just the binaries that systemd wants to call. I don't know if symlinks to binaries matters. At a quick skim I couldn't see anything but *shrug*. > And that's actually going to match the actual execcutable, unlike, for > example, argv[0], which can be filled in with random data. > > > *AND* we don't need any silly and expensive get_user_arg_ptr() and > strndup_user() copy for it, which does that user access twice. > > And no, we shouldn't fall back to the horrid thing that bprm->fdpath > does either. That's the thing that you apparently thought was too ugly > to use, and literally the *only* use of it was for this case. bprm->fdpath is for passing the pathname of the script to an interpreter like /bin/sh. That task->comm wind up being filled from it, is a bit of a surprise. > The reason that code existed at all was to generate a filename, and > because we didn't use to have access to the 'bprm->file' back in the > days. > > But that was changed by commit 978ffcbf00d8 ("execve: open the > executable file before doing anything else"). That should work for binfmt_script. I don't know which solution is less racy, and less prone to problems in the binfmt_script case. No one has complained so I have been assuming that case works just fine. > So I really really think that what this code *should* have done is > > - get rid of fdpath that you made pointless by not using it for 'comm[]' Again binfmt_script still uses it. I can assure you there was no thought at all given to task->comm when the contents of bprm->fdpath were chosen. Which is why there is a problem now. task->comm is for people. bprm->fdpath is for programs. > - teach the code to use the dentry name instead Which is fine for task->comm, when the pathname is NULL. Assuming it actually makes things usable for userspace. That doesn't work for bprm->fdpath with becomes bprm->interp. That would require using d_path. Eric ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [GIT PULL] execve updates for v6.13-rc1 2024-11-21 0:54 ` Eric W. Biederman @ 2024-11-21 2:23 ` Linus Torvalds 2024-11-21 2:29 ` Kees Cook ` (2 more replies) 0 siblings, 3 replies; 27+ messages in thread From: Linus Torvalds @ 2024-11-21 2:23 UTC (permalink / raw) To: Eric W. Biederman Cc: Kees Cook, linux-kernel, Alexander Viro, Christophe JAILLET, Dan Carpenter, Nir Lichtman, syzbot+03e1af5c332f7e0eb84b, Tycho Andersen, Vegard Nossum, Zbigniew Jędrzejewski-Szmek On Wed, 20 Nov 2024 at 16:55, Eric W. Biederman <ebiederm@xmission.com> wrote: > > __set_task_comm cannot be called with bprm->file->f_dentry > unconditionally. No, no. Only for the "no path" case. > The reason bprm->file->f_dentry.dentry was abandoned were concerns > about breaking userspace. There's no way it can break user space considering that right now comm[] ends up being just garbage. And I do *not* want to replace one garbage with a new one. > > - get rid of fdpath that you made pointless by not using it for 'comm[]' > > Again binfmt_script still uses it. Ahh, yeah, we can't just get rid of it. But we're *not* adding a new complete garbage "copy random stuff badly from user space" thing. Linus ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [GIT PULL] execve updates for v6.13-rc1 2024-11-21 2:23 ` Linus Torvalds @ 2024-11-21 2:29 ` Kees Cook 2024-11-21 2:45 ` Linus Torvalds 2024-11-21 17:22 ` Zbigniew Jędrzejewski-Szmek 2 siblings, 0 replies; 27+ messages in thread From: Kees Cook @ 2024-11-21 2:29 UTC (permalink / raw) To: Linus Torvalds, Eric W. Biederman Cc: linux-kernel, Alexander Viro, Christophe JAILLET, Dan Carpenter, Nir Lichtman, syzbot+03e1af5c332f7e0eb84b, Tycho Andersen, Vegard Nossum, Zbigniew Jędrzejewski-Szmek On November 20, 2024 6:23:55 PM PST, Linus Torvalds <torvalds@linux-foundation.org> wrote: >But we're *not* adding a new complete garbage "copy random stuff badly >from user space" thing. I'll drop this and resend the PR. I think using the dentry should work, as you say. Though with a more paranoid hat on, I wonder if it could be an info leak if the original filename isn't "visible" in the context it's running in; containers do weird stuff. (But I'm hard-pressed to imagine how the filename would actually be a useful "secret".) -Kees -- Kees Cook ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [GIT PULL] execve updates for v6.13-rc1 2024-11-21 2:23 ` Linus Torvalds 2024-11-21 2:29 ` Kees Cook @ 2024-11-21 2:45 ` Linus Torvalds 2024-11-21 17:22 ` Zbigniew Jędrzejewski-Szmek 2 siblings, 0 replies; 27+ messages in thread From: Linus Torvalds @ 2024-11-21 2:45 UTC (permalink / raw) To: Eric W. Biederman Cc: Kees Cook, linux-kernel, Alexander Viro, Christophe JAILLET, Dan Carpenter, Nir Lichtman, syzbot+03e1af5c332f7e0eb84b, Tycho Andersen, Vegard Nossum, Zbigniew Jędrzejewski-Szmek On Wed, 20 Nov 2024 at 18:23, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > > Again binfmt_script still uses it. > > Ahh, yeah, we can't just get rid of it. Actually, that reminds me: we've had issues with this horrible fdpath hack before due to close-on-exec, and that's why we have BINPRM_FLAGS_PATH_INACCESSIBLE. And that's independent of the whole "/proc isn't always mounted", so that the /dev/fd/%d/.. paths don't work at all. It would probably have been much nicer if we just put the real path of the dentry originally in ->fdpath, but I suspect it's too late to fix now: it would _mostly_ be a more reliable and meaningful path, and it would fix the close-on-exec situation, but I would not be surprised if we have some horrible user that really depends on the 'fd' being the only way to actually access it (either due to permission issues, or because of it having been actively unlinked). Linus ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [GIT PULL] execve updates for v6.13-rc1 2024-11-21 2:23 ` Linus Torvalds 2024-11-21 2:29 ` Kees Cook 2024-11-21 2:45 ` Linus Torvalds @ 2024-11-21 17:22 ` Zbigniew Jędrzejewski-Szmek 2024-11-21 17:28 ` Linus Torvalds [not found] ` <87zflrsw1c.fsf@email.froward.int.ebiederm.org> 2 siblings, 2 replies; 27+ messages in thread From: Zbigniew Jędrzejewski-Szmek @ 2024-11-21 17:22 UTC (permalink / raw) To: Linus Torvalds Cc: Eric W. Biederman, Kees Cook, linux-kernel, Alexander Viro, Christophe JAILLET, Dan Carpenter, Nir Lichtman, syzbot+03e1af5c332f7e0eb84b, Tycho Andersen, Vegard Nossum On Wed, Nov 20, 2024 at 06:23:55PM -0800, Linus Torvalds wrote: > On Wed, 20 Nov 2024 at 16:55, Eric W. Biederman <ebiederm@xmission.com> wrote: > > > > __set_task_comm cannot be called with bprm->file->f_dentry > > unconditionally. > > No, no. Only for the "no path" case. > > > The reason bprm->file->f_dentry.dentry was abandoned were concerns > > about breaking userspace. > > There's no way it can break user space considering that right now > comm[] ends up being just garbage. It'll "break userspace" in the sense the the resulting program name visible in /proc/self/{comm,stat,status} would be different than the expected value. Currently userspace is not using fexecve because this string is "just garbage". We'd very much like to start using fexecve, but we cannot do this (in the general case) if that'll result in a changed program name. If we change the value from the current (garbage) value to something that doesn't provide identical behaviour between execve and fexecve, fexecve will unused. As Eric wrote, there are various programs which are symlinked. /etc/alternatives is one group, but we also have "multicall binaries" which present different behaviour depending on the name. Some of those use argv[0], but other may use comm. We really need the name that the user called the program as, not the name after symlink chasing. Even if we end up copying a string from userspace unnecessarilly, does this matter? execve is a heavyweight operation and copying a a dozen bytes extra hardly matters. Zbyszek ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [GIT PULL] execve updates for v6.13-rc1 2024-11-21 17:22 ` Zbigniew Jędrzejewski-Szmek @ 2024-11-21 17:28 ` Linus Torvalds 2024-11-21 17:47 ` Linus Torvalds 2024-11-21 18:00 ` Zbigniew Jędrzejewski-Szmek [not found] ` <87zflrsw1c.fsf@email.froward.int.ebiederm.org> 1 sibling, 2 replies; 27+ messages in thread From: Linus Torvalds @ 2024-11-21 17:28 UTC (permalink / raw) To: Zbigniew Jędrzejewski-Szmek Cc: Eric W. Biederman, Kees Cook, linux-kernel, Alexander Viro, Christophe JAILLET, Dan Carpenter, Nir Lichtman, syzbot+03e1af5c332f7e0eb84b, Tycho Andersen, Vegard Nossum On Thu, 21 Nov 2024 at 09:22, Zbigniew Jędrzejewski-Szmek <zbyszek@in.waw.pl> wrote: > > It'll "break userspace" in the sense the the resulting program name > visible in /proc/self/{comm,stat,status} would be different than the > expected value. Currently userspace is not using fexecve because this > string is "just garbage". We'd very much like to start using fexecve, > but we cannot do this (in the general case) if that'll result in a > changed program name. If we change the value from the current > (garbage) value to something that doesn't provide identical behaviour > between execve and fexecve, fexecve will unused. Well, then you had better not use fexecve(), because that "identical behavior" is fundamentally impossible. The thing is, "argv[0]" can - and will be - complete garbage. Yes, it's *often* the same as the filename, but there is actually zero guarantee of that. It can be any random thing - it's literally just a user space argument. And the dentry name *will* be the name of the underlying executable. Again, it is *often* the same as the filename, but symlinks have already been brought up as an example when it isn't. See? There is no single solution, but at least the dentry name is a *reliable* thing, not a random garbage thing passed in by user space. End result: if you don't like it, don't use fexecve(). It really is that simple. Linus ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [GIT PULL] execve updates for v6.13-rc1 2024-11-21 17:28 ` Linus Torvalds @ 2024-11-21 17:47 ` Linus Torvalds 2024-11-21 18:00 ` Zbigniew Jędrzejewski-Szmek 1 sibling, 0 replies; 27+ messages in thread From: Linus Torvalds @ 2024-11-21 17:47 UTC (permalink / raw) To: Zbigniew Jędrzejewski-Szmek Cc: Eric W. Biederman, Kees Cook, linux-kernel, Alexander Viro, Christophe JAILLET, Dan Carpenter, Nir Lichtman, syzbot+03e1af5c332f7e0eb84b, Tycho Andersen, Vegard Nossum On Thu, 21 Nov 2024 at 09:28, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > See? There is no single solution, but at least the dentry name is a > *reliable* thing, not a random garbage thing passed in by user space. Note that the reason I want 'comm[]' to be something actually *reliable* is that that is what the kernel actually uses for various error messages etc. User space tools like 'ps' already do the "dig around in user space to look for argv[]" thing, and tools like "ps" do *not* actually use comm[] at all from normal user space programs. For example, why do you think "ps 3545" says 3545 ? Ssl 0:00 /usr/libexec/gnome-shell-calendar-server but when I do "cat /proc/3545/stat" I see 3545 (gnome-shell-cal) S 3129 ... in the output? That's exactly because comm[] has that "gnome-shell-cal" (limited to 16 characters with the NUL due to TASK_COMM_LEN), but 'ps' will go digging into user space argv[0] by looking at /proc/*/cmdline which gets you much more than just the name of the executable. And I do *not* want anybody to think that "comm[]" should act IN ANY WAY identically to /proc/*/cmdline. It never has, and it never will. Exactly because the two are completely different things, for different uses. And yes, people have historically actively messed with the argv[0] thing, and actively tried to hide what the actual executable was. I am not AT ALL interested in letting people play those kinds of games with "comm[]". There's a very real reason I rejected that original execve() change. It was stupid and wrong. Linus ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [GIT PULL] execve updates for v6.13-rc1 2024-11-21 17:28 ` Linus Torvalds 2024-11-21 17:47 ` Linus Torvalds @ 2024-11-21 18:00 ` Zbigniew Jędrzejewski-Szmek 2024-11-21 18:02 ` Linus Torvalds 1 sibling, 1 reply; 27+ messages in thread From: Zbigniew Jędrzejewski-Szmek @ 2024-11-21 18:00 UTC (permalink / raw) To: Linus Torvalds Cc: Eric W. Biederman, Kees Cook, linux-kernel, Alexander Viro, Christophe JAILLET, Dan Carpenter, Nir Lichtman, syzbot+03e1af5c332f7e0eb84b, Tycho Andersen, Vegard Nossum On Thu, Nov 21, 2024 at 09:28:09AM -0800, Linus Torvalds wrote: > On Thu, 21 Nov 2024 at 09:22, Zbigniew Jędrzejewski-Szmek > <zbyszek@in.waw.pl> wrote: > > > > It'll "break userspace" in the sense the the resulting program name > > visible in /proc/self/{comm,stat,status} would be different than the > > expected value. Currently userspace is not using fexecve because this > > string is "just garbage". We'd very much like to start using fexecve, > > but we cannot do this (in the general case) if that'll result in a > > changed program name. If we change the value from the current > > (garbage) value to something that doesn't provide identical behaviour > > between execve and fexecve, fexecve will unused. > > Well, then you had better not use fexecve(), because that "identical > behavior" is fundamentally impossible. Identical — as far as the callee is concerned. Basically, we'd like to switch the execve() that we use in systemd to start everything with fexecve(), but this should be invisible to both the programs that are started and users who call ps/pgrep/…. > The thing is, "argv[0]" can - and will be - complete garbage. Yes, > it's *often* the same as the filename, but there is actually zero > guarantee of that. It can be any random thing - it's literally just a > user space argument. Eh, no. I think you're trying to say that argv[0] is user-controlled. Sure, this is a feature. Systemd even exposes this as ExecStart=@program argv0 argv1 argv2… It can be overridden, but it's not "garbage". > And the dentry name *will* be the name of the underlying executable. > Again, it is *often* the same as the filename, but symlinks have > already been brought up as an example when it isn't. Exactly. This is the crux of the problem. We think that fd-based syscalls are great, and would like to use fexecve as a drop-in replacement, but currently can't. I would love to tell the rest of the userspace to stop ever looking at COMM, but, as you very well know, once an API is made public and widely used, it's very hard to redefine it. > See? There is no single solution, but at least the dentry name is a > *reliable* thing, not a random garbage thing passed in by user space. Reliable – yes. Useful – no. Zbyszek ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [GIT PULL] execve updates for v6.13-rc1 2024-11-21 18:00 ` Zbigniew Jędrzejewski-Szmek @ 2024-11-21 18:02 ` Linus Torvalds 2024-11-21 18:47 ` Zbigniew Jędrzejewski-Szmek 2024-11-21 18:50 ` Kees Cook 0 siblings, 2 replies; 27+ messages in thread From: Linus Torvalds @ 2024-11-21 18:02 UTC (permalink / raw) To: Zbigniew Jędrzejewski-Szmek Cc: Eric W. Biederman, Kees Cook, linux-kernel, Alexander Viro, Christophe JAILLET, Dan Carpenter, Nir Lichtman, syzbot+03e1af5c332f7e0eb84b, Tycho Andersen, Vegard Nossum On Thu, 21 Nov 2024 at 10:00, Zbigniew Jędrzejewski-Szmek <zbyszek@in.waw.pl> wrote: > > Identical — as far as the callee is concerned. > Basically, we'd like to switch the execve() that we use in systemd > to start everything with fexecve(), but this should be invisible to > both the programs that are started and users who call ps/pgrep/…. I'm not discussing this. If you cannot understand the difference between comm[] and argv[0], this discussion is entirely pointless. I'd suggest you just not use fexecve(). Linus ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [GIT PULL] execve updates for v6.13-rc1 2024-11-21 18:02 ` Linus Torvalds @ 2024-11-21 18:47 ` Zbigniew Jędrzejewski-Szmek 2024-11-21 18:50 ` Kees Cook 1 sibling, 0 replies; 27+ messages in thread From: Zbigniew Jędrzejewski-Szmek @ 2024-11-21 18:47 UTC (permalink / raw) To: Linus Torvalds Cc: Eric W. Biederman, Kees Cook, linux-kernel, Alexander Viro, Christophe JAILLET, Dan Carpenter, Nir Lichtman, syzbot+03e1af5c332f7e0eb84b, Tycho Andersen, Vegard Nossum On Thu, Nov 21, 2024 at 10:02:03AM -0800, Linus Torvalds wrote: > On Thu, 21 Nov 2024 at 10:00, Zbigniew Jędrzejewski-Szmek > <zbyszek@in.waw.pl> wrote: > > > > Identical — as far as the callee is concerned. > > Basically, we'd like to switch the execve() that we use in systemd > > to start everything with fexecve(), but this should be invisible to > > both the programs that are started and users who call ps/pgrep/…. > > I'm not discussing this. If you cannot understand the difference > between comm[] and argv[0], this discussion is entirely pointless. You brought up argv to say that it's "garbage". I was replying to that part of your message. With execve, the initial values of comm and argv are under full control of the caller, e.g. comm can be set as 'ln -s /bin/sleep /tmp/whatever && /tmp/whatever'. So comm doesn't have to match the "actual executable name" and if one is "garbage" then so is the other one. I very much understand the difference between comm and argv[0]. Once again: the goal is to be able to use fexecve in a way that doesn't cause a visible change to the called programs. Comm was previously set to the basename of the path, and if it's set based on argv[0] with fexecve, we get the behaviour that we want (*) and the rest of userspace is happy. If we set it to f_path.dentry, userspace is unhappy. (*) The exception is when a program is called with an argv[0] that has some special value. But it's such a niche case that nobody cares about it. There were some cups printer plugins which passed something important via argv[0], but that was at least a decade ago. I'm not aware of anybody trying to do that. In the worst case, we can fall back to execve for those cases if absolutely necessary. > I'd suggest you just not use fexecve(). Pffff. That's not helpful. The patch to use fexecve/execveat in systemd was merged in 2020, but we can't make this the default behaviour because of the comm blocker. Zbyszek ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [GIT PULL] execve updates for v6.13-rc1 2024-11-21 18:02 ` Linus Torvalds 2024-11-21 18:47 ` Zbigniew Jędrzejewski-Szmek @ 2024-11-21 18:50 ` Kees Cook 2024-11-21 19:23 ` Linus Torvalds 1 sibling, 1 reply; 27+ messages in thread From: Kees Cook @ 2024-11-21 18:50 UTC (permalink / raw) To: Linus Torvalds Cc: Zbigniew Jędrzejewski-Szmek, Eric W. Biederman, linux-kernel, Alexander Viro, Christophe JAILLET, Dan Carpenter, Nir Lichtman, syzbot+03e1af5c332f7e0eb84b, Tycho Andersen, Vegard Nossum On Thu, Nov 21, 2024 at 10:02:03AM -0800, Linus Torvalds wrote: > On Thu, 21 Nov 2024 at 10:00, Zbigniew Jędrzejewski-Szmek > <zbyszek@in.waw.pl> wrote: > > > > Identical — as far as the callee is concerned. > > Basically, we'd like to switch the execve() that we use in systemd > > to start everything with fexecve(), but this should be invisible to > > both the programs that are started and users who call ps/pgrep/…. > > I'm not discussing this. If you cannot understand the difference > between comm[] and argv[0], this discussion is entirely pointless. > > I'd suggest you just not use fexecve(). I think you're talking past each other. Here's my thought process: To Linus's point, comm[] is "garbage" in that a process can change it to anything it wants (i.e. PR_SET_NAME). I think everyone understands this, but that's not what what I see as the issue. What commp[] does have, however, is a "default" (starting) value set by execve(), which is that of the basename of the "pathname" argument of the syscall (yes, not argv[0], but see below). Most process list analysis tools (i.e. "ps") use /proc/*/comm for the short name view of a process. Most process launchers (i.e. shells, systemd), tend to use basename(pathname) as argv[0] when running programs. Again, I think everyone understands these things too, but I think it's worth calling out that while comm[] and argv[0] are obviously different things, in most cases they start out with identical values after execve(). The problem is fexecve(), where the pathname is lost since it was, obviously, only passed to open(). Right now, we re-use bprm->fdpath since it was already there for scripts, but that unhelpfully just shoves the fd number into comm[], making "ps" output unreadable. Nobody likes seeing basename(fdpath) in comm[] today. I don't think it's unreasonable to want to retain the basename(pathname) starting value of comm[] when switching from execve() to fexecve(). Using the f_path name is certainly better than basename(fdpath), but it doesn't really give userspace what they'd like nor expect. Since comm[] is mutable anyway, why not just copy argv[0] for this case, as that would give userspace exactly what it was expecting and does no harm? i.e. yes, comm[] is "garbage", but let's replicate in fexecve() as close to the default behavior we have from execve() as we can. Why expose f_path, which doesn't appear in comm[] nor cmdline currently? The only flip side I can see is that "ps" etc, should just never use comm at all, and instead use argv[0] from cmdline. That would get the same behavior as described here, but it is much more expensive in that is takes the mm lock and has to walk userspace memory. But then we don't need to change anything on the kernel side and just leave basename(fdpath) alone. I would note, of course, that cmdline can also be changed by the process (PR_SET_MM_ARG_START), too, so it is also "garbage". Just more expensive garbage than comm[]. I still think the original proposal is the most helpful to userspace. -Kees -- Kees Cook ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [GIT PULL] execve updates for v6.13-rc1 2024-11-21 18:50 ` Kees Cook @ 2024-11-21 19:23 ` Linus Torvalds 2024-11-21 21:31 ` Zbigniew Jędrzejewski-Szmek 2024-11-21 22:06 ` Kees Cook 0 siblings, 2 replies; 27+ messages in thread From: Linus Torvalds @ 2024-11-21 19:23 UTC (permalink / raw) To: Kees Cook Cc: Zbigniew Jędrzejewski-Szmek, Eric W. Biederman, linux-kernel, Alexander Viro, Christophe JAILLET, Dan Carpenter, Nir Lichtman, syzbot+03e1af5c332f7e0eb84b, Tycho Andersen, Vegard Nossum On Thu, 21 Nov 2024 at 10:50, Kees Cook <kees@kernel.org> wrote: > > The only flip side I can see is that "ps" etc, should just never use comm > at all, and instead use argv[0] from cmdline Gods people, what are you all on about? THIS IS WHAT PS ALREADY DOES. Stop this completely inane discussion. It's literally like you don't even know what you are talking about. For user space, comm[] is basically the fallback for when cmdline fails for some reason (for example, /proc/*/cmdline will be empty for kworkers, but there are other situations too) The reason? comm[] has *always* been much too limited for 'ps' output. ALWAYS. Yes, you can literally *force* ps to not do that (eg "ps -eo comm") but if you do that, you get the very limited comm[] output that nobody has ever wanted ps to give exactly because it's so limited. People who want 'argv[0]' will typically want argv[1] etc too, because argv[0] simply IS NOT SPECIAL. And yes, 'top' will give comm[] output because it's so much faster. I'm done with this discussion that apparently was brought on by people not knowing what the hell they were doing. Linus ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [GIT PULL] execve updates for v6.13-rc1 2024-11-21 19:23 ` Linus Torvalds @ 2024-11-21 21:31 ` Zbigniew Jędrzejewski-Szmek 2024-11-21 21:48 ` Linus Torvalds 2024-11-21 22:06 ` Kees Cook 1 sibling, 1 reply; 27+ messages in thread From: Zbigniew Jędrzejewski-Szmek @ 2024-11-21 21:31 UTC (permalink / raw) To: Linus Torvalds Cc: Kees Cook, Eric W. Biederman, linux-kernel, Alexander Viro, Christophe JAILLET, Dan Carpenter, Nir Lichtman, syzbot+03e1af5c332f7e0eb84b, Tycho Andersen, Vegard Nossum On Thu, Nov 21, 2024 at 11:23:46AM -0800, Linus Torvalds wrote: > On Thu, 21 Nov 2024 at 10:50, Kees Cook <kees@kernel.org> wrote: > > > > The only flip side I can see is that "ps" etc, should just never use comm > > at all, and instead use argv[0] from cmdline > > Gods people, what are you all on about? > > THIS IS WHAT PS ALREADY DOES. Nope. 'ps' uses comm, for example in 'ps -C' which is documented as "select by command name" and quite commonly used. It does not use it in the default listing, because it shows "COMMAND", incl. args, which obviously must use /proc//cmdline. But nobody said that it does. Kees only mentioned that "ps uses comm for the short name view of a process", which is true. _You_ wrote about ps and showed an example with COMMAND which is … fine, but not really relevant. If you wanted, you could ask where is comm used by the userspace? The ones I'm aware of where it's directly visible: - 'ps -C' or 'ps -o comm' - pgrep - top - htop Using sourcegraph.com I see mentions in moby, kubernetes, zsh and ohmyzsh, earlyoom, vmtop, and pages of more obscure stuff. It's also exposed via libsystemd's sd_bus_creds and journald's _COMM field. The generic interfaces are the biggest problem. We could probably update top/htop/pgrep to use argv[0], but we can't fix scripts that are out there that either use 'ps -o comm' or look in /proc//comm directly or filter journalctl output by field. > Stop this completely inane discussion. It's literally like you don't > even know what you are talking about. > > For user space, comm[] is basically the fallback for when cmdline > fails for some reason (for example, /proc/*/cmdline will be empty for > kworkers, but there are other situations too) As mentioned, this is easily disproved by running e.g. top/htop/pgrep. > The reason? comm[] has *always* been much too limited for 'ps' output. ALWAYS. > > Yes, you can literally *force* ps to not do that (eg "ps -eo comm") > but if you do that, you get the very limited comm[] output that nobody > has ever wanted ps to give exactly because it's so limited. > > People who want 'argv[0]' will typically want argv[1] etc too, because > argv[0] simply IS NOT SPECIAL. > > And yes, 'top' will give comm[] output because it's so much faster. Exactly. > I'm done with this discussion that apparently was brought on by people > not knowing what the hell they were doing. You keep disagreeing with things that nobody has said. The point is that comm _is_ used in many places that matter. I'm not sure what you're trying to say really, since in the second half of your mail you actually showed an example where this is true. Zbyszek ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [GIT PULL] execve updates for v6.13-rc1 2024-11-21 21:31 ` Zbigniew Jędrzejewski-Szmek @ 2024-11-21 21:48 ` Linus Torvalds 2024-11-21 22:24 ` Zbigniew Jędrzejewski-Szmek 0 siblings, 1 reply; 27+ messages in thread From: Linus Torvalds @ 2024-11-21 21:48 UTC (permalink / raw) To: Zbigniew Jędrzejewski-Szmek Cc: Kees Cook, Eric W. Biederman, linux-kernel, Alexander Viro, Christophe JAILLET, Dan Carpenter, Nir Lichtman, syzbot+03e1af5c332f7e0eb84b, Tycho Andersen, Vegard Nossum On Thu, 21 Nov 2024 at 13:31, Zbigniew Jędrzejewski-Szmek <zbyszek@in.waw.pl> wrote: > > The point is that comm _is_ used in many places that matter. I'm not > sure what you're trying to say really, since in the second half of > your mail you actually showed an example where this is true. Really, let me say this one more time: if you don't like the dentry name, don't give execveat() a bare file descriptor that has no other name, and claim that you want to use some completely unrelated third thing that has nothing to do with it except in your little dream-world. The dentry name is not just fine, it's THE TRUTH. It's fundamentally what you are executing. It's better than - and fundamentally different from - argv[0], which is something entirely different and is defined to be available elsewhere. If you have done an open() that followed a symlink, the name *behind* the symlink is the *CORRECT* name. So when you then do an execveat(), it's literally the thing *behind* the symlink that you are executing. You are trying to claim that people want argv[0]. No they don't. If they wanted argv[0], they'd use /proc/*/cmdline, which already give you that, and more. And if they don't want argv[0], then they get the name that the executable was started with. Which is the name *behind* the symlink that was resolved by the (preceding) 'open()'. I'm trying to explain to you that comm[] IS NOT, AND HAS NEVER BEEN, argv[0]. The two have *never* matched up and been the same thing. It's simply not how execve() works. I understand that you seem to *think* that it is how execve() works or should work, but that's your hangup, and has no actual basis in reality. I'm telling you that comm[] is not argv[0], and has never been argv[0]. Ever. I will not take that idiotic patch that tries to change decades of history, based on what is purely your personal misunderstanding of what argv[0] is, and does something *stupid* and slow, when something much more true is right there in the dentry name. I don't even understand why you care about the symlink thing. You did an open, you followed the symlink, the file no longer points to the symlink, trying to claim that it does is simply not *true* in any shape or form. Here's a hint for you: do the same open, but never execute it. Then look in /proc/*/fd and see what that says. Horror of horrors, it doesn't show the symlink. It shows what it pointed to. And before you start arguing about that too, let me just say that we're not changing that *either*. Linus ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [GIT PULL] execve updates for v6.13-rc1 2024-11-21 21:48 ` Linus Torvalds @ 2024-11-21 22:24 ` Zbigniew Jędrzejewski-Szmek 0 siblings, 0 replies; 27+ messages in thread From: Zbigniew Jędrzejewski-Szmek @ 2024-11-21 22:24 UTC (permalink / raw) To: Linus Torvalds Cc: Kees Cook, Eric W. Biederman, linux-kernel, Alexander Viro, Christophe JAILLET, Dan Carpenter, Nir Lichtman, syzbot+03e1af5c332f7e0eb84b, Tycho Andersen, Vegard Nossum On Thu, Nov 21, 2024 at 01:48:59PM -0800, Linus Torvalds wrote: > On Thu, 21 Nov 2024 at 13:31, Zbigniew Jędrzejewski-Szmek > <zbyszek@in.waw.pl> wrote: > > > > The point is that comm _is_ used in many places that matter. I'm not > > sure what you're trying to say really, since in the second half of > > your mail you actually showed an example where this is true. > > Really, let me say this one more time: if you don't like the dentry > name, don't give execveat() a bare file descriptor that has no other > name, and claim that you want to use some completely unrelated third > thing that has nothing to do with it except in your little > dream-world. > > The dentry name is not just fine, it's THE TRUTH. It's fundamentally > what you are executing. It may be THE TRUTH, but this is not what userspace expects. A user calls e.g. /usr/bin/wget and they really expect 'pgrep wget' to return a PID. They don't care that the dentry name is actually 'wget2', even if the exec is realized through fexecve. > It's better than - and fundamentally different from - argv[0], which > is something entirely different and is defined to be available > elsewhere. Kees explained this really well, please check his mail. Everybody knows that argv[0] is (or rather can be) different. What people expect is for /proc//comm to be the basename of the path that was passed to execve(2). Give me an interface that allows the caller to use fexecve() instead and set comm. I don't really care how that interface looks, as long as it's possible to set comm to mimick previous behaviour. The patch uses argv[0] because in a great majority of cases in normal use basename(argv[0]) is the same as comm. So it provides a good fallback. _Better_ than the dentry name, as far as the userspace is concerned. > If you have done an open() that followed a symlink, the name *behind* > the symlink is the *CORRECT* name. So when you then do an execveat(), > it's literally the thing *behind* the symlink that you are executing. > > You are trying to claim that people want argv[0]. No they don't. If > they wanted argv[0], they'd use /proc/*/cmdline, which already give > you that, and more. That's not what I said. What was said is that argv[0] can be used to set comm in a way that _the userspace expects_. To be clear: the dentry name goes 95% of the way. It certainly is better than the current "3" or whatever. If we cannot fix the remaining 5%, we'll need to adjust some userspace and possibly have some annoyed users. So I'd prefer that the kernel merges a patch that uses the dentry name, but it'd be better to use argv[0] which solves the problem 99+% of the way. > And if they don't want argv[0], then they get the name that the > executable was started with. Which is the name *behind* the symlink > that was resolved by the (preceding) 'open()'. > > I'm trying to explain to you that comm[] IS NOT, AND HAS NEVER BEEN, > argv[0]. The two have *never* matched up and been the same thing. It's > simply not how execve() works. Yes, this is exactly what Kees wrote in his mail. Nobody is disputing that. > I don't even understand why you care about the symlink thing. You did > an open, you followed the symlink, the file no longer points to the > symlink, trying to claim that it does is simply not *true* in any > shape or form. I have a generic execution engine (systemd). Userspace uses it to call arbitrary programs. That userspace expects that comm/pgrep/top/htop/etc/etc/etc show the "name of the program". The _name by which it was called_. For decades, this has been the symlink name, not target. Zbyszek ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [GIT PULL] execve updates for v6.13-rc1 2024-11-21 19:23 ` Linus Torvalds 2024-11-21 21:31 ` Zbigniew Jędrzejewski-Szmek @ 2024-11-21 22:06 ` Kees Cook 2024-11-21 22:38 ` Linus Torvalds 1 sibling, 1 reply; 27+ messages in thread From: Kees Cook @ 2024-11-21 22:06 UTC (permalink / raw) To: Linus Torvalds Cc: Zbigniew Jędrzejewski-Szmek, Eric W. Biederman, linux-kernel, Alexander Viro, Christophe JAILLET, Dan Carpenter, Nir Lichtman, syzbot+03e1af5c332f7e0eb84b, Tycho Andersen, Vegard Nossum On Thu, Nov 21, 2024 at 11:23:46AM -0800, Linus Torvalds wrote: > I'm done with this discussion that apparently was brought on by people > not knowing what the hell they were doing. This is disrespectful. If you're frustrated you can just say so. I'm certainly frustrated. > For user space, comm[] is basically the fallback for when cmdline > fails for some reason (for example, /proc/*/cmdline will be empty for > kworkers, but there are other situations too) > > The reason? comm[] has *always* been much too limited for 'ps' output. ALWAYS. > > Yes, you can literally *force* ps to not do that (eg "ps -eo comm") > but if you do that, you get the very limited comm[] output that nobody > has ever wanted ps to give exactly because it's so limited. I think I finally figured out why you keep saying this. I think you mean to imply "ps -e" (or similar), not "ps". Asking for more process details ("ps a", "ps -f", "ps -e", etc) uses cmdline. Without options that turn on more details, the default is comm. If you run just "ps", it shows comm: $ strace ps 2>&1 | grep /cmdline | wc -l 0 If you run with detail options it shows cmdline: $ strace ps a 2>&1 | grep /cmdline | wc -l 1266 $ strace ps -f 2>&1 | grep /cmdline | wc -l 1266 This is procps-ng found on all Debian and Ubuntu systems. AFAICT procps-ng is used on Fedora too. Note I'm not saying comm is GOOD or anything. I'm just saying that it IS regularly visible. > And yes, 'top' will give comm[] output because it's so much faster. Exactly. By default, top and ps both show comm, which in the vast majority of cases is identical to argv[0]. I don't understand why this is causing such angst -- it's just the observable facts: many things in userspace expose comm and many also expose cmdline. Having them be mismatched due to fexecve() is the whole issue. Nobody blinks at: $ ps PID TTY TIME CMD 4125309 pts/1 00:00:47 bash 4171960 pts/1 00:00:00 bash 4171962 pts/1 00:00:00 vim 4171997 pts/1 00:00:00 ps vs $ ps -f UID PID PPID C STIME TTY TIME CMD kees 4125309 3947 0 Jul11 pts/1 00:00:47 -bash kees 4171960 4125309 0 13:30 pts/1 00:00:00 -bash kees 4171962 4171960 0 13:30 pts/1 00:00:00 vim kees 4172004 4125309 0 13:30 pts/1 00:00:00 ps -f But if fexecve() were used now, "ps" would show: $ ps PID TTY TIME CMD 4125309 pts/1 00:00:47 3 4171960 pts/1 00:00:00 3 4171962 pts/1 00:00:00 3 4171997 pts/1 00:00:00 3 This is obviously horrible. Using f_path, we'd get close, but symlink destinations (or weird stuff like "memfd:name-here") are shown: $ realpath $(which vim) /usr/bin/vim.basic $ ps PID TTY TIME CMD 4125309 pts/1 00:00:47 bash 4171960 pts/1 00:00:00 bash 4171962 pts/1 00:00:00 vim.basic 4171997 pts/1 00:00:00 ps Using argv[0], we'd get the original output: $ ps PID TTY TIME CMD 4125309 pts/1 00:00:47 bash 4171960 pts/1 00:00:00 bash 4171962 pts/1 00:00:00 vim 4171997 pts/1 00:00:00 ps IMO, switching to fexecve() shouldn't regress this basic piece of information. Now, I think we have three choices: 1) Leave it as-is. (comm is useless) 2) Use argv[0]. (comm matches what would show with execve() in most cases) 3) Use f_path (comm exposes f_path dentry name, which matches basename(readlink(/proc/*/exe)), but doesn't always match what execve() would show). I think everyone agrees "1" should go away. So it comes down to trying to stay looking more like execve()'s comm, or more like /proc/*/exe's value. Since comm is mutable anyway, I feel like the "friendlier" default for userspace would be option 2. If you still conclude differently, I guess the discussion is over, and we go with 3: diff --git a/fs/exec.c b/fs/exec.c index e0435b31a811..8688bbbaf4af 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -1347,7 +1347,21 @@ int begin_new_exec(struct linux_binprm * bprm) set_dumpable(current->mm, SUID_DUMP_USER); perf_event_exec(); - __set_task_comm(me, kbasename(bprm->filename), true); + + /* + * If fdpath was set, alloc_bprm() made up a path that will + * probably not be useful to admins running ps or similar. + * Let's fix it up to be something reasonable. + */ + if (bprm->fdpath) { + rcu_read_lock(); + __set_task_comm(me, smp_load_acquire(&bprm->file->f_path.dentry->d_name.name), + true); + rcu_read_unlock(); + } + else { + __set_task_comm(me, kbasename(bprm->filename), true); + } /* An exec changes our domain. We are no longer part of the thread group */ I've minimally tested this. -Kees -- Kees Cook ^ permalink raw reply related [flat|nested] 27+ messages in thread
* Re: [GIT PULL] execve updates for v6.13-rc1 2024-11-21 22:06 ` Kees Cook @ 2024-11-21 22:38 ` Linus Torvalds 0 siblings, 0 replies; 27+ messages in thread From: Linus Torvalds @ 2024-11-21 22:38 UTC (permalink / raw) To: Kees Cook Cc: Zbigniew Jędrzejewski-Szmek, Eric W. Biederman, linux-kernel, Alexander Viro, Christophe JAILLET, Dan Carpenter, Nir Lichtman, syzbot+03e1af5c332f7e0eb84b, Tycho Andersen, Vegard Nossum On Thu, 21 Nov 2024 at 14:06, Kees Cook <kees@kernel.org> wrote: > > I think I finally figured out why you keep saying this. I think you mean > to imply "ps -e" (or similar), not "ps". Asking for more process details > ("ps a", "ps -f", "ps -e", etc) uses cmdline. Ah. I never use plain 'ps'. The output is too useless. > Since comm is mutable anyway, I feel like the "friendlier" default for > userspace would be option 2. The thing is, I still violently disagree. I don't see what's "friendlier" in being (a) slower and (b) giving the wrong output. argv[0] isn't what we *normally* use. And I've seen lots of cases where argv[0] is actually plain made-up garbage. Christ, I went and looked at OUR OWN TEST-CASES, and they just happily lie about "argv[0]". Just go check tools/testing/selftests/exec/execveat.c, and see. So no. THERE IS NO WAY I WILL ACCEPT THE GARBAGE THAT IS ARGV[0]. What is so hard to understand about the fact that argv[0] has never *EVER* been meaningful? We're not making it so now. Linus ^ permalink raw reply [flat|nested] 27+ messages in thread
[parent not found: <87zflrsw1c.fsf@email.froward.int.ebiederm.org>]
* Re: [GIT PULL] execve updates for v6.13-rc1 [not found] ` <87zflrsw1c.fsf@email.froward.int.ebiederm.org> @ 2024-11-22 7:47 ` Zbigniew Jędrzejewski-Szmek 2024-11-22 10:21 ` Harald Arnesen [not found] ` <87frnjqqh6.fsf@email.froward.int.ebiederm.org> 0 siblings, 2 replies; 27+ messages in thread From: Zbigniew Jędrzejewski-Szmek @ 2024-11-22 7:47 UTC (permalink / raw) To: Eric W. Biederman Cc: Linus Torvalds, Kees Cook, linux-kernel, Alexander Viro, Christophe JAILLET, Dan Carpenter, Nir Lichtman, syzbot+03e1af5c332f7e0eb84b, Tycho Andersen, Vegard Nossum On Thu, Nov 21, 2024 at 10:59:59PM -0600, Eric W. Biederman wrote: > Zbigniew Jędrzejewski-Szmek <zbyszek@in.waw.pl> writes: > > > On Wed, Nov 20, 2024 at 06:23:55PM -0800, Linus Torvalds wrote: > >> On Wed, 20 Nov 2024 at 16:55, Eric W. Biederman <ebiederm@xmission.com> wrote: > >> > > >> > __set_task_comm cannot be called with bprm->file->f_dentry > >> > unconditionally. > >> > >> No, no. Only for the "no path" case. > >> > >> > The reason bprm->file->f_dentry.dentry was abandoned were concerns > >> > about breaking userspace. > >> > >> There's no way it can break user space considering that right now > >> comm[] ends up being just garbage. > > > > It'll "break userspace" in the sense the the resulting program name > > visible in /proc/self/{comm,stat,status} would be different than the > > expected value. > > This actually has not been shown. > > In the general case of programs on linux it is indeed the case > we have multi-call binaries and symlinks to binaries. > > Limiting things to just fexecve reduces the scope, > I didn't think about the scope reduction when you made this argument the > first time. > > I do not see any evidence that there are daemons started by systemd > where systemd follows the name in /etc/alternatives on debian, or winds > up following a symlink on a multicall binary. The way I understand > /etc/alternatives I don't think you would ever want to use it for the > name of a daemon that is put in a unit file. systemd-udevd is one example that ~everybody has installed. (Though it doesn't use comm, it uses argv[0] to decide behaviour.) We can certainly find more, on my Fedora system, 337/2252 files in /usr/sbin/ are symlinks, quite a few candidates. But even if we went through every one of those, it's not enough, because people have custom unit files and there's also systemd-run and run0. > All of these cases where you get a task->comm that would be different > with fexecve are rare oddball cases today. I think it is quite likely > nothing systemd wants to start with fexecve will have this problem. > > Further you can detect this problem before you get as far as starting > the application. Just pass O_NOFOLLOW to open and you can refuse to > follow the potentially confusing symlinks. We'd have two choices: refuse the command, which is not really a great option, or fall back to execve(), an approach which I find really unappealing. > So short of finding real programs in the real world that actually care > it seems perfectly reasonable to do the cheap and easy thing. > > Right now it feels like you are adopting a very cautious approach and > arguing for an expensive implementation simply because it is a lot of > work to figure out what programs you actually care about and see what > those programs are actually doing. So it's not really "programs" per se. For example, systemd itself doesn't care at all, because it uses cgroups to manage processes. I've been using systemd-with-fexecve for two years and apart from strange 'ps' listings, it's all fine. But we have administrators and scripts. For example, 'pgrep systemd-udevd' and 'pkill -HUP httpd' and a bazillion other calls that we cannot ever find or change. > On the system I am writing this on right now there are about 300 > processes running and only about 17 whose parent is pid 1. Most of > those process whose parent is pid 1 run as root. Which means extra care > needs to be taken with them anyway. This would also affect user process started by 'systemd --user', i.e. most of the graphical user session nowadays (under GNOME, KDE, etc.) > > Even if we end up copying a string from userspace unnecessarilly, > > does this matter? execve is a heavyweight operation and copying a > > a dozen bytes extra hardly matters. > > Generally it is on the person arguing for making the kernel more > expensive to find a compelling case. In this instance my imagination > fails me in finding a binary that systemd will start that will be > affected. So I simply don't see the point in making the code > more expensive. > > I was really hoping we could use the cheap bprm->file->f_dentry > to set task->comm in all cases as that would make the kernel > simpler and faster. > > To me using bprm->file->f_dentry still seems to make more sense than a > field whose value (in the case of login shells) is documented as being > different from what you are looking for. > > There is no solution that doesn't have down sides. As such the kernel > might as well use a solution that is cheap and as close to how > task->comm has been calculated historically as possible. Right. It comes down to judgement / guesswork about the downsides of each approach. As I wrote before, the approach with bprm->file->f_dentry is _much_ better than what we have now. So I'll take that over status quo. But from the POV of a distro maintainer, it's not ideal. Basically, with the patch that uses argv[0], I expect almost no user-visible change, so I can rebuild systemd with -Dfexecve=true in Fedora and maybe send a mail to fedora-devel with "hey, watch out, the way systemd starts processes has been changed, selinux could be affected, please report any errors". With the f_dentry version, we'll still do the transition, but we'll need to make an effort to warn users and do it much more visibly, and based on past experience with such things, there'll still inevitably be users who report that their favourite backup script that they call from the git checkout now doesn't terminate properly and stuff like that. Every kernel maintainer knows that any aspect of behaviour that is visibile to users starts being relied on… I think fexecve is a nice feature (as all the fd-based syscalls in general), but it need to be drop-in replacement if it's supposed to be adopted quickly. Zbyszek ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [GIT PULL] execve updates for v6.13-rc1 2024-11-22 7:47 ` Zbigniew Jędrzejewski-Szmek @ 2024-11-22 10:21 ` Harald Arnesen [not found] ` <87frnjqqh6.fsf@email.froward.int.ebiederm.org> 1 sibling, 0 replies; 27+ messages in thread From: Harald Arnesen @ 2024-11-22 10:21 UTC (permalink / raw) To: Zbigniew Jędrzejewski-Szmek, Eric W. Biederman; +Cc: linux-kernel Zbigniew Jędrzejewski-Szmek [2024-11-22 08:47:30]: >> I do not see any evidence that there are daemons started by systemd >> where systemd follows the name in /etc/alternatives on debian, or winds >> up following a symlink on a multicall binary. The way I understand >> /etc/alternatives I don't think you would ever want to use it for the >> name of a daemon that is put in a unit file. > systemd-udevd is one example that ~everybody has installed. > (Though it doesn't use comm, it uses argv[0] to decide behaviour.) There are quite a few of us who do NOT have systemd-udevd (or anything systemd) installed. -- Hilsen Harald ^ permalink raw reply [flat|nested] 27+ messages in thread
[parent not found: <87frnjqqh6.fsf@email.froward.int.ebiederm.org>]
* Re: [GIT PULL] execve updates for v6.13-rc1 [not found] ` <87frnjqqh6.fsf@email.froward.int.ebiederm.org> @ 2024-11-24 15:21 ` Zbigniew Jędrzejewski-Szmek 0 siblings, 0 replies; 27+ messages in thread From: Zbigniew Jędrzejewski-Szmek @ 2024-11-24 15:21 UTC (permalink / raw) To: Eric W. Biederman Cc: Linus Torvalds, Kees Cook, linux-kernel, Alexander Viro, Christophe JAILLET, Dan Carpenter, Nir Lichtman, syzbot+03e1af5c332f7e0eb84b, Tycho Andersen, Vegard Nossum On Fri, Nov 22, 2024 at 08:43:01AM -0600, Eric W. Biederman wrote: > Zbigniew Jędrzejewski-Szmek <zbyszek@in.waw.pl> writes: > > > On Thu, Nov 21, 2024 at 10:59:59PM -0600, Eric W. Biederman wrote: > >> Zbigniew Jędrzejewski-Szmek <zbyszek@in.waw.pl> writes: > >> > >> > On Wed, Nov 20, 2024 at 06:23:55PM -0800, Linus Torvalds wrote: > >> >> On Wed, 20 Nov 2024 at 16:55, Eric W. Biederman <ebiederm@xmission.com> wrote: > >> >> > > >> >> > __set_task_comm cannot be called with bprm->file->f_dentry > >> >> > unconditionally. > >> >> > >> >> No, no. Only for the "no path" case. > >> >> > >> >> > The reason bprm->file->f_dentry.dentry was abandoned were concerns > >> >> > about breaking userspace. > >> >> > >> >> There's no way it can break user space considering that right now > >> >> comm[] ends up being just garbage. > >> > > >> > It'll "break userspace" in the sense the the resulting program name > >> > visible in /proc/self/{comm,stat,status} would be different than the > >> > expected value. > >> > >> This actually has not been shown. > >> > >> In the general case of programs on linux it is indeed the case > >> we have multi-call binaries and symlinks to binaries. > >> > >> Limiting things to just fexecve reduces the scope, > >> I didn't think about the scope reduction when you made this argument the > >> first time. > >> > >> I do not see any evidence that there are daemons started by systemd > >> where systemd follows the name in /etc/alternatives on debian, or winds > >> up following a symlink on a multicall binary. The way I understand > >> /etc/alternatives I don't think you would ever want to use it for the > >> name of a daemon that is put in a unit file. > > > > systemd-udevd is one example that ~everybody has installed. > > (Though it doesn't use comm, it uses argv[0] to decide behaviour.) > > As far as I can tell using task->comm is a bug. > argv[0] is cheaper and the unix convention for deciding program > behavior. I don't understand what you mean by "a bug". As I wrote, systemd-udevd does NOT use comm. And in other places, no matter what you or I might wish, comm definitely IS used. Any place with pgrep or pkill is doing exactly that. So an assertion that it's against a "unix convention" is unjustified. [snip part about udev] > It is worth remembering that hardlinks are going to be cheaper all > around for this sort of thing. The only reason I see for using > symlinks is that you can see that something is an alias. Yes. Symlinks are used so that that aliasing is visible. The second reason is that symlinks can be used across devices, so for example a symlink can be made from /etc/aliases to /usr, which are sometimes on different devices. The third reason is that when things are split between different packages (in the sense of distro packages), a package can contain multiple files that are hardlinked, and a package can contain a symlink to a file in a different package (or even something that is not packaged), but a package cannot natively provide a hardlink to a file in a different package. (A post-install scriptlet could make a hardlink, but that's much more work and preferrably avoided, even if technically possible.) > > We can certainly find more, on my Fedora system, 337/2252 files > > in /usr/sbin/ are symlinks, quite a few candidates. > > But even if we went through every one of those, it's not enough, > > because people have custom unit files and there's also systemd-run > > and run0. > > >> All of these cases where you get a task->comm that would be different > >> with fexecve are rare oddball cases today. I think it is quite likely > >> nothing systemd wants to start with fexecve will have this problem. > >> > >> Further you can detect this problem before you get as far as starting > >> the application. Just pass O_NOFOLLOW to open and you can refuse to > >> follow the potentially confusing symlinks. > > > > We'd have two choices: refuse the command, which is not really a great > > option, or fall back to execve(), an approach which I find really > > unappealing. > > It also allows you to report the program and quickly find which cases > might be a problem. > >> So short of finding real programs in the real world that actually care > >> it seems perfectly reasonable to do the cheap and easy thing. > >> > >> Right now it feels like you are adopting a very cautious approach and > >> arguing for an expensive implementation simply because it is a lot of > >> work to figure out what programs you actually care about and see what > >> those programs are actually doing. > > > > So it's not really "programs" per se. For example, systemd itself > > doesn't care at all, because it uses cgroups to manage processes. I've > > been using systemd-with-fexecve for two years and apart from strange 'ps' > > listings, it's all fine. But we have administrators and scripts. > > For example, 'pgrep systemd-udevd' and 'pkill -HUP httpd' and a > > bazillion other calls that we cannot ever find or change. > > Hmm. > > Not even my system where I run apache has a file named httpd > so I don't have enough context to understand that reference. It's a long-standing Red Hat vs. Debian naming war. Some want "apache httpd" to be called "apache" and others "httpd". In the Fedora package, we have '/usr/sbin/httpd'. Apache/httpd was (is?) deployed in a few major versions, so it was common to have httpd-2.2 and httpd-2.4 installed, and in those situations one would create a symlink to "decide" which version is the main one. This is the situation I was alluding to. Another common example that comes to mind is python: distros provide /usr/bin/python, /usr/bin/python2, and /usr/bin/python3 as symlinks to specific versions. On my Fedora system, python-unversioned-command.rpm provides the /usr/bin/python → python3 symlink, and python3.13.rpm provides /usr/bin/python3.13 binary and a /usr/bin/python3 → python3.13 symlink. So this is an example of a cross-package "alias" that pretty much must be a symlink. > >> To me using bprm->file->f_dentry still seems to make more sense than a > >> field whose value (in the case of login shells) is documented as being > >> different from what you are looking for. > >> > >> There is no solution that doesn't have down sides. As such the kernel > >> might as well use a solution that is cheap and as close to how > >> task->comm has been calculated historically as possible. > > > > Right. It comes down to judgement / guesswork about the downsides of > > each approach. > > > > As I wrote before, the approach with bprm->file->f_dentry is _much_ > > better than what we have now. So I'll take that over status quo. > > I really think that is where the kernel should start. > It is cheap, and it solves the issue that task->comm was set to > something meaningless from a process standpoint. Ack. > > But from the POV of a distro maintainer, it's not ideal. Basically, > > with the patch that uses argv[0], I expect almost no user-visible > > change, so I can rebuild systemd with -Dfexecve=true in Fedora and > > maybe send a mail to fedora-devel with "hey, watch out, the way > > systemd starts processes has been changed, selinux could be affected, > > please report any errors". > > Huh? If selinux depending upon filenames? I think if any security > module is somehow depending upon task->comm it should be fixed. > Just because task->comm very much is not built so that it can > be depended on that way. I never said anything about selinux depending on comm. Selinux is primarily based on syscalls, so changing execve() to execveat() _could_ require a change to some policies. (I'm not saying that it must. I didn't explore the topic. Just that it's something that'd need to be checked.) > > I think fexecve is a nice feature (as all the fd-based syscalls > > in general), but it need to be drop-in replacement if it's supposed > > to be adopted quickly. > > I don't know why you want to use fexecve. Using an fd based system > call will typically be slower. Using fd-based system calls is generally > what you want if there are extra checks you want to add before > performing an action. Opening and file and then calling exec is going > to be necessarily slower than letting execve handle it all. Not to > mention the cost of the added checks. That's exactly the reason. We do various checks on the binary before calling it, for example SMACK context setup. Another example is that we've been working on adding a .note.package section [1] that uniquely identifies the provenience of ELF files. I plan to add a feature where we read the ELF note and log the specific version of the program being launched. It's more elegant to open a file and use the same fd to do all the preparatory steps without a change of TOCTOU race, for example if packages are being upgraded and the file might be replaced in the background. [1] https://systemd.io/ELF_PACKAGE_METADATA/ > Depending on the checks you want to perform in userspace to the > executing of a binary it is possible to use a dirfd with execveat > instead of the fd of the actual binary. That allows you to have enough > of the full pathname that task->comm will be set the same. Yeah, that's not what we want. A package upgrade will use an atomic rename to replace files within the directory. Zbyszek ^ permalink raw reply [flat|nested] 27+ messages in thread
end of thread, other threads:[~2024-11-24 15:22 UTC | newest]
Thread overview: 27+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-11-19 17:00 [GIT PULL] execve updates for v6.13-rc1 Kees Cook
2024-11-20 22:32 ` Linus Torvalds
2024-11-20 22:33 ` Linus Torvalds
2024-11-20 22:50 ` Linus Torvalds
2024-11-21 2:36 ` Al Viro
2024-11-21 2:41 ` Al Viro
2024-11-21 2:51 ` Linus Torvalds
2024-11-21 0:54 ` Eric W. Biederman
2024-11-21 2:23 ` Linus Torvalds
2024-11-21 2:29 ` Kees Cook
2024-11-21 2:45 ` Linus Torvalds
2024-11-21 17:22 ` Zbigniew Jędrzejewski-Szmek
2024-11-21 17:28 ` Linus Torvalds
2024-11-21 17:47 ` Linus Torvalds
2024-11-21 18:00 ` Zbigniew Jędrzejewski-Szmek
2024-11-21 18:02 ` Linus Torvalds
2024-11-21 18:47 ` Zbigniew Jędrzejewski-Szmek
2024-11-21 18:50 ` Kees Cook
2024-11-21 19:23 ` Linus Torvalds
2024-11-21 21:31 ` Zbigniew Jędrzejewski-Szmek
2024-11-21 21:48 ` Linus Torvalds
2024-11-21 22:24 ` Zbigniew Jędrzejewski-Szmek
2024-11-21 22:06 ` Kees Cook
2024-11-21 22:38 ` Linus Torvalds
[not found] ` <87zflrsw1c.fsf@email.froward.int.ebiederm.org>
2024-11-22 7:47 ` Zbigniew Jędrzejewski-Szmek
2024-11-22 10:21 ` Harald Arnesen
[not found] ` <87frnjqqh6.fsf@email.froward.int.ebiederm.org>
2024-11-24 15:21 ` Zbigniew Jędrzejewski-Szmek
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.