* Re: [PATCH RFC v2 0/4] mm/ksm: add option to automerge VMAs
From: Michal Hocko @ 2019-05-16 7:47 UTC (permalink / raw)
To: Greg KH
Cc: Oleksandr Natalenko, linux-kernel, Kirill Tkhai, Vlastimil Babka,
Matthew Wilcox, Pavel Tatashin, Timofey Titovets, Aaron Tomlin,
Grzegorz Halat, linux-mm, linux-api, Hugh Dickins,
Suren Baghdasaryan, Minchan Kim
In-Reply-To: <20190515151557.GA23969@kroah.com>
On Wed 15-05-19 17:15:57, Greg KH wrote:
> On Wed, May 15, 2019 at 04:51:51PM +0200, Michal Hocko wrote:
> > [Cc Suren and Minchan - the email thread starts here 20190514131654.25463-1-oleksandr@redhat.com]
> >
> > On Wed 15-05-19 08:53:11, Michal Hocko wrote:
> > [...]
> > > I will try to comment on the interface itself later. But I have to say
> > > that I am not impressed. Abusing sysfs for per process features is quite
> > > gross to be honest.
> >
> > I have already commented on this in other email. I consider sysfs an
> > unsuitable interface for per-process API.
>
> Wait, what? A new sysfs file/directory per process? That's crazy, no
> one must have benchmarked it :)
Just to clarify, that was not a per process file but rather per process API.
Essentially echo $PID > $SYSFS_SPECIAL_FILE
--
Michal Hocko
SUSE Labs
^ permalink raw reply
* Re: [PATCH v2 0/3] initramfs: add support for xattrs in the initial ram disk
From: Arvind Sankar @ 2019-05-16 5:29 UTC (permalink / raw)
To: Roberto Sassu
Cc: Arvind Sankar, James Bottomley, Rob Landley, Andy Lutomirski,
Arvind Sankar, LKML, Linux API, Linux FS Devel, linux-integrity,
initramfs, Silviu Vlasceanu
In-Reply-To: <ce65240a-4df6-8ebc-8360-c01451e724f0@huawei.com>
On Wed, May 15, 2019 at 07:06:52PM +0200, Roberto Sassu wrote:
> On 5/15/2019 6:08 PM, Arvind Sankar wrote:
> > On Wed, May 15, 2019 at 01:19:04PM +0200, Roberto Sassu wrote:
> >> On 5/15/2019 2:52 AM, Arvind Sankar wrote:
> > I don't understand what you mean? The IMA hashes are signed by some key,
> > but I don't see how what that key is needs to be different between the
> > two proposals. If the only files used are from the distro, in my scheme
> > as well you can use the signatures and key provided by the distro. If
> > they're not, then in your scheme as well you would have to allow for a
> > local signing key to be used. Both schemes are using the same
> > .xattr-list file, no?
>
> I was referring to James's proposal to load an external initramfs from
> the embedded initramfs. If the embedded initramfs opens the external
> initramfs when IMA is enabled, the external initramfs needs to be
> signed with a local signing key. But I read your answer that this
> wouldn't be feasible. You have to specify all initramfs in the boot
> loader configuration.
>
> I think deferring IMA initialization is not the safest approach, as it
> cannot be guaranteed for all possible scenarios that there won't be any
> file read before /init is executed.
>
> But if IMA is enabled, there is the problem of who signs .xattr-list.
> There should be a local signing key that it is not necessary if the user
> only accesses distro files.
>
I think that's a separate issue. If you want to allow people to be able
to put files onto the system that will be IMA verified, they need to
have some way to locally sign them whether it's inside an initramfs or
on a real root filesystem.
>
> > Right, I guess this would be sort of the minimal "modification" to the
> > CPIO format to allow it to support xattrs.
>
> I would try to do it without modification of the CPIO format. However,
> at the time .xattr-list is parsed (in do_copy() before .xattr-list is
> closed), it is not guaranteed that all files are extracted. These must
> be created before xattrs are added, but the file type must be correct,
> otherwise clean_path() removes the existing file with xattrs.
>
> Roberto
>
> --
> HUAWEI TECHNOLOGIES Duesseldorf GmbH, HRB 56063
> Managing Director: Bo PENG, Jian LI, Yanli SHI
Right by "modification" in quotes I meant the format is actually the
same, but the kernel now interprets it a bit differently.
Regarding the order you don't have to handle that in the kernel. The
kernel CPIO format is already restricted in that directories have to be
specified before the files that contain them for example. It can very
well be restricted so that an .xattr-list can only specify xattrs for
files that were already extracted, else you bail out with an error. The
archive creation tooling can easily handle that. If someone wants to
shoot themselves in the foot by trying to add more files/replace
existing files after the .xattr-list its ok, the IMA policy will prevent
such files from being accessed and they can fix the archive for the next
boot.
^ permalink raw reply
* Re: [PATCH 1/2] pid: add pidfd_open()
From: Daniel Colascione @ 2019-05-15 17:45 UTC (permalink / raw)
To: Christian Brauner
Cc: Jann Horn, Oleg Nesterov, Al Viro, Linus Torvalds, linux-kernel,
Arnd Bergmann, David Howells, Andrew Morton, Aleksa Sarai,
Eric W. Biederman, elena.reshetova, Kees Cook, Andy Lutomirski,
Andy Lutomirski, Thomas Gleixner, linux-alpha, linux-arm-kernel,
linux-ia64, linux-m68k, linux-mips, linux-parisc, linuxppc-dev
In-Reply-To: <20190515100400.3450-1-christian@brauner.io>
On Wed, May 15, 2019 at 3:04 AM Christian Brauner <christian@brauner.io> wrote:
>
> This adds the pidfd_open() syscall. It allows a caller to retrieve pollable
> pidfds for a process which did not get created via CLONE_PIDFD, i.e. for a
> process that is created via traditional fork()/clone() calls that is only
> referenced by a PID:
Thanks for doing this work. I'm really looking forward to this new
approach to process management.
> int pidfd = pidfd_open(1234, 0);
> ret = pidfd_send_signal(pidfd, SIGSTOP, NULL, 0);
>
> With the introduction of pidfds through CLONE_PIDFD it is possible to
> created pidfds at process creation time.
> However, a lot of processes get created with traditional PID-based calls
> such as fork() or clone() (without CLONE_PIDFD). For these processes a
> caller can currently not create a pollable pidfd. This is a huge problem
> for Android's low memory killer (LMK) and service managers such as systemd.
> Both are examples of tools that want to make use of pidfds to get reliable
> notification of process exit for non-parents (pidfd polling) and race-free
> signal sending (pidfd_send_signal()). They intend to switch to this API for
> process supervision/management as soon as possible. Having no way to get
> pollable pidfds from PID-only processes is one of the biggest blockers for
> them in adopting this api. With pidfd_open() making it possible to retrieve
> pidfd for PID-based processes we enable them to adopt this api.
>
> In line with Arnd's recent changes to consolidate syscall numbers across
> architectures, I have added the pidfd_open() syscall to all architectures
> at the same time.
I'm glad it's easier now.
> arch/alpha/kernel/syscalls/syscall.tbl | 1 +
> arch/arm64/include/asm/unistd32.h | 2 +
> arch/ia64/kernel/syscalls/syscall.tbl | 1 +
> arch/m68k/kernel/syscalls/syscall.tbl | 1 +
> arch/microblaze/kernel/syscalls/syscall.tbl | 1 +
> arch/mips/kernel/syscalls/syscall_n32.tbl | 1 +
> arch/parisc/kernel/syscalls/syscall.tbl | 1 +
> arch/powerpc/kernel/syscalls/syscall.tbl | 1 +
> arch/s390/kernel/syscalls/syscall.tbl | 1 +
> arch/sh/kernel/syscalls/syscall.tbl | 1 +
> arch/sparc/kernel/syscalls/syscall.tbl | 1 +
> arch/x86/entry/syscalls/syscall_32.tbl | 1 +
> arch/x86/entry/syscalls/syscall_64.tbl | 1 +
> arch/xtensa/kernel/syscalls/syscall.tbl | 1 +
It'd be nice to arrange the system call tables so that we need to
change only one file when adding a new system call.
[Snip system call wiring]
> --- a/include/linux/pid.h
> +++ b/include/linux/pid.h
> @@ -67,6 +67,7 @@ struct pid
> extern struct pid init_struct_pid;
>
> extern const struct file_operations pidfd_fops;
> +extern int pidfd_create(struct pid *pid);
>
> static inline struct pid *get_pid(struct pid *pid)
> {
> diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
> index e2870fe1be5b..989055e0b501 100644
> --- a/include/linux/syscalls.h
> +++ b/include/linux/syscalls.h
> @@ -929,6 +929,7 @@ asmlinkage long sys_clock_adjtime32(clockid_t which_clock,
> struct old_timex32 __user *tx);
> asmlinkage long sys_syncfs(int fd);
> asmlinkage long sys_setns(int fd, int nstype);
> +asmlinkage long sys_pidfd_open(pid_t pid, unsigned int flags);
> asmlinkage long sys_sendmmsg(int fd, struct mmsghdr __user *msg,
> unsigned int vlen, unsigned flags);
> asmlinkage long sys_process_vm_readv(pid_t pid,
> diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
> index dee7292e1df6..94a257a93d20 100644
> --- a/include/uapi/asm-generic/unistd.h
> +++ b/include/uapi/asm-generic/unistd.h
> @@ -832,9 +832,11 @@ __SYSCALL(__NR_io_uring_setup, sys_io_uring_setup)
> __SYSCALL(__NR_io_uring_enter, sys_io_uring_enter)
> #define __NR_io_uring_register 427
> __SYSCALL(__NR_io_uring_register, sys_io_uring_register)
> +#define __NR_pidfd_open 428
> +__SYSCALL(__NR_pidfd_open, sys_pidfd_open)
>
> #undef __NR_syscalls
> -#define __NR_syscalls 428
> +#define __NR_syscalls 429
>
> /*
> * 32 bit systems traditionally used different
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 737db1828437..980cc1d2b8d4 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -1714,7 +1714,7 @@ const struct file_operations pidfd_fops = {
> * Return: On success, a cloexec pidfd is returned.
> * On error, a negative errno number will be returned.
> */
> -static int pidfd_create(struct pid *pid)
> +int pidfd_create(struct pid *pid)
> {
> int fd;
>
> diff --git a/kernel/pid.c b/kernel/pid.c
> index 20881598bdfa..237d18d6ecb8 100644
> --- a/kernel/pid.c
> +++ b/kernel/pid.c
> @@ -38,6 +38,7 @@
> #include <linux/syscalls.h>
> #include <linux/proc_ns.h>
> #include <linux/proc_fs.h>
> +#include <linux/sched/signal.h>
> #include <linux/sched/task.h>
> #include <linux/idr.h>
>
> @@ -451,6 +452,53 @@ struct pid *find_ge_pid(int nr, struct pid_namespace *ns)
> return idr_get_next(&ns->idr, &nr);
> }
>
> +/**
> + * pidfd_open() - Open new pid file descriptor.
> + *
> + * @pid: pid for which to retrieve a pidfd
> + * @flags: flags to pass
> + *
> + * This creates a new pid file descriptor with the O_CLOEXEC flag set for
> + * the process identified by @pid. Currently, the process identified by
> + * @pid must be a thread-group leader. This restriction currently exists
> + * for all aspects of pidfds including pidfd creation (CLONE_PIDFD cannot
> + * be used with CLONE_THREAD) and pidfd polling (only supports thread group
> + * leaders).
> + *
> + * Return: On success, a cloexec pidfd is returned.
> + * On error, a negative errno number will be returned.
> + */
> +SYSCALL_DEFINE2(pidfd_open, pid_t, pid, unsigned int, flags)
> +{
> + int fd, ret;
> + struct pid *p;
> + struct task_struct *tsk;
> +
> + if (flags)
> + return -EINVAL;
If we support blocking operations on pidfds, we'll want to be able to
put them in non-blocking mode. Does it make sense to accept and ignore
O_NONBLOCK here now?
> + if (pid <= 0)
> + return -EINVAL;
WDYT of defining pid == 0 to mean "open myself"?
> + p = find_get_pid(pid);
> + if (!p)
> + return -ESRCH;
> +
> + rcu_read_lock();
> + tsk = pid_task(p, PIDTYPE_PID);
> + if (!tsk)
> + ret = -ESRCH;
> + else if (unlikely(!thread_group_leader(tsk)))
> + ret = -EINVAL;
> + else
> + ret = 0;
> + rcu_read_unlock();
> +
> + fd = ret ?: pidfd_create(p);
> + put_pid(p);
> + return fd;
> +}
^ permalink raw reply
* Re: [PATCH v2 0/3] initramfs: add support for xattrs in the initial ram disk
From: Roberto Sassu @ 2019-05-15 17:06 UTC (permalink / raw)
To: Arvind Sankar
Cc: James Bottomley, Rob Landley, Andy Lutomirski, Arvind Sankar,
LKML, Linux API, Linux FS Devel, linux-integrity, initramfs,
Silviu Vlasceanu
In-Reply-To: <20190515160834.GA81614@rani.riverdale.lan>
On 5/15/2019 6:08 PM, Arvind Sankar wrote:
> On Wed, May 15, 2019 at 01:19:04PM +0200, Roberto Sassu wrote:
>> On 5/15/2019 2:52 AM, Arvind Sankar wrote:
>>> You can specify multiple initrd's to the boot loader, and they get
>>> loaded in sequence into memory and parsed by the kernel before /init is
>>> launched. Currently I believe later ones will overwrite the earlier
>>> ones, which is why we've been talking about adding an option to prevent
>>> that. You don't have to mess with manually finding/parsing initramfs's
>>> which wouldn't even be feasible since you may not have the drivers
>>> loaded yet to access the device/filesystem on which they live.
>>>
>>> Once that's done, the embedded /init is just going to do in userspace
>>> wht the current patch does in the kernel. So all the files in the
>>> external initramfs(es) would need to have IMA signatures via the special
>>> xattr file.
>>
>> So, the scheme you are proposing is not equivalent: using the distro key
>> to verify signatures, compared to adding a new user key to verify the
>> initramfs he builds. Why would it be necessary for the user to share
>> responsibility with the distro, if the only files he uses come from the
>> distro?
>>
> I don't understand what you mean? The IMA hashes are signed by some key,
> but I don't see how what that key is needs to be different between the
> two proposals. If the only files used are from the distro, in my scheme
> as well you can use the signatures and key provided by the distro. If
> they're not, then in your scheme as well you would have to allow for a
> local signing key to be used. Both schemes are using the same
> .xattr-list file, no?
I was referring to James's proposal to load an external initramfs from
the embedded initramfs. If the embedded initramfs opens the external
initramfs when IMA is enabled, the external initramfs needs to be
signed with a local signing key. But I read your answer that this
wouldn't be feasible. You have to specify all initramfs in the boot
loader configuration.
I think deferring IMA initialization is not the safest approach, as it
cannot be guaranteed for all possible scenarios that there won't be any
file read before /init is executed.
But if IMA is enabled, there is the problem of who signs .xattr-list.
There should be a local signing key that it is not necessary if the user
only accesses distro files.
> If the external initramfs is to be signed, and it is built locally, in
> both schemes there will have to be a provision for a local signing key,
> but this key in any case is verified by the bootloader so there can't
> be a difference between the two schemes since they're the same there.
>
> What is the difference you're seeing here?
>>
>>> Note that if you want the flexibility to be able to load one or both of
>>> two external initramfs's, the current in-kernel proposal wouldn't be
>>> enough -- the xattr specification would have to be more flexible (eg
>>> reading .xattr-list* to allow each initramfs to specifiy its own
>>> xattrs. This sort of enhancement would be much easier to handle with the
>>> userspace variant.
>>
>> Yes, the alternative solution is to parse .xattr-list at the time it is
>> extracted. The .xattr-list of each initramfs will be processed. Also,
>> the CPIO parser doesn't have to reopen the file after all other files
>> have been extracted.
>>
>> Roberto
> Right, I guess this would be sort of the minimal "modification" to the
> CPIO format to allow it to support xattrs.
I would try to do it without modification of the CPIO format. However,
at the time .xattr-list is parsed (in do_copy() before .xattr-list is
closed), it is not guaranteed that all files are extracted. These must
be created before xattrs are added, but the file type must be correct,
otherwise clean_path() removes the existing file with xattrs.
Roberto
--
HUAWEI TECHNOLOGIES Duesseldorf GmbH, HRB 56063
Managing Director: Bo PENG, Jian LI, Yanli SHI
^ permalink raw reply
* Re: [PATCH v2 0/3] initramfs: add support for xattrs in the initial ram disk
From: Arvind Sankar @ 2019-05-15 16:08 UTC (permalink / raw)
To: Roberto Sassu
Cc: Arvind Sankar, James Bottomley, Rob Landley, Andy Lutomirski,
Arvind Sankar, LKML, Linux API, Linux FS Devel, linux-integrity,
initramfs, Silviu Vlasceanu
In-Reply-To: <a138af12-d983-453e-f0b2-661a80b7e837@huawei.com>
On Wed, May 15, 2019 at 01:19:04PM +0200, Roberto Sassu wrote:
> On 5/15/2019 2:52 AM, Arvind Sankar wrote:
> > You can specify multiple initrd's to the boot loader, and they get
> > loaded in sequence into memory and parsed by the kernel before /init is
> > launched. Currently I believe later ones will overwrite the earlier
> > ones, which is why we've been talking about adding an option to prevent
> > that. You don't have to mess with manually finding/parsing initramfs's
> > which wouldn't even be feasible since you may not have the drivers
> > loaded yet to access the device/filesystem on which they live.
> >
> > Once that's done, the embedded /init is just going to do in userspace
> > wht the current patch does in the kernel. So all the files in the
> > external initramfs(es) would need to have IMA signatures via the special
> > xattr file.
>
> So, the scheme you are proposing is not equivalent: using the distro key
> to verify signatures, compared to adding a new user key to verify the
> initramfs he builds. Why would it be necessary for the user to share
> responsibility with the distro, if the only files he uses come from the
> distro?
>
I don't understand what you mean? The IMA hashes are signed by some key,
but I don't see how what that key is needs to be different between the
two proposals. If the only files used are from the distro, in my scheme
as well you can use the signatures and key provided by the distro. If
they're not, then in your scheme as well you would have to allow for a
local signing key to be used. Both schemes are using the same
.xattr-list file, no?
If the external initramfs is to be signed, and it is built locally, in
both schemes there will have to be a provision for a local signing key,
but this key in any case is verified by the bootloader so there can't
be a difference between the two schemes since they're the same there.
What is the difference you're seeing here?
>
> > Note that if you want the flexibility to be able to load one or both of
> > two external initramfs's, the current in-kernel proposal wouldn't be
> > enough -- the xattr specification would have to be more flexible (eg
> > reading .xattr-list* to allow each initramfs to specifiy its own
> > xattrs. This sort of enhancement would be much easier to handle with the
> > userspace variant.
>
> Yes, the alternative solution is to parse .xattr-list at the time it is
> extracted. The .xattr-list of each initramfs will be processed. Also,
> the CPIO parser doesn't have to reopen the file after all other files
> have been extracted.
>
> Roberto
Right, I guess this would be sort of the minimal "modification" to the
CPIO format to allow it to support xattrs.
>
> --
> HUAWEI TECHNOLOGIES Duesseldorf GmbH, HRB 56063
> Managing Director: Bo PENG, Jian LI, Yanli SHI
^ permalink raw reply
* Re: [PATCH 1/2] pid: add pidfd_open()
From: Christian Brauner @ 2019-05-15 15:40 UTC (permalink / raw)
To: Oleg Nesterov
Cc: jannh, viro, torvalds, linux-kernel, arnd, dhowells, akpm, cyphar,
ebiederm, elena.reshetova, keescook, luto, luto, tglx,
linux-alpha, linux-arm-kernel, linux-ia64, linux-m68k, linux-mips,
linux-parisc, linuxppc-dev, linux-s390, linux-sh, sparclinux,
linux-xtensa, linux-api, linux-arch, linux-kselftest
In-Reply-To: <20190515153515.GA20783@redhat.com>
On Wed, May 15, 2019 at 05:35:15PM +0200, Oleg Nesterov wrote:
> On 05/15, Oleg Nesterov wrote:
> >
> > On 05/15, Christian Brauner wrote:
> > >
> > > +SYSCALL_DEFINE2(pidfd_open, pid_t, pid, unsigned int, flags)
> > > +{
> > > + int fd, ret;
> > > + struct pid *p;
> > > + struct task_struct *tsk;
> > > +
> > > + if (flags)
> > > + return -EINVAL;
> > > +
> > > + if (pid <= 0)
> > > + return -EINVAL;
> > > +
> > > + p = find_get_pid(pid);
> > > + if (!p)
> > > + return -ESRCH;
> > > +
> > > + rcu_read_lock();
> > > + tsk = pid_task(p, PIDTYPE_PID);
> >
> > You do not need find_get_pid() before rcu_lock and put_pid() at the end.
> > You can just do find_vpid() under rcu_read_lock().
>
> Ah, sorry. Somehow I forgot you need to call pidfd_create(pid), you can't
> do this under rcu_read_lock().
>
> So I was wrong, you can't avoid get/put_pid.
Yeah, I haven't made any changes yet.
Christian
^ permalink raw reply
* Re: [PATCH 1/2] pid: add pidfd_open()
From: Oleg Nesterov @ 2019-05-15 15:35 UTC (permalink / raw)
To: Christian Brauner
Cc: jannh, viro, torvalds, linux-kernel, arnd, dhowells, akpm, cyphar,
ebiederm, elena.reshetova, keescook, luto, luto, tglx,
linux-alpha, linux-arm-kernel, linux-ia64, linux-m68k, linux-mips,
linux-parisc, linuxppc-dev, linux-s390, linux-sh, sparclinux,
linux-xtensa, linux-api, linux-arch, linux-kselftest
In-Reply-To: <20190515143857.GB18892@redhat.com>
On 05/15, Oleg Nesterov wrote:
>
> On 05/15, Christian Brauner wrote:
> >
> > +SYSCALL_DEFINE2(pidfd_open, pid_t, pid, unsigned int, flags)
> > +{
> > + int fd, ret;
> > + struct pid *p;
> > + struct task_struct *tsk;
> > +
> > + if (flags)
> > + return -EINVAL;
> > +
> > + if (pid <= 0)
> > + return -EINVAL;
> > +
> > + p = find_get_pid(pid);
> > + if (!p)
> > + return -ESRCH;
> > +
> > + rcu_read_lock();
> > + tsk = pid_task(p, PIDTYPE_PID);
>
> You do not need find_get_pid() before rcu_lock and put_pid() at the end.
> You can just do find_vpid() under rcu_read_lock().
Ah, sorry. Somehow I forgot you need to call pidfd_create(pid), you can't
do this under rcu_read_lock().
So I was wrong, you can't avoid get/put_pid.
Oleg.
^ permalink raw reply
* Re: [PATCH 1/2] pid: add pidfd_open()
From: Christian Brauner @ 2019-05-15 15:30 UTC (permalink / raw)
To: Oleg Nesterov
Cc: jannh, viro, torvalds, linux-kernel, arnd, dhowells, akpm, cyphar,
ebiederm, elena.reshetova, keescook, luto, luto, tglx,
linux-alpha, linux-arm-kernel, linux-ia64, linux-m68k, linux-mips,
linux-parisc, linuxppc-dev, linux-s390, linux-sh, sparclinux,
linux-xtensa, linux-api, linux-arch, linux-kselftest
In-Reply-To: <20190515151912.GE18892@redhat.com>
On Wed, May 15, 2019 at 05:19:13PM +0200, Oleg Nesterov wrote:
> On 05/15, Christian Brauner wrote:
> >
> > On Wed, May 15, 2019 at 04:38:58PM +0200, Oleg Nesterov wrote:
> > >
> > > it seems that you can do a single check
> > >
> > > tsk = pid_task(p, PIDTYPE_TGID);
> > > if (!tsk)
> > > ret = -ESRCH;
> > >
> > > this even looks more correct if we race with exec changing the leader.
> >
> > The logic here being that you can only reach the thread_group leader
> > from struct pid if PIDTYPE_PID == PIDTYPE_TGID for this struct pid?
>
> Not exactly... it is not that PIDTYPE_PID == PIDTYPE_TGID for this pid,
> struct pid has no "type" or something like this.
>
> The logic is that pid->tasks[PIDTYPE_XXX] is the list of task which use
> this pid as "XXX" type.
>
> For example, clone(CLONE_THREAD) creates a pid which has a single non-
> empty list, pid->tasks[PIDTYPE_PID]. This pid can't be used as TGID or
> SID.
>
> So if pid_task(PIDTYPE_TGID) returns non-NULL we know that this pid was
> used for a group-leader, see copy_process() which does
Ah, this was what I was asking myself when I worked on thread-specific
signal sending. This clarifies quite a lot of things!
Though I wonder how one reliably gets a the PGID or SID from a
PIDTYPE_PID.
>
> if (thread_group_leader(p))
> attach_pid(p, PIDTYPE_TGID);
>
>
> If we race with exec which changes the leader pid_task(TGID) can return
> the old leader. We do not care, but this means that we should not check
> thread_group_leader().
Nice!
Thank you, Oleg! :)
Christian
^ permalink raw reply
* Re: [PATCH 1/2] pid: add pidfd_open()
From: Yann Droneaud @ 2019-05-15 15:29 UTC (permalink / raw)
To: Christian Brauner
Cc: jannh, oleg, viro, torvalds, linux-kernel, arnd, dhowells, akpm,
cyphar, ebiederm, elena.reshetova, keescook, luto, luto, tglx,
linux-alpha, linux-arm-kernel, linux-ia64, linux-m68k, linux-mips,
linux-parisc, linuxppc-dev, linux-s390, linux-sh, sparclinux,
linux-xtensa, linux-api, linux-arch, linux-kselftest
In-Reply-To: <20190515141634.lrc5ynllcmjr64mn@brauner.io>
Hi,
Le mercredi 15 mai 2019 à 16:16 +0200, Christian Brauner a écrit :
> On Wed, May 15, 2019 at 04:00:20PM +0200, Yann Droneaud wrote:
> > Le mercredi 15 mai 2019 à 12:03 +0200, Christian Brauner a écrit :
> > > diff --git a/kernel/pid.c b/kernel/pid.c
> > > index 20881598bdfa..237d18d6ecb8 100644
> > > --- a/kernel/pid.c
> > > +++ b/kernel/pid.c
> > > @@ -451,6 +452,53 @@ struct pid *find_ge_pid(int nr, struct
> > > pid_namespace *ns)
> > > return idr_get_next(&ns->idr, &nr);
> > > }
> > >
> > > +/**
> > > + * pidfd_open() - Open new pid file descriptor.
> > > + *
> > > + * @pid: pid for which to retrieve a pidfd
> > > + * @flags: flags to pass
> > > + *
> > > + * This creates a new pid file descriptor with the O_CLOEXEC flag set for
> > > + * the process identified by @pid. Currently, the process identified by
> > > + * @pid must be a thread-group leader. This restriction currently exists
> > > + * for all aspects of pidfds including pidfd creation (CLONE_PIDFD cannot
> > > + * be used with CLONE_THREAD) and pidfd polling (only supports thread group
> > > + * leaders).
> > > + *
> >
> > Would it be possible to create file descriptor with "restricted"
> > operation ?
> >
> > - O_RDONLY: waiting for process completion allowed (for example)
> > - O_WRONLY: sending process signal allowed
>
> Yes, something like this is likely going to be possible in the future.
> We had discussion around this. But mapping this to O_RDONLY and O_WRONLY
> is not the right model. It makes more sense to have specialized flags
> that restrict actions.
Yes, dedicated flags are the way to go. I've used the old open() flags
here as examples as an echo of the O_CLOEXEC flag used in the comment.
> > For example, a process could send over a Unix socket a process a pidfd,
> > allowing this to only wait for completion, but not sending signal ?
> >
> > I see the permission check is not done in pidfd_open(), so what prevent
> > a user from sending a signal to another user owned process ?
>
> That's supposed to be possible. You can do the same right now already
> with pids. Tools like LMK need this probably very much.
> Permission checking for signals is done at send time right now.
> And if you can't signal via a pid you can't signal via a pidfd as
> they're both subject to the same permissions checks.
>
I would have expect it to behave like most other file descriptor,
permission check done at opening time, which allow more privileged
process to open the file descriptor, then pass it to a less privileged
process, or change its own privileged with setuid() and such. Then the
less privileged process can act on behalf of the privileged process
through the file descriptor.
> > If it's in pidfd_send_signal(), then, passing the socket through
> > SCM_RIGHT won't be useful if the target process is not owned by the
> > same user, or root.
> >
If the permission check is done at sending time, the scenario above
case cannot be implemented.
Sending a pidfd through SCM_RIGHT is only useful if the receiver
process is equally or more privileged than the sender then.
For isolation purpose, I would have expect to be able to give a right
to send a signal to a highly privileged process a specific less
privileged process though Unix socket.
But I can't come up with a specific use case. So I dunno.
Regards.
--
Yann Droneaud
OPTEYA
^ permalink raw reply
* Re: [PATCH 1/2] pid: add pidfd_open()
From: Oleg Nesterov @ 2019-05-15 15:19 UTC (permalink / raw)
To: Christian Brauner
Cc: jannh, viro, torvalds, linux-kernel, arnd, dhowells, akpm, cyphar,
ebiederm, elena.reshetova, keescook, luto, luto, tglx,
linux-alpha, linux-arm-kernel, linux-ia64, linux-m68k, linux-mips,
linux-parisc, linuxppc-dev, linux-s390, linux-sh, sparclinux,
linux-xtensa, linux-api, linux-arch, linux-kselftest
In-Reply-To: <20190515144927.f2yxyi6w6lhn3xx7@brauner.io>
On 05/15, Christian Brauner wrote:
>
> On Wed, May 15, 2019 at 04:38:58PM +0200, Oleg Nesterov wrote:
> >
> > it seems that you can do a single check
> >
> > tsk = pid_task(p, PIDTYPE_TGID);
> > if (!tsk)
> > ret = -ESRCH;
> >
> > this even looks more correct if we race with exec changing the leader.
>
> The logic here being that you can only reach the thread_group leader
> from struct pid if PIDTYPE_PID == PIDTYPE_TGID for this struct pid?
Not exactly... it is not that PIDTYPE_PID == PIDTYPE_TGID for this pid,
struct pid has no "type" or something like this.
The logic is that pid->tasks[PIDTYPE_XXX] is the list of task which use
this pid as "XXX" type.
For example, clone(CLONE_THREAD) creates a pid which has a single non-
empty list, pid->tasks[PIDTYPE_PID]. This pid can't be used as TGID or
SID.
So if pid_task(PIDTYPE_TGID) returns non-NULL we know that this pid was
used for a group-leader, see copy_process() which does
if (thread_group_leader(p))
attach_pid(p, PIDTYPE_TGID);
If we race with exec which changes the leader pid_task(TGID) can return
the old leader. We do not care, but this means that we should not check
thread_group_leader().
Oleg.
^ permalink raw reply
* Re: [PATCH RFC v2 0/4] mm/ksm: add option to automerge VMAs
From: Greg KH @ 2019-05-15 15:15 UTC (permalink / raw)
To: Michal Hocko
Cc: Oleksandr Natalenko, linux-kernel, Kirill Tkhai, Vlastimil Babka,
Matthew Wilcox, Pavel Tatashin, Timofey Titovets, Aaron Tomlin,
Grzegorz Halat, linux-mm, linux-api, Hugh Dickins,
Suren Baghdasaryan, Minchan Kim
In-Reply-To: <20190515145151.GG16651@dhcp22.suse.cz>
On Wed, May 15, 2019 at 04:51:51PM +0200, Michal Hocko wrote:
> [Cc Suren and Minchan - the email thread starts here 20190514131654.25463-1-oleksandr@redhat.com]
>
> On Wed 15-05-19 08:53:11, Michal Hocko wrote:
> [...]
> > I will try to comment on the interface itself later. But I have to say
> > that I am not impressed. Abusing sysfs for per process features is quite
> > gross to be honest.
>
> I have already commented on this in other email. I consider sysfs an
> unsuitable interface for per-process API.
Wait, what? A new sysfs file/directory per process? That's crazy, no
one must have benchmarked it :)
And I agree, sysfs is not for that, please don't abuse it. Proc is for
process apis.
thanks,
greg k-h
^ permalink raw reply
* Re: [PATCH RFC v2 0/4] mm/ksm: add option to automerge VMAs
From: Michal Hocko @ 2019-05-15 14:51 UTC (permalink / raw)
To: Oleksandr Natalenko
Cc: linux-kernel, Kirill Tkhai, Vlastimil Babka, Matthew Wilcox,
Pavel Tatashin, Timofey Titovets, Aaron Tomlin, Grzegorz Halat,
linux-mm, linux-api, Hugh Dickins, Suren Baghdasaryan,
Minchan Kim
In-Reply-To: <20190515065311.GB16651@dhcp22.suse.cz>
[Cc Suren and Minchan - the email thread starts here 20190514131654.25463-1-oleksandr@redhat.com]
On Wed 15-05-19 08:53:11, Michal Hocko wrote:
[...]
> I will try to comment on the interface itself later. But I have to say
> that I am not impressed. Abusing sysfs for per process features is quite
> gross to be honest.
I have already commented on this in other email. I consider sysfs an
unsuitable interface for per-process API. Not to mention this particular
one is very KSM specific while the question about setting different
hints on memory of a remote process is a more generic question. As
already mentioned there are usecases where people would like to say
that a certain memory is cold from outside of the process context (e.g.
monitor application). So essentially a form of a user space memory
management. And this usecase sounds a bit similar to me and having a
common api sounds more sensible to me.
One thing we were discussing at LSFMM this year was a way to either
provide madvise_remote(pid, addr, length, advice) or a fadvise
alternative over /proc/<pid>/map_vmas/<range> file descriptors
(essentially resembling the existing map_files api) to achieve such a
functionality. This is still a very rough idea but the api would sound
much more generic to me and it would allow much wider range of usecases.
But maybe I am completely wrong and this is just opens a can of worms
that we do not want.
--
Michal Hocko
SUSE Labs
^ permalink raw reply
* Re: [PATCH 1/2] pid: add pidfd_open()
From: Aleksa Sarai @ 2019-05-15 14:51 UTC (permalink / raw)
To: Christian Brauner
Cc: Yann Droneaud, jannh, oleg, viro, torvalds, linux-kernel, arnd,
dhowells, akpm, ebiederm, elena.reshetova, keescook, luto, luto,
tglx, linux-alpha, linux-arm-kernel, linux-ia64, linux-m68k,
linux-mips, linux-parisc, linuxppc-dev, linux-s390, linux-sh,
sparclinux, linux-xtensa, linux-api, linux-arch, linux-kselftest
In-Reply-To: <20190515141634.lrc5ynllcmjr64mn@brauner.io>
[-- Attachment #1: Type: text/plain, Size: 805 bytes --]
On 2019-05-15, Christian Brauner <christian@brauner.io> wrote:
> On Wed, May 15, 2019 at 04:00:20PM +0200, Yann Droneaud wrote:
> > Would it be possible to create file descriptor with "restricted"
> > operation ?
> >
> > - O_RDONLY: waiting for process completion allowed (for example)
> > - O_WRONLY: sending process signal allowed
>
> Yes, something like this is likely going to be possible in the future.
> We had discussion around this. But mapping this to O_RDONLY and O_WRONLY
> is not the right model. It makes more sense to have specialized flags
> that restrict actions.
Not to mention that the O_* flags have silly values which we shouldn't
replicate in new syscalls IMHO.
--
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply
* Re: [PATCH 1/2] pid: add pidfd_open()
From: Christian Brauner @ 2019-05-15 14:49 UTC (permalink / raw)
To: Oleg Nesterov
Cc: jannh, viro, torvalds, linux-kernel, arnd, dhowells, akpm, cyphar,
ebiederm, elena.reshetova, keescook, luto, luto, tglx,
linux-alpha, linux-arm-kernel, linux-ia64, linux-m68k, linux-mips,
linux-parisc, linuxppc-dev, linux-s390, linux-sh, sparclinux,
linux-xtensa, linux-api, linux-arch, linux-kselftest
In-Reply-To: <20190515143857.GB18892@redhat.com>
On Wed, May 15, 2019 at 04:38:58PM +0200, Oleg Nesterov wrote:
> On 05/15, Christian Brauner wrote:
> >
> > +SYSCALL_DEFINE2(pidfd_open, pid_t, pid, unsigned int, flags)
> > +{
> > + int fd, ret;
> > + struct pid *p;
> > + struct task_struct *tsk;
> > +
> > + if (flags)
> > + return -EINVAL;
> > +
> > + if (pid <= 0)
> > + return -EINVAL;
> > +
> > + p = find_get_pid(pid);
> > + if (!p)
> > + return -ESRCH;
> > +
> > + rcu_read_lock();
> > + tsk = pid_task(p, PIDTYPE_PID);
>
> You do not need find_get_pid() before rcu_lock and put_pid() at the end.
> You can just do find_vpid() under rcu_read_lock().
Will do.
>
> > + if (!tsk)
> > + ret = -ESRCH;
> > + else if (unlikely(!thread_group_leader(tsk)))
> > + ret = -EINVAL;
>
> it seems that you can do a single check
>
> tsk = pid_task(p, PIDTYPE_TGID);
> if (!tsk)
> ret = -ESRCH;
>
> this even looks more correct if we race with exec changing the leader.
The logic here being that you can only reach the thread_group leader
from struct pid if PIDTYPE_PID == PIDTYPE_TGID for this struct pid?
Thanks, Oleg.
Christian
^ permalink raw reply
* Re: [PATCH 1/2] pid: add pidfd_open()
From: Oleg Nesterov @ 2019-05-15 14:38 UTC (permalink / raw)
To: Christian Brauner
Cc: jannh, viro, torvalds, linux-kernel, arnd, dhowells, akpm, cyphar,
ebiederm, elena.reshetova, keescook, luto, luto, tglx,
linux-alpha, linux-arm-kernel, linux-ia64, linux-m68k, linux-mips,
linux-parisc, linuxppc-dev, linux-s390, linux-sh, sparclinux,
linux-xtensa, linux-api, linux-arch, linux-kselftest
In-Reply-To: <20190515100400.3450-1-christian@brauner.io>
On 05/15, Christian Brauner wrote:
>
> +SYSCALL_DEFINE2(pidfd_open, pid_t, pid, unsigned int, flags)
> +{
> + int fd, ret;
> + struct pid *p;
> + struct task_struct *tsk;
> +
> + if (flags)
> + return -EINVAL;
> +
> + if (pid <= 0)
> + return -EINVAL;
> +
> + p = find_get_pid(pid);
> + if (!p)
> + return -ESRCH;
> +
> + rcu_read_lock();
> + tsk = pid_task(p, PIDTYPE_PID);
You do not need find_get_pid() before rcu_lock and put_pid() at the end.
You can just do find_vpid() under rcu_read_lock().
> + if (!tsk)
> + ret = -ESRCH;
> + else if (unlikely(!thread_group_leader(tsk)))
> + ret = -EINVAL;
it seems that you can do a single check
tsk = pid_task(p, PIDTYPE_TGID);
if (!tsk)
ret = -ESRCH;
this even looks more correct if we race with exec changing the leader.
Oleg.
^ permalink raw reply
* Re: [PATCH RFC v2 0/4] mm/ksm: add option to automerge VMAs
From: Michal Hocko @ 2019-05-15 14:24 UTC (permalink / raw)
To: Oleksandr Natalenko
Cc: linux-kernel, Kirill Tkhai, Vlastimil Babka, Matthew Wilcox,
Pavel Tatashin, Timofey Titovets, Aaron Tomlin, Grzegorz Halat,
linux-mm, linux-api, Hugh Dickins
In-Reply-To: <20190515085158.hyuamrxkxhjhx6go@butterfly.localdomain>
On Wed 15-05-19 10:51:58, Oleksandr Natalenko wrote:
> On Wed, May 15, 2019 at 10:33:21AM +0200, Michal Hocko wrote:
> > > For my current setup with 2 Firefox instances I get 100 to 200 MiB saved
> > > for the second instance depending on the amount of tabs.
> >
> > What does prevent Firefox (an opensource project) to be updated to use
> > the explicit merging?
>
> This was rather an example of a big project. Other big projects may be
> closed source, of course.
Again, specific examples are usually considered a much better
justification than "something might use the feature".
[...]
> > OK, this makes more sense. Please note that there are other people who
> > would like to see certain madvise operations to be done on a remote
> > process - e.g. to allow external memory management (Android would like
> > to control memory aging so something like MADV_DONTNEED without loosing
> > content and more probably) and potentially other madvise operations.
> > Or maybe we need a completely new interface other than madvise.
>
> I didn't know about those intentions. Could you please point me to a
> relevant discussion so that I can check the details?
I am sorry I do not have any specific links to patches under discussion.
We have discussed that topic at LSFMM this year
(https://lwn.net/Articles/787217/) and Google guys should be sending
something soon.
--
Michal Hocko
SUSE Labs
^ permalink raw reply
* Re: [PATCH 1/2] pid: add pidfd_open()
From: Christian Brauner @ 2019-05-15 14:16 UTC (permalink / raw)
To: Yann Droneaud
Cc: linux-ia64, linux-sh, linux-kernel, dhowells, linux-kselftest,
sparclinux, linux-api, elena.reshetova, linux-arch, linux-s390,
linux-xtensa, keescook, arnd, jannh, linux-m68k, viro, luto, oleg,
tglx, linux-arm-kernel, linux-parisc, cyphar, torvalds,
linux-mips, luto, ebiederm, linux-alpha, akpm, linuxppc-dev
In-Reply-To: <4c5ae46657e1931a832def5645db61eb0bf1accd.camel@opteya.com>
On Wed, May 15, 2019 at 04:00:20PM +0200, Yann Droneaud wrote:
> Hi,
>
> Le mercredi 15 mai 2019 à 12:03 +0200, Christian Brauner a écrit :
> >
> > diff --git a/kernel/pid.c b/kernel/pid.c
> > index 20881598bdfa..237d18d6ecb8 100644
> > --- a/kernel/pid.c
> > +++ b/kernel/pid.c
> > @@ -451,6 +452,53 @@ struct pid *find_ge_pid(int nr, struct
> > pid_namespace *ns)
> > return idr_get_next(&ns->idr, &nr);
> > }
> >
> > +/**
> > + * pidfd_open() - Open new pid file descriptor.
> > + *
> > + * @pid: pid for which to retrieve a pidfd
> > + * @flags: flags to pass
> > + *
> > + * This creates a new pid file descriptor with the O_CLOEXEC flag set for
> > + * the process identified by @pid. Currently, the process identified by
> > + * @pid must be a thread-group leader. This restriction currently exists
> > + * for all aspects of pidfds including pidfd creation (CLONE_PIDFD cannot
> > + * be used with CLONE_THREAD) and pidfd polling (only supports thread group
> > + * leaders).
> > + *
>
> Would it be possible to create file descriptor with "restricted"
> operation ?
>
> - O_RDONLY: waiting for process completion allowed (for example)
> - O_WRONLY: sending process signal allowed
Yes, something like this is likely going to be possible in the future.
We had discussion around this. But mapping this to O_RDONLY and O_WRONLY
is not the right model. It makes more sense to have specialized flags
that restrict actions.
>
> For example, a process could send over a Unix socket a process a pidfd,
> allowing this to only wait for completion, but not sending signal ?
>
> I see the permission check is not done in pidfd_open(), so what prevent
> a user from sending a signal to another user owned process ?
That's supposed to be possible. You can do the same right now already
with pids. Tools like LMK need this probably very much.
Permission checking for signals is done at send time right now.
And if you can't signal via a pid you can't signal via a pidfd as
they're both subject to the same permissions checks.
>
> If it's in pidfd_send_signal(), then, passing the socket through
> SCM_RIGHT won't be useful if the target process is not owned by the
> same user, or root.
>
> > + * Return: On success, a cloexec pidfd is returned.
> > + * On error, a negative errno number will be returned.
> > + */
> > +SYSCALL_DEFINE2(pidfd_open, pid_t, pid, unsigned int, flags)
> > +{
> > + int fd, ret;
> > + struct pid *p;
> > + struct task_struct *tsk;
> > +
> > + if (flags)
> > + return -EINVAL;
> > +
> > + if (pid <= 0)
> > + return -EINVAL;
> > +
> > + p = find_get_pid(pid);
> > + if (!p)
> > + return -ESRCH;
> > +
> > + rcu_read_lock();
> > + tsk = pid_task(p, PIDTYPE_PID);
> > + if (!tsk)
> > + ret = -ESRCH;
> > + else if (unlikely(!thread_group_leader(tsk)))
> > + ret = -EINVAL;
> > + else
> > + ret = 0;
> > + rcu_read_unlock();
> > +
> > + fd = ret ?: pidfd_create(p);
> > + put_pid(p);
> > + return fd;
> > +}
> > +
> > void __init pid_idr_init(void)
> > {
> > /* Verify no one has done anything silly: */
>
> Regards.
>
> --
> Yann Droneaud
> OPTEYA
>
>
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply
* Re: [PATCH 1/2] pid: add pidfd_open()
From: Yann Droneaud @ 2019-05-15 14:00 UTC (permalink / raw)
To: Christian Brauner, jannh, oleg, viro, torvalds, linux-kernel,
arnd, dhowells
Cc: linux-ia64, linux-sh, linux-mips, linux-kselftest, sparclinux,
elena.reshetova, linux-arch, linux-s390, linux-xtensa, keescook,
linux-m68k, luto, tglx, linux-arm-kernel, linux-parisc, linux-api,
cyphar, luto, ebiederm, linux-alpha, akpm, linuxppc-dev
In-Reply-To: <20190515100400.3450-1-christian@brauner.io>
Hi,
Le mercredi 15 mai 2019 à 12:03 +0200, Christian Brauner a écrit :
>
> diff --git a/kernel/pid.c b/kernel/pid.c
> index 20881598bdfa..237d18d6ecb8 100644
> --- a/kernel/pid.c
> +++ b/kernel/pid.c
> @@ -451,6 +452,53 @@ struct pid *find_ge_pid(int nr, struct
> pid_namespace *ns)
> return idr_get_next(&ns->idr, &nr);
> }
>
> +/**
> + * pidfd_open() - Open new pid file descriptor.
> + *
> + * @pid: pid for which to retrieve a pidfd
> + * @flags: flags to pass
> + *
> + * This creates a new pid file descriptor with the O_CLOEXEC flag set for
> + * the process identified by @pid. Currently, the process identified by
> + * @pid must be a thread-group leader. This restriction currently exists
> + * for all aspects of pidfds including pidfd creation (CLONE_PIDFD cannot
> + * be used with CLONE_THREAD) and pidfd polling (only supports thread group
> + * leaders).
> + *
Would it be possible to create file descriptor with "restricted"
operation ?
- O_RDONLY: waiting for process completion allowed (for example)
- O_WRONLY: sending process signal allowed
For example, a process could send over a Unix socket a process a pidfd,
allowing this to only wait for completion, but not sending signal ?
I see the permission check is not done in pidfd_open(), so what prevent
a user from sending a signal to another user owned process ?
If it's in pidfd_send_signal(), then, passing the socket through
SCM_RIGHT won't be useful if the target process is not owned by the
same user, or root.
> + * Return: On success, a cloexec pidfd is returned.
> + * On error, a negative errno number will be returned.
> + */
> +SYSCALL_DEFINE2(pidfd_open, pid_t, pid, unsigned int, flags)
> +{
> + int fd, ret;
> + struct pid *p;
> + struct task_struct *tsk;
> +
> + if (flags)
> + return -EINVAL;
> +
> + if (pid <= 0)
> + return -EINVAL;
> +
> + p = find_get_pid(pid);
> + if (!p)
> + return -ESRCH;
> +
> + rcu_read_lock();
> + tsk = pid_task(p, PIDTYPE_PID);
> + if (!tsk)
> + ret = -ESRCH;
> + else if (unlikely(!thread_group_leader(tsk)))
> + ret = -EINVAL;
> + else
> + ret = 0;
> + rcu_read_unlock();
> +
> + fd = ret ?: pidfd_create(p);
> + put_pid(p);
> + return fd;
> +}
> +
> void __init pid_idr_init(void)
> {
> /* Verify no one has done anything silly: */
Regards.
--
Yann Droneaud
OPTEYA
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply
* Re: [PATCH 1/2] pid: add pidfd_open()
From: Geert Uytterhoeven @ 2019-05-15 12:29 UTC (permalink / raw)
To: Christian Brauner
Cc: Jann Horn, Oleg Nesterov, Al Viro, torvalds@linux-foundation.org,
Linux Kernel Mailing List, Arnd Bergmann, David Howells,
linux-ia64@vger.kernel.org, Linux-sh list, linux-mips,
open list:KERNEL SELFTEST FRAMEWORK, sparclinux, elena.reshetova,
Linux-Arch, linux-s390, linux-xtensa, Kees Cook, linux-m68k, Andy
In-Reply-To: <20190515100400.3450-1-christian@brauner.io>
On Wed, May 15, 2019 at 12:04 PM Christian Brauner <christian@brauner.io> wrote:
> This adds the pidfd_open() syscall. It allows a caller to retrieve pollable
> pidfds for a process which did not get created via CLONE_PIDFD, i.e. for a
> process that is created via traditional fork()/clone() calls that is only
> referenced by a PID:
>
> int pidfd = pidfd_open(1234, 0);
> ret = pidfd_send_signal(pidfd, SIGSTOP, NULL, 0);
>
> With the introduction of pidfds through CLONE_PIDFD it is possible to
> created pidfds at process creation time.
> However, a lot of processes get created with traditional PID-based calls
> such as fork() or clone() (without CLONE_PIDFD). For these processes a
> caller can currently not create a pollable pidfd. This is a huge problem
> for Android's low memory killer (LMK) and service managers such as systemd.
> Both are examples of tools that want to make use of pidfds to get reliable
> notification of process exit for non-parents (pidfd polling) and race-free
> signal sending (pidfd_send_signal()). They intend to switch to this API for
> process supervision/management as soon as possible. Having no way to get
> pollable pidfds from PID-only processes is one of the biggest blockers for
> them in adopting this api. With pidfd_open() making it possible to retrieve
> pidfd for PID-based processes we enable them to adopt this api.
>
> In line with Arnd's recent changes to consolidate syscall numbers across
> architectures, I have added the pidfd_open() syscall to all architectures
> at the same time.
>
> Signed-off-by: Christian Brauner <christian@brauner.io>
> arch/m68k/kernel/syscalls/syscall.tbl | 1 +
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Gr{oetje,eeting}s,
Geert
--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org
In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds
^ permalink raw reply
* Re: [PATCH v2 0/3] initramfs: add support for xattrs in the initial ram disk
From: Roberto Sassu @ 2019-05-15 11:19 UTC (permalink / raw)
To: Arvind Sankar, James Bottomley
Cc: Rob Landley, Andy Lutomirski, Arvind Sankar, LKML, Linux API,
Linux FS Devel, linux-integrity, initramfs, Silviu Vlasceanu
In-Reply-To: <20190515005221.GB88615@rani.riverdale.lan>
On 5/15/2019 2:52 AM, Arvind Sankar wrote:
> On Tue, May 14, 2019 at 04:54:12PM -0700, James Bottomley wrote:
>> On Tue, 2019-05-14 at 18:39 -0500, Rob Landley wrote:
>>> On 5/14/19 2:18 PM, James Bottomley wrote:
>>>>> I think Rob is right here. If /init was statically built into
>>>>> the kernel image, it has no more ability to compromise the kernel
>>>>> than anything else in the kernel. What's the problem here?
>>>>
>>>> The specific problem is that unless you own the kernel signing key,
>>>> which is really untrue for most distribution consumers because the
>>>> distro owns the key, you cannot build the initrd statically into
>>>> the kernel. You can take the distro signed kernel, link it with
>>>> the initrd then resign the combination with your key, provided you
>>>> insert your key into the MoK variables as a trusted secure boot
>>>> key, but the distros have been unhappy recommending this as
>>>> standard practice.
>>>>
>>>> If our model for security is going to be to link the kernel and the
>>>> initrd statically to give signature protection over the aggregate
>>>> then we need to figure out how to execute this via the distros. If
>>>> we accept that the split model, where the distro owns and signs the
>>>> kernel but the machine owner builds and is responsible for the
>>>> initrd, then we need to explore split security models like this
>>>> proposal.
>>>
>>> You can have a built-in and an external initrd? The second extracts
>>> over the first? (I know because once upon a time conflicting files
>>> would append. It sounds like the desired behavior here is O_EXCL fail
>>> and move on.)
>>
>> Technically yes, because the first initrd could find the second by some
>> predefined means, extract it to a temporary directory and do a
>> pivot_root() and then the second would do some stuff, find the real
>> root and do a pivot_root() again. However, while possible, wouldn't it
>> just add to the rendezvous complexity without adding any benefits? even
>> if the first initrd is built and signed by the distro and the second is
>> built by you, the first has to verify the second somehow. I suppose
>> the second could be tar extracted, which would add xattrs, if that's
>> the goal?
>>
>> James
>>
> You can specify multiple initrd's to the boot loader, and they get
> loaded in sequence into memory and parsed by the kernel before /init is
> launched. Currently I believe later ones will overwrite the earlier
> ones, which is why we've been talking about adding an option to prevent
> that. You don't have to mess with manually finding/parsing initramfs's
> which wouldn't even be feasible since you may not have the drivers
> loaded yet to access the device/filesystem on which they live.
>
> Once that's done, the embedded /init is just going to do in userspace
> wht the current patch does in the kernel. So all the files in the
> external initramfs(es) would need to have IMA signatures via the special
> xattr file.
So, the scheme you are proposing is not equivalent: using the distro key
to verify signatures, compared to adding a new user key to verify the
initramfs he builds. Why would it be necessary for the user to share
responsibility with the distro, if the only files he uses come from the
distro?
> Note that if you want the flexibility to be able to load one or both of
> two external initramfs's, the current in-kernel proposal wouldn't be
> enough -- the xattr specification would have to be more flexible (eg
> reading .xattr-list* to allow each initramfs to specifiy its own
> xattrs. This sort of enhancement would be much easier to handle with the
> userspace variant.
Yes, the alternative solution is to parse .xattr-list at the time it is
extracted. The .xattr-list of each initramfs will be processed. Also,
the CPIO parser doesn't have to reopen the file after all other files
have been extracted.
Roberto
--
HUAWEI TECHNOLOGIES Duesseldorf GmbH, HRB 56063
Managing Director: Bo PENG, Jian LI, Yanli SHI
^ permalink raw reply
* [PATCH 2/2] tests: add pidfd_open() tests
From: Christian Brauner @ 2019-05-15 10:04 UTC (permalink / raw)
To: jannh, oleg, viro, torvalds, linux-kernel, arnd, dhowells
Cc: akpm, cyphar, ebiederm, elena.reshetova, keescook, luto, luto,
tglx, linux-alpha, linux-arm-kernel, linux-ia64, linux-m68k,
linux-mips, linux-parisc, linuxppc-dev, linux-s390, linux-sh,
sparclinux, linux-xtensa, linux-api, linux-arch, linux-kselftest,
Christian Brauner, Michael Kerrisk (man-pages)
In-Reply-To: <20190515100400.3450-1-christian@brauner.io>
This adds testing for the new pidfd_open() syscalls. Specifically, we test:
- that no invalid flags can be passed to pidfd_open()
- that no invalid pid can be passed to pidfd_open()
- that a pidfd can be retrieved with pidfd_open()
- that the retrieved pidfd references the correct pid
Signed-off-by: Christian Brauner <christian@brauner.io>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Jann Horn <jannh@google.com>
Cc: David Howells <dhowells@redhat.com>
Cc: "Michael Kerrisk (man-pages)" <mtk.manpages@gmail.com>
Cc: Andy Lutomirsky <luto@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Aleksa Sarai <cyphar@cyphar.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-api@vger.kernel.org
---
tools/testing/selftests/pidfd/Makefile | 2 +-
tools/testing/selftests/pidfd/pidfd.h | 57 ++++++
.../testing/selftests/pidfd/pidfd_open_test.c | 170 ++++++++++++++++++
tools/testing/selftests/pidfd/pidfd_test.c | 41 +----
4 files changed, 229 insertions(+), 41 deletions(-)
create mode 100644 tools/testing/selftests/pidfd/pidfd.h
create mode 100644 tools/testing/selftests/pidfd/pidfd_open_test.c
diff --git a/tools/testing/selftests/pidfd/Makefile b/tools/testing/selftests/pidfd/Makefile
index deaf8073bc06..b36c0be70848 100644
--- a/tools/testing/selftests/pidfd/Makefile
+++ b/tools/testing/selftests/pidfd/Makefile
@@ -1,6 +1,6 @@
CFLAGS += -g -I../../../../usr/include/
-TEST_GEN_PROGS := pidfd_test
+TEST_GEN_PROGS := pidfd_test pidfd_open_test
include ../lib.mk
diff --git a/tools/testing/selftests/pidfd/pidfd.h b/tools/testing/selftests/pidfd/pidfd.h
new file mode 100644
index 000000000000..8452e910463f
--- /dev/null
+++ b/tools/testing/selftests/pidfd/pidfd.h
@@ -0,0 +1,57 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#ifndef __PIDFD_H
+#define __PIDFD_H
+
+#define _GNU_SOURCE
+#include <errno.h>
+#include <fcntl.h>
+#include <sched.h>
+#include <signal.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <syscall.h>
+#include <sys/mount.h>
+
+#include "../kselftest.h"
+
+/*
+ * The kernel reserves 300 pids via RESERVED_PIDS in kernel/pid.c
+ * That means, when it wraps around any pid < 300 will be skipped.
+ * So we need to use a pid > 300 in order to test recycling.
+ */
+#define PID_RECYCLE 1000
+
+/*
+ * Define a few custom error codes for the child process to clearly indicate
+ * what is happening. This way we can tell the difference between a system
+ * error, a test error, etc.
+ */
+#define PIDFD_PASS 0
+#define PIDFD_FAIL 1
+#define PIDFD_ERROR 2
+#define PIDFD_SKIP 3
+#define PIDFD_XFAIL 4
+
+int wait_for_pid(pid_t pid)
+{
+ int status, ret;
+
+again:
+ ret = waitpid(pid, &status, 0);
+ if (ret == -1) {
+ if (errno == EINTR)
+ goto again;
+
+ return -1;
+ }
+
+ if (!WIFEXITED(status))
+ return -1;
+
+ return WEXITSTATUS(status);
+}
+
+
+#endif /* __PIDFD_H */
diff --git a/tools/testing/selftests/pidfd/pidfd_open_test.c b/tools/testing/selftests/pidfd/pidfd_open_test.c
new file mode 100644
index 000000000000..9b073c1ac618
--- /dev/null
+++ b/tools/testing/selftests/pidfd/pidfd_open_test.c
@@ -0,0 +1,170 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#define _GNU_SOURCE
+#include <errno.h>
+#include <fcntl.h>
+#include <inttypes.h>
+#include <limits.h>
+#include <linux/types.h>
+#include <linux/wait.h>
+#include <sched.h>
+#include <signal.h>
+#include <stdbool.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <syscall.h>
+#include <sys/mount.h>
+#include <sys/prctl.h>
+#include <sys/wait.h>
+#include <unistd.h>
+
+#include "pidfd.h"
+#include "../kselftest.h"
+
+static inline int sys_pidfd_open(pid_t pid, unsigned int flags)
+{
+ return syscall(__NR_pidfd_open, pid, flags);
+}
+
+static int safe_int(const char *numstr, int *converted)
+{
+ char *err = NULL;
+ long sli;
+
+ errno = 0;
+ sli = strtol(numstr, &err, 0);
+ if (errno == ERANGE && (sli == LONG_MAX || sli == LONG_MIN))
+ return -ERANGE;
+
+ if (errno != 0 && sli == 0)
+ return -EINVAL;
+
+ if (err == numstr || *err != '\0')
+ return -EINVAL;
+
+ if (sli > INT_MAX || sli < INT_MIN)
+ return -ERANGE;
+
+ *converted = (int)sli;
+ return 0;
+}
+
+static int char_left_gc(const char *buffer, size_t len)
+{
+ size_t i;
+
+ for (i = 0; i < len; i++) {
+ if (buffer[i] == ' ' ||
+ buffer[i] == '\t')
+ continue;
+
+ return i;
+ }
+
+ return 0;
+}
+
+static int char_right_gc(const char *buffer, size_t len)
+{
+ int i;
+
+ for (i = len - 1; i >= 0; i--) {
+ if (buffer[i] == ' ' ||
+ buffer[i] == '\t' ||
+ buffer[i] == '\n' ||
+ buffer[i] == '\0')
+ continue;
+
+ return i + 1;
+ }
+
+ return 0;
+}
+
+static char *trim_whitespace_in_place(char *buffer)
+{
+ buffer += char_left_gc(buffer, strlen(buffer));
+ buffer[char_right_gc(buffer, strlen(buffer))] = '\0';
+ return buffer;
+}
+
+static pid_t get_pid_from_fdinfo_file(int pidfd, const char *key, size_t keylen)
+{
+ int ret;
+ char path[512];
+ FILE *f;
+ size_t n = 0;
+ pid_t result = -1;
+ char *line = NULL;
+
+ snprintf(path, sizeof(path), "/proc/self/fdinfo/%d", pidfd);
+
+ f = fopen(path, "re");
+ if (!f)
+ return -1;
+
+ while (getline(&line, &n, f) != -1) {
+ char *numstr;
+
+ if (strncmp(line, key, keylen))
+ continue;
+
+ numstr = trim_whitespace_in_place(line + 4);
+ ret = safe_int(numstr, &result);
+ if (ret < 0)
+ goto out;
+
+ break;
+ }
+
+out:
+ free(line);
+ fclose(f);
+ return result;
+}
+
+int main(int argc, char **argv)
+{
+ int pidfd = -1, ret = 1;
+ pid_t pid;
+
+ pidfd = sys_pidfd_open(-1, 0);
+ if (pidfd >= 0) {
+ ksft_print_msg(
+ "%s - succeeded to open pidfd for invalid pid -1\n",
+ strerror(errno));
+ goto on_error;
+ }
+ ksft_test_result_pass("do not allow invalid pid test: passed\n");
+ ksft_inc_pass_cnt();
+
+ pidfd = sys_pidfd_open(getpid(), 1);
+ if (pidfd >= 0) {
+ ksft_print_msg(
+ "%s - succeeded to open pidfd with invalid flag value specified\n",
+ strerror(errno));
+ goto on_error;
+ }
+ ksft_test_result_pass("do not allow invalid flag test: passed\n");
+ ksft_inc_pass_cnt();
+
+ pidfd = sys_pidfd_open(getpid(), 0);
+ if (pidfd < 0) {
+ ksft_print_msg("%s - failed to open pidfd\n", strerror(errno));
+ goto on_error;
+ }
+ ksft_test_result_pass("open a new pidfd test: passed\n");
+ ksft_inc_pass_cnt();
+
+ pid = get_pid_from_fdinfo_file(pidfd, "Pid:", sizeof("Pid:") - 1);
+ ksft_print_msg("pidfd %d refers to process with pid %d\n", pidfd, pid);
+
+ ret = 0;
+
+on_error:
+ if (pidfd >= 0)
+ close(pidfd);
+
+ return !ret ? ksft_exit_pass() : ksft_exit_fail();
+}
diff --git a/tools/testing/selftests/pidfd/pidfd_test.c b/tools/testing/selftests/pidfd/pidfd_test.c
index d59378a93782..f01de87249c9 100644
--- a/tools/testing/selftests/pidfd/pidfd_test.c
+++ b/tools/testing/selftests/pidfd/pidfd_test.c
@@ -14,6 +14,7 @@
#include <sys/wait.h>
#include <unistd.h>
+#include "pidfd.h"
#include "../kselftest.h"
static inline int sys_pidfd_send_signal(int pidfd, int sig, siginfo_t *info,
@@ -62,28 +63,6 @@ static int test_pidfd_send_signal_simple_success(void)
return 0;
}
-static int wait_for_pid(pid_t pid)
-{
- int status, ret;
-
-again:
- ret = waitpid(pid, &status, 0);
- if (ret == -1) {
- if (errno == EINTR)
- goto again;
-
- return -1;
- }
-
- if (ret != pid)
- goto again;
-
- if (!WIFEXITED(status))
- return -1;
-
- return WEXITSTATUS(status);
-}
-
static int test_pidfd_send_signal_exited_fail(void)
{
int pidfd, ret, saved_errno;
@@ -128,13 +107,6 @@ static int test_pidfd_send_signal_exited_fail(void)
return 0;
}
-/*
- * The kernel reserves 300 pids via RESERVED_PIDS in kernel/pid.c
- * That means, when it wraps around any pid < 300 will be skipped.
- * So we need to use a pid > 300 in order to test recycling.
- */
-#define PID_RECYCLE 1000
-
/*
* Maximum number of cycles we allow. This is equivalent to PID_MAX_DEFAULT.
* If users set a higher limit or we have cycled PIDFD_MAX_DEFAULT number of
@@ -143,17 +115,6 @@ static int test_pidfd_send_signal_exited_fail(void)
*/
#define PIDFD_MAX_DEFAULT 0x8000
-/*
- * Define a few custom error codes for the child process to clearly indicate
- * what is happening. This way we can tell the difference between a system
- * error, a test error, etc.
- */
-#define PIDFD_PASS 0
-#define PIDFD_FAIL 1
-#define PIDFD_ERROR 2
-#define PIDFD_SKIP 3
-#define PIDFD_XFAIL 4
-
static int test_pidfd_send_signal_recycled_pid_fail(void)
{
int i, ret;
--
2.21.0
^ permalink raw reply related
* [PATCH 1/2] pid: add pidfd_open()
From: Christian Brauner @ 2019-05-15 10:03 UTC (permalink / raw)
To: jannh, oleg, viro, torvalds, linux-kernel, arnd, dhowells
Cc: akpm, cyphar, ebiederm, elena.reshetova, keescook, luto, luto,
tglx, linux-alpha, linux-arm-kernel, linux-ia64, linux-m68k,
linux-mips, linux-parisc, linuxppc-dev, linux-s390, linux-sh,
sparclinux, linux-xtensa, linux-api, linux-arch, linux-kselftest,
Christian Brauner
This adds the pidfd_open() syscall. It allows a caller to retrieve pollable
pidfds for a process which did not get created via CLONE_PIDFD, i.e. for a
process that is created via traditional fork()/clone() calls that is only
referenced by a PID:
int pidfd = pidfd_open(1234, 0);
ret = pidfd_send_signal(pidfd, SIGSTOP, NULL, 0);
With the introduction of pidfds through CLONE_PIDFD it is possible to
created pidfds at process creation time.
However, a lot of processes get created with traditional PID-based calls
such as fork() or clone() (without CLONE_PIDFD). For these processes a
caller can currently not create a pollable pidfd. This is a huge problem
for Android's low memory killer (LMK) and service managers such as systemd.
Both are examples of tools that want to make use of pidfds to get reliable
notification of process exit for non-parents (pidfd polling) and race-free
signal sending (pidfd_send_signal()). They intend to switch to this API for
process supervision/management as soon as possible. Having no way to get
pollable pidfds from PID-only processes is one of the biggest blockers for
them in adopting this api. With pidfd_open() making it possible to retrieve
pidfd for PID-based processes we enable them to adopt this api.
In line with Arnd's recent changes to consolidate syscall numbers across
architectures, I have added the pidfd_open() syscall to all architectures
at the same time.
Signed-off-by: Christian Brauner <christian@brauner.io>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Jann Horn <jannh@google.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Andy Lutomirsky <luto@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Aleksa Sarai <cyphar@cyphar.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-api@vger.kernel.org
---
arch/alpha/kernel/syscalls/syscall.tbl | 1 +
arch/arm64/include/asm/unistd32.h | 2 +
arch/ia64/kernel/syscalls/syscall.tbl | 1 +
arch/m68k/kernel/syscalls/syscall.tbl | 1 +
arch/microblaze/kernel/syscalls/syscall.tbl | 1 +
arch/mips/kernel/syscalls/syscall_n32.tbl | 1 +
arch/parisc/kernel/syscalls/syscall.tbl | 1 +
arch/powerpc/kernel/syscalls/syscall.tbl | 1 +
arch/s390/kernel/syscalls/syscall.tbl | 1 +
arch/sh/kernel/syscalls/syscall.tbl | 1 +
arch/sparc/kernel/syscalls/syscall.tbl | 1 +
arch/x86/entry/syscalls/syscall_32.tbl | 1 +
arch/x86/entry/syscalls/syscall_64.tbl | 1 +
arch/xtensa/kernel/syscalls/syscall.tbl | 1 +
include/linux/pid.h | 1 +
include/linux/syscalls.h | 1 +
include/uapi/asm-generic/unistd.h | 4 +-
kernel/fork.c | 2 +-
kernel/pid.c | 48 +++++++++++++++++++++
19 files changed, 69 insertions(+), 2 deletions(-)
diff --git a/arch/alpha/kernel/syscalls/syscall.tbl b/arch/alpha/kernel/syscalls/syscall.tbl
index 165f268beafc..ddc3c93ad7a7 100644
--- a/arch/alpha/kernel/syscalls/syscall.tbl
+++ b/arch/alpha/kernel/syscalls/syscall.tbl
@@ -467,3 +467,4 @@
535 common io_uring_setup sys_io_uring_setup
536 common io_uring_enter sys_io_uring_enter
537 common io_uring_register sys_io_uring_register
+538 common pidfd_open sys_pidfd_open
diff --git a/arch/arm64/include/asm/unistd32.h b/arch/arm64/include/asm/unistd32.h
index 23f1a44acada..350e2049b4a9 100644
--- a/arch/arm64/include/asm/unistd32.h
+++ b/arch/arm64/include/asm/unistd32.h
@@ -874,6 +874,8 @@ __SYSCALL(__NR_io_uring_setup, sys_io_uring_setup)
__SYSCALL(__NR_io_uring_enter, sys_io_uring_enter)
#define __NR_io_uring_register 427
__SYSCALL(__NR_io_uring_register, sys_io_uring_register)
+#define __NR_pidfd_open 428
+__SYSCALL(__NR_pidfd_open, sys_pidfd_open)
/*
* Please add new compat syscalls above this comment and update
diff --git a/arch/ia64/kernel/syscalls/syscall.tbl b/arch/ia64/kernel/syscalls/syscall.tbl
index 56e3d0b685e1..7115f6dd347a 100644
--- a/arch/ia64/kernel/syscalls/syscall.tbl
+++ b/arch/ia64/kernel/syscalls/syscall.tbl
@@ -348,3 +348,4 @@
425 common io_uring_setup sys_io_uring_setup
426 common io_uring_enter sys_io_uring_enter
427 common io_uring_register sys_io_uring_register
+428 common pidfd_open sys_pidfd_open
diff --git a/arch/m68k/kernel/syscalls/syscall.tbl b/arch/m68k/kernel/syscalls/syscall.tbl
index df4ec3ec71d1..44bf12b16ffe 100644
--- a/arch/m68k/kernel/syscalls/syscall.tbl
+++ b/arch/m68k/kernel/syscalls/syscall.tbl
@@ -427,3 +427,4 @@
425 common io_uring_setup sys_io_uring_setup
426 common io_uring_enter sys_io_uring_enter
427 common io_uring_register sys_io_uring_register
+428 common pidfd_open sys_pidfd_open
diff --git a/arch/microblaze/kernel/syscalls/syscall.tbl b/arch/microblaze/kernel/syscalls/syscall.tbl
index 4964947732af..0d32e5152dc0 100644
--- a/arch/microblaze/kernel/syscalls/syscall.tbl
+++ b/arch/microblaze/kernel/syscalls/syscall.tbl
@@ -433,3 +433,4 @@
425 common io_uring_setup sys_io_uring_setup
426 common io_uring_enter sys_io_uring_enter
427 common io_uring_register sys_io_uring_register
+428 common pidfd_open sys_pidfd_open
diff --git a/arch/mips/kernel/syscalls/syscall_n32.tbl b/arch/mips/kernel/syscalls/syscall_n32.tbl
index 9392dfe33f97..726e107b3c9f 100644
--- a/arch/mips/kernel/syscalls/syscall_n32.tbl
+++ b/arch/mips/kernel/syscalls/syscall_n32.tbl
@@ -366,3 +366,4 @@
425 n32 io_uring_setup sys_io_uring_setup
426 n32 io_uring_enter sys_io_uring_enter
427 n32 io_uring_register sys_io_uring_register
+428 n32 pidfd_open sys_pidfd_open
diff --git a/arch/parisc/kernel/syscalls/syscall.tbl b/arch/parisc/kernel/syscalls/syscall.tbl
index fe8ca623add8..83b46b568d51 100644
--- a/arch/parisc/kernel/syscalls/syscall.tbl
+++ b/arch/parisc/kernel/syscalls/syscall.tbl
@@ -424,3 +424,4 @@
425 common io_uring_setup sys_io_uring_setup
426 common io_uring_enter sys_io_uring_enter
427 common io_uring_register sys_io_uring_register
+428 common pidfd_open sys_pidfd_open
diff --git a/arch/powerpc/kernel/syscalls/syscall.tbl b/arch/powerpc/kernel/syscalls/syscall.tbl
index 00f5a63c8d9a..5294d04d7fa5 100644
--- a/arch/powerpc/kernel/syscalls/syscall.tbl
+++ b/arch/powerpc/kernel/syscalls/syscall.tbl
@@ -509,3 +509,4 @@
425 common io_uring_setup sys_io_uring_setup
426 common io_uring_enter sys_io_uring_enter
427 common io_uring_register sys_io_uring_register
+428 common pidfd_open sys_pidfd_open
diff --git a/arch/s390/kernel/syscalls/syscall.tbl b/arch/s390/kernel/syscalls/syscall.tbl
index 061418f787c3..dcdb838adf49 100644
--- a/arch/s390/kernel/syscalls/syscall.tbl
+++ b/arch/s390/kernel/syscalls/syscall.tbl
@@ -430,3 +430,4 @@
425 common io_uring_setup sys_io_uring_setup sys_io_uring_setup
426 common io_uring_enter sys_io_uring_enter sys_io_uring_enter
427 common io_uring_register sys_io_uring_register sys_io_uring_register
+428 common pidfd_open sys_pidfd_open sys_pidfd_open
diff --git a/arch/sh/kernel/syscalls/syscall.tbl b/arch/sh/kernel/syscalls/syscall.tbl
index 480b057556ee..8e66edfbc521 100644
--- a/arch/sh/kernel/syscalls/syscall.tbl
+++ b/arch/sh/kernel/syscalls/syscall.tbl
@@ -430,3 +430,4 @@
425 common io_uring_setup sys_io_uring_setup
426 common io_uring_enter sys_io_uring_enter
427 common io_uring_register sys_io_uring_register
+428 common pidfd_open sys_pidfd_open
diff --git a/arch/sparc/kernel/syscalls/syscall.tbl b/arch/sparc/kernel/syscalls/syscall.tbl
index a1dd24307b00..d6f3bc686939 100644
--- a/arch/sparc/kernel/syscalls/syscall.tbl
+++ b/arch/sparc/kernel/syscalls/syscall.tbl
@@ -473,3 +473,4 @@
425 common io_uring_setup sys_io_uring_setup
426 common io_uring_enter sys_io_uring_enter
427 common io_uring_register sys_io_uring_register
+428 common pidfd_open sys_pidfd_open
diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index 4cd5f982b1e5..1af6b469160a 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -438,3 +438,4 @@
425 i386 io_uring_setup sys_io_uring_setup __ia32_sys_io_uring_setup
426 i386 io_uring_enter sys_io_uring_enter __ia32_sys_io_uring_enter
427 i386 io_uring_register sys_io_uring_register __ia32_sys_io_uring_register
+428 i386 pidfd_open sys_pidfd_open __ia32_sys_pidfd_open
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index 64ca0d06259a..c18e6ebe3387 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -355,6 +355,7 @@
425 common io_uring_setup __x64_sys_io_uring_setup
426 common io_uring_enter __x64_sys_io_uring_enter
427 common io_uring_register __x64_sys_io_uring_register
+428 common pidfd_open __x64_sys_pidfd_open
#
# x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/arch/xtensa/kernel/syscalls/syscall.tbl b/arch/xtensa/kernel/syscalls/syscall.tbl
index 30084eaf8422..21ee795f3003 100644
--- a/arch/xtensa/kernel/syscalls/syscall.tbl
+++ b/arch/xtensa/kernel/syscalls/syscall.tbl
@@ -398,3 +398,4 @@
425 common io_uring_setup sys_io_uring_setup
426 common io_uring_enter sys_io_uring_enter
427 common io_uring_register sys_io_uring_register
+428 common pidfd_open sys_pidfd_open
diff --git a/include/linux/pid.h b/include/linux/pid.h
index 3c8ef5a199ca..c938a92eab99 100644
--- a/include/linux/pid.h
+++ b/include/linux/pid.h
@@ -67,6 +67,7 @@ struct pid
extern struct pid init_struct_pid;
extern const struct file_operations pidfd_fops;
+extern int pidfd_create(struct pid *pid);
static inline struct pid *get_pid(struct pid *pid)
{
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index e2870fe1be5b..989055e0b501 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -929,6 +929,7 @@ asmlinkage long sys_clock_adjtime32(clockid_t which_clock,
struct old_timex32 __user *tx);
asmlinkage long sys_syncfs(int fd);
asmlinkage long sys_setns(int fd, int nstype);
+asmlinkage long sys_pidfd_open(pid_t pid, unsigned int flags);
asmlinkage long sys_sendmmsg(int fd, struct mmsghdr __user *msg,
unsigned int vlen, unsigned flags);
asmlinkage long sys_process_vm_readv(pid_t pid,
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index dee7292e1df6..94a257a93d20 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -832,9 +832,11 @@ __SYSCALL(__NR_io_uring_setup, sys_io_uring_setup)
__SYSCALL(__NR_io_uring_enter, sys_io_uring_enter)
#define __NR_io_uring_register 427
__SYSCALL(__NR_io_uring_register, sys_io_uring_register)
+#define __NR_pidfd_open 428
+__SYSCALL(__NR_pidfd_open, sys_pidfd_open)
#undef __NR_syscalls
-#define __NR_syscalls 428
+#define __NR_syscalls 429
/*
* 32 bit systems traditionally used different
diff --git a/kernel/fork.c b/kernel/fork.c
index 737db1828437..980cc1d2b8d4 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1714,7 +1714,7 @@ const struct file_operations pidfd_fops = {
* Return: On success, a cloexec pidfd is returned.
* On error, a negative errno number will be returned.
*/
-static int pidfd_create(struct pid *pid)
+int pidfd_create(struct pid *pid)
{
int fd;
diff --git a/kernel/pid.c b/kernel/pid.c
index 20881598bdfa..237d18d6ecb8 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -38,6 +38,7 @@
#include <linux/syscalls.h>
#include <linux/proc_ns.h>
#include <linux/proc_fs.h>
+#include <linux/sched/signal.h>
#include <linux/sched/task.h>
#include <linux/idr.h>
@@ -451,6 +452,53 @@ struct pid *find_ge_pid(int nr, struct pid_namespace *ns)
return idr_get_next(&ns->idr, &nr);
}
+/**
+ * pidfd_open() - Open new pid file descriptor.
+ *
+ * @pid: pid for which to retrieve a pidfd
+ * @flags: flags to pass
+ *
+ * This creates a new pid file descriptor with the O_CLOEXEC flag set for
+ * the process identified by @pid. Currently, the process identified by
+ * @pid must be a thread-group leader. This restriction currently exists
+ * for all aspects of pidfds including pidfd creation (CLONE_PIDFD cannot
+ * be used with CLONE_THREAD) and pidfd polling (only supports thread group
+ * leaders).
+ *
+ * Return: On success, a cloexec pidfd is returned.
+ * On error, a negative errno number will be returned.
+ */
+SYSCALL_DEFINE2(pidfd_open, pid_t, pid, unsigned int, flags)
+{
+ int fd, ret;
+ struct pid *p;
+ struct task_struct *tsk;
+
+ if (flags)
+ return -EINVAL;
+
+ if (pid <= 0)
+ return -EINVAL;
+
+ p = find_get_pid(pid);
+ if (!p)
+ return -ESRCH;
+
+ rcu_read_lock();
+ tsk = pid_task(p, PIDTYPE_PID);
+ if (!tsk)
+ ret = -ESRCH;
+ else if (unlikely(!thread_group_leader(tsk)))
+ ret = -EINVAL;
+ else
+ ret = 0;
+ rcu_read_unlock();
+
+ fd = ret ?: pidfd_create(p);
+ put_pid(p);
+ return fd;
+}
+
void __init pid_idr_init(void)
{
/* Verify no one has done anything silly: */
--
2.21.0
^ permalink raw reply related
* [PATCH v9 16/16] sched/core: uclamp: Update CPU's refcount on TG's clamp changes
From: Patrick Bellasi @ 2019-05-15 9:44 UTC (permalink / raw)
To: linux-kernel, linux-pm, linux-api
Cc: Ingo Molnar, Peter Zijlstra, Tejun Heo, Rafael J . Wysocki,
Vincent Guittot, Viresh Kumar, Paul Turner, Quentin Perret,
Dietmar Eggemann, Morten Rasmussen, Juri Lelli, Todd Kjos,
Joel Fernandes, Steve Muckle, Suren Baghdasaryan
In-Reply-To: <20190515094459.10317-1-patrick.bellasi@arm.com>
On updates of task group (TG) clamp values, ensure that these new values
are enforced on all RUNNABLE tasks of the task group, i.e. all RUNNABLE
tasks are immediately boosted and/or clamped as requested.
Do that by slightly refactoring uclamp_bucket_inc(). An additional
parameter *cgroup_subsys_state (css) is used to walk the list of tasks
in the TGs and update the RUNNABLE ones. Do that by taking the rq
lock for each task, the same mechanism used for cpu affinity masks
updates.
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
---
kernel/sched/core.c | 48 +++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 48 insertions(+)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 354d925a6ba8..0c078d586f36 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1031,6 +1031,51 @@ static inline void uclamp_rq_dec(struct rq *rq, struct task_struct *p)
uclamp_rq_dec_id(rq, p, clamp_id);
}
+static inline void
+uclamp_update_active(struct task_struct *p, unsigned int clamp_id)
+{
+ struct rq_flags rf;
+ struct rq *rq;
+
+ /*
+ * Lock the task and the rq where the task is (or was) queued.
+ *
+ * We might lock the (previous) rq of a !RUNNABLE task, but that's the
+ * price to pay to safely serialize util_{min,max} updates with
+ * enqueues, dequeues and migration operations.
+ * This is the same locking schema used by __set_cpus_allowed_ptr().
+ */
+ rq = task_rq_lock(p, &rf);
+
+ /*
+ * Setting the clamp bucket is serialized by task_rq_lock().
+ * If the task is not yet RUNNABLE and its task_struct is not
+ * affecting a valid clamp bucket, the next time it's enqueued,
+ * it will already see the updated clamp bucket value.
+ */
+ if (!p->uclamp[clamp_id].active)
+ goto done;
+
+ uclamp_rq_dec_id(rq, p, clamp_id);
+ uclamp_rq_inc_id(rq, p, clamp_id);
+
+done:
+
+ task_rq_unlock(rq, p, &rf);
+}
+
+static inline void
+uclamp_update_active_tasks(struct cgroup_subsys_state *css, int clamp_id)
+{
+ struct css_task_iter it;
+ struct task_struct *p;
+
+ css_task_iter_start(css, 0, &it);
+ while ((p = css_task_iter_next(&it)))
+ uclamp_update_active(p, clamp_id);
+ css_task_iter_end(&it);
+}
+
#ifdef CONFIG_UCLAMP_TASK_GROUP
static void cpu_util_update_eff(struct cgroup_subsys_state *css,
unsigned int clamp_id);
@@ -7044,6 +7089,9 @@ static void cpu_util_update_eff(struct cgroup_subsys_state *css,
uc_se->value = value;
uc_se->bucket_id = uclamp_bucket_id(value);
+
+ /* Immediately update descendants RUNNABLE tasks */
+ uclamp_update_active_tasks(css, clamp_id);
}
}
--
2.21.0
^ permalink raw reply related
* [PATCH v9 15/16] sched/core: uclamp: Use TG's clamps to restrict TASK's clamps
From: Patrick Bellasi @ 2019-05-15 9:44 UTC (permalink / raw)
To: linux-kernel, linux-pm, linux-api
Cc: Ingo Molnar, Peter Zijlstra, Tejun Heo, Rafael J . Wysocki,
Vincent Guittot, Viresh Kumar, Paul Turner, Quentin Perret,
Dietmar Eggemann, Morten Rasmussen, Juri Lelli, Todd Kjos,
Joel Fernandes, Steve Muckle, Suren Baghdasaryan
In-Reply-To: <20190515094459.10317-1-patrick.bellasi@arm.com>
When a task specific clamp value is configured via sched_setattr(2),
this value is accounted in the corresponding clamp bucket every time the
task is {en,de}qeued. However, when cgroups are also in use, the task
specific clamp values could be restricted by the task_group (TG)
clamp values.
Update uclamp_cpu_inc() to aggregate task and TG clamp values. Every
time a task is enqueued, it's accounted in the clamp_bucket defining the
smaller clamp between the task specific value and its TG effective
value. This allows to:
1. ensure cgroup clamps are always used to restrict task specific
requests, i.e. boosted only up to the effective granted value or
clamped at least to a certain value
2. implement a "nice-like" policy, where tasks are still allowed to
request less then what enforced by their current TG
This mimics what already happens for a task's CPU affinity mask when the
task is also in a cpuset, i.e. cgroup attributes are always used to
restrict per-task attributes.
Do this by exploiting the concept of "effective" clamp, which is already
used by a TG to track parent enforced restrictions.
Apply task group clamp restrictions only to tasks belonging to a child
group. While, for tasks in the root group or in an autogroup, only
system defaults are enforced.
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
---
kernel/sched/core.c | 28 +++++++++++++++++++++++++++-
1 file changed, 27 insertions(+), 1 deletion(-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index bd96a977ed07..354d925a6ba8 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -861,16 +861,42 @@ unsigned int uclamp_rq_max_value(struct rq *rq, unsigned int clamp_id,
return uclamp_idle_value(rq, clamp_id, clamp_value);
}
+static inline struct uclamp_se
+uclamp_tg_restrict(struct task_struct *p, unsigned int clamp_id)
+{
+ struct uclamp_se uc_req = p->uclamp_req[clamp_id];
+#ifdef CONFIG_UCLAMP_TASK_GROUP
+ struct uclamp_se uc_max;
+
+ /*
+ * Tasks in autogroups or root task group will be
+ * restricted by system defaults.
+ */
+ if (task_group_is_autogroup(task_group(p)))
+ return uc_req;
+ if (task_group(p) == &root_task_group)
+ return uc_req;
+
+ uc_max = task_group(p)->uclamp[clamp_id];
+ if (uc_req.value > uc_max.value || !uc_req.user_defined)
+ return uc_max;
+#endif
+
+ return uc_req;
+}
+
/*
* The effective clamp bucket index of a task depends on, by increasing
* priority:
* - the task specific clamp value, when explicitly requested from userspace
+ * - the task group effective clamp value, for tasks not either in the root
+ * group or in an autogroup
* - the system default clamp value, defined by the sysadmin
*/
static inline struct uclamp_se
uclamp_eff_get(struct task_struct *p, unsigned int clamp_id)
{
- struct uclamp_se uc_req = p->uclamp_req[clamp_id];
+ struct uclamp_se uc_req = uclamp_tg_restrict(p, clamp_id);
struct uclamp_se uc_max = uclamp_default[clamp_id];
/* System default restrictions always apply */
--
2.21.0
^ permalink raw reply related
* [PATCH v9 14/16] sched/core: uclamp: Propagate system defaults to root group
From: Patrick Bellasi @ 2019-05-15 9:44 UTC (permalink / raw)
To: linux-kernel, linux-pm, linux-api
Cc: Ingo Molnar, Peter Zijlstra, Tejun Heo, Rafael J . Wysocki,
Vincent Guittot, Viresh Kumar, Paul Turner, Quentin Perret,
Dietmar Eggemann, Morten Rasmussen, Juri Lelli, Todd Kjos,
Joel Fernandes, Steve Muckle, Suren Baghdasaryan
In-Reply-To: <20190515094459.10317-1-patrick.bellasi@arm.com>
The clamp values are not tunable at the level of the root task group.
That's for two main reasons:
- the root group represents "system resources" which are always
entirely available from the cgroup standpoint.
- when tuning/restricting "system resources" makes sense, tuning must
be done using a system wide API which should also be available when
control groups are not.
When a system wide restriction is available, cgroups should be aware of
its value in order to know exactly how much "system resources" are
available for the subgroups.
Utilization clamping supports already the concepts of:
- system defaults: which define the maximum possible clamp values
usable by tasks.
- effective clamps: which allows a parent cgroup to constraint (maybe
temporarily) its descendants without losing the information related
to the values "requested" from them.
Exploit these two concepts and bind them together in such a way that,
whenever system default are tuned, the new values are propagated to
(possibly) restrict or relax the "effective" value of nested cgroups.
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
---
kernel/sched/core.c | 10 ++++++++++
1 file changed, 10 insertions(+)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index efedbd3a0ce6..bd96a977ed07 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1005,6 +1005,13 @@ static inline void uclamp_rq_dec(struct rq *rq, struct task_struct *p)
uclamp_rq_dec_id(rq, p, clamp_id);
}
+#ifdef CONFIG_UCLAMP_TASK_GROUP
+static void cpu_util_update_eff(struct cgroup_subsys_state *css,
+ unsigned int clamp_id);
+#else
+#define cpu_util_update_eff(...)
+#endif
+
int sysctl_sched_uclamp_handler(struct ctl_table *table, int write,
void __user *buffer, size_t *lenp,
loff_t *ppos)
@@ -1038,6 +1045,9 @@ int sysctl_sched_uclamp_handler(struct ctl_table *table, int write,
sysctl_sched_uclamp_util_max, false);
}
+ cpu_util_update_eff(&root_task_group.css, UCLAMP_MIN);
+ cpu_util_update_eff(&root_task_group.css, UCLAMP_MAX);
+
/*
* Updating all the RUNNABLE task is expensive, keep it simple and do
* just a lazy update at each next enqueue time.
--
2.21.0
^ permalink raw reply related
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox