* [PATCH RFC] pidns: introduce syscall getvpid @ 2015-09-15 12:09 Konstantin Khlebnikov 2015-09-15 14:20 ` Oleg Nesterov 2015-09-15 14:27 ` Eric W. Biederman 0 siblings, 2 replies; 15+ messages in thread From: Konstantin Khlebnikov @ 2015-09-15 12:09 UTC (permalink / raw) To: linux-api, containers, linux-kernel Cc: Andrew Morton, Linus Torvalds, Oleg Nesterov, Eric W. Biederman pid_t getvpid(pid_t pid, pid_t source, pid_t target); This syscall converts pid from one pid-ns into pid in another pid-ns: it takes @pid in namespace of @source task (zero for current) and returns related pid in namespace of @target task (zero for current too). If pid is unreachable from target pid-ns then it returns zero. Such conversion is required for interaction between processes from different pid-namespaces. For example when system service talks with client from isolated container via socket about task in container: getvpid(pid, client_pid, 0) -> pid in our pid namespace getvpid(pid, 0, client_pid) -> pid in client pid namespace Also service can get pid of init task and match it with container: getvpid(1, client_pid, 0) -> pid of init task for client_pid Seems like gdb and strace could use this too for converting pids of newly forked tasks (IIRR they get pid from %rax) into pid from correct namespace for further interaction. As a bonus syscall getvpid can compare pid namespaces and test isolation without mounted procfs: getvpid(1, 0, pid) == 0 -> pid in our sub-pid-namespace getvpid(1, 0, pid) == 1 -> pid in our pid-namespace getvpid(1, pid1, pid2) == 0 -> pid1 isolated from pid2 getvpid(1, pid1, pid2) == 1 -> tasks are in one pid-namespace getvpid(1, pid1, pid2) > 1 -> pid1 is in sub-pidns of pid2 Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> --- arch/x86/entry/syscalls/syscall_32.tbl | 1 + arch/x86/entry/syscalls/syscall_64.tbl | 1 + include/linux/syscalls.h | 1 + kernel/pid.c | 36 ++++++++++++++++++++++++++++++++ 4 files changed, 39 insertions(+) diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl index 7663c455b9f6..dadb55d42fc9 100644 --- a/arch/x86/entry/syscalls/syscall_32.tbl +++ b/arch/x86/entry/syscalls/syscall_32.tbl @@ -382,3 +382,4 @@ 373 i386 shutdown sys_shutdown 374 i386 userfaultfd sys_userfaultfd 375 i386 membarrier sys_membarrier +376 i386 getvpid sys_getvpid diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl index 278842fdf1f6..0338f2eb3b7c 100644 --- a/arch/x86/entry/syscalls/syscall_64.tbl +++ b/arch/x86/entry/syscalls/syscall_64.tbl @@ -331,6 +331,7 @@ 322 64 execveat stub_execveat 323 common userfaultfd sys_userfaultfd 324 common membarrier sys_membarrier +325 common getvpid sys_getvpid # # x32-specific system call numbers start at 512 to avoid cache impact diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h index a460e2ef2843..3405c30999e3 100644 --- a/include/linux/syscalls.h +++ b/include/linux/syscalls.h @@ -222,6 +222,7 @@ asmlinkage long sys_nanosleep(struct timespec __user *rqtp, struct timespec __us asmlinkage long sys_alarm(unsigned int seconds); asmlinkage long sys_getpid(void); asmlinkage long sys_getppid(void); +asmlinkage long sys_getvpid(pid_t pid, pid_t source, pid_t target); asmlinkage long sys_getuid(void); asmlinkage long sys_geteuid(void); asmlinkage long sys_getgid(void); diff --git a/kernel/pid.c b/kernel/pid.c index ca368793808e..caa676ff7364 100644 --- a/kernel/pid.c +++ b/kernel/pid.c @@ -567,6 +567,42 @@ struct pid *find_ge_pid(int nr, struct pid_namespace *ns) return pid; } +/** + * sys_getvpid - convert pid from one pid-namespace into pid from another + * + * @pid - pid of requested task + * @source - pid of task in source pid-namespace, zero for current + * @target - pid of task in target pid-namespace, zero for current + * + * Returns pid from target pid-ns or zero if pid is unreachable. + * Returns -ESRCH if some of pids are not found. + */ +SYSCALL_DEFINE3(getvpid, pid_t, pid, pid_t, source, pid_t, target) +{ +#ifdef CONFIG_PID_NS + struct pid_namespace *current_ns = task_active_pid_ns(current); + struct pid_namespace *source_ns = current_ns, *target_ns = current_ns; + struct pid *task_pid; + pid_t result = -ESRCH; + + rcu_read_lock(); + if (source) + source_ns = ns_of_pid(find_pid_ns(source, current_ns)); + if (target) + target_ns = ns_of_pid(find_pid_ns(target, current_ns)); + if (source_ns && target_ns) { + task_pid = find_pid_ns(pid, source_ns); + if (task_pid) + result = pid_nr_ns(task_pid, target_ns); + } + rcu_read_unlock(); + + return result; +#else + return pid; +#endif /* CONFIG_PID_NS */ +} + /* * The pid hash table is scaled according to the amount of memory in the * machine. From a minimum of 16 slots up to 4096 slots at one gigabyte or ^ permalink raw reply related [flat|nested] 15+ messages in thread
* Re: [PATCH RFC] pidns: introduce syscall getvpid 2015-09-15 12:09 [PATCH RFC] pidns: introduce syscall getvpid Konstantin Khlebnikov @ 2015-09-15 14:20 ` Oleg Nesterov 2015-09-15 14:27 ` Eric W. Biederman 1 sibling, 0 replies; 15+ messages in thread From: Oleg Nesterov @ 2015-09-15 14:20 UTC (permalink / raw) To: Konstantin Khlebnikov Cc: linux-api-u79uwXL29TY76Z2rM5mHXA, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Eric W. Biederman, Andrew Morton, Linus Torvalds On 09/15, Konstantin Khlebnikov wrote: > > +SYSCALL_DEFINE3(getvpid, pid_t, pid, pid_t, source, pid_t, target) > +{ > +#ifdef CONFIG_PID_NS > + struct pid_namespace *current_ns = task_active_pid_ns(current); > + struct pid_namespace *source_ns = current_ns, *target_ns = current_ns; > + struct pid *task_pid; > + pid_t result = -ESRCH; > + > + rcu_read_lock(); > + if (source) > + source_ns = ns_of_pid(find_pid_ns(source, current_ns)); > + if (target) > + target_ns = ns_of_pid(find_pid_ns(target, current_ns)); > + if (source_ns && target_ns) { > + task_pid = find_pid_ns(pid, source_ns); > + if (task_pid) > + result = pid_nr_ns(task_pid, target_ns); > + } > + rcu_read_unlock(); > + > + return result; > +#else > + return pid; > +#endif /* CONFIG_PID_NS */ > +} Not sure we actually want ifdef(CONFIG_PID_NS). If this is just optimization I'd suggest to simply add if (!IS_ENABLED(CONFIG_PID_NS)) return pid; at the start. But. Either way this unconditional "return pid" doesn't look right imho. I think we should return -ESRCH if this pid number is not valid to ensure we have the same semantics with-or-without CONFIG_PID_NS. So it seems that you should remove this ifdef, this will also ensure that we return -ESRCH if (say) source != 0 and find_pid_ns(source) fails. Oleg. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH RFC] pidns: introduce syscall getvpid 2015-09-15 12:09 [PATCH RFC] pidns: introduce syscall getvpid Konstantin Khlebnikov 2015-09-15 14:20 ` Oleg Nesterov @ 2015-09-15 14:27 ` Eric W. Biederman [not found] ` <87h9mvg3kw.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org> 1 sibling, 1 reply; 15+ messages in thread From: Eric W. Biederman @ 2015-09-15 14:27 UTC (permalink / raw) To: Konstantin Khlebnikov Cc: linux-api-u79uwXL29TY76Z2rM5mHXA, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Oleg Nesterov, Andrew Morton, Linus Torvalds Konstantin Khlebnikov <khlebnikov-XoJtRXgx1JseBXzfvpsJ4g@public.gmane.org> writes: > pid_t getvpid(pid_t pid, pid_t source, pid_t target); > > This syscall converts pid from one pid-ns into pid in another pid-ns: > it takes @pid in namespace of @source task (zero for current) and > returns related pid in namespace of @target task (zero for current too). > If pid is unreachable from target pid-ns then it returns zero. This interface as presented is inherently racy. It would be better if source and target were file descriptors referring to the namespaces you wish to translate between. > Such conversion is required for interaction between processes from > different pid-namespaces. For example when system service talks with > client from isolated container via socket about task in container: Sockets are already supported. At least the metadata of sockets is. Maybe we need this but I am not convinced of it's utility. What are you trying to do that motivates this? Eric > getvpid(pid, client_pid, 0) -> pid in our pid namespace > getvpid(pid, 0, client_pid) -> pid in client pid namespace > > Also service can get pid of init task and match it with container: > > getvpid(1, client_pid, 0) -> pid of init task for client_pid > > Seems like gdb and strace could use this too for converting pids of > newly forked tasks (IIRR they get pid from %rax) into pid from > correct namespace for further interaction. > > As a bonus syscall getvpid can compare pid namespaces and > test isolation without mounted procfs: > > getvpid(1, 0, pid) == 0 -> pid in our sub-pid-namespace > getvpid(1, 0, pid) == 1 -> pid in our pid-namespace > getvpid(1, pid1, pid2) == 0 -> pid1 isolated from pid2 > getvpid(1, pid1, pid2) == 1 -> tasks are in one pid-namespace > getvpid(1, pid1, pid2) > 1 -> pid1 is in sub-pidns of pid2 > > Signed-off-by: Konstantin Khlebnikov <khlebnikov-XoJtRXgx1JseBXzfvpsJ4g@public.gmane.org> > --- > arch/x86/entry/syscalls/syscall_32.tbl | 1 + > arch/x86/entry/syscalls/syscall_64.tbl | 1 + > include/linux/syscalls.h | 1 + > kernel/pid.c | 36 ++++++++++++++++++++++++++++++++ > 4 files changed, 39 insertions(+) > > diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl > index 7663c455b9f6..dadb55d42fc9 100644 > --- a/arch/x86/entry/syscalls/syscall_32.tbl > +++ b/arch/x86/entry/syscalls/syscall_32.tbl > @@ -382,3 +382,4 @@ > 373 i386 shutdown sys_shutdown > 374 i386 userfaultfd sys_userfaultfd > 375 i386 membarrier sys_membarrier > +376 i386 getvpid sys_getvpid > diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl > index 278842fdf1f6..0338f2eb3b7c 100644 > --- a/arch/x86/entry/syscalls/syscall_64.tbl > +++ b/arch/x86/entry/syscalls/syscall_64.tbl > @@ -331,6 +331,7 @@ > 322 64 execveat stub_execveat > 323 common userfaultfd sys_userfaultfd > 324 common membarrier sys_membarrier > +325 common getvpid sys_getvpid > > # > # x32-specific system call numbers start at 512 to avoid cache impact > diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h > index a460e2ef2843..3405c30999e3 100644 > --- a/include/linux/syscalls.h > +++ b/include/linux/syscalls.h > @@ -222,6 +222,7 @@ asmlinkage long sys_nanosleep(struct timespec __user *rqtp, struct timespec __us > asmlinkage long sys_alarm(unsigned int seconds); > asmlinkage long sys_getpid(void); > asmlinkage long sys_getppid(void); > +asmlinkage long sys_getvpid(pid_t pid, pid_t source, pid_t target); > asmlinkage long sys_getuid(void); > asmlinkage long sys_geteuid(void); > asmlinkage long sys_getgid(void); > diff --git a/kernel/pid.c b/kernel/pid.c > index ca368793808e..caa676ff7364 100644 > --- a/kernel/pid.c > +++ b/kernel/pid.c > @@ -567,6 +567,42 @@ struct pid *find_ge_pid(int nr, struct pid_namespace *ns) > return pid; > } > > +/** > + * sys_getvpid - convert pid from one pid-namespace into pid from another > + * > + * @pid - pid of requested task > + * @source - pid of task in source pid-namespace, zero for current > + * @target - pid of task in target pid-namespace, zero for current > + * > + * Returns pid from target pid-ns or zero if pid is unreachable. > + * Returns -ESRCH if some of pids are not found. > + */ > +SYSCALL_DEFINE3(getvpid, pid_t, pid, pid_t, source, pid_t, target) > +{ > +#ifdef CONFIG_PID_NS > + struct pid_namespace *current_ns = task_active_pid_ns(current); > + struct pid_namespace *source_ns = current_ns, *target_ns = current_ns; > + struct pid *task_pid; > + pid_t result = -ESRCH; > + > + rcu_read_lock(); > + if (source) > + source_ns = ns_of_pid(find_pid_ns(source, current_ns)); > + if (target) > + target_ns = ns_of_pid(find_pid_ns(target, current_ns)); > + if (source_ns && target_ns) { > + task_pid = find_pid_ns(pid, source_ns); > + if (task_pid) > + result = pid_nr_ns(task_pid, target_ns); > + } > + rcu_read_unlock(); > + > + return result; > +#else > + return pid; > +#endif /* CONFIG_PID_NS */ > +} > + > /* > * The pid hash table is scaled according to the amount of memory in the > * machine. From a minimum of 16 slots up to 4096 slots at one gigabyte or ^ permalink raw reply [flat|nested] 15+ messages in thread
[parent not found: <87h9mvg3kw.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>]
* Re: [PATCH RFC] pidns: introduce syscall getvpid [not found] ` <87h9mvg3kw.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org> @ 2015-09-15 15:01 ` Konstantin Khlebnikov [not found] ` <55F832D2.1070605-XoJtRXgx1JseBXzfvpsJ4g@public.gmane.org> 0 siblings, 1 reply; 15+ messages in thread From: Konstantin Khlebnikov @ 2015-09-15 15:01 UTC (permalink / raw) To: Eric W. Biederman Cc: linux-api-u79uwXL29TY76Z2rM5mHXA, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Andrew Morton, Linus Torvalds, Oleg Nesterov On 15.09.2015 17:27, Eric W. Biederman wrote: > Konstantin Khlebnikov <khlebnikov-XoJtRXgx1JseBXzfvpsJ4g@public.gmane.org> writes: > >> pid_t getvpid(pid_t pid, pid_t source, pid_t target); >> >> This syscall converts pid from one pid-ns into pid in another pid-ns: >> it takes @pid in namespace of @source task (zero for current) and >> returns related pid in namespace of @target task (zero for current too). >> If pid is unreachable from target pid-ns then it returns zero. > > This interface as presented is inherently racy. It would be better > if source and target were file descriptors referring to the namespaces > you wish to translate between. Yep, it's racy. As well as any operation with non-child pids. With file descriptors for source/target result will be racy anyway. > >> Such conversion is required for interaction between processes from >> different pid-namespaces. For example when system service talks with >> client from isolated container via socket about task in container: > > Sockets are already supported. At least the metadata of sockets is. > > Maybe we need this but I am not convinced of it's utility. > > What are you trying to do that motivates this? I'm working on hierarchical container management system which allows to create and control nested sub-containers from containers ( https://github.com/yandex/porto ). Main server works in host and have to interact with all levels of nested namespaces. This syscall makes some operations much easier: server must remember only pid in host pid namespace and convert it into right vpid on demand. > > Eric > > >> getvpid(pid, client_pid, 0) -> pid in our pid namespace >> getvpid(pid, 0, client_pid) -> pid in client pid namespace >> >> Also service can get pid of init task and match it with container: >> >> getvpid(1, client_pid, 0) -> pid of init task for client_pid >> >> Seems like gdb and strace could use this too for converting pids of >> newly forked tasks (IIRR they get pid from %rax) into pid from >> correct namespace for further interaction. >> >> As a bonus syscall getvpid can compare pid namespaces and >> test isolation without mounted procfs: >> >> getvpid(1, 0, pid) == 0 -> pid in our sub-pid-namespace >> getvpid(1, 0, pid) == 1 -> pid in our pid-namespace >> getvpid(1, pid1, pid2) == 0 -> pid1 isolated from pid2 >> getvpid(1, pid1, pid2) == 1 -> tasks are in one pid-namespace >> getvpid(1, pid1, pid2) > 1 -> pid1 is in sub-pidns of pid2 >> >> Signed-off-by: Konstantin Khlebnikov <khlebnikov-XoJtRXgx1JseBXzfvpsJ4g@public.gmane.org> >> --- >> arch/x86/entry/syscalls/syscall_32.tbl | 1 + >> arch/x86/entry/syscalls/syscall_64.tbl | 1 + >> include/linux/syscalls.h | 1 + >> kernel/pid.c | 36 ++++++++++++++++++++++++++++++++ >> 4 files changed, 39 insertions(+) >> >> diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl >> index 7663c455b9f6..dadb55d42fc9 100644 >> --- a/arch/x86/entry/syscalls/syscall_32.tbl >> +++ b/arch/x86/entry/syscalls/syscall_32.tbl >> @@ -382,3 +382,4 @@ >> 373 i386 shutdown sys_shutdown >> 374 i386 userfaultfd sys_userfaultfd >> 375 i386 membarrier sys_membarrier >> +376 i386 getvpid sys_getvpid >> diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl >> index 278842fdf1f6..0338f2eb3b7c 100644 >> --- a/arch/x86/entry/syscalls/syscall_64.tbl >> +++ b/arch/x86/entry/syscalls/syscall_64.tbl >> @@ -331,6 +331,7 @@ >> 322 64 execveat stub_execveat >> 323 common userfaultfd sys_userfaultfd >> 324 common membarrier sys_membarrier >> +325 common getvpid sys_getvpid >> >> # >> # x32-specific system call numbers start at 512 to avoid cache impact >> diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h >> index a460e2ef2843..3405c30999e3 100644 >> --- a/include/linux/syscalls.h >> +++ b/include/linux/syscalls.h >> @@ -222,6 +222,7 @@ asmlinkage long sys_nanosleep(struct timespec __user *rqtp, struct timespec __us >> asmlinkage long sys_alarm(unsigned int seconds); >> asmlinkage long sys_getpid(void); >> asmlinkage long sys_getppid(void); >> +asmlinkage long sys_getvpid(pid_t pid, pid_t source, pid_t target); >> asmlinkage long sys_getuid(void); >> asmlinkage long sys_geteuid(void); >> asmlinkage long sys_getgid(void); >> diff --git a/kernel/pid.c b/kernel/pid.c >> index ca368793808e..caa676ff7364 100644 >> --- a/kernel/pid.c >> +++ b/kernel/pid.c >> @@ -567,6 +567,42 @@ struct pid *find_ge_pid(int nr, struct pid_namespace *ns) >> return pid; >> } >> >> +/** >> + * sys_getvpid - convert pid from one pid-namespace into pid from another >> + * >> + * @pid - pid of requested task >> + * @source - pid of task in source pid-namespace, zero for current >> + * @target - pid of task in target pid-namespace, zero for current >> + * >> + * Returns pid from target pid-ns or zero if pid is unreachable. >> + * Returns -ESRCH if some of pids are not found. >> + */ >> +SYSCALL_DEFINE3(getvpid, pid_t, pid, pid_t, source, pid_t, target) >> +{ >> +#ifdef CONFIG_PID_NS >> + struct pid_namespace *current_ns = task_active_pid_ns(current); >> + struct pid_namespace *source_ns = current_ns, *target_ns = current_ns; >> + struct pid *task_pid; >> + pid_t result = -ESRCH; >> + >> + rcu_read_lock(); >> + if (source) >> + source_ns = ns_of_pid(find_pid_ns(source, current_ns)); >> + if (target) >> + target_ns = ns_of_pid(find_pid_ns(target, current_ns)); >> + if (source_ns && target_ns) { >> + task_pid = find_pid_ns(pid, source_ns); >> + if (task_pid) >> + result = pid_nr_ns(task_pid, target_ns); >> + } >> + rcu_read_unlock(); >> + >> + return result; >> +#else >> + return pid; >> +#endif /* CONFIG_PID_NS */ >> +} >> + >> /* >> * The pid hash table is scaled according to the amount of memory in the >> * machine. From a minimum of 16 slots up to 4096 slots at one gigabyte or -- Konstantin ^ permalink raw reply [flat|nested] 15+ messages in thread
[parent not found: <55F832D2.1070605-XoJtRXgx1JseBXzfvpsJ4g@public.gmane.org>]
* Re: [PATCH RFC] pidns: introduce syscall getvpid [not found] ` <55F832D2.1070605-XoJtRXgx1JseBXzfvpsJ4g@public.gmane.org> @ 2015-09-15 15:17 ` Stéphane Graber 2015-09-15 15:51 ` Konstantin Khlebnikov 2015-09-15 17:41 ` Serge Hallyn 0 siblings, 2 replies; 15+ messages in thread From: Stéphane Graber @ 2015-09-15 15:17 UTC (permalink / raw) To: Konstantin Khlebnikov Cc: linux-api-u79uwXL29TY76Z2rM5mHXA, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Oleg Nesterov, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Eric W. Biederman, Andrew Morton, Linus Torvalds [-- Attachment #1.1: Type: text/plain, Size: 7359 bytes --] On Tue, Sep 15, 2015 at 06:01:38PM +0300, Konstantin Khlebnikov wrote: > On 15.09.2015 17:27, Eric W. Biederman wrote: > >Konstantin Khlebnikov <khlebnikov-XoJtRXgx1JseBXzfvpsJ4g@public.gmane.org> writes: > > > >>pid_t getvpid(pid_t pid, pid_t source, pid_t target); > >> > >>This syscall converts pid from one pid-ns into pid in another pid-ns: > >>it takes @pid in namespace of @source task (zero for current) and > >>returns related pid in namespace of @target task (zero for current too). > >>If pid is unreachable from target pid-ns then it returns zero. > > > >This interface as presented is inherently racy. It would be better > >if source and target were file descriptors referring to the namespaces > >you wish to translate between. > > Yep, it's racy. As well as any operation with non-child pids. > With file descriptors for source/target result will be racy anyway. > > > > >>Such conversion is required for interaction between processes from > >>different pid-namespaces. For example when system service talks with > >>client from isolated container via socket about task in container: > > > >Sockets are already supported. At least the metadata of sockets is. > > > >Maybe we need this but I am not convinced of it's utility. > > > >What are you trying to do that motivates this? > > I'm working on hierarchical container management system which > allows to create and control nested sub-containers from containers > ( https://github.com/yandex/porto ). Main server works in host and > have to interact with all levels of nested namespaces. This syscall > makes some operations much easier: server must remember only pid in > host pid namespace and convert it into right vpid on demand. Note that as Eric said earlier, sending a PID inside a ucred through a unix socket will have the pid translated. So while your solution certainly should be faster, you can already achieve what you want today by doing: == Translate PID in container to PID in host - open a socket - setns to container's pidns - send ucred from that container containing the requested container PID - host sees the host PID == Translate PID on host to PID in container - open a socket - setns to container's pidns - send ucred from the host containing the request host PID (send will fail if the host PID isn't part of that container) - container sees the container PID > > > > >Eric > > > > > >>getvpid(pid, client_pid, 0) -> pid in our pid namespace > >>getvpid(pid, 0, client_pid) -> pid in client pid namespace > >> > >>Also service can get pid of init task and match it with container: > >> > >>getvpid(1, client_pid, 0) -> pid of init task for client_pid > >> > >>Seems like gdb and strace could use this too for converting pids of > >>newly forked tasks (IIRR they get pid from %rax) into pid from > >>correct namespace for further interaction. > >> > >>As a bonus syscall getvpid can compare pid namespaces and > >>test isolation without mounted procfs: > >> > >>getvpid(1, 0, pid) == 0 -> pid in our sub-pid-namespace > >>getvpid(1, 0, pid) == 1 -> pid in our pid-namespace > >>getvpid(1, pid1, pid2) == 0 -> pid1 isolated from pid2 > >>getvpid(1, pid1, pid2) == 1 -> tasks are in one pid-namespace > >>getvpid(1, pid1, pid2) > 1 -> pid1 is in sub-pidns of pid2 > >> > >>Signed-off-by: Konstantin Khlebnikov <khlebnikov-XoJtRXgx1JseBXzfvpsJ4g@public.gmane.org> > >>--- > >> arch/x86/entry/syscalls/syscall_32.tbl | 1 + > >> arch/x86/entry/syscalls/syscall_64.tbl | 1 + > >> include/linux/syscalls.h | 1 + > >> kernel/pid.c | 36 ++++++++++++++++++++++++++++++++ > >> 4 files changed, 39 insertions(+) > >> > >>diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl > >>index 7663c455b9f6..dadb55d42fc9 100644 > >>--- a/arch/x86/entry/syscalls/syscall_32.tbl > >>+++ b/arch/x86/entry/syscalls/syscall_32.tbl > >>@@ -382,3 +382,4 @@ > >> 373 i386 shutdown sys_shutdown > >> 374 i386 userfaultfd sys_userfaultfd > >> 375 i386 membarrier sys_membarrier > >>+376 i386 getvpid sys_getvpid > >>diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl > >>index 278842fdf1f6..0338f2eb3b7c 100644 > >>--- a/arch/x86/entry/syscalls/syscall_64.tbl > >>+++ b/arch/x86/entry/syscalls/syscall_64.tbl > >>@@ -331,6 +331,7 @@ > >> 322 64 execveat stub_execveat > >> 323 common userfaultfd sys_userfaultfd > >> 324 common membarrier sys_membarrier > >>+325 common getvpid sys_getvpid > >> > >> # > >> # x32-specific system call numbers start at 512 to avoid cache impact > >>diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h > >>index a460e2ef2843..3405c30999e3 100644 > >>--- a/include/linux/syscalls.h > >>+++ b/include/linux/syscalls.h > >>@@ -222,6 +222,7 @@ asmlinkage long sys_nanosleep(struct timespec __user *rqtp, struct timespec __us > >> asmlinkage long sys_alarm(unsigned int seconds); > >> asmlinkage long sys_getpid(void); > >> asmlinkage long sys_getppid(void); > >>+asmlinkage long sys_getvpid(pid_t pid, pid_t source, pid_t target); > >> asmlinkage long sys_getuid(void); > >> asmlinkage long sys_geteuid(void); > >> asmlinkage long sys_getgid(void); > >>diff --git a/kernel/pid.c b/kernel/pid.c > >>index ca368793808e..caa676ff7364 100644 > >>--- a/kernel/pid.c > >>+++ b/kernel/pid.c > >>@@ -567,6 +567,42 @@ struct pid *find_ge_pid(int nr, struct pid_namespace *ns) > >> return pid; > >> } > >> > >>+/** > >>+ * sys_getvpid - convert pid from one pid-namespace into pid from another > >>+ * > >>+ * @pid - pid of requested task > >>+ * @source - pid of task in source pid-namespace, zero for current > >>+ * @target - pid of task in target pid-namespace, zero for current > >>+ * > >>+ * Returns pid from target pid-ns or zero if pid is unreachable. > >>+ * Returns -ESRCH if some of pids are not found. > >>+ */ > >>+SYSCALL_DEFINE3(getvpid, pid_t, pid, pid_t, source, pid_t, target) > >>+{ > >>+#ifdef CONFIG_PID_NS > >>+ struct pid_namespace *current_ns = task_active_pid_ns(current); > >>+ struct pid_namespace *source_ns = current_ns, *target_ns = current_ns; > >>+ struct pid *task_pid; > >>+ pid_t result = -ESRCH; > >>+ > >>+ rcu_read_lock(); > >>+ if (source) > >>+ source_ns = ns_of_pid(find_pid_ns(source, current_ns)); > >>+ if (target) > >>+ target_ns = ns_of_pid(find_pid_ns(target, current_ns)); > >>+ if (source_ns && target_ns) { > >>+ task_pid = find_pid_ns(pid, source_ns); > >>+ if (task_pid) > >>+ result = pid_nr_ns(task_pid, target_ns); > >>+ } > >>+ rcu_read_unlock(); > >>+ > >>+ return result; > >>+#else > >>+ return pid; > >>+#endif /* CONFIG_PID_NS */ > >>+} > >>+ > >> /* > >> * The pid hash table is scaled according to the amount of memory in the > >> * machine. From a minimum of 16 slots up to 4096 slots at one gigabyte or > > > -- > Konstantin > _______________________________________________ > Containers mailing list > Containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org > https://lists.linuxfoundation.org/mailman/listinfo/containers -- Stéphane Graber Ubuntu developer http://www.ubuntu.com [-- Attachment #1.2: Digital signature --] [-- Type: application/pgp-signature, Size: 819 bytes --] [-- Attachment #2: Type: text/plain, Size: 205 bytes --] _______________________________________________ Containers mailing list Containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org https://lists.linuxfoundation.org/mailman/listinfo/containers ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH RFC] pidns: introduce syscall getvpid 2015-09-15 15:17 ` Stéphane Graber @ 2015-09-15 15:51 ` Konstantin Khlebnikov 2015-09-15 17:41 ` Serge Hallyn 1 sibling, 0 replies; 15+ messages in thread From: Konstantin Khlebnikov @ 2015-09-15 15:51 UTC (permalink / raw) To: Stéphane Graber Cc: Eric W. Biederman, linux-api-u79uwXL29TY76Z2rM5mHXA, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Oleg Nesterov, Andrew Morton, Linus Torvalds On 15.09.2015 18:17, Stéphane Graber wrote: > On Tue, Sep 15, 2015 at 06:01:38PM +0300, Konstantin Khlebnikov wrote: >> On 15.09.2015 17:27, Eric W. Biederman wrote: >>> Konstantin Khlebnikov <khlebnikov-XoJtRXgx1JseBXzfvpsJ4g@public.gmane.org> writes: >>> >>>> pid_t getvpid(pid_t pid, pid_t source, pid_t target); >>>> >>>> This syscall converts pid from one pid-ns into pid in another pid-ns: >>>> it takes @pid in namespace of @source task (zero for current) and >>>> returns related pid in namespace of @target task (zero for current too). >>>> If pid is unreachable from target pid-ns then it returns zero. >>> >>> This interface as presented is inherently racy. It would be better >>> if source and target were file descriptors referring to the namespaces >>> you wish to translate between. >> >> Yep, it's racy. As well as any operation with non-child pids. >> With file descriptors for source/target result will be racy anyway. >> >>> >>>> Such conversion is required for interaction between processes from >>>> different pid-namespaces. For example when system service talks with >>>> client from isolated container via socket about task in container: >>> >>> Sockets are already supported. At least the metadata of sockets is. >>> >>> Maybe we need this but I am not convinced of it's utility. >>> >>> What are you trying to do that motivates this? >> >> I'm working on hierarchical container management system which >> allows to create and control nested sub-containers from containers >> ( https://github.com/yandex/porto ). Main server works in host and >> have to interact with all levels of nested namespaces. This syscall >> makes some operations much easier: server must remember only pid in >> host pid namespace and convert it into right vpid on demand. > > Note that as Eric said earlier, sending a PID inside a ucred through a > unix socket will have the pid translated. We are using this already: clients in container connect to unix-socket binded in host net-ns and bind-mounted into container =) Server identifies them by pid from SO_PEERCRED > > So while your solution certainly should be faster, you can already achieve > what you want today by doing: > > == Translate PID in container to PID in host > - open a socket > - setns to container's pidns > - send ucred from that container containing the requested container PID > - host sees the host PID > That's funny. But setns isn't enough, task have to fork into pid-namespace. > == Translate PID on host to PID in container > - open a socket > - setns to container's pidns > - send ucred from the host containing the request host PID > (send will fail if the host PID isn't part of that container) > - container sees the container PID > >> >>> >>> Eric >>> >>> >>>> getvpid(pid, client_pid, 0) -> pid in our pid namespace >>>> getvpid(pid, 0, client_pid) -> pid in client pid namespace >>>> >>>> Also service can get pid of init task and match it with container: >>>> >>>> getvpid(1, client_pid, 0) -> pid of init task for client_pid >>>> >>>> Seems like gdb and strace could use this too for converting pids of >>>> newly forked tasks (IIRR they get pid from %rax) into pid from >>>> correct namespace for further interaction. >>>> >>>> As a bonus syscall getvpid can compare pid namespaces and >>>> test isolation without mounted procfs: >>>> >>>> getvpid(1, 0, pid) == 0 -> pid in our sub-pid-namespace >>>> getvpid(1, 0, pid) == 1 -> pid in our pid-namespace >>>> getvpid(1, pid1, pid2) == 0 -> pid1 isolated from pid2 >>>> getvpid(1, pid1, pid2) == 1 -> tasks are in one pid-namespace >>>> getvpid(1, pid1, pid2) > 1 -> pid1 is in sub-pidns of pid2 >>>> >>>> Signed-off-by: Konstantin Khlebnikov <khlebnikov-XoJtRXgx1JseBXzfvpsJ4g@public.gmane.org> >>>> --- >>>> arch/x86/entry/syscalls/syscall_32.tbl | 1 + >>>> arch/x86/entry/syscalls/syscall_64.tbl | 1 + >>>> include/linux/syscalls.h | 1 + >>>> kernel/pid.c | 36 ++++++++++++++++++++++++++++++++ >>>> 4 files changed, 39 insertions(+) >>>> >>>> diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl >>>> index 7663c455b9f6..dadb55d42fc9 100644 >>>> --- a/arch/x86/entry/syscalls/syscall_32.tbl >>>> +++ b/arch/x86/entry/syscalls/syscall_32.tbl >>>> @@ -382,3 +382,4 @@ >>>> 373 i386 shutdown sys_shutdown >>>> 374 i386 userfaultfd sys_userfaultfd >>>> 375 i386 membarrier sys_membarrier >>>> +376 i386 getvpid sys_getvpid >>>> diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl >>>> index 278842fdf1f6..0338f2eb3b7c 100644 >>>> --- a/arch/x86/entry/syscalls/syscall_64.tbl >>>> +++ b/arch/x86/entry/syscalls/syscall_64.tbl >>>> @@ -331,6 +331,7 @@ >>>> 322 64 execveat stub_execveat >>>> 323 common userfaultfd sys_userfaultfd >>>> 324 common membarrier sys_membarrier >>>> +325 common getvpid sys_getvpid >>>> >>>> # >>>> # x32-specific system call numbers start at 512 to avoid cache impact >>>> diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h >>>> index a460e2ef2843..3405c30999e3 100644 >>>> --- a/include/linux/syscalls.h >>>> +++ b/include/linux/syscalls.h >>>> @@ -222,6 +222,7 @@ asmlinkage long sys_nanosleep(struct timespec __user *rqtp, struct timespec __us >>>> asmlinkage long sys_alarm(unsigned int seconds); >>>> asmlinkage long sys_getpid(void); >>>> asmlinkage long sys_getppid(void); >>>> +asmlinkage long sys_getvpid(pid_t pid, pid_t source, pid_t target); >>>> asmlinkage long sys_getuid(void); >>>> asmlinkage long sys_geteuid(void); >>>> asmlinkage long sys_getgid(void); >>>> diff --git a/kernel/pid.c b/kernel/pid.c >>>> index ca368793808e..caa676ff7364 100644 >>>> --- a/kernel/pid.c >>>> +++ b/kernel/pid.c >>>> @@ -567,6 +567,42 @@ struct pid *find_ge_pid(int nr, struct pid_namespace *ns) >>>> return pid; >>>> } >>>> >>>> +/** >>>> + * sys_getvpid - convert pid from one pid-namespace into pid from another >>>> + * >>>> + * @pid - pid of requested task >>>> + * @source - pid of task in source pid-namespace, zero for current >>>> + * @target - pid of task in target pid-namespace, zero for current >>>> + * >>>> + * Returns pid from target pid-ns or zero if pid is unreachable. >>>> + * Returns -ESRCH if some of pids are not found. >>>> + */ >>>> +SYSCALL_DEFINE3(getvpid, pid_t, pid, pid_t, source, pid_t, target) >>>> +{ >>>> +#ifdef CONFIG_PID_NS >>>> + struct pid_namespace *current_ns = task_active_pid_ns(current); >>>> + struct pid_namespace *source_ns = current_ns, *target_ns = current_ns; >>>> + struct pid *task_pid; >>>> + pid_t result = -ESRCH; >>>> + >>>> + rcu_read_lock(); >>>> + if (source) >>>> + source_ns = ns_of_pid(find_pid_ns(source, current_ns)); >>>> + if (target) >>>> + target_ns = ns_of_pid(find_pid_ns(target, current_ns)); >>>> + if (source_ns && target_ns) { >>>> + task_pid = find_pid_ns(pid, source_ns); >>>> + if (task_pid) >>>> + result = pid_nr_ns(task_pid, target_ns); >>>> + } >>>> + rcu_read_unlock(); >>>> + >>>> + return result; >>>> +#else >>>> + return pid; >>>> +#endif /* CONFIG_PID_NS */ >>>> +} >>>> + >>>> /* >>>> * The pid hash table is scaled according to the amount of memory in the >>>> * machine. From a minimum of 16 slots up to 4096 slots at one gigabyte or >> >> >> -- >> Konstantin >> _______________________________________________ >> Containers mailing list >> Containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org >> https://lists.linuxfoundation.org/mailman/listinfo/containers > -- Konstantin ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH RFC] pidns: introduce syscall getvpid 2015-09-15 15:17 ` Stéphane Graber 2015-09-15 15:51 ` Konstantin Khlebnikov @ 2015-09-15 17:41 ` Serge Hallyn 2015-09-16 7:37 ` Konstantin Khlebnikov 1 sibling, 1 reply; 15+ messages in thread From: Serge Hallyn @ 2015-09-15 17:41 UTC (permalink / raw) To: Stéphane Graber Cc: Konstantin Khlebnikov, linux-api, containers, Oleg Nesterov, linux-kernel, Eric W. Biederman, Andrew Morton, Linus Torvalds Quoting Stéphane Graber (stgraber@ubuntu.com): > On Tue, Sep 15, 2015 at 06:01:38PM +0300, Konstantin Khlebnikov wrote: > > On 15.09.2015 17:27, Eric W. Biederman wrote: > > >Konstantin Khlebnikov <khlebnikov@yandex-team.ru> writes: > > > > > >>pid_t getvpid(pid_t pid, pid_t source, pid_t target); > > >> > > >>This syscall converts pid from one pid-ns into pid in another pid-ns: > > >>it takes @pid in namespace of @source task (zero for current) and > > >>returns related pid in namespace of @target task (zero for current too). > > >>If pid is unreachable from target pid-ns then it returns zero. > > > > > >This interface as presented is inherently racy. It would be better > > >if source and target were file descriptors referring to the namespaces > > >you wish to translate between. > > > > Yep, it's racy. As well as any operation with non-child pids. > > With file descriptors for source/target result will be racy anyway. > > > > > > > >>Such conversion is required for interaction between processes from > > >>different pid-namespaces. For example when system service talks with > > >>client from isolated container via socket about task in container: > > > > > >Sockets are already supported. At least the metadata of sockets is. > > > > > >Maybe we need this but I am not convinced of it's utility. > > > > > >What are you trying to do that motivates this? > > > > I'm working on hierarchical container management system which > > allows to create and control nested sub-containers from containers > > ( https://github.com/yandex/porto ). Main server works in host and > > have to interact with all levels of nested namespaces. This syscall > > makes some operations much easier: server must remember only pid in > > host pid namespace and convert it into right vpid on demand. > > Note that as Eric said earlier, sending a PID inside a ucred through a > unix socket will have the pid translated. > > So while your solution certainly should be faster, you can already achieve > what you want today by doing: > > == Translate PID in container to PID in host > - open a socket > - setns to container's pidns > - send ucred from that container containing the requested container PID > - host sees the host PID > > == Translate PID on host to PID in container > - open a socket > - setns to container's pidns > - send ucred from the host containing the request host PID > (send will fail if the host PID isn't part of that container) > - container sees the container PID In addition, since commit e4bc332451 : /proc/PID/status: show all sets of pid according to ns we now also have 'NSpid' etc in /proc/$$/status. -serge ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH RFC] pidns: introduce syscall getvpid 2015-09-15 17:41 ` Serge Hallyn @ 2015-09-16 7:37 ` Konstantin Khlebnikov [not found] ` <55F91C3D.1040209-XoJtRXgx1JseBXzfvpsJ4g@public.gmane.org> 0 siblings, 1 reply; 15+ messages in thread From: Konstantin Khlebnikov @ 2015-09-16 7:37 UTC (permalink / raw) To: Serge Hallyn, Stéphane Graber Cc: linux-api-u79uwXL29TY76Z2rM5mHXA, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Oleg Nesterov, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Eric W. Biederman, Andrew Morton, Linus Torvalds On 15.09.2015 20:41, Serge Hallyn wrote: > Quoting Stéphane Graber (stgraber-GeWIH/nMZzLQT0dZR+AlfA@public.gmane.org): >> On Tue, Sep 15, 2015 at 06:01:38PM +0300, Konstantin Khlebnikov wrote: >>> On 15.09.2015 17:27, Eric W. Biederman wrote: >>>> Konstantin Khlebnikov <khlebnikov-XoJtRXgx1JseBXzfvpsJ4g@public.gmane.org> writes: >>>> >>>>> pid_t getvpid(pid_t pid, pid_t source, pid_t target); >>>>> >>>>> This syscall converts pid from one pid-ns into pid in another pid-ns: >>>>> it takes @pid in namespace of @source task (zero for current) and >>>>> returns related pid in namespace of @target task (zero for current too). >>>>> If pid is unreachable from target pid-ns then it returns zero. >>>> >>>> This interface as presented is inherently racy. It would be better >>>> if source and target were file descriptors referring to the namespaces >>>> you wish to translate between. >>> >>> Yep, it's racy. As well as any operation with non-child pids. >>> With file descriptors for source/target result will be racy anyway. >>> >>>> >>>>> Such conversion is required for interaction between processes from >>>>> different pid-namespaces. For example when system service talks with >>>>> client from isolated container via socket about task in container: >>>> >>>> Sockets are already supported. At least the metadata of sockets is. >>>> >>>> Maybe we need this but I am not convinced of it's utility. >>>> >>>> What are you trying to do that motivates this? >>> >>> I'm working on hierarchical container management system which >>> allows to create and control nested sub-containers from containers >>> ( https://github.com/yandex/porto ). Main server works in host and >>> have to interact with all levels of nested namespaces. This syscall >>> makes some operations much easier: server must remember only pid in >>> host pid namespace and convert it into right vpid on demand. >> >> Note that as Eric said earlier, sending a PID inside a ucred through a >> unix socket will have the pid translated. >> >> So while your solution certainly should be faster, you can already achieve >> what you want today by doing: >> >> == Translate PID in container to PID in host >> - open a socket >> - setns to container's pidns >> - send ucred from that container containing the requested container PID >> - host sees the host PID >> >> == Translate PID on host to PID in container >> - open a socket >> - setns to container's pidns >> - send ucred from the host containing the request host PID >> (send will fail if the host PID isn't part of that container) >> - container sees the container PID > > In addition, since commit e4bc332451 : /proc/PID/status: show all sets of pid according to ns > we now also have 'NSpid' etc in /proc/$$/status. > As I see this works perfectly only for converting host pid into virtual. Backward conversion is troublesome: we have to scan all pids in host procfs and somehow filter tasks from container and its sub-pid-ns. Or I am missing something trivial? -- Konstantin ^ permalink raw reply [flat|nested] 15+ messages in thread
[parent not found: <55F91C3D.1040209-XoJtRXgx1JseBXzfvpsJ4g@public.gmane.org>]
* Re: [PATCH RFC] pidns: introduce syscall getvpid [not found] ` <55F91C3D.1040209-XoJtRXgx1JseBXzfvpsJ4g@public.gmane.org> @ 2015-09-16 14:39 ` Serge E. Hallyn [not found] ` <20150916143939.GA32226-7LNsyQBKDXoIagZqoN9o3w@public.gmane.org> 0 siblings, 1 reply; 15+ messages in thread From: Serge E. Hallyn @ 2015-09-16 14:39 UTC (permalink / raw) To: Konstantin Khlebnikov Cc: Serge Hallyn, Stéphane Graber, linux-api-u79uwXL29TY76Z2rM5mHXA, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Oleg Nesterov, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Eric W. Biederman, Andrew Morton, Linus Torvalds On Wed, Sep 16, 2015 at 10:37:33AM +0300, Konstantin Khlebnikov wrote: > On 15.09.2015 20:41, Serge Hallyn wrote: > >Quoting Stéphane Graber (stgraber-GeWIH/nMZzLQT0dZR+AlfA@public.gmane.org): > >>On Tue, Sep 15, 2015 at 06:01:38PM +0300, Konstantin Khlebnikov wrote: > >>>On 15.09.2015 17:27, Eric W. Biederman wrote: > >>>>Konstantin Khlebnikov <khlebnikov-XoJtRXgx1JseBXzfvpsJ4g@public.gmane.org> writes: > >>>> > >>>>>pid_t getvpid(pid_t pid, pid_t source, pid_t target); > >>>>> > >>>>>This syscall converts pid from one pid-ns into pid in another pid-ns: > >>>>>it takes @pid in namespace of @source task (zero for current) and > >>>>>returns related pid in namespace of @target task (zero for current too). > >>>>>If pid is unreachable from target pid-ns then it returns zero. > >>>> > >>>>This interface as presented is inherently racy. It would be better > >>>>if source and target were file descriptors referring to the namespaces > >>>>you wish to translate between. > >>> > >>>Yep, it's racy. As well as any operation with non-child pids. > >>>With file descriptors for source/target result will be racy anyway. > >>> > >>>> > >>>>>Such conversion is required for interaction between processes from > >>>>>different pid-namespaces. For example when system service talks with > >>>>>client from isolated container via socket about task in container: > >>>> > >>>>Sockets are already supported. At least the metadata of sockets is. > >>>> > >>>>Maybe we need this but I am not convinced of it's utility. > >>>> > >>>>What are you trying to do that motivates this? > >>> > >>>I'm working on hierarchical container management system which > >>>allows to create and control nested sub-containers from containers > >>>( https://github.com/yandex/porto ). Main server works in host and > >>>have to interact with all levels of nested namespaces. This syscall > >>>makes some operations much easier: server must remember only pid in > >>>host pid namespace and convert it into right vpid on demand. > >> > >>Note that as Eric said earlier, sending a PID inside a ucred through a > >>unix socket will have the pid translated. > >> > >>So while your solution certainly should be faster, you can already achieve > >>what you want today by doing: > >> > >>== Translate PID in container to PID in host > >> - open a socket > >> - setns to container's pidns > >> - send ucred from that container containing the requested container PID > >> - host sees the host PID > >> > >>== Translate PID on host to PID in container > >> - open a socket > >> - setns to container's pidns > >> - send ucred from the host containing the request host PID > >> (send will fail if the host PID isn't part of that container) > >> - container sees the container PID > > > >In addition, since commit e4bc332451 : /proc/PID/status: show all sets of pid according to ns > >we now also have 'NSpid' etc in /proc/$$/status. > > > > As I see this works perfectly only for converting host pid into virtual. > > Backward conversion is troublesome: we have to scan all pids in host > procfs and somehow filter tasks from container and its sub-pid-ns. > Or I am missing something trivial? Ah, no that doesn't help with this. What Stéphane describes is what I've done in several projects. Getting it right is however actually quite tricky. I'm not convinced it's at the level of "since you can do (sweep hands) all this, we don't need a simple syscall to do it." So I'd encourage you to resend using namespace inode fds for source and target as Eric suggested. We still may decide that the syscall isn't needed, but it's a trivial change to your patch and removes that race. And I'm not convinced it's not needed. -serge ^ permalink raw reply [flat|nested] 15+ messages in thread
[parent not found: <20150916143939.GA32226-7LNsyQBKDXoIagZqoN9o3w@public.gmane.org>]
* Re: [PATCH RFC] pidns: introduce syscall getvpid [not found] ` <20150916143939.GA32226-7LNsyQBKDXoIagZqoN9o3w@public.gmane.org> @ 2015-09-16 14:49 ` Eric W. Biederman [not found] ` <87twquzag1.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org> 0 siblings, 1 reply; 15+ messages in thread From: Eric W. Biederman @ 2015-09-16 14:49 UTC (permalink / raw) To: Serge E. Hallyn Cc: Konstantin Khlebnikov, Serge Hallyn, Stéphane Graber, linux-api-u79uwXL29TY76Z2rM5mHXA, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Oleg Nesterov, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Andrew Morton, Linus Torvalds "Serge E. Hallyn" <serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org> writes: > On Wed, Sep 16, 2015 at 10:37:33AM +0300, Konstantin Khlebnikov wrote: >> On 15.09.2015 20:41, Serge Hallyn wrote: >> >Quoting Stéphane Graber (stgraber-GeWIH/nMZzLQT0dZR+AlfA@public.gmane.org): >> >>On Tue, Sep 15, 2015 at 06:01:38PM +0300, Konstantin Khlebnikov wrote: >> >>>On 15.09.2015 17:27, Eric W. Biederman wrote: >> >>>>Konstantin Khlebnikov <khlebnikov-XoJtRXgx1JseBXzfvpsJ4g@public.gmane.org> writes: >> >>>> >> >>>>>pid_t getvpid(pid_t pid, pid_t source, pid_t target); >> >>>>> >> >>>>>This syscall converts pid from one pid-ns into pid in another pid-ns: >> >>>>>it takes @pid in namespace of @source task (zero for current) and >> >>>>>returns related pid in namespace of @target task (zero for current too). >> >>>>>If pid is unreachable from target pid-ns then it returns zero. >> >>>> >> >>>>This interface as presented is inherently racy. It would be better >> >>>>if source and target were file descriptors referring to the namespaces >> >>>>you wish to translate between. >> >>> >> >>>Yep, it's racy. As well as any operation with non-child pids. >> >>>With file descriptors for source/target result will be racy anyway. >> >>> >> >>>> >> >>>>>Such conversion is required for interaction between processes from >> >>>>>different pid-namespaces. For example when system service talks with >> >>>>>client from isolated container via socket about task in container: >> >>>> >> >>>>Sockets are already supported. At least the metadata of sockets is. >> >>>> >> >>>>Maybe we need this but I am not convinced of it's utility. >> >>>> >> >>>>What are you trying to do that motivates this? >> >>> >> >>>I'm working on hierarchical container management system which >> >>>allows to create and control nested sub-containers from containers >> >>>( https://github.com/yandex/porto ). Main server works in host and >> >>>have to interact with all levels of nested namespaces. This syscall >> >>>makes some operations much easier: server must remember only pid in >> >>>host pid namespace and convert it into right vpid on demand. >> >> >> >>Note that as Eric said earlier, sending a PID inside a ucred through a >> >>unix socket will have the pid translated. >> >> >> >>So while your solution certainly should be faster, you can already achieve >> >>what you want today by doing: >> >> >> >>== Translate PID in container to PID in host >> >> - open a socket >> >> - setns to container's pidns >> >> - send ucred from that container containing the requested container PID >> >> - host sees the host PID >> >> >> >>== Translate PID on host to PID in container >> >> - open a socket >> >> - setns to container's pidns >> >> - send ucred from the host containing the request host PID >> >> (send will fail if the host PID isn't part of that container) >> >> - container sees the container PID >> > >> >In addition, since commit e4bc332451 : /proc/PID/status: show all sets of pid according to ns >> >we now also have 'NSpid' etc in /proc/$$/status. >> > >> >> As I see this works perfectly only for converting host pid into virtual. >> >> Backward conversion is troublesome: we have to scan all pids in host >> procfs and somehow filter tasks from container and its sub-pid-ns. >> Or I am missing something trivial? > > Ah, no that doesn't help with this. > > What Stéphane describes is what I've done in several projects. > Getting it right is however actually quite tricky. I'm not > convinced it's at the level of "since you can do (sweep hands) > all this, we don't need a simple syscall to do it." > > So I'd encourage you to resend using namespace inode fds for > source and target as Eric suggested. We still may decide that > the syscall isn't needed, but it's a trivial change to your > patch and removes that race. And I'm not convinced it's not > needed. At this point my primary concern is that a pattern that would need to be convering to and from pids quickly is potentially fundamentally racy to the point of broken. Especially with unix domain sockets passing and converting pids in a way that covers the common case. I am clearly missing some nuance of this use case. Eric ^ permalink raw reply [flat|nested] 15+ messages in thread
[parent not found: <87twquzag1.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>]
* Re: [PATCH RFC] pidns: introduce syscall getvpid [not found] ` <87twquzag1.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org> @ 2015-09-16 16:31 ` Serge E. Hallyn [not found] ` <20150916163123.GA1039-7LNsyQBKDXoIagZqoN9o3w@public.gmane.org> 0 siblings, 1 reply; 15+ messages in thread From: Serge E. Hallyn @ 2015-09-16 16:31 UTC (permalink / raw) To: Eric W. Biederman Cc: Konstantin Khlebnikov, linux-api-u79uwXL29TY76Z2rM5mHXA, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Serge Hallyn, Oleg Nesterov, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Andrew Morton, Linus Torvalds On Wed, Sep 16, 2015 at 09:49:02AM -0500, Eric W. Biederman wrote: > "Serge E. Hallyn" <serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org> writes: > > > On Wed, Sep 16, 2015 at 10:37:33AM +0300, Konstantin Khlebnikov wrote: > >> On 15.09.2015 20:41, Serge Hallyn wrote: > >> >Quoting Stéphane Graber (stgraber-GeWIH/nMZzLQT0dZR+AlfA@public.gmane.org): > >> >>On Tue, Sep 15, 2015 at 06:01:38PM +0300, Konstantin Khlebnikov wrote: > >> >>>On 15.09.2015 17:27, Eric W. Biederman wrote: > >> >>>>Konstantin Khlebnikov <khlebnikov-XoJtRXgx1JseBXzfvpsJ4g@public.gmane.org> writes: > >> >>>> > >> >>>>>pid_t getvpid(pid_t pid, pid_t source, pid_t target); > >> >>>>> > >> >>>>>This syscall converts pid from one pid-ns into pid in another pid-ns: > >> >>>>>it takes @pid in namespace of @source task (zero for current) and > >> >>>>>returns related pid in namespace of @target task (zero for current too). > >> >>>>>If pid is unreachable from target pid-ns then it returns zero. > >> >>>> > >> >>>>This interface as presented is inherently racy. It would be better > >> >>>>if source and target were file descriptors referring to the namespaces > >> >>>>you wish to translate between. > >> >>> > >> >>>Yep, it's racy. As well as any operation with non-child pids. > >> >>>With file descriptors for source/target result will be racy anyway. > >> >>> > >> >>>> > >> >>>>>Such conversion is required for interaction between processes from > >> >>>>>different pid-namespaces. For example when system service talks with > >> >>>>>client from isolated container via socket about task in container: > >> >>>> > >> >>>>Sockets are already supported. At least the metadata of sockets is. > >> >>>> > >> >>>>Maybe we need this but I am not convinced of it's utility. > >> >>>> > >> >>>>What are you trying to do that motivates this? > >> >>> > >> >>>I'm working on hierarchical container management system which > >> >>>allows to create and control nested sub-containers from containers > >> >>>( https://github.com/yandex/porto ). Main server works in host and > >> >>>have to interact with all levels of nested namespaces. This syscall > >> >>>makes some operations much easier: server must remember only pid in > >> >>>host pid namespace and convert it into right vpid on demand. > >> >> > >> >>Note that as Eric said earlier, sending a PID inside a ucred through a > >> >>unix socket will have the pid translated. > >> >> > >> >>So while your solution certainly should be faster, you can already achieve > >> >>what you want today by doing: > >> >> > >> >>== Translate PID in container to PID in host > >> >> - open a socket > >> >> - setns to container's pidns > >> >> - send ucred from that container containing the requested container PID > >> >> - host sees the host PID > >> >> > >> >>== Translate PID on host to PID in container > >> >> - open a socket > >> >> - setns to container's pidns > >> >> - send ucred from the host containing the request host PID > >> >> (send will fail if the host PID isn't part of that container) > >> >> - container sees the container PID > >> > > >> >In addition, since commit e4bc332451 : /proc/PID/status: show all sets of pid according to ns > >> >we now also have 'NSpid' etc in /proc/$$/status. > >> > > >> > >> As I see this works perfectly only for converting host pid into virtual. > >> > >> Backward conversion is troublesome: we have to scan all pids in host > >> procfs and somehow filter tasks from container and its sub-pid-ns. > >> Or I am missing something trivial? > > > > Ah, no that doesn't help with this. > > > > What Stéphane describes is what I've done in several projects. > > Getting it right is however actually quite tricky. I'm not > > convinced it's at the level of "since you can do (sweep hands) > > all this, we don't need a simple syscall to do it." > > > > So I'd encourage you to resend using namespace inode fds for > > source and target as Eric suggested. We still may decide that > > the syscall isn't needed, but it's a trivial change to your > > patch and removes that race. And I'm not convinced it's not > > needed. > > At this point my primary concern is that a pattern that would need to be > convering to and from pids quickly is potentially fundamentally racy to > the point of broken. The cgmanager GetTasks and GetTasksRecursive, and reading of the lxcfs cgroup /tasks files, require converting every pid from the cgmanager's namespace to the reading task's namespace. > Especially with unix domain sockets passing and converting pids in a way > that covers the common case. > > I am clearly missing some nuance of this use case. lxcfs and cgmanager are imo proof that we *can* do without the new syscall. However, the git history will show that there are some complications, and the system load when a few systemds are starting will show that it does take a performance toll on the host at some point. Still as I say it's doable. The syscall implementation was very simple, though. -serge ^ permalink raw reply [flat|nested] 15+ messages in thread
[parent not found: <20150916163123.GA1039-7LNsyQBKDXoIagZqoN9o3w@public.gmane.org>]
* Re: Re: [PATCH RFC] pidns: introduce syscall getvpid [not found] ` <20150916163123.GA1039-7LNsyQBKDXoIagZqoN9o3w@public.gmane.org> @ 2015-09-21 2:49 ` Chen Fan [not found] ` <55FF7043.5020701-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org> 0 siblings, 1 reply; 15+ messages in thread From: Chen Fan @ 2015-09-21 2:49 UTC (permalink / raw) To: Serge E. Hallyn, Eric W. Biederman Cc: Konstantin Khlebnikov, Serge Hallyn, Stéphane Graber, linux-api-u79uwXL29TY76Z2rM5mHXA, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Oleg Nesterov, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Andrew Morton, Linus Torvalds On 09/17/2015 12:31 AM, Serge E. Hallyn wrote: > On Wed, Sep 16, 2015 at 09:49:02AM -0500, Eric W. Biederman wrote: >> "Serge E. Hallyn" <serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org> writes: >> >>> On Wed, Sep 16, 2015 at 10:37:33AM +0300, Konstantin Khlebnikov wrote: >>>> On 15.09.2015 20:41, Serge Hallyn wrote: >>>>> Quoting Stéphane Graber (stgraber-GeWIH/nMZzLQT0dZR+AlfA@public.gmane.org): >>>>>> On Tue, Sep 15, 2015 at 06:01:38PM +0300, Konstantin Khlebnikov wrote: >>>>>>> On 15.09.2015 17:27, Eric W. Biederman wrote: >>>>>>>> Konstantin Khlebnikov <khlebnikov-XoJtRXgx1JseBXzfvpsJ4g@public.gmane.org> writes: >>>>>>>> >>>>>>>>> pid_t getvpid(pid_t pid, pid_t source, pid_t target); >>>>>>>>> >>>>>>>>> This syscall converts pid from one pid-ns into pid in another pid-ns: >>>>>>>>> it takes @pid in namespace of @source task (zero for current) and >>>>>>>>> returns related pid in namespace of @target task (zero for current too). >>>>>>>>> If pid is unreachable from target pid-ns then it returns zero. >>>>>>>> This interface as presented is inherently racy. It would be better >>>>>>>> if source and target were file descriptors referring to the namespaces >>>>>>>> you wish to translate between. >>>>>>> Yep, it's racy. As well as any operation with non-child pids. >>>>>>> With file descriptors for source/target result will be racy anyway. >>>>>>> >>>>>>>>> Such conversion is required for interaction between processes from >>>>>>>>> different pid-namespaces. For example when system service talks with >>>>>>>>> client from isolated container via socket about task in container: >>>>>>>> Sockets are already supported. At least the metadata of sockets is. >>>>>>>> >>>>>>>> Maybe we need this but I am not convinced of it's utility. >>>>>>>> >>>>>>>> What are you trying to do that motivates this? >>>>>>> I'm working on hierarchical container management system which >>>>>>> allows to create and control nested sub-containers from containers >>>>>>> ( https://github.com/yandex/porto ). Main server works in host and >>>>>>> have to interact with all levels of nested namespaces. This syscall >>>>>>> makes some operations much easier: server must remember only pid in >>>>>>> host pid namespace and convert it into right vpid on demand. >>>>>> Note that as Eric said earlier, sending a PID inside a ucred through a >>>>>> unix socket will have the pid translated. >>>>>> >>>>>> So while your solution certainly should be faster, you can already achieve >>>>>> what you want today by doing: >>>>>> >>>>>> == Translate PID in container to PID in host >>>>>> - open a socket >>>>>> - setns to container's pidns >>>>>> - send ucred from that container containing the requested container PID >>>>>> - host sees the host PID >>>>>> >>>>>> == Translate PID on host to PID in container >>>>>> - open a socket >>>>>> - setns to container's pidns >>>>>> - send ucred from the host containing the request host PID >>>>>> (send will fail if the host PID isn't part of that container) >>>>>> - container sees the container PID >>>>> In addition, since commit e4bc332451 : /proc/PID/status: show all sets of pid according to ns >>>>> we now also have 'NSpid' etc in /proc/$$/status. >>>>> >>>> As I see this works perfectly only for converting host pid into virtual. >>>> >>>> Backward conversion is troublesome: we have to scan all pids in host >>>> procfs and somehow filter tasks from container and its sub-pid-ns. >>>> Or I am missing something trivial? >>> Ah, no that doesn't help with this. >>> >>> What Stéphane describes is what I've done in several projects. >>> Getting it right is however actually quite tricky. I'm not >>> convinced it's at the level of "since you can do (sweep hands) >>> all this, we don't need a simple syscall to do it." >>> >>> So I'd encourage you to resend using namespace inode fds for >>> source and target as Eric suggested. We still may decide that >>> the syscall isn't needed, but it's a trivial change to your >>> patch and removes that race. And I'm not convinced it's not >>> needed. >> At this point my primary concern is that a pattern that would need to be >> convering to and from pids quickly is potentially fundamentally racy to >> the point of broken. > The cgmanager GetTasks and GetTasksRecursive, and reading of the > lxcfs cgroup /tasks files, require converting every pid from the > cgmanager's namespace to the reading task's namespace. > >> Especially with unix domain sockets passing and converting pids in a way >> that covers the common case. >> >> I am clearly missing some nuance of this use case. > lxcfs and cgmanager are imo proof that we *can* do without the new > syscall. However, the git history will show that there are some > complications, and the system load when a few systemds are starting > will show that it does take a performance toll on the host at some > point. Still as I say it's doable. The syscall implementation was > very simple, though. Yes, previous email discussed about the implementation of syscall or procfs: http://www.gossamer-threads.com/lists/linux/kernel/1971723?search_string=chen%20hanxiao;#1971723 but it seems complicated implemented by procfs, the original discussion at: http://www.gossamer-threads.com/lists/linux/kernel/2076440?search_string=chen%20hanxiao;#2076440 Thanks, Chen > > -serge > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ > . > ^ permalink raw reply [flat|nested] 15+ messages in thread
[parent not found: <55FF7043.5020701-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>]
* Re: Re: [PATCH RFC] pidns: introduce syscall getvpid [not found] ` <55FF7043.5020701-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org> @ 2015-09-21 14:22 ` Serge E. Hallyn [not found] ` <20150921142222.GA24005-7LNsyQBKDXoIagZqoN9o3w@public.gmane.org> 0 siblings, 1 reply; 15+ messages in thread From: Serge E. Hallyn @ 2015-09-21 14:22 UTC (permalink / raw) To: Chen Fan Cc: Konstantin Khlebnikov, linux-api-u79uwXL29TY76Z2rM5mHXA, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Serge Hallyn, Oleg Nesterov, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Eric W. Biederman, Andrew Morton, Linus Torvalds On Mon, Sep 21, 2015 at 10:49:39AM +0800, Chen Fan wrote: > > On 09/17/2015 12:31 AM, Serge E. Hallyn wrote: > >On Wed, Sep 16, 2015 at 09:49:02AM -0500, Eric W. Biederman wrote: > >>"Serge E. Hallyn" <serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org> writes: > >> > >>>On Wed, Sep 16, 2015 at 10:37:33AM +0300, Konstantin Khlebnikov wrote: > >>>>On 15.09.2015 20:41, Serge Hallyn wrote: > >>>>>Quoting Stéphane Graber (stgraber-GeWIH/nMZzLQT0dZR+AlfA@public.gmane.org): > >>>>>>On Tue, Sep 15, 2015 at 06:01:38PM +0300, Konstantin Khlebnikov wrote: > >>>>>>>On 15.09.2015 17:27, Eric W. Biederman wrote: > >>>>>>>>Konstantin Khlebnikov <khlebnikov-XoJtRXgx1JseBXzfvpsJ4g@public.gmane.org> writes: > >>>>>>>> > >>>>>>>>>pid_t getvpid(pid_t pid, pid_t source, pid_t target); > >>>>>>>>> > >>>>>>>>>This syscall converts pid from one pid-ns into pid in another pid-ns: > >>>>>>>>>it takes @pid in namespace of @source task (zero for current) and > >>>>>>>>>returns related pid in namespace of @target task (zero for current too). > >>>>>>>>>If pid is unreachable from target pid-ns then it returns zero. > >>>>>>>>This interface as presented is inherently racy. It would be better > >>>>>>>>if source and target were file descriptors referring to the namespaces > >>>>>>>>you wish to translate between. > >>>>>>>Yep, it's racy. As well as any operation with non-child pids. > >>>>>>>With file descriptors for source/target result will be racy anyway. > >>>>>>> > >>>>>>>>>Such conversion is required for interaction between processes from > >>>>>>>>>different pid-namespaces. For example when system service talks with > >>>>>>>>>client from isolated container via socket about task in container: > >>>>>>>>Sockets are already supported. At least the metadata of sockets is. > >>>>>>>> > >>>>>>>>Maybe we need this but I am not convinced of it's utility. > >>>>>>>> > >>>>>>>>What are you trying to do that motivates this? > >>>>>>>I'm working on hierarchical container management system which > >>>>>>>allows to create and control nested sub-containers from containers > >>>>>>>( https://github.com/yandex/porto ). Main server works in host and > >>>>>>>have to interact with all levels of nested namespaces. This syscall > >>>>>>>makes some operations much easier: server must remember only pid in > >>>>>>>host pid namespace and convert it into right vpid on demand. > >>>>>>Note that as Eric said earlier, sending a PID inside a ucred through a > >>>>>>unix socket will have the pid translated. > >>>>>> > >>>>>>So while your solution certainly should be faster, you can already achieve > >>>>>>what you want today by doing: > >>>>>> > >>>>>>== Translate PID in container to PID in host > >>>>>> - open a socket > >>>>>> - setns to container's pidns > >>>>>> - send ucred from that container containing the requested container PID > >>>>>> - host sees the host PID > >>>>>> > >>>>>>== Translate PID on host to PID in container > >>>>>> - open a socket > >>>>>> - setns to container's pidns > >>>>>> - send ucred from the host containing the request host PID > >>>>>> (send will fail if the host PID isn't part of that container) > >>>>>> - container sees the container PID > >>>>>In addition, since commit e4bc332451 : /proc/PID/status: show all sets of pid according to ns > >>>>>we now also have 'NSpid' etc in /proc/$$/status. > >>>>> > >>>>As I see this works perfectly only for converting host pid into virtual. > >>>> > >>>>Backward conversion is troublesome: we have to scan all pids in host > >>>>procfs and somehow filter tasks from container and its sub-pid-ns. > >>>>Or I am missing something trivial? > >>>Ah, no that doesn't help with this. > >>> > >>>What Stéphane describes is what I've done in several projects. > >>>Getting it right is however actually quite tricky. I'm not > >>>convinced it's at the level of "since you can do (sweep hands) > >>>all this, we don't need a simple syscall to do it." > >>> > >>>So I'd encourage you to resend using namespace inode fds for > >>>source and target as Eric suggested. We still may decide that > >>>the syscall isn't needed, but it's a trivial change to your > >>>patch and removes that race. And I'm not convinced it's not > >>>needed. > >>At this point my primary concern is that a pattern that would need to be > >>convering to and from pids quickly is potentially fundamentally racy to > >>the point of broken. > >The cgmanager GetTasks and GetTasksRecursive, and reading of the > >lxcfs cgroup /tasks files, require converting every pid from the > >cgmanager's namespace to the reading task's namespace. > > > >>Especially with unix domain sockets passing and converting pids in a way > >>that covers the common case. > >> > >>I am clearly missing some nuance of this use case. > >lxcfs and cgmanager are imo proof that we *can* do without the new > >syscall. However, the git history will show that there are some > >complications, and the system load when a few systemds are starting > >will show that it does take a performance toll on the host at some > >point. Still as I say it's doable. The syscall implementation was > >very simple, though. > > Yes, previous email discussed about the implementation of syscall or procfs: > http://www.gossamer-threads.com/lists/linux/kernel/1971723?search_string=chen%20hanxiao;#1971723 > > but it seems complicated implemented by procfs, the original discussion at: > http://www.gossamer-threads.com/lists/linux/kernel/2076440?search_string=chen%20hanxiao;#2076440 So please implement it, as Eric suggested, using the ns inode fds instead of racy pid_t hints for namespaces. ^ permalink raw reply [flat|nested] 15+ messages in thread
[parent not found: <20150921142222.GA24005-7LNsyQBKDXoIagZqoN9o3w@public.gmane.org>]
* Re: [PATCH RFC] pidns: introduce syscall getvpid [not found] ` <20150921142222.GA24005-7LNsyQBKDXoIagZqoN9o3w@public.gmane.org> @ 2015-09-22 7:42 ` Konstantin Khlebnikov [not found] ` <56010680.7000301-XoJtRXgx1JseBXzfvpsJ4g@public.gmane.org> 0 siblings, 1 reply; 15+ messages in thread From: Konstantin Khlebnikov @ 2015-09-22 7:42 UTC (permalink / raw) To: Serge E. Hallyn, Chen Fan Cc: linux-api-u79uwXL29TY76Z2rM5mHXA, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Serge Hallyn, Oleg Nesterov, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Eric W. Biederman, Andrew Morton, Linus Torvalds On 21.09.2015 17:22, Serge E. Hallyn wrote: > On Mon, Sep 21, 2015 at 10:49:39AM +0800, Chen Fan wrote: >> >> On 09/17/2015 12:31 AM, Serge E. Hallyn wrote: >>> On Wed, Sep 16, 2015 at 09:49:02AM -0500, Eric W. Biederman wrote: >>>> "Serge E. Hallyn" <serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org> writes: >>>> >>>>> On Wed, Sep 16, 2015 at 10:37:33AM +0300, Konstantin Khlebnikov wrote: >>>>>> On 15.09.2015 20:41, Serge Hallyn wrote: >>>>>>> Quoting Stéphane Graber (stgraber-GeWIH/nMZzLQT0dZR+AlfA@public.gmane.org): >>>>>>>> On Tue, Sep 15, 2015 at 06:01:38PM +0300, Konstantin Khlebnikov wrote: >>>>>>>>> On 15.09.2015 17:27, Eric W. Biederman wrote: >>>>>>>>>> Konstantin Khlebnikov <khlebnikov-XoJtRXgx1JseBXzfvpsJ4g@public.gmane.org> writes: >>>>>>>>>> >>>>>>>>>>> pid_t getvpid(pid_t pid, pid_t source, pid_t target); >>>>>>>>>>> >>>>>>>>>>> This syscall converts pid from one pid-ns into pid in another pid-ns: >>>>>>>>>>> it takes @pid in namespace of @source task (zero for current) and >>>>>>>>>>> returns related pid in namespace of @target task (zero for current too). >>>>>>>>>>> If pid is unreachable from target pid-ns then it returns zero. >>>>>>>>>> This interface as presented is inherently racy. It would be better >>>>>>>>>> if source and target were file descriptors referring to the namespaces >>>>>>>>>> you wish to translate between. >>>>>>>>> Yep, it's racy. As well as any operation with non-child pids. >>>>>>>>> With file descriptors for source/target result will be racy anyway. >>>>>>>>> >>>>>>>>>>> Such conversion is required for interaction between processes from >>>>>>>>>>> different pid-namespaces. For example when system service talks with >>>>>>>>>>> client from isolated container via socket about task in container: >>>>>>>>>> Sockets are already supported. At least the metadata of sockets is. >>>>>>>>>> >>>>>>>>>> Maybe we need this but I am not convinced of it's utility. >>>>>>>>>> >>>>>>>>>> What are you trying to do that motivates this? >>>>>>>>> I'm working on hierarchical container management system which >>>>>>>>> allows to create and control nested sub-containers from containers >>>>>>>>> ( https://github.com/yandex/porto ). Main server works in host and >>>>>>>>> have to interact with all levels of nested namespaces. This syscall >>>>>>>>> makes some operations much easier: server must remember only pid in >>>>>>>>> host pid namespace and convert it into right vpid on demand. >>>>>>>> Note that as Eric said earlier, sending a PID inside a ucred through a >>>>>>>> unix socket will have the pid translated. >>>>>>>> >>>>>>>> So while your solution certainly should be faster, you can already achieve >>>>>>>> what you want today by doing: >>>>>>>> >>>>>>>> == Translate PID in container to PID in host >>>>>>>> - open a socket >>>>>>>> - setns to container's pidns >>>>>>>> - send ucred from that container containing the requested container PID >>>>>>>> - host sees the host PID >>>>>>>> >>>>>>>> == Translate PID on host to PID in container >>>>>>>> - open a socket >>>>>>>> - setns to container's pidns >>>>>>>> - send ucred from the host containing the request host PID >>>>>>>> (send will fail if the host PID isn't part of that container) >>>>>>>> - container sees the container PID >>>>>>> In addition, since commit e4bc332451 : /proc/PID/status: show all sets of pid according to ns >>>>>>> we now also have 'NSpid' etc in /proc/$$/status. >>>>>>> >>>>>> As I see this works perfectly only for converting host pid into virtual. >>>>>> >>>>>> Backward conversion is troublesome: we have to scan all pids in host >>>>>> procfs and somehow filter tasks from container and its sub-pid-ns. >>>>>> Or I am missing something trivial? >>>>> Ah, no that doesn't help with this. >>>>> >>>>> What Stéphane describes is what I've done in several projects. >>>>> Getting it right is however actually quite tricky. I'm not >>>>> convinced it's at the level of "since you can do (sweep hands) >>>>> all this, we don't need a simple syscall to do it." >>>>> >>>>> So I'd encourage you to resend using namespace inode fds for >>>>> source and target as Eric suggested. We still may decide that >>>>> the syscall isn't needed, but it's a trivial change to your >>>>> patch and removes that race. And I'm not convinced it's not >>>>> needed. >>>> At this point my primary concern is that a pattern that would need to be >>>> convering to and from pids quickly is potentially fundamentally racy to >>>> the point of broken. >>> The cgmanager GetTasks and GetTasksRecursive, and reading of the >>> lxcfs cgroup /tasks files, require converting every pid from the >>> cgmanager's namespace to the reading task's namespace. >>> >>>> Especially with unix domain sockets passing and converting pids in a way >>>> that covers the common case. >>>> >>>> I am clearly missing some nuance of this use case. >>> lxcfs and cgmanager are imo proof that we *can* do without the new >>> syscall. However, the git history will show that there are some >>> complications, and the system load when a few systemds are starting >>> will show that it does take a performance toll on the host at some >>> point. Still as I say it's doable. The syscall implementation was >>> very simple, though. >> >> Yes, previous email discussed about the implementation of syscall or procfs: >> http://www.gossamer-threads.com/lists/linux/kernel/1971723?search_string=chen%20hanxiao;#1971723 >> >> but it seems complicated implemented by procfs, the original discussion at: >> http://www.gossamer-threads.com/lists/linux/kernel/2076440?search_string=chen%20hanxiao;#2076440 > > So please implement it, as Eric suggested, using the ns inode fds > instead of racy pid_t hints for namespaces. > I don't want to loose simple way to use it. Sometimes caller cannot prevent races (task its child or locked with with ptrace) or it don't care about them. What about this design: pid_t getvpid(pid_t pid, pid_t source, pid_t target) pid > 0 - get vpid of task pid = 0 - current pid (= just for symmetry =) pid < 0 - get vpid of parent task (ppid of -pid) [ that's really useful for poking isolated pidns ] source/target > 0 - pid of source/target task source/target = 0 - use current as source/target source/target < 0 - use pidns fd (1-arg) as source/target or the same but without =0 sugar: pid > 0 - get vpid of task pid < 0 - get vpid of parent task (ppid of -arg) source/target > 0 - pid of source/target task source/target <= 0 - use pidns fd (-arg) as source/target libc caches current pid, extra getpid shouldn't be a problem. -- Konstantin ^ permalink raw reply [flat|nested] 15+ messages in thread
[parent not found: <56010680.7000301-XoJtRXgx1JseBXzfvpsJ4g@public.gmane.org>]
* Re: [PATCH RFC] pidns: introduce syscall getvpid [not found] ` <56010680.7000301-XoJtRXgx1JseBXzfvpsJ4g@public.gmane.org> @ 2015-09-22 21:00 ` Eric W. Biederman 0 siblings, 0 replies; 15+ messages in thread From: Eric W. Biederman @ 2015-09-22 21:00 UTC (permalink / raw) To: Konstantin Khlebnikov Cc: Serge E. Hallyn, Chen Fan, linux-api-u79uwXL29TY76Z2rM5mHXA, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Serge Hallyn, Oleg Nesterov, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Andrew Morton, Linus Torvalds Konstantin Khlebnikov <khlebnikov-XoJtRXgx1JseBXzfvpsJ4g@public.gmane.org> writes: > On 21.09.2015 17:22, Serge E. Hallyn wrote: >> >> So please implement it, as Eric suggested, using the ns inode fds >> instead of racy pid_t hints for namespaces. >> > > I don't want to loose simple way to use it. > Sometimes caller cannot prevent races (task its child or > locked with with ptrace) or it don't care about them. > > What about this design: > > pid_t getvpid(pid_t pid, pid_t source, pid_t target) > > pid > 0 - get vpid of task > pid = 0 - current pid (= just for symmetry =) > pid < 0 - get vpid of parent task (ppid of -pid) > [ that's really useful for poking isolated pidns ] > source/target > 0 - pid of source/target task > source/target = 0 - use current as source/target > source/target < 0 - use pidns fd (1-arg) as source/target > > or the same but without =0 sugar: > > pid > 0 - get vpid of task > pid < 0 - get vpid of parent task (ppid of -arg) > source/target > 0 - pid of source/target task > source/target <= 0 - use pidns fd (-arg) as source/target > > libc caches current pid, extra getpid shouldn't be a problem. Yuck. An invalid fd like for saying use the current pid namespace is fine. Using pids to identify namespaces yuck just yuck. That just seems to add complexity for no gain except to make programs buggier. We have a couple of old interfaces that use pids because pids were the the best we had, but at this point I don't see anything at all that even suggests that pids are a good choice for identifying namespaces. If performance is important than caching file descriptors should be trivial. If performance is not important it should not be hard to open "/proc/<pid>/ns/pid". I do not see the gain of using pids in this interface except to confuse people, and make the interface brittle. Eric ^ permalink raw reply [flat|nested] 15+ messages in thread
end of thread, other threads:[~2015-09-22 21:00 UTC | newest] Thread overview: 15+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2015-09-15 12:09 [PATCH RFC] pidns: introduce syscall getvpid Konstantin Khlebnikov 2015-09-15 14:20 ` Oleg Nesterov 2015-09-15 14:27 ` Eric W. Biederman [not found] ` <87h9mvg3kw.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org> 2015-09-15 15:01 ` Konstantin Khlebnikov [not found] ` <55F832D2.1070605-XoJtRXgx1JseBXzfvpsJ4g@public.gmane.org> 2015-09-15 15:17 ` Stéphane Graber 2015-09-15 15:51 ` Konstantin Khlebnikov 2015-09-15 17:41 ` Serge Hallyn 2015-09-16 7:37 ` Konstantin Khlebnikov [not found] ` <55F91C3D.1040209-XoJtRXgx1JseBXzfvpsJ4g@public.gmane.org> 2015-09-16 14:39 ` Serge E. Hallyn [not found] ` <20150916143939.GA32226-7LNsyQBKDXoIagZqoN9o3w@public.gmane.org> 2015-09-16 14:49 ` Eric W. Biederman [not found] ` <87twquzag1.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org> 2015-09-16 16:31 ` Serge E. Hallyn [not found] ` <20150916163123.GA1039-7LNsyQBKDXoIagZqoN9o3w@public.gmane.org> 2015-09-21 2:49 ` Chen Fan [not found] ` <55FF7043.5020701-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org> 2015-09-21 14:22 ` Serge E. Hallyn [not found] ` <20150921142222.GA24005-7LNsyQBKDXoIagZqoN9o3w@public.gmane.org> 2015-09-22 7:42 ` Konstantin Khlebnikov [not found] ` <56010680.7000301-XoJtRXgx1JseBXzfvpsJ4g@public.gmane.org> 2015-09-22 21:00 ` Eric W. Biederman
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).