* [PATCH 1/2] kernel/fork: handle put_user errors for CLONE_CHILD_SETTID/CLEARTID @ 2015-02-06 16:23 Konstantin Khlebnikov 2015-02-06 16:23 ` [PATCH 2/2] kernel/fork: handle put_user errors for CLONE_PARENT_SETTID Konstantin Khlebnikov 2015-02-06 19:44 ` [PATCH 1/2] kernel/fork: handle put_user errors for CLONE_CHILD_SETTID/CLEARTID Oleg Nesterov 0 siblings, 2 replies; 13+ messages in thread From: Konstantin Khlebnikov @ 2015-02-06 16:23 UTC (permalink / raw) To: linux-api, Andrew Morton, Linus Torvalds, linux-kernel Cc: Roman Gushchin, Nikita Vetoshkin, Oleg Nesterov, Pavel Emelyanov Currently kernel ignores put_user() errors when it writes tid for syscall clone flags CLONE_CHILD_SETTID and CLONE_CHILD_CLEARTID. Unfortunately this code always worked in this way. We cannot abort syscall if client tid pointer is invalid. This patch adds get_user() after failed put_user(): if this address is not even readable then we leave it alone: user-space probably don't care about pid or error will be noticed soon. If address is readable then application might be mislead about it's own pid. In this case CLONE_CHILD_SETTID kills child task. In same condition failed CLONE_CHILD_CLEARTID prints warning. Nikita found script which sometimes hangs inside glibc in function fork() in child task right after returning from syscall some glibc assert fails: #0 __lll_lock_wait_private () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:93 #1 0x00007fabcc699087 in _L_lock_11477 at malloc.c:5249 #2 0x00007fabcc6973ed in __GI___libc_realloc at malloc.c:3054 #3 0x00007fabcc686dbd in _IO_vasprintf at vasprintf.c:86 #4 0x00007fabcc667747 in ___asprintf at asprintf.c:37 #5 0x00007fabcc642cfb in __assert_fail_base at assert.c:59 #6 0x00007fabcc642e42 in __GI___assert_fail at assert.c:103 #7 0x00007fabcc6d40b6 in __libc_fork at ../nptl/sysdeps/unix/sysv/linux/x86_64/../fork.c:142 #8 0x00007fabccad23cd in Perl_pp_system at pp_sys.c:4143c #9 0x00007fabcca7fcd6 in Perl_runops_standard at run.c:41 #10 0x00007fabcca2139a in S_run_body at perl.c:2350 #11 perl_run at perl.c:2268 #12 0x0000000000400db9 in main at perlmain.c:120 Assert checks (THREAD_GETMEM (self, tid) != ppid). In child task pid still equal to the parent pid! Write for CLONE_CHILD_SETTID had failed silently. Unfortunately this assert in glibc is flawed, some internal locks are held at this point and task hangs when tries to print out error message. Usually memory allocations at page-faults are either completely successful or task is killed by out of memory killer: buddy allocator handles 0-order allocations as GFP_NOFAIL and retries them endlessly. OOM-killer in memory cgroup works only for faulting from user-space. If kernel hits memcg limit inside page-fault that will be handled as ordinary "Bad address": -EFAULT. Whole sequence looks like: task calls fork, glibc calls syscall clone with CLONE_CHILD_SETTID and passes pointer to TLS THREAD_SELF->tid as argument. Child task gets read-only copy of VM including TLS. Child calls put_user() to handle CLONE_CHILD_SETTID from schedule_tail(). put_user() trigger page fault and it fails because do_wp_page() hits memcg limit without invoking OOM-killer because this is page-fault from kernel-space. Put_user returns -EFAULT, which is ignored. Child returns into user-space and catches here assert (THREAD_GETMEM (self, tid) != ppid), glibc tries to print something but hangs on deadlock on internal locks. Halt and catch fire. Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Reported-by: Nikita Vetoshkin <nekto0n@yandex-team.ru> --- kernel/fork.c | 15 ++++++++++++--- kernel/sched/core.c | 16 ++++++++++++++-- 2 files changed, 26 insertions(+), 5 deletions(-) diff --git a/kernel/fork.c b/kernel/fork.c index 4dc2dda..f71305d 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -822,11 +822,20 @@ void mm_release(struct task_struct *tsk, struct mm_struct *mm) if (tsk->clear_child_tid) { if (!(tsk->flags & PF_SIGNALED) && atomic_read(&mm->mm_users) > 1) { + int dummy; + /* - * We don't check the error code - if userspace has - * not set up a proper pointer then tough luck. + * We cannot report put_user fails - if userspace has + * not set up a proper pointer then tough luck. It's + * much worse if it's failed for some other reason: + * for example memory shortage in CoW area, somebody + * will wait us forever. Let's at least print warning. */ - put_user(0, tsk->clear_child_tid); + if (unlikely(put_user(0, tsk->clear_child_tid)) && + !get_user(dummy, tsk->clear_child_tid) && + !fatal_signal_pending(current)) + WARN_ON_ONCE(dummy); + sys_futex(tsk->clear_child_tid, FUTEX_WAKE, 1, NULL, NULL, 0); } diff --git a/kernel/sched/core.c b/kernel/sched/core.c index e628cb1..74b33ff 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -2312,8 +2312,20 @@ asmlinkage __visible void schedule_tail(struct task_struct *prev) post_schedule(rq); preempt_enable(); - if (current->set_child_tid) - put_user(task_pid_vnr(current), current->set_child_tid); + if (current->set_child_tid && + unlikely(put_user(task_pid_vnr(current), current->set_child_tid))) { + int dummy; + + /* + * If this address is unreadable then userspace has not set + * proper pointer. Application either doesn't care or will + * notice this soon. If this address is readable then task + * will be mislead about its own tid. It's better to die. + */ + if (!get_user(dummy, current->set_child_tid) && + !fatal_signal_pending(current)) + force_sig(SIGSEGV, current); + } } /* ^ permalink raw reply related [flat|nested] 13+ messages in thread
* [PATCH 2/2] kernel/fork: handle put_user errors for CLONE_PARENT_SETTID 2015-02-06 16:23 [PATCH 1/2] kernel/fork: handle put_user errors for CLONE_CHILD_SETTID/CLEARTID Konstantin Khlebnikov @ 2015-02-06 16:23 ` Konstantin Khlebnikov 2015-02-06 20:49 ` Linus Torvalds 2015-02-06 19:44 ` [PATCH 1/2] kernel/fork: handle put_user errors for CLONE_CHILD_SETTID/CLEARTID Oleg Nesterov 1 sibling, 1 reply; 13+ messages in thread From: Konstantin Khlebnikov @ 2015-02-06 16:23 UTC (permalink / raw) To: linux-api-u79uwXL29TY76Z2rM5mHXA, Andrew Morton, Linus Torvalds, linux-kernel-u79uwXL29TY76Z2rM5mHXA Cc: Roman Gushchin, Nikita Vetoshkin, Oleg Nesterov, Pavel Emelyanov Handling of flag CLONE_PARENT_SETTID has the same problem: error returned from put_user() is ignored. Glibc completely relies on that feature and uses value returned from syscall only for error checking. Kernels older than v2.6.24 handled that correctly but check has been removed in commit 30e49c263e36 ("pid namespaces: allow cloning of new namespace"). Commit message tells nothing about reason. I guess that was fix for commit 425fb2b4bf5d ("pid namespaces: move alloc_pid() lower in copy_process()") which breaks this logic: after it kernel writes parent pid as child pid. This patch moves related put_user() from do_fork() back into copy_process() where it was before and where error can be handled. Another problem is that before v2.6.24 CLONE_PARENT_SETTID stored child pid both in parent and child memory. Documentation in man clone(2) also tells so. In v2.6.24 put_user() was placed after copy_mm(), now only parent sees it. LTP test which should check that is buggy too: it clones child with CLONE_VM. It seems nobody have noticed this for seven years, probably we should leave it as is and document existing behavior. Signed-off-by: Konstantin Khlebnikov <khlebnikov-XoJtRXgx1JseBXzfvpsJ4g@public.gmane.org> --- kernel/fork.c | 14 +++++++++----- 1 file changed, 9 insertions(+), 5 deletions(-) diff --git a/kernel/fork.c b/kernel/fork.c index f71305d..1ea2eae 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -1194,6 +1194,7 @@ init_task_pid(struct task_struct *task, enum pid_type type, struct pid *pid) static struct task_struct *copy_process(unsigned long clone_flags, unsigned long stack_start, unsigned long stack_size, + int __user *parent_tidptr, int __user *child_tidptr, struct pid *pid, int trace) @@ -1416,6 +1417,12 @@ static struct task_struct *copy_process(unsigned long clone_flags, goto bad_fork_cleanup_io; } + if (clone_flags & CLONE_PARENT_SETTID) { + retval = put_user(pid_vnr(pid), parent_tidptr); + if (retval) + goto bad_fork_free_pid; + } + p->set_child_tid = (clone_flags & CLONE_CHILD_SETTID) ? child_tidptr : NULL; /* * Clear TID on mm_release()? @@ -1617,7 +1624,7 @@ static inline void init_idle_pids(struct pid_link *links) struct task_struct *fork_idle(int cpu) { struct task_struct *task; - task = copy_process(CLONE_VM, 0, 0, NULL, &init_struct_pid, 0); + task = copy_process(CLONE_VM, 0, 0, NULL, NULL, &init_struct_pid, 0); if (!IS_ERR(task)) { init_idle_pids(task->pids); init_idle(task, cpu); @@ -1661,7 +1668,7 @@ long do_fork(unsigned long clone_flags, } p = copy_process(clone_flags, stack_start, stack_size, - child_tidptr, NULL, trace); + parent_tidptr, child_tidptr, NULL, trace); /* * Do this prior waking up the new thread - the thread pointer * might get invalid after that point, if the thread exits quickly. @@ -1675,9 +1682,6 @@ long do_fork(unsigned long clone_flags, pid = get_task_pid(p, PIDTYPE_PID); nr = pid_vnr(pid); - if (clone_flags & CLONE_PARENT_SETTID) - put_user(nr, parent_tidptr); - if (clone_flags & CLONE_VFORK) { p->vfork_done = &vfork; init_completion(&vfork); ^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [PATCH 2/2] kernel/fork: handle put_user errors for CLONE_PARENT_SETTID 2015-02-06 16:23 ` [PATCH 2/2] kernel/fork: handle put_user errors for CLONE_PARENT_SETTID Konstantin Khlebnikov @ 2015-02-06 20:49 ` Linus Torvalds [not found] ` <CA+55aFxBuf-0UkoYCrwH_vNsWFnUkFOz5c9O_Mswe_w0BTkqbQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2015-02-06 21:13 ` Konstantin Khlebnikov 0 siblings, 2 replies; 13+ messages in thread From: Linus Torvalds @ 2015-02-06 20:49 UTC (permalink / raw) To: Konstantin Khlebnikov Cc: Linux API, Andrew Morton, Linux Kernel Mailing List, Roman Gushchin, Nikita Vetoshkin, Oleg Nesterov, Pavel Emelyanov On Fri, Feb 6, 2015 at 8:23 AM, Konstantin Khlebnikov <khlebnikov-XoJtRXgx1JseBXzfvpsJ4g@public.gmane.org> wrote: > Handling of flag CLONE_PARENT_SETTID has the same problem: error returned > from put_user() is ignored. Glibc completely relies on that feature and uses > value returned from syscall only for error checking. I'm not seeing the advantage of the error checking part of the pacth patch. It generates extra code, possibly changing existing interfaces, and it doesn't actually buy us anything. What's the upside? If somebody passes in a bad pointer, it's their problem. For all we know, people used to pass in NULL, even if they had the SETTID bit set. This makes it now return EFAULT. So I don't mind moving things into copy_process(), but I *do* mind the new error return thing. It's actually better in this patch than in 1/2, because 1/2 was just insane with the whole "readable vs writable" thing. That I refuse to even look at, for fear of going blind. Linus ^ permalink raw reply [flat|nested] 13+ messages in thread
[parent not found: <CA+55aFxBuf-0UkoYCrwH_vNsWFnUkFOz5c9O_Mswe_w0BTkqbQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: [PATCH 2/2] kernel/fork: handle put_user errors for CLONE_PARENT_SETTID [not found] ` <CA+55aFxBuf-0UkoYCrwH_vNsWFnUkFOz5c9O_Mswe_w0BTkqbQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2015-02-06 21:07 ` Oleg Nesterov 0 siblings, 0 replies; 13+ messages in thread From: Oleg Nesterov @ 2015-02-06 21:07 UTC (permalink / raw) To: Linus Torvalds Cc: Konstantin Khlebnikov, Linux API, Andrew Morton, Linux Kernel Mailing List, Roman Gushchin, Nikita Vetoshkin, Pavel Emelyanov I am not sure about these changes too, but On 02/06, Linus Torvalds wrote: > > What's the upside? If somebody passes in a bad pointer, it's their > problem. Yes. But unless I am totally confused (quite possible) this put_user() can fail even if the pointer is valid. So at least I think Konstantin has found a real problem. Which (I think) shoud be fixed, and this is not connected to fork. > insane with the whole "readable vs writable" thing. Agreed, this part doesn't look nice in any case. Oleg. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 2/2] kernel/fork: handle put_user errors for CLONE_PARENT_SETTID 2015-02-06 20:49 ` Linus Torvalds [not found] ` <CA+55aFxBuf-0UkoYCrwH_vNsWFnUkFOz5c9O_Mswe_w0BTkqbQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2015-02-06 21:13 ` Konstantin Khlebnikov 2015-02-06 21:55 ` Andy Lutomirski [not found] ` <CALYGNiMv021=WC2uXsjo5zT8JwewweZUDdk0x8FGHh9V5j6bFQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 1 sibling, 2 replies; 13+ messages in thread From: Konstantin Khlebnikov @ 2015-02-06 21:13 UTC (permalink / raw) To: Linus Torvalds Cc: Konstantin Khlebnikov, Linux API, Andrew Morton, Linux Kernel Mailing List, Roman Gushchin, Nikita Vetoshkin, Oleg Nesterov, Pavel Emelyanov On Fri, Feb 6, 2015 at 11:49 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Fri, Feb 6, 2015 at 8:23 AM, Konstantin Khlebnikov > <khlebnikov@yandex-team.ru> wrote: >> Handling of flag CLONE_PARENT_SETTID has the same problem: error returned >> from put_user() is ignored. Glibc completely relies on that feature and uses >> value returned from syscall only for error checking. > > I'm not seeing the advantage of the error checking part of the pacth > patch. It generates extra code, possibly changing existing interfaces, > and it doesn't actually buy us anything. > > What's the upside? If somebody passes in a bad pointer, it's their > problem. For all we know, people used to pass in NULL, even if they > had the SETTID bit set. This makes it now return EFAULT. Currently that works fine only because kernel retries 0-order allocations endlessly. But pagefault_out_of_memory() is never called for non-user PF. For kernel PF all oom-kills are triggered by buddy-allocator. If buddy allocator gave up earlier then page-faults from kernel space could fail without OOM. And in CoW area user-space will see stale data. So, either we must handle all put_user/copy_to_user errors (which isn't that bad idea) or kernel must force all PF to success-or-die policy. First patch is that ugly because kernel has never checked errors in that place. So, I've tried to find solution which could fix problem without breaking backward compatibility. > > So I don't mind moving things into copy_process(), but I *do* mind the > new error return thing. > > It's actually better in this patch than in 1/2, because 1/2 was just > insane with the whole "readable vs writable" thing. That I refuse to > even look at, for fear of going blind. > > Linus > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 2/2] kernel/fork: handle put_user errors for CLONE_PARENT_SETTID 2015-02-06 21:13 ` Konstantin Khlebnikov @ 2015-02-06 21:55 ` Andy Lutomirski [not found] ` <CALYGNiMv021=WC2uXsjo5zT8JwewweZUDdk0x8FGHh9V5j6bFQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 1 sibling, 0 replies; 13+ messages in thread From: Andy Lutomirski @ 2015-02-06 21:55 UTC (permalink / raw) To: Konstantin Khlebnikov Cc: Linus Torvalds, Konstantin Khlebnikov, Linux API, Andrew Morton, Linux Kernel Mailing List, Roman Gushchin, Nikita Vetoshkin, Oleg Nesterov, Pavel Emelyanov On Fri, Feb 6, 2015 at 1:13 PM, Konstantin Khlebnikov <koct9i@gmail.com> wrote: > On Fri, Feb 6, 2015 at 11:49 PM, Linus Torvalds > <torvalds@linux-foundation.org> wrote: >> On Fri, Feb 6, 2015 at 8:23 AM, Konstantin Khlebnikov >> <khlebnikov@yandex-team.ru> wrote: >>> Handling of flag CLONE_PARENT_SETTID has the same problem: error returned >>> from put_user() is ignored. Glibc completely relies on that feature and uses >>> value returned from syscall only for error checking. >> >> I'm not seeing the advantage of the error checking part of the pacth >> patch. It generates extra code, possibly changing existing interfaces, >> and it doesn't actually buy us anything. >> >> What's the upside? If somebody passes in a bad pointer, it's their >> problem. For all we know, people used to pass in NULL, even if they >> had the SETTID bit set. This makes it now return EFAULT. > > Currently that works fine only because kernel retries 0-order allocations > endlessly. But pagefault_out_of_memory() is never called for non-user PF. > For kernel PF all oom-kills are triggered by buddy-allocator. > If buddy allocator gave up earlier then page-faults from kernel space > could fail without OOM. And in CoW area user-space will see stale data. > So, either we must handle all put_user/copy_to_user errors (which isn't > that bad idea) or kernel must force all PF to success-or-die policy. > > First patch is that ugly because kernel has never checked errors > in that place. So, I've tried to find solution which could fix problem > without breaking backward compatibility. If you're really worried about compatibility, it would be possible, if really really ugly, to check whether there's a vma at all at the requested address and to return -EFAULT only in the case where there is a vma but put_user still failed. A less awful approach might be to accept put_user failures if the address is NULL. --Andy ^ permalink raw reply [flat|nested] 13+ messages in thread
[parent not found: <CALYGNiMv021=WC2uXsjo5zT8JwewweZUDdk0x8FGHh9V5j6bFQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: [PATCH 2/2] kernel/fork: handle put_user errors for CLONE_PARENT_SETTID [not found] ` <CALYGNiMv021=WC2uXsjo5zT8JwewweZUDdk0x8FGHh9V5j6bFQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2015-02-06 22:10 ` Linus Torvalds 0 siblings, 0 replies; 13+ messages in thread From: Linus Torvalds @ 2015-02-06 22:10 UTC (permalink / raw) To: Konstantin Khlebnikov Cc: Konstantin Khlebnikov, Linux API, Andrew Morton, Linux Kernel Mailing List, Roman Gushchin, Nikita Vetoshkin, Oleg Nesterov, Pavel Emelyanov On Fri, Feb 6, 2015 at 1:13 PM, Konstantin Khlebnikov <koct9i-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote: > > Currently that works fine only because kernel retries 0-order allocations > endlessly. But pagefault_out_of_memory() is never called for non-user PF. > For kernel PF all oom-kills are triggered by buddy-allocator. This makes no sense. You're trying to fix what you perceive as a problem in the page fault handling in some totally different place. If *that* is what you are worried about, then damnit, just fix the page fault handler for the kernel case to send a signal or something for VM_FAULT_ENOMEM. Or, better yet, make it just trigger oom at return to user space - we could add a _TIF_OOM flag, for example, and make it part of the user-return logic (do_notify_resume), kind of how _TIF_SIGPENDING triggers a signal. Don't try to make horrible code in insane places that have nothing to do with the fundamental problem. Why did you pick this particular get/put user anyway? There are tons others that we don't test, why did you happen pick these and then make it have that horrible and senseless error handling? Because at *NO* point was it obvious that that patch had anything at all to do with "out of memory". Not in the code, not in your commit messages, *nowhere*. There is no way in hell I will take that kind of changes when you didn't even articulate why you wanted to do them in the commit messages, and the patches didn't look like they had anything to do with oom either. Linus ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 1/2] kernel/fork: handle put_user errors for CLONE_CHILD_SETTID/CLEARTID 2015-02-06 16:23 [PATCH 1/2] kernel/fork: handle put_user errors for CLONE_CHILD_SETTID/CLEARTID Konstantin Khlebnikov 2015-02-06 16:23 ` [PATCH 2/2] kernel/fork: handle put_user errors for CLONE_PARENT_SETTID Konstantin Khlebnikov @ 2015-02-06 19:44 ` Oleg Nesterov [not found] ` <20150206194405.GA13960-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 1 sibling, 1 reply; 13+ messages in thread From: Oleg Nesterov @ 2015-02-06 19:44 UTC (permalink / raw) To: Konstantin Khlebnikov Cc: linux-api-u79uwXL29TY76Z2rM5mHXA, Andrew Morton, Linus Torvalds, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Roman Gushchin, Nikita Vetoshkin, Pavel Emelyanov On 02/06, Konstantin Khlebnikov wrote: > > Whole sequence looks like: task calls fork, glibc calls syscall clone with > CLONE_CHILD_SETTID and passes pointer to TLS THREAD_SELF->tid as argument. > Child task gets read-only copy of VM including TLS. Child calls put_user() > to handle CLONE_CHILD_SETTID from schedule_tail(). put_user() trigger page > fault and it fails because do_wp_page() hits memcg limit without invoking > OOM-killer because this is page-fault from kernel-space. Because of !FAULT_FLAG_USER? Perhaps we should fix this? Say mem_cgroup_oom_enable/disable around put_user(), I dunno. > Put_user returns > -EFAULT, which is ignored. Child returns into user-space and catches here > assert (THREAD_GETMEM (self, tid) != ppid), If only I understood why else we need CLONE_CHILD_SETTID ;) > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -2312,8 +2312,20 @@ asmlinkage __visible void schedule_tail(struct task_struct *prev) > post_schedule(rq); > preempt_enable(); > > - if (current->set_child_tid) > - put_user(task_pid_vnr(current), current->set_child_tid); > + if (current->set_child_tid && > + unlikely(put_user(task_pid_vnr(current), current->set_child_tid))) { > + int dummy; > + > + /* > + * If this address is unreadable then userspace has not set > + * proper pointer. Application either doesn't care or will > + * notice this soon. If this address is readable then task > + * will be mislead about its own tid. It's better to die. > + */ > + if (!get_user(dummy, current->set_child_tid) && > + !fatal_signal_pending(current)) > + force_sig(SIGSEGV, current); > + } Well, get_user() can fail the same way? The page we need to cow can be swapped out. At first glance, to me this problem should be solved somewhere else... I'll try to reread this all tomorrow. Oleg. ^ permalink raw reply [flat|nested] 13+ messages in thread
[parent not found: <20150206194405.GA13960-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]
* Re: [PATCH 1/2] kernel/fork: handle put_user errors for CLONE_CHILD_SETTID/CLEARTID [not found] ` <20150206194405.GA13960-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> @ 2015-02-06 19:55 ` Oleg Nesterov [not found] ` <20150206195529.GA15517-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 0 siblings, 1 reply; 13+ messages in thread From: Oleg Nesterov @ 2015-02-06 19:55 UTC (permalink / raw) To: Konstantin Khlebnikov Cc: linux-api-u79uwXL29TY76Z2rM5mHXA, Andrew Morton, Linus Torvalds, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Roman Gushchin, Nikita Vetoshkin, Pavel Emelyanov On 02/06, Oleg Nesterov wrote: > > On 02/06, Konstantin Khlebnikov wrote: > > > > Whole sequence looks like: task calls fork, glibc calls syscall clone with > > CLONE_CHILD_SETTID and passes pointer to TLS THREAD_SELF->tid as argument. > > Child task gets read-only copy of VM including TLS. Child calls put_user() > > to handle CLONE_CHILD_SETTID from schedule_tail(). put_user() trigger page > > fault and it fails because do_wp_page() hits memcg limit without invoking > > OOM-killer because this is page-fault from kernel-space. > > Because of !FAULT_FLAG_USER? > > Perhaps we should fix this? Say mem_cgroup_oom_enable/disable around put_user(), > I dunno. > > > Put_user returns > > -EFAULT, which is ignored. Child returns into user-space and catches here > > assert (THREAD_GETMEM (self, tid) != ppid), > > If only I understood why else we need CLONE_CHILD_SETTID ;) > > > --- a/kernel/sched/core.c > > +++ b/kernel/sched/core.c > > @@ -2312,8 +2312,20 @@ asmlinkage __visible void schedule_tail(struct task_struct *prev) > > post_schedule(rq); > > preempt_enable(); > > > > - if (current->set_child_tid) > > - put_user(task_pid_vnr(current), current->set_child_tid); > > + if (current->set_child_tid && > > + unlikely(put_user(task_pid_vnr(current), current->set_child_tid))) { > > + int dummy; > > + > > + /* > > + * If this address is unreadable then userspace has not set > > + * proper pointer. Application either doesn't care or will > > + * notice this soon. If this address is readable then task > > + * will be mislead about its own tid. It's better to die. > > + */ > > + if (!get_user(dummy, current->set_child_tid) && > > + !fatal_signal_pending(current)) > > + force_sig(SIGSEGV, current); > > + } > > Well, get_user() can fail the same way? The page we need to cow can be > swapped out. > > At first glance, to me this problem should be solved somewhere else... > I'll try to reread this all tomorrow. And in fact I think that this is not set_child_tid/etc-specific. Perhaps I am totally confused, but I think that put_user() simply should not fail this way. Say, why a syscall should return -EFAULT if memory allocation "silently" fails? Confused. Oleg. ^ permalink raw reply [flat|nested] 13+ messages in thread
[parent not found: <20150206195529.GA15517-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]
* Re: [PATCH 1/2] kernel/fork: handle put_user errors for CLONE_CHILD_SETTID/CLEARTID [not found] ` <20150206195529.GA15517-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> @ 2015-02-06 20:27 ` Konstantin Khlebnikov 2015-02-06 20:32 ` memcg && uaccess (Was: [PATCH 1/2] kernel/fork: handle put_user errors for CLONE_CHILD_SETTID/CLEARTID) Oleg Nesterov 1 sibling, 0 replies; 13+ messages in thread From: Konstantin Khlebnikov @ 2015-02-06 20:27 UTC (permalink / raw) To: Oleg Nesterov Cc: Konstantin Khlebnikov, Linux API, Andrew Morton, Linus Torvalds, Linux Kernel Mailing List, Roman Gushchin, Nikita Vetoshkin, Pavel Emelyanov On Fri, Feb 6, 2015 at 10:55 PM, Oleg Nesterov <oleg-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote: > On 02/06, Oleg Nesterov wrote: >> >> On 02/06, Konstantin Khlebnikov wrote: >> > >> > Whole sequence looks like: task calls fork, glibc calls syscall clone with >> > CLONE_CHILD_SETTID and passes pointer to TLS THREAD_SELF->tid as argument. >> > Child task gets read-only copy of VM including TLS. Child calls put_user() >> > to handle CLONE_CHILD_SETTID from schedule_tail(). put_user() trigger page >> > fault and it fails because do_wp_page() hits memcg limit without invoking >> > OOM-killer because this is page-fault from kernel-space. >> >> Because of !FAULT_FLAG_USER? Yep. As I see memcg triggers OOM only on page-faults and only from user-space. >> >> Perhaps we should fix this? Say mem_cgroup_oom_enable/disable around put_user(), >> I dunno. >> >> > Put_user returns >> > -EFAULT, which is ignored. Child returns into user-space and catches here >> > assert (THREAD_GETMEM (self, tid) != ppid), >> >> If only I understood why else we need CLONE_CHILD_SETTID ;) I dunno, CLONE_PARENT_SETTID should be enough for everybody but it's broken too. Twice. See the next patch =) >> >> > --- a/kernel/sched/core.c >> > +++ b/kernel/sched/core.c >> > @@ -2312,8 +2312,20 @@ asmlinkage __visible void schedule_tail(struct task_struct *prev) >> > post_schedule(rq); >> > preempt_enable(); >> > >> > - if (current->set_child_tid) >> > - put_user(task_pid_vnr(current), current->set_child_tid); >> > + if (current->set_child_tid && >> > + unlikely(put_user(task_pid_vnr(current), current->set_child_tid))) { >> > + int dummy; >> > + >> > + /* >> > + * If this address is unreadable then userspace has not set >> > + * proper pointer. Application either doesn't care or will >> > + * notice this soon. If this address is readable then task >> > + * will be mislead about its own tid. It's better to die. >> > + */ >> > + if (!get_user(dummy, current->set_child_tid) && >> > + !fatal_signal_pending(current)) >> > + force_sig(SIGSEGV, current); >> > + } >> >> Well, get_user() can fail the same way? The page we need to cow can be >> swapped out. >> >> At first glance, to me this problem should be solved somewhere else... >> I'll try to reread this all tomorrow. > > And in fact I think that this is not set_child_tid/etc-specific. Perhaps > I am totally confused, but I think that put_user() simply should not fail > this way. Say, why a syscall should return -EFAULT if memory allocation > "silently" fails? Confused. That's how memcg works. All other places are handled explicitly and returns into user-space as -ENOMEM or -EFAULT. Probably some strange numa policy / memoryaffinity might trigger this too. Probably all page-faults must be forced to succeed or die mode, in this case all errors in put_user/copy_to_user could be simply ignored. -- Konstantin ^ permalink raw reply [flat|nested] 13+ messages in thread
* memcg && uaccess (Was: [PATCH 1/2] kernel/fork: handle put_user errors for CLONE_CHILD_SETTID/CLEARTID) [not found] ` <20150206195529.GA15517-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 2015-02-06 20:27 ` Konstantin Khlebnikov @ 2015-02-06 20:32 ` Oleg Nesterov [not found] ` <20150206203246.GA16924-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 1 sibling, 1 reply; 13+ messages in thread From: Oleg Nesterov @ 2015-02-06 20:32 UTC (permalink / raw) To: Konstantin Khlebnikov, Johannes Weiner, Michal Hocko, KAMEZAWA Hiroyuki, KOSAKI Motohiro Cc: linux-api-u79uwXL29TY76Z2rM5mHXA, Andrew Morton, Linus Torvalds, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Roman Gushchin, Nikita Vetoshkin, Pavel Emelyanov Add cc's. On 02/06, Oleg Nesterov wrote: > > And in fact I think that this is not set_child_tid/etc-specific. Perhaps > I am totally confused, but I think that put_user() simply should not fail > this way. Say, why a syscall should return -EFAULT if memory allocation > "silently" fails? Confused. Seriously. I must have missed something, but I can't understand 519e52473eb "mm: memcg: enable memcg OOM killer only for user faults". The changelog says: System calls and kernel faults (uaccess, gup) can handle an out of memory situation gracefully and just return -ENOMEM. How can a system call know it should return -ENOMEM if put_user() can only return -EFAULT ? Oleg. ^ permalink raw reply [flat|nested] 13+ messages in thread
[parent not found: <20150206203246.GA16924-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]
* Re: memcg && uaccess (Was: [PATCH 1/2] kernel/fork: handle put_user errors for CLONE_CHILD_SETTID/CLEARTID) [not found] ` <20150206203246.GA16924-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> @ 2015-02-10 16:19 ` Johannes Weiner 2015-02-10 19:47 ` Oleg Nesterov 0 siblings, 1 reply; 13+ messages in thread From: Johannes Weiner @ 2015-02-10 16:19 UTC (permalink / raw) To: Oleg Nesterov Cc: Konstantin Khlebnikov, Michal Hocko, KAMEZAWA Hiroyuki, KOSAKI Motohiro, linux-api-u79uwXL29TY76Z2rM5mHXA, Andrew Morton, Linus Torvalds, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Roman Gushchin, Nikita Vetoshkin, Pavel Emelyanov On Fri, Feb 06, 2015 at 09:32:46PM +0100, Oleg Nesterov wrote: > Add cc's. > > On 02/06, Oleg Nesterov wrote: > > > > And in fact I think that this is not set_child_tid/etc-specific. Perhaps > > I am totally confused, but I think that put_user() simply should not fail > > this way. Say, why a syscall should return -EFAULT if memory allocation > > "silently" fails? Confused. > > Seriously. I must have missed something, but I can't understand 519e52473eb > "mm: memcg: enable memcg OOM killer only for user faults". > > The changelog says: > > System calls and kernel faults (uaccess, gup) can handle an out of > memory situation gracefully and just return -ENOMEM. The cover letter of the original patch series is hidden in the changelog a few commits before this one: commit 94bce453c78996cc4373d5da6cfabe07fcc6d9f9 Author: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org> Date: Thu Sep 12 15:13:36 2013 -0700 arch: mm: remove obsolete init OOM protection The memcg code can trap tasks in the context of the failing allocation until an OOM situation is resolved. They can hold all kinds of locks (fs, mm) at this point, which makes it prone to deadlocking. This series converts memcg OOM handling into a two step process that is started in the charge context, but any waiting is done after the fault stack is fully unwound. Patches 1-4 prepare architecture handlers to support the new memcg requirements, but in doing so they also remove old cruft and unify out-of-memory behavior across architectures. Patch 5 disables the memcg OOM handling for syscalls, readahead, kernel faults, because they can gracefully unwind the stack with -ENOMEM. OOM handling is restricted to user triggered faults that have no other option. We had reports of systems deadlocking because the allocating task would wait for the OOM killer to make progress, and the OOM-killed task would wait for a lock held by the allocating task. It's important to note here that the OOM killer currently does not kill a second task until the current victim has exited, because that victim has special rights to access to emergency reserves and we don't want to risk overcommitting those. To address that deadlock, I decided to no longer invoke memcg OOM handling from the allocation context directly, but instead remember the situation in the task_struct and handle it before restarting the fault. However, in-kernel faults do not restart, maybe even succeed despite allocation failures (i.e. readahead), so under the assumption that they have to handle error returns anyway, I disabled invoking the memcg OOM killer for in-kernel faults altogether. > How can a system call know it should return -ENOMEM if put_user() can only > return -EFAULT ? I see the problem, but allocations can not be guaranteed to succeed, not even the OOM killer can reliably make progress, depending on the state the faulting process is in, the locks it is holding. So what can we do if that allocation fails? Even if we go the route that Linus proposes and make OOM situations more generic and check them on *every* return to userspace, the OOM handler at that point might still kill a task more suited to free memory than the faulting one, and so we still have to communicate the proper error value to the syscall. However, I think we could go back to invoking OOM from all allocation contexts again as long as we change allocator and OOM killer to not wait for individual OOM victims to exit indefinitely (unless it's a __GFP_NOFAIL context). Maybe wait for some time on the first victim before moving on to the next one. That would reduce the likelihood of failing allocations without reinstating the original deadlock. put_user() could still technically fail due to allocation failure, but it wouldn't be a very likely event. Would that be okay? This does not just apply to memcg allocations, btw. Physical page allocations have recently been reported to deadlock in a similar fashion because of low orders implying __GFP_NOFAIL. Making the OOM killer more robust and more eager to make progress would allow us to remove the implied __GFP_NOFAIL while maintaining reasonable quality of service in the allocator. What do you think? ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: memcg && uaccess (Was: [PATCH 1/2] kernel/fork: handle put_user errors for CLONE_CHILD_SETTID/CLEARTID) 2015-02-10 16:19 ` Johannes Weiner @ 2015-02-10 19:47 ` Oleg Nesterov 0 siblings, 0 replies; 13+ messages in thread From: Oleg Nesterov @ 2015-02-10 19:47 UTC (permalink / raw) To: Johannes Weiner Cc: Konstantin Khlebnikov, Michal Hocko, KAMEZAWA Hiroyuki, KOSAKI Motohiro, linux-api, Andrew Morton, Linus Torvalds, linux-kernel, Roman Gushchin, Nikita Vetoshkin, Pavel Emelyanov On 02/10, Johannes Weiner wrote: > > We had reports of systems deadlocking because Yes, yes, to some degree I understand why it was done this way. Not that I understand the details of course. Thanks for your explanations. > > How can a system call know it should return -ENOMEM if put_user() can only > > return -EFAULT ? > > I see the problem, but allocations can not be guaranteed to succeed, > not even the OOM killer can reliably make progress, Yes sure, > So what > can we do if that allocation fails? Even if we go the route that > Linus proposes and make OOM situations more generic and check them on > *every* return to userspace, the OOM handler at that point might still > kill a task more suited to free memory than the faulting one, and so > we still have to communicate the proper error value to the syscall. Yes. To me this means that if a page fault from kernel-space fails because of VM_FAULT_OOM the task should be killed in any case. Except we should obviously exclude gup/kthreads. We can't retry in this case and (say) schedule_tail() simply can't report or handle the failure. Imho it would be better to kill the task loudly, perhaps with a warning. To avoid the confusion. Of course, it is not that I am trying to simply add send_sig(SIGKILL) into the failure paths. My only point is that, whatever we do, the "silent" or misleading failure is worse than SIGKILL. The application can't really "handle an out of memory situation gracefully" as the changlelog says. Even if put_user() (and thus syscall) could return -ENOMEM, this doesn't really matter I think. > However, I think we could go back to invoking OOM from all allocation > contexts again as long as we change allocator and OOM killer to not > wait for individual OOM victims to exit indefinitely (unless it's a > __GFP_NOFAIL context). Maybe wait for some time on the first victim > before moving on to the next one. perhaps... can't really comment, at least right now. > What do you think? So far I only think that this problem is not trivial ;) Oleg. ^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2015-02-10 19:47 UTC | newest] Thread overview: 13+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2015-02-06 16:23 [PATCH 1/2] kernel/fork: handle put_user errors for CLONE_CHILD_SETTID/CLEARTID Konstantin Khlebnikov 2015-02-06 16:23 ` [PATCH 2/2] kernel/fork: handle put_user errors for CLONE_PARENT_SETTID Konstantin Khlebnikov 2015-02-06 20:49 ` Linus Torvalds [not found] ` <CA+55aFxBuf-0UkoYCrwH_vNsWFnUkFOz5c9O_Mswe_w0BTkqbQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2015-02-06 21:07 ` Oleg Nesterov 2015-02-06 21:13 ` Konstantin Khlebnikov 2015-02-06 21:55 ` Andy Lutomirski [not found] ` <CALYGNiMv021=WC2uXsjo5zT8JwewweZUDdk0x8FGHh9V5j6bFQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2015-02-06 22:10 ` Linus Torvalds 2015-02-06 19:44 ` [PATCH 1/2] kernel/fork: handle put_user errors for CLONE_CHILD_SETTID/CLEARTID Oleg Nesterov [not found] ` <20150206194405.GA13960-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 2015-02-06 19:55 ` Oleg Nesterov [not found] ` <20150206195529.GA15517-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 2015-02-06 20:27 ` Konstantin Khlebnikov 2015-02-06 20:32 ` memcg && uaccess (Was: [PATCH 1/2] kernel/fork: handle put_user errors for CLONE_CHILD_SETTID/CLEARTID) Oleg Nesterov [not found] ` <20150206203246.GA16924-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 2015-02-10 16:19 ` Johannes Weiner 2015-02-10 19:47 ` Oleg Nesterov
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).