* [PATCH 2/2] kthread, cgroup: close race window where new kthreads can be migrated to non-root cgroups [not found] <20170315231827.GA13656@htj.duckdns.org> @ 2017-03-15 23:19 ` Tejun Heo 2017-03-16 15:02 ` Oleg Nesterov [not found] ` <20170315231920.GB13656-piEFEHQLUPpN0TnZuCh8vA@public.gmane.org> 0 siblings, 2 replies; 12+ messages in thread From: Tejun Heo @ 2017-03-15 23:19 UTC (permalink / raw) To: Oleg Nesterov, Linus Torvalds, Andrew Morton Cc: Peter Zijlstra, Thomas Gleixner, Chris Mason, linux-kernel, kernel-team, Li Zefan, Johannes Weiner, cgroups Creation of a kthread goes through a couple interlocked stages between the kthread itself and its creator. Once the new kthread starts running, it initializes itself and wakes up the creator. The creator then can further configure the kthread and then let it start doing its job by waking it up. In this configuration-by-creator stage, the creator is the only one that can wake it up but the kthread is visible to userland. When altering the kthread's attributes from userland is allowed, this is fine; however, for cases where CPU affinity is critical, kthread_bind() is used to first disable affinity changes from userland and then set the affinity. This also prevents the kthread from being migrated into non-root cgroups as that can affect the CPU affinity and many other things. Unfortunately, the cgroup side of protection is racy. While the PF_NO_SETAFFINITY flag prevents further migrations, userland can win the race before the creator sets the flag with kthread_bind() and put the kthread in a non-root cgroup, which can lead to all sorts of problems including incorrect CPU affinity and starvation. This bug got triggered by userland which periodically tries to migrate all processes in the root cpuset cgroup to a non-root one. Per-cpu workqueue workers got caught while being created and ended up with incorrected CPU affinity breaking concurrency management and sometimes stalling workqueue execution. This patch introduces KTHREAD_INITIALIZED which is set after the kthread finishes initialization. cgroup core closes the race window by testing kthread_initialized() and rejecting migration accordingly. It'd be better to wait for the initialization instead of failing but I couldn't think of a way of implementing that without adding either a new PF flag, or sleeping and retrying from waiting side. Even if userland depends on changing cgroup membership of a kthread, it either has to be synchronized with kthread_create() or periodically repeat, so it's unlikely that this would break anything. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Reported-and-debugged-by: Chris Mason <clm@fb.com> Cc: stable@vger.kernel.org # v4.3+ (we can't close the race < v4.3) --- include/linux/kthread.h | 1 + kernel/cgroup/cgroup.c | 10 ++++++---- kernel/kthread.c | 21 ++++++++++++++++++++- 3 files changed, 27 insertions(+), 5 deletions(-) --- a/include/linux/kthread.h +++ b/include/linux/kthread.h @@ -49,6 +49,7 @@ struct task_struct *kthread_create_on_cp }) void free_kthread_struct(struct task_struct *k); +bool kthread_initialized(struct task_struct *k); void kthread_bind(struct task_struct *k, unsigned int cpu); void kthread_bind_mask(struct task_struct *k, const struct cpumask *mask); int kthread_stop(struct task_struct *k); --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -2425,11 +2425,13 @@ ssize_t __cgroup_procs_write(struct kern tsk = tsk->group_leader; /* - * Workqueue threads may acquire PF_NO_SETAFFINITY and become - * trapped in a cpuset, or RT worker may be born in a cgroup - * with no rt_runtime allocated. Just say no. + * kthreads may acquire PF_NO_SETAFFINITY during initialization. + * If userland migrates such kthread to a non-root cgroup, it can + * become trapped in a cpuset, or RT kthread may be born in a + * cgroup with no rt_runtime allocated. Just say no. */ - if (tsk == kthreadd_task || (tsk->flags & PF_NO_SETAFFINITY)) { + if (tsk == kthreadd_task || (tsk->flags & PF_NO_SETAFFINITY) || + ((tsk->flags & PF_KTHREAD) && !kthread_initialized(tsk))) { ret = -EINVAL; goto out_unlock_rcu; } --- a/kernel/kthread.c +++ b/kernel/kthread.c @@ -50,6 +50,7 @@ struct kthread { enum KTHREAD_BITS { KTHREAD_IS_PER_CPU = 0, + KTHREAD_INITIALIZED, KTHREAD_SHOULD_STOP, KTHREAD_SHOULD_PARK, KTHREAD_IS_PARKED, @@ -57,7 +58,7 @@ enum KTHREAD_BITS { static inline void set_kthread_struct(void *kthread) { - /* paired with smp_read_data_barrier_depends() in to_kthread() */ + /* paired with smp_read_barrier_depends() in to_kthread() */ smp_wmb(); /* @@ -95,6 +96,23 @@ void free_kthread_struct(struct task_str } /** + * kthread_initialized - has the kthread finished initialization? + * @k: thread created by kthread_create(). + * + * Test whether @k, which must be a kthread, finished initialization and is + * ready to execute the threadfn. The kthread owner finishes + * initialization by waking up the new kthread for the first time. If this + * function returns %false, the kthread owner could still be configuring + * the kthread. + */ +bool kthread_initialized(struct task_struct *k) +{ + struct kthread *kthread = to_kthread(k); + + return kthread && test_bit(KTHREAD_INITIALIZED, &kthread->flags); +} + +/** * kthread_should_stop - should this kthread return now? * * When someone calls kthread_stop() on your kthread, it will be woken @@ -238,6 +256,7 @@ static int kthread(void *_create) create->result = current; complete(done); schedule(); + set_bit(KTHREAD_INITIALIZED, &self->flags); ret = -EINTR; if (!test_bit(KTHREAD_SHOULD_STOP, &self->flags)) { ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH 2/2] kthread, cgroup: close race window where new kthreads can be migrated to non-root cgroups 2017-03-15 23:19 ` [PATCH 2/2] kthread, cgroup: close race window where new kthreads can be migrated to non-root cgroups Tejun Heo @ 2017-03-16 15:02 ` Oleg Nesterov [not found] ` <20170316150233.GB24478-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 2017-03-16 16:05 ` Tejun Heo [not found] ` <20170315231920.GB13656-piEFEHQLUPpN0TnZuCh8vA@public.gmane.org> 1 sibling, 2 replies; 12+ messages in thread From: Oleg Nesterov @ 2017-03-16 15:02 UTC (permalink / raw) To: Tejun Heo Cc: Linus Torvalds, Andrew Morton, Peter Zijlstra, Thomas Gleixner, Chris Mason, linux-kernel, kernel-team, Li Zefan, Johannes Weiner, cgroups On 03/15, Tejun Heo wrote: > > --- a/kernel/cgroup/cgroup.c > +++ b/kernel/cgroup/cgroup.c > @@ -2425,11 +2425,13 @@ ssize_t __cgroup_procs_write(struct kern > tsk = tsk->group_leader; > > /* > - * Workqueue threads may acquire PF_NO_SETAFFINITY and become > - * trapped in a cpuset, or RT worker may be born in a cgroup > - * with no rt_runtime allocated. Just say no. > + * kthreads may acquire PF_NO_SETAFFINITY during initialization. > + * If userland migrates such kthread to a non-root cgroup, it can > + * become trapped in a cpuset, or RT kthread may be born in a > + * cgroup with no rt_runtime allocated. Just say no. > */ > - if (tsk == kthreadd_task || (tsk->flags & PF_NO_SETAFFINITY)) { > + if (tsk == kthreadd_task || (tsk->flags & PF_NO_SETAFFINITY) || > + ((tsk->flags & PF_KTHREAD) && !kthread_initialized(tsk))) { > ret = -EINVAL; ... > +bool kthread_initialized(struct task_struct *k) > +{ > + struct kthread *kthread = to_kthread(k); > + > + return kthread && test_bit(KTHREAD_INITIALIZED, &kthread->flags); > +} Not sure I understand... With this patch you can no longer migrate a kernel thread created by kernel_thread() ? Note that to_kthread() is NULL unless it was created by kthread_create(). Oleg. ^ permalink raw reply [flat|nested] 12+ messages in thread
[parent not found: <20170316150233.GB24478-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]
* Re: [PATCH 2/2] kthread, cgroup: close race window where new kthreads can be migrated to non-root cgroups [not found] ` <20170316150233.GB24478-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> @ 2017-03-16 15:39 ` Oleg Nesterov [not found] ` <20170316153925.GA26391-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 0 siblings, 1 reply; 12+ messages in thread From: Oleg Nesterov @ 2017-03-16 15:39 UTC (permalink / raw) To: Tejun Heo Cc: Linus Torvalds, Andrew Morton, Peter Zijlstra, Thomas Gleixner, Chris Mason, linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg, Li Zefan, Johannes Weiner, cgroups-u79uwXL29TY76Z2rM5mHXA On 03/16, Oleg Nesterov wrote: > > On 03/15, Tejun Heo wrote: > > > > --- a/kernel/cgroup/cgroup.c > > +++ b/kernel/cgroup/cgroup.c > > @@ -2425,11 +2425,13 @@ ssize_t __cgroup_procs_write(struct kern > > tsk = tsk->group_leader; > > > > /* > > - * Workqueue threads may acquire PF_NO_SETAFFINITY and become > > - * trapped in a cpuset, or RT worker may be born in a cgroup > > - * with no rt_runtime allocated. Just say no. > > + * kthreads may acquire PF_NO_SETAFFINITY during initialization. > > + * If userland migrates such kthread to a non-root cgroup, it can > > + * become trapped in a cpuset, or RT kthread may be born in a > > + * cgroup with no rt_runtime allocated. Just say no. > > */ > > - if (tsk == kthreadd_task || (tsk->flags & PF_NO_SETAFFINITY)) { > > + if (tsk == kthreadd_task || (tsk->flags & PF_NO_SETAFFINITY) || > > + ((tsk->flags & PF_KTHREAD) && !kthread_initialized(tsk))) { > > ret = -EINVAL; > > ... > > > +bool kthread_initialized(struct task_struct *k) > > +{ > > + struct kthread *kthread = to_kthread(k); > > + > > + return kthread && test_bit(KTHREAD_INITIALIZED, &kthread->flags); > > +} > > Not sure I understand... > > With this patch you can no longer migrate a kernel thread created by > kernel_thread() ? Note that to_kthread() is NULL unless it was created > by kthread_create(). Either way, I am wondering if we can do something really trivial like the patch below. This way we can also remove the "tsk == kthreadd_task" check, and we do not need the barriers. Oleg. --- x/kernel/kthread.c +++ x/kernel/kthread.c @@ -226,6 +226,7 @@ ret = -EINTR; if (!test_bit(KTHREAD_SHOULD_STOP, &self->flags)) { __kthread_parkme(self); + current->flags &= ~PF_IDONTLIKECGROUPS; ret = threadfn(data); } do_exit(ret); @@ -537,7 +538,7 @@ set_cpus_allowed_ptr(tsk, cpu_all_mask); set_mems_allowed(node_states[N_MEMORY]); - current->flags |= PF_NOFREEZE; + current->flags |= (PF_NOFREEZE | PF_IDONTLIKECGROUPS); for (;;) { set_current_state(TASK_INTERRUPTIBLE); --- x/kernel/cgroup/cgroup.c +++ x/kernel/cgroup/cgroup.c @@ -2429,7 +2429,7 @@ * trapped in a cpuset, or RT worker may be born in a cgroup * with no rt_runtime allocated. Just say no. */ - if (tsk == kthreadd_task || (tsk->flags & PF_NO_SETAFFINITY)) { + if (tsk->flags & (PF_NO_SETAFFINITY | PF_IDONTLIKECGROUPS)) { ret = -EINVAL; goto out_unlock_rcu; } ^ permalink raw reply [flat|nested] 12+ messages in thread
[parent not found: <20170316153925.GA26391-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]
* Re: [PATCH 2/2] kthread, cgroup: close race window where new kthreads can be migrated to non-root cgroups [not found] ` <20170316153925.GA26391-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> @ 2017-03-16 16:07 ` Tejun Heo [not found] ` <20170316160734.GD15810-piEFEHQLUPpN0TnZuCh8vA@public.gmane.org> 0 siblings, 1 reply; 12+ messages in thread From: Tejun Heo @ 2017-03-16 16:07 UTC (permalink / raw) To: Oleg Nesterov Cc: Linus Torvalds, Andrew Morton, Peter Zijlstra, Thomas Gleixner, Chris Mason, linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg, Li Zefan, Johannes Weiner, cgroups-u79uwXL29TY76Z2rM5mHXA Hello, On Thu, Mar 16, 2017 at 04:39:26PM +0100, Oleg Nesterov wrote: > Either way, I am wondering if we can do something really trivial like > the patch below. This way we can also remove the "tsk == kthreadd_task" > check, and we do not need the barriers. > > Oleg. > > --- x/kernel/kthread.c > +++ x/kernel/kthread.c > @@ -226,6 +226,7 @@ > ret = -EINTR; > if (!test_bit(KTHREAD_SHOULD_STOP, &self->flags)) { > __kthread_parkme(self); > + current->flags &= ~PF_IDONTLIKECGROUPS; > ret = threadfn(data); > } > do_exit(ret); > @@ -537,7 +538,7 @@ > set_cpus_allowed_ptr(tsk, cpu_all_mask); > set_mems_allowed(node_states[N_MEMORY]); > > - current->flags |= PF_NOFREEZE; > + current->flags |= (PF_NOFREEZE | PF_IDONTLIKECGROUPS); > > for (;;) { > set_current_state(TASK_INTERRUPTIBLE); > --- x/kernel/cgroup/cgroup.c > +++ x/kernel/cgroup/cgroup.c > @@ -2429,7 +2429,7 @@ > * trapped in a cpuset, or RT worker may be born in a cgroup > * with no rt_runtime allocated. Just say no. > */ > - if (tsk == kthreadd_task || (tsk->flags & PF_NO_SETAFFINITY)) { > + if (tsk->flags & (PF_NO_SETAFFINITY | PF_IDONTLIKECGROUPS)) { > ret = -EINVAL; > goto out_unlock_rcu; > } Absolutely. If we're willing to spend a PF flag on it, we can properly wait for it too instead of failing it. Thanks. -- tejun ^ permalink raw reply [flat|nested] 12+ messages in thread
[parent not found: <20170316160734.GD15810-piEFEHQLUPpN0TnZuCh8vA@public.gmane.org>]
* Re: [PATCH 2/2] kthread, cgroup: close race window where new kthreads can be migrated to non-root cgroups [not found] ` <20170316160734.GD15810-piEFEHQLUPpN0TnZuCh8vA@public.gmane.org> @ 2017-03-16 16:31 ` Oleg Nesterov 2017-03-16 17:41 ` Tejun Heo 0 siblings, 1 reply; 12+ messages in thread From: Oleg Nesterov @ 2017-03-16 16:31 UTC (permalink / raw) To: Tejun Heo Cc: Linus Torvalds, Andrew Morton, Peter Zijlstra, Thomas Gleixner, Chris Mason, linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg, Li Zefan, Johannes Weiner, cgroups-u79uwXL29TY76Z2rM5mHXA On 03/16, Tejun Heo wrote: > > > --- x/kernel/kthread.c > > +++ x/kernel/kthread.c > > @@ -226,6 +226,7 @@ > > ret = -EINTR; > > if (!test_bit(KTHREAD_SHOULD_STOP, &self->flags)) { > > __kthread_parkme(self); > > + current->flags &= ~PF_IDONTLIKECGROUPS; > > ret = threadfn(data); > > } > > do_exit(ret); > > @@ -537,7 +538,7 @@ > > set_cpus_allowed_ptr(tsk, cpu_all_mask); > > set_mems_allowed(node_states[N_MEMORY]); > > > > - current->flags |= PF_NOFREEZE; > > + current->flags |= (PF_NOFREEZE | PF_IDONTLIKECGROUPS); > > > > for (;;) { > > set_current_state(TASK_INTERRUPTIBLE); > > --- x/kernel/cgroup/cgroup.c > > +++ x/kernel/cgroup/cgroup.c > > @@ -2429,7 +2429,7 @@ > > * trapped in a cpuset, or RT worker may be born in a cgroup > > * with no rt_runtime allocated. Just say no. > > */ > > - if (tsk == kthreadd_task || (tsk->flags & PF_NO_SETAFFINITY)) { > > + if (tsk->flags & (PF_NO_SETAFFINITY | PF_IDONTLIKECGROUPS)) { > > ret = -EINVAL; > > goto out_unlock_rcu; > > } > > Absolutely. If we're willing to spend a PF flag on it, we can > properly wait for it too instead of failing it. Or we can add another "unsigned no_cgroups:1" bit into task_struct, not sure. Anyway, I do not understand the PF_NO_SETAFFINITY check in __cgroup_procs_write(). task_can_attach() checks it too, so cgroups can't change the affinity. Imo something explicit like no_cgroups makes more sense. Oleg. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH 2/2] kthread, cgroup: close race window where new kthreads can be migrated to non-root cgroups 2017-03-16 16:31 ` Oleg Nesterov @ 2017-03-16 17:41 ` Tejun Heo 0 siblings, 0 replies; 12+ messages in thread From: Tejun Heo @ 2017-03-16 17:41 UTC (permalink / raw) To: Oleg Nesterov Cc: Linus Torvalds, Andrew Morton, Peter Zijlstra, Thomas Gleixner, Chris Mason, linux-kernel, kernel-team, Li Zefan, Johannes Weiner, cgroups Hello, On Thu, Mar 16, 2017 at 05:31:59PM +0100, Oleg Nesterov wrote: > Or we can add another "unsigned no_cgroups:1" bit into task_struct, > not sure. To synchronize around initialization, a PF flag would be easier as we can use wait_on_bit(). > Anyway, I do not understand the PF_NO_SETAFFINITY check in > __cgroup_procs_write(). task_can_attach() checks it too, so cgroups > can't change the affinity. Imo something explicit like no_cgroups > makes more sense. task_can_attach() predates the __cgroup_procs_write() and currently doesn't do anything. We can split the flag or rename it so that it's more generic. The reasons for disallowing cgroup migration have a lot of crosssection with affinity, so it's not a complete misnomer. Either way is fine by me. Thanks. -- tejun ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH 2/2] kthread, cgroup: close race window where new kthreads can be migrated to non-root cgroups 2017-03-16 15:02 ` Oleg Nesterov [not found] ` <20170316150233.GB24478-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> @ 2017-03-16 16:05 ` Tejun Heo [not found] ` <20170316160544.GC15810-piEFEHQLUPpN0TnZuCh8vA@public.gmane.org> 1 sibling, 1 reply; 12+ messages in thread From: Tejun Heo @ 2017-03-16 16:05 UTC (permalink / raw) To: Oleg Nesterov Cc: Linus Torvalds, Andrew Morton, Peter Zijlstra, Thomas Gleixner, Chris Mason, linux-kernel, kernel-team, Li Zefan, Johannes Weiner, cgroups Hello, On Thu, Mar 16, 2017 at 04:02:34PM +0100, Oleg Nesterov wrote: > > +bool kthread_initialized(struct task_struct *k) > > +{ > > + struct kthread *kthread = to_kthread(k); > > + > > + return kthread && test_bit(KTHREAD_INITIALIZED, &kthread->flags); > > +} > > Not sure I understand... > > With this patch you can no longer migrate a kernel thread created by > kernel_thread() ? Note that to_kthread() is NULL unless it was created > by kthread_create(). Yeah, what it does is preventing migration of kthreads until the kthread owner wakes it up for the first time. The problem is that kthread_bind() seals up future cgroup migrations from userland but doesn't move back the kthread to the root cgroup, so the userland has a window where it can mangle with cgroup membership inbetween and break things. The NULL test is there because the test may be performed before the kthread itself sets up its struct kthread. An alternative approach could be making kthread_bind() migrate the kthread back to root cgroup, which btw is why affinity is fine as the function overwrites it after setting NO_SETAFFINITY; however, the problem there is that userland can put the kthread into !root cgroup and starve it before it reaches create->done. Thanks. -- tejun ^ permalink raw reply [flat|nested] 12+ messages in thread
[parent not found: <20170316160544.GC15810-piEFEHQLUPpN0TnZuCh8vA@public.gmane.org>]
* Re: [PATCH 2/2] kthread, cgroup: close race window where new kthreads can be migrated to non-root cgroups [not found] ` <20170316160544.GC15810-piEFEHQLUPpN0TnZuCh8vA@public.gmane.org> @ 2017-03-16 16:17 ` Oleg Nesterov 2017-03-16 17:03 ` Tejun Heo 0 siblings, 1 reply; 12+ messages in thread From: Oleg Nesterov @ 2017-03-16 16:17 UTC (permalink / raw) To: Tejun Heo Cc: Linus Torvalds, Andrew Morton, Peter Zijlstra, Thomas Gleixner, Chris Mason, linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg, Li Zefan, Johannes Weiner, cgroups-u79uwXL29TY76Z2rM5mHXA On 03/16, Tejun Heo wrote: > > Hello, > > On Thu, Mar 16, 2017 at 04:02:34PM +0100, Oleg Nesterov wrote: > > > +bool kthread_initialized(struct task_struct *k) > > > +{ > > > + struct kthread *kthread = to_kthread(k); > > > + > > > + return kthread && test_bit(KTHREAD_INITIALIZED, &kthread->flags); > > > +} > > > > Not sure I understand... > > > > With this patch you can no longer migrate a kernel thread created by > > kernel_thread() ? Note that to_kthread() is NULL unless it was created > > by kthread_create(). > > Yeah, what it does is preventing migration of kthreads until the > kthread owner wakes it up for the first time. The problem is that > kthread_bind() seals up future cgroup migrations from userland but > doesn't move back the kthread to the root cgroup, so the userland has > a window where it can mangle with cgroup membership inbetween and > break things. This is clear, > The NULL test is there because the test may be performed before the > kthread itself sets up its struct kthread. This too. But this also means that __cgroup_procs_write() will always fail if this task is a kernel thread which was not created by kthread_create(). Currently you can use kernel_thread() (although you shouldn't) and it can be migrated, this won't work after your patch. Oleg. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH 2/2] kthread, cgroup: close race window where new kthreads can be migrated to non-root cgroups 2017-03-16 16:17 ` Oleg Nesterov @ 2017-03-16 17:03 ` Tejun Heo 0 siblings, 0 replies; 12+ messages in thread From: Tejun Heo @ 2017-03-16 17:03 UTC (permalink / raw) To: Oleg Nesterov Cc: Linus Torvalds, Andrew Morton, Peter Zijlstra, Thomas Gleixner, Chris Mason, linux-kernel, kernel-team, Li Zefan, Johannes Weiner, cgroups Hello, On Thu, Mar 16, 2017 at 05:17:57PM +0100, Oleg Nesterov wrote: > But this also means that __cgroup_procs_write() will always fail if > this task is a kernel thread which was not created by kthread_create(). > > Currently you can use kernel_thread() (although you shouldn't) and it > can be migrated, this won't work after your patch. I see what you mean now. The only users seem to be init/main.c and kernel/kmod.c. I'll see if there's a way we can do this only for kthread_create() users. Thanks. -- tejun ^ permalink raw reply [flat|nested] 12+ messages in thread
[parent not found: <20170315231920.GB13656-piEFEHQLUPpN0TnZuCh8vA@public.gmane.org>]
* [PATCH v2] cgroup, kthread: close race window where new kthreads can be migrated to non-root cgroups [not found] ` <20170315231920.GB13656-piEFEHQLUPpN0TnZuCh8vA@public.gmane.org> @ 2017-03-16 20:54 ` Tejun Heo 2017-03-17 13:50 ` Oleg Nesterov 0 siblings, 1 reply; 12+ messages in thread From: Tejun Heo @ 2017-03-16 20:54 UTC (permalink / raw) To: Oleg Nesterov, Linus Torvalds, Andrew Morton Cc: Peter Zijlstra, Thomas Gleixner, Chris Mason, linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg, Li Zefan, Johannes Weiner, cgroups-u79uwXL29TY76Z2rM5mHXA Hello, I tried a couple variants but Oleg's suggestion turns out to be the simplest. This patch doesn't require the first barrier patch. Oleg, if you're okay with the patch, I can route this through cgroup/for-4.11-fixes. Thanks! ------ 8< ------ Creation of a kthread goes through a couple interlocked stages between the kthread itself and its creator. Once the new kthread starts running, it initializes itself and wakes up the creator. The creator then can further configure the kthread and then let it start doing its job by waking it up. In this configuration-by-creator stage, the creator is the only one that can wake it up but the kthread is visible to userland. When altering the kthread's attributes from userland is allowed, this is fine; however, for cases where CPU affinity is critical, kthread_bind() is used to first disable affinity changes from userland and then set the affinity. This also prevents the kthread from being migrated into non-root cgroups as that can affect the CPU affinity and many other things. Unfortunately, the cgroup side of protection is racy. While the PF_NO_SETAFFINITY flag prevents further migrations, userland can win the race before the creator sets the flag with kthread_bind() and put the kthread in a non-root cgroup, which can lead to all sorts of problems including incorrect CPU affinity and starvation. This bug got triggered by userland which periodically tries to migrate all processes in the root cpuset cgroup to a non-root one. Per-cpu workqueue workers got caught while being created and ended up with incorrected CPU affinity breaking concurrency management and sometimes stalling workqueue execution. This patch adds task->no_cgroup_migration which disallows the task to be migrated by userland. kthreadd starts with the flag set making every child kthread start in the root cgroup with migration disallowed. The flag is cleared after the kthread finishes initialization by which time PF_NO_SETAFFINITY is set if the kthread should stay in the root cgroup. It'd be better to wait for the initialization instead of failing but I couldn't think of a way of implementing that without adding either a new PF flag, or sleeping and retrying from waiting side. Even if userland depends on changing cgroup membership of a kthread, it either has to be synchronized with kthread_create() or periodically repeat, so it's unlikely that this would break anything. v2: Switch to a simpler implementation using a new task_struct bit field suggested by Oleg. Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> Suggested-by: Oleg Nesterov <oleg-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> Cc: Linus Torvalds <torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org> Cc: Andrew Morton <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org> Cc: Peter Zijlstra (Intel) <peterz-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org> Cc: Thomas Gleixner <tglx-hfZtesqFncYOwBW4kG4KsQ@public.gmane.org> Reported-and-debugged-by: Chris Mason <clm-b10kYP2dOMg@public.gmane.org> Cc: stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org # v4.3+ (we can't close the race on < v4.3) --- include/linux/cgroup.h | 21 +++++++++++++++++++++ include/linux/sched.h | 4 ++++ kernel/cgroup/cgroup.c | 9 +++++---- kernel/kthread.c | 3 +++ 4 files changed, 33 insertions(+), 4 deletions(-) --- a/include/linux/cgroup.h +++ b/include/linux/cgroup.h @@ -570,6 +570,25 @@ static inline void pr_cont_cgroup_path(s pr_cont_kernfs_path(cgrp->kn); } +static inline void cgroup_init_kthreadd(void) +{ + /* + * kthreadd is inherited by all kthreads, keep it in the root so + * that the new kthreads are guaranteed to stay in the root until + * initialization is finished. + */ + current->no_cgroup_migration = 1; +} + +static inline void cgroup_kthread_ready(void) +{ + /* + * This kthread finished initialization. The creator should have + * set PF_NO_SETAFFINITY if this kthread should stay in the root. + */ + current->no_cgroup_migration = 0; +} + #else /* !CONFIG_CGROUPS */ struct cgroup_subsys_state; @@ -590,6 +609,8 @@ static inline void cgroup_free(struct ta static inline int cgroup_init_early(void) { return 0; } static inline int cgroup_init(void) { return 0; } +static inline void cgroup_init_kthreadd(void) {} +static inline void cgroup_kthread_ready(void) {} static inline bool task_under_cgroup_hierarchy(struct task_struct *task, struct cgroup *ancestor) --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -604,6 +604,10 @@ struct task_struct { #ifdef CONFIG_COMPAT_BRK unsigned brk_randomized:1; #endif +#ifdef CONFIG_CGROUPS + /* disallow userland-initiated cgroup migration */ + unsigned no_cgroup_migration:1; +#endif unsigned long atomic_flags; /* Flags requiring atomic access. */ --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -2425,11 +2425,12 @@ ssize_t __cgroup_procs_write(struct kern tsk = tsk->group_leader; /* - * Workqueue threads may acquire PF_NO_SETAFFINITY and become - * trapped in a cpuset, or RT worker may be born in a cgroup - * with no rt_runtime allocated. Just say no. + * kthreads may acquire PF_NO_SETAFFINITY during initialization. + * If userland migrates such a kthread to a non-root cgroup, it can + * become trapped in a cpuset, or RT kthread may be born in a + * cgroup with no rt_runtime allocated. Just say no. */ - if (tsk == kthreadd_task || (tsk->flags & PF_NO_SETAFFINITY)) { + if (tsk->no_cgroup_migration || (tsk->flags & PF_NO_SETAFFINITY)) { ret = -EINVAL; goto out_unlock_rcu; } --- a/kernel/kthread.c +++ b/kernel/kthread.c @@ -20,6 +20,7 @@ #include <linux/freezer.h> #include <linux/ptrace.h> #include <linux/uaccess.h> +#include <linux/cgroup.h> #include <trace/events/sched.h> static DEFINE_SPINLOCK(kthread_create_lock); @@ -225,6 +226,7 @@ static int kthread(void *_create) ret = -EINTR; if (!test_bit(KTHREAD_SHOULD_STOP, &self->flags)) { + cgroup_kthread_ready(); __kthread_parkme(self); ret = threadfn(data); } @@ -538,6 +540,7 @@ int kthreadd(void *unused) set_mems_allowed(node_states[N_MEMORY]); current->flags |= PF_NOFREEZE; + cgroup_init_kthreadd(); for (;;) { set_current_state(TASK_INTERRUPTIBLE); ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH v2] cgroup, kthread: close race window where new kthreads can be migrated to non-root cgroups 2017-03-16 20:54 ` [PATCH v2] cgroup, kthread: " Tejun Heo @ 2017-03-17 13:50 ` Oleg Nesterov 2017-03-17 14:44 ` Tejun Heo 0 siblings, 1 reply; 12+ messages in thread From: Oleg Nesterov @ 2017-03-17 13:50 UTC (permalink / raw) To: Tejun Heo Cc: Linus Torvalds, Andrew Morton, Peter Zijlstra, Thomas Gleixner, Chris Mason, linux-kernel, kernel-team, Li Zefan, Johannes Weiner, cgroups On 03/16, Tejun Heo wrote: > > Oleg, > if you're okay with the patch, I can route this through > cgroup/for-4.11-fixes. Thanks, looks good to me. Oleg. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH v2] cgroup, kthread: close race window where new kthreads can be migrated to non-root cgroups 2017-03-17 13:50 ` Oleg Nesterov @ 2017-03-17 14:44 ` Tejun Heo 0 siblings, 0 replies; 12+ messages in thread From: Tejun Heo @ 2017-03-17 14:44 UTC (permalink / raw) To: Oleg Nesterov Cc: Linus Torvalds, Andrew Morton, Peter Zijlstra, Thomas Gleixner, Chris Mason, linux-kernel, kernel-team, Li Zefan, Johannes Weiner, cgroups On Fri, Mar 17, 2017 at 02:50:21PM +0100, Oleg Nesterov wrote: > On 03/16, Tejun Heo wrote: > > > > Oleg, > > if you're okay with the patch, I can route this through > > cgroup/for-4.11-fixes. > > Thanks, looks good to me. Applied to cgroup/for-4.11-fixes. Thanks. -- tejun ^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2017-03-17 14:44 UTC | newest]
Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <20170315231827.GA13656@htj.duckdns.org>
2017-03-15 23:19 ` [PATCH 2/2] kthread, cgroup: close race window where new kthreads can be migrated to non-root cgroups Tejun Heo
2017-03-16 15:02 ` Oleg Nesterov
[not found] ` <20170316150233.GB24478-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2017-03-16 15:39 ` Oleg Nesterov
[not found] ` <20170316153925.GA26391-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2017-03-16 16:07 ` Tejun Heo
[not found] ` <20170316160734.GD15810-piEFEHQLUPpN0TnZuCh8vA@public.gmane.org>
2017-03-16 16:31 ` Oleg Nesterov
2017-03-16 17:41 ` Tejun Heo
2017-03-16 16:05 ` Tejun Heo
[not found] ` <20170316160544.GC15810-piEFEHQLUPpN0TnZuCh8vA@public.gmane.org>
2017-03-16 16:17 ` Oleg Nesterov
2017-03-16 17:03 ` Tejun Heo
[not found] ` <20170315231920.GB13656-piEFEHQLUPpN0TnZuCh8vA@public.gmane.org>
2017-03-16 20:54 ` [PATCH v2] cgroup, kthread: " Tejun Heo
2017-03-17 13:50 ` Oleg Nesterov
2017-03-17 14:44 ` Tejun Heo
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).