* [PATCH 2/2] kthread, cgroup: close race window where new kthreads can be migrated to non-root cgroups
[not found] <20170315231827.GA13656@htj.duckdns.org>
@ 2017-03-15 23:19 ` Tejun Heo
2017-03-16 15:02 ` Oleg Nesterov
[not found] ` <20170315231920.GB13656-piEFEHQLUPpN0TnZuCh8vA@public.gmane.org>
0 siblings, 2 replies; 12+ messages in thread
From: Tejun Heo @ 2017-03-15 23:19 UTC (permalink / raw)
To: Oleg Nesterov, Linus Torvalds, Andrew Morton
Cc: Peter Zijlstra, Thomas Gleixner, Chris Mason, linux-kernel,
kernel-team, Li Zefan, Johannes Weiner, cgroups
Creation of a kthread goes through a couple interlocked stages between
the kthread itself and its creator. Once the new kthread starts
running, it initializes itself and wakes up the creator. The creator
then can further configure the kthread and then let it start doing its
job by waking it up.
In this configuration-by-creator stage, the creator is the only one
that can wake it up but the kthread is visible to userland. When
altering the kthread's attributes from userland is allowed, this is
fine; however, for cases where CPU affinity is critical,
kthread_bind() is used to first disable affinity changes from userland
and then set the affinity. This also prevents the kthread from being
migrated into non-root cgroups as that can affect the CPU affinity and
many other things.
Unfortunately, the cgroup side of protection is racy. While the
PF_NO_SETAFFINITY flag prevents further migrations, userland can win
the race before the creator sets the flag with kthread_bind() and put
the kthread in a non-root cgroup, which can lead to all sorts of
problems including incorrect CPU affinity and starvation.
This bug got triggered by userland which periodically tries to migrate
all processes in the root cpuset cgroup to a non-root one. Per-cpu
workqueue workers got caught while being created and ended up with
incorrected CPU affinity breaking concurrency management and sometimes
stalling workqueue execution.
This patch introduces KTHREAD_INITIALIZED which is set after the
kthread finishes initialization. cgroup core closes the race window
by testing kthread_initialized() and rejecting migration accordingly.
It'd be better to wait for the initialization instead of failing but I
couldn't think of a way of implementing that without adding either a
new PF flag, or sleeping and retrying from waiting side. Even if
userland depends on changing cgroup membership of a kthread, it either
has to be synchronized with kthread_create() or periodically repeat,
so it's unlikely that this would break anything.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Reported-and-debugged-by: Chris Mason <clm@fb.com>
Cc: stable@vger.kernel.org # v4.3+ (we can't close the race < v4.3)
---
include/linux/kthread.h | 1 +
kernel/cgroup/cgroup.c | 10 ++++++----
kernel/kthread.c | 21 ++++++++++++++++++++-
3 files changed, 27 insertions(+), 5 deletions(-)
--- a/include/linux/kthread.h
+++ b/include/linux/kthread.h
@@ -49,6 +49,7 @@ struct task_struct *kthread_create_on_cp
})
void free_kthread_struct(struct task_struct *k);
+bool kthread_initialized(struct task_struct *k);
void kthread_bind(struct task_struct *k, unsigned int cpu);
void kthread_bind_mask(struct task_struct *k, const struct cpumask *mask);
int kthread_stop(struct task_struct *k);
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -2425,11 +2425,13 @@ ssize_t __cgroup_procs_write(struct kern
tsk = tsk->group_leader;
/*
- * Workqueue threads may acquire PF_NO_SETAFFINITY and become
- * trapped in a cpuset, or RT worker may be born in a cgroup
- * with no rt_runtime allocated. Just say no.
+ * kthreads may acquire PF_NO_SETAFFINITY during initialization.
+ * If userland migrates such kthread to a non-root cgroup, it can
+ * become trapped in a cpuset, or RT kthread may be born in a
+ * cgroup with no rt_runtime allocated. Just say no.
*/
- if (tsk == kthreadd_task || (tsk->flags & PF_NO_SETAFFINITY)) {
+ if (tsk == kthreadd_task || (tsk->flags & PF_NO_SETAFFINITY) ||
+ ((tsk->flags & PF_KTHREAD) && !kthread_initialized(tsk))) {
ret = -EINVAL;
goto out_unlock_rcu;
}
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -50,6 +50,7 @@ struct kthread {
enum KTHREAD_BITS {
KTHREAD_IS_PER_CPU = 0,
+ KTHREAD_INITIALIZED,
KTHREAD_SHOULD_STOP,
KTHREAD_SHOULD_PARK,
KTHREAD_IS_PARKED,
@@ -57,7 +58,7 @@ enum KTHREAD_BITS {
static inline void set_kthread_struct(void *kthread)
{
- /* paired with smp_read_data_barrier_depends() in to_kthread() */
+ /* paired with smp_read_barrier_depends() in to_kthread() */
smp_wmb();
/*
@@ -95,6 +96,23 @@ void free_kthread_struct(struct task_str
}
/**
+ * kthread_initialized - has the kthread finished initialization?
+ * @k: thread created by kthread_create().
+ *
+ * Test whether @k, which must be a kthread, finished initialization and is
+ * ready to execute the threadfn. The kthread owner finishes
+ * initialization by waking up the new kthread for the first time. If this
+ * function returns %false, the kthread owner could still be configuring
+ * the kthread.
+ */
+bool kthread_initialized(struct task_struct *k)
+{
+ struct kthread *kthread = to_kthread(k);
+
+ return kthread && test_bit(KTHREAD_INITIALIZED, &kthread->flags);
+}
+
+/**
* kthread_should_stop - should this kthread return now?
*
* When someone calls kthread_stop() on your kthread, it will be woken
@@ -238,6 +256,7 @@ static int kthread(void *_create)
create->result = current;
complete(done);
schedule();
+ set_bit(KTHREAD_INITIALIZED, &self->flags);
ret = -EINTR;
if (!test_bit(KTHREAD_SHOULD_STOP, &self->flags)) {
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH 2/2] kthread, cgroup: close race window where new kthreads can be migrated to non-root cgroups
2017-03-15 23:19 ` [PATCH 2/2] kthread, cgroup: close race window where new kthreads can be migrated to non-root cgroups Tejun Heo
@ 2017-03-16 15:02 ` Oleg Nesterov
[not found] ` <20170316150233.GB24478-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2017-03-16 16:05 ` Tejun Heo
[not found] ` <20170315231920.GB13656-piEFEHQLUPpN0TnZuCh8vA@public.gmane.org>
1 sibling, 2 replies; 12+ messages in thread
From: Oleg Nesterov @ 2017-03-16 15:02 UTC (permalink / raw)
To: Tejun Heo
Cc: Linus Torvalds, Andrew Morton, Peter Zijlstra, Thomas Gleixner,
Chris Mason, linux-kernel, kernel-team, Li Zefan, Johannes Weiner,
cgroups
On 03/15, Tejun Heo wrote:
>
> --- a/kernel/cgroup/cgroup.c
> +++ b/kernel/cgroup/cgroup.c
> @@ -2425,11 +2425,13 @@ ssize_t __cgroup_procs_write(struct kern
> tsk = tsk->group_leader;
>
> /*
> - * Workqueue threads may acquire PF_NO_SETAFFINITY and become
> - * trapped in a cpuset, or RT worker may be born in a cgroup
> - * with no rt_runtime allocated. Just say no.
> + * kthreads may acquire PF_NO_SETAFFINITY during initialization.
> + * If userland migrates such kthread to a non-root cgroup, it can
> + * become trapped in a cpuset, or RT kthread may be born in a
> + * cgroup with no rt_runtime allocated. Just say no.
> */
> - if (tsk == kthreadd_task || (tsk->flags & PF_NO_SETAFFINITY)) {
> + if (tsk == kthreadd_task || (tsk->flags & PF_NO_SETAFFINITY) ||
> + ((tsk->flags & PF_KTHREAD) && !kthread_initialized(tsk))) {
> ret = -EINVAL;
...
> +bool kthread_initialized(struct task_struct *k)
> +{
> + struct kthread *kthread = to_kthread(k);
> +
> + return kthread && test_bit(KTHREAD_INITIALIZED, &kthread->flags);
> +}
Not sure I understand...
With this patch you can no longer migrate a kernel thread created by
kernel_thread() ? Note that to_kthread() is NULL unless it was created
by kthread_create().
Oleg.
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH 2/2] kthread, cgroup: close race window where new kthreads can be migrated to non-root cgroups
[not found] ` <20170316150233.GB24478-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2017-03-16 15:39 ` Oleg Nesterov
[not found] ` <20170316153925.GA26391-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
0 siblings, 1 reply; 12+ messages in thread
From: Oleg Nesterov @ 2017-03-16 15:39 UTC (permalink / raw)
To: Tejun Heo
Cc: Linus Torvalds, Andrew Morton, Peter Zijlstra, Thomas Gleixner,
Chris Mason, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
kernel-team-b10kYP2dOMg, Li Zefan, Johannes Weiner,
cgroups-u79uwXL29TY76Z2rM5mHXA
On 03/16, Oleg Nesterov wrote:
>
> On 03/15, Tejun Heo wrote:
> >
> > --- a/kernel/cgroup/cgroup.c
> > +++ b/kernel/cgroup/cgroup.c
> > @@ -2425,11 +2425,13 @@ ssize_t __cgroup_procs_write(struct kern
> > tsk = tsk->group_leader;
> >
> > /*
> > - * Workqueue threads may acquire PF_NO_SETAFFINITY and become
> > - * trapped in a cpuset, or RT worker may be born in a cgroup
> > - * with no rt_runtime allocated. Just say no.
> > + * kthreads may acquire PF_NO_SETAFFINITY during initialization.
> > + * If userland migrates such kthread to a non-root cgroup, it can
> > + * become trapped in a cpuset, or RT kthread may be born in a
> > + * cgroup with no rt_runtime allocated. Just say no.
> > */
> > - if (tsk == kthreadd_task || (tsk->flags & PF_NO_SETAFFINITY)) {
> > + if (tsk == kthreadd_task || (tsk->flags & PF_NO_SETAFFINITY) ||
> > + ((tsk->flags & PF_KTHREAD) && !kthread_initialized(tsk))) {
> > ret = -EINVAL;
>
> ...
>
> > +bool kthread_initialized(struct task_struct *k)
> > +{
> > + struct kthread *kthread = to_kthread(k);
> > +
> > + return kthread && test_bit(KTHREAD_INITIALIZED, &kthread->flags);
> > +}
>
> Not sure I understand...
>
> With this patch you can no longer migrate a kernel thread created by
> kernel_thread() ? Note that to_kthread() is NULL unless it was created
> by kthread_create().
Either way, I am wondering if we can do something really trivial like
the patch below. This way we can also remove the "tsk == kthreadd_task"
check, and we do not need the barriers.
Oleg.
--- x/kernel/kthread.c
+++ x/kernel/kthread.c
@@ -226,6 +226,7 @@
ret = -EINTR;
if (!test_bit(KTHREAD_SHOULD_STOP, &self->flags)) {
__kthread_parkme(self);
+ current->flags &= ~PF_IDONTLIKECGROUPS;
ret = threadfn(data);
}
do_exit(ret);
@@ -537,7 +538,7 @@
set_cpus_allowed_ptr(tsk, cpu_all_mask);
set_mems_allowed(node_states[N_MEMORY]);
- current->flags |= PF_NOFREEZE;
+ current->flags |= (PF_NOFREEZE | PF_IDONTLIKECGROUPS);
for (;;) {
set_current_state(TASK_INTERRUPTIBLE);
--- x/kernel/cgroup/cgroup.c
+++ x/kernel/cgroup/cgroup.c
@@ -2429,7 +2429,7 @@
* trapped in a cpuset, or RT worker may be born in a cgroup
* with no rt_runtime allocated. Just say no.
*/
- if (tsk == kthreadd_task || (tsk->flags & PF_NO_SETAFFINITY)) {
+ if (tsk->flags & (PF_NO_SETAFFINITY | PF_IDONTLIKECGROUPS)) {
ret = -EINVAL;
goto out_unlock_rcu;
}
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH 2/2] kthread, cgroup: close race window where new kthreads can be migrated to non-root cgroups
2017-03-16 15:02 ` Oleg Nesterov
[not found] ` <20170316150233.GB24478-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2017-03-16 16:05 ` Tejun Heo
[not found] ` <20170316160544.GC15810-piEFEHQLUPpN0TnZuCh8vA@public.gmane.org>
1 sibling, 1 reply; 12+ messages in thread
From: Tejun Heo @ 2017-03-16 16:05 UTC (permalink / raw)
To: Oleg Nesterov
Cc: Linus Torvalds, Andrew Morton, Peter Zijlstra, Thomas Gleixner,
Chris Mason, linux-kernel, kernel-team, Li Zefan, Johannes Weiner,
cgroups
Hello,
On Thu, Mar 16, 2017 at 04:02:34PM +0100, Oleg Nesterov wrote:
> > +bool kthread_initialized(struct task_struct *k)
> > +{
> > + struct kthread *kthread = to_kthread(k);
> > +
> > + return kthread && test_bit(KTHREAD_INITIALIZED, &kthread->flags);
> > +}
>
> Not sure I understand...
>
> With this patch you can no longer migrate a kernel thread created by
> kernel_thread() ? Note that to_kthread() is NULL unless it was created
> by kthread_create().
Yeah, what it does is preventing migration of kthreads until the
kthread owner wakes it up for the first time. The problem is that
kthread_bind() seals up future cgroup migrations from userland but
doesn't move back the kthread to the root cgroup, so the userland has
a window where it can mangle with cgroup membership inbetween and
break things.
The NULL test is there because the test may be performed before the
kthread itself sets up its struct kthread.
An alternative approach could be making kthread_bind() migrate the
kthread back to root cgroup, which btw is why affinity is fine as the
function overwrites it after setting NO_SETAFFINITY; however, the
problem there is that userland can put the kthread into !root cgroup
and starve it before it reaches create->done.
Thanks.
--
tejun
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH 2/2] kthread, cgroup: close race window where new kthreads can be migrated to non-root cgroups
[not found] ` <20170316153925.GA26391-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2017-03-16 16:07 ` Tejun Heo
[not found] ` <20170316160734.GD15810-piEFEHQLUPpN0TnZuCh8vA@public.gmane.org>
0 siblings, 1 reply; 12+ messages in thread
From: Tejun Heo @ 2017-03-16 16:07 UTC (permalink / raw)
To: Oleg Nesterov
Cc: Linus Torvalds, Andrew Morton, Peter Zijlstra, Thomas Gleixner,
Chris Mason, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
kernel-team-b10kYP2dOMg, Li Zefan, Johannes Weiner,
cgroups-u79uwXL29TY76Z2rM5mHXA
Hello,
On Thu, Mar 16, 2017 at 04:39:26PM +0100, Oleg Nesterov wrote:
> Either way, I am wondering if we can do something really trivial like
> the patch below. This way we can also remove the "tsk == kthreadd_task"
> check, and we do not need the barriers.
>
> Oleg.
>
> --- x/kernel/kthread.c
> +++ x/kernel/kthread.c
> @@ -226,6 +226,7 @@
> ret = -EINTR;
> if (!test_bit(KTHREAD_SHOULD_STOP, &self->flags)) {
> __kthread_parkme(self);
> + current->flags &= ~PF_IDONTLIKECGROUPS;
> ret = threadfn(data);
> }
> do_exit(ret);
> @@ -537,7 +538,7 @@
> set_cpus_allowed_ptr(tsk, cpu_all_mask);
> set_mems_allowed(node_states[N_MEMORY]);
>
> - current->flags |= PF_NOFREEZE;
> + current->flags |= (PF_NOFREEZE | PF_IDONTLIKECGROUPS);
>
> for (;;) {
> set_current_state(TASK_INTERRUPTIBLE);
> --- x/kernel/cgroup/cgroup.c
> +++ x/kernel/cgroup/cgroup.c
> @@ -2429,7 +2429,7 @@
> * trapped in a cpuset, or RT worker may be born in a cgroup
> * with no rt_runtime allocated. Just say no.
> */
> - if (tsk == kthreadd_task || (tsk->flags & PF_NO_SETAFFINITY)) {
> + if (tsk->flags & (PF_NO_SETAFFINITY | PF_IDONTLIKECGROUPS)) {
> ret = -EINVAL;
> goto out_unlock_rcu;
> }
Absolutely. If we're willing to spend a PF flag on it, we can
properly wait for it too instead of failing it.
Thanks.
--
tejun
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH 2/2] kthread, cgroup: close race window where new kthreads can be migrated to non-root cgroups
[not found] ` <20170316160544.GC15810-piEFEHQLUPpN0TnZuCh8vA@public.gmane.org>
@ 2017-03-16 16:17 ` Oleg Nesterov
2017-03-16 17:03 ` Tejun Heo
0 siblings, 1 reply; 12+ messages in thread
From: Oleg Nesterov @ 2017-03-16 16:17 UTC (permalink / raw)
To: Tejun Heo
Cc: Linus Torvalds, Andrew Morton, Peter Zijlstra, Thomas Gleixner,
Chris Mason, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
kernel-team-b10kYP2dOMg, Li Zefan, Johannes Weiner,
cgroups-u79uwXL29TY76Z2rM5mHXA
On 03/16, Tejun Heo wrote:
>
> Hello,
>
> On Thu, Mar 16, 2017 at 04:02:34PM +0100, Oleg Nesterov wrote:
> > > +bool kthread_initialized(struct task_struct *k)
> > > +{
> > > + struct kthread *kthread = to_kthread(k);
> > > +
> > > + return kthread && test_bit(KTHREAD_INITIALIZED, &kthread->flags);
> > > +}
> >
> > Not sure I understand...
> >
> > With this patch you can no longer migrate a kernel thread created by
> > kernel_thread() ? Note that to_kthread() is NULL unless it was created
> > by kthread_create().
>
> Yeah, what it does is preventing migration of kthreads until the
> kthread owner wakes it up for the first time. The problem is that
> kthread_bind() seals up future cgroup migrations from userland but
> doesn't move back the kthread to the root cgroup, so the userland has
> a window where it can mangle with cgroup membership inbetween and
> break things.
This is clear,
> The NULL test is there because the test may be performed before the
> kthread itself sets up its struct kthread.
This too.
But this also means that __cgroup_procs_write() will always fail if
this task is a kernel thread which was not created by kthread_create().
Currently you can use kernel_thread() (although you shouldn't) and it
can be migrated, this won't work after your patch.
Oleg.
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH 2/2] kthread, cgroup: close race window where new kthreads can be migrated to non-root cgroups
[not found] ` <20170316160734.GD15810-piEFEHQLUPpN0TnZuCh8vA@public.gmane.org>
@ 2017-03-16 16:31 ` Oleg Nesterov
2017-03-16 17:41 ` Tejun Heo
0 siblings, 1 reply; 12+ messages in thread
From: Oleg Nesterov @ 2017-03-16 16:31 UTC (permalink / raw)
To: Tejun Heo
Cc: Linus Torvalds, Andrew Morton, Peter Zijlstra, Thomas Gleixner,
Chris Mason, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
kernel-team-b10kYP2dOMg, Li Zefan, Johannes Weiner,
cgroups-u79uwXL29TY76Z2rM5mHXA
On 03/16, Tejun Heo wrote:
>
> > --- x/kernel/kthread.c
> > +++ x/kernel/kthread.c
> > @@ -226,6 +226,7 @@
> > ret = -EINTR;
> > if (!test_bit(KTHREAD_SHOULD_STOP, &self->flags)) {
> > __kthread_parkme(self);
> > + current->flags &= ~PF_IDONTLIKECGROUPS;
> > ret = threadfn(data);
> > }
> > do_exit(ret);
> > @@ -537,7 +538,7 @@
> > set_cpus_allowed_ptr(tsk, cpu_all_mask);
> > set_mems_allowed(node_states[N_MEMORY]);
> >
> > - current->flags |= PF_NOFREEZE;
> > + current->flags |= (PF_NOFREEZE | PF_IDONTLIKECGROUPS);
> >
> > for (;;) {
> > set_current_state(TASK_INTERRUPTIBLE);
> > --- x/kernel/cgroup/cgroup.c
> > +++ x/kernel/cgroup/cgroup.c
> > @@ -2429,7 +2429,7 @@
> > * trapped in a cpuset, or RT worker may be born in a cgroup
> > * with no rt_runtime allocated. Just say no.
> > */
> > - if (tsk == kthreadd_task || (tsk->flags & PF_NO_SETAFFINITY)) {
> > + if (tsk->flags & (PF_NO_SETAFFINITY | PF_IDONTLIKECGROUPS)) {
> > ret = -EINVAL;
> > goto out_unlock_rcu;
> > }
>
> Absolutely. If we're willing to spend a PF flag on it, we can
> properly wait for it too instead of failing it.
Or we can add another "unsigned no_cgroups:1" bit into task_struct,
not sure.
Anyway, I do not understand the PF_NO_SETAFFINITY check in
__cgroup_procs_write(). task_can_attach() checks it too, so cgroups
can't change the affinity. Imo something explicit like no_cgroups
makes more sense.
Oleg.
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH 2/2] kthread, cgroup: close race window where new kthreads can be migrated to non-root cgroups
2017-03-16 16:17 ` Oleg Nesterov
@ 2017-03-16 17:03 ` Tejun Heo
0 siblings, 0 replies; 12+ messages in thread
From: Tejun Heo @ 2017-03-16 17:03 UTC (permalink / raw)
To: Oleg Nesterov
Cc: Linus Torvalds, Andrew Morton, Peter Zijlstra, Thomas Gleixner,
Chris Mason, linux-kernel, kernel-team, Li Zefan, Johannes Weiner,
cgroups
Hello,
On Thu, Mar 16, 2017 at 05:17:57PM +0100, Oleg Nesterov wrote:
> But this also means that __cgroup_procs_write() will always fail if
> this task is a kernel thread which was not created by kthread_create().
>
> Currently you can use kernel_thread() (although you shouldn't) and it
> can be migrated, this won't work after your patch.
I see what you mean now. The only users seem to be init/main.c and
kernel/kmod.c. I'll see if there's a way we can do this only for
kthread_create() users.
Thanks.
--
tejun
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH 2/2] kthread, cgroup: close race window where new kthreads can be migrated to non-root cgroups
2017-03-16 16:31 ` Oleg Nesterov
@ 2017-03-16 17:41 ` Tejun Heo
0 siblings, 0 replies; 12+ messages in thread
From: Tejun Heo @ 2017-03-16 17:41 UTC (permalink / raw)
To: Oleg Nesterov
Cc: Linus Torvalds, Andrew Morton, Peter Zijlstra, Thomas Gleixner,
Chris Mason, linux-kernel, kernel-team, Li Zefan, Johannes Weiner,
cgroups
Hello,
On Thu, Mar 16, 2017 at 05:31:59PM +0100, Oleg Nesterov wrote:
> Or we can add another "unsigned no_cgroups:1" bit into task_struct,
> not sure.
To synchronize around initialization, a PF flag would be easier as we
can use wait_on_bit().
> Anyway, I do not understand the PF_NO_SETAFFINITY check in
> __cgroup_procs_write(). task_can_attach() checks it too, so cgroups
> can't change the affinity. Imo something explicit like no_cgroups
> makes more sense.
task_can_attach() predates the __cgroup_procs_write() and currently
doesn't do anything. We can split the flag or rename it so that it's
more generic. The reasons for disallowing cgroup migration have a lot
of crosssection with affinity, so it's not a complete misnomer.
Either way is fine by me.
Thanks.
--
tejun
^ permalink raw reply [flat|nested] 12+ messages in thread
* [PATCH v2] cgroup, kthread: close race window where new kthreads can be migrated to non-root cgroups
[not found] ` <20170315231920.GB13656-piEFEHQLUPpN0TnZuCh8vA@public.gmane.org>
@ 2017-03-16 20:54 ` Tejun Heo
2017-03-17 13:50 ` Oleg Nesterov
0 siblings, 1 reply; 12+ messages in thread
From: Tejun Heo @ 2017-03-16 20:54 UTC (permalink / raw)
To: Oleg Nesterov, Linus Torvalds, Andrew Morton
Cc: Peter Zijlstra, Thomas Gleixner, Chris Mason,
linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg,
Li Zefan, Johannes Weiner, cgroups-u79uwXL29TY76Z2rM5mHXA
Hello,
I tried a couple variants but Oleg's suggestion turns out to be the
simplest. This patch doesn't require the first barrier patch. Oleg,
if you're okay with the patch, I can route this through
cgroup/for-4.11-fixes.
Thanks!
------ 8< ------
Creation of a kthread goes through a couple interlocked stages between
the kthread itself and its creator. Once the new kthread starts
running, it initializes itself and wakes up the creator. The creator
then can further configure the kthread and then let it start doing its
job by waking it up.
In this configuration-by-creator stage, the creator is the only one
that can wake it up but the kthread is visible to userland. When
altering the kthread's attributes from userland is allowed, this is
fine; however, for cases where CPU affinity is critical,
kthread_bind() is used to first disable affinity changes from userland
and then set the affinity. This also prevents the kthread from being
migrated into non-root cgroups as that can affect the CPU affinity and
many other things.
Unfortunately, the cgroup side of protection is racy. While the
PF_NO_SETAFFINITY flag prevents further migrations, userland can win
the race before the creator sets the flag with kthread_bind() and put
the kthread in a non-root cgroup, which can lead to all sorts of
problems including incorrect CPU affinity and starvation.
This bug got triggered by userland which periodically tries to migrate
all processes in the root cpuset cgroup to a non-root one. Per-cpu
workqueue workers got caught while being created and ended up with
incorrected CPU affinity breaking concurrency management and sometimes
stalling workqueue execution.
This patch adds task->no_cgroup_migration which disallows the task to
be migrated by userland. kthreadd starts with the flag set making
every child kthread start in the root cgroup with migration
disallowed. The flag is cleared after the kthread finishes
initialization by which time PF_NO_SETAFFINITY is set if the kthread
should stay in the root cgroup.
It'd be better to wait for the initialization instead of failing but I
couldn't think of a way of implementing that without adding either a
new PF flag, or sleeping and retrying from waiting side. Even if
userland depends on changing cgroup membership of a kthread, it either
has to be synchronized with kthread_create() or periodically repeat,
so it's unlikely that this would break anything.
v2: Switch to a simpler implementation using a new task_struct bit
field suggested by Oleg.
Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
Suggested-by: Oleg Nesterov <oleg-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Cc: Linus Torvalds <torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
Cc: Andrew Morton <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
Cc: Peter Zijlstra (Intel) <peterz-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
Cc: Thomas Gleixner <tglx-hfZtesqFncYOwBW4kG4KsQ@public.gmane.org>
Reported-and-debugged-by: Chris Mason <clm-b10kYP2dOMg@public.gmane.org>
Cc: stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org # v4.3+ (we can't close the race on < v4.3)
---
include/linux/cgroup.h | 21 +++++++++++++++++++++
include/linux/sched.h | 4 ++++
kernel/cgroup/cgroup.c | 9 +++++----
kernel/kthread.c | 3 +++
4 files changed, 33 insertions(+), 4 deletions(-)
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -570,6 +570,25 @@ static inline void pr_cont_cgroup_path(s
pr_cont_kernfs_path(cgrp->kn);
}
+static inline void cgroup_init_kthreadd(void)
+{
+ /*
+ * kthreadd is inherited by all kthreads, keep it in the root so
+ * that the new kthreads are guaranteed to stay in the root until
+ * initialization is finished.
+ */
+ current->no_cgroup_migration = 1;
+}
+
+static inline void cgroup_kthread_ready(void)
+{
+ /*
+ * This kthread finished initialization. The creator should have
+ * set PF_NO_SETAFFINITY if this kthread should stay in the root.
+ */
+ current->no_cgroup_migration = 0;
+}
+
#else /* !CONFIG_CGROUPS */
struct cgroup_subsys_state;
@@ -590,6 +609,8 @@ static inline void cgroup_free(struct ta
static inline int cgroup_init_early(void) { return 0; }
static inline int cgroup_init(void) { return 0; }
+static inline void cgroup_init_kthreadd(void) {}
+static inline void cgroup_kthread_ready(void) {}
static inline bool task_under_cgroup_hierarchy(struct task_struct *task,
struct cgroup *ancestor)
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -604,6 +604,10 @@ struct task_struct {
#ifdef CONFIG_COMPAT_BRK
unsigned brk_randomized:1;
#endif
+#ifdef CONFIG_CGROUPS
+ /* disallow userland-initiated cgroup migration */
+ unsigned no_cgroup_migration:1;
+#endif
unsigned long atomic_flags; /* Flags requiring atomic access. */
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -2425,11 +2425,12 @@ ssize_t __cgroup_procs_write(struct kern
tsk = tsk->group_leader;
/*
- * Workqueue threads may acquire PF_NO_SETAFFINITY and become
- * trapped in a cpuset, or RT worker may be born in a cgroup
- * with no rt_runtime allocated. Just say no.
+ * kthreads may acquire PF_NO_SETAFFINITY during initialization.
+ * If userland migrates such a kthread to a non-root cgroup, it can
+ * become trapped in a cpuset, or RT kthread may be born in a
+ * cgroup with no rt_runtime allocated. Just say no.
*/
- if (tsk == kthreadd_task || (tsk->flags & PF_NO_SETAFFINITY)) {
+ if (tsk->no_cgroup_migration || (tsk->flags & PF_NO_SETAFFINITY)) {
ret = -EINVAL;
goto out_unlock_rcu;
}
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -20,6 +20,7 @@
#include <linux/freezer.h>
#include <linux/ptrace.h>
#include <linux/uaccess.h>
+#include <linux/cgroup.h>
#include <trace/events/sched.h>
static DEFINE_SPINLOCK(kthread_create_lock);
@@ -225,6 +226,7 @@ static int kthread(void *_create)
ret = -EINTR;
if (!test_bit(KTHREAD_SHOULD_STOP, &self->flags)) {
+ cgroup_kthread_ready();
__kthread_parkme(self);
ret = threadfn(data);
}
@@ -538,6 +540,7 @@ int kthreadd(void *unused)
set_mems_allowed(node_states[N_MEMORY]);
current->flags |= PF_NOFREEZE;
+ cgroup_init_kthreadd();
for (;;) {
set_current_state(TASK_INTERRUPTIBLE);
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH v2] cgroup, kthread: close race window where new kthreads can be migrated to non-root cgroups
2017-03-16 20:54 ` [PATCH v2] cgroup, kthread: " Tejun Heo
@ 2017-03-17 13:50 ` Oleg Nesterov
2017-03-17 14:44 ` Tejun Heo
0 siblings, 1 reply; 12+ messages in thread
From: Oleg Nesterov @ 2017-03-17 13:50 UTC (permalink / raw)
To: Tejun Heo
Cc: Linus Torvalds, Andrew Morton, Peter Zijlstra, Thomas Gleixner,
Chris Mason, linux-kernel, kernel-team, Li Zefan, Johannes Weiner,
cgroups
On 03/16, Tejun Heo wrote:
>
> Oleg,
> if you're okay with the patch, I can route this through
> cgroup/for-4.11-fixes.
Thanks, looks good to me.
Oleg.
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH v2] cgroup, kthread: close race window where new kthreads can be migrated to non-root cgroups
2017-03-17 13:50 ` Oleg Nesterov
@ 2017-03-17 14:44 ` Tejun Heo
0 siblings, 0 replies; 12+ messages in thread
From: Tejun Heo @ 2017-03-17 14:44 UTC (permalink / raw)
To: Oleg Nesterov
Cc: Linus Torvalds, Andrew Morton, Peter Zijlstra, Thomas Gleixner,
Chris Mason, linux-kernel, kernel-team, Li Zefan, Johannes Weiner,
cgroups
On Fri, Mar 17, 2017 at 02:50:21PM +0100, Oleg Nesterov wrote:
> On 03/16, Tejun Heo wrote:
> >
> > Oleg,
> > if you're okay with the patch, I can route this through
> > cgroup/for-4.11-fixes.
>
> Thanks, looks good to me.
Applied to cgroup/for-4.11-fixes.
Thanks.
--
tejun
^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2017-03-17 14:44 UTC | newest]
Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <20170315231827.GA13656@htj.duckdns.org>
2017-03-15 23:19 ` [PATCH 2/2] kthread, cgroup: close race window where new kthreads can be migrated to non-root cgroups Tejun Heo
2017-03-16 15:02 ` Oleg Nesterov
[not found] ` <20170316150233.GB24478-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2017-03-16 15:39 ` Oleg Nesterov
[not found] ` <20170316153925.GA26391-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2017-03-16 16:07 ` Tejun Heo
[not found] ` <20170316160734.GD15810-piEFEHQLUPpN0TnZuCh8vA@public.gmane.org>
2017-03-16 16:31 ` Oleg Nesterov
2017-03-16 17:41 ` Tejun Heo
2017-03-16 16:05 ` Tejun Heo
[not found] ` <20170316160544.GC15810-piEFEHQLUPpN0TnZuCh8vA@public.gmane.org>
2017-03-16 16:17 ` Oleg Nesterov
2017-03-16 17:03 ` Tejun Heo
[not found] ` <20170315231920.GB13656-piEFEHQLUPpN0TnZuCh8vA@public.gmane.org>
2017-03-16 20:54 ` [PATCH v2] cgroup, kthread: " Tejun Heo
2017-03-17 13:50 ` Oleg Nesterov
2017-03-17 14:44 ` Tejun Heo
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).