Re: [PATCH 3/3 cgroup/for-5.2-fixes] cgroup: Include dying leaders with live threads in PROCS iterations

public inbox for cgroups@vger.kernel.org
 help / color / mirror / Atom feed

* Re: [PATCH 3/3 cgroup/for-5.2-fixes] cgroup: Include dying leaders with live threads in PROCS iterations
       [not found]                     ` <20190611185742.GH3341036-LpCCV3molIbIZ9tKgghJQw2O0Ztt9esIQQ4Iyu8u01E@public.gmane.org>
@ 2020-01-10 21:47                       ` Suren Baghdasaryan
       [not found]                         ` <CAJuCfpGkAsmp5B=fNz38fLE8pYaMCG=uLDSSBFByNOtWD20zVQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 9+ messages in thread
From: Suren Baghdasaryan @ 2020-01-10 21:47 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Michal Koutný, Johannes Weiner, Topi Miettinen, Li Zefan,
	cgroups mailinglist, security-8fiUuRrzOP0dnm+yROfE0A,
	Security Officers, Lennart Poettering, Oleg Nesterov,
	Eric W. Biederman, james.hsu-NuS5LvNUpcJWk0Htik3J/w,
	linger.lee-NuS5LvNUpcJWk0Htik3J/w, Tom Cherry, Roman Gushchin

Hi Tejun,

On Tue, Jun 11, 2019 at 12:10 PM Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote:
>
> Hello, Michal.
>
> On Fri, Jun 07, 2019 at 07:09:53PM +0200, Michal KoutnÃ½ wrote:
> > Wouldn't it make more sense to call
> >       css_set_move_task(tsk, cset, NULL, false);
> > in cgroup_release instead of cgroup_exit then?
> >
> > css_set_move_task triggers the cgroup emptiness notification so if we
> > list group leaders with running siblings as members of the cgroup (IMO
> > correct), is it consistent to deliver the (un)populated event
> > that early?
> > By moving to cgroup_release we would also make this notification
> > analogous to SIGCHLD delivery.
>
> So, the current behavior is mostly historical and I don't think this
> is a better behavior.  That said, I'm not sure about the cost benefit
> ratio of changing the behavior at this point given how long the code
> has been this way.  Another aspect is that it'd expose zombie tasks
> and possibly migrations of them to each controller.

Sorry for reviving an old discussion but I got a bug report from a
customer about Android occasionally failing to remove a cgroup after
it killed all processes in it. AndroidManagerService (a privileged
user space process) reads cgroup.procs and sends SIGKILL to all listed
processes in that cgroup until cgroup.procs is empty at which point it
tries to delete cgroup directory (it is a leaf cgroup with empty
cgroup.procs). In my testing I was able to reproduce the failure in
which case rmdir() fails with EBUSY from cgroup_destroy_locked()
because cgroup_is_populated() returns true. cgroup_is_populated()
depends on cgrp->nr_populated_xxx counters which IIUC will not be
updated until cgroup_update_populated() is called from cgroup_exit()
and that might get delayed... Meanwhile the tasks counted by
cgrp->nr_populated_xxx will not show up when cgroup.procs or
cgroup.tasks/tasks files are read by user space due to these changes:

+               /* and dying leaders w/o live member threads */
+               if (!atomic_read(&task->signal->live))
+                       goto repeat;
+       } else {
+               /* skip all dying ones */
+               if (task->flags & PF_EXITING)
+                       goto repeat;
+       }

After removing these checks cgroup.procs shows the processes up until
they are dead and the issue is gone.

So from user space's point of view the cgroup is empty and can be
removed but rmdir() fails to do so. I think this behavior is
inconsistent with the claim "cgroup which doesn't have any children
and is associated only with zombie processes is considered empty and
can be removed" from
https://elixir.bootlin.com/linux/v5.4.10/source/Documentation/admin-guide/cgroup-v2.rst#L222
.

Before contemplating what would be the right fix here I wanted to
check if cgroup.procs being empty is a reasonable indication for user
space to assume the cgroup is empty and can be removed? I think it is
but I might be wrong.
Note that this behavior was reproduced on cgroup v1 hierarchy but I
think it will happen on v2 as well. If needed I can cook up a
reproducer.
Thanks,
Suren.

>
> Thanks.

>
> --
> tejun

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 3/3 cgroup/for-5.2-fixes] cgroup: Include dying leaders with live threads in PROCS iterations
       [not found]                         ` <CAJuCfpGkAsmp5B=fNz38fLE8pYaMCG=uLDSSBFByNOtWD20zVQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2020-01-10 21:56                           ` Tejun Heo
       [not found]                             ` <20200110215648.GC2677547-LpCCV3molIbIZ9tKgghJQw2O0Ztt9esIQQ4Iyu8u01E@public.gmane.org>
  2020-01-10 23:16                           ` Michal Koutný
  1 sibling, 1 reply; 9+ messages in thread
From: Tejun Heo @ 2020-01-10 21:56 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Michal Koutný, Johannes Weiner, Topi Miettinen, Li Zefan,
	cgroups mailinglist, security-8fiUuRrzOP0dnm+yROfE0A,
	Security Officers, Lennart Poettering, Oleg Nesterov,
	Eric W. Biederman, james.hsu-NuS5LvNUpcJWk0Htik3J/w,
	linger.lee-NuS5LvNUpcJWk0Htik3J/w, Tom Cherry, Roman Gushchin

Hello,

On Fri, Jan 10, 2020 at 01:47:19PM -0800, Suren Baghdasaryan wrote:
> So from user space's point of view the cgroup is empty and can be
> removed but rmdir() fails to do so. I think this behavior is
> inconsistent with the claim "cgroup which doesn't have any children
> and is associated only with zombie processes is considered empty and
> can be removed" from
> https://elixir.bootlin.com/linux/v5.4.10/source/Documentation/admin-guide/cgroup-v2.rst#L222

Yeah, the current behavior isn't quite consistent with the
documentation and what we prolly wanna do is allowing destroying a
cgroup with only dead processes in it.  That said, the correct (or at
least workable) signal which indicates that a cgroup is ready for
removal is cgroup.events::populated being zero, which is a poll(2)able
event.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 3/3 cgroup/for-5.2-fixes] cgroup: Include dying leaders with live threads in PROCS iterations
       [not found]                             ` <20200110215648.GC2677547-LpCCV3molIbIZ9tKgghJQw2O0Ztt9esIQQ4Iyu8u01E@public.gmane.org>
@ 2020-01-10 22:14                               ` Suren Baghdasaryan
       [not found]                                 ` <CAJuCfpHPRfV5KDM74JtDepdUJ2G+-3-A3XV+Fzr3Lkbj9nR-Cg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 9+ messages in thread
From: Suren Baghdasaryan @ 2020-01-10 22:14 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Michal Koutný, Johannes Weiner, Topi Miettinen, Li Zefan,
	cgroups mailinglist, security-8fiUuRrzOP0dnm+yROfE0A,
	Security Officers, Lennart Poettering, Oleg Nesterov,
	Eric W. Biederman, james.hsu-NuS5LvNUpcJWk0Htik3J/w,
	linger.lee-NuS5LvNUpcJWk0Htik3J/w, Tom Cherry, Roman Gushchin

On Fri, Jan 10, 2020 at 1:56 PM Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote:
>
> Hello,
>
> On Fri, Jan 10, 2020 at 01:47:19PM -0800, Suren Baghdasaryan wrote:
> > So from user space's point of view the cgroup is empty and can be
> > removed but rmdir() fails to do so. I think this behavior is
> > inconsistent with the claim "cgroup which doesn't have any children
> > and is associated only with zombie processes is considered empty and
> > can be removed" from
> > https://elixir.bootlin.com/linux/v5.4.10/source/Documentation/admin-guide/cgroup-v2.rst#L222
>
> Yeah, the current behavior isn't quite consistent with the
> documentation and what we prolly wanna do is allowing destroying a
> cgroup with only dead processes in it.  That said, the correct (or at
> least workable) signal which indicates that a cgroup is ready for
> removal is cgroup.events::populated being zero, which is a poll(2)able
> event.

Unfortunately it would not be workable for us as it's only available
for cgroup v2 controllers.
I can think of other ways to fix it in the userspace but there might
be other cgroup API users which are be broken after this change. I
would prefer to fix it in the kernel if at all possible rather than
chasing all possible users.
Thanks,
Suren.

>
> Thanks.
>
> --
> tejun

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 3/3 cgroup/for-5.2-fixes] cgroup: Include dying leaders with live threads in PROCS iterations
       [not found]                                 ` <CAJuCfpHPRfV5KDM74JtDepdUJ2G+-3-A3XV+Fzr3Lkbj9nR-Cg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2020-01-10 22:15                                   ` Tejun Heo
       [not found]                                     ` <20200110221553.GD2677547-LpCCV3molIbIZ9tKgghJQw2O0Ztt9esIQQ4Iyu8u01E@public.gmane.org>
  0 siblings, 1 reply; 9+ messages in thread
From: Tejun Heo @ 2020-01-10 22:15 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Michal Koutný, Johannes Weiner, Topi Miettinen, Li Zefan,
	cgroups mailinglist, security-8fiUuRrzOP0dnm+yROfE0A,
	Security Officers, Lennart Poettering, Oleg Nesterov,
	Eric W. Biederman, james.hsu-NuS5LvNUpcJWk0Htik3J/w,
	linger.lee-NuS5LvNUpcJWk0Htik3J/w, Tom Cherry, Roman Gushchin

On Fri, Jan 10, 2020 at 02:14:19PM -0800, Suren Baghdasaryan wrote:
> > Yeah, the current behavior isn't quite consistent with the
> > documentation and what we prolly wanna do is allowing destroying a
> > cgroup with only dead processes in it.  That said, the correct (or at
> > least workable) signal which indicates that a cgroup is ready for
> > removal is cgroup.events::populated being zero, which is a poll(2)able
> > event.
> 
> Unfortunately it would not be workable for us as it's only available
> for cgroup v2 controllers.
> I can think of other ways to fix it in the userspace but there might
> be other cgroup API users which are be broken after this change. I
> would prefer to fix it in the kernel if at all possible rather than
> chasing all possible users.

Yeah, the right thing to do is allowing destruction of cgroups w/ only
dead processes in it.

-- 
tejun

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 3/3 cgroup/for-5.2-fixes] cgroup: Include dying leaders with live threads in PROCS iterations
       [not found]                                     ` <20200110221553.GD2677547-LpCCV3molIbIZ9tKgghJQw2O0Ztt9esIQQ4Iyu8u01E@public.gmane.org>
@ 2020-01-10 22:50                                       ` Suren Baghdasaryan
  0 siblings, 0 replies; 9+ messages in thread
From: Suren Baghdasaryan @ 2020-01-10 22:50 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Michal Koutný, Johannes Weiner, Topi Miettinen, Li Zefan,
	cgroups mailinglist, security-8fiUuRrzOP0dnm+yROfE0A,
	Security Officers, Lennart Poettering, Oleg Nesterov,
	Eric W. Biederman, james.hsu-NuS5LvNUpcJWk0Htik3J/w,
	linger.lee-NuS5LvNUpcJWk0Htik3J/w, Tom Cherry, Roman Gushchin

On Fri, Jan 10, 2020 at 2:15 PM Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote:
>
> On Fri, Jan 10, 2020 at 02:14:19PM -0800, Suren Baghdasaryan wrote:
> > > Yeah, the current behavior isn't quite consistent with the
> > > documentation and what we prolly wanna do is allowing destroying a
> > > cgroup with only dead processes in it.  That said, the correct (or at
> > > least workable) signal which indicates that a cgroup is ready for
> > > removal is cgroup.events::populated being zero, which is a poll(2)able
> > > event.
> >
> > Unfortunately it would not be workable for us as it's only available
> > for cgroup v2 controllers.
> > I can think of other ways to fix it in the userspace but there might
> > be other cgroup API users which are be broken after this change. I
> > would prefer to fix it in the kernel if at all possible rather than
> > chasing all possible users.
>
> Yeah, the right thing to do is allowing destruction of cgroups w/ only
> dead processes in it.

Thanks for confirmation. I'll see if I can figure this out.

>
> --
> tejun

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 3/3 cgroup/for-5.2-fixes] cgroup: Include dying leaders with live threads in PROCS iterations
       [not found]                         ` <CAJuCfpGkAsmp5B=fNz38fLE8pYaMCG=uLDSSBFByNOtWD20zVQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2020-01-10 21:56                           ` Tejun Heo
@ 2020-01-10 23:16                           ` Michal Koutný
  2020-01-11  0:00                             ` Suren Baghdasaryan
  1 sibling, 1 reply; 9+ messages in thread
From: Michal Koutný @ 2020-01-10 23:16 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Tejun Heo, Johannes Weiner, Topi Miettinen, Li Zefan,
	cgroups mailinglist, security-8fiUuRrzOP0dnm+yROfE0A,
	Security Officers, Lennart Poettering, Oleg Nesterov,
	Eric W. Biederman, james.hsu-NuS5LvNUpcJWk0Htik3J/w,
	linger.lee-NuS5LvNUpcJWk0Htik3J/w, Tom Cherry, Roman Gushchin

[-- Attachment #1: Type: text/plain, Size: 2513 bytes --]

Hello Suren.

On Fri, Jan 10, 2020 at 01:47:19PM -0800, Suren Baghdasaryan <surenb@google.com> wrote:
> > On Fri, Jun 07, 2019 at 07:09:53PM +0200, Michal Koutný wrote:
> > > Wouldn't it make more sense to call
> > >       css_set_move_task(tsk, cset, NULL, false);
> > > in cgroup_release instead of cgroup_exit then?
> > >
> > > css_set_move_task triggers the cgroup emptiness notification so if we
> > > list group leaders with running siblings as members of the cgroup (IMO
> > > correct), is it consistent to deliver the (un)populated event
> > > that early?
> > > By moving to cgroup_release we would also make this notification
> > > analogous to SIGCHLD delivery.
> >
> > So, the current behavior is mostly historical and I don't think this
> > is a better behavior.  That said, I'm not sure about the cost benefit
> > ratio of changing the behavior at this point given how long the code
> > has been this way.  Another aspect is that it'd expose zombie tasks
> > and possibly migrations of them to each controller.
> 
> Sorry for reviving an old discussion
Since you reply to my remark, I have to share that I found myself wrong
later wrt the emptiness notifications. Moving the task in cgroup_exit
doesn't matter if thread group contains other live tasks, the unpopulated
notification will be raised when the last task of thread group calls
group_exit, i.e. it is similar to SIGHLD.

Now to your issue.

> but I got a bug report from a customer about Android 
What kernel version is that? Can be it considered equal to the current
Linux?

> In my testing I was able to reproduce the failure in
> which case rmdir() fails with EBUSY from cgroup_destroy_locked()
> because cgroup_is_populated() returns true.
That'd mean that not all tasks in the cgroup did exit() (cgroup_exit()),
i.e. they're still running. Can you see them in cgroup.threads/tasks?

> cgroup_is_populated() depends on cgrp->nr_populated_xxx counters which
> IIUC will not be updated until cgroup_update_populated() is called
> from cgroup_exit() and that might get delayed...
Why do you think it's delayed?

> So from user space's point of view the cgroup is empty and can be
> removed but rmdir() fails to do so.
As Tejun says, cgroup with only dead _tasks_ should be removable (and if
I'm not mistaken it is in the current kernel). Unless you do individual
threads migration when a thread would be separated from its (dead)
leader. Does your case include such migrations?

Michal

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 3/3 cgroup/for-5.2-fixes] cgroup: Include dying leaders with live threads in PROCS iterations
  2020-01-10 23:16                           ` Michal Koutný
@ 2020-01-11  0:00                             ` Suren Baghdasaryan
       [not found]                               ` <CAJuCfpG-sn0wPM9GRnCjrmytf=mDC3smsRRd=XQBLAK6ZKoAUA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 9+ messages in thread
From: Suren Baghdasaryan @ 2020-01-11  0:00 UTC (permalink / raw)
  To: Michal Koutný
  Cc: Tejun Heo, Johannes Weiner, Topi Miettinen, Li Zefan,
	cgroups mailinglist, security-8fiUuRrzOP0dnm+yROfE0A,
	Security Officers, Lennart Poettering, Oleg Nesterov,
	Eric W. Biederman, james.hsu-NuS5LvNUpcJWk0Htik3J/w,
	linger.lee-NuS5LvNUpcJWk0Htik3J/w, Tom Cherry, Roman Gushchin

 H Michal,

On Fri, Jan 10, 2020 at 3:16 PM Michal Koutný <mkoutny-IBi9RG/b67k@public.gmane.org> wrote:
>
> Hello Suren.
>
> On Fri, Jan 10, 2020 at 01:47:19PM -0800, Suren Baghdasaryan <surenb@google.com> wrote:
> > > On Fri, Jun 07, 2019 at 07:09:53PM +0200, Michal Koutný wrote:
> > > > Wouldn't it make more sense to call
> > > >       css_set_move_task(tsk, cset, NULL, false);
> > > > in cgroup_release instead of cgroup_exit then?
> > > >
> > > > css_set_move_task triggers the cgroup emptiness notification so if we
> > > > list group leaders with running siblings as members of the cgroup (IMO
> > > > correct), is it consistent to deliver the (un)populated event
> > > > that early?
> > > > By moving to cgroup_release we would also make this notification
> > > > analogous to SIGCHLD delivery.
> > >
> > > So, the current behavior is mostly historical and I don't think this
> > > is a better behavior.  That said, I'm not sure about the cost benefit
> > > ratio of changing the behavior at this point given how long the code
> > > has been this way.  Another aspect is that it'd expose zombie tasks
> > > and possibly migrations of them to each controller.
> >
> > Sorry for reviving an old discussion
> Since you reply to my remark, I have to share that I found myself wrong
> later wrt the emptiness notifications. Moving the task in cgroup_exit
> doesn't matter if thread group contains other live tasks, the unpopulated
> notification will be raised when the last task of thread group calls
> group_exit, i.e. it is similar to SIGHLD.
>
> Now to your issue.
>
> > but I got a bug report from a customer about Android
> What kernel version is that? Can be it considered equal to the current
> Linux?

I reproduced this with 4.19 kernel that had Tejun's patches backported
but I believe the same behavior will happen with any kernel that
contains c03cd7738a83 ("cgroup: Include dying leaders with live
threads in PROCS iterations") including the latest. I'll try to
confirm this on 5.4 next week.

>
> > In my testing I was able to reproduce the failure in
> > which case rmdir() fails with EBUSY from cgroup_destroy_locked()
> > because cgroup_is_populated() returns true.
> That'd mean that not all tasks in the cgroup did exit() (cgroup_exit()),
> i.e. they're still running. Can you see them in cgroup.threads/tasks?

Correct, I confirmed that there are still tasks in cset->tasks list
when this happens. Actually when I reproduced this I logged one task
in cset->dying_tasks which was the group leader and another one in
cset->tasks. So the last task to reach cgroup_exit() was not the group
leader.
However these tasks will be ignored when we enumerate them using
css_task_iter_start()/css_task_iter_next() loop because
css_task_iter_advance() ignores tasks with PF_EXITING flag set or
group leaders with task->signal->live==0 (see the code lines I posted
in my first email). So threads are there, they are PF_EXITING but at
least one of them did not reach cgroup_exit() yet. When user space
reads cgroup.procs/tasks they are skipped. This is the change in
behavior which is different from what it used to be before this patch.
Prior to it these tasks would be visible even if they are exiting.

>
> > cgroup_is_populated() depends on cgrp->nr_populated_xxx counters which
> > IIUC will not be updated until cgroup_update_populated() is called
> > from cgroup_exit() and that might get delayed...
> Why do you think it's delayed?

I think exit_mm() can be one of the contributors if the process owns a
considerable amount of memory but there might be other possible
reasons.

> > So from user space's point of view the cgroup is empty and can be
> > removed but rmdir() fails to do so.
> As Tejun says, cgroup with only dead _tasks_ should be removable (and if
> I'm not mistaken it is in the current kernel). Unless you do individual
> threads migration when a thread would be separated from its (dead)
> leader. Does your case include such migrations?
>
> Michal

Thanks,
Suren.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 3/3 cgroup/for-5.2-fixes] cgroup: Include dying leaders with live threads in PROCS iterations
       [not found]                               ` <CAJuCfpG-sn0wPM9GRnCjrmytf=mDC3smsRRd=XQBLAK6ZKoAUA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2020-01-11  0:55                                 ` Suren Baghdasaryan
       [not found]                                   ` <CAJuCfpFeiH_8MbX3qeAKCz_z+XXYp9M2KrnkOsBoN5jGoV7=eg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 9+ messages in thread
From: Suren Baghdasaryan @ 2020-01-11  0:55 UTC (permalink / raw)
  To: Michal Koutný
  Cc: Tejun Heo, Johannes Weiner, Topi Miettinen, Li Zefan,
	cgroups mailinglist, security-8fiUuRrzOP0dnm+yROfE0A,
	Security Officers, Lennart Poettering, Oleg Nesterov,
	Eric W. Biederman, james.hsu-NuS5LvNUpcJWk0Htik3J/w,
	linger.lee-NuS5LvNUpcJWk0Htik3J/w, Tom Cherry, Roman Gushchin

On Fri, Jan 10, 2020 at 4:00 PM Suren Baghdasaryan <surenb-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
>
>  H Michal,
>
> On Fri, Jan 10, 2020 at 3:16 PM Michal Koutný <mkoutny-IBi9RG/b67k@public.gmane.org> wrote:
> >
> > Hello Suren.
> >
> > On Fri, Jan 10, 2020 at 01:47:19PM -0800, Suren Baghdasaryan <surenb@google.com> wrote:
> > > > On Fri, Jun 07, 2019 at 07:09:53PM +0200, Michal Koutný wrote:
> > > > > Wouldn't it make more sense to call
> > > > >       css_set_move_task(tsk, cset, NULL, false);
> > > > > in cgroup_release instead of cgroup_exit then?
> > > > >
> > > > > css_set_move_task triggers the cgroup emptiness notification so if we
> > > > > list group leaders with running siblings as members of the cgroup (IMO
> > > > > correct), is it consistent to deliver the (un)populated event
> > > > > that early?
> > > > > By moving to cgroup_release we would also make this notification
> > > > > analogous to SIGCHLD delivery.
> > > >
> > > > So, the current behavior is mostly historical and I don't think this
> > > > is a better behavior.  That said, I'm not sure about the cost benefit
> > > > ratio of changing the behavior at this point given how long the code
> > > > has been this way.  Another aspect is that it'd expose zombie tasks
> > > > and possibly migrations of them to each controller.
> > >
> > > Sorry for reviving an old discussion
> > Since you reply to my remark, I have to share that I found myself wrong
> > later wrt the emptiness notifications. Moving the task in cgroup_exit
> > doesn't matter if thread group contains other live tasks, the unpopulated
> > notification will be raised when the last task of thread group calls
> > group_exit, i.e. it is similar to SIGHLD.
> >
> > Now to your issue.
> >
> > > but I got a bug report from a customer about Android
> > What kernel version is that? Can be it considered equal to the current
> > Linux?
>
> I reproduced this with 4.19 kernel that had Tejun's patches backported
> but I believe the same behavior will happen with any kernel that
> contains c03cd7738a83 ("cgroup: Include dying leaders with live
> threads in PROCS iterations") including the latest. I'll try to
> confirm this on 5.4 next week.
>
> >
> > > In my testing I was able to reproduce the failure in
> > > which case rmdir() fails with EBUSY from cgroup_destroy_locked()
> > > because cgroup_is_populated() returns true.
> > That'd mean that not all tasks in the cgroup did exit() (cgroup_exit()),
> > i.e. they're still running. Can you see them in cgroup.threads/tasks?
>
> Correct, I confirmed that there are still tasks in cset->tasks list
> when this happens. Actually when I reproduced this I logged one task
> in cset->dying_tasks which was the group leader and another one in
> cset->tasks. So the last task to reach cgroup_exit() was not the group
> leader.
> However these tasks will be ignored when we enumerate them using
> css_task_iter_start()/css_task_iter_next() loop because
> css_task_iter_advance() ignores tasks with PF_EXITING flag set or
> group leaders with task->signal->live==0 (see the code lines I posted
> in my first email). So threads are there, they are PF_EXITING but at
> least one of them did not reach cgroup_exit() yet. When user space
> reads cgroup.procs/tasks they are skipped. This is the change in
> behavior which is different from what it used to be before this patch.
> Prior to it these tasks would be visible even if they are exiting.
>
> >
> > > cgroup_is_populated() depends on cgrp->nr_populated_xxx counters which
> > > IIUC will not be updated until cgroup_update_populated() is called
> > > from cgroup_exit() and that might get delayed...
> > Why do you think it's delayed?
>
> I think exit_mm() can be one of the contributors if the process owns a
> considerable amount of memory but there might be other possible
> reasons.
>
> > > So from user space's point of view the cgroup is empty and can be
> > > removed but rmdir() fails to do so.
> > As Tejun says, cgroup with only dead _tasks_ should be removable (and if
> > I'm not mistaken it is in the current kernel).

Sorry, I missed this last part.
Agree, if all tasks are dead the cgroup should be removable and it
will be if all tasks reached cgroup_exit(). The issue happens when
tasks are marked PF_EXITING, userspace reads empty cgroups.procs and
issues rmdir() before the last killed task reaches cgroup_exit().

> > Unless you do individual
> > threads migration when a thread would be separated from its (dead)
> > leader. Does your case include such migrations?

No my case just kills all tasks in the cgroup until cgroup.procs is
empty and then tries to remove the cgroup itself. No migrations.

> >
> > Michal
>
> Thanks,
> Suren.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 3/3 cgroup/for-5.2-fixes] cgroup: Include dying leaders with live threads in PROCS iterations
       [not found]                                   ` <CAJuCfpFeiH_8MbX3qeAKCz_z+XXYp9M2KrnkOsBoN5jGoV7=eg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2020-01-16  4:26                                     ` Suren Baghdasaryan
  0 siblings, 0 replies; 9+ messages in thread
From: Suren Baghdasaryan @ 2020-01-16  4:26 UTC (permalink / raw)
  To: Michal Koutný
  Cc: Tejun Heo, Johannes Weiner, Topi Miettinen, Li Zefan,
	cgroups mailinglist, security-8fiUuRrzOP0dnm+yROfE0A,
	Security Officers, Lennart Poettering, Oleg Nesterov,
	Eric W. Biederman, james.hsu-NuS5LvNUpcJWk0Htik3J/w, JeiFeng Lee,
	Tom Cherry, Roman Gushchin

On Fri, Jan 10, 2020 at 4:55 PM Suren Baghdasaryan <surenb-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
>
> On Fri, Jan 10, 2020 at 4:00 PM Suren Baghdasaryan <surenb-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
> >
> >  H Michal,
> >
> > On Fri, Jan 10, 2020 at 3:16 PM Michal Koutný <mkoutny-IBi9RG/b67k@public.gmane.org> wrote:
> > >
> > > Hello Suren.
> > >
> > > On Fri, Jan 10, 2020 at 01:47:19PM -0800, Suren Baghdasaryan <surenb@google.com> wrote:
> > > > > On Fri, Jun 07, 2019 at 07:09:53PM +0200, Michal Koutný wrote:
> > > > > > Wouldn't it make more sense to call
> > > > > >       css_set_move_task(tsk, cset, NULL, false);
> > > > > > in cgroup_release instead of cgroup_exit then?
> > > > > >
> > > > > > css_set_move_task triggers the cgroup emptiness notification so if we
> > > > > > list group leaders with running siblings as members of the cgroup (IMO
> > > > > > correct), is it consistent to deliver the (un)populated event
> > > > > > that early?
> > > > > > By moving to cgroup_release we would also make this notification
> > > > > > analogous to SIGCHLD delivery.
> > > > >
> > > > > So, the current behavior is mostly historical and I don't think this
> > > > > is a better behavior.  That said, I'm not sure about the cost benefit
> > > > > ratio of changing the behavior at this point given how long the code
> > > > > has been this way.  Another aspect is that it'd expose zombie tasks
> > > > > and possibly migrations of them to each controller.
> > > >
> > > > Sorry for reviving an old discussion
> > > Since you reply to my remark, I have to share that I found myself wrong
> > > later wrt the emptiness notifications. Moving the task in cgroup_exit
> > > doesn't matter if thread group contains other live tasks, the unpopulated
> > > notification will be raised when the last task of thread group calls
> > > group_exit, i.e. it is similar to SIGHLD.
> > >
> > > Now to your issue.
> > >
> > > > but I got a bug report from a customer about Android
> > > What kernel version is that? Can be it considered equal to the current
> > > Linux?
> >
> > I reproduced this with 4.19 kernel that had Tejun's patches backported
> > but I believe the same behavior will happen with any kernel that
> > contains c03cd7738a83 ("cgroup: Include dying leaders with live
> > threads in PROCS iterations") including the latest. I'll try to
> > confirm this on 5.4 next week.

Hi Folks,
I confirmed that the issue is happening on 5.4 as well. I did
implement a fix for this and a kselftest to reproduce the problem.
Will send both the fix and the kselftest as a 2-patch series shortly.
Thanks,
Suren.

> >
> > >
> > > > In my testing I was able to reproduce the failure in
> > > > which case rmdir() fails with EBUSY from cgroup_destroy_locked()
> > > > because cgroup_is_populated() returns true.
> > > That'd mean that not all tasks in the cgroup did exit() (cgroup_exit()),
> > > i.e. they're still running. Can you see them in cgroup.threads/tasks?
> >
> > Correct, I confirmed that there are still tasks in cset->tasks list
> > when this happens. Actually when I reproduced this I logged one task
> > in cset->dying_tasks which was the group leader and another one in
> > cset->tasks. So the last task to reach cgroup_exit() was not the group
> > leader.
> > However these tasks will be ignored when we enumerate them using
> > css_task_iter_start()/css_task_iter_next() loop because
> > css_task_iter_advance() ignores tasks with PF_EXITING flag set or
> > group leaders with task->signal->live==0 (see the code lines I posted
> > in my first email). So threads are there, they are PF_EXITING but at
> > least one of them did not reach cgroup_exit() yet. When user space
> > reads cgroup.procs/tasks they are skipped. This is the change in
> > behavior which is different from what it used to be before this patch.
> > Prior to it these tasks would be visible even if they are exiting.
> >
> > >
> > > > cgroup_is_populated() depends on cgrp->nr_populated_xxx counters which
> > > > IIUC will not be updated until cgroup_update_populated() is called
> > > > from cgroup_exit() and that might get delayed...
> > > Why do you think it's delayed?
> >
> > I think exit_mm() can be one of the contributors if the process owns a
> > considerable amount of memory but there might be other possible
> > reasons.
> >
> > > > So from user space's point of view the cgroup is empty and can be
> > > > removed but rmdir() fails to do so.
> > > As Tejun says, cgroup with only dead _tasks_ should be removable (and if
> > > I'm not mistaken it is in the current kernel).
>
> Sorry, I missed this last part.
> Agree, if all tasks are dead the cgroup should be removable and it
> will be if all tasks reached cgroup_exit(). The issue happens when
> tasks are marked PF_EXITING, userspace reads empty cgroups.procs and
> issues rmdir() before the last killed task reaches cgroup_exit().
>
> > > Unless you do individual
> > > threads migration when a thread would be separated from its (dead)
> > > leader. Does your case include such migrations?
>
> No my case just kills all tasks in the cgroup until cgroup.procs is
> empty and then tries to remove the cgroup itself. No migrations.
>
> > >
> > > Michal
> >
> > Thanks,
> > Suren.

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2020-01-16  4:26 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <87blznagrl.fsf@xmission.com>
     [not found] ` <1956727d-1ee8-92af-1e00-66ae4921b075@gmail.com>
     [not found]   ` <87zhn6923n.fsf@xmission.com>
     [not found]     ` <e407a8e7-7780-f08f-320a-a0f2c954d253@gmail.com>
     [not found]       ` <20190529003601.GN374014@devbig004.ftw2.facebook.com>
     [not found]         ` <e45d974b-5eff-f781-291f-ddf5e9679e4c@gmail.com>
     [not found]           ` <20190530183556.GR374014@devbig004.ftw2.facebook.com>
     [not found]             ` <20190530183637.GS374014@devbig004.ftw2.facebook.com>
     [not found]               ` <20190530183700.GT374014@devbig004.ftw2.facebook.com>
     [not found]                 ` <20190607170952.GE30727@blackbody.suse.cz>
     [not found]                   ` <20190611185742.GH3341036@devbig004.ftw2.facebook.com>
     [not found]                     ` <20190611185742.GH3341036-LpCCV3molIbIZ9tKgghJQw2O0Ztt9esIQQ4Iyu8u01E@public.gmane.org>
2020-01-10 21:47                       ` [PATCH 3/3 cgroup/for-5.2-fixes] cgroup: Include dying leaders with live threads in PROCS iterations Suren Baghdasaryan
     [not found]                         ` <CAJuCfpGkAsmp5B=fNz38fLE8pYaMCG=uLDSSBFByNOtWD20zVQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2020-01-10 21:56                           ` Tejun Heo
     [not found]                             ` <20200110215648.GC2677547-LpCCV3molIbIZ9tKgghJQw2O0Ztt9esIQQ4Iyu8u01E@public.gmane.org>
2020-01-10 22:14                               ` Suren Baghdasaryan
     [not found]                                 ` <CAJuCfpHPRfV5KDM74JtDepdUJ2G+-3-A3XV+Fzr3Lkbj9nR-Cg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2020-01-10 22:15                                   ` Tejun Heo
     [not found]                                     ` <20200110221553.GD2677547-LpCCV3molIbIZ9tKgghJQw2O0Ztt9esIQQ4Iyu8u01E@public.gmane.org>
2020-01-10 22:50                                       ` Suren Baghdasaryan
2020-01-10 23:16                           ` Michal Koutný
2020-01-11  0:00                             ` Suren Baghdasaryan
     [not found]                               ` <CAJuCfpG-sn0wPM9GRnCjrmytf=mDC3smsRRd=XQBLAK6ZKoAUA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2020-01-11  0:55                                 ` Suren Baghdasaryan
     [not found]                                   ` <CAJuCfpFeiH_8MbX3qeAKCz_z+XXYp9M2KrnkOsBoN5jGoV7=eg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2020-01-16  4:26                                     ` Suren Baghdasaryan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox