Race between cgroup core and the cpuset controller

cgroups.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Race between cgroup core and the cpuset controller
@ 2017-05-22 22:14 Daniel Jordan
       [not found] ` <327ca1f5-7957-fbb9-9e5f-9ba149d40ba2-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 4+ messages in thread
From: Daniel Jordan @ 2017-05-22 22:14 UTC (permalink / raw)
  To: cgroups-u79uwXL29TY76Z2rM5mHXA

Hi,

I found what looks like a race between cgroup core and the cpuset 
controller.

---

/* Mount the cpuset and create parent and child cpusets. */
# mount -t cpuset nodev /dev/cpuset
# cd /dev/cpuset
# mkdir parent
# mkdir parent/child

/* Enable cpu_exclusive in both parent and child cpusets. */
# cd parent
# /bin/echo 1 > cpu_exclusive
# /bin/echo 1 > child/cpu_exclusive

/* Remove the child cpuset and then immediately try making the parent 
non-exclusive. */
# rmdir child; /bin/echo 0 > cpu_exclusive
/bin/echo: write error: Device or resource busy

---

I'd expect the last command above to succeed.

If I do the same steps as above, but make the last command

   # rmdir child; sleep 1; /bin/echo 0 > cpu_exclusive

then it works.

None of the three EBUSY errors from 'man cpuset' apply to this case, so 
I added some debug output that shows what's going on:

[2710738.469049] cgroup: [cpu  64] entering kill_css
[2710738.478379] cgroup: [cpu  64] leaving kill_css
[2710738.487659] [cpu  96] entering is_cpuset_subset
[2710738.496830] [cpu  96] triggered in is_cpuset_subset  /* 
is_cpu_exclusive(p) > is_cpu_exclusive(q) */
[2710738.513153] cgroup: [cpu  64] entering css_killed_ref_fn
[2710738.523873] cgroup: [cpu  64] leaving css_killed_ref_fn
[2710738.534716] cgroup: [cpu  64] entering css_killed_work_fn
[2710738.545737] cgroup: [cpu  64] entering offline_css
[2710738.555644] [cpu  64] entering cpuset_css_offline
[2710738.565387] [cpu  64] entering is_cpuset_subset
[2710738.574744] [cpu  64] leaving cpuset_css_offline
[2710738.584297] cgroup: [cpu  64] leaving offline_css
[2710738.594010] cgroup: [cpu  64] leaving css_killed_work_fn

It looks like the task on cpu 64 is kicking off the kworker that 
eventually calls cpuset_css_offline, but that this worker doesn't get 
started until after the "/bin/echo 0 > cpu_exclusive" command on cpu 96 
is finished running.  That means that the check in is_cpuset_subset that 
verifies whether all child css's are non-exclusive before allowing 
'parent' to be made exclusive fails because 'child' isn't killed yet.

Is this expected behavior?

Should the user have to sleep between commands or thoroughly clean up 
after a cpuset with something like

   # /bin/echo 0 > child/cpu_exclusive; rmdir child; /bin/echo 0 > 
cpu_exclusive

to avoid these kinds of failures?

Thank you,
Daniel

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Race between cgroup core and the cpuset controller
       [not found] ` <327ca1f5-7957-fbb9-9e5f-9ba149d40ba2-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
@ 2017-05-23 20:06   ` Tejun Heo
       [not found]     ` <20170523200645.GG13222-piEFEHQLUPpN0TnZuCh8vA@public.gmane.org>
  0 siblings, 1 reply; 4+ messages in thread
From: Tejun Heo @ 2017-05-23 20:06 UTC (permalink / raw)
  To: Daniel Jordan; +Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, Li Zefan

(cc'ing Li and quoting the whole message)

On Mon, May 22, 2017 at 06:14:31PM -0400, Daniel Jordan wrote:
> Hi,
> 
> I found what looks like a race between cgroup core and the cpuset
> controller.
> 
> ---
> 
> /* Mount the cpuset and create parent and child cpusets. */
> # mount -t cpuset nodev /dev/cpuset
> # cd /dev/cpuset
> # mkdir parent
> # mkdir parent/child
> 
> /* Enable cpu_exclusive in both parent and child cpusets. */
> # cd parent
> # /bin/echo 1 > cpu_exclusive
> # /bin/echo 1 > child/cpu_exclusive
> 
> /* Remove the child cpuset and then immediately try making the parent
> non-exclusive. */
> # rmdir child; /bin/echo 0 > cpu_exclusive
> /bin/echo: write error: Device or resource busy
> 
> ---
> 
> I'd expect the last command above to succeed.
> 
> If I do the same steps as above, but make the last command
> 
>   # rmdir child; sleep 1; /bin/echo 0 > cpu_exclusive
> 
> then it works.
> 
> None of the three EBUSY errors from 'man cpuset' apply to this case, so I
> added some debug output that shows what's going on:
> 
> [2710738.469049] cgroup: [cpu  64] entering kill_css
> [2710738.478379] cgroup: [cpu  64] leaving kill_css
> [2710738.487659] [cpu  96] entering is_cpuset_subset
> [2710738.496830] [cpu  96] triggered in is_cpuset_subset  /*
> is_cpu_exclusive(p) > is_cpu_exclusive(q) */
> [2710738.513153] cgroup: [cpu  64] entering css_killed_ref_fn
> [2710738.523873] cgroup: [cpu  64] leaving css_killed_ref_fn
> [2710738.534716] cgroup: [cpu  64] entering css_killed_work_fn
> [2710738.545737] cgroup: [cpu  64] entering offline_css
> [2710738.555644] [cpu  64] entering cpuset_css_offline
> [2710738.565387] [cpu  64] entering is_cpuset_subset
> [2710738.574744] [cpu  64] leaving cpuset_css_offline
> [2710738.584297] cgroup: [cpu  64] leaving offline_css
> [2710738.594010] cgroup: [cpu  64] leaving css_killed_work_fn
> 
> It looks like the task on cpu 64 is kicking off the kworker that eventually
> calls cpuset_css_offline, but that this worker doesn't get started until
> after the "/bin/echo 0 > cpu_exclusive" command on cpu 96 is finished
> running.  That means that the check in is_cpuset_subset that verifies
> whether all child css's are non-exclusive before allowing 'parent' to be
> made exclusive fails because 'child' isn't killed yet.
> 
> Is this expected behavior?
> 
> Should the user have to sleep between commands or thoroughly clean up after
> a cpuset with something like
> 
>   # /bin/echo 0 > child/cpu_exclusive; rmdir child; /bin/echo 0 >
> cpu_exclusive
> 
> to avoid these kinds of failures?

Can you please see whether the following patch fixes the issue?

Thanks.

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index f6501f4f6040..9e29dba49d6c 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -176,9 +176,10 @@ typedef enum {
 } cpuset_flagbits_t;
 
 /* convenient tests for these bits */
-static inline bool is_cpuset_online(const struct cpuset *cs)
+static inline bool is_cpuset_online(struct cpuset *cs)
 {
-	return test_bit(CS_ONLINE, &cs->flags);
+	return test_bit(CS_ONLINE, &cs->flags) &&
+		!percpu_ref_is_dying(&cs->css.refcnt);
 }
 
 static inline int is_cpu_exclusive(const struct cpuset *cs)

^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: Race between cgroup core and the cpuset controller
       [not found]     ` <20170523200645.GG13222-piEFEHQLUPpN0TnZuCh8vA@public.gmane.org>
@ 2017-05-23 21:32       ` Daniel Jordan
       [not found]         ` <b7cb070c-c6df-9bf7-fd88-a1b07da28dd0-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 4+ messages in thread
From: Daniel Jordan @ 2017-05-23 21:32 UTC (permalink / raw)
  To: Tejun Heo; +Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, Li Zefan

Hi Tejun,

On 05/23/2017 04:06 PM, Tejun Heo wrote:
> Can you please see whether the following patch fixes the issue?
> Thanks.
>
> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> index f6501f4f6040..9e29dba49d6c 100644
> --- a/kernel/cgroup/cpuset.c
> +++ b/kernel/cgroup/cpuset.c
> @@ -176,9 +176,10 @@ typedef enum {
>   } cpuset_flagbits_t;
>   
>   /* convenient tests for these bits */
> -static inline bool is_cpuset_online(const struct cpuset *cs)
> +static inline bool is_cpuset_online(struct cpuset *cs)
>   {
> -	return test_bit(CS_ONLINE, &cs->flags);
> +	return test_bit(CS_ONLINE, &cs->flags) &&
> +		!percpu_ref_is_dying(&cs->css.refcnt);
>   }
>   
>   static inline int is_cpu_exclusive(const struct cpuset *cs)

That does the trick, thanks.

Please let me know if there's anything else I can do to help.

Daniel

^ permalink raw reply	[flat|nested] 4+ messages in thread

* [PATCH cgroup/for-4.12-fixes] cpuset: consider dying css as offline
       [not found]         ` <b7cb070c-c6df-9bf7-fd88-a1b07da28dd0-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
@ 2017-05-24 16:03           ` Tejun Heo
  0 siblings, 0 replies; 4+ messages in thread
From: Tejun Heo @ 2017-05-24 16:03 UTC (permalink / raw)
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, Li Zefan, Daniel Jordan

In most cases, a cgroup controller don't care about the liftimes of
cgroups.  For the controller, a css becomes online when ->css_online()
is called on it and offline when ->css_offline() is called.

However, cpuset is special in that the user interface it exposes cares
whether certain cgroups exist or not.  Combined with the RCU delay
between cgroup removal and css offlining, this can lead to user
visible behavior oddities where operations which should succeed after
cgroup removals fail for some time period.  The effects of cgroup
removals are delayed when seen from userland.

This patch adds css_is_dying() which tests whether offline is pending
and updates is_cpuset_online() so that the function returns false also
while offline is pending.  This gets rid of the userland visible
delays.

Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
Reported-by: Daniel Jordan <daniel.m.jordan-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
Link: http://lkml.kernel.org/r/327ca1f5-7957-fbb9-9e5f-9ba149d40ba2-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org
Cc: stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
---
Hello,

Applying to cgroup/for-4.12-fixes.  Li, I think this is the right fix,
but I'd appreciate if you can take a look.

Thanks.

 include/linux/cgroup.h |   20 ++++++++++++++++++++
 kernel/cgroup/cpuset.c |    4 ++--
 2 files changed, 22 insertions(+), 2 deletions(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index ed2573e149fa..710a005c6b7a 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -344,6 +344,26 @@ static inline bool css_tryget_online(struct cgroup_subsys_state *css)
 }
 
 /**
+ * css_is_dying - test whether the specified css is dying
+ * @css: target css
+ *
+ * Test whether @css is in the process of offlining or already offline.  In
+ * most cases, ->css_online() and ->css_offline() callbacks should be
+ * enough; however, the actual offline operations are RCU delayed and this
+ * test returns %true also when @css is scheduled to be offlined.
+ *
+ * This is useful, for example, when the use case requires synchronous
+ * behavior with respect to cgroup removal.  cgroup removal schedules css
+ * offlining but the css can seem alive while the operation is being
+ * delayed.  If the delay affects user visible semantics, this test can be
+ * used to resolve the situation.
+ */
+static inline bool css_is_dying(struct cgroup_subsys_state *css)
+{
+	return !(css->flags & CSS_NO_REF) && percpu_ref_is_dying(&css->refcnt);
+}
+
+/**
  * css_put - put a css reference
  * @css: target css
  *
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index f6501f4f6040..ae643412948a 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -176,9 +176,9 @@ typedef enum {
 } cpuset_flagbits_t;
 
 /* convenient tests for these bits */
-static inline bool is_cpuset_online(const struct cpuset *cs)
+static inline bool is_cpuset_online(struct cpuset *cs)
 {
-	return test_bit(CS_ONLINE, &cs->flags);
+	return test_bit(CS_ONLINE, &cs->flags) && !css_is_dying(&cs->css);
 }
 
 static inline int is_cpu_exclusive(const struct cpuset *cs)

^ permalink raw reply related	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2017-05-24 16:03 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2017-05-22 22:14 Race between cgroup core and the cpuset controller Daniel Jordan
     [not found] ` <327ca1f5-7957-fbb9-9e5f-9ba149d40ba2-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
2017-05-23 20:06   ` Tejun Heo
     [not found]     ` <20170523200645.GG13222-piEFEHQLUPpN0TnZuCh8vA@public.gmane.org>
2017-05-23 21:32       ` Daniel Jordan
     [not found]         ` <b7cb070c-c6df-9bf7-fd88-a1b07da28dd0-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
2017-05-24 16:03           ` [PATCH cgroup/for-4.12-fixes] cpuset: consider dying css as offline Tejun Heo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).