Linux Power Management development
 help / color / mirror / Atom feed
* Re: [PATCHSET cgroup] extend threadgroup locking
From: Rafael J. Wysocki @ 2011-09-05  4:05 UTC (permalink / raw)
  To: Tejun Heo; +Cc: akpm, fweisbec, containers, lizf, linux-kernel, linux-pm, paul
In-Reply-To: <1315159280-25032-1-git-send-email-htejun@gmail.com>

Hi,

On Sunday, September 04, 2011, Tejun Heo wrote:
> Hello,
> 
> cgroup currently only blocks new threads from joining the target
> threadgroup during migration, and on-going migration could race
> against exec and exit leading to interesting problems - the symmetry
> between various attach methods, task exiting during method execution,
> ->exit() racing against attach methods, migrating task switching basic
> properties during exec and so on.
> 
> This patchset extends threadgroup locking such that it covers all
> operations which can alter the threadgroup - fork, exit and exec, and
> update cgroup to take advantage of it.  rwsem read ops are added to
> exit path but exec is excluded by grabbing the existing
> cred_guard_mutex from threadgroup locking helper.
> 
> This makes threadgroup locking complete and resolves cgroup issues
> stemming from the target taskset being unstable.
> 
> This patchset is on top of the current pm-freezer + "freezer: fixes &
> simplifications" patchset and contains the following four patches.
> Patch list and diffstat follow.

OK

I'm assuming that there will be a branch for me to pull from when
kernel.org starts to work again, right?

Rafael

^ permalink raw reply

* Re: [PATCH 6/6] freezer: kill unused set_freezable_with_signal()
From: Tejun Heo @ 2011-09-05  2:35 UTC (permalink / raw)
  To: Oleg Nesterov; +Cc: containers, linux-kernel, linux-pm, paul
In-Reply-To: <20110905023315.GB9807@htj.dyndns.org>

On Mon, Sep 05, 2011 at 11:33:15AM +0900, Tejun Heo wrote:
> > So, yes, I am starting to think this change is fine too ;)
> 
> Yeay. :)

Ooh, can I add your Acked-by before sending pull request to Rafael?

Thanks.

-- 
tejun

^ permalink raw reply

* Re: [PATCH 6/6] freezer: kill unused set_freezable_with_signal()
From: Tejun Heo @ 2011-09-05  2:33 UTC (permalink / raw)
  To: Oleg Nesterov; +Cc: containers, linux-kernel, linux-pm, paul
In-Reply-To: <20110904184626.GA30101@redhat.com>

Hello, Oleg.

On Sun, Sep 04, 2011 at 08:46:26PM +0200, Oleg Nesterov wrote:
> > @@ -72,10 +72,6 @@ bool __refrigerator(bool check_kthr_stop)
> >  		schedule();
> >  	}
> >
> > -	spin_lock_irq(&current->sighand->siglock);
> > -	recalc_sigpending(); /* We sent fake signal, clean it up */
> > -	spin_unlock_irq(&current->sighand->siglock);
> > -
> 
> Why? This recalc_sigpending() makes sense imho. Otherwise the user-space
> tasks (almost) always return with TIF_SIGPENDING. May be we can do this
> under "if (PF_KTRHREAD)".

PF_KTHREAD no longer gets TIF_SIGPENDING set, so...

> For example. Suppose the user-space task does wait_event_freezable()...
> 
> Hmm. OTOH, wait_event_freezable() looks wrong anyway... So probably
> this doesn't matter. ptrace_stop/get_signal_to_deliver doesn't need
> this, probably we do not care about the other callers.

Can you elaborate on it being wrong?  Do you mean the possibility of
leaking spurious TIF_SIGPENDING?

> It seems, a lot of get_signal_to_deliver() calles also call
> try_to_freeze() for no reason.

Yeap, it doesn't really matter.  In most cases userland tasks would
get parked in the signal delivery path anyway and if a task is deep in
the kernel, it should and usually does hit syscall restart path.
Except for few special cases (maybe some of unfailable NFS calls),
there'userland tasks shouldn't take try_to_freeze() path outside of
signal delivery / job control path.

Also, in general, I don't think it's a good idea to add non-trivial
optimization like above to code path which is as cold as it gets
without any backing data.  The PM freezer used to busy loop until all
tasks enter refrigerator.  Nobody cares whether some tasks enter
signal delivery path one more time.

> So, yes, I am starting to think this change is fine too ;)

Yeay. :)

Thanks.

-- 
tejun

^ permalink raw reply

* Re: [PATCHSET pm-freezer] freezer: fixes & simplifications
From: Oleg Nesterov @ 2011-09-04 18:48 UTC (permalink / raw)
  To: Tejun Heo; +Cc: containers, linux-kernel, linux-pm, paul
In-Reply-To: <1314988070-12244-1-git-send-email-tj@kernel.org>

On 09/03, Tejun Heo wrote:
>
> These six patches are fixes and simplifications for the recent freezer
> changes.  The first four have been posted twice.  Changes since the
> last posting[L] are
>
> * freezer->state setting bug fix updated per Matt Helsley so that the
>   actual per-task freeze/thaw operations are always performed.
>
> * fixed a bug caused by forgetting to unlock freezer_lock in "freezer:
>   restructure __refrigerator()".
>
> * Two patches added.  The fifth one is mostly trivial.  The sixth a
>   bit more involved but still shouldn't cause any noticeable
>   functional difference.  This removes the buggy and unused
>   freezable_with_signal.

Afaics, everything is fine.

Oleg.

^ permalink raw reply

* Re: [PATCH 6/6] freezer: kill unused set_freezable_with_signal()
From: Oleg Nesterov @ 2011-09-04 18:46 UTC (permalink / raw)
  To: Tejun Heo; +Cc: containers, linux-kernel, linux-pm, paul
In-Reply-To: <1314988070-12244-7-git-send-email-tj@kernel.org>

On 09/03, Tejun Heo wrote:
>
> This patch removes set_freezable_with_signal() along with
> PF_FREEZER_NOSIG

Great. I never understood PF_FREEZER_NOSIG logic ;)

One question,

> @@ -72,10 +72,6 @@ bool __refrigerator(bool check_kthr_stop)
>  		schedule();
>  	}
>
> -	spin_lock_irq(&current->sighand->siglock);
> -	recalc_sigpending(); /* We sent fake signal, clean it up */
> -	spin_unlock_irq(&current->sighand->siglock);
> -

Why? This recalc_sigpending() makes sense imho. Otherwise the user-space
tasks (almost) always return with TIF_SIGPENDING. May be we can do this
under "if (PF_KTRHREAD)".

For example. Suppose the user-space task does wait_event_freezable()...

Hmm. OTOH, wait_event_freezable() looks wrong anyway... So probably
this doesn't matter. ptrace_stop/get_signal_to_deliver doesn't need
this, probably we do not care about the other callers.

It seems, a lot of get_signal_to_deliver() calles also call
try_to_freeze() for no reason.


So, yes, I am starting to think this change is fine too ;)

Oleg.

^ permalink raw reply

* Re: [PATCH 1/6] cgroup_freezer: fix freezer->state setting bug in freezer_change_state()
From: Oleg Nesterov @ 2011-09-04 18:20 UTC (permalink / raw)
  To: Tejun Heo; +Cc: containers, linux-kernel, linux-pm, paul
In-Reply-To: <20110904181139.GA9807@htj.dyndns.org>

On 09/05, Tejun Heo wrote:
>
> On Sun, Sep 04, 2011 at 08:02:06PM +0200, Oleg Nesterov wrote:
> > Cough. Now it doesn't look right to me ;)
> >
> > If the user write "FROZEN" into the control file, we should not
> > turn the CGROUP_FROZEN state into CGROUP_FREEZING. Probably this
> > is harmless, but this looks wrong and this doesn't match the
> > current behaviour, user-visible change.
>
> It's not user-visible as the state is always updated before shown to
> userland.

Ah, indeed, thanks.

OK.

So, the only proble I can see is that update_if_frozen() can hit
BUG_ON(nfrozen > 0), but this is off-topic.

I think this patch is correct.

Oleg.

^ permalink raw reply

* Re: [PATCH 1/6] cgroup_freezer: fix freezer->state setting bug in freezer_change_state()
From: Tejun Heo @ 2011-09-04 18:11 UTC (permalink / raw)
  To: Oleg Nesterov; +Cc: containers, linux-kernel, linux-pm, paul
In-Reply-To: <20110904180206.GA28520@redhat.com>

On Sun, Sep 04, 2011 at 08:02:06PM +0200, Oleg Nesterov wrote:
> Cough. Now it doesn't look right to me ;)
> 
> If the user write "FROZEN" into the control file, we should not
> turn the CGROUP_FROZEN state into CGROUP_FREEZING. Probably this
> is harmless, but this looks wrong and this doesn't match the
> current behaviour, user-visible change.

It's not user-visible as the state is always updated before shown to
userland.  If we're gonna be re-priming freezing on each FROZEN write,
resetting the state to freezing to force state re-calculation is
logical thing to do.

Thanks.

-- 
tejun

^ permalink raw reply

* Re: [PATCH 1/6] cgroup_freezer: fix freezer->state setting bug in freezer_change_state()
From: Oleg Nesterov @ 2011-09-04 18:02 UTC (permalink / raw)
  To: Tejun Heo; +Cc: containers, linux-kernel, linux-pm, paul
In-Reply-To: <1314988070-12244-2-git-send-email-tj@kernel.org>

On 09/03, Tejun Heo wrote:
>
> @@ -308,24 +308,24 @@ static int freezer_change_state(struct cgroup *cgroup,
>  	spin_lock_irq(&freezer->lock);
>  
>  	update_if_frozen(cgroup, freezer);
> -	if (goal_state == freezer->state)
> -		goto out;
> -
> -	freezer->state = goal_state;
>  
>  	switch (goal_state) {
>  	case CGROUP_THAWED:
> -		atomic_dec(&system_freezing_cnt);
> +		if (freezer->state != CGROUP_THAWED)
> +			atomic_dec(&system_freezing_cnt);
> +		freezer->state = CGROUP_THAWED;
>  		unfreeze_cgroup(cgroup, freezer);
>  		break;
>  	case CGROUP_FROZEN:
> -		atomic_inc(&system_freezing_cnt);
> +		if (freezer->state == CGROUP_THAWED)
> +			atomic_inc(&system_freezing_cnt);
> +		freezer->state = CGROUP_FREEZING;
>  		retval = try_to_freeze_cgroup(cgroup, freezer);

Cough. Now it doesn't look right to me ;)

If the user write "FROZEN" into the control file, we should not
turn the CGROUP_FROZEN state into CGROUP_FREEZING. Probably this
is harmless, but this looks wrong and this doesn't match the
current behaviour, user-visible change.

Oleg.

^ permalink raw reply

* [PATCH 4/4] cgroup: always lock threadgroup during migration
From: Tejun Heo @ 2011-09-04 18:01 UTC (permalink / raw)
  To: rjw, paul, lizf
  Cc: Paul Menage, fweisbec, containers, linux-kernel, Oleg Nesterov,
	Tejun Heo, linux-pm, akpm
In-Reply-To: <1315159280-25032-1-git-send-email-htejun@gmail.com>

From: Tejun Heo <tj@kernel.org>

Update cgroup to take advantage of the fack that threadgroup_lock()
guarantees stable threadgroup.

* Lock threadgroup even if the target is a single task.  This
  guarantees that when the target tasks stay stable during migration
  regardless of the target type.

* Remove PF_EXITING early exit optimization from attach_task_by_pid()
  and check it in cgroup_task_migrate() instead.  The optimization was
  for rather cold path to begin with and PF_EXITING state can be
  trusted throughout migration by checking it after locking
  threadgroup.

* Don't add PF_EXITING tasks to target task array in
  cgroup_attach_proc().  This ensures that task migration is performed
  only for live tasks.

* Remove -ESRCH failure path from cgroup_task_migrate().  With the
  above changes, it's guaranteed to be called only for live tasks.

After the changes, only live tasks are migrated and they're guaranteed
to stay alive until migration is complete.  This removes problems
caused by exec and exit racing against cgroup migration including
symmetry among cgroup attach methods and different cgroup methods
racing each other.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Paul Menage <menage@google.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
---
 kernel/cgroup.c |   38 ++++++++++++++++++--------------------
 1 files changed, 18 insertions(+), 20 deletions(-)

diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index e81f403..e77dd41 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -1744,7 +1744,7 @@ EXPORT_SYMBOL_GPL(cgroup_path);
  *
  * 'guarantee' is set if the caller promises that a new css_set for the task
  * will already exist. If not set, this function might sleep, and can fail with
- * -ENOMEM. Otherwise, it can only fail with -ESRCH.
+ * -ENOMEM. Must be called with cgroup_mutex and threadgroup locked.
  */
 static int cgroup_task_migrate(struct cgroup *cgrp, struct cgroup *oldcgrp,
 			       struct task_struct *tsk, bool guarantee)
@@ -1782,13 +1782,9 @@ static int cgroup_task_migrate(struct cgroup *cgrp, struct cgroup *oldcgrp,
 	}
 	put_css_set(oldcg);
 
-	/* if PF_EXITING is set, the tsk->cgroups pointer is no longer safe. */
+	/* @tsk can't exit as its threadgroup is locked */
 	task_lock(tsk);
-	if (tsk->flags & PF_EXITING) {
-		task_unlock(tsk);
-		put_css_set(newcg);
-		return -ESRCH;
-	}
+	WARN_ON_ONCE(tsk->flags & PF_EXITING);
 	rcu_assign_pointer(tsk->cgroups, newcg);
 	task_unlock(tsk);
 
@@ -1814,8 +1810,8 @@ static int cgroup_task_migrate(struct cgroup *cgrp, struct cgroup *oldcgrp,
  * @cgrp: the cgroup the task is attaching to
  * @tsk: the task to be attached
  *
- * Call holding cgroup_mutex. May take task_lock of
- * the task 'tsk' during call.
+ * Call with cgroup_mutex and threadgroup locked. May take task_lock of
+ * @tsk during call.
  */
 int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
 {
@@ -1824,6 +1820,10 @@ int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
 	struct cgroup *oldcgrp;
 	struct cgroupfs_root *root = cgrp->root;
 
+	/* @tsk either already exited or can't exit until the end */
+	if (tsk->flags & PF_EXITING)
+		return -ESRCH;
+
 	/* Nothing to do if the task is already in that cgroup */
 	oldcgrp = task_cgroup_from_root(tsk, root);
 	if (cgrp == oldcgrp)
@@ -2044,6 +2044,10 @@ int cgroup_attach_proc(struct cgroup *cgrp, struct task_struct *leader)
 	tsk = leader;
 	i = 0;
 	do {
+		/* @tsk either already exited or can't exit until the end */
+		if (tsk->flags & PF_EXITING)
+			continue;
+
 		/* as per above, nr_threads may decrease, but not increase. */
 		BUG_ON(i >= group_size);
 		get_task_struct(tsk);
@@ -2142,7 +2146,7 @@ int cgroup_attach_proc(struct cgroup *cgrp, struct task_struct *leader)
 		}
 		/* if the thread is PF_EXITING, it can just get skipped. */
 		retval = cgroup_task_migrate(cgrp, oldcgrp, tsk, true);
-		BUG_ON(retval != 0 && retval != -ESRCH);
+		BUG_ON(retval != 0);
 	}
 	/* nothing is sensitive to fork() after this point. */
 
@@ -2194,8 +2198,8 @@ out_free_group_list:
 
 /*
  * Find the task_struct of the task to attach by vpid and pass it along to the
- * function to attach either it or all tasks in its threadgroup. Will take
- * cgroup_mutex; may take task_lock of task.
+ * function to attach either it or all tasks in its threadgroup. Will lock
+ * cgroup_mutex and threadgroup; may take task_lock of task.
  */
 static int attach_task_by_pid(struct cgroup *cgrp, u64 pid, bool threadgroup)
 {
@@ -2218,10 +2222,6 @@ static int attach_task_by_pid(struct cgroup *cgrp, u64 pid, bool threadgroup)
 			 * detect it later.
 			 */
 			tsk = tsk->group_leader;
-		} else if (tsk->flags & PF_EXITING) {
-			/* optimization for the single-task-only case */
-			rcu_read_unlock();
-			return -ESRCH;
 		}
 
 		/*
@@ -2245,8 +2245,7 @@ static int attach_task_by_pid(struct cgroup *cgrp, u64 pid, bool threadgroup)
 		get_task_struct(tsk);
 	}
 
-	if (threadgroup)
-		threadgroup_lock(tsk);
+	threadgroup_lock(tsk);
 	ret = -ENODEV;
 	if (cgroup_lock_live_group(cgrp)) {
 		if (threadgroup)
@@ -2255,8 +2254,7 @@ static int attach_task_by_pid(struct cgroup *cgrp, u64 pid, bool threadgroup)
 			ret = cgroup_attach_task(cgrp, tsk);
 		cgroup_unlock();
 	}
-	if (threadgroup)
-		threadgroup_unlock(tsk);
+	threadgroup_unlock(tsk);
 	put_task_struct(tsk);
 	return ret;
 }
-- 
1.7.6

^ permalink raw reply related

* [PATCH 3/4] threadgroup: extend threadgroup_lock() to cover exit and exec
From: Tejun Heo @ 2011-09-04 18:01 UTC (permalink / raw)
  To: rjw, paul, lizf
  Cc: Paul Menage, fweisbec, containers, linux-kernel, Oleg Nesterov,
	Tejun Heo, linux-pm, akpm
In-Reply-To: <1315159280-25032-1-git-send-email-htejun@gmail.com>

From: Tejun Heo <tj@kernel.org>

threadgroup_lock() protected only protected against new addition to
the threadgroup, which was inherently somewhat incomplete and
problematic for its only user cgroup.  On-going migration could race
against exec and exit leading to interesting problems - the symmetry
between various attach methods, task exiting during method execution,
->exit() racing against attach methods, migrating task switching basic
properties during exec and so on.

This patch extends threadgroup_lock() such that it protects against
all three threadgroup altering operations - fork, exit and exec.  For
exit, threadgroup_change_begin/end() calls are added to exit path.
For exec, threadgroup_[un]lock() are updated to also grab and release
cred_guard_mutex.

With this change, threadgroup_lock() guarantees that the target
threadgroup will remain stable - no new task will be added, no new
PF_EXITING will be set and exec won't happen.

The next patch will update cgroup so that it can take full advantage
of this change.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Paul Menage <menage@google.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
---
 include/linux/sched.h |   32 ++++++++++++++++++++++++++------
 kernel/exit.c         |   16 ++++++++++++----
 2 files changed, 38 insertions(+), 10 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index da74d6f..179cbdc 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -635,11 +635,12 @@ struct signal_struct {
 #endif
 #ifdef CONFIG_CGROUPS
 	/*
-	 * The group_rwsem prevents threads from forking with
-	 * CLONE_THREAD while held for writing. Use this for fork-sensitive
-	 * threadgroup-wide operations. It's taken for reading in fork.c in
-	 * copy_process().
-	 * Currently only needed write-side by cgroups.
+	 * group_rwsem prevents new tasks from entering the threadgroup and
+	 * member tasks from exiting.  fork and exit paths are protected
+	 * with this rwsem using threadgroup_change_begin/end().  Users
+	 * which require threadgroup to remain stable should use
+	 * threadgroup_[un]lock() which also takes care of exec path.
+	 * Currently, cgroup is the only user.
 	 */
 	struct rw_semaphore group_rwsem;
 #endif
@@ -2364,7 +2365,6 @@ static inline void unlock_task_sighand(struct task_struct *tsk,
 	spin_unlock_irqrestore(&tsk->sighand->siglock, *flags);
 }
 
-/* See the declaration of group_rwsem in signal_struct. */
 #ifdef CONFIG_CGROUPS
 static inline void threadgroup_change_begin(struct task_struct *tsk)
 {
@@ -2374,13 +2374,33 @@ static inline void threadgroup_change_done(struct task_struct *tsk)
 {
 	up_read(&tsk->signal->group_rwsem);
 }
+
+/**
+ * threadgroup_lock - lock threadgroup
+ * @tsk: member task of the threadgroup to lock
+ *
+ * Lock the threadgroup @tsk belongs to.  No new task is allowed to enter
+ * and member tasks aren't allowed to exit (as indicated by PF_EXITING) or
+ * perform exec.  This is useful for cases where the threadgroup needs to
+ * stay stable across blockable operations.
+ */
 static inline void threadgroup_lock(struct task_struct *tsk)
 {
+	/* exec uses exit for de-threading, grab cred_guard_mutex first */
+	mutex_lock(&tsk->signal->cred_guard_mutex);
 	down_write(&tsk->signal->group_rwsem);
 }
+
+/**
+ * threadgroup_unlock - unlock threadgroup
+ * @tsk: member task of the threadgroup to unlock
+ *
+ * Reverse threadgroup_lock().
+ */
 static inline void threadgroup_unlock(struct task_struct *tsk)
 {
 	up_write(&tsk->signal->group_rwsem);
+	mutex_unlock(&tsk->signal->cred_guard_mutex);
 }
 #else
 static inline void threadgroup_change_begin(struct task_struct *tsk) {}
diff --git a/kernel/exit.c b/kernel/exit.c
index 7b6c4fa..a5b40b3 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -936,6 +936,12 @@ NORET_TYPE void do_exit(long code)
 		schedule();
 	}
 
+	/*
+	 * @tsk's threadgroup is going through changes - lock out users
+	 * which expect stable threadgroup.
+	 */
+	threadgroup_change_begin(tsk);
+
 	exit_irq_thread();
 
 	exit_signals(tsk);  /* sets PF_EXITING */
@@ -1018,10 +1024,6 @@ NORET_TYPE void do_exit(long code)
 		kfree(current->pi_state_cache);
 #endif
 	/*
-	 * Make sure we are holding no locks:
-	 */
-	debug_check_no_locks_held(tsk);
-	/*
 	 * We can do this unlocked here. The futex code uses this flag
 	 * just to verify whether the pi state cleanup has been done
 	 * or not. In the worst case it loops once more.
@@ -1039,6 +1041,12 @@ NORET_TYPE void do_exit(long code)
 	preempt_disable();
 	exit_rcu();
 
+	/*
+	 * Release threadgroup and make sure we are holding no locks.
+	 */
+	threadgroup_change_done(tsk);
+	debug_check_no_locks_held(tsk);
+
 	/* this task is now dead and freezer should ignore it */
 	current->flags |= PF_NOFREEZE;
 
-- 
1.7.6

^ permalink raw reply related

* [PATCH 2/4] threadgroup: rename signal->threadgroup_fork_lock to ->group_rwsem
From: Tejun Heo @ 2011-09-04 18:01 UTC (permalink / raw)
  To: rjw, paul, lizf
  Cc: Paul Menage, fweisbec, containers, linux-kernel, Oleg Nesterov,
	Tejun Heo, linux-pm, akpm
In-Reply-To: <1315159280-25032-1-git-send-email-htejun@gmail.com>

From: Tejun Heo <tj@kernel.org>

Make the following renames to prepare for extension of threadgroup
locking.

* s/signal->threadgroup_fork_lock/signal->group_rwsem/
* s/threadgroup_fork_read_lock()/threadgroup_change_begin()/
* s/threadgroup_fork_read_unlock()/threadgroup_change_done()/
* s/threadgroup_fork_write_lock()/threadgroup_lock()/
* s/threadgroup_fork_write_unlock()/threadgroup_unlock()/

This patch doesn't cause any behavior change.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Paul Menage <menage@google.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
---
 include/linux/init_task.h |    9 ++++-----
 include/linux/sched.h     |   30 +++++++++++++++---------------
 kernel/cgroup.c           |   12 ++++++------
 kernel/fork.c             |    8 ++++----
 4 files changed, 29 insertions(+), 30 deletions(-)

diff --git a/include/linux/init_task.h b/include/linux/init_task.h
index d14e058..3a0d9f2 100644
--- a/include/linux/init_task.h
+++ b/include/linux/init_task.h
@@ -23,11 +23,10 @@ extern struct files_struct init_files;
 extern struct fs_struct init_fs;
 
 #ifdef CONFIG_CGROUPS
-#define INIT_THREADGROUP_FORK_LOCK(sig)					\
-	.threadgroup_fork_lock =					\
-		__RWSEM_INITIALIZER(sig.threadgroup_fork_lock),
+#define INIT_GROUP_RWSEM(sig)						\
+	.group_rwsem = __RWSEM_INITIALIZER(sig.group_rwsem),
 #else
-#define INIT_THREADGROUP_FORK_LOCK(sig)
+#define INIT_GROUP_RWSEM(sig)
 #endif
 
 #define INIT_SIGNALS(sig) {						\
@@ -46,7 +45,7 @@ extern struct fs_struct init_fs;
 	},								\
 	.cred_guard_mutex =						\
 		 __MUTEX_INITIALIZER(sig.cred_guard_mutex),		\
-	INIT_THREADGROUP_FORK_LOCK(sig)					\
+	INIT_GROUP_RWSEM(sig)						\
 }
 
 extern struct nsproxy init_nsproxy;
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 6d45ce7..da74d6f 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -635,13 +635,13 @@ struct signal_struct {
 #endif
 #ifdef CONFIG_CGROUPS
 	/*
-	 * The threadgroup_fork_lock prevents threads from forking with
+	 * The group_rwsem prevents threads from forking with
 	 * CLONE_THREAD while held for writing. Use this for fork-sensitive
 	 * threadgroup-wide operations. It's taken for reading in fork.c in
 	 * copy_process().
 	 * Currently only needed write-side by cgroups.
 	 */
-	struct rw_semaphore threadgroup_fork_lock;
+	struct rw_semaphore group_rwsem;
 #endif
 
 	int oom_adj;		/* OOM kill score adjustment (bit shift) */
@@ -2364,29 +2364,29 @@ static inline void unlock_task_sighand(struct task_struct *tsk,
 	spin_unlock_irqrestore(&tsk->sighand->siglock, *flags);
 }
 
-/* See the declaration of threadgroup_fork_lock in signal_struct. */
+/* See the declaration of group_rwsem in signal_struct. */
 #ifdef CONFIG_CGROUPS
-static inline void threadgroup_fork_read_lock(struct task_struct *tsk)
+static inline void threadgroup_change_begin(struct task_struct *tsk)
 {
-	down_read(&tsk->signal->threadgroup_fork_lock);
+	down_read(&tsk->signal->group_rwsem);
 }
-static inline void threadgroup_fork_read_unlock(struct task_struct *tsk)
+static inline void threadgroup_change_done(struct task_struct *tsk)
 {
-	up_read(&tsk->signal->threadgroup_fork_lock);
+	up_read(&tsk->signal->group_rwsem);
 }
-static inline void threadgroup_fork_write_lock(struct task_struct *tsk)
+static inline void threadgroup_lock(struct task_struct *tsk)
 {
-	down_write(&tsk->signal->threadgroup_fork_lock);
+	down_write(&tsk->signal->group_rwsem);
 }
-static inline void threadgroup_fork_write_unlock(struct task_struct *tsk)
+static inline void threadgroup_unlock(struct task_struct *tsk)
 {
-	up_write(&tsk->signal->threadgroup_fork_lock);
+	up_write(&tsk->signal->group_rwsem);
 }
 #else
-static inline void threadgroup_fork_read_lock(struct task_struct *tsk) {}
-static inline void threadgroup_fork_read_unlock(struct task_struct *tsk) {}
-static inline void threadgroup_fork_write_lock(struct task_struct *tsk) {}
-static inline void threadgroup_fork_write_unlock(struct task_struct *tsk) {}
+static inline void threadgroup_change_begin(struct task_struct *tsk) {}
+static inline void threadgroup_change_done(struct task_struct *tsk) {}
+static inline void threadgroup_lock(struct task_struct *tsk) {}
+static inline void threadgroup_unlock(struct task_struct *tsk) {}
 #endif
 
 #ifndef __HAVE_THREAD_FUNCTIONS
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index bd1fb5f..e81f403 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -1985,8 +1985,8 @@ static int css_set_prefetch(struct cgroup *cgrp, struct css_set *cg,
  * @cgrp: the cgroup to attach to
  * @leader: the threadgroup leader task_struct of the group to be attached
  *
- * Call holding cgroup_mutex and the threadgroup_fork_lock of the leader. Will
- * take task_lock of each thread in leader's threadgroup individually in turn.
+ * Call holding cgroup_mutex and the group_rwsem of the leader. Will take
+ * task_lock of each thread in leader's threadgroup individually in turn.
  */
 int cgroup_attach_proc(struct cgroup *cgrp, struct task_struct *leader)
 {
@@ -2012,8 +2012,8 @@ int cgroup_attach_proc(struct cgroup *cgrp, struct task_struct *leader)
 	 * step 0: in order to do expensive, possibly blocking operations for
 	 * every thread, we cannot iterate the thread group list, since it needs
 	 * rcu or tasklist locked. instead, build an array of all threads in the
-	 * group - threadgroup_fork_lock prevents new threads from appearing,
-	 * and if threads exit, this will just be an over-estimate.
+	 * group - group_rwsem prevents new threads from appearing, and if
+	 * threads exit, this will just be an over-estimate.
 	 */
 	group_size = get_nr_threads(leader);
 	/* flex_array supports very large thread-groups better than kmalloc. */
@@ -2246,7 +2246,7 @@ static int attach_task_by_pid(struct cgroup *cgrp, u64 pid, bool threadgroup)
 	}
 
 	if (threadgroup)
-		threadgroup_fork_write_lock(tsk);
+		threadgroup_lock(tsk);
 	ret = -ENODEV;
 	if (cgroup_lock_live_group(cgrp)) {
 		if (threadgroup)
@@ -2256,7 +2256,7 @@ static int attach_task_by_pid(struct cgroup *cgrp, u64 pid, bool threadgroup)
 		cgroup_unlock();
 	}
 	if (threadgroup)
-		threadgroup_fork_write_unlock(tsk);
+		threadgroup_unlock(tsk);
 	put_task_struct(tsk);
 	return ret;
 }
diff --git a/kernel/fork.c b/kernel/fork.c
index fa7beb3..f0e2e56 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -980,7 +980,7 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk)
 	sched_autogroup_fork(sig);
 
 #ifdef CONFIG_CGROUPS
-	init_rwsem(&sig->threadgroup_fork_lock);
+	init_rwsem(&sig->group_rwsem);
 #endif
 
 	sig->oom_adj = current->signal->oom_adj;
@@ -1165,7 +1165,7 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 	p->io_context = NULL;
 	p->audit_context = NULL;
 	if (clone_flags & CLONE_THREAD)
-		threadgroup_fork_read_lock(current);
+		threadgroup_change_begin(current);
 	cgroup_fork(p);
 #ifdef CONFIG_NUMA
 	p->mempolicy = mpol_dup(p->mempolicy);
@@ -1377,7 +1377,7 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 	proc_fork_connector(p);
 	cgroup_post_fork(p);
 	if (clone_flags & CLONE_THREAD)
-		threadgroup_fork_read_unlock(current);
+		threadgroup_change_done(current);
 	perf_event_fork(p);
 	return p;
 
@@ -1417,7 +1417,7 @@ bad_fork_cleanup_policy:
 bad_fork_cleanup_cgroup:
 #endif
 	if (clone_flags & CLONE_THREAD)
-		threadgroup_fork_read_unlock(current);
+		threadgroup_change_done(current);
 	cgroup_exit(p, cgroup_callbacks_done);
 	delayacct_tsk_free(p);
 	module_put(task_thread_info(p)->exec_domain->module);
-- 
1.7.6

^ permalink raw reply related

* [PATCH 1/4] cgroup: change locking order in attach_task_by_pid()
From: Tejun Heo @ 2011-09-04 18:01 UTC (permalink / raw)
  To: rjw, paul, lizf
  Cc: Paul Menage, fweisbec, containers, linux-kernel, Oleg Nesterov,
	Tejun Heo, linux-pm, akpm
In-Reply-To: <1315159280-25032-1-git-send-email-htejun@gmail.com>

From: Tejun Heo <tj@kernel.org>

cgroup_mutex is updated to nest inside threadgroup_fork_lock instead
of the other way around.  threadgroup locking is scheduled to be
updated to cover all threadgroup altering operations and nesting it
inside cgroup_mutex complicates locking dependency unnecessarily.
This also simplifies code a bit.

The ugly "if (threadgroup)" conditionals for threadgroup locking will
soon be removed.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Paul Menage <menage@google.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
---
 kernel/cgroup.c |   22 ++++++++++------------
 1 files changed, 10 insertions(+), 12 deletions(-)

diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 1d2b6ce..bd1fb5f 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -2203,15 +2203,11 @@ static int attach_task_by_pid(struct cgroup *cgrp, u64 pid, bool threadgroup)
 	const struct cred *cred = current_cred(), *tcred;
 	int ret;
 
-	if (!cgroup_lock_live_group(cgrp))
-		return -ENODEV;
-
 	if (pid) {
 		rcu_read_lock();
 		tsk = find_task_by_vpid(pid);
 		if (!tsk) {
 			rcu_read_unlock();
-			cgroup_unlock();
 			return -ESRCH;
 		}
 		if (threadgroup) {
@@ -2225,7 +2221,6 @@ static int attach_task_by_pid(struct cgroup *cgrp, u64 pid, bool threadgroup)
 		} else if (tsk->flags & PF_EXITING) {
 			/* optimization for the single-task-only case */
 			rcu_read_unlock();
-			cgroup_unlock();
 			return -ESRCH;
 		}
 
@@ -2238,7 +2233,6 @@ static int attach_task_by_pid(struct cgroup *cgrp, u64 pid, bool threadgroup)
 		    cred->euid != tcred->uid &&
 		    cred->euid != tcred->suid) {
 			rcu_read_unlock();
-			cgroup_unlock();
 			return -EACCES;
 		}
 		get_task_struct(tsk);
@@ -2251,15 +2245,19 @@ static int attach_task_by_pid(struct cgroup *cgrp, u64 pid, bool threadgroup)
 		get_task_struct(tsk);
 	}
 
-	if (threadgroup) {
+	if (threadgroup)
 		threadgroup_fork_write_lock(tsk);
-		ret = cgroup_attach_proc(cgrp, tsk);
-		threadgroup_fork_write_unlock(tsk);
-	} else {
-		ret = cgroup_attach_task(cgrp, tsk);
+	ret = -ENODEV;
+	if (cgroup_lock_live_group(cgrp)) {
+		if (threadgroup)
+			ret = cgroup_attach_proc(cgrp, tsk);
+		else
+			ret = cgroup_attach_task(cgrp, tsk);
+		cgroup_unlock();
 	}
+	if (threadgroup)
+		threadgroup_fork_write_unlock(tsk);
 	put_task_struct(tsk);
-	cgroup_unlock();
 	return ret;
 }
 
-- 
1.7.6

^ permalink raw reply related

* [PATCHSET cgroup] extend threadgroup locking
From: Tejun Heo @ 2011-09-04 18:01 UTC (permalink / raw)
  To: rjw, paul, lizf; +Cc: fweisbec, containers, linux-kernel, linux-pm, akpm

Hello,

cgroup currently only blocks new threads from joining the target
threadgroup during migration, and on-going migration could race
against exec and exit leading to interesting problems - the symmetry
between various attach methods, task exiting during method execution,
->exit() racing against attach methods, migrating task switching basic
properties during exec and so on.

This patchset extends threadgroup locking such that it covers all
operations which can alter the threadgroup - fork, exit and exec, and
update cgroup to take advantage of it.  rwsem read ops are added to
exit path but exec is excluded by grabbing the existing
cred_guard_mutex from threadgroup locking helper.

This makes threadgroup locking complete and resolves cgroup issues
stemming from the target taskset being unstable.

This patchset is on top of the current pm-freezer + "freezer: fixes &
simplifications" patchset and contains the following four patches.
Patch list and diffstat follow.

Thanks.

 [PATCH 1/4] cgroup: change locking order in attach_task_by_pid()
 [PATCH 2/4] threadgroup: rename signal->threadgroup_fork_lock to
 [PATCH 3/4] threadgroup: extend threadgroup_lock() to cover exit and
 [PATCH 4/4] cgroup: always lock threadgroup during migration

 include/linux/init_task.h |    9 ++----
 include/linux/sched.h     |   58 ++++++++++++++++++++++++++++---------------
 kernel/cgroup.c           |   62 +++++++++++++++++++++-------------------------
 kernel/exit.c             |   16 ++++++++---
 kernel/fork.c             |    8 ++---
 5 files changed, 88 insertions(+), 65 deletions(-)

--
tejun

[1] http://thread.gmane.org/gmane.linux.kernel/1187553

^ permalink raw reply

* Re: [PATCH v4 1/2] Input: enable i8042-level wakeup control
From: Daniel Drake @ 2011-09-03 12:45 UTC (permalink / raw)
  To: Dmitry Torokhov; +Cc: linux-pm, dilinger, linux-input
In-Reply-To: <20110817070351.GE29361@core.coreip.homeip.net>

Hi,

Sorry for the delay in replying to this - I took sime time off.


On Wed, Aug 17, 2011 at 8:03 AM, Dmitry Torokhov
<dmitry.torokhov@gmail.com> wrote:
> OK, so we have the following scenario:
>
> - we do not need to worry about SND and LED on OLPC;
>
> - we do not need to worry about releasing the key that was pressed
>  when we went to sleep because they system would not go to sleep while
>  there is a key pressed. So the code in input_dev_resume() won't
>  actually release any keys unless a key was held pressed and we forced
>  system to sleep somehow.
>
> - we need to make sure we do not loose key press (and following events)
>  when we are waking up.

Yes, that seems accurate, with one more addition:

- we need to make sure we do not loose mouse button press or mouse
movement (and following events) when we are waking up.

> Can atkbd driver stash away serio byte data (when not in command mode)
> and replay it when pm notifier signals us that resume process is done?

I'm having trouble understanding this suggestion or relating it to the
problems in question and the previous patches posted. Are you
suggesting stashing of data that is sent from the host  to the
keyboard, or from the keyboard to the host? How does this solve our
issues? What about the mouse?

How is it going to avoid the input-level issues?

Thanks,
Daniel

^ permalink raw reply

* Re: [PATCH 3/5] PM / QoS: Add function dev_pm_qos_read_value()
From: Rafael J. Wysocki @ 2011-09-03  8:02 UTC (permalink / raw)
  To: Jean Pihet; +Cc: Linux PM mailing list, LKML, Linux-sh list
In-Reply-To: <201109030155.28031.rjw@sisk.pl>

Hi,

On Saturday, September 03, 2011, Rafael J. Wysocki wrote:
> On Friday, September 02, 2011, Jean Pihet wrote:
> > On Fri, Sep 2, 2011 at 12:07 AM, Rafael J. Wysocki <rjw@sisk.pl> wrote:
...
> > >> >
> > >> > -       mutex_lock(&dev_pm_qos_mtx);
> > >> >        req->dev = dev;
> > >> >
> > >> > -       /* Return if the device has been removed */
> > >> > -       if (req->dev->power.constraints_state == DEV_PM_QOS_NO_DEVICE) {
> > >> > -               ret = -ENODEV;
> > >> > -               goto out;
> > >> > -       }
> > >> > +       device_pm_lock();
> > >> > +       mutex_lock(&dev_pm_qos_mtx);
> > >> >
> > >> > -       /*
> > >> > -        * Allocate the constraints data on the first call to add_request,
> > >> > -        * i.e. only if the data is not already allocated and if the device has
> > >> > -        * not been removed
> > >> > -        */
> > >> > -       if (dev->power.constraints_state == DEV_PM_QOS_DEVICE_PRESENT)
> > >> > -               ret = dev_pm_qos_constraints_allocate(dev);
> > >> > +       if (dev->power.constraints) {
> > >> > +               device_pm_unlock();
> > >> > +       } else {
> > >> > +               if (list_empty(&dev->power.entry)) {
> > >> > +                       /* The device has been removed from the system. */
> > >> > +                       device_pm_unlock();
> > >> > +                       goto out;
> > >> 0 is silently returned in case the device has been removed. Is that
> > >> the intention?
> > >
> > > Pretty much it is.  Is that a problem?
> > I think the caller needs to know if the constraint has been applied
> > correctly or not.
> 
> However, the removal of the device is a special case.  What would the caller
> be supposed to do when it learned that the device had been removed while it had
> been trying to add a QoS constraing for it?  Not much I guess.

Anyway, since returning an error code in that case won't hurt either,
below is a v2 of the patch doing that and resetting the dev field in the
req structure.

Thanks,
Rafael

---
From: Rafael J. Wysocki <rjw@sisk.pl>
Subject: PM / QoS: Add function dev_pm_qos_read_value() (v2)

To read the current PM QoS value for a given device we need to
make sure that the device's power.constraints object won't be
removed while we're doing that.  For this reason, put the
operation under dev->power.lock and acquire the lock
around the initialization and removal of power.constraints.

Moreover, since we're using the value of power.constraints to
determine whether or not the object is present, the
power.constraints_state field isn't necessary any more and may be
removed.  However, dev_pm_qos_add_request() needs to check if the
device is being removed from the system before allocating a new
PM QoS constraints object for it, so it has to use device_pm_lock()
and the device PM QoS initialization and destruction should be done
under device_pm_lock() as well.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
---
 drivers/base/power/main.c |    4 -
 drivers/base/power/qos.c  |  170 ++++++++++++++++++++++++++--------------------
 include/linux/pm.h        |    8 --
 include/linux/pm_qos.h    |    3 
 4 files changed, 104 insertions(+), 81 deletions(-)

Index: linux/drivers/base/power/qos.c
===================================================================
--- linux.orig/drivers/base/power/qos.c
+++ linux/drivers/base/power/qos.c
@@ -30,15 +30,6 @@
  * . To minimize the data usage by the per-device constraints, the data struct
  *   is only allocated at the first call to dev_pm_qos_add_request.
  * . The data is later free'd when the device is removed from the system.
- * . The constraints_state variable from dev_pm_info tracks the data struct
- *    allocation state:
- *    DEV_PM_QOS_NO_DEVICE: No device present or device removed, no data
- *     allocated,
- *    DEV_PM_QOS_DEVICE_PRESENT: Device present, data not allocated and will be
- *     allocated at the first call to dev_pm_qos_add_request,
- *    DEV_PM_QOS_ALLOCATED: Device present, data allocated. The per-device
- *     PM QoS constraints framework is operational and constraints can be
- *     added, updated or removed using the dev_pm_qos_* API.
  *  . A global mutex protects the constraints users from the data being
  *     allocated and free'd.
  */
@@ -51,8 +42,30 @@
 
 
 static DEFINE_MUTEX(dev_pm_qos_mtx);
+
 static BLOCKING_NOTIFIER_HEAD(dev_pm_notifiers);
 
+/**
+ * dev_pm_qos_read_value - Get PM QoS constraint for a given device.
+ * @dev: Device to get the PM QoS constraint value for.
+ */
+s32 dev_pm_qos_read_value(struct device *dev)
+{
+	struct pm_qos_constraints *c;
+	unsigned long flags;
+	s32 ret = 0;
+
+	spin_lock_irqsave(&dev->power.lock, flags);
+
+	c = dev->power.constraints;
+	if (c)
+		ret = pm_qos_read_value(c);
+
+	spin_unlock_irqrestore(&dev->power.lock, flags);
+
+	return ret;
+}
+
 /*
  * apply_constraint
  * @req: constraint request to apply
@@ -105,27 +118,37 @@ static int dev_pm_qos_constraints_alloca
 	}
 	BLOCKING_INIT_NOTIFIER_HEAD(n);
 
+	plist_head_init(&c->list);
+	c->target_value = PM_QOS_DEV_LAT_DEFAULT_VALUE;
+	c->default_value = PM_QOS_DEV_LAT_DEFAULT_VALUE;
+	c->type = PM_QOS_MIN;
+	c->notifiers = n;
+
+	spin_lock_irq(&dev->power.lock);
 	dev->power.constraints = c;
-	plist_head_init(&dev->power.constraints->list);
-	dev->power.constraints->target_value = PM_QOS_DEV_LAT_DEFAULT_VALUE;
-	dev->power.constraints->default_value =	PM_QOS_DEV_LAT_DEFAULT_VALUE;
-	dev->power.constraints->type = PM_QOS_MIN;
-	dev->power.constraints->notifiers = n;
-	dev->power.constraints_state = DEV_PM_QOS_ALLOCATED;
+	spin_unlock_irq(&dev->power.lock);
 
 	return 0;
 }
 
+static void __dev_pm_qos_constraints_init(struct device *dev)
+{
+	spin_lock_irq(&dev->power.lock);
+	dev->power.constraints = NULL;
+	spin_unlock_irq(&dev->power.lock);
+}
+
 /**
- * dev_pm_qos_constraints_init
+ * dev_pm_qos_constraints_init - Initalize device's PM QoS constraints pointer.
  * @dev: target device
  *
- * Called from the device PM subsystem at device insertion
+ * Called from the device PM subsystem at device insertion under
+ * device_pm_lock().
  */
 void dev_pm_qos_constraints_init(struct device *dev)
 {
 	mutex_lock(&dev_pm_qos_mtx);
-	dev->power.constraints_state = DEV_PM_QOS_DEVICE_PRESENT;
+	dev->power.constraints = NULL;
 	mutex_unlock(&dev_pm_qos_mtx);
 }
 
@@ -133,34 +156,35 @@ void dev_pm_qos_constraints_init(struct
  * dev_pm_qos_constraints_destroy
  * @dev: target device
  *
- * Called from the device PM subsystem at device removal
+ * Called from the device PM subsystem at device removal under device_pm_lock().
  */
 void dev_pm_qos_constraints_destroy(struct device *dev)
 {
 	struct dev_pm_qos_request *req, *tmp;
+	struct pm_qos_constraints *c;
 
 	mutex_lock(&dev_pm_qos_mtx);
 
-	if (dev->power.constraints_state == DEV_PM_QOS_ALLOCATED) {
-		/* Flush the constraints list for the device */
-		plist_for_each_entry_safe(req, tmp,
-					  &dev->power.constraints->list,
-					  node) {
-			/*
-			 * Update constraints list and call the notification
-			 * callbacks if needed
-			 */
-			apply_constraint(req, PM_QOS_REMOVE_REQ,
-					 PM_QOS_DEFAULT_VALUE);
-			memset(req, 0, sizeof(*req));
-		}
+	c = dev->power.constraints;
+	if (!c)
+		goto out;
 
-		kfree(dev->power.constraints->notifiers);
-		kfree(dev->power.constraints);
-		dev->power.constraints = NULL;
+	/* Flush the constraints list for the device */
+	plist_for_each_entry_safe(req, tmp, &c->list, node) {
+		/*
+		 * Update constraints list and call the notification
+		 * callbacks if needed
+		 */
+		apply_constraint(req, PM_QOS_REMOVE_REQ, PM_QOS_DEFAULT_VALUE);
+		memset(req, 0, sizeof(*req));
 	}
-	dev->power.constraints_state = DEV_PM_QOS_NO_DEVICE;
 
+	__dev_pm_qos_constraints_init(dev);
+
+	kfree(c->notifiers);
+	kfree(c);
+
+ out:
 	mutex_unlock(&dev_pm_qos_mtx);
 }
 
@@ -178,8 +202,9 @@ void dev_pm_qos_constraints_destroy(stru
  *
  * Returns 1 if the aggregated constraint value has changed,
  * 0 if the aggregated constraint value has not changed,
- * -EINVAL in case of wrong parameters, -ENODEV if the device has been
- * removed from the system
+ * -EINVAL in case of wrong parameters, -ENOMEM if there's not enough memory
+ * to allocate for data structures, -ENODEV if the device has just been removed
+ * from the system.
  */
 int dev_pm_qos_add_request(struct device *dev, struct dev_pm_qos_request *req,
 			   s32 value)
@@ -195,28 +220,37 @@ int dev_pm_qos_add_request(struct device
 		return -EINVAL;
 	}
 
-	mutex_lock(&dev_pm_qos_mtx);
 	req->dev = dev;
 
-	/* Return if the device has been removed */
-	if (req->dev->power.constraints_state == DEV_PM_QOS_NO_DEVICE) {
-		ret = -ENODEV;
-		goto out;
-	}
+	device_pm_lock();
+	mutex_lock(&dev_pm_qos_mtx);
 
-	/*
-	 * Allocate the constraints data on the first call to add_request,
-	 * i.e. only if the data is not already allocated and if the device has
-	 * not been removed
-	 */
-	if (dev->power.constraints_state == DEV_PM_QOS_DEVICE_PRESENT)
-		ret = dev_pm_qos_constraints_allocate(dev);
+	if (dev->power.constraints) {
+		device_pm_unlock();
+	} else {
+		if (list_empty(&dev->power.entry)) {
+			/* The device has been removed from the system. */
+			device_pm_unlock();
+			req->dev = NULL;
+			ret = -ENODEV;
+			goto out;
+		} else {
+			device_pm_unlock();
+			/*
+			 * Allocate the constraints data on the first call to
+			 * add_request, i.e. only if the data is not already
+			 * allocated and if the device has not been removed.
+			 */
+			ret = dev_pm_qos_constraints_allocate(dev);
+		}
+	}
 
 	if (!ret)
 		ret = apply_constraint(req, PM_QOS_ADD_REQ, value);
 
-out:
+ out:
 	mutex_unlock(&dev_pm_qos_mtx);
+
 	return ret;
 }
 EXPORT_SYMBOL_GPL(dev_pm_qos_add_request);
@@ -252,13 +286,13 @@ int dev_pm_qos_update_request(struct dev
 
 	mutex_lock(&dev_pm_qos_mtx);
 
-	if (req->dev->power.constraints_state == DEV_PM_QOS_ALLOCATED) {
+	if (req->dev->power.constraints) {
 		if (new_value != req->node.prio)
 			ret = apply_constraint(req, PM_QOS_UPDATE_REQ,
 					       new_value);
 	} else {
 		/* Return if the device has been removed */
-		ret = -ENODEV;
+		ret = -EINVAL;
 	}
 
 	mutex_unlock(&dev_pm_qos_mtx);
@@ -293,7 +327,7 @@ int dev_pm_qos_remove_request(struct dev
 
 	mutex_lock(&dev_pm_qos_mtx);
 
-	if (req->dev->power.constraints_state == DEV_PM_QOS_ALLOCATED) {
+	if (req->dev->power.constraints) {
 		ret = apply_constraint(req, PM_QOS_REMOVE_REQ,
 				       PM_QOS_DEFAULT_VALUE);
 		memset(req, 0, sizeof(*req));
@@ -323,15 +357,12 @@ int dev_pm_qos_add_notifier(struct devic
 
 	mutex_lock(&dev_pm_qos_mtx);
 
-	/* Silently return if the device has been removed */
-	if (dev->power.constraints_state != DEV_PM_QOS_ALLOCATED)
-		goto out;
-
-	retval = blocking_notifier_chain_register(
-			dev->power.constraints->notifiers,
-			notifier);
+	/* Silently return if the constraints object is not present. */
+	if (dev->power.constraints)
+		retval = blocking_notifier_chain_register(
+				dev->power.constraints->notifiers,
+				notifier);
 
-out:
 	mutex_unlock(&dev_pm_qos_mtx);
 	return retval;
 }
@@ -354,15 +385,12 @@ int dev_pm_qos_remove_notifier(struct de
 
 	mutex_lock(&dev_pm_qos_mtx);
 
-	/* Silently return if the device has been removed */
-	if (dev->power.constraints_state != DEV_PM_QOS_ALLOCATED)
-		goto out;
-
-	retval = blocking_notifier_chain_unregister(
-			dev->power.constraints->notifiers,
-			notifier);
+	/* Silently return if the constraints object is not present. */
+	if (dev->power.constraints)
+		retval = blocking_notifier_chain_unregister(
+				dev->power.constraints->notifiers,
+				notifier);
 
-out:
 	mutex_unlock(&dev_pm_qos_mtx);
 	return retval;
 }
Index: linux/include/linux/pm_qos.h
===================================================================
--- linux.orig/include/linux/pm_qos.h
+++ linux/include/linux/pm_qos.h
@@ -77,6 +77,7 @@ int pm_qos_remove_notifier(int pm_qos_cl
 int pm_qos_request_active(struct pm_qos_request *req);
 s32 pm_qos_read_value(struct pm_qos_constraints *c);
 
+s32 dev_pm_qos_read_value(struct device *dev);
 int dev_pm_qos_add_request(struct device *dev, struct dev_pm_qos_request *req,
 			   s32 value);
 int dev_pm_qos_update_request(struct dev_pm_qos_request *req, s32 new_value);
@@ -117,6 +118,8 @@ static inline int pm_qos_request_active(
 static inline s32 pm_qos_read_value(struct pm_qos_constraints *c)
 			{ return 0; }
 
+static inline s32 dev_pm_qos_read_value(struct device *dev)
+			{ return 0; }
 static inline int dev_pm_qos_add_request(struct device *dev,
 					 struct dev_pm_qos_request *req,
 					 s32 value)
Index: linux/drivers/base/power/main.c
===================================================================
--- linux.orig/drivers/base/power/main.c
+++ linux/drivers/base/power/main.c
@@ -98,8 +98,8 @@ void device_pm_add(struct device *dev)
 		dev_warn(dev, "parent %s should not be sleeping\n",
 			dev_name(dev->parent));
 	list_add_tail(&dev->power.entry, &dpm_list);
-	mutex_unlock(&dpm_list_mtx);
 	dev_pm_qos_constraints_init(dev);
+	mutex_unlock(&dpm_list_mtx);
 }
 
 /**
@@ -110,9 +110,9 @@ void device_pm_remove(struct device *dev
 {
 	pr_debug("PM: Removing info for %s:%s\n",
 		 dev->bus ? dev->bus->name : "No Bus", dev_name(dev));
-	dev_pm_qos_constraints_destroy(dev);
 	complete_all(&dev->power.completion);
 	mutex_lock(&dpm_list_mtx);
+	dev_pm_qos_constraints_destroy(dev);
 	list_del_init(&dev->power.entry);
 	mutex_unlock(&dpm_list_mtx);
 	device_wakeup_disable(dev);
Index: linux/include/linux/pm.h
===================================================================
--- linux.orig/include/linux/pm.h
+++ linux/include/linux/pm.h
@@ -421,13 +421,6 @@ enum rpm_request {
 	RPM_REQ_RESUME,
 };
 
-/* Per-device PM QoS constraints data struct state */
-enum dev_pm_qos_state {
-	DEV_PM_QOS_NO_DEVICE,		/* No device present */
-	DEV_PM_QOS_DEVICE_PRESENT,	/* Device present, data not allocated */
-	DEV_PM_QOS_ALLOCATED,		/* Device present, data allocated */
-};
-
 struct wakeup_source;
 
 struct pm_domain_data {
@@ -489,7 +482,6 @@ struct dev_pm_info {
 #endif
 	struct pm_subsys_data	*subsys_data;  /* Owned by the subsystem. */
 	struct pm_qos_constraints *constraints;
-	enum dev_pm_qos_state	constraints_state;
 };
 
 extern void update_pm_runtime_accounting(struct device *dev);

^ permalink raw reply

* Re: [PATCH 3/5] PM / QoS: Add function dev_pm_qos_read_value()
From: Rafael J. Wysocki @ 2011-09-02 23:55 UTC (permalink / raw)
  To: Jean Pihet; +Cc: Linux PM mailing list, LKML, Linux-sh list
In-Reply-To: <CAORVsuVAzc8Q2tG0-WW+96Ad9GhR5OCq2bAuweggs1rimFMcqg@mail.gmail.com>

On Friday, September 02, 2011, Jean Pihet wrote:
> On Fri, Sep 2, 2011 at 12:07 AM, Rafael J. Wysocki <rjw@sisk.pl> wrote:
> > Hi,
> >
> > On Thursday, September 01, 2011, Jean Pihet wrote:
> >> Hi Rafael,
> >>
> >> On Wed, Aug 31, 2011 at 12:21 AM, Rafael J. Wysocki <rjw@sisk.pl> wrote:
> >> > From: Rafael J. Wysocki <rjw@sisk.pl>
> >> >
> >> > To read the current PM QoS value for a given device we need to
> >> > make sure that the device's power.constraints object won't be
> >> > removed while we're doing that.  For this reason, put the
> >> > operation under dev->power.lock and acquire the lock
> >> > around the initialization and removal of power.constraints.
> >> Ok.
> >>
> >> > Moreover, since we're using the value of power.constraints to
> >> > determine whether or not the object is present, the
> >> > power.constraints_state field isn't necessary any more and may be
> >> > removed.  However, dev_pm_qos_add_request() needs to check if the
> >> > device is being removed from the system before allocating a new
> >> > PM QoS constraints object for it, so it has to use device_pm_lock()
> >> > and the device PM QoS initialization and destruction should be done
> >> > under device_pm_lock() as well.
> >> Ok that makes sense.
> >> The constraints_state field can be replaced by a combination of
> >> dev->power.constraints and list_empty(&dev->power.entry), which makes
> >> the code more compact and less redundant.
> >>
> >> >
> >> > Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
> >> > ---
> >> >  drivers/base/power/main.c |    4 -
> >> >  drivers/base/power/qos.c  |  167 ++++++++++++++++++++++++++--------------------
> >> >  include/linux/pm.h        |    8 --
> >> >  include/linux/pm_qos.h    |    3
> >> >  4 files changed, 101 insertions(+), 81 deletions(-)
> >> >
> >> > Index: linux/drivers/base/power/qos.c
> >> > ===================================================================
> >> > --- linux.orig/drivers/base/power/qos.c
> >> > +++ linux/drivers/base/power/qos.c
> >> > @@ -30,15 +30,6 @@
> >> ...
> >>
> >> >
> >> > @@ -178,8 +202,8 @@ void dev_pm_qos_constraints_destroy(stru
> >> >  *
> >> >  * Returns 1 if the aggregated constraint value has changed,
> >> >  * 0 if the aggregated constraint value has not changed,
> >> > - * -EINVAL in case of wrong parameters, -ENODEV if the device has been
> >> > - * removed from the system
> >> > + * -EINVAL in case of wrong parameters, -ENOMEM if there's not enough memory
> >> > + * to allocate for data structures.
> >> Why not use -ENODEV in case there is no device?
> >
> > I don't think it's useful for the caller.  If the device is gone, the
> > constraing simply doesn't matter, so there's no error to handle.
> >
> >> >  */
> >> >  int dev_pm_qos_add_request(struct device *dev, struct dev_pm_qos_request *req,
> >> >                           s32 value)
> >> > @@ -195,28 +219,35 @@ int dev_pm_qos_add_request(struct device
> >> >                return -EINVAL;
> >> >        }
> >> >
> >> > -       mutex_lock(&dev_pm_qos_mtx);
> >> >        req->dev = dev;
> >> >
> >> > -       /* Return if the device has been removed */
> >> > -       if (req->dev->power.constraints_state == DEV_PM_QOS_NO_DEVICE) {
> >> > -               ret = -ENODEV;
> >> > -               goto out;
> >> > -       }
> >> > +       device_pm_lock();
> >> > +       mutex_lock(&dev_pm_qos_mtx);
> >> >
> >> > -       /*
> >> > -        * Allocate the constraints data on the first call to add_request,
> >> > -        * i.e. only if the data is not already allocated and if the device has
> >> > -        * not been removed
> >> > -        */
> >> > -       if (dev->power.constraints_state == DEV_PM_QOS_DEVICE_PRESENT)
> >> > -               ret = dev_pm_qos_constraints_allocate(dev);
> >> > +       if (dev->power.constraints) {
> >> > +               device_pm_unlock();
> >> > +       } else {
> >> > +               if (list_empty(&dev->power.entry)) {
> >> > +                       /* The device has been removed from the system. */
> >> > +                       device_pm_unlock();
> >> > +                       goto out;
> >> 0 is silently returned in case the device has been removed. Is that
> >> the intention?
> >
> > Pretty much it is.  Is that a problem?
> I think the caller needs to know if the constraint has been applied
> correctly or not.

However, the removal of the device is a special case.  What would the caller
be supposed to do when it learned that the device had been removed while it had
been trying to add a QoS constraing for it?  Not much I guess.

Thanks,
Rafael

^ permalink raw reply

* Re: [PATCH pm-freezer 1/4] cgroup_freezer: fix freezer->state setting bug in freezer_change_state()
From: Matt Helsley @ 2011-09-02 18:31 UTC (permalink / raw)
  To: Oleg Nesterov; +Cc: Paul Menage, linux-kernel, Tejun Heo, linux-pm, containers
In-Reply-To: <20110902165839.GA7478@redhat.com>

On Fri, Sep 02, 2011 at 06:58:39PM +0200, Oleg Nesterov wrote:
> On 09/01, Matt Helsley wrote:
> >
> > On Wed, Aug 31, 2011 at 12:21:07PM +0200, Tejun Heo wrote:
> > >  	case CGROUP_FROZEN:
> > > -		atomic_inc(&system_freezing_cnt);
> > > -		retval = try_to_freeze_cgroup(cgroup, freezer);
> > > +		if (freezer->state == CGROUP_THAWED) {
> > > +			freezer->state = CGROUP_FREEZING;
> > > +			atomic_inc(&system_freezing_cnt);
> > > +			retval = try_to_freeze_cgroup(cgroup, freezer);
> >
> > This still doesn't look quite right. If the cgroup is FREEZING it should
> > also call try_to_freeze_cgroup(). I think this is what's needed:
> >
> > 		if (freezer->state == CGROUP_THAWED)
> > 			atomic_inc(&system_freezing_cnt);
> > 		freezer->state = CGROUP_FREEZING;
> > 		retval = try_to_freeze_cgroup(cgroup, freezer);
> 
> This is what I mentioned before, to me this looks like a win.
> 
> Why do we need try_to_freeze_cgroup() in this case? "for safety"
> could actually mean "hide the bug" ;)

Well, I need to check Tejun's latest freezer bits to see if this is
still the case but it was possible to get to the FREEZING state and
not enter FROZEN before returning to userspace. So you could come back
into the state change function in the FREEZING state with FROZEN as
the goal state. Note that for the cgroup freezer the FREEZING state is
optional -- so skipping it is fine so long was we guarantee that by the
time we exit to userspace with a FROZEN state all the tasks in the cgroup
are actually frozen (in the refrigerator loop) or frozen enough (about to
enter the refrigerator loop without causing more IO -- e.g. stopped).

Cheers,
	-Matt

^ permalink raw reply

* [PATCH 6/6] freezer: kill unused set_freezable_with_signal()
From: Tejun Heo @ 2011-09-02 18:27 UTC (permalink / raw)
  To: oleg, matthltc, rjw, paul; +Cc: Tejun Heo, containers, linux-pm, linux-kernel
In-Reply-To: <1314988070-12244-1-git-send-email-tj@kernel.org>

There's no in-kernel user of set_freezable_with_signal() and there's
no plan to add one.  Mixing TIF_SIGPENDING with kernel threads can
lead to nasty corner cases as kernel threads never travel signal
delivery path on their own.

e.g. the current implementation is buggy in the cancelation path of
__thaw_task().  It calls recalc_sigpending_and_wake() in an attempt to
clear TIF_SIGPENDING but the function never clears it regardless of
sigpending state.  This means that signallable freezable kthreads may
continue executing with !freezing() && stuck TIF_SIGPENDING, which can
be troublesome.

This patch removes set_freezable_with_signal() along with
PF_FREEZER_NOSIG and recalc_sigpending*() calls in freezer.  User
tasks get TIF_SIGPENDING, kernel tasks get woken up and the spurious
sigpending is dealt with in the usual signal delivery path.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 include/linux/freezer.h |   20 +-------------------
 include/linux/sched.h   |    1 -
 kernel/freezer.c        |   27 ++++++---------------------
 kernel/kthread.c        |    2 +-
 4 files changed, 8 insertions(+), 42 deletions(-)

diff --git a/include/linux/freezer.h b/include/linux/freezer.h
index 9193bae..7ffbf04 100644
--- a/include/linux/freezer.h
+++ b/include/linux/freezer.h
@@ -48,7 +48,7 @@ static inline bool try_to_freeze(void)
 }
 
 extern bool freeze_task(struct task_struct *p);
-extern bool __set_freezable(bool with_signal);
+extern bool set_freezable(void);
 
 #ifdef CONFIG_CGROUP_FREEZER
 extern bool cgroup_freezing(struct task_struct *task);
@@ -104,23 +104,6 @@ static inline int freezer_should_skip(struct task_struct *p)
 }
 
 /*
- * Tell the freezer that the current task should be frozen by it
- */
-static inline bool set_freezable(void)
-{
-	return __set_freezable(false);
-}
-
-/*
- * Tell the freezer that the current task should be frozen by it and that it
- * should send a fake signal to the task to freeze it.
- */
-static inline bool set_freezable_with_signal(void)
-{
-	return __set_freezable(true);
-}
-
-/*
  * Freezer-friendly wrappers around wait_event_interruptible() and
  * wait_event_interruptible_timeout(), originally defined in <linux/wait.h>
  */
@@ -164,7 +147,6 @@ static inline void freezer_do_not_count(void) {}
 static inline void freezer_count(void) {}
 static inline int freezer_should_skip(struct task_struct *p) { return 0; }
 static inline void set_freezable(void) {}
-static inline void set_freezable_with_signal(void) {}
 
 #define wait_event_freezable(wq, condition)				\
 		wait_event_interruptible(wq, condition)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 1bb3356..6d45ce7 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1784,7 +1784,6 @@ extern void thread_group_times(struct task_struct *p, cputime_t *ut, cputime_t *
 #define PF_MEMPOLICY	0x10000000	/* Non-default NUMA mempolicy */
 #define PF_MUTEX_TESTER	0x20000000	/* Thread belongs to the rt mutex tester */
 #define PF_FREEZER_SKIP	0x40000000	/* Freezer should not count it as freezable */
-#define PF_FREEZER_NOSIG 0x80000000	/* Freezer won't send signals to it */
 
 /*
  * Only the _current_ task can read/write to tsk->flags, but other
diff --git a/kernel/freezer.c b/kernel/freezer.c
index da76b64..770e6f5 100644
--- a/kernel/freezer.c
+++ b/kernel/freezer.c
@@ -39,7 +39,7 @@ bool freezing_slow_path(struct task_struct *p)
 	if (pm_nosig_freezing || cgroup_freezing(p))
 		return true;
 
-	if (pm_freezing && !(p->flags & PF_FREEZER_NOSIG))
+	if (pm_freezing && !(p->flags & PF_KTHREAD))
 		return true;
 
 	return false;
@@ -72,10 +72,6 @@ bool __refrigerator(bool check_kthr_stop)
 		schedule();
 	}
 
-	spin_lock_irq(&current->sighand->siglock);
-	recalc_sigpending(); /* We sent fake signal, clean it up */
-	spin_unlock_irq(&current->sighand->siglock);
-
 	pr_debug("%s left refrigerator\n", current->comm);
 
 	/*
@@ -120,7 +116,7 @@ bool freeze_task(struct task_struct *p)
 		return false;
 	}
 
-	if (!(p->flags & PF_FREEZER_NOSIG)) {
+	if (!(p->flags & PF_KTHREAD)) {
 		fake_signal_wake_up(p);
 		/*
 		 * fake_signal_wake_up() goes through p's scheduler
@@ -145,28 +141,19 @@ void __thaw_task(struct task_struct *p)
 	 * be visible to @p as waking up implies wmb.  Waking up inside
 	 * freezer_lock also prevents wakeups from leaking outside
 	 * refrigerator.
-	 *
-	 * If !FROZEN, @p hasn't reached refrigerator, recalc sigpending to
-	 * avoid leaving dangling TIF_SIGPENDING behind.
 	 */
 	spin_lock_irqsave(&freezer_lock, flags);
-	if (frozen(p)) {
+	if (frozen(p))
 		wake_up_process(p);
-	} else {
-		spin_lock(&p->sighand->siglock);
-		recalc_sigpending_and_wake(p);
-		spin_unlock(&p->sighand->siglock);
-	}
 	spin_unlock_irqrestore(&freezer_lock, flags);
 }
 
 /**
- * __set_freezable - make %current freezable
- * @with_signal: do we want %TIF_SIGPENDING for notification too?
+ * set_freezable - make %current freezable
  *
  * Mark %current freezable and enter refrigerator if necessary.
  */
-bool __set_freezable(bool with_signal)
+bool set_freezable(void)
 {
 	might_sleep();
 
@@ -177,10 +164,8 @@ bool __set_freezable(bool with_signal)
 	 */
 	spin_lock_irq(&freezer_lock);
 	current->flags &= ~PF_NOFREEZE;
-	if (with_signal)
-		current->flags &= ~PF_FREEZER_NOSIG;
 	spin_unlock_irq(&freezer_lock);
 
 	return try_to_freeze();
 }
-EXPORT_SYMBOL(__set_freezable);
+EXPORT_SYMBOL(set_freezable);
diff --git a/kernel/kthread.c b/kernel/kthread.c
index a6cbeea..7a4c862 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -282,7 +282,7 @@ int kthreadd(void *unused)
 	set_cpus_allowed_ptr(tsk, cpu_all_mask);
 	set_mems_allowed(node_states[N_HIGH_MEMORY]);
 
-	current->flags |= PF_NOFREEZE | PF_FREEZER_NOSIG;
+	current->flags |= PF_NOFREEZE;
 
 	for (;;) {
 		set_current_state(TASK_INTERRUPTIBLE);
-- 
1.7.6

^ permalink raw reply related

* [PATCH 5/6] freezer: remove unused @sig_only from freeze_task()
From: Tejun Heo @ 2011-09-02 18:27 UTC (permalink / raw)
  To: oleg, matthltc, rjw, paul; +Cc: Tejun Heo, containers, linux-pm, linux-kernel
In-Reply-To: <1314988070-12244-1-git-send-email-tj@kernel.org>

After 25a16d02ba "freezer: make freezing() test freeze conditions in
effect instead of TIF_FREEZE", freezing() returns authoritative answer
on whether the current task should freeze or not and freeze_task()
doesn't need or use @sig_only.  Remove it.

While at it, rewrite function comment for freeze_task() and rename
@sig_only to @user_only in try_to_freeze_tasks().

This patch doesn't cause any functional change.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 include/linux/freezer.h |    2 +-
 kernel/cgroup_freezer.c |    4 ++--
 kernel/freezer.c        |   21 +++++++++------------
 kernel/power/process.c  |    8 ++++----
 4 files changed, 16 insertions(+), 19 deletions(-)

diff --git a/include/linux/freezer.h b/include/linux/freezer.h
index 122b5ce..9193bae 100644
--- a/include/linux/freezer.h
+++ b/include/linux/freezer.h
@@ -47,7 +47,7 @@ static inline bool try_to_freeze(void)
 	return __refrigerator(false);
 }
 
-extern bool freeze_task(struct task_struct *p, bool sig_only);
+extern bool freeze_task(struct task_struct *p);
 extern bool __set_freezable(bool with_signal);
 
 #ifdef CONFIG_CGROUP_FREEZER
diff --git a/kernel/cgroup_freezer.c b/kernel/cgroup_freezer.c
index 56a457a..8753e7b 100644
--- a/kernel/cgroup_freezer.c
+++ b/kernel/cgroup_freezer.c
@@ -206,7 +206,7 @@ static void freezer_fork(struct cgroup_subsys *ss, struct task_struct *task)
 
 	/* Locking avoids race with FREEZING -> THAWED transitions. */
 	if (freezer->state == CGROUP_FREEZING)
-		freeze_task(task, true);
+		freeze_task(task);
 	spin_unlock_irq(&freezer->lock);
 }
 
@@ -274,7 +274,7 @@ static int try_to_freeze_cgroup(struct cgroup *cgroup, struct freezer *freezer)
 
 	cgroup_iter_start(cgroup, &it);
 	while ((task = cgroup_iter_next(cgroup, &it))) {
-		if (!freeze_task(task, true))
+		if (!freeze_task(task))
 			continue;
 		if (frozen(task))
 			continue;
diff --git a/kernel/freezer.c b/kernel/freezer.c
index 9846133..da76b64 100644
--- a/kernel/freezer.c
+++ b/kernel/freezer.c
@@ -100,20 +100,17 @@ static void fake_signal_wake_up(struct task_struct *p)
 }
 
 /**
- *	freeze_task - send a freeze request to given task
- *	@p: task to send the request to
- *	@sig_only: if set, the request will only be sent if the task has the
- *		PF_FREEZER_NOSIG flag unset
- *	Return value: 'false', if @sig_only is set and the task has
- *		PF_FREEZER_NOSIG set or the task is frozen, 'true', otherwise
+ * freeze_task - send a freeze request to given task
+ * @p: task to send the request to
  *
- *	The freeze request is sent by setting the tasks's TIF_FREEZE flag and
- *	either sending a fake signal to it or waking it up, depending on whether
- *	or not it has PF_FREEZER_NOSIG set.  If @sig_only is set and the task
- *	has PF_FREEZER_NOSIG set (ie. it is a typical kernel thread), its
- *	TIF_FREEZE flag will not be set.
+ * If @p is freezing, the freeze request is sent by setting %TIF_FREEZE
+ * flag and either sending a fake signal to it or waking it up, depending
+ * on whether it has %PF_FREEZER_NOSIG set.
+ *
+ * RETURNS:
+ * %false, if @p is not freezing or already frozen; %true, otherwise
  */
-bool freeze_task(struct task_struct *p, bool sig_only)
+bool freeze_task(struct task_struct *p)
 {
 	unsigned long flags;
 
diff --git a/kernel/power/process.c b/kernel/power/process.c
index fe06ddf..ab4fd7b 100644
--- a/kernel/power/process.c
+++ b/kernel/power/process.c
@@ -22,7 +22,7 @@
  */
 #define TIMEOUT	(20 * HZ)
 
-static int try_to_freeze_tasks(bool sig_only)
+static int try_to_freeze_tasks(bool user_only)
 {
 	struct task_struct *g, *p;
 	unsigned long end_time;
@@ -37,14 +37,14 @@ static int try_to_freeze_tasks(bool sig_only)
 
 	end_time = jiffies + TIMEOUT;
 
-	if (!sig_only)
+	if (!user_only)
 		freeze_workqueues_begin();
 
 	while (true) {
 		todo = 0;
 		read_lock(&tasklist_lock);
 		do_each_thread(g, p) {
-			if (p == current || !freeze_task(p, sig_only))
+			if (p == current || !freeze_task(p))
 				continue;
 
 			/*
@@ -65,7 +65,7 @@ static int try_to_freeze_tasks(bool sig_only)
 		} while_each_thread(g, p);
 		read_unlock(&tasklist_lock);
 
-		if (!sig_only) {
+		if (!user_only) {
 			wq_busy = freeze_workqueues_busy();
 			todo += wq_busy;
 		}
-- 
1.7.6

^ permalink raw reply related

* [PATCH 4/6] freezer: use lock_task_sighand() in fake_signal_wake_up()
From: Tejun Heo @ 2011-09-02 18:27 UTC (permalink / raw)
  To: oleg, matthltc, rjw, paul; +Cc: Tejun Heo, containers, linux-pm, linux-kernel
In-Reply-To: <1314988070-12244-1-git-send-email-tj@kernel.org>

cgroup_freezer calls freeze_task() without holding tasklist_lock and,
if the task is exiting, its ->sighand may be gone by the time
fake_signal_wake_up() is called.  Use lock_task_sighand() instead of
accessing ->sighand directly.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Oleg Nesterov <oleg@redhat.com>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: Paul Menage <paul@paulmenage.org>
---
 kernel/freezer.c |    7 ++++---
 1 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/kernel/freezer.c b/kernel/freezer.c
index ebee1ab..9846133 100644
--- a/kernel/freezer.c
+++ b/kernel/freezer.c
@@ -93,9 +93,10 @@ static void fake_signal_wake_up(struct task_struct *p)
 {
 	unsigned long flags;
 
-	spin_lock_irqsave(&p->sighand->siglock, flags);
-	signal_wake_up(p, 0);
-	spin_unlock_irqrestore(&p->sighand->siglock, flags);
+	if (lock_task_sighand(p, &flags)) {
+		signal_wake_up(p, 0);
+		unlock_task_sighand(p, &flags);
+	}
 }
 
 /**
-- 
1.7.6

^ permalink raw reply related

* [PATCH 3/6] freezer: restructure __refrigerator()
From: Tejun Heo @ 2011-09-02 18:27 UTC (permalink / raw)
  To: oleg, matthltc, rjw, paul; +Cc: Tejun Heo, containers, linux-pm, linux-kernel
In-Reply-To: <1314988070-12244-1-git-send-email-tj@kernel.org>

If another freeze happens before all tasks leave FROZEN state after
being thawed, the freezer can see the existing FROZEN and consider the
tasks to be frozen but they can clear FROZEN without checking the new
freezing().

Oleg suggested restructuring __refrigerator() such that there's single
condition check section inside freezer_lock and sigpending is cleared
afterwards, which fixes the problem and simplifies the code.
Restructure accordingly.

-v2: Frozen loop exited without releasing freezer_lock.  Fixed.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Oleg Nesterov <oleg@redhat.com>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
---
 kernel/freezer.c |   29 +++++++++++------------------
 1 files changed, 11 insertions(+), 18 deletions(-)

diff --git a/kernel/freezer.c b/kernel/freezer.c
index 466ea6b..ebee1ab 100644
--- a/kernel/freezer.c
+++ b/kernel/freezer.c
@@ -52,36 +52,29 @@ bool __refrigerator(bool check_kthr_stop)
 	/* Hmm, should we be allowed to suspend when there are realtime
 	   processes around? */
 	bool was_frozen = false;
-	long save;
+	long save = current->state;
 
-	/*
-	 * No point in checking freezing() again - the caller already did.
-	 * Proceed to enter FROZEN.
-	 */
-	spin_lock_irq(&freezer_lock);
-	current->flags |= PF_FROZEN;
-	spin_unlock_irq(&freezer_lock);
-
-	save = current->state;
 	pr_debug("%s entered refrigerator\n", current->comm);
 
-	spin_lock_irq(&current->sighand->siglock);
-	recalc_sigpending(); /* We sent fake signal, clean it up */
-	spin_unlock_irq(&current->sighand->siglock);
-
 	for (;;) {
 		set_current_state(TASK_UNINTERRUPTIBLE);
+
+		spin_lock_irq(&freezer_lock);
+		current->flags |= PF_FROZEN;
 		if (!freezing(current) ||
 		    (check_kthr_stop && kthread_should_stop()))
+			current->flags &= ~PF_FROZEN;
+		spin_unlock_irq(&freezer_lock);
+
+		if (!(current->flags & PF_FROZEN))
 			break;
 		was_frozen = true;
 		schedule();
 	}
 
-	/* leave FROZEN */
-	spin_lock_irq(&freezer_lock);
-	current->flags &= ~PF_FROZEN;
-	spin_unlock_irq(&freezer_lock);
+	spin_lock_irq(&current->sighand->siglock);
+	recalc_sigpending(); /* We sent fake signal, clean it up */
+	spin_unlock_irq(&current->sighand->siglock);
 
 	pr_debug("%s left refrigerator\n", current->comm);
 
-- 
1.7.6

^ permalink raw reply related

* [PATCH 2/6] freezer: set PF_NOFREEZE on a dying task right before TASK_DEAD setting bug in freezer_change_state()
From: Tejun Heo @ 2011-09-02 18:27 UTC (permalink / raw)
  To: oleg, matthltc, rjw, paul; +Cc: Tejun Heo, containers, linux-pm, linux-kernel
In-Reply-To: <1314988070-12244-1-git-send-email-tj@kernel.org>

3fb45733df "freezer: make exiting tasks properly unfreezable" removed
clear_freeze_flag() from exit path and set PF_NOFREEZE right after
PTRACE_EVENT_EXIT; however, Oleg pointed out that following exit paths
may cause interaction with device drivers after PM freezer consider
the system frozen.

There's no try_to_freeze() call in the exit path and the only
necessary guarantee is that freezer doesn't hang waiting for zombies.
Set PF_NOFREEZE right before setting tsk->state to TASK_DEAD instead.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Oleg Nesterov <oleg@redhat.com>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
---
 kernel/exit.c |   10 ++++------
 1 files changed, 4 insertions(+), 6 deletions(-)

diff --git a/kernel/exit.c b/kernel/exit.c
index ac58259..7b6c4fa 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -913,12 +913,6 @@ NORET_TYPE void do_exit(long code)
 
 	ptrace_event(PTRACE_EVENT_EXIT, code);
 
-	/*
-	 * With ptrace notification done, there's no point in freezing from
-	 * here on.  Disallow freezing.
-	 */
-	current->flags |= PF_NOFREEZE;
-
 	validate_creds_for_do_exit(tsk);
 
 	/*
@@ -1044,6 +1038,10 @@ NORET_TYPE void do_exit(long code)
 
 	preempt_disable();
 	exit_rcu();
+
+	/* this task is now dead and freezer should ignore it */
+	current->flags |= PF_NOFREEZE;
+
 	/* causes final put_task_struct in finish_task_switch(). */
 	tsk->state = TASK_DEAD;
 	schedule();
-- 
1.7.6

^ permalink raw reply related

* [PATCH 1/6] cgroup_freezer: fix freezer->state setting bug in freezer_change_state()
From: Tejun Heo @ 2011-09-02 18:27 UTC (permalink / raw)
  To: oleg, matthltc, rjw, paul; +Cc: Tejun Heo, containers, linux-pm, linux-kernel
In-Reply-To: <1314988070-12244-1-git-send-email-tj@kernel.org>

d02f52811d0e "cgroup_freezer: prepare for removal of TIF_FREEZE" moved
setting of freezer->state into freezer_change_state(); unfortunately,
while doing so, when it's beginning to freeze tasks, it sets the state
to CGROUP_FROZEN instead of CGROUP_FREEZING ending up skipping the
whole freezing state.  Fix it.

-v2: Oleg pointed out that re-freezing FROZEN cgroup could increment
     system_freezing_cnt.  Fixed.

-v3: Matt pointed out that setting CGROUP_FROZEN always invoked
     try_to_freeze_cgroup() regardless of the current state.  Patch
     updated such that the actual freeze/thaw operations are always
     performed on state changes.  This shouldn't make any difference
     unless something is broken.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Oleg Nesterov <oleg@redhat.com>
Cc: Paul Menage <paul@paulmenage.org>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: Matt Helsley <matthltc@us.ibm.com>
---
 kernel/cgroup_freezer.c |   14 +++++++-------
 1 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/kernel/cgroup_freezer.c b/kernel/cgroup_freezer.c
index 4e82525..56a457a 100644
--- a/kernel/cgroup_freezer.c
+++ b/kernel/cgroup_freezer.c
@@ -308,24 +308,24 @@ static int freezer_change_state(struct cgroup *cgroup,
 	spin_lock_irq(&freezer->lock);
 
 	update_if_frozen(cgroup, freezer);
-	if (goal_state == freezer->state)
-		goto out;
-
-	freezer->state = goal_state;
 
 	switch (goal_state) {
 	case CGROUP_THAWED:
-		atomic_dec(&system_freezing_cnt);
+		if (freezer->state != CGROUP_THAWED)
+			atomic_dec(&system_freezing_cnt);
+		freezer->state = CGROUP_THAWED;
 		unfreeze_cgroup(cgroup, freezer);
 		break;
 	case CGROUP_FROZEN:
-		atomic_inc(&system_freezing_cnt);
+		if (freezer->state == CGROUP_THAWED)
+			atomic_inc(&system_freezing_cnt);
+		freezer->state = CGROUP_FREEZING;
 		retval = try_to_freeze_cgroup(cgroup, freezer);
 		break;
 	default:
 		BUG();
 	}
-out:
+
 	spin_unlock_irq(&freezer->lock);
 
 	return retval;
-- 
1.7.6

^ permalink raw reply related

* [PATCHSET pm-freezer] freezer: fixes & simplifications
From: Tejun Heo @ 2011-09-02 18:27 UTC (permalink / raw)
  To: oleg, matthltc, rjw, paul; +Cc: containers, linux-pm, linux-kernel

Hello,

These six patches are fixes and simplifications for the recent freezer
changes.  The first four have been posted twice.  Changes since the
last posting[L] are

* freezer->state setting bug fix updated per Matt Helsley so that the
  actual per-task freeze/thaw operations are always performed.

* fixed a bug caused by forgetting to unlock freezer_lock in "freezer:
  restructure __refrigerator()".

* Two patches added.  The fifth one is mostly trivial.  The sixth a
  bit more involved but still shouldn't cause any noticeable
  functional difference.  This removes the buggy and unused
  freezable_with_signal.

Properly tested this time.  As hera is still down, no git branch
available at this point.  Patch list and diffstat follow.

Thanks.

 [PATCH 1/6] cgroup_freezer: fix freezer->state setting bug in
 [PATCH 2/6] freezer: set PF_NOFREEZE on a dying task right before
 [PATCH 3/6] freezer: restructure __refrigerator()
 [PATCH 4/6] freezer: use lock_task_sighand() in
 [PATCH 5/6] freezer: remove unused @sig_only from freeze_task()
 [PATCH 6/6] freezer: kill unused set_freezable_with_signal()

 include/linux/freezer.h |   22 +------------
 include/linux/sched.h   |    1
 kernel/cgroup_freezer.c |   18 +++++------
 kernel/exit.c           |   10 ++----
 kernel/freezer.c        |   78 ++++++++++++++++--------------------------------
 kernel/kthread.c        |    2 -
 kernel/power/process.c  |    8 ++--
 7 files changed, 47 insertions(+), 92 deletions(-)

--
tejun

[L] http://thread.gmane.org/gmane.linux.kernel/1186387

^ permalink raw reply

* Re: [PATCH pm-freezer 1/4] cgroup_freezer: fix freezer->state setting bug in freezer_change_state()
From: Tejun Heo @ 2011-09-02 17:31 UTC (permalink / raw)
  To: Oleg Nesterov; +Cc: containers, linux-kernel, linux-pm, Paul Menage
In-Reply-To: <20110902171517.GA8247@redhat.com>

Hello,

On Fri, Sep 02, 2011 at 07:15:17PM +0200, Oleg Nesterov wrote:
> > I guess it depends on the viewpoint.  A simple analogy would be using
> > WARN_ON_ONCE() instead of BUG_ON() so that the mode of failure is
> > softer.  This change isn't likely to make bugs significantly more
> > difficult to discover so why not?
> 
> I agree either way.
> 
> Personally I prefer your current patch. Because it is not clear why
> do we call try_to_freeze_cgroup() if it was already called. And, the
> 2nd call can silently hide the problem if we have some bug.
> 
> But of course, this is up to you and Matt.

I'm okay either way too.  It would make a bug in that area a bit less
annoying and thus may decrease the chance of bug report, but it means
that we ship with built-in easy work around (if it doesn't work at the
first kick, kick again), which can be beneficial in practice.

> > > But I agree either way. Rafael, I think 1-4 are fine, but I think
> > > we need the simple 5/4, will send in a minute...
> >
> > Can you please wait a bit?  The second one was broken (missing unlock)
> 
> Yes, I just noticed the small problem too, hopefully we mean the same
> bug ;)

Yeap, the same one.  I thought it was the second patch instead of the
third. :)

Thanks.

-- 
tejun

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox