public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Andrew Morton <akpm@linux-foundation.org>
To: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: menage@google.com, miaox@cn.fujitsu.com, maxk@qualcomm.com,
	linux-kernel@vger.kernel.org
Subject: Re: [PATCH] cpuset: fix possible deadlock in async_rebuild_sched_domains
Date: Fri, 16 Jan 2009 12:57:38 -0800	[thread overview]
Message-ID: <20090116125738.22c21bf2.akpm@linux-foundation.org> (raw)
In-Reply-To: <4970000E.7040902@cn.fujitsu.com>

On Fri, 16 Jan 2009 11:33:34 +0800
Lai Jiangshan <laijs@cn.fujitsu.com> wrote:

> 
> But queuing a work to an other thread is adding some overhead for cpuset.
> And a new separate workqueue thread is wasteful, this thread is sleeping
> at most time.
> 
> This is an effective fix:
> 
> This patch add cgroup_queue_defer_work(). And the works will be deferring
> processed with cgroup_mutex released. And this patch just add very very
> little overhead for cgroup_unlock()'s fast path.
> 
> Lai
> 
> From: Lai Jiangshan <laijs@cn.fujitsu.com>
> 
> Lockdep reported some possible circular locking info when we tested cpuset on
> NUMA/fake NUMA box.
> 
> =======================================================
> [ INFO: possible circular locking dependency detected ]
> 2.6.29-rc1-00224-ga652504 #111
> -------------------------------------------------------
> bash/2968 is trying to acquire lock:
>  (events){--..}, at: [<ffffffff8024c8cd>] flush_work+0x24/0xd8
> 
> but task is already holding lock:
>  (cgroup_mutex){--..}, at: [<ffffffff8026ad1e>] cgroup_lock_live_group+0x12/0x29
> 
> which lock already depends on the new lock.
> ......
> -------------------------------------------------------
> 
> Steps to reproduce:
> # mkdir /dev/cpuset
> # mount -t cpuset xxx /dev/cpuset
> # mkdir /dev/cpuset/0
> # echo 0 > /dev/cpuset/0/cpus
> # echo 0 > /dev/cpuset/0/mems
> # echo 1 > /dev/cpuset/0/memory_migrate
> # cat /dev/zero > /dev/null &
> # echo $! > /dev/cpuset/0/tasks
> 
> This is because async_rebuild_sched_domains has the following lock sequence:
> run_workqueue(async_rebuild_sched_domains)
> 	-> do_rebuild_sched_domains -> cgroup_lock
> 
> But, attaching tasks when memory_migrate is set has following:
> cgroup_lock_live_group(cgroup_tasks_write)
> 	-> do_migrate_pages -> flush_work
> 
> This can be fixed by using a separate workqueue thread.
> 
> But queuing a work to an other thread is adding some overhead for cpuset.
> And a new separate workqueue thread is wasteful, this thread is sleeping
> at most time.
> 
> This patch add cgroup_queue_defer_work(). And the works will be deferring
> processed with cgroup_mutex released. And this patch just add very very
> little overhead for cgroup_unlock()'s fast path.

hm.  There's little discussion for such a large-looking patch?

> --- a/include/linux/cgroup.h
> +++ b/include/linux/cgroup.h
> @@ -437,6 +437,19 @@ void cgroup_iter_end(struct cgroup *cgrp, struct cgroup_iter *it);
>  int cgroup_scan_tasks(struct cgroup_scanner *scan);
>  int cgroup_attach_task(struct cgroup *, struct task_struct *);
>  
> +struct cgroup_defer_work {
> +	struct list_head list;
> +	void (*func)(struct cgroup_defer_work *);
> +};
> +
> +#define CGROUP_DEFER_WORK(name, function)		\
> +	struct cgroup_defer_work name = {		\
> +		.list = LIST_HEAD_INIT((name).list),	\
> +		.func = (function),			\
> +	};
> +
> +int cgroup_queue_defer_work(struct cgroup_defer_work *defer_work);

These should be called "cgroup_deferred_work" and
"queue_deferred_work", I think?  Maybe not.

> +static void cgroup_flush_defer_work_locked(void);

cgroup_flush_deferred_work_locked()?

>  /**
>   * cgroup_unlock - release lock on cgroup changes
>   *
> @@ -547,9 +548,67 @@ void cgroup_lock(void)
>   */
>  void cgroup_unlock(void)
>  {
> +	cgroup_flush_defer_work_locked();
>  	mutex_unlock(&cgroup_mutex);
>  }
>  
> +static LIST_HEAD(defer_work_list);

Please add a comment telling readers what the locking protocol is for
this list.

> +/* flush deferred works with cgroup_mutex released */
> +static void cgroup_flush_defer_work_locked(void)
> +{
> +	static bool running_dely_work;

"running_delayed_work"

> +	if (likely(list_empty(&defer_work_list)))
> +		return;
> +
> +	/*
> +	 * Insure it's not recursive and also
> +	 * insure deferred works are run orderly.

"ensure"

> +	 */
> +	if (running_dely_work)
> +		return;
> +	running_dely_work = true;
> +
> +	for ( ; ; ) {
> +		struct cgroup_defer_work *defer_work;
> +
> +		defer_work = list_first_entry(&defer_work_list,
> +				struct cgroup_defer_work, list);
> +		list_del_init(&defer_work->list);
> +		mutex_unlock(&cgroup_mutex);
> +
> +		defer_work->func(defer_work);
> +
> +		mutex_lock(&cgroup_mutex);
> +		if (list_empty(&defer_work_list))
> +			break;
> +	}
> +
> +	running_dely_work = false;
> +}
> +
> +/**
> + * cgroup_queue_defer_work - queue a deferred work
> + * @defer_work: work to queue
> + *
> + * Returns 0 if @defer_work was already on the queue, non-zero otherwise.
> + *
> + * Must called when cgroup_mutex held.

"must be"

> + * The defered work will be run after cgroup_mutex released.

"deferred"

> + */
> +int cgroup_queue_defer_work(struct cgroup_defer_work *defer_work)
> +{
> +	int ret = 0;
> +
> +	if (list_empty(&defer_work->list)) {
> +		list_add_tail(&defer_work->list, &defer_work_list);
> +		ret = 1;
> +	}
> +
> +	return ret;
> +}
> +
>  /*
>   * A couple of forward declarations required, due to cyclic reference loop:
>   * cgroup_mkdir -> cgroup_create -> cgroup_populate_dir ->
> @@ -616,7 +675,7 @@ static void cgroup_diput(struct dentry *dentry, struct inode *inode)
>  		 * agent */
>  		synchronize_rcu();
>  
> -		mutex_lock(&cgroup_mutex);
> +		cgroup_lock();

All these changes add rather a lot of noise.  It would be better to do
this as two patches:

1: "convert open-coded mutex_lock(&cgroup_mutex) calls into cgroup_lock() calls"
2: "cpuset: fix possible deadlock in async_rebuild_sched_domains"

Although it doesn't matter a lot.


Also, as it now appears to be compulsory to use cgroup_lock(), you
should rename cgroup_mutex to something else, to prevent accidental
direct usage of the lock.


  reply	other threads:[~2009-01-16 20:58 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-01-16  2:24 [PATCH] cpuset: fix possible deadlock in async_rebuild_sched_domains Miao Xie
2009-01-16  3:33 ` Lai Jiangshan
2009-01-16 20:57   ` Andrew Morton [this message]
2009-01-18  8:06     ` [PATCH 1/3] cgroup: convert open-coded mutex_lock(&cgroup_mutex) calls into cgroup_lock() calls Lai Jiangshan
2009-01-18  9:10       ` Ingo Molnar
2009-01-19  1:37         ` Paul Menage
2009-01-19  1:41           ` Ingo Molnar
2009-01-20  1:28             ` Paul Menage
2009-01-20 18:22               ` Peter Zijlstra
2009-01-20  1:18       ` Paul Menage
2009-01-18  8:06     ` [PATCH 2/3] cgroup: introduce cgroup_queue_deferred_work() Lai Jiangshan
2009-01-18  9:04       ` Ingo Molnar
2009-01-19  1:55         ` Lai Jiangshan
2009-01-20  1:26       ` Paul Menage
2009-01-18  8:06     ` [PATCH 3/3] cpuset: fix possible deadlock in async_rebuild_sched_domains Lai Jiangshan
2009-01-18  9:06       ` Ingo Molnar
2009-01-19  1:40         ` Lai Jiangshan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20090116125738.22c21bf2.akpm@linux-foundation.org \
    --to=akpm@linux-foundation.org \
    --cc=laijs@cn.fujitsu.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=maxk@qualcomm.com \
    --cc=menage@google.com \
    --cc=miaox@cn.fujitsu.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox