[PATCH] bdi: Move cgroup bdi_writeback to a dedicated low concurrency workqueue

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH] bdi: Move cgroup bdi_writeback to a dedicated low concurrency workqueue
@ 2018-05-23 17:56 Tejun Heo
  2018-05-23 18:39 ` Paul E. McKenney
                   ` (4 more replies)
  0 siblings, 5 replies; 10+ messages in thread
From: Tejun Heo @ 2018-05-23 17:56 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-kernel, Paul E. McKenney, Jan Kara, Andrew Morton,
	kernel-team

>From 0aa2e9b921d6db71150633ff290199554f0842a8 Mon Sep 17 00:00:00 2001
From: Tejun Heo <tj@kernel.org>
Date: Wed, 23 May 2018 10:29:00 -0700

cgwb_release() punts the actual release to cgwb_release_workfn() on
system_wq.  Depending on the number of cgroups or block devices, there
can be a lot of cgwb_release_workfn() in flight at the same time.

We're periodically seeing close to 256 kworkers getting stuck with the
following stack trace and overtime the entire system gets stuck.

  [<ffffffff810ee40c>] _synchronize_rcu_expedited.constprop.72+0x2fc/0x330
  [<ffffffff810ee634>] synchronize_rcu_expedited+0x24/0x30
  [<ffffffff811ccf23>] bdi_unregister+0x53/0x290
  [<ffffffff811cd1e9>] release_bdi+0x89/0xc0
  [<ffffffff811cd645>] wb_exit+0x85/0xa0
  [<ffffffff811cdc84>] cgwb_release_workfn+0x54/0xb0
  [<ffffffff810a68d0>] process_one_work+0x150/0x410
  [<ffffffff810a71fd>] worker_thread+0x6d/0x520
  [<ffffffff810ad3dc>] kthread+0x12c/0x160
  [<ffffffff81969019>] ret_from_fork+0x29/0x40
  [<ffffffffffffffff>] 0xffffffffffffffff

The events leading to the lockup are...

1. A lot of cgwb_release_workfn() is queued at the same time and all
   system_wq kworkers are assigned to execute them.

2. They all end up calling synchronize_rcu_expedited().  One of them
   wins and tries to perform the expedited synchronization.

3. However, that invovles queueing rcu_exp_work to system_wq and
   waiting for it.  Because #1 is holding all available kworkers on
   system_wq, rcu_exp_work can't be executed.  cgwb_release_workfn()
   is waiting for synchronize_rcu_expedited() which in turn is waiting
   for cgwb_release_workfn() to free up some of the kworkers.

We shouldn't be scheduling hundreds of cgwb_release_workfn() at the
same time.  There's nothing to be gained from that.  This patch
updates cgwb release path to use a dedicated percpu workqueue with
@max_active of 1.

While this resolves the problem at hand, it might be a good idea to
isolate rcu_exp_work to its own workqueue too as it can be used from
various paths and is prone to this sort of indirect A-A deadlocks.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: stable@vger.kernel.org
---
 mm/backing-dev.c | 18 +++++++++++++++++-
 1 file changed, 17 insertions(+), 1 deletion(-)

diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 7441bd9..8fe3ebd 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -412,6 +412,7 @@ static void wb_exit(struct bdi_writeback *wb)
  * protected.
  */
 static DEFINE_SPINLOCK(cgwb_lock);
+static struct workqueue_struct *cgwb_release_wq;

 /**
  * wb_congested_get_create - get or create a wb_congested
@@ -522,7 +523,7 @@ static void cgwb_release(struct percpu_ref *refcnt)
 {
 	struct bdi_writeback *wb = container_of(refcnt, struct bdi_writeback,
 						refcnt);
-	schedule_work(&wb->release_work);
+	queue_work(cgwb_release_wq, &wb->release_work);
 }

 static void cgwb_kill(struct bdi_writeback *wb)
@@ -784,6 +785,21 @@ static void cgwb_bdi_register(struct backing_dev_info *bdi)
 	spin_unlock_irq(&cgwb_lock);
 }

+static int __init cgwb_init(void)
+{
+	/*
+	 * There can be many concurrent release work items overwhelming
+	 * system_wq.  Put them in a separate wq and limit concurrency.
+	 * There's no point in executing many of these in parallel.
+	 */
+	cgwb_release_wq = alloc_workqueue("cgwb_release", 0, 1);
+	if (!cgwb_release_wq)
+		return -ENOMEM;
+
+	return 0;
+}
+subsys_initcall(cgwb_init);
+
 #else	/* CONFIG_CGROUP_WRITEBACK */

 static int cgwb_bdi_init(struct backing_dev_info *bdi)
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH] bdi: Move cgroup bdi_writeback to a dedicated low concurrency workqueue
  2018-05-23 17:56 [PATCH] bdi: Move cgroup bdi_writeback to a dedicated low concurrency workqueue Tejun Heo
@ 2018-05-23 18:39 ` Paul E. McKenney
  2018-05-23 18:51   ` Tejun Heo
  2018-05-23 21:29 ` Jens Axboe
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 10+ messages in thread
From: Paul E. McKenney @ 2018-05-23 18:39 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Jens Axboe, linux-kernel, Jan Kara, Andrew Morton, kernel-team

On Wed, May 23, 2018 at 10:56:32AM -0700, Tejun Heo wrote:
> >From 0aa2e9b921d6db71150633ff290199554f0842a8 Mon Sep 17 00:00:00 2001
> From: Tejun Heo <tj@kernel.org>
> Date: Wed, 23 May 2018 10:29:00 -0700
> 
> cgwb_release() punts the actual release to cgwb_release_workfn() on
> system_wq.  Depending on the number of cgroups or block devices, there
> can be a lot of cgwb_release_workfn() in flight at the same time.
> 
> We're periodically seeing close to 256 kworkers getting stuck with the
> following stack trace and overtime the entire system gets stuck.
> 
>   [<ffffffff810ee40c>] _synchronize_rcu_expedited.constprop.72+0x2fc/0x330
>   [<ffffffff810ee634>] synchronize_rcu_expedited+0x24/0x30
>   [<ffffffff811ccf23>] bdi_unregister+0x53/0x290
>   [<ffffffff811cd1e9>] release_bdi+0x89/0xc0
>   [<ffffffff811cd645>] wb_exit+0x85/0xa0
>   [<ffffffff811cdc84>] cgwb_release_workfn+0x54/0xb0
>   [<ffffffff810a68d0>] process_one_work+0x150/0x410
>   [<ffffffff810a71fd>] worker_thread+0x6d/0x520
>   [<ffffffff810ad3dc>] kthread+0x12c/0x160
>   [<ffffffff81969019>] ret_from_fork+0x29/0x40
>   [<ffffffffffffffff>] 0xffffffffffffffff
> 
> The events leading to the lockup are...
> 
> 1. A lot of cgwb_release_workfn() is queued at the same time and all
>    system_wq kworkers are assigned to execute them.
> 
> 2. They all end up calling synchronize_rcu_expedited().  One of them
>    wins and tries to perform the expedited synchronization.
> 
> 3. However, that invovles queueing rcu_exp_work to system_wq and
>    waiting for it.  Because #1 is holding all available kworkers on
>    system_wq, rcu_exp_work can't be executed.  cgwb_release_workfn()
>    is waiting for synchronize_rcu_expedited() which in turn is waiting
>    for cgwb_release_workfn() to free up some of the kworkers.
> 
> We shouldn't be scheduling hundreds of cgwb_release_workfn() at the
> same time.  There's nothing to be gained from that.  This patch
> updates cgwb release path to use a dedicated percpu workqueue with
> @max_active of 1.
> 
> While this resolves the problem at hand, it might be a good idea to
> isolate rcu_exp_work to its own workqueue too as it can be used from
> various paths and is prone to this sort of indirect A-A deadlocks.

Commit ad7c946b35ad4 ("rcu: Create RCU-specific workqueues with rescuers")
was accepted into mainline this past merge window.  Does that do what
you want, or are you looking for something else?

							Thanx, Paul

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] bdi: Move cgroup bdi_writeback to a dedicated low concurrency workqueue
  2018-05-23 18:39 ` Paul E. McKenney
@ 2018-05-23 18:51   ` Tejun Heo
  2018-05-23 19:10     ` Paul E. McKenney
  0 siblings, 1 reply; 10+ messages in thread
From: Tejun Heo @ 2018-05-23 18:51 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Jens Axboe, linux-kernel, Jan Kara, Andrew Morton, kernel-team

On Wed, May 23, 2018 at 11:39:07AM -0700, Paul E. McKenney wrote:
> > While this resolves the problem at hand, it might be a good idea to
> > isolate rcu_exp_work to its own workqueue too as it can be used from
> > various paths and is prone to this sort of indirect A-A deadlocks.
> 
> Commit ad7c946b35ad4 ("rcu: Create RCU-specific workqueues with rescuers")
> was accepted into mainline this past merge window.  Does that do what
> you want, or are you looking for something else?

Ah, that does it.  Sorry, was looking at an older kernel.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] bdi: Move cgroup bdi_writeback to a dedicated low concurrency workqueue
  2018-05-23 18:51   ` Tejun Heo
@ 2018-05-23 19:10     ` Paul E. McKenney
  0 siblings, 0 replies; 10+ messages in thread
From: Paul E. McKenney @ 2018-05-23 19:10 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Jens Axboe, linux-kernel, Jan Kara, Andrew Morton, kernel-team

On Wed, May 23, 2018 at 11:51:43AM -0700, Tejun Heo wrote:
> On Wed, May 23, 2018 at 11:39:07AM -0700, Paul E. McKenney wrote:
> > > While this resolves the problem at hand, it might be a good idea to
> > > isolate rcu_exp_work to its own workqueue too as it can be used from
> > > various paths and is prone to this sort of indirect A-A deadlocks.
> > 
> > Commit ad7c946b35ad4 ("rcu: Create RCU-specific workqueues with rescuers")
> > was accepted into mainline this past merge window.  Does that do what
> > you want, or are you looking for something else?
> 
> Ah, that does it.  Sorry, was looking at an older kernel.

No worries -- after all, I must confess that I was a bit slow in getting
to this.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] bdi: Move cgroup bdi_writeback to a dedicated low concurrency workqueue
  2018-05-23 17:56 [PATCH] bdi: Move cgroup bdi_writeback to a dedicated low concurrency workqueue Tejun Heo
  2018-05-23 18:39 ` Paul E. McKenney
@ 2018-05-23 21:29 ` Jens Axboe
  2018-05-23 22:03 ` Rik van Riel
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 10+ messages in thread
From: Jens Axboe @ 2018-05-23 21:29 UTC (permalink / raw)
  To: Tejun Heo
  Cc: linux-kernel, Paul E. McKenney, Jan Kara, Andrew Morton,
	kernel-team

On 5/23/18 11:56 AM, Tejun Heo wrote:
> From 0aa2e9b921d6db71150633ff290199554f0842a8 Mon Sep 17 00:00:00 2001
> From: Tejun Heo <tj@kernel.org>
> Date: Wed, 23 May 2018 10:29:00 -0700
> 
> cgwb_release() punts the actual release to cgwb_release_workfn() on
> system_wq.  Depending on the number of cgroups or block devices, there
> can be a lot of cgwb_release_workfn() in flight at the same time.
> 
> We're periodically seeing close to 256 kworkers getting stuck with the
> following stack trace and overtime the entire system gets stuck.
> 
>   [<ffffffff810ee40c>] _synchronize_rcu_expedited.constprop.72+0x2fc/0x330
>   [<ffffffff810ee634>] synchronize_rcu_expedited+0x24/0x30
>   [<ffffffff811ccf23>] bdi_unregister+0x53/0x290
>   [<ffffffff811cd1e9>] release_bdi+0x89/0xc0
>   [<ffffffff811cd645>] wb_exit+0x85/0xa0
>   [<ffffffff811cdc84>] cgwb_release_workfn+0x54/0xb0
>   [<ffffffff810a68d0>] process_one_work+0x150/0x410
>   [<ffffffff810a71fd>] worker_thread+0x6d/0x520
>   [<ffffffff810ad3dc>] kthread+0x12c/0x160
>   [<ffffffff81969019>] ret_from_fork+0x29/0x40
>   [<ffffffffffffffff>] 0xffffffffffffffff
> 
> The events leading to the lockup are...
> 
> 1. A lot of cgwb_release_workfn() is queued at the same time and all
>    system_wq kworkers are assigned to execute them.
> 
> 2. They all end up calling synchronize_rcu_expedited().  One of them
>    wins and tries to perform the expedited synchronization.
> 
> 3. However, that invovles queueing rcu_exp_work to system_wq and
>    waiting for it.  Because #1 is holding all available kworkers on
>    system_wq, rcu_exp_work can't be executed.  cgwb_release_workfn()
>    is waiting for synchronize_rcu_expedited() which in turn is waiting
>    for cgwb_release_workfn() to free up some of the kworkers.
> 
> We shouldn't be scheduling hundreds of cgwb_release_workfn() at the
> same time.  There's nothing to be gained from that.  This patch
> updates cgwb release path to use a dedicated percpu workqueue with
> @max_active of 1.
> 
> While this resolves the problem at hand, it might be a good idea to
> isolate rcu_exp_work to its own workqueue too as it can be used from
> various paths and is prone to this sort of indirect A-A deadlocks.

Applied, thanks.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] bdi: Move cgroup bdi_writeback to a dedicated low concurrency workqueue
  2018-05-23 17:56 [PATCH] bdi: Move cgroup bdi_writeback to a dedicated low concurrency workqueue Tejun Heo
  2018-05-23 18:39 ` Paul E. McKenney
  2018-05-23 21:29 ` Jens Axboe
@ 2018-05-23 22:03 ` Rik van Riel
  2018-05-23 23:17   ` Tejun Heo
  2018-05-23 23:25 ` [PATCH] bdi: Increase the concurrecy level of cgwb_release_wq Tejun Heo
  2018-05-24 10:19 ` [PATCH] bdi: Move cgroup bdi_writeback to a dedicated low concurrency workqueue Jan Kara
  4 siblings, 1 reply; 10+ messages in thread
From: Rik van Riel @ 2018-05-23 22:03 UTC (permalink / raw)
  To: Tejun Heo, Jens Axboe
  Cc: linux-kernel, Paul E. McKenney, Jan Kara, Andrew Morton,
	kernel-team

[-- Attachment #1: Type: text/plain, Size: 1515 bytes --]

On Wed, 2018-05-23 at 10:56 -0700, Tejun Heo wrote:

> The events leading to the lockup are...
> 
> 1. A lot of cgwb_release_workfn() is queued at the same time and all
>    system_wq kworkers are assigned to execute them.
> 
> 2. They all end up calling synchronize_rcu_expedited().  One of them
>    wins and tries to perform the expedited synchronization.
> 
> 3. However, that invovles queueing rcu_exp_work to system_wq and
>    waiting for it.  Because #1 is holding all available kworkers on
>    system_wq, rcu_exp_work can't be executed.  cgwb_release_workfn()
>    is waiting for synchronize_rcu_expedited() which in turn is
> waiting
>    for cgwb_release_workfn() to free up some of the kworkers.
> 
> We shouldn't be scheduling hundreds of cgwb_release_workfn() at the
> same time.  There's nothing to be gained from that.  This patch
> updates cgwb release path to use a dedicated percpu workqueue with
> @max_active of 1.

Dumb question.  Does setting max_active to 1 mean
that every cgwb_release_workfn() ends up forcing
another RCU grace period on the whole system, while
today you might have a bunch of them waiting on the
same RCU grace period advance?

Would it be faster to have some number (up to 16?)
push RCU once, at the same time, instead of having
each of them push RCU into a next grace period one
after another?

I may be overlooking something fundamental here,
but I thought I'd at least ask the question, just
in case :)

-- 
All Rights Reversed.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] bdi: Move cgroup bdi_writeback to a dedicated low concurrency workqueue
  2018-05-23 22:03 ` Rik van Riel
@ 2018-05-23 23:17   ` Tejun Heo
  0 siblings, 0 replies; 10+ messages in thread
From: Tejun Heo @ 2018-05-23 23:17 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Jens Axboe, linux-kernel, Paul E. McKenney, Jan Kara,
	Andrew Morton, kernel-team

Hello, Rik.

On Wed, May 23, 2018 at 06:03:15PM -0400, Rik van Riel wrote:
> Dumb question.  Does setting max_active to 1 mean
> that every cgwb_release_workfn() ends up forcing
> another RCU grace period on the whole system, while
> today you might have a bunch of them waiting on the
> same RCU grace period advance?
> 
> Would it be faster to have some number (up to 16?)
> push RCU once, at the same time, instead of having
> each of them push RCU into a next grace period one
> after another?

Oh yeah, you're absolutely right.  This would end up doing a lot of
back-to-back synchronize_rcu_expedited() calls which can't be good.
I'll send a patch to push it upto 16.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH] bdi: Increase the concurrecy level of cgwb_release_wq
  2018-05-23 17:56 [PATCH] bdi: Move cgroup bdi_writeback to a dedicated low concurrency workqueue Tejun Heo
                   ` (2 preceding siblings ...)
  2018-05-23 22:03 ` Rik van Riel
@ 2018-05-23 23:25 ` Tejun Heo
  2018-05-24 10:19 ` [PATCH] bdi: Move cgroup bdi_writeback to a dedicated low concurrency workqueue Jan Kara
  4 siblings, 0 replies; 10+ messages in thread
From: Tejun Heo @ 2018-05-23 23:25 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-kernel, Paul E. McKenney, Jan Kara, Andrew Morton,
	kernel-team, Rik van Riel

cgwb_release_wq was added to isolate and limit the concurrency level
of cgwb release work items.  The max concurrency was set to 1 per CPU
but we do want some level of concurrency as otherwise it ends up
calling, through bdi_unregister(), synchronize_sched_expedited()
back-to-back for each release work item.

Let's allow them to bunch up a bit.

Signed-off-by: Tejun Heo <tj@kernel.org>
Suggested-by: Rik van Riel <riel@surriel.com>
---
 mm/backing-dev.c |    5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 8fe3ebd..f9a2268 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -790,9 +790,10 @@ static int __init cgwb_init(void)
 	/*
 	 * There can be many concurrent release work items overwhelming
 	 * system_wq.  Put them in a separate wq and limit concurrency.
-	 * There's no point in executing many of these in parallel.
+	 * Allow some level of concurrency so that we can batch on
+	 * synchronize_rcu_expedited() calls in bdi_unregister().
 	 */
-	cgwb_release_wq = alloc_workqueue("cgwb_release", 0, 1);
+	cgwb_release_wq = alloc_workqueue("cgwb_release", 0, 16);
 	if (!cgwb_release_wq)
 		return -ENOMEM;
 

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH] bdi: Move cgroup bdi_writeback to a dedicated low concurrency workqueue
  2018-05-23 17:56 [PATCH] bdi: Move cgroup bdi_writeback to a dedicated low concurrency workqueue Tejun Heo
                   ` (3 preceding siblings ...)
  2018-05-23 23:25 ` [PATCH] bdi: Increase the concurrecy level of cgwb_release_wq Tejun Heo
@ 2018-05-24 10:19 ` Jan Kara
  2018-05-24 14:00   ` Tejun Heo
  4 siblings, 1 reply; 10+ messages in thread
From: Jan Kara @ 2018-05-24 10:19 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Jens Axboe, linux-kernel, Paul E. McKenney, Jan Kara,
	Andrew Morton, kernel-team

On Wed 23-05-18 10:56:32, Tejun Heo wrote:
> From 0aa2e9b921d6db71150633ff290199554f0842a8 Mon Sep 17 00:00:00 2001
> From: Tejun Heo <tj@kernel.org>
> Date: Wed, 23 May 2018 10:29:00 -0700
> 
> cgwb_release() punts the actual release to cgwb_release_workfn() on
> system_wq.  Depending on the number of cgroups or block devices, there
> can be a lot of cgwb_release_workfn() in flight at the same time.
> 
> We're periodically seeing close to 256 kworkers getting stuck with the
> following stack trace and overtime the entire system gets stuck.

OK, but that means that you have to have 256 block devices, don't you? As
we have a bdi per device and we call synchronize_rcu_expedited() only when
unregistering bdi (and the corresponding request queue must be gone at that
point as well as that's otherwise holding a reference). Am I understanding
the situation correctly?

>   [<ffffffff810ee40c>] _synchronize_rcu_expedited.constprop.72+0x2fc/0x330
>   [<ffffffff810ee634>] synchronize_rcu_expedited+0x24/0x30
>   [<ffffffff811ccf23>] bdi_unregister+0x53/0x290
>   [<ffffffff811cd1e9>] release_bdi+0x89/0xc0
>   [<ffffffff811cd645>] wb_exit+0x85/0xa0
>   [<ffffffff811cdc84>] cgwb_release_workfn+0x54/0xb0
>   [<ffffffff810a68d0>] process_one_work+0x150/0x410
>   [<ffffffff810a71fd>] worker_thread+0x6d/0x520
>   [<ffffffff810ad3dc>] kthread+0x12c/0x160
>   [<ffffffff81969019>] ret_from_fork+0x29/0x40
>   [<ffffffffffffffff>] 0xffffffffffffffff
> 
> The events leading to the lockup are...
> 
> 1. A lot of cgwb_release_workfn() is queued at the same time and all
>    system_wq kworkers are assigned to execute them.
> 
> 2. They all end up calling synchronize_rcu_expedited().  One of them
>    wins and tries to perform the expedited synchronization.
> 
> 3. However, that invovles queueing rcu_exp_work to system_wq and
>    waiting for it.  Because #1 is holding all available kworkers on
>    system_wq, rcu_exp_work can't be executed.  cgwb_release_workfn()
>    is waiting for synchronize_rcu_expedited() which in turn is waiting
>    for cgwb_release_workfn() to free up some of the kworkers.
> 
> We shouldn't be scheduling hundreds of cgwb_release_workfn() at the
> same time.  There's nothing to be gained from that.  This patch
> updates cgwb release path to use a dedicated percpu workqueue with
> @max_active of 1.

As Rik wrote, some paralelism is good to reduce number of forced grace
periods so raising this to 16 is good. I was thinking whether we could not
batch rcu grace periods in some explicit way but that would be difficult to
do.

But thinking a bit more about this, if we made bdi RCU freed, we could just
avoid the synchronize_rcu_expedited() in bdi_remove_from_list() altogether.
The uses of bdi list are pretty limited and everybody ends up testing
WB_registered bit before doing anything anyway... What do you think?
 
Other than that you can add:

Reviewed-by: Jan Kara <jack@suse.cz>

to this patch when updated to increase concurrency as that's a good short
term solution (for stable kernels) anyway.

								Honza

> While this resolves the problem at hand, it might be a good idea to
> isolate rcu_exp_work to its own workqueue too as it can be used from
> various paths and is prone to this sort of indirect A-A deadlocks.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>
> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
> Cc: stable@vger.kernel.org



> ---
>  mm/backing-dev.c | 18 +++++++++++++++++-
>  1 file changed, 17 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/backing-dev.c b/mm/backing-dev.c
> index 7441bd9..8fe3ebd 100644
> --- a/mm/backing-dev.c
> +++ b/mm/backing-dev.c
> @@ -412,6 +412,7 @@ static void wb_exit(struct bdi_writeback *wb)
>   * protected.
>   */
>  static DEFINE_SPINLOCK(cgwb_lock);
> +static struct workqueue_struct *cgwb_release_wq;
>  
>  /**
>   * wb_congested_get_create - get or create a wb_congested
> @@ -522,7 +523,7 @@ static void cgwb_release(struct percpu_ref *refcnt)
>  {
>  	struct bdi_writeback *wb = container_of(refcnt, struct bdi_writeback,
>  						refcnt);
> -	schedule_work(&wb->release_work);
> +	queue_work(cgwb_release_wq, &wb->release_work);
>  }
>  
>  static void cgwb_kill(struct bdi_writeback *wb)
> @@ -784,6 +785,21 @@ static void cgwb_bdi_register(struct backing_dev_info *bdi)
>  	spin_unlock_irq(&cgwb_lock);
>  }
>  
> +static int __init cgwb_init(void)
> +{
> +	/*
> +	 * There can be many concurrent release work items overwhelming
> +	 * system_wq.  Put them in a separate wq and limit concurrency.
> +	 * There's no point in executing many of these in parallel.
> +	 */
> +	cgwb_release_wq = alloc_workqueue("cgwb_release", 0, 1);
> +	if (!cgwb_release_wq)
> +		return -ENOMEM;
> +
> +	return 0;
> +}
> +subsys_initcall(cgwb_init);
> +
>  #else	/* CONFIG_CGROUP_WRITEBACK */
>  
>  static int cgwb_bdi_init(struct backing_dev_info *bdi)
> -- 
> 2.9.5
> 
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] bdi: Move cgroup bdi_writeback to a dedicated low concurrency workqueue
  2018-05-24 10:19 ` [PATCH] bdi: Move cgroup bdi_writeback to a dedicated low concurrency workqueue Jan Kara
@ 2018-05-24 14:00   ` Tejun Heo
  0 siblings, 0 replies; 10+ messages in thread
From: Tejun Heo @ 2018-05-24 14:00 UTC (permalink / raw)
  To: Jan Kara
  Cc: Jens Axboe, linux-kernel, Paul E. McKenney, Jan Kara,
	Andrew Morton, kernel-team

Hello, Jan.

On Thu, May 24, 2018 at 12:19:00PM +0200, Jan Kara wrote:
> > We're periodically seeing close to 256 kworkers getting stuck with the
> > following stack trace and overtime the entire system gets stuck.
> 
> OK, but that means that you have to have 256 block devices, don't you? As

It'd be a product of the number of cgroups being deleted at the same
time and the number of block devices they were accessing.  With
multiple loop devices and multiple cgroup removals, it doesn't seem
too difficult to hit.  It's rare but we do see it happening repeatedly
in one of the setups.

> we have a bdi per device and we call synchronize_rcu_expedited() only when
> unregistering bdi (and the corresponding request queue must be gone at that
> point as well as that's otherwise holding a reference). Am I understanding
> the situation correctly?

Yeah, I think so.

> > We shouldn't be scheduling hundreds of cgwb_release_workfn() at the
> > same time.  There's nothing to be gained from that.  This patch
> > updates cgwb release path to use a dedicated percpu workqueue with
> > @max_active of 1.
> 
> As Rik wrote, some paralelism is good to reduce number of forced grace
> periods so raising this to 16 is good. I was thinking whether we could not
> batch rcu grace periods in some explicit way but that would be difficult to
> do.

Agreed, already sent a patch to increase @max_active to 16.

> But thinking a bit more about this, if we made bdi RCU freed, we could just
> avoid the synchronize_rcu_expedited() in bdi_remove_from_list() altogether.
> The uses of bdi list are pretty limited and everybody ends up testing
> WB_registered bit before doing anything anyway... What do you think?
>  
> Other than that you can add:
> 
> Reviewed-by: Jan Kara <jack@suse.cz>

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2018-05-24 14:00 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2018-05-23 17:56 [PATCH] bdi: Move cgroup bdi_writeback to a dedicated low concurrency workqueue Tejun Heo
2018-05-23 18:39 ` Paul E. McKenney
2018-05-23 18:51   ` Tejun Heo
2018-05-23 19:10     ` Paul E. McKenney
2018-05-23 21:29 ` Jens Axboe
2018-05-23 22:03 ` Rik van Riel
2018-05-23 23:17   ` Tejun Heo
2018-05-23 23:25 ` [PATCH] bdi: Increase the concurrecy level of cgwb_release_wq Tejun Heo
2018-05-24 10:19 ` [PATCH] bdi: Move cgroup bdi_writeback to a dedicated low concurrency workqueue Jan Kara
2018-05-24 14:00   ` Tejun Heo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox