loop block-mq conversion scalability issues

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* loop block-mq conversion scalability issues
@ 2015-04-23 21:04 Justin M. Forbes
  2015-04-24  2:59 ` Ming Lei
  0 siblings, 1 reply; 6+ messages in thread
From: Justin M. Forbes @ 2015-04-23 21:04 UTC (permalink / raw)
  To: ming.lei; +Cc: linux-kernel

The block-mq conversion for loop in 4.0 kernels is showing us an
interesting scalability problem with live CDs (ro, squashfs).  It was
noticed when testing the Fedora beta that the more CPUs a liveCD image
was given, the slower it would boot. A 4 core qemu instance or bare
metal instance took more than twice as long to boot compared to a single
CPU instance.  After investigating, this came directly to the block-mq
conversion, reverting these 4 patches will return performance. More
details are available at
https://bugzilla.redhat.com/show_bug.cgi?id=1210857
I don't think that reverting the patches is the ideal solution so I am
looking for other options.  Since you know this code a bit better than I
do I thought I would run it by you while I am looking as well.

Thanks,
Justin

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: loop block-mq conversion scalability issues
  2015-04-23 21:04 loop block-mq conversion scalability issues Justin M. Forbes
@ 2015-04-24  2:59 ` Ming Lei
  2015-04-24 21:46   ` Justin M. Forbes
  0 siblings, 1 reply; 6+ messages in thread
From: Ming Lei @ 2015-04-24  2:59 UTC (permalink / raw)
  To: Justin M. Forbes; +Cc: linux-kernel, tom.leiming

Hi Justin,

Thanks for the report.

On Thu, 23 Apr 2015 16:04:10 -0500
"Justin M. Forbes" <jforbes@redhat.com> wrote:

> The block-mq conversion for loop in 4.0 kernels is showing us an
> interesting scalability problem with live CDs (ro, squashfs).  It was
> noticed when testing the Fedora beta that the more CPUs a liveCD image
> was given, the slower it would boot. A 4 core qemu instance or bare
> metal instance took more than twice as long to boot compared to a single
> CPU instance.  After investigating, this came directly to the block-mq
> conversion, reverting these 4 patches will return performance. More
> details are available at
> https://bugzilla.redhat.com/show_bug.cgi?id=1210857
> I don't think that reverting the patches is the ideal solution so I am
> looking for other options.  Since you know this code a bit better than I
> do I thought I would run it by you while I am looking as well.

I can understand the issue because the default @max_active for
alloc_workqueue() is quite big(512), which may cause too much
context switchs, then loop I/O performance gets decreased.

Actually I have written the kernel dio/aio based patch for decreasing
both CPU and memory utilization without sacrificing I/O performance,
and I will try to improve and push the patch during this cycle and hope
it can be merged(kernel/aio.c change is dropped, and only fs change is
needed on fs/direct-io.c).

But the following change should help for your case, could you test it?

---
diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index c6b3726..b1cb41d 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -1831,7 +1831,7 @@ static int __init loop_init(void)
 	}
 
 	loop_wq = alloc_workqueue("kloopd",
-			WQ_MEM_RECLAIM | WQ_HIGHPRI | WQ_UNBOUND, 0);
+			WQ_MEM_RECLAIM | WQ_HIGHPRI | WQ_UNBOUND, 32);
 	if (!loop_wq) {
 		err = -ENOMEM;
 		goto misc_out;



Thanks,
Ming Lei

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: loop block-mq conversion scalability issues
  2015-04-24  2:59 ` Ming Lei
@ 2015-04-24 21:46   ` Justin M. Forbes
  2015-04-25 10:32     ` Ming Lei
  2015-04-26 15:27     ` Ming Lei
  0 siblings, 2 replies; 6+ messages in thread
From: Justin M. Forbes @ 2015-04-24 21:46 UTC (permalink / raw)
  To: Ming Lei; +Cc: linux-kernel, tom.leiming

On Fri, 2015-04-24 at 10:59 +0800, Ming Lei wrote:
> Hi Justin,
> 
> Thanks for the report.
> 
> On Thu, 23 Apr 2015 16:04:10 -0500
> "Justin M. Forbes" <jforbes@redhat.com> wrote:
> 
> > The block-mq conversion for loop in 4.0 kernels is showing us an
> > interesting scalability problem with live CDs (ro, squashfs).  It was
> > noticed when testing the Fedora beta that the more CPUs a liveCD image
> > was given, the slower it would boot. A 4 core qemu instance or bare
> > metal instance took more than twice as long to boot compared to a single
> > CPU instance.  After investigating, this came directly to the block-mq
> > conversion, reverting these 4 patches will return performance. More
> > details are available at
> > https://bugzilla.redhat.com/show_bug.cgi?id=1210857
> > I don't think that reverting the patches is the ideal solution so I am
> > looking for other options.  Since you know this code a bit better than I
> > do I thought I would run it by you while I am looking as well.
> 
> I can understand the issue because the default @max_active for
> alloc_workqueue() is quite big(512), which may cause too much
> context switchs, then loop I/O performance gets decreased.
> 
> Actually I have written the kernel dio/aio based patch for decreasing
> both CPU and memory utilization without sacrificing I/O performance,
> and I will try to improve and push the patch during this cycle and hope
> it can be merged(kernel/aio.c change is dropped, and only fs change is
> needed on fs/direct-io.c).
> 
> But the following change should help for your case, could you test it?
> 
> ---
> diff --git a/drivers/block/loop.c b/drivers/block/loop.c
> index c6b3726..b1cb41d 100644
> --- a/drivers/block/loop.c
> +++ b/drivers/block/loop.c
> @@ -1831,7 +1831,7 @@ static int __init loop_init(void)
>  	}
>  
>  	loop_wq = alloc_workqueue("kloopd",
> -			WQ_MEM_RECLAIM | WQ_HIGHPRI | WQ_UNBOUND, 0);
> +			WQ_MEM_RECLAIM | WQ_HIGHPRI | WQ_UNBOUND, 32);
>  	if (!loop_wq) {
>  		err = -ENOMEM;
>  		goto misc_out;
> 
Patch tested, it made things work (I gave up after 5 minutes and boot
still seemed hung). I also tried values of 1, 16, 64, and 128).
Everything below 128 was much worse than the current situation. Setting
it at 128 seemed about the same as booting without the patch. I can do
some more testing over the weekend, but I don't think this is the
correct solution.
I would be interested in testing your dio/aio patches as well though.

Thanks,
Justin


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: loop block-mq conversion scalability issues
  2015-04-24 21:46   ` Justin M. Forbes
@ 2015-04-25 10:32     ` Ming Lei
  2015-04-26 15:27     ` Ming Lei
  1 sibling, 0 replies; 6+ messages in thread
From: Ming Lei @ 2015-04-25 10:32 UTC (permalink / raw)
  To: Justin M. Forbes; +Cc: linux-kernel

On Sat, Apr 25, 2015 at 5:46 AM, Justin M. Forbes
<jforbes@fedoraproject.org> wrote:
> On Fri, 2015-04-24 at 10:59 +0800, Ming Lei wrote:
>> Hi Justin,
>>
>> Thanks for the report.
>>
>> On Thu, 23 Apr 2015 16:04:10 -0500
>> "Justin M. Forbes" <jforbes@redhat.com> wrote:
>>
>> > The block-mq conversion for loop in 4.0 kernels is showing us an
>> > interesting scalability problem with live CDs (ro, squashfs).  It was
>> > noticed when testing the Fedora beta that the more CPUs a liveCD image
>> > was given, the slower it would boot. A 4 core qemu instance or bare
>> > metal instance took more than twice as long to boot compared to a single
>> > CPU instance.  After investigating, this came directly to the block-mq
>> > conversion, reverting these 4 patches will return performance. More
>> > details are available at
>> > https://bugzilla.redhat.com/show_bug.cgi?id=1210857
>> > I don't think that reverting the patches is the ideal solution so I am
>> > looking for other options.  Since you know this code a bit better than I
>> > do I thought I would run it by you while I am looking as well.
>>
>> I can understand the issue because the default @max_active for
>> alloc_workqueue() is quite big(512), which may cause too much
>> context switchs, then loop I/O performance gets decreased.
>>
>> Actually I have written the kernel dio/aio based patch for decreasing
>> both CPU and memory utilization without sacrificing I/O performance,
>> and I will try to improve and push the patch during this cycle and hope
>> it can be merged(kernel/aio.c change is dropped, and only fs change is
>> needed on fs/direct-io.c).
>>
>> But the following change should help for your case, could you test it?
>>
>> ---
>> diff --git a/drivers/block/loop.c b/drivers/block/loop.c
>> index c6b3726..b1cb41d 100644
>> --- a/drivers/block/loop.c
>> +++ b/drivers/block/loop.c
>> @@ -1831,7 +1831,7 @@ static int __init loop_init(void)
>>       }
>>
>>       loop_wq = alloc_workqueue("kloopd",
>> -                     WQ_MEM_RECLAIM | WQ_HIGHPRI | WQ_UNBOUND, 0);
>> +                     WQ_MEM_RECLAIM | WQ_HIGHPRI | WQ_UNBOUND, 32);
>>       if (!loop_wq) {
>>               err = -ENOMEM;
>>               goto misc_out;
>>
> Patch tested, it made things work (I gave up after 5 minutes and boot
> still seemed hung). I also tried values of 1, 16, 64, and 128).
> Everything below 128 was much worse than the current situation. Setting
> it at 128 seemed about the same as booting without the patch. I can do
> some more testing over the weekend, but I don't think this is the
> correct solution.

I saw you mention there are about 200 kworker threads waiting for run,
that is why I thought it may be related with 'max active'

Also you can change the workqueue as bound by removing the flag
of 'WQ_UNBOUND' to see if there is any difference?

Could you share me how loop is used during the Fedora live boot?

> I would be interested in testing your dio/aio patches as well though.

Recently there is big changes in loop, and I need to port the original
patches to current loop, but I guess it shouldn't take much time.

Thanks,
Ming Lei

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: loop block-mq conversion scalability issues
  2015-04-24 21:46   ` Justin M. Forbes
  2015-04-25 10:32     ` Ming Lei
@ 2015-04-26 15:27     ` Ming Lei
  2015-04-27 13:48       ` Justin M. Forbes
  1 sibling, 1 reply; 6+ messages in thread
From: Ming Lei @ 2015-04-26 15:27 UTC (permalink / raw)
  To: Justin M. Forbes; +Cc: linux-kernel, tom.leiming

Hi Justin,

On Fri, 24 Apr 2015 16:46:02 -0500
"Justin M. Forbes" <jforbes@fedoraproject.org> wrote:

> On Fri, 2015-04-24 at 10:59 +0800, Ming Lei wrote:
> > Hi Justin,
> > 
> > Thanks for the report.
> > 
> > On Thu, 23 Apr 2015 16:04:10 -0500
> > "Justin M. Forbes" <jforbes@redhat.com> wrote:
> > 
> > > The block-mq conversion for loop in 4.0 kernels is showing us an
> > > interesting scalability problem with live CDs (ro, squashfs).  It was
> > > noticed when testing the Fedora beta that the more CPUs a liveCD image
> > > was given, the slower it would boot. A 4 core qemu instance or bare
> > > metal instance took more than twice as long to boot compared to a single
> > > CPU instance.  After investigating, this came directly to the block-mq
> > > conversion, reverting these 4 patches will return performance. More
> > > details are available at
> > > https://bugzilla.redhat.com/show_bug.cgi?id=1210857
> > > I don't think that reverting the patches is the ideal solution so I am
> > > looking for other options.  Since you know this code a bit better than I
> > > do I thought I would run it by you while I am looking as well.
> > 
> > I can understand the issue because the default @max_active for
> > alloc_workqueue() is quite big(512), which may cause too much
> > context switchs, then loop I/O performance gets decreased.
> > 
> > Actually I have written the kernel dio/aio based patch for decreasing
> > both CPU and memory utilization without sacrificing I/O performance,
> > and I will try to improve and push the patch during this cycle and hope
> > it can be merged(kernel/aio.c change is dropped, and only fs change is
> > needed on fs/direct-io.c).
> > 
> > But the following change should help for your case, could you test it?
> > 
> > ---
> > diff --git a/drivers/block/loop.c b/drivers/block/loop.c
> > index c6b3726..b1cb41d 100644
> > --- a/drivers/block/loop.c
> > +++ b/drivers/block/loop.c
> > @@ -1831,7 +1831,7 @@ static int __init loop_init(void)
> >  	}
> >  
> >  	loop_wq = alloc_workqueue("kloopd",
> > -			WQ_MEM_RECLAIM | WQ_HIGHPRI | WQ_UNBOUND, 0);
> > +			WQ_MEM_RECLAIM | WQ_HIGHPRI | WQ_UNBOUND, 32);
> >  	if (!loop_wq) {
> >  		err = -ENOMEM;
> >  		goto misc_out;
> > 
> Patch tested, it made things work (I gave up after 5 minutes and boot
> still seemed hung). I also tried values of 1, 16, 64, and 128).
> Everything below 128 was much worse than the current situation. Setting
> it at 128 seemed about the same as booting without the patch. I can do
> some more testing over the weekend, but I don't think this is the
> correct solution.

For describing the problem easily, follows the fedora live CD file structure first:

Fedora-Live-Workstation-x86_64-22_Beta-TC8.iso
	=>LiveOS/
		squashfs.img
			=>LiveOS/
				ext3fs.img        
 
Looks at least two reasons are related with the problem:

- not like other filesyststems(such as ext4), squashfs is a bit special, and
I observed that increasing I/O jobs to access file in squashfs can't improve
I/O performance at all, but it can for ext4

- nested loop: both squashfs.img and ext3fs.img are mounted as loop block

One key idea in the commit b5dd2f60(block: loop: improve performance via blk-mq)
is to submit I/O concurrently from more than one context(worker), like posix
AIO style. Unfortunately this way can't improve I/O performance for squashfs,
and with extra cost of kworker threads, and nested loop makes it worse. Meantime,
during booting, there are lots of concurrent tasks requiring CPU, so the high
priority kworker threads for loop can affect other boot tasks, then booting time
is increased.

I think it may improve the problem by removing the nest loop, such as
extract files in ext3fs.img to squashfs.img.

> I would be interested in testing your dio/aio patches as well though.

squashfs doesn't support dio, so the dio/aio patch can't help much, but
the motivation for introducing dio/aio is really for avoiding double cache
and decreasing CPU utilization[1].

[1], http://marc.info/?l=linux-kernel&m=142116397525668&w=2

The following patch may help the situation, but for this case, I am wondering
it can compete with previous loop.
---
>From 0af95571a2a066b4f3bacaac2c75b39e3c701c6e Mon Sep 17 00:00:00 2001
From: Ming Lei <ming.lei@canonical.com>
Date: Sun, 26 Apr 2015 17:53:56 +0800
Subject: [PATCH] block: loop: avoiding too many pending per work I/O

If there are too many pending per work I/O, too many
high priority work thread can be generated so that
system performance can be effected.

This patch limits the max pending per work I/O as 32,
and will degrage to single queue mode when the max number
is reached.

This patch fixes Fedora 22 live booting performance
regression when it is booted from squashfs over dm
based on loop.

Signed-off-by: Ming Lei <ming.lei@canonical.com>
---
 drivers/block/loop.c | 21 ++++++++++++++++++---
 drivers/block/loop.h |  2 ++
 2 files changed, 20 insertions(+), 3 deletions(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index c6b3726..55bd04f 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -1448,13 +1448,24 @@ static int loop_queue_rq(struct blk_mq_hw_ctx *hctx,
 		const struct blk_mq_queue_data *bd)
 {
 	struct loop_cmd *cmd = blk_mq_rq_to_pdu(bd->rq);
+	struct loop_device *lo = cmd->rq->q->queuedata;
+	bool single_queue = !!(cmd->rq->cmd_flags & REQ_WRITE);
+
+	/*
+	 * Degrade to single queue mode if the pending per work
+	 * I/O number reaches 16, otherwise too many high priority
+	 * worker threads may effect system performance as reported
+	 * in fedora live booting from squashfs over loop.
+	 */
+	if (atomic_read(&lo->pending_per_work_io) >= 16)
+		single_queue = true;
 
 	blk_mq_start_request(bd->rq);
 
-	if (cmd->rq->cmd_flags & REQ_WRITE) {
-		struct loop_device *lo = cmd->rq->q->queuedata;
+	if (single_queue) {
 		bool need_sched = true;
 
+		cmd->per_work_io = false;
 		spin_lock_irq(&lo->lo_lock);
 		if (lo->write_started)
 			need_sched = false;
@@ -1466,6 +1477,8 @@ static int loop_queue_rq(struct blk_mq_hw_ctx *hctx,
 		if (need_sched)
 			queue_work(loop_wq, &lo->write_work);
 	} else {
+		cmd->per_work_io = true;
+		atomic_inc(&lo->pending_per_work_io);
 		queue_work(loop_wq, &cmd->read_work);
 	}
 
@@ -1490,6 +1503,8 @@ static void loop_handle_cmd(struct loop_cmd *cmd)
 	if (ret)
 		cmd->rq->errors = -EIO;
 	blk_mq_complete_request(cmd->rq);
+	if (cmd->per_work_io)
+		atomic_dec(&lo->pending_per_work_io);
 }
 
 static void loop_queue_write_work(struct work_struct *work)
@@ -1831,7 +1846,7 @@ static int __init loop_init(void)
 	}
 
 	loop_wq = alloc_workqueue("kloopd",
-			WQ_MEM_RECLAIM | WQ_HIGHPRI | WQ_UNBOUND, 0);
+			WQ_MEM_RECLAIM | WQ_HIGHPRI, 0);
 	if (!loop_wq) {
 		err = -ENOMEM;
 		goto misc_out;
diff --git a/drivers/block/loop.h b/drivers/block/loop.h
index ffb6dd6..06d8f1a 100644
--- a/drivers/block/loop.h
+++ b/drivers/block/loop.h
@@ -57,6 +57,7 @@ struct loop_device {
 	struct list_head	write_cmd_head;
 	struct work_struct	write_work;
 	bool			write_started;
+	atomic_t		pending_per_work_io;
 	int			lo_state;
 	struct mutex		lo_ctl_mutex;
 
@@ -68,6 +69,7 @@ struct loop_device {
 struct loop_cmd {
 	struct work_struct read_work;
 	struct request *rq;
+	bool per_work_io;
 	struct list_head list;
 };
 
-- 
1.9.1


Thanks,
Ming Lei


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: loop block-mq conversion scalability issues
  2015-04-26 15:27     ` Ming Lei
@ 2015-04-27 13:48       ` Justin M. Forbes
  0 siblings, 0 replies; 6+ messages in thread
From: Justin M. Forbes @ 2015-04-27 13:48 UTC (permalink / raw)
  To: Ming Lei; +Cc: linux-kernel, tom.leiming

On Sun, 2015-04-26 at 23:27 +0800, Ming Lei wrote:
> Hi Justin,
> 
> On Fri, 24 Apr 2015 16:46:02 -0500
> "Justin M. Forbes" <jforbes@fedoraproject.org> wrote:
> 
> > On Fri, 2015-04-24 at 10:59 +0800, Ming Lei wrote:
> > > Hi Justin,
> > > 
> > > Thanks for the report.
> > > 
> > > On Thu, 23 Apr 2015 16:04:10 -0500
> > > "Justin M. Forbes" <jforbes@redhat.com> wrote:
> > > 
> > > > The block-mq conversion for loop in 4.0 kernels is showing us an
> > > > interesting scalability problem with live CDs (ro, squashfs).  It was
> > > > noticed when testing the Fedora beta that the more CPUs a liveCD image
> > > > was given, the slower it would boot. A 4 core qemu instance or bare
> > > > metal instance took more than twice as long to boot compared to a single
> > > > CPU instance.  After investigating, this came directly to the block-mq
> > > > conversion, reverting these 4 patches will return performance. More
> > > > details are available at
> > > > https://bugzilla.redhat.com/show_bug.cgi?id=1210857
> > > > I don't think that reverting the patches is the ideal solution so I am
> > > > looking for other options.  Since you know this code a bit better than I
> > > > do I thought I would run it by you while I am looking as well.
> > > 
> > > I can understand the issue because the default @max_active for
> > > alloc_workqueue() is quite big(512), which may cause too much
> > > context switchs, then loop I/O performance gets decreased.
> > > 
> > > Actually I have written the kernel dio/aio based patch for decreasing
> > > both CPU and memory utilization without sacrificing I/O performance,
> > > and I will try to improve and push the patch during this cycle and hope
> > > it can be merged(kernel/aio.c change is dropped, and only fs change is
> > > needed on fs/direct-io.c).
> > > 
> > > But the following change should help for your case, could you test it?
> > > 
> > > ---
> > > diff --git a/drivers/block/loop.c b/drivers/block/loop.c
> > > index c6b3726..b1cb41d 100644
> > > --- a/drivers/block/loop.c
> > > +++ b/drivers/block/loop.c
> > > @@ -1831,7 +1831,7 @@ static int __init loop_init(void)
> > >  	}
> > >  
> > >  	loop_wq = alloc_workqueue("kloopd",
> > > -			WQ_MEM_RECLAIM | WQ_HIGHPRI | WQ_UNBOUND, 0);
> > > +			WQ_MEM_RECLAIM | WQ_HIGHPRI | WQ_UNBOUND, 32);
> > >  	if (!loop_wq) {
> > >  		err = -ENOMEM;
> > >  		goto misc_out;
> > > 
> > Patch tested, it made things work (I gave up after 5 minutes and boot
> > still seemed hung). I also tried values of 1, 16, 64, and 128).
> > Everything below 128 was much worse than the current situation. Setting
> > it at 128 seemed about the same as booting without the patch. I can do
> > some more testing over the weekend, but I don't think this is the
> > correct solution.
> 
> For describing the problem easily, follows the fedora live CD file structure first:
> 
> Fedora-Live-Workstation-x86_64-22_Beta-TC8.iso
> 	=>LiveOS/
> 		squashfs.img
> 			=>LiveOS/
> 				ext3fs.img        
>  
> Looks at least two reasons are related with the problem:
> 
> - not like other filesyststems(such as ext4), squashfs is a bit special, and
> I observed that increasing I/O jobs to access file in squashfs can't improve
> I/O performance at all, but it can for ext4
> 
> - nested loop: both squashfs.img and ext3fs.img are mounted as loop block
> 
> One key idea in the commit b5dd2f60(block: loop: improve performance via blk-mq)
> is to submit I/O concurrently from more than one context(worker), like posix
> AIO style. Unfortunately this way can't improve I/O performance for squashfs,
> and with extra cost of kworker threads, and nested loop makes it worse. Meantime,
> during booting, there are lots of concurrent tasks requiring CPU, so the high
> priority kworker threads for loop can affect other boot tasks, then booting time
> is increased.
> 
> I think it may improve the problem by removing the nest loop, such as
> extract files in ext3fs.img to squashfs.img.
> 
> > I would be interested in testing your dio/aio patches as well though.
> 
> squashfs doesn't support dio, so the dio/aio patch can't help much, but
> the motivation for introducing dio/aio is really for avoiding double cache
> and decreasing CPU utilization[1].
> 
> [1], http://marc.info/?l=linux-kernel&m=142116397525668&w=2
> 
> The following patch may help the situation, but for this case, I am wondering
> it can compete with previous loop.
> ---
> >From 0af95571a2a066b4f3bacaac2c75b39e3c701c6e Mon Sep 17 00:00:00 2001
> From: Ming Lei <ming.lei@canonical.com>
> Date: Sun, 26 Apr 2015 17:53:56 +0800
> Subject: [PATCH] block: loop: avoiding too many pending per work I/O
> 
> If there are too many pending per work I/O, too many
> high priority work thread can be generated so that
> system performance can be effected.
> 
> This patch limits the max pending per work I/O as 32,
> and will degrage to single queue mode when the max number
> is reached.
> 
> This patch fixes Fedora 22 live booting performance
> regression when it is booted from squashfs over dm
> based on loop.
> 
> Signed-off-by: Ming Lei <ming.lei@canonical.com>
> ---
>  drivers/block/loop.c | 21 ++++++++++++++++++---
>  drivers/block/loop.h |  2 ++
>  2 files changed, 20 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/block/loop.c b/drivers/block/loop.c
> index c6b3726..55bd04f 100644
> --- a/drivers/block/loop.c
> +++ b/drivers/block/loop.c
> @@ -1448,13 +1448,24 @@ static int loop_queue_rq(struct blk_mq_hw_ctx *hctx,
>  		const struct blk_mq_queue_data *bd)
>  {
>  	struct loop_cmd *cmd = blk_mq_rq_to_pdu(bd->rq);
> +	struct loop_device *lo = cmd->rq->q->queuedata;
> +	bool single_queue = !!(cmd->rq->cmd_flags & REQ_WRITE);
> +
> +	/*
> +	 * Degrade to single queue mode if the pending per work
> +	 * I/O number reaches 16, otherwise too many high priority
> +	 * worker threads may effect system performance as reported
> +	 * in fedora live booting from squashfs over loop.
> +	 */
> +	if (atomic_read(&lo->pending_per_work_io) >= 16)
> +		single_queue = true;
>  
>  	blk_mq_start_request(bd->rq);
>  
> -	if (cmd->rq->cmd_flags & REQ_WRITE) {
> -		struct loop_device *lo = cmd->rq->q->queuedata;
> +	if (single_queue) {
>  		bool need_sched = true;
>  
> +		cmd->per_work_io = false;
>  		spin_lock_irq(&lo->lo_lock);
>  		if (lo->write_started)
>  			need_sched = false;
> @@ -1466,6 +1477,8 @@ static int loop_queue_rq(struct blk_mq_hw_ctx *hctx,
>  		if (need_sched)
>  			queue_work(loop_wq, &lo->write_work);
>  	} else {
> +		cmd->per_work_io = true;
> +		atomic_inc(&lo->pending_per_work_io);
>  		queue_work(loop_wq, &cmd->read_work);
>  	}
>  
> @@ -1490,6 +1503,8 @@ static void loop_handle_cmd(struct loop_cmd *cmd)
>  	if (ret)
>  		cmd->rq->errors = -EIO;
>  	blk_mq_complete_request(cmd->rq);
> +	if (cmd->per_work_io)
> +		atomic_dec(&lo->pending_per_work_io);
>  }
>  
>  static void loop_queue_write_work(struct work_struct *work)
> @@ -1831,7 +1846,7 @@ static int __init loop_init(void)
>  	}
>  
>  	loop_wq = alloc_workqueue("kloopd",
> -			WQ_MEM_RECLAIM | WQ_HIGHPRI | WQ_UNBOUND, 0);
> +			WQ_MEM_RECLAIM | WQ_HIGHPRI, 0);
>  	if (!loop_wq) {
>  		err = -ENOMEM;
>  		goto misc_out;
> diff --git a/drivers/block/loop.h b/drivers/block/loop.h
> index ffb6dd6..06d8f1a 100644
> --- a/drivers/block/loop.h
> +++ b/drivers/block/loop.h
> @@ -57,6 +57,7 @@ struct loop_device {
>  	struct list_head	write_cmd_head;
>  	struct work_struct	write_work;
>  	bool			write_started;
> +	atomic_t		pending_per_work_io;
>  	int			lo_state;
>  	struct mutex		lo_ctl_mutex;
>  
> @@ -68,6 +69,7 @@ struct loop_device {
>  struct loop_cmd {
>  	struct work_struct read_work;
>  	struct request *rq;
> +	bool per_work_io;
>  	struct list_head list;
>  };
>  

This patch tests well. It gives comparable boot times to the 3.19 loop
implementation and does not seem to have problems scaling with SMP.  I
would love to see this make 4.1 and 4.0.x stable as well.

Thanks,
Justin


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2015-04-27 13:48 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-04-23 21:04 loop block-mq conversion scalability issues Justin M. Forbes
2015-04-24  2:59 ` Ming Lei
2015-04-24 21:46   ` Justin M. Forbes
2015-04-25 10:32     ` Ming Lei
2015-04-26 15:27     ` Ming Lei
2015-04-27 13:48       ` Justin M. Forbes

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox