[PATCH] fix crash in blk_queue

dm-devel.redhat.com archive mirror
 help / color / mirror / Atom feed

* [PATCH] fix crash in blk_queue_abort
@ 2009-04-16 23:52 Mikulas Patocka
  2009-04-17  6:36 ` Jens Axboe
  0 siblings, 1 reply; 12+ messages in thread
From: Mikulas Patocka @ 2009-04-16 23:52 UTC (permalink / raw)
  To: axboe; +Cc: dm-devel, Alasdair G Kergon

Hi

This fixes a crash in blk_abort_queue. The crash can be triggered with 
device mapper multipath.

I believe that since there is method make_request_fn, the cleanest 
solution is to add another method, abort_queue_fn. But you can use 
different solution if you want (like testing some bit whether the device 
is request-based ... or so).

Mikulas

---

Fix a crash due to blk_abort_queue being called on non-request device.

The crash can be reproduced in the following way:

# dmsetup create alias1 --table "0 `blockdev --getsize /dev/sda` linear /dev/sda 0"
# dmsetup create alias2 --table "0 `blockdev --getsize /dev/sda` linear /dev/sda 0"
# dmsetup create mpath --table "0 `blockdev --getsize /dev/sda` multipath 0 0 2 1 round-robin 0 1 0 /dev/mapper/alias1 round-robin 0 1 0 /dev/mapper/alias2"
# dmsetup reload alias1 --table "0 `blockdev --getsize /dev/sda` error"
# dmsetup suspend alias1
# dmsetup resume alias1
# less -f /dev/mapper/mpath

TPC: <__lock_acquire+0x5c/0x1c00>
Caller[000000000047f468]: lock_acquire+0xa8/0xc0
Caller[000000000065a978]: _spin_lock_irqsave+0x38/0x60
Caller[000000000054f7b4]: blk_abort_queue+0x34/0x140
Caller[00000000100ffe04]: deactivate_path+0x44/0x60 [dm_multipath]
Caller[0000000000468898]: worker_thread+0x1d8/0x2e0
Caller[000000000046d4ac]: kthread+0x4c/0x80
Caller[000000000042bc1c]: kernel_thread+0x3c/0x60
Caller[000000000046d3e4]: kthreadd+0x104/0x180

The crash happens because queue spinlock pointer is NULL and blk_abort_queue is
called.

The problem is that blk_abort_queue assumes that the underlying device is
request-based. If it uses bios, not requests, it accesses uninitialized data
structures and crashes.

This patch changes it to provide a method, abort_queue_fn, that will
abort the queue. On request-based devices, it points to generic_abort_queue,
on non-request based devices it can be NULL (no abort) or the driver can
register its own abort function there.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>

---
 block/blk-core.c       |    1 +
 block/blk-timeout.c    |   10 ++++++++--
 include/linux/blkdev.h |    3 +++
 3 files changed, 12 insertions(+), 2 deletions(-)

Index: linux-2.6.30-rc2-devel/block/blk-core.c
===================================================================
--- linux-2.6.30-rc2-devel.orig/block/blk-core.c	2009-04-16 22:20:59.000000000 +0200
+++ linux-2.6.30-rc2-devel/block/blk-core.c	2009-04-16 23:44:25.000000000 +0200
@@ -598,6 +598,7 @@ blk_init_queue_node(request_fn_proc *rfn
 		lock = &q->__queue_lock;
 
 	q->request_fn		= rfn;
+	q->abort_queue_fn	= generic_abort_queue;
 	q->prep_rq_fn		= NULL;
 	q->unplug_fn		= generic_unplug_device;
 	q->queue_flags		= QUEUE_FLAG_DEFAULT;
Index: linux-2.6.30-rc2-devel/block/blk-timeout.c
===================================================================
--- linux-2.6.30-rc2-devel.orig/block/blk-timeout.c	2009-04-16 22:30:27.000000000 +0200
+++ linux-2.6.30-rc2-devel/block/blk-timeout.c	2009-04-16 23:46:28.000000000 +0200
@@ -205,7 +205,7 @@ void blk_add_timer(struct request *req)
  * @queue:	pointer to queue
  *
  */
-void blk_abort_queue(struct request_queue *q)
+void generic_abort_queue(struct request_queue *q)
 {
 	unsigned long flags;
 	struct request *rq, *tmp;
@@ -227,4 +227,10 @@ void blk_abort_queue(struct request_queu
 	spin_unlock_irqrestore(q->queue_lock, flags);
 
 }
-EXPORT_SYMBOL_GPL(blk_abort_queue);
+
+void blk_abort_queue(struct request_queue *q)
+{
+	if (q->abort_queue_fn)
+		q->abort_queue_fn(q);
+}
+EXPORT_SYMBOL(blk_abort_queue);
Index: linux-2.6.30-rc2-devel/include/linux/blkdev.h
===================================================================
--- linux-2.6.30-rc2-devel.orig/include/linux/blkdev.h	2009-04-16 22:20:43.000000000 +0200
+++ linux-2.6.30-rc2-devel/include/linux/blkdev.h	2009-04-16 22:32:27.000000000 +0200
@@ -267,6 +267,7 @@ struct request_pm_state
 
 typedef void (request_fn_proc) (struct request_queue *q);
 typedef int (make_request_fn) (struct request_queue *q, struct bio *bio);
+typedef void (abort_queue_fn) (struct request_queue *q);
 typedef int (prep_rq_fn) (struct request_queue *, struct request *);
 typedef void (unplug_fn) (struct request_queue *);
 typedef int (prepare_discard_fn) (struct request_queue *, struct request *);
@@ -332,6 +333,7 @@ struct request_queue
 
 	request_fn_proc		*request_fn;
 	make_request_fn		*make_request_fn;
+	abort_queue_fn		*abort_queue_fn;
 	prep_rq_fn		*prep_rq_fn;
 	unplug_fn		*unplug_fn;
 	prepare_discard_fn	*prepare_discard_fn;
@@ -903,6 +905,7 @@ extern bool blk_ordered_complete_seq(str
 extern int blk_rq_map_sg(struct request_queue *, struct request *, struct scatterlist *);
 extern void blk_dump_rq_flags(struct request *, char *);
 extern void generic_unplug_device(struct request_queue *);
+extern void generic_abort_queue(struct request_queue *);
 extern long nr_blockdev_pages(void);
 
 int blk_get_queue(struct request_queue *);

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] fix crash in blk_queue_abort
  2009-04-16 23:52 [PATCH] fix crash in blk_queue_abort Mikulas Patocka
@ 2009-04-17  6:36 ` Jens Axboe
  2009-04-20  6:20   ` Mikulas Patocka
  0 siblings, 1 reply; 12+ messages in thread
From: Jens Axboe @ 2009-04-17  6:36 UTC (permalink / raw)
  To: Mikulas Patocka; +Cc: dm-devel, Alasdair G Kergon

On Thu, Apr 16 2009, Mikulas Patocka wrote:
> Hi
> 
> This fixes a crash in blk_abort_queue. The crash can be triggered with 
> device mapper multipath.
> 
> I believe that since there is method make_request_fn, the cleanest 
> solution is to add another method, abort_queue_fn. But you can use 
> different solution if you want (like testing some bit whether the device 
> is request-based ... or so).
> 
> Mikulas
> 
> ---
> 
> Fix a crash due to blk_abort_queue being called on non-request device.
> 
> The crash can be reproduced in the following way:
> 
> # dmsetup create alias1 --table "0 `blockdev --getsize /dev/sda` linear /dev/sda 0"
> # dmsetup create alias2 --table "0 `blockdev --getsize /dev/sda` linear /dev/sda 0"
> # dmsetup create mpath --table "0 `blockdev --getsize /dev/sda` multipath 0 0 2 1 round-robin 0 1 0 /dev/mapper/alias1 round-robin 0 1 0 /dev/mapper/alias2"
> # dmsetup reload alias1 --table "0 `blockdev --getsize /dev/sda` error"
> # dmsetup suspend alias1
> # dmsetup resume alias1
> # less -f /dev/mapper/mpath
> 
> TPC: <__lock_acquire+0x5c/0x1c00>
> Caller[000000000047f468]: lock_acquire+0xa8/0xc0
> Caller[000000000065a978]: _spin_lock_irqsave+0x38/0x60
> Caller[000000000054f7b4]: blk_abort_queue+0x34/0x140
> Caller[00000000100ffe04]: deactivate_path+0x44/0x60 [dm_multipath]
> Caller[0000000000468898]: worker_thread+0x1d8/0x2e0
> Caller[000000000046d4ac]: kthread+0x4c/0x80
> Caller[000000000042bc1c]: kernel_thread+0x3c/0x60
> Caller[000000000046d3e4]: kthreadd+0x104/0x180
> 
> The crash happens because queue spinlock pointer is NULL and blk_abort_queue is
> called.
> 
> The problem is that blk_abort_queue assumes that the underlying device is
> request-based. If it uses bios, not requests, it accesses uninitialized data
> structures and crashes.
> 
> This patch changes it to provide a method, abort_queue_fn, that will
> abort the queue. On request-based devices, it points to generic_abort_queue,
> on non-request based devices it can be NULL (no abort) or the driver can
> register its own abort function there.

Lets just add a

        if (!q->request_fn)
                return;

at the top of blk_abort_queue(), no need in making this generic until
someone actually needs different behaviour.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] fix crash in blk_queue_abort
  2009-04-17  6:36 ` Jens Axboe
@ 2009-04-20  6:20   ` Mikulas Patocka
  2009-04-21  7:29     ` Jens Axboe
  0 siblings, 1 reply; 12+ messages in thread
From: Mikulas Patocka @ 2009-04-20  6:20 UTC (permalink / raw)
  To: Jens Axboe; +Cc: dm-devel, Alasdair G Kergon

> Lets just add a
> 
>         if (!q->request_fn)
>                 return;
> 
> at the top of blk_abort_queue(), no need in making this generic until
> someone actually needs different behaviour.
> 
> -- 
> Jens Axboe

Good point. Do it this way.

Mikulas

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] fix crash in blk_queue_abort
  2009-04-20  6:20   ` Mikulas Patocka
@ 2009-04-21  7:29     ` Jens Axboe
  2010-07-26 21:53       ` BIO_RW_SYNCIO Mikulas Patocka
  0 siblings, 1 reply; 12+ messages in thread
From: Jens Axboe @ 2009-04-21  7:29 UTC (permalink / raw)
  To: Mikulas Patocka; +Cc: dm-devel, Alasdair G Kergon

On Mon, Apr 20 2009, Mikulas Patocka wrote:
> > Lets just add a
> > 
> >         if (!q->request_fn)
> >                 return;
> > 
> > at the top of blk_abort_queue(), no need in making this generic until
> > someone actually needs different behaviour.
> > 
> > -- 
> > Jens Axboe
> 
> Good point. Do it this way.

Already did :-)

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 12+ messages in thread

* BIO_RW_SYNCIO
  2009-04-21  7:29     ` Jens Axboe
@ 2010-07-26 21:53       ` Mikulas Patocka
  2010-07-27  2:09         ` BIO_RW_SYNCIO Vivek Goyal
  0 siblings, 1 reply; 12+ messages in thread
From: Mikulas Patocka @ 2010-07-26 21:53 UTC (permalink / raw)
  To: Jens Axboe; +Cc: dm-devel, Alasdair G Kergon

Hi Jens

I found out that when I remove BIO_RW_SYNCIO from dm-kcopyd.c, performance 
when writing to the snapshot origin is increased twice. In this workload, 
snapshot driver reads a lot of 8k chunks from one place on the disk and 
writes them to the other place on the same disk. Without BIO_RW_SYNCIO, it 
is twice faster with CFQ scheduler.

What does BIO_RW_SYNCIO exactly do? Does it attempt to decrease latency 
and decrease throughput? (so that disk head seeks more between the source 
and destination and it causes performance degradation) Is there some 
general guidance, when use BIO_RW_SYNCIO and when not?

Mikulas

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: BIO_RW_SYNCIO
  2010-07-26 21:53       ` BIO_RW_SYNCIO Mikulas Patocka
@ 2010-07-27  2:09         ` Vivek Goyal
  2010-07-27 19:48           ` BIO_RW_SYNCIO Mikulas Patocka
  0 siblings, 1 reply; 12+ messages in thread
From: Vivek Goyal @ 2010-07-27  2:09 UTC (permalink / raw)
  To: Mikulas Patocka; +Cc: Jens Axboe, dm-devel, Alasdair G Kergon

On Mon, Jul 26, 2010 at 05:53:40PM -0400, Mikulas Patocka wrote:
> Hi Jens
> 
[ Jens's mail id has changed. Ccing him on different mail id ]

> I found out that when I remove BIO_RW_SYNCIO from dm-kcopyd.c, performance 
> when writing to the snapshot origin is increased twice. In this workload, 
> snapshot driver reads a lot of 8k chunks from one place on the disk and 
> writes them to the other place on the same disk. Without BIO_RW_SYNCIO, it 
> is twice faster with CFQ scheduler.
> 
> What does BIO_RW_SYNCIO exactly do?  Does it attempt to decrease latency 
> and decrease throughput? (so that disk head seeks more between the source 
> and destination and it causes performance degradation) Is there some 
> general guidance, when use BIO_RW_SYNCIO and when not?

BIO_RW_SYNC marks a request as SYNC. reads are by default sync. Writes
can be marked as SYNC so that they get priority of ASYNC WRITES.

Marking writes as SYNC has advantage that it gets its own queue, it can
preempt ASYNC WRITES and CFQ also idles on the cfq queue waiting for
more requests.

I suspect if it is idling on SYNC WRITES which is a problem here.

- Is it the same thread which is submitting both read and write requests?
- Are you interleaving reads and writes. Like reading some data, then
  writting it and reading again...

I guess after writting some data, CFQ is idling on WRITE_SYNC and next
READ does not get dispatched immediately to the disk.

Is it possible to provide some 30 second blktrace of the underlying device
where CFQ is running. You can also try setting slice_idle to 0 and see if
it helps.

If setting slice_idle=0 helps you, can you please also give a try to
following patch.

https://patchwork.kernel.org/patch/113061/

Thanks
Vivek

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: BIO_RW_SYNCIO
  2010-07-27  2:09         ` BIO_RW_SYNCIO Vivek Goyal
@ 2010-07-27 19:48           ` Mikulas Patocka
  2010-07-27 23:09             ` BIO_RW_SYNCIO Vivek Goyal
  0 siblings, 1 reply; 12+ messages in thread
From: Mikulas Patocka @ 2010-07-27 19:48 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: Jens Axboe, dm-devel, Alasdair G Kergon



On Mon, 26 Jul 2010, Vivek Goyal wrote:

> On Mon, Jul 26, 2010 at 05:53:40PM -0400, Mikulas Patocka wrote:
> > Hi Jens
> > 
> [ Jens's mail id has changed. Ccing him on different mail id ]
> 
> > I found out that when I remove BIO_RW_SYNCIO from dm-kcopyd.c, performance 
> > when writing to the snapshot origin is increased twice. In this workload, 
> > snapshot driver reads a lot of 8k chunks from one place on the disk and 
> > writes them to the other place on the same disk. Without BIO_RW_SYNCIO, it 
> > is twice faster with CFQ scheduler.
> > 
> > What does BIO_RW_SYNCIO exactly do?  Does it attempt to decrease latency 
> > and decrease throughput? (so that disk head seeks more between the source 
> > and destination and it causes performance degradation) Is there some 
> > general guidance, when use BIO_RW_SYNCIO and when not?
> 
> BIO_RW_SYNC marks a request as SYNC. reads are by default sync. Writes
> can be marked as SYNC so that they get priority of ASYNC WRITES.
> 
> Marking writes as SYNC has advantage that it gets its own queue, it can
> preempt ASYNC WRITES and CFQ also idles on the cfq queue waiting for
> more requests.
> 
> I suspect if it is idling on SYNC WRITES which is a problem here.
> 
> - Is it the same thread which is submitting both read and write requests?

Yes.

> - Are you interleaving reads and writes. Like reading some data, then
>   writting it and reading again...

I issue write immediatelly when read finishes.

> I guess after writting some data, CFQ is idling on WRITE_SYNC and next
> READ does not get dispatched immediately to the disk.
> 
> Is it possible to provide some 30 second blktrace of the underlying device
> where CFQ is running. You can also try setting slice_idle to 0 and see if
> it helps.
> 
> If setting slice_idle=0 helps you, can you please also give a try to
> following patch.
> 
> https://patchwork.kernel.org/patch/113061/
> 
> Thanks
> Vivek

I took the traces and placed them at 
http://people.redhat.com/mpatocka/data/blktrace/

It shows that WRITE requests are merged without SYNCIO flags and are not 
merged if SYNCIO is used.

slice_idle=0 or the patch don't help.


BTW. blkparse ignores the input file, when I use "blkparse -i 
sda.blktrace.0.sync", blkparse still displays content of "sda.blktrace.0" 
and not "sda.blktrace.0.sync".

I have to use "cat sda.blktrace.0.sync|blkparse -i -" to read th trace. Is 
it a blkparse bug? I use blkparse from Debian Lenny.

Mikulas

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: BIO_RW_SYNCIO
  2010-07-27 19:48           ` BIO_RW_SYNCIO Mikulas Patocka
@ 2010-07-27 23:09             ` Vivek Goyal
  2010-07-28  1:33               ` BIO_RW_SYNCIO Vivek Goyal
  2010-07-28 12:42               ` BIO_RW_SYNCIO Mikulas Patocka
  0 siblings, 2 replies; 12+ messages in thread
From: Vivek Goyal @ 2010-07-27 23:09 UTC (permalink / raw)
  To: Mikulas Patocka; +Cc: Jens Axboe, dm-devel, Alasdair G Kergon

On Tue, Jul 27, 2010 at 03:48:52PM -0400, Mikulas Patocka wrote:
> 
> 
> On Mon, 26 Jul 2010, Vivek Goyal wrote:
> 
> > On Mon, Jul 26, 2010 at 05:53:40PM -0400, Mikulas Patocka wrote:
> > > Hi Jens
> > > 
> > [ Jens's mail id has changed. Ccing him on different mail id ]
> > 
> > > I found out that when I remove BIO_RW_SYNCIO from dm-kcopyd.c, performance 
> > > when writing to the snapshot origin is increased twice. In this workload, 
> > > snapshot driver reads a lot of 8k chunks from one place on the disk and 
> > > writes them to the other place on the same disk. Without BIO_RW_SYNCIO, it 
> > > is twice faster with CFQ scheduler.
> > > 
> > > What does BIO_RW_SYNCIO exactly do?  Does it attempt to decrease latency 
> > > and decrease throughput? (so that disk head seeks more between the source 
> > > and destination and it causes performance degradation) Is there some 
> > > general guidance, when use BIO_RW_SYNCIO and when not?
> > 
> > BIO_RW_SYNC marks a request as SYNC. reads are by default sync. Writes
> > can be marked as SYNC so that they get priority of ASYNC WRITES.
> > 
> > Marking writes as SYNC has advantage that it gets its own queue, it can
> > preempt ASYNC WRITES and CFQ also idles on the cfq queue waiting for
> > more requests.
> > 
> > I suspect if it is idling on SYNC WRITES which is a problem here.
> > 
> > - Is it the same thread which is submitting both read and write requests?
> 
> Yes.
> 
> > - Are you interleaving reads and writes. Like reading some data, then
> >   writting it and reading again...
> 
> I issue write immediatelly when read finishes.
> 
> > I guess after writting some data, CFQ is idling on WRITE_SYNC and next
> > READ does not get dispatched immediately to the disk.
> > 
> > Is it possible to provide some 30 second blktrace of the underlying device
> > where CFQ is running. You can also try setting slice_idle to 0 and see if
> > it helps.
> > 
> > If setting slice_idle=0 helps you, can you please also give a try to
> > following patch.
> > 
> > https://patchwork.kernel.org/patch/113061/
> > 
> > Thanks
> > Vivek
> 
> I took the traces and placed them at 
> http://people.redhat.com/mpatocka/data/blktrace/
> 
> It shows that WRITE requests are merged without SYNCIO flags and are not 
> merged if SYNCIO is used.

Yes you are right. So I think following is happening.

Key is that requests don't get merged once they are on dispatch list. They
get merged only when they are still sitting in some cfq queue and are with
elevator.

In case of sync IO, both reads and writes are on single cfq queue. We are
driving good dispatch list depth (drv=65). That means there are 65 reads
and writes on dispatch list and none of the new requests can be merged with
those.

In case of async IO, reads and writes are going on different cfq queues.
While reads are being dispatched from one queue, writes are sitting in 
CFQ and are open to merge. That's the reason we are seeing lot more WRITE
merging with async case.

Not sure what we can do about it though. But had a couple of questions.

- You seem to be issuing lots of 4K size adjacent READS and WRITES. Is
  there a way that you can club these together and issue a bigger request.

- What kind of device this is where request queue depth is 65. Can you
  try reducing request queue depth to say 16 and see if things improve
  a bit. (/sys/block/<dev>/device/queue_depth).

> 
> slice_idle=0 or the patch don't help.

So it is a case of single cfq queue, so slice_idle will not help.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: BIO_RW_SYNCIO
  2010-07-27 23:09             ` BIO_RW_SYNCIO Vivek Goyal
@ 2010-07-28  1:33               ` Vivek Goyal
  2010-07-28 12:35                 ` BIO_RW_SYNCIO Mikulas Patocka
  2010-07-28 12:42               ` BIO_RW_SYNCIO Mikulas Patocka
  1 sibling, 1 reply; 12+ messages in thread
From: Vivek Goyal @ 2010-07-28  1:33 UTC (permalink / raw)
  To: Mikulas Patocka; +Cc: Jens Axboe, dm-devel, Alasdair G Kergon

On Tue, Jul 27, 2010 at 07:09:50PM -0400, Vivek Goyal wrote:
> On Tue, Jul 27, 2010 at 03:48:52PM -0400, Mikulas Patocka wrote:
> > 
> > 
> > On Mon, 26 Jul 2010, Vivek Goyal wrote:
> > 
> > > On Mon, Jul 26, 2010 at 05:53:40PM -0400, Mikulas Patocka wrote:
> > > > Hi Jens
> > > > 
> > > [ Jens's mail id has changed. Ccing him on different mail id ]
> > > 
> > > > I found out that when I remove BIO_RW_SYNCIO from dm-kcopyd.c, performance 
> > > > when writing to the snapshot origin is increased twice. In this workload, 
> > > > snapshot driver reads a lot of 8k chunks from one place on the disk and 
> > > > writes them to the other place on the same disk. Without BIO_RW_SYNCIO, it 
> > > > is twice faster with CFQ scheduler.
> > > > 
> > > > What does BIO_RW_SYNCIO exactly do?  Does it attempt to decrease latency 
> > > > and decrease throughput? (so that disk head seeks more between the source 
> > > > and destination and it causes performance degradation) Is there some 
> > > > general guidance, when use BIO_RW_SYNCIO and when not?
> > > 
> > > BIO_RW_SYNC marks a request as SYNC. reads are by default sync. Writes
> > > can be marked as SYNC so that they get priority of ASYNC WRITES.
> > > 
> > > Marking writes as SYNC has advantage that it gets its own queue, it can
> > > preempt ASYNC WRITES and CFQ also idles on the cfq queue waiting for
> > > more requests.
> > > 
> > > I suspect if it is idling on SYNC WRITES which is a problem here.
> > > 
> > > - Is it the same thread which is submitting both read and write requests?
> > 
> > Yes.
> > 
> > > - Are you interleaving reads and writes. Like reading some data, then
> > >   writting it and reading again...
> > 
> > I issue write immediatelly when read finishes.
> > 
> > > I guess after writting some data, CFQ is idling on WRITE_SYNC and next
> > > READ does not get dispatched immediately to the disk.
> > > 
> > > Is it possible to provide some 30 second blktrace of the underlying device
> > > where CFQ is running. You can also try setting slice_idle to 0 and see if
> > > it helps.
> > > 
> > > If setting slice_idle=0 helps you, can you please also give a try to
> > > following patch.
> > > 
> > > https://patchwork.kernel.org/patch/113061/
> > > 
> > > Thanks
> > > Vivek
> > 
> > I took the traces and placed them at 
> > http://people.redhat.com/mpatocka/data/blktrace/
> > 
> > It shows that WRITE requests are merged without SYNCIO flags and are not 
> > merged if SYNCIO is used.
> 
> Yes you are right. So I think following is happening.
> 
> Key is that requests don't get merged once they are on dispatch list. They
> get merged only when they are still sitting in some cfq queue and are with
> elevator.
> 
> In case of sync IO, both reads and writes are on single cfq queue. We are
> driving good dispatch list depth (drv=65). That means there are 65 reads
> and writes on dispatch list and none of the new requests can be merged with
> those.
> 
> In case of async IO, reads and writes are going on different cfq queues.
> While reads are being dispatched from one queue, writes are sitting in 
> CFQ and are open to merge. That's the reason we are seeing lot more WRITE
> merging with async case.
> 
> Not sure what we can do about it though. But had a couple of questions.
> 
> - You seem to be issuing lots of 4K size adjacent READS and WRITES. Is
>   there a way that you can club these together and issue a bigger request.
> 
> - What kind of device this is where request queue depth is 65. Can you
>   try reducing request queue depth to say 16 and see if things improve
>   a bit. (/sys/block/<dev>/device/queue_depth).
> 

One more thing. Have you change the "quantum" tunable of CFQ? Looking
at the traces of SYNC case, CFQ seems to be allowing up to 65 requests
from a single cfq queue. But that does not happen. It is limited to 8
requests max from a cfq queue.

Which kernel version are you using?

Thanks
Vivek

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: BIO_RW_SYNCIO
  2010-07-28  1:33               ` BIO_RW_SYNCIO Vivek Goyal
@ 2010-07-28 12:35                 ` Mikulas Patocka
  0 siblings, 0 replies; 12+ messages in thread
From: Mikulas Patocka @ 2010-07-28 12:35 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: Jens Axboe, dm-devel, Alasdair G Kergon

> One more thing. Have you change the "quantum" tunable of CFQ? Looking
> at the traces of SYNC case, CFQ seems to be allowing up to 65 requests
> from a single cfq queue. But that does not happen. It is limited to 8
> requests max from a cfq queue.
> 
> Which kernel version are you using?
> 
> Thanks
> Vivek

I haven't changed it. It's 8 --- the default. I'm using 2.6.34.

Mikulas

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: BIO_RW_SYNCIO
  2010-07-27 23:09             ` BIO_RW_SYNCIO Vivek Goyal
  2010-07-28  1:33               ` BIO_RW_SYNCIO Vivek Goyal
@ 2010-07-28 12:42               ` Mikulas Patocka
  2010-07-28 15:44                 ` BIO_RW_SYNCIO Vivek Goyal
  1 sibling, 1 reply; 12+ messages in thread
From: Mikulas Patocka @ 2010-07-28 12:42 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: Jens Axboe, dm-devel, Alasdair G Kergon

> > I took the traces and placed them at 
> > http://people.redhat.com/mpatocka/data/blktrace/
> > 
> > It shows that WRITE requests are merged without SYNCIO flags and are not 
> > merged if SYNCIO is used.
> 
> Yes you are right. So I think following is happening.
> 
> Key is that requests don't get merged once they are on dispatch list. They
> get merged only when they are still sitting in some cfq queue and are with
> elevator.
>
> In case of sync IO, both reads and writes are on single cfq queue. We are
> driving good dispatch list depth (drv=65). That means there are 65 reads
> and writes on dispatch list and none of the new requests can be merged with
> those.
> 
> In case of async IO, reads and writes are going on different cfq queues.
> While reads are being dispatched from one queue, writes are sitting in 
> CFQ and are open to merge. That's the reason we are seeing lot more WRITE
> merging with async case.
> 
> Not sure what we can do about it though. But had a couple of questions.
> 
> - You seem to be issuing lots of 4K size adjacent READS and WRITES. Is
>   there a way that you can club these together and issue a bigger request.

It is possible, but it would mean major code size increase (replicate the 
merge functionality in dm-kcopyd). We don't have problems with CPU time 
consumption, so we are not planning it now.

It just simpler to turn off BIO_RW_SYNCIO. I also turned off BIO_RW_UNPLUG 
and unplug the queue after more requests. It improves performance to 
22MB/s.

> - What kind of device this is where request queue depth is 65. Can you
>   try reducing request queue depth to say 16 and see if things improve
>   a bit. (/sys/block/<dev>/device/queue_depth).

Seagate U320 SCSI disk on MPT controller. It has 64 tags.

When I reduced the number of tags, it improved performance, 16 was good 
(19MB/s), reducing it to 4 or 1 improved it even more (22MB/s).

Mikulas

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: BIO_RW_SYNCIO
  2010-07-28 12:42               ` BIO_RW_SYNCIO Mikulas Patocka
@ 2010-07-28 15:44                 ` Vivek Goyal
  0 siblings, 0 replies; 12+ messages in thread
From: Vivek Goyal @ 2010-07-28 15:44 UTC (permalink / raw)
  To: Mikulas Patocka; +Cc: Jens Axboe, dm-devel, Alasdair G Kergon

On Wed, Jul 28, 2010 at 08:42:06AM -0400, Mikulas Patocka wrote:
> > > I took the traces and placed them at 
> > > http://people.redhat.com/mpatocka/data/blktrace/
> > > 
> > > It shows that WRITE requests are merged without SYNCIO flags and are not 
> > > merged if SYNCIO is used.
> > 
> > Yes you are right. So I think following is happening.
> > 
> > Key is that requests don't get merged once they are on dispatch list. They
> > get merged only when they are still sitting in some cfq queue and are with
> > elevator.
> >
> > In case of sync IO, both reads and writes are on single cfq queue. We are
> > driving good dispatch list depth (drv=65). That means there are 65 reads
> > and writes on dispatch list and none of the new requests can be merged with
> > those.
> > 
> > In case of async IO, reads and writes are going on different cfq queues.
> > While reads are being dispatched from one queue, writes are sitting in 
> > CFQ and are open to merge. That's the reason we are seeing lot more WRITE
> > merging with async case.
> > 
> > Not sure what we can do about it though. But had a couple of questions.
> > 
> > - You seem to be issuing lots of 4K size adjacent READS and WRITES. Is
> >   there a way that you can club these together and issue a bigger request.
> 
> It is possible, but it would mean major code size increase (replicate the 
> merge functionality in dm-kcopyd). We don't have problems with CPU time 
> consumption, so we are not planning it now.
> 
> It just simpler to turn off BIO_RW_SYNCIO. I also turned off BIO_RW_UNPLUG 
> and unplug the queue after more requests. It improves performance to 
> 22MB/s.
> 

I am not very sure how effective unplug thing is because it works only
if there is no IO happening in device. From blktraces it looks that device
is continuously busy.

I guess that not marking writes as sync is probably best in this case.
Though it will be interesting how does this look in presence of other
buffered writers on some other partition on device. I am not sure how
common case that is.

Are you seeing same issue with deadline also? I guess deadline might
run into same issue and beacause there is no idling logic there, I 
think even turning off BIO_RW_SYNCIO is not going to help.

> > - What kind of device this is where request queue depth is 65. Can you
> >   try reducing request queue depth to say 16 and see if things improve
> >   a bit. (/sys/block/<dev>/device/queue_depth).
> 
> Seagate U320 SCSI disk on MPT controller. It has 64 tags.
> 
> When I reduced the number of tags, it improved performance, 16 was good 
> (19MB/s), reducing it to 4 or 1 improved it even more (22MB/s).

Ok. I looked at the code again and realized that cfq allows unlimited dispatch
from a queue if there are no other competing queues. That's the reason in this
case we are allowing 64 request dispatch from single queue at a time.

Vivek

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2010-07-28 15:44 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-04-16 23:52 [PATCH] fix crash in blk_queue_abort Mikulas Patocka
2009-04-17  6:36 ` Jens Axboe
2009-04-20  6:20   ` Mikulas Patocka
2009-04-21  7:29     ` Jens Axboe
2010-07-26 21:53       ` BIO_RW_SYNCIO Mikulas Patocka
2010-07-27  2:09         ` BIO_RW_SYNCIO Vivek Goyal
2010-07-27 19:48           ` BIO_RW_SYNCIO Mikulas Patocka
2010-07-27 23:09             ` BIO_RW_SYNCIO Vivek Goyal
2010-07-28  1:33               ` BIO_RW_SYNCIO Vivek Goyal
2010-07-28 12:35                 ` BIO_RW_SYNCIO Mikulas Patocka
2010-07-28 12:42               ` BIO_RW_SYNCIO Mikulas Patocka
2010-07-28 15:44                 ` BIO_RW_SYNCIO Vivek Goyal

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).