public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [patch] block: properly handle flush/fua requests in blk_insert_cloned_request
@ 2011-08-09 15:05 Jeff Moyer
  2011-08-09 15:38 ` Tejun Heo
  0 siblings, 1 reply; 13+ messages in thread
From: Jeff Moyer @ 2011-08-09 15:05 UTC (permalink / raw)
  To: Tejun Heo; +Cc: linux-kernel, Jens Axboe, Mike Snitzer, Vivek Goyal

Hi,

Commit ae1b1539622fb46e51b4d13b3f9e5f4c713f86ae, block: reimplement
FLUSH/FUA to support merge, introduced a performance regression when
running any sort of fsyncing workload using dm-multipath and certain
storage (in our case, an HP EVA).  The test I ran was fs_mark, and it
dropped from ~800 files/sec on ext4 to ~100 files/sec.  It turns out
that dm-multipath always advertised flush+fua support, and passed
commands on down the stack, where those flags used to get stripped off.
The above commit changed that behavior:

static inline struct request *__elv_next_request(struct request_queue *q)
{
        struct request *rq;

        while (1) {
-               while (!list_empty(&q->queue_head)) {
+               if (!list_empty(&q->queue_head)) {
                        rq = list_entry_rq(q->queue_head.next);
-                       if (!(rq->cmd_flags & (REQ_FLUSH | REQ_FUA)) ||
-                           (rq->cmd_flags & REQ_FLUSH_SEQ))
-                               return rq;
-                       rq = blk_do_flush(q, rq);
-                       if (rq)
-                               return rq;
+                       return rq;
                }

Note that previously, a command would come in here, have
REQ_FLUSH|REQ_FUA set, and then get handed off to blk_do_flush:

struct request *blk_do_flush(struct request_queue *q, struct request *rq)
{
        unsigned int fflags = q->flush_flags; /* may change, cache it */
        bool has_flush = fflags & REQ_FLUSH, has_fua = fflags & REQ_FUA;
        bool do_preflush = has_flush && (rq->cmd_flags & REQ_FLUSH);
        bool do_postflush = has_flush && !has_fua && (rq->cmd_flags &
        REQ_FUA);
        unsigned skip = 0;
...
        if (blk_rq_sectors(rq) && !do_preflush && !do_postflush) {
                rq->cmd_flags &= ~REQ_FLUSH;
		if (!has_fua)
			rq->cmd_flags &= ~REQ_FUA;
	        return rq;
	}

So, the flush machinery was bypassed in such cases (q->flush_flags == 0
&& rq->cmd_flags & (REQ_FLUSH|REQ_FUA)).

Now, however, we don't get into the flush machinery at all.  Instead,
__elv_next_request just hands a request with flush and fua bits set to
the scsi_request_fn, even if the underlying request_queue does not
support flush or fua.

I've attached a fix for this.  Since request-based dm operates below the
elevator, the flush sequencing is done above dm.  So, when a flush
request is cloned and handed off to blk_insert_clone_request, we need to
preserve the REQ_FLUSH_SEQ flag, and put the request directly on the
queue (no need to go through the flush machinery again).  In the case of
an empty flush where the underlying device does not advertise a
write-back cache, we can simply complete the request.

This patch regains the lost performance.  Comments, as always, are
appreciated.

Cheers,
Jeff

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>

diff --git a/block/blk-core.c b/block/blk-core.c
index b850bed..c4213c1 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -39,6 +39,8 @@ EXPORT_TRACEPOINT_SYMBOL_GPL(block_rq_remap);
 EXPORT_TRACEPOINT_SYMBOL_GPL(block_bio_complete);
 
 static int __make_request(struct request_queue *q, struct bio *bio);
+static bool blk_end_bidi_request(struct request *rq, int error,
+				 unsigned int nr_bytes, unsigned int bidi_bytes);
 
 /*
  * For the allocated request tables
@@ -1708,6 +1710,21 @@ int blk_insert_cloned_request(struct request_queue *q, struct request *rq)
 	    should_fail_request(&rq->rq_disk->part0, blk_rq_bytes(rq)))
 		return -EIO;
 
+	/*
+	 * Check the cmd_flags against the flush flags of the underlying
+	 * request_queue and resolve any differences.
+	 */
+	if (rq->cmd_flags & (REQ_FLUSH|REQ_FUA)) {
+		if (!(q->flush_flags & REQ_FLUSH))
+			rq->cmd_flags &= ~REQ_FLUSH;
+		if (!(q->flush_flags & REQ_FUA))
+			rq->cmd_flags &= ~REQ_FUA;
+		if (!(rq->cmd_flags & REQ_FLUSH) && !blk_rq_sectors(rq)) {
+			blk_end_bidi_request(rq, 0, 0, 0);
+			return 0;
+		}
+	}
+
 	spin_lock_irqsave(q->queue_lock, flags);
 
 	/*
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 6395692..4fe753f 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -168,7 +168,7 @@ enum rq_flag_bits {
 #define REQ_COMMON_MASK \
 	(REQ_WRITE | REQ_FAILFAST_MASK | REQ_SYNC | REQ_META | REQ_DISCARD | \
 	 REQ_NOIDLE | REQ_FLUSH | REQ_FUA | REQ_SECURE)
-#define REQ_CLONE_MASK		REQ_COMMON_MASK
+#define REQ_CLONE_MASK		(REQ_COMMON_MASK | REQ_FLUSH_SEQ)
 
 #define REQ_RAHEAD		(1 << __REQ_RAHEAD)
 #define REQ_THROTTLED		(1 << __REQ_THROTTLED)

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [patch] block: properly handle flush/fua requests in blk_insert_cloned_request
  2011-08-09 15:05 [patch] block: properly handle flush/fua requests in blk_insert_cloned_request Jeff Moyer
@ 2011-08-09 15:38 ` Tejun Heo
  2011-08-09 15:53   ` Jeff Moyer
  0 siblings, 1 reply; 13+ messages in thread
From: Tejun Heo @ 2011-08-09 15:38 UTC (permalink / raw)
  To: Jeff Moyer; +Cc: linux-kernel, Jens Axboe, Mike Snitzer, Vivek Goyal

Hello,

On Tue, Aug 09, 2011 at 11:05:34AM -0400, Jeff Moyer wrote:
> @@ -1708,6 +1710,21 @@ int blk_insert_cloned_request(struct request_queue *q, struct request *rq)
>  	    should_fail_request(&rq->rq_disk->part0, blk_rq_bytes(rq)))
>  		return -EIO;
>  
> +	/*
> +	 * Check the cmd_flags against the flush flags of the underlying
> +	 * request_queue and resolve any differences.
> +	 */
> +	if (rq->cmd_flags & (REQ_FLUSH|REQ_FUA)) {
> +		if (!(q->flush_flags & REQ_FLUSH))
> +			rq->cmd_flags &= ~REQ_FLUSH;
> +		if (!(q->flush_flags & REQ_FUA))
> +			rq->cmd_flags &= ~REQ_FUA;
> +		if (!(rq->cmd_flags & REQ_FLUSH) && !blk_rq_sectors(rq)) {
> +			blk_end_bidi_request(rq, 0, 0, 0);
> +			return 0;
> +		}
> +	}
> +

I'm a bit confused.  We still need ELEVATOR_INSERT_FLUSH fix for
insertion paths, right?  Or is blk_insert_cloned_request() supposed to
used only by request based dm which lives under the elevator?  If so,
it would be great to make that explicit in the comment.  Maybe just
renaming it to blk_insert_dm_cloned_request() would be better as it
wouldn't be safe for other cases anyway.

> diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
> index 6395692..4fe753f 100644
> --- a/include/linux/blk_types.h
> +++ b/include/linux/blk_types.h
> @@ -168,7 +168,7 @@ enum rq_flag_bits {
>  #define REQ_COMMON_MASK \
>  	(REQ_WRITE | REQ_FAILFAST_MASK | REQ_SYNC | REQ_META | REQ_DISCARD | \
>  	 REQ_NOIDLE | REQ_FLUSH | REQ_FUA | REQ_SECURE)
> -#define REQ_CLONE_MASK		REQ_COMMON_MASK
> +#define REQ_CLONE_MASK		(REQ_COMMON_MASK | REQ_FLUSH_SEQ)

Given the weirdness, I think it deserves fat comment on why
REQ_FLUSH_SEQ is necessary.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [patch] block: properly handle flush/fua requests in blk_insert_cloned_request
  2011-08-09 15:38 ` Tejun Heo
@ 2011-08-09 15:53   ` Jeff Moyer
  2011-08-09 16:13     ` Tejun Heo
  0 siblings, 1 reply; 13+ messages in thread
From: Jeff Moyer @ 2011-08-09 15:53 UTC (permalink / raw)
  To: Tejun Heo; +Cc: linux-kernel, Jens Axboe, Mike Snitzer, Vivek Goyal

Tejun Heo <tj@kernel.org> writes:

> Hello,
>
> On Tue, Aug 09, 2011 at 11:05:34AM -0400, Jeff Moyer wrote:
>> @@ -1708,6 +1710,21 @@ int blk_insert_cloned_request(struct request_queue *q, struct request *rq)
>>  	    should_fail_request(&rq->rq_disk->part0, blk_rq_bytes(rq)))
>>  		return -EIO;
>>  
>> +	/*
>> +	 * Check the cmd_flags against the flush flags of the underlying
>> +	 * request_queue and resolve any differences.
>> +	 */
>> +	if (rq->cmd_flags & (REQ_FLUSH|REQ_FUA)) {
>> +		if (!(q->flush_flags & REQ_FLUSH))
>> +			rq->cmd_flags &= ~REQ_FLUSH;
>> +		if (!(q->flush_flags & REQ_FUA))
>> +			rq->cmd_flags &= ~REQ_FUA;
>> +		if (!(rq->cmd_flags & REQ_FLUSH) && !blk_rq_sectors(rq)) {
>> +			blk_end_bidi_request(rq, 0, 0, 0);
>> +			return 0;
>> +		}
>> +	}
>> +
>
> I'm a bit confused.  We still need ELEVATOR_INSERT_FLUSH fix for
> insertion paths, right?  Or is blk_insert_cloned_request() supposed to
> used only by request based dm which lives under the elevator?  If so,
> it would be great to make that explicit in the comment.  Maybe just
> renaming it to blk_insert_dm_cloned_request() would be better as it
> wouldn't be safe for other cases anyway.

request-based dm is the only caller at present.  I'm not a fan of
renaming the function, but I'm more than willing to comment it.

>> diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
>> index 6395692..4fe753f 100644
>> --- a/include/linux/blk_types.h
>> +++ b/include/linux/blk_types.h
>> @@ -168,7 +168,7 @@ enum rq_flag_bits {
>>  #define REQ_COMMON_MASK \
>>  	(REQ_WRITE | REQ_FAILFAST_MASK | REQ_SYNC | REQ_META | REQ_DISCARD | \
>>  	 REQ_NOIDLE | REQ_FLUSH | REQ_FUA | REQ_SECURE)
>> -#define REQ_CLONE_MASK		REQ_COMMON_MASK
>> +#define REQ_CLONE_MASK		(REQ_COMMON_MASK | REQ_FLUSH_SEQ)
>
> Given the weirdness, I think it deserves fat comment on why
> REQ_FLUSH_SEQ is necessary.

OK, the next version will have better comments.  ;-)  To answer the
question, without REQ_FLUSH_SEQ, empty flush requests would simply be
completed in blk_insert_cloned_request.  The completion path would then
try to do I/O accounting on the flush request (since REQ_FLUSH_SEQ was
stripped) and would end up dereferencing a null pointer (see below).

Cheers,
Jeff

BUG: unable to handle kernel NULL pointer dereference at 00000000000002b0
IP: [<ffffffff81222871>] blk_account_io_done+0x71/0x170
PGD 1122d6067 PUD 11f069067 PMD 0 
Oops: 0000 [#1] SMP 
CPU 1 
Modules linked in: autofs4 sunrpc p4_clockmod freq_table speedstep_lib ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables ipv6 ext3 jbd dm_mirror dm_region_hash dm_log dm_round_robin dm_multipath dm_mod ipmi_si ipmi_msghandler hpilo hpwdt tg3 pcspkr serio_raw iTCO_wdt iTCO_vendor_support shpchp sg i3200_edac edac_core ext4 mbcache jbd2 sr_mod cdrom sd_mod crc_t10dif pata_acpi ata_generic ata_piix qla2xxx scsi_transport_fc scsi_tgt radeon ttm drm_kms_helper drm i2c_algo_bit i2c_core [last unloaded: mperf]

Pid: 2150, comm: mkfs.ext4 Tainted: G        W   3.0.0+ #2 HP ProLiant DL320 G5p
RIP: 0010:[<ffffffff81222871>]  [<ffffffff81222871>] blk_account_io_done+0x71/0x170
RSP: 0018:ffff88011feb3b98  EFLAGS: 00010046
RAX: 0000000000000001 RBX: 0000000000000000 RCX: 0000000000000000
RDX: ffff8800ca89e338 RSI: 0000000000000001 RDI: 0000000000000003
RBP: ffff88011feb3ba8 R08: ffff8800ca89e338 R09: 0000000000000000
R10: 0000000000000001 R11: 0000000000000057 R12: 0000000000000001
R13: ffff88011f2d9838 R14: 0000000000000046 R15: 000000000fd00002
FS:  00007fbb9c207760(0000) GS:ffff880127c40000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00000000000002b0 CR3: 000000011f0ef000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process mkfs.ext4 (pid: 2150, threadinfo ffff88011feb2000, task ffff88011e736100)
Stack:
 ffff8800ca89e200 0000000000000000 ffff88011feb3bc8 ffffffff81224e3c
 ffff8800ca89e200 0000000000000000 ffff88011feb3bf8 ffffffff812252e4
 ffff8800ca89e200 ffff88011f2d9838 ffff88011f2d9838 ffffffffffffffff
Call Trace:
 [<ffffffff81224e3c>] blk_finish_request+0x4c/0xe0
 [<ffffffff812252e4>] blk_end_bidi_request+0x54/0x80
 [<ffffffff812254fe>] blk_insert_cloned_request+0xde/0xf0
 [<ffffffffa02f9f4e>] dm_dispatch_request+0x3e/0x70 [dm_mod]
 [<ffffffffa02fab40>] map_request+0xd0/0x130 [dm_mod]
 [<ffffffffa02fac64>] dm_request_fn+0xc4/0x160 [dm_mod]
 [<ffffffff8121ff0b>] __blk_run_queue+0x1b/0x20
 [<ffffffff81225d42>] __make_request+0x2c2/0x300
 [<ffffffffa02fba46>] dm_request+0x36/0x40 [dm_mod]
 [<ffffffff81224660>] generic_make_request+0x2f0/0x5e0
 [<ffffffff81108e03>] ? mempool_alloc+0x63/0x150
 [<ffffffff8112f355>] ? handle_mm_fault+0x1d5/0x350
 [<ffffffff812249d6>] submit_bio+0x86/0x110
 [<ffffffff81198b6b>] ? bio_alloc_bioset+0x5b/0xf0
 [<ffffffff81228472>] blkdev_issue_flush+0xa2/0xe0
 [<ffffffff8119a626>] blkdev_fsync+0x26/0x40
 [<ffffffff81190eab>] vfs_fsync_range+0x2b/0x30
 [<ffffffff81190ecc>] vfs_fsync+0x1c/0x20
 [<ffffffff8119110a>] do_fsync+0x3a/0x60
 [<ffffffff81191160>] sys_fsync+0x10/0x20
 [<ffffffff814ed942>] system_call_fastpath+0x16/0x1b
Code: 00 00 48 8b 9f b8 00 00 00 41 83 e4 01 48 8b 0d a6 e7 9e 00 65 8b 04 25 70 d3 00 00 48 63 f0 48 2b 8f c0 00 00 00 49 8d 7c 24 02 
 8b 93 b0 02 00 00 48 03 14 f5 60 d9 bf 81 48 83 04 fa 01 44 
RIP  [<ffffffff81222871>] blk_account_io_done+0x71/0x170
 RSP <ffff88011feb3b98>
CR2: 00000000000002b0
---[ end trace f8f7146ad3f363dc ]---

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [patch] block: properly handle flush/fua requests in blk_insert_cloned_request
  2011-08-09 15:53   ` Jeff Moyer
@ 2011-08-09 16:13     ` Tejun Heo
  2011-08-09 16:19       ` Tejun Heo
  2011-08-09 17:43       ` Mike Snitzer
  0 siblings, 2 replies; 13+ messages in thread
From: Tejun Heo @ 2011-08-09 16:13 UTC (permalink / raw)
  To: Jeff Moyer; +Cc: linux-kernel, Jens Axboe, Mike Snitzer, Vivek Goyal

Hello,

On Tue, Aug 09, 2011 at 11:53:51AM -0400, Jeff Moyer wrote:
> Tejun Heo <tj@kernel.org> writes:
> > I'm a bit confused.  We still need ELEVATOR_INSERT_FLUSH fix for
> > insertion paths, right?  Or is blk_insert_cloned_request() supposed to
> > used only by request based dm which lives under the elevator?  If so,
> > it would be great to make that explicit in the comment.  Maybe just
> > renaming it to blk_insert_dm_cloned_request() would be better as it
> > wouldn't be safe for other cases anyway.
> 
> request-based dm is the only caller at present.  I'm not a fan of
> renaming the function, but I'm more than willing to comment it.

I'm still confused and don't think the patch is correct (you can't
turn off REQ_FUA without decomposing it to data + post flush).

Going through flush machinery twice is okay and I think is the right
thing to do.  At the upper queue, the request is decomposed to member
requests.  After decomposition, it's either REQ_FLUSH w/o data or data
request w/ or w/o REQ_FUA.  When the decomposed request reaches lower
queue, the lower queue will then either short-circuit it, execute
as-is or decompose data w/ REQ_FUA into data + REQ_FLUSH sequence.

AFAICS, the breakages are...

* ELEVATOR_INSERT_FLUSH not used properly from insert paths.

* Short circuit not kicking in for the dm requests. (the above and the
  policy patch should solve this, right?)

* BUG(!rq->bio || ...) in blk_insert_flush().  I think we can lift
  this restriction for empty REQ_FLUSH but also dm can just send down
  requests with empty bio.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [patch] block: properly handle flush/fua requests in blk_insert_cloned_request
  2011-08-09 16:13     ` Tejun Heo
@ 2011-08-09 16:19       ` Tejun Heo
  2011-08-09 17:43       ` Mike Snitzer
  1 sibling, 0 replies; 13+ messages in thread
From: Tejun Heo @ 2011-08-09 16:19 UTC (permalink / raw)
  To: Jeff Moyer; +Cc: linux-kernel, Jens Axboe, Mike Snitzer, Vivek Goyal

Hello,

On Tue, Aug 09, 2011 at 06:13:34PM +0200, Tejun Heo wrote:
> AFAICS, the breakages are...
> 
> * ELEVATOR_INSERT_FLUSH not used properly from insert paths.
> 
> * Short circuit not kicking in for the dm requests. (the above and the
>   policy patch should solve this, right?)

Ooh... right, and we also need to add empty seq detection in
blk_insert_flush() as it would be bypassing the one in
__generic_make_request(), right?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: block: properly handle flush/fua requests in blk_insert_cloned_request
  2011-08-09 16:13     ` Tejun Heo
  2011-08-09 16:19       ` Tejun Heo
@ 2011-08-09 17:43       ` Mike Snitzer
  2011-08-09 17:51         ` Tejun Heo
  2011-08-09 17:52         ` Vivek Goyal
  1 sibling, 2 replies; 13+ messages in thread
From: Mike Snitzer @ 2011-08-09 17:43 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Jeff Moyer, linux-kernel, Jens Axboe, Vivek Goyal, dm-devel

On Tue, Aug 09 2011 at 12:13pm -0400,
Tejun Heo <tj@kernel.org> wrote:

> Hello,
> 
> On Tue, Aug 09, 2011 at 11:53:51AM -0400, Jeff Moyer wrote:
> > Tejun Heo <tj@kernel.org> writes:
> > > I'm a bit confused.  We still need ELEVATOR_INSERT_FLUSH fix for
> > > insertion paths, right?  Or is blk_insert_cloned_request() supposed to
> > > used only by request based dm which lives under the elevator?  If so,
> > > it would be great to make that explicit in the comment.  Maybe just
> > > renaming it to blk_insert_dm_cloned_request() would be better as it
> > > wouldn't be safe for other cases anyway.
> > 
> > request-based dm is the only caller at present.  I'm not a fan of
> > renaming the function, but I'm more than willing to comment it.
> 
> I'm still confused and don't think the patch is correct (you can't
> turn off REQ_FUA without decomposing it to data + post flush).
> 
> Going through flush machinery twice is okay and I think is the right
> thing to do.  At the upper queue, the request is decomposed to member
> requests.  After decomposition, it's either REQ_FLUSH w/o data or data
> request w/ or w/o REQ_FUA.  When the decomposed request reaches lower
> queue, the lower queue will then either short-circuit it, execute
> as-is or decompose data w/ REQ_FUA into data + REQ_FLUSH sequence.
> 
> AFAICS, the breakages are...
> 
> * ELEVATOR_INSERT_FLUSH not used properly from insert paths.
> 
> * Short circuit not kicking in for the dm requests. (the above and the
>   policy patch should solve this, right?)
> 
> * BUG(!rq->bio || ...) in blk_insert_flush().  I think we can lift
>   this restriction for empty REQ_FLUSH but also dm can just send down
>   requests with empty bio.

[cc'ing dm-devel]

All of these issues have come to light because DM was not setting
flush_flags based on the underlying device(s).  Now fixed in v3.1-rc1:
ed8b752 dm table: set flush capability based on underlying devices

Given that commit, and that request-based DM is beneath the elevator, it
seems any additional effort to have DM flushes re-enter the flush
machinary is unnecessary.

We expect:
1) flushes to have gone through the flush machinary
2) no FLUSH/FUA should be entering underlying queues if not supported

I think it best to just document the expectation that any FLUSH/FUA
request that enters blk_insert_cloned_request() will already match the
queue that the request is being sent to.  One way to document it is to
change Jeff's flag striping in to pure BUG_ON()s, e.g.:

---
 block/blk-core.c |    8 ++++++++
 1 files changed, 8 insertions(+), 0 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index b627558..201bb27 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1710,6 +1710,14 @@ int blk_insert_cloned_request(struct request_queue *q, struct request *rq)
 	    should_fail_request(&rq->rq_disk->part0, blk_rq_bytes(rq)))
 		return -EIO;
 
+	/*
+	 * All FLUSH/FUA requests are expected to have gone through the
+	 * flush machinary.  If a request's cmd_flags doesn't match the
+	 * flush_flags of the underlying request_queue it is a bug.
+	 */
+	BUG_ON((rq->cmd_flags & REQ_FLUSH) && !(q->flush_flags & REQ_FLUSH));
+	BUG_ON((rq->cmd_flags & REQ_FUA) && !(q->flush_flags & REQ_FUA));
+
 	spin_lock_irqsave(q->queue_lock, flags);
 
 	/*

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: block: properly handle flush/fua requests in blk_insert_cloned_request
  2011-08-09 17:43       ` Mike Snitzer
@ 2011-08-09 17:51         ` Tejun Heo
  2011-08-09 18:33           ` Mike Snitzer
  2011-08-09 17:52         ` Vivek Goyal
  1 sibling, 1 reply; 13+ messages in thread
From: Tejun Heo @ 2011-08-09 17:51 UTC (permalink / raw)
  To: Mike Snitzer; +Cc: Jeff Moyer, linux-kernel, Jens Axboe, Vivek Goyal, dm-devel

Hello,

On Tue, Aug 09, 2011 at 01:43:47PM -0400, Mike Snitzer wrote:
> [cc'ing dm-devel]
> 
> All of these issues have come to light because DM was not setting
> flush_flags based on the underlying device(s).  Now fixed in v3.1-rc1:
> ed8b752 dm table: set flush capability based on underlying devices
> 
> Given that commit, and that request-based DM is beneath the elevator, it
> seems any additional effort to have DM flushes re-enter the flush
> machinary is unnecessary.

Hmmm... what if the underlying devices have different featureset?  Use
the lowest common denominator?  The flush mechanism is designed to
deal with stacking by going through flush at each level.  Stacking
queue can simply export support for both REQ_FLUSH and FUA and the
lower lever queue will decompose them according to the capability
supported by the actual device.

IIUC, what's broken here is some insert functions aren't using
ELEVATOR_INSERT_FLUSH when needed and there are some issues due to the
special nature of the stacked requests which I think should be fixed.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: block: properly handle flush/fua requests in blk_insert_cloned_request
  2011-08-09 17:43       ` Mike Snitzer
  2011-08-09 17:51         ` Tejun Heo
@ 2011-08-09 17:52         ` Vivek Goyal
  2011-08-09 17:55           ` Tejun Heo
  2011-08-09 18:55           ` Mike Snitzer
  1 sibling, 2 replies; 13+ messages in thread
From: Vivek Goyal @ 2011-08-09 17:52 UTC (permalink / raw)
  To: Mike Snitzer; +Cc: Tejun Heo, Jeff Moyer, linux-kernel, Jens Axboe, dm-devel

On Tue, Aug 09, 2011 at 01:43:47PM -0400, Mike Snitzer wrote:
> On Tue, Aug 09 2011 at 12:13pm -0400,
> Tejun Heo <tj@kernel.org> wrote:
> 
> > Hello,
> > 
> > On Tue, Aug 09, 2011 at 11:53:51AM -0400, Jeff Moyer wrote:
> > > Tejun Heo <tj@kernel.org> writes:
> > > > I'm a bit confused.  We still need ELEVATOR_INSERT_FLUSH fix for
> > > > insertion paths, right?  Or is blk_insert_cloned_request() supposed to
> > > > used only by request based dm which lives under the elevator?  If so,
> > > > it would be great to make that explicit in the comment.  Maybe just
> > > > renaming it to blk_insert_dm_cloned_request() would be better as it
> > > > wouldn't be safe for other cases anyway.
> > > 
> > > request-based dm is the only caller at present.  I'm not a fan of
> > > renaming the function, but I'm more than willing to comment it.
> > 
> > I'm still confused and don't think the patch is correct (you can't
> > turn off REQ_FUA without decomposing it to data + post flush).
> > 
> > Going through flush machinery twice is okay and I think is the right
> > thing to do.  At the upper queue, the request is decomposed to member
> > requests.  After decomposition, it's either REQ_FLUSH w/o data or data
> > request w/ or w/o REQ_FUA.  When the decomposed request reaches lower
> > queue, the lower queue will then either short-circuit it, execute
> > as-is or decompose data w/ REQ_FUA into data + REQ_FLUSH sequence.
> > 
> > AFAICS, the breakages are...
> > 
> > * ELEVATOR_INSERT_FLUSH not used properly from insert paths.
> > 
> > * Short circuit not kicking in for the dm requests. (the above and the
> >   policy patch should solve this, right?)
> > 
> > * BUG(!rq->bio || ...) in blk_insert_flush().  I think we can lift
> >   this restriction for empty REQ_FLUSH but also dm can just send down
> >   requests with empty bio.
> 
> [cc'ing dm-devel]
> 
> All of these issues have come to light because DM was not setting
> flush_flags based on the underlying device(s).  Now fixed in v3.1-rc1:
> ed8b752 dm table: set flush capability based on underlying devices
> 
> Given that commit, and that request-based DM is beneath the elevator, it
> seems any additional effort to have DM flushes re-enter the flush
> machinary is unnecessary.
> 
> We expect:
> 1) flushes to have gone through the flush machinary
> 2) no FLUSH/FUA should be entering underlying queues if not supported
> 
> I think it best to just document the expectation that any FLUSH/FUA
> request that enters blk_insert_cloned_request() will already match the
> queue that the request is being sent to.  One way to document it is to
> change Jeff's flag striping in to pure BUG_ON()s, e.g.:
> 
> ---
>  block/blk-core.c |    8 ++++++++
>  1 files changed, 8 insertions(+), 0 deletions(-)
> 
> diff --git a/block/blk-core.c b/block/blk-core.c
> index b627558..201bb27 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -1710,6 +1710,14 @@ int blk_insert_cloned_request(struct request_queue *q, struct request *rq)
>  	    should_fail_request(&rq->rq_disk->part0, blk_rq_bytes(rq)))
>  		return -EIO;
>  
> +	/*
> +	 * All FLUSH/FUA requests are expected to have gone through the
> +	 * flush machinary.  If a request's cmd_flags doesn't match the
> +	 * flush_flags of the underlying request_queue it is a bug.
> +	 */
> +	BUG_ON((rq->cmd_flags & REQ_FLUSH) && !(q->flush_flags & REQ_FLUSH));
> +	BUG_ON((rq->cmd_flags & REQ_FUA) && !(q->flush_flags & REQ_FUA));
> +

Actually this makes sense and is simple. :-) Is BUG_ON() too harsh, how
about WARN_ONCE() variants? To me system continues to work so warning 
is probably good enough.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: block: properly handle flush/fua requests in blk_insert_cloned_request
  2011-08-09 17:52         ` Vivek Goyal
@ 2011-08-09 17:55           ` Tejun Heo
  2011-08-09 18:55           ` Mike Snitzer
  1 sibling, 0 replies; 13+ messages in thread
From: Tejun Heo @ 2011-08-09 17:55 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: Mike Snitzer, Jeff Moyer, linux-kernel, Jens Axboe, dm-devel

On Tue, Aug 09, 2011 at 01:52:37PM -0400, Vivek Goyal wrote:
> > +	/*
> > +	 * All FLUSH/FUA requests are expected to have gone through the
> > +	 * flush machinary.  If a request's cmd_flags doesn't match the
> > +	 * flush_flags of the underlying request_queue it is a bug.
> > +	 */
> > +	BUG_ON((rq->cmd_flags & REQ_FLUSH) && !(q->flush_flags & REQ_FLUSH));
> > +	BUG_ON((rq->cmd_flags & REQ_FUA) && !(q->flush_flags & REQ_FUA));
> > +
> 
> Actually this makes sense and is simple. :-) Is BUG_ON() too harsh, how
> about WARN_ONCE() variants? To me system continues to work so warning 
> is probably good enough.

I don't really think this is the right thing to do.  It makes the
cloning a special case.  It looks easy now but puts it in its own
special box which usually ends up ugly in the long run.  Let's get the
thing fixed and let it play like others do.  There's nothing
fundamentally broken with the usual mechanism.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: block: properly handle flush/fua requests in blk_insert_cloned_request
  2011-08-09 17:51         ` Tejun Heo
@ 2011-08-09 18:33           ` Mike Snitzer
  0 siblings, 0 replies; 13+ messages in thread
From: Mike Snitzer @ 2011-08-09 18:33 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Jeff Moyer, linux-kernel, Jens Axboe, Vivek Goyal, dm-devel

On Tue, Aug 09 2011 at  1:51pm -0400,
Tejun Heo <tj@kernel.org> wrote:

> Hello,
> 
> On Tue, Aug 09, 2011 at 01:43:47PM -0400, Mike Snitzer wrote:
> > [cc'ing dm-devel]
> > 
> > All of these issues have come to light because DM was not setting
> > flush_flags based on the underlying device(s).  Now fixed in v3.1-rc1:
> > ed8b752 dm table: set flush capability based on underlying devices
> > 
> > Given that commit, and that request-based DM is beneath the elevator, it
> > seems any additional effort to have DM flushes re-enter the flush
> > machinary is unnecessary.
> 
> Hmmm... what if the underlying devices have different featureset?  Use
> the lowest common denominator?

Yes, for DM, if any device in the stack requires FLUSH/FUA then
the upper request_queue's flush_flags will be set to reflect that.

Bio-based DM _could_ end up issuing a flush to a device that doesn't
need the flush.  But once the bio emerges from the bottom of the
bio-based stack it'll get handed to the flush mechanism where it'll be
dropped.

Bio-based DM may stack ontop of request-based DM (think LVM ontop of
mpath device).  Request-based DM may _not_ stack on bio-based DM.

Request-based DM, the cause of all this discussion, is only used for the
multipath target and I'm not aware of any plans to provide more complex
stacking via request-based DM.  Doing so would require splitting a
request that spans target boundaries, etc.

So we're left with request-based DM only allowing a single target,
meaning: even if request-based DM were stacked a couple times the
flush_flags would still reflect the underlying physical device's queue.

> The flush mechanism is designed to
> deal with stacking by going through flush at each level.  Stacking
> queue can simply export support for both REQ_FLUSH and FUA and the
> lower lever queue will decompose them according to the capability
> supported by the actual device.
> 
> IIUC, what's broken here is some insert functions aren't using
> ELEVATOR_INSERT_FLUSH when needed and there are some issues due to the
> special nature of the stacked requests which I think should be fixed.

OK, conceptually pure, just seems the fixes are multiplying.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: block: properly handle flush/fua requests in blk_insert_cloned_request
  2011-08-09 17:52         ` Vivek Goyal
  2011-08-09 17:55           ` Tejun Heo
@ 2011-08-09 18:55           ` Mike Snitzer
  2011-08-09 19:05             ` Vivek Goyal
  1 sibling, 1 reply; 13+ messages in thread
From: Mike Snitzer @ 2011-08-09 18:55 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: Tejun Heo, Jeff Moyer, linux-kernel, Jens Axboe, dm-devel

On Tue, Aug 09 2011 at  1:52pm -0400,
Vivek Goyal <vgoyal@redhat.com> wrote:

> On Tue, Aug 09, 2011 at 01:43:47PM -0400, Mike Snitzer wrote:
> > On Tue, Aug 09 2011 at 12:13pm -0400,
> > Tejun Heo <tj@kernel.org> wrote:
> > 
> > > Hello,
> > > 
> > > On Tue, Aug 09, 2011 at 11:53:51AM -0400, Jeff Moyer wrote:
> > > > Tejun Heo <tj@kernel.org> writes:
> > > > > I'm a bit confused.  We still need ELEVATOR_INSERT_FLUSH fix for
> > > > > insertion paths, right?  Or is blk_insert_cloned_request() supposed to
> > > > > used only by request based dm which lives under the elevator?  If so,
> > > > > it would be great to make that explicit in the comment.  Maybe just
> > > > > renaming it to blk_insert_dm_cloned_request() would be better as it
> > > > > wouldn't be safe for other cases anyway.
> > > > 
> > > > request-based dm is the only caller at present.  I'm not a fan of
> > > > renaming the function, but I'm more than willing to comment it.
> > > 
> > > I'm still confused and don't think the patch is correct (you can't
> > > turn off REQ_FUA without decomposing it to data + post flush).
> > > 
> > > Going through flush machinery twice is okay and I think is the right
> > > thing to do.  At the upper queue, the request is decomposed to member
> > > requests.  After decomposition, it's either REQ_FLUSH w/o data or data
> > > request w/ or w/o REQ_FUA.  When the decomposed request reaches lower
> > > queue, the lower queue will then either short-circuit it, execute
> > > as-is or decompose data w/ REQ_FUA into data + REQ_FLUSH sequence.
> > > 
> > > AFAICS, the breakages are...
> > > 
> > > * ELEVATOR_INSERT_FLUSH not used properly from insert paths.
> > > 
> > > * Short circuit not kicking in for the dm requests. (the above and the
> > >   policy patch should solve this, right?)
> > > 
> > > * BUG(!rq->bio || ...) in blk_insert_flush().  I think we can lift
> > >   this restriction for empty REQ_FLUSH but also dm can just send down
> > >   requests with empty bio.
> > 
> > [cc'ing dm-devel]
> > 
> > All of these issues have come to light because DM was not setting
> > flush_flags based on the underlying device(s).  Now fixed in v3.1-rc1:
> > ed8b752 dm table: set flush capability based on underlying devices
> > 
> > Given that commit, and that request-based DM is beneath the elevator, it
> > seems any additional effort to have DM flushes re-enter the flush
> > machinary is unnecessary.
> > 
> > We expect:
> > 1) flushes to have gone through the flush machinary
> > 2) no FLUSH/FUA should be entering underlying queues if not supported
> > 
> > I think it best to just document the expectation that any FLUSH/FUA
> > request that enters blk_insert_cloned_request() will already match the
> > queue that the request is being sent to.  One way to document it is to
> > change Jeff's flag striping in to pure BUG_ON()s, e.g.:
> > 
> > ---
> >  block/blk-core.c |    8 ++++++++
> >  1 files changed, 8 insertions(+), 0 deletions(-)
> > 
> > diff --git a/block/blk-core.c b/block/blk-core.c
> > index b627558..201bb27 100644
> > --- a/block/blk-core.c
> > +++ b/block/blk-core.c
> > @@ -1710,6 +1710,14 @@ int blk_insert_cloned_request(struct request_queue *q, struct request *rq)
> >  	    should_fail_request(&rq->rq_disk->part0, blk_rq_bytes(rq)))
> >  		return -EIO;
> >  
> > +	/*
> > +	 * All FLUSH/FUA requests are expected to have gone through the
> > +	 * flush machinary.  If a request's cmd_flags doesn't match the
> > +	 * flush_flags of the underlying request_queue it is a bug.
> > +	 */
> > +	BUG_ON((rq->cmd_flags & REQ_FLUSH) && !(q->flush_flags & REQ_FLUSH));
> > +	BUG_ON((rq->cmd_flags & REQ_FUA) && !(q->flush_flags & REQ_FUA));
> > +
> 
> Actually this makes sense and is simple. :-) Is BUG_ON() too harsh, how
> about WARN_ONCE() variants? To me system continues to work so warning 
> is probably good enough.

Sure, WARN_ONCE() is fine by me.

Seems Tejun wants a more involved fix though.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: block: properly handle flush/fua requests in blk_insert_cloned_request
  2011-08-09 18:55           ` Mike Snitzer
@ 2011-08-09 19:05             ` Vivek Goyal
  2011-08-10 15:49               ` Jeff Moyer
  0 siblings, 1 reply; 13+ messages in thread
From: Vivek Goyal @ 2011-08-09 19:05 UTC (permalink / raw)
  To: Mike Snitzer; +Cc: Tejun Heo, Jeff Moyer, linux-kernel, Jens Axboe, dm-devel

On Tue, Aug 09, 2011 at 02:55:31PM -0400, Mike Snitzer wrote:

[..]
> > > +	/*
> > > +	 * All FLUSH/FUA requests are expected to have gone through the
> > > +	 * flush machinary.  If a request's cmd_flags doesn't match the
> > > +	 * flush_flags of the underlying request_queue it is a bug.
> > > +	 */
> > > +	BUG_ON((rq->cmd_flags & REQ_FLUSH) && !(q->flush_flags & REQ_FLUSH));
> > > +	BUG_ON((rq->cmd_flags & REQ_FUA) && !(q->flush_flags & REQ_FUA));
> > > +
> > 
> > Actually this makes sense and is simple. :-) Is BUG_ON() too harsh, how
> > about WARN_ONCE() variants? To me system continues to work so warning 
> > is probably good enough.
> 
> Sure, WARN_ONCE() is fine by me.
> 
> Seems Tejun wants a more involved fix though.

Fixing it properly doesn't hurt. Makes it more future proof. In fact I am
thinking what happens to blk_execute_rq() variants where one can prepare a
request and send it down. What if caller sets FLUSH/FUA flags there.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: block: properly handle flush/fua requests in blk_insert_cloned_request
  2011-08-09 19:05             ` Vivek Goyal
@ 2011-08-10 15:49               ` Jeff Moyer
  0 siblings, 0 replies; 13+ messages in thread
From: Jeff Moyer @ 2011-08-10 15:49 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: Mike Snitzer, Tejun Heo, linux-kernel, Jens Axboe, dm-devel

Vivek Goyal <vgoyal@redhat.com> writes:

> On Tue, Aug 09, 2011 at 02:55:31PM -0400, Mike Snitzer wrote:
>
> [..]
>> > > +	/*
>> > > +	 * All FLUSH/FUA requests are expected to have gone through the
>> > > +	 * flush machinary.  If a request's cmd_flags doesn't match the
>> > > +	 * flush_flags of the underlying request_queue it is a bug.
>> > > +	 */
>> > > +	BUG_ON((rq->cmd_flags & REQ_FLUSH) && !(q->flush_flags & REQ_FLUSH));
>> > > +	BUG_ON((rq->cmd_flags & REQ_FUA) && !(q->flush_flags & REQ_FUA));
>> > > +
>> > 
>> > Actually this makes sense and is simple. :-) Is BUG_ON() too harsh, how
>> > about WARN_ONCE() variants? To me system continues to work so warning 
>> > is probably good enough.
>> 
>> Sure, WARN_ONCE() is fine by me.
>> 
>> Seems Tejun wants a more involved fix though.
>
> Fixing it properly doesn't hurt. Makes it more future proof. In fact I am
> thinking what happens to blk_execute_rq() variants where one can prepare a
> request and send it down. What if caller sets FLUSH/FUA flags there.

Callers of blk_execute_rq are special.  Those aren't REQ_TYPE_FS
requests, and so the callers are responsible for doing their own
sequencing.

Cheers,
Jeff

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2011-08-10 15:50 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-08-09 15:05 [patch] block: properly handle flush/fua requests in blk_insert_cloned_request Jeff Moyer
2011-08-09 15:38 ` Tejun Heo
2011-08-09 15:53   ` Jeff Moyer
2011-08-09 16:13     ` Tejun Heo
2011-08-09 16:19       ` Tejun Heo
2011-08-09 17:43       ` Mike Snitzer
2011-08-09 17:51         ` Tejun Heo
2011-08-09 18:33           ` Mike Snitzer
2011-08-09 17:52         ` Vivek Goyal
2011-08-09 17:55           ` Tejun Heo
2011-08-09 18:55           ` Mike Snitzer
2011-08-09 19:05             ` Vivek Goyal
2011-08-10 15:49               ` Jeff Moyer

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox