[RFC][PATCH 0/3] Skip I/O merges when disabled

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [RFC][PATCH 0/3] Skip I/O merges when disabled
@ 2008-04-23 19:08 Alan D. Brunelle
  2008-04-23 19:12 ` [RFC][PATCH 1/3] Add flag and sysfs interfaces Alan D. Brunelle
                   ` (5 more replies)
  0 siblings, 6 replies; 30+ messages in thread
From: Alan D. Brunelle @ 2008-04-23 19:08 UTC (permalink / raw)
  To: linux-kernel, Jens Axboe

The block I/O + elevator + I/O scheduler code spends a lot of time
trying to merge I/Os -- rightfully so under "normal" circumstances.
However, if one were to know that the incoming I/O stream was /very/
random in nature, the cycles are wasted. (This can be the case, for
example, during OLTP-type runs.)

This patch stream adds a per-request_queue tunable that (when set)
disables merge attempts, thus freeing up a non-trivial amount of CPU cycles.

I'll be doing some more benchmarking, but this is a representative set
of data on a two-way Opteron box w/ 4 SATA drives. 'fio' was used to
generate random 4k asynchronous direct I/Os over the 128GiB of each SATA
drive.  Oprofile was used to collect the results, and we collected
CPU_CLK_UNHALTED (CPU) and DATA_CACHE_MISSES (DCM) events. The data
extracted below shows both the percentage for all samples (including
non-kernel) as well as just those from the block I/O layer + elevator +
deadline I/O scheduler + SATA modules.

v2.6.25 (not patched):  CPU: 5.8330% (total)  7.5644% (I/O code only)
v2.6.25 + nomerges = 0: CPU: 5.8008% (total)  7.5806% (I/O code only)
v2.6.25 + nomerges = 1: CPU: 4.5404% (total)  5.9416% (I/O code only)

v2.6.25 (not patched):  DCM: 8.1967% (total) 10.5188% (I/O code only)
v2.6.25 + nomerges = 0: DCM: 7.2291% (total)  9.4087% (I/O code only)
v2.6.25 + nomerges = 1: DCM: 6.1989% (total)  8.0155% (I/O code only)

I've typically been seeing a good 20-25% reduction in CPU samples, and
10-15% in DCM samples for the random load w/ nomerges set to 1 compared
to set to 0 (looking at just the block code).

[BTW: The I/O performance doesn't change much between the 3 sets of data
- the seek + I/O times themselves dominate things to such a large
extent.  There is a very small improvement seen w/ nomerges=1, but <<1%.]

It's not clear to me why 2.6.25 (not patched) requires /more/ cycles
than does the patched kernel w/ nomerges=0 -- it's been consistent in
the handful of runs I've done. I'm going to do a large set of runs for
each condition (not patched, nomerges=0 & nomerges=1) to verify that
this holds over multiple runs. I'm also going to check out sequential
loads to see what (if any) penalty the extra couple of checks incurs on
those (probably not noticeable).

The first patch in the series adds the tunable; The second adds in the
check to skip the merge code; and the third adds in the check to skip
adding requests to hash lists for merging.

Alan D. Brunelle
Hewlett-Packard

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC][PATCH 1/3] Add flag and sysfs interfaces
  2008-04-23 19:08 [RFC][PATCH 0/3] Skip I/O merges when disabled Alan D. Brunelle
@ 2008-04-23 19:12 ` Alan D. Brunelle
  2008-04-23 19:14 ` [RFC][PATCH 2/3] Have __make_request skip merges when disabled Alan D. Brunelle
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 30+ messages in thread
From: Alan D. Brunelle @ 2008-04-23 19:12 UTC (permalink / raw)
  To: linux-kernel; +Cc: Jens Axboe

[-- Attachment #1: Type: text/plain, Size: 1 bytes --]



[-- Attachment #2: 0001-Added-QUEUE_FLAG_NOMERGES-and-sysfs-interfaces.patch --]
[-- Type: text/x-patch, Size: 2656 bytes --]


Signed-off-by: Alan D. Brunelle <alan.brunelle@hp.com>
---
 block/blk-sysfs.c      |   27 +++++++++++++++++++++++++++
 include/linux/blkdev.h |    2 ++
 2 files changed, 29 insertions(+), 0 deletions(-)

diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 54d0db1..32d917c 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -135,6 +135,26 @@ static ssize_t queue_max_hw_sectors_show(struct request_queue *q, char *page)
 	return queue_var_show(max_hw_sectors_kb, (page));
 }
 
+static ssize_t queue_nomerges_show(struct request_queue *q, char *page)
+{
+	int nm = test_bit(QUEUE_FLAG_NOMERGES, &q->queue_flags);
+	return queue_var_show(nm, page);
+}
+
+static ssize_t queue_nomerges_store(struct request_queue *q, const char *page,
+				    size_t count)
+{
+	unsigned long nm;
+	ssize_t ret = queue_var_store(&nm, page, count);
+
+	if (nm)
+	       set_bit(QUEUE_FLAG_NOMERGES, &q->queue_flags);
+	else
+	       clear_bit(QUEUE_FLAG_NOMERGES, &q->queue_flags);
+
+	return ret;
+}
+
 
 static struct queue_sysfs_entry queue_requests_entry = {
 	.attr = {.name = "nr_requests", .mode = S_IRUGO | S_IWUSR },
@@ -170,6 +190,12 @@ static struct queue_sysfs_entry queue_hw_sector_size_entry = {
 	.show = queue_hw_sector_size_show,
 };
 
+static struct queue_sysfs_entry queue_nomerges_entry = {
+	.attr = {.name = "nomerges", .mode = S_IRUGO | S_IWUSR },
+	.show = queue_nomerges_show,
+	.store = queue_nomerges_store,
+};
+
 static struct attribute *default_attrs[] = {
 	&queue_requests_entry.attr,
 	&queue_ra_entry.attr,
@@ -177,6 +203,7 @@ static struct attribute *default_attrs[] = {
 	&queue_max_sectors_entry.attr,
 	&queue_iosched_entry.attr,
 	&queue_hw_sector_size_entry.attr,
+	&queue_nomerges_entry.attr,
 	NULL,
 };
 
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 6f79d40..ee577f9 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -405,6 +405,7 @@ struct request_queue
 #define QUEUE_FLAG_PLUGGED	7	/* queue is plugged */
 #define QUEUE_FLAG_ELVSWITCH	8	/* don't use elevator, just do FIFO */
 #define QUEUE_FLAG_BIDI		9	/* queue supports bidi requests */
+#define QUEUE_FLAG_NOMERGES    10	/* disable merge attempts */
 
 enum {
 	/*
@@ -449,6 +450,7 @@ enum {
 #define blk_queue_plugged(q)	test_bit(QUEUE_FLAG_PLUGGED, &(q)->queue_flags)
 #define blk_queue_tagged(q)	test_bit(QUEUE_FLAG_QUEUED, &(q)->queue_flags)
 #define blk_queue_stopped(q)	test_bit(QUEUE_FLAG_STOPPED, &(q)->queue_flags)
+#define blk_queue_nomerges(q)	test_bit(QUEUE_FLAG_NOMERGES, &(q)->queue_flags)
 #define blk_queue_flushing(q)	((q)->ordseq)
 
 #define blk_fs_request(rq)	((rq)->cmd_type == REQ_TYPE_FS)
-- 
1.5.2.5


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [RFC][PATCH 2/3] Have __make_request skip merges when disabled
  2008-04-23 19:08 [RFC][PATCH 0/3] Skip I/O merges when disabled Alan D. Brunelle
  2008-04-23 19:12 ` [RFC][PATCH 1/3] Add flag and sysfs interfaces Alan D. Brunelle
@ 2008-04-23 19:14 ` Alan D. Brunelle
  2008-04-23 19:15 ` [RFC][PATCH 3/3] Do not use rqhash when merges disabled Alan D. Brunelle
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 30+ messages in thread
From: Alan D. Brunelle @ 2008-04-23 19:14 UTC (permalink / raw)
  To: linux-kernel; +Cc: Jens Axboe

[-- Attachment #1: Type: text/plain, Size: 0 bytes --]



[-- Attachment #2: 0002-Have-__make_request-skip-merges-when-disabled.patch --]
[-- Type: text/x-patch, Size: 577 bytes --]



Signed-off-by: Alan D. Brunelle <alan.brunelle@hp.com>
---
 block/blk-core.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 2a438a9..54a2d8b 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1115,7 +1115,7 @@ static int __make_request(struct request_queue *q, struct bio *bio)
 
 	spin_lock_irq(q->queue_lock);
 
-	if (unlikely(barrier) || elv_queue_empty(q))
+	if (blk_queue_nomerges(q) || unlikely(barrier) || elv_queue_empty(q))
 		goto get_rq;
 
 	el_ret = elv_merge(q, &req, bio);
-- 
1.5.2.5


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [RFC][PATCH 3/3] Do not use rqhash when merges disabled
  2008-04-23 19:08 [RFC][PATCH 0/3] Skip I/O merges when disabled Alan D. Brunelle
  2008-04-23 19:12 ` [RFC][PATCH 1/3] Add flag and sysfs interfaces Alan D. Brunelle
  2008-04-23 19:14 ` [RFC][PATCH 2/3] Have __make_request skip merges when disabled Alan D. Brunelle
@ 2008-04-23 19:15 ` Alan D. Brunelle
  2008-04-24  0:37   ` Aaron Carroll
  2008-04-24  7:09 ` [RFC][PATCH 0/3] Skip I/O merges when disabled Jens Axboe
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 30+ messages in thread
From: Alan D. Brunelle @ 2008-04-23 19:15 UTC (permalink / raw)
  To: linux-kernel; +Cc: Jens Axboe

[-- Attachment #1: Type: text/plain, Size: 1 bytes --]



[-- Attachment #2: 0003-Do-not-use-rqhash-when-merges-disabled.patch --]
[-- Type: text/x-patch, Size: 1079 bytes --]

>From b5da8dfdf106ceddc35634f43a02488616a55e7d Mon Sep 17 00:00:00 2001
From: Alan D. Brunelle <alan.brunelle@hp.com>
Date: Wed, 23 Apr 2008 08:50:22 -0400
Subject: [PATCH] Do not use rqhash when merges disabled


Signed-off-by: Alan D. Brunelle <alan.brunelle@hp.com>
---
 block/elevator.c |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/block/elevator.c b/block/elevator.c
index 88318c3..2ea5fb4 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -313,7 +313,7 @@ static inline void __elv_rqhash_del(struct request *rq)
 
 static void elv_rqhash_del(struct request_queue *q, struct request *rq)
 {
-	if (ELV_ON_HASH(rq))
+	if (!blk_queue_nomerges(q) && ELV_ON_HASH(rq))
 		__elv_rqhash_del(rq);
 }
 
@@ -601,7 +601,7 @@ void elv_insert(struct request_queue *q, struct request *rq, int where)
 		BUG_ON(!blk_fs_request(rq));
 		rq->cmd_flags |= REQ_SORTED;
 		q->nr_sorted++;
-		if (rq_mergeable(rq)) {
+		if (!blk_queue_nomerges(q) && rq_mergeable(rq)) {
 			elv_rqhash_add(q, rq);
 			if (!q->last_merge)
 				q->last_merge = rq;
-- 
1.5.2.5


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [RFC][PATCH 3/3] Do not use rqhash when merges disabled
  2008-04-23 19:15 ` [RFC][PATCH 3/3] Do not use rqhash when merges disabled Alan D. Brunelle
@ 2008-04-24  0:37   ` Aaron Carroll
  2008-04-24  0:59     ` Alan D. Brunelle
  0 siblings, 1 reply; 30+ messages in thread
From: Aaron Carroll @ 2008-04-24  0:37 UTC (permalink / raw)
  To: Alan D. Brunelle; +Cc: linux-kernel, Jens Axboe

Hi Alan,

Alan D. Brunelle wrote:
> --- a/block/elevator.c
> +++ b/block/elevator.c
> @@ -313,7 +313,7 @@ static inline void __elv_rqhash_del(struct request *rq)
>  
>  static void elv_rqhash_del(struct request_queue *q, struct request *rq)
>  {
> -	if (ELV_ON_HASH(rq))
> +	if (!blk_queue_nomerges(q) && ELV_ON_HASH(rq))
>  		__elv_rqhash_del(rq);
>  }

If you switch the nomerges tunable while requests are in flight, it is possible that
a request is put into the rqhash table but not removed here, leading to the BUG_ON
in elv_dequeue_request() triggering.  ELV_ON_HASH needs to be checked regardless of
the nomerges state.


 -- Aaron


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC][PATCH 3/3] Do not use rqhash when merges disabled
  2008-04-24  0:37   ` Aaron Carroll
@ 2008-04-24  0:59     ` Alan D. Brunelle
  2008-04-24  2:07       ` Aaron Carroll
  0 siblings, 1 reply; 30+ messages in thread
From: Alan D. Brunelle @ 2008-04-24  0:59 UTC (permalink / raw)
  To: Aaron Carroll; +Cc: linux-kernel, Jens Axboe

Aaron Carroll wrote:
> Hi Alan,
> 
> Alan D. Brunelle wrote:
>> --- a/block/elevator.c
>> +++ b/block/elevator.c
>> @@ -313,7 +313,7 @@ static inline void __elv_rqhash_del(struct request
>> *rq)
>>  
>>  static void elv_rqhash_del(struct request_queue *q, struct request *rq)
>>  {
>> -    if (ELV_ON_HASH(rq))
>> +    if (!blk_queue_nomerges(q) && ELV_ON_HASH(rq))
>>          __elv_rqhash_del(rq);
>>  }
> 
> If you switch the nomerges tunable while requests are in flight, it is
> possible that
> a request is put into the rqhash table but not removed here, leading to
> the BUG_ON
> in elv_dequeue_request() triggering.  ELV_ON_HASH needs to be checked
> regardless of
> the nomerges state.
> 
> 
> -- Aaron
> 
> 

Hi Aaron -

Good catch - that was a last minute addition, the ELV_ON_HASH should be
sufficient without the check for blk_queue_nomerges, right?

Alan

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC][PATCH 3/3] Do not use rqhash when merges disabled
  2008-04-24  0:59     ` Alan D. Brunelle
@ 2008-04-24  2:07       ` Aaron Carroll
  0 siblings, 0 replies; 30+ messages in thread
From: Aaron Carroll @ 2008-04-24  2:07 UTC (permalink / raw)
  To: Alan D. Brunelle; +Cc: linux-kernel, Jens Axboe

Alan D. Brunelle wrote:
> Good catch - that was a last minute addition, the ELV_ON_HASH should be
> sufficient without the check for blk_queue_nomerges, right?

Yes.  With that line reverted it seems to work fine.  FWIW, I'm testing
it like so:

=================================x8=======================================

#!/bin/bash

DEV=sdb

echo deadline > /sys/block/${DEV}/queue/scheduler

~/bin-ia64/fio - <<! &
[reader]
filename=/dev/${DEV}
ioengine=sync
direct=1
norandommap
randrepeat=0
zero_buffers
bs=16k
rw=randread:16
numjobs=128
time_based
runtime=600
!

while /bin/true ; do
	echo 0 > /sys/block/${DEV}/queue/nomerges
	sleep 0.1
	echo 1 > /sys/block/${DEV}/queue/nomerges
	sleep 0.1
done

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC][PATCH 0/3] Skip I/O merges when disabled
  2008-04-23 19:08 [RFC][PATCH 0/3] Skip I/O merges when disabled Alan D. Brunelle
                   ` (2 preceding siblings ...)
  2008-04-23 19:15 ` [RFC][PATCH 3/3] Do not use rqhash when merges disabled Alan D. Brunelle
@ 2008-04-24  7:09 ` Jens Axboe
  2008-04-24 12:09   ` Alan D. Brunelle
  2008-04-24 20:38   ` Alan D. Brunelle
  2008-04-24 13:29 ` Andi Kleen
  2008-04-24 13:31 ` Alan D. Brunelle
  5 siblings, 2 replies; 30+ messages in thread
From: Jens Axboe @ 2008-04-24  7:09 UTC (permalink / raw)
  To: Alan D. Brunelle; +Cc: linux-kernel

On Wed, Apr 23 2008, Alan D. Brunelle wrote:
> The block I/O + elevator + I/O scheduler code spends a lot of time
> trying to merge I/Os -- rightfully so under "normal" circumstances.
> However, if one were to know that the incoming I/O stream was /very/
> random in nature, the cycles are wasted. (This can be the case, for
> example, during OLTP-type runs.)
> 
> This patch stream adds a per-request_queue tunable that (when set)
> disables merge attempts, thus freeing up a non-trivial amount of CPU cycles.
> 
> I'll be doing some more benchmarking, but this is a representative set
> of data on a two-way Opteron box w/ 4 SATA drives. 'fio' was used to
> generate random 4k asynchronous direct I/Os over the 128GiB of each SATA
> drive.  Oprofile was used to collect the results, and we collected
> CPU_CLK_UNHALTED (CPU) and DATA_CACHE_MISSES (DCM) events. The data
> extracted below shows both the percentage for all samples (including
> non-kernel) as well as just those from the block I/O layer + elevator +
> deadline I/O scheduler + SATA modules.
> 
> v2.6.25 (not patched):  CPU: 5.8330% (total)  7.5644% (I/O code only)
> v2.6.25 + nomerges = 0: CPU: 5.8008% (total)  7.5806% (I/O code only)
> v2.6.25 + nomerges = 1: CPU: 4.5404% (total)  5.9416% (I/O code only)
> 
> v2.6.25 (not patched):  DCM: 8.1967% (total) 10.5188% (I/O code only)
> v2.6.25 + nomerges = 0: DCM: 7.2291% (total)  9.4087% (I/O code only)
> v2.6.25 + nomerges = 1: DCM: 6.1989% (total)  8.0155% (I/O code only)
> 
> I've typically been seeing a good 20-25% reduction in CPU samples, and
> 10-15% in DCM samples for the random load w/ nomerges set to 1 compared
> to set to 0 (looking at just the block code).
> 
> [BTW: The I/O performance doesn't change much between the 3 sets of data
> - the seek + I/O times themselves dominate things to such a large
> extent.  There is a very small improvement seen w/ nomerges=1, but <<1%.]
> 
> It's not clear to me why 2.6.25 (not patched) requires /more/ cycles
> than does the patched kernel w/ nomerges=0 -- it's been consistent in
> the handful of runs I've done. I'm going to do a large set of runs for
> each condition (not patched, nomerges=0 & nomerges=1) to verify that
> this holds over multiple runs. I'm also going to check out sequential
> loads to see what (if any) penalty the extra couple of checks incurs on
> those (probably not noticeable).
> 
> The first patch in the series adds the tunable; The second adds in the
> check to skip the merge code; and the third adds in the check to skip
> adding requests to hash lists for merging.

The functionality is fine with me, merging is obviously a non-zero
amount of cycles spent on IO and if you know it's in vain, may as well
turn it off. One suggestion, though - if you add this as a performance
rather than functionality change, I would suggest keeping the one-hit
cache merge as that is essentially free. Better than free actually,
since if you hit that merge point you'll be spending way less cycles
than allocating+setting up a new request.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC][PATCH 0/3] Skip I/O merges when disabled
  2008-04-24  7:09 ` [RFC][PATCH 0/3] Skip I/O merges when disabled Jens Axboe
@ 2008-04-24 12:09   ` Alan D. Brunelle
  2008-04-25  8:38     ` Jens Axboe
  2008-04-24 20:38   ` Alan D. Brunelle
  1 sibling, 1 reply; 30+ messages in thread
From: Alan D. Brunelle @ 2008-04-24 12:09 UTC (permalink / raw)
  To: linux-kernel; +Cc: Jens Axboe

Jens Axboe wrote:
> On Wed, Apr 23 2008, Alan D. Brunelle wrote:
>> The block I/O + elevator + I/O scheduler code spends a lot of time
>> trying to merge I/Os -- rightfully so under "normal" circumstances.
>> However, if one were to know that the incoming I/O stream was /very/
>> random in nature, the cycles are wasted. (This can be the case, for
>> example, during OLTP-type runs.)
>>
>> This patch stream adds a per-request_queue tunable that (when set)
>> disables merge attempts, thus freeing up a non-trivial amount of CPU cycles.
>>
>> I'll be doing some more benchmarking, but this is a representative set
>> of data on a two-way Opteron box w/ 4 SATA drives. 'fio' was used to
>> generate random 4k asynchronous direct I/Os over the 128GiB of each SATA
>> drive.  Oprofile was used to collect the results, and we collected
>> CPU_CLK_UNHALTED (CPU) and DATA_CACHE_MISSES (DCM) events. The data
>> extracted below shows both the percentage for all samples (including
>> non-kernel) as well as just those from the block I/O layer + elevator +
>> deadline I/O scheduler + SATA modules.
>>
>> v2.6.25 (not patched):  CPU: 5.8330% (total)  7.5644% (I/O code only)
>> v2.6.25 + nomerges = 0: CPU: 5.8008% (total)  7.5806% (I/O code only)
>> v2.6.25 + nomerges = 1: CPU: 4.5404% (total)  5.9416% (I/O code only)
>>
>> v2.6.25 (not patched):  DCM: 8.1967% (total) 10.5188% (I/O code only)
>> v2.6.25 + nomerges = 0: DCM: 7.2291% (total)  9.4087% (I/O code only)
>> v2.6.25 + nomerges = 1: DCM: 6.1989% (total)  8.0155% (I/O code only)
>>
>> I've typically been seeing a good 20-25% reduction in CPU samples, and
>> 10-15% in DCM samples for the random load w/ nomerges set to 1 compared
>> to set to 0 (looking at just the block code).
>>
>> [BTW: The I/O performance doesn't change much between the 3 sets of data
>> - the seek + I/O times themselves dominate things to such a large
>> extent.  There is a very small improvement seen w/ nomerges=1, but <<1%.]
>>
>> It's not clear to me why 2.6.25 (not patched) requires /more/ cycles
>> than does the patched kernel w/ nomerges=0 -- it's been consistent in
>> the handful of runs I've done. I'm going to do a large set of runs for
>> each condition (not patched, nomerges=0 & nomerges=1) to verify that
>> this holds over multiple runs. I'm also going to check out sequential
>> loads to see what (if any) penalty the extra couple of checks incurs on
>> those (probably not noticeable).
>>
>> The first patch in the series adds the tunable; The second adds in the
>> check to skip the merge code; and the third adds in the check to skip
>> adding requests to hash lists for merging.
> 
> The functionality is fine with me, merging is obviously a non-zero
> amount of cycles spent on IO and if you know it's in vain, may as well
> turn it off. One suggestion, though - if you add this as a performance
> rather than functionality change, I would suggest keeping the one-hit
> cache merge as that is essentially free. Better than free actually,
> since if you hit that merge point you'll be spending way less cycles
> than allocating+setting up a new request.
> 

Hi Jens -

I'll look into retaining the one-hit cache merge functionality, remove
the errant elv_rqhas_del code, and repost w/ the results from the other
tests I've run.

Thanks,
Alan


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC][PATCH 0/3] Skip I/O merges when disabled
  2008-04-24 12:09   ` Alan D. Brunelle
@ 2008-04-25  8:38     ` Jens Axboe
  2008-04-25 11:17       ` Alan D. Brunelle
  0 siblings, 1 reply; 30+ messages in thread
From: Jens Axboe @ 2008-04-25  8:38 UTC (permalink / raw)
  To: Alan D. Brunelle; +Cc: linux-kernel

On Thu, Apr 24 2008, Alan D. Brunelle wrote:
> Jens Axboe wrote:
> > On Wed, Apr 23 2008, Alan D. Brunelle wrote:
> >> The block I/O + elevator + I/O scheduler code spends a lot of time
> >> trying to merge I/Os -- rightfully so under "normal" circumstances.
> >> However, if one were to know that the incoming I/O stream was /very/
> >> random in nature, the cycles are wasted. (This can be the case, for
> >> example, during OLTP-type runs.)
> >>
> >> This patch stream adds a per-request_queue tunable that (when set)
> >> disables merge attempts, thus freeing up a non-trivial amount of CPU cycles.
> >>
> >> I'll be doing some more benchmarking, but this is a representative set
> >> of data on a two-way Opteron box w/ 4 SATA drives. 'fio' was used to
> >> generate random 4k asynchronous direct I/Os over the 128GiB of each SATA
> >> drive.  Oprofile was used to collect the results, and we collected
> >> CPU_CLK_UNHALTED (CPU) and DATA_CACHE_MISSES (DCM) events. The data
> >> extracted below shows both the percentage for all samples (including
> >> non-kernel) as well as just those from the block I/O layer + elevator +
> >> deadline I/O scheduler + SATA modules.
> >>
> >> v2.6.25 (not patched):  CPU: 5.8330% (total)  7.5644% (I/O code only)
> >> v2.6.25 + nomerges = 0: CPU: 5.8008% (total)  7.5806% (I/O code only)
> >> v2.6.25 + nomerges = 1: CPU: 4.5404% (total)  5.9416% (I/O code only)
> >>
> >> v2.6.25 (not patched):  DCM: 8.1967% (total) 10.5188% (I/O code only)
> >> v2.6.25 + nomerges = 0: DCM: 7.2291% (total)  9.4087% (I/O code only)
> >> v2.6.25 + nomerges = 1: DCM: 6.1989% (total)  8.0155% (I/O code only)
> >>
> >> I've typically been seeing a good 20-25% reduction in CPU samples, and
> >> 10-15% in DCM samples for the random load w/ nomerges set to 1 compared
> >> to set to 0 (looking at just the block code).
> >>
> >> [BTW: The I/O performance doesn't change much between the 3 sets of data
> >> - the seek + I/O times themselves dominate things to such a large
> >> extent.  There is a very small improvement seen w/ nomerges=1, but <<1%.]
> >>
> >> It's not clear to me why 2.6.25 (not patched) requires /more/ cycles
> >> than does the patched kernel w/ nomerges=0 -- it's been consistent in
> >> the handful of runs I've done. I'm going to do a large set of runs for
> >> each condition (not patched, nomerges=0 & nomerges=1) to verify that
> >> this holds over multiple runs. I'm also going to check out sequential
> >> loads to see what (if any) penalty the extra couple of checks incurs on
> >> those (probably not noticeable).
> >>
> >> The first patch in the series adds the tunable; The second adds in the
> >> check to skip the merge code; and the third adds in the check to skip
> >> adding requests to hash lists for merging.
> > 
> > The functionality is fine with me, merging is obviously a non-zero
> > amount of cycles spent on IO and if you know it's in vain, may as well
> > turn it off. One suggestion, though - if you add this as a performance
> > rather than functionality change, I would suggest keeping the one-hit
> > cache merge as that is essentially free. Better than free actually,
> > since if you hit that merge point you'll be spending way less cycles
> > than allocating+setting up a new request.
> > 
> 
> Hi Jens -
> 
> I'll look into retaining the one-hit cache merge functionality, remove
> the errant elv_rqhas_del code, and repost w/ the results from the other
> tests I've run.

Also please do a check where you only disable the front merge logic, as
that is the most expensive bit (and the least likely to occur). I would
not be surprised if just removing the front merge bit would get you the
majority of the gain already. I have in the past considered just getting
rid of that bit, as it rarely triggers and it is a costly rbtree lookup
for each IO. The back merge lookup+merge should be cheaper, it's just a
hash lookup.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC][PATCH 0/3] Skip I/O merges when disabled
  2008-04-25  8:38     ` Jens Axboe
@ 2008-04-25 11:17       ` Alan D. Brunelle
  2008-04-25 11:25         ` Jens Axboe
  2008-04-25 12:17         ` Alan D. Brunelle
  0 siblings, 2 replies; 30+ messages in thread
From: Alan D. Brunelle @ 2008-04-25 11:17 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1120 bytes --]

>>
>> I'll look into retaining the one-hit cache merge functionality, remove
>> the errant elv_rqhas_del code, and repost w/ the results from the other
>> tests I've run.
> 
> Also please do a check where you only disable the front merge logic, as
> that is the most expensive bit (and the least likely to occur). I would
> not be surprised if just removing the front merge bit would get you the
> majority of the gain already. I have in the past considered just getting
> rid of that bit, as it rarely triggers and it is a costly rbtree lookup
> for each IO. The back merge lookup+merge should be cheaper, it's just a
> hash lookup.
> 

I have the results from leaving in just the one-hit cache merge
attempts, and started a run leaving in both that and the back-merge
rq_hash checks. (The patch below basically undoes patch 3/3 - putting
back in the addition of rqs onto the hash list, and moves the nomerges
check below the back merge attempts.)

We /could/ change the tunable to a dial (or a mask) - enabling/disabling
specific merge attempts, but that seems a bit confusing/complex.

Jens: What do you think?

Alan

[-- Attachment #2: 0005-Enables-back-merge-checks-and-one-hit-cache-checks.patch --]
[-- Type: text/x-patch, Size: 1485 bytes --]

>From eb158393a5fd2eec0582bbba8af588be7e08ef32 Mon Sep 17 00:00:00 2001
From: Alan D. Brunelle <alan.brunelle@hp.com>
Date: Fri, 25 Apr 2008 07:14:42 -0400
Subject: [PATCH] Enables back-merge checks (and one-hit cache checks) for merges

Undoes patch 3/3 -- puts rqs onto the rq_hash list -- and performs simple
hash list checks for back-merges only.

Signed-off-by: Alan D. Brunelle <alan.brunelle@hp.com>
---
 block/elevator.c |    8 ++++----
 1 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/block/elevator.c b/block/elevator.c
index 557ee38..59be58d 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -488,9 +488,6 @@ int elv_merge(struct request_queue *q, struct request **req, struct bio *bio)
 		}
 	}
 
-	if (blk_queue_nomerges(q))
-		return ELEVATOR_NO_MERGE;
-
 	/*
 	 * See if our hash lookup can find a potential backmerge.
 	 */
@@ -500,6 +497,9 @@ int elv_merge(struct request_queue *q, struct request **req, struct bio *bio)
 		return ELEVATOR_BACK_MERGE;
 	}
 
+	if (blk_queue_nomerges(q))
+		return ELEVATOR_NO_MERGE;
+
 	if (e->ops->elevator_merge_fn)
 		return e->ops->elevator_merge_fn(q, req, bio);
 
@@ -604,7 +604,7 @@ void elv_insert(struct request_queue *q, struct request *rq, int where)
 		BUG_ON(!blk_fs_request(rq));
 		rq->cmd_flags |= REQ_SORTED;
 		q->nr_sorted++;
-		if (!blk_queue_nomerges(q) && rq_mergeable(rq)) {
+		if (rq_mergeable(rq)) {
 			elv_rqhash_add(q, rq);
 			if (!q->last_merge)
 				q->last_merge = rq;
-- 
1.5.2.5


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [RFC][PATCH 0/3] Skip I/O merges when disabled
  2008-04-25 11:17       ` Alan D. Brunelle
@ 2008-04-25 11:25         ` Jens Axboe
  2008-04-25 12:06           ` Aaron Carroll
  2008-04-25 12:17         ` Alan D. Brunelle
  1 sibling, 1 reply; 30+ messages in thread
From: Jens Axboe @ 2008-04-25 11:25 UTC (permalink / raw)
  To: Alan D. Brunelle; +Cc: linux-kernel

On Fri, Apr 25 2008, Alan D. Brunelle wrote:
> >>
> >> I'll look into retaining the one-hit cache merge functionality, remove
> >> the errant elv_rqhas_del code, and repost w/ the results from the other
> >> tests I've run.
> > 
> > Also please do a check where you only disable the front merge logic, as
> > that is the most expensive bit (and the least likely to occur). I would
> > not be surprised if just removing the front merge bit would get you the
> > majority of the gain already. I have in the past considered just getting
> > rid of that bit, as it rarely triggers and it is a costly rbtree lookup
> > for each IO. The back merge lookup+merge should be cheaper, it's just a
> > hash lookup.
> > 
> 
> I have the results from leaving in just the one-hit cache merge
> attempts, and started a run leaving in both that and the back-merge
> rq_hash checks. (The patch below basically undoes patch 3/3 - putting
> back in the addition of rqs onto the hash list, and moves the nomerges
> check below the back merge attempts.)
> 
> We /could/ change the tunable to a dial (or a mask) - enabling/disabling
> specific merge attempts, but that seems a bit confusing/complex.
> 
> Jens: What do you think?

I think we should keep it simple. I don't particularly like having a
switch to toggle merges, no one will ever use it. So I'm more inclined
to just disable front merges unconditionally if the theory of where
the cycles are spent holds up. We'll still do front merges on the
one-hit cache, just not spend time looking up an io context and request
in the rbtree for basically no gain.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC][PATCH 0/3] Skip I/O merges when disabled
  2008-04-25 11:25         ` Jens Axboe
@ 2008-04-25 12:06           ` Aaron Carroll
  2008-04-25 12:14             ` Jens Axboe
  0 siblings, 1 reply; 30+ messages in thread
From: Aaron Carroll @ 2008-04-25 12:06 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Alan D. Brunelle, linux-kernel

Jens Axboe wrote:
>> I have the results from leaving in just the one-hit cache merge
>> attempts, and started a run leaving in both that and the back-merge
>> rq_hash checks. (The patch below basically undoes patch 3/3 - putting
>> back in the addition of rqs onto the hash list, and moves the nomerges
>> check below the back merge attempts.)
>>
>> We /could/ change the tunable to a dial (or a mask) - enabling/disabling
>> specific merge attempts, but that seems a bit confusing/complex.
>>
>> Jens: What do you think?
> 
> I think we should keep it simple. I don't particularly like having a
> switch to toggle merges, no one will ever use it. So I'm more inclined
> to just disable front merges unconditionally if the theory of where
> the cycles are spent holds up. We'll still do front merges on the
> one-hit cache, just not spend time looking up an io context and request
> in the rbtree for basically no gain.

Front merging is probably a waste of time, but it could also be a hash table
lookup if you think the rbtree traversal is sinking too many cycles.

I wonder if there's any merit in junking the merge hash (and front-merging in
the ioscheds proper) and just having per-process one-hit caches.  That's going
to catch the majority of merge cases.  For requests that happen to be adjacent
by chance, they are just as likely to be back or front merges.


 -- Aaron

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC][PATCH 0/3] Skip I/O merges when disabled
  2008-04-25 12:06           ` Aaron Carroll
@ 2008-04-25 12:14             ` Jens Axboe
  0 siblings, 0 replies; 30+ messages in thread
From: Jens Axboe @ 2008-04-25 12:14 UTC (permalink / raw)
  To: Aaron Carroll; +Cc: Alan D. Brunelle, linux-kernel

On Fri, Apr 25 2008, Aaron Carroll wrote:
> Jens Axboe wrote:
> >>I have the results from leaving in just the one-hit cache merge
> >>attempts, and started a run leaving in both that and the back-merge
> >>rq_hash checks. (The patch below basically undoes patch 3/3 - putting
> >>back in the addition of rqs onto the hash list, and moves the nomerges
> >>check below the back merge attempts.)
> >>
> >>We /could/ change the tunable to a dial (or a mask) - enabling/disabling
> >>specific merge attempts, but that seems a bit confusing/complex.
> >>
> >>Jens: What do you think?
> >
> >I think we should keep it simple. I don't particularly like having a
> >switch to toggle merges, no one will ever use it. So I'm more inclined
> >to just disable front merges unconditionally if the theory of where
> >the cycles are spent holds up. We'll still do front merges on the
> >one-hit cache, just not spend time looking up an io context and request
> >in the rbtree for basically no gain.
> 
> Front merging is probably a waste of time, but it could also be a hash table
> lookup if you think the rbtree traversal is sinking too many cycles.

The front merges weren't considered important enough to add space for a
seperate hash table, that is why they are (re)using the normal rb sort
tree for lookups.

> I wonder if there's any merit in junking the merge hash (and
> front-merging in the ioscheds proper) and just having per-process
> one-hit caches.  That's going to catch the majority of merge cases.
> For requests that happen to be adjacent by chance, they are just as
> likely to be back or front merges.

It's a possibility, the per-process plugging does that. The merge hash
is fairly cheap though, so unless we ever merge the per-process
plugging, I don't think it's a good idea to change it.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC][PATCH 0/3] Skip I/O merges when disabled
  2008-04-25 11:17       ` Alan D. Brunelle
  2008-04-25 11:25         ` Jens Axboe
@ 2008-04-25 12:17         ` Alan D. Brunelle
  2008-04-28 16:36           ` Alan D. Brunelle
  1 sibling, 1 reply; 30+ messages in thread
From: Alan D. Brunelle @ 2008-04-25 12:17 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-kernel

Here are the results, the last kernel (2.6.25-nomerges.nofrontmerges)
had 10 runs of 2 minutes each (as opposed to 25 runs of 10 minutes each
for the other kernels). I'm doing a full run of that kernel w/
25x10minutes, but wanted to get this out for feedback first:

Increasing the merge attempts decreases the I/Os per second by less than
0.5%.

Kernel                         NM   I/Os per sec
-----------------------------  --   ------------
2.6.25                                    472.39

2.6.25-nomerges                 0         472.54
2.6.25-nomerges.onehit          0         472.10
2.6.25-nomerges.nofrontmerges   0         470.38

2.6.25-nomerges                 1         472.58
2.6.25-nomerges.onehit          1         472.02
2.6.25-nomerges.nofrontmerges   1         470.65

The savings in cycles for these random loads compared to the total cycle
costs goes from 4.4% up to 4.8% as we add in more merge attempts (as
compared to almost 5.8% for the stock 2.6.25 kernel).

Kernel                         NM  TAG   Total     I/O Code
-----------------------------  --  ----  --------  --------
2.6.25                             CPU:   5.7794%   7.5440%

2.6.25-nomerges                 0  CPU:   5.4957%   7.1987%
2.6.25-nomerges.onehit          0  CPU:   5.7822%   7.5034%
2.6.25-nomerges.nofrontmerges   0  CPU:   5.2041%   6.8534%

2.6.25-nomerges                 1  CPU:   4.4031%   5.7710%
2.6.25-nomerges.onehit          1  CPU:   4.7517%   6.1702%
2.6.25-nomerges.nofrontmerges   1  CPU:   4.8372%   6.3642%


Kernel                         NM  TAG   Total     I/O Code
-----------------------------  --  ----  --------  --------
2.6.25                             DCM:   7.9861%  10.2456%

2.6.25-nomerges                 0  DCM:   8.2134%  10.5145%
2.6.25-nomerges.onehit          0  DCM:   7.5559%   9.7389%
2.6.25-nomerges.nofrontmerges   0  DCM:   7.6436%   9.8934%

2.6.25-nomerges                 1  DCM:   6.6705%   8.5247%
2.6.25-nomerges.onehit          1  DCM:   6.3432%   8.1886%
2.6.25-nomerges.nofrontmerges   1  DCM:   7.2244%   9.3407%


Given that the tunable is meant to be turned on when the admin /knows/
the load is going to be random, it seems to me that adding in the other
merge checks (one-hit, back-merge) are going to be wasted the vast
majority of the time.

Thanks,
Alan

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC][PATCH 0/3] Skip I/O merges when disabled
  2008-04-25 12:17         ` Alan D. Brunelle
@ 2008-04-28 16:36           ` Alan D. Brunelle
  2008-04-29  7:37             ` Jens Axboe
  0 siblings, 1 reply; 30+ messages in thread
From: Alan D. Brunelle @ 2008-04-28 16:36 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-kernel

Jens -

Would you consider taking in a patch set that had the tunable + enabled
one-hit cache tries only? I don't see where the back-merge attempts add
anything in a truly random case.

Thanks,
Alan

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC][PATCH 0/3] Skip I/O merges when disabled
  2008-04-28 16:36           ` Alan D. Brunelle
@ 2008-04-29  7:37             ` Jens Axboe
  0 siblings, 0 replies; 30+ messages in thread
From: Jens Axboe @ 2008-04-29  7:37 UTC (permalink / raw)
  To: Alan D. Brunelle; +Cc: linux-kernel

On Mon, Apr 28 2008, Alan D. Brunelle wrote:
> Jens -
> 
> Would you consider taking in a patch set that had the tunable + enabled
> one-hit cache tries only? I don't see where the back-merge attempts add
> anything in a truly random case.
> 

Yep, I'm fine with that. Back merging wont help on truly random of
course, but it may make the switch more appealing to half-random cases
as well. But lets just go with a switch that turns off all search
merging and only keeps the one-hit cache.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC][PATCH 0/3] Skip I/O merges when disabled
  2008-04-24  7:09 ` [RFC][PATCH 0/3] Skip I/O merges when disabled Jens Axboe
  2008-04-24 12:09   ` Alan D. Brunelle
@ 2008-04-24 20:38   ` Alan D. Brunelle
  1 sibling, 0 replies; 30+ messages in thread
From: Alan D. Brunelle @ 2008-04-24 20:38 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-kernel

[-- Attachment #1: Type: text/plain, Size: 772 bytes --]

Jens Axboe wrote:
> 
> The functionality is fine with me, merging is obviously a non-zero
> amount of cycles spent on IO and if you know it's in vain, may as well
> turn it off. One suggestion, though - if you add this as a performance
> rather than functionality change, I would suggest keeping the one-hit
> cache merge as that is essentially free. Better than free actually,
> since if you hit that merge point you'll be spending way less cycles
> than allocating+setting up a new request.
> 

With the patch below we retain the one-hit cache functionality. On a few
by-hand runs I'm seeing not much movement in the numbers (goodness). I'm
going to do a full 25 by 10-minute set of runs, and if things look OK,
I'll submit a new patch stream tomorrow.

Cheers,
Alan




[-- Attachment #2: 0004-Move-merge-skip-until-after-the-one-hit-check.patch --]
[-- Type: text/x-patch, Size: 1276 bytes --]

>From afee0469c6dfc297cc81e38178193aaf0bd3b539 Mon Sep 17 00:00:00 2001
From: Alan D. Brunelle <alan.brunelle@hp.com>
Date: Thu, 24 Apr 2008 16:36:19 -0400
Subject: [PATCH] Move merge skip until after the one-hit check

This undoes patch 2/3, and moves the code to elv_merge after the one-hit
cache check.

Signed-off-by: Alan D. Brunelle <alan.brunelle@hp.com>
---
 block/blk-core.c |    2 +-
 block/elevator.c |    3 +++
 2 files changed, 4 insertions(+), 1 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 54a2d8b..2a438a9 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1115,7 +1115,7 @@ static int __make_request(struct request_queue *q, struct bio *bio)
 
 	spin_lock_irq(q->queue_lock);
 
-	if (blk_queue_nomerges(q) || unlikely(barrier) || elv_queue_empty(q))
+	if (unlikely(barrier) || elv_queue_empty(q))
 		goto get_rq;
 
 	el_ret = elv_merge(q, &req, bio);
diff --git a/block/elevator.c b/block/elevator.c
index 2a5e4be..557ee38 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -488,6 +488,9 @@ int elv_merge(struct request_queue *q, struct request **req, struct bio *bio)
 		}
 	}
 
+	if (blk_queue_nomerges(q))
+		return ELEVATOR_NO_MERGE;
+
 	/*
 	 * See if our hash lookup can find a potential backmerge.
 	 */
-- 
1.5.2.5


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [RFC][PATCH 0/3] Skip I/O merges when disabled
  2008-04-23 19:08 [RFC][PATCH 0/3] Skip I/O merges when disabled Alan D. Brunelle
                   ` (3 preceding siblings ...)
  2008-04-24  7:09 ` [RFC][PATCH 0/3] Skip I/O merges when disabled Jens Axboe
@ 2008-04-24 13:29 ` Andi Kleen
  2008-04-24 13:59   ` Jens Axboe
  2008-04-24 13:31 ` Alan D. Brunelle
  5 siblings, 1 reply; 30+ messages in thread
From: Andi Kleen @ 2008-04-24 13:29 UTC (permalink / raw)
  To: Alan D. Brunelle; +Cc: linux-kernel, Jens Axboe

"Alan D. Brunelle" <Alan.Brunelle@hp.com> writes:

> The block I/O + elevator + I/O scheduler code spends a lot of time
> trying to merge I/Os -- rightfully so under "normal" circumstances.
> However, if one were to know that the incoming I/O stream was /very/
> random in nature, the cycles are wasted. (This can be the case, for
> example, during OLTP-type runs.)
>
> This patch stream adds a per-request_queue tunable that (when set)
> disables merge attempts, thus freeing up a non-trivial amount of CPU cycles.

It sounds interesting. But explicit tunables are always bad because
they will be only used by a elite few. Do you think it would be
possible instead to keep some statistics on how successfull merging is and
when the success rate is very low disable it automatically for some
time until a time out?

This way nearly everybody could get most of the benefit from this
change.

-Andi

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC][PATCH 0/3] Skip I/O merges when disabled
  2008-04-24 13:29 ` Andi Kleen
@ 2008-04-24 13:59   ` Jens Axboe
  2008-04-24 14:13     ` Alan D. Brunelle
  2008-04-24 14:15     ` Andi Kleen
  0 siblings, 2 replies; 30+ messages in thread
From: Jens Axboe @ 2008-04-24 13:59 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Alan D. Brunelle, linux-kernel@vger.kernel.org, Jens Axboe

On 24/04/2008, at 15.29, Andi Kleen <andi@firstfloor.org> wrote:

> "Alan D. Brunelle" <Alan.Brunelle@hp.com> writes:
>
>> The block I/O + elevator + I/O scheduler code spends a lot of time
>> trying to merge I/Os -- rightfully so under "normal" circumstances.
>> However, if one were to know that the incoming I/O stream was /very/
>> random in nature, the cycles are wasted. (This can be the case, for
>> example, during OLTP-type runs.)
>>
>> This patch stream adds a per-request_queue tunable that (when set)
>> disables merge attempts, thus freeing up a non-trivial amount of  
>> CPU cycles.
>
> It sounds interesting. But explicit tunables are always bad because
> they will be only used by a elite few. Do you think it would be
> possible instead to keep some statistics on how successfull merging  
> is and
> when the success rate is very low disable it automatically for some
> time until a time out?
>
> This way nearly everybody could get most of the benefit from this
> change.

Not a good idea IMHO, it's much better with an explicit setting. That  
way you don't introduce indeterministic behavior. 

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC][PATCH 0/3] Skip I/O merges when disabled
  2008-04-24 13:59   ` Jens Axboe
@ 2008-04-24 14:13     ` Alan D. Brunelle
  2008-04-24 15:05       ` Jens Axboe
                         ` (2 more replies)
  2008-04-24 14:15     ` Andi Kleen
  1 sibling, 3 replies; 30+ messages in thread
From: Alan D. Brunelle @ 2008-04-24 14:13 UTC (permalink / raw)
  To: linux-kernel@vger.kernel.org; +Cc: Andi Kleen, Jens Axboe

Jens Axboe wrote:
> On 24/04/2008, at 15.29, Andi Kleen <andi@firstfloor.org> wrote:
> 
>> "Alan D. Brunelle" <Alan.Brunelle@hp.com> writes:
>>
>>> The block I/O + elevator + I/O scheduler code spends a lot of time
>>> trying to merge I/Os -- rightfully so under "normal" circumstances.
>>> However, if one were to know that the incoming I/O stream was /very/
>>> random in nature, the cycles are wasted. (This can be the case, for
>>> example, during OLTP-type runs.)
>>>
>>> This patch stream adds a per-request_queue tunable that (when set)
>>> disables merge attempts, thus freeing up a non-trivial amount of CPU
>>> cycles.
>>
>> It sounds interesting. But explicit tunables are always bad because
>> they will be only used by a elite few. Do you think it would be
>> possible instead to keep some statistics on how successfull merging is
>> and
>> when the success rate is very low disable it automatically for some
>> time until a time out?
>>
>> This way nearly everybody could get most of the benefit from this
>> change.
> 
> Not a good idea IMHO, it's much better with an explicit setting. That
> way you don't introduce indeterministic behavior.

Another way to attack this would be to have a user level daemon "watch
things" -

o  We could leave 'nomerges' alone: if someone set that, they "know"
what they are doing, and we just don't attempt merges. [This tunable
would really be for the "elite few" - those that no which devices are
used in which ways - people that administer Enterprise load environments
tend to need to know this.]

o  The kernel already exports stats on merges, so the daemon could watch
those stats in comparison to the number of I/Os submitted. If it
determined that merge attempts were not being very successful, it could
turn off merges for a period of time. Later it could turn them back on,
watch for a while, and repeat.

Does this sound better/worthwhile?

Alan

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC][PATCH 0/3] Skip I/O merges when disabled
  2008-04-24 14:13     ` Alan D. Brunelle
@ 2008-04-24 15:05       ` Jens Axboe
  2008-04-24 22:04       ` Carl Henrik Lunde
  2008-04-25  7:13       ` Andi Kleen
  2 siblings, 0 replies; 30+ messages in thread
From: Jens Axboe @ 2008-04-24 15:05 UTC (permalink / raw)
  To: Alan D. Brunelle; +Cc: linux-kernel@vger.kernel.org, Andi Kleen

On Thu, Apr 24 2008, Alan D. Brunelle wrote:
> Jens Axboe wrote:
> > On 24/04/2008, at 15.29, Andi Kleen <andi@firstfloor.org> wrote:
> > 
> >> "Alan D. Brunelle" <Alan.Brunelle@hp.com> writes:
> >>
> >>> The block I/O + elevator + I/O scheduler code spends a lot of time
> >>> trying to merge I/Os -- rightfully so under "normal" circumstances.
> >>> However, if one were to know that the incoming I/O stream was /very/
> >>> random in nature, the cycles are wasted. (This can be the case, for
> >>> example, during OLTP-type runs.)
> >>>
> >>> This patch stream adds a per-request_queue tunable that (when set)
> >>> disables merge attempts, thus freeing up a non-trivial amount of CPU
> >>> cycles.
> >>
> >> It sounds interesting. But explicit tunables are always bad because
> >> they will be only used by a elite few. Do you think it would be
> >> possible instead to keep some statistics on how successfull merging is
> >> and
> >> when the success rate is very low disable it automatically for some
> >> time until a time out?
> >>
> >> This way nearly everybody could get most of the benefit from this
> >> change.
> > 
> > Not a good idea IMHO, it's much better with an explicit setting. That
> > way you don't introduce indeterministic behavior.
> 
> Another way to attack this would be to have a user level daemon "watch
> things" -
> 
> o  We could leave 'nomerges' alone: if someone set that, they "know"
> what they are doing, and we just don't attempt merges. [This tunable
> would really be for the "elite few" - those that no which devices are
> used in which ways - people that administer Enterprise load environments
> tend to need to know this.]
> 
> o  The kernel already exports stats on merges, so the daemon could watch
> those stats in comparison to the number of I/Os submitted. If it
> determined that merge attempts were not being very successful, it could
> turn off merges for a period of time. Later it could turn them back on,
> watch for a while, and repeat.
> 
> Does this sound better/worthwhile?

That's is true, you could toggle this from a user daemon if you wish. I
still think it's a really bad idea, but at least then it's entirely up
to the user. I'm not a big fan of such schemes, to say the least.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC][PATCH 0/3] Skip I/O merges when disabled
  2008-04-24 14:13     ` Alan D. Brunelle
  2008-04-24 15:05       ` Jens Axboe
@ 2008-04-24 22:04       ` Carl Henrik Lunde
  2008-04-25  7:13       ` Andi Kleen
  2 siblings, 0 replies; 30+ messages in thread
From: Carl Henrik Lunde @ 2008-04-24 22:04 UTC (permalink / raw)
  To: Alan D. Brunelle; +Cc: linux-kernel@vger.kernel.org, Andi Kleen, Jens Axboe

On Thu, Apr 24, 2008 at 4:13 PM, Alan D. Brunelle <Alan.Brunelle@hp.com> wrote:
[...]
>  o  The kernel already exports stats on merges, so the daemon could watch
>  those stats in comparison to the number of I/Os submitted. If it
>  determined that merge attempts were not being very successful, it could
>  turn off merges for a period of time. Later it could turn them back on,
>  watch for a while, and repeat.
>
>  Does this sound better/worthwhile?

It may be more interesting to make a program which makes some helpful
suggestions
after looking at a blktrace file.  The program could for example
suggest changing
the I/O scheduler, increasing a parameter such as slice_sync, or disable merges.

-- 
Carl Henrik

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC][PATCH 0/3] Skip I/O merges when disabled
  2008-04-24 14:13     ` Alan D. Brunelle
  2008-04-24 15:05       ` Jens Axboe
  2008-04-24 22:04       ` Carl Henrik Lunde
@ 2008-04-25  7:13       ` Andi Kleen
  2 siblings, 0 replies; 30+ messages in thread
From: Andi Kleen @ 2008-04-25  7:13 UTC (permalink / raw)
  To: Alan D. Brunelle; +Cc: linux-kernel@vger.kernel.org, Jens Axboe

> Another way to attack this would be to have a user level daemon "watch
> things" -

I personally don't like having specialized user mode daemons for things
the kernel can easily do by itself (which is the case here) Linux setups
generally already have too many daemons and they have some overhead,
another one should be only added for very good reasons.

The daemon would be only an excuse here for "we cannot work out a
sensible kernel policy" which would be bad.

-Andi

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC][PATCH 0/3] Skip I/O merges when disabled
  2008-04-24 13:59   ` Jens Axboe
  2008-04-24 14:13     ` Alan D. Brunelle
@ 2008-04-24 14:15     ` Andi Kleen
  2008-04-24 15:04       ` Jens Axboe
  1 sibling, 1 reply; 30+ messages in thread
From: Andi Kleen @ 2008-04-24 14:15 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Alan D. Brunelle, linux-kernel@vger.kernel.org, Jens Axboe

> Not a good idea IMHO, it's much better with an explicit setting. That
> way you don't introduce indeterministic behavior.

So you would be deterministically slower.

Another way to avoid this problem would be to keep the statistics per
IO context, then the same run of a program would always get the same
behaviour. Drawback is that if your non mergeable workload consists of
lots of short running processes (like a shell script) the optimization
wouldn't work. Not sure if it's really practical, but it would be an option.

I think in modern systems with caches etc. you typically have enough
non quite deterministic and other surprising and hard to analyze
behaviour anyways, so a little more doesn't make much difference.

-Andi

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC][PATCH 0/3] Skip I/O merges when disabled
  2008-04-24 14:15     ` Andi Kleen
@ 2008-04-24 15:04       ` Jens Axboe
  2008-04-24 15:53         ` David Collier-Brown
  0 siblings, 1 reply; 30+ messages in thread
From: Jens Axboe @ 2008-04-24 15:04 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Alan D. Brunelle, linux-kernel@vger.kernel.org

On Thu, Apr 24 2008, Andi Kleen wrote:
> 
> > Not a good idea IMHO, it's much better with an explicit setting. That
> > way you don't introduce indeterministic behavior.
> 
> So you would be deterministically slower.

Yes, absolutely. Think about the case for a second - the potential gain is in
fractions of a percent basically, the potential loss however is HUGE.
There's absolutely no way on earth I'd ever make this dynamic.

> Another way to avoid this problem would be to keep the statistics per
> IO context, then the same run of a program would always get the same
> behaviour. Drawback is that if your non mergeable workload consists of
> lots of short running processes (like a shell script) the optimization
> wouldn't work. Not sure if it's really practical, but it would be an option.

Complexity for basically zero gain, no thanks.

> I think in modern systems with caches etc. you typically have enough
> non quite deterministic and other surprising and hard to analyze
> behaviour anyways, so a little more doesn't make much difference.

Sorry Andi, but that is nonsense. Not merging IOs when you should can
cut your performance to a 5th or something of that order, it's an
entirely different ballgame.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC][PATCH 0/3] Skip I/O merges when disabled
  2008-04-24 15:04       ` Jens Axboe
@ 2008-04-24 15:53         ` David Collier-Brown
  2008-04-24 16:29           ` Alan D. Brunelle
  0 siblings, 1 reply; 30+ messages in thread
From: David Collier-Brown @ 2008-04-24 15:53 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Andi Kleen, Alan D. Brunelle, linux-kernel@vger.kernel.org

Jens Axboe wrote:
> On Thu, Apr 24 2008, Andi Kleen wrote:
> 
>>>Not a good idea IMHO, it's much better with an explicit setting. That
>>>way you don't introduce indeterministic behavior.
>>
>>So you would be deterministically slower.
> 
> 
> Yes, absolutely. Think about the case for a second - the potential gain is in
> fractions of a percent basically, the potential loss however is HUGE.
> There's absolutely no way on earth I'd ever make this dynamic.

  If this is intended for databases, it might be backwards (;-)) 

  The commercial unix "forcedirectio" option that Oracle and other
database vendors usually ask for turns out to be a benefit
in large sequential data transfers, because it does two things:

1) transfers directly between user address space and disk, avoiding buffering, and
2) allows enthusiastic coalescence of synchronous writes

  Is this intended for DBMSs, or for something esle?

--dave 
-- 
David Collier-Brown            | Always do right. This will gratify
Sun Microsystems, Toronto      | some people and astonish the rest
davecb@sun.com                 |                      -- Mark Twain
(905) 943-1983, cell: (647) 833-9377, (800) 555-9786 x56583
bridge: (877) 385-4099 code: 506 9191#

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC][PATCH 0/3] Skip I/O merges when disabled
  2008-04-24 15:53         ` David Collier-Brown
@ 2008-04-24 16:29           ` Alan D. Brunelle
  0 siblings, 0 replies; 30+ messages in thread
From: Alan D. Brunelle @ 2008-04-24 16:29 UTC (permalink / raw)
  To: linux-kernel@vger.kernel.org; +Cc: davecb, Jens Axboe

David Collier-Brown wrote:
> Jens Axboe wrote:
>> On Thu, Apr 24 2008, Andi Kleen wrote:
>>
>>>> Not a good idea IMHO, it's much better with an explicit setting. That
>>>> way you don't introduce indeterministic behavior.
>>>
>>> So you would be deterministically slower.
>>
>>
>> Yes, absolutely. Think about the case for a second - the potential
>> gain is in
>> fractions of a percent basically, the potential loss however is HUGE.
>> There's absolutely no way on earth I'd ever make this dynamic.
> 
>  If this is intended for databases, it might be backwards (;-))
>  The commercial unix "forcedirectio" option that Oracle and other
> database vendors usually ask for turns out to be a benefit
> in large sequential data transfers, because it does two things:
> 
> 1) transfers directly between user address space and disk, avoiding
> buffering, and
> 2) allows enthusiastic coalescence of synchronous writes
> 
>  Is this intended for DBMSs, or for something esle?
> 
> --dave

No, it's intended for devices being used for /random IO loads/ - like
index seeks during OLTP and the like. But in general, the idea is if you
know you have a highly random IO work load, you can set the tunable and
get back a significant chunk of CPU cycles.

Alan

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC][PATCH 0/3] Skip I/O merges when disabled
  2008-04-23 19:08 [RFC][PATCH 0/3] Skip I/O merges when disabled Alan D. Brunelle
                   ` (4 preceding siblings ...)
  2008-04-24 13:29 ` Andi Kleen
@ 2008-04-24 13:31 ` Alan D. Brunelle
  2008-04-24 13:43   ` Alan D. Brunelle
  5 siblings, 1 reply; 30+ messages in thread
From: Alan D. Brunelle @ 2008-04-24 13:31 UTC (permalink / raw)
  To: linux-kernel; +Cc: Jens Axboe

Alan D. Brunelle wrote:
> The block I/O + elevator + I/O scheduler code spends a lot of time
> trying to merge I/Os -- rightfully so under "normal" circumstances.
> However, if one were to know that the incoming I/O stream was /very/
> random in nature, the cycles are wasted. (This can be the case, for
> example, during OLTP-type runs.)
>
> This patch stream adds a per-request_queue tunable that (when set)
> disables merge attempts, thus freeing up a non-trivial amount of CPU
cycles.
>
> I'll be doing some more benchmarking, but this is a representative set
> of data on a two-way Opteron box w/ 4 SATA drives. 'fio' was used to
> generate random 4k asynchronous direct I/Os over the 128GiB of each SATA
> drive.  Oprofile was used to collect the results, and we collected
> CPU_CLK_UNHALTED (CPU) and DATA_CACHE_MISSES (DCM) events. The data
> extracted below shows both the percentage for all samples (including
> non-kernel) as well as just those from the block I/O layer + elevator +
> deadline I/O scheduler + SATA modules.
>
> v2.6.25 (not patched):  CPU: 5.8330% (total)  7.5644% (I/O code only)
> v2.6.25 + nomerges = 0: CPU: 5.8008% (total)  7.5806% (I/O code only)
> v2.6.25 + nomerges = 1: CPU: 4.5404% (total)  5.9416% (I/O code only)
>
> v2.6.25 (not patched):  DCM: 8.1967% (total) 10.5188% (I/O code only)
> v2.6.25 + nomerges = 0: DCM: 7.2291% (total)  9.4087% (I/O code only)
> v2.6.25 + nomerges = 1: DCM: 6.1989% (total)  8.0155% (I/O code only)
>
> I've typically been seeing a good 20-25% reduction in CPU samples, and
> 10-15% in DCM samples for the random load w/ nomerges set to 1 compared
> to set to 0 (looking at just the block code).
>
> [BTW: The I/O performance doesn't change much between the 3 sets of data
> - the seek + I/O times themselves dominate things to such a large
> extent.  There is a very small improvement seen w/ nomerges=1, but <<1%.]
>
> It's not clear to me why 2.6.25 (not patched) requires /more/ cycles
> than does the patched kernel w/ nomerges=0 -- it's been consistent in
> the handful of runs I've done. I'm going to do a large set of runs for
> each condition (not patched, nomerges=0 & nomerges=1) to verify that
> this holds over multiple runs. I'm also going to check out sequential
> loads to see what (if any) penalty the extra couple of checks incurs on
> those (probably not noticeable).
>
> The first patch in the series adds the tunable; The second adds in the
> check to skip the merge code; and the third adds in the check to skip
> adding requests to hash lists for merging.
>
> Alan D. Brunelle
> Hewlett-Packard

The results over 25 runs (10-minutes each) look good, as noted
yesterday, /very/ slightly better I/Os per seconds with nomerges=1:

Kernel                NM  I/Os per second
--------------------  --  ---------------
2.6.25 (not patched)           483,727.36
2.6.25 + nomerges      0       483,880.96
2.6.25 + nomerges      1       483,921.92

The CPU and DCM samples in the block I/O code again were better w/
nomerges=1 averaged over the 25 runs (about 23.8% fewer cycles needed to
do the work in the block I/O code):

v2.6.25 (not patched):   CPU:  5.779% (total)   7.544% (I/O code only)
v2.6.25 + nomerges = 0:  CPU:  5.496% (total)   7.199% (I/O code only)
v2.6.25 + nomerges = 1:  CPU:  4.403% (total)   5.771% (I/O code only)


v2.6.26 (not patched):   DCM:  7.986% (total)  10.246% (I/O code only)
v2.6.25 + nomerges = 0:  DCM:  8.213% (total)  10.514% (I/O code only)
v2.6.25 + nomerges = 1:  DCM:  6.670% (total)   8.525% (I/O code only)

Alan

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC][PATCH 0/3] Skip I/O merges when disabled
  2008-04-24 13:31 ` Alan D. Brunelle
@ 2008-04-24 13:43   ` Alan D. Brunelle
  0 siblings, 0 replies; 30+ messages in thread
From: Alan D. Brunelle @ 2008-04-24 13:43 UTC (permalink / raw)
  To: Alan D. Brunelle; +Cc: linux-kernel, Jens Axboe

Alan D. Brunelle wrote:

>
> Kernel                NM  I/Os per second
> --------------------  --  ---------------
> 2.6.25 (not patched)           483,727.36
> 2.6.25 + nomerges      0       483,880.96
> 2.6.25 + nomerges      1       483,921.92

Wrong chart, sorry:

Kernel                NM  I/Os per second
--------------------  --  ---------------
2.6.25 (not patched)               472.39
2.6.25 + nomerges      0           472.54
2.6.25 + nomerges      1           472.58

(Don't think a little 2-way box w/ 4 SATA drives can really approach
500K I/Os per second...)

Alan

^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2008-04-29  7:37 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-04-23 19:08 [RFC][PATCH 0/3] Skip I/O merges when disabled Alan D. Brunelle
2008-04-23 19:12 ` [RFC][PATCH 1/3] Add flag and sysfs interfaces Alan D. Brunelle
2008-04-23 19:14 ` [RFC][PATCH 2/3] Have __make_request skip merges when disabled Alan D. Brunelle
2008-04-23 19:15 ` [RFC][PATCH 3/3] Do not use rqhash when merges disabled Alan D. Brunelle
2008-04-24  0:37   ` Aaron Carroll
2008-04-24  0:59     ` Alan D. Brunelle
2008-04-24  2:07       ` Aaron Carroll
2008-04-24  7:09 ` [RFC][PATCH 0/3] Skip I/O merges when disabled Jens Axboe
2008-04-24 12:09   ` Alan D. Brunelle
2008-04-25  8:38     ` Jens Axboe
2008-04-25 11:17       ` Alan D. Brunelle
2008-04-25 11:25         ` Jens Axboe
2008-04-25 12:06           ` Aaron Carroll
2008-04-25 12:14             ` Jens Axboe
2008-04-25 12:17         ` Alan D. Brunelle
2008-04-28 16:36           ` Alan D. Brunelle
2008-04-29  7:37             ` Jens Axboe
2008-04-24 20:38   ` Alan D. Brunelle
2008-04-24 13:29 ` Andi Kleen
2008-04-24 13:59   ` Jens Axboe
2008-04-24 14:13     ` Alan D. Brunelle
2008-04-24 15:05       ` Jens Axboe
2008-04-24 22:04       ` Carl Henrik Lunde
2008-04-25  7:13       ` Andi Kleen
2008-04-24 14:15     ` Andi Kleen
2008-04-24 15:04       ` Jens Axboe
2008-04-24 15:53         ` David Collier-Brown
2008-04-24 16:29           ` Alan D. Brunelle
2008-04-24 13:31 ` Alan D. Brunelle
2008-04-24 13:43   ` Alan D. Brunelle

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox