* [patch]raid5: fix directio regression
@ 2012-08-07 3:22 Shaohua Li
2012-08-07 5:13 ` Jianpeng Ma
0 siblings, 1 reply; 28+ messages in thread
From: Shaohua Li @ 2012-08-07 3:22 UTC (permalink / raw)
To: linux-raid; +Cc: neilb, majianpeng
My directIO randomwrite 4k workload shows a 10~20% regression caused by commit
895e3c5c58a80bb. directIO usually is random IO and if request size isn't big
(which is the common case), delay handling of the stripe hasn't any advantages.
For big size request, delay can still reduce IO.
Signed-off-by: Shaohua Li <shli@fusionio.com>
---
drivers/md/raid5.c | 9 ++++++++-
1 file changed, 8 insertions(+), 1 deletion(-)
Index: linux/drivers/md/raid5.c
===================================================================
--- linux.orig/drivers/md/raid5.c 2012-08-07 11:04:48.442834682 +0800
+++ linux/drivers/md/raid5.c 2012-08-07 11:09:08.743562203 +0800
@@ -4076,6 +4076,7 @@ static void make_request(struct mddev *m
struct stripe_head *sh;
const int rw = bio_data_dir(bi);
int remaining;
+ bool large_request;
if (unlikely(bi->bi_rw & REQ_FLUSH)) {
md_flush_request(mddev, bi);
@@ -4089,6 +4090,11 @@ static void make_request(struct mddev *m
chunk_aligned_read(mddev,bi))
return;
+ if (mddev->new_chunk_sectors < mddev->chunk_sectors)
+ large_request = (bi->bi_size >> 9) > mddev->new_chunk_sectors;
+ else
+ large_request = (bi->bi_size >> 9) > mddev->chunk_sectors;
+
logical_sector = bi->bi_sector & ~((sector_t)STRIPE_SECTORS-1);
last_sector = bi->bi_sector + (bi->bi_size>>9);
bi->bi_next = NULL;
@@ -4192,7 +4198,8 @@ static void make_request(struct mddev *m
finish_wait(&conf->wait_for_overlap, &w);
set_bit(STRIPE_HANDLE, &sh->state);
clear_bit(STRIPE_DELAYED, &sh->state);
- if ((bi->bi_rw & REQ_NOIDLE) &&
+ if ((bi->bi_rw & REQ_SYNC) &&
+ ((bi->bi_rw & REQ_NOIDLE) || !large_request) &&
!test_and_set_bit(STRIPE_PREREAD_ACTIVE, &sh->state))
atomic_inc(&conf->preread_active_stripes);
release_stripe_plug(mddev, sh);
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [patch]raid5: fix directio regression
2012-08-07 3:22 [patch]raid5: fix directio regression Shaohua Li
@ 2012-08-07 5:13 ` Jianpeng Ma
2012-08-07 5:32 ` Shaohua Li
0 siblings, 1 reply; 28+ messages in thread
From: Jianpeng Ma @ 2012-08-07 5:13 UTC (permalink / raw)
To: shli, linux-raid; +Cc: Neil Brown
On 2012-08-07 11:22 Shaohua Li <shli@kernel.org> Wrote:
>My directIO randomwrite 4k workload shows a 10~20% regression caused by commit
>895e3c5c58a80bb. directIO usually is random IO and if request size isn't big
>(which is the common case), delay handling of the stripe hasn't any advantages.
>For big size request, delay can still reduce IO.
>
>Signed-off-by: Shaohua Li <shli@fusionio.com>
>---
> drivers/md/raid5.c | 9 ++++++++-
> 1 file changed, 8 insertions(+), 1 deletion(-)
>
>Index: linux/drivers/md/raid5.c
>===================================================================
>--- linux.orig/drivers/md/raid5.c 2012-08-07 11:04:48.442834682 +0800
>+++ linux/drivers/md/raid5.c 2012-08-07 11:09:08.743562203 +0800
>@@ -4076,6 +4076,7 @@ static void make_request(struct mddev *m
> struct stripe_head *sh;
> const int rw = bio_data_dir(bi);
> int remaining;
>+ bool large_request;
>
> if (unlikely(bi->bi_rw & REQ_FLUSH)) {
> md_flush_request(mddev, bi);
>@@ -4089,6 +4090,11 @@ static void make_request(struct mddev *m
> chunk_aligned_read(mddev,bi))
> return;
>
>+ if (mddev->new_chunk_sectors < mddev->chunk_sectors)
>+ large_request = (bi->bi_size >> 9) > mddev->new_chunk_sectors;
>+ else
>+ large_request = (bi->bi_size >> 9) > mddev->chunk_sectors;
>+
> logical_sector = bi->bi_sector & ~((sector_t)STRIPE_SECTORS-1);
> last_sector = bi->bi_sector + (bi->bi_size>>9);
> bi->bi_next = NULL;
>@@ -4192,7 +4198,8 @@ static void make_request(struct mddev *m
> finish_wait(&conf->wait_for_overlap, &w);
> set_bit(STRIPE_HANDLE, &sh->state);
> clear_bit(STRIPE_DELAYED, &sh->state);
>- if ((bi->bi_rw & REQ_NOIDLE) &&
>+ if ((bi->bi_rw & REQ_SYNC) &&
>+ ((bi->bi_rw & REQ_NOIDLE) || !large_request) &&
> !test_and_set_bit(STRIPE_PREREAD_ACTIVE, &sh->state))
> atomic_inc(&conf->preread_active_stripes);
> release_stripe_plug(mddev, sh);
>--
May be used size to judge is not a good method.
I firstly sended this patch, only want to control direct-write-block,not for reqular file.
Because i think if someone used direct-write-block for raid5,he should know the feature of raid5 and he can control
for write to full-write.
But at that time, i did know how to differentiate between regular file and block-device.
I thik we should do something to do this.
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [patch]raid5: fix directio regression
2012-08-07 5:13 ` Jianpeng Ma
@ 2012-08-07 5:32 ` Shaohua Li
2012-08-07 5:42 ` Jianpeng Ma
2012-08-07 6:21 ` Jianpeng Ma
0 siblings, 2 replies; 28+ messages in thread
From: Shaohua Li @ 2012-08-07 5:32 UTC (permalink / raw)
To: Jianpeng Ma; +Cc: linux-raid, Neil Brown
2012/8/7 Jianpeng Ma <majianpeng@gmail.com>:
> On 2012-08-07 11:22 Shaohua Li <shli@kernel.org> Wrote:
>>My directIO randomwrite 4k workload shows a 10~20% regression caused by commit
>>895e3c5c58a80bb. directIO usually is random IO and if request size isn't big
>>(which is the common case), delay handling of the stripe hasn't any advantages.
>>For big size request, delay can still reduce IO.
>>
>>Signed-off-by: Shaohua Li <shli@fusionio.com>
>>---
>> drivers/md/raid5.c | 9 ++++++++-
>> 1 file changed, 8 insertions(+), 1 deletion(-)
>>
>>Index: linux/drivers/md/raid5.c
>>===================================================================
>>--- linux.orig/drivers/md/raid5.c 2012-08-07 11:04:48.442834682 +0800
>>+++ linux/drivers/md/raid5.c 2012-08-07 11:09:08.743562203 +0800
>>@@ -4076,6 +4076,7 @@ static void make_request(struct mddev *m
>> struct stripe_head *sh;
>> const int rw = bio_data_dir(bi);
>> int remaining;
>>+ bool large_request;
>>
>> if (unlikely(bi->bi_rw & REQ_FLUSH)) {
>> md_flush_request(mddev, bi);
>>@@ -4089,6 +4090,11 @@ static void make_request(struct mddev *m
>> chunk_aligned_read(mddev,bi))
>> return;
>>
>>+ if (mddev->new_chunk_sectors < mddev->chunk_sectors)
>>+ large_request = (bi->bi_size >> 9) > mddev->new_chunk_sectors;
>>+ else
>>+ large_request = (bi->bi_size >> 9) > mddev->chunk_sectors;
>>+
>> logical_sector = bi->bi_sector & ~((sector_t)STRIPE_SECTORS-1);
>> last_sector = bi->bi_sector + (bi->bi_size>>9);
>> bi->bi_next = NULL;
>>@@ -4192,7 +4198,8 @@ static void make_request(struct mddev *m
>> finish_wait(&conf->wait_for_overlap, &w);
>> set_bit(STRIPE_HANDLE, &sh->state);
>> clear_bit(STRIPE_DELAYED, &sh->state);
>>- if ((bi->bi_rw & REQ_NOIDLE) &&
>>+ if ((bi->bi_rw & REQ_SYNC) &&
>>+ ((bi->bi_rw & REQ_NOIDLE) || !large_request) &&
>> !test_and_set_bit(STRIPE_PREREAD_ACTIVE, &sh->state))
>> atomic_inc(&conf->preread_active_stripes);
>> release_stripe_plug(mddev, sh);
>>--
> May be used size to judge is not a good method.
> I firstly sended this patch, only want to control direct-write-block,not for reqular file.
> Because i think if someone used direct-write-block for raid5,he should know the feature of raid5 and he can control
> for write to full-write.
> But at that time, i did know how to differentiate between regular file and block-device.
> I thik we should do something to do this.
I don't think it's possible user can control his write to be a
full-write even for
raw disk IO. Why regular file and block device io matters here?
Thanks,
Shaohua
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Re: [patch]raid5: fix directio regression
2012-08-07 5:32 ` Shaohua Li
@ 2012-08-07 5:42 ` Jianpeng Ma
2012-08-07 6:21 ` Jianpeng Ma
1 sibling, 0 replies; 28+ messages in thread
From: Jianpeng Ma @ 2012-08-07 5:42 UTC (permalink / raw)
To: shli; +Cc: linux-raid, Neil Brown
On 2012-08-07 13:32 Shaohua Li <shli@kernel.org> Wrote:
>2012/8/7 Jianpeng Ma <majianpeng@gmail.com>:
>> On 2012-08-07 11:22 Shaohua Li <shli@kernel.org> Wrote:
>>>My directIO randomwrite 4k workload shows a 10~20% regression caused by commit
>>>895e3c5c58a80bb. directIO usually is random IO and if request size isn't big
>>>(which is the common case), delay handling of the stripe hasn't any advantages.
>>>For big size request, delay can still reduce IO.
>>>
>>>Signed-off-by: Shaohua Li <shli@fusionio.com>
>>>---
>>> drivers/md/raid5.c | 9 ++++++++-
>>> 1 file changed, 8 insertions(+), 1 deletion(-)
[snip]
>>>--
>> May be used size to judge is not a good method.
>> I firstly sended this patch, only want to control direct-write-block,not for reqular file.
>> Because i think if someone used direct-write-block for raid5,he should know the feature of raid5 and he can control
>> for write to full-write.
>> But at that time, i did know how to differentiate between regular file and block-device.
>> I thik we should do something to do this.
>
>I don't think it's possible user can control his write to be a
>full-write even for
>raw disk IO. Why regular file and block device io matters here?
>
>Thanks,
>Shaohua
The problem is when to set flag STRIPE_PREREAD_ACTIVE.
When setting this flag, it will to handle stripe instead of delaying for full-write.
I think it like the IOPS and throughput.
When in random-small-write workload, it't hardly to achieve full-write.So no need to dealy.This like sync-write.
But for larger-write workload, it's easly to achieve full-write.
This is why send the path. My workload is the latter.
For regular file, it controled by fs.But for raw-block,i think we do the work by fs, so we can control to do the best performance.
Of cource, My work is do this,so my point may be limited.
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Re: [patch]raid5: fix directio regression
2012-08-07 5:32 ` Shaohua Li
2012-08-07 5:42 ` Jianpeng Ma
@ 2012-08-07 6:21 ` Jianpeng Ma
2012-08-08 2:58 ` Shaohua Li
1 sibling, 1 reply; 28+ messages in thread
From: Jianpeng Ma @ 2012-08-07 6:21 UTC (permalink / raw)
To: shli; +Cc: linux-raid, Neil Brown
On 2012-08-07 13:32 Shaohua Li <shli@kernel.org> Wrote:
>2012/8/7 Jianpeng Ma <majianpeng@gmail.com>:
>> On 2012-08-07 11:22 Shaohua Li <shli@kernel.org> Wrote:
>>>My directIO randomwrite 4k workload shows a 10~20% regression caused by commit
>>>895e3c5c58a80bb. directIO usually is random IO and if request size isn't big
>>>(which is the common case), delay handling of the stripe hasn't any advantages.
>>>For big size request, delay can still reduce IO.
>>>
>>>Signed-off-by: Shaohua Li <shli@fusionio.com>
[snip]
>>>--
>> May be used size to judge is not a good method.
>> I firstly sended this patch, only want to control direct-write-block,not for reqular file.
>> Because i think if someone used direct-write-block for raid5,he should know the feature of raid5 and he can control
>> for write to full-write.
>> But at that time, i did know how to differentiate between regular file and block-device.
>> I thik we should do something to do this.
>
>I don't think it's possible user can control his write to be a
>full-write even for
>raw disk IO. Why regular file and block device io matters here?
>
>Thanks,
>Shaohua
Another problem is the size. How to judge the size is large or not?
A syscall write is a dio and a dio may be split more bios.
For my workload, i usualy write chunk-size.
But your patch is judge by bio-size.
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Re: [patch]raid5: fix directio regression
2012-08-07 6:21 ` Jianpeng Ma
@ 2012-08-08 2:58 ` Shaohua Li
2012-08-08 5:21 ` Jianpeng Ma
0 siblings, 1 reply; 28+ messages in thread
From: Shaohua Li @ 2012-08-08 2:58 UTC (permalink / raw)
To: Jianpeng Ma; +Cc: linux-raid, Neil Brown
2012/8/7 Jianpeng Ma <majianpeng@gmail.com>:
> On 2012-08-07 13:32 Shaohua Li <shli@kernel.org> Wrote:
>>2012/8/7 Jianpeng Ma <majianpeng@gmail.com>:
>>> On 2012-08-07 11:22 Shaohua Li <shli@kernel.org> Wrote:
>>>>My directIO randomwrite 4k workload shows a 10~20% regression caused by commit
>>>>895e3c5c58a80bb. directIO usually is random IO and if request size isn't big
>>>>(which is the common case), delay handling of the stripe hasn't any advantages.
>>>>For big size request, delay can still reduce IO.
>>>>
>>>>Signed-off-by: Shaohua Li <shli@fusionio.com>
> [snip]
>>>>--
>>> May be used size to judge is not a good method.
>>> I firstly sended this patch, only want to control direct-write-block,not for reqular file.
>>> Because i think if someone used direct-write-block for raid5,he should know the feature of raid5 and he can control
>>> for write to full-write.
>>> But at that time, i did know how to differentiate between regular file and block-device.
>>> I thik we should do something to do this.
>>
>>I don't think it's possible user can control his write to be a
>>full-write even for
>>raw disk IO. Why regular file and block device io matters here?
>>
>>Thanks,
>>Shaohua
> Another problem is the size. How to judge the size is large or not?
> A syscall write is a dio and a dio may be split more bios.
> For my workload, i usualy write chunk-size.
> But your patch is judge by bio-size.
I'd ignore workload which does sequential directIO, though
your workload is, but I bet no real workloads are. So I'd like
only to consider big size random directio. I agree the size
judge is arbitrary. I can optimize it to be only consider stripe
which hits two or more disks in one bio, but not sure if it's
worthy doing. Not ware big size directio is common, and even
is, big size request IOPS is low, a bit delay maybe not a big
deal.
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Re: [patch]raid5: fix directio regression
2012-08-08 2:58 ` Shaohua Li
@ 2012-08-08 5:21 ` Jianpeng Ma
2012-08-08 12:53 ` Shaohua Li
0 siblings, 1 reply; 28+ messages in thread
From: Jianpeng Ma @ 2012-08-08 5:21 UTC (permalink / raw)
To: shli; +Cc: linux-raid, Neil Brown
On 2012-08-08 10:58 Shaohua Li <shli@kernel.org> Wrote:
>2012/8/7 Jianpeng Ma <majianpeng@gmail.com>:
>> On 2012-08-07 13:32 Shaohua Li <shli@kernel.org> Wrote:
>>>2012/8/7 Jianpeng Ma <majianpeng@gmail.com>:
>>>> On 2012-08-07 11:22 Shaohua Li <shli@kernel.org> Wrote:
>>>>>My directIO randomwrite 4k workload shows a 10~20% regression caused by commit
>>>>>895e3c5c58a80bb. directIO usually is random IO and if request size isn't big
>>>>>(which is the common case), delay handling of the stripe hasn't any advantages.
>>>>>For big size request, delay can still reduce IO.
>>>>>
>>>>>Signed-off-by: Shaohua Li <shli@fusionio.com>
>> [snip]
>>>>>--
>>>> May be used size to judge is not a good method.
>>>> I firstly sended this patch, only want to control direct-write-block,not for reqular file.
>>>> Because i think if someone used direct-write-block for raid5,he should know the feature of raid5 and he can control
>>>> for write to full-write.
>>>> But at that time, i did know how to differentiate between regular file and block-device.
>>>> I thik we should do something to do this.
>>>
>>>I don't think it's possible user can control his write to be a
>>>full-write even for
>>>raw disk IO. Why regular file and block device io matters here?
>>>
>>>Thanks,
>>>Shaohua
>> Another problem is the size. How to judge the size is large or not?
>> A syscall write is a dio and a dio may be split more bios.
>> For my workload, i usualy write chunk-size.
>> But your patch is judge by bio-size.
>
>I'd ignore workload which does sequential directIO, though
>your workload is, but I bet no real workloads are. So I'd like
Sorry,my explain maybe not corcrect. I write data once which size is almost chunks-size * devices,in order to full-write
and as possible as to no pre-read operation.
>only to consider big size random directio. I agree the size
>judge is arbitrary. I can optimize it to be only consider stripe
>which hits two or more disks in one bio, but not sure if it's
>worthy doing. Not ware big size directio is common, and even
>is, big size request IOPS is low, a bit delay maybe not a big
>deal.
If add a acc_time for 'striep_head' to control?
When get_active_stripe() is ok, update acc_time.
For some time, stripe_head did not access and it shold pre-read.
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Re: [patch]raid5: fix directio regression
2012-08-08 5:21 ` Jianpeng Ma
@ 2012-08-08 12:53 ` Shaohua Li
2012-08-09 1:20 ` Jianpeng Ma
0 siblings, 1 reply; 28+ messages in thread
From: Shaohua Li @ 2012-08-08 12:53 UTC (permalink / raw)
To: Jianpeng Ma; +Cc: linux-raid, Neil Brown
2012/8/8 Jianpeng Ma <majianpeng@gmail.com>:
> On 2012-08-08 10:58 Shaohua Li <shli@kernel.org> Wrote:
>>2012/8/7 Jianpeng Ma <majianpeng@gmail.com>:
>>> On 2012-08-07 13:32 Shaohua Li <shli@kernel.org> Wrote:
>>>>2012/8/7 Jianpeng Ma <majianpeng@gmail.com>:
>>>>> On 2012-08-07 11:22 Shaohua Li <shli@kernel.org> Wrote:
>>>>>>My directIO randomwrite 4k workload shows a 10~20% regression caused by commit
>>>>>>895e3c5c58a80bb. directIO usually is random IO and if request size isn't big
>>>>>>(which is the common case), delay handling of the stripe hasn't any advantages.
>>>>>>For big size request, delay can still reduce IO.
>>>>>>
>>>>>>Signed-off-by: Shaohua Li <shli@fusionio.com>
>>> [snip]
>>>>>>--
>>>>> May be used size to judge is not a good method.
>>>>> I firstly sended this patch, only want to control direct-write-block,not for reqular file.
>>>>> Because i think if someone used direct-write-block for raid5,he should know the feature of raid5 and he can control
>>>>> for write to full-write.
>>>>> But at that time, i did know how to differentiate between regular file and block-device.
>>>>> I thik we should do something to do this.
>>>>
>>>>I don't think it's possible user can control his write to be a
>>>>full-write even for
>>>>raw disk IO. Why regular file and block device io matters here?
>>>>
>>>>Thanks,
>>>>Shaohua
>>> Another problem is the size. How to judge the size is large or not?
>>> A syscall write is a dio and a dio may be split more bios.
>>> For my workload, i usualy write chunk-size.
>>> But your patch is judge by bio-size.
>>
>>I'd ignore workload which does sequential directIO, though
>>your workload is, but I bet no real workloads are. So I'd like
> Sorry,my explain maybe not corcrect. I write data once which size is almost chunks-size * devices,in order to full-write
> and as possible as to no pre-read operation.
>>only to consider big size random directio. I agree the size
>>judge is arbitrary. I can optimize it to be only consider stripe
>>which hits two or more disks in one bio, but not sure if it's
>>worthy doing. Not ware big size directio is common, and even
>>is, big size request IOPS is low, a bit delay maybe not a big
>>deal.
> If add a acc_time for 'striep_head' to control?
> When get_active_stripe() is ok, update acc_time.
> For some time, stripe_head did not access and it shold pre-read.
Do you want to add a timer for each stripe? This is even ugly.
How do you choose the expire time? A time works for harddisk
definitely will not work for a fast SSD.
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Re: [patch]raid5: fix directio regression
2012-08-08 12:53 ` Shaohua Li
@ 2012-08-09 1:20 ` Jianpeng Ma
2012-08-09 1:32 ` NeilBrown
0 siblings, 1 reply; 28+ messages in thread
From: Jianpeng Ma @ 2012-08-09 1:20 UTC (permalink / raw)
To: shli; +Cc: linux-raid, Neil Brown
On 2012-08-08 20:53 Shaohua Li <shli@kernel.org> Wrote:
>2012/8/8 Jianpeng Ma <majianpeng@gmail.com>:
>> On 2012-08-08 10:58 Shaohua Li <shli@kernel.org> Wrote:
>>>2012/8/7 Jianpeng Ma <majianpeng@gmail.com>:
>>>> On 2012-08-07 13:32 Shaohua Li <shli@kernel.org> Wrote:
>>>>>2012/8/7 Jianpeng Ma <majianpeng@gmail.com>:
>>>>>> On 2012-08-07 11:22 Shaohua Li <shli@kernel.org> Wrote:
>>>>>>>My directIO randomwrite 4k workload shows a 10~20% regression caused by commit
>>>>>>>895e3c5c58a80bb. directIO usually is random IO and if request size isn't big
>>>>>>>(which is the common case), delay handling of the stripe hasn't any advantages.
>>>>>>>For big size request, delay can still reduce IO.
>>>>>>>
>>>>>>>Signed-off-by: Shaohua Li <shli@fusionio.com>
>>>> [snip]
>>>>>>>--
>>>>>> May be used size to judge is not a good method.
>>>>>> I firstly sended this patch, only want to control direct-write-block,not for reqular file.
>>>>>> Because i think if someone used direct-write-block for raid5,he should know the feature of raid5 and he can control
>>>>>> for write to full-write.
>>>>>> But at that time, i did know how to differentiate between regular file and block-device.
>>>>>> I thik we should do something to do this.
>>>>>
>>>>>I don't think it's possible user can control his write to be a
>>>>>full-write even for
>>>>>raw disk IO. Why regular file and block device io matters here?
>>>>>
>>>>>Thanks,
>>>>>Shaohua
>>>> Another problem is the size. How to judge the size is large or not?
>>>> A syscall write is a dio and a dio may be split more bios.
>>>> For my workload, i usualy write chunk-size.
>>>> But your patch is judge by bio-size.
>>>
>>>I'd ignore workload which does sequential directIO, though
>>>your workload is, but I bet no real workloads are. So I'd like
>> Sorry,my explain maybe not corcrect. I write data once which size is almost chunks-size * devices,in order to full-write
>> and as possible as to no pre-read operation.
>>>only to consider big size random directio. I agree the size
>>>judge is arbitrary. I can optimize it to be only consider stripe
>>>which hits two or more disks in one bio, but not sure if it's
>>>worthy doing. Not ware big size directio is common, and even
>>>is, big size request IOPS is low, a bit delay maybe not a big
>>>deal.
>> If add a acc_time for 'striep_head' to control?
>> When get_active_stripe() is ok, update acc_time.
>> For some time, stripe_head did not access and it shold pre-read.
>
>Do you want to add a timer for each stripe? This is even ugly.
>How do you choose the expire time? A time works for harddisk
>definitely will not work for a fast SSD.
A time is like the size which is arbitrary.
How about add a interface in sysfs to control by user?
Only user can judge the workload, which sequatial write or random write.
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [patch]raid5: fix directio regression
2012-08-09 1:20 ` Jianpeng Ma
@ 2012-08-09 1:32 ` NeilBrown
2012-08-09 2:27 ` Jianpeng Ma
2012-08-09 5:07 ` Shaohua Li
0 siblings, 2 replies; 28+ messages in thread
From: NeilBrown @ 2012-08-09 1:32 UTC (permalink / raw)
To: Jianpeng Ma; +Cc: shli, linux-raid
[-- Attachment #1: Type: text/plain, Size: 3525 bytes --]
On Thu, 9 Aug 2012 09:20:05 +0800 "Jianpeng Ma" <majianpeng@gmail.com> wrote:
> On 2012-08-08 20:53 Shaohua Li <shli@kernel.org> Wrote:
> >2012/8/8 Jianpeng Ma <majianpeng@gmail.com>:
> >> On 2012-08-08 10:58 Shaohua Li <shli@kernel.org> Wrote:
> >>>2012/8/7 Jianpeng Ma <majianpeng@gmail.com>:
> >>>> On 2012-08-07 13:32 Shaohua Li <shli@kernel.org> Wrote:
> >>>>>2012/8/7 Jianpeng Ma <majianpeng@gmail.com>:
> >>>>>> On 2012-08-07 11:22 Shaohua Li <shli@kernel.org> Wrote:
> >>>>>>>My directIO randomwrite 4k workload shows a 10~20% regression caused by commit
> >>>>>>>895e3c5c58a80bb. directIO usually is random IO and if request size isn't big
> >>>>>>>(which is the common case), delay handling of the stripe hasn't any advantages.
> >>>>>>>For big size request, delay can still reduce IO.
> >>>>>>>
> >>>>>>>Signed-off-by: Shaohua Li <shli@fusionio.com>
> >>>> [snip]
> >>>>>>>--
> >>>>>> May be used size to judge is not a good method.
> >>>>>> I firstly sended this patch, only want to control direct-write-block,not for reqular file.
> >>>>>> Because i think if someone used direct-write-block for raid5,he should know the feature of raid5 and he can control
> >>>>>> for write to full-write.
> >>>>>> But at that time, i did know how to differentiate between regular file and block-device.
> >>>>>> I thik we should do something to do this.
> >>>>>
> >>>>>I don't think it's possible user can control his write to be a
> >>>>>full-write even for
> >>>>>raw disk IO. Why regular file and block device io matters here?
> >>>>>
> >>>>>Thanks,
> >>>>>Shaohua
> >>>> Another problem is the size. How to judge the size is large or not?
> >>>> A syscall write is a dio and a dio may be split more bios.
> >>>> For my workload, i usualy write chunk-size.
> >>>> But your patch is judge by bio-size.
> >>>
> >>>I'd ignore workload which does sequential directIO, though
> >>>your workload is, but I bet no real workloads are. So I'd like
> >> Sorry,my explain maybe not corcrect. I write data once which size is almost chunks-size * devices,in order to full-write
> >> and as possible as to no pre-read operation.
> >>>only to consider big size random directio. I agree the size
> >>>judge is arbitrary. I can optimize it to be only consider stripe
> >>>which hits two or more disks in one bio, but not sure if it's
> >>>worthy doing. Not ware big size directio is common, and even
> >>>is, big size request IOPS is low, a bit delay maybe not a big
> >>>deal.
> >> If add a acc_time for 'striep_head' to control?
> >> When get_active_stripe() is ok, update acc_time.
> >> For some time, stripe_head did not access and it shold pre-read.
> >
> >Do you want to add a timer for each stripe? This is even ugly.
> >How do you choose the expire time? A time works for harddisk
> >definitely will not work for a fast SSD.
> A time is like the size which is arbitrary.
> How about add a interface in sysfs to control by user?
> Only user can judge the workload, which sequatial write or random write.
This is getting worse by the minute. A sysfs interface for this is
definitely not a good idea.
The REQ_NOIDLE flag is a pretty clear statement that no more requests that
merge with this one are expected. If some use cases sends random requests,
maybe it should be setting REQ_NOIDLE.
Maybe someone should do some research and find out why WRITE_ODIRECT doesn't
include REQ_NOIDLE. Understanding that would help understand the current
problem.
NeilBrown
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Re: [patch]raid5: fix directio regression
2012-08-09 1:32 ` NeilBrown
@ 2012-08-09 2:27 ` Jianpeng Ma
2012-08-09 5:07 ` Shaohua Li
1 sibling, 0 replies; 28+ messages in thread
From: Jianpeng Ma @ 2012-08-09 2:27 UTC (permalink / raw)
To: Neil Brown; +Cc: shli, linux-raid
On 2012-08-09 09:32 NeilBrown <neilb@suse.de> Wrote:
>On Thu, 9 Aug 2012 09:20:05 +0800 "Jianpeng Ma" <majianpeng@gmail.com> wrote:
>
>> On 2012-08-08 20:53 Shaohua Li <shli@kernel.org> Wrote:
>> >2012/8/8 Jianpeng Ma <majianpeng@gmail.com>:
>> >> On 2012-08-08 10:58 Shaohua Li <shli@kernel.org> Wrote:
>> >>>2012/8/7 Jianpeng Ma <majianpeng@gmail.com>:
>> >>>> On 2012-08-07 13:32 Shaohua Li <shli@kernel.org> Wrote:
>> >>>>>2012/8/7 Jianpeng Ma <majianpeng@gmail.com>:
>> >>>>>> On 2012-08-07 11:22 Shaohua Li <shli@kernel.org> Wrote:
>> >>>>>>>My directIO randomwrite 4k workload shows a 10~20% regression caused by commit
>> >>>>>>>895e3c5c58a80bb. directIO usually is random IO and if request size isn't big
>> >>>>>>>(which is the common case), delay handling of the stripe hasn't any advantages.
>> >>>>>>>For big size request, delay can still reduce IO.
>> >>>>>>>
>> >>>>>>>Signed-off-by: Shaohua Li <shli@fusionio.com>
>> >>>> [snip]
>> >>>>>>>--
>> >>>>>> May be used size to judge is not a good method.
>> >>>>>> I firstly sended this patch, only want to control direct-write-block,not for reqular file.
>> >>>>>> Because i think if someone used direct-write-block for raid5,he should know the feature of raid5 and he can control
>> >>>>>> for write to full-write.
>> >>>>>> But at that time, i did know how to differentiate between regular file and block-device.
>> >>>>>> I thik we should do something to do this.
>> >>>>>
>> >>>>>I don't think it's possible user can control his write to be a
>> >>>>>full-write even for
>> >>>>>raw disk IO. Why regular file and block device io matters here?
>> >>>>>
>> >>>>>Thanks,
>> >>>>>Shaohua
>> >>>> Another problem is the size. How to judge the size is large or not?
>> >>>> A syscall write is a dio and a dio may be split more bios.
>> >>>> For my workload, i usualy write chunk-size.
>> >>>> But your patch is judge by bio-size.
>> >>>
>> >>>I'd ignore workload which does sequential directIO, though
>> >>>your workload is, but I bet no real workloads are. So I'd like
>> >> Sorry,my explain maybe not corcrect. I write data once which size is almost chunks-size * devices,in order to full-write
>> >> and as possible as to no pre-read operation.
>> >>>only to consider big size random directio. I agree the size
>> >>>judge is arbitrary. I can optimize it to be only consider stripe
>> >>>which hits two or more disks in one bio, but not sure if it's
>> >>>worthy doing. Not ware big size directio is common, and even
>> >>>is, big size request IOPS is low, a bit delay maybe not a big
>> >>>deal.
>> >> If add a acc_time for 'striep_head' to control?
>> >> When get_active_stripe() is ok, update acc_time.
>> >> For some time, stripe_head did not access and it shold pre-read.
>> >
>> >Do you want to add a timer for each stripe? This is even ugly.
>> >How do you choose the expire time? A time works for harddisk
>> >definitely will not work for a fast SSD.
>> A time is like the size which is arbitrary.
>> How about add a interface in sysfs to control by user?
>> Only user can judge the workload, which sequatial write or random write.
>
>This is getting worse by the minute. A sysfs interface for this is
>definitely not a good idea.
>
>The REQ_NOIDLE flag is a pretty clear statement that no more requests that
>merge with this one are expected. If some use cases sends random requests,
>maybe it should be setting REQ_NOIDLE.
>
>Maybe someone should do some research and find out why WRITE_ODIRECT doesn't
>include REQ_NOIDLE. Understanding that would help understand the current
>problem.
>
>NeilBrown
>
Hi neil:
Thanks your suggestion.
Direct-write can set REQ_NOIDLE because only finish this write-operation the next can do.
But direct-write(struct dio) can break up to some bios(struct bios).
For those bios, they have releationSo they may not set REQ_NOIDLE unless the last bio.
I think this may increase the performance, because random-direct-write at most only one bio?
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [patch]raid5: fix directio regression
2012-08-09 1:32 ` NeilBrown
2012-08-09 2:27 ` Jianpeng Ma
@ 2012-08-09 5:07 ` Shaohua Li
2012-08-14 6:33 ` [patch v2]raid5: " Shaohua Li
1 sibling, 1 reply; 28+ messages in thread
From: Shaohua Li @ 2012-08-09 5:07 UTC (permalink / raw)
To: NeilBrown; +Cc: Jianpeng Ma, linux-raid, Jens Axboe
2012/8/9 NeilBrown <neilb@suse.de>:
> On Thu, 9 Aug 2012 09:20:05 +0800 "Jianpeng Ma" <majianpeng@gmail.com> wrote:
>
>> On 2012-08-08 20:53 Shaohua Li <shli@kernel.org> Wrote:
>> >2012/8/8 Jianpeng Ma <majianpeng@gmail.com>:
>> >> On 2012-08-08 10:58 Shaohua Li <shli@kernel.org> Wrote:
>> >>>2012/8/7 Jianpeng Ma <majianpeng@gmail.com>:
>> >>>> On 2012-08-07 13:32 Shaohua Li <shli@kernel.org> Wrote:
>> >>>>>2012/8/7 Jianpeng Ma <majianpeng@gmail.com>:
>> >>>>>> On 2012-08-07 11:22 Shaohua Li <shli@kernel.org> Wrote:
>> >>>>>>>My directIO randomwrite 4k workload shows a 10~20% regression caused by commit
>> >>>>>>>895e3c5c58a80bb. directIO usually is random IO and if request size isn't big
>> >>>>>>>(which is the common case), delay handling of the stripe hasn't any advantages.
>> >>>>>>>For big size request, delay can still reduce IO.
>> >>>>>>>
>> >>>>>>>Signed-off-by: Shaohua Li <shli@fusionio.com>
>> >>>> [snip]
>> >>>>>>>--
>> >>>>>> May be used size to judge is not a good method.
>> >>>>>> I firstly sended this patch, only want to control direct-write-block,not for reqular file.
>> >>>>>> Because i think if someone used direct-write-block for raid5,he should know the feature of raid5 and he can control
>> >>>>>> for write to full-write.
>> >>>>>> But at that time, i did know how to differentiate between regular file and block-device.
>> >>>>>> I thik we should do something to do this.
>> >>>>>
>> >>>>>I don't think it's possible user can control his write to be a
>> >>>>>full-write even for
>> >>>>>raw disk IO. Why regular file and block device io matters here?
>> >>>>>
>> >>>>>Thanks,
>> >>>>>Shaohua
>> >>>> Another problem is the size. How to judge the size is large or not?
>> >>>> A syscall write is a dio and a dio may be split more bios.
>> >>>> For my workload, i usualy write chunk-size.
>> >>>> But your patch is judge by bio-size.
>> >>>
>> >>>I'd ignore workload which does sequential directIO, though
>> >>>your workload is, but I bet no real workloads are. So I'd like
>> >> Sorry,my explain maybe not corcrect. I write data once which size is almost chunks-size * devices,in order to full-write
>> >> and as possible as to no pre-read operation.
>> >>>only to consider big size random directio. I agree the size
>> >>>judge is arbitrary. I can optimize it to be only consider stripe
>> >>>which hits two or more disks in one bio, but not sure if it's
>> >>>worthy doing. Not ware big size directio is common, and even
>> >>>is, big size request IOPS is low, a bit delay maybe not a big
>> >>>deal.
>> >> If add a acc_time for 'striep_head' to control?
>> >> When get_active_stripe() is ok, update acc_time.
>> >> For some time, stripe_head did not access and it shold pre-read.
>> >
>> >Do you want to add a timer for each stripe? This is even ugly.
>> >How do you choose the expire time? A time works for harddisk
>> >definitely will not work for a fast SSD.
>> A time is like the size which is arbitrary.
>> How about add a interface in sysfs to control by user?
>> Only user can judge the workload, which sequatial write or random write.
>
> This is getting worse by the minute. A sysfs interface for this is
> definitely not a good idea.
>
> The REQ_NOIDLE flag is a pretty clear statement that no more requests that
> merge with this one are expected. If some use cases sends random requests,
> maybe it should be setting REQ_NOIDLE.
>
> Maybe someone should do some research and find out why WRITE_ODIRECT doesn't
> include REQ_NOIDLE. Understanding that would help understand the current
> problem.
A quick search shows only cfq-iosched uses REQ_NOIDLE. In
cfq, a queue is idled to avoid losing its share. REQ_NOIDLE
tells cfq to avoid idle, since the task will not dispatch further
requests any more. Note this isn't no merge.
^ permalink raw reply [flat|nested] 28+ messages in thread
* [patch v2]raid5: fix directio regression
2012-08-09 5:07 ` Shaohua Li
@ 2012-08-14 6:33 ` Shaohua Li
2012-08-15 0:56 ` NeilBrown
0 siblings, 1 reply; 28+ messages in thread
From: Shaohua Li @ 2012-08-14 6:33 UTC (permalink / raw)
To: NeilBrown; +Cc: Jianpeng Ma, linux-raid, Jens Axboe
On Thu, Aug 09, 2012 at 01:07:01PM +0800, Shaohua Li wrote:
> 2012/8/9 NeilBrown <neilb@suse.de>:
> > On Thu, 9 Aug 2012 09:20:05 +0800 "Jianpeng Ma" <majianpeng@gmail.com> wrote:
> >
> >> On 2012-08-08 20:53 Shaohua Li <shli@kernel.org> Wrote:
> >> >2012/8/8 Jianpeng Ma <majianpeng@gmail.com>:
> >> >> On 2012-08-08 10:58 Shaohua Li <shli@kernel.org> Wrote:
> >> >>>2012/8/7 Jianpeng Ma <majianpeng@gmail.com>:
> >> >>>> On 2012-08-07 13:32 Shaohua Li <shli@kernel.org> Wrote:
> >> >>>>>2012/8/7 Jianpeng Ma <majianpeng@gmail.com>:
> >> >>>>>> On 2012-08-07 11:22 Shaohua Li <shli@kernel.org> Wrote:
> >> >>>>>>>My directIO randomwrite 4k workload shows a 10~20% regression caused by commit
> >> >>>>>>>895e3c5c58a80bb. directIO usually is random IO and if request size isn't big
> >> >>>>>>>(which is the common case), delay handling of the stripe hasn't any advantages.
> >> >>>>>>>For big size request, delay can still reduce IO.
> >> >>>>>>>
> >> >>>>>>>Signed-off-by: Shaohua Li <shli@fusionio.com>
> >> >>>> [snip]
> >> >>>>>>>--
> >> >>>>>> May be used size to judge is not a good method.
> >> >>>>>> I firstly sended this patch, only want to control direct-write-block,not for reqular file.
> >> >>>>>> Because i think if someone used direct-write-block for raid5,he should know the feature of raid5 and he can control
> >> >>>>>> for write to full-write.
> >> >>>>>> But at that time, i did know how to differentiate between regular file and block-device.
> >> >>>>>> I thik we should do something to do this.
> >> >>>>>
> >> >>>>>I don't think it's possible user can control his write to be a
> >> >>>>>full-write even for
> >> >>>>>raw disk IO. Why regular file and block device io matters here?
> >> >>>>>
> >> >>>>>Thanks,
> >> >>>>>Shaohua
> >> >>>> Another problem is the size. How to judge the size is large or not?
> >> >>>> A syscall write is a dio and a dio may be split more bios.
> >> >>>> For my workload, i usualy write chunk-size.
> >> >>>> But your patch is judge by bio-size.
> >> >>>
> >> >>>I'd ignore workload which does sequential directIO, though
> >> >>>your workload is, but I bet no real workloads are. So I'd like
> >> >> Sorry,my explain maybe not corcrect. I write data once which size is almost chunks-size * devices,in order to full-write
> >> >> and as possible as to no pre-read operation.
> >> >>>only to consider big size random directio. I agree the size
> >> >>>judge is arbitrary. I can optimize it to be only consider stripe
> >> >>>which hits two or more disks in one bio, but not sure if it's
> >> >>>worthy doing. Not ware big size directio is common, and even
> >> >>>is, big size request IOPS is low, a bit delay maybe not a big
> >> >>>deal.
> >> >> If add a acc_time for 'striep_head' to control?
> >> >> When get_active_stripe() is ok, update acc_time.
> >> >> For some time, stripe_head did not access and it shold pre-read.
> >> >
> >> >Do you want to add a timer for each stripe? This is even ugly.
> >> >How do you choose the expire time? A time works for harddisk
> >> >definitely will not work for a fast SSD.
> >> A time is like the size which is arbitrary.
> >> How about add a interface in sysfs to control by user?
> >> Only user can judge the workload, which sequatial write or random write.
> >
> > This is getting worse by the minute. A sysfs interface for this is
> > definitely not a good idea.
> >
> > The REQ_NOIDLE flag is a pretty clear statement that no more requests that
> > merge with this one are expected. If some use cases sends random requests,
> > maybe it should be setting REQ_NOIDLE.
> >
> > Maybe someone should do some research and find out why WRITE_ODIRECT doesn't
> > include REQ_NOIDLE. Understanding that would help understand the current
> > problem.
>
> A quick search shows only cfq-iosched uses REQ_NOIDLE. In
> cfq, a queue is idled to avoid losing its share. REQ_NOIDLE
> tells cfq to avoid idle, since the task will not dispatch further
> requests any more. Note this isn't no merge.
Since REQ_NOIDLE has no relationship with request merge, we'd better remove it.
I came out a new patch, which doesn't depend on request size any more. With
this patch, sequential directio will still introduce unnecessary raid5 preread
(especially for small size IO), but I bet no app does sequential small size
directIO.
Thanks,
Shaohua
Subject: raid5: fix directio regression
My directIO randomwrite 4k workload shows a 10~20% regression caused by commit
895e3c5c58a80bb. This commit isn't friendly for small size random IO, because
delaying such request hasn't any advantages.
DirectIO usually is random IO. I thought we can ignore request merge between
bios from different io_submit. So we only consider one bio which can drive
unnecessary preread in raid5, which is large request. If a bio is large enough
and some of its stripes will access two or more disks, such stripes should be
delayed to avoid unnecessary preread till bio for the last disk of the strips
is added.
REQ_NOIDLE doesn't mean about request merge, I deleted it.
Signed-off-by: Shaohua Li <shli@fusionio.com>
---
drivers/md/raid5.c | 9 ++++++++-
1 file changed, 8 insertions(+), 1 deletion(-)
Index: linux/drivers/md/raid5.c
===================================================================
--- linux.orig/drivers/md/raid5.c 2012-08-13 15:03:16.479473326 +0800
+++ linux/drivers/md/raid5.c 2012-08-14 11:10:37.335982170 +0800
@@ -4076,6 +4076,7 @@ static void make_request(struct mddev *m
struct stripe_head *sh;
const int rw = bio_data_dir(bi);
int remaining;
+ int chunk_sectors;
if (unlikely(bi->bi_rw & REQ_FLUSH)) {
md_flush_request(mddev, bi);
@@ -4089,6 +4090,11 @@ static void make_request(struct mddev *m
chunk_aligned_read(mddev,bi))
return;
+ if (mddev->new_chunk_sectors < mddev->chunk_sectors)
+ chunk_sectors = mddev->new_chunk_sectors;
+ else
+ chunk_sectors = mddev->chunk_sectors;
+
logical_sector = bi->bi_sector & ~((sector_t)STRIPE_SECTORS-1);
last_sector = bi->bi_sector + (bi->bi_size>>9);
bi->bi_next = NULL;
@@ -4192,7 +4198,8 @@ static void make_request(struct mddev *m
finish_wait(&conf->wait_for_overlap, &w);
set_bit(STRIPE_HANDLE, &sh->state);
clear_bit(STRIPE_DELAYED, &sh->state);
- if ((bi->bi_rw & REQ_NOIDLE) &&
+ if ((bi->bi_rw & REQ_SYNC) &&
+ (last_sector - logical_sector < chunk_sectors) &&
!test_and_set_bit(STRIPE_PREREAD_ACTIVE, &sh->state))
atomic_inc(&conf->preread_active_stripes);
release_stripe_plug(mddev, sh);
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [patch v2]raid5: fix directio regression
2012-08-14 6:33 ` [patch v2]raid5: " Shaohua Li
@ 2012-08-15 0:56 ` NeilBrown
2012-08-15 1:20 ` kedacomkernel
2012-08-15 1:44 ` Shaohua Li
0 siblings, 2 replies; 28+ messages in thread
From: NeilBrown @ 2012-08-15 0:56 UTC (permalink / raw)
To: Shaohua Li; +Cc: Jianpeng Ma, linux-raid, Jens Axboe
[-- Attachment #1: Type: text/plain, Size: 8306 bytes --]
On Tue, 14 Aug 2012 14:33:43 +0800 Shaohua Li <shli@kernel.org> wrote:
> On Thu, Aug 09, 2012 at 01:07:01PM +0800, Shaohua Li wrote:
> > 2012/8/9 NeilBrown <neilb@suse.de>:
> > > On Thu, 9 Aug 2012 09:20:05 +0800 "Jianpeng Ma" <majianpeng@gmail.com> wrote:
> > >
> > >> On 2012-08-08 20:53 Shaohua Li <shli@kernel.org> Wrote:
> > >> >2012/8/8 Jianpeng Ma <majianpeng@gmail.com>:
> > >> >> On 2012-08-08 10:58 Shaohua Li <shli@kernel.org> Wrote:
> > >> >>>2012/8/7 Jianpeng Ma <majianpeng@gmail.com>:
> > >> >>>> On 2012-08-07 13:32 Shaohua Li <shli@kernel.org> Wrote:
> > >> >>>>>2012/8/7 Jianpeng Ma <majianpeng@gmail.com>:
> > >> >>>>>> On 2012-08-07 11:22 Shaohua Li <shli@kernel.org> Wrote:
> > >> >>>>>>>My directIO randomwrite 4k workload shows a 10~20% regression caused by commit
> > >> >>>>>>>895e3c5c58a80bb. directIO usually is random IO and if request size isn't big
> > >> >>>>>>>(which is the common case), delay handling of the stripe hasn't any advantages.
> > >> >>>>>>>For big size request, delay can still reduce IO.
> > >> >>>>>>>
> > >> >>>>>>>Signed-off-by: Shaohua Li <shli@fusionio.com>
> > >> >>>> [snip]
> > >> >>>>>>>--
> > >> >>>>>> May be used size to judge is not a good method.
> > >> >>>>>> I firstly sended this patch, only want to control direct-write-block,not for reqular file.
> > >> >>>>>> Because i think if someone used direct-write-block for raid5,he should know the feature of raid5 and he can control
> > >> >>>>>> for write to full-write.
> > >> >>>>>> But at that time, i did know how to differentiate between regular file and block-device.
> > >> >>>>>> I thik we should do something to do this.
> > >> >>>>>
> > >> >>>>>I don't think it's possible user can control his write to be a
> > >> >>>>>full-write even for
> > >> >>>>>raw disk IO. Why regular file and block device io matters here?
> > >> >>>>>
> > >> >>>>>Thanks,
> > >> >>>>>Shaohua
> > >> >>>> Another problem is the size. How to judge the size is large or not?
> > >> >>>> A syscall write is a dio and a dio may be split more bios.
> > >> >>>> For my workload, i usualy write chunk-size.
> > >> >>>> But your patch is judge by bio-size.
> > >> >>>
> > >> >>>I'd ignore workload which does sequential directIO, though
> > >> >>>your workload is, but I bet no real workloads are. So I'd like
> > >> >> Sorry,my explain maybe not corcrect. I write data once which size is almost chunks-size * devices,in order to full-write
> > >> >> and as possible as to no pre-read operation.
> > >> >>>only to consider big size random directio. I agree the size
> > >> >>>judge is arbitrary. I can optimize it to be only consider stripe
> > >> >>>which hits two or more disks in one bio, but not sure if it's
> > >> >>>worthy doing. Not ware big size directio is common, and even
> > >> >>>is, big size request IOPS is low, a bit delay maybe not a big
> > >> >>>deal.
> > >> >> If add a acc_time for 'striep_head' to control?
> > >> >> When get_active_stripe() is ok, update acc_time.
> > >> >> For some time, stripe_head did not access and it shold pre-read.
> > >> >
> > >> >Do you want to add a timer for each stripe? This is even ugly.
> > >> >How do you choose the expire time? A time works for harddisk
> > >> >definitely will not work for a fast SSD.
> > >> A time is like the size which is arbitrary.
> > >> How about add a interface in sysfs to control by user?
> > >> Only user can judge the workload, which sequatial write or random write.
> > >
> > > This is getting worse by the minute. A sysfs interface for this is
> > > definitely not a good idea.
> > >
> > > The REQ_NOIDLE flag is a pretty clear statement that no more requests that
> > > merge with this one are expected. If some use cases sends random requests,
> > > maybe it should be setting REQ_NOIDLE.
> > >
> > > Maybe someone should do some research and find out why WRITE_ODIRECT doesn't
> > > include REQ_NOIDLE. Understanding that would help understand the current
> > > problem.
> >
> > A quick search shows only cfq-iosched uses REQ_NOIDLE. In
> > cfq, a queue is idled to avoid losing its share. REQ_NOIDLE
> > tells cfq to avoid idle, since the task will not dispatch further
> > requests any more. Note this isn't no merge.
>
> Since REQ_NOIDLE has no relationship with request merge, we'd better remove it.
> I came out a new patch, which doesn't depend on request size any more. With
> this patch, sequential directio will still introduce unnecessary raid5 preread
> (especially for small size IO), but I bet no app does sequential small size
> directIO.
>
> Thanks,
> Shaohua
>
> Subject: raid5: fix directio regression
>
> My directIO randomwrite 4k workload shows a 10~20% regression caused by commit
> 895e3c5c58a80bb. This commit isn't friendly for small size random IO, because
> delaying such request hasn't any advantages.
>
> DirectIO usually is random IO. I thought we can ignore request merge between
> bios from different io_submit. So we only consider one bio which can drive
> unnecessary preread in raid5, which is large request. If a bio is large enough
> and some of its stripes will access two or more disks, such stripes should be
> delayed to avoid unnecessary preread till bio for the last disk of the strips
> is added.
>
> REQ_NOIDLE doesn't mean about request merge, I deleted it.
Hi,
Have you tested what effect this has on large sequential direct writes?
Because it don't make sense to me and I would be surprised if it improves
things.
You are delaying setting the STRIPE_PREREAD_ACTIVE bit until you think you
have submitted all the writes from this bio that apply to the give stripe.
That does make some sense, however it doesn't seem to deal with the
possibility that the one bio covers parts of two different stripes. In that
case the first stripe never gets STRIPE_PREREAD_ACTIVE set, so it is delayed
despite having 'REQ_SYNC' set.
Also, and more significantly, plugging should mean that the various
stripe_heads are not even looked at until all of the original bio is
processed, so while STRIPE_PREREAD_ACTIVE might get set early, it should not
get processed until the whole bio is processed and the queue is unplugged.
So I don't think this patch should make a difference on large direct writes,
and if it does then something strange is going on that I'd like to
understand first.
I suspect that the original patch should be reverted because while it does
improve one case, it causes a regression in another and regressions should
be avoided. It would be nice to find a way for both to go fast though...
Thanks,
NeilBrown
>
> Signed-off-by: Shaohua Li <shli@fusionio.com>
> ---
> drivers/md/raid5.c | 9 ++++++++-
> 1 file changed, 8 insertions(+), 1 deletion(-)
>
> Index: linux/drivers/md/raid5.c
> ===================================================================
> --- linux.orig/drivers/md/raid5.c 2012-08-13 15:03:16.479473326 +0800
> +++ linux/drivers/md/raid5.c 2012-08-14 11:10:37.335982170 +0800
> @@ -4076,6 +4076,7 @@ static void make_request(struct mddev *m
> struct stripe_head *sh;
> const int rw = bio_data_dir(bi);
> int remaining;
> + int chunk_sectors;
>
> if (unlikely(bi->bi_rw & REQ_FLUSH)) {
> md_flush_request(mddev, bi);
> @@ -4089,6 +4090,11 @@ static void make_request(struct mddev *m
> chunk_aligned_read(mddev,bi))
> return;
>
> + if (mddev->new_chunk_sectors < mddev->chunk_sectors)
> + chunk_sectors = mddev->new_chunk_sectors;
> + else
> + chunk_sectors = mddev->chunk_sectors;
> +
> logical_sector = bi->bi_sector & ~((sector_t)STRIPE_SECTORS-1);
> last_sector = bi->bi_sector + (bi->bi_size>>9);
> bi->bi_next = NULL;
> @@ -4192,7 +4198,8 @@ static void make_request(struct mddev *m
> finish_wait(&conf->wait_for_overlap, &w);
> set_bit(STRIPE_HANDLE, &sh->state);
> clear_bit(STRIPE_DELAYED, &sh->state);
> - if ((bi->bi_rw & REQ_NOIDLE) &&
> + if ((bi->bi_rw & REQ_SYNC) &&
> + (last_sector - logical_sector < chunk_sectors) &&
> !test_and_set_bit(STRIPE_PREREAD_ACTIVE, &sh->state))
> atomic_inc(&conf->preread_active_stripes);
> release_stripe_plug(mddev, sh);
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Re: [patch v2]raid5: fix directio regression
2012-08-15 0:56 ` NeilBrown
@ 2012-08-15 1:20 ` kedacomkernel
2012-08-15 1:44 ` Shaohua Li
1 sibling, 0 replies; 28+ messages in thread
From: kedacomkernel @ 2012-08-15 1:20 UTC (permalink / raw)
To: Neil Brown, shli; +Cc: majianpeng, linux-raid, axboe
On 2012-08-15 08:56 NeilBrown <neilb@suse.de> Wrote:
>On Tue, 14 Aug 2012 14:33:43 +0800 Shaohua Li <shli@kernel.org> wrote:
>
>> On Thu, Aug 09, 2012 at 01:07:01PM +0800, Shaohua Li wrote:
>> > 2012/8/9 NeilBrown <neilb@suse.de>:
>> > > On Thu, 9 Aug 2012 09:20:05 +0800 "Jianpeng Ma" <majianpeng@gmail.com> wrote:
>> > >
>> > >> On 2012-08-08 20:53 Shaohua Li <shli@kernel.org> Wrote:
>> > >> >2012/8/8 Jianpeng Ma <majianpeng@gmail.com>:
>> > >> >> On 2012-08-08 10:58 Shaohua Li <shli@kernel.org> Wrote:
>> > >> >>>2012/8/7 Jianpeng Ma <majianpeng@gmail.com>:
>> > >> >>>> On 2012-08-07 13:32 Shaohua Li <shli@kernel.org> Wrote:
>> > >> >>>>>2012/8/7 Jianpeng Ma <majianpeng@gmail.com>:
>> > >> >>>>>> On 2012-08-07 11:22 Shaohua Li <shli@kernel.org> Wrote:
>> > >> >>>>>>>My directIO randomwrite 4k workload shows a 10~20% regression caused by commit
>> > >> >>>>>>>895e3c5c58a80bb. directIO usually is random IO and if request size isn't big
>> > >> >>>>>>>(which is the common case), delay handling of the stripe hasn't any advantages.
>> > >> >>>>>>>For big size request, delay can still reduce IO.
>> > >> >>>>>>>
>> > >> >>>>>>>Signed-off-by: Shaohua Li <shli@fusionio.com>
>> > >> >>>> [snip]
>> > >> >>>>>>>--
>> > >> >>>>>> May be used size to judge is not a good method.
>> > >> >>>>>> I firstly sended this patch, only want to control direct-write-block,not for reqular file.
>> > >> >>>>>> Because i think if someone used direct-write-block for raid5,he should know the feature of raid5 and he can control
>> > >> >>>>>> for write to full-write.
>> > >> >>>>>> But at that time, i did know how to differentiate between regular file and block-device.
>> > >> >>>>>> I thik we should do something to do this.
>> > >> >>>>>
>> > >> >>>>>I don't think it's possible user can control his write to be a
>> > >> >>>>>full-write even for
>> > >> >>>>>raw disk IO. Why regular file and block device io matters here?
>> > >> >>>>>
>> > >> >>>>>Thanks,
>> > >> >>>>>Shaohua
>> > >> >>>> Another problem is the size. How to judge the size is large or not?
>> > >> >>>> A syscall write is a dio and a dio may be split more bios.
>> > >> >>>> For my workload, i usualy write chunk-size.
>> > >> >>>> But your patch is judge by bio-size.
>> > >> >>>
>> > >> >>>I'd ignore workload which does sequential directIO, though
>> > >> >>>your workload is, but I bet no real workloads are. So I'd like
>> > >> >> Sorry,my explain maybe not corcrect. I write data once which size is almost chunks-size * devices,in order to full-write
>> > >> >> and as possible as to no pre-read operation.
>> > >> >>>only to consider big size random directio. I agree the size
>> > >> >>>judge is arbitrary. I can optimize it to be only consider stripe
>> > >> >>>which hits two or more disks in one bio, but not sure if it's
>> > >> >>>worthy doing. Not ware big size directio is common, and even
>> > >> >>>is, big size request IOPS is low, a bit delay maybe not a big
>> > >> >>>deal.
>> > >> >> If add a acc_time for 'striep_head' to control?
>> > >> >> When get_active_stripe() is ok, update acc_time.
>> > >> >> For some time, stripe_head did not access and it shold pre-read.
>> > >> >
>> > >> >Do you want to add a timer for each stripe? This is even ugly.
>> > >> >How do you choose the expire time? A time works for harddisk
>> > >> >definitely will not work for a fast SSD.
>> > >> A time is like the size which is arbitrary.
>> > >> How about add a interface in sysfs to control by user?
>> > >> Only user can judge the workload, which sequatial write or random write.
>> > >
>> > > This is getting worse by the minute. A sysfs interface for this is
>> > > definitely not a good idea.
>> > >
>> > > The REQ_NOIDLE flag is a pretty clear statement that no more requests that
>> > > merge with this one are expected. If some use cases sends random requests,
>> > > maybe it should be setting REQ_NOIDLE.
>> > >
>> > > Maybe someone should do some research and find out why WRITE_ODIRECT doesn't
>> > > include REQ_NOIDLE. Understanding that would help understand the current
>> > > problem.
>> >
>> > A quick search shows only cfq-iosched uses REQ_NOIDLE. In
>> > cfq, a queue is idled to avoid losing its share. REQ_NOIDLE
>> > tells cfq to avoid idle, since the task will not dispatch further
>> > requests any more. Note this isn't no merge.
>>
>> Since REQ_NOIDLE has no relationship with request merge, we'd better remove it.
>> I came out a new patch, which doesn't depend on request size any more. With
>> this patch, sequential directio will still introduce unnecessary raid5 preread
>> (especially for small size IO), but I bet no app does sequential small size
>> directIO.
>>
>> Thanks,
>> Shaohua
>>
>> Subject: raid5: fix directio regression
>>
>> My directIO randomwrite 4k workload shows a 10~20% regression caused by commit
>> 895e3c5c58a80bb. This commit isn't friendly for small size random IO, because
>> delaying such request hasn't any advantages.
>>
>> DirectIO usually is random IO. I thought we can ignore request merge between
>> bios from different io_submit. So we only consider one bio which can drive
>> unnecessary preread in raid5, which is large request. If a bio is large enough
>> and some of its stripes will access two or more disks, such stripes should be
>> delayed to avoid unnecessary preread till bio for the last disk of the strips
>> is added.
>>
>> REQ_NOIDLE doesn't mean about request merge, I deleted it.
>
>Hi,
> Have you tested what effect this has on large sequential direct writes?
> Because it don't make sense to me and I would be surprised if it improves
> things.
>
> You are delaying setting the STRIPE_PREREAD_ACTIVE bit until you think you
> have submitted all the writes from this bio that apply to the give stripe.
> That does make some sense, however it doesn't seem to deal with the
> possibility that the one bio covers parts of two different stripes. In that
> case the first stripe never gets STRIPE_PREREAD_ACTIVE set, so it is delayed
> despite having 'REQ_SYNC' set.
>
> Also, and more significantly, plugging should mean that the various
> stripe_heads are not even looked at until all of the original bio is
> processed, so while STRIPE_PREREAD_ACTIVE might get set early, it should not
> get processed until the whole bio is processed and the queue is unplugged.
>
> So I don't think this patch should make a difference on large direct writes,
> and if it does then something strange is going on that I'd like to
> understand first.
>
> I suspect that the original patch should be reverted because while it does
> improve one case, it causes a regression in another and regressions should
> be avoided. It would be nice to find a way for both to go fast though...
>
>Thanks,
>NeilBrown
Hi all:
In md-layer, we hardly decide to judge the large-sequential-direct-writes or random write(at most small size) by
bio.
I insist my option: we only judge bio from fs-layer. For one direct-write, it can send more bios to md-driver.
Those bios are sequential.So if the last the bio set REQ_NOFLAG which tell md-driver, it the last and no bio can arrive unless the previous completed.
This may good for single process.But for mutiple thread, i think it maybe not good.
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [patch v2]raid5: fix directio regression
2012-08-15 0:56 ` NeilBrown
2012-08-15 1:20 ` kedacomkernel
@ 2012-08-15 1:44 ` Shaohua Li
2012-08-15 1:54 ` Jianpeng Ma
2012-08-16 7:36 ` Jianpeng Ma
1 sibling, 2 replies; 28+ messages in thread
From: Shaohua Li @ 2012-08-15 1:44 UTC (permalink / raw)
To: NeilBrown; +Cc: Jianpeng Ma, linux-raid, Jens Axboe
On Wed, Aug 15, 2012 at 10:56:10AM +1000, NeilBrown wrote:
> On Tue, 14 Aug 2012 14:33:43 +0800 Shaohua Li <shli@kernel.org> wrote:
>
> > On Thu, Aug 09, 2012 at 01:07:01PM +0800, Shaohua Li wrote:
> > > 2012/8/9 NeilBrown <neilb@suse.de>:
> > > > On Thu, 9 Aug 2012 09:20:05 +0800 "Jianpeng Ma" <majianpeng@gmail.com> wrote:
> > > >
> > > >> On 2012-08-08 20:53 Shaohua Li <shli@kernel.org> Wrote:
> > > >> >2012/8/8 Jianpeng Ma <majianpeng@gmail.com>:
> > > >> >> On 2012-08-08 10:58 Shaohua Li <shli@kernel.org> Wrote:
> > > >> >>>2012/8/7 Jianpeng Ma <majianpeng@gmail.com>:
> > > >> >>>> On 2012-08-07 13:32 Shaohua Li <shli@kernel.org> Wrote:
> > > >> >>>>>2012/8/7 Jianpeng Ma <majianpeng@gmail.com>:
> > > >> >>>>>> On 2012-08-07 11:22 Shaohua Li <shli@kernel.org> Wrote:
> > > >> >>>>>>>My directIO randomwrite 4k workload shows a 10~20% regression caused by commit
> > > >> >>>>>>>895e3c5c58a80bb. directIO usually is random IO and if request size isn't big
> > > >> >>>>>>>(which is the common case), delay handling of the stripe hasn't any advantages.
> > > >> >>>>>>>For big size request, delay can still reduce IO.
> > > >> >>>>>>>
> > > >> >>>>>>>Signed-off-by: Shaohua Li <shli@fusionio.com>
> > > >> >>>> [snip]
> > > >> >>>>>>>--
> > > >> >>>>>> May be used size to judge is not a good method.
> > > >> >>>>>> I firstly sended this patch, only want to control direct-write-block,not for reqular file.
> > > >> >>>>>> Because i think if someone used direct-write-block for raid5,he should know the feature of raid5 and he can control
> > > >> >>>>>> for write to full-write.
> > > >> >>>>>> But at that time, i did know how to differentiate between regular file and block-device.
> > > >> >>>>>> I thik we should do something to do this.
> > > >> >>>>>
> > > >> >>>>>I don't think it's possible user can control his write to be a
> > > >> >>>>>full-write even for
> > > >> >>>>>raw disk IO. Why regular file and block device io matters here?
> > > >> >>>>>
> > > >> >>>>>Thanks,
> > > >> >>>>>Shaohua
> > > >> >>>> Another problem is the size. How to judge the size is large or not?
> > > >> >>>> A syscall write is a dio and a dio may be split more bios.
> > > >> >>>> For my workload, i usualy write chunk-size.
> > > >> >>>> But your patch is judge by bio-size.
> > > >> >>>
> > > >> >>>I'd ignore workload which does sequential directIO, though
> > > >> >>>your workload is, but I bet no real workloads are. So I'd like
> > > >> >> Sorry,my explain maybe not corcrect. I write data once which size is almost chunks-size * devices,in order to full-write
> > > >> >> and as possible as to no pre-read operation.
> > > >> >>>only to consider big size random directio. I agree the size
> > > >> >>>judge is arbitrary. I can optimize it to be only consider stripe
> > > >> >>>which hits two or more disks in one bio, but not sure if it's
> > > >> >>>worthy doing. Not ware big size directio is common, and even
> > > >> >>>is, big size request IOPS is low, a bit delay maybe not a big
> > > >> >>>deal.
> > > >> >> If add a acc_time for 'striep_head' to control?
> > > >> >> When get_active_stripe() is ok, update acc_time.
> > > >> >> For some time, stripe_head did not access and it shold pre-read.
> > > >> >
> > > >> >Do you want to add a timer for each stripe? This is even ugly.
> > > >> >How do you choose the expire time? A time works for harddisk
> > > >> >definitely will not work for a fast SSD.
> > > >> A time is like the size which is arbitrary.
> > > >> How about add a interface in sysfs to control by user?
> > > >> Only user can judge the workload, which sequatial write or random write.
> > > >
> > > > This is getting worse by the minute. A sysfs interface for this is
> > > > definitely not a good idea.
> > > >
> > > > The REQ_NOIDLE flag is a pretty clear statement that no more requests that
> > > > merge with this one are expected. If some use cases sends random requests,
> > > > maybe it should be setting REQ_NOIDLE.
> > > >
> > > > Maybe someone should do some research and find out why WRITE_ODIRECT doesn't
> > > > include REQ_NOIDLE. Understanding that would help understand the current
> > > > problem.
> > >
> > > A quick search shows only cfq-iosched uses REQ_NOIDLE. In
> > > cfq, a queue is idled to avoid losing its share. REQ_NOIDLE
> > > tells cfq to avoid idle, since the task will not dispatch further
> > > requests any more. Note this isn't no merge.
> >
> > Since REQ_NOIDLE has no relationship with request merge, we'd better remove it.
> > I came out a new patch, which doesn't depend on request size any more. With
> > this patch, sequential directio will still introduce unnecessary raid5 preread
> > (especially for small size IO), but I bet no app does sequential small size
> > directIO.
> >
> > Thanks,
> > Shaohua
> >
> > Subject: raid5: fix directio regression
> >
> > My directIO randomwrite 4k workload shows a 10~20% regression caused by commit
> > 895e3c5c58a80bb. This commit isn't friendly for small size random IO, because
> > delaying such request hasn't any advantages.
> >
> > DirectIO usually is random IO. I thought we can ignore request merge between
> > bios from different io_submit. So we only consider one bio which can drive
> > unnecessary preread in raid5, which is large request. If a bio is large enough
> > and some of its stripes will access two or more disks, such stripes should be
> > delayed to avoid unnecessary preread till bio for the last disk of the strips
> > is added.
> >
> > REQ_NOIDLE doesn't mean about request merge, I deleted it.
>
> Hi,
> Have you tested what effect this has on large sequential direct writes?
> Because it don't make sense to me and I would be surprised if it improves
> things.
>
> You are delaying setting the STRIPE_PREREAD_ACTIVE bit until you think you
> have submitted all the writes from this bio that apply to the give stripe.
> That does make some sense, however it doesn't seem to deal with the
> possibility that the one bio covers parts of two different stripes. In that
> case the first stripe never gets STRIPE_PREREAD_ACTIVE set, so it is delayed
> despite having 'REQ_SYNC' set.
I didn't get your point. Isn't last_sector - logical_sector < chunk_sectors true
in the case?
> Also, and more significantly, plugging should mean that the various
> stripe_heads are not even looked at until all of the original bio is
> processed, so while STRIPE_PREREAD_ACTIVE might get set early, it should not
> get processed until the whole bio is processed and the queue is unplugged.
>
> So I don't think this patch should make a difference on large direct writes,
> and if it does then something strange is going on that I'd like to
> understand first.
Aha, ok, this makes sense. recent delayed stripe release should make the
problem go away. So Jianpeng, can you try your workload with the commit
reverted with a recent kernel please?
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Re: [patch v2]raid5: fix directio regression
2012-08-15 1:44 ` Shaohua Li
@ 2012-08-15 1:54 ` Jianpeng Ma
2012-08-16 7:36 ` Jianpeng Ma
1 sibling, 0 replies; 28+ messages in thread
From: Jianpeng Ma @ 2012-08-15 1:54 UTC (permalink / raw)
To: shli, Neil Brown; +Cc: linux-raid, axboe
On 2012-08-15 09:44 Shaohua Li <shli@kernel.org> Wrote:
>On Wed, Aug 15, 2012 at 10:56:10AM +1000, NeilBrown wrote:
>> On Tue, 14 Aug 2012 14:33:43 +0800 Shaohua Li <shli@kernel.org> wrote:
>>
>> > On Thu, Aug 09, 2012 at 01:07:01PM +0800, Shaohua Li wrote:
>> > > 2012/8/9 NeilBrown <neilb@suse.de>:
>> > > > On Thu, 9 Aug 2012 09:20:05 +0800 "Jianpeng Ma" <majianpeng@gmail.com> wrote:
>> > > >
>> > > >> On 2012-08-08 20:53 Shaohua Li <shli@kernel.org> Wrote:
>> > > >> >2012/8/8 Jianpeng Ma <majianpeng@gmail.com>:
>> > > >> >> On 2012-08-08 10:58 Shaohua Li <shli@kernel.org> Wrote:
>> > > >> >>>2012/8/7 Jianpeng Ma <majianpeng@gmail.com>:
>> > > >> >>>> On 2012-08-07 13:32 Shaohua Li <shli@kernel.org> Wrote:
>> > > >> >>>>>2012/8/7 Jianpeng Ma <majianpeng@gmail.com>:
>> > > >> >>>>>> On 2012-08-07 11:22 Shaohua Li <shli@kernel.org> Wrote:
>> > > >> >>>>>>>My directIO randomwrite 4k workload shows a 10~20% regression caused by commit
>> > > >> >>>>>>>895e3c5c58a80bb. directIO usually is random IO and if request size isn't big
>> > > >> >>>>>>>(which is the common case), delay handling of the stripe hasn't any advantages.
>> > > >> >>>>>>>For big size request, delay can still reduce IO.
>> > > >> >>>>>>>
>> > > >> >>>>>>>Signed-off-by: Shaohua Li <shli@fusionio.com>
>> > > >> >>>> [snip]
>> > > >> >>>>>>>--
>> > > >> >>>>>> May be used size to judge is not a good method.
>> > > >> >>>>>> I firstly sended this patch, only want to control direct-write-block,not for reqular file.
>> > > >> >>>>>> Because i think if someone used direct-write-block for raid5,he should know the feature of raid5 and he can control
>> > > >> >>>>>> for write to full-write.
>> > > >> >>>>>> But at that time, i did know how to differentiate between regular file and block-device.
>> > > >> >>>>>> I thik we should do something to do this.
>> > > >> >>>>>
>> > > >> >>>>>I don't think it's possible user can control his write to be a
>> > > >> >>>>>full-write even for
>> > > >> >>>>>raw disk IO. Why regular file and block device io matters here?
>> > > >> >>>>>
>> > > >> >>>>>Thanks,
>> > > >> >>>>>Shaohua
>> > > >> >>>> Another problem is the size. How to judge the size is large or not?
>> > > >> >>>> A syscall write is a dio and a dio may be split more bios.
>> > > >> >>>> For my workload, i usualy write chunk-size.
>> > > >> >>>> But your patch is judge by bio-size.
>> > > >> >>>
>> > > >> >>>I'd ignore workload which does sequential directIO, though
>> > > >> >>>your workload is, but I bet no real workloads are. So I'd like
>> > > >> >> Sorry,my explain maybe not corcrect. I write data once which size is almost chunks-size * devices,in order to full-write
>> > > >> >> and as possible as to no pre-read operation.
>> > > >> >>>only to consider big size random directio. I agree the size
>> > > >> >>>judge is arbitrary. I can optimize it to be only consider stripe
>> > > >> >>>which hits two or more disks in one bio, but not sure if it's
>> > > >> >>>worthy doing. Not ware big size directio is common, and even
>> > > >> >>>is, big size request IOPS is low, a bit delay maybe not a big
>> > > >> >>>deal.
>> > > >> >> If add a acc_time for 'striep_head' to control?
>> > > >> >> When get_active_stripe() is ok, update acc_time.
>> > > >> >> For some time, stripe_head did not access and it shold pre-read.
>> > > >> >
>> > > >> >Do you want to add a timer for each stripe? This is even ugly.
>> > > >> >How do you choose the expire time? A time works for harddisk
>> > > >> >definitely will not work for a fast SSD.
>> > > >> A time is like the size which is arbitrary.
>> > > >> How about add a interface in sysfs to control by user?
>> > > >> Only user can judge the workload, which sequatial write or random write.
>> > > >
>> > > > This is getting worse by the minute. A sysfs interface for this is
>> > > > definitely not a good idea.
>> > > >
>> > > > The REQ_NOIDLE flag is a pretty clear statement that no more requests that
>> > > > merge with this one are expected. If some use cases sends random requests,
>> > > > maybe it should be setting REQ_NOIDLE.
>> > > >
>> > > > Maybe someone should do some research and find out why WRITE_ODIRECT doesn't
>> > > > include REQ_NOIDLE. Understanding that would help understand the current
>> > > > problem.
>> > >
>> > > A quick search shows only cfq-iosched uses REQ_NOIDLE. In
>> > > cfq, a queue is idled to avoid losing its share. REQ_NOIDLE
>> > > tells cfq to avoid idle, since the task will not dispatch further
>> > > requests any more. Note this isn't no merge.
>> >
>> > Since REQ_NOIDLE has no relationship with request merge, we'd better remove it.
>> > I came out a new patch, which doesn't depend on request size any more. With
>> > this patch, sequential directio will still introduce unnecessary raid5 preread
>> > (especially for small size IO), but I bet no app does sequential small size
>> > directIO.
>> >
>> > Thanks,
>> > Shaohua
>> >
>> > Subject: raid5: fix directio regression
>> >
>> > My directIO randomwrite 4k workload shows a 10~20% regression caused by commit
>> > 895e3c5c58a80bb. This commit isn't friendly for small size random IO, because
>> > delaying such request hasn't any advantages.
>> >
>> > DirectIO usually is random IO. I thought we can ignore request merge between
>> > bios from different io_submit. So we only consider one bio which can drive
>> > unnecessary preread in raid5, which is large request. If a bio is large enough
>> > and some of its stripes will access two or more disks, such stripes should be
>> > delayed to avoid unnecessary preread till bio for the last disk of the strips
>> > is added.
>> >
>> > REQ_NOIDLE doesn't mean about request merge, I deleted it.
>>
>> Hi,
>> Have you tested what effect this has on large sequential direct writes?
>> Because it don't make sense to me and I would be surprised if it improves
>> things.
>>
>> You are delaying setting the STRIPE_PREREAD_ACTIVE bit until you think you
>> have submitted all the writes from this bio that apply to the give stripe.
>> That does make some sense, however it doesn't seem to deal with the
>> possibility that the one bio covers parts of two different stripes. In that
>> case the first stripe never gets STRIPE_PREREAD_ACTIVE set, so it is delayed
>> despite having 'REQ_SYNC' set.
>
>I didn't get your point. Isn't last_sector - logical_sector < chunk_sectors true
>in the case?
>
>> Also, and more significantly, plugging should mean that the various
>> stripe_heads are not even looked at until all of the original bio is
>> processed, so while STRIPE_PREREAD_ACTIVE might get set early, it should not
>> get processed until the whole bio is processed and the queue is unplugged.
>>
>> So I don't think this patch should make a difference on large direct writes,
>> and if it does then something strange is going on that I'd like to
>> understand first.
>
>Aha, ok, this makes sense. recent delayed stripe release should make the
>problem go away. So Jianpeng, can you try your workload with the commit
>reverted with a recent kernel please?
Ok.
>
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Re: [patch v2]raid5: fix directio regression
2012-08-15 1:44 ` Shaohua Li
2012-08-15 1:54 ` Jianpeng Ma
@ 2012-08-16 7:36 ` Jianpeng Ma
2012-08-16 9:42 ` Shaohua Li
1 sibling, 1 reply; 28+ messages in thread
From: Jianpeng Ma @ 2012-08-16 7:36 UTC (permalink / raw)
To: shli, Neil Brown; +Cc: linux-raid, axboe
On 2012-08-15 09:44 Shaohua Li <shli@kernel.org> Wrote:
>On Wed, Aug 15, 2012 at 10:56:10AM +1000, NeilBrown wrote:
>> On Tue, 14 Aug 2012 14:33:43 +0800 Shaohua Li <shli@kernel.org> wrote:
>>
>> > On Thu, Aug 09, 2012 at 01:07:01PM +0800, Shaohua Li wrote:
>> > > 2012/8/9 NeilBrown <neilb@suse.de>:
>> > > > On Thu, 9 Aug 2012 09:20:05 +0800 "Jianpeng Ma" <majianpeng@gmail.com> wrote:
>> > > >
>> > > >> On 2012-08-08 20:53 Shaohua Li <shli@kernel.org> Wrote:
>> > > >> >2012/8/8 Jianpeng Ma <majianpeng@gmail.com>:
>> > > >> >> On 2012-08-08 10:58 Shaohua Li <shli@kernel.org> Wrote:
>> > > >> >>>2012/8/7 Jianpeng Ma <majianpeng@gmail.com>:
>> > > >> >>>> On 2012-08-07 13:32 Shaohua Li <shli@kernel.org> Wrote:
>> > > >> >>>>>2012/8/7 Jianpeng Ma <majianpeng@gmail.com>:
>> > > >> >>>>>> On 2012-08-07 11:22 Shaohua Li <shli@kernel.org> Wrote:
>> > > >> >>>>>>>My directIO randomwrite 4k workload shows a 10~20% regression caused by commit
>> > > >> >>>>>>>895e3c5c58a80bb. directIO usually is random IO and if request size isn't big
>> > > >> >>>>>>>(which is the common case), delay handling of the stripe hasn't any advantages.
>> > > >> >>>>>>>For big size request, delay can still reduce IO.
>> > > >> >>>>>>>
>> > > >> >>>>>>>Signed-off-by: Shaohua Li <shli@fusionio.com>
>> > > >> >>>> [snip]
>> > > >> >>>>>>>--
>> > > >> >>>>>> May be used size to judge is not a good method.
>> > > >> >>>>>> I firstly sended this patch, only want to control direct-write-block,not for reqular file.
>> > > >> >>>>>> Because i think if someone used direct-write-block for raid5,he should know the feature of raid5 and he can control
>> > > >> >>>>>> for write to full-write.
>> > > >> >>>>>> But at that time, i did know how to differentiate between regular file and block-device.
>> > > >> >>>>>> I thik we should do something to do this.
>> > > >> >>>>>
>> > > >> >>>>>I don't think it's possible user can control his write to be a
>> > > >> >>>>>full-write even for
>> > > >> >>>>>raw disk IO. Why regular file and block device io matters here?
>> > > >> >>>>>
>> > > >> >>>>>Thanks,
>> > > >> >>>>>Shaohua
>> > > >> >>>> Another problem is the size. How to judge the size is large or not?
>> > > >> >>>> A syscall write is a dio and a dio may be split more bios.
>> > > >> >>>> For my workload, i usualy write chunk-size.
>> > > >> >>>> But your patch is judge by bio-size.
>> > > >> >>>
>> > > >> >>>I'd ignore workload which does sequential directIO, though
>> > > >> >>>your workload is, but I bet no real workloads are. So I'd like
>> > > >> >> Sorry,my explain maybe not corcrect. I write data once which size is almost chunks-size * devices,in order to full-write
>> > > >> >> and as possible as to no pre-read operation.
>> > > >> >>>only to consider big size random directio. I agree the size
>> > > >> >>>judge is arbitrary. I can optimize it to be only consider stripe
>> > > >> >>>which hits two or more disks in one bio, but not sure if it's
>> > > >> >>>worthy doing. Not ware big size directio is common, and even
>> > > >> >>>is, big size request IOPS is low, a bit delay maybe not a big
>> > > >> >>>deal.
>> > > >> >> If add a acc_time for 'striep_head' to control?
>> > > >> >> When get_active_stripe() is ok, update acc_time.
>> > > >> >> For some time, stripe_head did not access and it shold pre-read.
>> > > >> >
>> > > >> >Do you want to add a timer for each stripe? This is even ugly.
>> > > >> >How do you choose the expire time? A time works for harddisk
>> > > >> >definitely will not work for a fast SSD.
>> > > >> A time is like the size which is arbitrary.
>> > > >> How about add a interface in sysfs to control by user?
>> > > >> Only user can judge the workload, which sequatial write or random write.
>> > > >
>> > > > This is getting worse by the minute. A sysfs interface for this is
>> > > > definitely not a good idea.
>> > > >
>> > > > The REQ_NOIDLE flag is a pretty clear statement that no more requests that
>> > > > merge with this one are expected. If some use cases sends random requests,
>> > > > maybe it should be setting REQ_NOIDLE.
>> > > >
>> > > > Maybe someone should do some research and find out why WRITE_ODIRECT doesn't
>> > > > include REQ_NOIDLE. Understanding that would help understand the current
>> > > > problem.
>> > >
>> > > A quick search shows only cfq-iosched uses REQ_NOIDLE. In
>> > > cfq, a queue is idled to avoid losing its share. REQ_NOIDLE
>> > > tells cfq to avoid idle, since the task will not dispatch further
>> > > requests any more. Note this isn't no merge.
>> >
>> > Since REQ_NOIDLE has no relationship with request merge, we'd better remove it.
>> > I came out a new patch, which doesn't depend on request size any more. With
>> > this patch, sequential directio will still introduce unnecessary raid5 preread
>> > (especially for small size IO), but I bet no app does sequential small size
>> > directIO.
>> >
>> > Thanks,
>> > Shaohua
>> >
>> > Subject: raid5: fix directio regression
>> >
>> > My directIO randomwrite 4k workload shows a 10~20% regression caused by commit
>> > 895e3c5c58a80bb. This commit isn't friendly for small size random IO, because
>> > delaying such request hasn't any advantages.
>> >
>> > DirectIO usually is random IO. I thought we can ignore request merge between
>> > bios from different io_submit. So we only consider one bio which can drive
>> > unnecessary preread in raid5, which is large request. If a bio is large enough
>> > and some of its stripes will access two or more disks, such stripes should be
>> > delayed to avoid unnecessary preread till bio for the last disk of the strips
>> > is added.
>> >
>> > REQ_NOIDLE doesn't mean about request merge, I deleted it.
>>
>> Hi,
>> Have you tested what effect this has on large sequential direct writes?
>> Because it don't make sense to me and I would be surprised if it improves
>> things.
>>
>> You are delaying setting the STRIPE_PREREAD_ACTIVE bit until you think you
>> have submitted all the writes from this bio that apply to the give stripe.
>> That does make some sense, however it doesn't seem to deal with the
>> possibility that the one bio covers parts of two different stripes. In that
>> case the first stripe never gets STRIPE_PREREAD_ACTIVE set, so it is delayed
>> despite having 'REQ_SYNC' set.
>
>I didn't get your point. Isn't last_sector - logical_sector < chunk_sectors true
>in the case?
>
>> Also, and more significantly, plugging should mean that the various
>> stripe_heads are not even looked at until all of the original bio is
>> processed, so while STRIPE_PREREAD_ACTIVE might get set early, it should not
>> get processed until the whole bio is processed and the queue is unplugged.
>>
>> So I don't think this patch should make a difference on large direct writes,
>> and if it does then something strange is going on that I'd like to
>> understand first.
>
>Aha, ok, this makes sense. recent delayed stripe release should make the
>problem go away. So Jianpeng, can you try your workload with the commit
>reverted with a recent kernel please?
>
I tested used your patch in my workload.
Like the neil said, the performance does not regress.
But if the code is :
> if (test_bit(STRIPE_PREREAD_ACTIVE, &sh->state))
> release_stripe(sh);
> else
> release_stripe_plug(mddev, sh);
The speed is about 76MB/s.With those code the speed is 200MB/s.
BTW, why are you and neil not to answer my option which adding REQ_NOIDLE for last bio of one dio?
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Re: [patch v2]raid5: fix directio regression
2012-08-16 7:36 ` Jianpeng Ma
@ 2012-08-16 9:42 ` Shaohua Li
2012-08-17 1:00 ` Jianpeng Ma
2012-08-23 6:08 ` Shaohua Li
0 siblings, 2 replies; 28+ messages in thread
From: Shaohua Li @ 2012-08-16 9:42 UTC (permalink / raw)
To: Jianpeng Ma; +Cc: Neil Brown, linux-raid, axboe
2012/8/16 Jianpeng Ma <majianpeng@gmail.com>:
> On 2012-08-15 09:44 Shaohua Li <shli@kernel.org> Wrote:
>>On Wed, Aug 15, 2012 at 10:56:10AM +1000, NeilBrown wrote:
>>> On Tue, 14 Aug 2012 14:33:43 +0800 Shaohua Li <shli@kernel.org> wrote:
>>>
>>> > On Thu, Aug 09, 2012 at 01:07:01PM +0800, Shaohua Li wrote:
>>> > > 2012/8/9 NeilBrown <neilb@suse.de>:
>>> > > > On Thu, 9 Aug 2012 09:20:05 +0800 "Jianpeng Ma" <majianpeng@gmail.com> wrote:
>>> > > >
>>> > > >> On 2012-08-08 20:53 Shaohua Li <shli@kernel.org> Wrote:
>>> > > >> >2012/8/8 Jianpeng Ma <majianpeng@gmail.com>:
>>> > > >> >> On 2012-08-08 10:58 Shaohua Li <shli@kernel.org> Wrote:
>>> > > >> >>>2012/8/7 Jianpeng Ma <majianpeng@gmail.com>:
>>> > > >> >>>> On 2012-08-07 13:32 Shaohua Li <shli@kernel.org> Wrote:
>>> > > >> >>>>>2012/8/7 Jianpeng Ma <majianpeng@gmail.com>:
>>> > > >> >>>>>> On 2012-08-07 11:22 Shaohua Li <shli@kernel.org> Wrote:
>>> > > >> >>>>>>>My directIO randomwrite 4k workload shows a 10~20% regression caused by commit
>>> > > >> >>>>>>>895e3c5c58a80bb. directIO usually is random IO and if request size isn't big
>>> > > >> >>>>>>>(which is the common case), delay handling of the stripe hasn't any advantages.
>>> > > >> >>>>>>>For big size request, delay can still reduce IO.
>>> > > >> >>>>>>>
>>> > > >> >>>>>>>Signed-off-by: Shaohua Li <shli@fusionio.com>
>>> > > >> >>>> [snip]
>>> > > >> >>>>>>>--
>>> > > >> >>>>>> May be used size to judge is not a good method.
>>> > > >> >>>>>> I firstly sended this patch, only want to control direct-write-block,not for reqular file.
>>> > > >> >>>>>> Because i think if someone used direct-write-block for raid5,he should know the feature of raid5 and he can control
>>> > > >> >>>>>> for write to full-write.
>>> > > >> >>>>>> But at that time, i did know how to differentiate between regular file and block-device.
>>> > > >> >>>>>> I thik we should do something to do this.
>>> > > >> >>>>>
>>> > > >> >>>>>I don't think it's possible user can control his write to be a
>>> > > >> >>>>>full-write even for
>>> > > >> >>>>>raw disk IO. Why regular file and block device io matters here?
>>> > > >> >>>>>
>>> > > >> >>>>>Thanks,
>>> > > >> >>>>>Shaohua
>>> > > >> >>>> Another problem is the size. How to judge the size is large or not?
>>> > > >> >>>> A syscall write is a dio and a dio may be split more bios.
>>> > > >> >>>> For my workload, i usualy write chunk-size.
>>> > > >> >>>> But your patch is judge by bio-size.
>>> > > >> >>>
>>> > > >> >>>I'd ignore workload which does sequential directIO, though
>>> > > >> >>>your workload is, but I bet no real workloads are. So I'd like
>>> > > >> >> Sorry,my explain maybe not corcrect. I write data once which size is almost chunks-size * devices,in order to full-write
>>> > > >> >> and as possible as to no pre-read operation.
>>> > > >> >>>only to consider big size random directio. I agree the size
>>> > > >> >>>judge is arbitrary. I can optimize it to be only consider stripe
>>> > > >> >>>which hits two or more disks in one bio, but not sure if it's
>>> > > >> >>>worthy doing. Not ware big size directio is common, and even
>>> > > >> >>>is, big size request IOPS is low, a bit delay maybe not a big
>>> > > >> >>>deal.
>>> > > >> >> If add a acc_time for 'striep_head' to control?
>>> > > >> >> When get_active_stripe() is ok, update acc_time.
>>> > > >> >> For some time, stripe_head did not access and it shold pre-read.
>>> > > >> >
>>> > > >> >Do you want to add a timer for each stripe? This is even ugly.
>>> > > >> >How do you choose the expire time? A time works for harddisk
>>> > > >> >definitely will not work for a fast SSD.
>>> > > >> A time is like the size which is arbitrary.
>>> > > >> How about add a interface in sysfs to control by user?
>>> > > >> Only user can judge the workload, which sequatial write or random write.
>>> > > >
>>> > > > This is getting worse by the minute. A sysfs interface for this is
>>> > > > definitely not a good idea.
>>> > > >
>>> > > > The REQ_NOIDLE flag is a pretty clear statement that no more requests that
>>> > > > merge with this one are expected. If some use cases sends random requests,
>>> > > > maybe it should be setting REQ_NOIDLE.
>>> > > >
>>> > > > Maybe someone should do some research and find out why WRITE_ODIRECT doesn't
>>> > > > include REQ_NOIDLE. Understanding that would help understand the current
>>> > > > problem.
>>> > >
>>> > > A quick search shows only cfq-iosched uses REQ_NOIDLE. In
>>> > > cfq, a queue is idled to avoid losing its share. REQ_NOIDLE
>>> > > tells cfq to avoid idle, since the task will not dispatch further
>>> > > requests any more. Note this isn't no merge.
>>> >
>>> > Since REQ_NOIDLE has no relationship with request merge, we'd better remove it.
>>> > I came out a new patch, which doesn't depend on request size any more. With
>>> > this patch, sequential directio will still introduce unnecessary raid5 preread
>>> > (especially for small size IO), but I bet no app does sequential small size
>>> > directIO.
>>> >
>>> > Thanks,
>>> > Shaohua
>>> >
>>> > Subject: raid5: fix directio regression
>>> >
>>> > My directIO randomwrite 4k workload shows a 10~20% regression caused by commit
>>> > 895e3c5c58a80bb. This commit isn't friendly for small size random IO, because
>>> > delaying such request hasn't any advantages.
>>> >
>>> > DirectIO usually is random IO. I thought we can ignore request merge between
>>> > bios from different io_submit. So we only consider one bio which can drive
>>> > unnecessary preread in raid5, which is large request. If a bio is large enough
>>> > and some of its stripes will access two or more disks, such stripes should be
>>> > delayed to avoid unnecessary preread till bio for the last disk of the strips
>>> > is added.
>>> >
>>> > REQ_NOIDLE doesn't mean about request merge, I deleted it.
>>>
>>> Hi,
>>> Have you tested what effect this has on large sequential direct writes?
>>> Because it don't make sense to me and I would be surprised if it improves
>>> things.
>>>
>>> You are delaying setting the STRIPE_PREREAD_ACTIVE bit until you think you
>>> have submitted all the writes from this bio that apply to the give stripe.
>>> That does make some sense, however it doesn't seem to deal with the
>>> possibility that the one bio covers parts of two different stripes. In that
>>> case the first stripe never gets STRIPE_PREREAD_ACTIVE set, so it is delayed
>>> despite having 'REQ_SYNC' set.
>>
>>I didn't get your point. Isn't last_sector - logical_sector < chunk_sectors true
>>in the case?
>>
>>> Also, and more significantly, plugging should mean that the various
>>> stripe_heads are not even looked at until all of the original bio is
>>> processed, so while STRIPE_PREREAD_ACTIVE might get set early, it should not
>>> get processed until the whole bio is processed and the queue is unplugged.
>>>
>>> So I don't think this patch should make a difference on large direct writes,
>>> and if it does then something strange is going on that I'd like to
>>> understand first.
>>
>>Aha, ok, this makes sense. recent delayed stripe release should make the
>>problem go away. So Jianpeng, can you try your workload with the commit
>>reverted with a recent kernel please?
>>
> I tested used your patch in my workload.
> Like the neil said, the performance does not regress.
> But if the code is :
>> if (test_bit(STRIPE_PREREAD_ACTIVE, &sh->state))
>> release_stripe(sh);
>> else
>> release_stripe_plug(mddev, sh);
> The speed is about 76MB/s.With those code the speed is 200MB/s.
Hmm, what I want to test is upstream kernel with commit 895e3c5c58a80bb
reverted. don't apply my patch. We want to just revert the commit.
> BTW, why are you and neil not to answer my option which adding REQ_NOIDLE for last bio of one dio?
I'm not quite positive to this. each io_submit can submit several
requests, and each request has a dio. Setting the flag for the first
dio doesn't make sense, for example
Thanks,
Shaohua
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Re: [patch v2]raid5: fix directio regression
2012-08-16 9:42 ` Shaohua Li
@ 2012-08-17 1:00 ` Jianpeng Ma
2012-08-23 6:08 ` Shaohua Li
1 sibling, 0 replies; 28+ messages in thread
From: Jianpeng Ma @ 2012-08-17 1:00 UTC (permalink / raw)
To: shli; +Cc: Neil Brown, linux-raid
On 2012-08-16 17:42 Shaohua Li <shli@kernel.org> Wrote:
>2012/8/16 Jianpeng Ma <majianpeng@gmail.com>:
>> On 2012-08-15 09:44 Shaohua Li <shli@kernel.org> Wrote:
>>>On Wed, Aug 15, 2012 at 10:56:10AM +1000, NeilBrown wrote:
>>>> On Tue, 14 Aug 2012 14:33:43 +0800 Shaohua Li <shli@kernel.org> wrote:
>>>>
>>>> > On Thu, Aug 09, 2012 at 01:07:01PM +0800, Shaohua Li wrote:
>>>> > > 2012/8/9 NeilBrown <neilb@suse.de>:
>>>> > > > On Thu, 9 Aug 2012 09:20:05 +0800 "Jianpeng Ma" <majianpeng@gmail.com> wrote:
>>>> > > >
>>>> > > >> On 2012-08-08 20:53 Shaohua Li <shli@kernel.org> Wrote:
>>>> > > >> >2012/8/8 Jianpeng Ma <majianpeng@gmail.com>:
>>>> > > >> >> On 2012-08-08 10:58 Shaohua Li <shli@kernel.org> Wrote:
>>>> > > >> >>>2012/8/7 Jianpeng Ma <majianpeng@gmail.com>:
>>>> > > >> >>>> On 2012-08-07 13:32 Shaohua Li <shli@kernel.org> Wrote:
>>>> > > >> >>>>>2012/8/7 Jianpeng Ma <majianpeng@gmail.com>:
>>>> > > >> >>>>>> On 2012-08-07 11:22 Shaohua Li <shli@kernel.org> Wrote:
>>>> > > >> >>>>>>>My directIO randomwrite 4k workload shows a 10~20% regression caused by commit
>>>> > > >> >>>>>>>895e3c5c58a80bb. directIO usually is random IO and if request size isn't big
>>>> > > >> >>>>>>>(which is the common case), delay handling of the stripe hasn't any advantages.
>>>> > > >> >>>>>>>For big size request, delay can still reduce IO.
>>>> > > >> >>>>>>>
>>>> > > >> >>>>>>>Signed-off-by: Shaohua Li <shli@fusionio.com>
>>>> > > >> >>>> [snip]
>>>> > > >> >>>>>>>--
>>>> > > >> >>>>>> May be used size to judge is not a good method.
>>>> > > >> >>>>>> I firstly sended this patch, only want to control direct-write-block,not for reqular file.
>>>> > > >> >>>>>> Because i think if someone used direct-write-block for raid5,he should know the feature of raid5 and he can control
>>>> > > >> >>>>>> for write to full-write.
>>>> > > >> >>>>>> But at that time, i did know how to differentiate between regular file and block-device.
>>>> > > >> >>>>>> I thik we should do something to do this.
>>>> > > >> >>>>>
>>>> > > >> >>>>>I don't think it's possible user can control his write to be a
>>>> > > >> >>>>>full-write even for
>>>> > > >> >>>>>raw disk IO. Why regular file and block device io matters here?
>>>> > > >> >>>>>
>>>> > > >> >>>>>Thanks,
>>>> > > >> >>>>>Shaohua
>>>> > > >> >>>> Another problem is the size. How to judge the size is large or not?
>>>> > > >> >>>> A syscall write is a dio and a dio may be split more bios.
>>>> > > >> >>>> For my workload, i usualy write chunk-size.
>>>> > > >> >>>> But your patch is judge by bio-size.
>>>> > > >> >>>
>>>> > > >> >>>I'd ignore workload which does sequential directIO, though
>>>> > > >> >>>your workload is, but I bet no real workloads are. So I'd like
>>>> > > >> >> Sorry,my explain maybe not corcrect. I write data once which size is almost chunks-size * devices,in order to full-write
>>>> > > >> >> and as possible as to no pre-read operation.
>>>> > > >> >>>only to consider big size random directio. I agree the size
>>>> > > >> >>>judge is arbitrary. I can optimize it to be only consider stripe
>>>> > > >> >>>which hits two or more disks in one bio, but not sure if it's
>>>> > > >> >>>worthy doing. Not ware big size directio is common, and even
>>>> > > >> >>>is, big size request IOPS is low, a bit delay maybe not a big
>>>> > > >> >>>deal.
>>>> > > >> >> If add a acc_time for 'striep_head' to control?
>>>> > > >> >> When get_active_stripe() is ok, update acc_time.
>>>> > > >> >> For some time, stripe_head did not access and it shold pre-read.
>>>> > > >> >
>>>> > > >> >Do you want to add a timer for each stripe? This is even ugly.
>>>> > > >> >How do you choose the expire time? A time works for harddisk
>>>> > > >> >definitely will not work for a fast SSD.
>>>> > > >> A time is like the size which is arbitrary.
>>>> > > >> How about add a interface in sysfs to control by user?
>>>> > > >> Only user can judge the workload, which sequatial write or random write.
>>>> > > >
>>>> > > > This is getting worse by the minute. A sysfs interface for this is
>>>> > > > definitely not a good idea.
>>>> > > >
>>>> > > > The REQ_NOIDLE flag is a pretty clear statement that no more requests that
>>>> > > > merge with this one are expected. If some use cases sends random requests,
>>>> > > > maybe it should be setting REQ_NOIDLE.
>>>> > > >
>>>> > > > Maybe someone should do some research and find out why WRITE_ODIRECT doesn't
>>>> > > > include REQ_NOIDLE. Understanding that would help understand the current
>>>> > > > problem.
>>>> > >
>>>> > > A quick search shows only cfq-iosched uses REQ_NOIDLE. In
>>>> > > cfq, a queue is idled to avoid losing its share. REQ_NOIDLE
>>>> > > tells cfq to avoid idle, since the task will not dispatch further
>>>> > > requests any more. Note this isn't no merge.
>>>> >
>>>> > Since REQ_NOIDLE has no relationship with request merge, we'd better remove it.
>>>> > I came out a new patch, which doesn't depend on request size any more. With
>>>> > this patch, sequential directio will still introduce unnecessary raid5 preread
>>>> > (especially for small size IO), but I bet no app does sequential small size
>>>> > directIO.
>>>> >
>>>> > Thanks,
>>>> > Shaohua
>>>> >
>>>> > Subject: raid5: fix directio regression
>>>> >
>>>> > My directIO randomwrite 4k workload shows a 10~20% regression caused by commit
>>>> > 895e3c5c58a80bb. This commit isn't friendly for small size random IO, because
>>>> > delaying such request hasn't any advantages.
>>>> >
>>>> > DirectIO usually is random IO. I thought we can ignore request merge between
>>>> > bios from different io_submit. So we only consider one bio which can drive
>>>> > unnecessary preread in raid5, which is large request. If a bio is large enough
>>>> > and some of its stripes will access two or more disks, such stripes should be
>>>> > delayed to avoid unnecessary preread till bio for the last disk of the strips
>>>> > is added.
>>>> >
>>>> > REQ_NOIDLE doesn't mean about request merge, I deleted it.
>>>>
>>>> Hi,
>>>> Have you tested what effect this has on large sequential direct writes?
>>>> Because it don't make sense to me and I would be surprised if it improves
>>>> things.
>>>>
>>>> You are delaying setting the STRIPE_PREREAD_ACTIVE bit until you think you
>>>> have submitted all the writes from this bio that apply to the give stripe.
>>>> That does make some sense, however it doesn't seem to deal with the
>>>> possibility that the one bio covers parts of two different stripes. In that
>>>> case the first stripe never gets STRIPE_PREREAD_ACTIVE set, so it is delayed
>>>> despite having 'REQ_SYNC' set.
>>>
>>>I didn't get your point. Isn't last_sector - logical_sector < chunk_sectors true
>>>in the case?
>>>
>>>> Also, and more significantly, plugging should mean that the various
>>>> stripe_heads are not even looked at until all of the original bio is
>>>> processed, so while STRIPE_PREREAD_ACTIVE might get set early, it should not
>>>> get processed until the whole bio is processed and the queue is unplugged.
>>>>
>>>> So I don't think this patch should make a difference on large direct writes,
>>>> and if it does then something strange is going on that I'd like to
>>>> understand first.
>>>
>>>Aha, ok, this makes sense. recent delayed stripe release should make the
>>>problem go away. So Jianpeng, can you try your workload with the commit
>>>reverted with a recent kernel please?
>>>
>> I tested used your patch in my workload.
>> Like the neil said, the performance does not regress.
>> But if the code is :
>>> if (test_bit(STRIPE_PREREAD_ACTIVE, &sh->state))
>>> release_stripe(sh);
>>> else
>>> release_stripe_plug(mddev, sh);
>> The speed is about 76MB/s.With those code the speed is 200MB/s.
>
>Hmm, what I want to test is upstream kernel with commit 895e3c5c58a80bb
>reverted. don't apply my patch. We want to just revert the commit.
>
>> BTW, why are you and neil not to answer my option which adding REQ_NOIDLE for last bio of one dio?
>
>I'm not quite positive to this. each io_submit can submit several
>requests, and each request has a dio. Setting the flag for the first
>dio doesn't make sense, for example
>
Thanks, i didn't know the aio. So, hmm. Thanks again.
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Re: [patch v2]raid5: fix directio regression
2012-08-16 9:42 ` Shaohua Li
2012-08-17 1:00 ` Jianpeng Ma
@ 2012-08-23 6:08 ` Shaohua Li
2012-08-23 6:46 ` Jianpeng Ma
1 sibling, 1 reply; 28+ messages in thread
From: Shaohua Li @ 2012-08-23 6:08 UTC (permalink / raw)
To: Jianpeng Ma; +Cc: Neil Brown, linux-raid, axboe
2012/8/16 Shaohua Li <shli@kernel.org>:
> 2012/8/16 Jianpeng Ma <majianpeng@gmail.com>:
>> On 2012-08-15 09:44 Shaohua Li <shli@kernel.org> Wrote:
>>>On Wed, Aug 15, 2012 at 10:56:10AM +1000, NeilBrown wrote:
>>>> On Tue, 14 Aug 2012 14:33:43 +0800 Shaohua Li <shli@kernel.org> wrote:
>>>>
>>>> > On Thu, Aug 09, 2012 at 01:07:01PM +0800, Shaohua Li wrote:
>>>> > > 2012/8/9 NeilBrown <neilb@suse.de>:
>>>> > > > On Thu, 9 Aug 2012 09:20:05 +0800 "Jianpeng Ma" <majianpeng@gmail.com> wrote:
>>>> > > >
>>>> > > >> On 2012-08-08 20:53 Shaohua Li <shli@kernel.org> Wrote:
>>>> > > >> >2012/8/8 Jianpeng Ma <majianpeng@gmail.com>:
>>>> > > >> >> On 2012-08-08 10:58 Shaohua Li <shli@kernel.org> Wrote:
>>>> > > >> >>>2012/8/7 Jianpeng Ma <majianpeng@gmail.com>:
>>>> > > >> >>>> On 2012-08-07 13:32 Shaohua Li <shli@kernel.org> Wrote:
>>>> > > >> >>>>>2012/8/7 Jianpeng Ma <majianpeng@gmail.com>:
>>>> > > >> >>>>>> On 2012-08-07 11:22 Shaohua Li <shli@kernel.org> Wrote:
>>>> > > >> >>>>>>>My directIO randomwrite 4k workload shows a 10~20% regression caused by commit
>>>> > > >> >>>>>>>895e3c5c58a80bb. directIO usually is random IO and if request size isn't big
>>>> > > >> >>>>>>>(which is the common case), delay handling of the stripe hasn't any advantages.
>>>> > > >> >>>>>>>For big size request, delay can still reduce IO.
>>>> > > >> >>>>>>>
>>>> > > >> >>>>>>>Signed-off-by: Shaohua Li <shli@fusionio.com>
>>>> > > >> >>>> [snip]
>>>> > > >> >>>>>>>--
>>>> > > >> >>>>>> May be used size to judge is not a good method.
>>>> > > >> >>>>>> I firstly sended this patch, only want to control direct-write-block,not for reqular file.
>>>> > > >> >>>>>> Because i think if someone used direct-write-block for raid5,he should know the feature of raid5 and he can control
>>>> > > >> >>>>>> for write to full-write.
>>>> > > >> >>>>>> But at that time, i did know how to differentiate between regular file and block-device.
>>>> > > >> >>>>>> I thik we should do something to do this.
>>>> > > >> >>>>>
>>>> > > >> >>>>>I don't think it's possible user can control his write to be a
>>>> > > >> >>>>>full-write even for
>>>> > > >> >>>>>raw disk IO. Why regular file and block device io matters here?
>>>> > > >> >>>>>
>>>> > > >> >>>>>Thanks,
>>>> > > >> >>>>>Shaohua
>>>> > > >> >>>> Another problem is the size. How to judge the size is large or not?
>>>> > > >> >>>> A syscall write is a dio and a dio may be split more bios.
>>>> > > >> >>>> For my workload, i usualy write chunk-size.
>>>> > > >> >>>> But your patch is judge by bio-size.
>>>> > > >> >>>
>>>> > > >> >>>I'd ignore workload which does sequential directIO, though
>>>> > > >> >>>your workload is, but I bet no real workloads are. So I'd like
>>>> > > >> >> Sorry,my explain maybe not corcrect. I write data once which size is almost chunks-size * devices,in order to full-write
>>>> > > >> >> and as possible as to no pre-read operation.
>>>> > > >> >>>only to consider big size random directio. I agree the size
>>>> > > >> >>>judge is arbitrary. I can optimize it to be only consider stripe
>>>> > > >> >>>which hits two or more disks in one bio, but not sure if it's
>>>> > > >> >>>worthy doing. Not ware big size directio is common, and even
>>>> > > >> >>>is, big size request IOPS is low, a bit delay maybe not a big
>>>> > > >> >>>deal.
>>>> > > >> >> If add a acc_time for 'striep_head' to control?
>>>> > > >> >> When get_active_stripe() is ok, update acc_time.
>>>> > > >> >> For some time, stripe_head did not access and it shold pre-read.
>>>> > > >> >
>>>> > > >> >Do you want to add a timer for each stripe? This is even ugly.
>>>> > > >> >How do you choose the expire time? A time works for harddisk
>>>> > > >> >definitely will not work for a fast SSD.
>>>> > > >> A time is like the size which is arbitrary.
>>>> > > >> How about add a interface in sysfs to control by user?
>>>> > > >> Only user can judge the workload, which sequatial write or random write.
>>>> > > >
>>>> > > > This is getting worse by the minute. A sysfs interface for this is
>>>> > > > definitely not a good idea.
>>>> > > >
>>>> > > > The REQ_NOIDLE flag is a pretty clear statement that no more requests that
>>>> > > > merge with this one are expected. If some use cases sends random requests,
>>>> > > > maybe it should be setting REQ_NOIDLE.
>>>> > > >
>>>> > > > Maybe someone should do some research and find out why WRITE_ODIRECT doesn't
>>>> > > > include REQ_NOIDLE. Understanding that would help understand the current
>>>> > > > problem.
>>>> > >
>>>> > > A quick search shows only cfq-iosched uses REQ_NOIDLE. In
>>>> > > cfq, a queue is idled to avoid losing its share. REQ_NOIDLE
>>>> > > tells cfq to avoid idle, since the task will not dispatch further
>>>> > > requests any more. Note this isn't no merge.
>>>> >
>>>> > Since REQ_NOIDLE has no relationship with request merge, we'd better remove it.
>>>> > I came out a new patch, which doesn't depend on request size any more. With
>>>> > this patch, sequential directio will still introduce unnecessary raid5 preread
>>>> > (especially for small size IO), but I bet no app does sequential small size
>>>> > directIO.
>>>> >
>>>> > Thanks,
>>>> > Shaohua
>>>> >
>>>> > Subject: raid5: fix directio regression
>>>> >
>>>> > My directIO randomwrite 4k workload shows a 10~20% regression caused by commit
>>>> > 895e3c5c58a80bb. This commit isn't friendly for small size random IO, because
>>>> > delaying such request hasn't any advantages.
>>>> >
>>>> > DirectIO usually is random IO. I thought we can ignore request merge between
>>>> > bios from different io_submit. So we only consider one bio which can drive
>>>> > unnecessary preread in raid5, which is large request. If a bio is large enough
>>>> > and some of its stripes will access two or more disks, such stripes should be
>>>> > delayed to avoid unnecessary preread till bio for the last disk of the strips
>>>> > is added.
>>>> >
>>>> > REQ_NOIDLE doesn't mean about request merge, I deleted it.
>>>>
>>>> Hi,
>>>> Have you tested what effect this has on large sequential direct writes?
>>>> Because it don't make sense to me and I would be surprised if it improves
>>>> things.
>>>>
>>>> You are delaying setting the STRIPE_PREREAD_ACTIVE bit until you think you
>>>> have submitted all the writes from this bio that apply to the give stripe.
>>>> That does make some sense, however it doesn't seem to deal with the
>>>> possibility that the one bio covers parts of two different stripes. In that
>>>> case the first stripe never gets STRIPE_PREREAD_ACTIVE set, so it is delayed
>>>> despite having 'REQ_SYNC' set.
>>>
>>>I didn't get your point. Isn't last_sector - logical_sector < chunk_sectors true
>>>in the case?
>>>
>>>> Also, and more significantly, plugging should mean that the various
>>>> stripe_heads are not even looked at until all of the original bio is
>>>> processed, so while STRIPE_PREREAD_ACTIVE might get set early, it should not
>>>> get processed until the whole bio is processed and the queue is unplugged.
>>>>
>>>> So I don't think this patch should make a difference on large direct writes,
>>>> and if it does then something strange is going on that I'd like to
>>>> understand first.
>>>
>>>Aha, ok, this makes sense. recent delayed stripe release should make the
>>>problem go away. So Jianpeng, can you try your workload with the commit
>>>reverted with a recent kernel please?
>>>
>> I tested used your patch in my workload.
>> Like the neil said, the performance does not regress.
>> But if the code is :
>>> if (test_bit(STRIPE_PREREAD_ACTIVE, &sh->state))
>>> release_stripe(sh);
>>> else
>>> release_stripe_plug(mddev, sh);
>> The speed is about 76MB/s.With those code the speed is 200MB/s.
>
> Hmm, what I want to test is upstream kernel with commit 895e3c5c58a80bb
> reverted. don't apply my patch. We want to just revert the commit.
Did you have data for your original workload with 895e3c5c58a80bb
reverted now?
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Re: [patch v2]raid5: fix directio regression
2012-08-23 6:08 ` Shaohua Li
@ 2012-08-23 6:46 ` Jianpeng Ma
2012-08-23 7:55 ` Shaohua Li
0 siblings, 1 reply; 28+ messages in thread
From: Jianpeng Ma @ 2012-08-23 6:46 UTC (permalink / raw)
To: shli; +Cc: Neil Brown, linux-raid, axboe
On 2012-08-23 14:08 Shaohua Li <shli@kernel.org> Wrote:
>2012/8/16 Shaohua Li <shli@kernel.org>:
>> 2012/8/16 Jianpeng Ma <majianpeng@gmail.com>:
>>> On 2012-08-15 09:44 Shaohua Li <shli@kernel.org> Wrote:
>>>>On Wed, Aug 15, 2012 at 10:56:10AM +1000, NeilBrown wrote:
>>>>> On Tue, 14 Aug 2012 14:33:43 +0800 Shaohua Li <shli@kernel.org> wrote:
>>>>>
>>>>> > On Thu, Aug 09, 2012 at 01:07:01PM +0800, Shaohua Li wrote:
>>>>> > > 2012/8/9 NeilBrown <neilb@suse.de>:
>>>>> > > > On Thu, 9 Aug 2012 09:20:05 +0800 "Jianpeng Ma" <majianpeng@gmail.com> wrote:
>>>>> > > >
>>>>> > > >> On 2012-08-08 20:53 Shaohua Li <shli@kernel.org> Wrote:
>>>>> > > >> >2012/8/8 Jianpeng Ma <majianpeng@gmail.com>:
>>>>> > > >> >> On 2012-08-08 10:58 Shaohua Li <shli@kernel.org> Wrote:
>>>>> > > >> >>>2012/8/7 Jianpeng Ma <majianpeng@gmail.com>:
>>>>> > > >> >>>> On 2012-08-07 13:32 Shaohua Li <shli@kernel.org> Wrote:
>>>>> > > >> >>>>>2012/8/7 Jianpeng Ma <majianpeng@gmail.com>:
>>>>> > > >> >>>>>> On 2012-08-07 11:22 Shaohua Li <shli@kernel.org> Wrote:
>>>>> > > >> >>>>>>>My directIO randomwrite 4k workload shows a 10~20% regression caused by commit
>>>>> > > >> >>>>>>>895e3c5c58a80bb. directIO usually is random IO and if request size isn't big
>>>>> > > >> >>>>>>>(which is the common case), delay handling of the stripe hasn't any advantages.
>>>>> > > >> >>>>>>>For big size request, delay can still reduce IO.
>>>>> > > >> >>>>>>>
>>>>> > > >> >>>>>>>Signed-off-by: Shaohua Li <shli@fusionio.com>
>>>>> > > >> >>>> [snip]
>>>>> > > >> >>>>>>>--
>>>>> > > >> >>>>>> May be used size to judge is not a good method.
>>>>> > > >> >>>>>> I firstly sended this patch, only want to control direct-write-block,not for reqular file.
>>>>> > > >> >>>>>> Because i think if someone used direct-write-block for raid5,he should know the feature of raid5 and he can control
>>>>> > > >> >>>>>> for write to full-write.
>>>>> > > >> >>>>>> But at that time, i did know how to differentiate between regular file and block-device.
>>>>> > > >> >>>>>> I thik we should do something to do this.
>>>>> > > >> >>>>>
>>>>> > > >> >>>>>I don't think it's possible user can control his write to be a
>>>>> > > >> >>>>>full-write even for
>>>>> > > >> >>>>>raw disk IO. Why regular file and block device io matters here?
>>>>> > > >> >>>>>
>>>>> > > >> >>>>>Thanks,
>>>>> > > >> >>>>>Shaohua
>>>>> > > >> >>>> Another problem is the size. How to judge the size is large or not?
>>>>> > > >> >>>> A syscall write is a dio and a dio may be split more bios.
>>>>> > > >> >>>> For my workload, i usualy write chunk-size.
>>>>> > > >> >>>> But your patch is judge by bio-size.
>>>>> > > >> >>>
>>>>> > > >> >>>I'd ignore workload which does sequential directIO, though
>>>>> > > >> >>>your workload is, but I bet no real workloads are. So I'd like
>>>>> > > >> >> Sorry,my explain maybe not corcrect. I write data once which size is almost chunks-size * devices,in order to full-write
>>>>> > > >> >> and as possible as to no pre-read operation.
>>>>> > > >> >>>only to consider big size random directio. I agree the size
>>>>> > > >> >>>judge is arbitrary. I can optimize it to be only consider stripe
>>>>> > > >> >>>which hits two or more disks in one bio, but not sure if it's
>>>>> > > >> >>>worthy doing. Not ware big size directio is common, and even
>>>>> > > >> >>>is, big size request IOPS is low, a bit delay maybe not a big
>>>>> > > >> >>>deal.
>>>>> > > >> >> If add a acc_time for 'striep_head' to control?
>>>>> > > >> >> When get_active_stripe() is ok, update acc_time.
>>>>> > > >> >> For some time, stripe_head did not access and it shold pre-read.
>>>>> > > >> >
>>>>> > > >> >Do you want to add a timer for each stripe? This is even ugly.
>>>>> > > >> >How do you choose the expire time? A time works for harddisk
>>>>> > > >> >definitely will not work for a fast SSD.
>>>>> > > >> A time is like the size which is arbitrary.
>>>>> > > >> How about add a interface in sysfs to control by user?
>>>>> > > >> Only user can judge the workload, which sequatial write or random write.
>>>>> > > >
>>>>> > > > This is getting worse by the minute. A sysfs interface for this is
>>>>> > > > definitely not a good idea.
>>>>> > > >
>>>>> > > > The REQ_NOIDLE flag is a pretty clear statement that no more requests that
>>>>> > > > merge with this one are expected. If some use cases sends random requests,
>>>>> > > > maybe it should be setting REQ_NOIDLE.
>>>>> > > >
>>>>> > > > Maybe someone should do some research and find out why WRITE_ODIRECT doesn't
>>>>> > > > include REQ_NOIDLE. Understanding that would help understand the current
>>>>> > > > problem.
>>>>> > >
>>>>> > > A quick search shows only cfq-iosched uses REQ_NOIDLE. In
>>>>> > > cfq, a queue is idled to avoid losing its share. REQ_NOIDLE
>>>>> > > tells cfq to avoid idle, since the task will not dispatch further
>>>>> > > requests any more. Note this isn't no merge.
>>>>> >
>>>>> > Since REQ_NOIDLE has no relationship with request merge, we'd better remove it.
>>>>> > I came out a new patch, which doesn't depend on request size any more. With
>>>>> > this patch, sequential directio will still introduce unnecessary raid5 preread
>>>>> > (especially for small size IO), but I bet no app does sequential small size
>>>>> > directIO.
>>>>> >
>>>>> > Thanks,
>>>>> > Shaohua
>>>>> >
>>>>> > Subject: raid5: fix directio regression
>>>>> >
>>>>> > My directIO randomwrite 4k workload shows a 10~20% regression caused by commit
>>>>> > 895e3c5c58a80bb. This commit isn't friendly for small size random IO, because
>>>>> > delaying such request hasn't any advantages.
>>>>> >
>>>>> > DirectIO usually is random IO. I thought we can ignore request merge between
>>>>> > bios from different io_submit. So we only consider one bio which can drive
>>>>> > unnecessary preread in raid5, which is large request. If a bio is large enough
>>>>> > and some of its stripes will access two or more disks, such stripes should be
>>>>> > delayed to avoid unnecessary preread till bio for the last disk of the strips
>>>>> > is added.
>>>>> >
>>>>> > REQ_NOIDLE doesn't mean about request merge, I deleted it.
>>>>>
>>>>> Hi,
>>>>> Have you tested what effect this has on large sequential direct writes?
>>>>> Because it don't make sense to me and I would be surprised if it improves
>>>>> things.
>>>>>
>>>>> You are delaying setting the STRIPE_PREREAD_ACTIVE bit until you think you
>>>>> have submitted all the writes from this bio that apply to the give stripe.
>>>>> That does make some sense, however it doesn't seem to deal with the
>>>>> possibility that the one bio covers parts of two different stripes. In that
>>>>> case the first stripe never gets STRIPE_PREREAD_ACTIVE set, so it is delayed
>>>>> despite having 'REQ_SYNC' set.
>>>>
>>>>I didn't get your point. Isn't last_sector - logical_sector < chunk_sectors true
>>>>in the case?
>>>>
>>>>> Also, and more significantly, plugging should mean that the various
>>>>> stripe_heads are not even looked at until all of the original bio is
>>>>> processed, so while STRIPE_PREREAD_ACTIVE might get set early, it should not
>>>>> get processed until the whole bio is processed and the queue is unplugged.
>>>>>
>>>>> So I don't think this patch should make a difference on large direct writes,
>>>>> and if it does then something strange is going on that I'd like to
>>>>> understand first.
>>>>
>>>>Aha, ok, this makes sense. recent delayed stripe release should make the
>>>>problem go away. So Jianpeng, can you try your workload with the commit
>>>>reverted with a recent kernel please?
>>>>
>>> I tested used your patch in my workload.
>>> Like the neil said, the performance does not regress.
>>> But if the code is :
>>>> if (test_bit(STRIPE_PREREAD_ACTIVE, &sh->state))
>>>> release_stripe(sh);
>>>> else
>>>> release_stripe_plug(mddev, sh);
>>> The speed is about 76MB/s.With those code the speed is 200MB/s.
>>
>> Hmm, what I want to test is upstream kernel with commit 895e3c5c58a80bb
>> reverted. don't apply my patch. We want to just revert the commit.
>
>Did you have data for your original workload with 895e3c5c58a80bb
>reverted now?
our raid5 which had 14 SATA HDDs.
with 895e3c5c58a80bb reverted:
using dd to test 55MB/s
using our-fs 200-250Mb/s
with 895e3c5c58a80bb:
using dd to test 275MB/s
using our-fs 500-550Mb/s
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Re: [patch v2]raid5: fix directio regression
2012-08-23 6:46 ` Jianpeng Ma
@ 2012-08-23 7:55 ` Shaohua Li
2012-08-23 8:11 ` Jianpeng Ma
2012-08-23 12:17 ` Jianpeng Ma
0 siblings, 2 replies; 28+ messages in thread
From: Shaohua Li @ 2012-08-23 7:55 UTC (permalink / raw)
To: Jianpeng Ma; +Cc: Neil Brown, linux-raid, axboe
2012/8/23 Jianpeng Ma <majianpeng@gmail.com>:
> On 2012-08-23 14:08 Shaohua Li <shli@kernel.org> Wrote:
>>2012/8/16 Shaohua Li <shli@kernel.org>:
>>> 2012/8/16 Jianpeng Ma <majianpeng@gmail.com>:
>>>> On 2012-08-15 09:44 Shaohua Li <shli@kernel.org> Wrote:
>>>>>On Wed, Aug 15, 2012 at 10:56:10AM +1000, NeilBrown wrote:
>>>>>> On Tue, 14 Aug 2012 14:33:43 +0800 Shaohua Li <shli@kernel.org> wrote:
>>>>>>
>>>>>> > On Thu, Aug 09, 2012 at 01:07:01PM +0800, Shaohua Li wrote:
>>>>>> > > 2012/8/9 NeilBrown <neilb@suse.de>:
>>>>>> > > > On Thu, 9 Aug 2012 09:20:05 +0800 "Jianpeng Ma" <majianpeng@gmail.com> wrote:
>>>>>> > > >
>>>>>> > > >> On 2012-08-08 20:53 Shaohua Li <shli@kernel.org> Wrote:
>>>>>> > > >> >2012/8/8 Jianpeng Ma <majianpeng@gmail.com>:
>>>>>> > > >> >> On 2012-08-08 10:58 Shaohua Li <shli@kernel.org> Wrote:
>>>>>> > > >> >>>2012/8/7 Jianpeng Ma <majianpeng@gmail.com>:
>>>>>> > > >> >>>> On 2012-08-07 13:32 Shaohua Li <shli@kernel.org> Wrote:
>>>>>> > > >> >>>>>2012/8/7 Jianpeng Ma <majianpeng@gmail.com>:
>>>>>> > > >> >>>>>> On 2012-08-07 11:22 Shaohua Li <shli@kernel.org> Wrote:
>>>>>> > > >> >>>>>>>My directIO randomwrite 4k workload shows a 10~20% regression caused by commit
>>>>>> > > >> >>>>>>>895e3c5c58a80bb. directIO usually is random IO and if request size isn't big
>>>>>> > > >> >>>>>>>(which is the common case), delay handling of the stripe hasn't any advantages.
>>>>>> > > >> >>>>>>>For big size request, delay can still reduce IO.
>>>>>> > > >> >>>>>>>
>>>>>> > > >> >>>>>>>Signed-off-by: Shaohua Li <shli@fusionio.com>
>>>>>> > > >> >>>> [snip]
>>>>>> > > >> >>>>>>>--
>>>>>> > > >> >>>>>> May be used size to judge is not a good method.
>>>>>> > > >> >>>>>> I firstly sended this patch, only want to control direct-write-block,not for reqular file.
>>>>>> > > >> >>>>>> Because i think if someone used direct-write-block for raid5,he should know the feature of raid5 and he can control
>>>>>> > > >> >>>>>> for write to full-write.
>>>>>> > > >> >>>>>> But at that time, i did know how to differentiate between regular file and block-device.
>>>>>> > > >> >>>>>> I thik we should do something to do this.
>>>>>> > > >> >>>>>
>>>>>> > > >> >>>>>I don't think it's possible user can control his write to be a
>>>>>> > > >> >>>>>full-write even for
>>>>>> > > >> >>>>>raw disk IO. Why regular file and block device io matters here?
>>>>>> > > >> >>>>>
>>>>>> > > >> >>>>>Thanks,
>>>>>> > > >> >>>>>Shaohua
>>>>>> > > >> >>>> Another problem is the size. How to judge the size is large or not?
>>>>>> > > >> >>>> A syscall write is a dio and a dio may be split more bios.
>>>>>> > > >> >>>> For my workload, i usualy write chunk-size.
>>>>>> > > >> >>>> But your patch is judge by bio-size.
>>>>>> > > >> >>>
>>>>>> > > >> >>>I'd ignore workload which does sequential directIO, though
>>>>>> > > >> >>>your workload is, but I bet no real workloads are. So I'd like
>>>>>> > > >> >> Sorry,my explain maybe not corcrect. I write data once which size is almost chunks-size * devices,in order to full-write
>>>>>> > > >> >> and as possible as to no pre-read operation.
>>>>>> > > >> >>>only to consider big size random directio. I agree the size
>>>>>> > > >> >>>judge is arbitrary. I can optimize it to be only consider stripe
>>>>>> > > >> >>>which hits two or more disks in one bio, but not sure if it's
>>>>>> > > >> >>>worthy doing. Not ware big size directio is common, and even
>>>>>> > > >> >>>is, big size request IOPS is low, a bit delay maybe not a big
>>>>>> > > >> >>>deal.
>>>>>> > > >> >> If add a acc_time for 'striep_head' to control?
>>>>>> > > >> >> When get_active_stripe() is ok, update acc_time.
>>>>>> > > >> >> For some time, stripe_head did not access and it shold pre-read.
>>>>>> > > >> >
>>>>>> > > >> >Do you want to add a timer for each stripe? This is even ugly.
>>>>>> > > >> >How do you choose the expire time? A time works for harddisk
>>>>>> > > >> >definitely will not work for a fast SSD.
>>>>>> > > >> A time is like the size which is arbitrary.
>>>>>> > > >> How about add a interface in sysfs to control by user?
>>>>>> > > >> Only user can judge the workload, which sequatial write or random write.
>>>>>> > > >
>>>>>> > > > This is getting worse by the minute. A sysfs interface for this is
>>>>>> > > > definitely not a good idea.
>>>>>> > > >
>>>>>> > > > The REQ_NOIDLE flag is a pretty clear statement that no more requests that
>>>>>> > > > merge with this one are expected. If some use cases sends random requests,
>>>>>> > > > maybe it should be setting REQ_NOIDLE.
>>>>>> > > >
>>>>>> > > > Maybe someone should do some research and find out why WRITE_ODIRECT doesn't
>>>>>> > > > include REQ_NOIDLE. Understanding that would help understand the current
>>>>>> > > > problem.
>>>>>> > >
>>>>>> > > A quick search shows only cfq-iosched uses REQ_NOIDLE. In
>>>>>> > > cfq, a queue is idled to avoid losing its share. REQ_NOIDLE
>>>>>> > > tells cfq to avoid idle, since the task will not dispatch further
>>>>>> > > requests any more. Note this isn't no merge.
>>>>>> >
>>>>>> > Since REQ_NOIDLE has no relationship with request merge, we'd better remove it.
>>>>>> > I came out a new patch, which doesn't depend on request size any more. With
>>>>>> > this patch, sequential directio will still introduce unnecessary raid5 preread
>>>>>> > (especially for small size IO), but I bet no app does sequential small size
>>>>>> > directIO.
>>>>>> >
>>>>>> > Thanks,
>>>>>> > Shaohua
>>>>>> >
>>>>>> > Subject: raid5: fix directio regression
>>>>>> >
>>>>>> > My directIO randomwrite 4k workload shows a 10~20% regression caused by commit
>>>>>> > 895e3c5c58a80bb. This commit isn't friendly for small size random IO, because
>>>>>> > delaying such request hasn't any advantages.
>>>>>> >
>>>>>> > DirectIO usually is random IO. I thought we can ignore request merge between
>>>>>> > bios from different io_submit. So we only consider one bio which can drive
>>>>>> > unnecessary preread in raid5, which is large request. If a bio is large enough
>>>>>> > and some of its stripes will access two or more disks, such stripes should be
>>>>>> > delayed to avoid unnecessary preread till bio for the last disk of the strips
>>>>>> > is added.
>>>>>> >
>>>>>> > REQ_NOIDLE doesn't mean about request merge, I deleted it.
>>>>>>
>>>>>> Hi,
>>>>>> Have you tested what effect this has on large sequential direct writes?
>>>>>> Because it don't make sense to me and I would be surprised if it improves
>>>>>> things.
>>>>>>
>>>>>> You are delaying setting the STRIPE_PREREAD_ACTIVE bit until you think you
>>>>>> have submitted all the writes from this bio that apply to the give stripe.
>>>>>> That does make some sense, however it doesn't seem to deal with the
>>>>>> possibility that the one bio covers parts of two different stripes. In that
>>>>>> case the first stripe never gets STRIPE_PREREAD_ACTIVE set, so it is delayed
>>>>>> despite having 'REQ_SYNC' set.
>>>>>
>>>>>I didn't get your point. Isn't last_sector - logical_sector < chunk_sectors true
>>>>>in the case?
>>>>>
>>>>>> Also, and more significantly, plugging should mean that the various
>>>>>> stripe_heads are not even looked at until all of the original bio is
>>>>>> processed, so while STRIPE_PREREAD_ACTIVE might get set early, it should not
>>>>>> get processed until the whole bio is processed and the queue is unplugged.
>>>>>>
>>>>>> So I don't think this patch should make a difference on large direct writes,
>>>>>> and if it does then something strange is going on that I'd like to
>>>>>> understand first.
>>>>>
>>>>>Aha, ok, this makes sense. recent delayed stripe release should make the
>>>>>problem go away. So Jianpeng, can you try your workload with the commit
>>>>>reverted with a recent kernel please?
>>>>>
>>>> I tested used your patch in my workload.
>>>> Like the neil said, the performance does not regress.
>>>> But if the code is :
>>>>> if (test_bit(STRIPE_PREREAD_ACTIVE, &sh->state))
>>>>> release_stripe(sh);
>>>>> else
>>>>> release_stripe_plug(mddev, sh);
>>>> The speed is about 76MB/s.With those code the speed is 200MB/s.
>>>
>>> Hmm, what I want to test is upstream kernel with commit 895e3c5c58a80bb
>>> reverted. don't apply my patch. We want to just revert the commit.
>>
>>Did you have data for your original workload with 895e3c5c58a80bb
>>reverted now?
> our raid5 which had 14 SATA HDDs.
>
> with 895e3c5c58a80bb reverted:
> using dd to test 55MB/s
> using our-fs 200-250Mb/s
>
> with 895e3c5c58a80bb:
> using dd to test 275MB/s
> using our-fs 500-550Mb/s
what's block size of dd in this test? In your original test, your
BS covers chunk_sector*data_disks. In that case,
895e3c5c58a80bb is likely not required.
I guess you are using a smaller bs this time, so we need
merge different bios to a stripe overwrite?
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Re: [patch v2]raid5: fix directio regression
2012-08-23 7:55 ` Shaohua Li
@ 2012-08-23 8:11 ` Jianpeng Ma
2012-08-23 12:17 ` Jianpeng Ma
1 sibling, 0 replies; 28+ messages in thread
From: Jianpeng Ma @ 2012-08-23 8:11 UTC (permalink / raw)
To: shli; +Cc: Neil Brown, linux-raid, axboe
On 2012-08-23 15:55 Shaohua Li <shli@kernel.org> Wrote:
>2012/8/23 Jianpeng Ma <majianpeng@gmail.com>:
>> On 2012-08-23 14:08 Shaohua Li <shli@kernel.org> Wrote:
>>>2012/8/16 Shaohua Li <shli@kernel.org>:
>>>> 2012/8/16 Jianpeng Ma <majianpeng@gmail.com>:
>>>>> On 2012-08-15 09:44 Shaohua Li <shli@kernel.org> Wrote:
>>>>>>On Wed, Aug 15, 2012 at 10:56:10AM +1000, NeilBrown wrote:
>>>>>>> On Tue, 14 Aug 2012 14:33:43 +0800 Shaohua Li <shli@kernel.org> wrote:
>>>>>>>
>>>>>>> > On Thu, Aug 09, 2012 at 01:07:01PM +0800, Shaohua Li wrote:
>>>>>>> > > 2012/8/9 NeilBrown <neilb@suse.de>:
>>>>>>> > > > On Thu, 9 Aug 2012 09:20:05 +0800 "Jianpeng Ma" <majianpeng@gmail.com> wrote:
>>>>>>> > > >
>>>>>>> > > >> On 2012-08-08 20:53 Shaohua Li <shli@kernel.org> Wrote:
>>>>>>> > > >> >2012/8/8 Jianpeng Ma <majianpeng@gmail.com>:
>>>>>>> > > >> >> On 2012-08-08 10:58 Shaohua Li <shli@kernel.org> Wrote:
>>>>>>> > > >> >>>2012/8/7 Jianpeng Ma <majianpeng@gmail.com>:
>>>>>>> > > >> >>>> On 2012-08-07 13:32 Shaohua Li <shli@kernel.org> Wrote:
>>>>>>> > > >> >>>>>2012/8/7 Jianpeng Ma <majianpeng@gmail.com>:
>>>>>>> > > >> >>>>>> On 2012-08-07 11:22 Shaohua Li <shli@kernel.org> Wrote:
>>>>>>> > > >> >>>>>>>My directIO randomwrite 4k workload shows a 10~20% regression caused by commit
>>>>>>> > > >> >>>>>>>895e3c5c58a80bb. directIO usually is random IO and if request size isn't big
>>>>>>> > > >> >>>>>>>(which is the common case), delay handling of the stripe hasn't any advantages.
>>>>>>> > > >> >>>>>>>For big size request, delay can still reduce IO.
>>>>>>> > > >> >>>>>>>
>>>>>>> > > >> >>>>>>>Signed-off-by: Shaohua Li <shli@fusionio.com>
>>>>>>> > > >> >>>> [snip]
>>>>>>> > > >> >>>>>>>--
>>>>>>> > > >> >>>>>> May be used size to judge is not a good method.
>>>>>>> > > >> >>>>>> I firstly sended this patch, only want to control direct-write-block,not for reqular file.
>>>>>>> > > >> >>>>>> Because i think if someone used direct-write-block for raid5,he should know the feature of raid5 and he can control
>>>>>>> > > >> >>>>>> for write to full-write.
>>>>>>> > > >> >>>>>> But at that time, i did know how to differentiate between regular file and block-device.
>>>>>>> > > >> >>>>>> I thik we should do something to do this.
>>>>>>> > > >> >>>>>
>>>>>>> > > >> >>>>>I don't think it's possible user can control his write to be a
>>>>>>> > > >> >>>>>full-write even for
>>>>>>> > > >> >>>>>raw disk IO. Why regular file and block device io matters here?
>>>>>>> > > >> >>>>>
>>>>>>> > > >> >>>>>Thanks,
>>>>>>> > > >> >>>>>Shaohua
>>>>>>> > > >> >>>> Another problem is the size. How to judge the size is large or not?
>>>>>>> > > >> >>>> A syscall write is a dio and a dio may be split more bios.
>>>>>>> > > >> >>>> For my workload, i usualy write chunk-size.
>>>>>>> > > >> >>>> But your patch is judge by bio-size.
>>>>>>> > > >> >>>
>>>>>>> > > >> >>>I'd ignore workload which does sequential directIO, though
>>>>>>> > > >> >>>your workload is, but I bet no real workloads are. So I'd like
>>>>>>> > > >> >> Sorry,my explain maybe not corcrect. I write data once which size is almost chunks-size * devices,in order to full-write
>>>>>>> > > >> >> and as possible as to no pre-read operation.
>>>>>>> > > >> >>>only to consider big size random directio. I agree the size
>>>>>>> > > >> >>>judge is arbitrary. I can optimize it to be only consider stripe
>>>>>>> > > >> >>>which hits two or more disks in one bio, but not sure if it's
>>>>>>> > > >> >>>worthy doing. Not ware big size directio is common, and even
>>>>>>> > > >> >>>is, big size request IOPS is low, a bit delay maybe not a big
>>>>>>> > > >> >>>deal.
>>>>>>> > > >> >> If add a acc_time for 'striep_head' to control?
>>>>>>> > > >> >> When get_active_stripe() is ok, update acc_time.
>>>>>>> > > >> >> For some time, stripe_head did not access and it shold pre-read.
>>>>>>> > > >> >
>>>>>>> > > >> >Do you want to add a timer for each stripe? This is even ugly.
>>>>>>> > > >> >How do you choose the expire time? A time works for harddisk
>>>>>>> > > >> >definitely will not work for a fast SSD.
>>>>>>> > > >> A time is like the size which is arbitrary.
>>>>>>> > > >> How about add a interface in sysfs to control by user?
>>>>>>> > > >> Only user can judge the workload, which sequatial write or random write.
>>>>>>> > > >
>>>>>>> > > > This is getting worse by the minute. A sysfs interface for this is
>>>>>>> > > > definitely not a good idea.
>>>>>>> > > >
>>>>>>> > > > The REQ_NOIDLE flag is a pretty clear statement that no more requests that
>>>>>>> > > > merge with this one are expected. If some use cases sends random requests,
>>>>>>> > > > maybe it should be setting REQ_NOIDLE.
>>>>>>> > > >
>>>>>>> > > > Maybe someone should do some research and find out why WRITE_ODIRECT doesn't
>>>>>>> > > > include REQ_NOIDLE. Understanding that would help understand the current
>>>>>>> > > > problem.
>>>>>>> > >
>>>>>>> > > A quick search shows only cfq-iosched uses REQ_NOIDLE. In
>>>>>>> > > cfq, a queue is idled to avoid losing its share. REQ_NOIDLE
>>>>>>> > > tells cfq to avoid idle, since the task will not dispatch further
>>>>>>> > > requests any more. Note this isn't no merge.
>>>>>>> >
>>>>>>> > Since REQ_NOIDLE has no relationship with request merge, we'd better remove it.
>>>>>>> > I came out a new patch, which doesn't depend on request size any more. With
>>>>>>> > this patch, sequential directio will still introduce unnecessary raid5 preread
>>>>>>> > (especially for small size IO), but I bet no app does sequential small size
>>>>>>> > directIO.
>>>>>>> >
>>>>>>> > Thanks,
>>>>>>> > Shaohua
>>>>>>> >
>>>>>>> > Subject: raid5: fix directio regression
>>>>>>> >
>>>>>>> > My directIO randomwrite 4k workload shows a 10~20% regression caused by commit
>>>>>>> > 895e3c5c58a80bb. This commit isn't friendly for small size random IO, because
>>>>>>> > delaying such request hasn't any advantages.
>>>>>>> >
>>>>>>> > DirectIO usually is random IO. I thought we can ignore request merge between
>>>>>>> > bios from different io_submit. So we only consider one bio which can drive
>>>>>>> > unnecessary preread in raid5, which is large request. If a bio is large enough
>>>>>>> > and some of its stripes will access two or more disks, such stripes should be
>>>>>>> > delayed to avoid unnecessary preread till bio for the last disk of the strips
>>>>>>> > is added.
>>>>>>> >
>>>>>>> > REQ_NOIDLE doesn't mean about request merge, I deleted it.
>>>>>>>
>>>>>>> Hi,
>>>>>>> Have you tested what effect this has on large sequential direct writes?
>>>>>>> Because it don't make sense to me and I would be surprised if it improves
>>>>>>> things.
>>>>>>>
>>>>>>> You are delaying setting the STRIPE_PREREAD_ACTIVE bit until you think you
>>>>>>> have submitted all the writes from this bio that apply to the give stripe.
>>>>>>> That does make some sense, however it doesn't seem to deal with the
>>>>>>> possibility that the one bio covers parts of two different stripes. In that
>>>>>>> case the first stripe never gets STRIPE_PREREAD_ACTIVE set, so it is delayed
>>>>>>> despite having 'REQ_SYNC' set.
>>>>>>
>>>>>>I didn't get your point. Isn't last_sector - logical_sector < chunk_sectors true
>>>>>>in the case?
>>>>>>
>>>>>>> Also, and more significantly, plugging should mean that the various
>>>>>>> stripe_heads are not even looked at until all of the original bio is
>>>>>>> processed, so while STRIPE_PREREAD_ACTIVE might get set early, it should not
>>>>>>> get processed until the whole bio is processed and the queue is unplugged.
>>>>>>>
>>>>>>> So I don't think this patch should make a difference on large direct writes,
>>>>>>> and if it does then something strange is going on that I'd like to
>>>>>>> understand first.
>>>>>>
>>>>>>Aha, ok, this makes sense. recent delayed stripe release should make the
>>>>>>problem go away. So Jianpeng, can you try your workload with the commit
>>>>>>reverted with a recent kernel please?
>>>>>>
>>>>> I tested used your patch in my workload.
>>>>> Like the neil said, the performance does not regress.
>>>>> But if the code is :
>>>>>> if (test_bit(STRIPE_PREREAD_ACTIVE, &sh->state))
>>>>>> release_stripe(sh);
>>>>>> else
>>>>>> release_stripe_plug(mddev, sh);
>>>>> The speed is about 76MB/s.With those code the speed is 200MB/s.
>>>>
>>>> Hmm, what I want to test is upstream kernel with commit 895e3c5c58a80bb
>>>> reverted. don't apply my patch. We want to just revert the commit.
>>>
>>>Did you have data for your original workload with 895e3c5c58a80bb
>>>reverted now?
>> our raid5 which had 14 SATA HDDs.
>>
>> with 895e3c5c58a80bb reverted:
>> using dd to test 55MB/s
>> using our-fs 200-250Mb/s
>>
>> with 895e3c5c58a80bb:
>> using dd to test 275MB/s
>> using our-fs 500-550Mb/s
>
>what's block size of dd in this test? In your original test, your
>BS covers chunk_sector*data_disks. In that case,
>895e3c5c58a80bb is likely not required.
Maybe it's my fault. Kerel version is 3.4.4 which did not contain the patch 8811b5968f6216e97c.
So i will retest it.
>
>I guess you are using a smaller bs this time, so we need
>merge different bios to a stripe overwrite?
At present, in our situation,we control our-fs to do it.
If md can support this function,like io-schedule. it's better for all who can't contorl use-date to cover
chunk_sector*data_disks.
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Re: [patch v2]raid5: fix directio regression
2012-08-23 7:55 ` Shaohua Li
2012-08-23 8:11 ` Jianpeng Ma
@ 2012-08-23 12:17 ` Jianpeng Ma
2012-08-24 3:12 ` Shaohua Li
1 sibling, 1 reply; 28+ messages in thread
From: Jianpeng Ma @ 2012-08-23 12:17 UTC (permalink / raw)
To: shli; +Cc: Neil Brown, linux-raid, axboe
On 2012-08-23 15:55 Shaohua Li <shli@kernel.org> Wrote:
>2012/8/23 Jianpeng Ma <majianpeng@gmail.com>:
>> On 2012-08-23 14:08 Shaohua Li <shli@kernel.org> Wrote:
>>>2012/8/16 Shaohua Li <shli@kernel.org>:
>>>> 2012/8/16 Jianpeng Ma <majianpeng@gmail.com>:
>>>>> On 2012-08-15 09:44 Shaohua Li <shli@kernel.org> Wrote:
>>>>>>On Wed, Aug 15, 2012 at 10:56:10AM +1000, NeilBrown wrote:
>>>>>>> On Tue, 14 Aug 2012 14:33:43 +0800 Shaohua Li <shli@kernel.org> wrote:
>>>>>>>
>>>>>>> > On Thu, Aug 09, 2012 at 01:07:01PM +0800, Shaohua Li wrote:
>>>>>>> > > 2012/8/9 NeilBrown <neilb@suse.de>:
>>>>>>> > > > On Thu, 9 Aug 2012 09:20:05 +0800 "Jianpeng Ma" <majianpeng@gmail.com> wrote:
>>>>>>> > > >
>>>>>>> > > >> On 2012-08-08 20:53 Shaohua Li <shli@kernel.org> Wrote:
>>>>>>> > > >> >2012/8/8 Jianpeng Ma <majianpeng@gmail.com>:
>>>>>>> > > >> >> On 2012-08-08 10:58 Shaohua Li <shli@kernel.org> Wrote:
>>>>>>> > > >> >>>2012/8/7 Jianpeng Ma <majianpeng@gmail.com>:
>>>>>>> > > >> >>>> On 2012-08-07 13:32 Shaohua Li <shli@kernel.org> Wrote:
>>>>>>> > > >> >>>>>2012/8/7 Jianpeng Ma <majianpeng@gmail.com>:
>>>>>>> > > >> >>>>>> On 2012-08-07 11:22 Shaohua Li <shli@kernel.org> Wrote:
>>>>>>> > > >> >>>>>>>My directIO randomwrite 4k workload shows a 10~20% regression caused by commit
>>>>>>> > > >> >>>>>>>895e3c5c58a80bb. directIO usually is random IO and if request size isn't big
>>>>>>> > > >> >>>>>>>(which is the common case), delay handling of the stripe hasn't any advantages.
>>>>>>> > > >> >>>>>>>For big size request, delay can still reduce IO.
>>>>>>> > > >> >>>>>>>
>>>>>>> > > >> >>>>>>>Signed-off-by: Shaohua Li <shli@fusionio.com>
>>>>>>> > > >> >>>> [snip]
>>>>>>> > > >> >>>>>>>--
>>>>>>> > > >> >>>>>> May be used size to judge is not a good method.
>>>>>>> > > >> >>>>>> I firstly sended this patch, only want to control direct-write-block,not for reqular file.
>>>>>>> > > >> >>>>>> Because i think if someone used direct-write-block for raid5,he should know the feature of raid5 and he can control
>>>>>>> > > >> >>>>>> for write to full-write.
>>>>>>> > > >> >>>>>> But at that time, i did know how to differentiate between regular file and block-device.
>>>>>>> > > >> >>>>>> I thik we should do something to do this.
>>>>>>> > > >> >>>>>
>>>>>>> > > >> >>>>>I don't think it's possible user can control his write to be a
>>>>>>> > > >> >>>>>full-write even for
>>>>>>> > > >> >>>>>raw disk IO. Why regular file and block device io matters here?
>>>>>>> > > >> >>>>>
>>>>>>> > > >> >>>>>Thanks,
>>>>>>> > > >> >>>>>Shaohua
>>>>>>> > > >> >>>> Another problem is the size. How to judge the size is large or not?
>>>>>>> > > >> >>>> A syscall write is a dio and a dio may be split more bios.
>>>>>>> > > >> >>>> For my workload, i usualy write chunk-size.
>>>>>>> > > >> >>>> But your patch is judge by bio-size.
>>>>>>> > > >> >>>
>>>>>>> > > >> >>>I'd ignore workload which does sequential directIO, though
>>>>>>> > > >> >>>your workload is, but I bet no real workloads are. So I'd like
>>>>>>> > > >> >> Sorry,my explain maybe not corcrect. I write data once which size is almost chunks-size * devices,in order to full-write
>>>>>>> > > >> >> and as possible as to no pre-read operation.
>>>>>>> > > >> >>>only to consider big size random directio. I agree the size
>>>>>>> > > >> >>>judge is arbitrary. I can optimize it to be only consider stripe
>>>>>>> > > >> >>>which hits two or more disks in one bio, but not sure if it's
>>>>>>> > > >> >>>worthy doing. Not ware big size directio is common, and even
>>>>>>> > > >> >>>is, big size request IOPS is low, a bit delay maybe not a big
>>>>>>> > > >> >>>deal.
>>>>>>> > > >> >> If add a acc_time for 'striep_head' to control?
>>>>>>> > > >> >> When get_active_stripe() is ok, update acc_time.
>>>>>>> > > >> >> For some time, stripe_head did not access and it shold pre-read.
>>>>>>> > > >> >
>>>>>>> > > >> >Do you want to add a timer for each stripe? This is even ugly.
>>>>>>> > > >> >How do you choose the expire time? A time works for harddisk
>>>>>>> > > >> >definitely will not work for a fast SSD.
>>>>>>> > > >> A time is like the size which is arbitrary.
>>>>>>> > > >> How about add a interface in sysfs to control by user?
>>>>>>> > > >> Only user can judge the workload, which sequatial write or random write.
>>>>>>> > > >
>>>>>>> > > > This is getting worse by the minute. A sysfs interface for this is
>>>>>>> > > > definitely not a good idea.
>>>>>>> > > >
>>>>>>> > > > The REQ_NOIDLE flag is a pretty clear statement that no more requests that
>>>>>>> > > > merge with this one are expected. If some use cases sends random requests,
>>>>>>> > > > maybe it should be setting REQ_NOIDLE.
>>>>>>> > > >
>>>>>>> > > > Maybe someone should do some research and find out why WRITE_ODIRECT doesn't
>>>>>>> > > > include REQ_NOIDLE. Understanding that would help understand the current
>>>>>>> > > > problem.
>>>>>>> > >
>>>>>>> > > A quick search shows only cfq-iosched uses REQ_NOIDLE. In
>>>>>>> > > cfq, a queue is idled to avoid losing its share. REQ_NOIDLE
>>>>>>> > > tells cfq to avoid idle, since the task will not dispatch further
>>>>>>> > > requests any more. Note this isn't no merge.
>>>>>>> >
>>>>>>> > Since REQ_NOIDLE has no relationship with request merge, we'd better remove it.
>>>>>>> > I came out a new patch, which doesn't depend on request size any more. With
>>>>>>> > this patch, sequential directio will still introduce unnecessary raid5 preread
>>>>>>> > (especially for small size IO), but I bet no app does sequential small size
>>>>>>> > directIO.
>>>>>>> >
>>>>>>> > Thanks,
>>>>>>> > Shaohua
>>>>>>> >
>>>>>>> > Subject: raid5: fix directio regression
>>>>>>> >
>>>>>>> > My directIO randomwrite 4k workload shows a 10~20% regression caused by commit
>>>>>>> > 895e3c5c58a80bb. This commit isn't friendly for small size random IO, because
>>>>>>> > delaying such request hasn't any advantages.
>>>>>>> >
>>>>>>> > DirectIO usually is random IO. I thought we can ignore request merge between
>>>>>>> > bios from different io_submit. So we only consider one bio which can drive
>>>>>>> > unnecessary preread in raid5, which is large request. If a bio is large enough
>>>>>>> > and some of its stripes will access two or more disks, such stripes should be
>>>>>>> > delayed to avoid unnecessary preread till bio for the last disk of the strips
>>>>>>> > is added.
>>>>>>> >
>>>>>>> > REQ_NOIDLE doesn't mean about request merge, I deleted it.
>>>>>>>
>>>>>>> Hi,
>>>>>>> Have you tested what effect this has on large sequential direct writes?
>>>>>>> Because it don't make sense to me and I would be surprised if it improves
>>>>>>> things.
>>>>>>>
>>>>>>> You are delaying setting the STRIPE_PREREAD_ACTIVE bit until you think you
>>>>>>> have submitted all the writes from this bio that apply to the give stripe.
>>>>>>> That does make some sense, however it doesn't seem to deal with the
>>>>>>> possibility that the one bio covers parts of two different stripes. In that
>>>>>>> case the first stripe never gets STRIPE_PREREAD_ACTIVE set, so it is delayed
>>>>>>> despite having 'REQ_SYNC' set.
>>>>>>
>>>>>>I didn't get your point. Isn't last_sector - logical_sector < chunk_sectors true
>>>>>>in the case?
>>>>>>
>>>>>>> Also, and more significantly, plugging should mean that the various
>>>>>>> stripe_heads are not even looked at until all of the original bio is
>>>>>>> processed, so while STRIPE_PREREAD_ACTIVE might get set early, it should not
>>>>>>> get processed until the whole bio is processed and the queue is unplugged.
>>>>>>>
>>>>>>> So I don't think this patch should make a difference on large direct writes,
>>>>>>> and if it does then something strange is going on that I'd like to
>>>>>>> understand first.
>>>>>>
>>>>>>Aha, ok, this makes sense. recent delayed stripe release should make the
>>>>>>problem go away. So Jianpeng, can you try your workload with the commit
>>>>>>reverted with a recent kernel please?
>>>>>>
>>>>> I tested used your patch in my workload.
>>>>> Like the neil said, the performance does not regress.
>>>>> But if the code is :
>>>>>> if (test_bit(STRIPE_PREREAD_ACTIVE, &sh->state))
>>>>>> release_stripe(sh);
>>>>>> else
>>>>>> release_stripe_plug(mddev, sh);
>>>>> The speed is about 76MB/s.With those code the speed is 200MB/s.
>>>>
>>>> Hmm, what I want to test is upstream kernel with commit 895e3c5c58a80bb
>>>> reverted. don't apply my patch. We want to just revert the commit.
>>>
>>>Did you have data for your original workload with 895e3c5c58a80bb
>>>reverted now?
>> our raid5 which had 14 SATA HDDs.
>>
>> with 895e3c5c58a80bb reverted:
>> using dd to test 55MB/s
>> using our-fs 200-250Mb/s
>>
>> with 895e3c5c58a80bb:
>> using dd to test 275MB/s
>> using our-fs 500-550Mb/s
>
>what's block size of dd in this test? In your original test, your
>BS covers chunk_sector*data_disks. In that case,
>895e3c5c58a80bb is likely not required.
>
With latest kernel(3.6-rc3), w/ or w/o 895e3c5c58a80bb, the result is the same.
The block size of dd is chunk_sector * data_disks.
Your patch(8811b5968f6216e97) is good.
I think it shoul revert 8811b5968f6216e97.
>I guess you are using a smaller bs this time, so we need
>merge different bios to a stripe overwrite?
If the size is smaller, the w/o 895e3c5c58a80bb looks like better than w/ 895e3c5c58a80bb.
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Re: [patch v2]raid5: fix directio regression
2012-08-23 12:17 ` Jianpeng Ma
@ 2012-08-24 3:12 ` Shaohua Li
2012-08-24 4:21 ` kedacomkernel
0 siblings, 1 reply; 28+ messages in thread
From: Shaohua Li @ 2012-08-24 3:12 UTC (permalink / raw)
To: Jianpeng Ma; +Cc: Neil Brown, linux-raid, axboe
2012/8/23 Jianpeng Ma <majianpeng@gmail.com>:
> On 2012-08-23 15:55 Shaohua Li <shli@kernel.org> Wrote:
>>2012/8/23 Jianpeng Ma <majianpeng@gmail.com>:
>>> On 2012-08-23 14:08 Shaohua Li <shli@kernel.org> Wrote:
>>>>2012/8/16 Shaohua Li <shli@kernel.org>:
>>>>> 2012/8/16 Jianpeng Ma <majianpeng@gmail.com>:
>>>>>> On 2012-08-15 09:44 Shaohua Li <shli@kernel.org> Wrote:
>>>>>>>On Wed, Aug 15, 2012 at 10:56:10AM +1000, NeilBrown wrote:
>>>>>>>> On Tue, 14 Aug 2012 14:33:43 +0800 Shaohua Li <shli@kernel.org> wrote:
>>>>>>>>
>>>>>>>> > On Thu, Aug 09, 2012 at 01:07:01PM +0800, Shaohua Li wrote:
>>>>>>>> > > 2012/8/9 NeilBrown <neilb@suse.de>:
>>>>>>>> > > > On Thu, 9 Aug 2012 09:20:05 +0800 "Jianpeng Ma" <majianpeng@gmail.com> wrote:
>>>>>>>> > > >
>>>>>>>> > > >> On 2012-08-08 20:53 Shaohua Li <shli@kernel.org> Wrote:
>>>>>>>> > > >> >2012/8/8 Jianpeng Ma <majianpeng@gmail.com>:
>>>>>>>> > > >> >> On 2012-08-08 10:58 Shaohua Li <shli@kernel.org> Wrote:
>>>>>>>> > > >> >>>2012/8/7 Jianpeng Ma <majianpeng@gmail.com>:
>>>>>>>> > > >> >>>> On 2012-08-07 13:32 Shaohua Li <shli@kernel.org> Wrote:
>>>>>>>> > > >> >>>>>2012/8/7 Jianpeng Ma <majianpeng@gmail.com>:
>>>>>>>> > > >> >>>>>> On 2012-08-07 11:22 Shaohua Li <shli@kernel.org> Wrote:
>>>>>>>> > > >> >>>>>>>My directIO randomwrite 4k workload shows a 10~20% regression caused by commit
>>>>>>>> > > >> >>>>>>>895e3c5c58a80bb. directIO usually is random IO and if request size isn't big
>>>>>>>> > > >> >>>>>>>(which is the common case), delay handling of the stripe hasn't any advantages.
>>>>>>>> > > >> >>>>>>>For big size request, delay can still reduce IO.
>>>>>>>> > > >> >>>>>>>
>>>>>>>> > > >> >>>>>>>Signed-off-by: Shaohua Li <shli@fusionio.com>
>>>>>>>> > > >> >>>> [snip]
>>>>>>>> > > >> >>>>>>>--
>>>>>>>> > > >> >>>>>> May be used size to judge is not a good method.
>>>>>>>> > > >> >>>>>> I firstly sended this patch, only want to control direct-write-block,not for reqular file.
>>>>>>>> > > >> >>>>>> Because i think if someone used direct-write-block for raid5,he should know the feature of raid5 and he can control
>>>>>>>> > > >> >>>>>> for write to full-write.
>>>>>>>> > > >> >>>>>> But at that time, i did know how to differentiate between regular file and block-device.
>>>>>>>> > > >> >>>>>> I thik we should do something to do this.
>>>>>>>> > > >> >>>>>
>>>>>>>> > > >> >>>>>I don't think it's possible user can control his write to be a
>>>>>>>> > > >> >>>>>full-write even for
>>>>>>>> > > >> >>>>>raw disk IO. Why regular file and block device io matters here?
>>>>>>>> > > >> >>>>>
>>>>>>>> > > >> >>>>>Thanks,
>>>>>>>> > > >> >>>>>Shaohua
>>>>>>>> > > >> >>>> Another problem is the size. How to judge the size is large or not?
>>>>>>>> > > >> >>>> A syscall write is a dio and a dio may be split more bios.
>>>>>>>> > > >> >>>> For my workload, i usualy write chunk-size.
>>>>>>>> > > >> >>>> But your patch is judge by bio-size.
>>>>>>>> > > >> >>>
>>>>>>>> > > >> >>>I'd ignore workload which does sequential directIO, though
>>>>>>>> > > >> >>>your workload is, but I bet no real workloads are. So I'd like
>>>>>>>> > > >> >> Sorry,my explain maybe not corcrect. I write data once which size is almost chunks-size * devices,in order to full-write
>>>>>>>> > > >> >> and as possible as to no pre-read operation.
>>>>>>>> > > >> >>>only to consider big size random directio. I agree the size
>>>>>>>> > > >> >>>judge is arbitrary. I can optimize it to be only consider stripe
>>>>>>>> > > >> >>>which hits two or more disks in one bio, but not sure if it's
>>>>>>>> > > >> >>>worthy doing. Not ware big size directio is common, and even
>>>>>>>> > > >> >>>is, big size request IOPS is low, a bit delay maybe not a big
>>>>>>>> > > >> >>>deal.
>>>>>>>> > > >> >> If add a acc_time for 'striep_head' to control?
>>>>>>>> > > >> >> When get_active_stripe() is ok, update acc_time.
>>>>>>>> > > >> >> For some time, stripe_head did not access and it shold pre-read.
>>>>>>>> > > >> >
>>>>>>>> > > >> >Do you want to add a timer for each stripe? This is even ugly.
>>>>>>>> > > >> >How do you choose the expire time? A time works for harddisk
>>>>>>>> > > >> >definitely will not work for a fast SSD.
>>>>>>>> > > >> A time is like the size which is arbitrary.
>>>>>>>> > > >> How about add a interface in sysfs to control by user?
>>>>>>>> > > >> Only user can judge the workload, which sequatial write or random write.
>>>>>>>> > > >
>>>>>>>> > > > This is getting worse by the minute. A sysfs interface for this is
>>>>>>>> > > > definitely not a good idea.
>>>>>>>> > > >
>>>>>>>> > > > The REQ_NOIDLE flag is a pretty clear statement that no more requests that
>>>>>>>> > > > merge with this one are expected. If some use cases sends random requests,
>>>>>>>> > > > maybe it should be setting REQ_NOIDLE.
>>>>>>>> > > >
>>>>>>>> > > > Maybe someone should do some research and find out why WRITE_ODIRECT doesn't
>>>>>>>> > > > include REQ_NOIDLE. Understanding that would help understand the current
>>>>>>>> > > > problem.
>>>>>>>> > >
>>>>>>>> > > A quick search shows only cfq-iosched uses REQ_NOIDLE. In
>>>>>>>> > > cfq, a queue is idled to avoid losing its share. REQ_NOIDLE
>>>>>>>> > > tells cfq to avoid idle, since the task will not dispatch further
>>>>>>>> > > requests any more. Note this isn't no merge.
>>>>>>>> >
>>>>>>>> > Since REQ_NOIDLE has no relationship with request merge, we'd better remove it.
>>>>>>>> > I came out a new patch, which doesn't depend on request size any more. With
>>>>>>>> > this patch, sequential directio will still introduce unnecessary raid5 preread
>>>>>>>> > (especially for small size IO), but I bet no app does sequential small size
>>>>>>>> > directIO.
>>>>>>>> >
>>>>>>>> > Thanks,
>>>>>>>> > Shaohua
>>>>>>>> >
>>>>>>>> > Subject: raid5: fix directio regression
>>>>>>>> >
>>>>>>>> > My directIO randomwrite 4k workload shows a 10~20% regression caused by commit
>>>>>>>> > 895e3c5c58a80bb. This commit isn't friendly for small size random IO, because
>>>>>>>> > delaying such request hasn't any advantages.
>>>>>>>> >
>>>>>>>> > DirectIO usually is random IO. I thought we can ignore request merge between
>>>>>>>> > bios from different io_submit. So we only consider one bio which can drive
>>>>>>>> > unnecessary preread in raid5, which is large request. If a bio is large enough
>>>>>>>> > and some of its stripes will access two or more disks, such stripes should be
>>>>>>>> > delayed to avoid unnecessary preread till bio for the last disk of the strips
>>>>>>>> > is added.
>>>>>>>> >
>>>>>>>> > REQ_NOIDLE doesn't mean about request merge, I deleted it.
>>>>>>>>
>>>>>>>> Hi,
>>>>>>>> Have you tested what effect this has on large sequential direct writes?
>>>>>>>> Because it don't make sense to me and I would be surprised if it improves
>>>>>>>> things.
>>>>>>>>
>>>>>>>> You are delaying setting the STRIPE_PREREAD_ACTIVE bit until you think you
>>>>>>>> have submitted all the writes from this bio that apply to the give stripe.
>>>>>>>> That does make some sense, however it doesn't seem to deal with the
>>>>>>>> possibility that the one bio covers parts of two different stripes. In that
>>>>>>>> case the first stripe never gets STRIPE_PREREAD_ACTIVE set, so it is delayed
>>>>>>>> despite having 'REQ_SYNC' set.
>>>>>>>
>>>>>>>I didn't get your point. Isn't last_sector - logical_sector < chunk_sectors true
>>>>>>>in the case?
>>>>>>>
>>>>>>>> Also, and more significantly, plugging should mean that the various
>>>>>>>> stripe_heads are not even looked at until all of the original bio is
>>>>>>>> processed, so while STRIPE_PREREAD_ACTIVE might get set early, it should not
>>>>>>>> get processed until the whole bio is processed and the queue is unplugged.
>>>>>>>>
>>>>>>>> So I don't think this patch should make a difference on large direct writes,
>>>>>>>> and if it does then something strange is going on that I'd like to
>>>>>>>> understand first.
>>>>>>>
>>>>>>>Aha, ok, this makes sense. recent delayed stripe release should make the
>>>>>>>problem go away. So Jianpeng, can you try your workload with the commit
>>>>>>>reverted with a recent kernel please?
>>>>>>>
>>>>>> I tested used your patch in my workload.
>>>>>> Like the neil said, the performance does not regress.
>>>>>> But if the code is :
>>>>>>> if (test_bit(STRIPE_PREREAD_ACTIVE, &sh->state))
>>>>>>> release_stripe(sh);
>>>>>>> else
>>>>>>> release_stripe_plug(mddev, sh);
>>>>>> The speed is about 76MB/s.With those code the speed is 200MB/s.
>>>>>
>>>>> Hmm, what I want to test is upstream kernel with commit 895e3c5c58a80bb
>>>>> reverted. don't apply my patch. We want to just revert the commit.
>>>>
>>>>Did you have data for your original workload with 895e3c5c58a80bb
>>>>reverted now?
>>> our raid5 which had 14 SATA HDDs.
>>>
>>> with 895e3c5c58a80bb reverted:
>>> using dd to test 55MB/s
>>> using our-fs 200-250Mb/s
>>>
>>> with 895e3c5c58a80bb:
>>> using dd to test 275MB/s
>>> using our-fs 500-550Mb/s
>>
>>what's block size of dd in this test? In your original test, your
>>BS covers chunk_sector*data_disks. In that case,
>>895e3c5c58a80bb is likely not required.
>>
> With latest kernel(3.6-rc3), w/ or w/o 895e3c5c58a80bb, the result is the same.
> The block size of dd is chunk_sector * data_disks.
> Your patch(8811b5968f6216e97) is good.
> I think it shoul revert 8811b5968f6216e97.
revert 895e3c5c58a80bb, right?
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Re: [patch v2]raid5: fix directio regression
2012-08-24 3:12 ` Shaohua Li
@ 2012-08-24 4:21 ` kedacomkernel
2012-09-11 0:44 ` NeilBrown
0 siblings, 1 reply; 28+ messages in thread
From: kedacomkernel @ 2012-08-24 4:21 UTC (permalink / raw)
To: shli, majianpeng; +Cc: Neil Brown, linux-raid, axboe
On 2012-08-24 11:12 Shaohua Li <shli@kernel.org> Wrote:
>2012/8/23 Jianpeng Ma <majianpeng@gmail.com>:
>> On 2012-08-23 15:55 Shaohua Li <shli@kernel.org> Wrote:
>>>2012/8/23 Jianpeng Ma <majianpeng@gmail.com>:
>>>> On 2012-08-23 14:08 Shaohua Li <shli@kernel.org> Wrote:
>>>>>2012/8/16 Shaohua Li <shli@kernel.org>:
>>>>>> 2012/8/16 Jianpeng Ma <majianpeng@gmail.com>:
>>>>>>> On 2012-08-15 09:44 Shaohua Li <shli@kernel.org> Wrote:
>>>>>>>>On Wed, Aug 15, 2012 at 10:56:10AM +1000, NeilBrown wrote:
>>>>>>>>> On Tue, 14 Aug 2012 14:33:43 +0800 Shaohua Li <shli@kernel.org> wrote:
>>>>>>>>>
>>>>>>>>> > On Thu, Aug 09, 2012 at 01:07:01PM +0800, Shaohua Li wrote:
>>>>>>>>> > > 2012/8/9 NeilBrown <neilb@suse.de>:
>>>>>>>>> > > > On Thu, 9 Aug 2012 09:20:05 +0800 "Jianpeng Ma" <majianpeng@gmail.com> wrote:
>>>>>>>>> > > >
>>>>>>>>> > > >> On 2012-08-08 20:53 Shaohua Li <shli@kernel.org> Wrote:
>>>>>>>>> > > >> >2012/8/8 Jianpeng Ma <majianpeng@gmail.com>:
>>>>>>>>> > > >> >> On 2012-08-08 10:58 Shaohua Li <shli@kernel.org> Wrote:
>>>>>>>>> > > >> >>>2012/8/7 Jianpeng Ma <majianpeng@gmail.com>:
>>>>>>>>> > > >> >>>> On 2012-08-07 13:32 Shaohua Li <shli@kernel.org> Wrote:
>>>>>>>>> > > >> >>>>>2012/8/7 Jianpeng Ma <majianpeng@gmail.com>:
>>>>>>>>> > > >> >>>>>> On 2012-08-07 11:22 Shaohua Li <shli@kernel.org> Wrote:
>>>>>>>>> > > >> >>>>>>>My directIO randomwrite 4k workload shows a 10~20% regression caused by commit
>>>>>>>>> > > >> >>>>>>>895e3c5c58a80bb. directIO usually is random IO and if request size isn't big
>>>>>>>>> > > >> >>>>>>>(which is the common case), delay handling of the stripe hasn't any advantages.
>>>>>>>>> > > >> >>>>>>>For big size request, delay can still reduce IO.
>>>>>>>>> > > >> >>>>>>>
>>>>>>>>> > > >> >>>>>>>Signed-off-by: Shaohua Li <shli@fusionio.com>
>>>>>>>>> > > >> >>>> [snip]
>>>>>>>>> > > >> >>>>>>>--
>>>>>>>>> > > >> >>>>>> May be used size to judge is not a good method.
>>>>>>>>> > > >> >>>>>> I firstly sended this patch, only want to control direct-write-block,not for reqular file.
>>>>>>>>> > > >> >>>>>> Because i think if someone used direct-write-block for raid5,he should know the feature of raid5 and he can control
>>>>>>>>> > > >> >>>>>> for write to full-write.
>>>>>>>>> > > >> >>>>>> But at that time, i did know how to differentiate between regular file and block-device.
>>>>>>>>> > > >> >>>>>> I thik we should do something to do this.
>>>>>>>>> > > >> >>>>>
>>>>>>>>> > > >> >>>>>I don't think it's possible user can control his write to be a
>>>>>>>>> > > >> >>>>>full-write even for
>>>>>>>>> > > >> >>>>>raw disk IO. Why regular file and block device io matters here?
>>>>>>>>> > > >> >>>>>
>>>>>>>>> > > >> >>>>>Thanks,
>>>>>>>>> > > >> >>>>>Shaohua
>>>>>>>>> > > >> >>>> Another problem is the size. How to judge the size is large or not?
>>>>>>>>> > > >> >>>> A syscall write is a dio and a dio may be split more bios.
>>>>>>>>> > > >> >>>> For my workload, i usualy write chunk-size.
>>>>>>>>> > > >> >>>> But your patch is judge by bio-size.
>>>>>>>>> > > >> >>>
>>>>>>>>> > > >> >>>I'd ignore workload which does sequential directIO, though
>>>>>>>>> > > >> >>>your workload is, but I bet no real workloads are. So I'd like
>>>>>>>>> > > >> >> Sorry,my explain maybe not corcrect. I write data once which size is almost chunks-size * devices,in order to full-write
>>>>>>>>> > > >> >> and as possible as to no pre-read operation.
>>>>>>>>> > > >> >>>only to consider big size random directio. I agree the size
>>>>>>>>> > > >> >>>judge is arbitrary. I can optimize it to be only consider stripe
>>>>>>>>> > > >> >>>which hits two or more disks in one bio, but not sure if it's
>>>>>>>>> > > >> >>>worthy doing. Not ware big size directio is common, and even
>>>>>>>>> > > >> >>>is, big size request IOPS is low, a bit delay maybe not a big
>>>>>>>>> > > >> >>>deal.
>>>>>>>>> > > >> >> If add a acc_time for 'striep_head' to control?
>>>>>>>>> > > >> >> When get_active_stripe() is ok, update acc_time.
>>>>>>>>> > > >> >> For some time, stripe_head did not access and it shold pre-read.
>>>>>>>>> > > >> >
>>>>>>>>> > > >> >Do you want to add a timer for each stripe? This is even ugly.
>>>>>>>>> > > >> >How do you choose the expire time? A time works for harddisk
>>>>>>>>> > > >> >definitely will not work for a fast SSD.
>>>>>>>>> > > >> A time is like the size which is arbitrary.
>>>>>>>>> > > >> How about add a interface in sysfs to control by user?
>>>>>>>>> > > >> Only user can judge the workload, which sequatial write or random write.
>>>>>>>>> > > >
>>>>>>>>> > > > This is getting worse by the minute. A sysfs interface for this is
>>>>>>>>> > > > definitely not a good idea.
>>>>>>>>> > > >
>>>>>>>>> > > > The REQ_NOIDLE flag is a pretty clear statement that no more requests that
>>>>>>>>> > > > merge with this one are expected. If some use cases sends random requests,
>>>>>>>>> > > > maybe it should be setting REQ_NOIDLE.
>>>>>>>>> > > >
>>>>>>>>> > > > Maybe someone should do some research and find out why WRITE_ODIRECT doesn't
>>>>>>>>> > > > include REQ_NOIDLE. Understanding that would help understand the current
>>>>>>>>> > > > problem.
>>>>>>>>> > >
>>>>>>>>> > > A quick search shows only cfq-iosched uses REQ_NOIDLE. In
>>>>>>>>> > > cfq, a queue is idled to avoid losing its share. REQ_NOIDLE
>>>>>>>>> > > tells cfq to avoid idle, since the task will not dispatch further
>>>>>>>>> > > requests any more. Note this isn't no merge.
>>>>>>>>> >
>>>>>>>>> > Since REQ_NOIDLE has no relationship with request merge, we'd better remove it.
>>>>>>>>> > I came out a new patch, which doesn't depend on request size any more. With
>>>>>>>>> > this patch, sequential directio will still introduce unnecessary raid5 preread
>>>>>>>>> > (especially for small size IO), but I bet no app does sequential small size
>>>>>>>>> > directIO.
>>>>>>>>> >
>>>>>>>>> > Thanks,
>>>>>>>>> > Shaohua
>>>>>>>>> >
>>>>>>>>> > Subject: raid5: fix directio regression
>>>>>>>>> >
>>>>>>>>> > My directIO randomwrite 4k workload shows a 10~20% regression caused by commit
>>>>>>>>> > 895e3c5c58a80bb. This commit isn't friendly for small size random IO, because
>>>>>>>>> > delaying such request hasn't any advantages.
>>>>>>>>> >
>>>>>>>>> > DirectIO usually is random IO. I thought we can ignore request merge between
>>>>>>>>> > bios from different io_submit. So we only consider one bio which can drive
>>>>>>>>> > unnecessary preread in raid5, which is large request. If a bio is large enough
>>>>>>>>> > and some of its stripes will access two or more disks, such stripes should be
>>>>>>>>> > delayed to avoid unnecessary preread till bio for the last disk of the strips
>>>>>>>>> > is added.
>>>>>>>>> >
>>>>>>>>> > REQ_NOIDLE doesn't mean about request merge, I deleted it.
>>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>> Have you tested what effect this has on large sequential direct writes?
>>>>>>>>> Because it don't make sense to me and I would be surprised if it improves
>>>>>>>>> things.
>>>>>>>>>
>>>>>>>>> You are delaying setting the STRIPE_PREREAD_ACTIVE bit until you think you
>>>>>>>>> have submitted all the writes from this bio that apply to the give stripe.
>>>>>>>>> That does make some sense, however it doesn't seem to deal with the
>>>>>>>>> possibility that the one bio covers parts of two different stripes. In that
>>>>>>>>> case the first stripe never gets STRIPE_PREREAD_ACTIVE set, so it is delayed
>>>>>>>>> despite having 'REQ_SYNC' set.
>>>>>>>>
>>>>>>>>I didn't get your point. Isn't last_sector - logical_sector < chunk_sectors true
>>>>>>>>in the case?
>>>>>>>>
>>>>>>>>> Also, and more significantly, plugging should mean that the various
>>>>>>>>> stripe_heads are not even looked at until all of the original bio is
>>>>>>>>> processed, so while STRIPE_PREREAD_ACTIVE might get set early, it should not
>>>>>>>>> get processed until the whole bio is processed and the queue is unplugged.
>>>>>>>>>
>>>>>>>>> So I don't think this patch should make a difference on large direct writes,
>>>>>>>>> and if it does then something strange is going on that I'd like to
>>>>>>>>> understand first.
>>>>>>>>
>>>>>>>>Aha, ok, this makes sense. recent delayed stripe release should make the
>>>>>>>>problem go away. So Jianpeng, can you try your workload with the commit
>>>>>>>>reverted with a recent kernel please?
>>>>>>>>
>>>>>>> I tested used your patch in my workload.
>>>>>>> Like the neil said, the performance does not regress.
>>>>>>> But if the code is :
>>>>>>>> if (test_bit(STRIPE_PREREAD_ACTIVE, &sh->state))
>>>>>>>> release_stripe(sh);
>>>>>>>> else
>>>>>>>> release_stripe_plug(mddev, sh);
>>>>>>> The speed is about 76MB/s.With those code the speed is 200MB/s.
>>>>>>
>>>>>> Hmm, what I want to test is upstream kernel with commit 895e3c5c58a80bb
>>>>>> reverted. don't apply my patch. We want to just revert the commit.
>>>>>
>>>>>Did you have data for your original workload with 895e3c5c58a80bb
>>>>>reverted now?
>>>> our raid5 which had 14 SATA HDDs.
>>>>
>>>> with 895e3c5c58a80bb reverted:
>>>> using dd to test 55MB/s
>>>> using our-fs 200-250Mb/s
>>>>
>>>> with 895e3c5c58a80bb:
>>>> using dd to test 275MB/s
>>>> using our-fs 500-550Mb/s
>>>
>>>what's block size of dd in this test? In your original test, your
>>>BS covers chunk_sector*data_disks. In that case,
>>>895e3c5c58a80bb is likely not required.
>>>
>> With latest kernel(3.6-rc3), w/ or w/o 895e3c5c58a80bb, the result is the same.
>> The block size of dd is chunk_sector * data_disks.
>> Your patch(8811b5968f6216e97) is good.
>> I think it shoul revert 8811b5968f6216e97.
>
>revert 895e3c5c58a80bb, right?
yes. Because the 8811b5968f6216e97, it can revert.
>--
>To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>the body of a message to majordomo@vger.kernel.org
>More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [patch v2]raid5: fix directio regression
2012-08-24 4:21 ` kedacomkernel
@ 2012-09-11 0:44 ` NeilBrown
0 siblings, 0 replies; 28+ messages in thread
From: NeilBrown @ 2012-09-11 0:44 UTC (permalink / raw)
To: kedacomkernel; +Cc: shli, majianpeng, linux-raid, axboe
[-- Attachment #1: Type: text/plain, Size: 10485 bytes --]
On Fri, 24 Aug 2012 12:21:30 +0800 kedacomkernel <kedacomkernel@gmail.com>
wrote:
> On 2012-08-24 11:12 Shaohua Li <shli@kernel.org> Wrote:
> >2012/8/23 Jianpeng Ma <majianpeng@gmail.com>:
> >> On 2012-08-23 15:55 Shaohua Li <shli@kernel.org> Wrote:
> >>>2012/8/23 Jianpeng Ma <majianpeng@gmail.com>:
> >>>> On 2012-08-23 14:08 Shaohua Li <shli@kernel.org> Wrote:
> >>>>>2012/8/16 Shaohua Li <shli@kernel.org>:
> >>>>>> 2012/8/16 Jianpeng Ma <majianpeng@gmail.com>:
> >>>>>>> On 2012-08-15 09:44 Shaohua Li <shli@kernel.org> Wrote:
> >>>>>>>>On Wed, Aug 15, 2012 at 10:56:10AM +1000, NeilBrown wrote:
> >>>>>>>>> On Tue, 14 Aug 2012 14:33:43 +0800 Shaohua Li <shli@kernel.org> wrote:
> >>>>>>>>>
> >>>>>>>>> > On Thu, Aug 09, 2012 at 01:07:01PM +0800, Shaohua Li wrote:
> >>>>>>>>> > > 2012/8/9 NeilBrown <neilb@suse.de>:
> >>>>>>>>> > > > On Thu, 9 Aug 2012 09:20:05 +0800 "Jianpeng Ma" <majianpeng@gmail.com> wrote:
> >>>>>>>>> > > >
> >>>>>>>>> > > >> On 2012-08-08 20:53 Shaohua Li <shli@kernel.org> Wrote:
> >>>>>>>>> > > >> >2012/8/8 Jianpeng Ma <majianpeng@gmail.com>:
> >>>>>>>>> > > >> >> On 2012-08-08 10:58 Shaohua Li <shli@kernel.org> Wrote:
> >>>>>>>>> > > >> >>>2012/8/7 Jianpeng Ma <majianpeng@gmail.com>:
> >>>>>>>>> > > >> >>>> On 2012-08-07 13:32 Shaohua Li <shli@kernel.org> Wrote:
> >>>>>>>>> > > >> >>>>>2012/8/7 Jianpeng Ma <majianpeng@gmail.com>:
> >>>>>>>>> > > >> >>>>>> On 2012-08-07 11:22 Shaohua Li <shli@kernel.org> Wrote:
> >>>>>>>>> > > >> >>>>>>>My directIO randomwrite 4k workload shows a 10~20% regression caused by commit
> >>>>>>>>> > > >> >>>>>>>895e3c5c58a80bb. directIO usually is random IO and if request size isn't big
> >>>>>>>>> > > >> >>>>>>>(which is the common case), delay handling of the stripe hasn't any advantages.
> >>>>>>>>> > > >> >>>>>>>For big size request, delay can still reduce IO.
> >>>>>>>>> > > >> >>>>>>>
> >>>>>>>>> > > >> >>>>>>>Signed-off-by: Shaohua Li <shli@fusionio.com>
> >>>>>>>>> > > >> >>>> [snip]
> >>>>>>>>> > > >> >>>>>>>--
> >>>>>>>>> > > >> >>>>>> May be used size to judge is not a good method.
> >>>>>>>>> > > >> >>>>>> I firstly sended this patch, only want to control direct-write-block,not for reqular file.
> >>>>>>>>> > > >> >>>>>> Because i think if someone used direct-write-block for raid5,he should know the feature of raid5 and he can control
> >>>>>>>>> > > >> >>>>>> for write to full-write.
> >>>>>>>>> > > >> >>>>>> But at that time, i did know how to differentiate between regular file and block-device.
> >>>>>>>>> > > >> >>>>>> I thik we should do something to do this.
> >>>>>>>>> > > >> >>>>>
> >>>>>>>>> > > >> >>>>>I don't think it's possible user can control his write to be a
> >>>>>>>>> > > >> >>>>>full-write even for
> >>>>>>>>> > > >> >>>>>raw disk IO. Why regular file and block device io matters here?
> >>>>>>>>> > > >> >>>>>
> >>>>>>>>> > > >> >>>>>Thanks,
> >>>>>>>>> > > >> >>>>>Shaohua
> >>>>>>>>> > > >> >>>> Another problem is the size. How to judge the size is large or not?
> >>>>>>>>> > > >> >>>> A syscall write is a dio and a dio may be split more bios.
> >>>>>>>>> > > >> >>>> For my workload, i usualy write chunk-size.
> >>>>>>>>> > > >> >>>> But your patch is judge by bio-size.
> >>>>>>>>> > > >> >>>
> >>>>>>>>> > > >> >>>I'd ignore workload which does sequential directIO, though
> >>>>>>>>> > > >> >>>your workload is, but I bet no real workloads are. So I'd like
> >>>>>>>>> > > >> >> Sorry,my explain maybe not corcrect. I write data once which size is almost chunks-size * devices,in order to full-write
> >>>>>>>>> > > >> >> and as possible as to no pre-read operation.
> >>>>>>>>> > > >> >>>only to consider big size random directio. I agree the size
> >>>>>>>>> > > >> >>>judge is arbitrary. I can optimize it to be only consider stripe
> >>>>>>>>> > > >> >>>which hits two or more disks in one bio, but not sure if it's
> >>>>>>>>> > > >> >>>worthy doing. Not ware big size directio is common, and even
> >>>>>>>>> > > >> >>>is, big size request IOPS is low, a bit delay maybe not a big
> >>>>>>>>> > > >> >>>deal.
> >>>>>>>>> > > >> >> If add a acc_time for 'striep_head' to control?
> >>>>>>>>> > > >> >> When get_active_stripe() is ok, update acc_time.
> >>>>>>>>> > > >> >> For some time, stripe_head did not access and it shold pre-read.
> >>>>>>>>> > > >> >
> >>>>>>>>> > > >> >Do you want to add a timer for each stripe? This is even ugly.
> >>>>>>>>> > > >> >How do you choose the expire time? A time works for harddisk
> >>>>>>>>> > > >> >definitely will not work for a fast SSD.
> >>>>>>>>> > > >> A time is like the size which is arbitrary.
> >>>>>>>>> > > >> How about add a interface in sysfs to control by user?
> >>>>>>>>> > > >> Only user can judge the workload, which sequatial write or random write.
> >>>>>>>>> > > >
> >>>>>>>>> > > > This is getting worse by the minute. A sysfs interface for this is
> >>>>>>>>> > > > definitely not a good idea.
> >>>>>>>>> > > >
> >>>>>>>>> > > > The REQ_NOIDLE flag is a pretty clear statement that no more requests that
> >>>>>>>>> > > > merge with this one are expected. If some use cases sends random requests,
> >>>>>>>>> > > > maybe it should be setting REQ_NOIDLE.
> >>>>>>>>> > > >
> >>>>>>>>> > > > Maybe someone should do some research and find out why WRITE_ODIRECT doesn't
> >>>>>>>>> > > > include REQ_NOIDLE. Understanding that would help understand the current
> >>>>>>>>> > > > problem.
> >>>>>>>>> > >
> >>>>>>>>> > > A quick search shows only cfq-iosched uses REQ_NOIDLE. In
> >>>>>>>>> > > cfq, a queue is idled to avoid losing its share. REQ_NOIDLE
> >>>>>>>>> > > tells cfq to avoid idle, since the task will not dispatch further
> >>>>>>>>> > > requests any more. Note this isn't no merge.
> >>>>>>>>> >
> >>>>>>>>> > Since REQ_NOIDLE has no relationship with request merge, we'd better remove it.
> >>>>>>>>> > I came out a new patch, which doesn't depend on request size any more. With
> >>>>>>>>> > this patch, sequential directio will still introduce unnecessary raid5 preread
> >>>>>>>>> > (especially for small size IO), but I bet no app does sequential small size
> >>>>>>>>> > directIO.
> >>>>>>>>> >
> >>>>>>>>> > Thanks,
> >>>>>>>>> > Shaohua
> >>>>>>>>> >
> >>>>>>>>> > Subject: raid5: fix directio regression
> >>>>>>>>> >
> >>>>>>>>> > My directIO randomwrite 4k workload shows a 10~20% regression caused by commit
> >>>>>>>>> > 895e3c5c58a80bb. This commit isn't friendly for small size random IO, because
> >>>>>>>>> > delaying such request hasn't any advantages.
> >>>>>>>>> >
> >>>>>>>>> > DirectIO usually is random IO. I thought we can ignore request merge between
> >>>>>>>>> > bios from different io_submit. So we only consider one bio which can drive
> >>>>>>>>> > unnecessary preread in raid5, which is large request. If a bio is large enough
> >>>>>>>>> > and some of its stripes will access two or more disks, such stripes should be
> >>>>>>>>> > delayed to avoid unnecessary preread till bio for the last disk of the strips
> >>>>>>>>> > is added.
> >>>>>>>>> >
> >>>>>>>>> > REQ_NOIDLE doesn't mean about request merge, I deleted it.
> >>>>>>>>>
> >>>>>>>>> Hi,
> >>>>>>>>> Have you tested what effect this has on large sequential direct writes?
> >>>>>>>>> Because it don't make sense to me and I would be surprised if it improves
> >>>>>>>>> things.
> >>>>>>>>>
> >>>>>>>>> You are delaying setting the STRIPE_PREREAD_ACTIVE bit until you think you
> >>>>>>>>> have submitted all the writes from this bio that apply to the give stripe.
> >>>>>>>>> That does make some sense, however it doesn't seem to deal with the
> >>>>>>>>> possibility that the one bio covers parts of two different stripes. In that
> >>>>>>>>> case the first stripe never gets STRIPE_PREREAD_ACTIVE set, so it is delayed
> >>>>>>>>> despite having 'REQ_SYNC' set.
> >>>>>>>>
> >>>>>>>>I didn't get your point. Isn't last_sector - logical_sector < chunk_sectors true
> >>>>>>>>in the case?
> >>>>>>>>
> >>>>>>>>> Also, and more significantly, plugging should mean that the various
> >>>>>>>>> stripe_heads are not even looked at until all of the original bio is
> >>>>>>>>> processed, so while STRIPE_PREREAD_ACTIVE might get set early, it should not
> >>>>>>>>> get processed until the whole bio is processed and the queue is unplugged.
> >>>>>>>>>
> >>>>>>>>> So I don't think this patch should make a difference on large direct writes,
> >>>>>>>>> and if it does then something strange is going on that I'd like to
> >>>>>>>>> understand first.
> >>>>>>>>
> >>>>>>>>Aha, ok, this makes sense. recent delayed stripe release should make the
> >>>>>>>>problem go away. So Jianpeng, can you try your workload with the commit
> >>>>>>>>reverted with a recent kernel please?
> >>>>>>>>
> >>>>>>> I tested used your patch in my workload.
> >>>>>>> Like the neil said, the performance does not regress.
> >>>>>>> But if the code is :
> >>>>>>>> if (test_bit(STRIPE_PREREAD_ACTIVE, &sh->state))
> >>>>>>>> release_stripe(sh);
> >>>>>>>> else
> >>>>>>>> release_stripe_plug(mddev, sh);
> >>>>>>> The speed is about 76MB/s.With those code the speed is 200MB/s.
> >>>>>>
> >>>>>> Hmm, what I want to test is upstream kernel with commit 895e3c5c58a80bb
> >>>>>> reverted. don't apply my patch. We want to just revert the commit.
> >>>>>
> >>>>>Did you have data for your original workload with 895e3c5c58a80bb
> >>>>>reverted now?
> >>>> our raid5 which had 14 SATA HDDs.
> >>>>
> >>>> with 895e3c5c58a80bb reverted:
> >>>> using dd to test 55MB/s
> >>>> using our-fs 200-250Mb/s
> >>>>
> >>>> with 895e3c5c58a80bb:
> >>>> using dd to test 275MB/s
> >>>> using our-fs 500-550Mb/s
> >>>
> >>>what's block size of dd in this test? In your original test, your
> >>>BS covers chunk_sector*data_disks. In that case,
> >>>895e3c5c58a80bb is likely not required.
> >>>
> >> With latest kernel(3.6-rc3), w/ or w/o 895e3c5c58a80bb, the result is the same.
> >> The block size of dd is chunk_sector * data_disks.
> >> Your patch(8811b5968f6216e97) is good.
> >> I think it shoul revert 8811b5968f6216e97.
> >
> >revert 895e3c5c58a80bb, right?
> yes. Because the 8811b5968f6216e97, it can revert.
Thanks. I've reverted 895e3c5c58a80bb and will submit to -next later today.
NeilBrown
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]
^ permalink raw reply [flat|nested] 28+ messages in thread
end of thread, other threads:[~2012-09-11 0:44 UTC | newest]
Thread overview: 28+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-08-07 3:22 [patch]raid5: fix directio regression Shaohua Li
2012-08-07 5:13 ` Jianpeng Ma
2012-08-07 5:32 ` Shaohua Li
2012-08-07 5:42 ` Jianpeng Ma
2012-08-07 6:21 ` Jianpeng Ma
2012-08-08 2:58 ` Shaohua Li
2012-08-08 5:21 ` Jianpeng Ma
2012-08-08 12:53 ` Shaohua Li
2012-08-09 1:20 ` Jianpeng Ma
2012-08-09 1:32 ` NeilBrown
2012-08-09 2:27 ` Jianpeng Ma
2012-08-09 5:07 ` Shaohua Li
2012-08-14 6:33 ` [patch v2]raid5: " Shaohua Li
2012-08-15 0:56 ` NeilBrown
2012-08-15 1:20 ` kedacomkernel
2012-08-15 1:44 ` Shaohua Li
2012-08-15 1:54 ` Jianpeng Ma
2012-08-16 7:36 ` Jianpeng Ma
2012-08-16 9:42 ` Shaohua Li
2012-08-17 1:00 ` Jianpeng Ma
2012-08-23 6:08 ` Shaohua Li
2012-08-23 6:46 ` Jianpeng Ma
2012-08-23 7:55 ` Shaohua Li
2012-08-23 8:11 ` Jianpeng Ma
2012-08-23 12:17 ` Jianpeng Ma
2012-08-24 3:12 ` Shaohua Li
2012-08-24 4:21 ` kedacomkernel
2012-09-11 0:44 ` NeilBrown
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).