Raid10 and page cache

All of lore.kernel.org
 help / color / mirror / Atom feed

* Raid10 and page cache
@ 2011-12-06 21:29 Yucong Sun (叶雨飞)
  2011-12-06 22:01 ` Yucong Sun (叶雨飞)
       [not found] ` <CAJygYd16PWfKe8fK-b150N46CEwzBUqJn1N6dfsGR4yyTgGbTQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 2 replies; 16+ messages in thread
From: Yucong Sun (叶雨飞) @ 2011-12-06 21:29 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA

Hi,

I recently setup raid10 on 4 physical disk and have a iscsi serve it
as a block device, and have been trying to tweak for performance.

First thing I notice that MD seems to rely on page cache to flush
changes to disk,  is there any way to turn that off so changes are
flushed to the disk? like O_FSYNC|O_DIRECT does? The reason I want to
turn it off is to understand the performance difference,  I want to be
sure that page cache is truly acting as a write-back cache, I know one
can tune the dirty_* to control the cache flush, but I want to make
sure that it is actually doing what I think it does.

Then I notice in output of free,  the number in Cache column is very
low, however the Buffer is very high, my question is does Buffer here
serves as a read cache? I couldn't find the answer anywhere else.

My last question is that since MD seems already doing the cache,  what
effect would it have if I want to setup a LO device in front of MD
device, Is there going to be more caching, how is different than just
plain MD device?

Thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Raid10 and page cache
  2011-12-06 21:29 Raid10 and page cache Yucong Sun (叶雨飞)
@ 2011-12-06 22:01 ` Yucong Sun (叶雨飞)
  2011-12-06 22:26   ` NeilBrown
       [not found] ` <CAJygYd16PWfKe8fK-b150N46CEwzBUqJn1N6dfsGR4yyTgGbTQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  1 sibling, 1 reply; 16+ messages in thread
From: Yucong Sun (叶雨飞) @ 2011-12-06 22:01 UTC (permalink / raw)
  To: linux-raid

Hi,

I recently setup raid10 on 4 physical disk and have a iscsi serve it
as a block device, and have been trying to tweak for performance.

First thing I notice that MD seems to rely on page cache to flush
changes to disk,  is there any way to turn that off so changes are
flushed to the disk? like O_FSYNC|O_DIRECT does? The reason I want to
turn it off is to understand the performance difference,  I want to be
sure that page cache is truly acting as a write-back cache, I know one
can tune the dirty_* to control the cache flush, but I want to make
sure that it is actually doing what I think it does.

Then I notice in output of free,  the number in Cache column is very
low, however the Buffer is very high, my question is does Buffer here
serves as a read cache? I couldn't find the answer anywhere else.

My last question is that since MD seems already doing the cache,  what
effect would it have if I want to setup a LO device in front of MD
device, Is there going to be more caching, how is different than just
plain MD device?

Thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Raid10 and page cache
       [not found] ` <CAJygYd16PWfKe8fK-b150N46CEwzBUqJn1N6dfsGR4yyTgGbTQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2011-12-06 22:01   ` Yucong Sun (叶雨飞)
  0 siblings, 0 replies; 16+ messages in thread
From: Yucong Sun (叶雨飞) @ 2011-12-06 22:01 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA

wrong list, sorry!

On Tue, Dec 6, 2011 at 1:29 PM, Yucong Sun (叶雨飞) <sunyucong-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> Hi,
>
> I recently setup raid10 on 4 physical disk and have a iscsi serve it
> as a block device, and have been trying to tweak for performance.
>
> First thing I notice that MD seems to rely on page cache to flush
> changes to disk,  is there any way to turn that off so changes are
> flushed to the disk? like O_FSYNC|O_DIRECT does? The reason I want to
> turn it off is to understand the performance difference,  I want to be
> sure that page cache is truly acting as a write-back cache, I know one
> can tune the dirty_* to control the cache flush, but I want to make
> sure that it is actually doing what I think it does.
>
> Then I notice in output of free,  the number in Cache column is very
> low, however the Buffer is very high, my question is does Buffer here
> serves as a read cache? I couldn't find the answer anywhere else.
>
> My last question is that since MD seems already doing the cache,  what
> effect would it have if I want to setup a LO device in front of MD
> device, Is there going to be more caching, how is different than just
> plain MD device?
>
> Thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Raid10 and page cache
  2011-12-06 22:01 ` Yucong Sun (叶雨飞)
@ 2011-12-06 22:26   ` NeilBrown
  2011-12-06 23:13     ` Yucong Sun (叶雨飞)
  0 siblings, 1 reply; 16+ messages in thread
From: NeilBrown @ 2011-12-06 22:26 UTC (permalink / raw)
  To: Yucong Sun (叶雨飞); +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 2190 bytes --]

On Tue, 6 Dec 2011 14:01:14 -0800 Yucong Sun (叶雨飞) <sunyucong@gmail.com>
wrote:

> Hi,
> 
> I recently setup raid10 on 4 physical disk and have a iscsi serve it
> as a block device, and have been trying to tweak for performance.
> 
> First thing I notice that MD seems to rely on page cache to flush
> changes to disk,  is there any way to turn that off so changes are
> flushed to the disk? like O_FSYNC|O_DIRECT does? The reason I want to
> turn it off is to understand the performance difference,  I want to be
> sure that page cache is truly acting as a write-back cache, I know one
> can tune the dirty_* to control the cache flush, but I want to make
> sure that it is actually doing what I think it does.

Why do you think this?

md/raid10 sends all request straight through to the relevant underlying
device(s).
reads are just passed straight down.
Writes are duplicated (the request structure, not the data) and queued to a
separate thread which does the actual write, but it is fairly direct.

> 
> Then I notice in output of free,  the number in Cache column is very
> low, however the Buffer is very high, my question is does Buffer here
> serves as a read cache? I couldn't find the answer anywhere else.

The best place to find the answer is in the source code.

Every page in the page cache is associated with some file.
If that file is a block device (e.g. /dev/sdX) then it is reported as
'Buffer' otherwise it is reported as 'Cache'.

Some filesystems like ext3 uses 'Buffer' memory for metadata but call use
'Cache' memory for files and directories.

> 
> My last question is that since MD seems already doing the cache,  what
> effect would it have if I want to setup a LO device in front of MD
> device, Is there going to be more caching, how is different than just
> plain MD device?

MD/raid10 does no caching.
A loop-back over the md device would not add extra caching.

NeilBrown


> 
> Thanks.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Raid10 and page cache
  2011-12-06 22:26   ` NeilBrown
@ 2011-12-06 23:13     ` Yucong Sun (叶雨飞)
  2011-12-06 23:22       ` Marcus Sorensen
  2011-12-07  1:01       ` NeilBrown
  0 siblings, 2 replies; 16+ messages in thread
From: Yucong Sun (叶雨飞) @ 2011-12-06 23:13 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid

On Tue, Dec 6, 2011 at 2:26 PM, NeilBrown <neilb@suse.de> wrote:
> On Tue, 6 Dec 2011 14:01:14 -0800 Yucong Sun (叶雨飞) <sunyucong@gmail.com>
> wrote:
>
>> Hi,
>>
>> I recently setup raid10 on 4 physical disk and have a iscsi serve it
>> as a block device, and have been trying to tweak for performance.
>>
>> First thing I notice that MD seems to rely on page cache to flush
>> changes to disk,  is there any way to turn that off so changes are
>> flushed to the disk? like O_FSYNC|O_DIRECT does? The reason I want to
>> turn it off is to understand the performance difference,  I want to be
>> sure that page cache is truly acting as a write-back cache, I know one
>> can tune the dirty_* to control the cache flush, but I want to make
>> sure that it is actually doing what I think it does.
>
> Why do you think this?
>
> md/raid10 sends all request straight through to the relevant underlying
> device(s).
> reads are just passed straight down.
> Writes are duplicated (the request structure, not the data) and queued to a
> separate thread which does the actual write, but it is fairly direct.

So I know there's page caching /flush involved because I watch
/proc/meminfo and see  Dirty value growing up and After reach the
threshold, Write-back kicks in and wrote data.
So if as you said md does no page flushing, then it must because of
the iscsi software opens the device without O_DIRECT, so it uses page
cache which in turn flush data to MD, now it makes more sense.

But for the md write, it's not SYNC write? meaning that after write
call with O_DIRECT to the md device returns, the data is still
possibility on the fly to the disk? how does having a bitmap plays in
between? does it work like ext3 jounal? after a power-loss, can we
expect a crash consistent data on the disk?

Another thing to note is I found IO size on MD device is always 4K,
which is the page size, is that normal? just want to making sure this
isn't a bad behavior result from the iscsi software.
>
>>
>> Then I notice in output of free,  the number in Cache column is very
>> low, however the Buffer is very high, my question is does Buffer here
>> serves as a read cache? I couldn't find the answer anywhere else.
>
> The best place to find the answer is in the source code.
>
> Every page in the page cache is associated with some file.
> If that file is a block device (e.g. /dev/sdX) then it is reported as
> 'Buffer' otherwise it is reported as 'Cache'.
>
> Some filesystems like ext3 uses 'Buffer' memory for metadata but call use
> 'Cache' memory for files and directories.
>

Thanks, it is being used as read cache then, too bad there's no easy
way to measure/see the hit rate.

>>
>> My last question is that since MD seems already doing the cache,  what
>> effect would it have if I want to setup a LO device in front of MD
>> device, Is there going to be more caching, how is different than just
>> plain MD device?
>
> MD/raid10 does no caching.
> A loop-back over the md device would not add extra caching.
>
> NeilBrown
>
>
>>
>> Thanks.
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Raid10 and page cache
  2011-12-06 23:13     ` Yucong Sun (叶雨飞)
@ 2011-12-06 23:22       ` Marcus Sorensen
  2011-12-07  1:01       ` NeilBrown
  1 sibling, 0 replies; 16+ messages in thread
From: Marcus Sorensen @ 2011-12-06 23:22 UTC (permalink / raw)
  To: Yucong Sun (叶雨飞); +Cc: NeilBrown, linux-raid

When you write a file, it is not MD doing caching that you see. The OS
caches via dirty memory before flushing to MD. If you want to write
sync or O_DIRECT, do so by adding the flag to the open() call when you
write a file.

On Tue, Dec 6, 2011 at 4:13 PM, Yucong Sun (叶雨飞) <sunyucong@gmail.com> wrote:
> On Tue, Dec 6, 2011 at 2:26 PM, NeilBrown <neilb@suse.de> wrote:
>> On Tue, 6 Dec 2011 14:01:14 -0800 Yucong Sun (叶雨飞) <sunyucong@gmail.com>
>> wrote:
>>
>>> Hi,
>>>
>>> I recently setup raid10 on 4 physical disk and have a iscsi serve it
>>> as a block device, and have been trying to tweak for performance.
>>>
>>> First thing I notice that MD seems to rely on page cache to flush
>>> changes to disk,  is there any way to turn that off so changes are
>>> flushed to the disk? like O_FSYNC|O_DIRECT does? The reason I want to
>>> turn it off is to understand the performance difference,  I want to be
>>> sure that page cache is truly acting as a write-back cache, I know one
>>> can tune the dirty_* to control the cache flush, but I want to make
>>> sure that it is actually doing what I think it does.
>>
>> Why do you think this?
>>
>> md/raid10 sends all request straight through to the relevant underlying
>> device(s).
>> reads are just passed straight down.
>> Writes are duplicated (the request structure, not the data) and queued to a
>> separate thread which does the actual write, but it is fairly direct.
>
> So I know there's page caching /flush involved because I watch
> /proc/meminfo and see  Dirty value growing up and After reach the
> threshold, Write-back kicks in and wrote data.
> So if as you said md does no page flushing, then it must because of
> the iscsi software opens the device without O_DIRECT, so it uses page
> cache which in turn flush data to MD, now it makes more sense.
>
> But for the md write, it's not SYNC write? meaning that after write
> call with O_DIRECT to the md device returns, the data is still
> possibility on the fly to the disk? how does having a bitmap plays in
> between? does it work like ext3 jounal? after a power-loss, can we
> expect a crash consistent data on the disk?
>
> Another thing to note is I found IO size on MD device is always 4K,
> which is the page size, is that normal? just want to making sure this
> isn't a bad behavior result from the iscsi software.
>>
>>>
>>> Then I notice in output of free,  the number in Cache column is very
>>> low, however the Buffer is very high, my question is does Buffer here
>>> serves as a read cache? I couldn't find the answer anywhere else.
>>
>> The best place to find the answer is in the source code.
>>
>> Every page in the page cache is associated with some file.
>> If that file is a block device (e.g. /dev/sdX) then it is reported as
>> 'Buffer' otherwise it is reported as 'Cache'.
>>
>> Some filesystems like ext3 uses 'Buffer' memory for metadata but call use
>> 'Cache' memory for files and directories.
>>
>
> Thanks, it is being used as read cache then, too bad there's no easy
> way to measure/see the hit rate.
>
>>>
>>> My last question is that since MD seems already doing the cache,  what
>>> effect would it have if I want to setup a LO device in front of MD
>>> device, Is there going to be more caching, how is different than just
>>> plain MD device?
>>
>> MD/raid10 does no caching.
>> A loop-back over the md device would not add extra caching.
>>
>> NeilBrown
>>
>>
>>>
>>> Thanks.
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Raid10 and page cache
  2011-12-06 23:13     ` Yucong Sun (叶雨飞)
  2011-12-06 23:22       ` Marcus Sorensen
@ 2011-12-07  1:01       ` NeilBrown
  2011-12-07  4:04         ` Yucong Sun (叶雨飞)
  1 sibling, 1 reply; 16+ messages in thread
From: NeilBrown @ 2011-12-07  1:01 UTC (permalink / raw)
  To: Yucong Sun (叶雨飞); +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 2511 bytes --]

On Tue, 6 Dec 2011 15:13:34 -0800 Yucong Sun (叶雨飞) <sunyucong@gmail.com>
wrote:

> On Tue, Dec 6, 2011 at 2:26 PM, NeilBrown <neilb@suse.de> wrote:
> > On Tue, 6 Dec 2011 14:01:14 -0800 Yucong Sun (叶雨飞) <sunyucong@gmail.com>
> > wrote:
> >
> >> Hi,
> >>
> >> I recently setup raid10 on 4 physical disk and have a iscsi serve it
> >> as a block device, and have been trying to tweak for performance.
> >>
> >> First thing I notice that MD seems to rely on page cache to flush
> >> changes to disk,  is there any way to turn that off so changes are
> >> flushed to the disk? like O_FSYNC|O_DIRECT does? The reason I want to
> >> turn it off is to understand the performance difference,  I want to be
> >> sure that page cache is truly acting as a write-back cache, I know one
> >> can tune the dirty_* to control the cache flush, but I want to make
> >> sure that it is actually doing what I think it does.
> >
> > Why do you think this?
> >
> > md/raid10 sends all request straight through to the relevant underlying
> > device(s).
> > reads are just passed straight down.
> > Writes are duplicated (the request structure, not the data) and queued to a
> > separate thread which does the actual write, but it is fairly direct.
> 
> So I know there's page caching /flush involved because I watch
> /proc/meminfo and see  Dirty value growing up and After reach the
> threshold, Write-back kicks in and wrote data.
> So if as you said md does no page flushing, then it must because of
> the iscsi software opens the device without O_DIRECT, so it uses page
> cache which in turn flush data to MD, now it makes more sense.
> 
> But for the md write, it's not SYNC write? meaning that after write
> call with O_DIRECT to the md device returns, the data is still
> possibility on the fly to the disk? how does having a bitmap plays in
> between? does it work like ext3 jounal? after a power-loss, can we
> expect a crash consistent data on the disk?

When you want sync writes, you need to use fsync.

When md writes the superblock or a bitmap page it uses SYNC and FLUSH writes
to ensure they get to the media before the subsequent data write.


> 
> Another thing to note is I found IO size on MD device is always 4K,
> which is the page size, is that normal? just want to making sure this
> isn't a bad behavior result from the iscsi software.

It is normal in some cases.  It depends a bit on the details of the
underlying device.


NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Raid10 and page cache
  2011-12-07  1:01       ` NeilBrown
@ 2011-12-07  4:04         ` Yucong Sun (叶雨飞)
  2011-12-07  4:28           ` NeilBrown
  0 siblings, 1 reply; 16+ messages in thread
From: Yucong Sun (叶雨飞) @ 2011-12-07  4:04 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid

The problem with using page-flush as a write cache here is that write
to MD don't go through IO scheduler, which is a very big problem,
because when flush thread decide to write to MD,  it's impossible to
control the write speed, or prioritize them with read, every requests
basically is a fifo,  and when flush size is big, no read can be
served.

On Tue, Dec 6, 2011 at 5:01 PM, NeilBrown <neilb@suse.de> wrote:
> On Tue, 6 Dec 2011 15:13:34 -0800 Yucong Sun (叶雨飞) <sunyucong@gmail.com>
> wrote:
>
>> On Tue, Dec 6, 2011 at 2:26 PM, NeilBrown <neilb@suse.de> wrote:
>> > On Tue, 6 Dec 2011 14:01:14 -0800 Yucong Sun (叶雨飞) <sunyucong@gmail.com>
>> > wrote:
>> >
>> >> Hi,
>> >>
>> >> I recently setup raid10 on 4 physical disk and have a iscsi serve it
>> >> as a block device, and have been trying to tweak for performance.
>> >>
>> >> First thing I notice that MD seems to rely on page cache to flush
>> >> changes to disk,  is there any way to turn that off so changes are
>> >> flushed to the disk? like O_FSYNC|O_DIRECT does? The reason I want to
>> >> turn it off is to understand the performance difference,  I want to be
>> >> sure that page cache is truly acting as a write-back cache, I know one
>> >> can tune the dirty_* to control the cache flush, but I want to make
>> >> sure that it is actually doing what I think it does.
>> >
>> > Why do you think this?
>> >
>> > md/raid10 sends all request straight through to the relevant underlying
>> > device(s).
>> > reads are just passed straight down.
>> > Writes are duplicated (the request structure, not the data) and queued to a
>> > separate thread which does the actual write, but it is fairly direct.
>>
>> So I know there's page caching /flush involved because I watch
>> /proc/meminfo and see  Dirty value growing up and After reach the
>> threshold, Write-back kicks in and wrote data.
>> So if as you said md does no page flushing, then it must because of
>> the iscsi software opens the device without O_DIRECT, so it uses page
>> cache which in turn flush data to MD, now it makes more sense.
>>
>> But for the md write, it's not SYNC write? meaning that after write
>> call with O_DIRECT to the md device returns, the data is still
>> possibility on the fly to the disk? how does having a bitmap plays in
>> between? does it work like ext3 jounal? after a power-loss, can we
>> expect a crash consistent data on the disk?
>
> When you want sync writes, you need to use fsync.
>
> When md writes the superblock or a bitmap page it uses SYNC and FLUSH writes
> to ensure they get to the media before the subsequent data write.
>
>
>>
>> Another thing to note is I found IO size on MD device is always 4K,
>> which is the page size, is that normal? just want to making sure this
>> isn't a bad behavior result from the iscsi software.
>
> It is normal in some cases.  It depends a bit on the details of the
> underlying device.
>
>
> NeilBrown
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Raid10 and page cache
  2011-12-07  4:04         ` Yucong Sun (叶雨飞)
@ 2011-12-07  4:28           ` NeilBrown
  2011-12-07  4:50             ` Yucong Sun (叶雨飞)
  0 siblings, 1 reply; 16+ messages in thread
From: NeilBrown @ 2011-12-07  4:28 UTC (permalink / raw)
  To: Yucong Sun (叶雨飞); +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 881 bytes --]

On Tue, 6 Dec 2011 20:04:33 -0800 Yucong Sun (叶雨飞) <sunyucong@gmail.com>
wrote:

> The problem with using page-flush as a write cache here is that write
> to MD don't go through IO scheduler, which is a very big problem,
> because when flush thread decide to write to MD,  it's impossible to
> control the write speed, or prioritize them with read, every requests
> basically is a fifo,  and when flush size is big, no read can be
> served.
> 

I'm not sure I understand....

Requests don't go through an IO scheduler before they hit md, but they do
after md sends them on down, so they can be re-ordered there.

There was a bug where raid10 would allow an arbitrary number of writes to
queue up so that flushing code didn't know when to stop.

This was fixed by 
   commit 34db0cd60f8a1f4ab73d118a8be3797c20388223

nearly 2 months ago :-)

NeilBrown


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Raid10 and page cache
  2011-12-07  4:28           ` NeilBrown
@ 2011-12-07  4:50             ` Yucong Sun (叶雨飞)
  2011-12-07  5:10               ` NeilBrown
  0 siblings, 1 reply; 16+ messages in thread
From: Yucong Sun (叶雨飞) @ 2011-12-07  4:50 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid

I'm not sure whether it is what I mean,  to illustrate my problem let
me put iostat -x -d 1 output  as below

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s
avgrq-sz avgqu-sz   await  svctm  %util
sdb               0.00     0.00  163.00    1.00  1304.00     8.00
8.00     0.26    1.59   1.59  26.00
sdc               0.00     0.00   93.00    1.00   744.00     8.00
8.00     0.24    2.55   2.45  23.00
sde               0.00     0.00   56.00    1.00   448.00     8.00
8.00     0.22    3.86   3.86  22.00
sdd               0.00     0.00   88.00    1.00   704.00     8.00
8.00     0.18    2.02   2.02  18.00
md_d0             0.00     0.00  401.00    0.00  3208.00     0.00
8.00     0.00    0.00   0.00   0.00

==> this is normal operation, because of page cache, there's only read
being submitted to the MD device.

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s
avgrq-sz avgqu-sz   await  svctm  %util
sda               0.00     0.00    0.00    0.00     0.00     0.00
0.00     0.00    0.00   0.00   0.00
sdb               0.00  1714.00    4.00  277.00    32.00 14810.00
52.82    34.04  105.05   2.92  82.00
sdc               0.00  1685.00   12.00  270.00    96.00 14122.00
50.42    42.56  131.03   3.09  87.00
sde               0.00  1385.00    8.00  261.00    64.00 12426.00
46.43    29.76   99.44   3.35  90.00
sdd               0.00  1350.00    8.00  228.00    64.00 10682.00
45.53    40.93  133.56   3.69  87.00
md_d0             0.00     0.00   32.00 16446.00   256.00 131568.00
 8.00     0.00    0.00   0.00   0.00

==> Huge page flush kick in, note the read requests is saturated on MD device.

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s
avgrq-sz avgqu-sz   await  svctm  %util
sda               0.00     0.00    0.00    0.00     0.00     0.00
0.00     0.00    0.00   0.00   0.00
sdb               0.00  1542.00    4.00  264.00    32.00 11760.00
44.00    66.58  230.22   3.73 100.00
sdc               0.00  1185.00    0.00  272.00     0.00  9672.00
35.56    63.40  215.88   3.68 100.00
sde               0.00  1352.00    0.00  298.00     0.00 12488.00
41.91    35.56  126.34   3.36 100.00
sdd               0.00   996.00    0.00  294.00     0.00 10120.00
34.42    76.79  270.37   3.40 100.00
md_d0             0.00     0.00    4.00    0.00    32.00     0.00
8.00     0.00    0.00   0.00   0.00

==> Huge page flush still working,  no read is being done.

This is the problem , when page flush kick in, MD appears to refuse
incoming read,  all under laying device is tuned to deadline scheduler
and tuned to favor read, still, it don't work since MD simply don't
submit new read to the underlying device.

2011/12/6 NeilBrown <neilb@suse.de>:
> On Tue, 6 Dec 2011 20:04:33 -0800 Yucong Sun (叶雨飞) <sunyucong@gmail.com>
> wrote:
>
>> The problem with using page-flush as a write cache here is that write
>> to MD don't go through IO scheduler, which is a very big problem,
>> because when flush thread decide to write to MD,  it's impossible to
>> control the write speed, or prioritize them with read, every requests
>> basically is a fifo,  and when flush size is big, no read can be
>> served.
>>
>
> I'm not sure I understand....
>
> Requests don't go through an IO scheduler before they hit md, but they do
> after md sends them on down, so they can be re-ordered there.
>
> There was a bug where raid10 would allow an arbitrary number of writes to
> queue up so that flushing code didn't know when to stop.
>
> This was fixed by
>   commit 34db0cd60f8a1f4ab73d118a8be3797c20388223
>
> nearly 2 months ago :-)
>
> NeilBrown
>
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Raid10 and page cache
  2011-12-07  4:50             ` Yucong Sun (叶雨飞)
@ 2011-12-07  5:10               ` NeilBrown
  2011-12-07  6:14                 ` Yucong Sun (叶雨飞)
  0 siblings, 1 reply; 16+ messages in thread
From: NeilBrown @ 2011-12-07  5:10 UTC (permalink / raw)
  To: Yucong Sun (叶雨飞); +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 4282 bytes --]

On Tue, 6 Dec 2011 20:50:48 -0800 Yucong Sun (叶雨飞) <sunyucong@gmail.com>
wrote:

> I'm not sure whether it is what I mean,  to illustrate my problem let
> me put iostat -x -d 1 output  as below
> 
> Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s
> avgrq-sz avgqu-sz   await  svctm  %util
> sdb               0.00     0.00  163.00    1.00  1304.00     8.00
> 8.00     0.26    1.59   1.59  26.00
> sdc               0.00     0.00   93.00    1.00   744.00     8.00
> 8.00     0.24    2.55   2.45  23.00
> sde               0.00     0.00   56.00    1.00   448.00     8.00
> 8.00     0.22    3.86   3.86  22.00
> sdd               0.00     0.00   88.00    1.00   704.00     8.00
> 8.00     0.18    2.02   2.02  18.00
> md_d0             0.00     0.00  401.00    0.00  3208.00     0.00
> 8.00     0.00    0.00   0.00   0.00
> 
> ==> this is normal operation, because of page cache, there's only read
> being submitted to the MD device.
> 
> Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s
> avgrq-sz avgqu-sz   await  svctm  %util
> sda               0.00     0.00    0.00    0.00     0.00     0.00
> 0.00     0.00    0.00   0.00   0.00
> sdb               0.00  1714.00    4.00  277.00    32.00 14810.00
> 52.82    34.04  105.05   2.92  82.00
> sdc               0.00  1685.00   12.00  270.00    96.00 14122.00
> 50.42    42.56  131.03   3.09  87.00
> sde               0.00  1385.00    8.00  261.00    64.00 12426.00
> 46.43    29.76   99.44   3.35  90.00
> sdd               0.00  1350.00    8.00  228.00    64.00 10682.00
> 45.53    40.93  133.56   3.69  87.00
> md_d0             0.00     0.00   32.00 16446.00   256.00 131568.00
>  8.00     0.00    0.00   0.00   0.00
> 
> ==> Huge page flush kick in, note the read requests is saturated on MD device.
> 
> Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s
> avgrq-sz avgqu-sz   await  svctm  %util
> sda               0.00     0.00    0.00    0.00     0.00     0.00
> 0.00     0.00    0.00   0.00   0.00
> sdb               0.00  1542.00    4.00  264.00    32.00 11760.00
> 44.00    66.58  230.22   3.73 100.00
> sdc               0.00  1185.00    0.00  272.00     0.00  9672.00
> 35.56    63.40  215.88   3.68 100.00
> sde               0.00  1352.00    0.00  298.00     0.00 12488.00
> 41.91    35.56  126.34   3.36 100.00
> sdd               0.00   996.00    0.00  294.00     0.00 10120.00
> 34.42    76.79  270.37   3.40 100.00
> md_d0             0.00     0.00    4.00    0.00    32.00     0.00
> 8.00     0.00    0.00   0.00   0.00
> 
> ==> Huge page flush still working,  no read is being done.
> 
> This is the problem , when page flush kick in, MD appears to refuse
> incoming read,  all under laying device is tuned to deadline scheduler
> and tuned to favor read, still, it don't work since MD simply don't
> submit new read to the underlying device.

The counters are update when a request completes, not when it is submitted,
so you cannot tell from this data if md is submitting the read requests or
not.

What kernel are you working with?  If it doesn't contain the commit
identified below can you try with that and see if it makes a difference?

Thanks,
NeilBrown



> 
> 2011/12/6 NeilBrown <neilb@suse.de>:
> > On Tue, 6 Dec 2011 20:04:33 -0800 Yucong Sun (叶雨飞) <sunyucong@gmail.com>
> > wrote:
> >
> >> The problem with using page-flush as a write cache here is that write
> >> to MD don't go through IO scheduler, which is a very big problem,
> >> because when flush thread decide to write to MD,  it's impossible to
> >> control the write speed, or prioritize them with read, every requests
> >> basically is a fifo,  and when flush size is big, no read can be
> >> served.
> >>
> >
> > I'm not sure I understand....
> >
> > Requests don't go through an IO scheduler before they hit md, but they do
> > after md sends them on down, so they can be re-ordered there.
> >
> > There was a bug where raid10 would allow an arbitrary number of writes to
> > queue up so that flushing code didn't know when to stop.
> >
> > This was fixed by
> >   commit 34db0cd60f8a1f4ab73d118a8be3797c20388223
> >
> > nearly 2 months ago :-)
> >
> > NeilBrown
> >


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Raid10 and page cache
  2011-12-07  5:10               ` NeilBrown
@ 2011-12-07  6:14                 ` Yucong Sun (叶雨飞)
  2011-12-07  9:21                   ` Yucong Sun (叶雨飞)
  0 siblings, 1 reply; 16+ messages in thread
From: Yucong Sun (叶雨飞) @ 2011-12-07  6:14 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid

Ok, still, during that time, no read is being finished.

I'm on  Linux vstore-1 2.6.32-35-server #78-Ubuntu SMP Tue Oct 11
16:26:12 UTC 2011 x86_64 GNU/Linux
do you know which kernel version has that commit ? 2.6.35 ?

I think the root cause is that, whenever dirty_background_bytes is
reached,  kernel flush thread [flush:254:0] wakes up and cause
md_raid10_d0 to go into state D, which cause everything to hang a
while, I guess maybe the flush thread is calling fsync() after the
write? That's hard to believe, but can actually explain the symptom.

BTW I don't think limiting batch write to 1024 would solve the
problem, I am actually doing it now because I have to set
dirty_background_bytes to 4M which is exactly 1024 write every second
or so.

Cheers.

On Tue, Dec 6, 2011 at 9:10 PM, NeilBrown <neilb@suse.de> wrote:
> On Tue, 6 Dec 2011 20:50:48 -0800 Yucong Sun (叶雨飞) <sunyucong@gmail.com>
> wrote:
>
>> I'm not sure whether it is what I mean,  to illustrate my problem let
>> me put iostat -x -d 1 output  as below
>>
>> Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s
>> avgrq-sz avgqu-sz   await  svctm  %util
>> sdb               0.00     0.00  163.00    1.00  1304.00     8.00
>> 8.00     0.26    1.59   1.59  26.00
>> sdc               0.00     0.00   93.00    1.00   744.00     8.00
>> 8.00     0.24    2.55   2.45  23.00
>> sde               0.00     0.00   56.00    1.00   448.00     8.00
>> 8.00     0.22    3.86 3.86 22.00
>> sdd               0.00     0.00   88.00    1.00   704.00     8.00
>> 8.00     0.18    2.02 2.02 18.00
>> md_d0             0.00     0.00  401.00    0.00  3208.00     0.00
>> 8.00     0.00    0.00   0.00   0.00
>>
>> ==> this is normal operation, because of page cache, there's only read
>> being submitted to the MD device.
>>
>> Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s
>> avgrq-sz avgqu-sz   await  svctm  %util
>> sda               0.00     0.00    0.00    0.00     0.00     0.00
>> 0.00     0.00    0.00   0.00   0.00
>> sdb               0.00  1714.00    4.00  277.00    32.00 14810.00
>> 52.82    34.04  105.05   2.92  82.00
>> sdc               0.00  1685.00   12.00  270.00    96.00 14122.00
>> 50.42    42.56  131.03   3.09  87.00
>> sde               0.00  1385.00    8.00  261.00    64.00 12426.00
>> 46.43    29.76   99.44   3.35  90.00
>> sdd               0.00  1350.00    8.00  228.00    64.00 10682.00
>> 45.53    40.93  133.56   3.69  87.00
>> md_d0             0.00     0.00   32.00 16446.00   256.00 131568.00
>>  8.00     0.00    0.00   0.00   0.00
>>
>> ==> Huge page flush kick in, note the read requests is saturated on MD device.
>>
>> Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s
>> avgrq-sz avgqu-sz   await  svctm  %util
>> sda               0.00     0.00    0.00    0.00     0.00     0.00
>> 0.00     0.00    0.00   0.00   0.00
>> sdb               0.00  1542.00    4.00  264.00    32.00 11760.00
>> 44.00    66.58  230.22   3.73 100.00
>> sdc               0.00  1185.00    0.00  272.00     0.00  9672.00
>> 35.56    63.40  215.88   3.68 100.00
>> sde               0.00  1352.00    0.00  298.00     0.00 12488.00
>> 41.91    35.56  126.34   3.36 100.00
>> sdd               0.00   996.00    0.00  294.00     0.00 10120.00
>> 34.42    76.79  270.37   3.40 100.00
>> md_d0             0.00     0.00    4.00    0.00    32.00     0.00
>> 8.00     0.00    0.00   0.00   0.00
>>
>> ==> Huge page flush still working,  no read is being done.
>>
>> This is the problem , when page flush kick in, MD appears to refuse
>> incoming read,  all under laying device is tuned to deadline scheduler
>> and tuned to favor read, still, it don't work since MD simply don't
>> submit new read to the underlying device.
>
> The counters are update when a request completes, not when it is submitted,
> so you cannot tell from this data if md is submitting the read requests or
> not.
>
> What kernel are you working with?  If it doesn't contain the commit
> identified below can you try with that and see if it makes a difference?
>
> Thanks,
> NeilBrown
>
>
>
>>
>> 2011/12/6 NeilBrown <neilb@suse.de>:
>> > On Tue, 6 Dec 2011 20:04:33 -0800 Yucong Sun (叶雨飞) <sunyucong@gmail.com>
>> > wrote:
>> >
>> >> The problem with using page-flush as a write cache here is that write
>> >> to MD don't go through IO scheduler, which is a very big problem,
>> >> because when flush thread decide to write to MD,  it's impossible to
>> >> control the write speed, or prioritize them with read, every requests
>> >> basically is a fifo,  and when flush size is big, no read can be
>> >> served.
>> >>
>> >
>> > I'm not sure I understand....
>> >
>> > Requests don't go through an IO scheduler before they hit md, but they do
>> > after md sends them on down, so they can be re-ordered there.
>> >
>> > There was a bug where raid10 would allow an arbitrary number of writes to
>> > queue up so that flushing code didn't know when to stop.
>> >
>> > This was fixed by
>> >   commit 34db0cd60f8a1f4ab73d118a8be3797c20388223
>> >
>> > nearly 2 months ago :-)
>> >
>> > NeilBrown
>> >
>
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Raid10 and page cache
  2011-12-07  6:14                 ` Yucong Sun (叶雨飞)
@ 2011-12-07  9:21                   ` Yucong Sun (叶雨飞)
  2011-12-07 23:37                     ` Yucong Sun (叶雨飞)
  0 siblings, 1 reply; 16+ messages in thread
From: Yucong Sun (叶雨飞) @ 2011-12-07  9:21 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid

So, I re-read the kernel code again,  it looks like backdev.cc is
doing the correct thing by calling writeback with WB_NO_SYNC, it all
looks good, but I don't understand why it would appear read saturated
on my system.

However I think your commit would definitely make things better,
Ideally I think make write only use available bandwidth like sync
does, and automatically adjusting.

2011/12/6 Yucong Sun (叶雨飞) <sunyucong@gmail.com>:
> Ok, still, during that time, no read is being finished.
>
> I'm on  Linux vstore-1 2.6.32-35-server #78-Ubuntu SMP Tue Oct 11
> 16:26:12 UTC 2011 x86_64 GNU/Linux
> do you know which kernel version has that commit ? 2.6.35 ?
>
> I think the root cause is that, whenever dirty_background_bytes is
> reached,  kernel flush thread [flush:254:0] wakes up and cause
> md_raid10_d0 to go into state D, which cause everything to hang a
> while, I guess maybe the flush thread is calling fsync() after the
> write? That's hard to believe, but can actually explain the symptom.
>
> BTW I don't think limiting batch write to 1024 would solve the
> problem, I am actually doing it now because I have to set
> dirty_background_bytes to 4M which is exactly 1024 write every second
> or so.
>
> Cheers.
>
> On Tue, Dec 6, 2011 at 9:10 PM, NeilBrown <neilb@suse.de> wrote:
>> On Tue, 6 Dec 2011 20:50:48 -0800 Yucong Sun (叶雨飞) <sunyucong@gmail.com>
>> wrote:
>>
>>> I'm not sure whether it is what I mean,  to illustrate my problem let
>>> me put iostat -x -d 1 output  as below
>>>
>>> Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s
>>> avgrq-sz avgqu-sz   await  svctm  %util
>>> sdb               0.00     0.00  163.00    1.00  1304.00     8.00
>>> 8.00     0.26    1.59   1.59  26.00
>>> sdc               0.00     0.00   93.00    1.00   744.00     8.00
>>> 8.00     0.24    2.55   2.45  23.00
>>> sde               0.00     0.00   56.00    1.00   448.00     8.00
>>> 8.00     0.22    3.86 3.86 22.00
>>> sdd               0.00     0.00   88.00    1.00   704.00     8.00
>>> 8.00     0.18    2.02 2.02 18.00
>>> md_d0             0.00     0.00  401.00    0.00  3208.00     0.00
>>> 8.00     0.00    0.00   0.00   0.00
>>>
>>> ==> this is normal operation, because of page cache, there's only read
>>> being submitted to the MD device.
>>>
>>> Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s
>>> avgrq-sz avgqu-sz   await  svctm  %util
>>> sda               0.00     0.00    0.00    0.00     0.00     0.00
>>> 0.00     0.00    0.00   0.00   0.00
>>> sdb               0.00  1714.00    4.00  277.00    32.00 14810.00
>>> 52.82    34.04  105.05   2.92  82.00
>>> sdc               0.00  1685.00   12.00  270.00    96.00 14122.00
>>> 50.42    42.56  131.03   3.09  87.00
>>> sde               0.00  1385.00    8.00  261.00    64.00 12426.00
>>> 46.43    29.76   99.44   3.35  90.00
>>> sdd               0.00  1350.00    8.00  228.00    64.00 10682.00
>>> 45.53    40.93  133.56   3.69  87.00
>>> md_d0             0.00     0.00   32.00 16446.00   256.00 131568.00
>>>  8.00     0.00    0.00   0.00   0.00
>>>
>>> ==> Huge page flush kick in, note the read requests is saturated on MD device.
>>>
>>> Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s
>>> avgrq-sz avgqu-sz   await  svctm  %util
>>> sda               0.00     0.00    0.00    0.00     0.00     0.00
>>> 0.00     0.00    0.00   0.00   0.00
>>> sdb               0.00  1542.00    4.00  264.00    32.00 11760.00
>>> 44.00    66.58  230.22   3.73 100.00
>>> sdc               0.00  1185.00    0.00  272.00     0.00  9672.00
>>> 35.56    63.40  215.88   3.68 100.00
>>> sde               0.00  1352.00    0.00  298.00     0.00 12488.00
>>> 41.91    35.56  126.34   3.36 100.00
>>> sdd               0.00   996.00    0.00  294.00     0.00 10120.00
>>> 34.42    76.79  270.37   3.40 100.00
>>> md_d0             0.00     0.00    4.00    0.00    32.00     0.00
>>> 8.00     0.00    0.00   0.00   0.00
>>>
>>> ==> Huge page flush still working,  no read is being done.
>>>
>>> This is the problem , when page flush kick in, MD appears to refuse
>>> incoming read,  all under laying device is tuned to deadline scheduler
>>> and tuned to favor read, still, it don't work since MD simply don't
>>> submit new read to the underlying device.
>>
>> The counters are update when a request completes, not when it is submitted,
>> so you cannot tell from this data if md is submitting the read requests or
>> not.
>>
>> What kernel are you working with?  If it doesn't contain the commit
>> identified below can you try with that and see if it makes a difference?
>>
>> Thanks,
>> NeilBrown
>>
>>
>>
>>>
>>> 2011/12/6 NeilBrown <neilb@suse.de>:
>>> > On Tue, 6 Dec 2011 20:04:33 -0800 Yucong Sun (叶雨飞) <sunyucong@gmail.com>
>>> > wrote:
>>> >
>>> >> The problem with using page-flush as a write cache here is that write
>>> >> to MD don't go through IO scheduler, which is a very big problem,
>>> >> because when flush thread decide to write to MD,  it's impossible to
>>> >> control the write speed, or prioritize them with read, every requests
>>> >> basically is a fifo,  and when flush size is big, no read can be
>>> >> served.
>>> >>
>>> >
>>> > I'm not sure I understand....
>>> >
>>> > Requests don't go through an IO scheduler before they hit md, but they do
>>> > after md sends them on down, so they can be re-ordered there.
>>> >
>>> > There was a bug where raid10 would allow an arbitrary number of writes to
>>> > queue up so that flushing code didn't know when to stop.
>>> >
>>> > This was fixed by
>>> >   commit 34db0cd60f8a1f4ab73d118a8be3797c20388223
>>> >
>>> > nearly 2 months ago :-)
>>> >
>>> > NeilBrown
>>> >
>>
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Raid10 and page cache
  2011-12-07  9:21                   ` Yucong Sun (叶雨飞)
@ 2011-12-07 23:37                     ` Yucong Sun (叶雨飞)
  2011-12-08  0:10                       ` NeilBrown
  0 siblings, 1 reply; 16+ messages in thread
From: Yucong Sun (叶雨飞) @ 2011-12-07 23:37 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid

Neil, I can't compile latest MD against 2.6.32,  and that commit can't
be patched into 2.6.32 directly either, can you help me on this?

Cheers.

2011/12/7 Yucong Sun (叶雨飞) <sunyucong@gmail.com>:
> So, I re-read the kernel code again,  it looks like backdev.cc is
> doing the correct thing by calling writeback with WB_NO_SYNC, it all
> looks good, but I don't understand why it would appear read saturated
> on my system.
>
> However I think your commit would definitely make things better,
> Ideally I think make write only use available bandwidth like sync
> does, and automatically adjusting.
>
> 2011/12/6 Yucong Sun (叶雨飞) <sunyucong@gmail.com>:
>> Ok, still, during that time, no read is being finished.
>>
>> I'm on  Linux vstore-1 2.6.32-35-server #78-Ubuntu SMP Tue Oct 11
>> 16:26:12 UTC 2011 x86_64 GNU/Linux
>> do you know which kernel version has that commit ? 2.6.35 ?
>>
>> I think the root cause is that, whenever dirty_background_bytes is
>> reached,  kernel flush thread [flush:254:0] wakes up and cause
>> md_raid10_d0 to go into state D, which cause everything to hang a
>> while, I guess maybe the flush thread is calling fsync() after the
>> write? That's hard to believe, but can actually explain the symptom.
>>
>> BTW I don't think limiting batch write to 1024 would solve the
>> problem, I am actually doing it now because I have to set
>> dirty_background_bytes to 4M which is exactly 1024 write every second
>> or so.
>>
>> Cheers.
>>
>> On Tue, Dec 6, 2011 at 9:10 PM, NeilBrown <neilb@suse.de> wrote:
>>> On Tue, 6 Dec 2011 20:50:48 -0800 Yucong Sun (叶雨飞) <sunyucong@gmail.com>
>>> wrote:
>>>
>>>> I'm not sure whether it is what I mean,  to illustrate my problem let
>>>> me put iostat -x -d 1 output  as below
>>>>
>>>> Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s
>>>> avgrq-sz avgqu-sz   await  svctm  %util
>>>> sdb               0.00     0.00  163.00    1.00  1304.00     8.00
>>>> 8.00     0.26    1.59   1.59  26.00
>>>> sdc               0.00     0.00   93.00    1.00   744.00     8.00
>>>> 8.00     0.24    2.55   2.45  23.00
>>>> sde               0.00     0.00   56.00    1.00   448.00     8.00
>>>> 8.00     0.22    3.86 3.86 22.00
>>>> sdd               0.00     0.00   88.00    1.00   704.00     8.00
>>>> 8.00     0.18    2.02 2.02 18.00
>>>> md_d0             0.00     0.00  401.00    0.00  3208.00     0.00
>>>> 8.00     0.00    0.00   0.00   0.00
>>>>
>>>> ==> this is normal operation, because of page cache, there's only read
>>>> being submitted to the MD device.
>>>>
>>>> Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s
>>>> avgrq-sz avgqu-sz   await  svctm  %util
>>>> sda               0.00     0.00    0.00    0.00     0.00     0.00
>>>> 0.00     0.00    0.00   0.00   0.00
>>>> sdb               0.00  1714.00    4.00  277.00    32.00 14810.00
>>>> 52.82    34.04  105.05   2.92  82.00
>>>> sdc               0.00  1685.00   12.00  270.00    96.00 14122.00
>>>> 50.42    42.56  131.03   3.09  87.00
>>>> sde               0.00  1385.00    8.00  261.00    64.00 12426.00
>>>> 46.43    29.76   99.44   3.35  90.00
>>>> sdd               0.00  1350.00    8.00  228.00    64.00 10682.00
>>>> 45.53    40.93  133.56   3.69  87.00
>>>> md_d0             0.00     0.00   32.00 16446.00   256.00 131568.00
>>>>  8.00     0.00    0.00   0.00   0.00
>>>>
>>>> ==> Huge page flush kick in, note the read requests is saturated on MD device.
>>>>
>>>> Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s
>>>> avgrq-sz avgqu-sz   await  svctm  %util
>>>> sda               0.00     0.00    0.00    0.00     0.00     0.00
>>>> 0.00     0.00    0.00   0.00   0.00
>>>> sdb               0.00  1542.00    4.00  264.00    32.00 11760.00
>>>> 44.00    66.58  230.22   3.73 100.00
>>>> sdc               0.00  1185.00    0.00  272.00     0.00  9672.00
>>>> 35.56    63.40  215.88   3.68 100.00
>>>> sde               0.00  1352.00    0.00  298.00     0.00 12488.00
>>>> 41.91    35.56  126.34   3.36 100.00
>>>> sdd               0.00   996.00    0.00  294.00     0.00 10120.00
>>>> 34.42    76.79  270.37   3.40 100.00
>>>> md_d0             0.00     0.00    4.00    0.00    32.00     0.00
>>>> 8.00     0.00    0.00   0.00   0.00
>>>>
>>>> ==> Huge page flush still working,  no read is being done.
>>>>
>>>> This is the problem , when page flush kick in, MD appears to refuse
>>>> incoming read,  all under laying device is tuned to deadline scheduler
>>>> and tuned to favor read, still, it don't work since MD simply don't
>>>> submit new read to the underlying device.
>>>
>>> The counters are update when a request completes, not when it is submitted,
>>> so you cannot tell from this data if md is submitting the read requests or
>>> not.
>>>
>>> What kernel are you working with?  If it doesn't contain the commit
>>> identified below can you try with that and see if it makes a difference?
>>>
>>> Thanks,
>>> NeilBrown
>>>
>>>
>>>
>>>>
>>>> 2011/12/6 NeilBrown <neilb@suse.de>:
>>>> > On Tue, 6 Dec 2011 20:04:33 -0800 Yucong Sun (叶雨飞) <sunyucong@gmail.com>
>>>> > wrote:
>>>> >
>>>> >> The problem with using page-flush as a write cache here is that write
>>>> >> to MD don't go through IO scheduler, which is a very big problem,
>>>> >> because when flush thread decide to write to MD,  it's impossible to
>>>> >> control the write speed, or prioritize them with read, every requests
>>>> >> basically is a fifo,  and when flush size is big, no read can be
>>>> >> served.
>>>> >>
>>>> >
>>>> > I'm not sure I understand....
>>>> >
>>>> > Requests don't go through an IO scheduler before they hit md, but they do
>>>> > after md sends them on down, so they can be re-ordered there.
>>>> >
>>>> > There was a bug where raid10 would allow an arbitrary number of writes to
>>>> > queue up so that flushing code didn't know when to stop.
>>>> >
>>>> > This was fixed by
>>>> >   commit 34db0cd60f8a1f4ab73d118a8be3797c20388223
>>>> >
>>>> > nearly 2 months ago :-)
>>>> >
>>>> > NeilBrown
>>>> >
>>>
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Raid10 and page cache
  2011-12-07 23:37                     ` Yucong Sun (叶雨飞)
@ 2011-12-08  0:10                       ` NeilBrown
  2011-12-08  6:31                         ` Yucong Sun (叶雨飞)
  0 siblings, 1 reply; 16+ messages in thread
From: NeilBrown @ 2011-12-08  0:10 UTC (permalink / raw)
  To: Yucong Sun (叶雨飞); +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 7881 bytes --]

On Wed, 7 Dec 2011 15:37:30 -0800 Yucong Sun (叶雨飞) <sunyucong@gmail.com>
wrote:

> Neil, I can't compile latest MD against 2.6.32,  and that commit can't
> be patched into 2.6.32 directly either, can you help me on this?
> 

This should do it.

NeilBrown

commit ef54b7cf955dc3b7d33248e8591b1a00b4fa998c
Author: NeilBrown <neilb@suse.de>
Date:   Tue Oct 11 16:50:01 2011 +1100

    md: add proper write-congestion reporting to RAID1 and RAID10.
    
    RAID1 and RAID10 handle write requests by queuing them for handling by
    a separate thread.  This is because when a write-intent-bitmap is
    active we might need to update the bitmap first, so it is good to
    queue a lot of writes, then do one big bitmap update for them all.
    
    However writeback request devices to appear to be congested after a
    while so it can make some guesstimate of throughput.  The infinite
    queue defeats that (note that RAID5 has already has a finite queue so
    it doesn't suffer from this problem).
    
    So impose a limit on the number of pending write requests.  By default
    it is 1024 which seems to be generally suitable.  Make it configurable
    via module option just in case someone finds a regression.
    
    Signed-off-by: NeilBrown <neilb@suse.de>

diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index e07ce2e..fe7ae3c 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -50,6 +50,11 @@
  */
 #define	NR_RAID1_BIOS 256
 
+/* When there are this many requests queue to be written by
+ * the raid1 thread, we become 'congested' to provide back-pressure
+ * for writeback.
+ */
+static int max_queued_requests = 1024;
 
 static void unplug_slaves(mddev_t *mddev);
 
@@ -576,7 +581,8 @@ static int raid1_congested(void *data, int bits)
 	conf_t *conf = mddev->private;
 	int i, ret = 0;
 
-	if (mddev_congested(mddev, bits))
+	if (mddev_congested(mddev, bits) &&
+	    conf->pending_count >= max_queued_requests)
 		return 1;
 
 	rcu_read_lock();
@@ -613,10 +619,12 @@ static int flush_pending_writes(conf_t *conf)
 		struct bio *bio;
 		bio = bio_list_get(&conf->pending_bio_list);
 		blk_remove_plug(conf->mddev->queue);
+		conf->pending_count = 0;
 		spin_unlock_irq(&conf->device_lock);
 		/* flush any pending bitmap writes to
 		 * disk before proceeding w/ I/O */
 		bitmap_unplug(conf->mddev->bitmap);
+		wake_up(&conf->wait_barrier);
 
 		while (bio) { /* submit pending writes */
 			struct bio *next = bio->bi_next;
@@ -789,6 +797,7 @@ static int make_request(struct request_queue *q, struct bio * bio)
 	int cpu;
 	bool do_barriers;
 	mdk_rdev_t *blocked_rdev;
+	int cnt = 0;
 
 	/*
 	 * Register the new request and wait if the reconstruction
@@ -864,6 +873,11 @@ static int make_request(struct request_queue *q, struct bio * bio)
 	/*
 	 * WRITE:
 	 */
+	if (conf->pending_count >= max_queued_requests) {
+		md_wakeup_thread(mddev->thread);
+		wait_event(conf->wait_barrier,
+			   conf->pending_count < max_queued_requests);
+	}
 	/* first select target devices under spinlock and
 	 * inc refcount on their rdev.  Record them by setting
 	 * bios[x] to bio
@@ -970,6 +984,7 @@ static int make_request(struct request_queue *q, struct bio * bio)
 		atomic_inc(&r1_bio->remaining);
 
 		bio_list_add(&bl, mbio);
+		cnt++;
 	}
 	kfree(behind_pages); /* the behind pages are attached to the bios now */
 
@@ -978,6 +993,7 @@ static int make_request(struct request_queue *q, struct bio * bio)
 	spin_lock_irqsave(&conf->device_lock, flags);
 	bio_list_merge(&conf->pending_bio_list, &bl);
 	bio_list_init(&bl);
+	conf->pending_count += cnt;
 
 	blk_plug_device(mddev->queue);
 	spin_unlock_irqrestore(&conf->device_lock, flags);
@@ -2021,7 +2037,7 @@ static int run(mddev_t *mddev)
 
 	bio_list_init(&conf->pending_bio_list);
 	bio_list_init(&conf->flushing_bio_list);
-
+	conf->pending_count = 0;
 
 	mddev->degraded = 0;
 	for (i = 0; i < conf->raid_disks; i++) {
@@ -2317,3 +2333,5 @@ MODULE_LICENSE("GPL");
 MODULE_ALIAS("md-personality-3"); /* RAID1 */
 MODULE_ALIAS("md-raid1");
 MODULE_ALIAS("md-level-1");
+
+module_param(max_queued_requests, int, S_IRUGO|S_IWUSR);
diff --git a/drivers/md/raid1.h b/drivers/md/raid1.h
index e87b84d..520288c 100644
--- a/drivers/md/raid1.h
+++ b/drivers/md/raid1.h
@@ -38,6 +38,7 @@ struct r1_private_data_s {
 	/* queue of writes that have been unplugged */
 	struct bio_list		flushing_bio_list;
 
+	int			pending_count;
 	/* for use when syncing mirrors: */
 
 	spinlock_t		resync_lock;
diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index c2cb7b8..4c7d9b5 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -59,6 +59,11 @@ static void unplug_slaves(mddev_t *mddev);
 
 static void allow_barrier(conf_t *conf);
 static void lower_barrier(conf_t *conf);
+/* When there are this many requests queue to be written by
+ * the raid10 thread, we become 'congested' to provide back-pressure
+ * for writeback.
+ */
+static int max_queued_requests = 1024;
 
 static void * r10bio_pool_alloc(gfp_t gfp_flags, void *data)
 {
@@ -631,6 +636,10 @@ static int raid10_congested(void *data, int bits)
 	conf_t *conf = mddev->private;
 	int i, ret = 0;
 
+	if ((bits & (1 << BDI_async_congested)) &&
+	    conf->pending_count >= max_queued_requests)
+		return 1;
+
 	if (mddev_congested(mddev, bits))
 		return 1;
 	rcu_read_lock();
@@ -660,10 +669,12 @@ static int flush_pending_writes(conf_t *conf)
 		struct bio *bio;
 		bio = bio_list_get(&conf->pending_bio_list);
 		blk_remove_plug(conf->mddev->queue);
+		conf->pending_count = 0;
 		spin_unlock_irq(&conf->device_lock);
 		/* flush any pending bitmap writes to disk
 		 * before proceeding w/ I/O */
 		bitmap_unplug(conf->mddev->bitmap);
+		wake_up(&conf->wait_barrier);
 
 		while (bio) { /* submit pending writes */
 			struct bio *next = bio->bi_next;
@@ -802,6 +813,7 @@ static int make_request(struct request_queue *q, struct bio * bio)
 	struct bio_list bl;
 	unsigned long flags;
 	mdk_rdev_t *blocked_rdev;
+	int cnt = 0;
 
 	if (unlikely(bio_rw_flagged(bio, BIO_RW_BARRIER))) {
 		bio_endio(bio, -EOPNOTSUPP);
@@ -894,6 +906,11 @@ static int make_request(struct request_queue *q, struct bio * bio)
 	/*
 	 * WRITE:
 	 */
+	if (conf->pending_count >= max_queued_requests) {
+		md_wakeup_thread(mddev->thread);
+		wait_event(conf->wait_barrier,
+			   conf->pending_count < max_queued_requests);
+	}
 	/* first select target devices under rcu_lock and
 	 * inc refcount on their rdev.  Record them by setting
 	 * bios[x] to bio
@@ -957,6 +974,7 @@ static int make_request(struct request_queue *q, struct bio * bio)
 
 		atomic_inc(&r10_bio->remaining);
 		bio_list_add(&bl, mbio);
+		cnt++
 	}
 
 	if (unlikely(!atomic_read(&r10_bio->remaining))) {
@@ -970,6 +988,7 @@ static int make_request(struct request_queue *q, struct bio * bio)
 	spin_lock_irqsave(&conf->device_lock, flags);
 	bio_list_merge(&conf->pending_bio_list, &bl);
 	blk_plug_device(mddev->queue);
+	conf->pending_count += cnt;
 	spin_unlock_irqrestore(&conf->device_lock, flags);
 
 	/* In case raid10d snuck in to freeze_array */
@@ -2318,3 +2337,5 @@ MODULE_LICENSE("GPL");
 MODULE_ALIAS("md-personality-9"); /* RAID10 */
 MODULE_ALIAS("md-raid10");
 MODULE_ALIAS("md-level-10");
+
+module_param(max_queued_requests, int, S_IRUGO|S_IWUSR);
diff --git a/drivers/md/raid10.h b/drivers/md/raid10.h
index 59cd1ef..e6e1613 100644
--- a/drivers/md/raid10.h
+++ b/drivers/md/raid10.h
@@ -39,7 +39,7 @@ struct r10_private_data_s {
 	struct list_head	retry_list;
 	/* queue pending writes and submit them on unplug */
 	struct bio_list		pending_bio_list;
-
+	int			pending_count;
 
 	spinlock_t		resync_lock;
 	int nr_pending;



[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: Raid10 and page cache
  2011-12-08  0:10                       ` NeilBrown
@ 2011-12-08  6:31                         ` Yucong Sun (叶雨飞)
  0 siblings, 0 replies; 16+ messages in thread
From: Yucong Sun (叶雨飞) @ 2011-12-08  6:31 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid

sadly the patch didn't help ,

sadly, the patch didn't help at all, see following


Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s
avgrq-sz avgqu-sz   await  svctm  %util
sda               0.00     0.00    0.00    0.00     0.00     0.00
0.00     0.00    0.00   0.00   0.00
sdb               0.00  2042.00    0.00  345.00     0.00 64112.00
185.83    93.13   93.36   2.12  73.00
sdd               0.00  1704.00    7.00  156.00    56.00 12496.00
77.01    95.71  146.20   3.62  59.00
sdc               0.00  1518.00   16.00  185.00   128.00  9936.00
50.07    98.20  157.41   3.13  63.00
sde             222.00  1997.00  194.00  189.00 51568.00 16488.00
177.69    81.54   99.09   2.25  86.00
md0               0.00     0.00   37.00 4096.00   296.00 32768.00
8.00     0.00    0.00   0.00   0.00


Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s
avgrq-sz avgqu-sz   await  svctm  %util
sda               0.00     0.00    0.00    0.00     0.00     0.00
0.00     0.00    0.00   0.00   0.00
sdb               0.00   150.00    0.00  194.00     0.00 33336.00
171.84    34.91  492.84   4.59  89.00
sdd               0.00     0.00    0.00  138.00     0.00  3488.00
25.28    32.68  757.75   4.06  56.00
sdc               0.00     0.00    3.00  127.00    24.00  4704.00
36.37    33.68  771.08   4.54  59.00
sde             222.00     0.00   90.00   84.00 39936.00  1672.00
239.13    23.73  386.90   4.08  71.00
md0               0.00     0.00    2.00    0.00    16.00     0.00
8.00     0.00    0.00   0.00   0.00


Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s
avgrq-sz avgqu-sz   await  svctm  %util
sda               0.00     0.00    0.00    0.00     0.00     0.00
0.00     0.00    0.00   0.00   0.00
sdb               0.00   235.00    0.00  188.00     0.00 54024.00
287.36     0.49    3.78   1.65  31.00
sdd               0.00     0.00   27.00    0.00   216.00     0.00
8.00     0.15    5.56   5.56  15.00
sdc               0.00     0.00   46.00    0.00   368.00     0.00
8.00     0.32    6.52   6.96  32.00
sde             165.00     0.00  200.00    0.00 43480.00     0.00
217.40     7.63   38.15   2.00  40.00
md0               0.00     0.00  101.00    0.00   808.00     0.00
8.00     0.00    0.00   0.00   0.00

I poked around and found this, when big flush comes in ,

Every 1.0s: cat /sys/block/sdb/stat /sys/block/sdc/stat
/sys/block/sdd/stat /sys/block/sde/stat /sys/block/md0/stat
                      Wed Dec  7 22:26:14 2011

      32       10      336      270  2792623  5501730 783168880
254952160      284  4815060 255014270
 2993481  2222268 499586400 94384090   493165  1842192 18671608
271311440      290  9942910 365758660
  691727       19  5533896  1507300   501261  1838497 18706544
276987570      262  3254420 278552760
 1458797  1404948 281875858 49664210   483386  1841832 18588928
256627020      259  4997270 306348180
 2797538        0 22380058        0  4652939        0 37223512
0        0        0        0

Every downstream disk have a Huge in-flight IO jump, where it is
usually just 0 or 1 the whole time.  The kernel document says this is
don't include queued IO, so I think the problem is because IO
scheduler issued too many requests to the device , without throttling
read/write,  that basically saturated the disk, so no other read can
be scheduled, do you knwo why this would happen to me?

Here's my relevenat scheduler tweak:

for disk in /sys/block/sd[bcde]
do
        echo "changing $disk scheduler"
        echo "deadline" > $disk/queue/scheduler

        echo "changing $disk nr_reqests to 4096"
        echo 4096 > $disk/queue/nr_requests

        echo "setra to 0"
        echo 0 > $disk/queue/read_ahead_kb

        echo "tweaking deadline io"
        echo 32 > $disk/queue/iosched/fifo_batch
        echo 30 > $disk/queue/iosched/read_expire
        echo 20000 > $disk/queue/iosched/write_expire
        echo 256 > $disk/queue/iosched/writes_starved
done

echo 0 > /sys/block/md0/queue/read_ahead_kb

My workload profile is 100% random 8K IO.

Come to think of it, the problem is mostly IO scheduling issue, does
nr_requests mean anything to MD? it's not possible to adjust it
either, was that the reason that MD can't accept more reads?
On Wed, Dec 7, 2011 at 4:10 PM, NeilBrown <neilb@suse.de> wrote:
>
> On Wed, 7 Dec 2011 15:37:30 -0800 Yucong Sun (叶雨飞) <sunyucong@gmail.com>
> wrote:
>
> > Neil, I can't compile latest MD against 2.6.32,  and that commit can't
> > be patched into 2.6.32 directly either, can you help me on this?
> >
>
> This should do it.
>
> NeilBrown
>
> commit ef54b7cf955dc3b7d33248e8591b1a00b4fa998c
> Author: NeilBrown <neilb@suse.de>
> Date:   Tue Oct 11 16:50:01 2011 +1100
>
>    md: add proper write-congestion reporting to RAID1 and RAID10.
>
>    RAID1 and RAID10 handle write requests by queuing them for handling by
>    a separate thread.  This is because when a write-intent-bitmap is
>    active we might need to update the bitmap first, so it is good to
>    queue a lot of writes, then do one big bitmap update for them all.
>
>    However writeback request devices to appear to be congested after a
>    while so it can make some guesstimate of throughput.  The infinite
>    queue defeats that (note that RAID5 has already has a finite queue so
>    it doesn't suffer from this problem).
>
>    So impose a limit on the number of pending write requests.  By default
>    it is 1024 which seems to be generally suitable.  Make it configurable
>    via module option just in case someone finds a regression.
>
>    Signed-off-by: NeilBrown <neilb@suse.de>
>
> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
> index e07ce2e..fe7ae3c 100644
> --- a/drivers/md/raid1.c
> +++ b/drivers/md/raid1.c
> @@ -50,6 +50,11 @@
>  */
>  #define        NR_RAID1_BIOS 256
>
> +/* When there are this many requests queue to be written by
> + * the raid1 thread, we become 'congested' to provide back-pressure
> + * for writeback.
> + */
> +static int max_queued_requests = 1024;
>
>  static void unplug_slaves(mddev_t *mddev);
>
> @@ -576,7 +581,8 @@ static int raid1_congested(void *data, int bits)
>        conf_t *conf = mddev->private;
>        int i, ret = 0;
>
> -       if (mddev_congested(mddev, bits))
> +       if (mddev_congested(mddev, bits) &&
> +           conf->pending_count >= max_queued_requests)
>                return 1;
>
>        rcu_read_lock();
> @@ -613,10 +619,12 @@ static int flush_pending_writes(conf_t *conf)
>                struct bio *bio;
>                bio = bio_list_get(&conf->pending_bio_list);
>                blk_remove_plug(conf->mddev->queue);
> +               conf->pending_count = 0;
>                spin_unlock_irq(&conf->device_lock);
>                /* flush any pending bitmap writes to
>                 * disk before proceeding w/ I/O */
>                bitmap_unplug(conf->mddev->bitmap);
> +               wake_up(&conf->wait_barrier);
>
>                while (bio) { /* submit pending writes */
>                        struct bio *next = bio->bi_next;
> @@ -789,6 +797,7 @@ static int make_request(struct request_queue *q, struct bio * bio)
>        int cpu;
>        bool do_barriers;
>        mdk_rdev_t *blocked_rdev;
> +       int cnt = 0;
>
>        /*
>         * Register the new request and wait if the reconstruction
> @@ -864,6 +873,11 @@ static int make_request(struct request_queue *q, struct bio * bio)
>        /*
>         * WRITE:
>         */
> +       if (conf->pending_count >= max_queued_requests) {
> +               md_wakeup_thread(mddev->thread);
> +               wait_event(conf->wait_barrier,
> +                          conf->pending_count < max_queued_requests);
> +       }
>        /* first select target devices under spinlock and
>         * inc refcount on their rdev.  Record them by setting
>         * bios[x] to bio
> @@ -970,6 +984,7 @@ static int make_request(struct request_queue *q, struct bio * bio)
>                atomic_inc(&r1_bio->remaining);
>
>                bio_list_add(&bl, mbio);
> +               cnt++;
>        }
>        kfree(behind_pages); /* the behind pages are attached to the bios now */
>
> @@ -978,6 +993,7 @@ static int make_request(struct request_queue *q, struct bio * bio)
>        spin_lock_irqsave(&conf->device_lock, flags);
>        bio_list_merge(&conf->pending_bio_list, &bl);
>        bio_list_init(&bl);
> +       conf->pending_count += cnt;
>
>        blk_plug_device(mddev->queue);
>        spin_unlock_irqrestore(&conf->device_lock, flags);
> @@ -2021,7 +2037,7 @@ static int run(mddev_t *mddev)
>
>        bio_list_init(&conf->pending_bio_list);
>        bio_list_init(&conf->flushing_bio_list);
> -
> +       conf->pending_count = 0;
>
>        mddev->degraded = 0;
>        for (i = 0; i < conf->raid_disks; i++) {
> @@ -2317,3 +2333,5 @@ MODULE_LICENSE("GPL");
>  MODULE_ALIAS("md-personality-3"); /* RAID1 */
>  MODULE_ALIAS("md-raid1");
>  MODULE_ALIAS("md-level-1");
> +
> +module_param(max_queued_requests, int, S_IRUGO|S_IWUSR);
> diff --git a/drivers/md/raid1.h b/drivers/md/raid1.h
> index e87b84d..520288c 100644
> --- a/drivers/md/raid1.h
> +++ b/drivers/md/raid1.h
> @@ -38,6 +38,7 @@ struct r1_private_data_s {
>        /* queue of writes that have been unplugged */
>        struct bio_list         flushing_bio_list;
>
> +       int                     pending_count;
>        /* for use when syncing mirrors: */
>
>        spinlock_t              resync_lock;
> diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
> index c2cb7b8..4c7d9b5 100644
> --- a/drivers/md/raid10.c
> +++ b/drivers/md/raid10.c
> @@ -59,6 +59,11 @@ static void unplug_slaves(mddev_t *mddev);
>
>  static void allow_barrier(conf_t *conf);
>  static void lower_barrier(conf_t *conf);
> +/* When there are this many requests queue to be written by
> + * the raid10 thread, we become 'congested' to provide back-pressure
> + * for writeback.
> + */
> +static int max_queued_requests = 1024;
>
>  static void * r10bio_pool_alloc(gfp_t gfp_flags, void *data)
>  {
> @@ -631,6 +636,10 @@ static int raid10_congested(void *data, int bits)
>        conf_t *conf = mddev->private;
>        int i, ret = 0;
>
> +       if ((bits & (1 << BDI_async_congested)) &&
> +           conf->pending_count >= max_queued_requests)
> +               return 1;
> +
>        if (mddev_congested(mddev, bits))
>                return 1;
>        rcu_read_lock();
> @@ -660,10 +669,12 @@ static int flush_pending_writes(conf_t *conf)
>                struct bio *bio;
>                bio = bio_list_get(&conf->pending_bio_list);
>                blk_remove_plug(conf->mddev->queue);
> +               conf->pending_count = 0;
>                spin_unlock_irq(&conf->device_lock);
>                /* flush any pending bitmap writes to disk
>                 * before proceeding w/ I/O */
>                bitmap_unplug(conf->mddev->bitmap);
> +               wake_up(&conf->wait_barrier);
>
>                while (bio) { /* submit pending writes */
>                        struct bio *next = bio->bi_next;
> @@ -802,6 +813,7 @@ static int make_request(struct request_queue *q, struct bio * bio)
>        struct bio_list bl;
>        unsigned long flags;
>        mdk_rdev_t *blocked_rdev;
> +       int cnt = 0;
>
>        if (unlikely(bio_rw_flagged(bio, BIO_RW_BARRIER))) {
>                bio_endio(bio, -EOPNOTSUPP);
> @@ -894,6 +906,11 @@ static int make_request(struct request_queue *q, struct bio * bio)
>        /*
>         * WRITE:
>         */
> +       if (conf->pending_count >= max_queued_requests) {
> +               md_wakeup_thread(mddev->thread);
> +               wait_event(conf->wait_barrier,
> +                          conf->pending_count < max_queued_requests);
> +       }
>        /* first select target devices under rcu_lock and
>         * inc refcount on their rdev.  Record them by setting
>         * bios[x] to bio
> @@ -957,6 +974,7 @@ static int make_request(struct request_queue *q, struct bio * bio)
>
>                atomic_inc(&r10_bio->remaining);
>                bio_list_add(&bl, mbio);
> +               cnt++
>        }
>
>        if (unlikely(!atomic_read(&r10_bio->remaining))) {
> @@ -970,6 +988,7 @@ static int make_request(struct request_queue *q, struct bio * bio)
>        spin_lock_irqsave(&conf->device_lock, flags);
>        bio_list_merge(&conf->pending_bio_list, &bl);
>        blk_plug_device(mddev->queue);
> +       conf->pending_count += cnt;
>        spin_unlock_irqrestore(&conf->device_lock, flags);
>
>        /* In case raid10d snuck in to freeze_array */
> @@ -2318,3 +2337,5 @@ MODULE_LICENSE("GPL");
>  MODULE_ALIAS("md-personality-9"); /* RAID10 */
>  MODULE_ALIAS("md-raid10");
>  MODULE_ALIAS("md-level-10");
> +
> +module_param(max_queued_requests, int, S_IRUGO|S_IWUSR);
> diff --git a/drivers/md/raid10.h b/drivers/md/raid10.h
> index 59cd1ef..e6e1613 100644
> --- a/drivers/md/raid10.h
> +++ b/drivers/md/raid10.h
> @@ -39,7 +39,7 @@ struct r10_private_data_s {
>        struct list_head        retry_list;
>        /* queue pending writes and submit them on unplug */
>        struct bio_list         pending_bio_list;
> -
> +       int                     pending_count;
>
>        spinlock_t              resync_lock;
>        int nr_pending;
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2011-12-08  6:31 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-12-06 21:29 Raid10 and page cache Yucong Sun (叶雨飞)
2011-12-06 22:01 ` Yucong Sun (叶雨飞)
2011-12-06 22:26   ` NeilBrown
2011-12-06 23:13     ` Yucong Sun (叶雨飞)
2011-12-06 23:22       ` Marcus Sorensen
2011-12-07  1:01       ` NeilBrown
2011-12-07  4:04         ` Yucong Sun (叶雨飞)
2011-12-07  4:28           ` NeilBrown
2011-12-07  4:50             ` Yucong Sun (叶雨飞)
2011-12-07  5:10               ` NeilBrown
2011-12-07  6:14                 ` Yucong Sun (叶雨飞)
2011-12-07  9:21                   ` Yucong Sun (叶雨飞)
2011-12-07 23:37                     ` Yucong Sun (叶雨飞)
2011-12-08  0:10                       ` NeilBrown
2011-12-08  6:31                         ` Yucong Sun (叶雨飞)
     [not found] ` <CAJygYd16PWfKe8fK-b150N46CEwzBUqJn1N6dfsGR4yyTgGbTQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2011-12-06 22:01   ` Yucong Sun (叶雨飞)

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.