* Raid10 and page cache
@ 2011-12-06 21:29 Yucong Sun (叶雨飞)
2011-12-06 22:01 ` Yucong Sun (叶雨飞)
[not found] ` <CAJygYd16PWfKe8fK-b150N46CEwzBUqJn1N6dfsGR4yyTgGbTQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
0 siblings, 2 replies; 16+ messages in thread
From: Yucong Sun (叶雨飞) @ 2011-12-06 21:29 UTC (permalink / raw)
To: linux-rdma-u79uwXL29TY76Z2rM5mHXA
Hi,
I recently setup raid10 on 4 physical disk and have a iscsi serve it
as a block device, and have been trying to tweak for performance.
First thing I notice that MD seems to rely on page cache to flush
changes to disk, is there any way to turn that off so changes are
flushed to the disk? like O_FSYNC|O_DIRECT does? The reason I want to
turn it off is to understand the performance difference, I want to be
sure that page cache is truly acting as a write-back cache, I know one
can tune the dirty_* to control the cache flush, but I want to make
sure that it is actually doing what I think it does.
Then I notice in output of free, the number in Cache column is very
low, however the Buffer is very high, my question is does Buffer here
serves as a read cache? I couldn't find the answer anywhere else.
My last question is that since MD seems already doing the cache, what
effect would it have if I want to setup a LO device in front of MD
device, Is there going to be more caching, how is different than just
plain MD device?
Thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 16+ messages in thread* Raid10 and page cache 2011-12-06 21:29 Raid10 and page cache Yucong Sun (叶雨飞) @ 2011-12-06 22:01 ` Yucong Sun (叶雨飞) 2011-12-06 22:26 ` NeilBrown [not found] ` <CAJygYd16PWfKe8fK-b150N46CEwzBUqJn1N6dfsGR4yyTgGbTQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 1 sibling, 1 reply; 16+ messages in thread From: Yucong Sun (叶雨飞) @ 2011-12-06 22:01 UTC (permalink / raw) To: linux-raid Hi, I recently setup raid10 on 4 physical disk and have a iscsi serve it as a block device, and have been trying to tweak for performance. First thing I notice that MD seems to rely on page cache to flush changes to disk, is there any way to turn that off so changes are flushed to the disk? like O_FSYNC|O_DIRECT does? The reason I want to turn it off is to understand the performance difference, I want to be sure that page cache is truly acting as a write-back cache, I know one can tune the dirty_* to control the cache flush, but I want to make sure that it is actually doing what I think it does. Then I notice in output of free, the number in Cache column is very low, however the Buffer is very high, my question is does Buffer here serves as a read cache? I couldn't find the answer anywhere else. My last question is that since MD seems already doing the cache, what effect would it have if I want to setup a LO device in front of MD device, Is there going to be more caching, how is different than just plain MD device? Thanks. -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Raid10 and page cache 2011-12-06 22:01 ` Yucong Sun (叶雨飞) @ 2011-12-06 22:26 ` NeilBrown 2011-12-06 23:13 ` Yucong Sun (叶雨飞) 0 siblings, 1 reply; 16+ messages in thread From: NeilBrown @ 2011-12-06 22:26 UTC (permalink / raw) To: Yucong Sun (叶雨飞); +Cc: linux-raid [-- Attachment #1: Type: text/plain, Size: 2190 bytes --] On Tue, 6 Dec 2011 14:01:14 -0800 Yucong Sun (叶雨飞) <sunyucong@gmail.com> wrote: > Hi, > > I recently setup raid10 on 4 physical disk and have a iscsi serve it > as a block device, and have been trying to tweak for performance. > > First thing I notice that MD seems to rely on page cache to flush > changes to disk, is there any way to turn that off so changes are > flushed to the disk? like O_FSYNC|O_DIRECT does? The reason I want to > turn it off is to understand the performance difference, I want to be > sure that page cache is truly acting as a write-back cache, I know one > can tune the dirty_* to control the cache flush, but I want to make > sure that it is actually doing what I think it does. Why do you think this? md/raid10 sends all request straight through to the relevant underlying device(s). reads are just passed straight down. Writes are duplicated (the request structure, not the data) and queued to a separate thread which does the actual write, but it is fairly direct. > > Then I notice in output of free, the number in Cache column is very > low, however the Buffer is very high, my question is does Buffer here > serves as a read cache? I couldn't find the answer anywhere else. The best place to find the answer is in the source code. Every page in the page cache is associated with some file. If that file is a block device (e.g. /dev/sdX) then it is reported as 'Buffer' otherwise it is reported as 'Cache'. Some filesystems like ext3 uses 'Buffer' memory for metadata but call use 'Cache' memory for files and directories. > > My last question is that since MD seems already doing the cache, what > effect would it have if I want to setup a LO device in front of MD > device, Is there going to be more caching, how is different than just > plain MD device? MD/raid10 does no caching. A loop-back over the md device would not add extra caching. NeilBrown > > Thanks. > -- > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 828 bytes --] ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Raid10 and page cache 2011-12-06 22:26 ` NeilBrown @ 2011-12-06 23:13 ` Yucong Sun (叶雨飞) 2011-12-06 23:22 ` Marcus Sorensen 2011-12-07 1:01 ` NeilBrown 0 siblings, 2 replies; 16+ messages in thread From: Yucong Sun (叶雨飞) @ 2011-12-06 23:13 UTC (permalink / raw) To: NeilBrown; +Cc: linux-raid On Tue, Dec 6, 2011 at 2:26 PM, NeilBrown <neilb@suse.de> wrote: > On Tue, 6 Dec 2011 14:01:14 -0800 Yucong Sun (叶雨飞) <sunyucong@gmail.com> > wrote: > >> Hi, >> >> I recently setup raid10 on 4 physical disk and have a iscsi serve it >> as a block device, and have been trying to tweak for performance. >> >> First thing I notice that MD seems to rely on page cache to flush >> changes to disk, is there any way to turn that off so changes are >> flushed to the disk? like O_FSYNC|O_DIRECT does? The reason I want to >> turn it off is to understand the performance difference, I want to be >> sure that page cache is truly acting as a write-back cache, I know one >> can tune the dirty_* to control the cache flush, but I want to make >> sure that it is actually doing what I think it does. > > Why do you think this? > > md/raid10 sends all request straight through to the relevant underlying > device(s). > reads are just passed straight down. > Writes are duplicated (the request structure, not the data) and queued to a > separate thread which does the actual write, but it is fairly direct. So I know there's page caching /flush involved because I watch /proc/meminfo and see Dirty value growing up and After reach the threshold, Write-back kicks in and wrote data. So if as you said md does no page flushing, then it must because of the iscsi software opens the device without O_DIRECT, so it uses page cache which in turn flush data to MD, now it makes more sense. But for the md write, it's not SYNC write? meaning that after write call with O_DIRECT to the md device returns, the data is still possibility on the fly to the disk? how does having a bitmap plays in between? does it work like ext3 jounal? after a power-loss, can we expect a crash consistent data on the disk? Another thing to note is I found IO size on MD device is always 4K, which is the page size, is that normal? just want to making sure this isn't a bad behavior result from the iscsi software. > >> >> Then I notice in output of free, the number in Cache column is very >> low, however the Buffer is very high, my question is does Buffer here >> serves as a read cache? I couldn't find the answer anywhere else. > > The best place to find the answer is in the source code. > > Every page in the page cache is associated with some file. > If that file is a block device (e.g. /dev/sdX) then it is reported as > 'Buffer' otherwise it is reported as 'Cache'. > > Some filesystems like ext3 uses 'Buffer' memory for metadata but call use > 'Cache' memory for files and directories. > Thanks, it is being used as read cache then, too bad there's no easy way to measure/see the hit rate. >> >> My last question is that since MD seems already doing the cache, what >> effect would it have if I want to setup a LO device in front of MD >> device, Is there going to be more caching, how is different than just >> plain MD device? > > MD/raid10 does no caching. > A loop-back over the md device would not add extra caching. > > NeilBrown > > >> >> Thanks. >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-raid" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html > -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Raid10 and page cache 2011-12-06 23:13 ` Yucong Sun (叶雨飞) @ 2011-12-06 23:22 ` Marcus Sorensen 2011-12-07 1:01 ` NeilBrown 1 sibling, 0 replies; 16+ messages in thread From: Marcus Sorensen @ 2011-12-06 23:22 UTC (permalink / raw) To: Yucong Sun (叶雨飞); +Cc: NeilBrown, linux-raid When you write a file, it is not MD doing caching that you see. The OS caches via dirty memory before flushing to MD. If you want to write sync or O_DIRECT, do so by adding the flag to the open() call when you write a file. On Tue, Dec 6, 2011 at 4:13 PM, Yucong Sun (叶雨飞) <sunyucong@gmail.com> wrote: > On Tue, Dec 6, 2011 at 2:26 PM, NeilBrown <neilb@suse.de> wrote: >> On Tue, 6 Dec 2011 14:01:14 -0800 Yucong Sun (叶雨飞) <sunyucong@gmail.com> >> wrote: >> >>> Hi, >>> >>> I recently setup raid10 on 4 physical disk and have a iscsi serve it >>> as a block device, and have been trying to tweak for performance. >>> >>> First thing I notice that MD seems to rely on page cache to flush >>> changes to disk, is there any way to turn that off so changes are >>> flushed to the disk? like O_FSYNC|O_DIRECT does? The reason I want to >>> turn it off is to understand the performance difference, I want to be >>> sure that page cache is truly acting as a write-back cache, I know one >>> can tune the dirty_* to control the cache flush, but I want to make >>> sure that it is actually doing what I think it does. >> >> Why do you think this? >> >> md/raid10 sends all request straight through to the relevant underlying >> device(s). >> reads are just passed straight down. >> Writes are duplicated (the request structure, not the data) and queued to a >> separate thread which does the actual write, but it is fairly direct. > > So I know there's page caching /flush involved because I watch > /proc/meminfo and see Dirty value growing up and After reach the > threshold, Write-back kicks in and wrote data. > So if as you said md does no page flushing, then it must because of > the iscsi software opens the device without O_DIRECT, so it uses page > cache which in turn flush data to MD, now it makes more sense. > > But for the md write, it's not SYNC write? meaning that after write > call with O_DIRECT to the md device returns, the data is still > possibility on the fly to the disk? how does having a bitmap plays in > between? does it work like ext3 jounal? after a power-loss, can we > expect a crash consistent data on the disk? > > Another thing to note is I found IO size on MD device is always 4K, > which is the page size, is that normal? just want to making sure this > isn't a bad behavior result from the iscsi software. >> >>> >>> Then I notice in output of free, the number in Cache column is very >>> low, however the Buffer is very high, my question is does Buffer here >>> serves as a read cache? I couldn't find the answer anywhere else. >> >> The best place to find the answer is in the source code. >> >> Every page in the page cache is associated with some file. >> If that file is a block device (e.g. /dev/sdX) then it is reported as >> 'Buffer' otherwise it is reported as 'Cache'. >> >> Some filesystems like ext3 uses 'Buffer' memory for metadata but call use >> 'Cache' memory for files and directories. >> > > Thanks, it is being used as read cache then, too bad there's no easy > way to measure/see the hit rate. > >>> >>> My last question is that since MD seems already doing the cache, what >>> effect would it have if I want to setup a LO device in front of MD >>> device, Is there going to be more caching, how is different than just >>> plain MD device? >> >> MD/raid10 does no caching. >> A loop-back over the md device would not add extra caching. >> >> NeilBrown >> >> >>> >>> Thanks. >>> -- >>> To unsubscribe from this list: send the line "unsubscribe linux-raid" in >>> the body of a message to majordomo@vger.kernel.org >>> More majordomo info at http://vger.kernel.org/majordomo-info.html >> > -- > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Raid10 and page cache 2011-12-06 23:13 ` Yucong Sun (叶雨飞) 2011-12-06 23:22 ` Marcus Sorensen @ 2011-12-07 1:01 ` NeilBrown 2011-12-07 4:04 ` Yucong Sun (叶雨飞) 1 sibling, 1 reply; 16+ messages in thread From: NeilBrown @ 2011-12-07 1:01 UTC (permalink / raw) To: Yucong Sun (叶雨飞); +Cc: linux-raid [-- Attachment #1: Type: text/plain, Size: 2511 bytes --] On Tue, 6 Dec 2011 15:13:34 -0800 Yucong Sun (叶雨飞) <sunyucong@gmail.com> wrote: > On Tue, Dec 6, 2011 at 2:26 PM, NeilBrown <neilb@suse.de> wrote: > > On Tue, 6 Dec 2011 14:01:14 -0800 Yucong Sun (叶雨飞) <sunyucong@gmail.com> > > wrote: > > > >> Hi, > >> > >> I recently setup raid10 on 4 physical disk and have a iscsi serve it > >> as a block device, and have been trying to tweak for performance. > >> > >> First thing I notice that MD seems to rely on page cache to flush > >> changes to disk, is there any way to turn that off so changes are > >> flushed to the disk? like O_FSYNC|O_DIRECT does? The reason I want to > >> turn it off is to understand the performance difference, I want to be > >> sure that page cache is truly acting as a write-back cache, I know one > >> can tune the dirty_* to control the cache flush, but I want to make > >> sure that it is actually doing what I think it does. > > > > Why do you think this? > > > > md/raid10 sends all request straight through to the relevant underlying > > device(s). > > reads are just passed straight down. > > Writes are duplicated (the request structure, not the data) and queued to a > > separate thread which does the actual write, but it is fairly direct. > > So I know there's page caching /flush involved because I watch > /proc/meminfo and see Dirty value growing up and After reach the > threshold, Write-back kicks in and wrote data. > So if as you said md does no page flushing, then it must because of > the iscsi software opens the device without O_DIRECT, so it uses page > cache which in turn flush data to MD, now it makes more sense. > > But for the md write, it's not SYNC write? meaning that after write > call with O_DIRECT to the md device returns, the data is still > possibility on the fly to the disk? how does having a bitmap plays in > between? does it work like ext3 jounal? after a power-loss, can we > expect a crash consistent data on the disk? When you want sync writes, you need to use fsync. When md writes the superblock or a bitmap page it uses SYNC and FLUSH writes to ensure they get to the media before the subsequent data write. > > Another thing to note is I found IO size on MD device is always 4K, > which is the page size, is that normal? just want to making sure this > isn't a bad behavior result from the iscsi software. It is normal in some cases. It depends a bit on the details of the underlying device. NeilBrown [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 828 bytes --] ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Raid10 and page cache 2011-12-07 1:01 ` NeilBrown @ 2011-12-07 4:04 ` Yucong Sun (叶雨飞) 2011-12-07 4:28 ` NeilBrown 0 siblings, 1 reply; 16+ messages in thread From: Yucong Sun (叶雨飞) @ 2011-12-07 4:04 UTC (permalink / raw) To: NeilBrown; +Cc: linux-raid The problem with using page-flush as a write cache here is that write to MD don't go through IO scheduler, which is a very big problem, because when flush thread decide to write to MD, it's impossible to control the write speed, or prioritize them with read, every requests basically is a fifo, and when flush size is big, no read can be served. On Tue, Dec 6, 2011 at 5:01 PM, NeilBrown <neilb@suse.de> wrote: > On Tue, 6 Dec 2011 15:13:34 -0800 Yucong Sun (叶雨飞) <sunyucong@gmail.com> > wrote: > >> On Tue, Dec 6, 2011 at 2:26 PM, NeilBrown <neilb@suse.de> wrote: >> > On Tue, 6 Dec 2011 14:01:14 -0800 Yucong Sun (叶雨飞) <sunyucong@gmail.com> >> > wrote: >> > >> >> Hi, >> >> >> >> I recently setup raid10 on 4 physical disk and have a iscsi serve it >> >> as a block device, and have been trying to tweak for performance. >> >> >> >> First thing I notice that MD seems to rely on page cache to flush >> >> changes to disk, is there any way to turn that off so changes are >> >> flushed to the disk? like O_FSYNC|O_DIRECT does? The reason I want to >> >> turn it off is to understand the performance difference, I want to be >> >> sure that page cache is truly acting as a write-back cache, I know one >> >> can tune the dirty_* to control the cache flush, but I want to make >> >> sure that it is actually doing what I think it does. >> > >> > Why do you think this? >> > >> > md/raid10 sends all request straight through to the relevant underlying >> > device(s). >> > reads are just passed straight down. >> > Writes are duplicated (the request structure, not the data) and queued to a >> > separate thread which does the actual write, but it is fairly direct. >> >> So I know there's page caching /flush involved because I watch >> /proc/meminfo and see Dirty value growing up and After reach the >> threshold, Write-back kicks in and wrote data. >> So if as you said md does no page flushing, then it must because of >> the iscsi software opens the device without O_DIRECT, so it uses page >> cache which in turn flush data to MD, now it makes more sense. >> >> But for the md write, it's not SYNC write? meaning that after write >> call with O_DIRECT to the md device returns, the data is still >> possibility on the fly to the disk? how does having a bitmap plays in >> between? does it work like ext3 jounal? after a power-loss, can we >> expect a crash consistent data on the disk? > > When you want sync writes, you need to use fsync. > > When md writes the superblock or a bitmap page it uses SYNC and FLUSH writes > to ensure they get to the media before the subsequent data write. > > >> >> Another thing to note is I found IO size on MD device is always 4K, >> which is the page size, is that normal? just want to making sure this >> isn't a bad behavior result from the iscsi software. > > It is normal in some cases. It depends a bit on the details of the > underlying device. > > > NeilBrown -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Raid10 and page cache 2011-12-07 4:04 ` Yucong Sun (叶雨飞) @ 2011-12-07 4:28 ` NeilBrown 2011-12-07 4:50 ` Yucong Sun (叶雨飞) 0 siblings, 1 reply; 16+ messages in thread From: NeilBrown @ 2011-12-07 4:28 UTC (permalink / raw) To: Yucong Sun (叶雨飞); +Cc: linux-raid [-- Attachment #1: Type: text/plain, Size: 881 bytes --] On Tue, 6 Dec 2011 20:04:33 -0800 Yucong Sun (叶雨飞) <sunyucong@gmail.com> wrote: > The problem with using page-flush as a write cache here is that write > to MD don't go through IO scheduler, which is a very big problem, > because when flush thread decide to write to MD, it's impossible to > control the write speed, or prioritize them with read, every requests > basically is a fifo, and when flush size is big, no read can be > served. > I'm not sure I understand.... Requests don't go through an IO scheduler before they hit md, but they do after md sends them on down, so they can be re-ordered there. There was a bug where raid10 would allow an arbitrary number of writes to queue up so that flushing code didn't know when to stop. This was fixed by commit 34db0cd60f8a1f4ab73d118a8be3797c20388223 nearly 2 months ago :-) NeilBrown [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 828 bytes --] ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Raid10 and page cache 2011-12-07 4:28 ` NeilBrown @ 2011-12-07 4:50 ` Yucong Sun (叶雨飞) 2011-12-07 5:10 ` NeilBrown 0 siblings, 1 reply; 16+ messages in thread From: Yucong Sun (叶雨飞) @ 2011-12-07 4:50 UTC (permalink / raw) To: NeilBrown; +Cc: linux-raid I'm not sure whether it is what I mean, to illustrate my problem let me put iostat -x -d 1 output as below Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sdb 0.00 0.00 163.00 1.00 1304.00 8.00 8.00 0.26 1.59 1.59 26.00 sdc 0.00 0.00 93.00 1.00 744.00 8.00 8.00 0.24 2.55 2.45 23.00 sde 0.00 0.00 56.00 1.00 448.00 8.00 8.00 0.22 3.86 3.86 22.00 sdd 0.00 0.00 88.00 1.00 704.00 8.00 8.00 0.18 2.02 2.02 18.00 md_d0 0.00 0.00 401.00 0.00 3208.00 0.00 8.00 0.00 0.00 0.00 0.00 ==> this is normal operation, because of page cache, there's only read being submitted to the MD device. Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sdb 0.00 1714.00 4.00 277.00 32.00 14810.00 52.82 34.04 105.05 2.92 82.00 sdc 0.00 1685.00 12.00 270.00 96.00 14122.00 50.42 42.56 131.03 3.09 87.00 sde 0.00 1385.00 8.00 261.00 64.00 12426.00 46.43 29.76 99.44 3.35 90.00 sdd 0.00 1350.00 8.00 228.00 64.00 10682.00 45.53 40.93 133.56 3.69 87.00 md_d0 0.00 0.00 32.00 16446.00 256.00 131568.00 8.00 0.00 0.00 0.00 0.00 ==> Huge page flush kick in, note the read requests is saturated on MD device. Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sdb 0.00 1542.00 4.00 264.00 32.00 11760.00 44.00 66.58 230.22 3.73 100.00 sdc 0.00 1185.00 0.00 272.00 0.00 9672.00 35.56 63.40 215.88 3.68 100.00 sde 0.00 1352.00 0.00 298.00 0.00 12488.00 41.91 35.56 126.34 3.36 100.00 sdd 0.00 996.00 0.00 294.00 0.00 10120.00 34.42 76.79 270.37 3.40 100.00 md_d0 0.00 0.00 4.00 0.00 32.00 0.00 8.00 0.00 0.00 0.00 0.00 ==> Huge page flush still working, no read is being done. This is the problem , when page flush kick in, MD appears to refuse incoming read, all under laying device is tuned to deadline scheduler and tuned to favor read, still, it don't work since MD simply don't submit new read to the underlying device. 2011/12/6 NeilBrown <neilb@suse.de>: > On Tue, 6 Dec 2011 20:04:33 -0800 Yucong Sun (叶雨飞) <sunyucong@gmail.com> > wrote: > >> The problem with using page-flush as a write cache here is that write >> to MD don't go through IO scheduler, which is a very big problem, >> because when flush thread decide to write to MD, it's impossible to >> control the write speed, or prioritize them with read, every requests >> basically is a fifo, and when flush size is big, no read can be >> served. >> > > I'm not sure I understand.... > > Requests don't go through an IO scheduler before they hit md, but they do > after md sends them on down, so they can be re-ordered there. > > There was a bug where raid10 would allow an arbitrary number of writes to > queue up so that flushing code didn't know when to stop. > > This was fixed by > commit 34db0cd60f8a1f4ab73d118a8be3797c20388223 > > nearly 2 months ago :-) > > NeilBrown > -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Raid10 and page cache 2011-12-07 4:50 ` Yucong Sun (叶雨飞) @ 2011-12-07 5:10 ` NeilBrown 2011-12-07 6:14 ` Yucong Sun (叶雨飞) 0 siblings, 1 reply; 16+ messages in thread From: NeilBrown @ 2011-12-07 5:10 UTC (permalink / raw) To: Yucong Sun (叶雨飞); +Cc: linux-raid [-- Attachment #1: Type: text/plain, Size: 4282 bytes --] On Tue, 6 Dec 2011 20:50:48 -0800 Yucong Sun (叶雨飞) <sunyucong@gmail.com> wrote: > I'm not sure whether it is what I mean, to illustrate my problem let > me put iostat -x -d 1 output as below > > Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s > avgrq-sz avgqu-sz await svctm %util > sdb 0.00 0.00 163.00 1.00 1304.00 8.00 > 8.00 0.26 1.59 1.59 26.00 > sdc 0.00 0.00 93.00 1.00 744.00 8.00 > 8.00 0.24 2.55 2.45 23.00 > sde 0.00 0.00 56.00 1.00 448.00 8.00 > 8.00 0.22 3.86 3.86 22.00 > sdd 0.00 0.00 88.00 1.00 704.00 8.00 > 8.00 0.18 2.02 2.02 18.00 > md_d0 0.00 0.00 401.00 0.00 3208.00 0.00 > 8.00 0.00 0.00 0.00 0.00 > > ==> this is normal operation, because of page cache, there's only read > being submitted to the MD device. > > Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s > avgrq-sz avgqu-sz await svctm %util > sda 0.00 0.00 0.00 0.00 0.00 0.00 > 0.00 0.00 0.00 0.00 0.00 > sdb 0.00 1714.00 4.00 277.00 32.00 14810.00 > 52.82 34.04 105.05 2.92 82.00 > sdc 0.00 1685.00 12.00 270.00 96.00 14122.00 > 50.42 42.56 131.03 3.09 87.00 > sde 0.00 1385.00 8.00 261.00 64.00 12426.00 > 46.43 29.76 99.44 3.35 90.00 > sdd 0.00 1350.00 8.00 228.00 64.00 10682.00 > 45.53 40.93 133.56 3.69 87.00 > md_d0 0.00 0.00 32.00 16446.00 256.00 131568.00 > 8.00 0.00 0.00 0.00 0.00 > > ==> Huge page flush kick in, note the read requests is saturated on MD device. > > Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s > avgrq-sz avgqu-sz await svctm %util > sda 0.00 0.00 0.00 0.00 0.00 0.00 > 0.00 0.00 0.00 0.00 0.00 > sdb 0.00 1542.00 4.00 264.00 32.00 11760.00 > 44.00 66.58 230.22 3.73 100.00 > sdc 0.00 1185.00 0.00 272.00 0.00 9672.00 > 35.56 63.40 215.88 3.68 100.00 > sde 0.00 1352.00 0.00 298.00 0.00 12488.00 > 41.91 35.56 126.34 3.36 100.00 > sdd 0.00 996.00 0.00 294.00 0.00 10120.00 > 34.42 76.79 270.37 3.40 100.00 > md_d0 0.00 0.00 4.00 0.00 32.00 0.00 > 8.00 0.00 0.00 0.00 0.00 > > ==> Huge page flush still working, no read is being done. > > This is the problem , when page flush kick in, MD appears to refuse > incoming read, all under laying device is tuned to deadline scheduler > and tuned to favor read, still, it don't work since MD simply don't > submit new read to the underlying device. The counters are update when a request completes, not when it is submitted, so you cannot tell from this data if md is submitting the read requests or not. What kernel are you working with? If it doesn't contain the commit identified below can you try with that and see if it makes a difference? Thanks, NeilBrown > > 2011/12/6 NeilBrown <neilb@suse.de>: > > On Tue, 6 Dec 2011 20:04:33 -0800 Yucong Sun (叶雨飞) <sunyucong@gmail.com> > > wrote: > > > >> The problem with using page-flush as a write cache here is that write > >> to MD don't go through IO scheduler, which is a very big problem, > >> because when flush thread decide to write to MD, it's impossible to > >> control the write speed, or prioritize them with read, every requests > >> basically is a fifo, and when flush size is big, no read can be > >> served. > >> > > > > I'm not sure I understand.... > > > > Requests don't go through an IO scheduler before they hit md, but they do > > after md sends them on down, so they can be re-ordered there. > > > > There was a bug where raid10 would allow an arbitrary number of writes to > > queue up so that flushing code didn't know when to stop. > > > > This was fixed by > > commit 34db0cd60f8a1f4ab73d118a8be3797c20388223 > > > > nearly 2 months ago :-) > > > > NeilBrown > > [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 828 bytes --] ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Raid10 and page cache 2011-12-07 5:10 ` NeilBrown @ 2011-12-07 6:14 ` Yucong Sun (叶雨飞) 2011-12-07 9:21 ` Yucong Sun (叶雨飞) 0 siblings, 1 reply; 16+ messages in thread From: Yucong Sun (叶雨飞) @ 2011-12-07 6:14 UTC (permalink / raw) To: NeilBrown; +Cc: linux-raid Ok, still, during that time, no read is being finished. I'm on Linux vstore-1 2.6.32-35-server #78-Ubuntu SMP Tue Oct 11 16:26:12 UTC 2011 x86_64 GNU/Linux do you know which kernel version has that commit ? 2.6.35 ? I think the root cause is that, whenever dirty_background_bytes is reached, kernel flush thread [flush:254:0] wakes up and cause md_raid10_d0 to go into state D, which cause everything to hang a while, I guess maybe the flush thread is calling fsync() after the write? That's hard to believe, but can actually explain the symptom. BTW I don't think limiting batch write to 1024 would solve the problem, I am actually doing it now because I have to set dirty_background_bytes to 4M which is exactly 1024 write every second or so. Cheers. On Tue, Dec 6, 2011 at 9:10 PM, NeilBrown <neilb@suse.de> wrote: > On Tue, 6 Dec 2011 20:50:48 -0800 Yucong Sun (叶雨飞) <sunyucong@gmail.com> > wrote: > >> I'm not sure whether it is what I mean, to illustrate my problem let >> me put iostat -x -d 1 output as below >> >> Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s >> avgrq-sz avgqu-sz await svctm %util >> sdb 0.00 0.00 163.00 1.00 1304.00 8.00 >> 8.00 0.26 1.59 1.59 26.00 >> sdc 0.00 0.00 93.00 1.00 744.00 8.00 >> 8.00 0.24 2.55 2.45 23.00 >> sde 0.00 0.00 56.00 1.00 448.00 8.00 >> 8.00 0.22 3.86 3.86 22.00 >> sdd 0.00 0.00 88.00 1.00 704.00 8.00 >> 8.00 0.18 2.02 2.02 18.00 >> md_d0 0.00 0.00 401.00 0.00 3208.00 0.00 >> 8.00 0.00 0.00 0.00 0.00 >> >> ==> this is normal operation, because of page cache, there's only read >> being submitted to the MD device. >> >> Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s >> avgrq-sz avgqu-sz await svctm %util >> sda 0.00 0.00 0.00 0.00 0.00 0.00 >> 0.00 0.00 0.00 0.00 0.00 >> sdb 0.00 1714.00 4.00 277.00 32.00 14810.00 >> 52.82 34.04 105.05 2.92 82.00 >> sdc 0.00 1685.00 12.00 270.00 96.00 14122.00 >> 50.42 42.56 131.03 3.09 87.00 >> sde 0.00 1385.00 8.00 261.00 64.00 12426.00 >> 46.43 29.76 99.44 3.35 90.00 >> sdd 0.00 1350.00 8.00 228.00 64.00 10682.00 >> 45.53 40.93 133.56 3.69 87.00 >> md_d0 0.00 0.00 32.00 16446.00 256.00 131568.00 >> 8.00 0.00 0.00 0.00 0.00 >> >> ==> Huge page flush kick in, note the read requests is saturated on MD device. >> >> Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s >> avgrq-sz avgqu-sz await svctm %util >> sda 0.00 0.00 0.00 0.00 0.00 0.00 >> 0.00 0.00 0.00 0.00 0.00 >> sdb 0.00 1542.00 4.00 264.00 32.00 11760.00 >> 44.00 66.58 230.22 3.73 100.00 >> sdc 0.00 1185.00 0.00 272.00 0.00 9672.00 >> 35.56 63.40 215.88 3.68 100.00 >> sde 0.00 1352.00 0.00 298.00 0.00 12488.00 >> 41.91 35.56 126.34 3.36 100.00 >> sdd 0.00 996.00 0.00 294.00 0.00 10120.00 >> 34.42 76.79 270.37 3.40 100.00 >> md_d0 0.00 0.00 4.00 0.00 32.00 0.00 >> 8.00 0.00 0.00 0.00 0.00 >> >> ==> Huge page flush still working, no read is being done. >> >> This is the problem , when page flush kick in, MD appears to refuse >> incoming read, all under laying device is tuned to deadline scheduler >> and tuned to favor read, still, it don't work since MD simply don't >> submit new read to the underlying device. > > The counters are update when a request completes, not when it is submitted, > so you cannot tell from this data if md is submitting the read requests or > not. > > What kernel are you working with? If it doesn't contain the commit > identified below can you try with that and see if it makes a difference? > > Thanks, > NeilBrown > > > >> >> 2011/12/6 NeilBrown <neilb@suse.de>: >> > On Tue, 6 Dec 2011 20:04:33 -0800 Yucong Sun (叶雨飞) <sunyucong@gmail.com> >> > wrote: >> > >> >> The problem with using page-flush as a write cache here is that write >> >> to MD don't go through IO scheduler, which is a very big problem, >> >> because when flush thread decide to write to MD, it's impossible to >> >> control the write speed, or prioritize them with read, every requests >> >> basically is a fifo, and when flush size is big, no read can be >> >> served. >> >> >> > >> > I'm not sure I understand.... >> > >> > Requests don't go through an IO scheduler before they hit md, but they do >> > after md sends them on down, so they can be re-ordered there. >> > >> > There was a bug where raid10 would allow an arbitrary number of writes to >> > queue up so that flushing code didn't know when to stop. >> > >> > This was fixed by >> > commit 34db0cd60f8a1f4ab73d118a8be3797c20388223 >> > >> > nearly 2 months ago :-) >> > >> > NeilBrown >> > > -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Raid10 and page cache 2011-12-07 6:14 ` Yucong Sun (叶雨飞) @ 2011-12-07 9:21 ` Yucong Sun (叶雨飞) 2011-12-07 23:37 ` Yucong Sun (叶雨飞) 0 siblings, 1 reply; 16+ messages in thread From: Yucong Sun (叶雨飞) @ 2011-12-07 9:21 UTC (permalink / raw) To: NeilBrown; +Cc: linux-raid So, I re-read the kernel code again, it looks like backdev.cc is doing the correct thing by calling writeback with WB_NO_SYNC, it all looks good, but I don't understand why it would appear read saturated on my system. However I think your commit would definitely make things better, Ideally I think make write only use available bandwidth like sync does, and automatically adjusting. 2011/12/6 Yucong Sun (叶雨飞) <sunyucong@gmail.com>: > Ok, still, during that time, no read is being finished. > > I'm on Linux vstore-1 2.6.32-35-server #78-Ubuntu SMP Tue Oct 11 > 16:26:12 UTC 2011 x86_64 GNU/Linux > do you know which kernel version has that commit ? 2.6.35 ? > > I think the root cause is that, whenever dirty_background_bytes is > reached, kernel flush thread [flush:254:0] wakes up and cause > md_raid10_d0 to go into state D, which cause everything to hang a > while, I guess maybe the flush thread is calling fsync() after the > write? That's hard to believe, but can actually explain the symptom. > > BTW I don't think limiting batch write to 1024 would solve the > problem, I am actually doing it now because I have to set > dirty_background_bytes to 4M which is exactly 1024 write every second > or so. > > Cheers. > > On Tue, Dec 6, 2011 at 9:10 PM, NeilBrown <neilb@suse.de> wrote: >> On Tue, 6 Dec 2011 20:50:48 -0800 Yucong Sun (叶雨飞) <sunyucong@gmail.com> >> wrote: >> >>> I'm not sure whether it is what I mean, to illustrate my problem let >>> me put iostat -x -d 1 output as below >>> >>> Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s >>> avgrq-sz avgqu-sz await svctm %util >>> sdb 0.00 0.00 163.00 1.00 1304.00 8.00 >>> 8.00 0.26 1.59 1.59 26.00 >>> sdc 0.00 0.00 93.00 1.00 744.00 8.00 >>> 8.00 0.24 2.55 2.45 23.00 >>> sde 0.00 0.00 56.00 1.00 448.00 8.00 >>> 8.00 0.22 3.86 3.86 22.00 >>> sdd 0.00 0.00 88.00 1.00 704.00 8.00 >>> 8.00 0.18 2.02 2.02 18.00 >>> md_d0 0.00 0.00 401.00 0.00 3208.00 0.00 >>> 8.00 0.00 0.00 0.00 0.00 >>> >>> ==> this is normal operation, because of page cache, there's only read >>> being submitted to the MD device. >>> >>> Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s >>> avgrq-sz avgqu-sz await svctm %util >>> sda 0.00 0.00 0.00 0.00 0.00 0.00 >>> 0.00 0.00 0.00 0.00 0.00 >>> sdb 0.00 1714.00 4.00 277.00 32.00 14810.00 >>> 52.82 34.04 105.05 2.92 82.00 >>> sdc 0.00 1685.00 12.00 270.00 96.00 14122.00 >>> 50.42 42.56 131.03 3.09 87.00 >>> sde 0.00 1385.00 8.00 261.00 64.00 12426.00 >>> 46.43 29.76 99.44 3.35 90.00 >>> sdd 0.00 1350.00 8.00 228.00 64.00 10682.00 >>> 45.53 40.93 133.56 3.69 87.00 >>> md_d0 0.00 0.00 32.00 16446.00 256.00 131568.00 >>> 8.00 0.00 0.00 0.00 0.00 >>> >>> ==> Huge page flush kick in, note the read requests is saturated on MD device. >>> >>> Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s >>> avgrq-sz avgqu-sz await svctm %util >>> sda 0.00 0.00 0.00 0.00 0.00 0.00 >>> 0.00 0.00 0.00 0.00 0.00 >>> sdb 0.00 1542.00 4.00 264.00 32.00 11760.00 >>> 44.00 66.58 230.22 3.73 100.00 >>> sdc 0.00 1185.00 0.00 272.00 0.00 9672.00 >>> 35.56 63.40 215.88 3.68 100.00 >>> sde 0.00 1352.00 0.00 298.00 0.00 12488.00 >>> 41.91 35.56 126.34 3.36 100.00 >>> sdd 0.00 996.00 0.00 294.00 0.00 10120.00 >>> 34.42 76.79 270.37 3.40 100.00 >>> md_d0 0.00 0.00 4.00 0.00 32.00 0.00 >>> 8.00 0.00 0.00 0.00 0.00 >>> >>> ==> Huge page flush still working, no read is being done. >>> >>> This is the problem , when page flush kick in, MD appears to refuse >>> incoming read, all under laying device is tuned to deadline scheduler >>> and tuned to favor read, still, it don't work since MD simply don't >>> submit new read to the underlying device. >> >> The counters are update when a request completes, not when it is submitted, >> so you cannot tell from this data if md is submitting the read requests or >> not. >> >> What kernel are you working with? If it doesn't contain the commit >> identified below can you try with that and see if it makes a difference? >> >> Thanks, >> NeilBrown >> >> >> >>> >>> 2011/12/6 NeilBrown <neilb@suse.de>: >>> > On Tue, 6 Dec 2011 20:04:33 -0800 Yucong Sun (叶雨飞) <sunyucong@gmail.com> >>> > wrote: >>> > >>> >> The problem with using page-flush as a write cache here is that write >>> >> to MD don't go through IO scheduler, which is a very big problem, >>> >> because when flush thread decide to write to MD, it's impossible to >>> >> control the write speed, or prioritize them with read, every requests >>> >> basically is a fifo, and when flush size is big, no read can be >>> >> served. >>> >> >>> > >>> > I'm not sure I understand.... >>> > >>> > Requests don't go through an IO scheduler before they hit md, but they do >>> > after md sends them on down, so they can be re-ordered there. >>> > >>> > There was a bug where raid10 would allow an arbitrary number of writes to >>> > queue up so that flushing code didn't know when to stop. >>> > >>> > This was fixed by >>> > commit 34db0cd60f8a1f4ab73d118a8be3797c20388223 >>> > >>> > nearly 2 months ago :-) >>> > >>> > NeilBrown >>> > >> -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Raid10 and page cache 2011-12-07 9:21 ` Yucong Sun (叶雨飞) @ 2011-12-07 23:37 ` Yucong Sun (叶雨飞) 2011-12-08 0:10 ` NeilBrown 0 siblings, 1 reply; 16+ messages in thread From: Yucong Sun (叶雨飞) @ 2011-12-07 23:37 UTC (permalink / raw) To: NeilBrown; +Cc: linux-raid Neil, I can't compile latest MD against 2.6.32, and that commit can't be patched into 2.6.32 directly either, can you help me on this? Cheers. 2011/12/7 Yucong Sun (叶雨飞) <sunyucong@gmail.com>: > So, I re-read the kernel code again, it looks like backdev.cc is > doing the correct thing by calling writeback with WB_NO_SYNC, it all > looks good, but I don't understand why it would appear read saturated > on my system. > > However I think your commit would definitely make things better, > Ideally I think make write only use available bandwidth like sync > does, and automatically adjusting. > > 2011/12/6 Yucong Sun (叶雨飞) <sunyucong@gmail.com>: >> Ok, still, during that time, no read is being finished. >> >> I'm on Linux vstore-1 2.6.32-35-server #78-Ubuntu SMP Tue Oct 11 >> 16:26:12 UTC 2011 x86_64 GNU/Linux >> do you know which kernel version has that commit ? 2.6.35 ? >> >> I think the root cause is that, whenever dirty_background_bytes is >> reached, kernel flush thread [flush:254:0] wakes up and cause >> md_raid10_d0 to go into state D, which cause everything to hang a >> while, I guess maybe the flush thread is calling fsync() after the >> write? That's hard to believe, but can actually explain the symptom. >> >> BTW I don't think limiting batch write to 1024 would solve the >> problem, I am actually doing it now because I have to set >> dirty_background_bytes to 4M which is exactly 1024 write every second >> or so. >> >> Cheers. >> >> On Tue, Dec 6, 2011 at 9:10 PM, NeilBrown <neilb@suse.de> wrote: >>> On Tue, 6 Dec 2011 20:50:48 -0800 Yucong Sun (叶雨飞) <sunyucong@gmail.com> >>> wrote: >>> >>>> I'm not sure whether it is what I mean, to illustrate my problem let >>>> me put iostat -x -d 1 output as below >>>> >>>> Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s >>>> avgrq-sz avgqu-sz await svctm %util >>>> sdb 0.00 0.00 163.00 1.00 1304.00 8.00 >>>> 8.00 0.26 1.59 1.59 26.00 >>>> sdc 0.00 0.00 93.00 1.00 744.00 8.00 >>>> 8.00 0.24 2.55 2.45 23.00 >>>> sde 0.00 0.00 56.00 1.00 448.00 8.00 >>>> 8.00 0.22 3.86 3.86 22.00 >>>> sdd 0.00 0.00 88.00 1.00 704.00 8.00 >>>> 8.00 0.18 2.02 2.02 18.00 >>>> md_d0 0.00 0.00 401.00 0.00 3208.00 0.00 >>>> 8.00 0.00 0.00 0.00 0.00 >>>> >>>> ==> this is normal operation, because of page cache, there's only read >>>> being submitted to the MD device. >>>> >>>> Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s >>>> avgrq-sz avgqu-sz await svctm %util >>>> sda 0.00 0.00 0.00 0.00 0.00 0.00 >>>> 0.00 0.00 0.00 0.00 0.00 >>>> sdb 0.00 1714.00 4.00 277.00 32.00 14810.00 >>>> 52.82 34.04 105.05 2.92 82.00 >>>> sdc 0.00 1685.00 12.00 270.00 96.00 14122.00 >>>> 50.42 42.56 131.03 3.09 87.00 >>>> sde 0.00 1385.00 8.00 261.00 64.00 12426.00 >>>> 46.43 29.76 99.44 3.35 90.00 >>>> sdd 0.00 1350.00 8.00 228.00 64.00 10682.00 >>>> 45.53 40.93 133.56 3.69 87.00 >>>> md_d0 0.00 0.00 32.00 16446.00 256.00 131568.00 >>>> 8.00 0.00 0.00 0.00 0.00 >>>> >>>> ==> Huge page flush kick in, note the read requests is saturated on MD device. >>>> >>>> Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s >>>> avgrq-sz avgqu-sz await svctm %util >>>> sda 0.00 0.00 0.00 0.00 0.00 0.00 >>>> 0.00 0.00 0.00 0.00 0.00 >>>> sdb 0.00 1542.00 4.00 264.00 32.00 11760.00 >>>> 44.00 66.58 230.22 3.73 100.00 >>>> sdc 0.00 1185.00 0.00 272.00 0.00 9672.00 >>>> 35.56 63.40 215.88 3.68 100.00 >>>> sde 0.00 1352.00 0.00 298.00 0.00 12488.00 >>>> 41.91 35.56 126.34 3.36 100.00 >>>> sdd 0.00 996.00 0.00 294.00 0.00 10120.00 >>>> 34.42 76.79 270.37 3.40 100.00 >>>> md_d0 0.00 0.00 4.00 0.00 32.00 0.00 >>>> 8.00 0.00 0.00 0.00 0.00 >>>> >>>> ==> Huge page flush still working, no read is being done. >>>> >>>> This is the problem , when page flush kick in, MD appears to refuse >>>> incoming read, all under laying device is tuned to deadline scheduler >>>> and tuned to favor read, still, it don't work since MD simply don't >>>> submit new read to the underlying device. >>> >>> The counters are update when a request completes, not when it is submitted, >>> so you cannot tell from this data if md is submitting the read requests or >>> not. >>> >>> What kernel are you working with? If it doesn't contain the commit >>> identified below can you try with that and see if it makes a difference? >>> >>> Thanks, >>> NeilBrown >>> >>> >>> >>>> >>>> 2011/12/6 NeilBrown <neilb@suse.de>: >>>> > On Tue, 6 Dec 2011 20:04:33 -0800 Yucong Sun (叶雨飞) <sunyucong@gmail.com> >>>> > wrote: >>>> > >>>> >> The problem with using page-flush as a write cache here is that write >>>> >> to MD don't go through IO scheduler, which is a very big problem, >>>> >> because when flush thread decide to write to MD, it's impossible to >>>> >> control the write speed, or prioritize them with read, every requests >>>> >> basically is a fifo, and when flush size is big, no read can be >>>> >> served. >>>> >> >>>> > >>>> > I'm not sure I understand.... >>>> > >>>> > Requests don't go through an IO scheduler before they hit md, but they do >>>> > after md sends them on down, so they can be re-ordered there. >>>> > >>>> > There was a bug where raid10 would allow an arbitrary number of writes to >>>> > queue up so that flushing code didn't know when to stop. >>>> > >>>> > This was fixed by >>>> > commit 34db0cd60f8a1f4ab73d118a8be3797c20388223 >>>> > >>>> > nearly 2 months ago :-) >>>> > >>>> > NeilBrown >>>> > >>> -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Raid10 and page cache 2011-12-07 23:37 ` Yucong Sun (叶雨飞) @ 2011-12-08 0:10 ` NeilBrown 2011-12-08 6:31 ` Yucong Sun (叶雨飞) 0 siblings, 1 reply; 16+ messages in thread From: NeilBrown @ 2011-12-08 0:10 UTC (permalink / raw) To: Yucong Sun (叶雨飞); +Cc: linux-raid [-- Attachment #1: Type: text/plain, Size: 7881 bytes --] On Wed, 7 Dec 2011 15:37:30 -0800 Yucong Sun (叶雨飞) <sunyucong@gmail.com> wrote: > Neil, I can't compile latest MD against 2.6.32, and that commit can't > be patched into 2.6.32 directly either, can you help me on this? > This should do it. NeilBrown commit ef54b7cf955dc3b7d33248e8591b1a00b4fa998c Author: NeilBrown <neilb@suse.de> Date: Tue Oct 11 16:50:01 2011 +1100 md: add proper write-congestion reporting to RAID1 and RAID10. RAID1 and RAID10 handle write requests by queuing them for handling by a separate thread. This is because when a write-intent-bitmap is active we might need to update the bitmap first, so it is good to queue a lot of writes, then do one big bitmap update for them all. However writeback request devices to appear to be congested after a while so it can make some guesstimate of throughput. The infinite queue defeats that (note that RAID5 has already has a finite queue so it doesn't suffer from this problem). So impose a limit on the number of pending write requests. By default it is 1024 which seems to be generally suitable. Make it configurable via module option just in case someone finds a regression. Signed-off-by: NeilBrown <neilb@suse.de> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c index e07ce2e..fe7ae3c 100644 --- a/drivers/md/raid1.c +++ b/drivers/md/raid1.c @@ -50,6 +50,11 @@ */ #define NR_RAID1_BIOS 256 +/* When there are this many requests queue to be written by + * the raid1 thread, we become 'congested' to provide back-pressure + * for writeback. + */ +static int max_queued_requests = 1024; static void unplug_slaves(mddev_t *mddev); @@ -576,7 +581,8 @@ static int raid1_congested(void *data, int bits) conf_t *conf = mddev->private; int i, ret = 0; - if (mddev_congested(mddev, bits)) + if (mddev_congested(mddev, bits) && + conf->pending_count >= max_queued_requests) return 1; rcu_read_lock(); @@ -613,10 +619,12 @@ static int flush_pending_writes(conf_t *conf) struct bio *bio; bio = bio_list_get(&conf->pending_bio_list); blk_remove_plug(conf->mddev->queue); + conf->pending_count = 0; spin_unlock_irq(&conf->device_lock); /* flush any pending bitmap writes to * disk before proceeding w/ I/O */ bitmap_unplug(conf->mddev->bitmap); + wake_up(&conf->wait_barrier); while (bio) { /* submit pending writes */ struct bio *next = bio->bi_next; @@ -789,6 +797,7 @@ static int make_request(struct request_queue *q, struct bio * bio) int cpu; bool do_barriers; mdk_rdev_t *blocked_rdev; + int cnt = 0; /* * Register the new request and wait if the reconstruction @@ -864,6 +873,11 @@ static int make_request(struct request_queue *q, struct bio * bio) /* * WRITE: */ + if (conf->pending_count >= max_queued_requests) { + md_wakeup_thread(mddev->thread); + wait_event(conf->wait_barrier, + conf->pending_count < max_queued_requests); + } /* first select target devices under spinlock and * inc refcount on their rdev. Record them by setting * bios[x] to bio @@ -970,6 +984,7 @@ static int make_request(struct request_queue *q, struct bio * bio) atomic_inc(&r1_bio->remaining); bio_list_add(&bl, mbio); + cnt++; } kfree(behind_pages); /* the behind pages are attached to the bios now */ @@ -978,6 +993,7 @@ static int make_request(struct request_queue *q, struct bio * bio) spin_lock_irqsave(&conf->device_lock, flags); bio_list_merge(&conf->pending_bio_list, &bl); bio_list_init(&bl); + conf->pending_count += cnt; blk_plug_device(mddev->queue); spin_unlock_irqrestore(&conf->device_lock, flags); @@ -2021,7 +2037,7 @@ static int run(mddev_t *mddev) bio_list_init(&conf->pending_bio_list); bio_list_init(&conf->flushing_bio_list); - + conf->pending_count = 0; mddev->degraded = 0; for (i = 0; i < conf->raid_disks; i++) { @@ -2317,3 +2333,5 @@ MODULE_LICENSE("GPL"); MODULE_ALIAS("md-personality-3"); /* RAID1 */ MODULE_ALIAS("md-raid1"); MODULE_ALIAS("md-level-1"); + +module_param(max_queued_requests, int, S_IRUGO|S_IWUSR); diff --git a/drivers/md/raid1.h b/drivers/md/raid1.h index e87b84d..520288c 100644 --- a/drivers/md/raid1.h +++ b/drivers/md/raid1.h @@ -38,6 +38,7 @@ struct r1_private_data_s { /* queue of writes that have been unplugged */ struct bio_list flushing_bio_list; + int pending_count; /* for use when syncing mirrors: */ spinlock_t resync_lock; diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c index c2cb7b8..4c7d9b5 100644 --- a/drivers/md/raid10.c +++ b/drivers/md/raid10.c @@ -59,6 +59,11 @@ static void unplug_slaves(mddev_t *mddev); static void allow_barrier(conf_t *conf); static void lower_barrier(conf_t *conf); +/* When there are this many requests queue to be written by + * the raid10 thread, we become 'congested' to provide back-pressure + * for writeback. + */ +static int max_queued_requests = 1024; static void * r10bio_pool_alloc(gfp_t gfp_flags, void *data) { @@ -631,6 +636,10 @@ static int raid10_congested(void *data, int bits) conf_t *conf = mddev->private; int i, ret = 0; + if ((bits & (1 << BDI_async_congested)) && + conf->pending_count >= max_queued_requests) + return 1; + if (mddev_congested(mddev, bits)) return 1; rcu_read_lock(); @@ -660,10 +669,12 @@ static int flush_pending_writes(conf_t *conf) struct bio *bio; bio = bio_list_get(&conf->pending_bio_list); blk_remove_plug(conf->mddev->queue); + conf->pending_count = 0; spin_unlock_irq(&conf->device_lock); /* flush any pending bitmap writes to disk * before proceeding w/ I/O */ bitmap_unplug(conf->mddev->bitmap); + wake_up(&conf->wait_barrier); while (bio) { /* submit pending writes */ struct bio *next = bio->bi_next; @@ -802,6 +813,7 @@ static int make_request(struct request_queue *q, struct bio * bio) struct bio_list bl; unsigned long flags; mdk_rdev_t *blocked_rdev; + int cnt = 0; if (unlikely(bio_rw_flagged(bio, BIO_RW_BARRIER))) { bio_endio(bio, -EOPNOTSUPP); @@ -894,6 +906,11 @@ static int make_request(struct request_queue *q, struct bio * bio) /* * WRITE: */ + if (conf->pending_count >= max_queued_requests) { + md_wakeup_thread(mddev->thread); + wait_event(conf->wait_barrier, + conf->pending_count < max_queued_requests); + } /* first select target devices under rcu_lock and * inc refcount on their rdev. Record them by setting * bios[x] to bio @@ -957,6 +974,7 @@ static int make_request(struct request_queue *q, struct bio * bio) atomic_inc(&r10_bio->remaining); bio_list_add(&bl, mbio); + cnt++ } if (unlikely(!atomic_read(&r10_bio->remaining))) { @@ -970,6 +988,7 @@ static int make_request(struct request_queue *q, struct bio * bio) spin_lock_irqsave(&conf->device_lock, flags); bio_list_merge(&conf->pending_bio_list, &bl); blk_plug_device(mddev->queue); + conf->pending_count += cnt; spin_unlock_irqrestore(&conf->device_lock, flags); /* In case raid10d snuck in to freeze_array */ @@ -2318,3 +2337,5 @@ MODULE_LICENSE("GPL"); MODULE_ALIAS("md-personality-9"); /* RAID10 */ MODULE_ALIAS("md-raid10"); MODULE_ALIAS("md-level-10"); + +module_param(max_queued_requests, int, S_IRUGO|S_IWUSR); diff --git a/drivers/md/raid10.h b/drivers/md/raid10.h index 59cd1ef..e6e1613 100644 --- a/drivers/md/raid10.h +++ b/drivers/md/raid10.h @@ -39,7 +39,7 @@ struct r10_private_data_s { struct list_head retry_list; /* queue pending writes and submit them on unplug */ struct bio_list pending_bio_list; - + int pending_count; spinlock_t resync_lock; int nr_pending; [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 828 bytes --] ^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: Raid10 and page cache 2011-12-08 0:10 ` NeilBrown @ 2011-12-08 6:31 ` Yucong Sun (叶雨飞) 0 siblings, 0 replies; 16+ messages in thread From: Yucong Sun (叶雨飞) @ 2011-12-08 6:31 UTC (permalink / raw) To: NeilBrown; +Cc: linux-raid sadly the patch didn't help , sadly, the patch didn't help at all, see following Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sdb 0.00 2042.00 0.00 345.00 0.00 64112.00 185.83 93.13 93.36 2.12 73.00 sdd 0.00 1704.00 7.00 156.00 56.00 12496.00 77.01 95.71 146.20 3.62 59.00 sdc 0.00 1518.00 16.00 185.00 128.00 9936.00 50.07 98.20 157.41 3.13 63.00 sde 222.00 1997.00 194.00 189.00 51568.00 16488.00 177.69 81.54 99.09 2.25 86.00 md0 0.00 0.00 37.00 4096.00 296.00 32768.00 8.00 0.00 0.00 0.00 0.00 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sdb 0.00 150.00 0.00 194.00 0.00 33336.00 171.84 34.91 492.84 4.59 89.00 sdd 0.00 0.00 0.00 138.00 0.00 3488.00 25.28 32.68 757.75 4.06 56.00 sdc 0.00 0.00 3.00 127.00 24.00 4704.00 36.37 33.68 771.08 4.54 59.00 sde 222.00 0.00 90.00 84.00 39936.00 1672.00 239.13 23.73 386.90 4.08 71.00 md0 0.00 0.00 2.00 0.00 16.00 0.00 8.00 0.00 0.00 0.00 0.00 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sdb 0.00 235.00 0.00 188.00 0.00 54024.00 287.36 0.49 3.78 1.65 31.00 sdd 0.00 0.00 27.00 0.00 216.00 0.00 8.00 0.15 5.56 5.56 15.00 sdc 0.00 0.00 46.00 0.00 368.00 0.00 8.00 0.32 6.52 6.96 32.00 sde 165.00 0.00 200.00 0.00 43480.00 0.00 217.40 7.63 38.15 2.00 40.00 md0 0.00 0.00 101.00 0.00 808.00 0.00 8.00 0.00 0.00 0.00 0.00 I poked around and found this, when big flush comes in , Every 1.0s: cat /sys/block/sdb/stat /sys/block/sdc/stat /sys/block/sdd/stat /sys/block/sde/stat /sys/block/md0/stat Wed Dec 7 22:26:14 2011 32 10 336 270 2792623 5501730 783168880 254952160 284 4815060 255014270 2993481 2222268 499586400 94384090 493165 1842192 18671608 271311440 290 9942910 365758660 691727 19 5533896 1507300 501261 1838497 18706544 276987570 262 3254420 278552760 1458797 1404948 281875858 49664210 483386 1841832 18588928 256627020 259 4997270 306348180 2797538 0 22380058 0 4652939 0 37223512 0 0 0 0 Every downstream disk have a Huge in-flight IO jump, where it is usually just 0 or 1 the whole time. The kernel document says this is don't include queued IO, so I think the problem is because IO scheduler issued too many requests to the device , without throttling read/write, that basically saturated the disk, so no other read can be scheduled, do you knwo why this would happen to me? Here's my relevenat scheduler tweak: for disk in /sys/block/sd[bcde] do echo "changing $disk scheduler" echo "deadline" > $disk/queue/scheduler echo "changing $disk nr_reqests to 4096" echo 4096 > $disk/queue/nr_requests echo "setra to 0" echo 0 > $disk/queue/read_ahead_kb echo "tweaking deadline io" echo 32 > $disk/queue/iosched/fifo_batch echo 30 > $disk/queue/iosched/read_expire echo 20000 > $disk/queue/iosched/write_expire echo 256 > $disk/queue/iosched/writes_starved done echo 0 > /sys/block/md0/queue/read_ahead_kb My workload profile is 100% random 8K IO. Come to think of it, the problem is mostly IO scheduling issue, does nr_requests mean anything to MD? it's not possible to adjust it either, was that the reason that MD can't accept more reads? On Wed, Dec 7, 2011 at 4:10 PM, NeilBrown <neilb@suse.de> wrote: > > On Wed, 7 Dec 2011 15:37:30 -0800 Yucong Sun (叶雨飞) <sunyucong@gmail.com> > wrote: > > > Neil, I can't compile latest MD against 2.6.32, and that commit can't > > be patched into 2.6.32 directly either, can you help me on this? > > > > This should do it. > > NeilBrown > > commit ef54b7cf955dc3b7d33248e8591b1a00b4fa998c > Author: NeilBrown <neilb@suse.de> > Date: Tue Oct 11 16:50:01 2011 +1100 > > md: add proper write-congestion reporting to RAID1 and RAID10. > > RAID1 and RAID10 handle write requests by queuing them for handling by > a separate thread. This is because when a write-intent-bitmap is > active we might need to update the bitmap first, so it is good to > queue a lot of writes, then do one big bitmap update for them all. > > However writeback request devices to appear to be congested after a > while so it can make some guesstimate of throughput. The infinite > queue defeats that (note that RAID5 has already has a finite queue so > it doesn't suffer from this problem). > > So impose a limit on the number of pending write requests. By default > it is 1024 which seems to be generally suitable. Make it configurable > via module option just in case someone finds a regression. > > Signed-off-by: NeilBrown <neilb@suse.de> > > diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c > index e07ce2e..fe7ae3c 100644 > --- a/drivers/md/raid1.c > +++ b/drivers/md/raid1.c > @@ -50,6 +50,11 @@ > */ > #define NR_RAID1_BIOS 256 > > +/* When there are this many requests queue to be written by > + * the raid1 thread, we become 'congested' to provide back-pressure > + * for writeback. > + */ > +static int max_queued_requests = 1024; > > static void unplug_slaves(mddev_t *mddev); > > @@ -576,7 +581,8 @@ static int raid1_congested(void *data, int bits) > conf_t *conf = mddev->private; > int i, ret = 0; > > - if (mddev_congested(mddev, bits)) > + if (mddev_congested(mddev, bits) && > + conf->pending_count >= max_queued_requests) > return 1; > > rcu_read_lock(); > @@ -613,10 +619,12 @@ static int flush_pending_writes(conf_t *conf) > struct bio *bio; > bio = bio_list_get(&conf->pending_bio_list); > blk_remove_plug(conf->mddev->queue); > + conf->pending_count = 0; > spin_unlock_irq(&conf->device_lock); > /* flush any pending bitmap writes to > * disk before proceeding w/ I/O */ > bitmap_unplug(conf->mddev->bitmap); > + wake_up(&conf->wait_barrier); > > while (bio) { /* submit pending writes */ > struct bio *next = bio->bi_next; > @@ -789,6 +797,7 @@ static int make_request(struct request_queue *q, struct bio * bio) > int cpu; > bool do_barriers; > mdk_rdev_t *blocked_rdev; > + int cnt = 0; > > /* > * Register the new request and wait if the reconstruction > @@ -864,6 +873,11 @@ static int make_request(struct request_queue *q, struct bio * bio) > /* > * WRITE: > */ > + if (conf->pending_count >= max_queued_requests) { > + md_wakeup_thread(mddev->thread); > + wait_event(conf->wait_barrier, > + conf->pending_count < max_queued_requests); > + } > /* first select target devices under spinlock and > * inc refcount on their rdev. Record them by setting > * bios[x] to bio > @@ -970,6 +984,7 @@ static int make_request(struct request_queue *q, struct bio * bio) > atomic_inc(&r1_bio->remaining); > > bio_list_add(&bl, mbio); > + cnt++; > } > kfree(behind_pages); /* the behind pages are attached to the bios now */ > > @@ -978,6 +993,7 @@ static int make_request(struct request_queue *q, struct bio * bio) > spin_lock_irqsave(&conf->device_lock, flags); > bio_list_merge(&conf->pending_bio_list, &bl); > bio_list_init(&bl); > + conf->pending_count += cnt; > > blk_plug_device(mddev->queue); > spin_unlock_irqrestore(&conf->device_lock, flags); > @@ -2021,7 +2037,7 @@ static int run(mddev_t *mddev) > > bio_list_init(&conf->pending_bio_list); > bio_list_init(&conf->flushing_bio_list); > - > + conf->pending_count = 0; > > mddev->degraded = 0; > for (i = 0; i < conf->raid_disks; i++) { > @@ -2317,3 +2333,5 @@ MODULE_LICENSE("GPL"); > MODULE_ALIAS("md-personality-3"); /* RAID1 */ > MODULE_ALIAS("md-raid1"); > MODULE_ALIAS("md-level-1"); > + > +module_param(max_queued_requests, int, S_IRUGO|S_IWUSR); > diff --git a/drivers/md/raid1.h b/drivers/md/raid1.h > index e87b84d..520288c 100644 > --- a/drivers/md/raid1.h > +++ b/drivers/md/raid1.h > @@ -38,6 +38,7 @@ struct r1_private_data_s { > /* queue of writes that have been unplugged */ > struct bio_list flushing_bio_list; > > + int pending_count; > /* for use when syncing mirrors: */ > > spinlock_t resync_lock; > diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c > index c2cb7b8..4c7d9b5 100644 > --- a/drivers/md/raid10.c > +++ b/drivers/md/raid10.c > @@ -59,6 +59,11 @@ static void unplug_slaves(mddev_t *mddev); > > static void allow_barrier(conf_t *conf); > static void lower_barrier(conf_t *conf); > +/* When there are this many requests queue to be written by > + * the raid10 thread, we become 'congested' to provide back-pressure > + * for writeback. > + */ > +static int max_queued_requests = 1024; > > static void * r10bio_pool_alloc(gfp_t gfp_flags, void *data) > { > @@ -631,6 +636,10 @@ static int raid10_congested(void *data, int bits) > conf_t *conf = mddev->private; > int i, ret = 0; > > + if ((bits & (1 << BDI_async_congested)) && > + conf->pending_count >= max_queued_requests) > + return 1; > + > if (mddev_congested(mddev, bits)) > return 1; > rcu_read_lock(); > @@ -660,10 +669,12 @@ static int flush_pending_writes(conf_t *conf) > struct bio *bio; > bio = bio_list_get(&conf->pending_bio_list); > blk_remove_plug(conf->mddev->queue); > + conf->pending_count = 0; > spin_unlock_irq(&conf->device_lock); > /* flush any pending bitmap writes to disk > * before proceeding w/ I/O */ > bitmap_unplug(conf->mddev->bitmap); > + wake_up(&conf->wait_barrier); > > while (bio) { /* submit pending writes */ > struct bio *next = bio->bi_next; > @@ -802,6 +813,7 @@ static int make_request(struct request_queue *q, struct bio * bio) > struct bio_list bl; > unsigned long flags; > mdk_rdev_t *blocked_rdev; > + int cnt = 0; > > if (unlikely(bio_rw_flagged(bio, BIO_RW_BARRIER))) { > bio_endio(bio, -EOPNOTSUPP); > @@ -894,6 +906,11 @@ static int make_request(struct request_queue *q, struct bio * bio) > /* > * WRITE: > */ > + if (conf->pending_count >= max_queued_requests) { > + md_wakeup_thread(mddev->thread); > + wait_event(conf->wait_barrier, > + conf->pending_count < max_queued_requests); > + } > /* first select target devices under rcu_lock and > * inc refcount on their rdev. Record them by setting > * bios[x] to bio > @@ -957,6 +974,7 @@ static int make_request(struct request_queue *q, struct bio * bio) > > atomic_inc(&r10_bio->remaining); > bio_list_add(&bl, mbio); > + cnt++ > } > > if (unlikely(!atomic_read(&r10_bio->remaining))) { > @@ -970,6 +988,7 @@ static int make_request(struct request_queue *q, struct bio * bio) > spin_lock_irqsave(&conf->device_lock, flags); > bio_list_merge(&conf->pending_bio_list, &bl); > blk_plug_device(mddev->queue); > + conf->pending_count += cnt; > spin_unlock_irqrestore(&conf->device_lock, flags); > > /* In case raid10d snuck in to freeze_array */ > @@ -2318,3 +2337,5 @@ MODULE_LICENSE("GPL"); > MODULE_ALIAS("md-personality-9"); /* RAID10 */ > MODULE_ALIAS("md-raid10"); > MODULE_ALIAS("md-level-10"); > + > +module_param(max_queued_requests, int, S_IRUGO|S_IWUSR); > diff --git a/drivers/md/raid10.h b/drivers/md/raid10.h > index 59cd1ef..e6e1613 100644 > --- a/drivers/md/raid10.h > +++ b/drivers/md/raid10.h > @@ -39,7 +39,7 @@ struct r10_private_data_s { > struct list_head retry_list; > /* queue pending writes and submit them on unplug */ > struct bio_list pending_bio_list; > - > + int pending_count; > > spinlock_t resync_lock; > int nr_pending; > > -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 16+ messages in thread
[parent not found: <CAJygYd16PWfKe8fK-b150N46CEwzBUqJn1N6dfsGR4yyTgGbTQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: Raid10 and page cache [not found] ` <CAJygYd16PWfKe8fK-b150N46CEwzBUqJn1N6dfsGR4yyTgGbTQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2011-12-06 22:01 ` Yucong Sun (叶雨飞) 0 siblings, 0 replies; 16+ messages in thread From: Yucong Sun (叶雨飞) @ 2011-12-06 22:01 UTC (permalink / raw) To: linux-rdma-u79uwXL29TY76Z2rM5mHXA wrong list, sorry! On Tue, Dec 6, 2011 at 1:29 PM, Yucong Sun (叶雨飞) <sunyucong-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote: > Hi, > > I recently setup raid10 on 4 physical disk and have a iscsi serve it > as a block device, and have been trying to tweak for performance. > > First thing I notice that MD seems to rely on page cache to flush > changes to disk, is there any way to turn that off so changes are > flushed to the disk? like O_FSYNC|O_DIRECT does? The reason I want to > turn it off is to understand the performance difference, I want to be > sure that page cache is truly acting as a write-back cache, I know one > can tune the dirty_* to control the cache flush, but I want to make > sure that it is actually doing what I think it does. > > Then I notice in output of free, the number in Cache column is very > low, however the Buffer is very high, my question is does Buffer here > serves as a read cache? I couldn't find the answer anywhere else. > > My last question is that since MD seems already doing the cache, what > effect would it have if I want to setup a LO device in front of MD > device, Is there going to be more caching, how is different than just > plain MD device? > > Thanks. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2011-12-08 6:31 UTC | newest]
Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-12-06 21:29 Raid10 and page cache Yucong Sun (叶雨飞)
2011-12-06 22:01 ` Yucong Sun (叶雨飞)
2011-12-06 22:26 ` NeilBrown
2011-12-06 23:13 ` Yucong Sun (叶雨飞)
2011-12-06 23:22 ` Marcus Sorensen
2011-12-07 1:01 ` NeilBrown
2011-12-07 4:04 ` Yucong Sun (叶雨飞)
2011-12-07 4:28 ` NeilBrown
2011-12-07 4:50 ` Yucong Sun (叶雨飞)
2011-12-07 5:10 ` NeilBrown
2011-12-07 6:14 ` Yucong Sun (叶雨飞)
2011-12-07 9:21 ` Yucong Sun (叶雨飞)
2011-12-07 23:37 ` Yucong Sun (叶雨飞)
2011-12-08 0:10 ` NeilBrown
2011-12-08 6:31 ` Yucong Sun (叶雨飞)
[not found] ` <CAJygYd16PWfKe8fK-b150N46CEwzBUqJn1N6dfsGR4yyTgGbTQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2011-12-06 22:01 ` Yucong Sun (叶雨飞)
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.