* RE: [PATCH 0/7] Per-bdi writeback flusher threads @ 2009-04-07 14:03 Jos Houtman 2009-04-08 0:44 ` Wu Fengguang 0 siblings, 1 reply; 7+ messages in thread From: Jos Houtman @ 2009-04-07 14:03 UTC (permalink / raw) To: linux-kernel@vger.kernel.org, Wu Fengguang I tried the write-back branch from the 2.6-block tree. And I can atleast confirm that it works, atleast in relation to the writeback not keeping up when the device was congested before it wrote a 1024 pages. See: http://lkml.org/lkml/2009/3/22/83 for a bit more information. But the second problem seen in that thread, a write-starve-read problem does not seem to solved. In this problem the writes of the writeback algorithm starve the ongoing reads, no matter what io-scheduler is picked. For good measure I also applied the blk-latency patches on top of the writeback branch, this did not improve anything. Nor did lowering max_sectors_kb, as linus suggested in the IO latency thread. As for a reproducible test-case: the simplest I could come up with was modifying the fsync-tester not to fsync, but letting the normal writeback handle it. And starting a separate process that tries to sequentially read a file from the same device. The read performance drops to a bare minimum as soon as the writeback algorithm kicks in. Jos ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH 0/7] Per-bdi writeback flusher threads 2009-04-07 14:03 [PATCH 0/7] Per-bdi writeback flusher threads Jos Houtman @ 2009-04-08 0:44 ` Wu Fengguang 2009-04-08 6:20 ` Jens Axboe 0 siblings, 1 reply; 7+ messages in thread From: Wu Fengguang @ 2009-04-08 0:44 UTC (permalink / raw) To: Jos Houtman; +Cc: linux-kernel@vger.kernel.org, jens.axboe@oracle.com [CC Jens] On Tue, Apr 07, 2009 at 10:03:38PM +0800, Jos Houtman wrote: > > I tried the write-back branch from the 2.6-block tree. > > And I can atleast confirm that it works, atleast in relation to the > writeback not keeping up when the device was congested before it wrote a > 1024 pages. > > See: http://lkml.org/lkml/2009/3/22/83 for a bit more information. Hi Jos, you said that this simple patch solved the problem, however you mentioned somehow suboptimal performance. Can you elaborate that? So that I can push or improve it. Thanks, Fengguang --- fs/fs-writeback.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) --- mm.orig/fs/fs-writeback.c +++ mm/fs/fs-writeback.c @@ -325,7 +325,8 @@ __sync_single_inode(struct inode *inode, * soon as the queue becomes uncongested. */ inode->i_state |= I_DIRTY_PAGES; - if (wbc->nr_to_write <= 0) { + if (wbc->nr_to_write <= 0 || + wbc->encountered_congestion) { /* * slice used up: queue for next turn */ > But the second problem seen in that thread, a write-starve-read problem does > not seem to solved. In this problem the writes of the writeback algorithm > starve the ongoing reads, no matter what io-scheduler is picked. > > For good measure I also applied the blk-latency patches on top of the > writeback branch, this did not improve anything. Nor did lowering > max_sectors_kb, as linus suggested in the IO latency thread. > > > As for a reproducible test-case: the simplest I could come up with was > modifying the fsync-tester not to fsync, but letting the normal writeback > handle it. And starting a separate process that tries to sequentially read a > file from the same device. The read performance drops to a bare minimum as > soon as the writeback algorithm kicks in. > > > Jos > > > > ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH 0/7] Per-bdi writeback flusher threads 2009-04-08 0:44 ` Wu Fengguang @ 2009-04-08 6:20 ` Jens Axboe 2009-04-08 8:57 ` Jos Houtman 0 siblings, 1 reply; 7+ messages in thread From: Jens Axboe @ 2009-04-08 6:20 UTC (permalink / raw) To: Wu Fengguang; +Cc: Jos Houtman, linux-kernel@vger.kernel.org On Wed, Apr 08 2009, Wu Fengguang wrote: > [CC Jens] > > On Tue, Apr 07, 2009 at 10:03:38PM +0800, Jos Houtman wrote: > > > > I tried the write-back branch from the 2.6-block tree. > > > > And I can atleast confirm that it works, atleast in relation to the > > writeback not keeping up when the device was congested before it wrote a > > 1024 pages. > > > > See: http://lkml.org/lkml/2009/3/22/83 for a bit more information. > > Hi Jos, you said that this simple patch solved the problem, however you > mentioned somehow suboptimal performance. Can you elaborate that? So > that I can push or improve it. > > Thanks, > Fengguang > --- > fs/fs-writeback.c | 3 ++- > 1 file changed, 2 insertions(+), 1 deletion(-) > > --- mm.orig/fs/fs-writeback.c > +++ mm/fs/fs-writeback.c > @@ -325,7 +325,8 @@ __sync_single_inode(struct inode *inode, > * soon as the queue becomes uncongested. > */ > inode->i_state |= I_DIRTY_PAGES; > - if (wbc->nr_to_write <= 0) { > + if (wbc->nr_to_write <= 0 || > + wbc->encountered_congestion) { > /* > * slice used up: queue for next turn > */ > > > But the second problem seen in that thread, a write-starve-read problem does > > not seem to solved. In this problem the writes of the writeback algorithm > > starve the ongoing reads, no matter what io-scheduler is picked. What kind of SSD drive are you using? Does it support queuing or not? -- Jens Axboe ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH 0/7] Per-bdi writeback flusher threads 2009-04-08 6:20 ` Jens Axboe @ 2009-04-08 8:57 ` Jos Houtman 2009-04-08 9:13 ` Jens Axboe 0 siblings, 1 reply; 7+ messages in thread From: Jos Houtman @ 2009-04-08 8:57 UTC (permalink / raw) To: Jens Axboe, Wu Fengguang; +Cc: linux-kernel >> >> Hi Jos, you said that this simple patch solved the problem, however you >> mentioned somehow suboptimal performance. Can you elaborate that? So >> that I can push or improve it. >> >> Thanks, >> Fengguang >> --- >> fs/fs-writeback.c | 3 ++- >> 1 file changed, 2 insertions(+), 1 deletion(-) >> >> --- mm.orig/fs/fs-writeback.c >> +++ mm/fs/fs-writeback.c >> @@ -325,7 +325,8 @@ __sync_single_inode(struct inode *inode, >> * soon as the queue becomes uncongested. >> */ >> inode->i_state |= I_DIRTY_PAGES; >> - if (wbc->nr_to_write <= 0) { >> + if (wbc->nr_to_write <= 0 || >> + wbc->encountered_congestion) { >> /* >> * slice used up: queue for next turn >> */ >> >>> But the second problem seen in that thread, a write-starve-read problem does >>> not seem to solved. In this problem the writes of the writeback algorithm >>> starve the ongoing reads, no matter what io-scheduler is picked. > > What kind of SSD drive are you using? Does it support queuing or not? First Jens his question: We use the MTRON PRO 7500 ( MTRON MSP-SATA75 ) with 64GB and 128GB and I don't know whether it supports queuing or not. How can I check? The data-sheet doesn't mention NCQ, if you meant that. As for a more elaborate description of the problem (please bare with me): There are actually two problems: The first is that the the writeback algorithm couldn't keep up with the number of pages being dirtied by our database, even though it should. The number of pages would rise for hours and level and stabilize around the dirty_background_ratio treshold. The second problem is that the io-writes triggered by the writeback algorithm happens in bursts and all read activity on the device is starved for the duration of the write-burst, sometimes periods of up to 15 seconds. My conclusion: There is no proper interleaving of writes and reads, _NO_ matter what IO-scheduler I choose to use. See the graph below for a plot of this behavior: Select queries vs disk read/write operations (measures every second). This was measured using Wu's patch, the per-bdi writeback patchset somehow wrote-back every 5 seconds and as a result created smaller but more frequent drops in the selects. http://94.100.113.33/535450001-535500000/535451701-535451800/535451800_5VNp. jpg The patch posted by Wu and the per-bdi writeback patchset both solve the first problem, at the cost of increasing the occurrence of problem number two. Fixed writeback => more write bursts => more frequent starvation of the reads. Background: The machines that have these problems are databases, with large datasets that need to read quite a lot of data from disk (as it won't fit in filecache). These write-bursts lock queries that normally take only a few ms up to a several seconds. As a result of this lockup a backlog is created, and in our current database setup the backlog is actively purged. Forcing a reconnect to the same set of suffering database servers, further increasing the load. We are actively working on application level solutions that don't trigger the write-starve-read problem, mainly by reducing the physical read load. But this is a lengthy process. Besides what we can do ourselves, I think that this write-starve-read behaviour should not happen or should at least be controllable by picking an IO-scheduler that suits you. The most extreme solutions as I see them: If your data is sacred: Writes have priority and the IO-scheduler should do its best to smooth the write burst and interleave them properly without hampering the read load too much. If your data is not so sacred (we have 30 machines with the same dataset): Reads have a priority and writes are the lowest priority and are interleaved whenever possible. This could mean writeback being postponed untill the off-hours. But I would be really glad if I could just use the deadline scheduler to do 1 write for every 10 reads and make the write-expire timeout very high. Thanks, Jos ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH 0/7] Per-bdi writeback flusher threads 2009-04-08 8:57 ` Jos Houtman @ 2009-04-08 9:13 ` Jens Axboe 2009-04-09 11:37 ` Jos Houtman 0 siblings, 1 reply; 7+ messages in thread From: Jens Axboe @ 2009-04-08 9:13 UTC (permalink / raw) To: Jos Houtman; +Cc: Wu Fengguang, linux-kernel On Wed, Apr 08 2009, Jos Houtman wrote: > >> > >> Hi Jos, you said that this simple patch solved the problem, however you > >> mentioned somehow suboptimal performance. Can you elaborate that? So > >> that I can push or improve it. > >> > >> Thanks, > >> Fengguang > >> --- > >> fs/fs-writeback.c | 3 ++- > >> 1 file changed, 2 insertions(+), 1 deletion(-) > >> > >> --- mm.orig/fs/fs-writeback.c > >> +++ mm/fs/fs-writeback.c > >> @@ -325,7 +325,8 @@ __sync_single_inode(struct inode *inode, > >> * soon as the queue becomes uncongested. > >> */ > >> inode->i_state |= I_DIRTY_PAGES; > >> - if (wbc->nr_to_write <= 0) { > >> + if (wbc->nr_to_write <= 0 || > >> + wbc->encountered_congestion) { > >> /* > >> * slice used up: queue for next turn > >> */ > >> > >>> But the second problem seen in that thread, a write-starve-read problem does > >>> not seem to solved. In this problem the writes of the writeback algorithm > >>> starve the ongoing reads, no matter what io-scheduler is picked. > > > > What kind of SSD drive are you using? Does it support queuing or not? > > First Jens his question: We use the MTRON PRO 7500 ( MTRON MSP-SATA75 ) with > 64GB and 128GB and I don't know whether it supports queuing or not. How can > I check? The data-sheet doesn't mention NCQ, if you meant that. They do not. The MTRONs are in the "crap" ssd category, irregardless of their (seemingly undeserved) high price tag. I tested a few of them some months ago and was less than impressed. It still sits behind a crap pata bridge and its random write performance was abysmal. So in general I find it quite weird that the writeback cannot keep up, there's not that much to keep up with. I'm guessing it's because of the quirky nature of the device when it comes to writes. As to the other problem, we usually do quite well on read-vs-write workloads. CFQ performs great for those, if I test the current 2.6.30-rc1 kernel, a read goes at > 90% of full performance with a dd if=/dev/zero of=foo bs=1M running in the background. On both NCQ and non-NCQ drives. Could you try 2.6.30-rc1, just in case it works better for you? At least CFQ will behave better there in any case. AS should work fine for that as well, but don't expect very good read-vs-write performance with deadline or noop. Doing some sort of anticipation is crucial to get that right. What kind of read workload are you running? May small files, big files, one big file, or? > Background: > The machines that have these problems are databases, with large datasets > that need to read quite a lot of data from disk (as it won't fit in > filecache). These write-bursts lock queries that normally take only a few ms > up to a several seconds. As a result of this lockup a backlog is created, > and in our current database setup the backlog is actively purged. Forcing a > reconnect to the same set of suffering database servers, further increasing > the load. OK, so I'm guessing it's bursty smallish reads. That is the hardest case. If your MTRON has a write cache, it's very possible that by the time we stop the writes and issue the read, the device takes a long time to service that read. And if we then mix reads and writes, it's basically impossible to get any sort of interactiveness out of it. With the second rate SSD devices, you probably need to tweak the IO scheduling a bit to make that work well. If you try 2.6.30-rc1, you could try and set 'slice_async_rq' to 1 and slice_async to 5 in /sys/block/sda/queue/iosched/ (or sdX whatever is your device) with CFQ and see if that makes a difference. If the device is really slow, perhaps try and increase slice_idle as well. > But I would be really glad if I could just use the deadline scheduler to do > 1 write for every 10 reads and make the write-expire timeout very high. It wont help a lot because of the dependent nature of the reads you are doing. By the time you issue 1 read and it completes and until you issue the next read, you could very well have sent enough writes to the device that the next read will take equally long to complete. -- Jens Axboe ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH 0/7] Per-bdi writeback flusher threads 2009-04-08 9:13 ` Jens Axboe @ 2009-04-09 11:37 ` Jos Houtman 0 siblings, 0 replies; 7+ messages in thread From: Jos Houtman @ 2009-04-09 11:37 UTC (permalink / raw) To: Jens Axboe; +Cc: Wu Fengguang, linux-kernel Hi, As a side note: It is correct that 2.6.30-r1 points to the 2.6.29 tar file? > They do not. The MTRONs are in the "crap" ssd category, irregardless of > their (seemingly undeserved) high price tag. I tested a few of them some > months ago and was less than impressed. It still sits behind a crap pata > bridge and its random write performance was abysmal. Hmm, that is something to look into on our end. I know we did a performance comparison mid-summer 2008 and the Mtron came out pretty good, obviously we did something wrong or the competition was even worse. Do you have any top-of-the-head tips on brands/tooling or comparison points. So we can more easily separate the good from the bad? Random write IOPS is obviously important. > So in general I find it quite weird that the writeback cannot keep up, > there's not that much to keep up with. I'm guessing it's because of the > quirky nature of the device when it comes to writes. Wu said that the device was congested before it used up its MAX_WRITEBACK_PAGES. Which could be explained by bad write performance of the device. > As to the other problem, we usually do quite well on read-vs-write > workloads. CFQ performs great for those, if I test the current > 2.6.30-rc1 kernel, a read goes at > 90% of full performance with a > > dd if=/dev/zero of=foo bs=1M > > running in the background. On both NCQ and non-NCQ drives. Could you try > 2.6.30-rc1, just in case it works better for you? At least CFQ will > behave better there in any case. AS should work fine for that as well, > but don't expect very good read-vs-write performance with deadline or > noop. Doing some sort of anticipation is crucial to get that right. Running the dd test and an adjusted fsync-tester todo random 4k writes in a large 8GB file I come to the same conclusion. CFQ beats noop, avarages and std-dev of the reads per interval are better in both tests. (I append the results below) But the application level stress test perform better and worse, better averages with cfq that are explained by the increased number of errors due to timeout (so the worst latencies are hidden by the timeout). I'am gonna run the tests again without the timeout but that takes a few hours. > > What kind of read workload are you running? May small files, big files, > one big file, or? Several bigfiles, that get lots of small random updates next to a steady load of new inserts which I guess are sequentially appended in the data file but randomly inserted in the index file. The reads are also small and random in the same files. > OK, so I'm guessing it's bursty smallish reads. That is the hardest > case. If your MTRON has a write cache, it's very possible that by the > time we stop the writes and issue the read, the device takes a long time > to service that read. And if we then mix reads and writes, it's > basically impossible to get any sort of interactiveness out of it. So If I understand correctly: the write-cache would speed up the write from the OS points of view. But each command given after the write still has to wait untill the write-cache is processed? Is there anyway I can check for the presence of such a write-cache? > With the second rate SSD devices, you probably need to tweak the IO > scheduling a bit to make that work well. If you try 2.6.30-rc1, you > could try and set 'slice_async_rq' to 1 and slice_async to 5 in > /sys/block/sda/queue/iosched/ (or sdX whatever is your device) with CFQ > and see if that makes a difference. If the device is really slow, > perhaps try and increase slice_idle as well. Those tweaks indeed give an performance increase in the dd and write-test testcases. > It wont help a lot because of the dependent nature of the reads you are > doing. By the time you issue 1 read and it completes and until you issue > the next read, you could very well have sent enough writes to the device > that the next read will take equally long to complete. I don't really get this: I assume that by dependent you mean that the first read gives the information necessary to issue the next read request? I would expect a database (knowing the table schema) to make an good estimate on which data it needs to retrieve from the datafile. The only dependency there is on the index, which is needed to know which rows the query needs. But the majority of the index is cache in memory anyway. Hmm, but that majority would probably point to the datafile part that is kept in memory. It is an LRU after all... So a physical read on the index file would have a high probability of causing a physical read on the datafile. Ok, point taken.. Thanks, Jos Performance tests: 29-dirty = the writeback branch with the blk-latency patches applied. 30- = 2.6.30-r1 Write-test = random 4k writes to a 8GB file. Dd = dd if=/dev/zero of=foo bs=1M read bytes per second - std dev samples 29-dirty-apr-dd-noop.text: 2.39144e+06 - 1.54062e+06 30-apr-dd-noop.txt: 291065 - 41556.6 29-dirty-apr-write-test-noop.text: 2.4075e+07 - 4.29928e+06 30-apr-write-test-noop.txt: 5.82404e+07 - 3.39006e+06 29-dirty-apr-dd-cfq.text: 6.71294e+07 - 1.64957e+06 30-apr-dd-cfq.txt: 5.31077e+07 - 3.14862e+06 29-dirty-apr-write-test-cfq.text: 6.57578e+07 - 2.31241e+06 30-apr-write-test-cfq.txt: 6.87535e+07 - 2.23881e+06 29-dirty-apr-write-test-cfq-tuned.txt: 7.12343e+07 - 1.79695e+06 30-apr-write-test-cfq-tuned.txt: 7.74155e+07 - 2.16096e+06 29-dirty-apr-dd-cfq-tuned.txt: 9.98722e+07 - 2.20931e+06 30-apr-dd-cfq-tuned.txt: 7.08474e+07 - 2.58305e+06 ^ permalink raw reply [flat|nested] 7+ messages in thread
* [PATCH 0/7] Per-bdi writeback flusher threads @ 2009-03-12 14:33 Jens Axboe 0 siblings, 0 replies; 7+ messages in thread From: Jens Axboe @ 2009-03-12 14:33 UTC (permalink / raw) To: linux-kernel, linux-fsdevel; +Cc: chris.mason, david, npiggin Hi, This is something I've wanted to play with for a while, and I finally got it hacked up a few days ago. Consider it a playground for writeback performance/behaviour testing :-) There's a full description in the next few patches. They are against current -git. -- Jens Axboe ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2009-04-09 11:43 UTC | newest] Thread overview: 7+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2009-04-07 14:03 [PATCH 0/7] Per-bdi writeback flusher threads Jos Houtman 2009-04-08 0:44 ` Wu Fengguang 2009-04-08 6:20 ` Jens Axboe 2009-04-08 8:57 ` Jos Houtman 2009-04-08 9:13 ` Jens Axboe 2009-04-09 11:37 ` Jos Houtman -- strict thread matches above, loose matches on Subject: below -- 2009-03-12 14:33 Jens Axboe
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox