Sorry, forgot the attachment :) Thanks, Fengguang On Tue, Aug 03, 2010 at 11:04:46PM +0800, Wu Fengguang wrote: > On Tue, Aug 03, 2010 at 08:52:49PM +0800, Jan Kara wrote: > > On Tue 03-08-10 15:34:49, Wu Fengguang wrote: > > > On Thu, Jul 29, 2010 at 04:45:23PM +0800, Christoph Hellwig wrote: > > > > Btw, I'm very happy with all this writeback related progress we've made > > > > for the 2.6.36 cycle. The only major thing that's really missing, and > > > > which should help dramatically with the I/O patters is stopping direct > > > > writeback from balance_dirty_pages(). I've seen patches frrom Wu and > > > > and Jan for this and lots of discussion. If we get either variant in > > > > this should be once of the best VM release from the filesystem point of > > > > view. > > > > > > Sorry for the delay. But I'm not feeling good about the current > > > patches, both mine and Jan's. > > > > > > Accounting overheads/accuracy are the obvious problem. Both patches do > > > not perform well on large NUMA machines and fast storage. They are found > > > hard to improve in previous discussions. > > Yes, my patch for balance_dirty_pages() has a problem with percpu counter > > (im)precision and resorting to pure atomic type could result in bouncing > > of the cache line among CPUs completing the IO (at least that is the reason > > why all other BDI stats are per-cpu I believe). > > We could solve the problem by doing the accounting on page IO submission > > time (there using the atomic type should be fine as we mostly submit IO > > from the flusher thread anyway). It's just that doing the accounting on > > completion time has the nice property that we really hold the throttled > > thread upto the moment when vm can really reuse the pages. > > Could try this and check how it works with NFS. The attached patch > will also be necessary for the test. It implements a writeback wait > queue for NFS, without it all dirty pages may be put to writeback. > > I suspect the resulting fluctuations will be the same. Because > balance_dirty_pages() will wait on some background writeback (as you > proposed), which will block on the NFS writeback queue, which in turn > wait for the completion of COMMIT RPCs (the current patches directly > wait here). On the completion of one COMMIT, lots of pages may be > freed in a burst, which makes the whole stack progress very bumpy. > > > > We might do dirty throttling based on throughput, ignoring the > > > writeback completions totally. The basic idea is, for current process, > > > we already have a per-bdi-and-task threshold B as the local throttle > > Do we? The limit is currently just per-bdi, isn't it? Or do you mean > > bdi_dirty_limit() calls task_dirty_limit(), so it's also related to > the current task. For convenience we called it per-bdi writeback :) > > > the ratelimiting - i.e. how often do we call balance_dirty_pages()? > > That is per-cpu if I'm right. > > > target. When dirty pages go beyond B*80% for example, we start > > > throttling the task's writeback throughput. The more closer to B, the > > > lower throughput. When reaches B or global threshold, we completely > > > stop it. The hope is, the throughput will be sustained at some balance > > > point. This will need careful calculation to perform stable/robust. > > But what do you exactly mean by throttling the task in your scenario? > > What would it wait on? > > It will simply wait for eg. 10ms for every N pages written. The more > closer to B, the less N will be. > > Thanks, > Fengguang > > > > In this way, the throttle can be made very smooth. My old experiments > > > show that the current writeback completion based throttling fluctuates > > > a lot for the stall time. In particular it makes bumpy writeback for > > > NFS, so that some times the network pipe is not active at all and > > > performance is impacted noticeably. > > > > > > By the way, we'll harvest a writeback IO controller :) > > > > Honza > > -- > > Jan Kara > > SUSE Labs, CR