* 2.6.6-rc{1,2} bad VM/NFS interaction in case of dirty page writeback
@ 2004-04-27 1:12 Shantanu Goel
2004-04-27 2:15 ` Andrew Morton
0 siblings, 1 reply; 19+ messages in thread
From: Shantanu Goel @ 2004-04-27 1:12 UTC (permalink / raw)
To: Kernel
Hi,
During page reclamation when the scanner encounters a
dirty page and invokes writepage(), if the FS layer
returns WRITEPAGE_ACTIVATE as NFS does, I think the
page should not be placed on the active as is
presently done. This can cause a lot of extraneous
swapout activity because in the presence of a large
active list, the pages being written out will not be
reclaimed quickly enough. It also seems counter
intuitive since the scanner has just determined that
the page has not been recently referenced.
Shouldn't the following code from shrink_list():
res = mapping->a_ops->writepage(page, &wbc);
if (res < 0)
handle_write_error(mapping, page, res);
if (res == WRITEPAGE_ACTIVATE) {
ClearPageReclaim(page);
goto activate_locked;
}
read:
res = mapping->a_ops->writepage(page, &wbc);
if (res < 0)
handle_write_error(mapping, page, res);
if (res == WRITEPAGE_ACTIVATE) {
ClearPageReclaim(page);
goto keep_locked;
}
I can observe the benefit of this change if I run a dd
on an NFS mount with the active list full of mostly
mapped pages. The stock kernel ends up paging out
quite a bit of memory whereas the modified kernel does
not.
Comments?
Thanks,
Shantanu
__________________________________
Do you Yahoo!?
Win a $20,000 Career Makeover at Yahoo! HotJobs
http://hotjobs.sweepstakes.yahoo.com/careermakeover
^ permalink raw reply [flat|nested] 19+ messages in thread* Re: 2.6.6-rc{1,2} bad VM/NFS interaction in case of dirty page writeback 2004-04-27 1:12 2.6.6-rc{1,2} bad VM/NFS interaction in case of dirty page writeback Shantanu Goel @ 2004-04-27 2:15 ` Andrew Morton 2004-04-27 3:11 ` Trond Myklebust 0 siblings, 1 reply; 19+ messages in thread From: Andrew Morton @ 2004-04-27 2:15 UTC (permalink / raw) To: Shantanu Goel; +Cc: linux-kernel, Trond Myklebust Shantanu Goel <sgoel01@yahoo.com> wrote: > > Hi, > > During page reclamation when the scanner encounters a > dirty page and invokes writepage(), if the FS layer > returns WRITEPAGE_ACTIVATE as NFS does, I think the > page should not be placed on the active as is > presently done. This can cause a lot of extraneous > swapout activity because in the presence of a large > active list, the pages being written out will not be > reclaimed quickly enough. It also seems counter > intuitive since the scanner has just determined that > the page has not been recently referenced. > > Shouldn't the following code from shrink_list(): > > res = mapping->a_ops->writepage(page, &wbc); > if (res < 0) > handle_write_error(mapping, page, res); > if (res == WRITEPAGE_ACTIVATE) { > ClearPageReclaim(page); > goto activate_locked; > } > > read: > > res = mapping->a_ops->writepage(page, &wbc); > if (res < 0) > handle_write_error(mapping, page, res); > if (res == WRITEPAGE_ACTIVATE) { > ClearPageReclaim(page); > goto keep_locked; > } WRITEPAGE_ACTIVATE is a bit of a hack to fix up specific peculiarities of the interaction between tmpfs and page reclaim. Trond, the changelog for that patch does not explain what is going on in there - can you help out? Also, what's the theory behind the handling of BDI_write_congested and nfs_wait_on_write_congestion() in nfs_writepages()? From a quick peek it looks like NFS should be waking the sleepers in blk_congestion_wait() rather than doing it privately? > I can observe the benefit of this change if I run a dd > on an NFS mount with the active list full of mostly > mapped pages. The stock kernel ends up paging out > quite a bit of memory whereas the modified kernel does > not. yup. We should be able to handle the throttling and writeback scheduling from within core VFS/VM. NFS should set and clear the backing_dev congestion state appropriately and the VFS should take care of the rest. The missing bit is the early blk_congestion_wait() termination. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: 2.6.6-rc{1,2} bad VM/NFS interaction in case of dirty page writeback 2004-04-27 2:15 ` Andrew Morton @ 2004-04-27 3:11 ` Trond Myklebust 2004-04-27 3:59 ` Andrew Morton 0 siblings, 1 reply; 19+ messages in thread From: Trond Myklebust @ 2004-04-27 3:11 UTC (permalink / raw) To: Andrew Morton; +Cc: Shantanu Goel, linux-kernel On Mon, 2004-04-26 at 22:15, Andrew Morton wrote: > WRITEPAGE_ACTIVATE is a bit of a hack to fix up specific peculiarities of > the interaction between tmpfs and page reclaim. > > Trond, the changelog for that patch does not explain what is going on in > there - can you help out? As far as I understand, the WRITEPAGE_ACTIVATE hack is supposed to allow filesystems to defer actually starting the I/O until the call to ->writepages(). This is indeed beneficial to NFS, since most servers will work more efficiently if we cluster NFS write requests into "wsize" sized chunks. > Also, what's the theory behind the handling of BDI_write_congested and > nfs_wait_on_write_congestion() in nfs_writepages()? From a quick peek it > looks like NFS should be waking the sleepers in blk_congestion_wait() > rather than doing it privately? The idea is mainly to prevent tasks from scheduling new writes if we are in the situation of wanting to reclaim or otherwise flush out dirty pages. IOW: I am assuming that the writepages() method is usually called only when we are low on memory and/or if pdflush() was triggered. > yup. We should be able to handle the throttling and writeback scheduling > from within core VFS/VM. NFS should set and clear the backing_dev > congestion state appropriately and the VFS should take care of the rest. > The missing bit is the early blk_congestion_wait() termination. Err... I appear to be missing something here. What is it you mean by *early* blk_congestion_wait() termination? Cheers, Trond ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: 2.6.6-rc{1,2} bad VM/NFS interaction in case of dirty page writeback 2004-04-27 3:11 ` Trond Myklebust @ 2004-04-27 3:59 ` Andrew Morton 2004-04-27 5:23 ` Trond Myklebust 2004-04-27 6:10 ` Nick Piggin 0 siblings, 2 replies; 19+ messages in thread From: Andrew Morton @ 2004-04-27 3:59 UTC (permalink / raw) To: Trond Myklebust; +Cc: sgoel01, linux-kernel Trond Myklebust <trond.myklebust@fys.uio.no> wrote: > > On Mon, 2004-04-26 at 22:15, Andrew Morton wrote: > > WRITEPAGE_ACTIVATE is a bit of a hack to fix up specific peculiarities of > > the interaction between tmpfs and page reclaim. > > > > Trond, the changelog for that patch does not explain what is going on in > > there - can you help out? > > As far as I understand, the WRITEPAGE_ACTIVATE hack is supposed to allow > filesystems to defer actually starting the I/O until the call to > ->writepages(). This is indeed beneficial to NFS, since most servers > will work more efficiently if we cluster NFS write requests into "wsize" > sized chunks. No, it's purely used by tmpfs when we discover the page was under mlock or we've run out of swapspace. Yes, single-page writepage off the LRU is inefficient and we want writepages() to do most of the work. For most workloads, this is the case. It's only the whacky mmap-based test which should encounter significant amounts of vmscan.c writepage activity. If you're seeing much traffic via that route I'd be interested in knowing what the workload is. There's one scenario in which we _do_ do too much writepage off the LRU, and that's on 1G ia32 highmem boxes: the tiny highmem zone is smaller than dirty_ratio and vmscan ends up doing work which balance_dirty_pages() should have done. Hard to fix, apart from reducing the dirty memory limits. > > Also, what's the theory behind the handling of BDI_write_congested and > > nfs_wait_on_write_congestion() in nfs_writepages()? From a quick peek it > > looks like NFS should be waking the sleepers in blk_congestion_wait() > > rather than doing it privately? > > The idea is mainly to prevent tasks from scheduling new writes if we are > in the situation of wanting to reclaim or otherwise flush out dirty > pages. IOW: I am assuming that the writepages() method is usually called > only when we are low on memory and/or if pdflush() was triggered. writepages() is called by pdflush when the dirty memory exceeds dirty_background_ratio and it is called by the write(2) caller when dirty memory exceeds dirty_ratio. balance_dirty_pages() will throttle the write(2) caller when we hit dirty_ratio. balance_dirty_pages() attempts to prevent amount of dirty memory from exceeding dirty_ratio by blocking the write(2) caller. > > yup. We should be able to handle the throttling and writeback scheduling > > from within core VFS/VM. NFS should set and clear the backing_dev > > congestion state appropriately and the VFS should take care of the rest. > > The missing bit is the early blk_congestion_wait() termination. > > Err... I appear to be missing something here. What is it you mean by > *early* blk_congestion_wait() termination? In the various places where the VM/VFS decides "gee, we need to wait for some write I/O to terminate" it will call blk_congestion_wait() (the function is seriously misnamed nowadays). blk_congestion_wait() will wait for a certain number of milliseconds, but that wait will terminate earlier if the block layer completes some write requests. So if writes are being retired quickly, the sleeps are shorter. What we've never had, and which I expected we'd need a year ago was a connection between network filesytems and blk_congestion_wait(). At present, if someone calls blk_congestion_wait(HZ/20) when the only write I/O in flight is NFS, they will sleep for 50 milliseconds no matter what's going on. What we should do is to deliver a wakeup from within NFS to the blk_congestion_wait() sleepers when write I/O has completed. Throttling of NFS writers should be working OK in 2.6.5. The only problem of which I am aware is that the callers of try_to_free_pages() and balance_dirty_pages() may be sleeping for too long in blk_congestion_wait(). We can address that by calling into clear_backing_dev_congested() (see below) from the place in NFS where we determine that writes have completed. Of course there may be additional problems of which I'm unaware. --- 25-akpm/drivers/block/ll_rw_blk.c | 20 +++++++++++++------- 25-akpm/include/linux/writeback.h | 1 + 2 files changed, 14 insertions(+), 7 deletions(-) diff -puN drivers/block/ll_rw_blk.c~clear_baking_dev_congested drivers/block/ll_rw_blk.c --- 25/drivers/block/ll_rw_blk.c~clear_baking_dev_congested 2004-04-26 20:52:05.752041120 -0700 +++ 25-akpm/drivers/block/ll_rw_blk.c 2004-04-26 20:54:20.997480688 -0700 @@ -95,22 +95,28 @@ static inline int queue_congestion_off_t return ret; } -/* - * A queue has just exitted congestion. Note this in the global counter of - * congested queues, and wake up anyone who was waiting for requests to be - * put back. - */ -static void clear_queue_congested(request_queue_t *q, int rw) +void clear_backing_dev_congested(struct backing_dev_info *bdi, int rw) { enum bdi_state bit; wait_queue_head_t *wqh = &congestion_wqh[rw]; bit = (rw == WRITE) ? BDI_write_congested : BDI_read_congested; - clear_bit(bit, &q->backing_dev_info.state); + clear_bit(bit, &bdi->state); smp_mb__after_clear_bit(); if (waitqueue_active(wqh)) wake_up(wqh); } +EXPORT_SYMBOL(clear_backing_dev_congested); + +/* + * A queue has just exitted congestion. Note this in the global counter of + * congested queues, and wake up anyone who was waiting for requests to be + * put back. + */ +static void clear_queue_congested(request_queue_t *q, int rw) +{ + clear_backing_dev_congested(&q->backing_dev_info, rw); +} /* * A queue has just entered congestion. Flag that in the queue's VM-visible diff -puN include/linux/writeback.h~clear_baking_dev_congested include/linux/writeback.h --- 25/include/linux/writeback.h~clear_baking_dev_congested 2004-04-26 20:53:19.680802240 -0700 +++ 25-akpm/include/linux/writeback.h 2004-04-26 20:53:36.333270680 -0700 @@ -97,5 +97,6 @@ int do_writepages(struct address_space * extern int nr_pdflush_threads; /* Global so it can be exported to sysctl read-only. */ +void clear_backing_dev_congested(struct backing_dev_info *bdi, int rw); #endif /* WRITEBACK_H */ _ ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: 2.6.6-rc{1,2} bad VM/NFS interaction in case of dirty page writeback 2004-04-27 3:59 ` Andrew Morton @ 2004-04-27 5:23 ` Trond Myklebust 2004-04-27 5:58 ` Andrew Morton 2004-04-27 6:10 ` Nick Piggin 1 sibling, 1 reply; 19+ messages in thread From: Trond Myklebust @ 2004-04-27 5:23 UTC (permalink / raw) To: Andrew Morton; +Cc: sgoel01, linux-kernel On Mon, 2004-04-26 at 23:59, Andrew Morton wrote: > Trond Myklebust <trond.myklebust@fys.uio.no> wrote: > No, it's purely used by tmpfs when we discover the page was under mlock or > we've run out of swapspace. No: either it is a bona-fide interface, in which case it belongs in the mm/vfs, or it is a private interface, in which case it doesn't. I can see no reason to be playing games purely for the case of tmpfs, but if indeed there is, then it needs a *lot* more comments in the code than is currently the case: something along the lines of "don't ever use this unless you are tmpfs"... > Yes, single-page writepage off the LRU is inefficient and we want > writepages() to do most of the work. For most workloads, this is the case. > It's only the whacky mmap-based test which should encounter significant > amounts of vmscan.c writepage activity. If you're seeing much traffic via > that route I'd be interested in knowing what the workload is. Err... Anything that currently ends up calling writepage() and returning WRITEPAGE_ACTIVATE. This is a problem that you believe you are seeing the effects of, right? Currently, we use this on virtually anything that trigggers writepage() with the wbc->for_reclaim flag set. > > The idea is mainly to prevent tasks from scheduling new writes if we are > > in the situation of wanting to reclaim or otherwise flush out dirty > > pages. IOW: I am assuming that the writepages() method is usually called > > only when we are low on memory and/or if pdflush() was triggered. > > writepages() is called by pdflush when the dirty memory exceeds > dirty_background_ratio and it is called by the write(2) caller when dirty > memory exceeds dirty_ratio. > > balance_dirty_pages() will throttle the write(2) caller when we hit > dirty_ratio. balance_dirty_pages() attempts to prevent amount of dirty > memory from exceeding dirty_ratio by blocking the write(2) caller. ...but if the pdflush() process hangs, due to an unavailable server or something like that, how do we prevent other processes from hanging too? AFAICS the backing-dev->state is the only way to ensure that alternate pdflush processes try to free memory through other means first. > \In the various places where the VM/VFS decides "gee, we need to wait for > some write I/O to terminate" it will call blk_congestion_wait() (the > function is seriously misnamed nowadays). blk_congestion_wait() will wait > for a certain number of milliseconds, but that wait will terminate earlier > if the block layer completes some write requests. So if writes are being > retired quickly, the sleeps are shorter. We're not using the block layer. > What we've never had, and which I expected we'd need a year ago was a > connection between network filesytems and blk_congestion_wait(). At > present, if someone calls blk_congestion_wait(HZ/20) when the only write > I/O in flight is NFS, they will sleep for 50 milliseconds no matter what's > going on. What we should do is to deliver a wakeup from within NFS to the > blk_congestion_wait() sleepers when write I/O has completed. So exactly *how* do we define the concept "write I/O has completed" here? AFAICS that has to be entirely defined by the MM layer since it alone can declare that writepages() (plus whatever was racing with writepages() at the time) has satisfied its request for reclaiming memory. I certainly don't want the wretched thing to be scheduling yet more writes while I'm still busy completing the last batch. If I happen to be on a slow network, that can cause me to enter a vicious circle where I keep scheduling more and more I/O simply because my write speed does not match the bandwidth expectations of the MM. This is why I want to be throttling *all* those processes that would otherwise be triggering yet more writepages() calls. Cheers, Trond ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: 2.6.6-rc{1,2} bad VM/NFS interaction in case of dirty page writeback 2004-04-27 5:23 ` Trond Myklebust @ 2004-04-27 5:58 ` Andrew Morton 2004-04-27 11:44 ` Shantanu Goel 2004-04-27 15:36 ` Trond Myklebust 0 siblings, 2 replies; 19+ messages in thread From: Andrew Morton @ 2004-04-27 5:58 UTC (permalink / raw) To: Trond Myklebust; +Cc: sgoel01, linux-kernel Trond Myklebust <trond.myklebust@fys.uio.no> wrote: > > On Mon, 2004-04-26 at 23:59, Andrew Morton wrote: > > Trond Myklebust <trond.myklebust@fys.uio.no> wrote: > > > No, it's purely used by tmpfs when we discover the page was under mlock or > > we've run out of swapspace. > > No: either it is a bona-fide interface, in which case it belongs in the > mm/vfs, or it is a private interface, in which case it doesn't. > I can see no reason to be playing games purely for the case of tmpfs, > but if indeed there is, then it needs a *lot* more comments in the code > than is currently the case: something along the lines of "don't ever use > this unless you are tmpfs"... It was designed for tmpfs's particular problems - tmpfs is quite incestuous with the MM. Yes, I suppose it should be renamed to TMPFS_WRITEPAGE_FAILED or something. > > Yes, single-page writepage off the LRU is inefficient and we want > > writepages() to do most of the work. For most workloads, this is the case. > > It's only the whacky mmap-based test which should encounter significant > > amounts of vmscan.c writepage activity. If you're seeing much traffic via > > that route I'd be interested in knowing what the workload is. > > Err... Anything that currently ends up calling writepage() and returning > WRITEPAGE_ACTIVATE. This is a problem that you believe you are seeing > the effects of, right? I didn't report the problem - Shantanu is reporting problems where all the NFS pages are stuck on the active list and the VM has nothing to reclaim on the inactive list. > Currently, we use this on virtually anything that trigggers writepage() > with the wbc->for_reclaim flag set. Why? Sorry, I still do not understand the effect which you're trying to achieve? I thought it was an efficiency thing: send more traffic via writepages(). But your other comments seem to indicate that this is not the primary reason. > ...but if the pdflush() process hangs, due to an unavailable server or > something like that, how do we prevent other processes from hanging too? We'll lose one pdflush thread if this happens. After that, writeback_acquire() will fail and no other pdflush instances will get stuck. After that, user processes will start entering that fs trying to write things back and they too will get stuck. Not good. You could do a writeback_acquire() in writepage and writepages perhaps. > AFAICS the backing-dev->state is the only way to ensure that alternate > pdflush processes try to free memory through other means first. Certainly backing_dev_info is the right place to do this. Again, I'd need to know more about what problems you're trying to solve to say much more. If the only problem is the everyone-hangs-because-of-a-dead-server problem then we can either utilise the write-congestion state or we can introduce an additional layer of congestion avoidance (similar to writeback_acquire) to prevent tasks from trying to write to dead NFS servers. > > \In the various places where the VM/VFS decides "gee, we need to wait for > > some write I/O to terminate" it will call blk_congestion_wait() (the > > function is seriously misnamed nowadays). blk_congestion_wait() will wait > > for a certain number of milliseconds, but that wait will terminate earlier > > if the block layer completes some write requests. So if writes are being > > retired quickly, the sleeps are shorter. > > We're not using the block layer. That's a fairly irritating comment :( I'm trying to explain how the current congestion/collision avoidance logic works and how I expected that it could be generalised to network-backed filesystems. If we can determine beforehand that writes to the server will block then this should fall out nicely. NFS needs to set a backing_dev flag which says "writes to this backing_dev will block soon" and writers can then take avoiding action. I assume such a thing can be implemented - there's an upper bound on the number of writes which can be in flight on the wire? > > What we've never had, and which I expected we'd need a year ago was a > > connection between network filesytems and blk_congestion_wait(). At > > present, if someone calls blk_congestion_wait(HZ/20) when the only write > > I/O in flight is NFS, they will sleep for 50 milliseconds no matter what's > > going on. What we should do is to deliver a wakeup from within NFS to the > > blk_congestion_wait() sleepers when write I/O has completed. > > So exactly *how* do we define the concept "write I/O has completed" > here? Simplistically: wherever NFS calls end_page_writeback(). But it would be better to do it at a higher level if poss: nfs_writeback_done_full(), perhaps? > AFAICS that has to be entirely defined by the MM layer since it alone > can declare that writepages() (plus whatever was racing with > writepages() at the time) has satisfied its request for reclaiming > memory. Well... the matter of reclaiming memory is somewhat separate from all of this. The code in there is all designed to throttle write()rs and memory allocators to the rate at which the I/O system can retire writes. It sounds like the problem which you're trying to solve here is the dead server problem? If so, yes, that will need additional logic somewhere. > I certainly don't want the wretched thing to be scheduling yet more > writes while I'm still busy completing the last batch. OK. In that case it would be preferable for the memory reclaim code to just skip the page altogether (via may_write_to_queue()) and to go on and try to reclaim the next page off the LRU. And it would be preferable for the write(2) caller to skip this superblock altogether and to go on and start working on a different superblock. Setting the congestion indicator will cause this to happen. If there are no additional superblocks to be written then the write() caller will block in blk_congestion_wait() and will then retry a writeback pass across the superblocks. > If I happen to be > on a slow network, that can cause me to enter a vicious circle where I > keep scheduling more and more I/O simply because my write speed does not > match the bandwidth expectations of the MM. > This is why I want to be throttling *all* those processes that would > otherwise be triggering yet more writepages() calls. Setting BDI_write_congested should have this effect for write(2) callers. We may need an additional backing_dev flag for may_write_to_queue() to inspect. Bottom line: I _think_ you need to remove the WRITEPAGE_ACTIVATE stuff, set a bdi->state flag (as well as BDI_write_congested) when the server gets stuck, test that new flag in may_write_to_queue(). ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: 2.6.6-rc{1,2} bad VM/NFS interaction in case of dirty page writeback 2004-04-27 5:58 ` Andrew Morton @ 2004-04-27 11:44 ` Shantanu Goel 2004-04-27 15:36 ` Trond Myklebust 1 sibling, 0 replies; 19+ messages in thread From: Shantanu Goel @ 2004-04-27 11:44 UTC (permalink / raw) To: Andrew Morton, Trond Myklebust; +Cc: sgoel01, linux-kernel --- Andrew Morton <akpm@osdl.org> wrote: > Trond Myklebust <trond.myklebust@fys.uio.no> wrote: > > > > > > Err... Anything that currently ends up calling > writepage() and returning > > WRITEPAGE_ACTIVATE. This is a problem that you > believe you are seeing > > the effects of, right? > > I didn't report the problem - Shantanu is reporting > problems where all the > NFS pages are stuck on the active list and the VM > has nothing to reclaim on > the inactive list. > Yup, what happens is the NFS writepage() returns WRITEPAGE_ACTIVATE causing the scanner to place the page at the head of the active list. Now the page won't be reclaimed until the scanner has run through the entire active list. I do not see such behaviour with ext3 for instance. __________________________________ Do you Yahoo!? Win a $20,000 Career Makeover at Yahoo! HotJobs http://hotjobs.sweepstakes.yahoo.com/careermakeover ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: 2.6.6-rc{1,2} bad VM/NFS interaction in case of dirty page writeback 2004-04-27 5:58 ` Andrew Morton 2004-04-27 11:44 ` Shantanu Goel @ 2004-04-27 15:36 ` Trond Myklebust 2004-04-28 0:47 ` Shantanu Goel 2004-04-28 5:29 ` Christoph Hellwig 1 sibling, 2 replies; 19+ messages in thread From: Trond Myklebust @ 2004-04-27 15:36 UTC (permalink / raw) To: Andrew Morton; +Cc: sgoel01, linux-kernel On Tue, 2004-04-27 at 01:58, Andrew Morton wrote: > > Currently, we use this on virtually anything that trigggers writepage() > > with the wbc->for_reclaim flag set. > > Why? Sorry, I still do not understand the effect which you're trying to > achieve? I thought it was an efficiency thing: send more traffic via > writepages(). But your other comments seem to indicate that this is not > the primary reason. That is the primary reason. Under NFS, writing is a 2 stage process: - Stage 1: "schedule" the page for writing. Basically this entails creating an nfs_page struct for each dirty page and putting this on the NFS private list of dirty pages. - Stage 2: "flush" the NFS private list of dirty pages. This involves coalescing contiguous nfs_page structs into chunks of size "wsize", and actually putting them on the wire in the form of RPC requests. writepage() only deals with one page at a time, so it will work fine for doing stage 1. If you also try force it to do stage 2, then you will end up with chunks of size <= PAGE_CACHE_SIZE because there will only be 1 page on the NFS private list of dirty pages. In practice this again means that you usually have to send 8 times as many RPC requests to the server in order to process the same amount of data. This is why I'd like to try to leave stage 2) to writepages(). Now normally, that is not a problem, since the actual calls to writepage() are in fact made by means of a call to generic_writepages() from within nfs_writepages(), so I can do all the correct things once I'm done with generic_writepages(). However shrink_cache() (where the MM calls writepage() directly) needs to be treated specially, since that is not wrapped by a writepages() call. In that case, we return WRITEPAGE_ACTIVATE and wait for pdflush to call writepages(). Cheers, Trond ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: 2.6.6-rc{1,2} bad VM/NFS interaction in case of dirty page writeback 2004-04-27 15:36 ` Trond Myklebust @ 2004-04-28 0:47 ` Shantanu Goel 2004-04-28 1:02 ` Andrew Morton 2004-04-28 5:29 ` Christoph Hellwig 1 sibling, 1 reply; 19+ messages in thread From: Shantanu Goel @ 2004-04-28 0:47 UTC (permalink / raw) To: Trond Myklebust, Andrew Morton; +Cc: sgoel01, linux-kernel Andrew/Trond, Any consensus as to what the right approach is here? One cheap though perhaps hack'ish solution would be to introduce yet another special error code like WRITEPAGE_ACTIVATE which instead of moving the page to the head of the active list instead moves it to the head of the inactive list. In this case, would it also make sense for balance_pgdat() to wakeup_bdflush() before blk_congestion_wait()? Otherwise, kswapd() would keep looping over the same pages at least until kupdate kicks in or a process stalls in try_to_free_pages()? Thanks, Shantanu __________________________________ Do you Yahoo!? Win a $20,000 Career Makeover at Yahoo! HotJobs http://hotjobs.sweepstakes.yahoo.com/careermakeover ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: 2.6.6-rc{1,2} bad VM/NFS interaction in case of dirty page writeback 2004-04-28 0:47 ` Shantanu Goel @ 2004-04-28 1:02 ` Andrew Morton 2004-04-28 1:28 ` Trond Myklebust 0 siblings, 1 reply; 19+ messages in thread From: Andrew Morton @ 2004-04-28 1:02 UTC (permalink / raw) To: Shantanu Goel; +Cc: trond.myklebust, sgoel01, linux-kernel Shantanu Goel <sgoel01@yahoo.com> wrote: > > Andrew/Trond, > > Any consensus as to what the right approach is here? For now I suggest you do - err = WRITEPAGE_ACTIVATE; + err = 0; in nfs_writepage(). > One cheap though perhaps hack'ish solution would be to > introduce yet another special error code like > WRITEPAGE_ACTIVATE which instead of moving the page to > the head of the active list instead moves it to the > head of the inactive list. It would be better to set a flag in the backing_dev so that page reclaim knows to not enter writepage at all. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: 2.6.6-rc{1,2} bad VM/NFS interaction in case of dirty page writeback 2004-04-28 1:02 ` Andrew Morton @ 2004-04-28 1:28 ` Trond Myklebust 2004-04-28 1:38 ` Andrew Morton 0 siblings, 1 reply; 19+ messages in thread From: Trond Myklebust @ 2004-04-28 1:28 UTC (permalink / raw) To: Andrew Morton; +Cc: Shantanu Goel, linux-kernel On Tue, 2004-04-27 at 21:02, Andrew Morton wrote: > Shantanu Goel <sgoel01@yahoo.com> wrote: > > > > Andrew/Trond, > > > > Any consensus as to what the right approach is here? > > For now I suggest you do > > - err = WRITEPAGE_ACTIVATE; > + err = 0; > > in nfs_writepage(). That will just cause the page to be put back onto the inactive list without starting writeback. Won't that cause precisely those kswapd loops that Shantanu was worried about? AFAICS if you want to do this, you probably need to flush the page immediately to disk on the server using a STABLE write as per the appeanded patch. The problem is that screws the server over pretty hard as it will get flooded with what are in effect a load of 4k O_SYNC writes. Cheers, Trond --- linux-2.6.6-rc2/fs/nfs/write.c.orig 2004-04-27 16:01:01.000000000 -0400 +++ linux-2.6.6-rc2/fs/nfs/write.c 2004-04-27 21:18:58.000000000 -0400 @@ -313,7 +313,7 @@ if (err >= 0) { err = 0; if (wbc->for_reclaim) - err = WRITEPAGE_ACTIVATE; + nfs_flush_inode(inode, 0, 0, FLUSH_STABLE); } } else { err = nfs_writepage_sync(NULL, inode, page, 0, ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: 2.6.6-rc{1,2} bad VM/NFS interaction in case of dirty page writeback 2004-04-28 1:28 ` Trond Myklebust @ 2004-04-28 1:38 ` Andrew Morton 0 siblings, 0 replies; 19+ messages in thread From: Andrew Morton @ 2004-04-28 1:38 UTC (permalink / raw) To: Trond Myklebust; +Cc: sgoel01, linux-kernel Trond Myklebust <trond.myklebust@fys.uio.no> wrote: > > On Tue, 2004-04-27 at 21:02, Andrew Morton wrote: > > Shantanu Goel <sgoel01@yahoo.com> wrote: > > > > > > Andrew/Trond, > > > > > > Any consensus as to what the right approach is here? > > > > For now I suggest you do > > > > - err = WRITEPAGE_ACTIVATE; > > + err = 0; > > > > in nfs_writepage(). > > That will just cause the page to be put back onto the inactive list > without starting writeback. Won't that cause precisely those kswapd > loops that Shantanu was worried about? Possibly - but the process which is in reclaim will throttle and will kick pdflush which will do the writepages() thing. It needs testing. > AFAICS if you want to do this, you probably need to flush the page > immediately to disk on the server using a STABLE write as per the > appeanded patch. The problem is that screws the server over pretty hard > as it will get flooded with what are in effect a load of 4k O_SYNC > writes. Well I'd be interested in discovering which workloads actually suffer significantly from doing this. If it's a significant problem then perhaps we should resurrect the writearound-from-within-writepage thing. It's pretty simple to do, especially since we now have efficient ways of finding neighbouring dirty pages. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: 2.6.6-rc{1,2} bad VM/NFS interaction in case of dirty page writeback 2004-04-27 15:36 ` Trond Myklebust 2004-04-28 0:47 ` Shantanu Goel @ 2004-04-28 5:29 ` Christoph Hellwig 2004-04-28 16:17 ` Trond Myklebust 1 sibling, 1 reply; 19+ messages in thread From: Christoph Hellwig @ 2004-04-28 5:29 UTC (permalink / raw) To: Trond Myklebust; +Cc: Andrew Morton, sgoel01, linux-kernel On Tue, Apr 27, 2004 at 11:36:47AM -0400, Trond Myklebust wrote: > writepage() only deals with one page at a time, so it will work fine for > doing stage 1. > If you also try force it to do stage 2, then you will end up with chunks > of size <= PAGE_CACHE_SIZE because there will only be 1 page on the NFS > private list of dirty pages. In practice this again means that you > usually have to send 8 times as many RPC requests to the server in order > to process the same amount of data. There's nothing speaking against probing for more dirty pages before and after the one ->writepage wants to write out and send the big request out. XFS does this to avoid creating small extents when converting from delayed allocated space to real extents. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: 2.6.6-rc{1,2} bad VM/NFS interaction in case of dirty page writeback 2004-04-28 5:29 ` Christoph Hellwig @ 2004-04-28 16:17 ` Trond Myklebust 2004-04-28 16:38 ` Christoph Hellwig 0 siblings, 1 reply; 19+ messages in thread From: Trond Myklebust @ 2004-04-28 16:17 UTC (permalink / raw) To: Christoph Hellwig; +Cc: Andrew Morton, sgoel01, linux-kernel On Wed, 2004-04-28 at 01:29, Christoph Hellwig wrote: > There's nothing speaking against probing for more dirty pages before and > after the one ->writepage wants to write out and send the big request > out. XFS does this to avoid creating small extents when converting from > delayed allocated space to real extents. You are referring to xfs_probe_unmapped_page()? True: we could do that... In fact we could do it a lot more efficiently now that we have the pagevec_lookup_tag() interface. Cheers, Trond ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: 2.6.6-rc{1,2} bad VM/NFS interaction in case of dirty page writeback 2004-04-28 16:17 ` Trond Myklebust @ 2004-04-28 16:38 ` Christoph Hellwig 2004-04-28 19:07 ` Andrew Morton 0 siblings, 1 reply; 19+ messages in thread From: Christoph Hellwig @ 2004-04-28 16:38 UTC (permalink / raw) To: Trond Myklebust; +Cc: Andrew Morton, sgoel01, linux-kernel On Wed, Apr 28, 2004 at 12:17:42PM -0400, Trond Myklebust wrote: > On Wed, 2004-04-28 at 01:29, Christoph Hellwig wrote: > > There's nothing speaking against probing for more dirty pages before and > > after the one ->writepage wants to write out and send the big request > > out. XFS does this to avoid creating small extents when converting from > > delayed allocated space to real extents. > > You are referring to xfs_probe_unmapped_page()? Yes, and the two other functions doing similar stuff. > True: we could do > that... In fact we could do it a lot more efficiently now that we have > the pagevec_lookup_tag() interface. Yes pagevec are much more efficient for that. In fact I have a workarea switching over XFS to use pagevec for that probing. I'm not yet sure where I'm heading with revamping xfs_aops.c, but what I'd love to see in the end is more or less xfs implementing only writepages and some generic implement writepage as writepages wrapper. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: 2.6.6-rc{1,2} bad VM/NFS interaction in case of dirty page writeback 2004-04-28 16:38 ` Christoph Hellwig @ 2004-04-28 19:07 ` Andrew Morton 2004-04-29 1:48 ` Nathan Scott 0 siblings, 1 reply; 19+ messages in thread From: Andrew Morton @ 2004-04-28 19:07 UTC (permalink / raw) To: Christoph Hellwig; +Cc: trond.myklebust, sgoel01, linux-kernel Christoph Hellwig <hch@infradead.org> wrote: > > I'm not yet sure where I'm heading with revamping xfs_aops.c, but what > I'd love to see in the end is more or less xfs implementing only > writepages and some generic implement writepage as writepages wrapper. That might make sense. One problem is that writepage expects to be passed a locked page whereas writepages() does not. Any code which implements writearound-inside-writepage should be targetted at a generic implementation, not an fs-specific one if poss. We could go look at the ->vm_writeback() a_op which was in in 2.5.20 or thereabouts. it was causing problems and had no discernable benefits so I ripped it out. A writearound-within-writepage implementation would need to decide whether it's goign to use lock_page() or TryLockPage(). I expect lock_page() will be OK - we only call in there for __GFP_FS allocators. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: 2.6.6-rc{1,2} bad VM/NFS interaction in case of dirty page writeback 2004-04-28 19:07 ` Andrew Morton @ 2004-04-29 1:48 ` Nathan Scott 2004-04-29 8:27 ` Christoph Hellwig 0 siblings, 1 reply; 19+ messages in thread From: Nathan Scott @ 2004-04-29 1:48 UTC (permalink / raw) To: Christoph Hellwig, Andrew Morton; +Cc: trond.myklebust, sgoel01, linux-kernel On Wed, Apr 28, 2004 at 12:07:15PM -0700, Andrew Morton wrote: > Christoph Hellwig <hch@infradead.org> wrote: > > > > I'm not yet sure where I'm heading with revamping xfs_aops.c, but what > > I'd love to see in the end is more or less xfs implementing only > > writepages and some generic implement writepage as writepages wrapper. > ... > Any code which implements writearound-inside-writepage should be targetted > at a generic implementation, not an fs-specific one if poss. We could go Another thing I think XFS could make good use of on the generic buffered IO path would be mpage_readpages and mpage_writepages variants that can use a get_blocks_t (like the direct IO path) instead of the single-block-at-a-time get_block_t interface thats used now. The existence of such beasts would probably impact the direction xfs_aops.c takes a fair bit I'd guess Christoph? cheers. -- Nathan ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: 2.6.6-rc{1,2} bad VM/NFS interaction in case of dirty page writeback 2004-04-29 1:48 ` Nathan Scott @ 2004-04-29 8:27 ` Christoph Hellwig 0 siblings, 0 replies; 19+ messages in thread From: Christoph Hellwig @ 2004-04-29 8:27 UTC (permalink / raw) To: Nathan Scott Cc: Christoph Hellwig, Andrew Morton, trond.myklebust, sgoel01, linux-kernel On Thu, Apr 29, 2004 at 11:48:21AM +1000, Nathan Scott wrote: > Another thing I think XFS could make good use of on the generic > buffered IO path would be mpage_readpages and mpage_writepages > variants that can use a get_blocks_t (like the direct IO path) > instead of the single-block-at-a-time get_block_t interface thats > used now. The existence of such beasts would probably impact the > direction xfs_aops.c takes a fair bit I'd guess Christoph? For mpage_readpages - yes. To use mpage_writepages from XFS I'd need a large-scale rewrite to handle delayed allocations and unwritten extents, so at least for the 2.6.x timeframe we're probably better of keeping our own ->writepages inside XFS. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: 2.6.6-rc{1,2} bad VM/NFS interaction in case of dirty page writeback 2004-04-27 3:59 ` Andrew Morton 2004-04-27 5:23 ` Trond Myklebust @ 2004-04-27 6:10 ` Nick Piggin 1 sibling, 0 replies; 19+ messages in thread From: Nick Piggin @ 2004-04-27 6:10 UTC (permalink / raw) To: Andrew Morton; +Cc: Trond Myklebust, sgoel01, linux-kernel Andrew Morton wrote: > Trond Myklebust <trond.myklebust@fys.uio.no> wrote: > >>On Mon, 2004-04-26 at 22:15, Andrew Morton wrote: >> >>>WRITEPAGE_ACTIVATE is a bit of a hack to fix up specific peculiarities of >>>the interaction between tmpfs and page reclaim. >>> >>>Trond, the changelog for that patch does not explain what is going on in >>>there - can you help out? >> >>As far as I understand, the WRITEPAGE_ACTIVATE hack is supposed to allow >>filesystems to defer actually starting the I/O until the call to >>->writepages(). This is indeed beneficial to NFS, since most servers >>will work more efficiently if we cluster NFS write requests into "wsize" >>sized chunks. > > > No, it's purely used by tmpfs when we discover the page was under mlock or > we've run out of swapspace. > > Yes, single-page writepage off the LRU is inefficient and we want > writepages() to do most of the work. For most workloads, this is the case. > It's only the whacky mmap-based test which should encounter significant > amounts of vmscan.c writepage activity. Nikita has a patch to gather a page's dirty bits before taking it off the active list. It might help here a bit. Looks like it would need porting to the new rmap stuff. > If you're seeing much traffic via > that route I'd be interested in knowing what the workload is. > > There's one scenario in which we _do_ do too much writepage off the LRU, > and that's on 1G ia32 highmem boxes: the tiny highmem zone is smaller than > dirty_ratio and vmscan ends up doing work which balance_dirty_pages() > should have done. Hard to fix, apart from reducing the dirty memory > limits. > You might be able to improve this by not allocating the highmem zone right down to its scanning watermark while zone normal has lots of free memory. ^ permalink raw reply [flat|nested] 19+ messages in thread
end of thread, other threads:[~2004-04-29 8:46 UTC | newest]
Thread overview: 19+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-04-27 1:12 2.6.6-rc{1,2} bad VM/NFS interaction in case of dirty page writeback Shantanu Goel
2004-04-27 2:15 ` Andrew Morton
2004-04-27 3:11 ` Trond Myklebust
2004-04-27 3:59 ` Andrew Morton
2004-04-27 5:23 ` Trond Myklebust
2004-04-27 5:58 ` Andrew Morton
2004-04-27 11:44 ` Shantanu Goel
2004-04-27 15:36 ` Trond Myklebust
2004-04-28 0:47 ` Shantanu Goel
2004-04-28 1:02 ` Andrew Morton
2004-04-28 1:28 ` Trond Myklebust
2004-04-28 1:38 ` Andrew Morton
2004-04-28 5:29 ` Christoph Hellwig
2004-04-28 16:17 ` Trond Myklebust
2004-04-28 16:38 ` Christoph Hellwig
2004-04-28 19:07 ` Andrew Morton
2004-04-29 1:48 ` Nathan Scott
2004-04-29 8:27 ` Christoph Hellwig
2004-04-27 6:10 ` Nick Piggin
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox