* [PATCH] mm fix page writeback accounting to fix oom condition under heavy I/O [not found] ` <20090123220009.34DF.KOSAKI.MOTOHIRO@jp.fujitsu.com> @ 2009-02-10 3:36 ` Mathieu Desnoyers 2009-02-10 3:55 ` Nick Piggin 2009-02-10 5:23 ` Linus Torvalds 0 siblings, 2 replies; 5+ messages in thread From: Mathieu Desnoyers @ 2009-02-10 3:36 UTC (permalink / raw) To: KOSAKI Motohiro, Jens Axboe, akpm, Peter Zijlstra, Linus Torvalds, Ingo Molnar, thomas.pi, Yuriy Lalym Cc: ltt-dev, linux-kernel, linux-mm Related to : http://bugzilla.kernel.org/show_bug.cgi?id=12309 Very annoying I/O latencies (20-30 seconds) are occuring under heavy I/O since ~2.6.18. Yuriy Lalym noticed that the oom killer was eventually called. So I took a look at /proc/meminfo and noticed that under my test case (fio job created from a LTTng block I/O trace, reproducing dd writing to a 20GB file and ssh sessions being opened), the Inactive(file) value increased, and the total memory consumed increased until only 80kB (out of 16GB) were left. So I first used cgroups to limit the memory usable by fio (or dd). This seems to fix the problem. Thomas noted that there seems to be a problem with pages being passed to the block I/O elevator not being counted as dirty. I looked at clear_page_dirty_for_io and noticed that page_mkclean clears the dirty bit and then set_page_dirty(page) is called on the page. This calls mm/page-writeback.c:set_page_dirty(). I assume that the mapping->a_ops->set_page_dirty is NULL, so it calls buffer.c:__set_page_dirty_buffers(). This calls set_buffer_dirty(bh). So we come back in clear_page_dirty_for_io where we decrement the dirty accounting. This is a problem, because we assume that the block layer will re-increment it when it gets the page, but because the buffer is marked as dirty, this won't happen. So this patch fixes this behavior by only decrementing the page accounting _after_ the block I/O writepage has been done. The effect on my workload is that the memory stops being completely filled by page cache under heavy I/O. The vfs_cache_pressure value seems to work again. However, this does not fully solve the high latency issue : when there are enough vfs pages in cache that the pages are being written directly to disk rather than left in the page cache, the CFQ I/O scheduler does not seem to be able to correctly prioritize I/O requests. I think this might be because when this high pressure point is reached, all tasks are blocked in the same way when they try to add pages to the page cache, independently of their I/O priority. Any idea on how to fix this is welcome. Related commits : commit 7658cc289288b8ae7dd2c2224549a048431222b3 Author: Linus Torvalds <torvalds@macmini.osdl.org> Date: Fri Dec 29 10:00:58 2006 -0800 VM: Fix nasty and subtle race in shared mmap'ed page writeback commit 8c08540f8755c451d8b96ea14cfe796bc3cd712d Author: Andrew Morton <akpm@osdl.org> Date: Sun Dec 10 02:19:24 2006 -0800 [PATCH] clean up __set_page_dirty_nobuffers() Both were merged Dec 2006, which is between kernel v2.6.19 and v2.6.20-rc3. This patch applies on 2.6.29-rc3. Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> CC: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> CC: Jens Axboe <jens.axboe@oracle.com> CC: akpm@linux-foundation.org CC: Peter Zijlstra <a.p.zijlstra@chello.nl> CC: Linus Torvalds <torvalds@linux-foundation.org> CC: Ingo Molnar <mingo@elte.hu> CC: thomas.pi@arcor.dea CC: Yuriy Lalym <ylalym@gmail.com> --- mm/page-writeback.c | 33 +++++++++++++++++++++++++-------- 1 file changed, 25 insertions(+), 8 deletions(-) Index: linux-2.6-lttng/mm/page-writeback.c =================================================================== --- linux-2.6-lttng.orig/mm/page-writeback.c 2009-02-09 20:18:41.000000000 -0500 +++ linux-2.6-lttng/mm/page-writeback.c 2009-02-09 20:42:39.000000000 -0500 @@ -945,6 +945,7 @@ int write_cache_pages(struct address_spa int cycled; int range_whole = 0; long nr_to_write = wbc->nr_to_write; + int lazyaccounting; if (wbc->nonblocking && bdi_write_congested(bdi)) { wbc->encountered_congestion = 1; @@ -1028,10 +1029,18 @@ continue_unlock: } BUG_ON(PageWriteback(page)); - if (!clear_page_dirty_for_io(page)) + lazyaccounting = clear_page_dirty_for_io(page); + if (!lazyaccounting) goto continue_unlock; ret = (*writepage)(page, wbc, data); + + if (lazyaccounting == 2) { + dec_zone_page_state(page, NR_FILE_DIRTY); + dec_bdi_stat(mapping->backing_dev_info, + BDI_RECLAIMABLE); + } + if (unlikely(ret)) { if (ret == AOP_WRITEPAGE_ACTIVATE) { unlock_page(page); @@ -1149,6 +1158,7 @@ int write_one_page(struct page *page, in { struct address_space *mapping = page->mapping; int ret = 0; + int lazyaccounting; struct writeback_control wbc = { .sync_mode = WB_SYNC_ALL, .nr_to_write = 1, @@ -1159,7 +1169,8 @@ int write_one_page(struct page *page, in if (wait) wait_on_page_writeback(page); - if (clear_page_dirty_for_io(page)) { + lazyaccounting = clear_page_dirty_for_io(page); + if (lazyaccounting) { page_cache_get(page); ret = mapping->a_ops->writepage(page, &wbc); if (ret == 0 && wait) { @@ -1167,6 +1178,11 @@ int write_one_page(struct page *page, in if (PageError(page)) ret = -EIO; } + if (lazyaccounting == 2) { + dec_zone_page_state(page, NR_FILE_DIRTY); + dec_bdi_stat(mapping->backing_dev_info, + BDI_RECLAIMABLE); + } page_cache_release(page); } else { unlock_page(page); @@ -1312,6 +1328,11 @@ EXPORT_SYMBOL(set_page_dirty_lock); * * This incoherency between the page's dirty flag and radix-tree tag is * unfortunate, but it only exists while the page is locked. + * + * Return values : + * 0 : page is not dirty + * 1 : page is dirty, no lazy accounting update still have to be performed + * 2 : page is direct *and* lazy accounting update must still be performed */ int clear_page_dirty_for_io(struct page *page) { @@ -1358,12 +1379,8 @@ int clear_page_dirty_for_io(struct page * the desired exclusion. See mm/memory.c:do_wp_page() * for more comments. */ - if (TestClearPageDirty(page)) { - dec_zone_page_state(page, NR_FILE_DIRTY); - dec_bdi_stat(mapping->backing_dev_info, - BDI_RECLAIMABLE); - return 1; - } + if (TestClearPageDirty(page)) + return 2; return 0; } return TestClearPageDirty(page); -- Mathieu Desnoyers OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH] mm fix page writeback accounting to fix oom condition under heavy I/O 2009-02-10 3:36 ` [PATCH] mm fix page writeback accounting to fix oom condition under heavy I/O Mathieu Desnoyers @ 2009-02-10 3:55 ` Nick Piggin 2009-02-10 5:23 ` Linus Torvalds 1 sibling, 0 replies; 5+ messages in thread From: Nick Piggin @ 2009-02-10 3:55 UTC (permalink / raw) To: Mathieu Desnoyers Cc: KOSAKI Motohiro, Jens Axboe, akpm, Peter Zijlstra, Linus Torvalds, Ingo Molnar, thomas.pi, Yuriy Lalym, ltt-dev, linux-kernel, linux-mm On Tuesday 10 February 2009 14:36:53 Mathieu Desnoyers wrote: > Related to : > http://bugzilla.kernel.org/show_bug.cgi?id=12309 > > Very annoying I/O latencies (20-30 seconds) are occuring under heavy I/O > since ~2.6.18. > > Yuriy Lalym noticed that the oom killer was eventually called. So I took a > look at /proc/meminfo and noticed that under my test case (fio job created > from a LTTng block I/O trace, reproducing dd writing to a 20GB file and ssh > sessions being opened), the Inactive(file) value increased, and the total > memory consumed increased until only 80kB (out of 16GB) were left. > > So I first used cgroups to limit the memory usable by fio (or dd). This > seems to fix the problem. > > Thomas noted that there seems to be a problem with pages being passed to > the block I/O elevator not being counted as dirty. I looked at > clear_page_dirty_for_io and noticed that page_mkclean clears the dirty bit > and then set_page_dirty(page) is called on the page. This calls > mm/page-writeback.c:set_page_dirty(). I assume that the > mapping->a_ops->set_page_dirty is NULL, so it calls > buffer.c:__set_page_dirty_buffers(). This calls set_buffer_dirty(bh). > > So we come back in clear_page_dirty_for_io where we decrement the dirty > accounting. This is a problem, because we assume that the block layer will > re-increment it when it gets the page, but because the buffer is marked as > dirty, this won't happen. > > So this patch fixes this behavior by only decrementing the page accounting > _after_ the block I/O writepage has been done. > > The effect on my workload is that the memory stops being completely filled > by page cache under heavy I/O. The vfs_cache_pressure value seems to work > again. I don't think we're supposed to assume the block layer will re-increment the dirty count? It should be all in the VM. And the VM should increment writeback count before sending it to the block device, and dirty page throttling also takes into account the number of writeback pages, so it should not be allowed to fill up memory with dirty pages even if the block device queue size is unlimited. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH] mm fix page writeback accounting to fix oom condition under heavy I/O 2009-02-10 3:36 ` [PATCH] mm fix page writeback accounting to fix oom condition under heavy I/O Mathieu Desnoyers 2009-02-10 3:55 ` Nick Piggin @ 2009-02-10 5:23 ` Linus Torvalds 2009-02-10 5:56 ` Nick Piggin 2009-02-10 6:12 ` Mathieu Desnoyers 1 sibling, 2 replies; 5+ messages in thread From: Linus Torvalds @ 2009-02-10 5:23 UTC (permalink / raw) To: Mathieu Desnoyers Cc: KOSAKI Motohiro, Jens Axboe, akpm, Peter Zijlstra, Ingo Molnar, thomas.pi, Yuriy Lalym, ltt-dev, linux-kernel, linux-mm On Mon, 9 Feb 2009, Mathieu Desnoyers wrote: > > So this patch fixes this behavior by only decrementing the page accounting > _after_ the block I/O writepage has been done. This makes no sense, really. Or rather, I don't mind the notion of updating the counters only after IO per se, and _that_ part of it probably makes sense. But why is it that you only then fix up two of the call-sites. There's a lot more call-sites than that for this function. So if this really makes a big difference, that's an interesting starting point for discussion, but I don't see how this particular patch could possibly be the right thing to do. Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH] mm fix page writeback accounting to fix oom condition under heavy I/O 2009-02-10 5:23 ` Linus Torvalds @ 2009-02-10 5:56 ` Nick Piggin 2009-02-10 6:12 ` Mathieu Desnoyers 1 sibling, 0 replies; 5+ messages in thread From: Nick Piggin @ 2009-02-10 5:56 UTC (permalink / raw) To: Linus Torvalds Cc: Mathieu Desnoyers, KOSAKI Motohiro, Jens Axboe, akpm, Peter Zijlstra, Ingo Molnar, thomas.pi, Yuriy Lalym, ltt-dev, linux-kernel, linux-mm On Tuesday 10 February 2009 16:23:56 Linus Torvalds wrote: > On Mon, 9 Feb 2009, Mathieu Desnoyers wrote: > > So this patch fixes this behavior by only decrementing the page > > accounting _after_ the block I/O writepage has been done. > > This makes no sense, really. > > Or rather, I don't mind the notion of updating the counters only after IO > per se, and _that_ part of it probably makes sense. But why is it that you > only then fix up two of the call-sites. There's a lot more call-sites than > that for this function. Well if you do that, then I'd think you also have to change some calculations that today use dirty+writeback. In some ways it does make sense, but OTOH it is natural in the pagecache since it was introduced to treat writeback as basically equivalent to dirty. So writeback && !dirty pages shouldn't cause things to blow up, or if it does then hopefully it is a simple bug somewhere. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH] mm fix page writeback accounting to fix oom condition under heavy I/O 2009-02-10 5:23 ` Linus Torvalds 2009-02-10 5:56 ` Nick Piggin @ 2009-02-10 6:12 ` Mathieu Desnoyers 1 sibling, 0 replies; 5+ messages in thread From: Mathieu Desnoyers @ 2009-02-10 6:12 UTC (permalink / raw) To: Linus Torvalds Cc: KOSAKI Motohiro, Jens Axboe, akpm, Peter Zijlstra, Ingo Molnar, thomas.pi, Yuriy Lalym, ltt-dev, linux-kernel, linux-mm * Linus Torvalds (torvalds@linux-foundation.org) wrote: > > > On Mon, 9 Feb 2009, Mathieu Desnoyers wrote: > > > > So this patch fixes this behavior by only decrementing the page accounting > > _after_ the block I/O writepage has been done. > > This makes no sense, really. > > Or rather, I don't mind the notion of updating the counters only after IO > per se, and _that_ part of it probably makes sense. But why is it that you > only then fix up two of the call-sites. There's a lot more call-sites than > that for this function. > > So if this really makes a big difference, that's an interesting starting > point for discussion, but I don't see how this particular patch could > possibly be the right thing to do. > Yes, you are right. Looking in more details at /proc/meminfo under the workload, I notice this : MemTotal: 16028812 kB MemFree: 13651440 kB Buffers: 8944 kB Cached: 2209456 kB <--- increments up to ~16GB cached = global_page_state(NR_FILE_PAGES) - total_swapcache_pages - i.bufferram; SwapCached: 0 kB Active: 34668 kB Inactive: 2200668 kB <--- also K(pages[LRU_INACTIVE_ANON] + pages[LRU_INACTIVE_FILE]), Active(anon): 17136 kB Inactive(anon): 0 kB Active(file): 17532 kB Inactive(file): 2200668 kB <--- also K(pages[LRU_INACTIVE_FILE]), Unevictable: 0 kB Mlocked: 0 kB SwapTotal: 19535024 kB SwapFree: 19535024 kB Dirty: 1159036 kB Writeback: 0 kB <--- stays close to 0 AnonPages: 17060 kB Mapped: 9476 kB Slab: 96188 kB SReclaimable: 79776 kB SUnreclaim: 16412 kB PageTables: 3364 kB NFS_Unstable: 0 kB Bounce: 0 kB WritebackTmp: 0 kB CommitLimit: 27549428 kB Committed_AS: 54292 kB VmallocTotal: 34359738367 kB VmallocUsed: 9960 kB VmallocChunk: 34359727667 kB HugePages_Total: 0 HugePages_Free: 0 HugePages_Rsvd: 0 HugePages_Surp: 0 Hugepagesize: 2048 kB DirectMap4k: 7552 kB DirectMap2M: 16769024 kB So I think simply substracting K(pages[LRU_INACTIVE_FILE]) from avail_dirty in clip_bdi_dirty_limit() and to consider it in balance_dirty_pages() and throttle_vm_writeout() would probably make my problem go away, but I would like to understand exactly why this is needed and if I would need to consider other types of page counts that would have been forgotten. Mathieu -- Mathieu Desnoyers OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2009-02-10 6:12 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <20090120122855.GF30821@kernel.dk>
[not found] ` <20090120232748.GA10605@Krystal>
[not found] ` <20090123220009.34DF.KOSAKI.MOTOHIRO@jp.fujitsu.com>
2009-02-10 3:36 ` [PATCH] mm fix page writeback accounting to fix oom condition under heavy I/O Mathieu Desnoyers
2009-02-10 3:55 ` Nick Piggin
2009-02-10 5:23 ` Linus Torvalds
2009-02-10 5:56 ` Nick Piggin
2009-02-10 6:12 ` Mathieu Desnoyers
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).