[PATCH] mm fix page writeback accounting to fix oom condition under heavy I/O

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] mm fix page writeback accounting to fix oom condition under heavy I/O
       [not found]   ` <20090123220009.34DF.KOSAKI.MOTOHIRO@jp.fujitsu.com>
@ 2009-02-10  3:36     ` Mathieu Desnoyers
  2009-02-10  3:55       ` Nick Piggin
  2009-02-10  5:23       ` Linus Torvalds
  0 siblings, 2 replies; 5+ messages in thread
From: Mathieu Desnoyers @ 2009-02-10  3:36 UTC (permalink / raw)
  To: KOSAKI Motohiro, Jens Axboe, akpm, Peter Zijlstra, Linus Torvalds,
	Ingo Molnar, thomas.pi, Yuriy Lalym
  Cc: ltt-dev, linux-kernel, linux-mm

Related to :
http://bugzilla.kernel.org/show_bug.cgi?id=12309

Very annoying I/O latencies (20-30 seconds) are occuring under heavy I/O since
~2.6.18.

Yuriy Lalym noticed that the oom killer was eventually called. So I took a look
at /proc/meminfo and noticed that under my test case (fio job created from a
LTTng block I/O trace, reproducing dd writing to a 20GB file and ssh sessions
being opened), the Inactive(file) value increased, and the total memory consumed
increased until only 80kB (out of 16GB) were left.

So I first used cgroups to limit the memory usable by fio (or dd). This seems to
fix the problem.

Thomas noted that there seems to be a problem with pages being passed to the
block I/O elevator not being counted as dirty. I looked at
clear_page_dirty_for_io and noticed that page_mkclean clears the dirty bit and
then set_page_dirty(page) is called on the page. This calls
mm/page-writeback.c:set_page_dirty(). I assume that the
mapping->a_ops->set_page_dirty is NULL, so it calls
buffer.c:__set_page_dirty_buffers(). This calls set_buffer_dirty(bh).

So we come back in clear_page_dirty_for_io where we decrement the dirty
accounting. This is a problem, because we assume that the block layer will
re-increment it when it gets the page, but because the buffer is marked as
dirty, this won't happen.

So this patch fixes this behavior by only decrementing the page accounting
_after_ the block I/O writepage has been done.

The effect on my workload is that the memory stops being completely filled by
page cache under heavy I/O. The vfs_cache_pressure value seems to work again.

However, this does not fully solve the high latency issue : when there are
enough vfs pages in cache that the pages are being written directly to disk
rather than left in the page cache, the CFQ I/O scheduler does not seem to be
able to correctly prioritize I/O requests. I think this might be because when
this high pressure point is reached, all tasks are blocked in the same way when
they try to add pages to the page cache, independently of their I/O priority.
Any idea on how to fix this is welcome.

Related commits :
commit 7658cc289288b8ae7dd2c2224549a048431222b3
Author: Linus Torvalds <torvalds@macmini.osdl.org>
Date:   Fri Dec 29 10:00:58 2006 -0800
    VM: Fix nasty and subtle race in shared mmap'ed page writeback

commit 8c08540f8755c451d8b96ea14cfe796bc3cd712d
Author: Andrew Morton <akpm@osdl.org>
Date:   Sun Dec 10 02:19:24 2006 -0800
    [PATCH] clean up __set_page_dirty_nobuffers()

Both were merged Dec 2006, which is between kernel v2.6.19 and v2.6.20-rc3.

This patch applies on 2.6.29-rc3.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
CC: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
CC: Jens Axboe <jens.axboe@oracle.com>
CC: akpm@linux-foundation.org
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
CC: Linus Torvalds <torvalds@linux-foundation.org>
CC: Ingo Molnar <mingo@elte.hu>
CC: thomas.pi@arcor.dea
CC: Yuriy Lalym <ylalym@gmail.com>
---
 mm/page-writeback.c |   33 +++++++++++++++++++++++++--------
 1 file changed, 25 insertions(+), 8 deletions(-)

Index: linux-2.6-lttng/mm/page-writeback.c
===================================================================
--- linux-2.6-lttng.orig/mm/page-writeback.c	2009-02-09 20:18:41.000000000 -0500
+++ linux-2.6-lttng/mm/page-writeback.c	2009-02-09 20:42:39.000000000 -0500
@@ -945,6 +945,7 @@ int write_cache_pages(struct address_spa
 	int cycled;
 	int range_whole = 0;
 	long nr_to_write = wbc->nr_to_write;
+	int lazyaccounting;

 	if (wbc->nonblocking && bdi_write_congested(bdi)) {
 		wbc->encountered_congestion = 1;
@@ -1028,10 +1029,18 @@ continue_unlock:
 			}

 			BUG_ON(PageWriteback(page));
-			if (!clear_page_dirty_for_io(page))
+			lazyaccounting = clear_page_dirty_for_io(page);
+			if (!lazyaccounting)
 				goto continue_unlock;

 			ret = (*writepage)(page, wbc, data);
+
+			if (lazyaccounting == 2) {
+				dec_zone_page_state(page, NR_FILE_DIRTY);
+				dec_bdi_stat(mapping->backing_dev_info,
+						BDI_RECLAIMABLE);
+			}
+
 			if (unlikely(ret)) {
 				if (ret == AOP_WRITEPAGE_ACTIVATE) {
 					unlock_page(page);
@@ -1149,6 +1158,7 @@ int write_one_page(struct page *page, in
 {
 	struct address_space *mapping = page->mapping;
 	int ret = 0;
+	int lazyaccounting;
 	struct writeback_control wbc = {
 		.sync_mode = WB_SYNC_ALL,
 		.nr_to_write = 1,
@@ -1159,7 +1169,8 @@ int write_one_page(struct page *page, in
 	if (wait)
 		wait_on_page_writeback(page);

-	if (clear_page_dirty_for_io(page)) {
+	lazyaccounting = clear_page_dirty_for_io(page);
+	if (lazyaccounting) {
 		page_cache_get(page);
 		ret = mapping->a_ops->writepage(page, &wbc);
 		if (ret == 0 && wait) {
@@ -1167,6 +1178,11 @@ int write_one_page(struct page *page, in
 			if (PageError(page))
 				ret = -EIO;
 		}
+		if (lazyaccounting == 2) {
+			dec_zone_page_state(page, NR_FILE_DIRTY);
+			dec_bdi_stat(mapping->backing_dev_info,
+					BDI_RECLAIMABLE);
+		}
 		page_cache_release(page);
 	} else {
 		unlock_page(page);
@@ -1312,6 +1328,11 @@ EXPORT_SYMBOL(set_page_dirty_lock);
  *
  * This incoherency between the page's dirty flag and radix-tree tag is
  * unfortunate, but it only exists while the page is locked.
+ *
+ * Return values :
+ * 0 : page is not dirty
+ * 1 : page is dirty, no lazy accounting update still have to be performed
+ * 2 : page is direct *and* lazy accounting update must still be performed
  */
 int clear_page_dirty_for_io(struct page *page)
 {
@@ -1358,12 +1379,8 @@ int clear_page_dirty_for_io(struct page 
 		 * the desired exclusion. See mm/memory.c:do_wp_page()
 		 * for more comments.
 		 */
-		if (TestClearPageDirty(page)) {
-			dec_zone_page_state(page, NR_FILE_DIRTY);
-			dec_bdi_stat(mapping->backing_dev_info,
-					BDI_RECLAIMABLE);
-			return 1;
-		}
+		if (TestClearPageDirty(page))
+			return 2;
 		return 0;
 	}
 	return TestClearPageDirty(page);

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] mm fix page writeback accounting to fix oom condition under heavy I/O
  2009-02-10  3:36     ` [PATCH] mm fix page writeback accounting to fix oom condition under heavy I/O Mathieu Desnoyers
@ 2009-02-10  3:55       ` Nick Piggin
  2009-02-10  5:23       ` Linus Torvalds
  1 sibling, 0 replies; 5+ messages in thread
From: Nick Piggin @ 2009-02-10  3:55 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: KOSAKI Motohiro, Jens Axboe, akpm, Peter Zijlstra, Linus Torvalds,
	Ingo Molnar, thomas.pi, Yuriy Lalym, ltt-dev, linux-kernel,
	linux-mm

On Tuesday 10 February 2009 14:36:53 Mathieu Desnoyers wrote:
> Related to :
> http://bugzilla.kernel.org/show_bug.cgi?id=12309
>
> Very annoying I/O latencies (20-30 seconds) are occuring under heavy I/O
> since ~2.6.18.
>
> Yuriy Lalym noticed that the oom killer was eventually called. So I took a
> look at /proc/meminfo and noticed that under my test case (fio job created
> from a LTTng block I/O trace, reproducing dd writing to a 20GB file and ssh
> sessions being opened), the Inactive(file) value increased, and the total
> memory consumed increased until only 80kB (out of 16GB) were left.
>
> So I first used cgroups to limit the memory usable by fio (or dd). This
> seems to fix the problem.
>
> Thomas noted that there seems to be a problem with pages being passed to
> the block I/O elevator not being counted as dirty. I looked at
> clear_page_dirty_for_io and noticed that page_mkclean clears the dirty bit
> and then set_page_dirty(page) is called on the page. This calls
> mm/page-writeback.c:set_page_dirty(). I assume that the
> mapping->a_ops->set_page_dirty is NULL, so it calls
> buffer.c:__set_page_dirty_buffers(). This calls set_buffer_dirty(bh).
>
> So we come back in clear_page_dirty_for_io where we decrement the dirty
> accounting. This is a problem, because we assume that the block layer will
> re-increment it when it gets the page, but because the buffer is marked as
> dirty, this won't happen.
>
> So this patch fixes this behavior by only decrementing the page accounting
> _after_ the block I/O writepage has been done.
>
> The effect on my workload is that the memory stops being completely filled
> by page cache under heavy I/O. The vfs_cache_pressure value seems to work
> again.

I don't think we're supposed to assume the block layer will re-increment
the dirty count? It should be all in the VM. And the VM should increment
writeback count before sending it to the block device, and dirty page
throttling also takes into account the number of writeback pages, so it
should not be allowed to fill up memory with dirty pages even if the
block device queue size is unlimited.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] mm fix page writeback accounting to fix oom condition under heavy I/O
  2009-02-10  3:36     ` [PATCH] mm fix page writeback accounting to fix oom condition under heavy I/O Mathieu Desnoyers
  2009-02-10  3:55       ` Nick Piggin
@ 2009-02-10  5:23       ` Linus Torvalds
  2009-02-10  5:56         ` Nick Piggin
  2009-02-10  6:12         ` Mathieu Desnoyers
  1 sibling, 2 replies; 5+ messages in thread
From: Linus Torvalds @ 2009-02-10  5:23 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: KOSAKI Motohiro, Jens Axboe, akpm, Peter Zijlstra, Ingo Molnar,
	thomas.pi, Yuriy Lalym, ltt-dev, linux-kernel, linux-mm



On Mon, 9 Feb 2009, Mathieu Desnoyers wrote:
> 
> So this patch fixes this behavior by only decrementing the page accounting
> _after_ the block I/O writepage has been done.

This makes no sense, really.

Or rather, I don't mind the notion of updating the counters only after IO 
per se, and _that_ part of it probably makes sense. But why is it that you 
only then fix up two of the call-sites. There's a lot more call-sites than 
that for this function. 

So if this really makes a big difference, that's an interesting starting 
point for discussion, but I don't see how this particular patch could 
possibly be the right thing to do.

			Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] mm fix page writeback accounting to fix oom condition under heavy I/O
  2009-02-10  5:23       ` Linus Torvalds
@ 2009-02-10  5:56         ` Nick Piggin
  2009-02-10  6:12         ` Mathieu Desnoyers
  1 sibling, 0 replies; 5+ messages in thread
From: Nick Piggin @ 2009-02-10  5:56 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Mathieu Desnoyers, KOSAKI Motohiro, Jens Axboe, akpm,
	Peter Zijlstra, Ingo Molnar, thomas.pi, Yuriy Lalym, ltt-dev,
	linux-kernel, linux-mm

On Tuesday 10 February 2009 16:23:56 Linus Torvalds wrote:
> On Mon, 9 Feb 2009, Mathieu Desnoyers wrote:
> > So this patch fixes this behavior by only decrementing the page
> > accounting _after_ the block I/O writepage has been done.
>
> This makes no sense, really.
>
> Or rather, I don't mind the notion of updating the counters only after IO
> per se, and _that_ part of it probably makes sense. But why is it that you
> only then fix up two of the call-sites. There's a lot more call-sites than
> that for this function.

Well if you do that, then I'd think you also have to change some
calculations that today use dirty+writeback.

In some ways it does make sense, but OTOH it is natural in the
pagecache since it was introduced to treat writeback as basically
equivalent to dirty. So writeback && !dirty pages shouldn't cause
things to blow up, or if it does then hopefully it is a simple
bug somewhere.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] mm fix page writeback accounting to fix oom condition under heavy I/O
  2009-02-10  5:23       ` Linus Torvalds
  2009-02-10  5:56         ` Nick Piggin
@ 2009-02-10  6:12         ` Mathieu Desnoyers
  1 sibling, 0 replies; 5+ messages in thread
From: Mathieu Desnoyers @ 2009-02-10  6:12 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: KOSAKI Motohiro, Jens Axboe, akpm, Peter Zijlstra, Ingo Molnar,
	thomas.pi, Yuriy Lalym, ltt-dev, linux-kernel, linux-mm

* Linus Torvalds (torvalds@linux-foundation.org) wrote:
> 
> 
> On Mon, 9 Feb 2009, Mathieu Desnoyers wrote:
> > 
> > So this patch fixes this behavior by only decrementing the page accounting
> > _after_ the block I/O writepage has been done.
> 
> This makes no sense, really.
> 
> Or rather, I don't mind the notion of updating the counters only after IO 
> per se, and _that_ part of it probably makes sense. But why is it that you 
> only then fix up two of the call-sites. There's a lot more call-sites than 
> that for this function. 
> 
> So if this really makes a big difference, that's an interesting starting 
> point for discussion, but I don't see how this particular patch could 
> possibly be the right thing to do.
> 

Yes, you are right. Looking in more details at /proc/meminfo under the
workload, I notice this :

MemTotal:       16028812 kB
MemFree:        13651440 kB
Buffers:            8944 kB
Cached:          2209456 kB   <--- increments up to ~16GB

        cached = global_page_state(NR_FILE_PAGES) -
                        total_swapcache_pages - i.bufferram;

SwapCached:            0 kB
Active:            34668 kB
Inactive:        2200668 kB   <--- also

                K(pages[LRU_INACTIVE_ANON] + pages[LRU_INACTIVE_FILE]),

Active(anon):      17136 kB
Inactive(anon):        0 kB
Active(file):      17532 kB
Inactive(file):  2200668 kB   <--- also

                K(pages[LRU_INACTIVE_FILE]),

Unevictable:           0 kB
Mlocked:               0 kB
SwapTotal:      19535024 kB
SwapFree:       19535024 kB
Dirty:           1159036 kB
Writeback:             0 kB  <--- stays close to 0
AnonPages:         17060 kB
Mapped:             9476 kB
Slab:              96188 kB
SReclaimable:      79776 kB
SUnreclaim:        16412 kB
PageTables:         3364 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:    27549428 kB
Committed_AS:      54292 kB
VmallocTotal:   34359738367 kB
VmallocUsed:        9960 kB
VmallocChunk:   34359727667 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
DirectMap4k:        7552 kB
DirectMap2M:    16769024 kB

So I think simply substracting K(pages[LRU_INACTIVE_FILE]) from
avail_dirty in clip_bdi_dirty_limit() and to consider it in
balance_dirty_pages() and throttle_vm_writeout() would probably make my
problem go away, but I would like to understand exactly why this is
needed and if I would need to consider other types of page counts that
would have been forgotten.

Mathieu

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2009-02-10  6:12 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20090120122855.GF30821@kernel.dk>
     [not found] ` <20090120232748.GA10605@Krystal>
     [not found]   ` <20090123220009.34DF.KOSAKI.MOTOHIRO@jp.fujitsu.com>
2009-02-10  3:36     ` [PATCH] mm fix page writeback accounting to fix oom condition under heavy I/O Mathieu Desnoyers
2009-02-10  3:55       ` Nick Piggin
2009-02-10  5:23       ` Linus Torvalds
2009-02-10  5:56         ` Nick Piggin
2009-02-10  6:12         ` Mathieu Desnoyers

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).