Re: [patch 0/5] mm: per-zone dirty limiting

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Mel Gorman <mgorman@suse.de>
To: Johannes Weiner <jweiner@redhat.com>
Cc: linux-mm@kvack.org, Dave Chinner <david@fromorbit.com>,
	Christoph Hellwig <hch@infradead.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	Wu Fengguang <fengguang.wu@intel.com>,
	Rik van Riel <riel@redhat.com>,
	Minchan Kim <minchan.kim@gmail.com>, Jan Kara <jack@suse.cz>,
	Andi Kleen <ak@linux.intel.com>,
	linux-kernel@vger.kernel.org
Subject: Re: [patch 0/5] mm: per-zone dirty limiting
Date: Tue, 26 Jul 2011 22:54:14 +0100	[thread overview]
Message-ID: <20110726215414.GF3010@suse.de> (raw)
In-Reply-To: <20110726180559.GA667@redhat.com>

On Tue, Jul 26, 2011 at 08:05:59PM +0200, Johannes Weiner wrote:
> On Tue, Jul 26, 2011 at 04:47:41PM +0100, Mel Gorman wrote:
> > On Mon, Jul 25, 2011 at 10:19:14PM +0200, Johannes Weiner wrote:
> > > Hello!
> > > 
> > > Writing back single file pages during reclaim exhibits bad IO
> > > patterns, but we can't just stop doing that before the VM has other
> > > means to ensure the pages in a zone are reclaimable.
> > > 
> > > Over time there were several suggestions of at least doing
> > > write-around of the pages in inode-proximity when the need arises to
> > > clean pages during memory pressure.  But even that would interrupt
> > > writeback from the flushers, without any guarantees that the nearby
> > > inode-pages are even sitting on the same troubled zone.
> > > 
> > > The reason why dirty pages reach the end of LRU lists in the first
> > > place is in part because the dirty limits are a global restriction
> > > while most systems have more than one LRU list that are different in
> > > size. Multiple nodes have multiple zones have multiple file lists but
> > > at the same time there is nothing to balance the dirty pages between
> > > the lists except for reclaim writing them out upon encounter.
> > > 
> > > With around 4G of RAM, a x86_64 machine of mine has a DMA32 zone of a
> > > bit over 3G, a Normal zone of 500M, and a DMA zone of 15M.
> > > 
> > > A linear writer can quickly fill up the Normal zone, then the DMA32
> > > zone, throttled by the dirty limit initially.  The flushers catch up,
> > > the zones are now mostly full of clean pages and memory reclaim kicks
> > > in on subsequent allocations.  The pages it frees from the Normal zone
> > > are quickly filled with dirty pages (unthrottled, as the much bigger
> > > DMA32 zone allows for a huge number of dirty pages in comparison to
> > > the Normal zone).  As there are also anon and active file pages on the
> > > Normal zone, it is not unlikely that a significant amount of its
> > > inactive file pages are now dirty [ foo=zone(global) ]:
> > > 
> > > reclaim: blkdev_writepage+0x0/0x20 zone=Normal inactive=112313(821289) active=9942(10039) isolated=27(27) dirty=59709(146944) writeback=739(4017)
> > > reclaim: blkdev_writepage+0x0/0x20 zone=Normal inactive=111102(806876) active=9925(10022) isolated=32(32) dirty=72125(146914) writeback=957(3972)
> > > reclaim: blkdev_writepage+0x0/0x20 zone=Normal inactive=110493(803374) active=9871(9978) isolated=32(32) dirty=57274(146618) writeback=4088(4088)
> > > reclaim: blkdev_writepage+0x0/0x20 zone=Normal inactive=111957(806559) active=9871(9978) isolated=32(32) dirty=65125(147329) writeback=456(3866)
> > > reclaim: blkdev_writepage+0x0/0x20 zone=Normal inactive=110601(803978) active=9860(9973) isolated=27(27) dirty=63792(146590) writeback=61(4276)
> > > reclaim: blkdev_writepage+0x0/0x20 zone=Normal inactive=111786(804032) active=9860(9973) isolated=0(64) dirty=64310(146998) writeback=1282(3847)
> > > reclaim: blkdev_writepage+0x0/0x20 zone=Normal inactive=111643(805651) active=9860(9982) isolated=32(32) dirty=63778(147217) writeback=1127(4156)
> > > reclaim: blkdev_writepage+0x0/0x20 zone=Normal inactive=111678(804709) active=9859(10112) isolated=27(27) dirty=81673(148224) writeback=29(4233)
> > > 
> > > [ These prints occur only once per reclaim invocation, so the actual
> > > ->writepage calls are more frequent than the timestamp may suggest. ]
> > > 
> > > In the scenario without the Normal zone, first the DMA32 zone fills
> > > up, then the DMA zone.  When reclaim kicks in, it is presented with a
> > > DMA zone whose inactive pages are all dirty -- and dirtied most
> > > recently at that, so the flushers really had abysmal chances at making
> > > some headway:
> > > 
> > > reclaim: xfs_vm_writepage+0x0/0x4f0 zone=DMA inactive=776(430813) active=2(2931) isolated=32(32) dirty=814(68649) writeback=0(18765)
> > > reclaim: xfs_vm_writepage+0x0/0x4f0 zone=DMA inactive=726(430344) active=2(2931) isolated=32(32) dirty=764(67790) writeback=0(17146)
> > > reclaim: xfs_vm_writepage+0x0/0x4f0 zone=DMA inactive=729(430838) active=2(2931) isolated=32(32) dirty=293(65303) writeback=468(20122)
> > > reclaim: xfs_vm_writepage+0x0/0x4f0 zone=DMA inactive=757(431181) active=2(2931) isolated=32(32) dirty=63(68851) writeback=731(15926)
> > > reclaim: xfs_vm_writepage+0x0/0x4f0 zone=DMA inactive=758(432808) active=2(2931) isolated=32(32) dirty=645(64106) writeback=0(19666)
> > > reclaim: xfs_vm_writepage+0x0/0x4f0 zone=DMA inactive=726(431018) active=2(2931) isolated=32(32) dirty=740(65770) writeback=10(17907)
> > > reclaim: xfs_vm_writepage+0x0/0x4f0 zone=DMA inactive=697(430467) active=2(2931) isolated=32(32) dirty=743(63757) writeback=0(18826)
> > > reclaim: xfs_vm_writepage+0x0/0x4f0 zone=DMA inactive=693(430951) active=2(2931) isolated=32(32) dirty=626(54529) writeback=91(16198)
> > > 
> > 
> > Patches 1-7 of the series "Reduce filesystem writeback from page
> > reclaim" should have been able to cope with this as well by marking
> > the dirty pages PageReclaim and continuing on. While it could still
> > take some time before ZONE_DMA is cleaned, it is very unlikely that
> > it is the preferred zone for allocation.
> 
> My changes can not fully prevent dirty pages from reaching the LRU
> tail, so IMO we want your patches in any case (sorry I haven't replied
> yet, but I went through them and they look good to me.  Acks coming
> up).  But this should reduce what reclaim has to skip and shuffle.
> 

No need to be sorry, I guessed this work may be related so figured
you had at least seen them.  While I was reasonable sure the patches
were not mutually exclusive, there was no harm in checking you thought
the same.

> > > The idea behind this patch set is to take the ratio the global dirty
> > > limits have to the global memory state and put it into proportion to
> > > the individual zone.  The allocator ensures that pages allocated for
> > > being written to in the page cache are distributed across zones such
> > > that there are always enough clean pages on a zone to begin with.
> > > 
> > 
> > Ok, I comment on potential lowmem pressure problems with this in the
> > patch itself.
> > 
> > > I am not yet really satisfied as it's not really orthogonal or
> > > integrated with the other writeback throttling much, and has rough
> > > edges here and there, but test results do look rather promising so
> > > far:
> > > 
> > 
> > I'd consider that the idea behind this patchset is independent of
> > patches 1-7 of the "Reduce filesystem writeback from page reclaim"
> > series although it may also allow the application of patch 8 from
> > that series. Would you agree or do you think the series should be
> > mutually exclusive?
> 
> My patchset was triggered by patch 8 of your series, as I think we can
> not simply remove our only measure to stay on top of the dirty pages
> from a per-zone perspective.
> 

Agreed. I fully intend to drop patch 8 until there is a better way of
handling pages from a specific zone.

> But I think your patches 1-7 and this series complement each other in
> that one series tries to keep the dirty pages per-zone on sane levels
> and the other series improves how we deal with what dirty pages still
> end up at the lru tails.
> 

Agreed.

> > > --- Copying 8G to fuse-ntfs on USB stick in 4G machine
> > > 
> > 
> > Unusual choice of filesystem :) It'd also be worth testing ext3, ext4,
> > xfs and btrfs to make sure there are no surprises.
> 
> Yeah, testing has been really shallow so far as my test box is
> occupied with the exclusive memcg lru stuff.
> 
> Also, this is the stick my TV has to be able to read from ;-)
> 

heh, fair enough.

> > > 3.0:
> > > 
> > >  Performance counter stats for 'dd if=/dev/zero of=zeroes bs=32k count=262144' (6 runs):
> > > 
> > >        140,671,831 cache-misses             #      4.923 M/sec   ( +-   0.198% )  (scaled from 82.80%)
> > >        726,265,014 cache-references         #     25.417 M/sec   ( +-   1.104% )  (scaled from 83.06%)
> > >        144,092,383 branch-misses            #      4.157 %       ( +-   0.493% )  (scaled from 83.17%)
> > >      3,466,608,296 branches                 #    121.319 M/sec   ( +-   0.421% )  (scaled from 67.89%)
> > >     17,882,351,343 instructions             #      0.417 IPC     ( +-   0.457% )  (scaled from 84.73%)
> > >     42,848,633,897 cycles                   #   1499.554 M/sec   ( +-   0.604% )  (scaled from 83.08%)
> > >                236 page-faults              #      0.000 M/sec   ( +-   0.323% )
> > >              8,026 CPU-migrations           #      0.000 M/sec   ( +-   6.291% )
> > >          2,372,358 context-switches         #      0.083 M/sec   ( +-   0.003% )
> > >       28574.255540 task-clock-msecs         #      0.031 CPUs    ( +-   0.409% )
> > > 
> > >       912.625436885  seconds time elapsed   ( +-   3.851% )
> > > 
> > >  nr_vmscan_write 667839
> > > 
> > > 3.0-per-zone-dirty:
> > > 
> > >  Performance counter stats for 'dd if=/dev/zero of=zeroes bs=32k count=262144' (6 runs):
> > > 
> > >        140,791,501 cache-misses             #      3.887 M/sec   ( +-   0.186% )  (scaled from 83.09%)
> > >        816,474,193 cache-references         #     22.540 M/sec   ( +-   0.923% )  (scaled from 83.16%)
> > >        154,500,577 branch-misses            #      4.302 %       ( +-   0.495% )  (scaled from 83.15%)
> > >      3,591,344,338 branches                 #     99.143 M/sec   ( +-   0.402% )  (scaled from 67.32%)
> > >     18,713,190,183 instructions             #      0.338 IPC     ( +-   0.448% )  (scaled from 83.96%)
> > >     55,285,320,107 cycles                   #   1526.208 M/sec   ( +-   0.588% )  (scaled from 83.28%)
> > >                237 page-faults              #      0.000 M/sec   ( +-   0.302% )
> > >             28,028 CPU-migrations           #      0.001 M/sec   ( +-   3.070% )
> > >          2,369,897 context-switches         #      0.065 M/sec   ( +-   0.006% )
> > >       36223.970238 task-clock-msecs         #      0.060 CPUs    ( +-   1.062% )
> > > 
> > >       605.909769823  seconds time elapsed   ( +-   0.783% )
> > > 
> > >  nr_vmscan_write 0
> > > 
> > 
> > Very nice!
> > 
> > > That's an increase of throughput by 30% and no writeback interference
> > > from reclaim.
> > > 
> > 
> > Any idea how much dd was varying in performance on each run? I'd
> > still expect a gain but I've found dd to vary wildly at times even
> > if conv=fdatasync,fsync is specified.
> 
> The fluctuation is in the figures after the 'seconds time elapsed'.
> It is less than 1% for the six runs.
> 
> Or did you mean something else?
> 

No, this is what I meant. I sometimes see very large variances but
that is usually on machines that are also doing other work. At the
moment for these tests, I'm see variances of +/- 1.5% for XFS and +/-
3.6% for ext4 which is acceptable.

> > > As not every other allocation has to reclaim from a Normal zone full
> > > of dirty pages anymore, the patched kernel is also more responsive in
> > > general during the copy.
> > > 
> > > I am also running fs_mark on XFS on a 2G machine, but the final
> > > results are not in yet.  The preliminary results appear to be in this
> > > ballpark:
> > > 
> > > --- fs_mark -d fsmark-one -d fsmark-two -D 100 -N 150 -n 150 -L 25 -t 1 -S 0 -s $((10 << 20))
> > > 
> > > 3.0:
> > > 
> > > real    20m43.901s
> > > user    0m8.988s
> > > sys     0m58.227s
> > > nr_vmscan_write 3347
> > > 
> > > 3.0-per-zone-dirty:
> > > 
> > > real    20m8.012s
> > > user    0m8.862s
> > > sys     1m2.585s
> > > nr_vmscan_write 161
> > > 
> > 
> > Thats roughly a 2.8% gain. I was seeing about 4.2% but was testing with
> > mem=1G, not 2G and there are a lot of factors at play.
> 
> [...]
> 
> > > #4 adds per-zone dirty throttling for __GFP_WRITE allocators, #5
> > > passes __GFP_WRITE from the grab_cache_page* functions in the hope to
> > > get most writers and no readers; I haven't checked all sites yet.
> > > 
> > > Discuss! :-)
> > > 
> > 
> > I think the performance gain may be due to flusher threads simply
> > being more aggressive and I suspect it will have a smaller effect on
> > NUMA where the flushers could be cleaning pages on the wrong node.
> 
> I ran this same test with statistics (which I now realize should
> probably become part of this series) and they indicated that the
> flushers were not woken a single time from the new code.
> 

Scratch that theory so!

> All it did in this case was defer future-dirty pages from the Normal
> zone to the DMA32 zone.
> 
> My understanding is that as the dirty pages are forcibly spread out
> into the bigger zone, reclaim and flushers become less likely to step
> on each other's toes.
> 

It makes more sense although I am surprised I didn't see something
similar in the initial tests.

> > That said, your figures are very promising and it is worth
> > an investigation and you should expand the number of filesystems
> > tested. I did a quick set of similar benchmarks locally. I only ran
> > dd once which is a major flaw but wanted to get a quick look.
> 
> Yeah, more testing is definitely going to happen on this.  I tried
> other filesystems with one-shot runs as well, just to see if anything
> stood out, but nothing conclusive.
> 
> > 4 kernels were tested.
> > 
> > vanilla:	3.0
> > lesskswapd	Patches 1-7 from my series
> > perzonedirty	Your patches
> > lessks-pzdirty	Both
> > 
> > Backing storage was a USB key. Kernel was booted with mem=4608M to
> > get a 500M highest zone similar to yours.
> 
> I think what I wrote was a bit misleading.  The zone size example was
> taken from my desktop machine to simply point out the different zones
> sizes in a simple UMA machine.  But I ran this test on my laptop,
> where the Normal zone is ~880MB (226240 present pages).
> 

I don't think that would make a massive difference. At the moment, I'm
testing with mem=512M, mem=1024M and mem=4608M.

> The dirty_background_ratio is 10, dirty_ratio is 20, btw, ISTR that
> you had set them higher and I expect that to be a factor.
> 

I was testing with dirty_ratio=40 to make the writeback-from-reclaim
problem worse so that is another important difference between the
tests.

> The dd throughput is ~14 MB/s on the pzd kernel.
> 
> > SIMPLE WRITEBACK XFS
> >               simple-writeback   writeback-3.0.0   writeback-3.0.0      3.0.0-lessks
> >                  3.0.0-vanilla   lesskswapd-v3r1 perzonedirty-v1r1      pzdirty-v3r1
> > 1                    526.83 ( 0.00%) 468.52 (12.45%) 542.05 (-2.81%) 464.42 (13.44%)
> > MMTests Statistics: duration
> > User/Sys Time Running Test (seconds)          7.27      7.34      7.69      7.96
> > Total Elapsed Time (seconds)                528.64    470.36    543.86    466.33
> > 
> > Direct pages scanned                             0         0         0         0
> > Direct pages reclaimed                           0         0         0         0
> > Kswapd pages scanned                       1058036   1167219   1060288   1169190
> > Kswapd pages reclaimed                      988591    979571    980278    981009
> > Kswapd efficiency                              93%       83%       92%       83%
> > Kswapd velocity                           2001.430  2481.544  1949.561  2507.216
> > Direct efficiency                             100%      100%      100%      100%
> > Direct velocity                              0.000     0.000     0.000     0.000
> > Percentage direct scans                         0%        0%        0%        0%
> > Page writes by reclaim                        4463      4587      4816      4910
> > Page reclaim invalidate                          0    145938         0    136510
> > 
> > Very few pages are being written back so I suspect any difference in
> > performance would be due to dd simply being very variable. I wasn't
> > running the monitoring that would tell me if the "Page writes" were
> > file-backed or anonymous but I assume they are file-backed. Your
> > patches did not seem to have much affect on the number of pages
> > written.
> 
> That's odd.  While it did not completely get rid of all file writes
> from reclaim, it reduced them consistently in all my tests so far.
> 

Do you see the same if dirty_ratio==40?

> I don't have swap space on any of my machines, but I wouldn't expect
> this to make a difference.
> 

No swap would affect the ratio of slab to LRU pages that are
reclaimed by slab shrinkers. It also affects the ratio of anon/file
pages that are isolated from the LRUs based on the calculations in
get_scan_count(). Either would affect results although I'd expect the
reclaiming of anonymous pages, increasing major faults and swapping
to make a bigger difference than shrinkers in a test case involving
dd to a single file.

Do you see the same results if swap is enabled?

> <SNIP>

-- 
Mel Gorman
SUSE Labs

WARNING: multiple messages have this Message-ID (diff)

From: Mel Gorman <mgorman@suse.de>
To: Johannes Weiner <jweiner@redhat.com>
Cc: linux-mm@kvack.org, Dave Chinner <david@fromorbit.com>,
	Christoph Hellwig <hch@infradead.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	Wu Fengguang <fengguang.wu@intel.com>,
	Rik van Riel <riel@redhat.com>,
	Minchan Kim <minchan.kim@gmail.com>, Jan Kara <jack@suse.cz>,
	Andi Kleen <ak@linux.intel.com>,
	linux-kernel@vger.kernel.org
Subject: Re: [patch 0/5] mm: per-zone dirty limiting
Date: Tue, 26 Jul 2011 22:54:14 +0100	[thread overview]
Message-ID: <20110726215414.GF3010@suse.de> (raw)
In-Reply-To: <20110726180559.GA667@redhat.com>

On Tue, Jul 26, 2011 at 08:05:59PM +0200, Johannes Weiner wrote:
> On Tue, Jul 26, 2011 at 04:47:41PM +0100, Mel Gorman wrote:
> > On Mon, Jul 25, 2011 at 10:19:14PM +0200, Johannes Weiner wrote:
> > > Hello!
> > > 
> > > Writing back single file pages during reclaim exhibits bad IO
> > > patterns, but we can't just stop doing that before the VM has other
> > > means to ensure the pages in a zone are reclaimable.
> > > 
> > > Over time there were several suggestions of at least doing
> > > write-around of the pages in inode-proximity when the need arises to
> > > clean pages during memory pressure.  But even that would interrupt
> > > writeback from the flushers, without any guarantees that the nearby
> > > inode-pages are even sitting on the same troubled zone.
> > > 
> > > The reason why dirty pages reach the end of LRU lists in the first
> > > place is in part because the dirty limits are a global restriction
> > > while most systems have more than one LRU list that are different in
> > > size. Multiple nodes have multiple zones have multiple file lists but
> > > at the same time there is nothing to balance the dirty pages between
> > > the lists except for reclaim writing them out upon encounter.
> > > 
> > > With around 4G of RAM, a x86_64 machine of mine has a DMA32 zone of a
> > > bit over 3G, a Normal zone of 500M, and a DMA zone of 15M.
> > > 
> > > A linear writer can quickly fill up the Normal zone, then the DMA32
> > > zone, throttled by the dirty limit initially.  The flushers catch up,
> > > the zones are now mostly full of clean pages and memory reclaim kicks
> > > in on subsequent allocations.  The pages it frees from the Normal zone
> > > are quickly filled with dirty pages (unthrottled, as the much bigger
> > > DMA32 zone allows for a huge number of dirty pages in comparison to
> > > the Normal zone).  As there are also anon and active file pages on the
> > > Normal zone, it is not unlikely that a significant amount of its
> > > inactive file pages are now dirty [ foo=zone(global) ]:
> > > 
> > > reclaim: blkdev_writepage+0x0/0x20 zone=Normal inactive=112313(821289) active=9942(10039) isolated=27(27) dirty=59709(146944) writeback=739(4017)
> > > reclaim: blkdev_writepage+0x0/0x20 zone=Normal inactive=111102(806876) active=9925(10022) isolated=32(32) dirty=72125(146914) writeback=957(3972)
> > > reclaim: blkdev_writepage+0x0/0x20 zone=Normal inactive=110493(803374) active=9871(9978) isolated=32(32) dirty=57274(146618) writeback=4088(4088)
> > > reclaim: blkdev_writepage+0x0/0x20 zone=Normal inactive=111957(806559) active=9871(9978) isolated=32(32) dirty=65125(147329) writeback=456(3866)
> > > reclaim: blkdev_writepage+0x0/0x20 zone=Normal inactive=110601(803978) active=9860(9973) isolated=27(27) dirty=63792(146590) writeback=61(4276)
> > > reclaim: blkdev_writepage+0x0/0x20 zone=Normal inactive=111786(804032) active=9860(9973) isolated=0(64) dirty=64310(146998) writeback=1282(3847)
> > > reclaim: blkdev_writepage+0x0/0x20 zone=Normal inactive=111643(805651) active=9860(9982) isolated=32(32) dirty=63778(147217) writeback=1127(4156)
> > > reclaim: blkdev_writepage+0x0/0x20 zone=Normal inactive=111678(804709) active=9859(10112) isolated=27(27) dirty=81673(148224) writeback=29(4233)
> > > 
> > > [ These prints occur only once per reclaim invocation, so the actual
> > > ->writepage calls are more frequent than the timestamp may suggest. ]
> > > 
> > > In the scenario without the Normal zone, first the DMA32 zone fills
> > > up, then the DMA zone.  When reclaim kicks in, it is presented with a
> > > DMA zone whose inactive pages are all dirty -- and dirtied most
> > > recently at that, so the flushers really had abysmal chances at making
> > > some headway:
> > > 
> > > reclaim: xfs_vm_writepage+0x0/0x4f0 zone=DMA inactive=776(430813) active=2(2931) isolated=32(32) dirty=814(68649) writeback=0(18765)
> > > reclaim: xfs_vm_writepage+0x0/0x4f0 zone=DMA inactive=726(430344) active=2(2931) isolated=32(32) dirty=764(67790) writeback=0(17146)
> > > reclaim: xfs_vm_writepage+0x0/0x4f0 zone=DMA inactive=729(430838) active=2(2931) isolated=32(32) dirty=293(65303) writeback=468(20122)
> > > reclaim: xfs_vm_writepage+0x0/0x4f0 zone=DMA inactive=757(431181) active=2(2931) isolated=32(32) dirty=63(68851) writeback=731(15926)
> > > reclaim: xfs_vm_writepage+0x0/0x4f0 zone=DMA inactive=758(432808) active=2(2931) isolated=32(32) dirty=645(64106) writeback=0(19666)
> > > reclaim: xfs_vm_writepage+0x0/0x4f0 zone=DMA inactive=726(431018) active=2(2931) isolated=32(32) dirty=740(65770) writeback=10(17907)
> > > reclaim: xfs_vm_writepage+0x0/0x4f0 zone=DMA inactive=697(430467) active=2(2931) isolated=32(32) dirty=743(63757) writeback=0(18826)
> > > reclaim: xfs_vm_writepage+0x0/0x4f0 zone=DMA inactive=693(430951) active=2(2931) isolated=32(32) dirty=626(54529) writeback=91(16198)
> > > 
> > 
> > Patches 1-7 of the series "Reduce filesystem writeback from page
> > reclaim" should have been able to cope with this as well by marking
> > the dirty pages PageReclaim and continuing on. While it could still
> > take some time before ZONE_DMA is cleaned, it is very unlikely that
> > it is the preferred zone for allocation.
> 
> My changes can not fully prevent dirty pages from reaching the LRU
> tail, so IMO we want your patches in any case (sorry I haven't replied
> yet, but I went through them and they look good to me.  Acks coming
> up).  But this should reduce what reclaim has to skip and shuffle.
> 

No need to be sorry, I guessed this work may be related so figured
you had at least seen them.  While I was reasonable sure the patches
were not mutually exclusive, there was no harm in checking you thought
the same.

> > > The idea behind this patch set is to take the ratio the global dirty
> > > limits have to the global memory state and put it into proportion to
> > > the individual zone.  The allocator ensures that pages allocated for
> > > being written to in the page cache are distributed across zones such
> > > that there are always enough clean pages on a zone to begin with.
> > > 
> > 
> > Ok, I comment on potential lowmem pressure problems with this in the
> > patch itself.
> > 
> > > I am not yet really satisfied as it's not really orthogonal or
> > > integrated with the other writeback throttling much, and has rough
> > > edges here and there, but test results do look rather promising so
> > > far:
> > > 
> > 
> > I'd consider that the idea behind this patchset is independent of
> > patches 1-7 of the "Reduce filesystem writeback from page reclaim"
> > series although it may also allow the application of patch 8 from
> > that series. Would you agree or do you think the series should be
> > mutually exclusive?
> 
> My patchset was triggered by patch 8 of your series, as I think we can
> not simply remove our only measure to stay on top of the dirty pages
> from a per-zone perspective.
> 

Agreed. I fully intend to drop patch 8 until there is a better way of
handling pages from a specific zone.

> But I think your patches 1-7 and this series complement each other in
> that one series tries to keep the dirty pages per-zone on sane levels
> and the other series improves how we deal with what dirty pages still
> end up at the lru tails.
> 

Agreed.

> > > --- Copying 8G to fuse-ntfs on USB stick in 4G machine
> > > 
> > 
> > Unusual choice of filesystem :) It'd also be worth testing ext3, ext4,
> > xfs and btrfs to make sure there are no surprises.
> 
> Yeah, testing has been really shallow so far as my test box is
> occupied with the exclusive memcg lru stuff.
> 
> Also, this is the stick my TV has to be able to read from ;-)
> 

heh, fair enough.

> > > 3.0:
> > > 
> > >  Performance counter stats for 'dd if=/dev/zero of=zeroes bs=32k count=262144' (6 runs):
> > > 
> > >        140,671,831 cache-misses             #      4.923 M/sec   ( +-   0.198% )  (scaled from 82.80%)
> > >        726,265,014 cache-references         #     25.417 M/sec   ( +-   1.104% )  (scaled from 83.06%)
> > >        144,092,383 branch-misses            #      4.157 %       ( +-   0.493% )  (scaled from 83.17%)
> > >      3,466,608,296 branches                 #    121.319 M/sec   ( +-   0.421% )  (scaled from 67.89%)
> > >     17,882,351,343 instructions             #      0.417 IPC     ( +-   0.457% )  (scaled from 84.73%)
> > >     42,848,633,897 cycles                   #   1499.554 M/sec   ( +-   0.604% )  (scaled from 83.08%)
> > >                236 page-faults              #      0.000 M/sec   ( +-   0.323% )
> > >              8,026 CPU-migrations           #      0.000 M/sec   ( +-   6.291% )
> > >          2,372,358 context-switches         #      0.083 M/sec   ( +-   0.003% )
> > >       28574.255540 task-clock-msecs         #      0.031 CPUs    ( +-   0.409% )
> > > 
> > >       912.625436885  seconds time elapsed   ( +-   3.851% )
> > > 
> > >  nr_vmscan_write 667839
> > > 
> > > 3.0-per-zone-dirty:
> > > 
> > >  Performance counter stats for 'dd if=/dev/zero of=zeroes bs=32k count=262144' (6 runs):
> > > 
> > >        140,791,501 cache-misses             #      3.887 M/sec   ( +-   0.186% )  (scaled from 83.09%)
> > >        816,474,193 cache-references         #     22.540 M/sec   ( +-   0.923% )  (scaled from 83.16%)
> > >        154,500,577 branch-misses            #      4.302 %       ( +-   0.495% )  (scaled from 83.15%)
> > >      3,591,344,338 branches                 #     99.143 M/sec   ( +-   0.402% )  (scaled from 67.32%)
> > >     18,713,190,183 instructions             #      0.338 IPC     ( +-   0.448% )  (scaled from 83.96%)
> > >     55,285,320,107 cycles                   #   1526.208 M/sec   ( +-   0.588% )  (scaled from 83.28%)
> > >                237 page-faults              #      0.000 M/sec   ( +-   0.302% )
> > >             28,028 CPU-migrations           #      0.001 M/sec   ( +-   3.070% )
> > >          2,369,897 context-switches         #      0.065 M/sec   ( +-   0.006% )
> > >       36223.970238 task-clock-msecs         #      0.060 CPUs    ( +-   1.062% )
> > > 
> > >       605.909769823  seconds time elapsed   ( +-   0.783% )
> > > 
> > >  nr_vmscan_write 0
> > > 
> > 
> > Very nice!
> > 
> > > That's an increase of throughput by 30% and no writeback interference
> > > from reclaim.
> > > 
> > 
> > Any idea how much dd was varying in performance on each run? I'd
> > still expect a gain but I've found dd to vary wildly at times even
> > if conv=fdatasync,fsync is specified.
> 
> The fluctuation is in the figures after the 'seconds time elapsed'.
> It is less than 1% for the six runs.
> 
> Or did you mean something else?
> 

No, this is what I meant. I sometimes see very large variances but
that is usually on machines that are also doing other work. At the
moment for these tests, I'm see variances of +/- 1.5% for XFS and +/-
3.6% for ext4 which is acceptable.

> > > As not every other allocation has to reclaim from a Normal zone full
> > > of dirty pages anymore, the patched kernel is also more responsive in
> > > general during the copy.
> > > 
> > > I am also running fs_mark on XFS on a 2G machine, but the final
> > > results are not in yet.  The preliminary results appear to be in this
> > > ballpark:
> > > 
> > > --- fs_mark -d fsmark-one -d fsmark-two -D 100 -N 150 -n 150 -L 25 -t 1 -S 0 -s $((10 << 20))
> > > 
> > > 3.0:
> > > 
> > > real    20m43.901s
> > > user    0m8.988s
> > > sys     0m58.227s
> > > nr_vmscan_write 3347
> > > 
> > > 3.0-per-zone-dirty:
> > > 
> > > real    20m8.012s
> > > user    0m8.862s
> > > sys     1m2.585s
> > > nr_vmscan_write 161
> > > 
> > 
> > Thats roughly a 2.8% gain. I was seeing about 4.2% but was testing with
> > mem=1G, not 2G and there are a lot of factors at play.
> 
> [...]
> 
> > > #4 adds per-zone dirty throttling for __GFP_WRITE allocators, #5
> > > passes __GFP_WRITE from the grab_cache_page* functions in the hope to
> > > get most writers and no readers; I haven't checked all sites yet.
> > > 
> > > Discuss! :-)
> > > 
> > 
> > I think the performance gain may be due to flusher threads simply
> > being more aggressive and I suspect it will have a smaller effect on
> > NUMA where the flushers could be cleaning pages on the wrong node.
> 
> I ran this same test with statistics (which I now realize should
> probably become part of this series) and they indicated that the
> flushers were not woken a single time from the new code.
> 

Scratch that theory so!

> All it did in this case was defer future-dirty pages from the Normal
> zone to the DMA32 zone.
> 
> My understanding is that as the dirty pages are forcibly spread out
> into the bigger zone, reclaim and flushers become less likely to step
> on each other's toes.
> 

It makes more sense although I am surprised I didn't see something
similar in the initial tests.

> > That said, your figures are very promising and it is worth
> > an investigation and you should expand the number of filesystems
> > tested. I did a quick set of similar benchmarks locally. I only ran
> > dd once which is a major flaw but wanted to get a quick look.
> 
> Yeah, more testing is definitely going to happen on this.  I tried
> other filesystems with one-shot runs as well, just to see if anything
> stood out, but nothing conclusive.
> 
> > 4 kernels were tested.
> > 
> > vanilla:	3.0
> > lesskswapd	Patches 1-7 from my series
> > perzonedirty	Your patches
> > lessks-pzdirty	Both
> > 
> > Backing storage was a USB key. Kernel was booted with mem=4608M to
> > get a 500M highest zone similar to yours.
> 
> I think what I wrote was a bit misleading.  The zone size example was
> taken from my desktop machine to simply point out the different zones
> sizes in a simple UMA machine.  But I ran this test on my laptop,
> where the Normal zone is ~880MB (226240 present pages).
> 

I don't think that would make a massive difference. At the moment, I'm
testing with mem=512M, mem=1024M and mem=4608M.

> The dirty_background_ratio is 10, dirty_ratio is 20, btw, ISTR that
> you had set them higher and I expect that to be a factor.
> 

I was testing with dirty_ratio=40 to make the writeback-from-reclaim
problem worse so that is another important difference between the
tests.

> The dd throughput is ~14 MB/s on the pzd kernel.
> 
> > SIMPLE WRITEBACK XFS
> >               simple-writeback   writeback-3.0.0   writeback-3.0.0      3.0.0-lessks
> >                  3.0.0-vanilla   lesskswapd-v3r1 perzonedirty-v1r1      pzdirty-v3r1
> > 1                    526.83 ( 0.00%) 468.52 (12.45%) 542.05 (-2.81%) 464.42 (13.44%)
> > MMTests Statistics: duration
> > User/Sys Time Running Test (seconds)          7.27      7.34      7.69      7.96
> > Total Elapsed Time (seconds)                528.64    470.36    543.86    466.33
> > 
> > Direct pages scanned                             0         0         0         0
> > Direct pages reclaimed                           0         0         0         0
> > Kswapd pages scanned                       1058036   1167219   1060288   1169190
> > Kswapd pages reclaimed                      988591    979571    980278    981009
> > Kswapd efficiency                              93%       83%       92%       83%
> > Kswapd velocity                           2001.430  2481.544  1949.561  2507.216
> > Direct efficiency                             100%      100%      100%      100%
> > Direct velocity                              0.000     0.000     0.000     0.000
> > Percentage direct scans                         0%        0%        0%        0%
> > Page writes by reclaim                        4463      4587      4816      4910
> > Page reclaim invalidate                          0    145938         0    136510
> > 
> > Very few pages are being written back so I suspect any difference in
> > performance would be due to dd simply being very variable. I wasn't
> > running the monitoring that would tell me if the "Page writes" were
> > file-backed or anonymous but I assume they are file-backed. Your
> > patches did not seem to have much affect on the number of pages
> > written.
> 
> That's odd.  While it did not completely get rid of all file writes
> from reclaim, it reduced them consistently in all my tests so far.
> 

Do you see the same if dirty_ratio==40?

> I don't have swap space on any of my machines, but I wouldn't expect
> this to make a difference.
> 

No swap would affect the ratio of slab to LRU pages that are
reclaimed by slab shrinkers. It also affects the ratio of anon/file
pages that are isolated from the LRUs based on the calculations in
get_scan_count(). Either would affect results although I'd expect the
reclaiming of anonymous pages, increasing major faults and swapping
to make a bigger difference than shrinkers in a test case involving
dd to a single file.

Do you see the same results if swap is enabled?

> <SNIP>

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

next prev parent reply	other threads:[~2011-07-26 21:54 UTC|newest]

Thread overview: 64+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-07-25 20:19 [patch 0/5] mm: per-zone dirty limiting Johannes Weiner
2011-07-25 20:19 ` Johannes Weiner
2011-07-25 20:19 ` [patch 1/5] mm: page_alloc: increase __GFP_BITS_SHIFT to include __GFP_OTHER_NODE Johannes Weiner
2011-07-25 20:19   ` Johannes Weiner
2011-07-25 20:52   ` Andi Kleen
2011-07-25 20:52     ` Andi Kleen
2011-07-25 22:56   ` Minchan Kim
2011-07-25 22:56     ` Minchan Kim
2011-07-26 13:51   ` Mel Gorman
2011-07-26 13:51     ` Mel Gorman
2011-07-27 12:50   ` Michal Hocko
2011-07-27 12:50     ` Michal Hocko
2011-08-05 14:16   ` Rik van Riel
2011-08-05 14:16     ` Rik van Riel
2011-07-25 20:19 ` [patch 2/5] mm: writeback: make determine_dirtyable_memory static again Johannes Weiner
2011-07-25 20:19   ` Johannes Weiner
2011-07-26 13:53   ` Mel Gorman
2011-07-26 13:53     ` Mel Gorman
2011-07-27 12:59   ` Michal Hocko
2011-07-27 12:59     ` Michal Hocko
2011-08-05 14:38   ` Rik van Riel
2011-08-05 14:38     ` Rik van Riel
2011-07-25 20:19 ` [patch 3/5] mm: writeback: remove seriously stale comment on dirty limits Johannes Weiner
2011-07-25 20:19   ` Johannes Weiner
2011-07-27 13:38   ` Michal Hocko
2011-07-27 13:38     ` Michal Hocko
2011-08-05 14:45   ` Rik van Riel
2011-08-05 14:45     ` Rik van Riel
2011-07-25 20:19 ` [patch 4/5] mm: writeback: throttle __GFP_WRITE on per-zone " Johannes Weiner
2011-07-25 20:19   ` Johannes Weiner
2011-07-25 20:37   ` Andi Kleen
2011-07-25 20:37     ` Andi Kleen
2011-07-25 23:40     ` Minchan Kim
2011-07-25 23:40       ` Minchan Kim
2011-08-03 19:06       ` Johannes Weiner
2011-08-03 19:06         ` Johannes Weiner
2011-07-26 14:42   ` Mel Gorman
2011-07-26 14:42     ` Mel Gorman
2011-08-03 20:21     ` Johannes Weiner
2011-08-03 20:21       ` Johannes Weiner
2011-07-27 14:24   ` Michal Hocko
2011-07-27 14:24     ` Michal Hocko
2011-08-03 20:25     ` Johannes Weiner
2011-08-03 20:25       ` Johannes Weiner
2011-08-04  7:27       ` Michal Hocko
2011-08-04  7:27         ` Michal Hocko
2011-07-25 20:19 ` [patch 5/5] mm: filemap: horrid hack to pass __GFP_WRITE for most page cache writers Johannes Weiner
2011-07-25 20:19   ` Johannes Weiner
2011-07-26  0:16 ` [patch 0/5] mm: per-zone dirty limiting Minchan Kim
2011-07-26  0:16   ` Minchan Kim
2011-07-26 15:47 ` Mel Gorman
2011-07-26 15:47   ` Mel Gorman
2011-07-26 18:05   ` Johannes Weiner
2011-07-26 18:05     ` Johannes Weiner
2011-07-26 21:54     ` Mel Gorman [this message]
2011-07-26 21:54       ` Mel Gorman
2011-07-29 11:05     ` Mel Gorman
2011-07-29 11:05       ` Mel Gorman
2011-08-02 12:17       ` Johannes Weiner
2011-08-02 12:17         ` Johannes Weiner
2011-08-03 13:18         ` Mel Gorman
2011-08-03 13:18           ` Mel Gorman
2011-09-20 12:19           ` Johannes Weiner
2011-09-20 12:19             ` Johannes Weiner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20110726215414.GF3010@suse.de \
    --to=mgorman@suse.de \
    --cc=ak@linux.intel.com \
    --cc=akpm@linux-foundation.org \
    --cc=david@fromorbit.com \
    --cc=fengguang.wu@intel.com \
    --cc=hch@infradead.org \
    --cc=jack@suse.cz \
    --cc=jweiner@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=minchan.kim@gmail.com \
    --cc=riel@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.