Re: [PATCH 03/27] xfs: use write_cache_pages for writeback clustering

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* Re: [PATCH 03/27] xfs: use write_cache_pages for writeback clustering
       [not found]     ` <20110701041851.GN561@dastard>
@ 2011-07-01  9:33       ` Christoph Hellwig
  2011-07-01 14:59         ` Mel Gorman
  2011-07-01 15:41         ` Wu Fengguang
  0 siblings, 2 replies; 20+ messages in thread
From: Christoph Hellwig @ 2011-07-01  9:33 UTC (permalink / raw)
  To: Mel Gorman, Johannes Weiner, Wu Fengguang; +Cc: Dave Chinner, xfs, linux-mm

Johannes, Mel, Wu,

Dave has been stressing some XFS patches of mine that remove the XFS
internal writeback clustering in favour of using write_cache_pages.

As part of investigating the behaviour he found out that we're still
doing lots of I/O from the end of the LRU in kswapd.  Not only is that
pretty bad behaviour in general, but it also means we really can't
just remove the writeback clustering in writepage given how much
I/O is still done through that.

Any chance we could the writeback vs kswap behaviour sorted out a bit
better finally?

Some excerpts from the previous discussion:

On Fri, Jul 01, 2011 at 02:18:51PM +1000, Dave Chinner wrote:
> I'm now only running test 180 on 100 files rather than the 1000 the
> test normally runs on, because it's faster and still shows the
> problem.  That means the test is only using 1GB of disk space, and
> I'm running on a VM with 1GB RAM. It appears to be related to the VM
> triggering random page writeback from the LRU - 100x10MB files more
> than fills memory, hence it being the smallest test case i could
> reproduce the problem on.
> 
> My triage notes are as follows, and the patch that fixes the bug is
> attached below.
> 
> --- 180.out     2010-04-28 15:00:22.000000000 +1000
> +++ 180.out.bad 2011-07-01 12:44:12.000000000 +1000
> @@ -1 +1,9 @@
>  QA output created by 180
> +file /mnt/scratch/81 has incorrect size 10473472 - sync failed
> +file /mnt/scratch/86 has incorrect size 10371072 - sync failed
> +file /mnt/scratch/87 has incorrect size 10104832 - sync failed
> +file /mnt/scratch/88 has incorrect size 10125312 - sync failed
> +file /mnt/scratch/89 has incorrect size 10469376 - sync failed
> +file /mnt/scratch/90 has incorrect size 10240000 - sync failed
> +file /mnt/scratch/91 has incorrect size 10362880 - sync failed
> +file /mnt/scratch/92 has incorrect size 10366976 - sync failed
> 
> $ ls -li /mnt/scratch/ | awk '/rw/ { printf("0x%x %d %d\n", $1, $6, $10); }'
> 0x244093 10473472 81
> 0x244098 10371072 86
> 0x244099 10104832 87
> 0x24409a 10125312 88
> 0x24409b 10469376 89
> 0x24409c 10240000 90
> 0x24409d 10362880 91
> 0x24409e 10366976 92
> 
> So looking at inode 0x244099 (/mnt/scratch/87), the last setfilesize
> call in the trace (got a separate patch for that) is:
> 
>            <...>-393   [000] 696245.229559: xfs_ilock_nowait:     dev 253:16 ino 0x244099 flags ILOCK_EXCL caller xfs_setfilesize
>            <...>-393   [000] 696245.229560: xfs_setfilesize:      dev 253:16 ino 0x244099 isize 0xa00000 disize 0x94e000 new_size 0x0 offset 0x600000 count 3813376
>            <...>-393   [000] 696245.229561: xfs_iunlock:          dev 253:16 ino 0x244099 flags ILOCK_EXCL caller xfs_setfilesize
> 
> For an IO that was from offset 0x600000 for just under 4MB. The end
> of that IO is at byte 10104832, which is _exactly_ what the inode
> size says it is.
> 
> It is very clear that from the IO completions that we are getting a
> *lot* of kswapd driven writeback directly through .writepage:
> 
> $ grep "xfs_setfilesize:" t.t |grep "4096$" | wc -l
> 801
> $ grep "xfs_setfilesize:" t.t |grep -v "4096$" | wc -l
> 78
> 
> So there's ~900 IO completions that change the file size, and 90% of
> them are single page updates.
> 
> $ ps -ef |grep [k]swap
> root       514     2  0 12:43 ?        00:00:00 [kswapd0]
> $ grep "writepage:" t.t | grep "514 " |wc -l
> 799
> 
> Oh, now that is too close to just be a co-incidence. We're getting
> significant amounts of random page writeback from the the ends of
> the LRUs done by the VM.
> 
> <sigh>


On Fri, Jul 01, 2011 at 07:20:21PM +1000, Dave Chinner wrote:
> > Looks good.  I still wonder why I haven't been able to hit this.
> > Haven't seen any 180 failure for a long time, with both 4k and 512 byte
> > filesystems and since yesterday 1k as well.
> 
> It requires the test to run the VM out of RAM and then force enough
> memory pressure for kswapd to start writeback from the LRU. The
> reproducer I have is a 1p, 1GB RAM VM with it's disk image on a
> 100MB/s HW RAID1 w/ 512MB BBWC disk subsystem.
> 
> When kswapd starts doing writeback from the LRU, the iops rate goes
> through the roof (from ~300iops @~320k/io to ~7000iops @4k/io) and
> throughput drops from 100MB/s to ~30MB/s. BBWC is the only reason
> the IOPS stays as high as it does - maybe that is why I saw this and
> you haven't.
> 
> As it is, the kswapd writeback behaviour is utterly atrocious and,
> ultimately, quite easy to provoke. I wish the MM folk would fix that
> goddamn problem already - we've only been complaining about it for
> the last 6 or 7 years. As such, I'm wondering if it's a bad idea to
> even consider removing the .writepage clustering...

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 03/27] xfs: use write_cache_pages for writeback clustering
  2011-07-01  9:33       ` [PATCH 03/27] xfs: use write_cache_pages for writeback clustering Christoph Hellwig
@ 2011-07-01 14:59         ` Mel Gorman
  2011-07-01 15:15           ` Christoph Hellwig
  2011-07-02  2:42           ` Dave Chinner
  2011-07-01 15:41         ` Wu Fengguang
  1 sibling, 2 replies; 20+ messages in thread
From: Mel Gorman @ 2011-07-01 14:59 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Johannes Weiner, Wu Fengguang, Dave Chinner, xfs, jack, linux-mm

On Fri, Jul 01, 2011 at 05:33:05AM -0400, Christoph Hellwig wrote:
> Johannes, Mel, Wu,

Am adding Jan Kara as he has been working on writeback efficiency
recently as well.

> Dave has been stressing some XFS patches of mine that remove the XFS
> internal writeback clustering in favour of using write_cache_pages.
> 

Against what kernel? 2.6.38 was a disaster for reclaim I've been
finding out this week. I don't know about 2.6.38.8. 2.6.39 was better.

> As part of investigating the behaviour he found out that we're still
> doing lots of I/O from the end of the LRU in kswapd.  Not only is that
> pretty bad behaviour in general, but it also means we really can't
> just remove the writeback clustering in writepage given how much
> I/O is still done through that.
> 
> Any chance we could the writeback vs kswap behaviour sorted out a bit
> better finally?
> 
> Some excerpts from the previous discussion:
> 
> On Fri, Jul 01, 2011 at 02:18:51PM +1000, Dave Chinner wrote:
> > I'm now only running test 180 on 100 files rather than the 1000 the
> > test normally runs on, because it's faster and still shows the
> > problem. 

I had stopped looking at writeback problems while Wu and Jan were
working on various writeback patchsets like io-less throttling. I
don't know where they currently stand and while I submitted a number
of reclaim patches since I last looked at this problem around 2.6.37,
they were related to migration, kswapd reclaiming too much memory
and kswapd using too much CPU - not writeback.

At the time I stopped, the tests I was looking at were writing very
few pages off the end of the LRU. Unfortunately I no longer have the
results to see but for unrelated reasons, I've been other regression
tests. Here is an example fsmark report over a number of kernels. The
machine used is old but unfortunately it's the only one I have a full
range of results at the moment.

FS-Mark
            fsmark-2.6.32.42-mainline-fsmarkfsmark-2.6.34.10-mainline-fsmarkfsmark-2.6.37.6-mainline-fsmarkfsmark-2.6.38-mainline-fsmarkfsmark-2.6.39-mainline-fsmark
            2.6.32.42-mainline2.6.34.10-mainline 2.6.37.6-mainline   2.6.38-mainline   2.6.39-mainline
Files/s  min         162.80 ( 0.00%)      156.20 (-4.23%)      155.60 (-4.63%)      157.80 (-3.17%)      151.10 (-7.74%)
Files/s  mean        173.77 ( 0.00%)      176.27 ( 1.42%)      168.19 (-3.32%)      172.98 (-0.45%)      172.05 (-1.00%)
Files/s  stddev        7.64 ( 0.00%)       12.54 (39.05%)        8.55 (10.57%)        8.39 ( 8.90%)       10.30 (25.77%)
Files/s  max         190.30 ( 0.00%)      206.80 ( 7.98%)      185.20 (-2.75%)      198.90 ( 4.32%)      201.00 ( 5.32%)
Overhead min     1742851.00 ( 0.00%)  1612311.00 ( 8.10%)  1251552.00 (39.26%)  1239859.00 (40.57%)  1393047.00 (25.11%)
Overhead mean    2443021.87 ( 0.00%)  2486525.60 (-1.75%)  2024365.53 (20.68%)  1849402.47 (32.10%)  1886692.53 (29.49%)
Overhead stddev   744034.70 ( 0.00%)   359446.19 (106.99%)   335986.49 (121.45%)   375627.48 (98.08%)   320901.34 (131.86%)
Overhead max     4744130.00 ( 0.00%)  3082235.00 (53.92%)  2561054.00 (85.24%)  2626346.00 (80.64%)  2559170.00 (85.38%)
MMTests Statistics: duration
User/Sys Time Running Test (seconds)        624.12    647.61     658.8    670.78    653.98
Total Elapsed Time (seconds)               5767.71   5742.30   5974.45   5852.32   5760.49

MMTests Statistics: vmstat
Page Ins                                   3143712   3367600   3108596   3371952   3102548
Page Outs                                104939296 105255268 105126820 105130540 105226620
Swap Ins                                         0         0         0         0         0
Swap Outs                                        0         0         0         0         0
Direct pages scanned                          3521       131      7035         0         0
Kswapd pages scanned                      23596104  23662641  23588211  23695015  23638226
Kswapd pages reclaimed                    23594758  23661359  23587478  23693447  23637005
Direct pages reclaimed                        3521       131      7031         0         0
Kswapd efficiency                              99%       99%       99%       99%       99%
Kswapd velocity                           4091.070  4120.760  3948.181  4048.824  4103.510
Direct efficiency                             100%      100%       99%      100%      100%
Direct velocity                              0.610     0.023     1.178     0.000     0.000
Percentage direct scans                         0%        0%        0%        0%        0%
Page writes by reclaim                          75        32        37       252        44
Slabs scanned                              1843200   1927168   2714112   2801280   2738816
Direct inode steals                              0         0         0         0         0
Kswapd inode steals                        1827970   1822770   1669879   1819583   1681155
Compaction stalls                                0         0         0         0         0
Compaction success                               0         0         0         0         0
Compaction failures                              0         0         0         0         0
Compaction pages moved                           0         0         0    228180         0
Compaction move failure                          0         0         0    637776         0

The number of pages written from reclaim is exceptionally low (2.6.38
was a total disaster but that release was bad for a number of reasons,
haven't tested 2.6.38.8 yet) but reduced by 2.6.37 as expected. Direct
reclaim usage was reduced and efficiency (ratio of pages scanned to
pages reclaimed) was high.

As I look through the results I have at the moment, the number of
pages written back was simply really low which is why the problem fell
off my radar.

> > That means the test is only using 1GB of disk space, and
> > I'm running on a VM with 1GB RAM. It appears to be related to the VM
> > triggering random page writeback from the LRU - 100x10MB files more
> > than fills memory, hence it being the smallest test case i could
> > reproduce the problem on.
> > 

My tests were on a machine with 8G and ext3. I'm running some of
the tests against ext4 and xfs to see if that makes a difference but
it's possible the tests are simply not agressive enough so I want to
reproduce Dave's test if possible.

I'm assuming "test 180" is from xfstests which was not one of the tests
I used previously. To run with 1000 files instead of 100, was the file
"180" simply editted to make it look like this loop instead?

# create files and sync them
i=1;
while [ $i -lt 100 ]
do
        file=$SCRATCH_MNT/$i
        xfs_io -f -c "pwrite -b 64k -S 0xff 0 10m" $file > /dev/null
        if [ $? -ne 0 ]
        then
                echo error creating/writing file $file
                exit
        fi
        let i=$i+1
done

> > My triage notes are as follows, and the patch that fixes the bug is
> > attached below.
> > 
> > <SNIP>
> > 
> >            <...>-393   [000] 696245.229559: xfs_ilock_nowait:     dev 253:16 ino 0x244099 flags ILOCK_EXCL caller xfs_setfilesize
> >            <...>-393   [000] 696245.229560: xfs_setfilesize:      dev 253:16 ino 0x244099 isize 0xa00000 disize 0x94e000 new_size 0x0 offset 0x600000 count 3813376
> >            <...>-393   [000] 696245.229561: xfs_iunlock:          dev 253:16 ino 0x244099 flags ILOCK_EXCL caller xfs_setfilesize
> > 
> > For an IO that was from offset 0x600000 for just under 4MB. The end
> > of that IO is at byte 10104832, which is _exactly_ what the inode
> > size says it is.
> > 
> > It is very clear that from the IO completions that we are getting a
> > *lot* of kswapd driven writeback directly through .writepage:
> > 
> > $ grep "xfs_setfilesize:" t.t |grep "4096$" | wc -l
> > 801
> > $ grep "xfs_setfilesize:" t.t |grep -v "4096$" | wc -l
> > 78
> > 
> > So there's ~900 IO completions that change the file size, and 90% of
> > them are single page updates.
> > 
> > $ ps -ef |grep [k]swap
> > root       514     2  0 12:43 ?        00:00:00 [kswapd0]
> > $ grep "writepage:" t.t | grep "514 " |wc -l
> > 799
> > 
> > Oh, now that is too close to just be a co-incidence. We're getting
> > significant amounts of random page writeback from the the ends of
> > the LRUs done by the VM.
> > 
> > <sigh>

Does the value for nr_vmscan_write in /proc/vmstat correlate? It must
but lets me sure because I'm using that figure rather than ftrace to
count writebacks at the moment. A more relevant question is this -
how many pages were reclaimed by kswapd and what percentage is 799
pages of that? What do you consider an acceptable percentage?

> On Fri, Jul 01, 2011 at 07:20:21PM +1000, Dave Chinner wrote:
> > > Looks good.  I still wonder why I haven't been able to hit this.
> > > Haven't seen any 180 failure for a long time, with both 4k and 512 byte
> > > filesystems and since yesterday 1k as well.
> > 
> > It requires the test to run the VM out of RAM and then force enough
> > memory pressure for kswapd to start writeback from the LRU. The
> > reproducer I have is a 1p, 1GB RAM VM with it's disk image on a
> > 100MB/s HW RAID1 w/ 512MB BBWC disk subsystem.
> > 

You say it's a 1G VM but you don't say what architecure. What is
the size of the highest zone? If this is 32-bit x86 for example, the
highest zone is HighMem and it would be really small. Unfortunately
it would always be the first choice for allocating and reclaiming
from which would drastically increase the number of pages written back
from reclaim.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 03/27] xfs: use write_cache_pages for writeback clustering
  2011-07-01 14:59         ` Mel Gorman
@ 2011-07-01 15:15           ` Christoph Hellwig
  2011-07-02  2:42           ` Dave Chinner
  1 sibling, 0 replies; 20+ messages in thread
From: Christoph Hellwig @ 2011-07-01 15:15 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Christoph Hellwig, Johannes Weiner, Wu Fengguang, Dave Chinner,
	xfs, jack, linux-mm

On Fri, Jul 01, 2011 at 03:59:35PM +0100, Mel Gorman wrote:
> On Fri, Jul 01, 2011 at 05:33:05AM -0400, Christoph Hellwig wrote:
> > Johannes, Mel, Wu,
> 
> Am adding Jan Kara as he has been working on writeback efficiency
> recently as well.
> 
> > Dave has been stressing some XFS patches of mine that remove the XFS
> > internal writeback clustering in favour of using write_cache_pages.
> > 
> 
> Against what kernel? 2.6.38 was a disaster for reclaim I've been
> finding out this week. I don't know about 2.6.38.8. 2.6.39 was better.

The patch series is against current 3.0-rc, I assume that's what Dave
tested as well.

> I'm assuming "test 180" is from xfstests which was not one of the tests
> I used previously. To run with 1000 files instead of 100, was the file
> "180" simply editted to make it look like this loop instead?

Yes. to both questions.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 03/27] xfs: use write_cache_pages for writeback clustering
  2011-07-01  9:33       ` [PATCH 03/27] xfs: use write_cache_pages for writeback clustering Christoph Hellwig
  2011-07-01 14:59         ` Mel Gorman
@ 2011-07-01 15:41         ` Wu Fengguang
  2011-07-04  3:25           ` Dave Chinner
  1 sibling, 1 reply; 20+ messages in thread
From: Wu Fengguang @ 2011-07-01 15:41 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Mel Gorman, Johannes Weiner, Dave Chinner, xfs@oss.sgi.com,
	linux-mm@kvack.org

Christoph,

On Fri, Jul 01, 2011 at 05:33:05PM +0800, Christoph Hellwig wrote:
> Johannes, Mel, Wu,
> 
> Dave has been stressing some XFS patches of mine that remove the XFS
> internal writeback clustering in favour of using write_cache_pages.
> 
> As part of investigating the behaviour he found out that we're still
> doing lots of I/O from the end of the LRU in kswapd.  Not only is that
> pretty bad behaviour in general, but it also means we really can't
> just remove the writeback clustering in writepage given how much
> I/O is still done through that.
> 
> Any chance we could the writeback vs kswap behaviour sorted out a bit
> better finally?

I once tried this approach:

http://www.spinics.net/lists/linux-mm/msg09202.html

It used a list structure that is not linearly scalable, however that
part should be independently improvable when necessary.

The real problem was, it seem to not very effective in my test runs.
I found many ->nr_pages works queued before the ->inode works, which
effectively makes the flusher working on more dispersed pages rather
than focusing on the dirty pages encountered in LRU reclaim.

So for the patch to work efficiently, we'll need to first merge the
->nr_pages works and make them lower priority than the ->inode works.

Thanks,
Fengguang

> Some excerpts from the previous discussion:
> 
> On Fri, Jul 01, 2011 at 02:18:51PM +1000, Dave Chinner wrote:
> > I'm now only running test 180 on 100 files rather than the 1000 the
> > test normally runs on, because it's faster and still shows the
> > problem.  That means the test is only using 1GB of disk space, and
> > I'm running on a VM with 1GB RAM. It appears to be related to the VM
> > triggering random page writeback from the LRU - 100x10MB files more
> > than fills memory, hence it being the smallest test case i could
> > reproduce the problem on.
> > 
> > My triage notes are as follows, and the patch that fixes the bug is
> > attached below.
> > 
> > --- 180.out     2010-04-28 15:00:22.000000000 +1000
> > +++ 180.out.bad 2011-07-01 12:44:12.000000000 +1000
> > @@ -1 +1,9 @@
> >  QA output created by 180
> > +file /mnt/scratch/81 has incorrect size 10473472 - sync failed
> > +file /mnt/scratch/86 has incorrect size 10371072 - sync failed
> > +file /mnt/scratch/87 has incorrect size 10104832 - sync failed
> > +file /mnt/scratch/88 has incorrect size 10125312 - sync failed
> > +file /mnt/scratch/89 has incorrect size 10469376 - sync failed
> > +file /mnt/scratch/90 has incorrect size 10240000 - sync failed
> > +file /mnt/scratch/91 has incorrect size 10362880 - sync failed
> > +file /mnt/scratch/92 has incorrect size 10366976 - sync failed
> > 
> > $ ls -li /mnt/scratch/ | awk '/rw/ { printf("0x%x %d %d\n", $1, $6, $10); }'
> > 0x244093 10473472 81
> > 0x244098 10371072 86
> > 0x244099 10104832 87
> > 0x24409a 10125312 88
> > 0x24409b 10469376 89
> > 0x24409c 10240000 90
> > 0x24409d 10362880 91
> > 0x24409e 10366976 92
> > 
> > So looking at inode 0x244099 (/mnt/scratch/87), the last setfilesize
> > call in the trace (got a separate patch for that) is:
> > 
> >            <...>-393   [000] 696245.229559: xfs_ilock_nowait:     dev 253:16 ino 0x244099 flags ILOCK_EXCL caller xfs_setfilesize
> >            <...>-393   [000] 696245.229560: xfs_setfilesize:      dev 253:16 ino 0x244099 isize 0xa00000 disize 0x94e000 new_size 0x0 offset 0x600000 count 3813376
> >            <...>-393   [000] 696245.229561: xfs_iunlock:          dev 253:16 ino 0x244099 flags ILOCK_EXCL caller xfs_setfilesize
> > 
> > For an IO that was from offset 0x600000 for just under 4MB. The end
> > of that IO is at byte 10104832, which is _exactly_ what the inode
> > size says it is.
> > 
> > It is very clear that from the IO completions that we are getting a
> > *lot* of kswapd driven writeback directly through .writepage:
> > 
> > $ grep "xfs_setfilesize:" t.t |grep "4096$" | wc -l
> > 801
> > $ grep "xfs_setfilesize:" t.t |grep -v "4096$" | wc -l
> > 78
> > 
> > So there's ~900 IO completions that change the file size, and 90% of
> > them are single page updates.
> > 
> > $ ps -ef |grep [k]swap
> > root       514     2  0 12:43 ?        00:00:00 [kswapd0]
> > $ grep "writepage:" t.t | grep "514 " |wc -l
> > 799
> > 
> > Oh, now that is too close to just be a co-incidence. We're getting
> > significant amounts of random page writeback from the the ends of
> > the LRUs done by the VM.
> > 
> > <sigh>
> 
> 
> On Fri, Jul 01, 2011 at 07:20:21PM +1000, Dave Chinner wrote:
> > > Looks good.  I still wonder why I haven't been able to hit this.
> > > Haven't seen any 180 failure for a long time, with both 4k and 512 byte
> > > filesystems and since yesterday 1k as well.
> > 
> > It requires the test to run the VM out of RAM and then force enough
> > memory pressure for kswapd to start writeback from the LRU. The
> > reproducer I have is a 1p, 1GB RAM VM with it's disk image on a
> > 100MB/s HW RAID1 w/ 512MB BBWC disk subsystem.
> > 
> > When kswapd starts doing writeback from the LRU, the iops rate goes
> > through the roof (from ~300iops @~320k/io to ~7000iops @4k/io) and
> > throughput drops from 100MB/s to ~30MB/s. BBWC is the only reason
> > the IOPS stays as high as it does - maybe that is why I saw this and
> > you haven't.
> > 
> > As it is, the kswapd writeback behaviour is utterly atrocious and,
> > ultimately, quite easy to provoke. I wish the MM folk would fix that
> > goddamn problem already - we've only been complaining about it for
> > the last 6 or 7 years. As such, I'm wondering if it's a bad idea to
> > even consider removing the .writepage clustering...

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 03/27] xfs: use write_cache_pages for writeback clustering
  2011-07-01 14:59         ` Mel Gorman
  2011-07-01 15:15           ` Christoph Hellwig
@ 2011-07-02  2:42           ` Dave Chinner
  2011-07-05 14:10             ` Mel Gorman
  2011-07-11 10:26             ` Christoph Hellwig
  1 sibling, 2 replies; 20+ messages in thread
From: Dave Chinner @ 2011-07-02  2:42 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Christoph Hellwig, Johannes Weiner, Wu Fengguang, xfs, jack,
	linux-mm

On Fri, Jul 01, 2011 at 03:59:35PM +0100, Mel Gorman wrote:
> On Fri, Jul 01, 2011 at 05:33:05AM -0400, Christoph Hellwig wrote:
> > Johannes, Mel, Wu,
> 
> Am adding Jan Kara as he has been working on writeback efficiency
> recently as well.

Writeback looks to be working fine - it's kswapd screwing up the
writeback patterns that appears to be the problem....

> > Dave has been stressing some XFS patches of mine that remove the XFS
> > internal writeback clustering in favour of using write_cache_pages.
> 
> Against what kernel? 2.6.38 was a disaster for reclaim I've been
> finding out this week. I don't know about 2.6.38.8. 2.6.39 was better.

3.0-rc4

....
> The number of pages written from reclaim is exceptionally low (2.6.38
> was a total disaster but that release was bad for a number of reasons,
> haven't tested 2.6.38.8 yet) but reduced by 2.6.37 as expected. Direct
> reclaim usage was reduced and efficiency (ratio of pages scanned to
> pages reclaimed) was high.

And is that consistent across ext3/ext4/xfs/btrfs filesystems? I
doubt it very much, as all have very different .writepage
behaviours...

BTW, called a workload "fsmark" tells us nothing about the workload
being tested - fsmark can do a lot of interesting things. IOWs, you
need to quote the command line for it to be meaningful to anyone...

> As I look through the results I have at the moment, the number of
> pages written back was simply really low which is why the problem fell
> off my radar.

It doesn't take many to completely screw up writeback IO patterns.
Write a few random pages to a 10MB file well before writeback would
get to the file, and instead of getting optimal sequential writeback
patterns when writeback gets to it, we get multiple disjoint IOs
that require multiple seeks to complete.

Slower, less efficient writeback IO causes memory pressure to last
longer and hence more likely to result in kswapd writeback, and it's
just a downward spiral from there....

> > > That means the test is only using 1GB of disk space, and
> > > I'm running on a VM with 1GB RAM. It appears to be related to the VM
> > > triggering random page writeback from the LRU - 100x10MB files more
> > > than fills memory, hence it being the smallest test case i could
> > > reproduce the problem on.
> > > 
> 
> My tests were on a machine with 8G and ext3. I'm running some of
> the tests against ext4 and xfs to see if that makes a difference but
> it's possible the tests are simply not agressive enough so I want to
> reproduce Dave's test if possible.

To tell the truth, I don't think anyone really cares how ext3
performs these days. XFS seems to be the filesystem that brings out
all the bad behaviour in the mm subsystem....

FWIW, the mm subsystem works well enough when there is RAM
available, so I'd suggest that your reclaim testing needs to focus
on smaller memory configurations to really stress the reclaim
algorithms. That's one of the reason why I regularly test on 1GB, 1p
machines - they show problems that are hard to repa??oduce on larger
configs....

> I'm assuming "test 180" is from xfstests which was not one of the tests
> I used previously. To run with 1000 files instead of 100, was the file
> "180" simply editted to make it look like this loop instead?

I reduced it to 100 files simply to speed up the testing process for
the "bad file size" problem I was trying to find. If you want to
reproduce the IO collapse in a big way, run it with 1000 files, and
it happens about 2/3rds of the way through the test on my hardware.

> > > It is very clear that from the IO completions that we are getting a
> > > *lot* of kswapd driven writeback directly through .writepage:
> > > 
> > > $ grep "xfs_setfilesize:" t.t |grep "4096$" | wc -l
> > > 801
> > > $ grep "xfs_setfilesize:" t.t |grep -v "4096$" | wc -l
> > > 78
> > > 
> > > So there's ~900 IO completions that change the file size, and 90% of
> > > them are single page updates.
> > > 
> > > $ ps -ef |grep [k]swap
> > > root       514     2  0 12:43 ?        00:00:00 [kswapd0]
> > > $ grep "writepage:" t.t | grep "514 " |wc -l
> > > 799
> > > 
> > > Oh, now that is too close to just be a co-incidence. We're getting
> > > significant amounts of random page writeback from the the ends of
> > > the LRUs done by the VM.
> > > 
> > > <sigh>
> 
> Does the value for nr_vmscan_write in /proc/vmstat correlate? It must
> but lets me sure because I'm using that figure rather than ftrace to
> count writebacks at the moment.

The number in /proc/vmstat is higher. Much higher.  I just ran the
test at 1000 files (only collapsed to ~3000 iops this time because I
ran it on a plain 3.0-rc4 kernel that still has the .writepage
clustering in XFS), and I see:

nr_vmscan_write 6723

after the test. The event trace only capture ~1400 writepage events
from kswapd, but it tends to miss a lot of events as the system is
quite unresponsive at times under this workload - it's not uncommon
to have ssh sessions not echo a character for 10s... e.g: I started
the workload ~11:08:22:

$ while [ 1 ]; do date; sleep 1; done
Sat Jul  2 11:08:15 EST 2011
Sat Jul  2 11:08:16 EST 2011
Sat Jul  2 11:08:17 EST 2011
Sat Jul  2 11:08:18 EST 2011
Sat Jul  2 11:08:19 EST 2011
Sat Jul  2 11:08:20 EST 2011
Sat Jul  2 11:08:21 EST 2011
Sat Jul  2 11:08:22 EST 2011         <<<<<<<< start test here
Sat Jul  2 11:08:23 EST 2011
Sat Jul  2 11:08:24 EST 2011
Sat Jul  2 11:08:25 EST 2011
Sat Jul  2 11:08:26 EST 2011         <<<<<<<<
Sat Jul  2 11:08:27 EST 2011         <<<<<<<<
Sat Jul  2 11:08:30 EST 2011         <<<<<<<<
Sat Jul  2 11:08:35 EST 2011         <<<<<<<<
Sat Jul  2 11:08:36 EST 2011
Sat Jul  2 11:08:37 EST 2011
Sat Jul  2 11:08:38 EST 2011         <<<<<<<<
Sat Jul  2 11:08:40 EST 2011         <<<<<<<<
Sat Jul  2 11:08:41 EST 2011
Sat Jul  2 11:08:42 EST 2011
Sat Jul  2 11:08:43 EST 2011

And there are quite a few more multi-second holdoffs during the
test, too.

> A more relevant question is this -
> how many pages were reclaimed by kswapd and what percentage is 799
> pages of that? What do you consider an acceptable percentage?

I don't care what the percentage is or what the number is. kswapd is
reclaiming pages most of the time without affect IO patterns, and
when that happens I just don't care because it is working just fine.

What I care about is what kswapd is doing when it finds dirty pages
and it decides they need to be written back. It's not a problem that
they are found or need to be written, the problem is the utterly
crap way that memory reclaim is throwing the pages at the filesystem.

I'm not sure how to get through to you guys that single, random page
writeback is *BAD*. Using .writepage directly is considered harmful
to IO throughput, and memory reclaim needs to stop doing that.
We've got hacks in the filesystems to try to make the IO memory
reclaim executes suck less, but ultimately the problem is the IO
memory reclaim is doing. And now the memory reclaim IO patterns are
getting in the way of further improving the writeback path in XFS
because were finding the hacks we've been carrying for years are
*still* the only thing that is making IO under memory pressure not
suck completely.

What I find extremely frustrating is that this is not a new issue.
We (filesystem people) have been asking for a long time to have the
memory reclaim subsystem either defer IO to the writeback threads or
to use the .writepages interface. We're not asking this to be
difficult, we're asking for this so that we can cluster IO in an
optimal manner to avoid these IO collapses that memory reclaim
currently triggers.  We now have generic methods of handing off IO
to flusher threads that also provide some level of throttling/
blocking while IO is submitted (e.g.  writeback_inodes_sb_nr()), so
this shouldn't be a difficult problem to solve for the memory
reclaim subsystem.

Hell, maybe memory reclaim should take a leaf from the IO-less
throttle work we are doing - hit a bunch of dirty pages on the LRU,
just back off and let the writeback subsystem clean a few more pages
before starting another scan.  Letting the writeback code clean
pages is the fastest way to get pages cleaned in the system, so if
we've already got a generic method for cleaning and/or waiting for
pages to be cleaned, why not aim to use that?

And while I'm ranting, when on earth is the issue-writeback-from-
direct-reclaim problem going to be fixed so we can remove the hacks
in the filesystem .writepage implementations to prevent this from
occurring?

I mean, when we combine the two issues, doesn't it imply that the
memory reclaim subsystem needs to be redesigned around the fact it
*can't clean pages directly*?  This IO collapse issue shows that we
really don't 't want kswapd issuing IO directly via .writepage, and
we already reject IO from direct reclaim in .writepage in ext4, XFS
and BTRFS because we'll overrun the stack on anything other than
trivial storage configurations.

That says to me in a big, flashing bright pink neon sign way that
memory reclaim simply should not be issuing IO at all. Perhaps it's
time to rethink the way memory reclaim deals with dirty pages to
take into account the current reality?

</rant>

> > On Fri, Jul 01, 2011 at 07:20:21PM +1000, Dave Chinner wrote:
> > > > Looks good.  I still wonder why I haven't been able to hit this.
> > > > Haven't seen any 180 failure for a long time, with both 4k and 512 byte
> > > > filesystems and since yesterday 1k as well.
> > > 
> > > It requires the test to run the VM out of RAM and then force enough
> > > memory pressure for kswapd to start writeback from the LRU. The
> > > reproducer I have is a 1p, 1GB RAM VM with it's disk image on a
> > > 100MB/s HW RAID1 w/ 512MB BBWC disk subsystem.
> > > 
> 
> You say it's a 1G VM but you don't say what architecure.

x86-64 for both the guest and the host.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 03/27] xfs: use write_cache_pages for writeback clustering
  2011-07-01 15:41         ` Wu Fengguang
@ 2011-07-04  3:25           ` Dave Chinner
  2011-07-05 14:34             ` Mel Gorman
                               ` (2 more replies)
  0 siblings, 3 replies; 20+ messages in thread
From: Dave Chinner @ 2011-07-04  3:25 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Christoph Hellwig, Mel Gorman, Johannes Weiner, xfs@oss.sgi.com,
	linux-mm@kvack.org

On Fri, Jul 01, 2011 at 11:41:36PM +0800, Wu Fengguang wrote:
> Christoph,
> 
> On Fri, Jul 01, 2011 at 05:33:05PM +0800, Christoph Hellwig wrote:
> > Johannes, Mel, Wu,
> > 
> > Dave has been stressing some XFS patches of mine that remove the XFS
> > internal writeback clustering in favour of using write_cache_pages.
> > 
> > As part of investigating the behaviour he found out that we're still
> > doing lots of I/O from the end of the LRU in kswapd.  Not only is that
> > pretty bad behaviour in general, but it also means we really can't
> > just remove the writeback clustering in writepage given how much
> > I/O is still done through that.
> > 
> > Any chance we could the writeback vs kswap behaviour sorted out a bit
> > better finally?
> 
> I once tried this approach:
> 
> http://www.spinics.net/lists/linux-mm/msg09202.html
> 
> It used a list structure that is not linearly scalable, however that
> part should be independently improvable when necessary.

I don't think that handing random writeback to the flusher thread is
much better than doing random writeback directly.  Yes, you added
some clustering, but I'm still don't think writing specific pages is
the best solution.

> The real problem was, it seem to not very effective in my test runs.
> I found many ->nr_pages works queued before the ->inode works, which
> effectively makes the flusher working on more dispersed pages rather
> than focusing on the dirty pages encountered in LRU reclaim.

But that's really just an implementation issue related to how you
tried to solve the problem. That could be addressed.

However, what I'm questioning is whether we should even care what
page memory reclaim wants to write - it seems to make fundamentally
bad decisions from an IO persepctive.

We have to remember that memory reclaim is doing LRU reclaim and the
flusher threads are doing "oldest first" writeback. IOWs, both are trying
to operate in the same direction (oldest to youngest) for the same
purpose.  The fundamental problem that occurs when memory reclaim
starts writing pages back from the LRU is this:

	- memory reclaim has run ahead of IO writeback -

The LRU usually looks like this:

	oldest					youngest
	+---------------+---------------+--------------+
	clean		writeback	dirty
			^		^
			|		|
			|		Where flusher will next work from
			|		Where kswapd is working from
			|
			IO submitted by flusher, waiting on completion

If memory reclaim is hitting dirty pages on the LRU, it means it has
got ahead of writeback without being throttled - it's passed over
all the pages currently under writeback and is trying to write back
pages that are *newer* than what writeback is working on. IOWs, it
starts trying to do the job of the flusher threads, and it does that
very badly.

The $100 question is a??why is it getting ahead of writeback*?

>From a brief look at the vmscan code, it appears that scanning does
not throttle/block until reclaim priority has got pretty high. That
means at low priority reclaim, it *skips pages under writeback*.
However, if it comes across a dirty page, it will trigger writeback
of the page.

Now call me crazy, but if we've already got a large number of pages
under writeback, why would we want to *start more IO* when clearly
the system is taking care of cleaning pages already and all we have
to do is wait for a short while to get clean pages ready for
reclaim?

Indeed, I added this quick hack to prevent the VM from doing
writeback via pageout until after it starts blocking on writeback
pages:

@@ -825,6 +825,8 @@ static unsigned long shrink_page_list(struct list_head *page_l
 		if (PageDirty(page)) {
 			nr_dirty++;

+			if (!(sc->reclaim_mode & RECLAIM_MODE_SYNC))
+				goto keep_locked;
 			if (references == PAGEREF_RECLAIM_CLEAN)
 				goto keep_locked;
 			if (!may_enter_fs)

IOWs, we don't write pages from kswapd unless there is no IO
writeback going on at all (waited on all the writeback pages or none
exist) and there are dirty pages on the LRU.

This doesn't completely stop the IO collapse, (looks like foreground
throttling is the other cause, which IO-less write throttling fixes)
but the collapse was significantly reduced in duration and intensity
by removing kswapd writeback. In fact, the IO rate only dropped to
~60MB/s instead of 30MB/s, and the improvement is easily measured by
the runtime of the test:

			run 1	run 2	run 3
3.0-rc5-vanilla		135s	137s	138s
3.0-rc5-patched		117s	115s	115s

That's a pretty massive improvement for a 2-line patch. ;) I expect
the IO-less write throttling patchset will further improve this.

FWIW, the nr_vmscan_write values changed like this:

			run 1	run 2	run 3
3.0-rc5-vanilla		6751	6893	6465
3.0-rc5-patched		0	0	0

These results support my argument that memory reclaim should not be
doing dirty page writeback at all - defering writeback to the
writeback infrastructure and just waiting for it to complete
appropriately is the Right Thing To Do. i.e. IO-less memory reclaim
works better than the current code for the same reason IO-less write
throttling works better than the current code....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 03/27] xfs: use write_cache_pages for writeback clustering
  2011-07-02  2:42           ` Dave Chinner
@ 2011-07-05 14:10             ` Mel Gorman
  2011-07-05 15:55               ` Dave Chinner
  2011-07-11 10:26             ` Christoph Hellwig
  1 sibling, 1 reply; 20+ messages in thread
From: Mel Gorman @ 2011-07-05 14:10 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Christoph Hellwig, Johannes Weiner, Wu Fengguang, xfs, jack,
	linux-mm

On Sat, Jul 02, 2011 at 12:42:19PM +1000, Dave Chinner wrote:
> On Fri, Jul 01, 2011 at 03:59:35PM +0100, Mel Gorman wrote:
> > On Fri, Jul 01, 2011 at 05:33:05AM -0400, Christoph Hellwig wrote:
> > > Johannes, Mel, Wu,
> > 
> > Am adding Jan Kara as he has been working on writeback efficiency
> > recently as well.
> 
> Writeback looks to be working fine - it's kswapd screwing up the
> writeback patterns that appears to be the problem....
> 

Not a new complaint.

> > > Dave has been stressing some XFS patches of mine that remove the XFS
> > > internal writeback clustering in favour of using write_cache_pages.
> > 
> > Against what kernel? 2.6.38 was a disaster for reclaim I've been
> > finding out this week. I don't know about 2.6.38.8. 2.6.39 was better.
> 
> 3.0-rc4
> 

Ok.

> ....
> > The number of pages written from reclaim is exceptionally low (2.6.38
> > was a total disaster but that release was bad for a number of reasons,
> > haven't tested 2.6.38.8 yet) but reduced by 2.6.37 as expected. Direct
> > reclaim usage was reduced and efficiency (ratio of pages scanned to
> > pages reclaimed) was high.
> 
> And is that consistent across ext3/ext4/xfs/btrfs filesystems? I
> doubt it very much, as all have very different .writepage
> behaviours...
> 

Some preliminary results are in and it looks like it is close to the
same across filesystems which was a suprise to me. Sometimes the
filesystem makes a difference to how many pages are written back but
it's not consistent across all tests i.e. in comparing ext3, ext4 and
xfs, there are big differences in performance but moderate differences
in pages written back. This implies that for the configurations I was
testing that pages are generally cleaned before reaching the end of the
LRU.

In all cases, the machines had ample memory. More on that later.

> BTW, called a workload "fsmark" tells us nothing about the workload
> being tested - fsmark can do a lot of interesting things. IOWs, you
> need to quote the command line for it to be meaningful to anyone...
> 

My bad.

./fs_mark -d /tmp/fsmark-14880 -D 225  -N  22500  -n  3125  -L  15 -t  16  -S0  -s  131072

> > As I look through the results I have at the moment, the number of
> > pages written back was simply really low which is why the problem fell
> > off my radar.
> 
> It doesn't take many to completely screw up writeback IO patterns.
> Write a few random pages to a 10MB file well before writeback would
> get to the file, and instead of getting optimal sequential writeback
> patterns when writeback gets to it, we get multiple disjoint IOs
> that require multiple seeks to complete.
> 
> Slower, less efficient writeback IO causes memory pressure to last
> longer and hence more likely to result in kswapd writeback, and it's
> just a downward spiral from there....
> 

Yes, I see the negative feedback loop. This has always been a struggle
in that kswapd needs pages from a particular zone to be cleaned and
freed but calling writepage can make things slower. There were
prototypes in the past to give hints to the flusher threads on what
inode and pages to be freed and they were never met with any degree of
satisfaction.

The consensus (amount VM people at least) was as long as that number was
low, it wasn't much of a problem. I know you disagree.

> > > > That means the test is only using 1GB of disk space, and
> > > > I'm running on a VM with 1GB RAM. It appears to be related to the VM
> > > > triggering random page writeback from the LRU - 100x10MB files more
> > > > than fills memory, hence it being the smallest test case i could
> > > > reproduce the problem on.
> > > > 
> > 
> > My tests were on a machine with 8G and ext3. I'm running some of
> > the tests against ext4 and xfs to see if that makes a difference but
> > it's possible the tests are simply not agressive enough so I want to
> > reproduce Dave's test if possible.
> 
> To tell the truth, I don't think anyone really cares how ext3
> performs these days.

I do but the reasoning is weak. I wanted to be able to compare kernels
between 2.6.32 and today with few points of variability. ext3 changed
relatively little between those times.

> XFS seems to be the filesystem that brings out
> all the bad behaviour in the mm subsystem....
> 
> FWIW, the mm subsystem works well enough when there is RAM
> available, so I'd suggest that your reclaim testing needs to focus
> on smaller memory configurations to really stress the reclaim
> algorithms. That's one of the reason why I regularly test on 1GB, 1p
> machines - they show problems that are hard to rep???oduce on larger
> configs....
> 

Based on the results coming in, I fully agree. I'm going to let the
tests run to completion so I'll have the data in the future. I'll then
go back and test for 1G, 1P configurations and it should be
reproducible.

> > I'm assuming "test 180" is from xfstests which was not one of the tests
> > I used previously. To run with 1000 files instead of 100, was the file
> > "180" simply editted to make it look like this loop instead?
> 
> I reduced it to 100 files simply to speed up the testing process for
> the "bad file size" problem I was trying to find. If you want to
> reproduce the IO collapse in a big way, run it with 1000 files, and
> it happens about 2/3rds of the way through the test on my hardware.
> 

Ok, I have a test prepared that will run this. At the rate tests are
currently going though, it could be Thursday before I can run them
though :(

> > > > It is very clear that from the IO completions that we are getting a
> > > > *lot* of kswapd driven writeback directly through .writepage:
> > > > 
> > > > $ grep "xfs_setfilesize:" t.t |grep "4096$" | wc -l
> > > > 801
> > > > $ grep "xfs_setfilesize:" t.t |grep -v "4096$" | wc -l
> > > > 78
> > > > 
> > > > So there's ~900 IO completions that change the file size, and 90% of
> > > > them are single page updates.
> > > > 
> > > > $ ps -ef |grep [k]swap
> > > > root       514     2  0 12:43 ?        00:00:00 [kswapd0]
> > > > $ grep "writepage:" t.t | grep "514 " |wc -l
> > > > 799
> > > > 
> > > > Oh, now that is too close to just be a co-incidence. We're getting
> > > > significant amounts of random page writeback from the the ends of
> > > > the LRUs done by the VM.
> > > > 
> > > > <sigh>
> > 
> > Does the value for nr_vmscan_write in /proc/vmstat correlate? It must
> > but lets me sure because I'm using that figure rather than ftrace to
> > count writebacks at the moment.
> 
> The number in /proc/vmstat is higher. Much higher.  I just ran the
> test at 1000 files (only collapsed to ~3000 iops this time because I
> ran it on a plain 3.0-rc4 kernel that still has the .writepage
> clustering in XFS), and I see:
> 
> nr_vmscan_write 6723
> 
> after the test. The event trace only capture ~1400 writepage events
> from kswapd, but it tends to miss a lot of events as the system is
> quite unresponsive at times under this workload - it's not uncommon
> to have ssh sessions not echo a character for 10s... e.g: I started
> the workload ~11:08:22:
> 

Ok, I'll be looking at nr_vmscan_write as the basis for "badness".

> $ while [ 1 ]; do date; sleep 1; done
> Sat Jul  2 11:08:15 EST 2011
> Sat Jul  2 11:08:16 EST 2011
> Sat Jul  2 11:08:17 EST 2011
> Sat Jul  2 11:08:18 EST 2011
> Sat Jul  2 11:08:19 EST 2011
> Sat Jul  2 11:08:20 EST 2011
> Sat Jul  2 11:08:21 EST 2011
> Sat Jul  2 11:08:22 EST 2011         <<<<<<<< start test here
> Sat Jul  2 11:08:23 EST 2011
> Sat Jul  2 11:08:24 EST 2011
> Sat Jul  2 11:08:25 EST 2011
> Sat Jul  2 11:08:26 EST 2011         <<<<<<<<
> Sat Jul  2 11:08:27 EST 2011         <<<<<<<<
> Sat Jul  2 11:08:30 EST 2011         <<<<<<<<
> Sat Jul  2 11:08:35 EST 2011         <<<<<<<<
> Sat Jul  2 11:08:36 EST 2011
> Sat Jul  2 11:08:37 EST 2011
> Sat Jul  2 11:08:38 EST 2011         <<<<<<<<
> Sat Jul  2 11:08:40 EST 2011         <<<<<<<<
> Sat Jul  2 11:08:41 EST 2011
> Sat Jul  2 11:08:42 EST 2011
> Sat Jul  2 11:08:43 EST 2011
> 
> And there are quite a few more multi-second holdoffs during the
> test, too.
> 
> > A more relevant question is this -
> > how many pages were reclaimed by kswapd and what percentage is 799
> > pages of that? What do you consider an acceptable percentage?
> 
> I don't care what the percentage is or what the number is. kswapd is
> reclaiming pages most of the time without affect IO patterns, and
> when that happens I just don't care because it is working just fine.
> 

I do care. I'm looking at some early XFS results here based on a laptop
(4G). For fsmark with the command line above, the number of pages
written back by kswapd was 0. The worst test by far was sysbench using a
particularly large database. The number of writes was 48745 which is
0.27% of pages scanned or 0.28% of pages reclaimed. Ordinarily I would
ignore that.

If I run this at 1G and get a similar ratio, I will assume that I
am not reproducing your problem at all unless I know what ratio you
are seeing.

So .... How many pages were reclaimed by kswapd and what percentage
is 799 pages of that?

You answered my second question. You consider 0% to be the acceptable
percentage.

> What I care about is what kswapd is doing when it finds dirty pages
> and it decides they need to be written back. It's not a problem that
> they are found or need to be written, the problem is the utterly
> crap way that memory reclaim is throwing the pages at the filesystem.
> 
> I'm not sure how to get through to you guys that single, random page
> writeback is *BAD*.

It got through. The feedback during discussions on the VM side was
that as long as the percentage was sufficiently low it wasn't a problem
because on occasion, the VM really needs pages from a particular zone.
A solution that addressed both problems has never been agreed on and
energy and time runs out before it gets fixed each time.

> Using .writepage directly is considered harmful
> to IO throughput, and memory reclaim needs to stop doing that.
> We've got hacks in the filesystems to try to make the IO memory
> reclaim executes suck less, but ultimately the problem is the IO
> memory reclaim is doing. And now the memory reclaim IO patterns are
> getting in the way of further improving the writeback path in XFS
> because were finding the hacks we've been carrying for years are
> *still* the only thing that is making IO under memory pressure not
> suck completely.
> 
> What I find extremely frustrating is that this is not a new issue.

I know.

> We (filesystem people) have been asking for a long time to have the
> memory reclaim subsystem either defer IO to the writeback threads or
> to use the .writepages interface.

There was a prototypes along these lines. One of the criticisms was
that it was fixing the wrong problem because dirty pages should be
at the end of the LRU at all. Later work focused on fixing that and
it was never revisited (at least not by me).

There was a bucket of complains about the initial series at
https://lkml.org/lkml/2010/6/8/82 . Despite the fact I wrote it,
I will have to read back to see why I stopped working on it but I
think it's because I focused on avoiding dirty pages reading the
end of the LRU judging by https://lkml.org/lkml/2010/6/11/157 and
eventually was satisified that the ratio of pages scanned to pages
written was acceptable.

> We're not asking this to be
> difficult, we're asking for this so that we can cluster IO in an
> optimal manner to avoid these IO collapses that memory reclaim
> currently triggers.  We now have generic methods of handing off IO
> to flusher threads that also provide some level of throttling/
> blocking while IO is submitted (e.g.  writeback_inodes_sb_nr()), so
> this shouldn't be a difficult problem to solve for the memory
> reclaim subsystem.
> 
> Hell, maybe memory reclaim should take a leaf from the IO-less
> throttle work we are doing - hit a bunch of dirty pages on the LRU,
> just back off and let the writeback subsystem clean a few more pages
> before starting another scan. 

Prototyped this before although I can't find it now. I think I
concluded at the time that it didn't really help and another direction
was taken. There was also the problem that the time to clean a page
from a particular zone was potentially unbounded and a solution didn't
present itself.

> Letting the writeback code clean
> pages is the fastest way to get pages cleaned in the system, so if
> we've already got a generic method for cleaning and/or waiting for
> pages to be cleaned, why not aim to use that?
> 
> And while I'm ranting, when on earth is the issue-writeback-from-
> direct-reclaim problem going to be fixed so we can remove the hacks
> in the filesystem .writepage implementations to prevent this from
> occurring?
> 

Prototyped that too, same thread. Same type of problem, writeback
from direct reclaim should happen so rarely that it should not be
optimised for. See https://lkml.org/lkml/2010/6/11/32

> I mean, when we combine the two issues, doesn't it imply that the
> memory reclaim subsystem needs to be redesigned around the fact it
> *can't clean pages directly*?  This IO collapse issue shows that we
> really don't 't want kswapd issuing IO directly via .writepage, and
> we already reject IO from direct reclaim in .writepage in ext4, XFS
> and BTRFS because we'll overrun the stack on anything other than
> trivial storage configurations.
> 
> That says to me in a big, flashing bright pink neon sign way that
> memory reclaim simply should not be issuing IO at all. Perhaps it's
> time to rethink the way memory reclaim deals with dirty pages to
> take into account the current reality?
> 
> </rant>
> 

At the risk of pissing you off, this isn't new information so I'll
consider myself duly nudged into revisiting.

> > > On Fri, Jul 01, 2011 at 07:20:21PM +1000, Dave Chinner wrote:
> > > > > Looks good.  I still wonder why I haven't been able to hit this.
> > > > > Haven't seen any 180 failure for a long time, with both 4k and 512 byte
> > > > > filesystems and since yesterday 1k as well.
> > > > 
> > > > It requires the test to run the VM out of RAM and then force enough
> > > > memory pressure for kswapd to start writeback from the LRU. The
> > > > reproducer I have is a 1p, 1GB RAM VM with it's disk image on a
> > > > 100MB/s HW RAID1 w/ 512MB BBWC disk subsystem.
> > > > 
> > 
> > You say it's a 1G VM but you don't say what architecure.
> 
> x86-64 for both the guest and the host.
> 

Grand.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 03/27] xfs: use write_cache_pages for writeback clustering
  2011-07-04  3:25           ` Dave Chinner
@ 2011-07-05 14:34             ` Mel Gorman
  2011-07-06  1:23               ` Dave Chinner
  2011-07-11 11:10               ` Christoph Hellwig
  2011-07-06  4:53             ` Wu Fengguang
  2011-07-06 15:12             ` Johannes Weiner
  2 siblings, 2 replies; 20+ messages in thread
From: Mel Gorman @ 2011-07-05 14:34 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Wu Fengguang, Christoph Hellwig, Johannes Weiner, xfs@oss.sgi.com,
	linux-mm@kvack.org

On Mon, Jul 04, 2011 at 01:25:34PM +1000, Dave Chinner wrote:
> On Fri, Jul 01, 2011 at 11:41:36PM +0800, Wu Fengguang wrote:
> > Christoph,
> > 
> > On Fri, Jul 01, 2011 at 05:33:05PM +0800, Christoph Hellwig wrote:
> > > Johannes, Mel, Wu,
> > > 
> > > Dave has been stressing some XFS patches of mine that remove the XFS
> > > internal writeback clustering in favour of using write_cache_pages.
> > > 
> > > As part of investigating the behaviour he found out that we're still
> > > doing lots of I/O from the end of the LRU in kswapd.  Not only is that
> > > pretty bad behaviour in general, but it also means we really can't
> > > just remove the writeback clustering in writepage given how much
> > > I/O is still done through that.
> > > 
> > > Any chance we could the writeback vs kswap behaviour sorted out a bit
> > > better finally?
> > 
> > I once tried this approach:
> > 
> > http://www.spinics.net/lists/linux-mm/msg09202.html
> > 
> > It used a list structure that is not linearly scalable, however that
> > part should be independently improvable when necessary.
> 
> I don't think that handing random writeback to the flusher thread is
> much better than doing random writeback directly.  Yes, you added
> some clustering, but I'm still don't think writing specific pages is
> the best solution.
> 
> > The real problem was, it seem to not very effective in my test runs.
> > I found many ->nr_pages works queued before the ->inode works, which
> > effectively makes the flusher working on more dispersed pages rather
> > than focusing on the dirty pages encountered in LRU reclaim.
> 
> But that's really just an implementation issue related to how you
> tried to solve the problem. That could be addressed.
> 
> However, what I'm questioning is whether we should even care what
> page memory reclaim wants to write - it seems to make fundamentally
> bad decisions from an IO persepctive.
> 

It sucks from an IO perspective but from the perspective of the VM that
needs memory to be free in a particular zone or node, it's a reasonable
request.

> We have to remember that memory reclaim is doing LRU reclaim and the
> flusher threads are doing "oldest first" writeback. IOWs, both are trying
> to operate in the same direction (oldest to youngest) for the same
> purpose.  The fundamental problem that occurs when memory reclaim
> starts writing pages back from the LRU is this:
> 
> 	- memory reclaim has run ahead of IO writeback -
> 

This reasoning was the basis for this patch
http://www.gossamer-threads.com/lists/linux/kernel/1251235?do=post_view_threaded#1251235

i.e. if old pages are dirty then the flusher threads are either not
awake or not doing enough work so wake them. It was flawed in a number
of respects and never finished though.

> The LRU usually looks like this:
> 
> 	oldest					youngest
> 	+---------------+---------------+--------------+
> 	clean		writeback	dirty
> 			^		^
> 			|		|
> 			|		Where flusher will next work from
> 			|		Where kswapd is working from
> 			|
> 			IO submitted by flusher, waiting on completion
> 
> 
> If memory reclaim is hitting dirty pages on the LRU, it means it has
> got ahead of writeback without being throttled - it's passed over
> all the pages currently under writeback and is trying to write back
> pages that are *newer* than what writeback is working on. IOWs, it
> starts trying to do the job of the flusher threads, and it does that
> very badly.
> 
> The $100 question is ???why is it getting ahead of writeback*?
> 

Allocating and dirtying memory faster than writeback. Large dd to USB
stick would also trigger it.

> From a brief look at the vmscan code, it appears that scanning does
> not throttle/block until reclaim priority has got pretty high. That
> means at low priority reclaim, it *skips pages under writeback*.
> However, if it comes across a dirty page, it will trigger writeback
> of the page.
> 
> Now call me crazy, but if we've already got a large number of pages
> under writeback, why would we want to *start more IO* when clearly
> the system is taking care of cleaning pages already and all we have
> to do is wait for a short while to get clean pages ready for
> reclaim?
> 

It doesnt' check how many pages are under writeback. Direct reclaim
will check if the block device is congested but that is about
it. Otherwise the expectation was the elevator would handle the
merging of requests into a sensible patter. Also, while filesystem
pages are getting cleaned by flushs, that does not cover anonymous
pages being written to swap.

> Indeed, I added this quick hack to prevent the VM from doing
> writeback via pageout until after it starts blocking on writeback
> pages:
> 
> @@ -825,6 +825,8 @@ static unsigned long shrink_page_list(struct list_head *page_l
>  		if (PageDirty(page)) {
>  			nr_dirty++;
>  
> +			if (!(sc->reclaim_mode & RECLAIM_MODE_SYNC))
> +				goto keep_locked;
>  			if (references == PAGEREF_RECLAIM_CLEAN)
>  				goto keep_locked;
>  			if (!may_enter_fs)
> 
> IOWs, we don't write pages from kswapd unless there is no IO
> writeback going on at all (waited on all the writeback pages or none
> exist) and there are dirty pages on the LRU.
> 

A side effect of this patch is that kswapd is no longer writing
anonymous pages to swap and possibly never will. RECLAIM_MODE_SYNC is
only set for lumpy reclaim which if you have CONFIG_COMPACTION set, will
never happen.

I see your figures and know why you want this but it never was that
straight-forward :/

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 03/27] xfs: use write_cache_pages for writeback clustering
  2011-07-05 14:10             ` Mel Gorman
@ 2011-07-05 15:55               ` Dave Chinner
  0 siblings, 0 replies; 20+ messages in thread
From: Dave Chinner @ 2011-07-05 15:55 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Christoph Hellwig, Johannes Weiner, Wu Fengguang, xfs, jack,
	linux-mm

On Tue, Jul 05, 2011 at 03:10:16PM +0100, Mel Gorman wrote:
> On Sat, Jul 02, 2011 at 12:42:19PM +1000, Dave Chinner wrote:
> > On Fri, Jul 01, 2011 at 03:59:35PM +0100, Mel Gorman wrote:
> > BTW, called a workload "fsmark" tells us nothing about the workload
> > being tested - fsmark can do a lot of interesting things. IOWs, you
> > need to quote the command line for it to be meaningful to anyone...
> > 
> 
> My bad.
> 
> ./fs_mark -d /tmp/fsmark-14880 -D 225  -N  22500  -n  3125  -L  15 -t  16  -S0  -s  131072

Ok, so 16 threads, 3125 files per thread, 128k per file, all created
in to the same directory which rolls over when it gets to 22500
files in the directory. Yeah, it generates a bit of memory pressure,
but I think the file sizes are too small to really stress writeback
much. You need to use files that are at least 10MB in size to really
start to mix up the writeback lists and the way they juggle new and
old inodes to try not to starve any particular inode of writeback
bandwidth....

Also, I don't use the "-t <num>" threading mechanism because all it
does is bash on the directory mutex without really improving
parallelism for creates. perf top on my system shows:

           samples  pcnt function                           DSO
             _______ _____ __________________________________ __________________________________

             2799.00  9.3% mutex_spin_on_owner                [kernel.kallsyms]
             2049.00  6.8% copy_user_generic_string           [kernel.kallsyms]
             1912.00  6.3% _raw_spin_unlock_irqrestore        [kernel.kallsyms]

A contended mutex as the prime CPU consumer. That's more CPU than
copying 750MB/s of data.

Hence I normally drive parallelism with fsmark by using multiple "-d
<dir>" options, which runs a thread per directory and a workload
unit per directory and so you don't get directory mutex contention
causing serialisation and interference with what you are really
trying to measure....

> > > As I look through the results I have at the moment, the number of
> > > pages written back was simply really low which is why the problem fell
> > > off my radar.
> > 
> > It doesn't take many to completely screw up writeback IO patterns.
> > Write a few random pages to a 10MB file well before writeback would
> > get to the file, and instead of getting optimal sequential writeback
> > patterns when writeback gets to it, we get multiple disjoint IOs
> > that require multiple seeks to complete.
> > 
> > Slower, less efficient writeback IO causes memory pressure to last
> > longer and hence more likely to result in kswapd writeback, and it's
> > just a downward spiral from there....
> > 
> 
> Yes, I see the negative feedback loop. This has always been a struggle
> in that kswapd needs pages from a particular zone to be cleaned and
> freed but calling writepage can make things slower. There were
> prototypes in the past to give hints to the flusher threads on what
> inode and pages to be freed and they were never met with any degree of
> satisfaction.
> 
> The consensus (amount VM people at least) was as long as that number was
> low, it wasn't much of a problem.

Therein lies the problem. You've got storage people telling you
there is an IO problem with memory reclaim, but the mm community
then put their heads together somewhere private, decide it isn't
a problem worth fixing and do nothing. Rinse, lather, repeat.

I expect memory reclaim to play nicely with writeback that is
already in progress. These subsystems do not work in isolation, yet
memory reclaim treats it that way - as though it is the most
important IO submitter and everything else can suffer while memory
reclaim does it's stuff.  Memory reclaim needs to co-ordinate with
writeback effectively for the system as a whole to work well
together.

> I know you disagree.

Right, that's because it doesn't have to be a very high number to be
a problem. IO is orders of magnitude slower than the CPU time it
takes to flush a page, so the cost of making a bad flush decision is
very high. And single page writeback from the LRU is almost always a
bad flush decision.

> > > > > Oh, now that is too close to just be a co-incidence. We're getting
> > > > > significant amounts of random page writeback from the the ends of
> > > > > the LRUs done by the VM.
> > > > > 
> > > > > <sigh>
> > > 
> > > Does the value for nr_vmscan_write in /proc/vmstat correlate? It must
> > > but lets me sure because I'm using that figure rather than ftrace to
> > > count writebacks at the moment.
> > 
> > The number in /proc/vmstat is higher. Much higher.  I just ran the
> > test at 1000 files (only collapsed to ~3000 iops this time because I
> > ran it on a plain 3.0-rc4 kernel that still has the .writepage
> > clustering in XFS), and I see:
> > 
> > nr_vmscan_write 6723
> > 
> > after the test. The event trace only capture ~1400 writepage events
> > from kswapd, but it tends to miss a lot of events as the system is
> > quite unresponsive at times under this workload - it's not uncommon
> > to have ssh sessions not echo a character for 10s... e.g: I started
> > the workload ~11:08:22:
> > 
> 
> Ok, I'll be looking at nr_vmscan_write as the basis for "badness".

Perhaps you should look at my other reply (and two line "fix") in
the thread about stopping dirty page writeback until after waiting
on pages under writeback.....

> > > A more relevant question is this -
> > > how many pages were reclaimed by kswapd and what percentage is 799
> > > pages of that? What do you consider an acceptable percentage?
> > 
> > I don't care what the percentage is or what the number is. kswapd is
> > reclaiming pages most of the time without affect IO patterns, and
> > when that happens I just don't care because it is working just fine.
> > 
> 
> I do care. I'm looking at some early XFS results here based on a laptop
> (4G). For fsmark with the command line above, the number of pages
> written back by kswapd was 0. The worst test by far was sysbench using a
> particularly large database. The number of writes was 48745 which is
> 0.27% of pages scanned or 0.28% of pages reclaimed. Ordinarily I would
> ignore that.
> 
> If I run this at 1G and get a similar ratio, I will assume that I
> am not reproducing your problem at all unless I know what ratio you
> are seeing.

Single threaded writing of files should -never- cause writeback from
the LRUs. If that is happening, then the memory reclaim throttling
is broken. See my other email.

> So .... How many pages were reclaimed by kswapd and what percentage
> is 799 pages of that?

No idea. That information is long gone....

> You answered my second question. You consider 0% to be the acceptable
> percentage.

No, I expect memory reclaim to behave nicely with writeback that is
already in progress. This subsystems do not work in isolation - they
need to co-ordinate 

> > What I care about is what kswapd is doing when it finds dirty pages
> > and it decides they need to be written back. It's not a problem that
> > they are found or need to be written, the problem is the utterly
> > crap way that memory reclaim is throwing the pages at the filesystem.
> > 
> > I'm not sure how to get through to you guys that single, random page
> > writeback is *BAD*.
> 
> It got through. The feedback during discussions on the VM side was
> that as long as the percentage was sufficiently low it wasn't a problem
> because on occasion, the VM really needs pages from a particular zone.
> A solution that addressed both problems has never been agreed on and
> energy and time runs out before it gets fixed each time.

<sigh>

> > And while I'm ranting, when on earth is the issue-writeback-from-
> > direct-reclaim problem going to be fixed so we can remove the hacks
> > in the filesystem .writepage implementations to prevent this from
> > occurring?
> > 
> 
> Prototyped that too, same thread. Same type of problem, writeback
> from direct reclaim should happen so rarely that it should not be
> optimised for. See https://lkml.org/lkml/2010/6/11/32

Writeback from direct reclaim crashes systems by causing stack
overruns - that's why we've disabled it. It's not an "optimisation"
problem - it's a _memory corruption_ bug that needs to be fixed.....

> At the risk of pissing you off, this isn't new information so I'll
> consider myself duly nudged into revisiting.

No, I've had a rant to express my displeasure at the lack of
progress on this front.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 03/27] xfs: use write_cache_pages for writeback clustering
  2011-07-05 14:34             ` Mel Gorman
@ 2011-07-06  1:23               ` Dave Chinner
  2011-07-11 11:10               ` Christoph Hellwig
  1 sibling, 0 replies; 20+ messages in thread
From: Dave Chinner @ 2011-07-06  1:23 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Wu Fengguang, Christoph Hellwig, Johannes Weiner, xfs@oss.sgi.com,
	linux-mm@kvack.org

On Tue, Jul 05, 2011 at 03:34:10PM +0100, Mel Gorman wrote:
> On Mon, Jul 04, 2011 at 01:25:34PM +1000, Dave Chinner wrote:
> > On Fri, Jul 01, 2011 at 11:41:36PM +0800, Wu Fengguang wrote:
> > > Christoph,
> > > 
> > > On Fri, Jul 01, 2011 at 05:33:05PM +0800, Christoph Hellwig wrote:
> > > > Johannes, Mel, Wu,
> > > > 
> > > > Dave has been stressing some XFS patches of mine that remove the XFS
> > > > internal writeback clustering in favour of using write_cache_pages.
> > > > 
> > > > As part of investigating the behaviour he found out that we're still
> > > > doing lots of I/O from the end of the LRU in kswapd.  Not only is that
> > > > pretty bad behaviour in general, but it also means we really can't
> > > > just remove the writeback clustering in writepage given how much
> > > > I/O is still done through that.
> > > > 
> > > > Any chance we could the writeback vs kswap behaviour sorted out a bit
> > > > better finally?
> > > 
> > > I once tried this approach:
> > > 
> > > http://www.spinics.net/lists/linux-mm/msg09202.html
> > > 
> > > It used a list structure that is not linearly scalable, however that
> > > part should be independently improvable when necessary.
> > 
> > I don't think that handing random writeback to the flusher thread is
> > much better than doing random writeback directly.  Yes, you added
> > some clustering, but I'm still don't think writing specific pages is
> > the best solution.
> > 
> > > The real problem was, it seem to not very effective in my test runs.
> > > I found many ->nr_pages works queued before the ->inode works, which
> > > effectively makes the flusher working on more dispersed pages rather
> > > than focusing on the dirty pages encountered in LRU reclaim.
> > 
> > But that's really just an implementation issue related to how you
> > tried to solve the problem. That could be addressed.
> > 
> > However, what I'm questioning is whether we should even care what
> > page memory reclaim wants to write - it seems to make fundamentally
> > bad decisions from an IO persepctive.
> > 
> 
> It sucks from an IO perspective but from the perspective of the VM that
> needs memory to be free in a particular zone or node, it's a reasonable
> request.

Sure, I'm not suggesting there is anything wrong the requirement of
being able to clean pages in a particular zone. My comments are
aimed at the fact the implementation of this feature is about as
friendly to the IO subsystem as a game of Roshambeau....

If someone comes to us complaining about an application that causes
this sort of IO behaviour, our answer is always "fix the
application" because it is not something we can fix in the
filesystem. Same here - we need to have the "application" fixed to
play well with others.

> > We have to remember that memory reclaim is doing LRU reclaim and the
> > flusher threads are doing "oldest first" writeback. IOWs, both are trying
> > to operate in the same direction (oldest to youngest) for the same
> > purpose.  The fundamental problem that occurs when memory reclaim
> > starts writing pages back from the LRU is this:
> > 
> > 	- memory reclaim has run ahead of IO writeback -
> > 
> 
> This reasoning was the basis for this patch
> http://www.gossamer-threads.com/lists/linux/kernel/1251235?do=post_view_threaded#1251235
> 
> i.e. if old pages are dirty then the flusher threads are either not
> awake or not doing enough work so wake them. It was flawed in a number
> of respects and never finished though.

But that's dealing with a different situation - you're assuming that the
writeback threads are not running or are running inefficiently.

What I'm seeing is bad behaviour when the IO subsystem is already
running flat out with perfectly formed IO. No additional IO
submission is going to make it clean pages faster than it already
is. It is in this situation that memory reclaim should never, ever
be trying to write dirty pages.

IIRC, the situation was that there were about 15,000 dirty pages and
~20,000 pages under writeback when memory reclaim started pushing
pages from the LRU. This is on a single node machine, with all IO
being single threaded (so a single source of memory pressure) and
writeback doing it's job.  Memory reclaim should *never* get ahead
of writeback under such a simple workload on such a simple
configuration....

> > The LRU usually looks like this:
> > 
> > 	oldest					youngest
> > 	+---------------+---------------+--------------+
> > 	clean		writeback	dirty
> > 			^		^
> > 			|		|
> > 			|		Where flusher will next work from
> > 			|		Where kswapd is working from
> > 			|
> > 			IO submitted by flusher, waiting on completion
> > 
> > 
> > If memory reclaim is hitting dirty pages on the LRU, it means it has
> > got ahead of writeback without being throttled - it's passed over
> > all the pages currently under writeback and is trying to write back
> > pages that are *newer* than what writeback is working on. IOWs, it
> > starts trying to do the job of the flusher threads, and it does that
> > very badly.
> > 
> > The $100 question is ???why is it getting ahead of writeback*?
> > 
> 
> Allocating and dirtying memory faster than writeback. Large dd to USB
> stick would also trigger it.

Write throttling is supposed to prevent that situation from being
problematic. It's entire purpose is to throttle the dirtying rate to
match the writeback rate. If that's a problem, the memory reclaim
subsystem is the wrong place to be trying to fix it.

And as such, that is not the case here; foreground throttling is
definitely occurring and works fine for 70-80s, then memory reclaim
gets ahead of writeback and it all goes to shit.

> > From a brief look at the vmscan code, it appears that scanning does
> > not throttle/block until reclaim priority has got pretty high. That
> > means at low priority reclaim, it *skips pages under writeback*.
> > However, if it comes across a dirty page, it will trigger writeback
> > of the page.
> > 
> > Now call me crazy, but if we've already got a large number of pages
> > under writeback, why would we want to *start more IO* when clearly
> > the system is taking care of cleaning pages already and all we have
> > to do is wait for a short while to get clean pages ready for
> > reclaim?
> > 
> 
> It doesnt' check how many pages are under writeback.

Isn't that an indication of a design flaw? You want to clean
pages, but you don't even bother to check on how many pages are
currently being cleaned and will soon be reclaimable?

> Direct reclaim
> will check if the block device is congested but that is about
> it.

FWIW, we've removed all the congestion logic from the writeback
subsystem because IO throttling never really worked well that way.
Writeback IO throttling now works by foreground blocking during IO
submission on request queue slots in the elevator. That's why we
have flusher threads per-bdi - so writeback can block on a congested
bdi and not block writeback to other bdis. It's simpler, more
extensible and far more scalable than the old method.

Anyway, it's a moot point because direct reclaim can't issue IO
through xfs, ext4 or btrfs and as such I have doubts that the
throttling logic in vmscan is completely robust.

> Otherwise the expectation was the elevator would handle the
> merging of requests into a sensible patter. Also, while filesystem
> pages are getting cleaned by flushs, that does not cover anonymous
> pages being written to swap.

Anonymous pages written to swap are not the issue here - I couldn't
care less what you do with them. It's writeback of dirty file pages
that I care about...

> 
> > Indeed, I added this quick hack to prevent the VM from doing
> > writeback via pageout until after it starts blocking on writeback
> > pages:
> > 
> > @@ -825,6 +825,8 @@ static unsigned long shrink_page_list(struct list_head *page_l
> >  		if (PageDirty(page)) {
> >  			nr_dirty++;
> >  
> > +			if (!(sc->reclaim_mode & RECLAIM_MODE_SYNC))
> > +				goto keep_locked;
> >  			if (references == PAGEREF_RECLAIM_CLEAN)
> >  				goto keep_locked;
> >  			if (!may_enter_fs)
> > 
> > IOWs, we don't write pages from kswapd unless there is no IO
> > writeback going on at all (waited on all the writeback pages or none
> > exist) and there are dirty pages on the LRU.
> > 
> 
> A side effect of this patch is that kswapd is no longer writing
> anonymous pages to swap and possibly never will.

For dirty anon pages to still get written, all that needs to be
done is pass the file parameter to shrink_page_list() and change the
test to: 

+			if (file && (sc->reclaim_mode & RECLAIM_MODE_SYNC))
+				goto keep_locked;

As it is, I haven't had any of my test systems (which run tests that
deliberately cause OOM conditions) fail with this patch. While I
agree it is just a hack, it's naivety has also demonstrated that a
working system does not need to write back dirty file pages from
memory reclaim -at all-. i.e. it makes my argument stronger, not
weaker....

> RECLAIM_MODE_SYNC is
> only set for lumpy reclaim which if you have CONFIG_COMPACTION set, will
> never happen.

Which means that memory reclaim does not throttle reliably on
writeback in progress. Even when the priority has ratcheted right up
and it is obvious that the zone in question has pages being cleaned
and will soon be available for reclaim, memory reclaim won't wait
for them directly.

Once again this points to the throttling mechanism being sub-optimal
- it relies on second order effects (congestion_wait) to try to
block long enough for pages to be cleaned in the zone being
reclaimed from before doing another scan to find those pages. It's a
"wait and hope" approach to throttling, and that's one of the
reasons it never worked well in the writeback subsystem.

Instead, if memory reclaim waits directly on a page on the given LRU
under writeback it guarantees that when you are woken that there was
at least some progress made by the IO subsystem that would allow the
memory reclaim subsystem to move forward.

What it comes down to is the fact that you can scan tens of
thousands of pages in the time it takes for IO on a single page to
complete. If there are pages already under IO, then why start more
IO when what ends up getting reclaimed is one of the pages that is
already under IO when the new IO was issued?

BTW:

# CONFIG_COMPACTION is not set

> I see your figures and know why you want this but it never was that
> straight-forward :/

If the code is complex enough that implementing a basic policy such
as "don't writeback pages if there are already pages under
writeback" is difficult, then maybe the code needs to be
simplified....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 03/27] xfs: use write_cache_pages for writeback clustering
  2011-07-04  3:25           ` Dave Chinner
  2011-07-05 14:34             ` Mel Gorman
@ 2011-07-06  4:53             ` Wu Fengguang
  2011-07-06  6:47               ` Minchan Kim
  2011-07-06  7:17               ` Dave Chinner
  2011-07-06 15:12             ` Johannes Weiner
  2 siblings, 2 replies; 20+ messages in thread
From: Wu Fengguang @ 2011-07-06  4:53 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Christoph Hellwig, Mel Gorman, Johannes Weiner, xfs@oss.sgi.com,
	linux-mm@kvack.org

On Mon, Jul 04, 2011 at 11:25:34AM +0800, Dave Chinner wrote:
> On Fri, Jul 01, 2011 at 11:41:36PM +0800, Wu Fengguang wrote:
> > Christoph,
> > 
> > On Fri, Jul 01, 2011 at 05:33:05PM +0800, Christoph Hellwig wrote:
> > > Johannes, Mel, Wu,
> > > 
> > > Dave has been stressing some XFS patches of mine that remove the XFS
> > > internal writeback clustering in favour of using write_cache_pages.
> > > 
> > > As part of investigating the behaviour he found out that we're still
> > > doing lots of I/O from the end of the LRU in kswapd.  Not only is that
> > > pretty bad behaviour in general, but it also means we really can't
> > > just remove the writeback clustering in writepage given how much
> > > I/O is still done through that.
> > > 
> > > Any chance we could the writeback vs kswap behaviour sorted out a bit
> > > better finally?
> > 
> > I once tried this approach:
> > 
> > http://www.spinics.net/lists/linux-mm/msg09202.html
> > 
> > It used a list structure that is not linearly scalable, however that
> > part should be independently improvable when necessary.
> 
> I don't think that handing random writeback to the flusher thread is
> much better than doing random writeback directly.  Yes, you added
> some clustering, but I'm still don't think writing specific pages is
> the best solution.

I agree that the VM should avoid writing specific pages as much as
possible. Mostly often, it's indeed OK to just skip sporadically
encountered dirty page and reclaim the clean pages presumably not
far away in the LRU list. So your 2-liner patch is all good if
constraining it to low scan pressure, which will look like

        if (priority == DEF_PRIORITY)
                tag PG_reclaim on encountered dirty pages and
                skip writing it

However the VM in general does need the ability to write specific
pages, such as when reclaiming from specific zone/memcg. So I'll still
propose to do bdi_start_inode_writeback().

Below is the patch rebased to linux-next. It's good enough for testing
purpose, and I guess even with the ->nr_pages work issue, it's
complete enough to get roughly the same performance as your 2-liner
patch.

> > The real problem was, it seem to not very effective in my test runs.
> > I found many ->nr_pages works queued before the ->inode works, which
> > effectively makes the flusher working on more dispersed pages rather
> > than focusing on the dirty pages encountered in LRU reclaim.
> 
> But that's really just an implementation issue related to how you
> tried to solve the problem. That could be addressed.
> 
> However, what I'm questioning is whether we should even care what
> page memory reclaim wants to write - it seems to make fundamentally
> bad decisions from an IO persepctive.
> 
> We have to remember that memory reclaim is doing LRU reclaim and the
> flusher threads are doing "oldest first" writeback. IOWs, both are trying
> to operate in the same direction (oldest to youngest) for the same
> purpose.  The fundamental problem that occurs when memory reclaim
> starts writing pages back from the LRU is this:
> 
> 	- memory reclaim has run ahead of IO writeback -
> 
> The LRU usually looks like this:
> 
> 	oldest					youngest
> 	+---------------+---------------+--------------+
> 	clean		writeback	dirty
> 			^		^
> 			|		|
> 			|		Where flusher will next work from
> 			|		Where kswapd is working from
> 			|
> 			IO submitted by flusher, waiting on completion
> 
> 
> If memory reclaim is hitting dirty pages on the LRU, it means it has
> got ahead of writeback without being throttled - it's passed over
> all the pages currently under writeback and is trying to write back
> pages that are *newer* than what writeback is working on. IOWs, it
> starts trying to do the job of the flusher threads, and it does that
> very badly.
> 
> The $100 question is a??why is it getting ahead of writeback*?

The most important case is: faster reader + relatively slow writer.

Assume for every 10 pages read, 1 page is dirtied, and the dirty speed
is fast enough to trigger the 20% dirty ratio and hence dirty balancing.

That pattern is able to evenly distribute dirty pages all over the LRU
list and hence trigger lots of pageout()s. The "skip reclaim writes on
low pressure" approach can fix this case.

Thanks,
Fengguang
---
Subject: writeback: introduce bdi_start_inode_writeback()
Date: Thu Jul 29 14:41:19 CST 2010

This relays ASYNC file writeback IOs to the flusher threads.

pageout() will continue to serve the SYNC file page writes for necessary
throttling for preventing OOM, which may happen if the LRU list is small
and/or the storage is slow, so that the flusher cannot clean enough
pages before the LRU is full scanned.

Only ASYNC pageout() is relayed to the flusher threads, the less
frequent SYNC pageout()s will work as before as a last resort.
This helps to avoid OOM when the LRU list is small and/or the storage is
slow, and the flusher cannot clean enough pages before the LRU is
full scanned.

The flusher will piggy back more dirty pages for IO
- it's more IO efficient
- it helps clean more pages, a good number of them may sit in the same
  LRU list that is being scanned.

To avoid memory allocations at page reclaim, a mempool is created.

Background/periodic works will quit automatically (as done in another
patch), so as to clean the pages under reclaim ASAP. However for now the
sync work can still block us for long time.

Jan Kara: limit the search scope.

CC: Jan Kara <jack@suse.cz>
CC: Rik van Riel <riel@redhat.com>
CC: Mel Gorman <mel@linux.vnet.ibm.com>
CC: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c                |  156 ++++++++++++++++++++++++++++-
 include/linux/backing-dev.h      |    1 
 include/trace/events/writeback.h |   15 ++
 mm/vmscan.c                      |    8 +
 4 files changed, 174 insertions(+), 6 deletions(-)

--- linux-next.orig/mm/vmscan.c	2011-06-29 20:43:10.000000000 -0700
+++ linux-next/mm/vmscan.c	2011-07-05 18:30:19.000000000 -0700
@@ -825,6 +825,14 @@ static unsigned long shrink_page_list(st
 		if (PageDirty(page)) {
 			nr_dirty++;
 
+			if (page_is_file_cache(page) && mapping &&
+			    sc->reclaim_mode != RECLAIM_MODE_SYNC) {
+				if (flush_inode_page(page, mapping) >= 0) {
+					SetPageReclaim(page);
+					goto keep_locked;
+				}
+			}
+
 			if (references == PAGEREF_RECLAIM_CLEAN)
 				goto keep_locked;
 			if (!may_enter_fs)
--- linux-next.orig/fs/fs-writeback.c	2011-07-05 18:30:16.000000000 -0700
+++ linux-next/fs/fs-writeback.c	2011-07-05 18:30:52.000000000 -0700
@@ -30,12 +30,21 @@
 #include "internal.h"
 
 /*
+ * When flushing an inode page (for page reclaim), try to piggy back up to
+ * 4MB nearby pages for IO efficiency. These pages will have good opportunity
+ * to be in the same LRU list.
+ */
+#define WRITE_AROUND_PAGES	MIN_WRITEBACK_PAGES
+
+/*
  * Passed into wb_writeback(), essentially a subset of writeback_control
  */
 struct wb_writeback_work {
 	long nr_pages;
 	struct super_block *sb;
 	unsigned long *older_than_this;
+	struct inode *inode;
+	pgoff_t offset;
 	enum writeback_sync_modes sync_mode;
 	unsigned int tagged_writepages:1;
 	unsigned int for_kupdate:1;
@@ -59,6 +68,27 @@ struct wb_writeback_work {
  */
 int nr_pdflush_threads;
 
+static mempool_t *wb_work_mempool;
+
+static void *wb_work_alloc(gfp_t gfp_mask, void *pool_data)
+{
+	/*
+	 * bdi_start_inode_writeback() may be called on page reclaim
+	 */
+	if (current->flags & PF_MEMALLOC)
+		return NULL;
+
+	return kmalloc(sizeof(struct wb_writeback_work), gfp_mask);
+}
+
+static __init int wb_work_init(void)
+{
+	wb_work_mempool = mempool_create(1024,
+					 wb_work_alloc, mempool_kfree, NULL);
+	return wb_work_mempool ? 0 : -ENOMEM;
+}
+fs_initcall(wb_work_init);
+
 /**
  * writeback_in_progress - determine whether there is writeback in progress
  * @bdi: the device's backing_dev_info structure.
@@ -123,7 +153,7 @@ __bdi_start_writeback(struct backing_dev
 	 * This is WB_SYNC_NONE writeback, so if allocation fails just
 	 * wakeup the thread for old dirty data writeback
 	 */
-	work = kzalloc(sizeof(*work), GFP_ATOMIC);
+	work = mempool_alloc(wb_work_mempool, GFP_NOWAIT);
 	if (!work) {
 		if (bdi->wb.task) {
 			trace_writeback_nowork(bdi);
@@ -132,6 +162,7 @@ __bdi_start_writeback(struct backing_dev
 		return;
 	}
 
+	memset(work, 0, sizeof(*work));
 	work->sync_mode	= WB_SYNC_NONE;
 	work->nr_pages	= nr_pages;
 	work->range_cyclic = range_cyclic;
@@ -177,6 +208,107 @@ void bdi_start_background_writeback(stru
 	spin_unlock_bh(&bdi->wb_lock);
 }
 
+static bool extend_writeback_range(struct wb_writeback_work *work,
+				   pgoff_t offset)
+{
+	pgoff_t end = work->offset + work->nr_pages;
+
+	if (offset >= work->offset && offset < end)
+		return true;
+
+	/* the unsigned comparison helps eliminate one compare */
+	if (work->offset - offset < WRITE_AROUND_PAGES) {
+		work->nr_pages += WRITE_AROUND_PAGES;
+		work->offset -= WRITE_AROUND_PAGES;
+		return true;
+	}
+
+	if (offset - end < WRITE_AROUND_PAGES) {
+		work->nr_pages += WRITE_AROUND_PAGES;
+		return true;
+	}
+
+	return false;
+}
+
+/*
+ * schedule writeback on a range of inode pages.
+ */
+static struct wb_writeback_work *
+bdi_flush_inode_range(struct backing_dev_info *bdi,
+		      struct inode *inode,
+		      pgoff_t offset,
+		      pgoff_t len)
+{
+	struct wb_writeback_work *work;
+
+	if (!igrab(inode))
+		return ERR_PTR(-ENOENT);
+
+	work = mempool_alloc(wb_work_mempool, GFP_NOWAIT);
+	if (!work)
+		return ERR_PTR(-ENOMEM);
+
+	memset(work, 0, sizeof(*work));
+	work->sync_mode		= WB_SYNC_NONE;
+	work->inode		= inode;
+	work->offset		= offset;
+	work->nr_pages		= len;
+
+	bdi_queue_work(bdi, work);
+
+	return work;
+}
+
+/*
+ * Called by page reclaim code to flush the dirty page ASAP. Do write-around to
+ * improve IO throughput. The nearby pages will have good chance to reside in
+ * the same LRU list that vmscan is working on, and even close to each other
+ * inside the LRU list in the common case of sequential read/write.
+ *
+ * ret > 0: success, found/reused a previous writeback work
+ * ret = 0: success, allocated/queued a new writeback work
+ * ret < 0: failed
+ */
+long flush_inode_page(struct page *page, struct address_space *mapping)
+{
+	struct backing_dev_info *bdi = mapping->backing_dev_info;
+	struct inode *inode = mapping->host;
+	pgoff_t offset = page->index;
+	pgoff_t len = 0;
+	struct wb_writeback_work *work;
+	long ret = -ENOENT;
+
+	if (unlikely(!inode))
+		goto out;
+
+	len = 1;
+	spin_lock_bh(&bdi->wb_lock);
+	list_for_each_entry_reverse(work, &bdi->work_list, list) {
+		if (work->inode != inode)
+			continue;
+		if (extend_writeback_range(work, offset)) {
+			ret = len;
+			offset = work->offset;
+			len = work->nr_pages;
+			break;
+		}
+		if (len++ > 30)	/* do limited search */
+			break;
+	}
+	spin_unlock_bh(&bdi->wb_lock);
+
+	if (ret > 0)
+		goto out;
+
+	offset = round_down(offset, WRITE_AROUND_PAGES);
+	len = WRITE_AROUND_PAGES;
+	work = bdi_flush_inode_range(bdi, inode, offset, len);
+	ret = IS_ERR(work) ? PTR_ERR(work) : 0;
+out:
+	return ret;
+}
+
 /*
  * Remove the inode from the writeback list it is on.
  */
@@ -830,6 +962,21 @@ static unsigned long get_nr_dirty_pages(
 		get_nr_dirty_inodes();
 }
 
+static long wb_flush_inode(struct bdi_writeback *wb,
+			   struct wb_writeback_work *work)
+{
+	loff_t start = work->offset;
+	loff_t end   = work->offset + work->nr_pages - 1;
+	int wrote;
+
+	wrote = __filemap_fdatawrite_range(work->inode->i_mapping,
+					   start << PAGE_CACHE_SHIFT,
+					   end   << PAGE_CACHE_SHIFT,
+					   WB_SYNC_NONE);
+	iput(work->inode);
+	return wrote;
+}
+
 static long wb_check_background_flush(struct bdi_writeback *wb)
 {
 	if (over_bground_thresh()) {
@@ -900,7 +1047,10 @@ long wb_do_writeback(struct bdi_writebac
 
 		trace_writeback_exec(bdi, work);
 
-		wrote += wb_writeback(wb, work);
+		if (work->inode)
+			wrote += wb_flush_inode(wb, work);
+		else
+			wrote += wb_writeback(wb, work);
 
 		/*
 		 * Notify the caller of completion if this is a synchronous
@@ -909,7 +1059,7 @@ long wb_do_writeback(struct bdi_writebac
 		if (work->done)
 			complete(work->done);
 		else
-			kfree(work);
+			mempool_free(work, wb_work_mempool);
 	}
 
 	/*
--- linux-next.orig/include/linux/backing-dev.h	2011-07-03 20:03:37.000000000 -0700
+++ linux-next/include/linux/backing-dev.h	2011-07-05 18:30:19.000000000 -0700
@@ -109,6 +109,7 @@ void bdi_unregister(struct backing_dev_i
 int bdi_setup_and_register(struct backing_dev_info *, char *, unsigned int);
 void bdi_start_writeback(struct backing_dev_info *bdi, long nr_pages);
 void bdi_start_background_writeback(struct backing_dev_info *bdi);
+long flush_inode_page(struct page *page, struct address_space *mapping);
 int bdi_writeback_thread(void *data);
 int bdi_has_dirty_io(struct backing_dev_info *bdi);
 void bdi_arm_supers_timer(void);
--- linux-next.orig/include/trace/events/writeback.h	2011-07-05 18:30:16.000000000 -0700
+++ linux-next/include/trace/events/writeback.h	2011-07-05 18:30:19.000000000 -0700
@@ -28,31 +28,40 @@ DECLARE_EVENT_CLASS(writeback_work_class
 	TP_ARGS(bdi, work),
 	TP_STRUCT__entry(
 		__array(char, name, 32)
+		__field(struct wb_writeback_work*, work)
 		__field(long, nr_pages)
 		__field(dev_t, sb_dev)
 		__field(int, sync_mode)
 		__field(int, for_kupdate)
 		__field(int, range_cyclic)
 		__field(int, for_background)
+		__field(unsigned long, ino)
+		__field(unsigned long, offset)
 	),
 	TP_fast_assign(
 		strncpy(__entry->name, dev_name(bdi->dev), 32);
+		__entry->work = work;
 		__entry->nr_pages = work->nr_pages;
 		__entry->sb_dev = work->sb ? work->sb->s_dev : 0;
 		__entry->sync_mode = work->sync_mode;
 		__entry->for_kupdate = work->for_kupdate;
 		__entry->range_cyclic = work->range_cyclic;
 		__entry->for_background	= work->for_background;
+		__entry->ino		= work->inode ? work->inode->i_ino : 0;
+		__entry->offset		= work->offset;
 	),
-	TP_printk("bdi %s: sb_dev %d:%d nr_pages=%ld sync_mode=%d "
-		  "kupdate=%d range_cyclic=%d background=%d",
+	TP_printk("bdi %s: sb_dev %d:%d %p nr_pages=%ld sync_mode=%d "
+		  "kupdate=%d range_cyclic=%d background=%d ino=%lu offset=%lu",
 		  __entry->name,
 		  MAJOR(__entry->sb_dev), MINOR(__entry->sb_dev),
+		  __entry->work,
 		  __entry->nr_pages,
 		  __entry->sync_mode,
 		  __entry->for_kupdate,
 		  __entry->range_cyclic,
-		  __entry->for_background
+		  __entry->for_background,
+		  __entry->ino,
+		  __entry->offset
 	)
 );
 #define DEFINE_WRITEBACK_WORK_EVENT(name) \

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 03/27] xfs: use write_cache_pages for writeback clustering
  2011-07-06  4:53             ` Wu Fengguang
@ 2011-07-06  6:47               ` Minchan Kim
  2011-07-06  7:17               ` Dave Chinner
  1 sibling, 0 replies; 20+ messages in thread
From: Minchan Kim @ 2011-07-06  6:47 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Dave Chinner, Christoph Hellwig, Mel Gorman, Johannes Weiner,
	xfs@oss.sgi.com, linux-mm@kvack.org

On Wed, Jul 6, 2011 at 1:53 PM, Wu Fengguang <fengguang.wu@intel.com> wrote:
> On Mon, Jul 04, 2011 at 11:25:34AM +0800, Dave Chinner wrote:
>> On Fri, Jul 01, 2011 at 11:41:36PM +0800, Wu Fengguang wrote:
>> > Christoph,
>> >
>> > On Fri, Jul 01, 2011 at 05:33:05PM +0800, Christoph Hellwig wrote:
>> > > Johannes, Mel, Wu,
>> > >
>> > > Dave has been stressing some XFS patches of mine that remove the XFS
>> > > internal writeback clustering in favour of using write_cache_pages.
>> > >
>> > > As part of investigating the behaviour he found out that we're still
>> > > doing lots of I/O from the end of the LRU in kswapd.  Not only is that
>> > > pretty bad behaviour in general, but it also means we really can't
>> > > just remove the writeback clustering in writepage given how much
>> > > I/O is still done through that.
>> > >
>> > > Any chance we could the writeback vs kswap behaviour sorted out a bit
>> > > better finally?
>> >
>> > I once tried this approach:
>> >
>> > http://www.spinics.net/lists/linux-mm/msg09202.html
>> >
>> > It used a list structure that is not linearly scalable, however that
>> > part should be independently improvable when necessary.
>>
>> I don't think that handing random writeback to the flusher thread is
>> much better than doing random writeback directly.  Yes, you added
>> some clustering, but I'm still don't think writing specific pages is
>> the best solution.
>
> I agree that the VM should avoid writing specific pages as much as
> possible. Mostly often, it's indeed OK to just skip sporadically
> encountered dirty page and reclaim the clean pages presumably not
> far away in the LRU list. So your 2-liner patch is all good if
> constraining it to low scan pressure, which will look like
>
>        if (priority == DEF_PRIORITY)
>                tag PG_reclaim on encountered dirty pages and
>                skip writing it
>
> However the VM in general does need the ability to write specific
> pages, such as when reclaiming from specific zone/memcg. So I'll still
> propose to do bdi_start_inode_writeback().
>
> Below is the patch rebased to linux-next. It's good enough for testing
> purpose, and I guess even with the ->nr_pages work issue, it's
> complete enough to get roughly the same performance as your 2-liner
> patch.
>
>> > The real problem was, it seem to not very effective in my test runs.
>> > I found many ->nr_pages works queued before the ->inode works, which
>> > effectively makes the flusher working on more dispersed pages rather
>> > than focusing on the dirty pages encountered in LRU reclaim.
>>
>> But that's really just an implementation issue related to how you
>> tried to solve the problem. That could be addressed.
>>
>> However, what I'm questioning is whether we should even care what
>> page memory reclaim wants to write - it seems to make fundamentally
>> bad decisions from an IO persepctive.
>>
>> We have to remember that memory reclaim is doing LRU reclaim and the
>> flusher threads are doing "oldest first" writeback. IOWs, both are trying
>> to operate in the same direction (oldest to youngest) for the same
>> purpose.  The fundamental problem that occurs when memory reclaim
>> starts writing pages back from the LRU is this:
>>
>>       - memory reclaim has run ahead of IO writeback -
>>
>> The LRU usually looks like this:
>>
>>       oldest                                  youngest
>>       +---------------+---------------+--------------+
>>       clean           writeback       dirty
>>                       ^               ^
>>                       |               |
>>                       |               Where flusher will next work from
>>                       |               Where kswapd is working from
>>                       |
>>                       IO submitted by flusher, waiting on completion
>>
>>
>> If memory reclaim is hitting dirty pages on the LRU, it means it has
>> got ahead of writeback without being throttled - it's passed over
>> all the pages currently under writeback and is trying to write back
>> pages that are *newer* than what writeback is working on. IOWs, it
>> starts trying to do the job of the flusher threads, and it does that
>> very badly.
>>
>> The $100 question is ∗why is it getting ahead of writeback*?
>
> The most important case is: faster reader + relatively slow writer.
>
> Assume for every 10 pages read, 1 page is dirtied, and the dirty speed
> is fast enough to trigger the 20% dirty ratio and hence dirty balancing.
>
> That pattern is able to evenly distribute dirty pages all over the LRU
> list and hence trigger lots of pageout()s. The "skip reclaim writes on
> low pressure" approach can fix this case.
>
> Thanks,
> Fengguang
> ---
> Subject: writeback: introduce bdi_start_inode_writeback()
> Date: Thu Jul 29 14:41:19 CST 2010
>
> This relays ASYNC file writeback IOs to the flusher threads.
>
> pageout() will continue to serve the SYNC file page writes for necessary
> throttling for preventing OOM, which may happen if the LRU list is small
> and/or the storage is slow, so that the flusher cannot clean enough
> pages before the LRU is full scanned.
>
> Only ASYNC pageout() is relayed to the flusher threads, the less
> frequent SYNC pageout()s will work as before as a last resort.
> This helps to avoid OOM when the LRU list is small and/or the storage is
> slow, and the flusher cannot clean enough pages before the LRU is
> full scanned.
>
> The flusher will piggy back more dirty pages for IO
> - it's more IO efficient
> - it helps clean more pages, a good number of them may sit in the same
>  LRU list that is being scanned.
>
> To avoid memory allocations at page reclaim, a mempool is created.
>
> Background/periodic works will quit automatically (as done in another
> patch), so as to clean the pages under reclaim ASAP. However for now the
> sync work can still block us for long time.
>
> Jan Kara: limit the search scope.
>
> CC: Jan Kara <jack@suse.cz>
> CC: Rik van Riel <riel@redhat.com>
> CC: Mel Gorman <mel@linux.vnet.ibm.com>
> CC: Minchan Kim <minchan.kim@gmail.com>
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>

It seems to be enhanced version of old Mel's done.
I support this approach :) but I have some questions.

> ---
>  fs/fs-writeback.c                |  156 ++++++++++++++++++++++++++++-
>  include/linux/backing-dev.h      |    1
>  include/trace/events/writeback.h |   15 ++
>  mm/vmscan.c                      |    8 +
>  4 files changed, 174 insertions(+), 6 deletions(-)
>
> --- linux-next.orig/mm/vmscan.c 2011-06-29 20:43:10.000000000 -0700
> +++ linux-next/mm/vmscan.c      2011-07-05 18:30:19.000000000 -0700
> @@ -825,6 +825,14 @@ static unsigned long shrink_page_list(st
>                if (PageDirty(page)) {
>                        nr_dirty++;
>
> +                       if (page_is_file_cache(page) && mapping &&
> +                           sc->reclaim_mode != RECLAIM_MODE_SYNC) {
> +                               if (flush_inode_page(page, mapping) >= 0) {
> +                                       SetPageReclaim(page);
> +                                       goto keep_locked;

keep_locked changes old behavior.
Normally, in case of async mode, we does keep_lumpy(ie, we didn't
reset reclaim_mode) but now you are always resetting reclaim_mode. so
sync call of shrink_page_list never happen if flush_inode_page is
successful.
Is it your intention?


> +                               }
> +                       }
> +

If flush_inode_page fails(ie, the page isn't nearby of current work's
writeback range), we still do pageout although it's async mode. Is it
your intention?

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 03/27] xfs: use write_cache_pages for writeback clustering
  2011-07-06  4:53             ` Wu Fengguang
  2011-07-06  6:47               ` Minchan Kim
@ 2011-07-06  7:17               ` Dave Chinner
  1 sibling, 0 replies; 20+ messages in thread
From: Dave Chinner @ 2011-07-06  7:17 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Christoph Hellwig, Mel Gorman, Johannes Weiner, xfs@oss.sgi.com,
	linux-mm@kvack.org

On Tue, Jul 05, 2011 at 09:53:01PM -0700, Wu Fengguang wrote:
> On Mon, Jul 04, 2011 at 11:25:34AM +0800, Dave Chinner wrote:
> > On Fri, Jul 01, 2011 at 11:41:36PM +0800, Wu Fengguang wrote:
> > We have to remember that memory reclaim is doing LRU reclaim and the
> > flusher threads are doing "oldest first" writeback. IOWs, both are trying
> > to operate in the same direction (oldest to youngest) for the same
> > purpose.  The fundamental problem that occurs when memory reclaim
> > starts writing pages back from the LRU is this:
> > 
> > 	- memory reclaim has run ahead of IO writeback -
> > 
> > The LRU usually looks like this:
> > 
> > 	oldest					youngest
> > 	+---------------+---------------+--------------+
> > 	clean		writeback	dirty
> > 			^		^
> > 			|		|
> > 			|		Where flusher will next work from
> > 			|		Where kswapd is working from
> > 			|
> > 			IO submitted by flusher, waiting on completion
> > 
> > 
> > If memory reclaim is hitting dirty pages on the LRU, it means it has
> > got ahead of writeback without being throttled - it's passed over
> > all the pages currently under writeback and is trying to write back
> > pages that are *newer* than what writeback is working on. IOWs, it
> > starts trying to do the job of the flusher threads, and it does that
> > very badly.
> > 
> > The $100 question is a??why is it getting ahead of writeback*?
> 
> The most important case is: faster reader + relatively slow writer.

Same thing I said to Mel: that is not the workload that is causing
this problem I am seeing.

> Assume for every 10 pages read, 1 page is dirtied, and the dirty speed
> is fast enough to trigger the 20% dirty ratio and hence dirty balancing.
> 
> That pattern is able to evenly distribute dirty pages all over the LRU
> list and hence trigger lots of pageout()s. The "skip reclaim writes on
> low pressure" approach can fix this case.

Sure it can, but even better would be to simply skip the dirty pages
and reclaim the interspersed clean pages which greatly
outnumber the dirty pages. That then lets writeback deal with
cleaning the dirty pages in the most optimal manner, and no
writeback from memory reclaim is needed.

IOWs, I don't think writeback from the LRU is the right solution to
the problem you've described, either.

> 
> Thanks,
> Fengguang
> ---
> Subject: writeback: introduce bdi_start_inode_writeback()
> Date: Thu Jul 29 14:41:19 CST 2010
> 
> This relays ASYNC file writeback IOs to the flusher threads.
> 
> pageout() will continue to serve the SYNC file page writes for necessary
> throttling for preventing OOM, which may happen if the LRU list is small
> and/or the storage is slow, so that the flusher cannot clean enough
> pages before the LRU is full scanned.
> 
> Only ASYNC pageout() is relayed to the flusher threads, the less
> frequent SYNC pageout()s will work as before as a last resort.
> This helps to avoid OOM when the LRU list is small and/or the storage is
> slow, and the flusher cannot clean enough pages before the LRU is
> full scanned.

Which ignores the fact that async pageout should not be happening in
most cases. Let's try and fix the root cause of the problem, not
paper over it again...

> The flusher will piggy back more dirty pages for IO
> - it's more IO efficient
> - it helps clean more pages, a good number of them may sit in the same
>   LRU list that is being scanned.
> 
> To avoid memory allocations at page reclaim, a mempool is created.
> 
> Background/periodic works will quit automatically (as done in another
> patch), so as to clean the pages under reclaim ASAP. However for now the
> sync work can still block us for long time.

>  /*
> + * When flushing an inode page (for page reclaim), try to piggy back up to
> + * 4MB nearby pages for IO efficiency. These pages will have good opportunity
> + * to be in the same LRU list.
> + */
> +#define WRITE_AROUND_PAGES	MIN_WRITEBACK_PAGES

Regardless of the trigger, I think you're going too far in the other
direction, here. If we have to do one IO to clean the page that the
VM wants, then it has to be done with as little latency as possible
but large enough to still maintain decent throughput.

With the above patch, for every single dirty page the VM wants
cleaned, we'll clean 4MB of pages around it. Ok, but once the VM has
tripped over pages on 25 different inodes, we've now got 100MB of
writeback work to chew through before we can get to the 26th page
the VM wanted cleaned.

At which point, we may as well just ignore what the VM wants and
continue to clean pages via the existing mechanisms because the
latency for cleaning a specific page will worse than if the VM just
skipped it in the first place....

FWIW, XFS limited such clustering to 64 pages at a time to try to
balance the bandwidth vs completion latency problem.


> +/*
> + * Called by page reclaim code to flush the dirty page ASAP. Do write-around to
> + * improve IO throughput. The nearby pages will have good chance to reside in
> + * the same LRU list that vmscan is working on, and even close to each other
> + * inside the LRU list in the common case of sequential read/write.
> + *
> + * ret > 0: success, found/reused a previous writeback work
> + * ret = 0: success, allocated/queued a new writeback work
> + * ret < 0: failed
> + */
> +long flush_inode_page(struct page *page, struct address_space *mapping)
> +{
> +	struct backing_dev_info *bdi = mapping->backing_dev_info;
> +	struct inode *inode = mapping->host;
> +	pgoff_t offset = page->index;
> +	pgoff_t len = 0;
> +	struct wb_writeback_work *work;
> +	long ret = -ENOENT;
> +
> +	if (unlikely(!inode))
> +		goto out;
> +
> +	len = 1;
> +	spin_lock_bh(&bdi->wb_lock);
> +	list_for_each_entry_reverse(work, &bdi->work_list, list) {
> +		if (work->inode != inode)
> +			continue;
> +		if (extend_writeback_range(work, offset)) {
> +			ret = len;
> +			offset = work->offset;
> +			len = work->nr_pages;
> +			break;
> +		}
> +		if (len++ > 30)	/* do limited search */
> +			break;
> +	}
> +	spin_unlock_bh(&bdi->wb_lock);

I dont think this is a necessary or scalable optimisation. It won't
be useful when there are lots of dirty inodes and dirty pages are
tripped over in their hundreds or thousands - it'll just burn CPU
doing nothing, and serialise against other reclaim and writeback
work. It looks like a case of premature optimisation to me....

Anyway, if there's a page flush near to an existing piece of work the
IO elevator should merge them appropriately.

> +static long wb_flush_inode(struct bdi_writeback *wb,
> +			   struct wb_writeback_work *work)
> +{
> +	loff_t start = work->offset;
> +	loff_t end   = work->offset + work->nr_pages - 1;
> +	int wrote;
> +
> +	wrote = __filemap_fdatawrite_range(work->inode->i_mapping,
> +					   start << PAGE_CACHE_SHIFT,
> +					   end   << PAGE_CACHE_SHIFT,
> +					   WB_SYNC_NONE);
> +	iput(work->inode);
> +	return wrote;
> +}

Out of curiousity, before going down the complex route did you try
just calling this directly and seeing if it solved the problem? i.e.

	igrab()
	get start/end
	unlock page
	__filemap_fdatawrite_range()
	iput()

I mean, much as I dislike the idea of writeback from the LRU, if all
we need to do is call through .writepages() to do get decent IO from
reclaim (when it occurs), then why do we need to add this async
complexity to the generic writeback code to acheive the same end?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 03/27] xfs: use write_cache_pages for writeback clustering
  2011-07-04  3:25           ` Dave Chinner
  2011-07-05 14:34             ` Mel Gorman
  2011-07-06  4:53             ` Wu Fengguang
@ 2011-07-06 15:12             ` Johannes Weiner
  2011-07-08  9:54               ` Dave Chinner
  2 siblings, 1 reply; 20+ messages in thread
From: Johannes Weiner @ 2011-07-06 15:12 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Wu Fengguang, Christoph Hellwig, Mel Gorman, xfs@oss.sgi.com,
	linux-mm@kvack.org

On Mon, Jul 04, 2011 at 01:25:34PM +1000, Dave Chinner wrote:
> On Fri, Jul 01, 2011 at 11:41:36PM +0800, Wu Fengguang wrote:
> > Christoph,
> > 
> > On Fri, Jul 01, 2011 at 05:33:05PM +0800, Christoph Hellwig wrote:
> > > Johannes, Mel, Wu,
> > > 
> > > Dave has been stressing some XFS patches of mine that remove the XFS
> > > internal writeback clustering in favour of using write_cache_pages.
> > > 
> > > As part of investigating the behaviour he found out that we're still
> > > doing lots of I/O from the end of the LRU in kswapd.  Not only is that
> > > pretty bad behaviour in general, but it also means we really can't
> > > just remove the writeback clustering in writepage given how much
> > > I/O is still done through that.
> > > 
> > > Any chance we could the writeback vs kswap behaviour sorted out a bit
> > > better finally?
> > 
> > I once tried this approach:
> > 
> > http://www.spinics.net/lists/linux-mm/msg09202.html
> > 
> > It used a list structure that is not linearly scalable, however that
> > part should be independently improvable when necessary.
> 
> I don't think that handing random writeback to the flusher thread is
> much better than doing random writeback directly.  Yes, you added
> some clustering, but I'm still don't think writing specific pages is
> the best solution.
> 
> > The real problem was, it seem to not very effective in my test runs.
> > I found many ->nr_pages works queued before the ->inode works, which
> > effectively makes the flusher working on more dispersed pages rather
> > than focusing on the dirty pages encountered in LRU reclaim.
> 
> But that's really just an implementation issue related to how you
> tried to solve the problem. That could be addressed.
> 
> However, what I'm questioning is whether we should even care what
> page memory reclaim wants to write - it seems to make fundamentally
> bad decisions from an IO persepctive.
> 
> We have to remember that memory reclaim is doing LRU reclaim and the
> flusher threads are doing "oldest first" writeback. IOWs, both are trying
> to operate in the same direction (oldest to youngest) for the same
> purpose.  The fundamental problem that occurs when memory reclaim
> starts writing pages back from the LRU is this:
> 
> 	- memory reclaim has run ahead of IO writeback -
> 
> The LRU usually looks like this:
> 
> 	oldest					youngest
> 	+---------------+---------------+--------------+
> 	clean		writeback	dirty
> 			^		^
> 			|		|
> 			|		Where flusher will next work from
> 			|		Where kswapd is working from
> 			|
> 			IO submitted by flusher, waiting on completion
> 
> 
> If memory reclaim is hitting dirty pages on the LRU, it means it has
> got ahead of writeback without being throttled - it's passed over
> all the pages currently under writeback and is trying to write back
> pages that are *newer* than what writeback is working on. IOWs, it
> starts trying to do the job of the flusher threads, and it does that
> very badly.
> 
> The $100 question is a??why is it getting ahead of writeback*?

Unless you have a purely sequential writer, the LRU order is - at
least in theory - diverging away from the writeback order.

According to the reasoning behind generational garbage collection,
they should in fact be inverse to each other.  The oldest pages still
in use are the most likely to be still needed in the future.

In practice we only make a generational distinction between used-once
and used-many, which manifests in the inactive and the active list.
But still, when reclaim starts off with a localized writer, the oldest
pages are likely to be at the end of the active list.

So pages from the inactive list are likely to be written in the right
order, but at the same time active pages are even older, thus written
before them.  Memory reclaim starts with the inactive pages, and this
is why it gets ahead.

Then there is also the case where a fast writer pushes dirty pages to
the end of the LRU list, of course, but you already said that this is
not applicable to your workload.

My point is that I don't think it's unexpected that dirty pages come
off the inactive list in practice.  It just sucks how we handle them.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 03/27] xfs: use write_cache_pages for writeback clustering
  2011-07-06 15:12             ` Johannes Weiner
@ 2011-07-08  9:54               ` Dave Chinner
  2011-07-11 17:20                 ` Johannes Weiner
  0 siblings, 1 reply; 20+ messages in thread
From: Dave Chinner @ 2011-07-08  9:54 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Wu Fengguang, Christoph Hellwig, Mel Gorman, xfs@oss.sgi.com,
	linux-mm@kvack.org

On Wed, Jul 06, 2011 at 05:12:29PM +0200, Johannes Weiner wrote:
> On Mon, Jul 04, 2011 at 01:25:34PM +1000, Dave Chinner wrote:
> > On Fri, Jul 01, 2011 at 11:41:36PM +0800, Wu Fengguang wrote:
> > We have to remember that memory reclaim is doing LRU reclaim and the
> > flusher threads are doing "oldest first" writeback. IOWs, both are trying
> > to operate in the same direction (oldest to youngest) for the same
> > purpose.  The fundamental problem that occurs when memory reclaim
> > starts writing pages back from the LRU is this:
> > 
> > 	- memory reclaim has run ahead of IO writeback -
> > 
> > The LRU usually looks like this:
> > 
> > 	oldest					youngest
> > 	+---------------+---------------+--------------+
> > 	clean		writeback	dirty
> > 			^		^
> > 			|		|
> > 			|		Where flusher will next work from
> > 			|		Where kswapd is working from
> > 			|
> > 			IO submitted by flusher, waiting on completion
> > 
> > 
> > If memory reclaim is hitting dirty pages on the LRU, it means it has
> > got ahead of writeback without being throttled - it's passed over
> > all the pages currently under writeback and is trying to write back
> > pages that are *newer* than what writeback is working on. IOWs, it
> > starts trying to do the job of the flusher threads, and it does that
> > very badly.
> > 
> > The $100 question is a??why is it getting ahead of writeback*?
> 
> Unless you have a purely sequential writer, the LRU order is - at
> least in theory - diverging away from the writeback order.

Which is the root cause of the IO collapse that writeback from the
LRU causes, yes?

> According to the reasoning behind generational garbage collection,
> they should in fact be inverse to each other.  The oldest pages still
> in use are the most likely to be still needed in the future.
> 
> In practice we only make a generational distinction between used-once
> and used-many, which manifests in the inactive and the active list.
> But still, when reclaim starts off with a localized writer, the oldest
> pages are likely to be at the end of the active list.

Yet the file pages on the active list are unlikely to be dirty -
overwrite-in-place cache hot workloads are pretty scarce in my
experience. hence writeback of dirty pages from the active LRU is
unlikely to be a problem.

That leaves all the use-once pages cycling through the inactive
list. The oldest pages on this list are the ones that get reclaimed,
and if we are getting lots of dirty pages here it seems pretty clear
that memory demand is mostly for pages being rapidly dirtied. In
which case, trying to speed up the rate at which they are cleaned by
issuing IO is only effective if there is no IO already in progress.

Who knows if Io is already in progress? The writeback subsystem....

> So pages from the inactive list are likely to be written in the right
> order, but at the same time active pages are even older, thus written
> before them.  Memory reclaim starts with the inactive pages, and this
> is why it gets ahead.

All right, if the design is such that you can't avoid having reclaim
write back dirty pages as it encounters them on the inactive LRU,
should the dirty pages even be on that LRU?

That is, dirty pages cannot be reclaimed immediately but they are
intertwined with pages that can be reclaimed immediately. We really
want to reclaim pages that can be reclaimed quickly while not
blocking on or continually having to skip over pages that cannot be
reclaimed.

So why not make a distinction between clean and dirty file pages on
the inactive list? That is, consider dirty pages to still be "in
use" and "owned" by the writeback subsystem. while pages are dirty
they are kept on a separate "dirty file page LRU" that memory
reclaim does not ever touch unless it runs out of clean pages on the
inactive list to reclaim. And then when it runs out of clean pages,
it can go find pages under writeback on the dirty list and block on
them before going back to reclaiming off the clean list....

And given that cgroups have their own LRUs for reclaim now, this
problem of dirty pages being written from the LRUs has a much larger
scope.  It's not just whether the global LRU reclaim is hitting
dirty pages, it's a per-cgroup problem and they are much more likely
to have low memory limits that lead to such problems. And
concurrently at that, too.  Writeback simply does't scale to having
multiple sources of random page IO being despatched concurrently.

> Then there is also the case where a fast writer pushes dirty pages to
> the end of the LRU list, of course, but you already said that this is
> not applicable to your workload.
> 
> My point is that I don't think it's unexpected that dirty pages come
> off the inactive list in practice.  It just sucks how we handle them.

Exactly what I've been saying.

And what I'm also trying to say is the way to fix the "we do shitty
IO on dirty pages" problem is *not to do IO*. That's -exactly- why
the IO-less write throttling is a significant improvement: we've
turned shitty IO into good IO by *waiting for IO* during throttling
rather than submitting IO.

Fundamentally, scaling to N IO waiters is far easier and more
efficient than scaling to N IO submitters. All I'm asking is that
you apply that same principle to memory reclaim, please.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 03/27] xfs: use write_cache_pages for writeback clustering
  2011-07-02  2:42           ` Dave Chinner
  2011-07-05 14:10             ` Mel Gorman
@ 2011-07-11 10:26             ` Christoph Hellwig
  1 sibling, 0 replies; 20+ messages in thread
From: Christoph Hellwig @ 2011-07-11 10:26 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Mel Gorman, Christoph Hellwig, Johannes Weiner, Wu Fengguang, xfs,
	jack, linux-mm

On Sat, Jul 02, 2011 at 12:42:19PM +1000, Dave Chinner wrote:
> To tell the truth, I don't think anyone really cares how ext3
> performs these days. XFS seems to be the filesystem that brings out
> all the bad behaviour in the mm subsystem....

Maybe that's because XFS actually plays by the rules?

btrfs simply rejects all attempts from kswapd to write back, as it
has the following check:

	if (current->flags & PF_MEMALLOC) {
		redirty_page_for_writepage(wbc, page);
		unlock_page(page);
		return 0;
	}

while XFS tries to play nice and allow writeback from kswapd:

	if ((current->flags & (PF_MEMALLOC|PF_KSWAPD)) == PF_MEMALLOC)
		goto redirty;

ext4 can't perform delalloc conversions from writepage:

	if (walk_page_buffers(NULL, page_bufs, 0, len, NULL,
			      ext4_bh_delay_or_unwritten)) {
		/*
		 * We don't want to do block allocation, so redirty
		 * the page and return.  We may reach here when we do
		 * a journal commit via journal_submit_inode_data_buffers.
		 * We can also reach here via shrink_page_list
		 */
		goto redirty_pages;
	}

so any normal worklaods that don't involve overwrites will every get
any writeback from kswapd.

This should tell us that the VM can live just fine without doing
writeback from kswapd, as otherwise all systems using btrfs or ext4
would have completely fallen over.

It also suggested we should have standardized helpers in the VFS to work
around the braindead VM behaviour.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 03/27] xfs: use write_cache_pages for writeback clustering
  2011-07-05 14:34             ` Mel Gorman
  2011-07-06  1:23               ` Dave Chinner
@ 2011-07-11 11:10               ` Christoph Hellwig
  1 sibling, 0 replies; 20+ messages in thread
From: Christoph Hellwig @ 2011-07-11 11:10 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Dave Chinner, Wu Fengguang, Christoph Hellwig, Johannes Weiner,
	xfs@oss.sgi.com, linux-mm@kvack.org

On Tue, Jul 05, 2011 at 03:34:10PM +0100, Mel Gorman wrote:
> > However, what I'm questioning is whether we should even care what
> > page memory reclaim wants to write - it seems to make fundamentally
> > bad decisions from an IO persepctive.
> > 
> 
> It sucks from an IO perspective but from the perspective of the VM that
> needs memory to be free in a particular zone or node, it's a reasonable
> request.

It might appear reasonable, but it's not.

What the VM wants underneath is generally (1):

 - free N pages in zone Z

and it then goes own to free the pages one one by one though kswapd,
which leads to freeing those N pages, but unless they already were
clean it will take very long to get there and bog down the whole
system.

So we need a better way to actually perform that underlying request.
Dave's suggestion of keeping different lists for clean vs dirty pages
in the VM and preferably reclaiming for the clean ones when having
zone pressure is one first step.  The second one will be to tell the
writeback threads to preferably reclaim from a zone.  I'm actually
not sure how do that yet, as we could have memory from different
zones on a single inode.  Taking an inode that has memory from the
right zone and the writing that out will probably work fine for
different zones in a 64-bit NUMA systems where zones more or less
equal nodes.  It probably won't work very well if we need to free
up memory in the various low memory zones, as those will be spread
over random inodes.

> It doesnt' check how many pages are under writeback. Direct reclaim
> will check if the block device is congested but that is about
> it. Otherwise the expectation was the elevator would handle the
> merging of requests into a sensible patter.

It can't.  The elevator has a relatively small window it can operate
on, and can never fix up a bad large scale writeback pattern. 

> Also, while filesystem
> pages are getting cleaned by flushs, that does not cover anonymous
> pages being written to swap.

At least for now we will have to keep kswapd writeback for swap.  It
is just as inefficient a on a filesystem, but given that people don't
rely on swap performance we can probably live with it.  Note that we
can't simply use background flushing for swap, as that would mean
we'd need backing space allocated for all main memory, which isn't
very practical with todays memory sized.  The whole concept of demand
paging anonymous memory leads to pretty bad I/O patterns.  If you're
actually making heavy use of it the old-school unix full process paging
would be a lot faster.

(1) moulo things like compaction

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 03/27] xfs: use write_cache_pages for writeback clustering
  2011-07-08  9:54               ` Dave Chinner
@ 2011-07-11 17:20                 ` Johannes Weiner
  2011-07-11 17:24                   ` Christoph Hellwig
  2011-07-11 19:09                   ` Rik van Riel
  0 siblings, 2 replies; 20+ messages in thread
From: Johannes Weiner @ 2011-07-11 17:20 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Wu Fengguang, Christoph Hellwig, Mel Gorman, Rik van Riel,
	xfs@oss.sgi.com, linux-mm@kvack.org

On Fri, Jul 08, 2011 at 07:54:56PM +1000, Dave Chinner wrote:
> On Wed, Jul 06, 2011 at 05:12:29PM +0200, Johannes Weiner wrote:
> > On Mon, Jul 04, 2011 at 01:25:34PM +1000, Dave Chinner wrote:
> > > On Fri, Jul 01, 2011 at 11:41:36PM +0800, Wu Fengguang wrote:
> > > We have to remember that memory reclaim is doing LRU reclaim and the
> > > flusher threads are doing "oldest first" writeback. IOWs, both are trying
> > > to operate in the same direction (oldest to youngest) for the same
> > > purpose.  The fundamental problem that occurs when memory reclaim
> > > starts writing pages back from the LRU is this:
> > > 
> > > 	- memory reclaim has run ahead of IO writeback -
> > > 
> > > The LRU usually looks like this:
> > > 
> > > 	oldest					youngest
> > > 	+---------------+---------------+--------------+
> > > 	clean		writeback	dirty
> > > 			^		^
> > > 			|		|
> > > 			|		Where flusher will next work from
> > > 			|		Where kswapd is working from
> > > 			|
> > > 			IO submitted by flusher, waiting on completion
> > > 
> > > 
> > > If memory reclaim is hitting dirty pages on the LRU, it means it has
> > > got ahead of writeback without being throttled - it's passed over
> > > all the pages currently under writeback and is trying to write back
> > > pages that are *newer* than what writeback is working on. IOWs, it
> > > starts trying to do the job of the flusher threads, and it does that
> > > very badly.
> > > 
> > > The $100 question is a??why is it getting ahead of writeback*?
> > 
> > Unless you have a purely sequential writer, the LRU order is - at
> > least in theory - diverging away from the writeback order.
> 
> Which is the root cause of the IO collapse that writeback from the
> LRU causes, yes?
> 
> > According to the reasoning behind generational garbage collection,
> > they should in fact be inverse to each other.  The oldest pages still
> > in use are the most likely to be still needed in the future.
> > 
> > In practice we only make a generational distinction between used-once
> > and used-many, which manifests in the inactive and the active list.
> > But still, when reclaim starts off with a localized writer, the oldest
> > pages are likely to be at the end of the active list.
> 
> Yet the file pages on the active list are unlikely to be dirty -
> overwrite-in-place cache hot workloads are pretty scarce in my
> experience. hence writeback of dirty pages from the active LRU is
> unlikely to be a problem.

Just to clarify, I looked at this too much from the reclaim POV, where
use-once applies to full pages, not bytes.

Even if you do not overwrite the same bytes over and over again,
issuing two subsequent write()s that end up against the same page will
have it activated.

Are your workloads writing in perfectly page-aligned chunks?

This effect may build up slowly, but every page that is written from
the active list makes room for a dirty page on the inactive list wrt
the dirty limit.  I.e. without the active pages, you have 10-20% dirty
pages at the head of the inactive list (default dirty ratio), or a
80-90% clean tail, and for every page cleaned, a new dirty page can
appear at the inactive head.

But taking the active list into account, some of these clean pages are
taken away from the headstart the flusher has over the reclaimer, they
sit on the active list.  For every page cleaned, a new dirty page can
appear at the inactive head, plus a few deactivated clean pages.

Now, the active list is not scanned anymore until it is bigger than
the inactive list, giving the flushers plenty of time to clean the
pages on it and let them accumulate even while memory pressure is
already occurring.  For every page cleaned, a new dirty page can
appear at the inactive head, plus a LOT of deactivated clean pages.

So when memory needs to be reclaimed, the LRU lists in those three
scenarios look like this:

	inactive-only: [CCCCCCCCDD][]

	active-small:  [CCCCCCDD][CC]

	active-huge:   [CCCDD][CCCCC]

where the third scenario is the most likely for the reclaimer to run
into dirty pages.

I CC'd Rik for reclaim-wizardry.  But if I am not completly off with
this there is a chance that the change that let the active list grow
unscanned may actually have contributed to this single-page writing
problem becoming worse?

commit 56e49d218890f49b0057710a4b6fef31f5ffbfec
Author: Rik van Riel <riel@redhat.com>
Date:   Tue Jun 16 15:32:28 2009 -0700

    vmscan: evict use-once pages first
    
    When the file LRU lists are dominated by streaming IO pages, evict those
    pages first, before considering evicting other pages.
    
    This should be safe from deadlocks or performance problems
    because only three things can happen to an inactive file page:
    
    1) referenced twice and promoted to the active list
    2) evicted by the pageout code
    3) under IO, after which it will get evicted or promoted
    
    The pages freed in this way can either be reused for streaming IO, or
    allocated for something else.  If the pages are used for streaming IO,
    this pageout pattern continues.  Otherwise, we will fall back to the
    normal pageout pattern.
    
    Signed-off-by: Rik van Riel <riel@redhat.com>
    Reported-by: Elladan <elladan@eskimo.com>
    Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
    Acked-by: Johannes Weiner <hannes@cmpxchg.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 03/27] xfs: use write_cache_pages for writeback clustering
  2011-07-11 17:20                 ` Johannes Weiner
@ 2011-07-11 17:24                   ` Christoph Hellwig
  2011-07-11 19:09                   ` Rik van Riel
  1 sibling, 0 replies; 20+ messages in thread
From: Christoph Hellwig @ 2011-07-11 17:24 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Dave Chinner, Wu Fengguang, Christoph Hellwig, Mel Gorman,
	Rik van Riel, xfs@oss.sgi.com, linux-mm@kvack.org

On Mon, Jul 11, 2011 at 07:20:50PM +0200, Johannes Weiner wrote:
> > Yet the file pages on the active list are unlikely to be dirty -
> > overwrite-in-place cache hot workloads are pretty scarce in my
> > experience. hence writeback of dirty pages from the active LRU is
> > unlikely to be a problem.
> 
> Just to clarify, I looked at this too much from the reclaim POV, where
> use-once applies to full pages, not bytes.
> 
> Even if you do not overwrite the same bytes over and over again,
> issuing two subsequent write()s that end up against the same page will
> have it activated.
> 
> Are your workloads writing in perfectly page-aligned chunks?

Many workloads do, given that we already tell them our preferred
I/O size through struct stat, which alway is the page size or larger.

That won't help with workloads having to write in small chunksizes.
The performance critical ones using small chunksizes usually use
O_(D)SYNC, so pages will be clean after the write returned to userspace.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 03/27] xfs: use write_cache_pages for writeback clustering
  2011-07-11 17:20                 ` Johannes Weiner
  2011-07-11 17:24                   ` Christoph Hellwig
@ 2011-07-11 19:09                   ` Rik van Riel
  1 sibling, 0 replies; 20+ messages in thread
From: Rik van Riel @ 2011-07-11 19:09 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Dave Chinner, Wu Fengguang, Christoph Hellwig, Mel Gorman,
	xfs@oss.sgi.com, linux-mm@kvack.org

On 07/11/2011 01:20 PM, Johannes Weiner wrote:

> I CC'd Rik for reclaim-wizardry.  But if I am not completly off with
> this there is a chance that the change that let the active list grow
> unscanned may actually have contributed to this single-page writing
> problem becoming worse?

Yes, the patch probably contributed.

However, the patch does help protect the working set in
the page cache from streaming IO, so on balance I believe
we need to keep this change.

What it changes is that the size of the inactive file list
can no longer grow unbounded, keeping it a little smaller
than it could have grown in the past.

> commit 56e49d218890f49b0057710a4b6fef31f5ffbfec
> Author: Rik van Riel<riel@redhat.com>
> Date:   Tue Jun 16 15:32:28 2009 -0700
>
>      vmscan: evict use-once pages first
>
>      When the file LRU lists are dominated by streaming IO pages, evict those
>      pages first, before considering evicting other pages.
>
>      This should be safe from deadlocks or performance problems
>      because only three things can happen to an inactive file page:
>
>      1) referenced twice and promoted to the active list
>      2) evicted by the pageout code
>      3) under IO, after which it will get evicted or promoted
>
>      The pages freed in this way can either be reused for streaming IO, or
>      allocated for something else.  If the pages are used for streaming IO,
>      this pageout pattern continues.  Otherwise, we will fall back to the
>      normal pageout pattern.
>
>      Signed-off-by: Rik van Riel<riel@redhat.com>
>      Reported-by: Elladan<elladan@eskimo.com>
>      Cc: KOSAKI Motohiro<kosaki.motohiro@jp.fujitsu.com>
>      Cc: Peter Zijlstra<peterz@infradead.org>
>      Cc: Lee Schermerhorn<lee.schermerhorn@hp.com>
>      Acked-by: Johannes Weiner<hannes@cmpxchg.org>
>      Signed-off-by: Andrew Morton<akpm@linux-foundation.org>
>      Signed-off-by: Linus Torvalds<torvalds@linux-foundation.org>


-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2011-07-11 19:09 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20110629140109.003209430@bombadil.infradead.org>
     [not found] ` <20110629140336.950805096@bombadil.infradead.org>
     [not found]   ` <20110701022248.GM561@dastard>
     [not found]     ` <20110701041851.GN561@dastard>
2011-07-01  9:33       ` [PATCH 03/27] xfs: use write_cache_pages for writeback clustering Christoph Hellwig
2011-07-01 14:59         ` Mel Gorman
2011-07-01 15:15           ` Christoph Hellwig
2011-07-02  2:42           ` Dave Chinner
2011-07-05 14:10             ` Mel Gorman
2011-07-05 15:55               ` Dave Chinner
2011-07-11 10:26             ` Christoph Hellwig
2011-07-01 15:41         ` Wu Fengguang
2011-07-04  3:25           ` Dave Chinner
2011-07-05 14:34             ` Mel Gorman
2011-07-06  1:23               ` Dave Chinner
2011-07-11 11:10               ` Christoph Hellwig
2011-07-06  4:53             ` Wu Fengguang
2011-07-06  6:47               ` Minchan Kim
2011-07-06  7:17               ` Dave Chinner
2011-07-06 15:12             ` Johannes Weiner
2011-07-08  9:54               ` Dave Chinner
2011-07-11 17:20                 ` Johannes Weiner
2011-07-11 17:24                   ` Christoph Hellwig
2011-07-11 19:09                   ` Rik van Riel

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).