* Re: [PATCH 03/27] xfs: use write_cache_pages for writeback clustering [not found] ` <20110701041851.GN561@dastard> @ 2011-07-01 9:33 ` Christoph Hellwig 2011-07-01 14:59 ` Mel Gorman 2011-07-01 15:41 ` Wu Fengguang 0 siblings, 2 replies; 20+ messages in thread From: Christoph Hellwig @ 2011-07-01 9:33 UTC (permalink / raw) To: Mel Gorman, Johannes Weiner, Wu Fengguang; +Cc: Dave Chinner, xfs, linux-mm Johannes, Mel, Wu, Dave has been stressing some XFS patches of mine that remove the XFS internal writeback clustering in favour of using write_cache_pages. As part of investigating the behaviour he found out that we're still doing lots of I/O from the end of the LRU in kswapd. Not only is that pretty bad behaviour in general, but it also means we really can't just remove the writeback clustering in writepage given how much I/O is still done through that. Any chance we could the writeback vs kswap behaviour sorted out a bit better finally? Some excerpts from the previous discussion: On Fri, Jul 01, 2011 at 02:18:51PM +1000, Dave Chinner wrote: > I'm now only running test 180 on 100 files rather than the 1000 the > test normally runs on, because it's faster and still shows the > problem. That means the test is only using 1GB of disk space, and > I'm running on a VM with 1GB RAM. It appears to be related to the VM > triggering random page writeback from the LRU - 100x10MB files more > than fills memory, hence it being the smallest test case i could > reproduce the problem on. > > My triage notes are as follows, and the patch that fixes the bug is > attached below. > > --- 180.out 2010-04-28 15:00:22.000000000 +1000 > +++ 180.out.bad 2011-07-01 12:44:12.000000000 +1000 > @@ -1 +1,9 @@ > QA output created by 180 > +file /mnt/scratch/81 has incorrect size 10473472 - sync failed > +file /mnt/scratch/86 has incorrect size 10371072 - sync failed > +file /mnt/scratch/87 has incorrect size 10104832 - sync failed > +file /mnt/scratch/88 has incorrect size 10125312 - sync failed > +file /mnt/scratch/89 has incorrect size 10469376 - sync failed > +file /mnt/scratch/90 has incorrect size 10240000 - sync failed > +file /mnt/scratch/91 has incorrect size 10362880 - sync failed > +file /mnt/scratch/92 has incorrect size 10366976 - sync failed > > $ ls -li /mnt/scratch/ | awk '/rw/ { printf("0x%x %d %d\n", $1, $6, $10); }' > 0x244093 10473472 81 > 0x244098 10371072 86 > 0x244099 10104832 87 > 0x24409a 10125312 88 > 0x24409b 10469376 89 > 0x24409c 10240000 90 > 0x24409d 10362880 91 > 0x24409e 10366976 92 > > So looking at inode 0x244099 (/mnt/scratch/87), the last setfilesize > call in the trace (got a separate patch for that) is: > > <...>-393 [000] 696245.229559: xfs_ilock_nowait: dev 253:16 ino 0x244099 flags ILOCK_EXCL caller xfs_setfilesize > <...>-393 [000] 696245.229560: xfs_setfilesize: dev 253:16 ino 0x244099 isize 0xa00000 disize 0x94e000 new_size 0x0 offset 0x600000 count 3813376 > <...>-393 [000] 696245.229561: xfs_iunlock: dev 253:16 ino 0x244099 flags ILOCK_EXCL caller xfs_setfilesize > > For an IO that was from offset 0x600000 for just under 4MB. The end > of that IO is at byte 10104832, which is _exactly_ what the inode > size says it is. > > It is very clear that from the IO completions that we are getting a > *lot* of kswapd driven writeback directly through .writepage: > > $ grep "xfs_setfilesize:" t.t |grep "4096$" | wc -l > 801 > $ grep "xfs_setfilesize:" t.t |grep -v "4096$" | wc -l > 78 > > So there's ~900 IO completions that change the file size, and 90% of > them are single page updates. > > $ ps -ef |grep [k]swap > root 514 2 0 12:43 ? 00:00:00 [kswapd0] > $ grep "writepage:" t.t | grep "514 " |wc -l > 799 > > Oh, now that is too close to just be a co-incidence. We're getting > significant amounts of random page writeback from the the ends of > the LRUs done by the VM. > > <sigh> On Fri, Jul 01, 2011 at 07:20:21PM +1000, Dave Chinner wrote: > > Looks good. I still wonder why I haven't been able to hit this. > > Haven't seen any 180 failure for a long time, with both 4k and 512 byte > > filesystems and since yesterday 1k as well. > > It requires the test to run the VM out of RAM and then force enough > memory pressure for kswapd to start writeback from the LRU. The > reproducer I have is a 1p, 1GB RAM VM with it's disk image on a > 100MB/s HW RAID1 w/ 512MB BBWC disk subsystem. > > When kswapd starts doing writeback from the LRU, the iops rate goes > through the roof (from ~300iops @~320k/io to ~7000iops @4k/io) and > throughput drops from 100MB/s to ~30MB/s. BBWC is the only reason > the IOPS stays as high as it does - maybe that is why I saw this and > you haven't. > > As it is, the kswapd writeback behaviour is utterly atrocious and, > ultimately, quite easy to provoke. I wish the MM folk would fix that > goddamn problem already - we've only been complaining about it for > the last 6 or 7 years. As such, I'm wondering if it's a bad idea to > even consider removing the .writepage clustering... -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 03/27] xfs: use write_cache_pages for writeback clustering 2011-07-01 9:33 ` [PATCH 03/27] xfs: use write_cache_pages for writeback clustering Christoph Hellwig @ 2011-07-01 14:59 ` Mel Gorman 2011-07-01 15:15 ` Christoph Hellwig 2011-07-02 2:42 ` Dave Chinner 2011-07-01 15:41 ` Wu Fengguang 1 sibling, 2 replies; 20+ messages in thread From: Mel Gorman @ 2011-07-01 14:59 UTC (permalink / raw) To: Christoph Hellwig Cc: Johannes Weiner, Wu Fengguang, Dave Chinner, xfs, jack, linux-mm On Fri, Jul 01, 2011 at 05:33:05AM -0400, Christoph Hellwig wrote: > Johannes, Mel, Wu, Am adding Jan Kara as he has been working on writeback efficiency recently as well. > Dave has been stressing some XFS patches of mine that remove the XFS > internal writeback clustering in favour of using write_cache_pages. > Against what kernel? 2.6.38 was a disaster for reclaim I've been finding out this week. I don't know about 2.6.38.8. 2.6.39 was better. > As part of investigating the behaviour he found out that we're still > doing lots of I/O from the end of the LRU in kswapd. Not only is that > pretty bad behaviour in general, but it also means we really can't > just remove the writeback clustering in writepage given how much > I/O is still done through that. > > Any chance we could the writeback vs kswap behaviour sorted out a bit > better finally? > > Some excerpts from the previous discussion: > > On Fri, Jul 01, 2011 at 02:18:51PM +1000, Dave Chinner wrote: > > I'm now only running test 180 on 100 files rather than the 1000 the > > test normally runs on, because it's faster and still shows the > > problem. I had stopped looking at writeback problems while Wu and Jan were working on various writeback patchsets like io-less throttling. I don't know where they currently stand and while I submitted a number of reclaim patches since I last looked at this problem around 2.6.37, they were related to migration, kswapd reclaiming too much memory and kswapd using too much CPU - not writeback. At the time I stopped, the tests I was looking at were writing very few pages off the end of the LRU. Unfortunately I no longer have the results to see but for unrelated reasons, I've been other regression tests. Here is an example fsmark report over a number of kernels. The machine used is old but unfortunately it's the only one I have a full range of results at the moment. FS-Mark fsmark-2.6.32.42-mainline-fsmarkfsmark-2.6.34.10-mainline-fsmarkfsmark-2.6.37.6-mainline-fsmarkfsmark-2.6.38-mainline-fsmarkfsmark-2.6.39-mainline-fsmark 2.6.32.42-mainline2.6.34.10-mainline 2.6.37.6-mainline 2.6.38-mainline 2.6.39-mainline Files/s min 162.80 ( 0.00%) 156.20 (-4.23%) 155.60 (-4.63%) 157.80 (-3.17%) 151.10 (-7.74%) Files/s mean 173.77 ( 0.00%) 176.27 ( 1.42%) 168.19 (-3.32%) 172.98 (-0.45%) 172.05 (-1.00%) Files/s stddev 7.64 ( 0.00%) 12.54 (39.05%) 8.55 (10.57%) 8.39 ( 8.90%) 10.30 (25.77%) Files/s max 190.30 ( 0.00%) 206.80 ( 7.98%) 185.20 (-2.75%) 198.90 ( 4.32%) 201.00 ( 5.32%) Overhead min 1742851.00 ( 0.00%) 1612311.00 ( 8.10%) 1251552.00 (39.26%) 1239859.00 (40.57%) 1393047.00 (25.11%) Overhead mean 2443021.87 ( 0.00%) 2486525.60 (-1.75%) 2024365.53 (20.68%) 1849402.47 (32.10%) 1886692.53 (29.49%) Overhead stddev 744034.70 ( 0.00%) 359446.19 (106.99%) 335986.49 (121.45%) 375627.48 (98.08%) 320901.34 (131.86%) Overhead max 4744130.00 ( 0.00%) 3082235.00 (53.92%) 2561054.00 (85.24%) 2626346.00 (80.64%) 2559170.00 (85.38%) MMTests Statistics: duration User/Sys Time Running Test (seconds) 624.12 647.61 658.8 670.78 653.98 Total Elapsed Time (seconds) 5767.71 5742.30 5974.45 5852.32 5760.49 MMTests Statistics: vmstat Page Ins 3143712 3367600 3108596 3371952 3102548 Page Outs 104939296 105255268 105126820 105130540 105226620 Swap Ins 0 0 0 0 0 Swap Outs 0 0 0 0 0 Direct pages scanned 3521 131 7035 0 0 Kswapd pages scanned 23596104 23662641 23588211 23695015 23638226 Kswapd pages reclaimed 23594758 23661359 23587478 23693447 23637005 Direct pages reclaimed 3521 131 7031 0 0 Kswapd efficiency 99% 99% 99% 99% 99% Kswapd velocity 4091.070 4120.760 3948.181 4048.824 4103.510 Direct efficiency 100% 100% 99% 100% 100% Direct velocity 0.610 0.023 1.178 0.000 0.000 Percentage direct scans 0% 0% 0% 0% 0% Page writes by reclaim 75 32 37 252 44 Slabs scanned 1843200 1927168 2714112 2801280 2738816 Direct inode steals 0 0 0 0 0 Kswapd inode steals 1827970 1822770 1669879 1819583 1681155 Compaction stalls 0 0 0 0 0 Compaction success 0 0 0 0 0 Compaction failures 0 0 0 0 0 Compaction pages moved 0 0 0 228180 0 Compaction move failure 0 0 0 637776 0 The number of pages written from reclaim is exceptionally low (2.6.38 was a total disaster but that release was bad for a number of reasons, haven't tested 2.6.38.8 yet) but reduced by 2.6.37 as expected. Direct reclaim usage was reduced and efficiency (ratio of pages scanned to pages reclaimed) was high. As I look through the results I have at the moment, the number of pages written back was simply really low which is why the problem fell off my radar. > > That means the test is only using 1GB of disk space, and > > I'm running on a VM with 1GB RAM. It appears to be related to the VM > > triggering random page writeback from the LRU - 100x10MB files more > > than fills memory, hence it being the smallest test case i could > > reproduce the problem on. > > My tests were on a machine with 8G and ext3. I'm running some of the tests against ext4 and xfs to see if that makes a difference but it's possible the tests are simply not agressive enough so I want to reproduce Dave's test if possible. I'm assuming "test 180" is from xfstests which was not one of the tests I used previously. To run with 1000 files instead of 100, was the file "180" simply editted to make it look like this loop instead? # create files and sync them i=1; while [ $i -lt 100 ] do file=$SCRATCH_MNT/$i xfs_io -f -c "pwrite -b 64k -S 0xff 0 10m" $file > /dev/null if [ $? -ne 0 ] then echo error creating/writing file $file exit fi let i=$i+1 done > > My triage notes are as follows, and the patch that fixes the bug is > > attached below. > > > > <SNIP> > > > > <...>-393 [000] 696245.229559: xfs_ilock_nowait: dev 253:16 ino 0x244099 flags ILOCK_EXCL caller xfs_setfilesize > > <...>-393 [000] 696245.229560: xfs_setfilesize: dev 253:16 ino 0x244099 isize 0xa00000 disize 0x94e000 new_size 0x0 offset 0x600000 count 3813376 > > <...>-393 [000] 696245.229561: xfs_iunlock: dev 253:16 ino 0x244099 flags ILOCK_EXCL caller xfs_setfilesize > > > > For an IO that was from offset 0x600000 for just under 4MB. The end > > of that IO is at byte 10104832, which is _exactly_ what the inode > > size says it is. > > > > It is very clear that from the IO completions that we are getting a > > *lot* of kswapd driven writeback directly through .writepage: > > > > $ grep "xfs_setfilesize:" t.t |grep "4096$" | wc -l > > 801 > > $ grep "xfs_setfilesize:" t.t |grep -v "4096$" | wc -l > > 78 > > > > So there's ~900 IO completions that change the file size, and 90% of > > them are single page updates. > > > > $ ps -ef |grep [k]swap > > root 514 2 0 12:43 ? 00:00:00 [kswapd0] > > $ grep "writepage:" t.t | grep "514 " |wc -l > > 799 > > > > Oh, now that is too close to just be a co-incidence. We're getting > > significant amounts of random page writeback from the the ends of > > the LRUs done by the VM. > > > > <sigh> Does the value for nr_vmscan_write in /proc/vmstat correlate? It must but lets me sure because I'm using that figure rather than ftrace to count writebacks at the moment. A more relevant question is this - how many pages were reclaimed by kswapd and what percentage is 799 pages of that? What do you consider an acceptable percentage? > On Fri, Jul 01, 2011 at 07:20:21PM +1000, Dave Chinner wrote: > > > Looks good. I still wonder why I haven't been able to hit this. > > > Haven't seen any 180 failure for a long time, with both 4k and 512 byte > > > filesystems and since yesterday 1k as well. > > > > It requires the test to run the VM out of RAM and then force enough > > memory pressure for kswapd to start writeback from the LRU. The > > reproducer I have is a 1p, 1GB RAM VM with it's disk image on a > > 100MB/s HW RAID1 w/ 512MB BBWC disk subsystem. > > You say it's a 1G VM but you don't say what architecure. What is the size of the highest zone? If this is 32-bit x86 for example, the highest zone is HighMem and it would be really small. Unfortunately it would always be the first choice for allocating and reclaiming from which would drastically increase the number of pages written back from reclaim. -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 03/27] xfs: use write_cache_pages for writeback clustering 2011-07-01 14:59 ` Mel Gorman @ 2011-07-01 15:15 ` Christoph Hellwig 2011-07-02 2:42 ` Dave Chinner 1 sibling, 0 replies; 20+ messages in thread From: Christoph Hellwig @ 2011-07-01 15:15 UTC (permalink / raw) To: Mel Gorman Cc: Christoph Hellwig, Johannes Weiner, Wu Fengguang, Dave Chinner, xfs, jack, linux-mm On Fri, Jul 01, 2011 at 03:59:35PM +0100, Mel Gorman wrote: > On Fri, Jul 01, 2011 at 05:33:05AM -0400, Christoph Hellwig wrote: > > Johannes, Mel, Wu, > > Am adding Jan Kara as he has been working on writeback efficiency > recently as well. > > > Dave has been stressing some XFS patches of mine that remove the XFS > > internal writeback clustering in favour of using write_cache_pages. > > > > Against what kernel? 2.6.38 was a disaster for reclaim I've been > finding out this week. I don't know about 2.6.38.8. 2.6.39 was better. The patch series is against current 3.0-rc, I assume that's what Dave tested as well. > I'm assuming "test 180" is from xfstests which was not one of the tests > I used previously. To run with 1000 files instead of 100, was the file > "180" simply editted to make it look like this loop instead? Yes. to both questions. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 03/27] xfs: use write_cache_pages for writeback clustering 2011-07-01 14:59 ` Mel Gorman 2011-07-01 15:15 ` Christoph Hellwig @ 2011-07-02 2:42 ` Dave Chinner 2011-07-05 14:10 ` Mel Gorman 2011-07-11 10:26 ` Christoph Hellwig 1 sibling, 2 replies; 20+ messages in thread From: Dave Chinner @ 2011-07-02 2:42 UTC (permalink / raw) To: Mel Gorman Cc: Christoph Hellwig, Johannes Weiner, Wu Fengguang, xfs, jack, linux-mm On Fri, Jul 01, 2011 at 03:59:35PM +0100, Mel Gorman wrote: > On Fri, Jul 01, 2011 at 05:33:05AM -0400, Christoph Hellwig wrote: > > Johannes, Mel, Wu, > > Am adding Jan Kara as he has been working on writeback efficiency > recently as well. Writeback looks to be working fine - it's kswapd screwing up the writeback patterns that appears to be the problem.... > > Dave has been stressing some XFS patches of mine that remove the XFS > > internal writeback clustering in favour of using write_cache_pages. > > Against what kernel? 2.6.38 was a disaster for reclaim I've been > finding out this week. I don't know about 2.6.38.8. 2.6.39 was better. 3.0-rc4 .... > The number of pages written from reclaim is exceptionally low (2.6.38 > was a total disaster but that release was bad for a number of reasons, > haven't tested 2.6.38.8 yet) but reduced by 2.6.37 as expected. Direct > reclaim usage was reduced and efficiency (ratio of pages scanned to > pages reclaimed) was high. And is that consistent across ext3/ext4/xfs/btrfs filesystems? I doubt it very much, as all have very different .writepage behaviours... BTW, called a workload "fsmark" tells us nothing about the workload being tested - fsmark can do a lot of interesting things. IOWs, you need to quote the command line for it to be meaningful to anyone... > As I look through the results I have at the moment, the number of > pages written back was simply really low which is why the problem fell > off my radar. It doesn't take many to completely screw up writeback IO patterns. Write a few random pages to a 10MB file well before writeback would get to the file, and instead of getting optimal sequential writeback patterns when writeback gets to it, we get multiple disjoint IOs that require multiple seeks to complete. Slower, less efficient writeback IO causes memory pressure to last longer and hence more likely to result in kswapd writeback, and it's just a downward spiral from there.... > > > That means the test is only using 1GB of disk space, and > > > I'm running on a VM with 1GB RAM. It appears to be related to the VM > > > triggering random page writeback from the LRU - 100x10MB files more > > > than fills memory, hence it being the smallest test case i could > > > reproduce the problem on. > > > > > My tests were on a machine with 8G and ext3. I'm running some of > the tests against ext4 and xfs to see if that makes a difference but > it's possible the tests are simply not agressive enough so I want to > reproduce Dave's test if possible. To tell the truth, I don't think anyone really cares how ext3 performs these days. XFS seems to be the filesystem that brings out all the bad behaviour in the mm subsystem.... FWIW, the mm subsystem works well enough when there is RAM available, so I'd suggest that your reclaim testing needs to focus on smaller memory configurations to really stress the reclaim algorithms. That's one of the reason why I regularly test on 1GB, 1p machines - they show problems that are hard to repa??oduce on larger configs.... > I'm assuming "test 180" is from xfstests which was not one of the tests > I used previously. To run with 1000 files instead of 100, was the file > "180" simply editted to make it look like this loop instead? I reduced it to 100 files simply to speed up the testing process for the "bad file size" problem I was trying to find. If you want to reproduce the IO collapse in a big way, run it with 1000 files, and it happens about 2/3rds of the way through the test on my hardware. > > > It is very clear that from the IO completions that we are getting a > > > *lot* of kswapd driven writeback directly through .writepage: > > > > > > $ grep "xfs_setfilesize:" t.t |grep "4096$" | wc -l > > > 801 > > > $ grep "xfs_setfilesize:" t.t |grep -v "4096$" | wc -l > > > 78 > > > > > > So there's ~900 IO completions that change the file size, and 90% of > > > them are single page updates. > > > > > > $ ps -ef |grep [k]swap > > > root 514 2 0 12:43 ? 00:00:00 [kswapd0] > > > $ grep "writepage:" t.t | grep "514 " |wc -l > > > 799 > > > > > > Oh, now that is too close to just be a co-incidence. We're getting > > > significant amounts of random page writeback from the the ends of > > > the LRUs done by the VM. > > > > > > <sigh> > > Does the value for nr_vmscan_write in /proc/vmstat correlate? It must > but lets me sure because I'm using that figure rather than ftrace to > count writebacks at the moment. The number in /proc/vmstat is higher. Much higher. I just ran the test at 1000 files (only collapsed to ~3000 iops this time because I ran it on a plain 3.0-rc4 kernel that still has the .writepage clustering in XFS), and I see: nr_vmscan_write 6723 after the test. The event trace only capture ~1400 writepage events from kswapd, but it tends to miss a lot of events as the system is quite unresponsive at times under this workload - it's not uncommon to have ssh sessions not echo a character for 10s... e.g: I started the workload ~11:08:22: $ while [ 1 ]; do date; sleep 1; done Sat Jul 2 11:08:15 EST 2011 Sat Jul 2 11:08:16 EST 2011 Sat Jul 2 11:08:17 EST 2011 Sat Jul 2 11:08:18 EST 2011 Sat Jul 2 11:08:19 EST 2011 Sat Jul 2 11:08:20 EST 2011 Sat Jul 2 11:08:21 EST 2011 Sat Jul 2 11:08:22 EST 2011 <<<<<<<< start test here Sat Jul 2 11:08:23 EST 2011 Sat Jul 2 11:08:24 EST 2011 Sat Jul 2 11:08:25 EST 2011 Sat Jul 2 11:08:26 EST 2011 <<<<<<<< Sat Jul 2 11:08:27 EST 2011 <<<<<<<< Sat Jul 2 11:08:30 EST 2011 <<<<<<<< Sat Jul 2 11:08:35 EST 2011 <<<<<<<< Sat Jul 2 11:08:36 EST 2011 Sat Jul 2 11:08:37 EST 2011 Sat Jul 2 11:08:38 EST 2011 <<<<<<<< Sat Jul 2 11:08:40 EST 2011 <<<<<<<< Sat Jul 2 11:08:41 EST 2011 Sat Jul 2 11:08:42 EST 2011 Sat Jul 2 11:08:43 EST 2011 And there are quite a few more multi-second holdoffs during the test, too. > A more relevant question is this - > how many pages were reclaimed by kswapd and what percentage is 799 > pages of that? What do you consider an acceptable percentage? I don't care what the percentage is or what the number is. kswapd is reclaiming pages most of the time without affect IO patterns, and when that happens I just don't care because it is working just fine. What I care about is what kswapd is doing when it finds dirty pages and it decides they need to be written back. It's not a problem that they are found or need to be written, the problem is the utterly crap way that memory reclaim is throwing the pages at the filesystem. I'm not sure how to get through to you guys that single, random page writeback is *BAD*. Using .writepage directly is considered harmful to IO throughput, and memory reclaim needs to stop doing that. We've got hacks in the filesystems to try to make the IO memory reclaim executes suck less, but ultimately the problem is the IO memory reclaim is doing. And now the memory reclaim IO patterns are getting in the way of further improving the writeback path in XFS because were finding the hacks we've been carrying for years are *still* the only thing that is making IO under memory pressure not suck completely. What I find extremely frustrating is that this is not a new issue. We (filesystem people) have been asking for a long time to have the memory reclaim subsystem either defer IO to the writeback threads or to use the .writepages interface. We're not asking this to be difficult, we're asking for this so that we can cluster IO in an optimal manner to avoid these IO collapses that memory reclaim currently triggers. We now have generic methods of handing off IO to flusher threads that also provide some level of throttling/ blocking while IO is submitted (e.g. writeback_inodes_sb_nr()), so this shouldn't be a difficult problem to solve for the memory reclaim subsystem. Hell, maybe memory reclaim should take a leaf from the IO-less throttle work we are doing - hit a bunch of dirty pages on the LRU, just back off and let the writeback subsystem clean a few more pages before starting another scan. Letting the writeback code clean pages is the fastest way to get pages cleaned in the system, so if we've already got a generic method for cleaning and/or waiting for pages to be cleaned, why not aim to use that? And while I'm ranting, when on earth is the issue-writeback-from- direct-reclaim problem going to be fixed so we can remove the hacks in the filesystem .writepage implementations to prevent this from occurring? I mean, when we combine the two issues, doesn't it imply that the memory reclaim subsystem needs to be redesigned around the fact it *can't clean pages directly*? This IO collapse issue shows that we really don't 't want kswapd issuing IO directly via .writepage, and we already reject IO from direct reclaim in .writepage in ext4, XFS and BTRFS because we'll overrun the stack on anything other than trivial storage configurations. That says to me in a big, flashing bright pink neon sign way that memory reclaim simply should not be issuing IO at all. Perhaps it's time to rethink the way memory reclaim deals with dirty pages to take into account the current reality? </rant> > > On Fri, Jul 01, 2011 at 07:20:21PM +1000, Dave Chinner wrote: > > > > Looks good. I still wonder why I haven't been able to hit this. > > > > Haven't seen any 180 failure for a long time, with both 4k and 512 byte > > > > filesystems and since yesterday 1k as well. > > > > > > It requires the test to run the VM out of RAM and then force enough > > > memory pressure for kswapd to start writeback from the LRU. The > > > reproducer I have is a 1p, 1GB RAM VM with it's disk image on a > > > 100MB/s HW RAID1 w/ 512MB BBWC disk subsystem. > > > > > You say it's a 1G VM but you don't say what architecure. x86-64 for both the guest and the host. Cheers, Dave. -- Dave Chinner david@fromorbit.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 03/27] xfs: use write_cache_pages for writeback clustering 2011-07-02 2:42 ` Dave Chinner @ 2011-07-05 14:10 ` Mel Gorman 2011-07-05 15:55 ` Dave Chinner 2011-07-11 10:26 ` Christoph Hellwig 1 sibling, 1 reply; 20+ messages in thread From: Mel Gorman @ 2011-07-05 14:10 UTC (permalink / raw) To: Dave Chinner Cc: Christoph Hellwig, Johannes Weiner, Wu Fengguang, xfs, jack, linux-mm On Sat, Jul 02, 2011 at 12:42:19PM +1000, Dave Chinner wrote: > On Fri, Jul 01, 2011 at 03:59:35PM +0100, Mel Gorman wrote: > > On Fri, Jul 01, 2011 at 05:33:05AM -0400, Christoph Hellwig wrote: > > > Johannes, Mel, Wu, > > > > Am adding Jan Kara as he has been working on writeback efficiency > > recently as well. > > Writeback looks to be working fine - it's kswapd screwing up the > writeback patterns that appears to be the problem.... > Not a new complaint. > > > Dave has been stressing some XFS patches of mine that remove the XFS > > > internal writeback clustering in favour of using write_cache_pages. > > > > Against what kernel? 2.6.38 was a disaster for reclaim I've been > > finding out this week. I don't know about 2.6.38.8. 2.6.39 was better. > > 3.0-rc4 > Ok. > .... > > The number of pages written from reclaim is exceptionally low (2.6.38 > > was a total disaster but that release was bad for a number of reasons, > > haven't tested 2.6.38.8 yet) but reduced by 2.6.37 as expected. Direct > > reclaim usage was reduced and efficiency (ratio of pages scanned to > > pages reclaimed) was high. > > And is that consistent across ext3/ext4/xfs/btrfs filesystems? I > doubt it very much, as all have very different .writepage > behaviours... > Some preliminary results are in and it looks like it is close to the same across filesystems which was a suprise to me. Sometimes the filesystem makes a difference to how many pages are written back but it's not consistent across all tests i.e. in comparing ext3, ext4 and xfs, there are big differences in performance but moderate differences in pages written back. This implies that for the configurations I was testing that pages are generally cleaned before reaching the end of the LRU. In all cases, the machines had ample memory. More on that later. > BTW, called a workload "fsmark" tells us nothing about the workload > being tested - fsmark can do a lot of interesting things. IOWs, you > need to quote the command line for it to be meaningful to anyone... > My bad. ./fs_mark -d /tmp/fsmark-14880 -D 225 -N 22500 -n 3125 -L 15 -t 16 -S0 -s 131072 > > As I look through the results I have at the moment, the number of > > pages written back was simply really low which is why the problem fell > > off my radar. > > It doesn't take many to completely screw up writeback IO patterns. > Write a few random pages to a 10MB file well before writeback would > get to the file, and instead of getting optimal sequential writeback > patterns when writeback gets to it, we get multiple disjoint IOs > that require multiple seeks to complete. > > Slower, less efficient writeback IO causes memory pressure to last > longer and hence more likely to result in kswapd writeback, and it's > just a downward spiral from there.... > Yes, I see the negative feedback loop. This has always been a struggle in that kswapd needs pages from a particular zone to be cleaned and freed but calling writepage can make things slower. There were prototypes in the past to give hints to the flusher threads on what inode and pages to be freed and they were never met with any degree of satisfaction. The consensus (amount VM people at least) was as long as that number was low, it wasn't much of a problem. I know you disagree. > > > > That means the test is only using 1GB of disk space, and > > > > I'm running on a VM with 1GB RAM. It appears to be related to the VM > > > > triggering random page writeback from the LRU - 100x10MB files more > > > > than fills memory, hence it being the smallest test case i could > > > > reproduce the problem on. > > > > > > > > My tests were on a machine with 8G and ext3. I'm running some of > > the tests against ext4 and xfs to see if that makes a difference but > > it's possible the tests are simply not agressive enough so I want to > > reproduce Dave's test if possible. > > To tell the truth, I don't think anyone really cares how ext3 > performs these days. I do but the reasoning is weak. I wanted to be able to compare kernels between 2.6.32 and today with few points of variability. ext3 changed relatively little between those times. > XFS seems to be the filesystem that brings out > all the bad behaviour in the mm subsystem.... > > FWIW, the mm subsystem works well enough when there is RAM > available, so I'd suggest that your reclaim testing needs to focus > on smaller memory configurations to really stress the reclaim > algorithms. That's one of the reason why I regularly test on 1GB, 1p > machines - they show problems that are hard to rep???oduce on larger > configs.... > Based on the results coming in, I fully agree. I'm going to let the tests run to completion so I'll have the data in the future. I'll then go back and test for 1G, 1P configurations and it should be reproducible. > > I'm assuming "test 180" is from xfstests which was not one of the tests > > I used previously. To run with 1000 files instead of 100, was the file > > "180" simply editted to make it look like this loop instead? > > I reduced it to 100 files simply to speed up the testing process for > the "bad file size" problem I was trying to find. If you want to > reproduce the IO collapse in a big way, run it with 1000 files, and > it happens about 2/3rds of the way through the test on my hardware. > Ok, I have a test prepared that will run this. At the rate tests are currently going though, it could be Thursday before I can run them though :( > > > > It is very clear that from the IO completions that we are getting a > > > > *lot* of kswapd driven writeback directly through .writepage: > > > > > > > > $ grep "xfs_setfilesize:" t.t |grep "4096$" | wc -l > > > > 801 > > > > $ grep "xfs_setfilesize:" t.t |grep -v "4096$" | wc -l > > > > 78 > > > > > > > > So there's ~900 IO completions that change the file size, and 90% of > > > > them are single page updates. > > > > > > > > $ ps -ef |grep [k]swap > > > > root 514 2 0 12:43 ? 00:00:00 [kswapd0] > > > > $ grep "writepage:" t.t | grep "514 " |wc -l > > > > 799 > > > > > > > > Oh, now that is too close to just be a co-incidence. We're getting > > > > significant amounts of random page writeback from the the ends of > > > > the LRUs done by the VM. > > > > > > > > <sigh> > > > > Does the value for nr_vmscan_write in /proc/vmstat correlate? It must > > but lets me sure because I'm using that figure rather than ftrace to > > count writebacks at the moment. > > The number in /proc/vmstat is higher. Much higher. I just ran the > test at 1000 files (only collapsed to ~3000 iops this time because I > ran it on a plain 3.0-rc4 kernel that still has the .writepage > clustering in XFS), and I see: > > nr_vmscan_write 6723 > > after the test. The event trace only capture ~1400 writepage events > from kswapd, but it tends to miss a lot of events as the system is > quite unresponsive at times under this workload - it's not uncommon > to have ssh sessions not echo a character for 10s... e.g: I started > the workload ~11:08:22: > Ok, I'll be looking at nr_vmscan_write as the basis for "badness". > $ while [ 1 ]; do date; sleep 1; done > Sat Jul 2 11:08:15 EST 2011 > Sat Jul 2 11:08:16 EST 2011 > Sat Jul 2 11:08:17 EST 2011 > Sat Jul 2 11:08:18 EST 2011 > Sat Jul 2 11:08:19 EST 2011 > Sat Jul 2 11:08:20 EST 2011 > Sat Jul 2 11:08:21 EST 2011 > Sat Jul 2 11:08:22 EST 2011 <<<<<<<< start test here > Sat Jul 2 11:08:23 EST 2011 > Sat Jul 2 11:08:24 EST 2011 > Sat Jul 2 11:08:25 EST 2011 > Sat Jul 2 11:08:26 EST 2011 <<<<<<<< > Sat Jul 2 11:08:27 EST 2011 <<<<<<<< > Sat Jul 2 11:08:30 EST 2011 <<<<<<<< > Sat Jul 2 11:08:35 EST 2011 <<<<<<<< > Sat Jul 2 11:08:36 EST 2011 > Sat Jul 2 11:08:37 EST 2011 > Sat Jul 2 11:08:38 EST 2011 <<<<<<<< > Sat Jul 2 11:08:40 EST 2011 <<<<<<<< > Sat Jul 2 11:08:41 EST 2011 > Sat Jul 2 11:08:42 EST 2011 > Sat Jul 2 11:08:43 EST 2011 > > And there are quite a few more multi-second holdoffs during the > test, too. > > > A more relevant question is this - > > how many pages were reclaimed by kswapd and what percentage is 799 > > pages of that? What do you consider an acceptable percentage? > > I don't care what the percentage is or what the number is. kswapd is > reclaiming pages most of the time without affect IO patterns, and > when that happens I just don't care because it is working just fine. > I do care. I'm looking at some early XFS results here based on a laptop (4G). For fsmark with the command line above, the number of pages written back by kswapd was 0. The worst test by far was sysbench using a particularly large database. The number of writes was 48745 which is 0.27% of pages scanned or 0.28% of pages reclaimed. Ordinarily I would ignore that. If I run this at 1G and get a similar ratio, I will assume that I am not reproducing your problem at all unless I know what ratio you are seeing. So .... How many pages were reclaimed by kswapd and what percentage is 799 pages of that? You answered my second question. You consider 0% to be the acceptable percentage. > What I care about is what kswapd is doing when it finds dirty pages > and it decides they need to be written back. It's not a problem that > they are found or need to be written, the problem is the utterly > crap way that memory reclaim is throwing the pages at the filesystem. > > I'm not sure how to get through to you guys that single, random page > writeback is *BAD*. It got through. The feedback during discussions on the VM side was that as long as the percentage was sufficiently low it wasn't a problem because on occasion, the VM really needs pages from a particular zone. A solution that addressed both problems has never been agreed on and energy and time runs out before it gets fixed each time. > Using .writepage directly is considered harmful > to IO throughput, and memory reclaim needs to stop doing that. > We've got hacks in the filesystems to try to make the IO memory > reclaim executes suck less, but ultimately the problem is the IO > memory reclaim is doing. And now the memory reclaim IO patterns are > getting in the way of further improving the writeback path in XFS > because were finding the hacks we've been carrying for years are > *still* the only thing that is making IO under memory pressure not > suck completely. > > What I find extremely frustrating is that this is not a new issue. I know. > We (filesystem people) have been asking for a long time to have the > memory reclaim subsystem either defer IO to the writeback threads or > to use the .writepages interface. There was a prototypes along these lines. One of the criticisms was that it was fixing the wrong problem because dirty pages should be at the end of the LRU at all. Later work focused on fixing that and it was never revisited (at least not by me). There was a bucket of complains about the initial series at https://lkml.org/lkml/2010/6/8/82 . Despite the fact I wrote it, I will have to read back to see why I stopped working on it but I think it's because I focused on avoiding dirty pages reading the end of the LRU judging by https://lkml.org/lkml/2010/6/11/157 and eventually was satisified that the ratio of pages scanned to pages written was acceptable. > We're not asking this to be > difficult, we're asking for this so that we can cluster IO in an > optimal manner to avoid these IO collapses that memory reclaim > currently triggers. We now have generic methods of handing off IO > to flusher threads that also provide some level of throttling/ > blocking while IO is submitted (e.g. writeback_inodes_sb_nr()), so > this shouldn't be a difficult problem to solve for the memory > reclaim subsystem. > > Hell, maybe memory reclaim should take a leaf from the IO-less > throttle work we are doing - hit a bunch of dirty pages on the LRU, > just back off and let the writeback subsystem clean a few more pages > before starting another scan. Prototyped this before although I can't find it now. I think I concluded at the time that it didn't really help and another direction was taken. There was also the problem that the time to clean a page from a particular zone was potentially unbounded and a solution didn't present itself. > Letting the writeback code clean > pages is the fastest way to get pages cleaned in the system, so if > we've already got a generic method for cleaning and/or waiting for > pages to be cleaned, why not aim to use that? > > And while I'm ranting, when on earth is the issue-writeback-from- > direct-reclaim problem going to be fixed so we can remove the hacks > in the filesystem .writepage implementations to prevent this from > occurring? > Prototyped that too, same thread. Same type of problem, writeback from direct reclaim should happen so rarely that it should not be optimised for. See https://lkml.org/lkml/2010/6/11/32 > I mean, when we combine the two issues, doesn't it imply that the > memory reclaim subsystem needs to be redesigned around the fact it > *can't clean pages directly*? This IO collapse issue shows that we > really don't 't want kswapd issuing IO directly via .writepage, and > we already reject IO from direct reclaim in .writepage in ext4, XFS > and BTRFS because we'll overrun the stack on anything other than > trivial storage configurations. > > That says to me in a big, flashing bright pink neon sign way that > memory reclaim simply should not be issuing IO at all. Perhaps it's > time to rethink the way memory reclaim deals with dirty pages to > take into account the current reality? > > </rant> > At the risk of pissing you off, this isn't new information so I'll consider myself duly nudged into revisiting. > > > On Fri, Jul 01, 2011 at 07:20:21PM +1000, Dave Chinner wrote: > > > > > Looks good. I still wonder why I haven't been able to hit this. > > > > > Haven't seen any 180 failure for a long time, with both 4k and 512 byte > > > > > filesystems and since yesterday 1k as well. > > > > > > > > It requires the test to run the VM out of RAM and then force enough > > > > memory pressure for kswapd to start writeback from the LRU. The > > > > reproducer I have is a 1p, 1GB RAM VM with it's disk image on a > > > > 100MB/s HW RAID1 w/ 512MB BBWC disk subsystem. > > > > > > > > You say it's a 1G VM but you don't say what architecure. > > x86-64 for both the guest and the host. > Grand. -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 03/27] xfs: use write_cache_pages for writeback clustering 2011-07-05 14:10 ` Mel Gorman @ 2011-07-05 15:55 ` Dave Chinner 0 siblings, 0 replies; 20+ messages in thread From: Dave Chinner @ 2011-07-05 15:55 UTC (permalink / raw) To: Mel Gorman Cc: Christoph Hellwig, Johannes Weiner, Wu Fengguang, xfs, jack, linux-mm On Tue, Jul 05, 2011 at 03:10:16PM +0100, Mel Gorman wrote: > On Sat, Jul 02, 2011 at 12:42:19PM +1000, Dave Chinner wrote: > > On Fri, Jul 01, 2011 at 03:59:35PM +0100, Mel Gorman wrote: > > BTW, called a workload "fsmark" tells us nothing about the workload > > being tested - fsmark can do a lot of interesting things. IOWs, you > > need to quote the command line for it to be meaningful to anyone... > > > > My bad. > > ./fs_mark -d /tmp/fsmark-14880 -D 225 -N 22500 -n 3125 -L 15 -t 16 -S0 -s 131072 Ok, so 16 threads, 3125 files per thread, 128k per file, all created in to the same directory which rolls over when it gets to 22500 files in the directory. Yeah, it generates a bit of memory pressure, but I think the file sizes are too small to really stress writeback much. You need to use files that are at least 10MB in size to really start to mix up the writeback lists and the way they juggle new and old inodes to try not to starve any particular inode of writeback bandwidth.... Also, I don't use the "-t <num>" threading mechanism because all it does is bash on the directory mutex without really improving parallelism for creates. perf top on my system shows: samples pcnt function DSO _______ _____ __________________________________ __________________________________ 2799.00 9.3% mutex_spin_on_owner [kernel.kallsyms] 2049.00 6.8% copy_user_generic_string [kernel.kallsyms] 1912.00 6.3% _raw_spin_unlock_irqrestore [kernel.kallsyms] A contended mutex as the prime CPU consumer. That's more CPU than copying 750MB/s of data. Hence I normally drive parallelism with fsmark by using multiple "-d <dir>" options, which runs a thread per directory and a workload unit per directory and so you don't get directory mutex contention causing serialisation and interference with what you are really trying to measure.... > > > As I look through the results I have at the moment, the number of > > > pages written back was simply really low which is why the problem fell > > > off my radar. > > > > It doesn't take many to completely screw up writeback IO patterns. > > Write a few random pages to a 10MB file well before writeback would > > get to the file, and instead of getting optimal sequential writeback > > patterns when writeback gets to it, we get multiple disjoint IOs > > that require multiple seeks to complete. > > > > Slower, less efficient writeback IO causes memory pressure to last > > longer and hence more likely to result in kswapd writeback, and it's > > just a downward spiral from there.... > > > > Yes, I see the negative feedback loop. This has always been a struggle > in that kswapd needs pages from a particular zone to be cleaned and > freed but calling writepage can make things slower. There were > prototypes in the past to give hints to the flusher threads on what > inode and pages to be freed and they were never met with any degree of > satisfaction. > > The consensus (amount VM people at least) was as long as that number was > low, it wasn't much of a problem. Therein lies the problem. You've got storage people telling you there is an IO problem with memory reclaim, but the mm community then put their heads together somewhere private, decide it isn't a problem worth fixing and do nothing. Rinse, lather, repeat. I expect memory reclaim to play nicely with writeback that is already in progress. These subsystems do not work in isolation, yet memory reclaim treats it that way - as though it is the most important IO submitter and everything else can suffer while memory reclaim does it's stuff. Memory reclaim needs to co-ordinate with writeback effectively for the system as a whole to work well together. > I know you disagree. Right, that's because it doesn't have to be a very high number to be a problem. IO is orders of magnitude slower than the CPU time it takes to flush a page, so the cost of making a bad flush decision is very high. And single page writeback from the LRU is almost always a bad flush decision. > > > > > Oh, now that is too close to just be a co-incidence. We're getting > > > > > significant amounts of random page writeback from the the ends of > > > > > the LRUs done by the VM. > > > > > > > > > > <sigh> > > > > > > Does the value for nr_vmscan_write in /proc/vmstat correlate? It must > > > but lets me sure because I'm using that figure rather than ftrace to > > > count writebacks at the moment. > > > > The number in /proc/vmstat is higher. Much higher. I just ran the > > test at 1000 files (only collapsed to ~3000 iops this time because I > > ran it on a plain 3.0-rc4 kernel that still has the .writepage > > clustering in XFS), and I see: > > > > nr_vmscan_write 6723 > > > > after the test. The event trace only capture ~1400 writepage events > > from kswapd, but it tends to miss a lot of events as the system is > > quite unresponsive at times under this workload - it's not uncommon > > to have ssh sessions not echo a character for 10s... e.g: I started > > the workload ~11:08:22: > > > > Ok, I'll be looking at nr_vmscan_write as the basis for "badness". Perhaps you should look at my other reply (and two line "fix") in the thread about stopping dirty page writeback until after waiting on pages under writeback..... > > > A more relevant question is this - > > > how many pages were reclaimed by kswapd and what percentage is 799 > > > pages of that? What do you consider an acceptable percentage? > > > > I don't care what the percentage is or what the number is. kswapd is > > reclaiming pages most of the time without affect IO patterns, and > > when that happens I just don't care because it is working just fine. > > > > I do care. I'm looking at some early XFS results here based on a laptop > (4G). For fsmark with the command line above, the number of pages > written back by kswapd was 0. The worst test by far was sysbench using a > particularly large database. The number of writes was 48745 which is > 0.27% of pages scanned or 0.28% of pages reclaimed. Ordinarily I would > ignore that. > > If I run this at 1G and get a similar ratio, I will assume that I > am not reproducing your problem at all unless I know what ratio you > are seeing. Single threaded writing of files should -never- cause writeback from the LRUs. If that is happening, then the memory reclaim throttling is broken. See my other email. > So .... How many pages were reclaimed by kswapd and what percentage > is 799 pages of that? No idea. That information is long gone.... > You answered my second question. You consider 0% to be the acceptable > percentage. No, I expect memory reclaim to behave nicely with writeback that is already in progress. This subsystems do not work in isolation - they need to co-ordinate > > What I care about is what kswapd is doing when it finds dirty pages > > and it decides they need to be written back. It's not a problem that > > they are found or need to be written, the problem is the utterly > > crap way that memory reclaim is throwing the pages at the filesystem. > > > > I'm not sure how to get through to you guys that single, random page > > writeback is *BAD*. > > It got through. The feedback during discussions on the VM side was > that as long as the percentage was sufficiently low it wasn't a problem > because on occasion, the VM really needs pages from a particular zone. > A solution that addressed both problems has never been agreed on and > energy and time runs out before it gets fixed each time. <sigh> > > And while I'm ranting, when on earth is the issue-writeback-from- > > direct-reclaim problem going to be fixed so we can remove the hacks > > in the filesystem .writepage implementations to prevent this from > > occurring? > > > > Prototyped that too, same thread. Same type of problem, writeback > from direct reclaim should happen so rarely that it should not be > optimised for. See https://lkml.org/lkml/2010/6/11/32 Writeback from direct reclaim crashes systems by causing stack overruns - that's why we've disabled it. It's not an "optimisation" problem - it's a _memory corruption_ bug that needs to be fixed..... > At the risk of pissing you off, this isn't new information so I'll > consider myself duly nudged into revisiting. No, I've had a rant to express my displeasure at the lack of progress on this front. Cheers, Dave. -- Dave Chinner david@fromorbit.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 03/27] xfs: use write_cache_pages for writeback clustering 2011-07-02 2:42 ` Dave Chinner 2011-07-05 14:10 ` Mel Gorman @ 2011-07-11 10:26 ` Christoph Hellwig 1 sibling, 0 replies; 20+ messages in thread From: Christoph Hellwig @ 2011-07-11 10:26 UTC (permalink / raw) To: Dave Chinner Cc: Mel Gorman, Christoph Hellwig, Johannes Weiner, Wu Fengguang, xfs, jack, linux-mm On Sat, Jul 02, 2011 at 12:42:19PM +1000, Dave Chinner wrote: > To tell the truth, I don't think anyone really cares how ext3 > performs these days. XFS seems to be the filesystem that brings out > all the bad behaviour in the mm subsystem.... Maybe that's because XFS actually plays by the rules? btrfs simply rejects all attempts from kswapd to write back, as it has the following check: if (current->flags & PF_MEMALLOC) { redirty_page_for_writepage(wbc, page); unlock_page(page); return 0; } while XFS tries to play nice and allow writeback from kswapd: if ((current->flags & (PF_MEMALLOC|PF_KSWAPD)) == PF_MEMALLOC) goto redirty; ext4 can't perform delalloc conversions from writepage: if (walk_page_buffers(NULL, page_bufs, 0, len, NULL, ext4_bh_delay_or_unwritten)) { /* * We don't want to do block allocation, so redirty * the page and return. We may reach here when we do * a journal commit via journal_submit_inode_data_buffers. * We can also reach here via shrink_page_list */ goto redirty_pages; } so any normal worklaods that don't involve overwrites will every get any writeback from kswapd. This should tell us that the VM can live just fine without doing writeback from kswapd, as otherwise all systems using btrfs or ext4 would have completely fallen over. It also suggested we should have standardized helpers in the VFS to work around the braindead VM behaviour. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 03/27] xfs: use write_cache_pages for writeback clustering 2011-07-01 9:33 ` [PATCH 03/27] xfs: use write_cache_pages for writeback clustering Christoph Hellwig 2011-07-01 14:59 ` Mel Gorman @ 2011-07-01 15:41 ` Wu Fengguang 2011-07-04 3:25 ` Dave Chinner 1 sibling, 1 reply; 20+ messages in thread From: Wu Fengguang @ 2011-07-01 15:41 UTC (permalink / raw) To: Christoph Hellwig Cc: Mel Gorman, Johannes Weiner, Dave Chinner, xfs@oss.sgi.com, linux-mm@kvack.org Christoph, On Fri, Jul 01, 2011 at 05:33:05PM +0800, Christoph Hellwig wrote: > Johannes, Mel, Wu, > > Dave has been stressing some XFS patches of mine that remove the XFS > internal writeback clustering in favour of using write_cache_pages. > > As part of investigating the behaviour he found out that we're still > doing lots of I/O from the end of the LRU in kswapd. Not only is that > pretty bad behaviour in general, but it also means we really can't > just remove the writeback clustering in writepage given how much > I/O is still done through that. > > Any chance we could the writeback vs kswap behaviour sorted out a bit > better finally? I once tried this approach: http://www.spinics.net/lists/linux-mm/msg09202.html It used a list structure that is not linearly scalable, however that part should be independently improvable when necessary. The real problem was, it seem to not very effective in my test runs. I found many ->nr_pages works queued before the ->inode works, which effectively makes the flusher working on more dispersed pages rather than focusing on the dirty pages encountered in LRU reclaim. So for the patch to work efficiently, we'll need to first merge the ->nr_pages works and make them lower priority than the ->inode works. Thanks, Fengguang > Some excerpts from the previous discussion: > > On Fri, Jul 01, 2011 at 02:18:51PM +1000, Dave Chinner wrote: > > I'm now only running test 180 on 100 files rather than the 1000 the > > test normally runs on, because it's faster and still shows the > > problem. That means the test is only using 1GB of disk space, and > > I'm running on a VM with 1GB RAM. It appears to be related to the VM > > triggering random page writeback from the LRU - 100x10MB files more > > than fills memory, hence it being the smallest test case i could > > reproduce the problem on. > > > > My triage notes are as follows, and the patch that fixes the bug is > > attached below. > > > > --- 180.out 2010-04-28 15:00:22.000000000 +1000 > > +++ 180.out.bad 2011-07-01 12:44:12.000000000 +1000 > > @@ -1 +1,9 @@ > > QA output created by 180 > > +file /mnt/scratch/81 has incorrect size 10473472 - sync failed > > +file /mnt/scratch/86 has incorrect size 10371072 - sync failed > > +file /mnt/scratch/87 has incorrect size 10104832 - sync failed > > +file /mnt/scratch/88 has incorrect size 10125312 - sync failed > > +file /mnt/scratch/89 has incorrect size 10469376 - sync failed > > +file /mnt/scratch/90 has incorrect size 10240000 - sync failed > > +file /mnt/scratch/91 has incorrect size 10362880 - sync failed > > +file /mnt/scratch/92 has incorrect size 10366976 - sync failed > > > > $ ls -li /mnt/scratch/ | awk '/rw/ { printf("0x%x %d %d\n", $1, $6, $10); }' > > 0x244093 10473472 81 > > 0x244098 10371072 86 > > 0x244099 10104832 87 > > 0x24409a 10125312 88 > > 0x24409b 10469376 89 > > 0x24409c 10240000 90 > > 0x24409d 10362880 91 > > 0x24409e 10366976 92 > > > > So looking at inode 0x244099 (/mnt/scratch/87), the last setfilesize > > call in the trace (got a separate patch for that) is: > > > > <...>-393 [000] 696245.229559: xfs_ilock_nowait: dev 253:16 ino 0x244099 flags ILOCK_EXCL caller xfs_setfilesize > > <...>-393 [000] 696245.229560: xfs_setfilesize: dev 253:16 ino 0x244099 isize 0xa00000 disize 0x94e000 new_size 0x0 offset 0x600000 count 3813376 > > <...>-393 [000] 696245.229561: xfs_iunlock: dev 253:16 ino 0x244099 flags ILOCK_EXCL caller xfs_setfilesize > > > > For an IO that was from offset 0x600000 for just under 4MB. The end > > of that IO is at byte 10104832, which is _exactly_ what the inode > > size says it is. > > > > It is very clear that from the IO completions that we are getting a > > *lot* of kswapd driven writeback directly through .writepage: > > > > $ grep "xfs_setfilesize:" t.t |grep "4096$" | wc -l > > 801 > > $ grep "xfs_setfilesize:" t.t |grep -v "4096$" | wc -l > > 78 > > > > So there's ~900 IO completions that change the file size, and 90% of > > them are single page updates. > > > > $ ps -ef |grep [k]swap > > root 514 2 0 12:43 ? 00:00:00 [kswapd0] > > $ grep "writepage:" t.t | grep "514 " |wc -l > > 799 > > > > Oh, now that is too close to just be a co-incidence. We're getting > > significant amounts of random page writeback from the the ends of > > the LRUs done by the VM. > > > > <sigh> > > > On Fri, Jul 01, 2011 at 07:20:21PM +1000, Dave Chinner wrote: > > > Looks good. I still wonder why I haven't been able to hit this. > > > Haven't seen any 180 failure for a long time, with both 4k and 512 byte > > > filesystems and since yesterday 1k as well. > > > > It requires the test to run the VM out of RAM and then force enough > > memory pressure for kswapd to start writeback from the LRU. The > > reproducer I have is a 1p, 1GB RAM VM with it's disk image on a > > 100MB/s HW RAID1 w/ 512MB BBWC disk subsystem. > > > > When kswapd starts doing writeback from the LRU, the iops rate goes > > through the roof (from ~300iops @~320k/io to ~7000iops @4k/io) and > > throughput drops from 100MB/s to ~30MB/s. BBWC is the only reason > > the IOPS stays as high as it does - maybe that is why I saw this and > > you haven't. > > > > As it is, the kswapd writeback behaviour is utterly atrocious and, > > ultimately, quite easy to provoke. I wish the MM folk would fix that > > goddamn problem already - we've only been complaining about it for > > the last 6 or 7 years. As such, I'm wondering if it's a bad idea to > > even consider removing the .writepage clustering... -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 03/27] xfs: use write_cache_pages for writeback clustering 2011-07-01 15:41 ` Wu Fengguang @ 2011-07-04 3:25 ` Dave Chinner 2011-07-05 14:34 ` Mel Gorman ` (2 more replies) 0 siblings, 3 replies; 20+ messages in thread From: Dave Chinner @ 2011-07-04 3:25 UTC (permalink / raw) To: Wu Fengguang Cc: Christoph Hellwig, Mel Gorman, Johannes Weiner, xfs@oss.sgi.com, linux-mm@kvack.org On Fri, Jul 01, 2011 at 11:41:36PM +0800, Wu Fengguang wrote: > Christoph, > > On Fri, Jul 01, 2011 at 05:33:05PM +0800, Christoph Hellwig wrote: > > Johannes, Mel, Wu, > > > > Dave has been stressing some XFS patches of mine that remove the XFS > > internal writeback clustering in favour of using write_cache_pages. > > > > As part of investigating the behaviour he found out that we're still > > doing lots of I/O from the end of the LRU in kswapd. Not only is that > > pretty bad behaviour in general, but it also means we really can't > > just remove the writeback clustering in writepage given how much > > I/O is still done through that. > > > > Any chance we could the writeback vs kswap behaviour sorted out a bit > > better finally? > > I once tried this approach: > > http://www.spinics.net/lists/linux-mm/msg09202.html > > It used a list structure that is not linearly scalable, however that > part should be independently improvable when necessary. I don't think that handing random writeback to the flusher thread is much better than doing random writeback directly. Yes, you added some clustering, but I'm still don't think writing specific pages is the best solution. > The real problem was, it seem to not very effective in my test runs. > I found many ->nr_pages works queued before the ->inode works, which > effectively makes the flusher working on more dispersed pages rather > than focusing on the dirty pages encountered in LRU reclaim. But that's really just an implementation issue related to how you tried to solve the problem. That could be addressed. However, what I'm questioning is whether we should even care what page memory reclaim wants to write - it seems to make fundamentally bad decisions from an IO persepctive. We have to remember that memory reclaim is doing LRU reclaim and the flusher threads are doing "oldest first" writeback. IOWs, both are trying to operate in the same direction (oldest to youngest) for the same purpose. The fundamental problem that occurs when memory reclaim starts writing pages back from the LRU is this: - memory reclaim has run ahead of IO writeback - The LRU usually looks like this: oldest youngest +---------------+---------------+--------------+ clean writeback dirty ^ ^ | | | Where flusher will next work from | Where kswapd is working from | IO submitted by flusher, waiting on completion If memory reclaim is hitting dirty pages on the LRU, it means it has got ahead of writeback without being throttled - it's passed over all the pages currently under writeback and is trying to write back pages that are *newer* than what writeback is working on. IOWs, it starts trying to do the job of the flusher threads, and it does that very badly. The $100 question is a??why is it getting ahead of writeback*? >From a brief look at the vmscan code, it appears that scanning does not throttle/block until reclaim priority has got pretty high. That means at low priority reclaim, it *skips pages under writeback*. However, if it comes across a dirty page, it will trigger writeback of the page. Now call me crazy, but if we've already got a large number of pages under writeback, why would we want to *start more IO* when clearly the system is taking care of cleaning pages already and all we have to do is wait for a short while to get clean pages ready for reclaim? Indeed, I added this quick hack to prevent the VM from doing writeback via pageout until after it starts blocking on writeback pages: @@ -825,6 +825,8 @@ static unsigned long shrink_page_list(struct list_head *page_l if (PageDirty(page)) { nr_dirty++; + if (!(sc->reclaim_mode & RECLAIM_MODE_SYNC)) + goto keep_locked; if (references == PAGEREF_RECLAIM_CLEAN) goto keep_locked; if (!may_enter_fs) IOWs, we don't write pages from kswapd unless there is no IO writeback going on at all (waited on all the writeback pages or none exist) and there are dirty pages on the LRU. This doesn't completely stop the IO collapse, (looks like foreground throttling is the other cause, which IO-less write throttling fixes) but the collapse was significantly reduced in duration and intensity by removing kswapd writeback. In fact, the IO rate only dropped to ~60MB/s instead of 30MB/s, and the improvement is easily measured by the runtime of the test: run 1 run 2 run 3 3.0-rc5-vanilla 135s 137s 138s 3.0-rc5-patched 117s 115s 115s That's a pretty massive improvement for a 2-line patch. ;) I expect the IO-less write throttling patchset will further improve this. FWIW, the nr_vmscan_write values changed like this: run 1 run 2 run 3 3.0-rc5-vanilla 6751 6893 6465 3.0-rc5-patched 0 0 0 These results support my argument that memory reclaim should not be doing dirty page writeback at all - defering writeback to the writeback infrastructure and just waiting for it to complete appropriately is the Right Thing To Do. i.e. IO-less memory reclaim works better than the current code for the same reason IO-less write throttling works better than the current code.... Cheers, Dave. -- Dave Chinner david@fromorbit.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 03/27] xfs: use write_cache_pages for writeback clustering 2011-07-04 3:25 ` Dave Chinner @ 2011-07-05 14:34 ` Mel Gorman 2011-07-06 1:23 ` Dave Chinner 2011-07-11 11:10 ` Christoph Hellwig 2011-07-06 4:53 ` Wu Fengguang 2011-07-06 15:12 ` Johannes Weiner 2 siblings, 2 replies; 20+ messages in thread From: Mel Gorman @ 2011-07-05 14:34 UTC (permalink / raw) To: Dave Chinner Cc: Wu Fengguang, Christoph Hellwig, Johannes Weiner, xfs@oss.sgi.com, linux-mm@kvack.org On Mon, Jul 04, 2011 at 01:25:34PM +1000, Dave Chinner wrote: > On Fri, Jul 01, 2011 at 11:41:36PM +0800, Wu Fengguang wrote: > > Christoph, > > > > On Fri, Jul 01, 2011 at 05:33:05PM +0800, Christoph Hellwig wrote: > > > Johannes, Mel, Wu, > > > > > > Dave has been stressing some XFS patches of mine that remove the XFS > > > internal writeback clustering in favour of using write_cache_pages. > > > > > > As part of investigating the behaviour he found out that we're still > > > doing lots of I/O from the end of the LRU in kswapd. Not only is that > > > pretty bad behaviour in general, but it also means we really can't > > > just remove the writeback clustering in writepage given how much > > > I/O is still done through that. > > > > > > Any chance we could the writeback vs kswap behaviour sorted out a bit > > > better finally? > > > > I once tried this approach: > > > > http://www.spinics.net/lists/linux-mm/msg09202.html > > > > It used a list structure that is not linearly scalable, however that > > part should be independently improvable when necessary. > > I don't think that handing random writeback to the flusher thread is > much better than doing random writeback directly. Yes, you added > some clustering, but I'm still don't think writing specific pages is > the best solution. > > > The real problem was, it seem to not very effective in my test runs. > > I found many ->nr_pages works queued before the ->inode works, which > > effectively makes the flusher working on more dispersed pages rather > > than focusing on the dirty pages encountered in LRU reclaim. > > But that's really just an implementation issue related to how you > tried to solve the problem. That could be addressed. > > However, what I'm questioning is whether we should even care what > page memory reclaim wants to write - it seems to make fundamentally > bad decisions from an IO persepctive. > It sucks from an IO perspective but from the perspective of the VM that needs memory to be free in a particular zone or node, it's a reasonable request. > We have to remember that memory reclaim is doing LRU reclaim and the > flusher threads are doing "oldest first" writeback. IOWs, both are trying > to operate in the same direction (oldest to youngest) for the same > purpose. The fundamental problem that occurs when memory reclaim > starts writing pages back from the LRU is this: > > - memory reclaim has run ahead of IO writeback - > This reasoning was the basis for this patch http://www.gossamer-threads.com/lists/linux/kernel/1251235?do=post_view_threaded#1251235 i.e. if old pages are dirty then the flusher threads are either not awake or not doing enough work so wake them. It was flawed in a number of respects and never finished though. > The LRU usually looks like this: > > oldest youngest > +---------------+---------------+--------------+ > clean writeback dirty > ^ ^ > | | > | Where flusher will next work from > | Where kswapd is working from > | > IO submitted by flusher, waiting on completion > > > If memory reclaim is hitting dirty pages on the LRU, it means it has > got ahead of writeback without being throttled - it's passed over > all the pages currently under writeback and is trying to write back > pages that are *newer* than what writeback is working on. IOWs, it > starts trying to do the job of the flusher threads, and it does that > very badly. > > The $100 question is ???why is it getting ahead of writeback*? > Allocating and dirtying memory faster than writeback. Large dd to USB stick would also trigger it. > From a brief look at the vmscan code, it appears that scanning does > not throttle/block until reclaim priority has got pretty high. That > means at low priority reclaim, it *skips pages under writeback*. > However, if it comes across a dirty page, it will trigger writeback > of the page. > > Now call me crazy, but if we've already got a large number of pages > under writeback, why would we want to *start more IO* when clearly > the system is taking care of cleaning pages already and all we have > to do is wait for a short while to get clean pages ready for > reclaim? > It doesnt' check how many pages are under writeback. Direct reclaim will check if the block device is congested but that is about it. Otherwise the expectation was the elevator would handle the merging of requests into a sensible patter. Also, while filesystem pages are getting cleaned by flushs, that does not cover anonymous pages being written to swap. > Indeed, I added this quick hack to prevent the VM from doing > writeback via pageout until after it starts blocking on writeback > pages: > > @@ -825,6 +825,8 @@ static unsigned long shrink_page_list(struct list_head *page_l > if (PageDirty(page)) { > nr_dirty++; > > + if (!(sc->reclaim_mode & RECLAIM_MODE_SYNC)) > + goto keep_locked; > if (references == PAGEREF_RECLAIM_CLEAN) > goto keep_locked; > if (!may_enter_fs) > > IOWs, we don't write pages from kswapd unless there is no IO > writeback going on at all (waited on all the writeback pages or none > exist) and there are dirty pages on the LRU. > A side effect of this patch is that kswapd is no longer writing anonymous pages to swap and possibly never will. RECLAIM_MODE_SYNC is only set for lumpy reclaim which if you have CONFIG_COMPACTION set, will never happen. I see your figures and know why you want this but it never was that straight-forward :/ -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 03/27] xfs: use write_cache_pages for writeback clustering 2011-07-05 14:34 ` Mel Gorman @ 2011-07-06 1:23 ` Dave Chinner 2011-07-11 11:10 ` Christoph Hellwig 1 sibling, 0 replies; 20+ messages in thread From: Dave Chinner @ 2011-07-06 1:23 UTC (permalink / raw) To: Mel Gorman Cc: Wu Fengguang, Christoph Hellwig, Johannes Weiner, xfs@oss.sgi.com, linux-mm@kvack.org On Tue, Jul 05, 2011 at 03:34:10PM +0100, Mel Gorman wrote: > On Mon, Jul 04, 2011 at 01:25:34PM +1000, Dave Chinner wrote: > > On Fri, Jul 01, 2011 at 11:41:36PM +0800, Wu Fengguang wrote: > > > Christoph, > > > > > > On Fri, Jul 01, 2011 at 05:33:05PM +0800, Christoph Hellwig wrote: > > > > Johannes, Mel, Wu, > > > > > > > > Dave has been stressing some XFS patches of mine that remove the XFS > > > > internal writeback clustering in favour of using write_cache_pages. > > > > > > > > As part of investigating the behaviour he found out that we're still > > > > doing lots of I/O from the end of the LRU in kswapd. Not only is that > > > > pretty bad behaviour in general, but it also means we really can't > > > > just remove the writeback clustering in writepage given how much > > > > I/O is still done through that. > > > > > > > > Any chance we could the writeback vs kswap behaviour sorted out a bit > > > > better finally? > > > > > > I once tried this approach: > > > > > > http://www.spinics.net/lists/linux-mm/msg09202.html > > > > > > It used a list structure that is not linearly scalable, however that > > > part should be independently improvable when necessary. > > > > I don't think that handing random writeback to the flusher thread is > > much better than doing random writeback directly. Yes, you added > > some clustering, but I'm still don't think writing specific pages is > > the best solution. > > > > > The real problem was, it seem to not very effective in my test runs. > > > I found many ->nr_pages works queued before the ->inode works, which > > > effectively makes the flusher working on more dispersed pages rather > > > than focusing on the dirty pages encountered in LRU reclaim. > > > > But that's really just an implementation issue related to how you > > tried to solve the problem. That could be addressed. > > > > However, what I'm questioning is whether we should even care what > > page memory reclaim wants to write - it seems to make fundamentally > > bad decisions from an IO persepctive. > > > > It sucks from an IO perspective but from the perspective of the VM that > needs memory to be free in a particular zone or node, it's a reasonable > request. Sure, I'm not suggesting there is anything wrong the requirement of being able to clean pages in a particular zone. My comments are aimed at the fact the implementation of this feature is about as friendly to the IO subsystem as a game of Roshambeau.... If someone comes to us complaining about an application that causes this sort of IO behaviour, our answer is always "fix the application" because it is not something we can fix in the filesystem. Same here - we need to have the "application" fixed to play well with others. > > We have to remember that memory reclaim is doing LRU reclaim and the > > flusher threads are doing "oldest first" writeback. IOWs, both are trying > > to operate in the same direction (oldest to youngest) for the same > > purpose. The fundamental problem that occurs when memory reclaim > > starts writing pages back from the LRU is this: > > > > - memory reclaim has run ahead of IO writeback - > > > > This reasoning was the basis for this patch > http://www.gossamer-threads.com/lists/linux/kernel/1251235?do=post_view_threaded#1251235 > > i.e. if old pages are dirty then the flusher threads are either not > awake or not doing enough work so wake them. It was flawed in a number > of respects and never finished though. But that's dealing with a different situation - you're assuming that the writeback threads are not running or are running inefficiently. What I'm seeing is bad behaviour when the IO subsystem is already running flat out with perfectly formed IO. No additional IO submission is going to make it clean pages faster than it already is. It is in this situation that memory reclaim should never, ever be trying to write dirty pages. IIRC, the situation was that there were about 15,000 dirty pages and ~20,000 pages under writeback when memory reclaim started pushing pages from the LRU. This is on a single node machine, with all IO being single threaded (so a single source of memory pressure) and writeback doing it's job. Memory reclaim should *never* get ahead of writeback under such a simple workload on such a simple configuration.... > > The LRU usually looks like this: > > > > oldest youngest > > +---------------+---------------+--------------+ > > clean writeback dirty > > ^ ^ > > | | > > | Where flusher will next work from > > | Where kswapd is working from > > | > > IO submitted by flusher, waiting on completion > > > > > > If memory reclaim is hitting dirty pages on the LRU, it means it has > > got ahead of writeback without being throttled - it's passed over > > all the pages currently under writeback and is trying to write back > > pages that are *newer* than what writeback is working on. IOWs, it > > starts trying to do the job of the flusher threads, and it does that > > very badly. > > > > The $100 question is ???why is it getting ahead of writeback*? > > > > Allocating and dirtying memory faster than writeback. Large dd to USB > stick would also trigger it. Write throttling is supposed to prevent that situation from being problematic. It's entire purpose is to throttle the dirtying rate to match the writeback rate. If that's a problem, the memory reclaim subsystem is the wrong place to be trying to fix it. And as such, that is not the case here; foreground throttling is definitely occurring and works fine for 70-80s, then memory reclaim gets ahead of writeback and it all goes to shit. > > From a brief look at the vmscan code, it appears that scanning does > > not throttle/block until reclaim priority has got pretty high. That > > means at low priority reclaim, it *skips pages under writeback*. > > However, if it comes across a dirty page, it will trigger writeback > > of the page. > > > > Now call me crazy, but if we've already got a large number of pages > > under writeback, why would we want to *start more IO* when clearly > > the system is taking care of cleaning pages already and all we have > > to do is wait for a short while to get clean pages ready for > > reclaim? > > > > It doesnt' check how many pages are under writeback. Isn't that an indication of a design flaw? You want to clean pages, but you don't even bother to check on how many pages are currently being cleaned and will soon be reclaimable? > Direct reclaim > will check if the block device is congested but that is about > it. FWIW, we've removed all the congestion logic from the writeback subsystem because IO throttling never really worked well that way. Writeback IO throttling now works by foreground blocking during IO submission on request queue slots in the elevator. That's why we have flusher threads per-bdi - so writeback can block on a congested bdi and not block writeback to other bdis. It's simpler, more extensible and far more scalable than the old method. Anyway, it's a moot point because direct reclaim can't issue IO through xfs, ext4 or btrfs and as such I have doubts that the throttling logic in vmscan is completely robust. > Otherwise the expectation was the elevator would handle the > merging of requests into a sensible patter. Also, while filesystem > pages are getting cleaned by flushs, that does not cover anonymous > pages being written to swap. Anonymous pages written to swap are not the issue here - I couldn't care less what you do with them. It's writeback of dirty file pages that I care about... > > > Indeed, I added this quick hack to prevent the VM from doing > > writeback via pageout until after it starts blocking on writeback > > pages: > > > > @@ -825,6 +825,8 @@ static unsigned long shrink_page_list(struct list_head *page_l > > if (PageDirty(page)) { > > nr_dirty++; > > > > + if (!(sc->reclaim_mode & RECLAIM_MODE_SYNC)) > > + goto keep_locked; > > if (references == PAGEREF_RECLAIM_CLEAN) > > goto keep_locked; > > if (!may_enter_fs) > > > > IOWs, we don't write pages from kswapd unless there is no IO > > writeback going on at all (waited on all the writeback pages or none > > exist) and there are dirty pages on the LRU. > > > > A side effect of this patch is that kswapd is no longer writing > anonymous pages to swap and possibly never will. For dirty anon pages to still get written, all that needs to be done is pass the file parameter to shrink_page_list() and change the test to: + if (file && (sc->reclaim_mode & RECLAIM_MODE_SYNC)) + goto keep_locked; As it is, I haven't had any of my test systems (which run tests that deliberately cause OOM conditions) fail with this patch. While I agree it is just a hack, it's naivety has also demonstrated that a working system does not need to write back dirty file pages from memory reclaim -at all-. i.e. it makes my argument stronger, not weaker.... > RECLAIM_MODE_SYNC is > only set for lumpy reclaim which if you have CONFIG_COMPACTION set, will > never happen. Which means that memory reclaim does not throttle reliably on writeback in progress. Even when the priority has ratcheted right up and it is obvious that the zone in question has pages being cleaned and will soon be available for reclaim, memory reclaim won't wait for them directly. Once again this points to the throttling mechanism being sub-optimal - it relies on second order effects (congestion_wait) to try to block long enough for pages to be cleaned in the zone being reclaimed from before doing another scan to find those pages. It's a "wait and hope" approach to throttling, and that's one of the reasons it never worked well in the writeback subsystem. Instead, if memory reclaim waits directly on a page on the given LRU under writeback it guarantees that when you are woken that there was at least some progress made by the IO subsystem that would allow the memory reclaim subsystem to move forward. What it comes down to is the fact that you can scan tens of thousands of pages in the time it takes for IO on a single page to complete. If there are pages already under IO, then why start more IO when what ends up getting reclaimed is one of the pages that is already under IO when the new IO was issued? BTW: # CONFIG_COMPACTION is not set > I see your figures and know why you want this but it never was that > straight-forward :/ If the code is complex enough that implementing a basic policy such as "don't writeback pages if there are already pages under writeback" is difficult, then maybe the code needs to be simplified.... Cheers, Dave. -- Dave Chinner david@fromorbit.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 03/27] xfs: use write_cache_pages for writeback clustering 2011-07-05 14:34 ` Mel Gorman 2011-07-06 1:23 ` Dave Chinner @ 2011-07-11 11:10 ` Christoph Hellwig 1 sibling, 0 replies; 20+ messages in thread From: Christoph Hellwig @ 2011-07-11 11:10 UTC (permalink / raw) To: Mel Gorman Cc: Dave Chinner, Wu Fengguang, Christoph Hellwig, Johannes Weiner, xfs@oss.sgi.com, linux-mm@kvack.org On Tue, Jul 05, 2011 at 03:34:10PM +0100, Mel Gorman wrote: > > However, what I'm questioning is whether we should even care what > > page memory reclaim wants to write - it seems to make fundamentally > > bad decisions from an IO persepctive. > > > > It sucks from an IO perspective but from the perspective of the VM that > needs memory to be free in a particular zone or node, it's a reasonable > request. It might appear reasonable, but it's not. What the VM wants underneath is generally (1): - free N pages in zone Z and it then goes own to free the pages one one by one though kswapd, which leads to freeing those N pages, but unless they already were clean it will take very long to get there and bog down the whole system. So we need a better way to actually perform that underlying request. Dave's suggestion of keeping different lists for clean vs dirty pages in the VM and preferably reclaiming for the clean ones when having zone pressure is one first step. The second one will be to tell the writeback threads to preferably reclaim from a zone. I'm actually not sure how do that yet, as we could have memory from different zones on a single inode. Taking an inode that has memory from the right zone and the writing that out will probably work fine for different zones in a 64-bit NUMA systems where zones more or less equal nodes. It probably won't work very well if we need to free up memory in the various low memory zones, as those will be spread over random inodes. > It doesnt' check how many pages are under writeback. Direct reclaim > will check if the block device is congested but that is about > it. Otherwise the expectation was the elevator would handle the > merging of requests into a sensible patter. It can't. The elevator has a relatively small window it can operate on, and can never fix up a bad large scale writeback pattern. > Also, while filesystem > pages are getting cleaned by flushs, that does not cover anonymous > pages being written to swap. At least for now we will have to keep kswapd writeback for swap. It is just as inefficient a on a filesystem, but given that people don't rely on swap performance we can probably live with it. Note that we can't simply use background flushing for swap, as that would mean we'd need backing space allocated for all main memory, which isn't very practical with todays memory sized. The whole concept of demand paging anonymous memory leads to pretty bad I/O patterns. If you're actually making heavy use of it the old-school unix full process paging would be a lot faster. (1) moulo things like compaction -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 03/27] xfs: use write_cache_pages for writeback clustering 2011-07-04 3:25 ` Dave Chinner 2011-07-05 14:34 ` Mel Gorman @ 2011-07-06 4:53 ` Wu Fengguang 2011-07-06 6:47 ` Minchan Kim 2011-07-06 7:17 ` Dave Chinner 2011-07-06 15:12 ` Johannes Weiner 2 siblings, 2 replies; 20+ messages in thread From: Wu Fengguang @ 2011-07-06 4:53 UTC (permalink / raw) To: Dave Chinner Cc: Christoph Hellwig, Mel Gorman, Johannes Weiner, xfs@oss.sgi.com, linux-mm@kvack.org On Mon, Jul 04, 2011 at 11:25:34AM +0800, Dave Chinner wrote: > On Fri, Jul 01, 2011 at 11:41:36PM +0800, Wu Fengguang wrote: > > Christoph, > > > > On Fri, Jul 01, 2011 at 05:33:05PM +0800, Christoph Hellwig wrote: > > > Johannes, Mel, Wu, > > > > > > Dave has been stressing some XFS patches of mine that remove the XFS > > > internal writeback clustering in favour of using write_cache_pages. > > > > > > As part of investigating the behaviour he found out that we're still > > > doing lots of I/O from the end of the LRU in kswapd. Not only is that > > > pretty bad behaviour in general, but it also means we really can't > > > just remove the writeback clustering in writepage given how much > > > I/O is still done through that. > > > > > > Any chance we could the writeback vs kswap behaviour sorted out a bit > > > better finally? > > > > I once tried this approach: > > > > http://www.spinics.net/lists/linux-mm/msg09202.html > > > > It used a list structure that is not linearly scalable, however that > > part should be independently improvable when necessary. > > I don't think that handing random writeback to the flusher thread is > much better than doing random writeback directly. Yes, you added > some clustering, but I'm still don't think writing specific pages is > the best solution. I agree that the VM should avoid writing specific pages as much as possible. Mostly often, it's indeed OK to just skip sporadically encountered dirty page and reclaim the clean pages presumably not far away in the LRU list. So your 2-liner patch is all good if constraining it to low scan pressure, which will look like if (priority == DEF_PRIORITY) tag PG_reclaim on encountered dirty pages and skip writing it However the VM in general does need the ability to write specific pages, such as when reclaiming from specific zone/memcg. So I'll still propose to do bdi_start_inode_writeback(). Below is the patch rebased to linux-next. It's good enough for testing purpose, and I guess even with the ->nr_pages work issue, it's complete enough to get roughly the same performance as your 2-liner patch. > > The real problem was, it seem to not very effective in my test runs. > > I found many ->nr_pages works queued before the ->inode works, which > > effectively makes the flusher working on more dispersed pages rather > > than focusing on the dirty pages encountered in LRU reclaim. > > But that's really just an implementation issue related to how you > tried to solve the problem. That could be addressed. > > However, what I'm questioning is whether we should even care what > page memory reclaim wants to write - it seems to make fundamentally > bad decisions from an IO persepctive. > > We have to remember that memory reclaim is doing LRU reclaim and the > flusher threads are doing "oldest first" writeback. IOWs, both are trying > to operate in the same direction (oldest to youngest) for the same > purpose. The fundamental problem that occurs when memory reclaim > starts writing pages back from the LRU is this: > > - memory reclaim has run ahead of IO writeback - > > The LRU usually looks like this: > > oldest youngest > +---------------+---------------+--------------+ > clean writeback dirty > ^ ^ > | | > | Where flusher will next work from > | Where kswapd is working from > | > IO submitted by flusher, waiting on completion > > > If memory reclaim is hitting dirty pages on the LRU, it means it has > got ahead of writeback without being throttled - it's passed over > all the pages currently under writeback and is trying to write back > pages that are *newer* than what writeback is working on. IOWs, it > starts trying to do the job of the flusher threads, and it does that > very badly. > > The $100 question is a??why is it getting ahead of writeback*? The most important case is: faster reader + relatively slow writer. Assume for every 10 pages read, 1 page is dirtied, and the dirty speed is fast enough to trigger the 20% dirty ratio and hence dirty balancing. That pattern is able to evenly distribute dirty pages all over the LRU list and hence trigger lots of pageout()s. The "skip reclaim writes on low pressure" approach can fix this case. Thanks, Fengguang --- Subject: writeback: introduce bdi_start_inode_writeback() Date: Thu Jul 29 14:41:19 CST 2010 This relays ASYNC file writeback IOs to the flusher threads. pageout() will continue to serve the SYNC file page writes for necessary throttling for preventing OOM, which may happen if the LRU list is small and/or the storage is slow, so that the flusher cannot clean enough pages before the LRU is full scanned. Only ASYNC pageout() is relayed to the flusher threads, the less frequent SYNC pageout()s will work as before as a last resort. This helps to avoid OOM when the LRU list is small and/or the storage is slow, and the flusher cannot clean enough pages before the LRU is full scanned. The flusher will piggy back more dirty pages for IO - it's more IO efficient - it helps clean more pages, a good number of them may sit in the same LRU list that is being scanned. To avoid memory allocations at page reclaim, a mempool is created. Background/periodic works will quit automatically (as done in another patch), so as to clean the pages under reclaim ASAP. However for now the sync work can still block us for long time. Jan Kara: limit the search scope. CC: Jan Kara <jack@suse.cz> CC: Rik van Riel <riel@redhat.com> CC: Mel Gorman <mel@linux.vnet.ibm.com> CC: Minchan Kim <minchan.kim@gmail.com> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> --- fs/fs-writeback.c | 156 ++++++++++++++++++++++++++++- include/linux/backing-dev.h | 1 include/trace/events/writeback.h | 15 ++ mm/vmscan.c | 8 + 4 files changed, 174 insertions(+), 6 deletions(-) --- linux-next.orig/mm/vmscan.c 2011-06-29 20:43:10.000000000 -0700 +++ linux-next/mm/vmscan.c 2011-07-05 18:30:19.000000000 -0700 @@ -825,6 +825,14 @@ static unsigned long shrink_page_list(st if (PageDirty(page)) { nr_dirty++; + if (page_is_file_cache(page) && mapping && + sc->reclaim_mode != RECLAIM_MODE_SYNC) { + if (flush_inode_page(page, mapping) >= 0) { + SetPageReclaim(page); + goto keep_locked; + } + } + if (references == PAGEREF_RECLAIM_CLEAN) goto keep_locked; if (!may_enter_fs) --- linux-next.orig/fs/fs-writeback.c 2011-07-05 18:30:16.000000000 -0700 +++ linux-next/fs/fs-writeback.c 2011-07-05 18:30:52.000000000 -0700 @@ -30,12 +30,21 @@ #include "internal.h" /* + * When flushing an inode page (for page reclaim), try to piggy back up to + * 4MB nearby pages for IO efficiency. These pages will have good opportunity + * to be in the same LRU list. + */ +#define WRITE_AROUND_PAGES MIN_WRITEBACK_PAGES + +/* * Passed into wb_writeback(), essentially a subset of writeback_control */ struct wb_writeback_work { long nr_pages; struct super_block *sb; unsigned long *older_than_this; + struct inode *inode; + pgoff_t offset; enum writeback_sync_modes sync_mode; unsigned int tagged_writepages:1; unsigned int for_kupdate:1; @@ -59,6 +68,27 @@ struct wb_writeback_work { */ int nr_pdflush_threads; +static mempool_t *wb_work_mempool; + +static void *wb_work_alloc(gfp_t gfp_mask, void *pool_data) +{ + /* + * bdi_start_inode_writeback() may be called on page reclaim + */ + if (current->flags & PF_MEMALLOC) + return NULL; + + return kmalloc(sizeof(struct wb_writeback_work), gfp_mask); +} + +static __init int wb_work_init(void) +{ + wb_work_mempool = mempool_create(1024, + wb_work_alloc, mempool_kfree, NULL); + return wb_work_mempool ? 0 : -ENOMEM; +} +fs_initcall(wb_work_init); + /** * writeback_in_progress - determine whether there is writeback in progress * @bdi: the device's backing_dev_info structure. @@ -123,7 +153,7 @@ __bdi_start_writeback(struct backing_dev * This is WB_SYNC_NONE writeback, so if allocation fails just * wakeup the thread for old dirty data writeback */ - work = kzalloc(sizeof(*work), GFP_ATOMIC); + work = mempool_alloc(wb_work_mempool, GFP_NOWAIT); if (!work) { if (bdi->wb.task) { trace_writeback_nowork(bdi); @@ -132,6 +162,7 @@ __bdi_start_writeback(struct backing_dev return; } + memset(work, 0, sizeof(*work)); work->sync_mode = WB_SYNC_NONE; work->nr_pages = nr_pages; work->range_cyclic = range_cyclic; @@ -177,6 +208,107 @@ void bdi_start_background_writeback(stru spin_unlock_bh(&bdi->wb_lock); } +static bool extend_writeback_range(struct wb_writeback_work *work, + pgoff_t offset) +{ + pgoff_t end = work->offset + work->nr_pages; + + if (offset >= work->offset && offset < end) + return true; + + /* the unsigned comparison helps eliminate one compare */ + if (work->offset - offset < WRITE_AROUND_PAGES) { + work->nr_pages += WRITE_AROUND_PAGES; + work->offset -= WRITE_AROUND_PAGES; + return true; + } + + if (offset - end < WRITE_AROUND_PAGES) { + work->nr_pages += WRITE_AROUND_PAGES; + return true; + } + + return false; +} + +/* + * schedule writeback on a range of inode pages. + */ +static struct wb_writeback_work * +bdi_flush_inode_range(struct backing_dev_info *bdi, + struct inode *inode, + pgoff_t offset, + pgoff_t len) +{ + struct wb_writeback_work *work; + + if (!igrab(inode)) + return ERR_PTR(-ENOENT); + + work = mempool_alloc(wb_work_mempool, GFP_NOWAIT); + if (!work) + return ERR_PTR(-ENOMEM); + + memset(work, 0, sizeof(*work)); + work->sync_mode = WB_SYNC_NONE; + work->inode = inode; + work->offset = offset; + work->nr_pages = len; + + bdi_queue_work(bdi, work); + + return work; +} + +/* + * Called by page reclaim code to flush the dirty page ASAP. Do write-around to + * improve IO throughput. The nearby pages will have good chance to reside in + * the same LRU list that vmscan is working on, and even close to each other + * inside the LRU list in the common case of sequential read/write. + * + * ret > 0: success, found/reused a previous writeback work + * ret = 0: success, allocated/queued a new writeback work + * ret < 0: failed + */ +long flush_inode_page(struct page *page, struct address_space *mapping) +{ + struct backing_dev_info *bdi = mapping->backing_dev_info; + struct inode *inode = mapping->host; + pgoff_t offset = page->index; + pgoff_t len = 0; + struct wb_writeback_work *work; + long ret = -ENOENT; + + if (unlikely(!inode)) + goto out; + + len = 1; + spin_lock_bh(&bdi->wb_lock); + list_for_each_entry_reverse(work, &bdi->work_list, list) { + if (work->inode != inode) + continue; + if (extend_writeback_range(work, offset)) { + ret = len; + offset = work->offset; + len = work->nr_pages; + break; + } + if (len++ > 30) /* do limited search */ + break; + } + spin_unlock_bh(&bdi->wb_lock); + + if (ret > 0) + goto out; + + offset = round_down(offset, WRITE_AROUND_PAGES); + len = WRITE_AROUND_PAGES; + work = bdi_flush_inode_range(bdi, inode, offset, len); + ret = IS_ERR(work) ? PTR_ERR(work) : 0; +out: + return ret; +} + /* * Remove the inode from the writeback list it is on. */ @@ -830,6 +962,21 @@ static unsigned long get_nr_dirty_pages( get_nr_dirty_inodes(); } +static long wb_flush_inode(struct bdi_writeback *wb, + struct wb_writeback_work *work) +{ + loff_t start = work->offset; + loff_t end = work->offset + work->nr_pages - 1; + int wrote; + + wrote = __filemap_fdatawrite_range(work->inode->i_mapping, + start << PAGE_CACHE_SHIFT, + end << PAGE_CACHE_SHIFT, + WB_SYNC_NONE); + iput(work->inode); + return wrote; +} + static long wb_check_background_flush(struct bdi_writeback *wb) { if (over_bground_thresh()) { @@ -900,7 +1047,10 @@ long wb_do_writeback(struct bdi_writebac trace_writeback_exec(bdi, work); - wrote += wb_writeback(wb, work); + if (work->inode) + wrote += wb_flush_inode(wb, work); + else + wrote += wb_writeback(wb, work); /* * Notify the caller of completion if this is a synchronous @@ -909,7 +1059,7 @@ long wb_do_writeback(struct bdi_writebac if (work->done) complete(work->done); else - kfree(work); + mempool_free(work, wb_work_mempool); } /* --- linux-next.orig/include/linux/backing-dev.h 2011-07-03 20:03:37.000000000 -0700 +++ linux-next/include/linux/backing-dev.h 2011-07-05 18:30:19.000000000 -0700 @@ -109,6 +109,7 @@ void bdi_unregister(struct backing_dev_i int bdi_setup_and_register(struct backing_dev_info *, char *, unsigned int); void bdi_start_writeback(struct backing_dev_info *bdi, long nr_pages); void bdi_start_background_writeback(struct backing_dev_info *bdi); +long flush_inode_page(struct page *page, struct address_space *mapping); int bdi_writeback_thread(void *data); int bdi_has_dirty_io(struct backing_dev_info *bdi); void bdi_arm_supers_timer(void); --- linux-next.orig/include/trace/events/writeback.h 2011-07-05 18:30:16.000000000 -0700 +++ linux-next/include/trace/events/writeback.h 2011-07-05 18:30:19.000000000 -0700 @@ -28,31 +28,40 @@ DECLARE_EVENT_CLASS(writeback_work_class TP_ARGS(bdi, work), TP_STRUCT__entry( __array(char, name, 32) + __field(struct wb_writeback_work*, work) __field(long, nr_pages) __field(dev_t, sb_dev) __field(int, sync_mode) __field(int, for_kupdate) __field(int, range_cyclic) __field(int, for_background) + __field(unsigned long, ino) + __field(unsigned long, offset) ), TP_fast_assign( strncpy(__entry->name, dev_name(bdi->dev), 32); + __entry->work = work; __entry->nr_pages = work->nr_pages; __entry->sb_dev = work->sb ? work->sb->s_dev : 0; __entry->sync_mode = work->sync_mode; __entry->for_kupdate = work->for_kupdate; __entry->range_cyclic = work->range_cyclic; __entry->for_background = work->for_background; + __entry->ino = work->inode ? work->inode->i_ino : 0; + __entry->offset = work->offset; ), - TP_printk("bdi %s: sb_dev %d:%d nr_pages=%ld sync_mode=%d " - "kupdate=%d range_cyclic=%d background=%d", + TP_printk("bdi %s: sb_dev %d:%d %p nr_pages=%ld sync_mode=%d " + "kupdate=%d range_cyclic=%d background=%d ino=%lu offset=%lu", __entry->name, MAJOR(__entry->sb_dev), MINOR(__entry->sb_dev), + __entry->work, __entry->nr_pages, __entry->sync_mode, __entry->for_kupdate, __entry->range_cyclic, - __entry->for_background + __entry->for_background, + __entry->ino, + __entry->offset ) ); #define DEFINE_WRITEBACK_WORK_EVENT(name) \ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 03/27] xfs: use write_cache_pages for writeback clustering 2011-07-06 4:53 ` Wu Fengguang @ 2011-07-06 6:47 ` Minchan Kim 2011-07-06 7:17 ` Dave Chinner 1 sibling, 0 replies; 20+ messages in thread From: Minchan Kim @ 2011-07-06 6:47 UTC (permalink / raw) To: Wu Fengguang Cc: Dave Chinner, Christoph Hellwig, Mel Gorman, Johannes Weiner, xfs@oss.sgi.com, linux-mm@kvack.org On Wed, Jul 6, 2011 at 1:53 PM, Wu Fengguang <fengguang.wu@intel.com> wrote: > On Mon, Jul 04, 2011 at 11:25:34AM +0800, Dave Chinner wrote: >> On Fri, Jul 01, 2011 at 11:41:36PM +0800, Wu Fengguang wrote: >> > Christoph, >> > >> > On Fri, Jul 01, 2011 at 05:33:05PM +0800, Christoph Hellwig wrote: >> > > Johannes, Mel, Wu, >> > > >> > > Dave has been stressing some XFS patches of mine that remove the XFS >> > > internal writeback clustering in favour of using write_cache_pages. >> > > >> > > As part of investigating the behaviour he found out that we're still >> > > doing lots of I/O from the end of the LRU in kswapd. Not only is that >> > > pretty bad behaviour in general, but it also means we really can't >> > > just remove the writeback clustering in writepage given how much >> > > I/O is still done through that. >> > > >> > > Any chance we could the writeback vs kswap behaviour sorted out a bit >> > > better finally? >> > >> > I once tried this approach: >> > >> > http://www.spinics.net/lists/linux-mm/msg09202.html >> > >> > It used a list structure that is not linearly scalable, however that >> > part should be independently improvable when necessary. >> >> I don't think that handing random writeback to the flusher thread is >> much better than doing random writeback directly. Yes, you added >> some clustering, but I'm still don't think writing specific pages is >> the best solution. > > I agree that the VM should avoid writing specific pages as much as > possible. Mostly often, it's indeed OK to just skip sporadically > encountered dirty page and reclaim the clean pages presumably not > far away in the LRU list. So your 2-liner patch is all good if > constraining it to low scan pressure, which will look like > > if (priority == DEF_PRIORITY) > tag PG_reclaim on encountered dirty pages and > skip writing it > > However the VM in general does need the ability to write specific > pages, such as when reclaiming from specific zone/memcg. So I'll still > propose to do bdi_start_inode_writeback(). > > Below is the patch rebased to linux-next. It's good enough for testing > purpose, and I guess even with the ->nr_pages work issue, it's > complete enough to get roughly the same performance as your 2-liner > patch. > >> > The real problem was, it seem to not very effective in my test runs. >> > I found many ->nr_pages works queued before the ->inode works, which >> > effectively makes the flusher working on more dispersed pages rather >> > than focusing on the dirty pages encountered in LRU reclaim. >> >> But that's really just an implementation issue related to how you >> tried to solve the problem. That could be addressed. >> >> However, what I'm questioning is whether we should even care what >> page memory reclaim wants to write - it seems to make fundamentally >> bad decisions from an IO persepctive. >> >> We have to remember that memory reclaim is doing LRU reclaim and the >> flusher threads are doing "oldest first" writeback. IOWs, both are trying >> to operate in the same direction (oldest to youngest) for the same >> purpose. The fundamental problem that occurs when memory reclaim >> starts writing pages back from the LRU is this: >> >> - memory reclaim has run ahead of IO writeback - >> >> The LRU usually looks like this: >> >> oldest youngest >> +---------------+---------------+--------------+ >> clean writeback dirty >> ^ ^ >> | | >> | Where flusher will next work from >> | Where kswapd is working from >> | >> IO submitted by flusher, waiting on completion >> >> >> If memory reclaim is hitting dirty pages on the LRU, it means it has >> got ahead of writeback without being throttled - it's passed over >> all the pages currently under writeback and is trying to write back >> pages that are *newer* than what writeback is working on. IOWs, it >> starts trying to do the job of the flusher threads, and it does that >> very badly. >> >> The $100 question is ∗why is it getting ahead of writeback*? > > The most important case is: faster reader + relatively slow writer. > > Assume for every 10 pages read, 1 page is dirtied, and the dirty speed > is fast enough to trigger the 20% dirty ratio and hence dirty balancing. > > That pattern is able to evenly distribute dirty pages all over the LRU > list and hence trigger lots of pageout()s. The "skip reclaim writes on > low pressure" approach can fix this case. > > Thanks, > Fengguang > --- > Subject: writeback: introduce bdi_start_inode_writeback() > Date: Thu Jul 29 14:41:19 CST 2010 > > This relays ASYNC file writeback IOs to the flusher threads. > > pageout() will continue to serve the SYNC file page writes for necessary > throttling for preventing OOM, which may happen if the LRU list is small > and/or the storage is slow, so that the flusher cannot clean enough > pages before the LRU is full scanned. > > Only ASYNC pageout() is relayed to the flusher threads, the less > frequent SYNC pageout()s will work as before as a last resort. > This helps to avoid OOM when the LRU list is small and/or the storage is > slow, and the flusher cannot clean enough pages before the LRU is > full scanned. > > The flusher will piggy back more dirty pages for IO > - it's more IO efficient > - it helps clean more pages, a good number of them may sit in the same > LRU list that is being scanned. > > To avoid memory allocations at page reclaim, a mempool is created. > > Background/periodic works will quit automatically (as done in another > patch), so as to clean the pages under reclaim ASAP. However for now the > sync work can still block us for long time. > > Jan Kara: limit the search scope. > > CC: Jan Kara <jack@suse.cz> > CC: Rik van Riel <riel@redhat.com> > CC: Mel Gorman <mel@linux.vnet.ibm.com> > CC: Minchan Kim <minchan.kim@gmail.com> > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> It seems to be enhanced version of old Mel's done. I support this approach :) but I have some questions. > --- > fs/fs-writeback.c | 156 ++++++++++++++++++++++++++++- > include/linux/backing-dev.h | 1 > include/trace/events/writeback.h | 15 ++ > mm/vmscan.c | 8 + > 4 files changed, 174 insertions(+), 6 deletions(-) > > --- linux-next.orig/mm/vmscan.c 2011-06-29 20:43:10.000000000 -0700 > +++ linux-next/mm/vmscan.c 2011-07-05 18:30:19.000000000 -0700 > @@ -825,6 +825,14 @@ static unsigned long shrink_page_list(st > if (PageDirty(page)) { > nr_dirty++; > > + if (page_is_file_cache(page) && mapping && > + sc->reclaim_mode != RECLAIM_MODE_SYNC) { > + if (flush_inode_page(page, mapping) >= 0) { > + SetPageReclaim(page); > + goto keep_locked; keep_locked changes old behavior. Normally, in case of async mode, we does keep_lumpy(ie, we didn't reset reclaim_mode) but now you are always resetting reclaim_mode. so sync call of shrink_page_list never happen if flush_inode_page is successful. Is it your intention? > + } > + } > + If flush_inode_page fails(ie, the page isn't nearby of current work's writeback range), we still do pageout although it's async mode. Is it your intention? -- Kind regards, Minchan Kim -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 03/27] xfs: use write_cache_pages for writeback clustering 2011-07-06 4:53 ` Wu Fengguang 2011-07-06 6:47 ` Minchan Kim @ 2011-07-06 7:17 ` Dave Chinner 1 sibling, 0 replies; 20+ messages in thread From: Dave Chinner @ 2011-07-06 7:17 UTC (permalink / raw) To: Wu Fengguang Cc: Christoph Hellwig, Mel Gorman, Johannes Weiner, xfs@oss.sgi.com, linux-mm@kvack.org On Tue, Jul 05, 2011 at 09:53:01PM -0700, Wu Fengguang wrote: > On Mon, Jul 04, 2011 at 11:25:34AM +0800, Dave Chinner wrote: > > On Fri, Jul 01, 2011 at 11:41:36PM +0800, Wu Fengguang wrote: > > We have to remember that memory reclaim is doing LRU reclaim and the > > flusher threads are doing "oldest first" writeback. IOWs, both are trying > > to operate in the same direction (oldest to youngest) for the same > > purpose. The fundamental problem that occurs when memory reclaim > > starts writing pages back from the LRU is this: > > > > - memory reclaim has run ahead of IO writeback - > > > > The LRU usually looks like this: > > > > oldest youngest > > +---------------+---------------+--------------+ > > clean writeback dirty > > ^ ^ > > | | > > | Where flusher will next work from > > | Where kswapd is working from > > | > > IO submitted by flusher, waiting on completion > > > > > > If memory reclaim is hitting dirty pages on the LRU, it means it has > > got ahead of writeback without being throttled - it's passed over > > all the pages currently under writeback and is trying to write back > > pages that are *newer* than what writeback is working on. IOWs, it > > starts trying to do the job of the flusher threads, and it does that > > very badly. > > > > The $100 question is a??why is it getting ahead of writeback*? > > The most important case is: faster reader + relatively slow writer. Same thing I said to Mel: that is not the workload that is causing this problem I am seeing. > Assume for every 10 pages read, 1 page is dirtied, and the dirty speed > is fast enough to trigger the 20% dirty ratio and hence dirty balancing. > > That pattern is able to evenly distribute dirty pages all over the LRU > list and hence trigger lots of pageout()s. The "skip reclaim writes on > low pressure" approach can fix this case. Sure it can, but even better would be to simply skip the dirty pages and reclaim the interspersed clean pages which greatly outnumber the dirty pages. That then lets writeback deal with cleaning the dirty pages in the most optimal manner, and no writeback from memory reclaim is needed. IOWs, I don't think writeback from the LRU is the right solution to the problem you've described, either. > > Thanks, > Fengguang > --- > Subject: writeback: introduce bdi_start_inode_writeback() > Date: Thu Jul 29 14:41:19 CST 2010 > > This relays ASYNC file writeback IOs to the flusher threads. > > pageout() will continue to serve the SYNC file page writes for necessary > throttling for preventing OOM, which may happen if the LRU list is small > and/or the storage is slow, so that the flusher cannot clean enough > pages before the LRU is full scanned. > > Only ASYNC pageout() is relayed to the flusher threads, the less > frequent SYNC pageout()s will work as before as a last resort. > This helps to avoid OOM when the LRU list is small and/or the storage is > slow, and the flusher cannot clean enough pages before the LRU is > full scanned. Which ignores the fact that async pageout should not be happening in most cases. Let's try and fix the root cause of the problem, not paper over it again... > The flusher will piggy back more dirty pages for IO > - it's more IO efficient > - it helps clean more pages, a good number of them may sit in the same > LRU list that is being scanned. > > To avoid memory allocations at page reclaim, a mempool is created. > > Background/periodic works will quit automatically (as done in another > patch), so as to clean the pages under reclaim ASAP. However for now the > sync work can still block us for long time. > /* > + * When flushing an inode page (for page reclaim), try to piggy back up to > + * 4MB nearby pages for IO efficiency. These pages will have good opportunity > + * to be in the same LRU list. > + */ > +#define WRITE_AROUND_PAGES MIN_WRITEBACK_PAGES Regardless of the trigger, I think you're going too far in the other direction, here. If we have to do one IO to clean the page that the VM wants, then it has to be done with as little latency as possible but large enough to still maintain decent throughput. With the above patch, for every single dirty page the VM wants cleaned, we'll clean 4MB of pages around it. Ok, but once the VM has tripped over pages on 25 different inodes, we've now got 100MB of writeback work to chew through before we can get to the 26th page the VM wanted cleaned. At which point, we may as well just ignore what the VM wants and continue to clean pages via the existing mechanisms because the latency for cleaning a specific page will worse than if the VM just skipped it in the first place.... FWIW, XFS limited such clustering to 64 pages at a time to try to balance the bandwidth vs completion latency problem. > +/* > + * Called by page reclaim code to flush the dirty page ASAP. Do write-around to > + * improve IO throughput. The nearby pages will have good chance to reside in > + * the same LRU list that vmscan is working on, and even close to each other > + * inside the LRU list in the common case of sequential read/write. > + * > + * ret > 0: success, found/reused a previous writeback work > + * ret = 0: success, allocated/queued a new writeback work > + * ret < 0: failed > + */ > +long flush_inode_page(struct page *page, struct address_space *mapping) > +{ > + struct backing_dev_info *bdi = mapping->backing_dev_info; > + struct inode *inode = mapping->host; > + pgoff_t offset = page->index; > + pgoff_t len = 0; > + struct wb_writeback_work *work; > + long ret = -ENOENT; > + > + if (unlikely(!inode)) > + goto out; > + > + len = 1; > + spin_lock_bh(&bdi->wb_lock); > + list_for_each_entry_reverse(work, &bdi->work_list, list) { > + if (work->inode != inode) > + continue; > + if (extend_writeback_range(work, offset)) { > + ret = len; > + offset = work->offset; > + len = work->nr_pages; > + break; > + } > + if (len++ > 30) /* do limited search */ > + break; > + } > + spin_unlock_bh(&bdi->wb_lock); I dont think this is a necessary or scalable optimisation. It won't be useful when there are lots of dirty inodes and dirty pages are tripped over in their hundreds or thousands - it'll just burn CPU doing nothing, and serialise against other reclaim and writeback work. It looks like a case of premature optimisation to me.... Anyway, if there's a page flush near to an existing piece of work the IO elevator should merge them appropriately. > +static long wb_flush_inode(struct bdi_writeback *wb, > + struct wb_writeback_work *work) > +{ > + loff_t start = work->offset; > + loff_t end = work->offset + work->nr_pages - 1; > + int wrote; > + > + wrote = __filemap_fdatawrite_range(work->inode->i_mapping, > + start << PAGE_CACHE_SHIFT, > + end << PAGE_CACHE_SHIFT, > + WB_SYNC_NONE); > + iput(work->inode); > + return wrote; > +} Out of curiousity, before going down the complex route did you try just calling this directly and seeing if it solved the problem? i.e. igrab() get start/end unlock page __filemap_fdatawrite_range() iput() I mean, much as I dislike the idea of writeback from the LRU, if all we need to do is call through .writepages() to do get decent IO from reclaim (when it occurs), then why do we need to add this async complexity to the generic writeback code to acheive the same end? Cheers, Dave. -- Dave Chinner david@fromorbit.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 03/27] xfs: use write_cache_pages for writeback clustering 2011-07-04 3:25 ` Dave Chinner 2011-07-05 14:34 ` Mel Gorman 2011-07-06 4:53 ` Wu Fengguang @ 2011-07-06 15:12 ` Johannes Weiner 2011-07-08 9:54 ` Dave Chinner 2 siblings, 1 reply; 20+ messages in thread From: Johannes Weiner @ 2011-07-06 15:12 UTC (permalink / raw) To: Dave Chinner Cc: Wu Fengguang, Christoph Hellwig, Mel Gorman, xfs@oss.sgi.com, linux-mm@kvack.org On Mon, Jul 04, 2011 at 01:25:34PM +1000, Dave Chinner wrote: > On Fri, Jul 01, 2011 at 11:41:36PM +0800, Wu Fengguang wrote: > > Christoph, > > > > On Fri, Jul 01, 2011 at 05:33:05PM +0800, Christoph Hellwig wrote: > > > Johannes, Mel, Wu, > > > > > > Dave has been stressing some XFS patches of mine that remove the XFS > > > internal writeback clustering in favour of using write_cache_pages. > > > > > > As part of investigating the behaviour he found out that we're still > > > doing lots of I/O from the end of the LRU in kswapd. Not only is that > > > pretty bad behaviour in general, but it also means we really can't > > > just remove the writeback clustering in writepage given how much > > > I/O is still done through that. > > > > > > Any chance we could the writeback vs kswap behaviour sorted out a bit > > > better finally? > > > > I once tried this approach: > > > > http://www.spinics.net/lists/linux-mm/msg09202.html > > > > It used a list structure that is not linearly scalable, however that > > part should be independently improvable when necessary. > > I don't think that handing random writeback to the flusher thread is > much better than doing random writeback directly. Yes, you added > some clustering, but I'm still don't think writing specific pages is > the best solution. > > > The real problem was, it seem to not very effective in my test runs. > > I found many ->nr_pages works queued before the ->inode works, which > > effectively makes the flusher working on more dispersed pages rather > > than focusing on the dirty pages encountered in LRU reclaim. > > But that's really just an implementation issue related to how you > tried to solve the problem. That could be addressed. > > However, what I'm questioning is whether we should even care what > page memory reclaim wants to write - it seems to make fundamentally > bad decisions from an IO persepctive. > > We have to remember that memory reclaim is doing LRU reclaim and the > flusher threads are doing "oldest first" writeback. IOWs, both are trying > to operate in the same direction (oldest to youngest) for the same > purpose. The fundamental problem that occurs when memory reclaim > starts writing pages back from the LRU is this: > > - memory reclaim has run ahead of IO writeback - > > The LRU usually looks like this: > > oldest youngest > +---------------+---------------+--------------+ > clean writeback dirty > ^ ^ > | | > | Where flusher will next work from > | Where kswapd is working from > | > IO submitted by flusher, waiting on completion > > > If memory reclaim is hitting dirty pages on the LRU, it means it has > got ahead of writeback without being throttled - it's passed over > all the pages currently under writeback and is trying to write back > pages that are *newer* than what writeback is working on. IOWs, it > starts trying to do the job of the flusher threads, and it does that > very badly. > > The $100 question is a??why is it getting ahead of writeback*? Unless you have a purely sequential writer, the LRU order is - at least in theory - diverging away from the writeback order. According to the reasoning behind generational garbage collection, they should in fact be inverse to each other. The oldest pages still in use are the most likely to be still needed in the future. In practice we only make a generational distinction between used-once and used-many, which manifests in the inactive and the active list. But still, when reclaim starts off with a localized writer, the oldest pages are likely to be at the end of the active list. So pages from the inactive list are likely to be written in the right order, but at the same time active pages are even older, thus written before them. Memory reclaim starts with the inactive pages, and this is why it gets ahead. Then there is also the case where a fast writer pushes dirty pages to the end of the LRU list, of course, but you already said that this is not applicable to your workload. My point is that I don't think it's unexpected that dirty pages come off the inactive list in practice. It just sucks how we handle them. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 03/27] xfs: use write_cache_pages for writeback clustering 2011-07-06 15:12 ` Johannes Weiner @ 2011-07-08 9:54 ` Dave Chinner 2011-07-11 17:20 ` Johannes Weiner 0 siblings, 1 reply; 20+ messages in thread From: Dave Chinner @ 2011-07-08 9:54 UTC (permalink / raw) To: Johannes Weiner Cc: Wu Fengguang, Christoph Hellwig, Mel Gorman, xfs@oss.sgi.com, linux-mm@kvack.org On Wed, Jul 06, 2011 at 05:12:29PM +0200, Johannes Weiner wrote: > On Mon, Jul 04, 2011 at 01:25:34PM +1000, Dave Chinner wrote: > > On Fri, Jul 01, 2011 at 11:41:36PM +0800, Wu Fengguang wrote: > > We have to remember that memory reclaim is doing LRU reclaim and the > > flusher threads are doing "oldest first" writeback. IOWs, both are trying > > to operate in the same direction (oldest to youngest) for the same > > purpose. The fundamental problem that occurs when memory reclaim > > starts writing pages back from the LRU is this: > > > > - memory reclaim has run ahead of IO writeback - > > > > The LRU usually looks like this: > > > > oldest youngest > > +---------------+---------------+--------------+ > > clean writeback dirty > > ^ ^ > > | | > > | Where flusher will next work from > > | Where kswapd is working from > > | > > IO submitted by flusher, waiting on completion > > > > > > If memory reclaim is hitting dirty pages on the LRU, it means it has > > got ahead of writeback without being throttled - it's passed over > > all the pages currently under writeback and is trying to write back > > pages that are *newer* than what writeback is working on. IOWs, it > > starts trying to do the job of the flusher threads, and it does that > > very badly. > > > > The $100 question is a??why is it getting ahead of writeback*? > > Unless you have a purely sequential writer, the LRU order is - at > least in theory - diverging away from the writeback order. Which is the root cause of the IO collapse that writeback from the LRU causes, yes? > According to the reasoning behind generational garbage collection, > they should in fact be inverse to each other. The oldest pages still > in use are the most likely to be still needed in the future. > > In practice we only make a generational distinction between used-once > and used-many, which manifests in the inactive and the active list. > But still, when reclaim starts off with a localized writer, the oldest > pages are likely to be at the end of the active list. Yet the file pages on the active list are unlikely to be dirty - overwrite-in-place cache hot workloads are pretty scarce in my experience. hence writeback of dirty pages from the active LRU is unlikely to be a problem. That leaves all the use-once pages cycling through the inactive list. The oldest pages on this list are the ones that get reclaimed, and if we are getting lots of dirty pages here it seems pretty clear that memory demand is mostly for pages being rapidly dirtied. In which case, trying to speed up the rate at which they are cleaned by issuing IO is only effective if there is no IO already in progress. Who knows if Io is already in progress? The writeback subsystem.... > So pages from the inactive list are likely to be written in the right > order, but at the same time active pages are even older, thus written > before them. Memory reclaim starts with the inactive pages, and this > is why it gets ahead. All right, if the design is such that you can't avoid having reclaim write back dirty pages as it encounters them on the inactive LRU, should the dirty pages even be on that LRU? That is, dirty pages cannot be reclaimed immediately but they are intertwined with pages that can be reclaimed immediately. We really want to reclaim pages that can be reclaimed quickly while not blocking on or continually having to skip over pages that cannot be reclaimed. So why not make a distinction between clean and dirty file pages on the inactive list? That is, consider dirty pages to still be "in use" and "owned" by the writeback subsystem. while pages are dirty they are kept on a separate "dirty file page LRU" that memory reclaim does not ever touch unless it runs out of clean pages on the inactive list to reclaim. And then when it runs out of clean pages, it can go find pages under writeback on the dirty list and block on them before going back to reclaiming off the clean list.... And given that cgroups have their own LRUs for reclaim now, this problem of dirty pages being written from the LRUs has a much larger scope. It's not just whether the global LRU reclaim is hitting dirty pages, it's a per-cgroup problem and they are much more likely to have low memory limits that lead to such problems. And concurrently at that, too. Writeback simply does't scale to having multiple sources of random page IO being despatched concurrently. > Then there is also the case where a fast writer pushes dirty pages to > the end of the LRU list, of course, but you already said that this is > not applicable to your workload. > > My point is that I don't think it's unexpected that dirty pages come > off the inactive list in practice. It just sucks how we handle them. Exactly what I've been saying. And what I'm also trying to say is the way to fix the "we do shitty IO on dirty pages" problem is *not to do IO*. That's -exactly- why the IO-less write throttling is a significant improvement: we've turned shitty IO into good IO by *waiting for IO* during throttling rather than submitting IO. Fundamentally, scaling to N IO waiters is far easier and more efficient than scaling to N IO submitters. All I'm asking is that you apply that same principle to memory reclaim, please. Cheers, Dave. -- Dave Chinner david@fromorbit.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 03/27] xfs: use write_cache_pages for writeback clustering 2011-07-08 9:54 ` Dave Chinner @ 2011-07-11 17:20 ` Johannes Weiner 2011-07-11 17:24 ` Christoph Hellwig 2011-07-11 19:09 ` Rik van Riel 0 siblings, 2 replies; 20+ messages in thread From: Johannes Weiner @ 2011-07-11 17:20 UTC (permalink / raw) To: Dave Chinner Cc: Wu Fengguang, Christoph Hellwig, Mel Gorman, Rik van Riel, xfs@oss.sgi.com, linux-mm@kvack.org On Fri, Jul 08, 2011 at 07:54:56PM +1000, Dave Chinner wrote: > On Wed, Jul 06, 2011 at 05:12:29PM +0200, Johannes Weiner wrote: > > On Mon, Jul 04, 2011 at 01:25:34PM +1000, Dave Chinner wrote: > > > On Fri, Jul 01, 2011 at 11:41:36PM +0800, Wu Fengguang wrote: > > > We have to remember that memory reclaim is doing LRU reclaim and the > > > flusher threads are doing "oldest first" writeback. IOWs, both are trying > > > to operate in the same direction (oldest to youngest) for the same > > > purpose. The fundamental problem that occurs when memory reclaim > > > starts writing pages back from the LRU is this: > > > > > > - memory reclaim has run ahead of IO writeback - > > > > > > The LRU usually looks like this: > > > > > > oldest youngest > > > +---------------+---------------+--------------+ > > > clean writeback dirty > > > ^ ^ > > > | | > > > | Where flusher will next work from > > > | Where kswapd is working from > > > | > > > IO submitted by flusher, waiting on completion > > > > > > > > > If memory reclaim is hitting dirty pages on the LRU, it means it has > > > got ahead of writeback without being throttled - it's passed over > > > all the pages currently under writeback and is trying to write back > > > pages that are *newer* than what writeback is working on. IOWs, it > > > starts trying to do the job of the flusher threads, and it does that > > > very badly. > > > > > > The $100 question is a??why is it getting ahead of writeback*? > > > > Unless you have a purely sequential writer, the LRU order is - at > > least in theory - diverging away from the writeback order. > > Which is the root cause of the IO collapse that writeback from the > LRU causes, yes? > > > According to the reasoning behind generational garbage collection, > > they should in fact be inverse to each other. The oldest pages still > > in use are the most likely to be still needed in the future. > > > > In practice we only make a generational distinction between used-once > > and used-many, which manifests in the inactive and the active list. > > But still, when reclaim starts off with a localized writer, the oldest > > pages are likely to be at the end of the active list. > > Yet the file pages on the active list are unlikely to be dirty - > overwrite-in-place cache hot workloads are pretty scarce in my > experience. hence writeback of dirty pages from the active LRU is > unlikely to be a problem. Just to clarify, I looked at this too much from the reclaim POV, where use-once applies to full pages, not bytes. Even if you do not overwrite the same bytes over and over again, issuing two subsequent write()s that end up against the same page will have it activated. Are your workloads writing in perfectly page-aligned chunks? This effect may build up slowly, but every page that is written from the active list makes room for a dirty page on the inactive list wrt the dirty limit. I.e. without the active pages, you have 10-20% dirty pages at the head of the inactive list (default dirty ratio), or a 80-90% clean tail, and for every page cleaned, a new dirty page can appear at the inactive head. But taking the active list into account, some of these clean pages are taken away from the headstart the flusher has over the reclaimer, they sit on the active list. For every page cleaned, a new dirty page can appear at the inactive head, plus a few deactivated clean pages. Now, the active list is not scanned anymore until it is bigger than the inactive list, giving the flushers plenty of time to clean the pages on it and let them accumulate even while memory pressure is already occurring. For every page cleaned, a new dirty page can appear at the inactive head, plus a LOT of deactivated clean pages. So when memory needs to be reclaimed, the LRU lists in those three scenarios look like this: inactive-only: [CCCCCCCCDD][] active-small: [CCCCCCDD][CC] active-huge: [CCCDD][CCCCC] where the third scenario is the most likely for the reclaimer to run into dirty pages. I CC'd Rik for reclaim-wizardry. But if I am not completly off with this there is a chance that the change that let the active list grow unscanned may actually have contributed to this single-page writing problem becoming worse? commit 56e49d218890f49b0057710a4b6fef31f5ffbfec Author: Rik van Riel <riel@redhat.com> Date: Tue Jun 16 15:32:28 2009 -0700 vmscan: evict use-once pages first When the file LRU lists are dominated by streaming IO pages, evict those pages first, before considering evicting other pages. This should be safe from deadlocks or performance problems because only three things can happen to an inactive file page: 1) referenced twice and promoted to the active list 2) evicted by the pageout code 3) under IO, after which it will get evicted or promoted The pages freed in this way can either be reused for streaming IO, or allocated for something else. If the pages are used for streaming IO, this pageout pattern continues. Otherwise, we will fall back to the normal pageout pattern. Signed-off-by: Rik van Riel <riel@redhat.com> Reported-by: Elladan <elladan@eskimo.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 03/27] xfs: use write_cache_pages for writeback clustering 2011-07-11 17:20 ` Johannes Weiner @ 2011-07-11 17:24 ` Christoph Hellwig 2011-07-11 19:09 ` Rik van Riel 1 sibling, 0 replies; 20+ messages in thread From: Christoph Hellwig @ 2011-07-11 17:24 UTC (permalink / raw) To: Johannes Weiner Cc: Dave Chinner, Wu Fengguang, Christoph Hellwig, Mel Gorman, Rik van Riel, xfs@oss.sgi.com, linux-mm@kvack.org On Mon, Jul 11, 2011 at 07:20:50PM +0200, Johannes Weiner wrote: > > Yet the file pages on the active list are unlikely to be dirty - > > overwrite-in-place cache hot workloads are pretty scarce in my > > experience. hence writeback of dirty pages from the active LRU is > > unlikely to be a problem. > > Just to clarify, I looked at this too much from the reclaim POV, where > use-once applies to full pages, not bytes. > > Even if you do not overwrite the same bytes over and over again, > issuing two subsequent write()s that end up against the same page will > have it activated. > > Are your workloads writing in perfectly page-aligned chunks? Many workloads do, given that we already tell them our preferred I/O size through struct stat, which alway is the page size or larger. That won't help with workloads having to write in small chunksizes. The performance critical ones using small chunksizes usually use O_(D)SYNC, so pages will be clean after the write returned to userspace. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 03/27] xfs: use write_cache_pages for writeback clustering 2011-07-11 17:20 ` Johannes Weiner 2011-07-11 17:24 ` Christoph Hellwig @ 2011-07-11 19:09 ` Rik van Riel 1 sibling, 0 replies; 20+ messages in thread From: Rik van Riel @ 2011-07-11 19:09 UTC (permalink / raw) To: Johannes Weiner Cc: Dave Chinner, Wu Fengguang, Christoph Hellwig, Mel Gorman, xfs@oss.sgi.com, linux-mm@kvack.org On 07/11/2011 01:20 PM, Johannes Weiner wrote: > I CC'd Rik for reclaim-wizardry. But if I am not completly off with > this there is a chance that the change that let the active list grow > unscanned may actually have contributed to this single-page writing > problem becoming worse? Yes, the patch probably contributed. However, the patch does help protect the working set in the page cache from streaming IO, so on balance I believe we need to keep this change. What it changes is that the size of the inactive file list can no longer grow unbounded, keeping it a little smaller than it could have grown in the past. > commit 56e49d218890f49b0057710a4b6fef31f5ffbfec > Author: Rik van Riel<riel@redhat.com> > Date: Tue Jun 16 15:32:28 2009 -0700 > > vmscan: evict use-once pages first > > When the file LRU lists are dominated by streaming IO pages, evict those > pages first, before considering evicting other pages. > > This should be safe from deadlocks or performance problems > because only three things can happen to an inactive file page: > > 1) referenced twice and promoted to the active list > 2) evicted by the pageout code > 3) under IO, after which it will get evicted or promoted > > The pages freed in this way can either be reused for streaming IO, or > allocated for something else. If the pages are used for streaming IO, > this pageout pattern continues. Otherwise, we will fall back to the > normal pageout pattern. > > Signed-off-by: Rik van Riel<riel@redhat.com> > Reported-by: Elladan<elladan@eskimo.com> > Cc: KOSAKI Motohiro<kosaki.motohiro@jp.fujitsu.com> > Cc: Peter Zijlstra<peterz@infradead.org> > Cc: Lee Schermerhorn<lee.schermerhorn@hp.com> > Acked-by: Johannes Weiner<hannes@cmpxchg.org> > Signed-off-by: Andrew Morton<akpm@linux-foundation.org> > Signed-off-by: Linus Torvalds<torvalds@linux-foundation.org> -- All rights reversed -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 20+ messages in thread
end of thread, other threads:[~2011-07-11 19:09 UTC | newest] Thread overview: 20+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- [not found] <20110629140109.003209430@bombadil.infradead.org> [not found] ` <20110629140336.950805096@bombadil.infradead.org> [not found] ` <20110701022248.GM561@dastard> [not found] ` <20110701041851.GN561@dastard> 2011-07-01 9:33 ` [PATCH 03/27] xfs: use write_cache_pages for writeback clustering Christoph Hellwig 2011-07-01 14:59 ` Mel Gorman 2011-07-01 15:15 ` Christoph Hellwig 2011-07-02 2:42 ` Dave Chinner 2011-07-05 14:10 ` Mel Gorman 2011-07-05 15:55 ` Dave Chinner 2011-07-11 10:26 ` Christoph Hellwig 2011-07-01 15:41 ` Wu Fengguang 2011-07-04 3:25 ` Dave Chinner 2011-07-05 14:34 ` Mel Gorman 2011-07-06 1:23 ` Dave Chinner 2011-07-11 11:10 ` Christoph Hellwig 2011-07-06 4:53 ` Wu Fengguang 2011-07-06 6:47 ` Minchan Kim 2011-07-06 7:17 ` Dave Chinner 2011-07-06 15:12 ` Johannes Weiner 2011-07-08 9:54 ` Dave Chinner 2011-07-11 17:20 ` Johannes Weiner 2011-07-11 17:24 ` Christoph Hellwig 2011-07-11 19:09 ` Rik van Riel
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).