* [PATCH] mm: Do not stall in synchronous compaction for THP allocations @ 2011-11-10 10:06 Mel Gorman 2011-11-10 10:38 ` Johannes Weiner ` (3 more replies) 0 siblings, 4 replies; 34+ messages in thread From: Mel Gorman @ 2011-11-10 10:06 UTC (permalink / raw) To: Andrew Morton Cc: Jan Kara, Andy Isaacson, Johannes Weiner, Andrea Arcangeli, linux-kernel, linux-mm Occasionally during large file copies to slow storage, there are still reports of user-visible stalls when THP is enabled. Reports on this have been intermittent and not reliable to reproduce locally but; Andy Isaacson reported a problem copying to VFAT on SD Card https://lkml.org/lkml/2011/11/7/2 In this case, it was stuck in munmap for betwen 20 and 60 seconds in compaction. It is also possible that khugepaged was holding mmap_sem on this process if CONFIG_NUMA was set. Johannes Weiner reported stalls on USB https://lkml.org/lkml/2011/7/25/378 In this case, there is no stack trace but it looks like the same problem. The USB stick may have been using NTFS as a filesystem based on other work done related to writing back to USB around the same time. Internally in SUSE, I received a bug report related to stalls in firefox when using Java and Flash heavily while copying from NFS to VFAT on USB. It has not been confirmed to be the same problem but if it looks like a duck and quacks like a duck..... In the past, commit [11bc82d6: mm: compaction: Use async migration for __GFP_NO_KSWAPD and enforce no writeback] forced that sync compaction would never be used for THP allocations. This was reverted in commit [c6a140bf: mm/compaction: reverse the change that forbade sync migraton with __GFP_NO_KSWAPD] on the grounds that it was uncertain it was beneficial. While user-visible stalls do not happen for me when writing to USB, I setup a test running postmark while short-lived processes created anonymous mapping. The objective was to exercise the paths that allocate transparent huge pages. I then logged when processes were stalled for more than 1 second, recorded a stack strace and did some analysis to aggregate unique "stall events" which revealed Time stalled in this event: 47369 ms Event count: 20 usemem sleep_on_page 3690 ms usemem sleep_on_page 2148 ms usemem sleep_on_page 1534 ms usemem sleep_on_page 1518 ms usemem sleep_on_page 1225 ms usemem sleep_on_page 2205 ms usemem sleep_on_page 2399 ms usemem sleep_on_page 2398 ms usemem sleep_on_page 3760 ms usemem sleep_on_page 1861 ms usemem sleep_on_page 2948 ms usemem sleep_on_page 1515 ms usemem sleep_on_page 1386 ms usemem sleep_on_page 1882 ms usemem sleep_on_page 1850 ms usemem sleep_on_page 3715 ms usemem sleep_on_page 3716 ms usemem sleep_on_page 4846 ms usemem sleep_on_page 1306 ms usemem sleep_on_page 1467 ms [<ffffffff810ef30c>] wait_on_page_bit+0x6c/0x80 [<ffffffff8113de9f>] unmap_and_move+0x1bf/0x360 [<ffffffff8113e0e2>] migrate_pages+0xa2/0x1b0 [<ffffffff81134273>] compact_zone+0x1f3/0x2f0 [<ffffffff811345d8>] compact_zone_order+0xa8/0xf0 [<ffffffff811346ff>] try_to_compact_pages+0xdf/0x110 [<ffffffff810f773a>] __alloc_pages_direct_compact+0xda/0x1a0 [<ffffffff810f7d5d>] __alloc_pages_slowpath+0x55d/0x7a0 [<ffffffff810f8151>] __alloc_pages_nodemask+0x1b1/0x1c0 [<ffffffff811331db>] alloc_pages_vma+0x9b/0x160 [<ffffffff81142bb0>] do_huge_pmd_anonymous_page+0x160/0x270 [<ffffffff814410a7>] do_page_fault+0x207/0x4c0 [<ffffffff8143dde5>] page_fault+0x25/0x30 The stall times are approximate at best but the estimates represent 25% of the worst stalls and even if the estimates are off by a factor of 10, it's severe. This patch once again prevents sync migration for transparent hugepage allocations as it is preferable to fail a THP allocation than stall. It was suggested that __GFP_NORETRY be used instead of __GFP_NO_KSWAPD. This would look less like a special case but would still cause compaction to run at least once with sync compaction. If accepted, this is a -stable candidate. Reported-by: Andy Isaacson <adi@hexapodia.org> Reported-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Mel Gorman <mgorman@suse.de> --- mm/page_alloc.c | 8 +++++++- 1 files changed, 7 insertions(+), 1 deletions(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 963c5de..cddc2d0 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -2213,7 +2213,13 @@ rebalance: sync_migration); if (page) goto got_pg; - sync_migration = true; + + /* + * Do not use sync migration for transparent hugepage allocations as + * it could stall writing back pages which is far worse than simply + * failing to promote a page. + */ + sync_migration = !(gfp_mask & __GFP_NO_KSWAPD); /* Try direct reclaim and then allocating */ page = __alloc_pages_direct_reclaim(gfp_mask, order, -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 34+ messages in thread
* Re: [PATCH] mm: Do not stall in synchronous compaction for THP allocations 2011-11-10 10:06 [PATCH] mm: Do not stall in synchronous compaction for THP allocations Mel Gorman @ 2011-11-10 10:38 ` Johannes Weiner 2011-11-10 10:51 ` Alan Cox ` (2 subsequent siblings) 3 siblings, 0 replies; 34+ messages in thread From: Johannes Weiner @ 2011-11-10 10:38 UTC (permalink / raw) To: Mel Gorman Cc: Andrew Morton, Jan Kara, Andy Isaacson, Andrea Arcangeli, linux-kernel, linux-mm On Thu, Nov 10, 2011 at 10:06:16AM +0000, Mel Gorman wrote: > Occasionally during large file copies to slow storage, there are still > reports of user-visible stalls when THP is enabled. Reports on this > have been intermittent and not reliable to reproduce locally but; > > Andy Isaacson reported a problem copying to VFAT on SD Card > https://lkml.org/lkml/2011/11/7/2 > > In this case, it was stuck in munmap for betwen 20 and 60 > seconds in compaction. It is also possible that khugepaged > was holding mmap_sem on this process if CONFIG_NUMA was set. > > Johannes Weiner reported stalls on USB > https://lkml.org/lkml/2011/7/25/378 > > In this case, there is no stack trace but it looks like the > same problem. The USB stick may have been using NTFS as a > filesystem based on other work done related to writing back > to USB around the same time. Sorry, here is a trace from when I first recorded the problem: [171252.437688] firefox D 000000010a31e6b7 0 8502 1110 0x00000000 [171252.437691] ffff88001e91f8a8 0000000000000082 ffff880000000000 ffff88001e91ffd8 [171252.437693] ffff88001e91ffd8 0000000000004000 ffffffff8180b020 ffff8800456e4680 [171252.437696] ffff88001e91f7e8 ffffffff810cd4ed ffffea00028d4500 000000000000000e [171252.437699] Call Trace: [171252.437701] [<ffffffff810cd4ed>] ? __pagevec_free+0x2d/0x40 [171252.437703] [<ffffffff810d044c>] ? release_pages+0x24c/0x280 [171252.437705] [<ffffffff81099448>] ? ktime_get_ts+0xa8/0xe0 [171252.437707] [<ffffffff810c5040>] ? file_read_actor+0x170/0x170 [171252.437709] [<ffffffff8156ac8a>] io_schedule+0x8a/0xd0 [171252.437711] [<ffffffff810c5049>] sleep_on_page+0x9/0x10 [171252.437713] [<ffffffff8156b0f7>] __wait_on_bit+0x57/0x80 [171252.437715] [<ffffffff810c5630>] wait_on_page_bit+0x70/0x80 [171252.437718] [<ffffffff810916c0>] ? autoremove_wake_function+0x40/0x40 [171252.437720] [<ffffffff810f98a5>] migrate_pages+0x2a5/0x490 [171252.437722] [<ffffffff810f5940>] ? suitable_migration_target+0x50/0x50 [171252.437724] [<ffffffff810f6234>] compact_zone+0x4e4/0x770 [171252.437727] [<ffffffff810f65e0>] compact_zone_order+0x80/0xb0 [171252.437729] [<ffffffff810f66cd>] try_to_compact_pages+0xbd/0xf0 [171252.437731] [<ffffffff810cc608>] __alloc_pages_direct_compact+0xa8/0x180 [171252.437734] [<ffffffff810ccbdb>] __alloc_pages_nodemask+0x4fb/0x720 [171252.437736] [<ffffffff810fbfe3>] do_huge_pmd_wp_page+0x4c3/0x740 [171252.437738] [<ffffffff81094fff>] ? hrtimer_start_range_ns+0xf/0x20 [171252.437740] [<ffffffff810e3f9c>] handle_mm_fault+0x1bc/0x2f0 [171252.437743] [<ffffffff8102e6c6>] ? __switch_to+0x1e6/0x2c0 [171252.437745] [<ffffffff810511b2>] do_page_fault+0x132/0x430 [171252.437747] [<ffffffff810a4995>] ? sys_futex+0x105/0x1a0 [171252.437749] [<ffffffff8156cf9f>] page_fault+0x1f/0x30 It could have been vfat, I don't remember for sure. But I would think the problem depends only on the duration of PageWriteback being set. > Internally in SUSE, I received a bug report related to stalls in firefox > when using Java and Flash heavily while copying from NFS > to VFAT on USB. It has not been confirmed to be the same problem > but if it looks like a duck and quacks like a duck..... > > In the past, commit [11bc82d6: mm: compaction: Use async migration for > __GFP_NO_KSWAPD and enforce no writeback] forced that sync compaction > would never be used for THP allocations. This was reverted in commit > [c6a140bf: mm/compaction: reverse the change that forbade sync > migraton with __GFP_NO_KSWAPD] on the grounds that it was uncertain > it was beneficial. > > While user-visible stalls do not happen for me when writing to USB, > I setup a test running postmark while short-lived processes created > anonymous mapping. The objective was to exercise the paths that > allocate transparent huge pages. I then logged when processes were > stalled for more than 1 second, recorded a stack strace and did some > analysis to aggregate unique "stall events" which revealed > > Time stalled in this event: 47369 ms > Event count: 20 > usemem sleep_on_page 3690 ms > usemem sleep_on_page 2148 ms > usemem sleep_on_page 1534 ms > usemem sleep_on_page 1518 ms > usemem sleep_on_page 1225 ms > usemem sleep_on_page 2205 ms > usemem sleep_on_page 2399 ms > usemem sleep_on_page 2398 ms > usemem sleep_on_page 3760 ms > usemem sleep_on_page 1861 ms > usemem sleep_on_page 2948 ms > usemem sleep_on_page 1515 ms > usemem sleep_on_page 1386 ms > usemem sleep_on_page 1882 ms > usemem sleep_on_page 1850 ms > usemem sleep_on_page 3715 ms > usemem sleep_on_page 3716 ms > usemem sleep_on_page 4846 ms > usemem sleep_on_page 1306 ms > usemem sleep_on_page 1467 ms > [<ffffffff810ef30c>] wait_on_page_bit+0x6c/0x80 > [<ffffffff8113de9f>] unmap_and_move+0x1bf/0x360 > [<ffffffff8113e0e2>] migrate_pages+0xa2/0x1b0 > [<ffffffff81134273>] compact_zone+0x1f3/0x2f0 > [<ffffffff811345d8>] compact_zone_order+0xa8/0xf0 > [<ffffffff811346ff>] try_to_compact_pages+0xdf/0x110 > [<ffffffff810f773a>] __alloc_pages_direct_compact+0xda/0x1a0 > [<ffffffff810f7d5d>] __alloc_pages_slowpath+0x55d/0x7a0 > [<ffffffff810f8151>] __alloc_pages_nodemask+0x1b1/0x1c0 > [<ffffffff811331db>] alloc_pages_vma+0x9b/0x160 > [<ffffffff81142bb0>] do_huge_pmd_anonymous_page+0x160/0x270 > [<ffffffff814410a7>] do_page_fault+0x207/0x4c0 > [<ffffffff8143dde5>] page_fault+0x25/0x30 > > The stall times are approximate at best but the estimates represent 25% > of the worst stalls and even if the estimates are off by a factor of > 10, it's severe. > > This patch once again prevents sync migration for transparent > hugepage allocations as it is preferable to fail a THP allocation > than stall. It was suggested that __GFP_NORETRY be used instead of > __GFP_NO_KSWAPD. This would look less like a special case but would > still cause compaction to run at least once with sync compaction. > > If accepted, this is a -stable candidate. > > Reported-by: Andy Isaacson <adi@hexapodia.org> > Reported-by: Johannes Weiner <hannes@cmpxchg.org> > Signed-off-by: Mel Gorman <mgorman@suse.de> Tested-by: Johannes Weiner <jweiner@redhat.com> -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH] mm: Do not stall in synchronous compaction for THP allocations 2011-11-10 10:06 [PATCH] mm: Do not stall in synchronous compaction for THP allocations Mel Gorman 2011-11-10 10:38 ` Johannes Weiner @ 2011-11-10 10:51 ` Alan Cox 2011-11-10 12:06 ` Johannes Weiner 2011-11-10 14:00 ` Andrea Arcangeli 2011-11-10 14:22 ` Mel Gorman 3 siblings, 1 reply; 34+ messages in thread From: Alan Cox @ 2011-11-10 10:51 UTC (permalink / raw) To: Mel Gorman Cc: Andrew Morton, Jan Kara, Andy Isaacson, Johannes Weiner, Andrea Arcangeli, linux-kernel, linux-mm On Thu, 10 Nov 2011 10:06:16 +0000 Mel Gorman <mgorman@suse.de> wrote: > Occasionally during large file copies to slow storage, there are still > reports of user-visible stalls when THP is enabled. Reports on this > have been intermittent and not reliable to reproduce locally but; If you want to cause a massive stall take a cheap 32GB USB flash drive plug it into an 8GB box and rsync a lot of small files to it. 400,000 emails in maildir format does the trick and can easily be simulated. The drive drops to about 1-2 IOPS with all the small mucking around and the backlog becomes massive. > Internally in SUSE, I received a bug report related to stalls in firefox > when using Java and Flash heavily while copying from NFS > to VFAT on USB. It has not been confirmed to be the same problem > but if it looks like a duck and quacks like a duck..... With the 32GB USB flash rsync I see firefox block for up to 45 minutes although operating entirely on an unrelated filesystem. I suspect it may be a problem that is visible because an fsync is getting jammed up in the mess. Alan -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH] mm: Do not stall in synchronous compaction for THP allocations 2011-11-10 10:51 ` Alan Cox @ 2011-11-10 12:06 ` Johannes Weiner 0 siblings, 0 replies; 34+ messages in thread From: Johannes Weiner @ 2011-11-10 12:06 UTC (permalink / raw) To: Alan Cox Cc: Mel Gorman, Andrew Morton, Jan Kara, Andy Isaacson, Andrea Arcangeli, linux-kernel, linux-mm On Thu, Nov 10, 2011 at 10:51:00AM +0000, Alan Cox wrote: > On Thu, 10 Nov 2011 10:06:16 +0000 > Mel Gorman <mgorman@suse.de> wrote: > > > Occasionally during large file copies to slow storage, there are still > > reports of user-visible stalls when THP is enabled. Reports on this > > have been intermittent and not reliable to reproduce locally but; > > If you want to cause a massive stall take a cheap 32GB USB flash drive > plug it into an 8GB box and rsync a lot of small files to it. 400,000 > emails in maildir format does the trick and can easily be simulated. The > drive drops to about 1-2 IOPS with all the small mucking around and the > backlog becomes massive. > > > Internally in SUSE, I received a bug report related to stalls in firefox > > when using Java and Flash heavily while copying from NFS > > to VFAT on USB. It has not been confirmed to be the same problem > > but if it looks like a duck and quacks like a duck..... > > With the 32GB USB flash rsync I see firefox block for up to 45 minutes > although operating entirely on an unrelated filesystem. I suspect it may > be a problem that is visible because an fsync is getting jammed up in > the mess. Compaction walks PFN ranges, oblivious to inode dirtying order, and so transparent huge page allocations can get stuck repeatedly on pages under writeback that are behind whatever the bdi's queue allows to be inflight. On all hangs I observed while writing to my 16GB USB thumb drive, it was tasks getting stuck in migration when allocating a THP. Can you capture /proc/`pidof firefox`/stack while it hangs to see if what you see is, in fact, the same problem? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH] mm: Do not stall in synchronous compaction for THP allocations 2011-11-10 10:06 [PATCH] mm: Do not stall in synchronous compaction for THP allocations Mel Gorman 2011-11-10 10:38 ` Johannes Weiner 2011-11-10 10:51 ` Alan Cox @ 2011-11-10 14:00 ` Andrea Arcangeli 2011-11-10 14:22 ` Mel Gorman 3 siblings, 0 replies; 34+ messages in thread From: Andrea Arcangeli @ 2011-11-10 14:00 UTC (permalink / raw) To: Mel Gorman Cc: Andrew Morton, Jan Kara, Andy Isaacson, Johannes Weiner, linux-kernel, linux-mm On Thu, Nov 10, 2011 at 10:06:16AM +0000, Mel Gorman wrote: > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index 963c5de..cddc2d0 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -2213,7 +2213,13 @@ rebalance: > sync_migration); > if (page) > goto got_pg; > - sync_migration = true; > + > + /* > + * Do not use sync migration for transparent hugepage allocations as > + * it could stall writing back pages which is far worse than simply > + * failing to promote a page. > + */ > + sync_migration = !(gfp_mask & __GFP_NO_KSWAPD); > > /* Try direct reclaim and then allocating */ > page = __alloc_pages_direct_reclaim(gfp_mask, order, Reviewed-by: Andrea Arcangeli <aarcange@redhat.com> -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH] mm: Do not stall in synchronous compaction for THP allocations 2011-11-10 10:06 [PATCH] mm: Do not stall in synchronous compaction for THP allocations Mel Gorman ` (2 preceding siblings ...) 2011-11-10 14:00 ` Andrea Arcangeli @ 2011-11-10 14:22 ` Mel Gorman 2011-11-10 15:12 ` Minchan Kim 3 siblings, 1 reply; 34+ messages in thread From: Mel Gorman @ 2011-11-10 14:22 UTC (permalink / raw) To: Andrew Morton Cc: Jan Kara, Andy Isaacson, Johannes Weiner, Andrea Arcangeli, linux-kernel, linux-mm On Thu, Nov 10, 2011 at 10:06:16AM +0000, Mel Gorman wrote: > than stall. It was suggested that __GFP_NORETRY be used instead of > __GFP_NO_KSWAPD. This would look less like a special case but would > still cause compaction to run at least once with sync compaction. > This comment is bogus - __GFP_NORETRY would have caught THP allocations and would not call sync compaction. The issue was that it would also have caught any hypothetical high-order GFP_THISNODE allocations that end up calling compaction here /* * High-order allocations do not necessarily loop after * direct reclaim and reclaim/compaction depends on * compaction being called after reclaim so call directly if * necessary */ page = __alloc_pages_direct_compact(gfp_mask, order, zonelist, high_zoneidx, nodemask, alloc_flags, preferred_zone, migratetype, &did_some_progress, sync_migration); __GFP_NORETRY is used in a bunch of places and while the most of them are not high-order, some of them potentially are like in sound/core/memalloc.c. Using __GFP_NO_KSWAPD as the flag allows these callers to continue using sync compaction. It could be argued that they would prefer __GFP_NORETRY but the potential side-effects should be taken should be taken into account and the comment updated if that happens. -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH] mm: Do not stall in synchronous compaction for THP allocations 2011-11-10 14:22 ` Mel Gorman @ 2011-11-10 15:12 ` Minchan Kim 2011-11-10 16:13 ` Mel Gorman 0 siblings, 1 reply; 34+ messages in thread From: Minchan Kim @ 2011-11-10 15:12 UTC (permalink / raw) To: Mel Gorman Cc: Andrew Morton, Jan Kara, Andy Isaacson, Johannes Weiner, Andrea Arcangeli, linux-kernel, linux-mm Hi Mel, You should have Cced with me because __GFP_NORETRY is issued by me. On Thu, Nov 10, 2011 at 11:22 PM, Mel Gorman <mgorman@suse.de> wrote: > On Thu, Nov 10, 2011 at 10:06:16AM +0000, Mel Gorman wrote: >> than stall. It was suggested that __GFP_NORETRY be used instead of >> __GFP_NO_KSWAPD. This would look less like a special case but would >> still cause compaction to run at least once with sync compaction. >> > > This comment is bogus - __GFP_NORETRY would have caught THP allocations > and would not call sync compaction. The issue was that it would also > have caught any hypothetical high-order GFP_THISNODE allocations that > end up calling compaction here In fact, the I support patch concept so I would like to give Acked-by: Minchan Kim <minchan.kim@gmail.com> But it is still doubt about code. __GFP_NORETRY: The VM implementation must not retry indefinitely What could people think if they look at above comment? At least, I can imagine two First, it is related on *latency*. Second, "I can handle if VM fails allocation" I am biased toward latter. Then, __GFP_NO_KSWAPD is okay? It means "let's avoid sync compaction or long latency"? It's rather awkward name. Already someone started to use __GFP_NO_KSWAPD as such purpose. See mtd_kmalloc_up_to. He mentioned in comment of function as follows, * the system page size. This attempts to make sure it does not adversely * impact system performance, so when allocating more than one page, we * ask the memory allocator to avoid re-trying, swapping, writing back * or performing I/O. That thing was what I concerned. In future, new users of __GFP_NO_KSWAPD is coming and we can't prevent them under our sight. So I hope we can change the flag name or fix above code and comment out __GFP_NO_KSWAPD /* * __GFP_NO_KSWAPD is very VM internal flag so Please don't use it without allowing mm guys * #define __GFP_NO_KSWAPD xxxx > > /* > * High-order allocations do not necessarily loop after > * direct reclaim and reclaim/compaction depends on > * compaction being called after reclaim so call directly if > * necessary > */ > page = __alloc_pages_direct_compact(gfp_mask, order, > zonelist, high_zoneidx, > nodemask, > alloc_flags, preferred_zone, > migratetype, &did_some_progress, > sync_migration); > > __GFP_NORETRY is used in a bunch of places and while the most > of them are not high-order, some of them potentially are like in > sound/core/memalloc.c. Using __GFP_NO_KSWAPD as the flag allows > these callers to continue using sync compaction. It could be argued Okay. If I was biased first, I have opposed this comment because they might think __GFP_NORETRY is very latency sensitive. So they wanted allocation is very fast without any writeback/retrial. In view point, __GFP_NORETRY isn't bad, I think. Having said that, I was biased latter, as I said earlier. > that they would prefer __GFP_NORETRY but the potential side-effects > should be taken should be taken into account and the comment updated Considering side-effect, your patch is okay. But I can't understand you mentioned "the comment updated if that happens" sentence. :( > if that happens. > > -- > Mel Gorman > SUSE Labs > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> > -- Kind regards, Minchan Kim -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH] mm: Do not stall in synchronous compaction for THP allocations 2011-11-10 15:12 ` Minchan Kim @ 2011-11-10 16:13 ` Mel Gorman 2011-11-10 16:30 ` Minchan Kim 2011-11-10 23:12 ` Andrew Morton 0 siblings, 2 replies; 34+ messages in thread From: Mel Gorman @ 2011-11-10 16:13 UTC (permalink / raw) To: Minchan Kim Cc: Andrew Morton, Jan Kara, Andy Isaacson, Johannes Weiner, Andrea Arcangeli, linux-kernel, linux-mm On Fri, Nov 11, 2011 at 12:12:01AM +0900, Minchan Kim wrote: > Hi Mel, > > You should have Cced with me because __GFP_NORETRY is issued by me. > I don't understand. __GFP_NORETRY has been around a long time ago and has loads of users. Do you have particular concerns about the behaviour of __GFP_NORETRY? > On Thu, Nov 10, 2011 at 11:22 PM, Mel Gorman <mgorman@suse.de> wrote: > > On Thu, Nov 10, 2011 at 10:06:16AM +0000, Mel Gorman wrote: > >> than stall. It was suggested that __GFP_NORETRY be used instead of > >> __GFP_NO_KSWAPD. This would look less like a special case but would > >> still cause compaction to run at least once with sync compaction. > >> > > > > This comment is bogus - __GFP_NORETRY would have caught THP allocations > > and would not call sync compaction. The issue was that it would also > > have caught any hypothetical high-order GFP_THISNODE allocations that > > end up calling compaction here > > In fact, the I support patch concept so I would like to give > > Acked-by: Minchan Kim <minchan.kim@gmail.com> > But it is still doubt about code. > > __GFP_NORETRY: The VM implementation must not retry indefinitely > > What could people think if they look at above comment? > At least, I can imagine two > > First, it is related on *latency*. I never read it to mean that. I always read it to mean "it must not retry indefinitely". I expected it was still fine to try direct reclaim which is not a latency-sensitive operation. > Second, "I can handle if VM fails allocation" > Fully agreed. > I am biased toward latter. > Then, __GFP_NO_KSWAPD is okay? It means "let's avoid sync compaction > or long latency"? and do not wake kswapd to swap additional pages. > It's rather awkward name. Already someone started to use > __GFP_NO_KSWAPD as such purpose. > See mtd_kmalloc_up_to. He mentioned in comment of function as follows, > > * the system page size. This attempts to make sure it does not adversely > * impact system performance, so when allocating more than one page, we > * ask the memory allocator to avoid re-trying, swapping, writing back > * or performing I/O. > I was not aware of this user although their reason for using it seems fine. We are not breaking their expectations with this patch but it is a slippery slope. > That thing was what I concerned. > In future, new users of __GFP_NO_KSWAPD is coming and we can't prevent > them under our sight. > So I hope we can change the flag name or fix above code and comment > out __GFP_NO_KSWAPD > I can't think of a better name but we can at least improve the comment. > /* > * __GFP_NO_KSWAPD is very VM internal flag so Please don't use it > without allowing mm guys > * > #define __GFP_NO_KSWAPD xxxx > > > > > /* > > * High-order allocations do not necessarily loop after > > * direct reclaim and reclaim/compaction depends on > > * compaction being called after reclaim so call directly if > > * necessary > > */ > > page = __alloc_pages_direct_compact(gfp_mask, order, > > zonelist, high_zoneidx, > > nodemask, > > alloc_flags, preferred_zone, > > migratetype, &did_some_progress, > > sync_migration); > > > > __GFP_NORETRY is used in a bunch of places and while the most > > of them are not high-order, some of them potentially are like in > > sound/core/memalloc.c. Using __GFP_NO_KSWAPD as the flag allows > > these callers to continue using sync compaction. It could be argued > > Okay. If I was biased first, I have opposed this comment because they > might think __GFP_NORETRY is very latency sensitive. As __GFP_NORETRY uses direct reclaim, I do not think users expect it to be latency sensitive. If they were latency sensitive, they would mask out __GFP_WAIT. > So they wanted allocation is very fast without any writeback/retrial. > In view point, __GFP_NORETRY isn't bad, I think. > > Having said that, I was biased latter, as I said earlier. > > > that they would prefer __GFP_NORETRY but the potential side-effects > > should be taken should be taken into account and the comment updated > > Considering side-effect, your patch is okay. > But I can't understand you mentioned "the comment updated if that > happens" sentence. :( > I meant the comment above sync_migration = !(gfp_mask & __GFP_NO_KSWAPD) should be updated if it changes to __GFP_NORETRY. I updated the commentary a bit. Does this help? ==== CUT HERE ==== mm: Do not stall in synchronous compaction for THP allocations Changelog since V1 o Update changelog with more detail o Add comment on __GFP_NO_KSWAPD Occasionally during large file copies to slow storage, there are still reports of user-visible stalls when THP is enabled. Reports on this have been intermittent and not reliable to reproduce locally but; Andy Isaacson reported a problem copying to VFAT on SD Card https://lkml.org/lkml/2011/11/7/2 In this case, it was stuck in munmap for betwen 20 and 60 seconds in compaction. It is also possible that khugepaged was holding mmap_sem on this process if CONFIG_NUMA was set. Johannes Weiner reported stalls on USB https://lkml.org/lkml/2011/7/25/378 In this case, there is no stack trace but it looks like the same problem. The USB stick may have been using NTFS as a filesystem based on other work done related to writing back to USB around the same time. Internally in SUSE, I received a bug report related to stalls in firefox when using Java and Flash heavily while copying from NFS to VFAT on USB. It has not been confirmed to be the same problem but if it looks like a duck and quacks like a duck..... In the past, commit [11bc82d6: mm: compaction: Use async migration for __GFP_NO_KSWAPD and enforce no writeback] forced that sync compaction would never be used for THP allocations. This was reverted in commit [c6a140bf: mm/compaction: reverse the change that forbade sync migraton with __GFP_NO_KSWAPD] on the grounds that it was uncertain it was beneficial. While user-visible stalls do not happen for me when writing to USB, I setup a test running postmark while short-lived processes created anonymous mapping. The objective was to exercise the paths that allocate transparent huge pages. I then logged when processes were stalled for more than 1 second, recorded a stack strace and did some analysis to aggregate unique "stall events" which revealed Time stalled in this event: 47369 ms Event count: 20 usemem sleep_on_page 3690 ms usemem sleep_on_page 2148 ms usemem sleep_on_page 1534 ms usemem sleep_on_page 1518 ms usemem sleep_on_page 1225 ms usemem sleep_on_page 2205 ms usemem sleep_on_page 2399 ms usemem sleep_on_page 2398 ms usemem sleep_on_page 3760 ms usemem sleep_on_page 1861 ms usemem sleep_on_page 2948 ms usemem sleep_on_page 1515 ms usemem sleep_on_page 1386 ms usemem sleep_on_page 1882 ms usemem sleep_on_page 1850 ms usemem sleep_on_page 3715 ms usemem sleep_on_page 3716 ms usemem sleep_on_page 4846 ms usemem sleep_on_page 1306 ms usemem sleep_on_page 1467 ms [<ffffffff810ef30c>] wait_on_page_bit+0x6c/0x80 [<ffffffff8113de9f>] unmap_and_move+0x1bf/0x360 [<ffffffff8113e0e2>] migrate_pages+0xa2/0x1b0 [<ffffffff81134273>] compact_zone+0x1f3/0x2f0 [<ffffffff811345d8>] compact_zone_order+0xa8/0xf0 [<ffffffff811346ff>] try_to_compact_pages+0xdf/0x110 [<ffffffff810f773a>] __alloc_pages_direct_compact+0xda/0x1a0 [<ffffffff810f7d5d>] __alloc_pages_slowpath+0x55d/0x7a0 [<ffffffff810f8151>] __alloc_pages_nodemask+0x1b1/0x1c0 [<ffffffff811331db>] alloc_pages_vma+0x9b/0x160 [<ffffffff81142bb0>] do_huge_pmd_anonymous_page+0x160/0x270 [<ffffffff814410a7>] do_page_fault+0x207/0x4c0 [<ffffffff8143dde5>] page_fault+0x25/0x30 The stall times are approximate at best but the estimates represent 25% of the worst stalls and even if the estimates are off by a factor of 10, it's severe. This patch once again prevents sync migration for transparent hugepage allocations as it is preferable to fail a THP allocation than stall. It was suggested that __GFP_NORETRY be used instead of __GFP_NO_KSWAPD to look less like a special case. This would prevent THP allocation using sync compaction but it would have other side-effects. There are existing users of __GFP_NORETRY that are doing high-order allocations and while they can handle allocation failure, it seems reasonable that they continue to use sync compaction unless there is a deliberate reason to change that. To help clarify this for the future, this patch updates the comment for __GFP_NO_KSWAPD. If accepted, this is a -stable candidate. Reported-by: Andy Isaacson <adi@hexapodia.org> Reported-and-tested-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Mel Gorman <mgorman@suse.de> --- include/linux/gfp.h | 8 ++++++++ mm/page_alloc.c | 9 ++++++++- 2 files changed, 16 insertions(+), 1 deletions(-) diff --git a/include/linux/gfp.h b/include/linux/gfp.h index 3a76faf..4470a3c 100644 --- a/include/linux/gfp.h +++ b/include/linux/gfp.h @@ -83,7 +83,15 @@ struct vm_area_struct; #define __GFP_RECLAIMABLE ((__force gfp_t)___GFP_RECLAIMABLE) /* Page is reclaimable */ #define __GFP_NOTRACK ((__force gfp_t)___GFP_NOTRACK) /* Don't track with kmemcheck */ +/* + * __GFP_NO_KSWAPD indicates that the VM should favour failing the allocation + * over excessive disruption of the system. Currently this means + * 1. Do not wake kswapd (hence the flag name) + * 2. Do not use stall in synchronous compaction for high-order allocations + * as this may cause the caller to stall writing out pages + */ #define __GFP_NO_KSWAPD ((__force gfp_t)___GFP_NO_KSWAPD) + #define __GFP_OTHER_NODE ((__force gfp_t)___GFP_OTHER_NODE) /* On behalf of other node */ /* diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 6e8ecb6..a3b5da6 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -2161,7 +2161,14 @@ rebalance: sync_migration); if (page) goto got_pg; - sync_migration = true; + + /* + * Do not use sync migration if __GFP_NO_KSWAPD is used to indicate + * the system should not be heavily disrupted. In practice, this is + * to avoid THP callers being stalled in writeback during migration + * as it's preferable for the the allocations to fail than to stall + */ + sync_migration = !(gfp_mask & __GFP_NO_KSWAPD); /* Try direct reclaim and then allocating */ page = __alloc_pages_direct_reclaim(gfp_mask, order, -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 34+ messages in thread
* Re: [PATCH] mm: Do not stall in synchronous compaction for THP allocations 2011-11-10 16:13 ` Mel Gorman @ 2011-11-10 16:30 ` Minchan Kim 2011-11-10 16:48 ` Mel Gorman 2011-11-10 23:12 ` Andrew Morton 1 sibling, 1 reply; 34+ messages in thread From: Minchan Kim @ 2011-11-10 16:30 UTC (permalink / raw) To: Mel Gorman Cc: Andrew Morton, Jan Kara, Andy Isaacson, Johannes Weiner, Andrea Arcangeli, linux-kernel, linux-mm On Fri, Nov 11, 2011 at 1:13 AM, Mel Gorman <mgorman@suse.de> wrote: > On Fri, Nov 11, 2011 at 12:12:01AM +0900, Minchan Kim wrote: >> Hi Mel, >> >> You should have Cced with me because __GFP_NORETRY is issued by me. >> > > I don't understand. __GFP_NORETRY has been around a long time ago > and has loads of users. Do you have particular concerns about the > behaviour of __GFP_NORETRY? It seems that I got misunderstood your point. I thought you mentioned this. http://www.spinics.net/lists/linux-mm/msg16371.html > >> On Thu, Nov 10, 2011 at 11:22 PM, Mel Gorman <mgorman@suse.de> wrote: >> > On Thu, Nov 10, 2011 at 10:06:16AM +0000, Mel Gorman wrote: >> >> than stall. It was suggested that __GFP_NORETRY be used instead of >> >> __GFP_NO_KSWAPD. This would look less like a special case but would >> >> still cause compaction to run at least once with sync compaction. >> >> >> > >> > This comment is bogus - __GFP_NORETRY would have caught THP allocations >> > and would not call sync compaction. The issue was that it would also >> > have caught any hypothetical high-order GFP_THISNODE allocations that >> > end up calling compaction here >> >> In fact, the I support patch concept so I would like to give >> >> Acked-by: Minchan Kim <minchan.kim@gmail.com> >> But it is still doubt about code. >> >> __GFP_NORETRY: The VM implementation must not retry indefinitely >> >> What could people think if they look at above comment? >> At least, I can imagine two >> >> First, it is related on *latency*. > > I never read it to mean that. I always read it to mean "it must > not retry indefinitely". I expected it was still fine to try direct > reclaim which is not a latency-sensitive operation. > >> Second, "I can handle if VM fails allocation" >> > > Fully agreed. > >> I am biased toward latter. >> Then, __GFP_NO_KSWAPD is okay? It means "let's avoid sync compaction >> or long latency"? > > and do not wake kswapd to swap additional pages. > >> It's rather awkward name. Already someone started to use >> __GFP_NO_KSWAPD as such purpose. >> See mtd_kmalloc_up_to. He mentioned in comment of function as follows, >> >> * the system page size. This attempts to make sure it does not adversely >> * impact system performance, so when allocating more than one page, we >> * ask the memory allocator to avoid re-trying, swapping, writing back >> * or performing I/O. >> > > I was not aware of this user although their reason for using it > seems fine. We are not breaking their expectations with this patch but > it is a slippery slope. > >> That thing was what I concerned. >> In future, new users of __GFP_NO_KSWAPD is coming and we can't prevent >> them under our sight. >> So I hope we can change the flag name or fix above code and comment >> out __GFP_NO_KSWAPD >> > > I can't think of a better name but we can at least improve the comment. > >> /* >> * __GFP_NO_KSWAPD is very VM internal flag so Please don't use it >> without allowing mm guys >> * >> #define __GFP_NO_KSWAPD xxxx >> >> > >> > /* >> > * High-order allocations do not necessarily loop after >> > * direct reclaim and reclaim/compaction depends on >> > * compaction being called after reclaim so call directly if >> > * necessary >> > */ >> > page = __alloc_pages_direct_compact(gfp_mask, order, >> > zonelist, high_zoneidx, >> > nodemask, >> > alloc_flags, preferred_zone, >> > migratetype, &did_some_progress, >> > sync_migration); >> > >> > __GFP_NORETRY is used in a bunch of places and while the most >> > of them are not high-order, some of them potentially are like in >> > sound/core/memalloc.c. Using __GFP_NO_KSWAPD as the flag allows >> > these callers to continue using sync compaction. It could be argued >> >> Okay. If I was biased first, I have opposed this comment because they >> might think __GFP_NORETRY is very latency sensitive. > > As __GFP_NORETRY uses direct reclaim, I do not think users expect it > to be latency sensitive. If they were latency sensitive, they would > mask out __GFP_WAIT. Indeed. > >> So they wanted allocation is very fast without any writeback/retrial. >> In view point, __GFP_NORETRY isn't bad, I think. >> >> Having said that, I was biased latter, as I said earlier. >> >> > that they would prefer __GFP_NORETRY but the potential side-effects >> > should be taken should be taken into account and the comment updated >> >> Considering side-effect, your patch is okay. >> But I can't understand you mentioned "the comment updated if that >> happens" sentence. :( >> > > I meant the comment above sync_migration = !(gfp_mask & __GFP_NO_KSWAPD) should > be updated if it changes to __GFP_NORETRY. > > I updated the commentary a bit. Does this help? I am happy with your updated comment but there is a concern. We are adding additional flags to be considered for all page allocation callers. If it isn't, there are too many flags in there. I really hope this flag makes VM internal so let other do not care about it. Until now, users was happy without such flag and it's only huge page problem. -- Kind regards, Minchan Kim -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH] mm: Do not stall in synchronous compaction for THP allocations 2011-11-10 16:30 ` Minchan Kim @ 2011-11-10 16:48 ` Mel Gorman 0 siblings, 0 replies; 34+ messages in thread From: Mel Gorman @ 2011-11-10 16:48 UTC (permalink / raw) To: Minchan Kim Cc: Andrew Morton, Jan Kara, Andy Isaacson, Johannes Weiner, Andrea Arcangeli, linux-kernel, linux-mm On Fri, Nov 11, 2011 at 01:30:24AM +0900, Minchan Kim wrote: > On Fri, Nov 11, 2011 at 1:13 AM, Mel Gorman <mgorman@suse.de> wrote: > > On Fri, Nov 11, 2011 at 12:12:01AM +0900, Minchan Kim wrote: > >> Hi Mel, > >> > >> You should have Cced with me because __GFP_NORETRY is issued by me. > >> > > > > I don't understand. __GFP_NORETRY has been around a long time ago > > and has loads of users. Do you have particular concerns about the > > behaviour of __GFP_NORETRY? > > It seems that I got misunderstood your point. > I thought you mentioned this. > > http://www.spinics.net/lists/linux-mm/msg16371.html > Ah, no I wasn't. I was referring to Andrea's mail suggesting the use of __GFP_NORETRY and I did not lookup if it had already been discussed. My bad. > >> So they wanted allocation is very fast without any writeback/retrial. > >> In view point, __GFP_NORETRY isn't bad, I think. > >> > >> Having said that, I was biased latter, as I said earlier. > >> > >> > that they would prefer __GFP_NORETRY but the potential side-effects > >> > should be taken should be taken into account and the comment updated > >> > >> Considering side-effect, your patch is okay. > >> But I can't understand you mentioned "the comment updated if that > >> happens" sentence. :( > >> > > > > I meant the comment above sync_migration = !(gfp_mask & __GFP_NO_KSWAPD) should > > be updated if it changes to __GFP_NORETRY. > > > > I updated the commentary a bit. Does this help? > > I am happy with your updated comment but there is a concern. > We are adding additional flags to be considered for all page allocation callers. > If it isn't, there are too many flags in there. > > I really hope this flag makes VM internal so let other do not care about it. > Until now, users was happy without such flag and it's only huge page problem. > I'd rather avoid moving some GFP flags to mm/internal.h unless there is blatant abuse of flags. I added a warning. How about this? ==== CUT HERE ==== mm: Do not stall in synchronous compaction for THP allocations Changelog since V1 o Update changelog with more detail o Add comment on __GFP_NO_KSWAPD Occasionally during large file copies to slow storage, there are still reports of user-visible stalls when THP is enabled. Reports on this have been intermittent and not reliable to reproduce locally but; Andy Isaacson reported a problem copying to VFAT on SD Card https://lkml.org/lkml/2011/11/7/2 In this case, it was stuck in munmap for betwen 20 and 60 seconds in compaction. It is also possible that khugepaged was holding mmap_sem on this process if CONFIG_NUMA was set. Johannes Weiner reported stalls on USB https://lkml.org/lkml/2011/7/25/378 In this case, there is no stack trace but it looks like the same problem. The USB stick may have been using NTFS as a filesystem based on other work done related to writing back to USB around the same time. Internally in SUSE, I received a bug report related to stalls in firefox when using Java and Flash heavily while copying from NFS to VFAT on USB. It has not been confirmed to be the same problem but if it looks like a duck and quacks like a duck..... In the past, commit [11bc82d6: mm: compaction: Use async migration for __GFP_NO_KSWAPD and enforce no writeback] forced that sync compaction would never be used for THP allocations. This was reverted in commit [c6a140bf: mm/compaction: reverse the change that forbade sync migraton with __GFP_NO_KSWAPD] on the grounds that it was uncertain it was beneficial. While user-visible stalls do not happen for me when writing to USB, I setup a test running postmark while short-lived processes created anonymous mapping. The objective was to exercise the paths that allocate transparent huge pages. I then logged when processes were stalled for more than 1 second, recorded a stack strace and did some analysis to aggregate unique "stall events" which revealed Time stalled in this event: 47369 ms Event count: 20 usemem sleep_on_page 3690 ms usemem sleep_on_page 2148 ms usemem sleep_on_page 1534 ms usemem sleep_on_page 1518 ms usemem sleep_on_page 1225 ms usemem sleep_on_page 2205 ms usemem sleep_on_page 2399 ms usemem sleep_on_page 2398 ms usemem sleep_on_page 3760 ms usemem sleep_on_page 1861 ms usemem sleep_on_page 2948 ms usemem sleep_on_page 1515 ms usemem sleep_on_page 1386 ms usemem sleep_on_page 1882 ms usemem sleep_on_page 1850 ms usemem sleep_on_page 3715 ms usemem sleep_on_page 3716 ms usemem sleep_on_page 4846 ms usemem sleep_on_page 1306 ms usemem sleep_on_page 1467 ms [<ffffffff810ef30c>] wait_on_page_bit+0x6c/0x80 [<ffffffff8113de9f>] unmap_and_move+0x1bf/0x360 [<ffffffff8113e0e2>] migrate_pages+0xa2/0x1b0 [<ffffffff81134273>] compact_zone+0x1f3/0x2f0 [<ffffffff811345d8>] compact_zone_order+0xa8/0xf0 [<ffffffff811346ff>] try_to_compact_pages+0xdf/0x110 [<ffffffff810f773a>] __alloc_pages_direct_compact+0xda/0x1a0 [<ffffffff810f7d5d>] __alloc_pages_slowpath+0x55d/0x7a0 [<ffffffff810f8151>] __alloc_pages_nodemask+0x1b1/0x1c0 [<ffffffff811331db>] alloc_pages_vma+0x9b/0x160 [<ffffffff81142bb0>] do_huge_pmd_anonymous_page+0x160/0x270 [<ffffffff814410a7>] do_page_fault+0x207/0x4c0 [<ffffffff8143dde5>] page_fault+0x25/0x30 The stall times are approximate at best but the estimates represent 25% of the worst stalls and even if the estimates are off by a factor of 10, it's severe. This patch once again prevents sync migration for transparent hugepage allocations as it is preferable to fail a THP allocation than stall. It was suggested that __GFP_NORETRY be used instead of __GFP_NO_KSWAPD to look less like a special case. This would prevent THP allocation using sync compaction but it would have other side-effects. There are existing users of __GFP_NORETRY that are doing high-order allocations and while they can handle allocation failure, it seems reasonable that they continue to use sync compaction unless there is a deliberate reason to change that. To help clarify this for the future, this patch updates the comment for __GFP_NO_KSWAPD. If accepted, this is a -stable candidate. Reported-by: Andy Isaacson <adi@hexapodia.org> Reported-and-tested-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Mel Gorman <mgorman@suse.de> --- include/linux/gfp.h | 11 +++++++++++ mm/page_alloc.c | 9 ++++++++- 2 files changed, 19 insertions(+), 1 deletions(-) diff --git a/include/linux/gfp.h b/include/linux/gfp.h index 3a76faf..ef1b1af 100644 --- a/include/linux/gfp.h +++ b/include/linux/gfp.h @@ -83,7 +83,18 @@ struct vm_area_struct; #define __GFP_RECLAIMABLE ((__force gfp_t)___GFP_RECLAIMABLE) /* Page is reclaimable */ #define __GFP_NOTRACK ((__force gfp_t)___GFP_NOTRACK) /* Don't track with kmemcheck */ +/* + * __GFP_NO_KSWAPD indicates that the VM should favour failing the allocation + * over excessive disruption of the system. Currently this means + * 1. Do not wake kswapd (hence the flag name) + * 2. Do not use stall in synchronous compaction for high-order allocations + * as this may cause the caller to stall writing out pages + * + * This flag it primarily intended for use with transparent hugepage support. + * If the flag is used outside the VM, linux-mm should be cc'd for review. + */ #define __GFP_NO_KSWAPD ((__force gfp_t)___GFP_NO_KSWAPD) + #define __GFP_OTHER_NODE ((__force gfp_t)___GFP_OTHER_NODE) /* On behalf of other node */ /* diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 6e8ecb6..a3b5da6 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -2161,7 +2161,14 @@ rebalance: sync_migration); if (page) goto got_pg; - sync_migration = true; + + /* + * Do not use sync migration if __GFP_NO_KSWAPD is used to indicate + * the system should not be heavily disrupted. In practice, this is + * to avoid THP callers being stalled in writeback during migration + * as it's preferable for the the allocations to fail than to stall + */ + sync_migration = !(gfp_mask & __GFP_NO_KSWAPD); /* Try direct reclaim and then allocating */ page = __alloc_pages_direct_reclaim(gfp_mask, order, -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 34+ messages in thread
* Re: [PATCH] mm: Do not stall in synchronous compaction for THP allocations 2011-11-10 16:13 ` Mel Gorman 2011-11-10 16:30 ` Minchan Kim @ 2011-11-10 23:12 ` Andrew Morton 2011-11-10 23:37 ` David Rientjes 2011-11-11 10:01 ` Mel Gorman 1 sibling, 2 replies; 34+ messages in thread From: Andrew Morton @ 2011-11-10 23:12 UTC (permalink / raw) To: Mel Gorman Cc: Minchan Kim, Jan Kara, Andy Isaacson, Johannes Weiner, Andrea Arcangeli, linux-kernel, linux-mm On Thu, 10 Nov 2011 16:13:31 +0000 Mel Gorman <mgorman@suse.de> wrote: > This patch once again prevents sync migration for transparent > hugepage allocations as it is preferable to fail a THP allocation > than stall. Who said? ;) Presumably some people would prefer to get lots of huge pages for their 1000-hour compute job, and waiting a bit to get those pages is acceptable. Do we have the accounting in place for us to be able to determine how many huge page allocation attempts failed due to this change? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH] mm: Do not stall in synchronous compaction for THP allocations 2011-11-10 23:12 ` Andrew Morton @ 2011-11-10 23:37 ` David Rientjes 2011-11-11 10:14 ` Mel Gorman 2011-11-11 10:01 ` Mel Gorman 1 sibling, 1 reply; 34+ messages in thread From: David Rientjes @ 2011-11-10 23:37 UTC (permalink / raw) To: Andrew Morton Cc: Mel Gorman, Minchan Kim, Jan Kara, Andy Isaacson, Johannes Weiner, Andrea Arcangeli, linux-kernel, linux-mm On Thu, 10 Nov 2011, Andrew Morton wrote: > > This patch once again prevents sync migration for transparent > > hugepage allocations as it is preferable to fail a THP allocation > > than stall. > > Who said? ;) Presumably some people would prefer to get lots of > huge pages for their 1000-hour compute job, and waiting a bit to get > those pages is acceptable. > Indeed. It seems like the behavior would better be controlled with /sys/kernel/mm/transparent_hugepage/defrag which is set aside specifically to control defragmentation for transparent hugepages and for that synchronous compaction should certainly apply. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH] mm: Do not stall in synchronous compaction for THP allocations 2011-11-10 23:37 ` David Rientjes @ 2011-11-11 10:14 ` Mel Gorman 2011-11-11 10:39 ` David Rientjes 2011-11-14 23:44 ` Andrew Morton 0 siblings, 2 replies; 34+ messages in thread From: Mel Gorman @ 2011-11-11 10:14 UTC (permalink / raw) To: David Rientjes Cc: Andrew Morton, Minchan Kim, Jan Kara, Andy Isaacson, Johannes Weiner, Andrea Arcangeli, linux-kernel, linux-mm On Thu, Nov 10, 2011 at 03:37:32PM -0800, David Rientjes wrote: > On Thu, 10 Nov 2011, Andrew Morton wrote: > > > > This patch once again prevents sync migration for transparent > > > hugepage allocations as it is preferable to fail a THP allocation > > > than stall. > > > > Who said? ;) Presumably some people would prefer to get lots of > > huge pages for their 1000-hour compute job, and waiting a bit to get > > those pages is acceptable. > > > > Indeed. It seems like the behavior would better be controlled with > /sys/kernel/mm/transparent_hugepage/defrag which is set aside specifically > to control defragmentation for transparent hugepages and for that > synchronous compaction should certainly apply. With khugepaged in place, it's adding a tunable that is unnecessary and will not be used. Even if such a tuneable was created, the default behaviour should be "do not stall". -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH] mm: Do not stall in synchronous compaction for THP allocations 2011-11-11 10:14 ` Mel Gorman @ 2011-11-11 10:39 ` David Rientjes 2011-11-11 11:17 ` Mel Gorman 2011-11-11 14:21 ` Andrea Arcangeli 2011-11-14 23:44 ` Andrew Morton 1 sibling, 2 replies; 34+ messages in thread From: David Rientjes @ 2011-11-11 10:39 UTC (permalink / raw) To: Mel Gorman Cc: Andrew Morton, Minchan Kim, Jan Kara, Andy Isaacson, Johannes Weiner, Andrea Arcangeli, linux-kernel, linux-mm On Fri, 11 Nov 2011, Mel Gorman wrote: > > Indeed. It seems like the behavior would better be controlled with > > /sys/kernel/mm/transparent_hugepage/defrag which is set aside specifically > > to control defragmentation for transparent hugepages and for that > > synchronous compaction should certainly apply. > > With khugepaged in place, it's adding a tunable that is unnecessary and > will not be used. Even if such a tuneable was created, the default > behaviour should be "do not stall". > Not sure what you mean, the tunable already exists and defaults to always if THP is turned on. I've been able to effectively control the behavior of synchronous compaction with it in combination with extfrag_threshold, i.e. always compact even if the fragmentation index is very small, for workloads that really really really want hugepages at fault when such a latency is permissable and then disable khugepaged entirely in the background for cpu bound tasks. The history of this boolean is somewhat disturbing: it's introduced in 77f1fe6b back on January 13 to be true after the first attempt at compaction, then changed to be !(gfp_mask & __GFP_NO_KSWAPD) in 11bc82d6 on March 22, then changed to be true again in c6a140bf on May 24, then proposed to be changed right back to !(gfp_mask & __GFP_NO_KSWAPD) in this patch again. When are we going to understand that the admin needs to tell the kernel when we'd really like to try to allocate a transparent hugepage and when it's ok to fail? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH] mm: Do not stall in synchronous compaction for THP allocations 2011-11-11 10:39 ` David Rientjes @ 2011-11-11 11:17 ` Mel Gorman 2011-11-11 14:21 ` Andrea Arcangeli 1 sibling, 0 replies; 34+ messages in thread From: Mel Gorman @ 2011-11-11 11:17 UTC (permalink / raw) To: David Rientjes Cc: Andrew Morton, Minchan Kim, Jan Kara, Andy Isaacson, Johannes Weiner, Andrea Arcangeli, linux-kernel, linux-mm On Fri, Nov 11, 2011 at 02:39:10AM -0800, David Rientjes wrote: > On Fri, 11 Nov 2011, Mel Gorman wrote: > > > > Indeed. It seems like the behavior would better be controlled with > > > /sys/kernel/mm/transparent_hugepage/defrag which is set aside specifically > > > to control defragmentation for transparent hugepages and for that > > > synchronous compaction should certainly apply. > > > > With khugepaged in place, it's adding a tunable that is unnecessary and > > will not be used. Even if such a tuneable was created, the default > > behaviour should be "do not stall". > > > > Not sure what you mean, the tunable already exists and defaults to always > if THP is turned on. Yes, but it does not distinguish between "always including synchronous stalls" and "always but only if you can do it quickly". This patch does change the behaviour of "always" but it's still using compaction to defrag memory. A sysfs file exists, but I wanted to avoid changing the meaning of its values if at all possible. If a new value was to be added that would allow the user of synchronous compaction, what should it be called? always-sync exposes implementation details which is not great. always-force is misleading because it wouldn't actually force anything. always-really-really-mean-it is a bit unwieldly. always-stress might suit but what is the meaning exactly? Using sync compaction would be part of it but it could also mean be more agressive about reclaiming. What level of control is required and how should it be expressed? I don't object to the existence of this tunable as such but the default should still be "no sync compaction for THP" because stalls due to writing to a USB stick sucks. > I've been able to effectively control the behavior > of synchronous compaction with it in combination with extfrag_threshold, extfrag_threshold controls whether compaction runs or not, it does not control if synchronous compaction it used. > i.e. always compact even if the fragmentation index is very small, for > workloads that really really really want hugepages at fault when such a > latency is permissable and then disable khugepaged entirely in the > background for cpu bound tasks. > > The history of this boolean is somewhat disturbing: it's introduced in > 77f1fe6b back on January 13 to be true after the first attempt at It was first introduced because compaction was stalling. This was unacceptable because the stalling cost more than hugepages saved and was user visible. There were other patches related to reducing latency due to compaction. > compaction, then changed to be !(gfp_mask & __GFP_NO_KSWAPD) in 11bc82d6 Because this was a safe option. > on March 22, then changed to be true again in c6a140bf on May 24, then Because testing indicated that stalls due to sync migration were not noticeable. For the most part, this is true but there should have been better recognition that sync migration could also write pages to slow storage which would be unacceptably slow. > proposed to be changed right back to !(gfp_mask & __GFP_NO_KSWAPD) in this > patch again. When are we going to understand that the admin needs to tell > the kernel when we'd really like to try to allocate a transparent hugepage > and when it's ok to fail? I don't recall a point when this was about the administrator wanting to control synchronous compaction. The objective was to maximise the number of huge pages used while minimising user-visible stalls. -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH] mm: Do not stall in synchronous compaction for THP allocations 2011-11-11 10:39 ` David Rientjes 2011-11-11 11:17 ` Mel Gorman @ 2011-11-11 14:21 ` Andrea Arcangeli 1 sibling, 0 replies; 34+ messages in thread From: Andrea Arcangeli @ 2011-11-11 14:21 UTC (permalink / raw) To: David Rientjes Cc: Mel Gorman, Andrew Morton, Minchan Kim, Jan Kara, Andy Isaacson, Johannes Weiner, linux-kernel, linux-mm On Fri, Nov 11, 2011 at 02:39:10AM -0800, David Rientjes wrote: > The history of this boolean is somewhat disturbing: it's introduced in > 77f1fe6b back on January 13 to be true after the first attempt at > compaction, then changed to be !(gfp_mask & __GFP_NO_KSWAPD) in 11bc82d6 > on March 22, then changed to be true again in c6a140bf on May 24, then > proposed to be changed right back to !(gfp_mask & __GFP_NO_KSWAPD) in this > patch again. When are we going to understand that the admin needs to tell > the kernel when we'd really like to try to allocate a transparent hugepage > and when it's ok to fail? Sorry for the confusion but it was reverted by mistake. Mel fixed a compaction bug that caused stalls of several minutes to people. compaction was overflowing into the next zone by mistake without adjusting its tracking parameters. So it wasn't clear anymore if sync compaction was really a problem or not. So I reverted it to be sure. And well now we're sure :). Without USB or pathologically slow I/O apparently it's not a noticeable issue. So having sync compaction off by default sounds good to me. We can still add a tunable to force sync compaction in case anybody needs. Another topic is if slub and stuff also wants sync compaction off by default when they allocate with large order so triggering compaction. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH] mm: Do not stall in synchronous compaction for THP allocations 2011-11-11 10:14 ` Mel Gorman 2011-11-11 10:39 ` David Rientjes @ 2011-11-14 23:44 ` Andrew Morton 2011-11-15 13:25 ` Mel Gorman 1 sibling, 1 reply; 34+ messages in thread From: Andrew Morton @ 2011-11-14 23:44 UTC (permalink / raw) To: Mel Gorman Cc: David Rientjes, Minchan Kim, Jan Kara, Andy Isaacson, Johannes Weiner, Andrea Arcangeli, linux-kernel, linux-mm On Fri, 11 Nov 2011 10:14:14 +0000 Mel Gorman <mgorman@suse.de> wrote: > On Thu, Nov 10, 2011 at 03:37:32PM -0800, David Rientjes wrote: > > On Thu, 10 Nov 2011, Andrew Morton wrote: > > > > > > This patch once again prevents sync migration for transparent > > > > hugepage allocations as it is preferable to fail a THP allocation > > > > than stall. > > > > > > Who said? ;) Presumably some people would prefer to get lots of > > > huge pages for their 1000-hour compute job, and waiting a bit to get > > > those pages is acceptable. > > > > > > > Indeed. It seems like the behavior would better be controlled with > > /sys/kernel/mm/transparent_hugepage/defrag which is set aside specifically > > to control defragmentation for transparent hugepages and for that > > synchronous compaction should certainly apply. > > With khugepaged in place, it's adding a tunable that is unnecessary and > will not be used. Even if such a tuneable was created, the default > behaviour should be "do not stall". (who said?) Let me repeat my cruelly unanswered question: do we have sufficient instrumentation in place so that operators can determine that this change is causing them to get less huge pages than they'd like? Because some people really really want those huge pages. If we go and silently deprive them of those huge pages via changes like this, how do they *know* it's happening? And what are their options for making the kernel try harder to get those pages? And how do we communicate all of this to those operators? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH] mm: Do not stall in synchronous compaction for THP allocations 2011-11-14 23:44 ` Andrew Morton @ 2011-11-15 13:25 ` Mel Gorman 2011-11-15 21:07 ` David Rientjes 0 siblings, 1 reply; 34+ messages in thread From: Mel Gorman @ 2011-11-15 13:25 UTC (permalink / raw) To: Andrew Morton Cc: David Rientjes, Minchan Kim, Jan Kara, Andy Isaacson, Johannes Weiner, Andrea Arcangeli, linux-kernel, linux-mm On Mon, Nov 14, 2011 at 03:44:08PM -0800, Andrew Morton wrote: > On Fri, 11 Nov 2011 10:14:14 +0000 > Mel Gorman <mgorman@suse.de> wrote: > > > On Thu, Nov 10, 2011 at 03:37:32PM -0800, David Rientjes wrote: > > > On Thu, 10 Nov 2011, Andrew Morton wrote: > > > > > > > > This patch once again prevents sync migration for transparent > > > > > hugepage allocations as it is preferable to fail a THP allocation > > > > > than stall. > > > > > > > > Who said? ;) Presumably some people would prefer to get lots of > > > > huge pages for their 1000-hour compute job, and waiting a bit to get > > > > those pages is acceptable. > > > > > > > > > > Indeed. It seems like the behavior would better be controlled with > > > /sys/kernel/mm/transparent_hugepage/defrag which is set aside specifically > > > to control defragmentation for transparent hugepages and for that > > > synchronous compaction should certainly apply. > > > > With khugepaged in place, it's adding a tunable that is unnecessary and > > will not be used. Even if such a tuneable was created, the default > > behaviour should be "do not stall". > > (who said?) > > Let me repeat my cruelly unanswered question: do we have sufficient > instrumentation in place so that operators can determine that this > change is causing them to get less huge pages than they'd like? > Unless we add a mel_did_it counter to vmstat, they won't be able to identify that it was this patch in particular. > Because some people really really want those huge pages. If we go and > silently deprive them of those huge pages via changes like this, how do > they *know* it's happening? > The counters in vmstat will give them a hint but it will not tell them *why* they are not getting the huge pages they want. That would require further analysis using a combination of ftrace, /proc/buddyinfo, /proc/pagetypeinfo and maybe /proc/kpageflags depending on how important the issue is. > And what are their options for making the kernel try harder to get > those pages? > Fine control is limited. If it is really needed, I would not oppose a patch that allows the use of sync compaction via a new setting in /sys/kernel/mm/transparent_hugepage/defrag. However, I think it is a slippery slope to expose implementation details like this and I'm not currently planning to implement such a patch. If they have root access, they have the option of writing to /proc/sys/vm/compact_memory to manually trigger compaction. If that does not free enough huge pages, they could use /proc/sys/vm/drop_caches followed by /proc/sys/vm/compact_memory and then start the target application. If that was too heavy, they could write a balloon application which forces some percentage of memory to be reclaimed by allocating anonymous memory, calling mlock on it, unmapping the memory and then writing to /proc/sys/vm/compact_memory . It would be very heavy handed but it could be a preparation step for running a job that absolutely must get huge pages without khugepaged running. > And how do we communicate all of this to those operators? The documentation patch will help to some extent but more creative manipulation of the system to increase the success rate of huge page allocations and how to analyse it is not documented. This is largely because the analysis is conducted on a case-by-case basis. Mailing "help help" to linux-mm and hoping someone on the internet can hear you scream may be the only option. -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH] mm: Do not stall in synchronous compaction for THP allocations 2011-11-15 13:25 ` Mel Gorman @ 2011-11-15 21:07 ` David Rientjes 2011-11-15 23:48 ` Mel Gorman 0 siblings, 1 reply; 34+ messages in thread From: David Rientjes @ 2011-11-15 21:07 UTC (permalink / raw) To: Mel Gorman Cc: Andrew Morton, Minchan Kim, Jan Kara, Andy Isaacson, Johannes Weiner, Andrea Arcangeli, linux-kernel, linux-mm On Tue, 15 Nov 2011, Mel Gorman wrote: > Fine control is limited. If it is really needed, I would not oppose > a patch that allows the use of sync compaction via a new setting in > /sys/kernel/mm/transparent_hugepage/defrag. However, I think it is > a slippery slope to expose implementation details like this and I'm > not currently planning to implement such a patch. > This doesn't expose any implementation detail, the "defrag" tunable is supposed to limit defragmentation efforts in the VM if the hugepages aren't immediately available and simply fallback to using small pages. Given that definition, it would make sense to allow for synchronous defragmentation (i.e. sync_compaction) on the second iteration of the page allocator slowpath if set. So where's the disconnect between this proposed behavior and the definition of the tunable in Documentation/vm/transhuge.txt? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH] mm: Do not stall in synchronous compaction for THP allocations 2011-11-15 21:07 ` David Rientjes @ 2011-11-15 23:48 ` Mel Gorman 2011-11-16 0:07 ` David Rientjes 0 siblings, 1 reply; 34+ messages in thread From: Mel Gorman @ 2011-11-15 23:48 UTC (permalink / raw) To: David Rientjes Cc: Andrew Morton, Minchan Kim, Jan Kara, Andy Isaacson, Johannes Weiner, Andrea Arcangeli, linux-kernel, linux-mm On Tue, Nov 15, 2011 at 01:07:51PM -0800, David Rientjes wrote: > On Tue, 15 Nov 2011, Mel Gorman wrote: > > > Fine control is limited. If it is really needed, I would not oppose > > a patch that allows the use of sync compaction via a new setting in > > /sys/kernel/mm/transparent_hugepage/defrag. However, I think it is > > a slippery slope to expose implementation details like this and I'm > > not currently planning to implement such a patch. > > > > This doesn't expose any implementation detail, the "defrag" tunable is > supposed to limit defragmentation efforts in the VM if the hugepages > aren't immediately available and simply fallback to using small pages. The current settings are "always", "madvise" and "never" which matches the settings for /sys/kernel/mm/transparent_hugepage/enabled and are fairly straight forward. Adding sync here could obviously be implemented although it may require both always-sync and madvise-sync. Alternatively, something like an options file could be created to create a bitmap similar to what ftrace does. Whatever the mechanism, it exposes the fact that "sync compaction" is used. If that turns out to be not enough, then you may want to add other steps like aggressively reclaiming memory which also potentially may need to be controlled via the sysfs file and this is the slippery slope. > Given that definition, it would make sense to allow for synchronous > defragmentation (i.e. sync_compaction) on the second iteration of the page > allocator slowpath if set. So where's the disconnect between this > proposed behavior and the definition of the tunable in > Documentation/vm/transhuge.txt? The transhuge.txt file does not describe how defrag works or whether it uses sync compaction internally. -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH] mm: Do not stall in synchronous compaction for THP allocations 2011-11-15 23:48 ` Mel Gorman @ 2011-11-16 0:07 ` David Rientjes 2011-11-16 4:13 ` Andrea Arcangeli 0 siblings, 1 reply; 34+ messages in thread From: David Rientjes @ 2011-11-16 0:07 UTC (permalink / raw) To: Mel Gorman Cc: Andrew Morton, Minchan Kim, Jan Kara, Andy Isaacson, Johannes Weiner, Andrea Arcangeli, linux-kernel, linux-mm On Tue, 15 Nov 2011, Mel Gorman wrote: > Adding sync here could obviously be implemented although it may > require both always-sync and madvise-sync. Alternatively, something > like an options file could be created to create a bitmap similar to > what ftrace does. Whatever the mechanism, it exposes the fact that > "sync compaction" is used. If that turns out to be not enough, then > you may want to add other steps like aggressively reclaiming memory > which also potentially may need to be controlled via the sysfs file > and this is the slippery slope. > So what's being proposed here in this patch is the fifth time this line has been changed and its always been switched between true and !(gfp_mask & __GFP_NO_KSWAPD). Instead of changing it every few months, I'd suggest that we tie the semantics of the tunable directly to sync_compaction since we're primarily targeting thp hugepages with this change anyway for the "always" case. Comments? diff --git a/Documentation/vm/transhuge.txt b/Documentation/vm/transhuge.txt --- a/Documentation/vm/transhuge.txt +++ b/Documentation/vm/transhuge.txt @@ -116,6 +116,13 @@ echo always >/sys/kernel/mm/transparent_hugepage/defrag echo madvise >/sys/kernel/mm/transparent_hugepage/defrag echo never >/sys/kernel/mm/transparent_hugepage/defrag +If defrag is set to "always", then all hugepage allocations also attempt +synchronous memory compaction which makes the allocation as aggressive +as possible. The overhead of attempting to allocate the hugepage is +considered acceptable because of the longterm benefits of the hugepage +itself at runtime. If the VM should fallback to using regular pages +instead, then you should use "madvise" or "never". + khugepaged will be automatically started when transparent_hugepage/enabled is set to "always" or "madvise, and it'll be automatically shutdown if it's set to "never". diff --git a/mm/page_alloc.c b/mm/page_alloc.c --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -2168,7 +2168,17 @@ rebalance: sync_migration); if (page) goto got_pg; - sync_migration = true; + + /* + * Do not use synchronous migration for transparent hugepages unless + * defragmentation is always attempted for such allocations since it + * can stall in writeback, which is far worse than simply failing to + * promote a page. Otherwise, we really do want a hugepage and are as + * aggressive as possible to allocate it. + */ + sync_migration = !(gfp_mask & __GFP_NO_KSWAPD) || + (transparent_hugepage_flags & + (1 << TRANSPARENT_HUGEPAGE_DEFRAG_FLAG)); /* Try direct reclaim and then allocating */ page = __alloc_pages_direct_reclaim(gfp_mask, order, -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH] mm: Do not stall in synchronous compaction for THP allocations 2011-11-16 0:07 ` David Rientjes @ 2011-11-16 4:13 ` Andrea Arcangeli 2011-11-16 13:30 ` Andrea Arcangeli 2011-11-16 14:14 ` Mel Gorman 0 siblings, 2 replies; 34+ messages in thread From: Andrea Arcangeli @ 2011-11-16 4:13 UTC (permalink / raw) To: David Rientjes Cc: Mel Gorman, Andrew Morton, Minchan Kim, Jan Kara, Andy Isaacson, Johannes Weiner, linux-kernel, linux-mm On Tue, Nov 15, 2011 at 04:07:08PM -0800, David Rientjes wrote: > On Tue, 15 Nov 2011, Mel Gorman wrote: > > > Adding sync here could obviously be implemented although it may > > require both always-sync and madvise-sync. Alternatively, something > > like an options file could be created to create a bitmap similar to > > what ftrace does. Whatever the mechanism, it exposes the fact that > > "sync compaction" is used. If that turns out to be not enough, then > > you may want to add other steps like aggressively reclaiming memory > > which also potentially may need to be controlled via the sysfs file > > and this is the slippery slope. > > > > So what's being proposed here in this patch is the fifth time this line > has been changed and its always been switched between true and !(gfp_mask > & __GFP_NO_KSWAPD). Instead of changing it every few months, I'd suggest > that we tie the semantics of the tunable directly to sync_compaction since > we're primarily targeting thp hugepages with this change anyway for the > "always" case. Comments? I don't think it's ok. defrag=madvise means __GFP_WAIT not set for regular THP alloc (without madvise(MADV_HUGEPAGE)) and __GFP_WAIT set for madvise(MADV_HUGEPAGE). If __GFP_WAIT isn't set, compaction-async won't be invoked. After checking my current thp vmstat I think Andrew was right and we backed out for a good reason before. I'm getting significantly worse success rate, not sure why it was a small reduction in success rate but hey I cannot exclude I may have broke something with some other patch. I've been running it together with a couple more changes. If it's this change that reduced the success rate, I'm afraid going always async is not ok. So before focusing so much on this sync/async flag, I'd like to understand better why sync stalls so bad. I mean it's not like the VM with 4k pages won't be doing some throttling too. I suspect we may be too heavyweight on migrate looping 10 times. Especially with O_DIRECT the pages may return pinned immediately after they're unpinned (if the buffer for the I/O is the same like with dd) so there's no point to wait on pinned pages 10 times around amounting to wait 10*2 = 20MB = several seconds with usb stick for each 2M allocation. We can reduce it to less than one second easily. Maybe somebody has time to test if the below patch helps or not. I understand in some circumstance it may not help and it'll lead to the same but I think this is good idea anyway and maybe it helps. Currently we wait on locked pages, on writeback pages, we retry on pinned pages again and again on locked pages etc... That's a bit too much I guess, and we don't have to go complete async to improve the situation I hope. Could you try if you see any improvement with the below patch? I'm running with a dd writing in a loop over 1g sd card and I don't see stalls and success rate seems better than before but I haven't been noticing the stalls before so I can't tell. (to test you need to backout the other patches first, this is for 3.2rc2) If this isn't enough I'd still prefer to find a way to tackle the problem on a write-throttling way, or at least I'd need to re-verify why the success rate was so bad with the patch applied (after 4 days of uptime of normal load). I tend to check the success rate and with upstream it's about perfect and a little degradation is ok but I was getting less than 50%. Note with the usb writing in a loop the success rate may degrade a bit, page will be locked, we won't wait on page lock and then on writeback and all that slowdown anymore but it'll still write throttle a bit and it will stop only working on movable pages and isolating only clean pages like it would have done with async forced. === ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH] mm: Do not stall in synchronous compaction for THP allocations 2011-11-16 4:13 ` Andrea Arcangeli @ 2011-11-16 13:30 ` Andrea Arcangeli 2011-11-16 15:07 ` Mel Gorman 2011-11-18 17:59 ` Andrea Arcangeli 2011-11-16 14:14 ` Mel Gorman 1 sibling, 2 replies; 34+ messages in thread From: Andrea Arcangeli @ 2011-11-16 13:30 UTC (permalink / raw) To: David Rientjes Cc: Mel Gorman, Andrew Morton, Minchan Kim, Jan Kara, Andy Isaacson, Johannes Weiner, linux-kernel, linux-mm On Wed, Nov 16, 2011 at 05:13:50AM +0100, Andrea Arcangeli wrote: > After checking my current thp vmstat I think Andrew was right and we > backed out for a good reason before. I'm getting significantly worse > success rate, not sure why it was a small reduction in success rate > but hey I cannot exclude I may have broke something with some other > patch. I've been running it together with a couple more changes. If > it's this change that reduced the success rate, I'm afraid going > always async is not ok. I wonder if the high failure rate when shutting off "sync compaction" and forcing only "async compaction" for THP (your patch queued in -mm) is also because of ISOLATE_CLEAN being set in compaction from commit 39deaf8. ISOLATE_CLEAN skipping PageDirty means all tmpfs/anon pages added to swapcache (or removed from swapcache which sets the dirty bit on the page because the pte may be mapped clean) are skipped entirely by async compaction for no good reason. That can't possibly be ok, because those don't actually require any I/O or blocking to be migrated. PageDirty is a "blocking/IO" operation only for filebacked pages. So I think we must revert 39deaf8, instead of cleaning it up with my cleanup posted in Message-Id 20111115020831.GF4414@redhat.com . ISOLATED_CLEAN still looks right for may_writepage, for reclaim dirty bit set on the page is a I/O event, for migrate it's not if it's tmpfs/anon. Did you run your compaction tests with some swap activity? Reducing the async compaction effectiveness while there's some swap activity then also leads in more frequently than needed running sync compaction and page reclaim. I'm hopeful however that by running just 2 passes of migrate_pages main loop with the "avoid overwork in migrate sync mode" patch, we can fix the excessive hanging. If that works number of passes could actually be a tunable, and setting it to 1 (instead of 2) would then provide 100% "async compaction" behavior again. And if somebody prefers to stick to 10 he can... so then he can do trylock pass 0, lock_page pass1, wait_writeback pass2, wait pin pass3, finally migrate pass4. (something 2 passes alone won't allow). So making the migrate passes/force-threshold tunable (maybe only for the new sync=2 migration mode) could be good idea. Or we could just return to sync true/false and have the migration tunable affect everything but that would alter the reliability of sys_move_pages and other numa things too, where I guess 10 passes are ok. This is why I added a sync=2 mode for migrate. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH] mm: Do not stall in synchronous compaction for THP allocations 2011-11-16 13:30 ` Andrea Arcangeli @ 2011-11-16 15:07 ` Mel Gorman 2011-11-18 17:59 ` Andrea Arcangeli 1 sibling, 0 replies; 34+ messages in thread From: Mel Gorman @ 2011-11-16 15:07 UTC (permalink / raw) To: Andrea Arcangeli Cc: David Rientjes, Andrew Morton, Minchan Kim, Jan Kara, Andy Isaacson, Johannes Weiner, linux-kernel, linux-mm On Wed, Nov 16, 2011 at 02:30:56PM +0100, Andrea Arcangeli wrote: > On Wed, Nov 16, 2011 at 05:13:50AM +0100, Andrea Arcangeli wrote: > > After checking my current thp vmstat I think Andrew was right and we > > backed out for a good reason before. I'm getting significantly worse > > success rate, not sure why it was a small reduction in success rate > > but hey I cannot exclude I may have broke something with some other > > patch. I've been running it together with a couple more changes. If > > it's this change that reduced the success rate, I'm afraid going > > always async is not ok. > > I wonder if the high failure rate when shutting off "sync compaction" > and forcing only "async compaction" for THP (your patch queued in -mm) > is also because of ISOLATE_CLEAN being set in compaction from commit > 39deaf8. ISOLATE_CLEAN skipping PageDirty means all tmpfs/anon pages > added to swapcache (or removed from swapcache which sets the dirty bit > on the page because the pte may be mapped clean) are skipped entirely > by async compaction for no good reason. Good point! Even though these pages can be migrated without IO or incurring a sync, we are skipping over them. I'm still looking at passing sync down to ->migratepages and as part of that, ISOLATE_CLEAN will need new smarts. > That can't possibly be ok, > because those don't actually require any I/O or blocking to be > migrated. PageDirty is a "blocking/IO" operation only for filebacked > pages. So I think we must revert 39deaf8, instead of cleaning it up > with my cleanup posted in Message-Id 20111115020831.GF4414@redhat.com . > It would be preferable if the pages that would block during migration could be identified in advance but that may be unrealistic. What may be a better compromise is to only isolate PageDirty pages with a ->migratepage callback. > ISOLATED_CLEAN still looks right for may_writepage, for reclaim dirty > bit set on the page is a I/O event, for migrate it's not if it's > tmpfs/anon. > > Did you run your compaction tests with some swap activity? > Some, but not intensive. > Reducing the async compaction effectiveness while there's some swap > activity then also leads in more frequently than needed running sync > compaction and page reclaim. > > I'm hopeful however that by running just 2 passes of migrate_pages > main loop with the "avoid overwork in migrate sync mode" patch, we can > fix the excessive hanging. If that works number of passes could > actually be a tunable, and setting it to 1 (instead of 2) would then > provide 100% "async compaction" behavior again. And if somebody > prefers to stick to 10 he can... so then he can do trylock pass 0, > lock_page pass1, wait_writeback pass2, wait pin pass3, finally migrate > pass4. (something 2 passes alone won't allow). So making the migrate > passes/force-threshold tunable (maybe only for the new sync=2 > migration mode) could be good idea. Or we could just return to sync > true/false and have the migration tunable affect everything but that > would alter the reliability of sys_move_pages and other numa things > too, where I guess 10 passes are ok. This is why I added a sync=2 mode > for migrate. I am vaguely concerned that this will just make the stalling harder to reproduce and diagnose. While you investigate this route, I'm going to keep investigating using only async migration for THP and having async compaction move pages it can migrate without blocking. -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH] mm: Do not stall in synchronous compaction for THP allocations 2011-11-16 13:30 ` Andrea Arcangeli 2011-11-16 15:07 ` Mel Gorman @ 2011-11-18 17:59 ` Andrea Arcangeli 1 sibling, 0 replies; 34+ messages in thread From: Andrea Arcangeli @ 2011-11-18 17:59 UTC (permalink / raw) To: David Rientjes Cc: Mel Gorman, Andrew Morton, Minchan Kim, Jan Kara, Andy Isaacson, Johannes Weiner, linux-kernel, linux-mm On Wed, Nov 16, 2011 at 02:30:56PM +0100, Andrea Arcangeli wrote: > On Wed, Nov 16, 2011 at 05:13:50AM +0100, Andrea Arcangeli wrote: > > After checking my current thp vmstat I think Andrew was right and we > > backed out for a good reason before. I'm getting significantly worse > > success rate, not sure why it was a small reduction in success rate > > but hey I cannot exclude I may have broke something with some other > > patch. I've been running it together with a couple more changes. If > > it's this change that reduced the success rate, I'm afraid going > > always async is not ok. > > I wonder if the high failure rate when shutting off "sync compaction" > and forcing only "async compaction" for THP (your patch queued in -mm) > is also because of ISOLATE_CLEAN being set in compaction from commit > 39deaf8. ISOLATE_CLEAN skipping PageDirty means all tmpfs/anon pages I think I tracked down the source of the thp allocation regression. They're commit e0887c19b2daa140f20ca8104bdc5740f39dbb86 and e0c23279c9f800c403f37511484d9014ac83adec. They're also wrong because compaction_suitable doesn't check that there is enough free memory in the number of "movable" pageblocks. I'm going to test this not sure if it helps. But the more free memory the more likely compaction succeeds, so there's still a risk we're reducing the compaction success. With the two commits above reverted my compaction success rate returns near 100%, with the two commits above applied it goes to <50%... Now we'll see what happens with the below patch. === From: Andrea Arcangeli <aarcange@redhat.com> Subject: [PATCH] compaction: correct reverse check for compaction_deferred Otherwise when compaction is deferred, reclaim stops to, leading to high failure rate of high order allocations. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> --- mm/vmscan.c | 4 ++-- 1 files changed, 2 insertions(+), 2 deletions(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index f7f7677..ce745f0 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2144,8 +2144,8 @@ static bool shrink_zones(int priority, struct zonelist *zonelist, * allocations. */ if (sc->order > PAGE_ALLOC_COSTLY_ORDER && - (compaction_suitable(zone, sc->order) || - compaction_deferred(zone))) { + (compaction_suitable(zone, sc->order) && + !compaction_deferred(zone))) { should_abort_reclaim = true; continue; } -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 34+ messages in thread
* Re: [PATCH] mm: Do not stall in synchronous compaction for THP allocations 2011-11-16 4:13 ` Andrea Arcangeli 2011-11-16 13:30 ` Andrea Arcangeli @ 2011-11-16 14:14 ` Mel Gorman 1 sibling, 0 replies; 34+ messages in thread From: Mel Gorman @ 2011-11-16 14:14 UTC (permalink / raw) To: Andrea Arcangeli Cc: David Rientjes, Andrew Morton, Minchan Kim, Jan Kara, Andy Isaacson, Johannes Weiner, linux-kernel, linux-mm On Wed, Nov 16, 2011 at 05:13:50AM +0100, Andrea Arcangeli wrote: > On Tue, Nov 15, 2011 at 04:07:08PM -0800, David Rientjes wrote: > > On Tue, 15 Nov 2011, Mel Gorman wrote: > > > > > Adding sync here could obviously be implemented although it may > > > require both always-sync and madvise-sync. Alternatively, something > > > like an options file could be created to create a bitmap similar to > > > what ftrace does. Whatever the mechanism, it exposes the fact that > > > "sync compaction" is used. If that turns out to be not enough, then > > > you may want to add other steps like aggressively reclaiming memory > > > which also potentially may need to be controlled via the sysfs file > > > and this is the slippery slope. > > > > > > > So what's being proposed here in this patch is the fifth time this line > > has been changed and its always been switched between true and !(gfp_mask > > & __GFP_NO_KSWAPD). Instead of changing it every few months, I'd suggest > > that we tie the semantics of the tunable directly to sync_compaction since > > we're primarily targeting thp hugepages with this change anyway for the > > "always" case. Comments? > > I don't think it's ok. defrag=madvise means __GFP_WAIT not set for > regular THP alloc (without madvise(MADV_HUGEPAGE)) and __GFP_WAIT set > for madvise(MADV_HUGEPAGE). > > If __GFP_WAIT isn't set, compaction-async won't be invoked. > And if defrag is enabled with this patch, sync compaction can still stall due to writing to USB as detailed in the changelog for the original patch in this thread. > After checking my current thp vmstat I think Andrew was right and we > backed out for a good reason before. I'm getting significantly worse > success rate, not sure why it was a small reduction in success rate > but hey I cannot exclude I may have broke something with some other > patch. I've been running it together with a couple more changes. If > it's this change that reduced the success rate, I'm afraid going > always async is not ok. > Can you narrow it down? I expect it really is the disabling of sync compaction that is reducing success rates during heavy copy operations because that's what I'd expect as 20% of memory is probably dirty pages that need writeback. I'll explain why. You didn't say the size of your machine but assuming 2G as it is easier to reproduce stalls on machines of that size, we get 2G RAM = 524288 pages = 1024 pageblocks assuming 2M hugepage On a desktop system, most of those blocks will be movable (I'm basing this on my own laptop, milage varies considerably) we get 85% of pageblocks are movable so 870 movable page blocks If a pageblock has even one dirty page then async compaction cannot convert that into a huge page. dirty_ratio is 20% so that gives us 104857 dirty pages. If those dirty pages are perfectly randomly distributed, async compaction will give 0 huge pages. The actual distribution of dirty pages is different. Early in the lifetime of the system, they'll be packed into the fewest possible pageblocks assuming that pages are cleaned in the order they are dirtied because of how the buddy allocator gives out pages. That gives a best-case scenario of 68% of memory as huge pages. You'd think the underlying filesystem matters because if the underlying filesystem supports migratepage, we don't have to writeout but that's not the case because; if (PageDirty(page) && !sync && mapping->a_ops->migratepage != migrate_page) rc = -EBUSY; else if (mapping->a_ops->migratepage) /* * Most pages have a mapping and most filesystems * should provide a migration function. Anonymous * pages are part of swap space which also has * its own migration function. This is the most * common path for page migration. */ rc = mapping->a_ops->migratepage(mapping, newpage, page); else rc = fallback_migrate_page(mapping, newpage, page); Even if the underlying filesystem is able to migrate a dirty pages without IO, we don't call it because the interface itself does not guarantee that ->migratepage will not initiate block. For example, buffer_migrate_page can still block on lock_buffer(). I'm looking into what is necessary to allow async migration to still use ->migratepage and update each implementation to obey "sync". This should allow us to avoid using sync compaction without severely impacting allocation success rates. It'll also need to somehow take into account that async migration is no longer isolating dirty pages. As this will be distracting me, this review will be a bit lighter than it should be. > So before focusing so much on this sync/async flag, I'd like to > understand better why sync stalls so bad. I mean it's not like the VM > with 4k pages won't be doing some throttling too. True, but the core VM avoids writing back dirty pages as much as possible and even with dirty throttling the caller has the option of using a different page. Compaction smacks itself into a while with dirty pages because of the linear scan and then beats itself harder because of large numbers of small IO requests. > I suspect we may be > too heavyweight on migrate looping 10 times. Especially with O_DIRECT > the pages may return pinned immediately after they're unpinned (if the > buffer for the I/O is the same like with dd) so there's no point to > wait on pinned pages 10 times around amounting to wait 10*2 = 20MB = > several seconds with usb stick for each 2M allocation. We can reduce > it to less than one second easily. It sounds reasonable to distinguish between pages that we cannot migrate because they are pinned for IO and pages that we cannot migrate because they were locked for a short period of time. > Maybe somebody has time to test if > the below patch helps or not. The machine that I normally use to test this sort of patch is currently tied up so it could take a day before I can queue it. > I understand in some circumstance it may > not help and it'll lead to the same but I think this is good idea > anyway and maybe it helps. It might be a good idea anyway. > Currently we wait on locked pages, on > writeback pages, we retry on pinned pages again and again on locked > pages etc... That's a bit too much I guess, and we don't have to go > complete async to improve the situation I hope. > > Could you try if you see any improvement with the below patch? I'm > running with a dd writing in a loop over 1g sd card and I don't see > stalls and success rate seems better than before but I haven't been > noticing the stalls before so I can't tell. (to test you need to > backout the other patches first, this is for 3.2rc2) > As it could take me a while to test, grab http://www.csn.ul.ie/~mel/postings/compaction-20111116/mmtests-add-monitor-for-tracing-processes-stuck-in-D-state.patch It's a patch to MM Tests but you don't need the suite to use the scripts. Extract the monitors/watch-dstate.pl and stap-dstate-frequency script from the patch, setup systemtap (I know, it had to work for kernels without ftrace) and run it as monitors/watch-dstate.pl -o stall.log run your test, interrupt watch-dstate.pl then cat stall.log | stap-dstate-frequency It'll summarise what the top stall points were with an estimate of how long you were stalled in there even if it's not user visible stall. Watch for long stalls in compaction. > If this isn't enough I'd still prefer to find a way to tackle the > problem on a write-throttling way, or at least I'd need to re-verify > why the success rate was so bad with the patch applied (after 4 days > of uptime of normal load). I tend to check the success rate and with > upstream it's about perfect and a little degradation is ok but I was > getting less than 50%. I bet you a shiny penny if you write a script to read /proc/kpageflags that you'll find there is at least one dirty page in 50% of movable pageblocks while your test runs. > Note with the usb writing in a loop the success > rate may degrade a bit, page will be locked, we won't wait on page > lock and then on writeback and all that slowdown anymore but it'll > still write throttle a bit and it will stop only working on movable > pages and isolating only clean pages like it would have done with > async forced. > > === > From 3a379b180aa544f876d3c42b47ae20060ae6279b Mon Sep 17 00:00:00 2001 > From: Andrea Arcangeli <aarcange@redhat.com> > Date: Wed, 16 Nov 2011 02:52:36 +0100 > Subject: [PATCH] compaction: avoid overwork in migrate sync mode > > Add a migration sync=2 mode that avoids overwork so more suitable to > be used by compaction to provide lower latency but still write > throttling. > sync=2 is obscure and will be prone to error. It needs to be an enum with a proper name for each field. > Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> > --- > include/linux/migrate.h | 8 ++-- > mm/compaction.c | 2 +- > mm/memory-failure.c | 2 +- > mm/memory_hotplug.c | 2 +- > mm/mempolicy.c | 4 +- > mm/migrate.c | 77 ++++++++++++++++++++++++++++++----------------- > 6 files changed, 58 insertions(+), 37 deletions(-) > > diff --git a/include/linux/migrate.h b/include/linux/migrate.h > index e39aeec..f26fc0e 100644 > --- a/include/linux/migrate.h > +++ b/include/linux/migrate.h > @@ -14,10 +14,10 @@ extern int migrate_page(struct address_space *, > struct page *, struct page *); > extern int migrate_pages(struct list_head *l, new_page_t x, > unsigned long private, bool offlining, > - bool sync); > + int sync); > extern int migrate_huge_pages(struct list_head *l, new_page_t x, > unsigned long private, bool offlining, > - bool sync); > + int sync); > these would then be an enum of some sort with a better name than "sync". > extern int fail_migrate_page(struct address_space *, > struct page *, struct page *); > @@ -36,10 +36,10 @@ extern int migrate_huge_page_move_mapping(struct address_space *mapping, > static inline void putback_lru_pages(struct list_head *l) {} > static inline int migrate_pages(struct list_head *l, new_page_t x, > unsigned long private, bool offlining, > - bool sync) { return -ENOSYS; } > + int sync) { return -ENOSYS; } > static inline int migrate_huge_pages(struct list_head *l, new_page_t x, > unsigned long private, bool offlining, > - bool sync) { return -ENOSYS; } > + int sync) { return -ENOSYS; } > > static inline int migrate_prep(void) { return -ENOSYS; } > static inline int migrate_prep_local(void) { return -ENOSYS; } > diff --git a/mm/compaction.c b/mm/compaction.c > index be0be1d..cbf2784 100644 > --- a/mm/compaction.c > +++ b/mm/compaction.c > @@ -555,7 +555,7 @@ static int compact_zone(struct zone *zone, struct compact_control *cc) > nr_migrate = cc->nr_migratepages; > err = migrate_pages(&cc->migratepages, compaction_alloc, > (unsigned long)cc, false, > - cc->sync); > + cc->sync ? 2 : 0); > update_nr_listpages(cc); > nr_remaining = cc->nr_migratepages; > > diff --git a/mm/memory-failure.c b/mm/memory-failure.c > index 06d3479..d8a41d3 100644 > --- a/mm/memory-failure.c > +++ b/mm/memory-failure.c > @@ -1557,7 +1557,7 @@ int soft_offline_page(struct page *page, int flags) > page_is_file_cache(page)); > list_add(&page->lru, &pagelist); > ret = migrate_pages(&pagelist, new_page, MPOL_MF_MOVE_ALL, > - 0, true); > + false, 1); > if (ret) { > putback_lru_pages(&pagelist); > pr_info("soft offline: %#lx: migration failed %d, type %lx\n", > diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c > index 2168489..e1d6176 100644 > --- a/mm/memory_hotplug.c > +++ b/mm/memory_hotplug.c > @@ -809,7 +809,7 @@ do_migrate_range(unsigned long start_pfn, unsigned long end_pfn) > } > /* this function returns # of failed pages */ > ret = migrate_pages(&source, hotremove_migrate_alloc, 0, > - true, true); > + true, 1); > if (ret) > putback_lru_pages(&source); > } > diff --git a/mm/mempolicy.c b/mm/mempolicy.c > index adc3954..0bf88ed 100644 > --- a/mm/mempolicy.c > +++ b/mm/mempolicy.c > @@ -933,7 +933,7 @@ static int migrate_to_node(struct mm_struct *mm, int source, int dest, > > if (!list_empty(&pagelist)) { > err = migrate_pages(&pagelist, new_node_page, dest, > - false, true); > + false, 1); > if (err) > putback_lru_pages(&pagelist); > } > @@ -1154,7 +1154,7 @@ static long do_mbind(unsigned long start, unsigned long len, > if (!list_empty(&pagelist)) { > nr_failed = migrate_pages(&pagelist, new_vma_page, > (unsigned long)vma, > - false, true); > + false, 1); > if (nr_failed) > putback_lru_pages(&pagelist); > } > diff --git a/mm/migrate.c b/mm/migrate.c > index 578e291..175c3bc 100644 > --- a/mm/migrate.c > +++ b/mm/migrate.c > @@ -564,7 +564,7 @@ static int fallback_migrate_page(struct address_space *mapping, > * == 0 - success > */ > static int move_to_new_page(struct page *newpage, struct page *page, > - int remap_swapcache, bool sync) > + int remap_swapcache, bool force) > { > struct address_space *mapping; > int rc; > @@ -588,11 +588,11 @@ static int move_to_new_page(struct page *newpage, struct page *page, > rc = migrate_page(mapping, newpage, page); > else { > /* > - * Do not writeback pages if !sync and migratepage is > + * Do not writeback pages if !force and migratepage is > * not pointing to migrate_page() which is nonblocking > * (swapcache/tmpfs uses migratepage = migrate_page). > */ > - if (PageDirty(page) && !sync && > + if (PageDirty(page) && !force && > mapping->a_ops->migratepage != migrate_page) > rc = -EBUSY; > else if (mapping->a_ops->migratepage) To be clear, this is the section I'll be looking at - getting rid of this check and pushing the sync parameter down into ->migratepage so async migration can handle more cases. We'll collide a bit but should be able to reconcile the differences. > @@ -622,7 +622,7 @@ static int move_to_new_page(struct page *newpage, struct page *page, > } > > static int __unmap_and_move(struct page *page, struct page *newpage, > - int force, bool offlining, bool sync) > + bool force, bool offlining) > { > int rc = -EAGAIN; > int remap_swapcache = 1; > @@ -631,7 +631,7 @@ static int __unmap_and_move(struct page *page, struct page *newpage, > struct anon_vma *anon_vma = NULL; > > if (!trylock_page(page)) { > - if (!force || !sync) > + if (!force) > goto out; > > /* Superficially this looks like a mistake because we can now call lock_page but as direct compaction is all that really cares about async compaction and we check PF_MEMALLOC, it probably doesn't matter matter. > @@ -676,14 +676,6 @@ static int __unmap_and_move(struct page *page, struct page *newpage, > BUG_ON(charge); > > if (PageWriteback(page)) { > - /* > - * For !sync, there is no point retrying as the retry loop > - * is expected to be too short for PageWriteback to be cleared > - */ > - if (!sync) { > - rc = -EBUSY; > - goto uncharge; > - } > if (!force) > goto uncharge; > wait_on_page_writeback(page); Switching from sync to force introduces an important behaviour change here I think. After this patch, async compaction becomes sync compaction on pass > 2 and can call wait_on_page_writeback() which may not be what you intended. It will probably increase allocation success rates in some tests but stalls may be worse. > @@ -751,7 +743,7 @@ static int __unmap_and_move(struct page *page, struct page *newpage, > > skip_unmap: > if (!page_mapped(page)) > - rc = move_to_new_page(newpage, page, remap_swapcache, sync); > + rc = move_to_new_page(newpage, page, remap_swapcache, force); > > if (rc && remap_swapcache) > remove_migration_ptes(page, page); > @@ -774,7 +766,7 @@ out: > * to the newly allocated page in newpage. > */ > static int unmap_and_move(new_page_t get_new_page, unsigned long private, > - struct page *page, int force, bool offlining, bool sync) > + struct page *page, bool force, bool offlining) > { > int rc = 0; > int *result = NULL; > @@ -792,7 +784,7 @@ static int unmap_and_move(new_page_t get_new_page, unsigned long private, > if (unlikely(split_huge_page(page))) > goto out; > > - rc = __unmap_and_move(page, newpage, force, offlining, sync); > + rc = __unmap_and_move(page, newpage, force, offlining); > out: > if (rc != -EAGAIN) { > /* > @@ -840,7 +832,7 @@ out: > */ > static int unmap_and_move_huge_page(new_page_t get_new_page, > unsigned long private, struct page *hpage, > - int force, bool offlining, bool sync) > + bool force, bool offlining) > { > int rc = 0; > int *result = NULL; > @@ -853,7 +845,7 @@ static int unmap_and_move_huge_page(new_page_t get_new_page, > rc = -EAGAIN; > > if (!trylock_page(hpage)) { > - if (!force || !sync) > + if (!force) > goto out; > lock_page(hpage); > } > @@ -864,7 +856,7 @@ static int unmap_and_move_huge_page(new_page_t get_new_page, > try_to_unmap(hpage, TTU_MIGRATION|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS); > > if (!page_mapped(hpage)) > - rc = move_to_new_page(new_hpage, hpage, 1, sync); > + rc = move_to_new_page(new_hpage, hpage, 1, force); > > if (rc) > remove_migration_ptes(hpage, hpage); > @@ -907,7 +899,7 @@ out: > */ > int migrate_pages(struct list_head *from, > new_page_t get_new_page, unsigned long private, bool offlining, > - bool sync) > + int sync) > { > int retry = 1; > int nr_failed = 0; > @@ -920,15 +912,30 @@ int migrate_pages(struct list_head *from, > if (!swapwrite) > current->flags |= PF_SWAPWRITE; > > - for(pass = 0; pass < 10 && retry; pass++) { > + for(pass = 0; pass < (sync == 1 ? 10 : 2) && retry; pass++) { > retry = 0; > > list_for_each_entry_safe(page, page2, from, lru) { > + bool force; > + > cond_resched(); > > + switch (sync) { > + case 0: > + force = false; > + break; > + case 1: > + force = pass > 2; > + break; > + case 2: > + force = pass > 0; > + break; > + default: > + BUG(); > + } > rc = unmap_and_move(get_new_page, private, > - page, pass > 2, offlining, > - sync); > + page, > + force, offlining); > > switch(rc) { > case -ENOMEM: > @@ -958,7 +965,7 @@ out: > > int migrate_huge_pages(struct list_head *from, > new_page_t get_new_page, unsigned long private, bool offlining, > - bool sync) > + int sync) > { > int retry = 1; > int nr_failed = 0; > @@ -967,15 +974,29 @@ int migrate_huge_pages(struct list_head *from, > struct page *page2; > int rc; > > - for (pass = 0; pass < 10 && retry; pass++) { > + for (pass = 0; pass < (sync == 1 ? 10 : 2) && retry; pass++) { > retry = 0; > > list_for_each_entry_safe(page, page2, from, lru) { > + bool force; > cond_resched(); > > + switch (sync) { > + case 0: > + force = false; > + break; > + case 1: > + force = pass > 2; > + break; > + case 2: > + force = pass > 0; > + break; > + default: > + BUG(); > + } > rc = unmap_and_move_huge_page(get_new_page, > - private, page, pass > 2, offlining, > - sync); > + private, page, > + force, offlining); > > switch(rc) { > case -ENOMEM: > @@ -1104,7 +1125,7 @@ set_status: > err = 0; > if (!list_empty(&pagelist)) { > err = migrate_pages(&pagelist, new_page_node, > - (unsigned long)pm, 0, true); > + (unsigned long)pm, false, 1); > if (err) > putback_lru_pages(&pagelist); > } -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH] mm: Do not stall in synchronous compaction for THP allocations 2011-11-10 23:12 ` Andrew Morton 2011-11-10 23:37 ` David Rientjes @ 2011-11-11 10:01 ` Mel Gorman 2011-11-15 0:03 ` Andrew Morton 1 sibling, 1 reply; 34+ messages in thread From: Mel Gorman @ 2011-11-11 10:01 UTC (permalink / raw) To: Andrew Morton Cc: Minchan Kim, Jan Kara, Andy Isaacson, Johannes Weiner, Andrea Arcangeli, linux-kernel, linux-mm On Thu, Nov 10, 2011 at 03:12:11PM -0800, Andrew Morton wrote: > On Thu, 10 Nov 2011 16:13:31 +0000 > Mel Gorman <mgorman@suse.de> wrote: > > > This patch once again prevents sync migration for transparent > > hugepage allocations as it is preferable to fail a THP allocation > > than stall. > > Who said? ;) Everyone who ever complained about stalls due to writing to USB :) In the last few releases, we have squashed a number of stalls due to writing pages from the end of the LRU but sync compaction is a relatively recent new root cause of stalls. > Presumably some people would prefer to get lots of > huge pages for their 1000-hour compute job, and waiting a bit to get > those pages is acceptable. > A 1000-hour compute job will have its pages collapsed into hugepages by khugepaged so they might not have the huge pages at the very beginning but they get them. With khugepaged in place, there should be no need for an additional tuneable. > Do we have the accounting in place for us to be able to determine how > many huge page allocation attempts failed due to this change? > thp_fault_fallback is the big one. It is incremented if we fail to allocate a hugepage during fault in either do_huge_pmd_anonymous_page or do_huge_pmd_wp_page_fallback thp_collapse_alloc_failed is also very interesting. It is incremented if khugepaged tried to collapse pages into a hugepage and failed the allocation The user has the option of monitoring their compute jobs hugepage usage by reading /proc/PID/smaps and looking at the AnonHugePages count for the large mappings of interest. -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH] mm: Do not stall in synchronous compaction for THP allocations 2011-11-11 10:01 ` Mel Gorman @ 2011-11-15 0:03 ` Andrew Morton 2011-11-15 2:00 ` Andrea Arcangeli 2011-11-15 13:07 ` Mel Gorman 0 siblings, 2 replies; 34+ messages in thread From: Andrew Morton @ 2011-11-15 0:03 UTC (permalink / raw) To: Mel Gorman Cc: Minchan Kim, Jan Kara, Andy Isaacson, Johannes Weiner, Andrea Arcangeli, linux-kernel, linux-mm On Fri, 11 Nov 2011 10:01:56 +0000 Mel Gorman <mgorman@suse.de> wrote: > On Thu, Nov 10, 2011 at 03:12:11PM -0800, Andrew Morton wrote: > > On Thu, 10 Nov 2011 16:13:31 +0000 > > Mel Gorman <mgorman@suse.de> wrote: > > > > > This patch once again prevents sync migration for transparent > > > hugepage allocations as it is preferable to fail a THP allocation > > > than stall. > > > > Who said? ;) > > Everyone who ever complained about stalls due to writing to USB :) oh, here it is. Who broke mesage threading? > In the last few releases, we have squashed a number of stalls due > to writing pages from the end of the LRU but sync compaction is a > relatively recent new root cause of stalls. > > > Presumably some people would prefer to get lots of > > huge pages for their 1000-hour compute job, and waiting a bit to get > > those pages is acceptable. > > > > A 1000-hour compute job will have its pages collapsed into hugepages by > khugepaged so they might not have the huge pages at the very beginning > but they get them. With khugepaged in place, there should be no need for > an additional tuneable. OK... > > Do we have the accounting in place for us to be able to determine how > > many huge page allocation attempts failed due to this change? > > > > thp_fault_fallback is the big one. It is incremented if we fail to > allocate a hugepage during fault in either > do_huge_pmd_anonymous_page or do_huge_pmd_wp_page_fallback > > thp_collapse_alloc_failed is also very interesting. It is incremented > if khugepaged tried to collapse pages into a hugepage and > failed the allocation > > The user has the option of monitoring their compute jobs hugepage > usage by reading /proc/PID/smaps and looking at the AnonHugePages > count for the large mappings of interest. Fair enough. One slight problem though: akpm:/usr/src/25> grep -r thp_collapse_alloc_failed Documentation akpm:/usr/src/25> -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH] mm: Do not stall in synchronous compaction for THP allocations 2011-11-15 0:03 ` Andrew Morton @ 2011-11-15 2:00 ` Andrea Arcangeli 2011-11-15 2:08 ` Andrea Arcangeli 2011-11-15 15:00 ` Mel Gorman 2011-11-15 13:07 ` Mel Gorman 1 sibling, 2 replies; 34+ messages in thread From: Andrea Arcangeli @ 2011-11-15 2:00 UTC (permalink / raw) To: Andrew Morton Cc: Mel Gorman, Minchan Kim, Jan Kara, Andy Isaacson, Johannes Weiner, linux-kernel, linux-mm On Mon, Nov 14, 2011 at 04:03:45PM -0800, Andrew Morton wrote: > On Fri, 11 Nov 2011 10:01:56 +0000 > Mel Gorman <mgorman@suse.de> wrote: > > A 1000-hour compute job will have its pages collapsed into hugepages by > > khugepaged so they might not have the huge pages at the very beginning > > but they get them. With khugepaged in place, there should be no need for > > an additional tuneable. > > OK... It's good idea to keep it monitored. But I guess the reduced rate will only materialize at temporary VM stress times. > Fair enough. One slight problem though: > > akpm:/usr/src/25> grep -r thp_collapse_alloc_failed Documentation > akpm:/usr/src/25> I didn't fill that gap but I was reading the code again and I don't see why we keep retrying for -EAGAIN in the !sync case. Maybe the below is good (untested). I doubt it's good to spend cpu to retry the trylock or to retry the migrate on a pinned page by O_DIRECT. In fact as far as THP success rate is concerned maybe we should "goto out" instead of "goto fail" but I didn't change to that as compaction even if it fails a subpage may still be successful at creating order 1/2/3/4...8 pages. I only avoid 9 loops to retry a trylock or a page under O_DIRECT. Maybe that will save a bit of CPU, I doubt it can decrease the success rate in any significant way. I'll test it at the next build... ==== From: Andrea Arcangeli <aarcange@redhat.com> Subject: [PATCH] migrate: !sync don't retry For !sync it's not worth retrying because we won't lock_page even after the second pass. So make -EAGAIN behave like -EBUSY in the !sync should be faster. The difference between -EAGAIN and -EBUSY remains as usual for the sync case, where -EAGAIN will retry, while -EBUSY will not. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> --- mm/migrate.c | 16 +++++++++------- 1 files changed, 9 insertions(+), 7 deletions(-) diff --git a/mm/migrate.c b/mm/migrate.c index 578e291..7d97a14 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -680,11 +680,7 @@ static int __unmap_and_move(struct page *page, struct page *newpage, * For !sync, there is no point retrying as the retry loop * is expected to be too short for PageWriteback to be cleared */ - if (!sync) { - rc = -EBUSY; - goto uncharge; - } - if (!force) + if (!force || !sync) goto uncharge; wait_on_page_writeback(page); } @@ -794,7 +790,7 @@ static int unmap_and_move(new_page_t get_new_page, unsigned long private, rc = __unmap_and_move(page, newpage, force, offlining, sync); out: - if (rc != -EAGAIN) { + if (rc != -EAGAIN || !sync) { /* * A page that has been migrated has all references * removed and will be freed. A page that has not been @@ -874,7 +870,7 @@ static int unmap_and_move_huge_page(new_page_t get_new_page, out: unlock_page(hpage); - if (rc != -EAGAIN) { + if (rc != -EAGAIN || !sync) { list_del(&hpage->lru); put_page(hpage); } @@ -934,11 +930,14 @@ int migrate_pages(struct list_head *from, case -ENOMEM: goto out; case -EAGAIN: + if (!sync) + goto fail; retry++; break; case 0: break; default: + fail: /* Permanent failure */ nr_failed++; break; @@ -981,11 +980,14 @@ int migrate_huge_pages(struct list_head *from, case -ENOMEM: goto out; case -EAGAIN: + if (!sync) + goto fail; retry++; break; case 0: break; default: + fail: /* Permanent failure */ nr_failed++; break; -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 34+ messages in thread
* Re: [PATCH] mm: Do not stall in synchronous compaction for THP allocations 2011-11-15 2:00 ` Andrea Arcangeli @ 2011-11-15 2:08 ` Andrea Arcangeli 2011-11-15 15:01 ` Mel Gorman 2011-11-15 15:00 ` Mel Gorman 1 sibling, 1 reply; 34+ messages in thread From: Andrea Arcangeli @ 2011-11-15 2:08 UTC (permalink / raw) To: Andrew Morton Cc: Mel Gorman, Minchan Kim, Jan Kara, Andy Isaacson, Johannes Weiner, linux-kernel, linux-mm On Tue, Nov 15, 2011 at 03:00:09AM +0100, Andrea Arcangeli wrote: > I didn't fill that gap but I was reading the code again and I don't > see why we keep retrying for -EAGAIN in the !sync case. Maybe the > below is good (untested). I doubt it's good to spend cpu to retry the > trylock or to retry the migrate on a pinned page by O_DIRECT. In fact > as far as THP success rate is concerned maybe we should "goto out" > instead of "goto fail" but I didn't change to that as compaction even > if it fails a subpage may still be successful at creating order > 1/2/3/4...8 pages. I only avoid 9 loops to retry a trylock or a page > under O_DIRECT. Maybe that will save a bit of CPU, I doubt it can > decrease the success rate in any significant way. I'll test it at the > next build... At the same time also noticed another minor cleanup (also untested, will text at next build together with some other stuff). === From: Andrea Arcangeli <aarcange@redhat.com> Subject: [PATCH] compaction: move ISOLATE_CLEAN setting out of compaction_migratepages loop cc->sync and mode cannot change within the loop so move it out. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> --- mm/compaction.c | 6 +++--- 1 files changed, 3 insertions(+), 3 deletions(-) diff --git a/mm/compaction.c b/mm/compaction.c index 899d956..be0be1d 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -291,6 +291,9 @@ static isolate_migrate_t isolate_migratepages(struct zone *zone, return ISOLATE_ABORT; } + if (!cc->sync) + mode |= ISOLATE_CLEAN; + /* Time to isolate some pages for migration */ cond_resched(); spin_lock_irq(&zone->lru_lock); @@ -349,9 +352,6 @@ static isolate_migrate_t isolate_migratepages(struct zone *zone, continue; } - if (!cc->sync) - mode |= ISOLATE_CLEAN; - /* Try isolate the page */ if (__isolate_lru_page(page, mode, 0) != 0) continue; -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 34+ messages in thread
* Re: [PATCH] mm: Do not stall in synchronous compaction for THP allocations 2011-11-15 2:08 ` Andrea Arcangeli @ 2011-11-15 15:01 ` Mel Gorman 0 siblings, 0 replies; 34+ messages in thread From: Mel Gorman @ 2011-11-15 15:01 UTC (permalink / raw) To: Andrea Arcangeli Cc: Andrew Morton, Minchan Kim, Jan Kara, Andy Isaacson, Johannes Weiner, linux-kernel, linux-mm On Tue, Nov 15, 2011 at 03:08:31AM +0100, Andrea Arcangeli wrote: > On Tue, Nov 15, 2011 at 03:00:09AM +0100, Andrea Arcangeli wrote: > > I didn't fill that gap but I was reading the code again and I don't > > see why we keep retrying for -EAGAIN in the !sync case. Maybe the > > below is good (untested). I doubt it's good to spend cpu to retry the > > trylock or to retry the migrate on a pinned page by O_DIRECT. In fact > > as far as THP success rate is concerned maybe we should "goto out" > > instead of "goto fail" but I didn't change to that as compaction even > > if it fails a subpage may still be successful at creating order > > 1/2/3/4...8 pages. I only avoid 9 loops to retry a trylock or a page > > under O_DIRECT. Maybe that will save a bit of CPU, I doubt it can > > decrease the success rate in any significant way. I'll test it at the > > next build... > > At the same time also noticed another minor cleanup (also untested, > will text at next build together with some other stuff). > > === > From: Andrea Arcangeli <aarcange@redhat.com> > Subject: [PATCH] compaction: move ISOLATE_CLEAN setting out of > compaction_migratepages loop > > cc->sync and mode cannot change within the loop so move it out. > > Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Acked-by: Mel Gorman <mgorman@suse.de> -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH] mm: Do not stall in synchronous compaction for THP allocations 2011-11-15 2:00 ` Andrea Arcangeli 2011-11-15 2:08 ` Andrea Arcangeli @ 2011-11-15 15:00 ` Mel Gorman 1 sibling, 0 replies; 34+ messages in thread From: Mel Gorman @ 2011-11-15 15:00 UTC (permalink / raw) To: Andrea Arcangeli Cc: Andrew Morton, Minchan Kim, Jan Kara, Andy Isaacson, Johannes Weiner, linux-kernel, linux-mm On Tue, Nov 15, 2011 at 03:00:09AM +0100, Andrea Arcangeli wrote: > On Mon, Nov 14, 2011 at 04:03:45PM -0800, Andrew Morton wrote: > > On Fri, 11 Nov 2011 10:01:56 +0000 > > Mel Gorman <mgorman@suse.de> wrote: > > > A 1000-hour compute job will have its pages collapsed into hugepages by > > > khugepaged so they might not have the huge pages at the very beginning > > > but they get them. With khugepaged in place, there should be no need for > > > an additional tuneable. > > > > OK... > > It's good idea to keep it monitored. But I guess the reduced rate will > only materialize at temporary VM stress times. > > > Fair enough. One slight problem though: > > > > akpm:/usr/src/25> grep -r thp_collapse_alloc_failed Documentation > > akpm:/usr/src/25> > > I didn't fill that gap but I was reading the code again and I don't > see why we keep retrying for -EAGAIN in the !sync case. Maybe the > below is good (untested). I doubt it's good to spend cpu to retry the > trylock or to retry the migrate on a pinned page by O_DIRECT. The retry happens on the expectation that the page is not locked for very long. Pinning for O_DIRECT would be an exception. Readahead pages would be another. I don't have data on how often we encounter a page that is locked for a very short period of time versus being locked for IO and even if I did, it would be depend on the workload and the amount of RAM. > In fact > as far as THP success rate is concerned maybe we should "goto out" > instead of "goto fail" but I didn't change to that as compaction even > if it fails a subpage may still be successful at creating order > 1/2/3/4...8 pages. I only avoid 9 loops to retry a trylock or a page > under O_DIRECT. Maybe that will save a bit of CPU, I doubt it can > decrease the success rate in any significant way. I'll test it at the > next build... > > ==== > From: Andrea Arcangeli <aarcange@redhat.com> > Subject: [PATCH] migrate: !sync don't retry > > For !sync it's not worth retrying because we won't lock_page even > after the second pass. Except in the cases where the page was locked for a very short period of time. I think this will have an impact on success rates for minimal latency savings but I cannot predict how much of an impact. -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH] mm: Do not stall in synchronous compaction for THP allocations 2011-11-15 0:03 ` Andrew Morton 2011-11-15 2:00 ` Andrea Arcangeli @ 2011-11-15 13:07 ` Mel Gorman 2011-11-15 15:47 ` Andrea Arcangeli 1 sibling, 1 reply; 34+ messages in thread From: Mel Gorman @ 2011-11-15 13:07 UTC (permalink / raw) To: Andrew Morton Cc: Minchan Kim, Jan Kara, Andy Isaacson, Johannes Weiner, Andrea Arcangeli, linux-kernel, linux-mm On Mon, Nov 14, 2011 at 04:03:45PM -0800, Andrew Morton wrote: > > <SNIP> > > A 1000-hour compute job will have its pages collapsed into hugepages by > > khugepaged so they might not have the huge pages at the very beginning > > but they get them. With khugepaged in place, there should be no need for > > an additional tuneable. > > OK... > David Rientjes did point out that it is preferred in certain cases that khugepaged be disabled on jobs that are CPU bound, high priority and do not want interference. If this really is the case, it should not be the default behaviour and added as a new option to /sys/kernel/mm/transparent_hugepage/defrag in a separate patch. > > > Do we have the accounting in place for us to be able to determine how > > > many huge page allocation attempts failed due to this change? > > > > > > > thp_fault_fallback is the big one. It is incremented if we fail to > > allocate a hugepage during fault in either > > do_huge_pmd_anonymous_page or do_huge_pmd_wp_page_fallback > > > > thp_collapse_alloc_failed is also very interesting. It is incremented > > if khugepaged tried to collapse pages into a hugepage and > > failed the allocation > > > > The user has the option of monitoring their compute jobs hugepage > > usage by reading /proc/PID/smaps and looking at the AnonHugePages > > count for the large mappings of interest. > > Fair enough. One slight problem though: > > akpm:/usr/src/25> grep -r thp_collapse_alloc_failed Documentation > akpm:/usr/src/25> > mel@machina:~/git-public/linux-2.6/Documentation > git grep vmstat trace/postprocess/trace-vmscan-postprocess.pl: # To closer match vmstat scanning statistics, only count isolate_both mel@machina:~/git-public/linux-2.6/Documentation > Given such an abundance and wealth of information on vmstat, how about this? ==== CUT HERE ==== mm: Document the meminfo and vmstat fields of relevance to transparent hugepages This patch updates Documentation/vm/transhuge.txt and Documentation/filesystems/proc.txt with some information on monitoring transparent huge page usage and the associated overhead. Signed-off-by: Mel Gorman <mgorman@suse.de> --- Documentation/filesystems/proc.txt | 2 + Documentation/vm/transhuge.txt | 62 ++++++++++++++++++++++++++++++++++++ 2 files changed, 64 insertions(+), 0 deletions(-) diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt index 0ec91f0..fb6ca6d 100644 --- a/Documentation/filesystems/proc.txt +++ b/Documentation/filesystems/proc.txt @@ -710,6 +710,7 @@ Committed_AS: 100056 kB VmallocTotal: 112216 kB VmallocUsed: 428 kB VmallocChunk: 111088 kB +AnonHugePages: 49152 kB MemTotal: Total usable ram (i.e. physical ram minus a few reserved bits and the kernel binary code) @@ -743,6 +744,7 @@ VmallocChunk: 111088 kB Dirty: Memory which is waiting to get written back to the disk Writeback: Memory which is actively being written back to the disk AnonPages: Non-file backed pages mapped into userspace page tables +AnonHugePages: Non-file backed huge pages mapped into userspace page tables Mapped: files which have been mmaped, such as libraries Slab: in-kernel data structures cache SReclaimable: Part of Slab, that might be reclaimed, such as caches diff --git a/Documentation/vm/transhuge.txt b/Documentation/vm/transhuge.txt index 29bdf62..f734bb2 100644 --- a/Documentation/vm/transhuge.txt +++ b/Documentation/vm/transhuge.txt @@ -166,6 +166,68 @@ behavior. So to make them effective you need to restart any application that could have been using hugepages. This also applies to the regions registered in khugepaged. +== Monitoring usage == + +The number of transparent huge pages currently used by the system is +available by reading the AnonHugePages field in /proc/meminfo. To +identify what applications are using transparent huge pages, it is +necessary to read /proc/PID/smaps and count the AnonHugePages fields +for each mapping. Note that reading the smaps file is expensive and +reading it frequently will incur overhead. + +There are a number of counters in /proc/vmstat that may be used to +monitor how successfully the system is providing huge pages for use. + +thp_fault_alloc is incremented every time a huge page is successfully + allocated to handle a page fault. This applies to both the + first time a page is faulted and for COW faults. + +thp_collapse_alloc is incremented by khugepaged when it has found + a range of pages to collapse into one huge page and has + successfully allocated a new huge page to store the data. + +thp_fault_fallback is incremented if a page fault fails to allocate + a huge page and instead falls back to using small pages. + +thp_collapse_alloc_failed is incremented if khugepaged found a range + of pages that should be collapsed into one huge page but failed + the allocation. + +thp_split is incremented every time a huge page is split into base + pages. This can happen for a variety of reasons but a common + reason is that a huge page is old and is being reclaimed. + +As the system ages, allocating huge pages may be expensive as the +system uses memory compaction to copy data around memory to free a +huge page for use. There are some counters in /proc/vmstat to help +monitor this overhead. + +compact_stall is incremented every time a process stalls to run + memory compaction so that a huge page is free for use. + +compact_success is incremented if the system compacted memory and + freed a huge page for use. + +compact_fail is incremented if the system tries to compact memory + but failed. + +compact_pages_moved is incremented each time a page is moved. If + this value is increasing rapidly, it implies that the system + is copying a lot of data to satisfy the huge page allocation. + It is possible that the cost of copying exceeds any savings + from reduced TLB misses. + +compact_pagemigrate_failed is incremented when the underlying mechanism + for moving a page failed. + +compact_blocks_moved is incremented each time memory compaction examines + a huge page aligned range of pages. + +It is possible to establish how long the stalls were using the function +tracer to record how long was spent in __alloc_pages_nodemask and +using the mm_page_alloc tracepoint to identify which allocations were +for huge pages. + == get_user_pages and follow_page == get_user_pages and follow_page if run on a hugepage, will return the -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 34+ messages in thread
* Re: [PATCH] mm: Do not stall in synchronous compaction for THP allocations 2011-11-15 13:07 ` Mel Gorman @ 2011-11-15 15:47 ` Andrea Arcangeli 0 siblings, 0 replies; 34+ messages in thread From: Andrea Arcangeli @ 2011-11-15 15:47 UTC (permalink / raw) To: Mel Gorman Cc: Andrew Morton, Minchan Kim, Jan Kara, Andy Isaacson, Johannes Weiner, linux-kernel, linux-mm On Tue, Nov 15, 2011 at 01:07:48PM +0000, Mel Gorman wrote: > Given such an abundance and wealth of information on vmstat, how > about this? Reviewed-by: Andrea Arcangeli <aarcange@redhat.com> -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 34+ messages in thread
end of thread, other threads:[~2011-11-18 17:59 UTC | newest] Thread overview: 34+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2011-11-10 10:06 [PATCH] mm: Do not stall in synchronous compaction for THP allocations Mel Gorman 2011-11-10 10:38 ` Johannes Weiner 2011-11-10 10:51 ` Alan Cox 2011-11-10 12:06 ` Johannes Weiner 2011-11-10 14:00 ` Andrea Arcangeli 2011-11-10 14:22 ` Mel Gorman 2011-11-10 15:12 ` Minchan Kim 2011-11-10 16:13 ` Mel Gorman 2011-11-10 16:30 ` Minchan Kim 2011-11-10 16:48 ` Mel Gorman 2011-11-10 23:12 ` Andrew Morton 2011-11-10 23:37 ` David Rientjes 2011-11-11 10:14 ` Mel Gorman 2011-11-11 10:39 ` David Rientjes 2011-11-11 11:17 ` Mel Gorman 2011-11-11 14:21 ` Andrea Arcangeli 2011-11-14 23:44 ` Andrew Morton 2011-11-15 13:25 ` Mel Gorman 2011-11-15 21:07 ` David Rientjes 2011-11-15 23:48 ` Mel Gorman 2011-11-16 0:07 ` David Rientjes 2011-11-16 4:13 ` Andrea Arcangeli 2011-11-16 13:30 ` Andrea Arcangeli 2011-11-16 15:07 ` Mel Gorman 2011-11-18 17:59 ` Andrea Arcangeli 2011-11-16 14:14 ` Mel Gorman 2011-11-11 10:01 ` Mel Gorman 2011-11-15 0:03 ` Andrew Morton 2011-11-15 2:00 ` Andrea Arcangeli 2011-11-15 2:08 ` Andrea Arcangeli 2011-11-15 15:01 ` Mel Gorman 2011-11-15 15:00 ` Mel Gorman 2011-11-15 13:07 ` Mel Gorman 2011-11-15 15:47 ` Andrea Arcangeli
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).