* [PATCH] mm: Do not stall in synchronous compaction for THP allocations
@ 2011-11-10 10:06 Mel Gorman
2011-11-10 10:38 ` Johannes Weiner
` (3 more replies)
0 siblings, 4 replies; 34+ messages in thread
From: Mel Gorman @ 2011-11-10 10:06 UTC (permalink / raw)
To: Andrew Morton
Cc: Jan Kara, Andy Isaacson, Johannes Weiner, Andrea Arcangeli,
linux-kernel, linux-mm
Occasionally during large file copies to slow storage, there are still
reports of user-visible stalls when THP is enabled. Reports on this
have been intermittent and not reliable to reproduce locally but;
Andy Isaacson reported a problem copying to VFAT on SD Card
https://lkml.org/lkml/2011/11/7/2
In this case, it was stuck in munmap for betwen 20 and 60
seconds in compaction. It is also possible that khugepaged
was holding mmap_sem on this process if CONFIG_NUMA was set.
Johannes Weiner reported stalls on USB
https://lkml.org/lkml/2011/7/25/378
In this case, there is no stack trace but it looks like the
same problem. The USB stick may have been using NTFS as a
filesystem based on other work done related to writing back
to USB around the same time.
Internally in SUSE, I received a bug report related to stalls in firefox
when using Java and Flash heavily while copying from NFS
to VFAT on USB. It has not been confirmed to be the same problem
but if it looks like a duck and quacks like a duck.....
In the past, commit [11bc82d6: mm: compaction: Use async migration for
__GFP_NO_KSWAPD and enforce no writeback] forced that sync compaction
would never be used for THP allocations. This was reverted in commit
[c6a140bf: mm/compaction: reverse the change that forbade sync
migraton with __GFP_NO_KSWAPD] on the grounds that it was uncertain
it was beneficial.
While user-visible stalls do not happen for me when writing to USB,
I setup a test running postmark while short-lived processes created
anonymous mapping. The objective was to exercise the paths that
allocate transparent huge pages. I then logged when processes were
stalled for more than 1 second, recorded a stack strace and did some
analysis to aggregate unique "stall events" which revealed
Time stalled in this event: 47369 ms
Event count: 20
usemem sleep_on_page 3690 ms
usemem sleep_on_page 2148 ms
usemem sleep_on_page 1534 ms
usemem sleep_on_page 1518 ms
usemem sleep_on_page 1225 ms
usemem sleep_on_page 2205 ms
usemem sleep_on_page 2399 ms
usemem sleep_on_page 2398 ms
usemem sleep_on_page 3760 ms
usemem sleep_on_page 1861 ms
usemem sleep_on_page 2948 ms
usemem sleep_on_page 1515 ms
usemem sleep_on_page 1386 ms
usemem sleep_on_page 1882 ms
usemem sleep_on_page 1850 ms
usemem sleep_on_page 3715 ms
usemem sleep_on_page 3716 ms
usemem sleep_on_page 4846 ms
usemem sleep_on_page 1306 ms
usemem sleep_on_page 1467 ms
[<ffffffff810ef30c>] wait_on_page_bit+0x6c/0x80
[<ffffffff8113de9f>] unmap_and_move+0x1bf/0x360
[<ffffffff8113e0e2>] migrate_pages+0xa2/0x1b0
[<ffffffff81134273>] compact_zone+0x1f3/0x2f0
[<ffffffff811345d8>] compact_zone_order+0xa8/0xf0
[<ffffffff811346ff>] try_to_compact_pages+0xdf/0x110
[<ffffffff810f773a>] __alloc_pages_direct_compact+0xda/0x1a0
[<ffffffff810f7d5d>] __alloc_pages_slowpath+0x55d/0x7a0
[<ffffffff810f8151>] __alloc_pages_nodemask+0x1b1/0x1c0
[<ffffffff811331db>] alloc_pages_vma+0x9b/0x160
[<ffffffff81142bb0>] do_huge_pmd_anonymous_page+0x160/0x270
[<ffffffff814410a7>] do_page_fault+0x207/0x4c0
[<ffffffff8143dde5>] page_fault+0x25/0x30
The stall times are approximate at best but the estimates represent 25%
of the worst stalls and even if the estimates are off by a factor of
10, it's severe.
This patch once again prevents sync migration for transparent
hugepage allocations as it is preferable to fail a THP allocation
than stall. It was suggested that __GFP_NORETRY be used instead of
__GFP_NO_KSWAPD. This would look less like a special case but would
still cause compaction to run at least once with sync compaction.
If accepted, this is a -stable candidate.
Reported-by: Andy Isaacson <adi@hexapodia.org>
Reported-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
mm/page_alloc.c | 8 +++++++-
1 files changed, 7 insertions(+), 1 deletions(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 963c5de..cddc2d0 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2213,7 +2213,13 @@ rebalance:
sync_migration);
if (page)
goto got_pg;
- sync_migration = true;
+
+ /*
+ * Do not use sync migration for transparent hugepage allocations as
+ * it could stall writing back pages which is far worse than simply
+ * failing to promote a page.
+ */
+ sync_migration = !(gfp_mask & __GFP_NO_KSWAPD);
/* Try direct reclaim and then allocating */
page = __alloc_pages_direct_reclaim(gfp_mask, order,
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 34+ messages in thread
* Re: [PATCH] mm: Do not stall in synchronous compaction for THP allocations
2011-11-10 10:06 [PATCH] mm: Do not stall in synchronous compaction for THP allocations Mel Gorman
@ 2011-11-10 10:38 ` Johannes Weiner
2011-11-10 10:51 ` Alan Cox
` (2 subsequent siblings)
3 siblings, 0 replies; 34+ messages in thread
From: Johannes Weiner @ 2011-11-10 10:38 UTC (permalink / raw)
To: Mel Gorman
Cc: Andrew Morton, Jan Kara, Andy Isaacson, Andrea Arcangeli,
linux-kernel, linux-mm
On Thu, Nov 10, 2011 at 10:06:16AM +0000, Mel Gorman wrote:
> Occasionally during large file copies to slow storage, there are still
> reports of user-visible stalls when THP is enabled. Reports on this
> have been intermittent and not reliable to reproduce locally but;
>
> Andy Isaacson reported a problem copying to VFAT on SD Card
> https://lkml.org/lkml/2011/11/7/2
>
> In this case, it was stuck in munmap for betwen 20 and 60
> seconds in compaction. It is also possible that khugepaged
> was holding mmap_sem on this process if CONFIG_NUMA was set.
>
> Johannes Weiner reported stalls on USB
> https://lkml.org/lkml/2011/7/25/378
>
> In this case, there is no stack trace but it looks like the
> same problem. The USB stick may have been using NTFS as a
> filesystem based on other work done related to writing back
> to USB around the same time.
Sorry, here is a trace from when I first recorded the problem:
[171252.437688] firefox D 000000010a31e6b7 0 8502 1110 0x00000000
[171252.437691] ffff88001e91f8a8 0000000000000082 ffff880000000000 ffff88001e91ffd8
[171252.437693] ffff88001e91ffd8 0000000000004000 ffffffff8180b020 ffff8800456e4680
[171252.437696] ffff88001e91f7e8 ffffffff810cd4ed ffffea00028d4500 000000000000000e
[171252.437699] Call Trace:
[171252.437701] [<ffffffff810cd4ed>] ? __pagevec_free+0x2d/0x40
[171252.437703] [<ffffffff810d044c>] ? release_pages+0x24c/0x280
[171252.437705] [<ffffffff81099448>] ? ktime_get_ts+0xa8/0xe0
[171252.437707] [<ffffffff810c5040>] ? file_read_actor+0x170/0x170
[171252.437709] [<ffffffff8156ac8a>] io_schedule+0x8a/0xd0
[171252.437711] [<ffffffff810c5049>] sleep_on_page+0x9/0x10
[171252.437713] [<ffffffff8156b0f7>] __wait_on_bit+0x57/0x80
[171252.437715] [<ffffffff810c5630>] wait_on_page_bit+0x70/0x80
[171252.437718] [<ffffffff810916c0>] ? autoremove_wake_function+0x40/0x40
[171252.437720] [<ffffffff810f98a5>] migrate_pages+0x2a5/0x490
[171252.437722] [<ffffffff810f5940>] ? suitable_migration_target+0x50/0x50
[171252.437724] [<ffffffff810f6234>] compact_zone+0x4e4/0x770
[171252.437727] [<ffffffff810f65e0>] compact_zone_order+0x80/0xb0
[171252.437729] [<ffffffff810f66cd>] try_to_compact_pages+0xbd/0xf0
[171252.437731] [<ffffffff810cc608>] __alloc_pages_direct_compact+0xa8/0x180
[171252.437734] [<ffffffff810ccbdb>] __alloc_pages_nodemask+0x4fb/0x720
[171252.437736] [<ffffffff810fbfe3>] do_huge_pmd_wp_page+0x4c3/0x740
[171252.437738] [<ffffffff81094fff>] ? hrtimer_start_range_ns+0xf/0x20
[171252.437740] [<ffffffff810e3f9c>] handle_mm_fault+0x1bc/0x2f0
[171252.437743] [<ffffffff8102e6c6>] ? __switch_to+0x1e6/0x2c0
[171252.437745] [<ffffffff810511b2>] do_page_fault+0x132/0x430
[171252.437747] [<ffffffff810a4995>] ? sys_futex+0x105/0x1a0
[171252.437749] [<ffffffff8156cf9f>] page_fault+0x1f/0x30
It could have been vfat, I don't remember for sure. But I would think
the problem depends only on the duration of PageWriteback being set.
> Internally in SUSE, I received a bug report related to stalls in firefox
> when using Java and Flash heavily while copying from NFS
> to VFAT on USB. It has not been confirmed to be the same problem
> but if it looks like a duck and quacks like a duck.....
>
> In the past, commit [11bc82d6: mm: compaction: Use async migration for
> __GFP_NO_KSWAPD and enforce no writeback] forced that sync compaction
> would never be used for THP allocations. This was reverted in commit
> [c6a140bf: mm/compaction: reverse the change that forbade sync
> migraton with __GFP_NO_KSWAPD] on the grounds that it was uncertain
> it was beneficial.
>
> While user-visible stalls do not happen for me when writing to USB,
> I setup a test running postmark while short-lived processes created
> anonymous mapping. The objective was to exercise the paths that
> allocate transparent huge pages. I then logged when processes were
> stalled for more than 1 second, recorded a stack strace and did some
> analysis to aggregate unique "stall events" which revealed
>
> Time stalled in this event: 47369 ms
> Event count: 20
> usemem sleep_on_page 3690 ms
> usemem sleep_on_page 2148 ms
> usemem sleep_on_page 1534 ms
> usemem sleep_on_page 1518 ms
> usemem sleep_on_page 1225 ms
> usemem sleep_on_page 2205 ms
> usemem sleep_on_page 2399 ms
> usemem sleep_on_page 2398 ms
> usemem sleep_on_page 3760 ms
> usemem sleep_on_page 1861 ms
> usemem sleep_on_page 2948 ms
> usemem sleep_on_page 1515 ms
> usemem sleep_on_page 1386 ms
> usemem sleep_on_page 1882 ms
> usemem sleep_on_page 1850 ms
> usemem sleep_on_page 3715 ms
> usemem sleep_on_page 3716 ms
> usemem sleep_on_page 4846 ms
> usemem sleep_on_page 1306 ms
> usemem sleep_on_page 1467 ms
> [<ffffffff810ef30c>] wait_on_page_bit+0x6c/0x80
> [<ffffffff8113de9f>] unmap_and_move+0x1bf/0x360
> [<ffffffff8113e0e2>] migrate_pages+0xa2/0x1b0
> [<ffffffff81134273>] compact_zone+0x1f3/0x2f0
> [<ffffffff811345d8>] compact_zone_order+0xa8/0xf0
> [<ffffffff811346ff>] try_to_compact_pages+0xdf/0x110
> [<ffffffff810f773a>] __alloc_pages_direct_compact+0xda/0x1a0
> [<ffffffff810f7d5d>] __alloc_pages_slowpath+0x55d/0x7a0
> [<ffffffff810f8151>] __alloc_pages_nodemask+0x1b1/0x1c0
> [<ffffffff811331db>] alloc_pages_vma+0x9b/0x160
> [<ffffffff81142bb0>] do_huge_pmd_anonymous_page+0x160/0x270
> [<ffffffff814410a7>] do_page_fault+0x207/0x4c0
> [<ffffffff8143dde5>] page_fault+0x25/0x30
>
> The stall times are approximate at best but the estimates represent 25%
> of the worst stalls and even if the estimates are off by a factor of
> 10, it's severe.
>
> This patch once again prevents sync migration for transparent
> hugepage allocations as it is preferable to fail a THP allocation
> than stall. It was suggested that __GFP_NORETRY be used instead of
> __GFP_NO_KSWAPD. This would look less like a special case but would
> still cause compaction to run at least once with sync compaction.
>
> If accepted, this is a -stable candidate.
>
> Reported-by: Andy Isaacson <adi@hexapodia.org>
> Reported-by: Johannes Weiner <hannes@cmpxchg.org>
> Signed-off-by: Mel Gorman <mgorman@suse.de>
Tested-by: Johannes Weiner <jweiner@redhat.com>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH] mm: Do not stall in synchronous compaction for THP allocations
2011-11-10 10:06 [PATCH] mm: Do not stall in synchronous compaction for THP allocations Mel Gorman
2011-11-10 10:38 ` Johannes Weiner
@ 2011-11-10 10:51 ` Alan Cox
2011-11-10 12:06 ` Johannes Weiner
2011-11-10 14:00 ` Andrea Arcangeli
2011-11-10 14:22 ` Mel Gorman
3 siblings, 1 reply; 34+ messages in thread
From: Alan Cox @ 2011-11-10 10:51 UTC (permalink / raw)
To: Mel Gorman
Cc: Andrew Morton, Jan Kara, Andy Isaacson, Johannes Weiner,
Andrea Arcangeli, linux-kernel, linux-mm
On Thu, 10 Nov 2011 10:06:16 +0000
Mel Gorman <mgorman@suse.de> wrote:
> Occasionally during large file copies to slow storage, there are still
> reports of user-visible stalls when THP is enabled. Reports on this
> have been intermittent and not reliable to reproduce locally but;
If you want to cause a massive stall take a cheap 32GB USB flash drive
plug it into an 8GB box and rsync a lot of small files to it. 400,000
emails in maildir format does the trick and can easily be simulated. The
drive drops to about 1-2 IOPS with all the small mucking around and the
backlog becomes massive.
> Internally in SUSE, I received a bug report related to stalls in firefox
> when using Java and Flash heavily while copying from NFS
> to VFAT on USB. It has not been confirmed to be the same problem
> but if it looks like a duck and quacks like a duck.....
With the 32GB USB flash rsync I see firefox block for up to 45 minutes
although operating entirely on an unrelated filesystem. I suspect it may
be a problem that is visible because an fsync is getting jammed up in
the mess.
Alan
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH] mm: Do not stall in synchronous compaction for THP allocations
2011-11-10 10:51 ` Alan Cox
@ 2011-11-10 12:06 ` Johannes Weiner
0 siblings, 0 replies; 34+ messages in thread
From: Johannes Weiner @ 2011-11-10 12:06 UTC (permalink / raw)
To: Alan Cox
Cc: Mel Gorman, Andrew Morton, Jan Kara, Andy Isaacson,
Andrea Arcangeli, linux-kernel, linux-mm
On Thu, Nov 10, 2011 at 10:51:00AM +0000, Alan Cox wrote:
> On Thu, 10 Nov 2011 10:06:16 +0000
> Mel Gorman <mgorman@suse.de> wrote:
>
> > Occasionally during large file copies to slow storage, there are still
> > reports of user-visible stalls when THP is enabled. Reports on this
> > have been intermittent and not reliable to reproduce locally but;
>
> If you want to cause a massive stall take a cheap 32GB USB flash drive
> plug it into an 8GB box and rsync a lot of small files to it. 400,000
> emails in maildir format does the trick and can easily be simulated. The
> drive drops to about 1-2 IOPS with all the small mucking around and the
> backlog becomes massive.
>
> > Internally in SUSE, I received a bug report related to stalls in firefox
> > when using Java and Flash heavily while copying from NFS
> > to VFAT on USB. It has not been confirmed to be the same problem
> > but if it looks like a duck and quacks like a duck.....
>
> With the 32GB USB flash rsync I see firefox block for up to 45 minutes
> although operating entirely on an unrelated filesystem. I suspect it may
> be a problem that is visible because an fsync is getting jammed up in
> the mess.
Compaction walks PFN ranges, oblivious to inode dirtying order, and so
transparent huge page allocations can get stuck repeatedly on pages
under writeback that are behind whatever the bdi's queue allows to be
inflight.
On all hangs I observed while writing to my 16GB USB thumb drive, it
was tasks getting stuck in migration when allocating a THP.
Can you capture /proc/`pidof firefox`/stack while it hangs to see if
what you see is, in fact, the same problem?
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH] mm: Do not stall in synchronous compaction for THP allocations
2011-11-10 10:06 [PATCH] mm: Do not stall in synchronous compaction for THP allocations Mel Gorman
2011-11-10 10:38 ` Johannes Weiner
2011-11-10 10:51 ` Alan Cox
@ 2011-11-10 14:00 ` Andrea Arcangeli
2011-11-10 14:22 ` Mel Gorman
3 siblings, 0 replies; 34+ messages in thread
From: Andrea Arcangeli @ 2011-11-10 14:00 UTC (permalink / raw)
To: Mel Gorman
Cc: Andrew Morton, Jan Kara, Andy Isaacson, Johannes Weiner,
linux-kernel, linux-mm
On Thu, Nov 10, 2011 at 10:06:16AM +0000, Mel Gorman wrote:
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 963c5de..cddc2d0 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2213,7 +2213,13 @@ rebalance:
> sync_migration);
> if (page)
> goto got_pg;
> - sync_migration = true;
> +
> + /*
> + * Do not use sync migration for transparent hugepage allocations as
> + * it could stall writing back pages which is far worse than simply
> + * failing to promote a page.
> + */
> + sync_migration = !(gfp_mask & __GFP_NO_KSWAPD);
>
> /* Try direct reclaim and then allocating */
> page = __alloc_pages_direct_reclaim(gfp_mask, order,
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH] mm: Do not stall in synchronous compaction for THP allocations
2011-11-10 10:06 [PATCH] mm: Do not stall in synchronous compaction for THP allocations Mel Gorman
` (2 preceding siblings ...)
2011-11-10 14:00 ` Andrea Arcangeli
@ 2011-11-10 14:22 ` Mel Gorman
2011-11-10 15:12 ` Minchan Kim
3 siblings, 1 reply; 34+ messages in thread
From: Mel Gorman @ 2011-11-10 14:22 UTC (permalink / raw)
To: Andrew Morton
Cc: Jan Kara, Andy Isaacson, Johannes Weiner, Andrea Arcangeli,
linux-kernel, linux-mm
On Thu, Nov 10, 2011 at 10:06:16AM +0000, Mel Gorman wrote:
> than stall. It was suggested that __GFP_NORETRY be used instead of
> __GFP_NO_KSWAPD. This would look less like a special case but would
> still cause compaction to run at least once with sync compaction.
>
This comment is bogus - __GFP_NORETRY would have caught THP allocations
and would not call sync compaction. The issue was that it would also
have caught any hypothetical high-order GFP_THISNODE allocations that
end up calling compaction here
/*
* High-order allocations do not necessarily loop after
* direct reclaim and reclaim/compaction depends on
* compaction being called after reclaim so call directly if
* necessary
*/
page = __alloc_pages_direct_compact(gfp_mask, order,
zonelist, high_zoneidx,
nodemask,
alloc_flags, preferred_zone,
migratetype, &did_some_progress,
sync_migration);
__GFP_NORETRY is used in a bunch of places and while the most
of them are not high-order, some of them potentially are like in
sound/core/memalloc.c. Using __GFP_NO_KSWAPD as the flag allows
these callers to continue using sync compaction. It could be argued
that they would prefer __GFP_NORETRY but the potential side-effects
should be taken should be taken into account and the comment updated
if that happens.
--
Mel Gorman
SUSE Labs
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH] mm: Do not stall in synchronous compaction for THP allocations
2011-11-10 14:22 ` Mel Gorman
@ 2011-11-10 15:12 ` Minchan Kim
2011-11-10 16:13 ` Mel Gorman
0 siblings, 1 reply; 34+ messages in thread
From: Minchan Kim @ 2011-11-10 15:12 UTC (permalink / raw)
To: Mel Gorman
Cc: Andrew Morton, Jan Kara, Andy Isaacson, Johannes Weiner,
Andrea Arcangeli, linux-kernel, linux-mm
Hi Mel,
You should have Cced with me because __GFP_NORETRY is issued by me.
On Thu, Nov 10, 2011 at 11:22 PM, Mel Gorman <mgorman@suse.de> wrote:
> On Thu, Nov 10, 2011 at 10:06:16AM +0000, Mel Gorman wrote:
>> than stall. It was suggested that __GFP_NORETRY be used instead of
>> __GFP_NO_KSWAPD. This would look less like a special case but would
>> still cause compaction to run at least once with sync compaction.
>>
>
> This comment is bogus - __GFP_NORETRY would have caught THP allocations
> and would not call sync compaction. The issue was that it would also
> have caught any hypothetical high-order GFP_THISNODE allocations that
> end up calling compaction here
In fact, the I support patch concept so I would like to give
Acked-by: Minchan Kim <minchan.kim@gmail.com>
But it is still doubt about code.
__GFP_NORETRY: The VM implementation must not retry indefinitely
What could people think if they look at above comment?
At least, I can imagine two
First, it is related on *latency*.
Second, "I can handle if VM fails allocation"
I am biased toward latter.
Then, __GFP_NO_KSWAPD is okay? It means "let's avoid sync compaction
or long latency"?
It's rather awkward name. Already someone started to use
__GFP_NO_KSWAPD as such purpose.
See mtd_kmalloc_up_to. He mentioned in comment of function as follows,
* the system page size. This attempts to make sure it does not adversely
* impact system performance, so when allocating more than one page, we
* ask the memory allocator to avoid re-trying, swapping, writing back
* or performing I/O.
That thing was what I concerned.
In future, new users of __GFP_NO_KSWAPD is coming and we can't prevent
them under our sight.
So I hope we can change the flag name or fix above code and comment
out __GFP_NO_KSWAPD
/*
* __GFP_NO_KSWAPD is very VM internal flag so Please don't use it
without allowing mm guys
*
#define __GFP_NO_KSWAPD xxxx
>
> /*
> * High-order allocations do not necessarily loop after
> * direct reclaim and reclaim/compaction depends on
> * compaction being called after reclaim so call directly if
> * necessary
> */
> page = __alloc_pages_direct_compact(gfp_mask, order,
> zonelist, high_zoneidx,
> nodemask,
> alloc_flags, preferred_zone,
> migratetype, &did_some_progress,
> sync_migration);
>
> __GFP_NORETRY is used in a bunch of places and while the most
> of them are not high-order, some of them potentially are like in
> sound/core/memalloc.c. Using __GFP_NO_KSWAPD as the flag allows
> these callers to continue using sync compaction. It could be argued
Okay. If I was biased first, I have opposed this comment because they
might think __GFP_NORETRY is very latency sensitive.
So they wanted allocation is very fast without any writeback/retrial.
In view point, __GFP_NORETRY isn't bad, I think.
Having said that, I was biased latter, as I said earlier.
> that they would prefer __GFP_NORETRY but the potential side-effects
> should be taken should be taken into account and the comment updated
Considering side-effect, your patch is okay.
But I can't understand you mentioned "the comment updated if that
happens" sentence. :(
> if that happens.
>
> --
> Mel Gorman
> SUSE Labs
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>
--
Kind regards,
Minchan Kim
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH] mm: Do not stall in synchronous compaction for THP allocations
2011-11-10 15:12 ` Minchan Kim
@ 2011-11-10 16:13 ` Mel Gorman
2011-11-10 16:30 ` Minchan Kim
2011-11-10 23:12 ` Andrew Morton
0 siblings, 2 replies; 34+ messages in thread
From: Mel Gorman @ 2011-11-10 16:13 UTC (permalink / raw)
To: Minchan Kim
Cc: Andrew Morton, Jan Kara, Andy Isaacson, Johannes Weiner,
Andrea Arcangeli, linux-kernel, linux-mm
On Fri, Nov 11, 2011 at 12:12:01AM +0900, Minchan Kim wrote:
> Hi Mel,
>
> You should have Cced with me because __GFP_NORETRY is issued by me.
>
I don't understand. __GFP_NORETRY has been around a long time ago
and has loads of users. Do you have particular concerns about the
behaviour of __GFP_NORETRY?
> On Thu, Nov 10, 2011 at 11:22 PM, Mel Gorman <mgorman@suse.de> wrote:
> > On Thu, Nov 10, 2011 at 10:06:16AM +0000, Mel Gorman wrote:
> >> than stall. It was suggested that __GFP_NORETRY be used instead of
> >> __GFP_NO_KSWAPD. This would look less like a special case but would
> >> still cause compaction to run at least once with sync compaction.
> >>
> >
> > This comment is bogus - __GFP_NORETRY would have caught THP allocations
> > and would not call sync compaction. The issue was that it would also
> > have caught any hypothetical high-order GFP_THISNODE allocations that
> > end up calling compaction here
>
> In fact, the I support patch concept so I would like to give
>
> Acked-by: Minchan Kim <minchan.kim@gmail.com>
> But it is still doubt about code.
>
> __GFP_NORETRY: The VM implementation must not retry indefinitely
>
> What could people think if they look at above comment?
> At least, I can imagine two
>
> First, it is related on *latency*.
I never read it to mean that. I always read it to mean "it must
not retry indefinitely". I expected it was still fine to try direct
reclaim which is not a latency-sensitive operation.
> Second, "I can handle if VM fails allocation"
>
Fully agreed.
> I am biased toward latter.
> Then, __GFP_NO_KSWAPD is okay? It means "let's avoid sync compaction
> or long latency"?
and do not wake kswapd to swap additional pages.
> It's rather awkward name. Already someone started to use
> __GFP_NO_KSWAPD as such purpose.
> See mtd_kmalloc_up_to. He mentioned in comment of function as follows,
>
> * the system page size. This attempts to make sure it does not adversely
> * impact system performance, so when allocating more than one page, we
> * ask the memory allocator to avoid re-trying, swapping, writing back
> * or performing I/O.
>
I was not aware of this user although their reason for using it
seems fine. We are not breaking their expectations with this patch but
it is a slippery slope.
> That thing was what I concerned.
> In future, new users of __GFP_NO_KSWAPD is coming and we can't prevent
> them under our sight.
> So I hope we can change the flag name or fix above code and comment
> out __GFP_NO_KSWAPD
>
I can't think of a better name but we can at least improve the comment.
> /*
> * __GFP_NO_KSWAPD is very VM internal flag so Please don't use it
> without allowing mm guys
> *
> #define __GFP_NO_KSWAPD xxxx
>
> >
> > /*
> > * High-order allocations do not necessarily loop after
> > * direct reclaim and reclaim/compaction depends on
> > * compaction being called after reclaim so call directly if
> > * necessary
> > */
> > page = __alloc_pages_direct_compact(gfp_mask, order,
> > zonelist, high_zoneidx,
> > nodemask,
> > alloc_flags, preferred_zone,
> > migratetype, &did_some_progress,
> > sync_migration);
> >
> > __GFP_NORETRY is used in a bunch of places and while the most
> > of them are not high-order, some of them potentially are like in
> > sound/core/memalloc.c. Using __GFP_NO_KSWAPD as the flag allows
> > these callers to continue using sync compaction. It could be argued
>
> Okay. If I was biased first, I have opposed this comment because they
> might think __GFP_NORETRY is very latency sensitive.
As __GFP_NORETRY uses direct reclaim, I do not think users expect it
to be latency sensitive. If they were latency sensitive, they would
mask out __GFP_WAIT.
> So they wanted allocation is very fast without any writeback/retrial.
> In view point, __GFP_NORETRY isn't bad, I think.
>
> Having said that, I was biased latter, as I said earlier.
>
> > that they would prefer __GFP_NORETRY but the potential side-effects
> > should be taken should be taken into account and the comment updated
>
> Considering side-effect, your patch is okay.
> But I can't understand you mentioned "the comment updated if that
> happens" sentence. :(
>
I meant the comment above sync_migration = !(gfp_mask & __GFP_NO_KSWAPD) should
be updated if it changes to __GFP_NORETRY.
I updated the commentary a bit. Does this help?
==== CUT HERE ====
mm: Do not stall in synchronous compaction for THP allocations
Changelog since V1
o Update changelog with more detail
o Add comment on __GFP_NO_KSWAPD
Occasionally during large file copies to slow storage, there are still
reports of user-visible stalls when THP is enabled. Reports on this
have been intermittent and not reliable to reproduce locally but;
Andy Isaacson reported a problem copying to VFAT on SD Card
https://lkml.org/lkml/2011/11/7/2
In this case, it was stuck in munmap for betwen 20 and 60
seconds in compaction. It is also possible that khugepaged
was holding mmap_sem on this process if CONFIG_NUMA was set.
Johannes Weiner reported stalls on USB
https://lkml.org/lkml/2011/7/25/378
In this case, there is no stack trace but it looks like the
same problem. The USB stick may have been using NTFS as a
filesystem based on other work done related to writing back
to USB around the same time.
Internally in SUSE, I received a bug report related to stalls in firefox
when using Java and Flash heavily while copying from NFS
to VFAT on USB. It has not been confirmed to be the same problem
but if it looks like a duck and quacks like a duck.....
In the past, commit [11bc82d6: mm: compaction: Use async migration for
__GFP_NO_KSWAPD and enforce no writeback] forced that sync compaction
would never be used for THP allocations. This was reverted in commit
[c6a140bf: mm/compaction: reverse the change that forbade sync
migraton with __GFP_NO_KSWAPD] on the grounds that it was uncertain
it was beneficial.
While user-visible stalls do not happen for me when writing to USB,
I setup a test running postmark while short-lived processes created
anonymous mapping. The objective was to exercise the paths that
allocate transparent huge pages. I then logged when processes were
stalled for more than 1 second, recorded a stack strace and did some
analysis to aggregate unique "stall events" which revealed
Time stalled in this event: 47369 ms
Event count: 20
usemem sleep_on_page 3690 ms
usemem sleep_on_page 2148 ms
usemem sleep_on_page 1534 ms
usemem sleep_on_page 1518 ms
usemem sleep_on_page 1225 ms
usemem sleep_on_page 2205 ms
usemem sleep_on_page 2399 ms
usemem sleep_on_page 2398 ms
usemem sleep_on_page 3760 ms
usemem sleep_on_page 1861 ms
usemem sleep_on_page 2948 ms
usemem sleep_on_page 1515 ms
usemem sleep_on_page 1386 ms
usemem sleep_on_page 1882 ms
usemem sleep_on_page 1850 ms
usemem sleep_on_page 3715 ms
usemem sleep_on_page 3716 ms
usemem sleep_on_page 4846 ms
usemem sleep_on_page 1306 ms
usemem sleep_on_page 1467 ms
[<ffffffff810ef30c>] wait_on_page_bit+0x6c/0x80
[<ffffffff8113de9f>] unmap_and_move+0x1bf/0x360
[<ffffffff8113e0e2>] migrate_pages+0xa2/0x1b0
[<ffffffff81134273>] compact_zone+0x1f3/0x2f0
[<ffffffff811345d8>] compact_zone_order+0xa8/0xf0
[<ffffffff811346ff>] try_to_compact_pages+0xdf/0x110
[<ffffffff810f773a>] __alloc_pages_direct_compact+0xda/0x1a0
[<ffffffff810f7d5d>] __alloc_pages_slowpath+0x55d/0x7a0
[<ffffffff810f8151>] __alloc_pages_nodemask+0x1b1/0x1c0
[<ffffffff811331db>] alloc_pages_vma+0x9b/0x160
[<ffffffff81142bb0>] do_huge_pmd_anonymous_page+0x160/0x270
[<ffffffff814410a7>] do_page_fault+0x207/0x4c0
[<ffffffff8143dde5>] page_fault+0x25/0x30
The stall times are approximate at best but the estimates represent 25%
of the worst stalls and even if the estimates are off by a factor of
10, it's severe.
This patch once again prevents sync migration for transparent
hugepage allocations as it is preferable to fail a THP allocation
than stall.
It was suggested that __GFP_NORETRY be used instead of __GFP_NO_KSWAPD
to look less like a special case. This would prevent THP allocation
using sync compaction but it would have other side-effects. There are
existing users of __GFP_NORETRY that are doing high-order allocations
and while they can handle allocation failure, it seems reasonable that
they continue to use sync compaction unless there is a deliberate
reason to change that. To help clarify this for the future, this
patch updates the comment for __GFP_NO_KSWAPD.
If accepted, this is a -stable candidate.
Reported-by: Andy Isaacson <adi@hexapodia.org>
Reported-and-tested-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
include/linux/gfp.h | 8 ++++++++
mm/page_alloc.c | 9 ++++++++-
2 files changed, 16 insertions(+), 1 deletions(-)
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 3a76faf..4470a3c 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -83,7 +83,15 @@ struct vm_area_struct;
#define __GFP_RECLAIMABLE ((__force gfp_t)___GFP_RECLAIMABLE) /* Page is reclaimable */
#define __GFP_NOTRACK ((__force gfp_t)___GFP_NOTRACK) /* Don't track with kmemcheck */
+/*
+ * __GFP_NO_KSWAPD indicates that the VM should favour failing the allocation
+ * over excessive disruption of the system. Currently this means
+ * 1. Do not wake kswapd (hence the flag name)
+ * 2. Do not use stall in synchronous compaction for high-order allocations
+ * as this may cause the caller to stall writing out pages
+ */
#define __GFP_NO_KSWAPD ((__force gfp_t)___GFP_NO_KSWAPD)
+
#define __GFP_OTHER_NODE ((__force gfp_t)___GFP_OTHER_NODE) /* On behalf of other node */
/*
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 6e8ecb6..a3b5da6 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2161,7 +2161,14 @@ rebalance:
sync_migration);
if (page)
goto got_pg;
- sync_migration = true;
+
+ /*
+ * Do not use sync migration if __GFP_NO_KSWAPD is used to indicate
+ * the system should not be heavily disrupted. In practice, this is
+ * to avoid THP callers being stalled in writeback during migration
+ * as it's preferable for the the allocations to fail than to stall
+ */
+ sync_migration = !(gfp_mask & __GFP_NO_KSWAPD);
/* Try direct reclaim and then allocating */
page = __alloc_pages_direct_reclaim(gfp_mask, order,
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 34+ messages in thread
* Re: [PATCH] mm: Do not stall in synchronous compaction for THP allocations
2011-11-10 16:13 ` Mel Gorman
@ 2011-11-10 16:30 ` Minchan Kim
2011-11-10 16:48 ` Mel Gorman
2011-11-10 23:12 ` Andrew Morton
1 sibling, 1 reply; 34+ messages in thread
From: Minchan Kim @ 2011-11-10 16:30 UTC (permalink / raw)
To: Mel Gorman
Cc: Andrew Morton, Jan Kara, Andy Isaacson, Johannes Weiner,
Andrea Arcangeli, linux-kernel, linux-mm
On Fri, Nov 11, 2011 at 1:13 AM, Mel Gorman <mgorman@suse.de> wrote:
> On Fri, Nov 11, 2011 at 12:12:01AM +0900, Minchan Kim wrote:
>> Hi Mel,
>>
>> You should have Cced with me because __GFP_NORETRY is issued by me.
>>
>
> I don't understand. __GFP_NORETRY has been around a long time ago
> and has loads of users. Do you have particular concerns about the
> behaviour of __GFP_NORETRY?
It seems that I got misunderstood your point.
I thought you mentioned this.
http://www.spinics.net/lists/linux-mm/msg16371.html
>
>> On Thu, Nov 10, 2011 at 11:22 PM, Mel Gorman <mgorman@suse.de> wrote:
>> > On Thu, Nov 10, 2011 at 10:06:16AM +0000, Mel Gorman wrote:
>> >> than stall. It was suggested that __GFP_NORETRY be used instead of
>> >> __GFP_NO_KSWAPD. This would look less like a special case but would
>> >> still cause compaction to run at least once with sync compaction.
>> >>
>> >
>> > This comment is bogus - __GFP_NORETRY would have caught THP allocations
>> > and would not call sync compaction. The issue was that it would also
>> > have caught any hypothetical high-order GFP_THISNODE allocations that
>> > end up calling compaction here
>>
>> In fact, the I support patch concept so I would like to give
>>
>> Acked-by: Minchan Kim <minchan.kim@gmail.com>
>> But it is still doubt about code.
>>
>> __GFP_NORETRY: The VM implementation must not retry indefinitely
>>
>> What could people think if they look at above comment?
>> At least, I can imagine two
>>
>> First, it is related on *latency*.
>
> I never read it to mean that. I always read it to mean "it must
> not retry indefinitely". I expected it was still fine to try direct
> reclaim which is not a latency-sensitive operation.
>
>> Second, "I can handle if VM fails allocation"
>>
>
> Fully agreed.
>
>> I am biased toward latter.
>> Then, __GFP_NO_KSWAPD is okay? It means "let's avoid sync compaction
>> or long latency"?
>
> and do not wake kswapd to swap additional pages.
>
>> It's rather awkward name. Already someone started to use
>> __GFP_NO_KSWAPD as such purpose.
>> See mtd_kmalloc_up_to. He mentioned in comment of function as follows,
>>
>> * the system page size. This attempts to make sure it does not adversely
>> * impact system performance, so when allocating more than one page, we
>> * ask the memory allocator to avoid re-trying, swapping, writing back
>> * or performing I/O.
>>
>
> I was not aware of this user although their reason for using it
> seems fine. We are not breaking their expectations with this patch but
> it is a slippery slope.
>
>> That thing was what I concerned.
>> In future, new users of __GFP_NO_KSWAPD is coming and we can't prevent
>> them under our sight.
>> So I hope we can change the flag name or fix above code and comment
>> out __GFP_NO_KSWAPD
>>
>
> I can't think of a better name but we can at least improve the comment.
>
>> /*
>> * __GFP_NO_KSWAPD is very VM internal flag so Please don't use it
>> without allowing mm guys
>> *
>> #define __GFP_NO_KSWAPD xxxx
>>
>> >
>> > /*
>> > * High-order allocations do not necessarily loop after
>> > * direct reclaim and reclaim/compaction depends on
>> > * compaction being called after reclaim so call directly if
>> > * necessary
>> > */
>> > page = __alloc_pages_direct_compact(gfp_mask, order,
>> > zonelist, high_zoneidx,
>> > nodemask,
>> > alloc_flags, preferred_zone,
>> > migratetype, &did_some_progress,
>> > sync_migration);
>> >
>> > __GFP_NORETRY is used in a bunch of places and while the most
>> > of them are not high-order, some of them potentially are like in
>> > sound/core/memalloc.c. Using __GFP_NO_KSWAPD as the flag allows
>> > these callers to continue using sync compaction. It could be argued
>>
>> Okay. If I was biased first, I have opposed this comment because they
>> might think __GFP_NORETRY is very latency sensitive.
>
> As __GFP_NORETRY uses direct reclaim, I do not think users expect it
> to be latency sensitive. If they were latency sensitive, they would
> mask out __GFP_WAIT.
Indeed.
>
>> So they wanted allocation is very fast without any writeback/retrial.
>> In view point, __GFP_NORETRY isn't bad, I think.
>>
>> Having said that, I was biased latter, as I said earlier.
>>
>> > that they would prefer __GFP_NORETRY but the potential side-effects
>> > should be taken should be taken into account and the comment updated
>>
>> Considering side-effect, your patch is okay.
>> But I can't understand you mentioned "the comment updated if that
>> happens" sentence. :(
>>
>
> I meant the comment above sync_migration = !(gfp_mask & __GFP_NO_KSWAPD) should
> be updated if it changes to __GFP_NORETRY.
>
> I updated the commentary a bit. Does this help?
I am happy with your updated comment but there is a concern.
We are adding additional flags to be considered for all page allocation callers.
If it isn't, there are too many flags in there.
I really hope this flag makes VM internal so let other do not care about it.
Until now, users was happy without such flag and it's only huge page problem.
--
Kind regards,
Minchan Kim
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH] mm: Do not stall in synchronous compaction for THP allocations
2011-11-10 16:30 ` Minchan Kim
@ 2011-11-10 16:48 ` Mel Gorman
0 siblings, 0 replies; 34+ messages in thread
From: Mel Gorman @ 2011-11-10 16:48 UTC (permalink / raw)
To: Minchan Kim
Cc: Andrew Morton, Jan Kara, Andy Isaacson, Johannes Weiner,
Andrea Arcangeli, linux-kernel, linux-mm
On Fri, Nov 11, 2011 at 01:30:24AM +0900, Minchan Kim wrote:
> On Fri, Nov 11, 2011 at 1:13 AM, Mel Gorman <mgorman@suse.de> wrote:
> > On Fri, Nov 11, 2011 at 12:12:01AM +0900, Minchan Kim wrote:
> >> Hi Mel,
> >>
> >> You should have Cced with me because __GFP_NORETRY is issued by me.
> >>
> >
> > I don't understand. __GFP_NORETRY has been around a long time ago
> > and has loads of users. Do you have particular concerns about the
> > behaviour of __GFP_NORETRY?
>
> It seems that I got misunderstood your point.
> I thought you mentioned this.
>
> http://www.spinics.net/lists/linux-mm/msg16371.html
>
Ah, no I wasn't. I was referring to Andrea's mail suggesting the use of
__GFP_NORETRY and I did not lookup if it had already been discussed. My
bad.
> >> So they wanted allocation is very fast without any writeback/retrial.
> >> In view point, __GFP_NORETRY isn't bad, I think.
> >>
> >> Having said that, I was biased latter, as I said earlier.
> >>
> >> > that they would prefer __GFP_NORETRY but the potential side-effects
> >> > should be taken should be taken into account and the comment updated
> >>
> >> Considering side-effect, your patch is okay.
> >> But I can't understand you mentioned "the comment updated if that
> >> happens" sentence. :(
> >>
> >
> > I meant the comment above sync_migration = !(gfp_mask & __GFP_NO_KSWAPD) should
> > be updated if it changes to __GFP_NORETRY.
> >
> > I updated the commentary a bit. Does this help?
>
> I am happy with your updated comment but there is a concern.
> We are adding additional flags to be considered for all page allocation callers.
> If it isn't, there are too many flags in there.
>
> I really hope this flag makes VM internal so let other do not care about it.
> Until now, users was happy without such flag and it's only huge page problem.
>
I'd rather avoid moving some GFP flags to mm/internal.h unless there is
blatant abuse of flags. I added a warning. How about this?
==== CUT HERE ====
mm: Do not stall in synchronous compaction for THP allocations
Changelog since V1
o Update changelog with more detail
o Add comment on __GFP_NO_KSWAPD
Occasionally during large file copies to slow storage, there are still
reports of user-visible stalls when THP is enabled. Reports on this
have been intermittent and not reliable to reproduce locally but;
Andy Isaacson reported a problem copying to VFAT on SD Card
https://lkml.org/lkml/2011/11/7/2
In this case, it was stuck in munmap for betwen 20 and 60
seconds in compaction. It is also possible that khugepaged
was holding mmap_sem on this process if CONFIG_NUMA was set.
Johannes Weiner reported stalls on USB
https://lkml.org/lkml/2011/7/25/378
In this case, there is no stack trace but it looks like the
same problem. The USB stick may have been using NTFS as a
filesystem based on other work done related to writing back
to USB around the same time.
Internally in SUSE, I received a bug report related to stalls in firefox
when using Java and Flash heavily while copying from NFS
to VFAT on USB. It has not been confirmed to be the same problem
but if it looks like a duck and quacks like a duck.....
In the past, commit [11bc82d6: mm: compaction: Use async migration for
__GFP_NO_KSWAPD and enforce no writeback] forced that sync compaction
would never be used for THP allocations. This was reverted in commit
[c6a140bf: mm/compaction: reverse the change that forbade sync
migraton with __GFP_NO_KSWAPD] on the grounds that it was uncertain
it was beneficial.
While user-visible stalls do not happen for me when writing to USB,
I setup a test running postmark while short-lived processes created
anonymous mapping. The objective was to exercise the paths that
allocate transparent huge pages. I then logged when processes were
stalled for more than 1 second, recorded a stack strace and did some
analysis to aggregate unique "stall events" which revealed
Time stalled in this event: 47369 ms
Event count: 20
usemem sleep_on_page 3690 ms
usemem sleep_on_page 2148 ms
usemem sleep_on_page 1534 ms
usemem sleep_on_page 1518 ms
usemem sleep_on_page 1225 ms
usemem sleep_on_page 2205 ms
usemem sleep_on_page 2399 ms
usemem sleep_on_page 2398 ms
usemem sleep_on_page 3760 ms
usemem sleep_on_page 1861 ms
usemem sleep_on_page 2948 ms
usemem sleep_on_page 1515 ms
usemem sleep_on_page 1386 ms
usemem sleep_on_page 1882 ms
usemem sleep_on_page 1850 ms
usemem sleep_on_page 3715 ms
usemem sleep_on_page 3716 ms
usemem sleep_on_page 4846 ms
usemem sleep_on_page 1306 ms
usemem sleep_on_page 1467 ms
[<ffffffff810ef30c>] wait_on_page_bit+0x6c/0x80
[<ffffffff8113de9f>] unmap_and_move+0x1bf/0x360
[<ffffffff8113e0e2>] migrate_pages+0xa2/0x1b0
[<ffffffff81134273>] compact_zone+0x1f3/0x2f0
[<ffffffff811345d8>] compact_zone_order+0xa8/0xf0
[<ffffffff811346ff>] try_to_compact_pages+0xdf/0x110
[<ffffffff810f773a>] __alloc_pages_direct_compact+0xda/0x1a0
[<ffffffff810f7d5d>] __alloc_pages_slowpath+0x55d/0x7a0
[<ffffffff810f8151>] __alloc_pages_nodemask+0x1b1/0x1c0
[<ffffffff811331db>] alloc_pages_vma+0x9b/0x160
[<ffffffff81142bb0>] do_huge_pmd_anonymous_page+0x160/0x270
[<ffffffff814410a7>] do_page_fault+0x207/0x4c0
[<ffffffff8143dde5>] page_fault+0x25/0x30
The stall times are approximate at best but the estimates represent 25%
of the worst stalls and even if the estimates are off by a factor of
10, it's severe.
This patch once again prevents sync migration for transparent
hugepage allocations as it is preferable to fail a THP allocation
than stall.
It was suggested that __GFP_NORETRY be used instead of __GFP_NO_KSWAPD
to look less like a special case. This would prevent THP allocation
using sync compaction but it would have other side-effects. There are
existing users of __GFP_NORETRY that are doing high-order allocations
and while they can handle allocation failure, it seems reasonable that
they continue to use sync compaction unless there is a deliberate
reason to change that. To help clarify this for the future, this
patch updates the comment for __GFP_NO_KSWAPD.
If accepted, this is a -stable candidate.
Reported-by: Andy Isaacson <adi@hexapodia.org>
Reported-and-tested-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
include/linux/gfp.h | 11 +++++++++++
mm/page_alloc.c | 9 ++++++++-
2 files changed, 19 insertions(+), 1 deletions(-)
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 3a76faf..ef1b1af 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -83,7 +83,18 @@ struct vm_area_struct;
#define __GFP_RECLAIMABLE ((__force gfp_t)___GFP_RECLAIMABLE) /* Page is reclaimable */
#define __GFP_NOTRACK ((__force gfp_t)___GFP_NOTRACK) /* Don't track with kmemcheck */
+/*
+ * __GFP_NO_KSWAPD indicates that the VM should favour failing the allocation
+ * over excessive disruption of the system. Currently this means
+ * 1. Do not wake kswapd (hence the flag name)
+ * 2. Do not use stall in synchronous compaction for high-order allocations
+ * as this may cause the caller to stall writing out pages
+ *
+ * This flag it primarily intended for use with transparent hugepage support.
+ * If the flag is used outside the VM, linux-mm should be cc'd for review.
+ */
#define __GFP_NO_KSWAPD ((__force gfp_t)___GFP_NO_KSWAPD)
+
#define __GFP_OTHER_NODE ((__force gfp_t)___GFP_OTHER_NODE) /* On behalf of other node */
/*
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 6e8ecb6..a3b5da6 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2161,7 +2161,14 @@ rebalance:
sync_migration);
if (page)
goto got_pg;
- sync_migration = true;
+
+ /*
+ * Do not use sync migration if __GFP_NO_KSWAPD is used to indicate
+ * the system should not be heavily disrupted. In practice, this is
+ * to avoid THP callers being stalled in writeback during migration
+ * as it's preferable for the the allocations to fail than to stall
+ */
+ sync_migration = !(gfp_mask & __GFP_NO_KSWAPD);
/* Try direct reclaim and then allocating */
page = __alloc_pages_direct_reclaim(gfp_mask, order,
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 34+ messages in thread
* Re: [PATCH] mm: Do not stall in synchronous compaction for THP allocations
2011-11-10 16:13 ` Mel Gorman
2011-11-10 16:30 ` Minchan Kim
@ 2011-11-10 23:12 ` Andrew Morton
2011-11-10 23:37 ` David Rientjes
2011-11-11 10:01 ` Mel Gorman
1 sibling, 2 replies; 34+ messages in thread
From: Andrew Morton @ 2011-11-10 23:12 UTC (permalink / raw)
To: Mel Gorman
Cc: Minchan Kim, Jan Kara, Andy Isaacson, Johannes Weiner,
Andrea Arcangeli, linux-kernel, linux-mm
On Thu, 10 Nov 2011 16:13:31 +0000
Mel Gorman <mgorman@suse.de> wrote:
> This patch once again prevents sync migration for transparent
> hugepage allocations as it is preferable to fail a THP allocation
> than stall.
Who said? ;) Presumably some people would prefer to get lots of
huge pages for their 1000-hour compute job, and waiting a bit to get
those pages is acceptable.
Do we have the accounting in place for us to be able to determine how
many huge page allocation attempts failed due to this change?
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH] mm: Do not stall in synchronous compaction for THP allocations
2011-11-10 23:12 ` Andrew Morton
@ 2011-11-10 23:37 ` David Rientjes
2011-11-11 10:14 ` Mel Gorman
2011-11-11 10:01 ` Mel Gorman
1 sibling, 1 reply; 34+ messages in thread
From: David Rientjes @ 2011-11-10 23:37 UTC (permalink / raw)
To: Andrew Morton
Cc: Mel Gorman, Minchan Kim, Jan Kara, Andy Isaacson, Johannes Weiner,
Andrea Arcangeli, linux-kernel, linux-mm
On Thu, 10 Nov 2011, Andrew Morton wrote:
> > This patch once again prevents sync migration for transparent
> > hugepage allocations as it is preferable to fail a THP allocation
> > than stall.
>
> Who said? ;) Presumably some people would prefer to get lots of
> huge pages for their 1000-hour compute job, and waiting a bit to get
> those pages is acceptable.
>
Indeed. It seems like the behavior would better be controlled with
/sys/kernel/mm/transparent_hugepage/defrag which is set aside specifically
to control defragmentation for transparent hugepages and for that
synchronous compaction should certainly apply.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH] mm: Do not stall in synchronous compaction for THP allocations
2011-11-10 23:12 ` Andrew Morton
2011-11-10 23:37 ` David Rientjes
@ 2011-11-11 10:01 ` Mel Gorman
2011-11-15 0:03 ` Andrew Morton
1 sibling, 1 reply; 34+ messages in thread
From: Mel Gorman @ 2011-11-11 10:01 UTC (permalink / raw)
To: Andrew Morton
Cc: Minchan Kim, Jan Kara, Andy Isaacson, Johannes Weiner,
Andrea Arcangeli, linux-kernel, linux-mm
On Thu, Nov 10, 2011 at 03:12:11PM -0800, Andrew Morton wrote:
> On Thu, 10 Nov 2011 16:13:31 +0000
> Mel Gorman <mgorman@suse.de> wrote:
>
> > This patch once again prevents sync migration for transparent
> > hugepage allocations as it is preferable to fail a THP allocation
> > than stall.
>
> Who said? ;)
Everyone who ever complained about stalls due to writing to USB :)
In the last few releases, we have squashed a number of stalls due
to writing pages from the end of the LRU but sync compaction is a
relatively recent new root cause of stalls.
> Presumably some people would prefer to get lots of
> huge pages for their 1000-hour compute job, and waiting a bit to get
> those pages is acceptable.
>
A 1000-hour compute job will have its pages collapsed into hugepages by
khugepaged so they might not have the huge pages at the very beginning
but they get them. With khugepaged in place, there should be no need for
an additional tuneable.
> Do we have the accounting in place for us to be able to determine how
> many huge page allocation attempts failed due to this change?
>
thp_fault_fallback is the big one. It is incremented if we fail to
allocate a hugepage during fault in either
do_huge_pmd_anonymous_page or do_huge_pmd_wp_page_fallback
thp_collapse_alloc_failed is also very interesting. It is incremented
if khugepaged tried to collapse pages into a hugepage and
failed the allocation
The user has the option of monitoring their compute jobs hugepage
usage by reading /proc/PID/smaps and looking at the AnonHugePages
count for the large mappings of interest.
--
Mel Gorman
SUSE Labs
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH] mm: Do not stall in synchronous compaction for THP allocations
2011-11-10 23:37 ` David Rientjes
@ 2011-11-11 10:14 ` Mel Gorman
2011-11-11 10:39 ` David Rientjes
2011-11-14 23:44 ` Andrew Morton
0 siblings, 2 replies; 34+ messages in thread
From: Mel Gorman @ 2011-11-11 10:14 UTC (permalink / raw)
To: David Rientjes
Cc: Andrew Morton, Minchan Kim, Jan Kara, Andy Isaacson,
Johannes Weiner, Andrea Arcangeli, linux-kernel, linux-mm
On Thu, Nov 10, 2011 at 03:37:32PM -0800, David Rientjes wrote:
> On Thu, 10 Nov 2011, Andrew Morton wrote:
>
> > > This patch once again prevents sync migration for transparent
> > > hugepage allocations as it is preferable to fail a THP allocation
> > > than stall.
> >
> > Who said? ;) Presumably some people would prefer to get lots of
> > huge pages for their 1000-hour compute job, and waiting a bit to get
> > those pages is acceptable.
> >
>
> Indeed. It seems like the behavior would better be controlled with
> /sys/kernel/mm/transparent_hugepage/defrag which is set aside specifically
> to control defragmentation for transparent hugepages and for that
> synchronous compaction should certainly apply.
With khugepaged in place, it's adding a tunable that is unnecessary and
will not be used. Even if such a tuneable was created, the default
behaviour should be "do not stall".
--
Mel Gorman
SUSE Labs
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH] mm: Do not stall in synchronous compaction for THP allocations
2011-11-11 10:14 ` Mel Gorman
@ 2011-11-11 10:39 ` David Rientjes
2011-11-11 11:17 ` Mel Gorman
2011-11-11 14:21 ` Andrea Arcangeli
2011-11-14 23:44 ` Andrew Morton
1 sibling, 2 replies; 34+ messages in thread
From: David Rientjes @ 2011-11-11 10:39 UTC (permalink / raw)
To: Mel Gorman
Cc: Andrew Morton, Minchan Kim, Jan Kara, Andy Isaacson,
Johannes Weiner, Andrea Arcangeli, linux-kernel, linux-mm
On Fri, 11 Nov 2011, Mel Gorman wrote:
> > Indeed. It seems like the behavior would better be controlled with
> > /sys/kernel/mm/transparent_hugepage/defrag which is set aside specifically
> > to control defragmentation for transparent hugepages and for that
> > synchronous compaction should certainly apply.
>
> With khugepaged in place, it's adding a tunable that is unnecessary and
> will not be used. Even if such a tuneable was created, the default
> behaviour should be "do not stall".
>
Not sure what you mean, the tunable already exists and defaults to always
if THP is turned on. I've been able to effectively control the behavior
of synchronous compaction with it in combination with extfrag_threshold,
i.e. always compact even if the fragmentation index is very small, for
workloads that really really really want hugepages at fault when such a
latency is permissable and then disable khugepaged entirely in the
background for cpu bound tasks.
The history of this boolean is somewhat disturbing: it's introduced in
77f1fe6b back on January 13 to be true after the first attempt at
compaction, then changed to be !(gfp_mask & __GFP_NO_KSWAPD) in 11bc82d6
on March 22, then changed to be true again in c6a140bf on May 24, then
proposed to be changed right back to !(gfp_mask & __GFP_NO_KSWAPD) in this
patch again. When are we going to understand that the admin needs to tell
the kernel when we'd really like to try to allocate a transparent hugepage
and when it's ok to fail?
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH] mm: Do not stall in synchronous compaction for THP allocations
2011-11-11 10:39 ` David Rientjes
@ 2011-11-11 11:17 ` Mel Gorman
2011-11-11 14:21 ` Andrea Arcangeli
1 sibling, 0 replies; 34+ messages in thread
From: Mel Gorman @ 2011-11-11 11:17 UTC (permalink / raw)
To: David Rientjes
Cc: Andrew Morton, Minchan Kim, Jan Kara, Andy Isaacson,
Johannes Weiner, Andrea Arcangeli, linux-kernel, linux-mm
On Fri, Nov 11, 2011 at 02:39:10AM -0800, David Rientjes wrote:
> On Fri, 11 Nov 2011, Mel Gorman wrote:
>
> > > Indeed. It seems like the behavior would better be controlled with
> > > /sys/kernel/mm/transparent_hugepage/defrag which is set aside specifically
> > > to control defragmentation for transparent hugepages and for that
> > > synchronous compaction should certainly apply.
> >
> > With khugepaged in place, it's adding a tunable that is unnecessary and
> > will not be used. Even if such a tuneable was created, the default
> > behaviour should be "do not stall".
> >
>
> Not sure what you mean, the tunable already exists and defaults to always
> if THP is turned on.
Yes, but it does not distinguish between "always including synchronous
stalls" and "always but only if you can do it quickly". This patch
does change the behaviour of "always" but it's still using compaction
to defrag memory. A sysfs file exists, but I wanted to avoid changing
the meaning of its values if at all possible.
If a new value was to be added that would allow the user of
synchronous compaction, what should it be called? always-sync
exposes implementation details which is not great. always-force
is misleading because it wouldn't actually force anything.
always-really-really-mean-it is a bit unwieldly. always-stress might
suit but what is the meaning exactly? Using sync compaction would be
part of it but it could also mean be more agressive about reclaiming.
What level of control is required and how should it be expressed?
I don't object to the existence of this tunable as such but the
default should still be "no sync compaction for THP" because stalls due
to writing to a USB stick sucks.
> I've been able to effectively control the behavior
> of synchronous compaction with it in combination with extfrag_threshold,
extfrag_threshold controls whether compaction runs or not, it does not
control if synchronous compaction it used.
> i.e. always compact even if the fragmentation index is very small, for
> workloads that really really really want hugepages at fault when such a
> latency is permissable and then disable khugepaged entirely in the
> background for cpu bound tasks.
>
> The history of this boolean is somewhat disturbing: it's introduced in
> 77f1fe6b back on January 13 to be true after the first attempt at
It was first introduced because compaction was stalling. This was
unacceptable because the stalling cost more than hugepages saved and
was user visible. There were other patches related to reducing latency
due to compaction.
> compaction, then changed to be !(gfp_mask & __GFP_NO_KSWAPD) in 11bc82d6
Because this was a safe option.
> on March 22, then changed to be true again in c6a140bf on May 24, then
Because testing indicated that stalls due to sync migration were not
noticeable. For the most part, this is true but there should have
been better recognition that sync migration could also write pages
to slow storage which would be unacceptably slow.
> proposed to be changed right back to !(gfp_mask & __GFP_NO_KSWAPD) in this
> patch again. When are we going to understand that the admin needs to tell
> the kernel when we'd really like to try to allocate a transparent hugepage
> and when it's ok to fail?
I don't recall a point when this was about the administrator wanting to
control synchronous compaction. The objective was to maximise the number
of huge pages used while minimising user-visible stalls.
--
Mel Gorman
SUSE Labs
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH] mm: Do not stall in synchronous compaction for THP allocations
2011-11-11 10:39 ` David Rientjes
2011-11-11 11:17 ` Mel Gorman
@ 2011-11-11 14:21 ` Andrea Arcangeli
1 sibling, 0 replies; 34+ messages in thread
From: Andrea Arcangeli @ 2011-11-11 14:21 UTC (permalink / raw)
To: David Rientjes
Cc: Mel Gorman, Andrew Morton, Minchan Kim, Jan Kara, Andy Isaacson,
Johannes Weiner, linux-kernel, linux-mm
On Fri, Nov 11, 2011 at 02:39:10AM -0800, David Rientjes wrote:
> The history of this boolean is somewhat disturbing: it's introduced in
> 77f1fe6b back on January 13 to be true after the first attempt at
> compaction, then changed to be !(gfp_mask & __GFP_NO_KSWAPD) in 11bc82d6
> on March 22, then changed to be true again in c6a140bf on May 24, then
> proposed to be changed right back to !(gfp_mask & __GFP_NO_KSWAPD) in this
> patch again. When are we going to understand that the admin needs to tell
> the kernel when we'd really like to try to allocate a transparent hugepage
> and when it's ok to fail?
Sorry for the confusion but it was reverted by mistake. Mel fixed a
compaction bug that caused stalls of several minutes to
people. compaction was overflowing into the next zone by mistake
without adjusting its tracking parameters. So it wasn't clear anymore
if sync compaction was really a problem or not. So I reverted it to be
sure. And well now we're sure :). Without USB or pathologically slow
I/O apparently it's not a noticeable issue.
So having sync compaction off by default sounds good to me.
We can still add a tunable to force sync compaction in case anybody
needs.
Another topic is if slub and stuff also wants sync compaction off by
default when they allocate with large order so triggering compaction.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH] mm: Do not stall in synchronous compaction for THP allocations
2011-11-11 10:14 ` Mel Gorman
2011-11-11 10:39 ` David Rientjes
@ 2011-11-14 23:44 ` Andrew Morton
2011-11-15 13:25 ` Mel Gorman
1 sibling, 1 reply; 34+ messages in thread
From: Andrew Morton @ 2011-11-14 23:44 UTC (permalink / raw)
To: Mel Gorman
Cc: David Rientjes, Minchan Kim, Jan Kara, Andy Isaacson,
Johannes Weiner, Andrea Arcangeli, linux-kernel, linux-mm
On Fri, 11 Nov 2011 10:14:14 +0000
Mel Gorman <mgorman@suse.de> wrote:
> On Thu, Nov 10, 2011 at 03:37:32PM -0800, David Rientjes wrote:
> > On Thu, 10 Nov 2011, Andrew Morton wrote:
> >
> > > > This patch once again prevents sync migration for transparent
> > > > hugepage allocations as it is preferable to fail a THP allocation
> > > > than stall.
> > >
> > > Who said? ;) Presumably some people would prefer to get lots of
> > > huge pages for their 1000-hour compute job, and waiting a bit to get
> > > those pages is acceptable.
> > >
> >
> > Indeed. It seems like the behavior would better be controlled with
> > /sys/kernel/mm/transparent_hugepage/defrag which is set aside specifically
> > to control defragmentation for transparent hugepages and for that
> > synchronous compaction should certainly apply.
>
> With khugepaged in place, it's adding a tunable that is unnecessary and
> will not be used. Even if such a tuneable was created, the default
> behaviour should be "do not stall".
(who said?)
Let me repeat my cruelly unanswered question: do we have sufficient
instrumentation in place so that operators can determine that this
change is causing them to get less huge pages than they'd like?
Because some people really really want those huge pages. If we go and
silently deprive them of those huge pages via changes like this, how do
they *know* it's happening?
And what are their options for making the kernel try harder to get
those pages?
And how do we communicate all of this to those operators?
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH] mm: Do not stall in synchronous compaction for THP allocations
2011-11-11 10:01 ` Mel Gorman
@ 2011-11-15 0:03 ` Andrew Morton
2011-11-15 2:00 ` Andrea Arcangeli
2011-11-15 13:07 ` Mel Gorman
0 siblings, 2 replies; 34+ messages in thread
From: Andrew Morton @ 2011-11-15 0:03 UTC (permalink / raw)
To: Mel Gorman
Cc: Minchan Kim, Jan Kara, Andy Isaacson, Johannes Weiner,
Andrea Arcangeli, linux-kernel, linux-mm
On Fri, 11 Nov 2011 10:01:56 +0000
Mel Gorman <mgorman@suse.de> wrote:
> On Thu, Nov 10, 2011 at 03:12:11PM -0800, Andrew Morton wrote:
> > On Thu, 10 Nov 2011 16:13:31 +0000
> > Mel Gorman <mgorman@suse.de> wrote:
> >
> > > This patch once again prevents sync migration for transparent
> > > hugepage allocations as it is preferable to fail a THP allocation
> > > than stall.
> >
> > Who said? ;)
>
> Everyone who ever complained about stalls due to writing to USB :)
oh, here it is. Who broke mesage threading?
> In the last few releases, we have squashed a number of stalls due
> to writing pages from the end of the LRU but sync compaction is a
> relatively recent new root cause of stalls.
>
> > Presumably some people would prefer to get lots of
> > huge pages for their 1000-hour compute job, and waiting a bit to get
> > those pages is acceptable.
> >
>
> A 1000-hour compute job will have its pages collapsed into hugepages by
> khugepaged so they might not have the huge pages at the very beginning
> but they get them. With khugepaged in place, there should be no need for
> an additional tuneable.
OK...
> > Do we have the accounting in place for us to be able to determine how
> > many huge page allocation attempts failed due to this change?
> >
>
> thp_fault_fallback is the big one. It is incremented if we fail to
> allocate a hugepage during fault in either
> do_huge_pmd_anonymous_page or do_huge_pmd_wp_page_fallback
>
> thp_collapse_alloc_failed is also very interesting. It is incremented
> if khugepaged tried to collapse pages into a hugepage and
> failed the allocation
>
> The user has the option of monitoring their compute jobs hugepage
> usage by reading /proc/PID/smaps and looking at the AnonHugePages
> count for the large mappings of interest.
Fair enough. One slight problem though:
akpm:/usr/src/25> grep -r thp_collapse_alloc_failed Documentation
akpm:/usr/src/25>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH] mm: Do not stall in synchronous compaction for THP allocations
2011-11-15 0:03 ` Andrew Morton
@ 2011-11-15 2:00 ` Andrea Arcangeli
2011-11-15 2:08 ` Andrea Arcangeli
2011-11-15 15:00 ` Mel Gorman
2011-11-15 13:07 ` Mel Gorman
1 sibling, 2 replies; 34+ messages in thread
From: Andrea Arcangeli @ 2011-11-15 2:00 UTC (permalink / raw)
To: Andrew Morton
Cc: Mel Gorman, Minchan Kim, Jan Kara, Andy Isaacson, Johannes Weiner,
linux-kernel, linux-mm
On Mon, Nov 14, 2011 at 04:03:45PM -0800, Andrew Morton wrote:
> On Fri, 11 Nov 2011 10:01:56 +0000
> Mel Gorman <mgorman@suse.de> wrote:
> > A 1000-hour compute job will have its pages collapsed into hugepages by
> > khugepaged so they might not have the huge pages at the very beginning
> > but they get them. With khugepaged in place, there should be no need for
> > an additional tuneable.
>
> OK...
It's good idea to keep it monitored. But I guess the reduced rate will
only materialize at temporary VM stress times.
> Fair enough. One slight problem though:
>
> akpm:/usr/src/25> grep -r thp_collapse_alloc_failed Documentation
> akpm:/usr/src/25>
I didn't fill that gap but I was reading the code again and I don't
see why we keep retrying for -EAGAIN in the !sync case. Maybe the
below is good (untested). I doubt it's good to spend cpu to retry the
trylock or to retry the migrate on a pinned page by O_DIRECT. In fact
as far as THP success rate is concerned maybe we should "goto out"
instead of "goto fail" but I didn't change to that as compaction even
if it fails a subpage may still be successful at creating order
1/2/3/4...8 pages. I only avoid 9 loops to retry a trylock or a page
under O_DIRECT. Maybe that will save a bit of CPU, I doubt it can
decrease the success rate in any significant way. I'll test it at the
next build...
====
From: Andrea Arcangeli <aarcange@redhat.com>
Subject: [PATCH] migrate: !sync don't retry
For !sync it's not worth retrying because we won't lock_page even
after the second pass. So make -EAGAIN behave like -EBUSY in the !sync
should be faster. The difference between -EAGAIN and -EBUSY remains as
usual for the sync case, where -EAGAIN will retry, while -EBUSY will
not.
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
mm/migrate.c | 16 +++++++++-------
1 files changed, 9 insertions(+), 7 deletions(-)
diff --git a/mm/migrate.c b/mm/migrate.c
index 578e291..7d97a14 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -680,11 +680,7 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
* For !sync, there is no point retrying as the retry loop
* is expected to be too short for PageWriteback to be cleared
*/
- if (!sync) {
- rc = -EBUSY;
- goto uncharge;
- }
- if (!force)
+ if (!force || !sync)
goto uncharge;
wait_on_page_writeback(page);
}
@@ -794,7 +790,7 @@ static int unmap_and_move(new_page_t get_new_page, unsigned long private,
rc = __unmap_and_move(page, newpage, force, offlining, sync);
out:
- if (rc != -EAGAIN) {
+ if (rc != -EAGAIN || !sync) {
/*
* A page that has been migrated has all references
* removed and will be freed. A page that has not been
@@ -874,7 +870,7 @@ static int unmap_and_move_huge_page(new_page_t get_new_page,
out:
unlock_page(hpage);
- if (rc != -EAGAIN) {
+ if (rc != -EAGAIN || !sync) {
list_del(&hpage->lru);
put_page(hpage);
}
@@ -934,11 +930,14 @@ int migrate_pages(struct list_head *from,
case -ENOMEM:
goto out;
case -EAGAIN:
+ if (!sync)
+ goto fail;
retry++;
break;
case 0:
break;
default:
+ fail:
/* Permanent failure */
nr_failed++;
break;
@@ -981,11 +980,14 @@ int migrate_huge_pages(struct list_head *from,
case -ENOMEM:
goto out;
case -EAGAIN:
+ if (!sync)
+ goto fail;
retry++;
break;
case 0:
break;
default:
+ fail:
/* Permanent failure */
nr_failed++;
break;
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 34+ messages in thread
* Re: [PATCH] mm: Do not stall in synchronous compaction for THP allocations
2011-11-15 2:00 ` Andrea Arcangeli
@ 2011-11-15 2:08 ` Andrea Arcangeli
2011-11-15 15:01 ` Mel Gorman
2011-11-15 15:00 ` Mel Gorman
1 sibling, 1 reply; 34+ messages in thread
From: Andrea Arcangeli @ 2011-11-15 2:08 UTC (permalink / raw)
To: Andrew Morton
Cc: Mel Gorman, Minchan Kim, Jan Kara, Andy Isaacson, Johannes Weiner,
linux-kernel, linux-mm
On Tue, Nov 15, 2011 at 03:00:09AM +0100, Andrea Arcangeli wrote:
> I didn't fill that gap but I was reading the code again and I don't
> see why we keep retrying for -EAGAIN in the !sync case. Maybe the
> below is good (untested). I doubt it's good to spend cpu to retry the
> trylock or to retry the migrate on a pinned page by O_DIRECT. In fact
> as far as THP success rate is concerned maybe we should "goto out"
> instead of "goto fail" but I didn't change to that as compaction even
> if it fails a subpage may still be successful at creating order
> 1/2/3/4...8 pages. I only avoid 9 loops to retry a trylock or a page
> under O_DIRECT. Maybe that will save a bit of CPU, I doubt it can
> decrease the success rate in any significant way. I'll test it at the
> next build...
At the same time also noticed another minor cleanup (also untested,
will text at next build together with some other stuff).
===
From: Andrea Arcangeli <aarcange@redhat.com>
Subject: [PATCH] compaction: move ISOLATE_CLEAN setting out of
compaction_migratepages loop
cc->sync and mode cannot change within the loop so move it out.
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
mm/compaction.c | 6 +++---
1 files changed, 3 insertions(+), 3 deletions(-)
diff --git a/mm/compaction.c b/mm/compaction.c
index 899d956..be0be1d 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -291,6 +291,9 @@ static isolate_migrate_t isolate_migratepages(struct zone *zone,
return ISOLATE_ABORT;
}
+ if (!cc->sync)
+ mode |= ISOLATE_CLEAN;
+
/* Time to isolate some pages for migration */
cond_resched();
spin_lock_irq(&zone->lru_lock);
@@ -349,9 +352,6 @@ static isolate_migrate_t isolate_migratepages(struct zone *zone,
continue;
}
- if (!cc->sync)
- mode |= ISOLATE_CLEAN;
-
/* Try isolate the page */
if (__isolate_lru_page(page, mode, 0) != 0)
continue;
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 34+ messages in thread
* Re: [PATCH] mm: Do not stall in synchronous compaction for THP allocations
2011-11-15 0:03 ` Andrew Morton
2011-11-15 2:00 ` Andrea Arcangeli
@ 2011-11-15 13:07 ` Mel Gorman
2011-11-15 15:47 ` Andrea Arcangeli
1 sibling, 1 reply; 34+ messages in thread
From: Mel Gorman @ 2011-11-15 13:07 UTC (permalink / raw)
To: Andrew Morton
Cc: Minchan Kim, Jan Kara, Andy Isaacson, Johannes Weiner,
Andrea Arcangeli, linux-kernel, linux-mm
On Mon, Nov 14, 2011 at 04:03:45PM -0800, Andrew Morton wrote:
> > <SNIP>
> > A 1000-hour compute job will have its pages collapsed into hugepages by
> > khugepaged so they might not have the huge pages at the very beginning
> > but they get them. With khugepaged in place, there should be no need for
> > an additional tuneable.
>
> OK...
>
David Rientjes did point out that it is preferred in certain
cases that khugepaged be disabled on jobs that are CPU bound, high
priority and do not want interference. If this really is the case,
it should not be the default behaviour and added as a new option to
/sys/kernel/mm/transparent_hugepage/defrag in a separate patch.
> > > Do we have the accounting in place for us to be able to determine how
> > > many huge page allocation attempts failed due to this change?
> > >
> >
> > thp_fault_fallback is the big one. It is incremented if we fail to
> > allocate a hugepage during fault in either
> > do_huge_pmd_anonymous_page or do_huge_pmd_wp_page_fallback
> >
> > thp_collapse_alloc_failed is also very interesting. It is incremented
> > if khugepaged tried to collapse pages into a hugepage and
> > failed the allocation
> >
> > The user has the option of monitoring their compute jobs hugepage
> > usage by reading /proc/PID/smaps and looking at the AnonHugePages
> > count for the large mappings of interest.
>
> Fair enough. One slight problem though:
>
> akpm:/usr/src/25> grep -r thp_collapse_alloc_failed Documentation
> akpm:/usr/src/25>
>
mel@machina:~/git-public/linux-2.6/Documentation > git grep vmstat
trace/postprocess/trace-vmscan-postprocess.pl: # To closer match vmstat scanning statistics, only count isolate_both
mel@machina:~/git-public/linux-2.6/Documentation >
Given such an abundance and wealth of information on vmstat, how
about this?
==== CUT HERE ====
mm: Document the meminfo and vmstat fields of relevance to transparent hugepages
This patch updates Documentation/vm/transhuge.txt and
Documentation/filesystems/proc.txt with some information on monitoring
transparent huge page usage and the associated overhead.
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
Documentation/filesystems/proc.txt | 2 +
Documentation/vm/transhuge.txt | 62 ++++++++++++++++++++++++++++++++++++
2 files changed, 64 insertions(+), 0 deletions(-)
diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
index 0ec91f0..fb6ca6d 100644
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -710,6 +710,7 @@ Committed_AS: 100056 kB
VmallocTotal: 112216 kB
VmallocUsed: 428 kB
VmallocChunk: 111088 kB
+AnonHugePages: 49152 kB
MemTotal: Total usable ram (i.e. physical ram minus a few reserved
bits and the kernel binary code)
@@ -743,6 +744,7 @@ VmallocChunk: 111088 kB
Dirty: Memory which is waiting to get written back to the disk
Writeback: Memory which is actively being written back to the disk
AnonPages: Non-file backed pages mapped into userspace page tables
+AnonHugePages: Non-file backed huge pages mapped into userspace page tables
Mapped: files which have been mmaped, such as libraries
Slab: in-kernel data structures cache
SReclaimable: Part of Slab, that might be reclaimed, such as caches
diff --git a/Documentation/vm/transhuge.txt b/Documentation/vm/transhuge.txt
index 29bdf62..f734bb2 100644
--- a/Documentation/vm/transhuge.txt
+++ b/Documentation/vm/transhuge.txt
@@ -166,6 +166,68 @@ behavior. So to make them effective you need to restart any
application that could have been using hugepages. This also applies to
the regions registered in khugepaged.
+== Monitoring usage ==
+
+The number of transparent huge pages currently used by the system is
+available by reading the AnonHugePages field in /proc/meminfo. To
+identify what applications are using transparent huge pages, it is
+necessary to read /proc/PID/smaps and count the AnonHugePages fields
+for each mapping. Note that reading the smaps file is expensive and
+reading it frequently will incur overhead.
+
+There are a number of counters in /proc/vmstat that may be used to
+monitor how successfully the system is providing huge pages for use.
+
+thp_fault_alloc is incremented every time a huge page is successfully
+ allocated to handle a page fault. This applies to both the
+ first time a page is faulted and for COW faults.
+
+thp_collapse_alloc is incremented by khugepaged when it has found
+ a range of pages to collapse into one huge page and has
+ successfully allocated a new huge page to store the data.
+
+thp_fault_fallback is incremented if a page fault fails to allocate
+ a huge page and instead falls back to using small pages.
+
+thp_collapse_alloc_failed is incremented if khugepaged found a range
+ of pages that should be collapsed into one huge page but failed
+ the allocation.
+
+thp_split is incremented every time a huge page is split into base
+ pages. This can happen for a variety of reasons but a common
+ reason is that a huge page is old and is being reclaimed.
+
+As the system ages, allocating huge pages may be expensive as the
+system uses memory compaction to copy data around memory to free a
+huge page for use. There are some counters in /proc/vmstat to help
+monitor this overhead.
+
+compact_stall is incremented every time a process stalls to run
+ memory compaction so that a huge page is free for use.
+
+compact_success is incremented if the system compacted memory and
+ freed a huge page for use.
+
+compact_fail is incremented if the system tries to compact memory
+ but failed.
+
+compact_pages_moved is incremented each time a page is moved. If
+ this value is increasing rapidly, it implies that the system
+ is copying a lot of data to satisfy the huge page allocation.
+ It is possible that the cost of copying exceeds any savings
+ from reduced TLB misses.
+
+compact_pagemigrate_failed is incremented when the underlying mechanism
+ for moving a page failed.
+
+compact_blocks_moved is incremented each time memory compaction examines
+ a huge page aligned range of pages.
+
+It is possible to establish how long the stalls were using the function
+tracer to record how long was spent in __alloc_pages_nodemask and
+using the mm_page_alloc tracepoint to identify which allocations were
+for huge pages.
+
== get_user_pages and follow_page ==
get_user_pages and follow_page if run on a hugepage, will return the
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 34+ messages in thread
* Re: [PATCH] mm: Do not stall in synchronous compaction for THP allocations
2011-11-14 23:44 ` Andrew Morton
@ 2011-11-15 13:25 ` Mel Gorman
2011-11-15 21:07 ` David Rientjes
0 siblings, 1 reply; 34+ messages in thread
From: Mel Gorman @ 2011-11-15 13:25 UTC (permalink / raw)
To: Andrew Morton
Cc: David Rientjes, Minchan Kim, Jan Kara, Andy Isaacson,
Johannes Weiner, Andrea Arcangeli, linux-kernel, linux-mm
On Mon, Nov 14, 2011 at 03:44:08PM -0800, Andrew Morton wrote:
> On Fri, 11 Nov 2011 10:14:14 +0000
> Mel Gorman <mgorman@suse.de> wrote:
>
> > On Thu, Nov 10, 2011 at 03:37:32PM -0800, David Rientjes wrote:
> > > On Thu, 10 Nov 2011, Andrew Morton wrote:
> > >
> > > > > This patch once again prevents sync migration for transparent
> > > > > hugepage allocations as it is preferable to fail a THP allocation
> > > > > than stall.
> > > >
> > > > Who said? ;) Presumably some people would prefer to get lots of
> > > > huge pages for their 1000-hour compute job, and waiting a bit to get
> > > > those pages is acceptable.
> > > >
> > >
> > > Indeed. It seems like the behavior would better be controlled with
> > > /sys/kernel/mm/transparent_hugepage/defrag which is set aside specifically
> > > to control defragmentation for transparent hugepages and for that
> > > synchronous compaction should certainly apply.
> >
> > With khugepaged in place, it's adding a tunable that is unnecessary and
> > will not be used. Even if such a tuneable was created, the default
> > behaviour should be "do not stall".
>
> (who said?)
>
> Let me repeat my cruelly unanswered question: do we have sufficient
> instrumentation in place so that operators can determine that this
> change is causing them to get less huge pages than they'd like?
>
Unless we add a mel_did_it counter to vmstat, they won't be able to
identify that it was this patch in particular.
> Because some people really really want those huge pages. If we go and
> silently deprive them of those huge pages via changes like this, how do
> they *know* it's happening?
>
The counters in vmstat will give them a hint but it will not tell them
*why* they are not getting the huge pages they want. That would require
further analysis using a combination of ftrace, /proc/buddyinfo,
/proc/pagetypeinfo and maybe /proc/kpageflags depending on how
important the issue is.
> And what are their options for making the kernel try harder to get
> those pages?
>
Fine control is limited. If it is really needed, I would not oppose
a patch that allows the use of sync compaction via a new setting in
/sys/kernel/mm/transparent_hugepage/defrag. However, I think it is
a slippery slope to expose implementation details like this and I'm
not currently planning to implement such a patch.
If they have root access, they have the option of writing to
/proc/sys/vm/compact_memory to manually trigger compaction. If
that does not free enough huge pages, they could use
/proc/sys/vm/drop_caches followed by /proc/sys/vm/compact_memory and
then start the target application. If that was too heavy, they could
write a balloon application which forces some percentage of memory
to be reclaimed by allocating anonymous memory, calling mlock on it,
unmapping the memory and then writing to /proc/sys/vm/compact_memory .
It would be very heavy handed but it could be a preparation step for
running a job that absolutely must get huge pages without khugepaged
running.
> And how do we communicate all of this to those operators?
The documentation patch will help to some extent but more creative
manipulation of the system to increase the success rate of huge
page allocations and how to analyse it is not documented. This is
largely because the analysis is conducted on a case-by-case basis.
Mailing "help help" to linux-mm and hoping someone on the internet
can hear you scream may be the only option.
--
Mel Gorman
SUSE Labs
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH] mm: Do not stall in synchronous compaction for THP allocations
2011-11-15 2:00 ` Andrea Arcangeli
2011-11-15 2:08 ` Andrea Arcangeli
@ 2011-11-15 15:00 ` Mel Gorman
1 sibling, 0 replies; 34+ messages in thread
From: Mel Gorman @ 2011-11-15 15:00 UTC (permalink / raw)
To: Andrea Arcangeli
Cc: Andrew Morton, Minchan Kim, Jan Kara, Andy Isaacson,
Johannes Weiner, linux-kernel, linux-mm
On Tue, Nov 15, 2011 at 03:00:09AM +0100, Andrea Arcangeli wrote:
> On Mon, Nov 14, 2011 at 04:03:45PM -0800, Andrew Morton wrote:
> > On Fri, 11 Nov 2011 10:01:56 +0000
> > Mel Gorman <mgorman@suse.de> wrote:
> > > A 1000-hour compute job will have its pages collapsed into hugepages by
> > > khugepaged so they might not have the huge pages at the very beginning
> > > but they get them. With khugepaged in place, there should be no need for
> > > an additional tuneable.
> >
> > OK...
>
> It's good idea to keep it monitored. But I guess the reduced rate will
> only materialize at temporary VM stress times.
>
> > Fair enough. One slight problem though:
> >
> > akpm:/usr/src/25> grep -r thp_collapse_alloc_failed Documentation
> > akpm:/usr/src/25>
>
> I didn't fill that gap but I was reading the code again and I don't
> see why we keep retrying for -EAGAIN in the !sync case. Maybe the
> below is good (untested). I doubt it's good to spend cpu to retry the
> trylock or to retry the migrate on a pinned page by O_DIRECT.
The retry happens on the expectation that the page is not locked
for very long. Pinning for O_DIRECT would be an exception. Readahead
pages would be another. I don't have data on how often we encounter
a page that is locked for a very short period of time versus being
locked for IO and even if I did, it would be depend on the workload
and the amount of RAM.
> In fact
> as far as THP success rate is concerned maybe we should "goto out"
> instead of "goto fail" but I didn't change to that as compaction even
> if it fails a subpage may still be successful at creating order
> 1/2/3/4...8 pages. I only avoid 9 loops to retry a trylock or a page
> under O_DIRECT. Maybe that will save a bit of CPU, I doubt it can
> decrease the success rate in any significant way. I'll test it at the
> next build...
>
> ====
> From: Andrea Arcangeli <aarcange@redhat.com>
> Subject: [PATCH] migrate: !sync don't retry
>
> For !sync it's not worth retrying because we won't lock_page even
> after the second pass.
Except in the cases where the page was locked for a very short
period of time. I think this will have an impact on success rates for
minimal latency savings but I cannot predict how much of an impact.
--
Mel Gorman
SUSE Labs
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH] mm: Do not stall in synchronous compaction for THP allocations
2011-11-15 2:08 ` Andrea Arcangeli
@ 2011-11-15 15:01 ` Mel Gorman
0 siblings, 0 replies; 34+ messages in thread
From: Mel Gorman @ 2011-11-15 15:01 UTC (permalink / raw)
To: Andrea Arcangeli
Cc: Andrew Morton, Minchan Kim, Jan Kara, Andy Isaacson,
Johannes Weiner, linux-kernel, linux-mm
On Tue, Nov 15, 2011 at 03:08:31AM +0100, Andrea Arcangeli wrote:
> On Tue, Nov 15, 2011 at 03:00:09AM +0100, Andrea Arcangeli wrote:
> > I didn't fill that gap but I was reading the code again and I don't
> > see why we keep retrying for -EAGAIN in the !sync case. Maybe the
> > below is good (untested). I doubt it's good to spend cpu to retry the
> > trylock or to retry the migrate on a pinned page by O_DIRECT. In fact
> > as far as THP success rate is concerned maybe we should "goto out"
> > instead of "goto fail" but I didn't change to that as compaction even
> > if it fails a subpage may still be successful at creating order
> > 1/2/3/4...8 pages. I only avoid 9 loops to retry a trylock or a page
> > under O_DIRECT. Maybe that will save a bit of CPU, I doubt it can
> > decrease the success rate in any significant way. I'll test it at the
> > next build...
>
> At the same time also noticed another minor cleanup (also untested,
> will text at next build together with some other stuff).
>
> ===
> From: Andrea Arcangeli <aarcange@redhat.com>
> Subject: [PATCH] compaction: move ISOLATE_CLEAN setting out of
> compaction_migratepages loop
>
> cc->sync and mode cannot change within the loop so move it out.
>
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Mel Gorman <mgorman@suse.de>
--
Mel Gorman
SUSE Labs
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH] mm: Do not stall in synchronous compaction for THP allocations
2011-11-15 13:07 ` Mel Gorman
@ 2011-11-15 15:47 ` Andrea Arcangeli
0 siblings, 0 replies; 34+ messages in thread
From: Andrea Arcangeli @ 2011-11-15 15:47 UTC (permalink / raw)
To: Mel Gorman
Cc: Andrew Morton, Minchan Kim, Jan Kara, Andy Isaacson,
Johannes Weiner, linux-kernel, linux-mm
On Tue, Nov 15, 2011 at 01:07:48PM +0000, Mel Gorman wrote:
> Given such an abundance and wealth of information on vmstat, how
> about this?
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH] mm: Do not stall in synchronous compaction for THP allocations
2011-11-15 13:25 ` Mel Gorman
@ 2011-11-15 21:07 ` David Rientjes
2011-11-15 23:48 ` Mel Gorman
0 siblings, 1 reply; 34+ messages in thread
From: David Rientjes @ 2011-11-15 21:07 UTC (permalink / raw)
To: Mel Gorman
Cc: Andrew Morton, Minchan Kim, Jan Kara, Andy Isaacson,
Johannes Weiner, Andrea Arcangeli, linux-kernel, linux-mm
On Tue, 15 Nov 2011, Mel Gorman wrote:
> Fine control is limited. If it is really needed, I would not oppose
> a patch that allows the use of sync compaction via a new setting in
> /sys/kernel/mm/transparent_hugepage/defrag. However, I think it is
> a slippery slope to expose implementation details like this and I'm
> not currently planning to implement such a patch.
>
This doesn't expose any implementation detail, the "defrag" tunable is
supposed to limit defragmentation efforts in the VM if the hugepages
aren't immediately available and simply fallback to using small pages.
Given that definition, it would make sense to allow for synchronous
defragmentation (i.e. sync_compaction) on the second iteration of the page
allocator slowpath if set. So where's the disconnect between this
proposed behavior and the definition of the tunable in
Documentation/vm/transhuge.txt?
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH] mm: Do not stall in synchronous compaction for THP allocations
2011-11-15 21:07 ` David Rientjes
@ 2011-11-15 23:48 ` Mel Gorman
2011-11-16 0:07 ` David Rientjes
0 siblings, 1 reply; 34+ messages in thread
From: Mel Gorman @ 2011-11-15 23:48 UTC (permalink / raw)
To: David Rientjes
Cc: Andrew Morton, Minchan Kim, Jan Kara, Andy Isaacson,
Johannes Weiner, Andrea Arcangeli, linux-kernel, linux-mm
On Tue, Nov 15, 2011 at 01:07:51PM -0800, David Rientjes wrote:
> On Tue, 15 Nov 2011, Mel Gorman wrote:
>
> > Fine control is limited. If it is really needed, I would not oppose
> > a patch that allows the use of sync compaction via a new setting in
> > /sys/kernel/mm/transparent_hugepage/defrag. However, I think it is
> > a slippery slope to expose implementation details like this and I'm
> > not currently planning to implement such a patch.
> >
>
> This doesn't expose any implementation detail, the "defrag" tunable is
> supposed to limit defragmentation efforts in the VM if the hugepages
> aren't immediately available and simply fallback to using small pages.
The current settings are "always", "madvise" and "never" which matches
the settings for /sys/kernel/mm/transparent_hugepage/enabled and are
fairly straight forward.
Adding sync here could obviously be implemented although it may
require both always-sync and madvise-sync. Alternatively, something
like an options file could be created to create a bitmap similar to
what ftrace does. Whatever the mechanism, it exposes the fact that
"sync compaction" is used. If that turns out to be not enough, then
you may want to add other steps like aggressively reclaiming memory
which also potentially may need to be controlled via the sysfs file
and this is the slippery slope.
> Given that definition, it would make sense to allow for synchronous
> defragmentation (i.e. sync_compaction) on the second iteration of the page
> allocator slowpath if set. So where's the disconnect between this
> proposed behavior and the definition of the tunable in
> Documentation/vm/transhuge.txt?
The transhuge.txt file does not describe how defrag works or whether it
uses sync compaction internally.
--
Mel Gorman
SUSE Labs
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH] mm: Do not stall in synchronous compaction for THP allocations
2011-11-15 23:48 ` Mel Gorman
@ 2011-11-16 0:07 ` David Rientjes
2011-11-16 4:13 ` Andrea Arcangeli
0 siblings, 1 reply; 34+ messages in thread
From: David Rientjes @ 2011-11-16 0:07 UTC (permalink / raw)
To: Mel Gorman
Cc: Andrew Morton, Minchan Kim, Jan Kara, Andy Isaacson,
Johannes Weiner, Andrea Arcangeli, linux-kernel, linux-mm
On Tue, 15 Nov 2011, Mel Gorman wrote:
> Adding sync here could obviously be implemented although it may
> require both always-sync and madvise-sync. Alternatively, something
> like an options file could be created to create a bitmap similar to
> what ftrace does. Whatever the mechanism, it exposes the fact that
> "sync compaction" is used. If that turns out to be not enough, then
> you may want to add other steps like aggressively reclaiming memory
> which also potentially may need to be controlled via the sysfs file
> and this is the slippery slope.
>
So what's being proposed here in this patch is the fifth time this line
has been changed and its always been switched between true and !(gfp_mask
& __GFP_NO_KSWAPD). Instead of changing it every few months, I'd suggest
that we tie the semantics of the tunable directly to sync_compaction since
we're primarily targeting thp hugepages with this change anyway for the
"always" case. Comments?
diff --git a/Documentation/vm/transhuge.txt b/Documentation/vm/transhuge.txt
--- a/Documentation/vm/transhuge.txt
+++ b/Documentation/vm/transhuge.txt
@@ -116,6 +116,13 @@ echo always >/sys/kernel/mm/transparent_hugepage/defrag
echo madvise >/sys/kernel/mm/transparent_hugepage/defrag
echo never >/sys/kernel/mm/transparent_hugepage/defrag
+If defrag is set to "always", then all hugepage allocations also attempt
+synchronous memory compaction which makes the allocation as aggressive
+as possible. The overhead of attempting to allocate the hugepage is
+considered acceptable because of the longterm benefits of the hugepage
+itself at runtime. If the VM should fallback to using regular pages
+instead, then you should use "madvise" or "never".
+
khugepaged will be automatically started when
transparent_hugepage/enabled is set to "always" or "madvise, and it'll
be automatically shutdown if it's set to "never".
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2168,7 +2168,17 @@ rebalance:
sync_migration);
if (page)
goto got_pg;
- sync_migration = true;
+
+ /*
+ * Do not use synchronous migration for transparent hugepages unless
+ * defragmentation is always attempted for such allocations since it
+ * can stall in writeback, which is far worse than simply failing to
+ * promote a page. Otherwise, we really do want a hugepage and are as
+ * aggressive as possible to allocate it.
+ */
+ sync_migration = !(gfp_mask & __GFP_NO_KSWAPD) ||
+ (transparent_hugepage_flags &
+ (1 << TRANSPARENT_HUGEPAGE_DEFRAG_FLAG));
/* Try direct reclaim and then allocating */
page = __alloc_pages_direct_reclaim(gfp_mask, order,
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH] mm: Do not stall in synchronous compaction for THP allocations
2011-11-16 0:07 ` David Rientjes
@ 2011-11-16 4:13 ` Andrea Arcangeli
2011-11-16 13:30 ` Andrea Arcangeli
2011-11-16 14:14 ` Mel Gorman
0 siblings, 2 replies; 34+ messages in thread
From: Andrea Arcangeli @ 2011-11-16 4:13 UTC (permalink / raw)
To: David Rientjes
Cc: Mel Gorman, Andrew Morton, Minchan Kim, Jan Kara, Andy Isaacson,
Johannes Weiner, linux-kernel, linux-mm
On Tue, Nov 15, 2011 at 04:07:08PM -0800, David Rientjes wrote:
> On Tue, 15 Nov 2011, Mel Gorman wrote:
>
> > Adding sync here could obviously be implemented although it may
> > require both always-sync and madvise-sync. Alternatively, something
> > like an options file could be created to create a bitmap similar to
> > what ftrace does. Whatever the mechanism, it exposes the fact that
> > "sync compaction" is used. If that turns out to be not enough, then
> > you may want to add other steps like aggressively reclaiming memory
> > which also potentially may need to be controlled via the sysfs file
> > and this is the slippery slope.
> >
>
> So what's being proposed here in this patch is the fifth time this line
> has been changed and its always been switched between true and !(gfp_mask
> & __GFP_NO_KSWAPD). Instead of changing it every few months, I'd suggest
> that we tie the semantics of the tunable directly to sync_compaction since
> we're primarily targeting thp hugepages with this change anyway for the
> "always" case. Comments?
I don't think it's ok. defrag=madvise means __GFP_WAIT not set for
regular THP alloc (without madvise(MADV_HUGEPAGE)) and __GFP_WAIT set
for madvise(MADV_HUGEPAGE).
If __GFP_WAIT isn't set, compaction-async won't be invoked.
After checking my current thp vmstat I think Andrew was right and we
backed out for a good reason before. I'm getting significantly worse
success rate, not sure why it was a small reduction in success rate
but hey I cannot exclude I may have broke something with some other
patch. I've been running it together with a couple more changes. If
it's this change that reduced the success rate, I'm afraid going
always async is not ok.
So before focusing so much on this sync/async flag, I'd like to
understand better why sync stalls so bad. I mean it's not like the VM
with 4k pages won't be doing some throttling too. I suspect we may be
too heavyweight on migrate looping 10 times. Especially with O_DIRECT
the pages may return pinned immediately after they're unpinned (if the
buffer for the I/O is the same like with dd) so there's no point to
wait on pinned pages 10 times around amounting to wait 10*2 = 20MB =
several seconds with usb stick for each 2M allocation. We can reduce
it to less than one second easily. Maybe somebody has time to test if
the below patch helps or not. I understand in some circumstance it may
not help and it'll lead to the same but I think this is good idea
anyway and maybe it helps. Currently we wait on locked pages, on
writeback pages, we retry on pinned pages again and again on locked
pages etc... That's a bit too much I guess, and we don't have to go
complete async to improve the situation I hope.
Could you try if you see any improvement with the below patch? I'm
running with a dd writing in a loop over 1g sd card and I don't see
stalls and success rate seems better than before but I haven't been
noticing the stalls before so I can't tell. (to test you need to
backout the other patches first, this is for 3.2rc2)
If this isn't enough I'd still prefer to find a way to tackle the
problem on a write-throttling way, or at least I'd need to re-verify
why the success rate was so bad with the patch applied (after 4 days
of uptime of normal load). I tend to check the success rate and with
upstream it's about perfect and a little degradation is ok but I was
getting less than 50%. Note with the usb writing in a loop the success
rate may degrade a bit, page will be locked, we won't wait on page
lock and then on writeback and all that slowdown anymore but it'll
still write throttle a bit and it will stop only working on movable
pages and isolating only clean pages like it would have done with
async forced.
===
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH] mm: Do not stall in synchronous compaction for THP allocations
2011-11-16 4:13 ` Andrea Arcangeli
@ 2011-11-16 13:30 ` Andrea Arcangeli
2011-11-16 15:07 ` Mel Gorman
2011-11-18 17:59 ` Andrea Arcangeli
2011-11-16 14:14 ` Mel Gorman
1 sibling, 2 replies; 34+ messages in thread
From: Andrea Arcangeli @ 2011-11-16 13:30 UTC (permalink / raw)
To: David Rientjes
Cc: Mel Gorman, Andrew Morton, Minchan Kim, Jan Kara, Andy Isaacson,
Johannes Weiner, linux-kernel, linux-mm
On Wed, Nov 16, 2011 at 05:13:50AM +0100, Andrea Arcangeli wrote:
> After checking my current thp vmstat I think Andrew was right and we
> backed out for a good reason before. I'm getting significantly worse
> success rate, not sure why it was a small reduction in success rate
> but hey I cannot exclude I may have broke something with some other
> patch. I've been running it together with a couple more changes. If
> it's this change that reduced the success rate, I'm afraid going
> always async is not ok.
I wonder if the high failure rate when shutting off "sync compaction"
and forcing only "async compaction" for THP (your patch queued in -mm)
is also because of ISOLATE_CLEAN being set in compaction from commit
39deaf8. ISOLATE_CLEAN skipping PageDirty means all tmpfs/anon pages
added to swapcache (or removed from swapcache which sets the dirty bit
on the page because the pte may be mapped clean) are skipped entirely
by async compaction for no good reason. That can't possibly be ok,
because those don't actually require any I/O or blocking to be
migrated. PageDirty is a "blocking/IO" operation only for filebacked
pages. So I think we must revert 39deaf8, instead of cleaning it up
with my cleanup posted in Message-Id 20111115020831.GF4414@redhat.com .
ISOLATED_CLEAN still looks right for may_writepage, for reclaim dirty
bit set on the page is a I/O event, for migrate it's not if it's
tmpfs/anon.
Did you run your compaction tests with some swap activity?
Reducing the async compaction effectiveness while there's some swap
activity then also leads in more frequently than needed running sync
compaction and page reclaim.
I'm hopeful however that by running just 2 passes of migrate_pages
main loop with the "avoid overwork in migrate sync mode" patch, we can
fix the excessive hanging. If that works number of passes could
actually be a tunable, and setting it to 1 (instead of 2) would then
provide 100% "async compaction" behavior again. And if somebody
prefers to stick to 10 he can... so then he can do trylock pass 0,
lock_page pass1, wait_writeback pass2, wait pin pass3, finally migrate
pass4. (something 2 passes alone won't allow). So making the migrate
passes/force-threshold tunable (maybe only for the new sync=2
migration mode) could be good idea. Or we could just return to sync
true/false and have the migration tunable affect everything but that
would alter the reliability of sys_move_pages and other numa things
too, where I guess 10 passes are ok. This is why I added a sync=2 mode
for migrate.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH] mm: Do not stall in synchronous compaction for THP allocations
2011-11-16 4:13 ` Andrea Arcangeli
2011-11-16 13:30 ` Andrea Arcangeli
@ 2011-11-16 14:14 ` Mel Gorman
1 sibling, 0 replies; 34+ messages in thread
From: Mel Gorman @ 2011-11-16 14:14 UTC (permalink / raw)
To: Andrea Arcangeli
Cc: David Rientjes, Andrew Morton, Minchan Kim, Jan Kara,
Andy Isaacson, Johannes Weiner, linux-kernel, linux-mm
On Wed, Nov 16, 2011 at 05:13:50AM +0100, Andrea Arcangeli wrote:
> On Tue, Nov 15, 2011 at 04:07:08PM -0800, David Rientjes wrote:
> > On Tue, 15 Nov 2011, Mel Gorman wrote:
> >
> > > Adding sync here could obviously be implemented although it may
> > > require both always-sync and madvise-sync. Alternatively, something
> > > like an options file could be created to create a bitmap similar to
> > > what ftrace does. Whatever the mechanism, it exposes the fact that
> > > "sync compaction" is used. If that turns out to be not enough, then
> > > you may want to add other steps like aggressively reclaiming memory
> > > which also potentially may need to be controlled via the sysfs file
> > > and this is the slippery slope.
> > >
> >
> > So what's being proposed here in this patch is the fifth time this line
> > has been changed and its always been switched between true and !(gfp_mask
> > & __GFP_NO_KSWAPD). Instead of changing it every few months, I'd suggest
> > that we tie the semantics of the tunable directly to sync_compaction since
> > we're primarily targeting thp hugepages with this change anyway for the
> > "always" case. Comments?
>
> I don't think it's ok. defrag=madvise means __GFP_WAIT not set for
> regular THP alloc (without madvise(MADV_HUGEPAGE)) and __GFP_WAIT set
> for madvise(MADV_HUGEPAGE).
>
> If __GFP_WAIT isn't set, compaction-async won't be invoked.
>
And if defrag is enabled with this patch, sync compaction can still
stall due to writing to USB as detailed in the changelog for the
original patch in this thread.
> After checking my current thp vmstat I think Andrew was right and we
> backed out for a good reason before. I'm getting significantly worse
> success rate, not sure why it was a small reduction in success rate
> but hey I cannot exclude I may have broke something with some other
> patch. I've been running it together with a couple more changes. If
> it's this change that reduced the success rate, I'm afraid going
> always async is not ok.
>
Can you narrow it down? I expect it really is the disabling of sync
compaction that is reducing success rates during heavy copy operations
because that's what I'd expect as 20% of memory is probably dirty
pages that need writeback. I'll explain why.
You didn't say the size of your machine but assuming 2G as it is
easier to reproduce stalls on machines of that size, we get
2G RAM = 524288 pages = 1024 pageblocks assuming 2M hugepage
On a desktop system, most of those blocks will be movable (I'm basing
this on my own laptop, milage varies considerably) we get
85% of pageblocks are movable so 870 movable page blocks
If a pageblock has even one dirty page then async compaction cannot
convert that into a huge page. dirty_ratio is 20% so that gives
us 104857 dirty pages. If those dirty pages are perfectly randomly
distributed, async compaction will give 0 huge pages.
The actual distribution of dirty pages is different. Early in the
lifetime of the system, they'll be packed into the fewest possible
pageblocks assuming that pages are cleaned in the order they are
dirtied because of how the buddy allocator gives out pages. That
gives a best-case scenario of 68% of memory as huge pages.
You'd think the underlying filesystem matters because if the underlying
filesystem supports migratepage, we don't have to writeout but that's
not the case because;
if (PageDirty(page) && !sync &&
mapping->a_ops->migratepage != migrate_page)
rc = -EBUSY;
else if (mapping->a_ops->migratepage)
/*
* Most pages have a mapping and most filesystems
* should provide a migration function. Anonymous
* pages are part of swap space which also has
* its own migration function. This is the most
* common path for page migration.
*/
rc = mapping->a_ops->migratepage(mapping,
newpage, page);
else
rc = fallback_migrate_page(mapping, newpage, page);
Even if the underlying filesystem is able to migrate a dirty pages
without IO, we don't call it because the interface itself does not
guarantee that ->migratepage will not initiate block. For example,
buffer_migrate_page can still block on lock_buffer().
I'm looking into what is necessary to allow async migration to still
use ->migratepage and update each implementation to obey "sync". This
should allow us to avoid using sync compaction without severely
impacting allocation success rates. It'll also need to somehow take
into account that async migration is no longer isolating dirty pages.
As this will be distracting me, this review will be a bit lighter
than it should be.
> So before focusing so much on this sync/async flag, I'd like to
> understand better why sync stalls so bad. I mean it's not like the VM
> with 4k pages won't be doing some throttling too.
True, but the core VM avoids writing back dirty pages as much as
possible and even with dirty throttling the caller has the option of
using a different page. Compaction smacks itself into a while with
dirty pages because of the linear scan and then beats itself harder
because of large numbers of small IO requests.
> I suspect we may be
> too heavyweight on migrate looping 10 times. Especially with O_DIRECT
> the pages may return pinned immediately after they're unpinned (if the
> buffer for the I/O is the same like with dd) so there's no point to
> wait on pinned pages 10 times around amounting to wait 10*2 = 20MB =
> several seconds with usb stick for each 2M allocation. We can reduce
> it to less than one second easily.
It sounds reasonable to distinguish between pages that we cannot migrate
because they are pinned for IO and pages that we cannot migrate because
they were locked for a short period of time.
> Maybe somebody has time to test if
> the below patch helps or not.
The machine that I normally use to test this sort of patch is currently
tied up so it could take a day before I can queue it.
> I understand in some circumstance it may
> not help and it'll lead to the same but I think this is good idea
> anyway and maybe it helps.
It might be a good idea anyway.
> Currently we wait on locked pages, on
> writeback pages, we retry on pinned pages again and again on locked
> pages etc... That's a bit too much I guess, and we don't have to go
> complete async to improve the situation I hope.
>
> Could you try if you see any improvement with the below patch? I'm
> running with a dd writing in a loop over 1g sd card and I don't see
> stalls and success rate seems better than before but I haven't been
> noticing the stalls before so I can't tell. (to test you need to
> backout the other patches first, this is for 3.2rc2)
>
As it could take me a while to test, grab
http://www.csn.ul.ie/~mel/postings/compaction-20111116/mmtests-add-monitor-for-tracing-processes-stuck-in-D-state.patch
It's a patch to MM Tests but you don't need the suite to use the
scripts. Extract the monitors/watch-dstate.pl and stap-dstate-frequency
script from the patch, setup systemtap (I know, it had to work for
kernels without ftrace) and run it as
monitors/watch-dstate.pl -o stall.log
run your test, interrupt watch-dstate.pl then
cat stall.log | stap-dstate-frequency
It'll summarise what the top stall points were with an estimate of
how long you were stalled in there even if it's not user visible
stall. Watch for long stalls in compaction.
> If this isn't enough I'd still prefer to find a way to tackle the
> problem on a write-throttling way, or at least I'd need to re-verify
> why the success rate was so bad with the patch applied (after 4 days
> of uptime of normal load). I tend to check the success rate and with
> upstream it's about perfect and a little degradation is ok but I was
> getting less than 50%.
I bet you a shiny penny if you write a script to read /proc/kpageflags
that you'll find there is at least one dirty page in 50% of movable
pageblocks while your test runs.
> Note with the usb writing in a loop the success
> rate may degrade a bit, page will be locked, we won't wait on page
> lock and then on writeback and all that slowdown anymore but it'll
> still write throttle a bit and it will stop only working on movable
> pages and isolating only clean pages like it would have done with
> async forced.
>
> ===
> From 3a379b180aa544f876d3c42b47ae20060ae6279b Mon Sep 17 00:00:00 2001
> From: Andrea Arcangeli <aarcange@redhat.com>
> Date: Wed, 16 Nov 2011 02:52:36 +0100
> Subject: [PATCH] compaction: avoid overwork in migrate sync mode
>
> Add a migration sync=2 mode that avoids overwork so more suitable to
> be used by compaction to provide lower latency but still write
> throttling.
>
sync=2 is obscure and will be prone to error. It needs to be an enum
with a proper name for each field.
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> ---
> include/linux/migrate.h | 8 ++--
> mm/compaction.c | 2 +-
> mm/memory-failure.c | 2 +-
> mm/memory_hotplug.c | 2 +-
> mm/mempolicy.c | 4 +-
> mm/migrate.c | 77 ++++++++++++++++++++++++++++++-----------------
> 6 files changed, 58 insertions(+), 37 deletions(-)
>
> diff --git a/include/linux/migrate.h b/include/linux/migrate.h
> index e39aeec..f26fc0e 100644
> --- a/include/linux/migrate.h
> +++ b/include/linux/migrate.h
> @@ -14,10 +14,10 @@ extern int migrate_page(struct address_space *,
> struct page *, struct page *);
> extern int migrate_pages(struct list_head *l, new_page_t x,
> unsigned long private, bool offlining,
> - bool sync);
> + int sync);
> extern int migrate_huge_pages(struct list_head *l, new_page_t x,
> unsigned long private, bool offlining,
> - bool sync);
> + int sync);
>
these would then be an enum of some sort with a better name than "sync".
> extern int fail_migrate_page(struct address_space *,
> struct page *, struct page *);
> @@ -36,10 +36,10 @@ extern int migrate_huge_page_move_mapping(struct address_space *mapping,
> static inline void putback_lru_pages(struct list_head *l) {}
> static inline int migrate_pages(struct list_head *l, new_page_t x,
> unsigned long private, bool offlining,
> - bool sync) { return -ENOSYS; }
> + int sync) { return -ENOSYS; }
> static inline int migrate_huge_pages(struct list_head *l, new_page_t x,
> unsigned long private, bool offlining,
> - bool sync) { return -ENOSYS; }
> + int sync) { return -ENOSYS; }
>
> static inline int migrate_prep(void) { return -ENOSYS; }
> static inline int migrate_prep_local(void) { return -ENOSYS; }
> diff --git a/mm/compaction.c b/mm/compaction.c
> index be0be1d..cbf2784 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -555,7 +555,7 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)
> nr_migrate = cc->nr_migratepages;
> err = migrate_pages(&cc->migratepages, compaction_alloc,
> (unsigned long)cc, false,
> - cc->sync);
> + cc->sync ? 2 : 0);
> update_nr_listpages(cc);
> nr_remaining = cc->nr_migratepages;
>
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index 06d3479..d8a41d3 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -1557,7 +1557,7 @@ int soft_offline_page(struct page *page, int flags)
> page_is_file_cache(page));
> list_add(&page->lru, &pagelist);
> ret = migrate_pages(&pagelist, new_page, MPOL_MF_MOVE_ALL,
> - 0, true);
> + false, 1);
> if (ret) {
> putback_lru_pages(&pagelist);
> pr_info("soft offline: %#lx: migration failed %d, type %lx\n",
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 2168489..e1d6176 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -809,7 +809,7 @@ do_migrate_range(unsigned long start_pfn, unsigned long end_pfn)
> }
> /* this function returns # of failed pages */
> ret = migrate_pages(&source, hotremove_migrate_alloc, 0,
> - true, true);
> + true, 1);
> if (ret)
> putback_lru_pages(&source);
> }
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index adc3954..0bf88ed 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -933,7 +933,7 @@ static int migrate_to_node(struct mm_struct *mm, int source, int dest,
>
> if (!list_empty(&pagelist)) {
> err = migrate_pages(&pagelist, new_node_page, dest,
> - false, true);
> + false, 1);
> if (err)
> putback_lru_pages(&pagelist);
> }
> @@ -1154,7 +1154,7 @@ static long do_mbind(unsigned long start, unsigned long len,
> if (!list_empty(&pagelist)) {
> nr_failed = migrate_pages(&pagelist, new_vma_page,
> (unsigned long)vma,
> - false, true);
> + false, 1);
> if (nr_failed)
> putback_lru_pages(&pagelist);
> }
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 578e291..175c3bc 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -564,7 +564,7 @@ static int fallback_migrate_page(struct address_space *mapping,
> * == 0 - success
> */
> static int move_to_new_page(struct page *newpage, struct page *page,
> - int remap_swapcache, bool sync)
> + int remap_swapcache, bool force)
> {
> struct address_space *mapping;
> int rc;
> @@ -588,11 +588,11 @@ static int move_to_new_page(struct page *newpage, struct page *page,
> rc = migrate_page(mapping, newpage, page);
> else {
> /*
> - * Do not writeback pages if !sync and migratepage is
> + * Do not writeback pages if !force and migratepage is
> * not pointing to migrate_page() which is nonblocking
> * (swapcache/tmpfs uses migratepage = migrate_page).
> */
> - if (PageDirty(page) && !sync &&
> + if (PageDirty(page) && !force &&
> mapping->a_ops->migratepage != migrate_page)
> rc = -EBUSY;
> else if (mapping->a_ops->migratepage)
To be clear, this is the section I'll be looking at - getting rid of
this check and pushing the sync parameter down into ->migratepage so
async migration can handle more cases.
We'll collide a bit but should be able to reconcile the differences.
> @@ -622,7 +622,7 @@ static int move_to_new_page(struct page *newpage, struct page *page,
> }
>
> static int __unmap_and_move(struct page *page, struct page *newpage,
> - int force, bool offlining, bool sync)
> + bool force, bool offlining)
> {
> int rc = -EAGAIN;
> int remap_swapcache = 1;
> @@ -631,7 +631,7 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
> struct anon_vma *anon_vma = NULL;
>
> if (!trylock_page(page)) {
> - if (!force || !sync)
> + if (!force)
> goto out;
>
> /*
Superficially this looks like a mistake because we can now call
lock_page but as direct compaction is all that really cares about async
compaction and we check PF_MEMALLOC, it probably doesn't matter matter.
> @@ -676,14 +676,6 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
> BUG_ON(charge);
>
> if (PageWriteback(page)) {
> - /*
> - * For !sync, there is no point retrying as the retry loop
> - * is expected to be too short for PageWriteback to be cleared
> - */
> - if (!sync) {
> - rc = -EBUSY;
> - goto uncharge;
> - }
> if (!force)
> goto uncharge;
> wait_on_page_writeback(page);
Switching from sync to force introduces an important behaviour
change here I think. After this patch, async compaction becomes sync
compaction on pass > 2 and can call wait_on_page_writeback() which
may not be what you intended. It will probably increase allocation
success rates in some tests but stalls may be worse.
> @@ -751,7 +743,7 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
>
> skip_unmap:
> if (!page_mapped(page))
> - rc = move_to_new_page(newpage, page, remap_swapcache, sync);
> + rc = move_to_new_page(newpage, page, remap_swapcache, force);
>
> if (rc && remap_swapcache)
> remove_migration_ptes(page, page);
> @@ -774,7 +766,7 @@ out:
> * to the newly allocated page in newpage.
> */
> static int unmap_and_move(new_page_t get_new_page, unsigned long private,
> - struct page *page, int force, bool offlining, bool sync)
> + struct page *page, bool force, bool offlining)
> {
> int rc = 0;
> int *result = NULL;
> @@ -792,7 +784,7 @@ static int unmap_and_move(new_page_t get_new_page, unsigned long private,
> if (unlikely(split_huge_page(page)))
> goto out;
>
> - rc = __unmap_and_move(page, newpage, force, offlining, sync);
> + rc = __unmap_and_move(page, newpage, force, offlining);
> out:
> if (rc != -EAGAIN) {
> /*
> @@ -840,7 +832,7 @@ out:
> */
> static int unmap_and_move_huge_page(new_page_t get_new_page,
> unsigned long private, struct page *hpage,
> - int force, bool offlining, bool sync)
> + bool force, bool offlining)
> {
> int rc = 0;
> int *result = NULL;
> @@ -853,7 +845,7 @@ static int unmap_and_move_huge_page(new_page_t get_new_page,
> rc = -EAGAIN;
>
> if (!trylock_page(hpage)) {
> - if (!force || !sync)
> + if (!force)
> goto out;
> lock_page(hpage);
> }
> @@ -864,7 +856,7 @@ static int unmap_and_move_huge_page(new_page_t get_new_page,
> try_to_unmap(hpage, TTU_MIGRATION|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS);
>
> if (!page_mapped(hpage))
> - rc = move_to_new_page(new_hpage, hpage, 1, sync);
> + rc = move_to_new_page(new_hpage, hpage, 1, force);
>
> if (rc)
> remove_migration_ptes(hpage, hpage);
> @@ -907,7 +899,7 @@ out:
> */
> int migrate_pages(struct list_head *from,
> new_page_t get_new_page, unsigned long private, bool offlining,
> - bool sync)
> + int sync)
> {
> int retry = 1;
> int nr_failed = 0;
> @@ -920,15 +912,30 @@ int migrate_pages(struct list_head *from,
> if (!swapwrite)
> current->flags |= PF_SWAPWRITE;
>
> - for(pass = 0; pass < 10 && retry; pass++) {
> + for(pass = 0; pass < (sync == 1 ? 10 : 2) && retry; pass++) {
> retry = 0;
>
> list_for_each_entry_safe(page, page2, from, lru) {
> + bool force;
> +
> cond_resched();
>
> + switch (sync) {
> + case 0:
> + force = false;
> + break;
> + case 1:
> + force = pass > 2;
> + break;
> + case 2:
> + force = pass > 0;
> + break;
> + default:
> + BUG();
> + }
> rc = unmap_and_move(get_new_page, private,
> - page, pass > 2, offlining,
> - sync);
> + page,
> + force, offlining);
>
> switch(rc) {
> case -ENOMEM:
> @@ -958,7 +965,7 @@ out:
>
> int migrate_huge_pages(struct list_head *from,
> new_page_t get_new_page, unsigned long private, bool offlining,
> - bool sync)
> + int sync)
> {
> int retry = 1;
> int nr_failed = 0;
> @@ -967,15 +974,29 @@ int migrate_huge_pages(struct list_head *from,
> struct page *page2;
> int rc;
>
> - for (pass = 0; pass < 10 && retry; pass++) {
> + for (pass = 0; pass < (sync == 1 ? 10 : 2) && retry; pass++) {
> retry = 0;
>
> list_for_each_entry_safe(page, page2, from, lru) {
> + bool force;
> cond_resched();
>
> + switch (sync) {
> + case 0:
> + force = false;
> + break;
> + case 1:
> + force = pass > 2;
> + break;
> + case 2:
> + force = pass > 0;
> + break;
> + default:
> + BUG();
> + }
> rc = unmap_and_move_huge_page(get_new_page,
> - private, page, pass > 2, offlining,
> - sync);
> + private, page,
> + force, offlining);
>
> switch(rc) {
> case -ENOMEM:
> @@ -1104,7 +1125,7 @@ set_status:
> err = 0;
> if (!list_empty(&pagelist)) {
> err = migrate_pages(&pagelist, new_page_node,
> - (unsigned long)pm, 0, true);
> + (unsigned long)pm, false, 1);
> if (err)
> putback_lru_pages(&pagelist);
> }
--
Mel Gorman
SUSE Labs
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH] mm: Do not stall in synchronous compaction for THP allocations
2011-11-16 13:30 ` Andrea Arcangeli
@ 2011-11-16 15:07 ` Mel Gorman
2011-11-18 17:59 ` Andrea Arcangeli
1 sibling, 0 replies; 34+ messages in thread
From: Mel Gorman @ 2011-11-16 15:07 UTC (permalink / raw)
To: Andrea Arcangeli
Cc: David Rientjes, Andrew Morton, Minchan Kim, Jan Kara,
Andy Isaacson, Johannes Weiner, linux-kernel, linux-mm
On Wed, Nov 16, 2011 at 02:30:56PM +0100, Andrea Arcangeli wrote:
> On Wed, Nov 16, 2011 at 05:13:50AM +0100, Andrea Arcangeli wrote:
> > After checking my current thp vmstat I think Andrew was right and we
> > backed out for a good reason before. I'm getting significantly worse
> > success rate, not sure why it was a small reduction in success rate
> > but hey I cannot exclude I may have broke something with some other
> > patch. I've been running it together with a couple more changes. If
> > it's this change that reduced the success rate, I'm afraid going
> > always async is not ok.
>
> I wonder if the high failure rate when shutting off "sync compaction"
> and forcing only "async compaction" for THP (your patch queued in -mm)
> is also because of ISOLATE_CLEAN being set in compaction from commit
> 39deaf8. ISOLATE_CLEAN skipping PageDirty means all tmpfs/anon pages
> added to swapcache (or removed from swapcache which sets the dirty bit
> on the page because the pte may be mapped clean) are skipped entirely
> by async compaction for no good reason.
Good point! Even though these pages can be migrated without IO or
incurring a sync, we are skipping over them. I'm still
looking at passing sync down to ->migratepages and as part of that,
ISOLATE_CLEAN will need new smarts.
> That can't possibly be ok,
> because those don't actually require any I/O or blocking to be
> migrated. PageDirty is a "blocking/IO" operation only for filebacked
> pages. So I think we must revert 39deaf8, instead of cleaning it up
> with my cleanup posted in Message-Id 20111115020831.GF4414@redhat.com .
>
It would be preferable if the pages that would block during migration
could be identified in advance but that may be unrealistic. What may be
a better compromise is to only isolate PageDirty pages with a
->migratepage callback.
> ISOLATED_CLEAN still looks right for may_writepage, for reclaim dirty
> bit set on the page is a I/O event, for migrate it's not if it's
> tmpfs/anon.
>
> Did you run your compaction tests with some swap activity?
>
Some, but not intensive.
> Reducing the async compaction effectiveness while there's some swap
> activity then also leads in more frequently than needed running sync
> compaction and page reclaim.
>
> I'm hopeful however that by running just 2 passes of migrate_pages
> main loop with the "avoid overwork in migrate sync mode" patch, we can
> fix the excessive hanging. If that works number of passes could
> actually be a tunable, and setting it to 1 (instead of 2) would then
> provide 100% "async compaction" behavior again. And if somebody
> prefers to stick to 10 he can... so then he can do trylock pass 0,
> lock_page pass1, wait_writeback pass2, wait pin pass3, finally migrate
> pass4. (something 2 passes alone won't allow). So making the migrate
> passes/force-threshold tunable (maybe only for the new sync=2
> migration mode) could be good idea. Or we could just return to sync
> true/false and have the migration tunable affect everything but that
> would alter the reliability of sys_move_pages and other numa things
> too, where I guess 10 passes are ok. This is why I added a sync=2 mode
> for migrate.
I am vaguely concerned that this will just make the stalling harder to
reproduce and diagnose. While you investigate this route, I'm going to
keep investigating using only async migration for THP and having async
compaction move pages it can migrate without blocking.
--
Mel Gorman
SUSE Labs
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH] mm: Do not stall in synchronous compaction for THP allocations
2011-11-16 13:30 ` Andrea Arcangeli
2011-11-16 15:07 ` Mel Gorman
@ 2011-11-18 17:59 ` Andrea Arcangeli
1 sibling, 0 replies; 34+ messages in thread
From: Andrea Arcangeli @ 2011-11-18 17:59 UTC (permalink / raw)
To: David Rientjes
Cc: Mel Gorman, Andrew Morton, Minchan Kim, Jan Kara, Andy Isaacson,
Johannes Weiner, linux-kernel, linux-mm
On Wed, Nov 16, 2011 at 02:30:56PM +0100, Andrea Arcangeli wrote:
> On Wed, Nov 16, 2011 at 05:13:50AM +0100, Andrea Arcangeli wrote:
> > After checking my current thp vmstat I think Andrew was right and we
> > backed out for a good reason before. I'm getting significantly worse
> > success rate, not sure why it was a small reduction in success rate
> > but hey I cannot exclude I may have broke something with some other
> > patch. I've been running it together with a couple more changes. If
> > it's this change that reduced the success rate, I'm afraid going
> > always async is not ok.
>
> I wonder if the high failure rate when shutting off "sync compaction"
> and forcing only "async compaction" for THP (your patch queued in -mm)
> is also because of ISOLATE_CLEAN being set in compaction from commit
> 39deaf8. ISOLATE_CLEAN skipping PageDirty means all tmpfs/anon pages
I think I tracked down the source of the thp allocation
regression. They're commit e0887c19b2daa140f20ca8104bdc5740f39dbb86
and e0c23279c9f800c403f37511484d9014ac83adec. They're also wrong
because compaction_suitable doesn't check that there is enough free
memory in the number of "movable" pageblocks.
I'm going to test this not sure if it helps. But the more free memory
the more likely compaction succeeds, so there's still a risk we're
reducing the compaction success.
With the two commits above reverted my compaction success rate returns
near 100%, with the two commits above applied it goes to <50%... Now
we'll see what happens with the below patch.
===
From: Andrea Arcangeli <aarcange@redhat.com>
Subject: [PATCH] compaction: correct reverse check for compaction_deferred
Otherwise when compaction is deferred, reclaim stops to, leading to
high failure rate of high order allocations.
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
mm/vmscan.c | 4 ++--
1 files changed, 2 insertions(+), 2 deletions(-)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index f7f7677..ce745f0 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2144,8 +2144,8 @@ static bool shrink_zones(int priority, struct zonelist *zonelist,
* allocations.
*/
if (sc->order > PAGE_ALLOC_COSTLY_ORDER &&
- (compaction_suitable(zone, sc->order) ||
- compaction_deferred(zone))) {
+ (compaction_suitable(zone, sc->order) &&
+ !compaction_deferred(zone))) {
should_abort_reclaim = true;
continue;
}
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 34+ messages in thread
end of thread, other threads:[~2011-11-18 17:59 UTC | newest]
Thread overview: 34+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-11-10 10:06 [PATCH] mm: Do not stall in synchronous compaction for THP allocations Mel Gorman
2011-11-10 10:38 ` Johannes Weiner
2011-11-10 10:51 ` Alan Cox
2011-11-10 12:06 ` Johannes Weiner
2011-11-10 14:00 ` Andrea Arcangeli
2011-11-10 14:22 ` Mel Gorman
2011-11-10 15:12 ` Minchan Kim
2011-11-10 16:13 ` Mel Gorman
2011-11-10 16:30 ` Minchan Kim
2011-11-10 16:48 ` Mel Gorman
2011-11-10 23:12 ` Andrew Morton
2011-11-10 23:37 ` David Rientjes
2011-11-11 10:14 ` Mel Gorman
2011-11-11 10:39 ` David Rientjes
2011-11-11 11:17 ` Mel Gorman
2011-11-11 14:21 ` Andrea Arcangeli
2011-11-14 23:44 ` Andrew Morton
2011-11-15 13:25 ` Mel Gorman
2011-11-15 21:07 ` David Rientjes
2011-11-15 23:48 ` Mel Gorman
2011-11-16 0:07 ` David Rientjes
2011-11-16 4:13 ` Andrea Arcangeli
2011-11-16 13:30 ` Andrea Arcangeli
2011-11-16 15:07 ` Mel Gorman
2011-11-18 17:59 ` Andrea Arcangeli
2011-11-16 14:14 ` Mel Gorman
2011-11-11 10:01 ` Mel Gorman
2011-11-15 0:03 ` Andrew Morton
2011-11-15 2:00 ` Andrea Arcangeli
2011-11-15 2:08 ` Andrea Arcangeli
2011-11-15 15:01 ` Mel Gorman
2011-11-15 15:00 ` Mel Gorman
2011-11-15 13:07 ` Mel Gorman
2011-11-15 15:47 ` Andrea Arcangeli
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).