linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 0/3] Change how we determine when to hand out THPs
@ 2013-12-12 18:00 Alex Thorlton
  2013-12-12 20:33 ` Alex Thorlton
                   ` (2 more replies)
  0 siblings, 3 replies; 17+ messages in thread
From: Alex Thorlton @ 2013-12-12 18:00 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Kirill A. Shutemov, Benjamin Herrenschmidt,
	Rik van Riel, Wanpeng Li, Mel Gorman, Michel Lespinasse,
	Benjamin LaHaise, Oleg Nesterov, Eric W. Biederman,
	Andy Lutomirski, Al Viro, David Rientjes, Zhang Yanfei,
	Peter Zijlstra, Johannes Weiner, Michal Hocko, Jiang Liu,
	Cody P Schafer, Glauber Costa, Kamezawa Hiroyuki, Naoya Horiguchi,
	linux-kernel

This patch changes the way we decide whether or not to give out THPs to
processes when they fault in pages.  The way things are right now,
touching one byte in a 2M chunk where no pages have been faulted in
results in a process being handed a 2M hugepage, which, in some cases,
is undesirable.  The most common issue seems to arise when a process
uses many cores to work on small portions of an allocated chunk of
memory.

Here are some results from a test that I wrote, which allocates memory
in a way that doesn't benefit from the use of THPs:

# echo always > /sys/kernel/mm/transparent_hugepage/enabled
# perf stat -a -r 5 ./thp_pthread -C 0 -m 0 -c 64 -b 128g

 Performance counter stats for './thp_pthread -C 0 -m 0 -c 64 -b 128g' (5 runs):

   61971685.470621 task-clock                #  662.557 CPUs utilized            ( +-  0.68% ) [100.00%]
           200,365 context-switches          #    0.000 M/sec                    ( +-  0.64% ) [100.00%]
                94 CPU-migrations            #    0.000 M/sec                    ( +-  3.76% ) [100.00%]
            61,644 page-faults               #    0.000 M/sec                    ( +-  0.00% )
11,771,748,145,744 cycles                    #    0.190 GHz                      ( +-  0.78% ) [100.00%]
17,958,073,323,609 stalled-cycles-frontend   #  152.55% frontend cycles idle     ( +-  0.97% ) [100.00%]
     <not counted> stalled-cycles-backend
10,691,478,094,935 instructions              #    0.91  insns per cycle
                                             #    1.68  stalled cycles per insn  ( +-  0.66% ) [100.00%]
 1,593,798,555,131 branches                  #   25.718 M/sec                    ( +-  0.62% ) [100.00%]
       102,473,582 branch-misses             #    0.01% of all branches          ( +-  0.43% )

      93.534078104 seconds time elapsed                                          ( +-  0.68% )

# echo never > /sys/kernel/mm/transparent_hugepage/enabled
# perf stat -a -r 5 ./thp_pthread -C 0 -m 0 -c 64 -b 128g

 Performance counter stats for './thp_pthread -C 0 -m 0 -c 64 -b 128g' (5 runs):

   50703784.027438 task-clock                #  663.073 CPUs utilized            ( +-  0.18% ) [100.00%]
           162,324 context-switches          #    0.000 M/sec                    ( +-  0.22% ) [100.00%]
                91 CPU-migrations            #    0.000 M/sec                    ( +-  9.22% ) [100.00%]
        31,250,840 page-faults               #    0.001 M/sec                    ( +-  0.00% )
 7,962,585,261,769 cycles                    #    0.157 GHz                      ( +-  0.21% ) [100.00%]
 9,230,610,615,208 stalled-cycles-frontend   #  115.92% frontend cycles idle     ( +-  0.23% ) [100.00%]
     <not counted> stalled-cycles-backend
16,899,387,283,411 instructions              #    2.12  insns per cycle
                                             #    0.55  stalled cycles per insn  ( +-  0.16% ) [100.00%]
 2,422,269,260,013 branches                  #   47.773 M/sec                    ( +-  0.16% ) [100.00%]
        99,419,683 branch-misses             #    0.00% of all branches          ( +-  0.22% )

      76.467835263 seconds time elapsed                                          ( +-  0.18% )

As you can see there's a significant performance increase when running
this test with THP off.  Here's a pointer to the test, for those who are
interested:

http://oss.sgi.com/projects/memtests/thp_pthread.tar.gz

My proposed solution to the problem is to allow users to set a
threshold at which THPs will be handed out.  The idea here is that, when
a user faults in a page in an area where they would usually be handed a
THP, we pull 512 pages off the free list, as we would with a regular
THP, but we only fault in single pages from that chunk, until the user
has faulted in enough pages to pass the threshold we've set.  Once they
pass the threshold, we do the necessary work to turn our 512 page chunk
into a proper THP.  As it stands now, if the user tries to fault in
pages from different nodes, we completely give up on ever turning a
particular chunk into a THP, and just fault in the 4K pages as they're
requested.  We may want to make this tunable in the future (i.e. allow
them to fault in from only 2 different nodes).

This patch is still a work in progress, and it has a few known issues
that I've yet to sort out:

- Bad page state bug resulting from pages being added to the pagevecs
  improperly
    + This bug doesn't seem to hit when allocating small amounts of
      memory on 32 or less cores, but it becomes an issue on larger test
      runs.
    + I believe the best way to avoid this is to make sure we don't
      lru_cache_add any of the pages in our chunk until we decide
      whether or not we'll turn the chunk into a THP.  Haven't quite
      gotten this working yet.
- A few small accounting issues with some of the mm counters
- Some spots are still pretty hacky, need to be cleaned up a bit

Just to let people know, I've been doing most of my testing with the
memscale test:

http://oss.sgi.com/projects/memtests/thp_memscale.tar.gz

The pthread test hits the first bug I mentioned here much more often,
but the patch seems to be more stable when tested with memscale.  I
typically run something like this to test:

# ./thp_memscale -C 0 -m 0 -c 32 -b 16m

As you increase the amount of memory/number of cores, you become more
likely to run into issues.

Although there's still work to be done here, I wanted to get an early
version of the patch out so that everyone could give their
opinions/suggestions.  The patch should apply cleanly to the 3.12
kernel.  I'll rebase it as soon as some of the remaining issues have
been sorted out, this will also mean changing over to the split PTL
where appropriate.

Signed-off-by: Alex Thorlton <athorlton@sgi.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Nate Zimmer <nzimmer@sgi.com>
Cc: Cliff Wickman <cpw@sgi.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michel Lespinasse <walken@google.com>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: David Rientjes <rientjes@google.com>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Jiang Liu <jiang.liu@huawei.com>
Cc: Cody P Schafer <cody@linux.vnet.ibm.com>
Cc: Glauber Costa <glommer@parallels.com>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org

Alex Thorlton (3):
  Add flags for temporary compound pages
  Add tunable to control THP behavior
  Change THP behavior

 include/linux/gfp.h      |   5 +
 include/linux/huge_mm.h  |   8 ++
 include/linux/mm_types.h |  14 +++
 kernel/fork.c            |   1 +
 mm/huge_memory.c         | 313 +++++++++++++++++++++++++++++++++++++++++++++++
 mm/internal.h            |   1 +
 mm/memory.c              |  29 ++++-
 mm/page_alloc.c          |  66 +++++++++-
 8 files changed, 430 insertions(+), 7 deletions(-)

-- 
1.7.12.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 0/3] Change how we determine when to hand out THPs
  2013-12-12 18:00 [RFC PATCH 0/3] Change how we determine when to hand out THPs Alex Thorlton
@ 2013-12-12 20:33 ` Alex Thorlton
  2013-12-14  5:44 ` Andrew Morton
  2013-12-19 14:55 ` Mel Gorman
  2 siblings, 0 replies; 17+ messages in thread
From: Alex Thorlton @ 2013-12-12 20:33 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Kirill A. Shutemov, Benjamin Herrenschmidt,
	Rik van Riel, Wanpeng Li, Mel Gorman, Michel Lespinasse,
	Benjamin LaHaise, Oleg Nesterov, Eric W. Biederman,
	Andy Lutomirski, Al Viro, David Rientjes, Zhang Yanfei,
	Peter Zijlstra, Johannes Weiner, Michal Hocko, Jiang Liu,
	Cody P Schafer, Glauber Costa, Kamezawa Hiroyuki, Naoya Horiguchi,
	linux-kernel

Ugggh.  Looks like mutt clobbered the message ID on the cover letter,
screwing up the reply chain.  Sorry about that :/

- Alex

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 0/3] Change how we determine when to hand out THPs
  2013-12-12 18:00 [RFC PATCH 0/3] Change how we determine when to hand out THPs Alex Thorlton
  2013-12-12 20:33 ` Alex Thorlton
@ 2013-12-14  5:44 ` Andrew Morton
  2013-12-16 17:12   ` Alex Thorlton
  2013-12-19 14:55 ` Mel Gorman
  2 siblings, 1 reply; 17+ messages in thread
From: Andrew Morton @ 2013-12-14  5:44 UTC (permalink / raw)
  To: Alex Thorlton
  Cc: linux-mm, Kirill A. Shutemov, Benjamin Herrenschmidt,
	Rik van Riel, Wanpeng Li, Mel Gorman, Michel Lespinasse,
	Benjamin LaHaise, Oleg Nesterov, Eric W. Biederman,
	Andy Lutomirski, Al Viro, David Rientjes, Zhang Yanfei,
	Peter Zijlstra, Johannes Weiner, Michal Hocko, Jiang Liu,
	Cody P Schafer, Glauber Costa, Kamezawa Hiroyuki, Naoya Horiguchi,
	linux-kernel, Andrea Arcangeli

On Thu, 12 Dec 2013 12:00:37 -0600 Alex Thorlton <athorlton@sgi.com> wrote:

> This patch changes the way we decide whether or not to give out THPs to
> processes when they fault in pages.

Please cc Andrea on this.

>  The way things are right now,
> touching one byte in a 2M chunk where no pages have been faulted in
> results in a process being handed a 2M hugepage, which, in some cases,
> is undesirable.  The most common issue seems to arise when a process
> uses many cores to work on small portions of an allocated chunk of
> memory.
> 
> Here are some results from a test that I wrote, which allocates memory
> in a way that doesn't benefit from the use of THPs:
> 
> # echo always > /sys/kernel/mm/transparent_hugepage/enabled
> # perf stat -a -r 5 ./thp_pthread -C 0 -m 0 -c 64 -b 128g
> 
>  Performance counter stats for './thp_pthread -C 0 -m 0 -c 64 -b 128g' (5 runs):
> 
>       93.534078104 seconds time elapsed
> ...
>
> 
> # echo never > /sys/kernel/mm/transparent_hugepage/enabled
> # perf stat -a -r 5 ./thp_pthread -C 0 -m 0 -c 64 -b 128g
> 
>  Performance counter stats for './thp_pthread -C 0 -m 0 -c 64 -b 128g' (5 runs):
>
> ...
>       76.467835263 seconds time elapsed
> ...
> 
> As you can see there's a significant performance increase when running
> this test with THP off.

yup.

> My proposed solution to the problem is to allow users to set a
> threshold at which THPs will be handed out.  The idea here is that, when
> a user faults in a page in an area where they would usually be handed a
> THP, we pull 512 pages off the free list, as we would with a regular
> THP, but we only fault in single pages from that chunk, until the user
> has faulted in enough pages to pass the threshold we've set.  Once they
> pass the threshold, we do the necessary work to turn our 512 page chunk
> into a proper THP.  As it stands now, if the user tries to fault in
> pages from different nodes, we completely give up on ever turning a
> particular chunk into a THP, and just fault in the 4K pages as they're
> requested.  We may want to make this tunable in the future (i.e. allow
> them to fault in from only 2 different nodes).

OK.  But all 512 pages reside on the same node, yes?  Whereas with thp
disabled those 512 pages would have resided closer to the CPUs which
instantiated them.  So the expected result will be somewhere in between
the 93 secs and the 76 secs?

That being said, I don't see a downside to the idea, apart from some
additional setup cost in kernel code.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 0/3] Change how we determine when to hand out THPs
  2013-12-14  5:44 ` Andrew Morton
@ 2013-12-16 17:12   ` Alex Thorlton
  2013-12-16 17:51     ` Andrea Arcangeli
                       ` (2 more replies)
  0 siblings, 3 replies; 17+ messages in thread
From: Alex Thorlton @ 2013-12-16 17:12 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, Kirill A. Shutemov, Benjamin Herrenschmidt,
	Rik van Riel, Wanpeng Li, Mel Gorman, Michel Lespinasse,
	Benjamin LaHaise, Oleg Nesterov, Eric W. Biederman,
	Andy Lutomirski, Al Viro, David Rientjes, Zhang Yanfei,
	Peter Zijlstra, Johannes Weiner, Michal Hocko, Jiang Liu,
	Cody P Schafer, Glauber Costa, Kamezawa Hiroyuki, Naoya Horiguchi,
	linux-kernel, Andrea Arcangeli

> Please cc Andrea on this.

I'm going to clean up a few small things for a v2 pretty soon, I'll be
sure to cc Andrea there.

> > My proposed solution to the problem is to allow users to set a
> > threshold at which THPs will be handed out.  The idea here is that, when
> > a user faults in a page in an area where they would usually be handed a
> > THP, we pull 512 pages off the free list, as we would with a regular
> > THP, but we only fault in single pages from that chunk, until the user
> > has faulted in enough pages to pass the threshold we've set.  Once they
> > pass the threshold, we do the necessary work to turn our 512 page chunk
> > into a proper THP.  As it stands now, if the user tries to fault in
> > pages from different nodes, we completely give up on ever turning a
> > particular chunk into a THP, and just fault in the 4K pages as they're
> > requested.  We may want to make this tunable in the future (i.e. allow
> > them to fault in from only 2 different nodes).
> 
> OK.  But all 512 pages reside on the same node, yes?  Whereas with thp
> disabled those 512 pages would have resided closer to the CPUs which
> instantiated them.  

As it stands right now, yes, since we're pulling a 512 page contiguous
chunk off the free list, everything from that chunk will reside on the
same node, but as I (stupidly) forgot to mention in my original e-mail,
one piece I have yet to add is the functionality to put the remaining
unfaulted pages from our chunk *back* on the free list after we give up
on handing out a THP.  Once this is in there, things will behave more
like they do when THP is turned completely off, i.e. pages will get
faulted in closer to the CPU that first referenced them once we give up
on handing out the THP.

> So the expected result will be somewhere in between
> the 93 secs and the 76 secs?

Yes.  Due to the time it takes to search for the temporary THP, I'm sure
we won't get down to 76 secs, but hopefully we'll get close.  I'm also
considering switching the linked list that stores the temporary THPs
over to an rbtree to make that search faster, just fyi.

> That being said, I don't see a downside to the idea, apart from some
> additional setup cost in kernel code.

Good to hear.  I still need to address some of the issues that others
have raised, and finish up the few pieces that aren't fully
working/finished.  I'll get things polished up and get some more
informative test results out soon.

Thanks for looking at the patch!

- Alex

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 0/3] Change how we determine when to hand out THPs
  2013-12-16 17:12   ` Alex Thorlton
@ 2013-12-16 17:51     ` Andrea Arcangeli
  2013-12-17 16:20       ` Alex Thorlton
  2013-12-17  1:43     ` Andy Lutomirski
  2013-12-19 15:29     ` Mel Gorman
  2 siblings, 1 reply; 17+ messages in thread
From: Andrea Arcangeli @ 2013-12-16 17:51 UTC (permalink / raw)
  To: Alex Thorlton
  Cc: Andrew Morton, linux-mm, Kirill A. Shutemov,
	Benjamin Herrenschmidt, Rik van Riel, Wanpeng Li, Mel Gorman,
	Michel Lespinasse, Benjamin LaHaise, Oleg Nesterov,
	Eric W. Biederman, Andy Lutomirski, Al Viro, David Rientjes,
	Zhang Yanfei, Peter Zijlstra, Johannes Weiner, Michal Hocko,
	Jiang Liu, Cody P Schafer, Glauber Costa, Kamezawa Hiroyuki,
	Naoya Horiguchi, linux-kernel

On Mon, Dec 16, 2013 at 11:12:15AM -0600, Alex Thorlton wrote:
> As it stands right now, yes, since we're pulling a 512 page contiguous
> chunk off the free list, everything from that chunk will reside on the
> same node, but as I (stupidly) forgot to mention in my original e-mail,
> one piece I have yet to add is the functionality to put the remaining
> unfaulted pages from our chunk *back* on the free list after we give up
> on handing out a THP.  Once this is in there, things will behave more
> like they do when THP is turned completely off, i.e. pages will get
> faulted in closer to the CPU that first referenced them once we give up
> on handing out the THP.

The only problem is the additional complexity and the slowdown to the
common cases that benefit from THP immediately.

> Yes.  Due to the time it takes to search for the temporary THP, I'm sure
> we won't get down to 76 secs, but hopefully we'll get close.  I'm also

Did you consider using MADV_NOHUGEPAGE?

Clearly this will disable it not just on NUMA, but NUMA vs non-NUMA
the problem is pretty much the same. You may want to verify if your
runs faster on non-NUMA too, with MADV_NOHUGEPAGE.

If every thread only touches 1 subpage of every hugepage mapped, the
number of TLB misses will be lower with many 4k d-TLB than with fewer
2M d-TLB entries. The only benefit that remains from THP in such case
is that the TLB miss is faster with THP and that's not always enough
to offset the cost of the increased number of TLB misses.

But MADV_NOHUGEPAGE is made specifically to tune for those non common
cases and if you know what the app is doing and you know the faster
TLB miss is a win despite the fewer 2M TLB entries on non-NUMA, you
could do:

      if (numnodes() > 1)
      	 madvise(MADV_NOHUGEPAGE, ...);

> considering switching the linked list that stores the temporary THPs
> over to an rbtree to make that search faster, just fyi.

Problem is that it'll be still slower than no change.

I'm certainly not against trying to make the kernel smarter to
optimize for non common workloads, but if the default behavior shall
not change and this is a tweak the admin should tune manually, I'd
like an explanation of why the non privileged MADV_NOHUGEPAGE madvise
is worse solution for this than a privileged tweak in sysfs that the
root user may also forget if not careful.

The problem in all this, is that this is a tradeoff and depending on
the app anything in between the settings "never" and "always" could be
optimal.

The idea was just to map THP whenever possible by default to keep the
kernel simpler and to gain the maximum performance from the faster TLB
miss immediately (and hopefully offset those cases were the number of
TLB misses doesn't decrease with THP enabled, like probably your app,
and at the same time avoiding the need later for a THP collapsing
event that requires TLB flushes and will add additional costs). For
what is extreme and just wants THP off, I thought MADV_NOHUGEPAGE
would be fine solution.

I doubt we can change the default behavior, at the very least it would
require lots of benchmarking and the only benchmarking I've seen here
is for the corner case app which may actually run the fastest with
MADV_NOHUGEPAGE than with a intermediate threshold.

If we instead we need an intermediate threshold less aggressive than
MADV_NOHUGEPAGE, I think the tweaks should be per-process and not
privileged, like MADV_NOHUGEPAGE. Because by definition of the
tradeoff, every app could have its own preferred threshold. And like
your app wants THP off, the majority still wants it on without
intermediate steps. So with a system wide setting you can't make
everyone happy unless you're like #osv in a VM running a single app.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 0/3] Change how we determine when to hand out THPs
  2013-12-16 17:12   ` Alex Thorlton
  2013-12-16 17:51     ` Andrea Arcangeli
@ 2013-12-17  1:43     ` Andy Lutomirski
  2013-12-17 16:04       ` Alex Thorlton
  2013-12-19 15:29     ` Mel Gorman
  2 siblings, 1 reply; 17+ messages in thread
From: Andy Lutomirski @ 2013-12-17  1:43 UTC (permalink / raw)
  To: Alex Thorlton
  Cc: Andrew Morton, linux-mm@kvack.org, Kirill A. Shutemov,
	Benjamin Herrenschmidt, Rik van Riel, Wanpeng Li, Mel Gorman,
	Michel Lespinasse, Benjamin LaHaise, Oleg Nesterov,
	Eric W. Biederman, Al Viro, David Rientjes, Zhang Yanfei,
	Peter Zijlstra, Johannes Weiner, Michal Hocko, Jiang Liu,
	Cody P Schafer, Glauber Costa, Kamezawa Hiroyuki, Naoya Horiguchi,
	linux-kernel@vger.kernel.org, Andrea Arcangeli

On Mon, Dec 16, 2013 at 9:12 AM, Alex Thorlton <athorlton@sgi.com> wrote:
>> Please cc Andrea on this.
>
> I'm going to clean up a few small things for a v2 pretty soon, I'll be
> sure to cc Andrea there.
>
>> > My proposed solution to the problem is to allow users to set a
>> > threshold at which THPs will be handed out.  The idea here is that, when
>> > a user faults in a page in an area where they would usually be handed a
>> > THP, we pull 512 pages off the free list, as we would with a regular
>> > THP, but we only fault in single pages from that chunk, until the user
>> > has faulted in enough pages to pass the threshold we've set.  Once they
>> > pass the threshold, we do the necessary work to turn our 512 page chunk
>> > into a proper THP.  As it stands now, if the user tries to fault in
>> > pages from different nodes, we completely give up on ever turning a
>> > particular chunk into a THP, and just fault in the 4K pages as they're
>> > requested.  We may want to make this tunable in the future (i.e. allow
>> > them to fault in from only 2 different nodes).
>>
>> OK.  But all 512 pages reside on the same node, yes?  Whereas with thp
>> disabled those 512 pages would have resided closer to the CPUs which
>> instantiated them.
>
> As it stands right now, yes, since we're pulling a 512 page contiguous
> chunk off the free list, everything from that chunk will reside on the
> same node, but as I (stupidly) forgot to mention in my original e-mail,
> one piece I have yet to add is the functionality to put the remaining
> unfaulted pages from our chunk *back* on the free list after we give up
> on handing out a THP.  Once this is in there, things will behave more
> like they do when THP is turned completely off, i.e. pages will get
> faulted in closer to the CPU that first referenced them once we give up
> on handing out the THP.

This sounds like it's almost the worst possible behavior wrt avoiding
memory fragmentation.  If userspace mmaps a very large region and then
starts accessing it randomly, it will allocate a bunch of contiguous
512-page regions, claim one page from each, and return the other 511
pages to the free list.  Memory is now maximally fragmented from the
point of view of future THP allocations.

--Andy

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 0/3] Change how we determine when to hand out THPs
  2013-12-17  1:43     ` Andy Lutomirski
@ 2013-12-17 16:04       ` Alex Thorlton
  2013-12-17 16:54         ` Andy Lutomirski
  0 siblings, 1 reply; 17+ messages in thread
From: Alex Thorlton @ 2013-12-17 16:04 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andrew Morton, linux-mm@kvack.org, Kirill A. Shutemov,
	Benjamin Herrenschmidt, Rik van Riel, Wanpeng Li, Mel Gorman,
	Michel Lespinasse, Benjamin LaHaise, Oleg Nesterov,
	Eric W. Biederman, Al Viro, David Rientjes, Zhang Yanfei,
	Peter Zijlstra, Johannes Weiner, Michal Hocko, Jiang Liu,
	Cody P Schafer, Glauber Costa, Kamezawa Hiroyuki, Naoya Horiguchi,
	linux-kernel@vger.kernel.org, Andrea Arcangeli

On Mon, Dec 16, 2013 at 05:43:40PM -0800, Andy Lutomirski wrote:
> On Mon, Dec 16, 2013 at 9:12 AM, Alex Thorlton <athorlton@sgi.com> wrote:
> >> Please cc Andrea on this.
> >
> > I'm going to clean up a few small things for a v2 pretty soon, I'll be
> > sure to cc Andrea there.
> >
> >> > My proposed solution to the problem is to allow users to set a
> >> > threshold at which THPs will be handed out.  The idea here is that, when
> >> > a user faults in a page in an area where they would usually be handed a
> >> > THP, we pull 512 pages off the free list, as we would with a regular
> >> > THP, but we only fault in single pages from that chunk, until the user
> >> > has faulted in enough pages to pass the threshold we've set.  Once they
> >> > pass the threshold, we do the necessary work to turn our 512 page chunk
> >> > into a proper THP.  As it stands now, if the user tries to fault in
> >> > pages from different nodes, we completely give up on ever turning a
> >> > particular chunk into a THP, and just fault in the 4K pages as they're
> >> > requested.  We may want to make this tunable in the future (i.e. allow
> >> > them to fault in from only 2 different nodes).
> >>
> >> OK.  But all 512 pages reside on the same node, yes?  Whereas with thp
> >> disabled those 512 pages would have resided closer to the CPUs which
> >> instantiated them.
> >
> > As it stands right now, yes, since we're pulling a 512 page contiguous
> > chunk off the free list, everything from that chunk will reside on the
> > same node, but as I (stupidly) forgot to mention in my original e-mail,
> > one piece I have yet to add is the functionality to put the remaining
> > unfaulted pages from our chunk *back* on the free list after we give up
> > on handing out a THP.  Once this is in there, things will behave more
> > like they do when THP is turned completely off, i.e. pages will get
> > faulted in closer to the CPU that first referenced them once we give up
> > on handing out the THP.
> 
> This sounds like it's almost the worst possible behavior wrt avoiding
> memory fragmentation.  If userspace mmaps a very large region and then
> starts accessing it randomly, it will allocate a bunch of contiguous
> 512-page regions, claim one page from each, and return the other 511
> pages to the free list.  Memory is now maximally fragmented from the
> point of view of future THP allocations.

Maybe I'm missing the point here to some degree, but the way I think
about this is that if we trigger the behavior to return the pages to the
free list, we don't *want* future THP allocations in that range of
memory for the current process anyways.  So, having the memory be
fragmented from the point of view of future THP allocations isn't an
issue.  

- Alex

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 0/3] Change how we determine when to hand out THPs
  2013-12-16 17:51     ` Andrea Arcangeli
@ 2013-12-17 16:20       ` Alex Thorlton
  2013-12-17 17:55         ` Andrea Arcangeli
  0 siblings, 1 reply; 17+ messages in thread
From: Alex Thorlton @ 2013-12-17 16:20 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Andrew Morton, linux-mm, Kirill A. Shutemov,
	Benjamin Herrenschmidt, Rik van Riel, Wanpeng Li, Mel Gorman,
	Michel Lespinasse, Benjamin LaHaise, Oleg Nesterov,
	Eric W. Biederman, Andy Lutomirski, Al Viro, David Rientjes,
	Zhang Yanfei, Peter Zijlstra, Johannes Weiner, Michal Hocko,
	Jiang Liu, Cody P Schafer, Glauber Costa, Kamezawa Hiroyuki,
	Naoya Horiguchi, linux-kernel

On Mon, Dec 16, 2013 at 06:51:11PM +0100, Andrea Arcangeli wrote:
> On Mon, Dec 16, 2013 at 11:12:15AM -0600, Alex Thorlton wrote:
> > As it stands right now, yes, since we're pulling a 512 page contiguous
> > chunk off the free list, everything from that chunk will reside on the
> > same node, but as I (stupidly) forgot to mention in my original e-mail,
> > one piece I have yet to add is the functionality to put the remaining
> > unfaulted pages from our chunk *back* on the free list after we give up
> > on handing out a THP.  Once this is in there, things will behave more
> > like they do when THP is turned completely off, i.e. pages will get
> > faulted in closer to the CPU that first referenced them once we give up
> > on handing out the THP.
> 
> The only problem is the additional complexity and the slowdown to the
> common cases that benefit from THP immediately.

Agreed.  I understand that jobs which perform well with the current
behavior will be adversely affected by the change.  My thinking here was
that we could at least reach a middle ground between jobs that don't
like THP, and jobs that perform well with it, allowing them both to run
without users having to use a malloc hook or something similar to
achieve that goal, especially considering some cases where malloc hooks
for madvise aren't an option.

> > Yes.  Due to the time it takes to search for the temporary THP, I'm sure
> > we won't get down to 76 secs, but hopefully we'll get close.  I'm also
> 
> Did you consider using MADV_NOHUGEPAGE?
> 
> Clearly this will disable it not just on NUMA, but NUMA vs non-NUMA
> the problem is pretty much the same. You may want to verify if your
> runs faster on non-NUMA too, with MADV_NOHUGEPAGE.
> 
> If every thread only touches 1 subpage of every hugepage mapped, the
> number of TLB misses will be lower with many 4k d-TLB than with fewer
> 2M d-TLB entries. The only benefit that remains from THP in such case
> is that the TLB miss is faster with THP and that's not always enough
> to offset the cost of the increased number of TLB misses.
> 
> But MADV_NOHUGEPAGE is made specifically to tune for those non common
> cases and if you know what the app is doing and you know the faster
> TLB miss is a win despite the fewer 2M TLB entries on non-NUMA, you
> could do:
> 
>       if (numnodes() > 1)
>       	 madvise(MADV_NOHUGEPAGE, ...);

Please see this thread where I discussed some of the problems we ran
into with madvise, also this touches on the per-process fix that you
mention below:

https://lkml.org/lkml/2013/8/2/671

This message in particular:

https://lkml.org/lkml/2013/8/2/697

> 
> > considering switching the linked list that stores the temporary THPs
> > over to an rbtree to make that search faster, just fyi.
> 
> Problem is that it'll be still slower than no change.
> 
> I'm certainly not against trying to make the kernel smarter to
> optimize for non common workloads, but if the default behavior shall
> not change and this is a tweak the admin should tune manually, I'd
> like an explanation of why the non privileged MADV_NOHUGEPAGE madvise
> is worse solution for this than a privileged tweak in sysfs that the
> root user may also forget if not careful.
> 
> The problem in all this, is that this is a tradeoff and depending on
> the app anything in between the settings "never" and "always" could be
> optimal.
> 
> The idea was just to map THP whenever possible by default to keep the
> kernel simpler and to gain the maximum performance from the faster TLB
> miss immediately (and hopefully offset those cases were the number of
> TLB misses doesn't decrease with THP enabled, like probably your app,
> and at the same time avoiding the need later for a THP collapsing
> event that requires TLB flushes and will add additional costs). For
> what is extreme and just wants THP off, I thought MADV_NOHUGEPAGE
> would be fine solution.
>
> I doubt we can change the default behavior, at the very least it would
> require lots of benchmarking and the only benchmarking I've seen here
> is for the corner case app which may actually run the fastest with
> MADV_NOHUGEPAGE than with a intermediate threshold.
> 
> If we instead we need an intermediate threshold less aggressive than
> MADV_NOHUGEPAGE, I think the tweaks should be per-process and not
> privileged, like MADV_NOHUGEPAGE. Because by definition of the
> tradeoff, every app could have its own preferred threshold. And like
> your app wants THP off, the majority still wants it on without
> intermediate steps. So with a system wide setting you can't make
> everyone happy unless you're like #osv in a VM running a single app.

The thread I mention above originally proposed a per-process switch to
disable THP without the use of madvise, but it was not very well 
received.  I'm more than willing to revisit that idea, and possibly
meld the two (a per-process threshold, instead of a big-hammer on-off
swtich).  Let me know if that seems preferable to this idea and we can
discuss.

Thanks for the input!

- Alex

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 0/3] Change how we determine when to hand out THPs
  2013-12-17 16:04       ` Alex Thorlton
@ 2013-12-17 16:54         ` Andy Lutomirski
  2013-12-17 17:47           ` Alex Thorlton
  0 siblings, 1 reply; 17+ messages in thread
From: Andy Lutomirski @ 2013-12-17 16:54 UTC (permalink / raw)
  To: Alex Thorlton
  Cc: Andrew Morton, linux-mm@kvack.org, Kirill A. Shutemov,
	Benjamin Herrenschmidt, Rik van Riel, Wanpeng Li, Mel Gorman,
	Michel Lespinasse, Benjamin LaHaise, Oleg Nesterov,
	Eric W. Biederman, Al Viro, David Rientjes, Zhang Yanfei,
	Peter Zijlstra, Johannes Weiner, Michal Hocko, Jiang Liu,
	Cody P Schafer, Glauber Costa, Kamezawa Hiroyuki, Naoya Horiguchi,
	linux-kernel@vger.kernel.org, Andrea Arcangeli

On Tue, Dec 17, 2013 at 8:04 AM, Alex Thorlton <athorlton@sgi.com> wrote:
> On Mon, Dec 16, 2013 at 05:43:40PM -0800, Andy Lutomirski wrote:
>> On Mon, Dec 16, 2013 at 9:12 AM, Alex Thorlton <athorlton@sgi.com> wrote:
>> >> Please cc Andrea on this.
>> >
>> > I'm going to clean up a few small things for a v2 pretty soon, I'll be
>> > sure to cc Andrea there.
>> >
>> >> > My proposed solution to the problem is to allow users to set a
>> >> > threshold at which THPs will be handed out.  The idea here is that, when
>> >> > a user faults in a page in an area where they would usually be handed a
>> >> > THP, we pull 512 pages off the free list, as we would with a regular
>> >> > THP, but we only fault in single pages from that chunk, until the user
>> >> > has faulted in enough pages to pass the threshold we've set.  Once they
>> >> > pass the threshold, we do the necessary work to turn our 512 page chunk
>> >> > into a proper THP.  As it stands now, if the user tries to fault in
>> >> > pages from different nodes, we completely give up on ever turning a
>> >> > particular chunk into a THP, and just fault in the 4K pages as they're
>> >> > requested.  We may want to make this tunable in the future (i.e. allow
>> >> > them to fault in from only 2 different nodes).
>> >>
>> >> OK.  But all 512 pages reside on the same node, yes?  Whereas with thp
>> >> disabled those 512 pages would have resided closer to the CPUs which
>> >> instantiated them.
>> >
>> > As it stands right now, yes, since we're pulling a 512 page contiguous
>> > chunk off the free list, everything from that chunk will reside on the
>> > same node, but as I (stupidly) forgot to mention in my original e-mail,
>> > one piece I have yet to add is the functionality to put the remaining
>> > unfaulted pages from our chunk *back* on the free list after we give up
>> > on handing out a THP.  Once this is in there, things will behave more
>> > like they do when THP is turned completely off, i.e. pages will get
>> > faulted in closer to the CPU that first referenced them once we give up
>> > on handing out the THP.
>>
>> This sounds like it's almost the worst possible behavior wrt avoiding
>> memory fragmentation.  If userspace mmaps a very large region and then
>> starts accessing it randomly, it will allocate a bunch of contiguous
>> 512-page regions, claim one page from each, and return the other 511
>> pages to the free list.  Memory is now maximally fragmented from the
>> point of view of future THP allocations.
>
> Maybe I'm missing the point here to some degree, but the way I think
> about this is that if we trigger the behavior to return the pages to the
> free list, we don't *want* future THP allocations in that range of
> memory for the current process anyways.  So, having the memory be
> fragmented from the point of view of future THP allocations isn't an
> issue.
>

Except that you're causing a problem for the whole system because one
process is triggering the "hugepages aren't helpful" heuristic.

--Andy

> - Alex



-- 
Andy Lutomirski
AMA Capital Management, LLC

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 0/3] Change how we determine when to hand out THPs
  2013-12-17 16:54         ` Andy Lutomirski
@ 2013-12-17 17:47           ` Alex Thorlton
  2013-12-17 22:25             ` Andy Lutomirski
  0 siblings, 1 reply; 17+ messages in thread
From: Alex Thorlton @ 2013-12-17 17:47 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andrew Morton, linux-mm@kvack.org, Kirill A. Shutemov,
	Benjamin Herrenschmidt, Rik van Riel, Wanpeng Li, Mel Gorman,
	Michel Lespinasse, Benjamin LaHaise, Oleg Nesterov,
	Eric W. Biederman, Al Viro, David Rientjes, Zhang Yanfei,
	Peter Zijlstra, Johannes Weiner, Michal Hocko, Jiang Liu,
	Cody P Schafer, Glauber Costa, Kamezawa Hiroyuki, Naoya Horiguchi,
	linux-kernel@vger.kernel.org, Andrea Arcangeli

On Tue, Dec 17, 2013 at 08:54:10AM -0800, Andy Lutomirski wrote:
> On Tue, Dec 17, 2013 at 8:04 AM, Alex Thorlton <athorlton@sgi.com> wrote:
> > On Mon, Dec 16, 2013 at 05:43:40PM -0800, Andy Lutomirski wrote:
> >> On Mon, Dec 16, 2013 at 9:12 AM, Alex Thorlton <athorlton@sgi.com> wrote:
> >> >> Please cc Andrea on this.
> >> >
> >> > I'm going to clean up a few small things for a v2 pretty soon, I'll be
> >> > sure to cc Andrea there.
> >> >
> >> >> > My proposed solution to the problem is to allow users to set a
> >> >> > threshold at which THPs will be handed out.  The idea here is that, when
> >> >> > a user faults in a page in an area where they would usually be handed a
> >> >> > THP, we pull 512 pages off the free list, as we would with a regular
> >> >> > THP, but we only fault in single pages from that chunk, until the user
> >> >> > has faulted in enough pages to pass the threshold we've set.  Once they
> >> >> > pass the threshold, we do the necessary work to turn our 512 page chunk
> >> >> > into a proper THP.  As it stands now, if the user tries to fault in
> >> >> > pages from different nodes, we completely give up on ever turning a
> >> >> > particular chunk into a THP, and just fault in the 4K pages as they're
> >> >> > requested.  We may want to make this tunable in the future (i.e. allow
> >> >> > them to fault in from only 2 different nodes).
> >> >>
> >> >> OK.  But all 512 pages reside on the same node, yes?  Whereas with thp
> >> >> disabled those 512 pages would have resided closer to the CPUs which
> >> >> instantiated them.
> >> >
> >> > As it stands right now, yes, since we're pulling a 512 page contiguous
> >> > chunk off the free list, everything from that chunk will reside on the
> >> > same node, but as I (stupidly) forgot to mention in my original e-mail,
> >> > one piece I have yet to add is the functionality to put the remaining
> >> > unfaulted pages from our chunk *back* on the free list after we give up
> >> > on handing out a THP.  Once this is in there, things will behave more
> >> > like they do when THP is turned completely off, i.e. pages will get
> >> > faulted in closer to the CPU that first referenced them once we give up
> >> > on handing out the THP.
> >>
> >> This sounds like it's almost the worst possible behavior wrt avoiding
> >> memory fragmentation.  If userspace mmaps a very large region and then
> >> starts accessing it randomly, it will allocate a bunch of contiguous
> >> 512-page regions, claim one page from each, and return the other 511
> >> pages to the free list.  Memory is now maximally fragmented from the
> >> point of view of future THP allocations.
> >
> > Maybe I'm missing the point here to some degree, but the way I think
> > about this is that if we trigger the behavior to return the pages to the
> > free list, we don't *want* future THP allocations in that range of
> > memory for the current process anyways.  So, having the memory be
> > fragmented from the point of view of future THP allocations isn't an
> > issue.
> >
> 
> Except that you're causing a problem for the whole system because one
> process is triggering the "hugepages aren't helpful" heuristic.

I do see where you're coming from here.  Do you have any good tests
that can cause this type of memory fragmentation that I might be able to
take a look at, to see how we might combat that issue in this case?
It seems like something that could occur anyways, but my patch would
create a situation where it could become a problem much more quickly.

Also, just a side note, I see this being more of a problem on a smaller
system, where swap is enabled.  However, on larger systems where swap is
turned off, I think that this scenario might be a bit tougher to hit.  I
understand that we don't want to hurt the average small system in favor
of large ones, but that's why we leave it as a tunable and leave it up
to the system administrator to decide whether or not this is appropriate
to enable.

- Alex

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 0/3] Change how we determine when to hand out THPs
  2013-12-17 16:20       ` Alex Thorlton
@ 2013-12-17 17:55         ` Andrea Arcangeli
  2013-12-18 17:15           ` Rik van Riel
  2013-12-25 19:07           ` Alex Thorlton
  0 siblings, 2 replies; 17+ messages in thread
From: Andrea Arcangeli @ 2013-12-17 17:55 UTC (permalink / raw)
  To: Alex Thorlton
  Cc: Andrew Morton, linux-mm, Kirill A. Shutemov,
	Benjamin Herrenschmidt, Rik van Riel, Wanpeng Li, Mel Gorman,
	Michel Lespinasse, Benjamin LaHaise, Oleg Nesterov,
	Eric W. Biederman, Andy Lutomirski, Al Viro, David Rientjes,
	Zhang Yanfei, Peter Zijlstra, Johannes Weiner, Michal Hocko,
	Jiang Liu, Cody P Schafer, Glauber Costa, Kamezawa Hiroyuki,
	Naoya Horiguchi, linux-kernel

On Tue, Dec 17, 2013 at 10:20:07AM -0600, Alex Thorlton wrote:
> This message in particular:
> 
> https://lkml.org/lkml/2013/8/2/697

I think adding a prctl (or similar) inherited by child to turn off THP
would be a fine addition to the current madvise. So you can then run
any static app under a wrapper like "THP_disable ./whatever"

The idea is, if the software is maintained, madvise allows for
finegrined optimization, if the software is legacy proprietary
statically linked (or if it already uses LD_PRELOAD for other things),
prctl takes care of that in a more coarse way (but still per-app).

> The thread I mention above originally proposed a per-process switch to
> disable THP without the use of madvise, but it was not very well 
> received.  I'm more than willing to revisit that idea, and possibly

I think you provided enough explanation of why it is needed (static
binaries, proprietary apps, annoyance of LD_PRELOAD that may collide
with other LD_PRELOAD in proprietary apps whatever), so I think a
prctl is reasonable addition to the madvise.

We also have an madvise to turn on THP selectively on embedded that
may boot with enabled=madvise to be sure not to waste any memory
because of THP. But the prctl to selectively enable doesn't make too
much sense, as one has to selectively enabled in a finegrined way to
be sure not to cause any memory waste. So I think a NOHUGEPAGE prctl
would be enough.

> meld the two (a per-process threshold, instead of a big-hammer on-off
> swtich).  Let me know if that seems preferable to this idea and we can
> discuss.

The per-process threshold would be much bigger patch, I think starting
with the big-hammer on-off is preferable as it is much simpler and it
should be more than enough to take care of the rare corner cases,
while leaving the other workloads unaffected (modulo the cacheline to
check the task or mm flags) running at max speed.

To evaluate the threshold solution, a variety of benchmarks of a
multitude of apps would be necessary first, to see the effect it has
on the non-corner cases. Adding the big-hammer on-off prctl instead is
a black and white design solution that won't require black magic
settings.

Ideally if we add a threshold later it won't require any more
cacheline accesses, as the threshold would also need to be per-task or
per-mm so the runtime cost of the prctl would be zero then and it
could then become a benchmarking tweak even if we add the per-app
threshold later.

About creating heuristics to automatically detect the ideal value of
the big-hammer per-app on/off switch (or even harder the ideal value
of the per-app threshold), I think it's not going to happen because
there are too few corner cases and it wouldn't be worth the cost of it
(the cost would be significant no matter how implemented).

Every time we try to make THP smarter at auto-disabling itself for the
corner cases, we're slowing it down for everyone that gets a benefit
from it, and there's no way around it. This is why I think the
big-hammer prctl for the few corner cases is the best way to go.

Thanks!
Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 0/3] Change how we determine when to hand out THPs
  2013-12-17 17:47           ` Alex Thorlton
@ 2013-12-17 22:25             ` Andy Lutomirski
  0 siblings, 0 replies; 17+ messages in thread
From: Andy Lutomirski @ 2013-12-17 22:25 UTC (permalink / raw)
  To: Alex Thorlton
  Cc: Andrew Morton, linux-mm@kvack.org, Kirill A. Shutemov,
	Benjamin Herrenschmidt, Rik van Riel, Wanpeng Li, Mel Gorman,
	Michel Lespinasse, Benjamin LaHaise, Oleg Nesterov,
	Eric W. Biederman, Al Viro, David Rientjes, Zhang Yanfei,
	Peter Zijlstra, Johannes Weiner, Michal Hocko, Jiang Liu,
	Cody P Schafer, Glauber Costa, Kamezawa Hiroyuki, Naoya Horiguchi,
	linux-kernel@vger.kernel.org, Andrea Arcangeli

On Tue, Dec 17, 2013 at 9:47 AM, Alex Thorlton <athorlton@sgi.com> wrote:
> On Tue, Dec 17, 2013 at 08:54:10AM -0800, Andy Lutomirski wrote:
>> On Tue, Dec 17, 2013 at 8:04 AM, Alex Thorlton <athorlton@sgi.com> wrote:
>> > On Mon, Dec 16, 2013 at 05:43:40PM -0800, Andy Lutomirski wrote:
>> >> On Mon, Dec 16, 2013 at 9:12 AM, Alex Thorlton <athorlton@sgi.com> wrote:
>> >> >> Please cc Andrea on this.
>> >> >
>> >> > I'm going to clean up a few small things for a v2 pretty soon, I'll be
>> >> > sure to cc Andrea there.
>> >> >
>> >> >> > My proposed solution to the problem is to allow users to set a
>> >> >> > threshold at which THPs will be handed out.  The idea here is that, when
>> >> >> > a user faults in a page in an area where they would usually be handed a
>> >> >> > THP, we pull 512 pages off the free list, as we would with a regular
>> >> >> > THP, but we only fault in single pages from that chunk, until the user
>> >> >> > has faulted in enough pages to pass the threshold we've set.  Once they
>> >> >> > pass the threshold, we do the necessary work to turn our 512 page chunk
>> >> >> > into a proper THP.  As it stands now, if the user tries to fault in
>> >> >> > pages from different nodes, we completely give up on ever turning a
>> >> >> > particular chunk into a THP, and just fault in the 4K pages as they're
>> >> >> > requested.  We may want to make this tunable in the future (i.e. allow
>> >> >> > them to fault in from only 2 different nodes).
>> >> >>
>> >> >> OK.  But all 512 pages reside on the same node, yes?  Whereas with thp
>> >> >> disabled those 512 pages would have resided closer to the CPUs which
>> >> >> instantiated them.
>> >> >
>> >> > As it stands right now, yes, since we're pulling a 512 page contiguous
>> >> > chunk off the free list, everything from that chunk will reside on the
>> >> > same node, but as I (stupidly) forgot to mention in my original e-mail,
>> >> > one piece I have yet to add is the functionality to put the remaining
>> >> > unfaulted pages from our chunk *back* on the free list after we give up
>> >> > on handing out a THP.  Once this is in there, things will behave more
>> >> > like they do when THP is turned completely off, i.e. pages will get
>> >> > faulted in closer to the CPU that first referenced them once we give up
>> >> > on handing out the THP.
>> >>
>> >> This sounds like it's almost the worst possible behavior wrt avoiding
>> >> memory fragmentation.  If userspace mmaps a very large region and then
>> >> starts accessing it randomly, it will allocate a bunch of contiguous
>> >> 512-page regions, claim one page from each, and return the other 511
>> >> pages to the free list.  Memory is now maximally fragmented from the
>> >> point of view of future THP allocations.
>> >
>> > Maybe I'm missing the point here to some degree, but the way I think
>> > about this is that if we trigger the behavior to return the pages to the
>> > free list, we don't *want* future THP allocations in that range of
>> > memory for the current process anyways.  So, having the memory be
>> > fragmented from the point of view of future THP allocations isn't an
>> > issue.
>> >
>>
>> Except that you're causing a problem for the whole system because one
>> process is triggering the "hugepages aren't helpful" heuristic.
>
> I do see where you're coming from here.  Do you have any good tests
> that can cause this type of memory fragmentation that I might be able to
> take a look at, to see how we might combat that issue in this case?
> It seems like something that could occur anyways, but my patch would
> create a situation where it could become a problem much more quickly.

mmap lots of space (comparable to total system memory).  Touch every
512th page.  (This will consume ~0.2% of memory with your patches.)

Now run any workload that benefits from THP (without unmapping the
first thing).  Make sure it still works well.

--Andy

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 0/3] Change how we determine when to hand out THPs
  2013-12-17 17:55         ` Andrea Arcangeli
@ 2013-12-18 17:15           ` Rik van Riel
  2013-12-25 19:07           ` Alex Thorlton
  1 sibling, 0 replies; 17+ messages in thread
From: Rik van Riel @ 2013-12-18 17:15 UTC (permalink / raw)
  To: Andrea Arcangeli, Alex Thorlton
  Cc: Andrew Morton, linux-mm, Kirill A. Shutemov,
	Benjamin Herrenschmidt, Wanpeng Li, Mel Gorman, Michel Lespinasse,
	Benjamin LaHaise, Oleg Nesterov, Eric W. Biederman,
	Andy Lutomirski, Al Viro, David Rientjes, Zhang Yanfei,
	Peter Zijlstra, Johannes Weiner, Michal Hocko, Jiang Liu,
	Cody P Schafer, Glauber Costa, Kamezawa Hiroyuki, Naoya Horiguchi,
	linux-kernel

On 12/17/2013 12:55 PM, Andrea Arcangeli wrote:

> About creating heuristics to automatically detect the ideal value of
> the big-hammer per-app on/off switch (or even harder the ideal value
> of the per-app threshold), I think it's not going to happen because
> there are too few corner cases and it wouldn't be worth the cost of it
> (the cost would be significant no matter how implemented).
>
> Every time we try to make THP smarter at auto-disabling itself for the
> corner cases, we're slowing it down for everyone that gets a benefit
> from it, and there's no way around it. This is why I think the
> big-hammer prctl for the few corner cases is the best way to go.

There is one thing we could do in a slow path, that
would result in automatic disabling of THP under the
corner case of there not being enough memory in the
system.

We can teach the swapout code to discard zero-filled
pages, instead of swapping them out to disk.

That way we will "deflate" some of the excess memory
consumed by THP, and reduce the extra swap IO that
could be caused by THP using more memory.

Not sure who is interested in this particular corner
case, but it may be an interesting one to solve :)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 0/3] Change how we determine when to hand out THPs
  2013-12-12 18:00 [RFC PATCH 0/3] Change how we determine when to hand out THPs Alex Thorlton
  2013-12-12 20:33 ` Alex Thorlton
  2013-12-14  5:44 ` Andrew Morton
@ 2013-12-19 14:55 ` Mel Gorman
  2 siblings, 0 replies; 17+ messages in thread
From: Mel Gorman @ 2013-12-19 14:55 UTC (permalink / raw)
  To: Alex Thorlton
  Cc: linux-mm, Andrew Morton, Kirill A. Shutemov,
	Benjamin Herrenschmidt, Rik van Riel, Wanpeng Li,
	Michel Lespinasse, Benjamin LaHaise, Oleg Nesterov,
	Eric W. Biederman, Andy Lutomirski, Al Viro, David Rientjes,
	Zhang Yanfei, Peter Zijlstra, Johannes Weiner, Michal Hocko,
	Jiang Liu, Cody P Schafer, Glauber Costa, Kamezawa Hiroyuki,
	Naoya Horiguchi, linux-kernel

On Thu, Dec 12, 2013 at 12:00:37PM -0600, Alex Thorlton wrote:
> This patch changes the way we decide whether or not to give out THPs to
> processes when they fault in pages.  The way things are right now,
> touching one byte in a 2M chunk where no pages have been faulted in
> results in a process being handed a 2M hugepage, which, in some cases,
> is undesirable.  The most common issue seems to arise when a process
> uses many cores to work on small portions of an allocated chunk of
> memory.
> 
> <SNIP>
> 
> As you can see there's a significant performance increase when running
> this test with THP off.  Here's a pointer to the test, for those who are
> interested:
> 
> http://oss.sgi.com/projects/memtests/thp_pthread.tar.gz
> 
> My proposed solution to the problem is to allow users to set a
> threshold at which THPs will be handed out.  The idea here is that, when
> a user faults in a page in an area where they would usually be handed a
> THP, we pull 512 pages off the free list, as we would with a regular
> THP, but we only fault in single pages from that chunk, until the user
> has faulted in enough pages to pass the threshold we've set. 

I have not read this thread yet so this is just me initial reaction to
just this part.

First, you say that the propose solution is to allow users to set a
threshold at which THPs will be handed out but you actually allocate all
the pages up front so it's not just that. There a few things in play

1. Deferred zeroing cost
2. Deferred THP set cost
3. Different TLB pressure
4. Alignment issues and NUMA

All are important. It is common for there to be fewer large TLB entries
than small ones. Workloads that sparsely reference data may suffer badly
when using large pages as the TLB gets trashed. Your workload could be
specifically testing for the TLB pressure (optimising point 3 above) in
which case the procesor used for benchmarking is a major factor and it's
not a universal win.

For example, your workload may optimise 3 but other workloads may suffer
because more faults are incurred until the threshold is reached, the
page tables must be walked to initialse the remaining pages and then the
THP setup and TLB flushed. 

Keep these details in mind when measuring your patches if at all possible.

Otherwise, on the face of it this is actually a similar proposal to "page
reservation" described one of the more important large page papers written
by Talluri (http://dl.acm.org/citation.cfm?id=195531). Right now you could
consider Linux to be reserving pages with a promotion threshold of 1 and
you're aiming to alter that threshold. Seems like a reasonable idea that
will eventually work out even though I have not seen the implementation yet.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 0/3] Change how we determine when to hand out THPs
  2013-12-16 17:12   ` Alex Thorlton
  2013-12-16 17:51     ` Andrea Arcangeli
  2013-12-17  1:43     ` Andy Lutomirski
@ 2013-12-19 15:29     ` Mel Gorman
  2013-12-25 16:38       ` Alex Thorlton
  2 siblings, 1 reply; 17+ messages in thread
From: Mel Gorman @ 2013-12-19 15:29 UTC (permalink / raw)
  To: Alex Thorlton
  Cc: Andrew Morton, linux-mm, Kirill A. Shutemov,
	Benjamin Herrenschmidt, Rik van Riel, Wanpeng Li,
	Michel Lespinasse, Benjamin LaHaise, Oleg Nesterov,
	Eric W. Biederman, Andy Lutomirski, Al Viro, David Rientjes,
	Zhang Yanfei, Peter Zijlstra, Johannes Weiner, Michal Hocko,
	Jiang Liu, Cody P Schafer, Glauber Costa, Kamezawa Hiroyuki,
	Naoya Horiguchi, linux-kernel, Andrea Arcangeli

On Mon, Dec 16, 2013 at 11:12:15AM -0600, Alex Thorlton wrote:
> > Please cc Andrea on this.
> 
> I'm going to clean up a few small things for a v2 pretty soon, I'll be
> sure to cc Andrea there.
> 
> > > My proposed solution to the problem is to allow users to set a
> > > threshold at which THPs will be handed out.  The idea here is that, when
> > > a user faults in a page in an area where they would usually be handed a
> > > THP, we pull 512 pages off the free list, as we would with a regular
> > > THP, but we only fault in single pages from that chunk, until the user
> > > has faulted in enough pages to pass the threshold we've set.  Once they
> > > pass the threshold, we do the necessary work to turn our 512 page chunk
> > > into a proper THP.  As it stands now, if the user tries to fault in
> > > pages from different nodes, we completely give up on ever turning a
> > > particular chunk into a THP, and just fault in the 4K pages as they're
> > > requested.  We may want to make this tunable in the future (i.e. allow
> > > them to fault in from only 2 different nodes).
> > 
> > OK.  But all 512 pages reside on the same node, yes?  Whereas with thp
> > disabled those 512 pages would have resided closer to the CPUs which
> > instantiated them.  
> 
> As it stands right now, yes, since we're pulling a 512 page contiguous
> chunk off the free list, everything from that chunk will reside on the
> same node, but as I (stupidly) forgot to mention in my original e-mail,
> one piece I have yet to add is the functionality to put the remaining
> unfaulted pages from our chunk *back* on the free list after we give up
> on handing out a THP. 

You don't necessarily have to take it off in the
first place either. Heavy handed approach is to create
MIGRATE_MOVABLE_THP_RESERVATION_BECAUSE_WHO_NEEDS_SNAPPY_NAMES and put it
at the bottom of the fallback lists in the page allocator. Allocate one
base page, move the other 511 to that list. On the second fault, use the
correctly aligned page if it's still on the buddy lists and local to the
current NUMA node, otherwise fallback to a normal allocation. On promotion,
you're checking first if all the faulted page are on the same node and
second if the correctly aligned pages are on the free lists or not.

The addition of a migrate type would very heavy handed but you could
just create a special cased linked list of pages that are potentially
reserved that is drained before the page allocator wakes kswapd.

Order the pages such that the oldest one on the new free list is the
first allocated. That way you do not have to worry about scanning tasks
for pages to put back on the free list.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 0/3] Change how we determine when to hand out THPs
  2013-12-19 15:29     ` Mel Gorman
@ 2013-12-25 16:38       ` Alex Thorlton
  0 siblings, 0 replies; 17+ messages in thread
From: Alex Thorlton @ 2013-12-25 16:38 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, linux-mm, Kirill A. Shutemov,
	Benjamin Herrenschmidt, Rik van Riel, Wanpeng Li,
	Michel Lespinasse, Benjamin LaHaise, Oleg Nesterov,
	Eric W. Biederman, Andy Lutomirski, Al Viro, David Rientjes,
	Zhang Yanfei, Peter Zijlstra, Johannes Weiner, Michal Hocko,
	Jiang Liu, Cody P Schafer, Glauber Costa, Kamezawa Hiroyuki,
	Naoya Horiguchi, linux-kernel, Andrea Arcangeli

On Thu, Dec 19, 2013 at 03:29:06PM +0000, Mel Gorman wrote:
> On Mon, Dec 16, 2013 at 11:12:15AM -0600, Alex Thorlton wrote:
> > > Please cc Andrea on this.
> > 
> > I'm going to clean up a few small things for a v2 pretty soon, I'll be
> > sure to cc Andrea there.
> > 
> > > > My proposed solution to the problem is to allow users to set a
> > > > threshold at which THPs will be handed out.  The idea here is that, when
> > > > a user faults in a page in an area where they would usually be handed a
> > > > THP, we pull 512 pages off the free list, as we would with a regular
> > > > THP, but we only fault in single pages from that chunk, until the user
> > > > has faulted in enough pages to pass the threshold we've set.  Once they
> > > > pass the threshold, we do the necessary work to turn our 512 page chunk
> > > > into a proper THP.  As it stands now, if the user tries to fault in
> > > > pages from different nodes, we completely give up on ever turning a
> > > > particular chunk into a THP, and just fault in the 4K pages as they're
> > > > requested.  We may want to make this tunable in the future (i.e. allow
> > > > them to fault in from only 2 different nodes).
> > > 
> > > OK.  But all 512 pages reside on the same node, yes?  Whereas with thp
> > > disabled those 512 pages would have resided closer to the CPUs which
> > > instantiated them.  
> > 
> > As it stands right now, yes, since we're pulling a 512 page contiguous
> > chunk off the free list, everything from that chunk will reside on the
> > same node, but as I (stupidly) forgot to mention in my original e-mail,
> > one piece I have yet to add is the functionality to put the remaining
> > unfaulted pages from our chunk *back* on the free list after we give up
> > on handing out a THP. 
> 
> You don't necessarily have to take it off in the
> first place either. Heavy handed approach is to create
> MIGRATE_MOVABLE_THP_RESERVATION_BECAUSE_WHO_NEEDS_SNAPPY_NAMES and put it
> at the bottom of the fallback lists in the page allocator. Allocate one
> base page, move the other 511 to that list. On the second fault, use the
> correctly aligned page if it's still on the buddy lists and local to the
> current NUMA node, otherwise fallback to a normal allocation. On promotion,
> you're checking first if all the faulted page are on the same node and
> second if the correctly aligned pages are on the free lists or not.
> 
> The addition of a migrate type would very heavy handed but you could
> just create a special cased linked list of pages that are potentially
> reserved that is drained before the page allocator wakes kswapd.
> 
> Order the pages such that the oldest one on the new free list is the
> first allocated. That way you do not have to worry about scanning tasks
> for pages to put back on the free list.

Thanks for the input, Mel.  While I agree that the addition of a migrate
type might be a bit heavy handed, I think that would also get rid of the
problem that Kirill pointed out with forking processes, i.e. the current
behavior tracks temporary huge pages in a per-mm freelist, which falls
apart for forked processes (only useful in the threaded case).  I'll
take a look into this soon.

- Alex

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 0/3] Change how we determine when to hand out THPs
  2013-12-17 17:55         ` Andrea Arcangeli
  2013-12-18 17:15           ` Rik van Riel
@ 2013-12-25 19:07           ` Alex Thorlton
  1 sibling, 0 replies; 17+ messages in thread
From: Alex Thorlton @ 2013-12-25 19:07 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Andrew Morton, linux-mm, Kirill A. Shutemov,
	Benjamin Herrenschmidt, Rik van Riel, Wanpeng Li, Mel Gorman,
	Michel Lespinasse, Benjamin LaHaise, Oleg Nesterov,
	Eric W. Biederman, Andy Lutomirski, Al Viro, David Rientjes,
	Zhang Yanfei, Peter Zijlstra, Johannes Weiner, Michal Hocko,
	Jiang Liu, Cody P Schafer, Glauber Costa, Kamezawa Hiroyuki,
	Naoya Horiguchi, linux-kernel

On Tue, Dec 17, 2013 at 06:55:00PM +0100, Andrea Arcangeli wrote:
> On Tue, Dec 17, 2013 at 10:20:07AM -0600, Alex Thorlton wrote:
> > This message in particular:
> > 
> > https://lkml.org/lkml/2013/8/2/697
> 
> I think adding a prctl (or similar) inherited by child to turn off THP
> would be a fine addition to the current madvise. So you can then run
> any static app under a wrapper like "THP_disable ./whatever"
> 
> The idea is, if the software is maintained, madvise allows for
> finegrined optimization, if the software is legacy proprietary
> statically linked (or if it already uses LD_PRELOAD for other things),
> prctl takes care of that in a more coarse way (but still per-app).

That sounds fine.  I'll dig up the old patches that I wrote a while back
to enable this, and get them cleaned up and rebased to the latest kernel
version for people to review.

> > The thread I mention above originally proposed a per-process switch to
> > disable THP without the use of madvise, but it was not very well 
> > received.  I'm more than willing to revisit that idea, and possibly
> 
> I think you provided enough explanation of why it is needed (static
> binaries, proprietary apps, annoyance of LD_PRELOAD that may collide
> with other LD_PRELOAD in proprietary apps whatever), so I think a
> prctl is reasonable addition to the madvise.
> 
> We also have an madvise to turn on THP selectively on embedded that
> may boot with enabled=madvise to be sure not to waste any memory
> because of THP. But the prctl to selectively enable doesn't make too
> much sense, as one has to selectively enabled in a finegrined way to
> be sure not to cause any memory waste. So I think a NOHUGEPAGE prctl
> would be enough.
> 
> > meld the two (a per-process threshold, instead of a big-hammer on-off
> > swtich).  Let me know if that seems preferable to this idea and we can
> > discuss.
> 
> The per-process threshold would be much bigger patch, I think starting
> with the big-hammer on-off is preferable as it is much simpler and it
> should be more than enough to take care of the rare corner cases,
> while leaving the other workloads unaffected (modulo the cacheline to
> check the task or mm flags) running at max speed.

Agreed.  While I still would like to explore the threshold idea further,
I'm all for putting in a simpler fix to our current problem that will
leave default behavior unaffected.
 
> To evaluate the threshold solution, a variety of benchmarks of a
> multitude of apps would be necessary first, to see the effect it has
> on the non-corner cases. Adding the big-hammer on-off prctl instead is
> a black and white design solution that won't require black magic
> settings.
> 
> Ideally if we add a threshold later it won't require any more
> cacheline accesses, as the threshold would also need to be per-task or
> per-mm so the runtime cost of the prctl would be zero then and it
> could then become a benchmarking tweak even if we add the per-app
> threshold later.
>
> About creating heuristics to automatically detect the ideal value of
> the big-hammer per-app on/off switch (or even harder the ideal value
> of the per-app threshold), I think it's not going to happen because
> there are too few corner cases and it wouldn't be worth the cost of it
> (the cost would be significant no matter how implemented).

I see where you're coming from here.  If we do decide to move further
with implementing a threshold solution in the future, I think the best
idea is to have it default to 1, which would maintain current behavior
and leave the non-corner cases unaffected.

Thanks for your suggestions!

- Alex

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2013-12-25 19:07 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-12-12 18:00 [RFC PATCH 0/3] Change how we determine when to hand out THPs Alex Thorlton
2013-12-12 20:33 ` Alex Thorlton
2013-12-14  5:44 ` Andrew Morton
2013-12-16 17:12   ` Alex Thorlton
2013-12-16 17:51     ` Andrea Arcangeli
2013-12-17 16:20       ` Alex Thorlton
2013-12-17 17:55         ` Andrea Arcangeli
2013-12-18 17:15           ` Rik van Riel
2013-12-25 19:07           ` Alex Thorlton
2013-12-17  1:43     ` Andy Lutomirski
2013-12-17 16:04       ` Alex Thorlton
2013-12-17 16:54         ` Andy Lutomirski
2013-12-17 17:47           ` Alex Thorlton
2013-12-17 22:25             ` Andy Lutomirski
2013-12-19 15:29     ` Mel Gorman
2013-12-25 16:38       ` Alex Thorlton
2013-12-19 14:55 ` Mel Gorman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).