Can we disable transparent hugepages for lack of a legitimate use case please?

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* Can we disable transparent hugepages for lack of a legitimate use case please?
@ 2015-08-24 20:12 James Hartshorn
  2015-08-24 20:20 ` Bridgman, John
                   ` (2 more replies)
  0 siblings, 3 replies; 14+ messages in thread
From: James Hartshorn @ 2015-08-24 20:12 UTC (permalink / raw)
  To: linux-mm@kvack.org

[-- Attachment #1: Type: text/plain, Size: 790 bytes --]

Hi,

I've been struggling with transparent hugepage performance issues, and can't seem to find anyone who actually uses it intentionally.  Virtually every database that runs on linux however recommends disabling it or setting it to madvise.  I'm referring to:

/sys/kernel/mm/transparent_hugepage/enabled

I asked on the internet http://unix.stackexchange.com/questions/201906/does-anyone-actually-use-and-benefit-from-transparent-huge-pages and got no responses there.

Independently I noticed

"sysctl: The scan_unevictable_pages sysctl/node-interface has been disabled for lack of a legitimate use case.  If you have one, please send an email to linux-mm@kvack.org."

And thought wow that's exactly what should be done to transparent hugepages.

Thoughts?

[-- Attachment #2: Type: text/html, Size: 1723 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Can we disable transparent hugepages for lack of a legitimate use case please?
  2015-08-24 20:12 Can we disable transparent hugepages for lack of a legitimate use case please? James Hartshorn
@ 2015-08-24 20:20 ` Bridgman, John
  2015-08-24 20:46   ` James Hartshorn
  2015-08-25  9:25 ` Konstantin Khlebnikov
  2015-09-03 19:33 ` Andi Kleen
  2 siblings, 1 reply; 14+ messages in thread
From: Bridgman, John @ 2015-08-24 20:20 UTC (permalink / raw)
  To: James Hartshorn, linux-mm@kvack.org

[-- Attachment #1: Type: text/plain, Size: 1346 bytes --]

We find it useful for GPU compute applications (APU I suppose, with GPU access via IOMMUv2) working on large datasets.

I wouldn't have expected THP to find much use for databases -- those seem to be more like graphics stacks where you have enough hints about future usage to justify explicit management of pages. I thought of THP as "the solution for everything else".

From: James Hartshorn
Sent: Monday, August 24, 2015 3:12 PM
To: linux-mm@kvack.org
Subject: Can we disable transparent hugepages for lack of a legitimate use case please?

Hi,

I've been struggling with transparent hugepage performance issues, and can't seem to find anyone who actually uses it intentionally.  Virtually every database that runs on linux however recommends disabling it or setting it to madvise.  I'm referring to:

/sys/kernel/mm/transparent_hugepage/enabled

I asked on the internet http://unix.stackexchange.com/questions/201906/does-anyone-actually-use-and-benefit-from-transparent-huge-pages and got no responses there.

Independently I noticed

"sysctl: The scan_unevictable_pages sysctl/node-interface has been disabled for lack of a legitimate use case.  If you have one, please send an email to linux-mm@kvack.org."

And thought wow that's exactly what should be done to transparent hugepages.

Thoughts?

[-- Attachment #2: Type: text/html, Size: 3894 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Can we disable transparent hugepages for lack of a legitimate use case please?
  2015-08-24 20:20 ` Bridgman, John
@ 2015-08-24 20:46   ` James Hartshorn
  2015-08-24 23:20     ` Theodore Ts'o
  2015-09-10 16:45     ` Andrea Arcangeli
  0 siblings, 2 replies; 14+ messages in thread
From: James Hartshorn @ 2015-08-24 20:46 UTC (permalink / raw)
  To: Bridgman, John, linux-mm@kvack.org

[-- Attachment #1: Type: text/plain, Size: 2982 bytes --]

As a general purpose sysadmin I've mostly struggled with its default being always, if it were never (or possibly madvise?) then I think all the very real performance problems would go away.  Those who know they need it could turn it on.  I have begun looking into asking the distros to change this (is it a distro choice?) but am not getting that far.  Just to be clear the default of always causes noticeable pauses of operation on almost all databases, analogous to having a stop the world gc.

As for THP in APU type applications have you run into any JEMalloc defrag performance issues?  My research into THP issues indicates this is part of the performance problem that manifests for databases.

Some more links to discussion about THP:

Postgresql  https://lwn.net/Articles/591723/

Postgresql http://www.postgresql.org/message-id/20120821131254.1415a545@jekyl.davidgould.org

Mysql (tokudb) https://dzone.com/articles/why-tokudb-hates-transparent

Redis http://redis.io/topics/latency http://antirez.com/news/84

Oracle https://blogs.oracle.com/linux/entry/performance_issues_with_transparent_huge
MongoDB http://docs.mongodb.org/master/tutorial/transparent-huge-pages/
Couchbase http://blog.couchbase.com/often-overlooked-linux-os-tweaks
Riak http://underthehood.meltwater.com/blog/2015/04/14/riak-elasticsearch-and-numad-walk-into-a-red-hat/

________________________________
From: Bridgman, John <John.Bridgman@amd.com>
Sent: Monday, August 24, 2015 1:20 PM
To: James Hartshorn; linux-mm@kvack.org
Subject: Re: Can we disable transparent hugepages for lack of a legitimate use case please?

We find it useful for GPU compute applications (APU I suppose, with GPU access via IOMMUv2) working on large datasets.

I wouldn't have expected THP to find much use for databases -- those seem to be more like graphics stacks where you have enough hints about future usage to justify explicit management of pages. I thought of THP as "the solution for everything else".

From: James Hartshorn
Sent: Monday, August 24, 2015 3:12 PM
To: linux-mm@kvack.org
Subject: Can we disable transparent hugepages for lack of a legitimate use case please?

Hi,

I've been struggling with transparent hugepage performance issues, and can't seem to find anyone who actually uses it intentionally.  Virtually every database that runs on linux however recommends disabling it or setting it to madvise.  I'm referring to:

/sys/kernel/mm/transparent_hugepage/enabled

I asked on the internet http://unix.stackexchange.com/questions/201906/does-anyone-actually-use-and-benefit-from-transparent-huge-pages and got no responses there.

Independently I noticed

"sysctl: The scan_unevictable_pages sysctl/node-interface has been disabled for lack of a legitimate use case.  If you have one, please send an email to linux-mm@kvack.org."

And thought wow that's exactly what should be done to transparent hugepages.

Thoughts?

[-- Attachment #2: Type: text/html, Size: 7288 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Can we disable transparent hugepages for lack of a legitimate use case please?
  2015-08-24 20:46   ` James Hartshorn
@ 2015-08-24 23:20     ` Theodore Ts'o
  2015-09-10 16:45     ` Andrea Arcangeli
  1 sibling, 0 replies; 14+ messages in thread
From: Theodore Ts'o @ 2015-08-24 23:20 UTC (permalink / raw)
  To: James Hartshorn; +Cc: Bridgman, John, linux-mm@kvack.org

Part of the problem with asking "Does anyone use THP" is that a lot of
people may be using THP without realizing it.  That is, after all, the
whole point.

Some selected bits from running the command:

sudo grep -e AnonHugePages  /proc/*/smaps | awk  '{ if($2>4) print $0} ' |  awk -F "/"  '{print $0; system("ps -fp " $3)} '

/proc/17297/smaps:AnonHugePages:    290816 kB
UID        PID  PPID  C STIME TTY          TIME CMD
tytso    17297 17290  4 19:10 pts/6    00:00:05 qemu-system-x86_64 -enable-kvm -boo

/proc/2467/smaps:AnonHugePages:     92160 kB
UID        PID  PPID  C STIME TTY          TIME CMD
tytso     2467  2347  0 09:49 ?        00:00:10 xfdesktop --display :0.0 --sm-clien

/proc/13446/smaps:AnonHugePages:     81920 kB
UID        PID  PPID  C STIME TTY          TIME CMD
tytso    13446  2591  0 12:25 pts/0    00:00:11 mutt -f /home/tytso/imap/shared.mit

/proc/2603/smaps:AnonHugePages:     43008 kB
UID        PID  PPID  C STIME TTY          TIME CMD
tytso     2603  2347  0 09:49 ?        00:00:01 /usr/bin/perl /usr/bin/parcimonie

/proc/9853/smaps:AnonHugePages:     20480 kB
UID        PID  PPID  C STIME TTY          TIME CMD
tytso     9853  2461  1 09:56 ?        00:07:01 /opt/google/chrome-beta/chrome --us

/proc/1622/smaps:AnonHugePages:     14336 kB
UID        PID  PPID  C STIME TTY          TIME CMD
root      1622  1567  0 09:49 tty7     00:03:09 /usr/bin/X :0 -seat seat0 -auth /va

Cheers,

					- Ted

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Can we disable transparent hugepages for lack of a legitimate use case please?
  2015-08-24 20:46   ` James Hartshorn
  2015-08-24 23:20     ` Theodore Ts'o
@ 2015-09-10 16:45     ` Andrea Arcangeli
  2015-09-10 17:02       ` Andres Freund
  2015-09-14 12:37       ` Vlastimil Babka
  1 sibling, 2 replies; 14+ messages in thread
From: Andrea Arcangeli @ 2015-09-10 16:45 UTC (permalink / raw)
  To: James Hartshorn; +Cc: Bridgman, John, linux-mm@kvack.org

On Mon, Aug 24, 2015 at 08:46:11PM +0000, James Hartshorn wrote:
> As a general purpose sysadmin I've mostly struggled with its default
> being always, if it were never (or possibly madvise?) then I think
> all the very real performance problems would go away.  Those who
> know they need it could turn it on.  I have begun looking into
> asking the distros to change this (is it a distro choice?) but am

My suggestion would be to: 1) identify exactly if it's a THP issue or
a compaction issue, 2) if it's really a THP issue report it to the
application developers to use the MADV_NOHUGEPAGE or the prctl to
disable THP only for the app or the library. If it's a compaction
issue disabling THP sounds wrong to me and it should be simply
reported here as a bug.

> not getting that far.  Just to be clear the default of always causes
> noticeable pauses of operation on almost all databases, analogous to
> having a stop the world gc.  As for THP in APU type applications
> have you run into any JEMalloc defrag performance issues?  My
> research into THP issues indicates this is part of the performance
> problem that manifests for databases.  Some more links to discussion
> about THP: Postgresql https://lwn.net/Articles/591723/ Postgresql
> http://www.postgresql.org/message-id/20120821131254.1415a545@jekyl.davidgould.org

"and my interpretation was that it was trying to create hugepages from
scattered fragments"

This is a very old email, but I'm just taking it as an example because
this has to be a compaction issue. If you run into very visible hangs
that goes away by disabling THP, it can't be THP to blame. THP can
increase the latency jitter during page faults (real time sensitive
application could notice a 2MB clear_page vs a 4KB clear_page), but
not in a way that hangs a system and becomes visible to the user.

It's just very early compaction code was too aggressive and it got
fixed in the meanwhile.

Worst of all is that disabling THP can't solve compaction issues
because compaction still runs even after you disable THP (drivers and
slab can still use high order pages), so it'll just hide the problem.

To disable compaction in THP just run:

echo madvise >/sys/kernel/mm/transparent_hugepage/defrag

If you got a compaction problem, this will make it go away, but you'd
still have THP on.

Considering the amount of work that went in compaction (primarily to
make it less aggressive) and how old the email is, I doubt that
problem reported in the email could still happen with current kernels.

There's current work on linux-mm (primarily from Vlastimil and David)
to make compaction asynchronous. I don't like too much the initial
proposal of offloading compaction purely to khugepaged and
disconnected to the page faults. But it would be possible to make the
page fault wakeup a kernel daemon that compact hugepages in parallel
to the page fault requests. So then the pagefault latency would become
identical to when the defrag sysfs control is set to "madvise". I
think apps that use MADV_HUGEPAGE (like qemu) should still run
compaction synchronously though. For qemu losing several hugepages
because of async behavior of compaction, would be a major loss. It's
perfectly fine if it's slower at starting up as long as it gets as
many hugepages as it can. I've seen other proposal floating around,
there's definitely work in this area to optimize compaction further.

Compaction is already much better now than in the very first version
that landed upstream so again those emails are not relevant anymore.

> Mysql (tokudb)
> https://dzone.com/articles/why-tokudb-hates-transparent

This seems a THP issue: unless the alternate malloc allocator starts
using MADV_NOHUGEPAGE, its memory loss would become extreme with the
split_huge_page pending changes from Kirill. There's little the kernel
can do about this, in fact Kirill's latest changes goes in the very
opposite direction of what's needed to reduce the memory footprint for
this MADV_DONTNEED 4kb case.

With current code however the best you can do is:

echo 0 >/sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none

That will guarantee that khugepaged never increases the memory
footprint after a MADV_DONTNEED done by the alternate malloc
allocator. Just that will definitely stop to help with the
split_huge_page pending changes. You could consider testing that but
if the split_huge_page pending changes are merged, this tuning shall
disappear.

> Redis
> http://redis.io/topics/latency http://antirez.com/news/84 Oracle

I already covered redis in detail in previous email in this
thread. This is a legitimate THP issue and for now MADV_NOHUGEPAGE
will take care of that.

If redis in the future could stop using fork() and use
clone()+userfaultfd for the snapshotting, then THP should be fine
enabled as it can control in userland the size of the wrprotect
faults.

> https://blogs.oracle.com/linux/entry/performance_issues_with_transparent_huge

At least this one document doesn't have the random reboots and
instability allegations that earlier of their documents talked about
(that I never seen here and I never had a report about... which made
me wonder why they were getting those reboots or instabilities and
which kernel they were actually using).

The very latest recent data (including a document on oracle.com) shows
a worst case 5-10% performance regression and like postgresql, they
should also consider trying again with:

echo madvise >/sys/kernel/mm/transparent_hugepage/defrag

To see if that 5-10% worst case performance regression magically
disappears while keeping THP enabled.

It'd at least help to know if this is a THP issue or a compaction
issue.

On a side note worth mentioning: Oracle has been very helpful to fix a
performance regression in O_DIRECT that materialized after THP was
merged, but that's fixed upstream for a while. Kirill's
split_huge_page pending changes will give O_DIRECT a further
boost. What's left to optimize is only barely measurable now with 2
fusion-IO and massive I/O bandwidth and orasim, not the real Oracle
database. We actually couldn't measure any difference even from that
optimization in a real Oracle load that isn't 100% I/O bound, despite
using the hardware setup with massive I/O bandwidth required to
reproduce it.

Note also that O_DIRECT currently performs identical with THP on or
off. Only Kirill's split_huge_page pending changes can give a further
small boost, disabling THP can't improve O_DIRECT performance.

I believe before disabling THP it should be identified where the
problem comes from... so if it's not a design issue like redis, we can
optimize it, like we did for the O_DIRECT case with Oracle's
helpful and appreciated contribution.

> MongoDB
> http://docs.mongodb.org/master/tutorial/transparent-huge-pages/

There's not much explanation here.

> Couchbase http://blog.couchbase.com/often-overlooked-linux-os-tweaks

"Couchbase Server can be negatively impacted by severe page allocation
delays when THP is enabled"

Like mentioned above, severe delays in page faults can only be
explained by compaction issues, trying with defrag = madvise is best.

> Riak
> http://underthehood.meltwater.com/blog/2015/04/14/riak-elasticsearch-and-numad-walk-into-a-red-hat/

This has not enough data to tell what the problem could be.

Thanks,
Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Can we disable transparent hugepages for lack of a legitimate use case please?
  2015-09-10 16:45     ` Andrea Arcangeli
@ 2015-09-10 17:02       ` Andres Freund
  2015-09-14 12:37       ` Vlastimil Babka
  1 sibling, 0 replies; 14+ messages in thread
From: Andres Freund @ 2015-09-10 17:02 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: James Hartshorn, Bridgman, John, linux-mm@kvack.org

On 2015-09-10 18:45:06 +0200, Andrea Arcangeli wrote:
> On Mon, Aug 24, 2015 at 08:46:11PM +0000, James Hartshorn wrote:
> > Some more links to discussion
> > about THP: Postgresql https://lwn.net/Articles/591723/ Postgresql
> > http://www.postgresql.org/message-id/20120821131254.1415a545@jekyl.davidgould.org
> 
> "and my interpretation was that it was trying to create hugepages from
> scattered fragments"
> 
> This is a very old email, but I'm just taking it as an example because
> this has to be a compaction issue. If you run into very visible hangs
> that goes away by disabling THP, it can't be THP to blame. THP can
> increase the latency jitter during page faults (real time sensitive
> application could notice a 2MB clear_page vs a 4KB clear_page), but
> not in a way that hangs a system and becomes visible to the user.
> 
> It's just very early compaction code was too aggressive and it got
> fixed in the meanwhile.

There's still some slowdown (as of 4.0) in extreme postgres workloads
with THP and/or compaction enabled, but I've indeed not been able to
reproduce bad stalls or large (10%+) slowdowns with recent kernels.

Greetings,

Andres Freund

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Can we disable transparent hugepages for lack of a legitimate use case please?
  2015-09-10 16:45     ` Andrea Arcangeli
  2015-09-10 17:02       ` Andres Freund
@ 2015-09-14 12:37       ` Vlastimil Babka
  1 sibling, 0 replies; 14+ messages in thread
From: Vlastimil Babka @ 2015-09-14 12:37 UTC (permalink / raw)
  To: Andrea Arcangeli, James Hartshorn
  Cc: Bridgman, John, linux-mm@kvack.org, Kirill A. Shutemov

On 09/10/2015 06:45 PM, Andrea Arcangeli wrote:
>> >Mysql (tokudb)
>> >https://dzone.com/articles/why-tokudb-hates-transparent
> This seems a THP issue: unless the alternate malloc allocator starts
> using MADV_NOHUGEPAGE, its memory loss would become extreme with the
> split_huge_page pending changes from Kirill. There's little the kernel
> can do about this, in fact Kirill's latest changes goes in the very
> opposite direction of what's needed to reduce the memory footprint for
> this MADV_DONTNEED 4kb case.
>
> With current code however the best you can do is:
>
> echo 0 >/sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none
>
> That will guarantee that khugepaged never increases the memory
> footprint after a MADV_DONTNEED done by the alternate malloc
> allocator. Just that will definitely stop to help with the
> split_huge_page pending changes. You could consider testing that but
> if the split_huge_page pending changes are merged, this tuning shall
> disappear.

I don't think it's that pessimistic after Kirill's patchset? 
MADV_DONTNEED should still result in unmaps, which results in 
split_huge_pmd. Then the THP is put in a shrinker list and will be fully 
split in response to memory pressure, see:

  [PATCHv10 34/36] thp: introduce deferred_split_huge_page()


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Can we disable transparent hugepages for lack of a legitimate use case please?
  2015-08-24 20:12 Can we disable transparent hugepages for lack of a legitimate use case please? James Hartshorn
  2015-08-24 20:20 ` Bridgman, John
@ 2015-08-25  9:25 ` Konstantin Khlebnikov
  2015-08-25  9:56   ` Vlastimil Babka
  2015-09-03 19:33 ` Andi Kleen
  2 siblings, 1 reply; 14+ messages in thread
From: Konstantin Khlebnikov @ 2015-08-25  9:25 UTC (permalink / raw)
  To: James Hartshorn; +Cc: linux-mm@kvack.org, Kirill A. Shutemov

On Mon, Aug 24, 2015 at 11:12 PM, James Hartshorn
<jhartshorn@connexity.com> wrote:
> Hi,
>
>
> I've been struggling with transparent hugepage performance issues, and can't
> seem to find anyone who actually uses it intentionally.  Virtually every
> database that runs on linux however recommends disabling it or setting it to
> madvise.  I'm referring to:
>
>
> /sys/kernel/mm/transparent_hugepage/enabled
>
>
> I asked on the internet
> http://unix.stackexchange.com/questions/201906/does-anyone-actually-use-and-benefit-from-transparent-huge-pages
> and got no responses there.
>
>
>
> Independently I noticed
>
>
> "sysctl: The scan_unevictable_pages sysctl/node-interface has been disabled
> for lack of a legitimate use case.  If you have one, please send an email to
> linux-mm@kvack.org."
>
>
> And thought wow that's exactly what should be done to transparent hugepages.
>
>
> Thoughts?

THP works very well when system has a lot of free memory.
Probably default should be weakened to "only if we have tons of free memory".
For example allocate THP pages atomically, only if buddy allocator already
has huge pages. Also them could be pre-zeroed in background.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Can we disable transparent hugepages for lack of a legitimate use case please?
  2015-08-25  9:25 ` Konstantin Khlebnikov
@ 2015-08-25  9:56   ` Vlastimil Babka
  2015-09-01 22:26     ` David Rientjes
  0 siblings, 1 reply; 14+ messages in thread
From: Vlastimil Babka @ 2015-08-25  9:56 UTC (permalink / raw)
  To: Konstantin Khlebnikov, James Hartshorn
  Cc: linux-mm@kvack.org, Kirill A. Shutemov, Andrea Arcangeli,
	David Rientjes, Mel Gorman

On 08/25/2015 11:25 AM, Konstantin Khlebnikov wrote:
> On Mon, Aug 24, 2015 at 11:12 PM, James Hartshorn
> <jhartshorn@connexity.com> wrote:
>> Hi,
>>
>>
>> I've been struggling with transparent hugepage performance issues, and can't
>> seem to find anyone who actually uses it intentionally.  Virtually every
>> database that runs on linux however recommends disabling it or setting it to
>> madvise.  I'm referring to:
>>
>>
>> /sys/kernel/mm/transparent_hugepage/enabled
>>
>>
>> I asked on the internet
>> http://unix.stackexchange.com/questions/201906/does-anyone-actually-use-and-benefit-from-transparent-huge-pages
>> and got no responses there.
>>
>>
>>
>> Independently I noticed
>>
>>
>> "sysctl: The scan_unevictable_pages sysctl/node-interface has been disabled
>> for lack of a legitimate use case.  If you have one, please send an email to
>> linux-mm@kvack.org."
>>
>>
>> And thought wow that's exactly what should be done to transparent hugepages.
>>
>>
>> Thoughts?

[+ Cc's]

> THP works very well when system has a lot of free memory.
> Probably default should be weakened to "only if we have tons of free memory".
> For example allocate THP pages atomically, only if buddy allocator already
> has huge pages. Also them could be pre-zeroed in background.

I've been proposing series that try to move more THP allocation activity 
from the page faults into khugepaged, but no success yet.

Maybe we should just start with changing the default of
/sys/kernel/mm/transparent_hugepage/defrag to "madvise". This would 
remove the reclaim and compaction for page faults and quickly fallback 
to order-0 pages. The compaction is already crippled enough there with 
the GFP_TRANSHUGE specific decisions in __alloc_pages_slowpath(). I've 
noticed it failing miserably in the transhuge-stress recently, so it 
seems it's not worth to try at all. With changing the default we can 
kill those GFP_TRANSHUGE checks and assume that whoever uses the madvise 
does actually want to try harder.

Of course that does nothing about zeroing. I don't know how huge issue 
is that one?

>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Can we disable transparent hugepages for lack of a legitimate use case please?
  2015-08-25  9:56   ` Vlastimil Babka
@ 2015-09-01 22:26     ` David Rientjes
  2015-09-02  8:55       ` Konstantin Khlebnikov
  2015-09-09 22:05       ` Andrea Arcangeli
  0 siblings, 2 replies; 14+ messages in thread
From: David Rientjes @ 2015-09-01 22:26 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Konstantin Khlebnikov, James Hartshorn, linux-mm@kvack.org,
	Kirill A. Shutemov, Andrea Arcangeli, Mel Gorman

On Tue, 25 Aug 2015, Vlastimil Babka wrote:

> > THP works very well when system has a lot of free memory.
> > Probably default should be weakened to "only if we have tons of free
> > memory".
> > For example allocate THP pages atomically, only if buddy allocator already
> > has huge pages. Also them could be pre-zeroed in background.
> 
> I've been proposing series that try to move more THP allocation activity from
> the page faults into khugepaged, but no success yet.
> 
> Maybe we should just start with changing the default of
> /sys/kernel/mm/transparent_hugepage/defrag to "madvise".

I would need to revert this internally to avoid performance degradation, I 
believe others would report the same.

> This would remove the
> reclaim and compaction for page faults and quickly fallback to order-0 pages.
> The compaction is already crippled enough there with the GFP_TRANSHUGE
> specific decisions in __alloc_pages_slowpath(). I've noticed it failing
> miserably in the transhuge-stress recently, so it seems it's not worth to try
> at all. With changing the default we can kill those GFP_TRANSHUGE checks and
> assume that whoever uses the madvise does actually want to try harder.
> 

I think the work that is being done on moving compaction to khugepaged as 
well as periodic synchronous compaction of all memory is the way to go to 
avoid lengthy stalls during fault.

> Of course that does nothing about zeroing. I don't know how huge issue is that
> one?
> 

I don't believe it is an issue that cannot be worked around in userspace 
either with MADV_NOHUGEPAGE or PR_SET_THP_DISABLE.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Can we disable transparent hugepages for lack of a legitimate use case please?
  2015-09-01 22:26     ` David Rientjes
@ 2015-09-02  8:55       ` Konstantin Khlebnikov
  2015-09-02  9:06         ` Vlastimil Babka
  2015-09-09 22:05       ` Andrea Arcangeli
  1 sibling, 1 reply; 14+ messages in thread
From: Konstantin Khlebnikov @ 2015-09-02  8:55 UTC (permalink / raw)
  To: David Rientjes
  Cc: Vlastimil Babka, James Hartshorn, linux-mm@kvack.org,
	Kirill A. Shutemov, Andrea Arcangeli, Mel Gorman

On Wed, Sep 2, 2015 at 1:26 AM, David Rientjes <rientjes@google.com> wrote:
> On Tue, 25 Aug 2015, Vlastimil Babka wrote:
>
>> > THP works very well when system has a lot of free memory.
>> > Probably default should be weakened to "only if we have tons of free
>> > memory".
>> > For example allocate THP pages atomically, only if buddy allocator already
>> > has huge pages. Also them could be pre-zeroed in background.
>>
>> I've been proposing series that try to move more THP allocation activity from
>> the page faults into khugepaged, but no success yet.
>>
>> Maybe we should just start with changing the default of
>> /sys/kernel/mm/transparent_hugepage/defrag to "madvise".
>
> I would need to revert this internally to avoid performance degradation, I
> believe others would report the same.

What about adding new mode "guess" -- something between always and madvise?

In this mode kernel tries to avoid performance impact for non-madvised vmas and
allocates 0-order pages if hugepages are not available right now.
(for example do allocations with GFP_NOWAIT)
I think we'll get all benefits without losing performance.

>
>> This would remove the
>> reclaim and compaction for page faults and quickly fallback to order-0 pages.
>> The compaction is already crippled enough there with the GFP_TRANSHUGE
>> specific decisions in __alloc_pages_slowpath(). I've noticed it failing
>> miserably in the transhuge-stress recently, so it seems it's not worth to try
>> at all. With changing the default we can kill those GFP_TRANSHUGE checks and
>> assume that whoever uses the madvise does actually want to try harder.
>>
>
> I think the work that is being done on moving compaction to khugepaged as
> well as periodic synchronous compaction of all memory is the way to go to
> avoid lengthy stalls during fault.
>
>> Of course that does nothing about zeroing. I don't know how huge issue is that
>> one?
>>
>
> I don't believe it is an issue that cannot be worked around in userspace
> either with MADV_NOHUGEPAGE or PR_SET_THP_DISABLE.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Can we disable transparent hugepages for lack of a legitimate use case please?
  2015-09-02  8:55       ` Konstantin Khlebnikov
@ 2015-09-02  9:06         ` Vlastimil Babka
  0 siblings, 0 replies; 14+ messages in thread
From: Vlastimil Babka @ 2015-09-02  9:06 UTC (permalink / raw)
  To: Konstantin Khlebnikov, David Rientjes
  Cc: James Hartshorn, linux-mm@kvack.org, Kirill A. Shutemov,
	Andrea Arcangeli, Mel Gorman

On 2.9.2015 10:55, Konstantin Khlebnikov wrote:
> On Wed, Sep 2, 2015 at 1:26 AM, David Rientjes <rientjes@google.com> wrote:
>> On Tue, 25 Aug 2015, Vlastimil Babka wrote:
>>
>>>> THP works very well when system has a lot of free memory.
>>>> Probably default should be weakened to "only if we have tons of free
>>>> memory".
>>>> For example allocate THP pages atomically, only if buddy allocator already
>>>> has huge pages. Also them could be pre-zeroed in background.
>>>
>>> I've been proposing series that try to move more THP allocation activity from
>>> the page faults into khugepaged, but no success yet.
>>>
>>> Maybe we should just start with changing the default of
>>> /sys/kernel/mm/transparent_hugepage/defrag to "madvise".
>>
>> I would need to revert this internally to avoid performance degradation, I
>> believe others would report the same.
> 
> What about adding new mode "guess" -- something between always and madvise?
> 
> In this mode kernel tries to avoid performance impact for non-madvised vmas and
> allocates 0-order pages if hugepages are not available right now.
> (for example do allocations with GFP_NOWAIT)

That's exactly what happens when
/sys/kernel/mm/transparent_hugepage/defrag is set to "madvise".

> I think we'll get all benefits without losing performance.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Can we disable transparent hugepages for lack of a legitimate use case please?
  2015-09-01 22:26     ` David Rientjes
  2015-09-02  8:55       ` Konstantin Khlebnikov
@ 2015-09-09 22:05       ` Andrea Arcangeli
  1 sibling, 0 replies; 14+ messages in thread
From: Andrea Arcangeli @ 2015-09-09 22:05 UTC (permalink / raw)
  To: David Rientjes
  Cc: Vlastimil Babka, Konstantin Khlebnikov, James Hartshorn,
	linux-mm@kvack.org, Kirill A. Shutemov, Mel Gorman

On Tue, Sep 01, 2015 at 03:26:34PM -0700, David Rientjes wrote:
> I don't believe it is an issue that cannot be worked around in userspace 
> either with MADV_NOHUGEPAGE or PR_SET_THP_DISABLE.

Agreed, for the legit cases where THP can hurt, the bugreport should
be sent to the databases so they can use one of the two above
features.

It really depends on the database if it hurts or not, in fact the
majority of databases benefits from THP (others can provide the exact
details). So in average it's still a net gain even for databases.

I'm aware of a single db case where THP hurts and it makes perfect
sense why it hurts (and it's not Oracle): redis and only during
snapshotting, see the end of the email.

Setting THP global tweak to "madvise" was designed for embedded
systems were losing even 4k of RAM matters, "madvise" should be more
about the memory footprint then the performance.

qemu-kvm uses MADV_HUGEPAGE so it is enabled even when the the global
setting is "madvise" exactly because with qemu the memory footprint
won't change regardless enabled or disabled. If you're very low on
memory "madvise" makes sense just in case.

Note also that even Oracle if run in KVM guests (and I'd recommend to
run it always in KVM guests) performs almost _half_ a slow, if THP is
not enabled in the _host_.

About Oracle I think it's more a case that THP cannot help Oracle
because Oracle already uses hugetlbfs which is guaranteed equal or
faster of THP as it has the memory preallocated matching the SGA,
1GByte page support, it doesn't need compaction, it doesn't run into
constraints by the restriction of preallocating the memory at
boot. Still I've no idea how THP could hurt Oracle, unless they got a
buggy implementation in their version of the kernels... or unless they
entirely missed the feature in some of their kernels.

I can't recall any outstanding THP related bugreport from Oracle, feel
free to search the kernel lists to point me to an open bugreport from
Oracle about THP performance hurting Oracle so I can have a look at
some data. I'm only aware about the generic allegations on their
website.

My guess was that THP being a tradeoff, from a purely risk-off
prospective (they can't get benefit from the winning side of the trade
anyway as they already rightfully optimized everything with hugetlbfs)
it's fair enough for Oracle to recommend to disable THP for Oracle
(including when it's run in the KVM guests). Even then I think they
should simply use the prctl if that's the reason of they
recommendation, so other processes like java and other apps can still
run much faster with THP (especially in guest, and that applies to all
hypervisors including proprietary ones, it's an hardware issue with
EPT/NPT, software can do nothing but use THP both on guest and host to
optimize).

The alternate malloc allocator also should consider disabling THP with
MADV_NOHUGEPAGE if it's totally relying on MADV_DONTNEED in order to
free up memory in a 4k fragmented way and the user needs low memory
footprint. That's what MADV_NOHUGEPAGE is for. If Kirill's
split_huge_page change goes in, that such a MADV_DONTNEED will
generate a even more extreme memory loss in the alternate malloc
allocator, because currently khugepaged won't collapse the hugepage if
the pte of the surrounding 4k pages within the 2m hugepage are not
young (young as in pte_young), i.e. if there's some memory pressure,
the 4k hole will remain an hole and khugepaged will skip it and the
memory can potentially remain free forever. After the split_huge_page
change proposed, there will be no way MADV_DONTNEED can free up any
memory at all, within a 2MB hugepage, no matter the memory pressure.

Now changing topic to some technical issue with redis. redis uses
fork() to create a readonly snapshot, then in the child it writes the
readonly data in memory to the disk. What happens is the parent still
writes to the memory while the child is snapshotting to the disk. So
during this snapshotting time, with THP each write redis does in the
parent results in a 2MByte allocation and 4MBbyte of memory accessed
by the CPU, instead of a 4kbyte allocation and 8kbyte of memory
accessed by the CPU. The writes are randomly scattered across all the
address space. In short during the snapshotting each writes gets 512
times higher latency, more L1/L2 cache is destroyed and the amount of
memory usage increases almost 512 times. There's no way the faster TLB
miss benefits and the larger TLB can offset that cost in this special
load and we're not even accounting for the compaction cost.

What redis I think really should do is to use userfaultfd write
protection tracking mode as soon as I finish writing it.

I doubt redis likes if the amount of memory usage doubles up during
snapshotting, but that can currently happen with fork() regardless of
THP.

userfaultfd will make the maximal ram utilization during snapshotting
configurable, once the limit hits, the wrprotect faults will throttle
on the snapshot disk I/O gracefully. It can still take twice the same
of the ram if it wants to and in such case it never risks having to
throttle on I/O, but it's not forced to, like it is now with fork().

Furthermore with userfaultfd redis won't have fork(), it will use
clone() instead. It won't have to duplicate all pagetables. The
wrprotect faults will talk directly to the userfaultfd thread that
will copy the memory off to a private location and then unblock the
fault that will just return to userland without having to do any
copy_page inside the kernel (the other thread will do the copy in
userland potentially in another CPU, which can be guaranteed with CPU
pinning if needed) and the L1/L2 cache of the master redis process
that is trying to write to the memory, will be totally unaffected (not
even the current 8k will be used).

Then it's up to redis if it wants to do userfaults with 4k or 2MByte
size, it's userland handling the page fault after all, the userfaultfd
kernel code has no control on the size of the page fault. If the
readonly THP page was mapped by a trans_huge_pmd, when the UFFDIO
ioctl marks read-write only 4k of it (or any region not multiple of
2MBytes or not aligned to 2Mbytes), the UFFDIO wrprotection ioctl will
take care of splitting the trans_huge_pmd. If the cost of splitting a
THP (with the split_huge_page change proposed it'll only actually
split the trans_huge_pmd) while marking a 4k region read-write it's
still too much, redis can still use MADV_NOHUGEPAGE with userfaultfd
too.

My guess is that THP + userfaultfd write tracking doing 4k faults in
userland will work optimally for redis snapshotting (both with the
current split_huge_page or the proposed change).

qemu is going to use the same model for KVM postcopy live
snapshotting to use in COLO fault tolerance or other features.

Now until userfaultfd is capable of write protect tracking, we could
introduce a new MADV_....HUGEPAGE to tell the kernel that copy on
write faults must be done by splitting the hugepage and using 4k
pages. That will also fix it. Just I'm not sure if it's worth it.

For now, redis should simply use MADV_NOHUGEPAGE (perhaps it already
does, I haven't checked).

Thanks,
Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Can we disable transparent hugepages for lack of a legitimate use case please?
  2015-08-24 20:12 Can we disable transparent hugepages for lack of a legitimate use case please? James Hartshorn
  2015-08-24 20:20 ` Bridgman, John
  2015-08-25  9:25 ` Konstantin Khlebnikov
@ 2015-09-03 19:33 ` Andi Kleen
  2 siblings, 0 replies; 14+ messages in thread
From: Andi Kleen @ 2015-09-03 19:33 UTC (permalink / raw)
  To: James Hartshorn; +Cc: linux-mm@kvack.org

James Hartshorn <jhartshorn@connexity.com> writes:

Your report seems to completely lack any detail, like
kernel version, description of the problem, etc.

> I've been struggling with transparent hugepage performance issues, and
> can't seem to find anyone who actually uses it intentionally.
> Virtually every database that runs on linux however recommends
> disabling it or setting it to madvise. I'm referring to:

Please see if you can reproduce your problem on a recent mainline
kernel (there were a lot of compaction improvements recently,
which can be a source of issues with THP)

If yes then please submit a test case and it can be investigated.

If no then update.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2015-09-14 12:37 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-08-24 20:12 Can we disable transparent hugepages for lack of a legitimate use case please? James Hartshorn
2015-08-24 20:20 ` Bridgman, John
2015-08-24 20:46   ` James Hartshorn
2015-08-24 23:20     ` Theodore Ts'o
2015-09-10 16:45     ` Andrea Arcangeli
2015-09-10 17:02       ` Andres Freund
2015-09-14 12:37       ` Vlastimil Babka
2015-08-25  9:25 ` Konstantin Khlebnikov
2015-08-25  9:56   ` Vlastimil Babka
2015-09-01 22:26     ` David Rientjes
2015-09-02  8:55       ` Konstantin Khlebnikov
2015-09-02  9:06         ` Vlastimil Babka
2015-09-09 22:05       ` Andrea Arcangeli
2015-09-03 19:33 ` Andi Kleen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).