* Can we disable transparent hugepages for lack of a legitimate use case please?
@ 2015-08-24 20:12 James Hartshorn
2015-08-24 20:20 ` Bridgman, John
` (2 more replies)
0 siblings, 3 replies; 14+ messages in thread
From: James Hartshorn @ 2015-08-24 20:12 UTC (permalink / raw)
To: linux-mm@kvack.org
[-- Attachment #1: Type: text/plain, Size: 790 bytes --]
Hi,
I've been struggling with transparent hugepage performance issues, and can't seem to find anyone who actually uses it intentionally. Virtually every database that runs on linux however recommends disabling it or setting it to madvise. I'm referring to:
/sys/kernel/mm/transparent_hugepage/enabled
I asked on the internet http://unix.stackexchange.com/questions/201906/does-anyone-actually-use-and-benefit-from-transparent-huge-pages and got no responses there.
Independently I noticed
"sysctl: The scan_unevictable_pages sysctl/node-interface has been disabled for lack of a legitimate use case. If you have one, please send an email to linux-mm@kvack.org."
And thought wow that's exactly what should be done to transparent hugepages.
Thoughts?
[-- Attachment #2: Type: text/html, Size: 1723 bytes --]
^ permalink raw reply [flat|nested] 14+ messages in thread* Re: Can we disable transparent hugepages for lack of a legitimate use case please? 2015-08-24 20:12 Can we disable transparent hugepages for lack of a legitimate use case please? James Hartshorn @ 2015-08-24 20:20 ` Bridgman, John 2015-08-24 20:46 ` James Hartshorn 2015-08-25 9:25 ` Konstantin Khlebnikov 2015-09-03 19:33 ` Andi Kleen 2 siblings, 1 reply; 14+ messages in thread From: Bridgman, John @ 2015-08-24 20:20 UTC (permalink / raw) To: James Hartshorn, linux-mm@kvack.org [-- Attachment #1: Type: text/plain, Size: 1346 bytes --] We find it useful for GPU compute applications (APU I suppose, with GPU access via IOMMUv2) working on large datasets. I wouldn't have expected THP to find much use for databases -- those seem to be more like graphics stacks where you have enough hints about future usage to justify explicit management of pages. I thought of THP as "the solution for everything else". From: James Hartshorn Sent: Monday, August 24, 2015 3:12 PM To: linux-mm@kvack.org Subject: Can we disable transparent hugepages for lack of a legitimate use case please? Hi, I've been struggling with transparent hugepage performance issues, and can't seem to find anyone who actually uses it intentionally. Virtually every database that runs on linux however recommends disabling it or setting it to madvise. I'm referring to: /sys/kernel/mm/transparent_hugepage/enabled I asked on the internet http://unix.stackexchange.com/questions/201906/does-anyone-actually-use-and-benefit-from-transparent-huge-pages and got no responses there. Independently I noticed "sysctl: The scan_unevictable_pages sysctl/node-interface has been disabled for lack of a legitimate use case. If you have one, please send an email to linux-mm@kvack.org." And thought wow that's exactly what should be done to transparent hugepages. Thoughts? [-- Attachment #2: Type: text/html, Size: 3894 bytes --] ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Can we disable transparent hugepages for lack of a legitimate use case please? 2015-08-24 20:20 ` Bridgman, John @ 2015-08-24 20:46 ` James Hartshorn 2015-08-24 23:20 ` Theodore Ts'o 2015-09-10 16:45 ` Andrea Arcangeli 0 siblings, 2 replies; 14+ messages in thread From: James Hartshorn @ 2015-08-24 20:46 UTC (permalink / raw) To: Bridgman, John, linux-mm@kvack.org [-- Attachment #1: Type: text/plain, Size: 2982 bytes --] As a general purpose sysadmin I've mostly struggled with its default being always, if it were never (or possibly madvise?) then I think all the very real performance problems would go away. Those who know they need it could turn it on. I have begun looking into asking the distros to change this (is it a distro choice?) but am not getting that far. Just to be clear the default of always causes noticeable pauses of operation on almost all databases, analogous to having a stop the world gc. As for THP in APU type applications have you run into any JEMalloc defrag performance issues? My research into THP issues indicates this is part of the performance problem that manifests for databases. Some more links to discussion about THP: Postgresql https://lwn.net/Articles/591723/ Postgresql http://www.postgresql.org/message-id/20120821131254.1415a545@jekyl.davidgould.org Mysql (tokudb) https://dzone.com/articles/why-tokudb-hates-transparent Redis http://redis.io/topics/latency http://antirez.com/news/84 Oracle https://blogs.oracle.com/linux/entry/performance_issues_with_transparent_huge MongoDB http://docs.mongodb.org/master/tutorial/transparent-huge-pages/ Couchbase http://blog.couchbase.com/often-overlooked-linux-os-tweaks Riak http://underthehood.meltwater.com/blog/2015/04/14/riak-elasticsearch-and-numad-walk-into-a-red-hat/ ________________________________ From: Bridgman, John <John.Bridgman@amd.com> Sent: Monday, August 24, 2015 1:20 PM To: James Hartshorn; linux-mm@kvack.org Subject: Re: Can we disable transparent hugepages for lack of a legitimate use case please? We find it useful for GPU compute applications (APU I suppose, with GPU access via IOMMUv2) working on large datasets. I wouldn't have expected THP to find much use for databases -- those seem to be more like graphics stacks where you have enough hints about future usage to justify explicit management of pages. I thought of THP as "the solution for everything else". From: James Hartshorn Sent: Monday, August 24, 2015 3:12 PM To: linux-mm@kvack.org Subject: Can we disable transparent hugepages for lack of a legitimate use case please? Hi, I've been struggling with transparent hugepage performance issues, and can't seem to find anyone who actually uses it intentionally. Virtually every database that runs on linux however recommends disabling it or setting it to madvise. I'm referring to: /sys/kernel/mm/transparent_hugepage/enabled I asked on the internet http://unix.stackexchange.com/questions/201906/does-anyone-actually-use-and-benefit-from-transparent-huge-pages and got no responses there. Independently I noticed "sysctl: The scan_unevictable_pages sysctl/node-interface has been disabled for lack of a legitimate use case. If you have one, please send an email to linux-mm@kvack.org." And thought wow that's exactly what should be done to transparent hugepages. Thoughts? [-- Attachment #2: Type: text/html, Size: 7288 bytes --] ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Can we disable transparent hugepages for lack of a legitimate use case please? 2015-08-24 20:46 ` James Hartshorn @ 2015-08-24 23:20 ` Theodore Ts'o 2015-09-10 16:45 ` Andrea Arcangeli 1 sibling, 0 replies; 14+ messages in thread From: Theodore Ts'o @ 2015-08-24 23:20 UTC (permalink / raw) To: James Hartshorn; +Cc: Bridgman, John, linux-mm@kvack.org Part of the problem with asking "Does anyone use THP" is that a lot of people may be using THP without realizing it. That is, after all, the whole point. Some selected bits from running the command: sudo grep -e AnonHugePages /proc/*/smaps | awk '{ if($2>4) print $0} ' | awk -F "/" '{print $0; system("ps -fp " $3)} ' /proc/17297/smaps:AnonHugePages: 290816 kB UID PID PPID C STIME TTY TIME CMD tytso 17297 17290 4 19:10 pts/6 00:00:05 qemu-system-x86_64 -enable-kvm -boo /proc/2467/smaps:AnonHugePages: 92160 kB UID PID PPID C STIME TTY TIME CMD tytso 2467 2347 0 09:49 ? 00:00:10 xfdesktop --display :0.0 --sm-clien /proc/13446/smaps:AnonHugePages: 81920 kB UID PID PPID C STIME TTY TIME CMD tytso 13446 2591 0 12:25 pts/0 00:00:11 mutt -f /home/tytso/imap/shared.mit /proc/2603/smaps:AnonHugePages: 43008 kB UID PID PPID C STIME TTY TIME CMD tytso 2603 2347 0 09:49 ? 00:00:01 /usr/bin/perl /usr/bin/parcimonie /proc/9853/smaps:AnonHugePages: 20480 kB UID PID PPID C STIME TTY TIME CMD tytso 9853 2461 1 09:56 ? 00:07:01 /opt/google/chrome-beta/chrome --us /proc/1622/smaps:AnonHugePages: 14336 kB UID PID PPID C STIME TTY TIME CMD root 1622 1567 0 09:49 tty7 00:03:09 /usr/bin/X :0 -seat seat0 -auth /va Cheers, - Ted -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Can we disable transparent hugepages for lack of a legitimate use case please? 2015-08-24 20:46 ` James Hartshorn 2015-08-24 23:20 ` Theodore Ts'o @ 2015-09-10 16:45 ` Andrea Arcangeli 2015-09-10 17:02 ` Andres Freund 2015-09-14 12:37 ` Vlastimil Babka 1 sibling, 2 replies; 14+ messages in thread From: Andrea Arcangeli @ 2015-09-10 16:45 UTC (permalink / raw) To: James Hartshorn; +Cc: Bridgman, John, linux-mm@kvack.org On Mon, Aug 24, 2015 at 08:46:11PM +0000, James Hartshorn wrote: > As a general purpose sysadmin I've mostly struggled with its default > being always, if it were never (or possibly madvise?) then I think > all the very real performance problems would go away. Those who > know they need it could turn it on. I have begun looking into > asking the distros to change this (is it a distro choice?) but am My suggestion would be to: 1) identify exactly if it's a THP issue or a compaction issue, 2) if it's really a THP issue report it to the application developers to use the MADV_NOHUGEPAGE or the prctl to disable THP only for the app or the library. If it's a compaction issue disabling THP sounds wrong to me and it should be simply reported here as a bug. > not getting that far. Just to be clear the default of always causes > noticeable pauses of operation on almost all databases, analogous to > having a stop the world gc. As for THP in APU type applications > have you run into any JEMalloc defrag performance issues? My > research into THP issues indicates this is part of the performance > problem that manifests for databases. Some more links to discussion > about THP: Postgresql https://lwn.net/Articles/591723/ Postgresql > http://www.postgresql.org/message-id/20120821131254.1415a545@jekyl.davidgould.org "and my interpretation was that it was trying to create hugepages from scattered fragments" This is a very old email, but I'm just taking it as an example because this has to be a compaction issue. If you run into very visible hangs that goes away by disabling THP, it can't be THP to blame. THP can increase the latency jitter during page faults (real time sensitive application could notice a 2MB clear_page vs a 4KB clear_page), but not in a way that hangs a system and becomes visible to the user. It's just very early compaction code was too aggressive and it got fixed in the meanwhile. Worst of all is that disabling THP can't solve compaction issues because compaction still runs even after you disable THP (drivers and slab can still use high order pages), so it'll just hide the problem. To disable compaction in THP just run: echo madvise >/sys/kernel/mm/transparent_hugepage/defrag If you got a compaction problem, this will make it go away, but you'd still have THP on. Considering the amount of work that went in compaction (primarily to make it less aggressive) and how old the email is, I doubt that problem reported in the email could still happen with current kernels. There's current work on linux-mm (primarily from Vlastimil and David) to make compaction asynchronous. I don't like too much the initial proposal of offloading compaction purely to khugepaged and disconnected to the page faults. But it would be possible to make the page fault wakeup a kernel daemon that compact hugepages in parallel to the page fault requests. So then the pagefault latency would become identical to when the defrag sysfs control is set to "madvise". I think apps that use MADV_HUGEPAGE (like qemu) should still run compaction synchronously though. For qemu losing several hugepages because of async behavior of compaction, would be a major loss. It's perfectly fine if it's slower at starting up as long as it gets as many hugepages as it can. I've seen other proposal floating around, there's definitely work in this area to optimize compaction further. Compaction is already much better now than in the very first version that landed upstream so again those emails are not relevant anymore. > Mysql (tokudb) > https://dzone.com/articles/why-tokudb-hates-transparent This seems a THP issue: unless the alternate malloc allocator starts using MADV_NOHUGEPAGE, its memory loss would become extreme with the split_huge_page pending changes from Kirill. There's little the kernel can do about this, in fact Kirill's latest changes goes in the very opposite direction of what's needed to reduce the memory footprint for this MADV_DONTNEED 4kb case. With current code however the best you can do is: echo 0 >/sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none That will guarantee that khugepaged never increases the memory footprint after a MADV_DONTNEED done by the alternate malloc allocator. Just that will definitely stop to help with the split_huge_page pending changes. You could consider testing that but if the split_huge_page pending changes are merged, this tuning shall disappear. > Redis > http://redis.io/topics/latency http://antirez.com/news/84 Oracle I already covered redis in detail in previous email in this thread. This is a legitimate THP issue and for now MADV_NOHUGEPAGE will take care of that. If redis in the future could stop using fork() and use clone()+userfaultfd for the snapshotting, then THP should be fine enabled as it can control in userland the size of the wrprotect faults. > https://blogs.oracle.com/linux/entry/performance_issues_with_transparent_huge At least this one document doesn't have the random reboots and instability allegations that earlier of their documents talked about (that I never seen here and I never had a report about... which made me wonder why they were getting those reboots or instabilities and which kernel they were actually using). The very latest recent data (including a document on oracle.com) shows a worst case 5-10% performance regression and like postgresql, they should also consider trying again with: echo madvise >/sys/kernel/mm/transparent_hugepage/defrag To see if that 5-10% worst case performance regression magically disappears while keeping THP enabled. It'd at least help to know if this is a THP issue or a compaction issue. On a side note worth mentioning: Oracle has been very helpful to fix a performance regression in O_DIRECT that materialized after THP was merged, but that's fixed upstream for a while. Kirill's split_huge_page pending changes will give O_DIRECT a further boost. What's left to optimize is only barely measurable now with 2 fusion-IO and massive I/O bandwidth and orasim, not the real Oracle database. We actually couldn't measure any difference even from that optimization in a real Oracle load that isn't 100% I/O bound, despite using the hardware setup with massive I/O bandwidth required to reproduce it. Note also that O_DIRECT currently performs identical with THP on or off. Only Kirill's split_huge_page pending changes can give a further small boost, disabling THP can't improve O_DIRECT performance. I believe before disabling THP it should be identified where the problem comes from... so if it's not a design issue like redis, we can optimize it, like we did for the O_DIRECT case with Oracle's helpful and appreciated contribution. > MongoDB > http://docs.mongodb.org/master/tutorial/transparent-huge-pages/ There's not much explanation here. > Couchbase http://blog.couchbase.com/often-overlooked-linux-os-tweaks "Couchbase Server can be negatively impacted by severe page allocation delays when THP is enabled" Like mentioned above, severe delays in page faults can only be explained by compaction issues, trying with defrag = madvise is best. > Riak > http://underthehood.meltwater.com/blog/2015/04/14/riak-elasticsearch-and-numad-walk-into-a-red-hat/ This has not enough data to tell what the problem could be. Thanks, Andrea -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Can we disable transparent hugepages for lack of a legitimate use case please? 2015-09-10 16:45 ` Andrea Arcangeli @ 2015-09-10 17:02 ` Andres Freund 2015-09-14 12:37 ` Vlastimil Babka 1 sibling, 0 replies; 14+ messages in thread From: Andres Freund @ 2015-09-10 17:02 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: James Hartshorn, Bridgman, John, linux-mm@kvack.org On 2015-09-10 18:45:06 +0200, Andrea Arcangeli wrote: > On Mon, Aug 24, 2015 at 08:46:11PM +0000, James Hartshorn wrote: > > Some more links to discussion > > about THP: Postgresql https://lwn.net/Articles/591723/ Postgresql > > http://www.postgresql.org/message-id/20120821131254.1415a545@jekyl.davidgould.org > > "and my interpretation was that it was trying to create hugepages from > scattered fragments" > > This is a very old email, but I'm just taking it as an example because > this has to be a compaction issue. If you run into very visible hangs > that goes away by disabling THP, it can't be THP to blame. THP can > increase the latency jitter during page faults (real time sensitive > application could notice a 2MB clear_page vs a 4KB clear_page), but > not in a way that hangs a system and becomes visible to the user. > > It's just very early compaction code was too aggressive and it got > fixed in the meanwhile. There's still some slowdown (as of 4.0) in extreme postgres workloads with THP and/or compaction enabled, but I've indeed not been able to reproduce bad stalls or large (10%+) slowdowns with recent kernels. Greetings, Andres Freund -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Can we disable transparent hugepages for lack of a legitimate use case please? 2015-09-10 16:45 ` Andrea Arcangeli 2015-09-10 17:02 ` Andres Freund @ 2015-09-14 12:37 ` Vlastimil Babka 1 sibling, 0 replies; 14+ messages in thread From: Vlastimil Babka @ 2015-09-14 12:37 UTC (permalink / raw) To: Andrea Arcangeli, James Hartshorn Cc: Bridgman, John, linux-mm@kvack.org, Kirill A. Shutemov On 09/10/2015 06:45 PM, Andrea Arcangeli wrote: >> >Mysql (tokudb) >> >https://dzone.com/articles/why-tokudb-hates-transparent > This seems a THP issue: unless the alternate malloc allocator starts > using MADV_NOHUGEPAGE, its memory loss would become extreme with the > split_huge_page pending changes from Kirill. There's little the kernel > can do about this, in fact Kirill's latest changes goes in the very > opposite direction of what's needed to reduce the memory footprint for > this MADV_DONTNEED 4kb case. > > With current code however the best you can do is: > > echo 0 >/sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none > > That will guarantee that khugepaged never increases the memory > footprint after a MADV_DONTNEED done by the alternate malloc > allocator. Just that will definitely stop to help with the > split_huge_page pending changes. You could consider testing that but > if the split_huge_page pending changes are merged, this tuning shall > disappear. I don't think it's that pessimistic after Kirill's patchset? MADV_DONTNEED should still result in unmaps, which results in split_huge_pmd. Then the THP is put in a shrinker list and will be fully split in response to memory pressure, see: [PATCHv10 34/36] thp: introduce deferred_split_huge_page() -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Can we disable transparent hugepages for lack of a legitimate use case please? 2015-08-24 20:12 Can we disable transparent hugepages for lack of a legitimate use case please? James Hartshorn 2015-08-24 20:20 ` Bridgman, John @ 2015-08-25 9:25 ` Konstantin Khlebnikov 2015-08-25 9:56 ` Vlastimil Babka 2015-09-03 19:33 ` Andi Kleen 2 siblings, 1 reply; 14+ messages in thread From: Konstantin Khlebnikov @ 2015-08-25 9:25 UTC (permalink / raw) To: James Hartshorn; +Cc: linux-mm@kvack.org, Kirill A. Shutemov On Mon, Aug 24, 2015 at 11:12 PM, James Hartshorn <jhartshorn@connexity.com> wrote: > Hi, > > > I've been struggling with transparent hugepage performance issues, and can't > seem to find anyone who actually uses it intentionally. Virtually every > database that runs on linux however recommends disabling it or setting it to > madvise. I'm referring to: > > > /sys/kernel/mm/transparent_hugepage/enabled > > > I asked on the internet > http://unix.stackexchange.com/questions/201906/does-anyone-actually-use-and-benefit-from-transparent-huge-pages > and got no responses there. > > > > Independently I noticed > > > "sysctl: The scan_unevictable_pages sysctl/node-interface has been disabled > for lack of a legitimate use case. If you have one, please send an email to > linux-mm@kvack.org." > > > And thought wow that's exactly what should be done to transparent hugepages. > > > Thoughts? THP works very well when system has a lot of free memory. Probably default should be weakened to "only if we have tons of free memory". For example allocate THP pages atomically, only if buddy allocator already has huge pages. Also them could be pre-zeroed in background. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Can we disable transparent hugepages for lack of a legitimate use case please? 2015-08-25 9:25 ` Konstantin Khlebnikov @ 2015-08-25 9:56 ` Vlastimil Babka 2015-09-01 22:26 ` David Rientjes 0 siblings, 1 reply; 14+ messages in thread From: Vlastimil Babka @ 2015-08-25 9:56 UTC (permalink / raw) To: Konstantin Khlebnikov, James Hartshorn Cc: linux-mm@kvack.org, Kirill A. Shutemov, Andrea Arcangeli, David Rientjes, Mel Gorman On 08/25/2015 11:25 AM, Konstantin Khlebnikov wrote: > On Mon, Aug 24, 2015 at 11:12 PM, James Hartshorn > <jhartshorn@connexity.com> wrote: >> Hi, >> >> >> I've been struggling with transparent hugepage performance issues, and can't >> seem to find anyone who actually uses it intentionally. Virtually every >> database that runs on linux however recommends disabling it or setting it to >> madvise. I'm referring to: >> >> >> /sys/kernel/mm/transparent_hugepage/enabled >> >> >> I asked on the internet >> http://unix.stackexchange.com/questions/201906/does-anyone-actually-use-and-benefit-from-transparent-huge-pages >> and got no responses there. >> >> >> >> Independently I noticed >> >> >> "sysctl: The scan_unevictable_pages sysctl/node-interface has been disabled >> for lack of a legitimate use case. If you have one, please send an email to >> linux-mm@kvack.org." >> >> >> And thought wow that's exactly what should be done to transparent hugepages. >> >> >> Thoughts? [+ Cc's] > THP works very well when system has a lot of free memory. > Probably default should be weakened to "only if we have tons of free memory". > For example allocate THP pages atomically, only if buddy allocator already > has huge pages. Also them could be pre-zeroed in background. I've been proposing series that try to move more THP allocation activity from the page faults into khugepaged, but no success yet. Maybe we should just start with changing the default of /sys/kernel/mm/transparent_hugepage/defrag to "madvise". This would remove the reclaim and compaction for page faults and quickly fallback to order-0 pages. The compaction is already crippled enough there with the GFP_TRANSHUGE specific decisions in __alloc_pages_slowpath(). I've noticed it failing miserably in the transhuge-stress recently, so it seems it's not worth to try at all. With changing the default we can kill those GFP_TRANSHUGE checks and assume that whoever uses the madvise does actually want to try harder. Of course that does nothing about zeroing. I don't know how huge issue is that one? > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Can we disable transparent hugepages for lack of a legitimate use case please? 2015-08-25 9:56 ` Vlastimil Babka @ 2015-09-01 22:26 ` David Rientjes 2015-09-02 8:55 ` Konstantin Khlebnikov 2015-09-09 22:05 ` Andrea Arcangeli 0 siblings, 2 replies; 14+ messages in thread From: David Rientjes @ 2015-09-01 22:26 UTC (permalink / raw) To: Vlastimil Babka Cc: Konstantin Khlebnikov, James Hartshorn, linux-mm@kvack.org, Kirill A. Shutemov, Andrea Arcangeli, Mel Gorman On Tue, 25 Aug 2015, Vlastimil Babka wrote: > > THP works very well when system has a lot of free memory. > > Probably default should be weakened to "only if we have tons of free > > memory". > > For example allocate THP pages atomically, only if buddy allocator already > > has huge pages. Also them could be pre-zeroed in background. > > I've been proposing series that try to move more THP allocation activity from > the page faults into khugepaged, but no success yet. > > Maybe we should just start with changing the default of > /sys/kernel/mm/transparent_hugepage/defrag to "madvise". I would need to revert this internally to avoid performance degradation, I believe others would report the same. > This would remove the > reclaim and compaction for page faults and quickly fallback to order-0 pages. > The compaction is already crippled enough there with the GFP_TRANSHUGE > specific decisions in __alloc_pages_slowpath(). I've noticed it failing > miserably in the transhuge-stress recently, so it seems it's not worth to try > at all. With changing the default we can kill those GFP_TRANSHUGE checks and > assume that whoever uses the madvise does actually want to try harder. > I think the work that is being done on moving compaction to khugepaged as well as periodic synchronous compaction of all memory is the way to go to avoid lengthy stalls during fault. > Of course that does nothing about zeroing. I don't know how huge issue is that > one? > I don't believe it is an issue that cannot be worked around in userspace either with MADV_NOHUGEPAGE or PR_SET_THP_DISABLE. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Can we disable transparent hugepages for lack of a legitimate use case please? 2015-09-01 22:26 ` David Rientjes @ 2015-09-02 8:55 ` Konstantin Khlebnikov 2015-09-02 9:06 ` Vlastimil Babka 2015-09-09 22:05 ` Andrea Arcangeli 1 sibling, 1 reply; 14+ messages in thread From: Konstantin Khlebnikov @ 2015-09-02 8:55 UTC (permalink / raw) To: David Rientjes Cc: Vlastimil Babka, James Hartshorn, linux-mm@kvack.org, Kirill A. Shutemov, Andrea Arcangeli, Mel Gorman On Wed, Sep 2, 2015 at 1:26 AM, David Rientjes <rientjes@google.com> wrote: > On Tue, 25 Aug 2015, Vlastimil Babka wrote: > >> > THP works very well when system has a lot of free memory. >> > Probably default should be weakened to "only if we have tons of free >> > memory". >> > For example allocate THP pages atomically, only if buddy allocator already >> > has huge pages. Also them could be pre-zeroed in background. >> >> I've been proposing series that try to move more THP allocation activity from >> the page faults into khugepaged, but no success yet. >> >> Maybe we should just start with changing the default of >> /sys/kernel/mm/transparent_hugepage/defrag to "madvise". > > I would need to revert this internally to avoid performance degradation, I > believe others would report the same. What about adding new mode "guess" -- something between always and madvise? In this mode kernel tries to avoid performance impact for non-madvised vmas and allocates 0-order pages if hugepages are not available right now. (for example do allocations with GFP_NOWAIT) I think we'll get all benefits without losing performance. > >> This would remove the >> reclaim and compaction for page faults and quickly fallback to order-0 pages. >> The compaction is already crippled enough there with the GFP_TRANSHUGE >> specific decisions in __alloc_pages_slowpath(). I've noticed it failing >> miserably in the transhuge-stress recently, so it seems it's not worth to try >> at all. With changing the default we can kill those GFP_TRANSHUGE checks and >> assume that whoever uses the madvise does actually want to try harder. >> > > I think the work that is being done on moving compaction to khugepaged as > well as periodic synchronous compaction of all memory is the way to go to > avoid lengthy stalls during fault. > >> Of course that does nothing about zeroing. I don't know how huge issue is that >> one? >> > > I don't believe it is an issue that cannot be worked around in userspace > either with MADV_NOHUGEPAGE or PR_SET_THP_DISABLE. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Can we disable transparent hugepages for lack of a legitimate use case please? 2015-09-02 8:55 ` Konstantin Khlebnikov @ 2015-09-02 9:06 ` Vlastimil Babka 0 siblings, 0 replies; 14+ messages in thread From: Vlastimil Babka @ 2015-09-02 9:06 UTC (permalink / raw) To: Konstantin Khlebnikov, David Rientjes Cc: James Hartshorn, linux-mm@kvack.org, Kirill A. Shutemov, Andrea Arcangeli, Mel Gorman On 2.9.2015 10:55, Konstantin Khlebnikov wrote: > On Wed, Sep 2, 2015 at 1:26 AM, David Rientjes <rientjes@google.com> wrote: >> On Tue, 25 Aug 2015, Vlastimil Babka wrote: >> >>>> THP works very well when system has a lot of free memory. >>>> Probably default should be weakened to "only if we have tons of free >>>> memory". >>>> For example allocate THP pages atomically, only if buddy allocator already >>>> has huge pages. Also them could be pre-zeroed in background. >>> >>> I've been proposing series that try to move more THP allocation activity from >>> the page faults into khugepaged, but no success yet. >>> >>> Maybe we should just start with changing the default of >>> /sys/kernel/mm/transparent_hugepage/defrag to "madvise". >> >> I would need to revert this internally to avoid performance degradation, I >> believe others would report the same. > > What about adding new mode "guess" -- something between always and madvise? > > In this mode kernel tries to avoid performance impact for non-madvised vmas and > allocates 0-order pages if hugepages are not available right now. > (for example do allocations with GFP_NOWAIT) That's exactly what happens when /sys/kernel/mm/transparent_hugepage/defrag is set to "madvise". > I think we'll get all benefits without losing performance. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Can we disable transparent hugepages for lack of a legitimate use case please? 2015-09-01 22:26 ` David Rientjes 2015-09-02 8:55 ` Konstantin Khlebnikov @ 2015-09-09 22:05 ` Andrea Arcangeli 1 sibling, 0 replies; 14+ messages in thread From: Andrea Arcangeli @ 2015-09-09 22:05 UTC (permalink / raw) To: David Rientjes Cc: Vlastimil Babka, Konstantin Khlebnikov, James Hartshorn, linux-mm@kvack.org, Kirill A. Shutemov, Mel Gorman On Tue, Sep 01, 2015 at 03:26:34PM -0700, David Rientjes wrote: > I don't believe it is an issue that cannot be worked around in userspace > either with MADV_NOHUGEPAGE or PR_SET_THP_DISABLE. Agreed, for the legit cases where THP can hurt, the bugreport should be sent to the databases so they can use one of the two above features. It really depends on the database if it hurts or not, in fact the majority of databases benefits from THP (others can provide the exact details). So in average it's still a net gain even for databases. I'm aware of a single db case where THP hurts and it makes perfect sense why it hurts (and it's not Oracle): redis and only during snapshotting, see the end of the email. Setting THP global tweak to "madvise" was designed for embedded systems were losing even 4k of RAM matters, "madvise" should be more about the memory footprint then the performance. qemu-kvm uses MADV_HUGEPAGE so it is enabled even when the the global setting is "madvise" exactly because with qemu the memory footprint won't change regardless enabled or disabled. If you're very low on memory "madvise" makes sense just in case. Note also that even Oracle if run in KVM guests (and I'd recommend to run it always in KVM guests) performs almost _half_ a slow, if THP is not enabled in the _host_. About Oracle I think it's more a case that THP cannot help Oracle because Oracle already uses hugetlbfs which is guaranteed equal or faster of THP as it has the memory preallocated matching the SGA, 1GByte page support, it doesn't need compaction, it doesn't run into constraints by the restriction of preallocating the memory at boot. Still I've no idea how THP could hurt Oracle, unless they got a buggy implementation in their version of the kernels... or unless they entirely missed the feature in some of their kernels. I can't recall any outstanding THP related bugreport from Oracle, feel free to search the kernel lists to point me to an open bugreport from Oracle about THP performance hurting Oracle so I can have a look at some data. I'm only aware about the generic allegations on their website. My guess was that THP being a tradeoff, from a purely risk-off prospective (they can't get benefit from the winning side of the trade anyway as they already rightfully optimized everything with hugetlbfs) it's fair enough for Oracle to recommend to disable THP for Oracle (including when it's run in the KVM guests). Even then I think they should simply use the prctl if that's the reason of they recommendation, so other processes like java and other apps can still run much faster with THP (especially in guest, and that applies to all hypervisors including proprietary ones, it's an hardware issue with EPT/NPT, software can do nothing but use THP both on guest and host to optimize). The alternate malloc allocator also should consider disabling THP with MADV_NOHUGEPAGE if it's totally relying on MADV_DONTNEED in order to free up memory in a 4k fragmented way and the user needs low memory footprint. That's what MADV_NOHUGEPAGE is for. If Kirill's split_huge_page change goes in, that such a MADV_DONTNEED will generate a even more extreme memory loss in the alternate malloc allocator, because currently khugepaged won't collapse the hugepage if the pte of the surrounding 4k pages within the 2m hugepage are not young (young as in pte_young), i.e. if there's some memory pressure, the 4k hole will remain an hole and khugepaged will skip it and the memory can potentially remain free forever. After the split_huge_page change proposed, there will be no way MADV_DONTNEED can free up any memory at all, within a 2MB hugepage, no matter the memory pressure. Now changing topic to some technical issue with redis. redis uses fork() to create a readonly snapshot, then in the child it writes the readonly data in memory to the disk. What happens is the parent still writes to the memory while the child is snapshotting to the disk. So during this snapshotting time, with THP each write redis does in the parent results in a 2MByte allocation and 4MBbyte of memory accessed by the CPU, instead of a 4kbyte allocation and 8kbyte of memory accessed by the CPU. The writes are randomly scattered across all the address space. In short during the snapshotting each writes gets 512 times higher latency, more L1/L2 cache is destroyed and the amount of memory usage increases almost 512 times. There's no way the faster TLB miss benefits and the larger TLB can offset that cost in this special load and we're not even accounting for the compaction cost. What redis I think really should do is to use userfaultfd write protection tracking mode as soon as I finish writing it. I doubt redis likes if the amount of memory usage doubles up during snapshotting, but that can currently happen with fork() regardless of THP. userfaultfd will make the maximal ram utilization during snapshotting configurable, once the limit hits, the wrprotect faults will throttle on the snapshot disk I/O gracefully. It can still take twice the same of the ram if it wants to and in such case it never risks having to throttle on I/O, but it's not forced to, like it is now with fork(). Furthermore with userfaultfd redis won't have fork(), it will use clone() instead. It won't have to duplicate all pagetables. The wrprotect faults will talk directly to the userfaultfd thread that will copy the memory off to a private location and then unblock the fault that will just return to userland without having to do any copy_page inside the kernel (the other thread will do the copy in userland potentially in another CPU, which can be guaranteed with CPU pinning if needed) and the L1/L2 cache of the master redis process that is trying to write to the memory, will be totally unaffected (not even the current 8k will be used). Then it's up to redis if it wants to do userfaults with 4k or 2MByte size, it's userland handling the page fault after all, the userfaultfd kernel code has no control on the size of the page fault. If the readonly THP page was mapped by a trans_huge_pmd, when the UFFDIO ioctl marks read-write only 4k of it (or any region not multiple of 2MBytes or not aligned to 2Mbytes), the UFFDIO wrprotection ioctl will take care of splitting the trans_huge_pmd. If the cost of splitting a THP (with the split_huge_page change proposed it'll only actually split the trans_huge_pmd) while marking a 4k region read-write it's still too much, redis can still use MADV_NOHUGEPAGE with userfaultfd too. My guess is that THP + userfaultfd write tracking doing 4k faults in userland will work optimally for redis snapshotting (both with the current split_huge_page or the proposed change). qemu is going to use the same model for KVM postcopy live snapshotting to use in COLO fault tolerance or other features. Now until userfaultfd is capable of write protect tracking, we could introduce a new MADV_....HUGEPAGE to tell the kernel that copy on write faults must be done by splitting the hugepage and using 4k pages. That will also fix it. Just I'm not sure if it's worth it. For now, redis should simply use MADV_NOHUGEPAGE (perhaps it already does, I haven't checked). Thanks, Andrea -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Can we disable transparent hugepages for lack of a legitimate use case please? 2015-08-24 20:12 Can we disable transparent hugepages for lack of a legitimate use case please? James Hartshorn 2015-08-24 20:20 ` Bridgman, John 2015-08-25 9:25 ` Konstantin Khlebnikov @ 2015-09-03 19:33 ` Andi Kleen 2 siblings, 0 replies; 14+ messages in thread From: Andi Kleen @ 2015-09-03 19:33 UTC (permalink / raw) To: James Hartshorn; +Cc: linux-mm@kvack.org James Hartshorn <jhartshorn@connexity.com> writes: Your report seems to completely lack any detail, like kernel version, description of the problem, etc. > I've been struggling with transparent hugepage performance issues, and > can't seem to find anyone who actually uses it intentionally. > Virtually every database that runs on linux however recommends > disabling it or setting it to madvise. I'm referring to: Please see if you can reproduce your problem on a recent mainline kernel (there were a lot of compaction improvements recently, which can be a source of issues with THP) If yes then please submit a test case and it can be investigated. If no then update. -Andi -- ak@linux.intel.com -- Speaking for myself only -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2015-09-14 12:37 UTC | newest] Thread overview: 14+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2015-08-24 20:12 Can we disable transparent hugepages for lack of a legitimate use case please? James Hartshorn 2015-08-24 20:20 ` Bridgman, John 2015-08-24 20:46 ` James Hartshorn 2015-08-24 23:20 ` Theodore Ts'o 2015-09-10 16:45 ` Andrea Arcangeli 2015-09-10 17:02 ` Andres Freund 2015-09-14 12:37 ` Vlastimil Babka 2015-08-25 9:25 ` Konstantin Khlebnikov 2015-08-25 9:56 ` Vlastimil Babka 2015-09-01 22:26 ` David Rientjes 2015-09-02 8:55 ` Konstantin Khlebnikov 2015-09-02 9:06 ` Vlastimil Babka 2015-09-09 22:05 ` Andrea Arcangeli 2015-09-03 19:33 ` Andi Kleen
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).