* [LSF/MM TOPIC] Support for 1GB THP @ 2016-03-01 7:09 Matthew Wilcox 2016-03-01 10:25 ` [Lsf-pc] " Jan Kara 2016-03-01 12:20 ` Kirill A. Shutemov 0 siblings, 2 replies; 14+ messages in thread From: Matthew Wilcox @ 2016-03-01 7:09 UTC (permalink / raw) To: lsf-pc; +Cc: linux-fsdevel, linux-mm There are a few issues around 1GB THP support that I've come up against while working on DAX support that I think may be interesting to discuss in person. - Do we want to add support for 1GB THP for anonymous pages? DAX support is driving the initial 1GB THP support, but would anonymous VMAs also benefit from 1GB support? I'm not volunteering to do this work, but it might make an interesting conversation if we can identify some users who think performance would be better if they had 1GB THP support. - Latency of a major page fault. According to various public reviews, main memory bandwidth is about 30GB/s on a Core i7-5960X with 4 DDR4 channels. I think people are probably fairly unhappy about doing only 30 page faults per second. So maybe we need a more complex scheme to handle major faults where we insert a temporary 2MB mapping, prepare the other 2MB pages in the background, then merge them into a 1GB mapping when they're completed. - Cache pressure from 1GB page support. If we're using NT stores, they bypass the cache, and all should be good. But if there are architectures that support THP and not NT stores, zeroing a page is just going to obliterate their caches. Other topics that might interest people from a VM/FS point of view: - Uses for (or replacement of) the radix tree. We're currently looking at using the radix tree with DAX in order to reduce the number of calls into the filesystem. That's leading to various enhancements to the radix tree, such as support for a lock bit for exceptional entries (Neil Brown), and support for multi-order entries (me). Is the (enhanced) radix tree the right data structure to be using for this brave new world of huge pages in the page cache, or should we be looking at some other data structure like an RB-tree? - Can we get rid of PAGE_CACHE_SIZE now? Finally? Pretty please? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [Lsf-pc] [LSF/MM TOPIC] Support for 1GB THP 2016-03-01 7:09 [LSF/MM TOPIC] Support for 1GB THP Matthew Wilcox @ 2016-03-01 10:25 ` Jan Kara 2016-03-01 11:00 ` Mel Gorman 2016-03-01 21:44 ` Matthew Wilcox 2016-03-01 12:20 ` Kirill A. Shutemov 1 sibling, 2 replies; 14+ messages in thread From: Jan Kara @ 2016-03-01 10:25 UTC (permalink / raw) To: Matthew Wilcox; +Cc: lsf-pc, linux-fsdevel, linux-mm Hi, On Tue 01-03-16 02:09:11, Matthew Wilcox wrote: > There are a few issues around 1GB THP support that I've come up against > while working on DAX support that I think may be interesting to discuss > in person. > > - Do we want to add support for 1GB THP for anonymous pages? DAX support > is driving the initial 1GB THP support, but would anonymous VMAs also > benefit from 1GB support? I'm not volunteering to do this work, but > it might make an interesting conversation if we can identify some users > who think performance would be better if they had 1GB THP support. Some time ago I was thinking about 1GB THP and I was wondering: What is the motivation for 1GB pages for persistent memory? Is it the savings in memory used for page tables? Or is it about the cost of fault? If it is mainly about the fault cost, won't some fault-around logic (i.e. filling more PMD entries in one PMD fault) go a long way towards reducing fault cost without some complications? > - Latency of a major page fault. According to various public reviews, > main memory bandwidth is about 30GB/s on a Core i7-5960X with 4 > DDR4 channels. I think people are probably fairly unhappy about > doing only 30 page faults per second. So maybe we need a more complex > scheme to handle major faults where we insert a temporary 2MB mapping, > prepare the other 2MB pages in the background, then merge them into > a 1GB mapping when they're completed. Yeah, here is one of the complications I have mentioned above ;) > - Cache pressure from 1GB page support. If we're using NT stores, they > bypass the cache, and all should be good. But if there are > architectures that support THP and not NT stores, zeroing a page is > just going to obliterate their caches. Even doing fsync() - and thus flush all cache lines associated with 1GB page - is likely going to take noticeable chunk of time. The granularity of cache flushing in kernel is another thing that makes me somewhat cautious about 1GB pages. > Other topics that might interest people from a VM/FS point of view: > > - Uses for (or replacement of) the radix tree. We're currently > looking at using the radix tree with DAX in order to reduce the number > of calls into the filesystem. That's leading to various enhancements > to the radix tree, such as support for a lock bit for exceptional > entries (Neil Brown), and support for multi-order entries (me). > Is the (enhanced) radix tree the right data structure to be using > for this brave new world of huge pages in the page cache, or should > we be looking at some other data structure like an RB-tree? I was also thinking whether we wouldn't be better off with some other data structure than radix tree for DAX. And I didn't really find anything that I'd be satisfied with. The main advantages of radix tree I see are - it is of constant depth, it supports lockless lookups, it is relatively simple (although with the additions we'd need this advantage slowly vanishes), it is pretty space efficient for common cases. For your multi-order entries I was wondering whether we shouldn't relax the requirement that all nodes have the same number of slots - e.g. we could have number of slots variable with node depth so that PMD and eventually PUD multi-order slots end up being a single entry at appropriate radix tree level. Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [Lsf-pc] [LSF/MM TOPIC] Support for 1GB THP 2016-03-01 10:25 ` [Lsf-pc] " Jan Kara @ 2016-03-01 11:00 ` Mel Gorman 2016-03-01 11:51 ` Mel Gorman 2016-03-01 21:44 ` Matthew Wilcox 1 sibling, 1 reply; 14+ messages in thread From: Mel Gorman @ 2016-03-01 11:00 UTC (permalink / raw) To: Jan Kara; +Cc: Matthew Wilcox, linux-fsdevel, linux-mm, lsf-pc On Tue, Mar 01, 2016 at 11:25:41AM +0100, Jan Kara wrote: > Hi, > > On Tue 01-03-16 02:09:11, Matthew Wilcox wrote: > > There are a few issues around 1GB THP support that I've come up against > > while working on DAX support that I think may be interesting to discuss > > in person. > > > > - Do we want to add support for 1GB THP for anonymous pages? DAX support > > is driving the initial 1GB THP support, but would anonymous VMAs also > > benefit from 1GB support? I'm not volunteering to do this work, but > > it might make an interesting conversation if we can identify some users > > who think performance would be better if they had 1GB THP support. > > Some time ago I was thinking about 1GB THP and I was wondering: What is the > motivation for 1GB pages for persistent memory? Is it the savings in memory > used for page tables? Or is it about the cost of fault? > If anything, the cost of the fault is going to suck as a 1G allocation and zeroing is required even if the application only needs 4K. It's by no means a universal win. The savings are in page table usage and TLB miss cost reduction and TLB footprint. For anonymous memory, it's not considered to be worth it because the cost of allocating the page is so high even if it works. There is no guarantee it'll work as fragementation avoidance only works on the 2M boundary. It's worse when files are involved because there is a write-multiplication effect when huge pages are used. Specifically, a fault incurs 1G of IO even if only 4K is required and then dirty information is only tracked on a huge page granularity. This increased IO can offset any TLB-related benefit. I'm highly skeptical that THP for persistent memory is even worthwhile once the write multiplication factors and allocation costs are taken into consideration. I was surprised overall that it was even attempted before basic features of persistent memory were even completed. I felt that it should have been avoided until the 4K case was as fast as possible and hitting problems where TLB was the limiting facto Given that I recently threw in the towel over the cost of 2M allocations let alone 1G translations, I'm highly skeptical that 1G anonymous pages are worth the cost. > If it is mainly about the fault cost, won't some fault-around logic (i.e. > filling more PMD entries in one PMD fault) go a long way towards reducing > fault cost without some complications? > I think this would be a pre-requisite. Basically, the idea is that a 2M page is reserved, but not allocated in response to a 4K page fault. The pages are then inserted properly aligned such. If there are faults around it then use other properly aligned pages and when the 2M chunk is allocated then promote it at that point. Early research considered where there was a fill-factor other than 1 that should trigger a hugepage promotion but it would have to be re-evaluated on modern hardware. I'm not aware of anyone actually working on such an implementation though because it'd be a lot of legwork. I wrote a TODO item about this at some far point in the past that never got to the top of the list Title: In-place huge page collapsing Description: When collapsing a huge page, the kernel allocates a huge page and then copies from the base page. This is expensive. Investigate in-place reservation whereby a base page is faulted in but the properly placed pages are reserved for that process unless the alternative is to fail the allocation. Care would be needed to ensure that the kernel does not reclaim because pages are reserved or increase contention on zone->lock. If it works correctly we would be able to collapse huge pages without copying and it would also performance extremely well when the workload uses sparse address spaces. > > - Cache pressure from 1GB page support. If we're using NT stores, they > > bypass the cache, and all should be good. But if there are > > architectures that support THP and not NT stores, zeroing a page is > > just going to obliterate their caches. > > Even doing fsync() - and thus flush all cache lines associated with 1GB > page - is likely going to take noticeable chunk of time. The granularity of > cache flushing in kernel is another thing that makes me somewhat cautious > about 1GB pages. > Problems like this were highlighted in early hugepage-related papers in the 90's. Even if persistent memory is extremely fast, there is going to be large costs. In-place promotion would avoid some of the worst of the costs. If it was me, I would focus on getting all the basic features of persistent memory working first, finding if there are workloads that are limited by TLB pressure and then and only then start worrying about 1G pages. If that is not done then persistent memory could fall down the same trap that the VM did whereby huge pages were being used to workaround bottlenecks within the VM or crappy hardware. -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [Lsf-pc] [LSF/MM TOPIC] Support for 1GB THP 2016-03-01 11:00 ` Mel Gorman @ 2016-03-01 11:51 ` Mel Gorman 2016-03-01 12:09 ` Kirill A. Shutemov 0 siblings, 1 reply; 14+ messages in thread From: Mel Gorman @ 2016-03-01 11:51 UTC (permalink / raw) To: Matthew Wilcox; +Cc: linux-fsdevel, Jan Kara, lsf-pc, linux-mm On Tue, Mar 01, 2016 at 11:00:55AM +0000, Mel Gorman wrote: > On Tue, Mar 01, 2016 at 11:25:41AM +0100, Jan Kara wrote: > > Hi, > > > > On Tue 01-03-16 02:09:11, Matthew Wilcox wrote: > > > There are a few issues around 1GB THP support that I've come up against > > > while working on DAX support that I think may be interesting to discuss > > > in person. > > > > > > - Do we want to add support for 1GB THP for anonymous pages? DAX support > > > is driving the initial 1GB THP support, but would anonymous VMAs also > > > benefit from 1GB support? I'm not volunteering to do this work, but > > > it might make an interesting conversation if we can identify some users > > > who think performance would be better if they had 1GB THP support. > > > > Some time ago I was thinking about 1GB THP and I was wondering: What is the > > motivation for 1GB pages for persistent memory? Is it the savings in memory > > used for page tables? Or is it about the cost of fault? > > > > If anything, the cost of the fault is going to suck as a 1G allocation > and zeroing is required even if the application only needs 4K. It's by > no means a universal win. The savings are in page table usage and TLB > miss cost reduction and TLB footprint. For anonymous memory, it's not > considered to be worth it because the cost of allocating the page is so > high even if it works. There is no guarantee it'll work as fragementation > avoidance only works on the 2M boundary. > > It's worse when files are involved because there is a > write-multiplication effect when huge pages are used. Specifically, a > fault incurs 1G of IO even if only 4K is required and then dirty > information is only tracked on a huge page granularity. This increased > IO can offset any TLB-related benefit. > It was pointed out to me privately that the IO amplication cost is not the same for persistent memory as it is for traditional storage and this is true. For example, the 1G of data does not have to be read on fault every time. The write problems are mitigated but remain if the 1G block has to be zero'd for example. Even for normal writeback the cache lines have to be flushed as the kernel does not know what lines were updated. I know there is a proposal to defer that tracking to userspace but that breaks if an unaware process accesses the page and is overall very risky. There are other issues such as having to reserve 1G of block in case a file is truncated in the future or else there is an extremely large amount of wastage. Maybe it can be worked around but a workload that uses persistent memory with many small files may have a bad day. While I know some of these points can be countered and discussed further, at the end of the day, the benefits to huge page usage are reduced memory usage on page tables, a reduction of TLB pressure and reduced TLB fill costs. Until such time as it's known that there are realistic workloads that cannot fit in memory due to the page table usage and workloads that are limited by TLB pressure, the complexity of huge pages is unjustified and the focus should be on the basic features working correctly. If fault overhead of a 4K page is a major concern then fault-around should be used on the 2M boundary at least. I expect there are relatively few real workloads that are limited by the cost of major faults. Applications may have a higher startup cost than desirable but in itself that does not justify using huge pages to workload problems with fault speeds in the kernel. -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [Lsf-pc] [LSF/MM TOPIC] Support for 1GB THP 2016-03-01 11:51 ` Mel Gorman @ 2016-03-01 12:09 ` Kirill A. Shutemov 2016-03-01 12:52 ` Mel Gorman 0 siblings, 1 reply; 14+ messages in thread From: Kirill A. Shutemov @ 2016-03-01 12:09 UTC (permalink / raw) To: Mel Gorman; +Cc: Matthew Wilcox, linux-fsdevel, Jan Kara, lsf-pc, linux-mm On Tue, Mar 01, 2016 at 11:51:36AM +0000, Mel Gorman wrote: > While I know some of these points can be countered and discussed further, > at the end of the day, the benefits to huge page usage are reduced memory > usage on page tables, a reduction of TLB pressure and reduced TLB fill > costs. Until such time as it's known that there are realistic workloads > that cannot fit in memory due to the page table usage and workloads that > are limited by TLB pressure, the complexity of huge pages is unjustified > and the focus should be on the basic features working correctly. Size of page table can be limiting factor now for workloads that tries to migrate from 2M hugetlb with shared page tables to DAX. 1G pages is a way to lower the overhead. Note, that reduced memory usage on page tables is not there for anon THP, as we have to deposit these page table to be able split huge pmd (or pud) at any point. -- Kirill A. Shutemov -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [Lsf-pc] [LSF/MM TOPIC] Support for 1GB THP 2016-03-01 12:09 ` Kirill A. Shutemov @ 2016-03-01 12:52 ` Mel Gorman 0 siblings, 0 replies; 14+ messages in thread From: Mel Gorman @ 2016-03-01 12:52 UTC (permalink / raw) To: Kirill A. Shutemov Cc: Matthew Wilcox, linux-fsdevel, Jan Kara, lsf-pc, linux-mm On Tue, Mar 01, 2016 at 03:09:19PM +0300, Kirill A. Shutemov wrote: > On Tue, Mar 01, 2016 at 11:51:36AM +0000, Mel Gorman wrote: > > While I know some of these points can be countered and discussed further, > > at the end of the day, the benefits to huge page usage are reduced memory > > usage on page tables, a reduction of TLB pressure and reduced TLB fill > > costs. Until such time as it's known that there are realistic workloads > > that cannot fit in memory due to the page table usage and workloads that > > are limited by TLB pressure, the complexity of huge pages is unjustified > > and the focus should be on the basic features working correctly. > > Size of page table can be limiting factor now for workloads that tries to > migrate from 2M hugetlb with shared page tables to DAX. 1G pages is a way > to lower the overhead. > That is only a limitation for users of hugetlbfs replacing hugetlbfs pages with DAX and even then only in the case where the workload is precisely sized to available memory. It's a potential limitation in a specialised configuration which may or may not be a problem in practice. Even the benefits of reduced memory usage and TLB pressure is not guaranteed to be offset by problems such as flushing the cache lines of the entire huge page during writeback or the necessity of allocating huge blocks on disk for a file that may or may not need it. Huge pages fix some problems but cause others. It may be better in practice for a workload to shrink the size of the shared region that was previously using hugetlbfs for example. Granted, I've not been following the development of persistent memory closely but from what I've seen, I think it's more important to get persistent memory, DAX and related features working correctly first and then worry about page table memory usage and TLB pressure *if* it's a problem in practice. If there are problems with fault scalability then it would be better to fix that instead of working around it with huge pages. -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [Lsf-pc] [LSF/MM TOPIC] Support for 1GB THP 2016-03-01 10:25 ` [Lsf-pc] " Jan Kara 2016-03-01 11:00 ` Mel Gorman @ 2016-03-01 21:44 ` Matthew Wilcox 2016-03-01 22:15 ` Mike Kravetz ` (2 more replies) 1 sibling, 3 replies; 14+ messages in thread From: Matthew Wilcox @ 2016-03-01 21:44 UTC (permalink / raw) To: Jan Kara; +Cc: lsf-pc, linux-fsdevel, linux-mm On Tue, Mar 01, 2016 at 11:25:41AM +0100, Jan Kara wrote: > On Tue 01-03-16 02:09:11, Matthew Wilcox wrote: > > There are a few issues around 1GB THP support that I've come up against > > while working on DAX support that I think may be interesting to discuss > > in person. > > > > - Do we want to add support for 1GB THP for anonymous pages? DAX support > > is driving the initial 1GB THP support, but would anonymous VMAs also > > benefit from 1GB support? I'm not volunteering to do this work, but > > it might make an interesting conversation if we can identify some users > > who think performance would be better if they had 1GB THP support. > > Some time ago I was thinking about 1GB THP and I was wondering: What is the > motivation for 1GB pages for persistent memory? Is it the savings in memory > used for page tables? Or is it about the cost of fault? I think it's both. I heard from one customer who calculated that with a 6TB server, mapping every page into a process would take ~24MB of page tables. Multiply that by the 50,000 processes they expect to run on a server of that size consumes 1.2TB of DRAM. Using 1GB pages reduces that by a factor of 512, down to 2GB. Another topic to consider then would be generalising the page table sharing code that is currently specific to hugetlbfs. I didn't bring it up as I haven't researched it in any detail, and don't know how hard it would be. > For your multi-order entries I was wondering whether we shouldn't relax the > requirement that all nodes have the same number of slots - e.g. we could > have number of slots variable with node depth so that PMD and eventually PUD > multi-order slots end up being a single entry at appropriate radix tree > level. I'm not a big fan of the sibling entries either :-) One thing I do wonder is whether anyone has done performance analysis recently of whether 2^6 is the right size for radix tree nodes? If it used 2^9, this would be a perfect match to x86 page tables ;-) Variable size is a bit painful because we've got two variable size arrays in the node; the array of node pointers and the tag bitmasks. And then we lose the benefit of the slab allocator if the node size is variable. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [Lsf-pc] [LSF/MM TOPIC] Support for 1GB THP 2016-03-01 21:44 ` Matthew Wilcox @ 2016-03-01 22:15 ` Mike Kravetz 2016-03-01 22:33 ` Rik van Riel 2016-03-01 22:36 ` James Bottomley 2 siblings, 0 replies; 14+ messages in thread From: Mike Kravetz @ 2016-03-01 22:15 UTC (permalink / raw) To: Matthew Wilcox, Jan Kara; +Cc: lsf-pc, linux-fsdevel, linux-mm On 03/01/2016 01:44 PM, Matthew Wilcox wrote: > On Tue, Mar 01, 2016 at 11:25:41AM +0100, Jan Kara wrote: >> On Tue 01-03-16 02:09:11, Matthew Wilcox wrote: >>> There are a few issues around 1GB THP support that I've come up against >>> while working on DAX support that I think may be interesting to discuss >>> in person. >>> >>> - Do we want to add support for 1GB THP for anonymous pages? DAX support >>> is driving the initial 1GB THP support, but would anonymous VMAs also >>> benefit from 1GB support? I'm not volunteering to do this work, but >>> it might make an interesting conversation if we can identify some users >>> who think performance would be better if they had 1GB THP support. >> >> Some time ago I was thinking about 1GB THP and I was wondering: What is the >> motivation for 1GB pages for persistent memory? Is it the savings in memory >> used for page tables? Or is it about the cost of fault? > > I think it's both. I heard from one customer who calculated that with > a 6TB server, mapping every page into a process would take ~24MB of > page tables. Multiply that by the 50,000 processes they expect to run > on a server of that size consumes 1.2TB of DRAM. Using 1GB pages reduces > that by a factor of 512, down to 2GB. > > Another topic to consider then would be generalising the page table > sharing code that is currently specific to hugetlbfs. I didn't bring > it up as I haven't researched it in any detail, and don't know how hard > it would be. Well, I have started down that path and have it working for some very simple cases with some very hacked up code. Too early/ugly to share. I'm struggling a bit with fact that you can have both regular and huge page mappings of the same regions. The hugetlb code only has to deal with huge pages. -- Mike Kravetz -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [Lsf-pc] [LSF/MM TOPIC] Support for 1GB THP 2016-03-01 21:44 ` Matthew Wilcox 2016-03-01 22:15 ` Mike Kravetz @ 2016-03-01 22:33 ` Rik van Riel 2016-03-01 22:36 ` James Bottomley 2 siblings, 0 replies; 14+ messages in thread From: Rik van Riel @ 2016-03-01 22:33 UTC (permalink / raw) To: Matthew Wilcox, Jan Kara; +Cc: linux-fsdevel, linux-mm, lsf-pc [-- Attachment #1: Type: text/plain, Size: 1701 bytes --] On Tue, 2016-03-01 at 16:44 -0500, Matthew Wilcox wrote: > On Tue, Mar 01, 2016 at 11:25:41AM +0100, Jan Kara wrote: > > On Tue 01-03-16 02:09:11, Matthew Wilcox wrote: > > > There are a few issues around 1GB THP support that I've come up > > > against > > > while working on DAX support that I think may be interesting to > > > discuss > > > in person. > > > > > > - Do we want to add support for 1GB THP for anonymous > > > pages? DAX support > > > is driving the initial 1GB THP support, but would anonymous > > > VMAs also > > > benefit from 1GB support? I'm not volunteering to do this > > > work, but > > > it might make an interesting conversation if we can identify > > > some users > > > who think performance would be better if they had 1GB THP > > > support. > > > > Some time ago I was thinking about 1GB THP and I was wondering: > > What is the > > motivation for 1GB pages for persistent memory? Is it the savings > > in memory > > used for page tables? Or is it about the cost of fault? > > I think it's both. I heard from one customer who calculated that > with > a 6TB server, mapping every page into a process would take ~24MB of > page tables. Multiply that by the 50,000 processes they expect to > run > on a server of that size consumes 1.2TB of DRAM. Using 1GB pages > reduces > that by a factor of 512, down to 2GB. Given the amounts of memory in systems, and the fact that 1TB (or even 2MB) page sizes will not always be possible, even with DAX on persistent memory, I suspect it may be time to implement the reclaiming of page tables that only map file pages. -- All Rights Reversed. [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 473 bytes --] ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [Lsf-pc] [LSF/MM TOPIC] Support for 1GB THP 2016-03-01 21:44 ` Matthew Wilcox 2016-03-01 22:15 ` Mike Kravetz 2016-03-01 22:33 ` Rik van Riel @ 2016-03-01 22:36 ` James Bottomley 2016-03-02 14:14 ` Matthew Wilcox 2 siblings, 1 reply; 14+ messages in thread From: James Bottomley @ 2016-03-01 22:36 UTC (permalink / raw) To: Matthew Wilcox, Jan Kara; +Cc: linux-fsdevel, linux-mm, lsf-pc On Tue, 2016-03-01 at 16:44 -0500, Matthew Wilcox wrote: > On Tue, Mar 01, 2016 at 11:25:41AM +0100, Jan Kara wrote: > > On Tue 01-03-16 02:09:11, Matthew Wilcox wrote: > > > There are a few issues around 1GB THP support that I've come up > > > against > > > while working on DAX support that I think may be interesting to > > > discuss > > > in person. > > > > > > - Do we want to add support for 1GB THP for anonymous pages? > > > DAX support > > > is driving the initial 1GB THP support, but would anonymous > > > VMAs also > > > benefit from 1GB support? I'm not volunteering to do this > > > work, but > > > it might make an interesting conversation if we can identify > > > some users > > > who think performance would be better if they had 1GB THP > > > support. > > > > Some time ago I was thinking about 1GB THP and I was wondering: > > What is the motivation for 1GB pages for persistent memory? Is it > > the savings in memory used for page tables? Or is it about the cost > > of fault? > > I think it's both. I heard from one customer who calculated that > with a 6TB server, mapping every page into a process would take ~24MB > of page tables. Multiply that by the 50,000 processes they expect to > run on a server of that size consumes 1.2TB of DRAM. Using 1GB pages > reduces that by a factor of 512, down to 2GB. This sounds a bit implausible: for the machine not to be thrashing to death, all the 6TB would have to be in shared memory used by all the 50k processes. The much more likely scenario is that it's mostly private memory mixed with a bit of shared, in which case sum(private working set) + shared must be under 6TB for the machine not to thrash and you likely only need mappings for the working set. Realistically that means you only need about 50MB or so of page tables, even with our current page size, assuming it's mostly file backed. There might be some optimisation done for the anonymous memory swap case, which is the pte profligate one, but probably we shouldn't do anything until we understand the workload profile. James -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [Lsf-pc] [LSF/MM TOPIC] Support for 1GB THP 2016-03-01 22:36 ` James Bottomley @ 2016-03-02 14:14 ` Matthew Wilcox 0 siblings, 0 replies; 14+ messages in thread From: Matthew Wilcox @ 2016-03-02 14:14 UTC (permalink / raw) To: James Bottomley; +Cc: Jan Kara, linux-fsdevel, linux-mm, lsf-pc On Tue, Mar 01, 2016 at 02:36:04PM -0800, James Bottomley wrote: > On Tue, 2016-03-01 at 16:44 -0500, Matthew Wilcox wrote: > > I think it's both. I heard from one customer who calculated that > > with a 6TB server, mapping every page into a process would take ~24MB > > of page tables. Multiply that by the 50,000 processes they expect to > > run on a server of that size consumes 1.2TB of DRAM. Using 1GB pages > > reduces that by a factor of 512, down to 2GB. > > This sounds a bit implausible: Well, that's the customer workload. They have terabytes of data, and they want to map all of it into all 50k processes. I know it's not how I use my machine, but that's customers for you ... -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [LSF/MM TOPIC] Support for 1GB THP 2016-03-01 7:09 [LSF/MM TOPIC] Support for 1GB THP Matthew Wilcox 2016-03-01 10:25 ` [Lsf-pc] " Jan Kara @ 2016-03-01 12:20 ` Kirill A. Shutemov 2016-03-01 16:32 ` Christoph Lameter 1 sibling, 1 reply; 14+ messages in thread From: Kirill A. Shutemov @ 2016-03-01 12:20 UTC (permalink / raw) To: Matthew Wilcox; +Cc: lsf-pc, linux-fsdevel, linux-mm On Tue, Mar 01, 2016 at 02:09:11AM -0500, Matthew Wilcox wrote: > > There are a few issues around 1GB THP support that I've come up against > while working on DAX support that I think may be interesting to discuss > in person. > > - Do we want to add support for 1GB THP for anonymous pages? DAX support > is driving the initial 1GB THP support, but would anonymous VMAs also > benefit from 1GB support? I'm not volunteering to do this work, but > it might make an interesting conversation if we can identify some users > who think performance would be better if they had 1GB THP support. At this point I don't think it would have much users. Too much hussle with non-obvious benefits. > - Latency of a major page fault. According to various public reviews, > main memory bandwidth is about 30GB/s on a Core i7-5960X with 4 > DDR4 channels. I think people are probably fairly unhappy about > doing only 30 page faults per second. So maybe we need a more complex > scheme to handle major faults where we insert a temporary 2MB mapping, > prepare the other 2MB pages in the background, then merge them into > a 1GB mapping when they're completed. > > - Cache pressure from 1GB page support. If we're using NT stores, they > bypass the cache, and all should be good. But if there are > architectures that support THP and not NT stores, zeroing a page is > just going to obliterate their caches. At some point I've tested NT stores for clearing 2M THP and it didn't show much benefit. I guess that could depend on microarhitecture and we probably should re-test this we new CPU generations. > Other topics that might interest people from a VM/FS point of view: > > - Uses for (or replacement of) the radix tree. We're currently > looking at using the radix tree with DAX in order to reduce the number > of calls into the filesystem. That's leading to various enhancements > to the radix tree, such as support for a lock bit for exceptional > entries (Neil Brown), and support for multi-order entries (me). > Is the (enhanced) radix tree the right data structure to be using > for this brave new world of huge pages in the page cache, or should > we be looking at some other data structure like an RB-tree? I'm interested in multi-order entires for THP page cache. It's not required for hugetmpfs, but would be nice to have. > > - Can we get rid of PAGE_CACHE_SIZE now? Finally? Pretty please? +1 :) -- Kirill A. Shutemov -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [LSF/MM TOPIC] Support for 1GB THP 2016-03-01 12:20 ` Kirill A. Shutemov @ 2016-03-01 16:32 ` Christoph Lameter 2016-03-01 21:47 ` Matthew Wilcox 0 siblings, 1 reply; 14+ messages in thread From: Christoph Lameter @ 2016-03-01 16:32 UTC (permalink / raw) To: Kirill A. Shutemov; +Cc: Matthew Wilcox, lsf-pc, linux-fsdevel, linux-mm On Tue, 1 Mar 2016, Kirill A. Shutemov wrote: > On Tue, Mar 01, 2016 at 02:09:11AM -0500, Matthew Wilcox wrote: > > > > There are a few issues around 1GB THP support that I've come up against > > while working on DAX support that I think may be interesting to discuss > > in person. > > > > - Do we want to add support for 1GB THP for anonymous pages? DAX support > > is driving the initial 1GB THP support, but would anonymous VMAs also > > benefit from 1GB support? I'm not volunteering to do this work, but > > it might make an interesting conversation if we can identify some users > > who think performance would be better if they had 1GB THP support. > > At this point I don't think it would have much users. Too much hussle with > non-obvious benefits. In our business we preallocate everything and then the processing proceeds without faults. 1G support has obvious benefits for us since we would be ableto access larger areas of memory for lookups and various bits of computation that we cannot do today without incurring TLB misses that cause variances in our processing time. Having more mainstream support for 1G pages would make it easier to operate using these pages. The long processing times for 1GB pages will make it even more important to ensure all faults are done before hitting critical sections. But this is already being done for most of our apps. For the large NVDIMMs on the horizon using gazillions of terabytes we really would want 1GB support. Otherwise TLB thrashing becomes quite easy if one walks pointer chains through memory. > > - Latency of a major page fault. According to various public reviews, > > main memory bandwidth is about 30GB/s on a Core i7-5960X with 4 > > DDR4 channels. I think people are probably fairly unhappy about > > doing only 30 page faults per second. So maybe we need a more complex > > scheme to handle major faults where we insert a temporary 2MB mapping, > > prepare the other 2MB pages in the background, then merge them into > > a 1GB mapping when they're completed. > > > > - Cache pressure from 1GB page support. If we're using NT stores, they > > bypass the cache, and all should be good. But if there are > > architectures that support THP and not NT stores, zeroing a page is > > just going to obliterate their caches. > > At some point I've tested NT stores for clearing 2M THP and it didn't show > much benefit. I guess that could depend on microarhitecture and we > probably should re-test this we new CPU generations. Zeroing a page should not occur during usual processing but just during the time that a process starts up. > > - Can we get rid of PAGE_CACHE_SIZE now? Finally? Pretty please? > > +1 :) We have had grandiouse visions of being free of that particular set of chains for more than 10 years now. Sadly nothing really was that appealing and the current state of THP support is not that encouraging as well. We rather go with static huge page support to have more control over how memory is laid out for a process. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [LSF/MM TOPIC] Support for 1GB THP 2016-03-01 16:32 ` Christoph Lameter @ 2016-03-01 21:47 ` Matthew Wilcox 0 siblings, 0 replies; 14+ messages in thread From: Matthew Wilcox @ 2016-03-01 21:47 UTC (permalink / raw) To: Christoph Lameter; +Cc: Kirill A. Shutemov, lsf-pc, linux-fsdevel, linux-mm On Tue, Mar 01, 2016 at 10:32:52AM -0600, Christoph Lameter wrote: > > > - Can we get rid of PAGE_CACHE_SIZE now? Finally? Pretty please? > > > > +1 :) > > We have had grandiouse visions of being free of that particular set of > chains for more than 10 years now. Sadly nothing really was that appealing > and the current state of THP support is not that encouraging as well. We > rather go with static huge page support to have more control over how > memory is laid out for a process. With Kirill's fault-around code in place, I think it delivers all or most of the benefits promised by increasing PAGE_CACHE_SIZE. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2016-03-02 14:14 UTC | newest] Thread overview: 14+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2016-03-01 7:09 [LSF/MM TOPIC] Support for 1GB THP Matthew Wilcox 2016-03-01 10:25 ` [Lsf-pc] " Jan Kara 2016-03-01 11:00 ` Mel Gorman 2016-03-01 11:51 ` Mel Gorman 2016-03-01 12:09 ` Kirill A. Shutemov 2016-03-01 12:52 ` Mel Gorman 2016-03-01 21:44 ` Matthew Wilcox 2016-03-01 22:15 ` Mike Kravetz 2016-03-01 22:33 ` Rik van Riel 2016-03-01 22:36 ` James Bottomley 2016-03-02 14:14 ` Matthew Wilcox 2016-03-01 12:20 ` Kirill A. Shutemov 2016-03-01 16:32 ` Christoph Lameter 2016-03-01 21:47 ` Matthew Wilcox
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).