* [LSF/MM ATTEND] Memory management -- THP, hugetlb, scalability @ 2014-01-03 12:25 Kirill A. Shutemov 2014-01-08 15:13 ` [Lsf-pc] " Mel Gorman 0 siblings, 1 reply; 7+ messages in thread From: Kirill A. Shutemov @ 2014-01-03 12:25 UTC (permalink / raw) To: lsf-pc; +Cc: linux-mm, linux-fsdevel Hi, I would like to attend LSF/MM summit. I'm interested in discussion about huge pages, scalability of memory management subsystem and persistent memory. Last year I did some work to fix THP-related regressions and improve scalability. I also work on THP for file-backed pages. Depending on project status, I probably want to bring transparent huge pagecache as a topic. -- Kirill A. Shutemov -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [Lsf-pc] [LSF/MM ATTEND] Memory management -- THP, hugetlb, scalability 2014-01-03 12:25 [LSF/MM ATTEND] Memory management -- THP, hugetlb, scalability Kirill A. Shutemov @ 2014-01-08 15:13 ` Mel Gorman 2014-01-10 17:42 ` Kirill A. Shutemov 0 siblings, 1 reply; 7+ messages in thread From: Mel Gorman @ 2014-01-08 15:13 UTC (permalink / raw) To: Kirill A. Shutemov; +Cc: lsf-pc, linux-fsdevel, linux-mm On Fri, Jan 03, 2014 at 02:25:09PM +0200, Kirill A. Shutemov wrote: > Hi, > > I would like to attend LSF/MM summit. I'm interested in discussion about > huge pages, scalability of memory management subsystem and persistent > memory. > > Last year I did some work to fix THP-related regressions and improve > scalability. I also work on THP for file-backed pages. > > Depending on project status, I probably want to bring transparent huge > pagecache as a topic. > I think transparent huge pagecache is likely to crop up for more than one reason. There is the TLB issue and the motivation that i-TLB pressure is a problem in some specialised cases. Whatever the merits of that case, transparent hugepage cache has been raised as a potential solution for some VM scalability problems. I recognise that dealing with large numbers of struct pages is now a problem on larger machines (although I have not seen quantified data on the problem nor do I have access to a machine large enough to measure it myself) but I'm wary of transparent hugepage cache being treated as a primary solution for VM scalability problems. Lacking performance data I have no suggestions on what these alternative solutions might look like. -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [Lsf-pc] [LSF/MM ATTEND] Memory management -- THP, hugetlb, scalability 2014-01-08 15:13 ` [Lsf-pc] " Mel Gorman @ 2014-01-10 17:42 ` Kirill A. Shutemov 2014-01-10 22:51 ` Matthew Wilcox 0 siblings, 1 reply; 7+ messages in thread From: Kirill A. Shutemov @ 2014-01-10 17:42 UTC (permalink / raw) To: Mel Gorman; +Cc: lsf-pc, linux-fsdevel, linux-mm, Matthew Wilcox On Wed, Jan 08, 2014 at 03:13:21PM +0000, Mel Gorman wrote: > On Fri, Jan 03, 2014 at 02:25:09PM +0200, Kirill A. Shutemov wrote: > > Hi, > > > > I would like to attend LSF/MM summit. I'm interested in discussion about > > huge pages, scalability of memory management subsystem and persistent > > memory. > > > > Last year I did some work to fix THP-related regressions and improve > > scalability. I also work on THP for file-backed pages. > > > > Depending on project status, I probably want to bring transparent huge > > pagecache as a topic. > > > > I think transparent huge pagecache is likely to crop up for more than one > reason. There is the TLB issue and the motivation that i-TLB pressure is > a problem in some specialised cases. Whatever the merits of that case, > transparent hugepage cache has been raised as a potential solution for > some VM scalability problems. I recognise that dealing with large numbers > of struct pages is now a problem on larger machines (although I have not > seen quantified data on the problem nor do I have access to a machine large > enough to measure it myself) but I'm wary of transparent hugepage cache > being treated as a primary solution for VM scalability problems. Lacking > performance data I have no suggestions on what these alternative solutions > might look like. Yes, performance data is critical. I'll try bring some. The only alternative I see is some kind of THP, implemented on filesystem level. It can work for tmpfs/shm reasonably well. But it looks ad-hoc and in long term transparent huge pagecache is the way to go, I believe. Sibling topic is THP for XIP (see Matthew's patchset). Guys want to manage persistent memory in 2M chunks where it's possible. And THP (but without struct page in this case) is the obvious solution. -- Kirill A. Shutemov -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [Lsf-pc] [LSF/MM ATTEND] Memory management -- THP, hugetlb, scalability 2014-01-10 17:42 ` Kirill A. Shutemov @ 2014-01-10 22:51 ` Matthew Wilcox 2014-01-10 22:59 ` Kirill A. Shutemov 0 siblings, 1 reply; 7+ messages in thread From: Matthew Wilcox @ 2014-01-10 22:51 UTC (permalink / raw) To: Kirill A. Shutemov; +Cc: Mel Gorman, lsf-pc, linux-fsdevel, linux-mm On Fri, Jan 10, 2014 at 07:42:04PM +0200, Kirill A. Shutemov wrote: > On Wed, Jan 08, 2014 at 03:13:21PM +0000, Mel Gorman wrote: > > I think transparent huge pagecache is likely to crop up for more than one > > reason. There is the TLB issue and the motivation that i-TLB pressure is > > a problem in some specialised cases. Whatever the merits of that case, > > transparent hugepage cache has been raised as a potential solution for > > some VM scalability problems. I recognise that dealing with large numbers > > of struct pages is now a problem on larger machines (although I have not > > seen quantified data on the problem nor do I have access to a machine large > > enough to measure it myself) but I'm wary of transparent hugepage cache > > being treated as a primary solution for VM scalability problems. Lacking > > performance data I have no suggestions on what these alternative solutions > > might look like. Something I'd like to see discussed (but don't have the MM chops to lead a discussion on myself) is the PAGE_CACHE_SIZE vs PAGE_SIZE split. This needs to be either fixed or removed, IMO. It's been in the tree since before git history began (ie before 2005), it imposes a reasonably large cognitive burden on programmers ("what kind of page size do I want here?"), it's not intuitively obvious (to a non-mm person) which page size is which, and it's never actually bought us anything because it's always been the same! Also, it bitrots. Look at this: pgoff_t pgoff = (((address & PAGE_MASK) - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff; vmf.pgoff = pgoff; pgoff_t offset = vmf->pgoff; size = (i_size_read(inode) + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT; if (offset >= size) return VM_FAULT_SIGBUS; That's spread over three functions, but that goes to illustrate my point; getting this stuff right is Hard; core mm developers get it wrong, we don't have the right types to document whether a variable is in PAGE_SIZE or PAGE_CACHE_SIZE units, and we're not getting any benefit from it today. > Sibling topic is THP for XIP (see Matthew's patchset). Guys want to manage > persistent memory in 2M chunks where it's possible. And THP (but without > struct page in this case) is the obvious solution. Not just 2MB, we also want 1GB pages for some special cases. It looks doable (XFS can allocate aligned 1GB blocks). I've written some supporting code that will at least get us to the point where we can insert a 1GB page. I haven't been able to test anything yet. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [Lsf-pc] [LSF/MM ATTEND] Memory management -- THP, hugetlb, scalability 2014-01-10 22:51 ` Matthew Wilcox @ 2014-01-10 22:59 ` Kirill A. Shutemov 2014-01-11 1:49 ` Matthew Wilcox 0 siblings, 1 reply; 7+ messages in thread From: Kirill A. Shutemov @ 2014-01-10 22:59 UTC (permalink / raw) To: Matthew Wilcox; +Cc: Mel Gorman, lsf-pc, linux-fsdevel, linux-mm On Fri, Jan 10, 2014 at 05:51:16PM -0500, Matthew Wilcox wrote: > On Fri, Jan 10, 2014 at 07:42:04PM +0200, Kirill A. Shutemov wrote: > > On Wed, Jan 08, 2014 at 03:13:21PM +0000, Mel Gorman wrote: > > > I think transparent huge pagecache is likely to crop up for more than one > > > reason. There is the TLB issue and the motivation that i-TLB pressure is > > > a problem in some specialised cases. Whatever the merits of that case, > > > transparent hugepage cache has been raised as a potential solution for > > > some VM scalability problems. I recognise that dealing with large numbers > > > of struct pages is now a problem on larger machines (although I have not > > > seen quantified data on the problem nor do I have access to a machine large > > > enough to measure it myself) but I'm wary of transparent hugepage cache > > > being treated as a primary solution for VM scalability problems. Lacking > > > performance data I have no suggestions on what these alternative solutions > > > might look like. > > Something I'd like to see discussed (but don't have the MM chops to > lead a discussion on myself) is the PAGE_CACHE_SIZE vs PAGE_SIZE split. > This needs to be either fixed or removed, IMO. It's been in the tree > since before git history began (ie before 2005), it imposes a reasonably > large cognitive burden on programmers ("what kind of page size do I want > here?"), it's not intuitively obvious (to a non-mm person) which page > size is which, and it's never actually bought us anything because it's > always been the same! > > Also, it bitrots. Look at this: > > pgoff_t pgoff = (((address & PAGE_MASK) > - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff; > vmf.pgoff = pgoff; > pgoff_t offset = vmf->pgoff; > size = (i_size_read(inode) + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT; > if (offset >= size) > return VM_FAULT_SIGBUS; > > That's spread over three functions, but that goes to illustrate my point; > getting this stuff right is Hard; core mm developers get it wrong, we > don't have the right types to document whether a variable is in PAGE_SIZE > or PAGE_CACHE_SIZE units, and we're not getting any benefit from it today. I also want to drop PAGE_CACHE_*. It's on my todo list almost a year now ;) > > Sibling topic is THP for XIP (see Matthew's patchset). Guys want to manage > > persistent memory in 2M chunks where it's possible. And THP (but without > > struct page in this case) is the obvious solution. > > Not just 2MB, we also want 1GB pages for some special cases. It looks > doable (XFS can allocate aligned 1GB blocks). I've written some > supporting code that will at least get us to the point where we can > insert a 1GB page. I haven't been able to test anything yet. It's probably doable from fs point of view, but adding PUD-level THP page is not trivial at all. I think it's more productive better to concentrate on 2M for now. -- Kirill A. Shutemov -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [Lsf-pc] [LSF/MM ATTEND] Memory management -- THP, hugetlb, scalability 2014-01-10 22:59 ` Kirill A. Shutemov @ 2014-01-11 1:49 ` Matthew Wilcox 2014-01-11 2:55 ` Kirill A. Shutemov 0 siblings, 1 reply; 7+ messages in thread From: Matthew Wilcox @ 2014-01-11 1:49 UTC (permalink / raw) To: Kirill A. Shutemov; +Cc: Mel Gorman, lsf-pc, linux-fsdevel, linux-mm On Sat, Jan 11, 2014 at 12:59:34AM +0200, Kirill A. Shutemov wrote: > On Fri, Jan 10, 2014 at 05:51:16PM -0500, Matthew Wilcox wrote: > > On Fri, Jan 10, 2014 at 07:42:04PM +0200, Kirill A. Shutemov wrote: > > > On Wed, Jan 08, 2014 at 03:13:21PM +0000, Mel Gorman wrote: > > > > I think transparent huge pagecache is likely to crop up for more than one > > > > reason. There is the TLB issue and the motivation that i-TLB pressure is > > > > a problem in some specialised cases. Whatever the merits of that case, > > > > transparent hugepage cache has been raised as a potential solution for > > > > some VM scalability problems. I recognise that dealing with large numbers > > > > of struct pages is now a problem on larger machines (although I have not > > > > seen quantified data on the problem nor do I have access to a machine large > > > > enough to measure it myself) but I'm wary of transparent hugepage cache > > > > being treated as a primary solution for VM scalability problems. Lacking > > > > performance data I have no suggestions on what these alternative solutions > > > > might look like. > > > > Something I'd like to see discussed (but don't have the MM chops to > > lead a discussion on myself) is the PAGE_CACHE_SIZE vs PAGE_SIZE split. > > This needs to be either fixed or removed, IMO. It's been in the tree > > since before git history began (ie before 2005), it imposes a reasonably > > large cognitive burden on programmers ("what kind of page size do I want > > here?"), it's not intuitively obvious (to a non-mm person) which page > > size is which, and it's never actually bought us anything because it's > > always been the same! > > I also want to drop PAGE_CACHE_*. It's on my todo list almost a year now ;) I dno't necessarily want to drop the concept of having 'the size of memory referenced by struct page' != 'the size of memory pointed at by a single PTE'. I just want to see it *implemented* for at least one architecture if we're going to have the distinction. It's one way of solving the problem that Mel mentioned (dealing with a large number of struct pages). > > > Sibling topic is THP for XIP (see Matthew's patchset). Guys want to manage > > > persistent memory in 2M chunks where it's possible. And THP (but without > > > struct page in this case) is the obvious solution. > > > > Not just 2MB, we also want 1GB pages for some special cases. It looks > > doable (XFS can allocate aligned 1GB blocks). I've written some > > supporting code that will at least get us to the point where we can > > insert a 1GB page. I haven't been able to test anything yet. > > It's probably doable from fs point of view, but adding PUD-level THP page > is not trivial at all. I think it's more productive better to concentrate > on 2M for now. It's clearly Hard to get to a point where we're inserting PUD entries for anonymous pages. While I don't think it's trivial to get to PUD entries for PFNMAP, I think it is doable. Last time we discussed this, your concern was around splitting a PUD entry down into PTEs and having to preallocate all the memory required to do that. We can't possibly need to call split_huge_page() for the PFNMAP case because we don't have a struct page, so none of those code paths can be run. I think that leaves split_huge_page_pmd() as the only place where we can try to split a huge PFNMAP PMD. That's called from: mem_cgroup_count_precharge_pte_range() mem_cgroup_move_charge_pte_range() These two look like they need to be converted to work on unsplit PMDs anyway, for efficiency reasons. Perhaps someone who's hacked on this file as recently as 2009 would care to do that work? :-) zap_pmd_range() does this: if (pmd_trans_huge(*pmd)) { if (next-addr != HPAGE_PMD_SIZE) { VM_BUG_ON(!rwsem_is_locked(&tlb->mm->mmap_sem)); split_huge_page_pmd(vma->vm_mm, pmd); } else if (zap_huge_pmd(tlb, vma, pmd)) continue; /* fall through */ } I don't understand why it bothers to split rather than just zapping the PMD and allowing refaults to populate the PTEs later. follow_page() calls it, but I think we can give up way earlier in this function, since we know there's no struct page to return. We can put in something like: if (IS_XIP(file_inode(vma->vm_file))) return ERR_PTR(-Ewhatever); check_pmd_range() calls it, but this is NUMA policy for the page cache. We should be skipping this code for XIP files too, if we aren't already. change_pmd_range() calls split_huge_page_pmd() if an mprotect call lands in the middle of a PMD range. Again, I'd be *fine* with just dropping the PMD entry here and allowing faults to repopulate the PTEs. Looks like the mremap code may need some work. I'm not sure what that work is right now. That leaves us with walk_page_range() ... which also looks like it's going to need some work in the callers. So yeah, not trivial at all, but doable with a few weeks of work, I think. Unless there's some other major concern that I've missed (which is possible since I'm not a MM hacker). -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [Lsf-pc] [LSF/MM ATTEND] Memory management -- THP, hugetlb, scalability 2014-01-11 1:49 ` Matthew Wilcox @ 2014-01-11 2:55 ` Kirill A. Shutemov 0 siblings, 0 replies; 7+ messages in thread From: Kirill A. Shutemov @ 2014-01-11 2:55 UTC (permalink / raw) To: Matthew Wilcox; +Cc: Mel Gorman, lsf-pc, linux-fsdevel, linux-mm On Fri, Jan 10, 2014 at 08:49:24PM -0500, Matthew Wilcox wrote: > > I also want to drop PAGE_CACHE_*. It's on my todo list almost a year now ;) > > I dno't necessarily want to drop the concept of having 'the size of > memory referenced by struct page' != 'the size of memory pointed at > by a single PTE'. I just want to see it *implemented* for at least one > architecture if we're going to have the distinction. It's one way of > solving the problem that Mel mentioned (dealing with a large number of > struct pages). Okay. But I don't think it's going to happen. > > > > Sibling topic is THP for XIP (see Matthew's patchset). Guys want to manage > > > > persistent memory in 2M chunks where it's possible. And THP (but without > > > > struct page in this case) is the obvious solution. > > > > > > Not just 2MB, we also want 1GB pages for some special cases. It looks > > > doable (XFS can allocate aligned 1GB blocks). I've written some > > > supporting code that will at least get us to the point where we can > > > insert a 1GB page. I haven't been able to test anything yet. > > > > It's probably doable from fs point of view, but adding PUD-level THP page > > is not trivial at all. I think it's more productive better to concentrate > > on 2M for now. > > It's clearly Hard to get to a point where we're inserting PUD entries > for anonymous pages. While I don't think it's trivial to get to PUD entries > for PFNMAP, I think it is doable. > > Last time we discussed this, your concern was around splitting a PUD entry > down into PTEs and having to preallocate all the memory required to do that. Other thing is dealing with PMD vs PTE races (due splitting or MADV_DONTNEED or something else). Adding PUD to the picture doesn't make it easier. > We can't possibly need to call split_huge_page() for the PFNMAP case > because we don't have a struct page, so none of those code paths can > be run. I think that leaves split_huge_page_pmd() as the only place > where we can try to split a huge PFNMAP PMD. > > That's called from: > > mem_cgroup_count_precharge_pte_range() > mem_cgroup_move_charge_pte_range() > These two look like they need to be converted to work on unsplit > PMDs anyway, for efficiency reasons. Perhaps someone who's hacked > on this file as recently as 2009 would care to do that work? :-) :) May be. I don't really remember anything there. > zap_pmd_range() does this: > > if (pmd_trans_huge(*pmd)) { > if (next-addr != HPAGE_PMD_SIZE) { > VM_BUG_ON(!rwsem_is_locked(&tlb->mm->mmap_sem)); > split_huge_page_pmd(vma->vm_mm, pmd); > } else if (zap_huge_pmd(tlb, vma, pmd)) > continue; > /* fall through */ > } > > I don't understand why it bothers to split rather than just zapping the > PMD and allowing refaults to populate the PTEs later. Because with anon-pages you don't have a backing storage to repopulate from: it will free the memory and you will end up with clear pages after next page fault. Yeah, it's not relevant for file pages, you can just unmap the pmd. > follow_page() calls it, but I think we can give up way earlier in this > function, since we know there's no struct page to return. We can put > in something like: > > if (IS_XIP(file_inode(vma->vm_file))) > return ERR_PTR(-Ewhatever); Are you sure you will not need a temporary struct page here to show to caller or something like this? > check_pmd_range() calls it, but this is NUMA policy for the page cache. > We should be skipping this code for XIP files too, if we aren't already. Okay. > change_pmd_range() calls split_huge_page_pmd() if an mprotect call lands > in the middle of a PMD range. Again, I'd be *fine* with just dropping the > PMD entry here and allowing faults to repopulate the PTEs. Do you have a way to store info that the area should be repopulated with PTEs, not PMD? > Looks like the mremap code may need some work. I'm not sure what that > work is right now. You probably may unmap there too and handle as !old_pmd. > That leaves us with walk_page_range() ... which also looks like it's > going to need some work in the callers. > > So yeah, not trivial at all, but doable with a few weeks of work, > I think. Unless there's some other major concern that I've missed > (which is possible since I'm not a MM hacker). Okay, doable. I guess. The general approach is replace split with unmap. unmap_mapping_range() takes ->i_mmap_mutex and we probably will hit locking ordering issues. But I would suggest to take more conservative approach first: leave 1G pages aside, use 2M pages with page table pre-allocation and implement proper splitting in split_huge_page_pmd() for this case. It should be much easier then fix all split_huge_page_pmd() callers. After getting this work we can look how to eliminate memory overhead on preallocated page tables and bring 1G pages. By the time you probably will have some performance data to say that you don't really need 1G pages that much. ;) -- Kirill A. Shutemov -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2014-01-11 2:56 UTC | newest] Thread overview: 7+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2014-01-03 12:25 [LSF/MM ATTEND] Memory management -- THP, hugetlb, scalability Kirill A. Shutemov 2014-01-08 15:13 ` [Lsf-pc] " Mel Gorman 2014-01-10 17:42 ` Kirill A. Shutemov 2014-01-10 22:51 ` Matthew Wilcox 2014-01-10 22:59 ` Kirill A. Shutemov 2014-01-11 1:49 ` Matthew Wilcox 2014-01-11 2:55 ` Kirill A. Shutemov
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).