* [LSF/MM ATTEND] Memory management -- THP, hugetlb, scalability
@ 2014-01-03 12:25 Kirill A. Shutemov
2014-01-08 15:13 ` [Lsf-pc] " Mel Gorman
0 siblings, 1 reply; 7+ messages in thread
From: Kirill A. Shutemov @ 2014-01-03 12:25 UTC (permalink / raw)
To: lsf-pc; +Cc: linux-mm, linux-fsdevel
Hi,
I would like to attend LSF/MM summit. I'm interested in discussion about
huge pages, scalability of memory management subsystem and persistent
memory.
Last year I did some work to fix THP-related regressions and improve
scalability. I also work on THP for file-backed pages.
Depending on project status, I probably want to bring transparent huge
pagecache as a topic.
--
Kirill A. Shutemov
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [Lsf-pc] [LSF/MM ATTEND] Memory management -- THP, hugetlb, scalability
2014-01-03 12:25 [LSF/MM ATTEND] Memory management -- THP, hugetlb, scalability Kirill A. Shutemov
@ 2014-01-08 15:13 ` Mel Gorman
2014-01-10 17:42 ` Kirill A. Shutemov
0 siblings, 1 reply; 7+ messages in thread
From: Mel Gorman @ 2014-01-08 15:13 UTC (permalink / raw)
To: Kirill A. Shutemov; +Cc: lsf-pc, linux-fsdevel, linux-mm
On Fri, Jan 03, 2014 at 02:25:09PM +0200, Kirill A. Shutemov wrote:
> Hi,
>
> I would like to attend LSF/MM summit. I'm interested in discussion about
> huge pages, scalability of memory management subsystem and persistent
> memory.
>
> Last year I did some work to fix THP-related regressions and improve
> scalability. I also work on THP for file-backed pages.
>
> Depending on project status, I probably want to bring transparent huge
> pagecache as a topic.
>
I think transparent huge pagecache is likely to crop up for more than one
reason. There is the TLB issue and the motivation that i-TLB pressure is
a problem in some specialised cases. Whatever the merits of that case,
transparent hugepage cache has been raised as a potential solution for
some VM scalability problems. I recognise that dealing with large numbers
of struct pages is now a problem on larger machines (although I have not
seen quantified data on the problem nor do I have access to a machine large
enough to measure it myself) but I'm wary of transparent hugepage cache
being treated as a primary solution for VM scalability problems. Lacking
performance data I have no suggestions on what these alternative solutions
might look like.
--
Mel Gorman
SUSE Labs
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [Lsf-pc] [LSF/MM ATTEND] Memory management -- THP, hugetlb, scalability
2014-01-08 15:13 ` [Lsf-pc] " Mel Gorman
@ 2014-01-10 17:42 ` Kirill A. Shutemov
2014-01-10 22:51 ` Matthew Wilcox
0 siblings, 1 reply; 7+ messages in thread
From: Kirill A. Shutemov @ 2014-01-10 17:42 UTC (permalink / raw)
To: Mel Gorman; +Cc: lsf-pc, linux-fsdevel, linux-mm, Matthew Wilcox
On Wed, Jan 08, 2014 at 03:13:21PM +0000, Mel Gorman wrote:
> On Fri, Jan 03, 2014 at 02:25:09PM +0200, Kirill A. Shutemov wrote:
> > Hi,
> >
> > I would like to attend LSF/MM summit. I'm interested in discussion about
> > huge pages, scalability of memory management subsystem and persistent
> > memory.
> >
> > Last year I did some work to fix THP-related regressions and improve
> > scalability. I also work on THP for file-backed pages.
> >
> > Depending on project status, I probably want to bring transparent huge
> > pagecache as a topic.
> >
>
> I think transparent huge pagecache is likely to crop up for more than one
> reason. There is the TLB issue and the motivation that i-TLB pressure is
> a problem in some specialised cases. Whatever the merits of that case,
> transparent hugepage cache has been raised as a potential solution for
> some VM scalability problems. I recognise that dealing with large numbers
> of struct pages is now a problem on larger machines (although I have not
> seen quantified data on the problem nor do I have access to a machine large
> enough to measure it myself) but I'm wary of transparent hugepage cache
> being treated as a primary solution for VM scalability problems. Lacking
> performance data I have no suggestions on what these alternative solutions
> might look like.
Yes, performance data is critical. I'll try bring some.
The only alternative I see is some kind of THP, implemented on filesystem
level. It can work for tmpfs/shm reasonably well. But it looks ad-hoc and
in long term transparent huge pagecache is the way to go, I believe.
Sibling topic is THP for XIP (see Matthew's patchset). Guys want to manage
persistent memory in 2M chunks where it's possible. And THP (but without
struct page in this case) is the obvious solution.
--
Kirill A. Shutemov
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [Lsf-pc] [LSF/MM ATTEND] Memory management -- THP, hugetlb, scalability
2014-01-10 17:42 ` Kirill A. Shutemov
@ 2014-01-10 22:51 ` Matthew Wilcox
2014-01-10 22:59 ` Kirill A. Shutemov
0 siblings, 1 reply; 7+ messages in thread
From: Matthew Wilcox @ 2014-01-10 22:51 UTC (permalink / raw)
To: Kirill A. Shutemov; +Cc: Mel Gorman, lsf-pc, linux-fsdevel, linux-mm
On Fri, Jan 10, 2014 at 07:42:04PM +0200, Kirill A. Shutemov wrote:
> On Wed, Jan 08, 2014 at 03:13:21PM +0000, Mel Gorman wrote:
> > I think transparent huge pagecache is likely to crop up for more than one
> > reason. There is the TLB issue and the motivation that i-TLB pressure is
> > a problem in some specialised cases. Whatever the merits of that case,
> > transparent hugepage cache has been raised as a potential solution for
> > some VM scalability problems. I recognise that dealing with large numbers
> > of struct pages is now a problem on larger machines (although I have not
> > seen quantified data on the problem nor do I have access to a machine large
> > enough to measure it myself) but I'm wary of transparent hugepage cache
> > being treated as a primary solution for VM scalability problems. Lacking
> > performance data I have no suggestions on what these alternative solutions
> > might look like.
Something I'd like to see discussed (but don't have the MM chops to
lead a discussion on myself) is the PAGE_CACHE_SIZE vs PAGE_SIZE split.
This needs to be either fixed or removed, IMO. It's been in the tree
since before git history began (ie before 2005), it imposes a reasonably
large cognitive burden on programmers ("what kind of page size do I want
here?"), it's not intuitively obvious (to a non-mm person) which page
size is which, and it's never actually bought us anything because it's
always been the same!
Also, it bitrots. Look at this:
pgoff_t pgoff = (((address & PAGE_MASK)
- vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
vmf.pgoff = pgoff;
pgoff_t offset = vmf->pgoff;
size = (i_size_read(inode) + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
if (offset >= size)
return VM_FAULT_SIGBUS;
That's spread over three functions, but that goes to illustrate my point;
getting this stuff right is Hard; core mm developers get it wrong, we
don't have the right types to document whether a variable is in PAGE_SIZE
or PAGE_CACHE_SIZE units, and we're not getting any benefit from it today.
> Sibling topic is THP for XIP (see Matthew's patchset). Guys want to manage
> persistent memory in 2M chunks where it's possible. And THP (but without
> struct page in this case) is the obvious solution.
Not just 2MB, we also want 1GB pages for some special cases. It looks
doable (XFS can allocate aligned 1GB blocks). I've written some
supporting code that will at least get us to the point where we can
insert a 1GB page. I haven't been able to test anything yet.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [Lsf-pc] [LSF/MM ATTEND] Memory management -- THP, hugetlb, scalability
2014-01-10 22:51 ` Matthew Wilcox
@ 2014-01-10 22:59 ` Kirill A. Shutemov
2014-01-11 1:49 ` Matthew Wilcox
0 siblings, 1 reply; 7+ messages in thread
From: Kirill A. Shutemov @ 2014-01-10 22:59 UTC (permalink / raw)
To: Matthew Wilcox; +Cc: Mel Gorman, lsf-pc, linux-fsdevel, linux-mm
On Fri, Jan 10, 2014 at 05:51:16PM -0500, Matthew Wilcox wrote:
> On Fri, Jan 10, 2014 at 07:42:04PM +0200, Kirill A. Shutemov wrote:
> > On Wed, Jan 08, 2014 at 03:13:21PM +0000, Mel Gorman wrote:
> > > I think transparent huge pagecache is likely to crop up for more than one
> > > reason. There is the TLB issue and the motivation that i-TLB pressure is
> > > a problem in some specialised cases. Whatever the merits of that case,
> > > transparent hugepage cache has been raised as a potential solution for
> > > some VM scalability problems. I recognise that dealing with large numbers
> > > of struct pages is now a problem on larger machines (although I have not
> > > seen quantified data on the problem nor do I have access to a machine large
> > > enough to measure it myself) but I'm wary of transparent hugepage cache
> > > being treated as a primary solution for VM scalability problems. Lacking
> > > performance data I have no suggestions on what these alternative solutions
> > > might look like.
>
> Something I'd like to see discussed (but don't have the MM chops to
> lead a discussion on myself) is the PAGE_CACHE_SIZE vs PAGE_SIZE split.
> This needs to be either fixed or removed, IMO. It's been in the tree
> since before git history began (ie before 2005), it imposes a reasonably
> large cognitive burden on programmers ("what kind of page size do I want
> here?"), it's not intuitively obvious (to a non-mm person) which page
> size is which, and it's never actually bought us anything because it's
> always been the same!
>
> Also, it bitrots. Look at this:
>
> pgoff_t pgoff = (((address & PAGE_MASK)
> - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
> vmf.pgoff = pgoff;
> pgoff_t offset = vmf->pgoff;
> size = (i_size_read(inode) + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
> if (offset >= size)
> return VM_FAULT_SIGBUS;
>
> That's spread over three functions, but that goes to illustrate my point;
> getting this stuff right is Hard; core mm developers get it wrong, we
> don't have the right types to document whether a variable is in PAGE_SIZE
> or PAGE_CACHE_SIZE units, and we're not getting any benefit from it today.
I also want to drop PAGE_CACHE_*. It's on my todo list almost a year now ;)
> > Sibling topic is THP for XIP (see Matthew's patchset). Guys want to manage
> > persistent memory in 2M chunks where it's possible. And THP (but without
> > struct page in this case) is the obvious solution.
>
> Not just 2MB, we also want 1GB pages for some special cases. It looks
> doable (XFS can allocate aligned 1GB blocks). I've written some
> supporting code that will at least get us to the point where we can
> insert a 1GB page. I haven't been able to test anything yet.
It's probably doable from fs point of view, but adding PUD-level THP page
is not trivial at all. I think it's more productive better to concentrate
on 2M for now.
--
Kirill A. Shutemov
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [Lsf-pc] [LSF/MM ATTEND] Memory management -- THP, hugetlb, scalability
2014-01-10 22:59 ` Kirill A. Shutemov
@ 2014-01-11 1:49 ` Matthew Wilcox
2014-01-11 2:55 ` Kirill A. Shutemov
0 siblings, 1 reply; 7+ messages in thread
From: Matthew Wilcox @ 2014-01-11 1:49 UTC (permalink / raw)
To: Kirill A. Shutemov; +Cc: Mel Gorman, lsf-pc, linux-fsdevel, linux-mm
On Sat, Jan 11, 2014 at 12:59:34AM +0200, Kirill A. Shutemov wrote:
> On Fri, Jan 10, 2014 at 05:51:16PM -0500, Matthew Wilcox wrote:
> > On Fri, Jan 10, 2014 at 07:42:04PM +0200, Kirill A. Shutemov wrote:
> > > On Wed, Jan 08, 2014 at 03:13:21PM +0000, Mel Gorman wrote:
> > > > I think transparent huge pagecache is likely to crop up for more than one
> > > > reason. There is the TLB issue and the motivation that i-TLB pressure is
> > > > a problem in some specialised cases. Whatever the merits of that case,
> > > > transparent hugepage cache has been raised as a potential solution for
> > > > some VM scalability problems. I recognise that dealing with large numbers
> > > > of struct pages is now a problem on larger machines (although I have not
> > > > seen quantified data on the problem nor do I have access to a machine large
> > > > enough to measure it myself) but I'm wary of transparent hugepage cache
> > > > being treated as a primary solution for VM scalability problems. Lacking
> > > > performance data I have no suggestions on what these alternative solutions
> > > > might look like.
> >
> > Something I'd like to see discussed (but don't have the MM chops to
> > lead a discussion on myself) is the PAGE_CACHE_SIZE vs PAGE_SIZE split.
> > This needs to be either fixed or removed, IMO. It's been in the tree
> > since before git history began (ie before 2005), it imposes a reasonably
> > large cognitive burden on programmers ("what kind of page size do I want
> > here?"), it's not intuitively obvious (to a non-mm person) which page
> > size is which, and it's never actually bought us anything because it's
> > always been the same!
>
> I also want to drop PAGE_CACHE_*. It's on my todo list almost a year now ;)
I dno't necessarily want to drop the concept of having 'the size of
memory referenced by struct page' != 'the size of memory pointed at
by a single PTE'. I just want to see it *implemented* for at least one
architecture if we're going to have the distinction. It's one way of
solving the problem that Mel mentioned (dealing with a large number of
struct pages).
> > > Sibling topic is THP for XIP (see Matthew's patchset). Guys want to manage
> > > persistent memory in 2M chunks where it's possible. And THP (but without
> > > struct page in this case) is the obvious solution.
> >
> > Not just 2MB, we also want 1GB pages for some special cases. It looks
> > doable (XFS can allocate aligned 1GB blocks). I've written some
> > supporting code that will at least get us to the point where we can
> > insert a 1GB page. I haven't been able to test anything yet.
>
> It's probably doable from fs point of view, but adding PUD-level THP page
> is not trivial at all. I think it's more productive better to concentrate
> on 2M for now.
It's clearly Hard to get to a point where we're inserting PUD entries
for anonymous pages. While I don't think it's trivial to get to PUD entries
for PFNMAP, I think it is doable.
Last time we discussed this, your concern was around splitting a PUD entry
down into PTEs and having to preallocate all the memory required to do that.
We can't possibly need to call split_huge_page() for the PFNMAP case
because we don't have a struct page, so none of those code paths can
be run. I think that leaves split_huge_page_pmd() as the only place
where we can try to split a huge PFNMAP PMD. That's called from:
mem_cgroup_count_precharge_pte_range()
mem_cgroup_move_charge_pte_range()
These two look like they need to be converted to work on unsplit
PMDs anyway, for efficiency reasons. Perhaps someone who's hacked
on this file as recently as 2009 would care to do that work? :-)
zap_pmd_range() does this:
if (pmd_trans_huge(*pmd)) {
if (next-addr != HPAGE_PMD_SIZE) {
VM_BUG_ON(!rwsem_is_locked(&tlb->mm->mmap_sem));
split_huge_page_pmd(vma->vm_mm, pmd);
} else if (zap_huge_pmd(tlb, vma, pmd))
continue;
/* fall through */
}
I don't understand why it bothers to split rather than just zapping the
PMD and allowing refaults to populate the PTEs later.
follow_page() calls it, but I think we can give up way earlier in this
function, since we know there's no struct page to return. We can put
in something like:
if (IS_XIP(file_inode(vma->vm_file)))
return ERR_PTR(-Ewhatever);
check_pmd_range() calls it, but this is NUMA policy for the page cache.
We should be skipping this code for XIP files too, if we aren't already.
change_pmd_range() calls split_huge_page_pmd() if an mprotect call lands
in the middle of a PMD range. Again, I'd be *fine* with just dropping the
PMD entry here and allowing faults to repopulate the PTEs.
Looks like the mremap code may need some work. I'm not sure what that
work is right now.
That leaves us with walk_page_range() ... which also looks like it's
going to need some work in the callers.
So yeah, not trivial at all, but doable with a few weeks of work,
I think. Unless there's some other major concern that I've missed
(which is possible since I'm not a MM hacker).
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [Lsf-pc] [LSF/MM ATTEND] Memory management -- THP, hugetlb, scalability
2014-01-11 1:49 ` Matthew Wilcox
@ 2014-01-11 2:55 ` Kirill A. Shutemov
0 siblings, 0 replies; 7+ messages in thread
From: Kirill A. Shutemov @ 2014-01-11 2:55 UTC (permalink / raw)
To: Matthew Wilcox; +Cc: Mel Gorman, lsf-pc, linux-fsdevel, linux-mm
On Fri, Jan 10, 2014 at 08:49:24PM -0500, Matthew Wilcox wrote:
> > I also want to drop PAGE_CACHE_*. It's on my todo list almost a year now ;)
>
> I dno't necessarily want to drop the concept of having 'the size of
> memory referenced by struct page' != 'the size of memory pointed at
> by a single PTE'. I just want to see it *implemented* for at least one
> architecture if we're going to have the distinction. It's one way of
> solving the problem that Mel mentioned (dealing with a large number of
> struct pages).
Okay. But I don't think it's going to happen.
> > > > Sibling topic is THP for XIP (see Matthew's patchset). Guys want to manage
> > > > persistent memory in 2M chunks where it's possible. And THP (but without
> > > > struct page in this case) is the obvious solution.
> > >
> > > Not just 2MB, we also want 1GB pages for some special cases. It looks
> > > doable (XFS can allocate aligned 1GB blocks). I've written some
> > > supporting code that will at least get us to the point where we can
> > > insert a 1GB page. I haven't been able to test anything yet.
> >
> > It's probably doable from fs point of view, but adding PUD-level THP page
> > is not trivial at all. I think it's more productive better to concentrate
> > on 2M for now.
>
> It's clearly Hard to get to a point where we're inserting PUD entries
> for anonymous pages. While I don't think it's trivial to get to PUD entries
> for PFNMAP, I think it is doable.
>
> Last time we discussed this, your concern was around splitting a PUD entry
> down into PTEs and having to preallocate all the memory required to do that.
Other thing is dealing with PMD vs PTE races (due splitting or
MADV_DONTNEED or something else). Adding PUD to the picture doesn't make
it easier.
> We can't possibly need to call split_huge_page() for the PFNMAP case
> because we don't have a struct page, so none of those code paths can
> be run. I think that leaves split_huge_page_pmd() as the only place
> where we can try to split a huge PFNMAP PMD.
>
> That's called from:
>
> mem_cgroup_count_precharge_pte_range()
> mem_cgroup_move_charge_pte_range()
> These two look like they need to be converted to work on unsplit
> PMDs anyway, for efficiency reasons. Perhaps someone who's hacked
> on this file as recently as 2009 would care to do that work? :-)
:) May be. I don't really remember anything there.
> zap_pmd_range() does this:
>
> if (pmd_trans_huge(*pmd)) {
> if (next-addr != HPAGE_PMD_SIZE) {
> VM_BUG_ON(!rwsem_is_locked(&tlb->mm->mmap_sem));
> split_huge_page_pmd(vma->vm_mm, pmd);
> } else if (zap_huge_pmd(tlb, vma, pmd))
> continue;
> /* fall through */
> }
>
> I don't understand why it bothers to split rather than just zapping the
> PMD and allowing refaults to populate the PTEs later.
Because with anon-pages you don't have a backing storage to repopulate
from: it will free the memory and you will end up with clear pages after
next page fault.
Yeah, it's not relevant for file pages, you can just unmap the pmd.
> follow_page() calls it, but I think we can give up way earlier in this
> function, since we know there's no struct page to return. We can put
> in something like:
>
> if (IS_XIP(file_inode(vma->vm_file)))
> return ERR_PTR(-Ewhatever);
Are you sure you will not need a temporary struct page here to show to
caller or something like this?
> check_pmd_range() calls it, but this is NUMA policy for the page cache.
> We should be skipping this code for XIP files too, if we aren't already.
Okay.
> change_pmd_range() calls split_huge_page_pmd() if an mprotect call lands
> in the middle of a PMD range. Again, I'd be *fine* with just dropping the
> PMD entry here and allowing faults to repopulate the PTEs.
Do you have a way to store info that the area should be repopulated with
PTEs, not PMD?
> Looks like the mremap code may need some work. I'm not sure what that
> work is right now.
You probably may unmap there too and handle as !old_pmd.
> That leaves us with walk_page_range() ... which also looks like it's
> going to need some work in the callers.
>
> So yeah, not trivial at all, but doable with a few weeks of work,
> I think. Unless there's some other major concern that I've missed
> (which is possible since I'm not a MM hacker).
Okay, doable. I guess.
The general approach is replace split with unmap. unmap_mapping_range() takes
->i_mmap_mutex and we probably will hit locking ordering issues.
But I would suggest to take more conservative approach first: leave 1G
pages aside, use 2M pages with page table pre-allocation and implement
proper splitting in split_huge_page_pmd() for this case. It should be
much easier then fix all split_huge_page_pmd() callers.
After getting this work we can look how to eliminate memory overhead on
preallocated page tables and bring 1G pages.
By the time you probably will have some performance data to say that you
don't really need 1G pages that much. ;)
--
Kirill A. Shutemov
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2014-01-11 2:56 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-01-03 12:25 [LSF/MM ATTEND] Memory management -- THP, hugetlb, scalability Kirill A. Shutemov
2014-01-08 15:13 ` [Lsf-pc] " Mel Gorman
2014-01-10 17:42 ` Kirill A. Shutemov
2014-01-10 22:51 ` Matthew Wilcox
2014-01-10 22:59 ` Kirill A. Shutemov
2014-01-11 1:49 ` Matthew Wilcox
2014-01-11 2:55 ` Kirill A. Shutemov
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).