[LSF/MM TOPIC] Support for 1GB THP

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [LSF/MM TOPIC] Support for 1GB THP
@ 2016-03-01  7:09 Matthew Wilcox
  2016-03-01 10:25 ` [Lsf-pc] " Jan Kara
  2016-03-01 12:20 ` Kirill A. Shutemov
  0 siblings, 2 replies; 14+ messages in thread
From: Matthew Wilcox @ 2016-03-01  7:09 UTC (permalink / raw)
  To: lsf-pc; +Cc: linux-fsdevel, linux-mm


There are a few issues around 1GB THP support that I've come up against
while working on DAX support that I think may be interesting to discuss
in person.

 - Do we want to add support for 1GB THP for anonymous pages?  DAX support
   is driving the initial 1GB THP support, but would anonymous VMAs also
   benefit from 1GB support?  I'm not volunteering to do this work, but
   it might make an interesting conversation if we can identify some users
   who think performance would be better if they had 1GB THP support.

 - Latency of a major page fault.  According to various public reviews,
   main memory bandwidth is about 30GB/s on a Core i7-5960X with 4
   DDR4 channels.  I think people are probably fairly unhappy about
   doing only 30 page faults per second.  So maybe we need a more complex
   scheme to handle major faults where we insert a temporary 2MB mapping,
   prepare the other 2MB pages in the background, then merge them into
   a 1GB mapping when they're completed.

 - Cache pressure from 1GB page support.  If we're using NT stores, they
   bypass the cache, and all should be good.  But if there are
   architectures that support THP and not NT stores, zeroing a page is
   just going to obliterate their caches.

Other topics that might interest people from a VM/FS point of view:

 - Uses for (or replacement of) the radix tree.  We're currently
   looking at using the radix tree with DAX in order to reduce the number
   of calls into the filesystem.  That's leading to various enhancements
   to the radix tree, such as support for a lock bit for exceptional
   entries (Neil Brown), and support for multi-order entries (me).
   Is the (enhanced) radix tree the right data structure to be using
   for this brave new world of huge pages in the page cache, or should
   we be looking at some other data structure like an RB-tree?

 - Can we get rid of PAGE_CACHE_SIZE now?  Finally?  Pretty please?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] Support for 1GB THP
  2016-03-01  7:09 [LSF/MM TOPIC] Support for 1GB THP Matthew Wilcox
@ 2016-03-01 10:25 ` Jan Kara
  2016-03-01 11:00   ` Mel Gorman
  2016-03-01 21:44   ` Matthew Wilcox
  2016-03-01 12:20 ` Kirill A. Shutemov
  1 sibling, 2 replies; 14+ messages in thread
From: Jan Kara @ 2016-03-01 10:25 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: lsf-pc, linux-fsdevel, linux-mm

Hi,

On Tue 01-03-16 02:09:11, Matthew Wilcox wrote:
> There are a few issues around 1GB THP support that I've come up against
> while working on DAX support that I think may be interesting to discuss
> in person.
> 
>  - Do we want to add support for 1GB THP for anonymous pages?  DAX support
>    is driving the initial 1GB THP support, but would anonymous VMAs also
>    benefit from 1GB support?  I'm not volunteering to do this work, but
>    it might make an interesting conversation if we can identify some users
>    who think performance would be better if they had 1GB THP support.

Some time ago I was thinking about 1GB THP and I was wondering: What is the
motivation for 1GB pages for persistent memory? Is it the savings in memory
used for page tables? Or is it about the cost of fault?

If it is mainly about the fault cost, won't some fault-around logic (i.e.
filling more PMD entries in one PMD fault) go a long way towards reducing
fault cost without some complications?

>  - Latency of a major page fault.  According to various public reviews,
>    main memory bandwidth is about 30GB/s on a Core i7-5960X with 4
>    DDR4 channels.  I think people are probably fairly unhappy about
>    doing only 30 page faults per second.  So maybe we need a more complex
>    scheme to handle major faults where we insert a temporary 2MB mapping,
>    prepare the other 2MB pages in the background, then merge them into
>    a 1GB mapping when they're completed.

Yeah, here is one of the complications I have mentioned above ;)

>  - Cache pressure from 1GB page support.  If we're using NT stores, they
>    bypass the cache, and all should be good.  But if there are
>    architectures that support THP and not NT stores, zeroing a page is
>    just going to obliterate their caches.

Even doing fsync() - and thus flush all cache lines associated with 1GB
page - is likely going to take noticeable chunk of time. The granularity of
cache flushing in kernel is another thing that makes me somewhat cautious
about 1GB pages.

> Other topics that might interest people from a VM/FS point of view:
> 
>  - Uses for (or replacement of) the radix tree.  We're currently
>    looking at using the radix tree with DAX in order to reduce the number
>    of calls into the filesystem.  That's leading to various enhancements
>    to the radix tree, such as support for a lock bit for exceptional
>    entries (Neil Brown), and support for multi-order entries (me).
>    Is the (enhanced) radix tree the right data structure to be using
>    for this brave new world of huge pages in the page cache, or should
>    we be looking at some other data structure like an RB-tree?

I was also thinking whether we wouldn't be better off with some other data
structure than radix tree for DAX. And I didn't really find anything that
I'd be satisfied with. The main advantages of radix tree I see are - it is
of constant depth, it supports lockless lookups, it is relatively simple
(although with the additions we'd need this advantage slowly vanishes), it
is pretty space efficient for common cases.

For your multi-order entries I was wondering whether we shouldn't relax the
requirement that all nodes have the same number of slots - e.g. we could
have number of slots variable with node depth so that PMD and eventually PUD
multi-order slots end up being a single entry at appropriate radix tree
level.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] Support for 1GB THP
  2016-03-01 10:25 ` [Lsf-pc] " Jan Kara
@ 2016-03-01 11:00   ` Mel Gorman
  2016-03-01 11:51     ` Mel Gorman
  2016-03-01 21:44   ` Matthew Wilcox
  1 sibling, 1 reply; 14+ messages in thread
From: Mel Gorman @ 2016-03-01 11:00 UTC (permalink / raw)
  To: Jan Kara; +Cc: Matthew Wilcox, linux-fsdevel, linux-mm, lsf-pc

On Tue, Mar 01, 2016 at 11:25:41AM +0100, Jan Kara wrote:
> Hi,
> 
> On Tue 01-03-16 02:09:11, Matthew Wilcox wrote:
> > There are a few issues around 1GB THP support that I've come up against
> > while working on DAX support that I think may be interesting to discuss
> > in person.
> > 
> >  - Do we want to add support for 1GB THP for anonymous pages?  DAX support
> >    is driving the initial 1GB THP support, but would anonymous VMAs also
> >    benefit from 1GB support?  I'm not volunteering to do this work, but
> >    it might make an interesting conversation if we can identify some users
> >    who think performance would be better if they had 1GB THP support.
> 
> Some time ago I was thinking about 1GB THP and I was wondering: What is the
> motivation for 1GB pages for persistent memory? Is it the savings in memory
> used for page tables? Or is it about the cost of fault?
> 

If anything, the cost of the fault is going to suck as a 1G allocation
and zeroing is required even if the application only needs 4K. It's by
no means a universal win. The savings are in page table usage and TLB
miss cost reduction and TLB footprint. For anonymous memory, it's not
considered to be worth it because the cost of allocating the page is so
high even if it works. There is no guarantee it'll work as fragementation
avoidance only works on the 2M boundary.

It's worse when files are involved because there is a
write-multiplication effect when huge pages are used. Specifically, a
fault incurs 1G of IO even if only 4K is required and then dirty
information is only tracked on a huge page granularity. This increased
IO can offset any TLB-related benefit.

I'm highly skeptical that THP for persistent memory is even worthwhile
once the write multiplication factors and allocation costs are taken into
consideration. I was surprised overall that it was even attempted before
basic features of persistent memory were even completed. I felt that it
should have been avoided until the 4K case was as fast as possible and
hitting problems where TLB was the limiting facto

Given that I recently threw in the towel over the cost of 2M allocations
let alone 1G translations, I'm highly skeptical that 1G anonymous pages
are worth the cost.

> If it is mainly about the fault cost, won't some fault-around logic (i.e.
> filling more PMD entries in one PMD fault) go a long way towards reducing
> fault cost without some complications?
> 

I think this would be a pre-requisite. Basically, the idea is that a 2M
page is reserved, but not allocated in response to a 4K page fault. The
pages are then inserted properly aligned such. If there are faults around
it then use other properly aligned pages and when the 2M chunk is allocated
then promote it at that point. Early research considered where there was
a fill-factor other than 1 that should trigger a hugepage promotion but
it would have to be re-evaluated on modern hardware.

I'm not aware of anyone actually working on such an implementation though
because it'd be a lot of legwork. I wrote a TODO item about this at some
far point in the past that never got to the top of the list

Title: In-place huge page collapsing
Description:
        When collapsing a huge page, the kernel allocates a huge page and
        then copies from the base page. This is expensive. Investigate
        in-place reservation whereby a base page is faulted in but the
        properly placed pages are reserved for that process unless the
        alternative is to fail the allocation. Care would be needed to
        ensure that the kernel does not reclaim because pages are reserved
        or increase contention on zone->lock. If it works correctly we
        would be able to collapse huge pages without copying and it would
        also performance extremely well when the workload uses sparse
        address spaces.

> >  - Cache pressure from 1GB page support.  If we're using NT stores, they
> >    bypass the cache, and all should be good.  But if there are
> >    architectures that support THP and not NT stores, zeroing a page is
> >    just going to obliterate their caches.
> 
> Even doing fsync() - and thus flush all cache lines associated with 1GB
> page - is likely going to take noticeable chunk of time. The granularity of
> cache flushing in kernel is another thing that makes me somewhat cautious
> about 1GB pages.
> 

Problems like this were highlighted in early hugepage-related papers in
the 90's. Even if persistent memory is extremely fast, there is going to
be large costs. In-place promotion would avoid some of the worst of the
costs.

If it was me, I would focus on getting all the basic features of persistent
memory working first, finding if there are workloads that are limited by
TLB pressure and then and only then start worrying about 1G pages. If that
is not done then persistent memory could fall down the same trap that the
VM did whereby huge pages were being used to workaround bottlenecks within
the VM or crappy hardware.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] Support for 1GB THP
  2016-03-01 11:00   ` Mel Gorman
@ 2016-03-01 11:51     ` Mel Gorman
  2016-03-01 12:09       ` Kirill A. Shutemov
  0 siblings, 1 reply; 14+ messages in thread
From: Mel Gorman @ 2016-03-01 11:51 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-fsdevel, Jan Kara, lsf-pc, linux-mm

On Tue, Mar 01, 2016 at 11:00:55AM +0000, Mel Gorman wrote:
> On Tue, Mar 01, 2016 at 11:25:41AM +0100, Jan Kara wrote:
> > Hi,
> > 
> > On Tue 01-03-16 02:09:11, Matthew Wilcox wrote:
> > > There are a few issues around 1GB THP support that I've come up against
> > > while working on DAX support that I think may be interesting to discuss
> > > in person.
> > > 
> > >  - Do we want to add support for 1GB THP for anonymous pages?  DAX support
> > >    is driving the initial 1GB THP support, but would anonymous VMAs also
> > >    benefit from 1GB support?  I'm not volunteering to do this work, but
> > >    it might make an interesting conversation if we can identify some users
> > >    who think performance would be better if they had 1GB THP support.
> > 
> > Some time ago I was thinking about 1GB THP and I was wondering: What is the
> > motivation for 1GB pages for persistent memory? Is it the savings in memory
> > used for page tables? Or is it about the cost of fault?
> > 
> 
> If anything, the cost of the fault is going to suck as a 1G allocation
> and zeroing is required even if the application only needs 4K. It's by
> no means a universal win. The savings are in page table usage and TLB
> miss cost reduction and TLB footprint. For anonymous memory, it's not
> considered to be worth it because the cost of allocating the page is so
> high even if it works. There is no guarantee it'll work as fragementation
> avoidance only works on the 2M boundary.
> 
> It's worse when files are involved because there is a
> write-multiplication effect when huge pages are used. Specifically, a
> fault incurs 1G of IO even if only 4K is required and then dirty
> information is only tracked on a huge page granularity. This increased
> IO can offset any TLB-related benefit.
> 

It was pointed out to me privately that the IO amplication cost is not the
same for persistent memory as it is for traditional storage and this is
true. For example, the 1G of data does not have to be read on fault every
time. The write problems are mitigated but remain if the 1G block has to
be zero'd for example. Even for normal writeback the cache lines have to
be flushed as the kernel does not know what lines were updated. I know
there is a proposal to defer that tracking to userspace but that breaks
if an unaware process accesses the page and is overall very risky.

There are other issues such as having to reserve 1G of block in case a file
is truncated in the future or else there is an extremely large amount of
wastage. Maybe it can be worked around but a workload that uses persistent
memory with many small files may have a bad day.

While I know some of these points can be countered and discussed further,
at the end of the day, the benefits to huge page usage are reduced memory
usage on page tables, a reduction of TLB pressure and reduced TLB fill
costs. Until such time as it's known that there are realistic workloads
that cannot fit in memory due to the page table usage and workloads that
are limited by TLB pressure, the complexity of huge pages is unjustified
and the focus should be on the basic features working correctly.

If fault overhead of a 4K page is a major concern then fault-around should
be used on the 2M boundary at least. I expect there are relatively few real
workloads that are limited by the cost of major faults. Applications may
have a higher startup cost than desirable but in itself that does not justify
using huge pages to workload problems with fault speeds in the kernel.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] Support for 1GB THP
  2016-03-01 11:51     ` Mel Gorman
@ 2016-03-01 12:09       ` Kirill A. Shutemov
  2016-03-01 12:52         ` Mel Gorman
  0 siblings, 1 reply; 14+ messages in thread
From: Kirill A. Shutemov @ 2016-03-01 12:09 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Matthew Wilcox, linux-fsdevel, Jan Kara, lsf-pc, linux-mm

On Tue, Mar 01, 2016 at 11:51:36AM +0000, Mel Gorman wrote:
> While I know some of these points can be countered and discussed further,
> at the end of the day, the benefits to huge page usage are reduced memory
> usage on page tables, a reduction of TLB pressure and reduced TLB fill
> costs. Until such time as it's known that there are realistic workloads
> that cannot fit in memory due to the page table usage and workloads that
> are limited by TLB pressure, the complexity of huge pages is unjustified
> and the focus should be on the basic features working correctly.

Size of page table can be limiting factor now for workloads that tries to
migrate from 2M hugetlb with shared page tables to DAX. 1G pages is a way
to lower the overhead.

Note, that reduced memory usage on page tables is not there for anon THP,
as we have to deposit these page table to be able split huge pmd (or pud)
at any point.

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] Support for 1GB THP
  2016-03-01 12:09       ` Kirill A. Shutemov
@ 2016-03-01 12:52         ` Mel Gorman
  0 siblings, 0 replies; 14+ messages in thread
From: Mel Gorman @ 2016-03-01 12:52 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Matthew Wilcox, linux-fsdevel, Jan Kara, lsf-pc, linux-mm

On Tue, Mar 01, 2016 at 03:09:19PM +0300, Kirill A. Shutemov wrote:
> On Tue, Mar 01, 2016 at 11:51:36AM +0000, Mel Gorman wrote:
> > While I know some of these points can be countered and discussed further,
> > at the end of the day, the benefits to huge page usage are reduced memory
> > usage on page tables, a reduction of TLB pressure and reduced TLB fill
> > costs. Until such time as it's known that there are realistic workloads
> > that cannot fit in memory due to the page table usage and workloads that
> > are limited by TLB pressure, the complexity of huge pages is unjustified
> > and the focus should be on the basic features working correctly.
> 
> Size of page table can be limiting factor now for workloads that tries to
> migrate from 2M hugetlb with shared page tables to DAX. 1G pages is a way
> to lower the overhead.
> 

That is only a limitation for users of hugetlbfs replacing hugetlbfs pages
with DAX and even then only in the case where the workload is precisely
sized to available memory. It's a potential limitation in a specialised
configuration which may or may not be a problem in practice. Even the
benefits of reduced memory usage and TLB pressure is not guaranteed to
be offset by problems such as flushing the cache lines of the entire huge
page during writeback or the necessity of allocating huge blocks on disk
for a file that may or may not need it. Huge pages fix some problems but
cause others. It may be better in practice for a workload to shrink the
size of the shared region that was previously using hugetlbfs for example.

Granted, I've not been following the development of persistent memory
closely but from what I've seen, I think it's more important to get
persistent memory, DAX and related features working correctly first and
then worry about page table memory usage and TLB pressure *if* it's a
problem in practice. If there are problems with fault scalability then it
would be better to fix that instead of working around it with huge pages.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] Support for 1GB THP
  2016-03-01 10:25 ` [Lsf-pc] " Jan Kara
  2016-03-01 11:00   ` Mel Gorman
@ 2016-03-01 21:44   ` Matthew Wilcox
  2016-03-01 22:15     ` Mike Kravetz
                       ` (2 more replies)
  1 sibling, 3 replies; 14+ messages in thread
From: Matthew Wilcox @ 2016-03-01 21:44 UTC (permalink / raw)
  To: Jan Kara; +Cc: lsf-pc, linux-fsdevel, linux-mm

On Tue, Mar 01, 2016 at 11:25:41AM +0100, Jan Kara wrote:
> On Tue 01-03-16 02:09:11, Matthew Wilcox wrote:
> > There are a few issues around 1GB THP support that I've come up against
> > while working on DAX support that I think may be interesting to discuss
> > in person.
> > 
> >  - Do we want to add support for 1GB THP for anonymous pages?  DAX support
> >    is driving the initial 1GB THP support, but would anonymous VMAs also
> >    benefit from 1GB support?  I'm not volunteering to do this work, but
> >    it might make an interesting conversation if we can identify some users
> >    who think performance would be better if they had 1GB THP support.
> 
> Some time ago I was thinking about 1GB THP and I was wondering: What is the
> motivation for 1GB pages for persistent memory? Is it the savings in memory
> used for page tables? Or is it about the cost of fault?

I think it's both.  I heard from one customer who calculated that with
a 6TB server, mapping every page into a process would take ~24MB of
page tables.  Multiply that by the 50,000 processes they expect to run
on a server of that size consumes 1.2TB of DRAM.  Using 1GB pages reduces
that by a factor of 512, down to 2GB.

Another topic to consider then would be generalising the page table
sharing code that is currently specific to hugetlbfs.  I didn't bring
it up as I haven't researched it in any detail, and don't know how hard
it would be.

> For your multi-order entries I was wondering whether we shouldn't relax the
> requirement that all nodes have the same number of slots - e.g. we could
> have number of slots variable with node depth so that PMD and eventually PUD
> multi-order slots end up being a single entry at appropriate radix tree
> level.

I'm not a big fan of the sibling entries either :-)  One thing I do
wonder is whether anyone has done performance analysis recently of
whether 2^6 is the right size for radix tree nodes?  If it used 2^9,
this would be a perfect match to x86 page tables ;-)

Variable size is a bit painful because we've got two variable size arrays
in the node; the array of node pointers and the tag bitmasks.  And then
we lose the benefit of the slab allocator if the node size is variable.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] Support for 1GB THP
  2016-03-01 21:44   ` Matthew Wilcox
@ 2016-03-01 22:15     ` Mike Kravetz
  2016-03-01 22:33     ` Rik van Riel
  2016-03-01 22:36     ` James Bottomley
  2 siblings, 0 replies; 14+ messages in thread
From: Mike Kravetz @ 2016-03-01 22:15 UTC (permalink / raw)
  To: Matthew Wilcox, Jan Kara; +Cc: lsf-pc, linux-fsdevel, linux-mm

On 03/01/2016 01:44 PM, Matthew Wilcox wrote:
> On Tue, Mar 01, 2016 at 11:25:41AM +0100, Jan Kara wrote:
>> On Tue 01-03-16 02:09:11, Matthew Wilcox wrote:
>>> There are a few issues around 1GB THP support that I've come up against
>>> while working on DAX support that I think may be interesting to discuss
>>> in person.
>>>
>>>  - Do we want to add support for 1GB THP for anonymous pages?  DAX support
>>>    is driving the initial 1GB THP support, but would anonymous VMAs also
>>>    benefit from 1GB support?  I'm not volunteering to do this work, but
>>>    it might make an interesting conversation if we can identify some users
>>>    who think performance would be better if they had 1GB THP support.
>>
>> Some time ago I was thinking about 1GB THP and I was wondering: What is the
>> motivation for 1GB pages for persistent memory? Is it the savings in memory
>> used for page tables? Or is it about the cost of fault?
> 
> I think it's both.  I heard from one customer who calculated that with
> a 6TB server, mapping every page into a process would take ~24MB of
> page tables.  Multiply that by the 50,000 processes they expect to run
> on a server of that size consumes 1.2TB of DRAM.  Using 1GB pages reduces
> that by a factor of 512, down to 2GB.
> 
> Another topic to consider then would be generalising the page table
> sharing code that is currently specific to hugetlbfs.  I didn't bring
> it up as I haven't researched it in any detail, and don't know how hard
> it would be.

Well, I have started down that path and have it working for some very
simple cases with some very hacked up code.  Too early/ugly to share.
I'm struggling a bit with fact that you can have both regular and huge
page mappings of the same regions.  The hugetlb code only has to deal
with huge pages.

-- 
Mike Kravetz

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] Support for 1GB THP
  2016-03-01 21:44   ` Matthew Wilcox
  2016-03-01 22:15     ` Mike Kravetz
@ 2016-03-01 22:33     ` Rik van Riel
  2016-03-01 22:36     ` James Bottomley
  2 siblings, 0 replies; 14+ messages in thread
From: Rik van Riel @ 2016-03-01 22:33 UTC (permalink / raw)
  To: Matthew Wilcox, Jan Kara; +Cc: linux-fsdevel, linux-mm, lsf-pc

[-- Attachment #1: Type: text/plain, Size: 1701 bytes --]

On Tue, 2016-03-01 at 16:44 -0500, Matthew Wilcox wrote:
> On Tue, Mar 01, 2016 at 11:25:41AM +0100, Jan Kara wrote:
> > On Tue 01-03-16 02:09:11, Matthew Wilcox wrote:
> > > There are a few issues around 1GB THP support that I've come up
> > > against
> > > while working on DAX support that I think may be interesting to
> > > discuss
> > > in person.
> > > 
> > >  - Do we want to add support for 1GB THP for anonymous
> > > pages?  DAX support
> > >    is driving the initial 1GB THP support, but would anonymous
> > > VMAs also
> > >    benefit from 1GB support?  I'm not volunteering to do this
> > > work, but
> > >    it might make an interesting conversation if we can identify
> > > some users
> > >    who think performance would be better if they had 1GB THP
> > > support.
> > 
> > Some time ago I was thinking about 1GB THP and I was wondering:
> > What is the
> > motivation for 1GB pages for persistent memory? Is it the savings
> > in memory
> > used for page tables? Or is it about the cost of fault?
> 
> I think it's both.  I heard from one customer who calculated that
> with
> a 6TB server, mapping every page into a process would take ~24MB of
> page tables.  Multiply that by the 50,000 processes they expect to
> run
> on a server of that size consumes 1.2TB of DRAM.  Using 1GB pages
> reduces
> that by a factor of 512, down to 2GB.

Given the amounts of memory in systems, and the fact
that 1TB (or even 2MB) page sizes will not always be
possible, even with DAX on persistent memory, I
suspect it may be time to implement the reclaiming of
page tables that only map file pages.

-- 
All Rights Reversed.


[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] Support for 1GB THP
  2016-03-01 21:44   ` Matthew Wilcox
  2016-03-01 22:15     ` Mike Kravetz
  2016-03-01 22:33     ` Rik van Riel
@ 2016-03-01 22:36     ` James Bottomley
  2016-03-02 14:14       ` Matthew Wilcox
  2 siblings, 1 reply; 14+ messages in thread
From: James Bottomley @ 2016-03-01 22:36 UTC (permalink / raw)
  To: Matthew Wilcox, Jan Kara; +Cc: linux-fsdevel, linux-mm, lsf-pc

On Tue, 2016-03-01 at 16:44 -0500, Matthew Wilcox wrote:
> On Tue, Mar 01, 2016 at 11:25:41AM +0100, Jan Kara wrote:
> > On Tue 01-03-16 02:09:11, Matthew Wilcox wrote:
> > > There are a few issues around 1GB THP support that I've come up
> > > against
> > > while working on DAX support that I think may be interesting to
> > > discuss
> > > in person.
> > > 
> > >  - Do we want to add support for 1GB THP for anonymous pages? 
> > >  DAX support
> > >    is driving the initial 1GB THP support, but would anonymous
> > > VMAs also
> > >    benefit from 1GB support?  I'm not volunteering to do this
> > > work, but
> > >    it might make an interesting conversation if we can identify
> > > some users
> > >    who think performance would be better if they had 1GB THP
> > > support.
> > 
> > Some time ago I was thinking about 1GB THP and I was wondering: 
> > What is the motivation for 1GB pages for persistent memory? Is it 
> > the savings in memory used for page tables? Or is it about the cost
> > of fault?
> 
> I think it's both.  I heard from one customer who calculated that 
> with a 6TB server, mapping every page into a process would take ~24MB 
> of page tables.  Multiply that by the 50,000 processes they expect to
> run on a server of that size consumes 1.2TB of DRAM.  Using 1GB pages
> reduces that by a factor of 512, down to 2GB.

This sounds a bit implausible: for the machine not to be thrashing to
death, all the 6TB would have to be in shared memory used by all the
50k processes.  The much more likely scenario is that it's mostly
private memory mixed with a bit of shared, in which case sum(private
working set) + shared must be under 6TB for the machine not to thrash
and you likely only need mappings for the working set. Realistically
that means you only need about 50MB or so of page tables, even with our
current page size, assuming it's mostly file backed.  There might be
some optimisation done for the anonymous memory swap case, which is the
pte profligate one, but probably we shouldn't do anything until we
understand the workload profile.

James

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] Support for 1GB THP
  2016-03-01 22:36     ` James Bottomley
@ 2016-03-02 14:14       ` Matthew Wilcox
  0 siblings, 0 replies; 14+ messages in thread
From: Matthew Wilcox @ 2016-03-02 14:14 UTC (permalink / raw)
  To: James Bottomley; +Cc: Jan Kara, linux-fsdevel, linux-mm, lsf-pc

On Tue, Mar 01, 2016 at 02:36:04PM -0800, James Bottomley wrote:
> On Tue, 2016-03-01 at 16:44 -0500, Matthew Wilcox wrote:
> > I think it's both.  I heard from one customer who calculated that 
> > with a 6TB server, mapping every page into a process would take ~24MB 
> > of page tables.  Multiply that by the 50,000 processes they expect to
> > run on a server of that size consumes 1.2TB of DRAM.  Using 1GB pages
> > reduces that by a factor of 512, down to 2GB.
> 
> This sounds a bit implausible:

Well, that's the customer workload.  They have terabytes of data, and they
want to map all of it into all 50k processes.  I know it's not how I use
my machine, but that's customers for you ...

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [LSF/MM TOPIC] Support for 1GB THP
  2016-03-01  7:09 [LSF/MM TOPIC] Support for 1GB THP Matthew Wilcox
  2016-03-01 10:25 ` [Lsf-pc] " Jan Kara
@ 2016-03-01 12:20 ` Kirill A. Shutemov
  2016-03-01 16:32   ` Christoph Lameter
  1 sibling, 1 reply; 14+ messages in thread
From: Kirill A. Shutemov @ 2016-03-01 12:20 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: lsf-pc, linux-fsdevel, linux-mm

On Tue, Mar 01, 2016 at 02:09:11AM -0500, Matthew Wilcox wrote:
> 
> There are a few issues around 1GB THP support that I've come up against
> while working on DAX support that I think may be interesting to discuss
> in person.
> 
>  - Do we want to add support for 1GB THP for anonymous pages?  DAX support
>    is driving the initial 1GB THP support, but would anonymous VMAs also
>    benefit from 1GB support?  I'm not volunteering to do this work, but
>    it might make an interesting conversation if we can identify some users
>    who think performance would be better if they had 1GB THP support.

At this point I don't think it would have much users. Too much hussle with
non-obvious benefits.

>  - Latency of a major page fault.  According to various public reviews,
>    main memory bandwidth is about 30GB/s on a Core i7-5960X with 4
>    DDR4 channels.  I think people are probably fairly unhappy about
>    doing only 30 page faults per second.  So maybe we need a more complex
>    scheme to handle major faults where we insert a temporary 2MB mapping,
>    prepare the other 2MB pages in the background, then merge them into
>    a 1GB mapping when they're completed.
> 
>  - Cache pressure from 1GB page support.  If we're using NT stores, they
>    bypass the cache, and all should be good.  But if there are
>    architectures that support THP and not NT stores, zeroing a page is
>    just going to obliterate their caches.

At some point I've tested NT stores for clearing 2M THP and it didn't show
much benefit. I guess that could depend on microarhitecture and we
probably should re-test this we new CPU generations.

> Other topics that might interest people from a VM/FS point of view:
> 
>  - Uses for (or replacement of) the radix tree.  We're currently
>    looking at using the radix tree with DAX in order to reduce the number
>    of calls into the filesystem.  That's leading to various enhancements
>    to the radix tree, such as support for a lock bit for exceptional
>    entries (Neil Brown), and support for multi-order entries (me).
>    Is the (enhanced) radix tree the right data structure to be using
>    for this brave new world of huge pages in the page cache, or should
>    we be looking at some other data structure like an RB-tree?

I'm interested in multi-order entires for THP page cache. It's not
required for hugetmpfs, but would be nice to have.
> 
>  - Can we get rid of PAGE_CACHE_SIZE now?  Finally?  Pretty please?

+1 :)

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [LSF/MM TOPIC] Support for 1GB THP
  2016-03-01 12:20 ` Kirill A. Shutemov
@ 2016-03-01 16:32   ` Christoph Lameter
  2016-03-01 21:47     ` Matthew Wilcox
  0 siblings, 1 reply; 14+ messages in thread
From: Christoph Lameter @ 2016-03-01 16:32 UTC (permalink / raw)
  To: Kirill A. Shutemov; +Cc: Matthew Wilcox, lsf-pc, linux-fsdevel, linux-mm

On Tue, 1 Mar 2016, Kirill A. Shutemov wrote:

> On Tue, Mar 01, 2016 at 02:09:11AM -0500, Matthew Wilcox wrote:
> >
> > There are a few issues around 1GB THP support that I've come up against
> > while working on DAX support that I think may be interesting to discuss
> > in person.
> >
> >  - Do we want to add support for 1GB THP for anonymous pages?  DAX support
> >    is driving the initial 1GB THP support, but would anonymous VMAs also
> >    benefit from 1GB support?  I'm not volunteering to do this work, but
> >    it might make an interesting conversation if we can identify some users
> >    who think performance would be better if they had 1GB THP support.
>
> At this point I don't think it would have much users. Too much hussle with
> non-obvious benefits.

In our business we preallocate everything and then the processing proceeds
without faults. 1G support has obvious benefits for us since we would be
ableto access larger areas of memory for lookups and various bits of
computation that we cannot do today without incurring TLB misses that
cause variances in our processing time. Having more mainstream support for
1G pages would make it easier to operate using these pages.

The long processing times for 1GB pages will make it even more important
to ensure all faults are done before hitting critical sections. But this
is already being done for most of our apps.

For the large NVDIMMs on the horizon using gazillions of terabytes we
really would want 1GB support. Otherwise TLB thrashing becomes quite easy
if one walks pointer chains through memory.

> >  - Latency of a major page fault.  According to various public reviews,
> >    main memory bandwidth is about 30GB/s on a Core i7-5960X with 4
> >    DDR4 channels.  I think people are probably fairly unhappy about
> >    doing only 30 page faults per second.  So maybe we need a more complex
> >    scheme to handle major faults where we insert a temporary 2MB mapping,
> >    prepare the other 2MB pages in the background, then merge them into
> >    a 1GB mapping when they're completed.
> >
> >  - Cache pressure from 1GB page support.  If we're using NT stores, they
> >    bypass the cache, and all should be good.  But if there are
> >    architectures that support THP and not NT stores, zeroing a page is
> >    just going to obliterate their caches.
>
> At some point I've tested NT stores for clearing 2M THP and it didn't show
> much benefit. I guess that could depend on microarhitecture and we
> probably should re-test this we new CPU generations.

Zeroing a page should not occur during usual processing but just during
the time that a process starts up.

> >  - Can we get rid of PAGE_CACHE_SIZE now?  Finally?  Pretty please?
>
> +1 :)

We have had grandiouse visions of being free of that particular set of
chains for more than 10 years now. Sadly nothing really was that appealing
and the current state of THP support is not that encouraging as well. We
rather go with static huge page support to have more control over how
memory is laid out for a process.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [LSF/MM TOPIC] Support for 1GB THP
  2016-03-01 16:32   ` Christoph Lameter
@ 2016-03-01 21:47     ` Matthew Wilcox
  0 siblings, 0 replies; 14+ messages in thread
From: Matthew Wilcox @ 2016-03-01 21:47 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Kirill A. Shutemov, lsf-pc, linux-fsdevel, linux-mm

On Tue, Mar 01, 2016 at 10:32:52AM -0600, Christoph Lameter wrote:
> > >  - Can we get rid of PAGE_CACHE_SIZE now?  Finally?  Pretty please?
> >
> > +1 :)
> 
> We have had grandiouse visions of being free of that particular set of
> chains for more than 10 years now. Sadly nothing really was that appealing
> and the current state of THP support is not that encouraging as well. We
> rather go with static huge page support to have more control over how
> memory is laid out for a process.

With Kirill's fault-around code in place, I think it delivers all or most
of the benefits promised by increasing PAGE_CACHE_SIZE.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2016-03-02 14:14 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-03-01  7:09 [LSF/MM TOPIC] Support for 1GB THP Matthew Wilcox
2016-03-01 10:25 ` [Lsf-pc] " Jan Kara
2016-03-01 11:00   ` Mel Gorman
2016-03-01 11:51     ` Mel Gorman
2016-03-01 12:09       ` Kirill A. Shutemov
2016-03-01 12:52         ` Mel Gorman
2016-03-01 21:44   ` Matthew Wilcox
2016-03-01 22:15     ` Mike Kravetz
2016-03-01 22:33     ` Rik van Riel
2016-03-01 22:36     ` James Bottomley
2016-03-02 14:14       ` Matthew Wilcox
2016-03-01 12:20 ` Kirill A. Shutemov
2016-03-01 16:32   ` Christoph Lameter
2016-03-01 21:47     ` Matthew Wilcox

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).