linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Mike Kravetz <mike.kravetz@oracle.com>
To: Mel Gorman <mgorman@suse.de>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>,
	linux-fsdevel@vger.kernel.org, linux-mm@kvack.org,
	lsf-pc@lists.linux-foundation.org
Subject: Re: [Lsf-pc] [LSF/MM ATTEND] Huge Page Futures
Date: Thu, 28 Jan 2016 10:24:55 -0800	[thread overview]
Message-ID: <56AA5CF7.1040108@oracle.com> (raw)
In-Reply-To: <20160128092122.GH3104@suse.de>

On 01/28/2016 01:21 AM, Mel Gorman wrote:
> On Wed, Jan 27, 2016 at 09:49:57AM -0800, Mike Kravetz wrote:
>> On 01/25/2016 05:50 AM, Mike Kravetz wrote:
>>>> Do you have any thoughts how it's going to be implemented? It would be
>>>> nice to have some design overview or better proof-of-concept patch before
>>>> the summit to be able analyze implications for the kernel.
>>>>
>>>
>>> Good to know the hugetlbfs implementation is considered a hack.  I just
>>> started looking at this, and was going to use hugetlbfs as a starting
>>> point.  I'll reconsider that decision.
>>
>> Kirill, can you (or others) explain your reasons for saying the hugetlbfs
>> implementation is an ugly hack?  I do not have enough history/experience
>> with this to say what is most offensive.  I would be happy to start by
>> cleaning up issues with the current implementation.
>>
> 
> Historically, it was considered a hack because it had special handling in
> a number of paths in the VM. Of course THP also has similar handling now
> so it's less of a concern but there are differences that cause base pages,
> transparent hugepages and hugetlbfs pages to all be special cases. That
> does not sit comfortably with everyone.
> 
> For a long time, it was considered ugly because a fault on private child
> mappings was so unreliable and a fork could cause a parent to unexpectedly
> fail a fault and die. These days it's different as only the child can die
> so while it's less of a concern, hugetlbfs pages allow a child to be killed
> if enough huge pages are not available.
> 
> It was also considered ugly because application-awareness was required in
> so many cases. Granted, libhugetlbfs can hide some of that ugliness but
> even that was considered hacky.
> 
> The fact that hugetlbfs pages cannot be swapped even without mlock is
> another fact that makes them different to the rest of the VM. It has its
> own reservation scheme that is different to everything else.
> 
> One that crippled it to some extent with the label was the fact that fixing
> swap on it was effectively impossible because of power. Once huge pages
> had been installed on that architecture for a lont time, it was impossible
> to remap them at a different size. The limitation has been relaxed to some
> extent but those around long enough remember it.
> 
> So it is a bit of a hack that behaves differently to other page types.
> It's fairly complex and while the semantics used to be a lot uglier than
> it is now, the "ugly hack" label has stuck.

Thanks Mel.  I understand most of the issues you mention above.  However,
some DB providers make extensive use of hugetlbfs for maximum performance.
My question had more to do with shared page tables (below) than hugetlbfs
in general.

> 
>> If we do shared page tables for DAX, it makes sense that it and hugetlbfs
>> should be similar (or common) if possible.
>>
> 
> It's been a long time since I looked at shared page tables so I can't
> remember why but it was a difficult area. A few years were spent on it so
> if shared page tables are being considered, I would make damn sure first
> that they actually help on modern hardware before jumping into that hole.

IIUC, the only sharing today is in hugetlbfs at the PMD level.  Sharing
at this level for 2M huge pages still provides significant memory savings
for some large DB implementations.  Think of systems with TBs of memory,
and using large shared mappings.  There can be 10,000 or more processes
sharing these mappings.  You can expect that pmem will provide more
opportunities for sharing of large mappings.  My intention was to explore
the possibility of providing this type of sharing (at a minimum) to huge
pages for DAX mappings.  I was only looking at this from the space saving
angle.

-- 
Mike Kravetz

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  reply	other threads:[~2016-01-28 18:24 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-01-25  1:57 [LSF/MM ATTEND] Huge Page Futures Mike Kravetz
2016-01-25 11:01 ` Kirill A. Shutemov
2016-01-25 13:50   ` Mike Kravetz
2016-01-27 17:49     ` Mike Kravetz
2016-01-28  8:49       ` Hugh Dickins
2016-01-28 19:06         ` Mike Kravetz
2016-01-28  9:21       ` [Lsf-pc] " Mel Gorman
2016-01-28 18:24         ` Mike Kravetz [this message]
2016-01-28 15:05 ` Aneesh Kumar K.V
2016-01-28 19:28   ` Mike Kravetz
2016-01-29 10:01     ` Kirill A. Shutemov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=56AA5CF7.1040108@oracle.com \
    --to=mike.kravetz@oracle.com \
    --cc=kirill@shutemov.name \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lsf-pc@lists.linux-foundation.org \
    --cc=mgorman@suse.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).