Re: Hugepages demand paging V2 [0/8]: Discussion and overview

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Ray Bryant <raybry@sgi.com>
To: Robin Holt <holt@sgi.com>
Cc: Christoph Lameter <clameter@sgi.com>,
	Jesse Barnes <jbarnes@sgi.com>,
	William Lee Irwin III <wli@holomorphy.com>,
	"Chen, Kenneth W" <kenneth.w.chen@intel.com>,
	linux-kernel@vger.kernel.org
Subject: Re: Hugepages demand paging V2 [0/8]: Discussion and overview
Date: Thu, 28 Oct 2004 11:34:04 -0500	[thread overview]
Message-ID: <41811F7C.40003@sgi.com> (raw)
In-Reply-To: <20041028115144.GA21926@lnx-holt.americas.sgi.com>

Robin Holt wrote:
> On Wed, Oct 27, 2004 at 06:01:12PM -0500, Ray Bryant wrote:
> 
>>Christoph Lameter wrote:
>>
>>>On Tue, 26 Oct 2004, Robin Holt wrote:
>>>
>>>
>>>
>>>>Sorry for being a stickler here, but the BTE is really part of the
>>>>I/O Interface portion of the shub.  That portion has a seperate clock
>>>>frequency from the memory controller (unfortunately slower).  The BTE
>>>>can zero at a slightly slower speed than the processor.  It does, as
>>>>you pointed out, not trash the CPU cache.
>>>>
>>>>One other feature of the BTE is it can operate asynchronously from
>>>>the cpu.  This could be used to, during a clock interrupt, schedule
>>>>additional huge page zero filling on multiple nodes at the same time.
>>>>This could result in a huge speed boost on machines that have multiple
>>>>memory only nodes.  That has not been tested thoroughly.  We have done
>>>>considerable testing of the page zero functionality as well as the
>>>>error handling.
>>>
>>>
>>>If the huge patch would support some way of redirecting the clearing of a
>>>huge page then we could:
>>>
>>>1. set the huge pte to not present so that we get a fault on access
>>>2. run the bte clearer.
>>>3. On receiving a huge fault we could check for the bte being finished.
>>>
>>>This would parallelize the clearing of huge pages. But is that really more
>>>efficient? There may be complexity involved in allowing the clearing of
>>>multiple pages and tracking of the clear in progress is additional
>>>overhead.
>>>
>>>
>>>-
>>>To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>>>the body of a message to majordomo@vger.kernel.org
>>>More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>Please read the FAQ at  http://www.tux.org/lkml/
>>>
>>
>>I'm personally of the opinion that using the BTE to "speculatively" clear
>>hugetlb pages in advance of when the hugetlb pages are requested is not a 
>>good
>>thing [tm].  One never knows if those pages will ever be requested.  And in
>>the meantime, tasks that need the BTE will be delayed by speculative use.
>>But that is a personal bias  :-), with no data to back it up.
> 
> 
> I was thinking the bte would be best used in an async mode where the pages
> would be pre-zeroed and available for use if the application needs them.
> If the pre-zeroed list is empty, then use the cpu to zero the page.
> 
> 
>>AFAIK, it is faster to clear the page with the processor anyway, since the
> 
> 
> The processor is slightly faster.  I believe the FSB is 200Mhz and the
> II is 100Mhz (150Mhz with no attached IX brick).  Future versions of the
> BTE will possibly have faster access to on node memory than the processor.
> 
> 
>>processor has a faster clock cycle.  Yes, it destroys the processor cache,
>>but the application has clearly indicated that it wants the page NOW, 
>>please,
>>(because it has faulted on it), and delivering the page to the application
>>as quickly as possible sounds like a good thing.  I'm not sure reloading
> 
> 
> I am not either.  I just would like to see any design take into consideration
> the possible uses and not design them out.  Nothing more.
> 
> 
>>the processor cache at this point is a cost we care about, given that the
>>application is likely just starting up anyway.  I figure hugetlb pages are
>>allocated once, stay around a long long time, so I'm not sure optimizing to
>>minimize cache damage is the correct way to go here.
>>
>>The only obvious win is for memory only nodes, that have a BTE and no CPU.
>>It is probably faster to use the local BTE than a remote CPU to clear the 
>>page.
> 
> 
> Plus, a single CPU could schedule the clearing of pages on multiple
> nodes at the same time.  Imagine a system that has 256 compute nodes
> and 756 memory nodes.  That configuration is theoretically possible with
> todays hardware, but we have never built or sold one.  Looking at that
> configuration gives you an one possible indication of how a pre-zeroing
> mechanism might improve things.
> 
> I am not saying that the BTE is the best option, or even a good one.  It
> just looks interesting.  It does bring up some interesting problems with
> repeatability.  Consider the application startup following termination
> of another which used all the huge pages.  The pre-zeroed list will
> be nearly if not completely empty.  The first fault will find the list
> empty and have to zero the page itself.  Hopefully, the second fault will
> find one on the zeroed list and return immediately.  This would cause
> application startup time to feel like it doubled from the previous run.
> Ouch.  That would be very upsetting for our typical customers.
>

Yep.

> The more memory nodes you have per cpu, the better this number will
> appear.
> 
> Sorry for being spineless, but I don't feel very strongly that it will
> be beneficial enough to be desirable.  I am just not sure.  I would
> just hope that it is taken into consideration during the design and,
> as long as it has no negative impact on the design, be left as a
> possibility.
> 
> Thanks,
> Robin Holt
> 

As always, Robin, you are being very reasonable.  I think the option
should be kept open as you suggest, since it may help and I agree it
is an interesting approach that might yield big speedups.

-- 
Best Regards,
Ray
-----------------------------------------------
                   Ray Bryant
512-453-9679 (work)         512-507-7807 (cell)
raybry@sgi.com             raybry@austin.rr.com
The box said: "Requires Windows 98 or better",
            so I installed Linux.
-----------------------------------------------

next prev parent reply	other threads:[~2004-10-28 16:29 UTC|newest]

Thread overview: 30+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <B05667366EE6204181EABE9C1B1C0EB504BFA47C@scsmsx401.amr.corp.intel.com>
2004-10-26  1:26 ` Hugepages demand paging V2 [0/8]: Discussion and overview Christoph Lameter
2004-10-26  1:27   ` Hugepages demand paging V2 [1/8]: hugetlb fault handler Christoph Lameter
2005-01-18 12:21     ` Hirokazu Takahashi
2005-01-18 16:33       ` Christoph Lameter
2004-10-26  1:28   ` Hugepages demand paging V2 [2/8]: allocation control Christoph Lameter
2004-10-26  1:28   ` Hugepages demand paging V2 [3/8]: simple numa compatible allocator Christoph Lameter
2005-02-02 12:21     ` Hirokazu Takahashi
2004-10-26  1:29   ` Hugepages demand paging V2 [4/8]: ia64 arch modifications Christoph Lameter
2004-10-26  1:29   ` Hugepages demand paging V2 [5/8]: i386 " Christoph Lameter
2004-10-26  1:30   ` Hugepages demand paging V2 [6/8]: sparc64 " Christoph Lameter
2004-10-26  1:31   ` Hugepages demand paging V2 [7/8]: sh64 " Christoph Lameter
2004-10-26  1:31   ` Hugepages demand paging V2 [8/8]: sh arch specific modifications Christoph Lameter
2004-10-26  2:23   ` Hugepages demand paging V2 [0/8]: Discussion and overview William Lee Irwin III
2004-10-26  2:40     ` Jesse Barnes
2004-10-26  2:43       ` William Lee Irwin III
2004-10-26 14:35       ` Robin Holt
2004-10-26 16:44         ` Jesse Barnes
2004-10-26 17:40           ` William Lee Irwin III
2004-10-26 17:45             ` Christoph Lameter
2004-10-26 17:47               ` William Lee Irwin III
2004-10-27 18:06         ` Christoph Lameter
2004-10-27 23:01           ` Ray Bryant
2004-10-28 11:51             ` Robin Holt
2004-10-28 16:34               ` Ray Bryant [this message]
2004-10-27 23:08           ` Ray Bryant
2004-10-27  5:23   ` David Gibson
2004-10-27 16:25     ` Christoph Lameter
2004-10-27  6:48   ` William Lee Irwin III
2004-10-27 14:21     ` Ray Bryant
2004-10-27 16:30     ` Christoph Lameter

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=41811F7C.40003@sgi.com \
    --to=raybry@sgi.com \
    --cc=clameter@sgi.com \
    --cc=holt@sgi.com \
    --cc=jbarnes@sgi.com \
    --cc=kenneth.w.chen@intel.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=wli@holomorphy.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).