From: Ray Bryant <raybry@sgi.com>
To: Robin Holt <holt@sgi.com>
Cc: Christoph Lameter <clameter@sgi.com>,
Jesse Barnes <jbarnes@sgi.com>,
William Lee Irwin III <wli@holomorphy.com>,
"Chen, Kenneth W" <kenneth.w.chen@intel.com>,
linux-kernel@vger.kernel.org
Subject: Re: Hugepages demand paging V2 [0/8]: Discussion and overview
Date: Thu, 28 Oct 2004 11:34:04 -0500 [thread overview]
Message-ID: <41811F7C.40003@sgi.com> (raw)
In-Reply-To: <20041028115144.GA21926@lnx-holt.americas.sgi.com>
Robin Holt wrote:
> On Wed, Oct 27, 2004 at 06:01:12PM -0500, Ray Bryant wrote:
>
>>Christoph Lameter wrote:
>>
>>>On Tue, 26 Oct 2004, Robin Holt wrote:
>>>
>>>
>>>
>>>>Sorry for being a stickler here, but the BTE is really part of the
>>>>I/O Interface portion of the shub. That portion has a seperate clock
>>>>frequency from the memory controller (unfortunately slower). The BTE
>>>>can zero at a slightly slower speed than the processor. It does, as
>>>>you pointed out, not trash the CPU cache.
>>>>
>>>>One other feature of the BTE is it can operate asynchronously from
>>>>the cpu. This could be used to, during a clock interrupt, schedule
>>>>additional huge page zero filling on multiple nodes at the same time.
>>>>This could result in a huge speed boost on machines that have multiple
>>>>memory only nodes. That has not been tested thoroughly. We have done
>>>>considerable testing of the page zero functionality as well as the
>>>>error handling.
>>>
>>>
>>>If the huge patch would support some way of redirecting the clearing of a
>>>huge page then we could:
>>>
>>>1. set the huge pte to not present so that we get a fault on access
>>>2. run the bte clearer.
>>>3. On receiving a huge fault we could check for the bte being finished.
>>>
>>>This would parallelize the clearing of huge pages. But is that really more
>>>efficient? There may be complexity involved in allowing the clearing of
>>>multiple pages and tracking of the clear in progress is additional
>>>overhead.
>>>
>>>
>>>-
>>>To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>>>the body of a message to majordomo@vger.kernel.org
>>>More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>Please read the FAQ at http://www.tux.org/lkml/
>>>
>>
>>I'm personally of the opinion that using the BTE to "speculatively" clear
>>hugetlb pages in advance of when the hugetlb pages are requested is not a
>>good
>>thing [tm]. One never knows if those pages will ever be requested. And in
>>the meantime, tasks that need the BTE will be delayed by speculative use.
>>But that is a personal bias :-), with no data to back it up.
>
>
> I was thinking the bte would be best used in an async mode where the pages
> would be pre-zeroed and available for use if the application needs them.
> If the pre-zeroed list is empty, then use the cpu to zero the page.
>
>
>>AFAIK, it is faster to clear the page with the processor anyway, since the
>
>
> The processor is slightly faster. I believe the FSB is 200Mhz and the
> II is 100Mhz (150Mhz with no attached IX brick). Future versions of the
> BTE will possibly have faster access to on node memory than the processor.
>
>
>>processor has a faster clock cycle. Yes, it destroys the processor cache,
>>but the application has clearly indicated that it wants the page NOW,
>>please,
>>(because it has faulted on it), and delivering the page to the application
>>as quickly as possible sounds like a good thing. I'm not sure reloading
>
>
> I am not either. I just would like to see any design take into consideration
> the possible uses and not design them out. Nothing more.
>
>
>>the processor cache at this point is a cost we care about, given that the
>>application is likely just starting up anyway. I figure hugetlb pages are
>>allocated once, stay around a long long time, so I'm not sure optimizing to
>>minimize cache damage is the correct way to go here.
>>
>>The only obvious win is for memory only nodes, that have a BTE and no CPU.
>>It is probably faster to use the local BTE than a remote CPU to clear the
>>page.
>
>
> Plus, a single CPU could schedule the clearing of pages on multiple
> nodes at the same time. Imagine a system that has 256 compute nodes
> and 756 memory nodes. That configuration is theoretically possible with
> todays hardware, but we have never built or sold one. Looking at that
> configuration gives you an one possible indication of how a pre-zeroing
> mechanism might improve things.
>
> I am not saying that the BTE is the best option, or even a good one. It
> just looks interesting. It does bring up some interesting problems with
> repeatability. Consider the application startup following termination
> of another which used all the huge pages. The pre-zeroed list will
> be nearly if not completely empty. The first fault will find the list
> empty and have to zero the page itself. Hopefully, the second fault will
> find one on the zeroed list and return immediately. This would cause
> application startup time to feel like it doubled from the previous run.
> Ouch. That would be very upsetting for our typical customers.
>
Yep.
> The more memory nodes you have per cpu, the better this number will
> appear.
>
> Sorry for being spineless, but I don't feel very strongly that it will
> be beneficial enough to be desirable. I am just not sure. I would
> just hope that it is taken into consideration during the design and,
> as long as it has no negative impact on the design, be left as a
> possibility.
>
> Thanks,
> Robin Holt
>
As always, Robin, you are being very reasonable. I think the option
should be kept open as you suggest, since it may help and I agree it
is an interesting approach that might yield big speedups.
--
Best Regards,
Ray
-----------------------------------------------
Ray Bryant
512-453-9679 (work) 512-507-7807 (cell)
raybry@sgi.com raybry@austin.rr.com
The box said: "Requires Windows 98 or better",
so I installed Linux.
-----------------------------------------------
next prev parent reply other threads:[~2004-10-28 16:29 UTC|newest]
Thread overview: 30+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <B05667366EE6204181EABE9C1B1C0EB504BFA47C@scsmsx401.amr.corp.intel.com>
2004-10-26 1:26 ` Hugepages demand paging V2 [0/8]: Discussion and overview Christoph Lameter
2004-10-26 1:27 ` Hugepages demand paging V2 [1/8]: hugetlb fault handler Christoph Lameter
2005-01-18 12:21 ` Hirokazu Takahashi
2005-01-18 16:33 ` Christoph Lameter
2004-10-26 1:28 ` Hugepages demand paging V2 [2/8]: allocation control Christoph Lameter
2004-10-26 1:28 ` Hugepages demand paging V2 [3/8]: simple numa compatible allocator Christoph Lameter
2005-02-02 12:21 ` Hirokazu Takahashi
2004-10-26 1:29 ` Hugepages demand paging V2 [4/8]: ia64 arch modifications Christoph Lameter
2004-10-26 1:29 ` Hugepages demand paging V2 [5/8]: i386 " Christoph Lameter
2004-10-26 1:30 ` Hugepages demand paging V2 [6/8]: sparc64 " Christoph Lameter
2004-10-26 1:31 ` Hugepages demand paging V2 [7/8]: sh64 " Christoph Lameter
2004-10-26 1:31 ` Hugepages demand paging V2 [8/8]: sh arch specific modifications Christoph Lameter
2004-10-26 2:23 ` Hugepages demand paging V2 [0/8]: Discussion and overview William Lee Irwin III
2004-10-26 2:40 ` Jesse Barnes
2004-10-26 2:43 ` William Lee Irwin III
2004-10-26 14:35 ` Robin Holt
2004-10-26 16:44 ` Jesse Barnes
2004-10-26 17:40 ` William Lee Irwin III
2004-10-26 17:45 ` Christoph Lameter
2004-10-26 17:47 ` William Lee Irwin III
2004-10-27 18:06 ` Christoph Lameter
2004-10-27 23:01 ` Ray Bryant
2004-10-28 11:51 ` Robin Holt
2004-10-28 16:34 ` Ray Bryant [this message]
2004-10-27 23:08 ` Ray Bryant
2004-10-27 5:23 ` David Gibson
2004-10-27 16:25 ` Christoph Lameter
2004-10-27 6:48 ` William Lee Irwin III
2004-10-27 14:21 ` Ray Bryant
2004-10-27 16:30 ` Christoph Lameter
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=41811F7C.40003@sgi.com \
--to=raybry@sgi.com \
--cc=clameter@sgi.com \
--cc=holt@sgi.com \
--cc=jbarnes@sgi.com \
--cc=kenneth.w.chen@intel.com \
--cc=linux-kernel@vger.kernel.org \
--cc=wli@holomorphy.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).