From: Ray Bryant <raybry@sgi.com>
To: Robin Holt <holt@sgi.com>
Cc: Christoph Lameter <clameter@sgi.com>,
Jesse Barnes <jbarnes@sgi.com>,
William Lee Irwin III <wli@holomorphy.com>,
"Chen, Kenneth W" <kenneth.w.chen@intel.com>,
linux-kernel@vger.kernel.org
Subject: Re: Hugepages demand paging V2 [0/8]: Discussion and overview
Date: Thu, 28 Oct 2004 11:34:04 -0500 [thread overview]
Message-ID: <41811F7C.40003@sgi.com> (raw)
In-Reply-To: <20041028115144.GA21926@lnx-holt.americas.sgi.com>
Robin Holt wrote:
> On Wed, Oct 27, 2004 at 06:01:12PM -0500, Ray Bryant wrote:
>
>>Christoph Lameter wrote:
>>
>>>On Tue, 26 Oct 2004, Robin Holt wrote:
>>>
>>>
>>>
>>>>Sorry for being a stickler here, but the BTE is really part of the
>>>>I/O Interface portion of the shub. That portion has a seperate clock
>>>>frequency from the memory controller (unfortunately slower). The BTE
>>>>can zero at a slightly slower speed than the processor. It does, as
>>>>you pointed out, not trash the CPU cache.
>>>>
>>>>One other feature of the BTE is it can operate asynchronously from
>>>>the cpu. This could be used to, during a clock interrupt, schedule
>>>>additional huge page zero filling on multiple nodes at the same time.
>>>>This could result in a huge speed boost on machines that have multiple
>>>>memory only nodes. That has not been tested thoroughly. We have done
>>>>considerable testing of the page zero functionality as well as the
>>>>error handling.
>>>
>>>
>>>If the huge patch would support some way of redirecting the clearing of a
>>>huge page then we could:
>>>
>>>1. set the huge pte to not present so that we get a fault on access
>>>2. run the bte clearer.
>>>3. On receiving a huge fault we could check for the bte being finished.
>>>
>>>This would parallelize the clearing of huge pages. But is that really more
>>>efficient? There may be complexity involved in allowing the clearing of
>>>multiple pages and tracking of the clear in progress is additional
>>>overhead.
>>>
>>>
>>>-
>>>To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>>>the body of a message to majordomo@vger.kernel.org
>>>More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>Please read the FAQ at http://www.tux.org/lkml/
>>>
>>
>>I'm personally of the opinion that using the BTE to "speculatively" clear
>>hugetlb pages in advance of when the hugetlb pages are requested is not a
>>good
>>thing [tm]. One never knows if those pages will ever be requested. And in
>>the meantime, tasks that need the BTE will be delayed by speculative use.
>>But that is a personal bias :-), with no data to back it up.
>
>
> I was thinking the bte would be best used in an async mode where the pages
> would be pre-zeroed and available for use if the application needs them.
> If the pre-zeroed list is empty, then use the cpu to zero the page.
>
>
>>AFAIK, it is faster to clear the page with the processor anyway, since the
>
>
> The processor is slightly faster. I believe the FSB is 200Mhz and the
> II is 100Mhz (150Mhz with no attached IX brick). Future versions of the
> BTE will possibly have faster access to on node memory than the processor.
>
>
>>processor has a faster clock cycle. Yes, it destroys the processor cache,
>>but the application has clearly indicated that it wants the page NOW,
>>please,
>>(because it has faulted on it), and delivering the page to the application
>>as quickly as possible sounds like a good thing. I'm not sure reloading
>
>
> I am not either. I just would like to see any design take into consideration
> the possible uses and not design them out. Nothing more.
>
>
>>the processor cache at this point is a cost we care about, given that the
>>application is likely just starting up anyway. I figure hugetlb pages are
>>allocated once, stay around a long long time, so I'm not sure optimizing to
>>minimize cache damage is the correct way to go here.
>>
>>The only obvious win is for memory only nodes, that have a BTE and no CPU.
>>It is probably faster to use the local BTE than a remote CPU to clear the
>>page.
>
>
> Plus, a single CPU could schedule the clearing of pages on multiple
> nodes at the same time. Imagine a system that has 256 compute nodes
> and 756 memory nodes. That configuration is theoretically possible with
> todays hardware, but we have never built or sold one. Looking at that
> configuration gives you an one possible indication of how a pre-zeroing
> mechanism might improve things.
>
> I am not saying that the BTE is the best option, or even a good one. It
> just looks interesting. It does bring up some interesting problems with
> repeatability. Consider the application startup following termination
> of another which used all the huge pages. The pre-zeroed list will
> be nearly if not completely empty. The first fault will find the list
> empty and have to zero the page itself. Hopefully, the second fault will
> find one on the zeroed list and return immediately. This would cause
> application startup time to feel like it doubled from the previous run.
> Ouch. That would be very upsetting for our typical customers.
>
Yep.
> The more memory nodes you have per cpu, the better this number will
> appear.
>
> Sorry for being spineless, but I don't feel very strongly that it will
> be beneficial enough to be desirable. I am just not sure. I would
> just hope that it is taken into consideration during the design and,
> as long as it has no negative impact on the design, be left as a
> possibility.
>
> Thanks,
> Robin Holt
>
As always, Robin, you are being very reasonable. I think the option
should be kept open as you suggest, since it may help and I agree it
is an interesting approach that might yield big speedups.
--
Best Regards,
Ray
-----------------------------------------------
Ray Bryant
512-453-9679 (work) 512-507-7807 (cell)
raybry@sgi.com raybry@austin.rr.com
The box said: "Requires Windows 98 or better",
so I installed Linux.
-----------------------------------------------
next prev parent reply other threads:[~2004-10-28 16:29 UTC|newest]
Thread overview: 30+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <B05667366EE6204181EABE9C1B1C0EB504BFA47C@scsmsx401.amr.corp.intel.com>
2004-10-26 1:26 ` Hugepages demand paging V2 [0/8]: Discussion and overview Christoph Lameter
2004-10-26 1:27 ` Hugepages demand paging V2 [1/8]: hugetlb fault handler Christoph Lameter
2005-01-18 12:21 ` Hirokazu Takahashi
2005-01-18 16:33 ` Christoph Lameter
2004-10-26 1:28 ` Hugepages demand paging V2 [2/8]: allocation control Christoph Lameter
2004-10-26 1:28 ` Hugepages demand paging V2 [3/8]: simple numa compatible allocator Christoph Lameter
2005-02-02 12:21 ` Hirokazu Takahashi
2004-10-26 1:29 ` Hugepages demand paging V2 [4/8]: ia64 arch modifications Christoph Lameter
2004-10-26 1:29 ` Hugepages demand paging V2 [5/8]: i386 " Christoph Lameter
2004-10-26 1:30 ` Hugepages demand paging V2 [6/8]: sparc64 " Christoph Lameter
2004-10-26 1:31 ` Hugepages demand paging V2 [7/8]: sh64 " Christoph Lameter
2004-10-26 1:31 ` Hugepages demand paging V2 [8/8]: sh arch specific modifications Christoph Lameter
2004-10-26 2:23 ` Hugepages demand paging V2 [0/8]: Discussion and overview William Lee Irwin III
2004-10-26 2:40 ` Jesse Barnes
2004-10-26 2:43 ` William Lee Irwin III
2004-10-26 14:35 ` Robin Holt
2004-10-26 16:44 ` Jesse Barnes
2004-10-26 17:40 ` William Lee Irwin III
2004-10-26 17:45 ` Christoph Lameter
2004-10-26 17:47 ` William Lee Irwin III
2004-10-27 18:06 ` Christoph Lameter
2004-10-27 23:01 ` Ray Bryant
2004-10-28 11:51 ` Robin Holt
2004-10-28 16:34 ` Ray Bryant [this message]
2004-10-27 23:08 ` Ray Bryant
2004-10-27 5:23 ` David Gibson
2004-10-27 16:25 ` Christoph Lameter
2004-10-27 6:48 ` William Lee Irwin III
2004-10-27 14:21 ` Ray Bryant
2004-10-27 16:30 ` Christoph Lameter
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=41811F7C.40003@sgi.com \
--to=raybry@sgi.com \
--cc=clameter@sgi.com \
--cc=holt@sgi.com \
--cc=jbarnes@sgi.com \
--cc=kenneth.w.chen@intel.com \
--cc=linux-kernel@vger.kernel.org \
--cc=wli@holomorphy.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.