From: Ray Bryant <raybry@sgi.com>
To: "Martin J. Bligh" <mbligh@aracnet.com>
Cc: Jesse Barnes <jbarnes@engr.sgi.com>,
akpm@osdl.org, linux-kernel@vger.kernel.org, steiner@sgi.com
Subject: Re: [PATCH] allocate page caches pages in round robin fasion
Date: Fri, 13 Aug 2004 12:31:18 -0500 [thread overview]
Message-ID: <411CFAE6.1010109@sgi.com> (raw)
In-Reply-To: <fa.cg3cafa.ngi9og@ifi.uio.no>
Hi Martin,
Martin J. Bligh wrote:
<snip>
>
> Does that actually happen though? Looking at the current code makes me think
> it'll keep some pages free on all nodes at all times, and if kswapd does
> it's job, we'll never fall back across nodes. Now ... I think that's broken,
> but I think that's what currently happens - that was what we discussed at
> KS ... I might be misreading it though, I should test it.
>
> Even if that's not true, allocating all your most recent stuff off-node is
> still crap (so either way, I'd agree the current situation is broken), but
> I don't think the solution is to push ALL your accesses (with n-1/n probability)
> off-node ... we need to be more careful than that ...
>
I think you're missing out on the typical workload situation where we run into
this problem. Just to make things a bit more specific, lets assume we are on
a 128 node (256 P) system, with 4 GB per node. Let's assume that we have a
100 GB data file that we access periodically during the run, accesses to that
data file are done in random access fashion from each node.
The program starts out by reading in the data file, then forks off 256 copies
of itself, and allocates 1 GB per CPU of local storage via MPOL_DEFAULT. All
of those pages had better be in local or the computation will be unbalanced
and run as slowly as the slowest node.
As I read the __alloc_pages() code, those 100 GB of data pages will be
allocated on the node that did the file read; when that node fills up, we will
spill the allocation to adjacent nodes (this is the first loop of
__alloc_pages(), kswapd doesn't get invoked until that first loop fails).
(kswapd() doesn't get invoked until all of the zones in the zonelist are full.
All of memory is in that zonelist, unless we have cpusets enabled. So the
priority is to spill off node first and then swap() second.)
Now the application starts allocating its 2 GB of local, and the nodes where
the page cache was allocated all get non-local pages allocated. (Once again,
this happens in the first loop of __alloc_pages().) The ratio of accesses to
local data pages versus access to remote page cache pages is unfavorable for
local page cache allocation, since the page cache pages are accessed at a tiny
fraction of the rate of the data pages.
Now I suppose you could argue that the application should fork first and then
read in 1/256th of the data on each cpu. The problems with this, in general,
are twofold:
(1) It could have been a simple "cp" in a startup script that did the read..
We can't fix all of those things as well.
(2) The application may be an ISV's program that is not NUMA aware. We can
fix most of that by wrappering the program with control scripts, but
requiring the ISV to build a specific NUMA aware version of the binary
for Altix is oftentimes not feasible. (And because the allocation
policy is MPOL_DEFAULT, the application doesn't have to have NUMA API
calls imbedded in the program.)
>
>>>If we round-robin it ... surely 7/8 of your data (on your 8 node machine)
>>>will ALWAYS be off-node ? I thought we discussed this at KS/OLS - what is
>>>needed is to punt old pages back off onto another node, rather than
>>>swapping them out. That way all your pages are going to be local.
>>
Surely you can't be suggesting that I migrate a page cache page to a local
node just to read it? If file accesses are random and global, won't you end
up just bouncing page cache pages hither and yon? Surely it is better just to
copy the data remotely to the current node and leave it where it is?
(YMMV -- all of these tradeoffs are clearly workload dependent.)
>>That gets complicated pretty quickly I think. We don't want to constantly
>>shuffle pages between nodes with kswapd, and there's also the problem of
>>deciding when to do it.
next parent reply other threads:[~2004-08-13 17:30 UTC|newest]
Thread overview: 23+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <fa.hmrqqf6.ckie1e@ifi.uio.no>
[not found] ` <fa.cg3cafa.ngi9og@ifi.uio.no>
2004-08-13 17:31 ` Ray Bryant [this message]
[not found] <fa.hmbmqn2.d4ef9c@ifi.uio.no>
[not found] ` <fa.g1i2d5e.1kgqq80@ifi.uio.no>
2004-08-13 16:33 ` [PATCH] allocate page caches pages in round robin fasion Ray Bryant
[not found] <2sxuC-429-3@gated-at.bofh.it>
2004-08-13 1:14 ` Andi Kleen
2004-08-13 1:26 ` William Lee Irwin III
2004-08-13 1:29 ` Jesse Barnes
2004-08-13 16:04 ` Jesse Barnes
2004-08-13 17:31 ` Brent Casavant
2004-08-13 20:16 ` Andi Kleen
2004-08-12 23:46 Jesse Barnes
2004-08-13 0:13 ` William Lee Irwin III
2004-08-13 0:25 ` Jesse Barnes
2004-08-13 0:32 ` William Lee Irwin III
2004-08-13 14:50 ` Martin J. Bligh
2004-08-13 15:59 ` Jesse Barnes
2004-08-13 16:20 ` Martin J. Bligh
2004-08-13 16:34 ` Jesse Barnes
2004-08-13 16:47 ` Martin J. Bligh
2004-08-13 17:31 ` Nick Piggin
2004-08-13 21:16 ` Martin J. Bligh
2004-08-13 22:59 ` Martin J. Bligh
2004-08-14 1:21 ` Nick Piggin
-- strict thread matches above, loose matches on Subject: below --
2004-08-12 23:38 Jesse Barnes
2004-08-13 1:36 ` Dave Hansen
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=411CFAE6.1010109@sgi.com \
--to=raybry@sgi.com \
--cc=akpm@osdl.org \
--cc=jbarnes@engr.sgi.com \
--cc=linux-kernel@vger.kernel.org \
--cc=mbligh@aracnet.com \
--cc=steiner@sgi.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.