Re: [PATCH] allocate page caches pages in round robin fasion

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Ray Bryant <raybry@sgi.com>
To: "Martin J. Bligh" <mbligh@aracnet.com>
Cc: Jesse Barnes <jbarnes@engr.sgi.com>,
	akpm@osdl.org, linux-kernel@vger.kernel.org, steiner@sgi.com
Subject: Re: [PATCH] allocate page caches pages in round robin fasion
Date: Fri, 13 Aug 2004 12:31:18 -0500	[thread overview]
Message-ID: <411CFAE6.1010109@sgi.com> (raw)
In-Reply-To: <fa.cg3cafa.ngi9og@ifi.uio.no>

Hi Martin,

Martin J. Bligh wrote:
<snip>

> 
> Does that actually happen though? Looking at the current code makes me think
> it'll keep some pages free on all nodes at all times, and if kswapd does
> it's job, we'll never fall back across nodes. Now ... I think that's broken,
> but I think that's what currently happens - that was what we discussed at
> KS ... I might be misreading it though, I should test it.
> 
> Even if that's not true, allocating all your most recent stuff off-node is
> still crap (so either way, I'd agree the current situation is broken), but
> I don't think the solution is to push ALL your accesses (with n-1/n probability)
> off-node ... we need to be more careful than that ...
>

I think you're missing out on the typical workload situation where we run into 
this problem.  Just to make things a bit more specific, lets assume we are on 
a 128 node (256 P) system, with 4 GB per node.  Let's assume that we have a 
100 GB data file that we access periodically during the run, accesses to that 
data file are done in random access fashion from each node.

The program starts out by reading in the data file, then forks off 256 copies 
of itself, and allocates 1 GB per CPU of local storage via MPOL_DEFAULT.  All 
of those pages had better be in local or the computation will be unbalanced 
and run as slowly as the slowest node.

As I read the __alloc_pages() code, those 100 GB of data pages will be 
allocated on the node that did the file read; when that node fills up, we will 
spill the allocation to adjacent nodes (this is the first loop of 
__alloc_pages(), kswapd doesn't get invoked until that first loop fails).

(kswapd() doesn't get invoked until all of the zones in the zonelist are full.
All of memory is in that zonelist, unless we have cpusets enabled. So the 
priority is to spill off node first and then swap() second.)

Now the application starts allocating its 2 GB of local, and the nodes where 
the page cache was allocated all get non-local pages allocated.  (Once again, 
this happens in the first loop of __alloc_pages().)  The ratio of accesses to 
local data pages versus access to remote page cache pages is unfavorable for 
local page cache allocation, since the page cache pages are accessed at a tiny 
fraction of the rate of the data pages.

Now I suppose you could argue that the application should fork first and then 
read in 1/256th of the data on each cpu.  The problems with this, in general, 
are twofold:

(1)  It could have been a simple "cp" in a startup script that did the read..
      We can't fix all of those things as well.
(2)  The application may be an ISV's program that is not NUMA aware.  We can
      fix most of that by wrappering the program with control scripts, but
      requiring the ISV to build a specific NUMA aware version of the binary
      for Altix is oftentimes not feasible.  (And because the allocation
      policy is MPOL_DEFAULT, the application doesn't have to have NUMA API
      calls imbedded in the program.)

> 
>>>If we round-robin it ... surely 7/8 of your data (on your 8 node machine)
>>>will ALWAYS be off-node ? I thought we discussed this at KS/OLS - what is
>>>needed is to punt old pages back off onto another node, rather than
>>>swapping them out. That way all your pages are going to be local.
>>

Surely you can't be suggesting that I migrate a page cache page to a local 
node just to read it?  If file accesses are random and global, won't you end 
up just bouncing page cache pages hither and yon?  Surely it is better just to 
copy the data remotely to the current node and leave it where it is?

(YMMV -- all of these tradeoffs are clearly workload dependent.)

>>That gets complicated pretty quickly I think.  We don't want to constantly 
>>shuffle pages between nodes with kswapd, and there's also the problem of 
>>deciding when to do it.

next      parent reply	other threads:[~2004-08-13 17:30 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <fa.hmrqqf6.ckie1e@ifi.uio.no>
     [not found] ` <fa.cg3cafa.ngi9og@ifi.uio.no>
2004-08-13 17:31   ` Ray Bryant [this message]
     [not found] <fa.hmbmqn2.d4ef9c@ifi.uio.no>
     [not found] ` <fa.g1i2d5e.1kgqq80@ifi.uio.no>
2004-08-13 16:33   ` [PATCH] allocate page caches pages in round robin fasion Ray Bryant
     [not found] <2sxuC-429-3@gated-at.bofh.it>
2004-08-13  1:14 ` Andi Kleen
2004-08-13  1:26   ` William Lee Irwin III
2004-08-13  1:29   ` Jesse Barnes
2004-08-13 16:04   ` Jesse Barnes
2004-08-13 17:31     ` Brent Casavant
2004-08-13 20:16       ` Andi Kleen
2004-08-12 23:46 Jesse Barnes
2004-08-13  0:13 ` William Lee Irwin III
2004-08-13  0:25   ` Jesse Barnes
2004-08-13  0:32     ` William Lee Irwin III
2004-08-13 14:50 ` Martin J. Bligh
2004-08-13 15:59   ` Jesse Barnes
2004-08-13 16:20     ` Martin J. Bligh
2004-08-13 16:34       ` Jesse Barnes
2004-08-13 16:47         ` Martin J. Bligh
2004-08-13 17:31           ` Nick Piggin
2004-08-13 21:16             ` Martin J. Bligh
2004-08-13 22:59               ` Martin J. Bligh
2004-08-14  1:21               ` Nick Piggin
  -- strict thread matches above, loose matches on Subject: below --
2004-08-12 23:38 Jesse Barnes
2004-08-13  1:36 ` Dave Hansen

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=411CFAE6.1010109@sgi.com \
    --to=raybry@sgi.com \
    --cc=akpm@osdl.org \
    --cc=jbarnes@engr.sgi.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mbligh@aracnet.com \
    --cc=steiner@sgi.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.