public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Paul Jackson <pj@sgi.com>
To: Paul Jackson <pj@sgi.com>
Cc: ak@suse.de, lse-tech@lists.sourceforge.net, linux-kernel@vger.kernel.org
Subject: Re: [Lse-tech] Re: [PATCH] subset zonelists and big numa friendly mempolicy MPOL_MBIND
Date: Tue, 3 Aug 2004 00:58:24 -0700	[thread overview]
Message-ID: <20040803005824.77358caf.pj@sgi.com> (raw)
In-Reply-To: <20040802191407.24e301e0.pj@sgi.com>

Earlier, I (pj) wrote:
> It has poor cache performance on big iron.  For a modest job on a big
> system, the allocator has to walk down an average of 128 out of 256 zone
> pointers in the list, derefencing each one into the zone struct, then
> into the struct pglist_data, before it finds one that matches an allowed
> node id. That's a nasty memory footprint for a hot code path.

This paragraph is B.S.  Most tasks are running on CPUs that are on nodes
whose memory they are allowed to use.  That node is at the front of the
local zonelist, and they get their memory on the first node they look.

Damn ... hate it when that happens ;).

Still, either MPOL_BIND needs a more numa friendly set of zonelists
having a differently sorted list for each node in the set, or it's
usefulness for binding to more than one or a few very close nodes, if
you care about memory performance, falls off quickly as the number of
nodes increases.  As you well know, any such numa-friendly set of sorted
zonelists will require space on the Order of N**2, for N the node count,
given the NULL-terminated linear list form in which they must be handed
to __alloc_pages.

I suspect that the English phrase you are searching for now to tell me
is "if it hurts, don't use it ;)."  That is, you are clearly advising me
not to use MPOL_BIND if I need a fancy zonelist sort.

The place I ran into the most complexity doing this in the 2.4 kernel
was in the per-memory region binding.  You're dealing with this in the
2.6 kernels, and when you get to stuff like shared memory and huge
pages, it's not easy.  At least the vma splitting code is better in
2.6 than it was in 2.4.   Whatever I do for cpusets must _not_ duplicate
your virtual address range specific work (mbind).  Too much detail to be
done twice.

Andi wrote:
> My first reaction that if you really want to do that, just pass
> the policy node bitmap to alloc_pages and try_to_free_pages
> and use the normal per node zone list with the bitmap as filter. 

Pass in, or add to task_struct?  I can imagine adding a:

	nodemask_t mems_allowed;

to task_struct, and ending up with a CONFIG_CPUSET enabled macro called
in a few places in __alloc_pages() and try_to_free_pages() that amounts
to:

	if (!in_interrupt())
		if (!node_isset(z->zone_pgdat->node_id, current->mems_allowed))
			continue;

In any event, cpusets provides the larger "container" on bigger numa
systems, and mbind/mempolicy provides the more detailed, and vma
specific, placement within the container (or within the entire system
if cpusets not configured).

I'll try coding this up and see how it looks.

I welcome your further comments.

Thank-you.

-- 
                          I won't rest till it's the best ...
                          Programmer, Linux Scalability
                          Paul Jackson <pj@sgi.com> 1.650.933.1373

  reply	other threads:[~2004-08-03  7:59 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2004-08-02 23:35 [PATCH] subset zonelists and big numa friendly mempolicy MPOL_MBIND Paul Jackson
2004-08-03  0:08 ` Andi Kleen
2004-08-03  2:14   ` Paul Jackson
2004-08-03  7:58     ` Paul Jackson [this message]
2004-08-04 22:33   ` Paul Jackson

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20040803005824.77358caf.pj@sgi.com \
    --to=pj@sgi.com \
    --cc=ak@suse.de \
    --cc=linux-kernel@vger.kernel.org \
    --cc=lse-tech@lists.sourceforge.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox