All of lore.kernel.org
 help / color / mirror / Atom feed
From: Ray Bryant <raybry@sgi.com>
To: Robin Holt <holt@sgi.com>
Cc: Paul Jackson <pj@sgi.com>,
	linux-mm@kvack.org, ak@muc.de, haveblue@us.ibm.com,
	marcello@cyclades.com, stevel@mwwireless.net,
	peterc@gelato.unsw.edu.au
Subject: Re: manual page migration -- issue list
Date: Wed, 16 Feb 2005 17:05:23 -0600	[thread overview]
Message-ID: <4213D1B3.4050607@sgi.com> (raw)
In-Reply-To: <20050216092011.GA6616@lnx-holt.americas.sgi.com>

Robin Holt wrote:
> On Tue, Feb 15, 2005 at 08:22:14PM -0800, Paul Jackson wrote:
> 
>>Robin wrote:
>>
>>>If you do that for each job with the shared mapping and have overlapping
>>>node lists, you end up combining two nodes and not being able to seperate
>>>them.
>>
>>I don't see the problem.  Just don't move a task onto a node
>>until you moved the one that was already there, if any, off.
>>
>>Say, for example, you want to move a job from nodes 4, 5 and 6 to nodes
>>5, 6 and 7, respectively.  First move 6 to 7, then 5 to 6, then 4 to 5.
>>Or save some migration, and just move what's on 4 to 7, leaving 5 and
>>6 as is.
>

The customers I have talked to about this tell me that they never
imagine having a set of old and new nodes overlap.  I agree it is more
general to allow this, but resistance to the original system call I
proposed appears to be somewhat stiff.

> 
> Moving 4 to 7 will likely change the node to node distance for the
> processes within that job.  You will probably need to do the 6-7, 5-6, 4-5
> to keep relative distances the same.  Again, the batch scheduler will tell
> us whether a simple 4-7 move is possible or whether we need to shift each.
> 
> I should correct my earlier add.  As long as you have a seperate node
> in the new list that is not in the old, you could accomplish it with a
> one-at-a-time fashion.  What that would result in is a syscall for each
> non-overlapping vma per node.  Multiple that by the number of nodes with
> each system call going over that same shared vma.
> 
> For the sake of discussion, lets assume this is a 256p job using 128 nodes
> and a shared message block of 2GB per task.  You will have a 512GB shared
> mapping which will have some holes punched in it (no single task will
> have the entire mapping unscathed).  Again, for the sake of discussion,
> let's assume that 96% of the shared buffer is intact for the process we
> choose to do the initial migration on.  Compare the single node method
> to the array method.

Would we really ever migrate something that big?  I had the same concerns
about large address spaces and the like, but it just seems to me that if
something is that big, we'd leave it alone.  :-)

> 
> Array method:
> 1) Call system call with pid, va_start, va_end, 128, [2,3,4,5...], [32,33,34,...].
>    This will scan the page tables _ONCE_ and migrate the pages to their
>    new destination.
> 2) Call system call on second pid to cover 1/2 of the remaining
>    4% of address space.  Again single scan over that portion of
>    address space.
> 3) Call system call on third pid to cover last portion of address
>    space.
> 
> With this, we have made 3 system calls and scanned the entire address
> range 1 time.
> 
> Single parameter method:
> 1) For a single pid, cal system call 128 times with pid, va_start, va_end, from, to
>    which scans the 96% chunk 128 times.
> 2) Repeat 128 times with second pid.
> 3) Repeat 128 times with third pid.
> 
> We have now made the system call 384 times, scanned the entire address
> range 128 times.
> 
> Do you see why I called this insane.  This is all because you don't like
> to pass in a complex array of integers.  That seems like a very small
> thing to ask to save 127 scans of a 512GB address space.
> 

I agree, it sounds like a lot of work.  Perhaps we should try this with
my prototype code and see how long it takes.  But, I really think this is
a contrived example.  I don't think anyone would migrate a job that big.
To my way of thinking, the largest job we would ever migrate would be on
the order of 1/8th to 1/4 of the machine.  Not 1/2.  If it is 1/2 of
the machine, lets just leave the darn thing where it is.  :-)  (I always
try to let large sleepling dogs lie...)

> I believe that is what I called insane earlier.  I reserve the right to
> be wrong.
> 
> 
>>At any point, either there is at least one new node not currently
>>occupied by some not yet migrated task, or else you're just reshuffling
>>a set of tasks on the same set of nodes, which I presume would be
>>without purpose and so we don't need to support.  If we did need to
>>support shuffling a job on its current node set, I'd have to plead
>>insanity, and reintroduce the temporary node hack.
>>
>>
>>
>>>Unfortunately it does happen often for stuff like shared file mappings
>>>that a different job is using in conjuction with this job.
>>
>>This might be the essential detail I'm missing.  I'm not sure what you
>>mean here (see P.S., at end), but it seems that you are telling me you
>>must have the ability to avoid moving parts of a job.  That for a given
>>task, pinned to a given cpu, with various physical pages on the node
>>local to that cpu, some of those pages must not move, because they are
>>used in conjunction with some other job, that is not being migrated at
>>this time.
> 
> 
> For the simple case assume a sysV shared memory segment that was created
> by a previous job being used by this one.  The memory placement for
> the segment will depend entirely on whether the previous job touched a
> particular page and where that job ran.  It may get migrated depending
> upon if any other jobs anywhere else are on the system and are using it
> and any of the pages are on the jobs old node list.
> 
> These types of mappings have always given us issues (Irix as well as
> Linux) and are difficult to handle.  The one additional nice feature to
> having an external migration facility is we might be able to use this
> type of thing from a command line to move the shared memory segment
> over to nodes that the job is using.  This has just been off the cuff
> thinking lately and hasn't been fully thought through.
> 
> 
>>P.S. - or perhaps what you're telling me with the bit about shared file
>>mappings is not that you must not move any such shared file pages as
>>well, but that you'd rather not, as there are perhaps many such pages,
>>and the time spent moving them would be wasted.  Are you saying that you
>>want to move some subset of a jobs pages, as an optimization, because
>>for a large chunk of pages, such as for some files and libraries shared
>>with other jobs, the expense of migrating them would not be paid back?
> 
> 
> I believe Ray's proposed userland piece would migrate shared libraries
> used exclusively by this job.  Was that right Ray?
>

Yes, that was the intent.

> Here is my real question.  How much opposition is there to the array
> of integers?  This does not seem like a risky interface to me.  If there
> is not a lot of opposition to the arrays, can we discuss the rest of
> the proposal and accept the arrays for the time being?  The array can
> be addressed once we know that the syscall for migrating idea is sound.
> 
> 
> Thanks,
> Robin
> 


-- 
-----------------------------------------------
Ray Bryant
512-453-9679 (work)         512-507-7807 (cell)
raybry@sgi.com             raybry@austin.rr.com
The box said: "Requires Windows 98 or better",
	 so I installed Linux.
-----------------------------------------------
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

  parent reply	other threads:[~2005-02-16 23:05 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2005-02-15 23:52 manual page migration -- issue list Ray Bryant
2005-02-16  0:09 ` Paul Jackson
2005-02-16  0:28   ` Ray Bryant
2005-02-16  0:51 ` Paul Jackson
2005-02-16  1:17   ` Paul Jackson
2005-02-16  2:01     ` Robin Holt
2005-02-16  4:04       ` Ray Bryant
2005-02-16  4:28         ` Paul Jackson
2005-02-16  4:24       ` Paul Jackson
2005-02-16  3:55     ` Ray Bryant
2005-02-16  1:56   ` Robin Holt
2005-02-16  4:22     ` Paul Jackson
2005-02-16  9:20       ` Robin Holt
2005-02-16 10:20         ` Paul Jackson
2005-02-16 11:30           ` Robin Holt
2005-02-16 15:45             ` Paul Jackson
2005-02-16 16:08               ` Robin Holt
2005-02-16 19:23                 ` Paul Jackson
2005-02-16 19:56                   ` Robin Holt
2005-02-16 23:08           ` Ray Bryant
2005-02-16 23:05         ` Ray Bryant [this message]
2005-02-17  0:28           ` Paul Jackson
2005-02-16  1:41 ` Paul Jackson
2005-02-16  3:56   ` Ray Bryant

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4213D1B3.4050607@sgi.com \
    --to=raybry@sgi.com \
    --cc=ak@muc.de \
    --cc=haveblue@us.ibm.com \
    --cc=holt@sgi.com \
    --cc=linux-mm@kvack.org \
    --cc=marcello@cyclades.com \
    --cc=peterc@gelato.unsw.edu.au \
    --cc=pj@sgi.com \
    --cc=stevel@mwwireless.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.