All of lore.kernel.org
 help / color / mirror / Atom feed
From: Jesse Pollard <pollard@tomcat.admin.navo.hpc.mil>
To: nleroy@cs.wisc.edu,
	Jesse Pollard <pollard@tomcat.admin.navo.hpc.mil>,
	pashley@storm.ca,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>
Subject: Re: McVoy's Clusters (was Re: latest linus-2.5 BK broken)
Date: Thu, 20 Jun 2002 13:32:20 -0500 (CDT)	[thread overview]
Message-ID: <200206201832.NAA87254@tomcat.admin.navo.hpc.mil> (raw)
In-Reply-To: <200206201743.g5KHhPu31957@schroeder.cs.wisc.edu>

Nick LeRoy <nleroy@cs.wisc.edu>:
> 
> On Thursday 20 June 2002 12:23 pm, Jesse Pollard wrote:
> <snip>
> > You don't use compute servers much? The problems we are currently running
> > require the cluster (IBM SP) to have 100% uptime for a single job. that
> > job may run for several days. If a detected problem is reported (not yet
> > catastrophic) it is desired/demanded to checkpoint the users process.
> >
> > Currently, we can't - but should be able to by this fall.
> >
> > Having the users job checkpoint midway in it's computations will allow us
> > to remove a node from active service, substitute a different node, and
> > resume the users process without losing many hours of computation (we have
> > a maximum of 300 nodes for computation, another 30 for I/O and front end).
> 
> Have you tried Condor?  Condor is a "high throughput computing" package, 
> specifically targetted at such applications, with the ability to checkpoint & 
> migrate jobs, etc.  Condor is free as in beer, but currently not as in speech 
> (sorry), and is developed by the University of Wisconsin.  
> http://www.condorproject.org is the URL to learn more.  Version 6.4.0 is in 
> the process of being released and should be available within the next couple 
> of days.
> 
> Condor runs on Linux (x86 & Alpha), Solaris, IRIX, HPUX, Digital Unix, and 
> NT, although the NT usually lags the Unix releases.

Condor is designed for a relatively low performance network (10-100Mbit) and
not for things like an IBM SP switch which can carry Gbit data. It needs
availablility on SP-3/4 and Cray SV systems (not that we have problems
with checkpoint there). Also note:

	Cannot use IPC (pipes shared memory), which also leave out PVM/MPI
	job cannot use threads
	cannot use forks

In many of our cases, the jobs are split across many nodes, then spread
across multiple processors in a single node (SP 3 has 4 cpus per node,
SP 4 will have 8-32). The current scientific library uses PVM/MPI to determine
whether it is using shared memory or node/node RPC.

Tightly integrated models wouldn't work well with Condor (disclaimer:
based on a fast look by me, and I don't work on the current jobs).

-------------------------------------------------------------------------
Jesse I Pollard, II
Email: pollard@navo.hpc.mil

Any opinions expressed are solely my own.

  reply	other threads:[~2002-06-20 18:32 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2002-06-20 17:23 McVoy's Clusters (was Re: latest linus-2.5 BK broken) Jesse Pollard
2002-06-20 17:43 ` Nick LeRoy
2002-06-20 18:32   ` Jesse Pollard [this message]
  -- strict thread matches above, loose matches on Subject: below --
2002-06-19 17:27 latest linus-2.5 BK broken Linus Torvalds
2002-06-20  3:57 ` Eric W. Biederman
2002-06-20  5:24   ` Larry McVoy
2002-06-20 15:41     ` McVoy's Clusters (was Re: latest linus-2.5 BK broken) Sandy Harris
2002-06-20 17:10       ` William Lee Irwin III
2002-06-20 20:42         ` Timothy D. Witham
2002-06-21  5:16       ` Eric W. Biederman
2002-06-22 14:14       ` Kai Henningsen

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=200206201832.NAA87254@tomcat.admin.navo.hpc.mil \
    --to=pollard@tomcat.admin.navo.hpc.mil \
    --cc=linux-kernel@vger.kernel.org \
    --cc=nleroy@cs.wisc.edu \
    --cc=pashley@storm.ca \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.