Re: holy grail - Eric W. Biederman

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

From: ebiederm@xmission.com (Eric W. Biederman)
To: Anomalous Force <anomalous_force@yahoo.com>
Cc: david.lang@digitalinsight.com, jdike@karaya.com,
	wa@almesberger.net, alan@lxorguk.ukuu.org.uk,
	riel@conectiva.com.br, ebiederm@xmission.com,
	linux-kernel@vger.kernel.org
Subject: Re: holy grail
Date: 30 Dec 2002 10:57:12 -0700	[thread overview]
Message-ID: <m1adins83r.fsf@frodo.biederman.org> (raw)
In-Reply-To: <20021230043908.77703.qmail@web13206.mail.yahoo.com>

Anomalous Force <anomalous_force@yahoo.com> writes:

> --- David Lang <david.lang@digitalinsight.com> wrote:
> > 
> > I think people are at the point of working on this becouse it
> > sounds like
> > a worthwhile feature, not becouse it's actually anything that would
> > be
> > used.
> 
> UML sounds like a worthwhile feature, turns out its actually pretty
> useful too. kexec() is supported in its current incarnation. why
> not simply extend it the one step further?

kexec() still has not quite made it into the kernel yet...
Can we at least finish one piece before starting on the next?

> 
> > 
> > what possible application needs to be able to do a seamless kernel
> > upgrade
> > that wouldn't be useing a network?
> 
> "programs will never use more than 640K of memory." - bill gates
> 
> lets talk clusters... the teragrid system being built out of 2024
> redhat 7.2 installs (ncsa alone, not counting the 3 other cluster
> sites). imagine a simple system on the network to push a copy of the
> new kernel and then telling each node to hot-swap. 0 downtime.
> __super__ easy to maintain. how easy would that become???

In this case you stagger the reboots, then if you have failover you
get 0 downtime.

>  how about
> this... an nfs mount point in the grid for /boot such that each node
> then gets the kernel from a central point and hot swaps when a flag
> is set, or a change is detected in the /boot directory. no push even
> needed then. the cost savings from that alone would be worth the
> effort to them.

KABOOM... you just saturated the network with NFS traffic.

> 
> > 
> > if it's a batch processing task, it can checkpoint itself and
> > restart
> > after a reboot.
> > 
> 
> 2024 nodes rebooting, how much time needed while the system is in a
> degraded state?

On MCR (960 nodes at the time) I have rebooted the entire cluster,
including downloading a the kernel over the network in a minute.   And
a complete reinstall of all compute notes in the cluster took about 5
minutes.  With a little care most of the extra management complexity
of a large cluster is due to hardware problems.

There are two very different  problems being considered here:  high
availability clustering, and high performance clustering.  In high
availability clustering you though hardware at the problem so that you
application continues to run.  For high performance computing you
throw even more hardware at the problem so your program runs fast.

At some point the high performance clustering needs the high
reliability techniques because with enough hardware the failure
rate becomes noticeable.  Mean time between failure becomes something
you experience and can easily measure instead. 

Once the hardware has been made as redundant and as reliable as
possible job check-pointing next becomes the only way to run longer
jobs on the system.  Given that one MPI job may span the entire
cluster this is a very interesting problem.  Long term at least that 
is something that needs to be completed.

> hence a queue to catch pending irqs while the system swaps over.

And back to the heart of the kexec territory.  Here you simply drop
irqs until the system comes back up.  And then drivers should poll
their hardware to see what state it is in when they come back up.
Additionally it is part of the kexec design to place hardware is a
quiescent state while the kernels are being swapped. 

So there should be no special kernel state that needs to be saved
across kernels.  Just enough state to recreate the user space
abstractions.

Additionally we need a scalable filesystem for the clusters. Lustre
shows some promise.  But it is not done yet.  Things like GFS are
o.k. But I believe they rely on all of the disks being on a single
storage area network which is a bit of a scaleability and reliability
problem.

Eric

next prev parent reply	other threads:[~2002-12-30 17:49 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2002-12-27  0:51 holy grail Anomalous Force
2002-12-27  4:03 ` Werner Almesberger
2002-12-27  7:21   ` Anomalous Force
2002-12-27  7:37     ` Ingo Oeser
2002-12-27 11:30     ` Werner Almesberger
2002-12-28 16:35       ` Anomalous Force
2002-12-28 20:43         ` Rik van Riel
2002-12-29 15:56           ` Anomalous Force
2002-12-29 16:44             ` John Bradford
2002-12-30  1:05           ` Alan Cox
2002-12-30  1:32             ` Werner Almesberger
2002-12-30  2:45               ` Jeff Dike
2002-12-30  3:55                 ` David Lang
2002-12-30  4:39                   ` Anomalous Force
2002-12-30 17:57                     ` Eric W. Biederman [this message]
2002-12-30 13:30               ` Alan Cox
2002-12-29 23:53         ` Werner Almesberger
2002-12-27 13:24     ` Pavel Machek
  -- strict thread matches above, loose matches on Subject: below --
2002-12-30  5:00 Anomalous Force
2002-12-30  6:46 ` Ed Sweetman

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=m1adins83r.fsf@frodo.biederman.org \
    --to=ebiederm@xmission.com \
    --cc=alan@lxorguk.ukuu.org.uk \
    --cc=anomalous_force@yahoo.com \
    --cc=david.lang@digitalinsight.com \
    --cc=jdike@karaya.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=riel@conectiva.com.br \
    --cc=wa@almesberger.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox