public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* Failover Kernel
@ 2009-02-26  8:58 Tarkan Erimer
  2009-02-26 16:03 ` Willy Tarreau
  2009-02-26 17:02 ` Diego Calleja
  0 siblings, 2 replies; 11+ messages in thread
From: Tarkan Erimer @ 2009-02-26  8:58 UTC (permalink / raw)
  To: linux-kernel

Hi all,

I'm thinking about a kernel feature called "Failover Kernel". The basic 
idea is to put 2 kernels (One is running "Primary Kernel" and the next 
one is "Backup Kernel") into the memory for disaster recovery of kernel 
panic'ing/crashing.

This feature's working schema could be like this :

- "Backup Kernel" could be stated and loaded into the memory via a boot 
line option like : "failover_kernel=/boot/vmlinuz-2.6.26"
- Primary running kernel will send keepalives to the "Backup Kernel" to 
state that it's alive.
- Primary running kernel can write a journal (like the journaled 
filesystems.) about needed infos for the backup kernel to recover.
- When the primary kernel crashed and couldn't send anymore keepalives, 
the backup kernel will recover from this journal to proceed to where the 
primary kernel left and will become primary.
- When "Backup Kernel" became "Primary" it will load the previous one as 
"Backup Kernel" again or maybe it could be left to manual. User could 
decide after the disaster recovery which kernel will be load as backup 
via a utility like "kexec".
- At kernel compile time, user can choose the the timing for failover 
kernel. For example, "Recover After 10 MS. of inactivity (not receiving 
keepalives). "


The usage scenarios of this feature could be :

- For people whose Datacenter is remote, it's a big problem when you 
compiled a new kernel and rebooting into a crashing/non-booting new 
kernel. You left with a completely crashed and non-functioning system. 
Hard reset and manual action is required. If there could be "Failover 
Kernel feature, the system will simply switch back to the "Backup 
Kernel" (This backup kernel will be the known stable kernel of the 
system.) and the system will proceed to work without any manual action 
required.

- Your system runs fine for the last several months and one day you hit 
a bug and kernel crashed/panic'ed . With "Failover Kernel", the system 
will switch to the "Backup Kernel" quickly (maybe some milliseconds or 
few seconds.) to recover and the system could proceed to work normally.

So,I'm not a coder and I don't know it is really possible as technically 
or not. You the kernel hackers, what's your opinion about it ? Could it 
be really possible ? If so, how we really can implement it ?

Many thanks for reading this long (and maybe stupid) post! :-)

Tarkan ERIMER



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Failover Kernel
  2009-02-26  8:58 Failover Kernel Tarkan Erimer
@ 2009-02-26 16:03 ` Willy Tarreau
  2009-02-27 15:25   ` Tarkan Erimer
  2009-02-26 17:02 ` Diego Calleja
  1 sibling, 1 reply; 11+ messages in thread
From: Willy Tarreau @ 2009-02-26 16:03 UTC (permalink / raw)
  To: Tarkan Erimer; +Cc: linux-kernel

On Thu, Feb 26, 2009 at 10:58:56AM +0200, Tarkan Erimer wrote:
> Hi all,
> 
> I'm thinking about a kernel feature called "Failover Kernel". The basic 
> idea is to put 2 kernels (One is running "Primary Kernel" and the next 
> one is "Backup Kernel") into the memory for disaster recovery of kernel 
> panic'ing/crashing.
> 
> This feature's working schema could be like this :
> 
> - "Backup Kernel" could be stated and loaded into the memory via a boot 
> line option like : "failover_kernel=/boot/vmlinuz-2.6.26"
> - Primary running kernel will send keepalives to the "Backup Kernel" to 
> state that it's alive.
> - Primary running kernel can write a journal (like the journaled 
> filesystems.) about needed infos for the backup kernel to recover.
> - When the primary kernel crashed and couldn't send anymore keepalives, 
> the backup kernel will recover from this journal to proceed to where the 
> primary kernel left and will become primary.
> - When "Backup Kernel" became "Primary" it will load the previous one as 
> "Backup Kernel" again or maybe it could be left to manual. User could 
> decide after the disaster recovery which kernel will be load as backup 
> via a utility like "kexec".
> - At kernel compile time, user can choose the the timing for failover 
> kernel. For example, "Recover After 10 MS. of inactivity (not receiving 
> keepalives). "
> 
> 
> The usage scenarios of this feature could be :
> 
> - For people whose Datacenter is remote, it's a big problem when you 
> compiled a new kernel and rebooting into a crashing/non-booting new 
> kernel. You left with a completely crashed and non-functioning system. 
> Hard reset and manual action is required. If there could be "Failover 
> Kernel feature, the system will simply switch back to the "Backup 
> Kernel" (This backup kernel will be the known stable kernel of the 
> system.) and the system will proceed to work without any manual action 
> required.
> 
> - Your system runs fine for the last several months and one day you hit 
> a bug and kernel crashed/panic'ed . With "Failover Kernel", the system 
> will switch to the "Backup Kernel" quickly (maybe some milliseconds or 
> few seconds.) to recover and the system could proceed to work normally.
> 
> So,I'm not a coder and I don't know it is really possible as technically 
> or not. You the kernel hackers, what's your opinion about it ? Could it 
> be really possible ? If so, how we really can implement it ?
> 
> Many thanks for reading this long (and maybe stupid) post! :-)

You forgot the most important thing : these two kernels will run on
the same machine. I'm not even considering how you intend to schedule
them. However, when a kernel crashes, it's often because of a hard
error : bug in a driver, memory corruption, etc...  You cannot sanely
recover from that. If the driver which crashed started to initiate a
multi-word command to the device, in a lot of situations you'll need
a reset to restore it in a known state. Memory corruption is even
worse, as you cannot even trust the backup kernel.

I'm currently using a backup kernel in our products, and do it with
the boot loader. Some BIOSes allow you to start a watchdog timer on
boot. Grub tries to load the first image, otherwise the second one.
If either image crashes during boot, the hardware watchdog triggers
and the machine reboots to the other image. That's extremely reliable,
and relatively simple.
 
And using this method, you don't have any compatibility problems between
your primary and secondary kernels.

Regards,
Willy


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Failover Kernel
  2009-02-26  8:58 Failover Kernel Tarkan Erimer
  2009-02-26 16:03 ` Willy Tarreau
@ 2009-02-26 17:02 ` Diego Calleja
  2009-02-27 15:32   ` Tarkan Erimer
  1 sibling, 1 reply; 11+ messages in thread
From: Diego Calleja @ 2009-02-26 17:02 UTC (permalink / raw)
  To: Tarkan Erimer; +Cc: linux-kernel

On Jueves 26 Febrero 2009 09:58:56 Tarkan Erimer escribió:
> Hi all,
> 
> I'm thinking about a kernel feature called "Failover Kernel". The basic 
> idea is to put 2 kernels (One is running "Primary Kernel" and the next 
> one is "Backup Kernel") into the memory for disaster recovery of kernel 
> panic'ing/crashing.

Isn't this what kdump does right now? http://lwn.net/Articles/108595/

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Failover Kernel
  2009-02-26 16:03 ` Willy Tarreau
@ 2009-02-27 15:25   ` Tarkan Erimer
  0 siblings, 0 replies; 11+ messages in thread
From: Tarkan Erimer @ 2009-02-27 15:25 UTC (permalink / raw)
  To: Willy Tarreau; +Cc: linux-kernel

Willy Tarreau wrote:
> You forgot the most important thing : these two kernels will run on
> the same machine. I'm not even considering how you intend to schedule
> them. However, when a kernel crashes, it's often because of a hard
>   
A similar way as "kdump" did. Just putting a backup kernel into the 
memory and receiving keepalives by primary kernel. In normal conditions, 
backup kernel just will sit in its place, will monitor the status of 
primary kernel (alive or crashed) and will do nothing else more. So, no 
scheduling is required.
> error : bug in a driver, memory corruption, etc...  You cannot sanely
> recover from that. If the driver which crashed started to initiate a
> multi-word command to the device, in a lot of situations you'll need
> a reset to restore it in a known state. Memory corruption is even
> worse, as you cannot even trust the backup kernel.
>
>   
Hardware related issues are exceptions. If there could be a journal; 
maybe, it could be possible to recover sanely where the primary left. Of 
course, it's clear that this system will not work for all the scenarios 
(like bad hardware etc.).
> I'm currently using a backup kernel in our products, and do it with
> the boot loader. Some BIOSes allow you to start a watchdog timer on
> boot. Grub tries to load the first image, otherwise the second one.
> If either image crashes during boot, the hardware watchdog triggers
> and the machine reboots to the other image. That's extremely reliable,
> and relatively simple.
>  
> And using this method, you don't have any compatibility problems between
> your primary and secondary kernels.
>   
Yep, it's very simple way. But the problem is that, as you mentioned, 
watchdog is not supported on all the hardwares. If possible to 
implement, it will be platform/hardware independent system.



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Failover Kernel
  2009-02-26 17:02 ` Diego Calleja
@ 2009-02-27 15:32   ` Tarkan Erimer
  2009-02-27 15:50     ` Lubomir Rintel
  0 siblings, 1 reply; 11+ messages in thread
From: Tarkan Erimer @ 2009-02-27 15:32 UTC (permalink / raw)
  To: Diego Calleja; +Cc: linux-kernel

Diego Calleja wrote:
> Isn't this what kdump does right now? http://lwn.net/Articles/108595/
>   

Yep, very similar to "kdump". But, kdump is too simple. Backup kernel is 
just so minimal and only to handle crash dump thing. With "Failover 
kernel", I'm talking about fully functional backup kernel like primary 
one. Not just too minimally functioning crash dump kernel.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Failover Kernel
  2009-02-27 15:32   ` Tarkan Erimer
@ 2009-02-27 15:50     ` Lubomir Rintel
  2009-03-02 16:21       ` Tarkan Erimer
  0 siblings, 1 reply; 11+ messages in thread
From: Lubomir Rintel @ 2009-02-27 15:50 UTC (permalink / raw)
  To: Tarkan Erimer; +Cc: Diego Calleja, linux-kernel


On Fri, 2009-02-27 at 17:32 +0200, Tarkan Erimer wrote:
> Diego Calleja wrote:
> > Isn't this what kdump does right now? http://lwn.net/Articles/108595/
> >   
> 
> Yep, very similar to "kdump". But, kdump is too simple. Backup kernel is 
> just so minimal and only to handle crash dump thing. With "Failover 
> kernel", I'm talking about fully functional backup kernel like primary 
> one. Not just too minimally functioning crash dump kernel.

How is the backup kernel minimal? It is usually the very same kernel as
the "primary" one. You can use the same initrd as well and do a full
multiuser boot.

-- 
Lubomir Rintel <lkundrak@v3.sk>


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Failover Kernel
  2009-02-27 15:50     ` Lubomir Rintel
@ 2009-03-02 16:21       ` Tarkan Erimer
  2009-03-03  3:29         ` David Newall
  0 siblings, 1 reply; 11+ messages in thread
From: Tarkan Erimer @ 2009-03-02 16:21 UTC (permalink / raw)
  To: Lubomir Rintel; +Cc: Diego Calleja, linux-kernel

Lubomir Rintel wrote:
>
> How is the backup kernel minimal? It is usually the very same kernel as
> the "primary" one. You can use the same initrd as well and do a full
> multiuser boot.
>
>   
Kdump's backup (in kdump terms, it is "Capture Kernel") kernel is with 
minimal set of features and modules (scsi drivers, network drivers etc.) 
to have small memory footprint and resources to just handle crash dump 
related things. So,it's not a full replacement of primary kernel. Also, 
the point is not to make boot when crash occured. The idea is to take 
control when a crash occured by backup kernel without any need to reboot.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Failover Kernel
  2009-03-02 16:21       ` Tarkan Erimer
@ 2009-03-03  3:29         ` David Newall
  2009-03-04  8:29           ` Tarkan Erimer
  0 siblings, 1 reply; 11+ messages in thread
From: David Newall @ 2009-03-03  3:29 UTC (permalink / raw)
  To: linux-kernel

Tarkan Erimer wrote:
> the point is not to make boot when crash occured. The idea is to take
> control when a crash occured by backup kernel without any need to reboot.

It sounds like you want everything to just continue running.  I don't
see how that can be done.  All of those in-kernel tables and structures
would need to be migrated, and it follows, because there was a crash,
that any of them might have been corrupted.  Worse, you want this to
save you when you try running a new kernel which crashes, and being a
new kernel, it follows that any of those structures could be different;
it might not be possible to create equivalent structures for different
kernel versions.

If you're at all concerned at keeping the computer running, and I think
that's your goal, then I think the best you can do is reset the
hardware, boot an alternate kernel and restart applications as appropriate.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Failover Kernel
  2009-03-03  3:29         ` David Newall
@ 2009-03-04  8:29           ` Tarkan Erimer
  2009-03-06  1:10             ` david
  0 siblings, 1 reply; 11+ messages in thread
From: Tarkan Erimer @ 2009-03-04  8:29 UTC (permalink / raw)
  To: David Newall; +Cc: linux-kernel

On 03/03/2009 05:29 AM, David Newall wrote:
> It sounds like you want everything to just continue running.  I don't
>    
Yes, exactly. Backup kernel will take control when a crush occured 
without need a reboot or halt.
> see how that can be done.  All of those in-kernel tables and structures
> would need to be migrated, and it follows, because there was a crash,
> that any of them might have been corrupted.  Worse, you want this to
> save you when you try running a new kernel which crashes, and being a
> new kernel, it follows that any of those structures could be different;
> it might not be possible to create equivalent structures for different
> kernel versions.
>
>    
Yes, that's right and it's the first thing needed to overcome. Maybe, it 
could be implemented like this :

- Primary kernel could be 2.6.x or 2.6.x.y (2.6.28 or 2.6.28.1)
- Backup kernel could be one of these .y fix releases only: Like 2.6.28.5

So; when they're from the same version, it will prevent kernel API and 
structure changes.
For resuming by backup kernel: The primary kernel could write a journal 
about the needed things for backup to resume. Like process IDs, memory 
and process situations etc. The same manner as the Journalled File 
Systems did (they write a journal what they did to recover/resume at 
crash/disaster time).


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Failover Kernel
  2009-03-04  8:29           ` Tarkan Erimer
@ 2009-03-06  1:10             ` david
  2009-03-09 12:35               ` Tarkan Erimer
  0 siblings, 1 reply; 11+ messages in thread
From: david @ 2009-03-06  1:10 UTC (permalink / raw)
  To: Tarkan Erimer; +Cc: David Newall, linux-kernel

On Wed, 4 Mar 2009, Tarkan Erimer wrote:

> On 03/03/2009 05:29 AM, David Newall wrote:
>> It sounds like you want everything to just continue running.  I don't
>> 
> Yes, exactly. Backup kernel will take control when a crush occured without 
> need a reboot or halt.
>> see how that can be done.  All of those in-kernel tables and structures
>> would need to be migrated, and it follows, because there was a crash,
>> that any of them might have been corrupted.  Worse, you want this to
>> save you when you try running a new kernel which crashes, and being a
>> new kernel, it follows that any of those structures could be different;
>> it might not be possible to create equivalent structures for different
>> kernel versions.
>>
>> 
> Yes, that's right and it's the first thing needed to overcome. Maybe, it 
> could be implemented like this :
>
> - Primary kernel could be 2.6.x or 2.6.x.y (2.6.28 or 2.6.28.1)
> - Backup kernel could be one of these .y fix releases only: Like 2.6.28.5
>
> So; when they're from the same version, it will prevent kernel API and 
> structure changes.
> For resuming by backup kernel: The primary kernel could write a journal about 
> the needed things for backup to resume. Like process IDs, memory and process 
> situations etc. The same manner as the Journalled File Systems did (they 
> write a journal what they did to recover/resume at crash/disaster time).

wrong, kernel structures can change in any patch. they can even change 
with different configuration options.

but even if they are the same version and configuration options, that 
doesn't address the fact that you can't trust the in-kernel structures 
because they may have been damaged by whatever caused the crash.

David Lang

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Failover Kernel
  2009-03-06  1:10             ` david
@ 2009-03-09 12:35               ` Tarkan Erimer
  0 siblings, 0 replies; 11+ messages in thread
From: Tarkan Erimer @ 2009-03-09 12:35 UTC (permalink / raw)
  To: david; +Cc: David Newall, linux-kernel

On 03/06/2009 03:10 AM, david@lang.hm wrote:
> wrong, kernel structures can change in any patch. they can even change 
> with different configuration options.
>
> but even if they are the same version and configuration options, that 
> doesn't address the fact that you can't trust the in-kernel structures 
> because they may have been damaged by whatever caused the crash.
>
> David Lang

Sorry for late reply. I was away for a while. Hmmm... I understood. It 
seems, it's not so possible. Thanks for who replied to this thread.

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2009-03-09 12:35 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-02-26  8:58 Failover Kernel Tarkan Erimer
2009-02-26 16:03 ` Willy Tarreau
2009-02-27 15:25   ` Tarkan Erimer
2009-02-26 17:02 ` Diego Calleja
2009-02-27 15:32   ` Tarkan Erimer
2009-02-27 15:50     ` Lubomir Rintel
2009-03-02 16:21       ` Tarkan Erimer
2009-03-03  3:29         ` David Newall
2009-03-04  8:29           ` Tarkan Erimer
2009-03-06  1:10             ` david
2009-03-09 12:35               ` Tarkan Erimer

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox