public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* subsystem crashes reboot system?
@ 2003-04-02 17:49 Russell Miller
  2003-04-02 18:06 ` Mitch Adair
  2003-04-02 21:51 ` Andrew Morton
  0 siblings, 2 replies; 10+ messages in thread
From: Russell Miller @ 2003-04-02 17:49 UTC (permalink / raw)
  To: linux-kernel

Hi,

I have a feature request, I'm willing to hack away at it myself, but I want to 
know if there's any way of doing what I want to, or if there's a good 
technical reason why it would be impossible.

As I mentioned earlier, we had an ext3 subsystem crash, which a helpful person 
was nice enough to tell me that upgrading the kernel would fix.  All well and 
good.  But this crash left the system in a semi-functional state.  The 
networking stack was up and running, the kernel was running, but the 
filesystem was not functional and because of this the kernel was in a nearly 
unusable state.  Because the system was pingable, most tcp-stack level 
detectors would not have been able to tell that something serious was wrong.  
The machine (our main production machine that serves millions of hits a week) 
was down for three hours.

Since this was an assertion that failed, one would think that bringing the 
system down automatically in an orderly - then, if that fails, disorderly - 
fashion would be possible.  In particular, I would like for it to behave 
similar as with the panic sysctl.  If a subsystem crashes, reboot the 
machine, because the system is essentially worthless in that state.  I 
realize that this behavior isn't required for everyone, so a sysctl 
(panic_on_subsys_crash maybe) would be sufficient.

Since the machine was in a semi-usable state, one might ask why we just didn't 
have an automated process in place.  Two reasons:  a subsystem crashing 
happens rarely enough that I didn't see any reason to put the effort into it 
until now, and when the system is in a state like that it is impossible to 
tell what will work and what will not.  For example, when we did the three 
finger salute, the system would not go down all the way because one of the 
user space programs made an io call to the crashed filesystem.

In order of helpfulness, please tell me (only one of the following is more 
than enough):
- whether I can do this using the existing sysctl mechanism
- whether there is a patch available (or coming available) to do this
- whether there is a technical reason for me not to do this
- what would be a good place in the code to begin applying a patch.

Please CC me with any replies as I am not on the list.

Thanks.

--Russell

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2003-04-02 22:08 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2003-04-02 17:49 subsystem crashes reboot system? Russell Miller
2003-04-02 18:06 ` Mitch Adair
2003-04-02 18:44   ` Michael Buesch
2003-04-02 18:46     ` Mitch Adair
2003-04-02 19:07     ` Philippe Troin
2003-04-02 19:32       ` Michael Buesch
2003-04-02 21:51 ` Andrew Morton
2003-04-02 21:51   ` Russell Miller
2003-04-02 22:13     ` Andrew Morton
2003-04-02 22:11       ` Russell Miller

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox