subsystem crashes reboot system?

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* subsystem crashes reboot system?
@ 2003-04-02 17:49 Russell Miller
  2003-04-02 18:06 ` Mitch Adair
  2003-04-02 21:51 ` Andrew Morton
  0 siblings, 2 replies; 10+ messages in thread
From: Russell Miller @ 2003-04-02 17:49 UTC (permalink / raw)
  To: linux-kernel

Hi,

I have a feature request, I'm willing to hack away at it myself, but I want to 
know if there's any way of doing what I want to, or if there's a good 
technical reason why it would be impossible.

As I mentioned earlier, we had an ext3 subsystem crash, which a helpful person 
was nice enough to tell me that upgrading the kernel would fix.  All well and 
good.  But this crash left the system in a semi-functional state.  The 
networking stack was up and running, the kernel was running, but the 
filesystem was not functional and because of this the kernel was in a nearly 
unusable state.  Because the system was pingable, most tcp-stack level 
detectors would not have been able to tell that something serious was wrong.  
The machine (our main production machine that serves millions of hits a week) 
was down for three hours.

Since this was an assertion that failed, one would think that bringing the 
system down automatically in an orderly - then, if that fails, disorderly - 
fashion would be possible.  In particular, I would like for it to behave 
similar as with the panic sysctl.  If a subsystem crashes, reboot the 
machine, because the system is essentially worthless in that state.  I 
realize that this behavior isn't required for everyone, so a sysctl 
(panic_on_subsys_crash maybe) would be sufficient.

Since the machine was in a semi-usable state, one might ask why we just didn't 
have an automated process in place.  Two reasons:  a subsystem crashing 
happens rarely enough that I didn't see any reason to put the effort into it 
until now, and when the system is in a state like that it is impossible to 
tell what will work and what will not.  For example, when we did the three 
finger salute, the system would not go down all the way because one of the 
user space programs made an io call to the crashed filesystem.

In order of helpfulness, please tell me (only one of the following is more 
than enough):
- whether I can do this using the existing sysctl mechanism
- whether there is a patch available (or coming available) to do this
- whether there is a technical reason for me not to do this
- what would be a good place in the code to begin applying a patch.

Please CC me with any replies as I am not on the list.

Thanks.

--Russell

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: subsystem crashes reboot system?
  2003-04-02 17:49 subsystem crashes reboot system? Russell Miller
@ 2003-04-02 18:06 ` Mitch Adair
  2003-04-02 18:44   ` Michael Buesch
  2003-04-02 21:51 ` Andrew Morton
  1 sibling, 1 reply; 10+ messages in thread
From: Mitch Adair @ 2003-04-02 18:06 UTC (permalink / raw)
  To: Russell Miller; +Cc: linux-kernel

> good.  But this crash left the system in a semi-functional state.  The 
> networking stack was up and running, the kernel was running, but the 
> filesystem was not functional and because of this the kernel was in a nearly 
> unusable state.  Because the system was pingable, most tcp-stack level 
> detectors would not have been able to tell that something serious was wrong.  
> The machine (our main production machine that serves millions of hits a week) 
> was down for three hours.

Isn't this what watchdog is for?  I think even the software watchdog would
catch this, then you can panic and reboot.

	M

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: subsystem crashes reboot system?
  2003-04-02 18:06 ` Mitch Adair
@ 2003-04-02 18:44   ` Michael Buesch
  2003-04-02 18:46     ` Mitch Adair
  2003-04-02 19:07     ` Philippe Troin
  0 siblings, 2 replies; 10+ messages in thread
From: Michael Buesch @ 2003-04-02 18:44 UTC (permalink / raw)
  To: Mitch Adair; +Cc: Russell Miller, linux-kernel

On Wednesday 02 April 2003 20:06, Mitch Adair wrote:

> Isn't this what watchdog is for?  I think even the software watchdog would
> catch this, then you can panic and reboot.

hm, I don't think, that watchdog will catch this, because the userspace-watchdog
daemon will still be running properly in a crash case
(or did I understand something wrong?)

Regards Michael Buesch.

-- 
-------------
My homepage: http://www.8ung.at/tuxsoft
fighting for peace is like fu**ing for virginity


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: subsystem crashes reboot system?
  2003-04-02 18:44   ` Michael Buesch
@ 2003-04-02 18:46     ` Mitch Adair
  2003-04-02 19:07     ` Philippe Troin
  1 sibling, 0 replies; 10+ messages in thread
From: Mitch Adair @ 2003-04-02 18:46 UTC (permalink / raw)
  To: Michael Buesch; +Cc: Russell Miller, linux-kernel

> > Isn't this what watchdog is for?  I think even the software watchdog would
> > catch this, then you can panic and reboot.
> 
> hm, I don't think, that watchdog will catch this, because the userspace-watchdog
> daemon will still be running properly in a crash case
> (or did I understand something wrong?)

But it wouldn't be able to write to the filesystem so it would trigger
if I believe.

	M

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: subsystem crashes reboot system?
  2003-04-02 18:44   ` Michael Buesch
  2003-04-02 18:46     ` Mitch Adair
@ 2003-04-02 19:07     ` Philippe Troin
  2003-04-02 19:32       ` Michael Buesch
  1 sibling, 1 reply; 10+ messages in thread
From: Philippe Troin @ 2003-04-02 19:07 UTC (permalink / raw)
  To: Michael Buesch; +Cc: Mitch Adair, Russell Miller, linux-kernel

Michael Buesch <freesoftwaredeveloper@web.de> writes:

> On Wednesday 02 April 2003 20:06, Mitch Adair wrote:
> 
> > Isn't this what watchdog is for?  I think even the software
> > watchdog would catch this, then you can panic and reboot.
> 
> hm, I don't think, that watchdog will catch this, because the
> userspace-watchdog daemon will still be running properly in a crash
> case (or did I understand something wrong?)

Unless you configure it to stat your filesystems, like in:

  watchdog-device         = /dev/misc/watchdog
  realtime                = yes
  priority                = 99
  admin                   =
  file                    = /
  file                    = /var
  file                    = /usr
  ...

Phil.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: subsystem crashes reboot system?
  2003-04-02 19:07     ` Philippe Troin
@ 2003-04-02 19:32       ` Michael Buesch
  0 siblings, 0 replies; 10+ messages in thread
From: Michael Buesch @ 2003-04-02 19:32 UTC (permalink / raw)
  To: Philippe Troin; +Cc: Mitch Adair, Russell Miller, linux-kernel

On Wednesday 02 April 2003 21:07, you wrote:

> > hm, I don't think, that watchdog will catch this, because the
> > userspace-watchdog daemon will still be running properly in a crash
> > case (or did I understand something wrong?)
>
> Unless you configure it to stat your filesystems, like in:
>
>   watchdog-device         = /dev/misc/watchdog
>   realtime                = yes
>   priority                = 99
>   admin                   =
>   file                    = /
>   file                    = /var
>   file                    = /usr
>   ...

Yes that's true. I didn't remember this option.
With this, watchdog would be a solution of russel's problem,
without writing some kernel-error-handling for it.

Regards Michael Buesch.

-- 
-------------
My homepage: http://www.8ung.at/tuxsoft
fighting for peace is like fu**ing for virginity


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: subsystem crashes reboot system?
  2003-04-02 17:49 subsystem crashes reboot system? Russell Miller
  2003-04-02 18:06 ` Mitch Adair
@ 2003-04-02 21:51 ` Andrew Morton
  2003-04-02 21:51   ` Russell Miller
  1 sibling, 1 reply; 10+ messages in thread
From: Andrew Morton @ 2003-04-02 21:51 UTC (permalink / raw)
  To: Russell Miller; +Cc: linux-kernel

Russell Miller <rmiller@duskglow.com> wrote:
>
> Since this was an assertion that failed, one would think that bringing the 
> system down automatically in an orderly - then, if that fails, disorderly - 
> fashion would be possible.

The way to handle this is to make arch/i386/kernel/traps.c:die() optionally
call panic() rather than do_exit().

It makes sense.  It does mean that we now have zero chance of the diagnostic
info making it to the system logs.


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: subsystem crashes reboot system?
  2003-04-02 21:51 ` Andrew Morton
@ 2003-04-02 21:51   ` Russell Miller
  2003-04-02 22:13     ` Andrew Morton
  0 siblings, 1 reply; 10+ messages in thread
From: Russell Miller @ 2003-04-02 21:51 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel

Any chance of making the dying thread sleep just long enough for syslogd to 
write it out to the file, then panic?  Since it's an assertion, we have a 
little more leeway then in a page fault OOPS, for example.

--Russell

On Wed April 2 2003 3:51 pm, Andrew Morton wrote:
> Russell Miller <rmiller@duskglow.com> wrote:
> > Since this was an assertion that failed, one would think that bringing
> > the system down automatically in an orderly - then, if that fails,
> > disorderly - fashion would be possible.
>
> The way to handle this is to make arch/i386/kernel/traps.c:die() optionally
> call panic() rather than do_exit().
>
> It makes sense.  It does mean that we now have zero chance of the
> diagnostic info making it to the system logs.


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: subsystem crashes reboot system?
  2003-04-02 22:13     ` Andrew Morton
@ 2003-04-02 22:11       ` Russell Miller
  0 siblings, 0 replies; 10+ messages in thread
From: Russell Miller @ 2003-04-02 22:11 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel

On Wed April 2 2003 4:13 pm, Andrew Morton wrote:

> Yes, that would probably be OK.  It won't make anything worse than it
> already is.
>
> hm, the kernel used to panic if schedule() was called from in_interrupt(),
> but that seems to have been taken out.   It's easy enough (and free) to
> put back in.

I'll see about writing a patch if you would like.  Sounds like a good thing to 
get my feet wet with.

--Russell

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: subsystem crashes reboot system?
  2003-04-02 21:51   ` Russell Miller
@ 2003-04-02 22:13     ` Andrew Morton
  2003-04-02 22:11       ` Russell Miller
  0 siblings, 1 reply; 10+ messages in thread
From: Andrew Morton @ 2003-04-02 22:13 UTC (permalink / raw)
  To: Russell Miller; +Cc: linux-kernel

Russell Miller <rmiller@duskglow.com> wrote:
>
> Any chance of making the dying thread sleep just long enough for syslogd to 
> write it out to the file, then panic?  Since it's an assertion, we have a 
> little more leeway then in a page fault OOPS, for example.
> 

Yes, that would probably be OK.  It won't make anything worse than it
already is.

hm, the kernel used to panic if schedule() was called from in_interrupt(),
but that seems to have been taken out.   It's easy enough (and free) to
put back in.

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2003-04-02 22:08 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2003-04-02 17:49 subsystem crashes reboot system? Russell Miller
2003-04-02 18:06 ` Mitch Adair
2003-04-02 18:44   ` Michael Buesch
2003-04-02 18:46     ` Mitch Adair
2003-04-02 19:07     ` Philippe Troin
2003-04-02 19:32       ` Michael Buesch
2003-04-02 21:51 ` Andrew Morton
2003-04-02 21:51   ` Russell Miller
2003-04-02 22:13     ` Andrew Morton
2003-04-02 22:11       ` Russell Miller

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox