* subsystem crashes reboot system?
@ 2003-04-02 17:49 Russell Miller
2003-04-02 18:06 ` Mitch Adair
2003-04-02 21:51 ` Andrew Morton
0 siblings, 2 replies; 10+ messages in thread
From: Russell Miller @ 2003-04-02 17:49 UTC (permalink / raw)
To: linux-kernel
Hi,
I have a feature request, I'm willing to hack away at it myself, but I want to
know if there's any way of doing what I want to, or if there's a good
technical reason why it would be impossible.
As I mentioned earlier, we had an ext3 subsystem crash, which a helpful person
was nice enough to tell me that upgrading the kernel would fix. All well and
good. But this crash left the system in a semi-functional state. The
networking stack was up and running, the kernel was running, but the
filesystem was not functional and because of this the kernel was in a nearly
unusable state. Because the system was pingable, most tcp-stack level
detectors would not have been able to tell that something serious was wrong.
The machine (our main production machine that serves millions of hits a week)
was down for three hours.
Since this was an assertion that failed, one would think that bringing the
system down automatically in an orderly - then, if that fails, disorderly -
fashion would be possible. In particular, I would like for it to behave
similar as with the panic sysctl. If a subsystem crashes, reboot the
machine, because the system is essentially worthless in that state. I
realize that this behavior isn't required for everyone, so a sysctl
(panic_on_subsys_crash maybe) would be sufficient.
Since the machine was in a semi-usable state, one might ask why we just didn't
have an automated process in place. Two reasons: a subsystem crashing
happens rarely enough that I didn't see any reason to put the effort into it
until now, and when the system is in a state like that it is impossible to
tell what will work and what will not. For example, when we did the three
finger salute, the system would not go down all the way because one of the
user space programs made an io call to the crashed filesystem.
In order of helpfulness, please tell me (only one of the following is more
than enough):
- whether I can do this using the existing sysctl mechanism
- whether there is a patch available (or coming available) to do this
- whether there is a technical reason for me not to do this
- what would be a good place in the code to begin applying a patch.
Please CC me with any replies as I am not on the list.
Thanks.
--Russell
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: subsystem crashes reboot system?
2003-04-02 17:49 subsystem crashes reboot system? Russell Miller
@ 2003-04-02 18:06 ` Mitch Adair
2003-04-02 18:44 ` Michael Buesch
2003-04-02 21:51 ` Andrew Morton
1 sibling, 1 reply; 10+ messages in thread
From: Mitch Adair @ 2003-04-02 18:06 UTC (permalink / raw)
To: Russell Miller; +Cc: linux-kernel
> good. But this crash left the system in a semi-functional state. The
> networking stack was up and running, the kernel was running, but the
> filesystem was not functional and because of this the kernel was in a nearly
> unusable state. Because the system was pingable, most tcp-stack level
> detectors would not have been able to tell that something serious was wrong.
> The machine (our main production machine that serves millions of hits a week)
> was down for three hours.
Isn't this what watchdog is for? I think even the software watchdog would
catch this, then you can panic and reboot.
M
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: subsystem crashes reboot system?
2003-04-02 18:06 ` Mitch Adair
@ 2003-04-02 18:44 ` Michael Buesch
2003-04-02 18:46 ` Mitch Adair
2003-04-02 19:07 ` Philippe Troin
0 siblings, 2 replies; 10+ messages in thread
From: Michael Buesch @ 2003-04-02 18:44 UTC (permalink / raw)
To: Mitch Adair; +Cc: Russell Miller, linux-kernel
On Wednesday 02 April 2003 20:06, Mitch Adair wrote:
> Isn't this what watchdog is for? I think even the software watchdog would
> catch this, then you can panic and reboot.
hm, I don't think, that watchdog will catch this, because the userspace-watchdog
daemon will still be running properly in a crash case
(or did I understand something wrong?)
Regards Michael Buesch.
--
-------------
My homepage: http://www.8ung.at/tuxsoft
fighting for peace is like fu**ing for virginity
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: subsystem crashes reboot system?
2003-04-02 18:44 ` Michael Buesch
@ 2003-04-02 18:46 ` Mitch Adair
2003-04-02 19:07 ` Philippe Troin
1 sibling, 0 replies; 10+ messages in thread
From: Mitch Adair @ 2003-04-02 18:46 UTC (permalink / raw)
To: Michael Buesch; +Cc: Russell Miller, linux-kernel
> > Isn't this what watchdog is for? I think even the software watchdog would
> > catch this, then you can panic and reboot.
>
> hm, I don't think, that watchdog will catch this, because the userspace-watchdog
> daemon will still be running properly in a crash case
> (or did I understand something wrong?)
But it wouldn't be able to write to the filesystem so it would trigger
if I believe.
M
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: subsystem crashes reboot system?
2003-04-02 18:44 ` Michael Buesch
2003-04-02 18:46 ` Mitch Adair
@ 2003-04-02 19:07 ` Philippe Troin
2003-04-02 19:32 ` Michael Buesch
1 sibling, 1 reply; 10+ messages in thread
From: Philippe Troin @ 2003-04-02 19:07 UTC (permalink / raw)
To: Michael Buesch; +Cc: Mitch Adair, Russell Miller, linux-kernel
Michael Buesch <freesoftwaredeveloper@web.de> writes:
> On Wednesday 02 April 2003 20:06, Mitch Adair wrote:
>
> > Isn't this what watchdog is for? I think even the software
> > watchdog would catch this, then you can panic and reboot.
>
> hm, I don't think, that watchdog will catch this, because the
> userspace-watchdog daemon will still be running properly in a crash
> case (or did I understand something wrong?)
Unless you configure it to stat your filesystems, like in:
watchdog-device = /dev/misc/watchdog
realtime = yes
priority = 99
admin =
file = /
file = /var
file = /usr
...
Phil.
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: subsystem crashes reboot system?
2003-04-02 19:07 ` Philippe Troin
@ 2003-04-02 19:32 ` Michael Buesch
0 siblings, 0 replies; 10+ messages in thread
From: Michael Buesch @ 2003-04-02 19:32 UTC (permalink / raw)
To: Philippe Troin; +Cc: Mitch Adair, Russell Miller, linux-kernel
On Wednesday 02 April 2003 21:07, you wrote:
> > hm, I don't think, that watchdog will catch this, because the
> > userspace-watchdog daemon will still be running properly in a crash
> > case (or did I understand something wrong?)
>
> Unless you configure it to stat your filesystems, like in:
>
> watchdog-device = /dev/misc/watchdog
> realtime = yes
> priority = 99
> admin =
> file = /
> file = /var
> file = /usr
> ...
Yes that's true. I didn't remember this option.
With this, watchdog would be a solution of russel's problem,
without writing some kernel-error-handling for it.
Regards Michael Buesch.
--
-------------
My homepage: http://www.8ung.at/tuxsoft
fighting for peace is like fu**ing for virginity
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: subsystem crashes reboot system?
2003-04-02 17:49 subsystem crashes reboot system? Russell Miller
2003-04-02 18:06 ` Mitch Adair
@ 2003-04-02 21:51 ` Andrew Morton
2003-04-02 21:51 ` Russell Miller
1 sibling, 1 reply; 10+ messages in thread
From: Andrew Morton @ 2003-04-02 21:51 UTC (permalink / raw)
To: Russell Miller; +Cc: linux-kernel
Russell Miller <rmiller@duskglow.com> wrote:
>
> Since this was an assertion that failed, one would think that bringing the
> system down automatically in an orderly - then, if that fails, disorderly -
> fashion would be possible.
The way to handle this is to make arch/i386/kernel/traps.c:die() optionally
call panic() rather than do_exit().
It makes sense. It does mean that we now have zero chance of the diagnostic
info making it to the system logs.
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: subsystem crashes reboot system?
2003-04-02 21:51 ` Andrew Morton
@ 2003-04-02 21:51 ` Russell Miller
2003-04-02 22:13 ` Andrew Morton
0 siblings, 1 reply; 10+ messages in thread
From: Russell Miller @ 2003-04-02 21:51 UTC (permalink / raw)
To: Andrew Morton; +Cc: linux-kernel
Any chance of making the dying thread sleep just long enough for syslogd to
write it out to the file, then panic? Since it's an assertion, we have a
little more leeway then in a page fault OOPS, for example.
--Russell
On Wed April 2 2003 3:51 pm, Andrew Morton wrote:
> Russell Miller <rmiller@duskglow.com> wrote:
> > Since this was an assertion that failed, one would think that bringing
> > the system down automatically in an orderly - then, if that fails,
> > disorderly - fashion would be possible.
>
> The way to handle this is to make arch/i386/kernel/traps.c:die() optionally
> call panic() rather than do_exit().
>
> It makes sense. It does mean that we now have zero chance of the
> diagnostic info making it to the system logs.
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: subsystem crashes reboot system?
2003-04-02 22:13 ` Andrew Morton
@ 2003-04-02 22:11 ` Russell Miller
0 siblings, 0 replies; 10+ messages in thread
From: Russell Miller @ 2003-04-02 22:11 UTC (permalink / raw)
To: Andrew Morton; +Cc: linux-kernel
On Wed April 2 2003 4:13 pm, Andrew Morton wrote:
> Yes, that would probably be OK. It won't make anything worse than it
> already is.
>
> hm, the kernel used to panic if schedule() was called from in_interrupt(),
> but that seems to have been taken out. It's easy enough (and free) to
> put back in.
I'll see about writing a patch if you would like. Sounds like a good thing to
get my feet wet with.
--Russell
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: subsystem crashes reboot system?
2003-04-02 21:51 ` Russell Miller
@ 2003-04-02 22:13 ` Andrew Morton
2003-04-02 22:11 ` Russell Miller
0 siblings, 1 reply; 10+ messages in thread
From: Andrew Morton @ 2003-04-02 22:13 UTC (permalink / raw)
To: Russell Miller; +Cc: linux-kernel
Russell Miller <rmiller@duskglow.com> wrote:
>
> Any chance of making the dying thread sleep just long enough for syslogd to
> write it out to the file, then panic? Since it's an assertion, we have a
> little more leeway then in a page fault OOPS, for example.
>
Yes, that would probably be OK. It won't make anything worse than it
already is.
hm, the kernel used to panic if schedule() was called from in_interrupt(),
but that seems to have been taken out. It's easy enough (and free) to
put back in.
^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2003-04-02 22:08 UTC | newest]
Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2003-04-02 17:49 subsystem crashes reboot system? Russell Miller
2003-04-02 18:06 ` Mitch Adair
2003-04-02 18:44 ` Michael Buesch
2003-04-02 18:46 ` Mitch Adair
2003-04-02 19:07 ` Philippe Troin
2003-04-02 19:32 ` Michael Buesch
2003-04-02 21:51 ` Andrew Morton
2003-04-02 21:51 ` Russell Miller
2003-04-02 22:13 ` Andrew Morton
2003-04-02 22:11 ` Russell Miller
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox