* Hang on "echo b > /proc/sysrq-trigger"
@ 2012-02-17 22:54 Keith Chew
2012-02-29 18:07 ` Eric W. Biederman
0 siblings, 1 reply; 12+ messages in thread
From: Keith Chew @ 2012-02-17 22:54 UTC (permalink / raw)
To: linux-kernel
Hi
To test the reliability of a hardware, I have a script which reboots a
machine every 15 minutes after boot up. This machine has a dual video
output, VGA and DVI-D, both driven via an intel GM45 chipset (I am
using kernel 2.6.39.24 kernel intel drivers).
Some interesting results (which can be reproduced consistently):
"echo b > /proc/sysrq-trigger" via DVI-D - after 2-3 days, it hangs
(freezes) before reboot (dmesg only shows "Resetting...", nothing
after that, no panic, stack trace, etc)
"echo b > /proc/sysrq-trigger" via VGA - runs > 1 week
"reboot -fn" via VGA or DVI-D - runs > 1 week
"reboot" via VGA or DVI-D - runs > 1 week
I suspect that the intel graphics driver is not happy with the "echo b
> /proc/sysrq-trigger" when it is still running.
I would like to make the "echo b" successfully reboot the machine, but
this would appear to be a hardware bug? Is there anything that can be
done in the kernel to make the "echo b" successfully work 100%?
Regards
Keith
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Hang on "echo b > /proc/sysrq-trigger"
2012-02-17 22:54 Hang on "echo b > /proc/sysrq-trigger" Keith Chew
@ 2012-02-29 18:07 ` Eric W. Biederman
2012-02-29 18:28 ` Keith Chew
0 siblings, 1 reply; 12+ messages in thread
From: Eric W. Biederman @ 2012-02-29 18:07 UTC (permalink / raw)
To: Keith Chew; +Cc: linux-kernel
Keith Chew <keith.chew@gmail.com> writes:
> Hi
>
> To test the reliability of a hardware, I have a script which reboots a
> machine every 15 minutes after boot up. This machine has a dual video
> output, VGA and DVI-D, both driven via an intel GM45 chipset (I am
> using kernel 2.6.39.24 kernel intel drivers).
>
> Some interesting results (which can be reproduced consistently):
> "echo b > /proc/sysrq-trigger" via DVI-D - after 2-3 days, it hangs
> (freezes) before reboot (dmesg only shows "Resetting...", nothing
My blind guess would be that it is the BIOS on the machine that is hung.
> after that, no panic, stack trace, etc)
> "echo b > /proc/sysrq-trigger" via VGA - runs > 1 week
> "reboot -fn" via VGA or DVI-D - runs > 1 week
> "reboot" via VGA or DVI-D - runs > 1 week
>
> I suspect that the intel graphics driver is not happy with the "echo b
>> /proc/sysrq-trigger" when it is still running.
>
> I would like to make the "echo b" successfully reboot the machine, but
> this would appear to be a hardware bug? Is there anything that can be
> done in the kernel to make the "echo b" successfully work 100%?
echo b > /proc/sysrq-trigger triggers the emergency_restart path which
tries but skips some steps so that it has a reasonable chance of working
when the kernel is wedged, it looks like some of those steps it skips
are needed on your hardware.
Eric
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Hang on "echo b > /proc/sysrq-trigger"
2012-02-29 18:07 ` Eric W. Biederman
@ 2012-02-29 18:28 ` Keith Chew
2012-02-29 20:49 ` Eric W. Biederman
0 siblings, 1 reply; 12+ messages in thread
From: Keith Chew @ 2012-02-29 18:28 UTC (permalink / raw)
To: Eric W. Biederman; +Cc: linux-kernel
Hi Eric
<snip>
>> Some interesting results (which can be reproduced consistently):
>> "echo b > /proc/sysrq-trigger" via DVI-D - after 2-3 days, it hangs
>> (freezes) before reboot (dmesg only shows "Resetting...", nothing
>
> My blind guess would be that it is the BIOS on the machine that is hung.
>
We have contacted the manufacturer, and they do not believe this is
the case as the BIOS does not really do much during the reboot.
Unfortunately, we do not have enough knowledge on the inner workings
of the BIOS to help or diagnose further. Any pointers here will be
helpful.
<snip>
>>
>> I would like to make the "echo b" successfully reboot the machine, but
>> this would appear to be a hardware bug? Is there anything that can be
>> done in the kernel to make the "echo b" successfully work 100%?
>
> echo b > /proc/sysrq-trigger triggers the emergency_restart path which
> tries but skips some steps so that it has a reasonable chance of working
> when the kernel is wedged, it looks like some of those steps it skips
> are needed on your hardware.
>
Yes, I have looked into the kernel code and it does not do much,
except to tell the hardware to reboot (either via BIOS, keyboard,
ACPI, etc). I have also tried the reboot=b, reboot=k and reboot=a
options, and all of them can cause a hang, with reboot=b lasting the
longest.
We have extended our testing time, and have some more worrying
results. The command "reboot -fn" which runs > 1 week, got a hang
after 2 weeks of running. We are now testing with just "reboot" to see
how long that last.
Regards
Keith
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Hang on "echo b > /proc/sysrq-trigger"
2012-02-29 18:28 ` Keith Chew
@ 2012-02-29 20:49 ` Eric W. Biederman
2012-02-29 22:06 ` Keith Chew
0 siblings, 1 reply; 12+ messages in thread
From: Eric W. Biederman @ 2012-02-29 20:49 UTC (permalink / raw)
To: Keith Chew; +Cc: linux-kernel
Keith Chew <keith.chew@gmail.com> writes:
> Hi Eric
>
> <snip>
>
>>> Some interesting results (which can be reproduced consistently):
>>> "echo b > /proc/sysrq-trigger" via DVI-D - after 2-3 days, it hangs
>>> (freezes) before reboot (dmesg only shows "Resetting...", nothing
>>
>> My blind guess would be that it is the BIOS on the machine that is hung.
>>
>
> We have contacted the manufacturer, and they do not believe this is
> the case as the BIOS does not really do much during the reboot.
> Unfortunately, we do not have enough knowledge on the inner workings
> of the BIOS to help or diagnose further. Any pointers here will be
> helpful.
Historically a lot of issues have had to do with which cpu you are
entering the bios from. So you might try pinning your process
to differen cpus and see if you can make the failure more deterministic.
>>> I would like to make the "echo b" successfully reboot the machine, but
>>> this would appear to be a hardware bug? Is there anything that can be
>>> done in the kernel to make the "echo b" successfully work 100%?
>>
>> echo b > /proc/sysrq-trigger triggers the emergency_restart path which
>> tries but skips some steps so that it has a reasonable chance of working
>> when the kernel is wedged, it looks like some of those steps it skips
>> are needed on your hardware.
>>
>
> Yes, I have looked into the kernel code and it does not do much,
> except to tell the hardware to reboot (either via BIOS, keyboard,
> ACPI, etc). I have also tried the reboot=b, reboot=k and reboot=a
> options, and all of them can cause a hang, with reboot=b lasting the
> longest.
>
> We have extended our testing time, and have some more worrying
> results. The command "reboot -fn" which runs > 1 week, got a hang
> after 2 weeks of running. We are now testing with just "reboot" to see
> how long that last.
Ugh. The other possibility is that there is an intermittent failure in
the hardware, that prevents the boot/reboot. Wrong values on pull-up
resistors have been known to cause that kind of thing.
Eric
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Hang on "echo b > /proc/sysrq-trigger"
2012-02-29 20:49 ` Eric W. Biederman
@ 2012-02-29 22:06 ` Keith Chew
2012-02-29 23:34 ` Eric W. Biederman
0 siblings, 1 reply; 12+ messages in thread
From: Keith Chew @ 2012-02-29 22:06 UTC (permalink / raw)
To: Eric W. Biederman; +Cc: linux-kernel
Hi Eric
>
> Historically a lot of issues have had to do with which cpu you are
> entering the bios from. So you might try pinning your process
> to differen cpus and see if you can make the failure more deterministic.
>
We are using a Celeron 575 uniprocessor, so we do not have the option
to pin on another cpu. I have tried compiling the kernel in both UP
and SMP configuration, but sadly both causes the hang.
>
> Ugh. The other possibility is that there is an intermittent failure in
> the hardware, that prevents the boot/reboot. Wrong values on pull-up
> resistors have been known to cause that kind of thing.
>
Thank you very much for this pointer, will feed that back to the
manufacturer and see if it will give them some clues. The original
purpose for this reboot exercise was to ensure the software will
handle a power failure without any OS/data corruptions. With this new
discovery of unreliable reboot, the next worry is "If reboot is not
reliable, is the boot process also susceptible to the same issue?". I
have not rigged up any hardware to simulate a periodic full shutdown
and boot up process, but will be planning to set this up next.
Thanks again, if you have any other suggestions for us to try, I am all ears!
Regards
Keith
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Hang on "echo b > /proc/sysrq-trigger"
2012-02-29 22:06 ` Keith Chew
@ 2012-02-29 23:34 ` Eric W. Biederman
2012-03-01 0:12 ` Keith Chew
0 siblings, 1 reply; 12+ messages in thread
From: Eric W. Biederman @ 2012-02-29 23:34 UTC (permalink / raw)
To: Keith Chew; +Cc: linux-kernel
Keith Chew <keith.chew@gmail.com> writes:
> Hi Eric
>
>>
>> Historically a lot of issues have had to do with which cpu you are
>> entering the bios from. So you might try pinning your process
>> to differen cpus and see if you can make the failure more deterministic.
>>
>
> We are using a Celeron 575 uniprocessor, so we do not have the option
> to pin on another cpu. I have tried compiling the kernel in both UP
> and SMP configuration, but sadly both causes the hang.
Ok. That rules out a bunch of things, and emerengy_restart may not
be much different in practice.
>> Ugh. The other possibility is that there is an intermittent failure in
>> the hardware, that prevents the boot/reboot. Wrong values on pull-up
>> resistors have been known to cause that kind of thing.
>>
>
> Thank you very much for this pointer, will feed that back to the
> manufacturer and see if it will give them some clues. The original
> purpose for this reboot exercise was to ensure the software will
> handle a power failure without any OS/data corruptions. With this new
> discovery of unreliable reboot, the next worry is "If reboot is not
> reliable, is the boot process also susceptible to the same issue?". I
> have not rigged up any hardware to simulate a periodic full shutdown
> and boot up process, but will be planning to set this up next.
>
> Thanks again, if you have any other suggestions for us to try, I am
> all ears!
I would check with your BIOS folks and perhaps play with the kernel
option. The most reliable way to peform a reset is to trigger a board
reset by writing to 0xcf9 or a similar register. I expect your BIOS
does that and you can probably get the kernel to do that. I would
definitely test to see if you can write to the mostly standard
0xcf9 register directly from the kernel and trigger a reset directly.
Once past a reset and with a single cpu all of the failures will be
happening in the boot path. So the only possible points of failure
are in devices that are different between a soft reset and a power on
reset.
I would check to see if your board perhaps supports post codes or any
other debugging that will let you see where you are hanging.
It sounds like there is some very rare failure, that is going to be
a challenge to track down. I would definitely test more than one
motherboard to ensure that you can reproduce the problem on more
than one piece of hardware. Sometimes hardware is just broken.
Eric
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Hang on "echo b > /proc/sysrq-trigger"
2012-02-29 23:34 ` Eric W. Biederman
@ 2012-03-01 0:12 ` Keith Chew
2012-03-10 23:45 ` Keith Chew
0 siblings, 1 reply; 12+ messages in thread
From: Keith Chew @ 2012-03-01 0:12 UTC (permalink / raw)
To: Eric W. Biederman; +Cc: linux-kernel
Hi Eric
> I would check with your BIOS folks and perhaps play with the kernel
> option. The most reliable way to peform a reset is to trigger a board
> reset by writing to 0xcf9 or a similar register. I expect your BIOS
> does that and you can probably get the kernel to do that. I would
> definitely test to see if you can write to the mostly standard
> 0xcf9 register directly from the kernel and trigger a reset directly.
>
> Once past a reset and with a single cpu all of the failures will be
> happening in the boot path. So the only possible points of failure
> are in devices that are different between a soft reset and a power on
> reset.
>
> I would check to see if your board perhaps supports post codes or any
> other debugging that will let you see where you are hanging.
>
> It sounds like there is some very rare failure, that is going to be
> a challenge to track down. I would definitely test more than one
> motherboard to ensure that you can reproduce the problem on more
> than one piece of hardware. Sometimes hardware is just broken.
>
These are really helpful suggestions, I will try to get to the bottom
on it. Yes, have tried 3 different boards with different RAM, HDD and
CPU. The hang can be reproduced consistently (just not
deterministically at this stage).
Thank you very much again, will update the progress in due course.
Regards
Keith
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Hang on "echo b > /proc/sysrq-trigger"
2012-03-01 0:12 ` Keith Chew
@ 2012-03-10 23:45 ` Keith Chew
2012-03-19 6:34 ` Jon Masters
0 siblings, 1 reply; 12+ messages in thread
From: Keith Chew @ 2012-03-10 23:45 UTC (permalink / raw)
To: Eric W. Biederman; +Cc: linux-kernel
Hi Eric
Keep a quick update...
>> I would check with your BIOS folks and perhaps play with the kernel
>> option. The most reliable way to peform a reset is to trigger a board
>> reset by writing to 0xcf9 or a similar register. I expect your BIOS
>> does that and you can probably get the kernel to do that. I would
>> definitely test to see if you can write to the mostly standard
>> 0xcf9 register directly from the kernel and trigger a reset directly.
>>
Thank you very much for this. We have tried with reboot=p, which is
writing to the 0xCF9 register directly, and the test has been running
good for the past 9 days. Will keep monitoring it.
Also, I have also added a delay in the KBD reboot, which appears to
have made the reboot reliable (running good for past 8 days, before it
would hang after 2 days):
-----------------------
kb_wait();
udelay(150); <------ Increased from 50
outb(0xfe, 0x64); /* pulse reset low */
udelay(50);
-----------------------
Looks like the code may be issuing the reboot a bit too early after
returning from kb_wait().
Regards
Keith
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Hang on "echo b > /proc/sysrq-trigger"
2012-03-10 23:45 ` Keith Chew
@ 2012-03-19 6:34 ` Jon Masters
2012-03-19 6:45 ` Keith Chew
0 siblings, 1 reply; 12+ messages in thread
From: Jon Masters @ 2012-03-19 6:34 UTC (permalink / raw)
To: Keith Chew; +Cc: Eric W. Biederman, linux-kernel
Hi Keith,
I've just been reviewing LKML for the past month…was there ever any followup?
Perhaps it's worth discussing a patch to change the below delay?
Jon.
On Mar 10, 2012, at 6:45 PM, Keith Chew wrote:
> Hi Eric
>
> Keep a quick update...
>
>>> I would check with your BIOS folks and perhaps play with the kernel
>>> option. The most reliable way to peform a reset is to trigger a board
>>> reset by writing to 0xcf9 or a similar register. I expect your BIOS
>>> does that and you can probably get the kernel to do that. I would
>>> definitely test to see if you can write to the mostly standard
>>> 0xcf9 register directly from the kernel and trigger a reset directly.
>>>
>
> Thank you very much for this. We have tried with reboot=p, which is
> writing to the 0xCF9 register directly, and the test has been running
> good for the past 9 days. Will keep monitoring it.
>
> Also, I have also added a delay in the KBD reboot, which appears to
> have made the reboot reliable (running good for past 8 days, before it
> would hang after 2 days):
> -----------------------
> kb_wait();
> udelay(150); <------ Increased from 50
> outb(0xfe, 0x64); /* pulse reset low */
> udelay(50);
> -----------------------
>
> Looks like the code may be issuing the reboot a bit too early after
> returning from kb_wait().
>
> Regards
> Keith
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Hang on "echo b > /proc/sysrq-trigger"
2012-03-19 6:34 ` Jon Masters
@ 2012-03-19 6:45 ` Keith Chew
2012-03-24 1:11 ` Ray Lee
0 siblings, 1 reply; 12+ messages in thread
From: Keith Chew @ 2012-03-19 6:45 UTC (permalink / raw)
To: Jon Masters; +Cc: Eric W. Biederman, linux-kernel
Hi Jon
> I've just been reviewing LKML for the past month…was there ever any followup?
> Perhaps it's worth discussing a patch to change the below delay?
>
I was about to prepare a patch for this change last Friday, but the
system hanged on Saturday, after a few weeks of smooth reboots. The
other unit with reboot=p (as Eric suggested to be the most reliable
way of rebooting) is still going without any issues, so maybe that is
the best workaround for now.
I have now put the first unit in reboot=p, and will keep monitoring
them. I am open to any other suggestions if anyone has any tweaks they
think will help with the KBD reboot.
Regards
Keith
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Hang on "echo b > /proc/sysrq-trigger"
2012-03-19 6:45 ` Keith Chew
@ 2012-03-24 1:11 ` Ray Lee
2012-03-28 20:25 ` Keith Chew
0 siblings, 1 reply; 12+ messages in thread
From: Ray Lee @ 2012-03-24 1:11 UTC (permalink / raw)
To: Keith Chew; +Cc: Jon Masters, Eric W. Biederman, linux-kernel
On Sun, Mar 18, 2012 at 11:45 PM, Keith Chew <keith.chew@gmail.com> wrote:
>> I've just been reviewing LKML for the past month…was there ever any followup?
>> Perhaps it's worth discussing a patch to change the below delay?
>>
>
> I was about to prepare a patch for this change last Friday, but the
> system hanged on Saturday, after a few weeks of smooth reboots. The
> other unit with reboot=p (as Eric suggested to be the most reliable
> way of rebooting) is still going without any issues, so maybe that is
> the best workaround for now.
Well, someone else may be hitting this and not know. If the situation
measurably improved by increasing the delay, then it seems like a good
idea to at least upstream that, regardless if it's perfect?
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Hang on "echo b > /proc/sysrq-trigger"
2012-03-24 1:11 ` Ray Lee
@ 2012-03-28 20:25 ` Keith Chew
0 siblings, 0 replies; 12+ messages in thread
From: Keith Chew @ 2012-03-28 20:25 UTC (permalink / raw)
To: Ray Lee; +Cc: Jon Masters, Eric W. Biederman, linux-kernel
Hi
> Well, someone else may be hitting this and not know. If the situation
> measurably improved by increasing the delay, then it seems like a good
> idea to at least upstream that, regardless if it's perfect?
Please let me carry on with this investigation, before submitting
anything. The reboot=p seems to be really solid, have not had any
issues for over a month. I will be moving back to KBD reboot testing,
and try some more test cases. Will report back soon.
Regards
Keith
^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2012-03-28 20:25 UTC | newest]
Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-02-17 22:54 Hang on "echo b > /proc/sysrq-trigger" Keith Chew
2012-02-29 18:07 ` Eric W. Biederman
2012-02-29 18:28 ` Keith Chew
2012-02-29 20:49 ` Eric W. Biederman
2012-02-29 22:06 ` Keith Chew
2012-02-29 23:34 ` Eric W. Biederman
2012-03-01 0:12 ` Keith Chew
2012-03-10 23:45 ` Keith Chew
2012-03-19 6:34 ` Jon Masters
2012-03-19 6:45 ` Keith Chew
2012-03-24 1:11 ` Ray Lee
2012-03-28 20:25 ` Keith Chew
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox