2.6.19-rc1, timebomb?

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* 2.6.19-rc1, timebomb?
@ 2006-10-20  5:30 Gene Heskett
  2006-10-21  4:22 ` Chris Largret
  0 siblings, 1 reply; 9+ messages in thread
From: Gene Heskett @ 2006-10-20  5:30 UTC (permalink / raw)
  To: Linux Kernel

Greetings;

I just arrived home a few hours ago, and my wife said the outside lights 
hadn't worked for the last 2 days.

I come in to check, the this machine, which runs some heyu scripts to do 
this, was powered down.  So I powered it back up and it had to e2fsk 
everything.  I have a ups with a fresh battery which passes the tests just 
fine.

The only thing in the logs is a single line about eth0 being down:
Oct 17 05:31:11 coyote kernel: eth0: link down.
Oct 19 20:37:49 coyote syslogd 1.4.1: restart.

Uptime when this occurred was about 9 days.  Was this a known problem?

-- 
Cheers, Gene
"There are four boxes to be used in defense of liberty:
 soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
Yahoo.com and AOL/TW attorneys please note, additions to the above
message by Gene Heskett are:
Copyright 2006 by Maurice Eugene Heskett, all rights reserved.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: 2.6.19-rc1, timebomb?
  2006-10-20  5:30 2.6.19-rc1, timebomb? Gene Heskett
@ 2006-10-21  4:22 ` Chris Largret
  2006-10-21  4:37   ` Gene Heskett
  0 siblings, 1 reply; 9+ messages in thread
From: Chris Largret @ 2006-10-21  4:22 UTC (permalink / raw)
  To: Gene Heskett; +Cc: Linux Kernel

On Fri, 20 Oct 2006 01:30:44 -0400
Gene Heskett <gene.heskett@verizon.net> wrote:

> Greetings;
> 
> I just arrived home a few hours ago, and my wife said the outside lights 
> hadn't worked for the last 2 days.
> 
> I come in to check, the this machine, which runs some heyu scripts to do 
> this, was powered down.  So I powered it back up and it had to e2fsk 
> everything.  I have a ups with a fresh battery which passes the tests just 
> fine.
> 
> The only thing in the logs is a single line about eth0 being down:
> Oct 17 05:31:11 coyote kernel: eth0: link down.
> Oct 19 20:37:49 coyote syslogd 1.4.1: restart.
> 
> Uptime when this occurred was about 9 days.  Was this a known problem?

Out of curiosity, did you check the UPS logs? The low- (and mid- ?)
range ones I've played with have logs as well as the ability to tell
the computer when there is a power problem. I'd check those logs and
also look in the system BIOS for a way to power the computer back on
when power returns. If it was powered off, I don't believe it would be
kernel-related.

I could always be wrong, but from my own experiences kernel problems
result in a system that is on but not operational.

-- 
Chris Largret <http://www.largret.com>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: 2.6.19-rc1, timebomb?
  2006-10-21  4:22 ` Chris Largret
@ 2006-10-21  4:37   ` Gene Heskett
  2006-10-21  5:03     ` Chris Wedgwood
  2006-10-21 17:25     ` Andi Kleen
  0 siblings, 2 replies; 9+ messages in thread
From: Gene Heskett @ 2006-10-21  4:37 UTC (permalink / raw)
  To: linux-kernel; +Cc: Chris Largret

On Saturday 21 October 2006 00:22, Chris Largret wrote:
>On Fri, 20 Oct 2006 01:30:44 -0400
>
>Gene Heskett <gene.heskett@verizon.net> wrote:
>> Greetings;
>>
>> I just arrived home a few hours ago, and my wife said the outside
>> lights hadn't worked for the last 2 days.
>>
>> I come in to check, the this machine, which runs some heyu scripts to
>> do this, was powered down.  So I powered it back up and it had to e2fsk
>> everything.  I have a ups with a fresh battery which passes the tests
>> just fine.
>>
>> The only thing in the logs is a single line about eth0 being down:
>> Oct 17 05:31:11 coyote kernel: eth0: link down.
>> Oct 19 20:37:49 coyote syslogd 1.4.1: restart.
>>
>> Uptime when this occurred was about 9 days.  Was this a known problem?
>
>Out of curiosity, did you check the UPS logs? The low- (and mid- ?)
>range ones I've played with have logs as well as the ability to tell
>the computer when there is a power problem. I'd check those logs and
>also look in the system BIOS for a way to power the computer back on
>when power returns. If it was powered off, I don't believe it would be
>kernel-related.
>
yes, they were clean.  Its a 1500kva Belkin, not exactly a small ups.

>I could always be wrong, but from my own experiences kernel problems
>result in a system that is on but not operational.

ISTR that was the second time an un-logged powerdown has been done since 
that kernel became the default.  For all practical purposes, it the equ of 
tapping the hard reset button and before it can start to reboot, the 4 
second powerdown expires and things get real quiet.

I guess I'm 'waiting for the other shoe to drop'  Until that time, 
everything seems normal.  But I did just note that 'fam' is using up to 
99.3% of the cpu, which is unusual considering that amanda is also 
running, and its usually gtar thats the hog.  This is according to htop.

That doesn't seem to be what I'd expect to see, thats for sure.  Even 
wierder, I just used htop to send it a SIGHUP and its now gone.  WTF??  Me 
wanders off for some sleep while the real brains ponder that one.

-- 
Cheers, Gene
"There are four boxes to be used in defense of liberty:
 soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
Yahoo.com and AOL/TW attorneys please note, additions to the above
message by Gene Heskett are:
Copyright 2006 by Maurice Eugene Heskett, all rights reserved.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: 2.6.19-rc1, timebomb?
  2006-10-21  4:37   ` Gene Heskett
@ 2006-10-21  5:03     ` Chris Wedgwood
  2006-10-21  6:08       ` Gene Heskett
  2006-10-21 17:25     ` Andi Kleen
  1 sibling, 1 reply; 9+ messages in thread
From: Chris Wedgwood @ 2006-10-21  5:03 UTC (permalink / raw)
  To: Gene Heskett; +Cc: linux-kernel, Chris Largret

On Sat, Oct 21, 2006 at 12:37:56AM -0400, Gene Heskett wrote:

> I guess I'm 'waiting for the other shoe to drop' Until that time,
> everything seems normal.  But I did just note that 'fam' is using up
> to 99.3% of the cpu, which is unusual considering that amanda is
> also running, and its usually gtar thats the hog.  This is according
> to htop.

I've had a few spontaneous restarts (which actually might have been
shutdowns, any key press will make the machine up so a power down when
working would probably look like a restart).

I've assumed these were heat related, mostly because they also
occurred when the CPU was working hard and the weather has been pretty
warm lately.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: 2.6.19-rc1, timebomb?
  2006-10-21  5:03     ` Chris Wedgwood
@ 2006-10-21  6:08       ` Gene Heskett
  2006-10-21 15:10         ` Gene Heskett
  0 siblings, 1 reply; 9+ messages in thread
From: Gene Heskett @ 2006-10-21  6:08 UTC (permalink / raw)
  To: linux-kernel

On Saturday 21 October 2006 01:03, Chris Wedgwood wrote:
>On Sat, Oct 21, 2006 at 12:37:56AM -0400, Gene Heskett wrote:
>> I guess I'm 'waiting for the other shoe to drop' Until that time,
>> everything seems normal.  But I did just note that 'fam' is using up
>> to 99.3% of the cpu, which is unusual considering that amanda is
>> also running, and its usually gtar thats the hog.  This is according
>> to htop.
>
>I've had a few spontaneous restarts (which actually might have been
>shutdowns, any key press will make the machine up so a power down when
>working would probably look like a restart).
>
>I've assumed these were heat related, mostly because they also
>occurred when the CPU was working hard and the weather has been pretty
>warm lately.

These may be related.  But I'm not convinced weather has anything to do 
with it.  The cpu is running about 120F, and is busier by quite a few 
processes than it was when the last failure occured. 

The 'fam' that was using 99.3% of the cpu, and which disappeared when I 
sent it a SIGHUP, has not returned, and amanda has completed her nightly 
chores without any hiccups.  It was not started as a service and is unk to 
getting a status report from it.  So I'm wondering just where it fits in 
the grand scheme of things?

>-
>To unsubscribe from this list: send the line "unsubscribe linux-kernel"
> in the body of a message to majordomo@vger.kernel.org
>More majordomo info at  http://vger.kernel.org/majordomo-info.html
>Please read the FAQ at  http://www.tux.org/lkml/

-- 
Cheers, Gene
"There are four boxes to be used in defense of liberty:
 soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
Yahoo.com and AOL/TW attorneys please note, additions to the above
message by Gene Heskett are:
Copyright 2006 by Maurice Eugene Heskett, all rights reserved.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: 2.6.19-rc1, timebomb?
  2006-10-21  6:08       ` Gene Heskett
@ 2006-10-21 15:10         ` Gene Heskett
  0 siblings, 0 replies; 9+ messages in thread
From: Gene Heskett @ 2006-10-21 15:10 UTC (permalink / raw)
  To: linux-kernel

On Saturday 21 October 2006 02:08, Gene Heskett wrote:
>On Saturday 21 October 2006 01:03, Chris Wedgwood wrote:
>>On Sat, Oct 21, 2006 at 12:37:56AM -0400, Gene Heskett wrote:
>>> I guess I'm 'waiting for the other shoe to drop' Until that time,
>>> everything seems normal.  But I did just note that 'fam' is using up
>>> to 99.3% of the cpu, which is unusual considering that amanda is
>>> also running, and its usually gtar thats the hog.  This is according
>>> to htop.
>>
>>I've had a few spontaneous restarts (which actually might have been
>>shutdowns, any key press will make the machine up so a power down when
>>working would probably look like a restart).
>>
>>I've assumed these were heat related, mostly because they also
>>occurred when the CPU was working hard and the weather has been pretty
>>warm lately.
>
>These may be related.  But I'm not convinced weather has anything to do
>with it.  The cpu is running about 120F, and is busier by quite a few
>processes than it was when the last failure occured.
>
>The 'fam' that was using 99.3% of the cpu, and which disappeared when I
>sent it a SIGHUP, has not returned, and amanda has completed her nightly
>chores without any hiccups.  It was not started as a service and is unk
> to getting a status report from it.  So I'm wondering just where it fits
> in the grand scheme of things?
>
Further addendum:  Another shutdown this morning, and the only line in the 
log is the 3rd one here:

Oct 21 07:42:18 coyote kernel: usb 3-2.1: reset low speed USB device using 
ohci_hcd and address 3
Oct 21 07:51:39 coyote kernel: usb 3-2.1: reset low speed USB device using 
ohci_hcd and address 3
Oct 21 08:01:01 coyote kernel: eth0: link down. <<------<<<
Oct 21 10:53:12 coyote syslogd 1.4.1: restart.

Thats a microsoft wireless mouse it keeps resetting, and I hadn't been in 
the room in 2+ hours.  My logs are littered with that message, but the 
mouse itself works fine.  I'd estimate that the mouse reset is over 90% of 
the content of my messages logs, and has been since back around 2.6.16 
days.  The batteries are fine.

I'm back to 2.6.18, but will build 2.6.19-rc2 shortly.

>>-
>>To unsubscribe from this list: send the line "unsubscribe linux-kernel"
>> in the body of a message to majordomo@vger.kernel.org
>>More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>Please read the FAQ at  http://www.tux.org/lkml/

-- 
Cheers, Gene
"There are four boxes to be used in defense of liberty:
 soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
Yahoo.com and AOL/TW attorneys please note, additions to the above
message by Gene Heskett are:
Copyright 2006 by Maurice Eugene Heskett, all rights reserved.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: 2.6.19-rc1, timebomb?
  2006-10-21  4:37   ` Gene Heskett
  2006-10-21  5:03     ` Chris Wedgwood
@ 2006-10-21 17:25     ` Andi Kleen
  2006-10-22  0:11       ` Gene Heskett
  1 sibling, 1 reply; 9+ messages in thread
From: Andi Kleen @ 2006-10-21 17:25 UTC (permalink / raw)
  To: Gene Heskett; +Cc: Chris Largret, linux-kernel

Gene Heskett <gene.heskett@verizon.net> writes:
> 
> ISTR that was the second time an un-logged powerdown has been done since 
> that kernel became the default.  

It might be overheating. During a critical overheat condition the
ACPI code will just power off. It should still get console messages
out (but nothing on disk), so if you configure serial or net console
you would see a message.

And check your fans are ok.

-Andi

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: 2.6.19-rc1, timebomb?
  2006-10-21 17:25     ` Andi Kleen
@ 2006-10-22  0:11       ` Gene Heskett
  2006-10-22 11:20         ` WAS Re: 2.6.19-rc1, timebomb?, now -rc2 progress Gene Heskett
  0 siblings, 1 reply; 9+ messages in thread
From: Gene Heskett @ 2006-10-22  0:11 UTC (permalink / raw)
  To: linux-kernel; +Cc: Andi Kleen, Chris Largret

On Saturday 21 October 2006 13:25, Andi Kleen wrote:
>Gene Heskett <gene.heskett@verizon.net> writes:
>> ISTR that was the second time an un-logged powerdown has been done
>> since that kernel became the default.
>
>It might be overheating. During a critical overheat condition the
>ACPI code will just power off. It should still get console messages
>out (but nothing on disk), so if you configure serial or net console
>you would see a message.
>
>And check your fans are ok.
>
>-Andi
>-

Thanks Andi, but heating isn't a problem that I'm aware of, I'm no longer 
running a seti client since they moved it all to BOINC & refused to set 
priorities to reasonable values.  Cpu temps are pretty steady at 120F.

I tried to build and boot to 2.6.19-rc2 twice today, but each time it fails 
at the initrd read phase, saying no (mutter) or cpio magic.  And this is 
with exactly the same command line as always generating the initrd and 
then copying it to the /boot partition.  This works well for 2.6.18, which 
I just rebuilt after having discovered I'd lost the himem magic somehow.

In fact, thats the 2.6.18 I'm running on right now.  If I get a decent 
uptime here, then I'll be pretty well convinced its something in 
2.6.19-rc1 thats doing it.

I haven't tried to setup a seriel console because both serial ports on this 
box are already busy with other things.  I could free a serial port if 
someone could tell me howto make the bulldog ups monitoring software from 
belkin use a usb port instead.  Anyone have a clue to share on that 
subject?

-- 
Cheers, Gene
"There are four boxes to be used in defense of liberty:
 soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
Yahoo.com and AOL/TW attorneys please note, additions to the above
message by Gene Heskett are:
Copyright 2006 by Maurice Eugene Heskett, all rights reserved.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* WAS Re: 2.6.19-rc1, timebomb?, now -rc2 progress
  2006-10-22  0:11       ` Gene Heskett
@ 2006-10-22 11:20         ` Gene Heskett
  0 siblings, 0 replies; 9+ messages in thread
From: Gene Heskett @ 2006-10-22 11:20 UTC (permalink / raw)
  To: linux-kernel; +Cc: Andi Kleen, Chris Largret

On Saturday 21 October 2006 20:11, Gene Heskett wrote:
>On Saturday 21 October 2006 13:25, Andi Kleen wrote:
>>Gene Heskett <gene.heskett@verizon.net> writes:
>>> ISTR that was the second time an un-logged powerdown has been done
>>> since that kernel became the default.
>>
>>It might be overheating. During a critical overheat condition the
>>ACPI code will just power off. It should still get console messages
>>out (but nothing on disk), so if you configure serial or net console
>>you would see a message.
>>
>>And check your fans are ok.
>>
>>-Andi
>>-
>
>Thanks Andi, but heating isn't a problem that I'm aware of, I'm no longer
>running a seti client since they moved it all to BOINC & refused to set
>priorities to reasonable values.  Cpu temps are pretty steady at 120F.
>
>I tried to build and boot to 2.6.19-rc2 twice today, but each time it
> fails at the initrd read phase, saying no (mutter) or cpio magic.  And
> this is with exactly the same command line as always generating the
> initrd and then copying it to the /boot partition.  This works well for
> 2.6.18, which I just rebuilt after having discovered I'd lost the himem
> magic somehow.

Someplace along the line, either a make oldconfig screwed up, or my .config 
chain of succession my scripts use got totally fubared when I was trying 
to build 19-rc2.

After 3 more rebuilds to add stuff like emu10k1 & the RAMFS bits, -rc2 has 
now booted.  So now we wait for the other shoe to drop & see if the auto 
powerdowns persist.
[...]
>I haven't tried to setup a seriel console because both serial ports on
> this box are already busy with other things.  I could free a serial port
> if someone could tell me howto make the bulldog ups monitoring software
> from belkin use a usb port instead.  Anyone have a clue to share on that
> subject?

-- 
Cheers, Gene
"There are four boxes to be used in defense of liberty:
 soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
Yahoo.com and AOL/TW attorneys please note, additions to the above
message by Gene Heskett are:
Copyright 2006 by Maurice Eugene Heskett, all rights reserved.

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2006-10-22 11:21 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-10-20  5:30 2.6.19-rc1, timebomb? Gene Heskett
2006-10-21  4:22 ` Chris Largret
2006-10-21  4:37   ` Gene Heskett
2006-10-21  5:03     ` Chris Wedgwood
2006-10-21  6:08       ` Gene Heskett
2006-10-21 15:10         ` Gene Heskett
2006-10-21 17:25     ` Andi Kleen
2006-10-22  0:11       ` Gene Heskett
2006-10-22 11:20         ` WAS Re: 2.6.19-rc1, timebomb?, now -rc2 progress Gene Heskett

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox