netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: Hardware bug or kernel bug?
       [not found]   ` <20061013130648.GC1690@ff.dom.local>
@ 2006-10-13 16:24     ` David Johnson
  2006-10-13 17:11       ` Alan Cox
  2006-10-16 10:25       ` Jarek Poplawski
  0 siblings, 2 replies; 5+ messages in thread
From: David Johnson @ 2006-10-13 16:24 UTC (permalink / raw)
  To: Jarek Poplawski; +Cc: Linux Kernel, netdev

On Friday 13 October 2006 14:06, Jarek Poplawski wrote:
>
> Probably - but only with networking. So I'd try with this debugging
> like in my first reply plus maybe 2.6.19-rc1 (e1000 - btw. I hope
> this other tested card was different model - and locking improved)
> and resend conclusions to netdev@vger.kernel.org.
>

OK I built a 2.6.19-rc1 kernel with a minimal config as you describe and I 
cannot reproduce the reboots with this kernel. My .config:
http://www.david-web.co.uk/download/config

The other NIC I tried was a D-Link DL10050-based card which I think uses the 
dl2k module.

I tried to reproduce the problem under Windows (2k), which didn't reboot but 
did still suffer from it I believe. Randomly during an scp transfer (using 
the PuTTY scp client) Windows will lock-up for about 30 seconds, making an 
entry in the event log indicating that there was a time-out talking to the 
IDE controller, then continuing. Could the same thing be happening in Linux? 
If Linux can't talk to the IDE controller when trying to write to disk, how 
does it handle that?

Regards,
David.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Hardware bug or kernel bug?
  2006-10-13 16:24     ` Hardware bug or kernel bug? David Johnson
@ 2006-10-13 17:11       ` Alan Cox
  2006-10-16 10:25       ` Jarek Poplawski
  1 sibling, 0 replies; 5+ messages in thread
From: Alan Cox @ 2006-10-13 17:11 UTC (permalink / raw)
  To: David Johnson; +Cc: Jarek Poplawski, Linux Kernel, netdev

Ar Gwe, 2006-10-13 am 17:24 +0100, ysgrifennodd David Johnson:
> IDE controller, then continuing. Could the same thing be happening in Linux? 
> If Linux can't talk to the IDE controller when trying to write to disk, how 
> does it handle that?

It will timeout and then retry the command. It's not the most ideal
situation to end up in but I'd expect to see a DMA timeout and a retry
or two in the log not a crash.


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Hardware bug or kernel bug?
  2006-10-13 16:24     ` Hardware bug or kernel bug? David Johnson
  2006-10-13 17:11       ` Alan Cox
@ 2006-10-16 10:25       ` Jarek Poplawski
  2006-10-16 14:32         ` David Johnson
  1 sibling, 1 reply; 5+ messages in thread
From: Jarek Poplawski @ 2006-10-16 10:25 UTC (permalink / raw)
  To: David Johnson; +Cc: Linux Kernel, netdev

On Fri, Oct 13, 2006 at 05:24:39PM +0100, David Johnson wrote:
> On Friday 13 October 2006 14:06, Jarek Poplawski wrote:
> >
> > Probably - but only with networking. So I'd try with this debugging
> > like in my first reply plus maybe 2.6.19-rc1 (e1000 - btw. I hope
> > this other tested card was different model - and locking improved)
> > and resend conclusions to netdev@vger.kernel.org.
> >
> 
> OK I built a 2.6.19-rc1 kernel with a minimal config as you describe and I 
> cannot reproduce the reboots with this kernel. My .config:
> http://www.david-web.co.uk/download/config

I've seen more minimal minimal configs but if it works
it is 50% of success. 

> The other NIC I tried was a D-Link DL10050-based card which I think uses the 
> dl2k module.
> 
> I tried to reproduce the problem under Windows (2k), which didn't reboot but 
> did still suffer from it I believe. Randomly during an scp transfer (using 
> the PuTTY scp client) Windows will lock-up for about 30 seconds, making an 
> entry in the event log indicating that there was a time-out talking to the 
> IDE controller, then continuing. Could the same thing be happening in Linux? 
> If Linux can't talk to the IDE controller when trying to write to disk, how 
> does it handle that?

Was this lock-up effect visible during above 2.6.19-rc1 tests?
If not I'd try to continue linux debbuging:
- is 2.6.19-rc1 working with "normal" config (use make oldconfig
to "upgrade" .config),
- is 2.6.17 working with "minimal" config (use make oldconfig),
- changing one or two options at a time try to find which one makes
the effect returns (acpi, smp...). 

Regards,
Jarek P.

PS: Sorry for late reply - I was offline.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Hardware bug or kernel bug?
  2006-10-16 10:25       ` Jarek Poplawski
@ 2006-10-16 14:32         ` David Johnson
  2006-10-17  7:10           ` Jarek Poplawski
  0 siblings, 1 reply; 5+ messages in thread
From: David Johnson @ 2006-10-16 14:32 UTC (permalink / raw)
  To: Jarek Poplawski; +Cc: Linux Kernel, netdev

On Monday 16 October 2006 11:25, Jarek Poplawski wrote:
>
> Was this lock-up effect visible during above 2.6.19-rc1 tests?

No, I've not seen anything in Linux other than the reboots, which are instant 
without any preceding lock-up.

> If not I'd try to continue linux debbuging:
> - is 2.6.19-rc1 working with "normal" config (use make oldconfig
> to "upgrade" .config),

With 2.6.19-rc1 and a normal config, I get the reboots as usual.

> - is 2.6.17 working with "minimal" config (use make oldconfig),

Yes.

> - changing one or two options at a time try to find which one makes
> the effect returns (acpi, smp...).

I've found the culprit - CPU Frequency Scaling.
With it enabled I get the reboots, with it disabled I don't. That's the same 
with every kernel version I've tried (2.6.19-rc1+rc2, 2.6.17.13 & Centos' 
2.6.9) The system was using the p4-clockmod driver and the ondemand governor.

I'm still not sure exactly what the problem is - the reboots only happen in 
the circumstances I've mentioned and are not triggered by changes in clock 
speed alone - but disabling cpufreq seems to make it go away...

Thanks for your help,
David.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Hardware bug or kernel bug?
  2006-10-16 14:32         ` David Johnson
@ 2006-10-17  7:10           ` Jarek Poplawski
  0 siblings, 0 replies; 5+ messages in thread
From: Jarek Poplawski @ 2006-10-17  7:10 UTC (permalink / raw)
  To: David Johnson; +Cc: Linux Kernel, netdev

On Mon, Oct 16, 2006 at 03:32:38PM +0100, David Johnson wrote:
...
> I've found the culprit - CPU Frequency Scaling.
> With it enabled I get the reboots, with it disabled I don't. That's the same 
> with every kernel version I've tried (2.6.19-rc1+rc2, 2.6.17.13 & Centos' 
> 2.6.9) The system was using the p4-clockmod driver and the ondemand governor.
> 
> I'm still not sure exactly what the problem is - the reboots only happen in 
> the circumstances I've mentioned and are not triggered by changes in clock 
> speed alone - but disabling cpufreq seems to make it go away...

I see you devoted a lot of work and time to this testing
and for sure it will help people who read this to
diagnose similar problems but I think it could be even
more valuable if you'd try (after some rest!) to find
if "Enable CPUfreq debugging" plus adding to kernel
command line cpufreq.debug=<value> (according to help
screen) would return any error messages that could be
send to bugzilla and/or cpufreq maintainer. 

Best regards,

Jarek P.

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2006-10-17  7:05 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20061013085605.GA1690@ff.dom.local>
     [not found] ` <200610131256.54546.dj@david-web.co.uk>
     [not found]   ` <20061013130648.GC1690@ff.dom.local>
2006-10-13 16:24     ` Hardware bug or kernel bug? David Johnson
2006-10-13 17:11       ` Alan Cox
2006-10-16 10:25       ` Jarek Poplawski
2006-10-16 14:32         ` David Johnson
2006-10-17  7:10           ` Jarek Poplawski

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).