netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* v3.0-rc* intermittent network failure: how to debug?
@ 2011-07-21 13:49 Richard Kennedy
  2011-07-21 14:32 ` Francois Romieu
  0 siblings, 1 reply; 4+ messages in thread
From: Richard Kennedy @ 2011-07-21 13:49 UTC (permalink / raw)
  To: netdev

I keep seeing a total network failure on v3.0.0-rc* , it is highly
intermittent, anything from 1 hour to 12+, and I don't have a reliable
test case.
When it fails I lose all network comms, but there are no errors in the
system log, no hung tasks reported, nothing. But after it fails the
machine hangs during shutdown, it just never turns off. So I guess
something is getting stuck but I can't find it.

Can you suggest how to find out what going on? 
Or how collect more information as to what's failing?


I'm going to add a serial console and see if that helps.

this is on a x86_64, via_velocity currently running 3.0.0-rc7 latest.

all suggestions gratefully received

regards
Richard


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: v3.0-rc* intermittent network failure: how to debug?
  2011-07-21 13:49 v3.0-rc* intermittent network failure: how to debug? Richard Kennedy
@ 2011-07-21 14:32 ` Francois Romieu
  2011-07-21 15:18   ` Richard Kennedy
  0 siblings, 1 reply; 4+ messages in thread
From: Francois Romieu @ 2011-07-21 14:32 UTC (permalink / raw)
  To: Richard Kennedy; +Cc: netdev

Richard Kennedy <richard@rsk.demon.co.uk> :
> I keep seeing a total network failure on v3.0.0-rc* , it is highly
> intermittent, anything from 1 hour to 12+, and I don't have a reliable
> test case.
> When it fails I lose all network comms, but there are no errors in the
> system log, no hung tasks reported, nothing. But after it fails the
> machine hangs during shutdown, it just never turns off. So I guess
> something is getting stuck but I can't find it.

Assuming the kernel hangs late enough, you can try the "reboot=" kernel
parameter and see if a value in arch/x86/include/asm/emergency-restart.h
makes a difference.

> Can you suggest how to find out what going on? 

Switch into text mode before starting the reboot sequence then send a
magic sysrq T or W ?

> I'm going to add a serial console and see if that helps.

It will help, especially with the kilometer long output of sysrq.

> this is on a x86_64, via_velocity currently running 3.0.0-rc7 latest.
> 
> all suggestions gratefully received

Last via-velocity change in mainline dates back to may 25 (see
d10358de8d70aaeb965a974d56e9b72f6c6dbb3a). Were you previously fine
with a recent enough kernel to rule it out ?

-- 
Ueimor

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: v3.0-rc* intermittent network failure: how to debug?
  2011-07-21 14:32 ` Francois Romieu
@ 2011-07-21 15:18   ` Richard Kennedy
  2011-07-25 12:01     ` v3.0-rc* intermittent network failure: Test case found! Richard Kennedy
  0 siblings, 1 reply; 4+ messages in thread
From: Richard Kennedy @ 2011-07-21 15:18 UTC (permalink / raw)
  To: Francois Romieu; +Cc: netdev

On Thu, 2011-07-21 at 16:32 +0200, Francois Romieu wrote:
> Richard Kennedy <richard@rsk.demon.co.uk> :
> > I keep seeing a total network failure on v3.0.0-rc* , it is highly
> > intermittent, anything from 1 hour to 12+, and I don't have a reliable
> > test case.
> > When it fails I lose all network comms, but there are no errors in the
> > system log, no hung tasks reported, nothing. But after it fails the
> > machine hangs during shutdown, it just never turns off. So I guess
> > something is getting stuck but I can't find it.
> 
> Assuming the kernel hangs late enough, you can try the "reboot=" kernel
> parameter and see if a value in arch/x86/include/asm/emergency-restart.h
> makes a difference.
> 
> > Can you suggest how to find out what going on? 
> 
> Switch into text mode before starting the reboot sequence then send a
> magic sysrq T or W ?
> 
> > I'm going to add a serial console and see if that helps.
> 
> It will help, especially with the kilometer long output of sysrq.
> 
> > this is on a x86_64, via_velocity currently running 3.0.0-rc7 latest.
> > 
> > all suggestions gratefully received
> 
> Last via-velocity change in mainline dates back to may 25 (see
> d10358de8d70aaeb965a974d56e9b72f6c6dbb3a). Were you previously fine
> with a recent enough kernel to rule it out ?
> 

Thanks Francois,
I'll try the reboot= tomorrow.

I don't really know when my last know good was, it could be that
via-velocity change, but the problem is so intermittent it's difficult
to be sure. I've been trying to stress the network to make the problem
happen sooner but I've had no luck yet.

regards
Richard  


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: v3.0-rc* intermittent network failure: Test case found!
  2011-07-21 15:18   ` Richard Kennedy
@ 2011-07-25 12:01     ` Richard Kennedy
  0 siblings, 0 replies; 4+ messages in thread
From: Richard Kennedy @ 2011-07-25 12:01 UTC (permalink / raw)
  To: netdev; +Cc: Francois Romieu

On 21/07/11 16:18, Richard Kennedy wrote:
>> Richard Kennedy<richard@rsk.demon.co.uk>  :
>>> I keep seeing a total network failure on v3.0.0-rc* , it is highly
>>> intermittent, anything from 1 hour to 12+, and I don't have a reliable
>>> test case.
>>> When it fails I lose all network comms, but there are no errors in the
>>> system log, no hung tasks reported, nothing. But after it fails the
>>> machine hangs during shutdown, it just never turns off. So I guess
>>> something is getting stuck but I can't find it.
>>

I have found a reliable test case, I can instantly trigger my problem by 
starting 2 instances of rsync at the same time. [this is on x86_64 AMDX2]

e.g.
rsync -a linux-2.6 server:t1 & ;rsync -a linux-2.6 server:t2 &


If I have a ping running when I trigger the problem, it pauses then 
errors with :-

	ping: sendmsg: No buffer space available

But if I start a ping after, it fails with

...	Destination Host Unreachable
.

I have a serial console attached but don't really understand what it's 
telling me.
AFAICT -- I have no blocked tasks  - sysrq w shows :-


SysRq : Show Blocked State
   task                        PC stack   pid father
Sched Debug Version: v0.10, 3.0.0 #46
ktime                                   : 7129717.783042
sched_clk                               : 7126380.221722
cpu_clk                                 : 7129711.544071
jiffies                                 : 4301797008
sched_clock_stable                      : 0
.....[lots more schedule & cpu info]

But now I've got a reliable test case I can find a last know good kernel 
and have a stab at bisecting this, unless anyone has got any better 
suggestions?

regards
Richard




^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2011-07-25 12:01 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-07-21 13:49 v3.0-rc* intermittent network failure: how to debug? Richard Kennedy
2011-07-21 14:32 ` Francois Romieu
2011-07-21 15:18   ` Richard Kennedy
2011-07-25 12:01     ` v3.0-rc* intermittent network failure: Test case found! Richard Kennedy

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).