* v3.0-rc* intermittent network failure: how to debug?
@ 2011-07-21 13:49 Richard Kennedy
2011-07-21 14:32 ` Francois Romieu
0 siblings, 1 reply; 4+ messages in thread
From: Richard Kennedy @ 2011-07-21 13:49 UTC (permalink / raw)
To: netdev
I keep seeing a total network failure on v3.0.0-rc* , it is highly
intermittent, anything from 1 hour to 12+, and I don't have a reliable
test case.
When it fails I lose all network comms, but there are no errors in the
system log, no hung tasks reported, nothing. But after it fails the
machine hangs during shutdown, it just never turns off. So I guess
something is getting stuck but I can't find it.
Can you suggest how to find out what going on?
Or how collect more information as to what's failing?
I'm going to add a serial console and see if that helps.
this is on a x86_64, via_velocity currently running 3.0.0-rc7 latest.
all suggestions gratefully received
regards
Richard
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: v3.0-rc* intermittent network failure: how to debug?
2011-07-21 13:49 v3.0-rc* intermittent network failure: how to debug? Richard Kennedy
@ 2011-07-21 14:32 ` Francois Romieu
2011-07-21 15:18 ` Richard Kennedy
0 siblings, 1 reply; 4+ messages in thread
From: Francois Romieu @ 2011-07-21 14:32 UTC (permalink / raw)
To: Richard Kennedy; +Cc: netdev
Richard Kennedy <richard@rsk.demon.co.uk> :
> I keep seeing a total network failure on v3.0.0-rc* , it is highly
> intermittent, anything from 1 hour to 12+, and I don't have a reliable
> test case.
> When it fails I lose all network comms, but there are no errors in the
> system log, no hung tasks reported, nothing. But after it fails the
> machine hangs during shutdown, it just never turns off. So I guess
> something is getting stuck but I can't find it.
Assuming the kernel hangs late enough, you can try the "reboot=" kernel
parameter and see if a value in arch/x86/include/asm/emergency-restart.h
makes a difference.
> Can you suggest how to find out what going on?
Switch into text mode before starting the reboot sequence then send a
magic sysrq T or W ?
> I'm going to add a serial console and see if that helps.
It will help, especially with the kilometer long output of sysrq.
> this is on a x86_64, via_velocity currently running 3.0.0-rc7 latest.
>
> all suggestions gratefully received
Last via-velocity change in mainline dates back to may 25 (see
d10358de8d70aaeb965a974d56e9b72f6c6dbb3a). Were you previously fine
with a recent enough kernel to rule it out ?
--
Ueimor
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: v3.0-rc* intermittent network failure: how to debug?
2011-07-21 14:32 ` Francois Romieu
@ 2011-07-21 15:18 ` Richard Kennedy
2011-07-25 12:01 ` v3.0-rc* intermittent network failure: Test case found! Richard Kennedy
0 siblings, 1 reply; 4+ messages in thread
From: Richard Kennedy @ 2011-07-21 15:18 UTC (permalink / raw)
To: Francois Romieu; +Cc: netdev
On Thu, 2011-07-21 at 16:32 +0200, Francois Romieu wrote:
> Richard Kennedy <richard@rsk.demon.co.uk> :
> > I keep seeing a total network failure on v3.0.0-rc* , it is highly
> > intermittent, anything from 1 hour to 12+, and I don't have a reliable
> > test case.
> > When it fails I lose all network comms, but there are no errors in the
> > system log, no hung tasks reported, nothing. But after it fails the
> > machine hangs during shutdown, it just never turns off. So I guess
> > something is getting stuck but I can't find it.
>
> Assuming the kernel hangs late enough, you can try the "reboot=" kernel
> parameter and see if a value in arch/x86/include/asm/emergency-restart.h
> makes a difference.
>
> > Can you suggest how to find out what going on?
>
> Switch into text mode before starting the reboot sequence then send a
> magic sysrq T or W ?
>
> > I'm going to add a serial console and see if that helps.
>
> It will help, especially with the kilometer long output of sysrq.
>
> > this is on a x86_64, via_velocity currently running 3.0.0-rc7 latest.
> >
> > all suggestions gratefully received
>
> Last via-velocity change in mainline dates back to may 25 (see
> d10358de8d70aaeb965a974d56e9b72f6c6dbb3a). Were you previously fine
> with a recent enough kernel to rule it out ?
>
Thanks Francois,
I'll try the reboot= tomorrow.
I don't really know when my last know good was, it could be that
via-velocity change, but the problem is so intermittent it's difficult
to be sure. I've been trying to stress the network to make the problem
happen sooner but I've had no luck yet.
regards
Richard
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: v3.0-rc* intermittent network failure: Test case found!
2011-07-21 15:18 ` Richard Kennedy
@ 2011-07-25 12:01 ` Richard Kennedy
0 siblings, 0 replies; 4+ messages in thread
From: Richard Kennedy @ 2011-07-25 12:01 UTC (permalink / raw)
To: netdev; +Cc: Francois Romieu
On 21/07/11 16:18, Richard Kennedy wrote:
>> Richard Kennedy<richard@rsk.demon.co.uk> :
>>> I keep seeing a total network failure on v3.0.0-rc* , it is highly
>>> intermittent, anything from 1 hour to 12+, and I don't have a reliable
>>> test case.
>>> When it fails I lose all network comms, but there are no errors in the
>>> system log, no hung tasks reported, nothing. But after it fails the
>>> machine hangs during shutdown, it just never turns off. So I guess
>>> something is getting stuck but I can't find it.
>>
I have found a reliable test case, I can instantly trigger my problem by
starting 2 instances of rsync at the same time. [this is on x86_64 AMDX2]
e.g.
rsync -a linux-2.6 server:t1 & ;rsync -a linux-2.6 server:t2 &
If I have a ping running when I trigger the problem, it pauses then
errors with :-
ping: sendmsg: No buffer space available
But if I start a ping after, it fails with
... Destination Host Unreachable
.
I have a serial console attached but don't really understand what it's
telling me.
AFAICT -- I have no blocked tasks - sysrq w shows :-
SysRq : Show Blocked State
task PC stack pid father
Sched Debug Version: v0.10, 3.0.0 #46
ktime : 7129717.783042
sched_clk : 7126380.221722
cpu_clk : 7129711.544071
jiffies : 4301797008
sched_clock_stable : 0
.....[lots more schedule & cpu info]
But now I've got a reliable test case I can find a last know good kernel
and have a stab at bisecting this, unless anyone has got any better
suggestions?
regards
Richard
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2011-07-25 12:01 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-07-21 13:49 v3.0-rc* intermittent network failure: how to debug? Richard Kennedy
2011-07-21 14:32 ` Francois Romieu
2011-07-21 15:18 ` Richard Kennedy
2011-07-25 12:01 ` v3.0-rc* intermittent network failure: Test case found! Richard Kennedy
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).