sch_generic warn_on (timed out)

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* sch_generic warn_on (timed out)
@ 2011-07-11 20:48 Dave Jones
  2011-07-11 21:17 ` David Miller
  2011-07-20 16:38 ` Brandeburg, Jesse
  0 siblings, 2 replies; 6+ messages in thread
From: Dave Jones @ 2011-07-11 20:48 UTC (permalink / raw)
  To: netdev

We've recieved quite a few bug reports in Fedora recently concerning this warning in
sch_generic..

            WARN_ONCE(1, KERN_INFO "NETDEV WATCHDOG: %s (%s): transmit queue %u timed out\n",
                      dev->name, netdev_drivername(dev, drivername, 64), i);

https://bugzilla.redhat.com/show_bug.cgi?id=702723 is our 'master bug' that we're
duping all others against. It seems to be showing up on a variety of different
hardware (r8169, atl1c, ipheth, e1000e, 8139too). Do all these drivers need
fixing ? or is it just 'crap hardware' ?

note that I've only been looking through fedora 15 bugs so far (which is still on 2.6.38),
but looking at the commit log for sch_generic, it doesn't seem that there's anything
obvious that needs backporting.

thanks,

	Dave

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: sch_generic warn_on (timed out)
  2011-07-11 20:48 sch_generic warn_on (timed out) Dave Jones
@ 2011-07-11 21:17 ` David Miller
  2011-07-11 21:20   ` Eric Dumazet
  2011-07-20 21:53   ` Francois Romieu
  2011-07-20 16:38 ` Brandeburg, Jesse
  1 sibling, 2 replies; 6+ messages in thread
From: David Miller @ 2011-07-11 21:17 UTC (permalink / raw)
  To: davej; +Cc: netdev

From: Dave Jones <davej@redhat.com>
Date: Mon, 11 Jul 2011 16:48:34 -0400

> We've recieved quite a few bug reports in Fedora recently concerning this warning in
> sch_generic..
> 
>             WARN_ONCE(1, KERN_INFO "NETDEV WATCHDOG: %s (%s): transmit queue %u timed out\n",
>                       dev->name, netdev_drivername(dev, drivername, 64), i);
> 
> https://bugzilla.redhat.com/show_bug.cgi?id=702723 is our 'master bug' that we're
> duping all others against. It seems to be showing up on a variety of different
> hardware (r8169, atl1c, ipheth, e1000e, 8139too). Do all these drivers need
> fixing ? or is it just 'crap hardware' ?
> 
> note that I've only been looking through fedora 15 bugs so far (which is still on 2.6.38),
> but looking at the commit log for sch_generic, it doesn't seem that there's anything
> obvious that needs backporting.

It means the transmitter stopped sending packets for several seconds.

I would track this on a per-device basis if I were you, instead of
combining them all into one super-bug.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: sch_generic warn_on (timed out)
  2011-07-11 21:17 ` David Miller
@ 2011-07-11 21:20   ` Eric Dumazet
  2011-07-20 21:53   ` Francois Romieu
  1 sibling, 0 replies; 6+ messages in thread
From: Eric Dumazet @ 2011-07-11 21:20 UTC (permalink / raw)
  To: David Miller; +Cc: davej, netdev

Le lundi 11 juillet 2011 à 14:17 -0700, David Miller a écrit :
> From: Dave Jones <davej@redhat.com>
> Date: Mon, 11 Jul 2011 16:48:34 -0400
> 
> > We've recieved quite a few bug reports in Fedora recently concerning this warning in
> > sch_generic..
> > 
> >             WARN_ONCE(1, KERN_INFO "NETDEV WATCHDOG: %s (%s): transmit queue %u timed out\n",
> >                       dev->name, netdev_drivername(dev, drivername, 64), i);
> > 
> > https://bugzilla.redhat.com/show_bug.cgi?id=702723 is our 'master bug' that we're
> > duping all others against. It seems to be showing up on a variety of different
> > hardware (r8169, atl1c, ipheth, e1000e, 8139too). Do all these drivers need
> > fixing ? or is it just 'crap hardware' ?
> > 
> > note that I've only been looking through fedora 15 bugs so far (which is still on 2.6.38),
> > but looking at the commit log for sch_generic, it doesn't seem that there's anything
> > obvious that needs backporting.
> 
> It means the transmitter stopped sending packets for several seconds.
> 
> I would track this on a per-device basis if I were you, instead of
> combining them all into one super-bug.

Last time I took a look (on one r8169 NIC), it wasnt clear if this could
be a PAUSE problem.



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: sch_generic warn_on (timed out)
  2011-07-11 21:17 ` David Miller
  2011-07-11 21:20   ` Eric Dumazet
@ 2011-07-20 21:53   ` Francois Romieu
  1 sibling, 0 replies; 6+ messages in thread
From: Francois Romieu @ 2011-07-20 21:53 UTC (permalink / raw)
  To: davej; +Cc: David Miller, netdev

David Miller <davem@davemloft.net> :
[...]
> > duping all others against. It seems to be showing up on a variety of different
> > hardware (r8169, atl1c, ipheth, e1000e, 8139too). Do all these drivers need
> > fixing ? or is it just 'crap hardware' ?

The r8169 driver has seen driver bugs, poorly supported hardware,
"interesting" hardware, evolving hardware, false alarm reports, crappy
maintainership and almost anything one can think of (as long as you do
not think about public datasheets or errata).

[...]
> I would track this on a per-device basis if I were you, instead of
> combining them all into one super-bug.

+1 +1 +1

One could almost split the r8169 problems further by XID, then eventually
by platform. 

-- 
Ueimor

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: sch_generic warn_on (timed out)
  2011-07-11 20:48 sch_generic warn_on (timed out) Dave Jones
  2011-07-11 21:17 ` David Miller
@ 2011-07-20 16:38 ` Brandeburg, Jesse
  2011-07-21 18:56   ` Dave Jones
  1 sibling, 1 reply; 6+ messages in thread
From: Brandeburg, Jesse @ 2011-07-20 16:38 UTC (permalink / raw)
  To: Dave Jones; +Cc: netdev@vger.kernel.org

On Mon, 11 Jul 2011, Dave Jones wrote:
>             WARN_ONCE(1, KERN_INFO "NETDEV WATCHDOG: %s (%s): transmit queue %u timed out\n",
>                       dev->name, netdev_drivername(dev, drivername, 64), i);
> 
> https://bugzilla.redhat.com/show_bug.cgi?id=702723 is our 'master bug' that we're
> duping all others against. It seems to be showing up on a variety of different
> hardware (r8169, atl1c, ipheth, e1000e, 8139too). Do all these drivers need
> fixing ? or is it just 'crap hardware' ?

neither, probably

> note that I've only been looking through fedora 15 bugs so far (which is still on 2.6.38),
> but looking at the commit log for sch_generic, it doesn't seem that there's anything
> obvious that needs backporting.

it used to just be a KERN_ERR printk, then it changed it to be a WARN_ONCE 
in order to trigger kerneloops reports so we knew how many people were 
getting tx hangs from their hardware.

The bad news is there is never anything useful in the backtrace, besides 
what driver it is.  Users have been trained to send backtrace for panic 
messages, and in this case it doesn't help very much to identify what 
the problem was.

If the reports within each driver were able to be traced back to a 
specific *model* of hardware then that might be useful (particularly for 
Intel hardware).  Maybe the WARN_ONCE should print vendor/device pair so 
we would at least know the hardware from the panic trace.

If this is happening more frequently on F15 than F14 across multiple 
pieces of hardware, it may indicate that a kernel/stack change is starting 
to (ab)use a changed working model that is causing an issue, or that there 
is an actual kernel issue with locks or interrupts or tx completions that 
is causing an excessive delay in completion of transmits.

Dave can you query for F14 reports and/or isolate what kernel this started 
with?

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: sch_generic warn_on (timed out)
  2011-07-20 16:38 ` Brandeburg, Jesse
@ 2011-07-21 18:56   ` Dave Jones
  0 siblings, 0 replies; 6+ messages in thread
From: Dave Jones @ 2011-07-21 18:56 UTC (permalink / raw)
  To: Brandeburg, Jesse; +Cc: netdev@vger.kernel.org

On Wed, Jul 20, 2011 at 09:38:32AM -0700, Brandeburg, Jesse wrote:

 > it used to just be a KERN_ERR printk, then it changed it to be a WARN_ONCE 
 > in order to trigger kerneloops reports so we knew how many people were 
 > getting tx hangs from their hardware.

That explains the uptick in reports we've got, as most of them
seem to be coming in through abrt (Fedora's kerneloops-like thing). 

 > If this is happening more frequently on F15 than F14 across multiple 
 > pieces of hardware, it may indicate that a kernel/stack change is starting 
 > to (ab)use a changed working model that is causing an issue, or that there 
 > is an actual kernel issue with locks or interrupts or tx completions that 
 > is causing an excessive delay in completion of transmits.
 > 
 > Dave can you query for F14 reports and/or isolate what kernel this started 
 > with?

Unfortunatly, F14 lagged behind rebasing for unrelated reasons, and is still stuck
on 2.6.35, so we've got quite a delta between that and f15 right now.

A quick search of the F14 bugs didn't show up any similar bugs, but due to the 4
versions we jumped, that's probably not too helpful, as it's probably still
just a printk in that old codebase.

We have plans in progress to rebase both releases to something more current,
so I'll keep an eye on this to see if it gets worse/better.

thanks,

        Dave

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2011-07-21 18:56 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-07-11 20:48 sch_generic warn_on (timed out) Dave Jones
2011-07-11 21:17 ` David Miller
2011-07-11 21:20   ` Eric Dumazet
2011-07-20 21:53   ` Francois Romieu
2011-07-20 16:38 ` Brandeburg, Jesse
2011-07-21 18:56   ` Dave Jones

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).