2.6.38 dev_watchdog WARNING

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* 2.6.38 dev_watchdog WARNING
@ 2011-04-19 17:40 Tim Gardner
  2011-04-19 18:20 ` Ben Greear
  2011-04-19 18:49 ` Ben Hutchings
  0 siblings, 2 replies; 5+ messages in thread
From: Tim Gardner @ 2011-04-19 17:40 UTC (permalink / raw)
  To: netdev

I'm seeing a lot of these kinds of bugs: WARNING: at 
/build/buildd/linux-2.6.38/net/sched/sch_generic.c:256 
dev_watchdog+0x213/0x220()

The kernel is 2.6.38.2 plus Ubuntu cruft.

A spot check of the 200+ hits on this string indicates they are 
primarily due to these drivers:

ipheth
atl1c
sis900
r8169

As far as I can tell the warning happens when link is down on the media 
(and has never been link UP) and are sent a transmit packet which never 
completes. Is there a net/core or net/sched requirement to which these 
drivers do not conform ? Are they not correctly indicating link status?

rtg
-- 
Tim Gardner tim.gardner@canonical.com

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: 2.6.38 dev_watchdog WARNING
  2011-04-19 17:40 2.6.38 dev_watchdog WARNING Tim Gardner
@ 2011-04-19 18:20 ` Ben Greear
  2011-04-19 18:49 ` Ben Hutchings
  1 sibling, 0 replies; 5+ messages in thread
From: Ben Greear @ 2011-04-19 18:20 UTC (permalink / raw)
  To: tim.gardner; +Cc: netdev

On 04/19/2011 10:40 AM, Tim Gardner wrote:
> I'm seeing a lot of these kinds of bugs: WARNING: at
> /build/buildd/linux-2.6.38/net/sched/sch_generic.c:256
> dev_watchdog+0x213/0x220()
>
> The kernel is 2.6.38.2 plus Ubuntu cruft.
>
> A spot check of the 200+ hits on this string indicates they are
> primarily due to these drivers:
>
> ipheth
> atl1c
> sis900
> r8169
>
> As far as I can tell the warning happens when link is down on the media
> (and has never been link UP) and are sent a transmit packet which never
> completes. Is there a net/core or net/sched requirement to which these
> drivers do not conform ? Are they not correctly indicating link status?

r8169 doesn't report link right for me...

No idea about the rest.

Ben

>
> rtg


-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: 2.6.38 dev_watchdog WARNING
  2011-04-19 17:40 2.6.38 dev_watchdog WARNING Tim Gardner
  2011-04-19 18:20 ` Ben Greear
@ 2011-04-19 18:49 ` Ben Hutchings
  2011-04-20 17:59   ` Fix atl1c event race (was Re: 2.6.38 dev_watchdog WARNING) Tim Gardner
  1 sibling, 1 reply; 5+ messages in thread
From: Ben Hutchings @ 2011-04-19 18:49 UTC (permalink / raw)
  To: tim.gardner; +Cc: netdev

On Tue, 2011-04-19 at 11:40 -0600, Tim Gardner wrote:
> I'm seeing a lot of these kinds of bugs: WARNING: at 
> /build/buildd/linux-2.6.38/net/sched/sch_generic.c:256 
> dev_watchdog+0x213/0x220()
> 
> The kernel is 2.6.38.2 plus Ubuntu cruft.
> 
> A spot check of the 200+ hits on this string indicates they are 
> primarily due to these drivers:
> 
> ipheth
> atl1c
> sis900
> r8169
> 
> As far as I can tell the warning happens when link is down on the media 
> (and has never been link UP) and are sent a transmit packet which never 
> completes. Is there a net/core or net/sched requirement to which these 
> drivers do not conform ? Are they not correctly indicating link status?

The watchdog fires when the software queue has been stopped *and* the
link has been reported as up for over dev->watchdog_timeo ticks.

The software queue should be stopped iff the hardware queue is full or
nearly full.  If the software queue remains stopped and the link is
still reported up, then one of these things is happening:

1. The link went down but the driver didn't notice
2. TX completions are not being indicated or handled correctly
3. The hardware TX path has locked up
4. The link is stalled by excessive pause frames or collisions
5. Timeout is too low and/or low watermark is too high
(there may be other explanations)

I think the watchdog is primarily meant to deal with case 3, though all
of cases 1-3 may be worked around by resetting the hardware.

Ben.

-- 
Ben Hutchings, Senior Software Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Fix atl1c event race (was Re: 2.6.38 dev_watchdog WARNING)
  2011-04-19 18:49 ` Ben Hutchings
@ 2011-04-20 17:59   ` Tim Gardner
  2011-04-20 18:13     ` Ben Hutchings
  0 siblings, 1 reply; 5+ messages in thread
From: Tim Gardner @ 2011-04-20 17:59 UTC (permalink / raw)
  To: Ben Hutchings; +Cc: netdev

[-- Attachment #1: Type: text/plain, Size: 2234 bytes --]

On 04/19/2011 12:49 PM, Ben Hutchings wrote:
> On Tue, 2011-04-19 at 11:40 -0600, Tim Gardner wrote:
>> I'm seeing a lot of these kinds of bugs: WARNING: at
>> /build/buildd/linux-2.6.38/net/sched/sch_generic.c:256
>> dev_watchdog+0x213/0x220()
>>
>> The kernel is 2.6.38.2 plus Ubuntu cruft.
>>
>> A spot check of the 200+ hits on this string indicates they are
>> primarily due to these drivers:
>>
>> ipheth
>> atl1c
>> sis900
>> r8169
>>
>> As far as I can tell the warning happens when link is down on the media
>> (and has never been link UP) and are sent a transmit packet which never
>> completes. Is there a net/core or net/sched requirement to which these
>> drivers do not conform ? Are they not correctly indicating link status?
>
> The watchdog fires when the software queue has been stopped *and* the
> link has been reported as up for over dev->watchdog_timeo ticks.
>
> The software queue should be stopped iff the hardware queue is full or
> nearly full.  If the software queue remains stopped and the link is
> still reported up, then one of these things is happening:
>
> 1. The link went down but the driver didn't notice
> 2. TX completions are not being indicated or handled correctly
> 3. The hardware TX path has locked up
> 4. The link is stalled by excessive pause frames or collisions
> 5. Timeout is too low and/or low watermark is too high
> (there may be other explanations)
>
> I think the watchdog is primarily meant to deal with case 3, though all
> of cases 1-3 may be worked around by resetting the hardware.
>
> Ben.
>

I've been focusing on atl1c while trying to understand why link status 
flapping could cause these watchdog timeouts. I've a couple of log files 
with link state change information:

http://bugs.launchpad.net/bugs/766273
https://launchpadlibrarian.net/69926580/BootDmesg.txt
https://launchpadlibrarian.net/69926583/CurrentDmesg.txt

One thing of note is that there are 2 link UP messages in a row, 
something that should only be able to happen if there has been an 
intervening device reset (which is not evident in the logs). I've 
noticed that the work event scheduling is kind of racy, so perhaps this 
will help. See attached.

rtg
-- 
Tim Gardner tim.gardner@canonical.com

[-- Attachment #2: 0001-atl1c-Fix-work-event-interrupt-task-races.patch --]
[-- Type: text/x-patch, Size: 2790 bytes --]

>From 6e92d53121dae5fa13c95c19b6076009df686de8 Mon Sep 17 00:00:00 2001
From: Tim Gardner <tim.gardner@canonical.com>
Date: Wed, 20 Apr 2011 11:31:09 -0600
Subject: [PATCH] atl1c: Fix work event interrupt/task races

The mechanism used to initiate work events from the interrupt
handler has a classic read/modify/write race between the interrupt
handler that sets the condition, and the worker task that reads and
clears the condition. Close these races by using atomic
bit fields.

Cc: stable@kernel.org
Signed-off-by: Tim Gardner <tim.gardner@canonical.com>
---
 drivers/net/atl1c/atl1c.h      |    6 +++---
 drivers/net/atl1c/atl1c_main.c |   14 +++++---------
 2 files changed, 8 insertions(+), 12 deletions(-)

diff --git a/drivers/net/atl1c/atl1c.h b/drivers/net/atl1c/atl1c.h
index 9ab5809..dec8110 100644
--- a/drivers/net/atl1c/atl1c.h
+++ b/drivers/net/atl1c/atl1c.h
@@ -566,9 +566,9 @@ struct atl1c_adapter {
 #define __AT_TESTING        0x0001
 #define __AT_RESETTING      0x0002
 #define __AT_DOWN           0x0003
-	u8 work_event;
-#define ATL1C_WORK_EVENT_RESET 		0x01
-#define ATL1C_WORK_EVENT_LINK_CHANGE	0x02
+	unsigned long work_event;
+#define	ATL1C_WORK_EVENT_RESET		0
+#define	ATL1C_WORK_EVENT_LINK_CHANGE	1
 	u32 msg_enable;
 
 	bool have_msi;
diff --git a/drivers/net/atl1c/atl1c_main.c b/drivers/net/atl1c/atl1c_main.c
index 3824382..dffc7f7 100644
--- a/drivers/net/atl1c/atl1c_main.c
+++ b/drivers/net/atl1c/atl1c_main.c
@@ -325,7 +325,7 @@ static void atl1c_link_chg_event(struct atl1c_adapter *adapter)
 		}
 	}
 
-	adapter->work_event |= ATL1C_WORK_EVENT_LINK_CHANGE;
+	set_bit(ATL1C_WORK_EVENT_LINK_CHANGE, &adapter->work_event);
 	schedule_work(&adapter->common_task);
 }
 
@@ -337,20 +337,16 @@ static void atl1c_common_task(struct work_struct *work)
 	adapter = container_of(work, struct atl1c_adapter, common_task);
 	netdev = adapter->netdev;
 
-	if (adapter->work_event & ATL1C_WORK_EVENT_RESET) {
-		adapter->work_event &= ~ATL1C_WORK_EVENT_RESET;
+	if (test_and_clear_bit(ATL1C_WORK_EVENT_RESET, &adapter->work_event)) {
 		netif_device_detach(netdev);
 		atl1c_down(adapter);
 		atl1c_up(adapter);
 		netif_device_attach(netdev);
-		return;
 	}
 
-	if (adapter->work_event & ATL1C_WORK_EVENT_LINK_CHANGE) {
-		adapter->work_event &= ~ATL1C_WORK_EVENT_LINK_CHANGE;
+	if (test_and_clear_bit(ATL1C_WORK_EVENT_LINK_CHANGE,
+		&adapter->work_event))
 		atl1c_check_link_status(adapter);
-	}
-	return;
 }
 
 
@@ -369,7 +365,7 @@ static void atl1c_tx_timeout(struct net_device *netdev)
 	struct atl1c_adapter *adapter = netdev_priv(netdev);
 
 	/* Do the reset outside of interrupt context */
-	adapter->work_event |= ATL1C_WORK_EVENT_RESET;
+	set_bit(ATL1C_WORK_EVENT_RESET, &adapter->work_event);
 	schedule_work(&adapter->common_task);
 }
 
-- 
1.7.0.4


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: Fix atl1c event race (was Re: 2.6.38 dev_watchdog WARNING)
  2011-04-20 17:59   ` Fix atl1c event race (was Re: 2.6.38 dev_watchdog WARNING) Tim Gardner
@ 2011-04-20 18:13     ` Ben Hutchings
  0 siblings, 0 replies; 5+ messages in thread
From: Ben Hutchings @ 2011-04-20 18:13 UTC (permalink / raw)
  To: tim.gardner; +Cc: netdev

On Wed, 2011-04-20 at 11:59 -0600, Tim Gardner wrote:
[...]
> I've been focusing on atl1c while trying to understand why link status 
> flapping could cause these watchdog timeouts. I've a couple of log files 
> with link state change information:
> 
> http://bugs.launchpad.net/bugs/766273
> https://launchpadlibrarian.net/69926580/BootDmesg.txt
> https://launchpadlibrarian.net/69926583/CurrentDmesg.txt
> 
> One thing of note is that there are 2 link UP messages in a row, 
> something that should only be able to happen if there has been an 
> intervening device reset (which is not evident in the logs). I've 
> noticed that the work event scheduling is kind of racy, so perhaps this 
> will help. See attached.

I'm not going to spend any significant time looking at atl1c, as that's
really not my job!  The atlx maintainers may be covering atl1c as well,
so try cc'ing them.

Ben.

-- 
Ben Hutchings, Senior Software Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.


^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2011-04-20 18:13 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-04-19 17:40 2.6.38 dev_watchdog WARNING Tim Gardner
2011-04-19 18:20 ` Ben Greear
2011-04-19 18:49 ` Ben Hutchings
2011-04-20 17:59   ` Fix atl1c event race (was Re: 2.6.38 dev_watchdog WARNING) Tim Gardner
2011-04-20 18:13     ` Ben Hutchings

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).