* 2.6.38 dev_watchdog WARNING @ 2011-04-19 17:40 Tim Gardner 2011-04-19 18:20 ` Ben Greear 2011-04-19 18:49 ` Ben Hutchings 0 siblings, 2 replies; 5+ messages in thread From: Tim Gardner @ 2011-04-19 17:40 UTC (permalink / raw) To: netdev I'm seeing a lot of these kinds of bugs: WARNING: at /build/buildd/linux-2.6.38/net/sched/sch_generic.c:256 dev_watchdog+0x213/0x220() The kernel is 2.6.38.2 plus Ubuntu cruft. A spot check of the 200+ hits on this string indicates they are primarily due to these drivers: ipheth atl1c sis900 r8169 As far as I can tell the warning happens when link is down on the media (and has never been link UP) and are sent a transmit packet which never completes. Is there a net/core or net/sched requirement to which these drivers do not conform ? Are they not correctly indicating link status? rtg -- Tim Gardner tim.gardner@canonical.com ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: 2.6.38 dev_watchdog WARNING 2011-04-19 17:40 2.6.38 dev_watchdog WARNING Tim Gardner @ 2011-04-19 18:20 ` Ben Greear 2011-04-19 18:49 ` Ben Hutchings 1 sibling, 0 replies; 5+ messages in thread From: Ben Greear @ 2011-04-19 18:20 UTC (permalink / raw) To: tim.gardner; +Cc: netdev On 04/19/2011 10:40 AM, Tim Gardner wrote: > I'm seeing a lot of these kinds of bugs: WARNING: at > /build/buildd/linux-2.6.38/net/sched/sch_generic.c:256 > dev_watchdog+0x213/0x220() > > The kernel is 2.6.38.2 plus Ubuntu cruft. > > A spot check of the 200+ hits on this string indicates they are > primarily due to these drivers: > > ipheth > atl1c > sis900 > r8169 > > As far as I can tell the warning happens when link is down on the media > (and has never been link UP) and are sent a transmit packet which never > completes. Is there a net/core or net/sched requirement to which these > drivers do not conform ? Are they not correctly indicating link status? r8169 doesn't report link right for me... No idea about the rest. Ben > > rtg -- Ben Greear <greearb@candelatech.com> Candela Technologies Inc http://www.candelatech.com ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: 2.6.38 dev_watchdog WARNING 2011-04-19 17:40 2.6.38 dev_watchdog WARNING Tim Gardner 2011-04-19 18:20 ` Ben Greear @ 2011-04-19 18:49 ` Ben Hutchings 2011-04-20 17:59 ` Fix atl1c event race (was Re: 2.6.38 dev_watchdog WARNING) Tim Gardner 1 sibling, 1 reply; 5+ messages in thread From: Ben Hutchings @ 2011-04-19 18:49 UTC (permalink / raw) To: tim.gardner; +Cc: netdev On Tue, 2011-04-19 at 11:40 -0600, Tim Gardner wrote: > I'm seeing a lot of these kinds of bugs: WARNING: at > /build/buildd/linux-2.6.38/net/sched/sch_generic.c:256 > dev_watchdog+0x213/0x220() > > The kernel is 2.6.38.2 plus Ubuntu cruft. > > A spot check of the 200+ hits on this string indicates they are > primarily due to these drivers: > > ipheth > atl1c > sis900 > r8169 > > As far as I can tell the warning happens when link is down on the media > (and has never been link UP) and are sent a transmit packet which never > completes. Is there a net/core or net/sched requirement to which these > drivers do not conform ? Are they not correctly indicating link status? The watchdog fires when the software queue has been stopped *and* the link has been reported as up for over dev->watchdog_timeo ticks. The software queue should be stopped iff the hardware queue is full or nearly full. If the software queue remains stopped and the link is still reported up, then one of these things is happening: 1. The link went down but the driver didn't notice 2. TX completions are not being indicated or handled correctly 3. The hardware TX path has locked up 4. The link is stalled by excessive pause frames or collisions 5. Timeout is too low and/or low watermark is too high (there may be other explanations) I think the watchdog is primarily meant to deal with case 3, though all of cases 1-3 may be worked around by resetting the hardware. Ben. -- Ben Hutchings, Senior Software Engineer, Solarflare Not speaking for my employer; that's the marketing department's job. They asked us to note that Solarflare product names are trademarked. ^ permalink raw reply [flat|nested] 5+ messages in thread
* Fix atl1c event race (was Re: 2.6.38 dev_watchdog WARNING) 2011-04-19 18:49 ` Ben Hutchings @ 2011-04-20 17:59 ` Tim Gardner 2011-04-20 18:13 ` Ben Hutchings 0 siblings, 1 reply; 5+ messages in thread From: Tim Gardner @ 2011-04-20 17:59 UTC (permalink / raw) To: Ben Hutchings; +Cc: netdev [-- Attachment #1: Type: text/plain, Size: 2234 bytes --] On 04/19/2011 12:49 PM, Ben Hutchings wrote: > On Tue, 2011-04-19 at 11:40 -0600, Tim Gardner wrote: >> I'm seeing a lot of these kinds of bugs: WARNING: at >> /build/buildd/linux-2.6.38/net/sched/sch_generic.c:256 >> dev_watchdog+0x213/0x220() >> >> The kernel is 2.6.38.2 plus Ubuntu cruft. >> >> A spot check of the 200+ hits on this string indicates they are >> primarily due to these drivers: >> >> ipheth >> atl1c >> sis900 >> r8169 >> >> As far as I can tell the warning happens when link is down on the media >> (and has never been link UP) and are sent a transmit packet which never >> completes. Is there a net/core or net/sched requirement to which these >> drivers do not conform ? Are they not correctly indicating link status? > > The watchdog fires when the software queue has been stopped *and* the > link has been reported as up for over dev->watchdog_timeo ticks. > > The software queue should be stopped iff the hardware queue is full or > nearly full. If the software queue remains stopped and the link is > still reported up, then one of these things is happening: > > 1. The link went down but the driver didn't notice > 2. TX completions are not being indicated or handled correctly > 3. The hardware TX path has locked up > 4. The link is stalled by excessive pause frames or collisions > 5. Timeout is too low and/or low watermark is too high > (there may be other explanations) > > I think the watchdog is primarily meant to deal with case 3, though all > of cases 1-3 may be worked around by resetting the hardware. > > Ben. > I've been focusing on atl1c while trying to understand why link status flapping could cause these watchdog timeouts. I've a couple of log files with link state change information: http://bugs.launchpad.net/bugs/766273 https://launchpadlibrarian.net/69926580/BootDmesg.txt https://launchpadlibrarian.net/69926583/CurrentDmesg.txt One thing of note is that there are 2 link UP messages in a row, something that should only be able to happen if there has been an intervening device reset (which is not evident in the logs). I've noticed that the work event scheduling is kind of racy, so perhaps this will help. See attached. rtg -- Tim Gardner tim.gardner@canonical.com [-- Attachment #2: 0001-atl1c-Fix-work-event-interrupt-task-races.patch --] [-- Type: text/x-patch, Size: 2790 bytes --] >From 6e92d53121dae5fa13c95c19b6076009df686de8 Mon Sep 17 00:00:00 2001 From: Tim Gardner <tim.gardner@canonical.com> Date: Wed, 20 Apr 2011 11:31:09 -0600 Subject: [PATCH] atl1c: Fix work event interrupt/task races The mechanism used to initiate work events from the interrupt handler has a classic read/modify/write race between the interrupt handler that sets the condition, and the worker task that reads and clears the condition. Close these races by using atomic bit fields. Cc: stable@kernel.org Signed-off-by: Tim Gardner <tim.gardner@canonical.com> --- drivers/net/atl1c/atl1c.h | 6 +++--- drivers/net/atl1c/atl1c_main.c | 14 +++++--------- 2 files changed, 8 insertions(+), 12 deletions(-) diff --git a/drivers/net/atl1c/atl1c.h b/drivers/net/atl1c/atl1c.h index 9ab5809..dec8110 100644 --- a/drivers/net/atl1c/atl1c.h +++ b/drivers/net/atl1c/atl1c.h @@ -566,9 +566,9 @@ struct atl1c_adapter { #define __AT_TESTING 0x0001 #define __AT_RESETTING 0x0002 #define __AT_DOWN 0x0003 - u8 work_event; -#define ATL1C_WORK_EVENT_RESET 0x01 -#define ATL1C_WORK_EVENT_LINK_CHANGE 0x02 + unsigned long work_event; +#define ATL1C_WORK_EVENT_RESET 0 +#define ATL1C_WORK_EVENT_LINK_CHANGE 1 u32 msg_enable; bool have_msi; diff --git a/drivers/net/atl1c/atl1c_main.c b/drivers/net/atl1c/atl1c_main.c index 3824382..dffc7f7 100644 --- a/drivers/net/atl1c/atl1c_main.c +++ b/drivers/net/atl1c/atl1c_main.c @@ -325,7 +325,7 @@ static void atl1c_link_chg_event(struct atl1c_adapter *adapter) } } - adapter->work_event |= ATL1C_WORK_EVENT_LINK_CHANGE; + set_bit(ATL1C_WORK_EVENT_LINK_CHANGE, &adapter->work_event); schedule_work(&adapter->common_task); } @@ -337,20 +337,16 @@ static void atl1c_common_task(struct work_struct *work) adapter = container_of(work, struct atl1c_adapter, common_task); netdev = adapter->netdev; - if (adapter->work_event & ATL1C_WORK_EVENT_RESET) { - adapter->work_event &= ~ATL1C_WORK_EVENT_RESET; + if (test_and_clear_bit(ATL1C_WORK_EVENT_RESET, &adapter->work_event)) { netif_device_detach(netdev); atl1c_down(adapter); atl1c_up(adapter); netif_device_attach(netdev); - return; } - if (adapter->work_event & ATL1C_WORK_EVENT_LINK_CHANGE) { - adapter->work_event &= ~ATL1C_WORK_EVENT_LINK_CHANGE; + if (test_and_clear_bit(ATL1C_WORK_EVENT_LINK_CHANGE, + &adapter->work_event)) atl1c_check_link_status(adapter); - } - return; } @@ -369,7 +365,7 @@ static void atl1c_tx_timeout(struct net_device *netdev) struct atl1c_adapter *adapter = netdev_priv(netdev); /* Do the reset outside of interrupt context */ - adapter->work_event |= ATL1C_WORK_EVENT_RESET; + set_bit(ATL1C_WORK_EVENT_RESET, &adapter->work_event); schedule_work(&adapter->common_task); } -- 1.7.0.4 ^ permalink raw reply related [flat|nested] 5+ messages in thread
* Re: Fix atl1c event race (was Re: 2.6.38 dev_watchdog WARNING) 2011-04-20 17:59 ` Fix atl1c event race (was Re: 2.6.38 dev_watchdog WARNING) Tim Gardner @ 2011-04-20 18:13 ` Ben Hutchings 0 siblings, 0 replies; 5+ messages in thread From: Ben Hutchings @ 2011-04-20 18:13 UTC (permalink / raw) To: tim.gardner; +Cc: netdev On Wed, 2011-04-20 at 11:59 -0600, Tim Gardner wrote: [...] > I've been focusing on atl1c while trying to understand why link status > flapping could cause these watchdog timeouts. I've a couple of log files > with link state change information: > > http://bugs.launchpad.net/bugs/766273 > https://launchpadlibrarian.net/69926580/BootDmesg.txt > https://launchpadlibrarian.net/69926583/CurrentDmesg.txt > > One thing of note is that there are 2 link UP messages in a row, > something that should only be able to happen if there has been an > intervening device reset (which is not evident in the logs). I've > noticed that the work event scheduling is kind of racy, so perhaps this > will help. See attached. I'm not going to spend any significant time looking at atl1c, as that's really not my job! The atlx maintainers may be covering atl1c as well, so try cc'ing them. Ben. -- Ben Hutchings, Senior Software Engineer, Solarflare Not speaking for my employer; that's the marketing department's job. They asked us to note that Solarflare product names are trademarked. ^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2011-04-20 18:13 UTC | newest] Thread overview: 5+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2011-04-19 17:40 2.6.38 dev_watchdog WARNING Tim Gardner 2011-04-19 18:20 ` Ben Greear 2011-04-19 18:49 ` Ben Hutchings 2011-04-20 17:59 ` Fix atl1c event race (was Re: 2.6.38 dev_watchdog WARNING) Tim Gardner 2011-04-20 18:13 ` Ben Hutchings
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).