All of lore.kernel.org
 help / color / mirror / Atom feed
From: Tim Gardner <tim.gardner@canonical.com>
To: Ben Hutchings <bhutchings@solarflare.com>
Cc: netdev <netdev@vger.kernel.org>
Subject: Fix atl1c event race (was Re: 2.6.38 dev_watchdog WARNING)
Date: Wed, 20 Apr 2011 11:59:57 -0600	[thread overview]
Message-ID: <4DAF1F1D.1020108@canonical.com> (raw)
In-Reply-To: <1303238959.2988.30.camel@bwh-desktop>

[-- Attachment #1: Type: text/plain, Size: 2234 bytes --]

On 04/19/2011 12:49 PM, Ben Hutchings wrote:
> On Tue, 2011-04-19 at 11:40 -0600, Tim Gardner wrote:
>> I'm seeing a lot of these kinds of bugs: WARNING: at
>> /build/buildd/linux-2.6.38/net/sched/sch_generic.c:256
>> dev_watchdog+0x213/0x220()
>>
>> The kernel is 2.6.38.2 plus Ubuntu cruft.
>>
>> A spot check of the 200+ hits on this string indicates they are
>> primarily due to these drivers:
>>
>> ipheth
>> atl1c
>> sis900
>> r8169
>>
>> As far as I can tell the warning happens when link is down on the media
>> (and has never been link UP) and are sent a transmit packet which never
>> completes. Is there a net/core or net/sched requirement to which these
>> drivers do not conform ? Are they not correctly indicating link status?
>
> The watchdog fires when the software queue has been stopped *and* the
> link has been reported as up for over dev->watchdog_timeo ticks.
>
> The software queue should be stopped iff the hardware queue is full or
> nearly full.  If the software queue remains stopped and the link is
> still reported up, then one of these things is happening:
>
> 1. The link went down but the driver didn't notice
> 2. TX completions are not being indicated or handled correctly
> 3. The hardware TX path has locked up
> 4. The link is stalled by excessive pause frames or collisions
> 5. Timeout is too low and/or low watermark is too high
> (there may be other explanations)
>
> I think the watchdog is primarily meant to deal with case 3, though all
> of cases 1-3 may be worked around by resetting the hardware.
>
> Ben.
>

I've been focusing on atl1c while trying to understand why link status 
flapping could cause these watchdog timeouts. I've a couple of log files 
with link state change information:

http://bugs.launchpad.net/bugs/766273
https://launchpadlibrarian.net/69926580/BootDmesg.txt
https://launchpadlibrarian.net/69926583/CurrentDmesg.txt

One thing of note is that there are 2 link UP messages in a row, 
something that should only be able to happen if there has been an 
intervening device reset (which is not evident in the logs). I've 
noticed that the work event scheduling is kind of racy, so perhaps this 
will help. See attached.

rtg
-- 
Tim Gardner tim.gardner@canonical.com

[-- Attachment #2: 0001-atl1c-Fix-work-event-interrupt-task-races.patch --]
[-- Type: text/x-patch, Size: 2790 bytes --]

>From 6e92d53121dae5fa13c95c19b6076009df686de8 Mon Sep 17 00:00:00 2001
From: Tim Gardner <tim.gardner@canonical.com>
Date: Wed, 20 Apr 2011 11:31:09 -0600
Subject: [PATCH] atl1c: Fix work event interrupt/task races

The mechanism used to initiate work events from the interrupt
handler has a classic read/modify/write race between the interrupt
handler that sets the condition, and the worker task that reads and
clears the condition. Close these races by using atomic
bit fields.

Cc: stable@kernel.org
Signed-off-by: Tim Gardner <tim.gardner@canonical.com>
---
 drivers/net/atl1c/atl1c.h      |    6 +++---
 drivers/net/atl1c/atl1c_main.c |   14 +++++---------
 2 files changed, 8 insertions(+), 12 deletions(-)

diff --git a/drivers/net/atl1c/atl1c.h b/drivers/net/atl1c/atl1c.h
index 9ab5809..dec8110 100644
--- a/drivers/net/atl1c/atl1c.h
+++ b/drivers/net/atl1c/atl1c.h
@@ -566,9 +566,9 @@ struct atl1c_adapter {
 #define __AT_TESTING        0x0001
 #define __AT_RESETTING      0x0002
 #define __AT_DOWN           0x0003
-	u8 work_event;
-#define ATL1C_WORK_EVENT_RESET 		0x01
-#define ATL1C_WORK_EVENT_LINK_CHANGE	0x02
+	unsigned long work_event;
+#define	ATL1C_WORK_EVENT_RESET		0
+#define	ATL1C_WORK_EVENT_LINK_CHANGE	1
 	u32 msg_enable;
 
 	bool have_msi;
diff --git a/drivers/net/atl1c/atl1c_main.c b/drivers/net/atl1c/atl1c_main.c
index 3824382..dffc7f7 100644
--- a/drivers/net/atl1c/atl1c_main.c
+++ b/drivers/net/atl1c/atl1c_main.c
@@ -325,7 +325,7 @@ static void atl1c_link_chg_event(struct atl1c_adapter *adapter)
 		}
 	}
 
-	adapter->work_event |= ATL1C_WORK_EVENT_LINK_CHANGE;
+	set_bit(ATL1C_WORK_EVENT_LINK_CHANGE, &adapter->work_event);
 	schedule_work(&adapter->common_task);
 }
 
@@ -337,20 +337,16 @@ static void atl1c_common_task(struct work_struct *work)
 	adapter = container_of(work, struct atl1c_adapter, common_task);
 	netdev = adapter->netdev;
 
-	if (adapter->work_event & ATL1C_WORK_EVENT_RESET) {
-		adapter->work_event &= ~ATL1C_WORK_EVENT_RESET;
+	if (test_and_clear_bit(ATL1C_WORK_EVENT_RESET, &adapter->work_event)) {
 		netif_device_detach(netdev);
 		atl1c_down(adapter);
 		atl1c_up(adapter);
 		netif_device_attach(netdev);
-		return;
 	}
 
-	if (adapter->work_event & ATL1C_WORK_EVENT_LINK_CHANGE) {
-		adapter->work_event &= ~ATL1C_WORK_EVENT_LINK_CHANGE;
+	if (test_and_clear_bit(ATL1C_WORK_EVENT_LINK_CHANGE,
+		&adapter->work_event))
 		atl1c_check_link_status(adapter);
-	}
-	return;
 }
 
 
@@ -369,7 +365,7 @@ static void atl1c_tx_timeout(struct net_device *netdev)
 	struct atl1c_adapter *adapter = netdev_priv(netdev);
 
 	/* Do the reset outside of interrupt context */
-	adapter->work_event |= ATL1C_WORK_EVENT_RESET;
+	set_bit(ATL1C_WORK_EVENT_RESET, &adapter->work_event);
 	schedule_work(&adapter->common_task);
 }
 
-- 
1.7.0.4


  reply	other threads:[~2011-04-20 18:00 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-04-19 17:40 2.6.38 dev_watchdog WARNING Tim Gardner
2011-04-19 18:20 ` Ben Greear
2011-04-19 18:49 ` Ben Hutchings
2011-04-20 17:59   ` Tim Gardner [this message]
2011-04-20 18:13     ` Fix atl1c event race (was Re: 2.6.38 dev_watchdog WARNING) Ben Hutchings

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4DAF1F1D.1020108@canonical.com \
    --to=tim.gardner@canonical.com \
    --cc=bhutchings@solarflare.com \
    --cc=netdev@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.