Netdev List
 help / color / mirror / Atom feed
* dgrs: sparse warnings
From: Stephen Hemminger @ 2007-09-06 12:55 UTC (permalink / raw)
  To: Al Viro, Jeff Garzik; +Cc: netdev

The dgrs driver is crap and assumes direct access to PCI iomem.

  CHECK   drivers/net/dgrs.c
drivers/net/dgrs.c:314:20: warning: cast removes address space of expression
drivers/net/dgrs.c:330:12: warning: incorrect type in argument 1 (different address spaces)
drivers/net/dgrs.c:330:12:    expected void volatile [noderef] <asn:2>*addr
drivers/net/dgrs.c:330:12:    got void *<noident>
drivers/net/dgrs.c:1003:14: warning: incorrect type in assignment (different address spaces)
drivers/net/dgrs.c:1003:14:    expected char *[usertype] vmem
drivers/net/dgrs.c:1003:14:    got void [noderef] <asn:2>*
drivers/net/dgrs.c:1023:17: warning: incorrect type in argument 1 (different address spaces)
drivers/net/dgrs.c:1023:17:    expected void volatile [noderef] <asn:2>*addr
drivers/net/dgrs.c:1023:17:    got char *[usertype] vmem
drivers/net/dgrs.c:1052:16: warning: incorrect type in argument 1 (different address spaces)
drivers/net/dgrs.c:1052:16:    expected void volatile [noderef] <asn:2>*addr
drivers/net/dgrs.c:1052:16:    got char *[usertype] vmem
drivers/net/dgrs.c:1106:16: warning: incorrect type in argument 1 (different address spaces)
drivers/net/dgrs.c:1106:16:    expected void volatile [noderef] <asn:2>*addr
drivers/net/dgrs.c:1106:16:    got char *[usertype] vmem
drivers/net/dgrs.c:1369:15: warning: incorrect type in argument 1 (different address spaces)
drivers/net/dgrs.c:1369:15:    expected void volatile [noderef] <asn:2>*addr
drivers/net/dgrs.c:1369:15:    got char *[usertype] vmem
drivers/net/dgrs.c:1371:12: warning: incorrect type in argument 1 (different address spaces)
drivers/net/dgrs.c:1371:12:    expected void volatile [noderef] <asn:2>*addr
drivers/net/dgrs.c:1371:12:    got unsigned char [usertype] *<noident>

Perhaps we should just mark it as BROKEN in Kconfig

^ permalink raw reply

* Re: af_packet: don't enable global timestamps
From: David Miller @ 2007-09-06 12:56 UTC (permalink / raw)
  To: dada1; +Cc: shemminger, ak, netdev
In-Reply-To: <20070904095958.53c86337.dada1@cosmosbay.com>

From: Eric Dumazet <dada1@cosmosbay.com>
Date: Tue, 4 Sep 2007 09:59:58 +0200

> This patch seems the correct fix for this longstanding problem.
> 
> Please note that if wireshark/tcpdump processes really want precise timestamps, 
> they still can ask this by using setsockopt(SO_TIMESTAMP or SO_TIMESTAMPNS) on their private socket.
> 
> (We already know that running a sniffer has a cost anyway)

Agreed, let's give this a try.

I've applied Stephen's patch to net-2.6.24

^ permalink raw reply

* Re: [TOMOYO 15/15] LSM expansion for TOMOYO Linux.
From: Tetsuo Handa @ 2007-09-06 13:04 UTC (permalink / raw)
  To: paul.moore; +Cc: linux-kernel, linux-security-module, chrisw, netdev
In-Reply-To: <200709051006.28429.paul.moore@hp.com>

Hello.

Thank you very much for your time, Paul.
Yes, you understood what I wanted to do.

TOMOYO Linux's approach:

(1) It uses userspace intervention to allow/reject
    connections and/or packets based on the application's domain.
    Since existent hooks can't be used for this purpose,
    I inserted a new hook post_recv_datagram() at skb_recv_datagram()
    and I modified socket_post_accept() to return error so that
    I can drop/disconnect based on the application's domain.

    I think skb_recv_datagram() is the only place that can remove
    a message picked up with MSG_PEEK flags from the receive queue.
    To remove a message picked up with MSG_PEEK flags, I noticed that
    I have to do skb_kill_datagram()-like operation so that
    "the head message that must not be delivered to the caller" won't prevent
    picking up of "the non-head message that should be delivered to the caller"
    when the caller repeats only recv(MSG_PEEK) requests.
    Since skb_recv_datagram() can be called from interrupt context,
    I have to use spin_lock_irqsave() instead for spin_lock_bh(), am I right?

/* from net/core/datagram.c */
@@ -178,6 +179,27 @@ struct sk_buff *skb_recv_datagram(struct
 		} else
 			skb = skb_dequeue(&sk->sk_receive_queue);
 
+		error = security_post_recv_datagram(sk, skb, flags);
+		if (error) {
+			unsigned long cpu_flags;
+
+			if (!(flags & MSG_PEEK))
+				goto no_peek;
+
+			spin_lock_irqsave(&sk->sk_receive_queue.lock,
+					  cpu_flags);
+			if (skb == skb_peek(&sk->sk_receive_queue)) {
+				__skb_unlink(skb,
+					     &sk->sk_receive_queue);
+				atomic_dec(&skb->users);
+			}
+			spin_unlock_irqrestore(&sk->sk_receive_queue.lock,
+					       cpu_flags);
+no_peek:
+			skb_free_datagram(sk, skb);
+			goto no_packet;
+		}
+
 		if (skb)
 			return skb;



    By the way, why can't socket_post_accept() fail?
    Someone may wish to do memory allocation at socket_post_accept().
    socket_accept() is too early for memory allocation because
    there is no chance to free allocated memory
    when sock->ops->accept() failed.
    I think socket_post_accept() should be able to fail.

(2) It allows the administrator judge interactively
    using a userspace agent.
    Thus, the new hook has to be inserted at blockable location,
    Since skb_recv_datagram() can be called from interrupt context,
    I do nothing in post_recv_datagram() if called from interrupt context.

+static int tmy_post_recv_datagram(struct sock *sk,
+				  struct sk_buff *skb,
+				  unsigned int flags)
+{
+	int error = 0;
+	const unsigned int type = sk->sk_type;
+
+	/* skb_recv_datagram() didn't dequeue. */
+	if (!skb)
+		return 0;
+
+	/* skb_recv_datagram() can be called from interrupt context. */
+	if (in_interrupt())
+		return 0;
+	/* I don't check if called by kernel process. */
+	if (segment_eq(get_fs(), KERNEL_DS))
+		return 0;
+
+	if (type != SOCK_DGRAM && type != SOCK_RAW)
+		return 0;
...(sniped)...
+}



Regards.

^ permalink raw reply

* Re: [PATCH] [sis900] convert to NAPI, WAS Re: pktgen terminating condition
From: jamal @ 2007-09-06 13:22 UTC (permalink / raw)
  To: Mandeep Singh Baines
  Cc: James Chapman, Daniele Venzano, davem, rick.jones2, msb, netdev,
	grundler, robert.olsson, jeff, nhorman
In-Reply-To: <20070906050629.GA4262@ludhiana>

On Wed, 2007-05-09 at 22:06 -0700, Mandeep Singh Baines wrote:
> jamal (hadi@cyberus.ca) wrote:

> > If you read the paper: There are no issues with high throughput - NAPI
> > kicks in.
> > The challenge to be overcome is at low traffic, if you have a real fast
> > processor your cpu-cycles-used/bits-processed ratio is high....
> 
> I'm not sure cpu-cycles-used/bits-processed is the correct metric to use.
> An alternative would be to look at cpu-cycles-used/unit-time (i.e.
> CPU utilization or load) for a given bit-rate or packet-rate. 

I wasnt explicit but we are saying the same thing. The paper is more
specific: if you make the packet count used constant - this translates
to a unit of time given low traffic rate. If you fix the time,
cpu-cycles are easier to extrapolate from.

> This would make an interesting graph.

If you are to plot a cpu-util vs packet rate, on most modern hardware you will 
see a spike before you hit the MLFRR and then utilization will go down and slowly 
start going up.

> At low packet-rate, CPU utilization is low so doing extra work per packet
> is probably OK. 

Thats the arguement weve used in the past ;-> Unfortunately, with the
relative cost of IO going up you cant keep making that arguement (at
least thats the position presented in the paper). 
Your mileage may vary. If you are running a machine dishing out bulk
transfers probably not a big deal. If you are doing database
transactions, you loose in benchmarks etc.

> Utilizing 2% of CPU vesus 1% is negligible. But at higher
> rate, when there is more CPU utilization, using 40% of CPU versus 60% is
> significant. I think the absolute different in CPU utilization is more
> important than the relative difference. iow, 2% versus 1%, even though 
> a 2x difference in cpu-cycles/packet, is negligible compared to 40% 
> versus 60%.

Refer to above.

> Using a timer might also behave better in a tick-less (CONFIG_NO_HZ) 
> configuration.

good point.

cheers,
jamal


^ permalink raw reply

* Re: dgrs: sparse warnings
From: maximilian attems @ 2007-09-06 13:30 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: Al Viro, Jeff Garzik, netdev
In-Reply-To: <20070906135547.13bb3326@oldman>

On Thu, Sep 06, 2007 at 01:55:47PM +0100, Stephen Hemminger wrote:
> The dgrs driver is crap and assumes direct access to PCI iomem.
> 

dgrs wants to be removed, it has no users.
(no bug report in bugzilla nor in debian nor in google)
the firmware is not distrutable.

i'll resend my removal patch bzipped and attached.

-- 
maks

^ permalink raw reply

* Re: BUG: scheduling while atomic: ifconfig/0x00000002/4170
From: Johannes Berg @ 2007-09-06 13:36 UTC (permalink / raw)
  To: Herbert Xu
  Cc: satyam-wEGCiKHe2LqWVfeAwA7xHQ, flo-BCn6idZOOBwdnm+yROfE0A,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-wireless-u79uwXL29TY76Z2rM5mHXA,
	michal.k.k.piotrowski-Re5JQEeQqe8AvxtiuMwx3w,
	ipw3945-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f,
	yi.zhu-ral2JQCrhuEAvxtiuMwx3w, flamingice-R9e9/4HEdknk1uMJSBkQmQ
In-Reply-To: <E1ITGb2-0006Be-00-XQvu0L+U/CjiRBuR/1fSEKKkPtS2pBon@public.gmane.org>

On Thu, 2007-09-06 at 20:36 +0800, Herbert Xu wrote:

> Yeah I think they're all under RTNL too.  So you don't need to
> take the lock here at all since you should already have the RTNL.

Ok, this patch gets rid of the lock. I'd appreciate if you could give it
a quick look for obvious RCU abuse as I haven't tested it. It also
doesn't apply against net-2.6.24 because I based it after the patches I
have queued with John for net-2.6.24.

johannes

--- netdev-2.6.orig/net/mac80211/ieee80211.c	2007-09-06 15:35:23.324447686 +0200
+++ netdev-2.6/net/mac80211/ieee80211.c	2007-09-06 15:35:23.394447686 +0200
@@ -90,8 +90,9 @@ static struct dev_mc_list *ieee80211_get
 
 	/* start of iteration, both unassigned */
 	if (!mcd->cur && !mcd->sdata) {
-		mcd->sdata = list_entry(local->sub_if_list.next,
-					struct ieee80211_sub_if_data, list);
+		mcd->sdata = list_entry(
+				rcu_dereference(local->interfaces.next),
+				struct ieee80211_sub_if_data, list);
 		mcd->cur = mcd->sdata->dev->mc_list;
 	}
 
@@ -100,10 +101,11 @@ static struct dev_mc_list *ieee80211_get
 
 	while (!mcd->cur) {
 		/* reached end of interface list? */
-		if (mcd->sdata->list.next == &local->sub_if_list)
+		if (rcu_dereference(mcd->sdata->list.next) ==
+		    &local->interfaces)
 			break;
 		/* otherwise try next interface */
-		mcd->sdata = list_entry(mcd->sdata->list.next,
+		mcd->sdata = list_entry(rcu_dereference(mcd->sdata->list.next),
 					struct ieee80211_sub_if_data, list);
 		mcd->cur = mcd->sdata->dev->mc_list;
 	}
@@ -145,9 +147,10 @@ static void ieee80211_configure_filter(s
 
 	/*
 	 * We can iterate through the device list for the multicast
-	 * address list so need to lock it.
+	 * address list so need to be in a RCU read-side section,
+	 * the RTNL isn't held in this function.
 	 */
-	read_lock(&local->sub_if_lock);
+	rcu_read_lock();
 
 	/* be a bit nasty */
 	new_flags |= (1<<31);
@@ -163,7 +166,7 @@ static void ieee80211_configure_filter(s
 	WARN_ON(mcd.cur);
 
 	local->filter_flags = new_flags & ~(1<<31);
-	read_unlock(&local->sub_if_lock);
+	rcu_read_unlock();
 
 	netif_tx_unlock(local->mdev);
 }
@@ -176,14 +179,13 @@ static int ieee80211_master_open(struct 
 	struct ieee80211_sub_if_data *sdata;
 	int res = -EOPNOTSUPP;
 
-	read_lock(&local->sub_if_lock);
-	list_for_each_entry(sdata, &local->sub_if_list, list) {
+	/* we hold the RTNL here so can safely walk the list */
+	list_for_each_entry(sdata, &local->interfaces, list) {
 		if (sdata->dev != dev && netif_running(sdata->dev)) {
 			res = 0;
 			break;
 		}
 	}
-	read_unlock(&local->sub_if_lock);
 	return res;
 }
 
@@ -192,11 +194,10 @@ static int ieee80211_master_stop(struct 
 	struct ieee80211_local *local = wdev_priv(dev->ieee80211_ptr);
 	struct ieee80211_sub_if_data *sdata;
 
-	read_lock(&local->sub_if_lock);
-	list_for_each_entry(sdata, &local->sub_if_list, list)
+	/* we hold the RTNL here so can safely walk the list */
+	list_for_each_entry(sdata, &local->interfaces, list)
 		if (sdata->dev != dev && netif_running(sdata->dev))
 			dev_close(sdata->dev);
-	read_unlock(&local->sub_if_lock);
 
 	return 0;
 }
@@ -430,8 +431,8 @@ static int ieee80211_open(struct net_dev
 
 	sdata = IEEE80211_DEV_TO_SUB_IF(dev);
 
-	read_lock(&local->sub_if_lock);
-	list_for_each_entry(nsdata, &local->sub_if_list, list) {
+	/* we hold the RTNL here so can safely walk the list */
+	list_for_each_entry(nsdata, &local->interfaces, list) {
 		struct net_device *ndev = nsdata->dev;
 
 		if (ndev != dev && ndev != local->mdev && netif_running(ndev) &&
@@ -440,10 +441,8 @@ static int ieee80211_open(struct net_dev
 			 * check whether it may have the same address
 			 */
 			if (!identical_mac_addr_allowed(sdata->type,
-							nsdata->type)) {
-				read_unlock(&local->sub_if_lock);
+							nsdata->type))
 				return -ENOTUNIQ;
-			}
 
 			/*
 			 * can only add VLANs to enabled APs
@@ -454,7 +453,6 @@ static int ieee80211_open(struct net_dev
 				sdata->u.vlan.ap = nsdata;
 		}
 	}
-	read_unlock(&local->sub_if_lock);
 
 	switch (sdata->type) {
 	case IEEE80211_IF_TYPE_WDS:
@@ -575,14 +573,13 @@ static int ieee80211_stop(struct net_dev
 		sdata->u.sta.state = IEEE80211_DISABLED;
 		del_timer_sync(&sdata->u.sta.timer);
 		/*
-		 * Holding the sub_if_lock for writing here blocks
-		 * out the receive path and makes sure it's not
-		 * currently processing a packet that may get
-		 * added to the queue.
+		 * When we get here, the interface is marked down.
+		 * Call synchronize_rcu() to wait for the RX path
+		 * should it be using the interface and enqueuing
+		 * frames at this very time on another CPU.
 		 */
-		write_lock_bh(&local->sub_if_lock);
+		synchronize_rcu();
 		skb_queue_purge(&sdata->u.sta.skb_queue);
-		write_unlock_bh(&local->sub_if_lock);
 
 		if (!local->ops->hw_scan &&
 		    local->scan_dev == sdata->dev) {
@@ -1134,9 +1131,9 @@ void ieee80211_tx_status(struct ieee8021
 
 	rthdr->data_retries = status->retry_count;
 
-	read_lock(&local->sub_if_lock);
+	rcu_read_lock();
 	monitors = local->monitors;
-	list_for_each_entry(sdata, &local->sub_if_list, list) {
+	list_for_each_entry_rcu(sdata, &local->interfaces, list) {
 		/*
 		 * Using the monitors counter is possibly racy, but
 		 * if the value is wrong we simply either clone the skb
@@ -1152,7 +1149,7 @@ void ieee80211_tx_status(struct ieee8021
 				continue;
 			monitors--;
 			if (monitors)
-				skb2 = skb_clone(skb, GFP_KERNEL);
+				skb2 = skb_clone(skb, GFP_ATOMIC);
 			else
 				skb2 = NULL;
 			skb->dev = sdata->dev;
@@ -1167,7 +1164,7 @@ void ieee80211_tx_status(struct ieee8021
 		}
 	}
  out:
-	read_unlock(&local->sub_if_lock);
+	rcu_read_unlock();
 	if (skb)
 		dev_kfree_skb(skb);
 }
@@ -1255,8 +1252,7 @@ struct ieee80211_hw *ieee80211_alloc_hw(
 
 	INIT_LIST_HEAD(&local->modes_list);
 
-	rwlock_init(&local->sub_if_lock);
-	INIT_LIST_HEAD(&local->sub_if_list);
+	INIT_LIST_HEAD(&local->interfaces);
 
 	INIT_DELAYED_WORK(&local->scan_work, ieee80211_sta_scan_work);
 	ieee80211_rx_bss_list_init(mdev);
@@ -1275,7 +1271,8 @@ struct ieee80211_hw *ieee80211_alloc_hw(
 	sdata->u.ap.force_unicast_rateidx = -1;
 	sdata->u.ap.max_ratectrl_rateidx = -1;
 	ieee80211_if_sdata_init(sdata);
-	list_add_tail(&sdata->list, &local->sub_if_list);
+	/* no RCU needed since we're still during init phase */
+	list_add_tail(&sdata->list, &local->interfaces);
 
 	tasklet_init(&local->tx_pending_tasklet, ieee80211_tx_pending,
 		     (unsigned long)local);
@@ -1434,7 +1431,6 @@ void ieee80211_unregister_hw(struct ieee
 {
 	struct ieee80211_local *local = hw_to_local(hw);
 	struct ieee80211_sub_if_data *sdata, *tmp;
-	struct list_head tmp_list;
 	int i;
 
 	tasklet_kill(&local->tx_pending_tasklet);
@@ -1448,11 +1444,12 @@ void ieee80211_unregister_hw(struct ieee
 	if (local->apdev)
 		ieee80211_if_del_mgmt(local);
 
-	write_lock_bh(&local->sub_if_lock);
-	list_replace_init(&local->sub_if_list, &tmp_list);
-	write_unlock_bh(&local->sub_if_lock);
-
-	list_for_each_entry_safe(sdata, tmp, &tmp_list, list)
+	/*
+	 * At this point, interface list manipulations are fine
+	 * because the driver cannot be handing us frames any
+	 * more and the tasklet is killed.
+	 */
+	list_for_each_entry_safe(sdata, tmp, &local->interfaces, list)
 		__ieee80211_if_del(local, sdata);
 
 	rtnl_unlock();
--- netdev-2.6.orig/net/mac80211/ieee80211_i.h	2007-09-06 15:35:23.334447686 +0200
+++ netdev-2.6/net/mac80211/ieee80211_i.h	2007-09-06 15:35:23.404447686 +0200
@@ -481,9 +481,8 @@ struct ieee80211_local {
 	ieee80211_rx_handler *rx_handlers;
 	ieee80211_tx_handler *tx_handlers;
 
-	rwlock_t sub_if_lock; /* Protects sub_if_list. Cannot be taken under
-			       * sta_bss_lock or sta_lock. */
-	struct list_head sub_if_list;
+	struct list_head interfaces;
+
 	int sta_scanning;
 	int scan_channel_idx;
 	enum { SCAN_SET_CHANNEL, SCAN_SEND_PROBE } scan_state;
--- netdev-2.6.orig/net/mac80211/ieee80211_iface.c	2007-09-06 15:35:23.334447686 +0200
+++ netdev-2.6/net/mac80211/ieee80211_iface.c	2007-09-06 15:35:23.404447686 +0200
@@ -79,16 +79,15 @@ int ieee80211_if_add(struct net_device *
 	ieee80211_debugfs_add_netdev(sdata);
 	ieee80211_if_set_type(ndev, type);
 
-	write_lock_bh(&local->sub_if_lock);
+	/* we're under RTNL so all this is fine */
 	if (unlikely(local->reg_state == IEEE80211_DEV_UNREGISTERED)) {
-		write_unlock_bh(&local->sub_if_lock);
 		__ieee80211_if_del(local, sdata);
 		return -ENODEV;
 	}
-	list_add(&sdata->list, &local->sub_if_list);
+	list_add_tail_rcu(&sdata->list, &local->interfaces);
+
 	if (new_dev)
 		*new_dev = ndev;
-	write_unlock_bh(&local->sub_if_lock);
 
 	return 0;
 
@@ -226,22 +225,22 @@ void ieee80211_if_reinit(struct net_devi
 		/* Remove all virtual interfaces that use this BSS
 		 * as their sdata->bss */
 		struct ieee80211_sub_if_data *tsdata, *n;
-		LIST_HEAD(tmp_list);
 
-		write_lock_bh(&local->sub_if_lock);
-		list_for_each_entry_safe(tsdata, n, &local->sub_if_list, list) {
+		list_for_each_entry_safe(tsdata, n, &local->interfaces, list) {
 			if (tsdata != sdata && tsdata->bss == &sdata->u.ap) {
 				printk(KERN_DEBUG "%s: removing virtual "
 				       "interface %s because its BSS interface"
 				       " is being removed\n",
 				       sdata->dev->name, tsdata->dev->name);
-				list_move_tail(&tsdata->list, &tmp_list);
+				list_del_rcu(&tsdata->list);
+				/*
+				 * We have lots of time and can afford
+				 * to sync for each interface
+				 */
+				synchronize_rcu();
+				__ieee80211_if_del(local, tsdata);
 			}
 		}
-		write_unlock_bh(&local->sub_if_lock);
-
-		list_for_each_entry_safe(tsdata, n, &tmp_list, list)
-			__ieee80211_if_del(local, tsdata);
 
 		kfree(sdata->u.ap.beacon_head);
 		kfree(sdata->u.ap.beacon_tail);
@@ -318,18 +317,16 @@ int ieee80211_if_remove(struct net_devic
 
 	ASSERT_RTNL();
 
-	write_lock_bh(&local->sub_if_lock);
-	list_for_each_entry_safe(sdata, n, &local->sub_if_list, list) {
+	list_for_each_entry_safe(sdata, n, &local->interfaces, list) {
 		if ((sdata->type == id || id == -1) &&
 		    strcmp(name, sdata->dev->name) == 0 &&
 		    sdata->dev != local->mdev) {
-			list_del(&sdata->list);
-			write_unlock_bh(&local->sub_if_lock);
+			list_del_rcu(&sdata->list);
+			synchronize_rcu();
 			__ieee80211_if_del(local, sdata);
 			return 0;
 		}
 	}
-	write_unlock_bh(&local->sub_if_lock);
 	return -ENODEV;
 }
 
--- netdev-2.6.orig/net/mac80211/ieee80211_sta.c	2007-09-06 15:35:23.224447686 +0200
+++ netdev-2.6/net/mac80211/ieee80211_sta.c	2007-09-06 15:35:23.414447686 +0200
@@ -2659,8 +2659,8 @@ void ieee80211_scan_completed(struct iee
 	memset(&wrqu, 0, sizeof(wrqu));
 	wireless_send_event(dev, SIOCGIWSCAN, &wrqu, NULL);
 
-	read_lock(&local->sub_if_lock);
-	list_for_each_entry(sdata, &local->sub_if_list, list) {
+	rcu_read_lock();
+	list_for_each_entry_rcu(sdata, &local->interfaces, list) {
 
 		/* No need to wake the master device. */
 		if (sdata->dev == local->mdev)
@@ -2674,7 +2674,7 @@ void ieee80211_scan_completed(struct iee
 
 		netif_wake_queue(sdata->dev);
 	}
-	read_unlock(&local->sub_if_lock);
+	rcu_read_unlock();
 
 	sdata = IEEE80211_DEV_TO_SUB_IF(dev);
 	if (sdata->type == IEEE80211_IF_TYPE_IBSS) {
@@ -2811,8 +2811,8 @@ static int ieee80211_sta_start_scan(stru
 
 	local->sta_scanning = 1;
 
-	read_lock(&local->sub_if_lock);
-	list_for_each_entry(sdata, &local->sub_if_list, list) {
+	rcu_read_lock();
+	list_for_each_entry_rcu(sdata, &local->interfaces, list) {
 
 		/* Don't stop the master interface, otherwise we can't transmit
 		 * probes! */
@@ -2824,7 +2824,7 @@ static int ieee80211_sta_start_scan(stru
 		    (sdata->u.sta.flags & IEEE80211_STA_ASSOCIATED))
 			ieee80211_send_nullfunc(local, sdata, 1);
 	}
-	read_unlock(&local->sub_if_lock);
+	rcu_read_unlock();
 
 	if (ssid) {
 		local->scan_ssid_len = ssid_len;
--- netdev-2.6.orig/net/mac80211/rx.c	2007-09-06 15:35:23.214447686 +0200
+++ netdev-2.6/net/mac80211/rx.c	2007-09-06 15:35:23.414447686 +0200
@@ -1385,8 +1385,9 @@ void __ieee80211_rx(struct ieee80211_hw 
 	}
 
 	/*
-	 * key references are protected using RCU and this requires that
-	 * we are in a read-site RCU section during receive processing
+	 * key references and virtual interfaces are protected using RCU
+	 * and this requires that we are in a read-side RCU section during
+	 * receive processing
 	 */
 	rcu_read_lock();
 
@@ -1441,8 +1442,7 @@ void __ieee80211_rx(struct ieee80211_hw 
 
 	bssid = ieee80211_get_bssid(hdr, skb->len - radiotap_len);
 
-	read_lock(&local->sub_if_lock);
-	list_for_each_entry(sdata, &local->sub_if_list, list) {
+	list_for_each_entry_rcu(sdata, &local->interfaces, list) {
 		rx.flags |= IEEE80211_TXRXD_RXRA_MATCH;
 
 		if (!netif_running(sdata->dev))
@@ -1494,7 +1494,6 @@ void __ieee80211_rx(struct ieee80211_hw 
 					     &rx, sta);
 	} else
 		dev_kfree_skb(skb);
-	read_unlock(&local->sub_if_lock);
 
  end:
 	rcu_read_unlock();
--- netdev-2.6.orig/net/mac80211/tx.c	2007-09-06 15:35:23.074447686 +0200
+++ netdev-2.6/net/mac80211/tx.c	2007-09-06 15:35:23.424447686 +0200
@@ -291,8 +291,12 @@ static void purge_old_ps_buffers(struct 
 	struct ieee80211_sub_if_data *sdata;
 	struct sta_info *sta;
 
-	read_lock(&local->sub_if_lock);
-	list_for_each_entry(sdata, &local->sub_if_list, list) {
+	/*
+	 * virtual interfaces are protected by RCU
+	 */
+	rcu_read_lock();
+
+	list_for_each_entry_rcu(sdata, &local->interfaces, list) {
 		struct ieee80211_if_ap *ap;
 		if (sdata->dev == local->mdev ||
 		    sdata->type != IEEE80211_IF_TYPE_AP)
@@ -305,7 +309,7 @@ static void purge_old_ps_buffers(struct 
 		}
 		total += skb_queue_len(&ap->ps_bc_buf);
 	}
-	read_unlock(&local->sub_if_lock);
+	rcu_read_unlock();
 
 	read_lock_bh(&local->sta_lock);
 	list_for_each_entry(sta, &local->sta_list, list) {

^ permalink raw reply

* Re: [TG3]: Workaround MSI bug on 5714/5780.
From: Andy Gospodarek @ 2007-09-06 13:45 UTC (permalink / raw)
  To: David Miller, mchan; +Cc: netdev, andy
In-Reply-To: <20070906.040235.71121480.davem@davemloft.net>

On Thu, Sep 06, 2007 at 04:02:35AM -0700, David Miller wrote:
> From: "Michael Chan" <mchan@broadcom.com>
> Date: Wed, 05 Sep 2007 13:07:08 -0700
> 
> > [TG3]: Workaround MSI bug on 5714/5780.
> > 
> > A hardware bug was revealed after a recent PCI MSI patch was made to
> > always disable legacy INTX when enabling MSI.  The 5714/5780 chips
> > will not generate MSI when INTX is disabled, causing MSI failure
> > messages to be reported, and another patch was made to workaround the
> > problem by disabling MSI on ServerWorks HT1000 bridge chips commonly
> > found with the 5714.
> > 
> > We workaround this chip bug by enabling INTX after we enable MSI and
> > after we resume from suspend.
> > 
> > Update version to 3.81.
> > 
> > This problem was discovered by David Miller.
> > 
> > Signed-off-by: Michael Chan <mchan@broadcom.com>
> 
> Thanks a lot Michael, I'll apply this and work on reverting that
> incorrect MSI chipset quirk too.

Michael,

Is is really necessary to get rid of the HT1000 patch?

commit e3008dedff4bdc96a5f67224cd3d8d12237082a0
Author: Andy Gospodarek <andy@greyhouse.net>
Date:   Thu May 10 22:58:57 2007 -0700

    PCI: disable MSI by default on systems with Serverworks HT1000 chips

This patch was designed to address tg3 and bnx2 messages printed on
systems with HT1000 chips IIRC.  Are we going back to those messages on
bnx2 or is there a workaround in that driver needed too?

-andy


^ permalink raw reply

* [patch] dgrs removal
From: maximilian attems @ 2007-09-06 13:50 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: Al Viro, Jeff Garzik, netdev, Andres Salomon, Alan
In-Reply-To: <20070906133038.GI16317@baikonur.stro.at>

[-- Attachment #1: Type: text/plain, Size: 1396 bytes --]

nobody objected to adrian's rfc dropping dgrs on netdev back in dec. 2006.
-> http://marc.info/?l=linux-netdev&m=116507999803538&w=2
as bonus points this will get rid of a proprietary licensed firmware.

"The dgrs driver is crap and assumes direct access to PCI iomem"
- Stephen Hemminger <shemminger@linux-foundation.org>

dgrs has no users and it's hardware was never shipped.

Signed-off-by: maximilian attems <max@stro.at>
Acked-by: Andres Salomon <dilinger@debian.org>
Acked-by: Alan <alan@lxorguk.ukuu.org.uk>

find the patch attached, due to netdev size constraints
with the following diffstat
 Documentation/networking/dgrs.txt    |   52 
 b/Documentation/networking/00-INDEX  |    2 
 b/MAINTAINERS                        |    6 
 b/drivers/net/Kconfig                |   15 
 b/drivers/net/Makefile               |    1 
 b/drivers/net/wireless/rtl8187_dev.c |    1 
 drivers/net/dgrs.c                   | 1615 -----
 drivers/net/dgrs.h                   |   38 
 drivers/net/dgrs_asstruct.h          |   37 
 drivers/net/dgrs_bcomm.h             |  148 
 drivers/net/dgrs_es4h.h              |  183 
 drivers/net/dgrs_ether.h             |  135 
 drivers/net/dgrs_firmware.c          | 9966 -----------------------------------
 drivers/net/dgrs_i82596.h            |  473 -
 drivers/net/dgrs_plx9060.h           |  175 
 15 files changed, 1 insertion(+), 12846 deletions(-)

-- 
maks

[-- Attachment #2: dgrs-removal.bz2 --]
[-- Type: application/octet-stream, Size: 79614 bytes --]

^ permalink raw reply

* Re: [patch] dgrs removal
From: maximilian attems @ 2007-09-06 14:16 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: Al Viro, Jeff Garzik, netdev, Andres Salomon, Alan
In-Reply-To: <20070906135023.GA16304@stro.at>

On Thu, Sep 06, 2007 at 03:50:23PM +0200, maximilian attems wrote:
>  b/drivers/net/wireless/rtl8187_dev.c |    1 
>  15 files changed, 1 insertion(+), 12846 deletions(-)

urrgs that was not meant to be there,
resending..

-- 
maks

^ permalink raw reply

* RFC: possible NAPI improvements to reduce interrupt rates for low traffic rates
From: James Chapman @ 2007-09-06 14:16 UTC (permalink / raw)
  To: netdev; +Cc: hadi, davem, jeff, mandeep.baines, ossthema

This RFC suggests some possible improvements to NAPI in the area of minimizing interrupt rates. A possible scheme to reduce interrupt rate for the low packet rate / fast CPU case is described. 

First, do we need to encourage consistency in NAPI poll drivers? A survey of current NAPI drivers shows different strategies being used in their poll(). Some such as r8169 do the napi_complete() if poll() does less work than their allowed budget. Others such as e100 and tg3 do napi_complete() only if they do no work at all. And some drivers use NAPI only for receive handling, perhaps setting txdone interrupts for 1 in N transmitted packets, while others do all "interrupt" processing in their poll(). Should we encourage more consistency? Should we encourage more NAPI driver maintainers to minimize interrupts by doing all rx _and_ tx processing in the poll(), and do napi_complete() only when the poll does _no_ work?

One well known issue with NAPI is that it is possible with certain traffic patterns for NAPI drivers to schedule in and out of polled mode very quickly. Worst case, a NAPI driver might get 1 interrupt per packet. With fast CPUs and interfaces, this can happen at high rates, causing high CPU loads and poor packet processing performance. Some drivers avoid this by using hardware interrupt mitigation features of the network device in tandem with NAPI to throttle the max interrupt rate per device. But this adds latency. Jamal's paper http://kernel.org/pub/linux/kernel/people/hadi/docs/UKUUG2005.pdf discusses this problem in some detail.

By making some small changes to the NAPI core, I think it is possible to prevent high interrupt rates with NAPI, regardless of traffic patterns and without using per-device hardware interrupt mitigation. The basic idea is that instead of immediately exiting polled mode when it finds no work to do, the driver's poll() keeps itself in active polled mode for 1-2 jiffies and only does napi_complete() when it does no work in that time period. When it does no work in its poll(), the driver can return 0 while leaving itself in the NAPI poll list. This means it is possible for the softirq processing to spin around its active device list, doing no work, since no quota is consumed. A change is therefore also needed in the NAPI core to detect the case when the only devices that are being actively pol
 led in softirq processing are doing no work on each poll and to exit the softirq loop rather than wasting CPU cycles.

The code changes are shown in the patch below. The patch is against the latest NAPI rework posted by DaveM http://marc.info/?l=linux-netdev&m=118829721407289&w=2. I used e100 and tg3 drivers to test. Since a driver that returns 0 from its poll() while leaving itself in polled mode would now used by the NAPI core as a condition for exiting the softirq poll loop, all existing NAPI drivers would need to conform to this new invariant. Some drivers, e.g. e100, can return 0 even if they do tx work in their poll().

Clearly, keeping a device in polled mode for 1-2 jiffies after it would otherwise have gone idle means that it might be called many times by the NAPI softirq while it has no work to do. This wastes CPU cycles. It would be important therefore to implement the driver's poll() to make this case as efficient as possible, perhaps testing for it early.

When a device is in polled mode while idle, there are 2 scheduling cases to consider:-

1. One or more other netdevs is not idle and is consuming quota on each poll. The net_rx softirq will loop until the next jiffy tick or when quota is exceeded, calling each device in its polled list. Since the idle device is still in the poll list, it will be polled very rapidly.

2. No other active device is in the poll list. The net_rx softirq will poll the idle device twice and then exit the softirq processing loop as if quota is exceeded. See the net_rx_action() changes in the patch which force the loop to exit if no work is being done by any device in the poll list.

In both cases described above, the scheduler will continue NAPI processing from ksoftirqd. This might be very soon, especially if the system is otherwise idle. But if the system is idle, do we really care that idle network devices will be polled for 1-2 jiffies? If the system is otherwise busy, ksoftirqd will share the CPU with other threads/processes which will reduce the poll rate anyway.

In testing, I see significant reduction in interrupt rate for typical traffic patterns. A flood ping, for example, keeps the device in polled mode, generating no interrupts. In a test, 8510 packets are sent/received versus 6200 previously; CPU load is 100% versus 62% previously; and 1 netdev interrupt occurs versus 12400 previously. Performance and CPU load under extreme network load (using pktgen) is unchanged, as expected. Most importantly though, it is no longer possible to find a combination of CPU performance and traffic pattern that induce high interrupt rates. And because hardware interrupt mitigation isn't used, packet latency is minimized.

The increase in CPU load isn't surprising for a flood ping test since the CPU is working to bounce packets as fast as it can. The increase in packet rate is a good indicator of how much the interrupt and NAPI scheduling overhead is. The CPU load shows 100% because ksoftirqd is always wanting the CPU for the duration of the flood ping. The beauty of NAPI is that the scheduler gets to decide which thread gets the CPU, not hardware CPU interrupt priorities. On my desktop system, I perceive _better_ system response (smoother X cursor movement etc) during the flood ping test, despite the CPU load being increased. For a system whose main job is processing network traffic quickly, like an embedded router or a network server, this approach might be very beneficial. For a desktop, I'm less sure, al
 though as I said above, I've noticed no performance issues in my setups to date.


Is this worth pursuing further? I'm considering doing more work to measure the effects at various relatively low packet rates. I also want to investigate using High Res Timers rather than jiffy sampling to reduce the idle poll time. Perhaps it is also worth trying HRT in the net_rx softirq too. I thought it would be worth throwing the ideas out there first to get early feedback.

Here's the patch.

Index: linux-2.6/drivers/net/e100.c
===================================================================
--- linux-2.6.orig/drivers/net/e100.c
+++ linux-2.6/drivers/net/e100.c
@@ -544,6 +544,7 @@ struct nic {
 	struct cb *cb_to_use;
 	struct cb *cb_to_send;
 	struct cb *cb_to_clean;
+	unsigned long exit_poll_time;
 	u16 tx_command;
 	/* End: frequently used values: keep adjacent for cache effect */
 
@@ -1993,12 +1994,35 @@ static int e100_poll(struct napi_struct 
 	e100_rx_clean(nic, &work_done, budget);
 	tx_cleaned = e100_tx_clean(nic);
 
-	/* If no Rx and Tx cleanup work was done, exit polling mode. */
-	if((!tx_cleaned && (work_done == 0)) || !netif_running(netdev)) {
+	if (!netif_running(netdev)) {
 		netif_rx_complete(netdev, napi);
 		e100_enable_irq(nic);
+		return 0;
 	}
 
+	/* Stay in polled mode if we do any tx cleanup */
+	if (tx_cleaned)
+		work_done++;
+
+	/* If no Rx and Tx cleanup work was done, exit polling mode if
+	 * we've seen no work for 1-2 jiffies.
+	 */
+	if (work_done == 0) {
+		if (nic->exit_poll_time) {
+			if (time_after(jiffies, nic->exit_poll_time)) {
+				nic->exit_poll_time = 0;
+				netif_rx_complete(netdev, napi);
+				e100_enable_irq(nic);
+			}
+		} else {
+			nic->exit_poll_time = jiffies + 2;
+		}
+		return 0;
+	}
+
+	/* Otherwise, reset poll exit time and stay in poll list */
+	nic->exit_poll_time = 0;
+
 	return work_done;
 }
 
Index: linux-2.6/drivers/net/tg3.c
===================================================================
--- linux-2.6.orig/drivers/net/tg3.c
+++ linux-2.6/drivers/net/tg3.c
@@ -3473,6 +3473,24 @@ static int tg3_poll(struct napi_struct *
 	struct tg3_hw_status *sblk = tp->hw_status;
 	int work_done = 0;
 
+	/* fastpath having no work while we're holding ourself in
+	 * polled mode
+	 */
+	if ((tp->exit_poll_time) && (!tg3_has_work(tp))) {
+		if (time_after(jiffies, tp->exit_poll_time)) {
+			tp->exit_poll_time = 0;
+			/* tell net stack and NIC we're done */
+			netif_rx_complete(netdev, napi);
+			tg3_restart_ints(tp);
+		}
+		return 0;
+	}
+
+	/* if we get here, there might be work to do so disable the
+	 * poll hold fastpath above
+	 */
+	tp->exit_poll_time = 0;
+
 	/* handle link change and other phy events */
 	if (!(tp->tg3_flags &
 	      (TG3_FLAG_USE_LINKCHG_REG |
@@ -3511,11 +3529,11 @@ static int tg3_poll(struct napi_struct *
 	} else
 		sblk->status &= ~SD_STATUS_UPDATED;
 
-	/* if no more work, tell net stack and NIC we're done */
-	if (!tg3_has_work(tp)) {
-		netif_rx_complete(netdev, napi);
-		tg3_restart_ints(tp);
-	}
+	/* if no more work, set the time in jiffies when we should
+	 * exit polled mode
+	 */
+	if (!tg3_has_work(tp))
+		tp->exit_poll_time = jiffies + 2;
 
 	return work_done;
 }
Index: linux-2.6/drivers/net/tg3.h
===================================================================
--- linux-2.6.orig/drivers/net/tg3.h
+++ linux-2.6/drivers/net/tg3.h
@@ -2163,6 +2163,7 @@ struct tg3 {
 	u32				last_tag;
 
 	u32				msg_enable;
+	unsigned long			exit_poll_time;
 
 	/* begin "tx thread" cacheline section */
 	void				(*write32_tx_mbox) (struct tg3 *, u32,
Index: linux-2.6/net/core/dev.c
===================================================================
--- linux-2.6.orig/net/core/dev.c
+++ linux-2.6/net/core/dev.c
@@ -2073,6 +2073,8 @@ static void net_rx_action(struct softirq
 	unsigned long start_time = jiffies;
 	int budget = netdev_budget;
 	void *have;
+	struct napi_struct *last_hold = NULL;
+	int done = 0;
 
 	local_irq_disable();
 	list_replace_init(&__get_cpu_var(softnet_data).poll_list, &list);
@@ -2082,7 +2084,7 @@ static void net_rx_action(struct softirq
 		struct napi_struct *n;
 
 		/* if softirq window is exhuasted then punt */
-		if (unlikely(budget <= 0 || jiffies != start_time)) {
+		if (unlikely(budget <= 0 || jiffies != start_time || done)) {
 			local_irq_disable();
 			list_splice(&list, &__get_cpu_var(softnet_data).poll_list);
 			__raise_softirq_irqoff(NET_RX_SOFTIRQ);
@@ -2096,12 +2098,28 @@ static void net_rx_action(struct softirq
 
 		list_del(&n->poll_list);
 
-		/* if quota not exhausted process work */
+		/* if quota not exhausted process work. We special
+		 * case on n->poll() returning 0 here when the driver
+		 * doesn't do a napi_complete(), which indicates that
+		 * the device wants to stay on the poll list although
+		 * it did no work. We remember the first device to go
+		 * into this state in order to terminate this loop if
+		 * we see the same device again without doing any
+		 * other work.
+		 */
 		if (likely(n->quota > 0)) {
 			int work = n->poll(n, min(budget, n->quota));
 
-			budget -= work;
-			n->quota -= work;
+			if (likely(work)) {
+				budget -= work;
+				n->quota -= work;
+				last_hold = NULL;
+			} else if (test_bit(NAPI_STATE_SCHED, &n->state)) {
+				if (unlikely(n == last_hold))
+					done = 1;
+				if (likely(!last_hold))
+					last_hold = n;
+			}
 		}
 
 		/* if napi_complete not called, reschedule */

^ permalink raw reply

* Re: [patch] dgrs removal
From: maximilian attems @ 2007-09-06 14:24 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: Al Viro, Jeff Garzik, netdev, Andres Salomon, Alan
In-Reply-To: <20070906141629.GK16317@baikonur.stro.at>

[-- Attachment #1: Type: text/plain, Size: 1257 bytes --]

nobody objected to adrian's rfc dropping dgrs on netdev back in dec. 2006.
-> http://marc.info/?l=linux-netdev&m=116507999803538&w=2
as bonus points this will get rid of a proprietary licensed firmware.

"The dgrs driver is crap and assumes direct access to PCI iomem"
- Stephen Hemminger <shemminger@linux-foundation.org>

dgrs has no users and it's hardware was never shipped.

Signed-off-by: maximilian attems <max@stro.at>
Acked-by: Andres Salomon <dilinger@debian.org>
Acked-by: Alan <alan@lxorguk.ukuu.org.uk>

---
sorry for the resend, now with corrected diffstat
 Documentation/networking/00-INDEX |    2 
 Documentation/networking/dgrs.txt |   52 
 MAINTAINERS                       |    6 
 drivers/net/Kconfig               |   15 
 drivers/net/Makefile              |    1 
 drivers/net/dgrs.c                | 1615 ----
 drivers/net/dgrs.h                |   38 
 drivers/net/dgrs_asstruct.h       |   37 
 drivers/net/dgrs_bcomm.h          |  148 
 drivers/net/dgrs_es4h.h           |  183 
 drivers/net/dgrs_ether.h          |  135 
 drivers/net/dgrs_firmware.c       | 9966 ------------------------------
 drivers/net/dgrs_i82596.h         |  473 -
 drivers/net/dgrs_plx9060.h        |  175 
 14 files changed, 12846 deletions(-)


-- 
maks

[-- Attachment #2: dgrs-removal.bz2 --]
[-- Type: application/octet-stream, Size: 79582 bytes --]

^ permalink raw reply

* Re: [TG3]: Workaround MSI bug on 5714/5780.
From: David Miller @ 2007-09-06 14:34 UTC (permalink / raw)
  To: andy; +Cc: mchan, netdev
In-Reply-To: <20070906134508.GA10317@gospo.rdu.redhat.com>

From: Andy Gospodarek <andy@greyhouse.net>
Date: Thu, 6 Sep 2007 09:45:16 -0400

> Is is really necessary to get rid of the HT1000 patch?

Yes, absolutely and without a single doubt.

> commit e3008dedff4bdc96a5f67224cd3d8d12237082a0
> Author: Andy Gospodarek <andy@greyhouse.net>
> Date:   Thu May 10 22:58:57 2007 -0700
> 
>     PCI: disable MSI by default on systems with Serverworks HT1000 chips
> 
> This patch was designed to address tg3 and bnx2 messages printed on
> systems with HT1000 chips IIRC.  Are we going back to those messages on
> bnx2 or is there a workaround in that driver needed too?

Every report of that problem is with tg3 chips afflicted by this
INTX bug which is now worked around properly.

Even my Niagara T1000 box was hit by this absolutely stupid and
wrong MSI quirk, which is how I found this problem to begin with.

I'm reverting it.


^ permalink raw reply

* Re: RFC: possible NAPI improvements to reduce interrupt rates for low traffic rates
From: Stephen Hemminger @ 2007-09-06 14:37 UTC (permalink / raw)
  To: James Chapman; +Cc: netdev, hadi, davem, jeff, mandeep.baines, ossthema
In-Reply-To: <200709061416.l86EG0Vb017675@quickie.katalix.com>

On Thu, 6 Sep 2007 15:16:00 +0100
James Chapman <jchapman@katalix.com> wrote:

> This RFC suggests some possible improvements to NAPI in the area of minimizing interrupt rates. A possible scheme to reduce interrupt rate for the low packet rate / fast CPU case is described. 
> 
> First, do we need to encourage consistency in NAPI poll drivers? A survey of current NAPI drivers shows different strategies being used in their poll(). Some such as r8169 do the napi_complete() if poll() does less work than their allowed budget. Others such as e100 and tg3 do napi_complete() only if they do no work at all. And some drivers use NAPI only for receive handling, perhaps setting txdone interrupts for 1 in N transmitted packets, while others do all "interrupt" processing in their poll(). Should we encourage more consistency? Should we encourage more NAPI driver maintainers to minimize interrupts by doing all rx _and_ tx processing in the poll(), and do napi_complete() only when the poll does _no_ work?
> 
> One well known issue with NAPI is that it is possible with certain traffic patterns for NAPI drivers to schedule in and out of polled mode very quickly. Worst case, a NAPI driver might get 1 interrupt per packet. With fast CPUs and interfaces, this can happen at high rates, causing high CPU loads and poor packet processing performance. Some drivers avoid this by using hardware interrupt mitigation features of the network device in tandem with NAPI to throttle the max interrupt rate per device. But this adds latency. Jamal's paper http://kernel.org/pub/linux/kernel/people/hadi/docs/UKUUG2005.pdf discusses this problem in some detail.
> 
> By making some small changes to the NAPI core, I think it is possible to prevent high interrupt rates with NAPI, regardless of traffic patterns and without using per-device hardware interrupt mitigation. The basic idea is that instead of immediately exiting polled mode when it finds no work to do, the driver's poll() keeps itself in active polled mode for 1-2 jiffies and only does napi_complete() when it does no work in that time period. When it does no work in its poll(), the driver can return 0 while leaving itself in the NAPI poll list. This means it is possible for the softirq processing to spin around its active device list, doing no work, since no quota is consumed. A change is therefore also needed in the NAPI core to detect the case when the only devices that are being actively p
 olled in softirq processing are doing no work on each poll and to exit the softirq loop rather than wasting CPU cycles.
> 
> The code changes are shown in the patch below. The patch is against the latest NAPI rework posted by DaveM http://marc.info/?l=linux-netdev&m=118829721407289&w=2. I used e100 and tg3 drivers to test. Since a driver that returns 0 from its poll() while leaving itself in polled mode would now used by the NAPI core as a condition for exiting the softirq poll loop, all existing NAPI drivers would need to conform to this new invariant. Some drivers, e.g. e100, can return 0 even if they do tx work in their poll().
> 
> Clearly, keeping a device in polled mode for 1-2 jiffies after it would otherwise have gone idle means that it might be called many times by the NAPI softirq while it has no work to do. This wastes CPU cycles. It would be important therefore to implement the driver's poll() to make this case as efficient as possible, perhaps testing for it early.
> 
> When a device is in polled mode while idle, there are 2 scheduling cases to consider:-
> 
> 1. One or more other netdevs is not idle and is consuming quota on each poll. The net_rx softirq will loop until the next jiffy tick or when quota is exceeded, calling each device in its polled list. Since the idle device is still in the poll list, it will be polled very rapidly.
> 
> 2. No other active device is in the poll list. The net_rx softirq will poll the idle device twice and then exit the softirq processing loop as if quota is exceeded. See the net_rx_action() changes in the patch which force the loop to exit if no work is being done by any device in the poll list.
> 
> In both cases described above, the scheduler will continue NAPI processing from ksoftirqd. This might be very soon, especially if the system is otherwise idle. But if the system is idle, do we really care that idle network devices will be polled for 1-2 jiffies? If the system is otherwise busy, ksoftirqd will share the CPU with other threads/processes which will reduce the poll rate anyway.
> 
> In testing, I see significant reduction in interrupt rate for typical traffic patterns. A flood ping, for example, keeps the device in polled mode, generating no interrupts. In a test, 8510 packets are sent/received versus 6200 previously; CPU load is 100% versus 62% previously; and 1 netdev interrupt occurs versus 12400 previously. Performance and CPU load under extreme network load (using pktgen) is unchanged, as expected. Most importantly though, it is no longer possible to find a combination of CPU performance and traffic pattern that induce high interrupt rates. And because hardware interrupt mitigation isn't used, packet latency is minimized.
> 
> The increase in CPU load isn't surprising for a flood ping test since the CPU is working to bounce packets as fast as it can. The increase in packet rate is a good indicator of how much the interrupt and NAPI scheduling overhead is. The CPU load shows 100% because ksoftirqd is always wanting the CPU for the duration of the flood ping. The beauty of NAPI is that the scheduler gets to decide which thread gets the CPU, not hardware CPU interrupt priorities. On my desktop system, I perceive _better_ system response (smoother X cursor movement etc) during the flood ping test, despite the CPU load being increased. For a system whose main job is processing network traffic quickly, like an embedded router or a network server, this approach might be very beneficial. For a desktop, I'm less sure, 
 although as I said above, I've noticed no performance issues in my setups to date.
> 
> 
> Is this worth pursuing further? I'm considering doing more work to measure the effects at various relatively low packet rates. I also want to investigate using High Res Timers rather than jiffy sampling to reduce the idle poll time. Perhaps it is also worth trying HRT in the net_rx softirq too. I thought it would be worth throwing the ideas out there first to get early feedback.
> 


What about the latency that NAPI imposes? Right now there are certain applications that
don't like NAPI because it add several more microseconds, and this may make it worse.
Maybe a per-device flag or tuning parameters (like weight sysfs value)? or some other
way to set low-latency values.

^ permalink raw reply

* Re: [TG3]: Workaround MSI bug on 5714/5780.
From: Andy Gospodarek @ 2007-09-06 14:40 UTC (permalink / raw)
  To: David Miller; +Cc: mchan, netdev
In-Reply-To: <20070906.073423.74747465.davem@davemloft.net>

On 9/6/07, David Miller <davem@davemloft.net> wrote:
> From: Andy Gospodarek <andy@greyhouse.net>
> Date: Thu, 6 Sep 2007 09:45:16 -0400
>
> > Is is really necessary to get rid of the HT1000 patch?
>
> Yes, absolutely and without a single doubt.
>
> > commit e3008dedff4bdc96a5f67224cd3d8d12237082a0
> > Author: Andy Gospodarek <andy@greyhouse.net>
> > Date:   Thu May 10 22:58:57 2007 -0700
> >
> >     PCI: disable MSI by default on systems with Serverworks HT1000 chips
> >
> > This patch was designed to address tg3 and bnx2 messages printed on
> > systems with HT1000 chips IIRC.  Are we going back to those messages on
> > bnx2 or is there a workaround in that driver needed too?
>
> Every report of that problem is with tg3 chips afflicted by this
> INTX bug which is now worked around properly.
>
> Even my Niagara T1000 box was hit by this absolutely stupid and
> wrong MSI quirk, which is how I found this problem to begin with.
>
> I'm reverting it.
>

Sounds great.  I never liked it that much to begin with but no one
NACKED it, so it slipped in through Greg KH's tree.

^ permalink raw reply

* [PATCH] devinet: show all addresses assigned to interface
From: Stephen Hemminger @ 2007-09-06 15:10 UTC (permalink / raw)
  To: David S. Miller; +Cc: netdev

Bug: http://bugzilla.kernel.org/show_bug.cgi?id=8876

Not all ips are shown by "ip addr show" command when IPs number assigned to an
interface is more than 60-80 (in fact it depends on broadcast/label etc
presence on each address).

Steps to reproduce:
It's terribly simple to reproduce:

# for i in $(seq 1 100); do ip ad add 10.0.$i.1/24 dev eth10 ; done
# ip addr show

this will _not_ show all IPs.
Looks like the problem is in netlink/ipv4 message processing.

This is fix from bug submitter, it looks correct.

Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>

--- a/net/ipv4/devinet.c	2007-08-15 12:56:38.000000000 +0100
+++ b/net/ipv4/devinet.c	2007-09-06 16:02:59.000000000 +0100
@@ -1193,7 +1193,7 @@ static int inet_dump_ifaddr(struct sk_bu
 		for (ifa = in_dev->ifa_list, ip_idx = 0; ifa;
 		     ifa = ifa->ifa_next, ip_idx++) {
 			if (ip_idx < s_ip_idx)
-				goto cont;
+				continue;
 			if (inet_fill_ifaddr(skb, ifa, NETLINK_CB(cb->skb).pid,
 					     cb->nlh->nlmsg_seq,
 					     RTM_NEWADDR, NLM_F_MULTI) <= 0)

^ permalink raw reply

* Re: [TG3]: Workaround MSI bug on 5714/5780.
From: Michael Chan @ 2007-09-06 15:11 UTC (permalink / raw)
  To: Andy Gospodarek, David Miller; +Cc: netdev
In-Reply-To: <20070906134508.GA10317@gospo.rdu.redhat.com>

Andy Gospodarek wrote:

> This patch was designed to address tg3 and bnx2 messages printed on
> systems with HT1000 chips IIRC.  Are we going back to those 
> messages on
> bnx2 or is there a workaround in that driver needed too?
> 

Andy, I've checked all the Broadcom chips including bnx2 chips, only
the 5714 and 5780 have this problem.


^ permalink raw reply

* Re: BUG: scheduling while atomic: ifconfig/0x00000002/4170
From: Johannes Berg @ 2007-09-06 15:19 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Satyam Sharma, Florian Lohoff, Linux Kernel Mailing List, Netdev,
	linux-wireless-u79uwXL29TY76Z2rM5mHXA, Michal Piotrowski,
	ipw3945-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f,
	yi.zhu-ral2JQCrhuEAvxtiuMwx3w
In-Reply-To: <20070906082301.GB21929-lOAM2aK0SrRLBo1qDEOMRrpzq4S04n8Q@public.gmane.org>

[-- Attachment #1: Type: text/plain, Size: 676 bytes --]

On Thu, 2007-09-06 at 16:23 +0800, Herbert Xu wrote:
> On Thu, Sep 06, 2007 at 10:32:33AM +0530, Satyam Sharma wrote:
> > 
> > > > [  382.529041]  [<c02c8abc>] dev_close+0x24/0x67
> > > > [  382.529052]  [<e01f402b>] ieee80211_master_stop+0x4a/0x6d [mac80211]
> 
> This is where the bug is.  You cannot call dev_close from an
> atomic context as i33380211_master_stop does it within spin
> locks.

Oh btw. Can we stick a might_sleep() into dev_close() *before* the test
whether the device is up? That way, we'd have seen the bug, but
apparently nobody before Florian ever did a 'ip link set wmaster0 down'
while the other interfaces were still open.

johannes

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 190 bytes --]

^ permalink raw reply

* Re: [TOMOYO 15/15] LSM expansion for TOMOYO Linux.
From: Paul Moore @ 2007-09-06 15:25 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: linux-kernel, linux-security-module, chrisw, netdev
In-Reply-To: <200709062204.DFH97819.JLOHVOOSFMtFQF@I-love.SAKURA.ne.jp>

On Thursday, September 6 2007 9:04:01 am Tetsuo Handa wrote:
> (1) It uses userspace intervention to allow/reject
>     connections and/or packets based on the application's domain.
>     Since existent hooks can't be used for this purpose,
>     I inserted a new hook post_recv_datagram() at skb_recv_datagram()
>     and I modified socket_post_accept() to return error so that
>     I can drop/disconnect based on the application's domain.
>
>     I think skb_recv_datagram() is the only place that can remove
>     a message picked up with MSG_PEEK flags from the receive queue.
>     To remove a message picked up with MSG_PEEK flags, I noticed that
>     I have to do skb_kill_datagram()-like operation so that
>     "the head message that must not be delivered to the caller" won't
> prevent picking up of "the non-head message that should be delivered to the
> caller" when the caller repeats only recv(MSG_PEEK) requests.
>     Since skb_recv_datagram() can be called from interrupt context,
>     I have to use spin_lock_irqsave() instead for spin_lock_bh(), am I
> right?

There are almost certainly better people to answer locking questions, but here 
is my take on it ... If you are accessing data both in a bottom half and 
elsewhere you need to make sure you disable bottom halfs from running before 
you access the data outside the bottom half (spin_lock_bh()).  If you are 
accessing data both in an interrupt handler and elsewhere you need to make 
sure you disable interrupts when accessing data outside the irq handler 
(spin_lock_irqsave()).

>     By the way, why can't socket_post_accept() fail?
>     Someone may wish to do memory allocation at socket_post_accept().
>     socket_accept() is too early for memory allocation because
>     there is no chance to free allocated memory
>     when sock->ops->accept() failed.
>     I think socket_post_accept() should be able to fail.

>From my experience the community disapproves of approaches which go through 
the entire TCP handshake and then terminate the connection, which is what 
allowing security_socket_post_accept() to fail would do.

-- 
paul moore
linux security @ hp

^ permalink raw reply

* Re: RFC: possible NAPI improvements to reduce interrupt rates for low traffic rates
From: James Chapman @ 2007-09-06 15:30 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: netdev, hadi, davem, jeff, mandeep.baines, ossthema
In-Reply-To: <20070906153700.57a0c448@oldman>

Stephen Hemminger wrote:

> What about the latency that NAPI imposes? Right now there are certain applications that
> don't like NAPI because it add several more microseconds, and this may make it worse.

Latency is something that I think this approach will actually improve, 
at the expense of additional polling. Or is it the ksoftirqd scheduling 
latency that you are referring to?

> Maybe a per-device flag or tuning parameters (like weight sysfs value)? or some other
> way to set low-latency values.

Yes. I'd like to think good defaults could be derived though, perhaps 
based on settings like CONFIG_PREEMPT, CONFIG_HIGH_RES_TIMER, CONFIG_HZ 
and maybe even bogomips / nr_cpus.

-- 
James Chapman
Katalix Systems Ltd
http://www.katalix.com
Catalysts for your Embedded Linux software development


^ permalink raw reply

* [PATCH][MIPS][7/7] AR7: ethernet
From: Matteo Croce @ 2007-09-06 15:34 UTC (permalink / raw)
  To: linux-mips
  Cc: Eugene Konev, akpm, jgarzik, netdev, davem, kuznet, pekkas,
	jmorris, yoshfuji, kaber
In-Reply-To: <200708201704.11529.technoboy85@gmail.com>

Driver for the cpmac 100M ethernet driver.
It works fine disabling napi support, enabling it gives a kernel panic
when the first IPv6 packet has to be forwarded.
Other than that works fine.

Signed-off-by: Matteo Croce <technoboy85@gmail.com>
Signed-off-by: Eugene Konev <ejka@imfi.kspu.ru>

diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig
index d9b7d9c..6f38a84 100644
--- a/drivers/net/Kconfig
+++ b/drivers/net/Kconfig
@@ -1822,6 +1822,15 @@ config SC92031
 	  To compile this driver as a module, choose M here: the module
 	  will be called sc92031.  This is recommended.
 
+config CPMAC
+	tristate "TI AR7 CPMAC Ethernet support (EXPERIMENTAL)"
+	depends on NET_ETHERNET && EXPERIMENTAL && AR7
+	select PHYLIB
+	select FIXED_PHY
+	select FIXED_MII_100_FDX
+	help
+	  TI AR7 CPMAC Ethernet support
+
 config NET_POCKET
 	bool "Pocket and portable adapters"
 	depends on PARPORT
diff --git a/drivers/net/Makefile b/drivers/net/Makefile
index 535d2a0..bb22df9 100644
--- a/drivers/net/Makefile
+++ b/drivers/net/Makefile
@@ -156,6 +156,7 @@ obj-$(CONFIG_8139CP) += 8139cp.o
 obj-$(CONFIG_8139TOO) += 8139too.o
 obj-$(CONFIG_ZNET) += znet.o
 obj-$(CONFIG_LAN_SAA9730) += saa9730.o
+obj-$(CONFIG_CPMAC) += cpmac.o
 obj-$(CONFIG_DEPCA) += depca.o
 obj-$(CONFIG_EWRK3) += ewrk3.o
 obj-$(CONFIG_ATP) += atp.o
diff --git a/drivers/net/cpmac.c b/drivers/net/cpmac.c
new file mode 100644
index 0000000..a20a5b6
--- /dev/null
+++ b/drivers/net/cpmac.c
@@ -0,0 +1,1217 @@
+/*
+ * Copyright (C) 2006, 2007 Eugene Konev
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
+ */
+
+#include <linux/module.h>
+#include <linux/init.h>
+#include <linux/moduleparam.h>
+
+#include <linux/sched.h>
+#include <linux/kernel.h>
+#include <linux/slab.h>
+#include <linux/errno.h>
+#include <linux/types.h>
+#include <linux/delay.h>
+#include <linux/version.h>
+
+#include <linux/netdevice.h>
+#include <linux/etherdevice.h>
+#include <linux/ethtool.h>
+#include <linux/skbuff.h>
+#include <linux/mii.h>
+#include <linux/phy.h>
+#include <linux/platform_device.h>
+#include <asm/ar7/ar7.h>
+#include <gpio.h>
+
+MODULE_AUTHOR("Eugene Konev");
+MODULE_DESCRIPTION("TI AR7 ethernet driver (CPMAC)");
+MODULE_LICENSE("GPL");
+
+static int rx_ring_size = 64;
+static int disable_napi;
+module_param(rx_ring_size, int, 64);
+module_param(disable_napi, int, 0);
+MODULE_PARM_DESC(rx_ring_size, "Size of rx ring (in skbs)");
+MODULE_PARM_DESC(disable_napi, "Disable NAPI polling");
+
+/* Register definitions */
+struct cpmac_control_regs {
+	volatile u32 revision;
+	volatile u32 control;
+	volatile u32 teardown;
+	volatile u32 unused;
+} __attribute__ ((packed));
+
+struct cpmac_int_regs {
+	volatile u32 stat_raw;
+	volatile u32 stat_masked;
+	volatile u32 enable;
+	volatile u32 clear;
+} __attribute__ ((packed));
+
+struct cpmac_stats {
+	volatile u32 good;
+	volatile u32 bcast;
+	volatile u32 mcast;
+	volatile u32 pause;
+	volatile u32 crc_error;
+	volatile u32 align_error;
+	volatile u32 oversized;
+	volatile u32 jabber;
+	volatile u32 undersized;
+	volatile u32 fragment;
+	volatile u32 filtered;
+	volatile u32 qos_filtered;
+	volatile u32 octets;
+} __attribute__ ((packed));
+
+struct cpmac_regs {
+	struct cpmac_control_regs tx_ctrl;
+	struct cpmac_control_regs rx_ctrl;
+	volatile u32 unused1[56];
+	volatile u32 mbp;
+/* MBP bits */
+#define MBP_RXPASSCRC         0x40000000
+#define MBP_RXQOS             0x20000000
+#define MBP_RXNOCHAIN         0x10000000
+#define MBP_RXCMF             0x01000000
+#define MBP_RXSHORT           0x00800000
+#define MBP_RXCEF             0x00400000
+#define MBP_RXPROMISC         0x00200000
+#define MBP_PROMISCCHAN(chan) (((chan) & 0x7) << 16)
+#define MBP_RXBCAST           0x00002000
+#define MBP_BCASTCHAN(chan)   (((chan) & 0x7) << 8)
+#define MBP_RXMCAST           0x00000020
+#define MBP_MCASTCHAN(chan)   ((chan) & 0x7)
+	volatile u32 unicast_enable;
+	volatile u32 unicast_clear;
+	volatile u32 max_len;
+	volatile u32 buffer_offset;
+	volatile u32 filter_flow_threshold;
+	volatile u32 unused2[2];
+	volatile u32 flow_thre[8];
+	volatile u32 free_buffer[8];
+	volatile u32 mac_control;
+#define MAC_TXPTYPE  0x00000200
+#define MAC_TXPACE   0x00000040
+#define MAC_MII      0x00000020
+#define MAC_TXFLOW   0x00000010
+#define MAC_RXFLOW   0x00000008
+#define MAC_MTEST    0x00000004
+#define MAC_LOOPBACK 0x00000002
+#define MAC_FDX      0x00000001
+	volatile u32 mac_status;
+#define MACST_QOS    0x4
+#define MACST_RXFLOW 0x2
+#define MACST_TXFLOW 0x1
+	volatile u32 emc_control;
+	volatile u32 unused3;
+	struct cpmac_int_regs tx_int;
+	volatile u32 mac_int_vector;
+/* Int Status bits */
+#define INTST_STATUS 0x80000
+#define INTST_HOST   0x40000
+#define INTST_RX     0x20000
+#define INTST_TX     0x10000
+	volatile u32 mac_eoi_vector;
+	volatile u32 unused4[2];
+	struct cpmac_int_regs rx_int;
+	volatile u32 mac_int_stat_raw;
+	volatile u32 mac_int_stat_masked;
+	volatile u32 mac_int_enable;
+	volatile u32 mac_int_clear;
+	volatile u32 mac_addr_low[8];
+	volatile u32 mac_addr_mid;
+	volatile u32 mac_addr_high;
+	volatile u32 mac_hash_low;
+	volatile u32 mac_hash_high;
+	volatile u32 boff_test;
+	volatile u32 pac_test;
+	volatile u32 rx_pause;
+	volatile u32 tx_pause;
+	volatile u32 unused5[2];
+	struct cpmac_stats rx_stats;
+	struct cpmac_stats tx_stats;
+	volatile u32 unused6[232];
+	volatile u32 tx_ptr[8];
+	volatile u32 rx_ptr[8];
+	volatile u32 tx_ack[8];
+	volatile u32 rx_ack[8];
+
+} __attribute__ ((packed));
+
+struct cpmac_mdio_regs {
+	volatile u32 version;
+	volatile u32 control;
+#define MDIOC_IDLE        0x80000000
+#define MDIOC_ENABLE      0x40000000
+#define MDIOC_PREAMBLE    0x00100000
+#define MDIOC_FAULT       0x00080000
+#define MDIOC_FAULTDETECT 0x00040000
+#define MDIOC_INTTEST     0x00020000
+#define MDIOC_CLKDIV(div) ((div) & 0xff)
+	volatile u32 alive;
+	volatile u32 link;
+	struct cpmac_int_regs link_int;
+	struct cpmac_int_regs user_int;
+	u32 unused[20];
+	volatile u32 access;
+#define MDIO_BUSY       0x80000000
+#define MDIO_WRITE      0x40000000
+#define MDIO_REG(reg)   (((reg) & 0x1f) << 21)
+#define MDIO_PHY(phy)   (((phy) & 0x1f) << 16)
+#define MDIO_DATA(data) ((data) & 0xffff)
+	volatile u32 physel;
+} __attribute__ ((packed));
+
+/* Descriptor */
+struct cpmac_desc {
+	u32 hw_next;
+	u32 hw_data;
+	u16 buflen;
+	u16 bufflags;
+	u16 datalen;
+	u16 dataflags;
+/* Flags bits */
+#define CPMAC_SOP 0x8000
+#define CPMAC_EOP 0x4000
+#define CPMAC_OWN 0x2000
+#define CPMAC_EOQ 0x1000
+	struct sk_buff *skb;
+	struct cpmac_desc *next;
+} __attribute__ ((packed));
+
+struct cpmac_priv {
+	struct net_device_stats stats;
+	spinlock_t lock; /* irq{save,restore} */
+	struct sk_buff *skb_pool;
+	int free_skbs;
+	struct cpmac_desc *rx_head;
+	int tx_head, tx_tail;
+	struct cpmac_desc *desc_ring;
+	struct cpmac_regs *regs;
+	struct mii_bus *mii_bus;
+	struct phy_device *phy;
+	char phy_name[BUS_ID_SIZE];
+	struct plat_cpmac_data *config;
+	int oldlink, oldspeed, oldduplex;
+	u32 msg_enable;
+	struct net_device *dev;
+	struct work_struct alloc_work;
+};
+
+static irqreturn_t cpmac_irq(int, void *);
+static void cpmac_reset(struct net_device *dev);
+static void cpmac_hw_init(struct net_device *dev);
+static int cpmac_stop(struct net_device *dev);
+static int cpmac_open(struct net_device *dev);
+
+#undef CPMAC_DEBUG
+#define CPMAC_LOW_THRESH 32
+#define CPMAC_ALLOC_SIZE 64
+#define CPMAC_SKB_SIZE 1518
+#define CPMAC_TX_RING_SIZE 8
+
+#ifdef CPMAC_DEBUG
+static void cpmac_dump_regs(u32 *base, int count)
+{
+	int i;
+	for (i = 0; i < (count + 3) / 4; i++) {
+		if (i % 4 == 0) printk("\nCPMAC[0x%04x]:", i * 4);
+		printk(" 0x%08x", *(base + i));
+	}
+	printk("\n");
+}
+
+static const char *cpmac_dump_buf(const uint8_t *buf, unsigned size)
+{
+	static char buffer[3 * 25 + 1];
+	char *p = &buffer[0];
+	if (size > 20)
+		size = 20;
+	while (size-- > 0) {
+		p += sprintf(p, " %02x", *buf++);
+	}
+	return buffer;
+}
+#endif
+
+static int cpmac_mdio_read(struct mii_bus *bus, int phy_id, int regnum)
+{
+	struct cpmac_mdio_regs *regs = (struct cpmac_mdio_regs *)bus->priv;
+	volatile u32 val;
+
+	while ((val = regs->access) & MDIO_BUSY);
+	regs->access = MDIO_BUSY | MDIO_REG(regnum & 0x1f) |
+		MDIO_PHY(phy_id & 0x1f);
+	while ((val = regs->access) & MDIO_BUSY);
+
+	return val & 0xffff;
+}
+
+static int cpmac_mdio_write(struct mii_bus *bus, int phy_id, int regnum, u16 val)
+{
+	struct cpmac_mdio_regs *regs = (struct cpmac_mdio_regs *)bus->priv;
+	volatile u32 tmp;
+
+	while ((tmp = regs->access) & MDIO_BUSY);
+	regs->access = MDIO_BUSY | MDIO_WRITE |
+		MDIO_REG(regnum & 0x1f) | MDIO_PHY(phy_id & 0x1f) |
+		val;
+
+	return 0;
+}
+
+static int cpmac_mdio_reset(struct mii_bus *bus)
+{
+	struct cpmac_mdio_regs *regs = (struct cpmac_mdio_regs *)bus->priv;
+
+	ar7_device_reset(AR7_RESET_BIT_MDIO);
+	regs->control = MDIOC_ENABLE |
+		MDIOC_CLKDIV(ar7_cpmac_freq() / 2200000 - 1);
+
+	return 0;
+}
+
+static int mii_irqs[PHY_MAX_ADDR] = { PHY_POLL, };
+
+static struct mii_bus cpmac_mii = {
+	.name = "cpmac-mii",
+	.read = cpmac_mdio_read,
+	.write = cpmac_mdio_write,
+	.reset = cpmac_mdio_reset,
+	.irq = mii_irqs,
+};
+
+static int cpmac_config(struct net_device *dev, struct ifmap *map)
+{
+	if (dev->flags & IFF_UP)
+		return -EBUSY;
+
+	/* Don't allow changing the I/O address */
+	if (map->base_addr != dev->base_addr)
+		return -EOPNOTSUPP;
+
+	/* ignore other fields */
+	return 0;
+}
+
+static int cpmac_set_mac_address(struct net_device *dev, void *addr)
+{
+	struct sockaddr *sa = addr;
+
+	if (dev->flags & IFF_UP)
+		return -EBUSY;
+
+	memcpy(dev->dev_addr, sa->sa_data, dev->addr_len);
+
+	return 0;
+}
+
+static void cpmac_set_multicast_list(struct net_device *dev)
+{
+	struct dev_mc_list *iter;
+	int i;
+	int hash, tmp;
+	int hashlo = 0, hashhi = 0;
+	struct cpmac_priv *priv = netdev_priv(dev);
+
+	if (dev->flags & IFF_PROMISC) {
+		priv->regs->mbp &= ~MBP_PROMISCCHAN(0); /* promisc channel 0 */
+		priv->regs->mbp |= MBP_RXPROMISC;
+	} else {
+		priv->regs->mbp &= ~MBP_RXPROMISC;
+		if (dev->flags & IFF_ALLMULTI) {
+			/* enable all multicast mode */
+			priv->regs->mac_hash_low = 0xffffffff;
+			priv->regs->mac_hash_high = 0xffffffff;
+		} else {
+			for (i = 0, iter = dev->mc_list; i < dev->mc_count;
+			    i++, iter = iter->next) {
+				hash = 0;
+				tmp = iter->dmi_addr[0];
+				hash  ^= (tmp >> 2) ^ (tmp << 4);
+				tmp = iter->dmi_addr[1];
+				hash  ^= (tmp >> 4) ^ (tmp << 2);
+				tmp = iter->dmi_addr[2];
+				hash  ^= (tmp >> 6) ^ tmp;
+				tmp = iter->dmi_addr[4];
+				hash  ^= (tmp >> 2) ^ (tmp << 4);
+				tmp = iter->dmi_addr[5];
+				hash  ^= (tmp >> 4) ^ (tmp << 2);
+				tmp = iter->dmi_addr[6];
+				hash  ^= (tmp >> 6) ^ tmp;
+				hash &= 0x3f;
+				if (hash < 32) {
+					hashlo |= 1<<hash;
+				} else {
+					hashhi |= 1<<(hash - 32);
+				}
+			}
+
+			priv->regs->mac_hash_low = hashlo;
+			priv->regs->mac_hash_high = hashhi;
+		}
+	}
+}
+
+static struct sk_buff *cpmac_get_skb(struct net_device *dev)
+{
+	struct sk_buff *skb;
+	struct cpmac_priv *priv = netdev_priv(dev);
+
+	skb = priv->skb_pool;
+	if (likely(skb)) {
+		priv->skb_pool = skb->next;
+	} else {
+		skb = dev_alloc_skb(CPMAC_SKB_SIZE + 2);
+		if (skb) {
+			skb->next = NULL;
+			skb_reserve(skb, 2);
+			skb->dev = priv->dev;
+		}
+	}
+
+	if (likely(priv->free_skbs))
+		priv->free_skbs--;
+
+	if (priv->free_skbs < CPMAC_LOW_THRESH)
+		schedule_work(&priv->alloc_work);
+
+	return skb;
+}
+
+static inline struct sk_buff *cpmac_rx_one(struct net_device *dev,
+					   struct cpmac_priv *priv,
+					   struct cpmac_desc *desc)
+{
+	unsigned long flags;
+	char *data;
+	struct sk_buff *skb, *result = NULL;
+
+	priv->regs->rx_ack[0] = virt_to_phys(desc);
+	if (unlikely(!desc->datalen)) {
+		if (printk_ratelimit())
+			printk(KERN_WARNING "%s: rx: spurious interrupt\n",
+			       dev->name);
+		priv->stats.rx_errors++;
+		return NULL;
+	}
+
+	spin_lock_irqsave(&priv->lock, flags);
+	skb = cpmac_get_skb(dev);
+	if (likely(skb)) {
+		data = (char *)phys_to_virt(desc->hw_data);
+		dma_cache_inv((u32)data, desc->datalen);
+		skb_put(desc->skb, desc->datalen);
+		desc->skb->protocol = eth_type_trans(desc->skb, dev);
+		desc->skb->ip_summed = CHECKSUM_NONE;
+		priv->stats.rx_packets++;
+		priv->stats.rx_bytes += desc->datalen;
+		result = desc->skb;
+		desc->skb = skb;
+	} else {
+#ifdef CPMAC_DEBUG
+		if (printk_ratelimit())
+			printk("%s: low on skbs, dropping packet\n",
+			       dev->name);
+#endif
+		priv->stats.rx_dropped++;
+	}
+	spin_unlock_irqrestore(&priv->lock, flags);
+
+	desc->hw_data = virt_to_phys(desc->skb->data);
+	desc->buflen = CPMAC_SKB_SIZE;
+	desc->dataflags = CPMAC_OWN;
+	dma_cache_wback((u32)desc, 16);
+
+	return result;
+}
+
+static void cpmac_rx(struct net_device *dev)
+{
+	struct sk_buff *skb;
+	struct cpmac_desc *desc;
+	struct cpmac_priv *priv = netdev_priv(dev);
+
+	spin_lock(&priv->lock);
+	if (unlikely(!priv->rx_head)) {
+		spin_unlock(&priv->lock);
+		return;
+	}
+
+	desc = priv->rx_head;
+	dma_cache_inv((u32)desc, 16);
+#ifdef CPMAC_DEBUG
+	printk(KERN_DEBUG "%s: len=%d, %s\n", __func__, pkt->datalen,
+		cpmac_dump_buf(data, pkt->datalen));
+#endif
+
+	while ((desc->dataflags & CPMAC_OWN) == 0) {
+		skb = cpmac_rx_one(dev, priv, desc);
+		if (likely(skb)) {
+			netif_rx(skb);
+		}
+		desc = desc->next;
+		dma_cache_inv((u32)desc, 16);
+	}
+
+	priv->rx_head = desc;
+	priv->regs->rx_ptr[0] = virt_to_phys(desc);
+	spin_unlock(&priv->lock);
+}
+
+static int cpmac_poll(struct net_device *dev, int *budget)
+{
+	struct sk_buff *skb;
+	struct cpmac_desc *desc;
+	int received = 0, quota = min(dev->quota, *budget);
+	struct cpmac_priv *priv = netdev_priv(dev);
+
+	if (unlikely(!priv->rx_head)) {
+		if (printk_ratelimit())
+			printk(KERN_WARNING "%s: rx: polling, but no queue\n",
+			       dev->name);
+		netif_rx_complete(dev);
+		return 0;
+	}
+
+	desc = priv->rx_head;
+	dma_cache_inv((u32)desc, 16);
+
+	while ((received < quota) && ((desc->dataflags & CPMAC_OWN) == 0)) {
+		skb = cpmac_rx_one(dev, priv, desc);
+		if (likely(skb)) {
+			netif_receive_skb(skb);
+			received++;
+		}
+		desc = desc->next;
+		priv->rx_head = desc;
+		dma_cache_inv((u32)desc, 16);
+	}
+
+	*budget -= received;
+	dev->quota -= received;
+#ifdef CPMAC_DEBUG
+	printk("%s: processed %d packets\n", dev->name, received);
+#endif
+	if (desc->dataflags & CPMAC_OWN) {
+		priv->regs->rx_ptr[0] = virt_to_phys(desc);
+		netif_rx_complete(dev);
+		priv->regs->rx_int.enable = 0x1;
+		priv->regs->rx_int.clear = 0xfe;
+		return 0;
+	}
+
+	return 1;
+}
+
+#if LINUX_VERSION_CODE >= KERNEL_VERSION(2, 6, 20)
+static void
+cpmac_alloc_skbs(struct work_struct *work)
+{
+	struct cpmac_priv *priv = container_of(work, struct cpmac_priv,
+			alloc_work);
+#else
+static void
+cpmac_alloc_skbs(void *data)
+{
+	struct net_device *dev = (struct net_device *)data;
+	struct cpmac_priv *priv = netdev_priv(dev);
+#endif
+	unsigned long flags;
+	int i, num_skbs = 0;
+	struct sk_buff *skb, *skbs = NULL;
+
+	for (i = 0; i < CPMAC_ALLOC_SIZE; i++) {
+		skb = alloc_skb(CPMAC_SKB_SIZE + 2, GFP_KERNEL);
+		if (!skb)
+			break;
+		skb->next = skbs;
+		skb_reserve(skb, 2);
+		skb->dev = priv->dev;
+		num_skbs++;
+		skbs = skb;
+	}
+
+	if (skbs) {
+		spin_lock_irqsave(&priv->lock, flags);
+		for (skb = priv->skb_pool; skb && skb->next; skb = skb->next);
+		if (!skb) {
+			priv->skb_pool = skbs;
+		} else {
+			skb->next = skbs;
+		}
+		priv->free_skbs += num_skbs;
+		spin_unlock_irqrestore(&priv->lock, flags);
+#ifdef CPMAC_DEBUG
+		printk("%s: allocated %d skbs\n", priv->dev->name, num_skbs);
+#endif
+	}
+}
+
+static int cpmac_start_xmit(struct sk_buff *skb, struct net_device *dev)
+{
+	unsigned long flags;
+	int len, chan;
+	struct cpmac_desc *desc;
+	struct cpmac_priv *priv = netdev_priv(dev);
+
+	len = skb->len;
+#ifdef CPMAC_DEBUG
+	printk(KERN_DEBUG "%s: len=%d\n", __func__, len); /* cpmac_dump_buf(const uint8_t * buf, unsigned size) */
+#endif
+	if (unlikely(len < ETH_ZLEN)) {
+		if (unlikely(skb_padto(skb, ETH_ZLEN))) {
+			if (printk_ratelimit())
+				printk(KERN_NOTICE "%s: padding failed, dropping\n",
+				       dev->name);
+			spin_lock_irqsave(&priv->lock, flags);
+			priv->stats.tx_dropped++;
+			spin_unlock_irqrestore(&priv->lock, flags);
+			return -ENOMEM;
+		}
+		len = ETH_ZLEN;
+	}
+	spin_lock_irqsave(&priv->lock, flags);
+	chan = priv->tx_tail++;
+	priv->tx_tail %= 8;
+	if (priv->tx_tail == priv->tx_head)
+		netif_stop_queue(dev);
+
+	desc = &priv->desc_ring[chan];
+	dma_cache_inv((u32)desc, 16);
+	if (desc->dataflags & CPMAC_OWN) {
+		printk(KERN_NOTICE "%s: tx dma ring full, dropping\n", dev->name);
+		priv->stats.tx_dropped++;
+		spin_unlock_irqrestore(&priv->lock, flags);
+		return -ENOMEM;
+	}
+
+	dev->trans_start = jiffies;
+	desc->dataflags = CPMAC_SOP | CPMAC_EOP | CPMAC_OWN;
+	desc->skb = skb;
+	desc->hw_data = virt_to_phys(skb->data);
+	dma_cache_wback((u32)skb->data, len);
+	desc->buflen = len;
+	desc->datalen = len;
+	desc->hw_next = 0;
+	dma_cache_wback((u32)desc, 16);
+	priv->regs->tx_ptr[chan] = virt_to_phys(desc);
+	spin_unlock_irqrestore(&priv->lock, flags);
+
+	return 0;
+}
+
+static void cpmac_end_xmit(struct net_device *dev, int channel)
+{
+	struct cpmac_desc *desc;
+	struct cpmac_priv *priv = netdev_priv(dev);
+
+	spin_lock(&priv->lock);
+	desc = &priv->desc_ring[channel];
+	priv->regs->tx_ack[channel] = virt_to_phys(desc);
+	if (likely(desc->skb)) {
+		priv->stats.tx_packets++;
+		priv->stats.tx_bytes += desc->skb->len;
+		dev_kfree_skb_irq(desc->skb);
+		if (netif_queue_stopped(dev))
+			netif_wake_queue(dev);
+	} else {
+		if (printk_ratelimit())
+			printk(KERN_NOTICE "%s: end_xmit: spurious interrupt\n",
+			       dev->name);
+	}
+	spin_unlock(&priv->lock);
+}
+
+static void cpmac_reset(struct net_device *dev)
+{
+	int i;
+	struct cpmac_priv *priv = netdev_priv(dev);
+
+	ar7_device_reset(priv->config->reset_bit);
+	priv->regs->rx_ctrl.control &= ~1;
+	priv->regs->tx_ctrl.control &= ~1;
+	for (i = 0; i < 8; i++) {
+		priv->regs->tx_ptr[i] = 0;
+		priv->regs->rx_ptr[i] = 0;
+	}
+	priv->regs->mac_control &= ~MAC_MII; /* disable mii */
+}
+
+static inline void cpmac_free_rx_ring(struct net_device *dev)
+{
+	struct cpmac_desc *desc;
+	int i;
+	struct cpmac_priv *priv = netdev_priv(dev);
+
+	if (unlikely(!priv->rx_head))
+		return;
+
+	desc = priv->rx_head;
+	dma_cache_inv((u32)desc, 16);
+
+	for (i = 0; i < rx_ring_size; i++) {
+		desc->buflen = CPMAC_SKB_SIZE;
+		if ((desc->dataflags & CPMAC_OWN) == 0) {
+			desc->dataflags = CPMAC_OWN;
+			priv->stats.rx_dropped++;
+		}
+		dma_cache_wback((u32)desc, 16);
+		desc = desc->next;
+		dma_cache_inv((u32)desc, 16);
+	}
+}
+
+static irqreturn_t cpmac_irq(int irq, void *dev_id)
+{
+	struct net_device *dev = (struct net_device *)dev_id;
+	struct cpmac_priv *priv = netdev_priv(dev);
+	u32 status;
+
+	if (!dev)
+		return IRQ_NONE;
+
+	status = priv->regs->mac_int_vector;
+
+	if (status & INTST_TX) {
+		cpmac_end_xmit(dev, (status & 7));
+	}
+
+	if (status & INTST_RX) {
+		if (disable_napi) {
+			cpmac_rx(dev);
+		} else {
+			priv->regs->rx_int.enable = 0;
+			priv->regs->rx_int.clear = 0xff;
+			netif_rx_schedule(dev);
+		}
+	}
+
+	priv->regs->mac_eoi_vector = 0;
+
+	if (unlikely(status & (INTST_HOST | INTST_STATUS))) {
+		if (printk_ratelimit()) {
+			printk(KERN_ERR "%s: hw error, resetting...\n", dev->name);
+		}
+		spin_lock(&priv->lock);
+		phy_stop(priv->phy);
+		cpmac_reset(dev);
+		cpmac_free_rx_ring(dev);
+		cpmac_hw_init(dev);
+		spin_unlock(&priv->lock);
+	}
+
+	return IRQ_HANDLED;
+}
+
+static void cpmac_tx_timeout(struct net_device *dev)
+{
+	struct cpmac_priv *priv = netdev_priv(dev);
+	struct cpmac_desc *desc;
+
+	priv->stats.tx_errors++;
+	desc = &priv->desc_ring[priv->tx_head++];
+	priv->tx_head %= 8;
+	printk("%s: transmit timeout\n", dev->name);
+	if (desc->skb)
+		dev_kfree_skb(desc->skb);
+	netif_wake_queue(dev);
+}
+
+static int cpmac_ioctl(struct net_device *dev, struct ifreq *ifr, int cmd)
+{
+	struct cpmac_priv *priv = netdev_priv(dev);
+	if (!(netif_running(dev)))
+		return -EINVAL;
+	if (!priv->phy)
+		return -EINVAL;
+	if ((cmd == SIOCGMIIPHY) || (cmd == SIOCGMIIREG) ||
+	    (cmd == SIOCSMIIREG))
+		return phy_mii_ioctl(priv->phy, if_mii(ifr), cmd);
+
+	return -EINVAL;
+}
+
+static int cpmac_get_settings(struct net_device *dev, struct ethtool_cmd *cmd)
+{
+	struct cpmac_priv *priv = netdev_priv(dev);
+
+	if (priv->phy)
+		return phy_ethtool_gset(priv->phy, cmd);
+
+	return -EINVAL;
+}
+
+static int cpmac_set_settings(struct net_device *dev, struct ethtool_cmd *cmd)
+{
+	struct cpmac_priv *priv = netdev_priv(dev);
+
+	if (!capable(CAP_NET_ADMIN))
+		return -EPERM;
+
+	if (priv->phy)
+		return phy_ethtool_sset(priv->phy, cmd);
+
+	return -EINVAL;
+}
+
+static void cpmac_get_drvinfo(struct net_device *dev,
+			      struct ethtool_drvinfo *info)
+{
+	strcpy(info->driver, "cpmac");
+	strcpy(info->version, "0.0.3");
+	info->fw_version[0] = '\0';
+	sprintf(info->bus_info, "%s", "cpmac");
+	info->regdump_len = 0;
+}
+
+static const struct ethtool_ops cpmac_ethtool_ops = {
+	.get_settings = cpmac_get_settings,
+	.set_settings = cpmac_set_settings,
+	.get_drvinfo = cpmac_get_drvinfo,
+	.get_link = ethtool_op_get_link,
+};
+
+static struct net_device_stats *cpmac_stats(struct net_device *dev)
+{
+	struct cpmac_priv *priv = netdev_priv(dev);
+
+	if (netif_device_present(dev))
+		return &priv->stats;
+
+	return NULL;
+}
+
+static int cpmac_change_mtu(struct net_device *dev, int mtu)
+{
+	unsigned long flags;
+	struct cpmac_priv *priv = netdev_priv(dev);
+	spinlock_t *lock = &priv->lock;
+
+	if ((mtu < 68) || (mtu > 1500))
+		return -EINVAL;
+
+	spin_lock_irqsave(lock, flags);
+	dev->mtu = mtu;
+	spin_unlock_irqrestore(lock, flags);
+
+	return 0;
+}
+
+static void cpmac_adjust_link(struct net_device *dev)
+{
+	struct cpmac_priv *priv = netdev_priv(dev);
+	unsigned long flags;
+	int new_state = 0;
+
+	spin_lock_irqsave(&priv->lock, flags);
+	if (priv->phy->link) {
+		if (priv->phy->duplex != priv->oldduplex) {
+			new_state = 1;
+			priv->oldduplex = priv->phy->duplex;
+		}
+
+		if (priv->phy->speed != priv->oldspeed) {
+			new_state = 1;
+			priv->oldspeed = priv->phy->speed;
+		}
+
+		if (!priv->oldlink) {
+			new_state = 1;
+			priv->oldlink = 1;
+			netif_schedule(dev);
+		}
+	} else if (priv->oldlink) {
+		new_state = 1;
+		priv->oldlink = 0;
+		priv->oldspeed = 0;
+		priv->oldduplex = -1;
+	}
+
+	if (new_state)
+		phy_print_status(priv->phy);
+
+	spin_unlock_irqrestore(&priv->lock, flags);
+}
+
+static void cpmac_hw_init(struct net_device *dev)
+{
+	int i;
+	struct cpmac_priv *priv = netdev_priv(dev);
+
+	for (i = 0; i < 8; i++)
+		priv->regs->tx_ptr[i] = 0;
+	priv->regs->rx_ptr[0] = virt_to_phys(priv->rx_head);
+
+	priv->regs->mbp = MBP_RXSHORT | MBP_RXBCAST | MBP_RXMCAST;
+	priv->regs->unicast_enable = 0x1;
+	priv->regs->unicast_clear = 0xfe;
+	priv->regs->buffer_offset = 0;
+	for (i = 0; i < 8; i++)
+		priv->regs->mac_addr_low[i] = dev->dev_addr[5];
+	priv->regs->mac_addr_mid = dev->dev_addr[4];
+	priv->regs->mac_addr_high = dev->dev_addr[0] | (dev->dev_addr[1] << 8)
+		| (dev->dev_addr[2] << 16) | (dev->dev_addr[3] << 24);
+	priv->regs->max_len = CPMAC_SKB_SIZE;
+	priv->regs->rx_int.enable = 0x1;
+	priv->regs->rx_int.clear = 0xfe;
+	priv->regs->tx_int.enable = 0xff;
+	priv->regs->tx_int.clear = 0;
+	priv->regs->mac_int_enable = 3;
+	priv->regs->mac_int_clear = 0xfc;
+
+	priv->regs->rx_ctrl.control |= 1;
+	priv->regs->tx_ctrl.control |= 1;
+	priv->regs->mac_control |= MAC_MII | MAC_FDX;
+
+	priv->phy->state = PHY_CHANGELINK;
+	phy_start(priv->phy);
+}
+
+static int cpmac_open(struct net_device *dev)
+{
+	int i, size, res;
+	struct cpmac_priv *priv = netdev_priv(dev);
+	struct cpmac_desc *desc;
+	struct sk_buff *skb;
+
+#if LINUX_VERSION_CODE >= KERNEL_VERSION(2, 6, 20)
+	priv->phy = phy_connect(dev, priv->phy_name, &cpmac_adjust_link,
+				0, PHY_INTERFACE_MODE_MII);
+#else
+	priv->phy = phy_connect(dev, priv->phy_name, &cpmac_adjust_link, 0);
+#endif
+	if (IS_ERR(priv->phy)) {
+		printk(KERN_ERR "%s: Could not attach to PHY\n", dev->name);
+		return PTR_ERR(priv->phy);
+	}
+
+	if (!request_mem_region(dev->mem_start, dev->mem_end -
+				dev->mem_start, dev->name)) {
+		printk("%s: failed to request registers\n",
+		       dev->name);
+		res = -ENXIO;
+		goto fail_reserve;
+	}
+
+	priv->regs = ioremap_nocache(dev->mem_start, dev->mem_end -
+				     dev->mem_start);
+	if (!priv->regs) {
+		printk("%s: failed to remap registers\n", dev->name);
+		res = -ENXIO;
+		goto fail_remap;
+	}
+
+	priv->rx_head = NULL;
+	size = sizeof(struct cpmac_desc) * (rx_ring_size +
+					    CPMAC_TX_RING_SIZE);
+	priv->desc_ring = (struct cpmac_desc *)kmalloc(size, GFP_KERNEL);
+	if (!priv->desc_ring) {
+		res = -ENOMEM;
+		goto fail_alloc;
+	}
+
+	memset((char *)priv->desc_ring, 0, size);
+
+	priv->skb_pool = NULL;
+	priv->free_skbs = 0;
+	priv->rx_head = &priv->desc_ring[CPMAC_TX_RING_SIZE];
+
+#if LINUX_VERSION_CODE >= KERNEL_VERSION(2, 6, 20)
+	INIT_WORK(&priv->alloc_work, cpmac_alloc_skbs);
+#else
+	INIT_WORK(&priv->alloc_work, cpmac_alloc_skbs, dev);
+#endif
+	schedule_work(&priv->alloc_work);
+	flush_scheduled_work();
+
+	for (i = 0; i < rx_ring_size; i++) {
+		desc = &priv->rx_head[i];
+		skb = cpmac_get_skb(dev);
+		if (!skb) {
+			res = -ENOMEM;
+			goto fail_desc;
+		}
+		desc->skb = skb;
+		desc->hw_data = virt_to_phys(skb->data);
+		desc->buflen = CPMAC_SKB_SIZE;
+		desc->dataflags = CPMAC_OWN;
+		desc->next = &priv->rx_head[(i + 1) % rx_ring_size];
+		desc->hw_next = virt_to_phys(desc->next);
+		dma_cache_wback((u32)desc, 16);
+	}
+
+	if ((res = request_irq(dev->irq, cpmac_irq, SA_INTERRUPT,
+			      dev->name, dev))) {
+		printk("%s: failed to obtain irq\n", dev->name);
+		goto fail_irq;
+	}
+
+	cpmac_reset(dev);
+	cpmac_hw_init(dev);
+
+	netif_start_queue(dev);
+	return 0;
+
+fail_irq:
+fail_desc:
+	for (i = 0; i < rx_ring_size; i++)
+		if (priv->rx_head[i].skb)
+			kfree_skb(priv->rx_head[i].skb);
+fail_alloc:
+	kfree(priv->desc_ring);
+
+	for (skb = priv->skb_pool; skb; skb = priv->skb_pool) {
+		priv->skb_pool = skb->next;
+		kfree_skb(skb);
+	}
+
+	iounmap(priv->regs);
+
+fail_remap:
+	release_mem_region(dev->mem_start, dev->mem_end -
+			   dev->mem_start);
+
+fail_reserve:
+	phy_disconnect(priv->phy);
+
+	return res;
+}
+
+static int cpmac_stop(struct net_device *dev)
+{
+	int i;
+	struct sk_buff *skb;
+	struct cpmac_priv *priv = netdev_priv(dev);
+
+	netif_stop_queue(dev);
+
+	phy_stop(priv->phy);
+	phy_disconnect(priv->phy);
+	priv->phy = NULL;
+
+	cpmac_reset(dev);
+
+	for (i = 0; i < 8; i++) {
+		priv->regs->rx_ptr[i] = 0;
+		priv->regs->tx_ptr[i] = 0;
+		priv->regs->mbp = 0;
+	}
+
+	free_irq(dev->irq, dev);
+	release_mem_region(dev->mem_start, dev->mem_end -
+			   dev->mem_start);
+
+	cancel_delayed_work(&priv->alloc_work);
+	flush_scheduled_work();
+
+	priv->rx_head = &priv->desc_ring[CPMAC_TX_RING_SIZE];
+	for (i = 0; i < rx_ring_size; i++)
+		if (priv->rx_head[i].skb)
+			kfree_skb(priv->rx_head[i].skb);
+
+	kfree(priv->desc_ring);
+
+	for (skb = priv->skb_pool; skb; skb = priv->skb_pool) {
+		priv->skb_pool = skb->next;
+		kfree_skb(skb);
+	}
+
+	return 0;
+}
+
+static int external_switch;
+
+static int __devinit cpmac_probe(struct platform_device *pdev)
+{
+	int i, rc, phy_id;
+	struct resource *res;
+	struct cpmac_priv *priv;
+	struct net_device *dev;
+	struct plat_cpmac_data *pdata;
+
+	if (strcmp(pdev->name, "cpmac") != 0)
+		return -ENODEV;
+
+	pdata = pdev->dev.platform_data;
+
+	for (phy_id = 0; phy_id < PHY_MAX_ADDR; phy_id++) {
+		if (!(pdata->phy_mask & (1 << phy_id)))
+			continue;
+		if (!cpmac_mii.phy_map[phy_id])
+			continue;
+		break;
+	}
+
+	if (phy_id == PHY_MAX_ADDR) {
+		if (external_switch) {
+			phy_id = 0;
+		} else {
+			printk("cpmac: no PHY present\n");
+			return -ENODEV;
+		}
+	}
+
+	dev = alloc_etherdev(sizeof(struct cpmac_priv));
+
+	if (!dev) {
+		printk(KERN_ERR "cpmac: Unable to allocate net_device structure!\n");
+		return -ENOMEM;
+	}
+
+	SET_MODULE_OWNER(dev);
+	platform_set_drvdata(pdev, dev);
+	priv = netdev_priv(dev);
+
+	res = platform_get_resource_byname(pdev, IORESOURCE_MEM, "regs");
+	if (!res) {
+		rc = -ENODEV;
+		goto fail;
+	}
+
+	dev->mem_start = res->start;
+	dev->mem_end = res->end;
+	dev->irq = platform_get_irq_byname(pdev, "irq");
+
+	dev->mtu                = 1500;
+	dev->open               = cpmac_open;
+	dev->stop               = cpmac_stop;
+	dev->set_config         = cpmac_config;
+	dev->hard_start_xmit    = cpmac_start_xmit;
+	dev->do_ioctl           = cpmac_ioctl;
+	dev->get_stats          = cpmac_stats;
+	dev->change_mtu         = cpmac_change_mtu;
+	dev->set_mac_address    = cpmac_set_mac_address;
+	dev->set_multicast_list = cpmac_set_multicast_list;
+	dev->tx_timeout         = cpmac_tx_timeout;
+	dev->ethtool_ops        = &cpmac_ethtool_ops;
+	if (!disable_napi) {
+		dev->poll = cpmac_poll;
+		dev->weight = min(rx_ring_size, 64);
+	}
+
+	memset(priv, 0, sizeof(struct cpmac_priv));
+	spin_lock_init(&priv->lock);
+	priv->msg_enable = netif_msg_init(NETIF_MSG_WOL, 0x3fff);
+	priv->config = pdata;
+	priv->dev = dev;
+	memcpy(dev->dev_addr, priv->config->dev_addr, sizeof(dev->dev_addr));
+	if (phy_id == 31) {
+		snprintf(priv->phy_name, BUS_ID_SIZE, PHY_ID_FMT,
+			 cpmac_mii.id, phy_id);
+	} else {
+		snprintf(priv->phy_name, BUS_ID_SIZE, "fixed@%d:%d", 100, 1);
+	}
+
+	if ((rc = register_netdev(dev))) {
+		printk("cpmac: error %i registering device %s\n",
+		       rc, dev->name);
+		goto fail;
+	}
+
+	printk("cpmac: device %s (regs: %p, irq: %d, phy: %s, mac: ",
+	       dev->name, (u32 *)dev->mem_start, dev->irq,
+	       priv->phy_name);
+	for (i = 0; i < 6; i++)
+		printk("%02x%s", dev->dev_addr[i], i < 5 ? ":" : ")\n");
+
+	return 0;
+
+fail:
+	free_netdev(dev);
+	return rc;
+}
+
+static int __devexit cpmac_remove(struct platform_device *pdev)
+{
+	struct net_device *dev = platform_get_drvdata(pdev);
+	unregister_netdev(dev);
+	free_netdev(dev);
+	return 0;
+}
+
+static struct platform_driver cpmac_driver = {
+	.driver.name = "cpmac",
+	.probe = cpmac_probe,
+	.remove = cpmac_remove,
+};
+
+int __devinit cpmac_init(void)
+{
+	volatile u32 mask;
+	int i, res;
+	cpmac_mii.priv = (struct cpmac_mdio_regs *)
+		ioremap_nocache(AR7_REGS_MDIO, sizeof(struct cpmac_mdio_regs));
+
+	if (!cpmac_mii.priv) {
+		printk("Can't ioremap mdio registers\n");
+		return -ENXIO;
+	}
+
+#warning FIXME: unhardcode gpio&reset bits
+	ar7_gpio_disable(26);
+	ar7_gpio_disable(27);
+	ar7_device_reset(AR7_RESET_BIT_CPMAC_LO);
+	ar7_device_reset(AR7_RESET_BIT_CPMAC_HI);
+	ar7_device_reset(AR7_RESET_BIT_EPHY);
+
+	cpmac_mii.reset(&cpmac_mii);
+
+	for (i = 0; i < 300000; i++) {
+		mask = ((struct cpmac_mdio_regs *)cpmac_mii.priv)->alive;
+		if (mask)
+			break;
+	}
+
+	mask &= 0x7fffffff;
+	if (mask & (mask - 1)) {
+		external_switch = 1;
+		mask = 0;
+	}
+
+	cpmac_mii.phy_mask = ~(mask | 0x80000000);
+
+	res = mdiobus_register(&cpmac_mii);
+	if (res)
+		goto fail_mii;
+
+	res = platform_driver_register(&cpmac_driver);
+	if (res)
+		goto fail_cpmac;
+
+	return 0;
+
+fail_cpmac:
+	mdiobus_unregister(&cpmac_mii);
+
+fail_mii:
+	iounmap(cpmac_mii.priv);
+
+	return res;
+}
+
+void __devexit cpmac_exit(void)
+{
+	platform_driver_unregister(&cpmac_driver);
+	mdiobus_unregister(&cpmac_mii);
+}
+
+module_init(cpmac_init);
+module_exit(cpmac_exit);

^ permalink raw reply related

* Re: RFC: possible NAPI improvements to reduce interrupt rates for low traffic rates
From: Stephen Hemminger @ 2007-09-06 15:37 UTC (permalink / raw)
  To: James Chapman; +Cc: netdev, hadi, davem, jeff, mandeep.baines, ossthema
In-Reply-To: <46E01D16.5000609@katalix.com>

On Thu, 06 Sep 2007 16:30:30 +0100
James Chapman <jchapman@katalix.com> wrote:

> Stephen Hemminger wrote:
> 
> > What about the latency that NAPI imposes? Right now there are certain applications that
> > don't like NAPI because it add several more microseconds, and this may make it worse.
> 
> Latency is something that I think this approach will actually improve, 
> at the expense of additional polling. Or is it the ksoftirqd scheduling 
> latency that you are referring to?

The problem is that you leave interrupts disabled, right. Also you are busy during
idle which kills powersaving and no hz clock.

> > Maybe a per-device flag or tuning parameters (like weight sysfs value)? or some other
> > way to set low-latency values.
> 
> Yes. I'd like to think good defaults could be derived though, perhaps 
> based on settings like CONFIG_PREEMPT, CONFIG_HIGH_RES_TIMER, CONFIG_HZ 
> and maybe even bogomips / nr_cpus.
> 
> -- 
> James Chapman
> Katalix Systems Ltd
> http://www.katalix.com
> Catalysts for your Embedded Linux software development
> 

^ permalink raw reply

* Re: [TG3]: Workaround MSI bug on 5714/5780.
From: Andy Gospodarek @ 2007-09-06 15:41 UTC (permalink / raw)
  To: David Miller; +Cc: andy, mchan, netdev
In-Reply-To: <20070906.073423.74747465.davem@davemloft.net>

On Thu, Sep 06, 2007 at 07:34:23AM -0700, David Miller wrote:
> From: Andy Gospodarek <andy@greyhouse.net>
> Date: Thu, 6 Sep 2007 09:45:16 -0400
> 
> > Is is really necessary to get rid of the HT1000 patch?
> 
> Yes, absolutely and without a single doubt.
> 
> > commit e3008dedff4bdc96a5f67224cd3d8d12237082a0
> > Author: Andy Gospodarek <andy@greyhouse.net>
> > Date:   Thu May 10 22:58:57 2007 -0700
> > 
> >     PCI: disable MSI by default on systems with Serverworks HT1000 chips
> > 
> > This patch was designed to address tg3 and bnx2 messages printed on
> > systems with HT1000 chips IIRC.  Are we going back to those messages on
> > bnx2 or is there a workaround in that driver needed too?
> 
> Every report of that problem is with tg3 chips afflicted by this
> INTX bug which is now worked around properly.
> 

Not according to this:

https://bugzilla.redhat.com/show_bug.cgi?id=227657

Which was precisely why I chose to look at it as a chipset issue.

> Even my Niagara T1000 box was hit by this absolutely stupid and
> wrong MSI quirk, which is how I found this problem to begin with.
> 
> I'm reverting it.
> 

If that's the case, do we need to consider that some of the bnx2
hardware (like in this case the 5706) needs a similar workaround?

> -
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: BUG: scheduling while atomic: ifconfig/0x00000002/4170
From: Paul E. McKenney @ 2007-09-06 15:46 UTC (permalink / raw)
  To: Johannes Berg
  Cc: Herbert Xu, satyam, flo, linux-kernel, netdev, linux-wireless,
	michal.k.k.piotrowski, ipw3945-devel, yi.zhu, flamingice
In-Reply-To: <1189085815.28781.78.camel@johannes.berg>

On Thu, Sep 06, 2007 at 03:36:55PM +0200, Johannes Berg wrote:
> On Thu, 2007-09-06 at 20:36 +0800, Herbert Xu wrote:
> 
> > Yeah I think they're all under RTNL too.  So you don't need to
> > take the lock here at all since you should already have the RTNL.
> 
> Ok, this patch gets rid of the lock. I'd appreciate if you could give it
> a quick look for obvious RCU abuse as I haven't tested it. It also
> doesn't apply against net-2.6.24 because I based it after the patches I
> have queued with John for net-2.6.24.

Looks good to me from an RCU viewpoint.  I cannot claim familiarity with
this code.  I therefore especially like the indications of where RTNL
is held and not!!!

Some questions below based on a quick scan.  And a global question:
should the comments about RTNL being held be replaced by ASSERT_RTNL()?

						Thanx, Paul

> johannes
> 
> --- netdev-2.6.orig/net/mac80211/ieee80211.c	2007-09-06 15:35:23.324447686 +0200
> +++ netdev-2.6/net/mac80211/ieee80211.c	2007-09-06 15:35:23.394447686 +0200
> @@ -90,8 +90,9 @@ static struct dev_mc_list *ieee80211_get
> 
>  	/* start of iteration, both unassigned */
>  	if (!mcd->cur && !mcd->sdata) {
> -		mcd->sdata = list_entry(local->sub_if_list.next,
> -					struct ieee80211_sub_if_data, list);
> +		mcd->sdata = list_entry(
> +				rcu_dereference(local->interfaces.next),
> +				struct ieee80211_sub_if_data, list);
>  		mcd->cur = mcd->sdata->dev->mc_list;
>  	}
> 
> @@ -100,10 +101,11 @@ static struct dev_mc_list *ieee80211_get
> 
>  	while (!mcd->cur) {
>  		/* reached end of interface list? */
> -		if (mcd->sdata->list.next == &local->sub_if_list)
> +		if (rcu_dereference(mcd->sdata->list.next) ==
> +		    &local->interfaces)
>  			break;
>  		/* otherwise try next interface */
> -		mcd->sdata = list_entry(mcd->sdata->list.next,
> +		mcd->sdata = list_entry(rcu_dereference(mcd->sdata->list.next),
>  					struct ieee80211_sub_if_data, list);
>  		mcd->cur = mcd->sdata->dev->mc_list;
>  	}
> @@ -145,9 +147,10 @@ static void ieee80211_configure_filter(s
> 
>  	/*
>  	 * We can iterate through the device list for the multicast
> -	 * address list so need to lock it.
> +	 * address list so need to be in a RCU read-side section,
> +	 * the RTNL isn't held in this function.
>  	 */
> -	read_lock(&local->sub_if_lock);
> +	rcu_read_lock();
> 
>  	/* be a bit nasty */
>  	new_flags |= (1<<31);
> @@ -163,7 +166,7 @@ static void ieee80211_configure_filter(s
>  	WARN_ON(mcd.cur);
> 
>  	local->filter_flags = new_flags & ~(1<<31);
> -	read_unlock(&local->sub_if_lock);
> +	rcu_read_unlock();
> 
>  	netif_tx_unlock(local->mdev);
>  }
> @@ -176,14 +179,13 @@ static int ieee80211_master_open(struct 
>  	struct ieee80211_sub_if_data *sdata;
>  	int res = -EOPNOTSUPP;
> 
> -	read_lock(&local->sub_if_lock);
> -	list_for_each_entry(sdata, &local->sub_if_list, list) {
> +	/* we hold the RTNL here so can safely walk the list */
> +	list_for_each_entry(sdata, &local->interfaces, list) {
>  		if (sdata->dev != dev && netif_running(sdata->dev)) {
>  			res = 0;
>  			break;
>  		}
>  	}
> -	read_unlock(&local->sub_if_lock);
>  	return res;
>  }
> 
> @@ -192,11 +194,10 @@ static int ieee80211_master_stop(struct 
>  	struct ieee80211_local *local = wdev_priv(dev->ieee80211_ptr);
>  	struct ieee80211_sub_if_data *sdata;
> 
> -	read_lock(&local->sub_if_lock);
> -	list_for_each_entry(sdata, &local->sub_if_list, list)
> +	/* we hold the RTNL here so can safely walk the list */
> +	list_for_each_entry(sdata, &local->interfaces, list)
>  		if (sdata->dev != dev && netif_running(sdata->dev))
>  			dev_close(sdata->dev);
> -	read_unlock(&local->sub_if_lock);
> 
>  	return 0;
>  }
> @@ -430,8 +431,8 @@ static int ieee80211_open(struct net_dev
> 
>  	sdata = IEEE80211_DEV_TO_SUB_IF(dev);
> 
> -	read_lock(&local->sub_if_lock);
> -	list_for_each_entry(nsdata, &local->sub_if_list, list) {
> +	/* we hold the RTNL here so can safely walk the list */
> +	list_for_each_entry(nsdata, &local->interfaces, list) {
>  		struct net_device *ndev = nsdata->dev;
> 
>  		if (ndev != dev && ndev != local->mdev && netif_running(ndev) &&
> @@ -440,10 +441,8 @@ static int ieee80211_open(struct net_dev
>  			 * check whether it may have the same address
>  			 */
>  			if (!identical_mac_addr_allowed(sdata->type,
> -							nsdata->type)) {
> -				read_unlock(&local->sub_if_lock);
> +							nsdata->type))
>  				return -ENOTUNIQ;
> -			}
> 
>  			/*
>  			 * can only add VLANs to enabled APs
> @@ -454,7 +453,6 @@ static int ieee80211_open(struct net_dev
>  				sdata->u.vlan.ap = nsdata;
>  		}
>  	}
> -	read_unlock(&local->sub_if_lock);
> 
>  	switch (sdata->type) {
>  	case IEEE80211_IF_TYPE_WDS:
> @@ -575,14 +573,13 @@ static int ieee80211_stop(struct net_dev
>  		sdata->u.sta.state = IEEE80211_DISABLED;
>  		del_timer_sync(&sdata->u.sta.timer);
>  		/*
> -		 * Holding the sub_if_lock for writing here blocks
> -		 * out the receive path and makes sure it's not
> -		 * currently processing a packet that may get
> -		 * added to the queue.
> +		 * When we get here, the interface is marked down.
> +		 * Call synchronize_rcu() to wait for the RX path
> +		 * should it be using the interface and enqueuing
> +		 * frames at this very time on another CPU.
>  		 */
> -		write_lock_bh(&local->sub_if_lock);
> +		synchronize_rcu();
>  		skb_queue_purge(&sdata->u.sta.skb_queue);
> -		write_unlock_bh(&local->sub_if_lock);
> 
>  		if (!local->ops->hw_scan &&
>  		    local->scan_dev == sdata->dev) {
> @@ -1134,9 +1131,9 @@ void ieee80211_tx_status(struct ieee8021
> 
>  	rthdr->data_retries = status->retry_count;
> 
> -	read_lock(&local->sub_if_lock);
> +	rcu_read_lock();
>  	monitors = local->monitors;
> -	list_for_each_entry(sdata, &local->sub_if_list, list) {
> +	list_for_each_entry_rcu(sdata, &local->interfaces, list) {
>  		/*
>  		 * Using the monitors counter is possibly racy, but
>  		 * if the value is wrong we simply either clone the skb
> @@ -1152,7 +1149,7 @@ void ieee80211_tx_status(struct ieee8021
>  				continue;
>  			monitors--;
>  			if (monitors)
> -				skb2 = skb_clone(skb, GFP_KERNEL);
> +				skb2 = skb_clone(skb, GFP_ATOMIC);
>  			else
>  				skb2 = NULL;
>  			skb->dev = sdata->dev;
> @@ -1167,7 +1164,7 @@ void ieee80211_tx_status(struct ieee8021
>  		}
>  	}
>   out:
> -	read_unlock(&local->sub_if_lock);
> +	rcu_read_unlock();
>  	if (skb)
>  		dev_kfree_skb(skb);
>  }
> @@ -1255,8 +1252,7 @@ struct ieee80211_hw *ieee80211_alloc_hw(
> 
>  	INIT_LIST_HEAD(&local->modes_list);
> 
> -	rwlock_init(&local->sub_if_lock);
> -	INIT_LIST_HEAD(&local->sub_if_list);
> +	INIT_LIST_HEAD(&local->interfaces);
> 
>  	INIT_DELAYED_WORK(&local->scan_work, ieee80211_sta_scan_work);
>  	ieee80211_rx_bss_list_init(mdev);
> @@ -1275,7 +1271,8 @@ struct ieee80211_hw *ieee80211_alloc_hw(
>  	sdata->u.ap.force_unicast_rateidx = -1;
>  	sdata->u.ap.max_ratectrl_rateidx = -1;
>  	ieee80211_if_sdata_init(sdata);
> -	list_add_tail(&sdata->list, &local->sub_if_list);
> +	/* no RCU needed since we're still during init phase */
> +	list_add_tail(&sdata->list, &local->interfaces);
> 
>  	tasklet_init(&local->tx_pending_tasklet, ieee80211_tx_pending,
>  		     (unsigned long)local);
> @@ -1434,7 +1431,6 @@ void ieee80211_unregister_hw(struct ieee
>  {
>  	struct ieee80211_local *local = hw_to_local(hw);
>  	struct ieee80211_sub_if_data *sdata, *tmp;
> -	struct list_head tmp_list;
>  	int i;
> 
>  	tasklet_kill(&local->tx_pending_tasklet);
> @@ -1448,11 +1444,12 @@ void ieee80211_unregister_hw(struct ieee
>  	if (local->apdev)
>  		ieee80211_if_del_mgmt(local);
> 
> -	write_lock_bh(&local->sub_if_lock);
> -	list_replace_init(&local->sub_if_list, &tmp_list);
> -	write_unlock_bh(&local->sub_if_lock);
> -
> -	list_for_each_entry_safe(sdata, tmp, &tmp_list, list)
> +	/*
> +	 * At this point, interface list manipulations are fine
> +	 * because the driver cannot be handing us frames any
> +	 * more and the tasklet is killed.
> +	 */
> +	list_for_each_entry_safe(sdata, tmp, &local->interfaces, list)
>  		__ieee80211_if_del(local, sdata);
> 
>  	rtnl_unlock();
> --- netdev-2.6.orig/net/mac80211/ieee80211_i.h	2007-09-06 15:35:23.334447686 +0200
> +++ netdev-2.6/net/mac80211/ieee80211_i.h	2007-09-06 15:35:23.404447686 +0200
> @@ -481,9 +481,8 @@ struct ieee80211_local {
>  	ieee80211_rx_handler *rx_handlers;
>  	ieee80211_tx_handler *tx_handlers;
> 
> -	rwlock_t sub_if_lock; /* Protects sub_if_list. Cannot be taken under
> -			       * sta_bss_lock or sta_lock. */
> -	struct list_head sub_if_list;
> +	struct list_head interfaces;
> +
>  	int sta_scanning;
>  	int scan_channel_idx;
>  	enum { SCAN_SET_CHANNEL, SCAN_SEND_PROBE } scan_state;
> --- netdev-2.6.orig/net/mac80211/ieee80211_iface.c	2007-09-06 15:35:23.334447686 +0200
> +++ netdev-2.6/net/mac80211/ieee80211_iface.c	2007-09-06 15:35:23.404447686 +0200
> @@ -79,16 +79,15 @@ int ieee80211_if_add(struct net_device *
>  	ieee80211_debugfs_add_netdev(sdata);
>  	ieee80211_if_set_type(ndev, type);
> 
> -	write_lock_bh(&local->sub_if_lock);
> +	/* we're under RTNL so all this is fine */
>  	if (unlikely(local->reg_state == IEEE80211_DEV_UNREGISTERED)) {
> -		write_unlock_bh(&local->sub_if_lock);
>  		__ieee80211_if_del(local, sdata);
>  		return -ENODEV;
>  	}
> -	list_add(&sdata->list, &local->sub_if_list);
> +	list_add_tail_rcu(&sdata->list, &local->interfaces);

The _rcu is required because this list isn't protected by RTNL?

> +
>  	if (new_dev)
>  		*new_dev = ndev;
> -	write_unlock_bh(&local->sub_if_lock);
> 
>  	return 0;
> 
> @@ -226,22 +225,22 @@ void ieee80211_if_reinit(struct net_devi
>  		/* Remove all virtual interfaces that use this BSS
>  		 * as their sdata->bss */
>  		struct ieee80211_sub_if_data *tsdata, *n;
> -		LIST_HEAD(tmp_list);
> 
> -		write_lock_bh(&local->sub_if_lock);

This code is also protected by RTNL?

> -		list_for_each_entry_safe(tsdata, n, &local->sub_if_list, list) {
> +		list_for_each_entry_safe(tsdata, n, &local->interfaces, list) {
>  			if (tsdata != sdata && tsdata->bss == &sdata->u.ap) {
>  				printk(KERN_DEBUG "%s: removing virtual "
>  				       "interface %s because its BSS interface"
>  				       " is being removed\n",
>  				       sdata->dev->name, tsdata->dev->name);
> -				list_move_tail(&tsdata->list, &tmp_list);
> +				list_del_rcu(&tsdata->list);
> +				/*
> +				 * We have lots of time and can afford
> +				 * to sync for each interface
> +				 */
> +				synchronize_rcu();
> +				__ieee80211_if_del(local, tsdata);
>  			}
>  		}
> -		write_unlock_bh(&local->sub_if_lock);
> -
> -		list_for_each_entry_safe(tsdata, n, &tmp_list, list)
> -			__ieee80211_if_del(local, tsdata);
> 
>  		kfree(sdata->u.ap.beacon_head);
>  		kfree(sdata->u.ap.beacon_tail);
> @@ -318,18 +317,16 @@ int ieee80211_if_remove(struct net_devic
> 
>  	ASSERT_RTNL();

I -like- this!!!  ;-)

> -	write_lock_bh(&local->sub_if_lock);
> -	list_for_each_entry_safe(sdata, n, &local->sub_if_list, list) {
> +	list_for_each_entry_safe(sdata, n, &local->interfaces, list) {
>  		if ((sdata->type == id || id == -1) &&
>  		    strcmp(name, sdata->dev->name) == 0 &&
>  		    sdata->dev != local->mdev) {
> -			list_del(&sdata->list);
> -			write_unlock_bh(&local->sub_if_lock);
> +			list_del_rcu(&sdata->list);
> +			synchronize_rcu();
>  			__ieee80211_if_del(local, sdata);
>  			return 0;
>  		}
>  	}
> -	write_unlock_bh(&local->sub_if_lock);
>  	return -ENODEV;
>  }
> 
> --- netdev-2.6.orig/net/mac80211/ieee80211_sta.c	2007-09-06 15:35:23.224447686 +0200
> +++ netdev-2.6/net/mac80211/ieee80211_sta.c	2007-09-06 15:35:23.414447686 +0200
> @@ -2659,8 +2659,8 @@ void ieee80211_scan_completed(struct iee
>  	memset(&wrqu, 0, sizeof(wrqu));
>  	wireless_send_event(dev, SIOCGIWSCAN, &wrqu, NULL);
> 
> -	read_lock(&local->sub_if_lock);
> -	list_for_each_entry(sdata, &local->sub_if_list, list) {
> +	rcu_read_lock();
> +	list_for_each_entry_rcu(sdata, &local->interfaces, list) {
> 
>  		/* No need to wake the master device. */
>  		if (sdata->dev == local->mdev)
> @@ -2674,7 +2674,7 @@ void ieee80211_scan_completed(struct iee
> 
>  		netif_wake_queue(sdata->dev);
>  	}
> -	read_unlock(&local->sub_if_lock);
> +	rcu_read_unlock();
> 
>  	sdata = IEEE80211_DEV_TO_SUB_IF(dev);
>  	if (sdata->type == IEEE80211_IF_TYPE_IBSS) {
> @@ -2811,8 +2811,8 @@ static int ieee80211_sta_start_scan(stru
> 
>  	local->sta_scanning = 1;
> 
> -	read_lock(&local->sub_if_lock);
> -	list_for_each_entry(sdata, &local->sub_if_list, list) {
> +	rcu_read_lock();
> +	list_for_each_entry_rcu(sdata, &local->interfaces, list) {
> 
>  		/* Don't stop the master interface, otherwise we can't transmit
>  		 * probes! */
> @@ -2824,7 +2824,7 @@ static int ieee80211_sta_start_scan(stru
>  		    (sdata->u.sta.flags & IEEE80211_STA_ASSOCIATED))
>  			ieee80211_send_nullfunc(local, sdata, 1);
>  	}
> -	read_unlock(&local->sub_if_lock);
> +	rcu_read_unlock();
> 
>  	if (ssid) {
>  		local->scan_ssid_len = ssid_len;
> --- netdev-2.6.orig/net/mac80211/rx.c	2007-09-06 15:35:23.214447686 +0200
> +++ netdev-2.6/net/mac80211/rx.c	2007-09-06 15:35:23.414447686 +0200
> @@ -1385,8 +1385,9 @@ void __ieee80211_rx(struct ieee80211_hw 
>  	}
> 
>  	/*
> -	 * key references are protected using RCU and this requires that
> -	 * we are in a read-site RCU section during receive processing
> +	 * key references and virtual interfaces are protected using RCU
> +	 * and this requires that we are in a read-side RCU section during
> +	 * receive processing
>  	 */
>  	rcu_read_lock();
> 
> @@ -1441,8 +1442,7 @@ void __ieee80211_rx(struct ieee80211_hw 
> 
>  	bssid = ieee80211_get_bssid(hdr, skb->len - radiotap_len);
> 
> -	read_lock(&local->sub_if_lock);
> -	list_for_each_entry(sdata, &local->sub_if_list, list) {
> +	list_for_each_entry_rcu(sdata, &local->interfaces, list) {
>  		rx.flags |= IEEE80211_TXRXD_RXRA_MATCH;
> 
>  		if (!netif_running(sdata->dev))
> @@ -1494,7 +1494,6 @@ void __ieee80211_rx(struct ieee80211_hw 
>  					     &rx, sta);
>  	} else
>  		dev_kfree_skb(skb);
> -	read_unlock(&local->sub_if_lock);
> 
>   end:
>  	rcu_read_unlock();
> --- netdev-2.6.orig/net/mac80211/tx.c	2007-09-06 15:35:23.074447686 +0200
> +++ netdev-2.6/net/mac80211/tx.c	2007-09-06 15:35:23.424447686 +0200
> @@ -291,8 +291,12 @@ static void purge_old_ps_buffers(struct 
>  	struct ieee80211_sub_if_data *sdata;
>  	struct sta_info *sta;
> 
> -	read_lock(&local->sub_if_lock);
> -	list_for_each_entry(sdata, &local->sub_if_list, list) {
> +	/*
> +	 * virtual interfaces are protected by RCU
> +	 */
> +	rcu_read_lock();
> +
> +	list_for_each_entry_rcu(sdata, &local->interfaces, list) {
>  		struct ieee80211_if_ap *ap;
>  		if (sdata->dev == local->mdev ||
>  		    sdata->type != IEEE80211_IF_TYPE_AP)
> @@ -305,7 +309,7 @@ static void purge_old_ps_buffers(struct 
>  		}
>  		total += skb_queue_len(&ap->ps_bc_buf);
>  	}
> -	read_unlock(&local->sub_if_lock);
> +	rcu_read_unlock();
> 
>  	read_lock_bh(&local->sta_lock);
>  	list_for_each_entry(sta, &local->sta_list, list) {
> 
> 
> -
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: RFC: possible NAPI improvements to reduce interrupt rates for low traffic rates
From: James Chapman @ 2007-09-06 16:07 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: netdev, hadi, davem, jeff, mandeep.baines, ossthema
In-Reply-To: <20070906163703.4fc12d32@oldman>

Stephen Hemminger wrote:
> On Thu, 06 Sep 2007 16:30:30 +0100
> James Chapman <jchapman@katalix.com> wrote:
> 
>> Stephen Hemminger wrote:
>>
>>> What about the latency that NAPI imposes? Right now there are certain applications that
>>> don't like NAPI because it add several more microseconds, and this may make it worse.
 >>
>> Latency is something that I think this approach will actually improve, 
>> at the expense of additional polling. Or is it the ksoftirqd scheduling 
>> latency that you are referring to?
> 
> The problem is that you leave interrupts disabled, right.

Are you saying NAPI drivers should avoid keeping interrupts disabled?

> Also you are busy during idle which kills powersaving and no hz clock.

But perhaps some environments don't care about powersave because they 
are always busy? Embedded routers or network servers, for example.

-- 
James Chapman
Katalix Systems Ltd
http://www.katalix.com
Catalysts for your Embedded Linux software development


^ permalink raw reply

* [PATCH net-2.6.23-rc5] ipsec interfamily route handling fix
From: Joakim Koskela @ 2007-09-06 16:00 UTC (permalink / raw)
  To: netdev

Hi,

This patch addresses a couple of issues related to interfamily ipsec
modes. The problem is that the structure of the routing info changes
with the family during the __xfrmX_bundle_create, which hasn't been
taken properly into account. Seems that by coincidence it hasn't
caused problems on 32bit platforms, but crashes for example on x86_64
in 6-4 around line 209 of xfrm6_policy.c as rt doesn't point to a
rt6_info anymore, but actually a struct rtable. With 64bit pointers,
the rt->rt6i_node pointer seems to hit something usually not null in
the rtable that rt now points to, making it go for the path_cookie
assignment and subsequently crashing.

Tested on both 32/64bit with all four (44/46/64/66) combinations of
transformation. I'm still a bit worried about how for example nested
transformations work with all of this and would appreciate if someone
more familiar with the details of these structs could comment.

Signed-off-by: Joakim Koskela <jookos@gmail.com>
---

diff --git a/net/ipv4/xfrm4_policy.c b/net/ipv4/xfrm4_policy.c
index 4ff8ed3..7410c0d 100644
--- a/net/ipv4/xfrm4_policy.c
+++ b/net/ipv4/xfrm4_policy.c
@@ -72,6 +72,7 @@ __xfrm4_bundle_create(struct xfrm_policy *policy, struct 
xfrm_state **xfrm, int
 	struct dst_entry *dst, *dst_prev;
 	struct rtable *rt0 = (struct rtable*)(*dst_p);
 	struct rtable *rt = rt0;
+	unsigned short encap_family = AF_INET;
 	struct flowi fl_tunnel = {
 		.nl_u = {
 			.ip4_u = {
@@ -118,7 +119,7 @@ __xfrm4_bundle_create(struct xfrm_policy *policy, struct 
xfrm_state **xfrm, int
 		trailer_len += xfrm[i]->props.trailer_len;
 
 		if (xfrm[i]->props.mode == XFRM_MODE_TUNNEL) {
-			unsigned short encap_family = xfrm[i]->props.family;
+			encap_family = xfrm[i]->props.family;
 			switch (encap_family) {
 			case AF_INET:
 				fl_tunnel.fl4_dst = xfrm[i]->id.daddr.a4;
@@ -180,16 +181,19 @@ __xfrm4_bundle_create(struct xfrm_policy *policy, struct 
xfrm_state **xfrm, int
 		}
 		dst_prev->output = afinfo->output;
 		xfrm_state_put_afinfo(afinfo);
-		if (dst_prev->xfrm->props.family == AF_INET && rt->peer)
-			atomic_inc(&rt->peer->refcnt);
-		x->u.rt.peer = rt->peer;
+
+		if (encap_family == AF_INET) {
+			if (dst_prev->xfrm->props.family == AF_INET && rt->peer)
+				atomic_inc(&rt->peer->refcnt);
+			x->u.rt.peer = rt->peer;
+			x->u.rt.rt_type = rt->rt_type;
+			x->u.rt.rt_gateway = rt->rt_gateway;
+		}
 		/* Sheit... I remember I did this right. Apparently,
 		 * it was magically lost, so this code needs audit */
 		x->u.rt.rt_flags = rt0->rt_flags&(RTCF_BROADCAST|RTCF_MULTICAST|
RTCF_LOCAL);
-		x->u.rt.rt_type = rt->rt_type;
 		x->u.rt.rt_src = rt0->rt_src;
 		x->u.rt.rt_dst = rt0->rt_dst;
-		x->u.rt.rt_gateway = rt->rt_gateway;
 		x->u.rt.rt_spec_dst = rt0->rt_spec_dst;
 		x->u.rt.idev = rt0->idev;
 		in_dev_hold(rt0->idev);
diff --git a/net/ipv6/xfrm6_policy.c b/net/ipv6/xfrm6_policy.c
index 3ec0c47..9733f39 100644
--- a/net/ipv6/xfrm6_policy.c
+++ b/net/ipv6/xfrm6_policy.c
@@ -131,6 +131,7 @@ __xfrm6_bundle_create(struct xfrm_policy *policy, struct 
xfrm_state **xfrm, int
 	struct dst_entry *dst, *dst_prev;
 	struct rt6_info *rt0 = (struct rt6_info*)(*dst_p);
 	struct rt6_info *rt  = rt0;
+	unsigned short encap_family = AF_INET6;
 	struct flowi fl_tunnel = {
 		.nl_u = {
 			.ip6_u = {
@@ -180,11 +181,13 @@ __xfrm6_bundle_create(struct xfrm_policy *policy, struct 
xfrm_state **xfrm, int
 
 		if (xfrm[i]->props.mode == XFRM_MODE_TUNNEL ||
 		    xfrm[i]->props.mode == XFRM_MODE_ROUTEOPTIMIZATION) {
-			unsigned short encap_family = xfrm[i]->props.family;
+			encap_family = xfrm[i]->props.family;
 			switch(encap_family) {
 			case AF_INET:
 				fl_tunnel.fl4_dst = xfrm[i]->id.daddr.a4;
 				fl_tunnel.fl4_src = xfrm[i]->props.saddr.a4;
+				fl_tunnel.fl4_tos = 0;
+				fl_tunnel.fl4_scope = 0;
 				break;
 			case AF_INET6:
 				ipv6_addr_copy(&fl_tunnel.fl6_dst, __xfrm6_bundle_addr_remote(xfrm[i], 
&fl->fl6_dst));
@@ -205,7 +208,7 @@ __xfrm6_bundle_create(struct xfrm_policy *policy, struct 
xfrm_state **xfrm, int
 
 	dst_prev->child = &rt->u.dst;
 	dst->path = &rt->u.dst;
-	if (rt->rt6i_node)
+	if (encap_family == AF_INET6 && rt->rt6i_node)
 		((struct xfrm_dst *)dst)->path_cookie = rt->rt6i_node->fn_sernum;
 
 	*dst_p = dst;

^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox