Netdev List

Netdev List
 help / color / mirror / Atom feed

* [PATCH 6/7]:
From: David Miller @ 2008-01-08  5:40 UTC (permalink / raw)
  To: netdev


[NET]: Stop polling when napi_disable() is pending.

This finally adds the code in net_rx_action() to break out of the
->poll()'ing loop when a napi_disable() is found to be pending.

Now, even if a device is being flooded with packets it can be cleanly
brought down.

Signed-off-by: David S. Miller <davem@davemloft.net>
---
 net/core/dev.c |    8 ++++++--
 1 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index be9d301..0879f52 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2207,8 +2207,12 @@ static void net_rx_action(struct softirq_action *h)
 		 * still "owns" the NAPI instance and therefore can
 		 * move the instance around on the list at-will.
 		 */
-		if (unlikely(work == weight))
-			list_move_tail(&n->poll_list, list);
+		if (unlikely(work == weight)) {
+			if (unlikely(napi_disable_pending(n)))
+				__napi_complete(n);
+			else
+				list_move_tail(&n->poll_list, list);
+		}
 
 		netpoll_poll_unlock(have);
 	}
-- 
1.5.4.rc2.38.gd6da3


^ permalink raw reply related

* [PATCH 5/7]: [NET]: Fix drivers to handle napi_disable() disabling interrupts.
From: David Miller @ 2008-01-08  5:40 UTC (permalink / raw)
  To: netdev


[NET]: Fix drivers to handle napi_disable() disabling interrupts.

When we add the generic napi_disable_pending() breakout
logic to net_rx_action() it means that napi_disable()
can cause NAPI poll interrupt events to be disabled.

And this is exactly what we want.  If a napi_disable()
is pending, and we are looping in the ->poll(), we want
->poll() event interrupts to stay disabled and we want
to complete the NAPI poll ASAP.

When ->poll() break out during device down was being handled on a
per-driver basis, often these drivers would turn interrupts back on
when '!netif_running()' was detected.

And this would just cause a reschedule of the NAPI ->poll() in the
interrupt handler before the napi_disable() could get in there and
grab the NAPI_STATE_SCHED bit.

The vast majority of drivers don't care if napi_disable() might have
the side effect of disabling NAPI ->poll() event interrupts.  In all
such cases, when a napi_disable() is performed, the driver just
disabled interrupts or is about to.

However there were three exceptions to this in PCNET32, R8169, and
SKY2.  To fix those cases, at the subsequent napi_enable() points, I
added code to ensure that the ->poll() interrupt events are enabled in
the hardware.

Signed-off-by: David S. Miller <davem@davemloft.net>
---
 drivers/net/pcnet32.c |    5 +++++
 drivers/net/r8169.c   |    2 ++
 drivers/net/sky2.c    |    3 +++
 3 files changed, 10 insertions(+), 0 deletions(-)

diff --git a/drivers/net/pcnet32.c b/drivers/net/pcnet32.c
index ff92aca..90498ff 100644
--- a/drivers/net/pcnet32.c
+++ b/drivers/net/pcnet32.c
@@ -455,9 +455,14 @@ static void pcnet32_netif_start(struct net_device *dev)
 {
 #ifdef CONFIG_PCNET32_NAPI
 	struct pcnet32_private *lp = netdev_priv(dev);
+	ulong ioaddr = dev->base_addr;
+	u16 val;
 #endif
 	netif_wake_queue(dev);
 #ifdef CONFIG_PCNET32_NAPI
+	val = lp->a.read_csr(ioaddr, CSR3);
+	val &= 0x00ff;
+	lp->a.write_csr(ioaddr, CSR3, val);
 	napi_enable(&lp->napi);
 #endif
 }
diff --git a/drivers/net/r8169.c b/drivers/net/r8169.c
index 5863190..af80309 100644
--- a/drivers/net/r8169.c
+++ b/drivers/net/r8169.c
@@ -2398,6 +2398,8 @@ static void rtl8169_wait_for_quiescence(struct net_device *dev)
 	rtl8169_irq_mask_and_ack(ioaddr);
 
 #ifdef CONFIG_R8169_NAPI
+	tp->intr_mask = 0xffff;
+	RTL_W16(IntrMask, tp->intr_event);
 	napi_enable(&tp->napi);
 #endif
 }
diff --git a/drivers/net/sky2.c b/drivers/net/sky2.c
index a74fc11..52ec89b 100644
--- a/drivers/net/sky2.c
+++ b/drivers/net/sky2.c
@@ -1168,6 +1168,7 @@ static void sky2_vlan_rx_register(struct net_device *dev, struct vlan_group *grp
 			     TX_VLAN_TAG_OFF);
 	}
 
+	sky2_read32(hw, B0_Y2_SP_LISR);
 	napi_enable(&hw->napi);
 	netif_tx_unlock_bh(dev);
 }
@@ -2043,6 +2044,7 @@ static int sky2_change_mtu(struct net_device *dev, int new_mtu)
 	err = sky2_rx_start(sky2);
 	sky2_write32(hw, B0_IMSK, imask);
 
+	sky2_read32(hw, B0_Y2_SP_LISR);
 	napi_enable(&hw->napi);
 
 	if (err)
@@ -3861,6 +3863,7 @@ static int sky2_debug_show(struct seq_file *seq, void *v)
 		   last = sky2_read16(hw, Y2_QADDR(rxqaddr[port], PREF_UNIT_PUT_IDX)),
 		   sky2_read16(hw, Y2_QADDR(rxqaddr[port], PREF_UNIT_LAST_IDX)));
 
+	sky2_read32(hw, B0_Y2_SP_LISR);
 	napi_enable(&hw->napi);
 	return 0;
 }
-- 
1.5.4.rc2.38.gd6da3


^ permalink raw reply related

* [PATCH 4/7]: [NETXEN]: Fix ->poll() done logic.
From: David Miller @ 2008-01-08  5:39 UTC (permalink / raw)
  To: netdev


[NETXEN]: Fix ->poll() done logic.

If work_done >= budget we should always elide the NAPI
completion.

Signed-off-by: David S. Miller <davem@davemloft.net>
---
 drivers/net/netxen/netxen_nic_main.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/drivers/net/netxen/netxen_nic_main.c b/drivers/net/netxen/netxen_nic_main.c
index a80f0cd..454226f 100644
--- a/drivers/net/netxen/netxen_nic_main.c
+++ b/drivers/net/netxen/netxen_nic_main.c
@@ -1321,7 +1321,7 @@ static int netxen_nic_poll(struct napi_struct *napi, int budget)
 						     budget / MAX_RCV_CTX);
 	}
 
-	if (work_done >= budget && netxen_nic_rx_has_work(adapter) != 0)
+	if (work_done >= budget)
 		done = 0;
 
 	if (netxen_process_cmd_ring((unsigned long)adapter) == 0)
-- 
1.5.4.rc2.38.gd6da3


^ permalink raw reply related

* [PATCH 3/7]: [NET]: Do not check netif_running() and carrier state in ->poll()
From: David Miller @ 2008-01-08  5:39 UTC (permalink / raw)
  To: netdev


[NET]: Do not check netif_running() and carrier state in ->poll()

Drivers do this to try to break out of the ->poll()'ing loop
when the device is being brought administratively down.

Now that we have a napi_disable() "pending" state we are going
to solve that problem generically.

Signed-off-by: David S. Miller <davem@davemloft.net>
---
 drivers/net/e100.c                 |    2 +-
 drivers/net/e1000/e1000_main.c     |    8 +-------
 drivers/net/e1000e/netdev.c        |    8 +-------
 drivers/net/epic100.c              |    2 +-
 drivers/net/fec_8xx/fec_main.c     |    5 -----
 drivers/net/fs_enet/fs_enet-main.c |    3 ---
 drivers/net/ixgb/ixgb_main.c       |    2 +-
 drivers/net/ixgbe/ixgbe_main.c     |    8 +-------
 drivers/net/ixp2000/ixpdev.c       |    2 --
 drivers/net/myri10ge/myri10ge.c    |    2 +-
 drivers/net/natsemi.c              |    2 +-
 drivers/net/qla3xxx.c              |    7 +------
 drivers/net/s2io.c                 |    3 ---
 drivers/net/tulip/interrupt.c      |    5 -----
 drivers/net/xen-netfront.c         |    5 -----
 15 files changed, 9 insertions(+), 55 deletions(-)

diff --git a/drivers/net/e100.c b/drivers/net/e100.c
index 2b06e4b..68316f1 100644
--- a/drivers/net/e100.c
+++ b/drivers/net/e100.c
@@ -1997,7 +1997,7 @@ static int e100_poll(struct napi_struct *napi, int budget)
 	tx_cleaned = e100_tx_clean(nic);
 
 	/* If no Rx and Tx cleanup work was done, exit polling mode. */
-	if((!tx_cleaned && (work_done == 0)) || !netif_running(netdev)) {
+	if((!tx_cleaned && (work_done == 0))) {
 		netif_rx_complete(netdev, napi);
 		e100_enable_irq(nic);
 	}
diff --git a/drivers/net/e1000/e1000_main.c b/drivers/net/e1000/e1000_main.c
index 4f37506..9de7144 100644
--- a/drivers/net/e1000/e1000_main.c
+++ b/drivers/net/e1000/e1000_main.c
@@ -3924,10 +3924,6 @@ e1000_clean(struct napi_struct *napi, int budget)
 	/* Must NOT use netdev_priv macro here. */
 	adapter = poll_dev->priv;
 
-	/* Keep link state information with original netdev */
-	if (!netif_carrier_ok(poll_dev))
-		goto quit_polling;
-
 	/* e1000_clean is called per-cpu.  This lock protects
 	 * tx_ring[0] from being cleaned by multiple cpus
 	 * simultaneously.  A failure obtaining the lock means
@@ -3942,9 +3938,7 @@ e1000_clean(struct napi_struct *napi, int budget)
 	                  &work_done, budget);
 
 	/* If no Tx and not enough Rx work done, exit the polling mode */
-	if ((!tx_cleaned && (work_done == 0)) ||
-	   !netif_running(poll_dev)) {
-quit_polling:
+	if ((!tx_cleaned && (work_done == 0))) {
 		if (likely(adapter->itr_setting & 3))
 			e1000_set_itr(adapter);
 		netif_rx_complete(poll_dev, napi);
diff --git a/drivers/net/e1000e/netdev.c b/drivers/net/e1000e/netdev.c
index 4fd2e23..dd9698c 100644
--- a/drivers/net/e1000e/netdev.c
+++ b/drivers/net/e1000e/netdev.c
@@ -1389,10 +1389,6 @@ static int e1000_clean(struct napi_struct *napi, int budget)
 	/* Must NOT use netdev_priv macro here. */
 	adapter = poll_dev->priv;
 
-	/* Keep link state information with original netdev */
-	if (!netif_carrier_ok(poll_dev))
-		goto quit_polling;
-
 	/* e1000_clean is called per-cpu.  This lock protects
 	 * tx_ring from being cleaned by multiple cpus
 	 * simultaneously.  A failure obtaining the lock means
@@ -1405,9 +1401,7 @@ static int e1000_clean(struct napi_struct *napi, int budget)
 	adapter->clean_rx(adapter, &work_done, budget);
 
 	/* If no Tx and not enough Rx work done, exit the polling mode */
-	if ((!tx_cleaned && (work_done < budget)) ||
-	   !netif_running(poll_dev)) {
-quit_polling:
+	if ((!tx_cleaned && (work_done < budget))) {
 		if (adapter->itr_setting & 3)
 			e1000_set_itr(adapter);
 		netif_rx_complete(poll_dev, napi);
diff --git a/drivers/net/epic100.c b/drivers/net/epic100.c
index ecdd3fc..0b365b8 100644
--- a/drivers/net/epic100.c
+++ b/drivers/net/epic100.c
@@ -1273,7 +1273,7 @@ rx_action:
 
 	epic_rx_err(dev, ep);
 
-	if (netif_running(dev) && (work_done < budget)) {
+	if (work_done < budget) {
 		unsigned long flags;
 		int more;
 
diff --git a/drivers/net/fec_8xx/fec_main.c b/drivers/net/fec_8xx/fec_main.c
index 8d2904f..ab9637a 100644
--- a/drivers/net/fec_8xx/fec_main.c
+++ b/drivers/net/fec_8xx/fec_main.c
@@ -476,11 +476,6 @@ static int fec_enet_rx_common(struct fec_enet_private *ep,
 	__u16 pkt_len, sc;
 	int curidx;
 
-	if (fpi->use_napi) {
-		if (!netif_running(dev))
-			return 0;
-	}
-
 	/*
 	 * First, grab all of the stats for the incoming packet.
 	 * These get messed up if we get called due to a busy condition.
diff --git a/drivers/net/fs_enet/fs_enet-main.c b/drivers/net/fs_enet/fs_enet-main.c
index f2a4d39..3e1a57a 100644
--- a/drivers/net/fs_enet/fs_enet-main.c
+++ b/drivers/net/fs_enet/fs_enet-main.c
@@ -96,9 +96,6 @@ static int fs_enet_rx_napi(struct napi_struct *napi, int budget)
 	u16 pkt_len, sc;
 	int curidx;
 
-	if (!netif_running(dev))
-		return 0;
-
 	/*
 	 * First, grab all of the stats for the incoming packet.
 	 * These get messed up if we get called due to a busy condition.
diff --git a/drivers/net/ixgb/ixgb_main.c b/drivers/net/ixgb/ixgb_main.c
index bf9085f..a8bef52 100644
--- a/drivers/net/ixgb/ixgb_main.c
+++ b/drivers/net/ixgb/ixgb_main.c
@@ -1794,7 +1794,7 @@ ixgb_clean(struct napi_struct *napi, int budget)
 	ixgb_clean_rx_irq(adapter, &work_done, budget);
 
 	/* if no Tx and not enough Rx work done, exit the polling mode */
-	if((!tx_cleaned && (work_done == 0)) || !netif_running(netdev)) {
+	if((!tx_cleaned && (work_done == 0))) {
 		netif_rx_complete(netdev, napi);
 		ixgb_irq_enable(adapter);
 	}
diff --git a/drivers/net/ixgbe/ixgbe_main.c b/drivers/net/ixgbe/ixgbe_main.c
index 00bc525..7c31930 100644
--- a/drivers/net/ixgbe/ixgbe_main.c
+++ b/drivers/net/ixgbe/ixgbe_main.c
@@ -1470,19 +1470,13 @@ static int ixgbe_clean(struct napi_struct *napi, int budget)
 	struct net_device *netdev = adapter->netdev;
 	int tx_cleaned = 0, work_done = 0;
 
-	/* Keep link state information with original netdev */
-	if (!netif_carrier_ok(adapter->netdev))
-		goto quit_polling;
-
 	/* In non-MSIX case, there is no multi-Tx/Rx queue */
 	tx_cleaned = ixgbe_clean_tx_irq(adapter, adapter->tx_ring);
 	ixgbe_clean_rx_irq(adapter, &adapter->rx_ring[0], &work_done,
 			   budget);
 
 	/* If no Tx and not enough Rx work done, exit the polling mode */
-	if ((!tx_cleaned && (work_done < budget)) ||
-	    !netif_running(adapter->netdev)) {
-quit_polling:
+	if ((!tx_cleaned && (work_done < budget))) {
 		netif_rx_complete(netdev, napi);
 		ixgbe_irq_enable(adapter);
 	}
diff --git a/drivers/net/ixp2000/ixpdev.c b/drivers/net/ixp2000/ixpdev.c
index 6c0dd49..484cb2b 100644
--- a/drivers/net/ixp2000/ixpdev.c
+++ b/drivers/net/ixp2000/ixpdev.c
@@ -135,8 +135,6 @@ static int ixpdev_poll(struct napi_struct *napi, int budget)
 	struct net_device *dev = ip->dev;
 	int rx;
 
-	/* @@@ Have to stop polling when nds[0] is administratively
-	 * downed while we are polling.  */
 	rx = 0;
 	do {
 		ixp2000_reg_write(IXP2000_IRQ_THD_RAW_STATUS_A_0, 0x00ff);
diff --git a/drivers/net/myri10ge/myri10ge.c b/drivers/net/myri10ge/myri10ge.c
index 8def865..c90958f 100644
--- a/drivers/net/myri10ge/myri10ge.c
+++ b/drivers/net/myri10ge/myri10ge.c
@@ -1239,7 +1239,7 @@ static int myri10ge_poll(struct napi_struct *napi, int budget)
 	/* process as many rx events as NAPI will allow */
 	work_done = myri10ge_clean_rx_done(mgp, budget);
 
-	if (work_done < budget || !netif_running(netdev)) {
+	if (work_done < budget) {
 		netif_rx_complete(netdev, napi);
 		put_be32(htonl(3), mgp->irq_claim);
 	}
diff --git a/drivers/net/natsemi.c b/drivers/net/natsemi.c
index 87cde06..c329a4f 100644
--- a/drivers/net/natsemi.c
+++ b/drivers/net/natsemi.c
@@ -2266,7 +2266,7 @@ static int natsemi_poll(struct napi_struct *napi, int budget)
 	/* Reenable interrupts providing nothing is trying to shut
 	 * the chip down. */
 	spin_lock(&np->lock);
-	if (!np->hands_off && netif_running(dev))
+	if (!np->hands_off)
 		natsemi_irq_enable(dev);
 	spin_unlock(&np->lock);
 
diff --git a/drivers/net/qla3xxx.c b/drivers/net/qla3xxx.c
index a579111..cf0774d 100644
--- a/drivers/net/qla3xxx.c
+++ b/drivers/net/qla3xxx.c
@@ -2320,14 +2320,9 @@ static int ql_poll(struct napi_struct *napi, int budget)
 	unsigned long hw_flags;
 	struct ql3xxx_port_registers __iomem *port_regs = qdev->mem_map_registers;
 
-	if (!netif_carrier_ok(ndev))
-		goto quit_polling;
-
 	ql_tx_rx_clean(qdev, &tx_cleaned, &rx_cleaned, budget);
 
-	if (tx_cleaned + rx_cleaned != budget ||
-	    !netif_running(ndev)) {
-quit_polling:
+	if (tx_cleaned + rx_cleaned != budget) {
 		spin_lock_irqsave(&qdev->hw_lock, hw_flags);
 		__netif_rx_complete(ndev, napi);
 		ql_update_small_bufq_prod_index(qdev);
diff --git a/drivers/net/s2io.c b/drivers/net/s2io.c
index 9d80f1c..fa57c49 100644
--- a/drivers/net/s2io.c
+++ b/drivers/net/s2io.c
@@ -2704,9 +2704,6 @@ static int s2io_poll(struct napi_struct *napi, int budget)
 	struct XENA_dev_config __iomem *bar0 = nic->bar0;
 	int i;
 
-	if (!is_s2io_card_up(nic))
-		return 0;
-
 	mac_control = &nic->mac_control;
 	config = &nic->config;
 
diff --git a/drivers/net/tulip/interrupt.c b/drivers/net/tulip/interrupt.c
index 0461956..6284afd 100644
--- a/drivers/net/tulip/interrupt.c
+++ b/drivers/net/tulip/interrupt.c
@@ -117,9 +117,6 @@ int tulip_poll(struct napi_struct *napi, int budget)
 	int received = 0;
 #endif
 
-	if (!netif_running(dev))
-		goto done;
-
 #ifdef CONFIG_TULIP_NAPI_HW_MITIGATION
 
 /* that one buffer is needed for mit activation; or might be a
@@ -261,8 +258,6 @@ int tulip_poll(struct napi_struct *napi, int budget)
                 * finally: amount of IO did not increase at all. */
        } while ((ioread32(tp->base_addr + CSR5) & RxIntr));
 
-done:
-
  #ifdef CONFIG_TULIP_NAPI_HW_MITIGATION
 
           /* We use this simplistic scheme for IM. It's proven by
diff --git a/drivers/net/xen-netfront.c b/drivers/net/xen-netfront.c
index 2a8fc43..bca37bf 100644
--- a/drivers/net/xen-netfront.c
+++ b/drivers/net/xen-netfront.c
@@ -852,11 +852,6 @@ static int xennet_poll(struct napi_struct *napi, int budget)
 
 	spin_lock(&np->rx_lock);
 
-	if (unlikely(!netif_carrier_ok(dev))) {
-		spin_unlock(&np->rx_lock);
-		return 0;
-	}
-
 	skb_queue_head_init(&rxq);
 	skb_queue_head_init(&errq);
 	skb_queue_head_init(&tmpq);
-- 
1.5.4.rc2.38.gd6da3


^ permalink raw reply related

* [PATCH 2/7]: [NET]: Add NAPI_STATE_DISABLE.
From: David Miller @ 2008-01-08  5:38 UTC (permalink / raw)
  To: netdev

[NET]: Add NAPI_STATE_DISABLE.

Create a bit to signal that a napi_disable() is in progress.

This sets up infrastructure such that net_rx_action() can generically
break out of the ->poll() loop on a NAPI context that has a pending
napi_disable() yet is being bombed with packets (and thus would
otherwise poll endlessly and not allow the napi_disable() to finish).

Now, what napi_disable() does is first set the NAPI_STATE_DISABLE bit
(to indicate that a disable is pending), then it polls for the
NAPI_STATE_SCHED bit, and once the NAPI_STATE_SCHED bit is acquired
the NAPI_STATE_DISABLE bit is cleared.  Here, the test_and_set_bit()
provides the necessary memory barrier between the various bitops.

napi_schedule_prep() now tests for a pending disable as it's first
action and won't try to obtain the NAPI_STATE_SCHED bit if a disable
is pending.

As a result, we can remove the netif_running() check in
netif_rx_schedule_prep() because the NAPI disable pending state serves
this purpose.  And, it does so in a NAPI centric manner which is what
we really want.

Signed-off-by: David S. Miller <davem@davemloft.net>
---
 include/linux/netdevice.h |   16 +++++++++++++---
 1 files changed, 13 insertions(+), 3 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index e393995..b0813c3 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -319,21 +319,29 @@ struct napi_struct {
 enum
 {
 	NAPI_STATE_SCHED,	/* Poll is scheduled */
+	NAPI_STATE_DISABLE,	/* Disable pending */
 };

 extern void FASTCALL(__napi_schedule(struct napi_struct *n));

+static inline int napi_disable_pending(struct napi_struct *n)
+{
+	return test_bit(NAPI_STATE_DISABLE, &n->state);
+}
+
 /**
  *	napi_schedule_prep - check if napi can be scheduled
  *	@n: napi context
  *
  * Test if NAPI routine is already running, and if not mark
  * it as running.  This is used as a condition variable
- * insure only one NAPI poll instance runs
+ * insure only one NAPI poll instance runs.  We also make
+ * sure there is no pending NAPI disable.
  */
 static inline int napi_schedule_prep(struct napi_struct *n)
 {
-	return !test_and_set_bit(NAPI_STATE_SCHED, &n->state);
+	return !napi_disable_pending(n) &&
+		!test_and_set_bit(NAPI_STATE_SCHED, &n->state);
 }

 /**
@@ -389,8 +397,10 @@ static inline void napi_complete(struct napi_struct *n)
  */
 static inline void napi_disable(struct napi_struct *n)
 {
+	set_bit(NAPI_STATE_DISABLE, &n->state);
 	while (test_and_set_bit(NAPI_STATE_SCHED, &n->state))
 		msleep(1);
+	clear_bit(NAPI_STATE_DISABLE, &n->state);
 }

 /**
@@ -1268,7 +1278,7 @@ static inline u32 netif_msg_init(int debug_value, int default_msg_enable_bits)
 static inline int netif_rx_schedule_prep(struct net_device *dev,
 					 struct napi_struct *napi)
 {
-	return netif_running(dev) && napi_schedule_prep(napi);
+	return napi_schedule_prep(napi);
 }

 /* Add interface to tail of rx poll list. This assumes that _prep has
-- 
1.5.4.rc2.38.gd6da3

^ permalink raw reply related

* [PATCH 1/7]: [NET]: Do not grab device reference when scheduling a NAPI poll.
From: David Miller @ 2008-01-08  5:38 UTC (permalink / raw)
  To: netdev

[NET]: Do not grab device reference when scheduling a NAPI poll.

It is pointless, because everything that can make a device go away
will do a napi_disable() first.

The main impetus behind this is that now we can legally do a NAPI
completion in generic code like net_rx_action() which a following
changeset needs to do.  net_rx_action() can only perform actions
in NAPI centric ways, because there may be a one to many mapping
between NAPI contexts and network devices (SKY2 is one example).

We also want to get rid of this because it's an extra atomic in the
NAPI paths, and also because it is one of the last instances where the
NAPI interfaces care about net devices.

The one remaining netdev detail the NAPI stuff cares about is the
netif_running() check which will be killed off in a subsequent
changeset.

Signed-off-by: David S. Miller <davem@davemloft.net>
---
 include/linux/netdevice.h |    2 --
 1 files changed, 0 insertions(+), 2 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 1e6af4f..e393995 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1277,7 +1277,6 @@ static inline int netif_rx_schedule_prep(struct net_device *dev,
 static inline void __netif_rx_schedule(struct net_device *dev,
 				       struct napi_struct *napi)
 {
-	dev_hold(dev);
 	__napi_schedule(napi);
 }

@@ -1308,7 +1307,6 @@ static inline void __netif_rx_complete(struct net_device *dev,
 				       struct napi_struct *napi)
 {
 	__napi_complete(napi);
-	dev_put(dev);
 }

 /* Remove interface from poll list: it must be in the poll list
-- 
1.5.4.rc2.38.gd6da3

^ permalink raw reply related

* [PATCH 0/7]: Fix napi_disable() wedge during packet flood.
From: David Miller @ 2008-01-08  5:37 UTC (permalink / raw)
  To: netdev

After going back and forth several times on how to fix this bug I
finally think I have a clean solution.

It is possible to clean up a ton of stuff, such as getting rid of
netif_rx_schedule() et al. (none of these interfaces care about the
netdev arg any long) but we should do that post 2.6.24 and just do
what is necessary to fix this bug first.

To the Intel driver maintainers, I know the last patch is contentious.
But at this time it is more important to have all of these drivers
behaving consistently.  If you want to have the TX work factor into
the ->poll() logic that is fine but please discuss it with the list
and apply that change consistently over all of the 5 drivers.  The
current situation was a bit of a mess.

^ permalink raw reply

* Re: [PATCH] via-velocity big-endian support
From: linux @ 2008-01-08  5:34 UTC (permalink / raw)
  To: romieu, viro; +Cc: jgarzik, linux, netdev
In-Reply-To: <20080107231859.GA3450@electric-eye.fr.zoreil.com>

It doesn't look like you need a test report, but here's one anyway...
I grabbed the patch series from git and am running it successfully
right now.

^ permalink raw reply

* Re: Please pull 'fixes-davem' branch of wireless-2.6
From: David Miller @ 2008-01-08  5:21 UTC (permalink / raw)
  To: linville-2XuSBdqkA4R54TAoqtyWWQ
  Cc: netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-wireless-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <20080108051425.GA3125-2XuSBdqkA4R54TAoqtyWWQ@public.gmane.org>

From: "John W. Linville" <linville-2XuSBdqkA4R54TAoqtyWWQ@public.gmane.org>
Date: Tue, 8 Jan 2008 00:14:25 -0500

> Two more fixes for 2.6.24.  I think they are self-explanatory --
> let me know if I'm too optimistic! :-)

Thanks John, I'll pull these in shortly.

^ permalink raw reply

* Please pull 'fixes-davem' branch of wireless-2.6
From: John W. Linville @ 2008-01-08  5:14 UTC (permalink / raw)
  To: davem-fT/PcQaiUtIeIZ0/mPfg9Q
  Cc: netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-wireless-u79uwXL29TY76Z2rM5mHXA

Dave,

Two more fixes for 2.6.24.  I think they are self-explanatory --
let me know if I'm too optimistic! :-)

Thanks,

John

---

Individual patches are available here:

	http://www.kernel.org/pub/linux/kernel/people/linville/wireless-2.6/fixes-davem

---

The following changes since commit 3ce54450461bad18bbe1f9f5aa3ecd2f8e8d1235:
  Linus Torvalds (1):
        Linux 2.6.24-rc7

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless-2.6.git fixes-davem

Andrew Lutomirski (1):
      mac80211: return an error when SIWRATE doesn't match any rate

Michael Buesch (1):
      ssb: Fix probing of PCI cores if PCI and PCIE core is available

 drivers/ssb/scan.c             |   11 +++++++++++
 net/mac80211/ieee80211_ioctl.c |    6 +++---
 2 files changed, 14 insertions(+), 3 deletions(-)

diff --git a/drivers/ssb/scan.c b/drivers/ssb/scan.c
index 96258c6..63ee5cf 100644
--- a/drivers/ssb/scan.c
+++ b/drivers/ssb/scan.c
@@ -388,6 +388,17 @@ int ssb_bus_scan(struct ssb_bus *bus,
 		case SSB_DEV_PCI:
 		case SSB_DEV_PCIE:
 #ifdef CONFIG_SSB_DRIVER_PCICORE
+			if (bus->bustype == SSB_BUSTYPE_PCI) {
+				/* Ignore PCI cores on PCI-E cards.
+				 * Ignore PCI-E cores on PCI cards. */
+				if (dev->id.coreid == SSB_DEV_PCI) {
+					if (bus->host_pci->is_pcie)
+						continue;
+				} else {
+					if (!bus->host_pci->is_pcie)
+						continue;
+				}
+			}
 			if (bus->pcicore.dev) {
 				ssb_printk(KERN_WARNING PFX
 					   "WARNING: Multiple PCI(E) cores found\n");
diff --git a/net/mac80211/ieee80211_ioctl.c b/net/mac80211/ieee80211_ioctl.c
index 7027eed..308bbe4 100644
--- a/net/mac80211/ieee80211_ioctl.c
+++ b/net/mac80211/ieee80211_ioctl.c
@@ -591,7 +591,7 @@ static int ieee80211_ioctl_siwrate(struct net_device *dev,
 	sdata->bss->force_unicast_rateidx = -1;
 	if (rate->value < 0)
 		return 0;
-	for (i=0; i< mode->num_rates; i++) {
+	for (i=0; i < mode->num_rates; i++) {
 		struct ieee80211_rate *rates = &mode->rates[i];
 		int this_rate = rates->rate;
 
@@ -599,10 +599,10 @@ static int ieee80211_ioctl_siwrate(struct net_device *dev,
 			sdata->bss->max_ratectrl_rateidx = i;
 			if (rate->fixed)
 				sdata->bss->force_unicast_rateidx = i;
-			break;
+			return 0;
 		}
 	}
-	return 0;
+	return -EINVAL;
 }
 
 static int ieee80211_ioctl_giwrate(struct net_device *dev,
-- 
John W. Linville
linville-2XuSBdqkA4R54TAoqtyWWQ@public.gmane.org

^ permalink raw reply related

* Re: [PATCH 3/4] [XFRM]: Kill some bloat
From: David Miller @ 2008-01-08  5:10 UTC (permalink / raw)
  To: andi; +Cc: herbert, ilpo.jarvinen, netdev, acme, paul.moore, latten
In-Reply-To: <20080108050007.GA25338@one.firstfloor.org>

From: Andi Kleen <andi@firstfloor.org>
Date: Tue, 8 Jan 2008 06:00:07 +0100

> On Mon, Jan 07, 2008 at 07:37:00PM -0800, David Miller wrote:
> > The vast majority of them are one, two, and three liners.
> 
> % awk '  { line++ } ; /^{/ { total++; start = line } ; /^}/ { len=line-start-3; if (len > 4) l++; if (len >= 10) k++; } ; END { print total, l, l/total, k, k/total }' < include/net/tcp.h
> 68 28 0.411765 20 0.294118
> 
> 41% are over 4 lines, 29% are >= 10 lines.

Take out the comments and whitespace lines, your script is
too simplistic.

^ permalink raw reply

* Re: [PATCH 3/4] [XFRM]: Kill some bloat
From: Andi Kleen @ 2008-01-08  5:00 UTC (permalink / raw)
  To: David Miller
  Cc: andi, herbert, ilpo.jarvinen, netdev, acme, paul.moore, latten
In-Reply-To: <20080107.193700.159646842.davem@davemloft.net>

On Mon, Jan 07, 2008 at 07:37:00PM -0800, David Miller wrote:
> From: Andi Kleen <andi@firstfloor.org>
> Date: Tue, 8 Jan 2008 03:05:29 +0100
> 
> > On Mon, Jan 07, 2008 at 05:54:58PM -0800, David Miller wrote:
> > > I explicitly left them out.
> > > 
> > > Most of them are abstractions of common 2 or 3 instruction
> > > calculations, and thus should stay inline.
> > 
> > Definitely not in tcp.h. It has quite a lot of very long functions, of
> > which very few really need to be inline: (AFAIK the only one where 
> > it makes really sense is tcp_set_state due to constant evaluation; 
> > although I never quite understood why the callers just didn't 
> > call explicit functions to do these actions) 
> > 
> > % awk '  { line++ } ; /^{/ { start = line } ; /^}/ { n++; r += line-start-2; } ; END { print r/n }' < include/net/tcp.h 
> > 9.48889
> > 
> > The average function length is 9 lines.
> 
> The vast majority of them are one, two, and three liners.

% awk '  { line++ } ; /^{/ { total++; start = line } ; /^}/ { len=line-start-3; if (len > 4) l++; if (len >= 10) k++; } ; END { print total, l, l/total, k, k/total }' < include/net/tcp.h
68 28 0.411765 20 0.294118

41% are over 4 lines, 29% are >= 10 lines.

-Andi


^ permalink raw reply

* Re: [PATCH][XFRM] Statistics: Add outbound-dropping error.
From: Herbert Xu @ 2008-01-08  4:33 UTC (permalink / raw)
  To: Masahide NAKAMURA; +Cc: davem, netdev
In-Reply-To: <11997665683729-git-send-email-nakam@linux-ipv6.org>

On Tue, Jan 08, 2008 at 01:29:28PM +0900, Masahide NAKAMURA wrote:
>
> P.S.
> I don't touch XFRM_LOOKUP_ICMP related error at __xfrm_lookup()
> since it may not drop the packet.
> Correct me if it is wrong or comments are welcomed.

Right, whether the packet is dropped would be decided by the caller.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply

* [PATCH][XFRM] Statistics: Add outbound-dropping error.
From: Masahide NAKAMURA @ 2008-01-08  4:29 UTC (permalink / raw)
  To: davem; +Cc: herbert, netdev, Masahide NAKAMURA

Hello,

I found two more points where they should be incremented
as XFRM packet dropping counter. Please apply it.

P.S.
I don't touch XFRM_LOOKUP_ICMP related error at __xfrm_lookup()
since it may not drop the packet.
Correct me if it is wrong or comments are welcomed.

[PATCH][XFRM] Statistics: Add outbound-dropping error.

o Increment PolError counter when flow_cache_lookup() returns
  errored pointer.

o Increment NoStates counter at larval-drop.

Signed-off-by: Masahide NAKAMURA <nakam@linux-ipv6.org>
---
 net/xfrm/xfrm_policy.c |    5 ++++-
 1 files changed, 4 insertions(+), 1 deletions(-)

diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c
index 280f8de..d83227b 100644
--- a/net/xfrm/xfrm_policy.c
+++ b/net/xfrm/xfrm_policy.c
@@ -1510,8 +1510,10 @@ restart:
 		policy = flow_cache_lookup(fl, dst_orig->ops->family,
 					   dir, xfrm_policy_lookup);
 		err = PTR_ERR(policy);
-		if (IS_ERR(policy))
+		if (IS_ERR(policy)) {
+			XFRM_INC_STATS(LINUX_MIB_XFRMOUTPOLERROR);
 			goto dropdst;
+		}
 	}

 	if (!policy)
@@ -1603,6 +1605,7 @@ restart:
 				/* EREMOTE tells the caller to generate
 				 * a one-shot blackhole route.
 				 */
+				XFRM_INC_STATS(LINUX_MIB_XFRMOUTNOSTATES);
 				xfrm_pol_put(policy);
 				return -EREMOTE;
 			}
-- 
1.4.4.2

^ permalink raw reply related

* Re: [PATCH 3/4] [XFRM]: Kill some bloat
From: David Miller @ 2008-01-08  3:37 UTC (permalink / raw)
  To: andi; +Cc: herbert, ilpo.jarvinen, netdev, acme, paul.moore, latten
In-Reply-To: <20080108020529.GC16156@one.firstfloor.org>

From: Andi Kleen <andi@firstfloor.org>
Date: Tue, 8 Jan 2008 03:05:29 +0100

> On Mon, Jan 07, 2008 at 05:54:58PM -0800, David Miller wrote:
> > I explicitly left them out.
> > 
> > Most of them are abstractions of common 2 or 3 instruction
> > calculations, and thus should stay inline.
> 
> Definitely not in tcp.h. It has quite a lot of very long functions, of
> which very few really need to be inline: (AFAIK the only one where 
> it makes really sense is tcp_set_state due to constant evaluation; 
> although I never quite understood why the callers just didn't 
> call explicit functions to do these actions) 
> 
> % awk '  { line++ } ; /^{/ { start = line } ; /^}/ { n++; r += line-start-2; } ; END { print r/n }' < include/net/tcp.h 
> 9.48889
> 
> The average function length is 9 lines.

The vast majority of them are one, two, and three liners.

There are about 4 or 5 inlines in there are in fact large and perhaps
should be removed, and these puff up your average.

^ permalink raw reply

* Re: Top 10 kernel oopses for the week ending January 5th, 2008
From: Linus Torvalds @ 2008-01-08  3:26 UTC (permalink / raw)
  To: Kevin Winchester
  Cc: J. Bruce Fields, Al Viro, Arjan van de Ven,
	Linux Kernel Mailing List, Andrew Morton, NetDev
In-Reply-To: <4782CF9C.6000508@gmail.com>

On Mon, 7 Jan 2008, Kevin Winchester wrote:

> J. Bruce Fields wrote:
> > 
> > Is there any good basic documentation on this to point people at?
> 
> I would second this question.  I see people "decode" oops on lkml often 
> enough, but I've never been entirely sure how its done.  Is it somewhere 
> in Documentation?

It's actually not necessarily at all that trivial, unless you have a deep 
understanding of the code generated for the architecture in question (and 
even then, some oopses take more time to figure out than others, thanks 
to inlining and tailcalls etc).

If the oops happened with a kernel you generated yourself, it's usually 
rather easy. Especially if you said "y" to the "generate debugging info" 
question at configuration time. Because, in that case, you really just do 
a simple

	gdb vmlinux

and then you can do (for example) something like setting a breakpoint at 
the EIP that was reported for the oops, and it will tell you what line it 
came from.

However, if you don't have the exact binary - which is the common case for 
random oopses reported on lkml - you will generally have to disassemble 
the hex sequence given in the oops (the "Code:" line), and try to match it 
up against the source code to try to figure out what is going on.

Even just the disassembly is not entirely trivial, since the oops will 
give you the eip that it happened at, but you often want to also 
disassemble *backwards* in order to get more of a context (the "Code:" 
line will mark the particular EIP that starts the oopsing instruction by 
enclosing it in <xx>, but with non-constant instruction lengths, you need 
to use a bit of trial-and-error to figure it out.

I usually just compile a small program like

	const char array[]="\xnn\xnn\xnn...";

	int main(int argc, char **argv)
	{
		printf("%p\n", array);
		*(int *)0=0;
	}

and run it under gdb, and then when it gets the SIGSEGV (due to the 
obvious NULL pointer dereference), I can just ask gdb to disassemble 
around the array that contains the code[] stuff. Try a few offsets, to see 
when the disassembly makes sense (and gives the reported EIP as the 
beginning of one of the disassembled instructions).

(You can do it other and smarter ways too, I'm not claiming that's a 
particularly good way to do it, and the old "ksymoops" program used to do 
a pretty good job of this, but I'm used to that particular idiotic way 
myself, since it's how I've basically always done it)

After that, you still need to try to match up the assembly code with the 
source code and figure out what variables the register contents actually 
are all about. You can often try to do a

	make the/affected/file.s

to generate the asm file in your own tree - the register allocation can be 
totally different due to different compilers and different options (and 
things like the fact that maybe the source tree you do this on doesn't 
match the oops report exactly), but it's usually a good starting point to 
compare the disassembly from gdb with the *.s file output from the 
compiler.

Quite often, it's all very obvious (you see some constant or other simple 
pattern). But if you're not used to the assembly format, you'll spend a 
lot of brainpower just trying to figure that part out even for the obvious 
stuff, which is why it's a good thing if you are very comfortable indeed 
with the assembly language of that particular platform.

It's not really all that hard. But the first few times you see those 
oopses, it all looks mostly like just line noise. So it definitely takes 
some practice to do it well.

Anyway, let's take an example, from

	http://lkml.org/lkml/2008/1/1/189

where the most obviously relevant parts are:

	BUG: unable to handle kernel paging request at virtual address 00100100
	EIP:    0060:[<f8819668>] 
	EIP is at evdev_disconnect+0x65/0x9e

	eax: 00000000   ebx: 000ffcf0   ecx: c1926760   edx: 00000033
	esi: f7415600   edi: f741564c   ebp: f7415654   esp: c1967e68
	Call Trace:
		[<c03454b2>] input_unregister_device+0x6f/0xff
		[<c03c6eb6>] klist_release+0x27/0x30
		[<c029178a>] kref_put+0x5f/0x6c
	..
	Code: 5e 4c 81 eb 10 04 00 00 eb 21 8d 83 08 04 00 00 b9 06 00 02 
	      00 ba 1d 00 00 00 e8 6a 93 95 c7 8b 9b 10 04 00 00 81 eb 10 
	      04 00 00 <8b> 83 10 04 00 00 0f 18 00 90 8d 83 10 04 00 00 
	      39 f8 75 cb 8d

so here let's do the above silly C program:

	const char array[]="\x5e\x4c\x81\xeb\x10\x04\x00\x00\xeb\x21..

and running it under gdb gives:

	0x8048500

	Program received signal SIGSEGV, Segmentation fault.
	0x080483f7 in main () at test.c:14
	14              *(int*)0=0;

and now I can just try

	x/20i 0x8048500

and it turns out that already gives a reasonable disassembly. The first 
few instructions are bogus: they're really part of the previous 
instruction, but it looks pretty sane around the actual problem spot, 
which is "array+43" (there are 42 bytes of code before the EIP one, and 20 
bytes after):

	0x8048500 <array>:      pop    %esi
	0x8048501 <array+1>:    dec    %esp
	0x8048502 <array+2>:    sub    $0x410,%ebx
	0x8048508 <array+8>:    jmp    0x804852b <array+43>
	0x804850a <array+10>:   lea    0x408(%ebx),%eax
	0x8048510 <array+16>:   mov    $0x20006,%ecx
	0x8048515 <array+21>:   mov    $0x1d,%edx
	0x804851a <array+26>:   call   0xcf9a1889
	0x804851f <array+31>:   mov    0x410(%ebx),%ebx
	0x8048525 <array+37>:   sub    $0x410,%ebx
	0x804852b <array+43>:   mov    0x410(%ebx),%eax
	0x8048531 <array+49>:   prefetchnta (%eax)
	0x8048534 <array+52>:   nop
	0x8048535 <array+53>:   lea    0x410(%ebx),%eax
	0x804853b <array+59>:   cmp    %edi,%eax
	0x804853d <array+61>:   jne    0x804850a <array+10>
	0x804853f <array+63>:   lea    (%eax),%eax
	.. 

so now we know that the faulting instruction was that

	mov    0x410(%ebx),%eax

and we can also see that this also matches the address that caused the 
oops (ebx=000ffcf0, so 0x410(%ebx) is 00100100, which matches the "unable 
to handle kernel paging request" message).

(Now, people used to kernel oopses will also recognize 00100100 as the 
LIST_POISON1, so this is all about dereferencing the ->next pointer of a 
list entry that has been removed from the list, but that's a whole 
separate level of kernel knowledge).

Anyway, you can now do

	make drivers/input/evdev.s

and see if you can find that kind of code sequence in there. You can use 
the "EIP: evdev_disconnect+0x65/0x9e" thing as a hint: if your compiler 
setup isn't too different, it's likely to be roughly two thirds into that 
evdev_disconnect function (but inlining really can mean that it's 
somewhere else entirely in the source tree!)

The rest left as an exercise for the reader.

		Linus

^ permalink raw reply

* Re: [PATCH 3/4] [XFRM]: Kill some bloat II
From: Andi Kleen @ 2008-01-08  2:08 UTC (permalink / raw)
  To: Andi Kleen
  Cc: David Miller, herbert, ilpo.jarvinen, netdev, acme, paul.moore,
	latten
In-Reply-To: <20080108020529.GC16156@one.firstfloor.org>

> % awk '  { line++ } ; /^{/ { start = line } ; /^}/ { n++; r += line-start-2; } ; END { print r/n }' < include/net/tcp.h 
> 9.48889
> 
> The average function length is 9 lines.

Actually 8 -- the awk hack had a off by one. Still too long.

-Andi

^ permalink raw reply

* Re: [PATCH 3/4] [XFRM]: Kill some bloat
From: Andi Kleen @ 2008-01-08  2:05 UTC (permalink / raw)
  To: David Miller
  Cc: andi, herbert, ilpo.jarvinen, netdev, acme, paul.moore, latten
In-Reply-To: <20080107.175458.127194310.davem@davemloft.net>

On Mon, Jan 07, 2008 at 05:54:58PM -0800, David Miller wrote:
> From: Andi Kleen <andi@firstfloor.org>
> Date: Tue, 08 Jan 2008 01:23:11 +0100
> 
> > David Miller <davem@davemloft.net> writes:
> > 
> > > Similarly I question just about any inline usage at all in *.c files
> > 
> > Don't forget the .h files. Especially a lot of stuff in tcp.h should
> > be probably in some .c file and not be inline.
> 
> I explicitly left them out.
> 
> Most of them are abstractions of common 2 or 3 instruction
> calculations, and thus should stay inline.

Definitely not in tcp.h. It has quite a lot of very long functions, of
which very few really need to be inline: (AFAIK the only one where 
it makes really sense is tcp_set_state due to constant evaluation; 
although I never quite understood why the callers just didn't 
call explicit functions to do these actions) 

% awk '  { line++ } ; /^{/ { start = line } ; /^}/ { n++; r += line-start-2; } ; END { print r/n }' < include/net/tcp.h 
9.48889

The average function length is 9 lines.

-Andi


^ permalink raw reply

* Re: TCP cache performance
From: David Miller @ 2008-01-08  1:57 UTC (permalink / raw)
  To: virtualphtn; +Cc: netdev, lachlan.andrew
In-Reply-To: <4782D06B.30706@gmail.com>

From: Tom Quetchenbach <virtualphtn@gmail.com>
Date: Mon, 07 Jan 2008 17:22:51 -0800

> This suggests that efforts to improve TCP performance should focus
> on cache usage rather than just processing time.

Thanks for reporting your data, but we very well know what the exact
problem is.

When we recover from loss, we touch thousands of packets freeing up
basically an entire window's worth.

^ permalink raw reply

* [PATCH 3/3] bonding: fix locking during alb failover and slave removal
From: Jay Vosburgh @ 2008-01-08  1:57 UTC (permalink / raw)
  To: netdev
  Cc: Jeff Garzik, David Miller, Andy Gospodarek, Krzysztof Oledzki,
	Jay Vosburgh
In-Reply-To: <1199757423867-git-send-email-fubar@us.ibm.com>

	alb_fasten_mac_swap (actually rlb_teach_disabled_mac_on_primary)
requries RTNL and no other locks.  This could cause dev_set_promiscuity
and/or dev_set_mac_address to be called with improper locking.

	Changed callers to hold only RTNL during calls to alb_fasten_mac_swap
or functions calling it.  Updated header comments in affected functions to
reflect proper reality of locking requirements.

Signed-off-by: Jay Vosburgh <fubar@us.ibm.com>
---
 drivers/net/bonding/bond_alb.c  |   18 ++++++++++++------
 drivers/net/bonding/bond_main.c |   14 ++++++++------
 2 files changed, 20 insertions(+), 12 deletions(-)

diff --git a/drivers/net/bonding/bond_alb.c b/drivers/net/bonding/bond_alb.c
index 9b55a12..b57bc94 100644
--- a/drivers/net/bonding/bond_alb.c
+++ b/drivers/net/bonding/bond_alb.c
@@ -979,7 +979,7 @@ static void alb_swap_mac_addr(struct bonding *bond, struct slave *slave1, struct
 /*
  * Send learning packets after MAC address swap.
  *
- * Called with RTNL and bond->lock held for read.
+ * Called with RTNL and no other locks
  */
 static void alb_fasten_mac_swap(struct bonding *bond, struct slave *slave1,
 				struct slave *slave2)
@@ -987,6 +987,8 @@ static void alb_fasten_mac_swap(struct bonding *bond, struct slave *slave1,
 	int slaves_state_differ = (SLAVE_IS_OK(slave1) != SLAVE_IS_OK(slave2));
 	struct slave *disabled_slave = NULL;
 
+	ASSERT_RTNL();
+
 	/* fasten the change in the switch */
 	if (SLAVE_IS_OK(slave1)) {
 		alb_send_learning_packets(slave1, slave1->dev->dev_addr);
@@ -1031,7 +1033,7 @@ static void alb_fasten_mac_swap(struct bonding *bond, struct slave *slave1,
  * a slave that has @slave's permanet address as its current address.
  * We'll make sure that that slave no longer uses @slave's permanent address.
  *
- * Caller must hold bond lock
+ * Caller must hold RTNL and no other locks
  */
 static void alb_change_hw_addr_on_detach(struct bonding *bond, struct slave *slave)
 {
@@ -1542,7 +1544,12 @@ int bond_alb_init_slave(struct bonding *bond, struct slave *slave)
 	return 0;
 }
 
-/* Caller must hold bond lock for write */
+/*
+ * Remove slave from tlb and rlb hash tables, and fix up MAC addresses
+ * if necessary.
+ *
+ * Caller must hold RTNL and no other locks
+ */
 void bond_alb_deinit_slave(struct bonding *bond, struct slave *slave)
 {
 	if (bond->slave_cnt > 1) {
@@ -1658,12 +1665,11 @@ void bond_alb_handle_active_change(struct bonding *bond, struct slave *new_slave
 				       bond->alb_info.rlb_enabled);
 	}
 
-	read_lock(&bond->lock);
-
 	if (swap_slave) {
 		alb_fasten_mac_swap(bond, swap_slave, new_slave);
+		read_lock(&bond->lock);
 	} else {
-		/* fasten bond mac on new current slave */
+		read_lock(&bond->lock);
 		alb_send_learning_packets(new_slave, bond->dev->dev_addr);
 	}
 
diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index b0b2603..77d004d 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -1746,7 +1746,9 @@ int bond_release(struct net_device *bond_dev, struct net_device *slave_dev)
 		 * has been cleared (if our_slave == old_current),
 		 * but before a new active slave is selected.
 		 */
+		write_unlock_bh(&bond->lock);
 		bond_alb_deinit_slave(bond, slave);
+		write_lock_bh(&bond->lock);
 	}
 
 	if (oldcurrent == slave) {
@@ -1905,6 +1907,12 @@ static int bond_release_all(struct net_device *bond_dev)
 		slave_dev = slave->dev;
 		bond_detach_slave(bond, slave);
 
+		/* now that the slave is detached, unlock and perform
+		 * all the undo steps that should not be called from
+		 * within a lock.
+		 */
+		write_unlock_bh(&bond->lock);
+
 		if ((bond->params.mode == BOND_MODE_TLB) ||
 		    (bond->params.mode == BOND_MODE_ALB)) {
 			/* must be called only after the slave
@@ -1915,12 +1923,6 @@ static int bond_release_all(struct net_device *bond_dev)
 
 		bond_compute_features(bond);
 
-		/* now that the slave is detached, unlock and perform
-		 * all the undo steps that should not be called from
-		 * within a lock.
-		 */
-		write_unlock_bh(&bond->lock);
-
 		bond_destroy_slave_symlinks(bond_dev, slave_dev);
 		bond_del_vlans_from_slave(bond, slave_dev);
 
-- 
1.5.3.4.206.g58ba4-dirty


^ permalink raw reply related

* [PATCH 2/3] bonding: fix ASSERT_RTNL that produces spurious warnings
From: Jay Vosburgh @ 2008-01-08  1:56 UTC (permalink / raw)
  To: netdev
  Cc: Jeff Garzik, David Miller, Andy Gospodarek, Krzysztof Oledzki,
	Jay Vosburgh
In-Reply-To: <11997574222492-git-send-email-fubar@us.ibm.com>

	Move an ASSERT_RTNL down to where we should hold only RTNL;
the existing check produces spurious warnings because we hold additional
locks at _bh, tripping a debug warning in spin_lock_mutex().

Signed-off-by: Jay Vosburgh <fubar@us.ibm.com>
---
 drivers/net/bonding/bond_alb.c |    5 ++---
 1 files changed, 2 insertions(+), 3 deletions(-)

diff --git a/drivers/net/bonding/bond_alb.c b/drivers/net/bonding/bond_alb.c
index 25b8dbf..9b55a12 100644
--- a/drivers/net/bonding/bond_alb.c
+++ b/drivers/net/bonding/bond_alb.c
@@ -1601,9 +1601,6 @@ void bond_alb_handle_active_change(struct bonding *bond, struct slave *new_slave
 	struct slave *swap_slave;
 	int i;
 
-	if (new_slave)
-		ASSERT_RTNL();
-
 	if (bond->curr_active_slave == new_slave) {
 		return;
 	}
@@ -1649,6 +1646,8 @@ void bond_alb_handle_active_change(struct bonding *bond, struct slave *new_slave
 	write_unlock_bh(&bond->curr_slave_lock);
 	read_unlock(&bond->lock);
 
+	ASSERT_RTNL();
+
 	/* curr_active_slave must be set before calling alb_swap_mac_addr */
 	if (swap_slave) {
 		/* swap mac address */
-- 
1.5.3.4.206.g58ba4-dirty


^ permalink raw reply related

* [PATCH 1/3] bonding: fix locking in sysfs primary/active selection
From: Jay Vosburgh @ 2008-01-08  1:56 UTC (permalink / raw)
  To: netdev
  Cc: Jeff Garzik, David Miller, Andy Gospodarek, Krzysztof Oledzki,
	Jay Vosburgh
In-Reply-To: <11997574203125-git-send-email-fubar@us.ibm.com>

	Fix the functions that store the primary and active slave
options via sysfs to hold the correct locks in the correct order.

	The bond_change_active_slave and bond_select_active_slave
functions both require rtnl, bond->lock for read and curr_slave_lock for
write_bh, and no other locks.  This is so that the lower level
mode-specific functions (notably for balance-alb mode) can release locks
down to just rtnl in order to call, e.g., dev_set_mac_address with the
locks it expects (rtnl only).

Signed-off-by: Jay Vosburgh <fubar@us.ibm.com>
Signed-off-by: Andy Gospodarek <andy@greyhouse.net>
---
 drivers/net/bonding/bond_sysfs.c |   15 ++++++++++-----
 1 files changed, 10 insertions(+), 5 deletions(-)

diff --git a/drivers/net/bonding/bond_sysfs.c b/drivers/net/bonding/bond_sysfs.c
index 11b76b3..28a2d80 100644
--- a/drivers/net/bonding/bond_sysfs.c
+++ b/drivers/net/bonding/bond_sysfs.c
@@ -1075,7 +1075,10 @@ static ssize_t bonding_store_primary(struct device *d,
 	struct slave *slave;
 	struct bonding *bond = to_bond(d);
 
-	write_lock_bh(&bond->lock);
+	rtnl_lock();
+	read_lock(&bond->lock);
+	write_lock_bh(&bond->curr_slave_lock);
+
 	if (!USES_PRIMARY(bond->params.mode)) {
 		printk(KERN_INFO DRV_NAME
 		       ": %s: Unable to set primary slave; %s is in mode %d\n",
@@ -1109,8 +1112,8 @@ static ssize_t bonding_store_primary(struct device *d,
 		}
 	}
 out:
-	write_unlock_bh(&bond->lock);
-
+	write_unlock_bh(&bond->curr_slave_lock);
+	read_unlock(&bond->lock);
 	rtnl_unlock();
 
 	return count;
@@ -1190,7 +1193,8 @@ static ssize_t bonding_store_active_slave(struct device *d,
 	struct bonding *bond = to_bond(d);
 
 	rtnl_lock();
-	write_lock_bh(&bond->lock);
+	read_lock(&bond->lock);
+	write_lock_bh(&bond->curr_slave_lock);
 
 	if (!USES_PRIMARY(bond->params.mode)) {
 		printk(KERN_INFO DRV_NAME
@@ -1247,7 +1251,8 @@ static ssize_t bonding_store_active_slave(struct device *d,
 		}
 	}
 out:
-	write_unlock_bh(&bond->lock);
+	write_unlock_bh(&bond->curr_slave_lock);
+	read_unlock(&bond->lock);
 	rtnl_unlock();
 
 	return count;
-- 
1.5.3.4.206.g58ba4-dirty


^ permalink raw reply related

* [PATCH 0/3] bonding: 3 fixes for 2.6.24
From: Jay Vosburgh @ 2008-01-08  1:56 UTC (permalink / raw)
  To: netdev; +Cc: Jeff Garzik, David Miller, Andy Gospodarek, Krzysztof Oledzki

	Following are three fixes to fix locking problems and
silence locking-related warnings in the current 2.6.24-rc.

	patch 1: fix locking in sysfs primary/active selection

	Call core network functions with expected locks to
eliminate potential deadlock and silence warnings.

	patch 2: fix ASSERT_RTNL that produces spurious warnings

	Relocate ASSERT_RTNL to remove a false warning; after patch,
ASSERT is located in code that holds only RTNL (additional locks were
causing the ASSERT to trip)

	patch 3: fix locking during alb failover and slave removal

	Fix all call paths into alb_fasten_mac_swap to hold only RTNL.
Eliminates deadlock and silences warnings.

	Patches are against the current netdev-2.6#upstream branch.

	Please apply for 2.6.24.

	-J

---
	-Jay Vosburgh, IBM Linux Technology Center, fubar@us.ibm.com

^ permalink raw reply

* Re: [PATCH 3/4] [XFRM]: Kill some bloat
From: David Miller @ 2008-01-08  1:54 UTC (permalink / raw)
  To: andi; +Cc: herbert, ilpo.jarvinen, netdev, acme, paul.moore, latten
In-Reply-To: <p73sl19qi6o.fsf@bingen.suse.de>

From: Andi Kleen <andi@firstfloor.org>
Date: Tue, 08 Jan 2008 01:23:11 +0100

> David Miller <davem@davemloft.net> writes:
> 
> > Similarly I question just about any inline usage at all in *.c files
> 
> Don't forget the .h files. Especially a lot of stuff in tcp.h should
> be probably in some .c file and not be inline.

I explicitly left them out.

Most of them are abstractions of common 2 or 3 instruction
calculations, and thus should stay inline.

^ permalink raw reply

* TCP cache performance
From: Tom Quetchenbach @ 2008-01-08  1:22 UTC (permalink / raw)
  To: netdev; +Cc: Lachlan Andrew

I've been continuing off and on to investigate TCP performance issues.
As has been noted before on this list, loss and subsequent processing
can lead to spikes in the measured RTT which confuse delay-based
congestion control algorithms.

I've done some experiments that indicate that cache size is a
significant limiting factor here. My desktop machine with a 2.4 GHz Core
Duo and 4 MB cache quite noticeably outperforms our experiment servers,
which have two dual-core Xeons at 2.66 GHz but only 512 KB cache. At 400
Mbps with 40ms round-trip delay and 1024-packet buffer the desktop
behaves fairly normally, although there is still a large RTT spike at
the start of the flow due to slow-start. The servers show large RTT
spikes at each loss event, as well as some timeouts.

This suggests that efforts to improve TCP performance should focus on
cache usage rather than just processing time.

Plots of cwnd, RTT, and CPU load are available here:

512K cache:
http://wil-ns.cs.caltech.edu/benchmark.tmp/265/2flow--ALG=illinois-BUF=1024-BUF_tgt=1333,1.0-BW=400M-GAP=150-LEN=600-RTT=40--1/

4M cache:
http://wil-ns.cs.caltech.edu/benchmark.tmp/266/2flow--ALG=illinois-BUF=1024-BUF_tgt=1333,1.0-BW=400M-GAP=150-LEN=600-RTT=40--1/

Tests were done with net-2.6 (2.6.23.1 gives similar results though)
using tcp_probe to capture data.

-Tom

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox