Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH] virtio-net: pass gfp to add_buf
From: Rusty Russell @ 2010-06-01  7:28 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, mst
In-Reply-To: <20100601.002246.149820849.davem@davemloft.net>

On Tue, 1 Jun 2010 04:52:46 pm David Miller wrote:
> From: Rusty Russell <rusty@rustcorp.com.au>
> Date: Mon, 31 May 2010 20:40:01 +0930
> 
> > From: "Michael S. Tsirkin" <mst@redhat.com>
> > 
> > (Dave: the virtqueue_add_buf_gfp is in Linus' tree, so please queue
> >  this trivial use now.  Thanks!)
> > 
> > virtio-net bounces buffer allocations off to
> > a thread if it can't allocate buffers from the atomic
> > pool. However, if posting buffers still requires atomic
> > buffers, this is unlikely to succeed.
> > Fix by passing in the proper gfp_t parameter.
> > 
> > Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> > Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
> 
> Your code sucks, and you're wrong.
> 
> Please give me positive feedback and tell me how right I am to pump up
> my ego or else I will verbally destroy you and make you feel small.
> 
> Applied :-)))))))))))))))

Thanks Dave, that made my day :)

Cheers,
Rusty.

^ permalink raw reply

* Re: [patch 1/2] caif: remove unneeded variable from caif_net_open()
From: Dan Carpenter @ 2010-06-01  7:45 UTC (permalink / raw)
  To: walter harms; +Cc: Sjur Braendeland, netdev, David S. Miller, kernel-janitors
In-Reply-To: <4C04B646.9080506@bfs.de>

On Tue, Jun 01, 2010 at 09:27:02AM +0200, walter harms wrote:
> >  static int caif_net_open(struct net_device *dev)
> >  {
> > -	struct ser_device *ser;
> > -	ser = netdev_priv(dev);
> >  	netif_wake_queue(dev);
> >  	return 0;
> >  }
> 
> 
> what makes caif_net_open() obsolet ?
> 

No.  It has to return an int to match this:
        int                     (*ndo_open)(struct net_device *dev);

regards,
dan carpenter

^ permalink raw reply

* Re: [PATCH 2/2] drivers/isdn/hardware/mISDN: Use GFP_ATOMIC when a lock is held
From: Karsten Keil @ 2010-06-01  8:05 UTC (permalink / raw)
  To: Julia Lawall; +Cc: netdev, linux-kernel, kernel-janitors
In-Reply-To: <Pine.LNX.4.64.1005301549230.15938@ask.diku.dk>

Hello Julia,

thanks for that report, the issue is valid, but I think the fix in that case
should be to move the allocation to an other place, to avoid wasting of
GFP_ATOMIC allocation. I will come up with an other fix.

Karsten


On Sun, May 30, 2010 at 03:49:40PM +0200, Julia Lawall wrote:
> From: Julia Lawall <julia@diku.dk>
> 
> The function inittiger is only called from nj_init_card, where a lock is held.
> 
> The semantic patch that makes this change is as follows:
> (http://coccinelle.lip6.fr/)
> 
> // <smpl>
> @gfp exists@
> identifier fn;
> position p;
> @@
> 
> fn(...) {
>  ... when != spin_unlock_irqrestore
>      when any
>  GFP_KERNEL@p
>  ... when any
> }
> 
> @locked@
> identifier gfp.fn;
> @@
> 
> spin_lock_irqsave(...)
> ...  when != spin_unlock_irqrestore
> fn(...)
> 
> @depends on locked@
> position gfp.p;
> @@
> 
> - GFP_KERNEL@p
> + GFP_ATOMIC
> // </smpl>
> 
> Signed-off-by: Julia Lawall <julia@diku.dk>
> 
> ---
>  drivers/isdn/hardware/mISDN/netjet.c |    4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff -u -p a/drivers/isdn/hardware/mISDN/netjet.c b/drivers/isdn/hardware/mISDN/netjet.c
> --- a/drivers/isdn/hardware/mISDN/netjet.c
> +++ b/drivers/isdn/hardware/mISDN/netjet.c
> @@ -320,12 +320,12 @@ inittiger(struct tiger_hw *card)
>  		return -ENOMEM;
>  	}
>  	for (i = 0; i < 2; i++) {
> -		card->bc[i].hsbuf = kmalloc(NJ_DMA_TXSIZE, GFP_KERNEL);
> +		card->bc[i].hsbuf = kmalloc(NJ_DMA_TXSIZE, GFP_ATOMIC);
>  		if (!card->bc[i].hsbuf) {
>  			pr_info("%s: no B%d send buffer\n", card->name, i + 1);
>  			return -ENOMEM;
>  		}
> -		card->bc[i].hrbuf = kmalloc(NJ_DMA_RXSIZE, GFP_KERNEL);
> +		card->bc[i].hrbuf = kmalloc(NJ_DMA_RXSIZE, GFP_ATOMIC);
>  		if (!card->bc[i].hrbuf) {
>  			pr_info("%s: no B%d recv buffer\n", card->name, i + 1);
>  			return -ENOMEM;
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH 2/2] drivers/isdn/hardware/mISDN: Use GFP_ATOMIC when a lock is held
From: Julia Lawall @ 2010-06-01  8:14 UTC (permalink / raw)
  To: Karsten Keil; +Cc: netdev, linux-kernel, kernel-janitors
In-Reply-To: <20100601080535.GA19288@gw.linux-pingi.de>

On Tue, 1 Jun 2010, Karsten Keil wrote:

> Hello Julia,
> 
> thanks for that report, the issue is valid, but I think the fix in that case
> should be to move the allocation to an other place, to avoid wasting of
> GFP_ATOMIC allocation. I will come up with an other fix.

OK, thanks.  It's not so easy to see what to do in the interprocedural 
case.

julia

> Karsten
> 
> 
> On Sun, May 30, 2010 at 03:49:40PM +0200, Julia Lawall wrote:
> > From: Julia Lawall <julia@diku.dk>
> > 
> > The function inittiger is only called from nj_init_card, where a lock is held.
> > 
> > The semantic patch that makes this change is as follows:
> > (http://coccinelle.lip6.fr/)
> > 
> > // <smpl>
> > @gfp exists@
> > identifier fn;
> > position p;
> > @@
> > 
> > fn(...) {
> >  ... when != spin_unlock_irqrestore
> >      when any
> >  GFP_KERNEL@p
> >  ... when any
> > }
> > 
> > @locked@
> > identifier gfp.fn;
> > @@
> > 
> > spin_lock_irqsave(...)
> > ...  when != spin_unlock_irqrestore
> > fn(...)
> > 
> > @depends on locked@
> > position gfp.p;
> > @@
> > 
> > - GFP_KERNEL@p
> > + GFP_ATOMIC
> > // </smpl>
> > 
> > Signed-off-by: Julia Lawall <julia@diku.dk>
> > 
> > ---
> >  drivers/isdn/hardware/mISDN/netjet.c |    4 ++--
> >  1 file changed, 2 insertions(+), 2 deletions(-)
> > 
> > diff -u -p a/drivers/isdn/hardware/mISDN/netjet.c b/drivers/isdn/hardware/mISDN/netjet.c
> > --- a/drivers/isdn/hardware/mISDN/netjet.c
> > +++ b/drivers/isdn/hardware/mISDN/netjet.c
> > @@ -320,12 +320,12 @@ inittiger(struct tiger_hw *card)
> >  		return -ENOMEM;
> >  	}
> >  	for (i = 0; i < 2; i++) {
> > -		card->bc[i].hsbuf = kmalloc(NJ_DMA_TXSIZE, GFP_KERNEL);
> > +		card->bc[i].hsbuf = kmalloc(NJ_DMA_TXSIZE, GFP_ATOMIC);
> >  		if (!card->bc[i].hsbuf) {
> >  			pr_info("%s: no B%d send buffer\n", card->name, i + 1);
> >  			return -ENOMEM;
> >  		}
> > -		card->bc[i].hrbuf = kmalloc(NJ_DMA_RXSIZE, GFP_KERNEL);
> > +		card->bc[i].hrbuf = kmalloc(NJ_DMA_RXSIZE, GFP_ATOMIC);
> >  		if (!card->bc[i].hrbuf) {
> >  			pr_info("%s: no B%d recv buffer\n", card->name, i + 1);
> >  			return -ENOMEM;
> > --
> > To unsubscribe from this list: send the line "unsubscribe netdev" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe kernel-janitors" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply

* Re: [PATCH 2/2] drivers/isdn/hardware/mISDN: Use GFP_ATOMIC when a lock is held
From: David Miller @ 2010-06-01  8:28 UTC (permalink / raw)
  To: isdn; +Cc: julia, netdev, linux-kernel, kernel-janitors
In-Reply-To: <20100601080535.GA19288@gw.linux-pingi.de>

From: Karsten Keil <isdn@linux-pingi.de>
Date: Tue, 1 Jun 2010 10:05:35 +0200

> thanks for that report, the issue is valid, but I think the fix in that case
> should be to move the allocation to an other place, to avoid wasting of
> GFP_ATOMIC allocation. I will come up with an other fix.

You'll have to write your patch relative to Julia'a s ince it's already
in my net-2.6 tree, since it's a bug fix.

^ permalink raw reply

* [GIT] Networking
From: David Miller @ 2010-06-01  8:35 UTC (permalink / raw)
  To: torvalds; +Cc: akpm, netdev, linux-kernel


1) OF device conversions broke greth driver build.

2) mpc5xxx_can.c driver build fix from Anatolij Gustschin.

3) Mobile IPV6 broken by output interface routing change, fix
   from Brian Haley.

4) ar9170usb use after free fix from Christian Lamparter.

5) Missing unlock on error in be2net from Dan Carpenter.

6) hdr.timeout accesses in be2net need to be little-endian, fix
   from Sathya Perla.

7) Busted connected socket list locking in Phonet, fix from
   Rémi Denis-Courmont.

8) Two netfilter fixes via Patrick McHardy:
   a) Accidental double-allocation in xt_register_table fixes
      by Xiaotian Feng
   b) xtables stackptr needs to be percpu, from Eric Dumazet

9) TCP compile fix with debugging options enabled, from Joe Perches.

10) Missing unlock and GFP_KERNEL with lock held fix in ISDN
    from Julia Lawall.

11) Two changes from Eric Dumazet that attempt to cure sock_queue_err_skb()'s
    erroneous modifications ->sk_forward_alloc without locking.

12) ksz884x driver returns wrong type from ->ndo_start_xmit() and
    is missing validate_addr in netdev_ops.  From Denis Kirjanov.

13) Wireless bug fixes from John Linville and co.

Please pull, thanks a lot!

The following changes since commit 67a3e12b05e055c0415c556a315a3d3eb637e29e:
  Linus Torvalds (1):
        Linux 2.6.35-rc1

are available in the git repository at:

  master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6.git master

Anatolij Gustschin (1):
      can: mpc5xxx_can.c: Fix build failure

Brian Haley (1):
      IPv6: fix Mobile IPv6 regression

Changli Gao (1):
      skb: make skb_recycle_check() return a bool value

Christian Lamparter (1):
      ar9170usb: fix read from freed driver context

Christoph Fritz (1):
      ssb: fix NULL ptr deref when pcihost_wrapper is used

Dan Carpenter (3):
      be2net: add unlock on error path
      be2net: remove superfluous externs
      caif: unlock on error path in cfserl_receive()

David S. Miller (4):
      Merge branch 'master' of git://git.kernel.org/.../linville/wireless-2.6
      Merge branch 'master' of /home/davem/src/GIT/linux-2.6/
      greth: Fix build after OF device conversions.
      Merge branch 'master' of git://git.kernel.org/.../kaber/nf-2.6

Denis Kirjanov (2):
      ksz884x: convert to netdev_tx_t
      ksz884x: Add missing validate_addr hook

Eric Dumazet (3):
      net: fix sk_forward_alloc corruptions
      netfilter: xtables: stackptr should be percpu
      net: sock_queue_err_skb() dont mess with sk_forward_alloc

Joe Perches (1):
      net/ipv4/tcp_input.c: fix compilation breakage when FASTRETRANS_DEBUG > 1

Johannes Berg (1):
      mac80211: make a function static

John W. Linville (1):
      Revert "rt2x00: Fix rt2800usb TX descriptor writing."

Julia Lawall (3):
      drivers/isdn/hardware/mISDN: Add missing spin_unlock
      net/rds: Add missing mutex_unlock
      drivers/isdn/hardware/mISDN: Use GFP_ATOMIC when a lock is held

Justin P. Mattock (1):
      ath9k: Fix ath_print in xmit for hardware reset.

Mark Ware (1):
      fs_enet: Adjust BDs after tx error

Michael S. Tsirkin (1):
      virtio-net: pass gfp to add_buf

Prarit Bhargava (1):
      libertas: fix uninitialized variable warning

Rémi Denis-Courmont (1):
      Phonet: listening socket lock protects the connected socket list

Sathya Perla (1):
      be2net: convert hdr.timeout in be_cmd_loopback_test() to le32

Vasanthakumar Thiagarajan (1):
      ath9k: Fix bug in the way "bf_tx_aborted" of struct ath_buf is used

Xiaotian Feng (1):
      netfilter: don't xt_jumpstack_alloc twice in xt_register_table

 drivers/isdn/hardware/mISDN/hfcsusb.c   |    4 ++-
 drivers/isdn/hardware/mISDN/netjet.c    |    4 +-
 drivers/net/benet/be_cmds.c             |   13 +++++---
 drivers/net/can/mscan/mpc5xxx_can.c     |   10 +++---
 drivers/net/fs_enet/mac-fcc.c           |   49 +++++++++++++++++++++++++++----
 drivers/net/greth.c                     |   11 +++----
 drivers/net/ksz884x.c                   |    3 +-
 drivers/net/virtio_net.c                |    8 ++--
 drivers/net/wireless/ath/ar9170/usb.c   |   14 +++++++-
 drivers/net/wireless/ath/ath9k/xmit.c   |    6 ++-
 drivers/net/wireless/libertas/rx.c      |    5 +--
 drivers/net/wireless/rt2x00/rt2800usb.c |    2 +-
 drivers/ssb/pci.c                       |    9 ++++--
 drivers/ssb/sprom.c                     |    1 +
 include/linux/netfilter/x_tables.h      |    2 +-
 include/linux/skbuff.h                  |    2 +-
 include/net/sock.h                      |   15 +---------
 net/caif/cfserl.c                       |    6 ++-
 net/core/skbuff.c                       |   42 ++++++++++++++++++++++----
 net/ipv4/netfilter/ip_tables.c          |    2 +-
 net/ipv4/tcp_input.c                    |    4 +-
 net/ipv4/udp.c                          |    4 +-
 net/ipv6/netfilter/ip6_tables.c         |    2 +-
 net/ipv6/route.c                        |    2 +-
 net/mac80211/chan.c                     |    2 +-
 net/netfilter/x_tables.c                |   17 ++---------
 net/phonet/pep.c                        |    6 ++--
 net/rds/ib_cm.c                         |    1 +
 net/rds/iw_cm.c                         |    1 +
 29 files changed, 157 insertions(+), 90 deletions(-)

^ permalink raw reply

* [PATCH] netdev:bfin_mac: reclaim and free tx skb as soon as possible after transfer
From: sonic zhang @ 2010-06-01  9:19 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, uclinux-dist-devel

SKBs hold onto resources that can't be held indefinitely, such as TCP
socket references and netfilter conntrack state.  So if a packet is left
in TX ring for a long time, there might be a TCP socket that cannot be
closed and freed up.

Current blackfin EMAC driver always reclaim and free used tx skbs in future
transfers. The problem is that future transfer may not come as soon as
possible. This patch start a timer after transfer to reclaim and free skb.
There is nearly no performance drop with this patch.

TX interrupt is not enabled for 2 reasons:

1) If Blackfin EMAC TX transfer control is turned on, endless TX 
interrupts are triggered no matter if TX DMA is enabled. Since DMA walks
down the ring automatically, TX transfer control can't be turned off in the
middle. The only way is to disable TX interrupt completely.

2) skb can not be freed from interrupt context. A work queue or tasklet
has to be created, which introduce more overhead than timer only solution.



Signed-off-by: Sonic Zhang <sonic.zhang@analog.com>
---
 drivers/net/bfin_mac.c |  111 +++++++++++++++++++++++++++++------------------
 drivers/net/bfin_mac.h |    5 ++
 2 files changed, 73 insertions(+), 43 deletions(-)

diff --git a/drivers/net/bfin_mac.c b/drivers/net/bfin_mac.c
index 368f333..f2eebed 100644
--- a/drivers/net/bfin_mac.c
+++ b/drivers/net/bfin_mac.c
@@ -922,61 +922,68 @@ static void bfin_mac_hwtstamp_init(struct net_device *netdev)
 # define bfin_tx_hwtstamp(dev, skb)
 #endif
 
-static void adjust_tx_list(void)
+static inline void _tx_reclaim_skb(void)
+{
+	do {
+		tx_list_head->desc_a.config &= ~DMAEN;
+		tx_list_head->status.status_word = 0;
+		if (tx_list_head->skb) {
+			dev_kfree_skb(tx_list_head->skb);
+			tx_list_head->skb = NULL;
+		}
+		tx_list_head = tx_list_head->next;
+
+	} while (tx_list_head->status.status_word != 0);
+}
+
+static void tx_reclaim_skb(struct bfin_mac_local *lp)
 {
 	int timeout_cnt = MAX_TIMEOUT_CNT;
 
-	if (tx_list_head->status.status_word != 0 &&
-	    current_tx_ptr != tx_list_head) {
-		goto adjust_head;	/* released something, just return; */
-	}
+	if (tx_list_head->status.status_word != 0)
+		_tx_reclaim_skb();
 
-	/*
-	 * if nothing released, check wait condition
-	 * current's next can not be the head,
-	 * otherwise the dma will not stop as we want
-	 */
-	if (current_tx_ptr->next->next == tx_list_head) {
+	if (current_tx_ptr->next == tx_list_head) {
 		while (tx_list_head->status.status_word == 0) {
+			/* slow down polling to avoid too many queue stop. */
 			udelay(10);
-			if (tx_list_head->status.status_word != 0 ||
-			    !(bfin_read_DMA2_IRQ_STATUS() & DMA_RUN)) {
-				goto adjust_head;
-			}
-			if (timeout_cnt-- < 0) {
-				printk(KERN_ERR DRV_NAME
-				": wait for adjust tx list head timeout\n");
+			/* reclaim skb if DMA is not running. */
+			if (!(bfin_read_DMA2_IRQ_STATUS() & DMA_RUN))
+				break;
+			if (timeout_cnt-- < 0)
 				break;
-			}
-		}
-		if (tx_list_head->status.status_word != 0) {
-			goto adjust_head;
 		}
+
+		if (timeout_cnt >= 0)
+			_tx_reclaim_skb();
+		else
+			netif_stop_queue(lp->ndev);
 	}
 
-	return;
+	if (current_tx_ptr->next != tx_list_head &&
+		netif_queue_stopped(lp->ndev))
+		netif_wake_queue(lp->ndev);
 
-adjust_head:
-	do {
-		tx_list_head->desc_a.config &= ~DMAEN;
-		tx_list_head->status.status_word = 0;
-		if (tx_list_head->skb) {
-			dev_kfree_skb(tx_list_head->skb);
-			tx_list_head->skb = NULL;
-		} else {
-			printk(KERN_ERR DRV_NAME
-			       ": no sk_buff in a transmitted frame!\n");
-		}
-		tx_list_head = tx_list_head->next;
-	} while (tx_list_head->status.status_word != 0 &&
-		 current_tx_ptr != tx_list_head);
-	return;
+	if (tx_list_head != current_tx_ptr) {
+		/* shorten the timer interval if tx queue is stopped */
+		if (netif_queue_stopped(lp->ndev))
+			lp->tx_reclaim_timer.expires =
+				jiffies + (TX_RECLAIM_JIFFIES >> 4);
+		else
+			lp->tx_reclaim_timer.expires =
+				jiffies + TX_RECLAIM_JIFFIES;
+
+		mod_timer(&lp->tx_reclaim_timer,
+			lp->tx_reclaim_timer.expires);
+	}
 
+	return;
 }
 
 static int bfin_mac_hard_start_xmit(struct sk_buff *skb,
 				struct net_device *dev)
 {
+	struct bfin_mac_local *lp = netdev_priv(dev);
 	u16 *data;
 	u32 data_align = (unsigned long)(skb->data) & 0x3;
 	union skb_shared_tx *shtx = skb_tx(skb);
@@ -1009,8 +1016,6 @@ static int bfin_mac_hard_start_xmit(struct sk_buff *skb,
 			skb->len);
 		current_tx_ptr->desc_a.start_addr =
 			(u32)current_tx_ptr->packet;
-		if (current_tx_ptr->status.status_word != 0)
-			current_tx_ptr->status.status_word = 0;
 		blackfin_dcache_flush_range(
 			(u32)current_tx_ptr->packet,
 			(u32)(current_tx_ptr->packet + skb->len + 2));
@@ -1022,6 +1027,9 @@ static int bfin_mac_hard_start_xmit(struct sk_buff *skb,
 	 */
 	SSYNC();
 
+	/* always clear status buffer before start tx dma */
+	current_tx_ptr->status.status_word = 0;
+
 	/* enable this packet's dma */
 	current_tx_ptr->desc_a.config |= DMAEN;
 
@@ -1037,13 +1045,14 @@ static int bfin_mac_hard_start_xmit(struct sk_buff *skb,
 	bfin_write_EMAC_OPMODE(bfin_read_EMAC_OPMODE() | TE);
 
 out:
-	adjust_tx_list();
-
 	bfin_tx_hwtstamp(dev, skb);
 
 	current_tx_ptr = current_tx_ptr->next;
 	dev->stats.tx_packets++;
 	dev->stats.tx_bytes += (skb->len);
+
+	tx_reclaim_skb(lp);
+
 	return NETDEV_TX_OK;
 }
 
@@ -1167,8 +1176,11 @@ real_rx:
 #ifdef CONFIG_NET_POLL_CONTROLLER
 static void bfin_mac_poll(struct net_device *dev)
 {
+	struct bfin_mac_local *lp = netdev_priv(dev);
+
 	disable_irq(IRQ_MAC_RX);
 	bfin_mac_interrupt(IRQ_MAC_RX, dev);
+	tx_reclaim_skb(lp);
 	enable_irq(IRQ_MAC_RX);
 }
 #endif				/* CONFIG_NET_POLL_CONTROLLER */
@@ -1232,12 +1244,20 @@ static int bfin_mac_enable(void)
 /* Our watchdog timed out. Called by the networking layer */
 static void bfin_mac_timeout(struct net_device *dev)
 {
+	struct bfin_mac_local *lp = netdev_priv(dev);
+
 	pr_debug("%s: %s\n", dev->name, __func__);
 
 	bfin_mac_disable();
 
+	if (timer_pending(&lp->tx_reclaim_timer))
+		del_timer(&(lp->tx_reclaim_timer));
+
 	/* reset tx queue */
-	tx_list_tail = tx_list_head->next;
+	current_tx_ptr = tx_list_head;
+
+	if (netif_queue_stopped(lp->ndev))
+		netif_wake_queue(lp->ndev);
 
 	bfin_mac_enable();
 
@@ -1430,6 +1450,7 @@ static int __devinit bfin_mac_probe(struct platform_device *pdev)
 	SET_NETDEV_DEV(ndev, &pdev->dev);
 	platform_set_drvdata(pdev, ndev);
 	lp = netdev_priv(ndev);
+	lp->ndev = ndev;
 
 	/* Grab the MAC address in the MAC */
 	*(__le32 *) (&(ndev->dev_addr[0])) = cpu_to_le32(bfin_read_EMAC_ADDRLO());
@@ -1485,6 +1506,10 @@ static int __devinit bfin_mac_probe(struct platform_device *pdev)
 	ndev->netdev_ops = &bfin_mac_netdev_ops;
 	ndev->ethtool_ops = &bfin_mac_ethtool_ops;
 
+	init_timer(&lp->tx_reclaim_timer);
+	lp->tx_reclaim_timer.data = (unsigned long)(lp);
+	lp->tx_reclaim_timer.function = (void *)tx_reclaim_skb;
+
 	spin_lock_init(&lp->lock);
 
 	/* now, enable interrupts */
diff --git a/drivers/net/bfin_mac.h b/drivers/net/bfin_mac.h
index 1ae7b82..04e4050 100644
--- a/drivers/net/bfin_mac.h
+++ b/drivers/net/bfin_mac.h
@@ -13,9 +13,12 @@
 #include <linux/net_tstamp.h>
 #include <linux/clocksource.h>
 #include <linux/timecompare.h>
+#include <linux/timer.h>
 
 #define BFIN_MAC_CSUM_OFFLOAD
 
+#define TX_RECLAIM_JIFFIES (HZ / 5)
+
 struct dma_descriptor {
 	struct dma_descriptor *next_dma_desc;
 	unsigned long start_addr;
@@ -68,6 +71,8 @@ struct bfin_mac_local {
 
 	int wol;		/* Wake On Lan */
 	int irq_wake_requested;
+	struct timer_list tx_reclaim_timer;
+	struct net_device *ndev;
 
 	/* MII and PHY stuffs */
 	int old_link;          /* used by bf537_adjust_link */
-- 
1.6.0




^ permalink raw reply related

* [PATCH 1/3] vhost: replace vhost_workqueue with per-vhost kthread
From: Tejun Heo @ 2010-06-01  9:34 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Oleg Nesterov, Sridhar Samudrala, netdev, lkml,
	kvm@vger.kernel.org, Andrew Morton, Dmitri Vorobiev, Jiri Kosina,
	Thomas Gleixner, Ingo Molnar, Andi Kleen
In-Reply-To: <20100531160020.GC3067@redhat.com>

Replace vhost_workqueue with per-vhost kthread.  Other than callback
argument change from struct work_struct * to struct vhost_work *,
there's no visible change to vhost_poll_*() interface.

This conversion is to make each vhost use a dedicated kthread so that
resource control via cgroup can be applied.

Partially based on Sridhar Samudrala's patch.

* Updated to use sub structure vhost_work instead of directly using
  vhost_poll at Michael's suggestion.

* Added flusher wake_up() optimization at Michael's suggestion.

NOT_SIGNED_OFF_YET
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Sridhar Samudrala <samudrala.sridhar@gmail.com>
---
Okay, here's the updated version.  I'm still in the process of setting
up the testing environment.  I'll be traveling from this afternoon
till tomorrow so I don't think I'll be able to test it before that.
If you can give it a shot, it would be great.

Thanks.

 drivers/vhost/net.c   |   56 ++++++++++---------------
 drivers/vhost/vhost.c |  110 ++++++++++++++++++++++++++++++++++++++------------
 drivers/vhost/vhost.h |   38 +++++++++++------
 3 files changed, 133 insertions(+), 71 deletions(-)

Index: work/drivers/vhost/net.c
===================================================================
--- work.orig/drivers/vhost/net.c
+++ work/drivers/vhost/net.c
@@ -294,54 +294,58 @@ static void handle_rx(struct vhost_net *
 	unuse_mm(net->dev.mm);
 }

-static void handle_tx_kick(struct work_struct *work)
+static void handle_tx_kick(struct vhost_work *work)
 {
-	struct vhost_virtqueue *vq;
-	struct vhost_net *net;
-	vq = container_of(work, struct vhost_virtqueue, poll.work);
-	net = container_of(vq->dev, struct vhost_net, dev);
+	struct vhost_virtqueue *vq = container_of(work, struct vhost_virtqueue,
+						  poll.work);
+	struct vhost_net *net = container_of(vq->dev, struct vhost_net, dev);
+
 	handle_tx(net);
 }

-static void handle_rx_kick(struct work_struct *work)
+static void handle_rx_kick(struct vhost_work *work)
 {
-	struct vhost_virtqueue *vq;
-	struct vhost_net *net;
-	vq = container_of(work, struct vhost_virtqueue, poll.work);
-	net = container_of(vq->dev, struct vhost_net, dev);
+	struct vhost_virtqueue *vq = container_of(work, struct vhost_virtqueue,
+						  poll.work);
+	struct vhost_net *net = container_of(vq->dev, struct vhost_net, dev);
+
 	handle_rx(net);
 }

-static void handle_tx_net(struct work_struct *work)
+static void handle_tx_net(struct vhost_work *work)
 {
-	struct vhost_net *net;
-	net = container_of(work, struct vhost_net, poll[VHOST_NET_VQ_TX].work);
+	struct vhost_net *net = container_of(work, struct vhost_net,
+					     poll[VHOST_NET_VQ_TX].work);
 	handle_tx(net);
 }

-static void handle_rx_net(struct work_struct *work)
+static void handle_rx_net(struct vhost_work *work)
 {
-	struct vhost_net *net;
-	net = container_of(work, struct vhost_net, poll[VHOST_NET_VQ_RX].work);
+	struct vhost_net *net = container_of(work, struct vhost_net,
+					     poll[VHOST_NET_VQ_RX].work);
 	handle_rx(net);
 }

 static int vhost_net_open(struct inode *inode, struct file *f)
 {
 	struct vhost_net *n = kmalloc(sizeof *n, GFP_KERNEL);
+	struct vhost_dev *dev;
 	int r;
+
 	if (!n)
 		return -ENOMEM;
+
+	dev = &n->dev;
 	n->vqs[VHOST_NET_VQ_TX].handle_kick = handle_tx_kick;
 	n->vqs[VHOST_NET_VQ_RX].handle_kick = handle_rx_kick;
-	r = vhost_dev_init(&n->dev, n->vqs, VHOST_NET_VQ_MAX);
+	r = vhost_dev_init(dev, n->vqs, VHOST_NET_VQ_MAX);
 	if (r < 0) {
 		kfree(n);
 		return r;
 	}

-	vhost_poll_init(n->poll + VHOST_NET_VQ_TX, handle_tx_net, POLLOUT);
-	vhost_poll_init(n->poll + VHOST_NET_VQ_RX, handle_rx_net, POLLIN);
+	vhost_poll_init(n->poll + VHOST_NET_VQ_TX, handle_tx_net, POLLOUT, dev);
+	vhost_poll_init(n->poll + VHOST_NET_VQ_RX, handle_rx_net, POLLIN, dev);
 	n->tx_poll_state = VHOST_NET_POLL_DISABLED;

 	f->private_data = n;
@@ -644,25 +648,13 @@ static struct miscdevice vhost_net_misc

 static int vhost_net_init(void)
 {
-	int r = vhost_init();
-	if (r)
-		goto err_init;
-	r = misc_register(&vhost_net_misc);
-	if (r)
-		goto err_reg;
-	return 0;
-err_reg:
-	vhost_cleanup();
-err_init:
-	return r;
-
+	return misc_register(&vhost_net_misc);
 }
 module_init(vhost_net_init);

 static void vhost_net_exit(void)
 {
 	misc_deregister(&vhost_net_misc);
-	vhost_cleanup();
 }
 module_exit(vhost_net_exit);

Index: work/drivers/vhost/vhost.c
===================================================================
--- work.orig/drivers/vhost/vhost.c
+++ work/drivers/vhost/vhost.c
@@ -17,12 +17,12 @@
 #include <linux/mm.h>
 #include <linux/miscdevice.h>
 #include <linux/mutex.h>
-#include <linux/workqueue.h>
 #include <linux/rcupdate.h>
 #include <linux/poll.h>
 #include <linux/file.h>
 #include <linux/highmem.h>
 #include <linux/slab.h>
+#include <linux/kthread.h>

 #include <linux/net.h>
 #include <linux/if_packet.h>
@@ -37,8 +37,6 @@ enum {
 	VHOST_MEMORY_F_LOG = 0x1,
 };

-static struct workqueue_struct *vhost_workqueue;
-
 static void vhost_poll_func(struct file *file, wait_queue_head_t *wqh,
 			    poll_table *pt)
 {
@@ -52,23 +50,31 @@ static void vhost_poll_func(struct file
 static int vhost_poll_wakeup(wait_queue_t *wait, unsigned mode, int sync,
 			     void *key)
 {
-	struct vhost_poll *poll;
-	poll = container_of(wait, struct vhost_poll, wait);
+	struct vhost_poll *poll = container_of(wait, struct vhost_poll, wait);
+
 	if (!((unsigned long)key & poll->mask))
 		return 0;

-	queue_work(vhost_workqueue, &poll->work);
+	vhost_poll_queue(poll);
 	return 0;
 }

 /* Init poll structure */
-void vhost_poll_init(struct vhost_poll *poll, work_func_t func,
-		     unsigned long mask)
+void vhost_poll_init(struct vhost_poll *poll, vhost_work_fn_t fn,
+		     unsigned long mask, struct vhost_dev *dev)
 {
-	INIT_WORK(&poll->work, func);
+	struct vhost_work *work = &poll->work;
+
 	init_waitqueue_func_entry(&poll->wait, vhost_poll_wakeup);
 	init_poll_funcptr(&poll->table, vhost_poll_func);
 	poll->mask = mask;
+	poll->dev = dev;
+
+	INIT_LIST_HEAD(&work->node);
+	work->fn = fn;
+	init_waitqueue_head(&work->done);
+	atomic_set(&work->flushing, 0);
+	work->queue_seq = work->done_seq = 0;
 }

 /* Start polling a file. We add ourselves to file's wait queue. The caller must
@@ -92,12 +98,28 @@ void vhost_poll_stop(struct vhost_poll *
  * locks that are also used by the callback. */
 void vhost_poll_flush(struct vhost_poll *poll)
 {
-	flush_work(&poll->work);
+	struct vhost_work *work = &poll->work;
+	int seq = work->queue_seq;
+
+	atomic_inc(&work->flushing);
+	smp_mb__after_atomic_inc();	/* mb flush-b0 paired with worker-b1 */
+	wait_event(work->done, seq - work->done_seq <= 0);
+	atomic_dec(&work->flushing);
+	smp_mb__after_atomic_dec();	/* rmb flush-b1 paired with worker-b0 */
 }

 void vhost_poll_queue(struct vhost_poll *poll)
 {
-	queue_work(vhost_workqueue, &poll->work);
+	struct vhost_dev *dev = poll->dev;
+	struct vhost_work *work = &poll->work;
+
+	spin_lock(&dev->work_lock);
+	if (list_empty(&work->node)) {
+		list_add_tail(&work->node, &dev->work_list);
+		work->queue_seq++;
+		wake_up_process(dev->worker);
+	}
+	spin_unlock(&dev->work_lock);
 }

 static void vhost_vq_reset(struct vhost_dev *dev,
@@ -125,10 +147,52 @@ static void vhost_vq_reset(struct vhost_
 	vq->log_ctx = NULL;
 }

+static int vhost_worker(void *data)
+{
+	struct vhost_dev *dev = data;
+	struct vhost_work *work;
+
+repeat:
+	set_current_state(TASK_INTERRUPTIBLE);	/* mb paired w/ kthread_stop */
+
+	if (kthread_should_stop()) {
+		__set_current_state(TASK_RUNNING);
+		return 0;
+	}
+
+	work = NULL;
+	spin_lock(&dev->work_lock);
+	if (!list_empty(&dev->work_list)) {
+		work = list_first_entry(&dev->work_list,
+					struct vhost_work, node);
+		list_del_init(&work->node);
+	}
+	spin_unlock(&dev->work_lock);
+
+	if (work) {
+		__set_current_state(TASK_RUNNING);
+		work->fn(work);
+		smp_wmb();	/* wmb worker-b0 paired with flush-b1 */
+		work->done_seq = work->queue_seq;
+		smp_mb();	/* mb worker-b1 paired with flush-b0 */
+		if (atomic_read(&work->flushing))
+			wake_up_all(&work->done);
+	} else
+		schedule();
+
+	goto repeat;
+}
+
 long vhost_dev_init(struct vhost_dev *dev,
 		    struct vhost_virtqueue *vqs, int nvqs)
 {
+	struct task_struct *worker;
 	int i;
+
+	worker = kthread_create(vhost_worker, dev, "vhost-%d", current->pid);
+	if (IS_ERR(worker))
+		return PTR_ERR(worker);
+
 	dev->vqs = vqs;
 	dev->nvqs = nvqs;
 	mutex_init(&dev->mutex);
@@ -136,6 +200,9 @@ long vhost_dev_init(struct vhost_dev *de
 	dev->log_file = NULL;
 	dev->memory = NULL;
 	dev->mm = NULL;
+	spin_lock_init(&dev->work_lock);
+	INIT_LIST_HEAD(&dev->work_list);
+	dev->worker = worker;

 	for (i = 0; i < dev->nvqs; ++i) {
 		dev->vqs[i].dev = dev;
@@ -143,9 +210,10 @@ long vhost_dev_init(struct vhost_dev *de
 		vhost_vq_reset(dev, dev->vqs + i);
 		if (dev->vqs[i].handle_kick)
 			vhost_poll_init(&dev->vqs[i].poll,
-					dev->vqs[i].handle_kick,
-					POLLIN);
+					dev->vqs[i].handle_kick, POLLIN, dev);
 	}
+
+	wake_up_process(worker);	/* avoid contributing to loadavg */
 	return 0;
 }

@@ -217,6 +285,9 @@ void vhost_dev_cleanup(struct vhost_dev
 	if (dev->mm)
 		mmput(dev->mm);
 	dev->mm = NULL;
+
+	WARN_ON(!list_empty(&dev->work_list));
+	kthread_stop(dev->worker);
 }

 static int log_access_ok(void __user *log_base, u64 addr, unsigned long sz)
@@ -1113,16 +1184,3 @@ void vhost_disable_notify(struct vhost_v
 		vq_err(vq, "Failed to enable notification at %p: %d\n",
 		       &vq->used->flags, r);
 }
-
-int vhost_init(void)
-{
-	vhost_workqueue = create_singlethread_workqueue("vhost");
-	if (!vhost_workqueue)
-		return -ENOMEM;
-	return 0;
-}
-
-void vhost_cleanup(void)
-{
-	destroy_workqueue(vhost_workqueue);
-}
Index: work/drivers/vhost/vhost.h
===================================================================
--- work.orig/drivers/vhost/vhost.h
+++ work/drivers/vhost/vhost.h
@@ -5,13 +5,13 @@
 #include <linux/vhost.h>
 #include <linux/mm.h>
 #include <linux/mutex.h>
-#include <linux/workqueue.h>
 #include <linux/poll.h>
 #include <linux/file.h>
 #include <linux/skbuff.h>
 #include <linux/uio.h>
 #include <linux/virtio_config.h>
 #include <linux/virtio_ring.h>
+#include <asm/atomic.h>

 struct vhost_device;

@@ -20,19 +20,31 @@ enum {
 	VHOST_NET_MAX_SG = MAX_SKB_FRAGS + 2,
 };

+struct vhost_work;
+typedef void (*vhost_work_fn_t)(struct vhost_work *work);
+
+struct vhost_work {
+	struct list_head	  node;
+	vhost_work_fn_t		  fn;
+	wait_queue_head_t	  done;
+	atomic_t		  flushing;
+	int			  queue_seq;
+	int			  done_seq;
+};
+
 /* Poll a file (eventfd or socket) */
 /* Note: there's nothing vhost specific about this structure. */
 struct vhost_poll {
 	poll_table                table;
 	wait_queue_head_t        *wqh;
 	wait_queue_t              wait;
-	/* struct which will handle all actual work. */
-	struct work_struct        work;
+	struct vhost_work	  work;
 	unsigned long		  mask;
+	struct vhost_dev	 *dev;
 };

-void vhost_poll_init(struct vhost_poll *poll, work_func_t func,
-		     unsigned long mask);
+void vhost_poll_init(struct vhost_poll *poll, vhost_work_fn_t fn,
+		     unsigned long mask, struct vhost_dev *dev);
 void vhost_poll_start(struct vhost_poll *poll, struct file *file);
 void vhost_poll_stop(struct vhost_poll *poll);
 void vhost_poll_flush(struct vhost_poll *poll);
@@ -63,7 +75,7 @@ struct vhost_virtqueue {
 	struct vhost_poll poll;

 	/* The routine to call when the Guest pings us, or timeout. */
-	work_func_t handle_kick;
+	vhost_work_fn_t handle_kick;

 	/* Last available index we saw. */
 	u16 last_avail_idx;
@@ -86,11 +98,11 @@ struct vhost_virtqueue {
 	struct iovec hdr[VHOST_NET_MAX_SG];
 	size_t hdr_size;
 	/* We use a kind of RCU to access private pointer.
-	 * All readers access it from workqueue, which makes it possible to
-	 * flush the workqueue instead of synchronize_rcu. Therefore readers do
+	 * All readers access it from worker, which makes it possible to
+	 * flush the vhost_work instead of synchronize_rcu. Therefore readers do
 	 * not need to call rcu_read_lock/rcu_read_unlock: the beginning of
-	 * work item execution acts instead of rcu_read_lock() and the end of
-	 * work item execution acts instead of rcu_read_lock().
+	 * vhost_work execution acts instead of rcu_read_lock() and the end of
+	 * vhost_work execution acts instead of rcu_read_lock().
 	 * Writers use virtqueue mutex. */
 	void *private_data;
 	/* Log write descriptors */
@@ -110,6 +122,9 @@ struct vhost_dev {
 	int nvqs;
 	struct file *log_file;
 	struct eventfd_ctx *log_ctx;
+	spinlock_t work_lock;
+	struct list_head work_list;
+	struct task_struct *worker;
 };

 long vhost_dev_init(struct vhost_dev *, struct vhost_virtqueue *vqs, int nvqs);
@@ -136,9 +151,6 @@ bool vhost_enable_notify(struct vhost_vi
 int vhost_log_write(struct vhost_virtqueue *vq, struct vhost_log *log,
 		    unsigned int log_num, u64 len);

-int vhost_init(void);
-void vhost_cleanup(void);
-
 #define vq_err(vq, fmt, ...) do {                                  \
 		pr_debug(pr_fmt(fmt), ##__VA_ARGS__);       \
 		if ((vq)->error_ctx)                               \

^ permalink raw reply

* [PATCH 2/3] cgroups: Add an API to attach a task to current task's cgroup
From: Tejun Heo @ 2010-06-01  9:34 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Oleg Nesterov, Sridhar Samudrala, netdev, lkml,
	kvm@vger.kernel.org, Andrew Morton, Dmitri Vorobiev, Jiri Kosina,
	Thomas Gleixner, Ingo Molnar, Andi Kleen
In-Reply-To: <20100531160020.GC3067@redhat.com>

From: Sridhar Samudrala <samudrala.sridhar@gmail.com>

Add a new kernel API to attach a task to current task's cgroup
in all the active hierarchies.

Signed-off-by: Sridhar Samudrala <sri@us.ibm.com>
Reviewed-by: Paul Menage <menage@google.com>
Acked-by: Li Zefan <lizf@cn.fujitsu.com>
---
 include/linux/cgroup.h |    1 +
 kernel/cgroup.c        |   23 +++++++++++++++++++++++
 2 files changed, 24 insertions(+)

Index: work/include/linux/cgroup.h
===================================================================
--- work.orig/include/linux/cgroup.h
+++ work/include/linux/cgroup.h
@@ -570,6 +570,7 @@ struct task_struct *cgroup_iter_next(str
 void cgroup_iter_end(struct cgroup *cgrp, struct cgroup_iter *it);
 int cgroup_scan_tasks(struct cgroup_scanner *scan);
 int cgroup_attach_task(struct cgroup *, struct task_struct *);
+int cgroup_attach_task_current_cg(struct task_struct *);

 /*
  * CSS ID is ID for cgroup_subsys_state structs under subsys. This only works
Index: work/kernel/cgroup.c
===================================================================
--- work.orig/kernel/cgroup.c
+++ work/kernel/cgroup.c
@@ -1788,6 +1788,29 @@ out:
 	return retval;
 }

+/**
+ * cgroup_attach_task_current_cg - attach task 'tsk' to current task's cgroup
+ * @tsk: the task to be attached
+ */
+int cgroup_attach_task_current_cg(struct task_struct *tsk)
+{
+	struct cgroupfs_root *root;
+	struct cgroup *cur_cg;
+	int retval = 0;
+
+	cgroup_lock();
+	for_each_active_root(root) {
+		cur_cg = task_cgroup_from_root(current, root);
+		retval = cgroup_attach_task(cur_cg, tsk);
+		if (retval)
+			break;
+	}
+	cgroup_unlock();
+
+	return retval;
+}
+EXPORT_SYMBOL_GPL(cgroup_attach_task_current_cg);
+
 /*
  * Attach task with pid 'pid' to cgroup 'cgrp'. Call with cgroup_mutex
  * held. May take task_lock of task

^ permalink raw reply

* [PATCH 3/3] vhost: apply cpumask and cgroup to vhost workers
From: Tejun Heo @ 2010-06-01  9:35 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Oleg Nesterov, Sridhar Samudrala, netdev, lkml,
	kvm@vger.kernel.org, Andrew Morton, Dmitri Vorobiev, Jiri Kosina,
	Thomas Gleixner, Ingo Molnar, Andi Kleen
In-Reply-To: <20100531160020.GC3067@redhat.com>

Apply the cpumask and cgroup of the initializing task to the created
vhost worker.

Based on Sridhar Samudrala's patch.  Li Zefan spotted a bug in error
path (twice), fixed (twice).

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Sridhar Samudrala <samudrala.sridhar@gmail.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
---
 drivers/vhost/vhost.c |   34 ++++++++++++++++++++++++++++++----
 1 file changed, 30 insertions(+), 4 deletions(-)

Index: work/drivers/vhost/vhost.c
===================================================================
--- work.orig/drivers/vhost/vhost.c
+++ work/drivers/vhost/vhost.c
@@ -23,6 +23,7 @@
 #include <linux/highmem.h>
 #include <linux/slab.h>
 #include <linux/kthread.h>
+#include <linux/cgroup.h>

 #include <linux/net.h>
 #include <linux/if_packet.h>
@@ -187,11 +188,29 @@ long vhost_dev_init(struct vhost_dev *de
 		    struct vhost_virtqueue *vqs, int nvqs)
 {
 	struct task_struct *worker;
-	int i;
+	cpumask_var_t mask;
+	int i, ret = -ENOMEM;
+
+	if (!alloc_cpumask_var(&mask, GFP_KERNEL))
+		goto out_free_mask;

 	worker = kthread_create(vhost_worker, dev, "vhost-%d", current->pid);
-	if (IS_ERR(worker))
-		return PTR_ERR(worker);
+	if (IS_ERR(worker)) {
+		ret = PTR_ERR(worker);
+		goto out_free_mask;
+	}
+
+	ret = sched_getaffinity(current->pid, mask);
+	if (ret)
+		goto out_stop_worker;
+
+	ret = sched_setaffinity(worker->pid, mask);
+	if (ret)
+		goto out_stop_worker;
+
+	ret = cgroup_attach_task_current_cg(worker);
+	if (ret)
+		goto out_stop_worker;

 	dev->vqs = vqs;
 	dev->nvqs = nvqs;
@@ -214,7 +233,14 @@ long vhost_dev_init(struct vhost_dev *de
 	}

 	wake_up_process(worker);	/* avoid contributing to loadavg */
-	return 0;
+	ret = 0;
+	goto out_free_mask;
+
+out_stop_worker:
+	kthread_stop(worker);
+out_free_mask:
+	free_cpumask_var(mask);
+	return ret;
 }

 /* Caller should have device mutex */

^ permalink raw reply

* Re: [PATCH net-next-2.6] br_netfilter: use skb_set_noref()
From: Patrick McHardy @ 2010-06-01  9:56 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Stephen Hemminger, David Miller, bridge, netdev
In-Reply-To: <1275340993.2478.27.camel@edumazet-laptop>

Eric Dumazet wrote:
> Avoid dirtying bridge_parent_rtable refcount, using new dst noref
> infrastructure.

Applied, thanks Eric.

^ permalink raw reply

* Re: [v5 Patch 1/3] netpoll: add generic support for bridge and bonding devices
From: Cong Wang @ 2010-06-01  9:57 UTC (permalink / raw)
  To: Flavio Leitner
  Cc: linux-kernel, Matt Mackall, netdev, bridge, Andy Gospodarek,
	Neil Horman, Jeff Moyer, Stephen Hemminger, bonding-devel,
	Jay Vosburgh, David Miller
In-Reply-To: <20100531190820.GA24569@sysclose.org>

[-- Attachment #1: Type: text/plain, Size: 754 bytes --]

On 06/01/10 03:08, Flavio Leitner wrote:
> On Mon, May 31, 2010 at 01:56:52PM +0800, Cong Wang wrote:
>> Hi, Flavio,
>>
>> Please use the attached patch instead, try to see if it solves
>> all your problems.
>
> I tried and it hangs. No backtraces this time.
> The bond_change_active_slave() prints before NETDEV_BONDING_FAILOVER
> notification, so I think it won't work.

Ah, I thought the same.

>
> Please, correct if I'm wrong, but when a failover happens with your
> patch applied, the netconsole would be disabled forever even with
> another healthy slave, right?
>

Yes, this is an easy solution, because bonding has several modes,
it is complex to make netpoll work in different modes.

Would you like to test the following patch?

Thanks much!


[-- Attachment #2: drivers-net-bonding-fix-activebackup-deadlock.diff --]
[-- Type: text/x-patch, Size: 591 bytes --]

diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index 5e12462..59ade92 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -1109,6 +1109,14 @@ void bond_change_active_slave(struct bonding *bond, struct slave *new_active)
 	if (old_active == new_active)
 		return;
 
+	write_unlock_bh(&bond->curr_slave_lock);
+	read_unlock(&bond->lock);
+
+	netdev_bonding_change(bond->dev, NETDEV_BONDING_DESLAVE);
+
+	read_lock(&bond->lock);
+	write_lock_bh(&bond->curr_slave_lock);
+
 	if (new_active) {
 		new_active->jiffies = jiffies;
 

^ permalink raw reply related

* Re: [PATCH 3/3] vhost: apply cpumask and cgroup to vhost workers
From: Michael S. Tsirkin @ 2010-06-01 10:17 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Oleg Nesterov, Sridhar Samudrala, netdev, lkml,
	kvm@vger.kernel.org, Andrew Morton, Dmitri Vorobiev, Jiri Kosina,
	Thomas Gleixner, Ingo Molnar, Andi Kleen
In-Reply-To: <4C04D453.9040208@kernel.org>

On Tue, Jun 01, 2010 at 11:35:15AM +0200, Tejun Heo wrote:
> Apply the cpumask and cgroup of the initializing task to the created
> vhost worker.
> 
> Based on Sridhar Samudrala's patch.  Li Zefan spotted a bug in error
> path (twice), fixed (twice).
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>
> Cc: Michael S. Tsirkin <mst@redhat.com>
> Cc: Sridhar Samudrala <samudrala.sridhar@gmail.com>
> Cc: Li Zefan <lizf@cn.fujitsu.com>

Something that I wanted to figure out - what happens if the
CPU mask limits us to a certain CPU that subsequently goes offline?
Will e.g. flush block forever or until that CPU comes back?
Also, does singlethreaded workqueue behave in the same way?

> ---
>  drivers/vhost/vhost.c |   34 ++++++++++++++++++++++++++++++----
>  1 file changed, 30 insertions(+), 4 deletions(-)
> 
> Index: work/drivers/vhost/vhost.c
> ===================================================================
> --- work.orig/drivers/vhost/vhost.c
> +++ work/drivers/vhost/vhost.c
> @@ -23,6 +23,7 @@
>  #include <linux/highmem.h>
>  #include <linux/slab.h>
>  #include <linux/kthread.h>
> +#include <linux/cgroup.h>
> 
>  #include <linux/net.h>
>  #include <linux/if_packet.h>
> @@ -187,11 +188,29 @@ long vhost_dev_init(struct vhost_dev *de
>  		    struct vhost_virtqueue *vqs, int nvqs)
>  {
>  	struct task_struct *worker;
> -	int i;
> +	cpumask_var_t mask;
> +	int i, ret = -ENOMEM;
> +
> +	if (!alloc_cpumask_var(&mask, GFP_KERNEL))
> +		goto out_free_mask;
> 
>  	worker = kthread_create(vhost_worker, dev, "vhost-%d", current->pid);
> -	if (IS_ERR(worker))
> -		return PTR_ERR(worker);
> +	if (IS_ERR(worker)) {
> +		ret = PTR_ERR(worker);
> +		goto out_free_mask;
> +	}
> +
> +	ret = sched_getaffinity(current->pid, mask);
> +	if (ret)
> +		goto out_stop_worker;
> +
> +	ret = sched_setaffinity(worker->pid, mask);
> +	if (ret)
> +		goto out_stop_worker;
> +
> +	ret = cgroup_attach_task_current_cg(worker);
> +	if (ret)
> +		goto out_stop_worker;
> 
>  	dev->vqs = vqs;
>  	dev->nvqs = nvqs;
> @@ -214,7 +233,14 @@ long vhost_dev_init(struct vhost_dev *de
>  	}
> 
>  	wake_up_process(worker);	/* avoid contributing to loadavg */
> -	return 0;
> +	ret = 0;
> +	goto out_free_mask;
> +
> +out_stop_worker:
> +	kthread_stop(worker);
> +out_free_mask:
> +	free_cpumask_var(mask);
> +	return ret;
>  }
> 
>  /* Caller should have device mutex */

^ permalink raw reply

* Re: DDoS attack causing bad effect on conntrack searches
From: Patrick McHardy @ 2010-06-01 10:18 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Changli Gao, hawk, Jesper Dangaard Brouer, paulmck,
	Linux Kernel Network Hackers, Netfilter Developers
In-Reply-To: <1275368732.2478.88.camel@edumazet-laptop>

Eric Dumazet wrote:
> Le mardi 01 juin 2010 à 08:28 +0800, Changli Gao a écrit :
>> On Tue, Jun 1, 2010 at 5:21 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>>> I had a look at current conntrack and found the 'unconfirmed' list was
>>> maybe a candidate for a potential blackhole.
>>>
>>> That is, if a reader happens to hit an entry that is moved from regular
>>> hash table slot 'hash' to unconfirmed list,
>> Sorry, but I can't find where we do this things. unconfirmed list is
>> used to track the unconfirmed cts, whose corresponding skbs are still
>> in path from the first to the last netfilter hooks. As soon as the
>> skbs end their travel in netfilter, the corresponding cts will be
>> confirmed(moving ct from unconfirmed list to regular hash table).
>>
> 
> So netfilter is a monolithic thing.
> 
> When a packet begins its travel into netfilter, you guarantee that no
> other packet can also begin its travel and find an unconfirmed
> conntrack ?

Correct, the unconfirmed list exists only for cleanup.

> I wonder why we use atomic ops then to track the confirmed bit :)

Good question, that looks unnecessary :)

>> unconfirmed list should be small, as networking receiving is in BH.
> 
> So according to you, netfilter/ct runs only in input path ?
> 
> So I assume a packet is handled by CPU X, creates a new conntrack
> (possibly early droping an old entry that was previously in a standard
> hash chain), inserted in unconfirmed list. _You_ guarantee another CPU
> Y, handling another packet, possibly sent by a hacker reading your
> netdev mails, cannot find the conntrack that was early dropped ?
> 
>> How about implementing unconfirmed list as a per cpu variable?
> 
> I first implemented such a patch to reduce cache line contention, then I
> asked to myself : What is exactly an unconfirmed conntrack ? Can their
> number be unbounded ? If yes, we have a problem, even on a two cpus
> machine. Using two lists instead of one wont solve the fundamental
> problem.

If a new conntrack is created in PRE_ROUTING or LOCAL_OUT, it will be
added to the unconfirmed list and moved to the hash as soon as the
packet passes POST_ROUTING. This means the number of unconfirmed entries
created by the network is bound by the number of CPUs due to BH
processing. The number created by locally generated packets is unbound
in case of preemptible kernels however.

> The real question is, why do we need this unconfirmed 'list' in the
> first place. Is it really a private per cpu thing ? Can you prove this,
> in respect of lockless lookups, and things like NFQUEUE ? 

Its used for cleaning up conntracks not in the hash table yet on
module unload (or manual flush). It is supposed to be write-only
during regular operation.

> Each conntrack object has two list anchors. One for IP_CT_DIR_ORIGINAL,
> one for IP_CT_DIR_REPLY.
> 
> Unconfirmed list use the first anchor. This means another cpu can
> definitely find an unconfirmed item in a regular hash chain, since we
> dont respect an RCU grace period before re-using an object.
> 
> If memory was not a problem, we probably would use a third anchor to
> avoid this, or regular RCU instead of SLAB_DESTROY_BY_RCU variant.

So I guess we should check the CONFIRMED bit when searching
in the hash table.
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [Patch] fix packet loss and massive ping spikes with PPP multi-link
From: Richard Hartmann @ 2010-06-01 10:20 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: Ben McKeegan, netdev, linux-ppp, Alan Cox, Alexander E. Patrakov,
	linux-kernel
In-Reply-To: <20100529021624.GA2538@brick.ozlabs.ibm.com>

On Sat, May 29, 2010 at 04:16, Paul Mackerras <paulus@samba.org> wrote:


> I'll
> ack it and hopefully DaveM will pick it up.

As 2.6.35-rc1 is out, does this mean that we are looking at 2.6.36 at the
earliest? Or could this make it in as it is

a) a bug fix
b) tested in the field
c) a small change


Thanks,
Richard

^ permalink raw reply

* Re: DDoS attack causing bad effect on conntrack searches
From: Eric Dumazet @ 2010-06-01 10:31 UTC (permalink / raw)
  To: Patrick McHardy
  Cc: Changli Gao, hawk, Jesper Dangaard Brouer, paulmck,
	Linux Kernel Network Hackers, Netfilter Developers
In-Reply-To: <4C04DE73.6050605@trash.net>

Le mardi 01 juin 2010 à 12:18 +0200, Patrick McHardy a écrit :

> If a new conntrack is created in PRE_ROUTING or LOCAL_OUT, it will be
> added to the unconfirmed list and moved to the hash as soon as the
> packet passes POST_ROUTING. This means the number of unconfirmed entries
> created by the network is bound by the number of CPUs due to BH
> processing. The number created by locally generated packets is unbound
> in case of preemptible kernels however.
> 

OK, we should have a percpu list then.

BTW, I notice nf_conntrack_untracked is incorrectly annotated
'__read_mostly'.

It can be written very often :(

Should'nt we special case it and let be really const ?




^ permalink raw reply

* Re: DDoS attack causing bad effect on conntrack searches
From: Patrick McHardy @ 2010-06-01 10:41 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Changli Gao, hawk, Jesper Dangaard Brouer, paulmck,
	Linux Kernel Network Hackers, Netfilter Developers
In-Reply-To: <1275388310.2738.2.camel@edumazet-laptop>

Eric Dumazet wrote:
> Le mardi 01 juin 2010 à 12:18 +0200, Patrick McHardy a écrit :
> 
>> If a new conntrack is created in PRE_ROUTING or LOCAL_OUT, it will be
>> added to the unconfirmed list and moved to the hash as soon as the
>> packet passes POST_ROUTING. This means the number of unconfirmed entries
>> created by the network is bound by the number of CPUs due to BH
>> processing. The number created by locally generated packets is unbound
>> in case of preemptible kernels however.
>>
> 
> OK, we should have a percpu list then.

Yes, that makes sense.

> BTW, I notice nf_conntrack_untracked is incorrectly annotated
> '__read_mostly'.
> 
> It can be written very often :(
> 
> Should'nt we special case it and let be really const ?

That would need quite a bit of special-casing to avoid touching
the reference counts. So far this is completely hidden, so I'd
say it just shouldn't be marked __read_mostly. Alternatively we
can make "untracked" a nfctinfo state.
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH 3/3] vhost: apply cpumask and cgroup to vhost workers
From: Tejun Heo @ 2010-06-01 10:56 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Oleg Nesterov, Sridhar Samudrala, netdev, lkml,
	kvm@vger.kernel.org, Andrew Morton, Dmitri Vorobiev, Jiri Kosina,
	Thomas Gleixner, Ingo Molnar, Andi Kleen
In-Reply-To: <20100601101703.GB9178@redhat.com>

Hello,

On 06/01/2010 12:17 PM, Michael S. Tsirkin wrote:
> Something that I wanted to figure out - what happens if the
> CPU mask limits us to a certain CPU that subsequently goes offline?

The thread gets unbound during the last steps of cpu offlining.

> Will e.g. flush block forever or until that CPU comes back?
> Also, does singlethreaded workqueue behave in the same way?

So, things will proceed as usual although the thread will lose its
affinity.  Singlethread wqs don't bind their workers (and they
shouldn't! :-).  MT ones explicitly manage workers according to cpu
up/down events.

Thanks.

-- 
tejun

^ permalink raw reply

* Re: [Patch] fix packet loss and massive ping spikes with PPP  multi-link
From: Ben McKeegan @ 2010-06-01 11:18 UTC (permalink / raw)
  To: Richard Hartmann
  Cc: Paul Mackerras, netdev, linux-ppp, Alan Cox,
	Alexander E. Patrakov, linux-kernel
In-Reply-To: <AANLkTil7fk_ikrSPiZnqcJSLGCrzcCrCwbeuUMY5Yu5v@mail.gmail.com>

Richard Hartmann wrote:

> As 2.6.35-rc1 is out, does this mean that we are looking at 2.6.36 at the
> earliest? Or could this make it in as it is
> 
> a) a bug fix

This isn't really a bug fix.  Its a behavioural change to work around 
poor quality/mismatched underlying PPP channels.

Regards,
Ben.

^ permalink raw reply

* Re: [Patch] fix packet loss and massive ping spikes with PPP multi-link
From: Richard Hartmann @ 2010-06-01 11:28 UTC (permalink / raw)
  To: Ben McKeegan
  Cc: Paul Mackerras, netdev, linux-ppp, Alan Cox,
	Alexander E. Patrakov, linux-kernel
In-Reply-To: <4C04ECA3.1080905@netservers.co.uk>

On Tue, Jun 1, 2010 at 13:18, Ben McKeegan <ben@netservers.co.uk> wrote:

> This isn't really a bug fix.  Its a behavioural change to work around poor
> quality/mismatched underlying PPP channels.

Maybe not a bug in the Linux kernel itself, but certainly in the real world
that exists around Linux. Similar to how a change to a device driver that
is needed to work around broken hardware is a bug fix, imo.

RIchard

^ permalink raw reply

* [PATCH 1/2] mac8390: propagate error code from request_irq
From: Finn Thain @ 2010-06-01 12:18 UTC (permalink / raw)
  To: David Miller; +Cc: geert, joe, p_gortmaker, netdev, linux-kernel, linux-m68k
In-Reply-To: <20100531.060225.68146363.davem@davemloft.net>


Use the request_irq() error code as the return value for mac8390_open(). 
EAGAIN doesn't make sense for Nubus slot IRQs. Only this driver can claim 
this IRQ (until the NIC is removed, which means everything is powered 
down).

Signed-off-by: Finn Thain <fthain@telegraphics.com.au>

Index: linux-2.6.34/drivers/net/mac8390.c
===================================================================
--- linux-2.6.34.orig/drivers/net/mac8390.c	2010-06-01 20:20:29.000000000 +1000
+++ linux-2.6.34/drivers/net/mac8390.c	2010-06-01 20:22:23.000000000 +1000
@@ -643,12 +643,13 @@ static int __init mac8390_initdev(struct
 
 static int mac8390_open(struct net_device *dev)
 {
+	int err;
+
 	__ei_open(dev);
-	if (request_irq(dev->irq, __ei_interrupt, 0, "8390 Ethernet", dev)) {
-		pr_info("%s: unable to get IRQ %d.\n", dev->name, dev->irq);
-		return -EAGAIN;
-	}
-	return 0;
+	err = request_irq(dev->irq, __ei_interrupt, 0, "8390 Ethernet", dev);
+	if (err)
+		pr_info("%s: unable to get IRQ %d\n", dev->name, dev->irq);
+	return err;
 }
 
 static int mac8390_close(struct net_device *dev)

^ permalink raw reply

* [PATCH 2/2] mac8390: raise error logging priority
From: Finn Thain @ 2010-06-01 12:18 UTC (permalink / raw)
  To: David Miller; +Cc: geert, joe, p_gortmaker, netdev, linux-kernel, linux-m68k
In-Reply-To: <20100531.060225.68146363.davem@davemloft.net>


Log error conditions using KERN_ERR priority.

Signed-off-by: Finn Thain <fthain@telegraphics.com.au>

---

This patch doesn't address the KERN_INFO messages conditional on ei_debug. 
I don't know how those can be improved to the satisfaction of all 
interested parties.

These two patches should be applied in sequence.

Index: linux-2.6.34/drivers/net/mac8390.c
===================================================================
--- linux-2.6.34.orig/drivers/net/mac8390.c	2010-06-01 20:22:23.000000000 +1000
+++ linux-2.6.34/drivers/net/mac8390.c	2010-06-01 20:22:25.000000000 +1000
@@ -556,7 +556,7 @@ static int __init mac8390_initdev(struct
 	case MAC8390_APPLE:
 		switch (mac8390_testio(dev->mem_start)) {
 		case ACCESS_UNKNOWN:
-			pr_info("Don't know how to access card memory!\n");
+			pr_err("Don't know how to access card memory!\n");
 			return -ENODEV;
 			break;
 
@@ -648,7 +648,7 @@ static int mac8390_open(struct net_devic
 	__ei_open(dev);
 	err = request_irq(dev->irq, __ei_interrupt, 0, "8390 Ethernet", dev);
 	if (err)
-		pr_info("%s: unable to get IRQ %d\n", dev->name, dev->irq);
+		pr_err("%s: unable to get IRQ %d\n", dev->name, dev->irq);
 	return err;
 }
 

^ permalink raw reply

* Re: [PATCH] cls_u32: use skb_copy_bits() to dereference data safely
From: jamal @ 2010-06-01 12:34 UTC (permalink / raw)
  To: Changli Gao; +Cc: David S. Miller, netdev
In-Reply-To: <1275272665-19047-1-git-send-email-xiaosuo@gmail.com>

Hi Changli,

On Mon, 2010-05-31 at 10:24 +0800, Changli Gao wrote:
> use skb_copy_bits() to dereference data safely
> 
> the original skb->data dereference isn't safe, as there isn't any skb->len or
> skb_is_nonlinear() check. 

I dont see any safety issue in current code in this respect. Do you have
a specific scenario where this would be unsafe? We inspect in 32 bit
chunks walking the packet data and stop when there is no more packet
data.

> skb_copy_bits() is used instead in this patch. And
> when the skb isn't long enough, we terminate the function u32_classify()
> immediately with -1.
> 

That could be a very interesting optimization - but i see it as _very
hard_ to do with current u32 given it has branching and different
branches would have different lengths etc. You'd have to essentially do
some math in user space as to what min length would suffice given
a specified filter and pass that to the kernel. 

cheers,
jamal

^ permalink raw reply

* Re: [Regression] [2.6.35-rc1] ssb_is_sprom_available
From: John W. Linville @ 2010-06-01 13:34 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Maciej Rutecki, linux-kernel@vger.kernel.org, linux-pci,
	linux-usb, Rafael J. Wysocki, netdev, Michael Buesch
In-Reply-To: <20100531205300.GB32708@parisc-linux.org>

On Mon, May 31, 2010 at 02:53:00PM -0600, Matthew Wilcox wrote:
> On Mon, May 31, 2010 at 09:55:20PM +0200, Maciej Rutecki wrote:
> > Last known good: 2.6.34
> > Failing kernel: 2.6.35-rc1
> > 
> > subsystem: PCI, USB(?)
> > 
> > Kernel dies during booting on message "ssb_is_sprom_available", see picture:
> > http://www.unixy.pl/maciek/download/kernel/2.6.35-rc1/gumis/DSC_0011.JPG
> 
> Um, looks like it's something to do with the Sonics Silicon Backplane,
> not PCI, nor USB.

Fix is on its way...

commit da1fdb02d9200ff28b6f3a380d21930335fe5429
Author: Christoph Fritz <chf.fritz@googlemail.com>
Date:   Fri May 28 10:45:59 2010 +0200

    ssb: fix NULL ptr deref when pcihost_wrapper is used
    
    Ethernet driver b44 does register ssb by it's pcihost_wrapper
    and doesn't set ssb_chipcommon. A check on this value
    introduced with commit d53cdbb94a52a920d5420ed64d986c3523a56743
    and ea2db495f92ad2cf3301623e60cb95b4062bc484 triggers:
    
    BUG: unable to handle kernel NULL pointer dereference at 00000010
    IP: [<c1266c36>] ssb_is_sprom_available+0x16/0x30
    
    Signed-off-by: Christoph Fritz <chf.fritz@googlemail.com>
    Signed-off-by: John W. Linville <linville@tuxdriver.com>

-- 
John W. Linville		Someday the world will need a hero, and you
linville@tuxdriver.com			might be all we have.  Be ready.

^ permalink raw reply

* Re: Kernel panic with the b44 driver in 2.6.35-rc1
From: John W. Linville @ 2010-06-01 13:33 UTC (permalink / raw)
  To: news.gmane.org; +Cc: netdev
In-Reply-To: <4C0414F3.8030409@tvcablenet.be>

On Mon, May 31, 2010 at 09:58:43PM +0200, news.gmane.org wrote:

> I have a kernel panic at startup with the b44 driver with the 2.6.35-rc1
> kernel. I have submitted a bug report (see
> https://bugzilla.kernel.org/show_bug.cgi?id=16074) but until know, I
> din't get any answer yet. Does anybody knows what's happening. I have
> tried several git-bisect run but I didn't find any conclusive result.
> The first bad-commit was different each time and if I revert it, the
> problem still occurs anyway.

This patch is already on its way to Linus -- it should fix the problem
you are seeing:

commit da1fdb02d9200ff28b6f3a380d21930335fe5429
Author: Christoph Fritz <chf.fritz@googlemail.com>
Date:   Fri May 28 10:45:59 2010 +0200

    ssb: fix NULL ptr deref when pcihost_wrapper is used
    
    Ethernet driver b44 does register ssb by it's pcihost_wrapper
    and doesn't set ssb_chipcommon. A check on this value
    introduced with commit d53cdbb94a52a920d5420ed64d986c3523a56743
    and ea2db495f92ad2cf3301623e60cb95b4062bc484 triggers:
    
    BUG: unable to handle kernel NULL pointer dereference at 00000010
    IP: [<c1266c36>] ssb_is_sprom_available+0x16/0x30
    
    Signed-off-by: Christoph Fritz <chf.fritz@googlemail.com>
    Signed-off-by: John W. Linville <linville@tuxdriver.com>

-- 
John W. Linville		Someday the world will need a hero, and you
linville@tuxdriver.com			might be all we have.  Be ready.

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox