Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH v2 net-next] tcp: avoid tx starvation by SYNACK packets
From: David Miller @ 2012-06-27  8:22 UTC (permalink / raw)
  To: hans.schillstrom
  Cc: eric.dumazet, subramanian.vijay, dave.taht, netdev, ncardwell,
	therbert, brouer
In-Reply-To: <201206270723.11615.hans.schillstrom@ericsson.com>

From: Hans Schillstrom <hans.schillstrom@ericsson.com>
Date: Wed, 27 Jun 2012 07:23:03 +0200

> On Tuesday 26 June 2012 19:02:36 Eric Dumazet wrote:
>> With David patch using jhash instead of SHA, I reach ~315.000 SYN per
>> second.
> 
> I have similar results from ~170k to ~199k synack/sec.

Eric and Hans, I'm going to add Tested-by: tags for both of you
when I commit this patch if you don't mind. :-)

^ permalink raw reply

* Re: [PATCH net-next 1/4 v2] net: sh_eth: remove unnecessary function
From: David Miller @ 2012-06-27  8:24 UTC (permalink / raw)
  To: yoshihiro.shimoda.uh; +Cc: netdev, linux-sh
In-Reply-To: <4FEAA157.9060403@renesas.com>

From: "Shimoda, Yoshihiro" <yoshihiro.shimoda.uh@renesas.com>
Date: Wed, 27 Jun 2012 14:59:51 +0900

> The sh_eth_timer() called mod_timer() for itself. So, this patch
> removes the function.
> 
> Signed-off-by: Yoshihiro Shimoda <yoshihiro.shimoda.uh@renesas.com>

Applied.

^ permalink raw reply

* Re: [PATCH net-next 2/4 v2] net: sh_eth: remove unnecessary members/definitions
From: David Miller @ 2012-06-27  8:24 UTC (permalink / raw)
  To: yoshihiro.shimoda.uh; +Cc: netdev, linux-sh
In-Reply-To: <4FEAA15E.2060402@renesas.com>

From: "Shimoda, Yoshihiro" <yoshihiro.shimoda.uh@renesas.com>
Date: Wed, 27 Jun 2012 14:59:58 +0900

> This patch removes unnecessary members in sh_th_private.
> This patch also removes unnecessary definitions in sh_eth.h
> 
> Signed-off-by: Yoshihiro Shimoda <yoshihiro.shimoda.uh@renesas.com>

Applied.

^ permalink raw reply

* Re: [PATCH net-next 3/4 v2] net: sh_eth: fix up the buffer pointers
From: David Miller @ 2012-06-27  8:24 UTC (permalink / raw)
  To: yoshihiro.shimoda.uh; +Cc: netdev, linux-sh
In-Reply-To: <4FEAA161.4030001@renesas.com>

From: "Shimoda, Yoshihiro" <yoshihiro.shimoda.uh@renesas.com>
Date: Wed, 27 Jun 2012 15:00:01 +0900

> After freeing the buffer, the driver should change the value of
> the pointer to NULL.
> 
> Signed-off-by: Yoshihiro Shimoda <yoshihiro.shimoda.uh@renesas.com>

Applied.

^ permalink raw reply

* Re: [PATCH net-next 4/4 v2] net: sh_eth: add support for set_ringparam/get_ringparam
From: David Miller @ 2012-06-27  8:24 UTC (permalink / raw)
  To: yoshihiro.shimoda.uh; +Cc: netdev, linux-sh
In-Reply-To: <4FEAA163.2000305@renesas.com>

From: "Shimoda, Yoshihiro" <yoshihiro.shimoda.uh@renesas.com>
Date: Wed, 27 Jun 2012 15:00:03 +0900

> This patch supports the ethtool's set_ringparam() and get_ringparam().
> 
> Signed-off-by: Yoshihiro Shimoda <yoshihiro.shimoda.uh@renesas.com>

Applied.

^ permalink raw reply

* Re: [PATCH v2 net-next] tcp: avoid tx starvation by SYNACK packets
From: Jesper Dangaard Brouer @ 2012-06-27  8:25 UTC (permalink / raw)
  To: David Miller
  Cc: hans.schillstrom, eric.dumazet, subramanian.vijay, dave.taht,
	netdev, ncardwell, therbert
In-Reply-To: <20120627.012204.1603944185909940692.davem@davemloft.net>

On Wed, 2012-06-27 at 01:22 -0700, David Miller wrote:
> From: Hans Schillstrom <hans.schillstrom@ericsson.com>
> Date: Wed, 27 Jun 2012 07:23:03 +0200
> 
> > On Tuesday 26 June 2012 19:02:36 Eric Dumazet wrote:
> >> With David patch using jhash instead of SHA, I reach ~315.000 SYN per
> >> second.
> > 
> > I have similar results from ~170k to ~199k synack/sec.
> 
> Eric and Hans, I'm going to add Tested-by: tags for both of you
> when I commit this patch if you don't mind. :-)

You can also add my

Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>

(I agree with you patch, and its does not have an attack vector...)

^ permalink raw reply

* Re: [PATCH 04/16] mm: allow PF_MEMALLOC from softirq context
From: Mel Gorman @ 2012-06-27  8:26 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Andrew Morton, Linux-MM, Linux-Netdev, LKML, David Miller,
	Neil Brown, Peter Zijlstra, Mike Christie, Eric B Munson,
	Eric Dumazet
In-Reply-To: <20120626165513.GD6509@breakpoint.cc>

On Tue, Jun 26, 2012 at 06:55:13PM +0200, Sebastian Andrzej Siewior wrote:
> On Fri, Jun 22, 2012 at 03:30:31PM +0100, Mel Gorman wrote:
> > This is needed to allow network softirq packet processing to make
> > use of PF_MEMALLOC.
> 
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index b6c0727..5c6d9c6 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -2265,7 +2265,11 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
> >  	if (likely(!(gfp_mask & __GFP_NOMEMALLOC))) {
> >  		if (gfp_mask & __GFP_MEMALLOC)
> >  			alloc_flags |= ALLOC_NO_WATERMARKS;
> > -		else if (likely(!(gfp_mask & __GFP_NOMEMALLOC)) && !in_interrupt())
> > +		else if (in_serving_softirq() && (current->flags & PF_MEMALLOC))
> > +			alloc_flags |= ALLOC_NO_WATERMARKS;
> > +		else if (!in_interrupt() &&
> > +				((current->flags & PF_MEMALLOC) ||
> > +				 unlikely(test_thread_flag(TIF_MEMDIE))))
> >  			alloc_flags |= ALLOC_NO_WATERMARKS;
> >  	}
> 
> You allocate in RX path with __GFP_MEMALLOC and your sk->sk_allocation has
> also __GFP_MEMALLOC set. That means you should get ALLOC_NO_WATERMARKS in
> alloc_flags.

In the cases where they are annotated correctly, yes. It is recordeed if
the page gets allocated from the PFMEMALLOC reserves. If the received
packet is not SOCK_MEMALLOC and the page was allocated from PFMEMALLOC
reserves it is then discarded and the packet must be retransmitted.

> Is this to done to avoid GFP annotations in skb_share_check() and
> friends on your __netif_receive_skb() path?
> 

I don't get your question as the annotations are not being avoided. If they
are set, they are used. In the __netif_receive_skb path, PF_MEMALLOC is
set for PFMEMALLOC skbs to avoid having to annotate every single allocation
call site.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [net-next RFC V3 PATCH 4/6] tuntap: multiqueue support
From: Michael S. Tsirkin @ 2012-06-27  8:26 UTC (permalink / raw)
  To: Jason Wang
  Cc: habanero, netdev, linux-kernel, krkumar2, tahm, akong, davem,
	shemminger, mashirle, Eric Dumazet
In-Reply-To: <4FEAA149.5010306@redhat.com>

On Wed, Jun 27, 2012 at 01:59:37PM +0800, Jason Wang wrote:
> On 06/26/2012 07:54 PM, Michael S. Tsirkin wrote:
> >On Tue, Jun 26, 2012 at 01:52:57PM +0800, Jason Wang wrote:
> >>On 06/25/2012 04:25 PM, Michael S. Tsirkin wrote:
> >>>On Mon, Jun 25, 2012 at 02:10:18PM +0800, Jason Wang wrote:
> >>>>This patch adds multiqueue support for tap device. This is done by abstracting
> >>>>each queue as a file/socket and allowing multiple sockets to be attached to the
> >>>>tuntap device (an array of tun_file were stored in the tun_struct). Userspace
> >>>>could write and read from those files to do the parallel packet
> >>>>sending/receiving.
> >>>>
> >>>>Unlike the previous single queue implementation, the socket and device were
> >>>>loosely coupled, each of them were allowed to go away first. In order to let the
> >>>>tx path lockless, netif_tx_loch_bh() is replaced by RCU/NETIF_F_LLTX to
> >>>>synchronize between data path and system call.
> >>>Don't use LLTX/RCU. It's not worth it.
> >>>Use something like netif_set_real_num_tx_queues.
> >>>
> >>For LLTX, maybe it's better to convert it to alloc_netdev_mq() to
> >>let the kernel see all queues and make the queue stopping and
> >>per-queue stats eaiser.
> >>RCU is used to handle the attaching/detaching when tun/tap is
> >>sending and receiving packets which looks reasonalbe for me.
> >Yes but do we have to allow this? How about we always ask
> >userspace to attach to all active queues?
> 
> Attaching/detaching is a method to active/deactive a queue, if all
> queues were kept attached, then we need other method or flag to mark
> the queue as activateddeactived and still need to synchronize with
> data path.

This is what I am trying to say: use an interface flag for
multiqueue. When it is set activate all queues attached.
When unset deactivate all queues except the default one.


> >>Not
> >>sure netif_set_real_num_tx_queues() can help in this situation.
> >Check it out.
> >
> >>>>The tx queue selecting is first based on the recorded rxq index of an skb, it
> >>>>there's no such one, then choosing based on rx hashing (skb_get_rxhash()).
> >>>>
> >>>>Signed-off-by: Jason Wang<jasowang@redhat.com>
> >>>Interestingly macvtap switched to hashing first:
> >>>ef0002b577b52941fb147128f30bd1ecfdd3ff6d
> >>>(the commit log is corrupted but see what it
> >>>does in the patch).
> >>>Any idea why?
> >>>
> >>>>---
> >>>>  drivers/net/tun.c |  371 +++++++++++++++++++++++++++++++++--------------------
> >>>>  1 files changed, 232 insertions(+), 139 deletions(-)
> >>>>
> >>>>diff --git a/drivers/net/tun.c b/drivers/net/tun.c
> >>>>index 8233b0a..5c26757 100644
> >>>>--- a/drivers/net/tun.c
> >>>>+++ b/drivers/net/tun.c
> >>>>@@ -107,6 +107,8 @@ struct tap_filter {
> >>>>  	unsigned char	addr[FLT_EXACT_COUNT][ETH_ALEN];
> >>>>  };
> >>>>
> >>>>+#define MAX_TAP_QUEUES (NR_CPUS<   16 ? NR_CPUS : 16)
> >>>Why the limit? I am guessing you copied this from macvtap?
> >>>This is problematic for a number of reasons:
> >>>	- will not play well with migration
> >>>	- will not work well for a large guest
> >>>
> >>>Yes, macvtap needs to be fixed too.
> >>>
> >>>I am guessing what it is trying to prevent is queueing
> >>>up a huge number of packets?
> >>>So just divide the default tx queue limit by the # of queues.
> >>Not sure,
> >>another reasons I can guess:
> >>- to prevent storing a large array of pointers in tun_struct or macvlan_dev.
> >OK so with the limit of e.g. 1024 we'd allocate at most
> >2 pages of memory. This doesn't look too bad. 1024 is probably a
> >high enough limit: modern hypervisors seem to support on the order
> >of 100-200 CPUs so this leaves us some breathing space
> >if we want to match a queue per guest CPU.
> >Of course we need to limit the packets per queue
> >in such a setup more aggressively. 1000 packets * 1000 queues
> >* 64K per packet is too much.
> >
> >>- it may not be suitable to allow the number of virtqueues greater
> >>than the number of physical queues in the card
> >Maybe for macvtap, here we have no idea which card we
> >are working with and how many queues it has.
> >
> >>>And by the way, for MQ applications maybe we can finally
> >>>ignore tx queue altogether and limit the total number
> >>>of bytes queued?
> >>>To avoid regressions we can make it large like 64M/# queues.
> >>>Could be a separate patch I think, and for a single queue
> >>>might need a compatible mode though I am not sure.
> >>Could you explain more about this?
> >>Did you mean to have a total
> >>sndbuf for all sockets that attached to tun/tap?
> >Consider that we currently limit the # of
> >packets queued at tun for xmit to userspace.
> >Some limit is needed but # of packets sounds
> >very silly - limiting the total memory
> >might be more reasonable.
> >
> >In case of multiqueue, we really care about
> >total # of packets or total memory, but a simple
> >approximation could be to divide the allocation
> >between active queues equally.
> 
> A possible method is to divce the TUN_READQ_SIZE by #queues, but
> make it at least to be equal to the vring size (256).

I would not enforce any limit actually.
Simply divide by # of queues, and
fail if userspace tries to attach > queue size packets.

With 1000 queues this is 64Mbyte worst case as is.
If someone wants to allow userspace to drink
256 times as much that is 16Giga byte per
single device, let the user tweak tx queue len.



> >
> >qdisc also queues some packets, that logic is
> >using # of packets anyway. So either make that
> >1000/# queues, or even set to 0 as Eric once
> >suggested.
> >
> >>>>+
> >>>>  struct tun_file {
> >>>>  	struct sock sk;
> >>>>  	struct socket socket;
> >>>>@@ -114,16 +116,18 @@ struct tun_file {
> >>>>  	int vnet_hdr_sz;
> >>>>  	struct tap_filter txflt;
> >>>>  	atomic_t count;
> >>>>-	struct tun_struct *tun;
> >>>>+	struct tun_struct __rcu *tun;
> >>>>  	struct net *net;
> >>>>  	struct fasync_struct *fasync;
> >>>>  	unsigned int flags;
> >>>>+	u16 queue_index;
> >>>>  };
> >>>>
> >>>>  struct tun_sock;
> >>>>
> >>>>  struct tun_struct {
> >>>>-	struct tun_file		*tfile;
> >>>>+	struct tun_file		*tfiles[MAX_TAP_QUEUES];
> >>>>+	unsigned int            numqueues;
> >>>>  	unsigned int 		flags;
> >>>>  	uid_t			owner;
> >>>>  	gid_t			group;
> >>>>@@ -138,80 +142,159 @@ struct tun_struct {
> >>>>  #endif
> >>>>  };
> >>>>
> >>>>-static int tun_attach(struct tun_struct *tun, struct file *file)
> >>>>+static DEFINE_SPINLOCK(tun_lock);
> >>>>+
> >>>>+/*
> >>>>+ * tun_get_queue(): calculate the queue index
> >>>>+ *     - if skbs comes from mq nics, we can just borrow
> >>>>+ *     - if not, calculate from the hash
> >>>>+ */
> >>>>+static struct tun_file *tun_get_queue(struct net_device *dev,
> >>>>+				      struct sk_buff *skb)
> >>>>  {
> >>>>-	struct tun_file *tfile = file->private_data;
> >>>>-	int err;
> >>>>+	struct tun_struct *tun = netdev_priv(dev);
> >>>>+	struct tun_file *tfile = NULL;
> >>>>+	int numqueues = tun->numqueues;
> >>>>+	__u32 rxq;
> >>>>
> >>>>-	ASSERT_RTNL();
> >>>>+	BUG_ON(!rcu_read_lock_held());
> >>>>
> >>>>-	netif_tx_lock_bh(tun->dev);
> >>>>+	if (!numqueues)
> >>>>+		goto out;
> >>>>
> >>>>-	err = -EINVAL;
> >>>>-	if (tfile->tun)
> >>>>+	if (numqueues == 1) {
> >>>>+		tfile = rcu_dereference(tun->tfiles[0]);
> >>>Instead of hacks like this, you can ask for an MQ
> >>>flag to be set in SETIFF. Then you won't need to
> >>>handle attach/detach at random times.
> >>Consier user switch between a sq guest to mq guest, qemu would
> >>attach or detach the fd which could not be expceted in kernel.
> >Can't userspace keep it attached always, just deactivate MQ?
> >
> >>>And most of the scary num_queues checks can go away.
> >>Even we has a MQ flag, userspace could still just attach one queue
> >>to the device.
> >I think we allow too much flexibility if we let
> >userspace detach a random queue.
> 
> The point is to let tun/tap has the same flexibility as macvtap.
> Macvtap allows add/delete queues at any time and it's very easy to
> add detach/attach to macvtap. So we can easily use almost the same
> ioctls to active/deactive a queue at any time for both tap and
> macvtap.

Yes but userspace does not do this in practice:
it decides how many queues and just activates them all.

> >Maybe only allow attaching/detaching with MQ off?
> >If userspace wants to attach/detach, clear MQ first?
> 
> Maybe I didn't understand the point here but I didn't advantages
> except more times of ioctl().

Way simpler to implement.

> >Alternatively, attach/detach all queues in one ioctl?
> 
> Yes, it can be same one.
> >
> >>>You can then also ask userspace about the max # of queues
> >>>to expect if you want to save some memory.
> >>>
> >>Yes, good suggestion.
> >>>>  		goto out;
> >>>>+	}
> >>>>
> >>>>-	err = -EBUSY;
> >>>>-	if (tun->tfile)
> >>>>+	if (likely(skb_rx_queue_recorded(skb))) {
> >>>>+		rxq = skb_get_rx_queue(skb);
> >>>>+
> >>>>+		while (unlikely(rxq>= numqueues))
> >>>>+			rxq -= numqueues;
> >>>>+
> >>>>+		tfile = rcu_dereference(tun->tfiles[rxq]);
> >>>>  		goto out;
> >>>>+	}
> >>>>
> >>>>-	err = 0;
> >>>>-	tfile->tun = tun;
> >>>>-	tun->tfile = tfile;
> >>>>-	netif_carrier_on(tun->dev);
> >>>>-	dev_hold(tun->dev);
> >>>>-	sock_hold(&tfile->sk);
> >>>>-	atomic_inc(&tfile->count);
> >>>>+	/* Check if we can use flow to select a queue */
> >>>>+	rxq = skb_get_rxhash(skb);
> >>>>+	if (rxq) {
> >>>>+		u32 idx = ((u64)rxq * numqueues)>>   32;
> >>>This completely confuses me. What's the logic here?
> >>>How do we even know it's in range?
> >>>
> >>rxq is a u32, so the result should be less than numqueues.
> >Aha. So the point is to use multiply+shift instead of %?
> >Please add a comment.
> >
> 
> Yes sure.

Not just about this trick, but generally explaining why do we use
rxhash for transmit.

> >>>>+		tfile = rcu_dereference(tun->tfiles[idx]);
> >>>>+		goto out;
> >>>>+	}
> >>>>
> >>>>+	tfile = rcu_dereference(tun->tfiles[0]);
> >>>>  out:
> >>>>-	netif_tx_unlock_bh(tun->dev);
> >>>>-	return err;
> >>>>+	return tfile;
> >>>>  }
> >>>>
> >>>>-static void __tun_detach(struct tun_struct *tun)
> >>>>+static int tun_detach(struct tun_file *tfile, bool clean)
> >>>>  {
> >>>>-	struct tun_file *tfile = tun->tfile;
> >>>>-	/* Detach from net device */
> >>>>-	netif_tx_lock_bh(tun->dev);
> >>>>-	netif_carrier_off(tun->dev);
> >>>>-	tun->tfile = NULL;
> >>>>-	netif_tx_unlock_bh(tun->dev);
> >>>>-
> >>>>-	/* Drop read queue */
> >>>>-	skb_queue_purge(&tfile->socket.sk->sk_receive_queue);
> >>>>-
> >>>>-	/* Drop the extra count on the net device */
> >>>>-	dev_put(tun->dev);
> >>>>-}
> >>>>+	struct tun_struct *tun;
> >>>>+	struct net_device *dev = NULL;
> >>>>+	bool destroy = false;
> >>>>
> >>>>-static void tun_detach(struct tun_struct *tun)
> >>>>-{
> >>>>-	rtnl_lock();
> >>>>-	__tun_detach(tun);
> >>>>-	rtnl_unlock();
> >>>>-}
> >>>>+	spin_lock(&tun_lock);
> >>>>
> >>>>-static struct tun_struct *__tun_get(struct tun_file *tfile)
> >>>>-{
> >>>>-	struct tun_struct *tun = NULL;
> >>>>+	tun = rcu_dereference_protected(tfile->tun,
> >>>>+					lockdep_is_held(&tun_lock));
> >>>>+	if (tun) {
> >>>>+		u16 index = tfile->queue_index;
> >>>>+		BUG_ON(index>= tun->numqueues);
> >>>>+		dev = tun->dev;
> >>>>+
> >>>>+		rcu_assign_pointer(tun->tfiles[index],
> >>>>+				   tun->tfiles[tun->numqueues - 1]);
> >>>>+		tun->tfiles[index]->queue_index = index;
> >>>>+		rcu_assign_pointer(tfile->tun, NULL);
> >>>>+		--tun->numqueues;
> >>>>+		sock_put(&tfile->sk);
> >>>>
> >>>>-	if (atomic_inc_not_zero(&tfile->count))
> >>>>-		tun = tfile->tun;
> >>>>+		if (tun->numqueues == 0&&   !(tun->flags&   TUN_PERSIST))
> >>>>+			destroy = true;
> >>>Please don't use flags like that. Use dedicated labels and goto there on error.
> >>ok.
> >>>>+	}
> >>>>
> >>>>-	return tun;
> >>>>+	spin_unlock(&tun_lock);
> >>>>+
> >>>>+	synchronize_rcu();
> >>>>+	if (clean)
> >>>>+		sock_put(&tfile->sk);
> >>>>+
> >>>>+	if (destroy) {
> >>>>+		rtnl_lock();
> >>>>+		if (dev->reg_state == NETREG_REGISTERED)
> >>>>+			unregister_netdevice(dev);
> >>>>+		rtnl_unlock();
> >>>>+	}
> >>>>+
> >>>>+	return 0;
> >>>>  }
> >>>>
> >>>>-static struct tun_struct *tun_get(struct file *file)
> >>>>+static void tun_detach_all(struct net_device *dev)
> >>>>  {
> >>>>-	return __tun_get(file->private_data);
> >>>>+	struct tun_struct *tun = netdev_priv(dev);
> >>>>+	struct tun_file *tfile, *tfile_list[MAX_TAP_QUEUES];
> >>>>+	int i, j = 0;
> >>>>+
> >>>>+	spin_lock(&tun_lock);
> >>>>+
> >>>>+	for (i = 0; i<   MAX_TAP_QUEUES&&   tun->numqueues; i++) {
> >>>>+		tfile = rcu_dereference_protected(tun->tfiles[i],
> >>>>+						lockdep_is_held(&tun_lock));
> >>>>+		BUG_ON(!tfile);
> >>>>+		wake_up_all(&tfile->wq.wait);
> >>>>+		tfile_list[j++] = tfile;
> >>>>+		rcu_assign_pointer(tfile->tun, NULL);
> >>>>+		--tun->numqueues;
> >>>>+	}
> >>>>+	BUG_ON(tun->numqueues != 0);
> >>>>+	/* guarantee that any future tun_attach will fail */
> >>>>+	tun->numqueues = MAX_TAP_QUEUES;
> >>>>+	spin_unlock(&tun_lock);
> >>>>+
> >>>>+	synchronize_rcu();
> >>>>+	for (--j; j>= 0; j--)
> >>>>+		sock_put(&tfile_list[j]->sk);
> >>>>  }
> >>>>
> >>>>-static void tun_put(struct tun_struct *tun)
> >>>>+static int tun_attach(struct tun_struct *tun, struct file *file)
> >>>>  {
> >>>>-	struct tun_file *tfile = tun->tfile;
> >>>>+	struct tun_file *tfile = file->private_data;
> >>>>+	int err;
> >>>>+
> >>>>+	ASSERT_RTNL();
> >>>>+
> >>>>+	spin_lock(&tun_lock);
> >>>>
> >>>>-	if (atomic_dec_and_test(&tfile->count))
> >>>>-		tun_detach(tfile->tun);
> >>>>+	err = -EINVAL;
> >>>>+	if (rcu_dereference_protected(tfile->tun, lockdep_is_held(&tun_lock)))
> >>>>+		goto out;
> >>>>+
> >>>>+	err = -EBUSY;
> >>>>+	if (!(tun->flags&   TUN_TAP_MQ)&&   tun->numqueues == 1)
> >>>>+		goto out;
> >>>>+
> >>>>+	if (tun->numqueues == MAX_TAP_QUEUES)
> >>>>+		goto out;
> >>>>+
> >>>>+	err = 0;
> >>>>+	tfile->queue_index = tun->numqueues;
> >>>>+	rcu_assign_pointer(tfile->tun, tun);
> >>>>+	rcu_assign_pointer(tun->tfiles[tun->numqueues], tfile);
> >>>>+	sock_hold(&tfile->sk);
> >>>>+	tun->numqueues++;
> >>>>+
> >>>>+	if (tun->numqueues == 1)
> >>>>+		netif_carrier_on(tun->dev);
> >>>>+
> >>>>+	/* device is allowed to go away first, so no need to hold extra
> >>>>+	 * refcnt. */
> >>>>+
> >>>>+out:
> >>>>+	spin_unlock(&tun_lock);
> >>>>+	return err;
> >>>>  }
> >>>>
> >>>>  /* TAP filtering */
> >>>>@@ -331,16 +414,7 @@ static const struct ethtool_ops tun_ethtool_ops;
> >>>>  /* Net device detach from fd. */
> >>>>  static void tun_net_uninit(struct net_device *dev)
> >>>>  {
> >>>>-	struct tun_struct *tun = netdev_priv(dev);
> >>>>-	struct tun_file *tfile = tun->tfile;
> >>>>-
> >>>>-	/* Inform the methods they need to stop using the dev.
> >>>>-	 */
> >>>>-	if (tfile) {
> >>>>-		wake_up_all(&tfile->wq.wait);
> >>>>-		if (atomic_dec_and_test(&tfile->count))
> >>>>-			__tun_detach(tun);
> >>>>-	}
> >>>>+	tun_detach_all(dev);
> >>>>  }
> >>>>
> >>>>  /* Net device open. */
> >>>>@@ -360,10 +434,10 @@ static int tun_net_close(struct net_device *dev)
> >>>>  /* Net device start xmit */
> >>>>  static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev)
> >>>>  {
> >>>>-	struct tun_struct *tun = netdev_priv(dev);
> >>>>-	struct tun_file *tfile = tun->tfile;
> >>>>+	struct tun_file *tfile = NULL;
> >>>>
> >>>>-	tun_debug(KERN_INFO, tun, "tun_net_xmit %d\n", skb->len);
> >>>>+	rcu_read_lock();
> >>>>+	tfile = tun_get_queue(dev, skb);
> >>>>
> >>>>  	/* Drop packet if interface is not attached */
> >>>>  	if (!tfile)
> >>>>@@ -381,7 +455,8 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev)
> >>>>
> >>>>  	if (skb_queue_len(&tfile->socket.sk->sk_receive_queue)
> >>>>  	>= dev->tx_queue_len) {
> >>>>-		if (!(tun->flags&   TUN_ONE_QUEUE)) {
> >>>>+		if (!(tfile->flags&   TUN_ONE_QUEUE)&&
> >>>Which patch moved flags from tun to tfile?
> >>Patch 1 cache the tun->flags in tfile, but it seems this may let the
> >>flags out of sync. So we'd better to use the one in tun_struct.
> >>>>+		    !(tfile->flags&   TUN_TAP_MQ)) {
> >>>>  			/* Normal queueing mode. */
> >>>>  			/* Packet scheduler handles dropping of further packets. */
> >>>>  			netif_stop_queue(dev);
> >>>>@@ -390,7 +465,7 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev)
> >>>>  			 * error is more appropriate. */
> >>>>  			dev->stats.tx_fifo_errors++;
> >>>>  		} else {
> >>>>-			/* Single queue mode.
> >>>>+			/* Single queue mode or multi queue mode.
> >>>>  			 * Driver handles dropping of all packets itself. */
> >>>Please don't do this. Stop the queue on overrun as appropriate.
> >>>ONE_QUEUE is a legacy hack.
> >>>
> >>>BTW we really should stop queue before we start dropping packets,
> >>>but that can be a separate patch.
> >>The problem here is the using of NETIF_F_LLTX. Kernel could only see
> >>one queue even for a multiqueue tun/tap. If we use
> >>netif_stop_queue(), all other queues would be stopped also.
> >Another reason not to use LLTX?
> 
> Yes.
> >>>>  			goto drop;
> >>>>  		}
> >>>>@@ -408,9 +483,11 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev)
> >>>>  		kill_fasync(&tfile->fasync, SIGIO, POLL_IN);
> >>>>  	wake_up_interruptible_poll(&tfile->wq.wait, POLLIN |
> >>>>  				   POLLRDNORM | POLLRDBAND);
> >>>>+	rcu_read_unlock();
> >>>>  	return NETDEV_TX_OK;
> >>>>
> >>>>  drop:
> >>>>+	rcu_read_unlock();
> >>>>  	dev->stats.tx_dropped++;
> >>>>  	kfree_skb(skb);
> >>>>  	return NETDEV_TX_OK;
> >>>>@@ -527,16 +604,22 @@ static void tun_net_init(struct net_device *dev)
> >>>>  static unsigned int tun_chr_poll(struct file *file, poll_table * wait)
> >>>>  {
> >>>>  	struct tun_file *tfile = file->private_data;
> >>>>-	struct tun_struct *tun = __tun_get(tfile);
> >>>>+	struct tun_struct *tun = NULL;
> >>>>  	struct sock *sk;
> >>>>  	unsigned int mask = 0;
> >>>>
> >>>>-	if (!tun)
> >>>>+	if (!tfile)
> >>>>  		return POLLERR;
> >>>>
> >>>>-	sk = tfile->socket.sk;
> >>>>+	rcu_read_lock();
> >>>>+	tun = rcu_dereference(tfile->tun);
> >>>>+	if (!tun) {
> >>>>+		rcu_read_unlock();
> >>>>+		return POLLERR;
> >>>>+	}
> >>>>+	rcu_read_unlock();
> >>>>
> >>>>-	tun_debug(KERN_INFO, tun, "tun_chr_poll\n");
> >>>>+	sk =&tfile->sk;
> >>>>
> >>>>  	poll_wait(file,&tfile->wq.wait, wait);
> >>>>
> >>>>@@ -548,10 +631,12 @@ static unsigned int tun_chr_poll(struct file *file, poll_table * wait)
> >>>>  	     sock_writeable(sk)))
> >>>>  		mask |= POLLOUT | POLLWRNORM;
> >>>>
> >>>>-	if (tun->dev->reg_state != NETREG_REGISTERED)
> >>>>+	rcu_read_lock();
> >>>>+	tun = rcu_dereference(tfile->tun);
> >>>>+	if (!tun || tun->dev->reg_state != NETREG_REGISTERED)
> >>>>  		mask = POLLERR;
> >>>>+	rcu_read_unlock();
> >>>>
> >>>>-	tun_put(tun);
> >>>>  	return mask;
> >>>>  }
> >>>>
> >>>>@@ -708,9 +793,12 @@ static ssize_t tun_get_user(struct tun_file *tfile,
> >>>>  		skb_shinfo(skb)->gso_segs = 0;
> >>>>  	}
> >>>>
> >>>>-	tun = __tun_get(tfile);
> >>>>-	if (!tun)
> >>>>+	rcu_read_lock();
> >>>>+	tun = rcu_dereference(tfile->tun);
> >>>>+	if (!tun) {
> >>>>+		rcu_read_unlock();
> >>>>  		return -EBADFD;
> >>>>+	}
> >>>>
> >>>>  	switch (tfile->flags&   TUN_TYPE_MASK) {
> >>>>  	case TUN_TUN_DEV:
> >>>>@@ -720,26 +808,30 @@ static ssize_t tun_get_user(struct tun_file *tfile,
> >>>>  		skb->protocol = eth_type_trans(skb, tun->dev);
> >>>>  		break;
> >>>>  	}
> >>>>-
> >>>>-	netif_rx_ni(skb);
> >>>>  	tun->dev->stats.rx_packets++;
> >>>>  	tun->dev->stats.rx_bytes += len;
> >>>>-	tun_put(tun);
> >>>>+	rcu_read_unlock();
> >>>>+
> >>>>+	netif_rx_ni(skb);
> >>>>+
> >>>>  	return count;
> >>>>
> >>>>  err_free:
> >>>>  	count = -EINVAL;
> >>>>  	kfree_skb(skb);
> >>>>  err:
> >>>>-	tun = __tun_get(tfile);
> >>>>-	if (!tun)
> >>>>+	rcu_read_lock();
> >>>>+	tun = rcu_dereference(tfile->tun);
> >>>>+	if (!tun) {
> >>>>+		rcu_read_unlock();
> >>>>  		return -EBADFD;
> >>>>+	}
> >>>>
> >>>>  	if (drop)
> >>>>  		tun->dev->stats.rx_dropped++;
> >>>>  	if (error)
> >>>>  		tun->dev->stats.rx_frame_errors++;
> >>>>-	tun_put(tun);
> >>>>+	rcu_read_unlock();
> >>>>  	return count;
> >>>>  }
> >>>>
> >>>>@@ -833,12 +925,13 @@ static ssize_t tun_put_user(struct tun_file *tfile,
> >>>>  	skb_copy_datagram_const_iovec(skb, 0, iv, total, len);
> >>>>  	total += skb->len;
> >>>>
> >>>>-	tun = __tun_get(tfile);
> >>>>+	rcu_read_lock();
> >>>>+	tun = rcu_dereference(tfile->tun);
> >>>>  	if (tun) {
> >>>>  		tun->dev->stats.tx_packets++;
> >>>>  		tun->dev->stats.tx_bytes += len;
> >>>>-		tun_put(tun);
> >>>>  	}
> >>>>+	rcu_read_unlock();
> >>>>
> >>>>  	return total;
> >>>>  }
> >>>>@@ -869,28 +962,31 @@ static ssize_t tun_do_read(struct tun_file *tfile,
> >>>>  				break;
> >>>>  			}
> >>>>
> >>>>-			tun = __tun_get(tfile);
> >>>>+			rcu_read_lock();
> >>>>+			tun = rcu_dereference(tfile->tun);
> >>>>  			if (!tun) {
> >>>>-				ret = -EIO;
> >>>>+				ret = -EBADFD;
> >>>BADFD is for when you get passed something like -1 fd.
> >>>Here fd is OK, it's just in a bad state so you can not do IO.
> >>>
> >>Sure.
> >>>>+				rcu_read_unlock();
> >>>>  				break;
> >>>>  			}
> >>>>  			if (tun->dev->reg_state != NETREG_REGISTERED) {
> >>>>  				ret = -EIO;
> >>>>-				tun_put(tun);
> >>>>+				rcu_read_unlock();
> >>>>  				break;
> >>>>  			}
> >>>>-			tun_put(tun);
> >>>>+			rcu_read_unlock();
> >>>>
> >>>>  			/* Nothing to read, let's sleep */
> >>>>  			schedule();
> >>>>  			continue;
> >>>>  		}
> >>>>
> >>>>-		tun = __tun_get(tfile);
> >>>>+		rcu_read_lock();
> >>>>+		tun = rcu_dereference(tfile->tun);
> >>>>  		if (tun) {
> >>>>  			netif_wake_queue(tun->dev);
> >>>>-			tun_put(tun);
> >>>>  		}
> >>>>+		rcu_read_unlock();
> >>>>
> >>>>  		ret = tun_put_user(tfile, skb, iv, len);
> >>>>  		kfree_skb(skb);
> >>>>@@ -1038,6 +1134,9 @@ static int tun_flags(struct tun_struct *tun)
> >>>>  	if (tun->flags&   TUN_VNET_HDR)
> >>>>  		flags |= IFF_VNET_HDR;
> >>>>
> >>>>+	if (tun->flags&   TUN_TAP_MQ)
> >>>>+		flags |= IFF_MULTI_QUEUE;
> >>>>+
> >>>>  	return flags;
> >>>>  }
> >>>>
> >>>>@@ -1097,8 +1196,7 @@ static int tun_set_iff(struct net *net, struct file *file, struct ifreq *ifr)
> >>>>  		err = tun_attach(tun, file);
> >>>>  		if (err<   0)
> >>>>  			return err;
> >>>>-	}
> >>>>-	else {
> >>>>+	} else {
> >>>>  		char *name;
> >>>>  		unsigned long flags = 0;
> >>>>
> >>>>@@ -1142,6 +1240,8 @@ static int tun_set_iff(struct net *net, struct file *file, struct ifreq *ifr)
> >>>>  		dev->hw_features = NETIF_F_SG | NETIF_F_FRAGLIST |
> >>>>  			TUN_USER_FEATURES;
> >>>>  		dev->features = dev->hw_features;
> >>>>+		if (ifr->ifr_flags&   IFF_MULTI_QUEUE)
> >>>>+			dev->features |= NETIF_F_LLTX;
> >>>>
> >>>>  		err = register_netdevice(tun->dev);
> >>>>  		if (err<   0)
> >>>>@@ -1154,7 +1254,7 @@ static int tun_set_iff(struct net *net, struct file *file, struct ifreq *ifr)
> >>>>
> >>>>  		err = tun_attach(tun, file);
> >>>>  		if (err<   0)
> >>>>-			goto failed;
> >>>>+			goto err_free_dev;
> >>>>  	}
> >>>>
> >>>>  	tun_debug(KERN_INFO, tun, "tun_set_iff\n");
> >>>>@@ -1174,6 +1274,11 @@ static int tun_set_iff(struct net *net, struct file *file, struct ifreq *ifr)
> >>>>  	else
> >>>>  		tun->flags&= ~TUN_VNET_HDR;
> >>>>
> >>>>+	if (ifr->ifr_flags&   IFF_MULTI_QUEUE)
> >>>>+		tun->flags |= TUN_TAP_MQ;
> >>>>+	else
> >>>>+		tun->flags&= ~TUN_TAP_MQ;
> >>>>+
> >>>>  	/* Cache flags from tun device */
> >>>>  	tfile->flags = tun->flags;
> >>>>  	/* Make sure persistent devices do not get stuck in
> >>>>@@ -1187,7 +1292,6 @@ static int tun_set_iff(struct net *net, struct file *file, struct ifreq *ifr)
> >>>>
> >>>>  err_free_dev:
> >>>>  	free_netdev(dev);
> >>>>-failed:
> >>>>  	return err;
> >>>>  }
> >>>>
> >>>>@@ -1264,38 +1368,40 @@ static long __tun_chr_ioctl(struct file *file, unsigned int cmd,
> >>>>  				(unsigned int __user*)argp);
> >>>>  	}
> >>>>
> >>>>-	rtnl_lock();
> >>>>-
> >>>>-	tun = __tun_get(tfile);
> >>>>-	if (cmd == TUNSETIFF&&   !tun) {
> >>>>+	ret = 0;
> >>>>+	if (cmd == TUNSETIFF) {
> >>>>+		rtnl_lock();
> >>>>  		ifr.ifr_name[IFNAMSIZ-1] = '\0';
> >>>>-
> >>>>  		ret = tun_set_iff(tfile->net, file,&ifr);
> >>>>-
> >>>>+		rtnl_unlock();
> >>>>  		if (ret)
> >>>>-			goto unlock;
> >>>>-
> >>>>+			return ret;
> >>>>  		if (copy_to_user(argp,&ifr, ifreq_len))
> >>>>-			ret = -EFAULT;
> >>>>-		goto unlock;
> >>>>+			return -EFAULT;
> >>>>+		return ret;
> >>>>  	}
> >>>>
> >>>>+	rtnl_lock();
> >>>>+
> >>>>+	rcu_read_lock();
> >>>>+
> >>>>  	ret = -EBADFD;
> >>>>+	tun = rcu_dereference(tfile->tun);
> >>>>  	if (!tun)
> >>>>  		goto unlock;
> >>>>+	else
> >>>>+		ret = 0;
> >>>>
> >>>>-	tun_debug(KERN_INFO, tun, "tun_chr_ioctl cmd %d\n", cmd);
> >>>>-
> >>>>-	ret = 0;
> >>>>  	switch (cmd) {
> >>>>  	case TUNGETIFF:
> >>>>  		ret = tun_get_iff(current->nsproxy->net_ns, tun,&ifr);
> >>>>+		rcu_read_unlock();
> >>>>  		if (ret)
> >>>>-			break;
> >>>>+			goto out;
> >>>>
> >>>>  		if (copy_to_user(argp,&ifr, ifreq_len))
> >>>>  			ret = -EFAULT;
> >>>>-		break;
> >>>>+		goto out;
> >>>>
> >>>>  	case TUNSETNOCSUM:
> >>>>  		/* Disable/Enable checksum */
> >>>>@@ -1357,9 +1463,10 @@ static long __tun_chr_ioctl(struct file *file, unsigned int cmd,
> >>>>  		/* Get hw address */
> >>>>  		memcpy(ifr.ifr_hwaddr.sa_data, tun->dev->dev_addr, ETH_ALEN);
> >>>>  		ifr.ifr_hwaddr.sa_family = tun->dev->type;
> >>>>+		rcu_read_unlock();
> >>>>  		if (copy_to_user(argp,&ifr, ifreq_len))
> >>>>  			ret = -EFAULT;
> >>>>-		break;
> >>>>+		goto out;
> >>>>
> >>>>  	case SIOCSIFHWADDR:
> >>>>  		/* Set hw address */
> >>>>@@ -1375,9 +1482,9 @@ static long __tun_chr_ioctl(struct file *file, unsigned int cmd,
> >>>>  	}
> >>>>
> >>>>  unlock:
> >>>>+	rcu_read_unlock();
> >>>>+out:
> >>>>  	rtnl_unlock();
> >>>>-	if (tun)
> >>>>-		tun_put(tun);
> >>>>  	return ret;
> >>>>  }
> >>>>
> >>>>@@ -1517,6 +1624,11 @@ out:
> >>>>  	return ret;
> >>>>  }
> >>>>
> >>>>+static void tun_sock_destruct(struct sock *sk)
> >>>>+{
> >>>>+	skb_queue_purge(&sk->sk_receive_queue);
> >>>>+}
> >>>>+
> >>>>  static int tun_chr_open(struct inode *inode, struct file * file)
> >>>>  {
> >>>>  	struct net *net = current->nsproxy->net_ns;
> >>>>@@ -1540,6 +1652,7 @@ static int tun_chr_open(struct inode *inode, struct file * file)
> >>>>  	sock_init_data(&tfile->socket,&tfile->sk);
> >>>>
> >>>>  	tfile->sk.sk_write_space = tun_sock_write_space;
> >>>>+	tfile->sk.sk_destruct = tun_sock_destruct;
> >>>>  	tfile->sk.sk_sndbuf = INT_MAX;
> >>>>  	file->private_data = tfile;
> >>>>
> >>>>@@ -1549,31 +1662,8 @@ static int tun_chr_open(struct inode *inode, struct file * file)
> >>>>  static int tun_chr_close(struct inode *inode, struct file *file)
> >>>>  {
> >>>>  	struct tun_file *tfile = file->private_data;
> >>>>-	struct tun_struct *tun;
> >>>>-
> >>>>-	tun = __tun_get(tfile);
> >>>>-	if (tun) {
> >>>>-		struct net_device *dev = tun->dev;
> >>>>-
> >>>>-		tun_debug(KERN_INFO, tun, "tun_chr_close\n");
> >>>>-
> >>>>-		__tun_detach(tun);
> >>>>-
> >>>>-		/* If desirable, unregister the netdevice. */
> >>>>-		if (!(tun->flags&   TUN_PERSIST)) {
> >>>>-			rtnl_lock();
> >>>>-			if (dev->reg_state == NETREG_REGISTERED)
> >>>>-				unregister_netdevice(dev);
> >>>>-			rtnl_unlock();
> >>>>-		}
> >>>>
> >>>>-		/* drop the reference that netdevice holds */
> >>>>-		sock_put(&tfile->sk);
> >>>>-
> >>>>-	}
> >>>>-
> >>>>-	/* drop the reference that file holds */
> >>>>-	sock_put(&tfile->sk);
> >>>>+	tun_detach(tfile, true);
> >>>>
> >>>>  	return 0;
> >>>>  }
> >>>>@@ -1700,14 +1790,17 @@ static void tun_cleanup(void)
> >>>>   * holding a reference to the file for as long as the socket is in use. */
> >>>>  struct socket *tun_get_socket(struct file *file)
> >>>>  {
> >>>>-	struct tun_struct *tun;
> >>>>+	struct tun_struct *tun = NULL;
> >>>>  	struct tun_file *tfile = file->private_data;
> >>>>  	if (file->f_op !=&tun_fops)
> >>>>  		return ERR_PTR(-EINVAL);
> >>>>-	tun = tun_get(file);
> >>>>-	if (!tun)
> >>>>+	rcu_read_lock();
> >>>>+	tun = rcu_dereference(tfile->tun);
> >>>>+	if (!tun) {
> >>>>+		rcu_read_unlock();
> >>>>  		return ERR_PTR(-EBADFD);
> >>>>-	tun_put(tun);
> >>>>+	}
> >>>>+	rcu_read_unlock();
> >>>>  	return&tfile->socket;
> >>>>  }
> >>>>  EXPORT_SYMBOL_GPL(tun_get_socket);

^ permalink raw reply

* Re: [PATCH v2 net-next] tcp: avoid tx starvation by SYNACK packets
From: Hans Schillstrom @ 2012-06-27  8:30 UTC (permalink / raw)
  To: David Miller
  Cc: eric.dumazet@gmail.com, subramanian.vijay@gmail.com,
	dave.taht@gmail.com, netdev@vger.kernel.org, ncardwell@google.com,
	therbert@google.com, brouer@redhat.com
In-Reply-To: <20120627.012204.1603944185909940692.davem@davemloft.net>

On Wednesday 27 June 2012 10:22:04 David Miller wrote:
> From: Hans Schillstrom <hans.schillstrom@ericsson.com>
> Date: Wed, 27 Jun 2012 07:23:03 +0200
> 
> > On Tuesday 26 June 2012 19:02:36 Eric Dumazet wrote:
> >> With David patch using jhash instead of SHA, I reach ~315.000 SYN per
> >> second.
> > 
> > I have similar results from ~170k to ~199k synack/sec.
> 
> Eric and Hans, I'm going to add Tested-by: tags for both of you
> when I commit this patch if you don't mind. :-)
> 

No problems, go ahead

^ permalink raw reply

* Re: [PATCH 09/16] netvm: Allow skb allocation to use PFMEMALLOC reserves
From: Mel Gorman @ 2012-06-27  8:32 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Andrew Morton, Linux-MM, Linux-Netdev, LKML, David Miller,
	Neil Brown, Peter Zijlstra, Mike Christie, Eric B Munson,
	Eric Dumazet
In-Reply-To: <20120626152734.GA6509@breakpoint.cc>

On Tue, Jun 26, 2012 at 05:27:34PM +0200, Sebastian Andrzej Siewior wrote:
> On Fri, Jun 22, 2012 at 03:30:36PM +0100, Mel Gorman wrote:
> > diff --git a/net/core/sock.c b/net/core/sock.c
> > index 5c9ca2b..159dccc 100644
> > --- a/net/core/sock.c
> > +++ b/net/core/sock.c
> > @@ -271,6 +271,9 @@ __u32 sysctl_rmem_default __read_mostly = SK_RMEM_MAX;
> >  int sysctl_optmem_max __read_mostly = sizeof(unsigned long)*(2*UIO_MAXIOV+512);
> >  EXPORT_SYMBOL(sysctl_optmem_max);
> >  
> > +struct static_key memalloc_socks = STATIC_KEY_INIT_FALSE;
> > +EXPORT_SYMBOL_GPL(memalloc_socks);
> > +
> 
> This is used via sk_memalloc_socks() by SLAB.
> 
> From 3da9ab9972845974da114c5a6624335e6371b2d5 Mon Sep 17 00:00:00 2001
> From: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> Date: Tue, 26 Jun 2012 17:18:20 +0200
> Subject: [PATCH] export sk_memalloc_socks() only with CONFIG_NET
> 
> |mm/built-in.o: In function `atomic_read':
> |include/asm/atomic.h:25: undefined reference to `memalloc_socks'
> |include/asm/atomic.h:25: undefined reference to `memalloc_socks'
> |include/asm/atomic.h:25: undefined reference to `memalloc_socks'
> |include/asm/atomic.h:25: undefined reference to `memalloc_socks'
> |include/asm/atomic.h:25: undefined reference to `memalloc_socks'
> |mm/built-in.o:include/asm/atomic.h:25: more undefined references to `memalloc_socks' follow
> 
> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>

Well caught. I had not tested build with !CONFIG_NET. I've folded in
this patch and the credits accordingly. Thanks.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* [PATCH net-next v2] be2net: Fix to trim skb for padded vlan packets to workaround an ASIC Bug
From: Somnath Kotur @ 2012-06-27  8:32 UTC (permalink / raw)
  To: netdev; +Cc: davem, Somnath Kotur

Fixed spelling error in a comment as pointed out by DaveM.
Also refactored existing code a bit to provide placeholders for another ASIC
Bug workaround that will be checked-in soon after this.

Signed-off-by: Somnath Kotur <somnath.kotur@emulex.com>
---
 drivers/net/ethernet/emulex/benet/be.h      |    5 ++
 drivers/net/ethernet/emulex/benet/be_main.c |   56 ++++++++++++++++++++-------
 2 files changed, 47 insertions(+), 14 deletions(-)

diff --git a/drivers/net/ethernet/emulex/benet/be.h b/drivers/net/ethernet/emulex/benet/be.h
index 7b5cc2b..7a71fb6 100644
--- a/drivers/net/ethernet/emulex/benet/be.h
+++ b/drivers/net/ethernet/emulex/benet/be.h
@@ -573,6 +573,11 @@ static inline u8 is_udp_pkt(struct sk_buff *skb)
 	return val;
 }
 
+static inline bool is_ipv4_pkt(struct sk_buff *skb)
+{
+	return skb->protocol == ntohs(ETH_P_IP) && ip_hdr(skb)->version == 4;
+}
+
 static inline void be_vf_eth_addr_generate(struct be_adapter *adapter, u8 *mac)
 {
 	u32 addr;
diff --git a/drivers/net/ethernet/emulex/benet/be_main.c b/drivers/net/ethernet/emulex/benet/be_main.c
index a28896d..edce7af 100644
--- a/drivers/net/ethernet/emulex/benet/be_main.c
+++ b/drivers/net/ethernet/emulex/benet/be_main.c
@@ -577,6 +577,11 @@ static inline u16 be_get_tx_vlan_tag(struct be_adapter *adapter,
 	return vlan_tag;
 }
 
+static int be_vlan_tag_chk(struct be_adapter *adapter, struct sk_buff *skb)
+{
+	return vlan_tx_tag_present(skb) || adapter->pvid;
+}
+
 static void wrb_fill_hdr(struct be_adapter *adapter, struct be_eth_hdr_wrb *hdr,
 		struct sk_buff *skb, u32 wrb_cnt, u32 len)
 {
@@ -704,33 +709,56 @@ dma_err:
 	return 0;
 }
 
+static struct sk_buff *be_insert_vlan_in_pkt(struct be_adapter *adapter,
+					     struct sk_buff *skb)
+{
+	u16 vlan_tag = 0;
+
+	skb = skb_share_check(skb, GFP_ATOMIC);
+	if (unlikely(!skb))
+		return skb;
+
+	if (vlan_tx_tag_present(skb)) {
+		vlan_tag = be_get_tx_vlan_tag(adapter, skb);
+		__vlan_put_tag(skb, vlan_tag);
+		skb->vlan_tci = 0;
+	}
+
+	return skb;
+}
+
 static netdev_tx_t be_xmit(struct sk_buff *skb,
 			struct net_device *netdev)
 {
 	struct be_adapter *adapter = netdev_priv(netdev);
 	struct be_tx_obj *txo = &adapter->tx_obj[skb_get_queue_mapping(skb)];
 	struct be_queue_info *txq = &txo->q;
+	struct iphdr *ip = NULL;
 	u32 wrb_cnt = 0, copied = 0;
-	u32 start = txq->head;
+	u32 start = txq->head, eth_hdr_len;
 	bool dummy_wrb, stopped = false;
 
-	/* For vlan tagged pkts, BE
-	 * 1) calculates checksum even when CSO is not requested
-	 * 2) calculates checksum wrongly for padded pkt less than
-	 * 60 bytes long.
-	 * As a workaround disable TX vlan offloading in such cases.
+	eth_hdr_len = ntohs(skb->protocol) == ETH_P_8021Q ?
+		VLAN_ETH_HLEN : ETH_HLEN;
+
+	/* HW has a bug which considers padding bytes as legal
+	 * and modifies the IPv4 hdr's 'tot_len' field
 	 */
-	if (vlan_tx_tag_present(skb) &&
-	    (skb->ip_summed != CHECKSUM_PARTIAL || skb->len <= 60)) {
-		skb = skb_share_check(skb, GFP_ATOMIC);
-		if (unlikely(!skb))
-			goto tx_drop;
+	if (skb->len <= 60 && be_vlan_tag_chk(adapter, skb) &&
+			is_ipv4_pkt(skb)) {
+		ip = (struct iphdr *)ip_hdr(skb);
+		pskb_trim(skb, eth_hdr_len + ntohs(ip->tot_len));
+	}
 
-		skb = __vlan_put_tag(skb, be_get_tx_vlan_tag(adapter, skb));
+	/* HW has a bug wherein it will calculate CSUM for VLAN
+	 * pkts even though it is disabled.
+	 * Manually insert VLAN in pkt.
+	 */
+	if (skb->ip_summed != CHECKSUM_PARTIAL &&
+			be_vlan_tag_chk(adapter, skb)) {
+		skb = be_insert_vlan_in_pkt(adapter, skb);
 		if (unlikely(!skb))
 			goto tx_drop;
-
-		skb->vlan_tci = 0;
 	}
 
 	wrb_cnt = wrb_cnt_for_skb(adapter, skb, &dummy_wrb);
-- 
1.5.6.1

^ permalink raw reply related

* Re: [PATCH net-next v2] be2net: Fix to trim skb for padded vlan packets to workaround an ASIC Bug
From: David Miller @ 2012-06-27  8:36 UTC (permalink / raw)
  To: somnath.kotur; +Cc: netdev
In-Reply-To: <879c023f-1730-4874-a943-58b6725f80f4@exht1.ad.emulex.com>

From: Somnath Kotur <somnath.kotur@emulex.com>
Date: Wed, 27 Jun 2012 14:02:10 +0530

> Fixed spelling error in a comment as pointed out by DaveM.
> Also refactored existing code a bit to provide placeholders for another ASIC
> Bug workaround that will be checked-in soon after this.
> 
> Signed-off-by: Somnath Kotur <somnath.kotur@emulex.com>

Applied, thanks.

^ permalink raw reply

* [PATCH] iwlegacy: print how long queue was actually stuck
From: Paul Bolle @ 2012-06-27  8:36 UTC (permalink / raw)
  To: Stanislaw Gruszka, John W. Linville; +Cc: linux-wireless, netdev, linux-kernel

Every now and then, after resuming from suspend, the iwlegacy driver
prints
    iwl4965 0000:03:00.0: Queue 2 stuck for 2000 ms.
    iwl4965 0000:03:00.0: On demand firmware reload

I have no idea what causes these errors. But the code currently uses
wd_timeout in the first error. wd_timeout will generally be set at
IL_DEF_WD_TIMEOUT (ie, 2000). Perhaps printing for how long the queue
was actually stuck can clarify the cause of these errors.

Signed-off-by: Paul Bolle <pebolle@tiscali.nl>
---
0) Compile tested on v3.5-rc4. Tested on Fedora 's current v3.4.2 based
kernel (ie, on F16). That required an edit to this commit because of
trivial context changes.

1) Please note that testing here involved waiting until I again
triggered this error (which now of course printed how long the queue was
actually stuck).

 drivers/net/wireless/iwlegacy/common.c |    7 ++++---
 1 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/drivers/net/wireless/iwlegacy/common.c b/drivers/net/wireless/iwlegacy/common.c
index cbf2dc1..763c752 100644
--- a/drivers/net/wireless/iwlegacy/common.c
+++ b/drivers/net/wireless/iwlegacy/common.c
@@ -4717,10 +4717,11 @@ il_check_stuck_queue(struct il_priv *il, int cnt)
 	struct il_tx_queue *txq = &il->txq[cnt];
 	struct il_queue *q = &txq->q;
 	unsigned long timeout;
+	unsigned long now = jiffies;
 	int ret;
 
 	if (q->read_ptr == q->write_ptr) {
-		txq->time_stamp = jiffies;
+		txq->time_stamp = now;
 		return 0;
 	}
 
@@ -4728,9 +4729,9 @@ il_check_stuck_queue(struct il_priv *il, int cnt)
 	    txq->time_stamp +
 	    msecs_to_jiffies(il->cfg->wd_timeout);
 
-	if (time_after(jiffies, timeout)) {
+	if (time_after(now, timeout)) {
 		IL_ERR("Queue %d stuck for %u ms.\n", q->id,
-		       il->cfg->wd_timeout);
+		       jiffies_to_msecs(now - txq->time_stamp));
 		ret = il_force_reset(il, false);
 		return (ret == -EAGAIN) ? 0 : 1;
 	}
-- 
1.7.7.6

^ permalink raw reply related

* Re: [PATCH v2 net-next] tcp: avoid tx starvation by SYNACK packets
From: Eric Dumazet @ 2012-06-27  8:40 UTC (permalink / raw)
  To: David Miller
  Cc: hans.schillstrom, subramanian.vijay, dave.taht, netdev, ncardwell,
	therbert, brouer
In-Reply-To: <20120627.012204.1603944185909940692.davem@davemloft.net>

On Wed, 2012-06-27 at 01:22 -0700, David Miller wrote:
> From: Hans Schillstrom <hans.schillstrom@ericsson.com>
> Date: Wed, 27 Jun 2012 07:23:03 +0200
> 
> > On Tuesday 26 June 2012 19:02:36 Eric Dumazet wrote:
> >> With David patch using jhash instead of SHA, I reach ~315.000 SYN per
> >> second.
> > 
> > I have similar results from ~170k to ~199k synack/sec.
> 
> Eric and Hans, I'm going to add Tested-by: tags for both of you
> when I commit this patch if you don't mind. :-)

Well, please send your complete patch before (with IPv6 part) ?

^ permalink raw reply

* Re: [PATCH 11/16] netvm: Propagate page->pfmemalloc from skb_alloc_page to skb
From: Mel Gorman @ 2012-06-27  8:43 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Andrew Morton, Linux-MM, Linux-Netdev, LKML, David Miller,
	Neil Brown, Peter Zijlstra, Mike Christie, Eric B Munson,
	Eric Dumazet
In-Reply-To: <20120626201328.GI6509@breakpoint.cc>

On Tue, Jun 26, 2012 at 10:13:28PM +0200, Sebastian Andrzej Siewior wrote:
> On Fri, Jun 22, 2012 at 03:30:38PM +0100, Mel Gorman wrote:
> >  drivers/net/ethernet/chelsio/cxgb4/sge.c          |    2 +-
> >  drivers/net/ethernet/chelsio/cxgb4vf/sge.c        |    2 +-
> >  drivers/net/ethernet/intel/igb/igb_main.c         |    2 +-
> >  drivers/net/ethernet/intel/ixgbe/ixgbe_main.c     |    4 +-
> >  drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c |    3 +-
> >  drivers/net/usb/cdc-phonet.c                      |    2 +-
> >  drivers/usb/gadget/f_phonet.c                     |    2 +-
> 
> You did not touch all drivers which use alloc_page(s)() like e1000(e). Was
> this on purpose?
> 

Yes. The ones I changed were the semi-obvious ones and carried over from
when the patches were completely out of tree.  As the changelog notes
it is not critical that these annotation happens and can be fixed on a
per-driver basis if there are complains about network swapping being slow.

In the e1000 case, alloc_page is called from e1000_alloc_jumbo_rx_buffers
and I would not have paid quite as close attention to jumbo configurations
even though e1000 does not depend on high-order allocations like some
other drivers do. I can update e1000 if you like but it's not critical
to do so and in fact getting a bug reporting saying that network swap
was slow on e1000 would be useful to me in its own way :)

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [net-next RFC V3 PATCH 4/6] tuntap: multiqueue support
From: Michael S. Tsirkin @ 2012-06-27  8:44 UTC (permalink / raw)
  To: Jason Wang
  Cc: habanero, netdev, linux-kernel, krkumar2, tahm, akong, davem,
	shemminger, mashirle
In-Reply-To: <4FEA972E.9010104@redhat.com>

On Wed, Jun 27, 2012 at 01:16:30PM +0800, Jason Wang wrote:
> On 06/26/2012 06:42 PM, Michael S. Tsirkin wrote:
> >On Tue, Jun 26, 2012 at 11:42:17AM +0800, Jason Wang wrote:
> >>On 06/25/2012 04:25 PM, Michael S. Tsirkin wrote:
> >>>On Mon, Jun 25, 2012 at 02:10:18PM +0800, Jason Wang wrote:
> >>>>This patch adds multiqueue support for tap device. This is done by abstracting
> >>>>each queue as a file/socket and allowing multiple sockets to be attached to the
> >>>>tuntap device (an array of tun_file were stored in the tun_struct). Userspace
> >>>>could write and read from those files to do the parallel packet
> >>>>sending/receiving.
> >>>>
> >>>>Unlike the previous single queue implementation, the socket and device were
> >>>>loosely coupled, each of them were allowed to go away first. In order to let the
> >>>>tx path lockless, netif_tx_loch_bh() is replaced by RCU/NETIF_F_LLTX to
> >>>>synchronize between data path and system call.
> >>>Don't use LLTX/RCU. It's not worth it.
> >>>Use something like netif_set_real_num_tx_queues.
> >>>
> >>>>The tx queue selecting is first based on the recorded rxq index of an skb, it
> >>>>there's no such one, then choosing based on rx hashing (skb_get_rxhash()).
> >>>>
> >>>>Signed-off-by: Jason Wang<jasowang@redhat.com>
> >>>Interestingly macvtap switched to hashing first:
> >>>ef0002b577b52941fb147128f30bd1ecfdd3ff6d
> >>>(the commit log is corrupted but see what it
> >>>does in the patch).
> >>>Any idea why?
> >>Yes, so tap should be changed to behave same as macvtap. I remember
> >>the reason we do that is to make sure the packet of a single flow to
> >>be queued to a fixed socket/virtqueues. As 10g cards like ixgbe
> >>choose the rx queue for a flow based on the last tx queue where the
> >>packets of that flow comes. So if we are using recored rx queue in
> >>macvtap, the queue index of a flow would change as vhost thread
> >>moves amongs processors.
> >Hmm. OTOH if you override this, if TX is sent from VCPU0, RX might land
> >on VCPU1 in the guest, which is not good, right?
> 
> Yes, but better than making the rx moves between vcpus when we use
> recorded rx queue.

Why isn't this a problem with native TCP?
I think what happens is one of the following:
- moving between CPUs is more expensive with tun
  because it can queue so much data on xmit
- scheduler makes very bad decisions about VCPUs
  bouncing them around all the time

Could we isolate which it is? Does the problem
still happen if you pin VCPUs to host cpus?
If not it's the queue depth.

> Flow steering is needed to make sure the tx and
> rx on the same vcpu.

That involves IPI between processes, so it might be
very expensive for kvm.

> >>But during test tun/tap, one interesting thing I find is that even
> >>ixgbe has recorded the queue index during rx, it seems be lost when
> >>tap tries to transmit skbs to userspace.
> >dev_pick_tx does this I think but ndo_select_queue
> >should be able to get it without trouble.
> >
> >
> >>>>---
> >>>>  drivers/net/tun.c |  371 +++++++++++++++++++++++++++++++++--------------------
> >>>>  1 files changed, 232 insertions(+), 139 deletions(-)
> >>>>
> >>>>diff --git a/drivers/net/tun.c b/drivers/net/tun.c
> >>>>index 8233b0a..5c26757 100644
> >>>>--- a/drivers/net/tun.c
> >>>>+++ b/drivers/net/tun.c
> >>>>@@ -107,6 +107,8 @@ struct tap_filter {
> >>>>  	unsigned char	addr[FLT_EXACT_COUNT][ETH_ALEN];
> >>>>  };
> >>>>
> >>>>+#define MAX_TAP_QUEUES (NR_CPUS<   16 ? NR_CPUS : 16)
> >>>Why the limit? I am guessing you copied this from macvtap?
> >>>This is problematic for a number of reasons:
> >>>	- will not play well with migration
> >>>	- will not work well for a large guest
> >>>
> >>>Yes, macvtap needs to be fixed too.
> >>>
> >>>I am guessing what it is trying to prevent is queueing
> >>>up a huge number of packets?
> >>>So just divide the default tx queue limit by the # of queues.
> >>>
> >>>And by the way, for MQ applications maybe we can finally
> >>>ignore tx queue altogether and limit the total number
> >>>of bytes queued?
> >>>To avoid regressions we can make it large like 64M/# queues.
> >>>Could be a separate patch I think, and for a single queue
> >>>might need a compatible mode though I am not sure.
> >>>
> >>>>+
> >>>>  struct tun_file {
> >>>>  	struct sock sk;
> >>>>  	struct socket socket;
> >>>>@@ -114,16 +116,18 @@ struct tun_file {
> >>>>  	int vnet_hdr_sz;
> >>>>  	struct tap_filter txflt;
> >>>>  	atomic_t count;
> >>>>-	struct tun_struct *tun;
> >>>>+	struct tun_struct __rcu *tun;
> >>>>  	struct net *net;
> >>>>  	struct fasync_struct *fasync;
> >>>>  	unsigned int flags;
> >>>>+	u16 queue_index;
> >>>>  };
> >>>>
> >>>>  struct tun_sock;
> >>>>
> >>>>  struct tun_struct {
> >>>>-	struct tun_file		*tfile;
> >>>>+	struct tun_file		*tfiles[MAX_TAP_QUEUES];
> >>>>+	unsigned int            numqueues;
> >>>>  	unsigned int 		flags;
> >>>>  	uid_t			owner;
> >>>>  	gid_t			group;
> >>>>@@ -138,80 +142,159 @@ struct tun_struct {
> >>>>  #endif
> >>>>  };
> >>>>
> >>>>-static int tun_attach(struct tun_struct *tun, struct file *file)
> >>>>+static DEFINE_SPINLOCK(tun_lock);
> >>>>+
> >>>>+/*
> >>>>+ * tun_get_queue(): calculate the queue index
> >>>>+ *     - if skbs comes from mq nics, we can just borrow
> >>>>+ *     - if not, calculate from the hash
> >>>>+ */
> >>>>+static struct tun_file *tun_get_queue(struct net_device *dev,
> >>>>+				      struct sk_buff *skb)
> >>>>  {
> >>>>-	struct tun_file *tfile = file->private_data;
> >>>>-	int err;
> >>>>+	struct tun_struct *tun = netdev_priv(dev);
> >>>>+	struct tun_file *tfile = NULL;
> >>>>+	int numqueues = tun->numqueues;
> >>>>+	__u32 rxq;
> >>>>
> >>>>-	ASSERT_RTNL();
> >>>>+	BUG_ON(!rcu_read_lock_held());
> >>>>
> >>>>-	netif_tx_lock_bh(tun->dev);
> >>>>+	if (!numqueues)
> >>>>+		goto out;
> >>>>
> >>>>-	err = -EINVAL;
> >>>>-	if (tfile->tun)
> >>>>+	if (numqueues == 1) {
> >>>>+		tfile = rcu_dereference(tun->tfiles[0]);
> >>>Instead of hacks like this, you can ask for an MQ
> >>>flag to be set in SETIFF. Then you won't need to
> >>>handle attach/detach at random times.
> >>>And most of the scary num_queues checks can go away.
> >>>You can then also ask userspace about the max # of queues
> >>>to expect if you want to save some memory.
> >>>
> >>>
> >>>>  		goto out;
> >>>>+	}
> >>>>
> >>>>-	err = -EBUSY;
> >>>>-	if (tun->tfile)
> >>>>+	if (likely(skb_rx_queue_recorded(skb))) {
> >>>>+		rxq = skb_get_rx_queue(skb);
> >>>>+
> >>>>+		while (unlikely(rxq>= numqueues))
> >>>>+			rxq -= numqueues;
> >>>>+
> >>>>+		tfile = rcu_dereference(tun->tfiles[rxq]);
> >>>>  		goto out;
> >>>>+	}
> >>>>
> >>>>-	err = 0;
> >>>>-	tfile->tun = tun;
> >>>>-	tun->tfile = tfile;
> >>>>-	netif_carrier_on(tun->dev);
> >>>>-	dev_hold(tun->dev);
> >>>>-	sock_hold(&tfile->sk);
> >>>>-	atomic_inc(&tfile->count);
> >>>>+	/* Check if we can use flow to select a queue */
> >>>>+	rxq = skb_get_rxhash(skb);
> >>>>+	if (rxq) {
> >>>>+		u32 idx = ((u64)rxq * numqueues)>>   32;
> >>>This completely confuses me. What's the logic here?
> >>>How do we even know it's in range?
> >>>
> >>>>+		tfile = rcu_dereference(tun->tfiles[idx]);
> >>>>+		goto out;
> >>>>+	}
> >>>>
> >>>>+	tfile = rcu_dereference(tun->tfiles[0]);
> >>>>  out:
> >>>>-	netif_tx_unlock_bh(tun->dev);
> >>>>-	return err;
> >>>>+	return tfile;
> >>>>  }
> >>>>
> >>>>-static void __tun_detach(struct tun_struct *tun)
> >>>>+static int tun_detach(struct tun_file *tfile, bool clean)
> >>>>  {
> >>>>-	struct tun_file *tfile = tun->tfile;
> >>>>-	/* Detach from net device */
> >>>>-	netif_tx_lock_bh(tun->dev);
> >>>>-	netif_carrier_off(tun->dev);
> >>>>-	tun->tfile = NULL;
> >>>>-	netif_tx_unlock_bh(tun->dev);
> >>>>-
> >>>>-	/* Drop read queue */
> >>>>-	skb_queue_purge(&tfile->socket.sk->sk_receive_queue);
> >>>>-
> >>>>-	/* Drop the extra count on the net device */
> >>>>-	dev_put(tun->dev);
> >>>>-}
> >>>>+	struct tun_struct *tun;
> >>>>+	struct net_device *dev = NULL;
> >>>>+	bool destroy = false;
> >>>>
> >>>>-static void tun_detach(struct tun_struct *tun)
> >>>>-{
> >>>>-	rtnl_lock();
> >>>>-	__tun_detach(tun);
> >>>>-	rtnl_unlock();
> >>>>-}
> >>>>+	spin_lock(&tun_lock);
> >>>>
> >>>>-static struct tun_struct *__tun_get(struct tun_file *tfile)
> >>>>-{
> >>>>-	struct tun_struct *tun = NULL;
> >>>>+	tun = rcu_dereference_protected(tfile->tun,
> >>>>+					lockdep_is_held(&tun_lock));
> >>>>+	if (tun) {
> >>>>+		u16 index = tfile->queue_index;
> >>>>+		BUG_ON(index>= tun->numqueues);
> >>>>+		dev = tun->dev;
> >>>>+
> >>>>+		rcu_assign_pointer(tun->tfiles[index],
> >>>>+				   tun->tfiles[tun->numqueues - 1]);
> >>>>+		tun->tfiles[index]->queue_index = index;
> >>>>+		rcu_assign_pointer(tfile->tun, NULL);
> >>>>+		--tun->numqueues;
> >>>>+		sock_put(&tfile->sk);
> >>>>
> >>>>-	if (atomic_inc_not_zero(&tfile->count))
> >>>>-		tun = tfile->tun;
> >>>>+		if (tun->numqueues == 0&&   !(tun->flags&   TUN_PERSIST))
> >>>>+			destroy = true;
> >>>Please don't use flags like that. Use dedicated labels and goto there on error.
> >>>
> >>>
> >>>>+	}
> >>>>
> >>>>-	return tun;
> >>>>+	spin_unlock(&tun_lock);
> >>>>+
> >>>>+	synchronize_rcu();
> >>>>+	if (clean)
> >>>>+		sock_put(&tfile->sk);
> >>>>+
> >>>>+	if (destroy) {
> >>>>+		rtnl_lock();
> >>>>+		if (dev->reg_state == NETREG_REGISTERED)
> >>>>+			unregister_netdevice(dev);
> >>>>+		rtnl_unlock();
> >>>>+	}
> >>>>+
> >>>>+	return 0;
> >>>>  }
> >>>>
> >>>>-static struct tun_struct *tun_get(struct file *file)
> >>>>+static void tun_detach_all(struct net_device *dev)
> >>>>  {
> >>>>-	return __tun_get(file->private_data);
> >>>>+	struct tun_struct *tun = netdev_priv(dev);
> >>>>+	struct tun_file *tfile, *tfile_list[MAX_TAP_QUEUES];
> >>>>+	int i, j = 0;
> >>>>+
> >>>>+	spin_lock(&tun_lock);
> >>>>+
> >>>>+	for (i = 0; i<   MAX_TAP_QUEUES&&   tun->numqueues; i++) {
> >>>>+		tfile = rcu_dereference_protected(tun->tfiles[i],
> >>>>+						lockdep_is_held(&tun_lock));
> >>>>+		BUG_ON(!tfile);
> >>>>+		wake_up_all(&tfile->wq.wait);
> >>>>+		tfile_list[j++] = tfile;
> >>>>+		rcu_assign_pointer(tfile->tun, NULL);
> >>>>+		--tun->numqueues;
> >>>>+	}
> >>>>+	BUG_ON(tun->numqueues != 0);
> >>>>+	/* guarantee that any future tun_attach will fail */
> >>>>+	tun->numqueues = MAX_TAP_QUEUES;
> >>>>+	spin_unlock(&tun_lock);
> >>>>+
> >>>>+	synchronize_rcu();
> >>>>+	for (--j; j>= 0; j--)
> >>>>+		sock_put(&tfile_list[j]->sk);
> >>>>  }
> >>>>
> >>>>-static void tun_put(struct tun_struct *tun)
> >>>>+static int tun_attach(struct tun_struct *tun, struct file *file)
> >>>>  {
> >>>>-	struct tun_file *tfile = tun->tfile;
> >>>>+	struct tun_file *tfile = file->private_data;
> >>>>+	int err;
> >>>>+
> >>>>+	ASSERT_RTNL();
> >>>>+
> >>>>+	spin_lock(&tun_lock);
> >>>>
> >>>>-	if (atomic_dec_and_test(&tfile->count))
> >>>>-		tun_detach(tfile->tun);
> >>>>+	err = -EINVAL;
> >>>>+	if (rcu_dereference_protected(tfile->tun, lockdep_is_held(&tun_lock)))
> >>>>+		goto out;
> >>>>+
> >>>>+	err = -EBUSY;
> >>>>+	if (!(tun->flags&   TUN_TAP_MQ)&&   tun->numqueues == 1)
> >>>>+		goto out;
> >>>>+
> >>>>+	if (tun->numqueues == MAX_TAP_QUEUES)
> >>>>+		goto out;
> >>>>+
> >>>>+	err = 0;
> >>>>+	tfile->queue_index = tun->numqueues;
> >>>>+	rcu_assign_pointer(tfile->tun, tun);
> >>>>+	rcu_assign_pointer(tun->tfiles[tun->numqueues], tfile);
> >>>>+	sock_hold(&tfile->sk);
> >>>>+	tun->numqueues++;
> >>>>+
> >>>>+	if (tun->numqueues == 1)
> >>>>+		netif_carrier_on(tun->dev);
> >>>>+
> >>>>+	/* device is allowed to go away first, so no need to hold extra
> >>>>+	 * refcnt. */
> >>>>+
> >>>>+out:
> >>>>+	spin_unlock(&tun_lock);
> >>>>+	return err;
> >>>>  }
> >>>>
> >>>>  /* TAP filtering */
> >>>>@@ -331,16 +414,7 @@ static const struct ethtool_ops tun_ethtool_ops;
> >>>>  /* Net device detach from fd. */
> >>>>  static void tun_net_uninit(struct net_device *dev)
> >>>>  {
> >>>>-	struct tun_struct *tun = netdev_priv(dev);
> >>>>-	struct tun_file *tfile = tun->tfile;
> >>>>-
> >>>>-	/* Inform the methods they need to stop using the dev.
> >>>>-	 */
> >>>>-	if (tfile) {
> >>>>-		wake_up_all(&tfile->wq.wait);
> >>>>-		if (atomic_dec_and_test(&tfile->count))
> >>>>-			__tun_detach(tun);
> >>>>-	}
> >>>>+	tun_detach_all(dev);
> >>>>  }
> >>>>
> >>>>  /* Net device open. */
> >>>>@@ -360,10 +434,10 @@ static int tun_net_close(struct net_device *dev)
> >>>>  /* Net device start xmit */
> >>>>  static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev)
> >>>>  {
> >>>>-	struct tun_struct *tun = netdev_priv(dev);
> >>>>-	struct tun_file *tfile = tun->tfile;
> >>>>+	struct tun_file *tfile = NULL;
> >>>>
> >>>>-	tun_debug(KERN_INFO, tun, "tun_net_xmit %d\n", skb->len);
> >>>>+	rcu_read_lock();
> >>>>+	tfile = tun_get_queue(dev, skb);
> >>>>
> >>>>  	/* Drop packet if interface is not attached */
> >>>>  	if (!tfile)
> >>>>@@ -381,7 +455,8 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev)
> >>>>
> >>>>  	if (skb_queue_len(&tfile->socket.sk->sk_receive_queue)
> >>>>  	>= dev->tx_queue_len) {
> >>>>-		if (!(tun->flags&   TUN_ONE_QUEUE)) {
> >>>>+		if (!(tfile->flags&   TUN_ONE_QUEUE)&&
> >>>Which patch moved flags from tun to tfile?
> >>>
> >>>>+		    !(tfile->flags&   TUN_TAP_MQ)) {
> >>>>  			/* Normal queueing mode. */
> >>>>  			/* Packet scheduler handles dropping of further packets. */
> >>>>  			netif_stop_queue(dev);
> >>>>@@ -390,7 +465,7 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev)
> >>>>  			 * error is more appropriate. */
> >>>>  			dev->stats.tx_fifo_errors++;
> >>>>  		} else {
> >>>>-			/* Single queue mode.
> >>>>+			/* Single queue mode or multi queue mode.
> >>>>  			 * Driver handles dropping of all packets itself. */
> >>>Please don't do this. Stop the queue on overrun as appropriate.
> >>>ONE_QUEUE is a legacy hack.
> >>>
> >>>BTW we really should stop queue before we start dropping packets,
> >>>but that can be a separate patch.
> >>>
> >>>>  			goto drop;
> >>>>  		}
> >>>>@@ -408,9 +483,11 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev)
> >>>>  		kill_fasync(&tfile->fasync, SIGIO, POLL_IN);
> >>>>  	wake_up_interruptible_poll(&tfile->wq.wait, POLLIN |
> >>>>  				   POLLRDNORM | POLLRDBAND);
> >>>>+	rcu_read_unlock();
> >>>>  	return NETDEV_TX_OK;
> >>>>
> >>>>  drop:
> >>>>+	rcu_read_unlock();
> >>>>  	dev->stats.tx_dropped++;
> >>>>  	kfree_skb(skb);
> >>>>  	return NETDEV_TX_OK;
> >>>>@@ -527,16 +604,22 @@ static void tun_net_init(struct net_device *dev)
> >>>>  static unsigned int tun_chr_poll(struct file *file, poll_table * wait)
> >>>>  {
> >>>>  	struct tun_file *tfile = file->private_data;
> >>>>-	struct tun_struct *tun = __tun_get(tfile);
> >>>>+	struct tun_struct *tun = NULL;
> >>>>  	struct sock *sk;
> >>>>  	unsigned int mask = 0;
> >>>>
> >>>>-	if (!tun)
> >>>>+	if (!tfile)
> >>>>  		return POLLERR;
> >>>>
> >>>>-	sk = tfile->socket.sk;
> >>>>+	rcu_read_lock();
> >>>>+	tun = rcu_dereference(tfile->tun);
> >>>>+	if (!tun) {
> >>>>+		rcu_read_unlock();
> >>>>+		return POLLERR;
> >>>>+	}
> >>>>+	rcu_read_unlock();
> >>>>
> >>>>-	tun_debug(KERN_INFO, tun, "tun_chr_poll\n");
> >>>>+	sk =&tfile->sk;
> >>>>
> >>>>  	poll_wait(file,&tfile->wq.wait, wait);
> >>>>
> >>>>@@ -548,10 +631,12 @@ static unsigned int tun_chr_poll(struct file *file, poll_table * wait)
> >>>>  	     sock_writeable(sk)))
> >>>>  		mask |= POLLOUT | POLLWRNORM;
> >>>>
> >>>>-	if (tun->dev->reg_state != NETREG_REGISTERED)
> >>>>+	rcu_read_lock();
> >>>>+	tun = rcu_dereference(tfile->tun);
> >>>>+	if (!tun || tun->dev->reg_state != NETREG_REGISTERED)
> >>>>  		mask = POLLERR;
> >>>>+	rcu_read_unlock();
> >>>>
> >>>>-	tun_put(tun);
> >>>>  	return mask;
> >>>>  }
> >>>>
> >>>>@@ -708,9 +793,12 @@ static ssize_t tun_get_user(struct tun_file *tfile,
> >>>>  		skb_shinfo(skb)->gso_segs = 0;
> >>>>  	}
> >>>>
> >>>>-	tun = __tun_get(tfile);
> >>>>-	if (!tun)
> >>>>+	rcu_read_lock();
> >>>>+	tun = rcu_dereference(tfile->tun);
> >>>>+	if (!tun) {
> >>>>+		rcu_read_unlock();
> >>>>  		return -EBADFD;
> >>>>+	}
> >>>>
> >>>>  	switch (tfile->flags&   TUN_TYPE_MASK) {
> >>>>  	case TUN_TUN_DEV:
> >>>>@@ -720,26 +808,30 @@ static ssize_t tun_get_user(struct tun_file *tfile,
> >>>>  		skb->protocol = eth_type_trans(skb, tun->dev);
> >>>>  		break;
> >>>>  	}
> >>>>-
> >>>>-	netif_rx_ni(skb);
> >>>>  	tun->dev->stats.rx_packets++;
> >>>>  	tun->dev->stats.rx_bytes += len;
> >>>>-	tun_put(tun);
> >>>>+	rcu_read_unlock();
> >>>>+
> >>>>+	netif_rx_ni(skb);
> >>>>+
> >>>>  	return count;
> >>>>
> >>>>  err_free:
> >>>>  	count = -EINVAL;
> >>>>  	kfree_skb(skb);
> >>>>  err:
> >>>>-	tun = __tun_get(tfile);
> >>>>-	if (!tun)
> >>>>+	rcu_read_lock();
> >>>>+	tun = rcu_dereference(tfile->tun);
> >>>>+	if (!tun) {
> >>>>+		rcu_read_unlock();
> >>>>  		return -EBADFD;
> >>>>+	}
> >>>>
> >>>>  	if (drop)
> >>>>  		tun->dev->stats.rx_dropped++;
> >>>>  	if (error)
> >>>>  		tun->dev->stats.rx_frame_errors++;
> >>>>-	tun_put(tun);
> >>>>+	rcu_read_unlock();
> >>>>  	return count;
> >>>>  }
> >>>>
> >>>>@@ -833,12 +925,13 @@ static ssize_t tun_put_user(struct tun_file *tfile,
> >>>>  	skb_copy_datagram_const_iovec(skb, 0, iv, total, len);
> >>>>  	total += skb->len;
> >>>>
> >>>>-	tun = __tun_get(tfile);
> >>>>+	rcu_read_lock();
> >>>>+	tun = rcu_dereference(tfile->tun);
> >>>>  	if (tun) {
> >>>>  		tun->dev->stats.tx_packets++;
> >>>>  		tun->dev->stats.tx_bytes += len;
> >>>>-		tun_put(tun);
> >>>>  	}
> >>>>+	rcu_read_unlock();
> >>>>
> >>>>  	return total;
> >>>>  }
> >>>>@@ -869,28 +962,31 @@ static ssize_t tun_do_read(struct tun_file *tfile,
> >>>>  				break;
> >>>>  			}
> >>>>
> >>>>-			tun = __tun_get(tfile);
> >>>>+			rcu_read_lock();
> >>>>+			tun = rcu_dereference(tfile->tun);
> >>>>  			if (!tun) {
> >>>>-				ret = -EIO;
> >>>>+				ret = -EBADFD;
> >>>BADFD is for when you get passed something like -1 fd.
> >>>Here fd is OK, it's just in a bad state so you can not do IO.
> >>>
> >>>
> >>>>+				rcu_read_unlock();
> >>>>  				break;
> >>>>  			}
> >>>>  			if (tun->dev->reg_state != NETREG_REGISTERED) {
> >>>>  				ret = -EIO;
> >>>>-				tun_put(tun);
> >>>>+				rcu_read_unlock();
> >>>>  				break;
> >>>>  			}
> >>>>-			tun_put(tun);
> >>>>+			rcu_read_unlock();
> >>>>
> >>>>  			/* Nothing to read, let's sleep */
> >>>>  			schedule();
> >>>>  			continue;
> >>>>  		}
> >>>>
> >>>>-		tun = __tun_get(tfile);
> >>>>+		rcu_read_lock();
> >>>>+		tun = rcu_dereference(tfile->tun);
> >>>>  		if (tun) {
> >>>>  			netif_wake_queue(tun->dev);
> >>>>-			tun_put(tun);
> >>>>  		}
> >>>>+		rcu_read_unlock();
> >>>>
> >>>>  		ret = tun_put_user(tfile, skb, iv, len);
> >>>>  		kfree_skb(skb);
> >>>>@@ -1038,6 +1134,9 @@ static int tun_flags(struct tun_struct *tun)
> >>>>  	if (tun->flags&   TUN_VNET_HDR)
> >>>>  		flags |= IFF_VNET_HDR;
> >>>>
> >>>>+	if (tun->flags&   TUN_TAP_MQ)
> >>>>+		flags |= IFF_MULTI_QUEUE;
> >>>>+
> >>>>  	return flags;
> >>>>  }
> >>>>
> >>>>@@ -1097,8 +1196,7 @@ static int tun_set_iff(struct net *net, struct file *file, struct ifreq *ifr)
> >>>>  		err = tun_attach(tun, file);
> >>>>  		if (err<   0)
> >>>>  			return err;
> >>>>-	}
> >>>>-	else {
> >>>>+	} else {
> >>>>  		char *name;
> >>>>  		unsigned long flags = 0;
> >>>>
> >>>>@@ -1142,6 +1240,8 @@ static int tun_set_iff(struct net *net, struct file *file, struct ifreq *ifr)
> >>>>  		dev->hw_features = NETIF_F_SG | NETIF_F_FRAGLIST |
> >>>>  			TUN_USER_FEATURES;
> >>>>  		dev->features = dev->hw_features;
> >>>>+		if (ifr->ifr_flags&   IFF_MULTI_QUEUE)
> >>>>+			dev->features |= NETIF_F_LLTX;
> >>>>
> >>>>  		err = register_netdevice(tun->dev);
> >>>>  		if (err<   0)
> >>>>@@ -1154,7 +1254,7 @@ static int tun_set_iff(struct net *net, struct file *file, struct ifreq *ifr)
> >>>>
> >>>>  		err = tun_attach(tun, file);
> >>>>  		if (err<   0)
> >>>>-			goto failed;
> >>>>+			goto err_free_dev;
> >>>>  	}
> >>>>
> >>>>  	tun_debug(KERN_INFO, tun, "tun_set_iff\n");
> >>>>@@ -1174,6 +1274,11 @@ static int tun_set_iff(struct net *net, struct file *file, struct ifreq *ifr)
> >>>>  	else
> >>>>  		tun->flags&= ~TUN_VNET_HDR;
> >>>>
> >>>>+	if (ifr->ifr_flags&   IFF_MULTI_QUEUE)
> >>>>+		tun->flags |= TUN_TAP_MQ;
> >>>>+	else
> >>>>+		tun->flags&= ~TUN_TAP_MQ;
> >>>>+
> >>>>  	/* Cache flags from tun device */
> >>>>  	tfile->flags = tun->flags;
> >>>>  	/* Make sure persistent devices do not get stuck in
> >>>>@@ -1187,7 +1292,6 @@ static int tun_set_iff(struct net *net, struct file *file, struct ifreq *ifr)
> >>>>
> >>>>  err_free_dev:
> >>>>  	free_netdev(dev);
> >>>>-failed:
> >>>>  	return err;
> >>>>  }
> >>>>
> >>>>@@ -1264,38 +1368,40 @@ static long __tun_chr_ioctl(struct file *file, unsigned int cmd,
> >>>>  				(unsigned int __user*)argp);
> >>>>  	}
> >>>>
> >>>>-	rtnl_lock();
> >>>>-
> >>>>-	tun = __tun_get(tfile);
> >>>>-	if (cmd == TUNSETIFF&&   !tun) {
> >>>>+	ret = 0;
> >>>>+	if (cmd == TUNSETIFF) {
> >>>>+		rtnl_lock();
> >>>>  		ifr.ifr_name[IFNAMSIZ-1] = '\0';
> >>>>-
> >>>>  		ret = tun_set_iff(tfile->net, file,&ifr);
> >>>>-
> >>>>+		rtnl_unlock();
> >>>>  		if (ret)
> >>>>-			goto unlock;
> >>>>-
> >>>>+			return ret;
> >>>>  		if (copy_to_user(argp,&ifr, ifreq_len))
> >>>>-			ret = -EFAULT;
> >>>>-		goto unlock;
> >>>>+			return -EFAULT;
> >>>>+		return ret;
> >>>>  	}
> >>>>
> >>>>+	rtnl_lock();
> >>>>+
> >>>>+	rcu_read_lock();
> >>>>+
> >>>>  	ret = -EBADFD;
> >>>>+	tun = rcu_dereference(tfile->tun);
> >>>>  	if (!tun)
> >>>>  		goto unlock;
> >>>>+	else
> >>>>+		ret = 0;
> >>>>
> >>>>-	tun_debug(KERN_INFO, tun, "tun_chr_ioctl cmd %d\n", cmd);
> >>>>-
> >>>>-	ret = 0;
> >>>>  	switch (cmd) {
> >>>>  	case TUNGETIFF:
> >>>>  		ret = tun_get_iff(current->nsproxy->net_ns, tun,&ifr);
> >>>>+		rcu_read_unlock();
> >>>>  		if (ret)
> >>>>-			break;
> >>>>+			goto out;
> >>>>
> >>>>  		if (copy_to_user(argp,&ifr, ifreq_len))
> >>>>  			ret = -EFAULT;
> >>>>-		break;
> >>>>+		goto out;
> >>>>
> >>>>  	case TUNSETNOCSUM:
> >>>>  		/* Disable/Enable checksum */
> >>>>@@ -1357,9 +1463,10 @@ static long __tun_chr_ioctl(struct file *file, unsigned int cmd,
> >>>>  		/* Get hw address */
> >>>>  		memcpy(ifr.ifr_hwaddr.sa_data, tun->dev->dev_addr, ETH_ALEN);
> >>>>  		ifr.ifr_hwaddr.sa_family = tun->dev->type;
> >>>>+		rcu_read_unlock();
> >>>>  		if (copy_to_user(argp,&ifr, ifreq_len))
> >>>>  			ret = -EFAULT;
> >>>>-		break;
> >>>>+		goto out;
> >>>>
> >>>>  	case SIOCSIFHWADDR:
> >>>>  		/* Set hw address */
> >>>>@@ -1375,9 +1482,9 @@ static long __tun_chr_ioctl(struct file *file, unsigned int cmd,
> >>>>  	}
> >>>>
> >>>>  unlock:
> >>>>+	rcu_read_unlock();
> >>>>+out:
> >>>>  	rtnl_unlock();
> >>>>-	if (tun)
> >>>>-		tun_put(tun);
> >>>>  	return ret;
> >>>>  }
> >>>>
> >>>>@@ -1517,6 +1624,11 @@ out:
> >>>>  	return ret;
> >>>>  }
> >>>>
> >>>>+static void tun_sock_destruct(struct sock *sk)
> >>>>+{
> >>>>+	skb_queue_purge(&sk->sk_receive_queue);
> >>>>+}
> >>>>+
> >>>>  static int tun_chr_open(struct inode *inode, struct file * file)
> >>>>  {
> >>>>  	struct net *net = current->nsproxy->net_ns;
> >>>>@@ -1540,6 +1652,7 @@ static int tun_chr_open(struct inode *inode, struct file * file)
> >>>>  	sock_init_data(&tfile->socket,&tfile->sk);
> >>>>
> >>>>  	tfile->sk.sk_write_space = tun_sock_write_space;
> >>>>+	tfile->sk.sk_destruct = tun_sock_destruct;
> >>>>  	tfile->sk.sk_sndbuf = INT_MAX;
> >>>>  	file->private_data = tfile;
> >>>>
> >>>>@@ -1549,31 +1662,8 @@ static int tun_chr_open(struct inode *inode, struct file * file)
> >>>>  static int tun_chr_close(struct inode *inode, struct file *file)
> >>>>  {
> >>>>  	struct tun_file *tfile = file->private_data;
> >>>>-	struct tun_struct *tun;
> >>>>-
> >>>>-	tun = __tun_get(tfile);
> >>>>-	if (tun) {
> >>>>-		struct net_device *dev = tun->dev;
> >>>>-
> >>>>-		tun_debug(KERN_INFO, tun, "tun_chr_close\n");
> >>>>-
> >>>>-		__tun_detach(tun);
> >>>>-
> >>>>-		/* If desirable, unregister the netdevice. */
> >>>>-		if (!(tun->flags&   TUN_PERSIST)) {
> >>>>-			rtnl_lock();
> >>>>-			if (dev->reg_state == NETREG_REGISTERED)
> >>>>-				unregister_netdevice(dev);
> >>>>-			rtnl_unlock();
> >>>>-		}
> >>>>
> >>>>-		/* drop the reference that netdevice holds */
> >>>>-		sock_put(&tfile->sk);
> >>>>-
> >>>>-	}
> >>>>-
> >>>>-	/* drop the reference that file holds */
> >>>>-	sock_put(&tfile->sk);
> >>>>+	tun_detach(tfile, true);
> >>>>
> >>>>  	return 0;
> >>>>  }
> >>>>@@ -1700,14 +1790,17 @@ static void tun_cleanup(void)
> >>>>   * holding a reference to the file for as long as the socket is in use. */
> >>>>  struct socket *tun_get_socket(struct file *file)
> >>>>  {
> >>>>-	struct tun_struct *tun;
> >>>>+	struct tun_struct *tun = NULL;
> >>>>  	struct tun_file *tfile = file->private_data;
> >>>>  	if (file->f_op !=&tun_fops)
> >>>>  		return ERR_PTR(-EINVAL);
> >>>>-	tun = tun_get(file);
> >>>>-	if (!tun)
> >>>>+	rcu_read_lock();
> >>>>+	tun = rcu_dereference(tfile->tun);
> >>>>+	if (!tun) {
> >>>>+		rcu_read_unlock();
> >>>>  		return ERR_PTR(-EBADFD);
> >>>>-	tun_put(tun);
> >>>>+	}
> >>>>+	rcu_read_unlock();
> >>>>  	return&tfile->socket;
> >>>>  }
> >>>>  EXPORT_SYMBOL_GPL(tun_get_socket);
> >--
> >To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> >the body of a message to majordomo@vger.kernel.org
> >More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply

* Re: [PATCH v2 net-next] tcp: avoid tx starvation by SYNACK packets
From: Eric Dumazet @ 2012-06-27  8:45 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: David Miller, hans.schillstrom, subramanian.vijay, dave.taht,
	netdev, ncardwell, therbert, mph
In-Reply-To: <1340785275.2028.151.camel@localhost>

On Wed, 2012-06-27 at 10:21 +0200, Jesper Dangaard Brouer wrote:

> It works because we have a spinlock problem in the code... Perhaps, they
> did it, because we have have locking/contention problem, not the other
> way around ;-)  How about fixing the code instead? ;-)))

The socket lock is currently mandatory.

It's really _hard_ to remove it, your attempts added a lot of races.

I want to do it properly, adding needed RCU and array of spinlocks were
appropriate.

Hopefully, its easier than the RCU conversion I did for the lookups of
ESTABLISHED/TIMEWAIT sockets.

^ permalink raw reply

* Re: [PATCH 1/2] net: flexcan: clock-frequency is optional for device tree probe
From: Marc Kleine-Budde @ 2012-06-27  8:47 UTC (permalink / raw)
  To: Shawn Guo; +Cc: David S. Miller, netdev, linux-arm-kernel
In-Reply-To: <1340700563-8386-2-git-send-email-shawn.guo@linaro.org>

[-- Attachment #1: Type: text/plain, Size: 1406 bytes --]

On 06/26/2012 10:49 AM, Shawn Guo wrote:
> The property clock-frequency is optional for device tree probe.  When
> it's absent, the flexcan driver will try to get the frequency from clk
> system by calling clk_get_rate.
> 
> Signed-off-by: Shawn Guo <shawn.guo@linaro.org>
Acked-by: Marc Kleine-Budde <mkl@pengutronix.de>

As Oliver pointed out, this doesn't go through the net tree.

Marc

> ---
>  .../devicetree/bindings/net/can/fsl-flexcan.txt    |    3 +++
>  1 files changed, 3 insertions(+), 0 deletions(-)
> 
> diff --git a/Documentation/devicetree/bindings/net/can/fsl-flexcan.txt b/Documentation/devicetree/bindings/net/can/fsl-flexcan.txt
> index f31b686..8ff324e 100644
> --- a/Documentation/devicetree/bindings/net/can/fsl-flexcan.txt
> +++ b/Documentation/devicetree/bindings/net/can/fsl-flexcan.txt
> @@ -11,6 +11,9 @@ Required properties:
>  
>  - reg : Offset and length of the register set for this device
>  - interrupts : Interrupt tuple for this device
> +
> +Optional properties:
> +
>  - clock-frequency : The oscillator frequency driving the flexcan device
>  
>  Example:


-- 
Pengutronix e.K.                  | Marc Kleine-Budde           |
Industrial Linux Solutions        | Phone: +49-231-2826-924     |
Vertretung West/Dortmund          | Fax:   +49-5121-206917-5555 |
Amtsgericht Hildesheim, HRA 2686  | http://www.pengutronix.de   |


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 262 bytes --]

^ permalink raw reply

* Re: [PATCH v2 net-next] tcp: avoid tx starvation by SYNACK packets
From: David Miller @ 2012-06-27  8:48 UTC (permalink / raw)
  To: eric.dumazet
  Cc: hans.schillstrom, subramanian.vijay, dave.taht, netdev, ncardwell,
	therbert, brouer
In-Reply-To: <1340786400.26242.27.camel@edumazet-glaptop>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Wed, 27 Jun 2012 10:40:00 +0200

> On Wed, 2012-06-27 at 01:22 -0700, David Miller wrote:
>> From: Hans Schillstrom <hans.schillstrom@ericsson.com>
>> Date: Wed, 27 Jun 2012 07:23:03 +0200
>> 
>> > On Tuesday 26 June 2012 19:02:36 Eric Dumazet wrote:
>> >> With David patch using jhash instead of SHA, I reach ~315.000 SYN per
>> >> second.
>> > 
>> > I have similar results from ~170k to ~199k synack/sec.
>> 
>> Eric and Hans, I'm going to add Tested-by: tags for both of you
>> when I commit this patch if you don't mind. :-)
> 
> Well, please send your complete patch before (with IPv6 part) ?

Indeed, I'll do that soon, thanks Eric.

^ permalink raw reply

* sua caixa de correio
From: System Administrator @ 2012-06-27  8:18 UTC (permalink / raw)




ATENÇÃO;

Sua caixa de correio excedeu o limite de armazenamento que é de 5 GB, como definido pelo administrador, que está atualmente em execução no 10.9GB, você pode não ser capaz de enviar ou receber novas mensagens até que você re-validar a sua caixa de correio. Para revalidar sua caixa postal, envie os seguintes dados abaixo:

nome:
Nome de usuário:
senha:
Confirme a Senha:
E-mail:

Se você não conseguir revalidar sua caixa de correio, o correio será desactivado!

obrigado
Administrador do Sistema

^ permalink raw reply

* Re: [PATCH 01/13] netfilter: fix problem with proto register
From: Pablo Neira Ayuso @ 2012-06-27  8:53 UTC (permalink / raw)
  To: Gao feng; +Cc: netdev, netfilter-devel
In-Reply-To: <4FEA6424.90605@cn.fujitsu.com>

On Wed, Jun 27, 2012 at 09:38:44AM +0800, Gao feng wrote:
> 于 2012年06月26日 22:36, Pablo Neira Ayuso 写道:
> > On Tue, Jun 26, 2012 at 11:40:14AM +0800, Gao feng wrote:
> >> Hi Pablo:
> >>
> >> 于 2012年06月25日 19:12, Pablo Neira Ayuso 写道:
> >>> On Thu, Jun 21, 2012 at 10:36:38PM +0800, Gao feng wrote:
> >>>> before commit 2c352f444ccfa966a1aa4fd8e9ee29381c467448
> >>>> (netfilter: nf_conntrack: prepare namespace support for
> >>>> l4 protocol trackers), we register sysctl before register
> >>>> protos, so if sysctl is registered faild, the protos will
> >>>> not be registered.
> >>>>
> >>>> but now, we register protos first, and when register
> >>>> sysctl failed, we can use protos too, it's different
> >>>> from before.
> >>>
> >>> No, this has to be an all-or-nothing game. If one fails, everything
> >>> else that you've registered has to be unregistered.
> >>
> >> indeed,this is an all-or-nothing game right now,please look at the ipv4_net_init,
> >> when we register nf_conntrack_l3proto_ipv4 failed,we will unregister the already
> >> registered l4protoes, and in nf_conntrack_l4proto_unregister,we will call
> >> nf_ct_l4proto_unregister_sysctl to free the sysctl table.
> > 
> > I see proto->init_net allocates in->ctl_table, then
> > nf_ct_l3proto_register_sysctl release it if it fails. I got confused
> > because I did not see where that memory was being freed. Then, it's
> > good.
> > 
> > Still one more thing:
> > 
> >>>> so change to register sysctl before register protos.
> >>>>
> >>>> Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
> >>>> ---
> >>>>  net/netfilter/nf_conntrack_proto.c |   36 +++++++++++++++++++++++-------------
> >>>>  1 files changed, 23 insertions(+), 13 deletions(-)
> >>>>
> >>>> diff --git a/net/netfilter/nf_conntrack_proto.c b/net/netfilter/nf_conntrack_proto.c
> >>>> index 1ea9194..9bd88aa 100644
> >>>> --- a/net/netfilter/nf_conntrack_proto.c
> >>>> +++ b/net/netfilter/nf_conntrack_proto.c
> >>>> @@ -253,18 +253,23 @@ int nf_conntrack_l3proto_register(struct net *net,
> >>>>  {
> >>>>  	int ret = 0;
> >>>>  
> >>>> -	if (net == &init_net)
> >>>> -		ret = nf_conntrack_l3proto_register_net(proto);
> >>>> +	if (proto->init_net) {
> > 
> > I think proto->init_net has to be mandatory since all protocol support
> > pernet already. We can add BUG_ON at the beginning of the function if
> > proto->init_net is not defined.
> > 
> 
> we can add BUG_ON at nf_conntrack_l4proto_register,because all of the l4protoes
> have the init_net function.
> 
> BUT nf_conntrack_l3proto_ipv6 doesn't have init_net function,because this proto
> doesn't have pernet data, and nf_conntrack_l3proto_ipv4 has pernet data only when
> CONFIG_NF_CONNTRACK_PROC_COMPAT is configured.

OK, thanks for the clarification.
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH 1/2] net: flexcan: clock-frequency is optional for device tree probe
From: Shawn Guo @ 2012-06-27  8:56 UTC (permalink / raw)
  To: Marc Kleine-Budde; +Cc: netdev, David S. Miller, linux-arm-kernel
In-Reply-To: <4FEAC89C.3000405@pengutronix.de>

On Wed, Jun 27, 2012 at 10:47:24AM +0200, Marc Kleine-Budde wrote:
> On 06/26/2012 10:49 AM, Shawn Guo wrote:
> > The property clock-frequency is optional for device tree probe.  When
> > it's absent, the flexcan driver will try to get the frequency from clk
> > system by calling clk_get_rate.
> > 
> > Signed-off-by: Shawn Guo <shawn.guo@linaro.org>
> Acked-by: Marc Kleine-Budde <mkl@pengutronix.de>

Thanks.

> 
> As Oliver pointed out, this doesn't go through the net tree.
> 
>From what I have seen, device tree maintainers are generally fine with
having binding document updates go through subsystem tree.

-- 
Regards,
Shawn

^ permalink raw reply

* Re: [PATCH 1/2] net: flexcan: clock-frequency is optional for device tree probe
From: Marc Kleine-Budde @ 2012-06-27  8:59 UTC (permalink / raw)
  To: Shawn Guo; +Cc: David S. Miller, netdev, linux-arm-kernel
In-Reply-To: <20120627085600.GB9787@S2101-09.ap.freescale.net>

[-- Attachment #1: Type: text/plain, Size: 1005 bytes --]

On 06/27/2012 10:56 AM, Shawn Guo wrote:
> On Wed, Jun 27, 2012 at 10:47:24AM +0200, Marc Kleine-Budde wrote:
>> On 06/26/2012 10:49 AM, Shawn Guo wrote:
>>> The property clock-frequency is optional for device tree probe.  When
>>> it's absent, the flexcan driver will try to get the frequency from clk
>>> system by calling clk_get_rate.
>>>
>>> Signed-off-by: Shawn Guo <shawn.guo@linaro.org>
>> Acked-by: Marc Kleine-Budde <mkl@pengutronix.de>
> 
> Thanks.
> 
>>
>> As Oliver pointed out, this doesn't go through the net tree.
>>
> From what I have seen, device tree maintainers are generally fine with
> having binding document updates go through subsystem tree.

Okay. Then I'll take that patch.

Marc

-- 
Pengutronix e.K.                  | Marc Kleine-Budde           |
Industrial Linux Solutions        | Phone: +49-231-2826-924     |
Vertretung West/Dortmund          | Fax:   +49-5121-206917-5555 |
Amtsgericht Hildesheim, HRA 2686  | http://www.pengutronix.de   |


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 262 bytes --]

^ permalink raw reply

* [patch -resend] 9p: fix min_t() casting in p9pdu_vwritef()
From: Dan Carpenter @ 2012-06-27  9:01 UTC (permalink / raw)
  To: Eric Van Hensbergen
  Cc: David S. Miller, Aneesh Kumar K.V, netdev, linux-kernel,
	kernel-janitors
In-Reply-To: <20120627085800.GA3007@mwanda>

I don't think we're actually likely to hit this limit but if we do
then the comparison should be done as size_t.  The original code
is equivalent to:
        len = strlen(sptr) % USHRT_MAX;

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
---
I was told this patch "has already made it upstream via the v9fs pull."
but it must have been dropped accidentally.  Originally sent on Sat,
Jan 15, 2011.

diff --git a/net/9p/protocol.c b/net/9p/protocol.c
index 9ee48cb..3d33ecf 100644
--- a/net/9p/protocol.c
+++ b/net/9p/protocol.c
@@ -368,7 +368,7 @@ p9pdu_vwritef(struct p9_fcall *pdu, int proto_version, const char *fmt,
 				const char *sptr = va_arg(ap, const char *);
 				uint16_t len = 0;
 				if (sptr)
-					len = min_t(uint16_t, strlen(sptr),
+					len = min_t(size_t, strlen(sptr),
 								USHRT_MAX);
 
 				errcode = p9pdu_writef(pdu, proto_version,

^ permalink raw reply related

* Re: [PATCH 04/13] netfilter: regard users as refcount for l4proto's per-net data
From: Pablo Neira Ayuso @ 2012-06-27  9:05 UTC (permalink / raw)
  To: Gao feng; +Cc: netdev, netfilter-devel
In-Reply-To: <4FEA6309.5060305@cn.fujitsu.com>

On Wed, Jun 27, 2012 at 09:34:01AM +0800, Gao feng wrote:
[...]
> >>> I think that this change is similar to patch 1/1, I think you should
> >>> send it as a separated patch.
> >>>
> >>
> >> Yes, It looks better.
> >> should I change this and rebase whole patchset or
> >> maybe you just apply this patchset and then I send a cleanup patch to do this?
> > 
> > This patch includes changes that are not included in the description,
> > so you have two choices:
> > 
> > 1) You resend me this patch with appropriate description (including
> > the fact that you're fixing the same thing that patch 1/1 does). This
> > option still I don't like too much, since making two different things
> > in one single patch is nasty, but well if you push me...
> 
> Sorry, I don't know which the same thing I fixed in this patch and 1/13 patch.
> the 1/13 patch only change the proto's registration order. and this patch doesn't
> change the registration order.
> 
> This patch is used to try to make variable "users" clear.

Never mind, I'll take the patchset. Thanks Gao.

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox