Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH 5/6] tuntap: per queue 64 bit stats
From: Eric Dumazet @ 2012-06-26  6:10 UTC (permalink / raw)
  To: Jason Wang
  Cc: mst, akong, habanero, tahm, haixiao, jwhan, ernesto.martin,
	mashirle, davem, netdev, linux-kernel, krkumar2, shemminger,
	edumazet
In-Reply-To: <4FE95015.7000707@redhat.com>

On Tue, 2012-06-26 at 14:00 +0800, Jason Wang wrote:

> Yes, looks like it's hard to use NETIF_F_LLTX without breaking the u64 
> statistics, may worth to use tx lock and alloc_netdev_mq().

Yes, this probably needs percpu storage (if you really want to use 
include/linux/u64_stats_sync.h).

But percpu storage seems a bit overkill with a raising number of cpus
on typical machines.

For loopback device, its fine because we only have one lo device per
network namespace, and some workloads really hit hard this device.

But for tuntap, I am not sure ?

^ permalink raw reply

* Re: [net-next RFC V4 PATCH 0/4] Multiqueue virtio-net
From: Jason Wang @ 2012-06-26  6:03 UTC (permalink / raw)
  To: Shirley Ma
  Cc: krkumar2, habanero, kvm, mst, netdev, linux-kernel,
	virtualization, edumazet, tahm, jwhan, davem
In-Reply-To: <1340647278.23823.0.camel@oc3660625478.ibm.com>

On 06/26/2012 02:01 AM, Shirley Ma wrote:
> Hello Jason,
>
> Good work. Do you have local guest to guest results?
>
> Thanks
> Shirley
Hi Shirley:

I would run tests to measure the performance and post here.

Thanks

^ permalink raw reply

* Re: [net-next RFC V4 PATCH 0/4] Multiqueue virtio-net
From: Jason Wang @ 2012-06-26  6:02 UTC (permalink / raw)
  To: Sridhar Samudrala
  Cc: krkumar2, habanero, kvm, mst, netdev, mashirle, linux-kernel,
	virtualization, edumazet, tahm, jwhan, davem
In-Reply-To: <4FE8A4B6.1080200@us.ibm.com>

On 06/26/2012 01:49 AM, Sridhar Samudrala wrote:
> On 6/25/2012 2:16 AM, Jason Wang wrote:
>> Hello All:
>>
>> This series is an update version of multiqueue virtio-net driver 
>> based on
>> Krishna Kumar's work to let virtio-net use multiple rx/tx queues to 
>> do the
>> packets reception and transmission. Please review and comments.
>>
>> Test Environment:
>> - Intel(R) Xeon(R) CPU E5620 @ 2.40GHz, 8 cores 2 numa nodes
>> - Two directed connected 82599
>>
>> Test Summary:
>>
>> - Highlights: huge improvements on TCP_RR test
>> - Lowlights: regression on small packet transmission, higher cpu 
>> utilization
>>               than single queue, need further optimization
> Does this also scale with increased number of VMs?
>

Hi Sridhar:

Good suggestions, I didn't measure them. I would run test and post them.

Thanks

> Thanks
> Sridhar
>>
>> Analysis of the performance result:
>>
>> - I count the number of packets sending/receiving during the test, and
>>    multiqueue show much more ability in terms of packets per second.
>>
>> - For the tx regression, multiqueue send about 1-2 times of more packets
>>    compared to single queue, and the packets size were much smaller 
>> than single
>>    queue does. I suspect tcp does less batching in multiqueue, so I 
>> hack the
>>    tcp_write_xmit() to forece more batching, multiqueue works as well as
>>    singlequeue for both small transmission and throughput
>>
>> - I didn't pack the accelerate RFS with virtio-net in this sereis as 
>> it still
>>    need further shaping, for the one that interested in this please see:
>>    http://www.mail-archive.com/kvm@vger.kernel.org/msg64111.html
>>
>>
>
> -- 
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH 5/6] tuntap: per queue 64 bit stats
From: Jason Wang @ 2012-06-26  6:00 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: mst, akong, habanero, tahm, haixiao, jwhan, ernesto.martin,
	mashirle, davem, netdev, linux-kernel, krkumar2, shemminger,
	edumazet
In-Reply-To: <1340628765.10893.46.camel@edumazet-glaptop>

On 06/25/2012 08:52 PM, Eric Dumazet wrote:
> On Mon, 2012-06-25 at 19:59 +0800, Jason Wang wrote:
>> As we've added multiqueue support for tun/tap, this patch convert the statistics
>> to use per-queue 64 bit statistics.
> LLTX means you can have several cpus calling TX path in parallel.
>
> So tx stats are wrong (even before this patch), and racy after this
> patch (if several cpu access same queue, it seems to be possible)
>
>         u64_stats_update_begin(&tfile->stats.tx_syncp);
>         tfile->stats.tx_packets++;
>         tfile->stats.tx_bytes += total;
>         u64_stats_update_end(&tfile->stats.tx_syncp);
>
> This can break horribly if several cpus run this code using same 'tfile'
> pointer.

Yes, looks like it's hard to use NETIF_F_LLTX without breaking the u64 
statistics, may worth to use tx lock and alloc_netdev_mq().

> I suggest this patch comes before 'tuntap: multiqueue support' in the
> serie.

Sure, thanks.
>
>
>

^ permalink raw reply

* Re: [net-next RFC V4 PATCH 3/4] virtio: introduce a method to get the irq of a specific virtqueue
From: Jason Wang @ 2012-06-26  5:59 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: krkumar2, habanero, kvm, qemu-devel, netdev, mashirle,
	linux-kernel, virtualization, edumazet, tahm, jwhan, davem
In-Reply-To: <20120625101439.GC19169@redhat.com>

On 06/25/2012 06:14 PM, Michael S. Tsirkin wrote:
> On Mon, Jun 25, 2012 at 05:41:17PM +0800, Jason Wang wrote:
>> Device specific irq optimizations such as irq affinity may be used by virtio
>> drivers. So this patch introduce a new method to get the irq of a specific
>> virtqueue.
>>
>> After this patch, virtio device drivers could query the irq and do device
>> specific optimizations. First user would be virtio-net.
>>
>> Signed-off-by: Jason Wang<jasowang@redhat.com>
>> ---
>>   drivers/lguest/lguest_device.c |    8 ++++++++
>>   drivers/s390/kvm/kvm_virtio.c  |    6 ++++++
>>   drivers/virtio/virtio_mmio.c   |    8 ++++++++
>>   drivers/virtio/virtio_pci.c    |   12 ++++++++++++
>>   include/linux/virtio_config.h  |    4 ++++
>>   5 files changed, 38 insertions(+), 0 deletions(-)
>>
>> diff --git a/drivers/lguest/lguest_device.c b/drivers/lguest/lguest_device.c
>> index 9e8388e..bcd080f 100644
>> --- a/drivers/lguest/lguest_device.c
>> +++ b/drivers/lguest/lguest_device.c
>> @@ -392,6 +392,13 @@ static const char *lg_bus_name(struct virtio_device *vdev)
>>   	return "";
>>   }
>>
>> +static int lg_get_vq_irq(struct virtio_device *vdev, struct virtqueue *vq)
>> +{
>> +	struct lguest_vq_info *lvq = vq->priv;
>> +
>> +	return lvq->config.irq;
>> +}
>> +
>>   /* The ops structure which hooks everything together. */
>>   static struct virtio_config_ops lguest_config_ops = {
>>   	.get_features = lg_get_features,
>> @@ -404,6 +411,7 @@ static struct virtio_config_ops lguest_config_ops = {
>>   	.find_vqs = lg_find_vqs,
>>   	.del_vqs = lg_del_vqs,
>>   	.bus_name = lg_bus_name,
>> +	.get_vq_irq = lg_get_vq_irq,
>>   };
>>
>>   /*
>> diff --git a/drivers/s390/kvm/kvm_virtio.c b/drivers/s390/kvm/kvm_virtio.c
>> index d74e9ae..a897de2 100644
>> --- a/drivers/s390/kvm/kvm_virtio.c
>> +++ b/drivers/s390/kvm/kvm_virtio.c
>> @@ -268,6 +268,11 @@ static const char *kvm_bus_name(struct virtio_device *vdev)
>>   	return "";
>>   }
>>
>> +static int kvm_get_vq_irq(struct virtio_device *vdev, struct virtqueue *vq)
>> +{
>> +	return 0x2603;
>> +}
>> +
>>   /*
>>    * The config ops structure as defined by virtio config
>>    */
>> @@ -282,6 +287,7 @@ static struct virtio_config_ops kvm_vq_configspace_ops = {
>>   	.find_vqs = kvm_find_vqs,
>>   	.del_vqs = kvm_del_vqs,
>>   	.bus_name = kvm_bus_name,
>> +	.get_vq_irq = kvm_get_vq_irq,
>>   };
>>
>>   /*
>> diff --git a/drivers/virtio/virtio_mmio.c b/drivers/virtio/virtio_mmio.c
>> index f5432b6..2ba37ed 100644
>> --- a/drivers/virtio/virtio_mmio.c
>> +++ b/drivers/virtio/virtio_mmio.c
>> @@ -411,6 +411,13 @@ static const char *vm_bus_name(struct virtio_device *vdev)
>>   	return vm_dev->pdev->name;
>>   }
>>
>> +static int vm_get_vq_irq(struct virtio_device *vdev, struct virtqueue *vq)
>> +{
>> +	struct virtio_mmio_device *vm_dev = to_virtio_mmio_device(vdev);
>> +
>> +	return platform_get_irq(vm_dev->pdev, 0);
>> +}
>> +
>>   static struct virtio_config_ops virtio_mmio_config_ops = {
>>   	.get		= vm_get,
>>   	.set		= vm_set,
>> @@ -422,6 +429,7 @@ static struct virtio_config_ops virtio_mmio_config_ops = {
>>   	.get_features	= vm_get_features,
>>   	.finalize_features = vm_finalize_features,
>>   	.bus_name	= vm_bus_name,
>> +	.get_vq_irq     = vm_get_vq_irq,
>>   };
>>
>>
>> diff --git a/drivers/virtio/virtio_pci.c b/drivers/virtio/virtio_pci.c
>> index adb24f2..c062227 100644
>> --- a/drivers/virtio/virtio_pci.c
>> +++ b/drivers/virtio/virtio_pci.c
>> @@ -607,6 +607,17 @@ static const char *vp_bus_name(struct virtio_device *vdev)
>>   	return pci_name(vp_dev->pci_dev);
>>   }
>>
>> +static int vp_get_vq_irq(struct virtio_device *vdev, struct virtqueue *vq)
>> +{
>> +	struct virtio_pci_device *vp_dev = to_vp_device(vdev);
>> +	struct virtio_pci_vq_info *info = vq->priv;
>> +
>> +	if (vp_dev->intx_enabled)
>> +		return vp_dev->pci_dev->irq;
>> +	else
>> +		return vp_dev->msix_entries[info->msix_vector].vector;
>> +}
>> +
>>   static struct virtio_config_ops virtio_pci_config_ops = {
>>   	.get		= vp_get,
>>   	.set		= vp_set,
>> @@ -618,6 +629,7 @@ static struct virtio_config_ops virtio_pci_config_ops = {
>>   	.get_features	= vp_get_features,
>>   	.finalize_features = vp_finalize_features,
>>   	.bus_name	= vp_bus_name,
>> +	.get_vq_irq     = vp_get_vq_irq,
>>   };
>>
>>   static void virtio_pci_release_dev(struct device *_d)
>> diff --git a/include/linux/virtio_config.h b/include/linux/virtio_config.h
>> index fc457f4..acd6930 100644
>> --- a/include/linux/virtio_config.h
>> +++ b/include/linux/virtio_config.h
>> @@ -98,6 +98,9 @@
>>    *	vdev: the virtio_device
>>    *      This returns a pointer to the bus name a la pci_name from which
>>    *      the caller can then copy.
>> + * @get_vq_irq: get the irq numer of the specific virt queue.
>> + *      vdev: the virtio_device
>> + *      vq: the virtqueue
> What if the vq does not have an IRQ? E.g. control vqs don't.
> What if the IRQ is shared between VQs? Between devices?
> The need to cleanup affinity on destroy is also nasty.
> How about we expose a set_affinity API instead?

Or exposed the irq information such as sharing, level/edge. But I think 
exposing the set_affinity API is enough for multiqueue virtio-net.
> Then:
> 	- non PCI can ignore for now
> 	- with a per vq vector we can force it
> 	- with a shared MSI we make it an OR over all affinities
> 	- with a level interrupt we can ignore it
> 	- on cleanup we can do it in core

Looks good, thanks.
>
>>    */
>>   typedef void vq_callback_t(struct virtqueue *);
>>   struct virtio_config_ops {
>> @@ -116,6 +119,7 @@ struct virtio_config_ops {
>>   	u32 (*get_features)(struct virtio_device *vdev);
>>   	void (*finalize_features)(struct virtio_device *vdev);
>>   	const char *(*bus_name)(struct virtio_device *vdev);
>> +	int (*get_vq_irq)(struct virtio_device *vdev, struct virtqueue *vq);
>>   };
>>
>>   /* If driver didn't advertise the feature, it will never appear. */
>> -- 
>> 1.7.1

^ permalink raw reply

* Re: [net-next RFC V3 PATCH 1/6] tuntap: move socket to tun_file
From: Jason Wang @ 2012-06-26  5:55 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: netdev, linux-kernel, krkumar2, tahm, akong, davem, shemminger,
	mashirle
In-Reply-To: <20120625082742.GA18794@redhat.com>

On 06/25/2012 04:27 PM, Michael S. Tsirkin wrote:
> On Mon, Jun 25, 2012 at 02:09:45PM +0800, Jason Wang wrote:
>> This patch moves socket structure from tun_device and to tun_file in order to
>> let it possbile for multiple sockets to be attached to tun/tap device. The
>> reference between tap device and socket was setup during TUNSETIFF as
>> usual.
>>
>> After this patch, we can go further towards multiqueue tun/tap support by
>> storing an array of pointers of tun_file in tun_device.
>>
>> Signed-off-by: Jason Wang<jasowang@redhat.com>
> I think this changes visible userspace
> behaviour for persistent devices.
>
> Specifically, with this patch, TUNSETSNDBUF and TUNATTACHFILTER won't
> be effective if you close and reopen the device, right?

Yes, good catch.
> It's possible that no application uses either of these
> ioctls on persistent tun devices at the moment,
> but seems safer to avoid changing such behaviour.

Agree, I would modify the socket filer and sndbuf to be per-device 
instead of per-socket.
>
>> ---
>>   drivers/net/tun.c |  352 +++++++++++++++++++++++++++--------------------------
>>   1 files changed, 181 insertions(+), 171 deletions(-)
>>
>> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
>> index 987aeef..1f27789 100644
>> --- a/drivers/net/tun.c
>> +++ b/drivers/net/tun.c
>> @@ -108,9 +108,16 @@ struct tap_filter {
>>   };
>>
>>   struct tun_file {
>> +	struct sock sk;
>> +	struct socket socket;
>> +	struct socket_wq wq;
>> +	int vnet_hdr_sz;
>> +	struct tap_filter txflt;
>>   	atomic_t count;
>>   	struct tun_struct *tun;
>>   	struct net *net;
>> +	struct fasync_struct *fasync;
>> +	unsigned int flags;
>>   };
>>
>>   struct tun_sock;
>> @@ -125,29 +132,12 @@ struct tun_struct {
>>   	netdev_features_t	set_features;
>>   #define TUN_USER_FEATURES (NETIF_F_HW_CSUM|NETIF_F_TSO_ECN|NETIF_F_TSO| \
>>   			  NETIF_F_TSO6|NETIF_F_UFO)
>> -	struct fasync_struct	*fasync;
>> -
>> -	struct tap_filter       txflt;
>> -	struct socket		socket;
>> -	struct socket_wq	wq;
>> -
>> -	int			vnet_hdr_sz;
>>
>>   #ifdef TUN_DEBUG
>>   	int debug;
>>   #endif
>>   };
>>
>> -struct tun_sock {
>> -	struct sock		sk;
>> -	struct tun_struct	*tun;
>> -};
>> -
>> -static inline struct tun_sock *tun_sk(struct sock *sk)
>> -{
>> -	return container_of(sk, struct tun_sock, sk);
>> -}
>> -
>>   static int tun_attach(struct tun_struct *tun, struct file *file)
>>   {
>>   	struct tun_file *tfile = file->private_data;
>> @@ -168,10 +158,9 @@ static int tun_attach(struct tun_struct *tun, struct file *file)
>>   	err = 0;
>>   	tfile->tun = tun;
>>   	tun->tfile = tfile;
>> -	tun->socket.file = file;
>>   	netif_carrier_on(tun->dev);
>>   	dev_hold(tun->dev);
>> -	sock_hold(tun->socket.sk);
>> +	sock_hold(&tfile->sk);
>>   	atomic_inc(&tfile->count);
>>
>>   out:
>> @@ -181,15 +170,15 @@ out:
>>
>>   static void __tun_detach(struct tun_struct *tun)
>>   {
>> +	struct tun_file *tfile = tun->tfile;
>>   	/* Detach from net device */
>>   	netif_tx_lock_bh(tun->dev);
>>   	netif_carrier_off(tun->dev);
>>   	tun->tfile = NULL;
>> -	tun->socket.file = NULL;
>>   	netif_tx_unlock_bh(tun->dev);
>>
>>   	/* Drop read queue */
>> -	skb_queue_purge(&tun->socket.sk->sk_receive_queue);
>> +	skb_queue_purge(&tfile->socket.sk->sk_receive_queue);
>>
>>   	/* Drop the extra count on the net device */
>>   	dev_put(tun->dev);
>> @@ -348,19 +337,12 @@ static void tun_net_uninit(struct net_device *dev)
>>   	/* Inform the methods they need to stop using the dev.
>>   	 */
>>   	if (tfile) {
>> -		wake_up_all(&tun->wq.wait);
>> +		wake_up_all(&tfile->wq.wait);
>>   		if (atomic_dec_and_test(&tfile->count))
>>   			__tun_detach(tun);
>>   	}
>>   }
>>
>> -static void tun_free_netdev(struct net_device *dev)
>> -{
>> -	struct tun_struct *tun = netdev_priv(dev);
>> -
>> -	sk_release_kernel(tun->socket.sk);
>> -}
>> -
>>   /* Net device open. */
>>   static int tun_net_open(struct net_device *dev)
>>   {
>> @@ -379,24 +361,26 @@ static int tun_net_close(struct net_device *dev)
>>   static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev)
>>   {
>>   	struct tun_struct *tun = netdev_priv(dev);
>> +	struct tun_file *tfile = tun->tfile;
>>
>>   	tun_debug(KERN_INFO, tun, "tun_net_xmit %d\n", skb->len);
>>
>>   	/* Drop packet if interface is not attached */
>> -	if (!tun->tfile)
>> +	if (!tfile)
>>   		goto drop;
>>
>>   	/* Drop if the filter does not like it.
>>   	 * This is a noop if the filter is disabled.
>>   	 * Filter can be enabled only for the TAP devices. */
>> -	if (!check_filter(&tun->txflt, skb))
>> +	if (!check_filter(&tfile->txflt, skb))
>>   		goto drop;
>>
>> -	if (tun->socket.sk->sk_filter&&
>> -	    sk_filter(tun->socket.sk, skb))
>> +	if (tfile->socket.sk->sk_filter&&
>> +	    sk_filter(tfile->socket.sk, skb))
>>   		goto drop;
>>
>> -	if (skb_queue_len(&tun->socket.sk->sk_receive_queue)>= dev->tx_queue_len) {
>> +	if (skb_queue_len(&tfile->socket.sk->sk_receive_queue)
>> +	>= dev->tx_queue_len) {
>>   		if (!(tun->flags&  TUN_ONE_QUEUE)) {
>>   			/* Normal queueing mode. */
>>   			/* Packet scheduler handles dropping of further packets. */
>> @@ -417,12 +401,12 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev)
>>   	skb_orphan(skb);
>>
>>   	/* Enqueue packet */
>> -	skb_queue_tail(&tun->socket.sk->sk_receive_queue, skb);
>> +	skb_queue_tail(&tfile->socket.sk->sk_receive_queue, skb);
>>
>>   	/* Notify and wake up reader process */
>> -	if (tun->flags&  TUN_FASYNC)
>> -		kill_fasync(&tun->fasync, SIGIO, POLL_IN);
>> -	wake_up_interruptible_poll(&tun->wq.wait, POLLIN |
>> +	if (tfile->flags&  TUN_FASYNC)
>> +		kill_fasync(&tfile->fasync, SIGIO, POLL_IN);
>> +	wake_up_interruptible_poll(&tfile->wq.wait, POLLIN |
>>   				   POLLRDNORM | POLLRDBAND);
>>   	return NETDEV_TX_OK;
>>
>> @@ -550,11 +534,11 @@ static unsigned int tun_chr_poll(struct file *file, poll_table * wait)
>>   	if (!tun)
>>   		return POLLERR;
>>
>> -	sk = tun->socket.sk;
>> +	sk = tfile->socket.sk;
>>
>>   	tun_debug(KERN_INFO, tun, "tun_chr_poll\n");
>>
>> -	poll_wait(file,&tun->wq.wait, wait);
>> +	poll_wait(file,&tfile->wq.wait, wait);
>>
>>   	if (!skb_queue_empty(&sk->sk_receive_queue))
>>   		mask |= POLLIN | POLLRDNORM;
>> @@ -573,11 +557,11 @@ static unsigned int tun_chr_poll(struct file *file, poll_table * wait)
>>
>>   /* prepad is the amount to reserve at front.  len is length after that.
>>    * linear is a hint as to how much to copy (usually headers). */
>> -static struct sk_buff *tun_alloc_skb(struct tun_struct *tun,
>> +static struct sk_buff *tun_alloc_skb(struct tun_file *tfile,
>>   				     size_t prepad, size_t len,
>>   				     size_t linear, int noblock)
>>   {
>> -	struct sock *sk = tun->socket.sk;
>> +	struct sock *sk = tfile->socket.sk;
>>   	struct sk_buff *skb;
>>   	int err;
>>
>> @@ -601,7 +585,7 @@ static struct sk_buff *tun_alloc_skb(struct tun_struct *tun,
>>   }
>>
>>   /* Get packet from user space buffer */
>> -static ssize_t tun_get_user(struct tun_struct *tun,
>> +static ssize_t tun_get_user(struct tun_file *tfile,
>>   			    const struct iovec *iv, size_t count,
>>   			    int noblock)
>>   {
>> @@ -610,8 +594,10 @@ static ssize_t tun_get_user(struct tun_struct *tun,
>>   	size_t len = count, align = NET_SKB_PAD;
>>   	struct virtio_net_hdr gso = { 0 };
>>   	int offset = 0;
>> +	struct tun_struct *tun = NULL;
>> +	bool drop = false, error = false;
>>
>> -	if (!(tun->flags&  TUN_NO_PI)) {
>> +	if (!(tfile->flags&  TUN_NO_PI)) {
>>   		if ((len -= sizeof(pi))>  count)
>>   			return -EINVAL;
>>
>> @@ -620,8 +606,9 @@ static ssize_t tun_get_user(struct tun_struct *tun,
>>   		offset += sizeof(pi);
>>   	}
>>
>> -	if (tun->flags&  TUN_VNET_HDR) {
>> -		if ((len -= tun->vnet_hdr_sz)>  count)
>> +	if (tfile->flags&  TUN_VNET_HDR) {
>> +		len -= tfile->vnet_hdr_sz;
>> +		if (len>  count)
>>   			return -EINVAL;
>>
>>   		if (memcpy_fromiovecend((void *)&gso, iv, offset, sizeof(gso)))
>> @@ -633,41 +620,43 @@ static ssize_t tun_get_user(struct tun_struct *tun,
>>
>>   		if (gso.hdr_len>  len)
>>   			return -EINVAL;
>> -		offset += tun->vnet_hdr_sz;
>> +		offset += tfile->vnet_hdr_sz;
>>   	}
>>
>> -	if ((tun->flags&  TUN_TYPE_MASK) == TUN_TAP_DEV) {
>> +	if ((tfile->flags&  TUN_TYPE_MASK) == TUN_TAP_DEV) {
>>   		align += NET_IP_ALIGN;
>>   		if (unlikely(len<  ETH_HLEN ||
>>   			     (gso.hdr_len&&  gso.hdr_len<  ETH_HLEN)))
>>   			return -EINVAL;
>>   	}
>>
>> -	skb = tun_alloc_skb(tun, align, len, gso.hdr_len, noblock);
>> +	skb = tun_alloc_skb(tfile, align, len, gso.hdr_len, noblock);
>> +
>>   	if (IS_ERR(skb)) {
>>   		if (PTR_ERR(skb) != -EAGAIN)
>> -			tun->dev->stats.rx_dropped++;
>> -		return PTR_ERR(skb);
>> +			drop = true;
>> +		count = PTR_ERR(skb);
>> +		goto err;
>>   	}
>>
>>   	if (skb_copy_datagram_from_iovec(skb, 0, iv, offset, len)) {
>> -		tun->dev->stats.rx_dropped++;
>> +		drop = true;
>>   		kfree_skb(skb);
>> -		return -EFAULT;
>> +		count = -EFAULT;
>> +		goto err;
>>   	}
>>
>>   	if (gso.flags&  VIRTIO_NET_HDR_F_NEEDS_CSUM) {
>>   		if (!skb_partial_csum_set(skb, gso.csum_start,
>>   					  gso.csum_offset)) {
>> -			tun->dev->stats.rx_frame_errors++;
>> -			kfree_skb(skb);
>> -			return -EINVAL;
>> +			error = true;
>> +			goto err_free;
>>   		}
>>   	}
>>
>> -	switch (tun->flags&  TUN_TYPE_MASK) {
>> +	switch (tfile->flags&  TUN_TYPE_MASK) {
>>   	case TUN_TUN_DEV:
>> -		if (tun->flags&  TUN_NO_PI) {
>> +		if (tfile->flags&  TUN_NO_PI) {
>>   			switch (skb->data[0]&  0xf0) {
>>   			case 0x40:
>>   				pi.proto = htons(ETH_P_IP);
>> @@ -676,18 +665,15 @@ static ssize_t tun_get_user(struct tun_struct *tun,
>>   				pi.proto = htons(ETH_P_IPV6);
>>   				break;
>>   			default:
>> -				tun->dev->stats.rx_dropped++;
>> -				kfree_skb(skb);
>> -				return -EINVAL;
>> +				drop = true;
>> +				goto err_free;
>>   			}
>>   		}
>>
>>   		skb_reset_mac_header(skb);
>>   		skb->protocol = pi.proto;
>> -		skb->dev = tun->dev;
>>   		break;
>>   	case TUN_TAP_DEV:
>> -		skb->protocol = eth_type_trans(skb, tun->dev);
>>   		break;
>>   	}
>>
>> @@ -704,9 +690,8 @@ static ssize_t tun_get_user(struct tun_struct *tun,
>>   			skb_shinfo(skb)->gso_type = SKB_GSO_UDP;
>>   			break;
>>   		default:
>> -			tun->dev->stats.rx_frame_errors++;
>> -			kfree_skb(skb);
>> -			return -EINVAL;
>> +			error = true;
>> +			goto err_free;
>>   		}
>>
>>   		if (gso.gso_type&  VIRTIO_NET_HDR_GSO_ECN)
>> @@ -714,9 +699,8 @@ static ssize_t tun_get_user(struct tun_struct *tun,
>>
>>   		skb_shinfo(skb)->gso_size = gso.gso_size;
>>   		if (skb_shinfo(skb)->gso_size == 0) {
>> -			tun->dev->stats.rx_frame_errors++;
>> -			kfree_skb(skb);
>> -			return -EINVAL;
>> +			error = true;
>> +			goto err_free;
>>   		}
>>
>>   		/* Header must be checked, and gso_segs computed. */
>> @@ -724,11 +708,38 @@ static ssize_t tun_get_user(struct tun_struct *tun,
>>   		skb_shinfo(skb)->gso_segs = 0;
>>   	}
>>
>> -	netif_rx_ni(skb);
>> +	tun = __tun_get(tfile);
>> +	if (!tun)
>> +		return -EBADFD;
>>
>> +	switch (tfile->flags&  TUN_TYPE_MASK) {
>> +	case TUN_TUN_DEV:
>> +		skb->dev = tun->dev;
>> +		break;
>> +	case TUN_TAP_DEV:
>> +		skb->protocol = eth_type_trans(skb, tun->dev);
>> +		break;
>> +	}
>> +
>> +	netif_rx_ni(skb);
>>   	tun->dev->stats.rx_packets++;
>>   	tun->dev->stats.rx_bytes += len;
>> +	tun_put(tun);
>> +	return count;
>> +
>> +err_free:
>> +	count = -EINVAL;
>> +	kfree_skb(skb);
>> +err:
>> +	tun = __tun_get(tfile);
>> +	if (!tun)
>> +		return -EBADFD;
>>
>> +	if (drop)
>> +		tun->dev->stats.rx_dropped++;
>> +	if (error)
>> +		tun->dev->stats.rx_frame_errors++;
>> +	tun_put(tun);
>>   	return count;
>>   }
>>
>> @@ -736,30 +747,25 @@ static ssize_t tun_chr_aio_write(struct kiocb *iocb, const struct iovec *iv,
>>   			      unsigned long count, loff_t pos)
>>   {
>>   	struct file *file = iocb->ki_filp;
>> -	struct tun_struct *tun = tun_get(file);
>> +	struct tun_file *tfile = file->private_data;
>>   	ssize_t result;
>>
>> -	if (!tun)
>> -		return -EBADFD;
>> -
>> -	tun_debug(KERN_INFO, tun, "tun_chr_write %ld\n", count);
>> -
>> -	result = tun_get_user(tun, iv, iov_length(iv, count),
>> +	result = tun_get_user(tfile, iv, iov_length(iv, count),
>>   			      file->f_flags&  O_NONBLOCK);
>>
>> -	tun_put(tun);
>>   	return result;
>>   }
>>
>>   /* Put packet to the user space buffer */
>> -static ssize_t tun_put_user(struct tun_struct *tun,
>> +static ssize_t tun_put_user(struct tun_file *tfile,
>>   			    struct sk_buff *skb,
>>   			    const struct iovec *iv, int len)
>>   {
>> +	struct tun_struct *tun = NULL;
>>   	struct tun_pi pi = { 0, skb->protocol };
>>   	ssize_t total = 0;
>>
>> -	if (!(tun->flags&  TUN_NO_PI)) {
>> +	if (!(tfile->flags&  TUN_NO_PI)) {
>>   		if ((len -= sizeof(pi))<  0)
>>   			return -EINVAL;
>>
>> @@ -773,9 +779,10 @@ static ssize_t tun_put_user(struct tun_struct *tun,
>>   		total += sizeof(pi);
>>   	}
>>
>> -	if (tun->flags&  TUN_VNET_HDR) {
>> +	if (tfile->flags&  TUN_VNET_HDR) {
>>   		struct virtio_net_hdr gso = { 0 }; /* no info leak */
>> -		if ((len -= tun->vnet_hdr_sz)<  0)
>> +		len -= tfile->vnet_hdr_sz;
>> +		if (len<  0)
>>   			return -EINVAL;
>>
>>   		if (skb_is_gso(skb)) {
>> @@ -818,7 +825,7 @@ static ssize_t tun_put_user(struct tun_struct *tun,
>>   		if (unlikely(memcpy_toiovecend(iv, (void *)&gso, total,
>>   					       sizeof(gso))))
>>   			return -EFAULT;
>> -		total += tun->vnet_hdr_sz;
>> +		total += tfile->vnet_hdr_sz;
>>   	}
>>
>>   	len = min_t(int, skb->len, len);
>> @@ -826,29 +833,33 @@ static ssize_t tun_put_user(struct tun_struct *tun,
>>   	skb_copy_datagram_const_iovec(skb, 0, iv, total, len);
>>   	total += skb->len;
>>
>> -	tun->dev->stats.tx_packets++;
>> -	tun->dev->stats.tx_bytes += len;
>> +	tun = __tun_get(tfile);
>> +	if (tun) {
>> +		tun->dev->stats.tx_packets++;
>> +		tun->dev->stats.tx_bytes += len;
>> +		tun_put(tun);
>> +	}
>>
>>   	return total;
>>   }
>>
>> -static ssize_t tun_do_read(struct tun_struct *tun,
>> +static ssize_t tun_do_read(struct tun_file *tfile,
>>   			   struct kiocb *iocb, const struct iovec *iv,
>>   			   ssize_t len, int noblock)
>>   {
>>   	DECLARE_WAITQUEUE(wait, current);
>>   	struct sk_buff *skb;
>>   	ssize_t ret = 0;
>> -
>> -	tun_debug(KERN_INFO, tun, "tun_chr_read\n");
>> +	struct tun_struct *tun = NULL;
>>
>>   	if (unlikely(!noblock))
>> -		add_wait_queue(&tun->wq.wait,&wait);
>> +		add_wait_queue(&tfile->wq.wait,&wait);
>>   	while (len) {
>>   		current->state = TASK_INTERRUPTIBLE;
>>
>> +		skb = skb_dequeue(&tfile->socket.sk->sk_receive_queue);
>>   		/* Read frames from the queue */
>> -		if (!(skb=skb_dequeue(&tun->socket.sk->sk_receive_queue))) {
>> +		if (!skb) {
>>   			if (noblock) {
>>   				ret = -EAGAIN;
>>   				break;
>> @@ -857,25 +868,38 @@ static ssize_t tun_do_read(struct tun_struct *tun,
>>   				ret = -ERESTARTSYS;
>>   				break;
>>   			}
>> +
>> +			tun = __tun_get(tfile);
>> +			if (!tun) {
>> +				ret = -EIO;
>> +				break;
>> +			}
>>   			if (tun->dev->reg_state != NETREG_REGISTERED) {
>>   				ret = -EIO;
>> +				tun_put(tun);
>>   				break;
>>   			}
>> +			tun_put(tun);
>>
>>   			/* Nothing to read, let's sleep */
>>   			schedule();
>>   			continue;
>>   		}
>> -		netif_wake_queue(tun->dev);
>>
>> -		ret = tun_put_user(tun, skb, iv, len);
>> +		tun = __tun_get(tfile);
>> +		if (tun) {
>> +			netif_wake_queue(tun->dev);
>> +			tun_put(tun);
>> +		}
>> +
>> +		ret = tun_put_user(tfile, skb, iv, len);
>>   		kfree_skb(skb);
>>   		break;
>>   	}
>>
>>   	current->state = TASK_RUNNING;
>>   	if (unlikely(!noblock))
>> -		remove_wait_queue(&tun->wq.wait,&wait);
>> +		remove_wait_queue(&tfile->wq.wait,&wait);
>>
>>   	return ret;
>>   }
>> @@ -885,21 +909,17 @@ static ssize_t tun_chr_aio_read(struct kiocb *iocb, const struct iovec *iv,
>>   {
>>   	struct file *file = iocb->ki_filp;
>>   	struct tun_file *tfile = file->private_data;
>> -	struct tun_struct *tun = __tun_get(tfile);
>>   	ssize_t len, ret;
>>
>> -	if (!tun)
>> -		return -EBADFD;
>>   	len = iov_length(iv, count);
>>   	if (len<  0) {
>>   		ret = -EINVAL;
>>   		goto out;
>>   	}
>>
>> -	ret = tun_do_read(tun, iocb, iv, len, file->f_flags&  O_NONBLOCK);
>> +	ret = tun_do_read(tfile, iocb, iv, len, file->f_flags&  O_NONBLOCK);
>>   	ret = min_t(ssize_t, ret, len);
>>   out:
>> -	tun_put(tun);
>>   	return ret;
>>   }
>>
>> @@ -911,7 +931,7 @@ static void tun_setup(struct net_device *dev)
>>   	tun->group = -1;
>>
>>   	dev->ethtool_ops =&tun_ethtool_ops;
>> -	dev->destructor = tun_free_netdev;
>> +	dev->destructor = free_netdev;
>>   }
>>
>>   /* Trivial set of netlink ops to allow deleting tun or tap
>> @@ -931,7 +951,7 @@ static struct rtnl_link_ops tun_link_ops __read_mostly = {
>>
>>   static void tun_sock_write_space(struct sock *sk)
>>   {
>> -	struct tun_struct *tun;
>> +	struct tun_file *tfile = NULL;
>>   	wait_queue_head_t *wqueue;
>>
>>   	if (!sock_writeable(sk))
>> @@ -945,37 +965,38 @@ static void tun_sock_write_space(struct sock *sk)
>>   		wake_up_interruptible_sync_poll(wqueue, POLLOUT |
>>   						POLLWRNORM | POLLWRBAND);
>>
>> -	tun = tun_sk(sk)->tun;
>> -	kill_fasync(&tun->fasync, SIGIO, POLL_OUT);
>> -}
>> -
>> -static void tun_sock_destruct(struct sock *sk)
>> -{
>> -	free_netdev(tun_sk(sk)->tun->dev);
>> +	tfile = container_of(sk, struct tun_file, sk);
>> +	kill_fasync(&tfile->fasync, SIGIO, POLL_OUT);
>>   }
>>
>>   static int tun_sendmsg(struct kiocb *iocb, struct socket *sock,
>>   		       struct msghdr *m, size_t total_len)
>>   {
>> -	struct tun_struct *tun = container_of(sock, struct tun_struct, socket);
>> -	return tun_get_user(tun, m->msg_iov, total_len,
>> -			    m->msg_flags&  MSG_DONTWAIT);
>> +	struct tun_file *tfile = container_of(sock, struct tun_file, socket);
>> +	ssize_t result;
>> +
>> +	result = tun_get_user(tfile, m->msg_iov, total_len,
>> +			      m->msg_flags&  MSG_DONTWAIT);
>> +	return result;
>>   }
>>
>>   static int tun_recvmsg(struct kiocb *iocb, struct socket *sock,
>>   		       struct msghdr *m, size_t total_len,
>>   		       int flags)
>>   {
>> -	struct tun_struct *tun = container_of(sock, struct tun_struct, socket);
>> +	struct tun_file *tfile = container_of(sock, struct tun_file, socket);
>>   	int ret;
>> +
>>   	if (flags&  ~(MSG_DONTWAIT|MSG_TRUNC))
>>   		return -EINVAL;
>> -	ret = tun_do_read(tun, iocb, m->msg_iov, total_len,
>> +
>> +	ret = tun_do_read(tfile, iocb, m->msg_iov, total_len,
>>   			  flags&  MSG_DONTWAIT);
>>   	if (ret>  total_len) {
>>   		m->msg_flags |= MSG_TRUNC;
>>   		ret = flags&  MSG_TRUNC ? ret : total_len;
>>   	}
>> +
>>   	return ret;
>>   }
>>
>> @@ -996,7 +1017,7 @@ static const struct proto_ops tun_socket_ops = {
>>   static struct proto tun_proto = {
>>   	.name		= "tun",
>>   	.owner		= THIS_MODULE,
>> -	.obj_size	= sizeof(struct tun_sock),
>> +	.obj_size	= sizeof(struct tun_file),
>>   };
>>
>>   static int tun_flags(struct tun_struct *tun)
>> @@ -1047,8 +1068,8 @@ static DEVICE_ATTR(group, 0444, tun_show_group, NULL);
>>
>>   static int tun_set_iff(struct net *net, struct file *file, struct ifreq *ifr)
>>   {
>> -	struct sock *sk;
>>   	struct tun_struct *tun;
>> +	struct tun_file *tfile = file->private_data;
>>   	struct net_device *dev;
>>   	int err;
>>
>> @@ -1069,7 +1090,7 @@ static int tun_set_iff(struct net *net, struct file *file, struct ifreq *ifr)
>>   		     (tun->group != -1&&  !in_egroup_p(tun->group)))&&
>>   		!capable(CAP_NET_ADMIN))
>>   			return -EPERM;
>> -		err = security_tun_dev_attach(tun->socket.sk);
>> +		err = security_tun_dev_attach(tfile->socket.sk);
>>   		if (err<  0)
>>   			return err;
>>
>> @@ -1113,25 +1134,8 @@ static int tun_set_iff(struct net *net, struct file *file, struct ifreq *ifr)
>>   		tun = netdev_priv(dev);
>>   		tun->dev = dev;
>>   		tun->flags = flags;
>> -		tun->txflt.count = 0;
>> -		tun->vnet_hdr_sz = sizeof(struct virtio_net_hdr);
>>
>> -		err = -ENOMEM;
>> -		sk = sk_alloc(&init_net, AF_UNSPEC, GFP_KERNEL,&tun_proto);
>> -		if (!sk)
>> -			goto err_free_dev;
>> -
>> -		sk_change_net(sk, net);
>> -		tun->socket.wq =&tun->wq;
>> -		init_waitqueue_head(&tun->wq.wait);
>> -		tun->socket.ops =&tun_socket_ops;
>> -		sock_init_data(&tun->socket, sk);
>> -		sk->sk_write_space = tun_sock_write_space;
>> -		sk->sk_sndbuf = INT_MAX;
>> -
>> -		tun_sk(sk)->tun = tun;
>> -
>> -		security_tun_dev_post_create(sk);
>> +		security_tun_dev_post_create(&tfile->sk);
>>
>>   		tun_net_init(dev);
>>
>> @@ -1141,15 +1145,13 @@ static int tun_set_iff(struct net *net, struct file *file, struct ifreq *ifr)
>>
>>   		err = register_netdevice(tun->dev);
>>   		if (err<  0)
>> -			goto err_free_sk;
>> +			goto err_free_dev;
>>
>>   		if (device_create_file(&tun->dev->dev,&dev_attr_tun_flags) ||
>>   		    device_create_file(&tun->dev->dev,&dev_attr_owner) ||
>>   		    device_create_file(&tun->dev->dev,&dev_attr_group))
>>   			pr_err("Failed to create tun sysfs files\n");
>>
>> -		sk->sk_destruct = tun_sock_destruct;
>> -
>>   		err = tun_attach(tun, file);
>>   		if (err<  0)
>>   			goto failed;
>> @@ -1172,6 +1174,8 @@ static int tun_set_iff(struct net *net, struct file *file, struct ifreq *ifr)
>>   	else
>>   		tun->flags&= ~TUN_VNET_HDR;
>>
>> +	/* Cache flags from tun device */
>> +	tfile->flags = tun->flags;
>>   	/* Make sure persistent devices do not get stuck in
>>   	 * xoff state.
>>   	 */
>> @@ -1181,11 +1185,9 @@ static int tun_set_iff(struct net *net, struct file *file, struct ifreq *ifr)
>>   	strcpy(ifr->ifr_name, tun->dev->name);
>>   	return 0;
>>
>> - err_free_sk:
>> -	tun_free_netdev(dev);
>> - err_free_dev:
>> +err_free_dev:
>>   	free_netdev(dev);
>> - failed:
>> +failed:
>>   	return err;
>>   }
>>
>> @@ -1357,9 +1359,9 @@ static long __tun_chr_ioctl(struct file *file, unsigned int cmd,
>>   	case TUNSETTXFILTER:
>>   		/* Can be set only for TAPs */
>>   		ret = -EINVAL;
>> -		if ((tun->flags&  TUN_TYPE_MASK) != TUN_TAP_DEV)
>> +		if ((tfile->flags&  TUN_TYPE_MASK) != TUN_TAP_DEV)
>>   			break;
>> -		ret = update_filter(&tun->txflt, (void __user *)arg);
>> +		ret = update_filter(&tfile->txflt, (void __user *)arg);
>>   		break;
>>
>>   	case SIOCGIFHWADDR:
>> @@ -1379,7 +1381,7 @@ static long __tun_chr_ioctl(struct file *file, unsigned int cmd,
>>   		break;
>>
>>   	case TUNGETSNDBUF:
>> -		sndbuf = tun->socket.sk->sk_sndbuf;
>> +		sndbuf = tfile->socket.sk->sk_sndbuf;
>>   		if (copy_to_user(argp,&sndbuf, sizeof(sndbuf)))
>>   			ret = -EFAULT;
>>   		break;
>> @@ -1390,11 +1392,11 @@ static long __tun_chr_ioctl(struct file *file, unsigned int cmd,
>>   			break;
>>   		}
>>
>> -		tun->socket.sk->sk_sndbuf = sndbuf;
>> +		tfile->socket.sk->sk_sndbuf = sndbuf;
>>   		break;
>>
>>   	case TUNGETVNETHDRSZ:
>> -		vnet_hdr_sz = tun->vnet_hdr_sz;
>> +		vnet_hdr_sz = tfile->vnet_hdr_sz;
>>   		if (copy_to_user(argp,&vnet_hdr_sz, sizeof(vnet_hdr_sz)))
>>   			ret = -EFAULT;
>>   		break;
>> @@ -1409,27 +1411,27 @@ static long __tun_chr_ioctl(struct file *file, unsigned int cmd,
>>   			break;
>>   		}
>>
>> -		tun->vnet_hdr_sz = vnet_hdr_sz;
>> +		tfile->vnet_hdr_sz = vnet_hdr_sz;
>>   		break;
>>
>>   	case TUNATTACHFILTER:
>>   		/* Can be set only for TAPs */
>>   		ret = -EINVAL;
>> -		if ((tun->flags&  TUN_TYPE_MASK) != TUN_TAP_DEV)
>> +		if ((tfile->flags&  TUN_TYPE_MASK) != TUN_TAP_DEV)
>>   			break;
>>   		ret = -EFAULT;
>>   		if (copy_from_user(&fprog, argp, sizeof(fprog)))
>>   			break;
>>
>> -		ret = sk_attach_filter(&fprog, tun->socket.sk);
>> +		ret = sk_attach_filter(&fprog, tfile->socket.sk);
>>   		break;
>>
>>   	case TUNDETACHFILTER:
>>   		/* Can be set only for TAPs */
>>   		ret = -EINVAL;
>> -		if ((tun->flags&  TUN_TYPE_MASK) != TUN_TAP_DEV)
>> +		if ((tfile->flags&  TUN_TYPE_MASK) != TUN_TAP_DEV)
>>   			break;
>> -		ret = sk_detach_filter(tun->socket.sk);
>> +		ret = sk_detach_filter(tfile->socket.sk);
>>   		break;
>>
>>   	default:
>> @@ -1481,43 +1483,50 @@ static long tun_chr_compat_ioctl(struct file *file,
>>
>>   static int tun_chr_fasync(int fd, struct file *file, int on)
>>   {
>> -	struct tun_struct *tun = tun_get(file);
>> -	int ret;
>> -
>> -	if (!tun)
>> -		return -EBADFD;
>> -
>> -	tun_debug(KERN_INFO, tun, "tun_chr_fasync %d\n", on);
>> +	struct tun_file *tfile = file->private_data;
>> +	int ret = fasync_helper(fd, file, on,&tfile->fasync);
>>
>> -	if ((ret = fasync_helper(fd, file, on,&tun->fasync))<  0)
>> +	if (ret<  0)
>>   		goto out;
>>
>>   	if (on) {
>>   		ret = __f_setown(file, task_pid(current), PIDTYPE_PID, 0);
>>   		if (ret)
>>   			goto out;
>> -		tun->flags |= TUN_FASYNC;
>> +		tfile->flags |= TUN_FASYNC;
>>   	} else
>> -		tun->flags&= ~TUN_FASYNC;
>> +		tfile->flags&= ~TUN_FASYNC;
>>   	ret = 0;
>>   out:
>> -	tun_put(tun);
>>   	return ret;
>>   }
>>
>>   static int tun_chr_open(struct inode *inode, struct file * file)
>>   {
>> +	struct net *net = current->nsproxy->net_ns;
>>   	struct tun_file *tfile;
>>
>>   	DBG1(KERN_INFO, "tunX: tun_chr_open\n");
>>
>> -	tfile = kmalloc(sizeof(*tfile), GFP_KERNEL);
>> +	tfile = (struct tun_file *)sk_alloc(net, AF_UNSPEC, GFP_KERNEL,
>> +					&tun_proto);
>>   	if (!tfile)
>>   		return -ENOMEM;
>> -	atomic_set(&tfile->count, 0);
>> +
>>   	tfile->tun = NULL;
>> -	tfile->net = get_net(current->nsproxy->net_ns);
>> +	tfile->net = net;
>> +	tfile->txflt.count = 0;
>> +	tfile->vnet_hdr_sz = sizeof(struct virtio_net_hdr);
>> +	tfile->socket.wq =&tfile->wq;
>> +	init_waitqueue_head(&tfile->wq.wait);
>> +	tfile->socket.file = file;
>> +	tfile->socket.ops =&tun_socket_ops;
>> +	sock_init_data(&tfile->socket,&tfile->sk);
>> +
>> +	tfile->sk.sk_write_space = tun_sock_write_space;
>> +	tfile->sk.sk_sndbuf = INT_MAX;
>>   	file->private_data = tfile;
>> +
>>   	return 0;
>>   }
>>
>> @@ -1541,14 +1550,14 @@ static int tun_chr_close(struct inode *inode, struct file *file)
>>   				unregister_netdevice(dev);
>>   			rtnl_unlock();
>>   		}
>> -	}
>>
>> -	tun = tfile->tun;
>> -	if (tun)
>> -		sock_put(tun->socket.sk);
>> +		/* drop the reference that netdevice holds */
>> +		sock_put(&tfile->sk);
>>
>> -	put_net(tfile->net);
>> -	kfree(tfile);
>> +	}
>> +
>> +	/* drop the reference that file holds */
>> +	sock_put(&tfile->sk);
>>
>>   	return 0;
>>   }
>> @@ -1676,13 +1685,14 @@ static void tun_cleanup(void)
>>   struct socket *tun_get_socket(struct file *file)
>>   {
>>   	struct tun_struct *tun;
>> +	struct tun_file *tfile = file->private_data;
>>   	if (file->f_op !=&tun_fops)
>>   		return ERR_PTR(-EINVAL);
>>   	tun = tun_get(file);
>>   	if (!tun)
>>   		return ERR_PTR(-EBADFD);
>>   	tun_put(tun);
>> -	return&tun->socket;
>> +	return&tfile->socket;
>>   }
>>   EXPORT_SYMBOL_GPL(tun_get_socket);
>>

^ permalink raw reply

* Re: [net-next RFC V3 PATCH 4/6] tuntap: multiqueue support
From: Jason Wang @ 2012-06-26  5:52 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: habanero, netdev, linux-kernel, krkumar2, tahm, akong, davem,
	shemminger, mashirle, Eric Dumazet
In-Reply-To: <20120625082553.GC18229@redhat.com>

On 06/25/2012 04:25 PM, Michael S. Tsirkin wrote:
> On Mon, Jun 25, 2012 at 02:10:18PM +0800, Jason Wang wrote:
>> This patch adds multiqueue support for tap device. This is done by abstracting
>> each queue as a file/socket and allowing multiple sockets to be attached to the
>> tuntap device (an array of tun_file were stored in the tun_struct). Userspace
>> could write and read from those files to do the parallel packet
>> sending/receiving.
>>
>> Unlike the previous single queue implementation, the socket and device were
>> loosely coupled, each of them were allowed to go away first. In order to let the
>> tx path lockless, netif_tx_loch_bh() is replaced by RCU/NETIF_F_LLTX to
>> synchronize between data path and system call.
> Don't use LLTX/RCU. It's not worth it.
> Use something like netif_set_real_num_tx_queues.
>

For LLTX, maybe it's better to convert it to alloc_netdev_mq() to let 
the kernel see all queues and make the queue stopping and per-queue 
stats eaiser.
RCU is used to handle the attaching/detaching when tun/tap is sending 
and receiving packets which looks reasonalbe for me. Not sure 
netif_set_real_num_tx_queues() can help in this situation.

>> The tx queue selecting is first based on the recorded rxq index of an skb, it
>> there's no such one, then choosing based on rx hashing (skb_get_rxhash()).
>>
>> Signed-off-by: Jason Wang<jasowang@redhat.com>
> Interestingly macvtap switched to hashing first:
> ef0002b577b52941fb147128f30bd1ecfdd3ff6d
> (the commit log is corrupted but see what it
> does in the patch).
> Any idea why?
>
>> ---
>>   drivers/net/tun.c |  371 +++++++++++++++++++++++++++++++++--------------------
>>   1 files changed, 232 insertions(+), 139 deletions(-)
>>
>> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
>> index 8233b0a..5c26757 100644
>> --- a/drivers/net/tun.c
>> +++ b/drivers/net/tun.c
>> @@ -107,6 +107,8 @@ struct tap_filter {
>>   	unsigned char	addr[FLT_EXACT_COUNT][ETH_ALEN];
>>   };
>>
>> +#define MAX_TAP_QUEUES (NR_CPUS<  16 ? NR_CPUS : 16)
> Why the limit? I am guessing you copied this from macvtap?
> This is problematic for a number of reasons:
> 	- will not play well with migration
> 	- will not work well for a large guest
>
> Yes, macvtap needs to be fixed too.
>
> I am guessing what it is trying to prevent is queueing
> up a huge number of packets?
> So just divide the default tx queue limit by the # of queues.

Not sure, another reasons I can guess:
- to prevent storing a large array of pointers in tun_struct or macvlan_dev.
- it may not be suitable to allow the number of virtqueues greater than 
the number of physical queues in the card

>
> And by the way, for MQ applications maybe we can finally
> ignore tx queue altogether and limit the total number
> of bytes queued?
> To avoid regressions we can make it large like 64M/# queues.
> Could be a separate patch I think, and for a single queue
> might need a compatible mode though I am not sure.

Could you explain more about this? Did you mean to have a total sndbuf 
for all sockets that attached to tun/tap?
>> +
>>   struct tun_file {
>>   	struct sock sk;
>>   	struct socket socket;
>> @@ -114,16 +116,18 @@ struct tun_file {
>>   	int vnet_hdr_sz;
>>   	struct tap_filter txflt;
>>   	atomic_t count;
>> -	struct tun_struct *tun;
>> +	struct tun_struct __rcu *tun;
>>   	struct net *net;
>>   	struct fasync_struct *fasync;
>>   	unsigned int flags;
>> +	u16 queue_index;
>>   };
>>
>>   struct tun_sock;
>>
>>   struct tun_struct {
>> -	struct tun_file		*tfile;
>> +	struct tun_file		*tfiles[MAX_TAP_QUEUES];
>> +	unsigned int            numqueues;
>>   	unsigned int 		flags;
>>   	uid_t			owner;
>>   	gid_t			group;
>> @@ -138,80 +142,159 @@ struct tun_struct {
>>   #endif
>>   };
>>
>> -static int tun_attach(struct tun_struct *tun, struct file *file)
>> +static DEFINE_SPINLOCK(tun_lock);
>> +
>> +/*
>> + * tun_get_queue(): calculate the queue index
>> + *     - if skbs comes from mq nics, we can just borrow
>> + *     - if not, calculate from the hash
>> + */
>> +static struct tun_file *tun_get_queue(struct net_device *dev,
>> +				      struct sk_buff *skb)
>>   {
>> -	struct tun_file *tfile = file->private_data;
>> -	int err;
>> +	struct tun_struct *tun = netdev_priv(dev);
>> +	struct tun_file *tfile = NULL;
>> +	int numqueues = tun->numqueues;
>> +	__u32 rxq;
>>
>> -	ASSERT_RTNL();
>> +	BUG_ON(!rcu_read_lock_held());
>>
>> -	netif_tx_lock_bh(tun->dev);
>> +	if (!numqueues)
>> +		goto out;
>>
>> -	err = -EINVAL;
>> -	if (tfile->tun)
>> +	if (numqueues == 1) {
>> +		tfile = rcu_dereference(tun->tfiles[0]);
> Instead of hacks like this, you can ask for an MQ
> flag to be set in SETIFF. Then you won't need to
> handle attach/detach at random times.

Consier user switch between a sq guest to mq guest, qemu would attach or 
detach the fd which could not be expceted in kernel.
> And most of the scary num_queues checks can go away.

Even we has a MQ flag, userspace could still just attach one queue to 
the device.
> You can then also ask userspace about the max # of queues
> to expect if you want to save some memory.
>

Yes, good suggestion.
>>   		goto out;
>> +	}
>>
>> -	err = -EBUSY;
>> -	if (tun->tfile)
>> +	if (likely(skb_rx_queue_recorded(skb))) {
>> +		rxq = skb_get_rx_queue(skb);
>> +
>> +		while (unlikely(rxq>= numqueues))
>> +			rxq -= numqueues;
>> +
>> +		tfile = rcu_dereference(tun->tfiles[rxq]);
>>   		goto out;
>> +	}
>>
>> -	err = 0;
>> -	tfile->tun = tun;
>> -	tun->tfile = tfile;
>> -	netif_carrier_on(tun->dev);
>> -	dev_hold(tun->dev);
>> -	sock_hold(&tfile->sk);
>> -	atomic_inc(&tfile->count);
>> +	/* Check if we can use flow to select a queue */
>> +	rxq = skb_get_rxhash(skb);
>> +	if (rxq) {
>> +		u32 idx = ((u64)rxq * numqueues)>>  32;
> This completely confuses me. What's the logic here?
> How do we even know it's in range?
>

rxq is a u32, so the result should be less than numqueues.
>> +		tfile = rcu_dereference(tun->tfiles[idx]);
>> +		goto out;
>> +	}
>>
>> +	tfile = rcu_dereference(tun->tfiles[0]);
>>   out:
>> -	netif_tx_unlock_bh(tun->dev);
>> -	return err;
>> +	return tfile;
>>   }
>>
>> -static void __tun_detach(struct tun_struct *tun)
>> +static int tun_detach(struct tun_file *tfile, bool clean)
>>   {
>> -	struct tun_file *tfile = tun->tfile;
>> -	/* Detach from net device */
>> -	netif_tx_lock_bh(tun->dev);
>> -	netif_carrier_off(tun->dev);
>> -	tun->tfile = NULL;
>> -	netif_tx_unlock_bh(tun->dev);
>> -
>> -	/* Drop read queue */
>> -	skb_queue_purge(&tfile->socket.sk->sk_receive_queue);
>> -
>> -	/* Drop the extra count on the net device */
>> -	dev_put(tun->dev);
>> -}
>> +	struct tun_struct *tun;
>> +	struct net_device *dev = NULL;
>> +	bool destroy = false;
>>
>> -static void tun_detach(struct tun_struct *tun)
>> -{
>> -	rtnl_lock();
>> -	__tun_detach(tun);
>> -	rtnl_unlock();
>> -}
>> +	spin_lock(&tun_lock);
>>
>> -static struct tun_struct *__tun_get(struct tun_file *tfile)
>> -{
>> -	struct tun_struct *tun = NULL;
>> +	tun = rcu_dereference_protected(tfile->tun,
>> +					lockdep_is_held(&tun_lock));
>> +	if (tun) {
>> +		u16 index = tfile->queue_index;
>> +		BUG_ON(index>= tun->numqueues);
>> +		dev = tun->dev;
>> +
>> +		rcu_assign_pointer(tun->tfiles[index],
>> +				   tun->tfiles[tun->numqueues - 1]);
>> +		tun->tfiles[index]->queue_index = index;
>> +		rcu_assign_pointer(tfile->tun, NULL);
>> +		--tun->numqueues;
>> +		sock_put(&tfile->sk);
>>
>> -	if (atomic_inc_not_zero(&tfile->count))
>> -		tun = tfile->tun;
>> +		if (tun->numqueues == 0&&  !(tun->flags&  TUN_PERSIST))
>> +			destroy = true;
> Please don't use flags like that. Use dedicated labels and goto there on error.

ok.
>
>> +	}
>>
>> -	return tun;
>> +	spin_unlock(&tun_lock);
>> +
>> +	synchronize_rcu();
>> +	if (clean)
>> +		sock_put(&tfile->sk);
>> +
>> +	if (destroy) {
>> +		rtnl_lock();
>> +		if (dev->reg_state == NETREG_REGISTERED)
>> +			unregister_netdevice(dev);
>> +		rtnl_unlock();
>> +	}
>> +
>> +	return 0;
>>   }
>>
>> -static struct tun_struct *tun_get(struct file *file)
>> +static void tun_detach_all(struct net_device *dev)
>>   {
>> -	return __tun_get(file->private_data);
>> +	struct tun_struct *tun = netdev_priv(dev);
>> +	struct tun_file *tfile, *tfile_list[MAX_TAP_QUEUES];
>> +	int i, j = 0;
>> +
>> +	spin_lock(&tun_lock);
>> +
>> +	for (i = 0; i<  MAX_TAP_QUEUES&&  tun->numqueues; i++) {
>> +		tfile = rcu_dereference_protected(tun->tfiles[i],
>> +						lockdep_is_held(&tun_lock));
>> +		BUG_ON(!tfile);
>> +		wake_up_all(&tfile->wq.wait);
>> +		tfile_list[j++] = tfile;
>> +		rcu_assign_pointer(tfile->tun, NULL);
>> +		--tun->numqueues;
>> +	}
>> +	BUG_ON(tun->numqueues != 0);
>> +	/* guarantee that any future tun_attach will fail */
>> +	tun->numqueues = MAX_TAP_QUEUES;
>> +	spin_unlock(&tun_lock);
>> +
>> +	synchronize_rcu();
>> +	for (--j; j>= 0; j--)
>> +		sock_put(&tfile_list[j]->sk);
>>   }
>>
>> -static void tun_put(struct tun_struct *tun)
>> +static int tun_attach(struct tun_struct *tun, struct file *file)
>>   {
>> -	struct tun_file *tfile = tun->tfile;
>> +	struct tun_file *tfile = file->private_data;
>> +	int err;
>> +
>> +	ASSERT_RTNL();
>> +
>> +	spin_lock(&tun_lock);
>>
>> -	if (atomic_dec_and_test(&tfile->count))
>> -		tun_detach(tfile->tun);
>> +	err = -EINVAL;
>> +	if (rcu_dereference_protected(tfile->tun, lockdep_is_held(&tun_lock)))
>> +		goto out;
>> +
>> +	err = -EBUSY;
>> +	if (!(tun->flags&  TUN_TAP_MQ)&&  tun->numqueues == 1)
>> +		goto out;
>> +
>> +	if (tun->numqueues == MAX_TAP_QUEUES)
>> +		goto out;
>> +
>> +	err = 0;
>> +	tfile->queue_index = tun->numqueues;
>> +	rcu_assign_pointer(tfile->tun, tun);
>> +	rcu_assign_pointer(tun->tfiles[tun->numqueues], tfile);
>> +	sock_hold(&tfile->sk);
>> +	tun->numqueues++;
>> +
>> +	if (tun->numqueues == 1)
>> +		netif_carrier_on(tun->dev);
>> +
>> +	/* device is allowed to go away first, so no need to hold extra
>> +	 * refcnt. */
>> +
>> +out:
>> +	spin_unlock(&tun_lock);
>> +	return err;
>>   }
>>
>>   /* TAP filtering */
>> @@ -331,16 +414,7 @@ static const struct ethtool_ops tun_ethtool_ops;
>>   /* Net device detach from fd. */
>>   static void tun_net_uninit(struct net_device *dev)
>>   {
>> -	struct tun_struct *tun = netdev_priv(dev);
>> -	struct tun_file *tfile = tun->tfile;
>> -
>> -	/* Inform the methods they need to stop using the dev.
>> -	 */
>> -	if (tfile) {
>> -		wake_up_all(&tfile->wq.wait);
>> -		if (atomic_dec_and_test(&tfile->count))
>> -			__tun_detach(tun);
>> -	}
>> +	tun_detach_all(dev);
>>   }
>>
>>   /* Net device open. */
>> @@ -360,10 +434,10 @@ static int tun_net_close(struct net_device *dev)
>>   /* Net device start xmit */
>>   static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev)
>>   {
>> -	struct tun_struct *tun = netdev_priv(dev);
>> -	struct tun_file *tfile = tun->tfile;
>> +	struct tun_file *tfile = NULL;
>>
>> -	tun_debug(KERN_INFO, tun, "tun_net_xmit %d\n", skb->len);
>> +	rcu_read_lock();
>> +	tfile = tun_get_queue(dev, skb);
>>
>>   	/* Drop packet if interface is not attached */
>>   	if (!tfile)
>> @@ -381,7 +455,8 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev)
>>
>>   	if (skb_queue_len(&tfile->socket.sk->sk_receive_queue)
>>   	>= dev->tx_queue_len) {
>> -		if (!(tun->flags&  TUN_ONE_QUEUE)) {
>> +		if (!(tfile->flags&  TUN_ONE_QUEUE)&&
> Which patch moved flags from tun to tfile?

Patch 1 cache the tun->flags in tfile, but it seems this may let the 
flags out of sync. So we'd better to use the one in tun_struct.
>
>> +		    !(tfile->flags&  TUN_TAP_MQ)) {
>>   			/* Normal queueing mode. */
>>   			/* Packet scheduler handles dropping of further packets. */
>>   			netif_stop_queue(dev);
>> @@ -390,7 +465,7 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev)
>>   			 * error is more appropriate. */
>>   			dev->stats.tx_fifo_errors++;
>>   		} else {
>> -			/* Single queue mode.
>> +			/* Single queue mode or multi queue mode.
>>   			 * Driver handles dropping of all packets itself. */
> Please don't do this. Stop the queue on overrun as appropriate.
> ONE_QUEUE is a legacy hack.
>
> BTW we really should stop queue before we start dropping packets,
> but that can be a separate patch.

The problem here is the using of NETIF_F_LLTX. Kernel could only see one 
queue even for a multiqueue tun/tap. If we use netif_stop_queue(), all 
other queues would be stopped also.
>>   			goto drop;
>>   		}
>> @@ -408,9 +483,11 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev)
>>   		kill_fasync(&tfile->fasync, SIGIO, POLL_IN);
>>   	wake_up_interruptible_poll(&tfile->wq.wait, POLLIN |
>>   				   POLLRDNORM | POLLRDBAND);
>> +	rcu_read_unlock();
>>   	return NETDEV_TX_OK;
>>
>>   drop:
>> +	rcu_read_unlock();
>>   	dev->stats.tx_dropped++;
>>   	kfree_skb(skb);
>>   	return NETDEV_TX_OK;
>> @@ -527,16 +604,22 @@ static void tun_net_init(struct net_device *dev)
>>   static unsigned int tun_chr_poll(struct file *file, poll_table * wait)
>>   {
>>   	struct tun_file *tfile = file->private_data;
>> -	struct tun_struct *tun = __tun_get(tfile);
>> +	struct tun_struct *tun = NULL;
>>   	struct sock *sk;
>>   	unsigned int mask = 0;
>>
>> -	if (!tun)
>> +	if (!tfile)
>>   		return POLLERR;
>>
>> -	sk = tfile->socket.sk;
>> +	rcu_read_lock();
>> +	tun = rcu_dereference(tfile->tun);
>> +	if (!tun) {
>> +		rcu_read_unlock();
>> +		return POLLERR;
>> +	}
>> +	rcu_read_unlock();
>>
>> -	tun_debug(KERN_INFO, tun, "tun_chr_poll\n");
>> +	sk =&tfile->sk;
>>
>>   	poll_wait(file,&tfile->wq.wait, wait);
>>
>> @@ -548,10 +631,12 @@ static unsigned int tun_chr_poll(struct file *file, poll_table * wait)
>>   	     sock_writeable(sk)))
>>   		mask |= POLLOUT | POLLWRNORM;
>>
>> -	if (tun->dev->reg_state != NETREG_REGISTERED)
>> +	rcu_read_lock();
>> +	tun = rcu_dereference(tfile->tun);
>> +	if (!tun || tun->dev->reg_state != NETREG_REGISTERED)
>>   		mask = POLLERR;
>> +	rcu_read_unlock();
>>
>> -	tun_put(tun);
>>   	return mask;
>>   }
>>
>> @@ -708,9 +793,12 @@ static ssize_t tun_get_user(struct tun_file *tfile,
>>   		skb_shinfo(skb)->gso_segs = 0;
>>   	}
>>
>> -	tun = __tun_get(tfile);
>> -	if (!tun)
>> +	rcu_read_lock();
>> +	tun = rcu_dereference(tfile->tun);
>> +	if (!tun) {
>> +		rcu_read_unlock();
>>   		return -EBADFD;
>> +	}
>>
>>   	switch (tfile->flags&  TUN_TYPE_MASK) {
>>   	case TUN_TUN_DEV:
>> @@ -720,26 +808,30 @@ static ssize_t tun_get_user(struct tun_file *tfile,
>>   		skb->protocol = eth_type_trans(skb, tun->dev);
>>   		break;
>>   	}
>> -
>> -	netif_rx_ni(skb);
>>   	tun->dev->stats.rx_packets++;
>>   	tun->dev->stats.rx_bytes += len;
>> -	tun_put(tun);
>> +	rcu_read_unlock();
>> +
>> +	netif_rx_ni(skb);
>> +
>>   	return count;
>>
>>   err_free:
>>   	count = -EINVAL;
>>   	kfree_skb(skb);
>>   err:
>> -	tun = __tun_get(tfile);
>> -	if (!tun)
>> +	rcu_read_lock();
>> +	tun = rcu_dereference(tfile->tun);
>> +	if (!tun) {
>> +		rcu_read_unlock();
>>   		return -EBADFD;
>> +	}
>>
>>   	if (drop)
>>   		tun->dev->stats.rx_dropped++;
>>   	if (error)
>>   		tun->dev->stats.rx_frame_errors++;
>> -	tun_put(tun);
>> +	rcu_read_unlock();
>>   	return count;
>>   }
>>
>> @@ -833,12 +925,13 @@ static ssize_t tun_put_user(struct tun_file *tfile,
>>   	skb_copy_datagram_const_iovec(skb, 0, iv, total, len);
>>   	total += skb->len;
>>
>> -	tun = __tun_get(tfile);
>> +	rcu_read_lock();
>> +	tun = rcu_dereference(tfile->tun);
>>   	if (tun) {
>>   		tun->dev->stats.tx_packets++;
>>   		tun->dev->stats.tx_bytes += len;
>> -		tun_put(tun);
>>   	}
>> +	rcu_read_unlock();
>>
>>   	return total;
>>   }
>> @@ -869,28 +962,31 @@ static ssize_t tun_do_read(struct tun_file *tfile,
>>   				break;
>>   			}
>>
>> -			tun = __tun_get(tfile);
>> +			rcu_read_lock();
>> +			tun = rcu_dereference(tfile->tun);
>>   			if (!tun) {
>> -				ret = -EIO;
>> +				ret = -EBADFD;
> BADFD is for when you get passed something like -1 fd.
> Here fd is OK, it's just in a bad state so you can not do IO.
>

Sure.
>> +				rcu_read_unlock();
>>   				break;
>>   			}
>>   			if (tun->dev->reg_state != NETREG_REGISTERED) {
>>   				ret = -EIO;
>> -				tun_put(tun);
>> +				rcu_read_unlock();
>>   				break;
>>   			}
>> -			tun_put(tun);
>> +			rcu_read_unlock();
>>
>>   			/* Nothing to read, let's sleep */
>>   			schedule();
>>   			continue;
>>   		}
>>
>> -		tun = __tun_get(tfile);
>> +		rcu_read_lock();
>> +		tun = rcu_dereference(tfile->tun);
>>   		if (tun) {
>>   			netif_wake_queue(tun->dev);
>> -			tun_put(tun);
>>   		}
>> +		rcu_read_unlock();
>>
>>   		ret = tun_put_user(tfile, skb, iv, len);
>>   		kfree_skb(skb);
>> @@ -1038,6 +1134,9 @@ static int tun_flags(struct tun_struct *tun)
>>   	if (tun->flags&  TUN_VNET_HDR)
>>   		flags |= IFF_VNET_HDR;
>>
>> +	if (tun->flags&  TUN_TAP_MQ)
>> +		flags |= IFF_MULTI_QUEUE;
>> +
>>   	return flags;
>>   }
>>
>> @@ -1097,8 +1196,7 @@ static int tun_set_iff(struct net *net, struct file *file, struct ifreq *ifr)
>>   		err = tun_attach(tun, file);
>>   		if (err<  0)
>>   			return err;
>> -	}
>> -	else {
>> +	} else {
>>   		char *name;
>>   		unsigned long flags = 0;
>>
>> @@ -1142,6 +1240,8 @@ static int tun_set_iff(struct net *net, struct file *file, struct ifreq *ifr)
>>   		dev->hw_features = NETIF_F_SG | NETIF_F_FRAGLIST |
>>   			TUN_USER_FEATURES;
>>   		dev->features = dev->hw_features;
>> +		if (ifr->ifr_flags&  IFF_MULTI_QUEUE)
>> +			dev->features |= NETIF_F_LLTX;
>>
>>   		err = register_netdevice(tun->dev);
>>   		if (err<  0)
>> @@ -1154,7 +1254,7 @@ static int tun_set_iff(struct net *net, struct file *file, struct ifreq *ifr)
>>
>>   		err = tun_attach(tun, file);
>>   		if (err<  0)
>> -			goto failed;
>> +			goto err_free_dev;
>>   	}
>>
>>   	tun_debug(KERN_INFO, tun, "tun_set_iff\n");
>> @@ -1174,6 +1274,11 @@ static int tun_set_iff(struct net *net, struct file *file, struct ifreq *ifr)
>>   	else
>>   		tun->flags&= ~TUN_VNET_HDR;
>>
>> +	if (ifr->ifr_flags&  IFF_MULTI_QUEUE)
>> +		tun->flags |= TUN_TAP_MQ;
>> +	else
>> +		tun->flags&= ~TUN_TAP_MQ;
>> +
>>   	/* Cache flags from tun device */
>>   	tfile->flags = tun->flags;
>>   	/* Make sure persistent devices do not get stuck in
>> @@ -1187,7 +1292,6 @@ static int tun_set_iff(struct net *net, struct file *file, struct ifreq *ifr)
>>
>>   err_free_dev:
>>   	free_netdev(dev);
>> -failed:
>>   	return err;
>>   }
>>
>> @@ -1264,38 +1368,40 @@ static long __tun_chr_ioctl(struct file *file, unsigned int cmd,
>>   				(unsigned int __user*)argp);
>>   	}
>>
>> -	rtnl_lock();
>> -
>> -	tun = __tun_get(tfile);
>> -	if (cmd == TUNSETIFF&&  !tun) {
>> +	ret = 0;
>> +	if (cmd == TUNSETIFF) {
>> +		rtnl_lock();
>>   		ifr.ifr_name[IFNAMSIZ-1] = '\0';
>> -
>>   		ret = tun_set_iff(tfile->net, file,&ifr);
>> -
>> +		rtnl_unlock();
>>   		if (ret)
>> -			goto unlock;
>> -
>> +			return ret;
>>   		if (copy_to_user(argp,&ifr, ifreq_len))
>> -			ret = -EFAULT;
>> -		goto unlock;
>> +			return -EFAULT;
>> +		return ret;
>>   	}
>>
>> +	rtnl_lock();
>> +
>> +	rcu_read_lock();
>> +
>>   	ret = -EBADFD;
>> +	tun = rcu_dereference(tfile->tun);
>>   	if (!tun)
>>   		goto unlock;
>> +	else
>> +		ret = 0;
>>
>> -	tun_debug(KERN_INFO, tun, "tun_chr_ioctl cmd %d\n", cmd);
>> -
>> -	ret = 0;
>>   	switch (cmd) {
>>   	case TUNGETIFF:
>>   		ret = tun_get_iff(current->nsproxy->net_ns, tun,&ifr);
>> +		rcu_read_unlock();
>>   		if (ret)
>> -			break;
>> +			goto out;
>>
>>   		if (copy_to_user(argp,&ifr, ifreq_len))
>>   			ret = -EFAULT;
>> -		break;
>> +		goto out;
>>
>>   	case TUNSETNOCSUM:
>>   		/* Disable/Enable checksum */
>> @@ -1357,9 +1463,10 @@ static long __tun_chr_ioctl(struct file *file, unsigned int cmd,
>>   		/* Get hw address */
>>   		memcpy(ifr.ifr_hwaddr.sa_data, tun->dev->dev_addr, ETH_ALEN);
>>   		ifr.ifr_hwaddr.sa_family = tun->dev->type;
>> +		rcu_read_unlock();
>>   		if (copy_to_user(argp,&ifr, ifreq_len))
>>   			ret = -EFAULT;
>> -		break;
>> +		goto out;
>>
>>   	case SIOCSIFHWADDR:
>>   		/* Set hw address */
>> @@ -1375,9 +1482,9 @@ static long __tun_chr_ioctl(struct file *file, unsigned int cmd,
>>   	}
>>
>>   unlock:
>> +	rcu_read_unlock();
>> +out:
>>   	rtnl_unlock();
>> -	if (tun)
>> -		tun_put(tun);
>>   	return ret;
>>   }
>>
>> @@ -1517,6 +1624,11 @@ out:
>>   	return ret;
>>   }
>>
>> +static void tun_sock_destruct(struct sock *sk)
>> +{
>> +	skb_queue_purge(&sk->sk_receive_queue);
>> +}
>> +
>>   static int tun_chr_open(struct inode *inode, struct file * file)
>>   {
>>   	struct net *net = current->nsproxy->net_ns;
>> @@ -1540,6 +1652,7 @@ static int tun_chr_open(struct inode *inode, struct file * file)
>>   	sock_init_data(&tfile->socket,&tfile->sk);
>>
>>   	tfile->sk.sk_write_space = tun_sock_write_space;
>> +	tfile->sk.sk_destruct = tun_sock_destruct;
>>   	tfile->sk.sk_sndbuf = INT_MAX;
>>   	file->private_data = tfile;
>>
>> @@ -1549,31 +1662,8 @@ static int tun_chr_open(struct inode *inode, struct file * file)
>>   static int tun_chr_close(struct inode *inode, struct file *file)
>>   {
>>   	struct tun_file *tfile = file->private_data;
>> -	struct tun_struct *tun;
>> -
>> -	tun = __tun_get(tfile);
>> -	if (tun) {
>> -		struct net_device *dev = tun->dev;
>> -
>> -		tun_debug(KERN_INFO, tun, "tun_chr_close\n");
>> -
>> -		__tun_detach(tun);
>> -
>> -		/* If desirable, unregister the netdevice. */
>> -		if (!(tun->flags&  TUN_PERSIST)) {
>> -			rtnl_lock();
>> -			if (dev->reg_state == NETREG_REGISTERED)
>> -				unregister_netdevice(dev);
>> -			rtnl_unlock();
>> -		}
>>
>> -		/* drop the reference that netdevice holds */
>> -		sock_put(&tfile->sk);
>> -
>> -	}
>> -
>> -	/* drop the reference that file holds */
>> -	sock_put(&tfile->sk);
>> +	tun_detach(tfile, true);
>>
>>   	return 0;
>>   }
>> @@ -1700,14 +1790,17 @@ static void tun_cleanup(void)
>>    * holding a reference to the file for as long as the socket is in use. */
>>   struct socket *tun_get_socket(struct file *file)
>>   {
>> -	struct tun_struct *tun;
>> +	struct tun_struct *tun = NULL;
>>   	struct tun_file *tfile = file->private_data;
>>   	if (file->f_op !=&tun_fops)
>>   		return ERR_PTR(-EINVAL);
>> -	tun = tun_get(file);
>> -	if (!tun)
>> +	rcu_read_lock();
>> +	tun = rcu_dereference(tfile->tun);
>> +	if (!tun) {
>> +		rcu_read_unlock();
>>   		return ERR_PTR(-EBADFD);
>> -	tun_put(tun);
>> +	}
>> +	rcu_read_unlock();
>>   	return&tfile->socket;
>>   }
>>   EXPORT_SYMBOL_GPL(tun_get_socket);

^ permalink raw reply

* Re: [PATCH v2 net-next] tcp: avoid tx starvation by SYNACK packets
From: Hans Schillstrom @ 2012-06-26  5:34 UTC (permalink / raw)
  To: David Miller
  Cc: eric.dumazet@gmail.com, subramanian.vijay@gmail.com,
	dave.taht@gmail.com, netdev@vger.kernel.org, ncardwell@google.com,
	therbert@google.com, brouer@redhat.com
In-Reply-To: <20120625.215537.169465424900682764.davem@davemloft.net>

On Tuesday 26 June 2012 06:55:37 David Miller wrote:
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Tue, 26 Jun 2012 06:51:36 +0200
> 
> > On Mon, 2012-06-25 at 15:43 -0700, David Miller wrote:
> > 
> >> I don't agree with this change.
> >> 
> >> What is the point in having real classification configuration if
> >> arbitrary places in the network stack are going to override SKB
> >> priority with a fixed priority setting?
> >> 
> >> I bet the person who set listening socket priority really meant it and
> >> does not expect you to override it.
> > 
> > 
> > If I add a test on listener_sk->sk_priority being 0, would you accept
> > the patch ? If classification is done after tcp stack, it wont be hurt
> > by initial skb priority ?
> 
> It's better than your original patch, but it suffers from the same
> fundamental problem.
> 
> No user is going to expect that TCP on it's own has choosen a
> non-default priority and only for some packet types.  It's completely
> unexpected behavior.
> 
> A SYN flood consumes so much more RX work than the TX for the SYNACK's
> ever can.
> 
> So whilst I understand your desire to handle all elements of this kind
> of attack, this one is reaching too far.
> 

This patch didn't give much in gain actually.
The big cycle consumer during a syn attack is SHA sum right now, 
so from that perspective it's better to add aes crypto (by using AES-NI) 
to the syn cookies instead of SHA sum. Even if only newer x86_64 can use it.

^ permalink raw reply

* RE: [PATCH] netxen : Error return off by one for XG port.
From: Rajesh Borundia @ 2012-06-26  5:06 UTC (permalink / raw)
  To: David Miller
  Cc: santoshprasadnayak@gmail.com, Sony Chacko, netdev,
	kernel-janitors@vger.kernel.org
In-Reply-To: <20120625.152748.779914323352702017.davem@davemloft.net>

> -----Original Message-----
> From: David Miller [mailto:davem@davemloft.net]
> Sent: Tuesday, June 26, 2012 3:58 AM
> To: Rajesh Borundia
> Cc: santoshprasadnayak@gmail.com; Sony Chacko; netdev; kernel-
> janitors@vger.kernel.org
> Subject: Re: [PATCH] netxen : Error return off by one for XG port.
> 
> From: Rajesh Borundia <rajesh.borundia@qlogic.com>
> Date: Wed, 20 Jun 2012 08:06:04 -0500
> 
> > ______________________________________
> > From: santosh nayak [santoshprasadnayak@gmail.com]
> > Sent: Wednesday, June 20, 2012 4:22 PM
> > To: Sony Chacko; Rajesh Borundia
> > Cc: netdev; kernel-janitors@vger.kernel.org; Santosh Nayak
> > Subject: [PATCH] netxen : Error return off by one for XG port.
> 
> This is not the correct way to submit patches written by other
> people.

I just ack-ed the patch. I did it in incorrect way (formatting in a mail).
Sorry for the trouble.

Rajesh

^ permalink raw reply

* Re: [PATCH v2 net-next] tcp: avoid tx starvation by SYNACK packets
From: David Miller @ 2012-06-26  4:55 UTC (permalink / raw)
  To: eric.dumazet
  Cc: subramanian.vijay, dave.taht, hans.schillstrom, netdev, ncardwell,
	therbert, brouer
In-Reply-To: <1340686296.10893.115.camel@edumazet-glaptop>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Tue, 26 Jun 2012 06:51:36 +0200

> On Mon, 2012-06-25 at 15:43 -0700, David Miller wrote:
> 
>> I don't agree with this change.
>> 
>> What is the point in having real classification configuration if
>> arbitrary places in the network stack are going to override SKB
>> priority with a fixed priority setting?
>> 
>> I bet the person who set listening socket priority really meant it and
>> does not expect you to override it.
> 
> 
> If I add a test on listener_sk->sk_priority being 0, would you accept
> the patch ? If classification is done after tcp stack, it wont be hurt
> by initial skb priority ?

It's better than your original patch, but it suffers from the same
fundamental problem.

No user is going to expect that TCP on it's own has choosen a
non-default priority and only for some packet types.  It's completely
unexpected behavior.

A SYN flood consumes so much more RX work than the TX for the SYNACK's
ever can.

So whilst I understand your desire to handle all elements of this kind
of attack, this one is reaching too far.

^ permalink raw reply

* Re: [PATCH v2 net-next] tcp: avoid tx starvation by SYNACK packets
From: Eric Dumazet @ 2012-06-26  4:51 UTC (permalink / raw)
  To: David Miller
  Cc: subramanian.vijay, dave.taht, hans.schillstrom, netdev, ncardwell,
	therbert, brouer
In-Reply-To: <20120625.154340.158890441165257041.davem@davemloft.net>

On Mon, 2012-06-25 at 15:43 -0700, David Miller wrote:

> I don't agree with this change.
> 
> What is the point in having real classification configuration if
> arbitrary places in the network stack are going to override SKB
> priority with a fixed priority setting?
> 
> I bet the person who set listening socket priority really meant it and
> does not expect you to override it.


If I add a test on listener_sk->sk_priority being 0, would you accept
the patch ? If classification is done after tcp stack, it wont be hurt
by initial skb priority ?

instead of :

	/* SYNACK sent in SYNCOOKIE mode have low priority */
	skb->priority = nocache ? TC_PRIO_FILLER : sk->sk_priority;

Having :

	/* SYNACK sent in SYNCOOKIE mode have low priority */
	skb->priority = (nocache && !sk->sk_priority) ?
			TC_PRIO_FILLER : sk->sk_priority;

^ permalink raw reply

* Re: [PATCH 04/13] netfilter: regard users as refcount for l4proto's per-net data
From: Gao feng @ 2012-06-26  3:58 UTC (permalink / raw)
  To: Pablo Neira Ayuso; +Cc: netdev, netfilter-devel
In-Reply-To: <20120625112029.GB4607@1984>

Hi Pablo:
于 2012年06月25日 19:20, Pablo Neira Ayuso 写道:
> On Thu, Jun 21, 2012 at 10:36:41PM +0800, Gao feng wrote:
>> Now, nf_proto_net's users is confusing.
>> we should regard it as the refcount for l4proto's per-net data,
>> because maybe there are two l4protos use the same per-net data.
>>
>> so increment pn->users when nf_conntrack_l4proto_register
>> success, and decrement it for nf_conntrack_l4_unregister case.
>>
>> because nf_conntrack_l3proto_ipv[4|6] don't use the same per-net
>> data,so we don't need to add a refcnt for their per-net data.
>>
>> Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
>> ---
>>  net/netfilter/nf_conntrack_proto.c |   76 ++++++++++++++++++++++--------------
>>  1 files changed, 46 insertions(+), 30 deletions(-)
>>
>> diff --git a/net/netfilter/nf_conntrack_proto.c b/net/netfilter/nf_conntrack_proto.c
>> index 9d6b6ab..63612e6 100644
>> --- a/net/netfilter/nf_conntrack_proto.c
>> +++ b/net/netfilter/nf_conntrack_proto.c
> [...]
>> @@ -458,23 +446,32 @@ int nf_conntrack_l4proto_register(struct net *net,
>>  				  struct nf_conntrack_l4proto *l4proto)
>>  {
>>  	int ret = 0;
>> +	struct nf_proto_net *pn = NULL;
>>  
>>  	if (l4proto->init_net) {
>>  		ret = l4proto->init_net(net, l4proto->l3proto);
>>  		if (ret < 0)
>> -			return ret;
>> +			goto out;
>>  	}
>>  
>> -	ret = nf_ct_l4proto_register_sysctl(net, l4proto);
>> +	pn = nf_ct_l4proto_net(net, l4proto);
>> +	if (pn == NULL)
>> +		goto out;
> 
> Same thing here, we're leaking memory allocated by l4proto->init_net.

if pn is NULL,init_net can't allocate memory for pn->ctl_table.
So I think it's not memory leak here.

> 
>> +	ret = nf_ct_l4proto_register_sysctl(net, pn, l4proto);
>>  	if (ret < 0)
>> -		return ret;
>> +		goto out;
>>  
>>  	if (net == &init_net) {
>>  		ret = nf_conntrack_l4proto_register_net(l4proto);
>> -		if (ret < 0)
>> -			nf_ct_l4proto_unregister_sysctl(net, l4proto);
>> +		if (ret < 0) {
>> +			nf_ct_l4proto_unregister_sysctl(net, pn, l4proto);
>> +			goto out;
> 
> Better replace the two lines above by:
> 
> goto out_register_net;
> 
> and then...
> 
>> +		}
>>  	}
>>  
>> +	pn->users++;
> 
> out_register_net:
>         nf_ct_l4proto_unregister_sysctl(net, pn, l4proto);
> 
>> +out:
>>  	return ret;
> 
> I think that this change is similar to patch 1/1, I think you should
> send it as a separated patch.
> 

Yes, It looks better.
should I change this and rebase whole patchset or
maybe you just apply this patchset and then I send a cleanup patch to do this?

>>  }
>>  EXPORT_SYMBOL_GPL(nf_conntrack_l4proto_register);
>> @@ -499,10 +496,18 @@ nf_conntrack_l4proto_unregister_net(struct nf_conntrack_l4proto *l4proto)
>>  void nf_conntrack_l4proto_unregister(struct net *net,
>>  				     struct nf_conntrack_l4proto *l4proto)
>>  {
>> +	struct nf_proto_net *pn = NULL;
>> +
>>  	if (net == &init_net)
>>  		nf_conntrack_l4proto_unregister_net(l4proto);
>>  
>> -	nf_ct_l4proto_unregister_sysctl(net, l4proto);
>> +	pn = nf_ct_l4proto_net(net, l4proto);
>> +	if (pn == NULL)
>> +		return;
>> +
>> +	pn->users--;
>> +	nf_ct_l4proto_unregister_sysctl(net, pn, l4proto);
>> +
>>  	/* Remove all contrack entries for this protocol */
>>  	rtnl_lock();
>>  	nf_ct_iterate_cleanup(net, kill_l4proto, l4proto);
>> @@ -514,11 +519,15 @@ int nf_conntrack_proto_init(struct net *net)
>>  {
>>  	unsigned int i;
>>  	int err;
>> +	struct nf_proto_net *pn = nf_ct_l4proto_net(net,
>> +					&nf_conntrack_l4proto_generic);
>> +
>>  	err = nf_conntrack_l4proto_generic.init_net(net,
>>  					nf_conntrack_l4proto_generic.l3proto);
>>  	if (err < 0)
>>  		return err;
>>  	err = nf_ct_l4proto_register_sysctl(net,
>> +					    pn,
>>  					    &nf_conntrack_l4proto_generic);
>>  	if (err < 0)
>>  		return err;
>> @@ -528,13 +537,20 @@ int nf_conntrack_proto_init(struct net *net)
>>  			rcu_assign_pointer(nf_ct_l3protos[i],
>>  					   &nf_conntrack_l3proto_generic);
>>  	}
>> +
>> +	pn->users++;
>>  	return 0;
>>  }
>>  
>>  void nf_conntrack_proto_fini(struct net *net)
>>  {
>>  	unsigned int i;
>> +	struct nf_proto_net *pn = nf_ct_l4proto_net(net,
>> +					&nf_conntrack_l4proto_generic);
>> +
>> +	pn->users--;
>>  	nf_ct_l4proto_unregister_sysctl(net,
>> +					pn,
>>  					&nf_conntrack_l4proto_generic);
>>  	if (net == &init_net) {
>>  		/* free l3proto protocol tables */
>> -- 
>> 1.7.7.6
>>
> 

--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH 01/13] netfilter: fix problem with proto register
From: Gao feng @ 2012-06-26  3:40 UTC (permalink / raw)
  To: Pablo Neira Ayuso; +Cc: netdev, netfilter-devel
In-Reply-To: <20120625111253.GA4607@1984>

Hi Pablo:

于 2012年06月25日 19:12, Pablo Neira Ayuso 写道:
> On Thu, Jun 21, 2012 at 10:36:38PM +0800, Gao feng wrote:
>> before commit 2c352f444ccfa966a1aa4fd8e9ee29381c467448
>> (netfilter: nf_conntrack: prepare namespace support for
>> l4 protocol trackers), we register sysctl before register
>> protos, so if sysctl is registered faild, the protos will
>> not be registered.
>>
>> but now, we register protos first, and when register
>> sysctl failed, we can use protos too, it's different
>> from before.
> 
> No, this has to be an all-or-nothing game. If one fails, everything
> else that you've registered has to be unregistered.

indeed,this is an all-or-nothing game right now,please look at the ipv4_net_init,
when we register nf_conntrack_l3proto_ipv4 failed,we will unregister the already
registered l4protoes, and in nf_conntrack_l4proto_unregister,we will call
nf_ct_l4proto_unregister_sysctl to free the sysctl table.

> 
>> so change to register sysctl before register protos.
>>
>> Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
>> ---
>>  net/netfilter/nf_conntrack_proto.c |   36 +++++++++++++++++++++++-------------
>>  1 files changed, 23 insertions(+), 13 deletions(-)
>>
>> diff --git a/net/netfilter/nf_conntrack_proto.c b/net/netfilter/nf_conntrack_proto.c
>> index 1ea9194..9bd88aa 100644
>> --- a/net/netfilter/nf_conntrack_proto.c
>> +++ b/net/netfilter/nf_conntrack_proto.c
>> @@ -253,18 +253,23 @@ int nf_conntrack_l3proto_register(struct net *net,
>>  {
>>  	int ret = 0;
>>  
>> -	if (net == &init_net)
>> -		ret = nf_conntrack_l3proto_register_net(proto);
>> +	if (proto->init_net) {
>> +		ret = proto->init_net(net);
>> +		if (ret < 0)
>> +			return ret;
>> +	}
>>  
>> +	ret = nf_ct_l3proto_register_sysctl(net, proto);
>>  	if (ret < 0)
>>  		return ret;
> 
> This is still wrong.
> 
> If nf_ct_l3proto_register_sysctl fails, we'll leak the memory that has
> been reserved by proto->init_net.
> 

we have freed the memory in nf_ct_l[3,4]proto_register_sysctl when we
call nf_ct_register_sysctl failed, so there is no need to free this memory
in nf_conntrack_l[3,4]proto_register.

>> -	if (proto->init_net) {
>> -		ret = proto->init_net(net);
>> +	if (net == &init_net) {
>> +		ret = nf_conntrack_l3proto_register_net(proto);
>>  		if (ret < 0)
>> -			return ret;
>> +			nf_ct_l3proto_unregister_sysctl(net, proto);
>>  	}
>> -	return nf_ct_l3proto_register_sysctl(net, proto);
>> +
>> +	return ret;
>>  }
>>  EXPORT_SYMBOL_GPL(nf_conntrack_l3proto_register);
>>  
>> @@ -454,19 +459,24 @@ int nf_conntrack_l4proto_register(struct net *net,
>>  				  struct nf_conntrack_l4proto *l4proto)
>>  {
>>  	int ret = 0;
>> -	if (net == &init_net)
>> -		ret = nf_conntrack_l4proto_register_net(l4proto);
>>  
>> -	if (ret < 0)
>> -		return ret;
>> -
>> -	if (l4proto->init_net)
>> +	if (l4proto->init_net) {
>>  		ret = l4proto->init_net(net);
>> +		if (ret < 0)
>> +			return ret;
>> +	}
>>  
>> +	ret = nf_ct_l4proto_register_sysctl(net, l4proto);
>>  	if (ret < 0)
>>  		return ret;
>>  
>> -	return nf_ct_l4proto_register_sysctl(net, l4proto);
>> +	if (net == &init_net) {
>> +		ret = nf_conntrack_l4proto_register_net(l4proto);
>> +		if (ret < 0)
>> +			nf_ct_l4proto_unregister_sysctl(net, l4proto);
>> +	}
>> +
>> +	return ret;
>>  }
>>  EXPORT_SYMBOL_GPL(nf_conntrack_l4proto_register);
>>  
>> -- 
>> 1.7.7.6
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply

* Re: [net-next RFC V3 PATCH 4/6] tuntap: multiqueue support
From: Jason Wang @ 2012-06-26  3:42 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: habanero, netdev, linux-kernel, krkumar2, tahm, akong, davem,
	shemminger, mashirle
In-Reply-To: <20120625082553.GC18229@redhat.com>

On 06/25/2012 04:25 PM, Michael S. Tsirkin wrote:
> On Mon, Jun 25, 2012 at 02:10:18PM +0800, Jason Wang wrote:
>> This patch adds multiqueue support for tap device. This is done by abstracting
>> each queue as a file/socket and allowing multiple sockets to be attached to the
>> tuntap device (an array of tun_file were stored in the tun_struct). Userspace
>> could write and read from those files to do the parallel packet
>> sending/receiving.
>>
>> Unlike the previous single queue implementation, the socket and device were
>> loosely coupled, each of them were allowed to go away first. In order to let the
>> tx path lockless, netif_tx_loch_bh() is replaced by RCU/NETIF_F_LLTX to
>> synchronize between data path and system call.
> Don't use LLTX/RCU. It's not worth it.
> Use something like netif_set_real_num_tx_queues.
>
>> The tx queue selecting is first based on the recorded rxq index of an skb, it
>> there's no such one, then choosing based on rx hashing (skb_get_rxhash()).
>>
>> Signed-off-by: Jason Wang<jasowang@redhat.com>
> Interestingly macvtap switched to hashing first:
> ef0002b577b52941fb147128f30bd1ecfdd3ff6d
> (the commit log is corrupted but see what it
> does in the patch).
> Any idea why?

Yes, so tap should be changed to behave same as macvtap. I remember the 
reason we do that is to make sure the packet of a single flow to be 
queued to a fixed socket/virtqueues. As 10g cards like ixgbe choose the 
rx queue for a flow based on the last tx queue where the packets of that 
flow comes. So if we are using recored rx queue in macvtap, the queue 
index of a flow would change as vhost thread moves amongs processors.

But during test tun/tap, one interesting thing I find is that even ixgbe 
has recorded the queue index during rx, it seems be lost when tap tries 
to transmit skbs to userspace.

>> ---
>>   drivers/net/tun.c |  371 +++++++++++++++++++++++++++++++++--------------------
>>   1 files changed, 232 insertions(+), 139 deletions(-)
>>
>> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
>> index 8233b0a..5c26757 100644
>> --- a/drivers/net/tun.c
>> +++ b/drivers/net/tun.c
>> @@ -107,6 +107,8 @@ struct tap_filter {
>>   	unsigned char	addr[FLT_EXACT_COUNT][ETH_ALEN];
>>   };
>>
>> +#define MAX_TAP_QUEUES (NR_CPUS<  16 ? NR_CPUS : 16)
> Why the limit? I am guessing you copied this from macvtap?
> This is problematic for a number of reasons:
> 	- will not play well with migration
> 	- will not work well for a large guest
>
> Yes, macvtap needs to be fixed too.
>
> I am guessing what it is trying to prevent is queueing
> up a huge number of packets?
> So just divide the default tx queue limit by the # of queues.
>
> And by the way, for MQ applications maybe we can finally
> ignore tx queue altogether and limit the total number
> of bytes queued?
> To avoid regressions we can make it large like 64M/# queues.
> Could be a separate patch I think, and for a single queue
> might need a compatible mode though I am not sure.
>
>> +
>>   struct tun_file {
>>   	struct sock sk;
>>   	struct socket socket;
>> @@ -114,16 +116,18 @@ struct tun_file {
>>   	int vnet_hdr_sz;
>>   	struct tap_filter txflt;
>>   	atomic_t count;
>> -	struct tun_struct *tun;
>> +	struct tun_struct __rcu *tun;
>>   	struct net *net;
>>   	struct fasync_struct *fasync;
>>   	unsigned int flags;
>> +	u16 queue_index;
>>   };
>>
>>   struct tun_sock;
>>
>>   struct tun_struct {
>> -	struct tun_file		*tfile;
>> +	struct tun_file		*tfiles[MAX_TAP_QUEUES];
>> +	unsigned int            numqueues;
>>   	unsigned int 		flags;
>>   	uid_t			owner;
>>   	gid_t			group;
>> @@ -138,80 +142,159 @@ struct tun_struct {
>>   #endif
>>   };
>>
>> -static int tun_attach(struct tun_struct *tun, struct file *file)
>> +static DEFINE_SPINLOCK(tun_lock);
>> +
>> +/*
>> + * tun_get_queue(): calculate the queue index
>> + *     - if skbs comes from mq nics, we can just borrow
>> + *     - if not, calculate from the hash
>> + */
>> +static struct tun_file *tun_get_queue(struct net_device *dev,
>> +				      struct sk_buff *skb)
>>   {
>> -	struct tun_file *tfile = file->private_data;
>> -	int err;
>> +	struct tun_struct *tun = netdev_priv(dev);
>> +	struct tun_file *tfile = NULL;
>> +	int numqueues = tun->numqueues;
>> +	__u32 rxq;
>>
>> -	ASSERT_RTNL();
>> +	BUG_ON(!rcu_read_lock_held());
>>
>> -	netif_tx_lock_bh(tun->dev);
>> +	if (!numqueues)
>> +		goto out;
>>
>> -	err = -EINVAL;
>> -	if (tfile->tun)
>> +	if (numqueues == 1) {
>> +		tfile = rcu_dereference(tun->tfiles[0]);
> Instead of hacks like this, you can ask for an MQ
> flag to be set in SETIFF. Then you won't need to
> handle attach/detach at random times.
> And most of the scary num_queues checks can go away.
> You can then also ask userspace about the max # of queues
> to expect if you want to save some memory.
>
>
>>   		goto out;
>> +	}
>>
>> -	err = -EBUSY;
>> -	if (tun->tfile)
>> +	if (likely(skb_rx_queue_recorded(skb))) {
>> +		rxq = skb_get_rx_queue(skb);
>> +
>> +		while (unlikely(rxq>= numqueues))
>> +			rxq -= numqueues;
>> +
>> +		tfile = rcu_dereference(tun->tfiles[rxq]);
>>   		goto out;
>> +	}
>>
>> -	err = 0;
>> -	tfile->tun = tun;
>> -	tun->tfile = tfile;
>> -	netif_carrier_on(tun->dev);
>> -	dev_hold(tun->dev);
>> -	sock_hold(&tfile->sk);
>> -	atomic_inc(&tfile->count);
>> +	/* Check if we can use flow to select a queue */
>> +	rxq = skb_get_rxhash(skb);
>> +	if (rxq) {
>> +		u32 idx = ((u64)rxq * numqueues)>>  32;
> This completely confuses me. What's the logic here?
> How do we even know it's in range?
>
>> +		tfile = rcu_dereference(tun->tfiles[idx]);
>> +		goto out;
>> +	}
>>
>> +	tfile = rcu_dereference(tun->tfiles[0]);
>>   out:
>> -	netif_tx_unlock_bh(tun->dev);
>> -	return err;
>> +	return tfile;
>>   }
>>
>> -static void __tun_detach(struct tun_struct *tun)
>> +static int tun_detach(struct tun_file *tfile, bool clean)
>>   {
>> -	struct tun_file *tfile = tun->tfile;
>> -	/* Detach from net device */
>> -	netif_tx_lock_bh(tun->dev);
>> -	netif_carrier_off(tun->dev);
>> -	tun->tfile = NULL;
>> -	netif_tx_unlock_bh(tun->dev);
>> -
>> -	/* Drop read queue */
>> -	skb_queue_purge(&tfile->socket.sk->sk_receive_queue);
>> -
>> -	/* Drop the extra count on the net device */
>> -	dev_put(tun->dev);
>> -}
>> +	struct tun_struct *tun;
>> +	struct net_device *dev = NULL;
>> +	bool destroy = false;
>>
>> -static void tun_detach(struct tun_struct *tun)
>> -{
>> -	rtnl_lock();
>> -	__tun_detach(tun);
>> -	rtnl_unlock();
>> -}
>> +	spin_lock(&tun_lock);
>>
>> -static struct tun_struct *__tun_get(struct tun_file *tfile)
>> -{
>> -	struct tun_struct *tun = NULL;
>> +	tun = rcu_dereference_protected(tfile->tun,
>> +					lockdep_is_held(&tun_lock));
>> +	if (tun) {
>> +		u16 index = tfile->queue_index;
>> +		BUG_ON(index>= tun->numqueues);
>> +		dev = tun->dev;
>> +
>> +		rcu_assign_pointer(tun->tfiles[index],
>> +				   tun->tfiles[tun->numqueues - 1]);
>> +		tun->tfiles[index]->queue_index = index;
>> +		rcu_assign_pointer(tfile->tun, NULL);
>> +		--tun->numqueues;
>> +		sock_put(&tfile->sk);
>>
>> -	if (atomic_inc_not_zero(&tfile->count))
>> -		tun = tfile->tun;
>> +		if (tun->numqueues == 0&&  !(tun->flags&  TUN_PERSIST))
>> +			destroy = true;
> Please don't use flags like that. Use dedicated labels and goto there on error.
>
>
>> +	}
>>
>> -	return tun;
>> +	spin_unlock(&tun_lock);
>> +
>> +	synchronize_rcu();
>> +	if (clean)
>> +		sock_put(&tfile->sk);
>> +
>> +	if (destroy) {
>> +		rtnl_lock();
>> +		if (dev->reg_state == NETREG_REGISTERED)
>> +			unregister_netdevice(dev);
>> +		rtnl_unlock();
>> +	}
>> +
>> +	return 0;
>>   }
>>
>> -static struct tun_struct *tun_get(struct file *file)
>> +static void tun_detach_all(struct net_device *dev)
>>   {
>> -	return __tun_get(file->private_data);
>> +	struct tun_struct *tun = netdev_priv(dev);
>> +	struct tun_file *tfile, *tfile_list[MAX_TAP_QUEUES];
>> +	int i, j = 0;
>> +
>> +	spin_lock(&tun_lock);
>> +
>> +	for (i = 0; i<  MAX_TAP_QUEUES&&  tun->numqueues; i++) {
>> +		tfile = rcu_dereference_protected(tun->tfiles[i],
>> +						lockdep_is_held(&tun_lock));
>> +		BUG_ON(!tfile);
>> +		wake_up_all(&tfile->wq.wait);
>> +		tfile_list[j++] = tfile;
>> +		rcu_assign_pointer(tfile->tun, NULL);
>> +		--tun->numqueues;
>> +	}
>> +	BUG_ON(tun->numqueues != 0);
>> +	/* guarantee that any future tun_attach will fail */
>> +	tun->numqueues = MAX_TAP_QUEUES;
>> +	spin_unlock(&tun_lock);
>> +
>> +	synchronize_rcu();
>> +	for (--j; j>= 0; j--)
>> +		sock_put(&tfile_list[j]->sk);
>>   }
>>
>> -static void tun_put(struct tun_struct *tun)
>> +static int tun_attach(struct tun_struct *tun, struct file *file)
>>   {
>> -	struct tun_file *tfile = tun->tfile;
>> +	struct tun_file *tfile = file->private_data;
>> +	int err;
>> +
>> +	ASSERT_RTNL();
>> +
>> +	spin_lock(&tun_lock);
>>
>> -	if (atomic_dec_and_test(&tfile->count))
>> -		tun_detach(tfile->tun);
>> +	err = -EINVAL;
>> +	if (rcu_dereference_protected(tfile->tun, lockdep_is_held(&tun_lock)))
>> +		goto out;
>> +
>> +	err = -EBUSY;
>> +	if (!(tun->flags&  TUN_TAP_MQ)&&  tun->numqueues == 1)
>> +		goto out;
>> +
>> +	if (tun->numqueues == MAX_TAP_QUEUES)
>> +		goto out;
>> +
>> +	err = 0;
>> +	tfile->queue_index = tun->numqueues;
>> +	rcu_assign_pointer(tfile->tun, tun);
>> +	rcu_assign_pointer(tun->tfiles[tun->numqueues], tfile);
>> +	sock_hold(&tfile->sk);
>> +	tun->numqueues++;
>> +
>> +	if (tun->numqueues == 1)
>> +		netif_carrier_on(tun->dev);
>> +
>> +	/* device is allowed to go away first, so no need to hold extra
>> +	 * refcnt. */
>> +
>> +out:
>> +	spin_unlock(&tun_lock);
>> +	return err;
>>   }
>>
>>   /* TAP filtering */
>> @@ -331,16 +414,7 @@ static const struct ethtool_ops tun_ethtool_ops;
>>   /* Net device detach from fd. */
>>   static void tun_net_uninit(struct net_device *dev)
>>   {
>> -	struct tun_struct *tun = netdev_priv(dev);
>> -	struct tun_file *tfile = tun->tfile;
>> -
>> -	/* Inform the methods they need to stop using the dev.
>> -	 */
>> -	if (tfile) {
>> -		wake_up_all(&tfile->wq.wait);
>> -		if (atomic_dec_and_test(&tfile->count))
>> -			__tun_detach(tun);
>> -	}
>> +	tun_detach_all(dev);
>>   }
>>
>>   /* Net device open. */
>> @@ -360,10 +434,10 @@ static int tun_net_close(struct net_device *dev)
>>   /* Net device start xmit */
>>   static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev)
>>   {
>> -	struct tun_struct *tun = netdev_priv(dev);
>> -	struct tun_file *tfile = tun->tfile;
>> +	struct tun_file *tfile = NULL;
>>
>> -	tun_debug(KERN_INFO, tun, "tun_net_xmit %d\n", skb->len);
>> +	rcu_read_lock();
>> +	tfile = tun_get_queue(dev, skb);
>>
>>   	/* Drop packet if interface is not attached */
>>   	if (!tfile)
>> @@ -381,7 +455,8 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev)
>>
>>   	if (skb_queue_len(&tfile->socket.sk->sk_receive_queue)
>>   	>= dev->tx_queue_len) {
>> -		if (!(tun->flags&  TUN_ONE_QUEUE)) {
>> +		if (!(tfile->flags&  TUN_ONE_QUEUE)&&
> Which patch moved flags from tun to tfile?
>
>> +		    !(tfile->flags&  TUN_TAP_MQ)) {
>>   			/* Normal queueing mode. */
>>   			/* Packet scheduler handles dropping of further packets. */
>>   			netif_stop_queue(dev);
>> @@ -390,7 +465,7 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev)
>>   			 * error is more appropriate. */
>>   			dev->stats.tx_fifo_errors++;
>>   		} else {
>> -			/* Single queue mode.
>> +			/* Single queue mode or multi queue mode.
>>   			 * Driver handles dropping of all packets itself. */
> Please don't do this. Stop the queue on overrun as appropriate.
> ONE_QUEUE is a legacy hack.
>
> BTW we really should stop queue before we start dropping packets,
> but that can be a separate patch.
>
>>   			goto drop;
>>   		}
>> @@ -408,9 +483,11 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev)
>>   		kill_fasync(&tfile->fasync, SIGIO, POLL_IN);
>>   	wake_up_interruptible_poll(&tfile->wq.wait, POLLIN |
>>   				   POLLRDNORM | POLLRDBAND);
>> +	rcu_read_unlock();
>>   	return NETDEV_TX_OK;
>>
>>   drop:
>> +	rcu_read_unlock();
>>   	dev->stats.tx_dropped++;
>>   	kfree_skb(skb);
>>   	return NETDEV_TX_OK;
>> @@ -527,16 +604,22 @@ static void tun_net_init(struct net_device *dev)
>>   static unsigned int tun_chr_poll(struct file *file, poll_table * wait)
>>   {
>>   	struct tun_file *tfile = file->private_data;
>> -	struct tun_struct *tun = __tun_get(tfile);
>> +	struct tun_struct *tun = NULL;
>>   	struct sock *sk;
>>   	unsigned int mask = 0;
>>
>> -	if (!tun)
>> +	if (!tfile)
>>   		return POLLERR;
>>
>> -	sk = tfile->socket.sk;
>> +	rcu_read_lock();
>> +	tun = rcu_dereference(tfile->tun);
>> +	if (!tun) {
>> +		rcu_read_unlock();
>> +		return POLLERR;
>> +	}
>> +	rcu_read_unlock();
>>
>> -	tun_debug(KERN_INFO, tun, "tun_chr_poll\n");
>> +	sk =&tfile->sk;
>>
>>   	poll_wait(file,&tfile->wq.wait, wait);
>>
>> @@ -548,10 +631,12 @@ static unsigned int tun_chr_poll(struct file *file, poll_table * wait)
>>   	     sock_writeable(sk)))
>>   		mask |= POLLOUT | POLLWRNORM;
>>
>> -	if (tun->dev->reg_state != NETREG_REGISTERED)
>> +	rcu_read_lock();
>> +	tun = rcu_dereference(tfile->tun);
>> +	if (!tun || tun->dev->reg_state != NETREG_REGISTERED)
>>   		mask = POLLERR;
>> +	rcu_read_unlock();
>>
>> -	tun_put(tun);
>>   	return mask;
>>   }
>>
>> @@ -708,9 +793,12 @@ static ssize_t tun_get_user(struct tun_file *tfile,
>>   		skb_shinfo(skb)->gso_segs = 0;
>>   	}
>>
>> -	tun = __tun_get(tfile);
>> -	if (!tun)
>> +	rcu_read_lock();
>> +	tun = rcu_dereference(tfile->tun);
>> +	if (!tun) {
>> +		rcu_read_unlock();
>>   		return -EBADFD;
>> +	}
>>
>>   	switch (tfile->flags&  TUN_TYPE_MASK) {
>>   	case TUN_TUN_DEV:
>> @@ -720,26 +808,30 @@ static ssize_t tun_get_user(struct tun_file *tfile,
>>   		skb->protocol = eth_type_trans(skb, tun->dev);
>>   		break;
>>   	}
>> -
>> -	netif_rx_ni(skb);
>>   	tun->dev->stats.rx_packets++;
>>   	tun->dev->stats.rx_bytes += len;
>> -	tun_put(tun);
>> +	rcu_read_unlock();
>> +
>> +	netif_rx_ni(skb);
>> +
>>   	return count;
>>
>>   err_free:
>>   	count = -EINVAL;
>>   	kfree_skb(skb);
>>   err:
>> -	tun = __tun_get(tfile);
>> -	if (!tun)
>> +	rcu_read_lock();
>> +	tun = rcu_dereference(tfile->tun);
>> +	if (!tun) {
>> +		rcu_read_unlock();
>>   		return -EBADFD;
>> +	}
>>
>>   	if (drop)
>>   		tun->dev->stats.rx_dropped++;
>>   	if (error)
>>   		tun->dev->stats.rx_frame_errors++;
>> -	tun_put(tun);
>> +	rcu_read_unlock();
>>   	return count;
>>   }
>>
>> @@ -833,12 +925,13 @@ static ssize_t tun_put_user(struct tun_file *tfile,
>>   	skb_copy_datagram_const_iovec(skb, 0, iv, total, len);
>>   	total += skb->len;
>>
>> -	tun = __tun_get(tfile);
>> +	rcu_read_lock();
>> +	tun = rcu_dereference(tfile->tun);
>>   	if (tun) {
>>   		tun->dev->stats.tx_packets++;
>>   		tun->dev->stats.tx_bytes += len;
>> -		tun_put(tun);
>>   	}
>> +	rcu_read_unlock();
>>
>>   	return total;
>>   }
>> @@ -869,28 +962,31 @@ static ssize_t tun_do_read(struct tun_file *tfile,
>>   				break;
>>   			}
>>
>> -			tun = __tun_get(tfile);
>> +			rcu_read_lock();
>> +			tun = rcu_dereference(tfile->tun);
>>   			if (!tun) {
>> -				ret = -EIO;
>> +				ret = -EBADFD;
> BADFD is for when you get passed something like -1 fd.
> Here fd is OK, it's just in a bad state so you can not do IO.
>
>
>> +				rcu_read_unlock();
>>   				break;
>>   			}
>>   			if (tun->dev->reg_state != NETREG_REGISTERED) {
>>   				ret = -EIO;
>> -				tun_put(tun);
>> +				rcu_read_unlock();
>>   				break;
>>   			}
>> -			tun_put(tun);
>> +			rcu_read_unlock();
>>
>>   			/* Nothing to read, let's sleep */
>>   			schedule();
>>   			continue;
>>   		}
>>
>> -		tun = __tun_get(tfile);
>> +		rcu_read_lock();
>> +		tun = rcu_dereference(tfile->tun);
>>   		if (tun) {
>>   			netif_wake_queue(tun->dev);
>> -			tun_put(tun);
>>   		}
>> +		rcu_read_unlock();
>>
>>   		ret = tun_put_user(tfile, skb, iv, len);
>>   		kfree_skb(skb);
>> @@ -1038,6 +1134,9 @@ static int tun_flags(struct tun_struct *tun)
>>   	if (tun->flags&  TUN_VNET_HDR)
>>   		flags |= IFF_VNET_HDR;
>>
>> +	if (tun->flags&  TUN_TAP_MQ)
>> +		flags |= IFF_MULTI_QUEUE;
>> +
>>   	return flags;
>>   }
>>
>> @@ -1097,8 +1196,7 @@ static int tun_set_iff(struct net *net, struct file *file, struct ifreq *ifr)
>>   		err = tun_attach(tun, file);
>>   		if (err<  0)
>>   			return err;
>> -	}
>> -	else {
>> +	} else {
>>   		char *name;
>>   		unsigned long flags = 0;
>>
>> @@ -1142,6 +1240,8 @@ static int tun_set_iff(struct net *net, struct file *file, struct ifreq *ifr)
>>   		dev->hw_features = NETIF_F_SG | NETIF_F_FRAGLIST |
>>   			TUN_USER_FEATURES;
>>   		dev->features = dev->hw_features;
>> +		if (ifr->ifr_flags&  IFF_MULTI_QUEUE)
>> +			dev->features |= NETIF_F_LLTX;
>>
>>   		err = register_netdevice(tun->dev);
>>   		if (err<  0)
>> @@ -1154,7 +1254,7 @@ static int tun_set_iff(struct net *net, struct file *file, struct ifreq *ifr)
>>
>>   		err = tun_attach(tun, file);
>>   		if (err<  0)
>> -			goto failed;
>> +			goto err_free_dev;
>>   	}
>>
>>   	tun_debug(KERN_INFO, tun, "tun_set_iff\n");
>> @@ -1174,6 +1274,11 @@ static int tun_set_iff(struct net *net, struct file *file, struct ifreq *ifr)
>>   	else
>>   		tun->flags&= ~TUN_VNET_HDR;
>>
>> +	if (ifr->ifr_flags&  IFF_MULTI_QUEUE)
>> +		tun->flags |= TUN_TAP_MQ;
>> +	else
>> +		tun->flags&= ~TUN_TAP_MQ;
>> +
>>   	/* Cache flags from tun device */
>>   	tfile->flags = tun->flags;
>>   	/* Make sure persistent devices do not get stuck in
>> @@ -1187,7 +1292,6 @@ static int tun_set_iff(struct net *net, struct file *file, struct ifreq *ifr)
>>
>>   err_free_dev:
>>   	free_netdev(dev);
>> -failed:
>>   	return err;
>>   }
>>
>> @@ -1264,38 +1368,40 @@ static long __tun_chr_ioctl(struct file *file, unsigned int cmd,
>>   				(unsigned int __user*)argp);
>>   	}
>>
>> -	rtnl_lock();
>> -
>> -	tun = __tun_get(tfile);
>> -	if (cmd == TUNSETIFF&&  !tun) {
>> +	ret = 0;
>> +	if (cmd == TUNSETIFF) {
>> +		rtnl_lock();
>>   		ifr.ifr_name[IFNAMSIZ-1] = '\0';
>> -
>>   		ret = tun_set_iff(tfile->net, file,&ifr);
>> -
>> +		rtnl_unlock();
>>   		if (ret)
>> -			goto unlock;
>> -
>> +			return ret;
>>   		if (copy_to_user(argp,&ifr, ifreq_len))
>> -			ret = -EFAULT;
>> -		goto unlock;
>> +			return -EFAULT;
>> +		return ret;
>>   	}
>>
>> +	rtnl_lock();
>> +
>> +	rcu_read_lock();
>> +
>>   	ret = -EBADFD;
>> +	tun = rcu_dereference(tfile->tun);
>>   	if (!tun)
>>   		goto unlock;
>> +	else
>> +		ret = 0;
>>
>> -	tun_debug(KERN_INFO, tun, "tun_chr_ioctl cmd %d\n", cmd);
>> -
>> -	ret = 0;
>>   	switch (cmd) {
>>   	case TUNGETIFF:
>>   		ret = tun_get_iff(current->nsproxy->net_ns, tun,&ifr);
>> +		rcu_read_unlock();
>>   		if (ret)
>> -			break;
>> +			goto out;
>>
>>   		if (copy_to_user(argp,&ifr, ifreq_len))
>>   			ret = -EFAULT;
>> -		break;
>> +		goto out;
>>
>>   	case TUNSETNOCSUM:
>>   		/* Disable/Enable checksum */
>> @@ -1357,9 +1463,10 @@ static long __tun_chr_ioctl(struct file *file, unsigned int cmd,
>>   		/* Get hw address */
>>   		memcpy(ifr.ifr_hwaddr.sa_data, tun->dev->dev_addr, ETH_ALEN);
>>   		ifr.ifr_hwaddr.sa_family = tun->dev->type;
>> +		rcu_read_unlock();
>>   		if (copy_to_user(argp,&ifr, ifreq_len))
>>   			ret = -EFAULT;
>> -		break;
>> +		goto out;
>>
>>   	case SIOCSIFHWADDR:
>>   		/* Set hw address */
>> @@ -1375,9 +1482,9 @@ static long __tun_chr_ioctl(struct file *file, unsigned int cmd,
>>   	}
>>
>>   unlock:
>> +	rcu_read_unlock();
>> +out:
>>   	rtnl_unlock();
>> -	if (tun)
>> -		tun_put(tun);
>>   	return ret;
>>   }
>>
>> @@ -1517,6 +1624,11 @@ out:
>>   	return ret;
>>   }
>>
>> +static void tun_sock_destruct(struct sock *sk)
>> +{
>> +	skb_queue_purge(&sk->sk_receive_queue);
>> +}
>> +
>>   static int tun_chr_open(struct inode *inode, struct file * file)
>>   {
>>   	struct net *net = current->nsproxy->net_ns;
>> @@ -1540,6 +1652,7 @@ static int tun_chr_open(struct inode *inode, struct file * file)
>>   	sock_init_data(&tfile->socket,&tfile->sk);
>>
>>   	tfile->sk.sk_write_space = tun_sock_write_space;
>> +	tfile->sk.sk_destruct = tun_sock_destruct;
>>   	tfile->sk.sk_sndbuf = INT_MAX;
>>   	file->private_data = tfile;
>>
>> @@ -1549,31 +1662,8 @@ static int tun_chr_open(struct inode *inode, struct file * file)
>>   static int tun_chr_close(struct inode *inode, struct file *file)
>>   {
>>   	struct tun_file *tfile = file->private_data;
>> -	struct tun_struct *tun;
>> -
>> -	tun = __tun_get(tfile);
>> -	if (tun) {
>> -		struct net_device *dev = tun->dev;
>> -
>> -		tun_debug(KERN_INFO, tun, "tun_chr_close\n");
>> -
>> -		__tun_detach(tun);
>> -
>> -		/* If desirable, unregister the netdevice. */
>> -		if (!(tun->flags&  TUN_PERSIST)) {
>> -			rtnl_lock();
>> -			if (dev->reg_state == NETREG_REGISTERED)
>> -				unregister_netdevice(dev);
>> -			rtnl_unlock();
>> -		}
>>
>> -		/* drop the reference that netdevice holds */
>> -		sock_put(&tfile->sk);
>> -
>> -	}
>> -
>> -	/* drop the reference that file holds */
>> -	sock_put(&tfile->sk);
>> +	tun_detach(tfile, true);
>>
>>   	return 0;
>>   }
>> @@ -1700,14 +1790,17 @@ static void tun_cleanup(void)
>>    * holding a reference to the file for as long as the socket is in use. */
>>   struct socket *tun_get_socket(struct file *file)
>>   {
>> -	struct tun_struct *tun;
>> +	struct tun_struct *tun = NULL;
>>   	struct tun_file *tfile = file->private_data;
>>   	if (file->f_op !=&tun_fops)
>>   		return ERR_PTR(-EINVAL);
>> -	tun = tun_get(file);
>> -	if (!tun)
>> +	rcu_read_lock();
>> +	tun = rcu_dereference(tfile->tun);
>> +	if (!tun) {
>> +		rcu_read_unlock();
>>   		return ERR_PTR(-EBADFD);
>> -	tun_put(tun);
>> +	}
>> +	rcu_read_unlock();
>>   	return&tfile->socket;
>>   }
>>   EXPORT_SYMBOL_GPL(tun_get_socket);

^ permalink raw reply

* Re: [PATCH] Allow receiving packets on the fallback tunnel if they pass sanity checks
From: Phil Dibowitz @ 2012-06-26  3:37 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, phild
In-Reply-To: <20120620.210405.2231549940491911080.davem@davemloft.net>

[-- Attachment #1: Type: text/plain, Size: 2435 bytes --]

On 06/20/2012 09:04 PM, David Miller wrote:
> From: Phil Dibowitz <phil@ipom.com>
> Date: Tue, 5 Jun 2012 08:40:58 -0700
> 
>> From b413062771afbba064ae9bc49b5daed7abb1243d Mon Sep 17 00:00:00 2001
>> From: Ville Nuorvala <ville.nuorvala@gmail.com>
>> Subject: [PATCH] Allow receiving packets on the fallback tunnel if they pass sanity checks
>>
>> The same IPv6 address checks are performed as with any normal tunnel,
>> but as the fallback tunnel endpoint addresses are unspecified, the checks
>> must be performed on a per-packet basis, rather than at tunnel
>> configuration time.
>>
>> Signed-off-by: Ville Nuorvala <ville.nuorvala@gmail.com>
>> Tested-by: Phil Dibowitz <phil@ipom.com>
> 
> I've reviewed this change but I still have no idea why it's
> necessary.
> 
> You need to compose a more lengthy and detailed commit log message
> explaining everything before I'm going to consider applying this
> patch.
> 
> You can't just say "we have some problem at Facebook, this patch fixes
> it", and then merely describe word by word the content of the patch
> without explaining the "why".  That simply doesn't cut it.

Sure. Sorry, I just kept Ville's patch description.

We do Layer-3 DSR via IP-in-IP tunneling. Our load balancers wrap an extra IP
header on incoming packets so they can be routed to the backend. In the v4
tunnel driver, when these packets fall on the default tunl0 device, the
behavior is to decapsulate them and drop them back on the stack. So our setup
is that tunl0 has the VIP and eth0 has (obviously) the backend's real address.

In IPv6 we do the same thing, but the v6 tunnel driver didn't have this same
behavior - if you didn't have an explicit tunnel setup, it would drop the packet.

This patch brings that v4 feature to the v6 driver.

I think that's the level of detail you're looking for, but I'm happy to expand
on anything in particular. I also break this down in tons of detail here:

https://www.facebook.com/notes/facebook-engineering/under-the-hood-network-implementation-for-world-ipv6-launch/10150873176303920

-- 
Phil Dibowitz                             phil@ipom.com
Open Source software and tech docs        Insanity Palace of Metallica
http://www.phildev.net/                   http://www.ipom.com/

"Be who you are and say what you feel, because those who mind don't matter
 and those who matter don't mind."
 - Dr. Seuss



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 262 bytes --]

^ permalink raw reply

* Re: [PATCH v2] net/sh-eth: Add support selecting MII function for SH7734 and R8A7740
From: Nobuhiro Iwamatsu @ 2012-06-26  3:35 UTC (permalink / raw)
  To: Florian Fainelli; +Cc: netdev
In-Reply-To: <2360907.koU5zCUYMQ@flexo>

Hi,

Thank you for your review.

Florian Fainelli さんは書きました:
> On Tuesday 12 June 2012 16:29:02 Nobuhiro Iwamatsu wrote:
>> Ethernet IP of SH7734 and R8A7740 has selecting MII register.
>> The user needs to change a value according to MII to be used.
>> This adds the function to change the value of this register.
>>
>> Signed-off-by: Nobuhiro Iwamatsu <nobuhiro.iwamatsu.yj@renesas.com>
>> ---
>>  V2: Fix the check by select_mii.
>>  drivers/net/ethernet/renesas/sh_eth.c |  106 
> ++++++++++++++++++++-------------
>>  drivers/net/ethernet/renesas/sh_eth.h |    1 +
>>  2 files changed, 65 insertions(+), 42 deletions(-)
>>
>> diff --git a/drivers/net/ethernet/renesas/sh_eth.c 
> b/drivers/net/ethernet/renesas/sh_eth.c
>> index be3c221..5358804 100644
>> --- a/drivers/net/ethernet/renesas/sh_eth.c
>> +++ b/drivers/net/ethernet/renesas/sh_eth.c
>> @@ -49,6 +49,33 @@
>>  		NETIF_MSG_RX_ERR| \
>>  		NETIF_MSG_TX_ERR)
>>  
>> +#if defined(CONFIG_CPU_SUBTYPE_SH7734) || defined(CONFIG_CPU_SUBTYPE_SH7763) 
> || \
>> +	defined(CONFIG_ARCH_R8A7740)
>> +static void sh_eth_select_mii(struct net_device *ndev)
>> +{
>> +	u32 value = 0x0;
>> +	struct sh_eth_private *mdp = netdev_priv(ndev);
>> +
>> +	switch (mdp->phy_interface) {
>> +	case PHY_INTERFACE_MODE_GMII:
>> +		value = 0x2;
>> +		break;
>> +	case PHY_INTERFACE_MODE_MII:
>> +		value = 0x1;
>> +		break;
>> +	case PHY_INTERFACE_MODE_RMII:
>> +		value = 0x0;
>> +		break;
>> +	default:
>> +		pr_warn("PHY interface mode was not setup. Set to MII.\n");
>> +		value = 0x1;
>> +		break;
>> +	}
>> +
>> +	sh_eth_write(ndev, value, RMII_MII);
>> +}
>> +#endif
>> +
>>  /* There is CPU dependent code */
>>  #if defined(CONFIG_CPU_SUBTYPE_SH7724)
>>  #define SH_ETH_RESET_DEFAULT	1
>> @@ -283,6 +310,7 @@ static struct sh_eth_cpu_data 
> *sh_eth_get_cpu_data(struct sh_eth_private *mdp)
>>  #elif defined(CONFIG_CPU_SUBTYPE_SH7734) || 
> defined(CONFIG_CPU_SUBTYPE_SH7763)
>>  #define SH_ETH_HAS_TSU	1
>>  static void sh_eth_reset_hw_crc(struct net_device *ndev);
>> +
>>  static void sh_eth_chip_reset(struct net_device *ndev)
>>  {
>>  	struct sh_eth_private *mdp = netdev_priv(ndev);
>> @@ -292,35 +320,6 @@ static void sh_eth_chip_reset(struct net_device *ndev)
>>  	mdelay(1);
>>  }
>>  
>> -static void sh_eth_reset(struct net_device *ndev)
>> -{
>> -	int cnt = 100;
>> -
>> -	sh_eth_write(ndev, EDSR_ENALL, EDSR);
>> -	sh_eth_write(ndev, sh_eth_read(ndev, EDMR) | EDMR_SRST_GETHER, EDMR);
>> -	while (cnt > 0) {
>> -		if (!(sh_eth_read(ndev, EDMR) & 0x3))
>> -			break;
>> -		mdelay(1);
>> -		cnt--;
>> -	}
>> -	if (cnt == 0)
>> -		printk(KERN_ERR "Device reset fail\n");
>> -
>> -	/* Table Init */
>> -	sh_eth_write(ndev, 0x0, TDLAR);
>> -	sh_eth_write(ndev, 0x0, TDFAR);
>> -	sh_eth_write(ndev, 0x0, TDFXR);
>> -	sh_eth_write(ndev, 0x0, TDFFR);
>> -	sh_eth_write(ndev, 0x0, RDLAR);
>> -	sh_eth_write(ndev, 0x0, RDFAR);
>> -	sh_eth_write(ndev, 0x0, RDFXR);
>> -	sh_eth_write(ndev, 0x0, RDFFR);
>> -
>> -	/* Reset HW CRC register */
>> -	sh_eth_reset_hw_crc(ndev);
>> -}
>> -
>>  static void sh_eth_set_duplex(struct net_device *ndev)
>>  {
>>  	struct sh_eth_private *mdp = netdev_priv(ndev);
>> @@ -377,9 +376,43 @@ static struct sh_eth_cpu_data sh_eth_my_cpu_data = {
>>  	.tsu		= 1,
>>  #if defined(CONFIG_CPU_SUBTYPE_SH7734)
>>  	.hw_crc     = 1,
>> +	.select_mii = 1,
>>  #endif
>>  };
>>  
>> +static void sh_eth_reset(struct net_device *ndev)
>> +{
>> +	int cnt = 100;
>> +
>> +	sh_eth_write(ndev, EDSR_ENALL, EDSR);
>> +	sh_eth_write(ndev, sh_eth_read(ndev, EDMR) | EDMR_SRST_GETHER, EDMR);
>> +	while (cnt > 0) {
>> +		if (!(sh_eth_read(ndev, EDMR) & 0x3))
>> +			break;
>> +		mdelay(1);
>> +		cnt--;
>> +	}
>> +	if (cnt == 0)
>> +		printk(KERN_ERR "Device reset fail\n");
> 
> It looks like this would need a subsequent fix. Failing to reset the adapter 
> and just erroring out and not returning an error looks obviously wrong. Since 
> sh_eth_reset() is called in sh_eth_dev_init() which does return an int, 
> propagate the error back to the caller.

Yes, you are right. I will fix your point and send a patch.

Best regards,
   Nobuhiro

^ permalink raw reply

* [PATCH] net/sh-eth: Check return value of sh_eth_reset when chip reset fail
From: Nobuhiro Iwamatsu @ 2012-06-26  3:35 UTC (permalink / raw)
  To: netdev; +Cc: florian, Nobuhiro Iwamatsu

The sh_eth_reset function resets chip, but this performs nothing when failed.
This changes sh_eth_reset return an error, when this failed in reset.

Signed-off-by: Nobuhiro Iwamatsu <nobuhiro.iwamatsu.yj@renesas.com>
---
 drivers/net/ethernet/renesas/sh_eth.c |   88 +++++++++++++++++++++------------
 1 file changed, 56 insertions(+), 32 deletions(-)

diff --git a/drivers/net/ethernet/renesas/sh_eth.c b/drivers/net/ethernet/renesas/sh_eth.c
index 8d696e0..326cb91 100644
--- a/drivers/net/ethernet/renesas/sh_eth.c
+++ b/drivers/net/ethernet/renesas/sh_eth.c
@@ -130,6 +130,8 @@ static struct sh_eth_cpu_data sh_eth_my_cpu_data = {
 #elif defined(CONFIG_CPU_SUBTYPE_SH7757)
 #define SH_ETH_HAS_BOTH_MODULES	1
 #define SH_ETH_HAS_TSU	1
+static int sh_eth_check_reset(struct net_device *ndev);
+
 static void sh_eth_set_duplex(struct net_device *ndev)
 {
 	struct sh_eth_private *mdp = netdev_priv(ndev);
@@ -204,23 +206,19 @@ static void sh_eth_chip_reset_giga(struct net_device *ndev)
 }
 
 static int sh_eth_is_gether(struct sh_eth_private *mdp);
-static void sh_eth_reset(struct net_device *ndev)
+static int sh_eth_reset(struct net_device *ndev)
 {
 	struct sh_eth_private *mdp = netdev_priv(ndev);
-	int cnt = 100;
+	int ret = 0;
 
 	if (sh_eth_is_gether(mdp)) {
 		sh_eth_write(ndev, 0x03, EDSR);
 		sh_eth_write(ndev, sh_eth_read(ndev, EDMR) | EDMR_SRST_GETHER,
 				EDMR);
-		while (cnt > 0) {
-			if (!(sh_eth_read(ndev, EDMR) & 0x3))
-				break;
-			mdelay(1);
-			cnt--;
-		}
-		if (cnt < 0)
-			printk(KERN_ERR "Device reset fail\n");
+
+		ret = sh_eth_check_reset(ndev);
+		if (ret)
+			goto out;
 
 		/* Table Init */
 		sh_eth_write(ndev, 0x0, TDLAR);
@@ -238,6 +236,9 @@ static void sh_eth_reset(struct net_device *ndev)
 		sh_eth_write(ndev, sh_eth_read(ndev, EDMR) & ~EDMR_SRST_ETHER,
 				EDMR);
 	}
+
+out:
+	return ret;
 }
 
 static void sh_eth_set_duplex_giga(struct net_device *ndev)
@@ -310,6 +311,7 @@ static struct sh_eth_cpu_data *sh_eth_get_cpu_data(struct sh_eth_private *mdp)
 
 #elif defined(CONFIG_CPU_SUBTYPE_SH7734) || defined(CONFIG_CPU_SUBTYPE_SH7763)
 #define SH_ETH_HAS_TSU	1
+static int sh_eth_check_reset(struct net_device *ndev);
 static void sh_eth_reset_hw_crc(struct net_device *ndev);
 
 static void sh_eth_chip_reset(struct net_device *ndev)
@@ -381,20 +383,16 @@ static struct sh_eth_cpu_data sh_eth_my_cpu_data = {
 #endif
 };
 
-static void sh_eth_reset(struct net_device *ndev)
+static int sh_eth_reset(struct net_device *ndev)
 {
-	int cnt = 100;
+	int ret = 0;
 
 	sh_eth_write(ndev, EDSR_ENALL, EDSR);
 	sh_eth_write(ndev, sh_eth_read(ndev, EDMR) | EDMR_SRST_GETHER, EDMR);
-	while (cnt > 0) {
-		if (!(sh_eth_read(ndev, EDMR) & 0x3))
-			break;
-		mdelay(1);
-		cnt--;
-	}
-	if (cnt == 0)
-		printk(KERN_ERR "Device reset fail\n");
+
+	ret = sh_eth_check_reset(ndev);
+	if (ret)
+		goto out;
 
 	/* Table Init */
 	sh_eth_write(ndev, 0x0, TDLAR);
@@ -412,6 +410,8 @@ static void sh_eth_reset(struct net_device *ndev)
 	/* Select MII mode */
 	if (sh_eth_my_cpu_data.select_mii)
 		sh_eth_select_mii(ndev);
+out:
+	return ret;
 }
 
 static void sh_eth_reset_hw_crc(struct net_device *ndev)
@@ -422,6 +422,8 @@ static void sh_eth_reset_hw_crc(struct net_device *ndev)
 
 #elif defined(CONFIG_ARCH_R8A7740)
 #define SH_ETH_HAS_TSU	1
+static int sh_eth_check_reset(struct net_device *ndev);
+
 static void sh_eth_chip_reset(struct net_device *ndev)
 {
 	struct sh_eth_private *mdp = netdev_priv(ndev);
@@ -433,20 +435,16 @@ static void sh_eth_chip_reset(struct net_device *ndev)
 	sh_eth_select_mii(ndev);
 }
 
-static void sh_eth_reset(struct net_device *ndev)
+static int sh_eth_reset(struct net_device *ndev)
 {
-	int cnt = 100;
+	int ret = 0;
 
 	sh_eth_write(ndev, EDSR_ENALL, EDSR);
 	sh_eth_write(ndev, sh_eth_read(ndev, EDMR) | EDMR_SRST_GETHER, EDMR);
-	while (cnt > 0) {
-		if (!(sh_eth_read(ndev, EDMR) & 0x3))
-			break;
-		mdelay(1);
-		cnt--;
-	}
-	if (cnt == 0)
-		printk(KERN_ERR "Device reset fail\n");
+
+	ret = sh_eth_check_reset(ndev);
+	if (ret)
+		goto out;
 
 	/* Table Init */
 	sh_eth_write(ndev, 0x0, TDLAR);
@@ -457,6 +455,9 @@ static void sh_eth_reset(struct net_device *ndev)
 	sh_eth_write(ndev, 0x0, RDFAR);
 	sh_eth_write(ndev, 0x0, RDFXR);
 	sh_eth_write(ndev, 0x0, RDFFR);
+
+out:
+	return ret;
 }
 
 static void sh_eth_set_duplex(struct net_device *ndev)
@@ -565,11 +566,31 @@ static void sh_eth_set_default_cpu_data(struct sh_eth_cpu_data *cd)
 
 #if defined(SH_ETH_RESET_DEFAULT)
 /* Chip Reset */
-static void sh_eth_reset(struct net_device *ndev)
+static int  sh_eth_reset(struct net_device *ndev)
 {
 	sh_eth_write(ndev, sh_eth_read(ndev, EDMR) | EDMR_SRST_ETHER, EDMR);
 	mdelay(3);
 	sh_eth_write(ndev, sh_eth_read(ndev, EDMR) & ~EDMR_SRST_ETHER, EDMR);
+
+	return 0;
+}
+#else
+static int sh_eth_check_reset(struct net_device *ndev)
+{
+	int ret = 0;
+	int cnt = 100;
+
+	while (cnt > 0) {
+		if (!(sh_eth_read(ndev, EDMR) & 0x3))
+			break;
+		mdelay(1);
+		cnt--;
+	}
+	if (cnt < 0) {
+		printk(KERN_ERR "Device reset fail\n");
+		ret = -ETIMEDOUT;
+	}
+	return ret;
 }
 #endif
 
@@ -924,7 +945,9 @@ static int sh_eth_dev_init(struct net_device *ndev)
 	u32 val;
 
 	/* Soft Reset */
-	sh_eth_reset(ndev);
+	ret = sh_eth_reset(ndev);
+	if (ret)
+		goto out;
 
 	/* Descriptor format */
 	sh_eth_ring_format(ndev);
@@ -998,6 +1021,7 @@ static int sh_eth_dev_init(struct net_device *ndev)
 
 	netif_start_queue(ndev);
 
+out:
 	return ret;
 }
 
-- 
1.7.10

^ permalink raw reply related

* [PATCH v3] net/sh-eth: Add support selecting MII function for SH7734 and R8A7740
From: Nobuhiro Iwamatsu @ 2012-06-26  3:34 UTC (permalink / raw)
  To: netdev; +Cc: Nobuhiro Iwamatsu

Ethernet IP of SH7734 and R8A7740 has selecting MII register.
The user needs to change a value according to MII to be used.
This adds the function to change the value of this register.

Signed-off-by: Nobuhiro Iwamatsu <nobuhiro.iwamatsu.yj@renesas.com>
---
 V2: Fix indent.
 V2: Fix the check by select_mii.
 drivers/net/ethernet/renesas/sh_eth.c |  108 ++++++++++++++++++++-------------
 drivers/net/ethernet/renesas/sh_eth.h |    1 +
 2 files changed, 66 insertions(+), 43 deletions(-)

diff --git a/drivers/net/ethernet/renesas/sh_eth.c b/drivers/net/ethernet/renesas/sh_eth.c
index 79bf09b..8d696e0 100644
--- a/drivers/net/ethernet/renesas/sh_eth.c
+++ b/drivers/net/ethernet/renesas/sh_eth.c
@@ -49,6 +49,34 @@
 		NETIF_MSG_RX_ERR| \
 		NETIF_MSG_TX_ERR)
 
+#if defined(CONFIG_CPU_SUBTYPE_SH7734) || \
+	defined(CONFIG_CPU_SUBTYPE_SH7763) || \
+	defined(CONFIG_ARCH_R8A7740)
+static void sh_eth_select_mii(struct net_device *ndev)
+{
+	u32 value = 0x0;
+	struct sh_eth_private *mdp = netdev_priv(ndev);
+
+	switch (mdp->phy_interface) {
+	case PHY_INTERFACE_MODE_GMII:
+		value = 0x2;
+		break;
+	case PHY_INTERFACE_MODE_MII:
+		value = 0x1;
+		break;
+	case PHY_INTERFACE_MODE_RMII:
+		value = 0x0;
+		break;
+	default:
+		pr_warn("PHY interface mode was not setup. Set to MII.\n");
+		value = 0x1;
+		break;
+	}
+
+	sh_eth_write(ndev, value, RMII_MII);
+}
+#endif
+
 /* There is CPU dependent code */
 #if defined(CONFIG_CPU_SUBTYPE_SH7724)
 #define SH_ETH_RESET_DEFAULT	1
@@ -283,6 +311,7 @@ static struct sh_eth_cpu_data *sh_eth_get_cpu_data(struct sh_eth_private *mdp)
 #elif defined(CONFIG_CPU_SUBTYPE_SH7734) || defined(CONFIG_CPU_SUBTYPE_SH7763)
 #define SH_ETH_HAS_TSU	1
 static void sh_eth_reset_hw_crc(struct net_device *ndev);
+
 static void sh_eth_chip_reset(struct net_device *ndev)
 {
 	struct sh_eth_private *mdp = netdev_priv(ndev);
@@ -292,35 +321,6 @@ static void sh_eth_chip_reset(struct net_device *ndev)
 	mdelay(1);
 }
 
-static void sh_eth_reset(struct net_device *ndev)
-{
-	int cnt = 100;
-
-	sh_eth_write(ndev, EDSR_ENALL, EDSR);
-	sh_eth_write(ndev, sh_eth_read(ndev, EDMR) | EDMR_SRST_GETHER, EDMR);
-	while (cnt > 0) {
-		if (!(sh_eth_read(ndev, EDMR) & 0x3))
-			break;
-		mdelay(1);
-		cnt--;
-	}
-	if (cnt == 0)
-		printk(KERN_ERR "Device reset fail\n");
-
-	/* Table Init */
-	sh_eth_write(ndev, 0x0, TDLAR);
-	sh_eth_write(ndev, 0x0, TDFAR);
-	sh_eth_write(ndev, 0x0, TDFXR);
-	sh_eth_write(ndev, 0x0, TDFFR);
-	sh_eth_write(ndev, 0x0, RDLAR);
-	sh_eth_write(ndev, 0x0, RDFAR);
-	sh_eth_write(ndev, 0x0, RDFXR);
-	sh_eth_write(ndev, 0x0, RDFFR);
-
-	/* Reset HW CRC register */
-	sh_eth_reset_hw_crc(ndev);
-}
-
 static void sh_eth_set_duplex(struct net_device *ndev)
 {
 	struct sh_eth_private *mdp = netdev_priv(ndev);
@@ -377,9 +377,43 @@ static struct sh_eth_cpu_data sh_eth_my_cpu_data = {
 	.tsu		= 1,
 #if defined(CONFIG_CPU_SUBTYPE_SH7734)
 	.hw_crc     = 1,
+	.select_mii = 1,
 #endif
 };
 
+static void sh_eth_reset(struct net_device *ndev)
+{
+	int cnt = 100;
+
+	sh_eth_write(ndev, EDSR_ENALL, EDSR);
+	sh_eth_write(ndev, sh_eth_read(ndev, EDMR) | EDMR_SRST_GETHER, EDMR);
+	while (cnt > 0) {
+		if (!(sh_eth_read(ndev, EDMR) & 0x3))
+			break;
+		mdelay(1);
+		cnt--;
+	}
+	if (cnt == 0)
+		printk(KERN_ERR "Device reset fail\n");
+
+	/* Table Init */
+	sh_eth_write(ndev, 0x0, TDLAR);
+	sh_eth_write(ndev, 0x0, TDFAR);
+	sh_eth_write(ndev, 0x0, TDFXR);
+	sh_eth_write(ndev, 0x0, TDFFR);
+	sh_eth_write(ndev, 0x0, RDLAR);
+	sh_eth_write(ndev, 0x0, RDFAR);
+	sh_eth_write(ndev, 0x0, RDFXR);
+	sh_eth_write(ndev, 0x0, RDFFR);
+
+	/* Reset HW CRC register */
+	sh_eth_reset_hw_crc(ndev);
+
+	/* Select MII mode */
+	if (sh_eth_my_cpu_data.select_mii)
+		sh_eth_select_mii(ndev);
+}
+
 static void sh_eth_reset_hw_crc(struct net_device *ndev)
 {
 	if (sh_eth_my_cpu_data.hw_crc)
@@ -391,25 +425,12 @@ static void sh_eth_reset_hw_crc(struct net_device *ndev)
 static void sh_eth_chip_reset(struct net_device *ndev)
 {
 	struct sh_eth_private *mdp = netdev_priv(ndev);
-	unsigned long mii;
 
 	/* reset device */
 	sh_eth_tsu_write(mdp, ARSTR_ARSTR, ARSTR);
 	mdelay(1);
 
-	switch (mdp->phy_interface) {
-	case PHY_INTERFACE_MODE_GMII:
-		mii = 2;
-		break;
-	case PHY_INTERFACE_MODE_MII:
-		mii = 1;
-		break;
-	case PHY_INTERFACE_MODE_RMII:
-	default:
-		mii = 0;
-		break;
-	}
-	sh_eth_write(ndev, mii, RMII_MII);
+	sh_eth_select_mii(ndev);
 }
 
 static void sh_eth_reset(struct net_device *ndev)
@@ -492,6 +513,7 @@ static struct sh_eth_cpu_data sh_eth_my_cpu_data = {
 	.no_trimd	= 1,
 	.no_ade		= 1,
 	.tsu		= 1,
+	.select_mii	= 1,
 };
 
 #elif defined(CONFIG_CPU_SUBTYPE_SH7619)
diff --git a/drivers/net/ethernet/renesas/sh_eth.h b/drivers/net/ethernet/renesas/sh_eth.h
index 57b8e1f..d6763b1392 100644
--- a/drivers/net/ethernet/renesas/sh_eth.h
+++ b/drivers/net/ethernet/renesas/sh_eth.h
@@ -757,6 +757,7 @@ struct sh_eth_cpu_data {
 	unsigned no_trimd:1;		/* E-DMAC DO NOT have TRIMD */
 	unsigned no_ade:1;	/* E-DMAC DO NOT have ADE bit in EESR */
 	unsigned hw_crc:1;	/* E-DMAC have CSMR */
+	unsigned select_mii:1;	/* EtherC have RMII_MII (MII select register) */
 };
 
 struct sh_eth_private {
-- 
1.7.10

^ permalink raw reply related

* Re: [PATCH v2] net/sh-eth: Add support selecting MII function for SH7734 and R8A7740
From: Nobuhiro Iwamatsu @ 2012-06-26  3:32 UTC (permalink / raw)
  To: David Miller; +Cc: netdev
In-Reply-To: <20120612.002954.1384327664115800772.davem@davemloft.net>

David Miller さんは書きました:
> From: Nobuhiro Iwamatsu <nobuhiro.iwamatsu.yj@renesas.com>
> Date: Tue, 12 Jun 2012 16:29:02 +0900
> 
>> @@ -492,6 +513,7 @@ static struct sh_eth_cpu_data sh_eth_my_cpu_data = {
>>  	.no_trimd	= 1,
>>  	.no_ade		= 1,
>>  	.tsu		= 1,
>> +	.select_mii = 1,
>>  };
>>  
> 
> Indent this new line consistently with those around it.
> 
Thank you. I will fix and resend.

Best regards,
   Nobuhiro

^ permalink raw reply

* linux-next: manual merge of the net-next tree with the net tree
From: Stephen Rothwell @ 2012-06-26  3:15 UTC (permalink / raw)
  To: David Miller, netdev
  Cc: linux-next, linux-kernel, Per Ellefsen,
	"Sjur Brændeland", Kim Lilliestierna

[-- Attachment #1: Type: text/plain, Size: 988 bytes --]

Hi all,

Today's linux-next merge of the net-next tree got a conflict in
drivers/net/caif/caif_hsi.c between commits 3935600a7f34 ("caif-hsi:
Bugfix - Piggyback'ed embedded CAIF frame lost") and 1fdc7630b2cb
("caif-hsi: Add missing return in error path") from the net tree and
commits 4e7bb59d49fb ("caif-hsi: Removed dead code") and c41254006377
("caif-hsi: Add rtnl support") from the net-next tree.

I fixed them up (see below) and can carry the fix as necessary.
-- 
Cheers,
Stephen Rothwell                    sfr@canb.auug.org.au

diff --cc drivers/net/caif/caif_hsi.c
index 4a27adb,1c2bd01..0000000
--- a/drivers/net/caif/caif_hsi.c
+++ b/drivers/net/caif/caif_hsi.c
@@@ -693,8 -678,8 +678,6 @@@ static void cfhsi_rx_done(struct cfhsi 
  			 */
  			memcpy(rx_buf, (u8 *)piggy_desc,
  					CFHSI_DESC_SHORT_SZ);
- 			if (desc_pld_len == -EPROTO)
- 				goto out_of_sync;
 -			/* Mark no embedded frame here */
 -			piggy_desc->offset = 0;
  		}
  	}
  

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply

* [PATCH v2] net/ipv6/route.c: packets originating on device match lo
From: David McCullough @ 2012-06-26  1:42 UTC (permalink / raw)
  To: netdev; +Cc: David Miller


Hi all,

Fix to allow IPv6 packets originating locally to match rules with the "iff"
set to "lo".  This allows IPv6 rule matching work the same as it does for
IPv4.  From the iproute2 man page:

   iif NAME
		  select  the incoming device to match.  If the interface is loop‐
		  back, the rule only matches packets originating from this  host.
		  This  means that you may create separate routing tables for for‐
		  warded and local packets and, hence, completely segregate them.

Cheers,
Davidm

Signed-off-by: David McCullough <david_mccullough@mcafee.com>

diff -u -p -r1.1.1.59 route.c
--- linux-3.4/net/ipv6/route.c	21 May 2012 23:15:01 -0000	1.1.1.59
+++ linux-3.4/net/ipv6/route.c	26 Jun 2012 01:41:15 -0000
@@ -931,6 +931,8 @@ struct dst_entry * ip6_route_output(stru
 {
 	int flags = 0;
 
+	fl6->flowi6_iif = net->loopback_dev->ifindex;
+
 	if ((sk && sk->sk_bound_dev_if) || rt6_need_strict(&fl6->daddr))
 		flags |= RT6_LOOKUP_F_IFACE;
 
 
-- 
David McCullough,      david_mccullough@mcafee.com,  Ph:+61 734352815
McAfee - SnapGear      http://www.mcafee.com         http://www.uCdot.org

^ permalink raw reply

* Re: [PATCH] net/ipv6/route.c: packets originating on device match lo
From: David McCullough @ 2012-06-26  1:38 UTC (permalink / raw)
  To: David Miller; +Cc: netdev
In-Reply-To: <20120625.162832.804773711822642700.davem@davemloft.net>


Jivin David Miller lays it down ...
> From: David McCullough <david_mccullough@mcafee.com>
> Date: Mon, 25 Jun 2012 16:50:30 +1000
> 
> > @@ -931,6 +931,9 @@ struct dst_entry * ip6_route_output(stru
> >  {
> >  	int flags = 0;
> >  
> > +	if (fl6->flowi6_iif == 0)
> > +		fl6->flowi6_iif = net->loopback_dev->ifindex;
> > +		
> 
> Like ipv4, you should make this assignment unconditionally.

No problems, will resend,  I wasn't so sure that I could do that safely
with the flowi getting passed into IPv6 rather than created as it is in
IPv4.  Did some extra testing,  looks good so patch to follow,

Thanks,
Davidm

-- 
David McCullough,      david_mccullough@mcafee.com,  Ph:+61 734352815
McAfee - SnapGear      http://www.mcafee.com         http://www.uCdot.org

^ permalink raw reply

* Re: [PATCH 1/4] netdev/phy: Handle IEEE802.3 clause 45 Ethernet PHYs
From: David Miller @ 2012-06-25 23:50 UTC (permalink / raw)
  To: ddaney.cavm
  Cc: grant.likely, rob.herring, devicetree-discuss, netdev,
	linux-kernel, linux-mips, afleming, david.daney
In-Reply-To: <4FE8F8D8.1050009@gmail.com>

From: David Daney <ddaney.cavm@gmail.com>
Date: Mon, 25 Jun 2012 16:48:40 -0700

> Therefore, I am going to propose that we add a 'flags' parameter to
> get_phy_device() and change the (two) callers.

Since there is only one flag, make it simply a bool for now.

^ permalink raw reply

* Re: [PATCH 1/4] netdev/phy: Handle IEEE802.3 clause 45 Ethernet PHYs
From: David Daney @ 2012-06-25 23:48 UTC (permalink / raw)
  To: David Miller
  Cc: ddaney.cavm, grant.likely, rob.herring, devicetree-discuss,
	netdev, linux-kernel, linux-mips, afleming, david.daney
In-Reply-To: <20120625.163355.2058784474741116830.davem@davemloft.net>

On 06/25/2012 04:33 PM, David Miller wrote:
> From: David Daney<ddaney.cavm@gmail.com>
> Date: Mon, 25 Jun 2012 16:11:23 -0700
>
>> Do you realize that at the time get_phy_device() is called, there is
>> no PHY device?  So there can be no attribute, nor are we passing a
>> register address.  Neither of these suggestions apply to this
>> situation.
>>
>> We need to know a priori if it is c22 or c45.  So we need to
>> communicate the type somehow to get_phy_device().  I chose an unused
>> bit in the addr parameter to do this, another option would be to add a
>> separate parameter to get_phy_device() specifying the type.
>
> Then pass it in to the get() routine and store the attribute there
> in the device we end up with.

OK.

addr has only 5 significant bits, and the patch *does* pass the 
information (c22 vs. c45) in one of the high order bits.  So it is 
essentially as you say, but you don't like the idea of multiplexing the 
arguments into a single int.

Therefore, I am going to propose that we add a 'flags' parameter to 
get_phy_device() and change the (two) callers.

Does that seem better (or at least acceptable)?

Or do you really want to pass the address of a (one bit) structure instead?

David Daney

>
> There are many parameters that go into a PHY register access, so
> we'll probably some day end up with a descriptor struct that the
> caller prepares on-stack to pass into the actual read/write ops
> via reference.
>

^ permalink raw reply

* Re: [PATCH 0/3 v3 resend] 6lowpan: minor fixes
From: David Miller @ 2012-06-25 23:45 UTC (permalink / raw)
  To: alex.bluesman.smirnov; +Cc: netdev, dbaryshkov
In-Reply-To: <1340632143-27794-1-git-send-email-alex.bluesman.smirnov@gmail.com>

From: Alexander Smirnov <alex.bluesman.smirnov@gmail.com>
Date: Mon, 25 Jun 2012 17:49:00 +0400

> I sent this set several weeks ago and seems that it was missed
> (or please let me know if there was a reason).
> 
> this is the 3-rd version of the patches.
> Changes since v2:
>  - Removed WARN_ON() and BUG() from skb fetch methods. That wasn't a good idea
>    to crash kernel by such an unsignificant issue.
>  - Added new patch (I've decided to add it here just to keep all the 6lowpan
>    related code together)

All applied, thanks.

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox