Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH] b44: add 64 bit stats
From: Ben Hutchings @ 2012-07-20 14:33 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Kevin Groeneveld, netdev
In-Reply-To: <1342761865.2626.5572.camel@edumazet-glaptop>

On Fri, 2012-07-20 at 07:24 +0200, Eric Dumazet wrote:
> On Fri, 2012-07-20 at 06:53 +0200, Eric Dumazet wrote:
> > On Thu, 2012-07-19 at 21:56 -0400, Kevin Groeneveld wrote:
> > 
> > > I am still trying to make sure I understand this fully.  I want to
> > > update some other drivers with 64 bit stats as well.  What you said
> > > seems to make sense, but...
> > > 
> > > I was looking at the virtio_net.c driver.  One spot in this driver
> > > which updates the stats is the receive_buf function.  recive_buf is
> > > called from virtnet_poll which is registered as a napi poll function.
> > > According to Documentation/networking/netdevices.txt the poll function
> > > is called in a softirq context.  However, the function which reads the
> > > stats uses u64_stats_fetch_begin/u64_stats_fetch_retry.  Shouldn't
> > > this be u64_stats_fetch_begin_bh/u64_stats_fetch_retry_bh for the
> > > exact reasons you described for my b44 patch?
> > 
> > Absolutely. You can argue that probably nobody use this driver on a
> > 32bit UP machine, but technically speaking the current implementation is
> > racy.
> > 
> 
> In fact all network drivers should use the _bh version.
> 
> Could you send a patch for all of them, based on net-next tree ?
> 
> Thanks !

Don't we need an _irq variant for drivers that support netpoll?

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply

* Re: resurrecting tcphealth
From: Yuchung Cheng @ 2012-07-20 14:06 UTC (permalink / raw)
  To: Piotr Sawuk; +Cc: netdev, linux-kernel
In-Reply-To: <cc2022d86834a8798518f9cccc605d45.squirrel@webmail.univie.ac.at>

On Mon, Jul 16, 2012 at 6:03 AM, Piotr Sawuk <a9702387@unet.univie.ac.at> wrote:
> On Mo, 16.07.2012, 13:46, Eric Dumazet wrote:
>> On Mon, 2012-07-16 at 13:33 +0200, Piotr Sawuk wrote:
>>> On Sa, 14.07.2012, 01:55, Stephen Hemminger wrote:
>>> > I am not sure if the is really necessary since the most
>>> > of the stats are available elsewhere.
>>>
>>> if by "most" you mean address and port then you're right.
>>> but even the rtt reported by "ss -i" seems to differ from tcphealth.
>>
>> Thats because tcphealth is wrong, it assumes HZ=1000 ?
>>
>> tp->srtt unit is jiffies, not ms.
>
> thanks. any conversion-functions in the kernel for that?
>>
>> tcphealth is a gross hack.
>
> what would you do if you tried making it less gross?
>
> I've not found any similar functionality, in the kernel.
> I want to know an estimate for the percentage of data lost in tcp.
> and I want to know that without actually sending much packets.
> afterall I'm on the receiving end most of the time.
> percentage of duplicate packets received is nice too.
> you have any suggestions?

counting dupack may not be as reliable as you'd like.
say the remote sends you 100 packets and only the first one is lost,
you'll see 99 dupacks. Morover any small degree reordering (<3)
will generate substantial dupacks but the network is perfectly fine
(see Craig Patridge's "reordering is not pathological" paper).
unfortunately receiver can't and does not have to distinguish loss
 or reordering. you can infer that but it should not be kernel's job.
there are public tools that inspect tcpdump traces to do that

exposing duplicate packets received can be done via getsockopt(TCP_INFO)
although I don't know what that gives you. the remote can be too
aggressive in retransmission (not just because of a bad RTO) or
the network RTT fluctuates.

I don't what if tracking last_ack_sent (the latest RCV.NXT) without
knowing the ISN is useful.

btw the term project paper cited concludes SACK is not useful is simply
wrong. This makes me suspicious about how rigorous and thoughtful of
its design.

>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [net-next RFC V5 4/5] virtio_net: multiqueue support
From: Michael S. Tsirkin @ 2012-07-20 13:40 UTC (permalink / raw)
  To: Jason Wang
  Cc: krkumar2, habanero, mashirle, kvm, netdev, linux-kernel,
	virtualization, edumazet, tahm, jwhan, davem, sri
In-Reply-To: <1341484194-8108-5-git-send-email-jasowang@redhat.com>

On Thu, Jul 05, 2012 at 06:29:53PM +0800, Jason Wang wrote:
> This patch converts virtio_net to a multi queue device. After negotiated
> VIRTIO_NET_F_MULTIQUEUE feature, the virtio device has many tx/rx queue pairs,
> and driver could read the number from config space.
> 
> The driver expects the number of rx/tx queue paris is equal to the number of
> vcpus. To maximize the performance under this per-cpu rx/tx queue pairs, some
> optimization were introduced:
> 
> - Txq selection is based on the processor id in order to avoid contending a lock
>   whose owner may exits to host.
> - Since the txq/txq were per-cpu, affinity hint were set to the cpu that owns
>   the queue pairs.
> 
> Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
> Signed-off-by: Jason Wang <jasowang@redhat.com>

Overall fine. I think it is best to smash the following patch
into this one, so that default behavior does not
jump to mq then back. some comments below: mostly nits, and a minor bug.

If you are worried the patch is too big, it can be split
differently
	- rework to use send_queue/receive_queue structures, no
	  functional changes.
	- add multiqueue

but this is not a must.

> ---
>  drivers/net/virtio_net.c   |  645 ++++++++++++++++++++++++++++++-------------
>  include/linux/virtio_net.h |    2 +
>  2 files changed, 452 insertions(+), 195 deletions(-)
> 
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index 1db445b..7410187 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -26,6 +26,7 @@
>  #include <linux/scatterlist.h>
>  #include <linux/if_vlan.h>
>  #include <linux/slab.h>
> +#include <linux/interrupt.h>
>  
>  static int napi_weight = 128;
>  module_param(napi_weight, int, 0444);
> @@ -41,6 +42,8 @@ module_param(gso, bool, 0444);
>  #define VIRTNET_SEND_COMMAND_SG_MAX    2
>  #define VIRTNET_DRIVER_VERSION "1.0.0"
>  
> +#define MAX_QUEUES 256
> +
>  struct virtnet_stats {
>  	struct u64_stats_sync tx_syncp;
>  	struct u64_stats_sync rx_syncp;

Would be a bit better not to have artificial limits like that.
Maybe allocate arrays at probe time, then we can
take whatever the device gives us?

> @@ -51,43 +54,69 @@ struct virtnet_stats {
>  	u64 rx_packets;
>  };
>  
> -struct virtnet_info {
> -	struct virtio_device *vdev;
> -	struct virtqueue *rvq, *svq, *cvq;
> -	struct net_device *dev;
> +/* Internal representation of a send virtqueue */
> +struct send_queue {
> +	/* Virtqueue associated with this send _queue */
> +	struct virtqueue *vq;
> +
> +	/* TX: fragments + linear part + virtio header */
> +	struct scatterlist sg[MAX_SKB_FRAGS + 2];
> +};
> +
> +/* Internal representation of a receive virtqueue */
> +struct receive_queue {
> +	/* Virtqueue associated with this receive_queue */
> +	struct virtqueue *vq;
> +
> +	/* Back pointer to the virtnet_info */
> +	struct virtnet_info *vi;
> +
>  	struct napi_struct napi;
> -	unsigned int status;
>  
>  	/* Number of input buffers, and max we've ever had. */
>  	unsigned int num, max;
>  
> +	/* Work struct for refilling if we run low on memory. */
> +	struct delayed_work refill;
> +
> +	/* Chain pages by the private ptr. */
> +	struct page *pages;
> +
> +	/* RX: fragments + linear part + virtio header */
> +	struct scatterlist sg[MAX_SKB_FRAGS + 2];
> +};
> +
> +struct virtnet_info {
> +	u16 num_queue_pairs;		/* # of RX/TX vq pairs */
> +
> +	struct send_queue *sq[MAX_QUEUES] ____cacheline_aligned_in_smp;
> +	struct receive_queue *rq[MAX_QUEUES] ____cacheline_aligned_in_smp;

The assumption is a tx/rx pair is handled on the same cpu, yes?
If yes maybe make it a single array to improve cache locality
a bit?
	struct queue_pair {
		struct send_queue sq;
		struct receive_queue rq;
	};

> +	struct virtqueue *cvq;
> +
> +	struct virtio_device *vdev;
> +	struct net_device *dev;
> +	unsigned int status;
> +
>  	/* I like... big packets and I cannot lie! */
>  	bool big_packets;
>  
>  	/* Host will merge rx buffers for big packets (shake it! shake it!) */
>  	bool mergeable_rx_bufs;
>  
> +	/* Has control virtqueue */
> +	bool has_cvq;
> +

won't checking (cvq != NULL) be enough?

>  	/* enable config space updates */
>  	bool config_enable;
>  
>  	/* Active statistics */
>  	struct virtnet_stats __percpu *stats;
>  
> -	/* Work struct for refilling if we run low on memory. */
> -	struct delayed_work refill;
> -
>  	/* Work struct for config space updates */
>  	struct work_struct config_work;
>  
>  	/* Lock for config space updates */
>  	struct mutex config_lock;
> -
> -	/* Chain pages by the private ptr. */
> -	struct page *pages;
> -
> -	/* fragments + linear part + virtio header */
> -	struct scatterlist rx_sg[MAX_SKB_FRAGS + 2];
> -	struct scatterlist tx_sg[MAX_SKB_FRAGS + 2];
>  };
>  
>  struct skb_vnet_hdr {
> @@ -108,6 +137,22 @@ struct padded_vnet_hdr {
>  	char padding[6];
>  };
>  
> +static inline int txq_get_qnum(struct virtnet_info *vi, struct virtqueue *vq)
> +{
> +	int ret = virtqueue_get_queue_index(vq);
> +
> +	/* skip ctrl vq */
> +	if (vi->has_cvq)
> +		return (ret - 1) / 2;
> +	else
> +		return ret / 2;
> +}
> +
> +static inline int rxq_get_qnum(struct virtnet_info *vi, struct virtqueue *vq)
> +{
> +	return virtqueue_get_queue_index(vq) / 2;
> +}
> +
>  static inline struct skb_vnet_hdr *skb_vnet_hdr(struct sk_buff *skb)
>  {
>  	return (struct skb_vnet_hdr *)skb->cb;
> @@ -117,22 +162,22 @@ static inline struct skb_vnet_hdr *skb_vnet_hdr(struct sk_buff *skb)
>   * private is used to chain pages for big packets, put the whole
>   * most recent used list in the beginning for reuse
>   */
> -static void give_pages(struct virtnet_info *vi, struct page *page)
> +static void give_pages(struct receive_queue *rq, struct page *page)
>  {
>  	struct page *end;
>  
>  	/* Find end of list, sew whole thing into vi->pages. */
>  	for (end = page; end->private; end = (struct page *)end->private);
> -	end->private = (unsigned long)vi->pages;
> -	vi->pages = page;
> +	end->private = (unsigned long)rq->pages;
> +	rq->pages = page;
>  }
>  
> -static struct page *get_a_page(struct virtnet_info *vi, gfp_t gfp_mask)
> +static struct page *get_a_page(struct receive_queue *rq, gfp_t gfp_mask)
>  {
> -	struct page *p = vi->pages;
> +	struct page *p = rq->pages;
>  
>  	if (p) {
> -		vi->pages = (struct page *)p->private;
> +		rq->pages = (struct page *)p->private;
>  		/* clear private here, it is used to chain pages */
>  		p->private = 0;
>  	} else
> @@ -140,15 +185,15 @@ static struct page *get_a_page(struct virtnet_info *vi, gfp_t gfp_mask)
>  	return p;
>  }
>  
> -static void skb_xmit_done(struct virtqueue *svq)
> +static void skb_xmit_done(struct virtqueue *vq)
>  {
> -	struct virtnet_info *vi = svq->vdev->priv;
> +	struct virtnet_info *vi = vq->vdev->priv;
>  
>  	/* Suppress further interrupts. */
> -	virtqueue_disable_cb(svq);
> +	virtqueue_disable_cb(vq);
>  
>  	/* We were probably waiting for more output buffers. */
> -	netif_wake_queue(vi->dev);
> +	netif_wake_subqueue(vi->dev, txq_get_qnum(vi, vq));
>  }
>  
>  static void set_skb_frag(struct sk_buff *skb, struct page *page,
> @@ -167,9 +212,10 @@ static void set_skb_frag(struct sk_buff *skb, struct page *page,
>  }
>  
>  /* Called from bottom half context */
> -static struct sk_buff *page_to_skb(struct virtnet_info *vi,
> +static struct sk_buff *page_to_skb(struct receive_queue *rq,
>  				   struct page *page, unsigned int len)
>  {
> +	struct virtnet_info *vi = rq->vi;
>  	struct sk_buff *skb;
>  	struct skb_vnet_hdr *hdr;
>  	unsigned int copy, hdr_len, offset;
> @@ -225,12 +271,12 @@ static struct sk_buff *page_to_skb(struct virtnet_info *vi,
>  	}
>  
>  	if (page)
> -		give_pages(vi, page);
> +		give_pages(rq, page);
>  
>  	return skb;
>  }
>  
> -static int receive_mergeable(struct virtnet_info *vi, struct sk_buff *skb)
> +static int receive_mergeable(struct receive_queue *rq, struct sk_buff *skb)
>  {
>  	struct skb_vnet_hdr *hdr = skb_vnet_hdr(skb);
>  	struct page *page;
> @@ -244,7 +290,7 @@ static int receive_mergeable(struct virtnet_info *vi, struct sk_buff *skb)
>  			skb->dev->stats.rx_length_errors++;
>  			return -EINVAL;
>  		}
> -		page = virtqueue_get_buf(vi->rvq, &len);
> +		page = virtqueue_get_buf(rq->vq, &len);
>  		if (!page) {
>  			pr_debug("%s: rx error: %d buffers missing\n",
>  				 skb->dev->name, hdr->mhdr.num_buffers);
> @@ -257,13 +303,14 @@ static int receive_mergeable(struct virtnet_info *vi, struct sk_buff *skb)
>  
>  		set_skb_frag(skb, page, 0, &len);
>  
> -		--vi->num;
> +		--rq->num;
>  	}
>  	return 0;
>  }
>  
> -static void receive_buf(struct net_device *dev, void *buf, unsigned int len)
> +static void receive_buf(struct receive_queue *rq, void *buf, unsigned int len)
>  {
> +	struct net_device *dev = rq->vi->dev;
>  	struct virtnet_info *vi = netdev_priv(dev);
>  	struct virtnet_stats *stats = this_cpu_ptr(vi->stats);
>  	struct sk_buff *skb;
> @@ -274,7 +321,7 @@ static void receive_buf(struct net_device *dev, void *buf, unsigned int len)
>  		pr_debug("%s: short packet %i\n", dev->name, len);
>  		dev->stats.rx_length_errors++;
>  		if (vi->mergeable_rx_bufs || vi->big_packets)
> -			give_pages(vi, buf);
> +			give_pages(rq, buf);
>  		else
>  			dev_kfree_skb(buf);
>  		return;
> @@ -286,14 +333,14 @@ static void receive_buf(struct net_device *dev, void *buf, unsigned int len)
>  		skb_trim(skb, len);
>  	} else {
>  		page = buf;
> -		skb = page_to_skb(vi, page, len);
> +		skb = page_to_skb(rq, page, len);
>  		if (unlikely(!skb)) {
>  			dev->stats.rx_dropped++;
> -			give_pages(vi, page);
> +			give_pages(rq, page);
>  			return;
>  		}
>  		if (vi->mergeable_rx_bufs)
> -			if (receive_mergeable(vi, skb)) {
> +			if (receive_mergeable(rq, skb)) {
>  				dev_kfree_skb(skb);
>  				return;
>  			}
> @@ -363,90 +410,91 @@ frame_err:
>  	dev_kfree_skb(skb);
>  }
>  
> -static int add_recvbuf_small(struct virtnet_info *vi, gfp_t gfp)
> +static int add_recvbuf_small(struct receive_queue *rq, gfp_t gfp)
>  {
>  	struct sk_buff *skb;
>  	struct skb_vnet_hdr *hdr;
>  	int err;
>  
> -	skb = __netdev_alloc_skb_ip_align(vi->dev, MAX_PACKET_LEN, gfp);
> +	skb = __netdev_alloc_skb_ip_align(rq->vi->dev, MAX_PACKET_LEN, gfp);
>  	if (unlikely(!skb))
>  		return -ENOMEM;
>  
>  	skb_put(skb, MAX_PACKET_LEN);
>  
>  	hdr = skb_vnet_hdr(skb);
> -	sg_set_buf(vi->rx_sg, &hdr->hdr, sizeof hdr->hdr);
> +	sg_set_buf(rq->sg, &hdr->hdr, sizeof hdr->hdr);
> +
> +	skb_to_sgvec(skb, rq->sg + 1, 0, skb->len);
>  
> -	skb_to_sgvec(skb, vi->rx_sg + 1, 0, skb->len);
> +	err = virtqueue_add_buf(rq->vq, rq->sg, 0, 2, skb, gfp);
>  
> -	err = virtqueue_add_buf(vi->rvq, vi->rx_sg, 0, 2, skb, gfp);
>  	if (err < 0)
>  		dev_kfree_skb(skb);
>  
>  	return err;
>  }
>  
> -static int add_recvbuf_big(struct virtnet_info *vi, gfp_t gfp)
> +static int add_recvbuf_big(struct receive_queue *rq, gfp_t gfp)
>  {
>  	struct page *first, *list = NULL;
>  	char *p;
>  	int i, err, offset;
>  
> -	/* page in vi->rx_sg[MAX_SKB_FRAGS + 1] is list tail */
> +	/* page in rq->sg[MAX_SKB_FRAGS + 1] is list tail */
>  	for (i = MAX_SKB_FRAGS + 1; i > 1; --i) {
> -		first = get_a_page(vi, gfp);
> +		first = get_a_page(rq, gfp);
>  		if (!first) {
>  			if (list)
> -				give_pages(vi, list);
> +				give_pages(rq, list);
>  			return -ENOMEM;
>  		}
> -		sg_set_buf(&vi->rx_sg[i], page_address(first), PAGE_SIZE);
> +		sg_set_buf(&rq->sg[i], page_address(first), PAGE_SIZE);
>  
>  		/* chain new page in list head to match sg */
>  		first->private = (unsigned long)list;
>  		list = first;
>  	}
>  
> -	first = get_a_page(vi, gfp);
> +	first = get_a_page(rq, gfp);
>  	if (!first) {
> -		give_pages(vi, list);
> +		give_pages(rq, list);
>  		return -ENOMEM;
>  	}
>  	p = page_address(first);
>  
> -	/* vi->rx_sg[0], vi->rx_sg[1] share the same page */
> -	/* a separated vi->rx_sg[0] for virtio_net_hdr only due to QEMU bug */
> -	sg_set_buf(&vi->rx_sg[0], p, sizeof(struct virtio_net_hdr));
> +	/* rq->sg[0], rq->sg[1] share the same page */
> +	/* a separated rq->sg[0] for virtio_net_hdr only due to QEMU bug */
> +	sg_set_buf(&rq->sg[0], p, sizeof(struct virtio_net_hdr));
>  
> -	/* vi->rx_sg[1] for data packet, from offset */
> +	/* rq->sg[1] for data packet, from offset */
>  	offset = sizeof(struct padded_vnet_hdr);
> -	sg_set_buf(&vi->rx_sg[1], p + offset, PAGE_SIZE - offset);
> +	sg_set_buf(&rq->sg[1], p + offset, PAGE_SIZE - offset);
>  
>  	/* chain first in list head */
>  	first->private = (unsigned long)list;
> -	err = virtqueue_add_buf(vi->rvq, vi->rx_sg, 0, MAX_SKB_FRAGS + 2,
> +	err = virtqueue_add_buf(rq->vq, rq->sg, 0, MAX_SKB_FRAGS + 2,
>  				first, gfp);
>  	if (err < 0)
> -		give_pages(vi, first);
> +		give_pages(rq, first);
>  
>  	return err;
>  }
>  
> -static int add_recvbuf_mergeable(struct virtnet_info *vi, gfp_t gfp)
> +static int add_recvbuf_mergeable(struct receive_queue *rq, gfp_t gfp)
>  {
>  	struct page *page;
>  	int err;
>  
> -	page = get_a_page(vi, gfp);
> +	page = get_a_page(rq, gfp);
>  	if (!page)
>  		return -ENOMEM;
>  
> -	sg_init_one(vi->rx_sg, page_address(page), PAGE_SIZE);
> +	sg_init_one(rq->sg, page_address(page), PAGE_SIZE);
>  
> -	err = virtqueue_add_buf(vi->rvq, vi->rx_sg, 0, 1, page, gfp);
> +	err = virtqueue_add_buf(rq->vq, rq->sg, 0, 1, page, gfp);
>  	if (err < 0)
> -		give_pages(vi, page);
> +		give_pages(rq, page);
>  
>  	return err;
>  }
> @@ -458,97 +506,104 @@ static int add_recvbuf_mergeable(struct virtnet_info *vi, gfp_t gfp)
>   * before we're receiving packets, or from refill_work which is
>   * careful to disable receiving (using napi_disable).
>   */
> -static bool try_fill_recv(struct virtnet_info *vi, gfp_t gfp)
> +static bool try_fill_recv(struct receive_queue *rq, gfp_t gfp)
>  {
> +	struct virtnet_info *vi = rq->vi;
>  	int err;
>  	bool oom;
>  
>  	do {
>  		if (vi->mergeable_rx_bufs)
> -			err = add_recvbuf_mergeable(vi, gfp);
> +			err = add_recvbuf_mergeable(rq, gfp);
>  		else if (vi->big_packets)
> -			err = add_recvbuf_big(vi, gfp);
> +			err = add_recvbuf_big(rq, gfp);
>  		else
> -			err = add_recvbuf_small(vi, gfp);
> +			err = add_recvbuf_small(rq, gfp);
>  
>  		oom = err == -ENOMEM;
>  		if (err < 0)
>  			break;
> -		++vi->num;
> +		++rq->num;
>  	} while (err > 0);
> -	if (unlikely(vi->num > vi->max))
> -		vi->max = vi->num;
> -	virtqueue_kick(vi->rvq);
> +	if (unlikely(rq->num > rq->max))
> +		rq->max = rq->num;
> +	virtqueue_kick(rq->vq);
>  	return !oom;
>  }
>  
> -static void skb_recv_done(struct virtqueue *rvq)
> +static void skb_recv_done(struct virtqueue *vq)
>  {
> -	struct virtnet_info *vi = rvq->vdev->priv;
> +	struct virtnet_info *vi = vq->vdev->priv;
> +	struct napi_struct *napi = &vi->rq[rxq_get_qnum(vi, vq)]->napi;
> +
>  	/* Schedule NAPI, Suppress further interrupts if successful. */
> -	if (napi_schedule_prep(&vi->napi)) {
> -		virtqueue_disable_cb(rvq);
> -		__napi_schedule(&vi->napi);
> +	if (napi_schedule_prep(napi)) {
> +		virtqueue_disable_cb(vq);
> +		__napi_schedule(napi);
>  	}
>  }
>  
> -static void virtnet_napi_enable(struct virtnet_info *vi)
> +static void virtnet_napi_enable(struct receive_queue *rq)
>  {
> -	napi_enable(&vi->napi);
> +	napi_enable(&rq->napi);
>  
>  	/* If all buffers were filled by other side before we napi_enabled, we
>  	 * won't get another interrupt, so process any outstanding packets
>  	 * now.  virtnet_poll wants re-enable the queue, so we disable here.
>  	 * We synchronize against interrupts via NAPI_STATE_SCHED */
> -	if (napi_schedule_prep(&vi->napi)) {
> -		virtqueue_disable_cb(vi->rvq);
> +	if (napi_schedule_prep(&rq->napi)) {
> +		virtqueue_disable_cb(rq->vq);
>  		local_bh_disable();
> -		__napi_schedule(&vi->napi);
> +		__napi_schedule(&rq->napi);
>  		local_bh_enable();
>  	}
>  }
>  
>  static void refill_work(struct work_struct *work)
>  {
> -	struct virtnet_info *vi;
> +	struct napi_struct *napi;
> +	struct receive_queue *rq;
>  	bool still_empty;
>  
> -	vi = container_of(work, struct virtnet_info, refill.work);
> -	napi_disable(&vi->napi);
> -	still_empty = !try_fill_recv(vi, GFP_KERNEL);
> -	virtnet_napi_enable(vi);
> +	rq = container_of(work, struct receive_queue, refill.work);
> +	napi = &rq->napi;
> +
> +	napi_disable(napi);
> +	still_empty = !try_fill_recv(rq, GFP_KERNEL);
> +	virtnet_napi_enable(rq);
>  
>  	/* In theory, this can happen: if we don't get any buffers in
>  	 * we will *never* try to fill again. */
>  	if (still_empty)
> -		queue_delayed_work(system_nrt_wq, &vi->refill, HZ/2);
> +		queue_delayed_work(system_nrt_wq, &rq->refill, HZ/2);
>  }
>  
>  static int virtnet_poll(struct napi_struct *napi, int budget)
>  {
> -	struct virtnet_info *vi = container_of(napi, struct virtnet_info, napi);
> +	struct receive_queue *rq = container_of(napi, struct receive_queue,
> +						napi);
>  	void *buf;
>  	unsigned int len, received = 0;
>  
>  again:
>  	while (received < budget &&
> -	       (buf = virtqueue_get_buf(vi->rvq, &len)) != NULL) {
> -		receive_buf(vi->dev, buf, len);
> -		--vi->num;
> +	       (buf = virtqueue_get_buf(rq->vq, &len)) != NULL) {
> +		receive_buf(rq, buf, len);
> +		--rq->num;
>  		received++;
>  	}
>  
> -	if (vi->num < vi->max / 2) {
> -		if (!try_fill_recv(vi, GFP_ATOMIC))
> -			queue_delayed_work(system_nrt_wq, &vi->refill, 0);
> +	if (rq->num < rq->max / 2) {
> +		if (!try_fill_recv(rq, GFP_ATOMIC))
> +			queue_delayed_work(system_nrt_wq, &rq->refill, 0);
>  	}
>  
>  	/* Out of packets? */
>  	if (received < budget) {
>  		napi_complete(napi);
> -		if (unlikely(!virtqueue_enable_cb(vi->rvq)) &&
> +		if (unlikely(!virtqueue_enable_cb(rq->vq)) &&
>  		    napi_schedule_prep(napi)) {
> -			virtqueue_disable_cb(vi->rvq);
> +			virtqueue_disable_cb(rq->vq);
>  			__napi_schedule(napi);
>  			goto again;
>  		}
> @@ -557,13 +612,14 @@ again:
>  	return received;
>  }
>  
> -static unsigned int free_old_xmit_skbs(struct virtnet_info *vi)
> +static unsigned int free_old_xmit_skbs(struct virtnet_info *vi,
> +				       struct virtqueue *vq)
>  {
>  	struct sk_buff *skb;
>  	unsigned int len, tot_sgs = 0;
>  	struct virtnet_stats *stats = this_cpu_ptr(vi->stats);
>  
> -	while ((skb = virtqueue_get_buf(vi->svq, &len)) != NULL) {
> +	while ((skb = virtqueue_get_buf(vq, &len)) != NULL) {
>  		pr_debug("Sent skb %p\n", skb);
>  
>  		u64_stats_update_begin(&stats->tx_syncp);
> @@ -577,7 +633,8 @@ static unsigned int free_old_xmit_skbs(struct virtnet_info *vi)
>  	return tot_sgs;
>  }
>  
> -static int xmit_skb(struct virtnet_info *vi, struct sk_buff *skb)
> +static int xmit_skb(struct virtnet_info *vi, struct sk_buff *skb,
> +		    struct virtqueue *vq, struct scatterlist *sg)
>  {
>  	struct skb_vnet_hdr *hdr = skb_vnet_hdr(skb);
>  	const unsigned char *dest = ((struct ethhdr *)skb->data)->h_dest;
> @@ -615,44 +672,47 @@ static int xmit_skb(struct virtnet_info *vi, struct sk_buff *skb)
>  
>  	/* Encode metadata header at front. */
>  	if (vi->mergeable_rx_bufs)
> -		sg_set_buf(vi->tx_sg, &hdr->mhdr, sizeof hdr->mhdr);
> +		sg_set_buf(sg, &hdr->mhdr, sizeof hdr->mhdr);
>  	else
> -		sg_set_buf(vi->tx_sg, &hdr->hdr, sizeof hdr->hdr);
> +		sg_set_buf(sg, &hdr->hdr, sizeof hdr->hdr);
>  
> -	hdr->num_sg = skb_to_sgvec(skb, vi->tx_sg + 1, 0, skb->len) + 1;
> -	return virtqueue_add_buf(vi->svq, vi->tx_sg, hdr->num_sg,
> +	hdr->num_sg = skb_to_sgvec(skb, sg + 1, 0, skb->len) + 1;
> +	return virtqueue_add_buf(vq, sg, hdr->num_sg,
>  				 0, skb, GFP_ATOMIC);
>  }
>  
>  static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev)
>  {
>  	struct virtnet_info *vi = netdev_priv(dev);
> +	int qnum = skb_get_queue_mapping(skb);
> +	struct virtqueue *vq = vi->sq[qnum]->vq;
>  	int capacity;
>  
>  	/* Free up any pending old buffers before queueing new ones. */
> -	free_old_xmit_skbs(vi);
> +	free_old_xmit_skbs(vi, vq);
>  
>  	/* Try to transmit */
> -	capacity = xmit_skb(vi, skb);
> +	capacity = xmit_skb(vi, skb, vq, vi->sq[qnum]->sg);
>  
>  	/* This can happen with OOM and indirect buffers. */
>  	if (unlikely(capacity < 0)) {
>  		if (likely(capacity == -ENOMEM)) {
>  			if (net_ratelimit())
>  				dev_warn(&dev->dev,
> -					 "TX queue failure: out of memory\n");
> +					"TXQ (%d) failure: out of memory\n",
> +					qnum);
>  		} else {
>  			dev->stats.tx_fifo_errors++;
>  			if (net_ratelimit())
>  				dev_warn(&dev->dev,
> -					 "Unexpected TX queue failure: %d\n",
> -					 capacity);
> +					"Unexpected TXQ (%d) failure: %d\n",
> +					qnum, capacity);
>  		}
>  		dev->stats.tx_dropped++;
>  		kfree_skb(skb);
>  		return NETDEV_TX_OK;
>  	}
> -	virtqueue_kick(vi->svq);
> +	virtqueue_kick(vq);
>  
>  	/* Don't wait up for transmitted skbs to be freed. */
>  	skb_orphan(skb);
> @@ -661,13 +721,13 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev)
>  	/* Apparently nice girls don't return TX_BUSY; stop the queue
>  	 * before it gets out of hand.  Naturally, this wastes entries. */
>  	if (capacity < 2+MAX_SKB_FRAGS) {
> -		netif_stop_queue(dev);
> -		if (unlikely(!virtqueue_enable_cb_delayed(vi->svq))) {
> +		netif_stop_subqueue(dev, qnum);
> +		if (unlikely(!virtqueue_enable_cb_delayed(vq))) {
>  			/* More just got used, free them then recheck. */
> -			capacity += free_old_xmit_skbs(vi);
> +			capacity += free_old_xmit_skbs(vi, vq);
>  			if (capacity >= 2+MAX_SKB_FRAGS) {
> -				netif_start_queue(dev);
> -				virtqueue_disable_cb(vi->svq);
> +				netif_start_subqueue(dev, qnum);
> +				virtqueue_disable_cb(vq);
>  			}
>  		}
>  	}
> @@ -700,7 +760,8 @@ static struct rtnl_link_stats64 *virtnet_stats(struct net_device *dev,
>  	unsigned int start;
>  
>  	for_each_possible_cpu(cpu) {
> -		struct virtnet_stats *stats = per_cpu_ptr(vi->stats, cpu);
> +		struct virtnet_stats __percpu *stats
> +			= per_cpu_ptr(vi->stats, cpu);
>  		u64 tpackets, tbytes, rpackets, rbytes;
>  
>  		do {
> @@ -734,20 +795,26 @@ static struct rtnl_link_stats64 *virtnet_stats(struct net_device *dev,
>  static void virtnet_netpoll(struct net_device *dev)
>  {
>  	struct virtnet_info *vi = netdev_priv(dev);
> +	int i;
>  
> -	napi_schedule(&vi->napi);
> +	for (i = 0; i < vi->num_queue_pairs; i++)
> +		napi_schedule(&vi->rq[i]->napi);
>  }
>  #endif
>  
>  static int virtnet_open(struct net_device *dev)
>  {
>  	struct virtnet_info *vi = netdev_priv(dev);
> +	int i;
>  
> -	/* Make sure we have some buffers: if oom use wq. */
> -	if (!try_fill_recv(vi, GFP_KERNEL))
> -		queue_delayed_work(system_nrt_wq, &vi->refill, 0);
> +	for (i = 0; i < vi->num_queue_pairs; i++) {
> +		/* Make sure we have some buffers: if oom use wq. */
> +		if (!try_fill_recv(vi->rq[i], GFP_KERNEL))
> +			queue_delayed_work(system_nrt_wq,
> +					   &vi->rq[i]->refill, 0);
> +		virtnet_napi_enable(vi->rq[i]);
> +	}
>  
> -	virtnet_napi_enable(vi);
>  	return 0;
>  }
>  
> @@ -809,10 +876,13 @@ static void virtnet_ack_link_announce(struct virtnet_info *vi)
>  static int virtnet_close(struct net_device *dev)
>  {
>  	struct virtnet_info *vi = netdev_priv(dev);
> +	int i;
>  
>  	/* Make sure refill_work doesn't re-enable napi! */
> -	cancel_delayed_work_sync(&vi->refill);
> -	napi_disable(&vi->napi);
> +	for (i = 0; i < vi->num_queue_pairs; i++) {
> +		cancel_delayed_work_sync(&vi->rq[i]->refill);
> +		napi_disable(&vi->rq[i]->napi);
> +	}
>  
>  	return 0;
>  }
> @@ -924,11 +994,10 @@ static void virtnet_get_ringparam(struct net_device *dev,
>  {
>  	struct virtnet_info *vi = netdev_priv(dev);
>  
> -	ring->rx_max_pending = virtqueue_get_vring_size(vi->rvq);
> -	ring->tx_max_pending = virtqueue_get_vring_size(vi->svq);
> +	ring->rx_max_pending = virtqueue_get_vring_size(vi->rq[0]->vq);
> +	ring->tx_max_pending = virtqueue_get_vring_size(vi->sq[0]->vq);
>  	ring->rx_pending = ring->rx_max_pending;
>  	ring->tx_pending = ring->tx_max_pending;
> -
>  }
>  
>  
> @@ -961,6 +1030,19 @@ static int virtnet_change_mtu(struct net_device *dev, int new_mtu)
>  	return 0;
>  }
>  
> +/* To avoid contending a lock hold by a vcpu who would exit to host, select the
> + * txq based on the processor id.
> + */
> +static u16 virtnet_select_queue(struct net_device *dev, struct sk_buff *skb)
> +{
> +	int txq = skb_rx_queue_recorded(skb) ? skb_get_rx_queue(skb) :
> +		  smp_processor_id();
> +
> +	while (unlikely(txq >= dev->real_num_tx_queues))
> +		txq -= dev->real_num_tx_queues;
> +	return txq;
> +}
> +
>  static const struct net_device_ops virtnet_netdev = {
>  	.ndo_open            = virtnet_open,
>  	.ndo_stop   	     = virtnet_close,
> @@ -972,6 +1054,7 @@ static const struct net_device_ops virtnet_netdev = {
>  	.ndo_get_stats64     = virtnet_stats,
>  	.ndo_vlan_rx_add_vid = virtnet_vlan_rx_add_vid,
>  	.ndo_vlan_rx_kill_vid = virtnet_vlan_rx_kill_vid,
> +	.ndo_select_queue     = virtnet_select_queue,
>  #ifdef CONFIG_NET_POLL_CONTROLLER
>  	.ndo_poll_controller = virtnet_netpoll,
>  #endif
> @@ -1007,10 +1090,10 @@ static void virtnet_config_changed_work(struct work_struct *work)
>  
>  	if (vi->status & VIRTIO_NET_S_LINK_UP) {
>  		netif_carrier_on(vi->dev);
> -		netif_wake_queue(vi->dev);
> +		netif_tx_wake_all_queues(vi->dev);
>  	} else {
>  		netif_carrier_off(vi->dev);
> -		netif_stop_queue(vi->dev);
> +		netif_tx_stop_all_queues(vi->dev);
>  	}
>  done:
>  	mutex_unlock(&vi->config_lock);
> @@ -1023,41 +1106,217 @@ static void virtnet_config_changed(struct virtio_device *vdev)
>  	queue_work(system_nrt_wq, &vi->config_work);
>  }
>  
> -static int init_vqs(struct virtnet_info *vi)
> +static void free_receive_bufs(struct virtnet_info *vi)
> +{
> +	int i;
> +
> +	for (i = 0; i < vi->num_queue_pairs; i++) {
> +		while (vi->rq[i]->pages)
> +			__free_pages(get_a_page(vi->rq[i], GFP_KERNEL), 0);
> +	}
> +}
> +
> +/* Free memory allocated for send and receive queues */
> +static void virtnet_free_queues(struct virtnet_info *vi)
>  {
> -	struct virtqueue *vqs[3];
> -	vq_callback_t *callbacks[] = { skb_recv_done, skb_xmit_done, NULL};
> -	const char *names[] = { "input", "output", "control" };
> -	int nvqs, err;
> +	int i;
>  
> -	/* We expect two virtqueues, receive then send,
> -	 * and optionally control. */
> -	nvqs = virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VQ) ? 3 : 2;
> +	for (i = 0; i < vi->num_queue_pairs; i++) {
> +		kfree(vi->rq[i]);
> +		vi->rq[i] = NULL;
> +		kfree(vi->sq[i]);
> +		vi->sq[i] = NULL;
> +	}
> +}
>  
> -	err = vi->vdev->config->find_vqs(vi->vdev, nvqs, vqs, callbacks, names);
> -	if (err)
> -		return err;
> +static void free_unused_bufs(struct virtnet_info *vi)
> +{
> +	void *buf;
> +	int i;
> +
> +	for (i = 0; i < vi->num_queue_pairs; i++) {
> +		struct virtqueue *vq = vi->sq[i]->vq;
> +
> +		while ((buf = virtqueue_detach_unused_buf(vq)) != NULL)
> +			dev_kfree_skb(buf);
> +	}
> +
> +	for (i = 0; i < vi->num_queue_pairs; i++) {
> +		struct virtqueue *vq = vi->rq[i]->vq;
> +
> +		while ((buf = virtqueue_detach_unused_buf(vq)) != NULL) {
> +			if (vi->mergeable_rx_bufs || vi->big_packets)
> +				give_pages(vi->rq[i], buf);
> +			else
> +				dev_kfree_skb(buf);
> +			--vi->rq[i]->num;
> +		}
> +		BUG_ON(vi->rq[i]->num != 0);
> +	}
> +}
> +
> +static void virtnet_set_affinity(struct virtnet_info *vi, bool set)
> +{
> +	int i;
> +
> +	if (vi->num_queue_pairs == 1)
> +		return;
> +
> +	for (i = 0; i < vi->num_queue_pairs; i++) {
> +		int cpu = set ? i : -1;
> +		virtqueue_set_affinity(vi->rq[i]->vq, cpu);
> +		virtqueue_set_affinity(vi->sq[i]->vq, cpu);
> +	}
> +	return;
> +}
> +
> +static void virtnet_del_vqs(struct virtnet_info *vi)
> +{
> +	struct virtio_device *vdev = vi->vdev;
> +
> +	virtnet_set_affinity(vi, false);
> +
> +	vdev->config->del_vqs(vdev);
> +
> +	virtnet_free_queues(vi);
> +}
> +
> +static int virtnet_find_vqs(struct virtnet_info *vi)
> +{
> +	vq_callback_t **callbacks;
> +	struct virtqueue **vqs;
> +	int ret = -ENOMEM;
> +	int i, total_vqs;
> +	char **names;
>  
> -	vi->rvq = vqs[0];
> -	vi->svq = vqs[1];
> +	/*
> +	 * We expect 1 RX virtqueue followed by 1 TX virtqueue, followd by
> +	 * possible control virtqueue and followed by the same
> +	 * 'vi->num_queue_pairs-1' more times
> +	 */
> +	total_vqs = vi->num_queue_pairs * 2 +
> +		    virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VQ);
> +
> +	/* Allocate space for find_vqs parameters */
> +	vqs = kmalloc(total_vqs * sizeof(*vqs), GFP_KERNEL);
> +	callbacks = kmalloc(total_vqs * sizeof(*callbacks), GFP_KERNEL);
> +	names = kmalloc(total_vqs * sizeof(*names), GFP_KERNEL);

so this needs to be kzalloc otherwise on an error cleanup will
get uninitialized data and crash?

> +	if (!vqs || !callbacks || !names)
> +		goto err;
> +
> +	/* Parameters for control virtqueue, if any */
> +	if (vi->has_cvq) {
> +		callbacks[2] = NULL;
> +		names[2] = "control";
> +	}
> +
> +	/* Allocate/initialize parameters for send/receive virtqueues */
> +	for (i = 0; i < vi->num_queue_pairs * 2; i += 2) {
> +		int j = (i == 0 ? i : i + vi->has_cvq);
> +		callbacks[j] = skb_recv_done;
> +		callbacks[j + 1] = skb_xmit_done;
> +		names[j] = kasprintf(GFP_KERNEL, "input.%d", i / 2);
> +		names[j + 1] = kasprintf(GFP_KERNEL, "output.%d", i / 2);

This needs wrappers. E.g. virtnet_rx_vq(int queue_pair), virtnet_tx_vq(int queue_pair);
Then you would just scan 0 to num_queue_pairs, and i is queue pair
number.

> +	}
>  
> -	if (virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VQ)) {
> +	ret = vi->vdev->config->find_vqs(vi->vdev, total_vqs, vqs, callbacks,
> +					 (const char **)names);
> +	if (ret)
> +		goto err;
> +
> +	if (vi->has_cvq)
>  		vi->cvq = vqs[2];
>  
> -		if (virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VLAN))
> -			vi->dev->features |= NETIF_F_HW_VLAN_FILTER;
> +	for (i = 0; i < vi->num_queue_pairs * 2; i += 2) {
> +		int j = i == 0 ? i : i + vi->has_cvq;
> +		vi->rq[i / 2]->vq = vqs[j];
> +		vi->sq[i / 2]->vq = vqs[j + 1];

Same here.

>  	}
> -	return 0;
> +
> +err:
> +	if (ret && names)

If we are here ret != 0. For names, just add another label, don't
complicate cleanup.

> +		for (i = 0; i < vi->num_queue_pairs * 2; i++)
> +			kfree(names[i]);
> +
> +	kfree(names);
> +	kfree(callbacks);
> +	kfree(vqs);
> +
> +	return ret;
> +}
> +
> +static int virtnet_alloc_queues(struct virtnet_info *vi)
> +{
> +	int ret = -ENOMEM;
> +	int i;
> +
> +	for (i = 0; i < vi->num_queue_pairs; i++) {
> +		vi->rq[i] = kzalloc(sizeof(*vi->rq[i]), GFP_KERNEL);
> +		vi->sq[i] = kzalloc(sizeof(*vi->sq[i]), GFP_KERNEL);
> +		if (!vi->rq[i] || !vi->sq[i])
> +			goto err;
> +	}
> +
> +	ret = 0;
> +
> +	/* setup initial receive and send queue parameters */
> +	for (i = 0; i < vi->num_queue_pairs; i++) {
> +		vi->rq[i]->vi = vi;
> +		vi->rq[i]->pages = NULL;
> +		INIT_DELAYED_WORK(&vi->rq[i]->refill, refill_work);
> +		netif_napi_add(vi->dev, &vi->rq[i]->napi, virtnet_poll,
> +			       napi_weight);
> +
> +		sg_init_table(vi->rq[i]->sg, ARRAY_SIZE(vi->rq[i]->sg));
> +		sg_init_table(vi->sq[i]->sg, ARRAY_SIZE(vi->sq[i]->sg));
> +	}
> +

Add return 0 here, then ret = 0 will not be needed
above and if (ret) below.


> +err:
> +	if (ret)
> +		virtnet_free_queues(vi);
> +
> +	return ret;
> +}
> +
> +static int virtnet_setup_vqs(struct virtnet_info *vi)
> +{
> +	int ret;
> +
> +	/* Allocate send & receive queues */
> +	ret = virtnet_alloc_queues(vi);
> +	if (!ret) {
> +		ret = virtnet_find_vqs(vi);
> +		if (ret)
> +			virtnet_free_queues(vi);
> +		else
> +			virtnet_set_affinity(vi, true);
> +	}
> +
> +	return ret;

Add some labels for error handling, this if nesting is messy.

>  }
>  
>  static int virtnet_probe(struct virtio_device *vdev)
>  {
> -	int err;
> +	int i, err;
>  	struct net_device *dev;
>  	struct virtnet_info *vi;
> +	u16 num_queues, num_queue_pairs;
> +
> +	/* Find if host supports multiqueue virtio_net device */
> +	err = virtio_config_val(vdev, VIRTIO_NET_F_MULTIQUEUE,
> +				offsetof(struct virtio_net_config,
> +				num_queues), &num_queues);
> +
> +	/* We need atleast 2 queue's */

typo

> +	if (err || num_queues < 2)
> +		num_queues = 2;
> +	if (num_queues > MAX_QUEUES * 2)
> +		num_queues = MAX_QUEUES;
> +
> +	num_queue_pairs = num_queues / 2;
>  
>  	/* Allocate ourselves a network device with room for our info */
> -	dev = alloc_etherdev(sizeof(struct virtnet_info));
> +	dev = alloc_etherdev_mq(sizeof(struct virtnet_info), num_queue_pairs);
>  	if (!dev)
>  		return -ENOMEM;
>  
> @@ -1103,22 +1362,18 @@ static int virtnet_probe(struct virtio_device *vdev)
>  
>  	/* Set up our device-specific information */
>  	vi = netdev_priv(dev);
> -	netif_napi_add(dev, &vi->napi, virtnet_poll, napi_weight);
>  	vi->dev = dev;
>  	vi->vdev = vdev;
>  	vdev->priv = vi;
> -	vi->pages = NULL;
>  	vi->stats = alloc_percpu(struct virtnet_stats);
>  	err = -ENOMEM;
>  	if (vi->stats == NULL)
> -		goto free;
> +		goto free_netdev;
>  
> -	INIT_DELAYED_WORK(&vi->refill, refill_work);
>  	mutex_init(&vi->config_lock);
>  	vi->config_enable = true;
>  	INIT_WORK(&vi->config_work, virtnet_config_changed_work);
> -	sg_init_table(vi->rx_sg, ARRAY_SIZE(vi->rx_sg));
> -	sg_init_table(vi->tx_sg, ARRAY_SIZE(vi->tx_sg));
> +	vi->num_queue_pairs = num_queue_pairs;
>  
>  	/* If we can receive ANY GSO packets, we must allocate large ones. */
>  	if (virtio_has_feature(vdev, VIRTIO_NET_F_GUEST_TSO4) ||
> @@ -1129,9 +1384,17 @@ static int virtnet_probe(struct virtio_device *vdev)
>  	if (virtio_has_feature(vdev, VIRTIO_NET_F_MRG_RXBUF))
>  		vi->mergeable_rx_bufs = true;
>  
> -	err = init_vqs(vi);
> +	if (virtio_has_feature(vdev, VIRTIO_NET_F_CTRL_VQ))
> +		vi->has_cvq = true;
> +

How about we disable multiqueue if there's no cvq?
Will make logic a bit simpler, won't it?

> +	/* Allocate/initialize the rx/tx queues, and invoke find_vqs */
> +	err = virtnet_setup_vqs(vi);
>  	if (err)
> -		goto free_stats;
> +		goto free_netdev;
> +
> +	if (virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VQ) &&
> +	    virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VLAN))
> +		dev->features |= NETIF_F_HW_VLAN_FILTER;
>  
>  	err = register_netdev(dev);
>  	if (err) {
> @@ -1140,12 +1403,15 @@ static int virtnet_probe(struct virtio_device *vdev)
>  	}
>  
>  	/* Last of all, set up some receive buffers. */
> -	try_fill_recv(vi, GFP_KERNEL);
> -
> -	/* If we didn't even get one input buffer, we're useless. */
> -	if (vi->num == 0) {
> -		err = -ENOMEM;
> -		goto unregister;
> +	for (i = 0; i < num_queue_pairs; i++) {
> +		try_fill_recv(vi->rq[i], GFP_KERNEL);
> +
> +		/* If we didn't even get one input buffer, we're useless. */
> +		if (vi->rq[i]->num == 0) {
> +			free_unused_bufs(vi);
> +			err = -ENOMEM;
> +			goto free_recv_bufs;
> +		}
>  	}
>  
>  	/* Assume link up if device can't report link status,
> @@ -1158,42 +1424,25 @@ static int virtnet_probe(struct virtio_device *vdev)
>  		netif_carrier_on(dev);
>  	}
>  
> -	pr_debug("virtnet: registered device %s\n", dev->name);
> +	pr_debug("virtnet: registered device %s with %d RX and TX vq's\n",
> +		 dev->name, num_queue_pairs);
> +
>  	return 0;
>  
> -unregister:
> +free_recv_bufs:
> +	free_receive_bufs(vi);
>  	unregister_netdev(dev);
> +
>  free_vqs:
> -	vdev->config->del_vqs(vdev);
> -free_stats:
> -	free_percpu(vi->stats);
> -free:
> +	for (i = 0; i < num_queue_pairs; i++)
> +		cancel_delayed_work_sync(&vi->rq[i]->refill);
> +	virtnet_del_vqs(vi);
> +
> +free_netdev:
>  	free_netdev(dev);
>  	return err;
>  }
>  
> -static void free_unused_bufs(struct virtnet_info *vi)
> -{
> -	void *buf;
> -	while (1) {
> -		buf = virtqueue_detach_unused_buf(vi->svq);
> -		if (!buf)
> -			break;
> -		dev_kfree_skb(buf);
> -	}
> -	while (1) {
> -		buf = virtqueue_detach_unused_buf(vi->rvq);
> -		if (!buf)
> -			break;
> -		if (vi->mergeable_rx_bufs || vi->big_packets)
> -			give_pages(vi, buf);
> -		else
> -			dev_kfree_skb(buf);
> -		--vi->num;
> -	}
> -	BUG_ON(vi->num != 0);
> -}
> -
>  static void remove_vq_common(struct virtnet_info *vi)
>  {
>  	vi->vdev->config->reset(vi->vdev);
> @@ -1201,10 +1450,9 @@ static void remove_vq_common(struct virtnet_info *vi)
>  	/* Free unused buffers in both send and recv, if any. */
>  	free_unused_bufs(vi);
>  
> -	vi->vdev->config->del_vqs(vi->vdev);
> +	free_receive_bufs(vi);
>  
> -	while (vi->pages)
> -		__free_pages(get_a_page(vi, GFP_KERNEL), 0);
> +	virtnet_del_vqs(vi);
>  }
>  
>  static void __devexit virtnet_remove(struct virtio_device *vdev)
> @@ -1230,6 +1478,7 @@ static void __devexit virtnet_remove(struct virtio_device *vdev)
>  static int virtnet_freeze(struct virtio_device *vdev)
>  {
>  	struct virtnet_info *vi = vdev->priv;
> +	int i;
>  
>  	/* Prevent config work handler from accessing the device */
>  	mutex_lock(&vi->config_lock);
> @@ -1237,10 +1486,13 @@ static int virtnet_freeze(struct virtio_device *vdev)
>  	mutex_unlock(&vi->config_lock);
>  
>  	netif_device_detach(vi->dev);
> -	cancel_delayed_work_sync(&vi->refill);
> +	for (i = 0; i < vi->num_queue_pairs; i++)
> +		cancel_delayed_work_sync(&vi->rq[i]->refill);
>  
>  	if (netif_running(vi->dev))
> -		napi_disable(&vi->napi);
> +		for (i = 0; i < vi->num_queue_pairs; i++)
> +			napi_disable(&vi->rq[i]->napi);
> +
>  
>  	remove_vq_common(vi);
>  
> @@ -1252,19 +1504,22 @@ static int virtnet_freeze(struct virtio_device *vdev)
>  static int virtnet_restore(struct virtio_device *vdev)
>  {
>  	struct virtnet_info *vi = vdev->priv;
> -	int err;
> +	int err, i;
>  
> -	err = init_vqs(vi);
> +	err = virtnet_setup_vqs(vi);
>  	if (err)
>  		return err;
>  
>  	if (netif_running(vi->dev))
> -		virtnet_napi_enable(vi);
> +		for (i = 0; i < vi->num_queue_pairs; i++)
> +			virtnet_napi_enable(vi->rq[i]);
>  
>  	netif_device_attach(vi->dev);
>  
> -	if (!try_fill_recv(vi, GFP_KERNEL))
> -		queue_delayed_work(system_nrt_wq, &vi->refill, 0);
> +	for (i = 0; i < vi->num_queue_pairs; i++)
> +		if (!try_fill_recv(vi->rq[i], GFP_KERNEL))
> +			queue_delayed_work(system_nrt_wq,
> +					   &vi->rq[i]->refill, 0);
>  
>  	mutex_lock(&vi->config_lock);
>  	vi->config_enable = true;
> @@ -1287,7 +1542,7 @@ static unsigned int features[] = {
>  	VIRTIO_NET_F_GUEST_ECN, VIRTIO_NET_F_GUEST_UFO,
>  	VIRTIO_NET_F_MRG_RXBUF, VIRTIO_NET_F_STATUS, VIRTIO_NET_F_CTRL_VQ,
>  	VIRTIO_NET_F_CTRL_RX, VIRTIO_NET_F_CTRL_VLAN,
> -	VIRTIO_NET_F_GUEST_ANNOUNCE,
> +	VIRTIO_NET_F_GUEST_ANNOUNCE, VIRTIO_NET_F_MULTIQUEUE,
>  };
>  
>  static struct virtio_driver virtio_net_driver = {
> diff --git a/include/linux/virtio_net.h b/include/linux/virtio_net.h
> index 1bc7e30..60f09ff 100644
> --- a/include/linux/virtio_net.h
> +++ b/include/linux/virtio_net.h
> @@ -61,6 +61,8 @@ struct virtio_net_config {
>  	__u8 mac[6];
>  	/* See VIRTIO_NET_F_STATUS and VIRTIO_NET_S_* above */
>  	__u16 status;
> +	/* Total number of RX/TX queues */
> +	__u16 num_queues;
>  } __attribute__((packed));
>  
>  /* This is the first element of the scatter-gather list.  If you don't
> -- 
> 1.7.1

^ permalink raw reply

* Re: [PATCH] net: ethernet: davinci_emac: add pm_runtime support
From: Sekhar Nori @ 2012-07-20 13:22 UTC (permalink / raw)
  To: Mark A. Greer
  Cc: netdev, linux-omap, linux-arm-kernel, Kevin Hilman, David Miller,
	davinci-linux-open-source@linux.davincidsp.com
In-Reply-To: <1342736577-30477-1-git-send-email-mgreer@animalcreek.com>

+ Dave Miller and DaVinci list

Hi Mark,

On 7/20/2012 3:52 AM, Mark A. Greer wrote:
> From: "Mark A. Greer" <mgreer@animalcreek.com>
> 
> Add pm_runtime support to the TI Davinci EMAC driver.
> 
> CC: Sekhar Nori <nsekhar@ti.com>
> CC: Kevin Hilman <khilman@ti.com>
> Signed-off-by: Mark A. Greer <mgreer@animalcreek.com>

Tested on AM18x EVM using NFS root and suspend-to-RAM.

Acked-by: Sekhar Nori <nsekhar@ti.com>

> ---
> 
> a) This patch depends on a patch by Kevin Hilman that has been
>    accepted for 3.6 and is waiting in arm-soc/for-next:
>    ce9dcb8784611c50974d1c6b600c71f5c0a29308
>    (ARM: davinci: add runtime PM support for clock management)

Since the patch does not have any dependency as far as applying it is
concerned, I guess it can be queued through the network tree.

> 
> b) Applies on top of current k.o. master:
>    3e4b9459fb0e149c6b74c9e89399a8fc39a92b44
>    (Merge tag 'md-3.5-fixes' of git://neil.brown.name/md)

The patch will have to apply to net-next if it goes through the net tree.

Thanks,
Sekhar

^ permalink raw reply

* [PATCH] gianfar: add support for wake-on-packet
From: Zhao Chenhui @ 2012-07-20 12:52 UTC (permalink / raw)
  To: netdev; +Cc: scottwood, linux-kernel, leoli

On certain chip like MPC8536 and P1022, system can be waked up from
sleep by user-defined packet and Magic Patcket.(The eTSEC cannot
supports both types of wake-up event simultaneously.) This patch
implements wake-up on user-defined patcket including ARP request
packet and unicast patcket to this station.

When entering suspend state, the gianfar driver sets receive queue
filer table to filter all of packets except ARP request packet and
unicast patcket to this station. The driver temporarily uses the last
receive queue to receive the user defined packet.

In suspend state, the receive part of eTSEC keeps working. When
receiving a user defined packet, it generates an interrupt to
wake up the system.

The rule of the filer table is as below.
  if (arp request to local ip address)
      accept it to the last queue
  elif (unicast packet to local mac address)
      accept it to the last queue
  else
      reject it
  endif
Note: The local ip/mac address is the ethernet primary IP/MAC address of
the station. Do not support multiple IP/MAC addresses.

Here is an example of enabling and testing wake up on user defined packet.
  ifconfig eth0 10.193.20.169
  ethtool -s eth0 wol a
  echo standby > /sys/power/state or echo mem > /sys/power/state

Ping from PC host to wake up the station:
  ping 10.193.20.169

Signed-off-by: Dave Liu <daveliu@freescale.com>
Signed-off-by: Jin Qing <b24347@freescale.com>
Signed-off-by: Li Yang <leoli@freescale.com>
Signed-off-by: Zhao Chenhui <chenhui.zhao@freescale.com>
Acked-by: Andy Fleming <afleming@freescale.com>
---
 drivers/net/ethernet/freescale/gianfar.c         |  371 +++++++++++++++++----
 drivers/net/ethernet/freescale/gianfar.h         |   28 ++-
 drivers/net/ethernet/freescale/gianfar_ethtool.c |   55 +++-
 3 files changed, 360 insertions(+), 94 deletions(-)

diff --git a/drivers/net/ethernet/freescale/gianfar.c b/drivers/net/ethernet/freescale/gianfar.c
index f2db8fc..2e51803 100644
--- a/drivers/net/ethernet/freescale/gianfar.c
+++ b/drivers/net/ethernet/freescale/gianfar.c
@@ -85,6 +85,8 @@
 #include <linux/tcp.h>
 #include <linux/udp.h>
 #include <linux/in.h>
+#include <linux/inetdevice.h>
+#include <sysdev/fsl_soc.h>
 #include <linux/net_tstamp.h>
 
 #include <asm/io.h>
@@ -144,6 +146,17 @@ static void gfar_clear_exact_match(struct net_device *dev);
 static void gfar_set_mac_for_addr(struct net_device *dev, int num,
 				  const u8 *addr);
 static int gfar_ioctl(struct net_device *dev, struct ifreq *rq, int cmd);
+static void gfar_schedule_cleanup(struct gfar_priv_grp *gfargrp);
+
+#ifdef CONFIG_PM
+static void gfar_halt_rx(struct net_device *dev);
+static void gfar_rx_start(struct net_device *dev);
+static void gfar_enable_filer(struct net_device *dev);
+static void gfar_disable_filer(struct net_device *dev);
+static void gfar_config_filer_table(struct net_device *dev);
+static void gfar_restore_filer_table(struct net_device *dev);
+static int gfar_get_ip(struct net_device *dev);
+#endif
 
 MODULE_AUTHOR("Freescale Semiconductor, Inc");
 MODULE_DESCRIPTION("Gianfar Ethernet Driver");
@@ -760,8 +773,17 @@ static int gfar_of_init(struct platform_device *ofdev, struct net_device **pdev)
 	else
 		priv->interface = PHY_INTERFACE_MODE_MII;
 
-	if (of_get_property(np, "fsl,magic-packet", NULL))
+	priv->wol_opts = 0;
+	priv->wol_supported = 0;
+	if (of_get_property(np, "fsl,magic-packet", NULL)) {
 		priv->device_flags |= FSL_GIANFAR_DEV_HAS_MAGIC_PACKET;
+		priv->wol_supported |= GIANFAR_WOL_MAGIC;
+	}
+
+	if (of_get_property(np, "fsl,wake-on-filer", NULL)) {
+		priv->wol_supported |= GIANFAR_WOL_ARP;
+		priv->wol_supported |= GIANFAR_WOL_UCAST;
+	}
 
 	priv->phy_node = of_parse_phandle(np, "phy-handle", 0);
 
@@ -1164,8 +1186,9 @@ static int gfar_probe(struct platform_device *ofdev)
 		goto register_fail;
 	}
 
-	device_init_wakeup(&dev->dev,
-		priv->device_flags & FSL_GIANFAR_DEV_HAS_MAGIC_PACKET);
+	if (priv->wol_supported) {
+		device_set_wakeup_capable(&ofdev->dev, true);
+	}
 
 	/* fill out IRQ number and name fields */
 	for (i = 0; i < priv->num_grps; i++) {
@@ -1232,100 +1255,215 @@ static int gfar_remove(struct platform_device *ofdev)
 }
 
 #ifdef CONFIG_PM
-
 static int gfar_suspend(struct device *dev)
 {
 	struct gfar_private *priv = dev_get_drvdata(dev);
 	struct net_device *ndev = priv->ndev;
 	struct gfar __iomem *regs = priv->gfargrp[0].regs;
-	unsigned long flags;
+	int retval;
 	u32 tempval;
 
-	int magic_packet = priv->wol_en &&
-		(priv->device_flags & FSL_GIANFAR_DEV_HAS_MAGIC_PACKET);
+	if (!priv->wol_opts) {
+		/* Wake-On-Lan is disabled */
+		if (!netif_running(ndev))
+			return 0;
+		gfar_close(ndev);
+		return 0;
+	}
+
+	retval = mpc85xx_pmc_set_wake(dev, true);
+	if (retval)
+		return retval;
 
 	netif_device_detach(ndev);
+	disable_napi(priv);
+	gfar_halt(ndev);
 
-	if (netif_running(ndev)) {
+	if (priv->wol_opts & GIANFAR_WOL_MAGIC) {
+		/* Enable Magic Packet mode */
+		tempval = gfar_read(&regs->maccfg2);
+		tempval |= MACCFG2_MPEN;
+		gfar_write(&regs->maccfg2, tempval);
+	}
 
-		local_irq_save(flags);
-		lock_tx_qs(priv);
-		lock_rx_qs(priv);
+	if (priv->wol_opts & (GIANFAR_WOL_ARP | GIANFAR_WOL_UCAST)) {
+		mpc85xx_pmc_set_lossless_ethernet(1);
+		gfar_disable_filer(ndev);
+		gfar_config_filer_table(ndev);
+		gfar_enable_filer(ndev);
+	}
 
-		gfar_halt_nodisable(ndev);
+	gfar_rx_start(ndev);
+	return 0;
+}
 
-		/* Disable Tx, and Rx if wake-on-LAN is disabled. */
-		tempval = gfar_read(&regs->maccfg1);
+static int gfar_resume(struct device *dev)
+{
+	struct gfar_private *priv = dev_get_drvdata(dev);
+	struct net_device *ndev = priv->ndev;
+	struct gfar __iomem *regs = priv->gfargrp[0].regs;
+	u32 tempval, i;
 
-		tempval &= ~MACCFG1_TX_EN;
+	if (!priv->wol_opts) {
+		/* Wake-On-Lan is disabled */
+		if (!netif_running(ndev))
+			return 0;
+		gfar_enet_open(ndev);
+		return 0;
+	}
 
-		if (!magic_packet)
-			tempval &= ~MACCFG1_RX_EN;
+	mpc85xx_pmc_set_wake(dev, false);
 
-		gfar_write(&regs->maccfg1, tempval);
+	gfar_halt_rx(ndev);
 
-		unlock_rx_qs(priv);
-		unlock_tx_qs(priv);
-		local_irq_restore(flags);
+	if (priv->wol_opts & GIANFAR_WOL_MAGIC) {
+		/* Disable Magic Packet mode */
+		tempval = gfar_read(&regs->maccfg2);
+		tempval &= ~MACCFG2_MPEN;
+		gfar_write(&regs->maccfg2, tempval);
+	}
 
-		disable_napi(priv);
+	if (priv->wol_opts & (GIANFAR_WOL_ARP | GIANFAR_WOL_UCAST)) {
+		mpc85xx_pmc_set_lossless_ethernet(0);
+		gfar_disable_filer(ndev);
+		gfar_restore_filer_table(ndev);
+	}
 
-		if (magic_packet) {
-			/* Enable interrupt on Magic Packet */
-			gfar_write(&regs->imask, IMASK_MAG);
+	gfar_start(ndev);
+	enable_napi(priv);
+	netif_device_attach(ndev);
 
-			/* Enable Magic Packet mode */
-			tempval = gfar_read(&regs->maccfg2);
-			tempval |= MACCFG2_MPEN;
-			gfar_write(&regs->maccfg2, tempval);
-		} else {
-			phy_stop(priv->phydev);
-		}
+	if (priv->wol_opts & (GIANFAR_WOL_ARP | GIANFAR_WOL_UCAST)) {
+		/* send requests to process the received packets */
+		for (i = 0; i < priv->num_grps; i++)
+			gfar_schedule_cleanup(&priv->gfargrp[i]);
 	}
-
 	return 0;
 }
 
-static int gfar_resume(struct device *dev)
+static void gfar_enable_filer(struct net_device *dev)
 {
-	struct gfar_private *priv = dev_get_drvdata(dev);
-	struct net_device *ndev = priv->ndev;
+	struct gfar_private *priv = netdev_priv(dev);
 	struct gfar __iomem *regs = priv->gfargrp[0].regs;
-	unsigned long flags;
-	u32 tempval;
-	int magic_packet = priv->wol_en &&
-		(priv->device_flags & FSL_GIANFAR_DEV_HAS_MAGIC_PACKET);
+	u32 temp;
 
-	if (!netif_running(ndev)) {
-		netif_device_attach(ndev);
-		return 0;
-	}
+	temp = gfar_read(&regs->rctrl);
+	temp &= ~(RCTRL_FSQEN | RCTRL_PRSDEP_MASK);
+	temp |= RCTRL_FILREN | RCTRL_PRSDEP_L2L3;
+	gfar_write(&regs->rctrl, temp);
+}
 
-	if (!magic_packet && priv->phydev)
-		phy_start(priv->phydev);
+static void gfar_disable_filer(struct net_device *dev)
+{
+	struct gfar_private *priv = netdev_priv(dev);
+	struct gfar __iomem *regs = priv->gfargrp[0].regs;
+	u32 temp;
 
-	/* Disable Magic Packet mode, in case something
-	 * else woke us up.
-	 */
-	local_irq_save(flags);
-	lock_tx_qs(priv);
-	lock_rx_qs(priv);
+	temp = gfar_read(&regs->rctrl);
+	temp &= ~RCTRL_FILREN;
+	gfar_write(&regs->rctrl, temp);
+}
 
-	tempval = gfar_read(&regs->maccfg2);
-	tempval &= ~MACCFG2_MPEN;
-	gfar_write(&regs->maccfg2, tempval);
+static int gfar_get_ip(struct net_device *dev)
+{
+	struct gfar_private *priv = netdev_priv(dev);
+	struct in_device *in_dev;
+	int ret = -ENOENT;
 
-	gfar_start(ndev);
+	in_dev = in_dev_get(dev);
+	if (!in_dev)
+		return ret;
 
-	unlock_rx_qs(priv);
-	unlock_tx_qs(priv);
-	local_irq_restore(flags);
+	/* Get the primary IP address */
+	for_primary_ifa(in_dev) {
+		priv->ip_addr = ifa->ifa_address;
+		ret = 0;
+		break;
+	} endfor_ifa(in_dev);
 
-	netif_device_attach(ndev);
+	in_dev_put(in_dev);
+	return ret;
+}
 
-	enable_napi(priv);
+static void gfar_restore_filer_table(struct net_device *dev)
+{
+	struct gfar_private *priv = netdev_priv(dev);
+	u32 rqfcr, rqfpr;
+	int i;
 
-	return 0;
+	for (i = 0; i <= MAX_FILER_IDX; i++) {
+		rqfcr = priv->ftp_rqfcr[i];
+		rqfpr = priv->ftp_rqfpr[i];
+		gfar_write_filer(priv, i, rqfcr, rqfpr);
+	}
+}
+
+static void gfar_config_filer_table(struct net_device *dev)
+{
+	struct gfar_private *priv = netdev_priv(dev);
+	u32 dest_mac_addr;
+	u32 rqfcr, rqfpr;
+	unsigned int index;
+	u8 rqfcr_queue;
+
+	if (gfar_get_ip(dev)) {
+		netif_err(priv, wol, dev, "WOL: get the ip address error\n");
+		return;
+	}
+
+	/* init filer table */
+	rqfcr = RQFCR_RJE | RQFCR_CMP_MATCH;
+	rqfpr = 0x0;
+	for (index = 0; index <= MAX_FILER_IDX; index++)
+		gfar_write_filer(priv, index, rqfcr, rqfpr);
+
+	index = 0;
+	/* select a rx queue in group 0 */
+	rqfcr_queue = (u8)find_first_bit(&priv->gfargrp[0].rx_bit_map,
+							priv->num_rx_queues);
+	if (priv->wol_opts & GIANFAR_WOL_ARP) {
+		/* ARP request filer, filling the packet to the last queue */
+		rqfcr = (rqfcr_queue << 10) | RQFCR_AND |
+					RQFCR_CMP_EXACT | RQFCR_PID_MASK;
+		rqfpr = RQFPR_ARQ;
+		gfar_write_filer(priv, index++, rqfcr, rqfpr);
+
+		rqfcr = (rqfcr_queue << 10) | RQFCR_AND |
+					RQFCR_CMP_EXACT | RQFCR_PID_PARSE;
+		rqfpr = RQFPR_ARQ;
+		gfar_write_filer(priv, index++, rqfcr, rqfpr);
+
+		/*
+		 * DEST_IP address in ARP packet,
+		 * filling it to the last queue.
+		 */
+		rqfcr = (rqfcr_queue << 10) | RQFCR_AND |
+					RQFCR_CMP_EXACT | RQFCR_PID_MASK;
+		rqfpr = FPR_FILER_MASK;
+		gfar_write_filer(priv, index++, rqfcr, rqfpr);
+
+		rqfcr = (rqfcr_queue << 10) | RQFCR_GPI |
+					RQFCR_CMP_EXACT | RQFCR_PID_DIA;
+		rqfpr = priv->ip_addr;
+		gfar_write_filer(priv, index++, rqfcr, rqfpr);
+	}
+
+	if (priv->wol_opts & GIANFAR_WOL_UCAST) {
+		/* Unicast packet, filling it to the last queue */
+		dest_mac_addr = (dev->dev_addr[0] << 16) |
+				(dev->dev_addr[1] << 8) | dev->dev_addr[2];
+		rqfcr = (rqfcr_queue << 10) | RQFCR_AND |
+					RQFCR_CMP_EXACT | RQFCR_PID_DAH;
+		rqfpr = dest_mac_addr;
+		gfar_write_filer(priv, index++, rqfcr, rqfpr);
+
+		dest_mac_addr = (dev->dev_addr[3] << 16) |
+				(dev->dev_addr[4] << 8) | dev->dev_addr[5];
+		rqfcr = (rqfcr_queue << 10) | RQFCR_GPI |
+					RQFCR_CMP_EXACT | RQFCR_PID_DAL;
+		rqfpr = dest_mac_addr;
+		gfar_write_filer(priv, index++, rqfcr, rqfpr);
+	}
 }
 
 static int gfar_restore(struct device *dev)
@@ -1574,6 +1712,48 @@ static int __gfar_is_rx_idle(struct gfar_private *priv)
 	return 0;
 }
 
+#ifdef CONFIG_PM
+/* Halt the receive queues */
+static void gfar_halt_rx(struct net_device *dev)
+{
+	struct gfar_private *priv = netdev_priv(dev);
+	struct gfar __iomem *regs = priv->gfargrp[0].regs;
+	u32 tempval;
+	int i = 0;
+
+	for (i = 0; i < priv->num_grps; i++) {
+		regs = priv->gfargrp[i].regs;
+		/* Mask all interrupts */
+		gfar_write(&regs->imask, IMASK_INIT_CLEAR);
+
+		/* Clear all interrupts */
+		gfar_write(&regs->ievent, IEVENT_INIT_CLEAR);
+	}
+
+	regs = priv->gfargrp[0].regs;
+	/* Stop the DMA, and wait for it to stop */
+	tempval = gfar_read(&regs->dmactrl);
+	if ((tempval & DMACTRL_GRS) != DMACTRL_GRS) {
+		int ret;
+
+		tempval |= DMACTRL_GRS;
+		gfar_write(&regs->dmactrl, tempval);
+
+		do {
+			ret = spin_event_timeout(((gfar_read(&regs->ievent) &
+				IEVENT_GRSC) == IEVENT_GRSC), 1000000, 0);
+			if (!ret && !(gfar_read(&regs->ievent) & IEVENT_GRSC))
+				ret = __gfar_is_rx_idle(priv);
+		} while (!ret);
+	}
+
+	/* Disable Rx in MACCFG1  */
+	tempval = gfar_read(&regs->maccfg1);
+	tempval &= ~MACCFG1_RX_EN;
+	gfar_write(&regs->maccfg1, tempval);
+}
+#endif
+
 /* Halt the receive and transmit queues */
 static void gfar_halt_nodisable(struct net_device *dev)
 {
@@ -1783,6 +1963,40 @@ void gfar_start(struct net_device *dev)
 	dev->trans_start = jiffies; /* prevent tx timeout */
 }
 
+#ifdef CONFIG_PM
+void gfar_rx_start(struct net_device *dev)
+{
+	struct gfar_private *priv = netdev_priv(dev);
+	struct gfar __iomem *regs = priv->gfargrp[0].regs;
+	u32 tempval;
+	int i = 0;
+
+	/* Enable Rx in MACCFG1 */
+	tempval = gfar_read(&regs->maccfg1);
+	tempval |= MACCFG1_RX_EN;
+	gfar_write(&regs->maccfg1, tempval);
+
+	/* Initialize DMACTRL to have WWR and WOP */
+	tempval = gfar_read(&regs->dmactrl);
+	tempval |= DMACTRL_INIT_SETTINGS;
+	gfar_write(&regs->dmactrl, tempval);
+
+	/* Make sure we aren't stopped */
+	tempval = gfar_read(&regs->dmactrl);
+	tempval &= ~DMACTRL_GRS;
+	gfar_write(&regs->dmactrl, tempval);
+
+	for (i = 0; i < priv->num_grps; i++) {
+		regs = priv->gfargrp[i].regs;
+		/* Clear RHLT, so that the DMA starts polling now */
+		gfar_write(&regs->rstat, priv->gfargrp[i].rstat);
+
+		/* Unmask the interrupts we look for */
+		gfar_write(&regs->imask, IMASK_FGPI | IMASK_MAG);
+	}
+}
+#endif
+
 void gfar_configure_coalescing(struct gfar_private *priv,
 	unsigned long tx_mask, unsigned long rx_mask)
 {
@@ -1829,30 +2043,34 @@ static int register_grp_irqs(struct gfar_priv_grp *grp)
 	if (priv->device_flags & FSL_GIANFAR_DEV_HAS_MULTI_INTR) {
 		/* Install our interrupt handlers for Error,
 		 * Transmit, and Receive */
-		if ((err = request_irq(grp->interruptError, gfar_error, 0,
-				grp->int_name_er,grp)) < 0) {
+		err = request_irq(grp->interruptError, gfar_error,
+			IRQF_NO_SUSPEND, grp->int_name_er, grp);
+		if (err) {
 			netif_err(priv, intr, dev, "Can't get IRQ %d\n",
 				  grp->interruptError);
 
 			goto err_irq_fail;
 		}
 
-		if ((err = request_irq(grp->interruptTransmit, gfar_transmit,
-				0, grp->int_name_tx, grp)) < 0) {
+		err = request_irq(grp->interruptTransmit, gfar_transmit,
+			IRQF_NO_SUSPEND, grp->int_name_tx, grp);
+		if (err) {
 			netif_err(priv, intr, dev, "Can't get IRQ %d\n",
 				  grp->interruptTransmit);
 			goto tx_irq_fail;
 		}
 
-		if ((err = request_irq(grp->interruptReceive, gfar_receive, 0,
-				grp->int_name_rx, grp)) < 0) {
+		err = request_irq(grp->interruptReceive, gfar_receive,
+			IRQF_NO_SUSPEND, grp->int_name_rx, grp);
+		if (err) {
 			netif_err(priv, intr, dev, "Can't get IRQ %d\n",
 				  grp->interruptReceive);
 			goto rx_irq_fail;
 		}
 	} else {
-		if ((err = request_irq(grp->interruptTransmit, gfar_interrupt, 0,
-				grp->int_name_tx, grp)) < 0) {
+		err = request_irq(grp->interruptTransmit, gfar_interrupt,
+			IRQF_NO_SUSPEND, grp->int_name_tx, grp);
+		if (err) {
 			netif_err(priv, intr, dev, "Can't get IRQ %d\n",
 				  grp->interruptTransmit);
 			goto err_irq_fail;
@@ -1943,7 +2161,7 @@ static int gfar_enet_open(struct net_device *dev)
 
 	netif_tx_start_all_queues(dev);
 
-	device_set_wakeup_enable(&dev->dev, priv->wol_en);
+	device_set_wakeup_enable(&priv->ofdev->dev, (bool)!!priv->wol_opts);
 
 	return err;
 }
@@ -2654,6 +2872,17 @@ static inline void count_errors(unsigned short status, struct net_device *dev)
 
 irqreturn_t gfar_receive(int irq, void *grp_id)
 {
+	struct gfar_priv_grp *gfargrp = grp_id;
+	struct gfar __iomem *regs = gfargrp->regs;
+	u32 ievent;
+
+	ievent = gfar_read(&regs->ievent);
+
+	if ((ievent & IEVENT_FGPI) == IEVENT_FGPI) {
+		gfar_write(&regs->ievent, ievent & IEVENT_RX_MASK);
+		return IRQ_HANDLED;
+	}
+
 	gfar_schedule_cleanup((struct gfar_priv_grp *)grp_id);
 	return IRQ_HANDLED;
 }
diff --git a/drivers/net/ethernet/freescale/gianfar.h b/drivers/net/ethernet/freescale/gianfar.h
index 2136c7f..9d45f51 100644
--- a/drivers/net/ethernet/freescale/gianfar.h
+++ b/drivers/net/ethernet/freescale/gianfar.h
@@ -229,6 +229,13 @@ extern const char gfar_driver_version[];
 #define RQUEUE_EN7		0x00000001
 #define RQUEUE_EN_ALL		0x000000FF
 
+/* Wake-On-Lan options */
+#define GIANFAR_WOL_UCAST	0x00000001	/* Unicast wakeup */
+#define GIANFAR_WOL_MCAST	0x00000002	/* Multicast wakeup */
+#define GIANFAR_WOL_BCAST	0x00000004	/* Broadcase wakeup */
+#define GIANFAR_WOL_ARP		0x00000008	/* ARP request wakeup */
+#define GIANFAR_WOL_MAGIC	0x00000010	/* Magic packet wakeup */
+
 /* Init to do tx snooping for buffers and descriptors */
 #define DMACTRL_INIT_SETTINGS   0x000000c3
 #define DMACTRL_GRS             0x00000010
@@ -274,11 +281,15 @@ extern const char gfar_driver_version[];
 #define RCTRL_PAL_MASK		0x001f0000
 #define RCTRL_VLEX		0x00002000
 #define RCTRL_FILREN		0x00001000
+#define RCTRL_FSQEN		0x00000800
 #define RCTRL_GHTX		0x00000400
 #define RCTRL_IPCSEN		0x00000200
 #define RCTRL_TUCSEN		0x00000100
 #define RCTRL_PRSDEP_MASK	0x000000c0
 #define RCTRL_PRSDEP_INIT	0x000000c0
+#define RCTRL_PRSDEP_L2		0x00000040
+#define RCTRL_PRSDEP_L2L3	0x00000080
+#define RCTRL_PRSDEP_L2L3L4	0x000000c0
 #define RCTRL_PRSFM		0x00000020
 #define RCTRL_PROM		0x00000008
 #define RCTRL_EMEN		0x00000002
@@ -324,18 +335,20 @@ extern const char gfar_driver_version[];
 #define IEVENT_MAG		0x00000800
 #define IEVENT_GRSC		0x00000100
 #define IEVENT_RXF0		0x00000080
+#define IEVENT_FGPI		0x00000010
 #define IEVENT_FIR		0x00000008
 #define IEVENT_FIQ		0x00000004
 #define IEVENT_DPE		0x00000002
 #define IEVENT_PERR		0x00000001
-#define IEVENT_RX_MASK          (IEVENT_RXB0 | IEVENT_RXF0 | IEVENT_BSY)
+#define IEVENT_RX_MASK          (IEVENT_RXB0 | IEVENT_RXF0 | IEVENT_BSY | \
+				 IEVENT_FGPI)
 #define IEVENT_TX_MASK          (IEVENT_TXB | IEVENT_TXF)
 #define IEVENT_RTX_MASK         (IEVENT_RX_MASK | IEVENT_TX_MASK)
 #define IEVENT_ERR_MASK         \
-(IEVENT_RXC | IEVENT_BSY | IEVENT_EBERR | IEVENT_MSRO | \
- IEVENT_BABT | IEVENT_TXC | IEVENT_TXE | IEVENT_LC \
- | IEVENT_CRL | IEVENT_XFUN | IEVENT_DPE | IEVENT_PERR \
- | IEVENT_MAG | IEVENT_BABR)
+	(IEVENT_RXC | IEVENT_BSY | IEVENT_EBERR | IEVENT_MSRO | \
+	 IEVENT_BABT | IEVENT_TXC | IEVENT_TXE | IEVENT_LC | \
+	 IEVENT_CRL | IEVENT_XFUN | IEVENT_DPE | IEVENT_PERR | \
+	 IEVENT_MAG | IEVENT_BABR | IEVENT_FIR | IEVENT_FIQ)
 
 #define IMASK_INIT_CLEAR	0x00000000
 #define IMASK_BABR              0x80000000
@@ -356,6 +369,7 @@ extern const char gfar_driver_version[];
 #define IMASK_MAG		0x00000800
 #define IMASK_GRSC              0x00000100
 #define IMASK_RXFEN0		0x00000080
+#define IMASK_FGPI		0x00000010
 #define IMASK_FIR		0x00000008
 #define IMASK_FIQ		0x00000004
 #define IMASK_DPE		0x00000002
@@ -1112,6 +1126,10 @@ struct gfar_private {
 
 	struct work_struct reset_task;
 
+	u32 ip_addr;       /* the primary IP address of the device */
+	u32 wol_opts;      /* enabled Wake-on-Lan modes */
+	u32 wol_supported; /* supported Wake-on-Lan modes */
+
 	/* Network Statistics */
 	struct gfar_extra_stats extra_stats;
 
diff --git a/drivers/net/ethernet/freescale/gianfar_ethtool.c b/drivers/net/ethernet/freescale/gianfar_ethtool.c
index 8a02557..33f806b 100644
--- a/drivers/net/ethernet/freescale/gianfar_ethtool.c
+++ b/drivers/net/ethernet/freescale/gianfar_ethtool.c
@@ -30,6 +30,7 @@
 #include <linux/skbuff.h>
 #include <linux/spinlock.h>
 #include <linux/mm.h>
+#include <linux/platform_device.h>
 
 #include <asm/io.h>
 #include <asm/irq.h>
@@ -578,32 +579,50 @@ static void gfar_get_wol(struct net_device *dev, struct ethtool_wolinfo *wol)
 {
 	struct gfar_private *priv = netdev_priv(dev);
 
-	if (priv->device_flags & FSL_GIANFAR_DEV_HAS_MAGIC_PACKET) {
-		wol->supported = WAKE_MAGIC;
-		wol->wolopts = priv->wol_en ? WAKE_MAGIC : 0;
-	} else {
-		wol->supported = wol->wolopts = 0;
-	}
+	wol->supported = 0;
+	wol->wolopts = 0;
+
+	if (!priv->wol_supported || !device_can_wakeup(&priv->ofdev->dev))
+		return;
+
+	if (priv->wol_supported & GIANFAR_WOL_MAGIC)
+		wol->supported |= WAKE_MAGIC;
+
+	if (priv->wol_supported & GIANFAR_WOL_ARP)
+		wol->supported |= WAKE_ARP;
+
+	if (priv->wol_supported & GIANFAR_WOL_UCAST)
+		wol->supported |= WAKE_UCAST;
+
+	if (priv->wol_opts & GIANFAR_WOL_MAGIC)
+		wol->wolopts |= WAKE_MAGIC;
+
+	if (priv->wol_opts & GIANFAR_WOL_ARP)
+		wol->wolopts |= WAKE_ARP;
+
+	if (priv->wol_opts & GIANFAR_WOL_UCAST)
+		wol->wolopts |= WAKE_UCAST;
 }
 
 static int gfar_set_wol(struct net_device *dev, struct ethtool_wolinfo *wol)
 {
 	struct gfar_private *priv = netdev_priv(dev);
-	unsigned long flags;
-
-	if (!(priv->device_flags & FSL_GIANFAR_DEV_HAS_MAGIC_PACKET) &&
-	    wol->wolopts != 0)
-		return -EINVAL;
-
-	if (wol->wolopts & ~WAKE_MAGIC)
-		return -EINVAL;
 
-	device_set_wakeup_enable(&dev->dev, wol->wolopts & WAKE_MAGIC);
+	if (!priv->wol_supported || !device_can_wakeup(&priv->ofdev->dev) ||
+		(wol->wolopts & ~(WAKE_MAGIC | WAKE_ARP | WAKE_UCAST)))
+		return -EOPNOTSUPP;
 
-	spin_lock_irqsave(&priv->bflock, flags);
-	priv->wol_en =  !!device_may_wakeup(&dev->dev);
-	spin_unlock_irqrestore(&priv->bflock, flags);
+	priv->wol_opts = 0;
 
+	if (wol->wolopts & WAKE_MAGIC) {
+		priv->wol_opts |= GIANFAR_WOL_MAGIC;
+	} else {
+		if (wol->wolopts & WAKE_ARP)
+			priv->wol_opts |= GIANFAR_WOL_ARP;
+		if (wol->wolopts & WAKE_UCAST)
+			priv->wol_opts |= GIANFAR_WOL_UCAST;
+	}
+	device_set_wakeup_enable(&priv->ofdev->dev, (bool)!!priv->wol_opts);
 	return 0;
 }
 #endif
-- 
1.6.4.1

^ permalink raw reply related

* Re: [net-next RFC V5 5/5] virtio_net: support negotiating the number of queues through ctrl vq
From: Michael S. Tsirkin @ 2012-07-20 12:33 UTC (permalink / raw)
  To: Jason Wang
  Cc: krkumar2, habanero, mashirle, kvm, netdev, linux-kernel,
	virtualization, edumazet, tahm, jwhan, davem, sri
In-Reply-To: <1341484194-8108-6-git-send-email-jasowang@redhat.com>

On Thu, Jul 05, 2012 at 06:29:54PM +0800, Jason Wang wrote:
> This patch let the virtio_net driver can negotiate the number of queues it
> wishes to use through control virtqueue and export an ethtool interface to let
> use tweak it.
> 
> As current multiqueue virtio-net implementation has optimizations on per-cpu
> virtuqueues, so only two modes were support:
> 
> - single queue pair mode
> - multiple queue paris mode, the number of queues matches the number of vcpus
> 
> The single queue mode were used by default currently due to regression of
> multiqueue mode in some test (especially in stream test).
> 
> Since virtio core does not support paritially deleting virtqueues, so during
> mode switching the whole virtqueue were deleted and the driver would re-create
> the virtqueues it would used.
> 
> btw. The queue number negotiating were defered to .ndo_open(), this is because
> only after feature negotitaion could we send the command to control virtqueue
> (as it may also use event index).
> 
> Signed-off-by: Jason Wang <jasowang@redhat.com>
> ---
>  drivers/net/virtio_net.c   |  171 ++++++++++++++++++++++++++++++++++---------
>  include/linux/virtio_net.h |    7 ++
>  2 files changed, 142 insertions(+), 36 deletions(-)
> 
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index 7410187..3339eeb 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -88,6 +88,7 @@ struct receive_queue {
>  
>  struct virtnet_info {
>  	u16 num_queue_pairs;		/* # of RX/TX vq pairs */
> +	u16 total_queue_pairs;
>  
>  	struct send_queue *sq[MAX_QUEUES] ____cacheline_aligned_in_smp;
>  	struct receive_queue *rq[MAX_QUEUES] ____cacheline_aligned_in_smp;
> @@ -137,6 +138,8 @@ struct padded_vnet_hdr {
>  	char padding[6];
>  };
>  
> +static const struct ethtool_ops virtnet_ethtool_ops;
> +
>  static inline int txq_get_qnum(struct virtnet_info *vi, struct virtqueue *vq)
>  {
>  	int ret = virtqueue_get_queue_index(vq);
> @@ -802,22 +805,6 @@ static void virtnet_netpoll(struct net_device *dev)
>  }
>  #endif
>  
> -static int virtnet_open(struct net_device *dev)
> -{
> -	struct virtnet_info *vi = netdev_priv(dev);
> -	int i;
> -
> -	for (i = 0; i < vi->num_queue_pairs; i++) {
> -		/* Make sure we have some buffers: if oom use wq. */
> -		if (!try_fill_recv(vi->rq[i], GFP_KERNEL))
> -			queue_delayed_work(system_nrt_wq,
> -					   &vi->rq[i]->refill, 0);
> -		virtnet_napi_enable(vi->rq[i]);
> -	}
> -
> -	return 0;
> -}
> -
>  /*
>   * Send command via the control virtqueue and check status.  Commands
>   * supported by the hypervisor, as indicated by feature bits, should
> @@ -873,6 +860,43 @@ static void virtnet_ack_link_announce(struct virtnet_info *vi)
>  	rtnl_unlock();
>  }
>  
> +static int virtnet_set_queues(struct virtnet_info *vi)
> +{
> +	struct scatterlist sg;
> +	struct net_device *dev = vi->dev;
> +	sg_init_one(&sg, &vi->num_queue_pairs, sizeof(vi->num_queue_pairs));
> +
> +	if (!vi->has_cvq)
> +		return -EINVAL;
> +
> +	if (!virtnet_send_command(vi, VIRTIO_NET_CTRL_MULTIQUEUE,
> +				  VIRTIO_NET_CTRL_MULTIQUEUE_QNUM, &sg, 1, 0)){
> +		dev_warn(&dev->dev, "Fail to set the number of queue pairs to"
> +			 " %d\n", vi->num_queue_pairs);
> +		return -EINVAL;
> +	}
> +
> +	return 0;
> +}
> +
> +static int virtnet_open(struct net_device *dev)
> +{
> +	struct virtnet_info *vi = netdev_priv(dev);
> +	int i;
> +
> +	for (i = 0; i < vi->num_queue_pairs; i++) {
> +		/* Make sure we have some buffers: if oom use wq. */
> +		if (!try_fill_recv(vi->rq[i], GFP_KERNEL))
> +			queue_delayed_work(system_nrt_wq,
> +					   &vi->rq[i]->refill, 0);
> +		virtnet_napi_enable(vi->rq[i]);
> +	}
> +
> +	virtnet_set_queues(vi);
> +
> +	return 0;
> +}
> +
>  static int virtnet_close(struct net_device *dev)
>  {
>  	struct virtnet_info *vi = netdev_priv(dev);
> @@ -1013,12 +1037,6 @@ static void virtnet_get_drvinfo(struct net_device *dev,
>  
>  }
>  
> -static const struct ethtool_ops virtnet_ethtool_ops = {
> -	.get_drvinfo = virtnet_get_drvinfo,
> -	.get_link = ethtool_op_get_link,
> -	.get_ringparam = virtnet_get_ringparam,
> -};
> -
>  #define MIN_MTU 68
>  #define MAX_MTU 65535
>  
> @@ -1235,7 +1253,7 @@ static int virtnet_find_vqs(struct virtnet_info *vi)
>  
>  err:
>  	if (ret && names)
> -		for (i = 0; i < vi->num_queue_pairs * 2; i++)
> +		for (i = 0; i < total_vqs * 2; i++)
>  			kfree(names[i]);
>  
>  	kfree(names);
> @@ -1373,7 +1391,6 @@ static int virtnet_probe(struct virtio_device *vdev)
>  	mutex_init(&vi->config_lock);
>  	vi->config_enable = true;
>  	INIT_WORK(&vi->config_work, virtnet_config_changed_work);
> -	vi->num_queue_pairs = num_queue_pairs;
>  
>  	/* If we can receive ANY GSO packets, we must allocate large ones. */
>  	if (virtio_has_feature(vdev, VIRTIO_NET_F_GUEST_TSO4) ||
> @@ -1387,6 +1404,10 @@ static int virtnet_probe(struct virtio_device *vdev)
>  	if (virtio_has_feature(vdev, VIRTIO_NET_F_CTRL_VQ))
>  		vi->has_cvq = true;
>  
> +	/* Use single tx/rx queue pair as default */
> +	vi->num_queue_pairs = 1;
> +	vi->total_queue_pairs = num_queue_pairs;
> +
>  	/* Allocate/initialize the rx/tx queues, and invoke find_vqs */
>  	err = virtnet_setup_vqs(vi);
>  	if (err)
> @@ -1396,6 +1417,9 @@ static int virtnet_probe(struct virtio_device *vdev)
>  	    virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VLAN))
>  		dev->features |= NETIF_F_HW_VLAN_FILTER;
>  
> +	netif_set_real_num_tx_queues(dev, 1);
> +	netif_set_real_num_rx_queues(dev, 1);
> +
>  	err = register_netdev(dev);
>  	if (err) {
>  		pr_debug("virtio_net: registering device failed\n");
> @@ -1403,7 +1427,7 @@ static int virtnet_probe(struct virtio_device *vdev)
>  	}
>  
>  	/* Last of all, set up some receive buffers. */
> -	for (i = 0; i < num_queue_pairs; i++) {
> +	for (i = 0; i < vi->num_queue_pairs; i++) {
>  		try_fill_recv(vi->rq[i], GFP_KERNEL);
>  
>  		/* If we didn't even get one input buffer, we're useless. */
> @@ -1474,10 +1498,8 @@ static void __devexit virtnet_remove(struct virtio_device *vdev)
>  	free_netdev(vi->dev);
>  }
>  
> -#ifdef CONFIG_PM
> -static int virtnet_freeze(struct virtio_device *vdev)
> +static void virtnet_stop(struct virtnet_info *vi)
>  {
> -	struct virtnet_info *vi = vdev->priv;
>  	int i;
>  
>  	/* Prevent config work handler from accessing the device */
> @@ -1493,17 +1515,10 @@ static int virtnet_freeze(struct virtio_device *vdev)
>  		for (i = 0; i < vi->num_queue_pairs; i++)
>  			napi_disable(&vi->rq[i]->napi);
>  
> -
> -	remove_vq_common(vi);
> -
> -	flush_work(&vi->config_work);
> -
> -	return 0;
>  }
>  
> -static int virtnet_restore(struct virtio_device *vdev)
> +static int virtnet_start(struct virtnet_info *vi)
>  {
> -	struct virtnet_info *vi = vdev->priv;
>  	int err, i;
>  
>  	err = virtnet_setup_vqs(vi);
> @@ -1527,6 +1542,29 @@ static int virtnet_restore(struct virtio_device *vdev)
>  
>  	return 0;
>  }
> +
> +#ifdef CONFIG_PM
> +static int virtnet_freeze(struct virtio_device *vdev)
> +{
> +	struct virtnet_info *vi = vdev->priv;
> +
> +	virtnet_stop(vi);
> +
> +	remove_vq_common(vi);
> +
> +	flush_work(&vi->config_work);
> +
> +	return 0;
> +}
> +
> +static int virtnet_restore(struct virtio_device *vdev)
> +{
> +	struct virtnet_info *vi = vdev->priv;
> +
> +	virtnet_start(vi);
> +
> +	return 0;
> +}
>  #endif
>  
>  static struct virtio_device_id id_table[] = {
> @@ -1560,6 +1598,67 @@ static struct virtio_driver virtio_net_driver = {
>  #endif
>  };
>  
> +static int virtnet_set_channels(struct net_device *dev,
> +				struct ethtool_channels *channels)
> +{
> +	struct virtnet_info *vi = netdev_priv(dev);
> +	u16 queues = channels->rx_count;
> +	unsigned status = VIRTIO_CONFIG_S_ACKNOWLEDGE | VIRTIO_CONFIG_S_DRIVER;
> +
> +	if (channels->rx_count != channels->tx_count)
> +		return -EINVAL;
> +	/* Only two modes were support currently */

s/were/are/ ?

> +	if (queues != vi->total_queue_pairs && queues != 1)
> +		return -EINVAL;

So userspace has to get queue number right. How does it know
what the valid value is?

> +	if (!vi->has_cvq)
> +		return -EINVAL;
> +
> +	virtnet_stop(vi);
> +
> +	netif_set_real_num_tx_queues(dev, queues);
> +	netif_set_real_num_rx_queues(dev, queues);
> +
> +	remove_vq_common(vi);
> +	flush_work(&vi->config_work);
> +
> +	vi->num_queue_pairs = queues;
> +	virtnet_start(vi);
> +
> +	vi->vdev->config->finalize_features(vi->vdev);
> +
> +	if (virtnet_set_queues(vi))
> +		status |= VIRTIO_CONFIG_S_FAILED;
> +	else
> +		status |= VIRTIO_CONFIG_S_DRIVER_OK;
> +
> +	vi->vdev->config->set_status(vi->vdev, status);
> +

Why do we need to tweak status like that?
Can we maybe just roll changes back on error?

> +	return 0;
> +}
> +
> +static void virtnet_get_channels(struct net_device *dev,
> +				 struct ethtool_channels *channels)
> +{
> +	struct virtnet_info *vi = netdev_priv(dev);
> +
> +	channels->max_rx = vi->total_queue_pairs;
> +	channels->max_tx = vi->total_queue_pairs;
> +	channels->max_other = 0;
> +	channels->max_combined = 0;
> +	channels->rx_count = vi->num_queue_pairs;
> +	channels->tx_count = vi->num_queue_pairs;
> +	channels->other_count = 0;
> +	channels->combined_count = 0;
> +}
> +
> +static const struct ethtool_ops virtnet_ethtool_ops = {
> +	.get_drvinfo = virtnet_get_drvinfo,
> +	.get_link = ethtool_op_get_link,
> +	.get_ringparam = virtnet_get_ringparam,
> +	.set_channels = virtnet_set_channels,
> +	.get_channels = virtnet_get_channels,
> +};
> +
>  static int __init init(void)
>  {
>  	return register_virtio_driver(&virtio_net_driver);
> diff --git a/include/linux/virtio_net.h b/include/linux/virtio_net.h
> index 60f09ff..0d21e08 100644
> --- a/include/linux/virtio_net.h
> +++ b/include/linux/virtio_net.h
> @@ -169,4 +169,11 @@ struct virtio_net_ctrl_mac {
>  #define VIRTIO_NET_CTRL_ANNOUNCE       3
>   #define VIRTIO_NET_CTRL_ANNOUNCE_ACK         0
>  
> +/*
> + * Control multiqueue
> + *
> + */
> +#define VIRTIO_NET_CTRL_MULTIQUEUE       4
> + #define VIRTIO_NET_CTRL_MULTIQUEUE_QNUM         0
> +
>  #endif /* _LINUX_VIRTIO_NET_H */
> -- 
> 1.7.1

^ permalink raw reply

* [patch iproute2] iplink: add support for num[tr]xqueues
From: Jiri Pirko @ 2012-07-20 12:29 UTC (permalink / raw)
  To: netdev; +Cc: davem, edumazet, shemminger

Signed-off-by: Jiri Pirko <jiri@resnulli.us>
---
 include/linux/if_link.h |    2 ++
 ip/iplink.c             |   20 ++++++++++++++++++++
 man/man8/ip-link.8.in   |   13 +++++++++++++
 3 files changed, 35 insertions(+)

diff --git a/include/linux/if_link.h b/include/linux/if_link.h
index 00e5868..46f03db 100644
--- a/include/linux/if_link.h
+++ b/include/linux/if_link.h
@@ -140,6 +140,8 @@ enum {
 	IFLA_EXT_MASK,		/* Extended info mask, VFs, etc */
 	IFLA_PROMISCUITY,	/* Promiscuity count: > 0 means acts PROMISC */
 #define IFLA_PROMISCUITY IFLA_PROMISCUITY
+	IFLA_NUM_TX_QUEUES,
+	IFLA_NUM_RX_QUEUES,
 	__IFLA_MAX
 };
 
diff --git a/ip/iplink.c b/ip/iplink.c
index 679091e..0baa128 100644
--- a/ip/iplink.c
+++ b/ip/iplink.c
@@ -48,6 +48,8 @@ void iplink_usage(void)
 		fprintf(stderr, "                   [ address LLADDR ]\n");
 		fprintf(stderr, "                   [ broadcast LLADDR ]\n");
 		fprintf(stderr, "                   [ mtu MTU ]\n");
+		fprintf(stderr, "                   [ numtxqueues QUEUE_COUNT ]\n");
+		fprintf(stderr, "                   [ numrxqueues QUEUE_COUNT ]\n");
 		fprintf(stderr, "                   type TYPE [ ARGS ]\n");
 		fprintf(stderr, "       ip link delete DEV type TYPE [ ARGS ]\n");
 		fprintf(stderr, "\n");
@@ -279,6 +281,8 @@ int iplink_parse(int argc, char **argv, struct iplink_req *req,
 	int mtu = -1;
 	int netns = -1;
 	int vf = -1;
+	int numtxqueues = -1;
+	int numrxqueues = -1;
 
 	*group = -1;
 	ret = argc;
@@ -445,6 +449,22 @@ int iplink_parse(int argc, char **argv, struct iplink_req *req,
 				invarg("Invalid operstate\n", *argv);
 
 			addattr8(&req->n, sizeof(*req), IFLA_OPERSTATE, state);
+		} else if (strcmp(*argv, "numtxqueues") == 0) {
+			NEXT_ARG();
+			if (numtxqueues != -1)
+				duparg("numtxqueues", *argv);
+			if (get_integer(&numtxqueues, *argv, 0))
+				invarg("Invalid \"numtxqueues\" value\n", *argv);
+			addattr_l(&req->n, sizeof(*req), IFLA_NUM_TX_QUEUES,
+				  &numtxqueues, 4);
+		} else if (strcmp(*argv, "numrxqueues") == 0) {
+			NEXT_ARG();
+			if (numrxqueues != -1)
+				duparg("numrxqueues", *argv);
+			if (get_integer(&numrxqueues, *argv, 0))
+				invarg("Invalid \"numrxqueues\" value\n", *argv);
+			addattr_l(&req->n, sizeof(*req), IFLA_NUM_RX_QUEUES,
+				  &numrxqueues, 4);
 		} else {
 			if (strcmp(*argv, "dev") == 0) {
 				NEXT_ARG();
diff --git a/man/man8/ip-link.8.in b/man/man8/ip-link.8.in
index 9386cc6..8a24e51 100644
--- a/man/man8/ip-link.8.in
+++ b/man/man8/ip-link.8.in
@@ -40,6 +40,11 @@ ip-link \- network device configuration
 .RB "[ " mtu
 .IR MTU " ]"
 .br
+.RB "[ " numtxqueues
+.IR QUEUE_COUNT " ]"
+.RB "[ " numrxqueues
+.IR QUEUE_COUNT " ]"
+.br
 .BR type " TYPE"
 .RI "[ " ARGS " ]"
 
@@ -156,6 +161,14 @@ Link types:
 - Ethernet Bridge device
 .in -8
 
+.TP
+.BI numtxqueues " QUEUE_COUNT "
+specifies the number of transmit queues for new device.
+
+.TP
+.BI numrxqueues " QUEUE_COUNT "
+specifies the number of receive queues for new device.
+
 .SS ip link delete - delete virtual link
 .I DEVICE
 specifies the virtual  device to act operate on.
-- 
1.7.10.4

^ permalink raw reply related

* [patch net-next 6/6] team: add multiqueue support
From: Jiri Pirko @ 2012-07-20 12:28 UTC (permalink / raw)
  To: netdev; +Cc: davem, edumazet, shemminger, fubar, andy
In-Reply-To: <1342787331-1866-1-git-send-email-jiri@resnulli.us>

Largely copied from bonding code.

Signed-off-by: Jiri Pirko <jiri@resnulli.us>
---
 drivers/net/team/team.c |   65 +++++++++++++++++++++++++++++++++++++++++++----
 include/linux/if_team.h |    8 ++++++
 2 files changed, 68 insertions(+), 5 deletions(-)

diff --git a/drivers/net/team/team.c b/drivers/net/team/team.c
index 813e131..b104c05 100644
--- a/drivers/net/team/team.c
+++ b/drivers/net/team/team.c
@@ -27,6 +27,7 @@
 #include <net/rtnetlink.h>
 #include <net/genetlink.h>
 #include <net/netlink.h>
+#include <net/sch_generic.h>
 #include <linux/if_team.h>
 
 #define DRV_NAME "team"
@@ -1121,6 +1122,22 @@ static const struct team_option team_options[] = {
 	},
 };
 
+static struct lock_class_key team_netdev_xmit_lock_key;
+static struct lock_class_key team_netdev_addr_lock_key;
+
+static void team_set_lockdep_class_one(struct net_device *dev,
+				       struct netdev_queue *txq,
+				       void *unused)
+{
+	lockdep_set_class(&txq->_xmit_lock, &team_netdev_xmit_lock_key);
+}
+
+static void team_set_lockdep_class(struct net_device *dev)
+{
+	lockdep_set_class(&dev->addr_list_lock, &team_netdev_addr_lock_key);
+	netdev_for_each_tx_queue(dev, team_set_lockdep_class_one, NULL);
+}
+
 static int team_init(struct net_device *dev)
 {
 	struct team *team = netdev_priv(dev);
@@ -1148,6 +1165,8 @@ static int team_init(struct net_device *dev)
 		goto err_options_register;
 	netif_carrier_off(dev);
 
+	team_set_lockdep_class(dev);
+
 	return 0;
 
 err_options_register:
@@ -1216,6 +1235,29 @@ static netdev_tx_t team_xmit(struct sk_buff *skb, struct net_device *dev)
 	return NETDEV_TX_OK;
 }
 
+static u16 team_select_queue(struct net_device *dev, struct sk_buff *skb)
+{
+	/*
+	 * This helper function exists to help dev_pick_tx get the correct
+	 * destination queue.  Using a helper function skips a call to
+	 * skb_tx_hash and will put the skbs in the queue we expect on their
+	 * way down to the team driver.
+	 */
+	u16 txq = skb_rx_queue_recorded(skb) ? skb_get_rx_queue(skb) : 0;
+
+	/*
+	 * Save the original txq to restore before passing to the driver
+	 */
+	qdisc_skb_cb(skb)->slave_dev_queue_mapping = skb->queue_mapping;
+
+	if (unlikely(txq >= dev->real_num_tx_queues)) {
+		do {
+			txq -= dev->real_num_tx_queues;
+		} while (txq >= dev->real_num_tx_queues);
+	}
+	return txq;
+}
+
 static void team_change_rx_flags(struct net_device *dev, int change)
 {
 	struct team *team = netdev_priv(dev);
@@ -1469,6 +1511,7 @@ static const struct net_device_ops team_netdev_ops = {
 	.ndo_open		= team_open,
 	.ndo_stop		= team_close,
 	.ndo_start_xmit		= team_xmit,
+	.ndo_select_queue	= team_select_queue,
 	.ndo_change_rx_flags	= team_change_rx_flags,
 	.ndo_set_rx_mode	= team_set_rx_mode,
 	.ndo_set_mac_address	= team_set_mac_address,
@@ -1543,12 +1586,24 @@ static int team_validate(struct nlattr *tb[], struct nlattr *data[])
 	return 0;
 }
 
+static unsigned int team_get_num_tx_queues(void)
+{
+	return TEAM_DEFAULT_NUM_TX_QUEUES;
+}
+
+static unsigned int team_get_num_rx_queues(void)
+{
+	return TEAM_DEFAULT_NUM_RX_QUEUES;
+}
+
 static struct rtnl_link_ops team_link_ops __read_mostly = {
-	.kind		= DRV_NAME,
-	.priv_size	= sizeof(struct team),
-	.setup		= team_setup,
-	.newlink	= team_newlink,
-	.validate	= team_validate,
+	.kind			= DRV_NAME,
+	.priv_size		= sizeof(struct team),
+	.setup			= team_setup,
+	.newlink		= team_newlink,
+	.validate		= team_validate,
+	.get_num_tx_queues	= team_get_num_tx_queues,
+	.get_num_rx_queues	= team_get_num_rx_queues,
 };
 
 
diff --git a/include/linux/if_team.h b/include/linux/if_team.h
index 7fd0cde..6960fc1 100644
--- a/include/linux/if_team.h
+++ b/include/linux/if_team.h
@@ -14,6 +14,7 @@
 #ifdef __KERNEL__
 
 #include <linux/netpoll.h>
+#include <net/sch_generic.h>
 
 struct team_pcpu_stats {
 	u64			rx_packets;
@@ -98,6 +99,10 @@ static inline void team_netpoll_send_skb(struct team_port *port,
 static inline int team_dev_queue_xmit(struct team *team, struct team_port *port,
 				      struct sk_buff *skb)
 {
+	BUILD_BUG_ON(sizeof(skb->queue_mapping) !=
+		     sizeof(qdisc_skb_cb(skb)->slave_dev_queue_mapping));
+	skb_set_queue_mapping(skb, qdisc_skb_cb(skb)->slave_dev_queue_mapping);
+
 	skb->dev = port->dev;
 	if (unlikely(netpoll_tx_running(port->dev))) {
 		team_netpoll_send_skb(port, skb);
@@ -236,6 +241,9 @@ extern void team_options_unregister(struct team *team,
 extern int team_mode_register(const struct team_mode *mode);
 extern void team_mode_unregister(const struct team_mode *mode);
 
+#define TEAM_DEFAULT_NUM_TX_QUEUES 16
+#define TEAM_DEFAULT_NUM_RX_QUEUES 16
+
 #endif /* __KERNEL__ */
 
 #define TEAM_STRING_MAX_LEN 32
-- 
1.7.10.4

^ permalink raw reply related

* [patch net-next 5/6] bond_sysfs: use ream_num_tx_queues rather than params.tx_queue
From: Jiri Pirko @ 2012-07-20 12:28 UTC (permalink / raw)
  To: netdev; +Cc: davem, edumazet, shemminger, fubar, andy
In-Reply-To: <1342787331-1866-1-git-send-email-jiri@resnulli.us>

Since now number of tx queues can be specified during bond instance
creation and therefore it may differ from params.tx_queues, use rather
real_num_tx_queues for boundary check.

Signed-off-by: Jiri Pirko <jiri@resnulli.us>
---
 drivers/net/bonding/bond_sysfs.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/bonding/bond_sysfs.c b/drivers/net/bonding/bond_sysfs.c
index 485bedb..dc15d24 100644
--- a/drivers/net/bonding/bond_sysfs.c
+++ b/drivers/net/bonding/bond_sysfs.c
@@ -1495,7 +1495,7 @@ static ssize_t bonding_store_queue_id(struct device *d,
 	/* Check buffer length, valid ifname and queue id */
 	if (strlen(buffer) > IFNAMSIZ ||
 	    !dev_valid_name(buffer) ||
-	    qid > bond->params.tx_queues)
+	    qid > bond->dev->real_num_tx_queues)
 		goto err_no_cmd;
 
 	/* Get the pointer to that interface if it exists */
-- 
1.7.10.4

^ permalink raw reply related

* [patch net-next 4/6] net: rename bond_queue_mapping to slave_dev_queue_mapping
From: Jiri Pirko @ 2012-07-20 12:28 UTC (permalink / raw)
  To: netdev; +Cc: davem, edumazet, shemminger, fubar, andy
In-Reply-To: <1342787331-1866-1-git-send-email-jiri@resnulli.us>

As this is going to be used not only by bonding.

Signed-off-by: Jiri Pirko <jiri@resnulli.us>
---
 drivers/net/bonding/bond_main.c |    6 +++---
 include/net/sch_generic.h       |    2 +-
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index f41ddc2..6fae5f3 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -395,8 +395,8 @@ int bond_dev_queue_xmit(struct bonding *bond, struct sk_buff *skb,
 	skb->dev = slave_dev;
 
 	BUILD_BUG_ON(sizeof(skb->queue_mapping) !=
-		     sizeof(qdisc_skb_cb(skb)->bond_queue_mapping));
-	skb->queue_mapping = qdisc_skb_cb(skb)->bond_queue_mapping;
+		     sizeof(qdisc_skb_cb(skb)->slave_dev_queue_mapping));
+	skb->queue_mapping = qdisc_skb_cb(skb)->slave_dev_queue_mapping;
 
 	if (unlikely(netpoll_tx_running(slave_dev)))
 		bond_netpoll_send_skb(bond_get_slave_by_dev(bond, slave_dev), skb);
@@ -4184,7 +4184,7 @@ static u16 bond_select_queue(struct net_device *dev, struct sk_buff *skb)
 	/*
 	 * Save the original txq to restore before passing to the driver
 	 */
-	qdisc_skb_cb(skb)->bond_queue_mapping = skb->queue_mapping;
+	qdisc_skb_cb(skb)->slave_dev_queue_mapping = skb->queue_mapping;
 
 	if (unlikely(txq >= dev->real_num_tx_queues)) {
 		do {
diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index 9d7d54a..d9611e0 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -220,7 +220,7 @@ struct tcf_proto {
 
 struct qdisc_skb_cb {
 	unsigned int		pkt_len;
-	u16			bond_queue_mapping;
+	u16			slave_dev_queue_mapping;
 	u16			_pad;
 	unsigned char		data[20];
 };
-- 
1.7.10.4

^ permalink raw reply related

* [patch net-next 3/6] rtnl: allow to specify number of rx and tx queues on device creation
From: Jiri Pirko @ 2012-07-20 12:28 UTC (permalink / raw)
  To: netdev; +Cc: davem, edumazet, shemminger, fubar, andy
In-Reply-To: <1342787331-1866-1-git-send-email-jiri@resnulli.us>

This patch introduces IFLA_NUM_TX_QUEUES and IFLA_NUM_RX_QUEUES by
which userspace can set number of rx and/or tx queues to be allocated
for newly created netdevice.
This overrides ops->get_num_[tr]x_queues()

Signed-off-by: Jiri Pirko <jiri@resnulli.us>
---
 include/linux/if_link.h |    2 ++
 net/core/rtnetlink.c    |   15 +++++++++++++--
 2 files changed, 15 insertions(+), 2 deletions(-)

diff --git a/include/linux/if_link.h b/include/linux/if_link.h
index f715750..ac173bd 100644
--- a/include/linux/if_link.h
+++ b/include/linux/if_link.h
@@ -140,6 +140,8 @@ enum {
 	IFLA_EXT_MASK,		/* Extended info mask, VFs, etc */
 	IFLA_PROMISCUITY,	/* Promiscuity count: > 0 means acts PROMISC */
 #define IFLA_PROMISCUITY IFLA_PROMISCUITY
+	IFLA_NUM_TX_QUEUES,
+	IFLA_NUM_RX_QUEUES,
 	__IFLA_MAX
 };
 
diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index db5a8ad..5bb1ebc 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -771,6 +771,8 @@ static noinline size_t if_nlmsg_size(const struct net_device *dev,
 	       + nla_total_size(4) /* IFLA_LINK */
 	       + nla_total_size(4) /* IFLA_MASTER */
 	       + nla_total_size(4) /* IFLA_PROMISCUITY */
+	       + nla_total_size(4) /* IFLA_NUM_TX_QUEUES */
+	       + nla_total_size(4) /* IFLA_NUM_RX_QUEUES */
 	       + nla_total_size(1) /* IFLA_OPERSTATE */
 	       + nla_total_size(1) /* IFLA_LINKMODE */
 	       + nla_total_size(ext_filter_mask
@@ -889,6 +891,8 @@ static int rtnl_fill_ifinfo(struct sk_buff *skb, struct net_device *dev,
 	    nla_put_u32(skb, IFLA_MTU, dev->mtu) ||
 	    nla_put_u32(skb, IFLA_GROUP, dev->group) ||
 	    nla_put_u32(skb, IFLA_PROMISCUITY, dev->promiscuity) ||
+	    nla_put_u32(skb, IFLA_NUM_TX_QUEUES, dev->num_tx_queues) ||
+	    nla_put_u32(skb, IFLA_NUM_RX_QUEUES, dev->num_rx_queues) ||
 	    (dev->ifindex != dev->iflink &&
 	     nla_put_u32(skb, IFLA_LINK, dev->iflink)) ||
 	    (dev->master &&
@@ -1106,6 +1110,8 @@ const struct nla_policy ifla_policy[IFLA_MAX+1] = {
 	[IFLA_AF_SPEC]		= { .type = NLA_NESTED },
 	[IFLA_EXT_MASK]		= { .type = NLA_U32 },
 	[IFLA_PROMISCUITY]	= { .type = NLA_U32 },
+	[IFLA_NUM_TX_QUEUES]	= { .type = NLA_U32 },
+	[IFLA_NUM_RX_QUEUES]	= { .type = NLA_U32 },
 };
 EXPORT_SYMBOL(ifla_policy);
 
@@ -1627,9 +1633,14 @@ struct net_device *rtnl_create_link(struct net *src_net, struct net *net,
 	unsigned int num_tx_queues = 1;
 	unsigned int num_rx_queues = 1;
 
-	if (ops->get_num_tx_queues)
+	if (tb[IFLA_NUM_TX_QUEUES])
+		num_tx_queues = nla_get_u32(tb[IFLA_NUM_TX_QUEUES]);
+	else if (ops->get_num_tx_queues)
 		num_tx_queues = ops->get_num_tx_queues();
-	if (ops->get_num_rx_queues)
+
+	if (tb[IFLA_NUM_RX_QUEUES])
+		num_rx_queues = nla_get_u32(tb[IFLA_NUM_RX_QUEUES]);
+	else if (ops->get_num_rx_queues)
 		num_rx_queues = ops->get_num_rx_queues();
 
 	err = -ENOMEM;
-- 
1.7.10.4

^ permalink raw reply related

* [patch net-next 2/6] rtnl: allow to specify different num for rx and tx queue count
From: Jiri Pirko @ 2012-07-20 12:28 UTC (permalink / raw)
  To: netdev; +Cc: davem, edumazet, shemminger, fubar, andy
In-Reply-To: <1342787331-1866-1-git-send-email-jiri@resnulli.us>

Also cut out unused function parameters and possible err in return
value.

Signed-off-by: Jiri Pirko <jiri@resnulli.us>
---
 drivers/net/bonding/bond_main.c |   14 ++++++++------
 include/net/rtnetlink.h         |   10 ++++++----
 net/core/rtnetlink.c            |   16 ++++++++--------
 3 files changed, 22 insertions(+), 18 deletions(-)

diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index 3960b1b..f41ddc2 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -4845,17 +4845,19 @@ static int bond_validate(struct nlattr *tb[], struct nlattr *data[])
 	return 0;
 }
 
-static int bond_get_tx_queues(struct net *net, struct nlattr *tb[])
+static unsigned int bond_get_num_tx_queues(void)
 {
 	return tx_queues;
 }
 
 static struct rtnl_link_ops bond_link_ops __read_mostly = {
-	.kind		= "bond",
-	.priv_size	= sizeof(struct bonding),
-	.setup		= bond_setup,
-	.validate	= bond_validate,
-	.get_tx_queues	= bond_get_tx_queues,
+	.kind			= "bond",
+	.priv_size		= sizeof(struct bonding),
+	.setup			= bond_setup,
+	.validate		= bond_validate,
+	.get_num_tx_queues	= bond_get_num_tx_queues,
+	.get_num_rx_queues	= bond_get_num_tx_queues, /* Use the same number
+							     as for TX queues */
 };
 
 /* Create a new bond based on the specified name and bonding parameters.
diff --git a/include/net/rtnetlink.h b/include/net/rtnetlink.h
index bbcfd09..6b00c4f 100644
--- a/include/net/rtnetlink.h
+++ b/include/net/rtnetlink.h
@@ -44,8 +44,10 @@ static inline int rtnl_msg_family(const struct nlmsghdr *nlh)
  *	@get_xstats_size: Function to calculate required room for dumping device
  *			  specific statistics
  *	@fill_xstats: Function to dump device specific statistics
- *	@get_tx_queues: Function to determine number of transmit queues to create when
- *		        creating a new device.
+ *	@get_num_tx_queues: Function to determine number of transmit queues
+ *			    to create when creating a new device.
+ *	@get_num_rx_queues: Function to determine number of receive queues
+ *			    to create when creating a new device.
  */
 struct rtnl_link_ops {
 	struct list_head	list;
@@ -77,8 +79,8 @@ struct rtnl_link_ops {
 	size_t			(*get_xstats_size)(const struct net_device *dev);
 	int			(*fill_xstats)(struct sk_buff *skb,
 					       const struct net_device *dev);
-	int			(*get_tx_queues)(struct net *net,
-						 struct nlattr *tb[]);
+	unsigned int		(*get_num_tx_queues)(void);
+	unsigned int		(*get_num_rx_queues)(void);
 };
 
 extern int	__rtnl_link_register(struct rtnl_link_ops *ops);
diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index 045db8a..db5a8ad 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -1624,17 +1624,17 @@ struct net_device *rtnl_create_link(struct net *src_net, struct net *net,
 {
 	int err;
 	struct net_device *dev;
-	unsigned int num_queues = 1;
+	unsigned int num_tx_queues = 1;
+	unsigned int num_rx_queues = 1;
 
-	if (ops->get_tx_queues) {
-		err = ops->get_tx_queues(src_net, tb);
-		if (err < 0)
-			goto err;
-		num_queues = err;
-	}
+	if (ops->get_num_tx_queues)
+		num_tx_queues = ops->get_num_tx_queues();
+	if (ops->get_num_rx_queues)
+		num_rx_queues = ops->get_num_rx_queues();
 
 	err = -ENOMEM;
-	dev = alloc_netdev_mq(ops->priv_size, ifname, ops->setup, num_queues);
+	dev = alloc_netdev_mqs(ops->priv_size, ifname, ops->setup,
+			       num_tx_queues, num_rx_queues);
 	if (!dev)
 		goto err;
 
-- 
1.7.10.4

^ permalink raw reply related

* [patch net-next 1/6] net: honour netif_set_real_num_tx_queues() retval
From: Jiri Pirko @ 2012-07-20 12:28 UTC (permalink / raw)
  To: netdev; +Cc: davem, edumazet, shemminger, fubar, andy
In-Reply-To: <1342787331-1866-1-git-send-email-jiri@resnulli.us>

In netif_copy_real_num_queues() the return value of
netif_set_real_num_tx_queues() should be checked.

Signed-off-by: Jiri Pirko <jiri@resnulli.us>
---
 include/linux/netdevice.h |    7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index ab0251d..eb06e58 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -2110,7 +2110,12 @@ static inline int netif_set_real_num_rx_queues(struct net_device *dev,
 static inline int netif_copy_real_num_queues(struct net_device *to_dev,
 					     const struct net_device *from_dev)
 {
-	netif_set_real_num_tx_queues(to_dev, from_dev->real_num_tx_queues);
+	int err;
+
+	err = netif_set_real_num_tx_queues(to_dev,
+					   from_dev->real_num_tx_queues);
+	if (err)
+		return err;
 #ifdef CONFIG_RPS
 	return netif_set_real_num_rx_queues(to_dev,
 					    from_dev->real_num_rx_queues);
-- 
1.7.10.4

^ permalink raw reply related

* [patch net-next 0/6] net: add team multiqueue support and do comple of thing on the way
From: Jiri Pirko @ 2012-07-20 12:28 UTC (permalink / raw)
  To: netdev; +Cc: davem, edumazet, shemminger, fubar, andy

This patchset represents the way I walked when I was adding multiqueue support for
team driver.

Jiri Pirko (6):
  net: honour netif_set_real_num_tx_queues() retval
  rtnl: allow to specify different num for rx and tx queue count
  rtnl: allow to specify number of rx and tx queues on device creation
  net: rename bond_queue_mapping to slave_dev_queue_mapping
  bond_sysfs: use ream_num_tx_queues rather than params.tx_queue
  team: add multiqueue support

 drivers/net/bonding/bond_main.c  |   20 ++++++------
 drivers/net/bonding/bond_sysfs.c |    2 +-
 drivers/net/team/team.c          |   65 +++++++++++++++++++++++++++++++++++---
 include/linux/if_link.h          |    2 ++
 include/linux/if_team.h          |    8 +++++
 include/linux/netdevice.h        |    7 +++-
 include/net/rtnetlink.h          |   10 +++---
 include/net/sch_generic.h        |    2 +-
 net/core/rtnetlink.c             |   27 +++++++++++-----
 9 files changed, 114 insertions(+), 29 deletions(-)

-- 
1.7.10.4

^ permalink raw reply

* WESTERN UNION COMPENSATION PAYMENT
From: South Africa western union @ 2012-07-20 11:58 UTC (permalink / raw)


[-- Attachment #1: Type: text/plain, Size: 0 bytes --]



[-- Attachment #2: WUMT.docx --]
[-- Type: application/vnd.openxmlformats-officedocument.wordprocessingml.document, Size: 27936 bytes --]

^ permalink raw reply

* Re: [PATCH] net, cgroup: Fix boot failure due to iteration of uninitialized list
From: Srivatsa S. Bhat @ 2012-07-20 11:18 UTC (permalink / raw)
  To: Neil Horman
  Cc: gaofeng, eric.dumazet, davem, linux-kernel, netdev, mark.d.rustad,
	john.r.fastabend, lizefan
In-Reply-To: <20120720110016.GA22367@hmsreliant.think-freely.org>

On 07/20/2012 04:30 PM, Neil Horman wrote:
> On Fri, Jul 20, 2012 at 03:34:47PM +0530, Srivatsa S. Bhat wrote:
>> On 07/19/2012 10:14 PM, Neil Horman wrote:
>>> On Thu, Jul 19, 2012 at 09:57:37PM +0530, Srivatsa S. Bhat wrote:
>>>> After commit ef209f15 (net: cgroup: fix access the unallocated memory in
>>>> netprio cgroup), boot fails with the following NULL pointer dereference:
>>>>
>>>> Initializing cgroup subsys devices
>>>> Initializing cgroup subsys freezer
>>>> Initializing cgroup subsys net_cls
>>>> Initializing cgroup subsys blkio
>>>> Initializing cgroup subsys perf_event
>>>> Initializing cgroup subsys net_prio
>>>> BUG: unable to handle kernel NULL pointer dereference at 0000000000000698
>>>> IP: [<ffffffff8145e8d6>] cgrp_create+0xf6/0x190
>>>> PGD 0
>>>> Oops: 0000 [#1] SMP
>>>> CPU 0
>>>> Modules linked in:
>>>>
>>>> Pid: 0, comm: swapper/0 Not tainted 3.5.0-rc7-mandeep #1 IBM IBM System x -[7870C4Q]-/68Y8033
>>>> RIP: 0010:[<ffffffff8145e8d6>]  [<ffffffff8145e8d6>] cgrp_create+0xf6/0x190
>>>> RSP: 0000:ffffffff81a01ea8  EFLAGS: 00010213
>>>> RAX: 0000000000000000 RBX: ffffffffffffff10 RCX: 0000000000000000
>>>> RDX: 0000000000000000 RSI: 0000000000000246 RDI: ffffffff81aa70a0
>>>> RBP: ffffffff81a01ed8 R08: 0000000000000000 R09: 0000000000000000
>>>> R10: ffff8808ff8641c0 R11: 6e697a696c616974 R12: 0000000000000001
>>>> R13: ffff8808ff8641c0 R14: 0000000000000000 R15: 0000000000093970
>>>> FS:  0000000000000000(0000) GS:ffff8808ffc00000(0000) knlGS:0000000000000000
>>>> CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
>>>> CR2: 0000000000000698 CR3: 0000000001a0b000 CR4: 00000000000006b0
>>>> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>>>> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
>>>> Process swapper/0 (pid: 0, threadinfo ffffffff81a00000, task ffffffff81a13420)
>>>> Stack:
>>>>  ffffffff81a01eb8 ffffffff818060ff ffffffff81d75ec8 ffffffff81aa8960
>>>>  ffffffff81aa8960 ffffffff81b4c2c0 ffffffff81a01ef8 ffffffff81b1cb78
>>>>  0000000000000018 0000000000000048 ffffffff81a01f18 ffffffff81b1ce13
>>>> Call Trace:
>>>>  [<ffffffff81b1cb78>] cgroup_init_subsys+0x83/0x169
>>>>  [<ffffffff81b1ce13>] cgroup_init+0x36/0x119
>>>>  [<ffffffff81affef7>] start_kernel+0x3ba/0x3ef
>>>>  [<ffffffff81aff95b>] ? kernel_init+0x27b/0x27b
>>>>  [<ffffffff81aff356>] x86_64_start_reservations+0x131/0x136
>>>>  [<ffffffff81aff45e>] x86_64_start_kernel+0x103/0x112
>>>> Code: 01 48 3d f8 e1 ec 81 48 8d 98 10 ff ff ff 75 1b eb 73 0f 1f 00 48 8b 83 f0 00 00 00 48 3d f8 e1 ec 81 48 8d 98 10 ff ff ff 74 5a <48> 8b 83 88 07 00 00 48 85 c0 74 de 44 3b 60 10 76 d8 44 89 e6
>>>> RIP  [<ffffffff8145e8d6>] cgrp_create+0xf6/0x190
>>>>  RSP <ffffffff81a01ea8>
>>>> CR2: 0000000000000698
>>>> ---[ end trace a7919e7f17c0a725 ]---
>>>> Kernel panic - not syncing: Attempted to kill the idle task!
>>>>
>>>> The code corresponds to:
>>>>
>>>> update_netdev_tables():
>>>>         for_each_netdev(&init_net, dev) {
>>>>                 map = rtnl_dereference(dev->priomap);  <---- HERE
>>>>
>>>>
>>>> The list head is initialized in netdev_init(), which is called much
>>>> later than cgrp_create(). So the problem is that we are calling
>>>> update_netdev_tables() way too early (in cgrp_create()), which will
>>>> end up traversing the not-yet-circular linked list. So at some point,
>>>> the dev pointer will become NULL and hence dev->priomap becomes an
>>>> invalid access.
>>>>
>>>> To fix this, just remove the update_netdev_tables() function entirely,
>>>> since it appears that write_update_netdev_table() will handle things
>>>> just fine.
>>>>
>>>> Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
>>>> ---
>>>>
>>>> Requesting a thorough review of this patch, since I am not sure whether
>>>> removing update_netdev_tables() is perfectly OK and whether that is the
>>>> right thing to do.
>>>>
>>> We could do this I suppose, but this has already been fixed by
>>> 734b65417b24d6eea3e3d7457e1f11493890ee1d
>>
>> Oh good! But don't you think that my patch looks cleaner than that fix?
>> (Of course, provided that my patch is correct!)
>>
>> Anyway, I'm happy to see that the boot failure is fixed. But if anyone feels
>> that the approach of removing the update_netdev_tables() function is correct
>> and better, I'll be happy to provide a patch on top of the boot-fix that
>> went upstream.
>>
> We're almost at the end of a release.  The fix that went in has been tesetd and
> fixes the specific problem that was reported, with almost zero likelyhood of
> causing other regressions. While this fix looks like it might be preferable,
> this isn't a time to go doing something like this without alot more testing, as
> it may cause unforseen problems.
> 

Oh definitely! I didn't mean to suggest doing these changes right away.
It can surely wait.. :)

> Theres also a larger issue of initalization order that I'll be looking at in the
> next few weeks.  Based on the outcome of that I may roll this change in.
> 

Sure, thanks!

Regards,
Srivatsa S. Bhat

^ permalink raw reply

* Re: [PATCH] net, cgroup: Fix boot failure due to iteration of uninitialized list
From: Neil Horman @ 2012-07-20 11:00 UTC (permalink / raw)
  To: Srivatsa S. Bhat
  Cc: gaofeng, eric.dumazet, davem, linux-kernel, netdev, mark.d.rustad,
	john.r.fastabend, lizefan
In-Reply-To: <50092D3F.5020108@linux.vnet.ibm.com>

On Fri, Jul 20, 2012 at 03:34:47PM +0530, Srivatsa S. Bhat wrote:
> On 07/19/2012 10:14 PM, Neil Horman wrote:
> > On Thu, Jul 19, 2012 at 09:57:37PM +0530, Srivatsa S. Bhat wrote:
> >> After commit ef209f15 (net: cgroup: fix access the unallocated memory in
> >> netprio cgroup), boot fails with the following NULL pointer dereference:
> >>
> >> Initializing cgroup subsys devices
> >> Initializing cgroup subsys freezer
> >> Initializing cgroup subsys net_cls
> >> Initializing cgroup subsys blkio
> >> Initializing cgroup subsys perf_event
> >> Initializing cgroup subsys net_prio
> >> BUG: unable to handle kernel NULL pointer dereference at 0000000000000698
> >> IP: [<ffffffff8145e8d6>] cgrp_create+0xf6/0x190
> >> PGD 0
> >> Oops: 0000 [#1] SMP
> >> CPU 0
> >> Modules linked in:
> >>
> >> Pid: 0, comm: swapper/0 Not tainted 3.5.0-rc7-mandeep #1 IBM IBM System x -[7870C4Q]-/68Y8033
> >> RIP: 0010:[<ffffffff8145e8d6>]  [<ffffffff8145e8d6>] cgrp_create+0xf6/0x190
> >> RSP: 0000:ffffffff81a01ea8  EFLAGS: 00010213
> >> RAX: 0000000000000000 RBX: ffffffffffffff10 RCX: 0000000000000000
> >> RDX: 0000000000000000 RSI: 0000000000000246 RDI: ffffffff81aa70a0
> >> RBP: ffffffff81a01ed8 R08: 0000000000000000 R09: 0000000000000000
> >> R10: ffff8808ff8641c0 R11: 6e697a696c616974 R12: 0000000000000001
> >> R13: ffff8808ff8641c0 R14: 0000000000000000 R15: 0000000000093970
> >> FS:  0000000000000000(0000) GS:ffff8808ffc00000(0000) knlGS:0000000000000000
> >> CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> >> CR2: 0000000000000698 CR3: 0000000001a0b000 CR4: 00000000000006b0
> >> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> >> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> >> Process swapper/0 (pid: 0, threadinfo ffffffff81a00000, task ffffffff81a13420)
> >> Stack:
> >>  ffffffff81a01eb8 ffffffff818060ff ffffffff81d75ec8 ffffffff81aa8960
> >>  ffffffff81aa8960 ffffffff81b4c2c0 ffffffff81a01ef8 ffffffff81b1cb78
> >>  0000000000000018 0000000000000048 ffffffff81a01f18 ffffffff81b1ce13
> >> Call Trace:
> >>  [<ffffffff81b1cb78>] cgroup_init_subsys+0x83/0x169
> >>  [<ffffffff81b1ce13>] cgroup_init+0x36/0x119
> >>  [<ffffffff81affef7>] start_kernel+0x3ba/0x3ef
> >>  [<ffffffff81aff95b>] ? kernel_init+0x27b/0x27b
> >>  [<ffffffff81aff356>] x86_64_start_reservations+0x131/0x136
> >>  [<ffffffff81aff45e>] x86_64_start_kernel+0x103/0x112
> >> Code: 01 48 3d f8 e1 ec 81 48 8d 98 10 ff ff ff 75 1b eb 73 0f 1f 00 48 8b 83 f0 00 00 00 48 3d f8 e1 ec 81 48 8d 98 10 ff ff ff 74 5a <48> 8b 83 88 07 00 00 48 85 c0 74 de 44 3b 60 10 76 d8 44 89 e6
> >> RIP  [<ffffffff8145e8d6>] cgrp_create+0xf6/0x190
> >>  RSP <ffffffff81a01ea8>
> >> CR2: 0000000000000698
> >> ---[ end trace a7919e7f17c0a725 ]---
> >> Kernel panic - not syncing: Attempted to kill the idle task!
> >>
> >> The code corresponds to:
> >>
> >> update_netdev_tables():
> >>         for_each_netdev(&init_net, dev) {
> >>                 map = rtnl_dereference(dev->priomap);  <---- HERE
> >>
> >>
> >> The list head is initialized in netdev_init(), which is called much
> >> later than cgrp_create(). So the problem is that we are calling
> >> update_netdev_tables() way too early (in cgrp_create()), which will
> >> end up traversing the not-yet-circular linked list. So at some point,
> >> the dev pointer will become NULL and hence dev->priomap becomes an
> >> invalid access.
> >>
> >> To fix this, just remove the update_netdev_tables() function entirely,
> >> since it appears that write_update_netdev_table() will handle things
> >> just fine.
> >>
> >> Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
> >> ---
> >>
> >> Requesting a thorough review of this patch, since I am not sure whether
> >> removing update_netdev_tables() is perfectly OK and whether that is the
> >> right thing to do.
> >>
> > We could do this I suppose, but this has already been fixed by
> > 734b65417b24d6eea3e3d7457e1f11493890ee1d
> 
> Oh good! But don't you think that my patch looks cleaner than that fix?
> (Of course, provided that my patch is correct!)
> 
> Anyway, I'm happy to see that the boot failure is fixed. But if anyone feels
> that the approach of removing the update_netdev_tables() function is correct
> and better, I'll be happy to provide a patch on top of the boot-fix that
> went upstream.
> 
We're almost at the end of a release.  The fix that went in has been tesetd and
fixes the specific problem that was reported, with almost zero likelyhood of
causing other regressions. While this fix looks like it might be preferable,
this isn't a time to go doing something like this without alot more testing, as
it may cause unforseen problems.

Theres also a larger issue of initalization order that I'll be looking at in the
next few weeks.  Based on the outcome of that I may roll this change in.

Thanks!
Neil

> Regards,
> Srivatsa S. Bhat
> 
> 

^ permalink raw reply

* RE: [PATCH V2 resend] ipv6: fix incorrect route 'expires' value passed to userspace
From: David Laight @ 2012-07-20 10:32 UTC (permalink / raw)
  To: Li Wei, David Miller; +Cc: netdev, shemminger
In-Reply-To: <5008B794.7010904@cn.fujitsu.com>

> -	else if (rt->dst.expires - jiffies < INT_MAX)
> -		expires = rt->dst.expires - jiffies;
> +	else if ((long)rt->dst.expires - (long)jiffies > INT_MIN
> +			&& (long)rt->dst.expires - (long)jiffies <
INT_MAX)
> +		expires = (long)rt->dst.expires - (long)jiffies;
>  	else
> -		expires = INT_MAX;
> +		expires = time_is_after_jiffies(rt->dst.expires) ?
INT_MAX : INT_MIN;

I can't help feeling there is a better way to do this.
Maybe:
	long expires = rt->dst.expires - jiffies;
	if (expires != (int)expires)
		expires = expires > 0 ? INT_MAX : INT_MIN;
Although maybe -INT_MAX instead of INT_MIN.

	David

^ permalink raw reply

* Re: [RFC] r8169 : why SG / TX checksum are default disabled
From: Francois Romieu @ 2012-07-20 10:08 UTC (permalink / raw)
  To: hayeswang; +Cc: 'David Miller', eric.dumazet, netdev
In-Reply-To: <EF06A7B3C1E6432BB660CDEF8E1D2A1B@realtek.com.tw>

hayeswang <hayeswang@realtek.com> :
[...]
> I find that the total length field of IP header would be modified if the hw
> checksum is enabled. Therefore, skb_padto + hw checksum wouldn't work.

Ok, my patch completely ignored the fact that skb_padto does not change the
length.

However skb_padto + length adjustement + hw checksum should work (at least in
theory if not in the patch below) ?

diff --git a/drivers/net/ethernet/realtek/r8169.c b/drivers/net/ethernet/realtek/r8169.c
index be4e00f..8d0cc09 100644
--- a/drivers/net/ethernet/realtek/r8169.c
+++ b/drivers/net/ethernet/realtek/r8169.c
@@ -5740,7 +5740,7 @@ err_out:
 	return -EIO;
 }
 
-static inline void rtl8169_tso_csum(struct rtl8169_private *tp,
+static inline bool rtl8169_tso_csum(struct rtl8169_private *tp,
 				    struct sk_buff *skb, u32 *opts)
 {
 	const struct rtl_tx_desc_info *info = tx_desc_info + tp->txd_version;
@@ -5753,6 +5753,13 @@ static inline void rtl8169_tso_csum(struct rtl8169_private *tp,
 	} else if (skb->ip_summed == CHECKSUM_PARTIAL) {
 		const struct iphdr *ip = ip_hdr(skb);
 
+		if (unlikely(skb->len < ETH_ZLEN &&
+		    (tp->mac_version == RTL_GIGA_MAC_VER_34))) {
+			if (skb_padto(skb, ETH_ZLEN))
+				return false;
+			skb_put(skb, ETH_ZLEN - skb->len);
+		}
+
 		if (ip->protocol == IPPROTO_TCP)
 			opts[offset] |= info->checksum.tcp;
 		else if (ip->protocol == IPPROTO_UDP)
@@ -5760,6 +5767,7 @@ static inline void rtl8169_tso_csum(struct rtl8169_private *tp,
 		else
 			WARN_ON_ONCE(1);
 	}
+	return true;
 }
 
 static netdev_tx_t rtl8169_start_xmit(struct sk_buff *skb,
@@ -5783,25 +5791,26 @@ static netdev_tx_t rtl8169_start_xmit(struct sk_buff *skb,
 	if (unlikely(le32_to_cpu(txd->opts1) & DescOwn))
 		goto err_stop_0;
 
+	opts[1] = cpu_to_le32(rtl8169_tx_vlan_tag(tp, skb));
+	opts[0] = DescOwn;
+
+	if (!rtl8169_tso_csum(tp, skb, opts))
+		goto err_update_stats_0;
+
 	len = skb_headlen(skb);
 	mapping = dma_map_single(d, skb->data, len, DMA_TO_DEVICE);
 	if (unlikely(dma_mapping_error(d, mapping))) {
 		if (net_ratelimit())
 			netif_err(tp, drv, dev, "Failed to map TX DMA!\n");
-		goto err_dma_0;
+		goto err_free_skb_1;
 	}
 
 	tp->tx_skb[entry].len = len;
 	txd->addr = cpu_to_le64(mapping);
 
-	opts[1] = cpu_to_le32(rtl8169_tx_vlan_tag(tp, skb));
-	opts[0] = DescOwn;
-
-	rtl8169_tso_csum(tp, skb, opts);
-
 	frags = rtl8169_xmit_frags(tp, skb, opts);
 	if (frags < 0)
-		goto err_dma_1;
+		goto err_unmap_2;
 	else if (frags)
 		opts[0] |= FirstFrag;
 	else {
@@ -5849,10 +5858,11 @@ static netdev_tx_t rtl8169_start_xmit(struct sk_buff *skb,
 
 	return NETDEV_TX_OK;
 
-err_dma_1:
+err_unmap_2:
 	rtl8169_unmap_tx_skb(d, tp->tx_skb + entry, txd);
-err_dma_0:
+err_free_skb_1:
 	dev_kfree_skb(skb);
+err_update_stats_0:
 	dev->stats.tx_dropped++;
 	return NETDEV_TX_OK;
 

^ permalink raw reply related

* Re: [PATCH] net, cgroup: Fix boot failure due to iteration of uninitialized list
From: Srivatsa S. Bhat @ 2012-07-20 10:04 UTC (permalink / raw)
  To: Neil Horman
  Cc: gaofeng, eric.dumazet, davem, linux-kernel, netdev, mark.d.rustad,
	john.r.fastabend, lizefan
In-Reply-To: <20120719164407.GA2963@neilslaptop.think-freely.org>

On 07/19/2012 10:14 PM, Neil Horman wrote:
> On Thu, Jul 19, 2012 at 09:57:37PM +0530, Srivatsa S. Bhat wrote:
>> After commit ef209f15 (net: cgroup: fix access the unallocated memory in
>> netprio cgroup), boot fails with the following NULL pointer dereference:
>>
>> Initializing cgroup subsys devices
>> Initializing cgroup subsys freezer
>> Initializing cgroup subsys net_cls
>> Initializing cgroup subsys blkio
>> Initializing cgroup subsys perf_event
>> Initializing cgroup subsys net_prio
>> BUG: unable to handle kernel NULL pointer dereference at 0000000000000698
>> IP: [<ffffffff8145e8d6>] cgrp_create+0xf6/0x190
>> PGD 0
>> Oops: 0000 [#1] SMP
>> CPU 0
>> Modules linked in:
>>
>> Pid: 0, comm: swapper/0 Not tainted 3.5.0-rc7-mandeep #1 IBM IBM System x -[7870C4Q]-/68Y8033
>> RIP: 0010:[<ffffffff8145e8d6>]  [<ffffffff8145e8d6>] cgrp_create+0xf6/0x190
>> RSP: 0000:ffffffff81a01ea8  EFLAGS: 00010213
>> RAX: 0000000000000000 RBX: ffffffffffffff10 RCX: 0000000000000000
>> RDX: 0000000000000000 RSI: 0000000000000246 RDI: ffffffff81aa70a0
>> RBP: ffffffff81a01ed8 R08: 0000000000000000 R09: 0000000000000000
>> R10: ffff8808ff8641c0 R11: 6e697a696c616974 R12: 0000000000000001
>> R13: ffff8808ff8641c0 R14: 0000000000000000 R15: 0000000000093970
>> FS:  0000000000000000(0000) GS:ffff8808ffc00000(0000) knlGS:0000000000000000
>> CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
>> CR2: 0000000000000698 CR3: 0000000001a0b000 CR4: 00000000000006b0
>> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
>> Process swapper/0 (pid: 0, threadinfo ffffffff81a00000, task ffffffff81a13420)
>> Stack:
>>  ffffffff81a01eb8 ffffffff818060ff ffffffff81d75ec8 ffffffff81aa8960
>>  ffffffff81aa8960 ffffffff81b4c2c0 ffffffff81a01ef8 ffffffff81b1cb78
>>  0000000000000018 0000000000000048 ffffffff81a01f18 ffffffff81b1ce13
>> Call Trace:
>>  [<ffffffff81b1cb78>] cgroup_init_subsys+0x83/0x169
>>  [<ffffffff81b1ce13>] cgroup_init+0x36/0x119
>>  [<ffffffff81affef7>] start_kernel+0x3ba/0x3ef
>>  [<ffffffff81aff95b>] ? kernel_init+0x27b/0x27b
>>  [<ffffffff81aff356>] x86_64_start_reservations+0x131/0x136
>>  [<ffffffff81aff45e>] x86_64_start_kernel+0x103/0x112
>> Code: 01 48 3d f8 e1 ec 81 48 8d 98 10 ff ff ff 75 1b eb 73 0f 1f 00 48 8b 83 f0 00 00 00 48 3d f8 e1 ec 81 48 8d 98 10 ff ff ff 74 5a <48> 8b 83 88 07 00 00 48 85 c0 74 de 44 3b 60 10 76 d8 44 89 e6
>> RIP  [<ffffffff8145e8d6>] cgrp_create+0xf6/0x190
>>  RSP <ffffffff81a01ea8>
>> CR2: 0000000000000698
>> ---[ end trace a7919e7f17c0a725 ]---
>> Kernel panic - not syncing: Attempted to kill the idle task!
>>
>> The code corresponds to:
>>
>> update_netdev_tables():
>>         for_each_netdev(&init_net, dev) {
>>                 map = rtnl_dereference(dev->priomap);  <---- HERE
>>
>>
>> The list head is initialized in netdev_init(), which is called much
>> later than cgrp_create(). So the problem is that we are calling
>> update_netdev_tables() way too early (in cgrp_create()), which will
>> end up traversing the not-yet-circular linked list. So at some point,
>> the dev pointer will become NULL and hence dev->priomap becomes an
>> invalid access.
>>
>> To fix this, just remove the update_netdev_tables() function entirely,
>> since it appears that write_update_netdev_table() will handle things
>> just fine.
>>
>> Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
>> ---
>>
>> Requesting a thorough review of this patch, since I am not sure whether
>> removing update_netdev_tables() is perfectly OK and whether that is the
>> right thing to do.
>>
> We could do this I suppose, but this has already been fixed by
> 734b65417b24d6eea3e3d7457e1f11493890ee1d

Oh good! But don't you think that my patch looks cleaner than that fix?
(Of course, provided that my patch is correct!)

Anyway, I'm happy to see that the boot failure is fixed. But if anyone feels
that the approach of removing the update_netdev_tables() function is correct
and better, I'll be happy to provide a patch on top of the boot-fix that
went upstream.

Regards,
Srivatsa S. Bhat

^ permalink raw reply

* Re: [PATCH net-next 4/7] sfc: Add support for IEEE-1588 PTP
From: Stuart Hodgson @ 2012-07-20  9:15 UTC (permalink / raw)
  To: Richard Cochran
  Cc: Ben Hutchings, David Miller, netdev, linux-net-drivers,
	Andrew Jackson
In-Reply-To: <20120720063102.GB2330@netboy.at.omicron.at>

>>> This code looks like it is trying to find the offset between two
>>> clocks. Is there some reason why you cannot use <linux/timecompare.h>
>>> to accomplish this?
>>
>> This is what the code is doing. <linux/timecompare.h> states
>>
>> "the assumption is that reading the source
>> time is slow and involves equal time for sending the request and
>> receiving the reply"
>>
>> While in our case event though it is slow we cannot guarantee the second
>> assumption. The code above takes into account some of the particulars of the sfc
>> hardware and gives us good results. 
> 
> Fair enough, but then maybe a comment mentioning how timecompare is
> unsuitable would be nice to have.
> 

This can be added

>>> I am trying to purge the whole SYS thing (only blackfin is left)
>>> because there is a much better way to go about this, namely
>>> synchronizing the system time to the PHC time via an internal PPS
>>> signal.
>>

Do you mean using the PPS kernel consumer to govern the system time?

> I don't understand what the issue is here. Can't you just call
> ptp_clock_event, like you already have...
> 
>>>> +static void efx_ptp_pps_worker(struct work_struct *work)
>>>> +{
>>>> +	struct efx_ptp_data *ptp =
>>>> +		container_of(work, struct efx_ptp_data, pps_work);
>>>> +	struct efx_nic *efx = ptp->channel->efx;
>>>> +	struct timespec event_gen_time;
>>>> +	struct ptp_clock_event ptp_pps_evt;
>>>> +	ktime_t gen_time_host;
>>>> +
>>>> +	if (efx_ptp_synchronize(efx, PTP_SYNC_ATTEMPTS))
>>>> +		return;
>>>> +
>>>> +	gen_time_host = ktime_sub(ptp->mc_base_time,
>>>> +				  ptp->host_base_time);
>>>> +	event_gen_time = ktime_to_timespec(gen_time_host);
>>>> +
>>>> +	ptp_pps_evt.type = PTP_CLOCK_EXTTS;
>>>> +	ptp_pps_evt.timestamp = ktime_to_ns(gen_time_host);
>>>> +	ptp_clock_event(ptp->phc_clock, &ptp_pps_evt);
>>>> +}
> 
> ... here?
> 

In order for a PPS to arrive at the kernel consumer ptp_clock_event
needs to be called with PTP_CLOCK_PPS. This then calls pps_get_ts
and stamps the event with the current system time, not the time
that was put into the event.

Using PTP_CLOCK_EXTTS the PPS is visible to userspace via a read
on the phc device and can then be used in our modified ptpd2.

>>>> +static int efx_phc_adjtime(struct ptp_clock_info *ptp, s64 delta)
>>>> +{
> 
> You can set the time here somehow by doing, T' = T + offset, and so...
> 
>>>> +}
> 
>>>> +static int efx_phc_settime(struct ptp_clock_info *ptp,
>>>> +			   const struct timespec *e_ts)
>>>> +{
>>>> +	/* We must provide this function, but we cannot actually set the time */
>>>
>>> Huh? You can adjtime, so must be able to settime, too, right?
>>>
>>> If you have enough range in the RAW timestamp in the MC firmware (like
>>> 64 bits of nanoseconds), and you allow settime, then you can spare the
>>> system time synchronization code altogether.
>>>
>>
>> You will have to elaborate further on this point.
> 
> ... why can't you also just set the time?

Our hardware can only have an offset applied to the clock. In order to set time
we need to know the time now, then work out and offset to get to the target time.
At the point that we apply this offset the clock will have moved on and not be
set to the target time. We can apply some measured average times to the offset
to get closer but with this hardware settime will not leave the NIC clock at the
desired time. 

> 
> Thanks,
> Richard

^ permalink raw reply

* [PATCH net-next] tcp: use hash_32() in tcp_metrics
From: Eric Dumazet @ 2012-07-20  9:02 UTC (permalink / raw)
  To: David Miller; +Cc: netdev

From: Eric Dumazet <edumazet@google.com>

Fix a missing roundup_pow_of_two(), since tcpmhash_entries is not
guaranteed to be a power of two.

Uses hash_32() instead of custom hash.

tcpmhash_entries should be an unsigned int.

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 include/net/netns/ipv4.h |    2 +-
 net/ipv4/tcp_metrics.c   |   25 ++++++++++---------------
 2 files changed, 11 insertions(+), 16 deletions(-)

diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
index d909c7f..0ffb8e3 100644
--- a/include/net/netns/ipv4.h
+++ b/include/net/netns/ipv4.h
@@ -40,7 +40,7 @@ struct netns_ipv4 {
 	struct sock		**icmp_sk;
 	struct inet_peer_base	*peers;
 	struct tcpm_hash_bucket	*tcp_metrics_hash;
-	unsigned int		tcp_metrics_hash_mask;
+	unsigned int		tcp_metrics_hash_log;
 	struct netns_frags	frags;
 #ifdef CONFIG_NETFILTER
 	struct xt_table		*iptable_filter;
diff --git a/net/ipv4/tcp_metrics.c b/net/ipv4/tcp_metrics.c
index 99779ae..992f1bf 100644
--- a/net/ipv4/tcp_metrics.c
+++ b/net/ipv4/tcp_metrics.c
@@ -7,6 +7,7 @@
 #include <linux/slab.h>
 #include <linux/init.h>
 #include <linux/tcp.h>
+#include <linux/hash.h>
 
 #include <net/inet_connection_sock.h>
 #include <net/net_namespace.h>
@@ -228,10 +229,8 @@ static struct tcp_metrics_block *__tcp_get_metrics_req(struct request_sock *req,
 		return NULL;
 	}
 
-	hash ^= (hash >> 24) ^ (hash >> 16) ^ (hash >> 8);
-
 	net = dev_net(dst->dev);
-	hash &= net->ipv4.tcp_metrics_hash_mask;
+	hash = hash_32(hash, net->ipv4.tcp_metrics_hash_log);
 
 	for (tm = rcu_dereference(net->ipv4.tcp_metrics_hash[hash].chain); tm;
 	     tm = rcu_dereference(tm->tcpm_next)) {
@@ -265,10 +264,8 @@ static struct tcp_metrics_block *__tcp_get_metrics_tw(struct inet_timewait_sock
 		return NULL;
 	}
 
-	hash ^= (hash >> 24) ^ (hash >> 16) ^ (hash >> 8);
-
 	net = twsk_net(tw);
-	hash &= net->ipv4.tcp_metrics_hash_mask;
+	hash = hash_32(hash, net->ipv4.tcp_metrics_hash_log);
 
 	for (tm = rcu_dereference(net->ipv4.tcp_metrics_hash[hash].chain); tm;
 	     tm = rcu_dereference(tm->tcpm_next)) {
@@ -302,10 +299,8 @@ static struct tcp_metrics_block *tcp_get_metrics(struct sock *sk,
 		return NULL;
 	}
 
-	hash ^= (hash >> 24) ^ (hash >> 16) ^ (hash >> 8);
-
 	net = dev_net(dst->dev);
-	hash &= net->ipv4.tcp_metrics_hash_mask;
+	hash = hash_32(hash, net->ipv4.tcp_metrics_hash_log);
 
 	tm = __tcp_get_metrics(&addr, net, hash);
 	reclaim = false;
@@ -694,7 +689,7 @@ void tcp_fastopen_cache_set(struct sock *sk, u16 mss,
 	rcu_read_unlock();
 }
 
-static unsigned long tcpmhash_entries;
+static unsigned int tcpmhash_entries;
 static int __init set_tcpmhash_entries(char *str)
 {
 	ssize_t ret;
@@ -702,7 +697,7 @@ static int __init set_tcpmhash_entries(char *str)
 	if (!str)
 		return 0;
 
-	ret = kstrtoul(str, 0, &tcpmhash_entries);
+	ret = kstrtouint(str, 0, &tcpmhash_entries);
 	if (ret)
 		return 0;
 
@@ -712,7 +707,8 @@ __setup("tcpmhash_entries=", set_tcpmhash_entries);
 
 static int __net_init tcp_net_metrics_init(struct net *net)
 {
-	int slots, size;
+	size_t size;
+	unsigned int slots;
 
 	slots = tcpmhash_entries;
 	if (!slots) {
@@ -722,14 +718,13 @@ static int __net_init tcp_net_metrics_init(struct net *net)
 			slots = 8 * 1024;
 	}
 
-	size = slots * sizeof(struct tcpm_hash_bucket);
+	net->ipv4.tcp_metrics_hash_log = order_base_2(slots);
+	size = sizeof(struct tcpm_hash_bucket) << net->ipv4.tcp_metrics_hash_log;
 
 	net->ipv4.tcp_metrics_hash = kzalloc(size, GFP_KERNEL);
 	if (!net->ipv4.tcp_metrics_hash)
 		return -ENOMEM;
 
-	net->ipv4.tcp_metrics_hash_mask = (slots - 1);
-
 	return 0;
 }
 

^ permalink raw reply related

* [PATCH] ipv4: show pmtu in route list
From: Julian Anastasov @ 2012-07-20  9:02 UTC (permalink / raw)
  To: David Miller; +Cc: netdev

	Override the metrics with rt_pmtu

Signed-off-by: Julian Anastasov <ja@ssi.bg>
---

	Is this patch still useful if routing cache is removed?

 net/ipv4/route.c |    6 +++++-
 1 files changed, 5 insertions(+), 1 deletions(-)

diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 9f7ffbe..d547f6f 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -2909,6 +2909,7 @@ static int rt_fill_info(struct net *net,
 	struct nlmsghdr *nlh;
 	unsigned long expires = 0;
 	u32 error;
+	u32 metrics[RTAX_MAX];
 
 	nlh = nlmsg_put(skb, pid, seq, event, sizeof(*r), flags);
 	if (nlh == NULL)
@@ -2953,7 +2954,10 @@ static int rt_fill_info(struct net *net,
 	    nla_put_be32(skb, RTA_GATEWAY, rt->rt_gateway))
 		goto nla_put_failure;
 
-	if (rtnetlink_put_metrics(skb, dst_metrics_ptr(&rt->dst)) < 0)
+	memcpy(metrics, dst_metrics_ptr(&rt->dst), sizeof(metrics));
+	if (rt->rt_pmtu)
+		metrics[RTAX_MTU - 1] = rt->rt_pmtu;
+	if (rtnetlink_put_metrics(skb, metrics) < 0)
 		goto nla_put_failure;
 
 	if (rt->rt_mark &&
-- 
1.7.3.4

^ permalink raw reply related

* (unknown)
From: Standard Credit International Finance @ 2012-07-20  8:12 UTC (permalink / raw)



Do you need business loan or personal loan if yes contact us today?

^ permalink raw reply

* Re: [PATCH net-next] tcp: Return bool instead of int where appropriate
From: Eric Dumazet @ 2012-07-20  8:02 UTC (permalink / raw)
  To: Vijay Subramanian; +Cc: netdev, davem
In-Reply-To: <1342769538-4039-1-git-send-email-subramanian.vijay@gmail.com>

On Fri, 2012-07-20 at 00:32 -0700, Vijay Subramanian wrote:
> Applied to a set of static inline functions in tcp_input.c
> 
> Signed-off-by: Vijay Subramanian <subramanian.vijay@gmail.com>
> ---
>  net/ipv4/tcp_input.c |   16 ++++++++--------
>  1 files changed, 8 insertions(+), 8 deletions(-)

Acked-by: Eric Dumazet <edumazet@google.com>

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox