Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH] macvlan: lockless tx path
From: Eric Dumazet @ 2010-11-10 23:24 UTC (permalink / raw)
  To: Ben Greear; +Cc: netdev
In-Reply-To: <4CDB226A.8080903@candelatech.com>

Le mercredi 10 novembre 2010 à 14:53 -0800, Ben Greear a écrit :

> Then it's busted.  If it claims to return stats64, but instead is returning
> something different, it is wrong.  Netlink API still defines a stats32, so
> the kernel should use that if it can't reliably deal with 64-bit counters.
> 

It claims nothing like that. You obviously assumed wrong things.

It provides a framework, not a mandatory or exclusive one.

Really this endless discussion is going nowhere.

I suggest you send patches if you want, I am very pleased by current
stats handling.

^ permalink raw reply

* Re: [PATCH 0/9] treewide: convert vprintk uses to %pV
From: Joe Perches @ 2010-11-10 23:01 UTC (permalink / raw)
  To: Luis R. Rodriguez
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	dri-devel-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW,
	netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-wireless-u79uwXL29TY76Z2rM5mHXA,
	cluster-devel-H+wXaHxf7aLQT0dZR+AlfA,
	linux-nilfs-u79uwXL29TY76Z2rM5mHXA,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <AANLkTinhcbdm8YQOrFVdONODo6K6PcxHYtx5vqnap_3T-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>

On Wed, 2010-11-10 at 14:48 -0800, Luis R. Rodriguez wrote:
> When was this added upstream BTW? I ask for backport considerations.

commit 7db6f5fb65a82af03229eef104dc9899c5eecf33
Author: Joe Perches <joe-6d6DIl74uiNBDgjK7y7TUQ@public.gmane.org>
Date:   Sun Jun 27 01:02:33 2010 +0000

    vsprintf: Recursive vsnprintf: Add "%pV", struct va_format
    
    Add the ability to print a format and va_list from a structure pointer
    
    Allows __dev_printk to be implemented as a single printk while
    minimizing string space duplication.
    
    %pV should not be used without some mechanism to verify the
    format and argument use ala __attribute__(format (printf(...))).
    
    Signed-off-by: Joe Perches <joe-6d6DIl74uiNBDgjK7y7TUQ@public.gmane.org>
    Acked-by: Greg Kroah-Hartman <gregkh-l3A5Bk7waGM@public.gmane.org>
    Signed-off-by: David S. Miller <davem-fT/PcQaiUtIeIZ0/mPfg9Q@public.gmane.org>


--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH] macvlan: lockless tx path
From: Ben Greear @ 2010-11-10 22:53 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev
In-Reply-To: <1289427705.17691.52.camel@edumazet-laptop>

On 11/10/2010 02:21 PM, Eric Dumazet wrote:
> Le mercredi 10 novembre 2010 à 13:35 -0800, Ben Greear a écrit :

NOTE:  I trimmed the CC list..this has nothing in particular to
do with mac-vlans now.

>> So an application that must deal with wraps must poll at the minimal
>> time interval for wrapping 32-bit counters at whatever speed, or it
>> must pay attention to the driver to somehow know that this magic driver
>> can *really* do 64-bit stats properly?
>>
>
> Are you aware that you speak of something that is not specified at all
> in linux ?
>
> Frequency of polling is not part of any RFC. This usually is tunable in
> the _application_. Some people sample stats every 5 minutes, some sample
> every second, and hit the "xxx driver updates its stats every two
> seconds, this sucks"

These '2 sec' granularity bugs are a pain and should be (and have been)
fixed.

> I wrote SNMP apps based on /proc/net/dev and all just work, with any
> versions, any driver. Of course, some of them broke 6 years ago because
> they were 32bit legacy application, running on a 64bit kernel. I never
> asked David to change /proc/net/dev to cap counters to 32bit.

I did similar, and then wrote extra code to detect a 64-bit kernel and if
so assume that the counters wrap at 64 bits so I didn't have to poll so
often to make sure I didn't miss a wrap for a 10G NIC.  If instead one wraps at 33
bits and the other at 36, there is no way for me to deal with the wrap
properly w/out explicitly knowing about that 33 and 36.

If the old 32-bit counters in /proc/net/dev instead had a driver that
managed to wrap them at 28 bits, I can't see how your application could
have worked properly, so you must have been assuming that the kernel would
always return a full 32-bit counter.

Now, I'm trying to use netlink api since I'm hoping that is more efficient
and controllable than just reading /proc/net/dev over and over.  Netlink
explicitly can return a set of 32-bit counters or 64-bit.

All I want is to ensure than they are actually 32 or 64 bit, not 36 bit crammed in a
64-bit counter.  In other words, make the driver and/or kernel do it's
job and abstract access to hardware and return a consistent interface.

>> Please note that just because a counter is less than the previous read,
>> that doesn't by itself tell us if it wrapped once or twice.  And, if we
>> don't know at which number of bits it wraps, then we don't know how many
>> to add even if we are certain it wrapped only once.
>>
>
> I repeat : Nothing in /proc/net/dev can tell you when a counter will
> wrap (the counter width).

My primary concern is netlink API now, and even for proc/net/dev, there is no
good reason to show 32-bit counters mixed with 64-bit counters on 64-bit systems.
The kernel can deal with this easily enough, and it should.

>> If netlink reports stats64, then those should only wrap at 64 bits,
>> and if it reports stats32, then wrap at 32-bits.
>>
>
> I believe you are mistaken. We provide stats64 for all drivers, even
> 32bit legacy ones. rtnetlink has no way to report counter widths,
> because nobody cared.

Then it's busted.  If it claims to return stats64, but instead is returning
something different, it is wrong.  Netlink API still defines a stats32, so
the kernel should use that if it can't reliably deal with 64-bit counters.

Thanks,
Ben

-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com

^ permalink raw reply

* Re: [PATCH 0/9] treewide: convert vprintk uses to %pV
From: Luis R. Rodriguez @ 2010-11-10 22:48 UTC (permalink / raw)
  To: Joe Perches
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	dri-devel-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW,
	netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-wireless-u79uwXL29TY76Z2rM5mHXA,
	cluster-devel-H+wXaHxf7aLQT0dZR+AlfA,
	linux-nilfs-u79uwXL29TY76Z2rM5mHXA,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <cover.1289348757.git.joe-6d6DIl74uiNBDgjK7y7TUQ@public.gmane.org>

On Tue, Nov 9, 2010 at 4:35 PM, Joe Perches <joe-6d6DIl74uiNBDgjK7y7TUQ@public.gmane.org> wrote:
> Multiple secessive calls to printk can be interleaved.
> Avoid this possible interleaving by using %pV
>
> Joe Perches (9):
>  drivers/gpu/drm/drm_stub.c: Use printf extension %pV
>  drivers/isdn/mISDN: Use printf extension %pV
>  drivers/net/wireless/ath/debug.c: Use printf extension %pV
>  drivers/net/wireless/b43/main.c: Use printf extension %pV
>  drivers/net/wireless/b43legacy/main.c: Use printf extension %pV
>  fs/gfs2/glock.c: Use printf extension %pV
>  fs/nilfs2/super.c: Use printf extension %pV
>  fs/quota/dquot.c: Use printf extension %pV
>  net/sunrpc/svc.c: Use printf extension %pV
>
>  drivers/gpu/drm/drm_stub.c            |   14 +++++++--
>  drivers/isdn/mISDN/layer1.c           |   10 +++++--
>  drivers/isdn/mISDN/layer2.c           |   12 ++++++--
>  drivers/isdn/mISDN/tei.c              |   23 +++++++++++----
>  drivers/net/wireless/ath/debug.c      |    9 +++++-
>  drivers/net/wireless/b43/main.c       |   48 ++++++++++++++++++++++++--------
>  drivers/net/wireless/b43legacy/main.c |   47 ++++++++++++++++++++++++--------
>  fs/gfs2/glock.c                       |    9 +++++-
>  fs/nilfs2/super.c                     |   23 +++++++++++-----
>  fs/quota/dquot.c                      |   12 +++++---
>  net/sunrpc/svc.c                      |   12 +++++---
>  11 files changed, 161 insertions(+), 58 deletions(-)

When was this added upstream BTW? I ask for backport considerations.

  Luis
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH 2.6.36-rc8] net-next: Add multiqueue support to vmxnet3 v2driver
From: Shreyas Bhatewara @ 2010-11-10 22:37 UTC (permalink / raw)
  To: David Miller
  Cc: bhutchings@solarflare.com, shemminger@vyatta.com,
	netdev@vger.kernel.org, pv-drivers@vmware.com,
	linux-kernel@vger.kernel.org
In-Reply-To: <alpine.LRH.2.00.1011011522020.12306@sbhatewara-dev1.eng.vmware.com>



On Mon, 1 Nov 2010, Shreyas Bhatewara wrote:

David/Stephen,

Any word about this patch ? To list out the changes made to the patch 
since last time :

- Added ethtool handlers for configuring RSS table and getting # rx queues
- Removed module parameters which were not strictly required. Eg. Those required
 for the above configuration
- Introduced module parameter to enable/disable multiqueue capability of 
the driver

Thanks.
Shreyas

Reviewed-by: Bhavesh Davda <bhavesh@vmware.com>

> 
> 
> Add multiqueue support to vmxnet3 driver
> 
> This change adds Multiqueue and thus receive side scaling support  
> to vmxnet3 device driver. Number of rx queues is limited to 1 in cases 
> where
> - MSI is not configured or
> - One MSIx vector is not available per rx queue
> 
> By default multiqueue capability is turned off and hence only 1 tx and 1 rx
> queue will be initialized. enable_mq module param should be set to 
> configure number of tx and rx queues equal to number of online CPUs. A 
> maximum of 8 tx/rx queues are allowed for any adapter.
> 
> Signed-off-by: Shreyas Bhatewara <sbhatewara@vmware.com>
> 
> ---
> 
> 2nd revision of the patch.
> 
> In this revision, module params which are not strictly required have been
> removed and ethtool callback handlers have been implemented instead. 
> Handlers to provide # rx queues and to get/set RSS indirection table are added.
> Information like Number of queues and how they share irqs is required at 
> driver attach time. Adding ethtool interfaces cannot help in this regards.
> Hence two module params have been introduced : enable_mq (to configure if
> multiple queues should be used) and irq_share_mode to configure the way in
> which irqs will be shared among queues. 
> 
> 
> diff --git a/drivers/net/vmxnet3/vmxnet3_drv.c b/drivers/net/vmxnet3/vmxnet3_drv.c
> index 3f60e0e..3ed4be6 100644
> --- a/drivers/net/vmxnet3/vmxnet3_drv.c
> +++ b/drivers/net/vmxnet3/vmxnet3_drv.c
> @@ -44,6 +44,26 @@ MODULE_DEVICE_TABLE(pci, vmxnet3_pciid_table);
>  
>  static atomic_t devices_found;
>  
> +#define VMXNET3_MAX_DEVICES 10
> +static int enable_mq[VMXNET3_MAX_DEVICES + 1] = {
> +	[0 ... VMXNET3_MAX_DEVICES] = 0 };
> +static int irq_share_mode[VMXNET3_MAX_DEVICES + 1] = {
> +	[0 ... VMXNET3_MAX_DEVICES] = VMXNET3_INTR_BUDDYSHARE };
> +
> +static unsigned int num_adapters;
> +module_param_array(irq_share_mode, int, &num_adapters, 0400);
> +MODULE_PARM_DESC(irq_share_mode, "Comma separated list of ints, configuring "
> +		 "mode in which irqs should be shared by tx and rx queues. When"
> +		 " set to 0, no irqs are shared, each tx and rx queue allocate"
> +		 " and use a separate irq. Set to 1, all tx queues share an irq"
> +		 ". Set to 2, corresponding tx and rx queues share an irq."
> +		 " Default is 2.");
> +module_param_array(enable_mq, int, &num_adapters, 0400);
> +MODULE_PARM_DESC(enable_mq, "Comma separated list of integers, one for each "
> +		 "adapter. When set to a non-zero value, multiqueue will be "
> +		 "enabled and number of tx and rx queues will be same as number"
> +		 " of CPUs online. number of queues will be 1 otherwise. "
> +		 "Default is 0 - multiqueue disabled.");
>  
>  /*
>   *    Enable/Disable the given intr
> @@ -107,7 +127,7 @@ static void
>  vmxnet3_tq_start(struct vmxnet3_tx_queue *tq, struct vmxnet3_adapter *adapter)
>  {
>  	tq->stopped = false;
> -	netif_start_queue(adapter->netdev);
> +	netif_start_subqueue(adapter->netdev, tq - adapter->tx_queue);
>  }
>  
>  
> @@ -115,7 +135,7 @@ static void
>  vmxnet3_tq_wake(struct vmxnet3_tx_queue *tq, struct vmxnet3_adapter *adapter)
>  {
>  	tq->stopped = false;
> -	netif_wake_queue(adapter->netdev);
> +	netif_wake_subqueue(adapter->netdev, (tq - adapter->tx_queue));
>  }
>  
>  
> @@ -124,7 +144,7 @@ vmxnet3_tq_stop(struct vmxnet3_tx_queue *tq, struct vmxnet3_adapter *adapter)
>  {
>  	tq->stopped = true;
>  	tq->num_stop++;
> -	netif_stop_queue(adapter->netdev);
> +	netif_stop_subqueue(adapter->netdev, (tq - adapter->tx_queue));
>  }
>  
>  
> @@ -135,6 +155,7 @@ static void
>  vmxnet3_check_link(struct vmxnet3_adapter *adapter, bool affectTxQueue)
>  {
>  	u32 ret;
> +	int i;
>  
>  	VMXNET3_WRITE_BAR1_REG(adapter, VMXNET3_REG_CMD, VMXNET3_CMD_GET_LINK);
>  	ret = VMXNET3_READ_BAR1_REG(adapter, VMXNET3_REG_CMD);
> @@ -145,22 +166,28 @@ vmxnet3_check_link(struct vmxnet3_adapter *adapter, bool affectTxQueue)
>  		if (!netif_carrier_ok(adapter->netdev))
>  			netif_carrier_on(adapter->netdev);
>  
> -		if (affectTxQueue)
> -			vmxnet3_tq_start(&adapter->tx_queue, adapter);
> +		if (affectTxQueue) {
> +			for (i = 0; i < adapter->num_tx_queues; i++)
> +				vmxnet3_tq_start(&adapter->tx_queue[i],
> +						 adapter);
> +		}
>  	} else {
>  		printk(KERN_INFO "%s: NIC Link is Down\n",
>  		       adapter->netdev->name);
>  		if (netif_carrier_ok(adapter->netdev))
>  			netif_carrier_off(adapter->netdev);
>  
> -		if (affectTxQueue)
> -			vmxnet3_tq_stop(&adapter->tx_queue, adapter);
> +		if (affectTxQueue) {
> +			for (i = 0; i < adapter->num_tx_queues; i++)
> +				vmxnet3_tq_stop(&adapter->tx_queue[i], adapter);
> +		}
>  	}
>  }
>  
>  static void
>  vmxnet3_process_events(struct vmxnet3_adapter *adapter)
>  {
> +	int i;
>  	u32 events = le32_to_cpu(adapter->shared->ecr);
>  	if (!events)
>  		return;
> @@ -176,16 +203,18 @@ vmxnet3_process_events(struct vmxnet3_adapter *adapter)
>  		VMXNET3_WRITE_BAR1_REG(adapter, VMXNET3_REG_CMD,
>  				       VMXNET3_CMD_GET_QUEUE_STATUS);
>  
> -		if (adapter->tqd_start->status.stopped) {
> -			printk(KERN_ERR "%s: tq error 0x%x\n",
> -			       adapter->netdev->name,
> -			       le32_to_cpu(adapter->tqd_start->status.error));
> -		}
> -		if (adapter->rqd_start->status.stopped) {
> -			printk(KERN_ERR "%s: rq error 0x%x\n",
> -			       adapter->netdev->name,
> -			       adapter->rqd_start->status.error);
> -		}
> +		for (i = 0; i < adapter->num_tx_queues; i++)
> +			if (adapter->tqd_start[i].status.stopped)
> +				dev_dbg(&adapter->netdev->dev,
> +					"%s: tq[%d] error 0x%x\n",
> +					adapter->netdev->name, i, le32_to_cpu(
> +					adapter->tqd_start[i].status.error));
> +		for (i = 0; i < adapter->num_rx_queues; i++)
> +			if (adapter->rqd_start[i].status.stopped)
> +				dev_dbg(&adapter->netdev->dev,
> +					"%s: rq[%d] error 0x%x\n",
> +					adapter->netdev->name, i,
> +					adapter->rqd_start[i].status.error);
>  
>  		schedule_work(&adapter->work);
>  	}
> @@ -410,7 +439,7 @@ vmxnet3_tq_cleanup(struct vmxnet3_tx_queue *tq,
>  }
>  
>  
> -void
> +static void
>  vmxnet3_tq_destroy(struct vmxnet3_tx_queue *tq,
>  		   struct vmxnet3_adapter *adapter)
>  {
> @@ -437,6 +466,17 @@ vmxnet3_tq_destroy(struct vmxnet3_tx_queue *tq,
>  }
>  
>  
> +/* Destroy all tx queues */
> +void
> +vmxnet3_tq_destroy_all(struct vmxnet3_adapter *adapter)
> +{
> +	int i;
> +
> +	for (i = 0; i < adapter->num_tx_queues; i++)
> +		vmxnet3_tq_destroy(&adapter->tx_queue[i], adapter);
> +}
> +
> +
>  static void
>  vmxnet3_tq_init(struct vmxnet3_tx_queue *tq,
>  		struct vmxnet3_adapter *adapter)
> @@ -518,6 +558,14 @@ err:
>  	return -ENOMEM;
>  }
>  
> +static void
> +vmxnet3_tq_cleanup_all(struct vmxnet3_adapter *adapter)
> +{
> +	int i;
> +
> +	for (i = 0; i < adapter->num_tx_queues; i++)
> +		vmxnet3_tq_cleanup(&adapter->tx_queue[i], adapter);
> +}
>  
>  /*
>   *    starting from ring->next2fill, allocate rx buffers for the given ring
> @@ -732,6 +780,17 @@ vmxnet3_map_pkt(struct sk_buff *skb, struct vmxnet3_tx_ctx *ctx,
>  }
>  
>  
> +/* Init all tx queues */
> +static void
> +vmxnet3_tq_init_all(struct vmxnet3_adapter *adapter)
> +{
> +	int i;
> +
> +	for (i = 0; i < adapter->num_tx_queues; i++)
> +		vmxnet3_tq_init(&adapter->tx_queue[i], adapter);
> +}
> +
> +
>  /*
>   *    parse and copy relevant protocol headers:
>   *      For a tso pkt, relevant headers are L2/3/4 including options
> @@ -1000,8 +1059,8 @@ vmxnet3_tq_xmit(struct sk_buff *skb, struct vmxnet3_tx_queue *tq,
>  	if (le32_to_cpu(tq->shared->txNumDeferred) >=
>  					le32_to_cpu(tq->shared->txThreshold)) {
>  		tq->shared->txNumDeferred = 0;
> -		VMXNET3_WRITE_BAR0_REG(adapter, VMXNET3_REG_TXPROD,
> -				       tq->tx_ring.next2fill);
> +		VMXNET3_WRITE_BAR0_REG(adapter, (VMXNET3_REG_TXPROD +
> +				       tq->qid * 8), tq->tx_ring.next2fill);
>  	}
>  
>  	return NETDEV_TX_OK;
> @@ -1020,7 +1079,10 @@ vmxnet3_xmit_frame(struct sk_buff *skb, struct net_device *netdev)
>  {
>  	struct vmxnet3_adapter *adapter = netdev_priv(netdev);
>  
> -	return vmxnet3_tq_xmit(skb, &adapter->tx_queue, adapter, netdev);
> +		BUG_ON(skb->queue_mapping > adapter->num_tx_queues);
> +		return vmxnet3_tq_xmit(skb,
> +				       &adapter->tx_queue[skb->queue_mapping],
> +				       adapter, netdev);
>  }
>  
>  
> @@ -1106,9 +1168,9 @@ vmxnet3_rq_rx_complete(struct vmxnet3_rx_queue *rq,
>  			break;
>  		}
>  		num_rxd++;
> -
> +		BUG_ON(rcd->rqID != rq->qid && rcd->rqID != rq->qid2);
>  		idx = rcd->rxdIdx;
> -		ring_idx = rcd->rqID == rq->qid ? 0 : 1;
> +		ring_idx = rcd->rqID < adapter->num_rx_queues ? 0 : 1;
>  		vmxnet3_getRxDesc(rxd, &rq->rx_ring[ring_idx].base[idx].rxd,
>  				  &rxCmdDesc);
>  		rbi = rq->buf_info[ring_idx] + idx;
> @@ -1260,6 +1322,16 @@ vmxnet3_rq_cleanup(struct vmxnet3_rx_queue *rq,
>  }
>  
>  
> +static void
> +vmxnet3_rq_cleanup_all(struct vmxnet3_adapter *adapter)
> +{
> +	int i;
> +
> +	for (i = 0; i < adapter->num_rx_queues; i++)
> +		vmxnet3_rq_cleanup(&adapter->rx_queue[i], adapter);
> +}
> +
> +
>  void vmxnet3_rq_destroy(struct vmxnet3_rx_queue *rq,
>  			struct vmxnet3_adapter *adapter)
>  {
> @@ -1351,6 +1423,25 @@ vmxnet3_rq_init(struct vmxnet3_rx_queue *rq,
>  
>  
>  static int
> +vmxnet3_rq_init_all(struct vmxnet3_adapter *adapter)
> +{
> +	int i, err = 0;
> +
> +	for (i = 0; i < adapter->num_rx_queues; i++) {
> +		err = vmxnet3_rq_init(&adapter->rx_queue[i], adapter);
> +		if (unlikely(err)) {
> +			dev_err(&adapter->netdev->dev, "%s: failed to "
> +				"initialize rx queue%i\n",
> +				adapter->netdev->name, i);
> +			break;
> +		}
> +	}
> +	return err;
> +
> +}
> +
> +
> +static int
>  vmxnet3_rq_create(struct vmxnet3_rx_queue *rq, struct vmxnet3_adapter *adapter)
>  {
>  	int i;
> @@ -1398,32 +1489,176 @@ err:
>  
>  
>  static int
> +vmxnet3_rq_create_all(struct vmxnet3_adapter *adapter)
> +{
> +	int i, err = 0;
> +
> +	for (i = 0; i < adapter->num_rx_queues; i++) {
> +		err = vmxnet3_rq_create(&adapter->rx_queue[i], adapter);
> +		if (unlikely(err)) {
> +			dev_err(&adapter->netdev->dev,
> +				"%s: failed to create rx queue%i\n",
> +				adapter->netdev->name, i);
> +			goto err_out;
> +		}
> +	}
> +	return err;
> +err_out:
> +	vmxnet3_rq_destroy_all(adapter);
> +	return err;
> +
> +}
> +
> +/* Multiple queue aware polling function for tx and rx */
> +
> +static int
>  vmxnet3_do_poll(struct vmxnet3_adapter *adapter, int budget)
>  {
> +	int rcd_done = 0, i;
>  	if (unlikely(adapter->shared->ecr))
>  		vmxnet3_process_events(adapter);
> +	for (i = 0; i < adapter->num_tx_queues; i++)
> +		vmxnet3_tq_tx_complete(&adapter->tx_queue[i], adapter);
>  
> -	vmxnet3_tq_tx_complete(&adapter->tx_queue, adapter);
> -	return vmxnet3_rq_rx_complete(&adapter->rx_queue, adapter, budget);
> +	for (i = 0; i < adapter->num_rx_queues; i++)
> +		rcd_done += vmxnet3_rq_rx_complete(&adapter->rx_queue[i],
> +						   adapter, budget);
> +	return rcd_done;
>  }
>  
>  
>  static int
>  vmxnet3_poll(struct napi_struct *napi, int budget)
>  {
> -	struct vmxnet3_adapter *adapter = container_of(napi,
> -					  struct vmxnet3_adapter, napi);
> +	struct vmxnet3_rx_queue *rx_queue = container_of(napi,
> +					  struct vmxnet3_rx_queue, napi);
>  	int rxd_done;
>  
> -	rxd_done = vmxnet3_do_poll(adapter, budget);
> +	rxd_done = vmxnet3_do_poll(rx_queue->adapter, budget);
>  
>  	if (rxd_done < budget) {
>  		napi_complete(napi);
> -		vmxnet3_enable_intr(adapter, 0);
> +		vmxnet3_enable_all_intrs(rx_queue->adapter);
>  	}
>  	return rxd_done;
>  }
>  
> +/*
> + * NAPI polling function for MSI-X mode with multiple Rx queues
> + * Returns the # of the NAPI credit consumed (# of rx descriptors processed)
> + */
> +
> +static int
> +vmxnet3_poll_rx_only(struct napi_struct *napi, int budget)
> +{
> +	struct vmxnet3_rx_queue *rq = container_of(napi,
> +						struct vmxnet3_rx_queue, napi);
> +	struct vmxnet3_adapter *adapter = rq->adapter;
> +	int rxd_done;
> +
> +	/* When sharing interrupt with corresponding tx queue, process
> +	 * tx completions in that queue as well
> +	 */
> +	if (adapter->share_intr == VMXNET3_INTR_BUDDYSHARE) {
> +		struct vmxnet3_tx_queue *tq =
> +				&adapter->tx_queue[rq - adapter->rx_queue];
> +		vmxnet3_tq_tx_complete(tq, adapter);
> +	}
> +
> +	rxd_done = vmxnet3_rq_rx_complete(rq, adapter, budget);
> +
> +	if (rxd_done < budget) {
> +		napi_complete(napi);
> +		vmxnet3_enable_intr(adapter, rq->comp_ring.intr_idx);
> +	}
> +	return rxd_done;
> +}
> +
> +
> +#ifdef CONFIG_PCI_MSI
> +
> +/*
> + * Handle completion interrupts on tx queues
> + * Returns whether or not the intr is handled
> + */
> +
> +static irqreturn_t
> +vmxnet3_msix_tx(int irq, void *data)
> +{
> +	struct vmxnet3_tx_queue *tq = data;
> +	struct vmxnet3_adapter *adapter = tq->adapter;
> +
> +	if (adapter->intr.mask_mode == VMXNET3_IMM_ACTIVE)
> +		vmxnet3_disable_intr(adapter, tq->comp_ring.intr_idx);
> +
> +	/* Handle the case where only one irq is allocate for all tx queues */
> +	if (adapter->share_intr == VMXNET3_INTR_TXSHARE) {
> +		int i;
> +		for (i = 0; i < adapter->num_tx_queues; i++) {
> +			struct vmxnet3_tx_queue *txq = &adapter->tx_queue[i];
> +			vmxnet3_tq_tx_complete(txq, adapter);
> +		}
> +	} else {
> +		vmxnet3_tq_tx_complete(tq, adapter);
> +	}
> +	vmxnet3_enable_intr(adapter, tq->comp_ring.intr_idx);
> +
> +	return IRQ_HANDLED;
> +}
> +
> +
> +/*
> + * Handle completion interrupts on rx queues. Returns whether or not the
> + * intr is handled
> + */
> +
> +static irqreturn_t
> +vmxnet3_msix_rx(int irq, void *data)
> +{
> +	struct vmxnet3_rx_queue *rq = data;
> +	struct vmxnet3_adapter *adapter = rq->adapter;
> +
> +	/* disable intr if needed */
> +	if (adapter->intr.mask_mode == VMXNET3_IMM_ACTIVE)
> +		vmxnet3_disable_intr(adapter, rq->comp_ring.intr_idx);
> +	napi_schedule(&rq->napi);
> +
> +	return IRQ_HANDLED;
> +}
> +
> +/*
> + *----------------------------------------------------------------------------
> + *
> + * vmxnet3_msix_event --
> + *
> + *    vmxnet3 msix event intr handler
> + *
> + * Result:
> + *    whether or not the intr is handled
> + *
> + *----------------------------------------------------------------------------
> + */
> +
> +static irqreturn_t
> +vmxnet3_msix_event(int irq, void *data)
> +{
> +	struct net_device *dev = data;
> +	struct vmxnet3_adapter *adapter = netdev_priv(dev);
> +
> +	/* disable intr if needed */
> +	if (adapter->intr.mask_mode == VMXNET3_IMM_ACTIVE)
> +		vmxnet3_disable_intr(adapter, adapter->intr.event_intr_idx);
> +
> +	if (adapter->shared->ecr)
> +		vmxnet3_process_events(adapter);
> +
> +	vmxnet3_enable_intr(adapter, adapter->intr.event_intr_idx);
> +
> +	return IRQ_HANDLED;
> +}
> +
> +#endif /* CONFIG_PCI_MSI  */
> +
>  
>  /* Interrupt handler for vmxnet3  */
>  static irqreturn_t
> @@ -1432,7 +1667,7 @@ vmxnet3_intr(int irq, void *dev_id)
>  	struct net_device *dev = dev_id;
>  	struct vmxnet3_adapter *adapter = netdev_priv(dev);
>  
> -	if (unlikely(adapter->intr.type == VMXNET3_IT_INTX)) {
> +	if (adapter->intr.type == VMXNET3_IT_INTX) {
>  		u32 icr = VMXNET3_READ_BAR1_REG(adapter, VMXNET3_REG_ICR);
>  		if (unlikely(icr == 0))
>  			/* not ours */
> @@ -1442,77 +1677,136 @@ vmxnet3_intr(int irq, void *dev_id)
>  
>  	/* disable intr if needed */
>  	if (adapter->intr.mask_mode == VMXNET3_IMM_ACTIVE)
> -		vmxnet3_disable_intr(adapter, 0);
> +		vmxnet3_disable_all_intrs(adapter);
>  
> -	napi_schedule(&adapter->napi);
> +	napi_schedule(&adapter->rx_queue[0].napi);
>  
>  	return IRQ_HANDLED;
>  }
>  
>  #ifdef CONFIG_NET_POLL_CONTROLLER
>  
> -
>  /* netpoll callback. */
>  static void
>  vmxnet3_netpoll(struct net_device *netdev)
>  {
>  	struct vmxnet3_adapter *adapter = netdev_priv(netdev);
> -	int irq;
>  
> -#ifdef CONFIG_PCI_MSI
> -	if (adapter->intr.type == VMXNET3_IT_MSIX)
> -		irq = adapter->intr.msix_entries[0].vector;
> -	else
> -#endif
> -		irq = adapter->pdev->irq;
> +	if (adapter->intr.mask_mode == VMXNET3_IMM_ACTIVE)
> +		vmxnet3_disable_all_intrs(adapter);
> +
> +	vmxnet3_do_poll(adapter, adapter->rx_queue[0].rx_ring[0].size);
> +	vmxnet3_enable_all_intrs(adapter);
>  
> -	disable_irq(irq);
> -	vmxnet3_intr(irq, netdev);
> -	enable_irq(irq);
>  }
> -#endif
> +#endif	/* CONFIG_NET_POLL_CONTROLLER */
>  
>  static int
>  vmxnet3_request_irqs(struct vmxnet3_adapter *adapter)
>  {
> -	int err;
> +	struct vmxnet3_intr *intr = &adapter->intr;
> +	int err = 0, i;
> +	int vector = 0;
>  
>  #ifdef CONFIG_PCI_MSI
>  	if (adapter->intr.type == VMXNET3_IT_MSIX) {
> -		/* we only use 1 MSI-X vector */
> -		err = request_irq(adapter->intr.msix_entries[0].vector,
> -				  vmxnet3_intr, 0, adapter->netdev->name,
> -				  adapter->netdev);
> -	} else if (adapter->intr.type == VMXNET3_IT_MSI) {
> +		for (i = 0; i < adapter->num_tx_queues; i++) {
> +			sprintf(adapter->tx_queue[i].name, "%s:v%d-%s",
> +				adapter->netdev->name, vector, "Tx");
> +			if (adapter->share_intr != VMXNET3_INTR_BUDDYSHARE)
> +				err = request_irq(
> +					      intr->msix_entries[vector].vector,
> +					      vmxnet3_msix_tx, 0,
> +					      adapter->tx_queue[i].name,
> +					      &adapter->tx_queue[i]);
> +			if (err) {
> +				dev_err(&adapter->netdev->dev,
> +					"Failed to request irq for MSIX, %s, "
> +					"error %d\n",
> +					adapter->tx_queue[i].name, err);
> +				return err;
> +			}
> +
> +			/* Handle the case where only 1 MSIx was allocated for
> +			 * all tx queues */
> +			if (adapter->share_intr == VMXNET3_INTR_TXSHARE) {
> +				for (; i < adapter->num_tx_queues; i++)
> +					adapter->tx_queue[i].comp_ring.intr_idx
> +								= vector;
> +				vector++;
> +				break;
> +			} else {
> +				adapter->tx_queue[i].comp_ring.intr_idx
> +								= vector++;
> +			}
> +		}
> +		if (adapter->share_intr == VMXNET3_INTR_BUDDYSHARE)
> +			vector = 0;
> +
> +		for (i = 0; i < adapter->num_rx_queues; i++) {
> +			sprintf(adapter->rx_queue[i].name, "%s:v%d-%s",
> +				adapter->netdev->name, vector, "Rx");
> +			err = request_irq(intr->msix_entries[vector].vector,
> +					  vmxnet3_msix_rx, 0,
> +					  adapter->rx_queue[i].name,
> +					  &(adapter->rx_queue[i]));
> +			if (err) {
> +				printk(KERN_ERR "Failed to request irq for MSIX"
> +				       ", %s, error %d\n",
> +				       adapter->rx_queue[i].name, err);
> +				return err;
> +			}
> +
> +			adapter->rx_queue[i].comp_ring.intr_idx = vector++;
> +		}
> +
> +		sprintf(intr->event_msi_vector_name, "%s:v%d-event",
> +			adapter->netdev->name, vector);
> +		err = request_irq(intr->msix_entries[vector].vector,
> +				  vmxnet3_msix_event, 0,
> +				  intr->event_msi_vector_name, adapter->netdev);
> +		intr->event_intr_idx = vector;
> +
> +	} else if (intr->type == VMXNET3_IT_MSI) {
> +		adapter->num_rx_queues = 1;
>  		err = request_irq(adapter->pdev->irq, vmxnet3_intr, 0,
>  				  adapter->netdev->name, adapter->netdev);
> -	} else
> +	} else {
>  #endif
> -	{
> +		adapter->num_rx_queues = 1;
>  		err = request_irq(adapter->pdev->irq, vmxnet3_intr,
>  				  IRQF_SHARED, adapter->netdev->name,
>  				  adapter->netdev);
> +#ifdef CONFIG_PCI_MSI
>  	}
> -
> -	if (err)
> +#endif
> +	intr->num_intrs = vector + 1;
> +	if (err) {
>  		printk(KERN_ERR "Failed to request irq %s (intr type:%d), error"
> -		       ":%d\n", adapter->netdev->name, adapter->intr.type, err);
> +		       ":%d\n", adapter->netdev->name, intr->type, err);
> +	} else {
> +		/* Number of rx queues will not change after this */
> +		for (i = 0; i < adapter->num_rx_queues; i++) {
> +			struct vmxnet3_rx_queue *rq = &adapter->rx_queue[i];
> +			rq->qid = i;
> +			rq->qid2 = i + adapter->num_rx_queues;
> +		}
>  
>  
> -	if (!err) {
> -		int i;
> -		/* init our intr settings */
> -		for (i = 0; i < adapter->intr.num_intrs; i++)
> -			adapter->intr.mod_levels[i] = UPT1_IML_ADAPTIVE;
>  
> -		/* next setup intr index for all intr sources */
> -		adapter->tx_queue.comp_ring.intr_idx = 0;
> -		adapter->rx_queue.comp_ring.intr_idx = 0;
> -		adapter->intr.event_intr_idx = 0;
> +		/* init our intr settings */
> +		for (i = 0; i < intr->num_intrs; i++)
> +			intr->mod_levels[i] = UPT1_IML_ADAPTIVE;
> +		if (adapter->intr.type != VMXNET3_IT_MSIX) {
> +			adapter->intr.event_intr_idx = 0;
> +			for (i = 0; i < adapter->num_tx_queues; i++)
> +				adapter->tx_queue[i].comp_ring.intr_idx = 0;
> +			adapter->rx_queue[0].comp_ring.intr_idx = 0;
> +		}
>  
>  		printk(KERN_INFO "%s: intr type %u, mode %u, %u vectors "
> -		       "allocated\n", adapter->netdev->name, adapter->intr.type,
> -		       adapter->intr.mask_mode, adapter->intr.num_intrs);
> +		       "allocated\n", adapter->netdev->name, intr->type,
> +		       intr->mask_mode, intr->num_intrs);
>  	}
>  
>  	return err;
> @@ -1522,18 +1816,32 @@ vmxnet3_request_irqs(struct vmxnet3_adapter *adapter)
>  static void
>  vmxnet3_free_irqs(struct vmxnet3_adapter *adapter)
>  {
> -	BUG_ON(adapter->intr.type == VMXNET3_IT_AUTO ||
> -	       adapter->intr.num_intrs <= 0);
> +	struct vmxnet3_intr *intr = &adapter->intr;
> +	BUG_ON(intr->type == VMXNET3_IT_AUTO || intr->num_intrs <= 0);
>  
> -	switch (adapter->intr.type) {
> +	switch (intr->type) {
>  #ifdef CONFIG_PCI_MSI
>  	case VMXNET3_IT_MSIX:
>  	{
> -		int i;
> +		int i, vector = 0;
> +
> +		if (adapter->share_intr != VMXNET3_INTR_BUDDYSHARE) {
> +			for (i = 0; i < adapter->num_tx_queues; i++) {
> +				free_irq(intr->msix_entries[vector++].vector,
> +					 &(adapter->tx_queue[i]));
> +				if (adapter->share_intr == VMXNET3_INTR_TXSHARE)
> +					break;
> +			}
> +		}
> +
> +		for (i = 0; i < adapter->num_rx_queues; i++) {
> +			free_irq(intr->msix_entries[vector++].vector,
> +				 &(adapter->rx_queue[i]));
> +		}
>  
> -		for (i = 0; i < adapter->intr.num_intrs; i++)
> -			free_irq(adapter->intr.msix_entries[i].vector,
> -				 adapter->netdev);
> +		free_irq(intr->msix_entries[vector].vector,
> +			 adapter->netdev);
> +		BUG_ON(vector >= intr->num_intrs);
>  		break;
>  	}
>  #endif
> @@ -1729,6 +2037,15 @@ vmxnet3_set_mc(struct net_device *netdev)
>  	kfree(new_table);
>  }
>  
> +void
> +vmxnet3_rq_destroy_all(struct vmxnet3_adapter *adapter)
> +{
> +	int i;
> +
> +	for (i = 0; i < adapter->num_rx_queues; i++)
> +		vmxnet3_rq_destroy(&adapter->rx_queue[i], adapter);
> +}
> +
>  
>  /*
>   *   Set up driver_shared based on settings in adapter.
> @@ -1776,40 +2093,72 @@ vmxnet3_setup_driver_shared(struct vmxnet3_adapter *adapter)
>  	devRead->misc.mtu = cpu_to_le32(adapter->netdev->mtu);
>  	devRead->misc.queueDescPA = cpu_to_le64(adapter->queue_desc_pa);
>  	devRead->misc.queueDescLen = cpu_to_le32(
> -				     sizeof(struct Vmxnet3_TxQueueDesc) +
> -				     sizeof(struct Vmxnet3_RxQueueDesc));
> +		adapter->num_tx_queues * sizeof(struct Vmxnet3_TxQueueDesc) +
> +		adapter->num_rx_queues * sizeof(struct Vmxnet3_RxQueueDesc));
>  
>  	/* tx queue settings */
> -	BUG_ON(adapter->tx_queue.tx_ring.base == NULL);
> -
> -	devRead->misc.numTxQueues = 1;
> -	tqc = &adapter->tqd_start->conf;
> -	tqc->txRingBasePA   = cpu_to_le64(adapter->tx_queue.tx_ring.basePA);
> -	tqc->dataRingBasePA = cpu_to_le64(adapter->tx_queue.data_ring.basePA);
> -	tqc->compRingBasePA = cpu_to_le64(adapter->tx_queue.comp_ring.basePA);
> -	tqc->ddPA           = cpu_to_le64(virt_to_phys(
> -						adapter->tx_queue.buf_info));
> -	tqc->txRingSize     = cpu_to_le32(adapter->tx_queue.tx_ring.size);
> -	tqc->dataRingSize   = cpu_to_le32(adapter->tx_queue.data_ring.size);
> -	tqc->compRingSize   = cpu_to_le32(adapter->tx_queue.comp_ring.size);
> -	tqc->ddLen          = cpu_to_le32(sizeof(struct vmxnet3_tx_buf_info) *
> -			      tqc->txRingSize);
> -	tqc->intrIdx        = adapter->tx_queue.comp_ring.intr_idx;
> +	devRead->misc.numTxQueues =  adapter->num_tx_queues;
> +	for (i = 0; i < adapter->num_tx_queues; i++) {
> +		struct vmxnet3_tx_queue	*tq = &adapter->tx_queue[i];
> +		BUG_ON(adapter->tx_queue[i].tx_ring.base == NULL);
> +		tqc = &adapter->tqd_start[i].conf;
> +		tqc->txRingBasePA   = cpu_to_le64(tq->tx_ring.basePA);
> +		tqc->dataRingBasePA = cpu_to_le64(tq->data_ring.basePA);
> +		tqc->compRingBasePA = cpu_to_le64(tq->comp_ring.basePA);
> +		tqc->ddPA           = cpu_to_le64(virt_to_phys(tq->buf_info));
> +		tqc->txRingSize     = cpu_to_le32(tq->tx_ring.size);
> +		tqc->dataRingSize   = cpu_to_le32(tq->data_ring.size);
> +		tqc->compRingSize   = cpu_to_le32(tq->comp_ring.size);
> +		tqc->ddLen          = cpu_to_le32(
> +					sizeof(struct vmxnet3_tx_buf_info) *
> +					tqc->txRingSize);
> +		tqc->intrIdx        = tq->comp_ring.intr_idx;
> +	}
>  
>  	/* rx queue settings */
> -	devRead->misc.numRxQueues = 1;
> -	rqc = &adapter->rqd_start->conf;
> -	rqc->rxRingBasePA[0] = cpu_to_le64(adapter->rx_queue.rx_ring[0].basePA);
> -	rqc->rxRingBasePA[1] = cpu_to_le64(adapter->rx_queue.rx_ring[1].basePA);
> -	rqc->compRingBasePA  = cpu_to_le64(adapter->rx_queue.comp_ring.basePA);
> -	rqc->ddPA            = cpu_to_le64(virt_to_phys(
> -						adapter->rx_queue.buf_info));
> -	rqc->rxRingSize[0]   = cpu_to_le32(adapter->rx_queue.rx_ring[0].size);
> -	rqc->rxRingSize[1]   = cpu_to_le32(adapter->rx_queue.rx_ring[1].size);
> -	rqc->compRingSize    = cpu_to_le32(adapter->rx_queue.comp_ring.size);
> -	rqc->ddLen           = cpu_to_le32(sizeof(struct vmxnet3_rx_buf_info) *
> -			       (rqc->rxRingSize[0] + rqc->rxRingSize[1]));
> -	rqc->intrIdx         = adapter->rx_queue.comp_ring.intr_idx;
> +	devRead->misc.numRxQueues = adapter->num_rx_queues;
> +	for (i = 0; i < adapter->num_rx_queues; i++) {
> +		struct vmxnet3_rx_queue	*rq = &adapter->rx_queue[i];
> +		rqc = &adapter->rqd_start[i].conf;
> +		rqc->rxRingBasePA[0] = cpu_to_le64(rq->rx_ring[0].basePA);
> +		rqc->rxRingBasePA[1] = cpu_to_le64(rq->rx_ring[1].basePA);
> +		rqc->compRingBasePA  = cpu_to_le64(rq->comp_ring.basePA);
> +		rqc->ddPA            = cpu_to_le64(virt_to_phys(
> +							rq->buf_info));
> +		rqc->rxRingSize[0]   = cpu_to_le32(rq->rx_ring[0].size);
> +		rqc->rxRingSize[1]   = cpu_to_le32(rq->rx_ring[1].size);
> +		rqc->compRingSize    = cpu_to_le32(rq->comp_ring.size);
> +		rqc->ddLen           = cpu_to_le32(
> +					sizeof(struct vmxnet3_rx_buf_info) *
> +					(rqc->rxRingSize[0] +
> +					 rqc->rxRingSize[1]));
> +		rqc->intrIdx         = rq->comp_ring.intr_idx;
> +	}
> +
> +#ifdef VMXNET3_RSS
> +	memset(adapter->rss_conf, 0, sizeof(*adapter->rss_conf));
> +
> +	if (adapter->rss) {
> +		struct UPT1_RSSConf *rssConf = adapter->rss_conf;
> +		devRead->misc.uptFeatures |= UPT1_F_RSS;
> +		devRead->misc.numRxQueues = adapter->num_rx_queues;
> +		rssConf->hashType = UPT1_RSS_HASH_TYPE_TCP_IPV4 |
> +				    UPT1_RSS_HASH_TYPE_IPV4 |
> +				    UPT1_RSS_HASH_TYPE_TCP_IPV6 |
> +				    UPT1_RSS_HASH_TYPE_IPV6;
> +		rssConf->hashFunc = UPT1_RSS_HASH_FUNC_TOEPLITZ;
> +		rssConf->hashKeySize = UPT1_RSS_MAX_KEY_SIZE;
> +		rssConf->indTableSize = VMXNET3_RSS_IND_TABLE_SIZE;
> +		get_random_bytes(&rssConf->hashKey[0], rssConf->hashKeySize);
> +		for (i = 0; i < rssConf->indTableSize; i++)
> +			rssConf->indTable[i] = i % adapter->num_rx_queues;
> +
> +		devRead->rssConfDesc.confVer = 1;
> +		devRead->rssConfDesc.confLen = sizeof(*rssConf);
> +		devRead->rssConfDesc.confPA  = virt_to_phys(rssConf);
> +	}
> +
> +#endif /* VMXNET3_RSS */
>  
>  	/* intr settings */
>  	devRead->intrConf.autoMask = adapter->intr.mask_mode ==
> @@ -1831,18 +2180,18 @@ vmxnet3_setup_driver_shared(struct vmxnet3_adapter *adapter)
>  int
>  vmxnet3_activate_dev(struct vmxnet3_adapter *adapter)
>  {
> -	int err;
> +	int err, i;
>  	u32 ret;
>  
> -	dev_dbg(&adapter->netdev->dev,
> -		"%s: skb_buf_size %d, rx_buf_per_pkt %d, ring sizes"
> -		" %u %u %u\n", adapter->netdev->name, adapter->skb_buf_size,
> -		adapter->rx_buf_per_pkt, adapter->tx_queue.tx_ring.size,
> -		adapter->rx_queue.rx_ring[0].size,
> -		adapter->rx_queue.rx_ring[1].size);
> -
> -	vmxnet3_tq_init(&adapter->tx_queue, adapter);
> -	err = vmxnet3_rq_init(&adapter->rx_queue, adapter);
> +	dev_dbg(&adapter->netdev->dev, "%s: skb_buf_size %d, rx_buf_per_pkt %d,"
> +		" ring sizes %u %u %u\n", adapter->netdev->name,
> +		adapter->skb_buf_size, adapter->rx_buf_per_pkt,
> +		adapter->tx_queue[0].tx_ring.size,
> +		adapter->rx_queue[0].rx_ring[0].size,
> +		adapter->rx_queue[0].rx_ring[1].size);
> +
> +	vmxnet3_tq_init_all(adapter);
> +	err = vmxnet3_rq_init_all(adapter);
>  	if (err) {
>  		printk(KERN_ERR "Failed to init rx queue for %s: error %d\n",
>  		       adapter->netdev->name, err);
> @@ -1872,10 +2221,15 @@ vmxnet3_activate_dev(struct vmxnet3_adapter *adapter)
>  		err = -EINVAL;
>  		goto activate_err;
>  	}
> -	VMXNET3_WRITE_BAR0_REG(adapter, VMXNET3_REG_RXPROD,
> -			       adapter->rx_queue.rx_ring[0].next2fill);
> -	VMXNET3_WRITE_BAR0_REG(adapter, VMXNET3_REG_RXPROD2,
> -			       adapter->rx_queue.rx_ring[1].next2fill);
> +
> +	for (i = 0; i < adapter->num_rx_queues; i++) {
> +		VMXNET3_WRITE_BAR0_REG(adapter, (VMXNET3_REG_RXPROD +
> +				(i * VMXNET3_REG_ALIGN)),
> +				adapter->rx_queue[i].rx_ring[0].next2fill);
> +		VMXNET3_WRITE_BAR0_REG(adapter, (VMXNET3_REG_RXPROD2 +
> +				(i * VMXNET3_REG_ALIGN)),
> +				adapter->rx_queue[i].rx_ring[1].next2fill);
> +	}
>  
>  	/* Apply the rx filter settins last. */
>  	vmxnet3_set_mc(adapter->netdev);
> @@ -1885,8 +2239,8 @@ vmxnet3_activate_dev(struct vmxnet3_adapter *adapter)
>  	 * tx queue if the link is up.
>  	 */
>  	vmxnet3_check_link(adapter, true);
> -
> -	napi_enable(&adapter->napi);
> +	for (i = 0; i < adapter->num_rx_queues; i++)
> +		napi_enable(&adapter->rx_queue[i].napi);
>  	vmxnet3_enable_all_intrs(adapter);
>  	clear_bit(VMXNET3_STATE_BIT_QUIESCED, &adapter->state);
>  	return 0;
> @@ -1898,7 +2252,7 @@ activate_err:
>  irq_err:
>  rq_err:
>  	/* free up buffers we allocated */
> -	vmxnet3_rq_cleanup(&adapter->rx_queue, adapter);
> +	vmxnet3_rq_cleanup_all(adapter);
>  	return err;
>  }
>  
> @@ -1913,6 +2267,7 @@ vmxnet3_reset_dev(struct vmxnet3_adapter *adapter)
>  int
>  vmxnet3_quiesce_dev(struct vmxnet3_adapter *adapter)
>  {
> +	int i;
>  	if (test_and_set_bit(VMXNET3_STATE_BIT_QUIESCED, &adapter->state))
>  		return 0;
>  
> @@ -1921,13 +2276,14 @@ vmxnet3_quiesce_dev(struct vmxnet3_adapter *adapter)
>  			       VMXNET3_CMD_QUIESCE_DEV);
>  	vmxnet3_disable_all_intrs(adapter);
>  
> -	napi_disable(&adapter->napi);
> +	for (i = 0; i < adapter->num_rx_queues; i++)
> +		napi_disable(&adapter->rx_queue[i].napi);
>  	netif_tx_disable(adapter->netdev);
>  	adapter->link_speed = 0;
>  	netif_carrier_off(adapter->netdev);
>  
> -	vmxnet3_tq_cleanup(&adapter->tx_queue, adapter);
> -	vmxnet3_rq_cleanup(&adapter->rx_queue, adapter);
> +	vmxnet3_tq_cleanup_all(adapter);
> +	vmxnet3_rq_cleanup_all(adapter);
>  	vmxnet3_free_irqs(adapter);
>  	return 0;
>  }
> @@ -2049,7 +2405,9 @@ vmxnet3_free_pci_resources(struct vmxnet3_adapter *adapter)
>  static void
>  vmxnet3_adjust_rx_ring_size(struct vmxnet3_adapter *adapter)
>  {
> -	size_t sz;
> +	size_t sz, i, ring0_size, ring1_size, comp_size;
> +	struct vmxnet3_rx_queue	*rq = &adapter->rx_queue[0];
> +
>  
>  	if (adapter->netdev->mtu <= VMXNET3_MAX_SKB_BUF_SIZE -
>  				    VMXNET3_MAX_ETH_HDR_SIZE) {
> @@ -2071,11 +2429,19 @@ vmxnet3_adjust_rx_ring_size(struct vmxnet3_adapter *adapter)
>  	 * rx_buf_per_pkt * VMXNET3_RING_SIZE_ALIGN
>  	 */
>  	sz = adapter->rx_buf_per_pkt * VMXNET3_RING_SIZE_ALIGN;
> -	adapter->rx_queue.rx_ring[0].size = (adapter->rx_queue.rx_ring[0].size +
> -					     sz - 1) / sz * sz;
> -	adapter->rx_queue.rx_ring[0].size = min_t(u32,
> -					    adapter->rx_queue.rx_ring[0].size,
> -					    VMXNET3_RX_RING_MAX_SIZE / sz * sz);
> +	ring0_size = adapter->rx_queue[0].rx_ring[0].size;
> +	ring0_size = (ring0_size + sz - 1) / sz * sz;
> +	ring0_size = min_t(u32, rq->rx_ring[0].size, VMXNET3_RX_RING_MAX_SIZE /
> +			   sz * sz);
> +	ring1_size = adapter->rx_queue[0].rx_ring[1].size;
> +	comp_size = ring0_size + ring1_size;
> +
> +	for (i = 0; i < adapter->num_rx_queues; i++) {
> +		rq = &adapter->rx_queue[i];
> +		rq->rx_ring[0].size = ring0_size;
> +		rq->rx_ring[1].size = ring1_size;
> +		rq->comp_ring.size = comp_size;
> +	}
>  }
>  
>  
> @@ -2083,29 +2449,53 @@ int
>  vmxnet3_create_queues(struct vmxnet3_adapter *adapter, u32 tx_ring_size,
>  		      u32 rx_ring_size, u32 rx_ring2_size)
>  {
> -	int err;
> -
> -	adapter->tx_queue.tx_ring.size   = tx_ring_size;
> -	adapter->tx_queue.data_ring.size = tx_ring_size;
> -	adapter->tx_queue.comp_ring.size = tx_ring_size;
> -	adapter->tx_queue.shared = &adapter->tqd_start->ctrl;
> -	adapter->tx_queue.stopped = true;
> -	err = vmxnet3_tq_create(&adapter->tx_queue, adapter);
> -	if (err)
> -		return err;
> +	int err = 0, i;
> +
> +	for (i = 0; i < adapter->num_tx_queues; i++) {
> +		struct vmxnet3_tx_queue	*tq = &adapter->tx_queue[i];
> +		tq->tx_ring.size   = tx_ring_size;
> +		tq->data_ring.size = tx_ring_size;
> +		tq->comp_ring.size = tx_ring_size;
> +		tq->shared = &adapter->tqd_start[i].ctrl;
> +		tq->stopped = true;
> +		tq->adapter = adapter;
> +		tq->qid = i;
> +		err = vmxnet3_tq_create(tq, adapter);
> +		/*
> +		 * Too late to change num_tx_queues. We cannot do away with
> +		 * lesser number of queues than what we asked for
> +		 */
> +		if (err)
> +			goto queue_err;
> +	}
>  
> -	adapter->rx_queue.rx_ring[0].size = rx_ring_size;
> -	adapter->rx_queue.rx_ring[1].size = rx_ring2_size;
> +	adapter->rx_queue[0].rx_ring[0].size = rx_ring_size;
> +	adapter->rx_queue[0].rx_ring[1].size = rx_ring2_size;
>  	vmxnet3_adjust_rx_ring_size(adapter);
> -	adapter->rx_queue.comp_ring.size  = adapter->rx_queue.rx_ring[0].size +
> -					    adapter->rx_queue.rx_ring[1].size;
> -	adapter->rx_queue.qid  = 0;
> -	adapter->rx_queue.qid2 = 1;
> -	adapter->rx_queue.shared = &adapter->rqd_start->ctrl;
> -	err = vmxnet3_rq_create(&adapter->rx_queue, adapter);
> -	if (err)
> -		vmxnet3_tq_destroy(&adapter->tx_queue, adapter);
> -
> +	for (i = 0; i < adapter->num_rx_queues; i++) {
> +		struct vmxnet3_rx_queue *rq = &adapter->rx_queue[i];
> +		/* qid and qid2 for rx queues will be assigned later when num
> +		 * of rx queues is finalized after allocating intrs */
> +		rq->shared = &adapter->rqd_start[i].ctrl;
> +		rq->adapter = adapter;
> +		err = vmxnet3_rq_create(rq, adapter);
> +		if (err) {
> +			if (i == 0) {
> +				printk(KERN_ERR "Could not allocate any rx"
> +				       "queues. Aborting.\n");
> +				goto queue_err;
> +			} else {
> +				printk(KERN_INFO "Number of rx queues changed "
> +				       "to : %d.\n", i);
> +				adapter->num_rx_queues = i;
> +				err = 0;
> +				break;
> +			}
> +		}
> +	}
> +	return err;
> +queue_err:
> +	vmxnet3_tq_destroy_all(adapter);
>  	return err;
>  }
>  
> @@ -2113,11 +2503,12 @@ static int
>  vmxnet3_open(struct net_device *netdev)
>  {
>  	struct vmxnet3_adapter *adapter;
> -	int err;
> +	int err, i;
>  
>  	adapter = netdev_priv(netdev);
>  
> -	spin_lock_init(&adapter->tx_queue.tx_lock);
> +	for (i = 0; i < adapter->num_tx_queues; i++)
> +		spin_lock_init(&adapter->tx_queue[i].tx_lock);
>  
>  	err = vmxnet3_create_queues(adapter, VMXNET3_DEF_TX_RING_SIZE,
>  				    VMXNET3_DEF_RX_RING_SIZE,
> @@ -2132,8 +2523,8 @@ vmxnet3_open(struct net_device *netdev)
>  	return 0;
>  
>  activate_err:
> -	vmxnet3_rq_destroy(&adapter->rx_queue, adapter);
> -	vmxnet3_tq_destroy(&adapter->tx_queue, adapter);
> +	vmxnet3_rq_destroy_all(adapter);
> +	vmxnet3_tq_destroy_all(adapter);
>  queue_err:
>  	return err;
>  }
> @@ -2153,8 +2544,8 @@ vmxnet3_close(struct net_device *netdev)
>  
>  	vmxnet3_quiesce_dev(adapter);
>  
> -	vmxnet3_rq_destroy(&adapter->rx_queue, adapter);
> -	vmxnet3_tq_destroy(&adapter->tx_queue, adapter);
> +	vmxnet3_rq_destroy_all(adapter);
> +	vmxnet3_tq_destroy_all(adapter);
>  
>  	clear_bit(VMXNET3_STATE_BIT_RESETTING, &adapter->state);
>  
> @@ -2166,6 +2557,8 @@ vmxnet3_close(struct net_device *netdev)
>  void
>  vmxnet3_force_close(struct vmxnet3_adapter *adapter)
>  {
> +	int i;
> +
>  	/*
>  	 * we must clear VMXNET3_STATE_BIT_RESETTING, otherwise
>  	 * vmxnet3_close() will deadlock.
> @@ -2173,7 +2566,8 @@ vmxnet3_force_close(struct vmxnet3_adapter *adapter)
>  	BUG_ON(test_bit(VMXNET3_STATE_BIT_RESETTING, &adapter->state));
>  
>  	/* we need to enable NAPI, otherwise dev_close will deadlock */
> -	napi_enable(&adapter->napi);
> +	for (i = 0; i < adapter->num_rx_queues; i++)
> +		napi_enable(&adapter->rx_queue[i].napi);
>  	dev_close(adapter->netdev);
>  }
>  
> @@ -2204,14 +2598,11 @@ vmxnet3_change_mtu(struct net_device *netdev, int new_mtu)
>  		vmxnet3_reset_dev(adapter);
>  
>  		/* we need to re-create the rx queue based on the new mtu */
> -		vmxnet3_rq_destroy(&adapter->rx_queue, adapter);
> +		vmxnet3_rq_destroy_all(adapter);
>  		vmxnet3_adjust_rx_ring_size(adapter);
> -		adapter->rx_queue.comp_ring.size  =
> -					adapter->rx_queue.rx_ring[0].size +
> -					adapter->rx_queue.rx_ring[1].size;
> -		err = vmxnet3_rq_create(&adapter->rx_queue, adapter);
> +		err = vmxnet3_rq_create_all(adapter);
>  		if (err) {
> -			printk(KERN_ERR "%s: failed to re-create rx queue,"
> +			printk(KERN_ERR "%s: failed to re-create rx queues,"
>  				" error %d. Closing it.\n", netdev->name, err);
>  			goto out;
>  		}
> @@ -2276,6 +2667,55 @@ vmxnet3_read_mac_addr(struct vmxnet3_adapter *adapter, u8 *mac)
>  	mac[5] = (tmp >> 8) & 0xff;
>  }
>  
> +#ifdef CONFIG_PCI_MSI
> +
> +/*
> + * Enable MSIx vectors.
> + * Returns :
> + *	0 on successful enabling of required vectors,
> + *	VMXNET3_LINUX_MIN_MSIX_VECT when only minumum number of vectors required
> + *	 could be enabled.
> + *	number of vectors which can be enabled otherwise (this number is smaller
> + *	 than VMXNET3_LINUX_MIN_MSIX_VECT)
> + */
> +
> +static int
> +vmxnet3_acquire_msix_vectors(struct vmxnet3_adapter *adapter,
> +			     int vectors)
> +{
> +	int err = 0, vector_threshold;
> +	vector_threshold = VMXNET3_LINUX_MIN_MSIX_VECT;
> +
> +	while (vectors >= vector_threshold) {
> +		err = pci_enable_msix(adapter->pdev, adapter->intr.msix_entries,
> +				      vectors);
> +		if (!err) {
> +			adapter->intr.num_intrs = vectors;
> +			return 0;
> +		} else if (err < 0) {
> +			printk(KERN_ERR "Failed to enable MSI-X for %s, error"
> +			       " %d\n",	adapter->netdev->name, err);
> +			vectors = 0;
> +		} else if (err < vector_threshold) {
> +			break;
> +		} else {
> +			/* If fails to enable required number of MSI-x vectors
> +			 * try enabling 3 of them. One each for rx, tx and event
> +			 */
> +			vectors = vector_threshold;
> +			printk(KERN_ERR "Failed to enable %d MSI-X for %s, try"
> +			       " %d instead\n", vectors, adapter->netdev->name,
> +			       vector_threshold);
> +		}
> +	}
> +
> +	printk(KERN_INFO "Number of MSI-X interrupts which can be allocatedi"
> +	       " are lower than min threshold required.\n");
> +	return err;
> +}
> +
> +
> +#endif /* CONFIG_PCI_MSI */
>  
>  static void
>  vmxnet3_alloc_intr_resources(struct vmxnet3_adapter *adapter)
> @@ -2295,16 +2735,47 @@ vmxnet3_alloc_intr_resources(struct vmxnet3_adapter *adapter)
>  
>  #ifdef CONFIG_PCI_MSI
>  	if (adapter->intr.type == VMXNET3_IT_MSIX) {
> -		int err;
> -
> -		adapter->intr.msix_entries[0].entry = 0;
> -		err = pci_enable_msix(adapter->pdev, adapter->intr.msix_entries,
> -				      VMXNET3_LINUX_MAX_MSIX_VECT);
> -		if (!err) {
> -			adapter->intr.num_intrs = 1;
> -			adapter->intr.type = VMXNET3_IT_MSIX;
> +		int vector, err = 0;
> +
> +		adapter->intr.num_intrs = (adapter->share_intr ==
> +					   VMXNET3_INTR_TXSHARE) ? 1 :
> +					   adapter->num_tx_queues;
> +		adapter->intr.num_intrs += (adapter->share_intr ==
> +					   VMXNET3_INTR_BUDDYSHARE) ? 0 :
> +					   adapter->num_rx_queues;
> +		adapter->intr.num_intrs += 1;		/* for link event */
> +
> +		adapter->intr.num_intrs = (adapter->intr.num_intrs >
> +					   VMXNET3_LINUX_MIN_MSIX_VECT
> +					   ? adapter->intr.num_intrs :
> +					   VMXNET3_LINUX_MIN_MSIX_VECT);
> +
> +		for (vector = 0; vector < adapter->intr.num_intrs; vector++)
> +			adapter->intr.msix_entries[vector].entry = vector;
> +
> +		err = vmxnet3_acquire_msix_vectors(adapter,
> +						   adapter->intr.num_intrs);
> +		/* If we cannot allocate one MSIx vector per queue
> +		 * then limit the number of rx queues to 1
> +		 */
> +		if (err == VMXNET3_LINUX_MIN_MSIX_VECT) {
> +			if (adapter->share_intr != VMXNET3_INTR_BUDDYSHARE
> +			    || adapter->num_rx_queues != 2) {
> +				adapter->share_intr = VMXNET3_INTR_TXSHARE;
> +				printk(KERN_ERR "Number of rx queues : 1\n");
> +				adapter->num_rx_queues = 1;
> +				adapter->intr.num_intrs =
> +						VMXNET3_LINUX_MIN_MSIX_VECT;
> +			}
>  			return;
>  		}
> +		if (!err)
> +			return;
> +
> +		/* If we cannot allocate MSIx vectors use only one rx queue */
> +		printk(KERN_INFO "Failed to enable MSI-X for %s, error %d."
> +		       "#rx queues : 1, try MSI\n", adapter->netdev->name, err);
> +
>  		adapter->intr.type = VMXNET3_IT_MSI;
>  	}
>  
> @@ -2312,12 +2783,15 @@ vmxnet3_alloc_intr_resources(struct vmxnet3_adapter *adapter)
>  		int err;
>  		err = pci_enable_msi(adapter->pdev);
>  		if (!err) {
> +			adapter->num_rx_queues = 1;
>  			adapter->intr.num_intrs = 1;
>  			return;
>  		}
>  	}
>  #endif /* CONFIG_PCI_MSI */
>  
> +	adapter->num_rx_queues = 1;
> +	printk(KERN_INFO "Using INTx interrupt, #Rx queues: 1.\n");
>  	adapter->intr.type = VMXNET3_IT_INTX;
>  
>  	/* INT-X related setting */
> @@ -2345,6 +2819,7 @@ vmxnet3_tx_timeout(struct net_device *netdev)
>  
>  	printk(KERN_ERR "%s: tx hang\n", adapter->netdev->name);
>  	schedule_work(&adapter->work);
> +	netif_wake_queue(adapter->netdev);
>  }
>  
>  
> @@ -2401,8 +2876,32 @@ vmxnet3_probe_device(struct pci_dev *pdev,
>  	struct net_device *netdev;
>  	struct vmxnet3_adapter *adapter;
>  	u8 mac[ETH_ALEN];
> +	int size;
> +	int num_tx_queues = enable_mq[atomic_read(&devices_found)] == 0 ? 1 : 0;
> +	int num_rx_queues = enable_mq[atomic_read(&devices_found)] == 0 ? 1 : 0;
> +
> +#ifdef VMXNET3_RSS
> +	if (num_rx_queues == 0)
> +		num_rx_queues = min(VMXNET3_DEVICE_MAX_RX_QUEUES,
> +				    (int)num_online_cpus());
> +	else
> +		num_rx_queues = min(VMXNET3_DEVICE_MAX_RX_QUEUES,
> +				    num_rx_queues);
> +#else
> +	num_rx_queues = 1;
> +#endif
> +
> +	if (num_tx_queues <= 0)
> +		num_tx_queues = min(VMXNET3_DEVICE_MAX_TX_QUEUES,
> +				    (int)num_online_cpus());
> +	else
> +		num_tx_queues = min(VMXNET3_DEVICE_MAX_TX_QUEUES,
> +				    num_tx_queues);
> +	netdev = alloc_etherdev_mq(sizeof(struct vmxnet3_adapter),
> +				   num_tx_queues);
> +	printk(KERN_INFO "# of Tx queues : %d, # of Rx queues : %d\n",
> +	       num_tx_queues, num_rx_queues);
>  
> -	netdev = alloc_etherdev(sizeof(struct vmxnet3_adapter));
>  	if (!netdev) {
>  		printk(KERN_ERR "Failed to alloc ethernet device for adapter "
>  			"%s\n",	pci_name(pdev));
> @@ -2424,9 +2923,12 @@ vmxnet3_probe_device(struct pci_dev *pdev,
>  		goto err_alloc_shared;
>  	}
>  
> -	adapter->tqd_start = pci_alloc_consistent(adapter->pdev,
> -			     sizeof(struct Vmxnet3_TxQueueDesc) +
> -			     sizeof(struct Vmxnet3_RxQueueDesc),
> +	adapter->num_rx_queues = num_rx_queues;
> +	adapter->num_tx_queues = num_tx_queues;
> +
> +	size = sizeof(struct Vmxnet3_TxQueueDesc) * adapter->num_tx_queues;
> +	size += sizeof(struct Vmxnet3_RxQueueDesc) * adapter->num_rx_queues;
> +	adapter->tqd_start = pci_alloc_consistent(adapter->pdev, size,
>  			     &adapter->queue_desc_pa);
>  
>  	if (!adapter->tqd_start) {
> @@ -2435,8 +2937,8 @@ vmxnet3_probe_device(struct pci_dev *pdev,
>  		err = -ENOMEM;
>  		goto err_alloc_queue_desc;
>  	}
> -	adapter->rqd_start = (struct Vmxnet3_RxQueueDesc *)(adapter->tqd_start
> -							    + 1);
> +	adapter->rqd_start = (struct Vmxnet3_RxQueueDesc *)(adapter->tqd_start +
> +							adapter->num_tx_queues);
>  
>  	adapter->pm_conf = kmalloc(sizeof(struct Vmxnet3_PMConf), GFP_KERNEL);
>  	if (adapter->pm_conf == NULL) {
> @@ -2446,6 +2948,17 @@ vmxnet3_probe_device(struct pci_dev *pdev,
>  		goto err_alloc_pm;
>  	}
>  
> +#ifdef VMXNET3_RSS
> +
> +	adapter->rss_conf = kmalloc(sizeof(struct UPT1_RSSConf), GFP_KERNEL);
> +	if (adapter->rss_conf == NULL) {
> +		printk(KERN_ERR "Failed to allocate memory for %s\n",
> +		       pci_name(pdev));
> +		err = -ENOMEM;
> +		goto err_alloc_rss;
> +	}
> +#endif /* VMXNET3_RSS */
> +
>  	err = vmxnet3_alloc_pci_resources(adapter, &dma64);
>  	if (err < 0)
>  		goto err_alloc_pci;
> @@ -2473,8 +2986,28 @@ vmxnet3_probe_device(struct pci_dev *pdev,
>  	vmxnet3_declare_features(adapter, dma64);
>  
>  	adapter->dev_number = atomic_read(&devices_found);
> +
> +	/*
> +	 * Sharing intr between corresponding tx and rx queues gets priority
> +	 * over all tx queues sharing an intr. Also, to use buddy interrupts
> +	 * number of tx queues should be same as number of rx queues.
> +	 */
> +	if (irq_share_mode[adapter->dev_number] == VMXNET3_INTR_BUDDYSHARE &&
> +	    adapter->num_tx_queues != adapter->num_rx_queues)
> +		adapter->share_intr = VMXNET3_INTR_DONTSHARE;
> +
>  	vmxnet3_alloc_intr_resources(adapter);
>  
> +#ifdef VMXNET3_RSS
> +	if (adapter->num_rx_queues > 1 &&
> +	    adapter->intr.type == VMXNET3_IT_MSIX) {
> +		adapter->rss = true;
> +		printk(KERN_INFO "RSS is enabled.\n");
> +	} else {
> +		adapter->rss = false;
> +	}
> +#endif
> +
>  	vmxnet3_read_mac_addr(adapter, mac);
>  	memcpy(netdev->dev_addr,  mac, netdev->addr_len);
>  
> @@ -2484,7 +3017,18 @@ vmxnet3_probe_device(struct pci_dev *pdev,
>  
>  	INIT_WORK(&adapter->work, vmxnet3_reset_work);
>  
> -	netif_napi_add(netdev, &adapter->napi, vmxnet3_poll, 64);
> +	if (adapter->intr.type == VMXNET3_IT_MSIX) {
> +		int i;
> +		for (i = 0; i < adapter->num_rx_queues; i++) {
> +			netif_napi_add(adapter->netdev,
> +				       &adapter->rx_queue[i].napi,
> +				       vmxnet3_poll_rx_only, 64);
> +		}
> +	} else {
> +		netif_napi_add(adapter->netdev, &adapter->rx_queue[0].napi,
> +			       vmxnet3_poll, 64);
> +	}
> +
>  	SET_NETDEV_DEV(netdev, &pdev->dev);
>  	err = register_netdev(netdev);
>  
> @@ -2504,11 +3048,14 @@ err_register:
>  err_ver:
>  	vmxnet3_free_pci_resources(adapter);
>  err_alloc_pci:
> +#ifdef VMXNET3_RSS
> +	kfree(adapter->rss_conf);
> +err_alloc_rss:
> +#endif
>  	kfree(adapter->pm_conf);
>  err_alloc_pm:
> -	pci_free_consistent(adapter->pdev, sizeof(struct Vmxnet3_TxQueueDesc) +
> -			    sizeof(struct Vmxnet3_RxQueueDesc),
> -			    adapter->tqd_start, adapter->queue_desc_pa);
> +	pci_free_consistent(adapter->pdev, size, adapter->tqd_start,
> +			    adapter->queue_desc_pa);
>  err_alloc_queue_desc:
>  	pci_free_consistent(adapter->pdev, sizeof(struct Vmxnet3_DriverShared),
>  			    adapter->shared, adapter->shared_pa);
> @@ -2524,6 +3071,19 @@ vmxnet3_remove_device(struct pci_dev *pdev)
>  {
>  	struct net_device *netdev = pci_get_drvdata(pdev);
>  	struct vmxnet3_adapter *adapter = netdev_priv(netdev);
> +	int size = 0;
> +	int num_rx_queues = enable_mq[adapter->dev_number] == 0 ? 1 : 0;
> +
> +#ifdef VMXNET3_RSS
> +	if (num_rx_queues <= 0)
> +		num_rx_queues = min(VMXNET3_DEVICE_MAX_RX_QUEUES,
> +				    (int)num_online_cpus());
> +	else
> +		num_rx_queues = min(VMXNET3_DEVICE_MAX_RX_QUEUES,
> +				    num_rx_queues);
> +#else
> +	num_rx_queues = 1;
> +#endif
>  
>  	flush_scheduled_work();
>  
> @@ -2531,10 +3091,15 @@ vmxnet3_remove_device(struct pci_dev *pdev)
>  
>  	vmxnet3_free_intr_resources(adapter);
>  	vmxnet3_free_pci_resources(adapter);
> +#ifdef VMXNET3_RSS
> +	kfree(adapter->rss_conf);
> +#endif
>  	kfree(adapter->pm_conf);
> -	pci_free_consistent(adapter->pdev, sizeof(struct Vmxnet3_TxQueueDesc) +
> -			    sizeof(struct Vmxnet3_RxQueueDesc),
> -			    adapter->tqd_start, adapter->queue_desc_pa);
> +
> +	size = sizeof(struct Vmxnet3_TxQueueDesc) * adapter->num_tx_queues;
> +	size += sizeof(struct Vmxnet3_RxQueueDesc) * num_rx_queues;
> +	pci_free_consistent(adapter->pdev, size, adapter->tqd_start,
> +			    adapter->queue_desc_pa);
>  	pci_free_consistent(adapter->pdev, sizeof(struct Vmxnet3_DriverShared),
>  			    adapter->shared, adapter->shared_pa);
>  	free_netdev(netdev);
> @@ -2565,7 +3130,7 @@ vmxnet3_suspend(struct device *device)
>  	vmxnet3_free_intr_resources(adapter);
>  
>  	netif_device_detach(netdev);
> -	netif_stop_queue(netdev);
> +	netif_tx_stop_all_queues(netdev);
>  
>  	/* Create wake-up filters. */
>  	pmConf = adapter->pm_conf;
> @@ -2710,6 +3275,7 @@ vmxnet3_init_module(void)
>  {
>  	printk(KERN_INFO "%s - version %s\n", VMXNET3_DRIVER_DESC,
>  		VMXNET3_DRIVER_VERSION_REPORT);
> +	atomic_set(&devices_found, 0);
>  	return pci_register_driver(&vmxnet3_driver);
>  }
>  
> @@ -2728,3 +3294,5 @@ MODULE_AUTHOR("VMware, Inc.");
>  MODULE_DESCRIPTION(VMXNET3_DRIVER_DESC);
>  MODULE_LICENSE("GPL v2");
>  MODULE_VERSION(VMXNET3_DRIVER_VERSION_STRING);
> +
> +
> diff --git a/drivers/net/vmxnet3/vmxnet3_ethtool.c b/drivers/net/vmxnet3/vmxnet3_ethtool.c
> index 7e4b5a8..73c2bf9 100644
> --- a/drivers/net/vmxnet3/vmxnet3_ethtool.c
> +++ b/drivers/net/vmxnet3/vmxnet3_ethtool.c
> @@ -153,44 +153,42 @@ vmxnet3_get_stats(struct net_device *netdev)
>  	struct UPT1_TxStats *devTxStats;
>  	struct UPT1_RxStats *devRxStats;
>  	struct net_device_stats *net_stats = &netdev->stats;
> +	int i;
>  
>  	adapter = netdev_priv(netdev);
>  
>  	/* Collect the dev stats into the shared area */
>  	VMXNET3_WRITE_BAR1_REG(adapter, VMXNET3_REG_CMD, VMXNET3_CMD_GET_STATS);
>  
> -	/* Assuming that we have a single queue device */
> -	devTxStats = &adapter->tqd_start->stats;
> -	devRxStats = &adapter->rqd_start->stats;
> -
> -	/* Get access to the driver stats per queue */
> -	drvTxStats = &adapter->tx_queue.stats;
> -	drvRxStats = &adapter->rx_queue.stats;
> -
>  	memset(net_stats, 0, sizeof(*net_stats));
> +	for (i = 0; i < adapter->num_tx_queues; i++) {
> +		devTxStats = &adapter->tqd_start[i].stats;
> +		drvTxStats = &adapter->tx_queue[i].stats;
> +		net_stats->tx_packets += devTxStats->ucastPktsTxOK +
> +					devTxStats->mcastPktsTxOK +
> +					devTxStats->bcastPktsTxOK;
> +		net_stats->tx_bytes += devTxStats->ucastBytesTxOK +
> +				      devTxStats->mcastBytesTxOK +
> +				      devTxStats->bcastBytesTxOK;
> +		net_stats->tx_errors += devTxStats->pktsTxError;
> +		net_stats->tx_dropped += drvTxStats->drop_total;
> +	}
>  
> -	net_stats->rx_packets = devRxStats->ucastPktsRxOK +
> -				devRxStats->mcastPktsRxOK +
> -				devRxStats->bcastPktsRxOK;
> -
> -	net_stats->tx_packets = devTxStats->ucastPktsTxOK +
> -				devTxStats->mcastPktsTxOK +
> -				devTxStats->bcastPktsTxOK;
> -
> -	net_stats->rx_bytes = devRxStats->ucastBytesRxOK +
> -			      devRxStats->mcastBytesRxOK +
> -			      devRxStats->bcastBytesRxOK;
> -
> -	net_stats->tx_bytes = devTxStats->ucastBytesTxOK +
> -			      devTxStats->mcastBytesTxOK +
> -			      devTxStats->bcastBytesTxOK;
> +	for (i = 0; i < adapter->num_rx_queues; i++) {
> +		devRxStats = &adapter->rqd_start[i].stats;
> +		drvRxStats = &adapter->rx_queue[i].stats;
> +		net_stats->rx_packets += devRxStats->ucastPktsRxOK +
> +					devRxStats->mcastPktsRxOK +
> +					devRxStats->bcastPktsRxOK;
>  
> -	net_stats->rx_errors = devRxStats->pktsRxError;
> -	net_stats->tx_errors = devTxStats->pktsTxError;
> -	net_stats->rx_dropped = drvRxStats->drop_total;
> -	net_stats->tx_dropped = drvTxStats->drop_total;
> -	net_stats->multicast =  devRxStats->mcastPktsRxOK;
> +		net_stats->rx_bytes += devRxStats->ucastBytesRxOK +
> +				      devRxStats->mcastBytesRxOK +
> +				      devRxStats->bcastBytesRxOK;
>  
> +		net_stats->rx_errors += devRxStats->pktsRxError;
> +		net_stats->rx_dropped += drvRxStats->drop_total;
> +		net_stats->multicast +=  devRxStats->mcastPktsRxOK;
> +	}
>  	return net_stats;
>  }
>  
> @@ -309,24 +307,26 @@ vmxnet3_get_ethtool_stats(struct net_device *netdev,
>  	struct vmxnet3_adapter *adapter = netdev_priv(netdev);
>  	u8 *base;
>  	int i;
> +	int j = 0;
>  
>  	VMXNET3_WRITE_BAR1_REG(adapter, VMXNET3_REG_CMD, VMXNET3_CMD_GET_STATS);
>  
>  	/* this does assume each counter is 64-bit wide */
> +/* TODO change this for multiple queues */
>  
> -	base = (u8 *)&adapter->tqd_start->stats;
> +	base = (u8 *)&adapter->tqd_start[j].stats;
>  	for (i = 0; i < ARRAY_SIZE(vmxnet3_tq_dev_stats); i++)
>  		*buf++ = *(u64 *)(base + vmxnet3_tq_dev_stats[i].offset);
>  
> -	base = (u8 *)&adapter->tx_queue.stats;
> +	base = (u8 *)&adapter->tx_queue[j].stats;
>  	for (i = 0; i < ARRAY_SIZE(vmxnet3_tq_driver_stats); i++)
>  		*buf++ = *(u64 *)(base + vmxnet3_tq_driver_stats[i].offset);
>  
> -	base = (u8 *)&adapter->rqd_start->stats;
> +	base = (u8 *)&adapter->rqd_start[j].stats;
>  	for (i = 0; i < ARRAY_SIZE(vmxnet3_rq_dev_stats); i++)
>  		*buf++ = *(u64 *)(base + vmxnet3_rq_dev_stats[i].offset);
>  
> -	base = (u8 *)&adapter->rx_queue.stats;
> +	base = (u8 *)&adapter->rx_queue[j].stats;
>  	for (i = 0; i < ARRAY_SIZE(vmxnet3_rq_driver_stats); i++)
>  		*buf++ = *(u64 *)(base + vmxnet3_rq_driver_stats[i].offset);
>  
> @@ -341,6 +341,7 @@ vmxnet3_get_regs(struct net_device *netdev, struct ethtool_regs *regs, void *p)
>  {
>  	struct vmxnet3_adapter *adapter = netdev_priv(netdev);
>  	u32 *buf = p;
> +	int i = 0;
>  
>  	memset(p, 0, vmxnet3_get_regs_len(netdev));
>  
> @@ -349,28 +350,29 @@ vmxnet3_get_regs(struct net_device *netdev, struct ethtool_regs *regs, void *p)
>  	/* Update vmxnet3_get_regs_len if we want to dump more registers */
>  
>  	/* make each ring use multiple of 16 bytes */
> -	buf[0] = adapter->tx_queue.tx_ring.next2fill;
> -	buf[1] = adapter->tx_queue.tx_ring.next2comp;
> -	buf[2] = adapter->tx_queue.tx_ring.gen;
> +/* TODO change this for multiple queues */
> +	buf[0] = adapter->tx_queue[i].tx_ring.next2fill;
> +	buf[1] = adapter->tx_queue[i].tx_ring.next2comp;
> +	buf[2] = adapter->tx_queue[i].tx_ring.gen;
>  	buf[3] = 0;
>  
> -	buf[4] = adapter->tx_queue.comp_ring.next2proc;
> -	buf[5] = adapter->tx_queue.comp_ring.gen;
> -	buf[6] = adapter->tx_queue.stopped;
> +	buf[4] = adapter->tx_queue[i].comp_ring.next2proc;
> +	buf[5] = adapter->tx_queue[i].comp_ring.gen;
> +	buf[6] = adapter->tx_queue[i].stopped;
>  	buf[7] = 0;
>  
> -	buf[8] = adapter->rx_queue.rx_ring[0].next2fill;
> -	buf[9] = adapter->rx_queue.rx_ring[0].next2comp;
> -	buf[10] = adapter->rx_queue.rx_ring[0].gen;
> +	buf[8] = adapter->rx_queue[i].rx_ring[0].next2fill;
> +	buf[9] = adapter->rx_queue[i].rx_ring[0].next2comp;
> +	buf[10] = adapter->rx_queue[i].rx_ring[0].gen;
>  	buf[11] = 0;
>  
> -	buf[12] = adapter->rx_queue.rx_ring[1].next2fill;
> -	buf[13] = adapter->rx_queue.rx_ring[1].next2comp;
> -	buf[14] = adapter->rx_queue.rx_ring[1].gen;
> +	buf[12] = adapter->rx_queue[i].rx_ring[1].next2fill;
> +	buf[13] = adapter->rx_queue[i].rx_ring[1].next2comp;
> +	buf[14] = adapter->rx_queue[i].rx_ring[1].gen;
>  	buf[15] = 0;
>  
> -	buf[16] = adapter->rx_queue.comp_ring.next2proc;
> -	buf[17] = adapter->rx_queue.comp_ring.gen;
> +	buf[16] = adapter->rx_queue[i].comp_ring.next2proc;
> +	buf[17] = adapter->rx_queue[i].comp_ring.gen;
>  	buf[18] = 0;
>  	buf[19] = 0;
>  }
> @@ -437,8 +439,10 @@ vmxnet3_get_ringparam(struct net_device *netdev,
>  	param->rx_mini_max_pending = 0;
>  	param->rx_jumbo_max_pending = 0;
>  
> -	param->rx_pending = adapter->rx_queue.rx_ring[0].size;
> -	param->tx_pending = adapter->tx_queue.tx_ring.size;
> +	param->rx_pending = adapter->rx_queue[0].rx_ring[0].size *
> +			    adapter->num_rx_queues;
> +	param->tx_pending = adapter->tx_queue[0].tx_ring.size *
> +			    adapter->num_tx_queues;
>  	param->rx_mini_pending = 0;
>  	param->rx_jumbo_pending = 0;
>  }
> @@ -482,8 +486,8 @@ vmxnet3_set_ringparam(struct net_device *netdev,
>  							   sz) != 0)
>  		return -EINVAL;
>  
> -	if (new_tx_ring_size == adapter->tx_queue.tx_ring.size &&
> -			new_rx_ring_size == adapter->rx_queue.rx_ring[0].size) {
> +	if (new_tx_ring_size == adapter->tx_queue[0].tx_ring.size &&
> +	    new_rx_ring_size == adapter->rx_queue[0].rx_ring[0].size) {
>  		return 0;
>  	}
>  
> @@ -500,11 +504,12 @@ vmxnet3_set_ringparam(struct net_device *netdev,
>  
>  		/* recreate the rx queue and the tx queue based on the
>  		 * new sizes */
> -		vmxnet3_tq_destroy(&adapter->tx_queue, adapter);
> -		vmxnet3_rq_destroy(&adapter->rx_queue, adapter);
> +		vmxnet3_tq_destroy_all(adapter);
> +		vmxnet3_rq_destroy_all(adapter);
>  
>  		err = vmxnet3_create_queues(adapter, new_tx_ring_size,
>  			new_rx_ring_size, VMXNET3_DEF_RX_RING_SIZE);
> +
>  		if (err) {
>  			/* failed, most likely because of OOM, try default
>  			 * size */
> @@ -537,6 +542,59 @@ out:
>  }
>  
>  
> +static int
> +vmxnet3_get_rxnfc(struct net_device *netdev, struct ethtool_rxnfc *info,
> +		  void *rules)
> +{
> +	struct vmxnet3_adapter *adapter = netdev_priv(netdev);
> +	switch (info->cmd) {
> +	case ETHTOOL_GRXRINGS:
> +		info->data = adapter->num_rx_queues;
> +		return 0;
> +	}
> +	return -EOPNOTSUPP;
> +}
> +
> +
> +static int
> +vmxnet3_get_rss_indir(struct net_device *netdev,
> +		      struct ethtool_rxfh_indir *p)
> +{
> +	struct vmxnet3_adapter *adapter = netdev_priv(netdev);
> +	struct UPT1_RSSConf *rssConf = adapter->rss_conf;
> +	unsigned int n = min_t(unsigned int, p->size, rssConf->indTableSize);
> +
> +	p->size = rssConf->indTableSize;
> +	while (n--)
> +		p->ring_index[n] = rssConf->indTable[n];
> +	return 0;
> +
> +}
> +
> +static int
> +vmxnet3_set_rss_indir(struct net_device *netdev,
> +		      const struct ethtool_rxfh_indir *p)
> +{
> +	unsigned int i;
> +	struct vmxnet3_adapter *adapter = netdev_priv(netdev);
> +	struct UPT1_RSSConf *rssConf = adapter->rss_conf;
> +
> +	if (p->size != rssConf->indTableSize)
> +		return -EINVAL;
> +	for (i = 0; i < rssConf->indTableSize; i++) {
> +		if (p->ring_index[i] >= 0 && p->ring_index[i] <
> +		    adapter->num_rx_queues)
> +			rssConf->indTable[i] = p->ring_index[i];
> +		else
> +			rssConf->indTable[i] = i % adapter->num_rx_queues;
> +	}
> +	VMXNET3_WRITE_BAR1_REG(adapter, VMXNET3_REG_CMD,
> +			       VMXNET3_CMD_UPDATE_RSSIDT);
> +
> +	return 0;
> +
> +}
> +
>  static struct ethtool_ops vmxnet3_ethtool_ops = {
>  	.get_settings      = vmxnet3_get_settings,
>  	.get_drvinfo       = vmxnet3_get_drvinfo,
> @@ -560,6 +618,9 @@ static struct ethtool_ops vmxnet3_ethtool_ops = {
>  	.get_ethtool_stats = vmxnet3_get_ethtool_stats,
>  	.get_ringparam     = vmxnet3_get_ringparam,
>  	.set_ringparam     = vmxnet3_set_ringparam,
> +	.get_rxnfc         = vmxnet3_get_rxnfc,
> +	.get_rxfh_indir    = vmxnet3_get_rss_indir,
> +	.set_rxfh_indir    = vmxnet3_set_rss_indir,
>  };
>  
>  void vmxnet3_set_ethtool_ops(struct net_device *netdev)
> diff --git a/drivers/net/vmxnet3/vmxnet3_int.h b/drivers/net/vmxnet3/vmxnet3_int.h
> index c88ea5c..2332b1f 100644
> --- a/drivers/net/vmxnet3/vmxnet3_int.h
> +++ b/drivers/net/vmxnet3/vmxnet3_int.h
> @@ -68,11 +68,15 @@
>  /*
>   * Version numbers
>   */
> -#define VMXNET3_DRIVER_VERSION_STRING   "1.0.14.0-k"
> +#define VMXNET3_DRIVER_VERSION_STRING   "1.0.16.0-k"
>  
>  /* a 32-bit int, each byte encode a verion number in VMXNET3_DRIVER_VERSION */
> -#define VMXNET3_DRIVER_VERSION_NUM      0x01000E00
> +#define VMXNET3_DRIVER_VERSION_NUM      0x01001000
>  
> +#if defined(CONFIG_PCI_MSI)
> +	/* RSS only makes sense if MSI-X is supported. */
> +	#define VMXNET3_RSS
> +#endif
>  
>  /*
>   * Capabilities
> @@ -218,16 +222,19 @@ struct vmxnet3_tx_ctx {
>  };
>  
>  struct vmxnet3_tx_queue {
> +	char			name[IFNAMSIZ+8]; /* To identify interrupt */
> +	struct vmxnet3_adapter		*adapter;
>  	spinlock_t                      tx_lock;
>  	struct vmxnet3_cmd_ring         tx_ring;
> -	struct vmxnet3_tx_buf_info     *buf_info;
> +	struct vmxnet3_tx_buf_info      *buf_info;
>  	struct vmxnet3_tx_data_ring     data_ring;
>  	struct vmxnet3_comp_ring        comp_ring;
> -	struct Vmxnet3_TxQueueCtrl            *shared;
> +	struct Vmxnet3_TxQueueCtrl      *shared;
>  	struct vmxnet3_tq_driver_stats  stats;
>  	bool                            stopped;
>  	int                             num_stop;  /* # of times the queue is
>  						    * stopped */
> +	int				qid;
>  } __attribute__((__aligned__(SMP_CACHE_BYTES)));
>  
>  enum vmxnet3_rx_buf_type {
> @@ -259,6 +266,9 @@ struct vmxnet3_rq_driver_stats {
>  };
>  
>  struct vmxnet3_rx_queue {
> +	char			name[IFNAMSIZ + 8]; /* To identify interrupt */
> +	struct vmxnet3_adapter	  *adapter;
> +	struct napi_struct        napi;
>  	struct vmxnet3_cmd_ring   rx_ring[2];
>  	struct vmxnet3_comp_ring  comp_ring;
>  	struct vmxnet3_rx_ctx     rx_ctx;
> @@ -271,7 +281,16 @@ struct vmxnet3_rx_queue {
>  	struct vmxnet3_rq_driver_stats  stats;
>  } __attribute__((__aligned__(SMP_CACHE_BYTES)));
>  
> -#define VMXNET3_LINUX_MAX_MSIX_VECT     1
> +#define VMXNET3_DEVICE_MAX_TX_QUEUES 8
> +#define VMXNET3_DEVICE_MAX_RX_QUEUES 8   /* Keep this value as a power of 2 */
> +
> +/* Should be less than UPT1_RSS_MAX_IND_TABLE_SIZE */
> +#define VMXNET3_RSS_IND_TABLE_SIZE (VMXNET3_DEVICE_MAX_RX_QUEUES * 4)
> +
> +#define VMXNET3_LINUX_MAX_MSIX_VECT     (VMXNET3_DEVICE_MAX_TX_QUEUES + \
> +					 VMXNET3_DEVICE_MAX_RX_QUEUES + 1)
> +#define VMXNET3_LINUX_MIN_MSIX_VECT     3    /* 1 for each : tx, rx and event */
> +
>  
>  struct vmxnet3_intr {
>  	enum vmxnet3_intr_mask_mode  mask_mode;
> @@ -279,28 +298,32 @@ struct vmxnet3_intr {
>  	u8  num_intrs;			/* # of intr vectors */
>  	u8  event_intr_idx;		/* idx of the intr vector for event */
>  	u8  mod_levels[VMXNET3_LINUX_MAX_MSIX_VECT]; /* moderation level */
> +	char	event_msi_vector_name[IFNAMSIZ+11];
>  #ifdef CONFIG_PCI_MSI
>  	struct msix_entry msix_entries[VMXNET3_LINUX_MAX_MSIX_VECT];
>  #endif
>  };
>  
> +/* Interrupt sharing schemes, share_intr */
> +#define VMXNET3_INTR_DONTSHARE 0     /* each queue has its own irq */
> +#define VMXNET3_INTR_TXSHARE 1	     /* All tx queues share one irq */
> +#define VMXNET3_INTR_BUDDYSHARE 2    /* Corresponding tx,rx queues share irq */
> +
>  #define VMXNET3_STATE_BIT_RESETTING   0
>  #define VMXNET3_STATE_BIT_QUIESCED    1
> -struct vmxnet3_adapter {
> -	struct vmxnet3_tx_queue         tx_queue;
> -	struct vmxnet3_rx_queue         rx_queue;
> -	struct napi_struct              napi;
> -	struct vlan_group              *vlan_grp;
> -
> -	struct vmxnet3_intr             intr;
> -
> -	struct Vmxnet3_DriverShared    *shared;
> -	struct Vmxnet3_PMConf          *pm_conf;
> -	struct Vmxnet3_TxQueueDesc     *tqd_start;     /* first tx queue desc */
> -	struct Vmxnet3_RxQueueDesc     *rqd_start;     /* first rx queue desc */
> -	struct net_device              *netdev;
> -	struct pci_dev                 *pdev;
>  
> +struct vmxnet3_adapter {
> +	struct vmxnet3_tx_queue		tx_queue[VMXNET3_DEVICE_MAX_TX_QUEUES];
> +	struct vmxnet3_rx_queue		rx_queue[VMXNET3_DEVICE_MAX_RX_QUEUES];
> +	struct vlan_group		*vlan_grp;
> +	struct vmxnet3_intr		intr;
> +	struct Vmxnet3_DriverShared	*shared;
> +	struct Vmxnet3_PMConf		*pm_conf;
> +	struct Vmxnet3_TxQueueDesc	*tqd_start;     /* all tx queue desc */
> +	struct Vmxnet3_RxQueueDesc	*rqd_start;	/* all rx queue desc */
> +	struct net_device		*netdev;
> +	struct net_device_stats		net_stats;
> +	struct pci_dev			*pdev;
>  	u8				*hw_addr0; /* for BAR 0 */
>  	u8				*hw_addr1; /* for BAR 1 */
>  
> @@ -308,6 +331,12 @@ struct vmxnet3_adapter {
>  	bool				rxcsum;
>  	bool				lro;
>  	bool				jumbo_frame;
> +#ifdef VMXNET3_RSS
> +	struct UPT1_RSSConf		*rss_conf;
> +	bool				rss;
> +#endif
> +	u32				num_rx_queues;
> +	u32				num_tx_queues;
>  
>  	/* rx buffer related */
>  	unsigned			skb_buf_size;
> @@ -327,6 +356,7 @@ struct vmxnet3_adapter {
>  	unsigned long  state;    /* VMXNET3_STATE_BIT_xxx */
>  
>  	int dev_number;
> +	int share_intr;
>  };
>  
>  #define VMXNET3_WRITE_BAR0_REG(adapter, reg, val)  \
> @@ -381,12 +411,10 @@ void
>  vmxnet3_reset_dev(struct vmxnet3_adapter *adapter);
>  
>  void
> -vmxnet3_tq_destroy(struct vmxnet3_tx_queue *tq,
> -		   struct vmxnet3_adapter *adapter);
> +vmxnet3_tq_destroy_all(struct vmxnet3_adapter *adapter);
>  
>  void
> -vmxnet3_rq_destroy(struct vmxnet3_rx_queue *rq,
> -		   struct vmxnet3_adapter *adapter);
> +vmxnet3_rq_destroy_all(struct vmxnet3_adapter *adapter);
>  
>  int
>  vmxnet3_create_queues(struct vmxnet3_adapter *adapter,
> 

^ permalink raw reply

* Re: [PATCH] macvlan: lockless tx path
From: Eric Dumazet @ 2010-11-10 22:21 UTC (permalink / raw)
  To: Ben Greear; +Cc: David Miller, Patrick McHardy, netdev
In-Reply-To: <4CDB1021.507@candelatech.com>

Le mercredi 10 novembre 2010 à 13:35 -0800, Ben Greear a écrit :

> So an application that must deal with wraps must poll at the minimal
> time interval for wrapping 32-bit counters at whatever speed, or it
> must pay attention to the driver to somehow know that this magic driver
> can *really* do 64-bit stats properly?
> 

Are you aware that you speak of something that is not specified at all
in linux ?

Frequency of polling is not part of any RFC. This usually is tunable in
the _application_. Some people sample stats every 5 minutes, some sample
every second, and hit the "xxx driver updates its stats every two
seconds, this sucks"

I wrote SNMP apps based on /proc/net/dev and all just work, with any
versions, any driver. Of course, some of them broke 6 years ago because
they were 32bit legacy application, running on a 64bit kernel. I never
asked David to change /proc/net/dev to cap counters to 32bit.

When 128bit cpu come, some userland changes are needed to parse 128bit
numbers.

In anycase, apps dont have to know a particular driver provides 64bit or
32bit counter. Only choice for them is to automatically detect the
wraparound, because they fetch a STRING, not a Counter32 or Counter64

This works for all drivers, legacy, new, Intel or whatever. If a driver
changes from 32 to 64, nothing special happens in /proc/net/dev.

RRD for example handles this just fine.

> Please note that just because a counter is less than the previous read,
> that doesn't by itself tell us if it wrapped once or twice.  And, if we
> don't know at which number of bits it wraps, then we don't know how many
> to add even if we are certain it wrapped only once.
> 

I repeat : Nothing in /proc/net/dev can tell you when a counter will
wrap (the counter width).

You also need to use the correct polling frequency, depending on max
speed. It was already the case with 32bit counters, 64bit ones only gave
some extra range.

> In general, I want to treat eth0 the same as eth5, and not worry that one
> is 10/100 realtek and the other a 10G Intel.
> 

> If netlink reports stats64, then those should only wrap at 64 bits,
> and if it reports stats32, then wrap at 32-bits.
> 

I believe you are mistaken. We provide stats64 for all drivers, even
32bit legacy ones. rtnetlink has no way to report counter widths,
because nobody cared.

^ permalink raw reply

* can-bcm: fix minor heap overflow
From: Oliver Hartkopp @ 2010-11-10 22:10 UTC (permalink / raw)
  To: David Miller
  Cc: Linux Netdev List, Dan Rosenberg, Linus Torvalds, Urs Thuermann,
	security

On 64-bit platforms the ASCII representation of a pointer may be up to 17
bytes long. This patch increases the length of the buffer accordingly.

http://marc.info/?l=linux-netdev&m=128872251418192&w=2

Reported-by: Dan Rosenberg <drosenberg@vsecurity.com>
Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net>
CC: Linus Torvalds <torvalds@linux-foundation.org>

---

diff --git a/net/can/bcm.c b/net/can/bcm.c
index 08ffe9e..6faa825 100644
--- a/net/can/bcm.c
+++ b/net/can/bcm.c
@@ -125,7 +125,7 @@ struct bcm_sock {
 	struct list_head tx_ops;
 	unsigned long dropped_usr_msgs;
 	struct proc_dir_entry *bcm_proc_read;
-	char procname [9]; /* pointer printed in ASCII with \0 */
+	char procname [20]; /* pointer printed in ASCII with \0 */
 };

 static inline struct bcm_sock *bcm_sk(const struct sock *sk)


^ permalink raw reply related

* Re: [PATCH] Fix CAN info leak/minor heap overflow
From: Oliver Hartkopp @ 2010-11-10 22:10 UTC (permalink / raw)
  To: David Miller; +Cc: urs, netdev, drosenberg, security, torvalds
In-Reply-To: <20101110.095141.226780406.davem@davemloft.net>

On 10.11.2010 18:51, David Miller wrote:
> From: Oliver Hartkopp <socketcan@hartkopp.net>
> Date: Wed, 10 Nov 2010 07:52:27 +0100
> 
>> IMHO the patch improves the historic situation and fixes the useless leakage
>> of kernel addresses. Please consider to apply that procfs changes.
> 
> I'm only fine with fixing the kernel pointer fields in some way.
> 
> But moving forward any other change to the procfs file is simply
> a waste of time.
> 
> You should create sysfs files and add logic to your tools to look
> for them and use them if they exist.
> 
> Your forward path _SHOULD NOT_ be continuing this procfs versioning
> madness.  Use something sane and do the work to make userland start
> to be ready for this transition.

Hm, summarizing the given restrictions and taking into account that just
setting the pointer fields to '0' is said to be annoying, the only thing that
can be fixed is the minor heap overflow caused by the char array. I'll send a
patch for that.

As you don't want to change the layout even if there's no tool relying on the
entries i wanted to modify, i'll just stop my attempts to improve it.

Regards,
Oliver



^ permalink raw reply

* Re: [PATCH] macvlan: lockless tx path
From: Ben Hutchings @ 2010-11-10 21:53 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Ben Greear, David Miller, Patrick McHardy, netdev
In-Reply-To: <1289423555.17691.8.camel@edumazet-laptop>

On Wed, 2010-11-10 at 22:12 +0100, Eric Dumazet wrote:
> Le mercredi 10 novembre 2010 à 21:04 +0000, Ben Hutchings a écrit :
> 
> > Drivers should calculate differences and accumulate them in a 64-bit
> > counter.  (A lot of hardware has read-to-clear counters anyway, in which
> > case the driver *has* to accumulate the values it reads.)
> 
> You are mistaken. These are _hardware_ counters. If they were software,
> of course they would be 32 or 64 bit.

Of course I understood that.

> And doing the thing you describe in software is racy.
> I tried to remove many races, not to add new ones.

If you do it in ndo_get_stats{,64} and don't use your own lock, yes.

> Yes, some drivers read one hardware counter using two instructions, and
> this is racy.

Most hardware which supports MMIO to multi-word counters has some kind
of latching scheme where you can read the words/registers in order and
get consistent values (within a single counter; consistency between
counters is another matter).  Obviously you have to use a lock to
serialise stats updates in this case - whether or not you maintain wider
software counters.

Oh, another problem with using register values directly is that
statistics are likely to be reset whenever the device is reconfigured in
the way that requires a hardware reset.

Ben.

-- 
Ben Hutchings, Senior Software Engineer, Solarflare Communications
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply

* Re: [PATCH] macvlan: lockless tx path
From: Ben Greear @ 2010-11-10 21:35 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, Patrick McHardy, netdev
In-Reply-To: <1289421187.2469.127.camel@edumazet-laptop>

On 11/10/2010 12:33 PM, Eric Dumazet wrote:
> Le mercredi 10 novembre 2010 à 10:40 -0800, Ben Greear a écrit :
>
>> In my opinion, the kernel and/or driver should deal with this such that
>> at worst the user has to deal with 32 v/s 64 bits based on whether the
>> kernel is compiled for 32 or 64 bit CPUs.  (Let the driver sample at
>> intervals needed to never wrap it's counters more than once and update
>> software stats of well-defined bit-width, and present those software
>> counters to users.
>>
>
> How so ? Are you willing to provide patches for all network drivers ?

I'm willing to attempt to fix something that I use and can test.

Either way, I think it's legitimate to document at least the desired
behaviour so that driver writers know what to aim for.

>> In practice, this seems to be the case, at least for the NICs I've used
>> (mostly Intel).  But, please don't propagate the idea that any width of
>> counters is OK to present to user-space:  It is completely unfair to
>> make app writers have to know the network driver and/or hardware quirks to
>> know how often it must sample stats.
>>
>
> I am sorry Ben, but /proc/net/dev doesnt publish each counter effective
> width. Its unfair, but its like that.
>
> An appplication must be able to cope for wrap arounds, running on a 32
> or 64bit kernel. Our duty is to provide 64bit counters for high speed
> interfaces where possible.
> For a 10Mb adapter, there is no need, since a 32bit counter doesnt wrap
> in less than one hour (RFC1902 suggestion)

So an application that must deal with wraps must poll at the minimal
time interval for wrapping 32-bit counters at whatever speed, or it
must pay attention to the driver to somehow know that this magic driver
can *really* do 64-bit stats properly?

Please note that just because a counter is less than the previous read,
that doesn't by itself tell us if it wrapped once or twice.  And, if we
don't know at which number of bits it wraps, then we don't know how many
to add even if we are certain it wrapped only once.

In general, I want to treat eth0 the same as eth5, and not worry that one
is 10/100 realtek and the other a 10G Intel.

If netlink reports stats64, then those should only wrap at 64 bits,
and if it reports stats32, then wrap at 32-bits.

> As I said, many drivers counters are not 32bit or 64bit. I did many
> driver get_stats() checks lately...
>
> Why should we cap them to 32bit if they really are 36 or 40 bits ?
>
>
>> Well, maybe using u32 would have positive benefits on 64-bit kernels then?
>>
>
> But we want to handle 40/100Gbps devices, and keep SNMP apps happy.
>
> We really need 64bit for them, and MACVLAN might be used on top of such
> devices.
>
> Or are you suggesting using u32 instead of "unsigned long" for
> rx_errors/tx_dropped ?
>
> This would indeed save 8 bytes per cpu per macvlan.

Yes, that was what I was trying to suggest.  I'm all for 64-bit numbers
in anything that can wrap anytime soon, and anywhere you think 32-bits
is enough, just use u32 so we don't have to worry about the number of
bits in 'unsigned long' on different platforms.

Thanks,
Ben

-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com

^ permalink raw reply

* Re: [PATCH] SUNRPC: Simplify rpc_alloc_iostats by removing pointless local variable
From: Jesper Juhl @ 2010-11-10 21:32 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA
  Cc: linux-nfs-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA,
	J. Bruce Fields, Neil Brown, Trond Myklebust, David S. Miller,
	Andrew Morton
In-Reply-To: <alpine.LNX.2.00.1011072205370.26247-h2p7t3/P30RzeRGmFJ5qR7ZzlVVXadcDXqFh9Ls21Oc@public.gmane.org>

On Sun, 7 Nov 2010, Jesper Juhl wrote:

> Hi,
> 
> We can simplify net/sunrpc/stats.c::rpc_alloc_iostats() a bit by getting 
> rid of the unneeded local variable 'new'.
> 
> 
> Please CC me on replies.
> 
> 
> Signed-off-by: Jesper Juhl <jj-IYz4IdjRLj0sV2N9l4h3zg@public.gmane.org>
> ---
>  stats.c |    4 +---
>  1 file changed, 1 insertion(+), 3 deletions(-)
> 
> diff --git 
> a/net/sunrpc/stats.c b/net/sunrpc/stats.c
> index f71a731..80df89d 100644
> --- a/net/sunrpc/stats.c
> +++ b/net/sunrpc/stats.c
> @@ -115,9 +115,7 @@ EXPORT_SYMBOL_GPL(svc_seq_show);
>   */
>  struct rpc_iostats *rpc_alloc_iostats(struct rpc_clnt *clnt)
>  {
> -	struct rpc_iostats *new;
> -	new = kcalloc(clnt->cl_maxproc, sizeof(struct rpc_iostats), GFP_KERNEL);
> -	return new;
> +	return kcalloc(clnt->cl_maxproc, sizeof(struct rpc_iostats), GFP_KERNEL);
>  }
>  EXPORT_SYMBOL_GPL(rpc_alloc_iostats);
>  
> 
> 

Ok, no response to this for a couple of days.
Is there some problem or did it just get missed?
Could someone merge this and push it up-stream, please, if there are no 
problems with it...
 

-- 
Jesper Juhl <jj-IYz4IdjRLj0sV2N9l4h3zg@public.gmane.org>             http://www.chaosbits.net/
Don't top-post  http://www.catb.org/~esr/jargon/html/T/top-post.html
Plain text mails only, please.

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH] Prevent reading uninitialized memory with socketfilters
From: Ben Hutchings @ 2010-11-10 21:25 UTC (permalink / raw)
  To: David Miller; +Cc: penguin-kernel, eric.dumazet, netdev
In-Reply-To: <20101110.125929.245406622.davem@davemloft.net>

On Wed, 2010-11-10 at 12:59 -0800, David Miller wrote:
> From: Ben Hutchings <bhutchings@solarflare.com>
> Date: Wed, 10 Nov 2010 20:57:44 +0000
> 
> > On Wed, 2010-11-10 at 10:39 -0800, David Miller wrote:
> > [...]
> >> In this patch, I use a bitmap (a single long var) so that only filters
> >> using mem[] loads/stores pay the price of added security checks.
> >> 
> >> For other filters, additional cost is a single instruction.
> >> 
> >> [ Since we access fentry->k a lot now, cache it in a local variable
> >>   and mark filter entry pointer as const. -DaveM ]
> > [...]
> > 
> > I don't see the justification for combining these changes.  One patch,
> > one fix, right?
> 
> I'm minimizing the performance impact of the new bitmap checks.

This seems like an entirely separate optimisation, since fentry->k was
*already* being used all over the place.  (And a smart compiler should
optimise that anyway... though I realise gcc is often not that smart.)

Ben.

-- 
Ben Hutchings, Senior Software Engineer, Solarflare Communications
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.


^ permalink raw reply

* Re: [PATCH] macvlan: lockless tx path
From: Eric Dumazet @ 2010-11-10 21:12 UTC (permalink / raw)
  To: Ben Hutchings; +Cc: Ben Greear, David Miller, Patrick McHardy, netdev
In-Reply-To: <1289423056.2249.7.camel@achroite.uk.solarflarecom.com>

Le mercredi 10 novembre 2010 à 21:04 +0000, Ben Hutchings a écrit :

> Drivers should calculate differences and accumulate them in a 64-bit
> counter.  (A lot of hardware has read-to-clear counters anyway, in which
> case the driver *has* to accumulate the values it reads.)

You are mistaken. These are _hardware_ counters. If they were software,
of course they would be 32 or 64 bit.

And doing the thing you describe in software is racy.
I tried to remove many races, not to add new ones.

Yes, some drivers read one hardware counter using two instructions, and
this is racy.

^ permalink raw reply

* Re: [PATCH] macvlan: lockless tx path
From: Ben Hutchings @ 2010-11-10 21:04 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Ben Greear, David Miller, Patrick McHardy, netdev
In-Reply-To: <1289421187.2469.127.camel@edumazet-laptop>

On Wed, 2010-11-10 at 21:33 +0100, Eric Dumazet wrote:
> Le mercredi 10 novembre 2010 à 10:40 -0800, Ben Greear a écrit :
> 
> > In my opinion, the kernel and/or driver should deal with this such that
> > at worst the user has to deal with 32 v/s 64 bits based on whether the
> > kernel is compiled for 32 or 64 bit CPUs.  (Let the driver sample at
> > intervals needed to never wrap it's counters more than once and update
> > software stats of well-defined bit-width, and present those software
> > counters to users.
> > 
> 
> How so ? Are you willing to provide patches for all network drivers ?
> 
> > In practice, this seems to be the case, at least for the NICs I've used
> > (mostly Intel).  But, please don't propagate the idea that any width of
> > counters is OK to present to user-space:  It is completely unfair to
> > make app writers have to know the network driver and/or hardware quirks to
> > know how often it must sample stats.
> > 
> 
> I am sorry Ben, but /proc/net/dev doesnt publish each counter effective
> width. Its unfair, but its like that.
> 
> An appplication must be able to cope for wrap arounds, running on a 32
> or 64bit kernel. Our duty is to provide 64bit counters for high speed
> interfaces where possible.
> For a 10Mb adapter, there is no need, since a 32bit counter doesnt wrap
> in less than one hour (RFC1902 suggestion)
> 
> As I said, many drivers counters are not 32bit or 64bit. I did many
> driver get_stats() checks lately...
> 
> Why should we cap them to 32bit if they really are 36 or 40 bits ? 
[...]

Drivers should calculate differences and accumulate them in a 64-bit
counter.  (A lot of hardware has read-to-clear counters anyway, in which
case the driver *has* to accumulate the values it reads.)

Ben.

-- 
Ben Hutchings, Senior Software Engineer, Solarflare Communications
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.


^ permalink raw reply

* Re: [PATCH 0/2] net: Changes in queue allocation and freeing
From: David Miller @ 2010-11-10 21:00 UTC (permalink / raw)
  To: therbert; +Cc: eric.dumazet, netdev
In-Reply-To: <AANLkTi=5rxQ6mmnhR7=_dh5u08raaoLiYo=6Gzj9CbL5@mail.gmail.com>

From: Tom Herbert <therbert@google.com>
Date: Wed, 10 Nov 2010 08:27:54 -0800

> Also I noticed that the comment about RX queues refcnts is no longer
> valid.  I can respin patch if necessary.

Not necessary, when I apply your patch I'll integrate this comment
removal.

Thanks.

^ permalink raw reply

* Re: [PATCH] Prevent reading uninitialized memory with socketfilters
From: David Miller @ 2010-11-10 20:59 UTC (permalink / raw)
  To: bhutchings; +Cc: penguin-kernel, eric.dumazet, netdev
In-Reply-To: <1289422664.2249.1.camel@achroite.uk.solarflarecom.com>

From: Ben Hutchings <bhutchings@solarflare.com>
Date: Wed, 10 Nov 2010 20:57:44 +0000

> On Wed, 2010-11-10 at 10:39 -0800, David Miller wrote:
> [...]
>> In this patch, I use a bitmap (a single long var) so that only filters
>> using mem[] loads/stores pay the price of added security checks.
>> 
>> For other filters, additional cost is a single instruction.
>> 
>> [ Since we access fentry->k a lot now, cache it in a local variable
>>   and mark filter entry pointer as const. -DaveM ]
> [...]
> 
> I don't see the justification for combining these changes.  One patch,
> one fix, right?

I'm minimizing the performance impact of the new bitmap checks.

^ permalink raw reply

* Re: [PATCH] Prevent reading uninitialized memory with socketfilters
From: Ben Hutchings @ 2010-11-10 20:57 UTC (permalink / raw)
  To: David Miller; +Cc: penguin-kernel, eric.dumazet, netdev
In-Reply-To: <20101110.103923.59670339.davem@davemloft.net>

On Wed, 2010-11-10 at 10:39 -0800, David Miller wrote:
[...]
> In this patch, I use a bitmap (a single long var) so that only filters
> using mem[] loads/stores pay the price of added security checks.
> 
> For other filters, additional cost is a single instruction.
> 
> [ Since we access fentry->k a lot now, cache it in a local variable
>   and mark filter entry pointer as const. -DaveM ]
[...]

I don't see the justification for combining these changes.  One patch,
one fix, right?

Ben.

-- 
Ben Hutchings, Senior Software Engineer, Solarflare Communications
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.


^ permalink raw reply

* Re: [PATCH] Fix header size check for GSO case in recvmsg (af_packet)
From: David Miller @ 2010-11-10 20:53 UTC (permalink / raw)
  To: mk; +Cc: sri, netdev, linux-kernel
In-Reply-To: <1289253525-7020-1-git-send-email-mk@lab.zgora.pl>

From: Mariusz Kozlowski <mk@lab.zgora.pl>
Date: Mon,  8 Nov 2010 22:58:45 +0100

> Parameter 'len' is size_t type so it will never get negative.
> 
> Signed-off-by: Mariusz Kozlowski <mk@lab.zgora.pl>

Applied, thank you!

^ permalink raw reply

* [PATCH net-next-2.6] net: net_families __rcu annotations
From: Eric Dumazet @ 2010-11-10 20:50 UTC (permalink / raw)
  To: David Miller; +Cc: netdev

Use modern RCU API / annotations for net_families array.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
 net/socket.c |   11 ++++++-----
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/net/socket.c b/net/socket.c
index 3ca2fd9..c898df7 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -156,7 +156,7 @@ static const struct file_operations socket_file_ops = {
  */
 
 static DEFINE_SPINLOCK(net_family_lock);
-static const struct net_proto_family *net_families[NPROTO] __read_mostly;
+static const struct net_proto_family __rcu *net_families[NPROTO] __read_mostly;
 
 /*
  *	Statistics counters of the socket lists
@@ -1200,7 +1200,7 @@ int __sock_create(struct net *net, int family, int type, int protocol,
 	 * requested real, full-featured networking support upon configuration.
 	 * Otherwise module support will break!
 	 */
-	if (net_families[family] == NULL)
+	if (rcu_access_pointer(net_families[family]) == NULL)
 		request_module("net-pf-%d", family);
 #endif
 
@@ -2332,10 +2332,11 @@ int sock_register(const struct net_proto_family *ops)
 	}
 
 	spin_lock(&net_family_lock);
-	if (net_families[ops->family])
+	if (rcu_dereference_protected(net_families[ops->family],
+				      lockdep_is_held(&net_family_lock)))
 		err = -EEXIST;
 	else {
-		net_families[ops->family] = ops;
+		rcu_assign_pointer(net_families[ops->family], ops);
 		err = 0;
 	}
 	spin_unlock(&net_family_lock);
@@ -2363,7 +2364,7 @@ void sock_unregister(int family)
 	BUG_ON(family < 0 || family >= NPROTO);
 
 	spin_lock(&net_family_lock);
-	net_families[family] = NULL;
+	rcu_assign_pointer(net_families[family], NULL);
 	spin_unlock(&net_family_lock);
 
 	synchronize_rcu();



^ permalink raw reply related

* Re: possible kernel oops from user MSS
From: David Miller @ 2010-11-10 20:41 UTC (permalink / raw)
  To: schen; +Cc: netdev
In-Reply-To: <AANLkTimMDMb74-3E9vhFxZ5Dgeuk3HMzPZVjwCj+yFEJ@mail.gmail.com>

From: Steve Chen <schen@mvista.com>
Date: Wed, 10 Nov 2010 07:24:51 -0600

> With commit f5fff5dc8a7a3f395b0525c02ba92c95d42b7390, a user program
> can pass in TCP_MAXSEG of 12 (or TCPOLEN_TSTAMP_ALIGNED), and cause
> kernel oops with division by 0
>  in tcp_select_initial_window.  One way to prevent it is to change the
> minimum value for TCP_MAXSEG in do_tcp_setsockopt from 8 to some value
> over 12.  Two questions.
> 
> 1.  Is this the right solution?
> 2.  If it is, what is a good minimum value?

Thanks Steve, I'll fix this like so:

--------------------
tcp: Increase TCP_MAXSEG socket option minimum.

As noted by Steve Chen, since commit
f5fff5dc8a7a3f395b0525c02ba92c95d42b7390 ("tcp: advertise MSS
requested by user") we can end up with a situation where
tcp_select_initial_window() does a divide by a zero (or
even negative) mss value.

The problem is that sometimes we subtract TCPOLEN_TSTAMP_ALIGNED
from the mss.

Fix this by increasing the minimum from 8 to 8 plus the value
of TCPOLEN_TSTATMP_ALIGNED.

Reported-by: Steve Chen <schen@mvista.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
---
 net/ipv4/tcp.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 245603c..6b0eb4d 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -2246,7 +2246,7 @@ static int do_tcp_setsockopt(struct sock *sk, int level,
 		/* Values greater than interface MTU won't take effect. However
 		 * at the point when this call is done we typically don't yet
 		 * know which interface is going to be used */
-		if (val < 8 || val > MAX_TCP_WINDOW) {
+		if (val < TCPOLEN_TSTAMP_ALIGNED + 8 || val > MAX_TCP_WINDOW) {
 			err = -EINVAL;
 			break;
 		}
-- 
1.7.3.2


^ permalink raw reply related

* Re: [PATCH] macvlan: lockless tx path
From: Eric Dumazet @ 2010-11-10 20:33 UTC (permalink / raw)
  To: Ben Greear; +Cc: David Miller, Patrick McHardy, netdev
In-Reply-To: <4CDAE713.7020309@candelatech.com>

Le mercredi 10 novembre 2010 à 10:40 -0800, Ben Greear a écrit :

> In my opinion, the kernel and/or driver should deal with this such that
> at worst the user has to deal with 32 v/s 64 bits based on whether the
> kernel is compiled for 32 or 64 bit CPUs.  (Let the driver sample at
> intervals needed to never wrap it's counters more than once and update
> software stats of well-defined bit-width, and present those software
> counters to users.
> 

How so ? Are you willing to provide patches for all network drivers ?

> In practice, this seems to be the case, at least for the NICs I've used
> (mostly Intel).  But, please don't propagate the idea that any width of
> counters is OK to present to user-space:  It is completely unfair to
> make app writers have to know the network driver and/or hardware quirks to
> know how often it must sample stats.
> 

I am sorry Ben, but /proc/net/dev doesnt publish each counter effective
width. Its unfair, but its like that.

An appplication must be able to cope for wrap arounds, running on a 32
or 64bit kernel. Our duty is to provide 64bit counters for high speed
interfaces where possible.
For a 10Mb adapter, there is no need, since a 32bit counter doesnt wrap
in less than one hour (RFC1902 suggestion)

As I said, many drivers counters are not 32bit or 64bit. I did many
driver get_stats() checks lately...

Why should we cap them to 32bit if they really are 36 or 40 bits ? 

> Well, maybe using u32 would have positive benefits on 64-bit kernels then?
> 

But we want to handle 40/100Gbps devices, and keep SNMP apps happy.

We really need 64bit for them, and MACVLAN might be used on top of such
devices.

Or are you suggesting using u32 instead of "unsigned long" for
rx_errors/tx_dropped ?

This would indeed save 8 bytes per cpu per macvlan.

^ permalink raw reply

* Re: [PATCH] net: avoid limits overflow
From: David Miller @ 2010-11-10 20:12 UTC (permalink / raw)
  To: eric.dumazet
  Cc: holt, akpm, w, linux-kernel, netdev, kuznet, pekkas, jmorris,
	yoshfuji, kaber
In-Reply-To: <1289381066.2860.109.camel@edumazet-laptop>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Wed, 10 Nov 2010 10:24:26 +0100

> [PATCH] net: avoid limits overflow
> 
> Robin Holt tried to boot a 16TB machine and found some limits were
> reached : sysctl_tcp_mem[2], sysctl_udp_mem[2]
> 
> We can switch infrastructure to use long "instead" of "int", now
> atomic_long_t primitives are available for free.
> 
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
> Reported-by: Robin Holt <holt@sgi.com>
> Reviewed-by: Robin Holt <holt@sgi.com>
> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Applied.

^ permalink raw reply

* Re: [PATCH 2/3 RESEND] net: packet: fix information leak to userland
From: David Miller @ 2010-11-10 20:09 UTC (permalink / raw)
  To: segooon; +Cc: kernel-janitors, jpirko, eric.dumazet, netdev, linux-kernel
In-Reply-To: <1289413760-12510-1-git-send-email-segooon@gmail.com>

From: Vasiliy Kulikov <segooon@gmail.com>
Date: Wed, 10 Nov 2010 21:29:18 +0300

> packet_getname_spkt() doesn't initialize all members of sa_data field of
> sockaddr struct if strlen(dev->name) < 13.  This structure is then copied
> to userland.  It leads to leaking of contents of kernel stack memory.
> We have to fully fill sa_data with strncpy() instead of strlcpy().
> 
> The same with packet_getname(): it doesn't initialize sll_pkttype field of
> sockaddr_ll.  Set it to zero.
> 
> Signed-off-by: Vasiliy Kulikov <segooon@gmail.com>

Applied.

^ permalink raw reply

* Re: [net-next PATCH 2/2] qlge: Version change to v1.00.00.27
From: David Miller @ 2010-11-10 20:07 UTC (permalink / raw)
  To: ron.mercer; +Cc: netdev, jitendra.kalsaria, ying.lok
In-Reply-To: <1289417386-28384-2-git-send-email-ron.mercer@qlogic.com>

From: Ron Mercer <ron.mercer@qlogic.com>
Date: Wed, 10 Nov 2010 11:29:46 -0800

> Signed-off-by: Jitendra Kalsaria <jitendra.kalsaria@qlogic.com>
> Signed-off-by: Ron Mercer <ron.mercer@qlogic.com>

Applied.

^ permalink raw reply

* Re: [net-next PATCH 1/2] qlge: Add firmware info to ethtool get regs.
From: David Miller @ 2010-11-10 20:07 UTC (permalink / raw)
  To: ron.mercer; +Cc: netdev, jitendra.kalsaria, ying.lok
In-Reply-To: <1289417386-28384-1-git-send-email-ron.mercer@qlogic.com>

From: Ron Mercer <ron.mercer@qlogic.com>
Date: Wed, 10 Nov 2010 11:29:45 -0800

> By default we add firmware information to ethtool get regs.
> Optionally firmware info can instead be sent to log.
> 
> Signed-off-by: Jitendra Kalsaria <jitendra.kalsaria@qlogic.com>
> Signed-off-by: Ron Mercer <ron.mercer@qlogic.com>

Applied.

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox