Netdev List
 help / color / mirror / Atom feed
* [PATCH 2/4] netfilter: synproxy_core: fix warning in __nf_ct_ext_add_length()
From: Pablo Neira Ayuso @ 2013-09-04 13:00 UTC (permalink / raw)
  To: netfilter-devel; +Cc: davem, netdev
In-Reply-To: <1378299625-4638-1-git-send-email-pablo@netfilter.org>

From: Patrick McHardy <kaber@trash.net>

With CONFIG_NETFILTER_DEBUG we get the following warning during SYNPROXY init:

[   80.558906] WARNING: CPU: 1 PID: 4833 at net/netfilter/nf_conntrack_extend.c:80 __nf_ct_ext_add_length+0x217/0x220 [nf_conntrack]()

The reason is that the conntrack template is set to confirmed before adding
the extension and it is invalid to add extensions to already confirmed
conntracks. Fix by adding the extensions before setting the conntrack to
confirmed.

Reported-by: Jesper Dangaard Brouer <jesper.brouer@gmail.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
 net/netfilter/nf_synproxy_core.c |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/netfilter/nf_synproxy_core.c b/net/netfilter/nf_synproxy_core.c
index d23dc79..6fd967c 100644
--- a/net/netfilter/nf_synproxy_core.c
+++ b/net/netfilter/nf_synproxy_core.c
@@ -356,12 +356,12 @@ static int __net_init synproxy_net_init(struct net *net)
 		goto err1;
 	}
 
-	__set_bit(IPS_TEMPLATE_BIT, &ct->status);
-	__set_bit(IPS_CONFIRMED_BIT, &ct->status);
 	if (!nfct_seqadj_ext_add(ct))
 		goto err2;
 	if (!nfct_synproxy_ext_add(ct))
 		goto err2;
+	__set_bit(IPS_TEMPLATE_BIT, &ct->status);
+	__set_bit(IPS_CONFIRMED_BIT, &ct->status);
 
 	snet->tmpl = ct;
 
-- 
1.7.10.4

^ permalink raw reply related

* [PATCH 4/4] netfilter: xt_TCPMSS: correct return value in tcpmss_mangle_packet
From: Pablo Neira Ayuso @ 2013-09-04 13:00 UTC (permalink / raw)
  To: netfilter-devel; +Cc: davem, netdev
In-Reply-To: <1378299625-4638-1-git-send-email-pablo@netfilter.org>

From: Phil Oester <kernel@linuxace.com>

In commit b396966c4 (netfilter: xt_TCPMSS: Fix missing fragmentation handling),
I attempted to add safe fragment handling to xt_TCPMSS.  However, Andy Padavan
of Project N56U correctly points out that returning XT_CONTINUE in this
function does not work.  The callers (tcpmss_tg[46]) expect to receive a value
of 0 in order to return XT_CONTINUE.

Signed-off-by: Phil Oester <kernel@linuxace.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
 net/netfilter/xt_TCPMSS.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/netfilter/xt_TCPMSS.c b/net/netfilter/xt_TCPMSS.c
index 6113cc7..cd24290 100644
--- a/net/netfilter/xt_TCPMSS.c
+++ b/net/netfilter/xt_TCPMSS.c
@@ -60,7 +60,7 @@ tcpmss_mangle_packet(struct sk_buff *skb,
 
 	/* This is a fragment, no TCP header is available */
 	if (par->fragoff != 0)
-		return XT_CONTINUE;
+		return 0;
 
 	if (!skb_make_writable(skb, skb->len))
 		return -1;
-- 
1.7.10.4


^ permalink raw reply related

* [PATCH 1/4] netfilter: more strict TCP flag matching in SYNPROXY
From: Pablo Neira Ayuso @ 2013-09-04 13:00 UTC (permalink / raw)
  To: netfilter-devel; +Cc: davem, netdev
In-Reply-To: <1378299625-4638-1-git-send-email-pablo@netfilter.org>

From: Jesper Dangaard Brouer <brouer@redhat.com>

Its seems Patrick missed to incoorporate some of my requested changes
during review v2 of SYNPROXY netfilter module.

Which were, to avoid SYN+ACK packets to enter the path, meant for the
ACK packet from the client (from the 3WHS).

Further there were a bug in ip6t_SYNPROXY.c, for matching SYN packets
that didn't exclude the ACK flag.

Go a step further with SYN packet/flag matching by excluding flags
ACK+FIN+RST, in both IPv4 and IPv6 modules.

The intented usage of SYNPROXY is as follows:
(gracefully describing usage in commit)

 iptables -t raw -A PREROUTING -i eth0 -p tcp --dport 80 --syn -j NOTRACK
 iptables -A INPUT -i eth0 -p tcp --dport 80 -m state UNTRACKED,INVALID \
         -j SYNPROXY --sack-perm --timestamp --mss 1480 --wscale 7 --ecn

 echo 0 > /proc/sys/net/netfilter/nf_conntrack_tcp_loose

This does filter SYN flags early, for packets in the UNTRACKED state,
but packets in the INVALID state with other TCP flags could still
reach the module, thus this stricter flag matching is still needed.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
 net/ipv4/netfilter/ipt_SYNPROXY.c  |    4 ++--
 net/ipv6/netfilter/ip6t_SYNPROXY.c |    4 ++--
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/net/ipv4/netfilter/ipt_SYNPROXY.c b/net/ipv4/netfilter/ipt_SYNPROXY.c
index 94371db..90e489e 100644
--- a/net/ipv4/netfilter/ipt_SYNPROXY.c
+++ b/net/ipv4/netfilter/ipt_SYNPROXY.c
@@ -269,7 +269,7 @@ synproxy_tg4(struct sk_buff *skb, const struct xt_action_param *par)
 
 	synproxy_parse_options(skb, par->thoff, th, &opts);
 
-	if (th->syn && !th->ack) {
+	if (th->syn && !(th->ack || th->fin || th->rst)) {
 		/* Initial SYN from client */
 		this_cpu_inc(snet->stats->syn_received);
 
@@ -285,7 +285,7 @@ synproxy_tg4(struct sk_buff *skb, const struct xt_action_param *par)
 					  XT_SYNPROXY_OPT_ECN);
 
 		synproxy_send_client_synack(skb, th, &opts);
-	} else if (th->ack && !(th->fin || th->rst))
+	} else if (th->ack && !(th->fin || th->rst || th->syn))
 		/* ACK from client */
 		synproxy_recv_client_ack(snet, skb, th, &opts, ntohl(th->seq));
 
diff --git a/net/ipv6/netfilter/ip6t_SYNPROXY.c b/net/ipv6/netfilter/ip6t_SYNPROXY.c
index 4270a9b..a5af0bf 100644
--- a/net/ipv6/netfilter/ip6t_SYNPROXY.c
+++ b/net/ipv6/netfilter/ip6t_SYNPROXY.c
@@ -284,7 +284,7 @@ synproxy_tg6(struct sk_buff *skb, const struct xt_action_param *par)
 
 	synproxy_parse_options(skb, par->thoff, th, &opts);
 
-	if (th->syn) {
+	if (th->syn && !(th->ack || th->fin || th->rst)) {
 		/* Initial SYN from client */
 		this_cpu_inc(snet->stats->syn_received);
 
@@ -300,7 +300,7 @@ synproxy_tg6(struct sk_buff *skb, const struct xt_action_param *par)
 					  XT_SYNPROXY_OPT_ECN);
 
 		synproxy_send_client_synack(skb, th, &opts);
-	} else if (th->ack && !(th->fin || th->rst))
+	} else if (th->ack && !(th->fin || th->rst || th->syn))
 		/* ACK from client */
 		synproxy_recv_client_ack(snet, skb, th, &opts, ntohl(th->seq));
 
-- 
1.7.10.4


^ permalink raw reply related

* [PATCH 0/4] netfilter updates for net-next
From: Pablo Neira Ayuso @ 2013-09-04 13:00 UTC (permalink / raw)
  To: netfilter-devel; +Cc: davem, netdev

Hi David,

The following batch contains:

* Three fixes for the new synproxy target available in your
  net-next tree, from Jesper D. Brouer and Patrick McHardy.

* One fix for TCPMSS to correctly handling the fragmentation
  case, from Phil Oester. I'll pass this one to -stable.

You can pull this changes from:

  git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next.git master

Thanks!

----------------------------------------------------------------

The following changes since commit 5a17a390de7bdbcfff9b8f344273a886ca4cf8bf:

  net: make snmp_mib_free static inline (2013-09-02 21:00:50 -0700)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next.git master

for you to fetch changes up to 1205e1fa615805c9efa97303b552cf445965752a:

  netfilter: xt_TCPMSS: correct return value in tcpmss_mangle_packet (2013-09-04 14:20:03 +0200)

----------------------------------------------------------------
Jesper Dangaard Brouer (2):
      netfilter: more strict TCP flag matching in SYNPROXY
      netfilter: SYNPROXY: let unrelated packets continue

Patrick McHardy (1):
      netfilter: synproxy_core: fix warning in __nf_ct_ext_add_length()

Phil Oester (1):
      netfilter: xt_TCPMSS: correct return value in tcpmss_mangle_packet

 net/ipv4/netfilter/ipt_SYNPROXY.c  |   10 +++++++---
 net/ipv6/netfilter/ip6t_SYNPROXY.c |   10 +++++++---
 net/netfilter/nf_synproxy_core.c   |    4 ++--
 net/netfilter/xt_TCPMSS.c          |    2 +-
 4 files changed, 17 insertions(+), 9 deletions(-)

^ permalink raw reply

* Re: [PATCH net-next v2 2/2] net: migrate direct users to prandom_u32_max
From: Daniel Borkmann @ 2013-09-04 12:58 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: davem, netdev, linux-kernel
In-Reply-To: <1378299135.7360.109.camel@edumazet-glaptop>

On 09/04/2013 02:52 PM, Eric Dumazet wrote:
> On Wed, 2013-09-04 at 14:37 +0200, Daniel Borkmann wrote:
>> Users that directly use or reimplement what we have in prandom_u32_max()
>> can be migrated for now to use it directly, so that we can reduce code size
>> and avoid reimplementations. That's obvious, follow-up patches could inspect
>> modulo use cases for possible migration as well.
>>
>> Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
>> ---
>>   drivers/net/team/team_mode_random.c | 8 +-------
>>   include/net/red.h                   | 2 +-
>>   net/802/garp.c                      | 3 ++-
>>   net/802/mrp.c                       | 3 ++-
>>   net/packet/af_packet.c              | 2 +-
>>   net/sched/sch_choke.c               | 8 +-------
>>   6 files changed, 8 insertions(+), 18 deletions(-)
>>
>> diff --git a/drivers/net/team/team_mode_random.c b/drivers/net/team/team_mode_random.c
>> index 7f032e2..0dbd1eb 100644
>> --- a/drivers/net/team/team_mode_random.c
>> +++ b/drivers/net/team/team_mode_random.c
>> @@ -13,20 +13,14 @@
>>   #include <linux/module.h>
>>   #include <linux/init.h>
>>   #include <linux/skbuff.h>
>> -#include <linux/reciprocal_div.h>
>>   #include <linux/if_team.h>
>>
>> -static u32 random_N(unsigned int N)
>> -{
>> -	return reciprocal_divide(prandom_u32(), N);
>> -}
>> -
>>   static bool rnd_transmit(struct team *team, struct sk_buff *skb)
>>   {
>>   	struct team_port *port;
>>   	int port_index;
>>
>> -	port_index = random_N(team->en_port_count);
>> +	port_index = prandom_u32_max(team->en_port_count - 1);
>
>
> Note the random_N(0) gave 0, while prandom_u32_max(0 - 1) can return any
> number in [0 ... ~0U]

Very true, that was stupid. Thanks for catching!

^ permalink raw reply

* Re: [nf-next PATCH] netfilter: SYNPROXY let unrelated packets continue
From: Pablo Neira Ayuso @ 2013-09-04 12:56 UTC (permalink / raw)
  To: Jesper Dangaard Brouer; +Cc: Patrick McHardy, netfilter-devel, netdev, mph, as
In-Reply-To: <20130829101625.14346.41071.stgit@dragon>

On Thu, Aug 29, 2013 at 12:18:46PM +0200, Jesper Dangaard Brouer wrote:
> Packets reaching SYNPROXY were default dropped, as they were most
> likely invalid (given the recommended state matching).  This
> patch, changes SYNPROXY target to let packets, not consumed,
> continue being processed by the stack.
> 
> This will be more in line other target modules. As it will allow
> more flexible configurations of handling, logging or matching on
> packets in INVALID states.

Applied, thanks.

^ permalink raw reply

* Re: [PATCH net-next v2 2/2] net: migrate direct users to prandom_u32_max
From: Eric Dumazet @ 2013-09-04 12:52 UTC (permalink / raw)
  To: Daniel Borkmann; +Cc: davem, netdev, linux-kernel
In-Reply-To: <1378298247-29364-3-git-send-email-dborkman@redhat.com>

On Wed, 2013-09-04 at 14:37 +0200, Daniel Borkmann wrote:
> Users that directly use or reimplement what we have in prandom_u32_max()
> can be migrated for now to use it directly, so that we can reduce code size
> and avoid reimplementations. That's obvious, follow-up patches could inspect
> modulo use cases for possible migration as well.
> 
> Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
> ---
>  drivers/net/team/team_mode_random.c | 8 +-------
>  include/net/red.h                   | 2 +-
>  net/802/garp.c                      | 3 ++-
>  net/802/mrp.c                       | 3 ++-
>  net/packet/af_packet.c              | 2 +-
>  net/sched/sch_choke.c               | 8 +-------
>  6 files changed, 8 insertions(+), 18 deletions(-)
> 
> diff --git a/drivers/net/team/team_mode_random.c b/drivers/net/team/team_mode_random.c
> index 7f032e2..0dbd1eb 100644
> --- a/drivers/net/team/team_mode_random.c
> +++ b/drivers/net/team/team_mode_random.c
> @@ -13,20 +13,14 @@
>  #include <linux/module.h>
>  #include <linux/init.h>
>  #include <linux/skbuff.h>
> -#include <linux/reciprocal_div.h>
>  #include <linux/if_team.h>
>  
> -static u32 random_N(unsigned int N)
> -{
> -	return reciprocal_divide(prandom_u32(), N);
> -}
> -
>  static bool rnd_transmit(struct team *team, struct sk_buff *skb)
>  {
>  	struct team_port *port;
>  	int port_index;
>  
> -	port_index = random_N(team->en_port_count);
> +	port_index = prandom_u32_max(team->en_port_count - 1);


Note the random_N(0) gave 0, while prandom_u32_max(0 - 1) can return any
number in [0 ... ~0U]

^ permalink raw reply

* Re: [PATCH] xen-netback: count number required slots for an skb more carefully
From: Ian Campbell @ 2013-09-04 12:41 UTC (permalink / raw)
  To: David Vrabel
  Cc: Wei Liu, xen-devel, Konrad Rzeszutek Wilk, Boris Ostrovsky,
	netdev, msw, annie.li
In-Reply-To: <52271DFF.3070008@citrix.com>

On Wed, 2013-09-04 at 12:48 +0100, David Vrabel wrote:
> On 03/09/13 22:53, Wei Liu wrote:
> > On Tue, Sep 03, 2013 at 06:29:50PM +0100, David Vrabel wrote:
> >> From: David Vrabel <david.vrabel@citrix.com>
> >>
> >> When a VM is providing an iSCSI target and the LUN is used by the
> >> backend domain, the generated skbs for direct I/O writes to the disk
> >> have large, multi-page skb->data but no frags.
> >>
> >> With some lengths and starting offsets, xen_netbk_count_skb_slots()
> >> would be one short because the simple calculation of
> >> DIV_ROUND_UP(skb_headlen(), PAGE_SIZE) was not accounting for the
> >> decisions made by start_new_rx_buffer() which does not guarantee
> >> responses are fully packed.
> >>
> >> For example, a skb with length < 2 pages but which spans 3 pages would
> >> be counted as requiring 2 slots but would actually use 3 slots.
> >>
> >> skb->data:
> >>
> >>     |        1111|222222222222|3333        |
> >>
> >> Fully packed, this would need 2 slots:
> >>
> >>     |111122222222|22223333    |
> >>
> >> But because the 2nd page wholy fits into a slot it is not split across
> >> slots and goes into a slot of its own:
> >>
> >>     |1111        |222222222222|3333        |
> >>
> >> Miscounting the number of slots means netback may push more responses
> >> than the number of available requests.  This will cause the frontend
> >> to get very confused and report "Too many frags/slots".  The frontend
> >> never recovers and will eventually BUG.
> >>
> >> Fix this by counting the number of required slots more carefully.  In
> >> xen_netbk_count_skb_slots(), more closely follow the algorithm used by
> >> xen_netbk_gop_skb() by introducing xen_netbk_count_frag_slots() which
> >> is the dry-run equivalent of netbk_gop_frag_copy().
> >>
> > 
> > Phew! So this is backend miscounting bug. I thought it was a frontend
> > bug so it didn't ring a bell when we had our face-to-face discussion,
> > sorry. :-(
> > 
> > This bug was discussed back in July among Annie, Matt, Ian and I. We
> > finally agreed to take Matt's solution. Matt agreed to post final
> > version within a week but obviously he's too busy to do so. I was away
> > so I didn't follow closely. Eventually it fell through the crack. :-(
> 
> I think I prefer fixing the counting for backporting to stable kernels.

That's a good argument. I think we should take this patch, or something
very like it, now and then rebase the more complex thing on top.

>  Xi's approach of packing the ring differently is a change in frontend
> visible behaviour and seems more risky. e.g., possible performance
> impact so I would like to see some performance analysis of that approach.

Yes.

^ permalink raw reply

* [PATCH net-next v2 1/2] random: add prandom_u32_range and prandom_u32_max helpers
From: Daniel Borkmann @ 2013-09-04 12:37 UTC (permalink / raw)
  To: davem; +Cc: netdev, linux-kernel, Theodore Ts'o, Joe Perches
In-Reply-To: <1378298247-29364-1-git-send-email-dborkman@redhat.com>

We have implemented the same function over and over, so introduce
generic helpers that unify these implementations in order to migrate
such code to use them. Make the API similarly to randomize_range()
for consistency. prandom_u32_range() generates numbers in [start, end]
interval and prandom_u32_max() generates numbers in [0, end] interval.

Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Joe Perches <joe@perches.com>
Cc: linux-kernel@vger.kernel.org
---
 include/linux/random.h | 31 ++++++++++++++++++++++++++++++-
 1 file changed, 30 insertions(+), 1 deletion(-)

diff --git a/include/linux/random.h b/include/linux/random.h
index 3b9377d..17c91c2 100644
--- a/include/linux/random.h
+++ b/include/linux/random.h
@@ -8,7 +8,6 @@
 
 #include <uapi/linux/random.h>
 
-
 extern void add_device_randomness(const void *, unsigned int);
 extern void add_input_randomness(unsigned int type, unsigned int code,
 				 unsigned int value);
@@ -32,6 +31,36 @@ void prandom_seed(u32 seed);
 u32 prandom_u32_state(struct rnd_state *);
 void prandom_bytes_state(struct rnd_state *state, void *buf, int nbytes);
 
+/**
+ * prandom_u32_range - return a random number in interval [start, end]
+ * @start: lower interval endpoint
+ * @end: higher interval endpoint
+ *
+ * Returns a number that is in the given interval:
+ *
+ *     [...... <range> .....]
+ *   start                  end
+ *
+ * Callers need to make sure that start <= end. Note that the result
+ * depends on PRNG being well distributed in [0, ~0U] space. Here we
+ * use maximally equidistributed combined Tausworthe generator.
+ */
+static inline u32 prandom_u32_range(u32 start, u32 end)
+{
+	return (u32)(((u64) prandom_u32() * (end + 1 - start)) >> 32) + start;
+}
+
+/**
+ * prandom_u32_max - return a random number in interval [0, max]
+ * @max: higher interval endpoint
+ *
+ * Returns a number that is in interval [0, end].
+ */
+static inline u32 prandom_u32_max(u32 end)
+{
+	return prandom_u32_range(0, end);
+}
+
 /*
  * Handle minimum values for seeds
  */
-- 
1.7.11.7

^ permalink raw reply related

* [PATCH net-next v2 2/2] net: migrate direct users to prandom_u32_max
From: Daniel Borkmann @ 2013-09-04 12:37 UTC (permalink / raw)
  To: davem; +Cc: netdev, linux-kernel
In-Reply-To: <1378298247-29364-1-git-send-email-dborkman@redhat.com>

Users that directly use or reimplement what we have in prandom_u32_max()
can be migrated for now to use it directly, so that we can reduce code size
and avoid reimplementations. That's obvious, follow-up patches could inspect
modulo use cases for possible migration as well.

Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
---
 drivers/net/team/team_mode_random.c | 8 +-------
 include/net/red.h                   | 2 +-
 net/802/garp.c                      | 3 ++-
 net/802/mrp.c                       | 3 ++-
 net/packet/af_packet.c              | 2 +-
 net/sched/sch_choke.c               | 8 +-------
 6 files changed, 8 insertions(+), 18 deletions(-)

diff --git a/drivers/net/team/team_mode_random.c b/drivers/net/team/team_mode_random.c
index 7f032e2..0dbd1eb 100644
--- a/drivers/net/team/team_mode_random.c
+++ b/drivers/net/team/team_mode_random.c
@@ -13,20 +13,14 @@
 #include <linux/module.h>
 #include <linux/init.h>
 #include <linux/skbuff.h>
-#include <linux/reciprocal_div.h>
 #include <linux/if_team.h>
 
-static u32 random_N(unsigned int N)
-{
-	return reciprocal_divide(prandom_u32(), N);
-}
-
 static bool rnd_transmit(struct team *team, struct sk_buff *skb)
 {
 	struct team_port *port;
 	int port_index;
 
-	port_index = random_N(team->en_port_count);
+	port_index = prandom_u32_max(team->en_port_count - 1);
 	port = team_get_port_by_index_rcu(team, port_index);
 	if (unlikely(!port))
 		goto drop;
diff --git a/include/net/red.h b/include/net/red.h
index ef46058..56f3c0c 100644
--- a/include/net/red.h
+++ b/include/net/red.h
@@ -303,7 +303,7 @@ static inline unsigned long red_calc_qavg(const struct red_parms *p,
 
 static inline u32 red_random(const struct red_parms *p)
 {
-	return reciprocal_divide(net_random(), p->max_P_reciprocal);
+	return prandom_u32_max(p->max_P_reciprocal - 1);
 }
 
 static inline int red_mark_probability(const struct red_parms *p,
diff --git a/net/802/garp.c b/net/802/garp.c
index 5d9630a..b4be421 100644
--- a/net/802/garp.c
+++ b/net/802/garp.c
@@ -397,7 +397,8 @@ static void garp_join_timer_arm(struct garp_applicant *app)
 {
 	unsigned long delay;
 
-	delay = (u64)msecs_to_jiffies(garp_join_time) * net_random() >> 32;
+	delay = prandom_u32_max(msecs_to_jiffies(garp_join_time) - 1);
+
 	mod_timer(&app->join_timer, jiffies + delay);
 }
 
diff --git a/net/802/mrp.c b/net/802/mrp.c
index 1eb05d8..1a08ae7 100644
--- a/net/802/mrp.c
+++ b/net/802/mrp.c
@@ -578,7 +578,8 @@ static void mrp_join_timer_arm(struct mrp_applicant *app)
 {
 	unsigned long delay;
 
-	delay = (u64)msecs_to_jiffies(mrp_join_time) * net_random() >> 32;
+	delay = prandom_u32_max(msecs_to_jiffies(mrp_join_time) - 1);
+
 	mod_timer(&app->join_timer, jiffies + delay);
 }
 
diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index 2e8286b..1c1ccf9 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -1162,7 +1162,7 @@ static unsigned int fanout_demux_rnd(struct packet_fanout *f,
 				     struct sk_buff *skb,
 				     unsigned int num)
 {
-	return reciprocal_divide(prandom_u32(), num);
+	return prandom_u32_max(num - 1);
 }
 
 static unsigned int fanout_demux_rollover(struct packet_fanout *f,
diff --git a/net/sched/sch_choke.c b/net/sched/sch_choke.c
index ef53ab8..7a73fbf 100644
--- a/net/sched/sch_choke.c
+++ b/net/sched/sch_choke.c
@@ -77,12 +77,6 @@ struct choke_sched_data {
 	struct sk_buff **tab;
 };
 
-/* deliver a random number between 0 and N - 1 */
-static u32 random_N(unsigned int N)
-{
-	return reciprocal_divide(prandom_u32(), N);
-}
-
 /* number of elements in queue including holes */
 static unsigned int choke_len(const struct choke_sched_data *q)
 {
@@ -233,7 +227,7 @@ static struct sk_buff *choke_peek_random(const struct choke_sched_data *q,
 	int retrys = 3;
 
 	do {
-		*pidx = (q->head + random_N(choke_len(q))) & q->tab_mask;
+		*pidx = (q->head + prandom_u32_max(choke_len(q) - 1)) & q->tab_mask;
 		skb = q->tab[*pidx];
 		if (skb)
 			return skb;
-- 
1.7.11.7

^ permalink raw reply related

* [PATCH net-next v2 0/2] prandom_u32_range, prandom_u32_max helpers
From: Daniel Borkmann @ 2013-09-04 12:37 UTC (permalink / raw)
  To: davem; +Cc: netdev, linux-kernel

v1->v2:
  - migrated api to random.h, ccing lkml
  - dropped second patch for now

Daniel Borkmann (2):
  random: add prandom_u32_range and prandom_u32_max helpers
  net: migrate direct users to prandom_u32_max

 drivers/net/team/team_mode_random.c |  8 +-------
 include/linux/random.h              | 31 ++++++++++++++++++++++++++++++-
 include/net/red.h                   |  2 +-
 net/802/garp.c                      |  3 ++-
 net/802/mrp.c                       |  3 ++-
 net/packet/af_packet.c              |  2 +-
 net/sched/sch_choke.c               |  8 +-------
 7 files changed, 38 insertions(+), 19 deletions(-)

-- 
1.7.11.7

^ permalink raw reply

* Re: Kernel 3.7+ tcp_metric cache system
From: Eric Dumazet @ 2013-09-04 12:34 UTC (permalink / raw)
  To: Simon Jouet; +Cc: netdev
In-Reply-To: <CAJWKWvDrjnLpnRR2GmmRLv6EXQdPP9Gb4+-EhWXYhLQzoVrJag@mail.gmail.com>

On Wed, 2013-09-04 at 13:09 +0100, Simon Jouet wrote:
> Hi,
> 
> First of all apologies if this mailing list doesn't this kind of
> discussions, if not could you please redirect me to a more suitable
> one ?
> 
> So, for my current research I require to be able to specify for
> specific hosts what cwnd and rto to use, after some investigation I
> came accross the modifications that have been done in kernel 3.7 to
> bring the tcp_metric cache and the get/del netlink commands.
> 
> I added a new command "tcp_metrics_nl_cmd_add" to be able to add
> entries to the cache (the code is available here
> http://pastebin.com/gSvhyjWU, this is very much work in progress).
> This work well enough and calling "ip tcpm show" afterwards to list
> the entries show the correct information.
> 
> The issue is that these values are never used or at least from what I
> can see. So once an entry is added it is attempted to be read by the
> function tcp_init_metrics(struct sock *sk) and it's read only if it's
> locked (tcp_metric_locked) I'm not sure what the lock flag is used
> for, f anybody has any pointer for that ...
> 
> Anyway the connection will go to the "reset" goto label, the cwnd will
> be reinitialised by tcp_init_cwnd (defined in tcp_input.c), in what
> I've tested  "__u32 cwnd = (dst ? dst_metric(dst, RTAX_INITCWND) :
> 0);" will always return 0 and the cwnd will be defaulted to 10.
> 
> I'm probably doing something wrong .... But in which condition is the
> cached cwnd used ? I tried to find some documentation on tcp_metric
> but I couldn't find much.
> 

ip ro add 192.168.7.7 via 10.1.10.1 initcwnd 30 rto_min 12

^ permalink raw reply

* Query about TX BD Reclaim in Napi poll path (was Re: [PATCH v3] ethernet/arc/arc_emac - Add new driver)
From: Vineet Gupta @ 2013-09-04 12:14 UTC (permalink / raw)
  To: Francois Romieu
  Cc: Alexey Brodkin, netdev, Andy Shevchenko, David S. Miller,
	linux-kernel
In-Reply-To: <20130613221946.GA16632@electric-eye.fr.zoreil.com>

Hi Francois,

Resurrecting an old thread.

On 06/14/2013 03:49 AM, Francois Romieu wrote:
>> +static irqreturn_t arc_emac_intr(int irq, void *dev_instance)
>> > +{
>> > +	struct net_device *ndev = dev_instance;
>> > +	struct arc_emac_priv *priv = netdev_priv(ndev);
>> > +	struct net_device_stats *stats = &priv->stats;
>> > +	unsigned int status;
>> > +
>> > +	status = arc_reg_get(priv, R_STATUS);
>> > +	status &= ~MDIO_MASK;
>> > +
>> > +	/* Reset all flags except "MDIO complete"*/
>> > +	arc_reg_set(priv, R_STATUS, status);
>> > +
>> > +	if (status & RXINT_MASK) {
>> > +		if (likely(napi_schedule_prep(&priv->napi))) {
>> > +			arc_reg_clr(priv, R_ENABLE, RXINT_MASK);
>> > +			__napi_schedule(&priv->napi);
>> > +		}
>> > +	}
>> > +
>> > +	if (status & TXINT_MASK) {
> You may consider moving everything into the napi poll handler.

I has to revisit this now-mainlined driver recently for fixing a bug. Per your
suggestion above, the TX BD reclaim was moved from interrupt context to NAPI
context. I was wondering if that is the right thing to do (I'm not a networking
expert but have worked on this driver heavily before it was mainlined by Alexey).

In case of large burst transfers by networking stack (say a large file copy over
NFS) will it not delay the TX BD reclaim possibly dropping more packets. Ofcourse
doing this requires enabling Tx interrupts which adds to overall cost from a
system perspective, but assuming the controller can coalesce the Tx interrupts,
will it not be better.

I did a quick hack to move the TX reclaim in intr path and it seems to be doing
slightly better than the current code - so the advantages are not sky high, but I
want to understand the implications nevertheless.

TIA,
-Vineet

^ permalink raw reply

* Kernel 3.7+ tcp_metric cache system
From: Simon Jouet @ 2013-09-04 12:09 UTC (permalink / raw)
  To: netdev

Hi,

First of all apologies if this mailing list doesn't this kind of
discussions, if not could you please redirect me to a more suitable
one ?

So, for my current research I require to be able to specify for
specific hosts what cwnd and rto to use, after some investigation I
came accross the modifications that have been done in kernel 3.7 to
bring the tcp_metric cache and the get/del netlink commands.

I added a new command "tcp_metrics_nl_cmd_add" to be able to add
entries to the cache (the code is available here
http://pastebin.com/gSvhyjWU, this is very much work in progress).
This work well enough and calling "ip tcpm show" afterwards to list
the entries show the correct information.

The issue is that these values are never used or at least from what I
can see. So once an entry is added it is attempted to be read by the
function tcp_init_metrics(struct sock *sk) and it's read only if it's
locked (tcp_metric_locked) I'm not sure what the lock flag is used
for, f anybody has any pointer for that ...

Anyway the connection will go to the "reset" goto label, the cwnd will
be reinitialised by tcp_init_cwnd (defined in tcp_input.c), in what
I've tested  "__u32 cwnd = (dst ? dst_metric(dst, RTAX_INITCWND) :
0);" will always return 0 and the cwnd will be defaulted to 10.

I'm probably doing something wrong .... But in which condition is the
cached cwnd used ? I tried to find some documentation on tcp_metric
but I couldn't find much.

Best regards,
Simon

P.S : If anything is unclear please let me know, so I can rephrase or
provide more details

^ permalink raw reply

* Re: [PATCH v4] ipv6:introduce function to find route for redirect
From: Duan Jiong @ 2013-09-04 12:06 UTC (permalink / raw)
  To: hannes; +Cc: davem, netdev
In-Reply-To: <20130903191729.GA28889@order.stressinduktion.org>

于 2013年09月04日 03:17, Hannes Frederic Sowa 写道:
> On Tue, Sep 03, 2013 at 01:37:19PM +0800, Duan Jiong wrote:
>>> Btw. I still think it should be possible to eliminate
>>> ip6_redirect_no_header:
>>>
>>> We could always use ip6_redirect_no_header and use the data of the redirected
>>> header option just for finding the socket to be notified. We can do the whole
>>> verification and route updating in ndisc layer and then just call into icmpv6
>>> layer if upper protocols need a notification of the redirect. But that should
>>> go into another patch. ;)
>>>
>>
>> I think this is good, but i have a question below:
>>
>>   if the socket type is connection-based, the dst information is stored in related
>> sock struct, so there is no need to look up the route for redirect in ip6_redirect
>> or ip6_redirect_no_header, in this case, we do the verification and route 
>> updating in the upper protocols' err_handler is better. 
>>
>> How do you think of this?
> 
> This should not be a problem, because every cached dst should be validated
> with ip6_dst_check before it is used. It uses the fib6_node serial number
> which is incremented for all fib6_nodes on the path to the new installed
> node by fib6_add_1. So we are safe here.
> 
> Btw. this is the same logic redirects get currently picked up, too.
> 

Thanks for you answer, but i still have some questions on dealing with redirect
in ip4ip6_err() and ipip6_err(), and i need some time to learn more about them.
So i only send one patch to fix the bug.

Please forgive me is a newbie.:)

Thanks,
  Duan 

^ permalink raw reply

* Re: [PATCH v2 net-next] pkt_sched: fq: Fair Queue packet scheduler
From: Daniel Borkmann @ 2013-09-04 11:59 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Jason Wang, David Miller, netdev, Yuchung Cheng, Neal Cardwell,
	Michael S. Tsirkin
In-Reply-To: <1378294029.7360.92.camel@edumazet-glaptop>

On 09/04/2013 01:27 PM, Eric Dumazet wrote:
> On Wed, 2013-09-04 at 03:30 -0700, Eric Dumazet wrote:
>> On Wed, 2013-09-04 at 14:30 +0800, Jason Wang wrote:
>>
>>>> And tcpdump would certainly help ;)
>>>
>>> See attachment.
>>>
>>
>> Nothing obvious on tcpdump (only that lot of frames are missing)
>>
>> 1) Are you capturing part of the payload only (like tcpdump -s 128)
>>
>> 2) What is the setup.
>>
>> 3) tc -s -d qdisc
>
> If you use FQ in the guest, then it could be that high resolution timers
> have high latency ?

Probably they internally switch to a lower resolution clock event source if
there's no hardware support available:

   The [source event] management layer provides interfaces for hrtimers to
   implement high resolution timers [...] [and it] supports these more advanced
   functions only when appropriate clock event sources have been registered,
   otherwise the traditional periodic tick based behaviour is retained. [1]

[1] https://www.kernel.org/doc/ols/2006/ols2006v1-pages-333-346.pdf

> So FQ arms short timers, but effective duration could be much longer.
>
> Here I get a smooth latency of up to ~3 us
>
> lpq83:~# ./netperf -H lpq84 ; ./tc -s -d qd ; dmesg | tail -n1
> MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to lpq84.prod.google.com () port 0 AF_INET
> Recv   Send    Send
> Socket Socket  Message  Elapsed
> Size   Size    Size     Time     Throughput
> bytes  bytes   bytes    secs.    10^6bits/sec
>
>   87380  16384  16384    10.00    9410.82
> qdisc fq 8005: dev eth0 root refcnt 32 limit 10000p flow_limit 100p buckets 1024 quantum 3028 initial_quantum 15140
>   Sent 50545633991 bytes 33385894 pkt (dropped 0, overlimits 0 requeues 19)
>   rate 9258Mbit 764335pps backlog 0b 0p requeues 19
>    117 flow, 115 inactive, 0 throttled
>    0 gc, 0 highprio, 0 retrans, 96861 throttled, 0 flows_plimit
> [  572.551664] latency = 3035 ns
>
>
> What do you get with this debugging patch ?
>
> diff --git a/net/sched/sch_fq.c b/net/sched/sch_fq.c
> index 32ad015..c1312a0 100644
> --- a/net/sched/sch_fq.c
> +++ b/net/sched/sch_fq.c
> @@ -103,6 +103,7 @@ struct fq_sched_data {
>   	u64		stat_internal_packets;
>   	u64		stat_tcp_retrans;
>   	u64		stat_throttled;
> +	s64		slatency;
>   	u64		stat_flows_plimit;
>   	u64		stat_pkts_too_long;
>   	u64		stat_allocation_errors;
> @@ -393,6 +394,7 @@ static int fq_enqueue(struct sk_buff *skb, struct Qdisc *sch)
>   static void fq_check_throttled(struct fq_sched_data *q, u64 now)
>   {
>   	struct rb_node *p;
> +	bool first = true;
>
>   	if (q->time_next_delayed_flow > now)
>   		return;
> @@ -405,6 +407,13 @@ static void fq_check_throttled(struct fq_sched_data *q, u64 now)
>   			q->time_next_delayed_flow = f->time_next_packet;
>   			break;
>   		}
> +		if (first) {
> +			s64 delay = now - f->time_next_packet;
> +
> +			first = false;
> +			delay -= q->slatency >> 3;
> +			q->slatency += delay;
> +		}
>   		rb_erase(p, &q->delayed);
>   		q->throttled_flows--;
>   		fq_flow_add_tail(&q->old_flows, f);
> @@ -711,6 +720,7 @@ static int fq_dump(struct Qdisc *sch, struct sk_buff *skb)
>   	if (opts == NULL)
>   		goto nla_put_failure;
>
> +	pr_err("latency = %lld ns\n", q->slatency >> 3);
>   	if (nla_put_u32(skb, TCA_FQ_PLIMIT, sch->limit) ||
>   	    nla_put_u32(skb, TCA_FQ_FLOW_PLIMIT, q->flow_plimit) ||
>   	    nla_put_u32(skb, TCA_FQ_QUANTUM, q->quantum) ||
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply

* Re: [PATCH V3 4/6] vhost_net: determine whether or not to use zerocopy at one time
From: Michael S. Tsirkin @ 2013-09-04 11:59 UTC (permalink / raw)
  To: Jason Wang; +Cc: netdev, linux-kernel, kvm, virtualization
In-Reply-To: <1378111261-14826-5-git-send-email-jasowang@redhat.com>

On Mon, Sep 02, 2013 at 04:40:59PM +0800, Jason Wang wrote:
> Currently, even if the packet length is smaller than VHOST_GOODCOPY_LEN, if
> upend_idx != done_idx we still set zcopy_used to true and rollback this choice
> later. This could be avoided by determining zerocopy once by checking all
> conditions at one time before.
> 
> Signed-off-by: Jason Wang <jasowang@redhat.com>
> ---
>  drivers/vhost/net.c |   47 ++++++++++++++++++++---------------------------
>  1 files changed, 20 insertions(+), 27 deletions(-)
> 
> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> index 8a6dd0d..3f89dea 100644
> --- a/drivers/vhost/net.c
> +++ b/drivers/vhost/net.c
> @@ -404,43 +404,36 @@ static void handle_tx(struct vhost_net *net)
>  			       iov_length(nvq->hdr, s), hdr_size);
>  			break;
>  		}
> -		zcopy_used = zcopy && (len >= VHOST_GOODCOPY_LEN ||
> -				       nvq->upend_idx != nvq->done_idx);
> +
> +		zcopy_used = zcopy && len >= VHOST_GOODCOPY_LEN
> +				   && (nvq->upend_idx + 1) % UIO_MAXIOV !=
> +				      nvq->done_idx

Thinking about this, this looks strange.
The original idea was that once we start doing zcopy, we keep
using the heads ring even for short packets until no zcopy is outstanding.

What's the logic behind (nvq->upend_idx + 1) % UIO_MAXIOV != nvq->done_idx
here?



> +				   && vhost_net_tx_select_zcopy(net);
>  
>  		/* use msg_control to pass vhost zerocopy ubuf info to skb */
>  		if (zcopy_used) {
> +			struct ubuf_info *ubuf;
> +			ubuf = nvq->ubuf_info + nvq->upend_idx;
> +
>  			vq->heads[nvq->upend_idx].id = head;
> -			if (!vhost_net_tx_select_zcopy(net) ||
> -			    len < VHOST_GOODCOPY_LEN) {
> -				/* copy don't need to wait for DMA done */
> -				vq->heads[nvq->upend_idx].len =
> -							VHOST_DMA_DONE_LEN;
> -				msg.msg_control = NULL;
> -				msg.msg_controllen = 0;
> -				ubufs = NULL;
> -			} else {
> -				struct ubuf_info *ubuf;
> -				ubuf = nvq->ubuf_info + nvq->upend_idx;
> -
> -				vq->heads[nvq->upend_idx].len =
> -					VHOST_DMA_IN_PROGRESS;
> -				ubuf->callback = vhost_zerocopy_callback;
> -				ubuf->ctx = nvq->ubufs;
> -				ubuf->desc = nvq->upend_idx;
> -				msg.msg_control = ubuf;
> -				msg.msg_controllen = sizeof(ubuf);
> -				ubufs = nvq->ubufs;
> -				kref_get(&ubufs->kref);
> -			}
> +			vq->heads[nvq->upend_idx].len = VHOST_DMA_IN_PROGRESS;
> +			ubuf->callback = vhost_zerocopy_callback;
> +			ubuf->ctx = nvq->ubufs;
> +			ubuf->desc = nvq->upend_idx;
> +			msg.msg_control = ubuf;
> +			msg.msg_controllen = sizeof(ubuf);
> +			ubufs = nvq->ubufs;
> +			kref_get(&ubufs->kref);
>  			nvq->upend_idx = (nvq->upend_idx + 1) % UIO_MAXIOV;
> -		} else
> +		} else {
>  			msg.msg_control = NULL;
> +			ubufs = NULL;
> +		}
>  		/* TODO: Check specific error and bomb out unless ENOBUFS? */
>  		err = sock->ops->sendmsg(NULL, sock, &msg, len);
>  		if (unlikely(err < 0)) {
>  			if (zcopy_used) {
> -				if (ubufs)
> -					vhost_net_ubuf_put(ubufs);
> +				vhost_net_ubuf_put(ubufs);
>  				nvq->upend_idx = ((unsigned)nvq->upend_idx - 1)
>  					% UIO_MAXIOV;
>  			}
> -- 
> 1.7.1

^ permalink raw reply

* Re: GSO/GRO and UDP performance
From: Eric Dumazet @ 2013-09-04 11:53 UTC (permalink / raw)
  To: James Yonan; +Cc: netdev
In-Reply-To: <52270659.1090208@openvpn.net>

On Wed, 2013-09-04 at 04:07 -0600, James Yonan wrote:

> The bundle of UDP packets would traverse the stack as a unit until it 
> reaches the socket layer, where recvmmsg could pass the whole bundle up 
> to userspace in a single transaction (or recvmsg could disaggregate the 
> bundle and pass each datagram individually).

That would require a lot of work, say in netfilter, but also in core
network stack in forwarding, and all UDP users (L2TP, vxlan).

Very unlikely to happen IMHO.

I suspect the performance is coming from aggregation done in user space,
then re-injected into the kernel ?

You could use a kernel module, using udp_encap_enable() and friends.

Check vxlan_socket_create() for an example

^ permalink raw reply

* Re: [PATCH] xen-netback: count number required slots for an skb more carefully
From: David Vrabel @ 2013-09-04 11:48 UTC (permalink / raw)
  To: Wei Liu
  Cc: xen-devel, Konrad Rzeszutek Wilk, Boris Ostrovsky, Ian Campbell,
	netdev, msw, annie.li
In-Reply-To: <20130903215328.GA13465@zion.uk.xensource.com>

On 03/09/13 22:53, Wei Liu wrote:
> On Tue, Sep 03, 2013 at 06:29:50PM +0100, David Vrabel wrote:
>> From: David Vrabel <david.vrabel@citrix.com>
>>
>> When a VM is providing an iSCSI target and the LUN is used by the
>> backend domain, the generated skbs for direct I/O writes to the disk
>> have large, multi-page skb->data but no frags.
>>
>> With some lengths and starting offsets, xen_netbk_count_skb_slots()
>> would be one short because the simple calculation of
>> DIV_ROUND_UP(skb_headlen(), PAGE_SIZE) was not accounting for the
>> decisions made by start_new_rx_buffer() which does not guarantee
>> responses are fully packed.
>>
>> For example, a skb with length < 2 pages but which spans 3 pages would
>> be counted as requiring 2 slots but would actually use 3 slots.
>>
>> skb->data:
>>
>>     |        1111|222222222222|3333        |
>>
>> Fully packed, this would need 2 slots:
>>
>>     |111122222222|22223333    |
>>
>> But because the 2nd page wholy fits into a slot it is not split across
>> slots and goes into a slot of its own:
>>
>>     |1111        |222222222222|3333        |
>>
>> Miscounting the number of slots means netback may push more responses
>> than the number of available requests.  This will cause the frontend
>> to get very confused and report "Too many frags/slots".  The frontend
>> never recovers and will eventually BUG.
>>
>> Fix this by counting the number of required slots more carefully.  In
>> xen_netbk_count_skb_slots(), more closely follow the algorithm used by
>> xen_netbk_gop_skb() by introducing xen_netbk_count_frag_slots() which
>> is the dry-run equivalent of netbk_gop_frag_copy().
>>
> 
> Phew! So this is backend miscounting bug. I thought it was a frontend
> bug so it didn't ring a bell when we had our face-to-face discussion,
> sorry. :-(
> 
> This bug was discussed back in July among Annie, Matt, Ian and I. We
> finally agreed to take Matt's solution. Matt agreed to post final
> version within a week but obviously he's too busy to do so. I was away
> so I didn't follow closely. Eventually it fell through the crack. :-(

I think I prefer fixing the counting for backporting to stable kernels.
 Xi's approach of packing the ring differently is a change in frontend
visible behaviour and seems more risky. e.g., possible performance
impact so I would like to see some performance analysis of that approach.

David

^ permalink raw reply

* [PATCH] ethernet/arc/arc_emac: Fix huge delays in large file copies
From: Vineet Gupta @ 2013-09-04 11:47 UTC (permalink / raw)
  To: netdev
  Cc: Vineet Gupta, Alexey Brodkin, David S. Miller, linux-kernel,
	arc-linux-dev

copying large files to a NFS mounted host was taking absurdly large
time.

Turns out that TX BD reclaim had a sublte bug.

Loop starts off from @txbd_dirty cursor and stops when it hits a BD
still in use by controller. However when it stops it needs to keep the
cursor at that very BD to resume scanning in next iteration. However it
was erroneously incrementing the cursor, causing the next scan(s) to
fail too, unless the BD chain was completely drained out.

[ARCLinux]$ ls -l -sh /disk/log.txt
 17976 -rw-r--r--    1 root     root       17.5M Sep  /disk/log.txt

========== Before =====================
[ARCLinux]$ time cp /disk/log.txt /mnt/.
real    31m 7.95s
user    0m 0.00s
sys     0m 0.10s

========== After =====================
[ARCLinux]$ time cp /disk/log.txt /mnt/.
real    0m 24.33s
user    0m 0.00s
sys     0m 0.19s

Signed-off-by: Vineet Gupta <vgupta@synopsys.com>
Cc: Alexey Brodkin <abrodkin@synopsys.com> (commit_signer:3/4=75%)
Cc: "David S. Miller" <davem@davemloft.net> (commit_signer:3/4=75%)
Cc: netdev@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: arc-linux-dev@synopsys.com
---
 drivers/net/ethernet/arc/emac_main.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/arc/emac_main.c b/drivers/net/ethernet/arc/emac_main.c
index 55d79cb..9e16014 100644
--- a/drivers/net/ethernet/arc/emac_main.c
+++ b/drivers/net/ethernet/arc/emac_main.c
@@ -149,8 +149,6 @@ static void arc_emac_tx_clean(struct net_device *ndev)
 		struct sk_buff *skb = tx_buff->skb;
 		unsigned int info = le32_to_cpu(txbd->info);
 
-		*txbd_dirty = (*txbd_dirty + 1) % TX_BD_NUM;
-
 		if ((info & FOR_EMAC) || !txbd->data)
 			break;
 
@@ -180,6 +178,8 @@ static void arc_emac_tx_clean(struct net_device *ndev)
 		txbd->data = 0;
 		txbd->info = 0;
 
+		*txbd_dirty = (*txbd_dirty + 1) % TX_BD_NUM;
+
 		if (netif_queue_stopped(ndev))
 			netif_wake_queue(ndev);
 	}
-- 
1.8.1.2

^ permalink raw reply related

* [PATCH v5] ipv6:introduce function to find route for redirect
From: Duan Jiong @ 2013-09-04 11:44 UTC (permalink / raw)
  To: davem; +Cc: duanj.fnst, netdev, hannes

From: Duan Jiong <duanj.fnst@cn.fujitsu.com>

RFC 4861 says that the IP source address of the Redirect is the
same as the current first-hop router for the specified ICMP
Destination Address, so the gateway should be taken into
consideration when we find the route for redirect.

There was once a check in commit
a6279458c534d01ccc39498aba61c93083ee0372 ("NDISC: Search over
all possible rules on receipt of redirect.") and the check
went away in commit b94f1c0904da9b8bf031667afc48080ba7c3e8c9
("ipv6: Use icmpv6_notify() to propagate redirect, instead of
rt6_redirect()").

The bug is only "exploitable" on layer-2 because the source
address of the redirect is checked to be a valid link-local
address but it makes spoofing a lot easier in the same L2
domain nonetheless.

Thanks very much for Hannes's help.

Signed-off-by: Duan Jiong <duanj.fnst@cn.fujitsu.com>
---
 Changes for v5:
 1.check dst.error when look up route for redirect.

 net/ipv6/ah6.c     |  2 +-
 net/ipv6/esp6.c    |  2 +-
 net/ipv6/icmp.c    |  2 +-
 net/ipv6/ipcomp6.c |  2 +-
 net/ipv6/ndisc.c   |  3 +-
 net/ipv6/route.c   | 81 ++++++++++++++++++++++++++++++++++++++++++++++++++----
 6 files changed, 81 insertions(+), 11 deletions(-)

diff --git a/net/ipv6/ah6.c b/net/ipv6/ah6.c
index bb02e17..73784c3 100644
--- a/net/ipv6/ah6.c
+++ b/net/ipv6/ah6.c
@@ -628,7 +628,7 @@ static void ah6_err(struct sk_buff *skb, struct inet6_skb_parm *opt,
 		return;
 
 	if (type == NDISC_REDIRECT)
-		ip6_redirect(skb, net, 0, 0);
+		ip6_redirect(skb, net, skb->dev->ifindex, 0);
 	else
 		ip6_update_pmtu(skb, net, info, 0, 0);
 	xfrm_state_put(x);
diff --git a/net/ipv6/esp6.c b/net/ipv6/esp6.c
index aeac0dc..d3618a7 100644
--- a/net/ipv6/esp6.c
+++ b/net/ipv6/esp6.c
@@ -447,7 +447,7 @@ static void esp6_err(struct sk_buff *skb, struct inet6_skb_parm *opt,
 		return;
 
 	if (type == NDISC_REDIRECT)
-		ip6_redirect(skb, net, 0, 0);
+		ip6_redirect(skb, net, skb->dev->ifindex, 0);
 	else
 		ip6_update_pmtu(skb, net, info, 0, 0);
 	xfrm_state_put(x);
diff --git a/net/ipv6/icmp.c b/net/ipv6/icmp.c
index 7cfc8d2..73681c2 100644
--- a/net/ipv6/icmp.c
+++ b/net/ipv6/icmp.c
@@ -92,7 +92,7 @@ static void icmpv6_err(struct sk_buff *skb, struct inet6_skb_parm *opt,
 	if (type == ICMPV6_PKT_TOOBIG)
 		ip6_update_pmtu(skb, net, info, 0, 0);
 	else if (type == NDISC_REDIRECT)
-		ip6_redirect(skb, net, 0, 0);
+		ip6_redirect(skb, net, skb->dev->ifindex, 0);
 
 	if (!(type & ICMPV6_INFOMSG_MASK))
 		if (icmp6->icmp6_type == ICMPV6_ECHO_REQUEST)
diff --git a/net/ipv6/ipcomp6.c b/net/ipv6/ipcomp6.c
index 7af5aee..5636a91 100644
--- a/net/ipv6/ipcomp6.c
+++ b/net/ipv6/ipcomp6.c
@@ -76,7 +76,7 @@ static void ipcomp6_err(struct sk_buff *skb, struct inet6_skb_parm *opt,
 		return;
 
 	if (type == NDISC_REDIRECT)
-		ip6_redirect(skb, net, 0, 0);
+		ip6_redirect(skb, net, skb->dev->ifindex, 0);
 	else
 		ip6_update_pmtu(skb, net, info, 0, 0);
 	xfrm_state_put(x);
diff --git a/net/ipv6/ndisc.c b/net/ipv6/ndisc.c
index 04d31c2..70abece 100644
--- a/net/ipv6/ndisc.c
+++ b/net/ipv6/ndisc.c
@@ -1370,7 +1370,8 @@ static void ndisc_redirect_rcv(struct sk_buff *skb)
 		return;
 
 	if (!ndopts.nd_opts_rh) {
-		ip6_redirect_no_header(skb, dev_net(skb->dev), 0, 0);
+		ip6_redirect_no_header(skb, dev_net(skb->dev),
+					skb->dev->ifindex, 0);
 		return;
 	}
 
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index 8d9a93e..36ddf3e 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -1157,6 +1157,77 @@ void ip6_sk_update_pmtu(struct sk_buff *skb, struct sock *sk, __be32 mtu)
 }
 EXPORT_SYMBOL_GPL(ip6_sk_update_pmtu);
 
+/* Handle redirects */
+struct ip6rd_flowi {
+	struct flowi6 fl6;
+	struct in6_addr gateway;
+};
+
+static struct rt6_info *__ip6_route_redirect(struct net *net,
+					     struct fib6_table *table,
+					     struct flowi6 *fl6,
+					     int flags)
+{
+	struct ip6rd_flowi *rdfl = (struct ip6rd_flowi *)fl6;
+	struct rt6_info *rt;
+	struct fib6_node *fn;
+
+	/* Get the "current" route for this destination and
+	 * check if the redirect has come from approriate router.
+	 *
+	 * RFC 4861 specifies that redirects should only be
+	 * accepted if they come from the nexthop to the target.
+	 * Due to the way the routes are chosen, this notion
+	 * is a bit fuzzy and one might need to check all possible
+	 * routes.
+	 */
+
+	read_lock_bh(&table->tb6_lock);
+	fn = fib6_lookup(&table->tb6_root, &fl6->daddr, &fl6->saddr);
+restart:
+	for (rt = fn->leaf; rt; rt = rt->dst.rt6_next) {
+		if (rt6_check_expired(rt))
+			continue;
+		if (rt->dst.error)
+			break;
+		if (!(rt->rt6i_flags & RTF_GATEWAY))
+			continue;
+		if (fl6->flowi6_oif != rt->dst.dev->ifindex)
+			continue;
+		if (!ipv6_addr_equal(&rdfl->gateway, &rt->rt6i_gateway))
+			continue;
+		break;
+	}
+
+	if (!rt)
+		rt = net->ipv6.ip6_null_entry;
+	else if (rt->dst.error) {
+		rt = net->ipv6.ip6_null_entry;
+		goto out;
+	}
+	BACKTRACK(net, &fl6->saddr);
+out:
+	dst_hold(&rt->dst);
+
+	read_unlock_bh(&table->tb6_lock);
+
+	return rt;
+};
+
+static struct dst_entry *ip6_route_redirect(struct net *net,
+					const struct flowi6 *fl6,
+					const struct in6_addr *gateway)
+{
+	int flags = RT6_LOOKUP_F_HAS_SADDR;
+	struct ip6rd_flowi rdfl;
+
+	rdfl.fl6 = *fl6;
+	rdfl.gateway = *gateway;
+
+	return fib6_rule_lookup(net, &rdfl.fl6,
+				flags, __ip6_route_redirect);
+}
+
 void ip6_redirect(struct sk_buff *skb, struct net *net, int oif, u32 mark)
 {
 	const struct ipv6hdr *iph = (struct ipv6hdr *) skb->data;
@@ -1171,9 +1242,8 @@ void ip6_redirect(struct sk_buff *skb, struct net *net, int oif, u32 mark)
 	fl6.saddr = iph->saddr;
 	fl6.flowlabel = ip6_flowinfo(iph);
 
-	dst = ip6_route_output(net, NULL, &fl6);
-	if (!dst->error)
-		rt6_do_redirect(dst, NULL, skb);
+	dst = ip6_route_redirect(net, &fl6, &ipv6_hdr(skb)->saddr);
+	rt6_do_redirect(dst, NULL, skb);
 	dst_release(dst);
 }
 EXPORT_SYMBOL_GPL(ip6_redirect);
@@ -1193,9 +1263,8 @@ void ip6_redirect_no_header(struct sk_buff *skb, struct net *net, int oif,
 	fl6.daddr = msg->dest;
 	fl6.saddr = iph->daddr;
 
-	dst = ip6_route_output(net, NULL, &fl6);
-	if (!dst->error)
-		rt6_do_redirect(dst, NULL, skb);
+	dst = ip6_route_redirect(net, &fl6, &iph->saddr);
+	rt6_do_redirect(dst, NULL, skb);
 	dst_release(dst);
 }
 
-- 
1.8.3.1

^ permalink raw reply related

* Re: [PATCH v2 net-next] pkt_sched: fq: Fair Queue packet scheduler
From: Eric Dumazet @ 2013-09-04 11:27 UTC (permalink / raw)
  To: Jason Wang
  Cc: David Miller, netdev, Yuchung Cheng, Neal Cardwell,
	Michael S. Tsirkin
In-Reply-To: <1378290638.7360.85.camel@edumazet-glaptop>

On Wed, 2013-09-04 at 03:30 -0700, Eric Dumazet wrote:
> On Wed, 2013-09-04 at 14:30 +0800, Jason Wang wrote:
> 
> > > And tcpdump would certainly help ;)
> > 
> > See attachment.
> > 
> 
> Nothing obvious on tcpdump (only that lot of frames are missing)
> 
> 1) Are you capturing part of the payload only (like tcpdump -s 128)
> 
> 2) What is the setup.
> 
> 3) tc -s -d qdisc

If you use FQ in the guest, then it could be that high resolution timers
have high latency ?

So FQ arms short timers, but effective duration could be much longer.

Here I get a smooth latency of up to ~3 us

lpq83:~# ./netperf -H lpq84 ; ./tc -s -d qd ; dmesg | tail -n1
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to lpq84.prod.google.com () port 0 AF_INET
Recv   Send    Send                          
Socket Socket  Message  Elapsed              
Size   Size    Size     Time     Throughput  
bytes  bytes   bytes    secs.    10^6bits/sec  

 87380  16384  16384    10.00    9410.82   
qdisc fq 8005: dev eth0 root refcnt 32 limit 10000p flow_limit 100p buckets 1024 quantum 3028 initial_quantum 15140 
 Sent 50545633991 bytes 33385894 pkt (dropped 0, overlimits 0 requeues 19) 
 rate 9258Mbit 764335pps backlog 0b 0p requeues 19 
  117 flow, 115 inactive, 0 throttled
  0 gc, 0 highprio, 0 retrans, 96861 throttled, 0 flows_plimit
[  572.551664] latency = 3035 ns


What do you get with this debugging patch ?

diff --git a/net/sched/sch_fq.c b/net/sched/sch_fq.c
index 32ad015..c1312a0 100644
--- a/net/sched/sch_fq.c
+++ b/net/sched/sch_fq.c
@@ -103,6 +103,7 @@ struct fq_sched_data {
 	u64		stat_internal_packets;
 	u64		stat_tcp_retrans;
 	u64		stat_throttled;
+	s64		slatency;
 	u64		stat_flows_plimit;
 	u64		stat_pkts_too_long;
 	u64		stat_allocation_errors;
@@ -393,6 +394,7 @@ static int fq_enqueue(struct sk_buff *skb, struct Qdisc *sch)
 static void fq_check_throttled(struct fq_sched_data *q, u64 now)
 {
 	struct rb_node *p;
+	bool first = true;
 
 	if (q->time_next_delayed_flow > now)
 		return;
@@ -405,6 +407,13 @@ static void fq_check_throttled(struct fq_sched_data *q, u64 now)
 			q->time_next_delayed_flow = f->time_next_packet;
 			break;
 		}
+		if (first) {
+			s64 delay = now - f->time_next_packet;
+
+			first = false;
+			delay -= q->slatency >> 3;
+			q->slatency += delay;
+		}
 		rb_erase(p, &q->delayed);
 		q->throttled_flows--;
 		fq_flow_add_tail(&q->old_flows, f);
@@ -711,6 +720,7 @@ static int fq_dump(struct Qdisc *sch, struct sk_buff *skb)
 	if (opts == NULL)
 		goto nla_put_failure;
 
+	pr_err("latency = %lld ns\n", q->slatency >> 3);
 	if (nla_put_u32(skb, TCA_FQ_PLIMIT, sch->limit) ||
 	    nla_put_u32(skb, TCA_FQ_FLOW_PLIMIT, q->flow_plimit) ||
 	    nla_put_u32(skb, TCA_FQ_QUANTUM, q->quantum) ||

^ permalink raw reply related

* [PATCH net-next 1/2] bnx2x: VF RSS support - PF side
From: Ariel Elior @ 2013-09-04 11:09 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, Eilon Greenstein, Ariel Elior
In-Reply-To: <1378292962-5537-1-git-send-email-ariele@broadcom.com>

This patch adds support for Receive Side Scaling for queues of
Virtual Functions on the PF side. This includes support for the
requests for multiple queues from VF drivers, configuration of the
HW for multiple queues per VF, and support for rss configuration
of said queues.

Signed-off-by: Ariel Elior <ariele@broadcom.com>
Signed-off-by: Eilon Greenstein <eilong@broadcom.com>
---
 drivers/net/ethernet/broadcom/bnx2x/bnx2x.h       |   32 ++-
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c   |    5 +
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c  |    2 +-
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_reg.h   |    1 +
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_sp.c    |   10 +
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_sp.h    |    2 +
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_sriov.c |  386 +++++++++++++++------
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_sriov.h |   32 ++-
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_vfpf.c  |  146 ++++++++-
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_vfpf.h  |   41 ++-
 10 files changed, 513 insertions(+), 144 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x.h b/drivers/net/ethernet/broadcom/bnx2x/bnx2x.h
index 3e77a1b..0c33802 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x.h
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x.h
@@ -825,15 +825,13 @@ static inline bool bnx2x_fp_ll_polling(struct bnx2x_fastpath *fp)
 #define BD_UNMAP_LEN(bd)		(le16_to_cpu((bd)->nbytes))
 
 #define BNX2X_DB_MIN_SHIFT		3	/* 8 bytes */
-#define BNX2X_DB_SHIFT			7	/* 128 bytes*/
+#define BNX2X_DB_SHIFT			3	/* 8 bytes*/
 #if (BNX2X_DB_SHIFT < BNX2X_DB_MIN_SHIFT)
 #error "Min DB doorbell stride is 8"
 #endif
-#define DPM_TRIGER_TYPE			0x40
 #define DOORBELL(bp, cid, val) \
 	do { \
-		writel((u32)(val), bp->doorbells + (bp->db_size * (cid)) + \
-		       DPM_TRIGER_TYPE); \
+		writel((u32)(val), bp->doorbells + (bp->db_size * (cid))); \
 	} while (0)
 
 /* TX CSUM helpers */
@@ -1100,13 +1098,27 @@ struct bnx2x_port {
 extern struct workqueue_struct *bnx2x_wq;
 
 #define BNX2X_MAX_NUM_OF_VFS	64
-#define BNX2X_VF_CID_WND	0
+#define BNX2X_VF_CID_WND	4 /* log num of queues per VF. HW config. */
 #define BNX2X_CIDS_PER_VF	(1 << BNX2X_VF_CID_WND)
-#define BNX2X_CLIENTS_PER_VF	1
-#define BNX2X_FIRST_VF_CID	256
+
+/* We need to reserve doorbell addresses for all VF and queue combinations */
 #define BNX2X_VF_CIDS		(BNX2X_MAX_NUM_OF_VFS * BNX2X_CIDS_PER_VF)
+
+/* The doorbell is configured to have the same number of CIDs for PFs and for
+ * VFs. For this reason the PF CID zone is as large as the VF zone.
+ */
+#define BNX2X_FIRST_VF_CID	BNX2X_VF_CIDS
+#define BNX2X_MAX_NUM_VF_QUEUES	64
 #define BNX2X_VF_ID_INVALID	0xFF
 
+/* the number of VF CIDS multiplied by the amount of bytes reserved for each
+ * cid must not exceed the size of the VF doorbell
+ */
+#define BNX2X_VF_BAR_SIZE	512
+#if (BNX2X_VF_BAR_SIZE < BNX2X_CIDS_PER_VF * (1 << BNX2X_DB_SHIFT))
+#error "VF doorbell bar size is 512"
+#endif
+
 /*
  * The total number of L2 queues, MSIX vectors and HW contexts (CIDs) is
  * control by the number of fast-path status blocks supported by the
@@ -1650,10 +1662,10 @@ struct bnx2x {
 	dma_addr_t			fw_stats_data_mapping;
 	int				fw_stats_data_sz;
 
-	/* For max 196 cids (64*3 + non-eth), 32KB ILT page size and 1KB
+	/* For max 1024 cids (VF RSS), 32KB ILT page size and 1KB
 	 * context size we need 8 ILT entries.
 	 */
-#define ILT_MAX_L2_LINES	8
+#define ILT_MAX_L2_LINES	32
 	struct hw_context	context[ILT_MAX_L2_LINES];
 
 	struct bnx2x_ilt	*ilt;
@@ -1869,7 +1881,7 @@ extern int num_queues;
 #define FUNC_FLG_TPA		0x0008
 #define FUNC_FLG_SPQ		0x0010
 #define FUNC_FLG_LEADING	0x0020	/* PF only */
-
+#define FUNC_FLG_LEADING_STATS	0x0040
 struct bnx2x_func_init_params {
 	/* dma */
 	dma_addr_t	fw_stat_map;	/* valid iff FUNC_FLG_STATS */
diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c
index 2e90868..e7400d9 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c
@@ -4784,6 +4784,11 @@ int bnx2x_resume(struct pci_dev *pdev)
 void bnx2x_set_ctx_validation(struct bnx2x *bp, struct eth_context *cxt,
 			      u32 cid)
 {
+	if (!cxt) {
+		BNX2X_ERR("bad context pointer %p\n", cxt);
+		return;
+	}
+
 	/* ustorm cxt validation */
 	cxt->ustorm_ag_context.cdu_usage =
 		CDU_RSRVD_VALUE_TYPE_A(HW_CID(bp, cid),
diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
index 17f117c..5729aa7 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
@@ -6893,7 +6893,7 @@ static int bnx2x_init_hw_common(struct bnx2x *bp)
 		bnx2x_init_block(bp, BLOCK_TM, PHASE_COMMON);
 
 	bnx2x_init_block(bp, BLOCK_DORQ, PHASE_COMMON);
-	REG_WR(bp, DORQ_REG_DPM_CID_OFST, BNX2X_DB_SHIFT);
+
 	if (!CHIP_REV_IS_SLOW(bp))
 		/* enable hw interrupt from doorbell Q */
 		REG_WR(bp, DORQ_REG_DORQ_INT_MASK, 0);
diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_reg.h b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_reg.h
index 8e627b8..5ecf267 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_reg.h
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_reg.h
@@ -6335,6 +6335,7 @@
 #define PCI_ID_VAL2					0x438
 #define PCI_ID_VAL3					0x43c
 
+#define GRC_CONFIG_REG_VF_MSIX_CONTROL		    0x61C
 #define GRC_CONFIG_REG_PF_INIT_VF		0x624
 #define GRC_CR_PF_INIT_VF_PF_FIRST_VF_NUM_MASK	0xf
 /* First VF_NUM for PF is encoded in this register.
diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_sp.c b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_sp.c
index 1d46b68..9fbeee5 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_sp.c
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_sp.c
@@ -4416,6 +4416,16 @@ void bnx2x_init_rss_config_obj(struct bnx2x *bp,
 	rss_obj->config_rss = bnx2x_setup_rss;
 }
 
+int validate_vlan_mac(struct bnx2x *bp,
+		      struct bnx2x_vlan_mac_obj *vlan_mac)
+{
+	if (!vlan_mac->get_n_elements) {
+		BNX2X_ERR("vlan mac object was not intialized\n");
+		return -EINVAL;
+	}
+	return 0;
+}
+
 /********************** Queue state object ***********************************/
 
 /**
diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_sp.h b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_sp.h
index 533a3ab..658f4e3 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_sp.h
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_sp.h
@@ -1407,4 +1407,6 @@ int bnx2x_config_rss(struct bnx2x *bp,
 void bnx2x_get_rss_ind_table(struct bnx2x_rss_config_obj *rss_obj,
 			     u8 *ind_table);
 
+int validate_vlan_mac(struct bnx2x *bp,
+		      struct bnx2x_vlan_mac_obj *vlan_mac);
 #endif /* BNX2X_SP_VERBS */
diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_sriov.c b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_sriov.c
index fbc026c..73731eb 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_sriov.c
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_sriov.c
@@ -170,6 +170,11 @@ enum bnx2x_vfop_qteardown_state {
 	   BNX2X_VFOP_QTEARDOWN_DONE
 };
 
+enum bnx2x_vfop_rss_state {
+	   BNX2X_VFOP_RSS_CONFIG,
+	   BNX2X_VFOP_RSS_DONE
+};
+
 #define bnx2x_vfop_reset_wq(vf)	atomic_set(&vf->op_in_progress, 0)
 
 void bnx2x_vfop_qctor_dump_tx(struct bnx2x *bp, struct bnx2x_virtf *vf,
@@ -265,11 +270,6 @@ void bnx2x_vfop_qctor_prep(struct bnx2x *bp,
 	__set_bit(BNX2X_Q_FLG_TX_SEC, &setup_p->flags);
 	__set_bit(BNX2X_Q_FLG_ANTI_SPOOF, &setup_p->flags);
 
-	if (vfq_is_leading(q)) {
-		__set_bit(BNX2X_Q_FLG_LEADING_RSS, &setup_p->flags);
-		__set_bit(BNX2X_Q_FLG_MCAST, &setup_p->flags);
-	}
-
 	/* Setup-op rx parameters */
 	if (test_bit(BNX2X_Q_TYPE_HAS_RX, &q_type)) {
 		struct bnx2x_rxq_setup_params *rxq_p = &setup_p->rxq_params;
@@ -398,7 +398,11 @@ static void bnx2x_vfop_qdtor(struct bnx2x *bp, struct bnx2x_virtf *vf)
 		    BNX2X_Q_LOGICAL_STATE_STOPPED) {
 			DP(BNX2X_MSG_IOV,
 			   "Entered qdtor but queue was already stopped. Aborting gracefully\n");
-			goto op_done;
+
+			/* next state */
+			vfop->state = BNX2X_VFOP_QDTOR_DONE;
+
+			bnx2x_vfop_finalize(vf, vfop->rc, VFOP_CONT);
 		}
 
 		/* next state */
@@ -432,8 +436,10 @@ op_err:
 op_done:
 	case BNX2X_VFOP_QDTOR_DONE:
 		/* invalidate the context */
-		qdtor->cxt->ustorm_ag_context.cdu_usage = 0;
-		qdtor->cxt->xstorm_ag_context.cdu_reserved = 0;
+		if (qdtor->cxt) {
+			qdtor->cxt->ustorm_ag_context.cdu_usage = 0;
+			qdtor->cxt->xstorm_ag_context.cdu_reserved = 0;
+		}
 		bnx2x_vfop_end(bp, vf, vfop);
 		return;
 	default:
@@ -465,7 +471,8 @@ static int bnx2x_vfop_qdtor_cmd(struct bnx2x *bp,
 		return bnx2x_vfop_transition(bp, vf, bnx2x_vfop_qdtor,
 					     cmd->block);
 	}
-	DP(BNX2X_MSG_IOV, "VF[%d] failed to add a vfop.\n", vf->abs_vfid);
+	DP(BNX2X_MSG_IOV, "VF[%d] failed to add a vfop. rc %d\n",
+	   vf->abs_vfid, vfop->rc);
 	return -ENOMEM;
 }
 
@@ -474,10 +481,18 @@ bnx2x_vf_set_igu_info(struct bnx2x *bp, u8 igu_sb_id, u8 abs_vfid)
 {
 	struct bnx2x_virtf *vf = bnx2x_vf_by_abs_fid(bp, abs_vfid);
 	if (vf) {
+		/* the first igu entry belonging to VFs of this PF */
+		if (!BP_VFDB(bp)->first_vf_igu_entry)
+			BP_VFDB(bp)->first_vf_igu_entry = igu_sb_id;
+
+		/* the first igu entry belonging to this VF */
 		if (!vf_sb_count(vf))
 			vf->igu_base_id = igu_sb_id;
+
 		++vf_sb_count(vf);
+		++vf->sb_count;
 	}
+	BP_VFDB(bp)->vf_sbs_pool++;
 }
 
 /* VFOP MAC/VLAN helpers */
@@ -733,6 +748,7 @@ static int bnx2x_vfop_mac_delall_cmd(struct bnx2x *bp,
 				     int qid, bool drv_only)
 {
 	struct bnx2x_vfop *vfop = bnx2x_vfop_add(bp, vf);
+	int rc;
 
 	if (vfop) {
 		struct bnx2x_vfop_args_filters filters = {
@@ -752,6 +768,9 @@ static int bnx2x_vfop_mac_delall_cmd(struct bnx2x *bp,
 		bnx2x_vfop_mac_prep_ramrod(ramrod, &flags);
 
 		/* set object */
+		rc = validate_vlan_mac(bp, &bnx2x_vfq(vf, qid, mac_obj));
+		if (rc)
+			return rc;
 		ramrod->vlan_mac_obj = &bnx2x_vfq(vf, qid, mac_obj);
 
 		/* set extra args */
@@ -772,6 +791,7 @@ int bnx2x_vfop_mac_list_cmd(struct bnx2x *bp,
 			    int qid, bool drv_only)
 {
 	struct bnx2x_vfop *vfop = bnx2x_vfop_add(bp, vf);
+	int rc;
 
 	if (vfop) {
 		struct bnx2x_vfop_args_filters filters = {
@@ -794,6 +814,9 @@ int bnx2x_vfop_mac_list_cmd(struct bnx2x *bp,
 		bnx2x_vfop_mac_prep_ramrod(ramrod, &flags);
 
 		/* set object */
+		rc = validate_vlan_mac(bp, &bnx2x_vfq(vf, qid, mac_obj));
+		if (rc)
+			return rc;
 		ramrod->vlan_mac_obj = &bnx2x_vfq(vf, qid, mac_obj);
 
 		/* set extra args */
@@ -814,6 +837,7 @@ int bnx2x_vfop_vlan_set_cmd(struct bnx2x *bp,
 			    int qid, u16 vid, bool add)
 {
 	struct bnx2x_vfop *vfop = bnx2x_vfop_add(bp, vf);
+	int rc;
 
 	if (vfop) {
 		struct bnx2x_vfop_args_filters filters = {
@@ -834,6 +858,9 @@ int bnx2x_vfop_vlan_set_cmd(struct bnx2x *bp,
 		ramrod->user_req.u.vlan.vlan = vid;
 
 		/* set object */
+		rc = validate_vlan_mac(bp, &bnx2x_vfq(vf, qid, vlan_obj));
+		if (rc)
+			return rc;
 		ramrod->vlan_mac_obj = &bnx2x_vfq(vf, qid, vlan_obj);
 
 		/* set extra args */
@@ -853,6 +880,7 @@ static int bnx2x_vfop_vlan_delall_cmd(struct bnx2x *bp,
 			       int qid, bool drv_only)
 {
 	struct bnx2x_vfop *vfop = bnx2x_vfop_add(bp, vf);
+	int rc;
 
 	if (vfop) {
 		struct bnx2x_vfop_args_filters filters = {
@@ -872,6 +900,9 @@ static int bnx2x_vfop_vlan_delall_cmd(struct bnx2x *bp,
 		bnx2x_vfop_vlan_mac_prep_ramrod(ramrod, &flags);
 
 		/* set object */
+		rc = validate_vlan_mac(bp, &bnx2x_vfq(vf, qid, vlan_obj));
+		if (rc)
+			return rc;
 		ramrod->vlan_mac_obj = &bnx2x_vfq(vf, qid, vlan_obj);
 
 		/* set extra args */
@@ -892,6 +923,7 @@ int bnx2x_vfop_vlan_list_cmd(struct bnx2x *bp,
 			     int qid, bool drv_only)
 {
 	struct bnx2x_vfop *vfop = bnx2x_vfop_add(bp, vf);
+	int rc;
 
 	if (vfop) {
 		struct bnx2x_vfop_args_filters filters = {
@@ -911,6 +943,9 @@ int bnx2x_vfop_vlan_list_cmd(struct bnx2x *bp,
 		bnx2x_vfop_vlan_mac_prep_ramrod(ramrod, &flags);
 
 		/* set object */
+		rc = validate_vlan_mac(bp, &bnx2x_vfq(vf, qid, vlan_obj));
+		if (rc)
+			return rc;
 		ramrod->vlan_mac_obj = &bnx2x_vfq(vf, qid, vlan_obj);
 
 		/* set extra args */
@@ -1021,21 +1056,25 @@ static void bnx2x_vfop_qflr(struct bnx2x *bp, struct bnx2x_virtf *vf)
 	case BNX2X_VFOP_QFLR_CLR_VLAN:
 		/* vlan-clear-all: driver-only, don't consume credit */
 		vfop->state = BNX2X_VFOP_QFLR_CLR_MAC;
-		vfop->rc = bnx2x_vfop_vlan_delall_cmd(bp, vf, &cmd, qid, true);
+		if (!validate_vlan_mac(bp, &bnx2x_vfq(vf, qid, vlan_obj)))
+			vfop->rc = bnx2x_vfop_vlan_delall_cmd(bp, vf, &cmd, qid,
+							      true);
 		if (vfop->rc)
 			goto op_err;
-		return;
+		bnx2x_vfop_finalize(vf, vfop->rc, VFOP_CONT);
 
 	case BNX2X_VFOP_QFLR_CLR_MAC:
 		/* mac-clear-all: driver only consume credit */
 		vfop->state = BNX2X_VFOP_QFLR_TERMINATE;
-		vfop->rc = bnx2x_vfop_mac_delall_cmd(bp, vf, &cmd, qid, true);
+		if (!validate_vlan_mac(bp, &bnx2x_vfq(vf, qid, mac_obj)))
+			vfop->rc = bnx2x_vfop_mac_delall_cmd(bp, vf, &cmd, qid,
+							     true);
 		DP(BNX2X_MSG_IOV,
 		   "VF[%d] vfop->rc after bnx2x_vfop_mac_delall_cmd was %d",
 		   vf->abs_vfid, vfop->rc);
 		if (vfop->rc)
 			goto op_err;
-		return;
+		bnx2x_vfop_finalize(vf, vfop->rc, VFOP_CONT);
 
 	case BNX2X_VFOP_QFLR_TERMINATE:
 		qstate = &vfop->op_p->qctor.qstate;
@@ -1332,10 +1371,13 @@ int bnx2x_vfop_qdown_cmd(struct bnx2x *bp,
 {
 	struct bnx2x_vfop *vfop = bnx2x_vfop_add(bp, vf);
 
+	/* for non leading queues skip directly to qdown sate */
 	if (vfop) {
 		vfop->args.qx.qid = qid;
-		bnx2x_vfop_opset(BNX2X_VFOP_QTEARDOWN_RXMODE,
-				 bnx2x_vfop_qdown, cmd->done);
+		bnx2x_vfop_opset(qid == LEADING_IDX ?
+				 BNX2X_VFOP_QTEARDOWN_RXMODE :
+				 BNX2X_VFOP_QTEARDOWN_QDTOR, bnx2x_vfop_qdown,
+				 cmd->done);
 		return bnx2x_vfop_transition(bp, vf, bnx2x_vfop_qdown,
 					     cmd->block);
 	}
@@ -1488,15 +1530,16 @@ int bnx2x_vf_flr_clnup_epilog(struct bnx2x *bp, u8 abs_vfid)
  * both known
  */
 static void
-bnx2x_iov_static_resc(struct bnx2x *bp, struct vf_pf_resc_request *resc)
+bnx2x_iov_static_resc(struct bnx2x *bp, struct bnx2x_virtf *vf)
 {
+	struct vf_pf_resc_request *resc = &vf->alloc_resc;
 	u16 vlan_count = 0;
 
 	/* will be set only during VF-ACQUIRE */
 	resc->num_rxqs = 0;
 	resc->num_txqs = 0;
 
-	/* no credit calculcis for macs (just yet) */
+	/* no credit calculations for macs (just yet) */
 	resc->num_mac_filters = 1;
 
 	/* divvy up vlan rules */
@@ -1508,13 +1551,14 @@ bnx2x_iov_static_resc(struct bnx2x *bp, struct vf_pf_resc_request *resc)
 	resc->num_mc_filters = 0;
 
 	/* num_sbs already set */
+	resc->num_sbs = vf->sb_count;
 }
 
 /* FLR routines: */
 static void bnx2x_vf_free_resc(struct bnx2x *bp, struct bnx2x_virtf *vf)
 {
 	/* reset the state variables */
-	bnx2x_iov_static_resc(bp, &vf->alloc_resc);
+	bnx2x_iov_static_resc(bp, vf);
 	vf->state = VF_FREE;
 }
 
@@ -1734,8 +1778,7 @@ void bnx2x_iov_init_dq(struct bnx2x *bp)
 	/* The VF doorbell size  0 - *B, 4 - 128B. We set it here to match
 	 * the Pf doorbell size although the 2 are independent.
 	 */
-	REG_WR(bp, DORQ_REG_VF_NORM_CID_OFST,
-	       BNX2X_DB_SHIFT - BNX2X_DB_MIN_SHIFT);
+	REG_WR(bp, DORQ_REG_VF_NORM_CID_OFST, 3);
 
 	/* No security checks for now -
 	 * configure single rule (out of 16) mask = 0x1, value = 0x0,
@@ -1802,7 +1845,7 @@ bnx2x_get_vf_igu_cam_info(struct bnx2x *bp)
 {
 	int sb_id;
 	u32 val;
-	u8 fid;
+	u8 fid, current_pf = 0;
 
 	/* IGU in normal mode - read CAM */
 	for (sb_id = 0; sb_id < IGU_REG_MAPPING_MEMORY_SIZE; sb_id++) {
@@ -1810,16 +1853,18 @@ bnx2x_get_vf_igu_cam_info(struct bnx2x *bp)
 		if (!(val & IGU_REG_MAPPING_MEMORY_VALID))
 			continue;
 		fid = GET_FIELD((val), IGU_REG_MAPPING_MEMORY_FID);
-		if (!(fid & IGU_FID_ENCODE_IS_PF))
+		if (fid & IGU_FID_ENCODE_IS_PF)
+			current_pf = fid & IGU_FID_PF_NUM_MASK;
+		else if (current_pf == BP_ABS_FUNC(bp))
 			bnx2x_vf_set_igu_info(bp, sb_id,
 					      (fid & IGU_FID_VF_NUM_MASK));
-
 		DP(BNX2X_MSG_IOV, "%s[%d], igu_sb_id=%d, msix=%d\n",
 		   ((fid & IGU_FID_ENCODE_IS_PF) ? "PF" : "VF"),
 		   ((fid & IGU_FID_ENCODE_IS_PF) ? (fid & IGU_FID_PF_NUM_MASK) :
 		   (fid & IGU_FID_VF_NUM_MASK)), sb_id,
 		   GET_FIELD((val), IGU_REG_MAPPING_MEMORY_VECTOR));
 	}
+	DP(BNX2X_MSG_IOV, "vf_sbs_pool is %d\n", BP_VFDB(bp)->vf_sbs_pool);
 }
 
 static void __bnx2x_iov_free_vfdb(struct bnx2x *bp)
@@ -1885,23 +1930,11 @@ static int bnx2x_sriov_info(struct bnx2x *bp, struct bnx2x_sriov *iov)
 	return 0;
 }
 
-static u8 bnx2x_iov_get_max_queue_count(struct bnx2x *bp)
-{
-	int i;
-	u8 queue_count = 0;
-
-	if (IS_SRIOV(bp))
-		for_each_vf(bp, i)
-			queue_count += bnx2x_vf(bp, i, alloc_resc.num_sbs);
-
-	return queue_count;
-}
-
 /* must be called after PF bars are mapped */
 int bnx2x_iov_init_one(struct bnx2x *bp, int int_mode_param,
-			int num_vfs_param)
+		       int num_vfs_param)
 {
-	int err, i, qcount;
+	int err, i;
 	struct bnx2x_sriov *iov;
 	struct pci_dev *dev = bp->pdev;
 
@@ -1999,12 +2032,13 @@ int bnx2x_iov_init_one(struct bnx2x *bp, int int_mode_param,
 	/* re-read the IGU CAM for VFs - index and abs_vfid must be set */
 	bnx2x_get_vf_igu_cam_info(bp);
 
-	/* get the total queue count and allocate the global queue arrays */
-	qcount = bnx2x_iov_get_max_queue_count(bp);
-
 	/* allocate the queue arrays for all VFs */
-	bp->vfdb->vfqs = kzalloc(qcount * sizeof(struct bnx2x_vf_queue),
-				 GFP_KERNEL);
+	bp->vfdb->vfqs = kzalloc(
+		BNX2X_MAX_NUM_VF_QUEUES * sizeof(struct bnx2x_vf_queue),
+		GFP_KERNEL);
+
+	DP(BNX2X_MSG_IOV, "bp->vfdb->vfqs was %p\n", bp->vfdb->vfqs);
+
 	if (!bp->vfdb->vfqs) {
 		BNX2X_ERR("failed to allocate vf queue array\n");
 		err = -ENOMEM;
@@ -2125,49 +2159,14 @@ static void bnx2x_vfq_init(struct bnx2x *bp, struct bnx2x_virtf *vf,
 			     q_type);
 
 	DP(BNX2X_MSG_IOV,
-	   "initialized vf %d's queue object. func id set to %d\n",
-	   vf->abs_vfid, q->sp_obj.func_id);
-
-	/* mac/vlan objects are per queue, but only those
-	 * that belong to the leading queue are initialized
-	 */
-	if (vfq_is_leading(q)) {
-		/* mac */
-		bnx2x_init_mac_obj(bp, &q->mac_obj,
-				   cl_id, q->cid, func_id,
-				   bnx2x_vf_sp(bp, vf, mac_rdata),
-				   bnx2x_vf_sp_map(bp, vf, mac_rdata),
-				   BNX2X_FILTER_MAC_PENDING,
-				   &vf->filter_state,
-				   BNX2X_OBJ_TYPE_RX_TX,
-				   &bp->macs_pool);
-		/* vlan */
-		bnx2x_init_vlan_obj(bp, &q->vlan_obj,
-				    cl_id, q->cid, func_id,
-				    bnx2x_vf_sp(bp, vf, vlan_rdata),
-				    bnx2x_vf_sp_map(bp, vf, vlan_rdata),
-				    BNX2X_FILTER_VLAN_PENDING,
-				    &vf->filter_state,
-				    BNX2X_OBJ_TYPE_RX_TX,
-				    &bp->vlans_pool);
-
-		/* mcast */
-		bnx2x_init_mcast_obj(bp, &vf->mcast_obj, cl_id,
-				     q->cid, func_id, func_id,
-				     bnx2x_vf_sp(bp, vf, mcast_rdata),
-				     bnx2x_vf_sp_map(bp, vf, mcast_rdata),
-				     BNX2X_FILTER_MCAST_PENDING,
-				     &vf->filter_state,
-				     BNX2X_OBJ_TYPE_RX_TX);
-
-		vf->leading_rss = cl_id;
-	}
+	   "initialized vf %d's queue object. func id set to %d. cid set to 0x%x\n",
+	   vf->abs_vfid, q->sp_obj.func_id, q->cid);
 }
 
 /* called by bnx2x_nic_load */
 int bnx2x_iov_nic_init(struct bnx2x *bp)
 {
-	int vfid, qcount, i;
+	int vfid;
 
 	if (!IS_SRIOV(bp)) {
 		DP(BNX2X_MSG_IOV, "vfdb was not allocated\n");
@@ -2196,7 +2195,7 @@ int bnx2x_iov_nic_init(struct bnx2x *bp)
 		   BNX2X_FIRST_VF_CID + base_vf_cid, base_cxt);
 
 		/* init statically provisioned resources */
-		bnx2x_iov_static_resc(bp, &vf->alloc_resc);
+		bnx2x_iov_static_resc(bp, vf);
 
 		/* queues are initialized during VF-ACQUIRE */
 
@@ -2232,13 +2231,12 @@ int bnx2x_iov_nic_init(struct bnx2x *bp)
 	}
 
 	/* Final VF init */
-	qcount = 0;
-	for_each_vf(bp, i) {
-		struct bnx2x_virtf *vf = BP_VF(bp, i);
+	for_each_vf(bp, vfid) {
+		struct bnx2x_virtf *vf = BP_VF(bp, vfid);
 
 		/* fill in the BDF and bars */
-		vf->bus = bnx2x_vf_bus(bp, i);
-		vf->devfn = bnx2x_vf_devfn(bp, i);
+		vf->bus = bnx2x_vf_bus(bp, vfid);
+		vf->devfn = bnx2x_vf_devfn(bp, vfid);
 		bnx2x_vf_set_bars(bp, vf);
 
 		DP(BNX2X_MSG_IOV,
@@ -2247,10 +2245,6 @@ int bnx2x_iov_nic_init(struct bnx2x *bp)
 		   (unsigned)vf->bars[0].bar, vf->bars[0].size,
 		   (unsigned)vf->bars[1].bar, vf->bars[1].size,
 		   (unsigned)vf->bars[2].bar, vf->bars[2].size);
-
-		/* set local queue arrays */
-		vf->vfqs = &bp->vfdb->vfqs[qcount];
-		qcount += bnx2x_vf(bp, i, alloc_resc.num_sbs);
 	}
 
 	return 0;
@@ -2556,6 +2550,9 @@ void bnx2x_iov_adjust_stats_req(struct bnx2x *bp)
 		for_each_vfq(vf, j) {
 			struct bnx2x_vf_queue *rxq = vfq_get(vf, j);
 
+			dma_addr_t q_stats_addr =
+				vf->fw_stat_map + j * vf->stats_stride;
+
 			/* collect stats fro active queues only */
 			if (bnx2x_get_q_logical_state(bp, &rxq->sp_obj) ==
 			    BNX2X_Q_LOGICAL_STATE_STOPPED)
@@ -2563,13 +2560,13 @@ void bnx2x_iov_adjust_stats_req(struct bnx2x *bp)
 
 			/* create stats query entry for this queue */
 			cur_query_entry->kind = STATS_TYPE_QUEUE;
-			cur_query_entry->index = vfq_cl_id(vf, rxq);
+			cur_query_entry->index = vfq_stat_id(vf, rxq);
 			cur_query_entry->funcID =
 				cpu_to_le16(FW_VF_HANDLE(vf->abs_vfid));
 			cur_query_entry->address.hi =
-				cpu_to_le32(U64_HI(vf->fw_stat_map));
+				cpu_to_le32(U64_HI(q_stats_addr));
 			cur_query_entry->address.lo =
-				cpu_to_le32(U64_LO(vf->fw_stat_map));
+				cpu_to_le32(U64_LO(q_stats_addr));
 			DP(BNX2X_MSG_IOV,
 			   "added address %x %x for vf %d queue %d client %d\n",
 			   cur_query_entry->address.hi,
@@ -2578,6 +2575,10 @@ void bnx2x_iov_adjust_stats_req(struct bnx2x *bp)
 			cur_query_entry++;
 			cur_data_offset += sizeof(struct per_queue_stats);
 			stats_count++;
+
+			/* all stats are coalesced to the leading queue */
+			if (vf->cfg_flags & VF_CFG_STATS_COALESCE)
+				break;
 		}
 	}
 	bp->fw_stats_req->hdr.cmd_num = bp->fw_stats_num + stats_count;
@@ -2596,6 +2597,11 @@ void bnx2x_iov_sp_task(struct bnx2x *bp)
 	for_each_vf(bp, i) {
 		struct bnx2x_virtf *vf = BP_VF(bp, i);
 
+		if (!vf) {
+			BNX2X_ERR("VF was null! skipping...\n");
+			continue;
+		}
+
 		if (!list_empty(&vf->op_list_head) &&
 		    atomic_read(&vf->op_in_progress)) {
 			DP(BNX2X_MSG_IOV, "running pending op for vf %d\n", i);
@@ -2743,7 +2749,7 @@ int bnx2x_vf_acquire(struct bnx2x *bp, struct bnx2x_virtf *vf,
 		struct bnx2x_vf_queue *q = vfq_get(vf, i);
 
 		if (!q) {
-			DP(BNX2X_MSG_IOV, "q number %d was not allocated\n", i);
+			BNX2X_ERR("q number %d was not allocated\n", i);
 			return -EINVAL;
 		}
 
@@ -2947,6 +2953,43 @@ op_done:
 	bnx2x_vfop_end(bp, vf, vfop);
 }
 
+static void bnx2x_vfop_rss(struct bnx2x *bp, struct bnx2x_virtf *vf)
+{
+	struct bnx2x_vfop *vfop = bnx2x_vfop_cur(bp, vf);
+	enum bnx2x_vfop_rss_state state;
+
+	if (!vfop) {
+		BNX2X_ERR("vfop was null\n");
+		return;
+	}
+
+	state = vfop->state;
+	bnx2x_vfop_reset_wq(vf);
+
+	if (vfop->rc < 0)
+		goto op_err;
+
+	DP(BNX2X_MSG_IOV, "vf[%d] STATE: %d\n", vf->abs_vfid, state);
+
+	switch (state) {
+	case BNX2X_VFOP_RSS_CONFIG:
+		/* next state */
+		vfop->state = BNX2X_VFOP_RSS_DONE;
+		bnx2x_config_rss(bp, &vfop->op_p->rss);
+		bnx2x_vfop_finalize(vf, vfop->rc, VFOP_DONE);
+op_err:
+		BNX2X_ERR("RSS error: rc %d\n", vfop->rc);
+op_done:
+	case BNX2X_VFOP_RSS_DONE:
+		bnx2x_vfop_end(bp, vf, vfop);
+		return;
+	default:
+		bnx2x_vfop_default(state);
+	}
+op_pending:
+	return;
+}
+
 int bnx2x_vfop_release_cmd(struct bnx2x *bp,
 			   struct bnx2x_virtf *vf,
 			   struct bnx2x_vfop_cmd *cmd)
@@ -2961,6 +3004,21 @@ int bnx2x_vfop_release_cmd(struct bnx2x *bp,
 	return -ENOMEM;
 }
 
+int bnx2x_vfop_rss_cmd(struct bnx2x *bp,
+		       struct bnx2x_virtf *vf,
+		       struct bnx2x_vfop_cmd *cmd)
+{
+	struct bnx2x_vfop *vfop = bnx2x_vfop_add(bp, vf);
+
+	if (vfop) {
+		bnx2x_vfop_opset(BNX2X_VFOP_RSS_CONFIG, bnx2x_vfop_rss,
+				 cmd->done);
+		return bnx2x_vfop_transition(bp, vf, bnx2x_vfop_rss,
+					     cmd->block);
+	}
+	return -ENOMEM;
+}
+
 /* VF release ~ VF close + VF release-resources
  * Release is the ultimate SW shutdown and is called whenever an
  * irrecoverable error is encountered.
@@ -2972,6 +3030,8 @@ void bnx2x_vf_release(struct bnx2x *bp, struct bnx2x_virtf *vf, bool block)
 		.block = block,
 	};
 	int rc;
+
+	DP(BNX2X_MSG_IOV, "PF releasing vf %d\n", vf->abs_vfid);
 	bnx2x_lock_vf_pf_channel(bp, vf, CHANNEL_TLV_PF_RELEASE_VF);
 
 	rc = bnx2x_vfop_release_cmd(bp, vf, &cmd);
@@ -3000,6 +3060,12 @@ static inline void bnx2x_vf_get_bars(struct bnx2x *bp, struct bnx2x_virtf *vf,
 void bnx2x_lock_vf_pf_channel(struct bnx2x *bp, struct bnx2x_virtf *vf,
 			      enum channel_tlvs tlv)
 {
+	/* we don't lock the channel for unsupported tlvs */
+	if (!bnx2x_tlv_supported(tlv)) {
+		BNX2X_ERR("attempting to lock with unsupported tlv. Aborting\n");
+		return;
+	}
+
 	/* lock the channel */
 	mutex_lock(&vf->op_mutex);
 
@@ -3014,19 +3080,32 @@ void bnx2x_lock_vf_pf_channel(struct bnx2x *bp, struct bnx2x_virtf *vf,
 void bnx2x_unlock_vf_pf_channel(struct bnx2x *bp, struct bnx2x_virtf *vf,
 				enum channel_tlvs expected_tlv)
 {
+	enum channel_tlvs current_tlv;
+
+	if (!vf) {
+		BNX2X_ERR("VF was %p\n", vf);
+		return;
+	}
+
+	current_tlv = vf->op_current;
+
+	/* we don't unlock the channel for unsupported tlvs */
+	if (!bnx2x_tlv_supported(expected_tlv))
+		return;
+
 	WARN(expected_tlv != vf->op_current,
 	     "lock mismatch: expected %d found %d", expected_tlv,
 	     vf->op_current);
 
+	/* record the locking op */
+	vf->op_current = CHANNEL_TLV_NONE;
+
 	/* lock the channel */
 	mutex_unlock(&vf->op_mutex);
 
 	/* log the unlock */
 	DP(BNX2X_MSG_IOV, "VF[%d]: vf pf channel unlocked by %d\n",
 	   vf->abs_vfid, vf->op_current);
-
-	/* record the locking op */
-	vf->op_current = CHANNEL_TLV_NONE;
 }
 
 int bnx2x_sriov_configure(struct pci_dev *dev, int num_vfs_param)
@@ -3057,11 +3136,77 @@ int bnx2x_sriov_configure(struct pci_dev *dev, int num_vfs_param)
 		return bnx2x_enable_sriov(bp);
 	}
 }
+#define IGU_ENTRY_SIZE 4
 
 int bnx2x_enable_sriov(struct bnx2x *bp)
 {
 	int rc = 0, req_vfs = bp->requested_nr_virtfn;
+	int vf_idx, sb_idx, vfq_idx, qcount, first_vf;
+	u32 igu_entry, address;
+	u16 num_vf_queues;
+
+	if (req_vfs == 0)
+		return 0;
+
+	first_vf = bp->vfdb->sriov.first_vf_in_pf;
+
+	/* statically distribute vf sb pool between VFs */
+	num_vf_queues = min_t(u16, BNX2X_VF_MAX_QUEUES,
+			      BP_VFDB(bp)->vf_sbs_pool / req_vfs);
+
+	/* zero previous values learned from igu cam */
+	for (vf_idx = 0; vf_idx < req_vfs; vf_idx++) {
+		struct bnx2x_virtf *vf = BP_VF(bp, vf_idx);
+
+		vf->sb_count = 0;
+		vf_sb_count(BP_VF(bp, vf_idx)) = 0;
+	}
+	bp->vfdb->vf_sbs_pool = 0;
+
+	/* prepare IGU cam */
+	sb_idx = BP_VFDB(bp)->first_vf_igu_entry;
+	address = IGU_REG_MAPPING_MEMORY + sb_idx * IGU_ENTRY_SIZE;
+	for (vf_idx = first_vf; vf_idx < first_vf + req_vfs; vf_idx++) {
+		for (vfq_idx = 0; vfq_idx < num_vf_queues; vfq_idx++) {
+			igu_entry = vf_idx << IGU_REG_MAPPING_MEMORY_FID_SHIFT |
+				vfq_idx << IGU_REG_MAPPING_MEMORY_VECTOR_SHIFT |
+				IGU_REG_MAPPING_MEMORY_VALID;
+			DP(BNX2X_MSG_IOV, "assigning sb %d to vf %d\n",
+			   sb_idx, vf_idx);
+			REG_WR(bp, address, igu_entry);
+			sb_idx++;
+			address += IGU_ENTRY_SIZE;
+		}
+	}
+
+	/* Reinitialize vf database according to igu cam */
+	bnx2x_get_vf_igu_cam_info(bp);
+
+	DP(BNX2X_MSG_IOV, "vf_sbs_pool %d, num_vf_queues %d\n",
+	   BP_VFDB(bp)->vf_sbs_pool, num_vf_queues);
+
+	qcount = 0;
+	for_each_vf(bp, vf_idx) {
+		struct bnx2x_virtf *vf = BP_VF(bp, vf_idx);
 
+		/* set local queue arrays */
+		vf->vfqs = &bp->vfdb->vfqs[qcount];
+		qcount += vf_sb_count(vf);
+	}
+
+	/* prepare msix vectors in VF configuration space */
+	for (vf_idx = first_vf; vf_idx < first_vf + req_vfs; vf_idx++) {
+		bnx2x_pretend_func(bp, HW_VF_HANDLE(bp, vf_idx));
+		REG_WR(bp, PCICFG_OFFSET + GRC_CONFIG_REG_VF_MSIX_CONTROL,
+		       num_vf_queues);
+	}
+	bnx2x_pretend_func(bp, BP_ABS_FUNC(bp));
+
+	/* enable sriov. This will probe all the VFs, and consequentially cause
+	 * the "acquire" messages to appear on the VF PF channel.
+	 */
+	DP(BNX2X_MSG_IOV, "about to call enable sriov\n");
+	pci_disable_sriov(bp->pdev);
 	rc = pci_enable_sriov(bp->pdev, req_vfs);
 	if (rc) {
 		BNX2X_ERR("pci_enable_sriov failed with %d\n", rc);
@@ -3089,9 +3234,8 @@ void bnx2x_disable_sriov(struct bnx2x *bp)
 	pci_disable_sriov(bp->pdev);
 }
 
-static int bnx2x_vf_ndo_prep(struct bnx2x *bp, int vfidx,
-			     struct bnx2x_virtf **vf,
-			     struct pf_vf_bulletin_content **bulletin)
+int bnx2x_vf_ndo_prep(struct bnx2x *bp, int vfidx, struct bnx2x_virtf **vf,
+			struct pf_vf_bulletin_content **bulletin)
 {
 	if (bp->state != BNX2X_STATE_OPEN) {
 		BNX2X_ERR("vf ndo called though PF is down\n");
@@ -3114,7 +3258,13 @@ static int bnx2x_vf_ndo_prep(struct bnx2x *bp, int vfidx,
 	*bulletin = BP_VF_BULLETIN(bp, vfidx);
 
 	if (!*vf) {
-		BNX2X_ERR("vf ndo called but vf was null. vfidx was %d\n",
+		BNX2X_ERR("vf ndo called but vf struct is null. vfidx was %d\n",
+			  vfidx);
+		return -EINVAL;
+	}
+
+	if (!(*vf)->vfqs) {
+		BNX2X_ERR("vf ndo called but vfqs struct is null. Was ndo invoked before dynamically enabling SR-IOV? vfidx was %d\n",
 			  vfidx);
 		return -EINVAL;
 	}
@@ -3142,8 +3292,8 @@ int bnx2x_get_vf_config(struct net_device *dev, int vfidx,
 	rc = bnx2x_vf_ndo_prep(bp, vfidx, &vf, &bulletin);
 	if (rc)
 		return rc;
-	mac_obj = &bnx2x_vfq(vf, 0, mac_obj);
-	vlan_obj = &bnx2x_vfq(vf, 0, vlan_obj);
+	mac_obj = &bnx2x_leading_vfq(vf, mac_obj);
+	vlan_obj = &bnx2x_leading_vfq(vf, vlan_obj);
 	if (!mac_obj || !vlan_obj) {
 		BNX2X_ERR("VF partially initialized\n");
 		return -EINVAL;
@@ -3155,10 +3305,13 @@ int bnx2x_get_vf_config(struct net_device *dev, int vfidx,
 	ivi->spoofchk = 1; /*always enabled */
 	if (vf->state == VF_ENABLED) {
 		/* mac and vlan are in vlan_mac objects */
-		mac_obj->get_n_elements(bp, mac_obj, 1, (u8 *)&ivi->mac,
-					0, ETH_ALEN);
-		vlan_obj->get_n_elements(bp, vlan_obj, 1, (u8 *)&ivi->vlan,
-					 0, VLAN_HLEN);
+		if (validate_vlan_mac(bp, &bnx2x_leading_vfq(vf, mac_obj)))
+			mac_obj->get_n_elements(bp, mac_obj, 1, (u8 *)&ivi->mac,
+						0, ETH_ALEN);
+		if (validate_vlan_mac(bp, &bnx2x_leading_vfq(vf, vlan_obj)))
+			vlan_obj->get_n_elements(bp, vlan_obj, 1,
+						 (u8 *)&ivi->vlan, 0,
+						 VLAN_HLEN);
 	} else {
 		/* mac */
 		if (bulletin->valid_bitmap & (1 << MAC_ADDR_VALID))
@@ -3226,14 +3379,18 @@ int bnx2x_set_vf_mac(struct net_device *dev, int vfidx, u8 *mac)
 		return rc;
 	}
 
-	/* is vf initialized and queue set up? */
 	q_logical_state =
-		bnx2x_get_q_logical_state(bp, &bnx2x_vfq(vf, 0, sp_obj));
+		bnx2x_get_q_logical_state(bp, &bnx2x_leading_vfq(vf, sp_obj));
 	if (vf->state == VF_ENABLED &&
 	    q_logical_state == BNX2X_Q_LOGICAL_STATE_ACTIVE) {
 		/* configure the mac in device on this vf's queue */
 		unsigned long ramrod_flags = 0;
-		struct bnx2x_vlan_mac_obj *mac_obj = &bnx2x_vfq(vf, 0, mac_obj);
+		struct bnx2x_vlan_mac_obj *mac_obj =
+			&bnx2x_leading_vfq(vf, mac_obj);
+
+		rc = validate_vlan_mac(bp, &bnx2x_leading_vfq(vf, mac_obj));
+		if (rc)
+			return rc;
 
 		/* must lock vfpf channel to protect against vf flows */
 		bnx2x_lock_vf_pf_channel(bp, vf, CHANNEL_TLV_PF_SET_MAC);
@@ -3293,18 +3450,21 @@ int bnx2x_set_vf_vlan(struct net_device *dev, int vfidx, u16 vlan, u8 qos)
 
 	/* is vf initialized and queue set up? */
 	q_logical_state =
-		bnx2x_get_q_logical_state(bp, &bnx2x_vfq(vf, 0, sp_obj));
+		bnx2x_get_q_logical_state(bp, &bnx2x_leading_vfq(vf, sp_obj));
 	if (vf->state == VF_ENABLED &&
 	    q_logical_state == BNX2X_Q_LOGICAL_STATE_ACTIVE) {
 		/* configure the vlan in device on this vf's queue */
 		unsigned long ramrod_flags = 0;
 		unsigned long vlan_mac_flags = 0;
 		struct bnx2x_vlan_mac_obj *vlan_obj =
-			&bnx2x_vfq(vf, 0, vlan_obj);
+			&bnx2x_leading_vfq(vf, vlan_obj);
 		struct bnx2x_vlan_mac_ramrod_params ramrod_param;
 		struct bnx2x_queue_state_params q_params = {NULL};
 		struct bnx2x_queue_update_params *update_params;
 
+		rc = validate_vlan_mac(bp, &bnx2x_leading_vfq(vf, mac_obj));
+		if (rc)
+			return rc;
 		memset(&ramrod_param, 0, sizeof(ramrod_param));
 
 		/* must lock vfpf channel to protect against vf flows */
@@ -3324,7 +3484,7 @@ int bnx2x_set_vf_vlan(struct net_device *dev, int vfidx, u16 vlan, u8 qos)
 		 */
 		__set_bit(RAMROD_COMP_WAIT, &q_params.ramrod_flags);
 		q_params.cmd = BNX2X_Q_CMD_UPDATE;
-		q_params.q_obj = &bnx2x_vfq(vf, 0, sp_obj);
+		q_params.q_obj = &bnx2x_leading_vfq(vf, sp_obj);
 		update_params = &q_params.params.update;
 		__set_bit(BNX2X_Q_UPDATE_DEF_VLAN_EN_CHNG,
 			  &update_params->update_flags);
diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_sriov.h b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_sriov.h
index d143a7c..8e9847f 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_sriov.h
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_sriov.h
@@ -81,6 +81,7 @@ struct bnx2x_vf_queue {
 	u32 cid;
 	u16 index;
 	u16 sb_idx;
+	bool is_leading;
 };
 
 /* struct bnx2x_vfop_qctor_params - prepare queue construction parameters:
@@ -194,6 +195,7 @@ struct bnx2x_virtf {
 #define VF_CFG_INT_SIMD		0x0008
 #define VF_CACHE_LINE		0x0010
 #define VF_CFG_VLAN		0x0020
+#define VF_CFG_STATS_COALESCE	0x0040
 
 	u8 state;
 #define VF_FREE		0	/* VF ready to be acquired holds no resc */
@@ -213,6 +215,7 @@ struct bnx2x_virtf {
 
 	/* dma */
 	dma_addr_t fw_stat_map;		/* valid iff VF_CFG_STATS */
+	u16 stats_stride;
 	dma_addr_t spq_map;
 	dma_addr_t bulletin_map;
 
@@ -239,7 +242,10 @@ struct bnx2x_virtf {
 	u8 igu_base_id;	/* base igu status block id */
 
 	struct bnx2x_vf_queue	*vfqs;
-#define bnx2x_vfq(vf, nr, var)	((vf)->vfqs[(nr)].var)
+#define LEADING_IDX			0
+#define bnx2x_vfq_is_leading(vfq)	((vfq)->index == LEADING_IDX)
+#define bnx2x_vfq(vf, nr, var)		((vf)->vfqs[(nr)].var)
+#define bnx2x_leading_vfq(vf, var)	((vf)->vfqs[LEADING_IDX].var)
 
 	u8 index;	/* index in the vf array */
 	u8 abs_vfid;
@@ -358,6 +364,10 @@ struct bnx2x_vf_sp {
 		struct client_init_ramrod_data  init_data;
 		struct client_update_ramrod_data update_data;
 	} q_data;
+
+	union {
+		struct eth_rss_update_ramrod_data e2;
+	} rss_rdata;
 };
 
 struct hw_dma {
@@ -403,6 +413,10 @@ struct bnx2x_vfdb {
 
 #define FLRD_VFS_DWORDS (BNX2X_MAX_NUM_OF_VFS / 32)
 	u32 flrd_vfs[FLRD_VFS_DWORDS];
+
+	/* the number of msix vectors belonging to this PF designated for VFs */
+	u16 vf_sbs_pool;
+	u16 first_vf_igu_entry;
 };
 
 /* queue access */
@@ -411,11 +425,6 @@ static inline struct bnx2x_vf_queue *vfq_get(struct bnx2x_virtf *vf, u8 index)
 	return &(vf->vfqs[index]);
 }
 
-static inline bool vfq_is_leading(struct bnx2x_vf_queue *vfq)
-{
-	return (vfq->index == 0);
-}
-
 /* FW ids */
 static inline u8 vf_igu_sb(struct bnx2x_virtf *vf, u16 sb_idx)
 {
@@ -434,7 +443,10 @@ static u8 vfq_cl_id(struct bnx2x_virtf *vf, struct bnx2x_vf_queue *q)
 
 static inline u8 vfq_stat_id(struct bnx2x_virtf *vf, struct bnx2x_vf_queue *q)
 {
-	return vfq_cl_id(vf, q);
+	if (vf->cfg_flags & VF_CFG_STATS_COALESCE)
+		return vf->leading_rss;
+	else
+		return vfq_cl_id(vf, q);
 }
 
 static inline u8 vfq_qzone_id(struct bnx2x_virtf *vf, struct bnx2x_vf_queue *q)
@@ -691,6 +703,10 @@ int bnx2x_vfop_release_cmd(struct bnx2x *bp,
 			   struct bnx2x_virtf *vf,
 			   struct bnx2x_vfop_cmd *cmd);
 
+int bnx2x_vfop_rss_cmd(struct bnx2x *bp,
+		       struct bnx2x_virtf *vf,
+		       struct bnx2x_vfop_cmd *cmd);
+
 /* VF release ~ VF close + VF release-resources
  *
  * Release is the ultimate SW shutdown and is called whenever an
@@ -758,7 +774,7 @@ int bnx2x_enable_sriov(struct bnx2x *bp);
 void bnx2x_disable_sriov(struct bnx2x *bp);
 static inline int bnx2x_vf_headroom(struct bnx2x *bp)
 {
-	return bp->vfdb->sriov.nr_virtfn * BNX2X_CLIENTS_PER_VF;
+	return bp->vfdb->sriov.nr_virtfn * BNX2X_CIDS_PER_VF;
 }
 void bnx2x_pf_set_vfs_vlan(struct bnx2x *bp);
 int bnx2x_sriov_configure(struct pci_dev *dev, int num_vfs);
diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_vfpf.c b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_vfpf.c
index 2088063..a7e88a4 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_vfpf.c
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_vfpf.c
@@ -257,17 +257,23 @@ int bnx2x_vfpf_acquire(struct bnx2x *bp, u8 tx_count, u8 rx_count)
 
 			/* humble our request */
 			req->resc_request.num_txqs =
-				bp->acquire_resp.resc.num_txqs;
+				min(req->resc_request.num_txqs,
+				    bp->acquire_resp.resc.num_txqs);
 			req->resc_request.num_rxqs =
-				bp->acquire_resp.resc.num_rxqs;
+				min(req->resc_request.num_rxqs,
+				    bp->acquire_resp.resc.num_rxqs);
 			req->resc_request.num_sbs =
-				bp->acquire_resp.resc.num_sbs;
+				min(req->resc_request.num_sbs,
+				    bp->acquire_resp.resc.num_sbs);
 			req->resc_request.num_mac_filters =
-				bp->acquire_resp.resc.num_mac_filters;
+				min(req->resc_request.num_mac_filters,
+				    bp->acquire_resp.resc.num_mac_filters);
 			req->resc_request.num_vlan_filters =
-				bp->acquire_resp.resc.num_vlan_filters;
+				min(req->resc_request.num_vlan_filters,
+				    bp->acquire_resp.resc.num_vlan_filters);
 			req->resc_request.num_mc_filters =
-				bp->acquire_resp.resc.num_mc_filters;
+				min(req->resc_request.num_mc_filters,
+				    bp->acquire_resp.resc.num_mc_filters);
 
 			/* Clear response buffer */
 			memset(&bp->vf2pf_mbox->resp, 0,
@@ -293,7 +299,7 @@ int bnx2x_vfpf_acquire(struct bnx2x *bp, u8 tx_count, u8 rx_count)
 	bp->common.flash_size = 0;
 	bp->flags |=
 		NO_WOL_FLAG | NO_ISCSI_OOO_FLAG | NO_ISCSI_FLAG | NO_FCOE_FLAG;
-	bp->igu_sb_cnt = 1;
+	bp->igu_sb_cnt = bp->acquire_resp.resc.num_sbs;
 	bp->igu_base_sb = bp->acquire_resp.resc.hw_sbs[0].hw_sb_id;
 	strlcpy(bp->fw_ver, bp->acquire_resp.pfdev_info.fw_ver,
 		sizeof(bp->fw_ver));
@@ -452,6 +458,53 @@ free_irq:
 	bnx2x_free_irq(bp);
 }
 
+static void bnx2x_leading_vfq_init(struct bnx2x *bp, struct bnx2x_virtf *vf,
+				   struct bnx2x_vf_queue *q)
+{
+	u8 cl_id = vfq_cl_id(vf, q);
+	u8 func_id = FW_VF_HANDLE(vf->abs_vfid);
+
+	/* mac */
+	bnx2x_init_mac_obj(bp, &q->mac_obj,
+			   cl_id, q->cid, func_id,
+			   bnx2x_vf_sp(bp, vf, mac_rdata),
+			   bnx2x_vf_sp_map(bp, vf, mac_rdata),
+			   BNX2X_FILTER_MAC_PENDING,
+			   &vf->filter_state,
+			   BNX2X_OBJ_TYPE_RX_TX,
+			   &bp->macs_pool);
+	/* vlan */
+	bnx2x_init_vlan_obj(bp, &q->vlan_obj,
+			    cl_id, q->cid, func_id,
+			    bnx2x_vf_sp(bp, vf, vlan_rdata),
+			    bnx2x_vf_sp_map(bp, vf, vlan_rdata),
+			    BNX2X_FILTER_VLAN_PENDING,
+			    &vf->filter_state,
+			    BNX2X_OBJ_TYPE_RX_TX,
+			    &bp->vlans_pool);
+
+	/* mcast */
+	bnx2x_init_mcast_obj(bp, &vf->mcast_obj, cl_id,
+			     q->cid, func_id, func_id,
+			     bnx2x_vf_sp(bp, vf, mcast_rdata),
+			     bnx2x_vf_sp_map(bp, vf, mcast_rdata),
+			     BNX2X_FILTER_MCAST_PENDING,
+			     &vf->filter_state,
+			     BNX2X_OBJ_TYPE_RX_TX);
+
+	/* rss */
+	bnx2x_init_rss_config_obj(bp, &vf->rss_conf_obj, cl_id, q->cid,
+				  func_id, func_id,
+				  bnx2x_vf_sp(bp, vf, rss_rdata),
+				  bnx2x_vf_sp_map(bp, vf, rss_rdata),
+				  BNX2X_FILTER_RSS_CONF_PENDING,
+				  &vf->filter_state,
+				  BNX2X_OBJ_TYPE_RX_TX);
+
+	vf->leading_rss = cl_id;
+	q->is_leading = true;
+}
+
 /* ask the pf to open a queue for the vf */
 int bnx2x_vfpf_setup_q(struct bnx2x *bp, int fp_idx)
 {
@@ -948,7 +1001,7 @@ static void bnx2x_vf_mbx_acquire_resp(struct bnx2x *bp, struct bnx2x_virtf *vf,
 
 	/* fill in pfdev info */
 	resp->pfdev_info.chip_num = bp->common.chip_id;
-	resp->pfdev_info.db_size = (1 << BNX2X_DB_SHIFT);
+	resp->pfdev_info.db_size = bp->db_size;
 	resp->pfdev_info.indices_per_sb = HC_SB_MAX_INDICES_E2;
 	resp->pfdev_info.pf_cap = (PFVF_CAP_RSS |
 				   /* PFVF_CAP_DHC |*/ PFVF_CAP_TPA);
@@ -1054,8 +1107,13 @@ static void bnx2x_vf_mbx_init_vf(struct bnx2x *bp, struct bnx2x_virtf *vf,
 	/* record ghost addresses from vf message */
 	vf->spq_map = init->spq_addr;
 	vf->fw_stat_map = init->stats_addr;
+	vf->stats_stride = init->stats_stride;
 	vf->op_rc = bnx2x_vf_init(bp, vf, (dma_addr_t *)init->sb_addr);
 
+	/* set VF multiqueue statistics collection mode */
+	if (init->flags & VFPF_INIT_FLG_STATS_COALESCE)
+		vf->cfg_flags |= VF_CFG_STATS_COALESCE;
+
 	/* response */
 	bnx2x_vf_mbx_resp(bp, vf);
 }
@@ -1080,6 +1138,8 @@ static void bnx2x_vf_mbx_set_q_flags(struct bnx2x *bp, u32 mbx_q_flags,
 		__set_bit(BNX2X_Q_FLG_HC, sp_q_flags);
 	if (mbx_q_flags & VFPF_QUEUE_FLG_DHC)
 		__set_bit(BNX2X_Q_FLG_DHC, sp_q_flags);
+	if (mbx_q_flags & VFPF_QUEUE_FLG_LEADING_RSS)
+		__set_bit(BNX2X_Q_FLG_LEADING_RSS, sp_q_flags);
 
 	/* outer vlan removal is set according to PF's multi function mode */
 	if (IS_MF_SD(bp))
@@ -1113,6 +1173,9 @@ static void bnx2x_vf_mbx_setup_q(struct bnx2x *bp, struct bnx2x_virtf *vf,
 		struct bnx2x_queue_init_params *init_p;
 		struct bnx2x_queue_setup_params *setup_p;
 
+		if (bnx2x_vfq_is_leading(q))
+			bnx2x_leading_vfq_init(bp, vf, q);
+
 		/* re-init the VF operation context */
 		memset(&vf->op_params.qctor, 0 , sizeof(vf->op_params.qctor));
 		setup_p = &vf->op_params.qctor.prep_qsetup;
@@ -1552,6 +1615,68 @@ static void bnx2x_vf_mbx_release_vf(struct bnx2x *bp, struct bnx2x_virtf *vf,
 		bnx2x_vf_mbx_resp(bp, vf);
 }
 
+static void bnx2x_vf_mbx_update_rss(struct bnx2x *bp, struct bnx2x_virtf *vf,
+				    struct bnx2x_vf_mbx *mbx)
+{
+	struct bnx2x_vfop_cmd cmd = {
+		.done = bnx2x_vf_mbx_resp,
+		.block = false,
+	};
+	struct bnx2x_config_rss_params *vf_op_params = &vf->op_params.rss;
+	struct vfpf_rss_tlv *rss_tlv = &mbx->msg->req.update_rss;
+
+	if (rss_tlv->ind_table_size != T_ETH_INDIRECTION_TABLE_SIZE ||
+	    rss_tlv->rss_key_size != T_ETH_RSS_KEY) {
+		BNX2X_ERR("failing rss configuration of vf %d due to size mismatch\n",
+			  vf->index);
+		vf->op_rc = -EINVAL;
+		goto mbx_resp;
+	}
+
+	/* set vfop params according to rss tlv */
+	memcpy(vf_op_params->ind_table, rss_tlv->ind_table,
+	       T_ETH_INDIRECTION_TABLE_SIZE);
+	memcpy(vf_op_params->rss_key, rss_tlv->rss_key,
+	       sizeof(rss_tlv->rss_key));
+	vf_op_params->rss_obj = &vf->rss_conf_obj;
+	vf_op_params->rss_result_mask = rss_tlv->rss_result_mask;
+
+	/* flags handled individually for backward/forward compatability */
+	if (rss_tlv->rss_flags & VFPF_RSS_MODE_DISABLED)
+		__set_bit(BNX2X_RSS_MODE_DISABLED, &vf_op_params->rss_flags);
+	if (rss_tlv->rss_flags & VFPF_RSS_MODE_REGULAR)
+		__set_bit(BNX2X_RSS_MODE_REGULAR, &vf_op_params->rss_flags);
+	if (rss_tlv->rss_flags & VFPF_RSS_SET_SRCH)
+		__set_bit(BNX2X_RSS_SET_SRCH, &vf_op_params->rss_flags);
+	if (rss_tlv->rss_flags & VFPF_RSS_IPV4)
+		__set_bit(BNX2X_RSS_IPV4, &vf_op_params->rss_flags);
+	if (rss_tlv->rss_flags & VFPF_RSS_IPV4_TCP)
+		__set_bit(BNX2X_RSS_IPV4_TCP, &vf_op_params->rss_flags);
+	if (rss_tlv->rss_flags & VFPF_RSS_IPV4_UDP)
+		__set_bit(BNX2X_RSS_IPV4_UDP, &vf_op_params->rss_flags);
+	if (rss_tlv->rss_flags & VFPF_RSS_IPV6)
+		__set_bit(BNX2X_RSS_IPV6, &vf_op_params->rss_flags);
+	if (rss_tlv->rss_flags & VFPF_RSS_IPV6_TCP)
+		__set_bit(BNX2X_RSS_IPV6_TCP, &vf_op_params->rss_flags);
+	if (rss_tlv->rss_flags & VFPF_RSS_IPV6_UDP)
+		__set_bit(BNX2X_RSS_IPV6_UDP, &vf_op_params->rss_flags);
+
+	if ((!(rss_tlv->rss_flags & VFPF_RSS_IPV4_TCP) &&
+	     rss_tlv->rss_flags & VFPF_RSS_IPV4_UDP) ||
+	    (!(rss_tlv->rss_flags & VFPF_RSS_IPV6_TCP) &&
+	     rss_tlv->rss_flags & VFPF_RSS_IPV6_UDP)) {
+		BNX2X_ERR("about to hit a FW assert. aborting...\n");
+		vf->op_rc = -EINVAL;
+		goto mbx_resp;
+	}
+
+	vf->op_rc = bnx2x_vfop_rss_cmd(bp, vf, &cmd);
+
+mbx_resp:
+	if (vf->op_rc)
+		bnx2x_vf_mbx_resp(bp, vf);
+}
+
 /* dispatch request */
 static void bnx2x_vf_mbx_request(struct bnx2x *bp, struct bnx2x_virtf *vf,
 				  struct bnx2x_vf_mbx *mbx)
@@ -1588,6 +1713,9 @@ static void bnx2x_vf_mbx_request(struct bnx2x *bp, struct bnx2x_virtf *vf,
 		case CHANNEL_TLV_RELEASE:
 			bnx2x_vf_mbx_release_vf(bp, vf, mbx);
 			break;
+		case CHANNEL_TLV_UPDATE_RSS:
+			bnx2x_vf_mbx_update_rss(bp, vf, mbx);
+			break;
 		}
 
 	} else {
@@ -1607,7 +1735,7 @@ static void bnx2x_vf_mbx_request(struct bnx2x *bp, struct bnx2x_virtf *vf,
 		/* test whether we can respond to the VF (do we have an address
 		 * for it?)
 		 */
-		if (vf->state == VF_ACQUIRED) {
+		if (vf->state == VF_ACQUIRED || vf->state == VF_ENABLED) {
 			/* mbx_resp uses the op_rc of the VF */
 			vf->op_rc = PFVF_STATUS_NOT_SUPPORTED;
 
diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_vfpf.h b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_vfpf.h
index f3ad174..1179fe0 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_vfpf.h
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_vfpf.h
@@ -51,6 +51,7 @@ struct hw_sb_info {
 #define VFPF_QUEUE_FLG_COS		0x0080
 #define VFPF_QUEUE_FLG_HC		0x0100
 #define VFPF_QUEUE_FLG_DHC		0x0200
+#define VFPF_QUEUE_FLG_LEADING_RSS	0x0400
 
 #define VFPF_QUEUE_DROP_IP_CS_ERR	(1 << 0)
 #define VFPF_QUEUE_DROP_TCP_CS_ERR	(1 << 1)
@@ -131,6 +132,27 @@ struct vfpf_q_op_tlv {
 	u8 padding[3];
 };
 
+/* receive side scaling tlv */
+struct vfpf_rss_tlv {
+	struct vfpf_first_tlv	first_tlv;
+	u32			rss_flags;
+#define VFPF_RSS_MODE_DISABLED	(1 << 0)
+#define VFPF_RSS_MODE_REGULAR	(1 << 1)
+#define VFPF_RSS_SET_SRCH	(1 << 2)
+#define VFPF_RSS_IPV4		(1 << 3)
+#define VFPF_RSS_IPV4_TCP	(1 << 4)
+#define VFPF_RSS_IPV4_UDP	(1 << 5)
+#define VFPF_RSS_IPV6		(1 << 6)
+#define VFPF_RSS_IPV6_TCP	(1 << 7)
+#define VFPF_RSS_IPV6_UDP	(1 << 8)
+	u8			rss_result_mask;
+	u8			ind_table_size;
+	u8			rss_key_size;
+	u8			padding;
+	u8			ind_table[T_ETH_INDIRECTION_TABLE_SIZE];
+	u32			rss_key[T_ETH_RSS_KEY];	/* hash values */
+};
+
 /* acquire response tlv - carries the allocated resources */
 struct pfvf_acquire_resp_tlv {
 	struct pfvf_tlv hdr;
@@ -166,12 +188,20 @@ struct pfvf_acquire_resp_tlv {
 	} resc;
 };
 
+#define VFPF_INIT_FLG_STATS_COALESCE	(1 << 0) /* when set the VFs queues
+						  * stats will be coalesced on
+						  * the leading RSS queue
+						  */
+
 /* Init VF */
 struct vfpf_init_tlv {
 	struct vfpf_first_tlv first_tlv;
 	aligned_u64 sb_addr[PFVF_MAX_SBS_PER_VF]; /* vf_sb based */
 	aligned_u64 spq_addr;
 	aligned_u64 stats_addr;
+	u16 stats_stride;
+	u32 flags;
+	u32 padding[2];
 };
 
 /* Setup Queue */
@@ -293,13 +323,14 @@ union vfpf_tlvs {
 	struct vfpf_q_op_tlv		q_op;
 	struct vfpf_setup_q_tlv		setup_q;
 	struct vfpf_set_q_filters_tlv	set_q_filters;
-	struct vfpf_release_tlv         release;
-	struct channel_list_end_tlv     list_end;
+	struct vfpf_release_tlv		release;
+	struct vfpf_rss_tlv		update_rss;
+	struct channel_list_end_tlv	list_end;
 	struct tlv_buffer_size		tlv_buf_size;
 };
 
 union pfvf_tlvs {
-	struct pfvf_general_resp_tlv    general_resp;
+	struct pfvf_general_resp_tlv	general_resp;
 	struct pfvf_acquire_resp_tlv	acquire_resp;
 	struct channel_list_end_tlv	list_end;
 	struct tlv_buffer_size		tlv_buf_size;
@@ -355,14 +386,18 @@ enum channel_tlvs {
 	CHANNEL_TLV_INIT,
 	CHANNEL_TLV_SETUP_Q,
 	CHANNEL_TLV_SET_Q_FILTERS,
+	CHANNEL_TLV_ACTIVATE_Q,
+	CHANNEL_TLV_DEACTIVATE_Q,
 	CHANNEL_TLV_TEARDOWN_Q,
 	CHANNEL_TLV_CLOSE,
 	CHANNEL_TLV_RELEASE,
+	CHANNEL_TLV_UPDATE_RSS_DEPRECATED,
 	CHANNEL_TLV_PF_RELEASE_VF,
 	CHANNEL_TLV_LIST_END,
 	CHANNEL_TLV_FLR,
 	CHANNEL_TLV_PF_SET_MAC,
 	CHANNEL_TLV_PF_SET_VLAN,
+	CHANNEL_TLV_UPDATE_RSS,
 	CHANNEL_TLV_MAX
 };
 
-- 
1.7.1

^ permalink raw reply related

* [PATCH net-next 2/2] bnx2x: VF RSS support - VF side
From: Ariel Elior @ 2013-09-04 11:09 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, Eilon Greenstein, Ariel Elior
In-Reply-To: <1378292962-5537-1-git-send-email-ariele@broadcom.com>

In this patch capabilities are added to the Vf driver to request
multiple queues over the VF PF channel, and the logic for requesting
rss configuration for said queues.

Signed-off-by: Ariel Elior <ariele@broadcom.com>
Signed-off-by: Eilong Greenstein <eilong@broadcom.com>
---
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c    |   85 ++++++++++----------
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.h    |    7 +-
 .../net/ethernet/broadcom/bnx2x/bnx2x_ethtool.c    |    4 +-
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c   |   27 ++++--
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_sriov.h  |    7 +-
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_vfpf.c   |   75 +++++++++++++++++-
 6 files changed, 145 insertions(+), 60 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c
index e7400d9..8d726f6 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c
@@ -1942,7 +1942,7 @@ static void bnx2x_set_rx_buf_size(struct bnx2x *bp)
 	}
 }
 
-static int bnx2x_init_rss_pf(struct bnx2x *bp)
+static int bnx2x_init_rss(struct bnx2x *bp)
 {
 	int i;
 	u8 num_eth_queues = BNX2X_NUM_ETH_QUEUES(bp);
@@ -1966,8 +1966,8 @@ static int bnx2x_init_rss_pf(struct bnx2x *bp)
 	return bnx2x_config_rss_eth(bp, bp->port.pmf || !CHIP_IS_E1x(bp));
 }
 
-int bnx2x_config_rss_pf(struct bnx2x *bp, struct bnx2x_rss_config_obj *rss_obj,
-			bool config_hash)
+int bnx2x_rss(struct bnx2x *bp, struct bnx2x_rss_config_obj *rss_obj,
+	      bool config_hash, bool enable)
 {
 	struct bnx2x_config_rss_params params = {NULL};
 
@@ -1982,17 +1982,21 @@ int bnx2x_config_rss_pf(struct bnx2x *bp, struct bnx2x_rss_config_obj *rss_obj,
 
 	__set_bit(RAMROD_COMP_WAIT, &params.ramrod_flags);
 
-	__set_bit(BNX2X_RSS_MODE_REGULAR, &params.rss_flags);
-
-	/* RSS configuration */
-	__set_bit(BNX2X_RSS_IPV4, &params.rss_flags);
-	__set_bit(BNX2X_RSS_IPV4_TCP, &params.rss_flags);
-	__set_bit(BNX2X_RSS_IPV6, &params.rss_flags);
-	__set_bit(BNX2X_RSS_IPV6_TCP, &params.rss_flags);
-	if (rss_obj->udp_rss_v4)
-		__set_bit(BNX2X_RSS_IPV4_UDP, &params.rss_flags);
-	if (rss_obj->udp_rss_v6)
-		__set_bit(BNX2X_RSS_IPV6_UDP, &params.rss_flags);
+	if (enable) {
+		__set_bit(BNX2X_RSS_MODE_REGULAR, &params.rss_flags);
+
+		/* RSS configuration */
+		__set_bit(BNX2X_RSS_IPV4, &params.rss_flags);
+		__set_bit(BNX2X_RSS_IPV4_TCP, &params.rss_flags);
+		__set_bit(BNX2X_RSS_IPV6, &params.rss_flags);
+		__set_bit(BNX2X_RSS_IPV6_TCP, &params.rss_flags);
+		if (rss_obj->udp_rss_v4)
+			__set_bit(BNX2X_RSS_IPV4_UDP, &params.rss_flags);
+		if (rss_obj->udp_rss_v6)
+			__set_bit(BNX2X_RSS_IPV6_UDP, &params.rss_flags);
+	} else {
+		__set_bit(BNX2X_RSS_MODE_DISABLED, &params.rss_flags);
+	}
 
 	/* Hash bits */
 	params.rss_result_mask = MULTI_MASK;
@@ -2001,11 +2005,14 @@ int bnx2x_config_rss_pf(struct bnx2x *bp, struct bnx2x_rss_config_obj *rss_obj,
 
 	if (config_hash) {
 		/* RSS keys */
-		prandom_bytes(params.rss_key, sizeof(params.rss_key));
+		prandom_bytes(params.rss_key, T_ETH_RSS_KEY * 4);
 		__set_bit(BNX2X_RSS_SET_SRCH, &params.rss_flags);
 	}
 
-	return bnx2x_config_rss(bp, &params);
+	if (IS_PF(bp))
+		return bnx2x_config_rss(bp, &params);
+	else
+		return bnx2x_vfpf_config_rss(bp, &params);
 }
 
 static int bnx2x_init_hw(struct bnx2x *bp, u32 load_code)
@@ -2645,38 +2652,32 @@ int bnx2x_nic_load(struct bnx2x *bp, int load_mode)
 
 		/* initialize FW coalescing state machines in RAM */
 		bnx2x_update_coalesce(bp);
+	}
 
-		/* setup the leading queue */
-		rc = bnx2x_setup_leading(bp);
-		if (rc) {
-			BNX2X_ERR("Setup leading failed!\n");
-			LOAD_ERROR_EXIT(bp, load_error3);
-		}
-
-		/* set up the rest of the queues */
-		for_each_nondefault_eth_queue(bp, i) {
-			rc = bnx2x_setup_queue(bp, &bp->fp[i], 0);
-			if (rc) {
-				BNX2X_ERR("Queue setup failed\n");
-				LOAD_ERROR_EXIT(bp, load_error3);
-			}
-		}
+	/* setup the leading queue */
+	rc = bnx2x_setup_leading(bp);
+	if (rc) {
+		BNX2X_ERR("Setup leading failed!\n");
+		LOAD_ERROR_EXIT(bp, load_error3);
+	}
 
-		/* setup rss */
-		rc = bnx2x_init_rss_pf(bp);
+	/* set up the rest of the queues */
+	for_each_nondefault_eth_queue(bp, i) {
+		if (IS_PF(bp))
+			rc = bnx2x_setup_queue(bp, &bp->fp[i], false);
+		else /* VF */
+			rc = bnx2x_vfpf_setup_q(bp, &bp->fp[i], false);
 		if (rc) {
-			BNX2X_ERR("PF RSS init failed\n");
+			BNX2X_ERR("Queue %d setup failed\n", i);
 			LOAD_ERROR_EXIT(bp, load_error3);
 		}
+	}
 
-	} else { /* vf */
-		for_each_eth_queue(bp, i) {
-			rc = bnx2x_vfpf_setup_q(bp, i);
-			if (rc) {
-				BNX2X_ERR("Queue setup failed\n");
-				LOAD_ERROR_EXIT(bp, load_error3);
-			}
-		}
+	/* setup rss */
+	rc = bnx2x_init_rss(bp);
+	if (rc) {
+		BNX2X_ERR("PF RSS init failed\n");
+		LOAD_ERROR_EXIT(bp, load_error3);
 	}
 
 	/* Now when Clients are configured we are ready to work */
diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.h b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.h
index affb764..da8fcaa 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.h
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.h
@@ -105,9 +105,10 @@ void bnx2x_send_unload_done(struct bnx2x *bp, bool keep_link);
  * @rss_obj:		RSS object to use
  * @ind_table:		indirection table to configure
  * @config_hash:	re-configure RSS hash keys configuration
+ * @enable:		enabled or disabled configuration
  */
-int bnx2x_config_rss_pf(struct bnx2x *bp, struct bnx2x_rss_config_obj *rss_obj,
-			bool config_hash);
+int bnx2x_rss(struct bnx2x *bp, struct bnx2x_rss_config_obj *rss_obj,
+	      bool config_hash, bool enable);
 
 /**
  * bnx2x__init_func_obj - init function object
@@ -980,7 +981,7 @@ static inline int func_by_vn(struct bnx2x *bp, int vn)
 
 static inline int bnx2x_config_rss_eth(struct bnx2x *bp, bool config_hash)
 {
-	return bnx2x_config_rss_pf(bp, &bp->rss_conf_obj, config_hash);
+	return bnx2x_rss(bp, &bp->rss_conf_obj, config_hash, true);
 }
 
 /**
diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_ethtool.c b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_ethtool.c
index c5f2251..2612e3c 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_ethtool.c
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_ethtool.c
@@ -3281,14 +3281,14 @@ static int bnx2x_set_rss_flags(struct bnx2x *bp, struct ethtool_rxnfc *info)
 			DP(BNX2X_MSG_ETHTOOL,
 			   "rss re-configured, UDP 4-tupple %s\n",
 			   udp_rss_requested ? "enabled" : "disabled");
-			return bnx2x_config_rss_pf(bp, &bp->rss_conf_obj, 0);
+			return bnx2x_rss(bp, &bp->rss_conf_obj, false, true);
 		} else if ((info->flow_type == UDP_V6_FLOW) &&
 			   (bp->rss_conf_obj.udp_rss_v6 != udp_rss_requested)) {
 			bp->rss_conf_obj.udp_rss_v6 = udp_rss_requested;
 			DP(BNX2X_MSG_ETHTOOL,
 			   "rss re-configured, UDP 4-tupple %s\n",
 			   udp_rss_requested ? "enabled" : "disabled");
-			return bnx2x_config_rss_pf(bp, &bp->rss_conf_obj, 0);
+			return bnx2x_rss(bp, &bp->rss_conf_obj, false, true);
 		}
 		return 0;
 
diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
index 5729aa7..c69990d 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
@@ -8060,7 +8060,10 @@ int bnx2x_set_eth_mac(struct bnx2x *bp, bool set)
 
 int bnx2x_setup_leading(struct bnx2x *bp)
 {
-	return bnx2x_setup_queue(bp, &bp->fp[0], 1);
+	if (IS_PF(bp))
+		return bnx2x_setup_queue(bp, &bp->fp[0], true);
+	else /* VF */
+		return bnx2x_vfpf_setup_q(bp, &bp->fp[0], true);
 }
 
 /**
@@ -8074,8 +8077,10 @@ int bnx2x_set_int_mode(struct bnx2x *bp)
 {
 	int rc = 0;
 
-	if (IS_VF(bp) && int_mode != BNX2X_INT_MODE_MSIX)
+	if (IS_VF(bp) && int_mode != BNX2X_INT_MODE_MSIX) {
+		BNX2X_ERR("VF not loaded since interrupt mode not msix\n");
 		return -EINVAL;
+	}
 
 	switch (int_mode) {
 	case BNX2X_INT_MODE_MSIX:
@@ -11658,9 +11663,11 @@ static int bnx2x_init_bp(struct bnx2x *bp)
 	 * second status block for the L2 queue, and a third status block for
 	 * CNIC if supported.
 	 */
-	if (CNIC_SUPPORT(bp))
+	if (IS_VF(bp))
+		bp->min_msix_vec_cnt = 1;
+	else if (CNIC_SUPPORT(bp))
 		bp->min_msix_vec_cnt = 3;
-	else
+	else /* PF w/o cnic */
 		bp->min_msix_vec_cnt = 2;
 	BNX2X_DEV_INFO("bp->min_msix_vec_cnt %d", bp->min_msix_vec_cnt);
 
@@ -12571,8 +12578,7 @@ static int bnx2x_set_qm_cid_count(struct bnx2x *bp)
  * @dev:	pci device
  *
  */
-static int bnx2x_get_num_non_def_sbs(struct pci_dev *pdev,
-				     int cnic_cnt, bool is_vf)
+static int bnx2x_get_num_non_def_sbs(struct pci_dev *pdev, int cnic_cnt)
 {
 	int index;
 	u16 control = 0;
@@ -12598,7 +12604,7 @@ static int bnx2x_get_num_non_def_sbs(struct pci_dev *pdev,
 
 	index = control & PCI_MSIX_FLAGS_QSIZE;
 
-	return is_vf ? index + 1 : index;
+	return index;
 }
 
 static int set_max_cos_est(int chip_id)
@@ -12678,10 +12684,13 @@ static int bnx2x_init_one(struct pci_dev *pdev,
 	is_vf = set_is_vf(ent->driver_data);
 	cnic_cnt = is_vf ? 0 : 1;
 
-	max_non_def_sbs = bnx2x_get_num_non_def_sbs(pdev, cnic_cnt, is_vf);
+	max_non_def_sbs = bnx2x_get_num_non_def_sbs(pdev, cnic_cnt);
+
+	/* add another SB for VF as it has no default SB */
+	max_non_def_sbs += is_vf ? 1 : 0;
 
 	/* Maximum number of RSS queues: one IGU SB goes to CNIC */
-	rss_count = is_vf ? 1 : max_non_def_sbs - cnic_cnt;
+	rss_count = max_non_def_sbs - cnic_cnt;
 
 	if (rss_count < 1)
 		return -EINVAL;
diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_sriov.h b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_sriov.h
index 8e9847f..2a8c1dc 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_sriov.h
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_sriov.h
@@ -746,9 +746,12 @@ int bnx2x_vfpf_release(struct bnx2x *bp);
 int bnx2x_vfpf_release(struct bnx2x *bp);
 int bnx2x_vfpf_init(struct bnx2x *bp);
 void bnx2x_vfpf_close_vf(struct bnx2x *bp);
-int bnx2x_vfpf_setup_q(struct bnx2x *bp, int fp_idx);
+int bnx2x_vfpf_setup_q(struct bnx2x *bp, struct bnx2x_fastpath *fp,
+		       bool is_leading);
 int bnx2x_vfpf_teardown_queue(struct bnx2x *bp, int qidx);
 int bnx2x_vfpf_config_mac(struct bnx2x *bp, u8 *addr, u8 vf_qid, bool set);
+int bnx2x_vfpf_config_rss(struct bnx2x *bp,
+			  struct bnx2x_config_rss_params *params);
 int bnx2x_vfpf_set_mcast(struct net_device *dev);
 int bnx2x_vfpf_storm_rx_mode(struct bnx2x *bp);
 
@@ -809,7 +812,7 @@ static inline int bnx2x_vfpf_acquire(struct bnx2x *bp,
 static inline int bnx2x_vfpf_release(struct bnx2x *bp) {return 0; }
 static inline int bnx2x_vfpf_init(struct bnx2x *bp) {return 0; }
 static inline void bnx2x_vfpf_close_vf(struct bnx2x *bp) {}
-static inline int bnx2x_vfpf_setup_q(struct bnx2x *bp, int fp_idx) {return 0; }
+static inline int bnx2x_vfpf_setup_q(struct bnx2x *bp, struct bnx2x_fastpath *fp, bool is_leading) {return 0; }
 static inline int bnx2x_vfpf_teardown_queue(struct bnx2x *bp, int qidx) {return 0; }
 static inline int bnx2x_vfpf_config_mac(struct bnx2x *bp, u8 *addr,
 					u8 vf_qid, bool set) {return 0; }
diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_vfpf.c b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_vfpf.c
index a7e88a4..6cfb887 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_vfpf.c
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_vfpf.c
@@ -379,6 +379,8 @@ int bnx2x_vfpf_init(struct bnx2x *bp)
 	req->stats_addr = bp->fw_stats_data_mapping +
 			  offsetof(struct bnx2x_fw_stats_data, queue_stats);
 
+	req->stats_stride = sizeof(struct per_queue_stats);
+
 	/* add list termination tlv */
 	bnx2x_add_tlv(bp, req, req->first_tlv.tl.length, CHANNEL_TLV_LIST_END,
 		      sizeof(struct channel_list_end_tlv));
@@ -506,11 +508,12 @@ static void bnx2x_leading_vfq_init(struct bnx2x *bp, struct bnx2x_virtf *vf,
 }
 
 /* ask the pf to open a queue for the vf */
-int bnx2x_vfpf_setup_q(struct bnx2x *bp, int fp_idx)
+int bnx2x_vfpf_setup_q(struct bnx2x *bp, struct bnx2x_fastpath *fp,
+		       bool is_leading)
 {
 	struct vfpf_setup_q_tlv *req = &bp->vf2pf_mbox->req.setup_q;
 	struct pfvf_general_resp_tlv *resp = &bp->vf2pf_mbox->resp.general_resp;
-	struct bnx2x_fastpath *fp = &bp->fp[fp_idx];
+	u8 fp_idx = fp->index;
 	u16 tpa_agg_size = 0, flags = 0;
 	int rc;
 
@@ -526,6 +529,9 @@ int bnx2x_vfpf_setup_q(struct bnx2x *bp, int fp_idx)
 		tpa_agg_size = TPA_AGG_SIZE;
 	}
 
+	if (is_leading)
+		flags |= VFPF_QUEUE_FLG_LEADING_RSS;
+
 	/* calculate queue flags */
 	flags |= VFPF_QUEUE_FLG_STATS;
 	flags |= VFPF_QUEUE_FLG_CACHE_ALIGN;
@@ -699,6 +705,71 @@ out:
 	return 0;
 }
 
+/* request pf to config rss table for vf queues*/
+int bnx2x_vfpf_config_rss(struct bnx2x *bp,
+			  struct bnx2x_config_rss_params *params)
+{
+	struct pfvf_general_resp_tlv *resp = &bp->vf2pf_mbox->resp.general_resp;
+	struct vfpf_rss_tlv *req = &bp->vf2pf_mbox->req.update_rss;
+	int rc = 0;
+
+	/* clear mailbox and prep first tlv */
+	bnx2x_vfpf_prep(bp, &req->first_tlv, CHANNEL_TLV_UPDATE_RSS,
+			sizeof(*req));
+
+	/* add list termination tlv */
+	bnx2x_add_tlv(bp, req, req->first_tlv.tl.length, CHANNEL_TLV_LIST_END,
+		      sizeof(struct channel_list_end_tlv));
+
+	memcpy(req->ind_table, params->ind_table, T_ETH_INDIRECTION_TABLE_SIZE);
+	memcpy(req->rss_key, params->rss_key, sizeof(params->rss_key));
+	req->ind_table_size = T_ETH_INDIRECTION_TABLE_SIZE;
+	req->rss_key_size = T_ETH_RSS_KEY;
+	req->rss_result_mask = params->rss_result_mask;
+
+	/* flags handled individually for backward/forward compatability */
+	if (params->rss_flags & (1 << BNX2X_RSS_MODE_DISABLED))
+		req->rss_flags |= VFPF_RSS_MODE_DISABLED;
+	if (params->rss_flags & (1 << BNX2X_RSS_MODE_REGULAR))
+		req->rss_flags |= VFPF_RSS_MODE_REGULAR;
+	if (params->rss_flags & (1 << BNX2X_RSS_SET_SRCH))
+		req->rss_flags |= VFPF_RSS_SET_SRCH;
+	if (params->rss_flags & (1 << BNX2X_RSS_IPV4))
+		req->rss_flags |= VFPF_RSS_IPV4;
+	if (params->rss_flags & (1 << BNX2X_RSS_IPV4_TCP))
+		req->rss_flags |= VFPF_RSS_IPV4_TCP;
+	if (params->rss_flags & (1 << BNX2X_RSS_IPV4_UDP))
+		req->rss_flags |= VFPF_RSS_IPV4_UDP;
+	if (params->rss_flags & (1 << BNX2X_RSS_IPV6))
+		req->rss_flags |= VFPF_RSS_IPV6;
+	if (params->rss_flags & (1 << BNX2X_RSS_IPV6_TCP))
+		req->rss_flags |= VFPF_RSS_IPV6_TCP;
+	if (params->rss_flags & (1 << BNX2X_RSS_IPV6_UDP))
+		req->rss_flags |= VFPF_RSS_IPV6_UDP;
+
+	DP(BNX2X_MSG_IOV, "rss flags %x\n", req->rss_flags);
+
+	/* output tlvs list */
+	bnx2x_dp_tlv_list(bp, req);
+
+	/* send message to pf */
+	rc = bnx2x_send_msg2pf(bp, &resp->hdr.status, bp->vf2pf_mbox_mapping);
+	if (rc) {
+		BNX2X_ERR("failed to send message to pf. rc was %d\n", rc);
+		goto out;
+	}
+
+	if (resp->hdr.status != PFVF_STATUS_SUCCESS) {
+		BNX2X_ERR("failed to send rss message to PF over Vf PF channel %d\n",
+			  resp->hdr.status);
+		rc = -EINVAL;
+	}
+out:
+	bnx2x_vfpf_finalize(bp, &req->first_tlv);
+
+	return 0;
+}
+
 int bnx2x_vfpf_set_mcast(struct net_device *dev)
 {
 	struct bnx2x *bp = netdev_priv(dev);
-- 
1.7.1

^ permalink raw reply related

* bnx2x: VF RSS support
From: Ariel Elior @ 2013-09-04 11:09 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, Eilon Greenstein, Ariel Elior

Hi Dave,

This patch series adds the capability for VF functions to use multiple queues
and Receive / Transmit side scaling.

Patch #1 enhances the PF's side database to allow for multiple queues per PF
and configure the HW appropriately, and the PF side of the VF PF channel
message for configuring the RSS.

Patch #2 adds to the VF side the ability to request multiple queues, and if
obtained to configure RSS for them over the VF PF channel.

Thanks,
Ariel Elior

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox