Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH net] ip6_gre: fix flowi6_proto value in xmit path
From: David Miller @ 2014-10-05  0:09 UTC (permalink / raw)
  To: nicolas.dichtel; +Cc: netdev
In-Reply-To: <1412267209-893-1-git-send-email-nicolas.dichtel@6wind.com>

From: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Date: Thu,  2 Oct 2014 18:26:49 +0200

> In xmit path, we build a flowi6 which will be used for the output route lookup.
> We are sending a GRE packet, neither IPv4 nor IPv6 encapsulated packet, thus the
> protocol should be IPPROTO_GRE.
> 
> Fixes: c12b395a4664 ("gre: Support GRE over IPv6")
> Reported-by: Matthieu Ternisien d'Ouville <matthieu.tdo@6wind.com>
> Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>

Applied and queued up for -stable, thanks.

^ permalink raw reply

* Re: [PATCH v7 net-next 2/2] bonding: Simplify the xmit function for modes that use xmit_hash
From: Mahesh Bandewar @ 2014-10-05  0:22 UTC (permalink / raw)
  To: Nikolay Aleksandrov
  Cc: Jay Vosburgh, Veaceslav Falico, Andy Gospodarek, David Miller,
	netdev, Eric Dumazet, Maciej Zenczykowski, Cong Wang
In-Reply-To: <542FA3C1.9080405@redhat.com>

On Sat, Oct 4, 2014 at 1:07 PM, Nikolay Aleksandrov <nikolay@redhat.com> wrote:
> On 10/04/2014 02:48 AM, Mahesh Bandewar wrote:
>> Earlier change to use usable slave array for TLB mode had an additional
>> performance advantage. So extending the same logic to all other modes
>> that use xmit-hash for slave selection (viz 802.3AD, and XOR modes).
>> Also consolidating this with the earlier TLB change.
>>
>> The main idea is to build the usable slaves array in the control path
>> and use that array for slave selection during xmit operation.
>>
>> Measured performance in a setup with a bond of 4x1G NICs with 200
>> instances of netperf for the modes involved (3ad, xor, tlb)
>> cmd: netperf -t TCP_RR -H <TargetHost> -l 60 -s 5
>>
>> Mode        TPS-Before   TPS-After
>>
>> 802.3ad   : 468,694      493,101
>> TLB (lb=0): 392,583      392,965
>> XOR       : 475,696      484,517
>>
>> Signed-off-by: Mahesh Bandewar <maheshb@google.com>
>> Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com>
>> ---
>> v1:
>>   (a) If bond_update_slave_arr() fails to allocate memory, it will overwrite
>>       the slave that need to be removed.
>>   (b) Freeing of array will assign NULL (to handle bond->down to bond->up
>>       transition gracefully.
>>   (c) Change from pr_debug() to pr_err() if bond_update_slave_arr() returns
>>       failure.
>>   (d) XOR: bond_update_slave_arr() will consider mii-mon, arp-mon cases and
>>       will populate the array even if these parameters are not used.
>>   (e) 3AD: Should handle the ad_agg_selection_logic correctly.
>> v2:
>>   (a) Removed rcu_read_{un}lock() calls from array manipulation code.
>>   (b) Slave link-events now refresh array for all these modes.
>>   (c) Moved free-array call from bond_close() to bond_uninit().
>> v3:
>>   (a) Fixed null pointer dereference.
>>   (b) Removed bond->lock lockdep dependency.
>> v4:
>>   (a) Made to changes to comply with Nikolay's locking changes
>>   (b) Added a work-queue to refresh slave-array when RTNL is not held
>>   (c) Array refresh happens ONLY with RTNL now.
>>   (d) alloc changed from GFP_ATOMIC to GFP_KERNEL
>> v5:
>>   (a) Consolidated all delayed slave-array updates at one place in
>>       3ad_state_machine_handler()
>> v6:
>>   (a) Free slave array when there is no active aggregator
>> v7:
>>   (a) Couple of trivial changes.
>>
>>  drivers/net/bonding/bond_3ad.c  | 140 +++++++++++------------------
>>  drivers/net/bonding/bond_alb.c  |  51 ++---------
>>  drivers/net/bonding/bond_alb.h  |   8 --
>>  drivers/net/bonding/bond_main.c | 192 +++++++++++++++++++++++++++++++++++++---
>>  drivers/net/bonding/bonding.h   |  10 +++
>>  5 files changed, 249 insertions(+), 152 deletions(-)
>>
> <<<snip>>>
>> +/* Build the usable slaves array in control path for modes that use xmit-hash
>> + * to determine the slave interface -
>> + * (a) BOND_MODE_8023AD
>> + * (b) BOND_MODE_XOR
>> + * (c) BOND_MODE_TLB && tlb_dynamic_lb == 0
>> + *
>> + * The caller is expected to hold RTNL only and NO other lock!
>> + */
>> +int bond_update_slave_arr(struct bonding *bond, struct slave *skipslave)
>> +{
>> +     struct slave *slave;
>> +     struct list_head *iter;
>> +     struct bond_up_slave *new_arr, *old_arr;
>> +     int slaves_in_agg;
>> +     int agg_id = 0;
>> +     int ret = 0;
>> +
>> +#ifdef CONFIG_LOCKDEP
>> +     lockdep_assert_held(&bond->mode_lock);
>> +#endif
> ^^^^^^^^^
> This is wrong now, the logic is inverted.
> It will WARN every time mode_lock is _not_ held:
>
> #define lockdep_assert_held(l)  do {                            \
>                 WARN_ON(debug_locks && !lockdep_is_held(l));    \
>         } while (0)
>
> The previous version was correct which did a WARN when mode_lock was
> actually held as that is the wrong condition, not when it's not held.
> I've missed that comment earlier.
>
Thanks Nik, I missed that. I'll revert it!

> (also switched Veaceslav's email address with the correct one in the CC list)
>
>> +
>> +     new_arr = kzalloc(offsetof(struct bond_up_slave, arr[bond->slave_cnt]),
>> +                       GFP_KERNEL);
>> +     if (!new_arr) {
>> +             ret = -ENOMEM;
>> +             pr_err("Failed to build slave-array.\n");
>> +             goto out;
>> +     }
>> +     if (BOND_MODE(bond) == BOND_MODE_8023AD) {
>> +             struct ad_info ad_info;
>> +
>> +             if (bond_3ad_get_active_agg_info(bond, &ad_info)) {
>> +                     pr_debug("bond_3ad_get_active_agg_info failed\n");
>> +                     kfree_rcu(new_arr, rcu);
>> +                     /* No active aggragator means it's not safe to use
>> +                      * the previous array.
>> +                      */
>> +                     old_arr = rtnl_dereference(bond->slave_arr);
>> +                     if (old_arr) {
>> +                             RCU_INIT_POINTER(bond->slave_arr, NULL);
>> +                             kfree_rcu(old_arr, rcu);
>> +                     }
>> +                     goto out;
>> +             }
>> +             slaves_in_agg = ad_info.ports;
>> +             agg_id = ad_info.aggregator_id;
>> +     }
>> +     bond_for_each_slave(bond, slave, iter) {
>> +             if (BOND_MODE(bond) == BOND_MODE_8023AD) {
>> +                     struct aggregator *agg;
>> +
>> +                     agg = SLAVE_AD_INFO(slave)->port.aggregator;
>> +                     if (!agg || agg->aggregator_identifier != agg_id)
>> +                             continue;
>> +             }
>> +             if (!bond_slave_can_tx(slave))
>> +                     continue;
>> +             if (skipslave == slave)
>> +                     continue;
>> +             new_arr->arr[new_arr->count++] = slave;
>> +     }
>> +
>> +     old_arr = rtnl_dereference(bond->slave_arr);
>> +     rcu_assign_pointer(bond->slave_arr, new_arr);
>> +     if (old_arr)
>> +             kfree_rcu(old_arr, rcu);
>> +out:
>> +     if (ret != 0 && skipslave) {
>> +             int idx;
>> +
>> +             /* Rare situation where caller has asked to skip a specific
>> +              * slave but allocation failed (most likely!). BTW this is
>> +              * only possible when the call is initiated from
>> +              * __bond_release_one(). In this situation; overwrite the
>> +              * skipslave entry in the array with the last entry from the
>> +              * array to avoid a situation where the xmit path may choose
>> +              * this to-be-skipped slave to send a packet out.
>> +              */
>> +             old_arr = rtnl_dereference(bond->slave_arr);
>> +             for (idx = 0; idx < old_arr->count; idx++) {
>> +                     if (skipslave == old_arr->arr[idx]) {
>> +                             old_arr->arr[idx] =
>> +                                 old_arr->arr[old_arr->count-1];
>> +                             old_arr->count--;
>> +                             break;
>> +                     }
>> +             }
>> +     }
>> +     return ret;
>> +}
>> +
> <<<snip>>>
>

^ permalink raw reply

* Re: [PATCH net-next] net: better IFF_XMIT_DST_RELEASE support
From: David Miller @ 2014-10-05  0:29 UTC (permalink / raw)
  To: eric.dumazet; +Cc: netdev
In-Reply-To: <1412267647.22242.3.camel@edumazet-glaptop2.roam.corp.google.com>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Thu, 02 Oct 2014 09:34:07 -0700

> From: Eric Dumazet <edumazet@google.com>
> 
> Testing xmit_more support with netperf and connected UDP sockets,
> I found strange dst refcount false sharing.
> 
> Current handling of IFF_XMIT_DST_RELEASE is not optimal.
> 
> dropping dst in validate_xmit_skb() is certainly too late in case
> packet was queued by cpu X but dequeued by cpu Y
> 
> The logical point to take care of drop/force is in __dev_queue_xmit()
> before even taking qdisc lock.
> 
> Signed-off-by: Eric Dumazet <edumazet@google.com>

I assume you are going to rework this to use a counter indication
in order to deal with the packet scheduler issues Julian brought
up.

^ permalink raw reply

* Re: [PATCH] net: systemport: fix bcm_sysport_insert_tsb()
From: David Miller @ 2014-10-05  0:33 UTC (permalink / raw)
  To: f.fainelli; +Cc: netdev
In-Reply-To: <1412268196-16086-1-git-send-email-f.fainelli@gmail.com>

From: Florian Fainelli <f.fainelli@gmail.com>
Date: Thu,  2 Oct 2014 09:43:16 -0700

> Similar to commit bc23333ba11fb7f959b7e87e121122f5a0fbbca8 ("net:
> bcmgenet: fix bcmgenet_put_tx_csum()"), we need to return the skb
> pointer in case we had to reallocate the SKB headroom.
> 
> Fixes: 80105befdb4b8 ("net: systemport: add Broadcom SYSTEMPORT Ethernet MAC driver")
> Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>

Applied and queued up for -stable, thanks Florian.

^ permalink raw reply

* Re: [PATCH V2 net-next] net: Cleanup skb cloning by adding SKB_FCLONE_FREE
From: David Miller @ 2014-10-05  0:34 UTC (permalink / raw)
  To: subramanian.vijay; +Cc: netdev, edumazet
In-Reply-To: <1412269243-5583-1-git-send-email-subramanian.vijay@gmail.com>

From: Vijay Subramanian <subramanian.vijay@gmail.com>
Date: Thu,  2 Oct 2014 10:00:43 -0700

> SKB_FCLONE_UNAVAILABLE has overloaded meaning depending on type of skb.
> 1: If skb is allocated from head_cache, it indicates fclone is not available.
> 2: If skb is a companion fclone skb (allocated from fclone_cache), it indicates
> it is available to be used.
> 
> To avoid confusion for case 2 above, this patch  replaces
> SKB_FCLONE_UNAVAILABLE with SKB_FCLONE_FREE where appropriate. For fclone
> companion skbs, this indicates it is free for use.
> 
> SKB_FCLONE_UNAVAILABLE will now simply indicate skb is from head_cache and
> cannot / will not have a companion fclone.
> 
> Signed-off-by: Vijay Subramanian <subramanian.vijay@gmail.com>
> ---
> V1-->V2: Comment all states

Applied, thank you.

^ permalink raw reply

* Re: [PATCH net-next] Removed unused inet6 address state
From: David Miller @ 2014-10-05  0:37 UTC (permalink / raw)
  To: sebastien.barre; +Cc: netdev, christoph.paasch, herbert
In-Reply-To: <1412277322-27823-1-git-send-email-sebastien.barre@uclouvain.be>

From: Sébastien Barré <sebastien.barre@uclouvain.be>
Date: Thu, 2 Oct 2014 21:15:22 +0200

> the inet6 state INET6_IFADDR_STATE_UP only appeared in its definition.
> 
> Cc: Christoph Paasch <christoph.paasch@uclouvain.be>
> Cc: Herbert Xu <herbert@gondor.apana.org.au>
> Signed-off-by: Sébastien Barré <sebastien.barre@uclouvain.be>

Looks good, applied.

^ permalink raw reply

* Re: macvlan: optimizing the receive path?
From: David Miller @ 2014-10-05  0:42 UTC (permalink / raw)
  To: jbaron; +Cc: netdev, kaber
In-Reply-To: <542DB55D.3090601@akamai.com>

From: Jason Baron <jbaron@akamai.com>
Date: Thu, 02 Oct 2014 16:28:13 -0400

> --- a/drivers/net/macvlan.c
> +++ b/drivers/net/macvlan.c
> @@ -321,8 +321,8 @@ static rx_handler_result_t macvlan_handle_frame(struct sk_buff **pskb)
>         skb->dev = dev;
>         skb->pkt_type = PACKET_HOST;
>  
> -       ret = netif_rx(skb);
> -
> +      macvlan_count_rx(vlan, len, true, 0);
> +      return RX_HANDLER_ANOTHER;
>  out:
>         macvlan_count_rx(vlan, len, ret == NET_RX_SUCCESS, 0);
>         return RX_HANDLER_CONSUMED;

That last argument to macvlan_count_rx() is a bool and thus should be
specified as "false".  Yes I know other areas of this file get it
wrong too.

Also, what about GRO?  Won't we get GRO processing if we do this via
netif_rx() but not via the RX_HANDLER_ANOTHER route?  Just curious...

^ permalink raw reply

* Re: [PATCH net-next v2 1/2] if_link: add client name to port profile
From: David Miller @ 2014-10-05  0:43 UTC (permalink / raw)
  To: _govind; +Cc: netdev, ssujith, benve
In-Reply-To: <1412289683-8278-2-git-send-email-_govind@gmx.com>

From: Govindarajulu Varadarajan <_govind@gmx.com>
Date: Fri,  3 Oct 2014 04:11:22 +0530

> This patch adds client name to port profile.
> 
> This is used by netlink client to send the client name in port profile.
> 
> Signed-off-by: Govindarajulu Varadarajan <_govind@gmx.com>

I really want to see what other developers think of this thing
because it seems extremely ad-hoc to me.

^ permalink raw reply

* [PATCH v8 net-next 1/2] bonding: display xmit_hash_policy for non-dynamic-tlb mode
From: Mahesh Bandewar @ 2014-10-05  0:44 UTC (permalink / raw)
  To: Jay Vosburgh, Andy Gospodarek, Veaceslav Falico,
	Nikolay Aleksandrov, David Miller
  Cc: netdev, Mahesh Bandewar, Eric Dumazet, Maciej Zenczykowski

It's a trivial fix to display xmit_hash_policy for this new TLB mode
since it uses transmit-hash-poilicy as part of bonding-master info
(/proc/net/bonding/<bonding-interface).

Signed-off-by: Mahesh Bandewar <maheshb@google.com>
Reviewed-by: Nikolay Aleksandrov <nikolay@redhat.com>
---
v1
 Rebase
v2
 Added bond_mode_uses_xmit_hash() inline function
v3-v8
 Rebase

 drivers/net/bonding/bond_procfs.c | 3 +--
 drivers/net/bonding/bonding.h     | 7 +++++++
 2 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/drivers/net/bonding/bond_procfs.c b/drivers/net/bonding/bond_procfs.c
index bb09d0442aa8..a3948f8d1e53 100644
--- a/drivers/net/bonding/bond_procfs.c
+++ b/drivers/net/bonding/bond_procfs.c
@@ -73,8 +73,7 @@ static void bond_info_show_master(struct seq_file *seq)
 
 	seq_printf(seq, "\n");
 
-	if (BOND_MODE(bond) == BOND_MODE_XOR ||
-		BOND_MODE(bond) == BOND_MODE_8023AD) {
+	if (bond_mode_uses_xmit_hash(bond)) {
 		optval = bond_opt_get_val(BOND_OPT_XMIT_HASH,
 					  bond->params.xmit_policy);
 		seq_printf(seq, "Transmit Hash Policy: %s (%d)\n",
diff --git a/drivers/net/bonding/bonding.h b/drivers/net/bonding/bonding.h
index 57917e63b4e6..5b022da9cad2 100644
--- a/drivers/net/bonding/bonding.h
+++ b/drivers/net/bonding/bonding.h
@@ -274,6 +274,13 @@ static inline bool bond_is_nondyn_tlb(const struct bonding *bond)
 	       (bond->params.tlb_dynamic_lb == 0);
 }
 
+static inline bool bond_mode_uses_xmit_hash(const struct bonding *bond)
+{
+	return (BOND_MODE(bond) == BOND_MODE_8023AD ||
+		BOND_MODE(bond) == BOND_MODE_XOR ||
+		bond_is_nondyn_tlb(bond));
+}
+
 static inline bool bond_mode_uses_arp(int mode)
 {
 	return mode != BOND_MODE_8023AD && mode != BOND_MODE_TLB &&
-- 
2.1.0.rc2.206.gedb03e5

^ permalink raw reply related

* [PATCH v8 net-next 2/2] bonding: Simplify the xmit function for modes that use xmit_hash
From: Mahesh Bandewar @ 2014-10-05  0:45 UTC (permalink / raw)
  To: Jay Vosburgh, Andy Gospodarek, Veaceslav Falico,
	Nikolay Aleksandrov, David Miller
  Cc: netdev, Mahesh Bandewar, Eric Dumazet, Maciej Zenczykowski

Earlier change to use usable slave array for TLB mode had an additional
performance advantage. So extending the same logic to all other modes
that use xmit-hash for slave selection (viz 802.3AD, and XOR modes).
Also consolidating this with the earlier TLB change.

The main idea is to build the usable slaves array in the control path
and use that array for slave selection during xmit operation.

Measured performance in a setup with a bond of 4x1G NICs with 200
instances of netperf for the modes involved (3ad, xor, tlb)
cmd: netperf -t TCP_RR -H <TargetHost> -l 60 -s 5

Mode        TPS-Before   TPS-After

802.3ad   : 468,694      493,101
TLB (lb=0): 392,583      392,965
XOR       : 475,696      484,517

Signed-off-by: Mahesh Bandewar <maheshb@google.com>
Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com>
---
v1:
  (a) If bond_update_slave_arr() fails to allocate memory, it will overwrite
      the slave that need to be removed.
  (b) Freeing of array will assign NULL (to handle bond->down to bond->up
      transition gracefully.
  (c) Change from pr_debug() to pr_err() if bond_update_slave_arr() returns
      failure.
  (d) XOR: bond_update_slave_arr() will consider mii-mon, arp-mon cases and
      will populate the array even if these parameters are not used.
  (e) 3AD: Should handle the ad_agg_selection_logic correctly.
v2:
  (a) Removed rcu_read_{un}lock() calls from array manipulation code.
  (b) Slave link-events now refresh array for all these modes.
  (c) Moved free-array call from bond_close() to bond_uninit().
v3:
  (a) Fixed null pointer dereference.
  (b) Removed bond->lock lockdep dependency.
v4:
  (a) Made to changes to comply with Nikolay's locking changes
  (b) Added a work-queue to refresh slave-array when RTNL is not held
  (c) Array refresh happens ONLY with RTNL now.
  (d) alloc changed from GFP_ATOMIC to GFP_KERNEL
v5:
  (a) Consolidated all delayed slave-array updates at one place in
      3ad_state_machine_handler()
v6:
  (a) Free slave array when there is no active aggregator
v7:
  (a) Couple of trivial changes.
v8:
  (a) Reverted erronus WARN_ON() update

 drivers/net/bonding/bond_3ad.c  | 140 +++++++++++------------------
 drivers/net/bonding/bond_alb.c  |  51 ++---------
 drivers/net/bonding/bond_alb.h  |   8 --
 drivers/net/bonding/bond_main.c | 192 +++++++++++++++++++++++++++++++++++++---
 drivers/net/bonding/bonding.h   |  10 +++
 5 files changed, 249 insertions(+), 152 deletions(-)

diff --git a/drivers/net/bonding/bond_3ad.c b/drivers/net/bonding/bond_3ad.c
index 7e9e522fd476..2110215f3528 100644
--- a/drivers/net/bonding/bond_3ad.c
+++ b/drivers/net/bonding/bond_3ad.c
@@ -102,17 +102,20 @@ static const u8 lacpdu_mcast_addr[ETH_ALEN] = MULTICAST_LACPDU_ADDR;
 /* ================= main 802.3ad protocol functions ================== */
 static int ad_lacpdu_send(struct port *port);
 static int ad_marker_send(struct port *port, struct bond_marker *marker);
-static void ad_mux_machine(struct port *port);
+static void ad_mux_machine(struct port *port, bool *update_slave_arr);
 static void ad_rx_machine(struct lacpdu *lacpdu, struct port *port);
 static void ad_tx_machine(struct port *port);
 static void ad_periodic_machine(struct port *port);
-static void ad_port_selection_logic(struct port *port);
-static void ad_agg_selection_logic(struct aggregator *aggregator);
+static void ad_port_selection_logic(struct port *port, bool *update_slave_arr);
+static void ad_agg_selection_logic(struct aggregator *aggregator,
+				   bool *update_slave_arr);
 static void ad_clear_agg(struct aggregator *aggregator);
 static void ad_initialize_agg(struct aggregator *aggregator);
 static void ad_initialize_port(struct port *port, int lacp_fast);
-static void ad_enable_collecting_distributing(struct port *port);
-static void ad_disable_collecting_distributing(struct port *port);
+static void ad_enable_collecting_distributing(struct port *port,
+					      bool *update_slave_arr);
+static void ad_disable_collecting_distributing(struct port *port,
+					       bool *update_slave_arr);
 static void ad_marker_info_received(struct bond_marker *marker_info,
 				    struct port *port);
 static void ad_marker_response_received(struct bond_marker *marker,
@@ -796,8 +799,9 @@ static int ad_marker_send(struct port *port, struct bond_marker *marker)
 /**
  * ad_mux_machine - handle a port's mux state machine
  * @port: the port we're looking at
+ * @update_slave_arr: Does slave array need update?
  */
-static void ad_mux_machine(struct port *port)
+static void ad_mux_machine(struct port *port, bool *update_slave_arr)
 {
 	mux_states_t last_state;
 
@@ -901,7 +905,8 @@ static void ad_mux_machine(struct port *port)
 		switch (port->sm_mux_state) {
 		case AD_MUX_DETACHED:
 			port->actor_oper_port_state &= ~AD_STATE_SYNCHRONIZATION;
-			ad_disable_collecting_distributing(port);
+			ad_disable_collecting_distributing(port,
+							   update_slave_arr);
 			port->actor_oper_port_state &= ~AD_STATE_COLLECTING;
 			port->actor_oper_port_state &= ~AD_STATE_DISTRIBUTING;
 			port->ntt = true;
@@ -913,13 +918,15 @@ static void ad_mux_machine(struct port *port)
 			port->actor_oper_port_state |= AD_STATE_SYNCHRONIZATION;
 			port->actor_oper_port_state &= ~AD_STATE_COLLECTING;
 			port->actor_oper_port_state &= ~AD_STATE_DISTRIBUTING;
-			ad_disable_collecting_distributing(port);
+			ad_disable_collecting_distributing(port,
+							   update_slave_arr);
 			port->ntt = true;
 			break;
 		case AD_MUX_COLLECTING_DISTRIBUTING:
 			port->actor_oper_port_state |= AD_STATE_COLLECTING;
 			port->actor_oper_port_state |= AD_STATE_DISTRIBUTING;
-			ad_enable_collecting_distributing(port);
+			ad_enable_collecting_distributing(port,
+							  update_slave_arr);
 			port->ntt = true;
 			break;
 		default:
@@ -1187,12 +1194,13 @@ static void ad_periodic_machine(struct port *port)
 /**
  * ad_port_selection_logic - select aggregation groups
  * @port: the port we're looking at
+ * @update_slave_arr: Does slave array need update?
  *
  * Select aggregation groups, and assign each port for it's aggregetor. The
  * selection logic is called in the inititalization (after all the handshkes),
  * and after every lacpdu receive (if selected is off).
  */
-static void ad_port_selection_logic(struct port *port)
+static void ad_port_selection_logic(struct port *port, bool *update_slave_arr)
 {
 	struct aggregator *aggregator, *free_aggregator = NULL, *temp_aggregator;
 	struct port *last_port = NULL, *curr_port;
@@ -1347,7 +1355,7 @@ static void ad_port_selection_logic(struct port *port)
 			      __agg_ports_are_ready(port->aggregator));
 
 	aggregator = __get_first_agg(port);
-	ad_agg_selection_logic(aggregator);
+	ad_agg_selection_logic(aggregator, update_slave_arr);
 }
 
 /* Decide if "agg" is a better choice for the new active aggregator that
@@ -1435,6 +1443,7 @@ static int agg_device_up(const struct aggregator *agg)
 /**
  * ad_agg_selection_logic - select an aggregation group for a team
  * @aggregator: the aggregator we're looking at
+ * @update_slave_arr: Does slave array need update?
  *
  * It is assumed that only one aggregator may be selected for a team.
  *
@@ -1457,7 +1466,8 @@ static int agg_device_up(const struct aggregator *agg)
  * __get_active_agg() won't work correctly. This function should be better
  * called with the bond itself, and retrieve the first agg from it.
  */
-static void ad_agg_selection_logic(struct aggregator *agg)
+static void ad_agg_selection_logic(struct aggregator *agg,
+				   bool *update_slave_arr)
 {
 	struct aggregator *best, *active, *origin;
 	struct bonding *bond = agg->slave->bond;
@@ -1550,6 +1560,8 @@ static void ad_agg_selection_logic(struct aggregator *agg)
 				__disable_port(port);
 			}
 		}
+		/* Slave array needs update. */
+		*update_slave_arr = true;
 	}
 
 	/* if the selected aggregator is of join individuals
@@ -1678,24 +1690,30 @@ static void ad_initialize_port(struct port *port, int lacp_fast)
 /**
  * ad_enable_collecting_distributing - enable a port's transmit/receive
  * @port: the port we're looking at
+ * @update_slave_arr: Does slave array need update?
  *
  * Enable @port if it's in an active aggregator
  */
-static void ad_enable_collecting_distributing(struct port *port)
+static void ad_enable_collecting_distributing(struct port *port,
+					      bool *update_slave_arr)
 {
 	if (port->aggregator->is_active) {
 		pr_debug("Enabling port %d(LAG %d)\n",
 			 port->actor_port_number,
 			 port->aggregator->aggregator_identifier);
 		__enable_port(port);
+		/* Slave array needs update */
+		*update_slave_arr = true;
 	}
 }
 
 /**
  * ad_disable_collecting_distributing - disable a port's transmit/receive
  * @port: the port we're looking at
+ * @update_slave_arr: Does slave array need update?
  */
-static void ad_disable_collecting_distributing(struct port *port)
+static void ad_disable_collecting_distributing(struct port *port,
+					       bool *update_slave_arr)
 {
 	if (port->aggregator &&
 	    !MAC_ADDRESS_EQUAL(&(port->aggregator->partner_system),
@@ -1704,6 +1722,8 @@ static void ad_disable_collecting_distributing(struct port *port)
 			 port->actor_port_number,
 			 port->aggregator->aggregator_identifier);
 		__disable_port(port);
+		/* Slave array needs an update */
+		*update_slave_arr = true;
 	}
 }
 
@@ -1868,6 +1888,7 @@ void bond_3ad_unbind_slave(struct slave *slave)
 	struct bonding *bond = slave->bond;
 	struct slave *slave_iter;
 	struct list_head *iter;
+	bool dummy_slave_update; /* Ignore this value as caller updates array */
 
 	/* Sync against bond_3ad_state_machine_handler() */
 	spin_lock_bh(&bond->mode_lock);
@@ -1951,7 +1972,8 @@ void bond_3ad_unbind_slave(struct slave *slave)
 				ad_clear_agg(aggregator);
 
 				if (select_new_active_agg)
-					ad_agg_selection_logic(__get_first_agg(port));
+					ad_agg_selection_logic(__get_first_agg(port),
+							       &dummy_slave_update);
 			} else {
 				netdev_warn(bond->dev, "unbinding aggregator, and could not find a new aggregator for its ports\n");
 			}
@@ -1966,7 +1988,8 @@ void bond_3ad_unbind_slave(struct slave *slave)
 				/* select new active aggregator */
 				temp_aggregator = __get_first_agg(port);
 				if (temp_aggregator)
-					ad_agg_selection_logic(temp_aggregator);
+					ad_agg_selection_logic(temp_aggregator,
+							       &dummy_slave_update);
 			}
 		}
 	}
@@ -1996,7 +2019,8 @@ void bond_3ad_unbind_slave(struct slave *slave)
 					if (select_new_active_agg) {
 						netdev_info(bond->dev, "Removing an active aggregator\n");
 						/* select new active aggregator */
-						ad_agg_selection_logic(__get_first_agg(port));
+						ad_agg_selection_logic(__get_first_agg(port),
+							               &dummy_slave_update);
 					}
 				}
 				break;
@@ -2031,6 +2055,7 @@ void bond_3ad_state_machine_handler(struct work_struct *work)
 	struct slave *slave;
 	struct port *port;
 	bool should_notify_rtnl = BOND_SLAVE_NOTIFY_LATER;
+	bool update_slave_arr = false;
 
 	/* Lock to protect data accessed by all (e.g., port->sm_vars) and
 	 * against running with bond_3ad_unbind_slave. ad_rx_machine may run
@@ -2058,7 +2083,7 @@ void bond_3ad_state_machine_handler(struct work_struct *work)
 			}
 
 			aggregator = __get_first_agg(port);
-			ad_agg_selection_logic(aggregator);
+			ad_agg_selection_logic(aggregator, &update_slave_arr);
 		}
 		bond_3ad_set_carrier(bond);
 	}
@@ -2074,8 +2099,8 @@ void bond_3ad_state_machine_handler(struct work_struct *work)
 
 		ad_rx_machine(NULL, port);
 		ad_periodic_machine(port);
-		ad_port_selection_logic(port);
-		ad_mux_machine(port);
+		ad_port_selection_logic(port, &update_slave_arr);
+		ad_mux_machine(port, &update_slave_arr);
 		ad_tx_machine(port);
 
 		/* turn off the BEGIN bit, since we already handled it */
@@ -2093,6 +2118,9 @@ re_arm:
 	rcu_read_unlock();
 	spin_unlock_bh(&bond->mode_lock);
 
+	if (update_slave_arr)
+		bond_slave_arr_work_rearm(bond, 0);
+
 	if (should_notify_rtnl && rtnl_trylock()) {
 		bond_slave_state_notify(bond);
 		rtnl_unlock();
@@ -2283,6 +2311,11 @@ void bond_3ad_handle_link_change(struct slave *slave, char link)
 	port->sm_vars |= AD_PORT_BEGIN;
 
 	spin_unlock_bh(&slave->bond->mode_lock);
+
+	/* RTNL is held and mode_lock is released so it's safe
+	 * to update slave_array here.
+	 */
+	bond_update_slave_arr(slave->bond, NULL);
 }
 
 /**
@@ -2377,73 +2410,6 @@ int bond_3ad_get_active_agg_info(struct bonding *bond, struct ad_info *ad_info)
 	return ret;
 }
 
-int bond_3ad_xmit_xor(struct sk_buff *skb, struct net_device *dev)
-{
-	struct bonding *bond = netdev_priv(dev);
-	struct slave *slave, *first_ok_slave;
-	struct aggregator *agg;
-	struct ad_info ad_info;
-	struct list_head *iter;
-	int slaves_in_agg;
-	int slave_agg_no;
-	int agg_id;
-
-	if (__bond_3ad_get_active_agg_info(bond, &ad_info)) {
-		netdev_dbg(dev, "__bond_3ad_get_active_agg_info failed\n");
-		goto err_free;
-	}
-
-	slaves_in_agg = ad_info.ports;
-	agg_id = ad_info.aggregator_id;
-
-	if (slaves_in_agg == 0) {
-		netdev_dbg(dev, "active aggregator is empty\n");
-		goto err_free;
-	}
-
-	slave_agg_no = bond_xmit_hash(bond, skb) % slaves_in_agg;
-	first_ok_slave = NULL;
-
-	bond_for_each_slave_rcu(bond, slave, iter) {
-		agg = SLAVE_AD_INFO(slave)->port.aggregator;
-		if (!agg || agg->aggregator_identifier != agg_id)
-			continue;
-
-		if (slave_agg_no >= 0) {
-			if (!first_ok_slave && bond_slave_can_tx(slave))
-				first_ok_slave = slave;
-			slave_agg_no--;
-			continue;
-		}
-
-		if (bond_slave_can_tx(slave)) {
-			bond_dev_queue_xmit(bond, skb, slave->dev);
-			goto out;
-		}
-	}
-
-	if (slave_agg_no >= 0) {
-		netdev_err(dev, "Couldn't find a slave to tx on for aggregator ID %d\n",
-			   agg_id);
-		goto err_free;
-	}
-
-	/* we couldn't find any suitable slave after the agg_no, so use the
-	 * first suitable found, if found.
-	 */
-	if (first_ok_slave)
-		bond_dev_queue_xmit(bond, skb, first_ok_slave->dev);
-	else
-		goto err_free;
-
-out:
-	return NETDEV_TX_OK;
-err_free:
-	/* no suitable interface, frame not sent */
-	dev_kfree_skb_any(skb);
-	goto out;
-}
-
 int bond_3ad_lacpdu_recv(const struct sk_buff *skb, struct bonding *bond,
 			 struct slave *slave)
 {
diff --git a/drivers/net/bonding/bond_alb.c b/drivers/net/bonding/bond_alb.c
index 615f3bebd019..d2eadab787c5 100644
--- a/drivers/net/bonding/bond_alb.c
+++ b/drivers/net/bonding/bond_alb.c
@@ -177,7 +177,6 @@ static int tlb_initialize(struct bonding *bond)
 static void tlb_deinitialize(struct bonding *bond)
 {
 	struct alb_bond_info *bond_info = &(BOND_ALB_INFO(bond));
-	struct tlb_up_slave *arr;
 
 	spin_lock_bh(&bond->mode_lock);
 
@@ -185,10 +184,6 @@ static void tlb_deinitialize(struct bonding *bond)
 	bond_info->tx_hashtbl = NULL;
 
 	spin_unlock_bh(&bond->mode_lock);
-
-	arr = rtnl_dereference(bond_info->slave_arr);
-	if (arr)
-		kfree_rcu(arr, rcu);
 }
 
 static long long compute_gap(struct slave *slave)
@@ -1336,39 +1331,9 @@ out:
 	return NETDEV_TX_OK;
 }
 
-static int bond_tlb_update_slave_arr(struct bonding *bond,
-				     struct slave *skipslave)
-{
-	struct alb_bond_info *bond_info = &(BOND_ALB_INFO(bond));
-	struct slave *tx_slave;
-	struct list_head *iter;
-	struct tlb_up_slave *new_arr, *old_arr;
-
-	new_arr = kzalloc(offsetof(struct tlb_up_slave, arr[bond->slave_cnt]),
-			  GFP_ATOMIC);
-	if (!new_arr)
-		return -ENOMEM;
-
-	bond_for_each_slave(bond, tx_slave, iter) {
-		if (!bond_slave_can_tx(tx_slave))
-			continue;
-		if (skipslave == tx_slave)
-			continue;
-		new_arr->arr[new_arr->count++] = tx_slave;
-	}
-
-	old_arr = rtnl_dereference(bond_info->slave_arr);
-	rcu_assign_pointer(bond_info->slave_arr, new_arr);
-	if (old_arr)
-		kfree_rcu(old_arr, rcu);
-
-	return 0;
-}
-
 int bond_tlb_xmit(struct sk_buff *skb, struct net_device *bond_dev)
 {
 	struct bonding *bond = netdev_priv(bond_dev);
-	struct alb_bond_info *bond_info = &(BOND_ALB_INFO(bond));
 	struct ethhdr *eth_data;
 	struct slave *tx_slave = NULL;
 	u32 hash_index;
@@ -1389,12 +1354,14 @@ int bond_tlb_xmit(struct sk_buff *skb, struct net_device *bond_dev)
 							      hash_index & 0xFF,
 							      skb->len);
 			} else {
-				struct tlb_up_slave *slaves;
+				struct bond_up_slave *slaves;
+				unsigned int count;
 
-				slaves = rcu_dereference(bond_info->slave_arr);
-				if (slaves && slaves->count)
+				slaves = rcu_dereference(bond->slave_arr);
+				count = slaves ? ACCESS_ONCE(slaves->count) : 0;
+				if (likely(count))
 					tx_slave = slaves->arr[hash_index %
-							       slaves->count];
+							       count];
 			}
 			break;
 		}
@@ -1641,10 +1608,6 @@ void bond_alb_deinit_slave(struct bonding *bond, struct slave *slave)
 		rlb_clear_slave(bond, slave);
 	}
 
-	if (bond_is_nondyn_tlb(bond))
-		if (bond_tlb_update_slave_arr(bond, slave))
-			pr_err("Failed to build slave-array for TLB mode.\n");
-
 }
 
 void bond_alb_handle_link_change(struct bonding *bond, struct slave *slave, char link)
@@ -1669,7 +1632,7 @@ void bond_alb_handle_link_change(struct bonding *bond, struct slave *slave, char
 	}
 
 	if (bond_is_nondyn_tlb(bond)) {
-		if (bond_tlb_update_slave_arr(bond, NULL))
+		if (bond_update_slave_arr(bond, NULL))
 			pr_err("Failed to build slave-array for TLB mode.\n");
 	}
 }
diff --git a/drivers/net/bonding/bond_alb.h b/drivers/net/bonding/bond_alb.h
index 3c6a7ff974d7..1ad473b4ade5 100644
--- a/drivers/net/bonding/bond_alb.h
+++ b/drivers/net/bonding/bond_alb.h
@@ -139,19 +139,11 @@ struct tlb_slave_info {
 			 */
 };
 
-struct tlb_up_slave {
-	unsigned int	count;
-	struct rcu_head rcu;
-	struct slave	*arr[0];
-};
-
 struct alb_bond_info {
 	struct tlb_client_info	*tx_hashtbl; /* Dynamically allocated */
 	u32			unbalanced_load;
 	int			tx_rebalance_counter;
 	int			lp_counter;
-	/* -------- non-dynamic tlb mode only ---------*/
-	struct tlb_up_slave __rcu *slave_arr;	  /* Up slaves */
 	/* -------- rlb parameters -------- */
 	int rlb_enabled;
 	struct rlb_client_info	*rx_hashtbl;	/* Receive hash table */
diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index c2adc2755ff6..3ad5413d4f57 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -210,6 +210,7 @@ static int bond_init(struct net_device *bond_dev);
 static void bond_uninit(struct net_device *bond_dev);
 static struct rtnl_link_stats64 *bond_get_stats(struct net_device *bond_dev,
 						struct rtnl_link_stats64 *stats);
+static void bond_slave_arr_handler(struct work_struct *work);
 
 /*---------------------------- General routines -----------------------------*/
 
@@ -1551,6 +1552,9 @@ int bond_enslave(struct net_device *bond_dev, struct net_device *slave_dev)
 		unblock_netpoll_tx();
 	}
 
+	if (bond_mode_uses_xmit_hash(bond))
+		bond_update_slave_arr(bond, NULL);
+
 	netdev_info(bond_dev, "Enslaving %s as %s interface with %s link\n",
 		    slave_dev->name,
 		    bond_is_active_slave(new_slave) ? "an active" : "a backup",
@@ -1668,6 +1672,9 @@ static int __bond_release_one(struct net_device *bond_dev,
 	if (BOND_MODE(bond) == BOND_MODE_8023AD)
 		bond_3ad_unbind_slave(slave);
 
+	if (bond_mode_uses_xmit_hash(bond))
+		bond_update_slave_arr(bond, slave);
+
 	netdev_info(bond_dev, "Releasing %s interface %s\n",
 		    bond_is_active_slave(slave) ? "active" : "backup",
 		    slave_dev->name);
@@ -1970,6 +1977,9 @@ static void bond_miimon_commit(struct bonding *bond)
 				bond_alb_handle_link_change(bond, slave,
 							    BOND_LINK_UP);
 
+			if (BOND_MODE(bond) == BOND_MODE_XOR)
+				bond_update_slave_arr(bond, NULL);
+
 			if (!bond->curr_active_slave || slave == primary)
 				goto do_failover;
 
@@ -1997,6 +2007,9 @@ static void bond_miimon_commit(struct bonding *bond)
 				bond_alb_handle_link_change(bond, slave,
 							    BOND_LINK_DOWN);
 
+			if (BOND_MODE(bond) == BOND_MODE_XOR)
+				bond_update_slave_arr(bond, NULL);
+
 			if (slave == rcu_access_pointer(bond->curr_active_slave))
 				goto do_failover;
 
@@ -2453,6 +2466,8 @@ static void bond_loadbalance_arp_mon(struct work_struct *work)
 
 		if (slave_state_changed) {
 			bond_slave_state_change(bond);
+			if (BOND_MODE(bond) == BOND_MODE_XOR)
+				bond_update_slave_arr(bond, NULL);
 		} else if (do_failover) {
 			block_netpoll_tx();
 			bond_select_active_slave(bond);
@@ -2829,8 +2844,20 @@ static int bond_slave_netdev_event(unsigned long event,
 			if (old_duplex != slave->duplex)
 				bond_3ad_adapter_duplex_changed(slave);
 		}
+		/* Refresh slave-array if applicable!
+		 * If the setup does not use miimon or arpmon (mode-specific!),
+		 * then these events will not cause the slave-array to be
+		 * refreshed. This will cause xmit to use a slave that is not
+		 * usable. Avoid such situation by refeshing the array at these
+		 * events. If these (miimon/arpmon) parameters are configured
+		 * then array gets refreshed twice and that should be fine!
+		 */
+		if (bond_mode_uses_xmit_hash(bond))
+			bond_update_slave_arr(bond, NULL);
 		break;
 	case NETDEV_DOWN:
+		if (bond_mode_uses_xmit_hash(bond))
+			bond_update_slave_arr(bond, NULL);
 		break;
 	case NETDEV_CHANGEMTU:
 		/* TODO: Should slaves be allowed to
@@ -3010,6 +3037,7 @@ static void bond_work_init_all(struct bonding *bond)
 	else
 		INIT_DELAYED_WORK(&bond->arp_work, bond_loadbalance_arp_mon);
 	INIT_DELAYED_WORK(&bond->ad_work, bond_3ad_state_machine_handler);
+	INIT_DELAYED_WORK(&bond->slave_arr_work, bond_slave_arr_handler);
 }
 
 static void bond_work_cancel_all(struct bonding *bond)
@@ -3019,6 +3047,7 @@ static void bond_work_cancel_all(struct bonding *bond)
 	cancel_delayed_work_sync(&bond->alb_work);
 	cancel_delayed_work_sync(&bond->ad_work);
 	cancel_delayed_work_sync(&bond->mcast_work);
+	cancel_delayed_work_sync(&bond->slave_arr_work);
 }
 
 static int bond_open(struct net_device *bond_dev)
@@ -3068,6 +3097,9 @@ static int bond_open(struct net_device *bond_dev)
 		bond_3ad_initiate_agg_selection(bond, 1);
 	}
 
+	if (bond_mode_uses_xmit_hash(bond))
+		bond_update_slave_arr(bond, NULL);
+
 	return 0;
 }
 
@@ -3573,20 +3605,148 @@ static int bond_xmit_activebackup(struct sk_buff *skb, struct net_device *bond_d
 	return NETDEV_TX_OK;
 }
 
-/* In bond_xmit_xor() , we determine the output device by using a pre-
- * determined xmit_hash_policy(), If the selected device is not enabled,
- * find the next active slave.
+/* Use this to update slave_array when (a) it's not appropriate to update
+ * slave_array right away (note that update_slave_array() may sleep)
+ * and / or (b) RTNL is not held.
  */
-static int bond_xmit_xor(struct sk_buff *skb, struct net_device *bond_dev)
+void bond_slave_arr_work_rearm(struct bonding *bond, unsigned long delay)
 {
-	struct bonding *bond = netdev_priv(bond_dev);
-	int slave_cnt = ACCESS_ONCE(bond->slave_cnt);
+	queue_delayed_work(bond->wq, &bond->slave_arr_work, delay);
+}
 
-	if (likely(slave_cnt))
-		bond_xmit_slave_id(bond, skb,
-				   bond_xmit_hash(bond, skb) % slave_cnt);
-	else
+/* Slave array work handler. Holds only RTNL */
+static void bond_slave_arr_handler(struct work_struct *work)
+{
+	struct bonding *bond = container_of(work, struct bonding,
+					    slave_arr_work.work);
+	int ret;
+
+	if (!rtnl_trylock())
+		goto err;
+
+	ret = bond_update_slave_arr(bond, NULL);
+	rtnl_unlock();
+	if (ret) {
+		pr_warn_ratelimited("Failed to update slave array from WT\n");
+		goto err;
+	}
+	return;
+
+err:
+	bond_slave_arr_work_rearm(bond, 1);
+}
+
+/* Build the usable slaves array in control path for modes that use xmit-hash
+ * to determine the slave interface -
+ * (a) BOND_MODE_8023AD
+ * (b) BOND_MODE_XOR
+ * (c) BOND_MODE_TLB && tlb_dynamic_lb == 0
+ *
+ * The caller is expected to hold RTNL only and NO other lock!
+ */
+int bond_update_slave_arr(struct bonding *bond, struct slave *skipslave)
+{
+	struct slave *slave;
+	struct list_head *iter;
+	struct bond_up_slave *new_arr, *old_arr;
+	int slaves_in_agg;
+	int agg_id = 0;
+	int ret = 0;
+
+#ifdef CONFIG_LOCKDEP
+	WARN_ON(lockdep_is_held(&bond->mode_lock));
+#endif
+
+	new_arr = kzalloc(offsetof(struct bond_up_slave, arr[bond->slave_cnt]),
+			  GFP_KERNEL);
+	if (!new_arr) {
+		ret = -ENOMEM;
+		pr_err("Failed to build slave-array.\n");
+		goto out;
+	}
+	if (BOND_MODE(bond) == BOND_MODE_8023AD) {
+		struct ad_info ad_info;
+
+		if (bond_3ad_get_active_agg_info(bond, &ad_info)) {
+			pr_debug("bond_3ad_get_active_agg_info failed\n");
+			kfree_rcu(new_arr, rcu);
+			/* No active aggragator means it's not safe to use
+			 * the previous array.
+			 */
+			old_arr = rtnl_dereference(bond->slave_arr);
+			if (old_arr) {
+				RCU_INIT_POINTER(bond->slave_arr, NULL);
+				kfree_rcu(old_arr, rcu);
+			}
+			goto out;
+		}
+		slaves_in_agg = ad_info.ports;
+		agg_id = ad_info.aggregator_id;
+	}
+	bond_for_each_slave(bond, slave, iter) {
+		if (BOND_MODE(bond) == BOND_MODE_8023AD) {
+			struct aggregator *agg;
+
+			agg = SLAVE_AD_INFO(slave)->port.aggregator;
+			if (!agg || agg->aggregator_identifier != agg_id)
+				continue;
+		}
+		if (!bond_slave_can_tx(slave))
+			continue;
+		if (skipslave == slave)
+			continue;
+		new_arr->arr[new_arr->count++] = slave;
+	}
+
+	old_arr = rtnl_dereference(bond->slave_arr);
+	rcu_assign_pointer(bond->slave_arr, new_arr);
+	if (old_arr)
+		kfree_rcu(old_arr, rcu);
+out:
+	if (ret != 0 && skipslave) {
+		int idx;
+
+		/* Rare situation where caller has asked to skip a specific
+		 * slave but allocation failed (most likely!). BTW this is
+		 * only possible when the call is initiated from
+		 * __bond_release_one(). In this situation; overwrite the
+		 * skipslave entry in the array with the last entry from the
+		 * array to avoid a situation where the xmit path may choose
+		 * this to-be-skipped slave to send a packet out.
+		 */
+		old_arr = rtnl_dereference(bond->slave_arr);
+		for (idx = 0; idx < old_arr->count; idx++) {
+			if (skipslave == old_arr->arr[idx]) {
+				old_arr->arr[idx] =
+				    old_arr->arr[old_arr->count-1];
+				old_arr->count--;
+				break;
+			}
+		}
+	}
+	return ret;
+}
+
+/* Use this Xmit function for 3AD as well as XOR modes. The current
+ * usable slave array is formed in the control path. The xmit function
+ * just calculates hash and sends the packet out.
+ */
+int bond_3ad_xor_xmit(struct sk_buff *skb, struct net_device *dev)
+{
+	struct bonding *bond = netdev_priv(dev);
+	struct slave *slave;
+	struct bond_up_slave *slaves;
+	unsigned int count;
+
+	slaves = rcu_dereference(bond->slave_arr);
+	count = slaves ? ACCESS_ONCE(slaves->count) : 0;
+	if (likely(count)) {
+		slave = slaves->arr[bond_xmit_hash(bond, skb) % count];
+		bond_dev_queue_xmit(bond, skb, slave->dev);
+	} else {
 		dev_kfree_skb_any(skb);
+		atomic_long_inc(&dev->tx_dropped);
+	}
 
 	return NETDEV_TX_OK;
 }
@@ -3682,12 +3842,11 @@ static netdev_tx_t __bond_start_xmit(struct sk_buff *skb, struct net_device *dev
 		return bond_xmit_roundrobin(skb, dev);
 	case BOND_MODE_ACTIVEBACKUP:
 		return bond_xmit_activebackup(skb, dev);
+	case BOND_MODE_8023AD:
 	case BOND_MODE_XOR:
-		return bond_xmit_xor(skb, dev);
+		return bond_3ad_xor_xmit(skb, dev);
 	case BOND_MODE_BROADCAST:
 		return bond_xmit_broadcast(skb, dev);
-	case BOND_MODE_8023AD:
-		return bond_3ad_xmit_xor(skb, dev);
 	case BOND_MODE_ALB:
 		return bond_alb_xmit(skb, dev);
 	case BOND_MODE_TLB:
@@ -3861,6 +4020,7 @@ static void bond_uninit(struct net_device *bond_dev)
 	struct bonding *bond = netdev_priv(bond_dev);
 	struct list_head *iter;
 	struct slave *slave;
+	struct bond_up_slave *arr;
 
 	bond_netpoll_cleanup(bond_dev);
 
@@ -3869,6 +4029,12 @@ static void bond_uninit(struct net_device *bond_dev)
 		__bond_release_one(bond_dev, slave->dev, true);
 	netdev_info(bond_dev, "Released all slaves\n");
 
+	arr = rtnl_dereference(bond->slave_arr);
+	if (arr) {
+		RCU_INIT_POINTER(bond->slave_arr, NULL);
+		kfree_rcu(arr, rcu);
+	}
+
 	list_del(&bond->bond_list);
 
 	bond_debug_unregister(bond);
diff --git a/drivers/net/bonding/bonding.h b/drivers/net/bonding/bonding.h
index 5b022da9cad2..10920f0686e2 100644
--- a/drivers/net/bonding/bonding.h
+++ b/drivers/net/bonding/bonding.h
@@ -179,6 +179,12 @@ struct slave {
 	struct rtnl_link_stats64 slave_stats;
 };
 
+struct bond_up_slave {
+	unsigned int	count;
+	struct rcu_head rcu;
+	struct slave	*arr[0];
+};
+
 /*
  * Link pseudo-state only used internally by monitors
  */
@@ -193,6 +199,7 @@ struct bonding {
 	struct   slave __rcu *curr_active_slave;
 	struct   slave __rcu *current_arp_slave;
 	struct   slave __rcu *primary_slave;
+	struct   bond_up_slave __rcu *slave_arr; /* Array of usable slaves */
 	bool     force_primary;
 	s32      slave_cnt; /* never change this value outside the attach/detach wrappers */
 	int     (*recv_probe)(const struct sk_buff *, struct bonding *,
@@ -222,6 +229,7 @@ struct bonding {
 	struct   delayed_work alb_work;
 	struct   delayed_work ad_work;
 	struct   delayed_work mcast_work;
+	struct   delayed_work slave_arr_work;
 #ifdef CONFIG_DEBUG_FS
 	/* debugging support via debugfs */
 	struct	 dentry *debug_dir;
@@ -534,6 +542,8 @@ const char *bond_slave_link_status(s8 link);
 struct bond_vlan_tag *bond_verify_device_path(struct net_device *start_dev,
 					      struct net_device *end_dev,
 					      int level);
+int bond_update_slave_arr(struct bonding *bond, struct slave *skipslave);
+void bond_slave_arr_work_rearm(struct bonding *bond, unsigned long delay);
 
 #ifdef CONFIG_PROC_FS
 void bond_create_proc_entry(struct bonding *bond);
-- 
2.1.0.rc2.206.gedb03e5

^ permalink raw reply related

* Re: [PATCH net-next] net: dsa: do not call phy_start_aneg
From: David Miller @ 2014-10-05  0:45 UTC (permalink / raw)
  To: f.fainelli; +Cc: netdev
In-Reply-To: <1412301363-8478-2-git-send-email-f.fainelli@gmail.com>

From: Florian Fainelli <f.fainelli@gmail.com>
Date: Thu,  2 Oct 2014 18:56:03 -0700

> Commit f7f1de51edbd ("net: dsa: start and stop the PHY state machine")
> add calls to phy_start() in dsa_slave_open() respectively phy_stop() in
> dsa_slave_close().
> 
> We also call phy_start_aneg() in dsa_slave_create(), and this call is
> messing up with the PHY state machine, since we basically start the
> auto-negotiation, and later on restart it when calling phy_start().
> phy_start() does not currently handle the PHY_FORCING or PHY_AN states
> properly, but such a fix would be too invasive for this window.
> 
> Fixes: f7f1de51edbd ("net: dsa: start and stop the PHY state machine")
> Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>

Applied, thanks.

^ permalink raw reply

* Re: [PATCH] net: sched: suspicious RCU usage in qdisc_watchdog
From: David Miller @ 2014-10-05  0:46 UTC (permalink / raw)
  To: john.fastabend; +Cc: xiyou.wangcong, eric.dumazet, netdev
In-Reply-To: <20141003054306.20821.48420.stgit@nitbit.x32>

From: John Fastabend <john.fastabend@gmail.com>
Date: Thu, 02 Oct 2014 22:43:09 -0700

> Suspicious RCU usage in qdisc_watchdog call needs to be done inside
> rcu_read_lock/rcu_read_unlock. And then Qdisc destroy operations
> need to ensure timer is cancelled before removing qdisc structure.
 ...
> Fixes: b26b0d1e8b1 ("net: qdisc: use rcu prefix and silence sparse warnings")
> Signed-off-by: John Fastabend <john.r.fastabend@intel.com>

Applied, thank you.

^ permalink raw reply

* Re: [PATCH net v2 1/1] ematch: Fix early ending of inverted containers.
From: David Miller @ 2014-10-05  0:50 UTC (permalink / raw)
  To: ignacy.gawedzki; +Cc: netdev
In-Reply-To: <20141003134448.GB26399@zenon.in.qult.net>

From: Ignacy Gawędzki <ignacy.gawedzki@green-communications.fr>
Date: Fri, 3 Oct 2014 15:44:48 +0200

> The result of a negated container has to be inverted before checking for
> early ending.
> 
> This fixes my previous attempt (17c9c8232663a47f074b7452b9b034efda868ca7) to
> make inverted containers work correctly.
> 
> Signed-off-by: Ignacy Gawędzki <ignacy.gawedzki@green-communications.fr>

Applied.

^ permalink raw reply

* Re: [PATCH] team: avoid race condition in scheduling delayed work
From: David Miller @ 2014-10-05  0:51 UTC (permalink / raw)
  To: joe.lawrence; +Cc: netdev, jiri
In-Reply-To: <1412344714-29938-1-git-send-email-joe.lawrence@stratus.com>

From: Joe Lawrence <joe.lawrence@stratus.com>
Date: Fri, 3 Oct 2014 09:58:34 -0400

> When team_notify_peers and team_mcast_rejoin are called, they both reset
> their respective .count_pending atomic variable. Then when the actual
> worker function is executed, the variable is atomically decremented.
> This pattern introduces a potential race condition where the
> .count_pending rolls over and the worker function keeps rescheduling
> until .count_pending decrements to zero again:
 ...
> Instead of assigning a new value to .count_pending, use atomic_add to
> tack-on the additional desired worker function invocations.
> 
> Signed-off-by: Joe Lawrence <joe.lawrence@stratus.com>
> Acked-by: Jiri Pirko <jiri@resnulli.us>
> Fixes: fc423ff00df3a19554414ee ("team: add peer notification")
> Fixes: 492b200efdd20b8fcfdac87 ("team: add support for sending multicast rejoins")

Applied and queued up for -stable, thanks.

^ permalink raw reply

* Re: [PATCH] team: add rescheduling jiffy delay on !rtnl_trylock
From: Tejun Heo @ 2014-10-05  2:13 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: Joe Lawrence, netdev, Jiri Pirko
In-Reply-To: <20141004083732.GG5015@linux.vnet.ibm.com>

Hello,

On Sat, Oct 04, 2014 at 01:37:32AM -0700, Paul E. McKenney wrote:
> On Fri, Oct 03, 2014 at 03:37:01PM -0400, Joe Lawrence wrote:
> > I gave this a spin, probably inserting the call in the wrong place:
> > 
> > diff --git a/kernel/workqueue.c b/kernel/workqueue.c
> > index 5dbe22a..77f128e 100644
> > --- a/kernel/workqueue.c
> > +++ b/kernel/workqueue.c
> > @@ -2045,7 +2045,8 @@ __acquires(&pool->lock)
> >          * indefinitely requeue itself while all other CPUs are trapped in
> >          * stop_machine.
> >          */
> > -       cond_resched();
> > +       if (!cond_resched())
> > +               rcu_note_context_switch(raw_smp_processor_id());
> > 
> >         spin_lock_irq(&pool->lock);
> 
> If the cond_resched() is in the right place, then you should be good.

Yeah, it looks good to me.

> FWIW, there is a cond_resched_rcu_qs() that should be going into the next
> merge window that could be used in place of the above two lines.  This is
> commit bde6c3aa9930 in -tip.

That sounds even better.

Joe, can you please send a patch with proper SOB and description?

Thanks.

-- 
tejun

^ permalink raw reply

* Re: [PATCH net-next v2 1/2] if_link: add client name to port profile
From: Stephen Hemminger @ 2014-10-05  3:05 UTC (permalink / raw)
  To: Govindarajulu Varadarajan; +Cc: davem, netdev, ssujith, benve
In-Reply-To: <1412289683-8278-2-git-send-email-_govind@gmx.com>

On Fri,  3 Oct 2014 04:11:22 +0530
Govindarajulu Varadarajan <_govind@gmx.com> wrote:

> This patch adds client name to port profile.
> 
> This is used by netlink client to send the client name in port profile.
> 
> Signed-off-by: Govindarajulu Varadarajan <_govind@gmx.com>
> ---
>  include/uapi/linux/if_link.h | 1 +
>  net/core/rtnetlink.c         | 2 ++
>  2 files changed, 3 insertions(+)
> 
> diff --git a/include/uapi/linux/if_link.h b/include/uapi/linux/if_link.h
> index 0bdb77e..6ae0b0b 100644
> --- a/include/uapi/linux/if_link.h
> +++ b/include/uapi/linux/if_link.h
> @@ -512,6 +512,7 @@ enum {
>  	IFLA_PORT_HOST_UUID,		/* binary UUID */
>  	IFLA_PORT_REQUEST,		/* __u8 */
>  	IFLA_PORT_RESPONSE,		/* __u16, output only */
> +	IFLA_PORT_VMNAME,		/* vm-name used by port profile */
>  	__IFLA_PORT_MAX,
>  };
>  
> diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
> index a688268..116d647 100644
> --- a/net/core/rtnetlink.c
> +++ b/net/core/rtnetlink.c
> @@ -1238,6 +1238,8 @@ static const struct nla_policy ifla_port_policy[IFLA_PORT_MAX+1] = {
>  				    .len = PORT_UUID_MAX },
>  	[IFLA_PORT_REQUEST]	= { .type = NLA_U8, },
>  	[IFLA_PORT_RESPONSE]	= { .type = NLA_U16, },
> +	[IFLA_PORT_VMNAME]	= { .type = NLA_STRING,
> +				    .len = PORT_PROFILE_MAX },
>  };
>  
>  static int rtnl_dump_ifinfo(struct sk_buff *skb, struct netlink_callback *cb)

Maybe you could use the existing IFLA_IFALIAS?
It is already supported by iproute tools and sysfs, and can be used by net-snmp as well.
You would just be setting the default string, users could change it.

^ permalink raw reply

* bridge: Respect call-iptables sysctls everywhere
From: Herbert Xu @ 2014-10-05  3:53 UTC (permalink / raw)
  To: Florian Westphal
  Cc: netfilter-devel, bsd, stephen, netdev, eric.dumazet, davidn,
	David S. Miller
In-Reply-To: <20141004180647.GB1241@breakpoint.cc>

On Sat, Oct 04, 2014 at 08:06:47PM +0200, Florian Westphal wrote:
>
> Fair enough.  We lose frag_max_size information from ipv4 defrag,

Sigh.  Why are people still doing IP netfilter through the bridge?
It's a huge security hole because all bridge devices share the same
defrag zone so each bridge port can inject packets into any bridge
device on the system through conntrack.  It used to be an even bigger
hole when all defrag were in the same zone which meant that you could
inject packets into the IP stack itself.  At least that hole is
closed now.

So in this case what we have is a bridge packet that temporarily
enters the IP stack for filtering, then reenters the bridge for
processing, and then gets reinserted into the IP stack for filtering.

What we should do therefore is to save any necessary information
such as frag_max_size into the bridge CB area when reentering the
bridge and then copy it back upon the next reentry into the IP stack.

But really we should be printing a big warning to tell people that
this feature (specifically IP netfilter through the bridge, netfilter
through the bridge itself is fine) is insecure and shouldn't be used
until such a time that it is redesigned properly.

> plus netfilter hooks are called without validating ip options.

This was the status quo before the patch in question.  Patches are
welcome.
 
> So I am fine with it, provided we rename br_parse_ip_options() --
> thats not what it does after this patch (br_validate_iphdr(), for
> example?)

I thought about renaming it but if we ever do add option parsing
then we'll be renaming it back.  So let's just stick with the name
plus my comment in the function.

While reviewing this code it occured to me that we have a serious
bug in that call-iptables sysctls aren't even respected in FORWARD
and POST_ROUTING.  Here is a patch that fixes this.

bridge: Respect call-iptables sysctls everywhere

>From the very beginning the call-iptables sysctl only prevented
the PRE_ROUTING hook from entering the IP stack.  This is very
wrong.  The sysctl is used because entering the IP stack from the
bridge has serious security ramifications so when the admin says
that we shouldn't do it, it really means no.

This patch fixes this by also checking the sysctl in FORWARD and
POST_ROUTING.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

diff --git a/net/bridge/br_netfilter.c b/net/bridge/br_netfilter.c
index c0fdb4d..389d1c6 100644
--- a/net/bridge/br_netfilter.c
+++ b/net/bridge/br_netfilter.c
@@ -171,20 +171,28 @@ void br_netfilter_rtable_init(struct net_bridge *br)
 	rt->dst.ops = &fake_dst_ops;
 }
 
-static inline struct rtable *bridge_parent_rtable(const struct net_device *dev)
+static inline struct net_bridge *bridge_parent(const struct net_device *dev)
 {
 	struct net_bridge_port *port;
 
 	port = br_port_get_rcu(dev);
-	return port ? &port->br->fake_rtable : NULL;
+	return port ? port->br : NULL;
 }
 
-static inline struct net_device *bridge_parent(const struct net_device *dev)
+static inline struct rtable *bridge_parent_rtable(const struct net_device *dev)
 {
-	struct net_bridge_port *port;
+	struct net_bridge *br;
 
-	port = br_port_get_rcu(dev);
-	return port ? port->br->dev : NULL;
+	br = bridge_parent(dev);
+	return br ? &br->fake_rtable : NULL;
+}
+
+static inline struct net_device *bridge_parent_dev(const struct net_device *dev)
+{
+	struct net_bridge *br;
+
+	br = bridge_parent(dev);
+	return br ? br->dev : NULL;
 }
 
 static inline struct nf_bridge_info *nf_bridge_alloc(struct sk_buff *skb)
@@ -367,7 +375,7 @@ static int br_nf_pre_routing_finish_bridge(struct sk_buff *skb)
 	struct neighbour *neigh;
 	struct dst_entry *dst;
 
-	skb->dev = bridge_parent(skb->dev);
+	skb->dev = bridge_parent_dev(skb->dev);
 	if (!skb->dev)
 		goto free_skb;
 	dst = skb_dst(skb);
@@ -517,7 +525,7 @@ static struct net_device *brnf_get_logical_dev(struct sk_buff *skb, const struct
 {
 	struct net_device *vlan, *br;
 
-	br = bridge_parent(dev);
+	br = bridge_parent_dev(dev);
 	if (brnf_pass_vlan_indev == 0 || !vlan_tx_tag_present(skb))
 		return br;
 
@@ -763,6 +771,7 @@ static unsigned int br_nf_forward_ip(const struct nf_hook_ops *ops,
 {
 	struct nf_bridge_info *nf_bridge;
 	struct net_device *parent;
+	struct net_bridge *br;
 	u_int8_t pf;
 
 	if (!skb->nf_bridge)
@@ -773,15 +782,21 @@ static unsigned int br_nf_forward_ip(const struct nf_hook_ops *ops,
 	if (!nf_bridge_unshare(skb))
 		return NF_DROP;
 
-	parent = bridge_parent(out);
-	if (!parent)
+	br = bridge_parent(out);
+	if (!br)
 		return NF_DROP;
 
-	if (IS_IP(skb) || IS_VLAN_IP(skb) || IS_PPPOE_IP(skb))
+	parent = br->dev;
+
+	if (IS_IP(skb) || IS_VLAN_IP(skb) || IS_PPPOE_IP(skb)) {
 		pf = NFPROTO_IPV4;
-	else if (IS_IPV6(skb) || IS_VLAN_IPV6(skb) || IS_PPPOE_IPV6(skb))
+		if (!brnf_call_iptables && !br->nf_call_iptables)
+			return NF_ACCEPT;
+	} else if (IS_IPV6(skb) || IS_VLAN_IPV6(skb) || IS_PPPOE_IPV6(skb)) {
 		pf = NFPROTO_IPV6;
-	else
+		if (!brnf_call_ip6tables && !br->nf_call_ip6tables)
+			return NF_ACCEPT;
+	} else
 		return NF_ACCEPT;
 
 	nf_bridge_pull_encap_header(skb);
@@ -877,20 +892,27 @@ static unsigned int br_nf_post_routing(const struct nf_hook_ops *ops,
 				       int (*okfn)(struct sk_buff *))
 {
 	struct nf_bridge_info *nf_bridge = skb->nf_bridge;
-	struct net_device *realoutdev = bridge_parent(skb->dev);
+	struct net_bridge *br = bridge_parent(skb->dev);
+	struct net_device *realoutdev;
 	u_int8_t pf;
 
 	if (!nf_bridge || !(nf_bridge->mask & BRNF_BRIDGED))
 		return NF_ACCEPT;
 
-	if (!realoutdev)
+	if (!br)
 		return NF_DROP;
 
-	if (IS_IP(skb) || IS_VLAN_IP(skb) || IS_PPPOE_IP(skb))
+	realoutdev = br->dev;
+
+	if (IS_IP(skb) || IS_VLAN_IP(skb) || IS_PPPOE_IP(skb)) {
 		pf = NFPROTO_IPV4;
-	else if (IS_IPV6(skb) || IS_VLAN_IPV6(skb) || IS_PPPOE_IPV6(skb))
+		if (!brnf_call_iptables && !br->nf_call_iptables)
+			return NF_ACCEPT;
+	} else if (IS_IPV6(skb) || IS_VLAN_IPV6(skb) || IS_PPPOE_IPV6(skb)) {
 		pf = NFPROTO_IPV6;
-	else
+		if (!brnf_call_ip6tables && !br->nf_call_ip6tables)
+			return NF_ACCEPT;
+	} else
 		return NF_ACCEPT;
 
 	/* We assume any code from br_dev_queue_push_xmit onwards doesn't care

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply related

* bridge: Save frag_max_size between PRE_ROUTING and POST_ROUTING
From: Herbert Xu @ 2014-10-05  4:00 UTC (permalink / raw)
  To: Florian Westphal
  Cc: netfilter-devel, bsd, stephen, netdev, eric.dumazet, davidn,
	David S. Miller
In-Reply-To: <20141005035343.GA13696@gondor.apana.org.au>

On Sun, Oct 05, 2014 at 11:53:43AM +0800, Herbert Xu wrote:
> 
> What we should do therefore is to save any necessary information
> such as frag_max_size into the bridge CB area when reentering the
> bridge and then copy it back upon the next reentry into the IP stack.

Here's a patch that does just that:

bridge: Save frag_max_size between PRE_ROUTING and POST_ROUTING

As we may defragment the packet in IPv4 PRE_ROUTING and refragment
it after POST_ROUTING we should save the value of frag_max_size.

This is still very wrong as the bridge is supposed to leave the
packets intact, meaning that the right thing to do is to use the
original frag_list for fragmentation.

Unfortunately we don't currently guarantee that the frag_list is
left untouched throughout netfilter so until this changes this is
the best we can do.

There is also a spot in FORWARD where it appears that we can
forward a packet without going through fragmentation, mark it
so that we can fix it later.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

diff --git a/net/bridge/br_netfilter.c b/net/bridge/br_netfilter.c
index 389d1c6..47fe079 100644
--- a/net/bridge/br_netfilter.c
+++ b/net/bridge/br_netfilter.c
@@ -398,6 +398,7 @@ static int br_nf_pre_routing_finish_bridge(struct sk_buff *skb)
 							 ETH_HLEN-ETH_ALEN);
 			/* tell br_dev_xmit to continue with forwarding */
 			nf_bridge->mask |= BRNF_BRIDGED_DNAT;
+			/* FIXME Need to refragment */
 			ret = neigh->output(neigh, skb);
 		}
 		neigh_release(neigh);
@@ -453,6 +454,10 @@ static int br_nf_pre_routing_finish(struct sk_buff *skb)
 	struct nf_bridge_info *nf_bridge = skb->nf_bridge;
 	struct rtable *rt;
 	int err;
+	int frag_max_size;
+
+	frag_max_size = IPCB(skb)->frag_max_size;
+	BR_INPUT_SKB_CB(skb)->frag_max_size = frag_max_size;
 
 	if (nf_bridge->mask & BRNF_PKT_TYPE) {
 		skb->pkt_type = PACKET_OTHERHOST;
@@ -864,13 +869,19 @@ static unsigned int br_nf_forward_arp(const struct nf_hook_ops *ops,
 static int br_nf_dev_queue_xmit(struct sk_buff *skb)
 {
 	int ret;
+	int frag_max_size;
 
+	/* This is wrong! We should preserve the original fragment
+	 * boundaries by preserving frag_list rather than refragmenting.
+	 */
 	if (skb->protocol == htons(ETH_P_IP) &&
 	    skb->len + nf_bridge_mtu_reduction(skb) > skb->dev->mtu &&
 	    !skb_is_gso(skb)) {
+		frag_max_size = BR_INPUT_SKB_CB(skb)->frag_max_size;
 		if (br_parse_ip_options(skb))
 			/* Drop invalid packet */
 			return NF_DROP;
+		IPCB(skb)->frag_max_size = frag_max_size;
 		ret = ip_fragment(skb, br_dev_queue_push_xmit);
 	} else
 		ret = br_dev_queue_push_xmit(skb);
diff --git a/net/bridge/br_private.h b/net/bridge/br_private.h
index b6c04cb..2398369 100644
--- a/net/bridge/br_private.h
+++ b/net/bridge/br_private.h
@@ -305,10 +305,14 @@ struct net_bridge
 
 struct br_input_skb_cb {
 	struct net_device *brdev;
+
 #ifdef CONFIG_BRIDGE_IGMP_SNOOPING
 	int igmp;
 	int mrouters_only;
 #endif
+
+	u16 frag_max_size;
+
 #ifdef CONFIG_BRIDGE_VLAN_FILTERING
 	bool vlan_filtered;
 #endif

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply related

* Re: [PATCH ethtool 2/3] ethtool: Add rx_copybreak support
From: Amir Vadai @ 2014-10-05  7:54 UTC (permalink / raw)
  To: Govindarajulu Varadarajan, ben
  Cc: netdev, ssujith, Or Gerlitz, Yevgeny Petrilin, Eric Dumazet
In-Reply-To: <1410602207-9084-3-git-send-email-_govind@gmx.com>

On 9/13/2014 12:56 PM, Govindarajulu Varadarajan wrote:
> Signed-off-by: Govindarajulu Varadarajan <_govind@gmx.com>
> ---

[...]

>  	  "		[ rx-jumbo N ]\n"
>  	  "		[ tx N ]\n" },
> +	{ "-b|--show-rx_copybreak", 1, do_grx_copybreak, "Show rx_copybreak value" },
> +	{ "-B|--set-rx_copybreak", 1, do_srx_copybreak, "Set rx_copybreak value",
> +	  "		N\n" },
>  	{ "-k|--show-features|--show-offload", 1, do_gfeatures,

Hi,

Since a get/set for TX copybreak will be also is needed, please change
the user command to something like "-b rx NNN", and when I will add tx
copybreak support, I will add "-b tx NNN".

Thanks,
amir

^ permalink raw reply

* [PATCH V1 net-next 2/2] net/mlx4_core: Disable BF when write combining is not available
From: Or Gerlitz @ 2014-10-05  8:22 UTC (permalink / raw)
  To: David S. Miller
  Cc: netdev, Amir Vadai, Jack Morgenstein, Moshe Lazer, Tal Alon,
	Yevgeny Petrilin, Or Gerlitz
In-Reply-To: <1412497342-12451-1-git-send-email-ogerlitz@mellanox.com>

From: Moshe Lazer <moshel@mellanox.com>

In mlx4 for better latency, we write send descriptors to a write-combining
(WC) mapped buffer instead of ringing a doorbell and having the HW fetch
the descriptor from system memory.

However, if write-combining is not supported on the host, then we
obtain better latency by using the doorbell-ring/HW fetch mechanism.

The mechanism that uses WC is called Blue-Flame (BF). BF is beneficial
only when the system supports write combining. When the BF buffer is
mapped as a write-combine buffer, the HCA receives data in multi-word
bursts. However, if the BF buffer is mapped only as non-cached, the
HCA receives data in individual dword chunks, which harms performance.

Therefore, disable blueflame when write combining is not available.

Signed-off-by: Moshe Lazer <moshel@mellanox.com>
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx4/fw.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/fw.c b/drivers/net/ethernet/mellanox/mlx4/fw.c
index 2e88a23..f7bb548 100644
--- a/drivers/net/ethernet/mellanox/mlx4/fw.c
+++ b/drivers/net/ethernet/mellanox/mlx4/fw.c
@@ -671,7 +671,7 @@ int mlx4_QUERY_DEV_CAP(struct mlx4_dev *dev, struct mlx4_dev_cap *dev_cap)
 	dev_cap->min_page_sz = 1 << field;
 
 	MLX4_GET(field, outbox, QUERY_DEV_CAP_BF_OFFSET);
-	if (field & 0x80) {
+	if ((field & 0x80) && writecombine_available()) {
 		MLX4_GET(field, outbox, QUERY_DEV_CAP_LOG_BF_REG_SZ_OFFSET);
 		dev_cap->bf_reg_size = 1 << (field & 0x1f);
 		MLX4_GET(field, outbox, QUERY_DEV_CAP_LOG_MAX_BF_REGS_PER_PAGE_OFFSET);
-- 
1.7.1

^ permalink raw reply related

* [PATCH V1 net-next 0/2] Add pgtable API to query if write combining is available
From: Or Gerlitz @ 2014-10-05  8:22 UTC (permalink / raw)
  To: David S. Miller
  Cc: netdev, Amir Vadai, Jack Morgenstein, Moshe Lazer, Tal Alon,
	Yevgeny Petrilin, Or Gerlitz

Currently the kernel write-combining interface provides a best effort
mechanism in which the caller simply invokes pgprot_writecombine().

If write combining is available, the region is mapped for it, otherwise
the region is (silently) mapped as non-cached. In some cases, however, 
the calling driver must know if write combining is available, so a silent 
best effort mechanism is not sufficient. Add writecombine_available(), which 
returns 1 if the system supports write combining and 0 if it doesn't.

In mlx4 for better latency, we write send descriptors to a write-combining
(WC) mapped buffer instead of ringing a doorbell and having the HW fetch
the descriptor from system memory.

However, if write-combining is not supported on the host, then we
obtain better latency by using the doorbell-ring/HW fetch mechanism.

This series from Moshe and Jack adds the API and uses in in mlx4.

We are sending through netdev to get feedback from the networking 
community and extend the reviewer audience if required.

Per the reviewers request, here are some results from these 
three different configurations:

[1] bf=on with wc
[2] bf=on without wc
[3] bf=off and doorbell 

The 1st set of results was obtained from running latency test 
with the HCA being passthrough-ed into VM running over KVM 
host -- so WC isn't available.

The problematic range is 32-128B, for example with 128 bytes 
message, using BF has latency of 1.47us and no usage of BF 
only 1us. When WC isn't really available every write of 64B
would actually translate into 8 writes of 8 bytes which obviously
hurts the latency.

# /usr/bin/taskset -c 0 ib_write_lat -d mlx4_0 -i 1  -F -a -n 1000000

[2] BF on without WC 
 #bytes #iterations    t_min[usec]    t_max[usec]  t_typical[usec]
 2       1000000          0.74           186.16       0.79
 4       1000000          0.70           103.62       0.78
 8       1000000          0.74           77.02        0.78
 16      1000000          0.65           640.75       0.86
 32      1000000          0.90           134.63       0.96
 64      1000000          1.05           808.52       1.11
 128     1000000          1.05           405.58       1.47
 
[3] BF off and using doorbell
 #bytes #iterations    t_min[usec]    t_max[usec]  t_typical[usec]
 2       1000000          0.85           107.29       0.89
 4       1000000          0.84           705.90       0.89
 8       1000000          0.85           457.72       0.89
 16      1000000          0.85           1041.43      0.90
 32      1000000          0.88           773.67       0.92
 64      1000000          0.90           82.70        0.93
 128     1000000          0.96           78.20        1.00

The 2nd set of results was obtained from running latency test 
over bare-metal host where WC is available. Clearly we gain
better latency when BF is used vs. the doorbell base.

# /usr/bin/taskset -c 0 ib_write_lat -d mlx4_0 -i 1  -F -a -n 1000000

[1] BF on, WC available
#bytes #iterations    t_min[usec]    t_max[usec]  t_typical[usec]
 2       1000000          0.74           131.62       0.79
 4       1000000          0.74           134.51       0.79
 8       1000000          0.74           154.30       0.79
 16      1000000          0.74           1437.57      0.79
 32      1000000          0.79           138.23       0.83
 64      1000000          0.82           135.86       0.85
 128     1000000          0.94           131.11       0.98

[3] BF off and using doorbell
#bytes #iterations    t_min[usec]    t_max[usec]  t_typical[usec]
 2       1000000          1.05           137.55       1.10
 4       1000000          1.04           422.50       1.10
 8       1000000          1.05           141.26       1.10
 16      1000000          1.06           1261.99      1.11
 32      1000000          1.09           141.47       1.14
 64      1000000          1.11           435.44       1.16
 128     1000000          1.22           212.19       1.27

Moshe and Or.

changes from V0:
  - changed the WC helper to return bool value


Moshe Lazer (2):
  pgtable: Add API to query if write combining is available
  net/mlx4_core: Disable BF when write combining is not available

 arch/arm/include/asm/pgtable.h          |    6 ++++++
 arch/arm64/include/asm/pgtable.h        |    5 +++++
 arch/ia64/include/asm/pgtable.h         |    6 ++++++
 arch/powerpc/include/asm/pgtable.h      |    6 ++++++
 arch/x86/include/asm/pgtable_types.h    |    2 ++
 arch/x86/mm/pat.c                       |    9 +++++++++
 drivers/net/ethernet/mellanox/mlx4/fw.c |    2 +-
 include/asm-generic/pgtable.h           |    8 ++++++++
 8 files changed, 43 insertions(+), 1 deletions(-)

^ permalink raw reply

* [PATCH V1 net-next 1/2] pgtable: Add API to query if write combining is available
From: Or Gerlitz @ 2014-10-05  8:22 UTC (permalink / raw)
  To: David S. Miller
  Cc: netdev, Amir Vadai, Jack Morgenstein, Moshe Lazer, Tal Alon,
	Yevgeny Petrilin, Or Gerlitz
In-Reply-To: <1412497342-12451-1-git-send-email-ogerlitz@mellanox.com>

From: Moshe Lazer <moshel@mellanox.com>

Currently the kernel write-combining interface provides a best effort
mechanism in which the caller simply invokes pgprot_writecombine().

If write combining is available, the region is mapped for it, otherwise
the region is (silently) mapped as non-cached.

In some cases, however, the calling driver must know if write combining
is available, so a silent best effort mechanism is not sufficient.

Add writecombine_available(), which returns true if the system
supports write combining and false if it doesn't.

Signed-off-by: Moshe Lazer <moshel@mellanox.com>
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
---
 arch/arm/include/asm/pgtable.h       |    6 ++++++
 arch/arm64/include/asm/pgtable.h     |    5 +++++
 arch/ia64/include/asm/pgtable.h      |    6 ++++++
 arch/powerpc/include/asm/pgtable.h   |    6 ++++++
 arch/x86/include/asm/pgtable_types.h |    2 ++
 arch/x86/mm/pat.c                    |    9 +++++++++
 include/asm-generic/pgtable.h        |    8 ++++++++
 7 files changed, 42 insertions(+), 0 deletions(-)

diff --git a/arch/arm/include/asm/pgtable.h b/arch/arm/include/asm/pgtable.h
index 01baef0..ce06b64 100644
--- a/arch/arm/include/asm/pgtable.h
+++ b/arch/arm/include/asm/pgtable.h
@@ -119,6 +119,12 @@ extern pgprot_t		pgprot_s2_device;
 #define pgprot_writecombine(prot) \
 	__pgprot_modify(prot, L_PTE_MT_MASK, L_PTE_MT_BUFFERABLE)
 
+#define writecombine_available writecombine_available
+static inline bool writecombine_available(void)
+{
+	return true;
+}
+
 #define pgprot_stronglyordered(prot) \
 	__pgprot_modify(prot, L_PTE_MT_MASK, L_PTE_MT_UNCACHED)
 
diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index ffe1ba0..6ab0630 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -296,6 +296,11 @@ static inline int has_transparent_hugepage(void)
 	__pgprot_modify(prot, PTE_ATTRINDX_MASK, PTE_ATTRINDX(MT_DEVICE_nGnRnE) | PTE_PXN | PTE_UXN)
 #define pgprot_writecombine(prot) \
 	__pgprot_modify(prot, PTE_ATTRINDX_MASK, PTE_ATTRINDX(MT_NORMAL_NC) | PTE_PXN | PTE_UXN)
+#define writecombine_available writecombine_available
+static inline bool writecombine_available(void)
+{
+	return true;
+}
 #define __HAVE_PHYS_MEM_ACCESS_PROT
 struct file;
 extern pgprot_t phys_mem_access_prot(struct file *file, unsigned long pfn,
diff --git a/arch/ia64/include/asm/pgtable.h b/arch/ia64/include/asm/pgtable.h
index 7935115..2e44501 100644
--- a/arch/ia64/include/asm/pgtable.h
+++ b/arch/ia64/include/asm/pgtable.h
@@ -356,6 +356,12 @@ static inline void set_pte(pte_t *ptep, pte_t pteval)
 #define pgprot_noncached(prot)		__pgprot((pgprot_val(prot) & ~_PAGE_MA_MASK) | _PAGE_MA_UC)
 #define pgprot_writecombine(prot)	__pgprot((pgprot_val(prot) & ~_PAGE_MA_MASK) | _PAGE_MA_WC)
 
+#define writecombine_available writecombine_available
+static inline bool writecombine_available(void)
+{
+	return true;
+}
+
 struct file;
 extern pgprot_t phys_mem_access_prot(struct file *file, unsigned long pfn,
 				     unsigned long size, pgprot_t vma_prot);
diff --git a/arch/powerpc/include/asm/pgtable.h b/arch/powerpc/include/asm/pgtable.h
index d98c1ec..3232d98 100644
--- a/arch/powerpc/include/asm/pgtable.h
+++ b/arch/powerpc/include/asm/pgtable.h
@@ -267,6 +267,12 @@ extern int ptep_set_access_flags(struct vm_area_struct *vma, unsigned long addre
 
 #define pgprot_writecombine pgprot_noncached_wc
 
+#define writecombine_available writecombine_available
+static inline bool writecombine_available(void)
+{
+	return true;
+}
+
 struct file;
 extern pgprot_t phys_mem_access_prot(struct file *file, unsigned long pfn,
 				     unsigned long size, pgprot_t vma_prot);
diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
index f216963..7d3dc79 100644
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -337,6 +337,8 @@ extern int nx_enabled;
 
 #define pgprot_writecombine	pgprot_writecombine
 extern pgprot_t pgprot_writecombine(pgprot_t prot);
+#define writecombine_available  writecombine_available
+bool writecombine_available(void);
 
 /* Indicate that x86 has its own track and untrack pfn vma functions */
 #define __HAVE_PFNMAP_TRACKING
diff --git a/arch/x86/mm/pat.c b/arch/x86/mm/pat.c
index 6574388..851ee51 100644
--- a/arch/x86/mm/pat.c
+++ b/arch/x86/mm/pat.c
@@ -797,6 +797,15 @@ pgprot_t pgprot_writecombine(pgprot_t prot)
 }
 EXPORT_SYMBOL_GPL(pgprot_writecombine);
 
+bool writecombine_available(void)
+{
+	if (pat_enabled)
+		return true;
+
+	return false;
+}
+EXPORT_SYMBOL_GPL(writecombine_available);
+
 #if defined(CONFIG_DEBUG_FS) && defined(CONFIG_X86_PAT)
 
 static struct memtype *memtype_get_idx(loff_t pos)
diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index 53b2acc..2cb40d9 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -249,6 +249,14 @@ static inline int pmd_same(pmd_t pmd_a, pmd_t pmd_b)
 #define pgprot_writecombine pgprot_noncached
 #endif
 
+#ifndef writecombine_available
+#define writecombine_available writecombine_available
+static inline bool writecombine_available(void)
+{
+	return false;
+}
+#endif
+
 /*
  * When walking page tables, get the address of the next boundary,
  * or the end address of the range if that comes earlier.  Although no
-- 
1.7.1

^ permalink raw reply related

* Re: bridge: Respect call-iptables sysctls everywhere
From: Florian Westphal @ 2014-10-05  9:13 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Florian Westphal, netfilter-devel, bsd, stephen, netdev,
	eric.dumazet, davidn, David S. Miller
In-Reply-To: <20141005035343.GA13696@gondor.apana.org.au>

Herbert Xu <herbert@gondor.apana.org.au> wrote:
> On Sat, Oct 04, 2014 at 08:06:47PM +0200, Florian Westphal wrote:
> >
> > Fair enough.  We lose frag_max_size information from ipv4 defrag,
> 
> While reviewing this code it occured to me that we have a serious
> bug in that call-iptables sysctls aren't even respected in FORWARD
> and POST_ROUTING.  Here is a patch that fixes this.

Upcalls to iptables in FORWARD/POSTROUTING depend on skb->nf_bridge
being set up, which only happens when call-iptables=1.

^ permalink raw reply

* [PATCH net-next 00/14] net/mlx4_en: Optimizations to TX flow
From: Amir Vadai @ 2014-10-05  9:35 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet
  Cc: netdev, Yevgeny Petrilin, Or Gerlitz, Ido Shamay, Amir Vadai

Hi,

This patchset contains optimizations to TX flow in mlx4_en driver. It also introduce
setting/getting tx copybreak, to enable controlling inline threshold dynamically.

TX flow optimizations was authored and posted to the mailing list by Eric
Dumazet [1] as a single patch. I splitted this patch to smaller patches,
Reviewed it and tested.
Changed from original patch:
- s/iowrite32be/iowrite32/, since ring->doorbell_qpn is stored as be32

The tx copybreak patch was also suggested by Eric Dumazet, and was edited and
reviewed by me. User space patch will be sent after kernel code is ready.

I am sending this patchset now since the merge window is near and don't want to
miss it.

More work need to do:
- Disable BF when xmit_more is in use
- Make TSO use xmit_more too. Maybe by splitting small TSO packets in the
  driver itself, to avoid extra cpu/memory costs of GSO before the driver
- Fix mlx4_en_xmit buggy handling of queue full in the middle of a burst
  partially posted to send queue using xmit_more

Eric, I edited the patches to have you as the Author and the first
signed-off-by. I hope it is ok with you (I wasn't sure if it is ok to sign by
you), anyway all the credit to those changes should go to you.

Patchset was tested and applied over commit 1e203c1 "(net: sched:
suspicious RCU usage in qdisc_watchdog")

[1] - https://patchwork.ozlabs.org/patch/394256/

Thanks,
Amir

Amir Vadai (13):
  net/mlx4_en: Code cleanups in tx path
  net/mlx4_en: Align tx path structures to cache lines
  net/mlx4_en: Avoid calling bswap in tx fast path
  net/mlx4_en: tx_info allocated with kmalloc() instead of vmalloc()
  net/mlx4_en: Avoid a cache line miss in TX completion for single frag
    skb's
  net/mlx4_en: Use prefetch in tx path
  net/mlx4_en: Avoid false sharing in mlx4_en_en_process_tx_cq()
  net/mlx4_en: mlx4_en_xmit() reads ring->cons once, and ahead of time
    to avoid stalls
  net/mlx4_en: Use local var in tx flow for skb_shinfo(skb)
  net/mlx4_en: Use local var for skb_headlen(skb)
  net/mlx4_en: tx_info->ts_requested was not cleared
  net/mlx4_en: Enable the compiler to make is_inline() inlined
  ethtool: Ethtool parameter to dynamically change tx_copybreak

Eric Dumazet (1):
  net/mlx4_en: Use the new tx_copybreak to set inline threshold

 drivers/net/ethernet/mellanox/mlx4/en_ethtool.c |  44 ++++
 drivers/net/ethernet/mellanox/mlx4/en_tx.c      | 330 ++++++++++++++----------
 drivers/net/ethernet/mellanox/mlx4/mlx4_en.h    |  90 ++++---
 include/linux/mlx4/device.h                     |   2 +-
 include/uapi/linux/ethtool.h                    |   1 +
 net/core/ethtool.c                              |   1 +
 6 files changed, 290 insertions(+), 178 deletions(-)

-- 
1.8.3.4

^ permalink raw reply

* [PATCH net-next 01/14] net/mlx4_en: Code cleanups in tx path
From: Amir Vadai @ 2014-10-05  9:35 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet
  Cc: netdev, Yevgeny Petrilin, Or Gerlitz, Ido Shamay, Amir Vadai
In-Reply-To: <1412501722-25092-1-git-send-email-amirv@mellanox.com>

From: Eric Dumazet <edumazet@google.com>

- Remove unused variable ring->poll_cnt
- No need to set some fields if using blueflame
- Add missing const's
- Use unlikely
- Remove unneeded new line
- Make some comments more precise
- struct mlx4_bf @offset field reduced to unsigned int to save space

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Amir Vadai <amirv@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx4/en_tx.c   | 49 +++++++++++++++-------------
 drivers/net/ethernet/mellanox/mlx4/mlx4_en.h |  1 -
 include/linux/mlx4/device.h                  |  2 +-
 3 files changed, 27 insertions(+), 25 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_tx.c b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
index 0c50125..eaf23eb 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_tx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
@@ -191,7 +191,6 @@ int mlx4_en_activate_tx_ring(struct mlx4_en_priv *priv,
 	ring->prod = 0;
 	ring->cons = 0xffffffff;
 	ring->last_nr_txbb = 1;
-	ring->poll_cnt = 0;
 	memset(ring->tx_info, 0, ring->size * sizeof(struct mlx4_en_tx_info));
 	memset(ring->buf, 0, ring->buf_size);
 
@@ -512,7 +511,8 @@ static struct mlx4_en_tx_desc *mlx4_en_bounce_to_desc(struct mlx4_en_priv *priv,
 	return ring->buf + index * TXBB_SIZE;
 }
 
-static int is_inline(int inline_thold, struct sk_buff *skb, void **pfrag)
+static bool is_inline(int inline_thold, const struct sk_buff *skb,
+		      void **pfrag)
 {
 	void *ptr;
 
@@ -535,7 +535,7 @@ static int is_inline(int inline_thold, struct sk_buff *skb, void **pfrag)
 	return 0;
 }
 
-static int inline_size(struct sk_buff *skb)
+static int inline_size(const struct sk_buff *skb)
 {
 	if (skb->len + CTRL_SIZE + sizeof(struct mlx4_wqe_inline_seg)
 	    <= MLX4_INLINE_ALIGN)
@@ -546,7 +546,8 @@ static int inline_size(struct sk_buff *skb)
 			     sizeof(struct mlx4_wqe_inline_seg), 16);
 }
 
-static int get_real_size(struct sk_buff *skb, struct net_device *dev,
+static int get_real_size(const struct sk_buff *skb,
+			 struct net_device *dev,
 			 int *lso_header_size)
 {
 	struct mlx4_en_priv *priv = netdev_priv(dev);
@@ -581,8 +582,10 @@ static int get_real_size(struct sk_buff *skb, struct net_device *dev,
 	return real_size;
 }
 
-static void build_inline_wqe(struct mlx4_en_tx_desc *tx_desc, struct sk_buff *skb,
-			     int real_size, u16 *vlan_tag, int tx_ind, void *fragptr)
+static void build_inline_wqe(struct mlx4_en_tx_desc *tx_desc,
+			     const struct sk_buff *skb,
+			     int real_size, u16 *vlan_tag,
+			     int tx_ind, void *fragptr)
 {
 	struct mlx4_wqe_inline_seg *inl = &tx_desc->inl;
 	int spc = MLX4_INLINE_ALIGN - CTRL_SIZE - sizeof *inl;
@@ -642,7 +645,8 @@ u16 mlx4_en_select_queue(struct net_device *dev, struct sk_buff *skb,
 	return fallback(dev, skb) % rings_p_up + up * rings_p_up;
 }
 
-static void mlx4_bf_copy(void __iomem *dst, unsigned long *src, unsigned bytecnt)
+static void mlx4_bf_copy(void __iomem *dst, const void *src,
+			 unsigned int bytecnt)
 {
 	__iowrite64_copy(dst, src, bytecnt / 8);
 }
@@ -736,11 +740,10 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)
 	tx_info->skb = skb;
 	tx_info->nr_txbb = nr_txbb;
 
+	data = &tx_desc->data;
 	if (lso_header_size)
 		data = ((void *)&tx_desc->lso + ALIGN(lso_header_size + 4,
 						      DS_SIZE));
-	else
-		data = &tx_desc->data;
 
 	/* valid only for none inline segments */
 	tx_info->data_offset = (void *)data - (void *)tx_desc;
@@ -753,9 +756,9 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)
 	if (is_inline(ring->inline_thold, skb, &fragptr)) {
 		tx_info->inl = 1;
 	} else {
-		/* Map fragments */
+		/* Map fragments if any */
 		for (i = skb_shinfo(skb)->nr_frags - 1; i >= 0; i--) {
-			struct skb_frag_struct *frag;
+			const struct skb_frag_struct *frag;
 			dma_addr_t dma;
 
 			frag = &skb_shinfo(skb)->frags[i];
@@ -772,7 +775,7 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)
 			--data;
 		}
 
-		/* Map linear part */
+		/* Map linear part if needed */
 		if (tx_info->linear) {
 			u32 byte_count = skb_headlen(skb) - lso_header_size;
 			dma_addr_t dma;
@@ -795,18 +798,14 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)
 	 * For timestamping add flag to skb_shinfo and
 	 * set flag for further reference
 	 */
-	if (ring->hwtstamp_tx_type == HWTSTAMP_TX_ON &&
-	    skb_shinfo(skb)->tx_flags & SKBTX_HW_TSTAMP) {
-		skb_shinfo(skb)->tx_flags |= SKBTX_IN_PROGRESS;
+	if (unlikely(ring->hwtstamp_tx_type == HWTSTAMP_TX_ON &&
+		     shinfo->tx_flags & SKBTX_HW_TSTAMP)) {
+		shinfo->tx_flags |= SKBTX_IN_PROGRESS;
 		tx_info->ts_requested = 1;
 	}
 
 	/* Prepare ctrl segement apart opcode+ownership, which depends on
 	 * whether LSO is used */
-	tx_desc->ctrl.vlan_tag = cpu_to_be16(vlan_tag);
-	tx_desc->ctrl.ins_vlan = MLX4_WQE_CTRL_INS_VLAN *
-		!!vlan_tx_tag_present(skb);
-	tx_desc->ctrl.fence_size = (real_size / 16) & 0x3f;
 	tx_desc->ctrl.srcrb_flags = priv->ctrl_flags;
 	if (likely(skb->ip_summed == CHECKSUM_PARTIAL)) {
 		tx_desc->ctrl.srcrb_flags |= cpu_to_be32(MLX4_WQE_CTRL_IP_CSUM |
@@ -852,7 +851,6 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)
 			 cpu_to_be32(MLX4_EN_BIT_DESC_OWN) : 0);
 		tx_info->nr_bytes = max_t(unsigned int, skb->len, ETH_ZLEN);
 		ring->packets++;
-
 	}
 	ring->bytes += tx_info->nr_bytes;
 	netdev_tx_sent_queue(ring->tx_queue, tx_info->nr_bytes);
@@ -874,7 +872,7 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)
 	ring->prod += nr_txbb;
 
 	/* If we used a bounce buffer then copy descriptor back into place */
-	if (bounce)
+	if (unlikely(bounce))
 		tx_desc = mlx4_en_bounce_to_desc(priv, ring, index, desc_size);
 
 	skb_tx_timestamp(skb);
@@ -894,13 +892,18 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)
 
 		wmb();
 
-		mlx4_bf_copy(ring->bf.reg + ring->bf.offset, (unsigned long *) &tx_desc->ctrl,
-		     desc_size);
+		mlx4_bf_copy(ring->bf.reg + ring->bf.offset, &tx_desc->ctrl,
+			     desc_size);
 
 		wmb();
 
 		ring->bf.offset ^= ring->bf.buf_size;
 	} else {
+		tx_desc->ctrl.vlan_tag = cpu_to_be16(vlan_tag);
+		tx_desc->ctrl.ins_vlan = MLX4_WQE_CTRL_INS_VLAN *
+			!!vlan_tx_tag_present(skb);
+		tx_desc->ctrl.fence_size = real_size;
+
 		/* Ensure new descriptor hits memory
 		 * before setting ownership of this descriptor to HW
 		 */
diff --git a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
index 84c9d5d..e54b653 100644
--- a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
+++ b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
@@ -263,7 +263,6 @@ struct mlx4_en_tx_ring {
 	u32 buf_size;
 	u32 doorbell_qpn;
 	void *buf;
-	u16 poll_cnt;
 	struct mlx4_en_tx_info *tx_info;
 	u8 *bounce_buf;
 	u8 queue_index;
diff --git a/include/linux/mlx4/device.h b/include/linux/mlx4/device.h
index b2f8ab9..37e4404 100644
--- a/include/linux/mlx4/device.h
+++ b/include/linux/mlx4/device.h
@@ -583,7 +583,7 @@ struct mlx4_uar {
 };
 
 struct mlx4_bf {
-	unsigned long		offset;
+	unsigned int		offset;
 	int			buf_size;
 	struct mlx4_uar	       *uar;
 	void __iomem	       *reg;
-- 
1.8.3.4

^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox