Netdev List
 help / color / mirror / Atom feed
* [PATCH 1/3] bonding: allow previous slave to be used when re-balancing traffic on tlb/alb interfaces
From: Jay Vosburgh @ 2009-09-30  0:15 UTC (permalink / raw)
  To: netdev; +Cc: David Miller, Andy Gospodarek
In-Reply-To: <1254269731-7341-1-git-send-email-fubar@us.ibm.com>

From: Andy Gospodarek <andy@greyhouse.net>

When using tlb (mode 5) or alb (mode 6) bonding, a task runs every 10s
and re-balances the output devices based on load.  I was trying to
diagnose some connectivity issues and realized that a high-traffic host
would often switch output interfaces every 10s.  I discovered this
happened because the 'least loaded interface' was chosen as the next
output interface for any given stream and quite often some lower load
traffic would slip in an take the interface previously used by our
stream.  This meant the 'least loaded interface' was no longer the one
we used during the last interval.

The switching of streams to another interface was not extremely helpful
as it would force the destination host or router to update its ARP
tables and produce some additional ARP traffic as the destination host
verified that is was using the MAC address it expected.  Having the
destination MAC for a given IP change every 10s seems undesirable.

The decision was made to use the same slave during this interval if the
current load on that interface was < 10.  A load of < 10 indicates that
during the last 10s sample, roughly 100bytes were sent by all streams
currently assigned to that interface.  This essentially means the
interface is unloaded, but allows for a few frames that will probably
have minimal impact to slip into the same interface we were using in the
past.

Signed-off-by: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Jay Vosburgh <fubar@us.ibm.com>

---
 drivers/net/bonding/bond_alb.c |   21 ++++++++++++++++++++-
 drivers/net/bonding/bond_alb.h |    4 ++++
 2 files changed, 24 insertions(+), 1 deletions(-)

diff --git a/drivers/net/bonding/bond_alb.c b/drivers/net/bonding/bond_alb.c
index 9b5936f..cf2842e 100644
--- a/drivers/net/bonding/bond_alb.c
+++ b/drivers/net/bonding/bond_alb.c
@@ -150,6 +150,7 @@ static inline void tlb_init_table_entry(struct tlb_client_info *entry, int save_
 		entry->load_history = 1 + entry->tx_bytes /
 				      BOND_TLB_REBALANCE_INTERVAL;
 		entry->tx_bytes = 0;
+		entry->last_slave = entry->tx_slave;
 	}
 
 	entry->tx_slave = NULL;
@@ -270,6 +271,24 @@ static struct slave *tlb_get_least_loaded_slave(struct bonding *bond)
 	return least_loaded;
 }
 
+/* Caller must hold bond lock for read and hashtbl lock */
+static struct slave *tlb_get_best_slave(struct bonding *bond, u32 hash_index)
+{
+	struct alb_bond_info *bond_info = &(BOND_ALB_INFO(bond));
+	struct tlb_client_info *tx_hash_table = bond_info->tx_hashtbl;
+	struct slave *last_slave = tx_hash_table[hash_index].last_slave;
+	struct slave *next_slave = NULL;
+
+	if (last_slave && SLAVE_IS_OK(last_slave)) {
+		/* Use the last slave listed in the tx hashtbl if:
+		   the last slave currently is essentially unloaded. */
+		if (SLAVE_TLB_INFO(last_slave).load < 10)
+			next_slave = last_slave;
+	}
+
+	return next_slave ? next_slave : tlb_get_least_loaded_slave(bond);
+}
+
 /* Caller must hold bond lock for read */
 static struct slave *tlb_choose_channel(struct bonding *bond, u32 hash_index, u32 skb_len)
 {
@@ -282,7 +301,7 @@ static struct slave *tlb_choose_channel(struct bonding *bond, u32 hash_index, u3
 	hash_table = bond_info->tx_hashtbl;
 	assigned_slave = hash_table[hash_index].tx_slave;
 	if (!assigned_slave) {
-		assigned_slave = tlb_get_least_loaded_slave(bond);
+		assigned_slave = tlb_get_best_slave(bond, hash_index);
 
 		if (assigned_slave) {
 			struct tlb_slave_info *slave_info =
diff --git a/drivers/net/bonding/bond_alb.h b/drivers/net/bonding/bond_alb.h
index 50968f8..b65fd29 100644
--- a/drivers/net/bonding/bond_alb.h
+++ b/drivers/net/bonding/bond_alb.h
@@ -36,6 +36,10 @@ struct tlb_client_info {
 				 * packets to a Client that the Hash function
 				 * gave this entry index.
 				 */
+	struct slave *last_slave; /* Pointer to last slave used for transmiting
+				 * packets to a Client that the Hash function
+				 * gave this entry index.
+				 */
 	u32 tx_bytes;		/* Each Client acumulates the BytesTx that
 				 * were tranmitted to it, and after each
 				 * CallBack the LoadHistory is devided
-- 
1.6.0.2


^ permalink raw reply related

* [PATCH 3/3] bonding: send ARP requests on interfaces other than the primary for tlb/alb
From: Jay Vosburgh @ 2009-09-30  0:15 UTC (permalink / raw)
  To: netdev; +Cc: David Miller, Andy Gospodarek
In-Reply-To: <1254269731-7341-3-git-send-email-fubar@us.ibm.com>

From: Andy Gospodarek <andy@greyhouse.net>

This patch sends ARP request on the correct destination output interface
rather than always sending them on the primary interface.  I've also
added some bits to make sure that the source and destination address in
the ARP header are correct since simply changing the source MAC and
output interface will not be that helpful.

Signed-off-by: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Jay Vosburgh <fubar@us.ibm.com>

---
 drivers/net/bonding/bond_alb.c |   45 ++++++++++++++++++---------------------
 1 files changed, 21 insertions(+), 24 deletions(-)

diff --git a/drivers/net/bonding/bond_alb.c b/drivers/net/bonding/bond_alb.c
index 5cd0400..9148c75 100644
--- a/drivers/net/bonding/bond_alb.c
+++ b/drivers/net/bonding/bond_alb.c
@@ -726,35 +726,32 @@ static struct slave *rlb_arp_xmit(struct sk_buff *skb, struct bonding *bond)
 	struct arp_pkt *arp = arp_pkt(skb);
 	struct slave *tx_slave = NULL;
 
-	if (arp->op_code == htons(ARPOP_REPLY)) {
-		/* the arp must be sent on the selected
-		* rx channel
-		*/
-		tx_slave = rlb_choose_channel(skb, bond);
-		if (tx_slave) {
-			memcpy(arp->mac_src,tx_slave->dev->dev_addr, ETH_ALEN);
-		}
-		pr_debug("Server sent ARP Reply packet\n");
-	} else if (arp->op_code == htons(ARPOP_REQUEST)) {
-		/* Create an entry in the rx_hashtbl for this client as a
-		 * place holder.
-		 * When the arp reply is received the entry will be updated
-		 * with the correct unicast address of the client.
-		 */
-		rlb_choose_channel(skb, bond);
+	/* Choose an output channel for the ARP frame */
+	tx_slave = rlb_choose_channel(skb, bond);
 
-		/* The ARP relpy packets must be delayed so that
-		 * they can cancel out the influence of the ARP request.
-		 */
+	/* If a valid interface is returned, make sure the sender and target MAC
+	 * addresses are correct based on the interface that will be transmitting
+	 * the frame. */
+	if (tx_slave) {
+		/* If sender mac is the bond's address, rewrite */
+		if (!compare_ether_addr_64bits(arp->mac_src,bond->dev->dev_addr))
+			memcpy(arp->mac_src,tx_slave->dev->dev_addr,bond->dev->addr_len);
+
+		/* If target mac is the bond's address, rewrite */
+		if (!compare_ether_addr_64bits(arp->mac_dst,bond->dev->dev_addr))
+			memcpy(arp->mac_dst,tx_slave->dev->dev_addr,bond->dev->addr_len);
+
+	} else if (arp->op_code == htons(ARPOP_REQUEST)) {
+		/* if tx_slave is NULL, the periodic ARP replies must
+		 * be delayed so they can cancel out the influence of
+		 * the ARP request. */
 		bond->alb_info.rlb_update_delay_counter = RLB_UPDATE_DELAY;
 
-		/* arp requests are broadcast and are sent on the primary
-		 * the arp request will collapse all clients on the subnet to
+		/* ARP requests are broadcast and are sent on the primary
+		 * the ARP request will collapse all clients on the subnet to
 		 * the primary slave. We must register these clients to be
-		 * updated with their assigned mac.
-		 */
+		 * updated with their assigned MAC. */
 		rlb_req_update_subnet_clients(bond, arp->ip_src);
-		pr_debug("Server sent ARP Request packet\n");
 	}
 
 	return tx_slave;
-- 
1.6.0.2


^ permalink raw reply related

* [PATCH 2/3] bonding: make sure tx and rx hash tables stay in sync when using alb mode
From: Jay Vosburgh @ 2009-09-30  0:15 UTC (permalink / raw)
  To: netdev; +Cc: David Miller, Andy Gospodarek
In-Reply-To: <1254269731-7341-2-git-send-email-fubar@us.ibm.com>

From: Andy Gospodarek <andy@greyhouse.net>

I noticed that it was easy for alb (mode 6) bonding to get into a state
where the tx hash-table and rx hash-table are out of sync (there is
really nothing to keep them synchronized), and we will transmit traffic
destined for a host on one slave and send ARP frames to the same slave
from another interface using a different source MAC.

There is no compelling reason to do this, so this patch makes sure the
rx hash-table changes whenever the tx hash-table is updated based on
device load.  This patch also drops the code that does rlb re-balancing
since the balancing will now be controlled by the tx hash-table based on
transmit load.  In order to address an issue found with the initial
patch, I have also combined the rx and tx hash table lock into a single
lock.  This will facilitate moving these into a single table at some
point.

Patch modified by Jay Vosburgh to fix a typo and remove some leftover
rlb rebalance code.

Signed-off-by: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Jay Vosburgh <fubar@us.ibm.com>
---
 drivers/net/bonding/bond_alb.c |  215 ++++++++++++++--------------------------
 drivers/net/bonding/bond_alb.h |    7 +-
 2 files changed, 75 insertions(+), 147 deletions(-)

diff --git a/drivers/net/bonding/bond_alb.c b/drivers/net/bonding/bond_alb.c
index cf2842e..5cd0400 100644
--- a/drivers/net/bonding/bond_alb.c
+++ b/drivers/net/bonding/bond_alb.c
@@ -118,6 +118,7 @@ static inline struct arp_pkt *arp_pkt(const struct sk_buff *skb)
 
 /* Forward declaration */
 static void alb_send_learning_packets(struct slave *slave, u8 mac_addr[]);
+static struct slave *alb_get_best_slave(struct bonding *bond, u32 hash_index);
 
 static inline u8 _simple_hash(const u8 *hash_start, int hash_size)
 {
@@ -131,18 +132,18 @@ static inline u8 _simple_hash(const u8 *hash_start, int hash_size)
 	return hash;
 }
 
-/*********************** tlb specific functions ***************************/
-
-static inline void _lock_tx_hashtbl(struct bonding *bond)
+/********************* hash table lock functions *************************/
+static inline void _lock_hashtbl(struct bonding *bond)
 {
-	spin_lock_bh(&(BOND_ALB_INFO(bond).tx_hashtbl_lock));
+	spin_lock_bh(&(BOND_ALB_INFO(bond).hashtbl_lock));
 }
 
-static inline void _unlock_tx_hashtbl(struct bonding *bond)
+static inline void _unlock_hashtbl(struct bonding *bond)
 {
-	spin_unlock_bh(&(BOND_ALB_INFO(bond).tx_hashtbl_lock));
+	spin_unlock_bh(&(BOND_ALB_INFO(bond).hashtbl_lock));
 }
 
+/*********************** tlb specific functions ***************************/
 /* Caller must hold tx_hashtbl lock */
 static inline void tlb_init_table_entry(struct tlb_client_info *entry, int save_load)
 {
@@ -170,7 +171,7 @@ static void tlb_clear_slave(struct bonding *bond, struct slave *slave, int save_
 	struct tlb_client_info *tx_hash_table;
 	u32 index;
 
-	_lock_tx_hashtbl(bond);
+	_lock_hashtbl(bond);
 
 	/* clear slave from tx_hashtbl */
 	tx_hash_table = BOND_ALB_INFO(bond).tx_hashtbl;
@@ -187,7 +188,7 @@ static void tlb_clear_slave(struct bonding *bond, struct slave *slave, int save_
 
 	tlb_init_slave(slave);
 
-	_unlock_tx_hashtbl(bond);
+	_unlock_hashtbl(bond);
 }
 
 /* Must be called before starting the monitor timer */
@@ -198,7 +199,7 @@ static int tlb_initialize(struct bonding *bond)
 	struct tlb_client_info *new_hashtbl;
 	int i;
 
-	spin_lock_init(&(bond_info->tx_hashtbl_lock));
+	spin_lock_init(&(bond_info->hashtbl_lock));
 
 	new_hashtbl = kzalloc(size, GFP_KERNEL);
 	if (!new_hashtbl) {
@@ -207,7 +208,7 @@ static int tlb_initialize(struct bonding *bond)
 		       bond->dev->name);
 		return -1;
 	}
-	_lock_tx_hashtbl(bond);
+	_lock_hashtbl(bond);
 
 	bond_info->tx_hashtbl = new_hashtbl;
 
@@ -215,7 +216,7 @@ static int tlb_initialize(struct bonding *bond)
 		tlb_init_table_entry(&bond_info->tx_hashtbl[i], 1);
 	}
 
-	_unlock_tx_hashtbl(bond);
+	_unlock_hashtbl(bond);
 
 	return 0;
 }
@@ -225,12 +226,12 @@ static void tlb_deinitialize(struct bonding *bond)
 {
 	struct alb_bond_info *bond_info = &(BOND_ALB_INFO(bond));
 
-	_lock_tx_hashtbl(bond);
+	_lock_hashtbl(bond);
 
 	kfree(bond_info->tx_hashtbl);
 	bond_info->tx_hashtbl = NULL;
 
-	_unlock_tx_hashtbl(bond);
+	_unlock_hashtbl(bond);
 }
 
 /* Caller must hold bond lock for read */
@@ -271,24 +272,6 @@ static struct slave *tlb_get_least_loaded_slave(struct bonding *bond)
 	return least_loaded;
 }
 
-/* Caller must hold bond lock for read and hashtbl lock */
-static struct slave *tlb_get_best_slave(struct bonding *bond, u32 hash_index)
-{
-	struct alb_bond_info *bond_info = &(BOND_ALB_INFO(bond));
-	struct tlb_client_info *tx_hash_table = bond_info->tx_hashtbl;
-	struct slave *last_slave = tx_hash_table[hash_index].last_slave;
-	struct slave *next_slave = NULL;
-
-	if (last_slave && SLAVE_IS_OK(last_slave)) {
-		/* Use the last slave listed in the tx hashtbl if:
-		   the last slave currently is essentially unloaded. */
-		if (SLAVE_TLB_INFO(last_slave).load < 10)
-			next_slave = last_slave;
-	}
-
-	return next_slave ? next_slave : tlb_get_least_loaded_slave(bond);
-}
-
 /* Caller must hold bond lock for read */
 static struct slave *tlb_choose_channel(struct bonding *bond, u32 hash_index, u32 skb_len)
 {
@@ -296,13 +279,12 @@ static struct slave *tlb_choose_channel(struct bonding *bond, u32 hash_index, u3
 	struct tlb_client_info *hash_table;
 	struct slave *assigned_slave;
 
-	_lock_tx_hashtbl(bond);
+	_lock_hashtbl(bond);
 
 	hash_table = bond_info->tx_hashtbl;
 	assigned_slave = hash_table[hash_index].tx_slave;
 	if (!assigned_slave) {
-		assigned_slave = tlb_get_best_slave(bond, hash_index);
-
+		assigned_slave = alb_get_best_slave(bond, hash_index);
 		if (assigned_slave) {
 			struct tlb_slave_info *slave_info =
 				&(SLAVE_TLB_INFO(assigned_slave));
@@ -326,20 +308,52 @@ static struct slave *tlb_choose_channel(struct bonding *bond, u32 hash_index, u3
 		hash_table[hash_index].tx_bytes += skb_len;
 	}
 
-	_unlock_tx_hashtbl(bond);
+	_unlock_hashtbl(bond);
 
 	return assigned_slave;
 }
 
 /*********************** rlb specific functions ***************************/
-static inline void _lock_rx_hashtbl(struct bonding *bond)
+
+/* Caller must hold bond lock for read and hashtbl lock */
+static struct slave *rlb_update_rx_table(struct bonding *bond, struct slave *next_slave, u32 hash_index)
 {
-	spin_lock_bh(&(BOND_ALB_INFO(bond).rx_hashtbl_lock));
+	struct alb_bond_info *bond_info = &(BOND_ALB_INFO(bond));
+
+	/* check rlb table and correct it if wrong */
+	if (bond_info->rlb_enabled) {
+		struct rlb_client_info *rx_client_info = &(bond_info->rx_hashtbl[hash_index]);
+
+		/* if the new slave computed by tlb checks doesn't match rlb, stop rlb from using it */
+		if (next_slave && (next_slave != rx_client_info->slave))
+			rx_client_info->slave = next_slave;
+	}
+	return next_slave;
 }
 
-static inline void _unlock_rx_hashtbl(struct bonding *bond)
+/* Caller must hold bond lock for read and hashtbl lock */
+static struct slave *alb_get_best_slave(struct bonding *bond, u32 hash_index)
 {
-	spin_unlock_bh(&(BOND_ALB_INFO(bond).rx_hashtbl_lock));
+	struct alb_bond_info *bond_info = &(BOND_ALB_INFO(bond));
+	struct tlb_client_info *tx_hash_table = bond_info->tx_hashtbl;
+	struct slave *last_slave = tx_hash_table[hash_index].last_slave;
+	struct slave *next_slave = NULL;
+
+	/* presume the next slave will be the least loaded one */
+	next_slave = tlb_get_least_loaded_slave(bond);
+
+	if (last_slave && SLAVE_IS_OK(last_slave)) {
+		/* Use the last slave listed in the tx hashtbl if:
+		   the last slave currently is essentially unloaded. */
+		if (SLAVE_TLB_INFO(last_slave).load < 10)
+			next_slave = last_slave;
+	}
+
+	/* update the rlb hashtbl if there was a previous entry */
+	if (bond_info->rlb_enabled)
+		rlb_update_rx_table(bond, next_slave, hash_index);
+
+	return next_slave;
 }
 
 /* when an ARP REPLY is received from a client update its info
@@ -351,7 +365,7 @@ static void rlb_update_entry_from_arp(struct bonding *bond, struct arp_pkt *arp)
 	struct rlb_client_info *client_info;
 	u32 hash_index;
 
-	_lock_rx_hashtbl(bond);
+	_lock_hashtbl(bond);
 
 	hash_index = _simple_hash((u8*)&(arp->ip_src), sizeof(arp->ip_src));
 	client_info = &(bond_info->rx_hashtbl[hash_index]);
@@ -365,7 +379,7 @@ static void rlb_update_entry_from_arp(struct bonding *bond, struct arp_pkt *arp)
 		bond_info->rx_ntt = 1;
 	}
 
-	_unlock_rx_hashtbl(bond);
+	_unlock_hashtbl(bond);
 }
 
 static int rlb_arp_recv(struct sk_buff *skb, struct net_device *bond_dev, struct packet_type *ptype, struct net_device *orig_dev)
@@ -409,38 +423,6 @@ out:
 	return res;
 }
 
-/* Caller must hold bond lock for read */
-static struct slave *rlb_next_rx_slave(struct bonding *bond)
-{
-	struct alb_bond_info *bond_info = &(BOND_ALB_INFO(bond));
-	struct slave *rx_slave, *slave, *start_at;
-	int i = 0;
-
-	if (bond_info->next_rx_slave) {
-		start_at = bond_info->next_rx_slave;
-	} else {
-		start_at = bond->first_slave;
-	}
-
-	rx_slave = NULL;
-
-	bond_for_each_slave_from(bond, slave, i, start_at) {
-		if (SLAVE_IS_OK(slave)) {
-			if (!rx_slave) {
-				rx_slave = slave;
-			} else if (slave->speed > rx_slave->speed) {
-				rx_slave = slave;
-			}
-		}
-	}
-
-	if (rx_slave) {
-		bond_info->next_rx_slave = rx_slave->next;
-	}
-
-	return rx_slave;
-}
-
 /* teach the switch the mac of a disabled slave
  * on the primary for fault tolerance
  *
@@ -475,14 +457,14 @@ static void rlb_clear_slave(struct bonding *bond, struct slave *slave)
 	u32 index, next_index;
 
 	/* clear slave from rx_hashtbl */
-	_lock_rx_hashtbl(bond);
+	_lock_hashtbl(bond);
 
 	rx_hash_table = bond_info->rx_hashtbl;
 	index = bond_info->rx_hashtbl_head;
 	for (; index != RLB_NULL_INDEX; index = next_index) {
 		next_index = rx_hash_table[index].next;
 		if (rx_hash_table[index].slave == slave) {
-			struct slave *assigned_slave = rlb_next_rx_slave(bond);
+			struct slave *assigned_slave = alb_get_best_slave(bond, index);
 
 			if (assigned_slave) {
 				rx_hash_table[index].slave = assigned_slave;
@@ -506,7 +488,7 @@ static void rlb_clear_slave(struct bonding *bond, struct slave *slave)
 		}
 	}
 
-	_unlock_rx_hashtbl(bond);
+	_unlock_hashtbl(bond);
 
 	write_lock_bh(&bond->curr_slave_lock);
 
@@ -565,7 +547,7 @@ static void rlb_update_rx_clients(struct bonding *bond)
 	struct rlb_client_info *client_info;
 	u32 hash_index;
 
-	_lock_rx_hashtbl(bond);
+	_lock_hashtbl(bond);
 
 	hash_index = bond_info->rx_hashtbl_head;
 	for (; hash_index != RLB_NULL_INDEX; hash_index = client_info->next) {
@@ -583,7 +565,7 @@ static void rlb_update_rx_clients(struct bonding *bond)
 	 */
 	bond_info->rlb_update_delay_counter = RLB_UPDATE_DELAY;
 
-	_unlock_rx_hashtbl(bond);
+	_unlock_hashtbl(bond);
 }
 
 /* The slave was assigned a new mac address - update the clients */
@@ -594,7 +576,7 @@ static void rlb_req_update_slave_clients(struct bonding *bond, struct slave *sla
 	int ntt = 0;
 	u32 hash_index;
 
-	_lock_rx_hashtbl(bond);
+	_lock_hashtbl(bond);
 
 	hash_index = bond_info->rx_hashtbl_head;
 	for (; hash_index != RLB_NULL_INDEX; hash_index = client_info->next) {
@@ -614,7 +596,7 @@ static void rlb_req_update_slave_clients(struct bonding *bond, struct slave *sla
 		bond_info->rlb_update_retry_counter = RLB_UPDATE_RETRY;
 	}
 
-	_unlock_rx_hashtbl(bond);
+	_unlock_hashtbl(bond);
 }
 
 /* mark all clients using src_ip to be updated */
@@ -624,7 +606,7 @@ static void rlb_req_update_subnet_clients(struct bonding *bond, __be32 src_ip)
 	struct rlb_client_info *client_info;
 	u32 hash_index;
 
-	_lock_rx_hashtbl(bond);
+	_lock_hashtbl(bond);
 
 	hash_index = bond_info->rx_hashtbl_head;
 	for (; hash_index != RLB_NULL_INDEX; hash_index = client_info->next) {
@@ -650,7 +632,7 @@ static void rlb_req_update_subnet_clients(struct bonding *bond, __be32 src_ip)
 		}
 	}
 
-	_unlock_rx_hashtbl(bond);
+	_unlock_hashtbl(bond);
 }
 
 /* Caller must hold both bond and ptr locks for read */
@@ -662,7 +644,7 @@ static struct slave *rlb_choose_channel(struct sk_buff *skb, struct bonding *bon
 	struct rlb_client_info *client_info;
 	u32 hash_index = 0;
 
-	_lock_rx_hashtbl(bond);
+	_lock_hashtbl(bond);
 
 	hash_index = _simple_hash((u8 *)&arp->ip_dst, sizeof(arp->ip_src));
 	client_info = &(bond_info->rx_hashtbl[hash_index]);
@@ -678,7 +660,7 @@ static struct slave *rlb_choose_channel(struct sk_buff *skb, struct bonding *bon
 
 			assigned_slave = client_info->slave;
 			if (assigned_slave) {
-				_unlock_rx_hashtbl(bond);
+				_unlock_hashtbl(bond);
 				return assigned_slave;
 			}
 		} else {
@@ -694,7 +676,7 @@ static struct slave *rlb_choose_channel(struct sk_buff *skb, struct bonding *bon
 		}
 	}
 	/* assign a new slave */
-	assigned_slave = rlb_next_rx_slave(bond);
+	assigned_slave = alb_get_best_slave(bond, hash_index);
 
 	if (assigned_slave) {
 		client_info->ip_src = arp->ip_src;
@@ -730,7 +712,7 @@ static struct slave *rlb_choose_channel(struct sk_buff *skb, struct bonding *bon
 		}
 	}
 
-	_unlock_rx_hashtbl(bond);
+	_unlock_hashtbl(bond);
 
 	return assigned_slave;
 }
@@ -778,36 +760,6 @@ static struct slave *rlb_arp_xmit(struct sk_buff *skb, struct bonding *bond)
 	return tx_slave;
 }
 
-/* Caller must hold bond lock for read */
-static void rlb_rebalance(struct bonding *bond)
-{
-	struct alb_bond_info *bond_info = &(BOND_ALB_INFO(bond));
-	struct slave *assigned_slave;
-	struct rlb_client_info *client_info;
-	int ntt;
-	u32 hash_index;
-
-	_lock_rx_hashtbl(bond);
-
-	ntt = 0;
-	hash_index = bond_info->rx_hashtbl_head;
-	for (; hash_index != RLB_NULL_INDEX; hash_index = client_info->next) {
-		client_info = &(bond_info->rx_hashtbl[hash_index]);
-		assigned_slave = rlb_next_rx_slave(bond);
-		if (assigned_slave && (client_info->slave != assigned_slave)) {
-			client_info->slave = assigned_slave;
-			client_info->ntt = 1;
-			ntt = 1;
-		}
-	}
-
-	/* update the team's flag only after the whole iteration */
-	if (ntt) {
-		bond_info->rx_ntt = 1;
-	}
-	_unlock_rx_hashtbl(bond);
-}
-
 /* Caller must hold rx_hashtbl lock */
 static void rlb_init_table_entry(struct rlb_client_info *entry)
 {
@@ -824,8 +776,6 @@ static int rlb_initialize(struct bonding *bond)
 	int size = RLB_HASH_TABLE_SIZE * sizeof(struct rlb_client_info);
 	int i;
 
-	spin_lock_init(&(bond_info->rx_hashtbl_lock));
-
 	new_hashtbl = kmalloc(size, GFP_KERNEL);
 	if (!new_hashtbl) {
 		pr_err(DRV_NAME
@@ -833,7 +783,7 @@ static int rlb_initialize(struct bonding *bond)
 		       bond->dev->name);
 		return -1;
 	}
-	_lock_rx_hashtbl(bond);
+	_lock_hashtbl(bond);
 
 	bond_info->rx_hashtbl = new_hashtbl;
 
@@ -843,7 +793,7 @@ static int rlb_initialize(struct bonding *bond)
 		rlb_init_table_entry(bond_info->rx_hashtbl + i);
 	}
 
-	_unlock_rx_hashtbl(bond);
+	_unlock_hashtbl(bond);
 
 	/*initialize packet type*/
 	pk_type->type = cpu_to_be16(ETH_P_ARP);
@@ -862,13 +812,13 @@ static void rlb_deinitialize(struct bonding *bond)
 
 	dev_remove_pack(&(bond_info->rlb_pkt_type));
 
-	_lock_rx_hashtbl(bond);
+	_lock_hashtbl(bond);
 
 	kfree(bond_info->rx_hashtbl);
 	bond_info->rx_hashtbl = NULL;
 	bond_info->rx_hashtbl_head = RLB_NULL_INDEX;
 
-	_unlock_rx_hashtbl(bond);
+	_unlock_hashtbl(bond);
 }
 
 static void rlb_clear_vlan(struct bonding *bond, unsigned short vlan_id)
@@ -876,7 +826,7 @@ static void rlb_clear_vlan(struct bonding *bond, unsigned short vlan_id)
 	struct alb_bond_info *bond_info = &(BOND_ALB_INFO(bond));
 	u32 curr_index;
 
-	_lock_rx_hashtbl(bond);
+	_lock_hashtbl(bond);
 
 	curr_index = bond_info->rx_hashtbl_head;
 	while (curr_index != RLB_NULL_INDEX) {
@@ -901,7 +851,7 @@ static void rlb_clear_vlan(struct bonding *bond, unsigned short vlan_id)
 		curr_index = next_index;
 	}
 
-	_unlock_rx_hashtbl(bond);
+	_unlock_hashtbl(bond);
 }
 
 /*********************** tlb/rlb shared functions *********************/
@@ -1525,11 +1475,6 @@ void bond_alb_monitor(struct work_struct *work)
 			read_lock(&bond->lock);
 		}
 
-		if (bond_info->rlb_rebalance) {
-			bond_info->rlb_rebalance = 0;
-			rlb_rebalance(bond);
-		}
-
 		/* check if clients need updating */
 		if (bond_info->rx_ntt) {
 			if (bond_info->rlb_update_delay_counter) {
@@ -1582,10 +1527,6 @@ int bond_alb_init_slave(struct bonding *bond, struct slave *slave)
 	/* order a rebalance ASAP */
 	bond->alb_info.tx_rebalance_counter = BOND_TLB_REBALANCE_TICKS;
 
-	if (bond->alb_info.rlb_enabled) {
-		bond->alb_info.rlb_rebalance = 1;
-	}
-
 	return 0;
 }
 
@@ -1622,14 +1563,6 @@ void bond_alb_handle_link_change(struct bonding *bond, struct slave *slave, char
 	} else if (link == BOND_LINK_UP) {
 		/* order a rebalance ASAP */
 		bond_info->tx_rebalance_counter = BOND_TLB_REBALANCE_TICKS;
-		if (bond->alb_info.rlb_enabled) {
-			bond->alb_info.rlb_rebalance = 1;
-			/* If the updelay module parameter is smaller than the
-			 * forwarding delay of the switch the rebalance will
-			 * not work because the rebalance arp replies will
-			 * not be forwarded to the clients..
-			 */
-		}
 	}
 }
 
diff --git a/drivers/net/bonding/bond_alb.h b/drivers/net/bonding/bond_alb.h
index b65fd29..24bf35a 100644
--- a/drivers/net/bonding/bond_alb.h
+++ b/drivers/net/bonding/bond_alb.h
@@ -90,7 +90,7 @@ struct tlb_slave_info {
 struct alb_bond_info {
 	struct timer_list	alb_timer;
 	struct tlb_client_info	*tx_hashtbl; /* Dynamically allocated */
-	spinlock_t		tx_hashtbl_lock;
+	spinlock_t		hashtbl_lock; /* lock for both tables */
 	u32			unbalanced_load;
 	int			tx_rebalance_counter;
 	int			lp_counter;
@@ -98,7 +98,6 @@ struct alb_bond_info {
 	int rlb_enabled;
 	struct packet_type	rlb_pkt_type;
 	struct rlb_client_info	*rx_hashtbl;	/* Receive hash table */
-	spinlock_t		rx_hashtbl_lock;
 	u32			rx_hashtbl_head;
 	u8			rx_ntt;	/* flag - need to transmit
 					 * to all rx clients
@@ -115,10 +114,6 @@ struct alb_bond_info {
 	u32			rlb_update_retry_counter;/* counter of retries
 							  * of client update
 							  */
-	u8			rlb_rebalance;	/* flag - indicates that the
-						 * rx traffic should be
-						 * rebalanced
-						 */
 	struct vlan_entry	*current_alb_vlan;
 };
 
-- 
1.6.0.2


^ permalink raw reply related

* Splice on blocking TCP sockets again..
From: Jason Gunthorpe @ 2009-09-30  0:48 UTC (permalink / raw)
  To: Eric Dumazet, netdev; +Cc: David S. Miller, Volker Lendecke

Eric,

I saw your patch from January regarding splicing on blocking sockets,
and I wondered what ever happened to it?

http://lkml.org/lkml/2009/1/13/507

It doesn't look like it has been applied.. I see the patch thread died
at davem's comments?

I have run into exactly the same problem as Samba, where I'd like the
TCP socket to be blocking, and the pipe to be non blocking ...

As it stands, 
  splice(socket,0,pipe,0,128*1024,SPLICE_F_MOVE); 
causes a random endless block and
  splice(socket,0,pipe,0,128*1024,SPLICE_F_MOVE | SPLICE_F_NONBLOCK);
will return 0 immediately if the TCP buffer is empty.

FWIW, it looks like samba has a splice code now, but doesn't enable it
due to this issue?

http://git.samba.org/?p=samba.git;a=history;f=source3/lib/recvfile.c;h=ea0159642137390a0f7e57a123684e6e63e47581;hb=HEAD

Thanks,
Jason

^ permalink raw reply

* Re: Splice on blocking TCP sockets again..
From: Eric Dumazet @ 2009-09-30  4:54 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: netdev, David S. Miller, Volker Lendecke
In-Reply-To: <20090930004820.GC19540@obsidianresearch.com>

Jason Gunthorpe a écrit :
> Eric,
> 
> I saw your patch from January regarding splicing on blocking sockets,
> and I wondered what ever happened to it?
> 
> http://lkml.org/lkml/2009/1/13/507
> 
> It doesn't look like it has been applied.. I see the patch thread died
> at davem's comments?
> 
> I have run into exactly the same problem as Samba, where I'd like the
> TCP socket to be blocking, and the pipe to be non blocking ...
> 
> As it stands, 
>   splice(socket,0,pipe,0,128*1024,SPLICE_F_MOVE); 
> causes a random endless block and
>   splice(socket,0,pipe,0,128*1024,SPLICE_F_MOVE | SPLICE_F_NONBLOCK);
> will return 0 immediately if the TCP buffer is empty.
> 
> FWIW, it looks like samba has a splice code now, but doesn't enable it
> due to this issue?
> 
> http://git.samba.org/?p=samba.git;a=history;f=source3/lib/recvfile.c;h=ea0159642137390a0f7e57a123684e6e63e47581;hb=HEAD
> 
> Thanks,
> Jason

Hi Jason, thanks for this reminding

Hmm, most probably I did not replied correctly do David objection which was :

Date	Wed, 14 Jan 2009 20:58:39 -0800 (PST)
Subject	Re: maximum buffer size for splice(2) tcp->pipe?
From	David Miller <>

> From: Eric Dumazet <dada1@cosmosbay.com>
> Date: Wed, 14 Jan 2009 00:38:32 +0100
> [PATCH] net: splice() from tcp to socket should take into account O_NONBLOCK
> 
> Instead of using SPLICE_F_NONBLOCK to select a non blocking mode both on
> source tcp socket and pipe destination, we use the underlying file flag (O_NONBLOCK)
> for selecting a non blocking socket.
> 
> Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>

This needs at least some more thought.

It seems, for one thing, that this change will interfere with the
intentions of the code in splice_dirt_to_actor which goes:

	/*
	 * Don't block on output, we have to drain the direct pipe.
	 */
	sd->flags &= ~SPLICE_F_NONBLOCK;

------------------------------------------------------------------------------

But splice_dist_to_actor() handles a REG/BLK file as input and a pipe as output,
so I believe my patch wont change splice_dist_to_actor() behavior.

My patch title was wrong :

net: splice() from tcp to socket should take into account O_NONBLOCK


So maybe David was mistaken by the title :)


[PATCH] net: splice() from tcp to pipe should take into account O_NONBLOCK

Before this patch :

splice(socket,0,pipe,0,128*1024,SPLICE_F_MOVE); 
causes a random endless block (if pipe is full) and
splice(socket,0,pipe,0,128*1024,SPLICE_F_MOVE | SPLICE_F_NONBLOCK);
will return 0 immediately if the TCP buffer is empty.

User application has no way to instruct splice() that socket should be in blocking mode
but pipe in nonblock more.
 
http://git.samba.org/?p=samba.git;a=history;f=source3/lib/recvfile.c;h=ea0159642137390a0f7e57a123684e6e63e47581;hb=HEAD

One way to handle this is to switch tcp_read() to use the underlying file O_NONBLOCK
flag, as other socket operations do. And let SPLICE_F_NONBLOCK control the pipe output only.

Users will then call :

splice(socket,0,pipe,0,128*1024,SPLICE_F_MOVE | SPLICE_F_NONBLOCK ); 

to block on data coming from socket (if file is in blocking mode),
and not block on pipe output (to avoid deadlock)

Reported-by: Volker Lendecke <vl@samba.org>
Reported-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 21387eb..8cdfab6 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -580,7 +580,7 @@ ssize_t tcp_splice_read(struct socket *sock, loff_t *ppos,
 
 	lock_sock(sk);
 
-	timeo = sock_rcvtimeo(sk, flags & SPLICE_F_NONBLOCK);
+	timeo = sock_rcvtimeo(sk, sock->file->f_flags & O_NONBLOCK);
 	while (tss.len) {
 		ret = __tcp_splice_read(sk, &tss);
 		if (ret < 0)

^ permalink raw reply related

* Re: [PATCH 3/3] bonding: send ARP requests on interfaces other than the primary for tlb/alb
From: Eric Dumazet @ 2009-09-30  5:20 UTC (permalink / raw)
  To: Jay Vosburgh; +Cc: netdev, David Miller, Andy Gospodarek
In-Reply-To: <1254269731-7341-4-git-send-email-fubar@us.ibm.com>

Jay Vosburgh a écrit :
> From: Andy Gospodarek <andy@greyhouse.net>
> 
> This patch sends ARP request on the correct destination output interface
> rather than always sending them on the primary interface.  I've also
> added some bits to make sure that the source and destination address in
> the ARP header are correct since simply changing the source MAC and
> output interface will not be that helpful.
> 
> Signed-off-by: Andy Gospodarek <andy@greyhouse.net>
> Signed-off-by: Jay Vosburgh <fubar@us.ibm.com>
> 
> ---
>  drivers/net/bonding/bond_alb.c |   45 ++++++++++++++++++---------------------
>  1 files changed, 21 insertions(+), 24 deletions(-)
> 
> diff --git a/drivers/net/bonding/bond_alb.c b/drivers/net/bonding/bond_alb.c
> index 5cd0400..9148c75 100644
> --- a/drivers/net/bonding/bond_alb.c
> +++ b/drivers/net/bonding/bond_alb.c
> @@ -726,35 +726,32 @@ static struct slave *rlb_arp_xmit(struct sk_buff *skb, struct bonding *bond)
>  	struct arp_pkt *arp = arp_pkt(skb);
>  	struct slave *tx_slave = NULL;
>  
> -	if (arp->op_code == htons(ARPOP_REPLY)) {
> -		/* the arp must be sent on the selected
> -		* rx channel
> -		*/
> -		tx_slave = rlb_choose_channel(skb, bond);
> -		if (tx_slave) {
> -			memcpy(arp->mac_src,tx_slave->dev->dev_addr, ETH_ALEN);
> -		}
> -		pr_debug("Server sent ARP Reply packet\n");
> -	} else if (arp->op_code == htons(ARPOP_REQUEST)) {
> -		/* Create an entry in the rx_hashtbl for this client as a
> -		 * place holder.
> -		 * When the arp reply is received the entry will be updated
> -		 * with the correct unicast address of the client.
> -		 */
> -		rlb_choose_channel(skb, bond);
> +	/* Choose an output channel for the ARP frame */
> +	tx_slave = rlb_choose_channel(skb, bond);
>  
> -		/* The ARP relpy packets must be delayed so that
> -		 * they can cancel out the influence of the ARP request.
> -		 */

COmments should have the following form :
/*
 * This is a fine comment
 */

> +	/* If a valid interface is returned, make sure the sender and target MAC
> +	 * addresses are correct based on the interface that will be transmitting
> +	 * the frame. */
> +	if (tx_slave) {
> +		/* If sender mac is the bond's address, rewrite */
> +		if (!compare_ether_addr_64bits(arp->mac_src,bond->dev->dev_addr))
> +			memcpy(arp->mac_src,tx_slave->dev->dev_addr,bond->dev->addr_len);

Hmm, you use compare_ether_addr_64bits(), implying a fix ETH_ALEN (6 bytes) address, then
you memcpy( ..., ..., bond->dev->addr_len);
Why not use ETH_ALEN to help compiler ?
Also add a space after a comma

> +
> +		/* If target mac is the bond's address, rewrite */
> +		if (!compare_ether_addr_64bits(arp->mac_dst,bond->dev->dev_addr))
> +			memcpy(arp->mac_dst,tx_slave->dev->dev_addr,bond->dev->addr_len);
> +
> +	} else if (arp->op_code == htons(ARPOP_REQUEST)) {
> +		/* if tx_slave is NULL, the periodic ARP replies must
> +		 * be delayed so they can cancel out the influence of
> +		 * the ARP request. */
>  		bond->alb_info.rlb_update_delay_counter = RLB_UPDATE_DELAY;
>  
> -		/* arp requests are broadcast and are sent on the primary
> -		 * the arp request will collapse all clients on the subnet to
> +		/* ARP requests are broadcast and are sent on the primary
> +		 * the ARP request will collapse all clients on the subnet to
>  		 * the primary slave. We must register these clients to be

Could you reformulate this comment, it is not readable as is :

 /*
  * ARP requests are broadcast and are sent on the primary.
  * The ARP request will collapse all clients on the subnet to
  * the primary slave. We must register these clients to be
  * updated with their assigned MAC.
  */


> -		 * updated with their assigned mac.
> -		 */
> +		 * updated with their assigned MAC. */
>  		rlb_req_update_subnet_clients(bond, arp->ip_src);
> -		pr_debug("Server sent ARP Request packet\n");
>  	}
>  
>  	return tx_slave;


^ permalink raw reply

* Re: [PATCH 3/3] bonding: send ARP requests on interfaces other than the primary for tlb/alb
From: Jarek Poplawski @ 2009-09-30  5:40 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Jay Vosburgh, netdev, David Miller, Andy Gospodarek
In-Reply-To: <4AC2EA97.5010809@gmail.com>

On 30-09-2009 07:20, Eric Dumazet wrote:
...
>> +	/* Choose an output channel for the ARP frame */
>> +	tx_slave = rlb_choose_channel(skb, bond);
>>  
>> -		/* The ARP relpy packets must be delayed so that
>> -		 * they can cancel out the influence of the ARP request.
>> -		 */
> 
> COmments should have the following form :
> /*
>  * This is a fine comment
>  */

It's "The preferred style for long (multi-line) comments [...]".

Jarek P.

^ permalink raw reply

* Re: Splice on blocking TCP sockets again..
From: Jason Gunthorpe @ 2009-09-30  5:40 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev, David S. Miller, Volker Lendecke
In-Reply-To: <4AC2E481.5060509@gmail.com>

> One way to handle this is to switch tcp_read() to use the underlying file O_NONBLOCK
> flag, as other socket operations do. And let SPLICE_F_NONBLOCK control the pipe output only.

Thanks Eric, this seems reasonable from my userspace perspective.

I admit I don't understand why SPLICE_F_NONBLOCK exists, it seems very
un-unixy to have a syscall completely ignore the NONBLOCK flag of the
fd it is called on. Ie setting NONBLOCK on the pipe itself does
nothing when using splice..

Jason

^ permalink raw reply

* Re: [PATCH 3/3] bonding: send ARP requests on interfaces other than the primary for tlb/alb
From: Eric Dumazet @ 2009-09-30  5:49 UTC (permalink / raw)
  To: Jarek Poplawski; +Cc: Jay Vosburgh, netdev, David Miller, Andy Gospodarek
In-Reply-To: <20090930054015.GA5231@ff.dom.local>

Jarek Poplawski a écrit :
> On 30-09-2009 07:20, Eric Dumazet wrote:
> ...
>>> +	/* Choose an output channel for the ARP frame */
>>> +	tx_slave = rlb_choose_channel(skb, bond);
>>>  
>>> -		/* The ARP relpy packets must be delayed so that
>>> -		 * they can cancel out the influence of the ARP request.
>>> -		 */
>> COmments should have the following form :
>> /*
>>  * This is a fine comment
>>  */
> 
> It's "The preferred style for long (multi-line) comments [...]".

Yes I agree.

We keep old comments as they are, but for new code submission, we
try to follow common practices.
I know it seems odd, but in the end it helps code readability.

Of course, I would have not point this without another more interesting 
stuff in the review process :)



^ permalink raw reply

* Re: Splice on blocking TCP sockets again..
From: Eric Dumazet @ 2009-09-30  5:51 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: netdev, David S. Miller, Volker Lendecke
In-Reply-To: <20090930054031.GY22310@obsidianresearch.com>

Jason Gunthorpe a écrit :
>> One way to handle this is to switch tcp_read() to use the underlying file O_NONBLOCK
>> flag, as other socket operations do. And let SPLICE_F_NONBLOCK control the pipe output only.

arg, this was tcp_splice_read() of course

> 
> Thanks Eric, this seems reasonable from my userspace perspective.
> 
> I admit I don't understand why SPLICE_F_NONBLOCK exists, it seems very
> un-unixy to have a syscall completely ignore the NONBLOCK flag of the
> fd it is called on. Ie setting NONBLOCK on the pipe itself does
> nothing when using splice..
> 

Hmm, good question, I dont have the answer but I'll digg one.

^ permalink raw reply

* Re: Splice on blocking TCP sockets again..
From: Eric Dumazet @ 2009-09-30  6:00 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: netdev, David S. Miller, Volker Lendecke
In-Reply-To: <4AC2F1D9.1010801@gmail.com>

Eric Dumazet a écrit :
> Jason Gunthorpe a écrit :
>>> One way to handle this is to switch tcp_read() to use the underlying file O_NONBLOCK
>>> flag, as other socket operations do. And let SPLICE_F_NONBLOCK control the pipe output only.
> 
> arg, this was tcp_splice_read() of course
> 
>> Thanks Eric, this seems reasonable from my userspace perspective.
>>
>> I admit I don't understand why SPLICE_F_NONBLOCK exists, it seems very
>> un-unixy to have a syscall completely ignore the NONBLOCK flag of the
>> fd it is called on. Ie setting NONBLOCK on the pipe itself does
>> nothing when using splice..
>>
> 
> Hmm, good question, I dont have the answer but I'll digg one.
> 

commit	29e350944fdc2dfca102500790d8ad6d6ff4f69d
splice: add SPLICE_F_NONBLOCK flag

It doesn't make the splice itself necessarily nonblocking (because the
actual file descriptors that are spliced from/to may block unless they
have the O_NONBLOCK flag set), but it makes the splice pipe operations
nonblocking.

Signed-off-by: Linus Torvalds <torvalds@osdl.org>


See Linus intention was pretty clear : O_NONBLOCK should be taken into account
by 'actual file that are spliced from/to', regardless of SPLICE_F_NONBLOCK flag


^ permalink raw reply

* Re: Splice on blocking TCP sockets again..
From: Eric Dumazet @ 2009-09-30  6:19 UTC (permalink / raw)
  Cc: Jason Gunthorpe, netdev, David S. Miller, Volker Lendecke,
	Octavian Purdila
In-Reply-To: <4AC2F3E4.5000904@gmail.com>

Eric Dumazet a écrit :
> Eric Dumazet a écrit :
>> Jason Gunthorpe a écrit :
>>>> One way to handle this is to switch tcp_read() to use the underlying file O_NONBLOCK
>>>> flag, as other socket operations do. And let SPLICE_F_NONBLOCK control the pipe output only.
>> arg, this was tcp_splice_read() of course
>>
>>> Thanks Eric, this seems reasonable from my userspace perspective.
>>>
>>> I admit I don't understand why SPLICE_F_NONBLOCK exists, it seems very
>>> un-unixy to have a syscall completely ignore the NONBLOCK flag of the
>>> fd it is called on. Ie setting NONBLOCK on the pipe itself does
>>> nothing when using splice..
>>>
>> Hmm, good question, I dont have the answer but I'll digg one.
>>
> 
> commit	29e350944fdc2dfca102500790d8ad6d6ff4f69d
> splice: add SPLICE_F_NONBLOCK flag
> 
> It doesn't make the splice itself necessarily nonblocking (because the
> actual file descriptors that are spliced from/to may block unless they
> have the O_NONBLOCK flag set), but it makes the splice pipe operations
> nonblocking.
> 
> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
> 
> 
> See Linus intention was pretty clear : O_NONBLOCK should be taken into account
> by 'actual file that are spliced from/to', regardless of SPLICE_F_NONBLOCK flag
> 

I also found first submission of the patch from Octavian Purdila,
so credit should be given to Octavian as well.

http://lkml.indiana.edu/hypermail/linux/kernel/0807.2/0687.html

We could add Linus into the discussion if it can help to make progress on this point.

I personally stopped to use splice(tcp -> pipe) in my projects because it was not usable 
in a reliable way.

Thanks


[PATCH] net: splice() from tcp to pipe should take into account O_NONBLOCK

tcp_splice_read() doesnt take into account socket's O_NONBLOCK flag

Before this patch :

splice(socket,0,pipe,0,128*1024,SPLICE_F_MOVE); 
causes a random endless block (if pipe is full) and
splice(socket,0,pipe,0,128*1024,SPLICE_F_MOVE | SPLICE_F_NONBLOCK);
will return 0 immediately if the TCP buffer is empty.

User application has no way to instruct splice() that socket should be in blocking mode
but pipe in nonblock more.

Many projects cannot use splice(tcp -> pipe) because of this flaw.
 
http://git.samba.org/?p=samba.git;a=history;f=source3/lib/recvfile.c;h=ea0159642137390a0f7e57a123684e6e63e47581;hb=HEAD
http://lkml.indiana.edu/hypermail/linux/kernel/0807.2/0687.html

Linus introduced  SPLICE_F_NONBLOCK in commit 29e350944fdc2dfca102500790d8ad6d6ff4f69d
(splice: add SPLICE_F_NONBLOCK flag )

  It doesn't make the splice itself necessarily nonblocking (because the
  actual file descriptors that are spliced from/to may block unless they
  have the O_NONBLOCK flag set), but it makes the splice pipe operations
  nonblocking.


Linus intention was clear : let SPLICE_F_NONBLOCK control the splice pipe mode only


This patch instruct tcp_splice_read() to use the underlying file O_NONBLOCK
flag, as other socket operations do.


Users will then call :

splice(socket,0,pipe,0,128*1024,SPLICE_F_MOVE | SPLICE_F_NONBLOCK ); 

to block on data coming from socket (if file is in blocking mode),
and not block on pipe output (to avoid deadlock)

First version of this patch was submitted by Octavian Purdila

Reported-by: Volker Lendecke <vl@samba.org>
Reported-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Octavian Purdila <opurdila@ixiacom.com>
---

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 21387eb..8cdfab6 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -580,7 +580,7 @@ ssize_t tcp_splice_read(struct socket *sock, loff_t *ppos,
 
 	lock_sock(sk);
 
-	timeo = sock_rcvtimeo(sk, flags & SPLICE_F_NONBLOCK);
+	timeo = sock_rcvtimeo(sk, sock->file->f_flags & O_NONBLOCK);
 	while (tss.len) {
 		ret = __tcp_splice_read(sk, &tss);
 		if (ret < 0)

^ permalink raw reply related

* Re: [PATCH] [RFC] IPv4 TCP fails to send window scale option when window scale is zero
From: Gilad Ben-Yossef @ 2009-09-30  6:28 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev, Ori Finkalman
In-Reply-To: <4AC241BA.8040608@gmail.com>

Hi,


[ Resending reply due to Android Gmail client sorry state. My apologies 
if you got it twice. ]


Eric Dumazet wrote:

> Gilad Ben-Yossef a écrit :
>   
>> From: Ori Finkalman <ori@comsleep.com>
>>
>>
>> Acknowledge TCP window scale support by inserting the proper option in
>> SYN/ACK header
>> even if our window scale is zero.
>>
>>
>> This fixes the following observed behavior:
>>
>>
>> 1. Client sends a SYN with TCP window scaling option and non zero window
>> scale value to a Linux box.
>>
>> 2. Linux box notes large receive window from client.
>>
>> 3. Linux decides on a zero value of window scale for its part.
>>
>> 4. Due to compare against requested window scale size option, Linux does
>> not to send windows scale
>>
>> TCP option header on SYN/ACK at all.
>>
>>
>> Result:
>>
>>
>> Client box thinks TCP window scaling is not supported, since SYN/ACK had
>> no TCP window scale option,
>> while Linux thinks that TCP window scaling is supported (and scale might
>> be non zero), since SYN had
>>
>> TCP window scale option and we have a mismatched idea between the client
>> and server regarding window sizes.
>>
>>
>> Please comment and/or apply.
>> ...
>>
>>
>> Signed-off-by: Gilad Ben-Yossef <gilad@codefidence.com>
>> Signed-off-by: Ori Finkelman <ori@comsleep.com>
>>
>>
>> Index: net/ipv4/tcp_output.c
>> ===================================================================
>> --- net/ipv4/tcp_output.c    (revision 46)
>> +++ net/ipv4/tcp_output.c    (revision 210)
>> @@ -353,6 +353,7 @@ static void tcp_init_nondata_skb(struct
>> #define OPTION_SACK_ADVERTISE    (1 << 0)
>> #define OPTION_TS        (1 << 1)
>> #define OPTION_MD5        (1 << 2)
>> +#define OPTION_WSCALE        (1 << 3)
>>
>> struct tcp_out_options {
>>     u8 options;        /* bit field of OPTION_* */
>> @@ -417,7 +418,7 @@ static void tcp_options_write(__be32 *pt
>>                    TCPOLEN_SACK_PERM);
>>     }
>>
>> -    if (unlikely(opts->ws)) {
>> +    if (unlikely(OPTION_WSCALE & opts->options)) {
>>         *ptr++ = htonl((TCPOPT_NOP << 24) |
>>                    (TCPOPT_WINDOW << 16) |
>>                    (TCPOLEN_WINDOW << 8) |
>> @@ -530,8 +531,8 @@ static unsigned tcp_synack_options(struc
>>
>>     if (likely(ireq->wscale_ok)) {
>>         opts->ws = ireq->rcv_wscale;
>> -        if(likely(opts->ws))
>> -            size += TCPOLEN_WSCALE_ALIGNED;
>> +        opts->options |= OPTION_WSCALE;
>> +        size += TCPOLEN_WSCALE_ALIGNED;
>>     }
>>     if (likely(doing_ts)) {
>>         opts->options |= OPTION_TS;
>>
>>
>>
>>     
>
> Seems not the more logical places to put this logic...
>
> How about this instead ?
>
> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index 5200aab..b78c084 100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -216,6 +216,11 @@ void tcp_select_initial_window(int __space, __u32 mss,
>  			space >>= 1;
>  			(*rcv_wscale)++;
>  		}
> +		/*
> +		 * Set a minimum wscale of 1
> +		 */
> +		if (*rcv_wscale == 0)
> +			*rcv_wscale = 1;
>         }
>
>         /* Set initial window to value enough for senders,
>
>   

Thank you for the patch review. The suggested replacement patch 
certainly is shorter, code wise, which is an advantage.

I cant help but feel though, that it is less readable - a window scale 
of zero is a perfectly legit value. Adding special logic to rule it out 
just because we chose to overload this setting for something else 
(whether window scaling is supported or not) seems like an invitation 
for someone to get it wrong again down the line, in my opinion.

Also note that the suggested fix is in line with how other TCP options 
are handled, e.g. TCP timestamp.

Anyone else wants to chime in on that?

PS. I also managed to to get the patch author name spelling wrong. It is 
Ori Finkelman and not as written.

Thanks!
Gilad


-- 
Gilad Ben-Yossef
Chief Coffee Drinker & CTO
Codefidence Ltd.

Web:   http://codefidence.com
Cell:  +972-52-8260388
Skype: gilad_codefidence
Tel:   +972-8-9316883 ext. 201
Fax:   +972-8-9316884
Email: gilad@codefidence.com

Check out our Open Source technology and training blog - http://tuxology.net

	"Now the world has gone to bed
	 Darkness won't engulf my head
	 I can see by infra-red
	 How I hate the night."


^ permalink raw reply

* [PATCH] connector: Fix regression introduced by sid connector
From: Christian Borntraeger @ 2009-09-30  6:43 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Oleg Nesterov, Scott James Remnant, Evgeniy Polyakov,
	Linux Kernel, Matt Helsley, David S. Miller, netdev
In-Reply-To: <20090929162855.GA15319@redhat.com>

Hi Andrew,

since commit 02b51df1b07b4e9ca823c89284e704cadb323cd1 (proc connector: add event 
for process becoming session leader) we have the following warning:
Badness at kernel/softirq.c:143
[...]
Krnl PSW : 0404c00180000000 00000000001481d4 (local_bh_enable+0xb0/0xe0)
[...]
Call Trace:
([<000000013fe04100>] 0x13fe04100)
 [<000000000048a946>] sk_filter+0x9a/0xd0
 [<000000000049d938>] netlink_broadcast+0x2c0/0x53c
 [<00000000003ba9ae>] cn_netlink_send+0x272/0x2b0
 [<00000000003baef0>] proc_sid_connector+0xc4/0xd4
 [<0000000000142604>] __set_special_pids+0x58/0x90
 [<0000000000159938>] sys_setsid+0xb4/0xd8
 [<00000000001187fe>] sysc_noemu+0x10/0x16
 [<00000041616cb266>] 0x41616cb266

The warning is
--->    WARN_ON_ONCE(in_irq() || irqs_disabled());

The network code must not be called with disabled interrupts but
sys_setsid holds the tasklist_lock with spinlock_irq while calling
the connector. 
After a discussion we agreed that we can move proc_sid_connector
from __set_special_pids to sys_setsid.
We also agreed that it is sufficient to change the check from
task_session(curr) != pid into err > 0, since if we don't change the
session, this means we were already the leader and return -EPERM.

One last thing:
There is also daemonize(), and some people might want to get a
notification in that case. Since daemonize() is only needed if a user
space does kernel_thread this does not look important (and there seems
to be no consensus if this connector should be called in daemonize). If
we really want this, we can add proc_sid_connector to daemonize() in an
additional patch (Scott?)

Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
CCed: Scott James Remnant <scott@ubuntu.com>
CCed: Matt Helsley <matthltc@us.ibm.com>
CCed: David S. Miller <davem@davemloft.net>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Evgeniy Polyakov <zbr@ioremap.net>
---
 kernel/exit.c |    4 +---
 kernel/sys.c  |    2 ++
 2 files changed, 3 insertions(+), 3 deletions(-)

Index: linux-2.6/kernel/exit.c
===================================================================
--- linux-2.6.orig/kernel/exit.c
+++ linux-2.6/kernel/exit.c
@@ -359,10 +359,8 @@ void __set_special_pids(struct pid *pid)
 {
 	struct task_struct *curr = current->group_leader;
 
-	if (task_session(curr) != pid) {
+	if (task_session(curr) != pid)
 		change_pid(curr, PIDTYPE_SID, pid);
-		proc_sid_connector(curr);
-	}
 
 	if (task_pgrp(curr) != pid)
 		change_pid(curr, PIDTYPE_PGID, pid);
Index: linux-2.6/kernel/sys.c
===================================================================
--- linux-2.6.orig/kernel/sys.c
+++ linux-2.6/kernel/sys.c
@@ -1110,6 +1110,8 @@ SYSCALL_DEFINE0(setsid)
 	err = session;
 out:
 	write_unlock_irq(&tasklist_lock);
+	if (err > 0)
+		proc_sid_connector(sid);
 	return err;
 }
 

^ permalink raw reply

* Re: Splice on blocking TCP sockets again..
From: Volker Lendecke @ 2009-09-30  6:37 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: Eric Dumazet, netdev, David S. Miller, Volker Lendecke
In-Reply-To: <20090930004820.GC19540@obsidianresearch.com>

[-- Attachment #1: Type: text/plain, Size: 542 bytes --]

On Tue, Sep 29, 2009 at 06:48:20PM -0600, Jason Gunthorpe wrote:
> FWIW, it looks like samba has a splice code now, but doesn't enable it
> due to this issue?

Right. What I've learned from the comments is that splice is
only usable in multi-threaded programs. One thread is
reading, one is writing from the other end. I deferred using
splice until we have the proper architecture to do sync
syscalls in helper threads to make them virtually async.  We
have some code for that now, but it's not a high priority
for me at this moment.

Volker

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 197 bytes --]

^ permalink raw reply

* Re: [PATCH] [RFC] IPv4 TCP fails to send window scale option when window scale is zero
From: Eric Dumazet @ 2009-09-30  7:16 UTC (permalink / raw)
  To: Gilad Ben-Yossef; +Cc: netdev, Ori Finkalman
In-Reply-To: <4AC2FA7C.6030901@codefidence.com>

Gilad Ben-Yossef a écrit :
> Hi,
> 
> 
> [ Resending reply due to Android Gmail client sorry state. My apologies
> if you got it twice. ]
> 
> 
> Eric Dumazet wrote:
> 
>> Gilad Ben-Yossef a écrit :
>>  
>>> From: Ori Finkalman <ori@comsleep.com>
>>>
>>>
>>> Acknowledge TCP window scale support by inserting the proper option in
>>> SYN/ACK header
>>> even if our window scale is zero.
>>>
>>>
>>> This fixes the following observed behavior:
>>>
>>>
>>> 1. Client sends a SYN with TCP window scaling option and non zero window
>>> scale value to a Linux box.
>>>
>>> 2. Linux box notes large receive window from client.
>>>
>>> 3. Linux decides on a zero value of window scale for its part.
>>>
>>> 4. Due to compare against requested window scale size option, Linux does
>>> not to send windows scale
>>>
>>> TCP option header on SYN/ACK at all.
>>>
>>>
>>> Result:
>>>
>>>
>>> Client box thinks TCP window scaling is not supported, since SYN/ACK had
>>> no TCP window scale option,
>>> while Linux thinks that TCP window scaling is supported (and scale might
>>> be non zero), since SYN had
>>>
>>> TCP window scale option and we have a mismatched idea between the client
>>> and server regarding window sizes.
>>>
>>>
>>> Please comment and/or apply.
>>> ...
>>>
>>>
>>> Signed-off-by: Gilad Ben-Yossef <gilad@codefidence.com>
>>> Signed-off-by: Ori Finkelman <ori@comsleep.com>
>>>
>>>
>>> Index: net/ipv4/tcp_output.c
>>> ===================================================================
>>> --- net/ipv4/tcp_output.c    (revision 46)
>>> +++ net/ipv4/tcp_output.c    (revision 210)
>>> @@ -353,6 +353,7 @@ static void tcp_init_nondata_skb(struct
>>> #define OPTION_SACK_ADVERTISE    (1 << 0)
>>> #define OPTION_TS        (1 << 1)
>>> #define OPTION_MD5        (1 << 2)
>>> +#define OPTION_WSCALE        (1 << 3)
>>>
>>> struct tcp_out_options {
>>>     u8 options;        /* bit field of OPTION_* */
>>> @@ -417,7 +418,7 @@ static void tcp_options_write(__be32 *pt
>>>                    TCPOLEN_SACK_PERM);
>>>     }
>>>
>>> -    if (unlikely(opts->ws)) {
>>> +    if (unlikely(OPTION_WSCALE & opts->options)) {
>>>         *ptr++ = htonl((TCPOPT_NOP << 24) |
>>>                    (TCPOPT_WINDOW << 16) |
>>>                    (TCPOLEN_WINDOW << 8) |
>>> @@ -530,8 +531,8 @@ static unsigned tcp_synack_options(struc
>>>
>>>     if (likely(ireq->wscale_ok)) {
>>>         opts->ws = ireq->rcv_wscale;
>>> -        if(likely(opts->ws))
>>> -            size += TCPOLEN_WSCALE_ALIGNED;
>>> +        opts->options |= OPTION_WSCALE;
>>> +        size += TCPOLEN_WSCALE_ALIGNED;
>>>     }
>>>     if (likely(doing_ts)) {
>>>         opts->options |= OPTION_TS;
>>>
>>>
>>>
>>>     
>>
>> Seems not the more logical places to put this logic...
>>
>> How about this instead ?
>>
>> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
>> index 5200aab..b78c084 100644
>> --- a/net/ipv4/tcp_output.c
>> +++ b/net/ipv4/tcp_output.c
>> @@ -216,6 +216,11 @@ void tcp_select_initial_window(int __space, __u32
>> mss,
>>              space >>= 1;
>>              (*rcv_wscale)++;
>>          }
>> +        /*
>> +         * Set a minimum wscale of 1
>> +         */
>> +        if (*rcv_wscale == 0)
>> +            *rcv_wscale = 1;
>>         }
>>
>>         /* Set initial window to value enough for senders,
>>
>>   
> 
> Thank you for the patch review. The suggested replacement patch
> certainly is shorter, code wise, which is an advantage.
> 
> I cant help but feel though, that it is less readable - a window scale
> of zero is a perfectly legit value. Adding special logic to rule it out
> just because we chose to overload this setting for something else
> (whether window scaling is supported or not) seems like an invitation
> for someone to get it wrong again down the line, in my opinion.

As a matter of fact I didnot test your patch.

My reaction was driven by :

Your version slows down the tcp_options_write() function, once per tx packet.

tcp_options_write() should not change socket state, while
tcp_select_initial_window() is the exact place where we are supposed to
compute wscale. 

Also how is managed tcp_syn_options() case (for outgoing connections ?)

        if (likely(sysctl_tcp_window_scaling)) {
                opts->ws = tp->rx_opt.rcv_wscale;
                if (likely(opts->ws))
                        size += TCPOLEN_WSCALE_ALIGNED;
        }

Dont you need to patch it as well ?

> 
> Also note that the suggested fix is in line with how other TCP options
> are handled, e.g. TCP timestamp.
> 
> Anyone else wants to chime in on that?
> 
> PS. I also managed to to get the patch author name spelling wrong. It is
> Ori Finkelman and not as written.
> 
> Thanks!
> Gilad
> 
> 


^ permalink raw reply

* Re: [PATCH 3/3] bonding: send ARP requests on interfaces other than the primary for tlb/alb
From: David Miller @ 2009-09-30  7:27 UTC (permalink / raw)
  To: eric.dumazet; +Cc: fubar, netdev, andy
In-Reply-To: <4AC2EA97.5010809@gmail.com>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Wed, 30 Sep 2009 07:20:23 +0200

> Jay Vosburgh a écrit :
>> -		/* The ARP relpy packets must be delayed so that
>> -		 * they can cancel out the influence of the ARP request.
>> -		 */
> 
> COmments should have the following form :
> /*
>  * This is a fine comment
>  */

I definitely prefer what is used in the patch, whereas your suggestion
wastes a precious line on the screen and doesn't look markedly better
at all.

Look at any TCP protocol source file and you'll see the consistent
application of:

	/* This form of
	 * comment.
	 */

there.

^ permalink raw reply

* Re: [PATCH 3/3] bonding: send ARP requests on interfaces other than the primary for tlb/alb
From: Eric Dumazet @ 2009-09-30  7:49 UTC (permalink / raw)
  To: David Miller; +Cc: fubar, netdev, andy
In-Reply-To: <20090930.002721.160750704.davem@davemloft.net>

David Miller a écrit :
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Wed, 30 Sep 2009 07:20:23 +0200
> 
>> Jay Vosburgh a écrit :
>>> -		/* The ARP relpy packets must be delayed so that
>>> -		 * they can cancel out the influence of the ARP request.
>>> -		 */
>> COmments should have the following form :
>> /*
>>  * This is a fine comment
>>  */
> 
> I definitely prefer what is used in the patch, whereas your suggestion
> wastes a precious line on the screen and doesn't look markedly better
> at all.
> 
> Look at any TCP protocol source file and you'll see the consistent
> application of:
> 
> 	/* This form of
> 	 * comment.
> 	 */
> 
> there.

No problem David, I dont want to waste time to discuss this kind of stuff,
but you know my own opinion, and I know yours as well !

As a non native, some things that are painful details for you definitly help
my brain to decode English sentences. Uppercase on first letter, final point,
spelling, ...


^ permalink raw reply

* [PATCH net-next-2.6] be2net: Workaround to fix a bug in Rx Completion processing.
From: Ajit Khaparde @ 2009-09-30  9:07 UTC (permalink / raw)
  To: davem, netdev

vtp bit in RX completion descriptor could be wrongly set in
some skews of BladEngine.  Ignore this  bit if vtm is not set.
This patch is against the net-next-2.6 tree.

Signed-off-by: Ajit Khaparde <ajitk@serverengines.com>
---
 drivers/net/benet/be.h      |    1 +
 drivers/net/benet/be_cmds.c |    3 ++-
 drivers/net/benet/be_cmds.h |    3 ++-
 drivers/net/benet/be_main.c |   23 +++++++++++++++++++----
 4 files changed, 24 insertions(+), 6 deletions(-)

diff --git a/drivers/net/benet/be.h b/drivers/net/benet/be.h
index 13b72ce..7503b92 100644
--- a/drivers/net/benet/be.h
+++ b/drivers/net/benet/be.h
@@ -258,6 +258,7 @@ struct be_adapter {
 	bool link_up;
 	u32 port_num;
 	bool promiscuous;
+	u32 cap;
 };
 
 extern const struct ethtool_ops be_ethtool_ops;
diff --git a/drivers/net/benet/be_cmds.c b/drivers/net/benet/be_cmds.c
index 1db0924..513e47b 100644
--- a/drivers/net/benet/be_cmds.c
+++ b/drivers/net/benet/be_cmds.c
@@ -995,7 +995,7 @@ int be_cmd_get_flow_control(struct be_adapter *adapter, u32 *tx_fc, u32 *rx_fc)
 	return status;
 }
 
-int be_cmd_query_fw_cfg(struct be_adapter *adapter, u32 *port_num)
+int be_cmd_query_fw_cfg(struct be_adapter *adapter, u32 *port_num, u32 *cap)
 {
 	struct be_mcc_wrb *wrb = wrb_from_mbox(&adapter->mbox_mem);
 	struct be_cmd_req_query_fw_cfg *req = embedded_payload(wrb);
@@ -1014,6 +1014,7 @@ int be_cmd_query_fw_cfg(struct be_adapter *adapter, u32 *port_num)
 	if (!status) {
 		struct be_cmd_resp_query_fw_cfg *resp = embedded_payload(wrb);
 		*port_num = le32_to_cpu(resp->phys_port);
+		*cap = le32_to_cpu(resp->function_cap);
 	}
 
 	spin_unlock(&adapter->mbox_lock);
diff --git a/drivers/net/benet/be_cmds.h b/drivers/net/benet/be_cmds.h
index fd7028e..f4fd74e 100644
--- a/drivers/net/benet/be_cmds.h
+++ b/drivers/net/benet/be_cmds.h
@@ -759,7 +759,8 @@ extern int be_cmd_set_flow_control(struct be_adapter *adapter,
 			u32 tx_fc, u32 rx_fc);
 extern int be_cmd_get_flow_control(struct be_adapter *adapter,
 			u32 *tx_fc, u32 *rx_fc);
-extern int be_cmd_query_fw_cfg(struct be_adapter *adapter, u32 *port_num);
+extern int be_cmd_query_fw_cfg(struct be_adapter *adapter,
+			u32 *port_num, u32 *cap);
 extern int be_cmd_reset_function(struct be_adapter *adapter);
 extern void be_process_mcc(struct be_adapter *adapter);
 extern int be_cmd_write_flashrom(struct be_adapter *adapter,
diff --git a/drivers/net/benet/be_main.c b/drivers/net/benet/be_main.c
index ce11bba..44d2e6a 100644
--- a/drivers/net/benet/be_main.c
+++ b/drivers/net/benet/be_main.c
@@ -747,9 +747,16 @@ static void be_rx_compl_process(struct be_adapter *adapter,
 			struct be_eth_rx_compl *rxcp)
 {
 	struct sk_buff *skb;
-	u32 vtp, vid;
+	u32 vlanf, vid;
+	u8 vtm;
 
-	vtp = AMAP_GET_BITS(struct amap_eth_rx_compl, vtp, rxcp);
+	vlanf = AMAP_GET_BITS(struct amap_eth_rx_compl, vtp, rxcp);
+	vtm = AMAP_GET_BITS(struct amap_eth_rx_compl, vtm, rxcp);
+
+	/* vlanf could be wrongly set in some cards.
+	 * ignore if vtm is not set */
+	if ((adapter->cap == 0x400) && !vtm)
+		vlanf = 0;
 
 	skb = netdev_alloc_skb(adapter->netdev, BE_HDR_LEN + NET_IP_ALIGN);
 	if (!skb) {
@@ -772,7 +779,7 @@ static void be_rx_compl_process(struct be_adapter *adapter,
 	skb->protocol = eth_type_trans(skb, adapter->netdev);
 	skb->dev = adapter->netdev;
 
-	if (vtp) {
+	if (vlanf) {
 		if (!adapter->vlan_grp || adapter->num_vlans == 0) {
 			kfree_skb(skb);
 			return;
@@ -797,11 +804,18 @@ static void be_rx_compl_process_gro(struct be_adapter *adapter,
 	struct be_eq_obj *eq_obj =  &adapter->rx_eq;
 	u32 num_rcvd, pkt_size, remaining, vlanf, curr_frag_len;
 	u16 i, rxq_idx = 0, vid, j;
+	u8 vtm;
 
 	num_rcvd = AMAP_GET_BITS(struct amap_eth_rx_compl, numfrags, rxcp);
 	pkt_size = AMAP_GET_BITS(struct amap_eth_rx_compl, pktsize, rxcp);
 	vlanf = AMAP_GET_BITS(struct amap_eth_rx_compl, vtp, rxcp);
 	rxq_idx = AMAP_GET_BITS(struct amap_eth_rx_compl, fragndx, rxcp);
+	vtm = AMAP_GET_BITS(struct amap_eth_rx_compl, vtm, rxcp);
+
+	/* vlanf could be wrongly set in some cards.
+	 * ignore if vtm is not set */
+	if ((adapter->cap == 0x400) && !vtm)
+		vlanf = 0;
 
 	skb = napi_get_frags(&eq_obj->napi);
 	if (!skb) {
@@ -2045,7 +2059,8 @@ static int be_hw_up(struct be_adapter *adapter)
 	if (status)
 		return status;
 
-	status = be_cmd_query_fw_cfg(adapter, &adapter->port_num);
+	status = be_cmd_query_fw_cfg(adapter,
+				&adapter->port_num, &adapter->cap);
 	return status;
 }
 
-- 
1.6.0.4


^ permalink raw reply related

* [PATCH] ipvs: Add boundary check on ioctl arguments
From: Arjan van de Ven @ 2009-09-30 11:11 UTC (permalink / raw)
  To: Wensong Zhang; +Cc: netdev, linux-kernel


>From 761a182f96b3707e1fee44e1079ba227e48745d1 Mon Sep 17 00:00:00 2001
From: Arjan van de Ven <arjan@linux.intel.com>
Date: Wed, 30 Sep 2009 13:05:51 +0200
Subject: [PATCH] ipvs: Add boundary check on ioctl arguments

The ipvs code has a nifty system for doing the size of ioctl command copies;
it defines an array with values into which it indexes the cmd to find the
right length.

Unfortunately, the ipvs code forgot to check if the cmd was in the range
that the array provides, allowing for an index outside of the array,
which then gives a "garbage" result into the length, which then gets
used for copying into a stack buffer.

Fix this by adding sanity checks on these as well as the copy size.

Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
---
 net/netfilter/ipvs/ip_vs_ctl.c |   14 +++++++++++++-
 1 files changed, 13 insertions(+), 1 deletions(-)

diff --git a/net/netfilter/ipvs/ip_vs_ctl.c b/net/netfilter/ipvs/ip_vs_ctl.c
index ac624e5..3c52796 100644
--- a/net/netfilter/ipvs/ip_vs_ctl.c
+++ b/net/netfilter/ipvs/ip_vs_ctl.c
@@ -2077,6 +2077,10 @@ do_ip_vs_set_ctl(struct sock *sk, int cmd, void __user *user, unsigned int len)
 	if (!capable(CAP_NET_ADMIN))
 		return -EPERM;
 
+	if (cmd < IP_VS_BASE_CTL || cmd > IP_VS_SO_SET_MAX + 1)
+		return -EINVAL;
+	if (len < 0 || len >  MAX_ARG_LEN)
+		return -EINVAL;
 	if (len != set_arglen[SET_CMDID(cmd)]) {
 		pr_err("set_ctl: len %u != %u\n",
 		       len, set_arglen[SET_CMDID(cmd)]);
@@ -2353,17 +2357,25 @@ do_ip_vs_get_ctl(struct sock *sk, int cmd, void __user *user, int *len)
 {
 	unsigned char arg[128];
 	int ret = 0;
+	unsigned int copylen;
 
 	if (!capable(CAP_NET_ADMIN))
 		return -EPERM;
 
+	if (cmd < IP_VS_BASE_CTL || cmd > IP_VS_SO_GET_MAX + 1)
+		return -EINVAL;
+
 	if (*len < get_arglen[GET_CMDID(cmd)]) {
 		pr_err("get_ctl: len %u < %u\n",
 		       *len, get_arglen[GET_CMDID(cmd)]);
 		return -EINVAL;
 	}
 
-	if (copy_from_user(arg, user, get_arglen[GET_CMDID(cmd)]) != 0)
+	copylen = get_arglen[GET_CMDID(cmd)];
+	if (copylen > 128)
+		return -EINVAL;
+
+	if (copy_from_user(arg, user, copylen) != 0)
 		return -EFAULT;
 
 	if (mutex_lock_interruptible(&__ip_vs_mutex))
-- 
1.6.2.5


-- 
Arjan van de Ven 	Intel Open Source Technology Centre
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org

^ permalink raw reply related

* Re: [PATCH] connector: Allow permission checking in the receiver callbacks
From: Evgeniy Polyakov @ 2009-09-30 11:20 UTC (permalink / raw)
  To: Philipp Reisner; +Cc: linux-kernel, netdev, Lars Ellenberg, Andrew Morton
In-Reply-To: <1254235692-1631-1-git-send-email-philipp.reisner@linbit.com>

Hi Philipp.

On Tue, Sep 29, 2009 at 04:48:07PM +0200, Philipp Reisner (philipp.reisner@linbit.com) wrote:
> Various users of the connector should actually check if the
> sender's capabilities of a netlink/connector packet are
> actually sufficient for the operation they trigger. Up to
> now the connector framework did not allow the kernel side
> receiver to do so.
> 
> This patch set does the groundwork.
> 
> Philipp Reisner (4):
>   connector: Keep the skb in cn_callback_data
>   connector: Provide the sender's credentials to the callback
>   connector/dm: Fixed a compilation warning
>   connector: Removed the destruct_data callback since it is always
>     kfree_skb()

Patches look good to me.
Andrew please apply to the appropriate tree. I do not know whether it is
acceptible now, since it is not a bugfix, but merely a simple cleanup.
Feel free to add my signed off or ack, thank you.

>  Documentation/connector/cn_test.c      |    2 +-
>  Documentation/connector/connector.txt  |    8 ++++----
>  drivers/connector/cn_queue.c           |   12 +++++++-----
>  drivers/connector/connector.c          |   22 ++++++++--------------
>  drivers/md/dm-log-userspace-transfer.c |    3 +--
>  drivers/staging/dst/dcore.c            |    2 +-
>  drivers/staging/pohmelfs/config.c      |    2 +-
>  drivers/video/uvesafb.c                |    2 +-
>  drivers/w1/w1_netlink.c                |    2 +-
>  include/linux/connector.h              |   11 ++++-------
>  10 files changed, 29 insertions(+), 37 deletions(-)

-- 
	Evgeniy Polyakov

^ permalink raw reply

* Re: WARNING: at net/ipv4/af_inet.c:154 inet_sock_destruct
From: Francis Moreau @ 2009-09-30 11:40 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Linux Kernel Mailing List, Linux Netdev List, David S. Miller
In-Reply-To: <4AC1D0F5.4050709@gmail.com>

On Tue, Sep 29, 2009 at 11:18 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> Francis Moreau a écrit :
>> Hello,
>>
>> I got this kernel warning when stopping nfsd:
>>
>> [260104.553720] WARNING: at net/ipv4/af_inet.c:154
>> inet_sock_destruct+0x164/0x182()
>> [260104.553722] Hardware name: P5K-VM
>> [260104.553724] Modules linked in: jfs loop nfsd lockd nfs_acl
>> auth_rpcgss exportfs sunrpc [last unloaded: microcode]
>> [260104.553736] Pid: 858, comm: nfsd Tainted: G   M       2.6.31 #13
>> [260104.553738] Call Trace:
>> [260104.553743]  [<ffffffff813ed53a>] ? inet_sock_destruct+0x164/0x182
>> [260104.553748]  [<ffffffff81044471>] warn_slowpath_common+0x7c/0xa9
>> [260104.553751]  [<ffffffff810444b2>] warn_slowpath_null+0x14/0x16
>> [260104.553754]  [<ffffffff813ed53a>] inet_sock_destruct+0x164/0x182
>> [260104.553759]  [<ffffffff8138e1c0>] __sk_free+0x23/0xe7
>> [260104.553762]  [<ffffffff8138e2fd>] sk_free+0x1f/0x21
>> [260104.553765]  [<ffffffff8138e3c7>] sk_common_release+0xc8/0xcd
>> [260104.553769]  [<ffffffff813e4459>] udp_lib_close+0xe/0x10
>> [260104.553772]  [<ffffffff813ecfe2>] inet_release+0x55/0x5c
>> [260104.553775]  [<ffffffff8138b746>] sock_release+0x1f/0x71
>> [260104.553778]  [<ffffffff8138b7bf>] sock_close+0x27/0x2b
>> [260104.553782]  [<ffffffff810d0641>] __fput+0xfb/0x1c0
>> [260104.553787]  [<ffffffff8104a197>] ? local_bh_disable+0x12/0x14
>> [260104.553790]  [<ffffffff810d0723>] fput+0x1d/0x1f
>> [260104.553810]  [<ffffffffa0014035>] svc_sock_free+0x40/0x56 [sunrpc]
>> [260104.553827]  [<ffffffffa001dea0>] svc_xprt_free+0x43/0x53 [sunrpc]
>> [260104.553843]  [<ffffffffa001de5d>] ? svc_xprt_free+0x0/0x53 [sunrpc]
>> [260104.553847]  [<ffffffff811b4641>] kref_put+0x43/0x4f
>> [260104.553863]  [<ffffffffa001d224>] svc_close_xprt+0x55/0x5e [sunrpc]
>> [260104.553879]  [<ffffffffa001d27d>] svc_close_all+0x50/0x69 [sunrpc]
>> [260104.553894]  [<ffffffffa0012922>] svc_destroy+0x9e/0x142 [sunrpc]
>> [260104.553910]  [<ffffffffa0012a7f>] svc_exit_thread+0xb9/0xc2 [sunrpc]
>> [260104.553922]  [<ffffffffa00707b1>] ? nfsd+0x0/0x151 [nfsd]
>> [260104.553932]  [<ffffffffa00708e8>] nfsd+0x137/0x151 [nfsd]
>> [260104.553936]  [<ffffffff8105ad28>] kthread+0x94/0x9c
>> [260104.553941]  [<ffffffff8100c1fa>] child_rip+0xa/0x20
>> [260104.553944]  [<ffffffff81047b00>] ? do_exit+0x5d7/0x691
>> [260104.553948]  [<ffffffff81039cf8>] ? finish_task_switch+0x6a/0xc7
>> [260104.553953]  [<ffffffff8100bb6d>] ? restore_args+0x0/0x30
>> [260104.553956]  [<ffffffff8105ac94>] ? kthread+0x0/0x9c
>> [260104.553959]  [<ffffffff8100c1f0>] ? child_rip+0x0/0x20
>>
>> It happens on 2.6.31 and older kernels as well though I don't remember
>> when it really started.
>
> Could you please try following patch ?
>

No trace of this bug has been seen so far.

thanks
-- 
Francis

^ permalink raw reply

* Re: [PATCH] [RFC] IPv4 TCP fails to send window scale option when window scale is zero
From: Ilpo Järvinen @ 2009-09-30 11:42 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Gilad Ben-Yossef, Netdev, Ori Finkalman
In-Reply-To: <4AC305BF.6080306@gmail.com>

[-- Attachment #1: Type: TEXT/PLAIN, Size: 5468 bytes --]

On Wed, 30 Sep 2009, Eric Dumazet wrote:

> Gilad Ben-Yossef a écrit :
> >
> > Eric Dumazet wrote:
> > 
> >> Gilad Ben-Yossef a écrit :
> >>  
> >>> From: Ori Finkalman <ori@comsleep.com>
> >>>
> >>>
> >>> Acknowledge TCP window scale support by inserting the proper option in
> >>> SYN/ACK header
> >>> even if our window scale is zero.
> >>>
> >>>
> >>> This fixes the following observed behavior:
> >>>
> >>>
> >>> 1. Client sends a SYN with TCP window scaling option and non zero window
> >>> scale value to a Linux box.
> >>>
> >>> 2. Linux box notes large receive window from client.
> >>>
> >>> 3. Linux decides on a zero value of window scale for its part.
> >>>
> >>> 4. Due to compare against requested window scale size option, Linux does
> >>> not to send windows scale
> >>>
> >>> TCP option header on SYN/ACK at all.
> >>>
> >>>
> >>> Result:
> >>>
> >>>
> >>> Client box thinks TCP window scaling is not supported, since SYN/ACK had
> >>> no TCP window scale option,
> >>> while Linux thinks that TCP window scaling is supported (and scale might
> >>> be non zero), since SYN had
> >>>
> >>> TCP window scale option and we have a mismatched idea between the client
> >>> and server regarding window sizes.
> >>>
> >>>
> >>> Please comment and/or apply.
> >>> ...
> >>>
> >>>
> >>> Signed-off-by: Gilad Ben-Yossef <gilad@codefidence.com>
> >>> Signed-off-by: Ori Finkelman <ori@comsleep.com>
> >>>
> >>>
> >>> Index: net/ipv4/tcp_output.c
> >>> ===================================================================
> >>> --- net/ipv4/tcp_output.c    (revision 46)
> >>> +++ net/ipv4/tcp_output.c    (revision 210)
> >>> @@ -353,6 +353,7 @@ static void tcp_init_nondata_skb(struct
> >>> #define OPTION_SACK_ADVERTISE    (1 << 0)
> >>> #define OPTION_TS        (1 << 1)
> >>> #define OPTION_MD5        (1 << 2)
> >>> +#define OPTION_WSCALE        (1 << 3)
> >>>
> >>> struct tcp_out_options {
> >>>     u8 options;        /* bit field of OPTION_* */
> >>> @@ -417,7 +418,7 @@ static void tcp_options_write(__be32 *pt
> >>>                    TCPOLEN_SACK_PERM);
> >>>     }
> >>>
> >>> -    if (unlikely(opts->ws)) {
> >>> +    if (unlikely(OPTION_WSCALE & opts->options)) {
> >>>         *ptr++ = htonl((TCPOPT_NOP << 24) |
> >>>                    (TCPOPT_WINDOW << 16) |
> >>>                    (TCPOLEN_WINDOW << 8) |
> >>> @@ -530,8 +531,8 @@ static unsigned tcp_synack_options(struc
> >>>
> >>>     if (likely(ireq->wscale_ok)) {
> >>>         opts->ws = ireq->rcv_wscale;
> >>> -        if(likely(opts->ws))
> >>> -            size += TCPOLEN_WSCALE_ALIGNED;
> >>> +        opts->options |= OPTION_WSCALE;
> >>> +        size += TCPOLEN_WSCALE_ALIGNED;
> >>>     }
> >>>     if (likely(doing_ts)) {
> >>>         opts->options |= OPTION_TS;
> >>>
> >>>
> >>>
> >>>     
> >>
> >> Seems not the more logical places to put this logic...
> >>
> >> How about this instead ?
> >>
> >> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> >> index 5200aab..b78c084 100644
> >> --- a/net/ipv4/tcp_output.c
> >> +++ b/net/ipv4/tcp_output.c
> >> @@ -216,6 +216,11 @@ void tcp_select_initial_window(int __space, __u32
> >> mss,
> >>              space >>= 1;
> >>              (*rcv_wscale)++;
> >>          }
> >> +        /*
> >> +         * Set a minimum wscale of 1
> >> +         */
> >> +        if (*rcv_wscale == 0)
> >> +            *rcv_wscale = 1;
> >>         }
> >>
> >>         /* Set initial window to value enough for senders,
> >>
> >>   
> > 
> > Thank you for the patch review. The suggested replacement patch
> > certainly is shorter, code wise, which is an advantage.
> > 
> > I cant help but feel though, that it is less readable - a window scale
> > of zero is a perfectly legit value. Adding special logic to rule it out
> > just because we chose to overload this setting for something else
> > (whether window scaling is supported or not) seems like an invitation
> > for someone to get it wrong again down the line, in my opinion.
> 
> As a matter of fact I didnot test your patch.
> 
> My reaction was driven by :
> 
> Your version slows down the tcp_options_write() function, once per tx packet.

Are you serious that anding would cost that much? :-/

> tcp_options_write() should not change socket state,

I fail to see how his patch was changing socket state in anyway in 
anywhere?

> while
> tcp_select_initial_window() is the exact place where we are supposed to
> compute wscale. 

And it calculated yielding to result of 0, which is perfectly valid. The 
problem is that tcp_write_options thinks that 0 is indication of no window 
scaling, instead of the correct interpretation of zero window scaling 
which makes the huge difference for the opposite direction traffic as 
these guys have noted. Not that I find your approach that bad either as 
we only lose 1-bit accuracy for the window which is rather insignificant 
as 1-byte window increments do not really make that much sense anyway 
(and we have to specifically code against doing them anyway so the 
effective granularity is much higher).

> Also how is managed tcp_syn_options() case (for outgoing connections ?)
> 
>         if (likely(sysctl_tcp_window_scaling)) {
>                 opts->ws = tp->rx_opt.rcv_wscale;
>                 if (likely(opts->ws))
>                         size += TCPOLEN_WSCALE_ALIGNED;
>         }
> 
> Dont you need to patch it as well ?

One certainly should change that too if that patch is the way to go 
forward.

-- 
 i.

^ permalink raw reply

* Re: mac80211: NOHZ: local_softirq_pending 08
From: Oliver Hartkopp @ 2009-09-30 11:56 UTC (permalink / raw)
  To: John W. Linville
  Cc: Michael Buesch, Kalle Valo, linux-wireless-u79uwXL29TY76Z2rM5mHXA,
	netdev-u79uwXL29TY76Z2rM5mHXA, Johannes Berg
In-Reply-To: <20090929192928.GF2678-2XuSBdqkA4R54TAoqtyWWQ@public.gmane.org>

John W. Linville wrote:
> On Sat, Sep 12, 2009 at 06:41:12PM +0200, Oliver Hartkopp wrote:
> 
>> i cooked a patch that introduces netif_rx_ti() and fixes up the problems in
>> mac80211 and the CAN subsystem.
> 
> Oliver,
> 
> Are you going to send this patch to Dave?  If you want me to carry
> it instead, please resend it with a proper changelog including a
> Signed-off-by line.  For that matter, Dave will most certainly want
> that as well...

Hello John,

as i wrote here

http://marc.info/?l=linux-netdev&m=125277885910179&w=2

there are currently only three occurrences of checks that use netif_rx() and
netif_rx_ni() depending on in_interrupt().

And regarding the suggested fix from Michael, that checked every(!) netif_rx()
whether it is in interrupt or not, i was unsure if a netif_tx_ti() would make
sense for only three cases?!?

If you think it makes sense, i can post a patch for that ... but:

Indeed it costs some additional investigation to prove whether netif_rx() or
netif_rx_ni() should be used in each case. But IMHO this has to be done before
providing a pump-gun function that solves the problem without thinking if we
are in irq-context or not. I want to avoid that people are using netif_rx_ti()
as some kind of default ...

I don't know how expensive in_interrupt() is, but it IMO should be avoided
when the context for a code section can be determined in another way.

Regards,
Oliver

--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [Pv-drivers] [PATCH 2.6.31-rc9] net: VMware virtual Ethernet NIC driver: vmxnet3
From: Arnd Bergmann @ 2009-09-29 21:23 UTC (permalink / raw)
  To: David Miller
  Cc: bhavesh, chrisw, pv-drivers, netdev, linux-kernel, virtualization,
	greg, anthony, jgarzik, akpm, shemminger
In-Reply-To: <20090929.125530.31894576.davem@davemloft.net>

On Tuesday 29 September 2009, David Miller wrote:
> > 
> > These header files are indeed shared with the host implementation,
> > as you've guessed. If it's not a big deal, we would like to keep
> > the names the same, just for our own sanity's sake?
> 
> No.  This isn't your source tree, it's everyone's.  So you should
> adhere to basic naming conventions and coding standards of the
> tree regardless of what you happen to use or need to use internally.

Well, there is nothing wrong with making the identifiers the same
everywhere, as long as they all follow the Linux coding style ;-).

I heard that a number of cross-OS device drivers do that nowadays.

	Arnd <><

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox