Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: Splice on blocking TCP sockets again..
From: Eric Dumazet @ 2009-09-30  5:51 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: netdev, David S. Miller, Volker Lendecke
In-Reply-To: <20090930054031.GY22310@obsidianresearch.com>

Jason Gunthorpe a écrit :
>> One way to handle this is to switch tcp_read() to use the underlying file O_NONBLOCK
>> flag, as other socket operations do. And let SPLICE_F_NONBLOCK control the pipe output only.

arg, this was tcp_splice_read() of course

> 
> Thanks Eric, this seems reasonable from my userspace perspective.
> 
> I admit I don't understand why SPLICE_F_NONBLOCK exists, it seems very
> un-unixy to have a syscall completely ignore the NONBLOCK flag of the
> fd it is called on. Ie setting NONBLOCK on the pipe itself does
> nothing when using splice..
> 

Hmm, good question, I dont have the answer but I'll digg one.

^ permalink raw reply

* Re: [PATCH 3/3] bonding: send ARP requests on interfaces other than the primary for tlb/alb
From: Eric Dumazet @ 2009-09-30  5:49 UTC (permalink / raw)
  To: Jarek Poplawski; +Cc: Jay Vosburgh, netdev, David Miller, Andy Gospodarek
In-Reply-To: <20090930054015.GA5231@ff.dom.local>

Jarek Poplawski a écrit :
> On 30-09-2009 07:20, Eric Dumazet wrote:
> ...
>>> +	/* Choose an output channel for the ARP frame */
>>> +	tx_slave = rlb_choose_channel(skb, bond);
>>>  
>>> -		/* The ARP relpy packets must be delayed so that
>>> -		 * they can cancel out the influence of the ARP request.
>>> -		 */
>> COmments should have the following form :
>> /*
>>  * This is a fine comment
>>  */
> 
> It's "The preferred style for long (multi-line) comments [...]".

Yes I agree.

We keep old comments as they are, but for new code submission, we
try to follow common practices.
I know it seems odd, but in the end it helps code readability.

Of course, I would have not point this without another more interesting 
stuff in the review process :)



^ permalink raw reply

* Re: Splice on blocking TCP sockets again..
From: Jason Gunthorpe @ 2009-09-30  5:40 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev, David S. Miller, Volker Lendecke
In-Reply-To: <4AC2E481.5060509@gmail.com>

> One way to handle this is to switch tcp_read() to use the underlying file O_NONBLOCK
> flag, as other socket operations do. And let SPLICE_F_NONBLOCK control the pipe output only.

Thanks Eric, this seems reasonable from my userspace perspective.

I admit I don't understand why SPLICE_F_NONBLOCK exists, it seems very
un-unixy to have a syscall completely ignore the NONBLOCK flag of the
fd it is called on. Ie setting NONBLOCK on the pipe itself does
nothing when using splice..

Jason

^ permalink raw reply

* Re: [PATCH 3/3] bonding: send ARP requests on interfaces other than the primary for tlb/alb
From: Jarek Poplawski @ 2009-09-30  5:40 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Jay Vosburgh, netdev, David Miller, Andy Gospodarek
In-Reply-To: <4AC2EA97.5010809@gmail.com>

On 30-09-2009 07:20, Eric Dumazet wrote:
...
>> +	/* Choose an output channel for the ARP frame */
>> +	tx_slave = rlb_choose_channel(skb, bond);
>>  
>> -		/* The ARP relpy packets must be delayed so that
>> -		 * they can cancel out the influence of the ARP request.
>> -		 */
> 
> COmments should have the following form :
> /*
>  * This is a fine comment
>  */

It's "The preferred style for long (multi-line) comments [...]".

Jarek P.

^ permalink raw reply

* Re: [PATCH 3/3] bonding: send ARP requests on interfaces other than the primary for tlb/alb
From: Eric Dumazet @ 2009-09-30  5:20 UTC (permalink / raw)
  To: Jay Vosburgh; +Cc: netdev, David Miller, Andy Gospodarek
In-Reply-To: <1254269731-7341-4-git-send-email-fubar@us.ibm.com>

Jay Vosburgh a écrit :
> From: Andy Gospodarek <andy@greyhouse.net>
> 
> This patch sends ARP request on the correct destination output interface
> rather than always sending them on the primary interface.  I've also
> added some bits to make sure that the source and destination address in
> the ARP header are correct since simply changing the source MAC and
> output interface will not be that helpful.
> 
> Signed-off-by: Andy Gospodarek <andy@greyhouse.net>
> Signed-off-by: Jay Vosburgh <fubar@us.ibm.com>
> 
> ---
>  drivers/net/bonding/bond_alb.c |   45 ++++++++++++++++++---------------------
>  1 files changed, 21 insertions(+), 24 deletions(-)
> 
> diff --git a/drivers/net/bonding/bond_alb.c b/drivers/net/bonding/bond_alb.c
> index 5cd0400..9148c75 100644
> --- a/drivers/net/bonding/bond_alb.c
> +++ b/drivers/net/bonding/bond_alb.c
> @@ -726,35 +726,32 @@ static struct slave *rlb_arp_xmit(struct sk_buff *skb, struct bonding *bond)
>  	struct arp_pkt *arp = arp_pkt(skb);
>  	struct slave *tx_slave = NULL;
>  
> -	if (arp->op_code == htons(ARPOP_REPLY)) {
> -		/* the arp must be sent on the selected
> -		* rx channel
> -		*/
> -		tx_slave = rlb_choose_channel(skb, bond);
> -		if (tx_slave) {
> -			memcpy(arp->mac_src,tx_slave->dev->dev_addr, ETH_ALEN);
> -		}
> -		pr_debug("Server sent ARP Reply packet\n");
> -	} else if (arp->op_code == htons(ARPOP_REQUEST)) {
> -		/* Create an entry in the rx_hashtbl for this client as a
> -		 * place holder.
> -		 * When the arp reply is received the entry will be updated
> -		 * with the correct unicast address of the client.
> -		 */
> -		rlb_choose_channel(skb, bond);
> +	/* Choose an output channel for the ARP frame */
> +	tx_slave = rlb_choose_channel(skb, bond);
>  
> -		/* The ARP relpy packets must be delayed so that
> -		 * they can cancel out the influence of the ARP request.
> -		 */

COmments should have the following form :
/*
 * This is a fine comment
 */

> +	/* If a valid interface is returned, make sure the sender and target MAC
> +	 * addresses are correct based on the interface that will be transmitting
> +	 * the frame. */
> +	if (tx_slave) {
> +		/* If sender mac is the bond's address, rewrite */
> +		if (!compare_ether_addr_64bits(arp->mac_src,bond->dev->dev_addr))
> +			memcpy(arp->mac_src,tx_slave->dev->dev_addr,bond->dev->addr_len);

Hmm, you use compare_ether_addr_64bits(), implying a fix ETH_ALEN (6 bytes) address, then
you memcpy( ..., ..., bond->dev->addr_len);
Why not use ETH_ALEN to help compiler ?
Also add a space after a comma

> +
> +		/* If target mac is the bond's address, rewrite */
> +		if (!compare_ether_addr_64bits(arp->mac_dst,bond->dev->dev_addr))
> +			memcpy(arp->mac_dst,tx_slave->dev->dev_addr,bond->dev->addr_len);
> +
> +	} else if (arp->op_code == htons(ARPOP_REQUEST)) {
> +		/* if tx_slave is NULL, the periodic ARP replies must
> +		 * be delayed so they can cancel out the influence of
> +		 * the ARP request. */
>  		bond->alb_info.rlb_update_delay_counter = RLB_UPDATE_DELAY;
>  
> -		/* arp requests are broadcast and are sent on the primary
> -		 * the arp request will collapse all clients on the subnet to
> +		/* ARP requests are broadcast and are sent on the primary
> +		 * the ARP request will collapse all clients on the subnet to
>  		 * the primary slave. We must register these clients to be

Could you reformulate this comment, it is not readable as is :

 /*
  * ARP requests are broadcast and are sent on the primary.
  * The ARP request will collapse all clients on the subnet to
  * the primary slave. We must register these clients to be
  * updated with their assigned MAC.
  */


> -		 * updated with their assigned mac.
> -		 */
> +		 * updated with their assigned MAC. */
>  		rlb_req_update_subnet_clients(bond, arp->ip_src);
> -		pr_debug("Server sent ARP Request packet\n");
>  	}
>  
>  	return tx_slave;


^ permalink raw reply

* Re: Splice on blocking TCP sockets again..
From: Eric Dumazet @ 2009-09-30  4:54 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: netdev, David S. Miller, Volker Lendecke
In-Reply-To: <20090930004820.GC19540@obsidianresearch.com>

Jason Gunthorpe a écrit :
> Eric,
> 
> I saw your patch from January regarding splicing on blocking sockets,
> and I wondered what ever happened to it?
> 
> http://lkml.org/lkml/2009/1/13/507
> 
> It doesn't look like it has been applied.. I see the patch thread died
> at davem's comments?
> 
> I have run into exactly the same problem as Samba, where I'd like the
> TCP socket to be blocking, and the pipe to be non blocking ...
> 
> As it stands, 
>   splice(socket,0,pipe,0,128*1024,SPLICE_F_MOVE); 
> causes a random endless block and
>   splice(socket,0,pipe,0,128*1024,SPLICE_F_MOVE | SPLICE_F_NONBLOCK);
> will return 0 immediately if the TCP buffer is empty.
> 
> FWIW, it looks like samba has a splice code now, but doesn't enable it
> due to this issue?
> 
> http://git.samba.org/?p=samba.git;a=history;f=source3/lib/recvfile.c;h=ea0159642137390a0f7e57a123684e6e63e47581;hb=HEAD
> 
> Thanks,
> Jason

Hi Jason, thanks for this reminding

Hmm, most probably I did not replied correctly do David objection which was :

Date	Wed, 14 Jan 2009 20:58:39 -0800 (PST)
Subject	Re: maximum buffer size for splice(2) tcp->pipe?
From	David Miller <>

> From: Eric Dumazet <dada1@cosmosbay.com>
> Date: Wed, 14 Jan 2009 00:38:32 +0100
> [PATCH] net: splice() from tcp to socket should take into account O_NONBLOCK
> 
> Instead of using SPLICE_F_NONBLOCK to select a non blocking mode both on
> source tcp socket and pipe destination, we use the underlying file flag (O_NONBLOCK)
> for selecting a non blocking socket.
> 
> Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>

This needs at least some more thought.

It seems, for one thing, that this change will interfere with the
intentions of the code in splice_dirt_to_actor which goes:

	/*
	 * Don't block on output, we have to drain the direct pipe.
	 */
	sd->flags &= ~SPLICE_F_NONBLOCK;

------------------------------------------------------------------------------

But splice_dist_to_actor() handles a REG/BLK file as input and a pipe as output,
so I believe my patch wont change splice_dist_to_actor() behavior.

My patch title was wrong :

net: splice() from tcp to socket should take into account O_NONBLOCK

So maybe David was mistaken by the title :)

[PATCH] net: splice() from tcp to pipe should take into account O_NONBLOCK

Before this patch :

splice(socket,0,pipe,0,128*1024,SPLICE_F_MOVE); 
causes a random endless block (if pipe is full) and
splice(socket,0,pipe,0,128*1024,SPLICE_F_MOVE | SPLICE_F_NONBLOCK);
will return 0 immediately if the TCP buffer is empty.

User application has no way to instruct splice() that socket should be in blocking mode
but pipe in nonblock more.

http://git.samba.org/?p=samba.git;a=history;f=source3/lib/recvfile.c;h=ea0159642137390a0f7e57a123684e6e63e47581;hb=HEAD

One way to handle this is to switch tcp_read() to use the underlying file O_NONBLOCK
flag, as other socket operations do. And let SPLICE_F_NONBLOCK control the pipe output only.

Users will then call :

splice(socket,0,pipe,0,128*1024,SPLICE_F_MOVE | SPLICE_F_NONBLOCK ); 

to block on data coming from socket (if file is in blocking mode),
and not block on pipe output (to avoid deadlock)

Reported-by: Volker Lendecke <vl@samba.org>
Reported-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 21387eb..8cdfab6 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -580,7 +580,7 @@ ssize_t tcp_splice_read(struct socket *sock, loff_t *ppos,

 	lock_sock(sk);

-	timeo = sock_rcvtimeo(sk, flags & SPLICE_F_NONBLOCK);
+	timeo = sock_rcvtimeo(sk, sock->file->f_flags & O_NONBLOCK);
 	while (tss.len) {
 		ret = __tcp_splice_read(sk, &tss);
 		if (ret < 0)

^ permalink raw reply related

* Splice on blocking TCP sockets again..
From: Jason Gunthorpe @ 2009-09-30  0:48 UTC (permalink / raw)
  To: Eric Dumazet, netdev; +Cc: David S. Miller, Volker Lendecke

Eric,

I saw your patch from January regarding splicing on blocking sockets,
and I wondered what ever happened to it?

http://lkml.org/lkml/2009/1/13/507

It doesn't look like it has been applied.. I see the patch thread died
at davem's comments?

I have run into exactly the same problem as Samba, where I'd like the
TCP socket to be blocking, and the pipe to be non blocking ...

As it stands, 
  splice(socket,0,pipe,0,128*1024,SPLICE_F_MOVE); 
causes a random endless block and
  splice(socket,0,pipe,0,128*1024,SPLICE_F_MOVE | SPLICE_F_NONBLOCK);
will return 0 immediately if the TCP buffer is empty.

FWIW, it looks like samba has a splice code now, but doesn't enable it
due to this issue?

http://git.samba.org/?p=samba.git;a=history;f=source3/lib/recvfile.c;h=ea0159642137390a0f7e57a123684e6e63e47581;hb=HEAD

Thanks,
Jason

^ permalink raw reply

* [PATCH 2/3] bonding: make sure tx and rx hash tables stay in sync when using alb mode
From: Jay Vosburgh @ 2009-09-30  0:15 UTC (permalink / raw)
  To: netdev; +Cc: David Miller, Andy Gospodarek
In-Reply-To: <1254269731-7341-2-git-send-email-fubar@us.ibm.com>

From: Andy Gospodarek <andy@greyhouse.net>

I noticed that it was easy for alb (mode 6) bonding to get into a state
where the tx hash-table and rx hash-table are out of sync (there is
really nothing to keep them synchronized), and we will transmit traffic
destined for a host on one slave and send ARP frames to the same slave
from another interface using a different source MAC.

There is no compelling reason to do this, so this patch makes sure the
rx hash-table changes whenever the tx hash-table is updated based on
device load.  This patch also drops the code that does rlb re-balancing
since the balancing will now be controlled by the tx hash-table based on
transmit load.  In order to address an issue found with the initial
patch, I have also combined the rx and tx hash table lock into a single
lock.  This will facilitate moving these into a single table at some
point.

Patch modified by Jay Vosburgh to fix a typo and remove some leftover
rlb rebalance code.

Signed-off-by: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Jay Vosburgh <fubar@us.ibm.com>
---
 drivers/net/bonding/bond_alb.c |  215 ++++++++++++++--------------------------
 drivers/net/bonding/bond_alb.h |    7 +-
 2 files changed, 75 insertions(+), 147 deletions(-)

diff --git a/drivers/net/bonding/bond_alb.c b/drivers/net/bonding/bond_alb.c
index cf2842e..5cd0400 100644
--- a/drivers/net/bonding/bond_alb.c
+++ b/drivers/net/bonding/bond_alb.c
@@ -118,6 +118,7 @@ static inline struct arp_pkt *arp_pkt(const struct sk_buff *skb)
 
 /* Forward declaration */
 static void alb_send_learning_packets(struct slave *slave, u8 mac_addr[]);
+static struct slave *alb_get_best_slave(struct bonding *bond, u32 hash_index);
 
 static inline u8 _simple_hash(const u8 *hash_start, int hash_size)
 {
@@ -131,18 +132,18 @@ static inline u8 _simple_hash(const u8 *hash_start, int hash_size)
 	return hash;
 }
 
-/*********************** tlb specific functions ***************************/
-
-static inline void _lock_tx_hashtbl(struct bonding *bond)
+/********************* hash table lock functions *************************/
+static inline void _lock_hashtbl(struct bonding *bond)
 {
-	spin_lock_bh(&(BOND_ALB_INFO(bond).tx_hashtbl_lock));
+	spin_lock_bh(&(BOND_ALB_INFO(bond).hashtbl_lock));
 }
 
-static inline void _unlock_tx_hashtbl(struct bonding *bond)
+static inline void _unlock_hashtbl(struct bonding *bond)
 {
-	spin_unlock_bh(&(BOND_ALB_INFO(bond).tx_hashtbl_lock));
+	spin_unlock_bh(&(BOND_ALB_INFO(bond).hashtbl_lock));
 }
 
+/*********************** tlb specific functions ***************************/
 /* Caller must hold tx_hashtbl lock */
 static inline void tlb_init_table_entry(struct tlb_client_info *entry, int save_load)
 {
@@ -170,7 +171,7 @@ static void tlb_clear_slave(struct bonding *bond, struct slave *slave, int save_
 	struct tlb_client_info *tx_hash_table;
 	u32 index;
 
-	_lock_tx_hashtbl(bond);
+	_lock_hashtbl(bond);
 
 	/* clear slave from tx_hashtbl */
 	tx_hash_table = BOND_ALB_INFO(bond).tx_hashtbl;
@@ -187,7 +188,7 @@ static void tlb_clear_slave(struct bonding *bond, struct slave *slave, int save_
 
 	tlb_init_slave(slave);
 
-	_unlock_tx_hashtbl(bond);
+	_unlock_hashtbl(bond);
 }
 
 /* Must be called before starting the monitor timer */
@@ -198,7 +199,7 @@ static int tlb_initialize(struct bonding *bond)
 	struct tlb_client_info *new_hashtbl;
 	int i;
 
-	spin_lock_init(&(bond_info->tx_hashtbl_lock));
+	spin_lock_init(&(bond_info->hashtbl_lock));
 
 	new_hashtbl = kzalloc(size, GFP_KERNEL);
 	if (!new_hashtbl) {
@@ -207,7 +208,7 @@ static int tlb_initialize(struct bonding *bond)
 		       bond->dev->name);
 		return -1;
 	}
-	_lock_tx_hashtbl(bond);
+	_lock_hashtbl(bond);
 
 	bond_info->tx_hashtbl = new_hashtbl;
 
@@ -215,7 +216,7 @@ static int tlb_initialize(struct bonding *bond)
 		tlb_init_table_entry(&bond_info->tx_hashtbl[i], 1);
 	}
 
-	_unlock_tx_hashtbl(bond);
+	_unlock_hashtbl(bond);
 
 	return 0;
 }
@@ -225,12 +226,12 @@ static void tlb_deinitialize(struct bonding *bond)
 {
 	struct alb_bond_info *bond_info = &(BOND_ALB_INFO(bond));
 
-	_lock_tx_hashtbl(bond);
+	_lock_hashtbl(bond);
 
 	kfree(bond_info->tx_hashtbl);
 	bond_info->tx_hashtbl = NULL;
 
-	_unlock_tx_hashtbl(bond);
+	_unlock_hashtbl(bond);
 }
 
 /* Caller must hold bond lock for read */
@@ -271,24 +272,6 @@ static struct slave *tlb_get_least_loaded_slave(struct bonding *bond)
 	return least_loaded;
 }
 
-/* Caller must hold bond lock for read and hashtbl lock */
-static struct slave *tlb_get_best_slave(struct bonding *bond, u32 hash_index)
-{
-	struct alb_bond_info *bond_info = &(BOND_ALB_INFO(bond));
-	struct tlb_client_info *tx_hash_table = bond_info->tx_hashtbl;
-	struct slave *last_slave = tx_hash_table[hash_index].last_slave;
-	struct slave *next_slave = NULL;
-
-	if (last_slave && SLAVE_IS_OK(last_slave)) {
-		/* Use the last slave listed in the tx hashtbl if:
-		   the last slave currently is essentially unloaded. */
-		if (SLAVE_TLB_INFO(last_slave).load < 10)
-			next_slave = last_slave;
-	}
-
-	return next_slave ? next_slave : tlb_get_least_loaded_slave(bond);
-}
-
 /* Caller must hold bond lock for read */
 static struct slave *tlb_choose_channel(struct bonding *bond, u32 hash_index, u32 skb_len)
 {
@@ -296,13 +279,12 @@ static struct slave *tlb_choose_channel(struct bonding *bond, u32 hash_index, u3
 	struct tlb_client_info *hash_table;
 	struct slave *assigned_slave;
 
-	_lock_tx_hashtbl(bond);
+	_lock_hashtbl(bond);
 
 	hash_table = bond_info->tx_hashtbl;
 	assigned_slave = hash_table[hash_index].tx_slave;
 	if (!assigned_slave) {
-		assigned_slave = tlb_get_best_slave(bond, hash_index);
-
+		assigned_slave = alb_get_best_slave(bond, hash_index);
 		if (assigned_slave) {
 			struct tlb_slave_info *slave_info =
 				&(SLAVE_TLB_INFO(assigned_slave));
@@ -326,20 +308,52 @@ static struct slave *tlb_choose_channel(struct bonding *bond, u32 hash_index, u3
 		hash_table[hash_index].tx_bytes += skb_len;
 	}
 
-	_unlock_tx_hashtbl(bond);
+	_unlock_hashtbl(bond);
 
 	return assigned_slave;
 }
 
 /*********************** rlb specific functions ***************************/
-static inline void _lock_rx_hashtbl(struct bonding *bond)
+
+/* Caller must hold bond lock for read and hashtbl lock */
+static struct slave *rlb_update_rx_table(struct bonding *bond, struct slave *next_slave, u32 hash_index)
 {
-	spin_lock_bh(&(BOND_ALB_INFO(bond).rx_hashtbl_lock));
+	struct alb_bond_info *bond_info = &(BOND_ALB_INFO(bond));
+
+	/* check rlb table and correct it if wrong */
+	if (bond_info->rlb_enabled) {
+		struct rlb_client_info *rx_client_info = &(bond_info->rx_hashtbl[hash_index]);
+
+		/* if the new slave computed by tlb checks doesn't match rlb, stop rlb from using it */
+		if (next_slave && (next_slave != rx_client_info->slave))
+			rx_client_info->slave = next_slave;
+	}
+	return next_slave;
 }
 
-static inline void _unlock_rx_hashtbl(struct bonding *bond)
+/* Caller must hold bond lock for read and hashtbl lock */
+static struct slave *alb_get_best_slave(struct bonding *bond, u32 hash_index)
 {
-	spin_unlock_bh(&(BOND_ALB_INFO(bond).rx_hashtbl_lock));
+	struct alb_bond_info *bond_info = &(BOND_ALB_INFO(bond));
+	struct tlb_client_info *tx_hash_table = bond_info->tx_hashtbl;
+	struct slave *last_slave = tx_hash_table[hash_index].last_slave;
+	struct slave *next_slave = NULL;
+
+	/* presume the next slave will be the least loaded one */
+	next_slave = tlb_get_least_loaded_slave(bond);
+
+	if (last_slave && SLAVE_IS_OK(last_slave)) {
+		/* Use the last slave listed in the tx hashtbl if:
+		   the last slave currently is essentially unloaded. */
+		if (SLAVE_TLB_INFO(last_slave).load < 10)
+			next_slave = last_slave;
+	}
+
+	/* update the rlb hashtbl if there was a previous entry */
+	if (bond_info->rlb_enabled)
+		rlb_update_rx_table(bond, next_slave, hash_index);
+
+	return next_slave;
 }
 
 /* when an ARP REPLY is received from a client update its info
@@ -351,7 +365,7 @@ static void rlb_update_entry_from_arp(struct bonding *bond, struct arp_pkt *arp)
 	struct rlb_client_info *client_info;
 	u32 hash_index;
 
-	_lock_rx_hashtbl(bond);
+	_lock_hashtbl(bond);
 
 	hash_index = _simple_hash((u8*)&(arp->ip_src), sizeof(arp->ip_src));
 	client_info = &(bond_info->rx_hashtbl[hash_index]);
@@ -365,7 +379,7 @@ static void rlb_update_entry_from_arp(struct bonding *bond, struct arp_pkt *arp)
 		bond_info->rx_ntt = 1;
 	}
 
-	_unlock_rx_hashtbl(bond);
+	_unlock_hashtbl(bond);
 }
 
 static int rlb_arp_recv(struct sk_buff *skb, struct net_device *bond_dev, struct packet_type *ptype, struct net_device *orig_dev)
@@ -409,38 +423,6 @@ out:
 	return res;
 }
 
-/* Caller must hold bond lock for read */
-static struct slave *rlb_next_rx_slave(struct bonding *bond)
-{
-	struct alb_bond_info *bond_info = &(BOND_ALB_INFO(bond));
-	struct slave *rx_slave, *slave, *start_at;
-	int i = 0;
-
-	if (bond_info->next_rx_slave) {
-		start_at = bond_info->next_rx_slave;
-	} else {
-		start_at = bond->first_slave;
-	}
-
-	rx_slave = NULL;
-
-	bond_for_each_slave_from(bond, slave, i, start_at) {
-		if (SLAVE_IS_OK(slave)) {
-			if (!rx_slave) {
-				rx_slave = slave;
-			} else if (slave->speed > rx_slave->speed) {
-				rx_slave = slave;
-			}
-		}
-	}
-
-	if (rx_slave) {
-		bond_info->next_rx_slave = rx_slave->next;
-	}
-
-	return rx_slave;
-}
-
 /* teach the switch the mac of a disabled slave
  * on the primary for fault tolerance
  *
@@ -475,14 +457,14 @@ static void rlb_clear_slave(struct bonding *bond, struct slave *slave)
 	u32 index, next_index;
 
 	/* clear slave from rx_hashtbl */
-	_lock_rx_hashtbl(bond);
+	_lock_hashtbl(bond);
 
 	rx_hash_table = bond_info->rx_hashtbl;
 	index = bond_info->rx_hashtbl_head;
 	for (; index != RLB_NULL_INDEX; index = next_index) {
 		next_index = rx_hash_table[index].next;
 		if (rx_hash_table[index].slave == slave) {
-			struct slave *assigned_slave = rlb_next_rx_slave(bond);
+			struct slave *assigned_slave = alb_get_best_slave(bond, index);
 
 			if (assigned_slave) {
 				rx_hash_table[index].slave = assigned_slave;
@@ -506,7 +488,7 @@ static void rlb_clear_slave(struct bonding *bond, struct slave *slave)
 		}
 	}
 
-	_unlock_rx_hashtbl(bond);
+	_unlock_hashtbl(bond);
 
 	write_lock_bh(&bond->curr_slave_lock);
 
@@ -565,7 +547,7 @@ static void rlb_update_rx_clients(struct bonding *bond)
 	struct rlb_client_info *client_info;
 	u32 hash_index;
 
-	_lock_rx_hashtbl(bond);
+	_lock_hashtbl(bond);
 
 	hash_index = bond_info->rx_hashtbl_head;
 	for (; hash_index != RLB_NULL_INDEX; hash_index = client_info->next) {
@@ -583,7 +565,7 @@ static void rlb_update_rx_clients(struct bonding *bond)
 	 */
 	bond_info->rlb_update_delay_counter = RLB_UPDATE_DELAY;
 
-	_unlock_rx_hashtbl(bond);
+	_unlock_hashtbl(bond);
 }
 
 /* The slave was assigned a new mac address - update the clients */
@@ -594,7 +576,7 @@ static void rlb_req_update_slave_clients(struct bonding *bond, struct slave *sla
 	int ntt = 0;
 	u32 hash_index;
 
-	_lock_rx_hashtbl(bond);
+	_lock_hashtbl(bond);
 
 	hash_index = bond_info->rx_hashtbl_head;
 	for (; hash_index != RLB_NULL_INDEX; hash_index = client_info->next) {
@@ -614,7 +596,7 @@ static void rlb_req_update_slave_clients(struct bonding *bond, struct slave *sla
 		bond_info->rlb_update_retry_counter = RLB_UPDATE_RETRY;
 	}
 
-	_unlock_rx_hashtbl(bond);
+	_unlock_hashtbl(bond);
 }
 
 /* mark all clients using src_ip to be updated */
@@ -624,7 +606,7 @@ static void rlb_req_update_subnet_clients(struct bonding *bond, __be32 src_ip)
 	struct rlb_client_info *client_info;
 	u32 hash_index;
 
-	_lock_rx_hashtbl(bond);
+	_lock_hashtbl(bond);
 
 	hash_index = bond_info->rx_hashtbl_head;
 	for (; hash_index != RLB_NULL_INDEX; hash_index = client_info->next) {
@@ -650,7 +632,7 @@ static void rlb_req_update_subnet_clients(struct bonding *bond, __be32 src_ip)
 		}
 	}
 
-	_unlock_rx_hashtbl(bond);
+	_unlock_hashtbl(bond);
 }
 
 /* Caller must hold both bond and ptr locks for read */
@@ -662,7 +644,7 @@ static struct slave *rlb_choose_channel(struct sk_buff *skb, struct bonding *bon
 	struct rlb_client_info *client_info;
 	u32 hash_index = 0;
 
-	_lock_rx_hashtbl(bond);
+	_lock_hashtbl(bond);
 
 	hash_index = _simple_hash((u8 *)&arp->ip_dst, sizeof(arp->ip_src));
 	client_info = &(bond_info->rx_hashtbl[hash_index]);
@@ -678,7 +660,7 @@ static struct slave *rlb_choose_channel(struct sk_buff *skb, struct bonding *bon
 
 			assigned_slave = client_info->slave;
 			if (assigned_slave) {
-				_unlock_rx_hashtbl(bond);
+				_unlock_hashtbl(bond);
 				return assigned_slave;
 			}
 		} else {
@@ -694,7 +676,7 @@ static struct slave *rlb_choose_channel(struct sk_buff *skb, struct bonding *bon
 		}
 	}
 	/* assign a new slave */
-	assigned_slave = rlb_next_rx_slave(bond);
+	assigned_slave = alb_get_best_slave(bond, hash_index);
 
 	if (assigned_slave) {
 		client_info->ip_src = arp->ip_src;
@@ -730,7 +712,7 @@ static struct slave *rlb_choose_channel(struct sk_buff *skb, struct bonding *bon
 		}
 	}
 
-	_unlock_rx_hashtbl(bond);
+	_unlock_hashtbl(bond);
 
 	return assigned_slave;
 }
@@ -778,36 +760,6 @@ static struct slave *rlb_arp_xmit(struct sk_buff *skb, struct bonding *bond)
 	return tx_slave;
 }
 
-/* Caller must hold bond lock for read */
-static void rlb_rebalance(struct bonding *bond)
-{
-	struct alb_bond_info *bond_info = &(BOND_ALB_INFO(bond));
-	struct slave *assigned_slave;
-	struct rlb_client_info *client_info;
-	int ntt;
-	u32 hash_index;
-
-	_lock_rx_hashtbl(bond);
-
-	ntt = 0;
-	hash_index = bond_info->rx_hashtbl_head;
-	for (; hash_index != RLB_NULL_INDEX; hash_index = client_info->next) {
-		client_info = &(bond_info->rx_hashtbl[hash_index]);
-		assigned_slave = rlb_next_rx_slave(bond);
-		if (assigned_slave && (client_info->slave != assigned_slave)) {
-			client_info->slave = assigned_slave;
-			client_info->ntt = 1;
-			ntt = 1;
-		}
-	}
-
-	/* update the team's flag only after the whole iteration */
-	if (ntt) {
-		bond_info->rx_ntt = 1;
-	}
-	_unlock_rx_hashtbl(bond);
-}
-
 /* Caller must hold rx_hashtbl lock */
 static void rlb_init_table_entry(struct rlb_client_info *entry)
 {
@@ -824,8 +776,6 @@ static int rlb_initialize(struct bonding *bond)
 	int size = RLB_HASH_TABLE_SIZE * sizeof(struct rlb_client_info);
 	int i;
 
-	spin_lock_init(&(bond_info->rx_hashtbl_lock));
-
 	new_hashtbl = kmalloc(size, GFP_KERNEL);
 	if (!new_hashtbl) {
 		pr_err(DRV_NAME
@@ -833,7 +783,7 @@ static int rlb_initialize(struct bonding *bond)
 		       bond->dev->name);
 		return -1;
 	}
-	_lock_rx_hashtbl(bond);
+	_lock_hashtbl(bond);
 
 	bond_info->rx_hashtbl = new_hashtbl;
 
@@ -843,7 +793,7 @@ static int rlb_initialize(struct bonding *bond)
 		rlb_init_table_entry(bond_info->rx_hashtbl + i);
 	}
 
-	_unlock_rx_hashtbl(bond);
+	_unlock_hashtbl(bond);
 
 	/*initialize packet type*/
 	pk_type->type = cpu_to_be16(ETH_P_ARP);
@@ -862,13 +812,13 @@ static void rlb_deinitialize(struct bonding *bond)
 
 	dev_remove_pack(&(bond_info->rlb_pkt_type));
 
-	_lock_rx_hashtbl(bond);
+	_lock_hashtbl(bond);
 
 	kfree(bond_info->rx_hashtbl);
 	bond_info->rx_hashtbl = NULL;
 	bond_info->rx_hashtbl_head = RLB_NULL_INDEX;
 
-	_unlock_rx_hashtbl(bond);
+	_unlock_hashtbl(bond);
 }
 
 static void rlb_clear_vlan(struct bonding *bond, unsigned short vlan_id)
@@ -876,7 +826,7 @@ static void rlb_clear_vlan(struct bonding *bond, unsigned short vlan_id)
 	struct alb_bond_info *bond_info = &(BOND_ALB_INFO(bond));
 	u32 curr_index;
 
-	_lock_rx_hashtbl(bond);
+	_lock_hashtbl(bond);
 
 	curr_index = bond_info->rx_hashtbl_head;
 	while (curr_index != RLB_NULL_INDEX) {
@@ -901,7 +851,7 @@ static void rlb_clear_vlan(struct bonding *bond, unsigned short vlan_id)
 		curr_index = next_index;
 	}
 
-	_unlock_rx_hashtbl(bond);
+	_unlock_hashtbl(bond);
 }
 
 /*********************** tlb/rlb shared functions *********************/
@@ -1525,11 +1475,6 @@ void bond_alb_monitor(struct work_struct *work)
 			read_lock(&bond->lock);
 		}
 
-		if (bond_info->rlb_rebalance) {
-			bond_info->rlb_rebalance = 0;
-			rlb_rebalance(bond);
-		}
-
 		/* check if clients need updating */
 		if (bond_info->rx_ntt) {
 			if (bond_info->rlb_update_delay_counter) {
@@ -1582,10 +1527,6 @@ int bond_alb_init_slave(struct bonding *bond, struct slave *slave)
 	/* order a rebalance ASAP */
 	bond->alb_info.tx_rebalance_counter = BOND_TLB_REBALANCE_TICKS;
 
-	if (bond->alb_info.rlb_enabled) {
-		bond->alb_info.rlb_rebalance = 1;
-	}
-
 	return 0;
 }
 
@@ -1622,14 +1563,6 @@ void bond_alb_handle_link_change(struct bonding *bond, struct slave *slave, char
 	} else if (link == BOND_LINK_UP) {
 		/* order a rebalance ASAP */
 		bond_info->tx_rebalance_counter = BOND_TLB_REBALANCE_TICKS;
-		if (bond->alb_info.rlb_enabled) {
-			bond->alb_info.rlb_rebalance = 1;
-			/* If the updelay module parameter is smaller than the
-			 * forwarding delay of the switch the rebalance will
-			 * not work because the rebalance arp replies will
-			 * not be forwarded to the clients..
-			 */
-		}
 	}
 }
 
diff --git a/drivers/net/bonding/bond_alb.h b/drivers/net/bonding/bond_alb.h
index b65fd29..24bf35a 100644
--- a/drivers/net/bonding/bond_alb.h
+++ b/drivers/net/bonding/bond_alb.h
@@ -90,7 +90,7 @@ struct tlb_slave_info {
 struct alb_bond_info {
 	struct timer_list	alb_timer;
 	struct tlb_client_info	*tx_hashtbl; /* Dynamically allocated */
-	spinlock_t		tx_hashtbl_lock;
+	spinlock_t		hashtbl_lock; /* lock for both tables */
 	u32			unbalanced_load;
 	int			tx_rebalance_counter;
 	int			lp_counter;
@@ -98,7 +98,6 @@ struct alb_bond_info {
 	int rlb_enabled;
 	struct packet_type	rlb_pkt_type;
 	struct rlb_client_info	*rx_hashtbl;	/* Receive hash table */
-	spinlock_t		rx_hashtbl_lock;
 	u32			rx_hashtbl_head;
 	u8			rx_ntt;	/* flag - need to transmit
 					 * to all rx clients
@@ -115,10 +114,6 @@ struct alb_bond_info {
 	u32			rlb_update_retry_counter;/* counter of retries
 							  * of client update
 							  */
-	u8			rlb_rebalance;	/* flag - indicates that the
-						 * rx traffic should be
-						 * rebalanced
-						 */
 	struct vlan_entry	*current_alb_vlan;
 };
 
-- 
1.6.0.2


^ permalink raw reply related

* [PATCH 3/3] bonding: send ARP requests on interfaces other than the primary for tlb/alb
From: Jay Vosburgh @ 2009-09-30  0:15 UTC (permalink / raw)
  To: netdev; +Cc: David Miller, Andy Gospodarek
In-Reply-To: <1254269731-7341-3-git-send-email-fubar@us.ibm.com>

From: Andy Gospodarek <andy@greyhouse.net>

This patch sends ARP request on the correct destination output interface
rather than always sending them on the primary interface.  I've also
added some bits to make sure that the source and destination address in
the ARP header are correct since simply changing the source MAC and
output interface will not be that helpful.

Signed-off-by: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Jay Vosburgh <fubar@us.ibm.com>

---
 drivers/net/bonding/bond_alb.c |   45 ++++++++++++++++++---------------------
 1 files changed, 21 insertions(+), 24 deletions(-)

diff --git a/drivers/net/bonding/bond_alb.c b/drivers/net/bonding/bond_alb.c
index 5cd0400..9148c75 100644
--- a/drivers/net/bonding/bond_alb.c
+++ b/drivers/net/bonding/bond_alb.c
@@ -726,35 +726,32 @@ static struct slave *rlb_arp_xmit(struct sk_buff *skb, struct bonding *bond)
 	struct arp_pkt *arp = arp_pkt(skb);
 	struct slave *tx_slave = NULL;
 
-	if (arp->op_code == htons(ARPOP_REPLY)) {
-		/* the arp must be sent on the selected
-		* rx channel
-		*/
-		tx_slave = rlb_choose_channel(skb, bond);
-		if (tx_slave) {
-			memcpy(arp->mac_src,tx_slave->dev->dev_addr, ETH_ALEN);
-		}
-		pr_debug("Server sent ARP Reply packet\n");
-	} else if (arp->op_code == htons(ARPOP_REQUEST)) {
-		/* Create an entry in the rx_hashtbl for this client as a
-		 * place holder.
-		 * When the arp reply is received the entry will be updated
-		 * with the correct unicast address of the client.
-		 */
-		rlb_choose_channel(skb, bond);
+	/* Choose an output channel for the ARP frame */
+	tx_slave = rlb_choose_channel(skb, bond);
 
-		/* The ARP relpy packets must be delayed so that
-		 * they can cancel out the influence of the ARP request.
-		 */
+	/* If a valid interface is returned, make sure the sender and target MAC
+	 * addresses are correct based on the interface that will be transmitting
+	 * the frame. */
+	if (tx_slave) {
+		/* If sender mac is the bond's address, rewrite */
+		if (!compare_ether_addr_64bits(arp->mac_src,bond->dev->dev_addr))
+			memcpy(arp->mac_src,tx_slave->dev->dev_addr,bond->dev->addr_len);
+
+		/* If target mac is the bond's address, rewrite */
+		if (!compare_ether_addr_64bits(arp->mac_dst,bond->dev->dev_addr))
+			memcpy(arp->mac_dst,tx_slave->dev->dev_addr,bond->dev->addr_len);
+
+	} else if (arp->op_code == htons(ARPOP_REQUEST)) {
+		/* if tx_slave is NULL, the periodic ARP replies must
+		 * be delayed so they can cancel out the influence of
+		 * the ARP request. */
 		bond->alb_info.rlb_update_delay_counter = RLB_UPDATE_DELAY;
 
-		/* arp requests are broadcast and are sent on the primary
-		 * the arp request will collapse all clients on the subnet to
+		/* ARP requests are broadcast and are sent on the primary
+		 * the ARP request will collapse all clients on the subnet to
 		 * the primary slave. We must register these clients to be
-		 * updated with their assigned mac.
-		 */
+		 * updated with their assigned MAC. */
 		rlb_req_update_subnet_clients(bond, arp->ip_src);
-		pr_debug("Server sent ARP Request packet\n");
 	}
 
 	return tx_slave;
-- 
1.6.0.2


^ permalink raw reply related

* [PATCH 1/3] bonding: allow previous slave to be used when re-balancing traffic on tlb/alb interfaces
From: Jay Vosburgh @ 2009-09-30  0:15 UTC (permalink / raw)
  To: netdev; +Cc: David Miller, Andy Gospodarek
In-Reply-To: <1254269731-7341-1-git-send-email-fubar@us.ibm.com>

From: Andy Gospodarek <andy@greyhouse.net>

When using tlb (mode 5) or alb (mode 6) bonding, a task runs every 10s
and re-balances the output devices based on load.  I was trying to
diagnose some connectivity issues and realized that a high-traffic host
would often switch output interfaces every 10s.  I discovered this
happened because the 'least loaded interface' was chosen as the next
output interface for any given stream and quite often some lower load
traffic would slip in an take the interface previously used by our
stream.  This meant the 'least loaded interface' was no longer the one
we used during the last interval.

The switching of streams to another interface was not extremely helpful
as it would force the destination host or router to update its ARP
tables and produce some additional ARP traffic as the destination host
verified that is was using the MAC address it expected.  Having the
destination MAC for a given IP change every 10s seems undesirable.

The decision was made to use the same slave during this interval if the
current load on that interface was < 10.  A load of < 10 indicates that
during the last 10s sample, roughly 100bytes were sent by all streams
currently assigned to that interface.  This essentially means the
interface is unloaded, but allows for a few frames that will probably
have minimal impact to slip into the same interface we were using in the
past.

Signed-off-by: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Jay Vosburgh <fubar@us.ibm.com>

---
 drivers/net/bonding/bond_alb.c |   21 ++++++++++++++++++++-
 drivers/net/bonding/bond_alb.h |    4 ++++
 2 files changed, 24 insertions(+), 1 deletions(-)

diff --git a/drivers/net/bonding/bond_alb.c b/drivers/net/bonding/bond_alb.c
index 9b5936f..cf2842e 100644
--- a/drivers/net/bonding/bond_alb.c
+++ b/drivers/net/bonding/bond_alb.c
@@ -150,6 +150,7 @@ static inline void tlb_init_table_entry(struct tlb_client_info *entry, int save_
 		entry->load_history = 1 + entry->tx_bytes /
 				      BOND_TLB_REBALANCE_INTERVAL;
 		entry->tx_bytes = 0;
+		entry->last_slave = entry->tx_slave;
 	}

 	entry->tx_slave = NULL;
@@ -270,6 +271,24 @@ static struct slave *tlb_get_least_loaded_slave(struct bonding *bond)
 	return least_loaded;
 }

+/* Caller must hold bond lock for read and hashtbl lock */
+static struct slave *tlb_get_best_slave(struct bonding *bond, u32 hash_index)
+{
+	struct alb_bond_info *bond_info = &(BOND_ALB_INFO(bond));
+	struct tlb_client_info *tx_hash_table = bond_info->tx_hashtbl;
+	struct slave *last_slave = tx_hash_table[hash_index].last_slave;
+	struct slave *next_slave = NULL;
+
+	if (last_slave && SLAVE_IS_OK(last_slave)) {
+		/* Use the last slave listed in the tx hashtbl if:
+		   the last slave currently is essentially unloaded. */
+		if (SLAVE_TLB_INFO(last_slave).load < 10)
+			next_slave = last_slave;
+	}
+
+	return next_slave ? next_slave : tlb_get_least_loaded_slave(bond);
+}
+
 /* Caller must hold bond lock for read */
 static struct slave *tlb_choose_channel(struct bonding *bond, u32 hash_index, u32 skb_len)
 {
@@ -282,7 +301,7 @@ static struct slave *tlb_choose_channel(struct bonding *bond, u32 hash_index, u3
 	hash_table = bond_info->tx_hashtbl;
 	assigned_slave = hash_table[hash_index].tx_slave;
 	if (!assigned_slave) {
-		assigned_slave = tlb_get_least_loaded_slave(bond);
+		assigned_slave = tlb_get_best_slave(bond, hash_index);

 		if (assigned_slave) {
 			struct tlb_slave_info *slave_info =
diff --git a/drivers/net/bonding/bond_alb.h b/drivers/net/bonding/bond_alb.h
index 50968f8..b65fd29 100644
--- a/drivers/net/bonding/bond_alb.h
+++ b/drivers/net/bonding/bond_alb.h
@@ -36,6 +36,10 @@ struct tlb_client_info {
 				 * packets to a Client that the Hash function
 				 * gave this entry index.
 				 */
+	struct slave *last_slave; /* Pointer to last slave used for transmiting
+				 * packets to a Client that the Hash function
+				 * gave this entry index.
+				 */
 	u32 tx_bytes;		/* Each Client acumulates the BytesTx that
 				 * were tranmitted to it, and after each
 				 * CallBack the LoadHistory is devided
-- 
1.6.0.2

^ permalink raw reply related

* [PATCH net-next-2.6 0/3] bonding: TLB / ALB changes
From: Jay Vosburgh @ 2009-09-30  0:15 UTC (permalink / raw)
  To: netdev; +Cc: David Miller

	Note that this patchset is against net-next-2.6 with these
patches from Jiri Pirko already applied:

http://patchwork.ozlabs.org/patch/32684/
http://patchwork.ozlabs.org/patch/34272/

	Three patches for bonding:

	Patch 1 changes the tlb/alb algorithm to try and keep hosts
assigned to the same slave during a rebalance.

	Patch 2 synchronizes the rx and tx hash tables, and unifies
their locking.  Periodic rlb rebalance no longer occurs, as both hash
tables are controlled by the transmit side.

	Patch 3 arranges for ARP packets to be sent on a slave that is
appropriate for the destination, instead of always using the same slave
for all ARP traffic.

	A fourth patch has been discussed on the list (to dump the alb
and tlb hash tables); that patch is not being submitted at this time,
and may return in a different form at some point in the future.

	Please apply for net-next-2.6.

	-J

---
	-Jay Vosburgh, IBM Linux Technology Center, fubar@us.ibm.com

^ permalink raw reply

* Re: [PATCH v2 0/4] IPVS full NAT support + netfilter 'ipvs' match support
From: Simon Horman @ 2009-09-29 23:18 UTC (permalink / raw)
  To: Hannes Eder
  Cc: lvs-devel, Wensong Zhang, Julius Volz, lvs-users, Laurent Grawet,
	Jean-Luc Fortemaison, linux-kernel, Jan Engelhardt,
	Julian Anastasov, netfilter-devel, netdev, Fabien Duchêne,
	Joseph Mack NA3T, Patrick McHardy
In-Reply-To: <b5ddba180909290807j4d1d1d2dl48af9453542612ff@mail.gmail.com>

On Tue, Sep 29, 2009 at 05:07:24PM +0200, Hannes Eder wrote:
> On Tue, Sep 29, 2009 at 16:51, Simon Horman <horms@verge.net.au> wrote:
> > On Tue, Sep 29, 2009 at 02:35:15PM +0200, Hannes Eder wrote:
> >> The following series implements full NAT support for IPVS.  The
> >> approach is via a minimal change to IPVS (make friends with
> >> nf_conntrack) and adding a netfilter matcher, kernel- and user-space
> >> part, i.e. xt_ipvs and libxt_ipvs.
> >
> > Its a bit late in the day for me to review the code, but I have a few
> > quick comments.
> >
> >>
> >> Example usage:
> >>
> >> % ipvsadm -A -t 192.168.100.30:80 -s rr
> >> % ipvsadm -a -t 192.168.100.30:80 -r 192.168.10.20:80 -m
> >> # ...
> >>
> >> # Source NAT for VIP 192.168.100.30:80
> >> % iptables -t nat -A POSTROUTING -m ipvs --vaddr 192.168.100.30/32 \
> >> > --vport 80 -j SNAT --to-source 192.168.10.10
> >>
> >> or SNAT-ing only a specific real server:
> >>
> >> % iptables -t nat -A POSTROUTING --dst 192.168.11.20 \
> >> > -m ipvs --vaddr 192.168.100.30/32 -j SNAT --to-source 192.168.10.10
> >
> > If the iptables rule is not in place does LVS just use
> > its old NAT behaviour?
> 
> Yes, without iptables rules LVS NAT does DNAT.

Great.

> >> First of all, thanks for all the feedback.  This is the changelog for v2:
> >>
> >> - Make ip_vs_ftp work again.  Setup nf_conntrack expectations for
> >>   related data connections (based on Julian's patch see
> >>   http://www.ssi.bg/~ja/nfct/) and let nf_conntrack/nf_nat do the
> >>   packet mangling and the TCP sequence adjusting.
> >>
> >>   This change rises the question how to deal with ip_vs_sync?  Does it
> >>   work together with conntrackd?  Wild idea: what about getting rid of
> >>   ip_vs_sync and piggy packing all on nf_conntrack and use conntrackd?
> >>
> >>   Any comments on this?
> >
> >    That sounds like a reasonable suggestion.
> >
> >    I think that ip_vs_sync came along before conntrackd
> >    and no one has given much thought to merging the functionality.
> 
> Okay, I'll dig further in this direction.

Assuming the technical side is clean, I suspect the major problem will be
how to migrate users away from ip_vs_sync.
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [Pv-drivers] [PATCH 2.6.31-rc9] net: VMware virtual Ethernet NIC driver: vmxnet3
From: Chris Wright @ 2009-09-29 21:54 UTC (permalink / raw)
  To: Bhavesh Davda
  Cc: Chris Wright, Shreyas Bhatewara, pv-drivers@vmware.com,
	netdev@vger.kernel.org, Stephen Hemminger,
	linux-kernel@vger.kernel.org, virtualization, Anthony Liguori,
	Greg Kroah-Hartman, Andrew Morton, Jeff Garzik, David S. Miller
In-Reply-To: <8B1F619C9F5F454E81D90D3C161698D7017DB765E8@EXCH-MBX-3.vmware.com>

* Bhavesh Davda (bhavesh@vmware.com) wrote:
> > > Thanks a bunch for your really thorough review! I'll answer some of
> > your questions here. Shreyas can respond to your comments about some of
> > the coding style/comments/etc. in a separate mail.
> > 
> > The style is less important at this stage, but certainly eases review
> > to make it more consistent w/ Linux code.  The StudlyCaps, extra macros
> > (screaming caps) and inconistent space/tabs are visual distractions,
> > that's all.
> 
> Agreed, but we'll definitely address all the style issues in our subsequent patch posts. Actually Shreyas showed me his raw patch and it had tabs and not spaces, so we're trying to figure out if either Outlook (corporate blessed) or our Exchange server is converting those tabs to spaces or something.

Ah, that's always fun.  You can check by mailing to yourself and looking
at the outcome.

> > > We do have an internal prototype of a Linux vmxnet3 driver with 4 Tx
> > queues and 4 Rx queues, using 9 MSI-X vectors, but it needs some work
> > before calling it production ready.
> > 
> > I'd expect once you switch to alloc_etherdev_mq(), make napi work per
> > rx queue, and fix MSI-X allocation (all needed for 4/4), you should
> > have enough to support the max of 16/8 (IOW, 4/4 still sounds like an
> > aritificial limitation).
> 
> Absolutely: 4/4 was simply a prototype to see if it helps with performance any with certain benchmarks. So far it looks like there's a small performance gain with microbenchmarks like netperf, but we're hoping having multiple queues with multiple vectors might have some promise with macro benchmarks like SPECjbb. If it pans out, we'll most likely make it a module_param with some reasonable defaults, possibly just 1/1 by default.

Most physical devices that do MSI-X will do queue per cpu (or some
grouping if large number of cpus compared to queues).  Probably
reasonable default here too.

> > > > How about GRO conversion?
> > >
> > > Looks attractive, and we'll work on that in a subsequent patch.
> > Again, when we first wrote the driver, the NETIF_F_GRO stuff didn't
> > exist in Linux.
> > 
> > OK, shouldn't be too much work.
> > 
> > Another thing I forgot to mention is that net_device now has
> > net_device_stats in it.  So you shouldn't need net_device_stats in
> > vmxnet3_adapter.
> 
> Cool. Will do.
> 
> > > > > +#define UPT1_MAX_TX_QUEUES  64
> > > > > +#define UPT1_MAX_RX_QUEUES  64
> > > >
> > > > This is different than the 16/8 described above (and seemingly all
> > moot
> > > > since it becomes a single queue device).
> > >
> > > Nice catch! Those are not even used and are from the earliest days of
> > our driver development. We'll nuke those.
> > 
> > Could you describe the UPT layer a bit?  There were a number of
> > constants that didn't appear to be used.
> 
> UPT stands for Uniform Pass Thru, a spec/framework VMware developed with its IHV partners to implement the fast path (Tx/Rx) features of vmxnet3 in silicon. Some of these #defines that appear not to be used are based on this initial spec that VMware shared with its IHV partners.
> 
> We divided the emulated vmxnet3 PCIe device's registers into two sets on two separate BARs: BAR 0 for the UPT registers we asked IHV partners to implement that we emulate in our hypervisor if no physical device compliant with the UPT spec is available to pass thru from a virtual machine, and BAR 1 for registers we always emulate for slow path/control operations like setting the MAC address, or activating/quiescing/resetting the device, etc.

Interesting.  Sounds like part of NetQueue and also something that virtio
has looked into to support, e.g. VMDq.  Do you have a more complete
spec?

thanks,
-chris

^ permalink raw reply

* [PATCH] sit: fix off-by-one in ipip6_tunnel_get_prl
From: Sascha Hlusiak @ 2009-09-29 21:27 UTC (permalink / raw)
  To: netdev; +Cc: fred.l.templin, Sascha Hlusiak

When requesting all prl entries (kprl.addr == INADDR_ANY) and there are
more prl entries than there is space passed from userspace, the existing
code would always copy cmax+1 entries, which is more than can be handled.

This patch makes the kernel copy only exactly cmax entries.

Signed-off-by: Sascha Hlusiak <contact@saschahlusiak.de>
Acked-By: Fred L. Templin <Fred.L.Templin@boeing.com>
---
 net/ipv6/sit.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/net/ipv6/sit.c b/net/ipv6/sit.c
index d65e0c4..dbd19a7 100644
--- a/net/ipv6/sit.c
+++ b/net/ipv6/sit.c
@@ -274,7 +274,7 @@ static int ipip6_tunnel_get_prl(struct ip_tunnel *t,
 
 	c = 0;
 	for (prl = t->prl; prl; prl = prl->next) {
-		if (c > cmax)
+		if (c >= cmax)
 			break;
 		if (kprl.addr != htonl(INADDR_ANY) && prl->addr != kprl.addr)
 			continue;
-- 
1.6.5.rc1


^ permalink raw reply related

* RE: [Pv-drivers] [PATCH 2.6.31-rc9] net: VMware virtual Ethernet NIC driver: vmxnet3
From: Bhavesh Davda @ 2009-09-29 21:00 UTC (permalink / raw)
  To: Chris Wright
  Cc: Shreyas Bhatewara, pv-drivers@vmware.com, netdev@vger.kernel.org,
	Stephen Hemminger, linux-kernel@vger.kernel.org, virtualization,
	Anthony Liguori, Greg Kroah-Hartman, Andrew Morton, Jeff Garzik,
	David S. Miller
In-Reply-To: <20090929203018.GG3958@sequoia.sous-sol.org>

> > Thanks a bunch for your really thorough review! I'll answer some of
> your questions here. Shreyas can respond to your comments about some of
> the coding style/comments/etc. in a separate mail.
> 
> The style is less important at this stage, but certainly eases review
> to make it more consistent w/ Linux code.  The StudlyCaps, extra macros
> (screaming caps) and inconistent space/tabs are visual distractions,
> that's all.

Agreed, but we'll definitely address all the style issues in our subsequent patch posts. Actually Shreyas showed me his raw patch and it had tabs and not spaces, so we're trying to figure out if either Outlook (corporate blessed) or our Exchange server is converting those tabs to spaces or something.

> > We do have an internal prototype of a Linux vmxnet3 driver with 4 Tx
> queues and 4 Rx queues, using 9 MSI-X vectors, but it needs some work
> before calling it production ready.
> 
> I'd expect once you switch to alloc_etherdev_mq(), make napi work per
> rx queue, and fix MSI-X allocation (all needed for 4/4), you should
> have enough to support the max of 16/8 (IOW, 4/4 still sounds like an
> aritificial limitation).

Absolutely: 4/4 was simply a prototype to see if it helps with performance any with certain benchmarks. So far it looks like there's a small performance gain with microbenchmarks like netperf, but we're hoping having multiple queues with multiple vectors might have some promise with macro benchmarks like SPECjbb. If it pans out, we'll most likely make it a module_param with some reasonable defaults, possibly just 1/1 by default.

> > > How about GRO conversion?
> >
> > Looks attractive, and we'll work on that in a subsequent patch.
> Again, when we first wrote the driver, the NETIF_F_GRO stuff didn't
> exist in Linux.
> 
> OK, shouldn't be too much work.
> 
> Another thing I forgot to mention is that net_device now has
> net_device_stats in it.  So you shouldn't need net_device_stats in
> vmxnet3_adapter.

Cool. Will do.

> > > > +#define UPT1_MAX_TX_QUEUES  64
> > > > +#define UPT1_MAX_RX_QUEUES  64
> > >
> > > This is different than the 16/8 described above (and seemingly all
> moot
> > > since it becomes a single queue device).
> >
> > Nice catch! Those are not even used and are from the earliest days of
> our driver development. We'll nuke those.
> 
> Could you describe the UPT layer a bit?  There were a number of
> constants that didn't appear to be used.

UPT stands for Uniform Pass Thru, a spec/framework VMware developed with its IHV partners to implement the fast path (Tx/Rx) features of vmxnet3 in silicon. Some of these #defines that appear not to be used are based on this initial spec that VMware shared with its IHV partners.

We divided the emulated vmxnet3 PCIe device's registers into two sets on two separate BARs: BAR 0 for the UPT registers we asked IHV partners to implement that we emulate in our hypervisor if no physical device compliant with the UPT spec is available to pass thru from a virtual machine, and BAR 1 for registers we always emulate for slow path/control operations like setting the MAC address, or activating/quiescing/resetting the device, etc.

> > > > +static void
> > > > +vmxnet3_process_events(struct vmxnet3_adapter *adapter)
> > >
> > > Should be trivial to break out to it's own MSI-X vector, basically
> set
> > > up to do that already.
> >
> > Yes, and the device is configurable to use any vector for any
> "events", but didn't see any compelling reason to do so. "ECR" events
> are extremely rare and we've got a shadow copy of the ECR register that
> avoids an expensive round trip to the device, stored in adapter-
> >shared->ecr. So we can cheaply handle events on the hot Tx/Rx path
> with minimal overhead. But if you really see a compelling reason to
> allocate a separate MSI-X vector for events, we can certainly do that.
> 
> Nah, just thinking outloud while trying to understand the driver.  I
> figured it'd be the + 1 vector (16 + 8 + 1).

Great. In that case we'll stay with not allocating a separate vector for events for now.

Thanks!

- Bhavesh

> 
> thanks,
> -chris

^ permalink raw reply

* RE: [PATCH 2.6.31-rc9] net: VMware virtual Ethernet NIC driver: vmxnet3
From: Shreyas Bhatewara @ 2009-09-29 20:56 UTC (permalink / raw)
  To: Chris Wright
  Cc: linux-kernel@vger.kernel.org, netdev@vger.kernel.org,
	Stephen Hemminger, David S. Miller, Jeff Garzik, Anthony Liguori,
	Greg Kroah-Hartman, Andrew Morton, virtualization,
	pv-drivers@vmware.com
In-Reply-To: <20090929085333.GC3958@sequoia.sous-sol.org>

Chris,

Thanks for the review.

Bhavesh responded to some queries. I will attempt the answer the rest.
I am working on rebasing the code to v2.6.32-rc1. Will send out a patch
with the changes you suggested after that.


->Shreyas

> -----Original Message-----
> From: Chris Wright [mailto:chrisw@sous-sol.org]
> Sent: Tuesday, September 29, 2009 1:54 AM
> To: Shreyas Bhatewara
> Cc: linux-kernel@vger.kernel.org; netdev@vger.kernel.org; Stephen
> Hemminger; David S. Miller; Jeff Garzik; Anthony Liguori; Chris Wright;
> Greg Kroah-Hartman; Andrew Morton; virtualization; pv-
> drivers@vmware.com
> Subject: Re: [PATCH 2.6.31-rc9] net: VMware virtual Ethernet NIC
> driver: vmxnet3
>
>
> >         Wake-on-LAN, PCI Power Management D0-D3 states
> >         PXE-ROM for boot support
> >
>
> Whole thing appears to be space indented, and is fairly noisy w/
> printk.

The code written is tab indented. Is there a specific line / function
which you find space indented ? If not, may be my email client did not
preserve the tabs while sending. I will take care while posting next patch.

Some of the printks are debug only. Others have been marked as INFO or ERR
appropriately. I will remove a few.


> Also, heavy use of BUG_ON() (counted 51 of them), are you sure that
> none
> of them can be triggered by guest or remote (esp. the ones that happen
> in interrupt context)?  Some initial thoughts below.
>

Like you said below, I will remove the user input dependent ones.


> > + * the file called "COPYING".
> > + *
> > + * Maintained by: Shreyas Bhatewara <pv-drivers@vmware.com>
> > + *
> > + */
> > +
> > +/* upt1_defs.h
> > + *
> > + *      Definitions for Uniform Pass Through.
> > + */
>
> Most of the source files have this format (some include -- after file
> name).  Could just keep it all w/in the same comment block.  Since you
> went to the trouble of saying what the file does, something a tad more
> descriptive would be welcome.
>

Yes, I will merge the two blocks and elaborate on the description.

> > +
> > +#ifndef _UPT1_DEFS_H
> > +#define _UPT1_DEFS_H
> > +
> > +#define UPT1_MAX_TX_QUEUES  64
> > +#define UPT1_MAX_RX_QUEUES  64
>
> This is different than the 16/8 described above (and seemingly all moot
> since it becomes a single queue device).
>
> > +
> > +/* interrupt moderation level */
> > +#define UPT1_IML_NONE     0 /* no interrupt moderation */
> > +#define UPT1_IML_HIGHEST  7 /* least intr generated */
> > +#define UPT1_IML_ADAPTIVE 8 /* adpative intr moderation */
>
> enum?  also only appears to support adaptive mode?
> > +
> > +/*
> > + * QueueDescPA must be 128 bytes aligned. It points to an array of
> > + * Vmxnet3_TxQueueDesc followed by an array of Vmxnet3_RxQueueDesc.
> > + * The number of Vmxnet3_TxQueueDesc/Vmxnet3_RxQueueDesc are
> specified by
> > + * Vmxnet3_MiscConf.numTxQueues/numRxQueues, respectively.
> > + */
> > +#define VMXNET3_QUEUE_DESC_ALIGN  128
>
> Lot of inconsistent spacing between types and names in the structure
> def'ns

Okay, I will try to make it uniform.

>
> > +struct Vmxnet3_MiscConf {
> > +       struct Vmxnet3_DriverInfo driverInfo;
> > +       uint64_t             uptFeatures;
> > +       uint64_t             ddPA;         /* driver data PA */
> > +       uint64_t             queueDescPA;  /* queue descriptor table
> PA */
> > +       uint32_t             ddLen;        /* driver data len */
> > +       uint32_t             queueDescLen; /* queue desc. table len
> in bytes */
> > +       uint32_t             mtu;
> > +       uint16_t             maxNumRxSG;
> > +       uint8_t              numTxQueues;
> > +       uint8_t              numRxQueues;
> > +       uint32_t             reserved[4];
> > +};
>
> should this be packed (or others that are shared w/ device)?  i assume
> you've already done 32 vs 64 here
>
> > +struct Vmxnet3_TxQueueConf {
> > +       uint64_t    txRingBasePA;
> > +       uint64_t    dataRingBasePA;
> > +       uint64_t    compRingBasePA;
> > +       uint64_t    ddPA;         /* driver data */
> > +       uint64_t    reserved;
> > +       uint32_t    txRingSize;   /* # of tx desc */
> > +       uint32_t    dataRingSize; /* # of data desc */
> > +       uint32_t    compRingSize; /* # of comp desc */
> > +       uint32_t    ddLen;        /* size of driver data */
> > +       uint8_t     intrIdx;
> > +       uint8_t     _pad[7];
> > +};
> > +
> > +
> > +struct Vmxnet3_RxQueueConf {
> > +       uint64_t    rxRingBasePA[2];
> > +       uint64_t    compRingBasePA;
> > +       uint64_t    ddPA;            /* driver data */
> > +       uint64_t    reserved;
> > +       uint32_t    rxRingSize[2];   /* # of rx desc */
> > +       uint32_t    compRingSize;    /* # of rx comp desc */
> > +       uint32_t    ddLen;           /* size of driver data */
> > +       uint8_t     intrIdx;
> > +       uint8_t     _pad[7];
> > +};
> > +
> > +enum vmxnet3_intr_mask_mode {
> > +       VMXNET3_IMM_AUTO   = 0,
> > +       VMXNET3_IMM_ACTIVE = 1,
> > +       VMXNET3_IMM_LAZY   = 2
> > +};
> > +
> > +enum vmxnet3_intr_type {
> > +       VMXNET3_IT_AUTO = 0,
> > +       VMXNET3_IT_INTX = 1,
> > +       VMXNET3_IT_MSI  = 2,
> > +       VMXNET3_IT_MSIX = 3
> > +};
> > +
> > +#define VMXNET3_MAX_TX_QUEUES  8
> > +#define VMXNET3_MAX_RX_QUEUES  16
>
> different to UPT, I must've missed some layering here

There are the right ones, I will remove the other definitions.

>
> > +/* addition 1 for events */
> > +#define VMXNET3_MAX_INTRS      25
> > +
> > +
> <snip>
>
> > --- /dev/null
> > +++ b/drivers/net/vmxnet3/vmxnet3_drv.c
> > @@ -0,0 +1,2608 @@
> > +/*
> > + * Linux driver for VMware's vmxnet3 ethernet NIC.
> <snip>
> > +/*
> > + * vmxnet3_drv.c --
> > + *
> > + *      Linux driver for VMware's vmxnet3 NIC
> > + */
>
> Not useful
>
> > +static void
> > +vmxnet3_enable_intr(struct vmxnet3_adapter *adapter, unsigned
> intr_idx)
> > +{
> > +       VMXNET3_WRITE_BAR0_REG(adapter, VMXNET3_REG_IMR + intr_idx *
> 8, 0);
>
>       writel(0, adapter->hw_addr0 + VMXNET3_REG_IMR + intr_idx * 8)
>
> seems just as clear to me.

The intention is to differentiate bar0 and bar1 writes. hw_addr0/1
doesn't seem to convey that instantly.

>
> > +vmxnet3_enable_all_intrs(struct vmxnet3_adapter *adapter)
> > +{
> > +       int i;
> > +
> > +       for (i = 0; i < adapter->intr.num_intrs; i++)
> > +               vmxnet3_enable_intr(adapter, i);
> > +}
> > +
> > +static void
> > +vmxnet3_disable_all_intrs(struct vmxnet3_adapter *adapter)
> > +{
> > +       int i;
> > +
> > +       for (i = 0; i < adapter->intr.num_intrs; i++)
> > +               vmxnet3_disable_intr(adapter, i);
> > +}
>
> only ever num_intrs=1, so there's some plan to bump this up and make
> these wrappers useful?
>
> > +static void
> > +vmxnet3_ack_events(struct vmxnet3_adapter *adapter, u32 events)
> > +{
> > +       VMXNET3_WRITE_BAR1_REG(adapter, VMXNET3_REG_ECR, events);
> > +}
> > +
> > +
> > +static bool
> > +vmxnet3_tq_stopped(struct vmxnet3_tx_queue *tq, struct
> vmxnet3_adapter *adapter)
> > +{
> > +       return netif_queue_stopped(adapter->netdev);
> > +}
> > +
> > +
> > +static void
> > +vmxnet3_tq_start(struct vmxnet3_tx_queue *tq, struct vmxnet3_adapter
> *adapter)
> > +{
> > +       tq->stopped = false;
>
> is tq->stopped used besides just toggling back and forth?

It is used in ethtool ops.

>
> > +       netif_start_queue(adapter->netdev);
> > +}
>
> > +static void
> > +vmxnet3_process_events(struct vmxnet3_adapter *adapter)
>
> Should be trivial to break out to it's own MSI-X vector, basically set
> up to do that already.
>
> > +{
> > +       u32 events = adapter->shared->ecr;
> > +       if (!events)
> > +               return;
> > +
> > +       vmxnet3_ack_events(adapter, events);
> > +
> > +       /* Check if link state has changed */
> > +       if (events & VMXNET3_ECR_LINK)
> > +               vmxnet3_check_link(adapter);
> > +
> > +       /* Check if there is an error on xmit/recv queues */
> > +       if (events & (VMXNET3_ECR_TQERR | VMXNET3_ECR_RQERR)) {
> > +               VMXNET3_WRITE_BAR1_REG(adapter, VMXNET3_REG_CMD,
> > +                                      VMXNET3_CMD_GET_QUEUE_STATUS);
> > +
> > +               if (adapter->tqd_start->status.stopped) {
> > +                       printk(KERN_ERR "%s: tq error 0x%x\n",
> > +                              adapter->netdev->name,
> > +                              adapter->tqd_start->status.error);
> > +               }
> > +               if (adapter->rqd_start->status.stopped) {
> > +                       printk(KERN_ERR "%s: rq error 0x%x\n",
> > +                              adapter->netdev->name,
> > +                              adapter->rqd_start->status.error);
> > +               }
> > +
> > +               schedule_work(&adapter->work);
> > +       }
> > +}
> <snip>
>
> > +
> > +       tq->buf_info = kcalloc(sizeof(tq->buf_info[0]), tq-
> >tx_ring.size,
> > +                              GFP_KERNEL);
>
> kcalloc args look backwards
>
> <snip>
> > +static int
> > +vmxnet3_alloc_pci_resources(struct vmxnet3_adapter *adapter, bool
> *dma64)
> > +{
> > +       int err;
> > +       unsigned long mmio_start, mmio_len;
> > +       struct pci_dev *pdev = adapter->pdev;
> > +
> > +       err = pci_enable_device(pdev);
>
> looks ioport free, can be pci_enable_device_mem()...

Yes, will do that.

>
> > +       if (err) {
> > +               printk(KERN_ERR "Failed to enable adapter %s: error
> %d\n",
> > +                      pci_name(pdev), err);
> > +               return err;
> > +       }
> > +
> > +       if (pci_set_dma_mask(pdev, DMA_BIT_MASK(64)) == 0) {
> > +               if (pci_set_consistent_dma_mask(pdev,
> DMA_BIT_MASK(64)) != 0) {
> > +                       printk(KERN_ERR "pci_set_consistent_dma_mask
> failed "
> > +                              "for adapter %s\n", pci_name(pdev));
> > +                       err = -EIO;
> > +                       goto err_set_mask;
> > +               }
> > +               *dma64 = true;
> > +       } else {
> > +               if (pci_set_dma_mask(pdev, DMA_BIT_MASK(32)) != 0) {
> > +                       printk(KERN_ERR "pci_set_dma_mask failed for
> adapter "
> > +                              "%s\n",  pci_name(pdev));
> > +                       err = -EIO;
> > +                       goto err_set_mask;
> > +               }
> > +               *dma64 = false;
> > +       }
> > +
> > +       err = pci_request_regions(pdev, vmxnet3_driver_name);
>
> ...pci_request_selected_regions()

Okay.

>
> > +       if (err) {
> > +               printk(KERN_ERR "Failed to request region for adapter
> %s: "
> > +                      "error %d\n", pci_name(pdev), err);
> > +               goto err_set_mask;
> > +       }
> > +
> > +       pci_set_master(pdev);
> > +
> > +       mmio_start = pci_resource_start(pdev, 0);
> > +       mmio_len = pci_resource_len(pdev, 0);
> > +       adapter->hw_addr0 = ioremap(mmio_start, mmio_len);
> > +       if (!adapter->hw_addr0) {
> > +               printk(KERN_ERR "Failed to map bar0 for adapter
> %s\n",
> > +                      pci_name(pdev));
> > +               err = -EIO;
> > +               goto err_ioremap;
> > +       }
> > +
> > +       mmio_start = pci_resource_start(pdev, 1);
> > +       mmio_len = pci_resource_len(pdev, 1);
> > +       adapter->hw_addr1 = ioremap(mmio_start, mmio_len);
> > +       if (!adapter->hw_addr1) {
> > +               printk(KERN_ERR "Failed to map bar1 for adapter
> %s\n",
> > +                      pci_name(pdev));
> > +               err = -EIO;
> > +               goto err_bar1;
> > +       }
> > +       return 0;
> > +
> > +err_bar1:
> > +       iounmap(adapter->hw_addr0);
> > +err_ioremap:
> > +       pci_release_regions(pdev);
>
> ...and pci_release_selected_regions()
>
> > +err_set_mask:
> > +       pci_disable_device(pdev);
> > +       return err;
> > +}
> > +
>
> <snip>
> > +vmxnet3_declare_features(struct vmxnet3_adapter *adapter, bool
> dma64)
> > +{
> > +       struct net_device *netdev = adapter->netdev;
> > +
> > +       netdev->features = NETIF_F_SG |
> > +               NETIF_F_HW_CSUM |
> > +               NETIF_F_HW_VLAN_TX |
> > +               NETIF_F_HW_VLAN_RX |
> > +               NETIF_F_HW_VLAN_FILTER |
> > +               NETIF_F_TSO |
> > +               NETIF_F_TSO6;
> > +
> > +       printk(KERN_INFO "features: sg csum vlan jf tso tsoIPv6");
> > +
> > +       adapter->rxcsum = true;
> > +       adapter->jumbo_frame = true;
> > +
> > +       if (!disable_lro) {
> > +               adapter->lro = true;
> > +               printk(" lro");
> > +       }
>
> Plan to switch to GRO?
>
> > +       if (dma64) {
> > +               netdev->features |= NETIF_F_HIGHDMA;
> > +               printk(" highDMA");
> > +       }
> > +
> > +       netdev->vlan_features = netdev->features;
> > +       printk("\n");
> > +}
> > +
> > +static int __devinit
> > +vmxnet3_probe_device(struct pci_dev *pdev,
> > +                    const struct pci_device_id *id)
> > +{
> > +       static const struct net_device_ops vmxnet3_netdev_ops = {
> > +               .ndo_open  = vmxnet3_open,
> > +               .ndo_stop  = vmxnet3_close,
> > +               .ndo_start_xmit = vmxnet3_xmit_frame,
> > +               .ndo_set_mac_address = vmxnet3_set_mac_addr,
> > +               .ndo_change_mtu = vmxnet3_change_mtu,
> > +               .ndo_get_stats = vmxnet3_get_stats,
> > +               .ndo_tx_timeout = vmxnet3_tx_timeout,
> > +               .ndo_set_multicast_list = vmxnet3_set_mc,
> > +               .ndo_vlan_rx_register = vmxnet3_vlan_rx_register,
> > +               .ndo_vlan_rx_add_vid = vmxnet3_vlan_rx_add_vid,
> > +               .ndo_vlan_rx_kill_vid = vmxnet3_vlan_rx_kill_vid,
> > +#   ifdef CONFIG_NET_POLL_CONTROLLER
> > +               .ndo_poll_controller = vmxnet3_netpoll,
> > +#   endif
>
> #ifdef
> #endif
>
> is more typical style here
>
> > +       };
> > +       int err;
> > +       bool dma64 = false; /* stupid gcc */
> > +       u32 ver;
> > +       struct net_device *netdev;
> > +       struct vmxnet3_adapter *adapter;
> > +       u8  mac[ETH_ALEN];
>
> extra space between type and name
>
> > +
> > +       netdev = alloc_etherdev(sizeof(struct vmxnet3_adapter));
> > +       if (!netdev) {
> > +               printk(KERN_ERR "Failed to alloc ethernet device for
> adapter "
> > +                       "%s\n", pci_name(pdev));
> > +               return -ENOMEM;
> > +       }
> > +
> > +       pci_set_drvdata(pdev, netdev);
> > +       adapter = netdev_priv(netdev);
> > +       adapter->netdev = netdev;
> > +       adapter->pdev = pdev;
> > +
> > +       adapter->shared = pci_alloc_consistent(adapter->pdev,
> > +                         sizeof(struct Vmxnet3_DriverShared),
> > +                         &adapter->shared_pa);
> > +       if (!adapter->shared) {
> > +               printk(KERN_ERR "Failed to allocate memory for %s\n",
> > +                       pci_name(pdev));
> > +               err = -ENOMEM;
> > +               goto err_alloc_shared;
> > +       }
> > +
> > +       adapter->tqd_start  = pci_alloc_consistent(adapter->pdev,
>
> extra space before =
>
> > diff --git a/drivers/net/vmxnet3/vmxnet3_ethtool.c
> b/drivers/net/vmxnet3/vmxnet3_ethtool.c
> > new file mode 100644
> > index 0000000..490577f
> > --- /dev/null
> > +++ b/drivers/net/vmxnet3/vmxnet3_ethtool.c
> > +#include "vmxnet3_int.h"
> > +
> > +struct vmxnet3_stat_desc {
> > +       char desc[ETH_GSTRING_LEN];
> > +       int  offset;
> > +};
> > +
> > +
> > +static u32
> > +vmxnet3_get_rx_csum(struct net_device *netdev)
> > +{
> > +       struct vmxnet3_adapter *adapter = netdev_priv(netdev);
> > +       return adapter->rxcsum;
> > +}
> > +
> > +
> > +static int
> > +vmxnet3_set_rx_csum(struct net_device *netdev, u32 val)
> > +{
> > +       struct vmxnet3_adapter *adapter = netdev_priv(netdev);
> > +
> > +       if (adapter->rxcsum != val) {
> > +               adapter->rxcsum = val;
> > +               if (netif_running(netdev)) {
> > +                       if (val)
> > +                               adapter->shared-
> >devRead.misc.uptFeatures |=
> > +
> UPT1_F_RXCSUM;
> > +                       else
> > +                               adapter->shared-
> >devRead.misc.uptFeatures &=
> > +
> ~UPT1_F_RXCSUM;
> > +
> > +                       VMXNET3_WRITE_BAR1_REG(adapter,
> VMXNET3_REG_CMD,
> > +
> VMXNET3_CMD_UPDATE_FEATURE);
> > +               }
> > +       }
> > +       return 0;
> > +}
> > +
> > +
> > +static u32
> > +vmxnet3_get_tx_csum(struct net_device *netdev)
> > +{
> > +       return (netdev->features & NETIF_F_HW_CSUM) != 0;
> > +}
>
> Not needed
>
> > +static int
> > +vmxnet3_set_tx_csum(struct net_device *netdev, u32 val)
> > +{
> > +       if (val)
> > +               netdev->features |= NETIF_F_HW_CSUM;
> > +       else
> > +               netdev->features &= ~NETIF_F_HW_CSUM;
> > +
> > +       return 0;
> > +}
>
> This is just ethtool_op_set_tx_hw_csum()
>
> > +static int
> > +vmxnet3_set_sg(struct net_device *netdev, u32 val)
> > +{
> > +       ethtool_op_set_sg(netdev, val);
> > +       return 0;
> > +}
>
> Useless wrapper
>
> > +static int
> > +vmxnet3_set_tso(struct net_device *netdev, u32 val)
> > +{
> > +       ethtool_op_set_tso(netdev, val);
> > +       return 0;
> > +}
>
> Useless wrapper
>

I will remove the unwanted wrappers functions, spaces and get back with a patch.


^ permalink raw reply

* Re: [Pv-drivers] [PATCH 2.6.31-rc9] net: VMware virtual Ethernet NIC driver: vmxnet3
From: Chris Wright @ 2009-09-29 20:30 UTC (permalink / raw)
  To: Bhavesh Davda
  Cc: Chris Wright, Shreyas Bhatewara, pv-drivers@vmware.com,
	netdev@vger.kernel.org, Stephen Hemminger,
	linux-kernel@vger.kernel.org, virtualization, Anthony Liguori,
	Greg Kroah-Hartman, Andrew Morton, Jeff Garzik, David S. Miller
In-Reply-To: <8B1F619C9F5F454E81D90D3C161698D7017DB76576@EXCH-MBX-3.vmware.com>

* Bhavesh Davda (bhavesh@vmware.com) wrote:
> Hi Chris,
> 
> Thanks a bunch for your really thorough review! I'll answer some of your questions here. Shreyas can respond to your comments about some of the coding style/comments/etc. in a separate mail.

The style is less important at this stage, but certainly eases review
to make it more consistent w/ Linux code.  The StudlyCaps, extra macros
(screaming caps) and inconistent space/tabs are visual distractions,
that's all.

> > >         INTx, MSI, MSI-X (25 vectors) interrupts
> > >         16 Rx queues, 8 Tx queues
> > 
> > Driver doesn't appear to actually support more than a single MSI-X
> > interrupt.
> > What is your plan for doing real multiqueue?
> 
> When we first wrote the driver a couple of years ago, Linux lacked proper multiqueue support, hence we chose to use only a single queue though the emulated device does support 16 Rx and 8 Tx queues, and 25 MSI-X vectors: 16 for Rx, 8 for Tx and 1 for other asynchronous event notifications, by design. Actually a driver can repurpose any of the 25 vectors for any notifications; just explaining the rationale for desiging the device with 25 MSI-X vectors.

I see, thanks.

> We do have an internal prototype of a Linux vmxnet3 driver with 4 Tx queues and 4 Rx queues, using 9 MSI-X vectors, but it needs some work before calling it production ready.

I'd expect once you switch to alloc_etherdev_mq(), make napi work per
rx queue, and fix MSI-X allocation (all needed for 4/4), you should
have enough to support the max of 16/8 (IOW, 4/4 still sounds like an
aritificial limitation).

> > How about GRO conversion?
> 
> Looks attractive, and we'll work on that in a subsequent patch. Again, when we first wrote the driver, the NETIF_F_GRO stuff didn't exist in Linux.

OK, shouldn't be too much work.

Another thing I forgot to mention is that net_device now has
net_device_stats in it.  So you shouldn't need net_device_stats in
vmxnet3_adapter.

> > Also, heavy use of BUG_ON() (counted 51 of them), are you sure that
> > none
> > of them can be triggered by guest or remote (esp. the ones that happen
> > in interrupt context)?  Some initial thoughts below.
> 
> We'll definitely audit all the BUG_ONs again to make sure they can't be exploited.
> 
> > > --- /dev/null
> > > +++ b/drivers/net/vmxnet3/upt1_defs.h
> > > +#define UPT1_MAX_TX_QUEUES  64
> > > +#define UPT1_MAX_RX_QUEUES  64
> > 
> > This is different than the 16/8 described above (and seemingly all moot
> > since it becomes a single queue device).
> 
> Nice catch! Those are not even used and are from the earliest days of our driver development. We'll nuke those.

Could you describe the UPT layer a bit?  There were a number of
constants that didn't appear to be used.

> > > +/* interrupt moderation level */
> > > +#define UPT1_IML_NONE     0 /* no interrupt moderation */
> > > +#define UPT1_IML_HIGHEST  7 /* least intr generated */
> > > +#define UPT1_IML_ADAPTIVE 8 /* adpative intr moderation */
> > 
> > enum?  also only appears to support adaptive mode?
> 
> Yes, the Linux driver currently only asks for adaptive mode, but the device supports 8 interrupt moderation levels.
> 
> > > --- /dev/null
> > > +++ b/drivers/net/vmxnet3/vmxnet3_defs.h
> > > +struct Vmxnet3_MiscConf {
> > > +       struct Vmxnet3_DriverInfo driverInfo;
> > > +       uint64_t             uptFeatures;
> > > +       uint64_t             ddPA;         /* driver data PA */
> > > +       uint64_t             queueDescPA;  /* queue descriptor table
> > PA */
> > > +       uint32_t             ddLen;        /* driver data len */
> > > +       uint32_t             queueDescLen; /* queue desc. table len
> > in bytes */
> > > +       uint32_t             mtu;
> > > +       uint16_t             maxNumRxSG;
> > > +       uint8_t              numTxQueues;
> > > +       uint8_t              numRxQueues;
> > > +       uint32_t             reserved[4];
> > > +};
> > 
> > should this be packed (or others that are shared w/ device)?  i assume
> > you've already done 32 vs 64 here
> > 
> 
> No need for packing since the fields are naturally 64-bit aligned. True for all structures shared between the driver and device.

I had quickly looked and thought I saw a hole that would lead to
inconsistent layout for 32-bit vs 64-bit.  I figured I'd be wrong
there ;-)

> > > +#define VMXNET3_MAX_TX_QUEUES  8
> > > +#define VMXNET3_MAX_RX_QUEUES  16
> > 
> > different to UPT, I must've missed some layering here
> 
> These are the authoritiative #defines. Ignore the UPT ones.
> 
> > > --- /dev/null
> > > +++ b/drivers/net/vmxnet3/vmxnet3_drv.c
> > > +       VMXNET3_WRITE_BAR0_REG(adapter, VMXNET3_REG_IMR + intr_idx *
> > 8, 0);
> > 
> > 	writel(0, adapter->hw_addr0 + VMXNET3_REG_IMR + intr_idx * 8)
> > seems just as clear to me.
> 
> Fair enough. We were just trying to clearly show which register accesses go to BAR 0 versus BAR 1.
> 
> > only ever num_intrs=1, so there's some plan to bump this up and make
> > these wrappers useful?
> 
> Yes.
> 
> > > +static void
> > > +vmxnet3_process_events(struct vmxnet3_adapter *adapter)
> > 
> > Should be trivial to break out to it's own MSI-X vector, basically set
> > up to do that already.
> 
> Yes, and the device is configurable to use any vector for any "events", but didn't see any compelling reason to do so. "ECR" events are extremely rare and we've got a shadow copy of the ECR register that avoids an expensive round trip to the device, stored in adapter->shared->ecr. So we can cheaply handle events on the hot Tx/Rx path with minimal overhead. But if you really see a compelling reason to allocate a separate MSI-X vector for events, we can certainly do that.

Nah, just thinking outloud while trying to understand the driver.  I
figured it'd be the + 1 vector (16 + 8 + 1).

thanks,
-chris

^ permalink raw reply

* Re: Very strange issues with ethernet wake on lan
From: Rafael J. Wysocki @ 2009-09-29 20:28 UTC (permalink / raw)
  To: Maxim Levitsky
  Cc: netdev@vger.kernel.org, linux-pm@lists.linux-foundation.org
In-Reply-To: <1250394174.32268.13.camel@localhost.localdomain>

On Sunday 16 August 2009, Maxim Levitsky wrote:
> Hi,
> 
> I have recently put back the davicom dm9009 ethernet card into my
> computer.
> 
> Some long time ago, I have written its suspend/resume routines.
> Now I see that few things have changed, like I need to enable wake in
> sysfs or better patch the code to do so, some nice helpers like
> pci_prepare_to_sleep have arrived, etc.
> 
> 
> I narrowed the strange issue down to following situation:
> 
> I reload dmfe.ko (and networkmanager is disabled)
> I don't ifup the device, thus pretty much no hardware initialization
> takes place (but this appears not to matter anyway)
> 
> I then suspend the system, and WOL doesn't work (I have patched the
> driver to enable WOL automaticly)
> 
> I then, suspend again. WOL works, and continues to work as long as I
> don't reload the driver. If I do, same situation repeats.
> 
> Also, after a boot, WOL works, so a reload cycle triggers that issue.
> 
> And most importantly, if I don't do a
> 
> pci_set_power_state(pci_dev, pci_choose_state (pci_dev, state));
> 
> in .suspend, then WOL always works.
> 
> and I have even tried to set state manually to PCI_D3hot or PCI_D3cold, 
> 
> I also tried to use pci_save_state
> 
> 
> I also have 2 copies of this card, and both have this issue.
> I also tried 2 pci slots.
> 
> Kernel is vanilla 2.6.31-rc5

Please check if this still happens with 2.6.32-rc1.

Best,
Rafael

^ permalink raw reply

* Re: [Pv-drivers] [PATCH 2.6.31-rc9] net: VMware virtual Ethernet NIC driver: vmxnet3
From: David Miller @ 2009-09-29 19:55 UTC (permalink / raw)
  To: bhavesh
  Cc: arnd, chrisw, pv-drivers, netdev, linux-kernel, virtualization,
	greg, anthony, jgarzik, akpm, shemminger
In-Reply-To: <8B1F619C9F5F454E81D90D3C161698D7017DB76582@EXCH-MBX-3.vmware.com>

From: Bhavesh Davda <bhavesh@vmware.com>
Date: Tue, 29 Sep 2009 12:52:05 -0700

>> One thing that should possibly be fixed is the naming of identifiers,
>> e.g.
>> 's/Vmxnet3_MiscConf/vmxnet3_misc_conf/g', unless these header files are
>> shared with the host implementation.
> 
> These header files are indeed shared with the host implementation, as you've guessed. If it's not a big deal, we would like to keep the names the same, just for our own sanity's sake?

No.  This isn't your source tree, it's everyone's.  So you should
adhere to basic naming conventions and coding standards of the
tree regardless of what you happen to use or need to use internally.

^ permalink raw reply

* RE: [Pv-drivers] [PATCH 2.6.31-rc9] net: VMware virtual Ethernet NIC driver: vmxnet3
From: Bhavesh Davda @ 2009-09-29 19:52 UTC (permalink / raw)
  To: Arnd Bergmann, Chris Wright
  Cc: pv-drivers@vmware.com, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org, virtualization, Greg Kroah-Hartman,
	Anthony Liguori, Jeff Garzik, Andrew Morton, Stephen Hemminger,
	David S. Miller
In-Reply-To: <200909291505.50961.arnd@arndb.de>

Hi Arnd,

> On Tuesday 29 September 2009, Chris Wright wrote:
> > > +struct Vmxnet3_MiscConf {
> > > +       struct Vmxnet3_DriverInfo driverInfo;
> > > +       uint64_t             uptFeatures;
> > > +       uint64_t             ddPA;         /* driver data PA */
> > > +       uint64_t             queueDescPA;  /* queue descriptor
> table PA */
> > > +       uint32_t             ddLen;        /* driver data len */
> > > +       uint32_t             queueDescLen; /* queue desc. table len
> in bytes */
> > > +       uint32_t             mtu;
> > > +       uint16_t             maxNumRxSG;
> > > +       uint8_t              numTxQueues;
> > > +       uint8_t              numRxQueues;
> > > +       uint32_t             reserved[4];
> > > +};
> >
> > should this be packed (or others that are shared w/ device)?  i
> assume
> > you've already done 32 vs 64 here
> 
> I would not mark it packed, because it already is well-defined on all
> systems. You should add __packed only to the fields where you screwed
> up, but not to structures that already work fine.

You're exactly right; I reiterated as much in my response to Chris.

> One thing that should possibly be fixed is the naming of identifiers,
> e.g.
> 's/Vmxnet3_MiscConf/vmxnet3_misc_conf/g', unless these header files are
> shared with the host implementation.

These header files are indeed shared with the host implementation, as you've guessed. If it's not a big deal, we would like to keep the names the same, just for our own sanity's sake?

Thanks!

- Bhavesh

> 
> 	Arnd <><

^ permalink raw reply

* RE: [Pv-drivers] [PATCH 2.6.31-rc9] net: VMware virtual Ethernet NIC driver: vmxnet3
From: Bhavesh Davda @ 2009-09-29 19:45 UTC (permalink / raw)
  To: Chris Wright, Shreyas Bhatewara
  Cc: pv-drivers@vmware.com, netdev@vger.kernel.org, Stephen Hemminger,
	linux-kernel@vger.kernel.org, virtualization, Anthony Liguori,
	Greg Kroah-Hartman, Andrew Morton, Jeff Garzik, David S. Miller
In-Reply-To: <20090929085333.GC3958@sequoia.sous-sol.org>

Hi Chris,

Thanks a bunch for your really thorough review! I'll answer some of your questions here. Shreyas can respond to your comments about some of the coding style/comments/etc. in a separate mail.

> >         INTx, MSI, MSI-X (25 vectors) interrupts
> >         16 Rx queues, 8 Tx queues
> 
> Driver doesn't appear to actually support more than a single MSI-X
> interrupt.
> What is your plan for doing real multiqueue?

When we first wrote the driver a couple of years ago, Linux lacked proper multiqueue support, hence we chose to use only a single queue though the emulated device does support 16 Rx and 8 Tx queues, and 25 MSI-X vectors: 16 for Rx, 8 for Tx and 1 for other asynchronous event notifications, by design. Actually a driver can repurpose any of the 25 vectors for any notifications; just explaining the rationale for desiging the device with 25 MSI-X vectors.

We do have an internal prototype of a Linux vmxnet3 driver with 4 Tx queues and 4 Rx queues, using 9 MSI-X vectors, but it needs some work before calling it production ready.

> How about GRO conversion?

Looks attractive, and we'll work on that in a subsequent patch. Again, when we first wrote the driver, the NETIF_F_GRO stuff didn't exist in Linux.

> Also, heavy use of BUG_ON() (counted 51 of them), are you sure that
> none
> of them can be triggered by guest or remote (esp. the ones that happen
> in interrupt context)?  Some initial thoughts below.

We'll definitely audit all the BUG_ONs again to make sure they can't be exploited.

> > --- /dev/null
> > +++ b/drivers/net/vmxnet3/upt1_defs.h
> > +#define UPT1_MAX_TX_QUEUES  64
> > +#define UPT1_MAX_RX_QUEUES  64
> 
> This is different than the 16/8 described above (and seemingly all moot
> since it becomes a single queue device).

Nice catch! Those are not even used and are from the earliest days of our driver development. We'll nuke those.

> > +/* interrupt moderation level */
> > +#define UPT1_IML_NONE     0 /* no interrupt moderation */
> > +#define UPT1_IML_HIGHEST  7 /* least intr generated */
> > +#define UPT1_IML_ADAPTIVE 8 /* adpative intr moderation */
> 
> enum?  also only appears to support adaptive mode?

Yes, the Linux driver currently only asks for adaptive mode, but the device supports 8 interrupt moderation levels.

> > --- /dev/null
> > +++ b/drivers/net/vmxnet3/vmxnet3_defs.h
> > +struct Vmxnet3_MiscConf {
> > +       struct Vmxnet3_DriverInfo driverInfo;
> > +       uint64_t             uptFeatures;
> > +       uint64_t             ddPA;         /* driver data PA */
> > +       uint64_t             queueDescPA;  /* queue descriptor table
> PA */
> > +       uint32_t             ddLen;        /* driver data len */
> > +       uint32_t             queueDescLen; /* queue desc. table len
> in bytes */
> > +       uint32_t             mtu;
> > +       uint16_t             maxNumRxSG;
> > +       uint8_t              numTxQueues;
> > +       uint8_t              numRxQueues;
> > +       uint32_t             reserved[4];
> > +};
> 
> should this be packed (or others that are shared w/ device)?  i assume
> you've already done 32 vs 64 here
> 

No need for packing since the fields are naturally 64-bit aligned. True for all structures shared between the driver and device.

> > +#define VMXNET3_MAX_TX_QUEUES  8
> > +#define VMXNET3_MAX_RX_QUEUES  16
> 
> different to UPT, I must've missed some layering here

These are the authoritiative #defines. Ignore the UPT ones.

> > --- /dev/null
> > +++ b/drivers/net/vmxnet3/vmxnet3_drv.c
> > +       VMXNET3_WRITE_BAR0_REG(adapter, VMXNET3_REG_IMR + intr_idx *
> 8, 0);
> 
> 	writel(0, adapter->hw_addr0 + VMXNET3_REG_IMR + intr_idx * 8)
> seems just as clear to me.

Fair enough. We were just trying to clearly show which register accesses go to BAR 0 versus BAR 1.

> only ever num_intrs=1, so there's some plan to bump this up and make
> these wrappers useful?

Yes.

> > +static void
> > +vmxnet3_process_events(struct vmxnet3_adapter *adapter)
> 
> Should be trivial to break out to it's own MSI-X vector, basically set
> up to do that already.

Yes, and the device is configurable to use any vector for any "events", but didn't see any compelling reason to do so. "ECR" events are extremely rare and we've got a shadow copy of the ECR register that avoids an expensive round trip to the device, stored in adapter->shared->ecr. So we can cheaply handle events on the hot Tx/Rx path with minimal overhead. But if you really see a compelling reason to allocate a separate MSI-X vector for events, we can certainly do that.

> 
> Plan to switch to GRO?

Already answered.

Thanks

- Bhavesh

^ permalink raw reply

* Re: mac80211: NOHZ: local_softirq_pending 08
From: John W. Linville @ 2009-09-29 19:29 UTC (permalink / raw)
  To: Oliver Hartkopp
  Cc: Michael Buesch, Kalle Valo, linux-wireless, netdev, Johannes Berg
In-Reply-To: <4AABCF28.6090505@hartkopp.net>

On Sat, Sep 12, 2009 at 06:41:12PM +0200, Oliver Hartkopp wrote:

> i cooked a patch that introduces netif_rx_ti() and fixes up the problems in
> mac80211 and the CAN subsystem.

Oliver,

Are you going to send this patch to Dave?  If you want me to carry
it instead, please resend it with a proper changelog including a
Signed-off-by line.  For that matter, Dave will most certainly want
that as well...

Thanks!

John
-- 
John W. Linville		Someday the world will need a hero, and you
linville@tuxdriver.com			might be all we have.  Be ready.

^ permalink raw reply

* Re: [Fwd: Re: Bug#538372: header failure including netlink.h (or uio.h)]
From: Jarek Poplawski @ 2009-09-29 19:21 UTC (permalink / raw)
  Cc: Manuel Prinz, netdev
In-Reply-To: <20090929092727.GA7025@ff.dom.local>

Jarek Poplawski wrote, On 09/29/2009 11:27 AM:

> On 28-09-2009 13:24, Manuel Prinz wrote:
>> Hi everyone,
>>
>> I'm forwarding this bug in Debian (http://bugs.debian.org/538372) as
>> requested by the Debian kernel team. A patch is available. Applying just
>> the first hunk fixes the issue for me. I've not enough kernel knowledge
>> to judge if this fix is a proper solution, though.
>>
>> It would be really great if someone could have a look at it. Thanks in
>> advance! (And please CC me in replies. Thanks!)
> 
> I've tried it with current include/linux and it works OK. Replacing
> uio.h on Debian really was not enough, but it looks like missing
> compiler.h entries could be the reason. Otherwise, please send your
> compile error log.
 

Hmm... Probably you noticed it yourself, so a little update for the
record. My method wasn't the proper way, but repeating it with
sanitized headers from the current kernel works even better: replacing
Debian's uio.h only seems to be enough yet.

Jarek P.

^ permalink raw reply

* [net-2.6 PATCH 5/5] qlge: Fix error exit for probe call.
From: Ron Mercer @ 2009-09-29 18:39 UTC (permalink / raw)
  To: davem; +Cc: netdev, ron.mercer
In-Reply-To: <1254249565-16381-1-git-send-email-ron.mercer@qlogic.com>


Signed-off-by: Ron Mercer <ron.mercer@qlogic.com>
---
 drivers/net/qlge/qlge_main.c |   10 +++++-----
 1 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/drivers/net/qlge/qlge_main.c b/drivers/net/qlge/qlge_main.c
index fbef305..c8a9efe 100644
--- a/drivers/net/qlge/qlge_main.c
+++ b/drivers/net/qlge/qlge_main.c
@@ -3832,11 +3832,14 @@ static int __devinit ql_init_device(struct pci_dev *pdev,
 		return err;
 	}
 
+	qdev->ndev = ndev;
+	qdev->pdev = pdev;
+	pci_set_drvdata(pdev, ndev);
 	pos = pci_find_capability(pdev, PCI_CAP_ID_EXP);
 	if (pos <= 0) {
 		dev_err(&pdev->dev, PFX "Cannot find PCI Express capability, "
 			"aborting.\n");
-		goto err_out;
+		return pos;
 	} else {
 		pci_read_config_word(pdev, pos + PCI_EXP_DEVCTL, &val16);
 		val16 &= ~PCI_EXP_DEVCTL_NOSNOOP_EN;
@@ -3849,7 +3852,7 @@ static int __devinit ql_init_device(struct pci_dev *pdev,
 	err = pci_request_regions(pdev, DRV_NAME);
 	if (err) {
 		dev_err(&pdev->dev, "PCI region request failed.\n");
-		goto err_out;
+		return err;
 	}
 
 	pci_set_master(pdev);
@@ -3867,7 +3870,6 @@ static int __devinit ql_init_device(struct pci_dev *pdev,
 		goto err_out;
 	}
 
-	pci_set_drvdata(pdev, ndev);
 	qdev->reg_base =
 	    ioremap_nocache(pci_resource_start(pdev, 1),
 			    pci_resource_len(pdev, 1));
@@ -3887,8 +3889,6 @@ static int __devinit ql_init_device(struct pci_dev *pdev,
 		goto err_out;
 	}
 
-	qdev->ndev = ndev;
-	qdev->pdev = pdev;
 	err = ql_get_board_info(qdev);
 	if (err) {
 		dev_err(&pdev->dev, "Register access failed.\n");
-- 
1.6.0.2


^ permalink raw reply related

* [net-2.6 PATCH 4/5] qlge: Protect reset recovery with rtnl_lock().
From: Ron Mercer @ 2009-09-29 18:39 UTC (permalink / raw)
  To: davem; +Cc: netdev, ron.mercer
In-Reply-To: <1254249565-16381-1-git-send-email-ron.mercer@qlogic.com>

Move the call to rtnl_lock() to before the internal call to
ql_adapter_down()/ql_adapter_up().  This prevents collisions that can
happen when recovering from an asic error.

Signed-off-by: Ron Mercer <ron.mercer@qlogic.com>
---
 drivers/net/qlge/qlge_main.c |    6 +++---
 1 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/net/qlge/qlge_main.c b/drivers/net/qlge/qlge_main.c
index b05300d..fbef305 100644
--- a/drivers/net/qlge/qlge_main.c
+++ b/drivers/net/qlge/qlge_main.c
@@ -3703,7 +3703,7 @@ static void ql_asic_reset_work(struct work_struct *work)
 	struct ql_adapter *qdev =
 	    container_of(work, struct ql_adapter, asic_reset_work.work);
 	int status;
-
+	rtnl_lock();
 	status = ql_adapter_down(qdev);
 	if (status)
 		goto error;
@@ -3711,12 +3711,12 @@ static void ql_asic_reset_work(struct work_struct *work)
 	status = ql_adapter_up(qdev);
 	if (status)
 		goto error;
-
+	rtnl_unlock();
 	return;
 error:
 	QPRINTK(qdev, IFUP, ALERT,
 		"Driver up/down cycle failed, closing device\n");
-	rtnl_lock();
+
 	set_bit(QL_ADAPTER_UP, &qdev->flags);
 	dev_close(qdev->ndev);
 	rtnl_unlock();
-- 
1.6.0.2


^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox