Netdev List
 help / color / mirror / Atom feed
* Re: [Xen-devel] [PATCH] net/core: Order-3 frag allocator causes SWIOTLB bouncing under Xen
From: Konrad Rzeszutek Wilk @ 2013-09-06 13:27 UTC (permalink / raw)
  To: Ian Campbell
  Cc: Eric Dumazet, Eliezer Tamir, Neil Horman, netdev, linux-kernel,
	xen-devel, Eric Dumazet, Li Zefan, david.vrabel, Zoltan Kiss,
	malcolm.crossley, David S. Miller
In-Reply-To: <1378366746.6935.35.camel@dagon.hellion.org.uk>

On Thu, Sep 05, 2013 at 08:39:06AM +0100, Ian Campbell wrote:
> On Wed, 2013-09-04 at 17:11 -0400, Konrad Rzeszutek Wilk wrote:
> > On Wed, Sep 04, 2013 at 02:00:40PM -0700, Eric Dumazet wrote:
> 
> > > Maybe you could add proper infrastructure to deal with Xen limitations.
> > 
> > I think Ian posted at some point an sysctl patch for that (more for
> > debugging that anything else). And it kind
> > of stalled: http://lists.xen.org/archives/html/xen-devel/2012-10/msg01832.html
> 
> I think I though you were looking into it from the swiotlb angle?

Yes. I didn't find anything immediately obvious - but I can still
reproduce with an skge DMA issues when booting baremetal with 'swiotlb=force'.

But only under 32-bit. I think there is some physical address
truncation with the new compound size skb's. Obviously needs
further investigation.

> 
> In any case I don't have time to look into this further.
> 
> > Is that what you mean by proper infrastructure ?
> > 
> > Oh wait, did you mean via dev and not the whole system wide sysctl?
> 
> The system wide sysctl was not acceptable AFAIR, which seems reasonable.

<nods>
> 
> Per-dev is hard because it affects the native drivers for each physical
> NIC when running under Xen, not the Xen PV NIC which we fixed by
> splitting compound frags into separate slots on the PV ring.

Right. Thank you for pointing that obvious issue.

> 
> Ian.
> 

^ permalink raw reply

* Re: TSQ accounting skb->truesize degrades throughput for large packets
From: Wei Liu @ 2013-09-06 13:12 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Wei Liu, Jonathan Davies, Ian Campbell, netdev, xen-devel
In-Reply-To: <1378472268.31445.15.camel@edumazet-glaptop>

On Fri, Sep 06, 2013 at 05:57:48AM -0700, Eric Dumazet wrote:
> On Fri, 2013-09-06 at 11:16 +0100, Wei Liu wrote:
> > Hi Eric
> > 
> > I have some questions regarding TSQ and I hope you can shed some light
> > on this.
> > 
> > Our observation is that with the default TSQ limit (128K), throughput
> > for Xen network driver for large packets degrades. That's because we now
> > only have 1 packet in queue.
> > 
> > I double-checked that skb->len is indeed <64K. Then I discovered that
> > TSQ actually accounts for skb->truesize and the packets generated had
> > skb->truesize > 64K which effectively prevented us from putting 2
> > packets in queue.
> > 
> > There seems to be no way to limit skb->truesize inside driver -- the skb
> > is already constructed when it comes to xen-netfront.
> > 
> 
> What is the skb->truesize value then ? It must be huge, and its clearly
> a problem, because the tcp _receiver_ will also grow its window slower,
> if packet is looped back.
> 

It's ~66KB.

> > My questions are:
> >   1) I see the comment in tcp_output.c saying: "TSQ : sk_wmem_alloc
> >      accounts skb truesize, including skb overhead. But thats OK", I
> >      don't quite understand why it is OK.
> >   2) presumably other drivers will suffer from this as well, is it
> >      possible to account for skb->len instead of skb->truesize?
> 
> Well, I have no problem to get line rate on 20Gb with a single flow, so
> other drivers have no problem.
> 

OK, good to know this.

> >   3) if accounting skb->truesize is on purpose, does that mean we only
> >      need to tune that value instead of trying to fix our driver (if
> >      there is a way to)?
> 
> The check in TCP allows for two packets at least, unless a single skb
> truesize is 128K ?
> 
> 
> if (atomic_read(&sk->sk_wmem_alloc) >= sysctl_tcp_limit_output_bytes) {
>     set_bit(TSQ_THROTTLED, &tp->tsq_flags);
>     break;
> }
> 
> So if a skb->truesize is 100K, this condition allows two packets, before
> throttling the third packet.
> 

OK. I need to check why we're getting only 1 then.

Thanks for your reply.

Wei.

> Its actually hard to account for skb->len, because sk_wmem_alloc
> accounts for skb->truesize : I do not want to add another
> sk->sk_wbytes_alloc new atomic field.
> 
> 

^ permalink raw reply

* Re: [PATCH net] bnx2x: Restore a call to config_init
From: Eric Dumazet @ 2013-09-06 13:08 UTC (permalink / raw)
  To: eilong; +Cc: David Miller, netdev, Dave Jones
In-Reply-To: <1378461302.15758.2.camel@lb-tlvb-eilong.il.broadcom.com>

On Fri, 2013-09-06 at 12:55 +0300, Eilon Greenstein wrote:
> Commit c0a77ec74f295013d7ba3204dd3ed25fccf83cb4 'bnx2x: Add missing braces in
> bnx2x:bnx2x_link_initialize' identified indentation problem, but resolved it
> by adding braces instead of fixing the indentation. The braces now prevents a
> config_init call in some cases, though it should be called regardless of that
> condition. This patch removes the braces and fix the confusing indentation
> that caused this mess.
> 
> Signed-off-by: Eilon Greenstein <eilong@broadcom.com>
> CC: Dave Jones <davej@redhat.com>
> ---
>  drivers/net/ethernet/broadcom/bnx2x/bnx2x_link.c | 9 +++------
>  1 file changed, 3 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_link.c b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_link.c
> index 6645684..d60a2ea 100644
> --- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_link.c
> +++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_link.c
> @@ -6501,13 +6501,10 @@ static int bnx2x_link_initialize(struct link_params *params,
>  		struct bnx2x_phy *phy = &params->phy[INT_PHY];
>  		if (vars->line_speed == SPEED_AUTO_NEG &&
>  		    (CHIP_IS_E1x(bp) ||
> -		     CHIP_IS_E2(bp))) {
> +		     CHIP_IS_E2(bp)))
>  			bnx2x_set_parallel_detection(phy, params);
> -			if (params->phy[INT_PHY].config_init)
> -				params->phy[INT_PHY].config_init(phy,
> -								 params,
> -								 vars);
> -		}
> +		if (params->phy[INT_PHY].config_init)
> +			params->phy[INT_PHY].config_init(phy, params, vars);
>  	}
>  
>  	/* Init external phy*/

Thanks 

I wish people, including Dave Jones, clearly states that a patch was not
actually tested on real hardware.

Tested-by: Eric Dumazet <edumazet@google.com>

^ permalink raw reply

* Re: GSO/GRO and UDP performance
From: Eric Dumazet @ 2013-09-06 13:07 UTC (permalink / raw)
  To: James Yonan; +Cc: netdev
In-Reply-To: <52299EDD.1030208@openvpn.net>

On Fri, 2013-09-06 at 03:22 -0600, James Yonan wrote:

> So I think that playing well with GSO/GRO is essential to get speedup in 
> UDP apps because of this 43x multiplier.
> 

Thats not true. GRO cannot aggregate more than 16+1 packets.

I think we cannot aggregate UDP packets, because UDP lacks sequence
numbers, so reorders would be a problem.

You really need something that is not UDP generic.

^ permalink raw reply

* Re: [PATCH net-next] net: add documentation for BQL helpers
From: Eric Dumazet @ 2013-09-06 13:03 UTC (permalink / raw)
  To: Florian Fainelli; +Cc: netdev, davem
In-Reply-To: <1378465119-1894-1-git-send-email-f.fainelli@gmail.com>

On Fri, 2013-09-06 at 11:58 +0100, Florian Fainelli wrote:
> Provide a kernel-doc comment documentation for the BQL helpers:
> - netdev_sent_queue
> - netdev_completed_queue
> - netdev_reset_queue
> 
> Similarly to how it is done for the other functions, the documentation
> only covers the function operating on struct net_device and not struct
> netdev_queue.
> 
> Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
> ---
>  include/linux/netdevice.h | 27 +++++++++++++++++++++++++++
>  1 file changed, 27 insertions(+)
> 
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index 8ed4ae9..ac36629 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -2101,6 +2101,16 @@ static inline void netdev_tx_sent_queue(struct netdev_queue *dev_queue,
>  #endif
>  }
>  
> +/**
> + * 	netdev_sent_queue - report the number of bytes queued to hardware
> + * 	@dev: network device
> + * 	@bytes: number of bytes queued to the hardware device queue
> + *
> + * 	Report the number of bytes queued for sending/completion to the network
> + * 	device hardware queue. @bytes should specify the number of bytes which
> + * 	will be sent over the physical medium (without prepended/appended
> + * 	control blocks, FCS...)

There is no such requirement.

@bytes should be a good approximation, and should match
netdev_completed_queue() @bytes


If you think of TSO, we know that skb->len does not exactly matches
number of bytes on physical medium ( check qdisc_pkt_len_init() to see
how Qdisc layer tries to get better estimation )


> + */
>  static inline void netdev_sent_queue(struct net_device *dev, unsigned int bytes)
>  {
>  	netdev_tx_sent_queue(netdev_get_tx_queue(dev, 0), bytes);
> @@ -2130,6 +2140,16 @@ static inline void netdev_tx_completed_queue(struct netdev_queue *dev_queue,
>  #endif
>  }
>  
> +/**
> + * 	netdev_completed_queue - report bytes and packets completed by device
> + * 	@dev: network device
> + * 	@pkts: actual number of packets sent over the medium
> + * 	@bytes: actual number of bytes sent over the medium
> + *
> + * 	Report the number of bytes and packets transmitted by the network device
> + * 	hardware queue over the physical medium (without prepended/appended
> + * 	control blocks, FCS...)
> + */

^ permalink raw reply

* Re: TSQ accounting skb->truesize degrades throughput for large packets
From: Eric Dumazet @ 2013-09-06 12:57 UTC (permalink / raw)
  To: Wei Liu; +Cc: Jonathan Davies, Ian Campbell, netdev, xen-devel
In-Reply-To: <20130906101635.GI14104@zion.uk.xensource.com>

On Fri, 2013-09-06 at 11:16 +0100, Wei Liu wrote:
> Hi Eric
> 
> I have some questions regarding TSQ and I hope you can shed some light
> on this.
> 
> Our observation is that with the default TSQ limit (128K), throughput
> for Xen network driver for large packets degrades. That's because we now
> only have 1 packet in queue.
> 
> I double-checked that skb->len is indeed <64K. Then I discovered that
> TSQ actually accounts for skb->truesize and the packets generated had
> skb->truesize > 64K which effectively prevented us from putting 2
> packets in queue.
> 
> There seems to be no way to limit skb->truesize inside driver -- the skb
> is already constructed when it comes to xen-netfront.
> 

What is the skb->truesize value then ? It must be huge, and its clearly
a problem, because the tcp _receiver_ will also grow its window slower,
if packet is looped back.

> My questions are:
>   1) I see the comment in tcp_output.c saying: "TSQ : sk_wmem_alloc
>      accounts skb truesize, including skb overhead. But thats OK", I
>      don't quite understand why it is OK.
>   2) presumably other drivers will suffer from this as well, is it
>      possible to account for skb->len instead of skb->truesize?

Well, I have no problem to get line rate on 20Gb with a single flow, so
other drivers have no problem.

>   3) if accounting skb->truesize is on purpose, does that mean we only
>      need to tune that value instead of trying to fix our driver (if
>      there is a way to)?

The check in TCP allows for two packets at least, unless a single skb
truesize is 128K ?


if (atomic_read(&sk->sk_wmem_alloc) >= sysctl_tcp_limit_output_bytes) {
    set_bit(TSQ_THROTTLED, &tp->tsq_flags);
    break;
}

So if a skb->truesize is 100K, this condition allows two packets, before
throttling the third packet.

Its actually hard to account for skb->len, because sk_wmem_alloc
accounts for skb->truesize : I do not want to add another
sk->sk_wbytes_alloc new atomic field.

^ permalink raw reply

* Re: [PATCH net-next v4 1/6] bonding: simplify and use RCU protection for 3ad xmit path
From: Nikolay Aleksandrov @ 2013-09-06 11:59 UTC (permalink / raw)
  To: Ding Tianhong
  Cc: Jay Vosburgh, Andy Gospodarek, David S. Miller, Veaceslav Falico,
	Netdev
In-Reply-To: <52298407.9040103@huawei.com>

On 09/06/2013 09:28 AM, Ding Tianhong wrote:
> The commit 278b20837511776dc9d5f6ee1c7fabd5479838bb
> (bonding: initial RCU conversion) has convert the roundrobin, active-backup,
> broadcast and xor xmit path to rcu protection, the performance will be better
> for these mode, so this time, convert xmit path for 3ad mode.
> 
> Suggested-by: Nikolay Aleksandrov <nikolay@redhat.com>
> Suggested-by: Veaceslav Falico <vfalico@redhat.com>
> Signed-off-by: Ding Tianhong <dingtianhong@huawei.com>
> Signed-off-by: Wang Yufen <wangyufen@huawei.com>
> Cc: Nikolay Aleksandrov <nikolay@redhat.com>
> Cc: Veaceslav Falico <vfalico@redhat.com>
> ---
>  drivers/net/bonding/bond_3ad.c | 32 ++++++++++++++------------------
>  drivers/net/bonding/bonding.h  | 33 ++++++++++++++++++++++++++++++++-
>  2 files changed, 46 insertions(+), 19 deletions(-)
> 
> diff --git a/drivers/net/bonding/bond_3ad.c b/drivers/net/bonding/bond_3ad.c
> index 0d8f427..13f1deb 100644
> --- a/drivers/net/bonding/bond_3ad.c
> +++ b/drivers/net/bonding/bond_3ad.c
> @@ -143,7 +143,7 @@ static inline struct bonding *__get_bond_by_port(struct port *port)
>   */
>  static inline struct port *__get_first_port(struct bonding *bond)
>  {
> -	struct slave *first_slave = bond_first_slave(bond);
> +	struct slave *first_slave = bond_first_slave_rcu(bond);
>  
>  	return first_slave ? &(SLAVE_AD_INFO(first_slave).port) : NULL;
>  }
> @@ -163,7 +163,7 @@ static inline struct port *__get_next_port(struct port *port)
>  	// If there's no bond for this port, or this is the last slave
>  	if (bond == NULL)
>  		return NULL;
> -	slave_next = bond_next_slave(bond, slave);
> +	slave_next = bond_next_slave_rcu(bond, slave);
>  	if (!slave_next || bond_is_first_slave(bond, slave_next))
>  		return NULL;
>  
> @@ -2417,16 +2417,14 @@ int bond_3ad_get_active_agg_info(struct bonding *bond, struct ad_info *ad_info)
>  
>  int bond_3ad_xmit_xor(struct sk_buff *skb, struct net_device *dev)
>  {
> -	struct slave *slave, *start_at;
>  	struct bonding *bond = netdev_priv(dev);
> +	struct slave *slave;
>  	int slave_agg_no;
>  	int slaves_in_agg;
>  	int agg_id;
> -	int i;
>  	struct ad_info ad_info;
>  	int res = 1;
>  
> -	read_lock(&bond->lock);
>  	if (__bond_3ad_get_active_agg_info(bond, &ad_info)) {
>  		pr_debug("%s: Error: __bond_3ad_get_active_agg_info failed\n",
>  			 dev->name);
> @@ -2444,13 +2442,17 @@ int bond_3ad_xmit_xor(struct sk_buff *skb, struct net_device *dev)
>  
>  	slave_agg_no = bond->xmit_hash_policy(skb, slaves_in_agg);
>  
> -	bond_for_each_slave(bond, slave) {
> +	bond_for_each_slave_rcu(bond, slave) {
>  		struct aggregator *agg = SLAVE_AD_INFO(slave).port.aggregator;
>  
>  		if (agg && (agg->aggregator_identifier == agg_id)) {
> -			slave_agg_no--;
> -			if (slave_agg_no < 0)
> -				break;
> +			if (--slave_agg_no < 0) {
> +				if (SLAVE_IS_OK(slave)) {
> +					res = bond_dev_queue_xmit(bond,
> +						skb, slave->dev);
> +					goto out;
> +				}
> +			}
>  		}
>  	}
>  
> @@ -2460,23 +2462,17 @@ int bond_3ad_xmit_xor(struct sk_buff *skb, struct net_device *dev)
>  		goto out;
>  	}
>  
> -	start_at = slave;
> -
> -	bond_for_each_slave_from(bond, slave, i, start_at) {
> -		int slave_agg_id = 0;
> +	bond_for_each_slave_rcu(bond, slave) {
>  		struct aggregator *agg = SLAVE_AD_INFO(slave).port.aggregator;
>  
> -		if (agg)
> -			slave_agg_id = agg->aggregator_identifier;
> -
> -		if (SLAVE_IS_OK(slave) && agg && (slave_agg_id == agg_id)) {
> +		if (SLAVE_IS_OK(slave) && agg &&
> +			agg->aggregator_identifier == agg_id) {
>  			res = bond_dev_queue_xmit(bond, skb, slave->dev);
>  			break;
>  		}
>  	}
>  
>  out:
> -	read_unlock(&bond->lock);
>  	if (res) {
>  		/* no suitable interface, frame not sent */
>  		kfree_skb(skb);
> diff --git a/drivers/net/bonding/bonding.h b/drivers/net/bonding/bonding.h
> index f7ab161..419161d 100644
> --- a/drivers/net/bonding/bonding.h
> +++ b/drivers/net/bonding/bonding.h
> @@ -74,13 +74,35 @@
>  /* slave list primitives */
>  #define bond_to_slave(ptr) list_entry(ptr, struct slave, list)
>  
> +/* slave list primitives, Caller must hold rcu_read_lock */
> +#define bond_to_slave_rcu(ptr) list_entry_rcu(ptr, struct slave, list)
> +
> +/* bond_is_empty return NULL if slave list is empty*/
> +#define bond_is_empty(bond) \
> +	(list_empty(&(bond)->slave_list))
> +
> +/* bond_is_empty_rcu return NULL if slave list is empty*/
> +#define bond_is_empty_rcu(bond) \
> +	(!list_first_or_null_rcu(&(bond)->slave_list, struct slave, list))
> +
>  /* IMPORTANT: bond_first/last_slave can return NULL in case of an empty list */
>  #define bond_first_slave(bond) \
>  	list_first_entry_or_null(&(bond)->slave_list, struct slave, list)
>  #define bond_last_slave(bond) \
> -	(list_empty(&(bond)->slave_list) ? NULL : \
> +	(bond_is_empty(bond) ? NULL : \
>  					   bond_to_slave((bond)->slave_list.prev))
>  
> +/**
> + * IMPORTANT: bond_first/last_slave_rcu can return NULL in case of an empty list
> + * Caller must hold rcu_read_lock
> + */
> +#define bond_first_slave_rcu(bond) \
> +	(bond_is_empty_rcu(bond) ? NULL : \
> +					bond_to_slave_rcu((bond)->slave_list.next))
> +#define bond_last_slave_rcu(bond) \
> +	(bond_is_empty_rcu(bond) ? NULL : \
> +					bond_to_slave_rcu((bond)->slave_list.prev))
> +
This still has the bug that you and Veaceslav were discussing earlier.
To be honest, I'm getting tired of these fast re-posts without any actual
changes, it really is really starting to get on my nerves. Please before posting
the next version take some time (more time) re-think it, go over it more times,
don't go with the first thing that comes to mind without thinking it through well.

Now to be specific here for the Nth time:
 You _can't_ do this sequence:
  if (list_first_or_null_rcu()) -> rcu_dereference(first/last element) because
between the check which actually dereferences it and the second dereference the
first/last element might be long gone. You should use the result of
list_first_or_null_rcu.

Cheers,
 Nik

>  #define bond_is_first_slave(bond, pos) ((pos)->list.prev == &(bond)->slave_list)
>  #define bond_is_last_slave(bond, pos) ((pos)->list.next == &(bond)->slave_list)
>  
> @@ -93,6 +115,15 @@
>  	(bond_is_first_slave(bond, pos) ? bond_last_slave(bond) : \
>  					  bond_to_slave((pos)->list.prev))
>  
> +/* Since bond_first/last_slave_rcu can return NULL, these can return NULL too */
> +#define bond_next_slave_rcu(bond, pos) \
> +	(bond_is_last_slave(bond, pos) ? bond_first_slave_rcu(bond) : \
> +					 bond_to_slave_rcu((pos)->list.next))
> +
P.S. Either I'm getting paranoid or I think there's the same bug here, namely:
say list.next != slave_list, but at the time of dereferencing list.next
bond_to_slave_rcu() can get slave_list if it was changed.

> +#define bond_prev_slave_rcu(bond, pos) \
> +	(bond_is_first_slave(bond, pos) ? bond_last_slave_rcu(bond) : \
> +					  bond_to_slave_rcu((pos)->list.prev))
> +
>  /**
>   * bond_for_each_slave_from - iterate the slaves list from a starting point
>   * @bond:	the bond holding this list.
> 

^ permalink raw reply

* Re: [PATCH net-next v4 5/6] bonding: restructure and add rcu for bond_for_each_slave_next()
From: Nikolay Aleksandrov @ 2013-09-06 11:59 UTC (permalink / raw)
  To: Ding Tianhong
  Cc: Jay Vosburgh, Andy Gospodarek, David S. Miller, Veaceslav Falico,
	Netdev
In-Reply-To: <5229841E.4020008@huawei.com>

On 09/06/2013 09:28 AM, Ding Tianhong wrote:
> Restructure the bond_for_each_slave(), remove the checking for bond->slave_cnt,
> make the new loop be more simple and racy.
> 
> Add bond_for_each_slave_next_rcu() for next patch use.
> 
> Suggested-by: Nikolay Aleksandrov <nikolay@redhat.com>
> Suggested-by: Veaceslav Falico <vfalico@redhat.com>
> Signed-off-by: Ding Tianhong <dingtianhong@huawei.com>
> Cc: Nikolay Aleksandrov <nikolay@redhat.com>
> Cc: Veaceslav Falico <vfalico@redhat.com>
> ---
>  drivers/net/bonding/bond_alb.c  |  3 +--
>  drivers/net/bonding/bond_main.c |  6 ++----
>  drivers/net/bonding/bonding.h   | 21 +++++++++++++++++----
>  3 files changed, 20 insertions(+), 10 deletions(-)
> 
> diff --git a/drivers/net/bonding/bond_alb.c b/drivers/net/bonding/bond_alb.c
> index 91f179d..c75d383 100644
> --- a/drivers/net/bonding/bond_alb.c
> +++ b/drivers/net/bonding/bond_alb.c
> @@ -383,7 +383,6 @@ static struct slave *rlb_next_rx_slave(struct bonding *bond)
>  {
>  	struct alb_bond_info *bond_info = &(BOND_ALB_INFO(bond));
>  	struct slave *rx_slave, *slave, *start_at;
> -	int i = 0;
>  
>  	if (bond_info->next_rx_slave)
>  		start_at = bond_info->next_rx_slave;
> @@ -392,7 +391,7 @@ static struct slave *rlb_next_rx_slave(struct bonding *bond)
>  
>  	rx_slave = NULL;
>  
> -	bond_for_each_slave_from(bond, slave, i, start_at) {
> +	bond_for_each_slave_from(bond, slave, start_at) {
>  		if (SLAVE_IS_OK(slave)) {
>  			if (!rx_slave) {
>  				rx_slave = slave;
> diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
> index 39e5b1c..4190389 100644
> --- a/drivers/net/bonding/bond_main.c
> +++ b/drivers/net/bonding/bond_main.c
> @@ -782,7 +782,6 @@ static struct slave *bond_find_best_slave(struct bonding *bond)
>  	struct slave *new_active, *old_active;
>  	struct slave *bestslave = NULL;
>  	int mintime = bond->params.updelay;
> -	int i;
>  
>  	new_active = bond->curr_active_slave;
>  
> @@ -801,7 +800,7 @@ static struct slave *bond_find_best_slave(struct bonding *bond)
>  	/* remember where to stop iterating over the slaves */
>  	old_active = new_active;
>  
> -	bond_for_each_slave_from(bond, new_active, i, old_active) {
> +	bond_for_each_slave_from(bond, new_active, old_active) {
>  		if (new_active->link == BOND_LINK_UP) {
>  			return new_active;
>  		} else if (new_active->link == BOND_LINK_BACK &&
> @@ -2756,7 +2755,6 @@ do_failover:
>  static void bond_ab_arp_probe(struct bonding *bond)
>  {
>  	struct slave *slave, *next_slave;
> -	int i;
>  
>  	read_lock(&bond->curr_slave_lock);
>  
> @@ -2788,7 +2786,7 @@ static void bond_ab_arp_probe(struct bonding *bond)
>  
>  	/* search for next candidate */
>  	next_slave = bond_next_slave(bond, bond->current_arp_slave);
> -	bond_for_each_slave_from(bond, slave, i, next_slave) {
> +	bond_for_each_slave_from(bond, slave, next_slave) {
>  		if (IS_UP(slave->dev)) {
>  			slave->link = BOND_LINK_BACK;
>  			bond_set_slave_active_flags(slave);
> diff --git a/drivers/net/bonding/bonding.h b/drivers/net/bonding/bonding.h
> index 419161d..d80e146 100644
> --- a/drivers/net/bonding/bonding.h
> +++ b/drivers/net/bonding/bonding.h
> @@ -124,18 +124,31 @@
>  	(bond_is_first_slave(bond, pos) ? bond_last_slave_rcu(bond) : \
>  					  bond_to_slave_rcu((pos)->list.prev))
>  
> +/* Check whether the slave is the only one in bond */
> +
>  /**
>   * bond_for_each_slave_from - iterate the slaves list from a starting point
>   * @bond:	the bond holding this list.
>   * @pos:	current slave.
> - * @cnt:	counter for max number of moves
>   * @start:	starting point.
>   *
>   * Caller must hold bond->lock
>   */
> -#define bond_for_each_slave_from(bond, pos, cnt, start) \
> -	for (cnt = 0, pos = start; pos && cnt < (bond)->slave_cnt; \
> -	     cnt++, pos = bond_next_slave(bond, pos))
> +#define bond_for_each_slave_from(bond, pos, start) \
> +	for (pos = start; pos; (pos = bond_next_slave(bond, pos)) != start ? \
> +		(pos) : (pos = NULL))
> +
> +/**
> + * bond_for_each_slave_from_rcu - iterate the slaves list from a starting point
> + * @bond:	the bond holding this list.
> + * @pos:	current slave.
> + * @start:	starting point.
> + *
> + * Caller must hold rcu_read_lock
> + */
> +#define bond_for_each_slave_from_rcu(bond, pos, start) \
> +	for (pos = start; pos; (pos = bond_next_slave_rcu(bond, pos)) != start ? \
> +		(pos) : (pos = NULL))
>  
The same bug here as before... What happens if start disappears ? - Infinite
loop, pos will never be equal to start again.

>  /**
>   * bond_for_each_slave - iterate over all slaves
> 

^ permalink raw reply

* [PATCH] mlx5: remove unused MLX5_DEBUG param in Kconfig
From: Michael Opdenacker @ 2013-09-06 11:49 UTC (permalink / raw)
  To: eli-VPRAkNaXOzVWk0Htik3J/w
  Cc: netdev-u79uwXL29TY76Z2rM5mHXA, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Michael Opdenacker

This patch proposes to remove the MLX5_DEBUG kernel configuration
parameter defined in drivers/net/ethernet/mellanox/mlx5/core/Kconfig,
but used nowhere in the makefiles and source code.

This could also be fixed by using this parameter,
but this may be a leftover from driver development...

Signed-off-by: Michael Opdenacker <michael.opdenacker-wi1+55ScJUtKEb57/3fJTNBPR1lH4CV8@public.gmane.org>
---
 drivers/net/ethernet/mellanox/mlx5/core/Kconfig | 10 ----------
 1 file changed, 10 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/Kconfig b/drivers/net/ethernet/mellanox/mlx5/core/Kconfig
index 2196282..157fe8d 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/Kconfig
+++ b/drivers/net/ethernet/mellanox/mlx5/core/Kconfig
@@ -6,13 +6,3 @@ config MLX5_CORE
 	tristate
 	depends on PCI && X86
 	default n
-
-config MLX5_DEBUG
-	bool "Verbose debugging output" if (MLX5_CORE && EXPERT)
-	depends on MLX5_CORE
-	default y
-	---help---
-	  This option causes debugging code to be compiled into the
-	  mlx5_core driver.  The output can be turned on via the
-	  debug_mask module parameter (which can also be set after
-	  the driver is loaded through sysfs).
-- 
1.8.1.2

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related

* [PATCH net-next] net: add documentation for BQL helpers
From: Florian Fainelli @ 2013-09-06 10:58 UTC (permalink / raw)
  To: netdev; +Cc: eric.dumazet, davem, Florian Fainelli

Provide a kernel-doc comment documentation for the BQL helpers:
- netdev_sent_queue
- netdev_completed_queue
- netdev_reset_queue

Similarly to how it is done for the other functions, the documentation
only covers the function operating on struct net_device and not struct
netdev_queue.

Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
---
 include/linux/netdevice.h | 27 +++++++++++++++++++++++++++
 1 file changed, 27 insertions(+)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 8ed4ae9..ac36629 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -2101,6 +2101,16 @@ static inline void netdev_tx_sent_queue(struct netdev_queue *dev_queue,
 #endif
 }
 
+/**
+ * 	netdev_sent_queue - report the number of bytes queued to hardware
+ * 	@dev: network device
+ * 	@bytes: number of bytes queued to the hardware device queue
+ *
+ * 	Report the number of bytes queued for sending/completion to the network
+ * 	device hardware queue. @bytes should specify the number of bytes which
+ * 	will be sent over the physical medium (without prepended/appended
+ * 	control blocks, FCS...)
+ */
 static inline void netdev_sent_queue(struct net_device *dev, unsigned int bytes)
 {
 	netdev_tx_sent_queue(netdev_get_tx_queue(dev, 0), bytes);
@@ -2130,6 +2140,16 @@ static inline void netdev_tx_completed_queue(struct netdev_queue *dev_queue,
 #endif
 }
 
+/**
+ * 	netdev_completed_queue - report bytes and packets completed by device
+ * 	@dev: network device
+ * 	@pkts: actual number of packets sent over the medium
+ * 	@bytes: actual number of bytes sent over the medium
+ *
+ * 	Report the number of bytes and packets transmitted by the network device
+ * 	hardware queue over the physical medium (without prepended/appended
+ * 	control blocks, FCS...)
+ */
 static inline void netdev_completed_queue(struct net_device *dev,
 					  unsigned int pkts, unsigned int bytes)
 {
@@ -2144,6 +2164,13 @@ static inline void netdev_tx_reset_queue(struct netdev_queue *q)
 #endif
 }
 
+/**
+ * 	netdev_reset_queue - reset the packets and bytes count of a network device
+ * 	@dev_queue: network device
+ *
+ * 	Reset the bytes and packet count of a network device and clear the
+ * 	software flow control OFF bit for this network device
+ */
 static inline void netdev_reset_queue(struct net_device *dev_queue)
 {
 	netdev_tx_reset_queue(netdev_get_tx_queue(dev_queue, 0));
-- 
1.8.1.2

^ permalink raw reply related

* RE: [PATCH] openvswitch: Fix alignment of struct sw_flow_key.
From: David Laight @ 2013-09-06 10:18 UTC (permalink / raw)
  To: Jesse Gross, David Miller
  Cc: netdev, dev, Andy Zhou, Fengguang Wu, Geert Uytterhoeven
In-Reply-To: <1378402887-17331-1-git-send-email-jesse@nicira.com>

> -} __aligned(__alignof__(long));
> +} __aligned(8); /* 8 byte alignment ensures this can be accessed as a long */

Don't you really want __aligned(MAX(__alignof__(__be64), sizeof (long))) ?

	David

^ permalink raw reply

* TSQ accounting skb->truesize degrades throughput for large packets
From: Wei Liu @ 2013-09-06 10:16 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: wei.liu2, Jonathan Davies, Ian Campbell, netdev, xen-devel

Hi Eric

I have some questions regarding TSQ and I hope you can shed some light
on this.

Our observation is that with the default TSQ limit (128K), throughput
for Xen network driver for large packets degrades. That's because we now
only have 1 packet in queue.

I double-checked that skb->len is indeed <64K. Then I discovered that
TSQ actually accounts for skb->truesize and the packets generated had
skb->truesize > 64K which effectively prevented us from putting 2
packets in queue.

There seems to be no way to limit skb->truesize inside driver -- the skb
is already constructed when it comes to xen-netfront.

My questions are:
  1) I see the comment in tcp_output.c saying: "TSQ : sk_wmem_alloc
     accounts skb truesize, including skb overhead. But thats OK", I
     don't quite understand why it is OK.
  2) presumably other drivers will suffer from this as well, is it
     possible to account for skb->len instead of skb->truesize?
  3) if accounting skb->truesize is on purpose, does that mean we only
     need to tune that value instead of trying to fix our driver (if
     there is a way to)?

Thanks
Wei.

^ permalink raw reply

* [PATCH net] bnx2x: Restore a call to config_init
From: Eilon Greenstein @ 2013-09-06  9:55 UTC (permalink / raw)
  To: David Miller; +Cc: Eilon Greenstein, netdev, Dave Jones

Commit c0a77ec74f295013d7ba3204dd3ed25fccf83cb4 'bnx2x: Add missing braces in
bnx2x:bnx2x_link_initialize' identified indentation problem, but resolved it
by adding braces instead of fixing the indentation. The braces now prevents a
config_init call in some cases, though it should be called regardless of that
condition. This patch removes the braces and fix the confusing indentation
that caused this mess.

Signed-off-by: Eilon Greenstein <eilong@broadcom.com>
CC: Dave Jones <davej@redhat.com>
---
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_link.c | 9 +++------
 1 file changed, 3 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_link.c b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_link.c
index 6645684..d60a2ea 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_link.c
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_link.c
@@ -6501,13 +6501,10 @@ static int bnx2x_link_initialize(struct link_params *params,
 		struct bnx2x_phy *phy = &params->phy[INT_PHY];
 		if (vars->line_speed == SPEED_AUTO_NEG &&
 		    (CHIP_IS_E1x(bp) ||
-		     CHIP_IS_E2(bp))) {
+		     CHIP_IS_E2(bp)))
 			bnx2x_set_parallel_detection(phy, params);
-			if (params->phy[INT_PHY].config_init)
-				params->phy[INT_PHY].config_init(phy,
-								 params,
-								 vars);
-		}
+		if (params->phy[INT_PHY].config_init)
+			params->phy[INT_PHY].config_init(phy, params, vars);
 	}
 
 	/* Init external phy*/
-- 
1.8.3.1

^ permalink raw reply related

* Re: [PATCH] Add missing braces to ath10k_pci_tx_pipe_cleanup
From: Kalle Valo @ 2013-09-06  9:32 UTC (permalink / raw)
  To: Dave Jones
  Cc: netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-wireless-u79uwXL29TY76Z2rM5mHXA,
	linville-2XuSBdqkA4R54TAoqtyWWQ,
	ath10k-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r
In-Reply-To: <20130905035128.GA15824-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

Dave Jones <davej-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> writes:

> The indentation here implies this was meant to be a multi-statement
> if, but it lacks the braces.
>
> Signed-off-by: Dave Jones <davej-rxtnV0ftBwyoClj4AeEUq9i2O/JbrIOy@public.gmane.org>

Thanks, applied.

I added "ath10k:" to the patch title.

-- 
Kalle Valo
--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: GSO/GRO and UDP performance
From: James Yonan @ 2013-09-06  9:22 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev
In-Reply-To: <1378295631.7360.98.camel@edumazet-glaptop>

On 04/09/2013 05:53, Eric Dumazet wrote:
> On Wed, 2013-09-04 at 04:07 -0600, James Yonan wrote:
>
>> The bundle of UDP packets would traverse the stack as a unit until it
>> reaches the socket layer, where recvmmsg could pass the whole bundle up
>> to userspace in a single transaction (or recvmsg could disaggregate the
>> bundle and pass each datagram individually).
>
> That would require a lot of work, say in netfilter, but also in core
> network stack in forwarding, and all UDP users (L2TP, vxlan).
>
> Very unlikely to happen IMHO.

I agree that aggregating packets by chaining multiple packets into a 
single skb would be too disruptive.

However I believe GSO/GRO provides a potential solution here that would 
be transparent to the core network stack and existing in-kernel UDP users.

GSO/GRO already allows any L4 protocol or lower to define their own 
segmentation and aggregation algorithms, as long as the algorithms are 
lossless.

There's no reason why GSO/GRO couldn't operate on L5 or higher protocols 
if segmentation and aggregation algorithms are provided by a kernel 
module that understands the specific app protocol.

It looks like this could be done with minimal changes to the GSO/GRO 
core.  There would need to be a hook where a kernel module could 
register itself as a GSO/GRO provider for UDP.  It could then perform 
segmentation/aggregation on UDP packets that belong to it.  The dispatch 
to the UDP GSO/GRO providers would be done by the existing offload code 
for UDP, so there would be zero added overhead for non-UDP protocols.

>
> I suspect the performance is coming from aggregation done in user space,
> then re-injected into the kernel ?
>
> You could use a kernel module, using udp_encap_enable() and friends.
>
> Check vxlan_socket_create() for an example

I actually put together a test kernel module using udp_encap_enable to 
see if I could accelerate UDP performance that way.  But even with the 
boost of running in kernel space, the packet processing overhead of 
dealing with 1500 byte packets negates most of the gain, while TCP gets 
a 43x performance boost by being able to aggregate up to 64KB per 
superpacket with GSO/GRO.

So I think that playing well with GSO/GRO is essential to get speedup in 
UDP apps because of this 43x multiplier.

James

^ permalink raw reply

* Re: [PATCH] can: pcan_usb_core: fix memory leak on failure paths in peak_usb_start()
From: Marc Kleine-Budde @ 2013-09-06  9:03 UTC (permalink / raw)
  To: Alexey Khoroshilov
  Cc: Wolfgang Grandegger, linux-can, netdev, linux-kernel, ldv-project,
	s.grosjean@peak-system.com
In-Reply-To: <52285006.3080303@pengutronix.de>

[-- Attachment #1: Type: text/plain, Size: 868 bytes --]

On 09/05/2013 11:33 AM, Marc Kleine-Budde wrote:
> Added Stephane to Cc.
> 
> On 09/04/2013 11:46 PM, Alexey Khoroshilov wrote:
>> Tx and rx urbs are not deallocated if something goes wrong in peak_usb_start().
>> The patch fixes error handling to deallocate all the resources.
>>
>> Found by Linux Driver Verification project (linuxtesting.org).
>>
>> Signed-off-by: Alexey Khoroshilov <khoroshilov@ispras.ru>
> 
> Stephane, can you have a look at the patch and give your Acked-by.
> 
> Looks good to me.

Applied with Stephane's Acked-by to can/fixes-for-3.12.

Tnx,
Marc
-- 
Pengutronix e.K.                  | Marc Kleine-Budde           |
Industrial Linux Solutions        | Phone: +49-231-2826-924     |
Vertretung West/Dortmund          | Fax:   +49-5121-206917-5555 |
Amtsgericht Hildesheim, HRA 2686  | http://www.pengutronix.de   |


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 259 bytes --]

^ permalink raw reply

* [ANNOUNCE] ipvsadm release v1.27
From: Jesper Dangaard Brouer @ 2013-09-06  8:58 UTC (permalink / raw)
  To: lvs-devel, lvs-users
  Cc: netdev, linux-kernel, Julian Anastasov, Hans Schillstrom,
	Simon Horman, Wensong Zhang, Jesper Dangaard Brouer,
	Ryan O'Hara, Daniel Borkmann, Alexander Frolkin

[-- Attachment #1: Type: text/plain, Size: 3029 bytes --]


We are happy to announce the release of ipvsadm v1.27.

 ipvsadm is a utility to administer the kernels IPVS/LVS load-balancer service

It have been quite a while since the last release of ipvsadm, as v1.26
were released on February 8, 2011.  Since then, the source code
repository have been converted into git.

This is the first released based on the kernel.org git tree:
  https://git.kernel.org/cgit/utils/kernel/ipvsadm/ipvsadm.git/

You can download the tarballs from:
 https://kernel.org/pub/linux/utils/kernel/ipvsadm/

The older releases are also available via:
 http://www.linuxvirtualserver.org/software/ipvs.html#kernel-2.6
 http://www.linuxvirtualserver.org/software/kernel-2.6/

Maintainers:
 Simon Horman <horms@verge.net.au>
 Jesper Dangaard Brouer <brouer@redhat.com>
 Wensong Zhang <wensong@linux-vs.org>

The development have slowed down over the past years, as the tool and
kernel in this area have stabilized.  As can be seen by the shortlog
below, this release mostly contains bug fixes for existing features.

With one exception commit 3141694c (ipvsadm: support for scheduler flags)
by Alexander Frolkin, adds support for a new feature in kernel v3.11
introduced in kernel commit eba3b5a787 (ipvs: SH fallback and L4 hashing).

Shortlog:
---------
Alexander Frolkin (1):
      ipvsadm: support for scheduler flags

Alexander Holler (1):
      libipvs: Fix initialization of netlink (needed for IPv6) when the module ip_vs wasn't loaded.

Daniel Borkmann (1):
      libipvs: libnl3: fix compilation error

Henrique Mecking (1):
      libipvs: Fix memory leak

Jesper Dangaard Brouer (6):
      ipvsadm: fix compiling tool on distros with only libnl-1
      ipvsadm: detect LIBS and CFLAGS for libnl versions
      Make README more generic and reference new kernel.org location
      Add MAINTAINERS style file
      Maintainer script for releasing tarballs
      Release: Version 1.27

Julian Anastasov (4):
      Fix the --pe option checks
      Fallback to libpopt on shared object
      ipvsadm: Fix wrong format of --pe option in FMT_RULE listing
      ipvsadm: init svc in ipvs_get_service

Krzysztof Gajdemski (2):
      ipvsadm: Fix wrong format of -o option in FMT_RULE listing
      ipvsadm: Show 'ops' flag regardless of service persistence

Ryan O'Hara (4):
      ipvsadm: fix list_daemon to handle master/backup status in either position
      ipvsadm: Fix svc->pe_name conditional
      libipvs: Remove redundant CHECK_PE
      libipvs: CHECK_IPV4 and CHECK_PE macros should not call return

Simon Horman (1):
      Linking against libpopt is always requited

Tim Serong (1):
      ipvsadm: Fix buffer overrun in ipvs_dests_parse_cb()

wensong (1):
      exclude the .svn directory and the TAGS file while making distribution

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Sr. Network Kernel Developer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 230 bytes --]

^ permalink raw reply

* [PATCH net] bnx2x: fix broken compilation with CONFIG_BNX2X_SRIOV is not set
From: Dmitry Kravkov @ 2013-09-06  7:35 UTC (permalink / raw)
  To: netdev, davem; +Cc: eric.dumazet, Dmitry Kravkov, Ariel Elior

Since commit 60cad4e67bd6ff400e7ea61fe762b3042b12ae9d
"bnx2x: VF RSS support - VF side" fails to compile w/o
CONFIG_BNX2X_SRIOV option.
 
Reported-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: Ariel Elior <ariele@broadcom.com>
Signed-off-by: Dmitry Kravkov <dmitry@broadcom.com>
---
Done with same formatting used in the file. 
---
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_sriov.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_sriov.h b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_sriov.h
index 2a8c1dc..059f0d4 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_sriov.h
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_sriov.h
@@ -816,6 +816,8 @@ static inline int bnx2x_vfpf_setup_q(struct bnx2x *bp, struct bnx2x_fastpath *fp
 static inline int bnx2x_vfpf_teardown_queue(struct bnx2x *bp, int qidx) {return 0; }
 static inline int bnx2x_vfpf_config_mac(struct bnx2x *bp, u8 *addr,
 					u8 vf_qid, bool set) {return 0; }
+static inline int bnx2x_vfpf_config_rss(struct bnx2x *bp,
+					struct bnx2x_config_rss_params *params) {return 0; }
 static inline int bnx2x_vfpf_set_mcast(struct net_device *dev) {return 0; }
 static inline int bnx2x_vfpf_storm_rx_mode(struct bnx2x *bp) {return 0; }
 static inline int bnx2x_iov_nic_init(struct bnx2x *bp) {return 0; }
-- 
1.8.3.1

^ permalink raw reply related

* [PATCH net-next v4 6/6] bonding: use RCU protection for alb xmit path
From: Ding Tianhong @ 2013-09-06  7:28 UTC (permalink / raw)
  To: Jay Vosburgh, Andy Gospodarek, David S. Miller,
	Nikolay Aleksandrov, Veaceslav Falico, Netdev, Hideaki YOSHIFUJI

The commit 278b20837511776dc9d5f6ee1c7fabd5479838bb
(bonding: initial RCU conversion) has convert the roundrobin, active-backup,
broadcast and xor xmit path to rcu protection, the performance will be better
for these mode, so this time, convert xmit path for alb mode.

Suggested-by: Nikolay Aleksandrov <nikolay@redhat.com>
Signed-off-by: Ding Tianhong <dingtianhong@huawei.com>
Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
Cc: Nikolay Aleksandrov <nikolay@redhat.com>
Cc: Veaceslav Falico <vfalico@redhat.com>
---
 drivers/net/bonding/bond_alb.c | 20 +++++++++-----------
 drivers/net/bonding/bonding.h  |  2 +-
 2 files changed, 10 insertions(+), 12 deletions(-)

diff --git a/drivers/net/bonding/bond_alb.c b/drivers/net/bonding/bond_alb.c
index c75d383..c0bfd56 100644
--- a/drivers/net/bonding/bond_alb.c
+++ b/drivers/net/bonding/bond_alb.c
@@ -229,7 +229,7 @@ static struct slave *tlb_get_least_loaded_slave(struct bonding *bond)
 	max_gap = LLONG_MIN;
 
 	/* Find the slave with the largest gap */
-	bond_for_each_slave(bond, slave) {
+	bond_for_each_slave_rcu(bond, slave) {
 		if (SLAVE_IS_OK(slave)) {
 			long long gap = compute_gap(slave);
 
@@ -625,12 +625,14 @@ static struct slave *rlb_choose_channel(struct sk_buff *skb, struct bonding *bon
 {
 	struct alb_bond_info *bond_info = &(BOND_ALB_INFO(bond));
 	struct arp_pkt *arp = arp_pkt(skb);
-	struct slave *assigned_slave;
+	struct slave *assigned_slave, *curr_active_slave;
 	struct rlb_client_info *client_info;
 	u32 hash_index = 0;
 
 	_lock_rx_hashtbl(bond);
 
+	curr_active_slave = rcu_dereference(bond->curr_active_slave);
+
 	hash_index = _simple_hash((u8 *)&arp->ip_dst, sizeof(arp->ip_dst));
 	client_info = &(bond_info->rx_hashtbl[hash_index]);
 
@@ -654,9 +656,9 @@ static struct slave *rlb_choose_channel(struct sk_buff *skb, struct bonding *bon
 			 * move the old client to primary (curr_active_slave) so
 			 * that the new client can be assigned to this entry.
 			 */
-			if (bond->curr_active_slave &&
-			    client_info->slave != bond->curr_active_slave) {
-				client_info->slave = bond->curr_active_slave;
+			if (curr_active_slave &&
+			    client_info->slave != curr_active_slave) {
+				client_info->slave = curr_active_slave;
 				rlb_update_client(client_info);
 			}
 		}
@@ -1338,8 +1340,6 @@ int bond_alb_xmit(struct sk_buff *skb, struct net_device *bond_dev)
 
 	/* make sure that the curr_active_slave do not change during tx
 	 */
-	read_lock(&bond->lock);
-	read_lock(&bond->curr_slave_lock);
 
 	switch (ntohs(skb->protocol)) {
 	case ETH_P_IP: {
@@ -1422,12 +1422,12 @@ int bond_alb_xmit(struct sk_buff *skb, struct net_device *bond_dev)
 
 	if (!tx_slave) {
 		/* unbalanced or unassigned, send through primary */
-		tx_slave = bond->curr_active_slave;
+		tx_slave = rcu_dereference(bond->curr_active_slave);
 		bond_info->unbalanced_load += skb->len;
 	}
 
 	if (tx_slave && SLAVE_IS_OK(tx_slave)) {
-		if (tx_slave != bond->curr_active_slave) {
+		if (tx_slave != rcu_dereference(bond->curr_active_slave)) {
 			memcpy(eth_data->h_source,
 			       tx_slave->dev->dev_addr,
 			       ETH_ALEN);
@@ -1442,8 +1442,6 @@ int bond_alb_xmit(struct sk_buff *skb, struct net_device *bond_dev)
 		}
 	}
 
-	read_unlock(&bond->curr_slave_lock);
-	read_unlock(&bond->lock);
 	if (res) {
 		/* no suitable interface, frame not sent */
 		kfree_skb(skb);
diff --git a/drivers/net/bonding/bonding.h b/drivers/net/bonding/bonding.h
index 80b170a..4282e65 100644
--- a/drivers/net/bonding/bonding.h
+++ b/drivers/net/bonding/bonding.h
@@ -536,7 +536,7 @@ static inline struct slave *bond_slave_has_mac(struct bonding *bond,
 {
 	struct slave *tmp;
 
-	bond_for_each_slave(bond, tmp)
+	bond_for_each_slave_rcu(bond, tmp)
 		if (ether_addr_equal_64bits(mac, tmp->dev->dev_addr))
 			return tmp;
 
-- 
1.8.2.1

^ permalink raw reply related

* [PATCH net-next v4 5/6] bonding: restructure and add rcu for bond_for_each_slave_next()
From: Ding Tianhong @ 2013-09-06  7:28 UTC (permalink / raw)
  To: Jay Vosburgh, Andy Gospodarek, David S. Miller,
	Nikolay Aleksandrov, Veaceslav Falico, Netdev

Restructure the bond_for_each_slave(), remove the checking for bond->slave_cnt,
make the new loop be more simple and racy.

Add bond_for_each_slave_next_rcu() for next patch use.

Suggested-by: Nikolay Aleksandrov <nikolay@redhat.com>
Suggested-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: Ding Tianhong <dingtianhong@huawei.com>
Cc: Nikolay Aleksandrov <nikolay@redhat.com>
Cc: Veaceslav Falico <vfalico@redhat.com>
---
 drivers/net/bonding/bond_alb.c  |  3 +--
 drivers/net/bonding/bond_main.c |  6 ++----
 drivers/net/bonding/bonding.h   | 21 +++++++++++++++++----
 3 files changed, 20 insertions(+), 10 deletions(-)

diff --git a/drivers/net/bonding/bond_alb.c b/drivers/net/bonding/bond_alb.c
index 91f179d..c75d383 100644
--- a/drivers/net/bonding/bond_alb.c
+++ b/drivers/net/bonding/bond_alb.c
@@ -383,7 +383,6 @@ static struct slave *rlb_next_rx_slave(struct bonding *bond)
 {
 	struct alb_bond_info *bond_info = &(BOND_ALB_INFO(bond));
 	struct slave *rx_slave, *slave, *start_at;
-	int i = 0;
 
 	if (bond_info->next_rx_slave)
 		start_at = bond_info->next_rx_slave;
@@ -392,7 +391,7 @@ static struct slave *rlb_next_rx_slave(struct bonding *bond)
 
 	rx_slave = NULL;
 
-	bond_for_each_slave_from(bond, slave, i, start_at) {
+	bond_for_each_slave_from(bond, slave, start_at) {
 		if (SLAVE_IS_OK(slave)) {
 			if (!rx_slave) {
 				rx_slave = slave;
diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index 39e5b1c..4190389 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -782,7 +782,6 @@ static struct slave *bond_find_best_slave(struct bonding *bond)
 	struct slave *new_active, *old_active;
 	struct slave *bestslave = NULL;
 	int mintime = bond->params.updelay;
-	int i;
 
 	new_active = bond->curr_active_slave;
 
@@ -801,7 +800,7 @@ static struct slave *bond_find_best_slave(struct bonding *bond)
 	/* remember where to stop iterating over the slaves */
 	old_active = new_active;
 
-	bond_for_each_slave_from(bond, new_active, i, old_active) {
+	bond_for_each_slave_from(bond, new_active, old_active) {
 		if (new_active->link == BOND_LINK_UP) {
 			return new_active;
 		} else if (new_active->link == BOND_LINK_BACK &&
@@ -2756,7 +2755,6 @@ do_failover:
 static void bond_ab_arp_probe(struct bonding *bond)
 {
 	struct slave *slave, *next_slave;
-	int i;
 
 	read_lock(&bond->curr_slave_lock);
 
@@ -2788,7 +2786,7 @@ static void bond_ab_arp_probe(struct bonding *bond)
 
 	/* search for next candidate */
 	next_slave = bond_next_slave(bond, bond->current_arp_slave);
-	bond_for_each_slave_from(bond, slave, i, next_slave) {
+	bond_for_each_slave_from(bond, slave, next_slave) {
 		if (IS_UP(slave->dev)) {
 			slave->link = BOND_LINK_BACK;
 			bond_set_slave_active_flags(slave);
diff --git a/drivers/net/bonding/bonding.h b/drivers/net/bonding/bonding.h
index 419161d..d80e146 100644
--- a/drivers/net/bonding/bonding.h
+++ b/drivers/net/bonding/bonding.h
@@ -124,18 +124,31 @@
 	(bond_is_first_slave(bond, pos) ? bond_last_slave_rcu(bond) : \
 					  bond_to_slave_rcu((pos)->list.prev))
 
+/* Check whether the slave is the only one in bond */
+
 /**
  * bond_for_each_slave_from - iterate the slaves list from a starting point
  * @bond:	the bond holding this list.
  * @pos:	current slave.
- * @cnt:	counter for max number of moves
  * @start:	starting point.
  *
  * Caller must hold bond->lock
  */
-#define bond_for_each_slave_from(bond, pos, cnt, start) \
-	for (cnt = 0, pos = start; pos && cnt < (bond)->slave_cnt; \
-	     cnt++, pos = bond_next_slave(bond, pos))
+#define bond_for_each_slave_from(bond, pos, start) \
+	for (pos = start; pos; (pos = bond_next_slave(bond, pos)) != start ? \
+		(pos) : (pos = NULL))
+
+/**
+ * bond_for_each_slave_from_rcu - iterate the slaves list from a starting point
+ * @bond:	the bond holding this list.
+ * @pos:	current slave.
+ * @start:	starting point.
+ *
+ * Caller must hold rcu_read_lock
+ */
+#define bond_for_each_slave_from_rcu(bond, pos, start) \
+	for (pos = start; pos; (pos = bond_next_slave_rcu(bond, pos)) != start ? \
+		(pos) : (pos = NULL))
 
 /**
  * bond_for_each_slave - iterate over all slaves
-- 
1.8.2.1

^ permalink raw reply related

* [PATCH net-next v4 4/6] bonding: add rtnl lock for bonding_store_xmit_hash
From: Ding Tianhong @ 2013-09-06  7:28 UTC (permalink / raw)
  To: Jay Vosburgh, Andy Gospodarek, David S. Miller,
	Nikolay Aleksandrov, Veaceslav Falico, Netdev

the bonding_store_xmit_hash() could update bond's xmit_policy, and
the xmit_policy is used in xmit path for xor mode, maybe it is hard
to occur any problem, but just follow the logic "don't change anything
slave-related without rtnl", so I think the rntl lock is fit here. :)

Signed-off-by: Ding Tianhong <dingtianhong@huawei.com>
Cc: Nikolay Aleksandrov <nikolay@redhat.com>
---
 drivers/net/bonding/bond_sysfs.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/drivers/net/bonding/bond_sysfs.c b/drivers/net/bonding/bond_sysfs.c
index 0f539de..deb1d8e 100644
--- a/drivers/net/bonding/bond_sysfs.c
+++ b/drivers/net/bonding/bond_sysfs.c
@@ -382,6 +382,9 @@ static ssize_t bonding_store_xmit_hash(struct device *d,
 	int new_value, ret = count;
 	struct bonding *bond = to_bond(d);
 
+	if (!rtnl_trylock())
+		return restart_syscall();
+
 	new_value = bond_parse_parm(buf, xmit_hashtype_tbl);
 	if (new_value < 0)  {
 		pr_err("%s: Ignoring invalid xmit hash policy value %.*s.\n",
@@ -396,6 +399,7 @@ static ssize_t bonding_store_xmit_hash(struct device *d,
 			xmit_hashtype_tbl[new_value].modename, new_value);
 	}
 
+	rtnl_unlock();
 	return ret;
 }
 static DEVICE_ATTR(xmit_hash_policy, S_IRUGO | S_IWUSR,
-- 
1.8.2.1

^ permalink raw reply related

* [PATCH net-next v4 3/6] bonding: replace read_lock to rcu_read_lock for bond_3ad_get_active_agg_info()
From: Ding Tianhong @ 2013-09-06  7:28 UTC (permalink / raw)
  To: Jay Vosburgh, Andy Gospodarek, David S. Miller,
	Nikolay Aleksandrov, Veaceslav Falico, Netdev

the bond slave list will protected by rcu, so the read lock should replace by
rcu read lock.

Signed-off-by: Ding Tianhong <dingtianhong@huawei.com>
Cc: Nikolay Aleksandrov <nikolay@redhat.com>
---
 drivers/net/bonding/bond_3ad.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/bonding/bond_3ad.c b/drivers/net/bonding/bond_3ad.c
index c134f43..b678502 100644
--- a/drivers/net/bonding/bond_3ad.c
+++ b/drivers/net/bonding/bond_3ad.c
@@ -2408,9 +2408,9 @@ int bond_3ad_get_active_agg_info(struct bonding *bond, struct ad_info *ad_info)
 {
 	int ret;
 
-	read_lock(&bond->lock);
+	rcu_read_lock();
 	ret = __bond_3ad_get_active_agg_info(bond, ad_info);
-	read_unlock(&bond->lock);
+	rcu_read_unlock();
 
 	return ret;
 }
-- 
1.8.2.1

^ permalink raw reply related

* [PATCH net-next v4 2/6] bonding: remove the no effect lock for bond_3ad_lacpdu_recv()
From: Ding Tianhong @ 2013-09-06  7:28 UTC (permalink / raw)
  To: Jay Vosburgh, Andy Gospodarek, David S. Miller,
	Nikolay Aleksandrov, Veaceslav Falico, Netdev

There is no pointer needed read lock protection, remove the unnecessary lock
and improve performance for the 3ad recv path.

Signed-off-by: Ding Tianhong <dingtianhong@huawei.com>
Cc: Nikolay Aleksandrov <nikolay@redhat.com>
---
 drivers/net/bonding/bond_3ad.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/drivers/net/bonding/bond_3ad.c b/drivers/net/bonding/bond_3ad.c
index 7a3860f..c134f43 100644
--- a/drivers/net/bonding/bond_3ad.c
+++ b/drivers/net/bonding/bond_3ad.c
@@ -2494,9 +2494,7 @@ int bond_3ad_lacpdu_recv(const struct sk_buff *skb, struct bonding *bond,
 	if (!lacpdu)
 		return ret;
 
-	read_lock(&bond->lock);
 	ret = bond_3ad_rx_indication(lacpdu, slave, skb->len);
-	read_unlock(&bond->lock);
 	return ret;
 }
 
-- 
1.8.2.1

^ permalink raw reply related

* [PATCH net-next v4 1/6] bonding: simplify and use RCU protection for 3ad xmit path
From: Ding Tianhong @ 2013-09-06  7:28 UTC (permalink / raw)
  To: Jay Vosburgh, Andy Gospodarek, David S. Miller,
	Nikolay Aleksandrov, Veaceslav Falico, Netdev

The commit 278b20837511776dc9d5f6ee1c7fabd5479838bb
(bonding: initial RCU conversion) has convert the roundrobin, active-backup,
broadcast and xor xmit path to rcu protection, the performance will be better
for these mode, so this time, convert xmit path for 3ad mode.

Suggested-by: Nikolay Aleksandrov <nikolay@redhat.com>
Suggested-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: Ding Tianhong <dingtianhong@huawei.com>
Signed-off-by: Wang Yufen <wangyufen@huawei.com>
Cc: Nikolay Aleksandrov <nikolay@redhat.com>
Cc: Veaceslav Falico <vfalico@redhat.com>
---
 drivers/net/bonding/bond_3ad.c | 32 ++++++++++++++------------------
 drivers/net/bonding/bonding.h  | 33 ++++++++++++++++++++++++++++++++-
 2 files changed, 46 insertions(+), 19 deletions(-)

diff --git a/drivers/net/bonding/bond_3ad.c b/drivers/net/bonding/bond_3ad.c
index 0d8f427..13f1deb 100644
--- a/drivers/net/bonding/bond_3ad.c
+++ b/drivers/net/bonding/bond_3ad.c
@@ -143,7 +143,7 @@ static inline struct bonding *__get_bond_by_port(struct port *port)
  */
 static inline struct port *__get_first_port(struct bonding *bond)
 {
-	struct slave *first_slave = bond_first_slave(bond);
+	struct slave *first_slave = bond_first_slave_rcu(bond);
 
 	return first_slave ? &(SLAVE_AD_INFO(first_slave).port) : NULL;
 }
@@ -163,7 +163,7 @@ static inline struct port *__get_next_port(struct port *port)
 	// If there's no bond for this port, or this is the last slave
 	if (bond == NULL)
 		return NULL;
-	slave_next = bond_next_slave(bond, slave);
+	slave_next = bond_next_slave_rcu(bond, slave);
 	if (!slave_next || bond_is_first_slave(bond, slave_next))
 		return NULL;
 
@@ -2417,16 +2417,14 @@ int bond_3ad_get_active_agg_info(struct bonding *bond, struct ad_info *ad_info)
 
 int bond_3ad_xmit_xor(struct sk_buff *skb, struct net_device *dev)
 {
-	struct slave *slave, *start_at;
 	struct bonding *bond = netdev_priv(dev);
+	struct slave *slave;
 	int slave_agg_no;
 	int slaves_in_agg;
 	int agg_id;
-	int i;
 	struct ad_info ad_info;
 	int res = 1;
 
-	read_lock(&bond->lock);
 	if (__bond_3ad_get_active_agg_info(bond, &ad_info)) {
 		pr_debug("%s: Error: __bond_3ad_get_active_agg_info failed\n",
 			 dev->name);
@@ -2444,13 +2442,17 @@ int bond_3ad_xmit_xor(struct sk_buff *skb, struct net_device *dev)
 
 	slave_agg_no = bond->xmit_hash_policy(skb, slaves_in_agg);
 
-	bond_for_each_slave(bond, slave) {
+	bond_for_each_slave_rcu(bond, slave) {
 		struct aggregator *agg = SLAVE_AD_INFO(slave).port.aggregator;
 
 		if (agg && (agg->aggregator_identifier == agg_id)) {
-			slave_agg_no--;
-			if (slave_agg_no < 0)
-				break;
+			if (--slave_agg_no < 0) {
+				if (SLAVE_IS_OK(slave)) {
+					res = bond_dev_queue_xmit(bond,
+						skb, slave->dev);
+					goto out;
+				}
+			}
 		}
 	}
 
@@ -2460,23 +2462,17 @@ int bond_3ad_xmit_xor(struct sk_buff *skb, struct net_device *dev)
 		goto out;
 	}
 
-	start_at = slave;
-
-	bond_for_each_slave_from(bond, slave, i, start_at) {
-		int slave_agg_id = 0;
+	bond_for_each_slave_rcu(bond, slave) {
 		struct aggregator *agg = SLAVE_AD_INFO(slave).port.aggregator;
 
-		if (agg)
-			slave_agg_id = agg->aggregator_identifier;
-
-		if (SLAVE_IS_OK(slave) && agg && (slave_agg_id == agg_id)) {
+		if (SLAVE_IS_OK(slave) && agg &&
+			agg->aggregator_identifier == agg_id) {
 			res = bond_dev_queue_xmit(bond, skb, slave->dev);
 			break;
 		}
 	}
 
 out:
-	read_unlock(&bond->lock);
 	if (res) {
 		/* no suitable interface, frame not sent */
 		kfree_skb(skb);
diff --git a/drivers/net/bonding/bonding.h b/drivers/net/bonding/bonding.h
index f7ab161..419161d 100644
--- a/drivers/net/bonding/bonding.h
+++ b/drivers/net/bonding/bonding.h
@@ -74,13 +74,35 @@
 /* slave list primitives */
 #define bond_to_slave(ptr) list_entry(ptr, struct slave, list)
 
+/* slave list primitives, Caller must hold rcu_read_lock */
+#define bond_to_slave_rcu(ptr) list_entry_rcu(ptr, struct slave, list)
+
+/* bond_is_empty return NULL if slave list is empty*/
+#define bond_is_empty(bond) \
+	(list_empty(&(bond)->slave_list))
+
+/* bond_is_empty_rcu return NULL if slave list is empty*/
+#define bond_is_empty_rcu(bond) \
+	(!list_first_or_null_rcu(&(bond)->slave_list, struct slave, list))
+
 /* IMPORTANT: bond_first/last_slave can return NULL in case of an empty list */
 #define bond_first_slave(bond) \
 	list_first_entry_or_null(&(bond)->slave_list, struct slave, list)
 #define bond_last_slave(bond) \
-	(list_empty(&(bond)->slave_list) ? NULL : \
+	(bond_is_empty(bond) ? NULL : \
 					   bond_to_slave((bond)->slave_list.prev))
 
+/**
+ * IMPORTANT: bond_first/last_slave_rcu can return NULL in case of an empty list
+ * Caller must hold rcu_read_lock
+ */
+#define bond_first_slave_rcu(bond) \
+	(bond_is_empty_rcu(bond) ? NULL : \
+					bond_to_slave_rcu((bond)->slave_list.next))
+#define bond_last_slave_rcu(bond) \
+	(bond_is_empty_rcu(bond) ? NULL : \
+					bond_to_slave_rcu((bond)->slave_list.prev))
+
 #define bond_is_first_slave(bond, pos) ((pos)->list.prev == &(bond)->slave_list)
 #define bond_is_last_slave(bond, pos) ((pos)->list.next == &(bond)->slave_list)
 
@@ -93,6 +115,15 @@
 	(bond_is_first_slave(bond, pos) ? bond_last_slave(bond) : \
 					  bond_to_slave((pos)->list.prev))
 
+/* Since bond_first/last_slave_rcu can return NULL, these can return NULL too */
+#define bond_next_slave_rcu(bond, pos) \
+	(bond_is_last_slave(bond, pos) ? bond_first_slave_rcu(bond) : \
+					 bond_to_slave_rcu((pos)->list.next))
+
+#define bond_prev_slave_rcu(bond, pos) \
+	(bond_is_first_slave(bond, pos) ? bond_last_slave_rcu(bond) : \
+					  bond_to_slave_rcu((pos)->list.prev))
+
 /**
  * bond_for_each_slave_from - iterate the slaves list from a starting point
  * @bond:	the bond holding this list.
-- 
1.8.2.1

^ permalink raw reply related

* [PATCH net-next v4 0/6] bonding: Patchset for rcu use in bonding
From: Ding Tianhong @ 2013-09-06  7:27 UTC (permalink / raw)
  To: Jay Vosburgh, Andy Gospodarek, David S. Miller,
	Nikolay Aleksandrov, Veaceslav Falico, Netdev

Hi:

The Patch Set convert the xmit of 3ad and alb mode to use rcu lock.
restructure and add more rcu list function.
add rtnl lock to protect bonding_store_xmit_hash().
remove read lock in bond_3ad_lacpdu_recv().

I test the patch well and no problems found till now.

v1: add 1 patch named remove the no effect lock for bond_3ad_lacpdu_recv().
    get the advice from Nikolay Aleksandrov, and modify some mistake in the code.

v2: accept the Nikolay Aleksandrov's advise, modify the patch 05/06.
    the new bond_for_each_slave_from will not use bond->slave_cnt and int cnt any more,
    it test well.
    move the rcu_dereference call of curr_active_slave after the lock avoid 
    a race condition with the slave removal. 

v3: according to the advise of Veaceslav Falico and David S. Miller, modify the patch 01/06 and patch 05/06.
    add rcu protect for read bond->slave_list.
    simplify the bond_for_each_slave_from() and  bond_for_each_slave_from_rcu()

v4: thanks for Veaceslav Falico's patient guidance, fix the problem of patch 01/06 and
patch 05/06.
    restructure the bond_first_slave_rcu().
    restructure the bond_for_each_slave() and bond_for_each_slave_rcu().

Ding Tianhong (4):
Wang Yufen (1):
Yang Yingliang (1):
  root (6):
  bonding: simplify and use RCU protection for 3ad xmit path
  bonding: remove the no effect lock for bond_3ad_lacpdu_recv()
  bonding: replace read_lock to rcu_read_lock for
  bond_3ad_get_active_agg_info()
  bonding: add rtnl lock for bonding_store_xmit_hash
  bonding: restructure and add rcu for bond_for_each_slave_next()
  bonding: use RCU protection for alb xmit path

 drivers/net/bonding/bond_3ad.c   | 38 ++++++++++++---------------
 drivers/net/bonding/bond_alb.c   | 23 +++++++----------
 drivers/net/bonding/bond_main.c  |  6 ++---
 drivers/net/bonding/bond_sysfs.c |  4 +++
 drivers/net/bonding/bonding.h    | 56 +++++++++++++++++++++++++++++++++++-----
 5 files changed, 82 insertions(+), 45 deletions(-)

-- 
1.8.2.1

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox