Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH net-next] team: add ethtool support
From: Flavio Leitner @ 2012-12-30  1:44 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: netdev, Jiri Pirko
In-Reply-To: <20121229172945.25a09fc8@nehalam.linuxnetplumber.net>

On Sat, Dec 29, 2012 at 05:29:45PM -0800, Stephen Hemminger wrote:
> On Sat, 29 Dec 2012 23:19:26 -0200
> Flavio Leitner <fbl@redhat.com> wrote:
> 
> > This patch adds few ethtool operations to team driver.
> > 
> > Signed-off-by: Flavio Leitner <fbl@redhat.com>
> 
> What is the motivation for this? Is there an application that depends
> on ethtool (versus netlink, or /proc)?

Speaking as a support engineer, it's a lot easier to grab ethtool -S and
see everything than grab two or more outputs.

> Sorry, I see no point in providing ethtool statistics for generic data that is already
> reported by existing netlink and other infrastructure. The purpose of ethtool
> statistics is to report device specific that is not available through the normal
> generic statistics.

Right, but those statistics can be device specific as well.  The tg3 and bnx2, for
instance, do the same reporting [rx|tx]_bytes|octets. 

I see no harm, and it is helpful. 

-- 
fbl

^ permalink raw reply

* Re: [PATCH net-next] team: add ethtool support
From: David Miller @ 2012-12-30  1:35 UTC (permalink / raw)
  To: shemminger; +Cc: fbl, netdev, jiri
In-Reply-To: <20121229172945.25a09fc8@nehalam.linuxnetplumber.net>

From: Stephen Hemminger <shemminger@vyatta.com>
Date: Sat, 29 Dec 2012 17:29:45 -0800

> On Sat, 29 Dec 2012 23:19:26 -0200
> Flavio Leitner <fbl@redhat.com> wrote:
> 
>> This patch adds few ethtool operations to team driver.
>> 
>> Signed-off-by: Flavio Leitner <fbl@redhat.com>
> 
> What is the motivation for this? Is there an application that depends
> on ethtool (versus netlink, or /proc)?
> 
> Sorry, I see no point in providing ethtool statistics for generic data that is already
> reported by existing netlink and other infrastructure. The purpose of ethtool
> statistics is to report device specific that is not available through the normal
> generic statistics.

Agreed, ethtool stats should _ONLY_ report device specific
statistics.

^ permalink raw reply

* [PATCH net-next] team: implement carrier change
From: Flavio Leitner @ 2012-12-30  1:31 UTC (permalink / raw)
  To: netdev; +Cc: Jiri Pirko, Flavio Leitner

The user space teamd daemon may need to control the
master's carrier state depending on the selected mode.

Signed-off-by: Flavio Leitner <fbl@redhat.com>
---
 drivers/net/team/team.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/drivers/net/team/team.c b/drivers/net/team/team.c
index f711039..14cb843 100644
--- a/drivers/net/team/team.c
+++ b/drivers/net/team/team.c
@@ -1708,6 +1708,15 @@ static netdev_features_t team_fix_features(struct net_device *dev,
 	return features;
 }
 
+static int team_change_carrier(struct net_device *dev, bool new_carrier)
+{
+	if (new_carrier)
+		netif_carrier_on(dev);
+	else
+		netif_carrier_off(dev);
+	return 0;
+}
+
 static const struct net_device_ops team_netdev_ops = {
 	.ndo_init		= team_init,
 	.ndo_uninit		= team_uninit,
@@ -1730,6 +1739,7 @@ static const struct net_device_ops team_netdev_ops = {
 	.ndo_add_slave		= team_add_slave,
 	.ndo_del_slave		= team_del_slave,
 	.ndo_fix_features	= team_fix_features,
+	.ndo_change_carrier     = team_change_carrier,
 };
 
 /***********************
-- 
1.8.0.1

^ permalink raw reply related

* Re: [PATCH net-next] team: add ethtool support
From: Stephen Hemminger @ 2012-12-30  1:29 UTC (permalink / raw)
  To: Flavio Leitner; +Cc: netdev, Jiri Pirko
In-Reply-To: <1356830366-991-1-git-send-email-fbl@redhat.com>

On Sat, 29 Dec 2012 23:19:26 -0200
Flavio Leitner <fbl@redhat.com> wrote:

> This patch adds few ethtool operations to team driver.
> 
> Signed-off-by: Flavio Leitner <fbl@redhat.com>

What is the motivation for this? Is there an application that depends
on ethtool (versus netlink, or /proc)?

Sorry, I see no point in providing ethtool statistics for generic data that is already
reported by existing netlink and other infrastructure. The purpose of ethtool
statistics is to report device specific that is not available through the normal
generic statistics.

^ permalink raw reply

* [PATCH net-next] team: add ethtool support
From: Flavio Leitner @ 2012-12-30  1:19 UTC (permalink / raw)
  To: netdev; +Cc: Jiri Pirko, Flavio Leitner

This patch adds few ethtool operations to team driver.

Signed-off-by: Flavio Leitner <fbl@redhat.com>
---
 drivers/net/team/team.c | 71 +++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 71 insertions(+)

diff --git a/drivers/net/team/team.c b/drivers/net/team/team.c
index ad86660..f711039 100644
--- a/drivers/net/team/team.c
+++ b/drivers/net/team/team.c
@@ -28,6 +28,7 @@
 #include <net/genetlink.h>
 #include <net/netlink.h>
 #include <net/sch_generic.h>
+#include <generated/utsrelease.h>
 #include <linux/if_team.h>
 
 #define DRV_NAME "team"
@@ -1731,6 +1732,75 @@ static const struct net_device_ops team_netdev_ops = {
 	.ndo_fix_features	= team_fix_features,
 };
 
+/***********************
+ * ethtool interface
+ ***********************/
+
+static const char ethtool_stats_keys[][ETH_GSTRING_LEN] = {
+	"rx_packets",
+	"rx_bytes",
+	"rx_dropped",
+	"tx_packets",
+	"tx_bytes",
+	"tx_dropped",
+	"multicast",
+};
+
+#define TEAM_NUM_STATS   ARRAY_SIZE(ethtool_stats_keys)
+
+static int team_get_sset_count(struct net_device *netdev, int sset)
+{
+	switch (sset) {
+	case ETH_SS_STATS:
+		return TEAM_NUM_STATS;
+	default:
+		return -EOPNOTSUPP;
+	}
+}
+
+static void team_get_strings(struct net_device *netdev, u32 stringset, u8 *data)
+{
+	switch (stringset) {
+	case ETH_SS_STATS:
+		memcpy(data, *ethtool_stats_keys, sizeof(ethtool_stats_keys));
+		break;
+	}
+}
+
+static void team_get_ethtool_stats(struct net_device *netdev,
+				   struct ethtool_stats *stats,
+				   u64 *data)
+{
+	struct rtnl_link_stats64 net_stats;
+	int i;
+
+	memset(&net_stats, 0, sizeof(struct rtnl_link_stats64));
+	team_get_stats64(netdev, &net_stats);
+	i = 0;
+	/* ordering based on ethtool_stats_keys */
+	data[i++] = net_stats.rx_packets;
+	data[i++] = net_stats.rx_bytes;
+	data[i++] = net_stats.rx_dropped;
+	data[i++] = net_stats.tx_packets;
+	data[i++] = net_stats.tx_bytes;
+	data[i++] = net_stats.tx_dropped;
+	data[i++] = net_stats.multicast;
+}
+
+static void team_ethtool_get_drvinfo(struct net_device *dev,
+				     struct ethtool_drvinfo *drvinfo)
+{
+	strncpy(drvinfo->driver, DRV_NAME, 32);
+	strncpy(drvinfo->version, UTS_RELEASE, 32);
+}
+
+static const struct ethtool_ops team_ethtool_ops = {
+	.get_drvinfo		= team_ethtool_get_drvinfo,
+	.get_link		= ethtool_op_get_link,
+	.get_strings		= team_get_strings,
+	.get_ethtool_stats	= team_get_ethtool_stats,
+	.get_sset_count		= team_get_sset_count,
+};
 
 /***********************
  * rt netlink interface
@@ -1780,6 +1850,7 @@ static void team_setup(struct net_device *dev)
 	ether_setup(dev);
 
 	dev->netdev_ops = &team_netdev_ops;
+	dev->ethtool_ops = &team_ethtool_ops;
 	dev->destructor	= team_destructor;
 	dev->tx_queue_len = 0;
 	dev->flags |= IFF_MULTICAST;
-- 
1.8.0.1

^ permalink raw reply related

* Re: [PATCH v2 1/1 net-next] NET: FEC: dynamtic check DMA desc buff type
From: Frank Li @ 2012-12-30  1:01 UTC (permalink / raw)
  To: Lothar Waßmann
  Cc: Frank Li, netdev, s.hauer, shawn.guo, davem, linux-arm-kernel
In-Reply-To: <20701.45196.361432.80622@ipc1.ka-ro>

2012/12/28 Lothar Waßmann <LW@karo-electronics.de>:
> Frank Li writes:
>> MX6 and mx28 support enhanced DMA descript buff to support 1588
>> ptp. But MX25, MX3x, MX5x can't support enhanced DMA descript buff.
>> Check fec type and choose correct DAM descript buff type.
>                                     ^^^
> s/DAM/DMA/
> s/descript/descriptor/g
>
> [...]
>> diff --git a/drivers/net/ethernet/freescale/fec.c b/drivers/net/ethernet/freescale/fec.c
>> index 0704bca..290f91c 100644
>> --- a/drivers/net/ethernet/freescale/fec.c
>> +++ b/drivers/net/ethernet/freescale/fec.c
>> @@ -76,6 +76,8 @@
>>  #define FEC_QUIRK_USE_GASKET         (1 << 2)
>>  /* Controller has GBIT support */
>>  #define FEC_QUIRK_HAS_GBIT           (1 << 3)
>> +/* Controller has extend desc buffer */
>> +#define FEC_QUICK_HAS_BUFDESC_EX     (1 << 4)
>                ^^^^^
> As Sascha has already pointed out, this should be 'QUIRK' rather than
> 'QUICK' (like in the preceeding lines)!
>
>>  static struct platform_device_id fec_devtype[] = {
>>       {
>> @@ -93,7 +95,8 @@ static struct platform_device_id fec_devtype[] = {
>>               .driver_data = FEC_QUIRK_ENET_MAC | FEC_QUIRK_SWAP_FRAME,
>>       }, {
>>               .name = "imx6q-fec",
>> -             .driver_data = FEC_QUIRK_ENET_MAC | FEC_QUIRK_HAS_GBIT,
>> +             .driver_data = FEC_QUIRK_ENET_MAC | FEC_QUIRK_HAS_GBIT |
>> +                             FEC_QUICK_HAS_BUFDESC_EX,
>                                     ^^^^^
> same as above.
>
> [...]
>> @@ -1574,6 +1617,8 @@ fec_probe(struct platform_device *pdev)
>>       fep->pdev = pdev;
>>       fep->dev_id = dev_id++;
>>
>> +     fep->bufdesc_ex = 0;
>> +
>>       if (!fep->hwp) {
>>               ret = -ENOMEM;
>>               goto failed_ioremap;
>> @@ -1628,19 +1673,19 @@ fec_probe(struct platform_device *pdev)
>>               goto failed_clk;
>>       }
>>
>> -#ifdef CONFIG_FEC_PTP
>>       fep->clk_ptp = devm_clk_get(&pdev->dev, "ptp");
>> +     fep->bufdesc_ex =
>> +             pdev->id_entry->driver_data & FEC_QUICK_HAS_BUFDESC_EX;
>                                                   ^^^^^
> same as above.

Okay, I will fix it after new year holiday.

>
>
> Lothar Waßmann
> --
> ___________________________________________________________
>
> Ka-Ro electronics GmbH | Pascalstraße 22 | D - 52076 Aachen
> Phone: +49 2408 1402-0 | Fax: +49 2408 1402-10
> Geschäftsführer: Matthias Kaussen
> Handelsregistereintrag: Amtsgericht Aachen, HRB 4996
>
> www.karo-electronics.de | info@karo-electronics.de
> ___________________________________________________________

^ permalink raw reply

* Re: [patch net-next 01/15] net: introduce upper device lists
From: David Miller @ 2012-12-29 23:31 UTC (permalink / raw)
  To: jiri
  Cc: netdev, edumazet, bhutchings, faisal.latif, shemminger, fbl,
	roland, sean.hefty, hal.rosenstock, fubar, andy, divy,
	jitendra.kalsaria, sony.chacko, linux-driver, kaber, ursula.braun,
	blaschka, schwidefsky, heiko.carstens, ebiederm, joe, amwang,
	nhorman, john.r.fastabend, pablo
In-Reply-To: <1356777522-19652-2-git-send-email-jiri@resnulli.us>

From: Jiri Pirko <jiri@resnulli.us>
Date: Sat, 29 Dec 2012 11:38:28 +0100

> +	/*
> +	 * To prevent loops, check if dev is not upper device to upper_dev.
> +	 */

Please use:

	/* To prevent loops, check if dev is not upper device to upper_dev.  */

> +/**
> + * netdev_upper_free_rcu - Frees a upper device list item via the RCU pointer
> + * @entry: the entry's RCU field
> + *
> + * This function is designed to be used as a callback to the call_rcu()
> + * function so that the memory allocated to the netdev upper device list item
> + * can be released safely.
> + */
> +static void netdev_upper_free_rcu(struct rcu_head *entry)
> +{
> +	struct netdev_upper *upper;
> +
> +	upper = container_of(entry, struct netdev_upper, rcu);
> +	kfree(upper);
> +}

Please use kfree_rcu().

Also, since __netdev_has_upper_dev() modifies &search_list inside
of the list traversal loop, I think you really need to use
list_for_each_entry_safe() even though you always append to the
tail of &search_list.

^ permalink raw reply

* Re: Slow speed of tcp connections in a network namespace
From: Andrew Vagin @ 2012-12-29 21:19 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev, vvs, Michał Mirosław
In-Reply-To: <1356815546.21409.5277.camel@edumazet-glaptop>

On Sat, Dec 29, 2012 at 01:12:26PM -0800, Eric Dumazet wrote:
> On Sun, 2012-12-30 at 00:08 +0400, Andrew Vagin wrote:
> 
> > Is it right, that a received window will be less, if packets are not sorted?
> > Looks like a bug.
> 
> Not really a bug.
> 
> TCP is very sensitive to packet reorders. I wont elaborate here as
> its a bit off topic.
> 
> Try to reorders credits/debits on your bank account, I am pretty sure
> you'll lose some money or even get serious troubles.
> 
> Of course, enabling GRO on eth0 would definitely help a bit...
> 
> (once/iff veth driver features are fixed to allow GSO packets being
> forwarded without being segmented again)
> 

Eric, thank you for the help.
I need time for thinking. I will ask you, if new questions will appear.

> 
> 

^ permalink raw reply

* Re: Slow speed of tcp connections in a network namespace
From: Andrew Vagin @ 2012-12-29 21:15 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev, vvs, Michał Mirosław
In-Reply-To: <1356807516.4102.4.camel@edumazet-laptop>

On Sat, Dec 29, 2012 at 07:58:36PM +0100, Eric Dumazet wrote:
> Le samedi 29 décembre 2012 à 09:40 -0800, Eric Dumazet a écrit :
> 
> > 
> > Please post your new tcpdump then ;)
> > 
> > also post "netstat -s" from root and test ns after your wgets
> 
> Also try following bnx2 patch.
> 
> It should help GRO / TCP coalesce
> 
> bnx2 should be the last driver not using skb head_frag
> 

This patch breaks nothing. I don't know what kind of profit I should get
with it:).

FYI:
I forgot to say, that I disable gro before collecting tcpdump, because
in this case tcpdump from veth and from eth0 can be compared easier.

> 
> diff --git a/drivers/net/ethernet/broadcom/bnx2.c b/drivers/net/ethernet/broadcom/bnx2.c
> index a1adfaf..08a2d40 100644
> --- a/drivers/net/ethernet/broadcom/bnx2.c
> +++ b/drivers/net/ethernet/broadcom/bnx2.c
> @@ -2726,6 +2726,14 @@ bnx2_free_rx_page(struct bnx2 *bp, struct bnx2_rx_ring_info *rxr, u16 index)
>  	rx_pg->page = NULL;
>  }
>  
> +static void bnx2_frag_free(const struct bnx2 *bp, void *data)
> +{
> +	if (bp->rx_frag_size)
> +		put_page(virt_to_head_page(data));
> +	else
> +		kfree(data);
> +}
> +
>  static inline int
>  bnx2_alloc_rx_data(struct bnx2 *bp, struct bnx2_rx_ring_info *rxr, u16 index, gfp_t gfp)
>  {
> @@ -2735,7 +2743,10 @@ bnx2_alloc_rx_data(struct bnx2 *bp, struct bnx2_rx_ring_info *rxr, u16 index, gf
>  	struct bnx2_rx_bd *rxbd =
>  		&rxr->rx_desc_ring[BNX2_RX_RING(index)][BNX2_RX_IDX(index)];
>  
> -	data = kmalloc(bp->rx_buf_size, gfp);
> +	if (bp->rx_frag_size)
> +		data = netdev_alloc_frag(bp->rx_frag_size);
> +	else
> +		data = kmalloc(bp->rx_buf_size, gfp);
>  	if (!data)
>  		return -ENOMEM;
>  
> @@ -2744,7 +2755,7 @@ bnx2_alloc_rx_data(struct bnx2 *bp, struct bnx2_rx_ring_info *rxr, u16 index, gf
>  				 bp->rx_buf_use_size,
>  				 PCI_DMA_FROMDEVICE);
>  	if (dma_mapping_error(&bp->pdev->dev, mapping)) {
> -		kfree(data);
> +		bnx2_frag_free(bp, data);
>  		return -EIO;
>  	}
>  
> @@ -3014,9 +3025,9 @@ error:
>  
>  	dma_unmap_single(&bp->pdev->dev, dma_addr, bp->rx_buf_use_size,
>  			 PCI_DMA_FROMDEVICE);
> -	skb = build_skb(data, 0);
> +	skb = build_skb(data, bp->rx_frag_size);
>  	if (!skb) {
> -		kfree(data);
> +		bnx2_frag_free(bp, data);
>  		goto error;
>  	}
>  	skb_reserve(skb, ((u8 *)get_l2_fhdr(data) - data) + BNX2_RX_OFFSET);
> @@ -5358,6 +5369,10 @@ bnx2_set_rx_ring_size(struct bnx2 *bp, u32 size)
>  	/* hw alignment + build_skb() overhead*/
>  	bp->rx_buf_size = SKB_DATA_ALIGN(bp->rx_buf_use_size + BNX2_RX_ALIGN) +
>  		NET_SKB_PAD + SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
> +	if (bp->rx_buf_size <= PAGE_SIZE)
> +		bp->rx_frag_size = bp->rx_buf_size;
> +	else
> +		bp->rx_frag_size = 0;
>  	bp->rx_jumbo_thresh = rx_size - BNX2_RX_OFFSET;
>  	bp->rx_ring_size = size;
>  	bp->rx_max_ring = bnx2_find_max_ring(size, BNX2_MAX_RX_RINGS);
> @@ -5436,7 +5451,7 @@ bnx2_free_rx_skbs(struct bnx2 *bp)
>  
>  			rx_buf->data = NULL;
>  
> -			kfree(data);
> +			bnx2_frag_free(bp, data);
>  		}
>  		for (j = 0; j < bp->rx_max_pg_ring_idx; j++)
>  			bnx2_free_rx_page(bp, rxr, j);
> diff --git a/drivers/net/ethernet/broadcom/bnx2.h b/drivers/net/ethernet/broadcom/bnx2.h
> index 172efbe..11f5dee 100644
> --- a/drivers/net/ethernet/broadcom/bnx2.h
> +++ b/drivers/net/ethernet/broadcom/bnx2.h
> @@ -6804,6 +6804,7 @@ struct bnx2 {
>  
>  	u32			rx_buf_use_size;	/* useable size */
>  	u32			rx_buf_size;		/* with alignment */
> +	u32			rx_frag_size; /* 0 if kmalloced(), or rx_buf_size */
>  	u32			rx_copy_thresh;
>  	u32			rx_jumbo_thresh;
>  	u32			rx_max_ring_idx;
> 
> 

^ permalink raw reply

* Re: Slow speed of tcp connections in a network namespace
From: Eric Dumazet @ 2012-12-29 21:12 UTC (permalink / raw)
  To: Andrew Vagin; +Cc: netdev, vvs, Michał Mirosław
In-Reply-To: <20121229200848.GA3389@paralelels.com>

On Sun, 2012-12-30 at 00:08 +0400, Andrew Vagin wrote:

> Is it right, that a received window will be less, if packets are not sorted?
> Looks like a bug.

Not really a bug.

TCP is very sensitive to packet reorders. I wont elaborate here as
its a bit off topic.

Try to reorders credits/debits on your bank account, I am pretty sure
you'll lose some money or even get serious troubles.

Of course, enabling GRO on eth0 would definitely help a bit...

(once/iff veth driver features are fixed to allow GSO packets being
forwarded without being segmented again)

^ permalink raw reply

* Re: Slow speed of tcp connections in a network namespace
From: Andrew Vagin @ 2012-12-29 21:07 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev, vvs, Michał Mirosław
In-Reply-To: <1356812407.21409.5116.camel@edumazet-glaptop>

On Sat, Dec 29, 2012 at 12:20:07PM -0800, Eric Dumazet wrote:
> On Sun, 2012-12-30 at 00:08 +0400, Andrew Vagin wrote:
> > On Sat, Dec 29, 2012 at 11:41:02AM -0800, Eric Dumazet wrote:
> > > On Sat, 2012-12-29 at 19:58 +0100, Eric Dumazet wrote:
> > > > Le samedi 29 décembre 2012 à 09:40 -0800, Eric Dumazet a écrit :
> > > > 
> > > > > 
> > > > > Please post your new tcpdump then ;)
> > > > > 
> > > > > also post "netstat -s" from root and test ns after your wgets
> > > > 
> > > > Also try following bnx2 patch.
> > > > 
> > > > It should help GRO / TCP coalesce
> > > > 
> > > > bnx2 should be the last driver not using skb head_frag
> > 
> > I don't have access to the host. I'm going to test your patch tomorrow.
> > Thanks.
> > 
> > > 
> > > And of course, you should make sure all your bnx2 interrupts are handled
> > > by the same cpu.
> > All bnx interrupts are handled on all cpus. They are handled on the same
> > cpu, if a kernel is booted with msi_disable=1.
> > 
> > Is it right, that a received window will be less, if packets are not sorted?
> > Looks like a bug.
> > 
> > I want to say, that probably it works correctly, if packets are sorted.
> > But I think if packets are not sorted, it should work with the same
> > speed, cpu load and memory consumption may be a bit more.
> 
> Without veth, it doesnt really matter that IRQ are spread on multiple
> cpus, because packets are handled in NAPI, and only one cpu runs the
> eth0 NAPI handler at one time.
> 
> But as soon as packets are queued (by netif_rx()) for 'later'
> processing, you can have dramatic performance decrease.
> 
> Thats why you really should make sure IRQ on your eth0 device
> are handled by a single cpu.
> 
> It will help to get better performance in most cases.

I understand this fact, but so big difference looks strange for me.

Default configuration (with the bug):
# cat /proc/interrupts  | grep eth0
  68:      10187      10188      10187      10023      10190      10185
10187      10019   PCI-MSI-edge      eth0

> 
> echo 1 >/proc/irq/*/eth0/../smp_affinity

This doesn't help.

I tryed echo 0 > /proc/irq/68/smp_affinity_list. This doesn't help too.

> 
> If it doesnt work, you might try instead :
> 
> echo 1 >/proc/irq/default_smp_affinity
> <you might need to reload bnx2 module, or ifdown/ifup eth0 >

This helps, and the bug are not reproduced in this case.

# cat /proc/interrupts  | grep eth0
  68:      60777          0          0          0          0          0
0          0   PCI-MSI-edge      eth0

Thanks.

^ permalink raw reply

* Re: Slow speed of tcp connections in a network namespace
From: Eric Dumazet @ 2012-12-29 20:20 UTC (permalink / raw)
  To: Andrew Vagin; +Cc: netdev, vvs, Michał Mirosław
In-Reply-To: <20121229200848.GA3389@paralelels.com>

On Sun, 2012-12-30 at 00:08 +0400, Andrew Vagin wrote:
> On Sat, Dec 29, 2012 at 11:41:02AM -0800, Eric Dumazet wrote:
> > On Sat, 2012-12-29 at 19:58 +0100, Eric Dumazet wrote:
> > > Le samedi 29 décembre 2012 à 09:40 -0800, Eric Dumazet a écrit :
> > > 
> > > > 
> > > > Please post your new tcpdump then ;)
> > > > 
> > > > also post "netstat -s" from root and test ns after your wgets
> > > 
> > > Also try following bnx2 patch.
> > > 
> > > It should help GRO / TCP coalesce
> > > 
> > > bnx2 should be the last driver not using skb head_frag
> 
> I don't have access to the host. I'm going to test your patch tomorrow.
> Thanks.
> 
> > 
> > And of course, you should make sure all your bnx2 interrupts are handled
> > by the same cpu.
> All bnx interrupts are handled on all cpus. They are handled on the same
> cpu, if a kernel is booted with msi_disable=1.
> 
> Is it right, that a received window will be less, if packets are not sorted?
> Looks like a bug.
> 
> I want to say, that probably it works correctly, if packets are sorted.
> But I think if packets are not sorted, it should work with the same
> speed, cpu load and memory consumption may be a bit more.

Without veth, it doesnt really matter that IRQ are spread on multiple
cpus, because packets are handled in NAPI, and only one cpu runs the
eth0 NAPI handler at one time.

But as soon as packets are queued (by netif_rx()) for 'later'
processing, you can have dramatic performance decrease.

Thats why you really should make sure IRQ on your eth0 device
are handled by a single cpu.

It will help to get better performance in most cases.

echo 1 >/proc/irq/*/eth0/../smp_affinity

If it doesnt work, you might try instead :

echo 1 >/proc/irq/default_smp_affinity
<you might need to reload bnx2 module, or ifdown/ifup eth0 >

^ permalink raw reply

* Re: Slow speed of tcp connections in a network namespace
From: Andrew Vagin @ 2012-12-29 20:08 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev, vvs, Michał Mirosław
In-Reply-To: <1356810062.21409.4991.camel@edumazet-glaptop>

On Sat, Dec 29, 2012 at 11:41:02AM -0800, Eric Dumazet wrote:
> On Sat, 2012-12-29 at 19:58 +0100, Eric Dumazet wrote:
> > Le samedi 29 décembre 2012 à 09:40 -0800, Eric Dumazet a écrit :
> > 
> > > 
> > > Please post your new tcpdump then ;)
> > > 
> > > also post "netstat -s" from root and test ns after your wgets
> > 
> > Also try following bnx2 patch.
> > 
> > It should help GRO / TCP coalesce
> > 
> > bnx2 should be the last driver not using skb head_frag

I don't have access to the host. I'm going to test your patch tomorrow.
Thanks.

> 
> And of course, you should make sure all your bnx2 interrupts are handled
> by the same cpu.
All bnx interrupts are handled on all cpus. They are handled on the same
cpu, if a kernel is booted with msi_disable=1.

Is it right, that a received window will be less, if packets are not sorted?
Looks like a bug.

I want to say, that probably it works correctly, if packets are sorted.
But I think if packets are not sorted, it should work with the same
speed, cpu load and memory consumption may be a bit more.

> 
> Or else, packets might be reordered because the way dev_forward_skb()
> works.
> 
> (CPU X gets a bunch of packets from eth0, forward them via netif_rx() in
> the local CPU X queue, NAPI is ended on eth0)
> 
> CPU Y gets a bunch of packets from eth0, forward them via netif_rx() in
> the local CPU Y queue.
> 
> CPU X and Y process their local queue in // -> packets are delivered Out
> of order to TCP stack
> 
> Alternative is to setup RPS on your veth1 device, to force packets being
> delivered/handled by a given cpu
> 
> 
> 
> 
> 

^ permalink raw reply

* Re: Is keepalive behaving as expected in 3.7.0+/net-next?
From: Jamie Gloudon @ 2012-12-29 19:47 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: rick.jones2, netdev
In-Reply-To: <1356645265.30414.1542.camel@edumazet-glaptop>

On Thu, Dec 27, 2012 at 01:54:25PM -0800, Eric Dumazet wrote:
> On Fri, 2012-12-21 at 14:05 -0800, Rick Jones wrote:
> > I was looking to do a bit more documentation clean-up and thought I 
> > would work on the descriptions of the "keepalive" sysctls, but first I 
> > wanted to see if they behaved as the existing descriptions suggested:
> > 
> > > tcp_keepalive_time - INTEGER
> > >         How often TCP sends out keepalive messages when keepalive is enabled.
> > >         Default: 2hours.
> > >
> > > tcp_keepalive_probes - INTEGER
> > >         How many keepalive probes TCP sends out, until it decides that the
> > >         connection is broken. Default value: 9.
> > >
> > > tcp_keepalive_intvl - INTEGER
> > >         How frequently the probes are send out. Multiplied by
> > >         tcp_keepalive_probes it is time to kill not responding connection,
> > >         after probes started. Default value: 75sec i.e. connection
> > >         will be aborted after ~11 minutes of retries.
> > 
> > I interpreted all that that as:  When a connection is idle, TCP will 
> > send a keepalive probe every tcp_keepalive_time seconds.  If a response 
> > to a keepalive probe is not received, TCP will resend (retransmit) it 
> > every tcp_keepalive_intvl seconds.
> > 
> > However, what I see is that on a connection where the remote is indeed 
> > still there, only the first keepalive probe is sent after 
> > tcp_keepalive_time, and thereafter it is sent every tcp_keepalive_intvl 
> > seconds.
> > 
> > Now, some of this may relate to my being impatient - rather than wait 
> > two hours for the first probe, I set tcp_keepalive_time to 3 seconds, 
> > and tcp_keepalive_intvl to 7 seconds.  I then kicked-off a ./configure 
> > --intervals-enable netperf TCP_RR test with a burst of one and a wait 
> > time of 90 seconds and got the following (trimmed) trace:
> > 
> > 13:43:46.879133 IP netnextraj.43054 > netnextraj2.srvr: Flags [S], seq 
> > 807869796, win 14600, options [mss 1460,sackOK,TS val 133470 ecr 
> > 0,nop,wscale 7], length 0
> > 13:43:46.880091 IP netnextraj2.srvr > netnextraj.43054: Flags [S.], seq 
> > 1522345902, ack 807869797, win 14480, options [mss 1460,sackOK,TS val 
> > 136186 ecr 133470,nop,wscale 4], length 0
> > 13:43:46.880114 IP netnextraj.43054 > netnextraj2.srvr: Flags [.], ack 
> > 1, win 115, options [nop,nop,TS val 133470 ecr 136186], length 0
> > 13:43:46.880306 IP netnextraj.43054 > netnextraj2.srvr: Flags [P.], seq 
> > 1:11, ack 1, win 115, options [nop,nop,TS val 133470 ecr 136186], length 10
> > 13:43:46.880948 IP netnextraj2.srvr > netnextraj.43054: Flags [.], ack 
> > 11, win 905, options [nop,nop,TS val 136187 ecr 133470], length 0
> > 13:43:46.880964 IP netnextraj2.srvr > netnextraj.43054: Flags [P.], seq 
> > 1:11, ack 11, win 905, options [nop,nop,TS val 136187 ecr 133470], length 10
> > 13:43:46.881161 IP netnextraj.43054 > netnextraj2.srvr: Flags [.], ack 
> > 11, win 115, options [nop,nop,TS val 133470 ecr 136187], length 0
> > 
> > The first probe above comes after 3 seconds - tcp_keepalive_time - at 
> > 13:43:49
> > 
> > 13:43:49.886752 IP netnextraj.43054 > netnextraj2.srvr: Flags [.], ack 
> > 11, win 115, options [nop,nop,TS val 134222 ecr 136187], length 0
> > 
> > And it does seem to elicit a response:
> > 
> > 13:43:49.887530 IP netnextraj2.srvr > netnextraj.43054: Flags [.], ack 
> > 11, win 905, options [nop,nop,TS val 136938 ecr 133470], length 0
> > 
> 
> 
> > Now it starts sending probes every 7 seconds (tcp_keepalive_intvl):
> > 
> > 13:43:56.903576 IP netnextraj.43054 > netnextraj2.srvr: Flags [.], ack 
> > 11, win 115, options [nop,nop,TS val 135976 ecr 136938], length 0
> > 13:43:56.904480 IP netnextraj2.srvr > netnextraj.43054: Flags [.], ack 
> > 11, win 905, options [nop,nop,TS val 138693 ecr 133470], length 0
> > 13:44:03.910744 IP netnextraj.43054 > netnextraj2.srvr: Flags [.], ack 
> > 11, win 115, options [nop,nop,TS val 137728 ecr 138693], length 0
> > 13:44:03.911623 IP netnextraj2.srvr > netnextraj.43054: Flags [.], ack 
> > 11, win 905, options [nop,nop,TS val 140444 ecr 133470], length 0
> > 
> > I;ve deleted the next 9 or so probes...  It continues, and doesn't 
> > terminate the connection, so I assume it was happy with the responses to 
> > the probes.
> > 
> > 13:45:13.990746 IP netnextraj.43054 > netnextraj2.srvr: Flags [.], ack 
> > 11, win 115, options [nop,nop,TS val 155248 ecr 156213], length 0
> > 13:45:13.991578 IP netnextraj2.srvr > netnextraj.43054: Flags [.], ack 
> > 11, win 905, options [nop,nop,TS val 157965 ecr 133470], length 0
> > 
> > Now the next netperf transaction happens:
> > 
> > 13:45:16.879222 IP netnextraj.43054 > netnextraj2.srvr: Flags [P.], seq 
> > 11:21, ack 11, win 115, options [nop,nop,TS val 155970 ecr 157965], 
> > length 10
> > 13:45:16.880033 IP netnextraj2.srvr > netnextraj.43054: Flags [P.], seq 
> > 11:21, ack 21, win 905, options [nop,nop,TS val 158687 ecr 155970], 
> > length 10
> > 13:45:16.880220 IP netnextraj.43054 > netnextraj2.srvr: Flags [.], ack 
> > 21, win 115, options [nop,nop,TS val 155970 ecr 158687], length 0
> > 
> > But the next keepalive probe is tcp_keepalive_intvl seconds after the 
> > last one, rather than that many, or tcp_keepalive_time seconds after the 
> > connection was last "active."
> > 
> > 13:45:20.998739 IP netnextraj.43054 > netnextraj2.srvr: Flags [.], ack 
> > 21, win 115, options [nop,nop,TS val 157000 ecr 158687], length 0
> > 13:45:20.999754 IP netnextraj2.srvr > netnextraj.43054: Flags [.], ack 
> > 21, win 905, options [nop,nop,TS val 159717 ecr 155970], length 0
> > 13:45:28.006747 IP netnextraj.43054 > netnextraj2.srvr: Flags [.], ack 
> > 21, win 115, options [nop,nop,TS val 158752 ecr 159717], length 0
> > 13:45:28.007624 IP netnextraj2.srvr > netnextraj.43054: Flags [.], ack 
> > 21, win 905, options [nop,nop,TS val 161469 ecr 155970], length 0
> > 
> > Is this the expected behaviour?  If I reverse the values - make 
> > tcp_keepalive_time 7 and tcp_keepalive_intvl 3, it seems that all the 
> > probes are after 7 seconds.
> > 
> > rick jones
> 
> Not sure if it makes sense to have 
> tcp_keepalive_intvl > tcp_keepalive_time
> 
> time should be an order of magnitude bigger than intvl.
> 
> keepalive timer is not reset each time we receive a valid frame, it
> would be very expensive.
> 
> Its a long period timer.
> 
> First interval is tcp_keepalive_time, and subsequent interval are
> tcp_keepalive_intvl
> 
> Each time timer is fired (once every 7200 seconds), we re-arm it with
> the observed elapsed time (keepalive_time_elapsed)
> 
> Fixing this would require to add a timestamp in inet socket, to remember
> time of next/last probe, and firing the timer using
> min(keepalive_time_when(tp), keepalive_intvl_when(tp))
> 
> Probably not worth it.
> 
>

Make a lot of sense. However, I got the impression from Rick that having tcp_keepalive_intvl > tcp_keepalive_time behaved correctly in older versions of the kernel. 

Regards,
Jamie Gloudon

^ permalink raw reply

* Re: Slow speed of tcp connections in a network namespace
From: Eric Dumazet @ 2012-12-29 19:41 UTC (permalink / raw)
  To: Andrew Vagin; +Cc: netdev, vvs, Michał Mirosław
In-Reply-To: <1356807516.4102.4.camel@edumazet-laptop>

On Sat, 2012-12-29 at 19:58 +0100, Eric Dumazet wrote:
> Le samedi 29 décembre 2012 à 09:40 -0800, Eric Dumazet a écrit :
> 
> > 
> > Please post your new tcpdump then ;)
> > 
> > also post "netstat -s" from root and test ns after your wgets
> 
> Also try following bnx2 patch.
> 
> It should help GRO / TCP coalesce
> 
> bnx2 should be the last driver not using skb head_frag

And of course, you should make sure all your bnx2 interrupts are handled
by the same cpu.

Or else, packets might be reordered because the way dev_forward_skb()
works.

(CPU X gets a bunch of packets from eth0, forward them via netif_rx() in
the local CPU X queue, NAPI is ended on eth0)

CPU Y gets a bunch of packets from eth0, forward them via netif_rx() in
the local CPU Y queue.

CPU X and Y process their local queue in // -> packets are delivered Out
of order to TCP stack

Alternative is to setup RPS on your veth1 device, to force packets being
delivered/handled by a given cpu

^ permalink raw reply

* Re: Slow speed of tcp connections in a network namespace
From: Eric Dumazet @ 2012-12-29 18:58 UTC (permalink / raw)
  To: Andrew Vagin; +Cc: netdev, vvs, Michał Mirosław
In-Reply-To: <1356802828.21409.4623.camel@edumazet-glaptop>

Le samedi 29 décembre 2012 à 09:40 -0800, Eric Dumazet a écrit :

> 
> Please post your new tcpdump then ;)
> 
> also post "netstat -s" from root and test ns after your wgets

Also try following bnx2 patch.

It should help GRO / TCP coalesce

bnx2 should be the last driver not using skb head_frag


diff --git a/drivers/net/ethernet/broadcom/bnx2.c b/drivers/net/ethernet/broadcom/bnx2.c
index a1adfaf..08a2d40 100644
--- a/drivers/net/ethernet/broadcom/bnx2.c
+++ b/drivers/net/ethernet/broadcom/bnx2.c
@@ -2726,6 +2726,14 @@ bnx2_free_rx_page(struct bnx2 *bp, struct bnx2_rx_ring_info *rxr, u16 index)
 	rx_pg->page = NULL;
 }
 
+static void bnx2_frag_free(const struct bnx2 *bp, void *data)
+{
+	if (bp->rx_frag_size)
+		put_page(virt_to_head_page(data));
+	else
+		kfree(data);
+}
+
 static inline int
 bnx2_alloc_rx_data(struct bnx2 *bp, struct bnx2_rx_ring_info *rxr, u16 index, gfp_t gfp)
 {
@@ -2735,7 +2743,10 @@ bnx2_alloc_rx_data(struct bnx2 *bp, struct bnx2_rx_ring_info *rxr, u16 index, gf
 	struct bnx2_rx_bd *rxbd =
 		&rxr->rx_desc_ring[BNX2_RX_RING(index)][BNX2_RX_IDX(index)];
 
-	data = kmalloc(bp->rx_buf_size, gfp);
+	if (bp->rx_frag_size)
+		data = netdev_alloc_frag(bp->rx_frag_size);
+	else
+		data = kmalloc(bp->rx_buf_size, gfp);
 	if (!data)
 		return -ENOMEM;
 
@@ -2744,7 +2755,7 @@ bnx2_alloc_rx_data(struct bnx2 *bp, struct bnx2_rx_ring_info *rxr, u16 index, gf
 				 bp->rx_buf_use_size,
 				 PCI_DMA_FROMDEVICE);
 	if (dma_mapping_error(&bp->pdev->dev, mapping)) {
-		kfree(data);
+		bnx2_frag_free(bp, data);
 		return -EIO;
 	}
 
@@ -3014,9 +3025,9 @@ error:
 
 	dma_unmap_single(&bp->pdev->dev, dma_addr, bp->rx_buf_use_size,
 			 PCI_DMA_FROMDEVICE);
-	skb = build_skb(data, 0);
+	skb = build_skb(data, bp->rx_frag_size);
 	if (!skb) {
-		kfree(data);
+		bnx2_frag_free(bp, data);
 		goto error;
 	}
 	skb_reserve(skb, ((u8 *)get_l2_fhdr(data) - data) + BNX2_RX_OFFSET);
@@ -5358,6 +5369,10 @@ bnx2_set_rx_ring_size(struct bnx2 *bp, u32 size)
 	/* hw alignment + build_skb() overhead*/
 	bp->rx_buf_size = SKB_DATA_ALIGN(bp->rx_buf_use_size + BNX2_RX_ALIGN) +
 		NET_SKB_PAD + SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
+	if (bp->rx_buf_size <= PAGE_SIZE)
+		bp->rx_frag_size = bp->rx_buf_size;
+	else
+		bp->rx_frag_size = 0;
 	bp->rx_jumbo_thresh = rx_size - BNX2_RX_OFFSET;
 	bp->rx_ring_size = size;
 	bp->rx_max_ring = bnx2_find_max_ring(size, BNX2_MAX_RX_RINGS);
@@ -5436,7 +5451,7 @@ bnx2_free_rx_skbs(struct bnx2 *bp)
 
 			rx_buf->data = NULL;
 
-			kfree(data);
+			bnx2_frag_free(bp, data);
 		}
 		for (j = 0; j < bp->rx_max_pg_ring_idx; j++)
 			bnx2_free_rx_page(bp, rxr, j);
diff --git a/drivers/net/ethernet/broadcom/bnx2.h b/drivers/net/ethernet/broadcom/bnx2.h
index 172efbe..11f5dee 100644
--- a/drivers/net/ethernet/broadcom/bnx2.h
+++ b/drivers/net/ethernet/broadcom/bnx2.h
@@ -6804,6 +6804,7 @@ struct bnx2 {
 
 	u32			rx_buf_use_size;	/* useable size */
 	u32			rx_buf_size;		/* with alignment */
+	u32			rx_frag_size; /* 0 if kmalloced(), or rx_buf_size */
 	u32			rx_copy_thresh;
 	u32			rx_jumbo_thresh;
 	u32			rx_max_ring_idx;

^ permalink raw reply related

* Re: Slow speed of tcp connections in a network namespace
From: Andrew Vagin @ 2012-12-29 18:29 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev, vvs, Michał Mirosław
In-Reply-To: <1356802828.21409.4623.camel@edumazet-glaptop>

[-- Attachment #1: Type: text/plain, Size: 1014 bytes --]

On Sat, Dec 29, 2012 at 09:40:28AM -0800, Eric Dumazet wrote:
> On Sat, 2012-12-29 at 18:50 +0400, Andrew Vagin wrote:
> > On Sat, Dec 29, 2012 at 05:53:23AM -0800, Eric Dumazet wrote:
> > > > 3.8-rc1 is used for experiments.
> > > > 
> > > > Do you have any ideas where is a problem?
> > > 
> > > veth has absolutely no offload features
> > > 
> > > It needs some care...
> > > 
> > > At the very miminum, let TCP coalesce do its job by allowing SG
> > > 
> > > CC Michał Mirosław <mirq-linux@rere.qmqm.pl> for insights.
> > > 
> > > Please try following patch :
> > 
> > Hello Eric,
> > 
> > Thanks for your feedback.
> > 
> > With this patch the results is a bit better (~4MB/s), but it's much less
> > than in the root netns.
> 
> Please post your new tcpdump then ;)

I have rebooted the host and a speed in a netns is again about 1.7MB/s. I
don't know why it was 4MB/s in the previous time.

new tcpdump and netstat are attached

> 
> also post "netstat -s" from root and test ns after your wgets
> 
> 
> 

[-- Attachment #2: tcpdump.host.gz --]
[-- Type: application/x-gzip, Size: 165716 bytes --]

[-- Attachment #3: tcpdump.netns.host.gz --]
[-- Type: application/x-gzip, Size: 180703 bytes --]

[-- Attachment #4: tcpdump.netns.veth.gz --]
[-- Type: application/x-gzip, Size: 181615 bytes --]

[-- Attachment #5: netstat.host --]
[-- Type: text/plain, Size: 1821 bytes --]

Ip:
    277536 total packets received
    20 forwarded
    0 incoming packets discarded
    202326 incoming packets delivered
    108228 requests sent out
    30 dropped because of missing route
Icmp:
    10 ICMP messages received
    0 input ICMP message failed.
    ICMP input histogram:
        destination unreachable: 5
        echo requests: 5
    6 ICMP messages sent
    0 ICMP messages failed
    ICMP output histogram:
        destination unreachable: 1
        echo replies: 5
IcmpMsg:
        InType3: 5
        InType8: 5
        OutType0: 5
        OutType3: 1
Tcp:
    1491 active connections openings
    12 passive connection openings
    14 failed connection attempts
    72 connection resets received
    2 connections established
    201920 segments received
    107815 segments send out
    0 segments retransmited
    0 bad segments received.
    1338 resets sent
Udp:
    387 packets received
    0 packets to unknown port received.
    0 packet receive errors
    389 packets sent
UdpLite:
TcpExt:
    3 invalid SYN cookies received
    4 TCP sockets finished time wait in fast timer
    63 delayed acks sent
    9 delayed acks further delayed because of locked socket
    236 packets directly queued to recvmsg prequeue.
    38600456 packets directly received from backlog
    298101 packets directly received from prequeue
    163501 packets header predicted
    27103 packets header predicted and directly queued to user
    2578 acknowledgments not containing data received
    1018 predicted acknowledgments
    15 connections reset due to unexpected data
    72 connections reset due to early user close
    TCPRcvCoalesce: 123
    TCPOFOQueue: 1187
IpExt:
    InBcastPkts: 9
    OutBcastPkts: 1
    InOctets: 296395504
    OutOctets: 6965311
    InBcastOctets: 2789
    OutBcastOctets: 165

[-- Attachment #6: netstat.netns --]
[-- Type: text/plain, Size: 1463 bytes --]

Ip:
    25483 total packets received
    0 forwarded
    0 incoming packets discarded
    25483 incoming packets delivered
    14572 requests sent out
Icmp:
    4 ICMP messages received
    0 input ICMP message failed.
    ICMP input histogram:
        echo requests: 2
        echo replies: 2
    4 ICMP messages sent
    0 ICMP messages failed
    ICMP output histogram:
        echo request: 2
        echo replies: 2
IcmpMsg:
        InType0: 2
        InType8: 2
        OutType0: 2
        OutType8: 2
Tcp:
    1 active connections openings
    0 passive connection openings
    0 failed connection attempts
    0 connection resets received
    0 connections established
    25473 segments received
    14562 segments send out
    0 segments retransmited
    0 bad segments received.
    77 resets sent
Udp:
    6 packets received
    0 packets to unknown port received.
    0 packet receive errors
    6 packets sent
UdpLite:
TcpExt:
    38 delayed acks sent
    Quick ack mode was activated 2 times
    52 packets directly queued to recvmsg prequeue.
    4916752 packets directly received from backlog
    11584 packets directly received from prequeue
    12538 packets header predicted
    2649 packets header predicted and directly queued to user
    1 acknowledgments not containing data received
    2 DSACKs sent for old packets
    1 connections reset due to unexpected data
    TCPOFOQueue: 1580
IpExt:
    InOctets: 38201843
    OutOctets: 829966

^ permalink raw reply

* Re: Slow speed of tcp connections in a network namespace
From: Eric Dumazet @ 2012-12-29 17:40 UTC (permalink / raw)
  To: Andrew Vagin; +Cc: netdev, vvs, Michał Mirosław
In-Reply-To: <20121229145030.GA7959@paralelels.com>

On Sat, 2012-12-29 at 18:50 +0400, Andrew Vagin wrote:
> On Sat, Dec 29, 2012 at 05:53:23AM -0800, Eric Dumazet wrote:
> > > 3.8-rc1 is used for experiments.
> > > 
> > > Do you have any ideas where is a problem?
> > 
> > veth has absolutely no offload features
> > 
> > It needs some care...
> > 
> > At the very miminum, let TCP coalesce do its job by allowing SG
> > 
> > CC Michał Mirosław <mirq-linux@rere.qmqm.pl> for insights.
> > 
> > Please try following patch :
> 
> Hello Eric,
> 
> Thanks for your feedback.
> 
> With this patch the results is a bit better (~4MB/s), but it's much less
> than in the root netns.

Please post your new tcpdump then ;)

also post "netstat -s" from root and test ns after your wgets

^ permalink raw reply

* Re: [Patch RFC] ndisc: Fix skb allocation size for link layer options.
From: Stephan Gatzka @ 2012-12-29 16:37 UTC (permalink / raw)
  To: YOSHIFUJI Hideaki; +Cc: netdev
In-Reply-To: <50DF0C4A.5020401@linux-ipv6.org>

> 
> Disagree.  NDISC_OPT_SPACE() takes care size of nd option header.
Agree. I misunderstood this, my fault.

Maybe it's better to write:

#define NDISC_OPT_SPACE(len) (((len)+sizeof(nd_opt_hdr)+7)&~7)

Regards,

Stephan

^ permalink raw reply

* Re: Slow speed of tcp connections in a network namespace
From: Michał Mirosław @ 2012-12-29 16:01 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Andrew Vagin, netdev, vvs, Michał Mirosław
In-Reply-To: <1356789203.21409.3923.camel@edumazet-glaptop>

2012/12/29 Eric Dumazet <eric.dumazet@gmail.com>:
> On Sat, 2012-12-29 at 13:24 +0400, Andrew Vagin wrote:
>> We found a few nodes, where network works slow in containers.
>>
>> For testing speed of TCP connections we use wget, which downloads iso
>> images from the internet.
>>
>> wget in the new netns reports only 1.5 MB/s, but wget in the root netns
>> reports 33MB/s.
>>
>> A few facts:
>>  * Experiments shows that window size for CT traffic does not increases
>>    up to ~900, however for host traffic window size increases up to ~14000
>>  * packets are shuffled in the netns sometimes.
>>  * tso/gro/gso changes on interfaces does not help
>>  * issue was _NOT_ reproduced if kernel booted with maxcpus=1 or bnx2.disable_msi=1
>>
>> I reduced steps to reproduce:
>> * Create a new network namespace "test" and a veth pair.
>>   # ip netns add test
>>   # ip link add name veth0 type veth peer name veth1
>>
>> * Move veth1 into the netns test
>>   # ip link set veth1 netns test
>>
>> * Set ip address on veth1 and proper routing rules are added for this ip
>>   in the root netns.
>>   # ip link set up dev veth0;  ip link set up dev veth0
>>   # ip netns exec test ip a add REMOTE dev veth1
>>   # ip netns exec test ip r a default via veth1
>>   # ip r a REMOTE/32 via dev veth0
>>
>> Tcpdump for both cases are attached to this message.
>> tcpdump.host - wget in the root netns
>> tcpdump.netns.host - tcpdump for the host device, wget in the new netns
>> tcpdump.netns.veth - tcpdump for the veth1 device, wget in the new netns
>>
>> 3.8-rc1 is used for experiments.
>>
>> Do you have any ideas where is a problem?
>
> veth has absolutely no offload features
>
> It needs some care...
>
> At the very miminum, let TCP coalesce do its job by allowing SG
>
> CC Michał Mirosław <mirq-linux@rere.qmqm.pl> for insights.

veth is just like a tunnel device. In terms of offloads, it can do anything
we have software fallbacks for (in case packets get forwarded to real hardware).

> Please try following patch :
>
> diff --git a/drivers/net/veth.c b/drivers/net/veth.c
> index 95814d9..9fefeb3 100644
> --- a/drivers/net/veth.c
> +++ b/drivers/net/veth.c
> @@ -259,6 +259,10 @@ static const struct net_device_ops veth_netdev_ops = {
>         .ndo_set_mac_address = eth_mac_addr,
>  };
>
> +#define VETH_FEATURES (NETIF_F_SG | NETIF_F_FRAGLIST | NETIF_F_TSO |   \
> +                      NETIF_F_HW_CSUM | NETIF_F_HIGHDMA |              \
> +                      NETIF_F_HW_VLAN_TX | NETIF_F_HW_VLAN_RX)
> +
>  static void veth_setup(struct net_device *dev)
>  {
>         ether_setup(dev);
> @@ -269,9 +273,10 @@ static void veth_setup(struct net_device *dev)
>         dev->netdev_ops = &veth_netdev_ops;
>         dev->ethtool_ops = &veth_ethtool_ops;
>         dev->features |= NETIF_F_LLTX;
> +       dev->features |= VETH_FEATURES;
>         dev->destructor = veth_dev_free;
>
> -       dev->hw_features = NETIF_F_HW_CSUM | NETIF_F_SG | NETIF_F_RXCSUM;
> +       dev->hw_features = VETH_FEATURES;
>  }

You missed NETIF_F_RXCSUM in VETH_FEATURES. We might support
NETIF_F_ALL_TSO, not just the IPv4 version.

Best Regards,
Michał Mirosław

^ permalink raw reply

* Re: [Patch RFC] ndisc: Fix skb allocation size for link layer options.
From: YOSHIFUJI Hideaki @ 2012-12-29 15:32 UTC (permalink / raw)
  To: Stephan Gatzka; +Cc: YOSHIFUJI Hideaki, netdev
In-Reply-To: <50DF0C4A.5020401@linux-ipv6.org>

YOSHIFUJI Hideaki wrote:
> Stephan Gatzka wrote:
>> Signed-off-by: Stephan Gatzka <stephan.gatzka@gmail.com>
>> ---
>>  net/ipv6/ndisc.c |    5 +++--
>>  1 file changed, 3 insertions(+), 2 deletions(-)
>>
>> diff --git a/net/ipv6/ndisc.c b/net/ipv6/ndisc.c
>> index 6574175..b12b94c 100644
>> --- a/net/ipv6/ndisc.c
>> +++ b/net/ipv6/ndisc.c
>> @@ -392,7 +392,7 @@ static struct sk_buff *ndisc_build_skb(struct net_device *dev,
>>  
>>  	len = sizeof(struct icmp6hdr) + (target ? sizeof(*target) : 0);
>>  	if (llinfo)
>> -		len += ndisc_opt_addr_space(dev);
>> +		len += sizeof(struct nd_opt_hdr) + ndisc_opt_addr_space(dev);
>>  
>>  	skb = sock_alloc_send_skb(sk,
>>  				  (MAX_HEADER + sizeof(struct ipv6hdr) +
>> @@ -1424,7 +1424,8 @@ void ndisc_send_redirect(struct sk_buff *skb, const struct in6_addr *target)
>>  			memcpy(ha_buf, neigh->ha, dev->addr_len);
>>  			read_unlock_bh(&neigh->lock);
>>  			ha = ha_buf;
>> -			len += ndisc_opt_addr_space(dev);
>> +			len += sizeof(struct nd_opt_hdr) +
>> +				ndisc_opt_addr_space(dev);
>>  		} else
>>  			read_unlock_bh(&neigh->lock);
>>  
>>
> 
> Disagree.  NDISC_OPT_SPACE() takes care size of nd option header.

Please note:

static inline int ndisc_opt_addr_space(struct net_device *dev)
{
        return NDISC_OPT_SPACE(dev->addr_len + ndisc_addr_option_pad(dev->type));
}

--yoshfuji

^ permalink raw reply

* Re: [Patch RFC] ndisc: Fix skb allocation size for link layer options.
From: YOSHIFUJI Hideaki @ 2012-12-29 15:29 UTC (permalink / raw)
  To: Stephan Gatzka; +Cc: netdev, YOSHIFUJI Hideaki
In-Reply-To: <1356702410-32293-1-git-send-email-stephan.gatzka@gmail.com>

Stephan Gatzka wrote:
> Signed-off-by: Stephan Gatzka <stephan.gatzka@gmail.com>
> ---
>  net/ipv6/ndisc.c |    5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/net/ipv6/ndisc.c b/net/ipv6/ndisc.c
> index 6574175..b12b94c 100644
> --- a/net/ipv6/ndisc.c
> +++ b/net/ipv6/ndisc.c
> @@ -392,7 +392,7 @@ static struct sk_buff *ndisc_build_skb(struct net_device *dev,
>  
>  	len = sizeof(struct icmp6hdr) + (target ? sizeof(*target) : 0);
>  	if (llinfo)
> -		len += ndisc_opt_addr_space(dev);
> +		len += sizeof(struct nd_opt_hdr) + ndisc_opt_addr_space(dev);
>  
>  	skb = sock_alloc_send_skb(sk,
>  				  (MAX_HEADER + sizeof(struct ipv6hdr) +
> @@ -1424,7 +1424,8 @@ void ndisc_send_redirect(struct sk_buff *skb, const struct in6_addr *target)
>  			memcpy(ha_buf, neigh->ha, dev->addr_len);
>  			read_unlock_bh(&neigh->lock);
>  			ha = ha_buf;
> -			len += ndisc_opt_addr_space(dev);
> +			len += sizeof(struct nd_opt_hdr) +
> +				ndisc_opt_addr_space(dev);
>  		} else
>  			read_unlock_bh(&neigh->lock);
>  
> 

Disagree.  NDISC_OPT_SPACE() takes care size of nd option header.

--yoshfuji

^ permalink raw reply

* Re: Slow speed of tcp connections in a network namespace
From: Andrew Vagin @ 2012-12-29 14:50 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev, vvs, Michał Mirosław
In-Reply-To: <1356789203.21409.3923.camel@edumazet-glaptop>

On Sat, Dec 29, 2012 at 05:53:23AM -0800, Eric Dumazet wrote:
> > 3.8-rc1 is used for experiments.
> > 
> > Do you have any ideas where is a problem?
> 
> veth has absolutely no offload features
> 
> It needs some care...
> 
> At the very miminum, let TCP coalesce do its job by allowing SG
> 
> CC Michał Mirosław <mirq-linux@rere.qmqm.pl> for insights.
> 
> Please try following patch :

Hello Eric,

Thanks for your feedback.

With this patch the results is a bit better (~4MB/s), but it's much less
than in the root netns.

> 
> diff --git a/drivers/net/veth.c b/drivers/net/veth.c
> index 95814d9..9fefeb3 100644
> --- a/drivers/net/veth.c
> +++ b/drivers/net/veth.c
> @@ -259,6 +259,10 @@ static const struct net_device_ops veth_netdev_ops = {
>  	.ndo_set_mac_address = eth_mac_addr,
>  };
>  
> +#define VETH_FEATURES (NETIF_F_SG | NETIF_F_FRAGLIST | NETIF_F_TSO |	\
> +		       NETIF_F_HW_CSUM | NETIF_F_HIGHDMA |		\
> +		       NETIF_F_HW_VLAN_TX | NETIF_F_HW_VLAN_RX)
> +
>  static void veth_setup(struct net_device *dev)
>  {
>  	ether_setup(dev);
> @@ -269,9 +273,10 @@ static void veth_setup(struct net_device *dev)
>  	dev->netdev_ops = &veth_netdev_ops;
>  	dev->ethtool_ops = &veth_ethtool_ops;
>  	dev->features |= NETIF_F_LLTX;
> +	dev->features |= VETH_FEATURES;
>  	dev->destructor = veth_dev_free;
>  
> -	dev->hw_features = NETIF_F_HW_CSUM | NETIF_F_SG | NETIF_F_RXCSUM;
> +	dev->hw_features = VETH_FEATURES;
>  }
>  
>  /*
> 
> 

^ permalink raw reply

* Re: tc filter ouput, no filter hits?
From: Eric Dumazet @ 2012-12-29 13:55 UTC (permalink / raw)
  To: julius; +Cc: netdev@vger.kernel.org
In-Reply-To: <1356788777.97050.YahooMailNeo@web165006.mail.bf1.yahoo.com>

On Sat, 2012-12-29 at 05:46 -0800, julius wrote:
> hi,
> 
> ive seen some examples where the output from:
> tc -s filter show dev ifb0 parent 1: 
> shows a filter hit count, like the output shown below:
> 
> filter protocol ip pref 1 u32
> filter protocol ip pref 1 u32 fh 800: ht divisor 1
> filter protocol ip pref 1 u32 fh 800::800 order 2048 key ht 800 bkt 0 flowid 
> 1:1  (rule hit 2 success 0)
> 
> 
> 
> however in my openwrt 12.09 and on my ubuntu machine i dont see the part in the paranthese, why is that?
> 
> ubuntu:
> iproute (20120521-3ubuntu1)
> 3.2.0-23-generic #36-Ubuntu
> 

Because its optional, kernel compilation needs :

CONFIG_CLS_U32_PERF=y

^ permalink raw reply

* Re: Slow speed of tcp connections in a network namespace
From: Eric Dumazet @ 2012-12-29 13:53 UTC (permalink / raw)
  To: Andrew Vagin; +Cc: netdev, vvs, Michał Mirosław
In-Reply-To: <20121229092417.GA4038@paralelels.com>

On Sat, 2012-12-29 at 13:24 +0400, Andrew Vagin wrote:
> We found a few nodes, where network works slow in containers.
> 
> For testing speed of TCP connections we use wget, which downloads iso
> images from the internet.
> 
> wget in the new netns reports only 1.5 MB/s, but wget in the root netns
> reports 33MB/s.
> 
> A few facts:
>  * Experiments shows that window size for CT traffic does not increases
>    up to ~900, however for host traffic window size increases up to ~14000
>  * packets are shuffled in the netns sometimes.
>  * tso/gro/gso changes on interfaces does not help
>  * issue was _NOT_ reproduced if kernel booted with maxcpus=1 or bnx2.disable_msi=1
> 
> I reduced steps to reproduce:
> * Create a new network namespace "test" and a veth pair.
>   # ip netns add test
>   # ip link add name veth0 type veth peer name veth1
> 
> * Move veth1 into the netns test
>   # ip link set veth1 netns test
> 
> * Set ip address on veth1 and proper routing rules are added for this ip
>   in the root netns.
>   # ip link set up dev veth0;  ip link set up dev veth0
>   # ip netns exec test ip a add REMOTE dev veth1
>   # ip netns exec test ip r a default via veth1
>   # ip r a REMOTE/32 via dev veth0
> 
> Tcpdump for both cases are attached to this message.
> tcpdump.host - wget in the root netns
> tcpdump.netns.host - tcpdump for the host device, wget in the new netns
> tcpdump.netns.veth - tcpdump for the veth1 device, wget in the new netns
> 
> 3.8-rc1 is used for experiments.
> 
> Do you have any ideas where is a problem?

veth has absolutely no offload features

It needs some care...

At the very miminum, let TCP coalesce do its job by allowing SG

CC Michał Mirosław <mirq-linux@rere.qmqm.pl> for insights.

Please try following patch :

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index 95814d9..9fefeb3 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -259,6 +259,10 @@ static const struct net_device_ops veth_netdev_ops = {
 	.ndo_set_mac_address = eth_mac_addr,
 };
 
+#define VETH_FEATURES (NETIF_F_SG | NETIF_F_FRAGLIST | NETIF_F_TSO |	\
+		       NETIF_F_HW_CSUM | NETIF_F_HIGHDMA |		\
+		       NETIF_F_HW_VLAN_TX | NETIF_F_HW_VLAN_RX)
+
 static void veth_setup(struct net_device *dev)
 {
 	ether_setup(dev);
@@ -269,9 +273,10 @@ static void veth_setup(struct net_device *dev)
 	dev->netdev_ops = &veth_netdev_ops;
 	dev->ethtool_ops = &veth_ethtool_ops;
 	dev->features |= NETIF_F_LLTX;
+	dev->features |= VETH_FEATURES;
 	dev->destructor = veth_dev_free;
 
-	dev->hw_features = NETIF_F_HW_CSUM | NETIF_F_SG | NETIF_F_RXCSUM;
+	dev->hw_features = VETH_FEATURES;
 }
 
 /*

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox