Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH] xen/netfront: improve truesize tracking
From: Konrad Rzeszutek Wilk @ 2012-12-18 14:10 UTC (permalink / raw)
  To: Ian Campbell; +Cc: netdev, Sander Eikelenboom, annie li, xen-devel
In-Reply-To: <1355838711-5473-1-git-send-email-ian.campbell@citrix.com>

On Tue, Dec 18, 2012 at 01:51:51PM +0000, Ian Campbell wrote:
> Using RX_COPY_THRESHOLD is incorrect if the SKB is actually smaller
> than that. We have already accounted for this in
> NETFRONT_SKB_CB(skb)->pull_to so use that instead.
> 
> Fixes WARN_ON from skb_try_coalesce.

This should probably be also on the stable tree for 3.7 at least?

> 
> Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
> Cc: Sander Eikelenboom <linux@eikelenboom.it>
  ^^ - Reported-by: 

> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
  ^^ - Acked-by:

> Cc: annie li <annie.li@oracle.com>
> Cc: xen-devel@lists.xensource.com
> Cc: netdev@vger.kernel.org
> Cc: stable@kernel.org # 3.7.x only
> ---
>  drivers/net/xen-netfront.c |   15 +++++----------
>  1 files changed, 5 insertions(+), 10 deletions(-)
> 
> diff --git a/drivers/net/xen-netfront.c b/drivers/net/xen-netfront.c
> index caa0110..b06ef81 100644
> --- a/drivers/net/xen-netfront.c
> +++ b/drivers/net/xen-netfront.c
> @@ -971,17 +971,12 @@ err:
>  		 * overheads. Here, we add the size of the data pulled
>  		 * in xennet_fill_frags().
>  		 *
> -		 * We also adjust for any unused space in the main
> -		 * data area by subtracting (RX_COPY_THRESHOLD -
> -		 * len). This is especially important with drivers
> -		 * which split incoming packets into header and data,
> -		 * using only 66 bytes of the main data area (see the
> -		 * e1000 driver for example.)  On such systems,
> -		 * without this last adjustement, our achievable
> -		 * receive throughout using the standard receive
> -		 * buffer size was cut by 25%(!!!).
> +		 * We also adjust for the __pskb_pull_tail done in
> +		 * handle_incoming_queue which pulls data from the
> +		 * frags into the head area, which is already
> +		 * accounted in RX_COPY_THRESHOLD.
>  		 */
> -		skb->truesize += skb->data_len - RX_COPY_THRESHOLD;
> +		skb->truesize += skb->data_len - NETFRONT_SKB_CB(skb)->pull_to;
>  		skb->len += skb->data_len;
>  
>  		if (rx->flags & XEN_NETRXF_csum_blank)
> -- 
> 1.7.2.5
> 

^ permalink raw reply

* Re: [PATCH] xen/netfront: improve truesize tracking
From: Ian Campbell @ 2012-12-18 14:15 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: netdev@vger.kernel.org, Sander Eikelenboom, annie li,
	xen-devel@lists.xensource.com
In-Reply-To: <20121218141048.GC4518@phenom.dumpdata.com>

On Tue, 2012-12-18 at 14:10 +0000, Konrad Rzeszutek Wilk wrote:
> On Tue, Dec 18, 2012 at 01:51:51PM +0000, Ian Campbell wrote:
> > Using RX_COPY_THRESHOLD is incorrect if the SKB is actually smaller
> > than that. We have already accounted for this in
> > NETFRONT_SKB_CB(skb)->pull_to so use that instead.
> > 
> > Fixes WARN_ON from skb_try_coalesce.
> 
> This should probably be also on the stable tree for 3.7 at least?

Yes, hence "Cc: stable@kernel.org # 3.7.x only" below ;-)

> > 
> > Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
> > Cc: Sander Eikelenboom <linux@eikelenboom.it>
>   ^^ - Reported-by: 
> 
> > Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
>   ^^ - Acked-by:
> 
> > Cc: annie li <annie.li@oracle.com>
> > Cc: xen-devel@lists.xensource.com
> > Cc: netdev@vger.kernel.org
> > Cc: stable@kernel.org # 3.7.x only
> > ---
> >  drivers/net/xen-netfront.c |   15 +++++----------
> >  1 files changed, 5 insertions(+), 10 deletions(-)
> > 
> > diff --git a/drivers/net/xen-netfront.c b/drivers/net/xen-netfront.c
> > index caa0110..b06ef81 100644
> > --- a/drivers/net/xen-netfront.c
> > +++ b/drivers/net/xen-netfront.c
> > @@ -971,17 +971,12 @@ err:
> >  		 * overheads. Here, we add the size of the data pulled
> >  		 * in xennet_fill_frags().
> >  		 *
> > -		 * We also adjust for any unused space in the main
> > -		 * data area by subtracting (RX_COPY_THRESHOLD -
> > -		 * len). This is especially important with drivers
> > -		 * which split incoming packets into header and data,
> > -		 * using only 66 bytes of the main data area (see the
> > -		 * e1000 driver for example.)  On such systems,
> > -		 * without this last adjustement, our achievable
> > -		 * receive throughout using the standard receive
> > -		 * buffer size was cut by 25%(!!!).
> > +		 * We also adjust for the __pskb_pull_tail done in
> > +		 * handle_incoming_queue which pulls data from the
> > +		 * frags into the head area, which is already
> > +		 * accounted in RX_COPY_THRESHOLD.
> >  		 */
> > -		skb->truesize += skb->data_len - RX_COPY_THRESHOLD;
> > +		skb->truesize += skb->data_len - NETFRONT_SKB_CB(skb)->pull_to;
> >  		skb->len += skb->data_len;
> >  
> >  		if (rx->flags & XEN_NETRXF_csum_blank)
> > -- 
> > 1.7.2.5
> > 

^ permalink raw reply

* Re: skb_warn_bad_offload with kernel 3.5 (maybe gso/bridge related ?)
From: Yann Dupont @ 2012-12-18 14:13 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev@vger.kernel.org, Ben Hutchings, Herbert Xu
In-Reply-To: <1343983887.9299.817.camel@edumazet-glaptop>

Le 03/08/2012 10:51, Eric Dumazet a écrit :
>
> As the problem seems more or less gso related, I've deactivated gso two
> days ago. This cure the symptom, running ok since.
>
> Anyone here  seeing this problem ?
>
> Cheers,
>
> I dont know, maybe its more a GRO issue ?
>
> When a NIC delivers skbs with ip_summed set to CHECKSUM_UNNECESSARY,
> should resulting GRO packet have ip_summed set to CHECKSUM_PARTIAL ?
>
> CC Ben and Herbert
>
>


Hello. I'm still seeing this issue with 3.7.0
example  :

[335685.629630] ------------[ cut here ]------------
[335685.629661] WARNING: at net/core/dev.c:1941 
skb_warn_bad_offload+0xb6/0xc1()
[335685.629691] Hardware name: PowerEdge M610
[335685.629720] : caps=(0x0000000000005000, 0x0000000000000000) 
len=12808 data_len=11308 gso_size=1448 gso_type=1 ip_summed=1
[335685.629769] Modules linked in: nfnetlink_log nfnetlink ip6table_raw 
iptable_raw openvswitch veth ebtable_nat ebtables dlm sctp configfs nfsd 
auth_rpcgss nfs_acl nfs lockd fscache sunrpc xt_physdev xt_multiport 
ip6table_filter ip6_tables xt_LOG xt_limit xt_tcpudp xt_state 
iptable_filter ip_tables x_tables nf_conntrack_tftp nf_conntrack_ftp 
nf_conntrack_ipv4 nf_defrag_ipv4 8021q bridge stp llc ext2 mbcache 
dm_round_robin nf_conntrack_ipv6 nf_defrag_ipv6 nf_conntrack 
dm_multipath scsi_dh ipv6 coretemp kvm iTCO_wdt snd_pcm ioatdma lpc_ich 
snd_page_alloc i7core_edac mfd_core snd_timer edac_core crc32c_intel snd 
soundcore pcspkr dca dcdbas microcode joydev evdev processor hed button 
thermal_sys xfs exportfs btrfs zlib_deflate dm_mod sg sd_mod hid_generic 
usbhid hid ata_generic uhci_hcd ata_piix libata ide_pci_generic lpfc 
ide_core mptsas bnx2x scsi_transport_sas mptscsih mdio mptbase ehci_hcd 
scsi_transport_fc scsi_tgt crc32c scsi_mod libcrc32c bnx2
[335685.630305] Pid: 0, comm: swapper/4 Tainted: G        W 
3.7.0-dsiun-121008 #2
[335685.630348] Call Trace:
[335685.630368]  <IRQ>  [<ffffffff813d1400>] ? 
skb_warn_bad_offload+0x74/0xc1
[335685.630403]  [<ffffffff8103e839>] ? warn_slowpath_common+0x79/0xc0
[335685.630430]  [<ffffffff8103e935>] ? warn_slowpath_fmt+0x45/0x50
[335685.630458]  [<ffffffff813d1442>] ? skb_warn_bad_offload+0xb6/0xc1
[335685.630486]  [<ffffffff81321af6>] ? skb_gso_segment+0x206/0x280
[335685.630513]  [<ffffffff81324ada>] ? dev_hard_start_xmit+0x9a/0x4a0
[335685.630542]  [<ffffffffa0087cde>] ? ipv4_confirm+0xae/0x110 
[nf_conntrack_ipv4]
[335685.630590]  [<ffffffffa13ceeb0>] ? br_parse_ip_options+0x220/0x220 
[bridge]
[335685.630620]  [<ffffffff813403dd>] ? sch_direct_xmit+0xfd/0x1d0
[335685.630647]  [<ffffffff8132529e>] ? dev_queue_xmit+0x16e/0x410
[335685.630679]  [<ffffffffa13c8c62>] ? br_dev_queue_push_xmit+0x72/0xc0 
[bridge]
[335685.630723]  [<ffffffffa13cfb33>] ? br_nf_post_routing+0x223/0x340 
[bridge]
[335685.630754]  [<ffffffffa13c8bf0>] ? deliver_clone+0x60/0x60 [bridge]
[335685.630785]  [<ffffffff8134d50d>] ? nf_iterate+0x8d/0xc0
[335685.630813]  [<ffffffffa13cef30>] ? br_nf_dev_queue_xmit+0x80/0x80 
[bridge]
[335685.630843]  [<ffffffffa13c8bf0>] ? deliver_clone+0x60/0x60 [bridge]
[335685.630871]  [<ffffffff8134d5ae>] ? nf_hook_slow+0x6e/0x130
[335685.630898]  [<ffffffffa13c8bf0>] ? deliver_clone+0x60/0x60 [bridge]
[335685.630927]  [<ffffffffa13c8f20>] ? br_multicast_flood+0x170/0x170 
[bridge]
[335685.630958]  [<ffffffffa13c8f62>] ? br_forward_finish+0x42/0x50 [bridge]
[335685.630988]  [<ffffffffa13cefe9>] ? br_nf_forward_finish+0xb9/0x180 
[bridge]
[335685.631018]  [<ffffffffa13cf7d3>] ? br_nf_forward_ip+0x293/0x3d0 
[bridge]
[335685.631051]  [<ffffffffa13c8f20>] ? br_multicast_flood+0x170/0x170 
[bridge]
[335685.631081]  [<ffffffff8134d50d>] ? nf_iterate+0x8d/0xc0
[335685.631111]  [<ffffffffa13c8f20>] ? br_multicast_flood+0x170/0x170 
[bridge]
[335685.631140]  [<ffffffff8134d5ae>] ? nf_hook_slow+0x6e/0x130
[335685.631168]  [<ffffffffa13c8f20>] ? br_multicast_flood+0x170/0x170 
[bridge]
[335685.631198]  [<ffffffffa13c9000>] ? __br_forward+0x90/0xb0 [bridge]
[335685.631227]  [<ffffffffa13c9e44>] ? 
br_handle_frame_finish+0x214/0x2b0 [bridge]
[335685.631272]  [<ffffffffa13cf31f>] ? 
br_nf_pre_routing_finish+0x14f/0x370 [bridge]
[335685.631317]  [<ffffffffa13d01e2>] ? br_nf_pre_routing+0x3a2/0x650 
[bridge]
[335685.631348]  [<ffffffffa13c9c30>] ? br_handle_local_finish+0x50/0x50 
[bridge]
[335685.631391]  [<ffffffff8134d50d>] ? nf_iterate+0x8d/0xc0
[335685.631419]  [<ffffffffa13c9c30>] ? br_handle_local_finish+0x50/0x50 
[bridge]
[335685.631462]  [<ffffffff8134d5ae>] ? nf_hook_slow+0x6e/0x130
[335685.631514]  [<ffffffffa13c9c30>] ? br_handle_local_finish+0x50/0x50 
[bridge]
[335685.631562]  [<ffffffffa13ca0c0>] ? br_handle_frame+0x1e0/0x280 [bridge]
[335685.631591]  [<ffffffff81323135>] ? __netif_receive_skb+0x215/0x860
[335685.631619]  [<ffffffff81125417>] ? alloc_pages_current+0xb7/0x130
[335685.631648]  [<ffffffff8100a3f5>] ? read_tsc+0x5/0x20
[335685.631677]  [<ffffffff8132390a>] ? netif_receive_skb+0x1a/0x80
[335685.631704]  [<ffffffff81323a60>] ? napi_skb_finish+0x50/0x70
[335685.631735]  [<ffffffffa02456e6>] ? bnx2x_rx_int+0x6a6/0x1500 [bnx2x]
[335685.631765]  [<ffffffffa13c9c30>] ? br_handle_local_finish+0x50/0x50 
[bridge]
[335685.631810]  [<ffffffffa13ca0c0>] ? br_handle_frame+0x1e0/0x280 [bridge]
[335685.632982]  [<ffffffffa02465d3>] ? bnx2x_poll+0x93/0x2b0 [bnx2x]
[335685.633010]  [<ffffffff81323135>] ? __netif_receive_skb+0x215/0x860
[335685.633038]  [<ffffffff813242e8>] ? net_rx_action+0x138/0x240
[335685.633065]  [<ffffffff810469ae>] ? __do_softirq+0xbe/0x1f0
[335685.633092]  [<ffffffff813d5cdc>] ? call_softirq+0x1c/0x30
[335685.633118]  [<ffffffff81004cb5>] ? do_softirq+0x75/0xb0
[335685.633144]  [<ffffffff81046c45>] ? irq_exit+0xa5/0xb0
[335685.633170]  [<ffffffff8100492b>] ? do_IRQ+0x5b/0xd0
[335685.633196]  [<ffffffff813d416d>] ? common_interrupt+0x6d/0x6d
[335685.633222]  <EOI>  [<ffffffff8125204c>] ? intel_idle+0xec/0x160
[335685.633257]  [<ffffffff8125202a>] ? intel_idle+0xca/0x160
[335685.633286]  [<ffffffff812f71bd>] ? cpuidle_idle_call+0x9d/0x240
[335685.633315]  [<ffffffff8100c335>] ? cpu_idle+0x65/0xd0
[335685.633340] ---[ end trace 2142bc9cd23c0d87 ]---


Only seeing this with bridge activated, and with bnx2x
ethtool -K eth2 gso cure the problem.

Cheers,

-- 
Yann Dupont - Service IRTS, DSI Université de Nantes
Tel : 02.53.48.49.20 - Mail/Jabber : Yann.Dupont@univ-nantes.fr

^ permalink raw reply

* [PATCH 1/3] usbnet: handle PM failure gracefully
From: Oliver Neukum @ 2012-12-18 14:45 UTC (permalink / raw)
  To: netdev, davem; +Cc: Oliver Neukum, Oliver Neukum

If a device fails to do remote wakeup, this is no reason
to abort an open totally. This patch just continues without
runtime PM.

Signed-off-by: Oliver Neukum <oneukum@suse.de>
---
 drivers/net/usb/usbnet.c   |   15 ++++++++-------
 include/linux/usb/usbnet.h |    1 +
 2 files changed, 9 insertions(+), 7 deletions(-)

diff --git a/drivers/net/usb/usbnet.c b/drivers/net/usb/usbnet.c
index c04110b..50ed7ab 100644
--- a/drivers/net/usb/usbnet.c
+++ b/drivers/net/usb/usbnet.c
@@ -719,7 +719,8 @@ int usbnet_stop (struct net_device *net)
 	dev->flags = 0;
 	del_timer_sync (&dev->delay);
 	tasklet_kill (&dev->bh);
-	if (info->manage_power)
+	if (info->manage_power &&
+	    !test_and_clear_bit(EVENT_NO_RUNTIME_PM, &dev->flags))
 		info->manage_power(dev, 0);
 	else
 		usb_autopm_put_interface(dev->intf);
@@ -794,14 +795,14 @@ int usbnet_open (struct net_device *net)
 	tasklet_schedule (&dev->bh);
 	if (info->manage_power) {
 		retval = info->manage_power(dev, 1);
-		if (retval < 0)
-			goto done_manage_power_error;
-		usb_autopm_put_interface(dev->intf);
+		if (retval < 0) {
+			retval = 0;
+			set_bit(EVENT_NO_RUNTIME_PM, &dev->flags);
+		} else {
+			usb_autopm_put_interface(dev->intf);
+		}
 	}
 	return retval;
-
-done_manage_power_error:
-	clear_bit(EVENT_DEV_OPEN, &dev->flags);
 done:
 	usb_autopm_put_interface(dev->intf);
 done_nopm:
diff --git a/include/linux/usb/usbnet.h b/include/linux/usb/usbnet.h
index 9bbeabf..288b32a 100644
--- a/include/linux/usb/usbnet.h
+++ b/include/linux/usb/usbnet.h
@@ -69,6 +69,7 @@ struct usbnet {
 #		define EVENT_DEV_ASLEEP 6
 #		define EVENT_DEV_OPEN	7
 #		define EVENT_DEVICE_REPORT_IDLE	8
+#		define EVENT_NO_RUNTIME_PM	9
 };
 
 static inline struct usb_driver *driver_of(struct usb_interface *intf)
-- 
1.7.7

^ permalink raw reply related

* [PATCH 2/3] usbnet: generic manage_power()
From: Oliver Neukum @ 2012-12-18 14:45 UTC (permalink / raw)
  To: netdev, davem; +Cc: Oliver Neukum, Oliver Neukum

Centralise common code for manage_power() in usbnet
by making a generic simple implementation

Signed-off-by: Oliver Neukum <oneukum@suse.de>
---
 drivers/net/usb/usbnet.c   |   10 ++++++++++
 include/linux/usb/usbnet.h |    2 ++
 2 files changed, 12 insertions(+), 0 deletions(-)

diff --git a/drivers/net/usb/usbnet.c b/drivers/net/usb/usbnet.c
index 50ed7ab..3d4bf01 100644
--- a/drivers/net/usb/usbnet.c
+++ b/drivers/net/usb/usbnet.c
@@ -1616,6 +1616,16 @@ void usbnet_device_suggests_idle(struct usbnet *dev)
 }
 EXPORT_SYMBOL(usbnet_device_suggests_idle);
 
+/*
+ * For devices that can do without special commands
+ */
+int usbnet_manage_power(struct usbnet *dev, int on)
+{
+	dev->intf->needs_remote_wakeup = on;
+	return 0;
+}
+EXPORT_SYMBOL(usbnet_manage_power);
+
 /*-------------------------------------------------------------------------*/
 static int __usbnet_read_cmd(struct usbnet *dev, u8 cmd, u8 reqtype,
 			     u16 value, u16 index, void *data, u16 size)
diff --git a/include/linux/usb/usbnet.h b/include/linux/usb/usbnet.h
index 288b32a..bd45eb7 100644
--- a/include/linux/usb/usbnet.h
+++ b/include/linux/usb/usbnet.h
@@ -241,4 +241,6 @@ extern void usbnet_set_msglevel(struct net_device *, u32);
 extern void usbnet_get_drvinfo(struct net_device *, struct ethtool_drvinfo *);
 extern int usbnet_nway_reset(struct net_device *net);
 
+extern int usbnet_manage_power(struct usbnet *, int);
+
 #endif /* __LINUX_USB_USBNET_H */
-- 
1.7.7

^ permalink raw reply related

* [PATCH 3/3] use generic usbnet_manage_power()
From: Oliver Neukum @ 2012-12-18 14:46 UTC (permalink / raw)
  To: netdev, davem; +Cc: Oliver Neukum, Oliver Neukum

This covers the drivers that can use a primitive
implementation.

Signed-off-by: Oliver Neukum <oneukum@suse.de>
---
 drivers/net/usb/cdc_ether.c |   10 ++--------
 drivers/net/usb/cdc_ncm.c   |   10 ++--------
 2 files changed, 4 insertions(+), 16 deletions(-)

diff --git a/drivers/net/usb/cdc_ether.c b/drivers/net/usb/cdc_ether.c
index d012982..421b7b8 100644
--- a/drivers/net/usb/cdc_ether.c
+++ b/drivers/net/usb/cdc_ether.c
@@ -457,12 +457,6 @@ int usbnet_cdc_bind(struct usbnet *dev, struct usb_interface *intf)
 }
 EXPORT_SYMBOL_GPL(usbnet_cdc_bind);
 
-static int cdc_manage_power(struct usbnet *dev, int on)
-{
-	dev->intf->needs_remote_wakeup = on;
-	return 0;
-}
-
 static const struct driver_info	cdc_info = {
 	.description =	"CDC Ethernet Device",
 	.flags =	FLAG_ETHER | FLAG_POINTTOPOINT,
@@ -470,7 +464,7 @@ static const struct driver_info	cdc_info = {
 	.bind =		usbnet_cdc_bind,
 	.unbind =	usbnet_cdc_unbind,
 	.status =	usbnet_cdc_status,
-	.manage_power =	cdc_manage_power,
+	.manage_power =	usbnet_manage_power,
 };
 
 static const struct driver_info wwan_info = {
@@ -479,7 +473,7 @@ static const struct driver_info wwan_info = {
 	.bind =		usbnet_cdc_bind,
 	.unbind =	usbnet_cdc_unbind,
 	.status =	usbnet_cdc_status,
-	.manage_power =	cdc_manage_power,
+	.manage_power =	usbnet_manage_power,
 };
 
 /*-------------------------------------------------------------------------*/
diff --git a/drivers/net/usb/cdc_ncm.c b/drivers/net/usb/cdc_ncm.c
index d38bc20..71b6e92 100644
--- a/drivers/net/usb/cdc_ncm.c
+++ b/drivers/net/usb/cdc_ncm.c
@@ -1129,19 +1129,13 @@ static void cdc_ncm_disconnect(struct usb_interface *intf)
 	usbnet_disconnect(intf);
 }
 
-static int cdc_ncm_manage_power(struct usbnet *dev, int status)
-{
-	dev->intf->needs_remote_wakeup = status;
-	return 0;
-}
-
 static const struct driver_info cdc_ncm_info = {
 	.description = "CDC NCM",
 	.flags = FLAG_POINTTOPOINT | FLAG_NO_SETINT | FLAG_MULTI_PACKET,
 	.bind = cdc_ncm_bind,
 	.unbind = cdc_ncm_unbind,
 	.check_connect = cdc_ncm_check_connect,
-	.manage_power = cdc_ncm_manage_power,
+	.manage_power = usbnet_manage_power,
 	.status = cdc_ncm_status,
 	.rx_fixup = cdc_ncm_rx_fixup,
 	.tx_fixup = cdc_ncm_tx_fixup,
@@ -1155,7 +1149,7 @@ static const struct driver_info wwan_info = {
 	.bind = cdc_ncm_bind,
 	.unbind = cdc_ncm_unbind,
 	.check_connect = cdc_ncm_check_connect,
-	.manage_power = cdc_ncm_manage_power,
+	.manage_power = usbnet_manage_power,
 	.status = cdc_ncm_status,
 	.rx_fixup = cdc_ncm_rx_fixup,
 	.tx_fixup = cdc_ncm_tx_fixup,
-- 
1.7.7

^ permalink raw reply related

* TCP delayed ACK heuristic
From: Cong Wang @ 2012-12-18 15:11 UTC (permalink / raw)
  To: netdev
  Cc: Ben Greear, David Miller, Eric Dumazet, Stephen Hemminger,
	Rick Jones, Thomas Graf
In-Reply-To: <270756364.27707018.1355842632348.JavaMail.root@redhat.com>

Hello, TCP experts!

Some time ago, Ben sent a patch [1] to add some knobs for
tuning TCP delayed ACK, but rejected by David.

David's point is that we can do some heuristics for TCP
delayed ACK, so the question is that what kind of heuristics
can we use?

RFC1122 explicitly mentions:

            A TCP SHOULD implement a delayed ACK, but an ACK should not
            be excessively delayed; in particular, the delay MUST be
            less than 0.5 seconds, and in a stream of full-sized
            segments there SHOULD be an ACK for at least every second
            segment.

so this prevents us from using any heuristic for the number
of coalesced delayed ACK.

For the timeout of a delayed ACK, my idea is guessing how many
packet we suppose to receive is the TCP stream is fully utilized,
something like below:

+static inline u32 tcp_expect_packets(struct sock *sk)
+{
+       struct tcp_sock *tp = tcp_sk(sk);
+       int rtt = tp->srtt >> 3;
+       u32 idle = tcp_time_stamp - inet_csk(sk)->icsk_ack.lrcvtime;
+
+       return idle * 2 / rtt;
+}
...
+       ato -= tcp_expect_packets(sk) * delta;

The more we expect, the less we should delay. However this is
not accurate due to congestion control.

Meanwhile, we can also check how many packets are pending in TCP
sending queue, the more we pend, the more we can piggyback with
a single ACK, but not beyond how much we are able to send at
that time.

Comments? Ideas?

Thanks.

1. http://thread.gmane.org/gmane.linux.network/233859

^ permalink raw reply

* Re: [PATCH] xen/netfront: improve truesize tracking
From: Eric Dumazet @ 2012-12-18 15:12 UTC (permalink / raw)
  To: Ian Campbell
  Cc: netdev, Sander Eikelenboom, Konrad Rzeszutek Wilk, annie li,
	xen-devel
In-Reply-To: <1355838711-5473-1-git-send-email-ian.campbell@citrix.com>

On Tue, 2012-12-18 at 13:51 +0000, Ian Campbell wrote:
> Using RX_COPY_THRESHOLD is incorrect if the SKB is actually smaller
> than that. We have already accounted for this in
> NETFRONT_SKB_CB(skb)->pull_to so use that instead.
> 
> Fixes WARN_ON from skb_try_coalesce.
> 
> Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
> Cc: Sander Eikelenboom <linux@eikelenboom.it>
> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
> Cc: annie li <annie.li@oracle.com>
> Cc: xen-devel@lists.xensource.com
> Cc: netdev@vger.kernel.org
> Cc: stable@kernel.org # 3.7.x only
> ---
>  drivers/net/xen-netfront.c |   15 +++++----------
>  1 files changed, 5 insertions(+), 10 deletions(-)
> 
> diff --git a/drivers/net/xen-netfront.c b/drivers/net/xen-netfront.c
> index caa0110..b06ef81 100644
> --- a/drivers/net/xen-netfront.c
> +++ b/drivers/net/xen-netfront.c
> @@ -971,17 +971,12 @@ err:
>  		 * overheads. Here, we add the size of the data pulled
>  		 * in xennet_fill_frags().
>  		 *
> -		 * We also adjust for any unused space in the main
> -		 * data area by subtracting (RX_COPY_THRESHOLD -
> -		 * len). This is especially important with drivers
> -		 * which split incoming packets into header and data,
> -		 * using only 66 bytes of the main data area (see the
> -		 * e1000 driver for example.)  On such systems,
> -		 * without this last adjustement, our achievable
> -		 * receive throughout using the standard receive
> -		 * buffer size was cut by 25%(!!!).
> +		 * We also adjust for the __pskb_pull_tail done in
> +		 * handle_incoming_queue which pulls data from the
> +		 * frags into the head area, which is already
> +		 * accounted in RX_COPY_THRESHOLD.
>  		 */
> -		skb->truesize += skb->data_len - RX_COPY_THRESHOLD;
> +		skb->truesize += skb->data_len - NETFRONT_SKB_CB(skb)->pull_to;
>  		skb->len += skb->data_len;
>  
>  		if (rx->flags & XEN_NETRXF_csum_blank)


But skb truesize is not what you think.

You must account the exact memory used by this skb, not only the used
part of it.

At the very minimum, it should be

skb->truesize += skb->data_len;

But it really should be the allocated size of the fragment.

If its a page, then its a page, even if you use one single byte in it.

^ permalink raw reply

* network namespace and DNS lookups
From: Ravi Aysola @ 2012-12-18 15:09 UTC (permalink / raw)
  To: netdev

Has there been any work in any of the recent kernels to limit the DNS lookup
to a particular network namespace?  Do we have any facility to specify the
DNS resolvers on network namespace basis (such as /etc/ns/resolv.conf)?

thank you
ravi/

^ permalink raw reply

* [PATCH v2 1/2] net: asix: init ASIX AX88772B MAC from EEPROM
From: Lucas Stach @ 2012-12-18 15:21 UTC (permalink / raw)
  To: netdev-u79uwXL29TY76Z2rM5mHXA
  Cc: Oliver Neukum, linux-usb-u79uwXL29TY76Z2rM5mHXA

The device comes up with a MAC address of all zeros. We need to read the
initial device MAC from EEPROM so it can be set properly later.

Signed-off-by: Lucas Stach <dev-8ppwABl0HbeELgA04lAiVw@public.gmane.org>
---
A similar fix was added into U-Boot:
http://patchwork.ozlabs.org/patch/179409/

v2: pass flag in the data field instead of clobbering the global flags
    space
---
 drivers/net/usb/asix.h         |  3 +++
 drivers/net/usb/asix_devices.c | 30 +++++++++++++++++++++++++++---
 2 Dateien geändert, 30 Zeilen hinzugefügt(+), 3 Zeilen entfernt(-)

diff --git a/drivers/net/usb/asix.h b/drivers/net/usb/asix.h
index e889631..7afe8ac 100644
--- a/drivers/net/usb/asix.h
+++ b/drivers/net/usb/asix.h
@@ -167,6 +167,9 @@ struct asix_data {
 	u8 res;
 };
 
+/* ASIX specific flags */
+#define FLAG_EEPROM_MAC		(1UL << 0)  /* init device MAC from eeprom */
+
 int asix_read_cmd(struct usbnet *dev, u8 cmd, u16 value, u16 index,
 		  u16 size, void *data);
 
diff --git a/drivers/net/usb/asix_devices.c b/drivers/net/usb/asix_devices.c
index 7a6e758..0ecc3bc 100644
--- a/drivers/net/usb/asix_devices.c
+++ b/drivers/net/usb/asix_devices.c
@@ -422,14 +422,25 @@ static const struct net_device_ops ax88772_netdev_ops = {
 
 static int ax88772_bind(struct usbnet *dev, struct usb_interface *intf)
 {
-	int ret, embd_phy;
+	int ret, embd_phy, i;
 	u8 buf[ETH_ALEN];
 	u32 phyid;
 
 	usbnet_get_endpoints(dev,intf);
 
 	/* Get the MAC address */
-	ret = asix_read_cmd(dev, AX_CMD_READ_NODE_ID, 0, 0, ETH_ALEN, buf);
+	if (dev->driver_info->data & FLAG_EEPROM_MAC) {
+		for (i = 0; i < (ETH_ALEN >> 1); i++) {
+			ret = asix_read_cmd(dev, AX_CMD_READ_EEPROM, 0x04 + i,
+					0, 2, buf + i * 2);
+			if (ret < 0)
+				break;
+		}
+	} else {
+		ret = asix_read_cmd(dev, AX_CMD_READ_NODE_ID,
+				0, 0, ETH_ALEN, buf);
+	}
+
 	if (ret < 0) {
 		netdev_dbg(dev->net, "Failed to read MAC address: %d\n", ret);
 		return ret;
@@ -872,6 +883,19 @@ static const struct driver_info ax88772_info = {
 	.tx_fixup = asix_tx_fixup,
 };
 
+static const struct driver_info ax88772b_info = {
+	.description = "ASIX AX88772B USB 2.0 Ethernet",
+	.bind = ax88772_bind,
+	.status = asix_status,
+	.link_reset = ax88772_link_reset,
+	.reset = ax88772_reset,
+	.flags = FLAG_ETHER | FLAG_FRAMING_AX | FLAG_LINK_INTR |
+	         FLAG_MULTI_PACKET,
+	.rx_fixup = asix_rx_fixup,
+	.tx_fixup = asix_tx_fixup,
+	.data = FLAG_EEPROM_MAC,
+};
+
 static const struct driver_info ax88178_info = {
 	.description = "ASIX AX88178 USB 2.0 Ethernet",
 	.bind = ax88178_bind,
@@ -953,7 +977,7 @@ static const struct usb_device_id	products [] = {
 }, {
 	// ASIX AX88772B 10/100
 	USB_DEVICE (0x0b95, 0x772b),
-	.driver_info = (unsigned long) &ax88772_info,
+	.driver_info = (unsigned long) &ax88772b_info,
 }, {
 	// ASIX AX88772 10/100
 	USB_DEVICE (0x0b95, 0x7720),
-- 
1.7.11.7

--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related

* [PATCH v2 2/2] net: asix: handle packets crossing URB boundaries
From: Lucas Stach @ 2012-12-18 15:21 UTC (permalink / raw)
  To: netdev-u79uwXL29TY76Z2rM5mHXA
  Cc: Oliver Neukum, linux-usb-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <1355844083-14285-1-git-send-email-dev-8ppwABl0HbeELgA04lAiVw@public.gmane.org>

ASIX AX88772B started to pack data even more tightly. Packets and the ASIX packet
header may now cross URB boundaries. To handle this we have to introduce
some state between individual calls to asix_rx_fixup().

Signed-off-by: Lucas Stach <dev-8ppwABl0HbeELgA04lAiVw@public.gmane.org>
---
I've running this patch for some weeks already now and it gets rid of all
the commonly seen rx failures with AX88772B.

v2: don't forget to free driver_private
---
 drivers/net/usb/asix.h         | 11 ++++++
 drivers/net/usb/asix_common.c  | 81 +++++++++++++++++++++++++++++-------------
 drivers/net/usb/asix_devices.c | 17 +++++++++
 3 Dateien geändert, 85 Zeilen hinzugefügt(+), 24 Zeilen entfernt(-)

diff --git a/drivers/net/usb/asix.h b/drivers/net/usb/asix.h
index 7afe8ac..4dfdbf6 100644
--- a/drivers/net/usb/asix.h
+++ b/drivers/net/usb/asix.h
@@ -167,6 +167,17 @@ struct asix_data {
 	u8 res;
 };
 
+struct asix_rx_fixup_info {
+	struct sk_buff *ax_skb;
+	u32 header;
+	u16 size;
+	bool split_head;
+};
+
+struct asix_private {
+	struct asix_rx_fixup_info rx_fixup_info;
+};
+
 /* ASIX specific flags */
 #define FLAG_EEPROM_MAC		(1UL << 0)  /* init device MAC from eeprom */
 
diff --git a/drivers/net/usb/asix_common.c b/drivers/net/usb/asix_common.c
index 50d1673..17f9801 100644
--- a/drivers/net/usb/asix_common.c
+++ b/drivers/net/usb/asix_common.c
@@ -53,44 +53,77 @@ void asix_write_cmd_async(struct usbnet *dev, u8 cmd, u16 value, u16 index,
 
 int asix_rx_fixup(struct usbnet *dev, struct sk_buff *skb)
 {
+	struct asix_private *dp = dev->driver_priv;
+	struct asix_rx_fixup_info *rx = &dp->rx_fixup_info;
 	int offset = 0;
 
-	while (offset + sizeof(u32) < skb->len) {
-		struct sk_buff *ax_skb;
-		u16 size;
-		u32 header = get_unaligned_le32(skb->data + offset);
-
-		offset += sizeof(u32);
-
-		/* get the packet length */
-		size = (u16) (header & 0x7ff);
-		if (size != ((~header >> 16) & 0x07ff)) {
-			netdev_err(dev->net, "asix_rx_fixup() Bad Header Length\n");
-			return 0;
+	while (offset + sizeof(u16) <= skb->len) {
+		u16 remaining = 0;
+		unsigned char *data;
+
+		if (!rx->size) {
+			if ((skb->len - offset == sizeof(u16)) ||
+			    rx->split_head) {
+				if(!rx->split_head) {
+					rx->header = get_unaligned_le16(
+							skb->data + offset);
+					rx->split_head = true;
+					offset += sizeof(u16);
+					break;
+				} else {
+					rx->header |= (get_unaligned_le16(
+							skb->data + offset)
+							<< 16);
+					rx->split_head = false;
+					offset += sizeof(u16);
+				}
+			} else {
+				rx->header = get_unaligned_le32(skb->data +
+								offset);
+				offset += sizeof(u32);
+			}
+
+			/* get the packet length */
+			rx->size = (u16) (rx->header & 0x7ff);
+			if (rx->size != ((~rx->header >> 16) & 0x7ff)) {
+				netdev_err(dev->net, "asix_rx_fixup() Bad Header Length 0x%x, offset %d\n",
+					   rx->header, offset);
+				rx->size = 0;
+				return 0;
+			}
+			rx->ax_skb = netdev_alloc_skb_ip_align(dev->net,
+							       rx->size);
+			if (!rx->ax_skb)
+				return 0;
 		}
 
-		if ((size > dev->net->mtu + ETH_HLEN + VLAN_HLEN) ||
-		    (size + offset > skb->len)) {
+		if (rx->size > dev->net->mtu + ETH_HLEN + VLAN_HLEN) {
 			netdev_err(dev->net, "asix_rx_fixup() Bad RX Length %d\n",
-				   size);
+				   rx->size);
+			kfree_skb(rx->ax_skb);
 			return 0;
 		}
-		ax_skb = netdev_alloc_skb_ip_align(dev->net, size);
-		if (!ax_skb)
-			return 0;
 
-		skb_put(ax_skb, size);
-		memcpy(ax_skb->data, skb->data + offset, size);
-		usbnet_skb_return(dev, ax_skb);
+		if (rx->size > skb->len - offset) {
+			remaining = rx->size - (skb->len - offset);
+			rx->size = skb->len - offset;
+		}
+
+		data = skb_put(rx->ax_skb, rx->size);
+		memcpy(data, skb->data + offset, rx->size);
+		if (!remaining)
+			usbnet_skb_return(dev, rx->ax_skb);
 
-		offset += (size + 1) & 0xfffe;
+		offset += (rx->size + 1) & 0xfffe;
+		rx->size = remaining;
 	}
 
 	if (skb->len != offset) {
-		netdev_err(dev->net, "asix_rx_fixup() Bad SKB Length %d\n",
-			   skb->len);
+		netdev_err(dev->net, "asix_rx_fixup() Bad SKB Length %d, %d\n",
+			   skb->len, offset);
 		return 0;
 	}
+
 	return 1;
 }
 
diff --git a/drivers/net/usb/asix_devices.c b/drivers/net/usb/asix_devices.c
index 0ecc3bc..c651243 100644
--- a/drivers/net/usb/asix_devices.c
+++ b/drivers/net/usb/asix_devices.c
@@ -495,9 +495,19 @@ static int ax88772_bind(struct usbnet *dev, struct usb_interface *intf)
 		dev->rx_urb_size = 2048;
 	}
 
+	dev->driver_priv = kzalloc(sizeof(struct asix_private), GFP_KERNEL);
+	if (!dev->driver_priv)
+		return -ENOMEM;
+
 	return 0;
 }
 
+void ax88772_unbind(struct usbnet *dev, struct usb_interface *intf)
+{
+	if (dev->driver_priv)
+		kfree(dev->driver_priv);
+}
+
 static const struct ethtool_ops ax88178_ethtool_ops = {
 	.get_drvinfo		= asix_get_drvinfo,
 	.get_link		= asix_get_link,
@@ -829,6 +839,10 @@ static int ax88178_bind(struct usbnet *dev, struct usb_interface *intf)
 		dev->rx_urb_size = 2048;
 	}
 
+	dev->driver_priv = kzalloc(sizeof(struct asix_private), GFP_KERNEL);
+	if (!dev->driver_priv)
+			return -ENOMEM;
+
 	return 0;
 }
 
@@ -875,6 +889,7 @@ static const struct driver_info hawking_uf200_info = {
 static const struct driver_info ax88772_info = {
 	.description = "ASIX AX88772 USB 2.0 Ethernet",
 	.bind = ax88772_bind,
+	.unbind = ax88772_unbind,
 	.status = asix_status,
 	.link_reset = ax88772_link_reset,
 	.reset = ax88772_reset,
@@ -886,6 +901,7 @@ static const struct driver_info ax88772_info = {
 static const struct driver_info ax88772b_info = {
 	.description = "ASIX AX88772B USB 2.0 Ethernet",
 	.bind = ax88772_bind,
+	.unbind = ax88772_unbind,
 	.status = asix_status,
 	.link_reset = ax88772_link_reset,
 	.reset = ax88772_reset,
@@ -899,6 +915,7 @@ static const struct driver_info ax88772b_info = {
 static const struct driver_info ax88178_info = {
 	.description = "ASIX AX88178 USB 2.0 Ethernet",
 	.bind = ax88178_bind,
+	.unbind = ax88772_unbind,
 	.status = asix_status,
 	.link_reset = ax88178_link_reset,
 	.reset = ax88178_reset,
-- 
1.7.11.7

--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related

* Re: [PATCH] xen/netfront: improve truesize tracking
From: Ian Campbell @ 2012-12-18 15:26 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: netdev@vger.kernel.org, Sander Eikelenboom, Konrad Rzeszutek Wilk,
	annie li, xen-devel@lists.xensource.com
In-Reply-To: <1355843525.9380.18.camel@edumazet-glaptop>

On Tue, 2012-12-18 at 15:12 +0000, Eric Dumazet wrote:
> On Tue, 2012-12-18 at 13:51 +0000, Ian Campbell wrote:
> > Using RX_COPY_THRESHOLD is incorrect if the SKB is actually smaller
> > than that. We have already accounted for this in
> > NETFRONT_SKB_CB(skb)->pull_to so use that instead.
> > 
> > Fixes WARN_ON from skb_try_coalesce.
> > 
> > Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
> > Cc: Sander Eikelenboom <linux@eikelenboom.it>
> > Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
> > Cc: annie li <annie.li@oracle.com>
> > Cc: xen-devel@lists.xensource.com
> > Cc: netdev@vger.kernel.org
> > Cc: stable@kernel.org # 3.7.x only
> > ---
> >  drivers/net/xen-netfront.c |   15 +++++----------
> >  1 files changed, 5 insertions(+), 10 deletions(-)
> > 
> > diff --git a/drivers/net/xen-netfront.c b/drivers/net/xen-netfront.c
> > index caa0110..b06ef81 100644
> > --- a/drivers/net/xen-netfront.c
> > +++ b/drivers/net/xen-netfront.c
> > @@ -971,17 +971,12 @@ err:
> >  		 * overheads. Here, we add the size of the data pulled
> >  		 * in xennet_fill_frags().
> >  		 *
> > -		 * We also adjust for any unused space in the main
> > -		 * data area by subtracting (RX_COPY_THRESHOLD -
> > -		 * len). This is especially important with drivers
> > -		 * which split incoming packets into header and data,
> > -		 * using only 66 bytes of the main data area (see the
> > -		 * e1000 driver for example.)  On such systems,
> > -		 * without this last adjustement, our achievable
> > -		 * receive throughout using the standard receive
> > -		 * buffer size was cut by 25%(!!!).
> > +		 * We also adjust for the __pskb_pull_tail done in
> > +		 * handle_incoming_queue which pulls data from the
> > +		 * frags into the head area, which is already
> > +		 * accounted in RX_COPY_THRESHOLD.
> >  		 */
> > -		skb->truesize += skb->data_len - RX_COPY_THRESHOLD;
> > +		skb->truesize += skb->data_len - NETFRONT_SKB_CB(skb)->pull_to;
> >  		skb->len += skb->data_len;
> >  
> >  		if (rx->flags & XEN_NETRXF_csum_blank)
> 
> 
> But skb truesize is not what you think.

Indeed, it seems I was completely backwards about what it means!

> You must account the exact memory used by this skb, not only the used
> part of it.
> 
> At the very minimum, it should be
> 
> skb->truesize += skb->data_len;
> 
> But it really should be the allocated size of the fragment.
> 
> If its a page, then its a page, even if you use one single byte in it.

So actually we want += PAGE_SIZE * skb_shinfo(skb)->nr_frags ?

Sander, can you try that change?

Ian.

^ permalink raw reply

* inaccurate packet scheduling
From: Jiri Pirko @ 2012-12-18 15:04 UTC (permalink / raw)
  To: netdev; +Cc: jhs, davem, edumazet, tgraf, shemminger

Hi all.

Run one of the following 2 scripts on machine A:

#!/bin/bash
tc qdisc del dev eth0 root
sleep 1
tc -batch << EOF
qdisc add dev eth0 root handle 1: prio bands 2 priomap 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
qdisc add dev eth0 parent 1:1 handle 10: pfifo limit 50
qdisc add dev eth0 parent 1:2 handle 20 tbf latency 100ms rate 4mbit burst 2m
filter add dev eth0 parent 1: protocol ip u32 match ip dst $machineB_ip flowid 1:2
EOF

#!/bin/bash
tc qdisc del dev eth0 root
sleep 1
tc -batch << EOF
qdisc add dev eth0 root handle 1: prio bands 2 priomap 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
qdisc add dev eth0 parent 1:1 handle 10: pfifo limit 20
qdisc add dev eth0 parent 1:2 handle 20: pfifo limit 20
filter add dev eth0 parent 1: protocol ip pref 10 \
u32 match ip dst $machineB_ip \
flowid 1:2 \
police rate 4Mbit burst 2m conform-exceed drop
EOF

And run:
[machineB ~]# iperf -s
[machineA ~]# iperf -c machineB_ip -t 60

Expected results are: ~3.8-4.2 Mbits/s
But actual results are: ~130-170 Kbits/s with tbf, ~70-300 Kbits/s with policy rate

[machineA ~]# tc -s qdisc list dev eth0
qdisc prio 1: root refcnt 9 bands 2 priomap  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 Sent 1512384 bytes 1032 pkt (dropped 729, overlimits 0 requeues 0) 
 backlog 0b 0p requeues 0 
qdisc pfifo 10: parent 1:1 limit 50p
 Sent 4560 bytes 32 pkt (dropped 0, overlimits 0 requeues 0) 
 backlog 0b 0p requeues 0 
qdisc tbf 20: parent 1:2 rate 4000Kbit burst 2Mb lat 100.0ms 
 Sent 1507824 bytes 1000 pkt (dropped 729, overlimits 0 requeues 0) 
 backlog 0b 0p requeues 0 


Tested with kernel pulled from linus's git today. This happens with older
kernels as well (I tried 2.6.32-based rhel6 kernels).

This happens to me on following machines:
HP DL360G8 (x86_64) http://people.redhat.com/jpirko/aThoo2Ei/dl380g8/
HP DL360G3 (i686) 
IBM JS22 (ppc64) http://people.redhat.com/jpirko/aThoo2Ei/ibmjs22/

On following machines, I do not observe this issue:
qemu kvm (x86_64)
IBM Zseries (s390x) http://people.redhat.com/jpirko/aThoo2Ei/ibmz/

Please ask in case you need me to provide any other details.

Thanks.

Jiri

^ permalink raw reply

* Re: network namespace and DNS lookups
From: Ravi Aysola @ 2012-12-18 15:49 UTC (permalink / raw)
  To: netdev
In-Reply-To: <CAFaHj6GVgDf2QXUWyaKYnvOqLJRjG+kA32eEk5psPkmc-RPY1Q@mail.gmail.com>

I think I sent my earlier email a bit prematurely.  I do have
/etc/netns/<namespace-name>/resolv.conf
files under each of my namespaces. Now the question is, how does a
user space process
(say bind)  look at a namespace specific resolv.conf instead of
default one?  Have any of
these standard applications been modified to work with namespace
specific config files?

thanks again
ravi/

On Tue, Dec 18, 2012 at 10:09 AM, Ravi Aysola <ravi.mlists@gmail.com> wrote:
> Has there been any work in any of the recent kernels to limit the DNS lookup
> to a particular network namespace?  Do we have any facility to specify the
> DNS resolvers on network namespace basis (such as /etc/ns/resolv.conf)?
>
> thank you
> ravi/

^ permalink raw reply

* Re: [PATCH V4 0/5] Add LE hash collision bug fix for active and passive offloaded connections
From: Vipul Pandya @ 2012-12-18 16:07 UTC (permalink / raw)
  To: roland@purestorage.com, davem@davemloft.net
  Cc: Vipul Pandya, linux-rdma@vger.kernel.org, netdev@vger.kernel.org,
	Divy Le Ray, Dimitrios Michailidis, Kumar A S, Steve Wise,
	Abhishek Agrawal
In-Reply-To: <1355131856-29142-1-git-send-email-vipul@chelsio.com>

Hi Roland and David,

Any feedback on this?

Thanks,
Vipul

On 10-12-2012 15:00, Vipul Pandya wrote:
> This patch series fixes the LE hash collision issue in cxgb4 and RDMA/cxgb4
> drivers in kernel.org.
> 
> If the hash functionality is enabled in T4 then tuple information of active and
> passive offloaded connections are stored in DDR3 memory. LE (Lookup Engine)
> implements the interface to search this tuple entries using hash algorithm. To
> avoid LE hash collision, firmware provides new mechanisms to setup active and
> passive open connections.
> 
> In case of active open connection, firmware detects LE hash collision situation
> and notifies driver. Driver uses fw_ofld_connection work request to offload
> that connection. Its tuple information will be stored in TCAM memory array. 
> 
> In case of passive open connection, server filter region is created in TCAM.
> This region stores the filter which will redirect the incoming SYN packet to
> offload queues. After this driver tries to establish the connection using
> firmware work request.
> 
> This patch series also adds framework for managing filters and to use T4's
> filter capabilities.
> 
> Following testing has been carried out successfully on this patch series.
> 1. 1000 active open connections created and saw that tcam_full counter is
> getting incremented through debugfs file.
> 2. Multiple passive open connections were created and ran the same test
> repetatively to verify server filter is getting created and deleted properly.
> 
> The patch-series is based on Roland's infiniband tree (for-next branch),
> and involves changes on cxgb4 and RDMA/cxgb4 drivers. Both linux-rdma and netdev
> are included in this post for review.
> 
> We have some more bug fixes in RDMA/cxgb4 after this patch series. So, we would
> like to request this series to get mereged through Roland's infiniband for-next
> tree.
> 
> V2: Changed commenting style from kernel-doc format('/**') to normal
> V3: Resending the whole patch series again with the missing patches in V2
> V4: Changed comments for networking from '/*' to '/* Comment' format.
> 
> Thanks,
> Vipul Pandya
> 
> Vipul Pandya (5):
>   cxgb4: Add T4 filter support
>   cxgb4: Add LE hash collision bug fix path in LLD driver
>   RDMA/cxgb4: Fix LE hash collision bug for active open connection
>   RDMA/cxgb4: Fix LE hash collision bug for passive open connection
>   RDMA/cxgb4: Fix bug for active and passive LE hash collision path
> 
>  drivers/infiniband/hw/cxgb4/cm.c                |  789 +++++++++++++++++++---
>  drivers/infiniband/hw/cxgb4/device.c            |  210 ++++++-
>  drivers/infiniband/hw/cxgb4/iw_cxgb4.h          |   33 +
>  drivers/net/ethernet/chelsio/cxgb4/cxgb4.h      |  146 +++++
>  drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c |  473 +++++++++++++-
>  drivers/net/ethernet/chelsio/cxgb4/cxgb4_uld.h  |   23 +-
>  drivers/net/ethernet/chelsio/cxgb4/l2t.c        |   34 +
>  drivers/net/ethernet/chelsio/cxgb4/l2t.h        |    3 +
>  drivers/net/ethernet/chelsio/cxgb4/t4_hw.c      |   23 +-
>  drivers/net/ethernet/chelsio/cxgb4/t4_msg.h     |   66 ++
>  drivers/net/ethernet/chelsio/cxgb4/t4_regs.h    |   37 ++
>  drivers/net/ethernet/chelsio/cxgb4/t4fw_api.h   |  388 +++++++++++
>  12 files changed, 2100 insertions(+), 125 deletions(-)
> 

^ permalink raw reply

* Re: [PATCH V4 0/5] Add LE hash collision bug fix for active and passive offloaded connections
From: Steve Wise @ 2012-12-18 16:11 UTC (permalink / raw)
  To: Vipul Pandya
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA,
	roland-BHEL68pLQRGGvPXPguhicg, davem-fT/PcQaiUtIeIZ0/mPfg9Q,
	divy-ut6Up61K2wZBDgjK7y7TUQ, dm-ut6Up61K2wZBDgjK7y7TUQ,
	kumaras-ut6Up61K2wZBDgjK7y7TUQ, abhishek-ut6Up61K2wZBDgjK7y7TUQ
In-Reply-To: <1355131856-29142-1-git-send-email-vipul-ut6Up61K2wZBDgjK7y7TUQ@public.gmane.org>

Reviewed-by: Steve Wise <swise-7bPotxP6k4+P2YhJcF5u+vpXobYPEAuW@public.gmane.org>

On 12/10/2012 3:30 AM, Vipul Pandya wrote:
> This patch series fixes the LE hash collision issue in cxgb4 and RDMA/cxgb4
> drivers in kernel.org.
>
> If the hash functionality is enabled in T4 then tuple information of active and
> passive offloaded connections are stored in DDR3 memory. LE (Lookup Engine)
> implements the interface to search this tuple entries using hash algorithm. To
> avoid LE hash collision, firmware provides new mechanisms to setup active and
> passive open connections.
>
> In case of active open connection, firmware detects LE hash collision situation
> and notifies driver. Driver uses fw_ofld_connection work request to offload
> that connection. Its tuple information will be stored in TCAM memory array.
>
> In case of passive open connection, server filter region is created in TCAM.
> This region stores the filter which will redirect the incoming SYN packet to
> offload queues. After this driver tries to establish the connection using
> firmware work request.
>
> This patch series also adds framework for managing filters and to use T4's
> filter capabilities.
>
> Following testing has been carried out successfully on this patch series.
> 1. 1000 active open connections created and saw that tcam_full counter is
> getting incremented through debugfs file.
> 2. Multiple passive open connections were created and ran the same test
> repetatively to verify server filter is getting created and deleted properly.
>
> The patch-series is based on Roland's infiniband tree (for-next branch),
> and involves changes on cxgb4 and RDMA/cxgb4 drivers. Both linux-rdma and netdev
> are included in this post for review.
>
> We have some more bug fixes in RDMA/cxgb4 after this patch series. So, we would
> like to request this series to get mereged through Roland's infiniband for-next
> tree.
>
> V2: Changed commenting style from kernel-doc format('/**') to normal
> V3: Resending the whole patch series again with the missing patches in V2
> V4: Changed comments for networking from '/*' to '/* Comment' format.
>
> Thanks,
> Vipul Pandya
>
> Vipul Pandya (5):
>    cxgb4: Add T4 filter support
>    cxgb4: Add LE hash collision bug fix path in LLD driver
>    RDMA/cxgb4: Fix LE hash collision bug for active open connection
>    RDMA/cxgb4: Fix LE hash collision bug for passive open connection
>    RDMA/cxgb4: Fix bug for active and passive LE hash collision path
>
>   drivers/infiniband/hw/cxgb4/cm.c                |  789 +++++++++++++++++++---
>   drivers/infiniband/hw/cxgb4/device.c            |  210 ++++++-
>   drivers/infiniband/hw/cxgb4/iw_cxgb4.h          |   33 +
>   drivers/net/ethernet/chelsio/cxgb4/cxgb4.h      |  146 +++++
>   drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c |  473 +++++++++++++-
>   drivers/net/ethernet/chelsio/cxgb4/cxgb4_uld.h  |   23 +-
>   drivers/net/ethernet/chelsio/cxgb4/l2t.c        |   34 +
>   drivers/net/ethernet/chelsio/cxgb4/l2t.h        |    3 +
>   drivers/net/ethernet/chelsio/cxgb4/t4_hw.c      |   23 +-
>   drivers/net/ethernet/chelsio/cxgb4/t4_msg.h     |   66 ++
>   drivers/net/ethernet/chelsio/cxgb4/t4_regs.h    |   37 ++
>   drivers/net/ethernet/chelsio/cxgb4/t4fw_api.h   |  388 +++++++++++
>   12 files changed, 2100 insertions(+), 125 deletions(-)

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH] xen/netfront: improve truesize tracking
From: Eric Dumazet @ 2012-12-18 16:13 UTC (permalink / raw)
  To: Ian Campbell
  Cc: netdev@vger.kernel.org, Sander Eikelenboom, Konrad Rzeszutek Wilk,
	annie li, xen-devel@lists.xensource.com
In-Reply-To: <1355844398.14620.254.camel@zakaz.uk.xensource.com>

On Tue, 2012-12-18 at 15:26 +0000, Ian Campbell wrote:

> So actually we want += PAGE_SIZE * skb_shinfo(skb)->nr_frags ?
> 

I dont know what are the real frag sizes in your case.

Some drivers allocate a full page for an ethernet frame, others use half
of a page, it really depends.

As the frag ABI doesnt contain real size, its ok in this case to account
the actual frag size.

(skb->data_len in your driver)

^ permalink raw reply

* Re: [PATCH] xen/netfront: improve truesize tracking
From: Ian Campbell @ 2012-12-18 16:22 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: netdev@vger.kernel.org, Sander Eikelenboom, Konrad Rzeszutek Wilk,
	annie li, xen-devel@lists.xensource.com
In-Reply-To: <1355847180.9380.21.camel@edumazet-glaptop>

On Tue, 2012-12-18 at 16:13 +0000, Eric Dumazet wrote:
> On Tue, 2012-12-18 at 15:26 +0000, Ian Campbell wrote:
> 
> > So actually we want += PAGE_SIZE * skb_shinfo(skb)->nr_frags ?
> > 
> 
> I dont know what are the real frag sizes in your case.

I think it's a page, see xennet_alloc_rx_buffers and the alloc_page
therein.

> Some drivers allocate a full page for an ethernet frame, others use half
> of a page, it really depends.
> 
> As the frag ABI doesnt contain real size, its ok in this case to account
> the actual frag size.
> 
> (skb->data_len in your driver)

I guess I'm a bit confused by what truesize means again then ;-),
because in that case the original patch is correct although it would
have been less confusing to do:
	skb->truesize += skb->data_len; 
in xennet_poll() and then do the subtraction of
NETFRONT_SKB_CB(skb)->pull_to in handle_incoming_queue() where we
actually do the pull up.

Unless __pskb_pull_tail does that adjustment for us, but if it does I
can't see where.

Ian.

^ permalink raw reply

* Re: inaccurate packet scheduling
From: Stephen Hemminger @ 2012-12-18 16:26 UTC (permalink / raw)
  To: Jiri Pirko; +Cc: jhs, davem, edumazet, tgraf, netdev
In-Reply-To: <20121218150407.GB1533@minipsycho.orion>



----- Original Message -----
> Hi all.
> 
> Run one of the following 2 scripts on machine A:
> 
> #!/bin/bash
> tc qdisc del dev eth0 root
> sleep 1
> tc -batch << EOF
> qdisc add dev eth0 root handle 1: prio bands 2 priomap 0 0 0 0 0 0 0
> 0 0 0 0 0 0 0 0 0
> qdisc add dev eth0 parent 1:1 handle 10: pfifo limit 50
> qdisc add dev eth0 parent 1:2 handle 20 tbf latency 100ms rate 4mbit
> burst 2m
> filter add dev eth0 parent 1: protocol ip u32 match ip dst
> $machineB_ip flowid 1:2
> EOF
> 
> #!/bin/bash
> tc qdisc del dev eth0 root
> sleep 1
> tc -batch << EOF
> qdisc add dev eth0 root handle 1: prio bands 2 priomap 0 0 0 0 0 0 0
> 0 0 0 0 0 0 0 0 0
> qdisc add dev eth0 parent 1:1 handle 10: pfifo limit 20
> qdisc add dev eth0 parent 1:2 handle 20: pfifo limit 20
> filter add dev eth0 parent 1: protocol ip pref 10 \
> u32 match ip dst $machineB_ip \
> flowid 1:2 \
> police rate 4Mbit burst 2m conform-exceed drop
> EOF
> 
> And run:
> [machineB ~]# iperf -s
> [machineA ~]# iperf -c machineB_ip -t 60
> 
> Expected results are: ~3.8-4.2 Mbits/s
> But actual results are: ~130-170 Kbits/s with tbf, ~70-300 Kbits/s
> with policy rate
> 
> [machineA ~]# tc -s qdisc list dev eth0
> qdisc prio 1: root refcnt 9 bands 2 priomap  0 0 0 0 0 0 0 0 0 0 0 0
> 0 0 0 0
>  Sent 1512384 bytes 1032 pkt (dropped 729, overlimits 0 requeues 0)
>  backlog 0b 0p requeues 0
> qdisc pfifo 10: parent 1:1 limit 50p
>  Sent 4560 bytes 32 pkt (dropped 0, overlimits 0 requeues 0)
>  backlog 0b 0p requeues 0
> qdisc tbf 20: parent 1:2 rate 4000Kbit burst 2Mb lat 100.0ms
>  Sent 1507824 bytes 1000 pkt (dropped 729, overlimits 0 requeues 0)
>  backlog 0b 0p requeues 0
> 
> 
> Tested with kernel pulled from linus's git today. This happens with
> older
> kernels as well (I tried 2.6.32-based rhel6 kernels).
> 
> This happens to me on following machines:
> HP DL360G8 (x86_64) http://people.redhat.com/jpirko/aThoo2Ei/dl380g8/
> HP DL360G3 (i686)
> IBM JS22 (ppc64) http://people.redhat.com/jpirko/aThoo2Ei/ibmjs22/
> 
> On following machines, I do not observe this issue:
> qemu kvm (x86_64)
> IBM Zseries (s390x) http://people.redhat.com/jpirko/aThoo2Ei/ibmz/
> 
> Please ask in case you need me to provide any other details.
> 
> Thanks.

Check kernel log for messages about clock. It could be that on the
machines with issues TSC is not usable for kernel clock.
Also turn off TSO since it screws up any form of rate control.

^ permalink raw reply

* Re: [PATCH v2] netlink: align attributes on 64-bits
From: Nicolas Dichtel @ 2012-12-18 16:23 UTC (permalink / raw)
  To: Thomas Graf; +Cc: bhutchings, netdev, davem, David.Laight
In-Reply-To: <20121218125735.GG27746@casper.infradead.org>

Le 18/12/2012 13:57, Thomas Graf a écrit :
> On 12/17/12 at 05:49pm, Nicolas Dichtel wrote:
>> @@ -492,6 +492,15 @@ static inline struct nlmsghdr *nlmsg_put_answer(struct sk_buff *skb,
>>    */
>>   static inline struct sk_buff *nlmsg_new(size_t payload, gfp_t flags)
>>   {
>> +	/* Because attributes may be aligned on 64-bits boundary with fake
>> +	 * attribute (type 0, size 4 (attributes are 32-bits align by default)),
>> +	 * an exact payload size cannot be calculated. Hence, we need to reserve
>> +	 * more space for these attributes.
>> +	 * 128 is arbitrary: it allows to align up to 32 attributes.
>> +	 */
>> +	if (payload < NLMSG_DEFAULT_SIZE)
>> +		payload = min(payload + 128, (size_t)NLMSG_DEFAULT_SIZE);
>
> This is doomed to fail eventually. A netlink message may carry
> hundreds of attributes eventually. See my suggestion below.
>
>> diff --git a/lib/nlattr.c b/lib/nlattr.c
>> index 18eca78..7440a80 100644
>> --- a/lib/nlattr.c
>> +++ b/lib/nlattr.c
>> @@ -450,9 +450,18 @@ EXPORT_SYMBOL(__nla_put_nohdr);
>>    */
>>   int nla_put(struct sk_buff *skb, int attrtype, int attrlen, const void *data)
>>   {
>> -	if (unlikely(skb_tailroom(skb) < nla_total_size(attrlen)))
>> +	int align = IS_ALIGNED((unsigned long)skb_tail_pointer(skb), 8) ? 0 : 4;
>
> This does not look right. In order for the attribute data to be
> aligned properly you would need to check skb_tail_pointer(skb) +
> NLA_HDRLEN for proper alignment or you end up aligning the
> attribute header.
Good point.

>
> How about we change nla_total_size() to return the size with
> needed padding taken into account. That should fix the message
> size caluclation problem and we only need to reserve room for
> the initial padding to align the very first attribute.
>
> Below is an untested patch that does this. What do you think?
I still have some doubts about the size calculation (see bellow).
For the rest of the patch, it seems ok (except some minor point). I will test it.

>
> diff --git a/include/net/netlink.h b/include/net/netlink.h
> index 9690b0f..7ce8e76 100644
> --- a/include/net/netlink.h
> +++ b/include/net/netlink.h
> @@ -114,7 +114,6 @@
>    * Attribute Length Calculations:
>    *   nla_attr_size(payload)		length of attribute w/o padding
>    *   nla_total_size(payload)		length of attribute w/ padding
> - *   nla_padlen(payload)		length of padding
>    *
>    * Attribute Payload Access:
>    *   nla_data(nla)			head of attribute payload
> @@ -492,6 +491,14 @@ static inline struct nlmsghdr *nlmsg_put_answer(struct sk_buff *skb,
>    */
>   static inline struct sk_buff *nlmsg_new(size_t payload, gfp_t flags)
>   {
> +	/* If an exact size if specified, reserve some additional space to
> +	 * align the first attribute, all subsequent attributes should have
> +	 * padding accounted for.
> +	 */
> +	if (payload != NLMSG_DEFAULT_SIZE)
> +		payload = min_t(size_t, payload + NLA_ATTR_ALIGN,
> +				NLMSG_DEFAULT_SIZE);
> +
>   	return alloc_skb(nlmsg_total_size(payload), flags);
>   }
>
> @@ -653,16 +660,12 @@ static inline int nla_attr_size(int payload)
>    */
>   static inline int nla_total_size(int payload)
>   {
> -	return NLA_ALIGN(nla_attr_size(payload));
> -}
> +	size_t len = NLA_ALIGN(nla_attr_size(payload));
>
> -/**
> - * nla_padlen - length of padding at the tail of attribute
> - * @payload: length of payload
> - */
> -static inline int nla_padlen(int payload)
> -{
> -	return nla_total_size(payload) - nla_attr_size(payload);
> +	if (!IS_ALIGNED(len, NLA_ATTR_ALIGN))
> +		len = ALIGN(len + NLA_HDRLEN, NLA_ATTR_ALIGN);
Two comments:
1/ should it be ALIGN(len, NLA_ATTR_ALIGN)? If we want to add a __u64:
    => nla_attr_size(sizeof(__u64)) = 12
    => NLA_ALIGN(nla_attr_size(sizeof(__u64))) => 12 (= len)
    => ALIGN(len + NLA_HDRLEN, NLA_ATTR_ALIGN) = 0 but it should be 4

2/ Suppose that the attribute is:

   struct foo {
   	__u64 bar1;
   	__u32 bar2;
   }
   => sizeof(struct foo) = 12 (= payload)
   => nla_attr_size(payload) = 16
   => NLA_ALIGN(nla_attr_size(payload)) = 16 (= len)
   => IS_ALIGNED(len, NLA_ATTR_ALIGN) = true
   => extra room is not reserved
   But it's not guaranteed that bar1 is aligned on 8 bytes, only on 4 bytes.

> +
> +	return len;
>   }
>
>   /**
> diff --git a/include/uapi/linux/netlink.h b/include/uapi/linux/netlink.h
> index 78d5b8a..1856729 100644
> --- a/include/uapi/linux/netlink.h
> +++ b/include/uapi/linux/netlink.h
> @@ -149,5 +149,10 @@ struct nlattr {
>   #define NLA_ALIGN(len)		(((len) + NLA_ALIGNTO - 1) & ~(NLA_ALIGNTO - 1))
>   #define NLA_HDRLEN		((int) NLA_ALIGN(sizeof(struct nlattr)))
>
> +/* Padding attribute type, added automatically to align attributes,
> + * must be ignored by readers. */
> +#define NLA_PADDING		0
> +#define NLA_ATTR_ALIGN		8
> +
>
>   #endif /* _UAPI__LINUX_NETLINK_H */
> diff --git a/lib/nlattr.c b/lib/nlattr.c
> index 18eca78..b09473c 100644
> --- a/lib/nlattr.c
> +++ b/lib/nlattr.c
> @@ -322,18 +322,36 @@ int nla_strcmp(const struct nlattr *nla, const char *str)
>    * Adds a netlink attribute header to a socket buffer and reserves
>    * room for the payload but does not copy it.
>    *
> + * May add a padding attribute of type NLA_PADDING before the
> + * real attribute to ensure proper alignment.
> + *
>    * The caller is responsible to ensure that the skb provides enough
>    * tailroom for the attribute header and payload.
>    */
>   struct nlattr *__nla_reserve(struct sk_buff *skb, int attrtype, int attrlen)
>   {
>   	struct nlattr *nla;
> +	size_t offset;
> +
> +	offset = (size_t) skb_tail_pointer(skb);
> +	if (!IS_ALIGNED(offset + NLA_HDRLEN, NLA_ATTR_ALIGN)) {
With the previous struct foo, this test may be true even if we don't have 
reserved extra room. This test depends on previous attribute.
I think the exact size of the netlink message depends on the order of 
attributes, not only on the attribute itself.
What about taking the assumption that the start will never be aligned and always 
allocating extra room: ALIGN(NLA_ALIGNTO, NLA_ATTR_ALIGN) (= 4)?

> +		struct nlattr *pad;
> +		size_t padlen;
> +
> +		padlen = nla_total_size(offset) - offset - NLA_HDRLEN;
> +		pad = (struct nlattr *) skb_put(skb, nla_attr_size(padlen));
> +		pad->nla_type = 0;
> +		pad->nla_len = nla_attr_size(padlen);
> +
> +		memset((unsigned char *) pad + NLA_HDRLEN, 0, padlen);
> +	}
>
> -	nla = (struct nlattr *) skb_put(skb, nla_total_size(attrlen));
> +	nla = (struct nlattr *) skb_put(skb, NLA_ALIGN(nla_attr_size(attrlen)));
>   	nla->nla_type = attrtype;
>   	nla->nla_len = nla_attr_size(attrlen);
>
> -	memset((unsigned char *) nla + nla->nla_len, 0, nla_padlen(attrlen));
> +	memset((unsigned char *) nla + nla->nla_len, 0,
> +	       NLA_ALIGN(nla->nla_len) - nla->nla_len);
>
>   	return nla;
>   }
> @@ -360,6 +378,23 @@ void *__nla_reserve_nohdr(struct sk_buff *skb, int attrlen)
>   }
>   EXPORT_SYMBOL(__nla_reserve_nohdr);
>
> +static size_t nla_pre_padlen(struct sk_buff *skb)
> +{
> +	size_t offset = (size_t) skb_tail_pointer(skb);
> +
> +	if (!IS_ALIGNED(offset + NLA_HDRLEN, NLA_ATTR_ALIGN))
> +		return nla_total_size(offset) - offset - NLA_HDRLEN;
> +
> +	return 0;
> +}
> +
> +static bool nla_insufficient_space(struct sk_buff *skb, int attrlen)
> +{
> +	size_t needed = nla_pre_padlen(skb) + nla_total_size(attrlen);
If nla_total_size() was right, nla_pre_padlen(skb) should already be included. 
Am I wrong?

> +
> +	return (skb_tailroom(skb) < needed);
> +}
> +
>   /**
>    * nla_reserve - reserve room for attribute on the skb
>    * @skb: socket buffer to reserve room on
> @@ -374,7 +409,7 @@ EXPORT_SYMBOL(__nla_reserve_nohdr);
>    */
>   struct nlattr *nla_reserve(struct sk_buff *skb, int attrtype, int attrlen)
>   {
> -	if (unlikely(skb_tailroom(skb) < nla_total_size(attrlen)))
> +	if (unlikely(nla_insufficient_space(skb, attrlen)))
>   		return NULL;
>
>   	return __nla_reserve(skb, attrtype, attrlen);
> @@ -450,7 +485,7 @@ EXPORT_SYMBOL(__nla_put_nohdr);
>    */
>   int nla_put(struct sk_buff *skb, int attrtype, int attrlen, const void *data)
>   {
> -	if (unlikely(skb_tailroom(skb) < nla_total_size(attrlen)))
> +	if (unlikely(nla_insufficient_space(skb, attrlen)))
>   		return -EMSGSIZE;
>
>   	__nla_put(skb, attrtype, attrlen, data);
>

^ permalink raw reply

* Re: [PATCH] xen/netfront: improve truesize tracking
From: Eric Dumazet @ 2012-12-18 16:35 UTC (permalink / raw)
  To: Ian Campbell
  Cc: netdev@vger.kernel.org, Sander Eikelenboom, Konrad Rzeszutek Wilk,
	annie li, xen-devel@lists.xensource.com
In-Reply-To: <1355847758.14620.265.camel@zakaz.uk.xensource.com>

On Tue, 2012-12-18 at 16:22 +0000, Ian Campbell wrote:
> On Tue, 2012-12-18 at 16:13 +0000, Eric Dumazet wrote:
> > On Tue, 2012-12-18 at 15:26 +0000, Ian Campbell wrote:
> > 
> > > So actually we want += PAGE_SIZE * skb_shinfo(skb)->nr_frags ?
> > > 
> > 
> > I dont know what are the real frag sizes in your case.
> 
> I think it's a page, see xennet_alloc_rx_buffers and the alloc_page
> therein.
> 

If they are order-0 pages, then PAGE_SIZE * nr_frags  is OK.


> > Some drivers allocate a full page for an ethernet frame, others use half
> > of a page, it really depends.
> > 

> > As the frag ABI doesnt contain real size, its ok in this case to account
> > the actual frag size.
> > 
> > (skb->data_len in your driver)
> 
> I guess I'm a bit confused by what truesize means again then ;-),
> because in that case the original patch is correct although it would
> have been less confusing to do:
> 	skb->truesize += skb->data_len; 
> in xennet_poll() and then do the subtraction of
> NETFRONT_SKB_CB(skb)->pull_to in handle_incoming_queue() where we
> actually do the pull up.
> 
> Unless __pskb_pull_tail does that adjustment for us, but if it does I
> can't see where.

Thats because skb frags only contain :

a page pointer.
An offset
A size. (Exact number or used bytes in this frag)

And not the 'originally allocated size. It could be 256, 768, 2048,
4096, 65536 bytes, nobody but the driver really knows.

So when we pull X bytes from a fragment to skb->head, there is no way to
remember what was the original size of the fragment.

Only the driver allocating the frag knows its truesize.

Once skb is given to the stack, we lose this information, and rely on
skb->truesize being an accurate estimation.

^ permalink raw reply

* Re: TCP delayed ACK heuristic
From: Eric Dumazet @ 2012-12-18 16:30 UTC (permalink / raw)
  To: Cong Wang
  Cc: netdev, Ben Greear, David Miller, Stephen Hemminger, Rick Jones,
	Thomas Graf
In-Reply-To: <2088500005.27728019.1355843484620.JavaMail.root@redhat.com>

On Tue, 2012-12-18 at 10:11 -0500, Cong Wang wrote:
> Hello, TCP experts!
> 
> Some time ago, Ben sent a patch [1] to add some knobs for
> tuning TCP delayed ACK, but rejected by David.
> 
> David's point is that we can do some heuristics for TCP
> delayed ACK, so the question is that what kind of heuristics
> can we use?
> 
> RFC1122 explicitly mentions:
> 
>             A TCP SHOULD implement a delayed ACK, but an ACK should not
>             be excessively delayed; in particular, the delay MUST be
>             less than 0.5 seconds, and in a stream of full-sized
>             segments there SHOULD be an ACK for at least every second
>             segment.
> 
> so this prevents us from using any heuristic for the number
> of coalesced delayed ACK.
> 
> For the timeout of a delayed ACK, my idea is guessing how many
> packet we suppose to receive is the TCP stream is fully utilized,
> something like below:
> 
> +static inline u32 tcp_expect_packets(struct sock *sk)
> +{
> +       struct tcp_sock *tp = tcp_sk(sk);
> +       int rtt = tp->srtt >> 3;
> +       u32 idle = tcp_time_stamp - inet_csk(sk)->icsk_ack.lrcvtime;
> +
> +       return idle * 2 / rtt;
> +}
> ...
> +       ato -= tcp_expect_packets(sk) * delta;
> 
> 
> The more we expect, the less we should delay. However this is
> not accurate due to congestion control.
> 
> Meanwhile, we can also check how many packets are pending in TCP
> sending queue, the more we pend, the more we can piggyback with
> a single ACK, but not beyond how much we are able to send at
> that time.
> 
> Comments? Ideas?
> 

ACKS might also be delayed because of bidirectional traffic, and is more
controlled by the application response time. TCP stack can not easily
estimate it.

If you focus on bulk receive, LRO/GRO should already lower number of
ACKS to an acceptable level and without major disruption.

Stretch acks are not only the receiver concern, there are issues for the
sender that you cannot always control/change.

I recommend reading RFC2525 2.13

^ permalink raw reply

* RE: TCP delayed ACK heuristic
From: David Laight @ 2012-12-18 16:39 UTC (permalink / raw)
  To: Cong Wang, netdev
  Cc: Ben Greear, David Miller, Eric Dumazet, Stephen Hemminger,
	Rick Jones, Thomas Graf
In-Reply-To: <2088500005.27728019.1355843484620.JavaMail.root@redhat.com>

> David's point is that we can do some heuristics for TCP
> delayed ACK, so the question is that what kind of heuristics
> can we use?
> 
> RFC1122 explicitly mentions:
> 
>             A TCP SHOULD implement a delayed ACK, but an ACK should not
>             be excessively delayed; in particular, the delay MUST be
>             less than 0.5 seconds, and in a stream of full-sized
>             segments there SHOULD be an ACK for at least every second
>             segment.
> 
> so this prevents us from using any heuristic for the number
> of coalesced delayed ACK.

There are problems with only implementing the acks
specified by RFC1122.

I've seen problems when the sending side is doing (I think)
'slow start' with Nagle disabled.
The sender would only send 4 segments before waiting for an
ACK - even when it had more than a full sized segment waiting.
Sender was Linux 2.6.something (probably low 20s).
I changed the application flow to send data in the reverse
direction to avoid the problem.
That was on a ~0 delay local connection - which means that
there is almost never outstanding data, and the 'slow start'
happened almost all the time.
Nagle is completely the wrong algorithm for the data flow.

	David

^ permalink raw reply

* Re: inaccurate packet scheduling
From: Eric Dumazet @ 2012-12-18 16:51 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: Jiri Pirko, jhs, davem, edumazet, tgraf, netdev
In-Reply-To: <e2f71dc7-bee5-41ca-bd64-f4569fa953da@tahiti.vyatta.com>

On Tue, 2012-12-18 at 08:26 -0800, Stephen Hemminger wrote:

> Check kernel log for messages about clock. It could be that on the
> machines with issues TSC is not usable for kernel clock.
> Also turn off TSO since it screws up any form of rate control.

Yes, but we should fix it eventually. I'll take a look.

^ permalink raw reply

* RE: [PATCH v2] netlink: align attributes on 64-bits
From: David Laight @ 2012-12-18 16:50 UTC (permalink / raw)
  To: nicolas.dichtel, Thomas Graf; +Cc: bhutchings, netdev, davem
In-Reply-To: <50D09897.8030508@6wind.com>

> 2/ Suppose that the attribute is:
> 
>    struct foo {
>    	__u64 bar1;
>    	__u32 bar2;
>    }
>    => sizeof(struct foo) = 12 (= payload)

That is only true if the host architecture aligns 64bit items
on 32 it boundaries (as i386 does).
Otherwise there are 4 bytes of padding at the end and the
size is 16.

Actually it is worse than that.
Consider the structure:
	struct bar {
		__u32 foo1;
		__u64 foo2;
	}
On i386 it will have size 12 and foo2 will be at offset 4.
On sparc32 (and most 64bit) it will have size 16 with foo2
at offset 8 (and 4 bytes of pad after foo1).

Do these messages move between systems?
If they do then any 64bit items need an explicit alignment
eg tag with __attribute__((aligned(8))) (or aligned(4)).

	David

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox