Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH v3 6/6] net: sh_eth: use NAPI
From: Shimoda, Yoshihiro @ 2012-05-15  9:46 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, linux-sh
In-Reply-To: <20120515.010753.2012331320750491448.davem@davemloft.net>

2012/05/15 14:07, David Miller wrote:
> From: "Shimoda, Yoshihiro" <yoshihiro.shimoda.uh@renesas.com>
> Date: Tue, 15 May 2012 13:47:44 +0900
> 
>> 2012/05/15 7:50, David Miller wrote:
>>> You need strict synchronization between your TX queueing and TX
>>> liberation flows.  So that queue stop and wake are only performed
>>> at the correct moment.
>>
>> I will add netif_queue_stopped() in the sh_eth_poll().
> 
> That doesn't fix the bug.  What if someone transmits a packet and
> fills the TX queue between the netif_queue_stopped() test and the
> call to netif_wake_queue()?
> 
> Adding another test doesn't create the necessary synchronization.
> 

Thank you for the reply again.
I will modify the code as the following. Is it correct?

	if (txfree_num) {
		netif_tx_lock(ndev);
		if (netif_queue_stopped(ndev))
			netif_wake_queue(ndev);
		netif_tx_unlock(ndev);
	}

Best regards,
Yoshihiro Shimoda

^ permalink raw reply

* Re: [PATCH 01/12] netvm: Prevent a stream-specific deadlock
From: Peter Zijlstra @ 2012-05-15  9:47 UTC (permalink / raw)
  To: Mel Gorman
  Cc: David Miller, akpm, linux-mm, netdev, linux-nfs, linux-kernel,
	Trond.Myklebust, neilb, hch, michaelc, emunson
In-Reply-To: <20120515091402.GG29102@suse.de>

On Tue, 2012-05-15 at 10:14 +0100, Mel Gorman wrote:
> @@ -289,6 +289,18 @@ void sk_clear_memalloc(struct sock *sk)
>         sock_reset_flag(sk, SOCK_MEMALLOC);
>         sk->sk_allocation &= ~__GFP_MEMALLOC;
>         static_key_slow_dec(&memalloc_socks);
> +
> +       /*
> +        * SOCK_MEMALLOC is allowed to ignore rmem limits to ensure forward
> +        * progress of swapping. However, if SOCK_MEMALLOC is cleared while
> +        * it has rmem allocations there is a risk that the user of the
> +        * socket cannot make forward progress due to exceeding the rmem
> +        * limits. By rights, sk_clear_memalloc() should only be called
> +        * on sockets being torn down but warn and reset the accounting if
> +        * that assumption breaks.
> +        */
> +       if (WARN_ON(sk->sk_forward_alloc))

WARN_ON_ONCE() perhaps?

> +               sk_mem_reclaim(sk);
>  } 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH 01/12] netvm: Prevent a stream-specific deadlock
From: Mel Gorman @ 2012-05-15 10:08 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: David Miller, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Trond.Myklebust-HgOvQuBEEgTQT0dZR+AlfA, neilb-l3A5Bk7waGM,
	hch-wEGCiKHe2LqWVfeAwA7xHQ, michaelc-hcNo3dDEHLuVc3sceRu5cw,
	emunson-CVBTeua0HjReoWH0uzbU5w
In-Reply-To: <1337075234.27694.9.camel@twins>

On Tue, May 15, 2012 at 11:47:14AM +0200, Peter Zijlstra wrote:
> On Tue, 2012-05-15 at 10:14 +0100, Mel Gorman wrote:
> > @@ -289,6 +289,18 @@ void sk_clear_memalloc(struct sock *sk)
> >         sock_reset_flag(sk, SOCK_MEMALLOC);
> >         sk->sk_allocation &= ~__GFP_MEMALLOC;
> >         static_key_slow_dec(&memalloc_socks);
> > +
> > +       /*
> > +        * SOCK_MEMALLOC is allowed to ignore rmem limits to ensure forward
> > +        * progress of swapping. However, if SOCK_MEMALLOC is cleared while
> > +        * it has rmem allocations there is a risk that the user of the
> > +        * socket cannot make forward progress due to exceeding the rmem
> > +        * limits. By rights, sk_clear_memalloc() should only be called
> > +        * on sockets being torn down but warn and reset the accounting if
> > +        * that assumption breaks.
> > +        */
> > +       if (WARN_ON(sk->sk_forward_alloc))
> 
> WARN_ON_ONCE() perhaps?
> 

I do not expect SOCK_MEMALLOC to be cleared frequently at all with the
possible exception of swapon/swapoff stress tests. If the flag is being
cleared regularly with rmem tokens then that is interesting in itself
but a WARN_ON_ONCE would miss it.

> > +               sk_mem_reclaim(sk);
> >  } 

-- 
Mel Gorman
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH] xfrm: make xfrm_algo.c a module
From: Jan Beulich @ 2012-05-15 11:28 UTC (permalink / raw)
  To: David Miller; +Cc: linux-kernel, netdev
In-Reply-To: <20120514.183721.1989442389245940694.davem@davemloft.net>

>>> On 15.05.12 at 00:37, David Miller <davem@davemloft.net> wrote:
> From: "Jan Beulich" <JBeulich@suse.com>
> Date: Wed, 09 May 2012 08:53:51 +0100
> 
>> By making this a standalone config option (selected as needed),
>> selecting CRYPTO from here rather than from XFRM (which is boolean)
>> allows the core crypto code to become a module again even when XFRM=y.
>> 
>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>  ...
>> @@ -15,9 +15,6 @@
>>  #include <linux/crypto.h>
>>  #include <linux/scatterlist.h>
>>  #include <net/xfrm.h>
>> -#if defined(CONFIG_INET_AH) || defined(CONFIG_INET_AH_MODULE) || 
> defined(CONFIG_INET6_AH) || defined(CONFIG_INET6_AH_MODULE)
>> -#include <net/ah.h>
>> -#endif
> 
> This is completely unrelated to the change you are trying to make in
> this patch.
> 
> It belongs in a separate change.

I apologize for that, I meant to split this out and then forgot. I'll
re-submit the two things separately.

Jan

^ permalink raw reply

* [PATCH NEXT 1/2] linux/ethtool: Added macro ETH_FW_DUMP_DISABLE
From: Manish Chopra @ 2012-05-15 11:13 UTC (permalink / raw)
  To: bhutchings
  Cc: davem, netdev, Dept_NX_Linux_NIC_Driver, anirban.chakraborty,
	Manish chopra
In-Reply-To: <1337080419-31786-1-git-send-email-manish.chopra@qlogic.com>

From: Manish chopra <manish.chopra@qlogic.com>

o flag field of ethtool_dump structure must be initialized by this macro
value that is zero, if the firmware dump is disabled.
by this we can get the firmware dump capability [enable/disable] via ethtool

Signed-off-by: Manish chopra <manish.chopra@qlogic.com>
---
 include/linux/ethtool.h |    7 ++++++-
 1 files changed, 6 insertions(+), 1 deletions(-)

diff --git a/include/linux/ethtool.h b/include/linux/ethtool.h
index 89d68d8..fea2ac0 100644
--- a/include/linux/ethtool.h
+++ b/include/linux/ethtool.h
@@ -661,12 +661,17 @@ struct ethtool_flash {
  * 	%ETHTOOL_SET_DUMP
  * @version: FW version of the dump, filled in by driver
  * @flag: driver dependent flag for dump setting, filled in by driver during
- * 	  get and filled in by ethtool for set operation
+ *        get and filled in by ethtool for set operation.
+ *        flag must be initialized by macro ETH_FW_DUMP_DISABLE value when
+ *        firmware dump is disabled.
  * @len: length of dump data, used as the length of the user buffer on entry to
  * 	 %ETHTOOL_GET_DUMP_DATA and this is returned as dump length by driver
  * 	 for %ETHTOOL_GET_DUMP_FLAG command
  * @data: data collected for get dump data operation
  */
+
+#define ETH_FW_DUMP_DISABLE 0
+
 struct ethtool_dump {
 	__u32	cmd;
 	__u32	version;
-- 
1.7.1

^ permalink raw reply related

* [PATCH NEXT 2/2] qlcnic-ethtool: set the ethtool_dump flag by ETH_FW_DUMP_DISABLE value that is zero, if firmware dump is disabled.
From: Manish Chopra @ 2012-05-15 11:13 UTC (permalink / raw)
  To: bhutchings
  Cc: davem, netdev, Dept_NX_Linux_NIC_Driver, anirban.chakraborty,
	Manish chopra
In-Reply-To: <1337080419-31786-1-git-send-email-manish.chopra@qlogic.com>

From: Manish chopra <manish.chopra@qlogic.com>

Signed-off-by: Manish chopra <manish.chopra@qlogic.com>
---
 .../net/ethernet/qlogic/qlcnic/qlcnic_ethtool.c    |    7 ++++++-
 1 files changed, 6 insertions(+), 1 deletions(-)

diff --git a/drivers/net/ethernet/qlogic/qlcnic/qlcnic_ethtool.c b/drivers/net/ethernet/qlogic/qlcnic/qlcnic_ethtool.c
index 735423f..9e9e78a 100644
--- a/drivers/net/ethernet/qlogic/qlcnic/qlcnic_ethtool.c
+++ b/drivers/net/ethernet/qlogic/qlcnic/qlcnic_ethtool.c
@@ -1232,7 +1232,12 @@ qlcnic_get_dump_flag(struct net_device *netdev, struct ethtool_dump *dump)
 		dump->len = fw_dump->tmpl_hdr->size + fw_dump->size;
 	else
 		dump->len = 0;
-	dump->flag = fw_dump->tmpl_hdr->drv_cap_mask;
+
+	if (!fw_dump->enable)
+		dump->flag = ETH_FW_DUMP_DISABLE;
+	else
+		dump->flag = fw_dump->tmpl_hdr->drv_cap_mask;
+
 	dump->version = adapter->fw_version;
 	return 0;
 }
-- 
1.7.1

^ permalink raw reply related

* [PATCH net-next 0/2] ethtool changes
From: Manish Chopra @ 2012-05-15 11:13 UTC (permalink / raw)
  To: bhutchings
  Cc: davem, netdev, Dept_NX_Linux_NIC_Driver, anirban.chakraborty,
	Manish Chopra

From: Manish Chopra <manish.chopra@qlogic.com> 

Please apply it to net-next.

Thanks,
Manish

^ permalink raw reply

* Re: [PATCH] ipv6: fix incorrect ipsec transport mode fragment
From: Steffen Klassert @ 2012-05-15 11:48 UTC (permalink / raw)
  To: Gao feng; +Cc: netdev, davem, lw
In-Reply-To: <4FB1D11A.4040206@cn.fujitsu.com>

On Tue, May 15, 2012 at 11:44:26AM +0800, Gao feng wrote:
> 
> how about add a function pointer append_data to the struct rt6_info?
> so we can just call rt->append_data in ip6_append_data without conside
> witch mode it is.
> 

If you want to use a function pointer, it should go to stuct xfrm_mode.
That's where the IPsec mode dependent functions reside.

A side note, I'll be off for three weeks starting from tomorrow.
I'll have no E-mail access most of the time, so I'll probaply not
respond for the next three weeks.

^ permalink raw reply

* [PATCH, v2] xfrm: make xfrm_algo.c a module
From: Jan Beulich @ 2012-05-15 11:57 UTC (permalink / raw)
  To: davem, netdev; +Cc: linux-kernel

By making this a standalone config option (auto-selected as needed),
selecting CRYPTO from here rather than from XFRM (which is boolean)
allows the core crypto code to become a module again even when XFRM=y.

Signed-off-by: Jan Beulich <jbeulich@suse.com>

---
v2: Drop an unrelated change.
---
 net/ipv4/Kconfig     |    4 ++--
 net/ipv6/Kconfig     |    4 ++--
 net/xfrm/Kconfig     |   13 +++++++++----
 net/xfrm/Makefile    |    3 ++-
 net/xfrm/xfrm_algo.c |    2 ++
 5 files changed, 17 insertions(+), 9 deletions(-)

--- 3.4-rc7/net/ipv4/Kconfig
+++ 3.4-rc7-xfrm-algo-module/net/ipv4/Kconfig
@@ -312,7 +312,7 @@ config SYN_COOKIES
 
 config INET_AH
 	tristate "IP: AH transformation"
-	select XFRM
+	select XFRM_ALGO
 	select CRYPTO
 	select CRYPTO_HMAC
 	select CRYPTO_MD5
@@ -324,7 +324,7 @@ config INET_AH
 
 config INET_ESP
 	tristate "IP: ESP transformation"
-	select XFRM
+	select XFRM_ALGO
 	select CRYPTO
 	select CRYPTO_AUTHENC
 	select CRYPTO_HMAC
--- 3.4-rc7/net/ipv6/Kconfig
+++ 3.4-rc7-xfrm-algo-module/net/ipv6/Kconfig
@@ -69,7 +69,7 @@ config IPV6_OPTIMISTIC_DAD
 
 config INET6_AH
 	tristate "IPv6: AH transformation"
-	select XFRM
+	select XFRM_ALGO
 	select CRYPTO
 	select CRYPTO_HMAC
 	select CRYPTO_MD5
@@ -81,7 +81,7 @@ config INET6_AH
 
 config INET6_ESP
 	tristate "IPv6: ESP transformation"
-	select XFRM
+	select XFRM_ALGO
 	select CRYPTO
 	select CRYPTO_AUTHENC
 	select CRYPTO_HMAC
--- 3.4-rc7/net/xfrm/Kconfig
+++ 3.4-rc7-xfrm-algo-module/net/xfrm/Kconfig
@@ -3,12 +3,17 @@
 #
 config XFRM
        bool
-       select CRYPTO
        depends on NET
 
+config XFRM_ALGO
+	tristate
+	select XFRM
+	select CRYPTO
+
 config XFRM_USER
 	tristate "Transformation user configuration interface"
-	depends on INET && XFRM
+	depends on INET
+	select XFRM_ALGO
 	---help---
 	  Support for Transformation(XFRM) user configuration interface
 	  like IPsec used by native Linux tools.
@@ -48,13 +53,13 @@ config XFRM_STATISTICS
 
 config XFRM_IPCOMP
 	tristate
-	select XFRM
+	select XFRM_ALGO
 	select CRYPTO
 	select CRYPTO_DEFLATE
 
 config NET_KEY
 	tristate "PF_KEY sockets"
-	select XFRM
+	select XFRM_ALGO
 	---help---
 	  PF_KEYv2 socket family, compatible to KAME ones.
 	  They are required if you are going to use IPsec tools ported
--- 3.4-rc7/net/xfrm/Makefile
+++ 3.4-rc7-xfrm-algo-module/net/xfrm/Makefile
@@ -3,8 +3,9 @@
 #
 
 obj-$(CONFIG_XFRM) := xfrm_policy.o xfrm_state.o xfrm_hash.o \
-		      xfrm_input.o xfrm_output.o xfrm_algo.o \
+		      xfrm_input.o xfrm_output.o \
 		      xfrm_sysctl.o xfrm_replay.o
 obj-$(CONFIG_XFRM_STATISTICS) += xfrm_proc.o
+obj-$(CONFIG_XFRM_ALGO) += xfrm_algo.o
 obj-$(CONFIG_XFRM_USER) += xfrm_user.o
 obj-$(CONFIG_XFRM_IPCOMP) += xfrm_ipcomp.o
--- 3.4-rc7/net/xfrm/xfrm_algo.c
+++ 3.4-rc7-xfrm-algo-module/net/xfrm/xfrm_algo.c
@@ -752,3 +752,5 @@ void *pskb_put(struct sk_buff *skb, stru
 }
 EXPORT_SYMBOL_GPL(pskb_put);
 #endif
+
+MODULE_LICENSE("GPL");

^ permalink raw reply

* [PATCH] xfrm_algo: drop an unnecessary inclusion
From: Jan Beulich @ 2012-05-15 12:00 UTC (permalink / raw)
  To: davem, netdev; +Cc: linux-kernel

For several releases, this has not been needed anymore, as no helper
functions declared in net/ah.h get implemented by xfrm_algo.c anymore.

Signed-off-by: Jan Beulich <jbeulich@suse.com>

---
 net/xfrm/xfrm_algo.c |    3 ---
 1 file changed, 3 deletions(-)

--- 3.4-rc7/net/xfrm/xfrm_algo.c
+++ 3.4-rc7-xfrm-algo-module/net/xfrm/xfrm_algo.c
@@ -15,9 +15,6 @@
 #include <linux/crypto.h>
 #include <linux/scatterlist.h>
 #include <net/xfrm.h>
-#if defined(CONFIG_INET_AH) || defined(CONFIG_INET_AH_MODULE) || defined(CONFIG_INET6_AH) || defined(CONFIG_INET6_AH_MODULE)
-#include <net/ah.h>
-#endif
 #if defined(CONFIG_INET_ESP) || defined(CONFIG_INET_ESP_MODULE) || defined(CONFIG_INET6_ESP) || defined(CONFIG_INET6_ESP_MODULE)
 #include <net/esp.h>
 #endif

^ permalink raw reply

* [PATCH v3 1/6] netfilter: sanity checks on NFPROTO_NUMPROTO
From: Alban Crequy @ 2012-05-15 12:32 UTC (permalink / raw)
  To: Pablo Neira Ayuso, Patrick McHardy
  Cc: Alban Crequy, Javier Martinez Canillas, Vincent Sanders,
	netfilter-devel, netdev
In-Reply-To: <20120514190416.GD14897@1984>

With the NFPROTO_* constants introduced by commit 7e9c6e ("netfilter: Introduce
NFPROTO_* constants"), it is too easy to confuse PF_* and NFPROTO_* constants
in new protocols.

Signed-off-by: Alban Crequy <alban.crequy@collabora.co.uk>
---
v2:
 - use WARN
 - return -EINVAL
v3:
 - two checkings

 net/netfilter/core.c |   10 ++++++++++
 1 files changed, 10 insertions(+), 0 deletions(-)

diff --git a/net/netfilter/core.c b/net/netfilter/core.c
index e1b7e05..448b531 100644
--- a/net/netfilter/core.c
+++ b/net/netfilter/core.c
@@ -67,6 +67,16 @@ int nf_register_hook(struct nf_hook_ops *reg)
 	struct nf_hook_ops *elem;
 	int err;
 
+	if (reg->pf >= NFPROTO_NUMPROTO) {
+		WARN(1, "netfilter: Invalid nfproto %d\n", reg->pf);
+		return -EINVAL;
+	}
+
+	if (reg->hooknum >= NF_MAX_HOOKS) {
+		WARN(1, "netfilter: Invalid hooknum %d\n", reg->hooknum);
+		return -EINVAL;
+	}
+
 	err = mutex_lock_interruptible(&nf_hook_mutex);
 	if (err < 0)
 		return err;
-- 
1.7.2.5


^ permalink raw reply related

* Re: Question about be2net error field, rx_drops_no_pbuf
From: Marcelo Leitner @ 2012-05-15 12:45 UTC (permalink / raw)
  To: Sathya.Perla; +Cc: netdev
In-Reply-To: <3367B80B08154D42A3B2BC708B5D41F64580B0D945@EXMAIL.ad.emulex.com>

On 05/15/2012 01:28 AM, Sathya.Perla@Emulex.Com wrote:
>
>> -----Original Message-----
>> From: netdev-owner@vger.kernel.org [mailto:netdev-owner@vger.kernel.org] On
>> Behalf Of Marcelo Leitner
>
>> What does 'rx_drops_no_pbuf' mean at be2net driver? I can see it is a
>> hardware counter for some type of error, which I would like to know
>> about. What causes it?
>>
>> All documentation I could find about it is a comment referring firmware
>> specification
>
> Brief descriptions of the counters are in be_ethtool.c:
> /* Received packets dropped due to lack of available HW packet buffers
>    * used to temporarily hold the received packets.
>    */
> {DRVSTAT_INFO(rx_drops_no_pbuf)}
>
> pbufs are HW buffers for parking incoming pkts before they are transferred to the host.
> You can see this counter go up when the transfer speed of the slot is not fast enough.
> lspci -vv?

Oh! My bad, sorry.

NIC is using MTU 9000 bytes. We were seeing a lot of rx_drops_no_frags, 
then we boosted rx_frag_size to 8192 and it solved all problems for this 
counter. Then rx_drops_no_pbuf started happening, but in much lower 
scale: ~14 counter hits for 22TiB RX traffic. It's the only error 
counter that is different from 0.

Only one port is up.

 From be_ethtool.c:
     /* Received packets dropped due to lack of available fetched buffers
      * posted by the driver.
      */
     {DRVSTAT_RX_INFO(rx_drops_no_frags)}

So both counters are related, but handling different error points, 
right? I was thinking about lowering rx-usecs-high (from 96usec to ~80) 
to make refilling more frequent. What do you think?

This is a RHEL 5.8 host, btw. lspci -nvv for the port in use:

Thanks!
Marcelo.

0b:00.1 Ethernet controller: Emulex Corporation OneConnect 10Gb NIC (rev 02)
0b:00.1 0200: 19a2:0700 (rev 02)
         Subsystem: 103c:1747
         Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- 
ParErr+ Stepping- SERR- FastB2B- DisINTx+
         Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- 
<TAbort- <MAbort- >SERR- <PERR- INTx-
         Latency: 0, Cache Line Size: 64 bytes
         Interrupt: pin B routed to IRQ 59
         Region 1: Memory at d0308000 (32-bit, non-prefetchable) [size=16K]
         Region 2: Memory at d03a0000 (64-bit, non-prefetchable) [size=128K]
         Region 4: Memory at d03e0000 (64-bit, non-prefetchable) [size=128K]
         [virtual] Expansion ROM at e0a80000 [disabled] [size=512K]
         Capabilities: [40] Power Management version 3
                 Flags: PMEClk- DSI+ D1- D2- AuxCurrent=375mA 
PME(D0-,D1-,D2-,D3hot+,D3cold+)
                 Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
         Capabilities: [48] MSI-X: Enable+ Count=32 Masked-
                 Vector table: BAR=1 offset=00002000
                 PBA: BAR=1 offset=00003000
         Capabilities: [c0] Express (v2) Endpoint, MSI 00
                 DevCap: MaxPayload 512 bytes, PhantFunc 0, Latency L0s 
<1us, L1 <16us
                         ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset+
                 DevCtl: Report errors: Correctable- Non-Fatal+ Fatal+ 
Unsupported-
                         RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop- 
FLReset-
                         MaxPayload 256 bytes, MaxReadReq 4096 bytes
                 DevSta: CorrErr- UncorrErr- FatalErr+ UnsuppReq+ 
AuxPwr+ TransPend-
                 LnkCap: Port #0, Speed 5GT/s, Width x8, ASPM L0s, 
Latency L0 <1us, L1 <16us
                         ClockPM- Surprise- LLActRep- BwNot-
                 LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- Retrain- 
CommClk-
                         ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
                 LnkSta: Speed 5GT/s, Width x8, TrErr- Train- SlotClk+ 
DLActive- BWMgmt- ABWMgmt-
                 DevCap2: Completion Timeout: Not Supported, TimeoutDis-
                 DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-
                 LnkCtl2: Target Link Speed: 2.5GT/s, EnterCompliance- 
SpeedDis-, Selectable De-emphasis: -6dB
                          Transmit Margin: Normal Operating Range, 
EnterModifiedCompliance- ComplianceSOS-
                          Compliance De-emphasis: -6dB
                 LnkSta2: Current De-emphasis Level: -6dB
         Capabilities: [100 v1] Advanced Error Reporting
                 UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- 
UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq+ ACSViol-
                 UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- 
UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq+ ACSViol-
                 UESvrt: DLP+ SDES+ TLP+ FCP+ CmpltTO+ CmpltAbrt+ 
UnxCmplt+ RxOF+ MalfTLP+ ECRC- UnsupReq+ ACSViol-
                 CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- 
NonFatalErr-
                 CEMsk:  RxErr- BadTLP+ BadDLLP+ Rollover+ Timeout+ 
NonFatalErr+
                 AERCap: First Error Pointer: 00, GenCap+ CGenEn- 
ChkCap+ ChkEn-
         Capabilities: [194 v1] Device Serial Number XXXXXXXX
         Kernel driver in use: be2net
         Kernel modules: be2net

^ permalink raw reply

* Re: [PATCH 05/17] mm: allow PF_MEMALLOC from softirq context
From: Mel Gorman @ 2012-05-15 13:07 UTC (permalink / raw)
  To: David Miller
  Cc: akpm, linux-mm, netdev, linux-kernel, neilb, a.p.zijlstra,
	michaelc, emunson
In-Reply-To: <20120514100229.GA29102@suse.de>

On Mon, May 14, 2012 at 11:02:29AM +0100, Mel Gorman wrote:
> Softirqs can run on multiple CPUs sure but the same task should not be
> 	executing the same softirq code. Interrupts are disabled and the
> 	executing process cannot sleep in softirq context so the task flags
> 	cannot "leak" nor can they be concurrently modified.
> 

This comment about hardirq is obviously wrong as __do_softirq() enables
interrupts and can be preempted by a hardirq. I've updated the changelog
now to include the following;

Softirqs can run on multiple CPUs sure but the same task should not be
        executing the same softirq code. Neither should the softirq
        handler be preempted by any other softirq handler so the flags
        should not leak to an unrelated softirq.

Softirqs re-enable hardware interrupts in __do_softirq() so can be
        preempted by hardware interrupts so PF_MEMALLOC is inherited
        by the hard IRQ. However, this is similar to a process in
        reclaim being preempted by a hardirq. While PF_MEMALLOC is
        set, gfp_to_alloc_flags() distinguishes between hard and
        soft irqs and avoids giving a hardirq the ALLOC_NO_WATERMARKS
        flag.

If the softirq is deferred to ksoftirq then its flags may be used
        instead of a normal tasks but as the softirq cannot be preempted,
        the PF_MEMALLOC flag does not leak to other code by accident.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH v3 6/6] net: sh_eth: use NAPI
From: Francois Romieu @ 2012-05-15 13:43 UTC (permalink / raw)
  To: Shimoda, Yoshihiro; +Cc: David Miller, netdev, linux-sh
In-Reply-To: <4FB225F1.20407@renesas.com>

Shimoda, Yoshihiro <yoshihiro.shimoda.uh@renesas.com> :
[...]
> I will modify the code as the following. Is it correct?

No.

Please take a look at tg3, especially the ...mb() calls and rework the
queue disabling part in the xmit handler. Btw sh_eth_txfree has no
business being called from the xmit and poll handlers at the same time.

There is no memory barrier in the xmit handler. It it not clear if the
poll thread always sees tx ring status and cur_tx updates in the same
order or not. If not, the driver may unmap before the device actually
accesses skb buffer.

-- 
Ueimor

^ permalink raw reply

* PROBLEM: Fragmentation issue with 1521 bytes ip packets
From: Omar Alhassane @ 2012-05-15 14:00 UTC (permalink / raw)
  To: netdev

Hello Folks,

I think i may have found a problem with the linux networking stack.
Below is a description of the problem.

[1.] One line summary of the problem:
No response to pings of certain sizes.

[2.] Full description of the problem/report:
Using hping3, when i ping a linux machine with 1521 bytes ip packets i
get only one response.
But when i use 1482 bytes, everything works fine. I've tried this with
both tcp and udp. The MTU of my interface is 1500.
[3.] Keywords (i.e., modules, networking, kernel):
ip, udp, tcp, networking, fragmentation
[4.] Kernel version (from /proc/version):
3.3.1
[5.] Output of Oops.. message (if applicable) with symbolic information
[6.] A small shell script or example program which triggers the
problem (if possible)
The following commands works only if the target has tcp port 22 open

hping3 -d 1481 -S -P 22 10.0.30.225 (only one response)
hping3 -d 1482 -S -P 22 10.0.30.225 (works fine)

Can somebody confirm if this is a problem?

Thanks

^ permalink raw reply

* Re: [PATCH NEXT 1/2] linux/ethtool: Added macro ETH_FW_DUMP_DISABLE
From: Ben Hutchings @ 2012-05-15 14:01 UTC (permalink / raw)
  To: Manish Chopra
  Cc: davem, netdev, Dept_NX_Linux_NIC_Driver, anirban.chakraborty
In-Reply-To: <1337080419-31786-2-git-send-email-manish.chopra@qlogic.com>

On Tue, 2012-05-15 at 07:13 -0400, Manish Chopra wrote:
> From: Manish chopra <manish.chopra@qlogic.com>
> 
> o flag field of ethtool_dump structure must be initialized by this macro
> value that is zero, if the firmware dump is disabled.
> by this we can get the firmware dump capability [enable/disable] via ethtool
> 
> Signed-off-by: Manish chopra <manish.chopra@qlogic.com>
Reviewed-by: Ben Hutchings <bhutchings@solarflare.com>

> ---
>  include/linux/ethtool.h |    7 ++++++-
>  1 files changed, 6 insertions(+), 1 deletions(-)
> 
> diff --git a/include/linux/ethtool.h b/include/linux/ethtool.h
> index 89d68d8..fea2ac0 100644
> --- a/include/linux/ethtool.h
> +++ b/include/linux/ethtool.h
> @@ -661,12 +661,17 @@ struct ethtool_flash {
>   * 	%ETHTOOL_SET_DUMP
>   * @version: FW version of the dump, filled in by driver
>   * @flag: driver dependent flag for dump setting, filled in by driver during
> - * 	  get and filled in by ethtool for set operation
> + *        get and filled in by ethtool for set operation.
> + *        flag must be initialized by macro ETH_FW_DUMP_DISABLE value when
> + *        firmware dump is disabled.
>   * @len: length of dump data, used as the length of the user buffer on entry to
>   * 	 %ETHTOOL_GET_DUMP_DATA and this is returned as dump length by driver
>   * 	 for %ETHTOOL_GET_DUMP_FLAG command
>   * @data: data collected for get dump data operation
>   */
> +
> +#define ETH_FW_DUMP_DISABLE 0
> +
>  struct ethtool_dump {
>  	__u32	cmd;
>  	__u32	version;

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply

* (unknown), 
From: Omar Alhassane @ 2012-05-15 14:07 UTC (permalink / raw)
  To: netdev

subscribe netdev

^ permalink raw reply

* Strange latency spikes/TX network stalls on Sun Fire X4150(x86) and e1000e
From: Denys Fedoryshchenko @ 2012-05-15 14:15 UTC (permalink / raw)
  To: netdev, e1000-devel, jeffrey.t.kirsher, jesse.brandeburg

Hi

I have two identical servers, Sun Fire X4150, both has different 
flavors of Linux, x86_64 and i386.
04:00.0 Ethernet controller: Intel Corporation 80003ES2LAN Gigabit 
Ethernet Controller (Copper) (rev 01)
04:00.1 Ethernet controller: Intel Corporation 80003ES2LAN Gigabit 
Ethernet Controller (Copper) (rev 01)
0b:00.0 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet 
Controller (rev 06)
0b:00.1 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet 
Controller (rev 06)
I am using now interface:
#ethtool -i eth0
driver: e1000e
version: 1.9.5-k
firmware-version: 2.1-11
bus-info: 0000:04:00.0
There is 2 CPU , Intel(R) Xeon(R) CPU           E5440  @ 2.83GHz .

i386 was acting as NAT and shaper, and as soon as i removed shaper from 
it, i started to experience strange lockups, e.g. traffic is normal for 
5-30 seconds, then short lockup for 500-3000ms (usually around 1000ms) 
with dropped packets counter increasing. I was suspecting it is due 
load, but it seems was wrong.
Recently, on another server, x86_64 i am using as development, i 
upgrade kernel (it was old, from 2.6 series) and on completely idle 
machine started to experience same latency spikes, while i am just 
running mc and for example typing in text editor - i notice "stalls". 
After i investigate it a little more, i notice also small amount of 
drops on interface. No tcpdump running. Also this machine is idle, and 
the only traffic there - some small broadcasts from network, my ssh, and 
ping.

Dropped packets in ifconfig
           RX packets:3752868 errors:0 dropped:5350 overruns:0 frame:0
Counter is increasing sometimes, when this stall happening.

ethtool -S is clean, there is no dropped packets.

I did tried to check load (mpstat and perf), there is nothing 
suspicious, latencytop also doesn't show anything suspicious.
dropwatch report a lot of drops, but mostly because there is some 
broadcasts and etc. tcpdump at the moment of such drops doesn't show 
anything suspicious.
Changed qdisc from default fifo_fast to bfifo, without any result.
Tried:  ethtool -K eth0 tso off gso off gro off sg off , no result
Problem occured at 3.3.6 - 3.4.0-rc7, most probably 3.3.0 also, but i 
don't remember for sure. I thik on some kernels like 3.1 probably it 
doesn't occur, i will check it soon, because it is not always reliable 
to reproduce it. All tests i did on 3.4.0-rc7.

I did run also in background tcpdump, additionally iptables with 
timestamps, and at time when stall occured, seems i am still receiving 
packets properly, also on iperf udp  (from some host to this SunFire) at 
this moments no packets missing. But i am sure RX interface errors are 
increasing.
If i do iperf from SunFire to test host - there is packetloss at 
moments when stall occured.

I suspect that by some reason network card stop to transmit, but unable 
to pinpoint issue. All other hosts in this network are fine and don't 
have such problems.
Can you help me with that please? Maybe i can provide more debug 
information, compile with patches and etc. Also i will try to fallback 
to 3.1 and 3.0 kernels.

Here it is how it occurs and i am reproducing it:
I'm just opening file, and start to scroll it in mc, then in another 
console i run ping
[1337089061.844167] 1480 bytes from 194.146.153.20: icmp_req=162 ttl=64 
time=0.485 ms
[1337089061.944138] 1480 bytes from 194.146.153.20: icmp_req=163 ttl=64 
time=0.470 ms
[1337089062.467759] 1480 bytes from 194.146.153.20: icmp_req=164 ttl=64 
time=424 ms
[1337089062.467899] 1480 bytes from 194.146.153.20: icmp_req=165 ttl=64 
time=324 ms
[1337089062.468058] 1480 bytes from 194.146.153.20: icmp_req=166 ttl=64 
time=214 ms
[1337089062.468161] 1480 bytes from 194.146.153.20: icmp_req=167 ttl=64 
time=104 ms
[1337089062.468958] 1480 bytes from 194.146.153.20: icmp_req=168 ttl=64 
time=1.15 ms
[1337089062.568604] 1480 bytes from 194.146.153.20: icmp_req=169 ttl=64 
time=0.477 ms
[1337089062.668909] 1480 bytes from 194.146.153.20: icmp_req=170 ttl=64 
time=0.667 ms

Remote host tcpdump:
1337089061.934737 IP 194.146.153.20 > 194.146.153.22: ICMP echo reply, 
id 3486, seq 163, length 1480
1337089062.458360 IP 194.146.153.22 > 194.146.153.20: ICMP echo 
request, id 3486, seq 164, length 1480
1337089062.458380 IP 194.146.153.20 > 194.146.153.22: ICMP echo reply, 
id 3486, seq 164, length 1480
1337089062.458481 IP 194.146.153.22 > 194.146.153.20: ICMP echo 
request, id 3486, seq 165, length 1480
1337089062.458502 IP 194.146.153.20 > 194.146.153.22: ICMP echo reply, 
id 3486, seq 165, length 1480
1337089062.458606 IP 194.146.153.22 > 194.146.153.20: ICMP echo 
request, id 3486, seq 166, length 1480
1337089062.458623 IP 194.146.153.20 > 194.146.153.22: ICMP echo reply, 
id 3486, seq 166, length 1480
1337089062.458729 IP 194.146.153.22 > 194.146.153.20: ICMP echo 
request, id 3486, seq 167, length 1480
1337089062.458745 IP 194.146.153.20 > 194.146.153.22: ICMP echo reply, 
id 3486, seq 167, length 1480
1337089062.459537 IP 194.146.153.22 > 194.146.153.20: ICMP echo 
request, id 3486, seq 168, length 1480
1337089062.459545 IP 194.146.153.20 > 194.146.153.22: ICMP echo reply, 
id 3486, seq 168, length 1480

Local host(SunFire) tcpdump:
1337089061.844140 IP 194.146.153.20 > 194.146.153.22: ICMP echo reply, 
id 3486, seq 162, length 1480
1337089061.943661 IP 194.146.153.22 > 194.146.153.20: ICMP echo 
request, id 3486, seq 163, length 1480
1337089061.944124 IP 194.146.153.20 > 194.146.153.22: ICMP echo reply, 
id 3486, seq 163, length 1480
1337089062.465622 IP 194.146.153.22 > 194.146.153.20: ICMP echo 
request, id 3486, seq 164, length 1480
1337089062.465630 IP 194.146.153.22 > 194.146.153.20: ICMP echo 
request, id 3486, seq 165, length 1480
1337089062.465632 IP 194.146.153.22 > 194.146.153.20: ICMP echo 
request, id 3486, seq 166, length 1480
1337089062.465634 IP 194.146.153.22 > 194.146.153.20: ICMP echo 
request, id 3486, seq 167, length 1480
1337089062.467730 IP 194.146.153.20 > 194.146.153.22: ICMP echo reply, 
id 3486, seq 164, length 1480
1337089062.467785 IP 194.146.153.22 > 194.146.153.20: ICMP echo 
request, id 3486, seq 168, length 1480
1337089062.467884 IP 194.146.153.20 > 194.146.153.22: ICMP echo reply, 
id 3486, seq 165, length 1480
1337089062.468035 IP 194.146.153.20 > 194.146.153.22: ICMP echo reply, 
id 3486, seq 166, length 1480
1337089062.468129 IP 194.146.153.20 > 194.146.153.22: ICMP echo reply, 
id 3486, seq 167, length 1480
1337089062.468928 IP 194.146.153.20 > 194.146.153.22: ICMP echo reply, 
id 3486, seq 168, length 1480
1337089062.568112 IP 194.146.153.22 > 194.146.153.20: ICMP echo 
request, id 3486, seq 169, length 1480
1337089062.568578 IP 194.146.153.20 > 194.146.153.22: ICMP echo reply, 
id 3486, seq 169, length 1480

lspci -t
centaur src # lspci -t
-[0000:00]-+-00.0
            +-02.0-[01-05]--+-00.0-[02-04]--+-00.0-[03]--
            |               |               \-02.0-[04]--+-00.0
            |               |                            \-00.1
            |               \-00.3-[05]--
            +-03.0-[06]--
            +-04.0-[07]----00.0
            +-05.0-[08]--
            +-06.0-[09]--
            +-07.0-[0a]--
            +-08.0
            +-10.0
            +-10.1
            +-10.2
            +-11.0
            +-13.0
            +-15.0
            +-16.0
            +-1c.0-[0b]--+-00.0
            |            \-00.1
            +-1d.0
            +-1d.1
            +-1d.2
            +-1d.3
            +-1d.7
            +-1e.0-[0c]----05.0
            +-1f.0
            +-1f.1
            +-1f.2
            \-1f.3
lspci
00:00.0 Host bridge: Intel Corporation 5000P Chipset Memory Controller 
Hub (rev b1)
00:02.0 PCI bridge: Intel Corporation 5000 Series Chipset PCI Express 
x4 Port 2 (rev b1)
00:03.0 PCI bridge: Intel Corporation 5000 Series Chipset PCI Express 
x4 Port 3 (rev b1)
00:04.0 PCI bridge: Intel Corporation 5000 Series Chipset PCI Express 
x8 Port 4-5 (rev b1)
00:05.0 PCI bridge: Intel Corporation 5000 Series Chipset PCI Express 
x4 Port 5 (rev b1)
00:06.0 PCI bridge: Intel Corporation 5000 Series Chipset PCI Express 
x8 Port 6-7 (rev b1)
00:07.0 PCI bridge: Intel Corporation 5000 Series Chipset PCI Express 
x4 Port 7 (rev b1)
00:08.0 System peripheral: Intel Corporation 5000 Series Chipset DMA 
Engine (rev b1)
00:10.0 Host bridge: Intel Corporation 5000 Series Chipset FSB 
Registers (rev b1)
00:10.1 Host bridge: Intel Corporation 5000 Series Chipset FSB 
Registers (rev b1)
00:10.2 Host bridge: Intel Corporation 5000 Series Chipset FSB 
Registers (rev b1)
00:11.0 Host bridge: Intel Corporation 5000 Series Chipset Reserved 
Registers (rev b1)
00:13.0 Host bridge: Intel Corporation 5000 Series Chipset Reserved 
Registers (rev b1)
00:15.0 Host bridge: Intel Corporation 5000 Series Chipset FBD 
Registers (rev b1)
00:16.0 Host bridge: Intel Corporation 5000 Series Chipset FBD 
Registers (rev b1)
00:1c.0 PCI bridge: Intel Corporation 631xESB/632xESB/3100 Chipset PCI 
Express Root Port 1 (rev 09)
00:1d.0 USB controller: Intel Corporation 631xESB/632xESB/3100 Chipset 
UHCI USB Controller #1 (rev 09)
00:1d.1 USB controller: Intel Corporation 631xESB/632xESB/3100 Chipset 
UHCI USB Controller #2 (rev 09)
00:1d.2 USB controller: Intel Corporation 631xESB/632xESB/3100 Chipset 
UHCI USB Controller #3 (rev 09)
00:1d.3 USB controller: Intel Corporation 631xESB/632xESB/3100 Chipset 
UHCI USB Controller #4 (rev 09)
00:1d.7 USB controller: Intel Corporation 631xESB/632xESB/3100 Chipset 
EHCI USB2 Controller (rev 09)
00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev d9)
00:1f.0 ISA bridge: Intel Corporation 631xESB/632xESB/3100 Chipset LPC 
Interface Controller (rev 09)
00:1f.1 IDE interface: Intel Corporation 631xESB/632xESB IDE Controller 
(rev 09)
00:1f.2 SATA controller: Intel Corporation 631xESB/632xESB SATA AHCI 
Controller (rev 09)
00:1f.3 SMBus: Intel Corporation 631xESB/632xESB/3100 Chipset SMBus 
Controller (rev 09)
01:00.0 PCI bridge: Intel Corporation 6311ESB/6321ESB PCI Express 
Upstream Port (rev 01)
01:00.3 PCI bridge: Intel Corporation 6311ESB/6321ESB PCI Express to 
PCI-X Bridge (rev 01)
02:00.0 PCI bridge: Intel Corporation 6311ESB/6321ESB PCI Express 
Downstream Port E1 (rev 01)
02:02.0 PCI bridge: Intel Corporation 6311ESB/6321ESB PCI Express 
Downstream Port E3 (rev 01)
04:00.0 Ethernet controller: Intel Corporation 80003ES2LAN Gigabit 
Ethernet Controller (Copper) (rev 01)
04:00.1 Ethernet controller: Intel Corporation 80003ES2LAN Gigabit 
Ethernet Controller (Copper) (rev 01)
07:00.0 RAID bus controller: Adaptec AAC-RAID (rev 09)
0b:00.0 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet 
Controller (rev 06)
0b:00.1 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet 
Controller (rev 06)
0c:05.0 VGA compatible controller: ASPEED Technology, Inc. ASPEED 
Graphics Family


dmesg:
[    4.936885] e1000: Intel(R) PRO/1000 Network Driver - version 
7.3.21-k8-NAPI
[    4.936887] e1000: Copyright (c) 1999-2006 Intel Corporation.
[    4.936966] e1000e: Intel(R) PRO/1000 Network Driver - 1.9.5-k
[    4.936967] e1000e: Copyright(c) 1999 - 2012 Intel Corporation.
[    4.938529] e1000e 0000:04:00.0: (unregistered net_device): 
Interrupt Throttling Rate (ints/sec) set to dynamic conservative mode
[    4.939598] e1000e 0000:04:00.0: irq 65 for MSI/MSI-X
[    4.992246] e1000e 0000:04:00.0: eth0: (PCI Express:2.5GT/s:Width 
x4) 00:1e:68:04:99:f8
[    4.992657] e1000e 0000:04:00.0: eth0: Intel(R) PRO/1000 Network 
Connection
[    4.992964] e1000e 0000:04:00.0: eth0: MAC: 5, PHY: 5, PBA No: 
FFFFFF-0FF
[    4.994745] e1000e 0000:04:00.1: (unregistered net_device): 
Interrupt Throttling Rate (ints/sec) set to dynamic conservative mode
[    4.996233] e1000e 0000:04:00.1: irq 66 for MSI/MSI-X
[    5.050901] e1000e 0000:04:00.1: eth1: (PCI Express:2.5GT/s:Width 
x4) 00:1e:68:04:99:f9
[    5.051317] e1000e 0000:04:00.1: eth1: Intel(R) PRO/1000 Network 
Connection
[    5.051623] e1000e 0000:04:00.1: eth1: MAC: 5, PHY: 5, PBA No: 
FFFFFF-0FF
[    5.051857] e1000e 0000:0b:00.0: Disabling ASPM  L1
[    5.052168] e1000e 0000:0b:00.0: (unregistered net_device): 
Interrupt Throttling Rate (ints/sec) set to dynamic conservative mode
[    5.052611] e1000e 0000:0b:00.0: irq 67 for MSI/MSI-X
[    5.223454] e1000e 0000:0b:00.0: eth2: (PCI Express:2.5GT/s:Width 
x4) 00:1e:68:04:99:fa
[    5.223864] e1000e 0000:0b:00.0: eth2: Intel(R) PRO/1000 Network 
Connection
[    5.224178] e1000e 0000:0b:00.0: eth2: MAC: 0, PHY: 4, PBA No: 
C83246-002
[    5.224412] e1000e 0000:0b:00.1: Disabling ASPM  L1
[    5.224709] e1000e 0000:0b:00.1: (unregistered net_device): 
Interrupt Throttling Rate (ints/sec) set to dynamic conservative mode
[    5.225168] e1000e 0000:0b:00.1: irq 68 for MSI/MSI-X
[    5.397603] e1000e 0000:0b:00.1: eth3: (PCI Express:2.5GT/s:Width 
x4) 00:1e:68:04:99:fb
[    5.398021] e1000e 0000:0b:00.1: eth3: Intel(R) PRO/1000 Network 
Connection
[    5.398336] e1000e 0000:0b:00.1: eth3: MAC: 0, PHY: 4, PBA No: 
C83246-002
[   13.859817] e1000e 0000:04:00.0: irq 65 for MSI/MSI-X
[   13.962309] e1000e 0000:04:00.0: irq 65 for MSI/MSI-X
[   17.150392] e1000e: eth0 NIC Link is Up 1000 Mbps Full Duplex, Flow 
Control: None

^ permalink raw reply

* TCPBacklogDrops during aggressive bursts of traffic
From: Kieran Mansley @ 2012-05-15 14:38 UTC (permalink / raw)
  To: netdev

I've been investigating an issue with TCPBacklogDrops being reported
(and relatively poor performance as a result).  The problem is most
easily observed on slightly older kernels (e.g 3.0.13) but is still
present in 3.3.6, although harder to reproduce.  I've also seen it in
2.6 series kernels, so it's not a recent issue.

The problem occurs at the receiver when a TCP sender with a large
congestion window is sending at a high rate and the receiving
application has blocked in a recv() or similar call.  During the stream
ACKs are being returned to the sender keeping the receive window open
and so allowing it to carry on sending.  The local socket receive buffer
gets dynamically increased, and the advertised receive window increases
similarly.

[As an aside, it appears as though the total bytes that the receiver
commits to receiving - i.e. the point at which it stops advertising new
sequence space - is around double the receive socket buffer.  I'm
guessing it is committing to receiving the current socket buffer
(perhaps as there is a pending recv() it knows it will be able to
immediately empty this) and the next one, but I've not looked into this
in detail]

As the socket buffer is approaching full the kernel decides to satisfy
the recv() call and wake the application.  It will have to copy the data
to application address space etc.  At this point there is a switch in
tcp_v4_rcv():

http://lxr.linux.no/#linux+v3.3.6/net/ipv4/tcp_ipv4.c#L1726

Before this point, the "if (!sock_owned_by_user(sk)) " will evaluate to
true, but once it has decided to wake the application I think it will
evaluate to false and it will drop through to:

1739        else if (unlikely(sk_add_backlog(sk, skb))) {
1740                bh_unlock_sock(sk);
1741                NET_INC_STATS_BH(net, LINUX_MIB_TCPBACKLOGDROP);
1742                goto discard_and_relse;
1743        }

In sk_add_backlog() there is a test to see if the socket's receive
buffer is full, and if there is the kernel drops the packets, reporting
them through netstat as TCPBacklogDrop.  This is despite there being
potentially megabytes of unused advertised receive window space at this
point.

Very shortly afterwards the socket buffer will be empty again (as its
contents will have been transferred to the user) so this is essentially
a race and depends on a fast sender to demonstrate it.  It shows up as a
acute period of drops that are quickly retransmitted and then
accepted.  

There are two ways of thinking about this problem: either the receiver
should be more conservative about the receive window it advertises
(limiting it to the available receive socket buffer size); or the
receiver should be more generous with what it will accept on to the
backlog (matching it to the advertised receive window).  It is the
discrepancy between advertised receive window and what can be put on the
backlog that is the root of the problem.  I would be tempted by the
latter and say that as the backlog is likely to soon make it into the
receive buffer, it should be allowed to contain a full receive buffer of
bytes on top of what is currently being removed from the receive buffer
into the application.

It is harder to reproduce on recent kernels because the pending recv()
call gets satisfied very close to the start of a burst, and at this time
the receive buffer will be mostly empty and so it is less likely that
any packets in flight will overflow the backlog.  On earlier kernels it
is easier to reproduce because the pending recv() call didn't return
until the socket's receive buffer was nearly full, and so it would only
take a few extra packets to overflow the backlog.

I have a packet capture to illustrate the problem (taken on 3.0.13) if
that would be of help.  As I can easily reproduce it I'm also happy to
make changes and test to see if they improve matters.

Thanks

Kieran

^ permalink raw reply

* Re: TCPBacklogDrops during aggressive bursts of traffic
From: Eric Dumazet @ 2012-05-15 14:56 UTC (permalink / raw)
  To: Kieran Mansley; +Cc: netdev
In-Reply-To: <1337092718.1689.45.camel@kjm-desktop.uk.level5networks.com>

On Tue, 2012-05-15 at 15:38 +0100, Kieran Mansley wrote:
> I've been investigating an issue with TCPBacklogDrops being reported
> (and relatively poor performance as a result).  The problem is most
> easily observed on slightly older kernels (e.g 3.0.13) but is still
> present in 3.3.6, although harder to reproduce.  I've also seen it in
> 2.6 series kernels, so it's not a recent issue.
> 
> The problem occurs at the receiver when a TCP sender with a large
> congestion window is sending at a high rate and the receiving
> application has blocked in a recv() or similar call.  During the stream
> ACKs are being returned to the sender keeping the receive window open
> and so allowing it to carry on sending.  The local socket receive buffer
> gets dynamically increased, and the advertised receive window increases
> similarly.
> 
> [As an aside, it appears as though the total bytes that the receiver
> commits to receiving - i.e. the point at which it stops advertising new
> sequence space - is around double the receive socket buffer.  I'm
> guessing it is committing to receiving the current socket buffer
> (perhaps as there is a pending recv() it knows it will be able to
> immediately empty this) and the next one, but I've not looked into this
> in detail]
> 
> As the socket buffer is approaching full the kernel decides to satisfy
> the recv() call and wake the application.  It will have to copy the data
> to application address space etc.  At this point there is a switch in
> tcp_v4_rcv():
> 
> http://lxr.linux.no/#linux+v3.3.6/net/ipv4/tcp_ipv4.c#L1726
> 
> Before this point, the "if (!sock_owned_by_user(sk)) " will evaluate to
> true, but once it has decided to wake the application I think it will
> evaluate to false and it will drop through to:
> 
> 1739        else if (unlikely(sk_add_backlog(sk, skb))) {
> 1740                bh_unlock_sock(sk);
> 1741                NET_INC_STATS_BH(net, LINUX_MIB_TCPBACKLOGDROP);
> 1742                goto discard_and_relse;
> 1743        }
> 
> In sk_add_backlog() there is a test to see if the socket's receive
> buffer is full, and if there is the kernel drops the packets, reporting
> them through netstat as TCPBacklogDrop.  This is despite there being
> potentially megabytes of unused advertised receive window space at this
> point.
> 
> Very shortly afterwards the socket buffer will be empty again (as its
> contents will have been transferred to the user) so this is essentially
> a race and depends on a fast sender to demonstrate it.  It shows up as a
> acute period of drops that are quickly retransmitted and then
> accepted.  
> 
> There are two ways of thinking about this problem: either the receiver
> should be more conservative about the receive window it advertises
> (limiting it to the available receive socket buffer size); or the
> receiver should be more generous with what it will accept on to the
> backlog (matching it to the advertised receive window).  It is the
> discrepancy between advertised receive window and what can be put on the
> backlog that is the root of the problem.  I would be tempted by the
> latter and say that as the backlog is likely to soon make it into the
> receive buffer, it should be allowed to contain a full receive buffer of
> bytes on top of what is currently being removed from the receive buffer
> into the application.
> 
> It is harder to reproduce on recent kernels because the pending recv()
> call gets satisfied very close to the start of a burst, and at this time
> the receive buffer will be mostly empty and so it is less likely that
> any packets in flight will overflow the backlog.  On earlier kernels it
> is easier to reproduce because the pending recv() call didn't return
> until the socket's receive buffer was nearly full, and so it would only
> take a few extra packets to overflow the backlog.
> 
> I have a packet capture to illustrate the problem (taken on 3.0.13) if
> that would be of help.  As I can easily reproduce it I'm also happy to
> make changes and test to see if they improve matters.


Please try latest kernels, this is probably 'fixed'

What network driver are you using ?

^ permalink raw reply

* Re: TCPBacklogDrops during aggressive bursts of traffic
From: Eric Dumazet @ 2012-05-15 15:00 UTC (permalink / raw)
  To: Kieran Mansley; +Cc: netdev
In-Reply-To: <1337093776.8512.1089.camel@edumazet-glaptop>

On Tue, 2012-05-15 at 16:56 +0200, Eric Dumazet wrote:

> Please try latest kernels, this is probably 'fixed'
> 
> What network driver are you using ?
> 
> 

commit b49960a05e32121d29316cfdf653894b88ac9190
Author: Eric Dumazet <edumazet@google.com>
Date:   Wed May 2 02:28:41 2012 +0000

    tcp: change tcp_adv_win_scale and tcp_rmem[2]
    
    tcp_adv_win_scale default value is 2, meaning we expect a good citizen
    skb to have skb->len / skb->truesize ratio of 75% (3/4)
    
    In 2.6 kernels we (mis)accounted for typical MSS=1460 frame :
    1536 + 64 + 256 = 1856 'estimated truesize', and 1856 * 3/4 = 1392.
    So these skbs were considered as not bloated.
    
    With recent truesize fixes, a typical MSS=1460 frame truesize is now the
    more precise :
    2048 + 256 = 2304. But 2304 * 3/4 = 1728.
    So these skb are not good citizen anymore, because 1460 < 1728
    
    (GRO can escape this problem because it build skbs with a too low
    truesize.)
    
    This also means tcp advertises a too optimistic window for a given
    allocated rcvspace : When receiving frames, sk_rmem_alloc can hit
    sk_rcvbuf limit and we call tcp_prune_queue()/tcp_collapse() too often,
    especially when application is slow to drain its receive queue or in
    case of losses (netperf is fast, scp is slow). This is a major latency
    source.
    
    We should adjust the len/truesize ratio to 50% instead of 75%
    
    This patch :
    
    1) changes tcp_adv_win_scale default to 1 instead of 2
    
    2) increase tcp_rmem[2] limit from 4MB to 6MB to take into account
    better truesize tracking and to allow autotuning tcp receive window to
    reach same value than before. Note that same amount of kernel memory is
    consumed compared to 2.6 kernels.
    
    Signed-off-by: Eric Dumazet <edumazet@google.com>

^ permalink raw reply

* Re: PROBLEM: Fragmentation issue with 1521 bytes ip packets
From: Eric Dumazet @ 2012-05-15 15:04 UTC (permalink / raw)
  To: Omar Alhassane; +Cc: netdev
In-Reply-To: <CAPFXtPexGpVyGGuKdjpuH+cxOaxa7KeBfmFrs39=NW7LsZ6-bw@mail.gmail.com>

On Tue, 2012-05-15 at 10:00 -0400, Omar Alhassane wrote:
> Hello Folks,
> 
> I think i may have found a problem with the linux networking stack.
> Below is a description of the problem.
> 
> [1.] One line summary of the problem:
> No response to pings of certain sizes.
> 
> [2.] Full description of the problem/report:
> Using hping3, when i ping a linux machine with 1521 bytes ip packets i
> get only one response.
> But when i use 1482 bytes, everything works fine. I've tried this with
> both tcp and udp. The MTU of my interface is 1500.
> [3.] Keywords (i.e., modules, networking, kernel):
> ip, udp, tcp, networking, fragmentation
> [4.] Kernel version (from /proc/version):
> 3.3.1
> [5.] Output of Oops.. message (if applicable) with symbolic information
> [6.] A small shell script or example program which triggers the
> problem (if possible)
> The following commands works only if the target has tcp port 22 open
> 
> hping3 -d 1481 -S -P 22 10.0.30.225 (only one response)
> hping3 -d 1482 -S -P 22 10.0.30.225 (works fine)
> 
> Can somebody confirm if this is a problem?

hping3 bug : All the fragments it sends have the same ID field.

First 2 frags are reassembled by remote. Remote sends a SYNACK.


Following frags are 'ignored' because they have same ID than previous
packet.

^ permalink raw reply

* Re: TCPBacklogDrops during aggressive bursts of traffic
From: Kieran Mansley @ 2012-05-15 16:29 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev
In-Reply-To: <1337093776.8512.1089.camel@edumazet-glaptop>

On Tue, 2012-05-15 at 16:56 +0200, Eric Dumazet wrote:
> 
> Please try latest kernels, this is probably 'fixed'

I've just tried with 3.4.0-rc7 and the problem is still reproducible.
It's perhaps harder to reproduce than on 3.3.6 but still there.

> What network driver are you using ? 

The receiver is using the sfc driver that is included in the kernel
build, together with an SFC 9020 NIC. 

Kieran

^ permalink raw reply

* Re: TCPBacklogDrops during aggressive bursts of traffic
From: Eric Dumazet @ 2012-05-15 16:34 UTC (permalink / raw)
  To: Kieran Mansley; +Cc: netdev
In-Reply-To: <1337099368.1689.47.camel@kjm-desktop.uk.level5networks.com>

On Tue, 2012-05-15 at 17:29 +0100, Kieran Mansley wrote:
> On Tue, 2012-05-15 at 16:56 +0200, Eric Dumazet wrote:
> > 
> > Please try latest kernels, this is probably 'fixed'
> 
> I've just tried with 3.4.0-rc7 and the problem is still reproducible.
> It's perhaps harder to reproduce than on 3.3.6 but still there.
> 
> > What network driver are you using ? 
> 
> The receiver is using the sfc driver that is included in the kernel
> build, together with an SFC 9020 NIC. 
> 
> Kieran
> 

MTU ?

What is typical skb->truesize of skb given to stack in RX path ?

If drivers use PAGE_SIZE fragments, then you are more likely to hit
limit.

^ permalink raw reply

* Re: [PATCH net-next 0/2] extend sch_mqprio to distribute traffic not only by ETS TC
From: John Fastabend @ 2012-05-15 16:44 UTC (permalink / raw)
  To: Amir Vadai
  Cc: David S. Miller, netdev, Oren Duer, Liran Liss, Jamal Hadi Salim,
	Diego Crupnicoff, Or Gerlitz
In-Reply-To: <4FB15BEC.8040000@mellanox.com>

On 5/14/2012 12:24 PM, Amir Vadai wrote:
>>>> On 5/6/2012 12:05 AM, Amir Vadai wrote:
>>>>> This series comes to revive the discussion initiated on the thread "net:
>>>>> support tx_ring per UP in HW based QoS mechanism" (see
>>>>> http://marc.info/?t=133165957200004&r=1&w=2) with the major issue to be address
>>>>> is - how should sk_prio<=>    TC be done, for both, tagged and untagged traffic.
>>>>> Following is a staged description addressing the background, problem
>>>>> description, current situation, suggestion for the change and implementation of
>>>>> it.

[...]

> John Hi,
> 
> After some internal discussions, it was agreed to line up with your
> approach, to leave mqprio an abstract skb->priority <=> queue set
> mapping and to ignore egress_map if mqprio is enabled.
> 

OK sounds good.

> It would be very nice, if the term 'tc' in kernel code would be
> replaced to queue set, since it is very misleading.
> 

Go ahead and write up a patch. Just be careful not to break existing
user visible API. I agree it is confusing.

> There still might be some small issues with skb_tx_hash for tagged
> traffic, which I will work on tomorrow, and hopefully will send a new
> patch set with the solution.
> 

What are the issues? Lets see a patch.

Thanks,
John

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox