Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH RFT RESEND] net: Fix Neptune ethernet driver to check dma mapping error
From: David Miller @ 2012-07-23  6:34 UTC (permalink / raw)
  To: shuah.khan
  Cc: mcarlson, bhutchings, eric.dumazet, mchan, netdev, linux-kernel,
	shuahkhan, stable
In-Reply-To: <1342821035.5434.60.camel@lorien2>

From: Shuah Khan <shuah.khan@hp.com>
Date: Fri, 20 Jul 2012 15:50:35 -0600

> Fix Neptune ethernet driver to check dma mapping error after map_page()
> interface returns.
> 
> Signed-off-by: Shuah Khan <shuah.khan@hp.com>

Applied.

^ permalink raw reply

* Re: [PATCH RFT] net: Change niu_rbr_fill() to use unlikely() to check niu_rbr_add_page() return value
From: David Miller @ 2012-07-23  6:35 UTC (permalink / raw)
  To: shuah.khan
  Cc: mcarlson, bhutchings, eric.dumazet, mchan, netdev, linux-kernel,
	shuahkhan
In-Reply-To: <1342827272.5434.71.camel@lorien2>

From: Shuah Khan <shuah.khan@hp.com>
Date: Fri, 20 Jul 2012 17:34:32 -0600

> Change niu_rbr_fill() to use unlikely() to check niu_rbr_add_page() return
> value to be consistent with the rest of the checks after niu_rbr_add_page()
> calls in this file.
> 
> Signed-off-by: Shuah Khan <shuah.khan@hp.com>

Applied.

^ permalink raw reply

* Re: [PATCH 1/2] ipvs: ip_vs_ftp depends on nf_conntrack_ftp helper
From: Simon Horman @ 2012-07-23  6:48 UTC (permalink / raw)
  To: Julian Anastasov
  Cc: Pablo Neira Ayuso, lvs-devel, netdev, netfilter-devel,
	Wensong Zhang, Hans Schillstrom, Jesper Dangaard Brouer
In-Reply-To: <alpine.LFD.2.00.1207122238220.1831@ja.ssi.bg>

On Thu, Jul 12, 2012 at 10:43:22PM +0300, Julian Anastasov wrote:
> 
> 	Hello,
> 
> On Thu, 12 Jul 2012, Pablo Neira Ayuso wrote:
> 
> > On Wed, Jul 11, 2012 at 09:25:26AM +0900, Simon Horman wrote:
> > > From: Julian Anastasov <ja@ssi.bg>
> > > 
> > > 	The FTP application indirectly depends on the
> > > nf_conntrack_ftp helper for proper NAT support. If the
> > > module is not loaded, IPVS can resize the packets for the
> > > command connection, eg. PASV response but the SEQ adjustment
> > > logic in ipv4_confirm is not called without helper.
> > > 
> > > Signed-off-by: Julian Anastasov <ja@ssi.bg>
> > > Signed-off-by: Simon Horman <horms@verge.net.au>
> > > ---
> > >  net/netfilter/ipvs/Kconfig | 3 ++-
> > >  1 file changed, 2 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/net/netfilter/ipvs/Kconfig b/net/netfilter/ipvs/Kconfig
> > > index f987138..8b2cffd 100644
> > > --- a/net/netfilter/ipvs/Kconfig
> > > +++ b/net/netfilter/ipvs/Kconfig
> > > @@ -250,7 +250,8 @@ comment 'IPVS application helper'
> > >  
> > >  config	IP_VS_FTP
> > >    	tristate "FTP protocol helper"
> > > -        depends on IP_VS_PROTO_TCP && NF_CONNTRACK && NF_NAT
> > > +	depends on IP_VS_PROTO_TCP && NF_CONNTRACK && NF_NAT && \
> > > +		NF_CONNTRACK_FTP
> > 
> > If you require FTP NAT support, then this depends on NF_NAT_FTP
> > instead of NF_CONNTRACK_FTP.
> 
> 	No, I just checked again, it works without nf_nat_ftp,
> only nf_nat, nf_conntrack_ftp and iptable_nat are needed.
> We use packet mangling part from nf_nat (nf_nat_mangle_tcp_packet).

Is there a consensus on this?

^ permalink raw reply

* [RFC PATCH 0/1] sched: Add a new API to find the prefer idlest cpu
From: Shirley Ma @ 2012-07-23  6:57 UTC (permalink / raw)
  To: linux-kernel, netdev; +Cc: Michael S. Tsirkin, vivek, sri

Introduce a new API to choose per-cpu thread from cgroup control cpuset
(allowed) and preferred cpuset (local numa-node).

The receiving cpus of a networking device are not under cgroup controls.
When such a networking device uses per-cpu thread model, the cpu which
is chose to process the packets might not be part of cgroup cpusets
without this API. On numa system, the preferred cpusets would help to
reduce expensive cross memory access to/from the other node.

Signed-off-by: Shirley Ma <xma@us.ibm.com>
---

 include/linux/sched.h |    2 ++
 kernel/sched/fair.c   |   30 ++++++++++++++++++++++++++++++
 2 files changed, 32 insertions(+), 0 deletions(-)

Thanks
Shirley

^ permalink raw reply

* [RFC PATCH 1/1] sched: Add a new API to find the prefer idlest cpu
From: Shirley Ma @ 2012-07-23  6:59 UTC (permalink / raw)
  To: linux-kernel; +Cc: netdev, Michael S. Tsirkin, vivek, sri
In-Reply-To: <1343026634.13461.15.camel@oc3660625478.ibm.com>

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 64d9df5..46cc4a7 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2806,4 +2806,6 @@ static inline unsigned long rlimit_max(unsigned int limit)
 
 #endif /* __KERNEL__ */
 
+extern int find_idlest_prefer_cpu(struct cpumask *prefer,
+				 struct cpumask *allowed, int prev_cpu);
 #endif
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c099cc6..7240868 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -26,6 +26,7 @@
 #include <linux/slab.h>
 #include <linux/profile.h>
 #include <linux/interrupt.h>
+#include <linux/export.h>
 
 #include <trace/events/sched.h>
 
@@ -2809,6 +2810,35 @@ unlock:
 
 	return new_cpu;
 }
+
+/*
+ * This API is used to find the most idle cpu from both preferred and
+ * allowed cpuset (such as cgroup controls cpuset). It helps per-cpu thread
+ * model to pick up the allowed local cpu to be scheduled.
+ * If these two cpusets have intersects, the cpu is chose from the intersects,
+ * if there is no intersects, then the cpu is chose from the allowed cpuset.
+ * prev_cpu helps to better local cache when prev_cpu is not busy.
+ */
+int find_idlest_prefer_cpu(struct cpumask *prefer, struct cpumask *allowed,
+			  int prev_cpu)
+{
+	unsigned long load, min_load = ULONG_MAX;
+	int check, i, idlest = -1;
+
+	check = cpumask_intersects(prefer, allowed);
+	/* Traverse only the allowed CPUs */
+	if (check == 0)
+		prefer = allowed;
+	for_each_cpu_and(i, prefer, allowed) {
+		load = weighted_cpuload(i);
+		if (load < min_load || (load == min_load && i == prev_cpu)) {
+			min_load = load;
+			idlest = i;
+		}
+	}
+	return idlest;
+}
+EXPORT_SYMBOL(find_idlest_prefer_cpu);
 #endif /* CONFIG_SMP */
 
 static unsigned long

Shirley

^ permalink raw reply related

* [3.5 regression / bridge] constantly toggeling between disabled and forwarding
From: Michael Leun @ 2012-07-23  7:15 UTC (permalink / raw)
  To: bridge, netdev; +Cc: linux-kernel, shemminger

Hi,

when I use my usb ethernet adapter

# > lsusb
[...]
Bus 002 Device 009: ID 9710:7830 MosChip Semiconductor MCS7830 10/100 Mbps Ethernet adapter
[...]

as port of an bridge

> # brctl addbr br0
> # brctl addif br0 eth0
> # brctl addif br0 ue5
> # ifconfig ue5 up
> # ifconfig br0 up

(Also does happen when eth0 is not part of the bridge, but the logs I
had available were from that situation...)

I constantly get messages showing the interface toggeling between
disabled and forwarding state:

Jul 23 07:40:50 elektra kernel: [ 1539.497337] br0: port 2(ue5) entered disabled state
Jul 23 07:40:50 elektra kernel: [ 1539.554992] br0: port 2(ue5) entered forwarding state
Jul 23 07:40:50 elektra kernel: [ 1539.555005] br0: port 2(ue5) entered forwarding state
Jul 23 07:40:51 elektra kernel: [ 1540.496242] br0: port 2(ue5) entered disabled state
Jul 23 07:40:51 elektra kernel: [ 1540.552534] br0: port 2(ue5) entered forwarding state
Jul 23 07:40:51 elektra kernel: [ 1540.552548] br0: port 2(ue5) entered forwarding state
Jul 23 07:40:52 elektra kernel: [ 1541.550413] br0: port 2(ue5) entered forwarding state
Jul 23 07:40:53 elektra kernel: [ 1542.529672] br0: port 2(ue5) entered disabled state
Jul 23 07:40:53 elektra kernel: [ 1542.587162] br0: port 2(ue5) entered forwarding state
Jul 23 07:40:53 elektra kernel: [ 1542.587175] br0: port 2(ue5) entered forwarding state
Jul 23 07:40:54 elektra kernel: [ 1543.585309] br0: port 2(ue5) entered forwarding state
Jul 23 07:41:00 elektra kernel: [ 1549.360600] br0: port 2(ue5) entered disabled state
Jul 23 07:41:00 elektra kernel: [ 1549.442998] br0: port 2(ue5) entered forwarding state
Jul 23 07:41:00 elektra kernel: [ 1549.443011] br0: port 2(ue5) entered forwarding state
Jul 23 07:41:01 elektra kernel: [ 1550.357686] br0: port 2(ue5) entered disabled state
Jul 23 07:41:01 elektra kernel: [ 1550.408208] br0: port 2(ue5) entered forwarding state
Jul 23 07:41:01 elektra kernel: [ 1550.408222] br0: port 2(ue5) entered forwarding state
Jul 23 07:41:02 elektra kernel: [ 1551.407656] br0: port 2(ue5) entered forwarding state
Jul 23 07:41:03 elektra kernel: [ 1552.401578] br0: port 2(ue5) entered disabled state
Jul 23 07:41:03 elektra kernel: [ 1552.474773] br0: port 2(ue5) entered forwarding state
Jul 23 07:41:03 elektra kernel: [ 1552.474786] br0: port 2(ue5) entered forwarding state
Jul 23 07:41:04 elektra kernel: [ 1553.472487] br0: port 2(ue5) entered forwarding state
Jul 23 07:41:05 elektra kernel: [ 1554.356138] br0: port 2(ue5) entered disabled state
[...]

This does (in the same situation, nothing else than the kernel changed)
not happen with 3.4.5.

Does anybody have an idea what the issue might be or do I need to bisect?

-- 
MfG,

Michael Leun

^ permalink raw reply

* Re: [PATCH 00/16] Remove the ipv4 routing cache
From: Eric Dumazet @ 2012-07-23  7:15 UTC (permalink / raw)
  To: David Miller; +Cc: netdev
In-Reply-To: <20120722.173951.1794347789063177131.davem@davemloft.net>

On Sun, 2012-07-22 at 17:39 -0700, David Miller wrote:
> Just FYI, I'm pushing this work out to net-next now.
> --

Excellent !

Thanks a lot David

^ permalink raw reply

* [net-next PATCH 1/1] bnx2x: Add new 57840 device IDs
From: Yuval Mintz @ 2012-07-23  7:25 UTC (permalink / raw)
  To: davem, netdev; +Cc: Yuval Mintz, Eilon Greenstein

The 57840 boards come in two flavours: 2 x 20G and 4 x 10G.
To better differentiate between the two flavours, a separate device ID
was assigned to each.
The silicon default value is still the currently supported 57840 device ID
(0x168d), and since a user can damage the nvram (e.g., 'ethtool -E')
the driver will still support this device ID to allow the user to amend the
nvram back into a supported configuration.

Notice this patch contains lines longer than 80 characters (strings).

Signed-off-by: Yuval Mintz <yuvalmin@broadcom.com>
Signed-off-by: Eilon Greenstein <eilong@broadcom.com>
---
Hi Dave,

Please consider applying this patch to 'net-next'.

Thanks,
---
 drivers/net/ethernet/broadcom/bnx2x/bnx2x.h      |   15 ++++++++---
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_link.c |    2 +-
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c |   31 ++++++++++++++++++---
 3 files changed, 38 insertions(+), 10 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x.h b/drivers/net/ethernet/broadcom/bnx2x/bnx2x.h
index dbe9791..77bcd4c 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x.h
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x.h
@@ -819,8 +819,11 @@ struct bnx2x_common {
 #define CHIP_NUM_57810_MF		0x16ae
 #define CHIP_NUM_57811			0x163d
 #define CHIP_NUM_57811_MF		0x163e
-#define CHIP_NUM_57840			0x168d
-#define CHIP_NUM_57840_MF		0x16ab
+#define CHIP_NUM_57840_OBSOLETE	0x168d
+#define CHIP_NUM_57840_MF_OBSOLETE	0x16ab
+#define CHIP_NUM_57840_4_10		0x16a1
+#define CHIP_NUM_57840_2_20		0x16a2
+#define CHIP_NUM_57840_MF		0x16a4
 #define CHIP_IS_E1(bp)			(CHIP_NUM(bp) == CHIP_NUM_57710)
 #define CHIP_IS_57711(bp)		(CHIP_NUM(bp) == CHIP_NUM_57711)
 #define CHIP_IS_57711E(bp)		(CHIP_NUM(bp) == CHIP_NUM_57711E)
@@ -832,8 +835,12 @@ struct bnx2x_common {
 #define CHIP_IS_57810_MF(bp)		(CHIP_NUM(bp) == CHIP_NUM_57810_MF)
 #define CHIP_IS_57811(bp)		(CHIP_NUM(bp) == CHIP_NUM_57811)
 #define CHIP_IS_57811_MF(bp)		(CHIP_NUM(bp) == CHIP_NUM_57811_MF)
-#define CHIP_IS_57840(bp)		(CHIP_NUM(bp) == CHIP_NUM_57840)
-#define CHIP_IS_57840_MF(bp)		(CHIP_NUM(bp) == CHIP_NUM_57840_MF)
+#define CHIP_IS_57840(bp)		\
+		((CHIP_NUM(bp) == CHIP_NUM_57840_4_10) || \
+		 (CHIP_NUM(bp) == CHIP_NUM_57840_2_20) || \
+		 (CHIP_NUM(bp) == CHIP_NUM_57840_OBSOLETE))
+#define CHIP_IS_57840_MF(bp)	((CHIP_NUM(bp) == CHIP_NUM_57840_MF) || \
+				 (CHIP_NUM(bp) == CHIP_NUM_57840_MF_OBSOLETE))
 #define CHIP_IS_E1H(bp)			(CHIP_IS_57711(bp) || \
 					 CHIP_IS_57711E(bp))
 #define CHIP_IS_E2(bp)			(CHIP_IS_57712(bp) || \
diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_link.c b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_link.c
index e04b282..f4beb46 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_link.c
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_link.c
@@ -1718,7 +1718,7 @@ static void bnx2x_xmac_init(struct link_params *params, u32 max_speed)
 	 * ports of the path
 	 */
 
-	if ((CHIP_NUM(bp) == CHIP_NUM_57840) &&
+	if ((CHIP_NUM(bp) == CHIP_NUM_57840_4_10) &&
 	    (REG_RD(bp, MISC_REG_RESET_REG_2) &
 	     MISC_REGISTERS_RESET_REG_2_XMAC)) {
 		DP(NETIF_MSG_LINK,
diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
index 08eca3f..9aaf863 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
@@ -137,7 +137,10 @@ enum bnx2x_board_type {
 	BCM57800_MF,
 	BCM57810,
 	BCM57810_MF,
-	BCM57840,
+	BCM57840_O,
+	BCM57840_4_10,
+	BCM57840_2_20,
+	BCM57840_MFO,
 	BCM57840_MF,
 	BCM57811,
 	BCM57811_MF
@@ -157,6 +160,9 @@ static struct {
 	{ "Broadcom NetXtreme II BCM57810 10 Gigabit Ethernet" },
 	{ "Broadcom NetXtreme II BCM57810 10 Gigabit Ethernet Multi Function" },
 	{ "Broadcom NetXtreme II BCM57840 10/20 Gigabit Ethernet" },
+	{ "Broadcom NetXtreme II BCM57840 10 Gigabit Ethernet" },
+	{ "Broadcom NetXtreme II BCM57840 20 Gigabit Ethernet" },
+	{ "Broadcom NetXtreme II BCM57840 10/20 Gigabit Ethernet Multi Function"},
 	{ "Broadcom NetXtreme II BCM57840 10/20 Gigabit Ethernet Multi Function"},
 	{ "Broadcom NetXtreme II BCM57811 10 Gigabit Ethernet"},
 	{ "Broadcom NetXtreme II BCM57811 10 Gigabit Ethernet Multi Function"},
@@ -189,8 +195,17 @@ static struct {
 #ifndef PCI_DEVICE_ID_NX2_57810_MF
 #define PCI_DEVICE_ID_NX2_57810_MF	CHIP_NUM_57810_MF
 #endif
-#ifndef PCI_DEVICE_ID_NX2_57840
-#define PCI_DEVICE_ID_NX2_57840		CHIP_NUM_57840
+#ifndef PCI_DEVICE_ID_NX2_57840_O
+#define PCI_DEVICE_ID_NX2_57840_O	CHIP_NUM_57840_OBSOLETE
+#endif
+#ifndef PCI_DEVICE_ID_NX2_57840_4_10
+#define PCI_DEVICE_ID_NX2_57840_4_10	CHIP_NUM_57840_4_10
+#endif
+#ifndef PCI_DEVICE_ID_NX2_57840_2_20
+#define PCI_DEVICE_ID_NX2_57840_2_20	CHIP_NUM_57840_2_20
+#endif
+#ifndef PCI_DEVICE_ID_NX2_57840_MFO
+#define PCI_DEVICE_ID_NX2_57840_MFO	CHIP_NUM_57840_MF_OBSOLETE
 #endif
 #ifndef PCI_DEVICE_ID_NX2_57840_MF
 #define PCI_DEVICE_ID_NX2_57840_MF	CHIP_NUM_57840_MF
@@ -211,7 +226,10 @@ static DEFINE_PCI_DEVICE_TABLE(bnx2x_pci_tbl) = {
 	{ PCI_VDEVICE(BROADCOM, PCI_DEVICE_ID_NX2_57800_MF), BCM57800_MF },
 	{ PCI_VDEVICE(BROADCOM, PCI_DEVICE_ID_NX2_57810), BCM57810 },
 	{ PCI_VDEVICE(BROADCOM, PCI_DEVICE_ID_NX2_57810_MF), BCM57810_MF },
-	{ PCI_VDEVICE(BROADCOM, PCI_DEVICE_ID_NX2_57840), BCM57840 },
+	{ PCI_VDEVICE(BROADCOM, PCI_DEVICE_ID_NX2_57840_O), BCM57840_O },
+	{ PCI_VDEVICE(BROADCOM, PCI_DEVICE_ID_NX2_57840_4_10), BCM57840_4_10 },
+	{ PCI_VDEVICE(BROADCOM, PCI_DEVICE_ID_NX2_57840_2_20), BCM57840_2_20 },
+	{ PCI_VDEVICE(BROADCOM, PCI_DEVICE_ID_NX2_57840_MFO), BCM57840_MFO },
 	{ PCI_VDEVICE(BROADCOM, PCI_DEVICE_ID_NX2_57840_MF), BCM57840_MF },
 	{ PCI_VDEVICE(BROADCOM, PCI_DEVICE_ID_NX2_57811), BCM57811 },
 	{ PCI_VDEVICE(BROADCOM, PCI_DEVICE_ID_NX2_57811_MF), BCM57811_MF },
@@ -11801,7 +11819,10 @@ static int __devinit bnx2x_init_one(struct pci_dev *pdev,
 	case BCM57800_MF:
 	case BCM57810:
 	case BCM57810_MF:
-	case BCM57840:
+	case BCM57840_O:
+	case BCM57840_4_10:
+	case BCM57840_2_20:
+	case BCM57840_MFO:
 	case BCM57840_MF:
 	case BCM57811:
 	case BCM57811_MF:
-- 
1.7.9.rc2

^ permalink raw reply related

* flush cache according to 'preferred life time'
From: BALAKUMARAN KANNAN @ 2012-07-23  7:36 UTC (permalink / raw)
  To: netdev@vger.kernel.org

Hello all,
    I am running test casees for IPv6 conformation on linux-3.0.26 kernel. Here I am facing a problem in routing advertisement. Once test case sets the 'preferred life time' to 20 seconds for a particular destination. And continuously sending ICMP REQUEST. It is expected that the ICMP_REPLY should stop in 20 seconds. But as because the default gc_interval is 30seconds, even after the timer expiry of the route, it is staying in the router cache. So even after 20 seconds, the nut(node under test) sends ICMP_REPLY. So if I changes gc_interval to 1, the test is getting passed.

    But if I changes gc_interval to 1 seconds another test case in pmtu section fails. It expects that the nut should hold pmtu(path mtu) information for different value. So if I flushes the cache, the pmtu value is turning back to default.

    So I made the kernel to alter its gc_interval value according to the 'preferred life time' of the path. Here is my path. Kindly tell me whether my idea is correct. Am I missing something?

--------------------------------------------------------------------------------------
--- ../linux-3.0.y-BRANCH_SS-RT.git.fresh/net/ipv6/ndisc.c.bak  2012-07-23 12:50:46.000000000 +0530
+++ ../linux-3.0.y-BRANCH_SS-RT.git.fresh/net/ipv6/ndisc.c      2012-07-23 12:54:17.000000000 +0530
@@ -1160,6 +1160,9 @@
 
        __u8 * opt = (__u8 *)(ra_msg + 1);
 
+        struct net *net = dev_net(skb->dev);
+        fib6_run_gc(1, net);
+
        optlen = (skb->tail - skb->transport_header) - sizeof(struct ra_msg);
 
        if (!(ipv6_addr_type(&ipv6_hdr(skb)->saddr) & IPV6_ADDR_LINKLOCAL)) {
@@ -1200,6 +1203,22 @@
                return;
        }
 
+        if (*opt == 3) {
+                printk("<8> IN OPT 3\n");
+                struct net *net = dev_net(skb->dev);
+                fib6_run_gc(1, net);
+                int pref_life_time =  ntohl(*((int *) (((char *) (opt)) + 8)));
+                if ((pref_life_time != 0) && (pref_life_time < 50)) {
+                        printk("<8> gc_interval CHANGED\n");
+                        //init_net.ipv6.sysctl.flush_delay = 1;
+                        init_net.ipv6.sysctl.ip6_rt_gc_interval = 1 * HZ;
+                }
+                else {
+                        init_net.ipv6.sysctl.ip6_rt_gc_interval = 30 * HZ;
+                }
+
+        }
+
        if (!accept_ra(in6_dev))
                goto skip_linkparms;
--------------------------------------------------------------------------------------

Note: This is not well structured. I just created it for temparory solution. Just clarify me whether this idea is right.

And Please let me know why pmtu value is not stored in routing table but only in cache.

^ permalink raw reply

* [PATCH] tcp: avoid oops in tcp_metrics and reset tcpm_stamp
From: Julian Anastasov @ 2012-07-23  7:46 UTC (permalink / raw)
  To: David Miller; +Cc: netdev

	In tcp_tw_remember_stamp we incorrectly checked tw
instead of tm, it can lead to oops if the cached entry is
not found.

	tcpm_stamp was not updated in tcpm_check_stamp when
tcpm_suck_dst was called, move the update into tcpm_suck_dst,
so that we do not call it infinitely on every next cache hit
after TCP_METRICS_TIMEOUT.

Signed-off-by: Julian Anastasov <ja@ssi.bg>
---
 net/ipv4/tcp_metrics.c |    5 +++--
 1 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/tcp_metrics.c b/net/ipv4/tcp_metrics.c
index 992f1bf..2288a63 100644
--- a/net/ipv4/tcp_metrics.c
+++ b/net/ipv4/tcp_metrics.c
@@ -107,6 +107,8 @@ static void tcpm_suck_dst(struct tcp_metrics_block *tm, struct dst_entry *dst)
 {
 	u32 val;
 
+	tm->tcpm_stamp = jiffies;
+
 	val = 0;
 	if (dst_metric_locked(dst, RTAX_RTT))
 		val |= 1 << TCP_METRIC_RTT;
@@ -158,7 +160,6 @@ static struct tcp_metrics_block *tcpm_new(struct dst_entry *dst,
 			goto out_unlock;
 	}
 	tm->tcpm_addr = *addr;
-	tm->tcpm_stamp = jiffies;
 
 	tcpm_suck_dst(tm, dst);
 
@@ -621,7 +622,7 @@ bool tcp_tw_remember_stamp(struct inet_timewait_sock *tw)
 
 	rcu_read_lock();
 	tm = __tcp_get_metrics_tw(tw);
-	if (tw) {
+	if (tm) {
 		const struct tcp_timewait_sock *tcptw;
 		struct sock *sk = (struct sock *) tw;
 
-- 
1.7.3.4

^ permalink raw reply related

* [patch] openvswitch: potential NULL deref in sample()
From: Dan Carpenter @ 2012-07-23  7:46 UTC (permalink / raw)
  To: Jesse Gross
  Cc: dev-yBygre7rU0TnMu66kgdUjQ, netdev-u79uwXL29TY76Z2rM5mHXA,
	kernel-janitors-u79uwXL29TY76Z2rM5mHXA, David S. Miller

If there is no OVS_SAMPLE_ATTR_ACTIONS set then "acts_list" is NULL and
it leads to a NULL dereference when we call nla_len(acts_list).  This
is a static checker fix, not something I have seen in testing.

Signed-off-by: Dan Carpenter <dan.carpenter-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
---
This applies to Linus's tree.

diff --git a/net/openvswitch/actions.c b/net/openvswitch/actions.c
index 48badff..c2351d6 100644
--- a/net/openvswitch/actions.c
+++ b/net/openvswitch/actions.c
@@ -325,6 +325,9 @@ static int sample(struct datapath *dp, struct sk_buff *skb,
 		}
 	}
 
+	if (!acts_list)
+		return 0;
+
 	return do_execute_actions(dp, skb, nla_data(acts_list),
 						 nla_len(acts_list), true);
 }

^ permalink raw reply related

* [PATCH net-next] tcp: dont drop MTU reduction indications
From: Eric Dumazet @ 2012-07-23  7:48 UTC (permalink / raw)
  To: David Miller
  Cc: netdev, Nandita Dukkipati, Neal Cardwell,
	Maciej Żenczykowski, Tore Anderson, Tom Herbert

From: Eric Dumazet <edumazet@google.com>

ICMP messages generated in output path if frame length is bigger than
mtu are actually lost because socket is owned by user (doing the xmit)

One example is the ipgre_tunnel_xmit() calling 
icmp_send(skb, ICMP_DEST_UNREACH, ICMP_FRAG_NEEDED, htonl(mtu));

We had a similar case fixed in commit a34a101e1e6 (ipv6: disable GSO on
sockets hitting dst_allfrag).

Problem of such fix is that it relied on retransmit timers, so short tcp
sessions paid a too big latency increase price.

This patch uses the tcp_release_cb() infrastructure so that MTU
reduction messages (ICMP messages) are not lost, and no extra delay
is added in TCP transmits.

Reported-by: Maciej Żenczykowski <maze@google.com>
Diagnosed-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Nandita Dukkipati <nanditad@google.com>
Cc: Tom Herbert <therbert@google.com>
Cc: Tore Anderson <tore@fud.no>
---
 include/linux/tcp.h   |    6 ++++++
 include/net/sock.h    |    1 +
 net/ipv4/tcp_ipv4.c   |   19 +++++++++++++++----
 net/ipv4/tcp_output.c |    6 +++++-
 net/ipv6/tcp_ipv6.c   |   40 ++++++++++++++++++++++++----------------
 5 files changed, 51 insertions(+), 21 deletions(-)

diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index 2761856..eb125a4 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -493,6 +493,9 @@ struct tcp_sock {
 		u32		  probe_seq_start;
 		u32		  probe_seq_end;
 	} mtu_probe;
+	u32	mtu_info; /* We received an ICMP_FRAG_NEEDED / ICMPV6_PKT_TOOBIG
+			   * while socket was owned by user.
+			   */
 
 #ifdef CONFIG_TCP_MD5SIG
 /* TCP AF-Specific parts; only used by MD5 Signature support so far */
@@ -518,6 +521,9 @@ enum tsq_flags {
 	TCP_TSQ_DEFERRED,	   /* tcp_tasklet_func() found socket was owned */
 	TCP_WRITE_TIMER_DEFERRED,  /* tcp_write_timer() found socket was owned */
 	TCP_DELACK_TIMER_DEFERRED, /* tcp_delack_timer() found socket was owned */
+	TCP_MTU_REDUCED_DEFERRED,  /* tcp_v{4|6}_err() could not call
+				    * tcp_v{4|6}_mtu_reduced()
+				    */
 };
 
 static inline struct tcp_sock *tcp_sk(const struct sock *sk)
diff --git a/include/net/sock.h b/include/net/sock.h
index 88de092..e067f8c 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -859,6 +859,7 @@ struct proto {
 						struct sk_buff *skb);
 
 	void		(*release_cb)(struct sock *sk);
+	void		(*mtu_reduced)(struct sock *sk);
 
 	/* Keeping track of sk's, looking them up, and port selection methods. */
 	void			(*hash)(struct sock *sk);
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 59110ca..bc5432e 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -275,12 +275,15 @@ failure:
 EXPORT_SYMBOL(tcp_v4_connect);
 
 /*
- * This routine does path mtu discovery as defined in RFC1191.
+ * This routine reacts to ICMP_FRAG_NEEDED mtu indications as defined in RFC1191.
+ * It can be called through tcp_release_cb() if socket was owned by user
+ * at the time tcp_v4_err() was called to handle ICMP message.
  */
-static void do_pmtu_discovery(struct sock *sk, const struct iphdr *iph, u32 mtu)
+static void tcp_v4_mtu_reduced(struct sock *sk)
 {
 	struct dst_entry *dst;
 	struct inet_sock *inet = inet_sk(sk);
+	u32 mtu = tcp_sk(sk)->mtu_info;
 
 	/* We are not interested in TCP_LISTEN and open_requests (SYN-ACKs
 	 * send out by Linux are always <576bytes so they should go through
@@ -373,8 +376,12 @@ void tcp_v4_err(struct sk_buff *icmp_skb, u32 info)
 	bh_lock_sock(sk);
 	/* If too many ICMPs get dropped on busy
 	 * servers this needs to be solved differently.
+	 * We do take care of PMTU discovery (RFC1191) special case :
+	 * we can receive locally generated ICMP messages while socket is held.
 	 */
-	if (sock_owned_by_user(sk))
+	if (sock_owned_by_user(sk) &&
+	    type != ICMP_DEST_UNREACH &&
+	    code != ICMP_FRAG_NEEDED)
 		NET_INC_STATS_BH(net, LINUX_MIB_LOCKDROPPEDICMPS);
 
 	if (sk->sk_state == TCP_CLOSE)
@@ -409,8 +416,11 @@ void tcp_v4_err(struct sk_buff *icmp_skb, u32 info)
 			goto out;
 
 		if (code == ICMP_FRAG_NEEDED) { /* PMTU discovery (RFC1191) */
+			tp->mtu_info = info;
 			if (!sock_owned_by_user(sk))
-				do_pmtu_discovery(sk, iph, info);
+				tcp_v4_mtu_reduced(sk);
+			else
+				set_bit(TCP_MTU_REDUCED_DEFERRED, &tp->tsq_flags);
 			goto out;
 		}
 
@@ -2596,6 +2606,7 @@ struct proto tcp_prot = {
 	.sendpage		= tcp_sendpage,
 	.backlog_rcv		= tcp_v4_do_rcv,
 	.release_cb		= tcp_release_cb,
+	.mtu_reduced		= tcp_v4_mtu_reduced,
 	.hash			= inet_hash,
 	.unhash			= inet_unhash,
 	.get_port		= inet_csk_get_port,
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 950aebf..33cd065 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -885,7 +885,8 @@ static void tcp_tasklet_func(unsigned long data)
 
 #define TCP_DEFERRED_ALL ((1UL << TCP_TSQ_DEFERRED) |		\
 			  (1UL << TCP_WRITE_TIMER_DEFERRED) |	\
-			  (1UL << TCP_DELACK_TIMER_DEFERRED))
+			  (1UL << TCP_DELACK_TIMER_DEFERRED) |	\
+			  (1UL << TCP_MTU_REDUCED_DEFERRED))
 /**
  * tcp_release_cb - tcp release_sock() callback
  * @sk: socket
@@ -914,6 +915,9 @@ void tcp_release_cb(struct sock *sk)
 
 	if (flags & (1UL << TCP_DELACK_TIMER_DEFERRED))
 		tcp_delack_timer_handler(sk);
+
+	if (flags & (1UL << TCP_MTU_REDUCED_DEFERRED))
+		sk->sk_prot->mtu_reduced(sk);
 }
 EXPORT_SYMBOL(tcp_release_cb);
 
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index 0302ec3..f49476e 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -315,6 +315,23 @@ failure:
 	return err;
 }
 
+static void tcp_v6_mtu_reduced(struct sock *sk)
+{
+	struct dst_entry *dst;
+
+	if ((1 << sk->sk_state) & (TCPF_LISTEN | TCPF_CLOSE))
+		return;
+
+	dst = inet6_csk_update_pmtu(sk, tcp_sk(sk)->mtu_info);
+	if (!dst)
+		return;
+
+	if (inet_csk(sk)->icsk_pmtu_cookie > dst_mtu(dst)) {
+		tcp_sync_mss(sk, dst_mtu(dst));
+		tcp_simple_retransmit(sk);
+	}
+}
+
 static void tcp_v6_err(struct sk_buff *skb, struct inet6_skb_parm *opt,
 		u8 type, u8 code, int offset, __be32 info)
 {
@@ -342,7 +359,7 @@ static void tcp_v6_err(struct sk_buff *skb, struct inet6_skb_parm *opt,
 	}
 
 	bh_lock_sock(sk);
-	if (sock_owned_by_user(sk))
+	if (sock_owned_by_user(sk) && type != ICMPV6_PKT_TOOBIG)
 		NET_INC_STATS_BH(net, LINUX_MIB_LOCKDROPPEDICMPS);
 
 	if (sk->sk_state == TCP_CLOSE)
@@ -371,21 +388,11 @@ static void tcp_v6_err(struct sk_buff *skb, struct inet6_skb_parm *opt,
 	}
 
 	if (type == ICMPV6_PKT_TOOBIG) {
-		struct dst_entry *dst;
-
-		if (sock_owned_by_user(sk))
-			goto out;
-		if ((1 << sk->sk_state) & (TCPF_LISTEN | TCPF_CLOSE))
-			goto out;
-
-		dst = inet6_csk_update_pmtu(sk, ntohl(info));
-		if (!dst)
-			goto out;
-
-		if (inet_csk(sk)->icsk_pmtu_cookie > dst_mtu(dst)) {
-			tcp_sync_mss(sk, dst_mtu(dst));
-			tcp_simple_retransmit(sk);
-		}
+		tp->mtu_info = ntohl(info);
+		if (!sock_owned_by_user(sk))
+			tcp_v6_mtu_reduced(sk);
+		else
+			set_bit(TCP_MTU_REDUCED_DEFERRED, &tp->tsq_flags);
 		goto out;
 	}
 
@@ -1949,6 +1956,7 @@ struct proto tcpv6_prot = {
 	.sendpage		= tcp_sendpage,
 	.backlog_rcv		= tcp_v6_do_rcv,
 	.release_cb		= tcp_release_cb,
+	.mtu_reduced		= tcp_v6_mtu_reduced,
 	.hash			= tcp_v6_hash,
 	.unhash			= inet_unhash,
 	.get_port		= inet_csk_get_port,

^ permalink raw reply related

* Re: [PATCH] tcp: avoid oops in tcp_metrics and reset tcpm_stamp
From: David Miller @ 2012-07-23  7:58 UTC (permalink / raw)
  To: ja; +Cc: netdev
In-Reply-To: <1343029598-4975-1-git-send-email-ja@ssi.bg>

From: Julian Anastasov <ja@ssi.bg>
Date: Mon, 23 Jul 2012 10:46:38 +0300

> 	In tcp_tw_remember_stamp we incorrectly checked tw
> instead of tm, it can lead to oops if the cached entry is
> not found.
> 
> 	tcpm_stamp was not updated in tcpm_check_stamp when
> tcpm_suck_dst was called, move the update into tcpm_suck_dst,
> so that we do not call it infinitely on every next cache hit
> after TCP_METRICS_TIMEOUT.
> 
> Signed-off-by: Julian Anastasov <ja@ssi.bg>

Applied, thanks Julian.

^ permalink raw reply

* Re: [net-next PATCH 1/1] bnx2x: Add new 57840 device IDs
From: David Miller @ 2012-07-23  7:58 UTC (permalink / raw)
  To: yuvalmin; +Cc: netdev, eilong
In-Reply-To: <1343028343-15351-1-git-send-email-yuvalmin@broadcom.com>

From: "Yuval Mintz" <yuvalmin@broadcom.com>
Date: Mon, 23 Jul 2012 10:25:43 +0300

> The 57840 boards come in two flavours: 2 x 20G and 4 x 10G.
> To better differentiate between the two flavours, a separate device ID
> was assigned to each.
> The silicon default value is still the currently supported 57840 device ID
> (0x168d), and since a user can damage the nvram (e.g., 'ethtool -E')
> the driver will still support this device ID to allow the user to amend the
> nvram back into a supported configuration.
> 
> Notice this patch contains lines longer than 80 characters (strings).
> 
> Signed-off-by: Yuval Mintz <yuvalmin@broadcom.com>
> Signed-off-by: Eilon Greenstein <eilong@broadcom.com>

Applied.

^ permalink raw reply

* Re: [PATCH net-next] tcp: dont drop MTU reduction indications
From: David Miller @ 2012-07-23  7:59 UTC (permalink / raw)
  To: eric.dumazet; +Cc: netdev, nanditad, ncardwell, maze, tore, therbert
In-Reply-To: <1343029732.2626.10234.camel@edumazet-glaptop>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Mon, 23 Jul 2012 09:48:52 +0200

> From: Eric Dumazet <edumazet@google.com>
> 
> ICMP messages generated in output path if frame length is bigger than
> mtu are actually lost because socket is owned by user (doing the xmit)
> 
> One example is the ipgre_tunnel_xmit() calling 
> icmp_send(skb, ICMP_DEST_UNREACH, ICMP_FRAG_NEEDED, htonl(mtu));
> 
> We had a similar case fixed in commit a34a101e1e6 (ipv6: disable GSO on
> sockets hitting dst_allfrag).
> 
> Problem of such fix is that it relied on retransmit timers, so short tcp
> sessions paid a too big latency increase price.
> 
> This patch uses the tcp_release_cb() infrastructure so that MTU
> reduction messages (ICMP messages) are not lost, and no extra delay
> is added in TCP transmits.
> 
> Reported-by: Maciej Żenczykowski <maze@google.com>
> Diagnosed-by: Neal Cardwell <ncardwell@google.com>
> Signed-off-by: Eric Dumazet <edumazet@google.com>

Applied, thanks Eric.

^ permalink raw reply

* Re: [patch] openvswitch: potential NULL deref in sample()
From: David Miller @ 2012-07-23  8:00 UTC (permalink / raw)
  To: dan.carpenter-QHcLZuEGTsvQT0dZR+AlfA
  Cc: dev-yBygre7rU0TnMu66kgdUjQ, netdev-u79uwXL29TY76Z2rM5mHXA,
	kernel-janitors-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <20120723074628.GA30892-mgFCXtclrQlZLf2FXnZxJA@public.gmane.org>

From: Dan Carpenter <dan.carpenter-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
Date: Mon, 23 Jul 2012 10:46:28 +0300

> If there is no OVS_SAMPLE_ATTR_ACTIONS set then "acts_list" is NULL and
> it leads to a NULL dereference when we call nla_len(acts_list).  This
> is a static checker fix, not something I have seen in testing.
> 
> Signed-off-by: Dan Carpenter <dan.carpenter-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>

Applied, thanks Dan.

^ permalink raw reply

* Re: [PATCH net] rds: set correct msg_namelen
From: David Miller @ 2012-07-23  8:02 UTC (permalink / raw)
  To: wpan; +Cc: netdev, linux-kernel
In-Reply-To: <5181687def9991f9878460d932bd31c64f9ad0cb.1343010976.git.wpan@redhat.com>

From: Weiping Pan <wpan@redhat.com>
Date: Mon, 23 Jul 2012 10:37:48 +0800

> Jay Fenlason (fenlason@redhat.com) found a bug,
> that recvfrom() on an RDS socket can return the contents of random kernel
> memory to userspace if it was called with a address length larger than
> sizeof(struct sockaddr_in).
> rds_recvmsg() also fails to set the addr_len paramater properly before
> returning, but that's just a bug.
> There are also a number of cases wher recvfrom() can return an entirely bogus
> address. Anything in rds_recvmsg() that returns a non-negative value but does
> not go through the "sin = (struct sockaddr_in *)msg->msg_name;" code path
> at the end of the while(1) loop will return up to 128 bytes of kernel memory
> to userspace.
> 
> And I write two test programs to reproduce this bug, you will see that in
> rds_server, fromAddr will be overwritten and the following sock_fd will be
> destroyed.
> Yes, it is the programmer's fault to set msg_namelen incorrectly, but it is
> better to make the kernel copy the real length of address to user space in
> such case.
> 
> How to run the test programs ?
> I test them on 32bit x86 system, 3.5.0-rc7.
 ...
> Signed-off-by: Weiping Pan <wpan@redhat.com>

Applied, thanks.

^ permalink raw reply

* Re: flush cache according to 'preferred life time'
From: Gao feng @ 2012-07-23  8:05 UTC (permalink / raw)
  To: BALAKUMARAN KANNAN; +Cc: netdev@vger.kernel.org
In-Reply-To: <4A71D24947E78D43BC584A7CD4391A41017DE48F@SIXPRD0410MB359.apcprd04.prod.outlook.com>

于 2012年07月23日 15:36, BALAKUMARAN KANNAN 写道:
> Hello all,
>     I am running test casees for IPv6 conformation on linux-3.0.26 kernel. Here I am facing a problem in routing advertisement. Once test case sets the 'preferred life time' to 20 seconds for a particular destination. And continuously sending ICMP REQUEST. It is expected that the ICMP_REPLY should stop in 20 seconds. But as because the default gc_interval is 30seconds, even after the timer expiry of the route, it is staying in the router cache. So even after 20 seconds, the nut(node under test) sends ICMP_REPLY. So if I changes gc_interval to 1, the test is getting passed.
> 
>     But if I changes gc_interval to 1 seconds another test case in pmtu section fails. It expects that the nut should hold pmtu(path mtu) information for different value. So if I flushes the cache, the pmtu value is turning back to default.
> 
>     So I made the kernel to alter its gc_interval value according to the 'preferred life time' of the path. Here is my path. Kindly tell me whether my idea is correct. Am I missing something?
> 

Is this commit 1716a96101c49186bb0b8491922fd3e69030235f what you need?

> --------------------------------------------------------------------------------------
> --- ../linux-3.0.y-BRANCH_SS-RT.git.fresh/net/ipv6/ndisc.c.bak  2012-07-23 12:50:46.000000000 +0530
> +++ ../linux-3.0.y-BRANCH_SS-RT.git.fresh/net/ipv6/ndisc.c      2012-07-23 12:54:17.000000000 +0530
> @@ -1160,6 +1160,9 @@
>  
>         __u8 * opt = (__u8 *)(ra_msg + 1);
>  
> +        struct net *net = dev_net(skb->dev);
> +        fib6_run_gc(1, net);
> +
>         optlen = (skb->tail - skb->transport_header) - sizeof(struct ra_msg);
>  
>         if (!(ipv6_addr_type(&ipv6_hdr(skb)->saddr) & IPV6_ADDR_LINKLOCAL)) {
> @@ -1200,6 +1203,22 @@
>                 return;
>         }
>  
> +        if (*opt == 3) {
> +                printk("<8> IN OPT 3\n");
> +                struct net *net = dev_net(skb->dev);
> +                fib6_run_gc(1, net);
> +                int pref_life_time =  ntohl(*((int *) (((char *) (opt)) + 8)));
> +                if ((pref_life_time != 0) && (pref_life_time < 50)) {
> +                        printk("<8> gc_interval CHANGED\n");
> +                        //init_net.ipv6.sysctl.flush_delay = 1;
> +                        init_net.ipv6.sysctl.ip6_rt_gc_interval = 1 * HZ;
> +                }
> +                else {
> +                        init_net.ipv6.sysctl.ip6_rt_gc_interval = 30 * HZ;
> +                }
> +
> +        }
> +
>         if (!accept_ra(in6_dev))
>                 goto skip_linkparms;
> --------------------------------------------------------------------------------------
> 
> Note: This is not well structured. I just created it for temparory solution. Just clarify me whether this idea is right.
> 
> And Please let me know why pmtu value is not stored in routing table but only in cache.

I think the pmtu should be belong to destination.the different destinations may have
different pmtu,even they use same route entry.

^ permalink raw reply

* RE: flush cache according to 'preferred life time'
From: BALAKUMARAN KANNAN @ 2012-07-23  8:12 UTC (permalink / raw)
  To: Gao feng; +Cc: netdev@vger.kernel.org
In-Reply-To: <500D05D6.7090108@cn.fujitsu.com>


________________________________________
From: netdev-owner@vger.kernel.org [netdev-owner@vger.kernel.org] on behalf of Gao feng [gaofeng@cn.fujitsu.com]
Sent: Monday, July 23, 2012 1:35 PM
To: BALAKUMARAN KANNAN
Cc: netdev@vger.kernel.org
Subject: Re: flush cache according to 'preferred life time'

于 2012年07月23日 15:36, BALAKUMARAN KANNAN 写道:
> Hello all,
>     I am running test casees for IPv6 conformation on linux-3.0.26 kernel. Here I am facing a problem in routing advertisement. Once test case sets the 'preferred life time' to 20 seconds for a particular destination. And continuously sending ICMP REQUEST. It is expected that the ICMP_REPLY should stop in 20 seconds. But as because the default gc_interval is 30seconds, even after the timer expiry of the route, it is staying in the router cache. So even after 20 seconds, the nut(node under test) sends ICMP_REPLY. So if I changes gc_interval to 1, the test is getting passed.
>
>     But if I changes gc_interval to 1 seconds another test case in pmtu section fails. It expects that the nut should hold pmtu(path mtu) information for different value. So if I flushes the cache, the pmtu value is turning back to default.
>
>     So I made the kernel to alter its gc_interval value according to the 'preferred life time' of the path. Here is my path. Kindly tell me whether my idea is correct. Am I missing something?
>

Is this commit 1716a96101c49186bb0b8491922fd3e69030235f what you need?

> --------------------------------------------------------------------------------------
> --- ../linux-3.0.y-BRANCH_SS-RT.git.fresh/net/ipv6/ndisc.c.bak  2012-07-23 12:50:46.000000000 +0530
> +++ ../linux-3.0.y-BRANCH_SS-RT.git.fresh/net/ipv6/ndisc.c      2012-07-23 12:54:17.000000000 +0530
> @@ -1160,6 +1160,9 @@
>
>         __u8 * opt = (__u8 *)(ra_msg + 1);
>
> +        struct net *net = dev_net(skb->dev);
> +        fib6_run_gc(1, net);
> +
>         optlen = (skb->tail - skb->transport_header) - sizeof(struct ra_msg);
>
>         if (!(ipv6_addr_type(&ipv6_hdr(skb)->saddr) & IPV6_ADDR_LINKLOCAL)) {
> @@ -1200,6 +1203,22 @@
>                 return;
>         }
>
> +        if (*opt == 3) {
> +                printk("<8> IN OPT 3\n");
> +                struct net *net = dev_net(skb->dev);
> +                fib6_run_gc(1, net);
> +                int pref_life_time =  ntohl(*((int *) (((char *) (opt)) + 8)));
> +                if ((pref_life_time != 0) && (pref_life_time < 50)) {
> +                        printk("<8> gc_interval CHANGED\n");
> +                        //init_net.ipv6.sysctl.flush_delay = 1;
> +                        init_net.ipv6.sysctl.ip6_rt_gc_interval = 1 * HZ;
> +                }
> +                else {
> +                        init_net.ipv6.sysctl.ip6_rt_gc_interval = 30 * HZ;
> +                }
> +
> +        }
> +
>         if (!accept_ra(in6_dev))
>                 goto skip_linkparms;
> --------------------------------------------------------------------------------------
>
> Note: This is not well structured. I just created it for temparory solution. Just clarify me whether this idea is right.
>
> And Please let me know why pmtu value is not stored in routing table but only in cache.

I think the pmtu should be belong to destination.the different destinations may have
different pmtu,even they use same route entry.

So what you are comming to say, there will not be any entry in routing table for each destination but only in routing cache. So the pmtu value is updated in the cache not in the routing table. thank you. Also I have one more doubt. Is there a way to delete an entry (particular entry) from cache once the corresponding entry in the routing table is expired. As in my case, the routing table entry is expired in 20 seconds. but the cache entry is available till next cache flush. But I want to delete the entry from cache once the entry in routing table is expired. I believe you understand.

Thank you Geo feng-san for Interest.

--Regards,
K.Balakumaran

^ permalink raw reply

* Re: [PATCH net-next 2/4] e1000: advertise transmit time stamping
From: Jeff Kirsher @ 2012-07-23  8:18 UTC (permalink / raw)
  To: Richard Cochran; +Cc: netdev, David Miller, Willem de Bruijn, e1000-devel
In-Reply-To: <33885adb6b0e97e6b21fd1903fb6ede3ed7b50ab.1342976654.git.richardcochran@gmail.com>

[-- Attachment #1: Type: text/plain, Size: 572 bytes --]

On Sun, 2012-07-22 at 19:15 +0200, Richard Cochran wrote:
> This driver now offers software transmit time stamping, so it should
> advertise that fact via ethtool. Compile tested only.
> 
> Signed-off-by: Richard Cochran <richardcochran@gmail.com>
> 
> Cc: Willem de Bruijn <willemb@google.com>
> Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
> Cc: e1000-devel@lists.sourceforge.net
> ---
>  drivers/net/ethernet/intel/e1000/e1000_ethtool.c |    1 +
>  1 files changed, 1 insertions(+), 0 deletions(-) 

Acked-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply

* Re: [PATCH net-next 3/4] e1000e: advertise transmit time stamping
From: Jeff Kirsher @ 2012-07-23  8:19 UTC (permalink / raw)
  To: Richard Cochran; +Cc: Willem, netdev, Bruijn, David Miller, e1000-devel
In-Reply-To: <e0e31f0229b5bfda1684122498b4d6fb3195cf26.1342976654.git.richardcochran@gmail.com>


[-- Attachment #1.1: Type: text/plain, Size: 567 bytes --]

On Sun, 2012-07-22 at 19:15 +0200, Richard Cochran wrote:
> This driver now offers software transmit time stamping, so it should
> advertise that fact via ethtool. Compile tested only.
> 
> Signed-off-by: Richard Cochran <richardcochran@gmail.com>
> 
> Cc: Willem de Bruijn <willemb@google.com>
> Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
> Cc: e1000-devel@lists.sourceforge.net
> ---
>  drivers/net/ethernet/intel/e1000e/ethtool.c |    1 +
>  1 files changed, 1 insertions(+), 0 deletions(-) 

Acked-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

[-- Attachment #2: Type: text/plain, Size: 395 bytes --]

------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/

[-- Attachment #3: Type: text/plain, Size: 257 bytes --]

_______________________________________________
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel&#174; Ethernet, visit http://communities.intel.com/community/wired

^ permalink raw reply

* Re: flush cache according to 'preferred life time'
From: Gao feng @ 2012-07-23  8:24 UTC (permalink / raw)
  To: BALAKUMARAN KANNAN; +Cc: netdev@vger.kernel.org
In-Reply-To: <4A71D24947E78D43BC584A7CD4391A41017DE4C1@SIXPRD0410MB359.apcprd04.prod.outlook.com>

于 2012年07月23日 16:12, BALAKUMARAN KANNAN 写道:
> 
> ________________________________________
> From: netdev-owner@vger.kernel.org [netdev-owner@vger.kernel.org] on behalf of Gao feng [gaofeng@cn.fujitsu.com]
> Sent: Monday, July 23, 2012 1:35 PM
> To: BALAKUMARAN KANNAN
> Cc: netdev@vger.kernel.org
> Subject: Re: flush cache according to 'preferred life time'
> 
> 于 2012年07月23日 15:36, BALAKUMARAN KANNAN 写道:
>> Hello all,
>>     I am running test casees for IPv6 conformation on linux-3.0.26 kernel. Here I am facing a problem in routing advertisement. Once test case sets the 'preferred life time' to 20 seconds for a particular destination. And continuously sending ICMP REQUEST. It is expected that the ICMP_REPLY should stop in 20 seconds. But as because the default gc_interval is 30seconds, even after the timer expiry of the route, it is staying in the router cache. So even after 20 seconds, the nut(node under test) sends ICMP_REPLY. So if I changes gc_interval to 1, the test is getting passed.
>>
>>     But if I changes gc_interval to 1 seconds another test case in pmtu section fails. It expects that the nut should hold pmtu(path mtu) information for different value. So if I flushes the cache, the pmtu value is turning back to default.
>>
>>     So I made the kernel to alter its gc_interval value according to the 'preferred life time' of the path. Here is my path. Kindly tell me whether my idea is correct. Am I missing something?
>>
> 
> Is this commit 1716a96101c49186bb0b8491922fd3e69030235f what you need?
> 
>> --------------------------------------------------------------------------------------
>> --- ../linux-3.0.y-BRANCH_SS-RT.git.fresh/net/ipv6/ndisc.c.bak  2012-07-23 12:50:46.000000000 +0530
>> +++ ../linux-3.0.y-BRANCH_SS-RT.git.fresh/net/ipv6/ndisc.c      2012-07-23 12:54:17.000000000 +0530
>> @@ -1160,6 +1160,9 @@
>>
>>         __u8 * opt = (__u8 *)(ra_msg + 1);
>>
>> +        struct net *net = dev_net(skb->dev);
>> +        fib6_run_gc(1, net);
>> +
>>         optlen = (skb->tail - skb->transport_header) - sizeof(struct ra_msg);
>>
>>         if (!(ipv6_addr_type(&ipv6_hdr(skb)->saddr) & IPV6_ADDR_LINKLOCAL)) {
>> @@ -1200,6 +1203,22 @@
>>                 return;
>>         }
>>
>> +        if (*opt == 3) {
>> +                printk("<8> IN OPT 3\n");
>> +                struct net *net = dev_net(skb->dev);
>> +                fib6_run_gc(1, net);
>> +                int pref_life_time =  ntohl(*((int *) (((char *) (opt)) + 8)));
>> +                if ((pref_life_time != 0) && (pref_life_time < 50)) {
>> +                        printk("<8> gc_interval CHANGED\n");
>> +                        //init_net.ipv6.sysctl.flush_delay = 1;
>> +                        init_net.ipv6.sysctl.ip6_rt_gc_interval = 1 * HZ;
>> +                }
>> +                else {
>> +                        init_net.ipv6.sysctl.ip6_rt_gc_interval = 30 * HZ;
>> +                }
>> +
>> +        }
>> +
>>         if (!accept_ra(in6_dev))
>>                 goto skip_linkparms;
>> --------------------------------------------------------------------------------------
>>
>> Note: This is not well structured. I just created it for temparory solution. Just clarify me whether this idea is right.
>>
>> And Please let me know why pmtu value is not stored in routing table but only in cache.
> 
> I think the pmtu should be belong to destination.the different destinations may have
> different pmtu,even they use same route entry.
> 
> So what you are comming to say, there will not be any entry in routing table for each destination but only in routing cache. So the pmtu value is updated in the cache not in the routing table. thank you. Also I have one more doubt. Is there a way to delete an entry (particular entry) from cache once the corresponding entry in the routing table is expired. As in my case, the routing table entry is expired in 20 seconds. but the cache entry is available till next cache flush. But I want to delete the entry from cache once the entry in routing table is expired. I believe you understand.
> 

Yes,I think so too.

> Thank you Geo feng-san for Interest.

I don't know if this commit is what you are finding.
1716a96101c49186bb0b8491922fd3e69030235f

You can test again with this commit.

Thanks

^ permalink raw reply

* RE: flush cache according to 'preferred life time'
From: BALAKUMARAN KANNAN @ 2012-07-23  8:36 UTC (permalink / raw)
  To: Gao feng; +Cc: netdev@vger.kernel.org
In-Reply-To: <500D0A24.6060200@cn.fujitsu.com>

Than you Gao-san. Your patch is helpful. I will try that. Also I am facing another problem. Tahi test case section nd test case 145 is failing if gc_interval is 30. The test case is as follows
 * The tester node (tn) sends RA with curhoplimit 64
 * tn sends a ICMP_REQUEST and checks the ICMP_REPLY from nut is having hoplimit 64. (it is fine in my case)
 * Then the tn sends a RA with curhoplimit 0. (It should be ignored)
 * Then again tn sends ICMP_REQUEST and checks the ICMP_REPLY from nut whether the hoplimit remains 64 (but it changes to 255 if cache is present. But once I changed the gc_interval to 1, this testcase passes)
Can you please explain what is the reason. 

Thank you

--Regards,
K.Balakumaran

^ permalink raw reply

* Re: [PATCH 11/16] ipv4: Cache input routes in fib_info nexthops.
From: Julian Anastasov @ 2012-07-23  9:13 UTC (permalink / raw)
  To: David Miller; +Cc: netdev
In-Reply-To: <20120720.142622.1447419081262029885.davem@davemloft.net>


	Hello,

On Fri, 20 Jul 2012, David Miller wrote:

> 
> Caching input routes is slightly simpler than output routes, since we
> don't need to be concerned with nexthop exceptions.  (locally
> destined, and routed packets, never trigger PMTU events or redirects
> that will be processed by us).
> 
> However, we have to elide caching for the DIRECTSRC and non-zero itag
> cases.

	I see only one user for RTCF_DIRECTSRC:
icmp_address_reply, may be we can do some magic there and
avoid using this flag. By this way we can cache in
nh_rth_input not depending on it.

	The problem with rt_iif is worse. May be we
can cache only the first iif, other packets will see
different iif in nh_rth_input and will get non-cached
result. For boxes with 2 or more interfaces only one
can use the cache. One setup can have large traffic
from LAN, other can be server for remote clients.

	For forwarding such ambiguity should be lower
and also rt_iif is mostly used for local targets.

>  local_input:
> +	do_cache = false;
> +	if (res.fi) {
> +		if (!(flags & RTCF_DIRECTSRC) && !itag) {
> +			rth = FIB_RES_NH(res).nh_rth_input;

			rt_iif here should be same!!!

> +			if (rt_cache_valid(rth)) {
> +				dst_use(&rth->dst, jiffies);
> +				goto set_and_out;
> +			}
> +			do_cache = true;
> +		}
> +	}
> +
>  	rth = rt_dst_alloc(net->loopback_dev,
> -			   IN_DEV_CONF_GET(in_dev, NOPOLICY), false, false);
> +			   IN_DEV_CONF_GET(in_dev, NOPOLICY), false, do_cache);
>  	if (!rth)
>  		goto e_nobufs;
>  
> @@ -1622,6 +1651,9 @@ local_input:
>  		rth->dst.error= -err;
>  		rth->rt_flags 	&= ~RTCF_LOCAL;
>  	}
> +	if (do_cache)
> +		rt_cache_route(&FIB_RES_NH(res), rth);

Regards

--
Julian Anastasov <ja@ssi.bg>

^ permalink raw reply

* Re: [net-next RFC V5 4/5] virtio_net: multiqueue support
From: Sasha Levin @ 2012-07-23  9:28 UTC (permalink / raw)
  To: Jason Wang
  Cc: krkumar2, habanero, mashirle, kvm, Michael S. Tsirkin, netdev,
	linux-kernel, virtualization, edumazet, tahm, jwhan, davem, sri
In-Reply-To: <500CE72B.2040101@redhat.com>

On 07/23/2012 07:54 AM, Jason Wang wrote:
> On 07/21/2012 08:02 PM, Sasha Levin wrote:
>> On 07/20/2012 03:40 PM, Michael S. Tsirkin wrote:
>>>> -    err = init_vqs(vi);
>>>>> +    if (virtio_has_feature(vdev, VIRTIO_NET_F_CTRL_VQ))
>>>>> +        vi->has_cvq = true;
>>>>> +
>>> How about we disable multiqueue if there's no cvq?
>>> Will make logic a bit simpler, won't it?
>> multiqueues don't really depend on cvq. Does this added complexity really justifies adding an artificial limit?
>>
> 
> Yes, it does not depends on cvq. Cvq were just used to negotiate the number of queues a guest wishes to use which is really useful (at least for now). Since multiqueue can not out-perform for single queue in every kinds of workloads or benchmark, so we want to let guest driver use single queue by default even when multiqueue were enabled by management software and let use to enalbe it through ethtool. So user could not feel regression when it switch to use a multiqueue capable driver and backend.

Why would you limit it to a single vq if the user has specified a different number of vqs (>1) in the virtio-net device config?

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox