Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH net-next] openvswitch: Create right mask with disabled megaflows
From: David Miller @ 2014-10-17 20:20 UTC (permalink / raw)
  To: pshelar; +Cc: netdev, ddiproietto, azhou
In-Reply-To: <1413521745-1492-1-git-send-email-pshelar@nicira.com>

From: Pravin B Shelar <pshelar@nicira.com>
Date: Thu, 16 Oct 2014 21:55:45 -0700

> If megaflows are disabled, the userspace does not send the netlink attribute
> OVS_FLOW_ATTR_MASK, and the kernel must create an exact match mask.
> 
> sw_flow_mask_set() sets every bytes (in 'range') of the mask to 0xff, even the
> bytes that represent padding for struct sw_flow, or the bytes that represent
> fields that may not be set during ovs_flow_extract().
> This is a problem, because when we extract a flow from a packet,
> we do not memset() anymore the struct sw_flow to 0.
> 
> This commit gets rid of sw_flow_mask_set() and introduces mask_set_nlattr(),
> which operates on the netlink attributes rather than on the mask key. Using
> this approach we are sure that only the bytes that the user provided in the
> flow are matched.
> 
> Also, if the parse_flow_mask_nlattrs() for the mask ENCAP attribute fails, we
> now return with an error.
> 
> This bug is introduced by commit 0714812134d7dcadeb7ecfbfeb18788aa7e1eaac
> ("openvswitch: Eliminate memset() from flow_extract").
> 
> Reported-by: Alex Wang <alexw@nicira.com>
> Signed-off-by: Daniele Di Proietto <ddiproietto@vmware.com>
> Signed-off-by: Andy Zhou <azhou@nicira.com>
> Signed-off-by: Pravin B Shelar <pshelar@nicira.com>

Why are you targetting a bug fix like this to net-next?

That makes no sense at all.

^ permalink raw reply

* Re: [PATCH net] net: dsa: remove phy.h and phy_fixed.h inclusions
From: David Miller @ 2014-10-17 20:19 UTC (permalink / raw)
  To: f.fainelli; +Cc: netdev
In-Reply-To: <1413507355-28837-1-git-send-email-f.fainelli@gmail.com>

From: Florian Fainelli <f.fainelli@gmail.com>
Date: Thu, 16 Oct 2014 17:55:55 -0700

> There is no need to include phy.h nor phy_fixed.h when we can use
> forward declarations instead to keep the include chain smaller.
> 
> Doing this unveiled that we were implicitely getting the definitions for
> struct ethtool_eee and struct ethtool_wolinfo, and that net/dsa/slave.c
> was missing an include of phy_fixed.h.
> 
> Fixes: ec9436baedb6 ("net: dsa: allow drivers to do link adjustment")
> Fixes: ce31b31c68e7 ("net: dsa: allow updating fixed PHY link information")
> Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>

Please, don't do this.

Get the definitions from the header file that provides them.

The only situation where forward declarations done by hand like this
make sense is when there is absolutely no way to avoid looping header
includes.

For example in A.h you want to provide a function that takes a pointer
to foo, but foo is defined B.h which includes A.h first.

Thanks.

^ permalink raw reply

* [PATCH 1/1 net-next] netrom: use linux/uaccess.h
From: Fabian Frederick @ 2014-10-17 20:00 UTC (permalink / raw)
  To: linux-kernel
  Cc: Fabian Frederick, Ralf Baechle, David S. Miller, linux-hams,
	netdev

replace asm/uaccess.h by linux/uaccess.h

Signed-off-by: Fabian Frederick <fabf@skynet.be>
---
 net/netrom/af_netrom.c | 2 +-
 net/netrom/nr_dev.c    | 2 +-
 net/netrom/nr_in.c     | 2 +-
 net/netrom/nr_out.c    | 2 +-
 net/netrom/nr_route.c  | 2 +-
 net/netrom/nr_subr.c   | 2 +-
 net/netrom/nr_timer.c  | 2 +-
 7 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/net/netrom/af_netrom.c b/net/netrom/af_netrom.c
index 71cf1bf..1b06a1f 100644
--- a/net/netrom/af_netrom.c
+++ b/net/netrom/af_netrom.c
@@ -30,7 +30,7 @@
 #include <linux/skbuff.h>
 #include <net/net_namespace.h>
 #include <net/sock.h>
-#include <asm/uaccess.h>
+#include <linux/uaccess.h>
 #include <linux/fcntl.h>
 #include <linux/termios.h>	/* For TIOCINQ/OUTQ */
 #include <linux/mm.h>
diff --git a/net/netrom/nr_dev.c b/net/netrom/nr_dev.c
index 743262b..6ae063c 100644
--- a/net/netrom/nr_dev.c
+++ b/net/netrom/nr_dev.c
@@ -20,8 +20,8 @@
 #include <linux/in.h>
 #include <linux/if_ether.h>	/* For the statistics structure. */
 #include <linux/slab.h>
+#include <linux/uaccess.h>
 
-#include <asm/uaccess.h>
 #include <asm/io.h>
 
 #include <linux/inet.h>
diff --git a/net/netrom/nr_in.c b/net/netrom/nr_in.c
index c3073a2..80dbd0b 100644
--- a/net/netrom/nr_in.c
+++ b/net/netrom/nr_in.c
@@ -23,7 +23,7 @@
 #include <linux/skbuff.h>
 #include <net/sock.h>
 #include <net/tcp_states.h>
-#include <asm/uaccess.h>
+#include <linux/uaccess.h>
 #include <linux/fcntl.h>
 #include <linux/mm.h>
 #include <linux/interrupt.h>
diff --git a/net/netrom/nr_out.c b/net/netrom/nr_out.c
index 0b4bcb2..00fbf14 100644
--- a/net/netrom/nr_out.c
+++ b/net/netrom/nr_out.c
@@ -22,7 +22,7 @@
 #include <linux/netdevice.h>
 #include <linux/skbuff.h>
 #include <net/sock.h>
-#include <asm/uaccess.h>
+#include <linux/uaccess.h>
 #include <linux/fcntl.h>
 #include <linux/mm.h>
 #include <linux/interrupt.h>
diff --git a/net/netrom/nr_route.c b/net/netrom/nr_route.c
index b976d5e..96b64d2 100644
--- a/net/netrom/nr_route.c
+++ b/net/netrom/nr_route.c
@@ -25,7 +25,7 @@
 #include <linux/if_arp.h>
 #include <linux/skbuff.h>
 #include <net/sock.h>
-#include <asm/uaccess.h>
+#include <linux/uaccess.h>
 #include <linux/fcntl.h>
 #include <linux/termios.h>	/* For TIOCINQ/OUTQ */
 #include <linux/mm.h>
diff --git a/net/netrom/nr_subr.c b/net/netrom/nr_subr.c
index ca40e22..029c8bb 100644
--- a/net/netrom/nr_subr.c
+++ b/net/netrom/nr_subr.c
@@ -22,7 +22,7 @@
 #include <linux/skbuff.h>
 #include <net/sock.h>
 #include <net/tcp_states.h>
-#include <asm/uaccess.h>
+#include <linux/uaccess.h>
 #include <linux/fcntl.h>
 #include <linux/mm.h>
 #include <linux/interrupt.h>
diff --git a/net/netrom/nr_timer.c b/net/netrom/nr_timer.c
index ff2c1b1..94d0580 100644
--- a/net/netrom/nr_timer.c
+++ b/net/netrom/nr_timer.c
@@ -23,7 +23,7 @@
 #include <linux/skbuff.h>
 #include <net/sock.h>
 #include <net/tcp_states.h>
-#include <asm/uaccess.h>
+#include <linux/uaccess.h>
 #include <linux/fcntl.h>
 #include <linux/mm.h>
 #include <linux/interrupt.h>
-- 
1.9.3


^ permalink raw reply related

* Re: [PATCH] net: use hardware buffer pool to allocate skb
From: Eric Dumazet @ 2014-10-17 19:51 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Alexander Duyck, David Laight, Jiafei.Pan@freescale.com,
	David Miller, jkosina@suse.cz, netdev@vger.kernel.org,
	LeoLi@freescale.com, linux-doc@vger.kernel.org
In-Reply-To: <54417044.5090602@gmail.com>

On Fri, 2014-10-17 at 12:38 -0700, Alexander Duyck wrote:

> That's not what I am saying, but there is a trade-off we always have to
> take into account.  Cutting memory overhead will likely have an impact
> on performance.  I would like to make the best informed trade-off in
> that regard rather than just assuming worst case always for the driver.

It seems you misunderstood me. You believe I suggested doing another
allocation strategy in the drivers.

This was not the case.

This allocation strategy is wonderful. I repeat : This is wonderful.

We only have to make sure we do not fool memory management layers, when
they do not understand where the memory is.

Apparently you think it is hard, while it really is not.

^ permalink raw reply

* Re: something is wrong in commit 971f10eca1 - tcp: better TCP_SKB_CB layout to reduce cache line misses
From: Eric Dumazet @ 2014-10-17 19:48 UTC (permalink / raw)
  To: David Miller; +Cc: cwang, kkolasa, edumazet, netdev
In-Reply-To: <20141017.152923.678515861731659347.davem@davemloft.net>

On Fri, 2014-10-17 at 15:29 -0400, David Miller wrote:
> So I applied Cong's patches, but if I am to understand the testing feedback
> so far only the 64-bit case of the reporter's problems is fixed, and that
> the 32-bit graphics corruption still happens.
> 
> Can I get some clarification of where we stand after Cong's fixes
> which are already merged into 'net'?

I spent some time on the (awful) memmove() implementation on x86_32, and
could not find a bug in it.

It would be worth trying doing 2 memcpy() instead just in case...

^ permalink raw reply

* [PATCH net] bna: fix skb->truesize underestimation
From: Eric Dumazet @ 2014-10-17 19:45 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, Rasesh Mody

From: Eric Dumazet <edumazet@google.com>

skb->truesize is not meant to be tracking amount of used bytes
in an skb, but amount of reserved/consumed bytes in memory.

For instance, if we use a single byte in last page fragment,
we have to account the full size of the fragment.

skb->truesize can be very different from skb->len, that has
a very specific safety purpose.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Rasesh Mody <rasesh.mody@qlogic.com>
---
 drivers/net/ethernet/brocade/bna/bnad.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/brocade/bna/bnad.c b/drivers/net/ethernet/brocade/bna/bnad.c
index 153cafac323c..c3861de9dc81 100644
--- a/drivers/net/ethernet/brocade/bna/bnad.c
+++ b/drivers/net/ethernet/brocade/bna/bnad.c
@@ -552,6 +552,7 @@ bnad_cq_setup_skb_frags(struct bna_rcb *rcb, struct sk_buff *skb,
 
 		len = (vec == nvecs) ?
 			last_fraglen : unmap->vector.len;
+		skb->truesize += unmap->vector.len;
 		totlen += len;
 
 		skb_fill_page_desc(skb, skb_shinfo(skb)->nr_frags,
@@ -563,7 +564,6 @@ bnad_cq_setup_skb_frags(struct bna_rcb *rcb, struct sk_buff *skb,
 
 	skb->len += totlen;
 	skb->data_len += totlen;
-	skb->truesize += totlen;
 }
 
 static inline void

^ permalink raw reply related

* Re: [PATCH] staging: octeon: Replace memcpy with ETH_ALEN
From: Joe Perches @ 2014-10-17 19:40 UTC (permalink / raw)
  To: Balavasu; +Cc: netdev, linux-kernel
In-Reply-To: <20141017184705.GA17944@vasu-Inspiron-3542>

On Sat, 2014-10-18 at 00:17 +0530, Balavasu wrote:
> This patch fixes the checkpatch.pl issue
> WARNING: memcpy(dev->dev_addr, mac, ETH_ALEN);

That is not the actual warning.

This is the warning:

Prefer ether_addr_copy() over memcpy() if the Ethernet addresses are __aligned(2)

How have you made sure that mac can not have odd alignment?

Make sure your commit message shows how you know
that mac is appropriately aligned.

> diff --git a/drivers/staging/octeon/ethernet.c b/drivers/staging/octeon/ethernet.c
[]
> @@ -452,7 +452,7 @@ int cvm_oct_common_init(struct net_device *dev)
>  		mac = of_get_mac_address(priv->of_node);
>  
>  	if (mac)
> -		memcpy(dev->dev_addr, mac, ETH_ALEN);
> +		ether_addr_copy(dev->dev_addr, mac);
>  	else
>  		eth_hw_addr_random(dev);
>  

^ permalink raw reply

* Re: [PATCH] net: use hardware buffer pool to allocate skb
From: Alexander Duyck @ 2014-10-17 19:38 UTC (permalink / raw)
  To: Eric Dumazet, Alexander Duyck
  Cc: David Laight, Jiafei.Pan@freescale.com, David Miller,
	jkosina@suse.cz, netdev@vger.kernel.org, LeoLi@freescale.com,
	linux-doc@vger.kernel.org
In-Reply-To: <1413572539.25949.17.camel@edumazet-glaptop2.roam.corp.google.com>

On 10/17/2014 12:02 PM, Eric Dumazet wrote:
> On Fri, 2014-10-17 at 11:28 -0700, Alexander Duyck wrote:
>
>> So long as it doesn't impact performance significantly I am fine with 
>> it.  My concern is that you are bringing up issues that none of the 
>> customers were brining up when I was at Intel, and the fixes you are 
>> proposing are likely to result in customers seeing things they will 
>> report as issues.
> We regularly hit these issues at Google.
>
> We have memory containers, and we detect quite easily if some layer is
> lying, because we cant afford having 20% of headroom on our servers.

Data is useful here.  If you can give enough data about the setup to
reproduce it then we can actually start looking at fixing it.  Otherwise
it is just your anecdotal data versus mine.

> I am not claiming IGB is the only offender.
>
> I am sorry if you believe it was an attack on IGB, or any network
> driver.

My concern is that igb is guilty of being off by at most a factor of 2. 
What about the drivers and implementations that are off by possibly much
larger values?  I'd be much more interested in this if there was data to
back up your position.  Right now it is mostly just conjecture and my
concern is that this type of change may cause more harm than good.

> truesize should really be the thing that protects us from OOM,
> and apparently driver authors hitting TCP performance problems
> thinks it is better to simply let TCP do no garbage collection.

One key point to keep in mind is that the igb performance should take a
pretty hard hit if pages aren't being freed.  The overhead difference
between the page reuse path vs non-page reuse is pretty significant.  If
this is a scenerio you are actually seeing this would be of interest.

> It seems that nobody cares or even understand what I am saying,
> so I should probably not care and suggest Google to buy PetaBytes of
> memory, right ?

That's not what I am saying, but there is a trade-off we always have to
take into account.  Cutting memory overhead will likely have an impact
on performance.  I would like to make the best informed trade-off in
that regard rather than just assuming worst case always for the driver.

Thanks,

Alex

^ permalink raw reply

* [PATCH] dsa: Fix conversion from host device to mii bus
From: Guenter Roeck @ 2014-10-17 19:30 UTC (permalink / raw)
  To: David S. Miller
  Cc: Florian Fainelli, Alexander Duyck, netdev, linux-kernel,
	Guenter Roeck

Commit b4d2394d01bc ("dsa: Replace mii_bus with a generic host device")
replaces mii_bus with a generic host_dev, and introduces
dsa_host_dev_to_mii_bus() to support conversion from host_dev to mii_bus.
However, in some cases it uses to_mii_bus to perform that conversion.
Since host_dev is not the phy bus device but typically a platform device,
this fails and results in a crash with the affected drivers.

BUG: unable to handle kernel NULL pointer dereference at           (null)
IP: [<ffffffff81781d35>] __mutex_lock_slowpath+0x75/0x100
PGD 406783067 PUD 406784067 PMD 0
Oops: 0002 [#1] SMP
...
Call Trace:
[<ffffffff810a538b>] ? pick_next_task_fair+0x61b/0x880
[<ffffffff81781de3>] mutex_lock+0x23/0x37
[<ffffffff81533244>] mdiobus_read+0x34/0x60
[<ffffffff8153b95a>] __mv88e6xxx_reg_read+0x8a/0xa0
[<ffffffff8153b9bc>] mv88e6xxx_reg_read+0x4c/0xa0

Fixes: b4d2394d01bc ("dsa: Replace mii_bus with a generic host device")
Cc: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
---
 drivers/net/dsa/mv88e6060.c | 16 ++++++++++++----
 drivers/net/dsa/mv88e6xxx.c | 14 ++++++++++----
 2 files changed, 22 insertions(+), 8 deletions(-)

diff --git a/drivers/net/dsa/mv88e6060.c b/drivers/net/dsa/mv88e6060.c
index 776e965..05b0ca3 100644
--- a/drivers/net/dsa/mv88e6060.c
+++ b/drivers/net/dsa/mv88e6060.c
@@ -21,8 +21,12 @@
 
 static int reg_read(struct dsa_switch *ds, int addr, int reg)
 {
-	return mdiobus_read(to_mii_bus(ds->master_dev),
-			    ds->pd->sw_addr + addr, reg);
+	struct mii_bus *bus = dsa_host_dev_to_mii_bus(ds->master_dev);
+
+	if (bus == NULL)
+		return -EINVAL;
+
+	return mdiobus_read(bus, ds->pd->sw_addr + addr, reg);
 }
 
 #define REG_READ(addr, reg)					\
@@ -38,8 +42,12 @@ static int reg_read(struct dsa_switch *ds, int addr, int reg)
 
 static int reg_write(struct dsa_switch *ds, int addr, int reg, u16 val)
 {
-	return mdiobus_write(to_mii_bus(ds->master_dev),
-			     ds->pd->sw_addr + addr, reg, val);
+	struct mii_bus *bus = dsa_host_dev_to_mii_bus(ds->master_dev);
+
+	if (bus == NULL)
+		return -EINVAL;
+
+	return mdiobus_write(bus, ds->pd->sw_addr + addr, reg, val);
 }
 
 #define REG_WRITE(addr, reg, val)				\
diff --git a/drivers/net/dsa/mv88e6xxx.c b/drivers/net/dsa/mv88e6xxx.c
index d6f6428..a6c90cf 100644
--- a/drivers/net/dsa/mv88e6xxx.c
+++ b/drivers/net/dsa/mv88e6xxx.c
@@ -75,11 +75,14 @@ int __mv88e6xxx_reg_read(struct mii_bus *bus, int sw_addr, int addr, int reg)
 int mv88e6xxx_reg_read(struct dsa_switch *ds, int addr, int reg)
 {
 	struct mv88e6xxx_priv_state *ps = ds_to_priv(ds);
+	struct mii_bus *bus = dsa_host_dev_to_mii_bus(ds->master_dev);
 	int ret;
 
+	if (bus == NULL)
+		return -EINVAL;
+
 	mutex_lock(&ps->smi_mutex);
-	ret = __mv88e6xxx_reg_read(to_mii_bus(ds->master_dev),
-				   ds->pd->sw_addr, addr, reg);
+	ret = __mv88e6xxx_reg_read(bus, ds->pd->sw_addr, addr, reg);
 	mutex_unlock(&ps->smi_mutex);
 
 	return ret;
@@ -119,11 +122,14 @@ int __mv88e6xxx_reg_write(struct mii_bus *bus, int sw_addr, int addr,
 int mv88e6xxx_reg_write(struct dsa_switch *ds, int addr, int reg, u16 val)
 {
 	struct mv88e6xxx_priv_state *ps = ds_to_priv(ds);
+	struct mii_bus *bus = dsa_host_dev_to_mii_bus(ds->master_dev);
 	int ret;
 
+	if (bus == NULL)
+		return -EINVAL;
+
 	mutex_lock(&ps->smi_mutex);
-	ret = __mv88e6xxx_reg_write(to_mii_bus(ds->master_dev),
-				    ds->pd->sw_addr, addr, reg, val);
+	ret = __mv88e6xxx_reg_write(bus, ds->pd->sw_addr, addr, reg, val);
 	mutex_unlock(&ps->smi_mutex);
 
 	return ret;
-- 
1.9.1

^ permalink raw reply related

* Re: [PATCH] ipv4: dst_entry leak in ip_send_unicast_reply()
From: David Miller @ 2014-10-17 19:30 UTC (permalink / raw)
  To: vvs; +Cc: eric.dumazet, netdev, kuznet, jmorris, yoshfuji, kaber
In-Reply-To: <543E6762.9090807@parallels.com>

From: Vasily Averin <vvs@parallels.com>
Date: Wed, 15 Oct 2014 16:24:02 +0400

> Fixes: 2e77d89b2fa8 ("net: avoid a pair of dst_hold()/dst_release() in ip_append_data()")
> 
> ip_setup_cork() called inside ip_append_data() steals dst entry from rt to cork
> and in case errors in __ip_append_data() nobody frees stolen dst entry
> 
> Signed-off-by: Vasily Averin <vvs@parallels.com>

Applied and queued up for -stable, thanks everyone!

^ permalink raw reply

* Re: something is wrong in commit 971f10eca1 - tcp: better TCP_SKB_CB layout to reduce cache line misses
From: David Miller @ 2014-10-17 19:29 UTC (permalink / raw)
  To: eric.dumazet; +Cc: cwang, kkolasa, edumazet, netdev
In-Reply-To: <1413409354.17186.2.camel@edumazet-glaptop2.roam.corp.google.com>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Wed, 15 Oct 2014 14:42:34 -0700

> On Wed, 2014-10-15 at 14:39 -0700, Eric Dumazet wrote:
>> On Wed, 2014-10-15 at 13:36 -0700, Cong Wang wrote:
>> > On Wed, Oct 15, 2014 at 3:35 AM, Krzysztof Kolasa <kkolasa@winsoft.pl> wrote:
>> > >
>> > > one-line patch not resolve problem, fix created by Cong Wang resolves
>> > > problem !!!
>> > >
>> > > tested on 64bit system and working OK
>> > >
>> > 
>> > So my patch fixes your problem on 64bit but not on 32bit,
>> > is this correct?
>> > 
>> > I will send out the patch. And as Eric said, the problem on 32bit
>> > might be a different issue, we can definitely fix it separately.
>> > 
>> > Thanks for testing!
>> 
>> I think more fixes are needed.
>> 
>> This is most probably an IPv6 issue.
>> 
> 
> Yes, ipv6 uses inet6_iif() a lot. We need to use a different accessor.

So I applied Cong's patches, but if I am to understand the testing feedback
so far only the 64-bit case of the reporter's problems is fixed, and that
the 32-bit graphics corruption still happens.

Can I get some clarification of where we stand after Cong's fixes
which are already merged into 'net'?

Thanks.

^ permalink raw reply

* [PATCH net 1/1] tipc: fix bug in bundled buffer reception
From: Jon Maloy @ 2014-10-17 19:25 UTC (permalink / raw)
  To: davem
  Cc: netdev, Paul Gortmaker, erik.hugne, ying.xue, maloy,
	tipc-discussion, Jon Maloy

In commit ec8a2e5621db2da24badb3969eda7fd359e1869f ("tipc: same receive
code path for connection protocol and data messages") we omitted the
the possiblilty that an arriving message extracted from a bundle buffer
may be a multicast message. Such messages need to be to be delivered to
the socket via a separate function, tipc_sk_mcast_rcv(). As a result,
small multicast messages arriving as members of a bundle buffer will be
silently dropped.

This commit corrects the error by considering this case in the function
tipc_link_bundle_rcv().

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
---
 net/tipc/link.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/net/tipc/link.c b/net/tipc/link.c
index 65410e1..1db162a 100644
--- a/net/tipc/link.c
+++ b/net/tipc/link.c
@@ -1924,7 +1924,12 @@ void tipc_link_bundle_rcv(struct sk_buff *buf)
 		}
 		omsg = buf_msg(obuf);
 		pos += align(msg_size(omsg));
-		if (msg_isdata(omsg) || (msg_user(omsg) == CONN_MANAGER)) {
+		if (msg_isdata(omsg)) {
+			if (unlikely(msg_type(omsg) == TIPC_MCAST_MSG))
+				tipc_sk_mcast_rcv(obuf);
+			else
+				tipc_sk_rcv(obuf);
+		} else if (msg_user(omsg) == CONN_MANAGER) {
 			tipc_sk_rcv(obuf);
 		} else if (msg_user(omsg) == NAME_DISTRIBUTOR) {
 			tipc_named_rcv(obuf);
-- 
1.9.1

^ permalink raw reply related

* Re: [PATCH] net: use hardware buffer pool to allocate skb
From: Eric Dumazet @ 2014-10-17 19:02 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: David Laight, Jiafei.Pan@freescale.com, David Miller,
	jkosina@suse.cz, netdev@vger.kernel.org, LeoLi@freescale.com,
	linux-doc@vger.kernel.org
In-Reply-To: <54415FEA.5000705@redhat.com>

On Fri, 2014-10-17 at 11:28 -0700, Alexander Duyck wrote:

> So long as it doesn't impact performance significantly I am fine with 
> it.  My concern is that you are bringing up issues that none of the 
> customers were brining up when I was at Intel, and the fixes you are 
> proposing are likely to result in customers seeing things they will 
> report as issues.

We regularly hit these issues at Google.

We have memory containers, and we detect quite easily if some layer is
lying, because we cant afford having 20% of headroom on our servers.

I am not claiming IGB is the only offender.

I am sorry if you believe it was an attack on IGB, or any network
driver.

truesize should really be the thing that protects us from OOM,
and apparently driver authors hitting TCP performance problems
thinks it is better to simply let TCP do no garbage collection.

It seems that nobody cares or even understand what I am saying,
so I should probably not care and suggest Google to buy PetaBytes of
memory, right ?




^ permalink raw reply

* Re: [PATCH] net: use hardware buffer pool to allocate skb
From: Eric Dumazet @ 2014-10-17 18:53 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: David Laight, Jiafei.Pan@freescale.com, David Miller,
	jkosina@suse.cz, netdev@vger.kernel.org, LeoLi@freescale.com,
	linux-doc@vger.kernel.org
In-Reply-To: <54415FEA.5000705@redhat.com>

On Fri, 2014-10-17 at 11:28 -0700, Alexander Duyck wrote:

> Yes, I am well aware of this bit.  That is my concern.  You increase the 
> size for truesize, it will cut the amount of queueing in half, and then 
> igb will start seeing drops when it has to deal with bursty traffic and 
> people will start to complain about a performance regression.  That is 
> the bit I want to avoid.

Then, what about addressing the issue for good, instead of working
around it ?

Don't worry, I will work on it.

We also are working on a direct placement in memory for TCP receive
path. (equivalent of sendfile() but for receiver), to avoid the need to 
hold payload in kernel memory (out of order queues)




^ permalink raw reply

* [PATCH] staging: octeon: Replace memcpy with ETH_ALEN
From: Balavasu @ 2014-10-17 18:47 UTC (permalink / raw)
  To: netdev; +Cc: linux-kernel

This patch fixes the checkpatch.pl issue
WARNING: memcpy(dev->dev_addr, mac, ETH_ALEN);

Signed-off-by: Balavasu <kp.balavasu@gmail.com>
---
 drivers/staging/octeon/ethernet.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/staging/octeon/ethernet.c b/drivers/staging/octeon/ethernet.c
index 8f9e3fb..47d4277 100644
--- a/drivers/staging/octeon/ethernet.c
+++ b/drivers/staging/octeon/ethernet.c
@@ -452,7 +452,7 @@ int cvm_oct_common_init(struct net_device *dev)
 		mac = of_get_mac_address(priv->of_node);
 
 	if (mac)
-		memcpy(dev->dev_addr, mac, ETH_ALEN);
+		ether_addr_copy(dev->dev_addr, mac);
 	else
 		eth_hw_addr_random(dev);
 
-- 
1.9.1

^ permalink raw reply related

* [PATCH] Move rlb table dump to procfs
From: Eric B Munson @ 2014-10-17 18:41 UTC (permalink / raw)
  To: netdev
  Cc: Andy Gospodarek, Veaceslav Falico, Jay Vosburgh, linux-kernel,
	Eric B Munson

Because debugfs is not net namespace aware, it is currently not
possible to dump the RLB hash table when network namespaces are enabled.
This patch creates a second proc file called IFNAME_rlb_table for
bonding interfaces that use adaptive load balancing which will expose
this table to userspace.  It also removes all the debugfs code from the
bonding module as the RLB hash was the only entry in debugfs.

Signed-off-by: Eric B Munson <emunson@akamai.com>
---
 drivers/net/bonding/Makefile       |    2 +-
 drivers/net/bonding/bond_debugfs.c |  141 ------------------------------------
 drivers/net/bonding/bond_main.c    |   11 ---
 drivers/net/bonding/bond_procfs.c  |   75 ++++++++++++++++++-
 drivers/net/bonding/bonding.h      |   11 +--
 5 files changed, 74 insertions(+), 166 deletions(-)
 delete mode 100644 drivers/net/bonding/bond_debugfs.c

diff --git a/drivers/net/bonding/Makefile b/drivers/net/bonding/Makefile
index 6f4e808..409c5e3 100644
--- a/drivers/net/bonding/Makefile
+++ b/drivers/net/bonding/Makefile
@@ -4,7 +4,7 @@
 
 obj-$(CONFIG_BONDING) += bonding.o
 
-bonding-objs := bond_main.o bond_3ad.o bond_alb.o bond_sysfs.o bond_sysfs_slave.o bond_debugfs.o bond_netlink.o bond_options.o
+bonding-objs := bond_main.o bond_3ad.o bond_alb.o bond_sysfs.o bond_sysfs_slave.o bond_netlink.o bond_options.o
 
 proc-$(CONFIG_PROC_FS) += bond_procfs.o
 bonding-objs += $(proc-y)
diff --git a/drivers/net/bonding/bond_debugfs.c b/drivers/net/bonding/bond_debugfs.c
deleted file mode 100644
index 8f99082..0000000
--- a/drivers/net/bonding/bond_debugfs.c
+++ /dev/null
@@ -1,141 +0,0 @@
-#include <linux/kernel.h>
-#include <linux/module.h>
-#include <linux/device.h>
-#include <linux/netdevice.h>
-
-#include "bonding.h"
-#include "bond_alb.h"
-
-#if defined(CONFIG_DEBUG_FS) && !defined(CONFIG_NET_NS)
-
-#include <linux/debugfs.h>
-#include <linux/seq_file.h>
-
-static struct dentry *bonding_debug_root;
-
-/* Show RLB hash table */
-static int bond_debug_rlb_hash_show(struct seq_file *m, void *v)
-{
-	struct bonding *bond = m->private;
-	struct alb_bond_info *bond_info = &(BOND_ALB_INFO(bond));
-	struct rlb_client_info *client_info;
-	u32 hash_index;
-
-	if (BOND_MODE(bond) != BOND_MODE_ALB)
-		return 0;
-
-	seq_printf(m, "SourceIP        DestinationIP   "
-			"Destination MAC   DEV\n");
-
-	spin_lock_bh(&bond->mode_lock);
-
-	hash_index = bond_info->rx_hashtbl_used_head;
-	for (; hash_index != RLB_NULL_INDEX;
-	     hash_index = client_info->used_next) {
-		client_info = &(bond_info->rx_hashtbl[hash_index]);
-		seq_printf(m, "%-15pI4 %-15pI4 %-17pM %s\n",
-			&client_info->ip_src,
-			&client_info->ip_dst,
-			&client_info->mac_dst,
-			client_info->slave->dev->name);
-	}
-
-	spin_unlock_bh(&bond->mode_lock);
-
-	return 0;
-}
-
-static int bond_debug_rlb_hash_open(struct inode *inode, struct file *file)
-{
-	return single_open(file, bond_debug_rlb_hash_show, inode->i_private);
-}
-
-static const struct file_operations bond_debug_rlb_hash_fops = {
-	.owner		= THIS_MODULE,
-	.open		= bond_debug_rlb_hash_open,
-	.read		= seq_read,
-	.llseek		= seq_lseek,
-	.release	= single_release,
-};
-
-void bond_debug_register(struct bonding *bond)
-{
-	if (!bonding_debug_root)
-		return;
-
-	bond->debug_dir =
-		debugfs_create_dir(bond->dev->name, bonding_debug_root);
-
-	if (!bond->debug_dir) {
-		netdev_warn(bond->dev, "failed to register to debugfs\n");
-		return;
-	}
-
-	debugfs_create_file("rlb_hash_table", 0400, bond->debug_dir,
-				bond, &bond_debug_rlb_hash_fops);
-}
-
-void bond_debug_unregister(struct bonding *bond)
-{
-	if (!bonding_debug_root)
-		return;
-
-	debugfs_remove_recursive(bond->debug_dir);
-}
-
-void bond_debug_reregister(struct bonding *bond)
-{
-	struct dentry *d;
-
-	if (!bonding_debug_root)
-		return;
-
-	d = debugfs_rename(bonding_debug_root, bond->debug_dir,
-			   bonding_debug_root, bond->dev->name);
-	if (d) {
-		bond->debug_dir = d;
-	} else {
-		netdev_warn(bond->dev, "failed to reregister, so just unregister old one\n");
-		bond_debug_unregister(bond);
-	}
-}
-
-void bond_create_debugfs(void)
-{
-	bonding_debug_root = debugfs_create_dir("bonding", NULL);
-
-	if (!bonding_debug_root) {
-		pr_warn("Warning: Cannot create bonding directory in debugfs\n");
-	}
-}
-
-void bond_destroy_debugfs(void)
-{
-	debugfs_remove_recursive(bonding_debug_root);
-	bonding_debug_root = NULL;
-}
-
-
-#else /* !CONFIG_DEBUG_FS */
-
-void bond_debug_register(struct bonding *bond)
-{
-}
-
-void bond_debug_unregister(struct bonding *bond)
-{
-}
-
-void bond_debug_reregister(struct bonding *bond)
-{
-}
-
-void bond_create_debugfs(void)
-{
-}
-
-void bond_destroy_debugfs(void)
-{
-}
-
-#endif /* CONFIG_DEBUG_FS */
diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index c9ac06c..d6aea43 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -2778,8 +2778,6 @@ static int bond_event_changename(struct bonding *bond)
 	bond_remove_proc_entry(bond);
 	bond_create_proc_entry(bond);
 
-	bond_debug_reregister(bond);
-
 	return NOTIFY_DONE;
 }
 
@@ -4039,8 +4037,6 @@ static void bond_uninit(struct net_device *bond_dev)
 	}
 
 	list_del(&bond->bond_list);
-
-	bond_debug_unregister(bond);
 }
 
 /*------------------------- Module initialization ---------------------------*/
@@ -4432,8 +4428,6 @@ static int bond_init(struct net_device *bond_dev)
 
 	bond_prepare_sysfs_group(bond);
 
-	bond_debug_register(bond);
-
 	/* Ensure valid dev_addr */
 	if (is_zero_ether_addr(bond_dev->dev_addr) &&
 	    bond_dev->addr_assign_type == NET_ADDR_PERM)
@@ -4538,8 +4532,6 @@ static int __init bonding_init(void)
 	if (res)
 		goto err_link;
 
-	bond_create_debugfs();
-
 	for (i = 0; i < max_bonds; i++) {
 		res = bond_create(&init_net, NULL);
 		if (res)
@@ -4550,7 +4542,6 @@ static int __init bonding_init(void)
 out:
 	return res;
 err:
-	bond_destroy_debugfs();
 	bond_netlink_fini();
 err_link:
 	unregister_pernet_subsys(&bond_net_ops);
@@ -4562,8 +4553,6 @@ static void __exit bonding_exit(void)
 {
 	unregister_netdevice_notifier(&bond_netdev_notifier);
 
-	bond_destroy_debugfs();
-
 	bond_netlink_fini();
 	unregister_pernet_subsys(&bond_net_ops);
 
diff --git a/drivers/net/bonding/bond_procfs.c b/drivers/net/bonding/bond_procfs.c
index a3948f8..30693bc 100644
--- a/drivers/net/bonding/bond_procfs.c
+++ b/drivers/net/bonding/bond_procfs.c
@@ -231,10 +231,55 @@ static const struct file_operations bond_info_fops = {
 	.release = seq_release,
 };
 
+static int bond_table_seq_show(struct seq_file *seq, void *v)
+{
+	struct bonding *bond = seq->private;
+	struct alb_bond_info *bond_info;
+	struct rlb_client_info *client_info;
+	u32 hash_index;
+
+	/* make sure the bond won't be taken away */
+	rcu_read_lock();
+	bond_info = &(BOND_ALB_INFO(bond));
+
+	seq_printf(seq, "%-15s %-17s %-15s %-17s %s\n", "SourceIP",
+		   "Source MAC", "DestinationIP", "Destination MAC", "DEV");
+
+	spin_lock_bh(&bond->mode_lock);
+
+	hash_index = bond_info->rx_hashtbl_used_head;
+	for (; hash_index != RLB_NULL_INDEX;
+	     hash_index = client_info->used_next) {
+		client_info = &bond_info->rx_hashtbl[hash_index];
+		seq_printf(seq, "%-15pI4 %-17pM %-15pI4 %-17pM %s\n",
+			   &client_info->ip_src, client_info->mac_src,
+			   &client_info->ip_dst, client_info->mac_dst,
+			   client_info->slave->dev->name);
+	}
+
+	spin_unlock_bh(&bond->mode_lock);
+	rcu_read_unlock();
+	return 0;
+}
+
+static int bond_table_open(struct inode *inode, struct file *file)
+{
+	return single_open(file, bond_table_seq_show, PDE_DATA(inode));
+}
+
+static const struct file_operations bond_table_fops = {
+	.owner   = THIS_MODULE,
+	.open    = bond_table_open,
+	.read    = seq_read,
+	.llseek  = seq_lseek,
+	.release = seq_release,
+};
+
 void bond_create_proc_entry(struct bonding *bond)
 {
 	struct net_device *bond_dev = bond->dev;
 	struct bond_net *bn = net_generic(dev_net(bond_dev), bond_net_id);
+	char table_name[IFNAMSIZ + 10];
 
 	if (bn->proc_dir) {
 		bond->proc_entry = proc_create_data(bond_dev->name,
@@ -245,6 +290,21 @@ void bond_create_proc_entry(struct bonding *bond)
 				    DRV_NAME, bond_dev->name);
 		else
 			memcpy(bond->proc_file_name, bond_dev->name, IFNAMSIZ);
+
+		if (bond->params.mode == BOND_MODE_ALB) {
+			snprintf(table_name, IFNAMSIZ + 10, "%s_rlb_table",
+				 bond_dev->name);
+			bond->proc_table = proc_create_data(table_name, S_IRUGO,
+							    bn->proc_dir,
+							    &bond_table_fops,
+							    bond);
+			if (bond->proc_table == NULL)
+				pr_warn("Warning: Cannot create /proc/net/%s/%s\n",
+					DRV_NAME, table_name);
+			else
+				memcpy(bond->proc_table_name, table_name,
+				       IFNAMSIZ + 10);
+		}
 	}
 }
 
@@ -253,10 +313,17 @@ void bond_remove_proc_entry(struct bonding *bond)
 	struct net_device *bond_dev = bond->dev;
 	struct bond_net *bn = net_generic(dev_net(bond_dev), bond_net_id);
 
-	if (bn->proc_dir && bond->proc_entry) {
-		remove_proc_entry(bond->proc_file_name, bn->proc_dir);
-		memset(bond->proc_file_name, 0, IFNAMSIZ);
-		bond->proc_entry = NULL;
+	if (bn->proc_dir) {
+		if (bond->proc_entry) {
+			remove_proc_entry(bond->proc_file_name, bn->proc_dir);
+			memset(bond->proc_file_name, 0, IFNAMSIZ);
+			bond->proc_entry = NULL;
+		}
+		if (bond->proc_table) {
+			remove_proc_entry(bond->proc_table_name, bn->proc_dir);
+			memset(bond->proc_table_name, 0, IFNAMSIZ + 10);
+			bond->proc_table = NULL;
+		}
 	}
 }
 
diff --git a/drivers/net/bonding/bonding.h b/drivers/net/bonding/bonding.h
index 10920f0..714222a 100644
--- a/drivers/net/bonding/bonding.h
+++ b/drivers/net/bonding/bonding.h
@@ -217,6 +217,8 @@ struct bonding {
 #ifdef CONFIG_PROC_FS
 	struct   proc_dir_entry *proc_entry;
 	char     proc_file_name[IFNAMSIZ];
+	struct   proc_dir_entry *proc_table;
+	char     proc_table_name[IFNAMSIZ + 10];
 #endif /* CONFIG_PROC_FS */
 	struct   list_head bond_list;
 	u32      rr_tx_counter;
@@ -230,10 +232,6 @@ struct bonding {
 	struct   delayed_work ad_work;
 	struct   delayed_work mcast_work;
 	struct   delayed_work slave_arr_work;
-#ifdef CONFIG_DEBUG_FS
-	/* debugging support via debugfs */
-	struct	 dentry *debug_dir;
-#endif /* CONFIG_DEBUG_FS */
 	struct rtnl_link_stats64 bond_stats;
 };
 
@@ -527,11 +525,6 @@ int bond_release(struct net_device *bond_dev, struct net_device *slave_dev);
 u32 bond_xmit_hash(struct bonding *bond, struct sk_buff *skb);
 void bond_select_active_slave(struct bonding *bond);
 void bond_change_active_slave(struct bonding *bond, struct slave *new_active);
-void bond_create_debugfs(void);
-void bond_destroy_debugfs(void);
-void bond_debug_register(struct bonding *bond);
-void bond_debug_unregister(struct bonding *bond);
-void bond_debug_reregister(struct bonding *bond);
 const char *bond_mode_name(int mode);
 void bond_setup(struct net_device *bond_dev);
 unsigned int bond_get_num_tx_queues(void);
-- 
1.7.9.5

^ permalink raw reply related

* Re: [PATCH] net: use hardware buffer pool to allocate skb
From: Alexander Duyck @ 2014-10-17 18:28 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Laight, Jiafei.Pan@freescale.com, David Miller,
	jkosina@suse.cz, netdev@vger.kernel.org, LeoLi@freescale.com,
	linux-doc@vger.kernel.org
In-Reply-To: <1413564913.24709.7.camel@edumazet-glaptop2.roam.corp.google.com>


On 10/17/2014 09:55 AM, Eric Dumazet wrote:
> On Fri, 2014-10-17 at 07:40 -0700, Alexander Duyck wrote:
>> On 10/17/2014 02:11 AM, David Laight wrote:
>>> From: Alexander Duyck
>>> ...
>>>> Actually the likelihood of anything holding onto the 4K page for very
>>>> long doesn't seem to occur, at least from the drivers perspective.  It
>>>> is one of the reasons why I went for the page reuse approach rather than
>>>> just partitioning a single large page.  It allows us to avoid having to
>>>> call IOMMU map/unmap for the pages since the entire page is usually back
>>>> in the driver ownership before we need to reuse the portion given to the
>>>> stack.
>>> That is almost certainly true for most benchmarks, benchmark processes
>>> consume receive data.
>>> But what about real life situations?
>>>
>>> There must be some 'normal' workloads where receive data doesn't get consumed.
>>>
>>> 	David
>>>
>> Yes, but for workloads where receive data doesn't get consumed it is
>> very unlikely that much receive data is generated.
> This is very optimistic.
>
> Any kind of flood can generate 5 or 6 Million packets per second.

That is fine.  The first 256 (default descriptor ring size) might be 4K 
while reporting truesize of 2K, after that each page is guaranteed to be 
split in half so we get at least 2 uses per page.

> So in stress condition, we possibly consume twice more ram than alloted
> in tcp_mem.  (About 3GBytes per second, think about it)

I see what you are trying to get at, but I don't see how my scenerio is 
worse then the setups that use a large page and partition it.

> This is fine, if admins are aware of that and can adjust tcp_mem
> accordingly to this.

I can say I have never had a single report of of us feeding too much 
memory to the sockets, if anything the complaints I have seen have 
always been that the socket is being starved due to too much memory 
being used to move small packets.  That is one of the reasons I decided 
we had to have a copy-break built in for packets 256B and smaller.  It 
doesn't make much sense to allocate 2K + ~1K (skb + skb->head) for 256B 
or less of payload data.

> Apparently none of your customers suffered from this, maybe they had
> enough headroom to absorb the over commit or they trusted us and could not
> find culprit if they had issues.

Correct.  I've never received complaints about memory overcommit. Like I 
have said in most cases we are always getting the page back anyway so we 
usually get a good ROI on the page recycling.

> Open 50,000 tcp sockets, do not read data on 50% of them (pretend you
> are busy on disk access or doing cpu intensive work). As traffic is interleaved
> (between consumed data and non consumed data), you'll have the side
> effect of consuming more ram than advertised.

Yes, but in the case we are talking about it is only off by a factor of 
2.  How do you account for the setups such as the code for allocating an 
skb that is allocating a 32K page over multiple frames. In your setup I 
would suspect it wouldn't be uncommon for the socket to end up with 
multiple spots where only a few sockets are holding the entire 32K page 
for some period of time.  So does that mean we should hit anybody that 
uses netdev_alloc_skb with the overhead for 32K since there are 
scenarios where that can happen?

> Compare /proc/net/protocols (grep TCP /proc/net/protocols) and output of
> 'free', and you'll see that we are not good citizens.

I'm assuming there is some sort of test I should be running while I do 
this?  Otherwise the current dump of those is not too interesting 
currently because my system is idle.

> I will work on TCP stack, to go beyond what I did in commit
> b49960a05e3212 ("tcp: change tcp_adv_win_scale and tcp_rmem[2]")
>
> So that TCP should not care if a driver chose to potentially use 4K per
> MSS.

So long as it doesn't impact performance significantly I am fine with 
it.  My concern is that you are bringing up issues that none of the 
customers were brining up when I was at Intel, and the fixes you are 
proposing are likely to result in customers seeing things they will 
report as issues.

> Right now, it seems we can drop few packets, and get a slight reduction in
> throughput (TCP is very sensitive to losses, even if we drop 0.1 % of packets)

Yes, I am well aware of this bit.  That is my concern.  You increase the 
size for truesize, it will cut the amount of queueing in half, and then 
igb will start seeing drops when it has to deal with bursty traffic and 
people will start to complain about a performance regression.  That is 
the bit I want to avoid.

Thanks,

Alex


^ permalink raw reply

* Re: [Patch net] ipv4: clear all TCP_SKB_CB before passing to network layer
From: Cong Wang @ 2014-10-17 18:26 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Cong Wang, netdev, David Miller, Krzysztof Kolasa, Eric Dumazet
In-Reply-To: <1413569928.25949.5.camel@edumazet-glaptop2.roam.corp.google.com>

On Fri, Oct 17, 2014 at 11:18 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Fri, 2014-10-17 at 10:49 -0700, Cong Wang wrote:
>> From: Cong Wang <cwang@twopensource.com>
>>
>> Probably not a big deal, but IP is not the only network protocol,
>> don't clear skb->cb just for IP.
>>
>> Also, IPv6 header is not always defined in struct tcp_skb_cb.
>>
>> Cc: Krzysztof Kolasa <kkolasa@winsoft.pl>
>> Cc: Eric Dumazet <edumazet@google.com>
>> Signed-off-by: Cong Wang <cwang@twopensource.com>
>> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
>> ---
>>  net/ipv4/tcp_output.c | 4 +---
>>  1 file changed, 1 insertion(+), 3 deletions(-)
>>
>> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
>> index e13d778..ee356e5 100644
>> --- a/net/ipv4/tcp_output.c
>> +++ b/net/ipv4/tcp_output.c
>> @@ -1005,9 +1005,7 @@ static int tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, int clone_it,
>>       /* Our usage of tstamp should remain private */
>>       skb->tstamp.tv64 = 0;
>>
>> -     /* Cleanup our debris for IP stacks */
>> -     memset(skb->cb, 0, max(sizeof(struct inet_skb_parm),
>> -                            sizeof(struct inet6_skb_parm)));
>> +     memset(TCP_SKB_CB(skb), 0, sizeof(*TCP_SKB_CB(skb)));
>>
>>       err = icsk->icsk_af_ops->queue_xmit(sk, skb, &inet->cork.fl);
>>
>
>
> Usually, each layer is responsible for clearing skb->cb[] at its entry
> point. Or more exactly it does not care of previous garbage.
>
> There is no evidence your patch is needed.

This is why I said "probably not a big deal" in changelog, I never
say it fixes anything, just a cleanup.

>
> I was maybe too defensive when I added this, because I wanted to make
> only TCP changes.
>
> We should instead remove the memset() in TCP and fix IP/IPv6 if
> necessary.

That works too and sounds better.

>
> But this should wait net-next being open.
>
>

Yeah, agreed, I will update and resend when net-next is reopen.

Thanks.

^ permalink raw reply

* Re: [PATCH net] ipv6: introduce tcp_v6_iif()
From: Eric Dumazet @ 2014-10-17 18:19 UTC (permalink / raw)
  To: Cong Wang; +Cc: David Miller, netdev
In-Reply-To: <CAHA+R7Od13Bnn3wciLNrA-RsYuYBAjvwAzc20dgGtYT_=EX88w@mail.gmail.com>

On Fri, 2014-10-17 at 09:58 -0700, Cong Wang wrote:

> I doubt we still need to store iif in IP6CB() since we have skb->skb_iif,
> we can probably just make inet6_iif() be like inet_iif() so that IP6CB()->iif
> can be just removed? Does this make any sense to you?
> 

This makes sense and would gain 4 bytes in skb->cb[], so definitely
worth it.

> (I have a patch locally since I thought it should be target for net-next.)
> 

^ permalink raw reply

* Re: [Patch net] ipv4: clear all TCP_SKB_CB before passing to network layer
From: Eric Dumazet @ 2014-10-17 18:18 UTC (permalink / raw)
  To: Cong Wang; +Cc: netdev, davem, Krzysztof Kolasa, Eric Dumazet, Cong Wang
In-Reply-To: <1413568169-4123-1-git-send-email-xiyou.wangcong@gmail.com>

On Fri, 2014-10-17 at 10:49 -0700, Cong Wang wrote:
> From: Cong Wang <cwang@twopensource.com>
> 
> Probably not a big deal, but IP is not the only network protocol,
> don't clear skb->cb just for IP.
> 
> Also, IPv6 header is not always defined in struct tcp_skb_cb.
> 
> Cc: Krzysztof Kolasa <kkolasa@winsoft.pl>
> Cc: Eric Dumazet <edumazet@google.com>
> Signed-off-by: Cong Wang <cwang@twopensource.com>
> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
> ---
>  net/ipv4/tcp_output.c | 4 +---
>  1 file changed, 1 insertion(+), 3 deletions(-)
> 
> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index e13d778..ee356e5 100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -1005,9 +1005,7 @@ static int tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, int clone_it,
>  	/* Our usage of tstamp should remain private */
>  	skb->tstamp.tv64 = 0;
>  
> -	/* Cleanup our debris for IP stacks */
> -	memset(skb->cb, 0, max(sizeof(struct inet_skb_parm),
> -			       sizeof(struct inet6_skb_parm)));
> +	memset(TCP_SKB_CB(skb), 0, sizeof(*TCP_SKB_CB(skb)));
>  
>  	err = icsk->icsk_af_ops->queue_xmit(sk, skb, &inet->cork.fl);
>  


Usually, each layer is responsible for clearing skb->cb[] at its entry
point. Or more exactly it does not care of previous garbage.

There is no evidence your patch is needed.

I was maybe too defensive when I added this, because I wanted to make
only TCP changes.

We should instead remove the memset() in TCP and fix IP/IPv6 if
necessary.

But this should wait net-next being open.

^ permalink raw reply

* [Patch net] ipv4: clear all TCP_SKB_CB before passing to network layer
From: Cong Wang @ 2014-10-17 17:49 UTC (permalink / raw)
  To: netdev; +Cc: davem, Cong Wang, Krzysztof Kolasa, Eric Dumazet, Cong Wang

From: Cong Wang <cwang@twopensource.com>

Probably not a big deal, but IP is not the only network protocol,
don't clear skb->cb just for IP.

Also, IPv6 header is not always defined in struct tcp_skb_cb.

Cc: Krzysztof Kolasa <kkolasa@winsoft.pl>
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Cong Wang <cwang@twopensource.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
---
 net/ipv4/tcp_output.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index e13d778..ee356e5 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -1005,9 +1005,7 @@ static int tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, int clone_it,
 	/* Our usage of tstamp should remain private */
 	skb->tstamp.tv64 = 0;
 
-	/* Cleanup our debris for IP stacks */
-	memset(skb->cb, 0, max(sizeof(struct inet_skb_parm),
-			       sizeof(struct inet6_skb_parm)));
+	memset(TCP_SKB_CB(skb), 0, sizeof(*TCP_SKB_CB(skb)));
 
 	err = icsk->icsk_af_ops->queue_xmit(sk, skb, &inet->cork.fl);
 
-- 
1.8.3.1

^ permalink raw reply related

* Re: [PATCH v2] vxlan: remove the dead codes
From: Cong Wang @ 2014-10-17 17:33 UTC (permalink / raw)
  To: Li RongQing; +Cc: netdev, Stephen Hemminger
In-Reply-To: <1413525844-2406-1-git-send-email-roy.qing.li@gmail.com>

On Thu, Oct 16, 2014 at 11:04 PM,  <roy.qing.li@gmail.com> wrote:
> From: Li RongQing <roy.qing.li@gmail.com>
>
> remove the dead codes, no user uses vxlan_salt
>
> Signed-off-by: Li RongQing <roy.qing.li@gmail.com>
> ---
>  drivers/net/vxlan.c |    4 ----
>  1 file changed, 4 deletions(-)
>
> diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c
> index 77ab844..855a81d 100644
> --- a/drivers/net/vxlan.c
> +++ b/drivers/net/vxlan.c
> @@ -152,8 +152,6 @@ struct vxlan_dev {
>         struct hlist_head fdb_head[FDB_HASH_SIZE];
>  };
>
> -/* salt for hash table */
> -static u32 vxlan_salt __read_mostly;

Hmm, it has never been used since it was born. Maybe we should
really use for vxlan hash table?

^ permalink raw reply

* Re: [PATCH] ipv4: fix a potential use after free
From: Pravin Shelar @ 2014-10-17 17:07 UTC (permalink / raw)
  To: Li RongQing; +Cc: netdev
In-Reply-To: <1413536003-15646-1-git-send-email-roy.qing.li@gmail.com>

On Fri, Oct 17, 2014 at 1:53 AM,  <roy.qing.li@gmail.com> wrote:
> From: Li RongQing <roy.qing.li@gmail.com>
>
> pskb_may_pull() maybe change skb->data and make eth pointer oboslete,
> so set eth after pskb_may_pull()
>
> Fixes:3d7b46cd("ip_tunnel: push generic protocol handling to ip_tunnel module")
> Cc: Pravin B Shelar <pshelar@nicira.com>
> Signed-off-by: Li RongQing <roy.qing.li@gmail.com>

Looks good.
Acked-by: Pravin B Shelar <pshelar@nicira.com>

> ---
>  net/ipv4/ip_tunnel_core.c |    3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/net/ipv4/ip_tunnel_core.c b/net/ipv4/ip_tunnel_core.c
> index f4c987b..88c386c 100644
> --- a/net/ipv4/ip_tunnel_core.c
> +++ b/net/ipv4/ip_tunnel_core.c
> @@ -91,11 +91,12 @@ int iptunnel_pull_header(struct sk_buff *skb, int hdr_len, __be16 inner_proto)
>         skb_pull_rcsum(skb, hdr_len);
>
>         if (inner_proto == htons(ETH_P_TEB)) {
> -               struct ethhdr *eh = (struct ethhdr *)skb->data;
> +               struct ethhdr *eh;
>
>                 if (unlikely(!pskb_may_pull(skb, ETH_HLEN)))
>                         return -ENOMEM;
>
> +               eh = (struct ethhdr *)skb->data;
>                 if (likely(ntohs(eh->h_proto) >= ETH_P_802_3_MIN))
>                         skb->protocol = eh->h_proto;
>                 else
> --
> 1.7.10.4
>

^ permalink raw reply

* Re: [PATCH net] ipv6: introduce tcp_v6_iif()
From: Cong Wang @ 2014-10-17 16:58 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, netdev
In-Reply-To: <1413562640.24953.26.camel@edumazet-glaptop2.roam.corp.google.com>

On Fri, Oct 17, 2014 at 9:17 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> From: Eric Dumazet <edumazet@google.com>
>
> Commit 971f10eca186 ("tcp: better TCP_SKB_CB layout to reduce cache line
> misses") added a regression for SO_BINDTODEVICE on IPv6.
>
> This is because we still use inet6_iif() which expects that IP6 control
> block is still at the beginning of skb->cb[]
>
> This patch adds tcp_v6_iif() helper and uses it where necessary.
>
> Because __inet6_lookup_skb() is used by TCP and DCCP, we add an iif
> parameter to it.
>

I doubt we still need to store iif in IP6CB() since we have skb->skb_iif,
we can probably just make inet6_iif() be like inet_iif() so that IP6CB()->iif
can be just removed? Does this make any sense to you?

(I have a patch locally since I thought it should be target for net-next.)

Thanks.

^ permalink raw reply

* Re: [PATCH] ipv4: dst_entry leak in ip_send_unicast_reply()
From: Eric Dumazet @ 2014-10-17 16:56 UTC (permalink / raw)
  To: David Miller; +Cc: vvs, netdev, kuznet, jmorris, yoshfuji, kaber
In-Reply-To: <20141017.115932.740288542547693461.davem@davemloft.net>

On Fri, 2014-10-17 at 11:59 -0400, David Miller wrote:
> From: Vasily Averin <vvs@parallels.com>
> Date: Wed, 15 Oct 2014 16:24:02 +0400
> 
> > Fixes: 2e77d89b2fa8 ("net: avoid a pair of dst_hold()/dst_release() in ip_append_data()")
> > 
> > ip_setup_cork() called inside ip_append_data() steals dst entry from rt to cork
> > and in case errors in __ip_append_data() nobody frees stolen dst entry
> > 
> > Signed-off-by: Vasily Averin <vvs@parallels.com>
> 
> Please, in the future, put the Fixes: tag first in the list of signoffs.
> 
> Eric, please review this change, it looks good to me.

It is good indeed, thanks !

Acked-by: Eric Dumazet <edumazet@google.com>

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox