Netdev List

Netdev List
 help / color / mirror / Atom feed

* WARNING: drivers/net/ethernet/dec/tulip/tulip.o(.text+0x6f9b): Section mismatch in reference from the function tulip_init_one() to the variable .devinit.rodata:early_486_chipsets
From: kbuild test robot @ 2012-12-05  0:10 UTC (permalink / raw)
  To: Bill Pemberton; +Cc: netdev, Greg Kroah-Hartman

tree:   git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git master
head:   193c1e478cc496844fcbef402a10976c95a634ff
commit: 779c1a85813a4622cc3bb3d25ce10b523bd055ba tulip: remove __dev* attributes
date:   29 hours ago
config: make ARCH=x86_64 allmodconfig

All warnings:

WARNING: drivers/net/ethernet/dec/tulip/tulip.o(.text+0x6f9b): Section mismatch in reference from the function tulip_init_one() to the variable .devinit.rodata:early_486_chipsets
The function tulip_init_one() references
the variable __devinitconst early_486_chipsets.
This is often because tulip_init_one lacks a __devinitconst
annotation or the annotation of early_486_chipsets is wrong.

---
0-DAY kernel build testing backend         Open Source Technology Center
Fengguang Wu, Yuanhan Liu                              Intel Corporation

^ permalink raw reply

* NETDEV WATCHDOG: eth1 (r8169): transmit queue 0 timed out
From: Dave Jones @ 2012-12-05  1:19 UTC (permalink / raw)
  To: netdev

We continue to see warnings like this reported against the Fedora kernel
for a number of different NICs. I just hit this one myself for the first time
on that hardware iirc.

Anything else I can provide ?

	Dave

[22347.960097] ------------[ cut here ]------------
[22347.965388] WARNING: at net/sched/sch_generic.c:255 dev_watchdog+0x258/0x270()
[22347.973764] Hardware name:         
[22347.977748] NETDEV WATCHDOG: eth1 (r8169): transmit queue 0 timed out
[22347.985212] Modules linked in: ipt_MASQUERADE iptable_nat nf_nat_ipv4 nf_nat xt_LOG xt_limit ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 nf_conntrack_ipv4 nf_defrag_ipv4 ip6table_filter xt_conntrack ip6_tables nf_conntrack microcode pcspkr r8169 mii nfsd auth_rpcgss nfs_acl lockd sunrpc i915 video backlight i2c_algo_bit drm_kms_helper drm
[22348.020948] Pid: 0, comm: swapper/3 Not tainted 3.7.0-rc8+ #5
[22348.027483] Call Trace:
[22348.030420]  <IRQ>  [<ffffffff8104928f>] warn_slowpath_common+0x7f/0xc0
[22348.038029]  [<ffffffff81049386>] warn_slowpath_fmt+0x46/0x50
[22348.044710]  [<ffffffff814d35a8>] dev_watchdog+0x258/0x270
[22348.051096]  [<ffffffff814d3350>] ? dev_deactivate_queue.constprop.30+0x80/0x80
[22348.059408]  [<ffffffff8105afea>] call_timer_fn+0x8a/0x330
[22348.065791]  [<ffffffff8105af65>] ? call_timer_fn+0x5/0x330
[22348.072276]  [<ffffffff814d3350>] ? dev_deactivate_queue.constprop.30+0x80/0x80
[22348.080721]  [<ffffffff814d3350>] ? dev_deactivate_queue.constprop.30+0x80/0x80
[22348.089032]  [<ffffffff8105b4d4>] run_timer_softirq+0x244/0x380
[22348.095913]  [<ffffffff81052a00>] __do_softirq+0xe0/0x3c0
[22348.102204]  [<ffffffff810a71d4>] ? tick_program_event+0x24/0x30
[22348.109045]  [<ffffffff815e36cc>] call_softirq+0x1c/0x26
[22348.115236]  [<ffffffff810043cd>] do_softirq+0x8d/0xc0
[22348.121226]  [<ffffffff81052eb5>] irq_exit+0xd5/0xe0
[22348.126882]  [<ffffffff815e381b>] smp_apic_timer_interrupt+0x6b/0x98
[22348.134253]  [<ffffffff815e302f>] apic_timer_interrupt+0x6f/0x80
[22348.141293]  <EOI>  [<ffffffff8108bad5>] ? sched_clock_local+0x25/0xa0
[22348.148819]  [<ffffffff81343d3d>] ? intel_idle+0xfd/0x160
[22348.155110]  [<ffffffff81343d36>] ? intel_idle+0xf6/0x160
[22348.161399]  [<ffffffff81474dc9>] cpuidle_enter+0x19/0x20
[22348.167562]  [<ffffffff81475462>] cpuidle_idle_call+0xa2/0x5f0
[22348.174346]  [<ffffffff8100c59a>] cpu_idle+0xca/0x160
[22348.180235]  [<ffffffff81ceead4>] start_secondary+0x251/0x258
[22348.186776] ---[ end trace 62a349a839e1b608 ]---
[22348.261360] r8169 0000:05:00.0 eth1: link up

^ permalink raw reply

* WARNING: drivers/net/ethernet/broadcom/tg3.o(.text+0x1fe93): Section mismatch in reference from the function tg3_test_dma() to the variable .devinit.rodata:tg3_dma_wait_state_chipsets
From: kbuild test robot @ 2012-12-05  1:21 UTC (permalink / raw)
  To: Bill Pemberton; +Cc: netdev, Greg Kroah-Hartman

tree:   git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git master
head:   193c1e478cc496844fcbef402a10976c95a634ff
commit: 229b1ad1cb1a7159658f466c736bc3898dabb876 tg3: remove __dev* attributes
date:   30 hours ago
config: make ARCH=i386 allmodconfig

All warnings:

WARNING: drivers/net/ethernet/broadcom/tg3.o(.text+0x1fe93): Section mismatch in reference from the function tg3_test_dma() to the variable .devinit.rodata:tg3_dma_wait_state_chipsets
The function tg3_test_dma() references
the variable __devinitconst tg3_dma_wait_state_chipsets.
This is often because tg3_test_dma lacks a __devinitconst
annotation or the annotation of tg3_dma_wait_state_chipsets is wrong.

---
0-DAY kernel build testing backend         Open Source Technology Center
Fengguang Wu, Yuanhan Liu                              Intel Corporation

^ permalink raw reply

* Re: [Suggestion] net/atm : for sprintf, need check the total write length whether larger than a page.
From: Chen Gang @ 2012-12-05  1:26 UTC (permalink / raw)
  To: Chas Williams (CONTRACTOR); +Cc: David Miller, netdev
In-Reply-To: <201212040346.qB43kVSQ018028@thirdoffive.cmf.nrl.navy.mil>


  firstly, sorry for reply late (yesterday I have an annual leave for
some personal things).

  and thank you for your reply, too.


于 2012年12月04日 11:46, Chas Williams (CONTRACTOR) 写道:
>  static ssize_t show_address(struct device *cdev,
>  			    struct device_attribute *attr, char *buf)
>  {
> -	char *pos = buf;
>  	struct atm_dev *adev = to_atm_dev(cdev);
> -	int i;
> -
> -	for (i = 0; i < (ESI_LEN - 1); i++)
> -		pos += sprintf(pos, "%02x:", adev->esi[i]);
> -	pos += sprintf(pos, "%02x\n", adev->esi[i]);
>  
> -	return pos - buf;
> +	return scnprintf(buf, PAGE_SIZE, "%pM\n", adev->esi);
>  }
>  

  "%p" seems print a pointer, not contents of pointer (is it correct ?)

  will it change the original display format to outside ?


>  static ssize_t show_atmaddress(struct device *cdev,
>  			       struct device_attribute *attr, char *buf)
>  {
>  	unsigned long flags;
> -	char *pos = buf;
>  	struct atm_dev *adev = to_atm_dev(cdev);
>  	struct atm_dev_addr *aaddr;
>  	int bin[] = { 1, 2, 10, 6, 1 }, *fmt = bin;
> -	int i, j;
> +	int i, j, count = 0;
>  
>  	spin_lock_irqsave(&adev->lock, flags);
>  	list_for_each_entry(aaddr, &adev->local, entry) {
>  		for (i = 0, j = 0; i < ATM_ESA_LEN; ++i, ++j) {
>  			if (j == *fmt) {
> -				pos += sprintf(pos, ".");
> +				count += scnprintf(buf + count,
> +						   PAGE_SIZE - count, ".");
>  				++fmt;
>  				j = 0;
>  			}
> -			pos += sprintf(pos, "%02x",
> -				       aaddr->addr.sas_addr.prv[i]);
> +			count += scnprintf(buf + count,
> +					   PAGE_SIZE - count, "%02x",
> +					   aaddr->addr.sas_addr.prv[i]);
>  		}
> -		pos += sprintf(pos, "\n");
> +		count += scnprintf(buf + count, PAGE_SIZE - count, "\n");
>  	}
>  	spin_unlock_irqrestore(&adev->lock, flags);
>  
> -	return pos - buf;
> +	return count;
>  }
>  

  need we judge whether count >= PAGE_SIZE ?


  these are my suggestions, maybe not correct.

  :-)

  thanks.

-- 
Chen Gang

Asianux Corporation

^ permalink raw reply

* Re: [Suggestion] net/atm :  for sprintf, need check the total write length whether larger than a page.
From: Chen Gang @ 2012-12-05  1:28 UTC (permalink / raw)
  To: chas williams - CONTRACTOR
  Cc: David.Woodhouse, David Miller, krzysiek, Joe Perches, edumazet,
	netdev
In-Reply-To: <20121203104825.78123ecc@thirdoffive.cmf.nrl.navy.mil>

于 2012年12月03日 23:48, chas williams - CONTRACTOR 写道:
> yes this seems like it should be done.  maybe this week i will try to
> put something together unless you already have a patch somewhere.
> 

  thank you very much.


  and also sorry for replying late (yesterday, I have an annual leave
for some personal things)

^ permalink raw reply

* Re: [PATCH net-next v2] bridge: export multicast database via netlink
From: Cong Wang @ 2012-12-05  2:07 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, bridge, herbert, shemminger, tgraf, brouer
In-Reply-To: <20121204.132505.1229048595483168009.davem@davemloft.net>

On Tue, 2012-12-04 at 13:25 -0500, David Miller wrote:
> From: Cong Wang <amwang@redhat.com>
> Date: Mon,  3 Dec 2012 21:03:43 +0800
> 
> > TODO: remove debugging printk's
> 
> Can you please take care of this so we can consider this patch
> seriously for inclusion?
> 

Sure. I planned to send V3 when I wrote this TODO. ;) And it seems
Thomas and Stephen are ok with it now, I will remove these printk's.

Thanks.

^ permalink raw reply

* Re: [PATCH net-next v2] bridge: export multicast database via netlink
From: Cong Wang @ 2012-12-05  2:08 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: netdev, bridge, Herbert Xu, Jesper Dangaard Brouer, Thomas Graf,
	David S. Miller
In-Reply-To: <20121204085320.3e08986a@nehalam.linuxnetplumber.net>

On Tue, 2012-12-04 at 08:53 -0800, Stephen Hemminger wrote:
> On Mon,  3 Dec 2012 21:03:43 +0800
> Cong Wang <amwang@redhat.com> wrote:
> 
> > V2: drop patch 1/2, export ifindex directly
> >     Redesign netlink attributes
> >     Improve netlink seq check
> >     Handle IPv6 addr as well
> > 
> > TODO: remove debugging printk's
> > 
> > This patch exports bridge multicast database via netlink
> > message type RTM_GETMDB. Similar to fdb, but currently bridge-specific.
> > We may need to support modify multicast database too (RTM_{ADD,DEL}MDB).
> > 
> > Cc: Herbert Xu <herbert@gondor.apana.org.au>
> > Cc: Stephen Hemminger <shemminger@vyatta.com>
> > Cc: "David S. Miller" <davem@davemloft.net>
> > Cc: Thomas Graf <tgraf@suug.ch>
> > Cc: Jesper Dangaard Brouer <brouer@redhat.com>
> > Signed-off-by: Cong Wang <amwang@redhat.com>
> >     
> 
> Minor nit reported by checkpatch was the messages should be using the api
> which provides the most info in the log to identify.
> 
> WARNING: Prefer netdev_info(netdev, ... then dev_info(dev, ... then pr_info(...  to printk(KERN_INFO ...
> #190: FILE: net/bridge/br_mdb.c:28:
> +		printk(KERN_INFO "no router on bridge\n")
> 
> There is a set of macro's already for use in bridging code:
>                 br_info(br, "no router on bridge\n");
> 

Never mind, removing these printk's is in my TODO list. :)

^ permalink raw reply

* [PATCH net-next v3] bridge: export multicast database via netlink
From: Cong Wang @ 2012-12-05  2:50 UTC (permalink / raw)
  To: netdev
  Cc: Cong Wang, bridge, Herbert Xu, Jesper Dangaard Brouer,
	Thomas Graf, Stephen Hemminger, David S. Miller

V3: drop debugging printk's
    update selinux perm table as well

V2: drop patch 1/2, export ifindex directly
    Redesign netlink attributes
    Improve netlink seq check
    Handle IPv6 addr as well

This patch exports bridge multicast database via netlink
message type RTM_GETMDB. Similar to fdb, but currently bridge-specific.
We may need to support modify multicast database too (RTM_{ADD,DEL}MDB).

Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Thomas Graf <tgraf@suug.ch>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Cong Wang <amwang@redhat.com>
    
---
 include/uapi/linux/if_bridge.h |   55 ++++++++++++++
 include/uapi/linux/rtnetlink.h |    3 +
 net/bridge/Makefile            |    2 +-
 net/bridge/br_mdb.c            |  161 ++++++++++++++++++++++++++++++++++++++++
 net/bridge/br_multicast.c      |    2 +
 net/bridge/br_private.h        |    2 +
 security/selinux/nlmsgtab.c    |    1 +
 7 files changed, 225 insertions(+), 1 deletions(-)

diff --git a/include/uapi/linux/if_bridge.h b/include/uapi/linux/if_bridge.h
index b388579..9a0f6ff 100644
--- a/include/uapi/linux/if_bridge.h
+++ b/include/uapi/linux/if_bridge.h
@@ -116,4 +116,59 @@ enum {
 	__IFLA_BRIDGE_MAX,
 };
 #define IFLA_BRIDGE_MAX (__IFLA_BRIDGE_MAX - 1)
+
+/* Bridge multicast database attributes
+ * [MDBA_MDB] = {
+ *     [MDBA_MDB_ENTRY] = {
+ *         [MDBA_MDB_ENTRY_INFO]
+ *     }
+ * }
+ * [MDBA_ROUTER] = {
+ *    [MDBA_ROUTER_PORT]
+ * }
+ */
+enum {
+	MDBA_UNSPEC,
+	MDBA_MDB,
+	MDBA_ROUTER,
+	__MDBA_MAX,
+};
+#define MDBA_MAX (__MDBA_MAX - 1)
+
+enum {
+	MDBA_MDB_UNSPEC,
+	MDBA_MDB_ENTRY,
+	__MDBA_MDB_MAX,
+};
+#define MDBA_MDB_MAX (__MDBA_MDB_MAX - 1)
+
+enum {
+	MDBA_MDB_ENTRY_UNSPEC,
+	MDBA_MDB_ENTRY_INFO,
+	__MDBA_MDB_ENTRY_MAX,
+};
+#define MDBA_MDB_ENTRY_MAX (__MDBA_MDB_ENTRY_MAX - 1)
+
+enum {
+	MDBA_ROUTER_UNSPEC,
+	MDBA_ROUTER_PORT,
+	__MDBA_ROUTER_MAX,
+};
+#define MDBA_ROUTER_MAX (__MDBA_ROUTER_MAX - 1)
+
+struct br_port_msg {
+	__u32 ifindex;
+};
+
+struct br_mdb_entry {
+	__u32 ifindex;
+	struct {
+		union {
+			__be32	ip4;
+			struct in6_addr ip6;
+		} u;
+		__be16		proto;
+	} addr;
+};
+
 #endif /* _UAPI_LINUX_IF_BRIDGE_H */
diff --git a/include/uapi/linux/rtnetlink.h b/include/uapi/linux/rtnetlink.h
index 33d29ce..354a1e7 100644
--- a/include/uapi/linux/rtnetlink.h
+++ b/include/uapi/linux/rtnetlink.h
@@ -125,6 +125,9 @@ enum {
 	RTM_GETNETCONF = 82,
 #define RTM_GETNETCONF RTM_GETNETCONF
 
+	RTM_GETMDB = 86,
+#define RTM_GETMDB RTM_GETMDB
+
 	__RTM_MAX,
 #define RTM_MAX		(((__RTM_MAX + 3) & ~3) - 1)
 };
diff --git a/net/bridge/Makefile b/net/bridge/Makefile
index d0359ea..e859098 100644
--- a/net/bridge/Makefile
+++ b/net/bridge/Makefile
@@ -12,6 +12,6 @@ bridge-$(CONFIG_SYSFS) += br_sysfs_if.o br_sysfs_br.o
 
 bridge-$(CONFIG_BRIDGE_NETFILTER) += br_netfilter.o
 
-bridge-$(CONFIG_BRIDGE_IGMP_SNOOPING) += br_multicast.o
+bridge-$(CONFIG_BRIDGE_IGMP_SNOOPING) += br_multicast.o br_mdb.o
 
 obj-$(CONFIG_BRIDGE_NF_EBTABLES) += netfilter/
diff --git a/net/bridge/br_mdb.c b/net/bridge/br_mdb.c
new file mode 100644
index 0000000..f0f4dcf
--- /dev/null
+++ b/net/bridge/br_mdb.c
@@ -0,0 +1,161 @@
+#include <linux/err.h>
+#include <linux/if_ether.h>
+#include <linux/igmp.h>
+#include <linux/kernel.h>
+#include <linux/netdevice.h>
+#include <linux/rculist.h>
+#include <linux/skbuff.h>
+#include <linux/slab.h>
+#include <net/ip.h>
+#if IS_ENABLED(CONFIG_IPV6)
+#include <net/ipv6.h>
+#include <net/mld.h>
+#include <net/addrconf.h>
+#include <net/ip6_checksum.h>
+#endif
+
+#include "br_private.h"
+
+static int br_rports_fill_info(struct sk_buff *skb, struct netlink_callback *cb,
+			       struct net_device *dev)
+{
+	struct net_bridge *br = netdev_priv(dev);
+	struct net_bridge_port *p;
+	struct hlist_node *n;
+	struct nlattr *nest;
+
+	if (!br->multicast_router || hlist_empty(&br->router_list))
+		return 0;
+
+	nest = nla_nest_start(skb, MDBA_ROUTER);
+	if (nest == NULL)
+		return -EMSGSIZE;
+
+	hlist_for_each_entry_rcu(p, n, &br->router_list, rlist) {
+		if (p && nla_put_u32(skb, MDBA_ROUTER_PORT, p->dev->ifindex))
+			goto fail;
+	}
+
+	nla_nest_end(skb, nest);
+	return 0;
+fail:
+	nla_nest_cancel(skb, nest);
+	return -EMSGSIZE;
+}
+
+static int br_mdb_fill_info(struct sk_buff *skb, struct netlink_callback *cb,
+			    struct net_device *dev)
+{
+	struct net_bridge *br = netdev_priv(dev);
+	struct net_bridge_mdb_htable *mdb;
+	struct nlattr *nest, *nest2;
+	int i, err = 0;
+	int idx = 0, s_idx = cb->args[1];
+
+	if (br->multicast_disabled)
+		return 0;
+
+	mdb = rcu_dereference(br->mdb);
+	if (!mdb)
+		return 0;
+
+	nest = nla_nest_start(skb, MDBA_MDB);
+	if (nest == NULL)
+		return -EMSGSIZE;
+
+	for (i = 0; i < mdb->max; i++) {
+		struct hlist_node *h;
+		struct net_bridge_mdb_entry *mp;
+		struct net_bridge_port_group *p, **pp;
+		struct net_bridge_port *port;
+
+		hlist_for_each_entry_rcu(mp, h, &mdb->mhash[i], hlist[mdb->ver]) {
+			if (idx < s_idx)
+				goto skip;
+
+			nest2 = nla_nest_start(skb, MDBA_MDB_ENTRY);
+			if (nest2 == NULL) {
+				err = -EMSGSIZE;
+				goto out;
+			}
+
+			for (pp = &mp->ports;
+			     (p = rcu_dereference(*pp)) != NULL;
+			      pp = &p->next) {
+				port = p->port;
+				if (port) {
+					struct br_mdb_entry e;
+					e.ifindex = port->dev->ifindex;
+					e.addr.u.ip4 = p->addr.u.ip4;
+#if IS_ENABLED(CONFIG_IPV6)
+					e.addr.u.ip6 = p->addr.u.ip6;
+#endif
+					e.addr.proto = p->addr.proto;
+					if (nla_put(skb, MDBA_MDB_ENTRY_INFO, sizeof(e), &e)) {
+						nla_nest_cancel(skb, nest2);
+						err = -EMSGSIZE;
+						goto out;
+					}
+				}
+			}
+			nla_nest_end(skb, nest2);
+		skip:
+			idx++;
+		}
+	}
+
+out:
+	cb->args[1] = idx;
+	cb->args[2] = mdb->seq;
+	nla_nest_end(skb, nest);
+	return err;
+}
+
+static int br_mdb_dump(struct sk_buff *skb, struct netlink_callback *cb)
+{
+	struct net_device *dev;
+	struct net *net = sock_net(skb->sk);
+	struct nlmsghdr *nlh;
+	int idx = 0, s_idx;
+
+	s_idx = cb->args[0];
+
+	rcu_read_lock();
+
+	for_each_netdev_rcu(net, dev) {
+		if (dev->priv_flags & IFF_EBRIDGE) {
+			struct br_port_msg *bpm;
+
+			if (idx < s_idx)
+				goto cont;
+
+			nlh = nlmsg_put(skb, NETLINK_CB(cb->skb).portid,
+					cb->nlh->nlmsg_seq, RTM_GETMDB,
+					sizeof(*bpm), NLM_F_MULTI);
+			if (nlh == NULL)
+				break;
+
+			bpm = nlmsg_data(nlh);
+			bpm->ifindex = dev->ifindex;
+			if (br_mdb_fill_info(skb, cb, dev) < 0)
+				goto out;
+			if (br_rports_fill_info(skb, cb, dev) < 0)
+				goto out;
+
+			nlmsg_end(skb, nlh);
+		cont:
+			idx++;
+		}
+	}
+
+out:
+	cb->seq = cb->args[2];
+	rcu_read_unlock();
+	cb->args[0] = idx;
+	return skb->len;
+}
+
+void br_mdb_init(void)
+{
+	rtnl_register(PF_BRIDGE, RTM_GETMDB, NULL, br_mdb_dump, NULL);
+}
diff --git a/net/bridge/br_multicast.c b/net/bridge/br_multicast.c
index 2417434..d53e4f4 100644
--- a/net/bridge/br_multicast.c
+++ b/net/bridge/br_multicast.c
@@ -322,6 +322,7 @@ static int br_mdb_rehash(struct net_bridge_mdb_htable __rcu **mdbp, int max,
 
 	mdb->size = old ? old->size : 0;
 	mdb->ver = old ? old->ver ^ 1 : 0;
+	mdb->seq = old ? (old->seq + 1): 0;
 
 	if (!old || elasticity)
 		get_random_bytes(&mdb->secret, sizeof(mdb->secret));
@@ -1584,6 +1585,7 @@ void br_multicast_init(struct net_bridge *br)
 		    br_multicast_querier_expired, (unsigned long)br);
 	setup_timer(&br->multicast_query_timer, br_multicast_query_expired,
 		    (unsigned long)br);
+	br_mdb_init();
 }
 
 void br_multicast_open(struct net_bridge *br)
diff --git a/net/bridge/br_private.h b/net/bridge/br_private.h
index eb9cd42..6484069 100644
--- a/net/bridge/br_private.h
+++ b/net/bridge/br_private.h
@@ -105,6 +105,7 @@ struct net_bridge_mdb_htable
 	u32				max;
 	u32				secret;
 	u32				ver;
+	u32				seq;
 };
 
 struct net_bridge_port
@@ -432,6 +433,7 @@ extern int br_multicast_set_port_router(struct net_bridge_port *p,
 extern int br_multicast_toggle(struct net_bridge *br, unsigned long val);
 extern int br_multicast_set_querier(struct net_bridge *br, unsigned long val);
 extern int br_multicast_set_hash_max(struct net_bridge *br, unsigned long val);
+extern void br_mdb_init(void);
 
 static inline bool br_multicast_is_router(struct net_bridge *br)
 {
diff --git a/security/selinux/nlmsgtab.c b/security/selinux/nlmsgtab.c
index d309e7f..163aaa7 100644
--- a/security/selinux/nlmsgtab.c
+++ b/security/selinux/nlmsgtab.c
@@ -67,6 +67,7 @@ static struct nlmsg_perm nlmsg_route_perms[] =
 	{ RTM_GETADDRLABEL,	NETLINK_ROUTE_SOCKET__NLMSG_READ  },
 	{ RTM_GETDCB,		NETLINK_ROUTE_SOCKET__NLMSG_READ  },
 	{ RTM_SETDCB,		NETLINK_ROUTE_SOCKET__NLMSG_WRITE },
+	{ RTM_GETMDB,		NETLINK_ROUTE_SOCKET__NLMSG_READ  },
 };
 
 static struct nlmsg_perm nlmsg_tcpdiag_perms[] =

^ permalink raw reply related

* [RFC PATCH net-next 0/3 V4] net-tcp: TCP/IP stack bypass for loopback connections
From: Weiping Pan @ 2012-12-05  2:54 UTC (permalink / raw)
  To: netdev; +Cc: brutus, Weiping Pan
In-Reply-To: <CAEkNxbGTwGEBMCpSdib_paaxs0ekc52HWNo2Vai0nNSrZ1Zkng@mail.gmail.com>

1 patch overview
[PATCH 1/3] is the original V3 patch from  Bruce(brutus@google.com),
I just rebase it on top of net-next
commit 03f52a0a5542(ip6mr: Add sizeof verification to MRT6_ASSERT and
MT6_PIM).
http://patchwork.ozlabs.org/patch/184523/

[PATCH 2/3] is to fix the bug in tcp_close() that triggered by [PATCH 1/3],
since for tcp friends data skb, it has no tcp header, and its transport_header
is NULL,
so it will panic if we deference tcp_hdr(skb) in tcp_close().

[PATCH 3/3] is to fix the problem raised by Eric(eric.dumazet@gmail.com)
http://www.spinics.net/lists/netdev/msg210750.html

The sock pointed by request_sock->friend may be freed since it does not have a
lock to protect it.
I just delete request_sock->friend since I think it is useless.

For sk_buff->friend, it has the same problem, and I use
"atomic_add(skb->truesize, &sk->sk_wmem_alloc)" to guarantee that the sock can
not be freed before the skb is freed.

Then for 3-way handshake with tcp friends enabled,
SYN->friend is NULL, SYN/ACK->friend is set in tcp_make_synack(),
and ACK->friend is set in tcp_send_ack().

For normal data and FIN skbs, their friend pointer is NULL.

2 performance analysis
In short, TCP_RR increases by 5 or 6 times, TCP_CRR keeps the same,
TCP_SENDFILE and TCP_MAERTS are not stable, sometimes they increase while
sometimes decrease, so we can regard them as no increase.
For TCP_STREAM, it depends on the message size, if it is bigger than 8192, it
increases else decreases.

Intel(R) Xeon(R) E5506, 2 sockets, 8 cores, 2.13GHz
Memory 4GB
--------------------------------------------------------------------------
TCP friends performance results start

BASE means normal tcp with friends DISABLED.
AF_UNIX means sockets for local interprocess communication, for reference.
FRIENDS means tcp with friends ENABLED.
I set -s 51882 -m 16384 -M 87380 for all the three kinds of sockets by default.
The first percentage number is FRIENDS/BASE.
The second percentage number is FRIENDS/AF_UNIX.
We set -i 10,2 -I 95,20 to stabilize the statistics.

      BASE    AF_UNIX    FRIENDS               TCP_STREAM
  21741.94   30653.90   17115.66   78%   55%

      BASE    AF_UNIX    FRIENDS               TCP_MAERTS
  17464.98          -   17134.63   98%    -%

      BASE    AF_UNIX    FRIENDS             TCP_SENDFILE
     25707          -      30828  119%    -%

TCP_SENDFILE can not work with -i 10,2 -I 95,20 (strange), so I use average.

        MS       BASE    AF_UNIX    FRIENDS            TCP_STREAM_MS
         1      15.64       5.90       5.12   32%   86%
         2      30.93       9.81      10.48   33%  106%
         4      58.22      19.70      21.29   36%  108%
         8     117.00      39.00      42.74   36%  109%
        16     231.08      84.59      83.90   36%   99%
        32     439.39     159.93     163.03   37%  101%
        64     879.13     323.31     322.78   36%   99%
       128    1617.55     632.50     646.34   39%  102%
       256    3091.72    1316.36    1206.93   39%   91%
       512    5077.18    2359.51    2342.00   46%   99%
      1024    7403.20    6302.20    3335.23   45%   52%
      2048   10194.40   13922.19    5751.23   56%   41%
      4096   13338.08   22566.45    9447.29   70%   41%
      8192   14467.93   28122.20   13758.43   95%   48%
     16384   22463.15   37522.42   26804.36  119%   71%
     32768   14743.58   30591.61   17040.15  115%   55%
     65536   24743.77   33855.93   40418.15  163%  119%
    131072   13925.14   31762.52   48292.60  346%  152%
    262144   16126.15   32912.89   25610.47  158%   77%
    524288   12080.51   35059.27   30608.31  253%   87%
   1048576   10539.06   28200.14   16953.69  160%   60%
MS means Message Size in bytes, that is -m -M for netperf

        RR       BASE    AF_UNIX    FRIENDS                TCP_RR_RR
         1   13064.17   95593.46   72982.11  558%   76%
         2   12000.95   95477.38   65203.37  543%   68%
         4   12560.45   90758.17   69983.71  557%   77%
         8   17991.62   96794.53   77293.14  429%   79%
        16   13015.98   89384.69   83125.91  638%   92%
        32   13863.00   89870.17   88986.21  641%   99%
        64   10632.42   88906.59   83055.69  781%   93%
       128   13673.29   85629.27   92984.32  680%  108%
       256   12965.59   88117.74   86155.43  664%   97%
       512   17158.55   90866.08   85498.26  498%   94%
      1024   16951.15   82982.26   82286.84  485%   99%
      2048   11814.75   76684.40   83154.99  703%  108%
      4096   10393.91   63204.65   68558.71  659%  108%
      8192    7757.81   50318.63   50270.39  647%   99%
     16384    8147.26   37392.42   38619.89  474%  103%
     32768    8846.85   24847.64   28412.23  321%  114%
     65536    4974.59   16717.47   17327.65  348%  103%
    131072    4148.19    9053.56    9402.89  226%  103%
    262144    3029.66    5575.51    6119.65  201%  109%
    524288     923.40    3271.52    3649.37  395%  111%
   1048576     385.47    1173.18    1017.43  263%   86%
RR means Request Response Message Size in bytes, that is -r req,resp for netperf

        RR       BASE    AF_UNIX    FRIENDS               TCP_CRR_RR
         1    3424.40          -    3608.92  105%    -%
         2    3355.94          -    3523.77  105%    -%
         4    3437.05          -    3538.48  102%    -%
         8    3465.41          -    3630.49  104%    -%
        16    3495.40          -    3516.93  100%    -%
        32    3425.78          -    3524.90  102%    -%
        64    3432.01          -    3628.25  105%    -%
       128    3434.69          -    3573.88  104%    -%
       256    3413.94          -    3616.94  105%    -%
       512    3457.32          -    3675.38  106%    -%
      1024    3476.01          -    3634.25  104%    -%
      2048    3484.38          -    3539.96  101%    -%
      4096    3304.86          -    3564.57  107%    -%
      8192    3420.40          -    3599.02  105%    -%
     16384    3358.47          -    3571.60  106%    -%
     32768    3299.75          -    3469.19  105%    -%
     65536    2635.22          -    3292.74  124%    -%
    131072     119.97          -    3008.15 2507%    -%
    262144     933.66          -    2189.83  234%    -%
    524288     175.82          -     607.32  345%    -%
   1048576      41.70          -     296.22  710%    -%
RR means Request Response Message Size in bytes, that is -r req,resp for netperf -H 127.0.0.1

TCP friends performance results end
--------------------------------------------------------------------------

In short, I think the performance of tcp friends is not overwhelming than
loopback.

Friends VS AF__UNIX
Their call path are almost the same, but AF_UNIX uses its own send/recv codes
with proper locks,
so AF_UNIX's performance is much better than Friends.

Friends VS normal tcp
Friends directly adds skb into peer's sk_receive_queue if it gets the lock.
So the sender and receiver have serious lock contention.

Normal tcp sends skb into sk_write_queue, then sends it in net_tx_action() and
receives it in net_rx_action(), then adds it into peer's sk_receive_queue.
So the sender just needs to lock the write queue while the receiver just needs
to lock the receive queue, so they have little lock contention.

3 TODO
1 try to confirm that the root cause of regression in some cases is the lock
contention.

2 find a better way to fix the regression.

Any hints ?

thanks

Weiping Pan (3):
  Bruce's orignal tcp friend V3
  fix panic in tcp_close()
  delete request_sock->friend

 Documentation/networking/ip-sysctl.txt |    8 +
 include/linux/skbuff.h                 |    2 +
 include/net/inet_connection_sock.h     |    4 +
 include/net/sock.h                     |   32 ++-
 include/net/tcp.h                      |   13 +-
 net/core/skbuff.c                      |    1 +
 net/core/sock.c                        |    1 +
 net/core/stream.c                      |   36 ++
 net/ipv4/inet_connection_sock.c        |   38 ++
 net/ipv4/sysctl_net_ipv4.c             |    7 +
 net/ipv4/tcp.c                         |  610 +++++++++++++++++++++++++++-----
 net/ipv4/tcp_input.c                   |   12 +-
 net/ipv4/tcp_ipv4.c                    |    5 +
 net/ipv4/tcp_minisocks.c               |   11 +-
 net/ipv4/tcp_output.c                  |   19 +-
 15 files changed, 707 insertions(+), 92 deletions(-)

-- 
1.7.4.4

^ permalink raw reply

* [PATCH 1/3] Bruce's orignal tcp friend V3
From: Weiping Pan @ 2012-12-05  2:54 UTC (permalink / raw)
  To: netdev; +Cc: brutus, Weiping Pan
In-Reply-To: <cover.1354674151.git.wpan@redhat.com>

http://patchwork.ozlabs.org/patch/184523/

Rebase on top of commit 03f52a0a5542(ip6mr: Add sizeof verification to
MRT6_ASSERT and MT6_PIM).

Signed-off-by: Weiping Pan <wpan@redhat.com>
---
 Documentation/networking/ip-sysctl.txt |    8 +
 include/linux/skbuff.h                 |    2 +
 include/net/request_sock.h             |    1 +
 include/net/sock.h                     |   32 ++-
 include/net/tcp.h                      |   13 +-
 net/core/skbuff.c                      |    1 +
 net/core/sock.c                        |    1 +
 net/core/stream.c                      |   36 ++
 net/ipv4/inet_connection_sock.c        |   20 +
 net/ipv4/sysctl_net_ipv4.c             |    7 +
 net/ipv4/tcp.c                         |  604 +++++++++++++++++++++++++++-----
 net/ipv4/tcp_input.c                   |   22 +-
 net/ipv4/tcp_ipv4.c                    |    2 +
 net/ipv4/tcp_minisocks.c               |    4 +
 net/ipv4/tcp_output.c                  |   16 +-
 net/ipv6/tcp_ipv6.c                    |    1 +
 16 files changed, 679 insertions(+), 91 deletions(-)

diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
index 98ac0d7..152f488 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -214,6 +214,14 @@ tcp_fack - BOOLEAN
 	Enable FACK congestion avoidance and fast retransmission.
 	The value is not used, if tcp_sack is not enabled.
 
+tcp_friends - BOOLEAN
+	If set, TCP loopback socket pair stack bypass is enabled such
+	that all data sent will be directly queued to the receiver's
+	socket for receive. Note, normal connection establishment and
+	finish is used to make friends so any loopback interpose, e.g.
+	tcpdump, will see these TCP segements but no data segments.
+	Default: 1
+
 tcp_fin_timeout - INTEGER
 	Time to hold socket in state FIN-WAIT-2, if it was closed
 	by our side. Peer can be broken and never close its side,
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index f2af494..c890f65 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -334,6 +334,7 @@ typedef unsigned char *sk_buff_data_t;
  *	@cb: Control buffer. Free for use by every layer. Put private vars here
  *	@_skb_refdst: destination entry (with norefcount bit)
  *	@sp: the security path, used for xfrm
+ *	@friend: loopback friend socket
  *	@len: Length of actual data
  *	@data_len: Data length
  *	@mac_len: Length of link layer header
@@ -409,6 +410,7 @@ struct sk_buff {
 #ifdef CONFIG_XFRM
 	struct	sec_path	*sp;
 #endif
+	struct sock		*friend;
 	unsigned int		len,
 				data_len;
 	__u16			mac_len,
diff --git a/include/net/request_sock.h b/include/net/request_sock.h
index a51dbd1..c6dfa26 100644
--- a/include/net/request_sock.h
+++ b/include/net/request_sock.h
@@ -66,6 +66,7 @@ struct request_sock {
 	unsigned long			expires;
 	const struct request_sock_ops	*rsk_ops;
 	struct sock			*sk;
+	struct sock			*friend;
 	u32				secid;
 	u32				peer_secid;
 };
diff --git a/include/net/sock.h b/include/net/sock.h
index c945fba..778d8dd 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -197,6 +197,7 @@ struct cg_proto;
   *	@sk_userlocks: %SO_SNDBUF and %SO_RCVBUF settings
   *	@sk_lock:	synchronizer
   *	@sk_rcvbuf: size of receive buffer in bytes
+  *	@sk_friend: loopback friend socket
   *	@sk_wq: sock wait queue and async head
   *	@sk_rx_dst: receive input route used by early tcp demux
   *	@sk_dst_cache: destination cache
@@ -286,6 +287,14 @@ struct sock {
 	socket_lock_t		sk_lock;
 	struct sk_buff_head	sk_receive_queue;
 	/*
+	 * If socket has a friend (sk_friend != NULL) then a send skb is
+	 * enqueued directly to the friend's sk_receive_queue such that:
+	 *
+	 *        sk_sndbuf -> sk_sndbuf + sk_friend->sk_rcvbuf
+	 *   sk_wmem_queued -> sk_friend->sk_rmem_alloc
+	 */
+	struct sock		*sk_friend;
+	/*
 	 * The backlog queue is special, it is always used with
 	 * the per-socket spinlock held and requires low latency
 	 * access. Therefore we special case it's implementation.
@@ -703,24 +712,40 @@ static inline bool sk_acceptq_is_full(const struct sock *sk)
 	return sk->sk_ack_backlog > sk->sk_max_ack_backlog;
 }
 
+static inline int sk_wmem_queued_get(const struct sock *sk)
+{
+	if (sk->sk_friend)
+		return atomic_read(&sk->sk_friend->sk_rmem_alloc);
+	else
+		return sk->sk_wmem_queued;
+}
+
+static inline int sk_sndbuf_get(const struct sock *sk)
+{
+	if (sk->sk_friend)
+		return sk->sk_sndbuf + sk->sk_friend->sk_rcvbuf;
+	else
+		return sk->sk_sndbuf;
+}
+
 /*
  * Compute minimal free write space needed to queue new packets.
  */
 static inline int sk_stream_min_wspace(const struct sock *sk)
 {
-	return sk->sk_wmem_queued >> 1;
+	return sk_wmem_queued_get(sk) >> 1;
 }
 
 static inline int sk_stream_wspace(const struct sock *sk)
 {
-	return sk->sk_sndbuf - sk->sk_wmem_queued;
+	return sk_sndbuf_get(sk) - sk_wmem_queued_get(sk);
 }
 
 extern void sk_stream_write_space(struct sock *sk);
 
 static inline bool sk_stream_memory_free(const struct sock *sk)
 {
-	return sk->sk_wmem_queued < sk->sk_sndbuf;
+	return sk_wmem_queued_get(sk) < sk_sndbuf_get(sk);
 }
 
 /* OOB backlog add */
@@ -829,6 +854,7 @@ static inline void sock_rps_reset_rxhash(struct sock *sk)
 	})
 
 extern int sk_stream_wait_connect(struct sock *sk, long *timeo_p);
+extern int sk_stream_wait_friend(struct sock *sk, long *timeo_p);
 extern int sk_stream_wait_memory(struct sock *sk, long *timeo_p);
 extern void sk_stream_wait_close(struct sock *sk, long timeo_p);
 extern int sk_stream_error(struct sock *sk, int flags, int err);
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 3202bde..5f82770 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -292,6 +292,7 @@ extern int sysctl_tcp_thin_dupack;
 extern int sysctl_tcp_early_retrans;
 extern int sysctl_tcp_limit_output_bytes;
 extern int sysctl_tcp_challenge_ack_limit;
+extern int sysctl_tcp_friends;
 
 extern atomic_long_t tcp_memory_allocated;
 extern struct percpu_counter tcp_sockets_allocated;
@@ -687,6 +688,15 @@ void tcp_send_window_probe(struct sock *sk);
 #define TCPHDR_ECE 0x40
 #define TCPHDR_CWR 0x80
 
+/* If skb_get_friend() != NULL, TCP friends per packet state.
+ */
+struct friend_skb_parm {
+	bool	tail_inuse;		/* In use by skb_get_friend() send while */
+					/* on sk_receive_queue for tail put */
+};
+
+#define TCP_FRIEND_CB(tcb) (&(tcb)->header.hf)
+
 /* This is what the send packet queuing engine uses to pass
  * TCP per-packet control information to the transmission code.
  * We also store the host-order sequence numbers in here too.
@@ -699,6 +709,7 @@ struct tcp_skb_cb {
 #if IS_ENABLED(CONFIG_IPV6)
 		struct inet6_skb_parm	h6;
 #endif
+		struct friend_skb_parm	hf;
 	} header;	/* For incoming frames		*/
 	__u32		seq;		/* Starting sequence number	*/
 	__u32		end_seq;	/* SEQ + FIN + SYN + datalen	*/
@@ -1041,7 +1052,7 @@ static inline bool tcp_prequeue(struct sock *sk, struct sk_buff *skb)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
 
-	if (sysctl_tcp_low_latency || !tp->ucopy.task)
+	if (sysctl_tcp_low_latency || !tp->ucopy.task || sk->sk_friend)
 		return false;
 
 	__skb_queue_tail(&tp->ucopy.prequeue, skb);
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 880722e2..665826a 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -690,6 +690,7 @@ static void __copy_skb_header(struct sk_buff *new, const struct sk_buff *old)
 #ifdef CONFIG_XFRM
 	new->sp			= secpath_get(old->sp);
 #endif
+	new->friend		= old->friend;
 	memcpy(new->cb, old->cb, sizeof(old->cb));
 	new->csum		= old->csum;
 	new->local_df		= old->local_df;
diff --git a/net/core/sock.c b/net/core/sock.c
index a692ef4..a8f59a9 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -2225,6 +2225,7 @@ void sock_init_data(struct socket *sock, struct sock *sk)
 #ifdef CONFIG_NET_DMA
 	skb_queue_head_init(&sk->sk_async_wait_queue);
 #endif
+	sk->sk_friend		=	NULL;
 
 	sk->sk_send_head	=	NULL;
 
diff --git a/net/core/stream.c b/net/core/stream.c
index f5df85d..85e5b03 100644
--- a/net/core/stream.c
+++ b/net/core/stream.c
@@ -83,6 +83,42 @@ int sk_stream_wait_connect(struct sock *sk, long *timeo_p)
 EXPORT_SYMBOL(sk_stream_wait_connect);
 
 /**
+ * sk_stream_wait_friend - Wait for a socket to make friends
+ * @sk: sock to wait on
+ * @timeo_p: for how long to wait
+ *
+ * Must be called with the socket locked.
+ */
+int sk_stream_wait_friend(struct sock *sk, long *timeo_p)
+{
+	struct task_struct *tsk = current;
+	DEFINE_WAIT(wait);
+	int done;
+
+	do {
+		int err = sock_error(sk);
+		if (err)
+			return err;
+		if (!sk->sk_friend)
+			return -EBADFD;
+		if (!*timeo_p)
+			return -EAGAIN;
+		if (signal_pending(tsk))
+			return sock_intr_errno(*timeo_p);
+
+		prepare_to_wait(sk_sleep(sk), &wait, TASK_INTERRUPTIBLE);
+		sk->sk_write_pending++;
+		done = sk_wait_event(sk, timeo_p,
+				     !sk->sk_err &&
+				     sk->sk_friend->sk_friend);
+		finish_wait(sk_sleep(sk), &wait);
+		sk->sk_write_pending--;
+	} while (!done);
+	return 0;
+}
+EXPORT_SYMBOL(sk_stream_wait_friend);
+
+/**
  * sk_stream_closing - Return 1 if we still have things to send in our buffers.
  * @sk: socket to verify
  */
diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
index 2026542..ce4b79b 100644
--- a/net/ipv4/inet_connection_sock.c
+++ b/net/ipv4/inet_connection_sock.c
@@ -659,6 +659,26 @@ struct sock *inet_csk_clone_lock(const struct sock *sk,
 	if (newsk != NULL) {
 		struct inet_connection_sock *newicsk = inet_csk(newsk);
 
+		if (req->friend) {
+			/*
+			 * Make friends with the requestor but the ACK of
+			 * the request is already in-flight so the race is
+			 * on to make friends before the ACK is processed.
+			 * If the requestor's sk_friend value is != NULL
+			 * then the requestor has already processed the
+			 * ACK so indicate state change to wake'm up.
+			 */
+			struct sock *was;
+
+			sock_hold(req->friend);
+			newsk->sk_friend = req->friend;
+			sock_hold(newsk);
+			was = xchg(&req->friend->sk_friend, newsk);
+			/* If requester already connect()ed, maybe sleeping */
+			if (was && !sock_flag(req->friend, SOCK_DEAD))
+				sk->sk_state_change(req->friend);
+		}
+
 		newsk->sk_state = TCP_SYN_RECV;
 		newicsk->icsk_bind_hash = NULL;
 
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index d84400b..4ca53db 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -796,6 +796,13 @@ static struct ctl_table ipv4_table[] = {
 		.proc_handler	= proc_dointvec_minmax,
 		.extra1		= &zero
 	},
+	{
+		.procname	= "tcp_friends",
+		.data		= &sysctl_tcp_friends,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec
+	},
 	{ }
 };
 
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index e6eace1..4327deb 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -310,6 +310,56 @@ struct tcp_splice_state {
 };
 
 /*
+ * Validate friendp, if not a friend return 0, else if friend is also a
+ * friend return 1, else friendp points to a listen()er so wait for our
+ * friend to be ready then update friendp with pointer to the real friend
+ * and return 1, else an error has occurred so return a -errno.
+ */
+static inline int tcp_friend_validate(struct sock *sk, struct sock **friendp,
+			      long *timeo)
+{
+	struct sock *friend = *friendp;
+
+	if (!friend)
+		return 0;
+	if (unlikely(!friend->sk_friend)) {
+		/* Friendship not complete, wait? */
+		int err;
+
+		if (!timeo)
+			return -EAGAIN;
+		err = sk_stream_wait_friend(sk, timeo);
+		if (err < 0)
+			return err;
+		*friendp = sk->sk_friend;
+	}
+	return 1;
+}
+
+static inline int tcp_friend_send_lock(struct sock *friend)
+{
+	int err = 0;
+
+	spin_lock_bh(&friend->sk_lock.slock);
+	if (unlikely(friend->sk_shutdown & RCV_SHUTDOWN)) {
+		spin_unlock_bh(&friend->sk_lock.slock);
+		err = -ECONNRESET;
+	}
+
+	return err;
+}
+
+static inline void tcp_friend_recv_lock(struct sock *friend)
+{
+	spin_lock_bh(&friend->sk_lock.slock);
+}
+
+static void tcp_friend_unlock(struct sock *friend)
+{
+	spin_unlock_bh(&friend->sk_lock.slock);
+}
+
+/*
  * Pressure flag: try to collapse.
  * Technical note: it is used by multiple contexts non atomically.
  * All the __sk_mem_schedule() is of this nature: accounting
@@ -589,6 +639,76 @@ int tcp_ioctl(struct sock *sk, int cmd, unsigned long arg)
 }
 EXPORT_SYMBOL(tcp_ioctl);
 
+/*
+ * Friend receive_queue tail skb space? If true, set tail_inuse.
+ * Else if RCV_SHUTDOWN, return *copy = -ECONNRESET.
+ */
+static inline struct sk_buff *tcp_friend_tail(struct sock *friend, int *copy)
+{
+	struct sk_buff	*skb = NULL;
+	int		sz = 0;
+
+	if (skb_peek_tail(&friend->sk_receive_queue)) {
+		sz = tcp_friend_send_lock(friend);
+		if (!sz) {
+			skb = skb_peek_tail(&friend->sk_receive_queue);
+			if (skb && skb->friend) {
+				if (!*copy)
+					sz = skb_tailroom(skb);
+				else {
+					sz = *copy - skb->len;
+					if (sz < 0)
+						sz = 0;
+				}
+				if (sz > 0)
+					TCP_FRIEND_CB(TCP_SKB_CB(skb))->
+							tail_inuse = true;
+			}
+			tcp_friend_unlock(friend);
+		}
+	}
+
+	*copy = sz;
+	return skb;
+}
+
+static inline void tcp_friend_seq(struct sock *sk, int copy, int charge)
+{
+	struct sock	*friend = sk->sk_friend;
+	struct tcp_sock *tp = tcp_sk(friend);
+
+	if (charge) {
+		sk_mem_charge(friend, charge);
+		atomic_add(charge, &friend->sk_rmem_alloc);
+	}
+	tp->rcv_nxt += copy;
+	tp->rcv_wup += copy;
+	tcp_friend_unlock(friend);
+
+	tp = tcp_sk(sk);
+	tp->snd_nxt += copy;
+	tp->pushed_seq += copy;
+	tp->snd_una += copy;
+	tp->snd_up += copy;
+}
+
+static inline bool tcp_friend_push(struct sock *sk, struct sk_buff *skb)
+{
+	struct sock	*friend = sk->sk_friend;
+	int		wait = false;
+
+	skb_set_owner_r(skb, friend);
+	__skb_queue_tail(&friend->sk_receive_queue, skb);
+	if (!sk_rmem_schedule(friend, skb, skb->truesize))
+		wait = true;
+
+	tcp_friend_seq(sk, skb->len, 0);
+	if (skb == skb_peek(&friend->sk_receive_queue))
+		friend->sk_data_ready(friend, 0);
+
+	return wait;
+}
+
 static inline void tcp_mark_push(struct tcp_sock *tp, struct sk_buff *skb)
 {
 	TCP_SKB_CB(skb)->tcp_flags |= TCPHDR_PSH;
@@ -605,8 +725,13 @@ static inline void skb_entail(struct sock *sk, struct sk_buff *skb)
 	struct tcp_sock *tp = tcp_sk(sk);
 	struct tcp_skb_cb *tcb = TCP_SKB_CB(skb);
 
-	skb->csum    = 0;
 	tcb->seq     = tcb->end_seq = tp->write_seq;
+	if (sk->sk_friend) {
+		skb->friend = sk;
+		TCP_FRIEND_CB(tcb)->tail_inuse = false;
+		return;
+	}
+	skb->csum    = 0;
 	tcb->tcp_flags = TCPHDR_ACK;
 	tcb->sacked  = 0;
 	skb_header_release(skb);
@@ -626,7 +751,10 @@ static inline void tcp_mark_urg(struct tcp_sock *tp, int flags)
 static inline void tcp_push(struct sock *sk, int flags, int mss_now,
 			    int nonagle)
 {
-	if (tcp_send_head(sk)) {
+	if (sk->sk_friend) {
+		if (skb_peek(&sk->sk_friend->sk_receive_queue))
+			sk->sk_friend->sk_data_ready(sk->sk_friend, 0);
+	} else if (tcp_send_head(sk)) {
 		struct tcp_sock *tp = tcp_sk(sk);
 
 		if (!(flags & MSG_MORE) || forced_push(tp))
@@ -758,6 +886,21 @@ ssize_t tcp_splice_read(struct socket *sock, loff_t *ppos,
 }
 EXPORT_SYMBOL(tcp_splice_read);
 
+static inline struct sk_buff *tcp_friend_alloc_skb(struct sock *sk, int size)
+{
+	struct sk_buff *skb;
+
+	skb = alloc_skb(size, sk->sk_allocation);
+	if (skb)
+		skb->avail_size = skb_tailroom(skb);
+	else {
+		sk->sk_prot->enter_memory_pressure(sk);
+		sk_stream_moderate_sndbuf(sk);
+	}
+
+	return skb;
+}
+
 struct sk_buff *sk_stream_alloc_skb(struct sock *sk, int size, gfp_t gfp)
 {
 	struct sk_buff *skb;
@@ -821,12 +964,53 @@ static unsigned int tcp_xmit_size_goal(struct sock *sk, u32 mss_now,
 	return max(xmit_size_goal, mss_now);
 }
 
+static unsigned int tcp_friend_xmit_size_goal(struct sock *sk, int size_goal)
+{
+	u32 size = SKB_DATA_ALIGN(size_goal);
+	u32 overhead = sizeof(struct skb_shared_info) + sizeof(struct sk_buff);
+
+	/*
+	 * If alloc >= largest skb use largest order, else check
+	 * for optimal tail fill size, else use largest order.
+	 */
+	if (size >= SKB_MAX_ORDER(0, 4))
+		size = SKB_MAX_ORDER(0, 4);
+	else if (size <= (SKB_MAX_ORDER(0, 0) >> 3))
+		size = SKB_MAX_ORDER(0, 0);
+	else if (size <= (SKB_MAX_ORDER(0, 1) >> 3))
+		size = SKB_MAX_ORDER(0, 1);
+	else if (size <= (SKB_MAX_ORDER(0, 0) >> 1))
+		size = SKB_MAX_ORDER(0, 0);
+	else if (size <= (SKB_MAX_ORDER(0, 1) >> 1))
+		size = SKB_MAX_ORDER(0, 1);
+	else if (size <= (SKB_MAX_ORDER(0, 2) >> 1))
+		size = SKB_MAX_ORDER(0, 2);
+	else if (size <= (SKB_MAX_ORDER(0, 3) >> 1))
+		size = SKB_MAX_ORDER(0, 3);
+	else
+		size = SKB_MAX_ORDER(0, 4);
+
+	/* At least 2 true sized in sk_buf */
+	if (size + overhead > (sk_sndbuf_get(sk) >> 1))
+		size = (sk_sndbuf_get(sk) >> 1) - overhead;
+
+	return size;
+}
+
 static int tcp_send_mss(struct sock *sk, int *size_goal, int flags)
 {
 	int mss_now;
+	int tmp;
+
+	if (sk->sk_friend) {
+		mss_now = tcp_friend_xmit_size_goal(sk, *size_goal);
+		tmp = mss_now;
+	} else {
+		mss_now = tcp_current_mss(sk);
+		tmp = tcp_xmit_size_goal(sk, mss_now, !(flags & MSG_OOB));
+	}
 
-	mss_now = tcp_current_mss(sk);
-	*size_goal = tcp_xmit_size_goal(sk, mss_now, !(flags & MSG_OOB));
+	*size_goal = tmp;
 
 	return mss_now;
 }
@@ -834,8 +1018,9 @@ static int tcp_send_mss(struct sock *sk, int *size_goal, int flags)
 static ssize_t do_tcp_sendpages(struct sock *sk, struct page **pages, int poffset,
 			 size_t psize, int flags)
 {
+	struct sock *friend = sk->sk_friend;
 	struct tcp_sock *tp = tcp_sk(sk);
-	int mss_now, size_goal;
+	int mss_now, size_goal = psize;
 	int err;
 	ssize_t copied;
 	long timeo = sock_sndtimeo(sk, flags & MSG_DONTWAIT);
@@ -850,6 +1035,10 @@ static ssize_t do_tcp_sendpages(struct sock *sk, struct page **pages, int poffse
 			goto out_err;
 	}
 
+	err = tcp_friend_validate(sk, &friend, &timeo);
+	if (err < 0)
+		goto out_err;
+
 	clear_bit(SOCK_ASYNC_NOSPACE, &sk->sk_socket->flags);
 
 	mss_now = tcp_send_mss(sk, &size_goal, flags);
@@ -860,25 +1049,47 @@ static ssize_t do_tcp_sendpages(struct sock *sk, struct page **pages, int poffse
 		goto out_err;
 
 	while (psize > 0) {
-		struct sk_buff *skb = tcp_write_queue_tail(sk);
+		struct sk_buff *skb;
+		struct tcp_skb_cb *tcb;
 		struct page *page = pages[poffset / PAGE_SIZE];
 		int copy, i;
 		int offset = poffset % PAGE_SIZE;
 		int size = min_t(size_t, psize, PAGE_SIZE - offset);
 		bool can_coalesce;
 
-		if (!tcp_send_head(sk) || (copy = size_goal - skb->len) <= 0) {
+		if (friend) {
+			copy = size_goal;
+			skb = tcp_friend_tail(friend, &copy);
+			if (copy < 0) {
+				sk->sk_err = -copy;
+				err = -EPIPE;
+				goto out_err;
+			}
+		} else if (!tcp_send_head(sk)) {
+			skb = NULL;
+			copy = 0;
+		} else {
+			skb = tcp_write_queue_tail(sk);
+			copy = size_goal - skb->len;
+		}
+
+		if (copy <= 0) {
 new_segment:
 			if (!sk_stream_memory_free(sk))
 				goto wait_for_sndbuf;
 
-			skb = sk_stream_alloc_skb(sk, 0, sk->sk_allocation);
+			if (friend)
+				skb = tcp_friend_alloc_skb(sk, 0);
+			else
+				skb = sk_stream_alloc_skb(sk, 0,
+							  sk->sk_allocation);
 			if (!skb)
 				goto wait_for_memory;
 
 			skb_entail(sk, skb);
 			copy = size_goal;
 		}
+		tcb = TCP_SKB_CB(skb);
 
 		if (copy > size)
 			copy = size;
@@ -886,10 +1097,14 @@ new_segment:
 		i = skb_shinfo(skb)->nr_frags;
 		can_coalesce = skb_can_coalesce(skb, i, page, offset);
 		if (!can_coalesce && i >= MAX_SKB_FRAGS) {
-			tcp_mark_push(tp, skb);
+			if (friend) {
+				if (TCP_FRIEND_CB(tcb)->tail_inuse)
+					TCP_FRIEND_CB(tcb)->tail_inuse = false;
+			} else
+				tcp_mark_push(tp, skb);
 			goto new_segment;
 		}
-		if (!sk_wmem_schedule(sk, copy))
+		if (!friend && !sk_wmem_schedule(sk, copy))
 			goto wait_for_memory;
 
 		if (can_coalesce) {
@@ -902,19 +1117,41 @@ new_segment:
 		skb->len += copy;
 		skb->data_len += copy;
 		skb->truesize += copy;
-		sk->sk_wmem_queued += copy;
-		sk_mem_charge(sk, copy);
-		skb->ip_summed = CHECKSUM_PARTIAL;
 		tp->write_seq += copy;
-		TCP_SKB_CB(skb)->end_seq += copy;
-		skb_shinfo(skb)->gso_segs = 0;
-
-		if (!copied)
-			TCP_SKB_CB(skb)->tcp_flags &= ~TCPHDR_PSH;
 
 		copied += copy;
 		poffset += copy;
-		if (!(psize -= copy))
+		psize -= copy;
+
+		if (friend) {
+			err = tcp_friend_send_lock(friend);
+			if (err) {
+				sk->sk_err = -err;
+				err = -EPIPE;
+				goto out_err;
+			}
+			tcb->end_seq += copy;
+			if (TCP_FRIEND_CB(tcb)->tail_inuse) {
+				TCP_FRIEND_CB(tcb)->tail_inuse = false;
+				tcp_friend_seq(sk, copy, copy);
+			} else {
+				if (tcp_friend_push(sk, skb))
+					goto wait_for_sndbuf;
+			}
+			if (!psize)
+				goto out;
+			continue;
+		}
+
+		tcb->end_seq += copy;
+		skb_shinfo(skb)->gso_segs = 0;
+		sk->sk_wmem_queued += copy;
+		sk_mem_charge(sk, copy);
+		skb->ip_summed = CHECKSUM_PARTIAL;
+		if (copied == copy)
+			tcb->tcp_flags &= ~TCPHDR_PSH;
+
+		if (!psize)
 			goto out;
 
 		if (skb->len < size_goal || (flags & MSG_OOB))
@@ -935,7 +1172,8 @@ wait_for_memory:
 		if ((err = sk_stream_wait_memory(sk, &timeo)) != 0)
 			goto do_error;
 
-		mss_now = tcp_send_mss(sk, &size_goal, flags);
+		if (!friend)
+			mss_now = tcp_send_mss(sk, &size_goal, flags);
 	}
 
 out:
@@ -1026,10 +1264,12 @@ int tcp_sendmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
 		size_t size)
 {
 	struct iovec *iov;
+	struct sock *friend = sk->sk_friend;
 	struct tcp_sock *tp = tcp_sk(sk);
 	struct sk_buff *skb;
+	struct tcp_skb_cb *tcb;
 	int iovlen, flags, err, copied = 0;
-	int mss_now = 0, size_goal, copied_syn = 0, offset = 0;
+	int mss_now = 0, size_goal = size, copied_syn = 0, offset = 0;
 	bool sg;
 	long timeo;
 
@@ -1057,6 +1297,10 @@ int tcp_sendmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
 			goto do_error;
 	}
 
+	err = tcp_friend_validate(sk, &friend, &timeo);
+	if (err < 0)
+		goto out;
+
 	if (unlikely(tp->repair)) {
 		if (tp->repair_queue == TCP_RECV_QUEUE) {
 			copied = tcp_send_rcvq(sk, msg, size);
@@ -1105,24 +1349,38 @@ int tcp_sendmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
 			int copy = 0;
 			int max = size_goal;
 
-			skb = tcp_write_queue_tail(sk);
-			if (tcp_send_head(sk)) {
-				if (skb->ip_summed == CHECKSUM_NONE)
-					max = mss_now;
-				copy = max - skb->len;
+			if (friend) {
+				skb = tcp_friend_tail(friend, &copy);
+				if (copy < 0) {
+					sk->sk_err = -copy;
+					err = -EPIPE;
+					goto out_err;
+				}
+			} else {
+				skb = tcp_write_queue_tail(sk);
+				if (tcp_send_head(sk)) {
+					if (skb->ip_summed == CHECKSUM_NONE)
+						max = mss_now;
+					copy = max - skb->len;
+				}
 			}
 
 			if (copy <= 0) {
 new_segment:
-				/* Allocate new segment. If the interface is SG,
-				 * allocate skb fitting to single page.
-				 */
 				if (!sk_stream_memory_free(sk))
 					goto wait_for_sndbuf;
 
-				skb = sk_stream_alloc_skb(sk,
-							  select_size(sk, sg),
-							  sk->sk_allocation);
+				if (friend)
+					skb = tcp_friend_alloc_skb(sk, max);
+				else {
+					/* Allocate new segment. If the
+					 * interface is SG, allocate skb
+					 * fitting to single page.
+					 */
+					skb = sk_stream_alloc_skb(sk,
+							select_size(sk, sg),
+							sk->sk_allocation);
+				}
 				if (!skb)
 					goto wait_for_memory;
 
@@ -1136,6 +1394,7 @@ new_segment:
 				copy = size_goal;
 				max = size_goal;
 			}
+			tcb = TCP_SKB_CB(skb);
 
 			/* Try to append data to the end of skb. */
 			if (copy > seglen)
@@ -1153,6 +1412,8 @@ new_segment:
 				int i = skb_shinfo(skb)->nr_frags;
 				struct page_frag *pfrag = sk_page_frag(sk);
 
+				BUG_ON(friend);
+
 				if (!sk_page_frag_refill(sk, pfrag))
 					goto wait_for_memory;
 
@@ -1188,16 +1449,37 @@ new_segment:
 				pfrag->offset += copy;
 			}
 
-			if (!copied)
-				TCP_SKB_CB(skb)->tcp_flags &= ~TCPHDR_PSH;
-
 			tp->write_seq += copy;
-			TCP_SKB_CB(skb)->end_seq += copy;
-			skb_shinfo(skb)->gso_segs = 0;
 
 			from += copy;
 			copied += copy;
-			if ((seglen -= copy) == 0 && iovlen == 0)
+			seglen -= copy;
+
+			if (friend) {
+				err = tcp_friend_send_lock(friend);
+				if (err) {
+					sk->sk_err = -err;
+					err = -EPIPE;
+					goto out_err;
+				}
+				tcb->end_seq += copy;
+				if (TCP_FRIEND_CB(tcb)->tail_inuse) {
+					TCP_FRIEND_CB(tcb)->tail_inuse = false;
+					tcp_friend_seq(sk, copy, 0);
+				} else {
+					if (tcp_friend_push(sk, skb))
+						goto wait_for_sndbuf;
+				}
+				continue;
+			}
+
+			tcb->end_seq += copy;
+			skb_shinfo(skb)->gso_segs = 0;
+
+			if (copied == copy)
+				tcb->tcp_flags &= ~TCPHDR_PSH;
+
+			if (seglen == 0 && iovlen == 0)
 				goto out;
 
 			if (skb->len < max || (flags & MSG_OOB) || unlikely(tp->repair))
@@ -1219,7 +1501,8 @@ wait_for_memory:
 			if ((err = sk_stream_wait_memory(sk, &timeo)) != 0)
 				goto do_error;
 
-			mss_now = tcp_send_mss(sk, &size_goal, flags);
+			if (!friend)
+				mss_now = tcp_send_mss(sk, &size_goal, flags);
 		}
 	}
 
@@ -1230,7 +1513,12 @@ out:
 	return copied + copied_syn;
 
 do_fault:
-	if (!skb->len) {
+	if (skb->friend) {
+		if (TCP_FRIEND_CB(tcb)->tail_inuse)
+			TCP_FRIEND_CB(tcb)->tail_inuse = false;
+		else
+			__kfree_skb(skb);
+	} else if (!skb->len) {
 		tcp_unlink_write_queue(skb, sk);
 		/* It is the one place in all of TCP, except connection
 		 * reset, where we can be unlinking the send_head.
@@ -1249,6 +1537,13 @@ out_err:
 }
 EXPORT_SYMBOL(tcp_sendmsg);
 
+static inline void tcp_friend_write_space(struct sock *sk)
+{
+	/* Queued data below 1/4th of sndbuf? */
+	if ((sk_sndbuf_get(sk) >> 2) > sk_wmem_queued_get(sk))
+		sk->sk_friend->sk_write_space(sk->sk_friend);
+}
+
 /*
  *	Handle reading urgent data. BSD has very simple semantics for
  *	this, no blocking and very strange errors 8)
@@ -1327,7 +1622,12 @@ void tcp_cleanup_rbuf(struct sock *sk, int copied)
 	struct tcp_sock *tp = tcp_sk(sk);
 	bool time_to_ack = false;
 
-	struct sk_buff *skb = skb_peek(&sk->sk_receive_queue);
+	struct sk_buff *skb;
+
+	if (sk->sk_friend)
+		return;
+
+	skb = skb_peek(&sk->sk_receive_queue);
 
 	WARN(skb && !before(tp->copied_seq, TCP_SKB_CB(skb)->end_seq),
 	     "cleanup rbuf bug: copied %X seq %X rcvnxt %X\n",
@@ -1431,17 +1731,27 @@ static void tcp_service_net_dma(struct sock *sk, bool wait)
 }
 #endif
 
-static inline struct sk_buff *tcp_recv_skb(struct sock *sk, u32 seq, u32 *off)
+static inline struct sk_buff *tcp_recv_skb(struct sock *sk, u32 seq, u32 *off,
+					   size_t *len)
 {
 	struct sk_buff *skb;
 	u32 offset;
+	size_t avail;
 
 	skb_queue_walk(&sk->sk_receive_queue, skb) {
-		offset = seq - TCP_SKB_CB(skb)->seq;
-		if (tcp_hdr(skb)->syn)
-			offset--;
-		if (offset < skb->len || tcp_hdr(skb)->fin) {
+		struct tcp_skb_cb *tcb = TCP_SKB_CB(skb);
+
+		offset = seq - tcb->seq;
+		if (skb->friend)
+			avail = (u32)(tcb->end_seq - seq);
+		else {
+			if (tcp_hdr(skb)->syn)
+				offset--;
+			avail = skb->len - offset;
+		}
+		if (avail > 0 || (!skb->friend && tcp_hdr(skb)->fin)) {
 			*off = offset;
+			*len = avail;
 			return skb;
 		}
 	}
@@ -1467,15 +1777,23 @@ int tcp_read_sock(struct sock *sk, read_descriptor_t *desc,
 	u32 seq = tp->copied_seq;
 	u32 offset;
 	int copied = 0;
+	size_t len;
+	int err;
+	struct sock *friend = sk->sk_friend;
+	long timeo = sock_rcvtimeo(sk, false);
 
 	if (sk->sk_state == TCP_LISTEN)
 		return -ENOTCONN;
-	while ((skb = tcp_recv_skb(sk, seq, &offset)) != NULL) {
-		if (offset < skb->len) {
-			int used;
-			size_t len;
+	err = tcp_friend_validate(sk, &friend, &timeo);
+	if (err < 0)
+		return err;
+	if (friend)
+		tcp_friend_recv_lock(sk);
 
-			len = skb->len - offset;
+	while ((skb = tcp_recv_skb(sk, seq, &offset, &len)) != NULL) {
+		if (len > 0) {
+			int used;
+	again:
 			/* Stop reading if we hit a patch of urgent data */
 			if (tp->urg_data) {
 				u32 urg_offset = tp->urg_seq - seq;
@@ -1484,6 +1802,10 @@ int tcp_read_sock(struct sock *sk, read_descriptor_t *desc,
 				if (!len)
 					break;
 			}
+
+			if (friend)
+				tcp_friend_unlock(sk);
+
 			used = recv_actor(desc, skb, offset, len);
 			if (used < 0) {
 				if (!copied)
@@ -1494,33 +1816,65 @@ int tcp_read_sock(struct sock *sk, read_descriptor_t *desc,
 				copied += used;
 				offset += used;
 			}
-			/*
-			 * If recv_actor drops the lock (e.g. TCP splice
-			 * receive) the skb pointer might be invalid when
-			 * getting here: tcp_collapse might have deleted it
-			 * while aggregating skbs from the socket queue.
-			 */
-			skb = tcp_recv_skb(sk, seq-1, &offset);
-			if (!skb || (offset+1 != skb->len))
-				break;
+
+			if (friend)
+				tcp_friend_recv_lock(sk);
+			if (skb->friend) {
+				len = (u32)(TCP_SKB_CB(skb)->end_seq - seq);
+				if (len > 0) {
+					/*
+					 * Friend did an skb_put() while we
+					 * were away so process the same skb.
+					 */
+					if (!desc->count)
+						break;
+					tp->copied_seq = seq;
+					goto again;
+				}
+			} else {
+				/*
+				 * If recv_actor drops the lock (e.g. TCP
+				 * splice receive) the skb pointer might be
+				 * invalid when getting here: tcp_collapse
+				 * might have deleted it while aggregating
+				 * skbs from the socket queue.
+				 */
+				skb = tcp_recv_skb(sk, seq-1, &offset, &len);
+				if (!skb || (offset+1 != skb->len))
+					break;
+			}
 		}
-		if (tcp_hdr(skb)->fin) {
+		if (!skb->friend && tcp_hdr(skb)->fin) {
 			sk_eat_skb(sk, skb, false);
 			++seq;
 			break;
 		}
-		sk_eat_skb(sk, skb, false);
+		if (skb->friend) {
+			if (!TCP_FRIEND_CB(TCP_SKB_CB(skb))->tail_inuse) {
+				__skb_unlink(skb, &sk->sk_receive_queue);
+				__kfree_skb(skb);
+				tcp_friend_write_space(sk);
+			}
+			tcp_friend_unlock(sk);
+			tcp_friend_recv_lock(sk);
+		} else
+			sk_eat_skb(sk, skb, 0);
 		if (!desc->count)
 			break;
 		tp->copied_seq = seq;
 	}
 	tp->copied_seq = seq;
 
-	tcp_rcv_space_adjust(sk);
+	if (friend) {
+		tcp_friend_unlock(sk);
+		tcp_friend_write_space(sk);
+	} else {
+		tcp_rcv_space_adjust(sk);
 
-	/* Clean up data we have read: This will do ACK frames. */
-	if (copied > 0)
-		tcp_cleanup_rbuf(sk, copied);
+		/* Clean up data we have read: This will do ACK frames. */
+		if (copied > 0)
+			tcp_cleanup_rbuf(sk, copied);
+	}
 	return copied;
 }
 EXPORT_SYMBOL(tcp_read_sock);
@@ -1536,6 +1890,7 @@ EXPORT_SYMBOL(tcp_read_sock);
 int tcp_recvmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
 		size_t len, int nonblock, int flags, int *addr_len)
 {
+	struct sock *friend = sk->sk_friend;
 	struct tcp_sock *tp = tcp_sk(sk);
 	int copied = 0;
 	u32 peek_seq;
@@ -1548,6 +1903,7 @@ int tcp_recvmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
 	bool copied_early = false;
 	struct sk_buff *skb;
 	u32 urg_hole = 0;
+	bool locked = false;
 
 	lock_sock(sk);
 
@@ -1557,6 +1913,10 @@ int tcp_recvmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
 
 	timeo = sock_rcvtimeo(sk, nonblock);
 
+	err = tcp_friend_validate(sk, &friend, &timeo);
+	if (err < 0)
+		goto out;
+
 	/* Urgent data needs to be handled specially. */
 	if (flags & MSG_OOB)
 		goto recv_urg;
@@ -1595,7 +1955,7 @@ int tcp_recvmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
 			available = TCP_SKB_CB(skb)->seq + skb->len - (*seq);
 		if ((available < target) &&
 		    (len > sysctl_tcp_dma_copybreak) && !(flags & MSG_PEEK) &&
-		    !sysctl_tcp_low_latency &&
+		    !sysctl_tcp_low_latency && !friend &&
 		    net_dma_find_channel()) {
 			preempt_enable_no_resched();
 			tp->ucopy.pinned_list =
@@ -1606,7 +1966,10 @@ int tcp_recvmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
 	}
 #endif
 
+	err = 0;
+
 	do {
+		struct tcp_skb_cb *tcb;
 		u32 offset;
 
 		/* Are we at urgent data? Stop if we have read anything or have SIGURG pending. */
@@ -1614,37 +1977,77 @@ int tcp_recvmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
 			if (copied)
 				break;
 			if (signal_pending(current)) {
-				copied = timeo ? sock_intr_errno(timeo) : -EAGAIN;
+				err = timeo ? sock_intr_errno(timeo) : -EAGAIN;
 				break;
 			}
 		}
 
-		/* Next get a buffer. */
+		/*
+		 * Next get a buffer. Note, for friends sendmsg() queues
+		 * data directly to our sk_receive_queue by holding our
+		 * slock and either tail queuing a new skb or adding new
+		 * data to the tail skb. In the later case tail_inuse is
+		 * set, slock dropped, copyin, skb->len updated, re-hold
+		 * slock, end_seq updated, so we can only use the bytes
+		 * from *seq to end_seq!
+		 */
+		if (friend && !locked) {
+			tcp_friend_recv_lock(sk);
+			locked = true;
+		}
 
 		skb_queue_walk(&sk->sk_receive_queue, skb) {
+			tcb = TCP_SKB_CB(skb);
+			offset = *seq - tcb->seq;
+			if (friend) {
+				if (skb->friend) {
+					used = (u32)(tcb->end_seq - *seq);
+					if (used > 0) {
+						tcp_friend_unlock(sk);
+						locked = false;
+						/* Can use it all */
+						goto found_ok_skb;
+					}
+					/* No data to copyout */
+					if (flags & MSG_PEEK)
+						continue;
+					if (!TCP_FRIEND_CB(tcb)->tail_inuse)
+						goto unlink;
+					break;
+				}
+				tcp_friend_unlock(sk);
+				locked = false;
+			}
+
 			/* Now that we have two receive queues this
 			 * shouldn't happen.
 			 */
-			if (WARN(before(*seq, TCP_SKB_CB(skb)->seq),
+			if (WARN(before(*seq, tcb->seq),
 				 "recvmsg bug: copied %X seq %X rcvnxt %X fl %X\n",
-				 *seq, TCP_SKB_CB(skb)->seq, tp->rcv_nxt,
-				 flags))
+				 *seq, tcb->seq, tp->rcv_nxt, flags))
 				break;
 
-			offset = *seq - TCP_SKB_CB(skb)->seq;
 			if (tcp_hdr(skb)->syn)
 				offset--;
-			if (offset < skb->len)
+			if (offset < skb->len) {
+				/* Ok so how much can we use? */
+				used = skb->len - offset;
 				goto found_ok_skb;
+			}
 			if (tcp_hdr(skb)->fin)
 				goto found_fin_ok;
 			WARN(!(flags & MSG_PEEK),
 			     "recvmsg bug 2: copied %X seq %X rcvnxt %X fl %X\n",
-			     *seq, TCP_SKB_CB(skb)->seq, tp->rcv_nxt, flags);
+			     *seq, tcb->seq, tp->rcv_nxt, flags);
 		}
 
 		/* Well, if we have backlog, try to process it now yet. */
 
+		if (friend && locked) {
+			tcp_friend_unlock(sk);
+			locked = false;
+		}
+
 		if (copied >= target && !sk->sk_backlog.tail)
 			break;
 
@@ -1691,7 +2094,8 @@ int tcp_recvmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
 
 		tcp_cleanup_rbuf(sk, copied);
 
-		if (!sysctl_tcp_low_latency && tp->ucopy.task == user_recv) {
+		if (!sysctl_tcp_low_latency && !friend &&
+		    tp->ucopy.task == user_recv) {
 			/* Install new reader */
 			if (!user_recv && !(flags & (MSG_TRUNC | MSG_PEEK))) {
 				user_recv = current;
@@ -1791,8 +2195,6 @@ do_prequeue:
 		continue;
 
 	found_ok_skb:
-		/* Ok so how much can we use? */
-		used = skb->len - offset;
 		if (len < used)
 			used = len;
 
@@ -1849,7 +2251,7 @@ do_prequeue:
 				if (err) {
 					/* Exception. Bailout! */
 					if (!copied)
-						copied = -EFAULT;
+						copied = err;
 					break;
 				}
 			}
@@ -1858,6 +2260,7 @@ do_prequeue:
 		*seq += used;
 		copied += used;
 		len -= used;
+		offset += used;
 
 		tcp_rcv_space_adjust(sk);
 
@@ -1866,10 +2269,43 @@ skip_copy:
 			tp->urg_data = 0;
 			tcp_fast_path_check(sk);
 		}
-		if (used + offset < skb->len)
+
+		if (skb->friend) {
+			tcp_friend_recv_lock(sk);
+			locked = true;
+			used = (u32)(tcb->end_seq - *seq);
+			if (used) {
+				/*
+				 * Friend did an skb_put() while we were away
+				 * so if more to do process the same skb.
+				 */
+				if (len > 0) {
+					tcp_friend_unlock(sk);
+					locked = false;
+					goto found_ok_skb;
+				}
+				continue;
+			}
+			if (TCP_FRIEND_CB(tcb)->tail_inuse) {
+				/* Give sendmsg a chance */
+				tcp_friend_unlock(sk);
+				locked = false;
+				continue;
+			}
+			if (!(flags & MSG_PEEK)) {
+		unlink:
+				__skb_unlink(skb, &sk->sk_receive_queue);
+				__kfree_skb(skb);
+				tcp_friend_unlock(sk);
+				locked = false;
+				tcp_friend_write_space(sk);
+			}
 			continue;
+		}
 
-		if (tcp_hdr(skb)->fin)
+		if (offset < skb->len)
+			continue;
+		else if (tcp_hdr(skb)->fin)
 			goto found_fin_ok;
 		if (!(flags & MSG_PEEK)) {
 			sk_eat_skb(sk, skb, copied_early);
@@ -1887,6 +2323,9 @@ skip_copy:
 		break;
 	} while (len > 0);
 
+	if (friend && locked)
+		tcp_friend_unlock(sk);
+
 	if (user_recv) {
 		if (!skb_queue_empty(&tp->ucopy.prequeue)) {
 			int chunk;
@@ -2065,6 +2504,9 @@ void tcp_close(struct sock *sk, long timeout)
 		goto adjudge_to_death;
 	}
 
+	if (sk->sk_friend)
+		sock_put(sk->sk_friend);
+
 	/*  We need to flush the recv. buffs.  We do this only on the
 	 *  descriptor close, not protocol-sourced closes, because the
 	 *  reader process may not have drained the data yet!
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index fc67831..9640a81 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -530,6 +530,9 @@ void tcp_rcv_space_adjust(struct sock *sk)
 	int time;
 	int space;
 
+	if (sk->sk_friend)
+		return;
+
 	if (tp->rcvq_space.time == 0)
 		goto new_measure;
 
@@ -4350,8 +4353,9 @@ static int tcp_prune_queue(struct sock *sk);
 static int tcp_try_rmem_schedule(struct sock *sk, struct sk_buff *skb,
 				 unsigned int size)
 {
-	if (atomic_read(&sk->sk_rmem_alloc) > sk->sk_rcvbuf ||
-	    !sk_rmem_schedule(sk, skb, size)) {
+	if (!sk->sk_friend &&
+	    (atomic_read(&sk->sk_rmem_alloc) > sk->sk_rcvbuf ||
+	    !sk_rmem_schedule(sk, skb, size))) {
 
 		if (tcp_prune_queue(sk) < 0)
 			return -1;
@@ -5722,6 +5726,16 @@ static int tcp_rcv_synsent_state_process(struct sock *sk, struct sk_buff *skb,
 		 *    state to ESTABLISHED..."
 		 */
 
+		if (skb->friend) {
+			/*
+			 * If friends haven't been made yet, our sk_friend
+			 * still == NULL, then update with the ACK's friend
+			 * value (the listen()er's sock addr) which is used
+			 * as a place holder.
+			 */
+			cmpxchg(&sk->sk_friend, NULL, skb->friend);
+		}
+
 		TCP_ECN_rcv_synack(tp, th);
 
 		tcp_init_wl(tp, TCP_SKB_CB(skb)->seq);
@@ -5797,9 +5811,9 @@ static int tcp_rcv_synsent_state_process(struct sock *sk, struct sk_buff *skb,
 		    tcp_rcv_fastopen_synack(sk, skb, &foc))
 			return -1;
 
-		if (sk->sk_write_pending ||
+		if (!skb->friend && (sk->sk_write_pending ||
 		    icsk->icsk_accept_queue.rskq_defer_accept ||
-		    icsk->icsk_ack.pingpong) {
+		    icsk->icsk_ack.pingpong)) {
 			/* Save one ACK. Data will be ready after
 			 * several ticks, if write_pending is set.
 			 *
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 1ed2307..f494914 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1512,6 +1512,8 @@ int tcp_v4_conn_request(struct sock *sk, struct sk_buff *skb)
 	tcp_rsk(req)->af_specific = &tcp_request_sock_ipv4_ops;
 #endif
 
+	req->friend = skb->friend;
+
 	tcp_clear_options(&tmp_opt);
 	tmp_opt.mss_clamp = TCP_MSS_DEFAULT;
 	tmp_opt.user_mss  = tp->rx_opt.user_mss;
diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
index f35f2df..36d832a 100644
--- a/net/ipv4/tcp_minisocks.c
+++ b/net/ipv4/tcp_minisocks.c
@@ -270,6 +270,9 @@ void tcp_time_wait(struct sock *sk, int state, int timeo)
 	const struct tcp_sock *tp = tcp_sk(sk);
 	bool recycle_ok = false;
 
+	if (sk->sk_friend)
+		goto out;
+
 	if (tcp_death_row.sysctl_tw_recycle && tp->rx_opt.ts_recent_stamp)
 		recycle_ok = tcp_remember_stamp(sk);
 
@@ -349,6 +352,7 @@ void tcp_time_wait(struct sock *sk, int state, int timeo)
 	}
 
 	tcp_update_metrics(sk);
+out:
 	tcp_done(sk);
 }
 
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 8ac0855..509c5e3 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -65,6 +65,9 @@ int sysctl_tcp_base_mss __read_mostly = TCP_BASE_MSS;
 /* By default, RFC2861 behavior.  */
 int sysctl_tcp_slow_start_after_idle __read_mostly = 1;
 
+/* By default, TCP loopback bypass */
+int sysctl_tcp_friends __read_mostly = 1;
+
 int sysctl_tcp_cookie_size __read_mostly = 0; /* TCP_COOKIE_MAX */
 EXPORT_SYMBOL_GPL(sysctl_tcp_cookie_size);
 
@@ -1025,9 +1028,13 @@ static int tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, int clone_it,
 	tcb = TCP_SKB_CB(skb);
 	memset(&opts, 0, sizeof(opts));
 
-	if (unlikely(tcb->tcp_flags & TCPHDR_SYN))
+	if (unlikely(tcb->tcp_flags & TCPHDR_SYN)) {
+		/* Only try to make friends if enabled */
+		if (sysctl_tcp_friends)
+			skb->friend = sk;
+
 		tcp_options_size = tcp_syn_options(sk, skb, &opts, &md5);
-	else
+	} else
 		tcp_options_size = tcp_established_options(sk, skb, &opts,
 							   &md5);
 	tcp_header_size = tcp_options_size + sizeof(struct tcphdr);
@@ -2725,6 +2732,11 @@ struct sk_buff *tcp_make_synack(struct sock *sk, struct dst_entry *dst,
 	}
 
 	memset(&opts, 0, sizeof(opts));
+
+	/* Only try to make friends if enabled */
+	if (sysctl_tcp_friends)
+		skb->friend = sk;
+
 #ifdef CONFIG_SYN_COOKIES
 	if (unlikely(req->cookie_ts))
 		TCP_SKB_CB(skb)->when = cookie_init_timestamp(req);
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index 6565cf5..828d5f7 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -969,6 +969,7 @@ static int tcp_v6_conn_request(struct sock *sk, struct sk_buff *skb)
 	tcp_rsk(req)->af_specific = &tcp_request_sock_ipv6_ops;
 #endif
 
+	req->friend = skb->friend;
 	tcp_clear_options(&tmp_opt);
 	tmp_opt.mss_clamp = IPV6_MIN_MTU - sizeof(struct tcphdr) - sizeof(struct ipv6hdr);
 	tmp_opt.user_mss = tp->rx_opt.user_mss;
-- 
1.7.4.4

^ permalink raw reply related

* [PATCH 2/3] fix panic in tcp_close()
From: Weiping Pan @ 2012-12-05  2:54 UTC (permalink / raw)
  To: netdev; +Cc: brutus, Weiping Pan
In-Reply-To: <cover.1354674151.git.wpan@redhat.com>

For tcp friends data skb, it has no tcp header,
and its transport_header is NULL, so it will panic if we deference tcp_hdr(skb)
in tcp_close().

So I add a check before we use tcp_hdr().

Signed-off-by: Weiping Pan <wpan@redhat.com>
---
 net/ipv4/tcp.c |    6 +++++-
 1 files changed, 5 insertions(+), 1 deletions(-)

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 4327deb..e9d82e0 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -2512,8 +2512,12 @@ void tcp_close(struct sock *sk, long timeout)
 	 *  reader process may not have drained the data yet!
 	 */
 	while ((skb = __skb_dequeue(&sk->sk_receive_queue)) != NULL) {
-		u32 len = TCP_SKB_CB(skb)->end_seq - TCP_SKB_CB(skb)->seq -
+		u32 len;
+		if (tcp_hdr(skb))
+			len = TCP_SKB_CB(skb)->end_seq - TCP_SKB_CB(skb)->seq -
 			  tcp_hdr(skb)->fin;
+		else
+			len = TCP_SKB_CB(skb)->end_seq - TCP_SKB_CB(skb)->seq;
 		data_was_unread += len;
 		__kfree_skb(skb);
 	}
-- 
1.7.4.4

^ permalink raw reply related

* [PATCH 3/3] delete request_sock->friend
From: Weiping Pan @ 2012-12-05  2:54 UTC (permalink / raw)
  To: netdev; +Cc: brutus, Weiping Pan
In-Reply-To: <cover.1354674151.git.wpan@redhat.com>

The sock pointed by request_sock->friend may be freed since it does not have a
lock to protect it.
I just delete request_sock->friend since I think it is useless.

For sk_buff->friend, it has the same problem, and we use
"atomic_add(skb->truesize, &sk->sk_wmem_alloc)" to guarantee that the sock can
not be freed before the skb is freed.

Then for 3-way handshake with tcp friends enabled,
SYN->friend is NULL, SYN/ACK->friend is set in tcp_make_synack(),
and ACK->friend is set in tcp_send_ack().

Signed-off-by: Weiping Pan <wpan@redhat.com>
---
 include/net/inet_connection_sock.h |    4 ++
 include/net/request_sock.h         |    1 -
 net/ipv4/inet_connection_sock.c    |   58 +++++++++++++++++++++++------------
 net/ipv4/tcp_input.c               |   10 ------
 net/ipv4/tcp_ipv4.c                |    7 +++-
 net/ipv4/tcp_minisocks.c           |    7 ++++-
 net/ipv4/tcp_output.c              |   21 ++++++++-----
 net/ipv6/tcp_ipv6.c                |    1 -
 8 files changed, 66 insertions(+), 43 deletions(-)

diff --git a/include/net/inet_connection_sock.h b/include/net/inet_connection_sock.h
index ba1d361..883e029 100644
--- a/include/net/inet_connection_sock.h
+++ b/include/net/inet_connection_sock.h
@@ -147,6 +147,10 @@ static inline void *inet_csk_ca(const struct sock *sk)
 extern struct sock *inet_csk_clone_lock(const struct sock *sk,
 					const struct request_sock *req,
 					const gfp_t priority);
+extern struct sock *inet_csk_friend_clone_lock(const struct sock *sk,
+					const struct request_sock *req,
+					const struct sk_buff *skb,
+					const gfp_t priority);
 
 enum inet_csk_ack_state_t {
 	ICSK_ACK_SCHED	= 1,
diff --git a/include/net/request_sock.h b/include/net/request_sock.h
index c6dfa26..a51dbd1 100644
--- a/include/net/request_sock.h
+++ b/include/net/request_sock.h
@@ -66,7 +66,6 @@ struct request_sock {
 	unsigned long			expires;
 	const struct request_sock_ops	*rsk_ops;
 	struct sock			*sk;
-	struct sock			*friend;
 	u32				secid;
 	u32				peer_secid;
 };
diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
index ce4b79b..7af92ed 100644
--- a/net/ipv4/inet_connection_sock.c
+++ b/net/ipv4/inet_connection_sock.c
@@ -659,26 +659,6 @@ struct sock *inet_csk_clone_lock(const struct sock *sk,
 	if (newsk != NULL) {
 		struct inet_connection_sock *newicsk = inet_csk(newsk);
 
-		if (req->friend) {
-			/*
-			 * Make friends with the requestor but the ACK of
-			 * the request is already in-flight so the race is
-			 * on to make friends before the ACK is processed.
-			 * If the requestor's sk_friend value is != NULL
-			 * then the requestor has already processed the
-			 * ACK so indicate state change to wake'm up.
-			 */
-			struct sock *was;
-
-			sock_hold(req->friend);
-			newsk->sk_friend = req->friend;
-			sock_hold(newsk);
-			was = xchg(&req->friend->sk_friend, newsk);
-			/* If requester already connect()ed, maybe sleeping */
-			if (was && !sock_flag(req->friend, SOCK_DEAD))
-				sk->sk_state_change(req->friend);
-		}
-
 		newsk->sk_state = TCP_SYN_RECV;
 		newicsk->icsk_bind_hash = NULL;
 
@@ -700,6 +680,44 @@ struct sock *inet_csk_clone_lock(const struct sock *sk,
 }
 EXPORT_SYMBOL_GPL(inet_csk_clone_lock);
 
+/**
+ *	inet_csk_friend_clone_lock - clone an inet socket, and lock its clone
+ *	@sk: the socket to clone
+ *	@req: request_sock
+ *	@skb: who sends the request
+ *	@priority: for allocation (%GFP_KERNEL, %GFP_ATOMIC, etc)
+ *
+ *	Caller must unlock socket even in error path (bh_unlock_sock(newsk))
+ */
+struct sock *inet_csk_friend_clone_lock(const struct sock *sk,
+				 const struct request_sock *req,
+				 const struct sk_buff *skb,
+				 const gfp_t priority)
+{
+	struct sock *newsk = inet_csk_clone_lock(sk, req, priority);
+
+	if (newsk) {
+		struct sock *friend = skb->friend;
+		if (friend) {
+			/*
+			 * Make friends.
+			 */
+			struct sock *was;
+
+			sock_hold(friend);
+			newsk->sk_friend = friend;
+			sock_hold(newsk);
+			was = xchg(&friend->sk_friend, newsk);
+			/* If requester already connect()ed, maybe sleeping */
+			if (was && !sock_flag(friend, SOCK_DEAD))
+				sk->sk_state_change(friend);
+		}
+	}
+
+	return newsk;
+}
+EXPORT_SYMBOL_GPL(inet_csk_friend_clone_lock);
+
 /*
  * At this point, there should be no process reference to this
  * socket, and thus no user references at all.  Therefore we
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 9640a81..39db09d 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -5726,16 +5726,6 @@ static int tcp_rcv_synsent_state_process(struct sock *sk, struct sk_buff *skb,
 		 *    state to ESTABLISHED..."
 		 */
 
-		if (skb->friend) {
-			/*
-			 * If friends haven't been made yet, our sk_friend
-			 * still == NULL, then update with the ACK's friend
-			 * value (the listen()er's sock addr) which is used
-			 * as a place holder.
-			 */
-			cmpxchg(&sk->sk_friend, NULL, skb->friend);
-		}
-
 		TCP_ECN_rcv_synack(tp, th);
 
 		tcp_init_wl(tp, TCP_SKB_CB(skb)->seq);
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index f494914..8d61e4c 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1512,8 +1512,6 @@ int tcp_v4_conn_request(struct sock *sk, struct sk_buff *skb)
 	tcp_rsk(req)->af_specific = &tcp_request_sock_ipv4_ops;
 #endif
 
-	req->friend = skb->friend;
-
 	tcp_clear_options(&tmp_opt);
 	tmp_opt.mss_clamp = TCP_MSS_DEFAULT;
 	tmp_opt.user_mss  = tp->rx_opt.user_mss;
@@ -1873,6 +1871,11 @@ int tcp_v4_do_rcv(struct sock *sk, struct sk_buff *skb)
 	if (skb->len < tcp_hdrlen(skb) || tcp_checksum_complete(skb))
 		goto csum_err;
 
+	if (sysctl_tcp_friends && skb->friend) {
+		skb->sk = skb->friend;
+		skb->destructor = sock_wfree;
+	}
+
 	if (sk->sk_state == TCP_LISTEN) {
 		struct sock *nsk = tcp_v4_hnd_req(sk, skb);
 		if (!nsk)
diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
index 36d832a..753126e 100644
--- a/net/ipv4/tcp_minisocks.c
+++ b/net/ipv4/tcp_minisocks.c
@@ -383,7 +383,12 @@ static inline void TCP_ECN_openreq_child(struct tcp_sock *tp,
  */
 struct sock *tcp_create_openreq_child(struct sock *sk, struct request_sock *req, struct sk_buff *skb)
 {
-	struct sock *newsk = inet_csk_clone_lock(sk, req, GFP_ATOMIC);
+	struct sock *newsk = NULL;
+
+	if (sysctl_tcp_friends && skb->friend)
+		newsk = inet_csk_friend_clone_lock(sk, req, skb, GFP_ATOMIC);
+	else
+		newsk = inet_csk_clone_lock(sk, req, GFP_ATOMIC);
 
 	if (newsk != NULL) {
 		const struct inet_request_sock *ireq = inet_rsk(req);
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 509c5e3..4d71549 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -1028,13 +1028,9 @@ static int tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, int clone_it,
 	tcb = TCP_SKB_CB(skb);
 	memset(&opts, 0, sizeof(opts));
 
-	if (unlikely(tcb->tcp_flags & TCPHDR_SYN)) {
-		/* Only try to make friends if enabled */
-		if (sysctl_tcp_friends)
-			skb->friend = sk;
-
+	if (unlikely(tcb->tcp_flags & TCPHDR_SYN))
 		tcp_options_size = tcp_syn_options(sk, skb, &opts, &md5);
-	} else
+	else
 		tcp_options_size = tcp_established_options(sk, skb, &opts,
 							   &md5);
 	tcp_header_size = tcp_options_size + sizeof(struct tcphdr);
@@ -1050,7 +1046,11 @@ static int tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, int clone_it,
 
 	skb_orphan(skb);
 	skb->sk = sk;
-	skb->destructor = (sysctl_tcp_limit_output_bytes > 0) ?
+
+	if (skb->friend)
+		skb->destructor = NULL;
+	else
+		skb->destructor = (sysctl_tcp_limit_output_bytes > 0) ?
 			  tcp_wfree : sock_wfree;
 	atomic_add(skb->truesize, &sk->sk_wmem_alloc);
 
@@ -2734,8 +2734,10 @@ struct sk_buff *tcp_make_synack(struct sock *sk, struct dst_entry *dst,
 	memset(&opts, 0, sizeof(opts));
 
 	/* Only try to make friends if enabled */
-	if (sysctl_tcp_friends)
+	if (sysctl_tcp_friends) {
 		skb->friend = sk;
+		atomic_add(skb->truesize, &sk->sk_wmem_alloc);
+	}
 
 #ifdef CONFIG_SYN_COOKIES
 	if (unlikely(req->cookie_ts))
@@ -3120,6 +3122,9 @@ void tcp_send_ack(struct sock *sk)
 
 	/* Send it off, this clears delayed acks for us. */
 	TCP_SKB_CB(buff)->when = tcp_time_stamp;
+
+	if (sysctl_tcp_friends)
+		buff->friend = sk;
 	tcp_transmit_skb(sk, buff, 0, sk_gfp_atomic(sk, GFP_ATOMIC));
 }
 
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index 828d5f7..6565cf5 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -969,7 +969,6 @@ static int tcp_v6_conn_request(struct sock *sk, struct sk_buff *skb)
 	tcp_rsk(req)->af_specific = &tcp_request_sock_ipv6_ops;
 #endif
 
-	req->friend = skb->friend;
 	tcp_clear_options(&tmp_opt);
 	tmp_opt.mss_clamp = IPV6_MIN_MTU - sizeof(struct tcphdr) - sizeof(struct ipv6hdr);
 	tmp_opt.user_mss = tp->rx_opt.user_mss;
-- 
1.7.4.4

^ permalink raw reply related

* new portal
From: marknmel05 @ 2012-12-05  3:18 UTC (permalink / raw)
  To: ewokandiam

Dear Colleague,  http://elysianwhitetails.com/images/stories/nadomu.html

^ permalink raw reply

* Re: [Patch net-next] netlink: add missing netlink message types to selinux perm table
From: Cong Wang @ 2012-12-05  3:23 UTC (permalink / raw)
  To: netdev; +Cc: David S. Miller
In-Reply-To: <1354593208-21939-1-git-send-email-amwang@redhat.com>

On Tue, 2012-12-04 at 11:53 +0800, Cong Wang wrote:
> RTM_NEWNETCONF and RTM_GETNETCONF are missing in this table.

This patch has conflicts with "v3 bridge: export multicast database via
netlink", please drop this for now, I will resend it later.

Thanks.

^ permalink raw reply

* Re: [Suggestion] net/atm : for sprintf, need check the total write length whether larger than a page.
From: Chas Williams (CONTRACTOR) @ 2012-12-05  3:57 UTC (permalink / raw)
  To: Chen Gang; +Cc: David Miller, netdev
In-Reply-To: <50BEA2CB.9000800@asianux.com>

In message <50BEA2CB.9000800@asianux.com>,Chen Gang writes:
>> -	for (i = 0; i < (ESI_LEN - 1); i++)
>> -		pos += sprintf(pos, "%02x:", adev->esi[i]);
>> -	pos += sprintf(pos, "%02x\n", adev->esi[i]);
>>  
>> -	return pos - buf;
>> +	return scnprintf(buf, PAGE_SIZE, "%pM\n", adev->esi);
>>  }
>>  
>
>  "%p" seems print a pointer, not contents of pointer (is it correct ?)
>  will it change the original display format to outside ?

%pM means format this pointer as a mac address.  it didnt exist when the
atm stack was originally written but can be used now to save a bit of
messy code.

>> -		pos += sprintf(pos, "\n");
>> +		count += scnprintf(buf + count, PAGE_SIZE - count, "\n");
...
>  need we judge whether count >= PAGE_SIZE ?

count will eventually make PAGE_SIZE - count reach 0 at which point,
scnprintf() won't be able to write into the buffer.

^ permalink raw reply

* [PATCH net-next 1/2] net: neighbour: prohibit negative value for unres_qlen_bytes parameter
From: Shan Wei @ 2012-12-05  4:49 UTC (permalink / raw)
  To: David Miller, Eric Dumazet, NetDev; +Cc: Shan Wei

From: Shan Wei <davidshan@tencent.com>

unres_qlen_bytes and unres_qlen are int type.
But multiple relation(unres_qlen_bytes = unres_qlen * SKB_TRUESIZE(ETH_FRAME_LEN))
will cause type overflow when seting unres_qlen. e.g.

$ echo 1027506 > /proc/sys/net/ipv4/neigh/eth1/unres_qlen
$ cat /proc/sys/net/ipv4/neigh/eth1/unres_qlen
1182657265
$ cat /proc/sys/net/ipv4/neigh/eth1/unres_qlen_bytes 
-2147479756

The gutted value is not that we setting。
But user/administrator don't know this is caused by int type overflow.

what's more, it is meaningless and even dangerous that unres_qlen_bytes is set
with negative number. Because, for unresolved neighbour address, kernel will cache packets
without limit in __neigh_event_send()(e.g. (u32)-1 = 2GB).


Signed-off-by: Shan Wei <davidshan@tencent.com>
---
 net/core/neighbour.c |   17 ++++++++++++-----
 1 files changed, 12 insertions(+), 5 deletions(-)

diff --git a/net/core/neighbour.c b/net/core/neighbour.c
index f1c0c2e..36fc692 100644
--- a/net/core/neighbour.c
+++ b/net/core/neighbour.c
@@ -62,6 +62,9 @@ static void __neigh_notify(struct neighbour *n, int type, int flags);
 static void neigh_update_notify(struct neighbour *neigh);
 static int pneigh_ifdown(struct neigh_table *tbl, struct net_device *dev);
 
+static int zero;
+static int unres_qlen_max = INT_MAX / SKB_TRUESIZE(ETH_FRAME_LEN);
+
 static struct neigh_table *neigh_tables;
 #ifdef CONFIG_PROC_FS
 static const struct file_operations neigh_stat_seq_fops;
@@ -1787,8 +1790,7 @@ static int neightbl_fill_parms(struct sk_buff *skb, struct neigh_parms *parms)
 	    nla_put_u32(skb, NDTPA_QUEUE_LENBYTES, parms->queue_len_bytes) ||
 	    /* approximative value for deprecated QUEUE_LEN (in packets) */
 	    nla_put_u32(skb, NDTPA_QUEUE_LEN,
-			DIV_ROUND_UP(parms->queue_len_bytes,
-				     SKB_TRUESIZE(ETH_FRAME_LEN))) ||
+			parms->queue_len_bytes / SKB_TRUESIZE(ETH_FRAME_LEN)) ||
 	    nla_put_u32(skb, NDTPA_PROXY_QLEN, parms->proxy_qlen) ||
 	    nla_put_u32(skb, NDTPA_APP_PROBES, parms->app_probes) ||
 	    nla_put_u32(skb, NDTPA_UCAST_PROBES, parms->ucast_probes) ||
@@ -2777,9 +2779,13 @@ static int proc_unres_qlen(ctl_table *ctl, int write, void __user *buffer,
 	int size, ret;
 	ctl_table tmp = *ctl;
 
+	tmp.extra1 = &zero;
+	tmp.extra2 = &unres_qlen_max;
 	tmp.data = &size;
-	size = DIV_ROUND_UP(*(int *)ctl->data, SKB_TRUESIZE(ETH_FRAME_LEN));
-	ret = proc_dointvec(&tmp, write, buffer, lenp, ppos);
+
+	size = *(int *)ctl->data / SKB_TRUESIZE(ETH_FRAME_LEN);
+	ret = proc_dointvec_minmax(&tmp, write, buffer, lenp, ppos);
+
 	if (write && !ret)
 		*(int *)ctl->data = size * SKB_TRUESIZE(ETH_FRAME_LEN);
 	return ret;
@@ -2865,7 +2871,8 @@ static struct neigh_sysctl_table {
 			.procname	= "unres_qlen_bytes",
 			.maxlen		= sizeof(int),
 			.mode		= 0644,
-			.proc_handler	= proc_dointvec,
+			.extra1		= &zero,
+			.proc_handler   = proc_dointvec_minmax,
 		},
 		[NEIGH_VAR_PROXY_QLEN] = {
 			.procname	= "proxy_qlen",
-- 
1.7.1

^ permalink raw reply related

* [PATCH net-next 2/2] net: doc: add default value for neighbour parameters
From: Shan Wei @ 2012-12-05  4:50 UTC (permalink / raw)
  To: David Miller, Eric Dumazet, NetDev, Shan Wei

From: Shan Wei <davidshan@tencent.com>

Signed-off-by: Shan Wei <davidshan@tencent.com>
---
 Documentation/networking/ip-sysctl.txt |    8 ++++++++
 1 files changed, 8 insertions(+), 0 deletions(-)

diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
index c6d5fee..0462a71 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -30,16 +30,24 @@ neigh/default/gc_thresh3 - INTEGER
 	Maximum number of neighbor entries allowed.  Increase this
 	when using large numbers of interfaces and when communicating
 	with large numbers of directly-connected peers.
+	Default: 1024
 
 neigh/default/unres_qlen_bytes - INTEGER
 	The maximum number of bytes which may be used by packets
 	queued for each	unresolved address by other network layers.
 	(added in linux 3.3)
+	Seting negative value is meaningless and will retrun error.
+	Default: 65536 Bytes(64KB)
 
 neigh/default/unres_qlen - INTEGER
 	The maximum number of packets which may be queued for each
 	unresolved address by other network layers.
 	(deprecated in linux 3.3) : use unres_qlen_bytes instead.
+	Prior to linux 3.3, the default value is 3 which may cause
+	secluded packet loss. The current default value is calculated
+	according to default value of unres_qlen_bytes and true size of
+	packet.
+	Default: 31
 
 mtu_expires - INTEGER
 	Time, in seconds, that cached PMTU information is kept.
-- 
1.7.1

^ permalink raw reply related

* Re: [Suggestion] net/atm : for sprintf, need check the total write length whether larger than a page.
From: Chen Gang @ 2012-12-05  4:56 UTC (permalink / raw)
  To: Chas Williams (CONTRACTOR); +Cc: David Miller, netdev
In-Reply-To: <201212050357.qB53vHvT022706@thirdoffive.cmf.nrl.navy.mil>

于 2012年12月05日 11:57, Chas Williams (CONTRACTOR) 写道:
> In message <50BEA2CB.9000800@asianux.com>,Chen Gang writes:
>>> -	for (i = 0; i < (ESI_LEN - 1); i++)
>>> -		pos += sprintf(pos, "%02x:", adev->esi[i]);
>>> -	pos += sprintf(pos, "%02x\n", adev->esi[i]);
>>>  
>>> -	return pos - buf;
>>> +	return scnprintf(buf, PAGE_SIZE, "%pM\n", adev->esi);
>>>  }
>>>  
>>
>>  "%p" seems print a pointer, not contents of pointer (is it correct ?)
>>  will it change the original display format to outside ?
> 
> %pM means format this pointer as a mac address.  it didnt exist when the
> atm stack was originally written but can be used now to save a bit of
> messy code.
> 

  it is my fault. thank you

  :-)

>>> -		pos += sprintf(pos, "\n");
>>> +		count += scnprintf(buf + count, PAGE_SIZE - count, "\n");
> ..
>>  need we judge whether count >= PAGE_SIZE ?
> 
> count will eventually make PAGE_SIZE - count reach 0 at which point,
> scnprintf() won't be able to write into the buffer.

  I also think so.

  I think, maybe it will be better to break the loop when we already
know that "count >= PAGE_SIZE" (it can save waste looping, although it
seems unlikly happen, for example, using unlikly(...) ).

By the way:
  will it be better that always let "\n" at the end ?
  (if count == PAGE_SIZE in a loop, we can not let "\n" at the end).

  I think what I said above are minor, if you think, for this patch, do
not need consider them, it is ok (at least for me, it is true).

  :-)

> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

-- 
Chen Gang

Asianux Corporation

^ permalink raw reply

* Re: [Suggestion] net/atm : for sprintf, need check the total write length whether larger than a page.
From: Chen Gang @ 2012-12-05  5:40 UTC (permalink / raw)
  To: Chas Williams (CONTRACTOR); +Cc: David Miller, netdev
In-Reply-To: <50BED40D.9080100@asianux.com>

于 2012年12月05日 12:56, Chen Gang 写道:
>>>> >>> -		pos += sprintf(pos, "\n");
>>>> >>> +		count += scnprintf(buf + count, PAGE_SIZE - count, "\n");
>> > ..
>>> >>  need we judge whether count >= PAGE_SIZE ?
>> > 
>> > count will eventually make PAGE_SIZE - count reach 0 at which point,
>> > scnprintf() won't be able to write into the buffer.
>   I also think so.
> 
>   I think, maybe it will be better to break the loop when we already
> know that "count >= PAGE_SIZE" (it can save waste looping, although it
> seems unlikly happen, for example, using unlikly(...) ).
> 
> By the way:
>   will it be better that always let "\n" at the end ?
>   (if count == PAGE_SIZE in a loop, we can not let "\n" at the end).

   oh, sorry ! count will never >= PAGE_SIZE.

   I think let "PAGE_SIZE - 2" instead of "PAGE_SIZE" in the loop, so we
can make the room for the end of "\n".

-- 
Chen Gang

Asianux Corporation

^ permalink raw reply

* Re: [Patch net-next] xfrm: add missing xfrm message types to selinux perm table
From: Steffen Klassert @ 2012-12-05  5:41 UTC (permalink / raw)
  To: David Miller; +Cc: amwang, netdev, herbert
In-Reply-To: <20121204.130257.1178992077066719400.davem@davemloft.net>

On Tue, Dec 04, 2012 at 01:02:57PM -0500, David Miller wrote:
> From: Cong Wang <amwang@redhat.com>
> Date: Tue,  4 Dec 2012 13:39:31 +0800
> 
> > Cc: Steffen Klassert <steffen.klassert@secunet.com>
> > Cc: Herbert Xu <herbert@gondor.apana.org.au>
> > Cc: "David S. Miller" <davem@davemloft.net>
> > Signed-off-by: Cong Wang <amwang@redhat.com>
> 
> Steffen will you take this into your IPSEC tree?
> 

I would do so, but I have not seen this patch by now.
Was it really send to me? I have not even seen it on the
netdev list.

Anyway, I don't have it, please resend.

Thanks.

^ permalink raw reply

* Re: [RFC PATCH 2/2] tun: fix LSM/SELinux labeling of tun/tap devices
From: Jason Wang @ 2012-12-05  5:44 UTC (permalink / raw)
  To: Paul Moore; +Cc: netdev, linux-security-module, selinux, mst
In-Reply-To: <8577392.82G063LYx2@sifl>

On Tuesday, December 04, 2012 11:18:57 AM Paul Moore wrote:
> On Tuesday, December 04, 2012 09:24:43 PM Jason Wang wrote:
> > On Monday, December 03, 2012 11:22:29 AM Paul Moore wrote:
> > > It may be that I'm misunderstanding TUNSETQUEUE and/or TUNSETIFF.  Can
> > > you
> > > elaborate as to why they should be different?
> > 
> > If I understand correctly, before multiqueue patchset, TUNSETIFF is used
> > to:
> > 
> > 1) Create the tun/tap network device
> > 2) For persistent device, re-attach the fd to the network device / socket.
> > In this case, we call selinux_tun_dev_attch() to relabel the socket sid
> > (in
> > fact also the device's since the socket were persistent also) to the sid
> > of
> > process that calls TUNSETIFF.
> > 
> > So, after the changes of multiqueue, we need try to preserve those policy.
> > The interesting part is the introducing of TUNSETQUEUE, it's used to
> > attach
> > more file descriptors/sockets to a tun/tap device after at least one file
> > descriptor were attached to the tun/tap device through TUNSETIFF. So I
> > think maybe we need differ those two ioctls. This patch looks fine for
> > TUNSETQUEUE, but for TUNSETIFF, we need relabel the tunsec to the process
> > that calling TUNSETIFF for persistent device?
> 
> Okay, based on your explanation of TUNSETQUEUE, the steps below are what I
> believe we need to do ... if you disagree speak up quickly please.
> 
> A. TUNSETIFF (new, non-persistent device)
> 
> [Allocate and initialize the tun_struct LSM state based on the calling
> process, use this state to label the TUN socket.]
> 
> 1. Call security_tun_dev_create() which authorizes the action.
> 2. Call security_tun_dev_alloc_security() which allocates the tun_struct LSM
> blob and SELinux sets some internal blob state to record the label of the
> calling process.
> 3. Call security_tun_dev_attach() which sets the label of the TUN socket to
> match the label stored in the tun_struct LSM blob during A2.  No
> authorization is done at this point since the socket is new/unlabeled.
> 
> B. TUNSETIFF (existing, persistent device)
> 
> [Relabel the existing tun_struct LSM state based on the calling process, use
> this state to label the TUN socket.]
> 
> 1. Attempt to relabel/reset the tun_struct LSM blob from the currently
> stored value, set during A2, to the label of the current calling process.
> *** THIS IS NOT CURRENTLY DONE IN THE RFC PATCH ***
> 2. Call security_tun_dev_attach() which sets the label of the TUN socket to
> match the label stored in the tun_struct LSM blob during B1. No
> authorization is done at this point since the socket is new/unlabeled.
> 
> C. TUNSETQUEUE
> 
> [Use the existing tun_struct LSM state to label the new TUN socket.]
> 
> 1. Call security_tun_dev_attach() which sets the label of the TUN socket to
> match the label stored in the tun_struct LSM blob set during either A2 or
> B1. No authorization is done at this point since the socket is
> new/unlabeled.

This looks fine to me.
> > btw. Current code does allow calling TUNSETQUEUE to a persistent tun/tap
> > device with no file attached. It should be a bug and need to be fixed.
> 
> Since you wrote that code will you be submitting a patch to fix that
> problem?

Yes, I will fix it.
> > > One thing that I think we probably should change is the relabelto/from
> > > permissions in the function above (selinux_tun_dev_attach()); in the
> > > case
> > > where the socket does not yet have a label, e.g. 'sksec->sid == 0', we
> > > should probably skip the relabel permissions since we want to assign the
> > > TUN device label regardless in this case.
> > 
> > I'm not familiar with the selinux, have a quick glance of the code, looks
> > like the label has been initialized to SECINITSID_KERNEL in
> > selinux_socket_post_create().
> 
> Unless I've missed something in your changes, the multiqueue code never
> calls any socket code which ends up calling
> {security,selinux}_socket_post_create(); I believe you only call sk_alloc()
> which ends up calling
> {security,selinux}_sk_alloc() which sets SECINITSID_UNLABELED (I mistakenly
> wrote 0 instead in my earlier email which is techincally SECSID_NULL). 
> Either way, I still think the logic I originally described above is
> correct.

Yes, I was wrong. Thanks for the checking.



^ permalink raw reply

* Re: [Suggestion] net/atm : for sprintf, need check the total write length whether larger than a page.
From: Chen Gang @ 2012-12-05  5:59 UTC (permalink / raw)
  To: Chas Williams (CONTRACTOR); +Cc: David Miller, netdev
In-Reply-To: <50BEDE4E.8010408@asianux.com>

于 2012年12月05日 13:40, Chen Gang 写道:
> 于 2012年12月05日 12:56, Chen Gang 写道:
>>>>>>>> -		pos += sprintf(pos, "\n");
>>>>>>>> +		count += scnprintf(buf + count, PAGE_SIZE - count, "\n");
>>>> ..
>>>>>>  need we judge whether count >= PAGE_SIZE ?
>>>>
>>>> count will eventually make PAGE_SIZE - count reach 0 at which point,
>>>> scnprintf() won't be able to write into the buffer.
>>   I also think so.
>>
>>   I think, maybe it will be better to break the loop when we already
>> know that "count >= PAGE_SIZE" (it can save waste looping, although it
>> seems unlikly happen, for example, using unlikly(...) ).
>>
>> By the way:
>>   will it be better that always let "\n" at the end ?
>>   (if count == PAGE_SIZE in a loop, we can not let "\n" at the end).
> 
>    oh, sorry ! count will never >= PAGE_SIZE.
> 
>    I think let "PAGE_SIZE - 2" instead of "PAGE_SIZE" in the loop, so we
> can make the room for the end of "\n".
> 
> 
> 
   sorry, "PAGE_SIZE - 1" is enough, not need "PAGE_SIZE - 2".


-- 
Chen Gang

Asianux Corporation

^ permalink raw reply

* Re: [RFC PATCH 2/2] tun: fix LSM/SELinux labeling of tun/tap devices
From: Jason Wang @ 2012-12-05  6:17 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: Paul Moore, netdev, linux-security-module, selinux
In-Reply-To: <20121204152420.GJ7499@redhat.com>

On 12/04/2012 11:24 PM, Michael S. Tsirkin wrote:
> On Tue, Dec 04, 2012 at 09:24:43PM +0800, Jason Wang wrote:
>> On Monday, December 03, 2012 11:22:29 AM Paul Moore wrote:
>>> On Monday, December 03, 2012 06:15:42 PM Jason Wang wrote:
>>>> On 11/30/2012 06:06 AM, Paul Moore wrote:
>>>>> This patch corrects some problems with LSM/SELinux that were introduced
>>>>> with the multiqueue patchset.  The problem stems from the fact that the
>>>>> multiqueue work changed the relationship between the tun device and its
>>>>> associated socket; before the socket persisted for the life of the
>>>>> device, however after the multiqueue changes the socket only persisted
>>>>> for the life of the userspace connection (fd open).  For non-persistent
>>>>> devices this is not an issue, but for persistent devices this can cause
>>>>> the tun device to lose its SELinux label.
>>>>>
>>>>> We correct this problem by adding an opaque LSM security blob to the
>>>>> tun device struct which allows us to have the LSM security state, e.g.
>>>>> SELinux labeling information, persist for the lifetime of the tun
>>>>> device.
>>> ...
>>>
>>>>> -static int selinux_tun_dev_attach(struct sock *sk)
>>>>> +static int selinux_tun_dev_attach(struct sock *sk, void *security)
>>>>>
>>>>>  {
>>>>>
>>>>> +	struct tun_security_struct *tunsec = security;
>>>>>
>>>>>  	struct sk_security_struct *sksec = sk->sk_security;
>>>>>  	u32 sid = current_sid();
>>>>>  	int err;
>>>>>
>>>>> +	/* we don't currently perform any NetLabel based labeling here ...
>>>>>
>>>>>  	err = avc_has_perm(sid, sksec->sid, SECCLASS_TUN_SOCKET,
>>>>>  	
>>>>>  			   TUN_SOCKET__RELABELFROM, NULL);
>>>>>  	
>>>>>  	if (err)
>>>>>  	
>>>>>  		return err;
>>>>>
>>>>> -	err = avc_has_perm(sid, sid, SECCLASS_TUN_SOCKET,
>>>>> +	err = avc_has_perm(sid, tunsec->sid, SECCLASS_TUN_SOCKET,
>>>>>
>>>>>  			   TUN_SOCKET__RELABELTO, NULL);
>>>>>  	
>>>>>  	if (err)
>>>>>  	
>>>>>  		return err;
>>>>>
>>>>> -	sksec->sid = sid;
>>>>> +	sksec->sid = tunsec->sid;
>>>>> +	sksec->sclass = SECCLASS_TUN_SOCKET;
>>>> I'm not sure whether this is correct, looks like we need to differ between
>>>> TUNSETQUEUE and TUNSETIFF. When userspace call TUNSETIFF for persistent
>>>> device, looks like we need change the sid of tunsec like in the past.
>>> It may be that I'm misunderstanding TUNSETQUEUE and/or TUNSETIFF.  Can you
>>> elaborate as to why they should be different?
>> If I understand correctly, before multiqueue patchset, TUNSETIFF is used to:
>>
>> 1) Create the tun/tap network device
>> 2) For persistent device, re-attach the fd to the network device / socket. In 
>> this case, we call selinux_tun_dev_attch() to relabel the socket sid (in fact 
>> also the device's since the socket were persistent also) to the sid of process 
>> that calls TUNSETIFF.
>>
>> So, after the changes of multiqueue, we need try to preserve those policy. The 
>> interesting part is the introducing of TUNSETQUEUE, it's used to attach more 
>> file descriptors/sockets to a tun/tap device after at least one file descriptor 
>> were attached to the tun/tap device through TUNSETIFF. So I think maybe we 
>> need differ those two ioctls. This patch looks fine for TUNSETQUEUE, but for 
>> TUNSETIFF, we need relabel the tunsec to the process that calling TUNSETIFF 
>> for persistent device?
> Basically, it looks like currently once you get a tun fd,
> you can attach it to any device even if normally
> selinux would prevent you from accessing it.

Yes some checking during TUNSETQUEUE is missed.
> If we reuse selinux_tun_dev_attach, we won't need to
> change selinux policy, with a new capability we will need to change it
> to allow libvirt to do TUNSETQUEUE.
>

Also needed for qemu too since it may call TUNSETQUEUE when guest wants
to change the number of queues.
>> btw. Current code does allow calling TUNSETQUEUE to a persistent tun/tap 
>> device with no file attached. It should be a bug and need to be fixed.
> Is this a problem? You can always
> attach
> set queue
> detach
>
> and it would be hard to prevent this ...

Currently, the following steps is allowed:

1. fd1 = open("/dev/net/tun");
2. tunsetiff(fd1, "tap0");
3. tunsetpersistent("tap0");
4. close(fd1);
5. fd2 = open("/dev/net/tun");
6. tunsetqueue(fd2, "tap0);

Looks like step 6 should be forbidden since:

- no fd/sockets were attached to the device, we need use TUNSETIFF
instead to keep the API as we used do in single queue tun
- we need update the security information in tun_struct just like what
we discussed in this mail
- it may also miss checks in TUNSETIFF

>>> One thing that I think we probably should change is the relabelto/from
>>> permissions in the function above (selinux_tun_dev_attach()); in the case
>>> where the socket does not yet have a label, e.g. 'sksec->sid == 0', we
>>> should probably skip the relabel permissions since we want to assign the
>>> TUN device label regardless in this case.
>> I'm not familiar with the selinux, have a quick glance of the code, looks like 
>> the label has been initialized to SECINITSID_KERNEL in 
>> selinux_socket_post_create().
>>
>> Thanks

^ permalink raw reply

* Re: [RFC PATCH 2/2] tun: fix LSM/SELinux labeling of tun/tap devices
From: Jason Wang @ 2012-12-05  6:19 UTC (permalink / raw)
  To: Paul Moore; +Cc: Michael S. Tsirkin, netdev, linux-security-module, selinux
In-Reply-To: <338664993.8Hd16PoY2S@sifl>

On 12/05/2012 02:17 AM, Paul Moore wrote:
> On Tuesday, December 04, 2012 07:36:26 PM Michael S. Tsirkin wrote:
>> On Tue, Dec 04, 2012 at 11:18:57AM -0500, Paul Moore wrote:
>>> Okay, based on your explanation of TUNSETQUEUE, the steps below are what I
>>> believe we need to do ... if you disagree speak up quickly please.
>>>
>>> A. TUNSETIFF (new, non-persistent device)
>>>
>>> [Allocate and initialize the tun_struct LSM state based on the calling
>>> process, use this state to label the TUN socket.]
>>>
>>> 1. Call security_tun_dev_create() which authorizes the action.
>>> 2. Call security_tun_dev_alloc_security() which allocates the tun_struct
>>> LSM blob and SELinux sets some internal blob state to record the label of
>>> the calling process.
>>> 3. Call security_tun_dev_attach() which sets the label of the TUN socket
>>> to match the label stored in the tun_struct LSM blob during A2.  No
>>> authorization is done at this point since the socket is new/unlabeled.
>>>
>>> B. TUNSETIFF (existing, persistent device)
>>>
>>> [Relabel the existing tun_struct LSM state based on the calling process,
>>> use this state to label the TUN socket.]
>>>
>>> 1. Attempt to relabel/reset the tun_struct LSM blob from the currently
>>> stored value, set during A2, to the label of the current calling process.
>>> *** THIS IS NOT CURRENTLY DONE IN THE RFC PATCH ***
>>> 2. Call security_tun_dev_attach() which sets the label of the TUN socket
>>> to match the label stored in the tun_struct LSM blob during B1. No
>>> authorization is done at this point since the socket is new/unlabeled.
>>>
>>> C. TUNSETQUEUE
>>>
>>> [Use the existing tun_struct LSM state to label the new TUN socket.]
>>>
>>> 1. Call security_tun_dev_attach() which sets the label of the TUN socket
>>> to match the label stored in the tun_struct LSM blob set during either A2
>>> or B1. No authorization is done at this point since the socket is
>>> new/unlabeled.
>> Here's what bothers me. libvirt currently opens tun and passes
>> fd to qemu. What would prevent qemu from attaching fd using TUNSETQUEUE
>> to another device it does not own?
> True, assuming all the above is correct and that I'm understanding it 
> correctly (Jason?), we should probably add a new SELinux access control for 
> TUNSETQUEUE.

Yes, we need make sure qemu can call TUNSETQUEUE for the device it does
not own.
>
> The current DAC code exists in tun_not_capable().
>

^ permalink raw reply

* Re: [PATCH net-next 2/3] virtio_net: multiqueue support
From: Jason Wang @ 2012-12-05  6:33 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: krkumar2, kvm, netdev, linux-kernel, virtualization, bhutchings,
	jwhan, davem, shiyer
In-Reply-To: <20121204151108.GI7499@redhat.com>

On 12/04/2012 11:11 PM, Michael S. Tsirkin wrote:
> On Tue, Dec 04, 2012 at 10:45:33PM +0800, Jason Wang wrote:
>> On Tuesday, December 04, 2012 03:24:22 PM Michael S. Tsirkin wrote:
>>> I found some bugs, see below.
>>> Also some style nitpicking, this is not mandatory to address.
>> Thanks for the reviewing.
>>> On Tue, Dec 04, 2012 at 07:07:57PM +0800, Jason Wang wrote:
>>>
[...]
>>>> +		set = false;
>>> This will overwrite affinity if it was set by userspace.
>>> Just
>>> 	if (set)
>>> 		return;
>>> will not have this problem.
>> But we need handle the situtaiton when switch back to sq from mq mode. 
>> Otherwise we may still get the affinity hint used in mq.
>>  This kind of overwrite 
>> is unavoidable or is there some method to detect whether userspac write 
>> something new?
> If we didn't set the affinity originally we should not overwrite it.
> I think this means we need a flag that tells us that
> virtio set the affinity.

Ok.

[...]
>>>> +
>>>> +	/* Parameters for control virtqueue, if any */
>>>> +	if (vi->has_cvq) {
>>>> +		callbacks[total_vqs - 1] = NULL;
>>>> +		names[total_vqs - 1] = kasprintf(GFP_KERNEL, "control");
>>>> +	}
>>>>
>>>> +	/* Allocate/initialize parameters for send/receive virtqueues */
>>>> +	for (i = 0; i < vi->max_queue_pairs; i++) {
>>>> +		callbacks[rxq2vq(i)] = skb_recv_done;
>>>> +		callbacks[txq2vq(i)] = skb_xmit_done;
>>>> +		names[rxq2vq(i)] = kasprintf(GFP_KERNEL, "input.%d", i);
>>>> +		names[txq2vq(i)] = kasprintf(GFP_KERNEL, "output.%d", i);
>>>> +	}
>>> We would need to check kasprintf return value.
>> Looks like a better method is to make the name as a memeber of receive_queue 
>> and send_queue, and use sprintf here.
>>> Also if you allocate names from slab we'll need to free them
>>> later.
>> Then it could be freed when the send_queue and receive_queue is freed.
>>> It's probably easier to just use fixed names for now -
>>> it's not like the index is really useful.
>> Looks useful for debugging e.g. check whether the irq distribution is as 
>> expected.
> Well it doesn't really matter which one goes where, right?
> As long as interrupts are distributed well.

Yes, anyway, we decide to store the name in the send/receive queue, so I
will keep the index.
>
>>>> +
>>>> +	ret = vi->vdev->config->find_vqs(vi->vdev, total_vqs, vqs, callbacks,
>>>> +					 (const char **)names);
>>> Please avoid casts, use a proper type for names.
>> I'm consider we need a minor change in this api, we need allocate the names 
>> dynamically which could not be a const char **.
> I don't see why. Any use that allocates on the fly as
> you did would leak memory. Any use like you suggest
> e.g. allocating as part of send/receive structure
> would be fine.

True
>>>> +	if (ret)
>>>> +		goto err_names;
>>>> +
>>>> +	if (vi->has_cvq) {
>>>> +		vi->cvq = vqs[total_vqs - 1];
>>>>
>>>>  		if (virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VLAN))
>>>>  		
>>>>  			vi->dev->features |= NETIF_F_HW_VLAN_FILTER;
>>>>  	
>>>>  	}
>>>>
>>>> +
>>>> +	for (i = 0; i < vi->max_queue_pairs; i++) {
>>>> +		vi->rq[i].vq = vqs[rxq2vq(i)];
>>>> +		vi->sq[i].vq = vqs[txq2vq(i)];
>>>> +	}
>>>> +
>>>> +	kfree(callbacks);
>>>> +	kfree(vqs);
>>> Who frees names if there's no error?
>>>
>> The virtio core does not copy the name, so it need this and only used for 
>> debugging if I'm reading the code correctly.
> No, virtio core does not free either individual vq name or the names
> array passed in. So this leaks memory.

Yes, so when we use the names in receive/send queue, it can be freed
during queue destroying.

[...]
> @@ -1276,24 +1531,29 @@ static int virtnet_freeze(struct virtio_device
> *vdev)> 
>  static int virtnet_restore(struct virtio_device *vdev)
>  {
>  
>  	struct virtnet_info *vi = vdev->priv;
>
> -	int err;
> +	int err, i;
>
>  	err = init_vqs(vi);
>  	if (err)
>  	
>  		return err;
>  	
>  	if (netif_running(vi->dev))
>
> -		virtnet_napi_enable(&vi->rq);
> +		for (i = 0; i < vi->max_queue_pairs; i++)
> +			virtnet_napi_enable(&vi->rq[i]);
>
>  	netif_device_attach(vi->dev);
>
> -	if (!try_fill_recv(&vi->rq, GFP_KERNEL))
> -		schedule_delayed_work(&vi->refill, 0);
> +	for (i = 0; i < vi->max_queue_pairs; i++)
> +		if (!try_fill_recv(&vi->rq[i], GFP_KERNEL))
> +			schedule_delayed_work(&vi->refill, 0);
>
>  	mutex_lock(&vi->config_lock);
>  	vi->config_enable = true;
>  	mutex_unlock(&vi->config_lock);
>
> +	if (vi->has_cvq && virtio_has_feature(vi->vdev, VIRTIO_NET_F_RFS))
> +		virtnet_set_queues(vi);
> +
>>> I think it's easier to test
>>> if (curr_queue_pairs == max_queue_pairs)
>>> within virtnet_set_queues and make it
>>> a NOP if so.
>> Still need to send the command during restore since we reset the device during 
>> freezing.
>
> Then maybe check vi->has_cvq && virtio_has_feature(vi->vdev,
> VIRTIO_NET_F_RFS) in there?

Right.
>
>>>>  
[...]

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox