Netdev List

Netdev List
 help / color / mirror / Atom feed

* [patch net-next 2/4] net: allow to change carrier via sysfs
From: Jiri Pirko @ 2012-12-12 10:58 UTC (permalink / raw)
  To: netdev; +Cc: davem, edumazet, bhutchings, mirqus, shemminger, greearb, fbl
In-Reply-To: <1355309887-1081-1-git-send-email-jiri@resnulli.us>

Make carrier writable

Signed-off-by: Jiri Pirko <jiri@resnulli.us>
---
 net/core/net-sysfs.c | 15 ++++++++++++++-
 1 file changed, 14 insertions(+), 1 deletion(-)

diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index 334efd5..7eda40a 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -126,6 +126,19 @@ static ssize_t show_broadcast(struct device *dev,
 	return -EINVAL;
 }
 
+static int change_carrier(struct net_device *net, unsigned long new_carrier)
+{
+	if (!netif_running(net))
+		return -EINVAL;
+	return dev_change_carrier(net, (bool) new_carrier);
+}
+
+static ssize_t store_carrier(struct device *dev, struct device_attribute *attr,
+			 const char *buf, size_t len)
+{
+	return netdev_store(dev, attr, buf, len, change_carrier);
+}
+
 static ssize_t show_carrier(struct device *dev,
 			    struct device_attribute *attr, char *buf)
 {
@@ -331,7 +344,7 @@ static struct device_attribute net_class_attributes[] = {
 	__ATTR(link_mode, S_IRUGO, show_link_mode, NULL),
 	__ATTR(address, S_IRUGO, show_address, NULL),
 	__ATTR(broadcast, S_IRUGO, show_broadcast, NULL),
-	__ATTR(carrier, S_IRUGO, show_carrier, NULL),
+	__ATTR(carrier, S_IRUGO | S_IWUSR, show_carrier, store_carrier),
 	__ATTR(speed, S_IRUGO, show_speed, NULL),
 	__ATTR(duplex, S_IRUGO, show_duplex, NULL),
 	__ATTR(dormant, S_IRUGO, show_dormant, NULL),
-- 
1.8.0

^ permalink raw reply related

* [patch net-next 3/4] rtnl: expose carrier value with possibility to set it
From: Jiri Pirko @ 2012-12-12 10:58 UTC (permalink / raw)
  To: netdev; +Cc: davem, edumazet, bhutchings, mirqus, shemminger, greearb, fbl
In-Reply-To: <1355309887-1081-1-git-send-email-jiri@resnulli.us>

Signed-off-by: Jiri Pirko <jiri@resnulli.us>
---
 include/uapi/linux/if_link.h |  1 +
 net/core/rtnetlink.c         | 10 ++++++++++
 2 files changed, 11 insertions(+)

diff --git a/include/uapi/linux/if_link.h b/include/uapi/linux/if_link.h
index 60f3b6b..c4edfe1 100644
--- a/include/uapi/linux/if_link.h
+++ b/include/uapi/linux/if_link.h
@@ -142,6 +142,7 @@ enum {
 #define IFLA_PROMISCUITY IFLA_PROMISCUITY
 	IFLA_NUM_TX_QUEUES,
 	IFLA_NUM_RX_QUEUES,
+	IFLA_CARRIER,
 	__IFLA_MAX
 };
 
diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index 1868625..2ef7a56 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -780,6 +780,7 @@ static noinline size_t if_nlmsg_size(const struct net_device *dev,
 	       + nla_total_size(4) /* IFLA_MTU */
 	       + nla_total_size(4) /* IFLA_LINK */
 	       + nla_total_size(4) /* IFLA_MASTER */
+	       + nla_total_size(1) /* IFLA_CARRIER */
 	       + nla_total_size(4) /* IFLA_PROMISCUITY */
 	       + nla_total_size(4) /* IFLA_NUM_TX_QUEUES */
 	       + nla_total_size(4) /* IFLA_NUM_RX_QUEUES */
@@ -909,6 +910,7 @@ static int rtnl_fill_ifinfo(struct sk_buff *skb, struct net_device *dev,
 	     nla_put_u32(skb, IFLA_LINK, dev->iflink)) ||
 	    (dev->master &&
 	     nla_put_u32(skb, IFLA_MASTER, dev->master->ifindex)) ||
+	    nla_put_u8(skb, IFLA_CARRIER, netif_carrier_ok(dev)) ||
 	    (dev->qdisc &&
 	     nla_put_string(skb, IFLA_QDISC, dev->qdisc->ops->id)) ||
 	    (dev->ifalias &&
@@ -1108,6 +1110,7 @@ const struct nla_policy ifla_policy[IFLA_MAX+1] = {
 	[IFLA_MTU]		= { .type = NLA_U32 },
 	[IFLA_LINK]		= { .type = NLA_U32 },
 	[IFLA_MASTER]		= { .type = NLA_U32 },
+	[IFLA_CARRIER]		= { .type = NLA_U8 },
 	[IFLA_TXQLEN]		= { .type = NLA_U32 },
 	[IFLA_WEIGHT]		= { .type = NLA_U32 },
 	[IFLA_OPERSTATE]	= { .type = NLA_U8 },
@@ -1438,6 +1441,13 @@ static int do_setlink(struct net_device *dev, struct ifinfomsg *ifm,
 		modified = 1;
 	}
 
+	if (tb[IFLA_CARRIER]) {
+		err = dev_change_carrier(dev, nla_get_u8(tb[IFLA_CARRIER]));
+		if (err)
+			goto errout;
+		modified = 1;
+	}
+
 	if (tb[IFLA_TXQLEN])
 		dev->tx_queue_len = nla_get_u32(tb[IFLA_TXQLEN]);
 
-- 
1.8.0

^ permalink raw reply related

* [patch net-next 4/4] dummy: implement carrier change
From: Jiri Pirko @ 2012-12-12 10:58 UTC (permalink / raw)
  To: netdev; +Cc: davem, edumazet, bhutchings, mirqus, shemminger, greearb, fbl
In-Reply-To: <1355309887-1081-1-git-send-email-jiri@resnulli.us>

Signed-off-by: Jiri Pirko <jiri@resnulli.us>
---
 drivers/net/dummy.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/drivers/net/dummy.c b/drivers/net/dummy.c
index c260af5..42aa54a 100644
--- a/drivers/net/dummy.c
+++ b/drivers/net/dummy.c
@@ -100,6 +100,15 @@ static void dummy_dev_uninit(struct net_device *dev)
 	free_percpu(dev->dstats);
 }
 
+static int dummy_change_carrier(struct net_device *dev, bool new_carrier)
+{
+	if (new_carrier)
+		netif_carrier_on(dev);
+	else
+		netif_carrier_off(dev);
+	return 0;
+}
+
 static const struct net_device_ops dummy_netdev_ops = {
 	.ndo_init		= dummy_dev_init,
 	.ndo_uninit		= dummy_dev_uninit,
@@ -108,6 +117,7 @@ static const struct net_device_ops dummy_netdev_ops = {
 	.ndo_set_rx_mode	= set_multicast_list,
 	.ndo_set_mac_address	= eth_mac_addr,
 	.ndo_get_stats64	= dummy_get_stats64,
+	.ndo_change_carrier	= dummy_change_carrier,
 };
 
 static void dummy_setup(struct net_device *dev)
-- 
1.8.0

^ permalink raw reply related

* Re: [PATCH] ipv6: fix the bug when propagating Redirect Message
From: Duan Jiong @ 2012-12-12 11:09 UTC (permalink / raw)
  To: Steffen Klassert; +Cc: davem, netdev
In-Reply-To: <20121211134514.GE18940@secunet.com>

于 2012/12/11 21:45, Steffen Klassert 写道:
> On Tue, Dec 11, 2012 at 08:58:20PM +0800, Duan Jiong wrote:
>>
>> Just like you said, i try to use ndisc_parse_options() to instead
>> of the loop, but i find the skb->data can't be changed in function
>> ndisc_parse_options() due to lack of  arguments. So i think it is
>> better to continue to use the loop. How do you think this?
>>
> 
> You can change the data pointer after ndisc_parse_options().
> Something like the (untested) patch below should do it.
> 
>  include/net/ndisc.h |    7 +++++++
>  net/ipv6/ndisc.c    |   20 ++++++++++++++++++++
>  2 files changed, 27 insertions(+)
> 
> diff --git a/include/net/ndisc.h b/include/net/ndisc.h
> index 980d263..c17bccd 100644
> --- a/include/net/ndisc.h
> +++ b/include/net/ndisc.h
> @@ -78,6 +78,13 @@ struct ra_msg {
>  	__be32			retrans_timer;
>  };
>  
> +struct rd_msg {
> +        struct icmp6hdr	icmph;
> +        struct in6_addr	target;
> +        struct in6_addr	dest;
> +	__u8		opt[0];
> +};
> +
>  struct nd_opt_hdr {
>  	__u8		nd_opt_type;
>  	__u8		nd_opt_len;
> diff --git a/net/ipv6/ndisc.c b/net/ipv6/ndisc.c
> index 2edce30..9afd23f 100644
> --- a/net/ipv6/ndisc.c
> +++ b/net/ipv6/ndisc.c
> @@ -1333,6 +1333,12 @@ out:
>  
>  static void ndisc_redirect_rcv(struct sk_buff *skb)
>  {
> +	u8 *hdr;
> +	struct ndisc_options ndopts;
> +	struct rd_msg *msg = (struct rd_msg *) skb_transport_header(skb);
> +	u32 ndoptlen = skb->tail - (skb->transport_header +
> +				    offsetof(struct rd_msg, opt));
> +
>  #ifdef CONFIG_IPV6_NDISC_NODETYPE
>  	switch (skb->ndisc_nodetype) {
>  	case NDISC_NODETYPE_HOST:
> @@ -1349,6 +1355,20 @@ static void ndisc_redirect_rcv(struct sk_buff *skb)
>  		return;
>  	}
>  
> +	if (!ndisc_parse_options(msg->opt, ndoptlen, &ndopts)) {
> +		ND_PRINTK(2, warn, "Redirect: invalid ND options\n");
> +		return;
> +	}
> +
> +	if (!ndopts.nd_opts_rh)
> +		return;
> +
> +	hdr = (u8 *) ndopts.nd_opts_rh;
> +	hdr += 8;
> +
> +	if (!pskb_pull(skb, hdr - skb_transport_header(skb)))
> +		return;
> +
>  	icmpv6_notify(skb, NDISC_REDIRECT, 0, 0);
>  }
>  
> 
Thanks for you help. I will test it.

^ permalink raw reply

* [PATCH] iproute2: fix tc ematch manpage section
From: Andreas Henriksson @ 2012-12-12 11:23 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: netdev

The debian package checking tool, lintian, spotted that the
tc ematch manpage seems to have an error in the specified section.

Signed-off-by: Andreas Henriksson <andreas@fatal.se>

diff --git a/man/man8/tc-ematch.8 b/man/man8/tc-ematch.8
index 2eafc29..957a22e 100644
--- a/man/man8/tc-ematch.8
+++ b/man/man8/tc-ematch.8
@@ -1,4 +1,4 @@
-.TH filter ematch "6 August 2012" iproute2 Linux
+.TH ematch 8 "6 August 2012" iproute2 Linux
 .
 .SH NAME
 ematch \- extended matches for use with "basic" or "flow" filters

^ permalink raw reply related

* [PATCH V1 net-next 2/3] net/mlx4_en: Use generic etherdevice.h functions.
From: Amir Vadai @ 2012-12-12 12:13 UTC (permalink / raw)
  To: David S. Miller
  Cc: netdev, Or Gerlitz, Amir Vadai, Hadar Har-Zion, Yan Burman
In-Reply-To: <1355314400-14909-1-git-send-email-amirv@mellanox.com>

From: Yan Burman <yanb@mellanox.com>

Get rid of full_mac, zero_mac in favour of
is_zero_ether_addr and is_broadcast_ether_addr.

Signed-off-by: Yan Burman <yanb@mellanox.com>
Signed-off-by: Amir Vadai <amirv@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx4/en_ethtool.c | 6 ++----
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_ethtool.c b/drivers/net/ethernet/mellanox/mlx4/en_ethtool.c
index 4aaa7c3..cc7bb25 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_ethtool.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_ethtool.c
@@ -613,8 +613,6 @@ static int mlx4_en_validate_flow(struct net_device *dev,
 	struct ethtool_usrip4_spec *l3_mask;
 	struct ethtool_tcpip4_spec *l4_mask;
 	struct ethhdr *eth_mask;
-	u64 full_mac = ~0ull;
-	u64 zero_mac = 0;
 
 	if (cmd->fs.location >= MAX_NUM_OF_FS_RULES)
 		return -EINVAL;
@@ -644,11 +642,11 @@ static int mlx4_en_validate_flow(struct net_device *dev,
 	case ETHER_FLOW:
 		eth_mask = &cmd->fs.m_u.ether_spec;
 		/* source mac mask must not be set */
-		if (memcmp(eth_mask->h_source, &zero_mac, ETH_ALEN))
+		if (!is_zero_ether_addr(eth_mask->h_source))
 			return -EINVAL;
 
 		/* dest mac mask must be ff:ff:ff:ff:ff:ff */
-		if (memcmp(eth_mask->h_dest, &full_mac, ETH_ALEN))
+		if (!is_broadcast_ether_addr(eth_mask->h_dest))
 			return -EINVAL;
 
 		if (!all_zeros_or_all_ones(eth_mask->h_proto))
-- 
1.7.11.3

^ permalink raw reply related

* [PATCH V1 net-next 1/3] net: ethtool: Add destination MAC address to flow steering API
From: Amir Vadai @ 2012-12-12 12:13 UTC (permalink / raw)
  To: David S. Miller
  Cc: netdev, Or Gerlitz, Amir Vadai, Hadar Har-Zion, Yan Burman
In-Reply-To: <1355314400-14909-1-git-send-email-amirv@mellanox.com>

From: Yan Burman <yanb@mellanox.com>

Add ability to specify destination MAC address for L3/L4 flow spec
in order to be able to specify action for different VM's under vSwitch
configuration. This change is transparent to older userspace.

Signed-off-by: Yan Burman <yanb@mellanox.com>
Signed-off-by: Amir Vadai <amirv@mellanox.com>
---
 include/uapi/linux/ethtool.h | 11 +++++++----
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/include/uapi/linux/ethtool.h b/include/uapi/linux/ethtool.h
index d3eaaaf..be8c41e 100644
--- a/include/uapi/linux/ethtool.h
+++ b/include/uapi/linux/ethtool.h
@@ -500,13 +500,15 @@ union ethtool_flow_union {
 	struct ethtool_ah_espip4_spec		esp_ip4_spec;
 	struct ethtool_usrip4_spec		usr_ip4_spec;
 	struct ethhdr				ether_spec;
-	__u8					hdata[60];
+	__u8					hdata[52];
 };
 
 struct ethtool_flow_ext {
-	__be16	vlan_etype;
-	__be16	vlan_tci;
-	__be32	data[2];
+	__u8		padding[2];
+	unsigned char	h_dest[ETH_ALEN];	/* destination eth addr	*/
+	__be16		vlan_etype;
+	__be16		vlan_tci;
+	__be32		data[2];
 };
 
 /**
@@ -1027,6 +1029,7 @@ enum ethtool_sfeatures_retval_bits {
 #define	ETHER_FLOW	0x12	/* spec only (ether_spec) */
 /* Flag to enable additional fields in struct ethtool_rx_flow_spec */
 #define	FLOW_EXT	0x80000000
+#define	FLOW_MAC_EXT	0x40000000
 
 /* L3-L4 network traffic flow hash options */
 #define	RXH_L2DA	(1 << 1)
-- 
1.7.11.3

^ permalink raw reply related

* [PATCH ETHTOOL] Added dst-mac parameter for L3/L4 flow spec rules. This is usefull in vSwitch configurations.
From: Amir Vadai @ 2012-12-12 12:13 UTC (permalink / raw)
  To: David S. Miller
  Cc: netdev, Or Gerlitz, Amir Vadai, Hadar Har-Zion, Yan Burman
In-Reply-To: <1355314400-14909-1-git-send-email-amirv@mellanox.com>

From: Yan Burman <yanb@mellanox.com>

Signed-off-by: Yan Burman <yanb@mellanox.com>
Signed-off-by: Amir Vadai <amirv@mellanox.com>
---
 ethtool-copy.h | 11 +++++++----
 ethtool.8.in   |  6 ++++++
 ethtool.c      |  5 +++++
 rxclass.c      | 62 ++++++++++++++++++++++++++++++++++++++++------------------
 4 files changed, 61 insertions(+), 23 deletions(-)

diff --git a/ethtool-copy.h b/ethtool-copy.h
index 4801eef..d352f20 100644
--- a/ethtool-copy.h
+++ b/ethtool-copy.h
@@ -500,13 +500,15 @@ union ethtool_flow_union {
 	struct ethtool_ah_espip4_spec		esp_ip4_spec;
 	struct ethtool_usrip4_spec		usr_ip4_spec;
 	struct ethhdr				ether_spec;
-	__u8					hdata[60];
+	__u8					hdata[52];
 };
 
 struct ethtool_flow_ext {
-	__be16	vlan_etype;
-	__be16	vlan_tci;
-	__be32	data[2];
+	__u8		padding[2];
+	unsigned char	h_dest[ETH_ALEN];	/* destination eth addr	*/
+	__be16		vlan_etype;
+	__be16		vlan_tci;
+	__be32		data[2];
 };
 
 /**
@@ -1027,6 +1029,7 @@ enum ethtool_sfeatures_retval_bits {
 #define	ETHER_FLOW	0x12	/* spec only (ether_spec) */
 /* Flag to enable additional fields in struct ethtool_rx_flow_spec */
 #define	FLOW_EXT	0x80000000
+#define	FLOW_MAC_EXT	0x40000000
 
 /* L3-L4 network traffic flow hash options */
 #define	RXH_L2DA	(1 << 1)
diff --git a/ethtool.8.in b/ethtool.8.in
index e701919..a52e484 100644
--- a/ethtool.8.in
+++ b/ethtool.8.in
@@ -268,6 +268,7 @@ ethtool \- query or control network driver and hardware settings
 .BM vlan\-etype
 .BM vlan
 .BM user\-def
+.RB [ dst-mac \ \*(MA\ [ m \ \*(MA]]
 .BN action
 .BN loc
 .RB |
@@ -739,6 +740,11 @@ Includes the VLAN tag and an optional mask.
 .BI user\-def \ N \\fR\ [\\fPm \ N \\fR]\\fP
 Includes 64-bits of user-specific data and an optional mask.
 .TP
+.BR dst-mac \ \*(MA\ [ m \ \*(MA]
+Includes the destination MAC address, specified as 6 bytes in hexadecimal
+separated by colons, along with an optional mask.
+Valid for all IPv4 based flow-types.
+.TP
 .BI action \ N
 Specifies the Rx queue to send packets to, or some other action.
 .TS
diff --git a/ethtool.c b/ethtool.c
index 345c21c..55bc082 100644
--- a/ethtool.c
+++ b/ethtool.c
@@ -3231,6 +3231,10 @@ static int flow_spec_to_ntuple(struct ethtool_rx_flow_spec *fsp,
 	if (fsp->location != RX_CLS_LOC_ANY)
 		return -1;
 
+	/* destination MAC address in L3/L4 rules is not supported by ntuple */
+	if (fsp->flow_type & FLOW_MAC_EXT)
+		return -1;
+
 	/* verify ring cookie can transfer to action */
 	if (fsp->ring_cookie > INT_MAX && fsp->ring_cookie < (u64)(-2))
 		return -1;
@@ -3814,6 +3818,7 @@ static const struct option {
 	  "			[ vlan-etype %x [m %x] ]\n"
 	  "			[ vlan %x [m %x] ]\n"
 	  "			[ user-def %x [m %x] ]\n"
+	  "			[ dst-mac %x:%x:%x:%x:%x:%x [m %x:%x:%x:%x:%x:%x] ]\n"
 	  "			[ action %d ]\n"
 	  "			[ loc %d]] |\n"
 	  "		delete %d\n" },
diff --git a/rxclass.c b/rxclass.c
index e1633a8..1564b62 100644
--- a/rxclass.c
+++ b/rxclass.c
@@ -41,26 +41,38 @@ static void rxclass_print_ipv4_rule(__be32 sip, __be32 sipm, __be32 dip,
 
 static void rxclass_print_nfc_spec_ext(struct ethtool_rx_flow_spec *fsp)
 {
-	u64 data, datam;
-	__u16 etype, etypem, tci, tcim;
+	if (fsp->flow_type & FLOW_EXT) {
+		u64 data, datam;
+		__u16 etype, etypem, tci, tcim;
+		etype = ntohs(fsp->h_ext.vlan_etype);
+		etypem = ntohs(~fsp->m_ext.vlan_etype);
+		tci = ntohs(fsp->h_ext.vlan_tci);
+		tcim = ntohs(~fsp->m_ext.vlan_tci);
+		data = (u64)ntohl(fsp->h_ext.data[0]) << 32;
+		data = (u64)ntohl(fsp->h_ext.data[1]);
+		datam = (u64)ntohl(~fsp->m_ext.data[0]) << 32;
+		datam |= (u64)ntohl(~fsp->m_ext.data[1]);
 
-	if (!(fsp->flow_type & FLOW_EXT))
-		return;
+		fprintf(stdout,
+			"\tVLAN EtherType: 0x%x mask: 0x%x\n"
+			"\tVLAN: 0x%x mask: 0x%x\n"
+			"\tUser-defined: 0x%llx mask: 0x%llx\n",
+			etype, etypem, tci, tcim, data, datam);
+	}
 
-	etype = ntohs(fsp->h_ext.vlan_etype);
-	etypem = ntohs(~fsp->m_ext.vlan_etype);
-	tci = ntohs(fsp->h_ext.vlan_tci);
-	tcim = ntohs(~fsp->m_ext.vlan_tci);
-	data = (u64)ntohl(fsp->h_ext.data[0]) << 32;
-	data = (u64)ntohl(fsp->h_ext.data[1]);
-	datam = (u64)ntohl(~fsp->m_ext.data[0]) << 32;
-	datam |= (u64)ntohl(~fsp->m_ext.data[1]);
+	if (fsp->flow_type & FLOW_MAC_EXT) {
+		unsigned char *dmac, *dmacm;
 
-	fprintf(stdout,
-		"\tVLAN EtherType: 0x%x mask: 0x%x\n"
-		"\tVLAN: 0x%x mask: 0x%x\n"
-		"\tUser-defined: 0x%llx mask: 0x%llx\n",
-		etype, etypem, tci, tcim, data, datam);
+		dmac = fsp->h_ext.h_dest;
+		dmacm = fsp->m_ext.h_dest;
+
+		fprintf(stdout,
+			"\tDest MAC addr: %02X:%02X:%02X:%02X:%02X:%02X"
+			" mask: %02X:%02X:%02X:%02X:%02X:%02X\n",
+			dmac[0], dmac[1], dmac[2], dmac[3], dmac[4],
+			dmac[5], dmacm[0], dmacm[1], dmacm[2], dmacm[3],
+			dmacm[4], dmacm[5]);
+	}
 }
 
 static void rxclass_print_nfc_rule(struct ethtool_rx_flow_spec *fsp)
@@ -70,7 +82,7 @@ static void rxclass_print_nfc_rule(struct ethtool_rx_flow_spec *fsp)
 
 	fprintf(stdout,	"Filter: %d\n", fsp->location);
 
-	flow_type = fsp->flow_type & ~FLOW_EXT;
+	flow_type = fsp->flow_type & ~(FLOW_EXT | FLOW_MAC_EXT);
 
 	invert_flow_mask(fsp);
 
@@ -172,7 +184,7 @@ static void rxclass_print_nfc_rule(struct ethtool_rx_flow_spec *fsp)
 static void rxclass_print_rule(struct ethtool_rx_flow_spec *fsp)
 {
 	/* print the rule in this location */
-	switch (fsp->flow_type & ~FLOW_EXT) {
+	switch (fsp->flow_type & ~(FLOW_EXT | FLOW_MAC_EXT)) {
 	case TCP_V4_FLOW:
 	case UDP_V4_FLOW:
 	case SCTP_V4_FLOW:
@@ -533,6 +545,7 @@ typedef enum {
 #define NTUPLE_FLAG_VLAN	0x100
 #define NTUPLE_FLAG_UDEF	0x200
 #define NTUPLE_FLAG_VETH	0x400
+#define NFC_FLAG_MAC_ADDR	0x800
 
 struct rule_opts {
 	const char	*name;
@@ -571,6 +584,9 @@ static const struct rule_opts rule_nfc_tcp_ip4[] = {
 	{ "user-def", OPT_BE64, NTUPLE_FLAG_UDEF,
 	  offsetof(struct ethtool_rx_flow_spec, h_ext.data),
 	  offsetof(struct ethtool_rx_flow_spec, m_ext.data) },
+	{ "dst-mac", OPT_MAC, NFC_FLAG_MAC_ADDR,
+	  offsetof(struct ethtool_rx_flow_spec, h_ext.h_dest),
+	  offsetof(struct ethtool_rx_flow_spec, m_ext.h_dest) },
 };
 
 static const struct rule_opts rule_nfc_esp_ip4[] = {
@@ -599,6 +615,9 @@ static const struct rule_opts rule_nfc_esp_ip4[] = {
 	{ "user-def", OPT_BE64, NTUPLE_FLAG_UDEF,
 	  offsetof(struct ethtool_rx_flow_spec, h_ext.data),
 	  offsetof(struct ethtool_rx_flow_spec, m_ext.data) },
+	{ "dst-mac", OPT_MAC, NFC_FLAG_MAC_ADDR,
+	  offsetof(struct ethtool_rx_flow_spec, h_ext.h_dest),
+	  offsetof(struct ethtool_rx_flow_spec, m_ext.h_dest) },
 };
 
 static const struct rule_opts rule_nfc_usr_ip4[] = {
@@ -639,6 +658,9 @@ static const struct rule_opts rule_nfc_usr_ip4[] = {
 	{ "user-def", OPT_BE64, NTUPLE_FLAG_UDEF,
 	  offsetof(struct ethtool_rx_flow_spec, h_ext.data),
 	  offsetof(struct ethtool_rx_flow_spec, m_ext.data) },
+	{ "dst-mac", OPT_MAC, NFC_FLAG_MAC_ADDR,
+	  offsetof(struct ethtool_rx_flow_spec, h_ext.h_dest),
+	  offsetof(struct ethtool_rx_flow_spec, m_ext.h_dest) },
 };
 
 static const struct rule_opts rule_nfc_ether[] = {
@@ -1063,6 +1085,8 @@ int rxclass_parse_ruleopts(struct cmd_context *ctx,
 		fsp->h_u.usr_ip4_spec.ip_ver = ETH_RX_NFC_IP4;
 	if (flags & (NTUPLE_FLAG_VLAN | NTUPLE_FLAG_UDEF | NTUPLE_FLAG_VETH))
 		fsp->flow_type |= FLOW_EXT;
+	if (flags & NFC_FLAG_MAC_ADDR)
+		fsp->flow_type |= FLOW_MAC_EXT;
 
 	return 0;
 
-- 
1.7.11.3

^ permalink raw reply related

* [PATCH V1 net-next 3/3] net/mlx4_en: Add support for destination MAC in steering rules
From: Amir Vadai @ 2012-12-12 12:13 UTC (permalink / raw)
  To: David S. Miller
  Cc: netdev, Or Gerlitz, Amir Vadai, Hadar Har-Zion, Yan Burman
In-Reply-To: <1355314400-14909-1-git-send-email-amirv@mellanox.com>

From: Yan Burman <yanb@mellanox.com>

Implement destination MAC rule extension for L3/L4 rules in
flow steering. Usefull for vSwitch/macvlan configurations.

Signed-off-by: Yan Burman <yanb@mellanox.com>
Signed-off-by: Amir Vadai <amirv@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx4/en_ethtool.c | 21 +++++++++++++++------
 1 file changed, 15 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_ethtool.c b/drivers/net/ethernet/mellanox/mlx4/en_ethtool.c
index cc7bb25..03447da 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_ethtool.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_ethtool.c
@@ -617,7 +617,13 @@ static int mlx4_en_validate_flow(struct net_device *dev,
 	if (cmd->fs.location >= MAX_NUM_OF_FS_RULES)
 		return -EINVAL;
 
-	switch (cmd->fs.flow_type & ~FLOW_EXT) {
+	if (cmd->fs.flow_type & FLOW_MAC_EXT) {
+		/* dest mac mask must be ff:ff:ff:ff:ff:ff */
+		if (!is_broadcast_ether_addr(cmd->fs.m_ext.h_dest))
+			return -EINVAL;
+	}
+
+	switch (cmd->fs.flow_type & ~(FLOW_EXT | FLOW_MAC_EXT)) {
 	case TCP_V4_FLOW:
 	case UDP_V4_FLOW:
 		if (cmd->fs.m_u.tcp_ip4_spec.tos)
@@ -745,7 +751,6 @@ static int mlx4_en_ethtool_to_net_trans_rule(struct net_device *dev,
 					     struct list_head *rule_list_h)
 {
 	int err;
-	u64 mac;
 	__be64 be_mac;
 	struct ethhdr *eth_spec;
 	struct mlx4_en_priv *priv = netdev_priv(dev);
@@ -760,12 +765,16 @@ static int mlx4_en_ethtool_to_net_trans_rule(struct net_device *dev,
 	if (!spec_l2)
 		return -ENOMEM;
 
-	mac = priv->mac & MLX4_MAC_MASK;
-	be_mac = cpu_to_be64(mac << 16);
+	if (cmd->fs.flow_type & FLOW_MAC_EXT) {
+		memcpy(&be_mac, cmd->fs.h_ext.h_dest, ETH_ALEN);
+	} else {
+		u64 mac = priv->mac & MLX4_MAC_MASK;
+		be_mac = cpu_to_be64(mac << 16);
+	}
 
 	spec_l2->id = MLX4_NET_TRANS_RULE_ID_ETH;
 	memcpy(spec_l2->eth.dst_mac_msk, &mac_msk, ETH_ALEN);
-	if ((cmd->fs.flow_type & ~FLOW_EXT) != ETHER_FLOW)
+	if ((cmd->fs.flow_type & ~(FLOW_EXT | FLOW_MAC_EXT)) != ETHER_FLOW)
 		memcpy(spec_l2->eth.dst_mac, &be_mac, ETH_ALEN);
 
 	if ((cmd->fs.flow_type & FLOW_EXT) && cmd->fs.m_ext.vlan_tci) {
@@ -775,7 +784,7 @@ static int mlx4_en_ethtool_to_net_trans_rule(struct net_device *dev,
 
 	list_add_tail(&spec_l2->list, rule_list_h);
 
-	switch (cmd->fs.flow_type & ~FLOW_EXT) {
+	switch (cmd->fs.flow_type & ~(FLOW_EXT | FLOW_MAC_EXT)) {
 	case ETHER_FLOW:
 		eth_spec = &cmd->fs.h_u.ether_spec;
 		memcpy(&spec_l2->eth.dst_mac, eth_spec->h_dest, ETH_ALEN);
-- 
1.7.11.3

^ permalink raw reply related

* [PATCH V1 net-next 0/4] Add destination MAC address to ethtool flow steering
From: Amir Vadai @ 2012-12-12 12:13 UTC (permalink / raw)
  To: David S. Miller
  Cc: netdev, Or Gerlitz, Amir Vadai, Hadar Har-Zion, Yan Burman

From: Yan Burman <yanb@mellanox.com>

In vSwitch configuration it is often beneficial to create flow steering
rules for L3/L4 traffic based on VM port. This requires destination MAC
address of that port to be present. Note that today the mlx4_en driver 
adds the mac address of itself to the flow spec, where under the new
ethtool flag suggested here it doesn't.

It may also be useful in macvlan devices.

These patches add kernel support for the new field (does not break old
userspace compatibility, so new ethtool will work on old kernels and
old ethtool will work with new kernels).

Also present here is the ethtool userspace patch.

See more details here http ://marc.info/?t=134977576500003

Changes from V0:
- Get rid of full_mac, zero_mac in favour of
    is_zero_ether_addr and is_broadcast_ether_addr

Yan Burman (3):
  net: ethtool: Add destination MAC address to flow steering API
  net/mlx4_en: Use generic etherdevice.h functions.
  net/mlx4_en: Add support for destination MAC in steering rules

 drivers/net/ethernet/mellanox/mlx4/en_ethtool.c | 27 ++++++++++++++++---------
 include/uapi/linux/ethtool.h                    | 11 ++++++----
 2 files changed, 24 insertions(+), 14 deletions(-)

-- 
1.7.11.3

^ permalink raw reply

* Re: [PATCH] net: filter: return -EINVAL if BPF_S_ANC* operation is not supported
From: Eric Dumazet @ 2012-12-12 12:22 UTC (permalink / raw)
  To: Daniel Borkmann; +Cc: David Miller, netdev, Ani Sinha
In-Reply-To: <1355304701-22228-1-git-send-email-dborkman@redhat.com>

On Wed, 2012-12-12 at 10:31 +0100, Daniel Borkmann wrote:
> Currently, we return -EINVAL for malicious or wrong BPF filters.
> However, this is not done for BPF_S_ANC* operations, which makes it
> more difficult to detect if it's actually supported or not by the
> BPF machine. Therefore, we should also return -EINVAL if K is within
> the SKF_AD_OFF universe and the ancillary operation did not match.
> 
> Cc: Ani Sinha <ani@aristanetworks.com>
> Cc: Eric Dumazet <eric.dumazet@gmail.com>
> Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
> ---
>  net/core/filter.c | 8 +++++++-
>  1 file changed, 7 insertions(+), 1 deletion(-)
> 
> diff --git a/net/core/filter.c b/net/core/filter.c
> index c23543c..de9bed4 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -531,7 +531,7 @@ int sk_chk_filter(struct sock_filter *filter, unsigned int flen)
>  		[BPF_JMP|BPF_JSET|BPF_K] = BPF_S_JMP_JSET_K,
>  		[BPF_JMP|BPF_JSET|BPF_X] = BPF_S_JMP_JSET_X,
>  	};
> -	int pc;
> +	int pc, anc_found;
>  
>  	if (flen == 0 || flen > BPF_MAXINSNS)
>  		return -EINVAL;
> @@ -592,8 +592,10 @@ int sk_chk_filter(struct sock_filter *filter, unsigned int flen)
>  		case BPF_S_LD_W_ABS:
>  		case BPF_S_LD_H_ABS:
>  		case BPF_S_LD_B_ABS:
> +			anc_found = 0;
>  #define ANCILLARY(CODE) case SKF_AD_OFF + SKF_AD_##CODE:	\
>  				code = BPF_S_ANC_##CODE;	\
> +				anc_found = 1;			\
>  				break
>  			switch (ftest->k) {
>  			ANCILLARY(PROTOCOL);
> @@ -610,6 +612,10 @@ int sk_chk_filter(struct sock_filter *filter, unsigned int flen)
>  			ANCILLARY(VLAN_TAG);
>  			ANCILLARY(VLAN_TAG_PRESENT);
>  			}
> +
> +			/* ancillary operation unkown or unsupported */
> +			if (anc_found == 0 && ftest->k >= SKF_AD_OFF)
> +				return -EINVAL;
>  		}
>  		ftest->code = code;
>  	}

Several points :

1) This might break a userland filter that was previously working, by
returning 0 when load_pointer() returns NULL.

Specifying an offset bigger than skb->len is not _invalid_, it only
makes a filter returns 0, because load_pointer() returns NULL.

2) This wont help applications running on old kernels where your patch
wont be applied, as already mentioned yesterday.

3) Misses a "Reported-by" tag

4) anc_found is a boolean


To be truly portable, userland should not rely on kernel doing a full
validation of ancillaries. 

^ permalink raw reply

* Re: [PATCH v2 1/1] net: ethernet: davinci_cpdma: Add boundary for rx and tx descriptors
From: Mugunthan V N @ 2012-12-12 13:38 UTC (permalink / raw)
  To: David Miller; +Cc: erdnetdev, netdev, linux-arm-kernel, linux-omap, s.hauer
In-Reply-To: <20121211.135759.1010213285970148974.davem@davemloft.net>

On 12/12/2012 12:27 AM, David Miller wrote:
> From: Eric Dumazet <erdnetdev@gmail.com>
> Date: Tue, 11 Dec 2012 10:54:56 -0800
>
>> Suggested fix : add a TCQ_F_MQSLAVE flag to allow dequeue_skb() to test
>> the netif_xmit_frozen_or_stopped() status _before_ dequeing packet from
>> qdisc.
> This sounds fine to me.
I will submit next version with the suggestion

Regards
Mugunthan V N

^ permalink raw reply

* Re: [RFC PATCH net-next 0/3 V4] net-tcp: TCP/IP stack bypass for loopback connections
From: Weiping Pan @ 2012-12-12 14:13 UTC (permalink / raw)
  To: David Miller; +Cc: wpan, netdev, brutus
In-Reply-To: <20121210.160230.1883556145617090938.davem@davemloft.net>

On 12/11/2012 05:02 AM, David Miller wrote:
> From: Weiping Pan<wpan@redhat.com>
> Date: Wed,  5 Dec 2012 10:54:16 +0800
>
>> Friends VS AF__UNIX
>> Their call path are almost the same, but AF_UNIX uses its own send/recv codes
>> with proper locks,
>> so AF_UNIX's performance is much better than Friends.
Sorry, this statement is not correct.
In TCP_STREAM case, if the message size if 16384, then AF_UNIX is much 
better than Friends.
If the message size is smaller, then Friends shows equal performance 
with AF_UNIX.
In TCP_RR,  Friends shows equal performance with AF_UNIX, too.

> While I understand the other portions of your analysis, this one
> mystifies me.
>
> In both cases, the sender has to queue the SKB onto the receiver's
> queue.  And in both cases, the sender takes the lock on that queue.
>
> So the locking contention really ought to be similar if not identical.
>
> The only difference is that AF_UNIX takes the unix_sk()->lock of the
> remote socket around these operations.
>
> If that is enough of a synchronizer to "fix" the contention or reduce
> it, then this would be very easy to test by adding a friend lock to
> tcp_sk().

I make some experiments to reduce the use of lock,
some performance results will be followed up.

thanks
Weiping Pan

> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply

* [RFC PATCH net-next 4/4 V4] try to fix performance regression
From: Weiping Pan @ 2012-12-12 14:29 UTC (permalink / raw)
  To: davem; +Cc: brutus, netdev, Weiping Pan
In-Reply-To: <117a10f9575d95d6a9ea4602ea7376e2b6d5ccd1.1355320533.git.wpan@redhat.com>

1 do not share tail skb between sender and receiver
2 reduce the use of sock->sk_lock.slock

--------------------------------------------------------------------------
TCP friends performance results start


BASE means normal tcp with friends DISABLED.
AF_UNIX means sockets for local interprocess communication, for reference.
FRIENDS means tcp with friends ENABLED.
I set -s 51882 -m 16384 -M 87380 for all the three kinds of sockets by default.
The first percentage number is FRIENDS/BASE.
The second percentage number is FRIENDS/AF_UNIX.
We set -i 10,2 -I 95,20 to stabilize the statistics.



      BASE    AF_UNIX    FRIENDS               TCP_STREAM
   7952.97   10864.86   13440.08  168%  123%



      BASE    AF_UNIX    FRIENDS               TCP_MAERTS
   6743.78          -   13809.97  204%    -%



      BASE    AF_UNIX    FRIENDS             TCP_SENDFILE
     11758          -      18483  157%    -%


TCP_SENDFILE can not work with -i 10,2 -I 95,20 (strange), so I use average.



        MS       BASE    AF_UNIX    FRIENDS            TCP_STREAM_MS
         1      10.70       5.40       4.02   37%   74%
         2      28.01       9.67       7.97   28%   82%
         4      55.53      19.78      16.48   29%   83%
         8     115.40      38.22      33.51   29%   87%
        16     227.31      81.06      67.70   29%   83%
        32     446.20     166.59     129.31   28%   77%
        64     849.04     336.77     259.43   30%   77%
       128    1440.50     661.88     530.43   36%   80%
       256    2404.70    1279.67    1029.15   42%   80%
       512    4331.53    2501.30    1942.21   44%   77%
      1024    6819.78    4622.37    4128.10   60%   89%
      2048   10544.60    6348.81    6349.59   60%  100%
      4096   12830.41    8324.43    7984.43   62%   95%
      8192   13462.65    8355.49   11079.37   82%  132%
     16384    9960.87   10840.13   13037.81  130%  120%
     32768    8749.31   11372.15   15087.08  172%  132%
     65536    7580.27   12150.23   14971.42  197%  123%
    131072    6727.74   11451.34   13604.78  202%  118%
    262144    7673.14   11613.10   11436.97  149%   98%
    524288    7366.17   11675.95   11559.43  156%   99%
   1048576    6608.57   11883.01   10103.20  152%   85%
MS means Message Size in bytes, that is -m -M for netperf



        RR       BASE    AF_UNIX    FRIENDS                TCP_RR_RR
         1   19716.88   34451.39   34574.12  175%  100%
         2   19836.74   34297.00   34671.29  174%  101%
         4   19874.71   34456.48   34552.13  173%  100%
         8   18882.93   34123.00   34661.48  183%  101%
        16   19179.09   34358.47   34599.16  180%  100%
        32   20140.08   34326.35   34616.30  171%  100%
        64   19473.39   34382.05   34583.10  177%  100%
       128   19699.62   34012.03   34566.14  175%  101%
       256   19740.44   34529.71   34624.07  175%  100%
       512   18929.46   33673.06   33932.83  179%  100%
      1024   18738.98   33724.78   33313.44  177%   98%
      2048   17315.61   32982.24   32361.39  186%   98%
      4096   16585.81   31345.85   31073.32  187%   99%
      8192   11933.16   27851.10   27166.94  227%   97%
     16384    9717.19   21746.12   22583.40  232%  103%
     32768    7044.35   12927.23   16253.26  230%  125%
     65536    5038.96    8945.74    7982.61  158%   89%
    131072    2860.64    4981.78    4417.16  154%   88%
    262144    1633.45    2765.27    2739.36  167%   99%
    524288     796.68    1429.79    1445.21  181%  101%
   1048576     379.78        per     730.05  192%     %
RR means Request Response Message Size in bytes, that is -r req,resp for netperf



        RR       BASE    AF_UNIX    FRIENDS               TCP_CRR_RR
         1    5531.49          -    5861.86  105%    -%
         2    5506.13          -    5845.53  106%    -%
         4    5523.27          -    5853.43  105%    -%
         8    5503.73          -    5836.44  106%    -%
        16    5516.23          -    5842.29  105%    -%
        32    5557.37          -    5858.29  105%    -%
        64    5517.51          -    5892.64  106%    -%
       128    5504.18          -    5841.44  106%    -%
       256    5512.82          -    5842.60  105%    -%
       512    5496.36          -    5837.72  106%    -%
      1024    5465.24          -    5827.99  106%    -%
      2048    5550.15          -    5812.88  104%    -%
      4096    5292.75          -    5824.45  110%    -%
      8192    4917.06          -    5705.12  116%    -%
     16384    4278.63          -    5318.39  124%    -%
     32768    3611.86          -    4930.30  136%    -%
     65536      77.35          -    3847.43 4974%    -%
    131072      47.65          -    2811.58 5900%    -%
    262144     805.13          -       4.88    0%    -%
    524288     583.08          -       4.78    0%    -%
   1048576     369.52          -       5.02    1%    -%
RR means Request Response Message Size in bytes, that is -r req,resp for netperf -H 127.0.0.1



TCP friends performance results end
--------------------------------------------------------------------------


Performance analysis:
1 Friends shows better performance than loopback in TCP_RR, TCP_MAERTS and
TCP_SENDFILE, same in TCP_CRR_RR.

2 In TCP_STREAM, Friends shows much worse perofrmance (30%) than loopback if
the message size if small, and it shows worse performance (80%) than AF_UNIX.

3 Compared with last performance report, Friends shows worse performance in
TCP_RR.

Friends VS AF_UNIX
I think the lock use is much similar this time.
May the locking contention is not the bottle neck ?

Friends VS loopback
I have reduced the locking contention as much as possible,
but it still shows bad performance.
May the locking contention is not the bottle neck ?


Signed-off-by: Weiping Pan <wpan@redhat.com>
---
 include/net/tcp.h |   10 --
 net/ipv4/tcp.c    |  327 ++++++++++++++++++++++-------------------------------
 2 files changed, 136 insertions(+), 201 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index 5f82770..80a8ec9 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -688,15 +688,6 @@ void tcp_send_window_probe(struct sock *sk);
 #define TCPHDR_ECE 0x40
 #define TCPHDR_CWR 0x80
 
-/* If skb_get_friend() != NULL, TCP friends per packet state.
- */
-struct friend_skb_parm {
-	bool	tail_inuse;		/* In use by skb_get_friend() send while */
-					/* on sk_receive_queue for tail put */
-};
-
-#define TCP_FRIEND_CB(tcb) (&(tcb)->header.hf)
-
 /* This is what the send packet queuing engine uses to pass
  * TCP per-packet control information to the transmission code.
  * We also store the host-order sequence numbers in here too.
@@ -709,7 +700,6 @@ struct tcp_skb_cb {
 #if IS_ENABLED(CONFIG_IPV6)
 		struct inet6_skb_parm	h6;
 #endif
-		struct friend_skb_parm	hf;
 	} header;	/* For incoming frames		*/
 	__u32		seq;		/* Starting sequence number	*/
 	__u32		end_seq;	/* SEQ + FIN + SYN + datalen	*/
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index e9d82e0..f008d60 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -336,25 +336,24 @@ static inline int tcp_friend_validate(struct sock *sk, struct sock **friendp,
 	return 1;
 }
 
-static inline int tcp_friend_send_lock(struct sock *friend)
+static inline int tcp_friend_get_state(struct sock *friend)
 {
 	int err = 0;
 
 	spin_lock_bh(&friend->sk_lock.slock);
-	if (unlikely(friend->sk_shutdown & RCV_SHUTDOWN)) {
-		spin_unlock_bh(&friend->sk_lock.slock);
+	if (unlikely(friend->sk_shutdown & RCV_SHUTDOWN))
 		err = -ECONNRESET;
-	}
+	spin_unlock_bh(&friend->sk_lock.slock);
 
 	return err;
 }
 
-static inline void tcp_friend_recv_lock(struct sock *friend)
+static inline void tcp_friend_state_lock(struct sock *friend)
 {
 	spin_lock_bh(&friend->sk_lock.slock);
 }
 
-static void tcp_friend_unlock(struct sock *friend)
+static inline void tcp_friend_state_unlock(struct sock *friend)
 {
 	spin_unlock_bh(&friend->sk_lock.slock);
 }
@@ -639,71 +638,32 @@ int tcp_ioctl(struct sock *sk, int cmd, unsigned long arg)
 }
 EXPORT_SYMBOL(tcp_ioctl);
 
-/*
- * Friend receive_queue tail skb space? If true, set tail_inuse.
- * Else if RCV_SHUTDOWN, return *copy = -ECONNRESET.
- */
-static inline struct sk_buff *tcp_friend_tail(struct sock *friend, int *copy)
-{
-	struct sk_buff	*skb = NULL;
-	int		sz = 0;
-
-	if (skb_peek_tail(&friend->sk_receive_queue)) {
-		sz = tcp_friend_send_lock(friend);
-		if (!sz) {
-			skb = skb_peek_tail(&friend->sk_receive_queue);
-			if (skb && skb->friend) {
-				if (!*copy)
-					sz = skb_tailroom(skb);
-				else {
-					sz = *copy - skb->len;
-					if (sz < 0)
-						sz = 0;
-				}
-				if (sz > 0)
-					TCP_FRIEND_CB(TCP_SKB_CB(skb))->
-							tail_inuse = true;
-			}
-			tcp_friend_unlock(friend);
-		}
-	}
-
-	*copy = sz;
-	return skb;
-}
-
-static inline void tcp_friend_seq(struct sock *sk, int copy, int charge)
-{
-	struct sock	*friend = sk->sk_friend;
-	struct tcp_sock *tp = tcp_sk(friend);
-
-	if (charge) {
-		sk_mem_charge(friend, charge);
-		atomic_add(charge, &friend->sk_rmem_alloc);
-	}
-	tp->rcv_nxt += copy;
-	tp->rcv_wup += copy;
-	tcp_friend_unlock(friend);
-
-	tp = tcp_sk(sk);
-	tp->snd_nxt += copy;
-	tp->pushed_seq += copy;
-	tp->snd_una += copy;
-	tp->snd_up += copy;
-}
-
 static inline bool tcp_friend_push(struct sock *sk, struct sk_buff *skb)
 {
-	struct sock	*friend = sk->sk_friend;
-	int		wait = false;
+	struct sock *friend = sk->sk_friend;
+	struct tcp_sock *tp = NULL;
+	int wait = false;
+
+	tcp_friend_state_lock(friend);
 
 	skb_set_owner_r(skb, friend);
-	__skb_queue_tail(&friend->sk_receive_queue, skb);
 	if (!sk_rmem_schedule(friend, skb, skb->truesize))
 		wait = true;
+	__skb_queue_tail(&friend->sk_receive_queue, skb);
+
+	tcp_friend_state_unlock(friend);
 
-	tcp_friend_seq(sk, skb->len, 0);
-	if (skb == skb_peek(&friend->sk_receive_queue))
+	tp = tcp_sk(friend);
+	tp->rcv_nxt += skb->len;
+	tp->rcv_wup += skb->len;
+
+	tp = tcp_sk(sk);
+	tp->snd_nxt += skb->len;
+	tp->pushed_seq += skb->len;
+	tp->snd_una += skb->len;
+	tp->snd_up += skb->len;
+
+	if (skb_queue_len(&friend->sk_receive_queue) == 1)
 		friend->sk_data_ready(friend, 0);
 
 	return wait;
@@ -728,7 +688,6 @@ static inline void skb_entail(struct sock *sk, struct sk_buff *skb)
 	tcb->seq     = tcb->end_seq = tp->write_seq;
 	if (sk->sk_friend) {
 		skb->friend = sk;
-		TCP_FRIEND_CB(tcb)->tail_inuse = false;
 		return;
 	}
 	skb->csum    = 0;
@@ -1048,8 +1007,17 @@ static ssize_t do_tcp_sendpages(struct sock *sk, struct page **pages, int poffse
 	if (sk->sk_err || (sk->sk_shutdown & SEND_SHUTDOWN))
 		goto out_err;
 
+	if (friend) {
+		err = tcp_friend_get_state(friend);
+		if (err) {
+			sk->sk_err = -err;
+			err = -EPIPE;
+			goto out_err;
+		}
+	}
+
 	while (psize > 0) {
-		struct sk_buff *skb;
+		struct sk_buff *skb = NULL;
 		struct tcp_skb_cb *tcb;
 		struct page *page = pages[poffset / PAGE_SIZE];
 		int copy, i;
@@ -1059,12 +1027,10 @@ static ssize_t do_tcp_sendpages(struct sock *sk, struct page **pages, int poffse
 
 		if (friend) {
 			copy = size_goal;
-			skb = tcp_friend_tail(friend, &copy);
-			if (copy < 0) {
-				sk->sk_err = -copy;
-				err = -EPIPE;
-				goto out_err;
-			}
+			if (skb)
+				copy = copy - skb->len;
+			else
+				copy = 0;
 		} else if (!tcp_send_head(sk)) {
 			skb = NULL;
 			copy = 0;
@@ -1078,9 +1044,17 @@ new_segment:
 			if (!sk_stream_memory_free(sk))
 				goto wait_for_sndbuf;
 
-			if (friend)
+			if (friend) {
+				if (skb) {
+					if (tcp_friend_push(sk, skb))
+						goto wait_for_sndbuf;
+				}
+
+				/*
+				 * new skb
+				 */
 				skb = tcp_friend_alloc_skb(sk, 0);
-			else
+			} else
 				skb = sk_stream_alloc_skb(sk, 0,
 							  sk->sk_allocation);
 			if (!skb)
@@ -1097,10 +1071,7 @@ new_segment:
 		i = skb_shinfo(skb)->nr_frags;
 		can_coalesce = skb_can_coalesce(skb, i, page, offset);
 		if (!can_coalesce && i >= MAX_SKB_FRAGS) {
-			if (friend) {
-				if (TCP_FRIEND_CB(tcb)->tail_inuse)
-					TCP_FRIEND_CB(tcb)->tail_inuse = false;
-			} else
+			if (!friend)
 				tcp_mark_push(tp, skb);
 			goto new_segment;
 		}
@@ -1124,20 +1095,9 @@ new_segment:
 		psize -= copy;
 
 		if (friend) {
-			err = tcp_friend_send_lock(friend);
-			if (err) {
-				sk->sk_err = -err;
-				err = -EPIPE;
-				goto out_err;
-			}
 			tcb->end_seq += copy;
-			if (TCP_FRIEND_CB(tcb)->tail_inuse) {
-				TCP_FRIEND_CB(tcb)->tail_inuse = false;
-				tcp_friend_seq(sk, copy, copy);
-			} else {
-				if (tcp_friend_push(sk, skb))
-					goto wait_for_sndbuf;
-			}
+			if (tcp_friend_push(sk, skb))
+				goto wait_for_sndbuf;
 			if (!psize)
 				goto out;
 			continue;
@@ -1172,6 +1132,18 @@ wait_for_memory:
 		if ((err = sk_stream_wait_memory(sk, &timeo)) != 0)
 			goto do_error;
 
+		if (friend) {
+			if (skb) {
+				tcp_friend_state_lock(friend);
+				if (!sk_rmem_schedule(friend, skb, skb->truesize)) {
+					tcp_friend_state_unlock(friend);
+					goto wait_for_sndbuf;
+				}
+				tcp_friend_state_unlock(friend);
+				skb = NULL;
+			}
+		}
+
 		if (!friend)
 			mss_now = tcp_send_mss(sk, &size_goal, flags);
 	}
@@ -1266,7 +1238,7 @@ int tcp_sendmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
 	struct iovec *iov;
 	struct sock *friend = sk->sk_friend;
 	struct tcp_sock *tp = tcp_sk(sk);
-	struct sk_buff *skb;
+	struct sk_buff *skb = NULL;
 	struct tcp_skb_cb *tcb;
 	int iovlen, flags, err, copied = 0;
 	int mss_now = 0, size_goal = size, copied_syn = 0, offset = 0;
@@ -1330,6 +1302,15 @@ int tcp_sendmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
 
 	sg = !!(sk->sk_route_caps & NETIF_F_SG);
 
+	if (friend) {
+		err = tcp_friend_get_state(friend);
+		if (err) {
+			sk->sk_err = -err;
+			err = -EPIPE;
+			goto out_err;
+		}
+	}
+
 	while (--iovlen >= 0) {
 		size_t seglen = iov->iov_len;
 		unsigned char __user *from = iov->iov_base;
@@ -1350,12 +1331,10 @@ int tcp_sendmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
 			int max = size_goal;
 
 			if (friend) {
-				skb = tcp_friend_tail(friend, &copy);
-				if (copy < 0) {
-					sk->sk_err = -copy;
-					err = -EPIPE;
-					goto out_err;
-				}
+				if (skb)
+					copy = skb_availroom(skb);
+				else
+					copy = 0;
 			} else {
 				skb = tcp_write_queue_tail(sk);
 				if (tcp_send_head(sk)) {
@@ -1370,9 +1349,21 @@ new_segment:
 				if (!sk_stream_memory_free(sk))
 					goto wait_for_sndbuf;
 
-				if (friend)
+				if (friend) {
+					if (skb) {
+						/*
+						 * Friend push old skb
+						 */
+
+						if (tcp_friend_push(sk, skb))
+							goto wait_for_sndbuf;
+					}
+
+					/*
+					 * new skb
+					 */
 					skb = tcp_friend_alloc_skb(sk, max);
-				else {
+				} else {
 					/* Allocate new segment. If the
 					 * interface is SG, allocate skb
 					 * fitting to single page.
@@ -1455,32 +1446,23 @@ new_segment:
 			copied += copy;
 			seglen -= copy;
 
-			if (friend) {
-				err = tcp_friend_send_lock(friend);
-				if (err) {
-					sk->sk_err = -err;
-					err = -EPIPE;
-					goto out_err;
-				}
-				tcb->end_seq += copy;
-				if (TCP_FRIEND_CB(tcb)->tail_inuse) {
-					TCP_FRIEND_CB(tcb)->tail_inuse = false;
-					tcp_friend_seq(sk, copy, 0);
-				} else {
-					if (tcp_friend_push(sk, skb))
-						goto wait_for_sndbuf;
-				}
-				continue;
-			}
-
 			tcb->end_seq += copy;
+
 			skb_shinfo(skb)->gso_segs = 0;
 
 			if (copied == copy)
 				tcb->tcp_flags &= ~TCPHDR_PSH;
 
-			if (seglen == 0 && iovlen == 0)
+			if (seglen == 0 && iovlen == 0) {
+				if (friend && skb) {
+					if (tcp_friend_push(sk, skb))
+						goto wait_for_sndbuf;
+				}
 				goto out;
+			}
+
+			if (friend)
+				continue;
 
 			if (skb->len < max || (flags & MSG_OOB) || unlikely(tp->repair))
 				continue;
@@ -1501,6 +1483,17 @@ wait_for_memory:
 			if ((err = sk_stream_wait_memory(sk, &timeo)) != 0)
 				goto do_error;
 
+			if (friend) {
+				if (skb) {
+					tcp_friend_state_lock(friend);
+					if (!sk_rmem_schedule(friend, skb, skb->truesize)) {
+						tcp_friend_state_unlock(friend);
+						goto wait_for_sndbuf;
+					}
+					tcp_friend_state_unlock(friend);
+					skb = NULL;
+				}
+			}
 			if (!friend)
 				mss_now = tcp_send_mss(sk, &size_goal, flags);
 		}
@@ -1514,10 +1507,7 @@ out:
 
 do_fault:
 	if (skb->friend) {
-		if (TCP_FRIEND_CB(tcb)->tail_inuse)
-			TCP_FRIEND_CB(tcb)->tail_inuse = false;
-		else
-			__kfree_skb(skb);
+		__kfree_skb(skb);
 	} else if (!skb->len) {
 		tcp_unlink_write_queue(skb, sk);
 		/* It is the one place in all of TCP, except connection
@@ -1787,8 +1777,6 @@ int tcp_read_sock(struct sock *sk, read_descriptor_t *desc,
 	err = tcp_friend_validate(sk, &friend, &timeo);
 	if (err < 0)
 		return err;
-	if (friend)
-		tcp_friend_recv_lock(sk);
 
 	while ((skb = tcp_recv_skb(sk, seq, &offset, &len)) != NULL) {
 		if (len > 0) {
@@ -1803,9 +1791,6 @@ int tcp_read_sock(struct sock *sk, read_descriptor_t *desc,
 					break;
 			}
 
-			if (friend)
-				tcp_friend_unlock(sk);
-
 			used = recv_actor(desc, skb, offset, len);
 			if (used < 0) {
 				if (!copied)
@@ -1817,21 +1802,7 @@ int tcp_read_sock(struct sock *sk, read_descriptor_t *desc,
 				offset += used;
 			}
 
-			if (friend)
-				tcp_friend_recv_lock(sk);
-			if (skb->friend) {
-				len = (u32)(TCP_SKB_CB(skb)->end_seq - seq);
-				if (len > 0) {
-					/*
-					 * Friend did an skb_put() while we
-					 * were away so process the same skb.
-					 */
-					if (!desc->count)
-						break;
-					tp->copied_seq = seq;
-					goto again;
-				}
-			} else {
+			if (!skb->friend) {
 				/*
 				 * If recv_actor drops the lock (e.g. TCP
 				 * splice receive) the skb pointer might be
@@ -1844,19 +1815,25 @@ int tcp_read_sock(struct sock *sk, read_descriptor_t *desc,
 					break;
 			}
 		}
+
 		if (!skb->friend && tcp_hdr(skb)->fin) {
 			sk_eat_skb(sk, skb, false);
 			++seq;
 			break;
 		}
 		if (skb->friend) {
-			if (!TCP_FRIEND_CB(TCP_SKB_CB(skb))->tail_inuse) {
-				__skb_unlink(skb, &sk->sk_receive_queue);
-				__kfree_skb(skb);
-				tcp_friend_write_space(sk);
+			len = (u32)(TCP_SKB_CB(skb)->end_seq - seq);
+			if (len > 0) {
+				if (!desc->count)
+					break;
+				tp->copied_seq = seq;
+				goto again;
 			}
-			tcp_friend_unlock(sk);
-			tcp_friend_recv_lock(sk);
+			tcp_friend_state_lock(sk);
+			__skb_unlink(skb, &sk->sk_receive_queue);
+			__kfree_skb(skb);
+			tcp_friend_state_unlock(sk);
+			tcp_friend_write_space(sk);
 		} else
 			sk_eat_skb(sk, skb, 0);
 		if (!desc->count)
@@ -1866,7 +1843,6 @@ int tcp_read_sock(struct sock *sk, read_descriptor_t *desc,
 	tp->copied_seq = seq;
 
 	if (friend) {
-		tcp_friend_unlock(sk);
 		tcp_friend_write_space(sk);
 	} else {
 		tcp_rcv_space_adjust(sk);
@@ -1903,7 +1879,6 @@ int tcp_recvmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
 	bool copied_early = false;
 	struct sk_buff *skb;
 	u32 urg_hole = 0;
-	bool locked = false;
 
 	lock_sock(sk);
 
@@ -1991,11 +1966,6 @@ int tcp_recvmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
 		 * slock, end_seq updated, so we can only use the bytes
 		 * from *seq to end_seq!
 		 */
-		if (friend && !locked) {
-			tcp_friend_recv_lock(sk);
-			locked = true;
-		}
-
 		skb_queue_walk(&sk->sk_receive_queue, skb) {
 			tcb = TCP_SKB_CB(skb);
 			offset = *seq - tcb->seq;
@@ -2003,20 +1973,14 @@ int tcp_recvmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
 				if (skb->friend) {
 					used = (u32)(tcb->end_seq - *seq);
 					if (used > 0) {
-						tcp_friend_unlock(sk);
-						locked = false;
 						/* Can use it all */
 						goto found_ok_skb;
 					}
 					/* No data to copyout */
 					if (flags & MSG_PEEK)
 						continue;
-					if (!TCP_FRIEND_CB(tcb)->tail_inuse)
-						goto unlink;
-					break;
+					goto unlink;
 				}
-				tcp_friend_unlock(sk);
-				locked = false;
 			}
 
 			/* Now that we have two receive queues this
@@ -2043,11 +2007,6 @@ int tcp_recvmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
 
 		/* Well, if we have backlog, try to process it now yet. */
 
-		if (friend && locked) {
-			tcp_friend_unlock(sk);
-			locked = false;
-		}
-
 		if (copied >= target && !sk->sk_backlog.tail)
 			break;
 
@@ -2262,17 +2221,7 @@ do_prequeue:
 		len -= used;
 		offset += used;
 
-		tcp_rcv_space_adjust(sk);
-
-skip_copy:
-		if (tp->urg_data && after(tp->copied_seq, tp->urg_seq)) {
-			tp->urg_data = 0;
-			tcp_fast_path_check(sk);
-		}
-
 		if (skb->friend) {
-			tcp_friend_recv_lock(sk);
-			locked = true;
 			used = (u32)(tcb->end_seq - *seq);
 			if (used) {
 				/*
@@ -2280,29 +2229,28 @@ skip_copy:
 				 * so if more to do process the same skb.
 				 */
 				if (len > 0) {
-					tcp_friend_unlock(sk);
-					locked = false;
 					goto found_ok_skb;
 				}
 				continue;
 			}
-			if (TCP_FRIEND_CB(tcb)->tail_inuse) {
-				/* Give sendmsg a chance */
-				tcp_friend_unlock(sk);
-				locked = false;
-				continue;
-			}
 			if (!(flags & MSG_PEEK)) {
 		unlink:
+				tcp_friend_state_lock(sk);
 				__skb_unlink(skb, &sk->sk_receive_queue);
 				__kfree_skb(skb);
-				tcp_friend_unlock(sk);
-				locked = false;
+				tcp_friend_state_unlock(sk);
 				tcp_friend_write_space(sk);
 			}
 			continue;
 		}
 
+		tcp_rcv_space_adjust(sk);
+skip_copy:
+		if (tp->urg_data && after(tp->copied_seq, tp->urg_seq)) {
+			tp->urg_data = 0;
+			tcp_fast_path_check(sk);
+		}
+
 		if (offset < skb->len)
 			continue;
 		else if (tcp_hdr(skb)->fin)
@@ -2323,9 +2271,6 @@ skip_copy:
 		break;
 	} while (len > 0);
 
-	if (friend && locked)
-		tcp_friend_unlock(sk);
-
 	if (user_recv) {
 		if (!skb_queue_empty(&tp->ucopy.prequeue)) {
 			int chunk;
-- 
1.7.4.4

^ permalink raw reply related

* RE: [RFC PATCH net-next 4/4 V4] try to fix performance regression
From: David Laight @ 2012-12-12 14:57 UTC (permalink / raw)
  To: Weiping Pan, davem; +Cc: brutus, netdev
In-Reply-To: <5e333588f6cb48cc3464b2263dcaa734b952e4c1.1355320534.git.wpan@redhat.com>

>         MS       BASE    AF_UNIX    FRIENDS            TCP_STREAM_MS
>          1      10.70       5.40       4.02   37%   74%
>          2      28.01       9.67       7.97   28%   82%
>          4      55.53      19.78      16.48   29%   83%
>          8     115.40      38.22      33.51   29%   87%
>         16     227.31      81.06      67.70   29%   83%
>         32     446.20     166.59     129.31   28%   77%
>         64     849.04     336.77     259.43   30%   77%
>        128    1440.50     661.88     530.43   36%   80%
>        256    2404.70    1279.67    1029.15   42%   80%
>        512    4331.53    2501.30    1942.21   44%   77%
>       1024    6819.78    4622.37    4128.10   60%   89%
>       2048   10544.60    6348.81    6349.59   60%  100%
>       4096   12830.41    8324.43    7984.43   62%   95%
>       8192   13462.65    8355.49   11079.37   82%  132%
>      16384    9960.87   10840.13   13037.81  130%  120%
>      32768    8749.31   11372.15   15087.08  172%  132%
>      65536    7580.27   12150.23   14971.42  197%  123%
>     131072    6727.74   11451.34   13604.78  202%  118%
>     262144    7673.14   11613.10   11436.97  149%   98%
>     524288    7366.17   11675.95   11559.43  156%   99%
>    1048576    6608.57   11883.01   10103.20  152%   85%
> MS means Message Size in bytes, that is -m -M for netperf

If I read that table correctly, it seems to imply that
something goes badly wrong for 'normal' TCP loopback
connections when the read/write size exceeds 8k.
Putting effort into fixing that would appear to be
more worthwhile than the 'friends' code.

	David

^ permalink raw reply

* [RFC] net : add tx timestamp to packet mmap.
From: Paul Chavent @ 2012-12-12 15:29 UTC (permalink / raw)
  To: davem, edumazet, daniel.borkmann, xemul, ebiederm, netdev; +Cc: Paul Chavent

This patch allow to generate tx timestamps of packets sent by the packet mmap interface.

Actually, you can't get tx timestamps with the sample code below.

I wonder if my current implementation is good. And if not, how should i get the timestamps ?

Wouldn't be a good idea to put timestamps in the ring buffer frame before give it back to the user ?

Thanks for your comments.

/* BEGIN OF SAMPLE CODE */
struct timespec ts = {0,0};
struct sockaddr from_addr;
static uint8_t tmp_data[256];
struct iovec msg_iov = {tmp_data, sizeof(tmp_data)};
static uint8_t cmsg_buff[256];
struct msghdr msghdr = {&from_addr, sizeof(from_addr),
                        &msg_iov, 1,
                        cmsg_buff, sizeof(cmsg_buff),
                        0};

ssize_t err = recvmsg(itf->sock_fd, &msghdr, MSG_ERRQUEUE);
if(err < 0)
  {
    perror("recvmsg failed");
    return -1;
  }

struct cmsghdr *cmsg;
for(cmsg = CMSG_FIRSTHDR(&msghdr); cmsg != NULL; cmsg = CMSG_NXTHDR(&msghdr, cmsg))
{
  if(cmsg->cmsg_level == SOL_SOCKET && cmsg->cmsg_type == SCM_TIMESTAMPING)
    {
      ts = *(struct timespec *)CMSG_DATA(cmsg);
      fprintf(stderr, "SCM_TIMESTAMPING available\n");
    }
  else if (cmsg->cmsg_level == SOL_PACKET && cmsg->cmsg_type == PACKET_TX_TIMESTAMP)
      {
        ts = *(struct timespec *)CMSG_DATA(cmsg);
        fprintf(stderr, "PACKET_TX_TIMESTAMP available\n");
      }
} 
/* END OF SAMPLE CODE */

Signed-off-by: Paul Chavent <paul.chavent@onera.fr>
---
 net/packet/af_packet.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index e639645..948748b 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -1857,6 +1857,10 @@ static int tpacket_fill_skb(struct packet_sock *po, struct sk_buff *skb,
 	void *data;
 	int err;
 
+	err = sock_tx_timestamp(&po->sk, &skb_shinfo(skb)->tx_flags);
+	if (err < 0)
+		return err;
+
 	ph.raw = frame;
 
 	skb->protocol = proto;
-- 
1.7.12.1

^ permalink raw reply related

* Network namespace bugs in L2TP
From: Tom Parkin @ 2012-12-12 15:51 UTC (permalink / raw)
  To: ebiederm; +Cc: netdev

[-- Attachment #1: Type: text/plain, Size: 3545 bytes --]

Hi Eric,

I'm following up on this thread from later October in which you
pointed out some network namespace bugs in L2TP:

http://www.spinics.net/lists/netdev/msg214776.html

I use L2TP, and I'd like to help fix these bugs.  But I'm not very
conversant with network namespaces, and so I'm struggling to fully
appreciate the issues you pointed out previously.  Could you give me a
hand getting to grips with this?

So far I've tested L2TP within network namespaces, using both iproute2
to create sessions between two namespaces on the same host, and an
L2TP daemon running in a namespace to create sessions between two
hosts.  In both cases I've done a bit of trivial ping and iperf
testing using Ethernet pseudowires.

To make this work I've had to add a couple of trivial patches (see
below).

There are two things I'm uncertain about:

 1. Why do we need to change the namespace of the socket created in
    l2tp_tunnel_sock_create?  So far as I can tell, sock_create
    defaults to the namespace of the calling process.  Is the issue
    here that this code may run from a work queue or similar?

 2. You mentioned the need to keep track of sockets allocated within a
    namespace in order to be able to clean them up when the namespace
    is deleted.  Should we be keeping a list of sockets we create and
    then destroying them in the namespace pernet_ops exit function?

Thanks,
Tom

From b9c095fdf32c895b79a5954020c4745fe5518141 Mon Sep 17 00:00:00 2001
From: Tom Parkin <tparkin@katalix.com>
Date: Tue, 11 Dec 2012 13:03:48 +0000
Subject: [PATCH 1/2] l2tp: set netnsok flag for netlink messages

The L2TP netlink code can run in namespaces.  Set the netnsok flag in
genl_family to true to reflect that fact.
---
 net/l2tp/l2tp_netlink.c |    1 +
 1 file changed, 1 insertion(+)

diff --git a/net/l2tp/l2tp_netlink.c b/net/l2tp/l2tp_netlink.c
index bbba3a1..c1bab22 100644
--- a/net/l2tp/l2tp_netlink.c
+++ b/net/l2tp/l2tp_netlink.c
@@ -37,6 +37,7 @@ static struct genl_family l2tp_nl_family = {
 	.version	= L2TP_GENL_VERSION,
 	.hdrsize	= 0,
 	.maxattr	= L2TP_ATTR_MAX,
+	.netnsok	= true,
 };
 
 /* Accessed under genl lock */
-- 
1.7.9.5

From 13e9b0ddc48a16b384ffbf5ff64e6413cfa612f5 Mon Sep 17 00:00:00 2001
From: Tom Parkin <tparkin@katalix.com>
Date: Wed, 12 Dec 2012 12:50:54 +0000
Subject: [PATCH 2/2] l2tp: prevent tunnel creation on netns mismatch

l2tp_tunnel_create is passed a pointer to the network namespace for the
tunnel, along with an optional file descriptor for the tunnel which may
be passed in from userspace via. netlink.

In the case where the file descriptor is defined, ensure that the namespace
associated with that socket matches the namespace explicitly passed to
l2tp_tunnel_create.
---
 net/l2tp/l2tp_core.c |    7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/net/l2tp/l2tp_core.c b/net/l2tp/l2tp_core.c
index 1a9f372..f8d200b 100644
--- a/net/l2tp/l2tp_core.c
+++ b/net/l2tp/l2tp_core.c
@@ -1528,6 +1528,13 @@ int l2tp_tunnel_create(struct net *net, int fd, int version, u32 tunnel_id, u32
 			       tunnel_id, fd, err);
 			goto err;
 		}
+
+		/* Reject namespace mismatches */
+		if (!net_eq(sock_net(sock->sk), net)) {
+			pr_err("tunl %hu: netns mismatch\n", tunnel_id);
+			err = -EBADF; /* TODO -- what value? */
+			goto err;
+		}
 	}
 
 	sk = sock->sk;
-- 
1.7.9.5
-- 
Tom Parkin
Katalix Systems Ltd
http://www.katalix.com
Catalysts for your Embedded Linux software development

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 490 bytes --]

^ permalink raw reply related

* [PATCHv2 iproute2] add DOVE extensions for iproute2
From: David L Stevens @ 2012-12-12 16:10 UTC (permalink / raw)
  To: David Miller, Stephen Hemminger; +Cc: netdev


	This patch adds a new flag to iproute2 for vxlan devices to enable
DOVE features. It also adds support for L2 and L3 switch lookup miss
netlink messages to "ip monitor".

Changes since v1:
	- split "dove" flag into separate feature flags:
		- "proxy" for ARP reduction
		- "rsc" for route short circuiting
		- "l2miss" for L2 switch miss notifications
		- "l3miss" for L3 switch miss notifications

Signed-off-by: David L Stevens <dlstevens@us.ibm.com>

diff --git a/include/linux/if_link.h b/include/linux/if_link.h
index 012d95a..a163702 100644
- --- a/include/linux/if_link.h
+++ b/include/linux/if_link.h
@@ -283,6 +283,10 @@ enum {
 	IFLA_VXLAN_AGEING,
 	IFLA_VXLAN_LIMIT,
 	IFLA_VXLAN_PORT_RANGE,
+	IFLA_VXLAN_PROXY,
+	IFLA_VXLAN_RSC,
+	IFLA_VXLAN_L2MISS,
+	IFLA_VXLAN_L3MISS,
 	__IFLA_VXLAN_MAX
 };
 #define IFLA_VXLAN_MAX	(__IFLA_VXLAN_MAX - 1)
diff --git a/ip/iplink_vxlan.c b/ip/iplink_vxlan.c
index ba5c4ab..f2e6bef 100644
- --- a/ip/iplink_vxlan.c
+++ b/ip/iplink_vxlan.c
@@ -26,6 +26,8 @@ static void explain(void)
 	fprintf(stderr, "Usage: ... vxlan id VNI [ group ADDR ] [ local ADDR ]\n");
 	fprintf(stderr, "                 [ ttl TTL ] [ tos TOS ] [ dev PHYS_DEV ]\n");
 	fprintf(stderr, "                 [ port MIN MAX ] [ [no]learning ]\n");
+	fprintf(stderr, "                 [ [no]proxy ] [ [no]rsc ]\n");
+	fprintf(stderr, "                 [ [no]l2miss ] [ [no]l3miss ]\n");
 	fprintf(stderr, "\n");
 	fprintf(stderr, "Where: VNI := 0-16777215\n");
 	fprintf(stderr, "       ADDR := { IP_ADDRESS | any }\n");
@@ -44,6 +46,10 @@ static int vxlan_parse_opt(struct link_util *lu, int argc, char **argv,
 	__u8 tos = 0;
 	__u8 ttl = 0;
 	__u8 learning = 1;
+	__u8 proxy = 0;
+	__u8 rsc = 0;
+	__u8 l2miss = 0;
+	__u8 l3miss = 0;
 	__u8 noage = 0;
 	__u32 age = 0;
 	__u32 maxaddr = 0;
@@ -123,6 +129,22 @@ static int vxlan_parse_opt(struct link_util *lu, int argc, char **argv,
 			learning = 0;
 		} else if (!matches(*argv, "learning")) {
 			learning = 1;
+		} else if (!matches(*argv, "noproxy")) {
+			proxy = 0;
+		} else if (!matches(*argv, "proxy")) {
+			proxy = 1;
+		} else if (!matches(*argv, "norsc")) {
+			rsc = 0;
+		} else if (!matches(*argv, "rsc")) {
+			rsc = 1;
+		} else if (!matches(*argv, "nol2miss")) {
+			l2miss = 0;
+		} else if (!matches(*argv, "l2miss")) {
+			l2miss = 1;
+		} else if (!matches(*argv, "nol3miss")) {
+			l3miss = 0;
+		} else if (!matches(*argv, "l3miss")) {
+			l3miss = 1;
 		} else if (matches(*argv, "help") == 0) {
 			explain();
 			return -1;
@@ -148,6 +170,10 @@ static int vxlan_parse_opt(struct link_util *lu, int argc, char **argv,
 	addattr8(n, 1024, IFLA_VXLAN_TTL, ttl);
 	addattr8(n, 1024, IFLA_VXLAN_TOS, tos);
 	addattr8(n, 1024, IFLA_VXLAN_LEARNING, learning);
+	addattr8(n, 1024, IFLA_VXLAN_PROXY, proxy);
+	addattr8(n, 1024, IFLA_VXLAN_RSC, rsc);
+	addattr8(n, 1024, IFLA_VXLAN_L2MISS, l2miss);
+	addattr8(n, 1024, IFLA_VXLAN_L3MISS, l3miss);
 	if (noage)
 		addattr32(n, 1024, IFLA_VXLAN_AGEING, 0);
 	else if (age)
@@ -213,6 +239,18 @@ static void vxlan_print_opt(struct link_util *lu, FILE *f, struct rtattr *tb[])
 	if (tb[IFLA_VXLAN_LEARNING] &&
 	    !rta_getattr_u8(tb[IFLA_VXLAN_LEARNING]))
 		fputs("nolearning ", f);
+ 
+	if (tb[IFLA_VXLAN_PROXY] && rta_getattr_u8(tb[IFLA_VXLAN_PROXY]))
+		fputs("proxy ", f);
+ 
+	if (tb[IFLA_VXLAN_RSC] && rta_getattr_u8(tb[IFLA_VXLAN_RSC]))
+		fputs("rsc ", f);
+
+	if (tb[IFLA_VXLAN_L2MISS] && rta_getattr_u8(tb[IFLA_VXLAN_L2MISS]))
+		fputs("l2miss ", f);
+
+	if (tb[IFLA_VXLAN_L3MISS] && rta_getattr_u8(tb[IFLA_VXLAN_L3MISS]))
+		fputs("l3miss ", f);
 	
 	if (tb[IFLA_VXLAN_TOS] &&
 	    (tos = rta_getattr_u8(tb[IFLA_VXLAN_TOS]))) {
diff --git a/ip/ipmonitor.c b/ip/ipmonitor.c
index 4b1d469..7a7cc88 100644
- --- a/ip/ipmonitor.c
+++ b/ip/ipmonitor.c
@@ -67,7 +67,8 @@ int accept_msg(const struct sockaddr_nl *who,
 		print_addrlabel(who, n, arg);
 		return 0;
 	}
- -	if (n->nlmsg_type == RTM_NEWNEIGH || n->nlmsg_type == RTM_DELNEIGH) {
+	if (n->nlmsg_type == RTM_NEWNEIGH || n->nlmsg_type == RTM_DELNEIGH ||
+	    n->nlmsg_type == RTM_GETNEIGH) {
 		if (prefix_banner)
 			fprintf(fp, "[NEIGH]");
 		print_neigh(who, n, arg);
diff --git a/ip/ipneigh.c b/ip/ipneigh.c
index 56e56b2..1b7600b 100644
- --- a/ip/ipneigh.c
+++ b/ip/ipneigh.c
@@ -189,7 +189,8 @@ int print_neigh(const struct sockaddr_nl *who, struct nlmsghdr *n, void *arg)
 	struct rtattr * tb[NDA_MAX+1];
 	char abuf[256];
 
- -	if (n->nlmsg_type != RTM_NEWNEIGH && n->nlmsg_type != RTM_DELNEIGH) {
+	if (n->nlmsg_type != RTM_NEWNEIGH && n->nlmsg_type != RTM_DELNEIGH &&
+	    n->nlmsg_type != RTM_GETNEIGH) {
 		fprintf(stderr, "Not RTM_NEWNEIGH: %08x %08x %08x\n",
 			n->nlmsg_len, n->nlmsg_type, n->nlmsg_flags);
 
@@ -251,6 +252,8 @@ int print_neigh(const struct sockaddr_nl *who, struct nlmsghdr *n, void *arg)
 
 	if (n->nlmsg_type == RTM_DELNEIGH)
 		fprintf(fp, "delete ");
+	else if (n->nlmsg_type == RTM_GETNEIGH)
+		fprintf(fp, "miss ");
 	if (tb[NDA_DST]) {
 		fprintf(fp, "%s ",
 			format_host(r->ndm_family,

^ permalink raw reply

* Re: [patch net-next 0/4] net: allow to change carrier from userspace
From: Stephen Hemminger @ 2012-12-12 16:15 UTC (permalink / raw)
  To: Jiri Pirko; +Cc: netdev, davem, edumazet, bhutchings, mirqus, greearb, fbl
In-Reply-To: <1355309887-1081-1-git-send-email-jiri@resnulli.us>

On Wed, 12 Dec 2012 11:58:03 +0100
Jiri Pirko <jiri@resnulli.us> wrote:

> This is basically a repost of my previous patchset:
> "[patch net-next-2.6 0/2] net: allow to change carrier via sysfs" from Aug 30
> 
> The way net-sysfs stores values changed and this patchset reflects it.
> Also, I exposed carrier via rtnetlink iface.
> 
> So far, only dummy driver uses carrier change ndo. In very near future
> team driver will use that as well.
> 
> Jiri Pirko (4):
>   net: add change_carrier netdev op
>   net: allow to change carrier via sysfs
>   rtnl: expose carrier value with possibility to set it
>   dummy: implement carrier change
> 
>  drivers/net/dummy.c          | 10 ++++++++++
>  include/linux/netdevice.h    |  7 +++++++
>  include/uapi/linux/if_link.h |  1 +
>  net/core/dev.c               | 19 +++++++++++++++++++
>  net/core/net-sysfs.c         | 15 ++++++++++++++-
>  net/core/rtnetlink.c         | 10 ++++++++++
>  6 files changed, 61 insertions(+), 1 deletion(-)
> 

I needed to do the same thing for a project we are working on and discovered
that there already is a working documented interface for doing that via
operstate mode. Therefore I can't recommend that the additional complexity
of a new API for this is required.

^ permalink raw reply

* Re: [PATCH] iproute2: fix tc ematch manpage section
From: Stephen Hemminger @ 2012-12-12 16:16 UTC (permalink / raw)
  To: Andreas Henriksson; +Cc: netdev
In-Reply-To: <20121212112348.GA6520@amd64.fatal.se>

On Wed, 12 Dec 2012 12:23:48 +0100
Andreas Henriksson <andreas@fatal.se> wrote:

> The debian package checking tool, lintian, spotted that the
> tc ematch manpage seems to have an error in the specified section.
> 
> Signed-off-by: Andreas Henriksson <andreas@fatal.se>
> 
> diff --git a/man/man8/tc-ematch.8 b/man/man8/tc-ematch.8
> index 2eafc29..957a22e 100644
> --- a/man/man8/tc-ematch.8
> +++ b/man/man8/tc-ematch.8
> @@ -1,4 +1,4 @@
> -.TH filter ematch "6 August 2012" iproute2 Linux
> +.TH ematch 8 "6 August 2012" iproute2 Linux
>  .
>  .SH NAME
>  ematch \- extended matches for use with "basic" or "flow" filters

Applied, thanks.

^ permalink raw reply

* Re: [RFC PATCH net-next 4/4 V4] try to fix performance regression
From: Eric Dumazet @ 2012-12-12 16:25 UTC (permalink / raw)
  To: Weiping Pan; +Cc: davem, brutus, netdev
In-Reply-To: <5e333588f6cb48cc3464b2263dcaa734b952e4c1.1355320534.git.wpan@redhat.com>

On Wed, 2012-12-12 at 22:29 +0800, Weiping Pan wrote:

> 
>         MS       BASE    AF_UNIX    FRIENDS            TCP_STREAM_MS
>          1      10.70       5.40       4.02   37%   74%
>          2      28.01       9.67       7.97   28%   82%
>          4      55.53      19.78      16.48   29%   83%
>          8     115.40      38.22      33.51   29%   87%
>         16     227.31      81.06      67.70   29%   83%
>         32     446.20     166.59     129.31   28%   77%
>         64     849.04     336.77     259.43   30%   77%
>        128    1440.50     661.88     530.43   36%   80%
>        256    2404.70    1279.67    1029.15   42%   80%
>        512    4331.53    2501.30    1942.21   44%   77%
>       1024    6819.78    4622.37    4128.10   60%   89%
>       2048   10544.60    6348.81    6349.59   60%  100%
>       4096   12830.41    8324.43    7984.43   62%   95%
>       8192   13462.65    8355.49   11079.37   82%  132%
>      16384    9960.87   10840.13   13037.81  130%  120%
>      32768    8749.31   11372.15   15087.08  172%  132%
>      65536    7580.27   12150.23   14971.42  197%  123%
>     131072    6727.74   11451.34   13604.78  202%  118%
>     262144    7673.14   11613.10   11436.97  149%   98%
>     524288    7366.17   11675.95   11559.43  156%   99%
>    1048576    6608.57   11883.01   10103.20  152%   85%
> MS means Message Size in bytes, that is -m -M for netperf

I cant reproduce your strange numbers here, they make no sense to me.

for s in 1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 16384 32768
65536 131072 262144 524288 1048576
do
 ./netperf -- -m $s -M $s | tail -n1
done

Results :

87380  16384      1    10.00      34.68   
 87380  16384      2    10.00      68.07   
 87380  16384      4    10.00     126.27   
 87380  16384      8    10.00     284.50   
 87380  16384     16    10.00     574.38   
 87380  16384     32    10.00    1091.74   
 87380  16384     64    10.00    2130.23   
 87380  16384    128    10.00    4001.83   
 87380  16384    256    10.00    7666.01   
 87380  16384    512    10.00    13425.81   
 87380  16384   1024    10.00    21146.43   
 87380  16384   2048    10.00    28551.42   
 87380  16384   4096    10.00    37878.95   
 87380  16384   8192    10.00    42507.23   
 87380  16384  16384    10.00    46782.53   
 87380  16384  32768    10.00    42410.97   
 87380  16384  65536    10.00    43053.09   
 87380  16384 131072    10.00    44504.20   
 87380  16384 262144    10.00    50211.74   
 87380  16384 524288    10.00    54004.23   
 87380  16384 1048576    10.00    53852.26   

^ permalink raw reply

* Re: [PATCH] net: filter: return -EINVAL if BPF_S_ANC* operation is not supported
From: Daniel Borkmann @ 2012-12-12 16:25 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, netdev, Ani Sinha
In-Reply-To: <1355314964.9139.173.camel@edumazet-glaptop>

On 12/12/2012 01:22 PM, Eric Dumazet wrote:
> On Wed, 2012-12-12 at 10:31 +0100, Daniel Borkmann wrote:
>> Currently, we return -EINVAL for malicious or wrong BPF filters.
>> However, this is not done for BPF_S_ANC* operations, which makes it
>> more difficult to detect if it's actually supported or not by the
>> BPF machine. Therefore, we should also return -EINVAL if K is within
>> the SKF_AD_OFF universe and the ancillary operation did not match.
>>
>> Cc: Ani Sinha <ani@aristanetworks.com>
>> Cc: Eric Dumazet <eric.dumazet@gmail.com>
>> Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
>> ---
>>   net/core/filter.c | 8 +++++++-
>>   1 file changed, 7 insertions(+), 1 deletion(-)
>>
>> diff --git a/net/core/filter.c b/net/core/filter.c
>> index c23543c..de9bed4 100644
>> --- a/net/core/filter.c
>> +++ b/net/core/filter.c
>> @@ -531,7 +531,7 @@ int sk_chk_filter(struct sock_filter *filter, unsigned int flen)
>>   		[BPF_JMP|BPF_JSET|BPF_K] = BPF_S_JMP_JSET_K,
>>   		[BPF_JMP|BPF_JSET|BPF_X] = BPF_S_JMP_JSET_X,
>>   	};
>> -	int pc;
>> +	int pc, anc_found;
>>
>>   	if (flen == 0 || flen > BPF_MAXINSNS)
>>   		return -EINVAL;
>> @@ -592,8 +592,10 @@ int sk_chk_filter(struct sock_filter *filter, unsigned int flen)
>>   		case BPF_S_LD_W_ABS:
>>   		case BPF_S_LD_H_ABS:
>>   		case BPF_S_LD_B_ABS:
>> +			anc_found = 0;
>>   #define ANCILLARY(CODE) case SKF_AD_OFF + SKF_AD_##CODE:	\
>>   				code = BPF_S_ANC_##CODE;	\
>> +				anc_found = 1;			\
>>   				break
>>   			switch (ftest->k) {
>>   			ANCILLARY(PROTOCOL);
>> @@ -610,6 +612,10 @@ int sk_chk_filter(struct sock_filter *filter, unsigned int flen)
>>   			ANCILLARY(VLAN_TAG);
>>   			ANCILLARY(VLAN_TAG_PRESENT);
>>   			}
>> +
>> +			/* ancillary operation unkown or unsupported */
>> +			if (anc_found == 0 && ftest->k >= SKF_AD_OFF)
>> +				return -EINVAL;
>>   		}
>>   		ftest->code = code;
>>   	}
>
> Several points :
>
> 1) This might break a userland filter that was previously working, by
> returning 0 when load_pointer() returns NULL.
>
> Specifying an offset bigger than skb->len is not _invalid_, it only
> makes a filter returns 0, because load_pointer() returns NULL.

I think it will not break for code, that calls load_pointer() in such a
circumstance which passed the sk_chk_filter() test. However, it will
"break" for code that calls ...

   { BPF_LD | BPF_(W|H|B) | BPF_ABS, 0, 0, <K> },

... where <K> is in [0xfffff000, 0xffffffff] _and_ <K> is not an ancillary.

But ...

Assuming some old code will have such an instruction where <K> is between
[0xfffff000, 0xffffffff] and it doesn't know ancillary operations, then
this will give a non-expected/unwanted behavior as well (since we do not
return the BPF machine with 0 as it probably was the case before anc.ops,
but load sth. into the accumulator instead and continue with the next
instruction, for instance), right? Thus, following this argumentation, user
space code would already have been broken by introducing ancillary
operations into the BPF machine per se.

This is probably just an assumption, but code that does such a direct load,
e.g. "load word at packet offset 0xffffffff into accumulator" ("ld [0xffffffff]")
is quite broken, isn't it? Isn't the whole assumption of ancillary operations
that no-one intentionally calls things like "ld [0xffffffff]" and expect this
word to be loaded from the packet offset?

> 2) This wont help applications running on old kernels where your patch
> wont be applied, as already mentioned yesterday.

Agreed, but leaving old kernels aside, it would be nice if newer kernels
could validate that, so at least from kernel <xyz> onwards it could be
checked _for sure_ if anc.op <abc> is present and can be used.

> 3) Misses a "Reported-by" tag
>
> 4) anc_found is a boolean

3 + 4 agreed, sorry for that. I could do a v2 of the patch with 3 + 4 fixed
and resubmit it, if there's interest ...

> To be truly portable, userland should not rely on kernel doing a full
> validation of ancillaries.

^ permalink raw reply

* Re: [PATCH] tun: allow setting ethernet addresss while running
From: Stephen Hemminger @ 2012-12-12 16:38 UTC (permalink / raw)
  To: Jan Engelhardt; +Cc: davem, netdev, jasowang
In-Reply-To: <alpine.LNX.2.01.1212120427370.16297@nerf07.vanv.qr>

On Wed, 12 Dec 2012 04:27:54 +0100 (CET)
Jan Engelhardt <jengelh@inai.de> wrote:

> On Tuesday 2012-12-11 02:16, Stephen Hemminger wrote:
> 
> >This is a pure software device, and ok with live address change.
> >--- a/drivers/net/tun.c
> >+++ b/drivers/net/tun.c
> >@@ -849,6 +849,7 @@ static void tun_net_init(struct net_device *dev)
> > 		/* Ethernet TAP Device */
> > 		ether_setup(dev);
> > 		dev->priv_flags &= ~IFF_TX_SKB_SHARING;
> >+		dev->priv_flags |= IFF_LIVE_ADDR_CHANGE;
> > 
> > 		eth_hw_addr_random(dev);
> 
> Would this possibly apply to L2TP devices as well?

L2TP does not allow changing mac address at all right now.
Only drivers that use eth_mac_addr, can take advantage of the flag.


Looking around here are the other places that could use it.
 vxlan, xen-netfront?, gre, gre6, virtio_net?, hyperv?

Also the following look buggy.
  c2 allows changing mac address but never tells hardware?
  isdn/hysdn_net.c allows setting mac address but then resets it
      card value in net_open
  xpnet allows setting address but it looks like it fixed by hardware
  ipddp allows ethernet address but protocol is not ethernet

^ permalink raw reply

* [PATCH net-next] uapi: add missing netconf.h to export list
From: Stephen Hemminger @ 2012-12-12 16:58 UTC (permalink / raw)
  To: David Miller; +Cc: Nicolas Dichtel, netdev
In-Reply-To: <1355305907-7102-1-git-send-email-nicolas.dichtel@6wind.com>

Add netconf.h for use by iproute2.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>

--- a/include/uapi/linux/Kbuild	2012-10-25 09:11:15.499273810 -0700
+++ b/include/uapi/linux/Kbuild	2012-12-12 08:56:36.130263710 -0800
@@ -258,6 +258,7 @@ header-y += neighbour.h
 header-y += net.h
 header-y += net_dropmon.h
 header-y += net_tstamp.h
+header-y += netconf.h
 header-y += netdevice.h
 header-y += netfilter.h
 header-y += netfilter_arp.h

^ permalink raw reply

* Re: [PATCH iproute2 1/3] ip: add support of netconf messages
From: Stephen Hemminger @ 2012-12-12 16:59 UTC (permalink / raw)
  To: Nicolas Dichtel; +Cc: netdev
In-Reply-To: <1355305907-7102-1-git-send-email-nicolas.dichtel@6wind.com>

Ok, but the headers for all of iproute2 are supposed to come from
sanitized kernel headers from "make headers_install"

You missed that piece in the original patch.

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox