Netdev List
 help / color / mirror / Atom feed
* [PATCH net-next 1/8] net/mlx4_core: Prevent VF from changing port configuration
From: Or Gerlitz @ 2014-10-30 16:06 UTC (permalink / raw)
  To: David S. Miller
  Cc: netdev, Matan Barak, Amir Vadai, Saeed Mahameed, Shani Michaeli
In-Reply-To: <1414685216-28907-1-git-send-email-ogerlitz@mellanox.com>

From: Saeed Mahameed <saeedm@mellanox.com>

Added wrapper to the ACCESS_REG command for handling guest HW
registers access, preventing write operations, but do allow reads.

This will prevent SRIOV guests to change port PTYS configuration,
such as speed/advertised link modes.

Fixes: adbc7ac5c15e ('net/mlx4_core: Introduce ACCESS_REG CMD [...]')
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Amir Vadai <amirv@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx4/cmd.c  |    2 +-
 drivers/net/ethernet/mellanox/mlx4/fw.c   |   30 ++++++++++++++++++++++++++++-
 drivers/net/ethernet/mellanox/mlx4/mlx4.h |    5 ++++
 3 files changed, 35 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/cmd.c b/drivers/net/ethernet/mellanox/mlx4/cmd.c
index 916459e..1312ccf 100644
--- a/drivers/net/ethernet/mellanox/mlx4/cmd.c
+++ b/drivers/net/ethernet/mellanox/mlx4/cmd.c
@@ -1345,7 +1345,7 @@ static struct mlx4_cmd_info cmd_info[] = {
 		.out_is_imm = false,
 		.encode_slave_id = false,
 		.verify = NULL,
-		.wrapper = NULL,
+		.wrapper = mlx4_ACCESS_REG_wrapper,
 	},
 	/* Native multicast commands are not available for guests */
 	{
diff --git a/drivers/net/ethernet/mellanox/mlx4/fw.c b/drivers/net/ethernet/mellanox/mlx4/fw.c
index 72289ef..e7639e3 100644
--- a/drivers/net/ethernet/mellanox/mlx4/fw.c
+++ b/drivers/net/ethernet/mellanox/mlx4/fw.c
@@ -2220,7 +2220,7 @@ static int mlx4_ACCESS_REG(struct mlx4_dev *dev, u16 reg_id,
 	memcpy(inbuf->reg_data, reg_data, reg_len);
 	err = mlx4_cmd_box(dev, inbox->dma, outbox->dma, 0, 0,
 			   MLX4_CMD_ACCESS_REG, MLX4_CMD_TIME_CLASS_C,
-			   MLX4_CMD_NATIVE);
+			   MLX4_CMD_WRAPPED);
 	if (err)
 		goto out;
 
@@ -2263,3 +2263,31 @@ int mlx4_ACCESS_PTYS_REG(struct mlx4_dev *dev,
 			       method, sizeof(*ptys_reg), ptys_reg);
 }
 EXPORT_SYMBOL_GPL(mlx4_ACCESS_PTYS_REG);
+
+int mlx4_ACCESS_REG_wrapper(struct mlx4_dev *dev, int slave,
+			    struct mlx4_vhcr *vhcr,
+			    struct mlx4_cmd_mailbox *inbox,
+			    struct mlx4_cmd_mailbox *outbox,
+			    struct mlx4_cmd_info *cmd)
+{
+	struct mlx4_access_reg *inbuf = inbox->buf;
+	u8 method = inbuf->method & MLX4_ACCESS_REG_METHOD_MASK;
+	u16 reg_id = be16_to_cpu(inbuf->reg_id);
+
+	if (slave != mlx4_master_func_num(dev) &&
+	    method == MLX4_ACCESS_REG_WRITE)
+		return -EPERM;
+
+	if (reg_id == MLX4_REG_ID_PTYS) {
+		struct mlx4_ptys_reg *ptys_reg =
+			(struct mlx4_ptys_reg *)inbuf->reg_data;
+
+		ptys_reg->local_port =
+			mlx4_slave_convert_port(dev, slave,
+						ptys_reg->local_port);
+	}
+
+	return mlx4_cmd_box(dev, inbox->dma, outbox->dma, vhcr->in_modifier,
+			    0, MLX4_CMD_ACCESS_REG, MLX4_CMD_TIME_CLASS_C,
+			    MLX4_CMD_NATIVE);
+}
diff --git a/drivers/net/ethernet/mellanox/mlx4/mlx4.h b/drivers/net/ethernet/mellanox/mlx4/mlx4.h
index de10dbb..254ec7b 100644
--- a/drivers/net/ethernet/mellanox/mlx4/mlx4.h
+++ b/drivers/net/ethernet/mellanox/mlx4/mlx4.h
@@ -1273,6 +1273,11 @@ int mlx4_QP_FLOW_STEERING_DETACH_wrapper(struct mlx4_dev *dev, int slave,
 					 struct mlx4_cmd_mailbox *inbox,
 					 struct mlx4_cmd_mailbox *outbox,
 					 struct mlx4_cmd_info *cmd);
+int mlx4_ACCESS_REG_wrapper(struct mlx4_dev *dev, int slave,
+			    struct mlx4_vhcr *vhcr,
+			    struct mlx4_cmd_mailbox *inbox,
+			    struct mlx4_cmd_mailbox *outbox,
+			    struct mlx4_cmd_info *cmd);
 
 int mlx4_get_mgm_entry_size(struct mlx4_dev *dev);
 int mlx4_get_qp_per_mgm(struct mlx4_dev *dev);
-- 
1.7.1

^ permalink raw reply related

* [PATCH net-next 0/8] Mellanox ethernet driver update Oct-30-2014
From: Or Gerlitz @ 2014-10-30 16:06 UTC (permalink / raw)
  To: David S. Miller
  Cc: netdev, Matan Barak, Amir Vadai, Saeed Mahameed, Shani Michaeli,
	Or Gerlitz

Hi Dave,

The 1st patch from Saeed fixes a bug in the last net-next batch where
a VF could get access to set port configuration, the next patch from Amir
fixes a race in the port VPI logic. Next are two performance patches from Ido.

The last four patches from Shani, Matan and myself add support for CHECKSUM_COMPLETE 
reporting on non TCP/UDP packets such as GRE and ICMP. I'd like to deeply thank 
Jerry Chu for his innovation and support in that effort.

Or.

Amir Vadai (1):
  net/mlx4_core: Protect port type setting by mutex

Ido Shamay (2):
  net/mlx4_en: Remove RX buffers alignment to IP_ALIGN
  net/mlx4_en: Add __GFP_COLD gfp flags in alloc_pages

Matan Barak (1):
  net/mlx4_core: Add retrieval of CONFIG_DEV parameters

Or Gerlitz (1):
  net/mlx4_en: Remove redundant code from RX/GRO path

Saeed Mahameed (1):
  net/mlx4_core: Prevent VF from changing port configuration

Shani Michaeli (1):
  net: Add calaulation of non folded IPV6 pseudo header checksum
  net/mlx4_en: Extend checksum offloading by CHECKSUM COMPLETE

 drivers/net/ethernet/mellanox/mlx4/cmd.c           |    6 +-
 drivers/net/ethernet/mellanox/mlx4/en_ethtool.c    |    2 +-
 drivers/net/ethernet/mellanox/mlx4/en_netdev.c     |    5 +
 drivers/net/ethernet/mellanox/mlx4/en_port.c       |    2 +
 drivers/net/ethernet/mellanox/mlx4/en_rx.c         |  186 ++++++++++++--------
 drivers/net/ethernet/mellanox/mlx4/fw.c            |  118 ++++++++++++-
 drivers/net/ethernet/mellanox/mlx4/main.c          |   18 ++-
 drivers/net/ethernet/mellanox/mlx4/mlx4.h          |   10 +
 drivers/net/ethernet/mellanox/mlx4/mlx4_en.h       |    6 +-
 .../net/ethernet/mellanox/mlx4/resource_tracker.c  |   17 ++
 include/linux/mlx4/cmd.h                           |   29 +++
 include/linux/mlx4/device.h                        |    4 +-
 include/net/ip6_checksum.h                         |   21 +++
 13 files changed, 339 insertions(+), 85 deletions(-)

^ permalink raw reply

* Re: [PATCH net] gre: Use inner mac length when computing tunnel length
From: Alexander Duyck @ 2014-10-30 15:52 UTC (permalink / raw)
  To: Tom Herbert, davem, alexander.duyck, netdev
In-Reply-To: <1414683656-26493-1-git-send-email-therbert@google.com>

On 10/30/2014 08:40 AM, Tom Herbert wrote:
> Currently, skb_inner_network_header is used but this does not account
> for Ethernet header for ETH_P_TEB. Use skb_inner_mac_header which
> handles TEB and also should work with IP encapsulation in which case
> inner mac and inner network headers are the same.
>
> Tested: Ran TCP_STREAM over GRE, worked as expected.
>
> Signed-off-by: Tom Herbert <therbert@google.com>
> ---
>   net/ipv4/gre_offload.c | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/net/ipv4/gre_offload.c b/net/ipv4/gre_offload.c
> index f6e345c..bb5947b 100644
> --- a/net/ipv4/gre_offload.c
> +++ b/net/ipv4/gre_offload.c
> @@ -47,7 +47,7 @@ static struct sk_buff *gre_gso_segment(struct sk_buff *skb,
>
>   	greh = (struct gre_base_hdr *)skb_transport_header(skb);
>
> -	ghl = skb_inner_network_header(skb) - skb_transport_header(skb);
> +	ghl = skb_inner_mac_header(skb) - skb_transport_header(skb);
>   	if (unlikely(ghl < sizeof(*greh)))
>   		goto out;
>
>

This works for me.  We probably need to queue this up for stable as well 
since this bug goes back as far as 3.14.

Acked-by: Alexander Duyck <alexander.h.duyck@redhat.com>

^ permalink raw reply

* [PATCH iproute2] ss: Identify more netlink protocol names
From: Vadim Kochan @ 2014-10-30 15:33 UTC (permalink / raw)
  To: netdev; +Cc: Vadim Kochan

There were only few Netlink protocol names
which were printed on the screen:

    rtnl, fw, tcpdiag

So added the ability to identify Netlink proto name
from /etc/iproute/nl_protos or from static table.

Signed-off-by: Vadim Kochan <vadim4j@gmail.com>
---
 etc/iproute2/nl_protos | 23 ++++++++++++++
 include/rt_names.h     |  2 ++
 lib/rt_names.c         | 82 ++++++++++++++++++++++++++++++++++++++++++++++++++
 misc/ss.c              | 17 ++++++-----
 4 files changed, 116 insertions(+), 8 deletions(-)
 create mode 100644 etc/iproute2/nl_protos

diff --git a/etc/iproute2/nl_protos b/etc/iproute2/nl_protos
new file mode 100644
index 0000000..43418f3
--- /dev/null
+++ b/etc/iproute2/nl_protos
@@ -0,0 +1,23 @@
+# Netlink protocol names mapping
+
+0   rtnl
+1   unused
+2   usersock
+3   fw
+4   tcpdiag
+5   nflog
+6   xfrm
+7   selinux
+8   iscsi
+9   audit
+10  fiblookup
+11  connector
+12  nft 
+13  ip6fw
+14  dec-rt
+15  uevent
+16  genl
+18  scsi-trans
+19  ecryptfs
+20  rdma
+21  crypto 
diff --git a/include/rt_names.h b/include/rt_names.h
index 56b649a..c0ea4f9 100644
--- a/include/rt_names.h
+++ b/include/rt_names.h
@@ -29,5 +29,7 @@ int ll_addr_a2n(char *lladdr, int len, const char *arg);
 const char * ll_proto_n2a(unsigned short id, char *buf, int len);
 int ll_proto_a2n(unsigned short *id, const char *buf);
 
+const char *nl_proto_n2a(int id, char *buf, int len);
+int nl_proto_a2n(__u32 *id, const char *arg);
 
 #endif
diff --git a/lib/rt_names.c b/lib/rt_names.c
index 911e4d2..184f590 100644
--- a/lib/rt_names.c
+++ b/lib/rt_names.c
@@ -525,3 +525,85 @@ const char *rtnl_group_n2a(int id, char *buf, int len)
 	snprintf(buf, len, "%d", id);
 	return buf;
 }
+
+static char *nl_proto_tab[256] = {
+	[NETLINK_ROUTE]          = "rtnl",
+	[NETLINK_UNUSED]         = "unused",
+	[NETLINK_USERSOCK]       = "usersock",
+	[NETLINK_FIREWALL]       = "fw",
+	[NETLINK_SOCK_DIAG]      = "tcpdiag",
+	[NETLINK_NFLOG]          = "nflog",
+	[NETLINK_XFRM]           = "xfrm",
+	[NETLINK_SELINUX]        = "selinux",
+	[NETLINK_ISCSI]          = "iscsi",
+	[NETLINK_AUDIT]          = "audit",
+	[NETLINK_FIB_LOOKUP]     = "fiblookup",
+	[NETLINK_CONNECTOR]      = "connector",
+	[NETLINK_NETFILTER]      = "nft",
+	[NETLINK_IP6_FW]         = "ip6fw",
+	[NETLINK_DNRTMSG]        = "dec-rt",
+	[NETLINK_KOBJECT_UEVENT] = "uevent",
+	[NETLINK_GENERIC]        = "genl",
+	[NETLINK_SCSITRANSPORT]  = "scsi-trans",
+	[NETLINK_ECRYPTFS]       = "ecryptfs",
+	[NETLINK_RDMA]           = "rdma",
+	[NETLINK_CRYPTO]         = "crypto",
+};
+
+static int nl_proto_init;
+
+static void nl_proto_initialize(void)
+{
+	nl_proto_init = 1;
+	rtnl_tab_initialize(CONFDIR "/nl_protos",
+			    nl_proto_tab, 256);
+}
+
+const char *nl_proto_n2a(int id, char *buf, int len)
+{
+	if (id < 0 || id >= 256) {
+		snprintf(buf, len, "%u", id);
+		return buf;
+	}
+
+	if (!nl_proto_init)
+		nl_proto_initialize();
+
+	if (nl_proto_tab[id])
+		return nl_proto_tab[id];
+
+	snprintf(buf, len, "%u", id);
+	return buf;
+}
+
+int nl_proto_a2n(__u32 *id, const char *arg)
+{
+	static char *cache = NULL;
+	static unsigned long res;
+	char *end;
+	int i;
+
+	if (cache && strcmp(cache, arg) == 0) {
+		*id = res;
+		return 0;
+	}
+
+	if (!nl_proto_init)
+		nl_proto_initialize();
+
+	for (i = 0; i < 256; i++) {
+		if (nl_proto_tab[i] &&
+		    strcmp(nl_proto_tab[i], arg) == 0) {
+			cache = nl_proto_tab[i];
+			res = i;
+			*id = res;
+			return 0;
+		}
+	}
+
+	res = strtoul(arg, &end, 0);
+	if (!end || end == arg || *end || res > 255)
+		return -1;
+	*id = res;
+	return 0;
+}
diff --git a/misc/ss.c b/misc/ss.c
index b7e0ef0..291d85f 100644
--- a/misc/ss.c
+++ b/misc/ss.c
@@ -2979,6 +2979,8 @@ static void netlink_show_one(struct filter *f,
 				int rq, int wq,
 				unsigned long long sk, unsigned long long cb)
 {
+	SPRINT_BUF(prot_name);
+
 	if (f->f) {
 		struct tcpstat tst;
 		tst.local.family = AF_NETLINK;
@@ -2996,14 +2998,13 @@ static void netlink_show_one(struct filter *f,
 	if (state_width)
 		printf("%-*s ", state_width, "UNCONN");
 	printf("%-6d %-6d ", rq, wq);
-	if (resolve_services && prot == 0)
-		printf("%*s:", addr_width, "rtnl");
-	else if (resolve_services && prot == 3)
-		printf("%*s:", addr_width, "fw");
-	else if (resolve_services && prot == 4)
-		printf("%*s:", addr_width, "tcpdiag");
-	else
-		printf("%*d:", addr_width, prot);
+
+	if (resolve_services)
+	{
+		printf("%*s:", addr_width, nl_proto_n2a(prot, prot_name,
+					sizeof(prot_name)));
+	}
+
 	if (pid == -1) {
 		printf("%-*s ", serv_width, "*");
 	} else if (resolve_services) {
-- 
2.1.0

^ permalink raw reply related

* [PATCH net] gre: Use inner mac length when computing tunnel length
From: Tom Herbert @ 2014-10-30 15:40 UTC (permalink / raw)
  To: davem, alexander.duyck, netdev

Currently, skb_inner_network_header is used but this does not account
for Ethernet header for ETH_P_TEB. Use skb_inner_mac_header which
handles TEB and also should work with IP encapsulation in which case
inner mac and inner network headers are the same.

Tested: Ran TCP_STREAM over GRE, worked as expected.

Signed-off-by: Tom Herbert <therbert@google.com>
---
 net/ipv4/gre_offload.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/ipv4/gre_offload.c b/net/ipv4/gre_offload.c
index f6e345c..bb5947b 100644
--- a/net/ipv4/gre_offload.c
+++ b/net/ipv4/gre_offload.c
@@ -47,7 +47,7 @@ static struct sk_buff *gre_gso_segment(struct sk_buff *skb,
 
 	greh = (struct gre_base_hdr *)skb_transport_header(skb);
 
-	ghl = skb_inner_network_header(skb) - skb_transport_header(skb);
+	ghl = skb_inner_mac_header(skb) - skb_transport_header(skb);
 	if (unlikely(ghl < sizeof(*greh)))
 		goto out;
 
-- 
2.1.0.rc2.206.gedb03e5

^ permalink raw reply related

* Re: [PATCH iproute2] ss: Identify a lot of netlink protocol names
From: vadim4j @ 2014-10-30 15:26 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: netdev
In-Reply-To: <20141029224932.0df2edba@urahara>

On Wed, Oct 29, 2014 at 10:49:32PM -0700, Stephen Hemminger wrote:
> On Thu, 16 Oct 2014 19:46:58 +0300
> Vadim Kochan <vadim4j@gmail.com> wrote:
> 
> > There were only few Netlink protocol names:
> > 
> >     rtnl, fw, tcpdiag
> > 
> > which were printed on output.
> > So added the other ones.
> > 
> > Signed-off-by: Vadim Kochan <vadim4j@gmail.com>
> 
> Please make this driven off of a file in /etc/iproute2/ rather than
> hard coding a big switch in the code.
> 
Yes, good idea, will do.

Regards,

^ permalink raw reply

* Re: [PATCH net] gre: Fix regression in gretap TSO support
From: Alexander Duyck @ 2014-10-30 15:32 UTC (permalink / raw)
  To: Tom Herbert
  Cc: Neal Cardwell, Pravin Shelar, Alexander Duyck, netdev,
	David Miller, H.K. Jerry Chu, Eric Dumazet
In-Reply-To: <CA+mtBx_AAexNNktyZDFFZwqzgEh_FJbdekvtYVLn7cc2AGLFqA@mail.gmail.com>


On 10/30/2014 08:05 AM, Tom Herbert wrote:
> On Thu, Oct 30, 2014 at 7:30 AM, Alexander Duyck
> <alexander.h.duyck@redhat.com> wrote:
>> On 10/30/2014 06:51 AM, Neal Cardwell wrote:
>>> On Thu, Oct 30, 2014 at 1:14 AM, Pravin Shelar <pshelar@nicira.com> wrote:
>>>> On Wed, Oct 29, 2014 at 8:26 PM,  <alexander.duyck@gmail.com> wrote:
>>>>> From: Alexander Duyck <alexander.h.duyck@redhat.com>
>>>>>
>>>>> On recent kernels I found that TSO on gretap interfaces didn't work.
>>>>> After
>>>>> bisecting it I found that commit b884b1a4 had introduced a regression in
>>>>> which the Ethernet header was being included in the GRE header length.
>>>>>
>>>>> This change corrects that by basing the GRE header length on the inner
>>>>> mac
>>>>> header in the case of GRE tunnels using transparent Ethernet bridging,
>>>>> and
>>>>> uses the network header for all other GRE tunnel types.
>>>>>
>>>>> Fixes: b884b1a4 ("gre_offload: simplify GRE header length calculation in
>>>>> gre_gso_segment()")
>>> Hmm. There may be other protocols, either now or in the future, where
>>> we want to be able to have a mac header inside the GRE header, rather
>>> than a network header. AFAICT it would be safer to revert b884b1a4,
>>> and go back to the previous code (from c50cd357), where we parse the
>>> GRE header to figure out its length.
>>>
>>> neal
>>
>> The change is consistent with how we handle this in other spots throughout
>> the kernel.  If nothing else you can just search for ETH_P_TEB and you will
>> find multiple spots in the kernel where IP tunnels differentiate between
>> transparent Ethernet bridging and regular IP in IP tunnels by checking for
>> the protocol ETH_P_TEB.
>>
> I'm not sure I understand this. We always use inner mac header in
> __skb_udp_tunnel_segment for computing tunnel length and don't
> distinguish between Ethernet or IP encapsulation. Presumably, in the
> case of IP encapsulation inner mac header is equal to inner network
> header. Why is this different for GRE?
>
> Thanks,
> Tom

I'll dig into that a bit more and see if I can simplify this.  I just 
wasn't sure if the inner mac header was being initialized or not in the 
case of IP in IP tunnels.

Thanks,

Alex

^ permalink raw reply

* Re: [PATCH net] gre: Fix regression in gretap TSO support
From: Tom Herbert @ 2014-10-30 15:32 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Neal Cardwell, Pravin Shelar, Alexander Duyck, netdev,
	David Miller, H.K. Jerry Chu, Eric Dumazet
In-Reply-To: <CA+mtBx_AAexNNktyZDFFZwqzgEh_FJbdekvtYVLn7cc2AGLFqA@mail.gmail.com>

> I'm not sure I understand this. We always use inner mac header in
> __skb_udp_tunnel_segment for computing tunnel length and don't
> distinguish between Ethernet or IP encapsulation. Presumably, in the
> case of IP encapsulation inner mac header is equal to inner network
> header. Why is this different for GRE?
>

Using skb_inner_mac_header seems to work okay for IP encapsulation.
I'll post the path momentarily.

Tom


> Thanks,
> Tom
>
>> Thanks,
>>
>> Alex
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe netdev" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* [PATCH net-next v4 1/4] netns: add genl cmd to add and get peer netns ids
From: Nicolas Dichtel @ 2014-10-30 15:25 UTC (permalink / raw)
  To: netdev, containers, linux-kernel, linux-api
  Cc: davem, ebiederm, stephen, akpm, luto, cwang, Nicolas Dichtel
In-Reply-To: <1414682728-4532-1-git-send-email-nicolas.dichtel@6wind.com>

With this patch, a user can define an id for a peer netns by providing a FD or a
PID. These ids are local to netns (ie valid only into one netns).

This will be useful for netlink messages when a x-netns interface is dumped.

Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
---
 MAINTAINERS                 |   1 +
 include/net/net_namespace.h |   5 ++
 include/uapi/linux/Kbuild   |   1 +
 include/uapi/linux/netns.h  |  38 +++++++++
 net/core/net_namespace.c    | 195 ++++++++++++++++++++++++++++++++++++++++++++
 net/netlink/genetlink.c     |   4 +
 6 files changed, 244 insertions(+)
 create mode 100644 include/uapi/linux/netns.h

diff --git a/MAINTAINERS b/MAINTAINERS
index 43898b1a8a2d..de7e6fcbd5c2 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -6382,6 +6382,7 @@ F:	include/linux/netdevice.h
 F:	include/uapi/linux/in.h
 F:	include/uapi/linux/net.h
 F:	include/uapi/linux/netdevice.h
+F:	include/uapi/linux/netns.h
 F:	tools/net/
 F:	tools/testing/selftests/net/
 F:	lib/random32.c
diff --git a/include/net/net_namespace.h b/include/net/net_namespace.h
index e0d64667a4b3..0f1367a71b81 100644
--- a/include/net/net_namespace.h
+++ b/include/net/net_namespace.h
@@ -59,6 +59,7 @@ struct net {
 	struct list_head	exit_list;	/* Use only net_mutex */
 
 	struct user_namespace   *user_ns;	/* Owning user namespace */
+	struct idr		netns_ids;
 
 	unsigned int		proc_inum;
 
@@ -289,6 +290,10 @@ static inline struct net *read_pnet(struct net * const *pnet)
 #define __net_initconst	__initconst
 #endif
 
+int peernet2id(struct net *net, struct net *peer);
+struct net *get_net_ns_by_id(struct net *net, int id);
+int netns_genl_register(void);
+
 struct pernet_operations {
 	struct list_head list;
 	int (*init)(struct net *net);
diff --git a/include/uapi/linux/Kbuild b/include/uapi/linux/Kbuild
index 6cad97485bad..d7f49c69585a 100644
--- a/include/uapi/linux/Kbuild
+++ b/include/uapi/linux/Kbuild
@@ -277,6 +277,7 @@ header-y += netfilter_decnet.h
 header-y += netfilter_ipv4.h
 header-y += netfilter_ipv6.h
 header-y += netlink.h
+header-y += netns.h
 header-y += netrom.h
 header-y += nfc.h
 header-y += nfs.h
diff --git a/include/uapi/linux/netns.h b/include/uapi/linux/netns.h
new file mode 100644
index 000000000000..2edf129377de
--- /dev/null
+++ b/include/uapi/linux/netns.h
@@ -0,0 +1,38 @@
+/* Copyright (c) 2014 6WIND S.A.
+ * Author: Nicolas Dichtel <nicolas.dichtel@6wind.com>
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ */
+#ifndef _UAPI_LINUX_NETNS_H_
+#define _UAPI_LINUX_NETNS_H_
+
+/* Generic netlink messages */
+
+#define NETNS_GENL_NAME			"netns"
+#define NETNS_GENL_VERSION		0x1
+
+/* Commands */
+enum {
+	NETNS_CMD_UNSPEC,
+	NETNS_CMD_NEWID,
+	NETNS_CMD_GETID,
+	__NETNS_CMD_MAX,
+};
+
+#define NETNS_CMD_MAX		(__NETNS_CMD_MAX - 1)
+
+/* Attributes */
+enum {
+	NETNSA_NONE,
+#define NETNSA_NSINDEX_UNKNOWN	-1
+	NETNSA_NSID,
+	NETNSA_PID,
+	NETNSA_FD,
+	__NETNSA_MAX,
+};
+
+#define NETNSA_MAX		(__NETNSA_MAX - 1)
+
+#endif /* _UAPI_LINUX_NETNS_H_ */
diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
index 7f155175bba8..4a5680ed42fb 100644
--- a/net/core/net_namespace.c
+++ b/net/core/net_namespace.c
@@ -15,6 +15,8 @@
 #include <linux/file.h>
 #include <linux/export.h>
 #include <linux/user_namespace.h>
+#include <linux/netns.h>
+#include <net/genetlink.h>
 #include <net/net_namespace.h>
 #include <net/netns/generic.h>
 
@@ -144,6 +146,50 @@ static void ops_free_list(const struct pernet_operations *ops,
 	}
 }
 
+/* This function is used by idr_for_each(). If net is equal to peer, the
+ * function returns the id so that idr_for_each() stops. Because we cannot
+ * returns the id 0 (idr_for_each() will not stop), we return the magic value
+ * -1 for it.
+ */
+static int net_eq_idr(int id, void *net, void *peer)
+{
+	if (net_eq(net, peer))
+		return id ? : -1;
+	return 0;
+}
+
+/* returns NETNSA_NSINDEX_UNKNOWN if not found */
+int peernet2id(struct net *net, struct net *peer)
+{
+	int id = idr_for_each(&net->netns_ids, net_eq_idr, peer);
+
+	ASSERT_RTNL();
+
+	/* Magic value for id 0. */
+	if (id == -1)
+		return 0;
+	if (id == 0)
+		return NETNSA_NSINDEX_UNKNOWN;
+
+	return id;
+}
+
+struct net *get_net_ns_by_id(struct net *net, int id)
+{
+	struct net *peer;
+
+	if (id < 0)
+		return NULL;
+
+	rcu_read_lock();
+	peer = idr_find(&net->netns_ids, id);
+	if (peer)
+		get_net(peer);
+	rcu_read_unlock();
+
+	return peer;
+}
+
 /*
  * setup_net runs the initializers for the network namespace object.
  */
@@ -158,6 +204,7 @@ static __net_init int setup_net(struct net *net, struct user_namespace *user_ns)
 	atomic_set(&net->passive, 1);
 	net->dev_base_seq = 1;
 	net->user_ns = user_ns;
+	idr_init(&net->netns_ids);
 
 #ifdef NETNS_REFCNT_DEBUG
 	atomic_set(&net->use_count, 0);
@@ -288,6 +335,14 @@ static void cleanup_net(struct work_struct *work)
 	list_for_each_entry(net, &net_kill_list, cleanup_list) {
 		list_del_rcu(&net->list);
 		list_add_tail(&net->exit_list, &net_exit_list);
+		for_each_net(tmp) {
+			int id = peernet2id(tmp, net);
+
+			if (id >= 0)
+				idr_remove(&tmp->netns_ids, id);
+		}
+		idr_destroy(&net->netns_ids);
+
 	}
 	rtnl_unlock();
 
@@ -399,6 +454,146 @@ static struct pernet_operations __net_initdata net_ns_ops = {
 	.exit = net_ns_net_exit,
 };
 
+static struct genl_family netns_genl_family = {
+	.id		= GENL_ID_GENERATE,
+	.name		= NETNS_GENL_NAME,
+	.version	= NETNS_GENL_VERSION,
+	.hdrsize	= 0,
+	.maxattr	= NETNSA_MAX,
+	.netnsok	= true,
+};
+
+static struct nla_policy netns_nl_policy[NETNSA_MAX + 1] = {
+	[NETNSA_NONE]		= { .type = NLA_UNSPEC },
+	[NETNSA_NSID]		= { .type = NLA_S32 },
+	[NETNSA_PID]		= { .type = NLA_U32 },
+	[NETNSA_FD]		= { .type = NLA_U32 },
+};
+
+static int netns_nl_cmd_newid(struct sk_buff *skb, struct genl_info *info)
+{
+	struct net *net = genl_info_net(info);
+	struct net *peer;
+	int nsid, err;
+
+	if (!info->attrs[NETNSA_NSID])
+		return -EINVAL;
+	nsid = nla_get_s32(info->attrs[NETNSA_NSID]);
+	if (nsid < 0)
+		return -EINVAL;
+
+	if (info->attrs[NETNSA_PID])
+		peer = get_net_ns_by_pid(nla_get_u32(info->attrs[NETNSA_PID]));
+	else if (info->attrs[NETNSA_FD])
+		peer = get_net_ns_by_fd(nla_get_u32(info->attrs[NETNSA_FD]));
+	else
+		return -EINVAL;
+	if (IS_ERR(peer))
+		return PTR_ERR(peer);
+
+	rtnl_lock();
+	if (peernet2id(net, peer) >= 0) {
+		err = -EEXIST;
+		goto out;
+	}
+
+	err = idr_alloc(&net->netns_ids, peer, nsid, nsid + 1, GFP_KERNEL);
+	if (err >= 0)
+		err = 0;
+out:
+	rtnl_unlock();
+	put_net(peer);
+	return err;
+}
+
+static int netns_nl_get_size(void)
+{
+	return nla_total_size(sizeof(s32)) /* NETNSA_NSID */
+	       ;
+}
+
+static int netns_nl_fill(struct sk_buff *skb, u32 portid, u32 seq, int flags,
+			 int cmd, struct net *net, struct net *peer)
+{
+	void *hdr;
+	int id;
+
+	hdr = genlmsg_put(skb, portid, seq, &netns_genl_family, flags, cmd);
+	if (!hdr)
+		return -EMSGSIZE;
+
+	rtnl_lock();
+	id = peernet2id(net, peer);
+	rtnl_unlock();
+	if (nla_put_s32(skb, NETNSA_NSID, id))
+		goto nla_put_failure;
+
+	return genlmsg_end(skb, hdr);
+
+nla_put_failure:
+	genlmsg_cancel(skb, hdr);
+	return -EMSGSIZE;
+}
+
+static int netns_nl_cmd_getid(struct sk_buff *skb, struct genl_info *info)
+{
+	struct net *net = genl_info_net(info);
+	struct sk_buff *msg;
+	int err = -ENOBUFS;
+	struct net *peer;
+
+	if (info->attrs[NETNSA_PID])
+		peer = get_net_ns_by_pid(nla_get_u32(info->attrs[NETNSA_PID]));
+	else if (info->attrs[NETNSA_FD])
+		peer = get_net_ns_by_fd(nla_get_u32(info->attrs[NETNSA_FD]));
+	else
+		return -EINVAL;
+
+	if (IS_ERR(peer))
+		return PTR_ERR(peer);
+
+	msg = genlmsg_new(netns_nl_get_size(), GFP_KERNEL);
+	if (!msg) {
+		err = -ENOMEM;
+		goto out;
+	}
+
+	err = netns_nl_fill(msg, info->snd_portid, info->snd_seq,
+			    NLM_F_ACK, NETNS_CMD_GETID, net, peer);
+	if (err < 0)
+		goto err_out;
+
+	err = genlmsg_unicast(net, msg, info->snd_portid);
+	goto out;
+
+err_out:
+	nlmsg_free(msg);
+out:
+	put_net(peer);
+	return err;
+}
+
+static struct genl_ops netns_genl_ops[] = {
+	{
+		.cmd = NETNS_CMD_NEWID,
+		.policy = netns_nl_policy,
+		.doit = netns_nl_cmd_newid,
+		.flags = GENL_ADMIN_PERM,
+	},
+	{
+		.cmd = NETNS_CMD_GETID,
+		.policy = netns_nl_policy,
+		.doit = netns_nl_cmd_getid,
+		.flags = GENL_ADMIN_PERM,
+	},
+};
+
+int netns_genl_register(void)
+{
+	return genl_register_family_with_ops(&netns_genl_family,
+					     netns_genl_ops);
+}
+
 static int __init net_ns_init(void)
 {
 	struct net_generic *ng;
diff --git a/net/netlink/genetlink.c b/net/netlink/genetlink.c
index 76393f2f4b22..c6f39e40c9f3 100644
--- a/net/netlink/genetlink.c
+++ b/net/netlink/genetlink.c
@@ -1029,6 +1029,10 @@ static int __init genl_init(void)
 	if (err)
 		goto problem;
 
+	err = netns_genl_register();
+	if (err < 0)
+		goto problem;
+
 	return 0;
 
 problem:
-- 
2.1.0

^ permalink raw reply related

* Re: [PATCH 0/6 3.18] Fixes for iwlwifi drivers
From: Larry Finger @ 2014-10-30 15:27 UTC (permalink / raw)
  To: Luca Coelho
  Cc: linville-2XuSBdqkA4R54TAoqtyWWQ,
	linux-wireless-u79uwXL29TY76Z2rM5mHXA,
	netdev-u79uwXL29TY76Z2rM5mHXA, Murilo Opsfelder Araujo
In-Reply-To: <1414667312.27833.22.camel-XPOmlcxoEMv1KXRcyAk9cg@public.gmane.org>

On 10/30/2014 06:08 AM, Luca Coelho wrote:
> The cover-letter subject is wrong. :) I guess you meant
> s/iwlwifi/rtlwifi/ ;)

Yes, the changes were for rtlwifi, not iwlwifi. Sorry. (:

My laptop has an Intel 7260 card built in, and it is working correctly on both 
2.4 and 5G bands under mainline 3.18-rc2.

Those types of errors are what I get for trying to "work" while on a family 
vacation. Unfortunately, I needed to submit those patches quickly to prevent a 
set of conflicting updates from being accepted, and I made a silly mistake.

Larry

--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* [PATCH net-next v4 4/4] rtnl: allow to create device with IFLA_LINK_NETNSID set
From: Nicolas Dichtel @ 2014-10-30 15:25 UTC (permalink / raw)
  To: netdev-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA
  Cc: cwang-xCSkyg8dI+0RB7SZvlqPiA, Nicolas Dichtel,
	luto-kltTT9wpgjJwATOyAt5JVQ,
	stephen-OTpzqLSitTUnbdJkjeBofR2eb7JE58TQ,
	ebiederm-aS9lmoZGLiVWk0Htik3J/w,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	davem-fT/PcQaiUtIeIZ0/mPfg9Q
In-Reply-To: <1414682728-4532-1-git-send-email-nicolas.dichtel-pdR9zngts4EAvxtiuMwx3w@public.gmane.org>

This patch adds the ability to create a netdevice in a specified netns and
then move it into the final netns. In fact, it allows to have a symetry between
get and set rtnl messages.

Signed-off-by: Nicolas Dichtel <nicolas.dichtel-pdR9zngts4EAvxtiuMwx3w@public.gmane.org>
---
 net/core/rtnetlink.c | 25 ++++++++++++++++++++++---
 1 file changed, 22 insertions(+), 3 deletions(-)

diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index 1b9329512496..57959a85ed2c 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -1211,6 +1211,7 @@ static const struct nla_policy ifla_policy[IFLA_MAX+1] = {
 	[IFLA_NUM_RX_QUEUES]	= { .type = NLA_U32 },
 	[IFLA_PHYS_PORT_ID]	= { .type = NLA_BINARY, .len = MAX_PHYS_PORT_ID_LEN },
 	[IFLA_CARRIER_CHANGES]	= { .type = NLA_U32 },  /* ignored */
+	[IFLA_LINK_NETNSID]	= { .type = NLA_S32 },
 };
 
 static const struct nla_policy ifla_info_policy[IFLA_INFO_MAX+1] = {
@@ -1983,7 +1984,7 @@ replay:
 		struct nlattr *slave_attr[m_ops ? m_ops->slave_maxtype + 1 : 0];
 		struct nlattr **data = NULL;
 		struct nlattr **slave_data = NULL;
-		struct net *dest_net;
+		struct net *dest_net, *link_net = NULL;
 
 		if (ops) {
 			if (ops->maxtype && linkinfo[IFLA_INFO_DATA]) {
@@ -2089,7 +2090,18 @@ replay:
 		if (IS_ERR(dest_net))
 			return PTR_ERR(dest_net);
 
-		dev = rtnl_create_link(dest_net, ifname, name_assign_type, ops, tb);
+		if (tb[IFLA_LINK_NETNSID]) {
+			int id = nla_get_s32(tb[IFLA_LINK_NETNSID]);
+
+			link_net = get_net_ns_by_id(dest_net, id);
+			if (link_net == NULL) {
+				err =  -EINVAL;
+				goto out;
+			}
+		}
+
+		dev = rtnl_create_link(link_net ? : dest_net, ifname,
+				       name_assign_type, ops, tb);
 		if (IS_ERR(dev)) {
 			err = PTR_ERR(dev);
 			goto out;
@@ -2117,9 +2129,16 @@ replay:
 			}
 		}
 		err = rtnl_configure_link(dev, ifm);
-		if (err < 0)
+		if (err < 0) {
 			unregister_netdevice(dev);
+			goto out;
+		}
+
+		if (link_net)
+			err = dev_change_net_namespace(dev, dest_net, ifname);
 out:
+		if (link_net)
+			put_net(link_net);
 		put_net(dest_net);
 		return err;
 	}
-- 
2.1.0

^ permalink raw reply related

* [PATCH net-next v4 3/4] iptunnels: advertise link netns via netlink
From: Nicolas Dichtel @ 2014-10-30 15:25 UTC (permalink / raw)
  To: netdev-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA
  Cc: cwang-xCSkyg8dI+0RB7SZvlqPiA, Nicolas Dichtel,
	luto-kltTT9wpgjJwATOyAt5JVQ,
	stephen-OTpzqLSitTUnbdJkjeBofR2eb7JE58TQ,
	ebiederm-aS9lmoZGLiVWk0Htik3J/w,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	davem-fT/PcQaiUtIeIZ0/mPfg9Q
In-Reply-To: <1414682728-4532-1-git-send-email-nicolas.dichtel-pdR9zngts4EAvxtiuMwx3w@public.gmane.org>

Implement rtnl_link_ops->get_link_net() callback so that IFLA_LINK_NETNSID is
added to rtnetlink messages.

Signed-off-by: Nicolas Dichtel <nicolas.dichtel-pdR9zngts4EAvxtiuMwx3w@public.gmane.org>
---
 include/net/ip6_tunnel.h | 1 +
 include/net/ip_tunnels.h | 1 +
 net/ipv4/ip_gre.c        | 2 ++
 net/ipv4/ip_tunnel.c     | 8 ++++++++
 net/ipv4/ip_vti.c        | 1 +
 net/ipv4/ipip.c          | 1 +
 net/ipv6/ip6_gre.c       | 1 +
 net/ipv6/ip6_tunnel.c    | 9 +++++++++
 net/ipv6/ip6_vti.c       | 1 +
 net/ipv6/sit.c           | 1 +
 10 files changed, 26 insertions(+)

diff --git a/include/net/ip6_tunnel.h b/include/net/ip6_tunnel.h
index a5593dab6af7..8648519f4555 100644
--- a/include/net/ip6_tunnel.h
+++ b/include/net/ip6_tunnel.h
@@ -69,6 +69,7 @@ int ip6_tnl_xmit_ctl(struct ip6_tnl *t);
 __u16 ip6_tnl_parse_tlv_enc_lim(struct sk_buff *skb, __u8 *raw);
 __u32 ip6_tnl_get_cap(struct ip6_tnl *t, const struct in6_addr *laddr,
 			     const struct in6_addr *raddr);
+struct net *ip6_tnl_get_link_net(const struct net_device *dev);
 
 static inline void ip6tunnel_xmit(struct sk_buff *skb, struct net_device *dev)
 {
diff --git a/include/net/ip_tunnels.h b/include/net/ip_tunnels.h
index 5bc6edeb7143..ce4ff6161fab 100644
--- a/include/net/ip_tunnels.h
+++ b/include/net/ip_tunnels.h
@@ -122,6 +122,7 @@ struct ip_tunnel_net {
 int ip_tunnel_init(struct net_device *dev);
 void ip_tunnel_uninit(struct net_device *dev);
 void  ip_tunnel_dellink(struct net_device *dev, struct list_head *head);
+struct net *ip_tunnel_get_link_net(const struct net_device *dev);
 int ip_tunnel_init_net(struct net *net, int ip_tnl_net_id,
 		       struct rtnl_link_ops *ops, char *devname);
 
diff --git a/net/ipv4/ip_gre.c b/net/ipv4/ip_gre.c
index 12055fdbe716..9e2e29a8c989 100644
--- a/net/ipv4/ip_gre.c
+++ b/net/ipv4/ip_gre.c
@@ -827,6 +827,7 @@ static struct rtnl_link_ops ipgre_link_ops __read_mostly = {
 	.dellink	= ip_tunnel_dellink,
 	.get_size	= ipgre_get_size,
 	.fill_info	= ipgre_fill_info,
+	.get_link_net	= ip_tunnel_get_link_net,
 };
 
 static struct rtnl_link_ops ipgre_tap_ops __read_mostly = {
@@ -841,6 +842,7 @@ static struct rtnl_link_ops ipgre_tap_ops __read_mostly = {
 	.dellink	= ip_tunnel_dellink,
 	.get_size	= ipgre_get_size,
 	.fill_info	= ipgre_fill_info,
+	.get_link_net	= ip_tunnel_get_link_net,
 };
 
 static int __net_init ipgre_tap_init_net(struct net *net)
diff --git a/net/ipv4/ip_tunnel.c b/net/ipv4/ip_tunnel.c
index 0bb8e141eacc..3e1edd544b27 100644
--- a/net/ipv4/ip_tunnel.c
+++ b/net/ipv4/ip_tunnel.c
@@ -972,6 +972,14 @@ void ip_tunnel_dellink(struct net_device *dev, struct list_head *head)
 }
 EXPORT_SYMBOL_GPL(ip_tunnel_dellink);
 
+struct net *ip_tunnel_get_link_net(const struct net_device *dev)
+{
+	struct ip_tunnel *tunnel = netdev_priv(dev);
+
+	return tunnel->net;
+}
+EXPORT_SYMBOL(ip_tunnel_get_link_net);
+
 int ip_tunnel_init_net(struct net *net, int ip_tnl_net_id,
 				  struct rtnl_link_ops *ops, char *devname)
 {
diff --git a/net/ipv4/ip_vti.c b/net/ipv4/ip_vti.c
index 3e861011e4a3..f0fab26e4ddc 100644
--- a/net/ipv4/ip_vti.c
+++ b/net/ipv4/ip_vti.c
@@ -530,6 +530,7 @@ static struct rtnl_link_ops vti_link_ops __read_mostly = {
 	.changelink	= vti_changelink,
 	.get_size	= vti_get_size,
 	.fill_info	= vti_fill_info,
+	.get_link_net	= ip_tunnel_get_link_net,
 };
 
 static int __init vti_init(void)
diff --git a/net/ipv4/ipip.c b/net/ipv4/ipip.c
index 37096d64730e..e7a183baba0a 100644
--- a/net/ipv4/ipip.c
+++ b/net/ipv4/ipip.c
@@ -498,6 +498,7 @@ static struct rtnl_link_ops ipip_link_ops __read_mostly = {
 	.dellink	= ip_tunnel_dellink,
 	.get_size	= ipip_get_size,
 	.fill_info	= ipip_fill_info,
+	.get_link_net	= ip_tunnel_get_link_net,
 };
 
 static struct xfrm_tunnel ipip_handler __read_mostly = {
diff --git a/net/ipv6/ip6_gre.c b/net/ipv6/ip6_gre.c
index 12c3c8ef3849..5165ac7fde22 100644
--- a/net/ipv6/ip6_gre.c
+++ b/net/ipv6/ip6_gre.c
@@ -1661,6 +1661,7 @@ static struct rtnl_link_ops ip6gre_link_ops __read_mostly = {
 	.dellink	= ip6gre_dellink,
 	.get_size	= ip6gre_get_size,
 	.fill_info	= ip6gre_fill_info,
+	.get_link_net	= ip6_tnl_get_link_net,
 };
 
 static struct rtnl_link_ops ip6gre_tap_ops __read_mostly = {
diff --git a/net/ipv6/ip6_tunnel.c b/net/ipv6/ip6_tunnel.c
index 9409887fb664..6b2534ea9c54 100644
--- a/net/ipv6/ip6_tunnel.c
+++ b/net/ipv6/ip6_tunnel.c
@@ -1703,6 +1703,14 @@ nla_put_failure:
 	return -EMSGSIZE;
 }
 
+struct net *ip6_tnl_get_link_net(const struct net_device *dev)
+{
+	struct ip6_tnl *tunnel = netdev_priv(dev);
+
+	return tunnel->net;
+}
+EXPORT_SYMBOL(ip6_tnl_get_link_net);
+
 static const struct nla_policy ip6_tnl_policy[IFLA_IPTUN_MAX + 1] = {
 	[IFLA_IPTUN_LINK]		= { .type = NLA_U32 },
 	[IFLA_IPTUN_LOCAL]		= { .len = sizeof(struct in6_addr) },
@@ -1726,6 +1734,7 @@ static struct rtnl_link_ops ip6_link_ops __read_mostly = {
 	.dellink	= ip6_tnl_dellink,
 	.get_size	= ip6_tnl_get_size,
 	.fill_info	= ip6_tnl_fill_info,
+	.get_link_net	= ip6_tnl_get_link_net,
 };
 
 static struct xfrm6_tunnel ip4ip6_handler __read_mostly = {
diff --git a/net/ipv6/ip6_vti.c b/net/ipv6/ip6_vti.c
index d440bb585524..43966dcc9603 100644
--- a/net/ipv6/ip6_vti.c
+++ b/net/ipv6/ip6_vti.c
@@ -992,6 +992,7 @@ static struct rtnl_link_ops vti6_link_ops __read_mostly = {
 	.changelink	= vti6_changelink,
 	.get_size	= vti6_get_size,
 	.fill_info	= vti6_fill_info,
+	.get_link_net	= ip6_tnl_get_link_net,
 };
 
 static void __net_exit vti6_destroy_tunnels(struct vti6_net *ip6n)
diff --git a/net/ipv6/sit.c b/net/ipv6/sit.c
index 58e5b4710127..c858d0eb267a 100644
--- a/net/ipv6/sit.c
+++ b/net/ipv6/sit.c
@@ -1765,6 +1765,7 @@ static struct rtnl_link_ops sit_link_ops __read_mostly = {
 	.get_size	= ipip6_get_size,
 	.fill_info	= ipip6_fill_info,
 	.dellink	= ipip6_dellink,
+	.get_link_net	= ip_tunnel_get_link_net,
 };
 
 static struct xfrm_tunnel sit_handler __read_mostly = {
-- 
2.1.0

^ permalink raw reply related

* [PATCH net-next v4 2/4] rtnl: add link netns id to interface messages
From: Nicolas Dichtel @ 2014-10-30 15:25 UTC (permalink / raw)
  To: netdev-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA
  Cc: cwang-xCSkyg8dI+0RB7SZvlqPiA, Nicolas Dichtel,
	luto-kltTT9wpgjJwATOyAt5JVQ,
	stephen-OTpzqLSitTUnbdJkjeBofR2eb7JE58TQ,
	ebiederm-aS9lmoZGLiVWk0Htik3J/w,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	davem-fT/PcQaiUtIeIZ0/mPfg9Q
In-Reply-To: <1414682728-4532-1-git-send-email-nicolas.dichtel-pdR9zngts4EAvxtiuMwx3w@public.gmane.org>

This patch adds a new attribute (IFLA_LINK_NETNSID) which contains the 'link'
netns id when this netns is different from the netns where the interface
stands (for example for x-net interfaces like ip tunnels). When there is no id,
we put NETNSA_NSINDEX_UNKNOWN into this attribute to indicate to userland that
the link netns is different from the interface netns. Hence, userland knows that
some information like IFLA_LINK are not interpretable.

Signed-off-by: Nicolas Dichtel <nicolas.dichtel-pdR9zngts4EAvxtiuMwx3w@public.gmane.org>
---
 include/net/rtnetlink.h      |  2 ++
 include/uapi/linux/if_link.h |  1 +
 net/core/rtnetlink.c         | 13 +++++++++++++
 3 files changed, 16 insertions(+)

diff --git a/include/net/rtnetlink.h b/include/net/rtnetlink.h
index e21b9f9653c0..6c6d5393fc34 100644
--- a/include/net/rtnetlink.h
+++ b/include/net/rtnetlink.h
@@ -46,6 +46,7 @@ static inline int rtnl_msg_family(const struct nlmsghdr *nlh)
  *			    to create when creating a new device.
  *	@get_num_rx_queues: Function to determine number of receive queues
  *			    to create when creating a new device.
+ *	@get_link_net: Function to get the i/o netns of the device
  */
 struct rtnl_link_ops {
 	struct list_head	list;
@@ -93,6 +94,7 @@ struct rtnl_link_ops {
 	int			(*fill_slave_info)(struct sk_buff *skb,
 						   const struct net_device *dev,
 						   const struct net_device *slave_dev);
+	struct net		*(*get_link_net)(const struct net_device *dev);
 };
 
 int __rtnl_link_register(struct rtnl_link_ops *ops);
diff --git a/include/uapi/linux/if_link.h b/include/uapi/linux/if_link.h
index 7072d8325016..d2729f63cf01 100644
--- a/include/uapi/linux/if_link.h
+++ b/include/uapi/linux/if_link.h
@@ -145,6 +145,7 @@ enum {
 	IFLA_CARRIER,
 	IFLA_PHYS_PORT_ID,
 	IFLA_CARRIER_CHANGES,
+	IFLA_LINK_NETNSID,
 	__IFLA_MAX
 };
 
diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index a6882686ca3a..1b9329512496 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -862,6 +862,7 @@ static noinline size_t if_nlmsg_size(const struct net_device *dev,
 	       + nla_total_size(1) /* IFLA_OPERSTATE */
 	       + nla_total_size(1) /* IFLA_LINKMODE */
 	       + nla_total_size(4) /* IFLA_CARRIER_CHANGES */
+	       + nla_total_size(4) /* IFLA_LINK_NETNSID */
 	       + nla_total_size(ext_filter_mask
 			        & RTEXT_FILTER_VF ? 4 : 0) /* IFLA_NUM_VF */
 	       + rtnl_vfinfo_size(dev, ext_filter_mask) /* IFLA_VFINFO_LIST */
@@ -1134,6 +1135,18 @@ static int rtnl_fill_ifinfo(struct sk_buff *skb, struct net_device *dev,
 			goto nla_put_failure;
 	}
 
+	if (dev->rtnl_link_ops &&
+	    dev->rtnl_link_ops->get_link_net) {
+		struct net *link_net = dev->rtnl_link_ops->get_link_net(dev);
+
+		if (!net_eq(dev_net(dev), link_net)) {
+			int id = peernet2id(dev_net(dev), link_net);
+
+			if (nla_put_s32(skb, IFLA_LINK_NETNSID, id))
+				goto nla_put_failure;
+		}
+	}
+
 	if (!(af_spec = nla_nest_start(skb, IFLA_AF_SPEC)))
 		goto nla_put_failure;
 
-- 
2.1.0

^ permalink raw reply related

* [PATCH net-next v4 0/4] netns: allow to identify peer netns
From: Nicolas Dichtel @ 2014-10-30 15:25 UTC (permalink / raw)
  To: netdev-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA
  Cc: davem-fT/PcQaiUtIeIZ0/mPfg9Q, ebiederm-aS9lmoZGLiVWk0Htik3J/w,
	stephen-OTpzqLSitTUnbdJkjeBofR2eb7JE58TQ,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	luto-kltTT9wpgjJwATOyAt5JVQ, cwang-xCSkyg8dI+0RB7SZvlqPiA
In-Reply-To: <1412257690-31253-1-git-send-email-nicolas.dichtel-pdR9zngts4EAvxtiuMwx3w@public.gmane.org>

The goal of this serie is to be able to multicast netlink messages with an
attribute that identify a peer netns.
This is needed by the userland to interpret some informations contained in
netlink messages (like IFLA_LINK value, but also some other attributes in case
of x-netns netdevice (see also
http://thread.gmane.org/gmane.linux.network/315933/focus=316064 and
http://thread.gmane.org/gmane.linux.kernel.containers/28301/focus=4239)).

Ids of peer netns are set by userland via a new genl messages. These ids are
stored per netns and are local (ie only valid in the netns where they are set).
To avoid allocating an int for each peer netns, I use idr_for_each() to retrieve
the id of a peer netns. Note that it will be possible to add a table (struct net
-> id) later to optimize this lookup if needed.

Patch 1/4 introduces the netlink API mechanism to set and get these ids.
Patch 2/4 and 3/4 implements an example of how to use these ids in rtnetlink
messages. And patch 4/4 shows that the netlink messages can be symetric between
a GET and a SET.

iproute2 patches are available, I can send them on demand.

Here is a small screenshot to show how it can be used by userland.

First, setup netns and required ids:
$ ip netns add foo
$ ip netns del foo
$ ip netns
$ touch /var/run/netns/init_net
$ mount --bind /proc/1/ns/net /var/run/netns/init_net
$ ip netns add foo
$ ip netns exec foo ip netns set init_net 0
$ ip netns
foo
init_net
$ ip netns exec foo ip netns
foo
init_net (id: 0)

Now, add and display an ipip tunnel, with its link part in init_net (id 0 in
netns foo) and the netdevice in foo:
$ ip netns exec foo ip link add ipip1 link-netnsid 0 type ipip remote 10.16.0.121 local 10.16.0.249
$ ip netns exec foo ip l ls ipip1
6: ipip1@NONE: <POINTOPOINT,NOARP> mtu 1480 qdisc noop state DOWN mode DEFAULT group default 
    link/ipip 10.16.0.249 peer 10.16.0.121 link-netnsid 0

The parameter link-netnsid shows us where the interface sends and receives
packets (and thus we know where encapsulated addresses are set).

RFCv3 -> v4:
  rebase on net-next
  add copyright text in the new netns.h file

RFCv2 -> RFCv3:
  ids are now defined by userland (via netlink). Ids are stored in each netns
  (and they are local to this netns).
  add get_link_net support for ip6 tunnels
  netnsid is now a s32 instead of a u32

RFCv1 -> RFCv2:
  remove useless ()
  ids are now stored in the user ns. It's possible to get an id for a peer netns
  only if the current netns and the peer netns have the same user ns parent.

 MAINTAINERS                  |   1 +
 include/net/ip6_tunnel.h     |   1 +
 include/net/ip_tunnels.h     |   1 +
 include/net/net_namespace.h  |   5 ++
 include/net/rtnetlink.h      |   2 +
 include/uapi/linux/Kbuild    |   1 +
 include/uapi/linux/if_link.h |   1 +
 include/uapi/linux/netns.h   |  38 +++++++++
 net/core/net_namespace.c     | 195 +++++++++++++++++++++++++++++++++++++++++++
 net/core/rtnetlink.c         |  38 ++++++++-
 net/ipv4/ip_gre.c            |   2 +
 net/ipv4/ip_tunnel.c         |   8 ++
 net/ipv4/ip_vti.c            |   1 +
 net/ipv4/ipip.c              |   1 +
 net/ipv6/ip6_gre.c           |   1 +
 net/ipv6/ip6_tunnel.c        |   9 ++
 net/ipv6/ip6_vti.c           |   1 +
 net/ipv6/sit.c               |   1 +
 net/netlink/genetlink.c      |   4 +
 19 files changed, 308 insertions(+), 3 deletions(-)

Comments are welcome.

Regards,
Nicolas

^ permalink raw reply

* Re: [PATCH net] gre: Fix regression in gretap TSO support
From: Tom Herbert @ 2014-10-30 15:05 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Neal Cardwell, Pravin Shelar, Alexander Duyck, netdev,
	David Miller, H.K. Jerry Chu, Eric Dumazet
In-Reply-To: <54524B6A.3000503@redhat.com>

On Thu, Oct 30, 2014 at 7:30 AM, Alexander Duyck
<alexander.h.duyck@redhat.com> wrote:
>
> On 10/30/2014 06:51 AM, Neal Cardwell wrote:
>>
>> On Thu, Oct 30, 2014 at 1:14 AM, Pravin Shelar <pshelar@nicira.com> wrote:
>>>
>>> On Wed, Oct 29, 2014 at 8:26 PM,  <alexander.duyck@gmail.com> wrote:
>>>>
>>>> From: Alexander Duyck <alexander.h.duyck@redhat.com>
>>>>
>>>> On recent kernels I found that TSO on gretap interfaces didn't work.
>>>> After
>>>> bisecting it I found that commit b884b1a4 had introduced a regression in
>>>> which the Ethernet header was being included in the GRE header length.
>>>>
>>>> This change corrects that by basing the GRE header length on the inner
>>>> mac
>>>> header in the case of GRE tunnels using transparent Ethernet bridging,
>>>> and
>>>> uses the network header for all other GRE tunnel types.
>>>>
>>>> Fixes: b884b1a4 ("gre_offload: simplify GRE header length calculation in
>>>> gre_gso_segment()")
>>
>> Hmm. There may be other protocols, either now or in the future, where
>> we want to be able to have a mac header inside the GRE header, rather
>> than a network header. AFAICT it would be safer to revert b884b1a4,
>> and go back to the previous code (from c50cd357), where we parse the
>> GRE header to figure out its length.
>>
>> neal
>
>
> The change is consistent with how we handle this in other spots throughout
> the kernel.  If nothing else you can just search for ETH_P_TEB and you will
> find multiple spots in the kernel where IP tunnels differentiate between
> transparent Ethernet bridging and regular IP in IP tunnels by checking for
> the protocol ETH_P_TEB.
>
I'm not sure I understand this. We always use inner mac header in
__skb_udp_tunnel_segment for computing tunnel length and don't
distinguish between Ethernet or IP encapsulation. Presumably, in the
case of IP encapsulation inner mac header is equal to inner network
header. Why is this different for GRE?

Thanks,
Tom

> Thanks,
>
> Alex
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH net] gre: Fix regression in gretap TSO support
From: Eric Dumazet @ 2014-10-30 15:00 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Neal Cardwell, Pravin Shelar, alexander.duyck, netdev,
	David Miller, H.K. Jerry Chu, Eric Dumazet
In-Reply-To: <54524B6A.3000503@redhat.com>

On Thu, 2014-10-30 at 07:30 -0700, Alexander Duyck wrote:

> The change is consistent with how we handle this in other spots 
> throughout the kernel.  If nothing else you can just search for 
> ETH_P_TEB and you will find multiple spots in the kernel where IP 
> tunnels differentiate between transparent Ethernet bridging and regular 
> IP in IP tunnels by checking for the protocol ETH_P_TEB.

Agreed, I think that GUE might supersedes GRE usage anyway ;)

Acked-by: Eric Dumazet <edumazet@google.com>

^ permalink raw reply

* Re: [PATCH net] gre: Fix regression in gretap TSO support
From: Alexander Duyck @ 2014-10-30 14:30 UTC (permalink / raw)
  To: Neal Cardwell, Pravin Shelar
  Cc: alexander.duyck, netdev, David Miller, H.K. Jerry Chu,
	Eric Dumazet
In-Reply-To: <CADVnQykSY_Q3HL0T1uNO4iUheWTpTsLLF7gCnM4SbJyGp3x6Wg@mail.gmail.com>


On 10/30/2014 06:51 AM, Neal Cardwell wrote:
> On Thu, Oct 30, 2014 at 1:14 AM, Pravin Shelar <pshelar@nicira.com> wrote:
>> On Wed, Oct 29, 2014 at 8:26 PM,  <alexander.duyck@gmail.com> wrote:
>>> From: Alexander Duyck <alexander.h.duyck@redhat.com>
>>>
>>> On recent kernels I found that TSO on gretap interfaces didn't work.  After
>>> bisecting it I found that commit b884b1a4 had introduced a regression in
>>> which the Ethernet header was being included in the GRE header length.
>>>
>>> This change corrects that by basing the GRE header length on the inner mac
>>> header in the case of GRE tunnels using transparent Ethernet bridging, and
>>> uses the network header for all other GRE tunnel types.
>>>
>>> Fixes: b884b1a4 ("gre_offload: simplify GRE header length calculation in gre_gso_segment()")
> Hmm. There may be other protocols, either now or in the future, where
> we want to be able to have a mac header inside the GRE header, rather
> than a network header. AFAICT it would be safer to revert b884b1a4,
> and go back to the previous code (from c50cd357), where we parse the
> GRE header to figure out its length.
>
> neal

The change is consistent with how we handle this in other spots 
throughout the kernel.  If nothing else you can just search for 
ETH_P_TEB and you will find multiple spots in the kernel where IP 
tunnels differentiate between transparent Ethernet bridging and regular 
IP in IP tunnels by checking for the protocol ETH_P_TEB.

Thanks,

Alex

^ permalink raw reply

* [PATCH net 2/2] mlx4: Avoid leaking steering rules on flow creation error flow
From: Or Gerlitz @ 2014-10-30 13:59 UTC (permalink / raw)
  To: David S. Miller
  Cc: netdev, Matan Barak, Amir Vadai, Saeed Mahameed, Or Gerlitz
In-Reply-To: <1414677568-28409-1-git-send-email-ogerlitz@mellanox.com>

If mlx4_ib_create_flow() attempts to create > 1 rules with the
firmware, and one of these registrations fail, we leaked the
already created flow rules.

One example of the leak is when the registration of the VXLAN ghost
steering rule fails, we didn't unregister the original rule requested
by the user, introduced in commit d2fce8a9060d "mlx4: Set
user-space raw Ethernet QPs to properly handle VXLAN traffic".

While here, add dump of the VXLAN portion of steering rules
so it can actually be seen when flow creation fails.

Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
---
 drivers/infiniband/hw/mlx4/main.c        |   10 ++++++++--
 drivers/net/ethernet/mellanox/mlx4/mcg.c |    4 ++++
 2 files changed, 12 insertions(+), 2 deletions(-)

diff --git a/drivers/infiniband/hw/mlx4/main.c b/drivers/infiniband/hw/mlx4/main.c
index bda5994..8b72cf3 100644
--- a/drivers/infiniband/hw/mlx4/main.c
+++ b/drivers/infiniband/hw/mlx4/main.c
@@ -1173,18 +1173,24 @@ static struct ib_flow *mlx4_ib_create_flow(struct ib_qp *qp,
 		err = __mlx4_ib_create_flow(qp, flow_attr, domain, type[i],
 					    &mflow->reg_id[i]);
 		if (err)
-			goto err_free;
+			goto err_create_flow;
 		i++;
 	}
 
 	if (i < ARRAY_SIZE(type) && flow_attr->type == IB_FLOW_ATTR_NORMAL) {
 		err = mlx4_ib_tunnel_steer_add(qp, flow_attr, &mflow->reg_id[i]);
 		if (err)
-			goto err_free;
+			goto err_create_flow;
+		i++;
 	}
 
 	return &mflow->ibflow;
 
+err_create_flow:
+	while (i) {
+		(void)__mlx4_ib_destroy_flow(to_mdev(qp->device)->dev, mflow->reg_id[i]);
+		i--;
+	}
 err_free:
 	kfree(mflow);
 	return ERR_PTR(err);
diff --git a/drivers/net/ethernet/mellanox/mlx4/mcg.c b/drivers/net/ethernet/mellanox/mlx4/mcg.c
index ca0f98c..8728431 100644
--- a/drivers/net/ethernet/mellanox/mlx4/mcg.c
+++ b/drivers/net/ethernet/mellanox/mlx4/mcg.c
@@ -955,6 +955,10 @@ static void mlx4_err_rule(struct mlx4_dev *dev, char *str,
 					cur->ib.dst_gid_msk);
 			break;
 
+		case MLX4_NET_TRANS_RULE_ID_VXLAN:
+			len += snprintf(buf + len, BUF_SIZE - len,
+					"VNID = %d ", be32_to_cpu(cur->vxlan.vni));
+			break;
 		case MLX4_NET_TRANS_RULE_ID_IPV6:
 			break;
 
-- 
1.7.1

^ permalink raw reply related

* [PATCH net 0/2] mlx4 driver encapsulation/steering fixes
From: Or Gerlitz @ 2014-10-30 13:59 UTC (permalink / raw)
  To: David S. Miller
  Cc: netdev, Matan Barak, Amir Vadai, Saeed Mahameed, Or Gerlitz

Hi Dave,

The 1st patch fixes a bug in the TX path that supports offloading the 
TX checksum of (VXLAN) encapsulated TCP packets. It turns out that the 
bug is revealed only when the receiver runs in non-offloaded mode, so
we somehow missed it so far... please queue it for -stable >= 3.14 

The 2nd patch makes sure not to leak steering entry on error flow, 
please queue it to 3.17-stable 

thanks,

Or.

Or Gerlitz (2):
  net/mlx4_en: Don't attempt to TX offload the outer UDP checksum for VXLAN
  mlx4: Avoid leaking steering rules on flow creation error flow

 drivers/infiniband/hw/mlx4/main.c          |   10 ++++++++--
 drivers/net/ethernet/mellanox/mlx4/en_tx.c |    7 +++++--
 drivers/net/ethernet/mellanox/mlx4/mcg.c   |    4 ++++
 3 files changed, 17 insertions(+), 4 deletions(-)

^ permalink raw reply

* [PATCH net 1/2] net/mlx4_en: Don't attempt to TX offload the outer UDP checksum for VXLAN
From: Or Gerlitz @ 2014-10-30 13:59 UTC (permalink / raw)
  To: David S. Miller
  Cc: netdev, Matan Barak, Amir Vadai, Saeed Mahameed, Or Gerlitz
In-Reply-To: <1414677568-28409-1-git-send-email-ogerlitz@mellanox.com>

For VXLAN/NVGRE encapsulation, the current HW doesn't support offloading
both the outer UDP TX checksum and the inner TCP/UDP TX checksum. 

The driver doesn't advertize SKB_GSO_UDP_TUNNEL_CSUM, however we are wrongly
telling the HW to offload the outer UDP checksum for encapsulated packets,
fix that.

Fixes: 837052d0ccc5 ('net/mlx4_en: Add netdev support for TCP/IP
		     offloads of vxlan tunneling')
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx4/en_tx.c |    7 +++++--
 1 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_tx.c b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
index 34c1378..454d9fe 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_tx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
@@ -836,8 +836,11 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)
 	 * whether LSO is used */
 	tx_desc->ctrl.srcrb_flags = priv->ctrl_flags;
 	if (likely(skb->ip_summed == CHECKSUM_PARTIAL)) {
-		tx_desc->ctrl.srcrb_flags |= cpu_to_be32(MLX4_WQE_CTRL_IP_CSUM |
-							 MLX4_WQE_CTRL_TCP_UDP_CSUM);
+		if (!skb->encapsulation)
+			tx_desc->ctrl.srcrb_flags |= cpu_to_be32(MLX4_WQE_CTRL_IP_CSUM |
+								 MLX4_WQE_CTRL_TCP_UDP_CSUM);
+		else
+			tx_desc->ctrl.srcrb_flags |= cpu_to_be32(MLX4_WQE_CTRL_IP_CSUM);
 		ring->tx_csum++;
 	}
 
-- 
1.7.1

^ permalink raw reply related

* Re: [PATCH net] gre: Fix regression in gretap TSO support
From: Neal Cardwell @ 2014-10-30 13:51 UTC (permalink / raw)
  To: Pravin Shelar
  Cc: alexander.duyck, netdev, David Miller, H.K. Jerry Chu,
	Eric Dumazet, Alexander Duyck
In-Reply-To: <CALnjE+oK-O-PJH_u50HqQQLnvh+GyeCnv6tNQf5qzL0o1RiPQg@mail.gmail.com>

On Thu, Oct 30, 2014 at 1:14 AM, Pravin Shelar <pshelar@nicira.com> wrote:
> On Wed, Oct 29, 2014 at 8:26 PM,  <alexander.duyck@gmail.com> wrote:
>> From: Alexander Duyck <alexander.h.duyck@redhat.com>
>>
>> On recent kernels I found that TSO on gretap interfaces didn't work.  After
>> bisecting it I found that commit b884b1a4 had introduced a regression in
>> which the Ethernet header was being included in the GRE header length.
>>
>> This change corrects that by basing the GRE header length on the inner mac
>> header in the case of GRE tunnels using transparent Ethernet bridging, and
>> uses the network header for all other GRE tunnel types.
>>
>> Fixes: b884b1a4 ("gre_offload: simplify GRE header length calculation in gre_gso_segment()")

Hmm. There may be other protocols, either now or in the future, where
we want to be able to have a mac header inside the GRE header, rather
than a network header. AFAICT it would be safer to revert b884b1a4,
and go back to the previous code (from c50cd357), where we parse the
GRE header to figure out its length.

neal

^ permalink raw reply

* [net 3/4] ixgbe: need not repeat init skb with NULL
From: Jeff Kirsher @ 2014-10-30 12:33 UTC (permalink / raw)
  To: davem
  Cc: Junwei Zhang, netdev, nhorman, sassmann, jogreene, Martin Zhang,
	Jeff Kirsher
In-Reply-To: <1414672436-20616-1-git-send-email-jeffrey.t.kirsher@intel.com>

From: Junwei Zhang <linggao.zjw@alibaba-inc.com>

Signed-off-by: Martin Zhang <martinbj2008@gmail.com>
Tested-by: Phil Schmitt <phillip.j.schmitt@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index fec5212..d2df4e3 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -4321,8 +4321,8 @@ static void ixgbe_clean_rx_ring(struct ixgbe_ring *rx_ring)
 				IXGBE_CB(skb)->page_released = false;
 			}
 			dev_kfree_skb(skb);
+			rx_buffer->skb = NULL;
 		}
-		rx_buffer->skb = NULL;
 		if (rx_buffer->dma)
 			dma_unmap_page(dev, rx_buffer->dma,
 				       ixgbe_rx_pg_size(rx_ring),
-- 
1.9.3

^ permalink raw reply related

* [net 2/4] igb: don't reuse pages with pfmemalloc flag
From: Jeff Kirsher @ 2014-10-30 12:33 UTC (permalink / raw)
  To: davem; +Cc: Roman Gushchin, netdev, nhorman, sassmann, jogreene, Jeff Kirsher
In-Reply-To: <1414672436-20616-1-git-send-email-jeffrey.t.kirsher@intel.com>

From: Roman Gushchin <klamm@yandex-team.ru>

Incoming packet is dropped silently by sk_filter(), if the skb was
allocated from pfmemalloc reserves and the corresponding socket is
not marked with the SOCK_MEMALLOC flag.

Igb driver allocates pages for DMA with __skb_alloc_page(), which
calls alloc_pages_node() with the __GFP_MEMALLOC flag. So, in case
of OOM condition, igb can get pages with pfmemalloc flag set.

If an incoming packet hits the pfmemalloc page and is large enough
(small packets are copying into the memory, allocated with
netdev_alloc_skb_ip_align(), so they are not affected), it will be
dropped.

This behavior is ok under high memory pressure, but the problem is
that the igb driver reuses these mapped pages. So, packets are still
dropping even if all memory issues are gone and there is a plenty
of free memory.

In my case, some TCP sessions hang on a small percentage (< 0.1%)
of machines days after OOMs.

Fix this by avoiding reuse of such pages.

Signed-off-by: Roman Gushchin <klamm@yandex-team.ru>
Tested-by: Aaron Brown "aaron.f.brown@intel.com"
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---
 drivers/net/ethernet/intel/igb/igb_main.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/intel/igb/igb_main.c b/drivers/net/ethernet/intel/igb/igb_main.c
index a21b144..a2d72a8 100644
--- a/drivers/net/ethernet/intel/igb/igb_main.c
+++ b/drivers/net/ethernet/intel/igb/igb_main.c
@@ -6537,6 +6537,9 @@ static bool igb_can_reuse_rx_page(struct igb_rx_buffer *rx_buffer,
 	if (unlikely(page_to_nid(page) != numa_node_id()))
 		return false;
 
+	if (unlikely(page->pfmemalloc))
+		return false;
+
 #if (PAGE_SIZE < 8192)
 	/* if we are only owner of page we can reuse it */
 	if (unlikely(page_count(page) != 1))
@@ -6603,7 +6606,8 @@ static bool igb_add_rx_frag(struct igb_ring *rx_ring,
 		memcpy(__skb_put(skb, size), va, ALIGN(size, sizeof(long)));
 
 		/* we can reuse buffer as-is, just make sure it is local */
-		if (likely(page_to_nid(page) == numa_node_id()))
+		if (likely((page_to_nid(page) == numa_node_id()) &&
+			   !page->pfmemalloc))
 			return true;
 
 		/* this page cannot be reused so discard it */
-- 
1.9.3

^ permalink raw reply related

* [net 4/4] ixgbe: fix race when setting advertised speed
From: Jeff Kirsher @ 2014-10-30 12:33 UTC (permalink / raw)
  To: davem; +Cc: Emil Tantilov, netdev, nhorman, sassmann, jogreene, Jeff Kirsher
In-Reply-To: <1414672436-20616-1-git-send-email-jeffrey.t.kirsher@intel.com>

From: Emil Tantilov <emil.s.tantilov@intel.com>

Following commands:

modprobe ixgbe
ifconfig ethX up
ethtool -s ethX advertise 0x020

can lead to "setup link failed with code -14" error due to the setup_link
call racing with the SFP detection routine in the watchdog.

This patch resolves this issue by protecting the setup_link call with check
for __IXGBE_IN_SFP_INIT.

Reported-by: Scott Harrison <scoharr2@cisco.com>
Signed-off-by: Emil Tantilov <emil.s.tantilov@intel.com>
Tested-by: Phil Schmitt <phillip.j.schmitt@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---
 drivers/net/ethernet/intel/ixgbe/ixgbe_ethtool.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_ethtool.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_ethtool.c
index 3ce4a25..0ae038b 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_ethtool.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_ethtool.c
@@ -342,12 +342,16 @@ static int ixgbe_set_settings(struct net_device *netdev,
 		if (old == advertised)
 			return err;
 		/* this sets the link speed and restarts auto-neg */
+		while (test_and_set_bit(__IXGBE_IN_SFP_INIT, &adapter->state))
+			usleep_range(1000, 2000);
+
 		hw->mac.autotry_restart = true;
 		err = hw->mac.ops.setup_link(hw, advertised, true);
 		if (err) {
 			e_info(probe, "setup link failed with code %d\n", err);
 			hw->mac.ops.setup_link(hw, old, true);
 		}
+		clear_bit(__IXGBE_IN_SFP_INIT, &adapter->state);
 	} else {
 		/* in this case we currently only support 10Gb/FULL */
 		u32 speed = ethtool_cmd_speed(ecmd);
-- 
1.9.3

^ permalink raw reply related

* [net 1/4] e1000: unset IFF_UNICAST_FLT on WMware 82545EM
From: Jeff Kirsher @ 2014-10-30 12:33 UTC (permalink / raw)
  To: davem
  Cc: Francesco Ruggeri, netdev, nhorman, sassmann, jogreene,
	Francesco Ruggeri, Jeff Kirsher
In-Reply-To: <1414672436-20616-1-git-send-email-jeffrey.t.kirsher@intel.com>

From: Francesco Ruggeri <fruggeri@aristanetworks.com>

VMWare's e1000 implementation does not seem to support unicast filtering.
This can be observed by configuring a macvlan interface on eth0 in a VM in
VMWare Fusion 5.0.5, and trying to use that interface instead of eth0.
Tested on 3.16.

Signed-off-by: Francesco Ruggeri <fruggeri@arista.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---
 drivers/net/ethernet/intel/e1000/e1000_main.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/intel/e1000/e1000_main.c b/drivers/net/ethernet/intel/e1000/e1000_main.c
index 5f6aded..24f3986 100644
--- a/drivers/net/ethernet/intel/e1000/e1000_main.c
+++ b/drivers/net/ethernet/intel/e1000/e1000_main.c
@@ -1075,7 +1075,10 @@ static int e1000_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
 				  NETIF_F_HW_CSUM |
 				  NETIF_F_SG);
 
-	netdev->priv_flags |= IFF_UNICAST_FLT;
+	/* Do not set IFF_UNICAST_FLT for VMWare's 82545EM */
+	if (hw->device_id != E1000_DEV_ID_82545EM_COPPER ||
+	    hw->subsystem_vendor_id != PCI_VENDOR_ID_VMWARE)
+		netdev->priv_flags |= IFF_UNICAST_FLT;
 
 	adapter->en_mng_pt = e1000_enable_mng_pass_thru(hw);
 
-- 
1.9.3

^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox