Netdev List
 help / color / mirror / Atom feed
* [PATCH net, 1/3] hyperv: Fix the max_xfer_size in RNDIS initialization
From: Haiyang Zhang @ 2012-10-01 22:30 UTC (permalink / raw)
  To: davem, netdev; +Cc: olaf, jasowang, linux-kernel, devel, haiyangz

According to RNDIS specs, Windows sets this size to
0x4000. I use the same value here.

Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
Reviewed-by: K. Y. Srinivasan <kys@microsoft.com>

---
 drivers/net/hyperv/rndis_filter.c |    3 +--
 1 files changed, 1 insertions(+), 2 deletions(-)

diff --git a/drivers/net/hyperv/rndis_filter.c b/drivers/net/hyperv/rndis_filter.c
index 1e88a10..3cb7486 100644
--- a/drivers/net/hyperv/rndis_filter.c
+++ b/drivers/net/hyperv/rndis_filter.c
@@ -678,8 +678,7 @@ static int rndis_filter_init_device(struct rndis_device *dev)
 	init = &request->request_msg.msg.init_req;
 	init->major_ver = RNDIS_MAJOR_VERSION;
 	init->minor_ver = RNDIS_MINOR_VERSION;
-	/* FIXME: Use 1536 - rounded ethernet frame size */
-	init->max_xfer_size = 2048;
+	init->max_xfer_size = 0x4000;
 
 	dev->state = RNDIS_DEV_INITIALIZING;
 
-- 
1.7.4.1

^ permalink raw reply related

* [PATCH net, 2/3] hyperv: Fix the missing return value in rndis_filter_set_packet_filter()
From: Haiyang Zhang @ 2012-10-01 22:30 UTC (permalink / raw)
  To: davem, netdev; +Cc: olaf, jasowang, linux-kernel, devel, haiyangz
In-Reply-To: <1349130657-7987-1-git-send-email-haiyangz@microsoft.com>

Return ETIMEDOUT when the reply message is not received in time.

Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
Reviewed-by: K. Y. Srinivasan <kys@microsoft.com>

---
 drivers/net/hyperv/rndis_filter.c |    1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/drivers/net/hyperv/rndis_filter.c b/drivers/net/hyperv/rndis_filter.c
index 3cb7486..2909dd8 100644
--- a/drivers/net/hyperv/rndis_filter.c
+++ b/drivers/net/hyperv/rndis_filter.c
@@ -641,6 +641,7 @@ int rndis_filter_set_packet_filter(struct rndis_device *dev, u32 new_filter)
 	if (t == 0) {
 		netdev_err(ndev,
 			"timeout before we got a set response...\n");
+		ret = -ETIMEDOUT;
 		/*
 		 * We can't deallocate the request since we may still receive a
 		 * send completion for it.
-- 
1.7.4.1

^ permalink raw reply related

* [PATCH net, 3/3] hyperv: Fix page buffer handling in rndis_filter_send_request()
From: Haiyang Zhang @ 2012-10-01 22:30 UTC (permalink / raw)
  To: davem, netdev; +Cc: olaf, jasowang, linux-kernel, devel, haiyangz
In-Reply-To: <1349130657-7987-1-git-send-email-haiyangz@microsoft.com>

Add another page buffer if the request message crossed page boundary.

Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
Reviewed-by: K. Y. Srinivasan <kys@microsoft.com>

---
 drivers/net/hyperv/rndis_filter.c |   16 +++++++++++++++-
 1 files changed, 15 insertions(+), 1 deletions(-)

diff --git a/drivers/net/hyperv/rndis_filter.c b/drivers/net/hyperv/rndis_filter.c
index 2909dd8..1cd8d45 100644
--- a/drivers/net/hyperv/rndis_filter.c
+++ b/drivers/net/hyperv/rndis_filter.c
@@ -45,7 +45,9 @@ struct rndis_request {
 
 	/* Simplify allocation by having a netvsc packet inline */
 	struct hv_netvsc_packet	pkt;
-	struct hv_page_buffer buf;
+	/* Set 2 pages for rndis requests crossing page boundary */
+	struct hv_page_buffer buf[2];
+
 	/* FIXME: We assumed a fixed size request here. */
 	struct rndis_message request_msg;
 	u8 ext[100];
@@ -221,6 +223,18 @@ static int rndis_filter_send_request(struct rndis_device *dev,
 	packet->page_buf[0].offset =
 		(unsigned long)&req->request_msg & (PAGE_SIZE - 1);
 
+	/* Add one page_buf when request_msg crossing page boundary */
+	if (packet->page_buf[0].offset + packet->page_buf[0].len > PAGE_SIZE) {
+		packet->page_buf_cnt++;
+		packet->page_buf[0].len = PAGE_SIZE -
+			packet->page_buf[0].offset;
+		packet->page_buf[1].pfn = virt_to_phys((void *)&req->request_msg
+			+ packet->page_buf[0].len) >> PAGE_SHIFT;
+		packet->page_buf[1].offset = 0;
+		packet->page_buf[1].len = req->request_msg.msg_len -
+			packet->page_buf[0].len;
+	}
+
 	packet->completion.send.send_completion_ctx = req;/* packet; */
 	packet->completion.send.send_completion =
 		rndis_filter_send_request_completion;
-- 
1.7.4.1

^ permalink raw reply related

* RE: [PATCH RFC net-next 1/1] ptp: add an ioctl to compare PHC time with system time
From: Keller, Jacob E @ 2012-10-01 22:33 UTC (permalink / raw)
  To: Richard Cochran, netdev@vger.kernel.org
  Cc: David Miller, John Stultz, Miroslav Lichvar
In-Reply-To: <f0c20e2d1a303b0247b1e0e0def19f131de162ff.1348768886.git.richardcochran@gmail.com>

> -----Original Message-----
> From: netdev-owner@vger.kernel.org [mailto:netdev-owner@vger.kernel.org]
> On Behalf Of Richard Cochran
> Sent: Thursday, September 27, 2012 11:12 AM
> To: netdev@vger.kernel.org
> Cc: David Miller; Keller, Jacob E; John Stultz; Miroslav Lichvar
> Subject: [PATCH RFC net-next 1/1] ptp: add an ioctl to compare PHC time
> with system time
> 
> This patch adds an ioctl for PTP Hardware Clock (PHC) devices that allows
> user space to measure the time offset between the PHC and the system
> clock. Rather than hard coding any kind of estimation algorithm into the
> kernel, this patch takes the more flexible approach of just delivering an
> array of raw clock readings. In that way, the user space clock servo may
> be adapted to new and different hardware clocks.
> 
> Signed-off-by: Richard Cochran <richardcochran@gmail.com>
> ---
>  drivers/ptp/ptp_chardev.c |   32 ++++++++++++++++++++++++++++++++
>  include/linux/ptp_clock.h |   14 ++++++++++++++
>  2 files changed, 46 insertions(+), 0 deletions(-)
> 
> diff --git a/drivers/ptp/ptp_chardev.c b/drivers/ptp/ptp_chardev.c index
> e7f301da2..4f8ae80 100644
> --- a/drivers/ptp/ptp_chardev.c
> +++ b/drivers/ptp/ptp_chardev.c
> @@ -33,9 +33,13 @@ long ptp_ioctl(struct posix_clock *pc, unsigned int
> cmd, unsigned long arg)  {
>  	struct ptp_clock_caps caps;
>  	struct ptp_clock_request req;
> +	struct ptp_sys_offset sysoff;
>  	struct ptp_clock *ptp = container_of(pc, struct ptp_clock, clock);
>  	struct ptp_clock_info *ops = ptp->info;
> +	struct ptp_clock_time *pct;
> +	struct timespec ts;
>  	int enable, err = 0;
> +	unsigned int i;
> 
>  	switch (cmd) {
> 
> @@ -88,6 +92,34 @@ long ptp_ioctl(struct posix_clock *pc, unsigned int
> cmd, unsigned long arg)
>  		err = ops->enable(ops, &req, enable);
>  		break;
> 
> +	case PTP_SYS_OFFSET:
> +		if (copy_from_user(&sysoff, (void __user *)arg,
> +				   sizeof(sysoff))) {
> +			err = -EFAULT;
> +			break;
> +		}
> +		if (sysoff.n_samples > PTP_MAX_SAMPLES) {
> +			err = -EINVAL;
> +			break;
> +		}
> +		pct = &sysoff.ts[0];
> +		for (i = 0; i < sysoff.n_samples; i++) {
> +			getnstimeofday(&ts);
> +			pct->sec = ts.tv_sec;
> +			pct->nsec = ts.tv_nsec;
> +			pct++;
> +			ptp->info->gettime(ptp->info, &ts);
> +			pct->sec = ts.tv_sec;
> +			pct->nsec = ts.tv_nsec;
> +			pct++;
> +		}
> +		getnstimeofday(&ts);
> +		pct->sec = ts.tv_sec;
> +		pct->nsec = ts.tv_nsec;
> +		if (copy_to_user((void __user *)arg, &sysoff,
> sizeof(sysoff)))
> +			err = -EFAULT;
> +		break;
> +
>  	default:
>  		err = -ENOTTY;
>  		break;
> diff --git a/include/linux/ptp_clock.h b/include/linux/ptp_clock.h index
> 94e981f..b65c834 100644
> --- a/include/linux/ptp_clock.h
> +++ b/include/linux/ptp_clock.h
> @@ -67,12 +67,26 @@ struct ptp_perout_request {
>  	unsigned int rsv[4];          /* Reserved for future use. */
>  };
> 
> +#define PTP_MAX_SAMPLES 25 /* Maximum allowed offset measurement
> +samples. */
> +
> +struct ptp_sys_offset {
> +	unsigned int n_samples; /* Desired number of measurements. */
> +	unsigned int rsv[3];    /* Reserved for future use. */
> +	/*
> +	 * Array of interleaved system/phc time stamps. The kernel
> +	 * will provide 2*n_samples + 1 time stamps, with the last
> +	 * one as a system time stamp.
> +	 */
> +	struct ptp_clock_time ts[2 * PTP_MAX_SAMPLES + 1]; };
> +
>  #define PTP_CLK_MAGIC '='
> 
>  #define PTP_CLOCK_GETCAPS  _IOR(PTP_CLK_MAGIC, 1, struct ptp_clock_caps)
> #define PTP_EXTTS_REQUEST  _IOW(PTP_CLK_MAGIC, 2, struct
> ptp_extts_request)  #define PTP_PEROUT_REQUEST _IOW(PTP_CLK_MAGIC, 3,
> struct ptp_perout_request)
>  #define PTP_ENABLE_PPS     _IOW(PTP_CLK_MAGIC, 4, int)
> +#define PTP_SYS_OFFSET     _IOW(PTP_CLK_MAGIC, 5, struct ptp_sys_offset)
> 
>  struct ptp_extts_event {
>  	struct ptp_clock_time t; /* Time event occured. */
> --
> 1.7.2.5
> 
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in the
> body of a message to majordomo@vger.kernel.org More majordomo info at
> http://vger.kernel.org/majordomo-info.html

This is much nicer than performing the same reads manually in user-space.

Acked-by: Jacob Keller <jacob.e.keller@intel.com>

^ permalink raw reply

* Re: [PATCHv6 net-next] vxlan: virtual extensible lan
From: David Miller @ 2012-10-01 22:34 UTC (permalink / raw)
  To: shemminger; +Cc: jesse, chrisw, netdev
In-Reply-To: <20121001153014.07d25c4e@nehalam.linuxnetplumber.net>

From: Stephen Hemminger <shemminger@vyatta.com>
Date: Mon, 1 Oct 2012 15:30:14 -0700

> On Mon, 01 Oct 2012 18:07:12 -0400 (EDT)
> David Miller <davem@davemloft.net> wrote:
> 
>> From: Stephen Hemminger <shemminger@vyatta.com>
>> Date: Mon, 1 Oct 2012 13:57:19 -0700
>> 
>> > This is an implementation of Virtual eXtensible Local Area Network
>> > as described in draft RFC:
>> >   http://tools.ietf.org/html/draft-mahalingam-dutt-dcops-vxlan-02
>> > 
>> > The driver integrates a Virtual Tunnel Endpoint (VTEP) functionality
>> > that learns MAC to IP address mapping. 
>> > 
>> > This implementation has only been tested with the user-mode TAP
>> > based version for Linux, not against other vendors (yet).
>> > 
>> > Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
>> 
>> It doesn't build.
>> 
>> And I'm not telling you what the build error is, you'll have to do an
>> allmodconfig build yourself to see it.
>> 
>> I want you to get into the habit of doing an allmodconfig build to
>> validate your changes because that's the very first thing I'm going to
>> do.
> 
> Dave did you remember to include the two pre-cursor patches.
> 
> Vxlan was originally submitted as a 3 part series and only the
> last one ever changed.
> 
>  [PATCH net-next 1/3] netlink: add attributes to fdb interface
>  [PATCH net-next 2/3] igmp: export symbol ip_mc_leave_group
> 
> Make allmodconfig works for me (x86-64).

You MUST always resubmit the entire series when you resubmit one
patch.

Furthermore you confused the situation even more by not including
the patch number information in your Subject line.

^ permalink raw reply

* [PATCH 2/3] igmp: export symbol ip_mc_leave_group
From: Stephen Hemminger @ 2012-10-01 22:32 UTC (permalink / raw)
  To: davem; +Cc: netdev
In-Reply-To: <20121001223232.566037595@vyatta.com>

[-- Attachment #1: igmp-leave.patch --]
[-- Type: text/plain, Size: 428 bytes --]

Needed for VXLAN.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>


--- a/net/ipv4/igmp.c	2012-09-17 17:15:06.747860247 -0700
+++ b/net/ipv4/igmp.c	2012-09-17 17:16:33.554984978 -0700
@@ -1904,6 +1904,7 @@ int ip_mc_leave_group(struct sock *sk, s
 	rtnl_unlock();
 	return ret;
 }
+EXPORT_SYMBOL(ip_mc_leave_group);
 
 int ip_mc_source(int add, int omode, struct sock *sk, struct
 	ip_mreq_source *mreqs, int ifindex)

^ permalink raw reply

* [PATCH 1/3] netlink: add attributes to fdb interface
From: Stephen Hemminger @ 2012-10-01 22:32 UTC (permalink / raw)
  To: davem; +Cc: netdev
In-Reply-To: <20121001223232.566037595@vyatta.com>

[-- Attachment #1: fdb-attr.patch --]
[-- Type: text/plain, Size: 4187 bytes --]

Later changes need to be able to refer to neighbour attributes
when doing fdb_add.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>


---
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |    2 +-
 drivers/net/macvlan.c                         |    2 +-
 include/linux/netdevice.h                     |    4 +++-
 net/bridge/br_fdb.c                           |    3 ++-
 net/bridge/br_private.h                       |    2 +-
 net/core/rtnetlink.c                          |    6 ++++--
 6 files changed, 12 insertions(+), 7 deletions(-)

--- a/drivers/net/macvlan.c	2012-09-28 09:28:48.899660128 -0700
+++ b/drivers/net/macvlan.c	2012-09-28 09:28:56.191588306 -0700
@@ -546,7 +546,7 @@ static int macvlan_vlan_rx_kill_vid(stru
 	return 0;
 }
 
-static int macvlan_fdb_add(struct ndmsg *ndm,
+static int macvlan_fdb_add(struct ndmsg *ndm, struct nlattr *tb[],
 			   struct net_device *dev,
 			   const unsigned char *addr,
 			   u16 flags)
--- a/include/linux/netdevice.h	2012-09-28 09:28:53.439615417 -0700
+++ b/include/linux/netdevice.h	2012-09-28 09:28:56.191588306 -0700
@@ -904,7 +904,8 @@ struct netdev_fcoe_hbainfo {
  *	feature set might be less than what was returned by ndo_fix_features()).
  *	Must return >0 or -errno if it changed dev->features itself.
  *
- * int (*ndo_fdb_add)(struct ndmsg *ndm, struct net_device *dev,
+ * int (*ndo_fdb_add)(struct ndmsg *ndm, struct nlattr *tb[],
+ *		      struct net_device *dev,
  *		      const unsigned char *addr, u16 flags)
  *	Adds an FDB entry to dev for addr.
  * int (*ndo_fdb_del)(struct ndmsg *ndm, struct net_device *dev,
@@ -1014,6 +1015,7 @@ struct net_device_ops {
 	void			(*ndo_neigh_destroy)(struct neighbour *n);
 
 	int			(*ndo_fdb_add)(struct ndmsg *ndm,
+					       struct nlattr *tb[],
 					       struct net_device *dev,
 					       const unsigned char *addr,
 					       u16 flags);
--- a/net/bridge/br_fdb.c	2012-09-28 09:28:48.899660128 -0700
+++ b/net/bridge/br_fdb.c	2012-09-28 09:28:56.191588306 -0700
@@ -608,7 +608,8 @@ static int fdb_add_entry(struct net_brid
 }
 
 /* Add new permanent fdb entry with RTM_NEWNEIGH */
-int br_fdb_add(struct ndmsg *ndm, struct net_device *dev,
+int br_fdb_add(struct ndmsg *ndm, struct nlattr *tb[],
+	       struct net_device *dev,
 	       const unsigned char *addr, u16 nlh_flags)
 {
 	struct net_bridge_port *p;
--- a/net/bridge/br_private.h	2012-09-28 09:28:48.899660128 -0700
+++ b/net/bridge/br_private.h	2012-09-28 09:28:56.191588306 -0700
@@ -364,7 +364,7 @@ extern void br_fdb_update(struct net_bri
 extern int br_fdb_delete(struct ndmsg *ndm,
 			 struct net_device *dev,
 			 const unsigned char *addr);
-extern int br_fdb_add(struct ndmsg *nlh,
+extern int br_fdb_add(struct ndmsg *nlh, struct nlattr *tb[],
 		      struct net_device *dev,
 		      const unsigned char *addr,
 		      u16 nlh_flags);
--- a/net/core/rtnetlink.c	2012-09-28 09:28:48.899660128 -0700
+++ b/net/core/rtnetlink.c	2012-09-28 09:28:56.191588306 -0700
@@ -2090,7 +2090,8 @@ static int rtnl_fdb_add(struct sk_buff *
 	if ((!ndm->ndm_flags || ndm->ndm_flags & NTF_MASTER) &&
 	    (dev->priv_flags & IFF_BRIDGE_PORT)) {
 		master = dev->master;
-		err = master->netdev_ops->ndo_fdb_add(ndm, dev, addr,
+		err = master->netdev_ops->ndo_fdb_add(ndm, tb,
+						      dev, addr,
 						      nlh->nlmsg_flags);
 		if (err)
 			goto out;
@@ -2100,7 +2101,8 @@ static int rtnl_fdb_add(struct sk_buff *
 
 	/* Embedded bridge, macvlan, and any other device support */
 	if ((ndm->ndm_flags & NTF_SELF) && dev->netdev_ops->ndo_fdb_add) {
-		err = dev->netdev_ops->ndo_fdb_add(ndm, dev, addr,
+		err = dev->netdev_ops->ndo_fdb_add(ndm, tb,
+						   dev, addr,
 						   nlh->nlmsg_flags);
 
 		if (!err) {
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c	2012-09-28 09:28:48.899660128 -0700
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c	2012-09-28 09:28:56.195588265 -0700
@@ -6889,7 +6889,7 @@ static int ixgbe_set_features(struct net
 	return 0;
 }
 
-static int ixgbe_ndo_fdb_add(struct ndmsg *ndm,
+static int ixgbe_ndo_fdb_add(struct ndmsg *ndm, struct nlattr *tb[],
 			     struct net_device *dev,
 			     const unsigned char *addr,
 			     u16 flags)

^ permalink raw reply

* [PATCH 3/3] vxlan: virtual extensible lan
From: Stephen Hemminger @ 2012-10-01 22:32 UTC (permalink / raw)
  To: davem; +Cc: netdev
In-Reply-To: <20121001223232.566037595@vyatta.com>

[-- Attachment #1: vxlan.patch --]
[-- Type: text/plain, Size: 35704 bytes --]

This is an implementation of Virtual eXtensible Local Area Network
as described in draft RFC:
  http://tools.ietf.org/html/draft-mahalingam-dutt-dcops-vxlan-02

The driver integrates a Virtual Tunnel Endpoint (VTEP) functionality
that learns MAC to IP address mapping. 

This implementation has not been tested only against the Linux
userspace implementation using TAP, not against other vendor's
equipment.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>

---
v6 - simplify hash function
     allow configuring forwarding table limit and ageing timer
     account for VLAN in header
     adjust mtu based on lower device (if any)
     fix fields in fdb_show

v5 - drop MTU discovery since network is overlaid
     use common code to do ECN decapsulation
v4 - fix ecn and set state of fdb entries
v3 - fix ordering of change versus migration message
v2 - fix use of ip header after pskb_may_pull

 Documentation/networking/vxlan.txt |   47 +
 drivers/net/Kconfig                |   13 
 drivers/net/Makefile               |    1 
 drivers/net/vxlan.c                | 1217 +++++++++++++++++++++++++++++++++++++
 include/linux/if_link.h            |   16 
 5 files changed, 1294 insertions(+)

--- a/drivers/net/Kconfig	2012-10-01 15:08:35.640522830 -0700
+++ b/drivers/net/Kconfig	2012-10-01 15:08:38.024499080 -0700
@@ -149,6 +149,19 @@ config MACVTAP
 	  To compile this driver as a module, choose M here: the module
 	  will be called macvtap.
 
+config VXLAN
+       tristate "Virtual eXtensible Local Area Network (VXLAN)"
+       depends on EXPERIMENTAL
+       ---help---
+	  This allows one to create vxlan virtual interfaces that provide
+	  Layer 2 Networks over Layer 3 Networks. VXLAN is often used
+	  to tunnel virtual network infrastructure in virtualized environments.
+	  For more information see:
+	    http://tools.ietf.org/html/draft-mahalingam-dutt-dcops-vxlan-02
+
+	  To compile this driver as a module, choose M here: the module
+	  will be called vxlan.
+
 config NETCONSOLE
 	tristate "Network console logging support"
 	---help---
--- a/drivers/net/Makefile	2012-10-01 15:08:35.640522830 -0700
+++ b/drivers/net/Makefile	2012-10-01 15:08:38.024499080 -0700
@@ -21,6 +21,7 @@ obj-$(CONFIG_NET_TEAM) += team/
 obj-$(CONFIG_TUN) += tun.o
 obj-$(CONFIG_VETH) += veth.o
 obj-$(CONFIG_VIRTIO_NET) += virtio_net.o
+obj-$(CONFIG_VXLAN) += vxlan.o
 
 #
 # Networking Drivers
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ b/drivers/net/vxlan.c	2012-10-01 15:08:38.024499080 -0700
@@ -0,0 +1,1217 @@
+/*
+ * VXLAN: Virtual eXtensiable Local Area Network
+ *
+ * Copyright (c) 2012 Vyatta Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * TODO
+ *  - use IANA UDP port number (when defined)
+ *  - IPv6 (not in RFC)
+ */
+
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+
+#include <linux/kernel.h>
+#include <linux/types.h>
+#include <linux/module.h>
+#include <linux/errno.h>
+#include <linux/slab.h>
+#include <linux/skbuff.h>
+#include <linux/rculist.h>
+#include <linux/netdevice.h>
+#include <linux/in.h>
+#include <linux/ip.h>
+#include <linux/udp.h>
+#include <linux/igmp.h>
+#include <linux/etherdevice.h>
+#include <linux/if_ether.h>
+#include <linux/version.h>
+#include <linux/hash.h>
+#include <net/ip.h>
+#include <net/icmp.h>
+#include <net/udp.h>
+#include <net/rtnetlink.h>
+#include <net/route.h>
+#include <net/dsfield.h>
+#include <net/inet_ecn.h>
+#include <net/net_namespace.h>
+#include <net/netns/generic.h>
+
+#define VXLAN_VERSION	"0.1"
+
+#define VNI_HASH_BITS	10
+#define VNI_HASH_SIZE	(1<<VNI_HASH_BITS)
+#define FDB_HASH_BITS	8
+#define FDB_HASH_SIZE	(1<<FDB_HASH_BITS)
+#define FDB_AGE_DEFAULT 300 /* 5 min */
+#define FDB_AGE_INTERVAL (10 * HZ)	/* rescan interval */
+
+#define VXLAN_N_VID	(1u << 24)
+#define VXLAN_VID_MASK	(VXLAN_N_VID - 1)
+/* VLAN + IP header + UDP + VXLAN */
+#define VXLAN_HEADROOM (4 + 20 + 8 + 8)
+
+#define VXLAN_FLAGS 0x08000000	/* struct vxlanhdr.vx_flags required value. */
+
+/* VXLAN protocol header */
+struct vxlanhdr {
+	__be32 vx_flags;
+	__be32 vx_vni;
+};
+
+/* UDP port for VXLAN traffic. */
+static unsigned int vxlan_port __read_mostly = 8472;
+module_param_named(udp_port, vxlan_port, uint, 0444);
+MODULE_PARM_DESC(udp_port, "Destination UDP port");
+
+static bool log_ecn_error = true;
+module_param(log_ecn_error, bool, 0644);
+MODULE_PARM_DESC(log_ecn_error, "Log packets received with corrupted ECN");
+
+/* per-net private data for this module */
+static unsigned int vxlan_net_id;
+struct vxlan_net {
+	struct socket	  *sock;	/* UDP encap socket */
+	struct hlist_head vni_list[VNI_HASH_SIZE];
+};
+
+/* Forwarding table entry */
+struct vxlan_fdb {
+	struct hlist_node hlist;	/* linked list of entries */
+	struct rcu_head	  rcu;
+	unsigned long	  updated;	/* jiffies */
+	unsigned long	  used;
+	__be32		  remote_ip;
+	u16		  state;	/* see ndm_state */
+	u8		  eth_addr[ETH_ALEN];
+};
+
+/* Per-cpu network traffic stats */
+struct vxlan_stats {
+	u64			rx_packets;
+	u64			rx_bytes;
+	u64			tx_packets;
+	u64			tx_bytes;
+	struct u64_stats_sync	syncp;
+};
+
+/* Pseudo network device */
+struct vxlan_dev {
+	struct hlist_node hlist;
+	struct net_device *dev;
+	struct vxlan_stats __percpu *stats;
+	__u32		  vni;		/* virtual network id */
+	__be32	          gaddr;	/* multicast group */
+	__be32		  saddr;	/* source address */
+	unsigned int      link;		/* link to multicast over */
+	__u8		  tos;		/* TOS override */
+	__u8		  ttl;
+	bool		  learn;
+
+	unsigned long	  age_interval;
+	struct timer_list age_timer;
+	spinlock_t	  hash_lock;
+	unsigned int	  addrcnt;
+	unsigned int	  addrmax;
+	unsigned int	  addrexceeded;
+
+	struct hlist_head fdb_head[FDB_HASH_SIZE];
+};
+
+/* salt for hash table */
+static u32 vxlan_salt __read_mostly;
+
+static inline struct hlist_head *vni_head(struct net *net, u32 id)
+{
+	struct vxlan_net *vn = net_generic(net, vxlan_net_id);
+
+	return &vn->vni_list[hash_32(id, VNI_HASH_BITS)];
+}
+
+/* Look up VNI in a per net namespace table */
+static struct vxlan_dev *vxlan_find_vni(struct net *net, u32 id)
+{
+	struct vxlan_dev *vxlan;
+	struct hlist_node *node;
+
+	hlist_for_each_entry_rcu(vxlan, node, vni_head(net, id), hlist) {
+		if (vxlan->vni == id)
+			return vxlan;
+	}
+
+	return NULL;
+}
+
+/* Fill in neighbour message in skbuff. */
+static int vxlan_fdb_info(struct sk_buff *skb, struct vxlan_dev *vxlan,
+			   const struct vxlan_fdb *fdb,
+			   u32 portid, u32 seq, int type, unsigned int flags)
+{
+	unsigned long now = jiffies;
+	struct nda_cacheinfo ci;
+	struct nlmsghdr *nlh;
+	struct ndmsg *ndm;
+
+	nlh = nlmsg_put(skb, portid, seq, type, sizeof(*ndm), flags);
+	if (nlh == NULL)
+		return -EMSGSIZE;
+
+	ndm = nlmsg_data(nlh);
+	memset(ndm, 0, sizeof(*ndm));
+	ndm->ndm_family	= AF_BRIDGE;
+	ndm->ndm_state = fdb->state;
+	ndm->ndm_ifindex = vxlan->dev->ifindex;
+	ndm->ndm_flags = NTF_SELF;
+	ndm->ndm_type = NDA_DST;
+
+	if (nla_put(skb, NDA_LLADDR, ETH_ALEN, &fdb->eth_addr))
+		goto nla_put_failure;
+
+	if (nla_put_be32(skb, NDA_DST, fdb->remote_ip))
+		goto nla_put_failure;
+
+	ci.ndm_used	 = jiffies_to_clock_t(now - fdb->used);
+	ci.ndm_confirmed = 0;
+	ci.ndm_updated	 = jiffies_to_clock_t(now - fdb->updated);
+	ci.ndm_refcnt	 = 0;
+
+	if (nla_put(skb, NDA_CACHEINFO, sizeof(ci), &ci))
+		goto nla_put_failure;
+
+	return nlmsg_end(skb, nlh);
+
+nla_put_failure:
+	nlmsg_cancel(skb, nlh);
+	return -EMSGSIZE;
+}
+
+static inline size_t vxlan_nlmsg_size(void)
+{
+	return NLMSG_ALIGN(sizeof(struct ndmsg))
+		+ nla_total_size(ETH_ALEN) /* NDA_LLADDR */
+		+ nla_total_size(sizeof(__be32)) /* NDA_DST */
+		+ nla_total_size(sizeof(struct nda_cacheinfo));
+}
+
+static void vxlan_fdb_notify(struct vxlan_dev *vxlan,
+			     const struct vxlan_fdb *fdb, int type)
+{
+	struct net *net = dev_net(vxlan->dev);
+	struct sk_buff *skb;
+	int err = -ENOBUFS;
+
+	skb = nlmsg_new(vxlan_nlmsg_size(), GFP_ATOMIC);
+	if (skb == NULL)
+		goto errout;
+
+	err = vxlan_fdb_info(skb, vxlan, fdb, 0, 0, type, 0);
+	if (err < 0) {
+		/* -EMSGSIZE implies BUG in vxlan_nlmsg_size() */
+		WARN_ON(err == -EMSGSIZE);
+		kfree_skb(skb);
+		goto errout;
+	}
+
+	rtnl_notify(skb, net, 0, RTNLGRP_NEIGH, NULL, GFP_ATOMIC);
+	return;
+errout:
+	if (err < 0)
+		rtnl_set_sk_err(net, RTNLGRP_NEIGH, err);
+}
+
+/* Hash Ethernet address */
+static u32 eth_hash(const unsigned char *addr)
+{
+	u64 value = get_unaligned((u64 *)addr);
+
+	/* only want 6 bytes */
+#ifdef __BIG_ENDIAN
+	value <<= 16;
+#else
+	value >>= 16;
+#endif
+	return hash_64(value, FDB_HASH_BITS);
+}
+
+/* Hash chain to use given mac address */
+static inline struct hlist_head *vxlan_fdb_head(struct vxlan_dev *vxlan,
+						const u8 *mac)
+{
+	return &vxlan->fdb_head[eth_hash(mac)];
+}
+
+/* Look up Ethernet address in forwarding table */
+static struct vxlan_fdb *vxlan_find_mac(struct vxlan_dev *vxlan,
+					const u8 *mac)
+
+{
+	struct hlist_head *head = vxlan_fdb_head(vxlan, mac);
+	struct vxlan_fdb *f;
+	struct hlist_node *node;
+
+	hlist_for_each_entry_rcu(f, node, head, hlist) {
+		if (compare_ether_addr(mac, f->eth_addr) == 0)
+			return f;
+	}
+
+	return NULL;
+}
+
+/* Add new entry to forwarding table -- assumes lock held */
+static int vxlan_fdb_create(struct vxlan_dev *vxlan,
+			    const u8 *mac, __be32 ip,
+			    __u16 state, __u16 flags)
+{
+	struct vxlan_fdb *f;
+	int notify = 0;
+
+	f = vxlan_find_mac(vxlan, mac);
+	if (f) {
+		if (flags & NLM_F_EXCL) {
+			netdev_dbg(vxlan->dev,
+				   "lost race to create %pM\n", mac);
+			return -EEXIST;
+		}
+		if (f->state != state) {
+			f->state = state;
+			f->updated = jiffies;
+			notify = 1;
+		}
+	} else {
+		if (!(flags & NLM_F_CREATE))
+			return -ENOENT;
+
+		if (vxlan->addrmax && vxlan->addrcnt >= vxlan->addrmax)
+			return -ENOSPC;
+
+		netdev_dbg(vxlan->dev, "add %pM -> %pI4\n", mac, &ip);
+		f = kmalloc(sizeof(*f), GFP_ATOMIC);
+		if (!f)
+			return -ENOMEM;
+
+		notify = 1;
+		f->remote_ip = ip;
+		f->state = state;
+		f->updated = f->used = jiffies;
+		memcpy(f->eth_addr, mac, ETH_ALEN);
+
+		++vxlan->addrcnt;
+		hlist_add_head_rcu(&f->hlist,
+				   vxlan_fdb_head(vxlan, mac));
+	}
+
+	if (notify)
+		vxlan_fdb_notify(vxlan, f, RTM_NEWNEIGH);
+
+	return 0;
+}
+
+static void vxlan_fdb_destroy(struct vxlan_dev *vxlan, struct vxlan_fdb *f)
+{
+	netdev_dbg(vxlan->dev,
+		    "delete %pM\n", f->eth_addr);
+
+	--vxlan->addrcnt;
+	vxlan_fdb_notify(vxlan, f, RTM_DELNEIGH);
+
+	hlist_del_rcu(&f->hlist);
+	kfree_rcu(f, rcu);
+}
+
+/* Add static entry (via netlink) */
+static int vxlan_fdb_add(struct ndmsg *ndm, struct nlattr *tb[],
+			 struct net_device *dev,
+			 const unsigned char *addr, u16 flags)
+{
+	struct vxlan_dev *vxlan = netdev_priv(dev);
+	__be32 ip;
+	int err;
+
+	if (!(ndm->ndm_state & (NUD_PERMANENT|NUD_REACHABLE))) {
+		pr_info("RTM_NEWNEIGH with invalid state %#x\n",
+			ndm->ndm_state);
+		return -EINVAL;
+	}
+
+	if (tb[NDA_DST] == NULL)
+		return -EINVAL;
+
+	if (nla_len(tb[NDA_DST]) != sizeof(__be32))
+		return -EAFNOSUPPORT;
+
+	ip = nla_get_be32(tb[NDA_DST]);
+
+	spin_lock_bh(&vxlan->hash_lock);
+	err = vxlan_fdb_create(vxlan, addr, ip, ndm->ndm_state, flags);
+	spin_unlock_bh(&vxlan->hash_lock);
+
+	return err;
+}
+
+/* Delete entry (via netlink) */
+static int vxlan_fdb_delete(struct ndmsg *ndm, struct net_device *dev,
+			    const unsigned char *addr)
+{
+	struct vxlan_dev *vxlan = netdev_priv(dev);
+	struct vxlan_fdb *f;
+	int err = -ENOENT;
+
+	spin_lock_bh(&vxlan->hash_lock);
+	f = vxlan_find_mac(vxlan, addr);
+	if (f) {
+		vxlan_fdb_destroy(vxlan, f);
+		err = 0;
+	}
+	spin_unlock_bh(&vxlan->hash_lock);
+
+	return err;
+}
+
+/* Dump forwarding table */
+static int vxlan_fdb_dump(struct sk_buff *skb, struct netlink_callback *cb,
+			  struct net_device *dev, int idx)
+{
+	struct vxlan_dev *vxlan = netdev_priv(dev);
+	unsigned int h;
+
+	for (h = 0; h < FDB_HASH_SIZE; ++h) {
+		struct vxlan_fdb *f;
+		struct hlist_node *n;
+		int err;
+
+		hlist_for_each_entry_rcu(f, n, &vxlan->fdb_head[h], hlist) {
+			if (idx < cb->args[0])
+				goto skip;
+
+			err = vxlan_fdb_info(skb, vxlan, f,
+					     NETLINK_CB(cb->skb).portid,
+					     cb->nlh->nlmsg_seq,
+					     RTM_NEWNEIGH,
+					     NLM_F_MULTI);
+			if (err < 0)
+				break;
+skip:
+			++idx;
+		}
+	}
+
+	return idx;
+}
+
+/* Watch incoming packets to learn mapping between Ethernet address
+ * and Tunnel endpoint.
+ */
+static void vxlan_snoop(struct net_device *dev,
+			__be32 src_ip, const u8 *src_mac)
+{
+	struct vxlan_dev *vxlan = netdev_priv(dev);
+	struct vxlan_fdb *f;
+	int err;
+
+	f = vxlan_find_mac(vxlan, src_mac);
+	if (likely(f)) {
+		f->used = jiffies;
+		if (likely(f->remote_ip == src_ip))
+			return;
+
+		if (net_ratelimit())
+			netdev_info(dev,
+				    "%pM migrated from %pI4 to %pI4\n",
+				    src_mac, &f->remote_ip, &src_ip);
+
+		f->remote_ip = src_ip;
+		f->updated = jiffies;
+	} else {
+		/* learned new entry */
+		spin_lock(&vxlan->hash_lock);
+		err = vxlan_fdb_create(vxlan, src_mac, src_ip,
+				       NUD_REACHABLE,
+				       NLM_F_EXCL|NLM_F_CREATE);
+		spin_unlock(&vxlan->hash_lock);
+	}
+}
+
+
+/* See if multicast group is already in use by other ID */
+static bool vxlan_group_used(struct vxlan_net *vn,
+			     const struct vxlan_dev *this)
+{
+	const struct vxlan_dev *vxlan;
+	struct hlist_node *node;
+	unsigned h;
+
+	for (h = 0; h < VNI_HASH_SIZE; ++h)
+		hlist_for_each_entry(vxlan, node, &vn->vni_list[h], hlist) {
+			if (vxlan == this)
+				continue;
+
+			if (!netif_running(vxlan->dev))
+				continue;
+
+			if (vxlan->gaddr == this->gaddr)
+				return true;
+		}
+
+	return false;
+}
+
+/* kernel equivalent to IP_ADD_MEMBERSHIP */
+static int vxlan_join_group(struct net_device *dev)
+{
+	struct vxlan_dev *vxlan = netdev_priv(dev);
+	struct vxlan_net *vn = net_generic(dev_net(dev), vxlan_net_id);
+	struct sock *sk = vn->sock->sk;
+	struct ip_mreqn mreq = {
+		.imr_multiaddr.s_addr = vxlan->gaddr,
+	};
+	int err;
+
+	/* Already a member of group */
+	if (vxlan_group_used(vn, vxlan))
+		return 0;
+
+	/* Need to drop RTNL to call multicast join */
+	rtnl_unlock();
+	lock_sock(sk);
+	err = ip_mc_join_group(sk, &mreq);
+	release_sock(sk);
+	rtnl_lock();
+
+	return err;
+}
+
+
+/* kernel equivalent to IP_DROP_MEMBERSHIP */
+static int vxlan_leave_group(struct net_device *dev)
+{
+	struct vxlan_dev *vxlan = netdev_priv(dev);
+	struct vxlan_net *vn = net_generic(dev_net(dev), vxlan_net_id);
+	int err = 0;
+	struct sock *sk = vn->sock->sk;
+	struct ip_mreqn mreq = {
+		.imr_multiaddr.s_addr = vxlan->gaddr,
+	};
+
+	/* Only leave group when last vxlan is done. */
+	if (vxlan_group_used(vn, vxlan))
+		return 0;
+
+	/* Need to drop RTNL to call multicast leave */
+	rtnl_unlock();
+	lock_sock(sk);
+	err = ip_mc_leave_group(sk, &mreq);
+	release_sock(sk);
+	rtnl_lock();
+
+	return err;
+}
+
+/* Callback from net/ipv4/udp.c to receive packets */
+static int vxlan_udp_encap_recv(struct sock *sk, struct sk_buff *skb)
+{
+	struct iphdr *oip;
+	struct vxlanhdr *vxh;
+	struct vxlan_dev *vxlan;
+	struct vxlan_stats *stats;
+	__u32 vni;
+	int err;
+
+	/* pop off outer UDP header */
+	__skb_pull(skb, sizeof(struct udphdr));
+
+	/* Need Vxlan and inner Ethernet header to be present */
+	if (!pskb_may_pull(skb, sizeof(struct vxlanhdr)))
+		goto error;
+
+	/* Drop packets with reserved bits set */
+	vxh = (struct vxlanhdr *) skb->data;
+	if (vxh->vx_flags != htonl(VXLAN_FLAGS) ||
+	    (vxh->vx_vni & htonl(0xff))) {
+		netdev_dbg(skb->dev, "invalid vxlan flags=%#x vni=%#x\n",
+			   ntohl(vxh->vx_flags), ntohl(vxh->vx_vni));
+		goto error;
+	}
+
+	__skb_pull(skb, sizeof(struct vxlanhdr));
+	skb_postpull_rcsum(skb, eth_hdr(skb), sizeof(struct vxlanhdr));
+
+	/* Is this VNI defined? */
+	vni = ntohl(vxh->vx_vni) >> 8;
+	vxlan = vxlan_find_vni(sock_net(sk), vni);
+	if (!vxlan) {
+		netdev_dbg(skb->dev, "unknown vni %d\n", vni);
+		goto drop;
+	}
+
+	if (!pskb_may_pull(skb, ETH_HLEN)) {
+		vxlan->dev->stats.rx_length_errors++;
+		vxlan->dev->stats.rx_errors++;
+		goto drop;
+	}
+
+	/* Re-examine inner Ethernet packet */
+	oip = ip_hdr(skb);
+	skb->protocol = eth_type_trans(skb, vxlan->dev);
+	skb_postpull_rcsum(skb, eth_hdr(skb), ETH_HLEN);
+
+	/* Ignore packet loops (and multicast echo) */
+	if (compare_ether_addr(eth_hdr(skb)->h_source,
+			       vxlan->dev->dev_addr) == 0)
+		goto drop;
+
+	if (vxlan->learn)
+		vxlan_snoop(skb->dev, oip->saddr, eth_hdr(skb)->h_source);
+
+	__skb_tunnel_rx(skb, vxlan->dev);
+	skb_reset_network_header(skb);
+
+	err = IP_ECN_decapsulate(oip, skb);
+	if (unlikely(err)) {
+		if (log_ecn_error)
+			net_info_ratelimited("non-ECT from %pI4 with TOS=%#x\n",
+					     &oip->saddr, oip->tos);
+		if (err > 1) {
+			++vxlan->dev->stats.rx_frame_errors;
+			++vxlan->dev->stats.rx_errors;
+			goto drop;
+		}
+	}
+
+	stats = this_cpu_ptr(vxlan->stats);
+	u64_stats_update_begin(&stats->syncp);
+	stats->rx_packets++;
+	stats->rx_bytes += skb->len;
+	u64_stats_update_end(&stats->syncp);
+
+	netif_rx(skb);
+
+	return 0;
+error:
+	/* Put UDP header back */
+	__skb_push(skb, sizeof(struct udphdr));
+
+	return 1;
+drop:
+	/* Consume bad packet */
+	kfree_skb(skb);
+	return 0;
+}
+
+/* Extract dsfield from inner protocol */
+static inline u8 vxlan_get_dsfield(const struct iphdr *iph,
+				   const struct sk_buff *skb)
+{
+	if (skb->protocol == htons(ETH_P_IP))
+		return iph->tos;
+	else if (skb->protocol == htons(ETH_P_IPV6))
+		return ipv6_get_dsfield((const struct ipv6hdr *)iph);
+	else
+		return 0;
+}
+
+/* Propogate ECN bits out */
+static inline u8 vxlan_ecn_encap(u8 tos,
+				 const struct iphdr *iph,
+				 const struct sk_buff *skb)
+{
+	u8 inner = vxlan_get_dsfield(iph, skb);
+
+	return INET_ECN_encapsulate(tos, inner);
+}
+
+/* Transmit local packets over Vxlan
+ *
+ * Outer IP header inherits ECN and DF from inner header.
+ * Outer UDP destination is the VXLAN assigned port.
+ *           source port is based on hash of flow if available
+ *                       otherwise use a random value
+ */
+static netdev_tx_t vxlan_xmit(struct sk_buff *skb, struct net_device *dev)
+{
+	struct vxlan_dev *vxlan = netdev_priv(dev);
+	struct rtable *rt;
+	const struct ethhdr *eth;
+	const struct iphdr *old_iph;
+	struct iphdr *iph;
+	struct vxlanhdr *vxh;
+	struct udphdr *uh;
+	struct flowi4 fl4;
+	struct vxlan_fdb *f;
+	unsigned int pkt_len = skb->len;
+	u32 hash;
+	__be32 dst;
+	__be16 df = 0;
+	__u8 tos, ttl;
+	int err;
+
+	/* Need space for new headers (invalidates iph ptr) */
+	if (skb_cow_head(skb, VXLAN_HEADROOM))
+		goto drop;
+
+	eth = (void *)skb->data;
+	old_iph = ip_hdr(skb);
+
+	if (!is_multicast_ether_addr(eth->h_dest) &&
+	    (f = vxlan_find_mac(vxlan, eth->h_dest)))
+		dst = f->remote_ip;
+	else if (vxlan->gaddr) {
+		dst = vxlan->gaddr;
+	} else
+		goto drop;
+
+	ttl = vxlan->ttl;
+	if (!ttl && IN_MULTICAST(ntohl(dst)))
+		ttl = 1;
+
+	tos = vxlan->tos;
+	if (tos == 1)
+		tos = vxlan_get_dsfield(old_iph, skb);
+
+	hash = skb_get_rxhash(skb);
+
+	rt = ip_route_output_gre(dev_net(dev), &fl4, dst,
+				 vxlan->saddr, vxlan->vni,
+				 RT_TOS(tos), vxlan->link);
+	if (IS_ERR(rt)) {
+		netdev_dbg(dev, "no route to %pI4\n", &dst);
+		dev->stats.tx_carrier_errors++;
+		goto tx_error;
+	}
+
+	if (rt->dst.dev == dev) {
+		netdev_dbg(dev, "circular route to %pI4\n", &dst);
+		ip_rt_put(rt);
+		dev->stats.collisions++;
+		goto tx_error;
+	}
+
+	memset(&(IPCB(skb)->opt), 0, sizeof(IPCB(skb)->opt));
+	IPCB(skb)->flags &= ~(IPSKB_XFRM_TUNNEL_SIZE | IPSKB_XFRM_TRANSFORMED |
+			      IPSKB_REROUTED);
+	skb_dst_drop(skb);
+	skb_dst_set(skb, &rt->dst);
+
+	vxh = (struct vxlanhdr *) __skb_push(skb, sizeof(*vxh));
+	vxh->vx_flags = htonl(VXLAN_FLAGS);
+	vxh->vx_vni = htonl(vxlan->vni << 8);
+
+	__skb_push(skb, sizeof(*uh));
+	skb_reset_transport_header(skb);
+	uh = udp_hdr(skb);
+
+	uh->dest = htons(vxlan_port);
+	uh->source = hash ? :random32();
+
+	uh->len = htons(skb->len);
+	uh->check = 0;
+
+	__skb_push(skb, sizeof(*iph));
+	skb_reset_network_header(skb);
+	iph		= ip_hdr(skb);
+	iph->version	= 4;
+	iph->ihl	= sizeof(struct iphdr) >> 2;
+	iph->frag_off	= df;
+	iph->protocol	= IPPROTO_UDP;
+	iph->tos	= vxlan_ecn_encap(tos, old_iph, skb);
+	iph->daddr	= fl4.daddr;
+	iph->saddr	= fl4.saddr;
+	iph->ttl	= ttl ? : ip4_dst_hoplimit(&rt->dst);
+
+	/* See __IPTUNNEL_XMIT */
+	skb->ip_summed = CHECKSUM_NONE;
+	ip_select_ident(iph, &rt->dst, NULL);
+
+	err = ip_local_out(skb);
+	if (likely(net_xmit_eval(err) == 0)) {
+		struct vxlan_stats *stats = this_cpu_ptr(vxlan->stats);
+
+		u64_stats_update_begin(&stats->syncp);
+		stats->tx_packets++;
+		stats->tx_bytes += pkt_len;
+		u64_stats_update_end(&stats->syncp);
+	} else {
+		dev->stats.tx_errors++;
+		dev->stats.tx_aborted_errors++;
+	}
+	return NETDEV_TX_OK;
+
+drop:
+	dev->stats.tx_dropped++;
+	goto tx_free;
+
+tx_error:
+	dev->stats.tx_errors++;
+tx_free:
+	dev_kfree_skb(skb);
+	return NETDEV_TX_OK;
+}
+
+/* Walk the forwarding table and purge stale entries */
+static void vxlan_cleanup(unsigned long arg)
+{
+	struct vxlan_dev *vxlan = (struct vxlan_dev *) arg;
+	unsigned long next_timer = jiffies + FDB_AGE_INTERVAL;
+	unsigned int h;
+
+	if (!netif_running(vxlan->dev))
+		return;
+
+	spin_lock_bh(&vxlan->hash_lock);
+	for (h = 0; h < FDB_HASH_SIZE; ++h) {
+		struct hlist_node *p, *n;
+		hlist_for_each_safe(p, n, &vxlan->fdb_head[h]) {
+			struct vxlan_fdb *f
+				= container_of(p, struct vxlan_fdb, hlist);
+			unsigned long timeout;
+
+			if (f->state == NUD_PERMANENT)
+				continue;
+
+			timeout = f->used + vxlan->age_interval * HZ;
+			if (time_before_eq(timeout, jiffies)) {
+				netdev_dbg(vxlan->dev,
+					   "garbage collect %pM\n",
+					   f->eth_addr);
+				f->state = NUD_STALE;
+				vxlan_fdb_destroy(vxlan, f);
+			} else if (time_before(timeout, next_timer))
+				next_timer = timeout;
+		}
+	}
+	spin_unlock_bh(&vxlan->hash_lock);
+
+	mod_timer(&vxlan->age_timer, next_timer);
+}
+
+/* Setup stats when device is created */
+static int vxlan_init(struct net_device *dev)
+{
+	struct vxlan_dev *vxlan = netdev_priv(dev);
+
+	vxlan->stats = alloc_percpu(struct vxlan_stats);
+	if (!vxlan->stats)
+		return -ENOMEM;
+
+	return 0;
+}
+
+/* Start ageing timer and join group when device is brought up */
+static int vxlan_open(struct net_device *dev)
+{
+	struct vxlan_dev *vxlan = netdev_priv(dev);
+	int err;
+
+	if (vxlan->gaddr) {
+		err = vxlan_join_group(dev);
+		if (err)
+			return err;
+	}
+
+	if (vxlan->age_interval)
+		mod_timer(&vxlan->age_timer, jiffies + FDB_AGE_INTERVAL);
+
+	return 0;
+}
+
+/* Purge the forwarding table */
+static void vxlan_flush(struct vxlan_dev *vxlan)
+{
+	unsigned h;
+
+	spin_lock_bh(&vxlan->hash_lock);
+	for (h = 0; h < FDB_HASH_SIZE; ++h) {
+		struct hlist_node *p, *n;
+		hlist_for_each_safe(p, n, &vxlan->fdb_head[h]) {
+			struct vxlan_fdb *f
+				= container_of(p, struct vxlan_fdb, hlist);
+			vxlan_fdb_destroy(vxlan, f);
+		}
+	}
+	spin_unlock_bh(&vxlan->hash_lock);
+}
+
+/* Cleanup timer and forwarding table on shutdown */
+static int vxlan_stop(struct net_device *dev)
+{
+	struct vxlan_dev *vxlan = netdev_priv(dev);
+
+	if (vxlan->gaddr)
+		vxlan_leave_group(dev);
+
+	del_timer_sync(&vxlan->age_timer);
+
+	vxlan_flush(vxlan);
+
+	return 0;
+}
+
+/* Merge per-cpu statistics */
+static struct rtnl_link_stats64 *vxlan_stats64(struct net_device *dev,
+					       struct rtnl_link_stats64 *stats)
+{
+	struct vxlan_dev *vxlan = netdev_priv(dev);
+	struct vxlan_stats tmp, sum = { 0 };
+	unsigned int cpu;
+
+	for_each_possible_cpu(cpu) {
+		unsigned int start;
+		const struct vxlan_stats *stats
+			= per_cpu_ptr(vxlan->stats, cpu);
+
+		do {
+			start = u64_stats_fetch_begin_bh(&stats->syncp);
+			memcpy(&tmp, stats, sizeof(tmp));
+		} while (u64_stats_fetch_retry_bh(&stats->syncp, start));
+
+		sum.tx_bytes   += tmp.tx_bytes;
+		sum.tx_packets += tmp.tx_packets;
+		sum.rx_bytes   += tmp.rx_bytes;
+		sum.rx_packets += tmp.rx_packets;
+	}
+
+	stats->tx_bytes   = sum.tx_bytes;
+	stats->tx_packets = sum.tx_packets;
+	stats->rx_bytes   = sum.rx_bytes;
+	stats->rx_packets = sum.rx_packets;
+
+	stats->multicast = dev->stats.multicast;
+	stats->rx_length_errors = dev->stats.rx_length_errors;
+	stats->rx_frame_errors = dev->stats.rx_frame_errors;
+	stats->rx_errors = dev->stats.rx_errors;
+
+	stats->tx_dropped = dev->stats.tx_dropped;
+	stats->tx_carrier_errors  = dev->stats.tx_carrier_errors;
+	stats->tx_aborted_errors  = dev->stats.tx_aborted_errors;
+	stats->collisions  = dev->stats.collisions;
+	stats->tx_errors = dev->stats.tx_errors;
+
+	return stats;
+}
+
+/* Stub, nothing needs to be done. */
+static void vxlan_set_multicast_list(struct net_device *dev)
+{
+}
+
+static const struct net_device_ops vxlan_netdev_ops = {
+	.ndo_init		= vxlan_init,
+	.ndo_open		= vxlan_open,
+	.ndo_stop		= vxlan_stop,
+	.ndo_start_xmit		= vxlan_xmit,
+	.ndo_get_stats64	= vxlan_stats64,
+	.ndo_set_rx_mode	= vxlan_set_multicast_list,
+	.ndo_change_mtu		= eth_change_mtu,
+	.ndo_validate_addr	= eth_validate_addr,
+	.ndo_set_mac_address	= eth_mac_addr,
+	.ndo_fdb_add		= vxlan_fdb_add,
+	.ndo_fdb_del		= vxlan_fdb_delete,
+	.ndo_fdb_dump		= vxlan_fdb_dump,
+};
+
+/* Info for udev, that this is a virtual tunnel endpoint */
+static struct device_type vxlan_type = {
+	.name = "vxlan",
+};
+
+static void vxlan_free(struct net_device *dev)
+{
+	struct vxlan_dev *vxlan = netdev_priv(dev);
+
+	free_percpu(vxlan->stats);
+	free_netdev(dev);
+}
+
+/* Initialize the device structure. */
+static void vxlan_setup(struct net_device *dev)
+{
+	struct vxlan_dev *vxlan = netdev_priv(dev);
+	unsigned h;
+
+	eth_hw_addr_random(dev);
+	ether_setup(dev);
+
+	dev->netdev_ops = &vxlan_netdev_ops;
+	dev->destructor = vxlan_free;
+	SET_NETDEV_DEVTYPE(dev, &vxlan_type);
+
+	dev->tx_queue_len = 0;
+	dev->features	|= NETIF_F_LLTX;
+	dev->features	|= NETIF_F_NETNS_LOCAL;
+	dev->priv_flags	&= ~IFF_XMIT_DST_RELEASE;
+
+	spin_lock_init(&vxlan->hash_lock);
+
+	init_timer_deferrable(&vxlan->age_timer);
+	vxlan->age_timer.function = vxlan_cleanup;
+	vxlan->age_timer.data = (unsigned long) vxlan;
+
+	vxlan->dev = dev;
+
+	for (h = 0; h < FDB_HASH_SIZE; ++h)
+		INIT_HLIST_HEAD(&vxlan->fdb_head[h]);
+}
+
+static const struct nla_policy vxlan_policy[IFLA_VXLAN_MAX + 1] = {
+	[IFLA_VXLAN_ID]		= { .type = NLA_U32 },
+	[IFLA_VXLAN_GROUP]	= { .len = FIELD_SIZEOF(struct iphdr, daddr) },
+	[IFLA_VXLAN_LINK]	= { .type = NLA_U32 },
+	[IFLA_VXLAN_LOCAL]	= { .len = FIELD_SIZEOF(struct iphdr, saddr) },
+	[IFLA_VXLAN_TOS]	= { .type = NLA_U8 },
+	[IFLA_VXLAN_TTL]	= { .type = NLA_U8 },
+	[IFLA_VXLAN_LEARNING]	= { .type = NLA_U8 },
+	[IFLA_VXLAN_AGEING]	= { .type = NLA_U32 },
+	[IFLA_VXLAN_LIMIT]	= { .type = NLA_U32 },
+};
+
+static int vxlan_validate(struct nlattr *tb[], struct nlattr *data[])
+{
+	if (tb[IFLA_ADDRESS]) {
+		if (nla_len(tb[IFLA_ADDRESS]) != ETH_ALEN) {
+			pr_debug("invalid link address (not ethernet)\n");
+			return -EINVAL;
+		}
+
+		if (!is_valid_ether_addr(nla_data(tb[IFLA_ADDRESS]))) {
+			pr_debug("invalid all zero ethernet address\n");
+			return -EADDRNOTAVAIL;
+		}
+	}
+
+	if (!data)
+		return -EINVAL;
+
+	if (data[IFLA_VXLAN_ID]) {
+		__u32 id = nla_get_u32(data[IFLA_VXLAN_ID]);
+		if (id >= VXLAN_VID_MASK)
+			return -ERANGE;
+	}
+
+	if (data[IFLA_VXLAN_GROUP]) {
+		__be32 gaddr = nla_get_be32(data[IFLA_VXLAN_GROUP]);
+		if (!IN_MULTICAST(ntohl(gaddr))) {
+			pr_debug("group address is not IPv4 multicast\n");
+			return -EADDRNOTAVAIL;
+		}
+	}
+	return 0;
+}
+
+static int vxlan_newlink(struct net *net, struct net_device *dev,
+			 struct nlattr *tb[], struct nlattr *data[])
+{
+	struct vxlan_dev *vxlan = netdev_priv(dev);
+	__u32 vni;
+	int err;
+
+	if (!data[IFLA_VXLAN_ID])
+		return -EINVAL;
+
+	vni = nla_get_u32(data[IFLA_VXLAN_ID]);
+	if (vxlan_find_vni(net, vni)) {
+		pr_info("duplicate VNI %u\n", vni);
+		return -EEXIST;
+	}
+	vxlan->vni = vni;
+
+	if (data[IFLA_VXLAN_GROUP])
+		vxlan->gaddr = nla_get_be32(data[IFLA_VXLAN_GROUP]);
+
+	if (data[IFLA_VXLAN_LOCAL])
+		vxlan->saddr = nla_get_be32(data[IFLA_VXLAN_LOCAL]);
+
+	if (data[IFLA_VXLAN_LINK]) {
+		vxlan->link = nla_get_u32(data[IFLA_VXLAN_LINK]);
+
+		if (!tb[IFLA_MTU]) {
+			struct net_device *lowerdev;
+			lowerdev = __dev_get_by_index(net, vxlan->link);
+			dev->mtu = lowerdev->mtu - VXLAN_HEADROOM;
+		}
+	}
+
+	if (data[IFLA_VXLAN_TOS])
+		vxlan->tos  = nla_get_u8(data[IFLA_VXLAN_TOS]);
+
+	if (!data[IFLA_VXLAN_LEARNING] || nla_get_u8(data[IFLA_VXLAN_LEARNING]))
+		vxlan->learn = true;
+
+	if (data[IFLA_VXLAN_AGEING])
+		vxlan->age_interval = nla_get_u32(data[IFLA_VXLAN_AGEING]);
+	else
+		vxlan->age_interval = FDB_AGE_DEFAULT;
+
+	if (data[IFLA_VXLAN_LIMIT])
+		vxlan->addrmax = nla_get_u32(data[IFLA_VXLAN_LIMIT]);
+
+	err = register_netdevice(dev);
+	if (!err)
+		hlist_add_head_rcu(&vxlan->hlist, vni_head(net, vxlan->vni));
+
+	return err;
+}
+
+static void vxlan_dellink(struct net_device *dev, struct list_head *head)
+{
+	struct vxlan_dev *vxlan = netdev_priv(dev);
+
+	hlist_del_rcu(&vxlan->hlist);
+
+	unregister_netdevice_queue(dev, head);
+}
+
+static size_t vxlan_get_size(const struct net_device *dev)
+{
+
+	return nla_total_size(sizeof(__u32)) +	/* IFLA_VXLAN_ID */
+		nla_total_size(sizeof(__be32)) +/* IFLA_VXLAN_GROUP */
+		nla_total_size(sizeof(__u32)) +	/* IFLA_VXLAN_LINK */
+		nla_total_size(sizeof(__be32))+	/* IFLA_VXLAN_LOCAL */
+		nla_total_size(sizeof(__u8)) +	/* IFLA_VXLAN_TTL */
+		nla_total_size(sizeof(__u8)) +	/* IFLA_VXLAN_TOS */
+		nla_total_size(sizeof(__u8)) +	/* IFLA_VXLAN_LEARNING */
+		nla_total_size(sizeof(__u32)) +	/* IFLA_VXLAN_AGEING */
+		nla_total_size(sizeof(__u32)) +	/* IFLA_VXLAN_LIMIT */
+		0;
+}
+
+static int vxlan_fill_info(struct sk_buff *skb, const struct net_device *dev)
+{
+	const struct vxlan_dev *vxlan = netdev_priv(dev);
+
+	if (nla_put_u32(skb, IFLA_VXLAN_ID, vxlan->vni))
+		goto nla_put_failure;
+
+	if (vxlan->gaddr && nla_put_u32(skb, IFLA_VXLAN_GROUP, vxlan->gaddr))
+		goto nla_put_failure;
+
+	if (vxlan->link && nla_put_u32(skb, IFLA_VXLAN_LINK, vxlan->link))
+		goto nla_put_failure;
+
+	if (vxlan->saddr && nla_put_u32(skb, IFLA_VXLAN_LOCAL, vxlan->saddr))
+		goto nla_put_failure;
+
+	if (nla_put_u8(skb, IFLA_VXLAN_TTL, vxlan->ttl) ||
+	    nla_put_u8(skb, IFLA_VXLAN_TOS, vxlan->tos) ||
+	    nla_put_u8(skb, IFLA_VXLAN_LEARNING, vxlan->learn) ||
+	    nla_put_u32(skb, IFLA_VXLAN_AGEING, vxlan->age_interval) ||
+	    nla_put_u32(skb, IFLA_VXLAN_LIMIT, vxlan->addrmax))
+		goto nla_put_failure;
+
+	return 0;
+
+nla_put_failure:
+	return -EMSGSIZE;
+}
+
+static struct rtnl_link_ops vxlan_link_ops __read_mostly = {
+	.kind		= "vxlan",
+	.maxtype	= IFLA_VXLAN_MAX,
+	.policy		= vxlan_policy,
+	.priv_size	= sizeof(struct vxlan_dev),
+	.setup		= vxlan_setup,
+	.validate	= vxlan_validate,
+	.newlink	= vxlan_newlink,
+	.dellink	= vxlan_dellink,
+	.get_size	= vxlan_get_size,
+	.fill_info	= vxlan_fill_info,
+};
+
+static __net_init int vxlan_init_net(struct net *net)
+{
+	struct vxlan_net *vn = net_generic(net, vxlan_net_id);
+	struct sock *sk;
+	struct sockaddr_in vxlan_addr = {
+		.sin_family = AF_INET,
+		.sin_addr.s_addr = htonl(INADDR_ANY),
+	};
+	int rc;
+	unsigned h;
+
+	/* Create UDP socket for encapsulation receive. */
+	rc = sock_create_kern(AF_INET, SOCK_DGRAM, IPPROTO_UDP, &vn->sock);
+	if (rc < 0) {
+		pr_debug("UDP socket create failed\n");
+		return rc;
+	}
+
+	vxlan_addr.sin_port = htons(vxlan_port);
+
+	rc = kernel_bind(vn->sock, (struct sockaddr *) &vxlan_addr,
+			 sizeof(vxlan_addr));
+	if (rc < 0) {
+		pr_debug("bind for UDP socket %pI4:%u (%d)\n",
+			 &vxlan_addr.sin_addr, ntohs(vxlan_addr.sin_port), rc);
+		sock_release(vn->sock);
+		vn->sock = NULL;
+		return rc;
+	}
+
+	/* Disable multicast loopback */
+	sk = vn->sock->sk;
+	inet_sk(sk)->mc_loop = 0;
+
+	/* Mark socket as an encapsulation socket. */
+	udp_sk(sk)->encap_type = 1;
+	udp_sk(sk)->encap_rcv = vxlan_udp_encap_recv;
+	udp_encap_enable();
+
+	for (h = 0; h < VNI_HASH_SIZE; ++h)
+		INIT_HLIST_HEAD(&vn->vni_list[h]);
+
+	return 0;
+}
+
+static __net_exit void vxlan_exit_net(struct net *net)
+{
+	struct vxlan_net *vn = net_generic(net, vxlan_net_id);
+
+	if (vn->sock) {
+		sock_release(vn->sock);
+		vn->sock = NULL;
+	}
+}
+
+static struct pernet_operations vxlan_net_ops = {
+	.init = vxlan_init_net,
+	.exit = vxlan_exit_net,
+	.id   = &vxlan_net_id,
+	.size = sizeof(struct vxlan_net),
+};
+
+static int __init vxlan_init_module(void)
+{
+	int rc;
+
+	get_random_bytes(&vxlan_salt, sizeof(vxlan_salt));
+
+	rc = register_pernet_device(&vxlan_net_ops);
+	if (rc)
+		goto out1;
+
+	rc = rtnl_link_register(&vxlan_link_ops);
+	if (rc)
+		goto out2;
+
+	return 0;
+
+out2:
+	unregister_pernet_device(&vxlan_net_ops);
+out1:
+	return rc;
+}
+module_init(vxlan_init_module);
+
+static void __exit vxlan_cleanup_module(void)
+{
+	rtnl_link_unregister(&vxlan_link_ops);
+	unregister_pernet_device(&vxlan_net_ops);
+}
+module_exit(vxlan_cleanup_module);
+
+MODULE_LICENSE("GPL");
+MODULE_VERSION(VXLAN_VERSION);
+MODULE_AUTHOR("Stephen Hemminger <shemminger@vyatta.com>");
+MODULE_ALIAS_RTNL_LINK("vxlan");
--- a/include/linux/if_link.h	2012-10-01 15:08:35.640522830 -0700
+++ b/include/linux/if_link.h	2012-10-01 15:08:38.024499080 -0700
@@ -272,6 +272,22 @@ enum macvlan_mode {
 
 #define MACVLAN_FLAG_NOPROMISC	1
 
+/* VXLAN section */
+enum {
+	IFLA_VXLAN_UNSPEC,
+	IFLA_VXLAN_ID,
+	IFLA_VXLAN_GROUP,
+	IFLA_VXLAN_LINK,
+	IFLA_VXLAN_LOCAL,
+	IFLA_VXLAN_TTL,
+	IFLA_VXLAN_TOS,
+	IFLA_VXLAN_LEARNING,
+	IFLA_VXLAN_AGEING,
+	IFLA_VXLAN_LIMIT,
+	__IFLA_VXLAN_MAX
+};
+#define IFLA_VXLAN_MAX	(__IFLA_VXLAN_MAX - 1)
+
 /* SR-IOV virtual function management section */
 
 enum {
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ b/Documentation/networking/vxlan.txt	2012-10-01 15:08:38.024499080 -0700
@@ -0,0 +1,47 @@
+Virtual eXtensible Local Area Networking documentation
+======================================================
+
+The VXLAN protocol is a tunnelling protocol that is designed to
+solve the problem of limited number of available VLAN's (4096).
+With VXLAN identifier is expanded to 24 bits.
+
+It is a draft RFC standard, that is implemented by Cisco Nexus,
+Vmware and Brocade. The protocol runs over UDP using a single
+destination port (still not standardized by IANA).
+This document describes the Linux kernel tunnel device,
+there is also an implantation of VXLAN for Openvswitch.
+
+Unlike most tunnels, a VXLAN is a 1 to N network, not just point
+to point. A VXLAN device can either dynamically learn the IP address
+of the other end, in a manner similar to a learning bridge, or the
+forwarding entries can be configured statically.
+
+The management of vxlan is done in a similar fashion to it's
+too closest neighbors GRE and VLAN. Configuring VXLAN requires
+the version of iproute2 that matches the kernel release
+where VXLAN was first merged upstream.
+
+1. Create vxlan device
+  # ip li add vxlan0 type vxlan id 42 group 239.1.1.1 dev eth1
+
+This creates a new device (vxlan0). The device uses the
+the multicast group 239.1.1.1 over eth1 to handle packets where
+no entry is in the forwarding table.
+
+2. Delete vxlan device
+  # ip link delete vxlan0
+
+3. Show vxlan info
+  # ip -d show vxlan0
+
+It is possible to create, destroy and display the vxlan
+forwarding table using the new bridge command.
+
+1. Create forwarding table entry
+  # bridge fdb add to 00:17:42:8a:b4:05 dst 192.19.0.2 dev vxlan0
+
+2. Delete forwarding table entry
+  # bridge fdb delete 00:17:42:8a:b4:05
+
+3. Show forwarding table
+  # bridge fdb show dev vxlan0

^ permalink raw reply

* Re: [PATCH] tcp: sysctl for initial receive window
From: Yuchung Cheng @ 2012-10-01 22:36 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, brouer, netdev, nanditad
In-Reply-To: <1348660432.5093.353.camel@edumazet-glaptop>

On Wed, Sep 26, 2012 at 4:53 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Fri, 2012-09-21 at 14:48 -0400, David Miller wrote:
>> From: Jesper Dangaard Brouer <brouer@redhat.com>
>> Date: Fri, 21 Sep 2012 20:32:06 +0200
>> > The would defeat the purpose of the patch.  Perhaps we could, allow a
>> > sensible max... (but this max is already being controlled as described).
>>
>> Any new max which is truly sensible, could be the new default, and we
>> would apply the same amount of vetting for such a thing.
>
>
> We have in linux a very conservative and complex rwin control at the
> beginning of a TCP session, only for the very first packets,
> if applications are reasonably fast at draining their receive queue.
> (They mostly are)
>
> Last time I had to take a look (after truesize changes), I was kind of
> worried to not find a good reason why we were doing this.
>
> We now have :
>
> - rcvbuf autotuning, letting rwin growing up to 3MB or so
> - Better truesize tracking
> - global/cgroup tcp mem accounting/pressure
> - TCP coalescing to minimize the effect of bad citizen packets
>     (very low len/truesize ratio)
> - People tracking TCP stack inefficiencies and working on new CCs...
>    (An example is Joe Touch I-D
> http://tools.ietf.org/html/draft-touch-tcpm-automatic-iw-03 that
> proposes increasing IW over a longer period of time (as opposed to
> revisiting constants every few years).
> - ...
>
> TCP congestion control is controlled by the sender, driven by the ACK
> coming back from receiver, and initial rwin should not change CC at all,
> unless we deliberately constrain rwin to a too small value.
>
> We did the 3 -> 10 change only two years ago.
> And 3 was really too small even 5 years ago.
>
> Browsers had to open simultaneous sessions to the same server only to
> workaround this limit, and they still do.
>
> I would just remove the 10 'hard constant', (but not so hard, since it
> was 3 only 2 years ago), and let tcp_rmem[1]/SO_RCVBUF decide of the
> initial receive window.
I like this idea a lot. Got a patch for us to try?

>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH net,1/3] hyperv: Fix the max_xfer_size in RNDIS initialization
From: David Miller @ 2012-10-01 22:39 UTC (permalink / raw)
  To: haiyangz; +Cc: netdev, kys, olaf, jasowang, linux-kernel, devel
In-Reply-To: <1349130657-7987-1-git-send-email-haiyangz@microsoft.com>


These patches do not apply cleanly to the current net-next tree
which is the only place where patches should be targetted right
now.

^ permalink raw reply

* Re: [PATCH 1/3] netlink: add attributes to fdb interface
From: David Miller @ 2012-10-01 22:40 UTC (permalink / raw)
  To: shemminger; +Cc: netdev
In-Reply-To: <20121001223254.175870711@vyatta.com>

From: Stephen Hemminger <shemminger@vyatta.com>
Date: Mon, 01 Oct 2012 15:32:33 -0700

> Later changes need to be able to refer to neighbour attributes
> when doing fdb_add.
> 
> Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>

Applied.

^ permalink raw reply

* Re: [PATCH 2/3] igmp: export symbol ip_mc_leave_group
From: David Miller @ 2012-10-01 22:40 UTC (permalink / raw)
  To: shemminger; +Cc: netdev
In-Reply-To: <20121001223254.264749157@vyatta.com>

From: Stephen Hemminger <shemminger@vyatta.com>
Date: Mon, 01 Oct 2012 15:32:34 -0700

> Needed for VXLAN.
> 
> Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>

Applied.

^ permalink raw reply

* Re: [PATCH 3/3] vxlan: virtual extensible lan
From: David Miller @ 2012-10-01 22:40 UTC (permalink / raw)
  To: shemminger; +Cc: netdev
In-Reply-To: <20121001223254.349753999@vyatta.com>

From: Stephen Hemminger <shemminger@vyatta.com>
Date: Mon, 01 Oct 2012 15:32:35 -0700

> This is an implementation of Virtual eXtensible Local Area Network
> as described in draft RFC:
>   http://tools.ietf.org/html/draft-mahalingam-dutt-dcops-vxlan-02
> 
> The driver integrates a Virtual Tunnel Endpoint (VTEP) functionality
> that learns MAC to IP address mapping. 
> 
> This implementation has not been tested only against the Linux
> userspace implementation using TAP, not against other vendor's
> equipment.
> 
> Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>

Applied.

^ permalink raw reply

* Re: network namespace and kernel bind issue
From: Eric W. Biederman @ 2012-10-01 22:40 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: netdev
In-Reply-To: <20121001145838.5eafef4c@nehalam.linuxnetplumber.net>

Stephen Hemminger <shemminger@vyatta.com> writes:

> On Mon, 1 Oct 2012 14:16:09 -0700
> Stephen Hemminger <shemminger@vyatta.com> wrote:
>
>> When testing VXLAN I noticed that the kernel bind seems to be a problem for
>> network tunnels. The init_net function is called repeatedly for the same
>> network namespace!

It definitely should not be.

>> 1. Create vxlan device:
>>  # ip li add vxlan0 type vxlan id 11 group 239.1.1.1 dev eth0
>>  # dmesg | tail
>> [11580.671016] vxlan: vxlan_init_net in net 1

Net 1?  What are you printing out?  It isn't the net_id by any chance?

>> 2. Start Chrome (or other application using namespaces)
>>  
>>   dmesg | tail
>> [11587.371195] vxlan: vxlan_init_net in net 1
>> [11587.371211] vxlan: bind for UDP socket 0.0.0.0:8472 (-98)
>> 
>> 
>> Isn't init_net supposed to be unique. The current semantics also break
>> L2TP.

The init method should be called exactly once per network namespace.

The timing of the init methods you report seems correct.

The vxlan code isn't in net-next or I would take a look.

I took a quick look at l2tp and the code is doing some weird things.
There are a bunch of references to &init_net that I would expect
to references to either sk_net() or dev_net().  

Adding support for multiple network namespaces and then reaching
out to the initial network namespace for things is definitely a recipe
for getting confused.

So my blind guess would be that someone half implemented network
namespace support for l2tp and vxlan copied the bugs.

Eric


>> This is with 3.6.0-rc7-net-next
>
> Here is back trace from where duplicate network namespace init gets done.

> [13532.579900] vxlan: bind for UDP socket 0.0.0.0:8472 (-98)
> [13532.579903] ------------[ cut here ]------------
> [13532.579906] WARNING: at drivers/net/vxlan.c:1148 vxlan_init_net+0xc9/0x126 [vxlan]()
> [13532.579907] Hardware name: System Product Name
> [13532.579908] Modules linked in: vxlan nfnetlink_log nfnetlink ip6table_filter ip6_tables ebtable_nat ebtables ipt_MASQUERADE iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack ipt_REJECT xt_CHECKSUM iptable_mangle xt_tcpudp iptable_filter ip_tables x_tables tun bridge stp llc cpufreq_stats cpufreq_powersave cpufreq_conservative cpufreq_userspace binfmt_misc fuse loop snd_hda_codec_hdmi snd_hda_codec_realtek i915 hid_belkin hid_generic snd_hda_intel evdev snd_hda_codec drm_kms_helper snd_hwdep drm snd_pcm_oss snd_pcm psmouse microcode snd_page_alloc serio_raw pcspkr i2c_i801 snd_timer i2c_algo_bit i2c_core acpi_cpufreq mperf processor video button btrfs libcrc32c lzo_compress zlib_deflate crc32c_intel ghash_clmulni_intel aesni_intel aes_x86_64 ablk_hel
 per cryptd usbhid hid ixgbe r8169 mii mdio thermal [last unloaded: vxlan]
> [13532.579965] Pid: 7130, comm: chromium-sandbo Not tainted 3.6.0-rc7-net-next+ #10
> [13532.579966] Call Trace:
> [13532.579972]  [<ffffffff8106674e>] warn_slowpath_common+0x83/0x9c
> [13532.579974]  [<ffffffff81066781>] warn_slowpath_null+0x1a/0x1c
> [13532.579976]  [<ffffffffa03ea87d>] vxlan_init_net+0xc9/0x126 [vxlan]
> [13532.579980]  [<ffffffff8136b4dd>] ops_init+0xcd/0xfc
> [13532.579982]  [<ffffffff8136b824>] setup_net+0x51/0xd8
> [13532.579984]  [<ffffffff8136bd37>] copy_net_ns+0x6c/0xd7
> [13532.579987]  [<ffffffff810882bb>] create_new_namespaces+0xd8/0x14f
> [13532.579989]  [<ffffffff81088417>] copy_namespaces+0x69/0x9e
> [13532.579991]  [<ffffffff81065b10>] copy_process.part.27+0x12ae/0x12f5
> [13532.579994]  [<ffffffff8144170f>] ? do_page_fault+0x2fb/0x37c
> [13532.579997]  [<ffffffff8111d4d8>] ? might_fault+0x5c/0xac
> [13532.579998]  [<ffffffff81065cb2>] do_fork+0x120/0x2fc
> [13532.580001]  [<ffffffff810e7343>] ? time_hardirqs_off+0x15/0x2a
> [13532.580004]  [<ffffffff8143ea53>] ? error_sti+0x5/0x6
> [13532.580007]  [<ffffffff810a7204>] ? trace_hardirqs_off_caller+0x3f/0x9e
> [13532.580009]  [<ffffffff8143e646>] ? retint_swapgs+0xe/0x13
> [13532.580012]  [<ffffffff8103e541>] sys_clone+0x28/0x2a
> [13532.580014]  [<ffffffff814450e3>] stub_clone+0x13/0x20
> [13532.580016]  [<ffffffff81444d92>] ? system_call_fastpath+0x16/0x1b
> [13532.580018] ---[ end trace 2c2b222e23a4d880 ]---
> [13573.765721] vxlan: bind for UDP socket 0.0.0.0:8472 (-98)

^ permalink raw reply

* Re: network namespace and kernel bind issue
From: Stephen Hemminger @ 2012-10-01 22:57 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: netdev
In-Reply-To: <87fw5xeryf.fsf@xmission.com>

On Mon, 01 Oct 2012 15:40:56 -0700
ebiederm@xmission.com (Eric W. Biederman) wrote:

> Stephen Hemminger <shemminger@vyatta.com> writes:
> 
> > On Mon, 1 Oct 2012 14:16:09 -0700
> > Stephen Hemminger <shemminger@vyatta.com> wrote:
> >
> >> When testing VXLAN I noticed that the kernel bind seems to be a problem for
> >> network tunnels. The init_net function is called repeatedly for the same
> >> network namespace!
> 
> It definitely should not be.
> 
> >> 1. Create vxlan device:
> >>  # ip li add vxlan0 type vxlan id 11 group 239.1.1.1 dev eth0
> >>  # dmesg | tail
> >> [11580.671016] vxlan: vxlan_init_net in net 1
> 
> Net 1?  What are you printing out?  It isn't the net_id by any chance?

Yes it is the net_id which is passed to net_generic() to find the
per-namespace data structure.

> 
> >> 2. Start Chrome (or other application using namespaces)
> >>  
> >>   dmesg | tail
> >> [11587.371195] vxlan: vxlan_init_net in net 1
> >> [11587.371211] vxlan: bind for UDP socket 0.0.0.0:8472 (-98)
> >> 
> >> 
> >> Isn't init_net supposed to be unique. The current semantics also break
> >> L2TP.
> 
> The init method should be called exactly once per network namespace.
> 
> The timing of the init methods you report seems correct.
> 
> The vxlan code isn't in net-next or I would take a look.
> 
> I took a quick look at l2tp and the code is doing some weird things.
> There are a bunch of references to &init_net that I would expect
> to references to either sk_net() or dev_net().  
> 
> Adding support for multiple network namespaces and then reaching
> out to the initial network namespace for things is definitely a recipe
> for getting confused.
> 
> So my blind guess would be that someone half implemented network
> namespace support for l2tp and vxlan copied the bugs.

The vxlan driver has one UDP socket per namespace.
There are no references to init_net in it.

I think the problem is the call chain
      copy_net_ns -> setup_net -> ops_init
There is nothing that nothing increments the id after register_pernet_operations.

Shouldn't there be an increment so each new namespace gets a unique id?

--- a/net/core/net_namespace.c	2012-08-15 08:59:22.938704423 -0700
+++ b/net/core/net_namespace.c	2012-10-01 15:54:50.293088913 -0700
@@ -161,6 +161,7 @@ static __net_init int setup_net(struct n
 #endif
 
 	list_for_each_entry(ops, &pernet_list, list) {
+		++*ops->id;
 		error = ops_init(ops, net);
 		if (error < 0)
 			goto out_undo;


Or maybe you need to keep track of IDR map for each pernet_operations structure?

^ permalink raw reply

* Re: network namespace and kernel bind issue
From: Eric W. Biederman @ 2012-10-01 23:11 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: netdev
In-Reply-To: <20121001155702.5b5e2188@nehalam.linuxnetplumber.net>

Stephen Hemminger <shemminger@vyatta.com> writes:

> On Mon, 01 Oct 2012 15:40:56 -0700
> ebiederm@xmission.com (Eric W. Biederman) wrote:
>
>> Stephen Hemminger <shemminger@vyatta.com> writes:
>> 
>> > On Mon, 1 Oct 2012 14:16:09 -0700
>> > Stephen Hemminger <shemminger@vyatta.com> wrote:
>> >
>> >> When testing VXLAN I noticed that the kernel bind seems to be a problem for
>> >> network tunnels. The init_net function is called repeatedly for the same
>> >> network namespace!
>> 
>> It definitely should not be.
>> 
>> >> 1. Create vxlan device:
>> >>  # ip li add vxlan0 type vxlan id 11 group 239.1.1.1 dev eth0
>> >>  # dmesg | tail
>> >> [11580.671016] vxlan: vxlan_init_net in net 1
>> 
>> Net 1?  What are you printing out?  It isn't the net_id by any chance?
>
> Yes it is the net_id which is passed to net_generic() to find the
> per-namespace data structure.

Yes.  net_id is just an index and is the same in all network namespaces.
net_id should only be different for different instances of per_net
operations.

>> >> 2. Start Chrome (or other application using namespaces)
>> >>  
>> >>   dmesg | tail
>> >> [11587.371195] vxlan: vxlan_init_net in net 1
>> >> [11587.371211] vxlan: bind for UDP socket 0.0.0.0:8472 (-98)
>> >> 
>> >> 
>> >> Isn't init_net supposed to be unique. The current semantics also break
>> >> L2TP.
>> 
>> The init method should be called exactly once per network namespace.
>> 
>> The timing of the init methods you report seems correct.
>> 
>> The vxlan code isn't in net-next or I would take a look.
>> 
>> I took a quick look at l2tp and the code is doing some weird things.
>> There are a bunch of references to &init_net that I would expect
>> to references to either sk_net() or dev_net().  
>> 
>> Adding support for multiple network namespaces and then reaching
>> out to the initial network namespace for things is definitely a recipe
>> for getting confused.
>> 
>> So my blind guess would be that someone half implemented network
>> namespace support for l2tp and vxlan copied the bugs.
>
> The vxlan driver has one UDP socket per namespace.
> There are no references to init_net in it.

Then my guess is that you have an ordering problem.  Attempting
to initialize a vxlan before ipv4 is initialized or some such.

> I think the problem is the call chain
>       copy_net_ns -> setup_net -> ops_init
> There is nothing that nothing increments the id after register_pernet_operations.
>
> Shouldn't there be an increment so each new namespace gets a unique id?

No.

There are some extra pointers at the end of struct net and the id is
which of those pointers your subsystem gets to use.  net_generic returns
your pointer value.

I can see the confusion but the id is definitely not a namespace id.

Eric

^ permalink raw reply

* Re:Very Urgent!!!
From: Ruth Yoda @ 2012-10-01 23:16 UTC (permalink / raw)



Greetings from BURKINA FASO:
Let me start by introduce myself,I am Mrs Ruth Yoda, BILL AND EXCHANGE MANAGER (Bank of Africa) Ouagadougou, Burkina Faso.I am writting you this letter based on the latest development at our bank which I will like to bring to your personal edification.($12,250million transfer claims).This is a legitimate transaction and I agreed to offer you 30% of this money as my foreign partner after confirmation of the fund in your bank account,If you are interested,get back to me.
 
Yours faithful,
Mrs Ruth Yoda.

^ permalink raw reply

* Re: network namespace and kernel bind issue
From: Stephen Hemminger @ 2012-10-01 23:32 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: netdev
In-Reply-To: <87y5jpdbzo.fsf@xmission.com>

On Mon, 01 Oct 2012 16:11:07 -0700
ebiederm@xmission.com (Eric W. Biederman) wrote:

> Then my guess is that you have an ordering problem.  Attempting
> to initialize a vxlan before ipv4 is initialized or some such.

Isn't there a gurantee that init operations are called in the order
they registered?

^ permalink raw reply

* Re: network namespace and kernel bind issue
From: Eric W. Biederman @ 2012-10-02  0:35 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: netdev
In-Reply-To: <20121001163226.3873ca58@nehalam.linuxnetplumber.net>

Stephen Hemminger <shemminger@vyatta.com> writes:

> On Mon, 01 Oct 2012 16:11:07 -0700
> ebiederm@xmission.com (Eric W. Biederman) wrote:
>
>> Then my guess is that you have an ordering problem.  Attempting
>> to initialize a vxlan before ipv4 is initialized or some such.
>
> Isn't there a gurantee that init operations are called in the order
> they registered?

Yes.  With the caveat that all things registered with
register_pernet_subsys are called before register_pernet_device.

So if you are a registering as a subsystem the loopback device won't
have been registered yet.

So if there is some requirement that I'm not seeing that the loopback
device needs to be registered or possibly even registered and brought up
before we can bind to a port you could easily be hitting that.

[11587.371211] vxlan: bind for UDP socket 0.0.0.0:8472 (-98)

>From this one clue it does look like the trace is:
inet_bind
   udp4_get_port
       udp_lib_get_port

And it does look like the only possible failure when a port number
is passed in is for the port to be genuinly in use.

Ok.  So I tracked down your patch so I could find the relevant code.

+static __net_init int vxlan_init_net(struct net *net)
+{
....
+	/* Create UDP socket for encapsulation receive. */
+	rc = sock_create_kern(AF_INET, SOCK_DGRAM, IPPROTO_UDP, &vn->sock);
+	if (rc < 0) {
+		pr_debug("UDP socket create failed\n");
+		return rc;
+	}

And this is where we have the issue.

sock_create_kern only creates sockets in the initial network namespace.
There is inet_ctl_sock_create which comes closer to what you want
but I expect you want your socket to be hashed.

Still we need to do something here to avoid have a socket in the
network namespace that has a reference count on the network namespace
and keeps the network namespace from exiting.

We very clearly don't have a good interface for handling this at
the moment.  I am drawing a blank at the moment on exactly what
such an interface should look like.

What we have is certainly error prone for use inside the kernel.
I have a suspicion the nfs server code that uses __sock_create
has the potential to forever pin a network namespace.

int sock_create_netns(struct net *net, int family, int type, int protocol,
                         struct socket **res)
{
        int err;
	err = __sock_create(&init_net, family, type, protocol, res, 1);
        if (err == 0) {
 	       sk_change_net(sock->sk, net);
        return err;
}

Although I am beginning to suspect we should do the silly refcount
avoidance for all in kernel sockets, and just pass the kern parameter
all of the way down to sk_alloc, so it can get the refcounting right
the first time.

However for the bug fix for the merge window (since it appears Dave
merged this code). 

I suggest you just add the sk_change_net and change the socket release
to sk_release_kern in release_net.  At least that is localized, and
doesn't require us to clean up the API for in kernel sockets in a rush.

Eric


+	vxlan_addr.sin_port = htons(vxlan_port);
+
+	rc = kernel_bind(vn->sock, (struct sockaddr *) &vxlan_addr,
+			 sizeof(vxlan_addr));
+	if (rc < 0) {
+		pr_debug("bind for UDP socket %pI4:%u (%d)\n",
+			 &vxlan_addr.sin_addr, ntohs(vxlan_addr.sin_port), rc);
+		sock_release(vn->sock);
+		vn->sock = NULL;
+		return rc;
+	}


Eric

^ permalink raw reply

* Re: network namespace and kernel bind issue
From: Stephen Hemminger @ 2012-10-02  0:48 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: netdev
In-Reply-To: <871uhhd82p.fsf@xmission.com>

The problem was vxlan wasn't doing sk_change_net on the created socket.

I'm testing that fix.

The long term fix is to change sock_create_kern() to take a 'struct net'
argument. This would avoid the trap of having to change the namespace.
Also several places using __sock_create() could use it.

L2TP still looks to have several namespace related issues.

^ permalink raw reply

* [PATCH net-next] vxlan: put UDP socket in correct namespace
From: Stephen Hemminger @ 2012-10-02  0:51 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: netdev
In-Reply-To: <871uhhd82p.fsf@xmission.com>

Move vxlan UDP socket to correct network namespace

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>

--- a/drivers/net/vxlan.c	2012-10-01 17:18:30.776513263 -0700
+++ b/drivers/net/vxlan.c	2012-10-01 17:42:28.340411631 -0700
@@ -1136,6 +1136,9 @@ static __net_init int vxlan_init_net(str
 		pr_debug("UDP socket create failed\n");
 		return rc;
 	}
+	/* Put in proper namespace */
+	sk = vn->sock->sk;
+	sk_change_net(sk, net);
 
 	vxlan_addr.sin_port = htons(vxlan_port);
 
@@ -1150,7 +1153,6 @@ static __net_init int vxlan_init_net(str
 	}
 
 	/* Disable multicast loopback */
-	sk = vn->sock->sk;
 	inet_sk(sk)->mc_loop = 0;
 
 	/* Mark socket as an encapsulation socket. */

^ permalink raw reply

* Re: [PATCH net-next] vxlan: put UDP socket in correct namespace
From: Eric W. Biederman @ 2012-10-02  0:58 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: netdev
In-Reply-To: <20121001175107.0ec2931c@nehalam.linuxnetplumber.net>

Stephen Hemminger <shemminger@vyatta.com> writes:

> Move vxlan UDP socket to correct network namespace

You also need to replease sock_release with
sk_release_kernel.

Otherwise you will decrement the network namespace count
below zero, when sock_release is called.

Eric

> Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
>
> --- a/drivers/net/vxlan.c	2012-10-01 17:18:30.776513263 -0700
> +++ b/drivers/net/vxlan.c	2012-10-01 17:42:28.340411631 -0700
> @@ -1136,6 +1136,9 @@ static __net_init int vxlan_init_net(str
>  		pr_debug("UDP socket create failed\n");
>  		return rc;
>  	}
> +	/* Put in proper namespace */
> +	sk = vn->sock->sk;
> +	sk_change_net(sk, net);
>  
>  	vxlan_addr.sin_port = htons(vxlan_port);
>  
> @@ -1150,7 +1153,6 @@ static __net_init int vxlan_init_net(str
>  	}
>  
>  	/* Disable multicast loopback */
> -	sk = vn->sock->sk;
>  	inet_sk(sk)->mc_loop = 0;
>  
>  	/* Mark socket as an encapsulation socket. */

^ permalink raw reply

* RE: drivers/net/ethernet/chelsio/cxgb4/t4_hw.c:363:64: sparse: incorrect type in argument 3 (different base types)
From: Jay Hernandez @ 2012-10-02  0:48 UTC (permalink / raw)
  To: Fengguang Wu, Vipul Pandya; +Cc: kernel-janitors, netdev, Jay Hernandez
In-Reply-To: <20120928163653.GC5171@localhost>

Hi Fengguang,

Thanks for pointing this out we have a fix which addresses these issues.
What's interesting is I did not see these warnings I tried to reproduce
the errors on my setup. Is there a special flag we can use to reproduce
the sparse warnings.

Thanks for your help,
Jay-

-----Original Message-----
From: Fengguang Wu [mailto:fengguang.wu@intel.com] 
Sent: Friday, September 28, 2012 9:37 AM
To: Vipul Pandya
Cc: kernel-janitors@vger.kernel.org; Jay Hernandez;
netdev@vger.kernel.org
Subject: drivers/net/ethernet/chelsio/cxgb4/t4_hw.c:363:64: sparse:
incorrect type in argument 3 (different base types)

Hi Vipul,

FYI, there are new sparse warnings show up in

commit: 5afc8b84eb7b29e4646d6e8ca7e6d7196031d6f7  cxgb4: Add functions
to read memory via PCIE memory window

  drivers/net/ethernet/chelsio/cxgb4/t4_hw.c:361:33: sparse: incorrect
type in assignment (different base types)
  drivers/net/ethernet/chelsio/cxgb4/t4_hw.c:361:33:    expected
restricted __be32 [usertype] <noident>
  drivers/net/ethernet/chelsio/cxgb4/t4_hw.c:361:33:    got unsigned int
+ drivers/net/ethernet/chelsio/cxgb4/t4_hw.c:363:64: sparse: incorrect
type in argument 3 (different base types)
  drivers/net/ethernet/chelsio/cxgb4/t4_hw.c:363:64:    expected
unsigned int [unsigned] [usertype] val
  drivers/net/ethernet/chelsio/cxgb4/t4_hw.c:363:64:    got restricted
__be32 [usertype] <noident>
  drivers/net/ethernet/chelsio/cxgb4/t4_hw.c:694:31: sparse: incorrect
type in assignment (different base types)
  drivers/net/ethernet/chelsio/cxgb4/t4_hw.c:694:31:    expected
unsigned int [unsigned] [usertype] <noident>
  drivers/net/ethernet/chelsio/cxgb4/t4_hw.c:694:31:    got restricted
__be32 [usertype] <noident>
  drivers/net/ethernet/chelsio/cxgb4/t4_hw.c:898:25: sparse: cast to
restricted __be32
  drivers/net/ethernet/chelsio/cxgb4/t4_hw.c:898:25: sparse: cast to
restricted __be32
  drivers/net/ethernet/chelsio/cxgb4/t4_hw.c:898:25: sparse: cast to
restricted __be32
  drivers/net/ethernet/chelsio/cxgb4/t4_hw.c:898:25: sparse: cast to
restricted __be32
  drivers/net/ethernet/chelsio/cxgb4/t4_hw.c:898:25: sparse: cast to
restricted __be32
  drivers/net/ethernet/chelsio/cxgb4/t4_hw.c:898:25: sparse: cast to
restricted __be32
  drivers/net/ethernet/chelsio/cxgb4/t4_hw.c:2177:25: sparse: incorrect
type in assignment (different base types)
  drivers/net/ethernet/chelsio/cxgb4/t4_hw.c:2177:25:    expected
restricted __be32 [usertype] <noident>
  drivers/net/ethernet/chelsio/cxgb4/t4_hw.c:2177:25:    got unsigned
int
  drivers/net/ethernet/chelsio/cxgb4/t4_hw.c: In function
't4_memory_rw.constprop.6':
  drivers/net/ethernet/chelsio/cxgb4/t4_hw.c:462:1: warning: the frame
size of 2056 bytes is larger than 1024 bytes [-Wframe-larger-than=]

vim +363 drivers/net/ethernet/chelsio/cxgb4/t4_hw.c

5afc8b84 (Vipul Pandya 2012-09-26  347) 
5afc8b84 (Vipul Pandya 2012-09-26  348) 	/*
5afc8b84 (Vipul Pandya 2012-09-26  349) 	 * Setup offset into
PCIE memory window.  Address must be a
5afc8b84 (Vipul Pandya 2012-09-26  350) 	 *
MEMWIN0_APERTURE-byte-aligned address.  (Read back MA register to
5afc8b84 (Vipul Pandya 2012-09-26  351) 	 * ensure that changes
propagate before we attempt to use the new
5afc8b84 (Vipul Pandya 2012-09-26  352) 	 * values.)
5afc8b84 (Vipul Pandya 2012-09-26  353) 	 */
5afc8b84 (Vipul Pandya 2012-09-26  354) 	t4_write_reg(adap,
PCIE_MEM_ACCESS_OFFSET,
5afc8b84 (Vipul Pandya 2012-09-26  355) 		     addr &
~(MEMWIN0_APERTURE - 1));
5afc8b84 (Vipul Pandya 2012-09-26  356) 	t4_read_reg(adap,
PCIE_MEM_ACCESS_OFFSET);
5afc8b84 (Vipul Pandya 2012-09-26  357) 
5afc8b84 (Vipul Pandya 2012-09-26  358) 	/* Collecting data 4
bytes at a time upto MEMWIN0_APERTURE */
5afc8b84 (Vipul Pandya 2012-09-26  359) 	for (i = 0; i <
MEMWIN0_APERTURE; i = i+0x4) {
5afc8b84 (Vipul Pandya 2012-09-26  360) 		if (dir)
5afc8b84 (Vipul Pandya 2012-09-26  361) 			*data++
= t4_read_reg(adap, (MEMWIN0_BASE + i));
5afc8b84 (Vipul Pandya 2012-09-26  362) 		else
5afc8b84 (Vipul Pandya 2012-09-26 @363)
t4_write_reg(adap, (MEMWIN0_BASE + i), *data++);
5afc8b84 (Vipul Pandya 2012-09-26  364) 	}
5afc8b84 (Vipul Pandya 2012-09-26  365) 
5afc8b84 (Vipul Pandya 2012-09-26  366) 	return 0;
5afc8b84 (Vipul Pandya 2012-09-26  367) }
5afc8b84 (Vipul Pandya 2012-09-26  368) 
5afc8b84 (Vipul Pandya 2012-09-26  369) /**
5afc8b84 (Vipul Pandya 2012-09-26  370)  *	t4_memory_rw -
read/write EDC 0, EDC 1 or MC via PCIE memory window
5afc8b84 (Vipul Pandya 2012-09-26  371)  *	@adap: the adapter

---
0-DAY kernel build testing backend         Open Source Technology Centre
Fengguang Wu, Yuanhan Liu                              Intel Corporation

^ permalink raw reply

* Pull request: sfc-next 2012-10-02
From: Ben Hutchings @ 2012-10-02  1:25 UTC (permalink / raw)
  To: David Miller; +Cc: linux-net-drivers, netdev

The following changes since commit abb17e6c0c7b27693201dc85f75dbb184279fd10:

  netlink: use <linux/export.h> instead of <linux/module.h> (2012-09-21 15:43:58 -0400)

are available in the git repository at:
  git://git.kernel.org/pub/scm/linux/kernel/git/bwh/sfc-next.git for-davem

(commit 6ac7ef1487a040483d89a95045efc5185a71268f)

Some bug fixes that should go into 3.7:

1. Fix oops when removing device with SR-IOV enabled.  (This regression
was introduced by the last set of changes, so the fix does not need to
be applied to any earlier kernel versions.)
2. Fix firmware structure field lookup bug that resulted in missing
sensor information.
3. Fix bug that makes self-test do very little in some configurations.
4. Fix the numbering of ethtool RX flow steering filters to reflect the
real hardware priorities.

Ben.

Ben Hutchings (6):
      sfc: Fix null function pointer in efx_sriov_channel_type
      sfc: Add parentheses around use of bitfield macro arguments
      sfc: Fix MCDI structure field lookup
      sfc: Fix loopback self-test with separate_tx_channels=1
      sfc: Remove EFX_FILTER_FLAG_RX_OVERRIDE_IP
      sfc: Fix the reported priorities of different filter types

 drivers/net/ethernet/sfc/bitfield.h    |   22 +++---
 drivers/net/ethernet/sfc/ethtool.c     |   11 +--
 drivers/net/ethernet/sfc/filter.c      |  108 ++++++++++++++++----------------
 drivers/net/ethernet/sfc/filter.h      |    7 +--
 drivers/net/ethernet/sfc/mcdi.h        |    6 +-
 drivers/net/ethernet/sfc/selftest.c    |    3 +-
 drivers/net/ethernet/sfc/siena_sriov.c |    1 +
 7 files changed, 76 insertions(+), 82 deletions(-)

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply

* [PATCH net-next 1/6] sfc: Fix null function pointer in efx_sriov_channel_type
From: Ben Hutchings @ 2012-10-02  1:27 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, linux-net-drivers
In-Reply-To: <1349141144.2577.73.camel@bwh-desktop.uk.solarflarecom.com>

Commit c31e5f9 ('sfc: Add channel specific receive_skb handler and
post_remove callback') added the function pointer field
efx_channel_type::post_remove and an unconditional call through it.

This field should have been initialised to efx_channel_dummy_op_void
in the existing instances of efx_channel_type, but this was only done
in efx_default_channel_type.  Consequently, if a device has SR-IOV
enabled then removing the driver or device will result in an oops.

Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
---
 drivers/net/ethernet/sfc/siena_sriov.c |    1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/drivers/net/ethernet/sfc/siena_sriov.c b/drivers/net/ethernet/sfc/siena_sriov.c
index a8f48a4..d49b53d 100644
--- a/drivers/net/ethernet/sfc/siena_sriov.c
+++ b/drivers/net/ethernet/sfc/siena_sriov.c
@@ -1035,6 +1035,7 @@ efx_sriov_get_channel_name(struct efx_channel *channel, char *buf, size_t len)
 static const struct efx_channel_type efx_sriov_channel_type = {
 	.handle_no_channel	= efx_sriov_handle_no_channel,
 	.pre_probe		= efx_sriov_probe_channel,
+	.post_remove		= efx_channel_dummy_op_void,
 	.get_name		= efx_sriov_get_channel_name,
 	/* no copy operation; channel must not be reallocated */
 	.keep_eventq		= true,
-- 
1.7.7.6



-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox