Netdev List
 help / color / mirror / Atom feed
* [PATCH net-next v11 4/5] virtio_net: Introduce VIRTIO_NET_F_STANDBY feature bit
From: Sridhar Samudrala @ 2018-05-22  2:06 UTC (permalink / raw)
  To: mst, stephen, davem, netdev, virtualization, virtio-dev,
	jesse.brandeburg, alexander.h.duyck, kubakici, sridhar.samudrala,
	jasowang, loseweigh, jiri, aaron.f.brown, anjali.singhai
In-Reply-To: <1526954781-35359-1-git-send-email-sridhar.samudrala@intel.com>

This feature bit can be used by hypervisor to indicate virtio_net device to
act as a standby for another device with the same MAC address.

VIRTIO_NET_F_STANDBY is defined as bit 62 as it is a device feature bit.

Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
---
 drivers/net/virtio_net.c        | 2 +-
 include/uapi/linux/virtio_net.h | 3 +++
 2 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index f34794a76c4d..213fddc70fd0 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -2999,7 +2999,7 @@ static struct virtio_device_id id_table[] = {
 	VIRTIO_NET_F_GUEST_ANNOUNCE, VIRTIO_NET_F_MQ, \
 	VIRTIO_NET_F_CTRL_MAC_ADDR, \
 	VIRTIO_NET_F_MTU, VIRTIO_NET_F_CTRL_GUEST_OFFLOADS, \
-	VIRTIO_NET_F_SPEED_DUPLEX
+	VIRTIO_NET_F_SPEED_DUPLEX, VIRTIO_NET_F_STANDBY
 
 static unsigned int features[] = {
 	VIRTNET_FEATURES,
diff --git a/include/uapi/linux/virtio_net.h b/include/uapi/linux/virtio_net.h
index 5de6ed37695b..a3715a3224c1 100644
--- a/include/uapi/linux/virtio_net.h
+++ b/include/uapi/linux/virtio_net.h
@@ -57,6 +57,9 @@
 					 * Steering */
 #define VIRTIO_NET_F_CTRL_MAC_ADDR 23	/* Set MAC address */
 
+#define VIRTIO_NET_F_STANDBY	  62	/* Act as standby for another device
+					 * with the same MAC.
+					 */
 #define VIRTIO_NET_F_SPEED_DUPLEX 63	/* Device set linkspeed and duplex */
 
 #ifndef VIRTIO_NET_NO_LEGACY
-- 
2.14.3

^ permalink raw reply related

* [PATCH net-next v11 3/5] net: Introduce net_failover driver
From: Sridhar Samudrala @ 2018-05-22  2:06 UTC (permalink / raw)
  To: mst, stephen, davem, netdev, virtualization, virtio-dev,
	jesse.brandeburg, alexander.h.duyck, kubakici, sridhar.samudrala,
	jasowang, loseweigh, jiri, aaron.f.brown, anjali.singhai
In-Reply-To: <1526954781-35359-1-git-send-email-sridhar.samudrala@intel.com>

The net_failover driver provides an automated failover mechanism via APIs
to create and destroy a failover master netdev and mananges a primary and
standby slave netdevs that get registered via the generic failover
infrastructure.

The failover netdev acts a master device and controls 2 slave devices. The
original paravirtual interface gets registered as 'standby' slave netdev and
a passthru/vf device with the same MAC gets registered as 'primary' slave
netdev. Both 'standby' and 'failover' netdevs are associated with the same
'pci' device. The user accesses the network interface via 'failover' netdev.
The 'failover' netdev chooses 'primary' netdev as default for transmits when
it is available with link up and running.

This can be used by paravirtual drivers to enable an alternate low latency
datapath. It also enables hypervisor controlled live migration of a VM with
direct attached VF by failing over to the paravirtual datapath when the VF
is unplugged.

Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
---
 Documentation/networking/net_failover.rst |  26 +
 MAINTAINERS                               |   8 +
 drivers/net/Kconfig                       |  12 +
 drivers/net/Makefile                      |   1 +
 drivers/net/net_failover.c                | 836 ++++++++++++++++++++++++++++++
 include/net/net_failover.h                |  40 ++
 6 files changed, 923 insertions(+)
 create mode 100644 Documentation/networking/net_failover.rst
 create mode 100644 drivers/net/net_failover.c
 create mode 100644 include/net/net_failover.h

diff --git a/Documentation/networking/net_failover.rst b/Documentation/networking/net_failover.rst
new file mode 100644
index 000000000000..6764c3b1eec8
--- /dev/null
+++ b/Documentation/networking/net_failover.rst
@@ -0,0 +1,26 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+============
+NET_FAILOVER
+============
+
+Overview
+========
+
+The net_failover driver provides an automated failover mechanism via APIs
+to create and destroy a failover master netdev and mananges a primary and
+standby slave netdevs that get registered via the generic failover
+infrastructrure.
+
+The failover netdev acts a master device and controls 2 slave devices. The
+original paravirtual interface is registered as 'standby' slave netdev and
+a passthru/vf device with the same MAC gets registered as 'primary' slave
+netdev. Both 'standby' and 'failover' netdevs are associated with the same
+'pci' device. The user accesses the network interface via 'failover' netdev.
+The 'failover' netdev chooses 'primary' netdev as default for transmits when
+it is available with link up and running.
+
+This can be used by paravirtual drivers to enable an alternate low latency
+datapath. It also enables hypervisor controlled live migration of a VM with
+direct attached VF by failing over to the paravirtual datapath when the VF
+is unplugged.
diff --git a/MAINTAINERS b/MAINTAINERS
index aed9575ddc53..ad172b1a7d57 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -9642,6 +9642,14 @@ S:	Maintained
 F:	Documentation/hwmon/nct6775
 F:	drivers/hwmon/nct6775.c
 
+NET_FAILOVER MODULE
+M:	Sridhar Samudrala <sridhar.samudrala@intel.com>
+L:	netdev@vger.kernel.org
+S:	Supported
+F:	driver/net/net_failover.c
+F:	include/net/net_failover.h
+F:	Documentation/networking/net_failover.rst
+
 NETEFFECT IWARP RNIC DRIVER (IW_NES)
 M:	Faisal Latif <faisal.latif@intel.com>
 L:	linux-rdma@vger.kernel.org
diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig
index a029b27fd002..1cf29c5ed510 100644
--- a/drivers/net/Kconfig
+++ b/drivers/net/Kconfig
@@ -510,4 +510,16 @@ config NETDEVSIM
 	  To compile this driver as a module, choose M here: the module
 	  will be called netdevsim.
 
+config NET_FAILOVER
+	tristate "Failover driver"
+	select FAILOVER
+	help
+	  This provides an automated failover mechanism via APIs to create
+	  and destroy a failover master netdev and manages a primary and
+	  standby slave netdevs that get registered via the generic failover
+	  infrastructure. This can be used by paravirtual drivers to enable
+	  an alternate low latency datapath. It alsoenables live migration of
+	  a VM with direct attached VF by failing over to the paravirtual
+	  datapath when the VF is unplugged.
+
 endif # NETDEVICES
diff --git a/drivers/net/Makefile b/drivers/net/Makefile
index 91e67e375dd4..21cde7e78621 100644
--- a/drivers/net/Makefile
+++ b/drivers/net/Makefile
@@ -78,3 +78,4 @@ obj-$(CONFIG_FUJITSU_ES) += fjes/
 thunderbolt-net-y += thunderbolt.o
 obj-$(CONFIG_THUNDERBOLT_NET) += thunderbolt-net.o
 obj-$(CONFIG_NETDEVSIM) += netdevsim/
+obj-$(CONFIG_NET_FAILOVER) += net_failover.o
diff --git a/drivers/net/net_failover.c b/drivers/net/net_failover.c
new file mode 100644
index 000000000000..ae47bdb363c9
--- /dev/null
+++ b/drivers/net/net_failover.c
@@ -0,0 +1,836 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2018, Intel Corporation. */
+
+/* This provides a net_failover interface for paravirtual drivers to
+ * provide an alternate datapath by exporting APIs to create and
+ * destroy a upper 'net_failover' netdev. The upper dev manages the
+ * original paravirtual interface as a 'standby' netdev and uses the
+ * generic failover infrastructure to register and manage a direct
+ * attached VF as a 'primary' netdev. This enables live migration of
+ * a VM with direct attached VF by failing over to the paravirtual
+ * datapath when the VF is unplugged.
+ *
+ * Some of the netdev management routines are based on bond/team driver as
+ * this driver provides active-backup functionality similar to those drivers.
+ */
+
+#include <linux/netdevice.h>
+#include <linux/etherdevice.h>
+#include <linux/ethtool.h>
+#include <linux/module.h>
+#include <linux/slab.h>
+#include <linux/netdevice.h>
+#include <linux/netpoll.h>
+#include <linux/rtnetlink.h>
+#include <linux/if_vlan.h>
+#include <linux/pci.h>
+#include <net/sch_generic.h>
+#include <uapi/linux/if_arp.h>
+#include <net/net_failover.h>
+
+static bool net_failover_xmit_ready(struct net_device *dev)
+{
+	return netif_running(dev) && netif_carrier_ok(dev);
+}
+
+static int net_failover_open(struct net_device *dev)
+{
+	struct net_failover_info *nfo_info = netdev_priv(dev);
+	struct net_device *primary_dev, *standby_dev;
+	int err;
+
+	primary_dev = rtnl_dereference(nfo_info->primary_dev);
+	if (primary_dev) {
+		err = dev_open(primary_dev);
+		if (err)
+			goto err_primary_open;
+	}
+
+	standby_dev = rtnl_dereference(nfo_info->standby_dev);
+	if (standby_dev) {
+		err = dev_open(standby_dev);
+		if (err)
+			goto err_standby_open;
+	}
+
+	if ((primary_dev && net_failover_xmit_ready(primary_dev)) ||
+	    (standby_dev && net_failover_xmit_ready(standby_dev))) {
+		netif_carrier_on(dev);
+		netif_tx_wake_all_queues(dev);
+	}
+
+	return 0;
+
+err_standby_open:
+	dev_close(primary_dev);
+err_primary_open:
+	netif_tx_disable(dev);
+	return err;
+}
+
+static int net_failover_close(struct net_device *dev)
+{
+	struct net_failover_info *nfo_info = netdev_priv(dev);
+	struct net_device *slave_dev;
+
+	netif_tx_disable(dev);
+
+	slave_dev = rtnl_dereference(nfo_info->primary_dev);
+	if (slave_dev)
+		dev_close(slave_dev);
+
+	slave_dev = rtnl_dereference(nfo_info->standby_dev);
+	if (slave_dev)
+		dev_close(slave_dev);
+
+	return 0;
+}
+
+static netdev_tx_t net_failover_drop_xmit(struct sk_buff *skb,
+					  struct net_device *dev)
+{
+	atomic_long_inc(&dev->tx_dropped);
+	dev_kfree_skb_any(skb);
+	return NETDEV_TX_OK;
+}
+
+static netdev_tx_t net_failover_start_xmit(struct sk_buff *skb,
+					   struct net_device *dev)
+{
+	struct net_failover_info *nfo_info = netdev_priv(dev);
+	struct net_device *xmit_dev;
+
+	/* Try xmit via primary netdev followed by standby netdev */
+	xmit_dev = rcu_dereference_bh(nfo_info->primary_dev);
+	if (!xmit_dev || !net_failover_xmit_ready(xmit_dev)) {
+		xmit_dev = rcu_dereference_bh(nfo_info->standby_dev);
+		if (!xmit_dev || !net_failover_xmit_ready(xmit_dev))
+			return net_failover_drop_xmit(skb, dev);
+	}
+
+	skb->dev = xmit_dev;
+	skb->queue_mapping = qdisc_skb_cb(skb)->slave_dev_queue_mapping;
+
+	return dev_queue_xmit(skb);
+}
+
+static u16 net_failover_select_queue(struct net_device *dev,
+				     struct sk_buff *skb, void *accel_priv,
+				     select_queue_fallback_t fallback)
+{
+	struct net_failover_info *nfo_info = netdev_priv(dev);
+	struct net_device *primary_dev;
+	u16 txq;
+
+	primary_dev = rcu_dereference(nfo_info->primary_dev);
+	if (primary_dev) {
+		const struct net_device_ops *ops = primary_dev->netdev_ops;
+
+		if (ops->ndo_select_queue)
+			txq = ops->ndo_select_queue(primary_dev, skb,
+						    accel_priv, fallback);
+		else
+			txq = fallback(primary_dev, skb);
+
+		qdisc_skb_cb(skb)->slave_dev_queue_mapping = skb->queue_mapping;
+
+		return txq;
+	}
+
+	txq = skb_rx_queue_recorded(skb) ? skb_get_rx_queue(skb) : 0;
+
+	/* Save the original txq to restore before passing to the driver */
+	qdisc_skb_cb(skb)->slave_dev_queue_mapping = skb->queue_mapping;
+
+	if (unlikely(txq >= dev->real_num_tx_queues)) {
+		do {
+			txq -= dev->real_num_tx_queues;
+		} while (txq >= dev->real_num_tx_queues);
+	}
+
+	return txq;
+}
+
+/* fold stats, assuming all rtnl_link_stats64 fields are u64, but
+ * that some drivers can provide 32bit values only.
+ */
+static void net_failover_fold_stats(struct rtnl_link_stats64 *_res,
+				    const struct rtnl_link_stats64 *_new,
+				    const struct rtnl_link_stats64 *_old)
+{
+	const u64 *new = (const u64 *)_new;
+	const u64 *old = (const u64 *)_old;
+	u64 *res = (u64 *)_res;
+	int i;
+
+	for (i = 0; i < sizeof(*_res) / sizeof(u64); i++) {
+		u64 nv = new[i];
+		u64 ov = old[i];
+		s64 delta = nv - ov;
+
+		/* detects if this particular field is 32bit only */
+		if (((nv | ov) >> 32) == 0)
+			delta = (s64)(s32)((u32)nv - (u32)ov);
+
+		/* filter anomalies, some drivers reset their stats
+		 * at down/up events.
+		 */
+		if (delta > 0)
+			res[i] += delta;
+	}
+}
+
+static void net_failover_get_stats(struct net_device *dev,
+				   struct rtnl_link_stats64 *stats)
+{
+	struct net_failover_info *nfo_info = netdev_priv(dev);
+	const struct rtnl_link_stats64 *new;
+	struct rtnl_link_stats64 temp;
+	struct net_device *slave_dev;
+
+	spin_lock(&nfo_info->stats_lock);
+	memcpy(stats, &nfo_info->failover_stats, sizeof(*stats));
+
+	rcu_read_lock();
+
+	slave_dev = rcu_dereference(nfo_info->primary_dev);
+	if (slave_dev) {
+		new = dev_get_stats(slave_dev, &temp);
+		net_failover_fold_stats(stats, new, &nfo_info->primary_stats);
+		memcpy(&nfo_info->primary_stats, new, sizeof(*new));
+	}
+
+	slave_dev = rcu_dereference(nfo_info->standby_dev);
+	if (slave_dev) {
+		new = dev_get_stats(slave_dev, &temp);
+		net_failover_fold_stats(stats, new, &nfo_info->standby_stats);
+		memcpy(&nfo_info->standby_stats, new, sizeof(*new));
+	}
+
+	rcu_read_unlock();
+
+	memcpy(&nfo_info->failover_stats, stats, sizeof(*stats));
+	spin_unlock(&nfo_info->stats_lock);
+}
+
+static int net_failover_change_mtu(struct net_device *dev, int new_mtu)
+{
+	struct net_failover_info *nfo_info = netdev_priv(dev);
+	struct net_device *primary_dev, *standby_dev;
+	int ret = 0;
+
+	primary_dev = rcu_dereference(nfo_info->primary_dev);
+	if (primary_dev) {
+		ret = dev_set_mtu(primary_dev, new_mtu);
+		if (ret)
+			return ret;
+	}
+
+	standby_dev = rcu_dereference(nfo_info->standby_dev);
+	if (standby_dev) {
+		ret = dev_set_mtu(standby_dev, new_mtu);
+		if (ret) {
+			if (primary_dev)
+				dev_set_mtu(primary_dev, dev->mtu);
+			return ret;
+		}
+	}
+
+	dev->mtu = new_mtu;
+
+	return 0;
+}
+
+static void net_failover_set_rx_mode(struct net_device *dev)
+{
+	struct net_failover_info *nfo_info = netdev_priv(dev);
+	struct net_device *slave_dev;
+
+	rcu_read_lock();
+
+	slave_dev = rcu_dereference(nfo_info->primary_dev);
+	if (slave_dev) {
+		dev_uc_sync_multiple(slave_dev, dev);
+		dev_mc_sync_multiple(slave_dev, dev);
+	}
+
+	slave_dev = rcu_dereference(nfo_info->standby_dev);
+	if (slave_dev) {
+		dev_uc_sync_multiple(slave_dev, dev);
+		dev_mc_sync_multiple(slave_dev, dev);
+	}
+
+	rcu_read_unlock();
+}
+
+static int net_failover_vlan_rx_add_vid(struct net_device *dev, __be16 proto,
+					u16 vid)
+{
+	struct net_failover_info *nfo_info = netdev_priv(dev);
+	struct net_device *primary_dev, *standby_dev;
+	int ret = 0;
+
+	primary_dev = rcu_dereference(nfo_info->primary_dev);
+	if (primary_dev) {
+		ret = vlan_vid_add(primary_dev, proto, vid);
+		if (ret)
+			return ret;
+	}
+
+	standby_dev = rcu_dereference(nfo_info->standby_dev);
+	if (standby_dev) {
+		ret = vlan_vid_add(standby_dev, proto, vid);
+		if (ret)
+			if (primary_dev)
+				vlan_vid_del(primary_dev, proto, vid);
+	}
+
+	return ret;
+}
+
+static int net_failover_vlan_rx_kill_vid(struct net_device *dev, __be16 proto,
+					 u16 vid)
+{
+	struct net_failover_info *nfo_info = netdev_priv(dev);
+	struct net_device *slave_dev;
+
+	slave_dev = rcu_dereference(nfo_info->primary_dev);
+	if (slave_dev)
+		vlan_vid_del(slave_dev, proto, vid);
+
+	slave_dev = rcu_dereference(nfo_info->standby_dev);
+	if (slave_dev)
+		vlan_vid_del(slave_dev, proto, vid);
+
+	return 0;
+}
+
+static const struct net_device_ops failover_dev_ops = {
+	.ndo_open		= net_failover_open,
+	.ndo_stop		= net_failover_close,
+	.ndo_start_xmit		= net_failover_start_xmit,
+	.ndo_select_queue	= net_failover_select_queue,
+	.ndo_get_stats64	= net_failover_get_stats,
+	.ndo_change_mtu		= net_failover_change_mtu,
+	.ndo_set_rx_mode	= net_failover_set_rx_mode,
+	.ndo_vlan_rx_add_vid	= net_failover_vlan_rx_add_vid,
+	.ndo_vlan_rx_kill_vid	= net_failover_vlan_rx_kill_vid,
+	.ndo_validate_addr	= eth_validate_addr,
+	.ndo_features_check	= passthru_features_check,
+};
+
+#define FAILOVER_NAME "net_failover"
+#define FAILOVER_VERSION "0.1"
+
+static void nfo_ethtool_get_drvinfo(struct net_device *dev,
+				    struct ethtool_drvinfo *drvinfo)
+{
+	strlcpy(drvinfo->driver, FAILOVER_NAME, sizeof(drvinfo->driver));
+	strlcpy(drvinfo->version, FAILOVER_VERSION, sizeof(drvinfo->version));
+}
+
+static int nfo_ethtool_get_link_ksettings(struct net_device *dev,
+					  struct ethtool_link_ksettings *cmd)
+{
+	struct net_failover_info *nfo_info = netdev_priv(dev);
+	struct net_device *slave_dev;
+
+	slave_dev = rtnl_dereference(nfo_info->primary_dev);
+	if (!slave_dev || !net_failover_xmit_ready(slave_dev)) {
+		slave_dev = rtnl_dereference(nfo_info->standby_dev);
+		if (!slave_dev || !net_failover_xmit_ready(slave_dev)) {
+			cmd->base.duplex = DUPLEX_UNKNOWN;
+			cmd->base.port = PORT_OTHER;
+			cmd->base.speed = SPEED_UNKNOWN;
+
+			return 0;
+		}
+	}
+
+	return __ethtool_get_link_ksettings(slave_dev, cmd);
+}
+
+static const struct ethtool_ops failover_ethtool_ops = {
+	.get_drvinfo            = nfo_ethtool_get_drvinfo,
+	.get_link               = ethtool_op_get_link,
+	.get_link_ksettings     = nfo_ethtool_get_link_ksettings,
+};
+
+/* Called when slave dev is injecting data into network stack.
+ * Change the associated network device from lower dev to failover dev.
+ * note: already called with rcu_read_lock
+ */
+static rx_handler_result_t net_failover_handle_frame(struct sk_buff **pskb)
+{
+	struct sk_buff *skb = *pskb;
+	struct net_device *dev = rcu_dereference(skb->dev->rx_handler_data);
+	struct net_failover_info *nfo_info = netdev_priv(dev);
+	struct net_device *primary_dev, *standby_dev;
+
+	primary_dev = rcu_dereference(nfo_info->primary_dev);
+	standby_dev = rcu_dereference(nfo_info->standby_dev);
+
+	if (primary_dev && skb->dev == standby_dev)
+		return RX_HANDLER_EXACT;
+
+	skb->dev = dev;
+
+	return RX_HANDLER_ANOTHER;
+}
+
+static void net_failover_compute_features(struct net_device *dev)
+{
+	u32 vlan_features = FAILOVER_VLAN_FEATURES & NETIF_F_ALL_FOR_ALL;
+	netdev_features_t enc_features  = FAILOVER_ENC_FEATURES;
+	unsigned short max_hard_header_len = ETH_HLEN;
+	unsigned int dst_release_flag = IFF_XMIT_DST_RELEASE |
+					IFF_XMIT_DST_RELEASE_PERM;
+	struct net_failover_info *nfo_info = netdev_priv(dev);
+	struct net_device *primary_dev, *standby_dev;
+
+	primary_dev = rcu_dereference(nfo_info->primary_dev);
+	if (primary_dev) {
+		vlan_features =
+			netdev_increment_features(vlan_features,
+						  primary_dev->vlan_features,
+						  FAILOVER_VLAN_FEATURES);
+		enc_features =
+			netdev_increment_features(enc_features,
+						  primary_dev->hw_enc_features,
+						  FAILOVER_ENC_FEATURES);
+
+		dst_release_flag &= primary_dev->priv_flags;
+		if (primary_dev->hard_header_len > max_hard_header_len)
+			max_hard_header_len = primary_dev->hard_header_len;
+	}
+
+	standby_dev = rcu_dereference(nfo_info->standby_dev);
+	if (standby_dev) {
+		vlan_features =
+			netdev_increment_features(vlan_features,
+						  standby_dev->vlan_features,
+						  FAILOVER_VLAN_FEATURES);
+		enc_features =
+			netdev_increment_features(enc_features,
+						  standby_dev->hw_enc_features,
+						  FAILOVER_ENC_FEATURES);
+
+		dst_release_flag &= standby_dev->priv_flags;
+		if (standby_dev->hard_header_len > max_hard_header_len)
+			max_hard_header_len = standby_dev->hard_header_len;
+	}
+
+	dev->vlan_features = vlan_features;
+	dev->hw_enc_features = enc_features | NETIF_F_GSO_ENCAP_ALL;
+	dev->hard_header_len = max_hard_header_len;
+
+	dev->priv_flags &= ~IFF_XMIT_DST_RELEASE;
+	if (dst_release_flag == (IFF_XMIT_DST_RELEASE |
+				 IFF_XMIT_DST_RELEASE_PERM))
+		dev->priv_flags |= IFF_XMIT_DST_RELEASE;
+
+	netdev_change_features(dev);
+}
+
+static void net_failover_lower_state_changed(struct net_device *slave_dev,
+					     struct net_device *primary_dev,
+					     struct net_device *standby_dev)
+{
+	struct netdev_lag_lower_state_info info;
+
+	if (netif_carrier_ok(slave_dev))
+		info.link_up = true;
+	else
+		info.link_up = false;
+
+	if (slave_dev == primary_dev) {
+		if (netif_running(primary_dev))
+			info.tx_enabled = true;
+		else
+			info.tx_enabled = false;
+	} else {
+		if ((primary_dev && netif_running(primary_dev)) ||
+		    (!netif_running(standby_dev)))
+			info.tx_enabled = false;
+		else
+			info.tx_enabled = true;
+	}
+
+	netdev_lower_state_changed(slave_dev, &info);
+}
+
+static int net_failover_slave_register(struct net_device *slave_dev,
+				       struct net_device *failover_dev)
+{
+	struct net_device *standby_dev, *primary_dev;
+	struct netdev_lag_upper_info lag_upper_info;
+	struct net_failover_info *nfo_info;
+	bool slave_is_standby;
+	u32 orig_mtu;
+	int err;
+
+	nfo_info = netdev_priv(failover_dev);
+	standby_dev = rtnl_dereference(nfo_info->standby_dev);
+	primary_dev = rtnl_dereference(nfo_info->primary_dev);
+	slave_is_standby = slave_dev->dev.parent == failover_dev->dev.parent;
+	if (slave_is_standby ? standby_dev : primary_dev) {
+		netdev_err(failover_dev, "%s attempting to register as slave dev when %s already present\n",
+			   slave_dev->name,
+			   slave_is_standby ? "standby" : "primary");
+		goto done;
+	}
+
+	/* We want to allow only a direct attached VF device as a primary
+	 * netdev. As there is no easy way to check for a VF device, restrict
+	 * this to a pci device.
+	 */
+	if (!slave_is_standby && (!slave_dev->dev.parent ||
+				  !dev_is_pci(slave_dev->dev.parent)))
+		goto done;
+
+	if (failover_dev->features & NETIF_F_VLAN_CHALLENGED &&
+	    vlan_uses_dev(failover_dev)) {
+		netdev_err(failover_dev, "Device %s is VLAN challenged and failover device has VLAN set up\n",
+			   failover_dev->name);
+		goto done;
+	}
+
+	/* Align MTU of slave with failover dev */
+	orig_mtu = slave_dev->mtu;
+	err = dev_set_mtu(slave_dev, failover_dev->mtu);
+	if (err) {
+		netdev_err(failover_dev, "unable to change mtu of %s to %u register failed\n",
+			   slave_dev->name, failover_dev->mtu);
+		goto done;
+	}
+
+	dev_hold(slave_dev);
+
+	if (netif_running(failover_dev)) {
+		err = dev_open(slave_dev);
+		if (err && (err != -EBUSY)) {
+			netdev_err(failover_dev, "Opening slave %s failed err:%d\n",
+				   slave_dev->name, err);
+			goto err_dev_open;
+		}
+	}
+
+	netif_addr_lock_bh(failover_dev);
+	dev_uc_sync_multiple(slave_dev, failover_dev);
+	dev_uc_sync_multiple(slave_dev, failover_dev);
+	netif_addr_unlock_bh(failover_dev);
+
+	err = vlan_vids_add_by_dev(slave_dev, failover_dev);
+	if (err) {
+		netdev_err(failover_dev, "Failed to add vlan ids to device %s err:%d\n",
+			   slave_dev->name, err);
+		goto err_vlan_add;
+	}
+
+	err = netdev_rx_handler_register(slave_dev, net_failover_handle_frame,
+					 failover_dev);
+	if (err) {
+		netdev_err(slave_dev, "can not register failover rx handler (err = %d)\n",
+			   err);
+		goto err_handler_register;
+	}
+
+	lag_upper_info.tx_type = NETDEV_LAG_TX_TYPE_ACTIVEBACKUP;
+	err = netdev_master_upper_dev_link(slave_dev, failover_dev, NULL,
+					   &lag_upper_info, NULL);
+	if (err) {
+		netdev_err(slave_dev, "can not set failover device %s (err = %d)\n",
+			   failover_dev->name, err);
+		goto err_upper_link;
+	}
+
+	slave_dev->priv_flags |= IFF_FAILOVER_SLAVE;
+
+	if (slave_is_standby) {
+		rcu_assign_pointer(nfo_info->standby_dev, slave_dev);
+		standby_dev = slave_dev;
+		dev_get_stats(standby_dev, &nfo_info->standby_stats);
+	} else {
+		rcu_assign_pointer(nfo_info->primary_dev, slave_dev);
+		primary_dev = slave_dev;
+		dev_get_stats(primary_dev, &nfo_info->primary_stats);
+		failover_dev->min_mtu = slave_dev->min_mtu;
+		failover_dev->max_mtu = slave_dev->max_mtu;
+	}
+
+	net_failover_lower_state_changed(slave_dev, primary_dev, standby_dev);
+	net_failover_compute_features(failover_dev);
+
+	call_netdevice_notifiers(NETDEV_JOIN, slave_dev);
+
+	netdev_info(failover_dev, "failover %s slave:%s registered\n",
+		    slave_is_standby ? "standby" : "primary", slave_dev->name);
+
+	goto done;
+
+err_upper_link:
+	netdev_rx_handler_unregister(slave_dev);
+err_handler_register:
+	vlan_vids_del_by_dev(slave_dev, failover_dev);
+err_vlan_add:
+	dev_uc_unsync(slave_dev, failover_dev);
+	dev_mc_unsync(slave_dev, failover_dev);
+	dev_close(slave_dev);
+err_dev_open:
+	dev_put(slave_dev);
+	dev_set_mtu(slave_dev, orig_mtu);
+done:
+	return NOTIFY_DONE;
+}
+
+static int net_failover_slave_unregister(struct net_device *slave_dev,
+					 struct net_device *failover_dev)
+{
+	struct net_device *standby_dev, *primary_dev;
+	struct net_failover_info *nfo_info;
+	bool slave_is_standby;
+
+	nfo_info = netdev_priv(failover_dev);
+	primary_dev = rtnl_dereference(nfo_info->primary_dev);
+	standby_dev = rtnl_dereference(nfo_info->standby_dev);
+
+	if (slave_dev != primary_dev && slave_dev != standby_dev)
+		goto done;
+
+	slave_is_standby = slave_dev->dev.parent == failover_dev->dev.parent;
+
+	netdev_rx_handler_unregister(slave_dev);
+	netdev_upper_dev_unlink(slave_dev, failover_dev);
+	vlan_vids_del_by_dev(slave_dev, failover_dev);
+	dev_uc_unsync(slave_dev, failover_dev);
+	dev_mc_unsync(slave_dev, failover_dev);
+	dev_close(slave_dev);
+	slave_dev->priv_flags &= ~IFF_FAILOVER_SLAVE;
+
+	nfo_info = netdev_priv(failover_dev);
+	dev_get_stats(failover_dev, &nfo_info->failover_stats);
+
+	if (slave_is_standby) {
+		RCU_INIT_POINTER(nfo_info->standby_dev, NULL);
+	} else {
+		RCU_INIT_POINTER(nfo_info->primary_dev, NULL);
+		if (standby_dev) {
+			failover_dev->min_mtu = standby_dev->min_mtu;
+			failover_dev->max_mtu = standby_dev->max_mtu;
+		}
+	}
+
+	dev_put(slave_dev);
+
+	net_failover_compute_features(failover_dev);
+
+	netdev_info(failover_dev, "failover %s slave:%s unregistered\n",
+		    slave_is_standby ? "standby" : "primary", slave_dev->name);
+
+done:
+	return NOTIFY_DONE;
+}
+
+static int net_failover_slave_link_change(struct net_device *slave_dev,
+					  struct net_device *failover_dev)
+{
+	struct net_device *primary_dev, *standby_dev;
+	struct net_failover_info *nfo_info;
+
+	nfo_info = netdev_priv(failover_dev);
+
+	primary_dev = rtnl_dereference(nfo_info->primary_dev);
+	standby_dev = rtnl_dereference(nfo_info->standby_dev);
+
+	if (slave_dev != primary_dev && slave_dev != standby_dev)
+		goto done;
+
+	if ((primary_dev && net_failover_xmit_ready(primary_dev)) ||
+	    (standby_dev && net_failover_xmit_ready(standby_dev))) {
+		netif_carrier_on(failover_dev);
+		netif_tx_wake_all_queues(failover_dev);
+	} else {
+		dev_get_stats(failover_dev, &nfo_info->failover_stats);
+		netif_carrier_off(failover_dev);
+		netif_tx_stop_all_queues(failover_dev);
+	}
+
+	net_failover_lower_state_changed(slave_dev, primary_dev, standby_dev);
+
+done:
+	return NOTIFY_DONE;
+}
+
+static int net_failover_slave_name_change(struct net_device *slave_dev,
+					  struct net_device *failover_dev)
+{
+	struct net_device *primary_dev, *standby_dev;
+	struct net_failover_info *nfo_info;
+
+	nfo_info = netdev_priv(failover_dev);
+
+	primary_dev = rtnl_dereference(nfo_info->primary_dev);
+	standby_dev = rtnl_dereference(nfo_info->standby_dev);
+
+	if (slave_dev != primary_dev && slave_dev != standby_dev)
+		goto done;
+
+	/* We need to bring up the slave after the rename by udev in case
+	 * open failed with EBUSY when it was registered.
+	 */
+	dev_open(slave_dev);
+
+done:
+	return NOTIFY_DONE;
+}
+
+static struct failover_ops net_failover_ops = {
+	.slave_register		= net_failover_slave_register,
+	.slave_unregister	= net_failover_slave_unregister,
+	.slave_link_change	= net_failover_slave_link_change,
+	.slave_name_change	= net_failover_slave_name_change,
+};
+
+/**
+ * net_failover_create - Create and register a failover instance
+ *
+ * @dev: standby netdev
+ *
+ * Creates a failover netdev and registers a failover instance for a standby
+ * netdev. Used by paravirtual drivers that use 3-netdev model.
+ * The failover netdev acts as a master device and controls 2 slave devices -
+ * the original standby netdev and a VF netdev with the same MAC gets
+ * registered as primary netdev.
+ *
+ * Return: pointer to failover instance
+ */
+struct failover *net_failover_create(struct net_device *standby_dev)
+{
+	struct device *dev = standby_dev->dev.parent;
+	struct net_device *failover_dev;
+	struct failover *failover;
+	int err;
+
+	/* Alloc at least 2 queues, for now we are going with 16 assuming
+	 * that VF devices being enslaved won't have too many queues.
+	 */
+	failover_dev = alloc_etherdev_mq(sizeof(struct net_failover_info), 16);
+	if (!failover_dev) {
+		dev_err(dev, "Unable to allocate failover_netdev!\n");
+		return ERR_PTR(-ENOMEM);
+	}
+
+	dev_net_set(failover_dev, dev_net(standby_dev));
+	SET_NETDEV_DEV(failover_dev, dev);
+
+	failover_dev->netdev_ops = &failover_dev_ops;
+	failover_dev->ethtool_ops = &failover_ethtool_ops;
+
+	/* Initialize the device options */
+	failover_dev->priv_flags |= IFF_UNICAST_FLT | IFF_NO_QUEUE;
+	failover_dev->priv_flags &= ~(IFF_XMIT_DST_RELEASE |
+				       IFF_TX_SKB_SHARING);
+
+	/* don't acquire failover netdev's netif_tx_lock when transmitting */
+	failover_dev->features |= NETIF_F_LLTX;
+
+	/* Don't allow failover devices to change network namespaces. */
+	failover_dev->features |= NETIF_F_NETNS_LOCAL;
+
+	failover_dev->hw_features = FAILOVER_VLAN_FEATURES |
+				    NETIF_F_HW_VLAN_CTAG_TX |
+				    NETIF_F_HW_VLAN_CTAG_RX |
+				    NETIF_F_HW_VLAN_CTAG_FILTER;
+
+	failover_dev->hw_features |= NETIF_F_GSO_ENCAP_ALL;
+	failover_dev->features |= failover_dev->hw_features;
+
+	memcpy(failover_dev->dev_addr, standby_dev->dev_addr,
+	       failover_dev->addr_len);
+
+	failover_dev->min_mtu = standby_dev->min_mtu;
+	failover_dev->max_mtu = standby_dev->max_mtu;
+
+	err = register_netdev(failover_dev);
+	if (err) {
+		dev_err(dev, "Unable to register failover_dev!\n");
+		goto err_register_netdev;
+	}
+
+	netif_carrier_off(failover_dev);
+
+	failover = failover_register(failover_dev, &net_failover_ops);
+	if (IS_ERR(failover))
+		goto err_failover_register;
+
+	return failover;
+
+err_failover_register:
+	unregister_netdev(failover_dev);
+err_register_netdev:
+	free_netdev(failover_dev);
+
+	return ERR_PTR(err);
+}
+EXPORT_SYMBOL_GPL(net_failover_create);
+
+/**
+ * net_failover_destroy - Destroy a failover instance
+ *
+ * @failover: pointer to failover instance
+ *
+ * Unregisters any slave netdevs associated with the failover instance by
+ * calling failover_slave_unregister().
+ * unregisters the failover instance itself and finally frees the failover
+ * netdev. Used by paravirtual drivers that use 3-netdev model.
+ *
+ */
+void net_failover_destroy(struct failover *failover)
+{
+	struct net_failover_info *nfo_info;
+	struct net_device *failover_dev;
+	struct net_device *slave_dev;
+
+	if (!failover)
+		return;
+
+	failover_dev = rcu_dereference(failover->failover_dev);
+	nfo_info = netdev_priv(failover_dev);
+
+	netif_device_detach(failover_dev);
+
+	rtnl_lock();
+
+	slave_dev = rtnl_dereference(nfo_info->primary_dev);
+	if (slave_dev)
+		failover_slave_unregister(slave_dev);
+
+	slave_dev = rtnl_dereference(nfo_info->standby_dev);
+	if (slave_dev)
+		failover_slave_unregister(slave_dev);
+
+	failover_unregister(failover);
+
+	unregister_netdevice(failover_dev);
+
+	rtnl_unlock();
+
+	free_netdev(failover_dev);
+}
+EXPORT_SYMBOL_GPL(net_failover_destroy);
+
+static __init int
+net_failover_init(void)
+{
+	return 0;
+}
+module_init(net_failover_init);
+
+static __exit
+void net_failover_exit(void)
+{
+}
+module_exit(net_failover_exit);
+
+MODULE_DESCRIPTION("Failover driver for Paravirtual drivers");
+MODULE_LICENSE("GPL v2");
diff --git a/include/net/net_failover.h b/include/net/net_failover.h
new file mode 100644
index 000000000000..b12a1c469d1c
--- /dev/null
+++ b/include/net/net_failover.h
@@ -0,0 +1,40 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/* Copyright (c) 2018, Intel Corporation. */
+
+#ifndef _NET_FAILOVER_H
+#define _NET_FAILOVER_H
+
+#include <net/failover.h>
+
+/* failover state */
+struct net_failover_info {
+	/* primary netdev with same MAC */
+	struct net_device __rcu *primary_dev;
+
+	/* standby netdev */
+	struct net_device __rcu *standby_dev;
+
+	/* primary netdev stats */
+	struct rtnl_link_stats64 primary_stats;
+
+	/* standby netdev stats */
+	struct rtnl_link_stats64 standby_stats;
+
+	/* aggregated stats */
+	struct rtnl_link_stats64 failover_stats;
+
+	/* spinlock while updating stats */
+	spinlock_t stats_lock;
+};
+
+struct failover *net_failover_create(struct net_device *standby_dev);
+void net_failover_destroy(struct failover *failover);
+
+#define FAILOVER_VLAN_FEATURES	(NETIF_F_HW_CSUM | NETIF_F_SG | \
+				 NETIF_F_FRAGLIST | NETIF_F_ALL_TSO | \
+				 NETIF_F_HIGHDMA | NETIF_F_LRO)
+
+#define FAILOVER_ENC_FEATURES	(NETIF_F_HW_CSUM | NETIF_F_SG | \
+				 NETIF_F_RXCSUM | NETIF_F_ALL_TSO)
+
+#endif /* _NET_FAILOVER_H */
-- 
2.14.3

^ permalink raw reply related

* [PATCH net-next v11 2/5] netvsc: refactor notifier/event handling code to use the failover framework
From: Sridhar Samudrala @ 2018-05-22  2:06 UTC (permalink / raw)
  To: mst, stephen, davem, netdev, virtualization, virtio-dev,
	jesse.brandeburg, alexander.h.duyck, kubakici, sridhar.samudrala,
	jasowang, loseweigh, jiri, aaron.f.brown, anjali.singhai
In-Reply-To: <1526954781-35359-1-git-send-email-sridhar.samudrala@intel.com>

Use the registration/notification framework supported by the generic
failover infrastructure.

Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
---
 drivers/net/hyperv/Kconfig      |   1 +
 drivers/net/hyperv/hyperv_net.h |   2 +
 drivers/net/hyperv/netvsc_drv.c | 133 +++++++---------------------------------
 3 files changed, 25 insertions(+), 111 deletions(-)

diff --git a/drivers/net/hyperv/Kconfig b/drivers/net/hyperv/Kconfig
index 0765d5f61714..23a2d145813a 100644
--- a/drivers/net/hyperv/Kconfig
+++ b/drivers/net/hyperv/Kconfig
@@ -2,5 +2,6 @@ config HYPERV_NET
 	tristate "Microsoft Hyper-V virtual network driver"
 	depends on HYPERV
 	select UCS2_STRING
+	select FAILOVER
 	help
 	  Select this option to enable the Hyper-V virtual network driver.
diff --git a/drivers/net/hyperv/hyperv_net.h b/drivers/net/hyperv/hyperv_net.h
index 1be34d2e3563..99d8e7398a5b 100644
--- a/drivers/net/hyperv/hyperv_net.h
+++ b/drivers/net/hyperv/hyperv_net.h
@@ -932,6 +932,8 @@ struct net_device_context {
 	u32 vf_alloc;
 	/* Serial number of the VF to team with */
 	u32 vf_serial;
+
+	struct failover *failover;
 };
 
 /* Per channel data */
diff --git a/drivers/net/hyperv/netvsc_drv.c b/drivers/net/hyperv/netvsc_drv.c
index da07ccdf84bf..6c77a81b7266 100644
--- a/drivers/net/hyperv/netvsc_drv.c
+++ b/drivers/net/hyperv/netvsc_drv.c
@@ -43,6 +43,7 @@
 #include <net/pkt_sched.h>
 #include <net/checksum.h>
 #include <net/ip6_checksum.h>
+#include <net/failover.h>
 
 #include "hyperv_net.h"
 
@@ -1763,46 +1764,6 @@ static void netvsc_link_change(struct work_struct *w)
 	rtnl_unlock();
 }
 
-static struct net_device *get_netvsc_bymac(const u8 *mac)
-{
-	struct net_device *dev;
-
-	ASSERT_RTNL();
-
-	for_each_netdev(&init_net, dev) {
-		if (dev->netdev_ops != &device_ops)
-			continue;	/* not a netvsc device */
-
-		if (ether_addr_equal(mac, dev->perm_addr))
-			return dev;
-	}
-
-	return NULL;
-}
-
-static struct net_device *get_netvsc_byref(struct net_device *vf_netdev)
-{
-	struct net_device *dev;
-
-	ASSERT_RTNL();
-
-	for_each_netdev(&init_net, dev) {
-		struct net_device_context *net_device_ctx;
-
-		if (dev->netdev_ops != &device_ops)
-			continue;	/* not a netvsc device */
-
-		net_device_ctx = netdev_priv(dev);
-		if (!rtnl_dereference(net_device_ctx->nvdev))
-			continue;	/* device is removed */
-
-		if (rtnl_dereference(net_device_ctx->vf_netdev) == vf_netdev)
-			return dev;	/* a match */
-	}
-
-	return NULL;
-}
-
 /* Called when VF is injecting data into network stack.
  * Change the associated network device from VF to netvsc.
  * note: already called with rcu_read_lock
@@ -1915,24 +1876,15 @@ static void netvsc_vf_setup(struct work_struct *w)
 	rtnl_unlock();
 }
 
-static int netvsc_register_vf(struct net_device *vf_netdev)
+static int netvsc_register_vf(struct net_device *vf_netdev,
+			      struct net_device *ndev)
 {
-	struct net_device *ndev;
 	struct net_device_context *net_device_ctx;
 	struct netvsc_device *netvsc_dev;
 
 	if (vf_netdev->addr_len != ETH_ALEN)
 		return NOTIFY_DONE;
 
-	/*
-	 * We will use the MAC address to locate the synthetic interface to
-	 * associate with the VF interface. If we don't find a matching
-	 * synthetic interface, move on.
-	 */
-	ndev = get_netvsc_bymac(vf_netdev->perm_addr);
-	if (!ndev)
-		return NOTIFY_DONE;
-
 	net_device_ctx = netdev_priv(ndev);
 	netvsc_dev = rtnl_dereference(net_device_ctx->nvdev);
 	if (!netvsc_dev || rtnl_dereference(net_device_ctx->vf_netdev))
@@ -1949,17 +1901,13 @@ static int netvsc_register_vf(struct net_device *vf_netdev)
 }
 
 /* VF up/down change detected, schedule to change data path */
-static int netvsc_vf_changed(struct net_device *vf_netdev)
+static int netvsc_vf_changed(struct net_device *vf_netdev,
+			     struct net_device *ndev)
 {
 	struct net_device_context *net_device_ctx;
 	struct netvsc_device *netvsc_dev;
-	struct net_device *ndev;
 	bool vf_is_up = netif_running(vf_netdev);
 
-	ndev = get_netvsc_byref(vf_netdev);
-	if (!ndev)
-		return NOTIFY_DONE;
-
 	net_device_ctx = netdev_priv(ndev);
 	netvsc_dev = rtnl_dereference(net_device_ctx->nvdev);
 	if (!netvsc_dev)
@@ -1972,15 +1920,11 @@ static int netvsc_vf_changed(struct net_device *vf_netdev)
 	return NOTIFY_OK;
 }
 
-static int netvsc_unregister_vf(struct net_device *vf_netdev)
+static int netvsc_unregister_vf(struct net_device *vf_netdev,
+				struct net_device *ndev)
 {
-	struct net_device *ndev;
 	struct net_device_context *net_device_ctx;
 
-	ndev = get_netvsc_byref(vf_netdev);
-	if (!ndev)
-		return NOTIFY_DONE;
-
 	net_device_ctx = netdev_priv(ndev);
 	cancel_delayed_work_sync(&net_device_ctx->vf_takeover);
 
@@ -1994,6 +1938,12 @@ static int netvsc_unregister_vf(struct net_device *vf_netdev)
 	return NOTIFY_OK;
 }
 
+static struct failover_ops netvsc_failover_ops = {
+	.slave_register		= netvsc_register_vf,
+	.slave_unregister	= netvsc_unregister_vf,
+	.slave_link_change	= netvsc_vf_changed,
+};
+
 static int netvsc_probe(struct hv_device *dev,
 			const struct hv_vmbus_device_id *dev_id)
 {
@@ -2083,8 +2033,14 @@ static int netvsc_probe(struct hv_device *dev,
 		goto register_failed;
 	}
 
+	net_device_ctx->failover = failover_register(net, &netvsc_failover_ops);
+	if (IS_ERR(net_device_ctx->failover))
+		goto err_failover;
+
 	return ret;
 
+err_failover:
+	unregister_netdev(net);
 register_failed:
 	rndis_filter_device_remove(dev, nvdev);
 rndis_failed:
@@ -2125,13 +2081,15 @@ static int netvsc_remove(struct hv_device *dev)
 	rtnl_lock();
 	vf_netdev = rtnl_dereference(ndev_ctx->vf_netdev);
 	if (vf_netdev)
-		netvsc_unregister_vf(vf_netdev);
+		failover_slave_unregister(vf_netdev);
 
 	if (nvdev)
 		rndis_filter_device_remove(dev, nvdev);
 
 	unregister_netdevice(net);
 
+	failover_unregister(ndev_ctx->failover);
+
 	rtnl_unlock();
 	rcu_read_unlock();
 
@@ -2158,54 +2116,8 @@ static struct  hv_driver netvsc_drv = {
 	.remove = netvsc_remove,
 };
 
-/*
- * On Hyper-V, every VF interface is matched with a corresponding
- * synthetic interface. The synthetic interface is presented first
- * to the guest. When the corresponding VF instance is registered,
- * we will take care of switching the data path.
- */
-static int netvsc_netdev_event(struct notifier_block *this,
-			       unsigned long event, void *ptr)
-{
-	struct net_device *event_dev = netdev_notifier_info_to_dev(ptr);
-
-	/* Skip our own events */
-	if (event_dev->netdev_ops == &device_ops)
-		return NOTIFY_DONE;
-
-	/* Avoid non-Ethernet type devices */
-	if (event_dev->type != ARPHRD_ETHER)
-		return NOTIFY_DONE;
-
-	/* Avoid Vlan dev with same MAC registering as VF */
-	if (is_vlan_dev(event_dev))
-		return NOTIFY_DONE;
-
-	/* Avoid Bonding master dev with same MAC registering as VF */
-	if ((event_dev->priv_flags & IFF_BONDING) &&
-	    (event_dev->flags & IFF_MASTER))
-		return NOTIFY_DONE;
-
-	switch (event) {
-	case NETDEV_REGISTER:
-		return netvsc_register_vf(event_dev);
-	case NETDEV_UNREGISTER:
-		return netvsc_unregister_vf(event_dev);
-	case NETDEV_UP:
-	case NETDEV_DOWN:
-		return netvsc_vf_changed(event_dev);
-	default:
-		return NOTIFY_DONE;
-	}
-}
-
-static struct notifier_block netvsc_netdev_notifier = {
-	.notifier_call = netvsc_netdev_event,
-};
-
 static void __exit netvsc_drv_exit(void)
 {
-	unregister_netdevice_notifier(&netvsc_netdev_notifier);
 	vmbus_driver_unregister(&netvsc_drv);
 }
 
@@ -2225,7 +2137,6 @@ static int __init netvsc_drv_init(void)
 	if (ret)
 		return ret;
 
-	register_netdevice_notifier(&netvsc_netdev_notifier);
 	return 0;
 }
 
-- 
2.14.3

^ permalink raw reply related

* [PATCH net-next v11 1/5] net: Introduce generic failover module
From: Sridhar Samudrala @ 2018-05-22  2:06 UTC (permalink / raw)
  To: mst, stephen, davem, netdev, virtualization, virtio-dev,
	jesse.brandeburg, alexander.h.duyck, kubakici, sridhar.samudrala,
	jasowang, loseweigh, jiri, aaron.f.brown, anjali.singhai
In-Reply-To: <1526954781-35359-1-git-send-email-sridhar.samudrala@intel.com>

The failover module provides a generic interface for paravirtual drivers
to register a netdev and a set of ops with a failover instance. The ops
are used as event handlers that get called to handle netdev register/
unregister/link change/name change events on slave pci ethernet devices
with the same mac address as the failover netdev.

This enables paravirtual drivers to use a VF as an accelerated low latency
datapath. It also allows migration of VMs with direct attached VFs by
failing over to the paravirtual datapath when the VF is unplugged.

Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
---
 Documentation/networking/failover.rst |  18 +++
 MAINTAINERS                           |   8 +
 include/linux/netdevice.h             |  16 ++
 include/net/failover.h                |  31 ++++
 net/Kconfig                           |  13 ++
 net/core/Makefile                     |   1 +
 net/core/failover.c                   | 273 ++++++++++++++++++++++++++++++++++
 7 files changed, 360 insertions(+)
 create mode 100644 Documentation/networking/failover.rst
 create mode 100644 include/net/failover.h
 create mode 100644 net/core/failover.c

diff --git a/Documentation/networking/failover.rst b/Documentation/networking/failover.rst
new file mode 100644
index 000000000000..f0c8483cdbf5
--- /dev/null
+++ b/Documentation/networking/failover.rst
@@ -0,0 +1,18 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+========
+FAILOVER
+========
+
+Overview
+========
+
+The failover module provides a generic interface for paravirtual drivers
+to register a netdev and a set of ops with a failover instance. The ops
+are used as event handlers that get called to handle netdev register/
+unregister/link change/name change events on slave pci ethernet devices
+with the same mac address as the failover netdev.
+
+This enables paravirtual drivers to use a VF as an accelerated low latency
+datapath. It also allows live migration of VMs with direct attached VFs by
+failing over to the paravirtual datapath when the VF is unplugged.
diff --git a/MAINTAINERS b/MAINTAINERS
index 658880464b9d..aed9575ddc53 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -5413,6 +5413,14 @@ S:	Maintained
 F:	Documentation/hwmon/f71805f
 F:	drivers/hwmon/f71805f.c
 
+FAILOVER MODULE
+M:	Sridhar Samudrala <sridhar.samudrala@intel.com>
+L:	netdev@vger.kernel.org
+S:	Supported
+F:	net/core/failover.c
+F:	include/net/failover.h
+F:	Documentation/networking/failover.rst
+
 FANOTIFY
 M:	Jan Kara <jack@suse.cz>
 R:	Amir Goldstein <amir73il@gmail.com>
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 03ed492c4e14..0f4ba52b641d 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1421,6 +1421,8 @@ struct net_device_ops {
  *	entity (i.e. the master device for bridged veth)
  * @IFF_MACSEC: device is a MACsec device
  * @IFF_NO_RX_HANDLER: device doesn't support the rx_handler hook
+ * @IFF_FAILOVER: device is a failover master device
+ * @IFF_FAILOVER_SLAVE: device is lower dev of a failover master device
  */
 enum netdev_priv_flags {
 	IFF_802_1Q_VLAN			= 1<<0,
@@ -1450,6 +1452,8 @@ enum netdev_priv_flags {
 	IFF_PHONY_HEADROOM		= 1<<24,
 	IFF_MACSEC			= 1<<25,
 	IFF_NO_RX_HANDLER		= 1<<26,
+	IFF_FAILOVER			= 1<<27,
+	IFF_FAILOVER_SLAVE		= 1<<28,
 };
 
 #define IFF_802_1Q_VLAN			IFF_802_1Q_VLAN
@@ -1478,6 +1482,8 @@ enum netdev_priv_flags {
 #define IFF_RXFH_CONFIGURED		IFF_RXFH_CONFIGURED
 #define IFF_MACSEC			IFF_MACSEC
 #define IFF_NO_RX_HANDLER		IFF_NO_RX_HANDLER
+#define IFF_FAILOVER			IFF_FAILOVER
+#define IFF_FAILOVER_SLAVE		IFF_FAILOVER_SLAVE
 
 /**
  *	struct net_device - The DEVICE structure.
@@ -4321,6 +4327,16 @@ static inline bool netif_is_rxfh_configured(const struct net_device *dev)
 	return dev->priv_flags & IFF_RXFH_CONFIGURED;
 }
 
+static inline bool netif_is_failover(const struct net_device *dev)
+{
+	return dev->priv_flags & IFF_FAILOVER;
+}
+
+static inline bool netif_is_failover_slave(const struct net_device *dev)
+{
+	return dev->priv_flags & IFF_FAILOVER_SLAVE;
+}
+
 /* This device needs to keep skb dst for qdisc enqueue or ndo_start_xmit() */
 static inline void netif_keep_dst(struct net_device *dev)
 {
diff --git a/include/net/failover.h b/include/net/failover.h
new file mode 100644
index 000000000000..fea9d0b978c8
--- /dev/null
+++ b/include/net/failover.h
@@ -0,0 +1,31 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/* Copyright (c) 2018, Intel Corporation. */
+
+#ifndef _FAILOVER_H
+#define _FAILOVER_H
+
+#include <linux/netdevice.h>
+
+struct failover_ops {
+	int (*slave_register)(struct net_device *slave_dev,
+			      struct net_device *failover_dev);
+	int (*slave_unregister)(struct net_device *slave_dev,
+				struct net_device *failover_dev);
+	int (*slave_link_change)(struct net_device *slave_dev,
+				 struct net_device *failover_dev);
+	int (*slave_name_change)(struct net_device *slave_dev,
+				 struct net_device *failover_dev);
+};
+
+struct failover {
+	struct list_head list;
+	struct net_device __rcu *failover_dev;
+	struct failover_ops __rcu *ops;
+};
+
+struct failover *failover_register(struct net_device *dev,
+				   struct failover_ops *ops);
+void failover_unregister(struct failover *failover);
+int failover_slave_unregister(struct net_device *slave_dev);
+
+#endif /* _FAILOVER_H */
diff --git a/net/Kconfig b/net/Kconfig
index df8d45ef47d8..d581c4f0f1c4 100644
--- a/net/Kconfig
+++ b/net/Kconfig
@@ -430,6 +430,19 @@ config MAY_USE_DEVLINK
 config PAGE_POOL
        bool
 
+config FAILOVER
+	tristate "Generic failover module"
+	help
+	  The failover module provides a generic interface for paravirtual
+	  drivers to register a netdev and a set of ops with a failover
+	  instance. The ops are used as event handlers that get called to
+	  handle netdev register/unregister/link change/name change events
+	  on slave pci ethernet devices with the same mac address as the
+	  failover netdev. This enables paravirtual drivers to use a
+	  VF as an accelerated low latency datapath. It also allows live
+	  migration of VMs with direct attached VFs by failing over to the
+	  paravirtual datapath when the VF is unplugged.
+
 endif   # if NET
 
 # Used by archs to tell that they support BPF JIT compiler plus which flavour.
diff --git a/net/core/Makefile b/net/core/Makefile
index 7080417f8bc8..80175e6a2eb8 100644
--- a/net/core/Makefile
+++ b/net/core/Makefile
@@ -31,3 +31,4 @@ obj-$(CONFIG_DST_CACHE) += dst_cache.o
 obj-$(CONFIG_HWBM) += hwbm.o
 obj-$(CONFIG_NET_DEVLINK) += devlink.o
 obj-$(CONFIG_GRO_CELLS) += gro_cells.o
+obj-$(CONFIG_FAILOVER) += failover.o
diff --git a/net/core/failover.c b/net/core/failover.c
new file mode 100644
index 000000000000..48ee713a1ef4
--- /dev/null
+++ b/net/core/failover.c
@@ -0,0 +1,273 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2018, Intel Corporation. */
+
+/* A common module to handle registrations and notifications for paravirtual
+ * drivers to enable accelerated datapath and support VF live migration.
+ *
+ * The notifier and event handling code is based on netvsc driver.
+ */
+
+#include <linux/module.h>
+#include <linux/etherdevice.h>
+#include <uapi/linux/if_arp.h>
+#include <linux/rtnetlink.h>
+#include <net/failover.h>
+
+static LIST_HEAD(failover_list);
+static DEFINE_SPINLOCK(failover_lock);
+
+static struct net_device *failover_get_bymac(u8 *mac, struct failover_ops **ops)
+{
+	struct net_device *failover_dev;
+	struct failover *failover;
+
+	spin_lock(&failover_lock);
+	list_for_each_entry(failover, &failover_list, list) {
+		failover_dev = rtnl_dereference(failover->failover_dev);
+		if (ether_addr_equal(failover_dev->perm_addr, mac)) {
+			*ops = rtnl_dereference(failover->ops);
+			spin_unlock(&failover_lock);
+			return failover_dev;
+		}
+	}
+	spin_unlock(&failover_lock);
+	return NULL;
+}
+
+/**
+ * failover_slave_register - Register a slave netdev
+ *
+ * @slave_dev: slave netdev that is being registered
+ *
+ * Registers a slave device to a failover instance. Only ethernet devices
+ * are supported.
+ */
+static int failover_slave_register(struct net_device *slave_dev)
+{
+	struct net_device *failover_dev;
+	struct failover_ops *fops;
+
+	if (slave_dev->type != ARPHRD_ETHER)
+		goto done;
+
+	ASSERT_RTNL();
+
+	failover_dev = failover_get_bymac(slave_dev->perm_addr, &fops);
+	if (!failover_dev)
+		goto done;
+
+	if (fops && fops->slave_register)
+		return fops->slave_register(slave_dev, failover_dev);
+
+done:
+	return NOTIFY_DONE;
+}
+
+/**
+ * failover_slave_unregister - Unregister a slave netdev
+ *
+ * @slave_dev: slave netdev that is being unregistered
+ *
+ * Unregisters a slave device from a failover instance.
+ */
+int failover_slave_unregister(struct net_device *slave_dev)
+{
+	struct net_device *failover_dev;
+	struct failover_ops *fops;
+
+	if (!netif_is_failover_slave(slave_dev))
+		goto done;
+
+	ASSERT_RTNL();
+
+	failover_dev = failover_get_bymac(slave_dev->perm_addr, &fops);
+	if (!failover_dev)
+		goto done;
+
+	if (fops && fops->slave_unregister)
+		return fops->slave_unregister(slave_dev, failover_dev);
+
+done:
+	return NOTIFY_DONE;
+}
+EXPORT_SYMBOL_GPL(failover_slave_unregister);
+
+static int failover_slave_link_change(struct net_device *slave_dev)
+{
+	struct net_device *failover_dev;
+	struct failover_ops *fops;
+
+	if (!netif_is_failover_slave(slave_dev))
+		goto done;
+
+	ASSERT_RTNL();
+
+	failover_dev = failover_get_bymac(slave_dev->perm_addr, &fops);
+	if (!failover_dev)
+		goto done;
+
+	if (!netif_running(failover_dev))
+		goto done;
+
+	if (fops && fops->slave_link_change)
+		return fops->slave_link_change(slave_dev, failover_dev);
+
+done:
+	return NOTIFY_DONE;
+}
+
+static int failover_slave_name_change(struct net_device *slave_dev)
+{
+	struct net_device *failover_dev;
+	struct failover_ops *fops;
+
+	if (!netif_is_failover_slave(slave_dev))
+		goto done;
+
+	ASSERT_RTNL();
+
+	failover_dev = failover_get_bymac(slave_dev->perm_addr, &fops);
+	if (!failover_dev)
+		goto done;
+
+	if (!netif_running(failover_dev))
+		goto done;
+
+	if (fops && fops->slave_name_change)
+		return fops->slave_name_change(slave_dev, failover_dev);
+
+done:
+	return NOTIFY_DONE;
+}
+
+static int
+failover_event(struct notifier_block *this, unsigned long event, void *ptr)
+{
+	struct net_device *event_dev = netdev_notifier_info_to_dev(ptr);
+
+	/* Skip parent events */
+	if (netif_is_failover(event_dev))
+		return NOTIFY_DONE;
+
+	switch (event) {
+	case NETDEV_REGISTER:
+		return failover_slave_register(event_dev);
+	case NETDEV_UNREGISTER:
+		return failover_slave_unregister(event_dev);
+	case NETDEV_UP:
+	case NETDEV_DOWN:
+	case NETDEV_CHANGE:
+		return failover_slave_link_change(event_dev);
+	case NETDEV_CHANGENAME:
+		return failover_slave_name_change(event_dev);
+	default:
+		return NOTIFY_DONE;
+	}
+}
+
+static struct notifier_block failover_notifier = {
+	.notifier_call = failover_event,
+};
+
+static void
+failover_existing_slave_register(struct net_device *failover_dev)
+{
+	struct net *net = dev_net(failover_dev);
+	struct net_device *dev;
+
+	rtnl_lock();
+	for_each_netdev(net, dev) {
+		if (netif_is_failover(dev))
+			continue;
+		if (ether_addr_equal(failover_dev->perm_addr, dev->perm_addr))
+			failover_slave_register(dev);
+	}
+	rtnl_unlock();
+}
+
+/**
+ * failover_register - Register a failover instance
+ *
+ * @dev: failover netdev
+ * @ops: failover ops
+ *
+ * Allocate and register a failover instance for a failover netdev. ops
+ * provides handlers for slave device register/unregister/link change/
+ * name change events.
+ *
+ * Return: pointer to failover instance
+ */
+struct failover *failover_register(struct net_device *dev,
+				   struct failover_ops *ops)
+{
+	struct failover *failover;
+
+	if (dev->type != ARPHRD_ETHER)
+		return ERR_PTR(-EINVAL);
+
+	failover = kzalloc(sizeof(*failover), GFP_KERNEL);
+	if (!failover)
+		return ERR_PTR(-ENOMEM);
+
+	rcu_assign_pointer(failover->ops, ops);
+	dev_hold(dev);
+	dev->priv_flags |= IFF_FAILOVER;
+	rcu_assign_pointer(failover->failover_dev, dev);
+
+	spin_lock(&failover_lock);
+	list_add_tail(&failover->list, &failover_list);
+	spin_unlock(&failover_lock);
+
+	netdev_info(dev, "failover master:%s registered\n", dev->name);
+
+	failover_existing_slave_register(dev);
+
+	return failover;
+}
+EXPORT_SYMBOL_GPL(failover_register);
+
+/**
+ * failover_unregister - Unregister a failover instance
+ *
+ * @failover: pointer to failover instance
+ *
+ * Unregisters and frees a failover instance.
+ */
+void failover_unregister(struct failover *failover)
+{
+	struct net_device *failover_dev;
+
+	failover_dev = rcu_dereference(failover->failover_dev);
+
+	netdev_info(failover_dev, "failover master:%s unregistered\n",
+		    failover_dev->name);
+
+	failover_dev->priv_flags &= ~IFF_FAILOVER;
+	dev_put(failover_dev);
+
+	spin_lock(&failover_lock);
+	list_del(&failover->list);
+	spin_unlock(&failover_lock);
+
+	kfree(failover);
+}
+EXPORT_SYMBOL_GPL(failover_unregister);
+
+static __init int
+failover_init(void)
+{
+	register_netdevice_notifier(&failover_notifier);
+
+	return 0;
+}
+module_init(failover_init);
+
+static __exit
+void failover_exit(void)
+{
+	unregister_netdevice_notifier(&failover_notifier);
+}
+module_exit(failover_exit);
+
+MODULE_DESCRIPTION("Generic failover infrastructure/interface");
+MODULE_LICENSE("GPL v2");
-- 
2.14.3

^ permalink raw reply related

* [PATCH net-next v11 0/5] Enable virtio_net to act as a standby for a passthru device
From: Sridhar Samudrala @ 2018-05-22  2:06 UTC (permalink / raw)
  To: mst, stephen, davem, netdev, virtualization, virtio-dev,
	jesse.brandeburg, alexander.h.duyck, kubakici, sridhar.samudrala,
	jasowang, loseweigh, jiri, aaron.f.brown, anjali.singhai

The main motivation for this patch is to enable cloud service providers
to provide an accelerated datapath to virtio-net enabled VMs in a 
transparent manner with no/minimal guest userspace changes. This also
enables hypervisor controlled live migration to be supported with VMs that
have direct attached SR-IOV VF devices.

Patch 1 introduces a failover module that provides a generic interface for 
paravirtual drivers to listen for netdev register/unregister/link change
events from pci ethernet devices with the same MAC and takeover their
datapath. The notifier and event handling code is based on the existing
netvsc implementation. 

Patch 2 refactors netvsc to use the registration/notification framework
introduced by failover module.

Patch 3 introduces a net_failover driver that provides an automated
failover mechanism to paravirtual drivers via APIs to create and destroy
a failover master netdev and mananges a primary and standby slave netdevs
that get registered via the generic failover infrastructure.

Patch 4 introduces a new feature bit VIRTIO_NET_F_STANDBY to virtio-net
that can be used by hypervisor to indicate that virtio_net interface
should act as a standby for another device with the same MAC address.

Patch 5 extends virtio_net to use alternate datapath when available and
registered. When STANDBY feature is enabled, virtio_net driver uese the
net_failover API to create an additional 'failover' netdev that acts as
a master device and controls 2 slave devices.  The original virtio_net
netdev is registered as 'standby' netdev and a passthru/vf device with
the same MAC gets registered as 'primary' netdev. Both 'standby' and
'failover' netdevs are associated with the same 'pci' device.  The user
accesses the network interface via 'failover' netdev. The 'failover'
netdev chooses 'primary' netdev as default for transmits when it is
available with link up and running.

As this patch series is initially focusing on usecases where hypervisor 
fully controls the VM networking and the guest is not expected to directly 
configure any hardware settings, it doesn't expose all the ndo/ethtool ops
that are supported by virtio_net at this time. To support additional usecases,
it should be possible to enable additional ops later by caching the state
in failover netdev and replaying when the 'primary' netdev gets registered. 
 
At the time of live migration, the hypervisor needs to unplug the VF device
from the guest on the source host and reset the MAC filter of the VF to
initiate failover of datapath to virtio before starting the migration. After
the migration is completed, the destination hypervisor sets the MAC filter
on the VF and plugs it back to the guest to switch over to VF datapath.

This patch is based on the discussion initiated by Jesse on this thread.
https://marc.info/?l=linux-virtualization&m=151189725224231&w=2

v11:
- Tested live migration with virtio-net/AVF(i40evf) configured in failover
  mode while running iperf in background. Tried static ip and dhcp
  configurations using 'network' scripts and Network Manager.
- Build tested netvsc module.
Updates:
- Split net_failover module into 2 components.
  1. 'failover' module that provides generic failover infrastructure
  to register a failover instance and listen for slave events.
  2. 'net_failover' driver that provides APIs to create/destroy upper
  netdev and supports 3-netdev model used by virtio-net.
- Added documentation

v10:
- Tested live migration with virtio-net/AVF(i40evf) configured in failover
  mode while running iperf in background. Tried static ip and dhcp
  configurations using 'network' scripts and Network Manager.
  - To avoid dhcp requests to be sent over the lower 'standby' and 'primary'
    netdevs, ifcfg-xxx scripts for the lower devices can be created.
- Build tested netvsc module.
Updates:
- fix net_failover_open() to update failover CARRIER correctly based on
  standby and primary states. 
- fix net_failover_handle_frame() to handle frames received on standby
  when primary is present.
- replace netdev_upper_dev_link with netdev_master_upper_dev_link and
  handle lower dev state changes.
- fix net_failver_create() and net_failover_register() interfaces to
  use ERR_PTR and avoid arg **
- disable setting mac address when virtio-net in STANDBY mode
- document exported symbols
- added entry to MAINTAINERS file

v9:
Select NET_FAILOVER automatically when VIRTIO_NET/HYPERV_NET 
are enabled. (stephen)

v8:
- Made the failover managment routines more robust by updating the feature 
  bits/other fields in the failover netdev when slave netdevs are 
  registered/unregistered. (mst)
- added support for handling vlans.
- Limited the changes in netvsc to only use the notifier/event/lookups
  from the failover module. The slave register/unregister/link-change 
  handlers are only updated to use the getbymac routine to get the 
  upper netdev. There is no change in their functionality. (stephen)
- renamed structs/function/file names to use net_failover prefix. (mst)

v7
- Rename 'bypass/active/backup' terminology with 'failover/primary/standy'
  (jiri, mst)
- re-arranged dev_open() and dev_set_mtu() calls in the register routines
  so that they don't get called for 2-netdev model. (stephen)
- fixed select_queue() routine to do queue selection based on VF if it is
  registered as primary. (stephen)
-  minor bugfixes

v6 RFC:
  Simplified virtio_net changes by moving all the ndo_ops of the 
  bypass_netdev and create/destroy of bypass_netdev to 'bypass' module.
  avoided 2 phase registration(driver + instances).
  introduced IFF_BYPASS/IFF_BYPASS_SLAVE dev->priv_flags 
  replaced mutex with a spinlock

v5 RFC:
  Based on Jiri's comments, moved the common functionality to a 'bypass'
  module so that the same notifier and event handlers to handle child
  register/unregister/link change events can be shared between virtio_net
  and netvsc.
  Improved error handling based on Siwei's comments.
v4:
- Based on the review comments on the v3 version of the RFC patch and
  Jakub's suggestion for the naming issue with 3 netdev solution,
  proposed 3 netdev in-driver bonding solution for virtio-net.
v3 RFC:
- Introduced 3 netdev model and pointed out a couple of issues with
  that model and proposed 2 netdev model to avoid these issues.
- Removed broadcast/multicast optimization and only use virtio as
  backup path when VF is unplugged.
v2 RFC:
- Changed VIRTIO_NET_F_MASTER to VIRTIO_NET_F_BACKUP (mst)
- made a small change to the virtio-net xmit path to only use VF datapath
  for unicasts. Broadcasts/multicasts use virtio datapath. This avoids
  east-west broadcasts to go over the PCI link.
- added suppport for the feature bit in qemu

Sridhar Samudrala (5):
  net: Introduce generic failover module
  netvsc: refactor notifier/event handling code to use the failover
    framework
  net: Introduce net_failover driver
  virtio_net: Introduce VIRTIO_NET_F_STANDBY feature bit
  virtio_net: Extend virtio to use VF datapath when available

 Documentation/networking/failover.rst     |  18 +
 Documentation/networking/net_failover.rst | 116 +++++
 MAINTAINERS                               |  16 +
 drivers/net/Kconfig                       |  13 +
 drivers/net/Makefile                      |   1 +
 drivers/net/hyperv/Kconfig                |   1 +
 drivers/net/hyperv/hyperv_net.h           |   2 +
 drivers/net/hyperv/netvsc_drv.c           | 133 +----
 drivers/net/net_failover.c                | 836 ++++++++++++++++++++++++++++++
 drivers/net/virtio_net.c                  |  40 +-
 include/linux/netdevice.h                 |  16 +
 include/net/failover.h                    |  31 ++
 include/net/net_failover.h                |  40 ++
 include/uapi/linux/virtio_net.h           |   3 +
 net/Kconfig                               |  13 +
 net/core/Makefile                         |   1 +
 net/core/failover.c                       | 273 ++++++++++
 17 files changed, 1440 insertions(+), 113 deletions(-)
 create mode 100644 Documentation/networking/failover.rst
 create mode 100644 Documentation/networking/net_failover.rst
 create mode 100644 drivers/net/net_failover.c
 create mode 100644 include/net/failover.h
 create mode 100644 include/net/net_failover.h
 create mode 100644 net/core/failover.c

-- 
2.14.3

^ permalink raw reply

* Re: [PATCH 05/33] cxgb4: use match_string() helper
From: Yisheng Xie @ 2018-05-22  0:55 UTC (permalink / raw)
  To: Andy Shevchenko; +Cc: Linux Kernel Mailing List, Ganesh Goudar, netdev
In-Reply-To: <CAHp75Vcpfeey_womFqoqp_gyy0=jqMHgW+30WJs-BcVTM-vHQA@mail.gmail.com>

Hi Andy,

On 2018/5/22 5:39, Andy Shevchenko wrote:
> On Mon, May 21, 2018 at 2:57 PM, Yisheng Xie <xieyisheng1@huawei.com> wrote:
>> match_string() returns the index of an array for a matching string,
>> which can be used intead of open coded variant.
> 
>> -       for (i = 0; i < ARRAY_SIZE(cudbg_region); i++) {
>> -               if (!strcmp(cudbg_region[i], region_name)) {
>> -                       found = 1;
>> -                       idx = i;
>> -                       break;
>> -               }
>> -       }
>> -       if (!found)
>> -               return -EINVAL;
>> +       rc = match_string(cudbg_region, ARRAY_SIZE(cudbg_region), region_name);
>> +       if (rc < 0)
>> +               return rc;
>>
>> -       found = 0;
>> +       idx = rc;
> 
> Is found still in use after this?
> If so, is it initialized properly now?
it is initialized  when define 'found', so no need to be initialized once more.

Thanks
Yisheng

> 

^ permalink raw reply

* Re: INFO: rcu detected stall in is_bpf_text_address
From: Marcelo Ricardo Leitner @ 2018-05-22  0:43 UTC (permalink / raw)
  To: Xin Long
  Cc: Eric Dumazet, syzbot, ast, Daniel Borkmann, LKML, network dev,
	syzkaller-bugs, linux-sctp
In-Reply-To: <CADvbK_e7LXc7minSmaPw6iZvjWo215wG-pPAS0gr58+-VxUO7Q@mail.gmail.com>

On Sun, May 20, 2018 at 04:26:03PM +0800, Xin Long wrote:
> On Sat, May 19, 2018 at 11:57 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > SCTP experts, please take a look.
> >
> > On 05/19/2018 08:55 AM, syzbot wrote:
> >> Hello,
> >>
> >> syzbot found the following crash on:
> >>
> >> HEAD commit:    73fcb1a370c7 Merge branch 'akpm' (patches from Andrew)
> >> git tree:       upstream
> >> console output: https://syzkaller.appspot.com/x/log.txt?x=1462ec0f800000
> >> kernel config:  https://syzkaller.appspot.com/x/.config?x=f3b4e30da84ec1ed
> >> dashboard link: https://syzkaller.appspot.com/bug?extid=3dcd59a1f907245f891f
> >> compiler:       gcc (GCC) 8.0.1 20180413 (experimental)
> >> syzkaller repro:https://syzkaller.appspot.com/x/repro.syz?x=1079cf8f800000
> Thank you.
> The Reproducer is more than helpful.
> 
> setsockopt$inet_sctp6_SCTP_RTOINFO(r0, 0x84, 0x0,
> &(0x7f0000000140)={0x0, 0x6, 0x7, 0x4}, 0x10)
> 
> It set rto_min=6 and rto_max=7, these are too small values.
> t3_rtx timer works fine with it. But hb_timer will get stuck there, as
> in its timer handler it starts this timer again with this value, then
> it goes to the timer handler again...

Nice, thanks Xin.

> 
> HB has to repeat this and the hb timer's expire may also have to use
> 'trans->rto >> 1 ...' stuff. But we can limit the RTO's min value, like
> HZ/20, which is 'Try again later.' number used when sock lock is
> owned by others in all timer handlers.

I think a good fix for this is to not allow the application to go
below net.sctp.rto_min, and neither above net.sctp.rto_max.

Then they can even be close to each other, won't be an issue, as long
as rto_min is something sensible. Which then brings it to the second
step of a fix: to restrict rto_min to be >= HZ/5 (copying from TCP
here).

^ permalink raw reply

* Re: [PATCH] [RFC] bpf: tracing: new helper bpf_get_current_cgroup_ino
From: Y Song @ 2018-05-22  0:24 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Alban Crequy, netdev, linux-kernel, containers, cgroups,
	Alban Crequy, tj
In-Reply-To: <20180521162609.lpdrnozowmzdn57m@ast-mbp.dhcp.thefacebook.com>

On Mon, May 21, 2018 at 9:26 AM, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
> On Sun, May 13, 2018 at 07:33:18PM +0200, Alban Crequy wrote:
>>
>> +BPF_CALL_2(bpf_get_current_cgroup_ino, u32, hierarchy, u64, flags)
>> +{
>> +     // TODO: pick the correct hierarchy instead of the mem controller
>> +     struct cgroup *cgrp = task_cgroup(current, memory_cgrp_id);
>> +
>> +     if (unlikely(!cgrp))
>> +             return -EINVAL;
>> +     if (unlikely(hierarchy))
>> +             return -EINVAL;
>> +     if (unlikely(flags))
>> +             return -EINVAL;
>> +
>> +     return cgrp->kn->id.ino;
>
> ino only is not enough to identify cgroup. It needs generation number too.
> I don't quite see how hierarchy and flags can be used in the future.
> Also why limit it to memcg?
>
> How about something like this instead:
>
> BPF_CALL_2(bpf_get_current_cgroup_id)
> {
>         struct cgroup *cgrp = task_dfl_cgroup(current);
>
>         return cgrp->kn->id.id;
> }
> The user space can use fhandle api to get the same 64-bit id.

I think this should work. This will also be useful to bcc as user
space can encode desired id
in the bpf program and compared that id to the current cgroup id, so we can have
cgroup level tracing (esp. stat collection) support. To cope with
cgroup hierarchy, user can use
cgroup-array based approach or explicitly compare against multiple cgroup id's.

^ permalink raw reply

* [PATCH net-next v4 2/2] openvswitch: Support conntrack zone limit
From: Yi-Hung Wei @ 2018-05-22  0:16 UTC (permalink / raw)
  To: netdev, pshelar; +Cc: Yi-Hung Wei
In-Reply-To: <1526948165-32443-1-git-send-email-yihung.wei@gmail.com>

Currently, nf_conntrack_max is used to limit the maximum number of
conntrack entries in the conntrack table for every network namespace.
For the VMs and containers that reside in the same namespace,
they share the same conntrack table, and the total # of conntrack entries
for all the VMs and containers are limited by nf_conntrack_max.  In this
case, if one of the VM/container abuses the usage the conntrack entries,
it blocks the others from committing valid conntrack entries into the
conntrack table.  Even if we can possibly put the VM in different network
namespace, the current nf_conntrack_max configuration is kind of rigid
that we cannot limit different VM/container to have different # conntrack
entries.

To address the aforementioned issue, this patch proposes to have a
fine-grained mechanism that could further limit the # of conntrack entries
per-zone.  For example, we can designate different zone to different VM,
and set conntrack limit to each zone.  By providing this isolation, a
mis-behaved VM only consumes the conntrack entries in its own zone, and
it will not influence other well-behaved VMs.  Moreover, the users can
set various conntrack limit to different zone based on their preference.

The proposed implementation utilizes Netfilter's nf_conncount backend
to count the number of connections in a particular zone.  If the number of
connection is above a configured limitation, ovs will return ENOMEM to the
userspace.  If userspace does not configure the zone limit, the limit
defaults to zero that is no limitation, which is backward compatible to
the behavior without this patch.

The following high leve APIs are provided to the userspace:
  - OVS_CT_LIMIT_CMD_SET:
    * set default connection limit for all zones
    * set the connection limit for a particular zone
  - OVS_CT_LIMIT_CMD_DEL:
    * remove the connection limit for a particular zone
  - OVS_CT_LIMIT_CMD_GET:
    * get the default connection limit for all zones
    * get the connection limit for a particular zone

Signed-off-by: Yi-Hung Wei <yihung.wei@gmail.com>
---
 net/openvswitch/Kconfig     |   3 +-
 net/openvswitch/conntrack.c | 541 +++++++++++++++++++++++++++++++++++++++++++-
 net/openvswitch/conntrack.h |   9 +-
 net/openvswitch/datapath.c  |   7 +-
 net/openvswitch/datapath.h  |   3 +
 5 files changed, 557 insertions(+), 6 deletions(-)

diff --git a/net/openvswitch/Kconfig b/net/openvswitch/Kconfig
index 2650205cdaf9..89da9512ec1e 100644
--- a/net/openvswitch/Kconfig
+++ b/net/openvswitch/Kconfig
@@ -9,7 +9,8 @@ config OPENVSWITCH
 		   (NF_CONNTRACK && ((!NF_DEFRAG_IPV6 || NF_DEFRAG_IPV6) && \
 				     (!NF_NAT || NF_NAT) && \
 				     (!NF_NAT_IPV4 || NF_NAT_IPV4) && \
-				     (!NF_NAT_IPV6 || NF_NAT_IPV6)))
+				     (!NF_NAT_IPV6 || NF_NAT_IPV6) && \
+				     (!NETFILTER_CONNCOUNT || NETFILTER_CONNCOUNT)))
 	select LIBCRC32C
 	select MPLS
 	select NET_MPLS_GSO
diff --git a/net/openvswitch/conntrack.c b/net/openvswitch/conntrack.c
index 02fc343feb66..e8bb91420ca9 100644
--- a/net/openvswitch/conntrack.c
+++ b/net/openvswitch/conntrack.c
@@ -16,8 +16,11 @@
 #include <linux/tcp.h>
 #include <linux/udp.h>
 #include <linux/sctp.h>
+#include <linux/static_key.h>
 #include <net/ip.h>
+#include <net/genetlink.h>
 #include <net/netfilter/nf_conntrack_core.h>
+#include <net/netfilter/nf_conntrack_count.h>
 #include <net/netfilter/nf_conntrack_helper.h>
 #include <net/netfilter/nf_conntrack_labels.h>
 #include <net/netfilter/nf_conntrack_seqadj.h>
@@ -76,6 +79,31 @@ struct ovs_conntrack_info {
 #endif
 };
 
+#if	IS_ENABLED(CONFIG_NETFILTER_CONNCOUNT)
+#define OVS_CT_LIMIT_UNLIMITED	0
+#define OVS_CT_LIMIT_DEFAULT OVS_CT_LIMIT_UNLIMITED
+#define CT_LIMIT_HASH_BUCKETS 512
+static DEFINE_STATIC_KEY_FALSE(ovs_ct_limit_enabled);
+
+struct ovs_ct_limit {
+	/* Elements in ovs_ct_limit_info->limits hash table */
+	struct hlist_node hlist_node;
+	struct rcu_head rcu;
+	u16 zone;
+	u32 limit;
+};
+
+struct ovs_ct_limit_info {
+	u32 default_limit;
+	struct hlist_head *limits;
+	struct nf_conncount_data *data __aligned(8);
+};
+
+static const struct nla_policy ct_limit_policy[OVS_CT_LIMIT_ATTR_MAX + 1] = {
+	[OVS_CT_LIMIT_ATTR_ZONE_LIMIT] = { .type = NLA_NESTED, },
+};
+#endif
+
 static bool labels_nonzero(const struct ovs_key_ct_labels *labels);
 
 static void __ovs_ct_free_action(struct ovs_conntrack_info *ct_info);
@@ -1036,6 +1064,89 @@ static bool labels_nonzero(const struct ovs_key_ct_labels *labels)
 	return false;
 }
 
+#if	IS_ENABLED(CONFIG_NETFILTER_CONNCOUNT)
+static struct hlist_head *ct_limit_hash_bucket(
+	const struct ovs_ct_limit_info *info, u16 zone)
+{
+	return &info->limits[zone & (CT_LIMIT_HASH_BUCKETS - 1)];
+}
+
+/* Call with ovs_mutex */
+static void ct_limit_set(const struct ovs_ct_limit_info *info,
+			 struct ovs_ct_limit *new_ct_limit)
+{
+	struct ovs_ct_limit *ct_limit;
+	struct hlist_head *head;
+
+	head = ct_limit_hash_bucket(info, new_ct_limit->zone);
+	hlist_for_each_entry_rcu(ct_limit, head, hlist_node) {
+		if (ct_limit->zone == new_ct_limit->zone) {
+			hlist_replace_rcu(&ct_limit->hlist_node,
+					  &new_ct_limit->hlist_node);
+			kfree_rcu(ct_limit, rcu);
+			return;
+		}
+	}
+
+	hlist_add_head_rcu(&new_ct_limit->hlist_node, head);
+}
+
+/* Call with ovs_mutex */
+static void ct_limit_del(const struct ovs_ct_limit_info *info, u16 zone)
+{
+	struct ovs_ct_limit *ct_limit;
+	struct hlist_head *head;
+	struct hlist_node *n;
+
+	head = ct_limit_hash_bucket(info, zone);
+	hlist_for_each_entry_safe(ct_limit, n, head, hlist_node) {
+		if (ct_limit->zone == zone) {
+			hlist_del_rcu(&ct_limit->hlist_node);
+			kfree_rcu(ct_limit, rcu);
+			return;
+		}
+	}
+}
+
+/* Call with RCU read lock */
+static u32 ct_limit_get(const struct ovs_ct_limit_info *info, u16 zone)
+{
+	struct ovs_ct_limit *ct_limit;
+	struct hlist_head *head;
+
+	head = ct_limit_hash_bucket(info, zone);
+	hlist_for_each_entry_rcu(ct_limit, head, hlist_node) {
+		if (ct_limit->zone == zone)
+			return ct_limit->limit;
+	}
+
+	return info->default_limit;
+}
+
+static int ovs_ct_check_limit(struct net *net,
+			      const struct ovs_conntrack_info *info,
+			      const struct nf_conntrack_tuple *tuple)
+{
+	struct ovs_net *ovs_net = net_generic(net, ovs_net_id);
+	const struct ovs_ct_limit_info *ct_limit_info = ovs_net->ct_limit_info;
+	u32 per_zone_limit, connections;
+	u32 conncount_key[5];
+
+	conncount_key[0] = info->zone.id;
+
+	per_zone_limit = ct_limit_get(ct_limit_info, info->zone.id);
+	if (per_zone_limit == OVS_CT_LIMIT_UNLIMITED)
+		return 0;
+
+	connections = nf_conncount_count(net, ct_limit_info->data,
+					 conncount_key, tuple, &info->zone);
+	if (connections > per_zone_limit)
+		return -ENOMEM;
+
+	return 0;
+}
+#endif
+
 /* Lookup connection and confirm if unconfirmed. */
 static int ovs_ct_commit(struct net *net, struct sw_flow_key *key,
 			 const struct ovs_conntrack_info *info,
@@ -1054,6 +1165,21 @@ static int ovs_ct_commit(struct net *net, struct sw_flow_key *key,
 	if (!ct)
 		return 0;
 
+#if	IS_ENABLED(CONFIG_NETFILTER_CONNCOUNT)
+	if (static_branch_unlikely(&ovs_ct_limit_enabled)) {
+		if (!nf_ct_is_confirmed(ct)) {
+			err = ovs_ct_check_limit(net, info,
+				&ct->tuplehash[IP_CT_DIR_ORIGINAL].tuple);
+			if (err) {
+				net_warn_ratelimited("openvswitch: zone: %u "
+					"execeeds conntrack limit\n",
+					info->zone.id);
+				return err;
+			}
+		}
+	}
+#endif
+
 	/* Set the conntrack event mask if given.  NEW and DELETE events have
 	 * their own groups, but the NFNLGRP_CONNTRACK_UPDATE group listener
 	 * typically would receive many kinds of updates.  Setting the event
@@ -1655,7 +1781,410 @@ static void __ovs_ct_free_action(struct ovs_conntrack_info *ct_info)
 		nf_ct_tmpl_free(ct_info->ct);
 }
 
-void ovs_ct_init(struct net *net)
+#if	IS_ENABLED(CONFIG_NETFILTER_CONNCOUNT)
+static int ovs_ct_limit_init(struct net *net, struct ovs_net *ovs_net)
+{
+	int i, err;
+
+	ovs_net->ct_limit_info = kmalloc(sizeof(*ovs_net->ct_limit_info),
+					 GFP_KERNEL);
+	if (!ovs_net->ct_limit_info)
+		return -ENOMEM;
+
+	ovs_net->ct_limit_info->default_limit = OVS_CT_LIMIT_DEFAULT;
+	ovs_net->ct_limit_info->limits =
+		kmalloc_array(CT_LIMIT_HASH_BUCKETS, sizeof(struct hlist_head),
+			      GFP_KERNEL);
+	if (!ovs_net->ct_limit_info->limits) {
+		kfree(ovs_net->ct_limit_info);
+		return -ENOMEM;
+	}
+
+	for (i = 0; i < CT_LIMIT_HASH_BUCKETS; i++)
+		INIT_HLIST_HEAD(&ovs_net->ct_limit_info->limits[i]);
+
+	ovs_net->ct_limit_info->data =
+		nf_conncount_init(net, NFPROTO_INET, sizeof(u32));
+
+	if (IS_ERR(ovs_net->ct_limit_info->data)) {
+		err = PTR_ERR(ovs_net->ct_limit_info->data);
+		kfree(ovs_net->ct_limit_info->limits);
+		kfree(ovs_net->ct_limit_info);
+		return err;
+	}
+	return 0;
+}
+
+static void ovs_ct_limit_exit(struct net *net, struct ovs_net *ovs_net)
+{
+	const struct ovs_ct_limit_info *info = ovs_net->ct_limit_info;
+	int i;
+
+	nf_conncount_destroy(net, NFPROTO_INET, info->data);
+	for (i = 0; i < CT_LIMIT_HASH_BUCKETS; ++i) {
+		struct hlist_head *head = &info->limits[i];
+		struct ovs_ct_limit *ct_limit;
+
+		hlist_for_each_entry_rcu(ct_limit, head, hlist_node)
+			kfree_rcu(ct_limit, rcu);
+	}
+	kfree(ovs_net->ct_limit_info->limits);
+	kfree(ovs_net->ct_limit_info);
+}
+
+static struct sk_buff *
+ovs_ct_limit_cmd_reply_start(struct genl_info *info, u8 cmd,
+			     struct ovs_header **ovs_reply_header)
+{
+	struct ovs_header *ovs_header = info->userhdr;
+	struct sk_buff *skb;
+
+	skb = genlmsg_new(NLMSG_DEFAULT_SIZE, GFP_KERNEL);
+	if (!skb)
+		return ERR_PTR(-ENOMEM);
+
+	*ovs_reply_header = genlmsg_put(skb, info->snd_portid,
+					info->snd_seq,
+					&dp_ct_limit_genl_family, 0, cmd);
+
+	if (!*ovs_reply_header) {
+		nlmsg_free(skb);
+		return ERR_PTR(-EMSGSIZE);
+	}
+	(*ovs_reply_header)->dp_ifindex = ovs_header->dp_ifindex;
+
+	return skb;
+}
+
+static bool check_zone_id(int zone_id, u16 *pzone)
+{
+	if (zone_id >= 0 && zone_id <= 65535) {
+		*pzone = (u16)zone_id;
+		return true;
+	}
+	return false;
+}
+
+static int ovs_ct_limit_set_zone_limit(struct nlattr *nla_zone_limit,
+				       struct ovs_ct_limit_info *info)
+{
+	struct ovs_zone_limit *zone_limit;
+	int rem;
+	u16 zone;
+
+	rem = NLA_ALIGN(nla_len(nla_zone_limit));
+	zone_limit = (struct ovs_zone_limit *)nla_data(nla_zone_limit);
+
+	while (rem >= sizeof(*zone_limit)) {
+		if (unlikely(zone_limit->zone_id == -1)) {
+			ovs_lock();
+			info->default_limit = zone_limit->limit;
+			ovs_unlock();
+		} else if (unlikely(!check_zone_id(
+				zone_limit->zone_id, &zone))) {
+			OVS_NLERR(true, "zone id is out of range");
+		} else {
+			struct ovs_ct_limit *ct_limit;
+
+			ct_limit = kmalloc(sizeof(*ct_limit), GFP_KERNEL);
+			if (!ct_limit)
+				return -ENOMEM;
+
+			ct_limit->zone = zone;
+			ct_limit->limit = zone_limit->limit;
+
+			ovs_lock();
+			ct_limit_set(info, ct_limit);
+			ovs_unlock();
+		}
+		rem -= NLA_ALIGN(sizeof(*zone_limit));
+		zone_limit = (struct ovs_zone_limit *)((u8 *)zone_limit +
+				NLA_ALIGN(sizeof(*zone_limit)));
+	}
+
+	if (rem)
+		OVS_NLERR(true, "set zone limit has %d unknown bytes", rem);
+
+	return 0;
+}
+
+static int ovs_ct_limit_del_zone_limit(struct nlattr *nla_zone_limit,
+				       struct ovs_ct_limit_info *info)
+{
+	struct ovs_zone_limit *zone_limit;
+	int rem;
+	u16 zone;
+
+	rem = NLA_ALIGN(nla_len(nla_zone_limit));
+	zone_limit = (struct ovs_zone_limit *)nla_data(nla_zone_limit);
+
+	while (rem >= sizeof(*zone_limit)) {
+		if (unlikely(!check_zone_id(zone_limit->zone_id, &zone))) {
+			OVS_NLERR(true, "zone id is out of range");
+		} else {
+			ovs_lock();
+			ct_limit_del(info, zone);
+			ovs_unlock();
+		}
+		rem -= NLA_ALIGN(sizeof(*zone_limit));
+		zone_limit = (struct ovs_zone_limit *)((u8 *)zone_limit +
+				NLA_ALIGN(sizeof(*zone_limit)));
+	}
+
+	if (rem)
+		OVS_NLERR(true, "del zone limit has %d unknown bytes", rem);
+
+	return 0;
+}
+
+static int ovs_ct_limit_get_default_limit(struct ovs_ct_limit_info *info,
+					  struct sk_buff *reply)
+{
+	struct ovs_zone_limit zone_limit;
+	int err;
+
+	zone_limit.zone_id = -1;
+	zone_limit.limit = info->default_limit;
+	err = nla_put_nohdr(reply, sizeof(zone_limit), &zone_limit);
+	if (err)
+		return err;
+
+	return 0;
+}
+
+static int ovs_ct_limit_get_zone_limit(struct net *net,
+				       struct nlattr *nla_zone_limit,
+				       struct ovs_ct_limit_info *info,
+				       struct sk_buff *reply)
+{
+	struct nf_conntrack_zone ct_zone;
+	struct ovs_zone_limit *zone_limit;
+	int rem, err;
+	u32 conncount_key[5];
+	u16 zone;
+
+	rem = NLA_ALIGN(nla_len(nla_zone_limit));
+	zone_limit = (struct ovs_zone_limit *)nla_data(nla_zone_limit);
+
+	while (rem >= sizeof(*zone_limit)) {
+		if (unlikely(zone_limit->zone_id == -1)) {
+			err = ovs_ct_limit_get_default_limit(info, reply);
+			if (err)
+				return err;
+		} else if (unlikely(!check_zone_id(zone_limit->zone_id,
+							&zone))) {
+			OVS_NLERR(true, "zone id is out of range");
+		} else {
+			rcu_read_lock();
+			zone_limit->limit = ct_limit_get(info, zone);
+			rcu_read_unlock();
+
+			nf_ct_zone_init(&ct_zone, zone, NF_CT_DEFAULT_ZONE_DIR,
+					0);
+			conncount_key[0] = zone;
+			zone_limit->count = nf_conncount_count(
+				net, info->data, conncount_key, NULL, &ct_zone);
+			err = nla_put_nohdr(reply, sizeof(*zone_limit),
+					    zone_limit);
+			if (err)
+				return err;
+		}
+		rem -= NLA_ALIGN(sizeof(*zone_limit));
+		zone_limit = (struct ovs_zone_limit *)((u8 *)zone_limit +
+				NLA_ALIGN(sizeof(*zone_limit)));
+	}
+
+	if (rem)
+		OVS_NLERR(true, "get zone limit has %d unknown bytes", rem);
+
+	return 0;
+}
+
+static int ovs_ct_limit_get_all_zone_limit(struct net *net,
+					   struct ovs_ct_limit_info *info,
+					   struct sk_buff *reply)
+{
+	struct nf_conntrack_zone ct_zone;
+	struct ovs_zone_limit zone_limit;
+	struct ovs_ct_limit *ct_limit;
+	struct hlist_head *head;
+	u32 conncount_key[5];
+	int i, err = 0;
+
+	err = ovs_ct_limit_get_default_limit(info, reply);
+	if (err)
+		return err;
+
+	rcu_read_lock();
+	for (i = 0; i < CT_LIMIT_HASH_BUCKETS; ++i) {
+		head = &info->limits[i];
+		hlist_for_each_entry_rcu(ct_limit, head, hlist_node) {
+			zone_limit.zone_id = ct_limit->zone;
+			zone_limit.limit = ct_limit->limit;
+			nf_ct_zone_init(&ct_zone, ct_limit->zone,
+					NF_CT_DEFAULT_ZONE_DIR, 0);
+
+			conncount_key[0] = ct_limit->zone;
+			zone_limit.count = nf_conncount_count(net, info->data,
+					conncount_key, NULL, &ct_zone);
+			err = nla_put_nohdr(reply, sizeof(zone_limit),
+					&zone_limit);
+			if (err)
+				goto exit_err;
+		}
+	}
+
+exit_err:
+	rcu_read_unlock();
+	return err;
+}
+
+static int ovs_ct_limit_cmd_set(struct sk_buff *skb, struct genl_info *info)
+{
+	struct nlattr **a = info->attrs;
+	struct sk_buff *reply;
+	struct ovs_header *ovs_reply_header;
+	struct ovs_net *ovs_net = net_generic(sock_net(skb->sk), ovs_net_id);
+	struct ovs_ct_limit_info *ct_limit_info = ovs_net->ct_limit_info;
+	int err;
+
+	reply = ovs_ct_limit_cmd_reply_start(info, OVS_CT_LIMIT_CMD_SET,
+					     &ovs_reply_header);
+	if (IS_ERR(reply))
+		return PTR_ERR(reply);
+
+	if (!a[OVS_CT_LIMIT_ATTR_ZONE_LIMIT]) {
+		err = -EINVAL;
+		goto exit_err;
+	}
+
+	err = ovs_ct_limit_set_zone_limit(a[OVS_CT_LIMIT_ATTR_ZONE_LIMIT],
+					  ct_limit_info);
+	if (err)
+		goto exit_err;
+
+	static_branch_enable(&ovs_ct_limit_enabled);
+
+	genlmsg_end(reply, ovs_reply_header);
+	return genlmsg_reply(reply, info);
+
+exit_err:
+	nlmsg_free(reply);
+	return err;
+}
+
+static int ovs_ct_limit_cmd_del(struct sk_buff *skb, struct genl_info *info)
+{
+	struct nlattr **a = info->attrs;
+	struct sk_buff *reply;
+	struct ovs_header *ovs_reply_header;
+	struct ovs_net *ovs_net = net_generic(sock_net(skb->sk), ovs_net_id);
+	struct ovs_ct_limit_info *ct_limit_info = ovs_net->ct_limit_info;
+	int err;
+
+	reply = ovs_ct_limit_cmd_reply_start(info, OVS_CT_LIMIT_CMD_DEL,
+					     &ovs_reply_header);
+	if (IS_ERR(reply))
+		return PTR_ERR(reply);
+
+	if (!a[OVS_CT_LIMIT_ATTR_ZONE_LIMIT]) {
+		err = -EINVAL;
+		goto exit_err;
+	}
+
+	err = ovs_ct_limit_del_zone_limit(a[OVS_CT_LIMIT_ATTR_ZONE_LIMIT],
+					  ct_limit_info);
+	if (err)
+		goto exit_err;
+
+	genlmsg_end(reply, ovs_reply_header);
+	return genlmsg_reply(reply, info);
+
+exit_err:
+	nlmsg_free(reply);
+	return err;
+}
+
+static int ovs_ct_limit_cmd_get(struct sk_buff *skb, struct genl_info *info)
+{
+	struct nlattr **a = info->attrs;
+	struct nlattr *nla_reply;
+	struct sk_buff *reply;
+	struct ovs_header *ovs_reply_header;
+	struct net *net = sock_net(skb->sk);
+	struct ovs_net *ovs_net = net_generic(net, ovs_net_id);
+	struct ovs_ct_limit_info *ct_limit_info = ovs_net->ct_limit_info;
+	int err;
+
+	reply = ovs_ct_limit_cmd_reply_start(info, OVS_CT_LIMIT_CMD_GET,
+					     &ovs_reply_header);
+	if (IS_ERR(reply))
+		return PTR_ERR(reply);
+
+	nla_reply = nla_nest_start(reply, OVS_CT_LIMIT_ATTR_ZONE_LIMIT);
+
+	if (a[OVS_CT_LIMIT_ATTR_ZONE_LIMIT]) {
+		err = ovs_ct_limit_get_zone_limit(
+			net, a[OVS_CT_LIMIT_ATTR_ZONE_LIMIT], ct_limit_info,
+			reply);
+		if (err)
+			goto exit_err;
+	} else {
+		err = ovs_ct_limit_get_all_zone_limit(net, ct_limit_info,
+						      reply);
+		if (err)
+			goto exit_err;
+	}
+
+	nla_nest_end(reply, nla_reply);
+	genlmsg_end(reply, ovs_reply_header);
+	return genlmsg_reply(reply, info);
+
+exit_err:
+	nlmsg_free(reply);
+	return err;
+}
+
+static struct genl_ops ct_limit_genl_ops[] = {
+	{ .cmd = OVS_CT_LIMIT_CMD_SET,
+		.flags = GENL_ADMIN_PERM, /* Requires CAP_NET_ADMIN
+					   * privilege. */
+		.policy = ct_limit_policy,
+		.doit = ovs_ct_limit_cmd_set,
+	},
+	{ .cmd = OVS_CT_LIMIT_CMD_DEL,
+		.flags = GENL_ADMIN_PERM, /* Requires CAP_NET_ADMIN
+					   * privilege. */
+		.policy = ct_limit_policy,
+		.doit = ovs_ct_limit_cmd_del,
+	},
+	{ .cmd = OVS_CT_LIMIT_CMD_GET,
+		.flags = 0,		  /* OK for unprivileged users. */
+		.policy = ct_limit_policy,
+		.doit = ovs_ct_limit_cmd_get,
+	},
+};
+
+static const struct genl_multicast_group ovs_ct_limit_multicast_group = {
+	.name = OVS_CT_LIMIT_MCGROUP,
+};
+
+struct genl_family dp_ct_limit_genl_family __ro_after_init = {
+	.hdrsize = sizeof(struct ovs_header),
+	.name = OVS_CT_LIMIT_FAMILY,
+	.version = OVS_CT_LIMIT_VERSION,
+	.maxattr = OVS_CT_LIMIT_ATTR_MAX,
+	.netnsok = true,
+	.parallel_ops = true,
+	.ops = ct_limit_genl_ops,
+	.n_ops = ARRAY_SIZE(ct_limit_genl_ops),
+	.mcgrps = &ovs_ct_limit_multicast_group,
+	.n_mcgrps = 1,
+	.module = THIS_MODULE,
+};
+#endif
+
+int ovs_ct_init(struct net *net)
 {
 	unsigned int n_bits = sizeof(struct ovs_key_ct_labels) * BITS_PER_BYTE;
 	struct ovs_net *ovs_net = net_generic(net, ovs_net_id);
@@ -1666,12 +2195,22 @@ void ovs_ct_init(struct net *net)
 	} else {
 		ovs_net->xt_label = true;
 	}
+
+#if	IS_ENABLED(CONFIG_NETFILTER_CONNCOUNT)
+	return ovs_ct_limit_init(net, ovs_net);
+#else
+	return 0;
+#endif
 }
 
 void ovs_ct_exit(struct net *net)
 {
 	struct ovs_net *ovs_net = net_generic(net, ovs_net_id);
 
+#if	IS_ENABLED(CONFIG_NETFILTER_CONNCOUNT)
+	ovs_ct_limit_exit(net, ovs_net);
+#endif
+
 	if (ovs_net->xt_label)
 		nf_connlabels_put(net);
 }
diff --git a/net/openvswitch/conntrack.h b/net/openvswitch/conntrack.h
index 399dfdd2c4f9..900dadd70974 100644
--- a/net/openvswitch/conntrack.h
+++ b/net/openvswitch/conntrack.h
@@ -17,10 +17,11 @@
 #include "flow.h"
 
 struct ovs_conntrack_info;
+struct ovs_ct_limit_info;
 enum ovs_key_attr;
 
 #if IS_ENABLED(CONFIG_NF_CONNTRACK)
-void ovs_ct_init(struct net *);
+int ovs_ct_init(struct net *);
 void ovs_ct_exit(struct net *);
 bool ovs_ct_verify(struct net *, enum ovs_key_attr attr);
 int ovs_ct_copy_action(struct net *, const struct nlattr *,
@@ -44,7 +45,7 @@ void ovs_ct_free_action(const struct nlattr *a);
 #else
 #include <linux/errno.h>
 
-static inline void ovs_ct_init(struct net *net) { }
+static inline int ovs_ct_init(struct net *net) { return 0; }
 
 static inline void ovs_ct_exit(struct net *net) { }
 
@@ -104,4 +105,8 @@ static inline void ovs_ct_free_action(const struct nlattr *a) { }
 
 #define CT_SUPPORTED_MASK 0
 #endif /* CONFIG_NF_CONNTRACK */
+
+#if IS_ENABLED(CONFIG_NETFILTER_CONNCOUNT)
+extern struct genl_family dp_ct_limit_genl_family;
+#endif
 #endif /* ovs_conntrack.h */
diff --git a/net/openvswitch/datapath.c b/net/openvswitch/datapath.c
index 015e24e08909..a61818e94396 100644
--- a/net/openvswitch/datapath.c
+++ b/net/openvswitch/datapath.c
@@ -2288,6 +2288,9 @@ static struct genl_family * const dp_genl_families[] = {
 	&dp_flow_genl_family,
 	&dp_packet_genl_family,
 	&dp_meter_genl_family,
+#if	IS_ENABLED(CONFIG_NETFILTER_CONNCOUNT)
+	&dp_ct_limit_genl_family,
+#endif
 };
 
 static void dp_unregister_genl(int n_families)
@@ -2323,8 +2326,7 @@ static int __net_init ovs_init_net(struct net *net)
 
 	INIT_LIST_HEAD(&ovs_net->dps);
 	INIT_WORK(&ovs_net->dp_notify_work, ovs_dp_notify_wq);
-	ovs_ct_init(net);
-	return 0;
+	return ovs_ct_init(net);
 }
 
 static void __net_exit list_vports_from_net(struct net *net, struct net *dnet,
@@ -2469,3 +2471,4 @@ MODULE_ALIAS_GENL_FAMILY(OVS_VPORT_FAMILY);
 MODULE_ALIAS_GENL_FAMILY(OVS_FLOW_FAMILY);
 MODULE_ALIAS_GENL_FAMILY(OVS_PACKET_FAMILY);
 MODULE_ALIAS_GENL_FAMILY(OVS_METER_FAMILY);
+MODULE_ALIAS_GENL_FAMILY(OVS_CT_LIMIT_FAMILY);
diff --git a/net/openvswitch/datapath.h b/net/openvswitch/datapath.h
index 523d65526766..c9eb267c6f7e 100644
--- a/net/openvswitch/datapath.h
+++ b/net/openvswitch/datapath.h
@@ -144,6 +144,9 @@ struct dp_upcall_info {
 struct ovs_net {
 	struct list_head dps;
 	struct work_struct dp_notify_work;
+#if	IS_ENABLED(CONFIG_NETFILTER_CONNCOUNT)
+	struct ovs_ct_limit_info *ct_limit_info;
+#endif
 
 	/* Module reference for configuring conntrack. */
 	bool xt_label;
-- 
2.7.4

^ permalink raw reply related

* [PATCH net-next v4 1/2] openvswitch: Add conntrack limit netlink definition
From: Yi-Hung Wei @ 2018-05-22  0:16 UTC (permalink / raw)
  To: netdev, pshelar; +Cc: Yi-Hung Wei
In-Reply-To: <1526948165-32443-1-git-send-email-yihung.wei@gmail.com>

Define netlink messages and attributes to support user kernel
communication that uses the conntrack limit feature.

Signed-off-by: Yi-Hung Wei <yihung.wei@gmail.com>
---
 include/uapi/linux/openvswitch.h | 26 ++++++++++++++++++++++++++
 1 file changed, 26 insertions(+)

diff --git a/include/uapi/linux/openvswitch.h b/include/uapi/linux/openvswitch.h
index 713e56ce681f..d8da2b7591f5 100644
--- a/include/uapi/linux/openvswitch.h
+++ b/include/uapi/linux/openvswitch.h
@@ -937,4 +937,30 @@ enum ovs_meter_band_type {
 
 #define OVS_METER_BAND_TYPE_MAX (__OVS_METER_BAND_TYPE_MAX - 1)
 
+/* Conntrack limit */
+#define OVS_CT_LIMIT_FAMILY  "ovs_ct_limit"
+#define OVS_CT_LIMIT_MCGROUP "ovs_ct_limit"
+#define OVS_CT_LIMIT_VERSION 0x1
+
+enum ovs_ct_limit_cmd {
+	OVS_CT_LIMIT_CMD_UNSPEC,
+	OVS_CT_LIMIT_CMD_SET,		/* Add or modify ct limit. */
+	OVS_CT_LIMIT_CMD_DEL,		/* Delete ct limit. */
+	OVS_CT_LIMIT_CMD_GET		/* Get ct limit. */
+};
+
+enum ovs_ct_limit_attr {
+	OVS_CT_LIMIT_ATTR_UNSPEC,
+	OVS_CT_LIMIT_ATTR_ZONE_LIMIT,	/* Nested struct ovs_zone_limit. */
+	__OVS_CT_LIMIT_ATTR_MAX
+};
+
+#define OVS_CT_LIMIT_ATTR_MAX (__OVS_CT_LIMIT_ATTR_MAX - 1)
+
+struct ovs_zone_limit {
+	int zone_id;
+	__u32 limit;
+	__u32 count;
+};
+
 #endif /* _LINUX_OPENVSWITCH_H */
-- 
2.7.4

^ permalink raw reply related

* [PATCH net-next v4 0/2] openvswitch: Support conntrack zone limit
From: Yi-Hung Wei @ 2018-05-22  0:16 UTC (permalink / raw)
  To: netdev, pshelar; +Cc: Yi-Hung Wei

Currently, nf_conntrack_max is used to limit the maximum number of
conntrack entries in the conntrack table for every network namespace.
For the VMs and containers that reside in the same namespace,
they share the same conntrack table, and the total # of conntrack entries
for all the VMs and containers are limited by nf_conntrack_max.  In this
case, if one of the VM/container abuses the usage the conntrack entries,
it blocks the others from committing valid conntrack entries into the
conntrack table.  Even if we can possibly put the VM in different network
namespace, the current nf_conntrack_max configuration is kind of rigid
that we cannot limit different VM/container to have different # conntrack
entries.

To address the aforementioned issue, this patch proposes to have a
fine-grained mechanism that could further limit the # of conntrack entries
per-zone.  For example, we can designate different zone to different VM,
and set conntrack limit to each zone.  By providing this isolation, a
mis-behaved VM only consumes the conntrack entries in its own zone, and
it will not influence other well-behaved VMs.  Moreover, the users can
set various conntrack limit to different zone based on their preference.

The proposed implementation utilizes Netfilter's nf_conncount backend
to count the number of connections in a particular zone.  If the number of
connection is above a configured limitation, OVS will return ENOMEM to the
userspace.  If userspace does not configure the zone limit, the limit
defaults to zero that is no limitation, which is backward compatible to
the behavior without this patch.

The first patch defines the conntrack limit netlink definition, and the
second patch provides the implementation.

v3->v4:
  - Addresses comments from Parvin that include simplify netlink API,
    and remove unncessary RCU lockings.
  - Rebases to master.

v2->v3:
  - Addresses comments from Parvin that include using static keys to check
    if ovs_ct_limit features is used, only check ct_limit when a ct entry
    is unconfirmed, and reports rate limited warning messages when the ct
    limit is reached.
  - Rebases to master.

v1->v2:
  - Fixes commit log typos suggested by Greg.
  - Fixes memory free issue that Julia found.


Yi-Hung Wei (2):
  openvswitch: Add conntrack limit netlink definition
  openvswitch: Support conntrack zone limit

 include/uapi/linux/openvswitch.h |  26 ++
 net/openvswitch/Kconfig          |   3 +-
 net/openvswitch/conntrack.c      | 541 ++++++++++++++++++++++++++++++++++++++-
 net/openvswitch/conntrack.h      |   9 +-
 net/openvswitch/datapath.c       |   7 +-
 net/openvswitch/datapath.h       |   3 +
 6 files changed, 583 insertions(+), 6 deletions(-)

-- 
2.7.4

^ permalink raw reply

* Re: [PATCH bpf-next 1/5] bpf: Hooks for sys_sendmsg
From: kbuild test robot @ 2018-05-22  0:08 UTC (permalink / raw)
  To: Andrey Ignatov
  Cc: kbuild-all, netdev, Andrey Ignatov, davem, ast, daniel,
	kernel-team
In-Reply-To: <654b59700a21ed2342b12f30a583ba9329ef850b.1526694154.git.rdna@fb.com>

[-- Attachment #1: Type: text/plain, Size: 10869 bytes --]

Hi Andrey,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on bpf-next/master]

url:    https://github.com/0day-ci/linux/commits/Andrey-Ignatov/bpf-Hooks-for-sys_sendmsg/20180522-065614
base:   https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git master
config: i386-randconfig-s1-201820 (attached as .config)
compiler: gcc-6 (Debian 6.4.0-9) 6.4.0 20171026
reproduce:
        # save the attached .config to linux build tree
        make ARCH=i386 

All errors (new ones prefixed by >>):

   net/ipv4/udp.c: In function 'udp_sendmsg':
   net/ipv4/udp.c:1014:44: error: macro "BPF_CGROUP_RUN_PROG_UDP4_SENDMSG_LOCK" passed 3 arguments, but takes just 2
             (struct sockaddr *)usin, &ipc.addr);
                                               ^
>> net/ipv4/udp.c:1013:9: error: 'BPF_CGROUP_RUN_PROG_UDP4_SENDMSG_LOCK' undeclared (first use in this function)
      err = BPF_CGROUP_RUN_PROG_UDP4_SENDMSG_LOCK(sk,
            ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   net/ipv4/udp.c:1013:9: note: each undeclared identifier is reported only once for each function it appears in
--
   net/ipv6/udp.c: In function 'udpv6_sendmsg':
   net/ipv6/udp.c:1320:44: error: macro "BPF_CGROUP_RUN_PROG_UDP6_SENDMSG_LOCK" passed 3 arguments, but takes just 2
            (struct sockaddr *)sin6, &fl6.saddr);
                                               ^
>> net/ipv6/udp.c:1319:9: error: 'BPF_CGROUP_RUN_PROG_UDP6_SENDMSG_LOCK' undeclared (first use in this function)
      err = BPF_CGROUP_RUN_PROG_UDP6_SENDMSG_LOCK(sk,
            ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   net/ipv6/udp.c:1319:9: note: each undeclared identifier is reported only once for each function it appears in

vim +/BPF_CGROUP_RUN_PROG_UDP4_SENDMSG_LOCK +1013 net/ipv4/udp.c

   898	
   899	int udp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
   900	{
   901		struct inet_sock *inet = inet_sk(sk);
   902		struct udp_sock *up = udp_sk(sk);
   903		DECLARE_SOCKADDR(struct sockaddr_in *, usin, msg->msg_name);
   904		struct flowi4 fl4_stack;
   905		struct flowi4 *fl4;
   906		int ulen = len;
   907		struct ipcm_cookie ipc;
   908		struct rtable *rt = NULL;
   909		int free = 0;
   910		int connected = 0;
   911		__be32 daddr, faddr, saddr;
   912		__be16 dport;
   913		u8  tos;
   914		int err, is_udplite = IS_UDPLITE(sk);
   915		int corkreq = up->corkflag || msg->msg_flags&MSG_MORE;
   916		int (*getfrag)(void *, char *, int, int, int, struct sk_buff *);
   917		struct sk_buff *skb;
   918		struct ip_options_data opt_copy;
   919	
   920		if (len > 0xFFFF)
   921			return -EMSGSIZE;
   922	
   923		/*
   924		 *	Check the flags.
   925		 */
   926	
   927		if (msg->msg_flags & MSG_OOB) /* Mirror BSD error message compatibility */
   928			return -EOPNOTSUPP;
   929	
   930		ipc.opt = NULL;
   931		ipc.tx_flags = 0;
   932		ipc.ttl = 0;
   933		ipc.tos = -1;
   934	
   935		getfrag = is_udplite ? udplite_getfrag : ip_generic_getfrag;
   936	
   937		fl4 = &inet->cork.fl.u.ip4;
   938		if (up->pending) {
   939			/*
   940			 * There are pending frames.
   941			 * The socket lock must be held while it's corked.
   942			 */
   943			lock_sock(sk);
   944			if (likely(up->pending)) {
   945				if (unlikely(up->pending != AF_INET)) {
   946					release_sock(sk);
   947					return -EINVAL;
   948				}
   949				goto do_append_data;
   950			}
   951			release_sock(sk);
   952		}
   953		ulen += sizeof(struct udphdr);
   954	
   955		/*
   956		 *	Get and verify the address.
   957		 */
   958		if (usin) {
   959			if (msg->msg_namelen < sizeof(*usin))
   960				return -EINVAL;
   961			if (usin->sin_family != AF_INET) {
   962				if (usin->sin_family != AF_UNSPEC)
   963					return -EAFNOSUPPORT;
   964			}
   965	
   966			daddr = usin->sin_addr.s_addr;
   967			dport = usin->sin_port;
   968			if (dport == 0)
   969				return -EINVAL;
   970		} else {
   971			if (sk->sk_state != TCP_ESTABLISHED)
   972				return -EDESTADDRREQ;
   973			daddr = inet->inet_daddr;
   974			dport = inet->inet_dport;
   975			/* Open fast path for connected socket.
   976			   Route will not be used, if at least one option is set.
   977			 */
   978			connected = 1;
   979		}
   980	
   981		ipc.sockc.tsflags = sk->sk_tsflags;
   982		ipc.addr = inet->inet_saddr;
   983		ipc.oif = sk->sk_bound_dev_if;
   984		ipc.gso_size = up->gso_size;
   985	
   986		if (msg->msg_controllen) {
   987			err = udp_cmsg_send(sk, msg, &ipc.gso_size);
   988			if (err > 0)
   989				err = ip_cmsg_send(sk, msg, &ipc,
   990						   sk->sk_family == AF_INET6);
   991			if (unlikely(err < 0)) {
   992				kfree(ipc.opt);
   993				return err;
   994			}
   995			if (ipc.opt)
   996				free = 1;
   997			connected = 0;
   998		}
   999		if (!ipc.opt) {
  1000			struct ip_options_rcu *inet_opt;
  1001	
  1002			rcu_read_lock();
  1003			inet_opt = rcu_dereference(inet->inet_opt);
  1004			if (inet_opt) {
  1005				memcpy(&opt_copy, inet_opt,
  1006				       sizeof(*inet_opt) + inet_opt->opt.optlen);
  1007				ipc.opt = &opt_copy.opt;
  1008			}
  1009			rcu_read_unlock();
  1010		}
  1011	
  1012		if (!connected) {
> 1013			err = BPF_CGROUP_RUN_PROG_UDP4_SENDMSG_LOCK(sk,
> 1014						    (struct sockaddr *)usin, &ipc.addr);
  1015			if (err)
  1016				goto out_free;
  1017			if (usin) {
  1018				if (usin->sin_port == 0) {
  1019					/* BPF program set invalid port. Reject it. */
  1020					err = -EINVAL;
  1021					goto out_free;
  1022				}
  1023				daddr = usin->sin_addr.s_addr;
  1024				dport = usin->sin_port;
  1025			}
  1026		}
  1027	
  1028		saddr = ipc.addr;
  1029		ipc.addr = faddr = daddr;
  1030	
  1031		sock_tx_timestamp(sk, ipc.sockc.tsflags, &ipc.tx_flags);
  1032	
  1033		if (ipc.opt && ipc.opt->opt.srr) {
  1034			if (!daddr) {
  1035				err = -EINVAL;
  1036				goto out_free;
  1037			}
  1038			faddr = ipc.opt->opt.faddr;
  1039			connected = 0;
  1040		}
  1041		tos = get_rttos(&ipc, inet);
  1042		if (sock_flag(sk, SOCK_LOCALROUTE) ||
  1043		    (msg->msg_flags & MSG_DONTROUTE) ||
  1044		    (ipc.opt && ipc.opt->opt.is_strictroute)) {
  1045			tos |= RTO_ONLINK;
  1046			connected = 0;
  1047		}
  1048	
  1049		if (ipv4_is_multicast(daddr)) {
  1050			if (!ipc.oif)
  1051				ipc.oif = inet->mc_index;
  1052			if (!saddr)
  1053				saddr = inet->mc_addr;
  1054			connected = 0;
  1055		} else if (!ipc.oif) {
  1056			ipc.oif = inet->uc_index;
  1057		} else if (ipv4_is_lbcast(daddr) && inet->uc_index) {
  1058			/* oif is set, packet is to local broadcast and
  1059			 * and uc_index is set. oif is most likely set
  1060			 * by sk_bound_dev_if. If uc_index != oif check if the
  1061			 * oif is an L3 master and uc_index is an L3 slave.
  1062			 * If so, we want to allow the send using the uc_index.
  1063			 */
  1064			if (ipc.oif != inet->uc_index &&
  1065			    ipc.oif == l3mdev_master_ifindex_by_index(sock_net(sk),
  1066								      inet->uc_index)) {
  1067				ipc.oif = inet->uc_index;
  1068			}
  1069		}
  1070	
  1071		if (connected)
  1072			rt = (struct rtable *)sk_dst_check(sk, 0);
  1073	
  1074		if (!rt) {
  1075			struct net *net = sock_net(sk);
  1076			__u8 flow_flags = inet_sk_flowi_flags(sk);
  1077	
  1078			fl4 = &fl4_stack;
  1079	
  1080			flowi4_init_output(fl4, ipc.oif, sk->sk_mark, tos,
  1081					   RT_SCOPE_UNIVERSE, sk->sk_protocol,
  1082					   flow_flags,
  1083					   faddr, saddr, dport, inet->inet_sport,
  1084					   sk->sk_uid);
  1085	
  1086			security_sk_classify_flow(sk, flowi4_to_flowi(fl4));
  1087			rt = ip_route_output_flow(net, fl4, sk);
  1088			if (IS_ERR(rt)) {
  1089				err = PTR_ERR(rt);
  1090				rt = NULL;
  1091				if (err == -ENETUNREACH)
  1092					IP_INC_STATS(net, IPSTATS_MIB_OUTNOROUTES);
  1093				goto out;
  1094			}
  1095	
  1096			err = -EACCES;
  1097			if ((rt->rt_flags & RTCF_BROADCAST) &&
  1098			    !sock_flag(sk, SOCK_BROADCAST))
  1099				goto out;
  1100			if (connected)
  1101				sk_dst_set(sk, dst_clone(&rt->dst));
  1102		}
  1103	
  1104		if (msg->msg_flags&MSG_CONFIRM)
  1105			goto do_confirm;
  1106	back_from_confirm:
  1107	
  1108		saddr = fl4->saddr;
  1109		if (!ipc.addr)
  1110			daddr = ipc.addr = fl4->daddr;
  1111	
  1112		/* Lockless fast path for the non-corking case. */
  1113		if (!corkreq) {
  1114			struct inet_cork cork;
  1115	
  1116			skb = ip_make_skb(sk, fl4, getfrag, msg, ulen,
  1117					  sizeof(struct udphdr), &ipc, &rt,
  1118					  &cork, msg->msg_flags);
  1119			err = PTR_ERR(skb);
  1120			if (!IS_ERR_OR_NULL(skb))
  1121				err = udp_send_skb(skb, fl4, &cork);
  1122			goto out;
  1123		}
  1124	
  1125		lock_sock(sk);
  1126		if (unlikely(up->pending)) {
  1127			/* The socket is already corked while preparing it. */
  1128			/* ... which is an evident application bug. --ANK */
  1129			release_sock(sk);
  1130	
  1131			net_dbg_ratelimited("socket already corked\n");
  1132			err = -EINVAL;
  1133			goto out;
  1134		}
  1135		/*
  1136		 *	Now cork the socket to pend data.
  1137		 */
  1138		fl4 = &inet->cork.fl.u.ip4;
  1139		fl4->daddr = daddr;
  1140		fl4->saddr = saddr;
  1141		fl4->fl4_dport = dport;
  1142		fl4->fl4_sport = inet->inet_sport;
  1143		up->pending = AF_INET;
  1144	
  1145	do_append_data:
  1146		up->len += ulen;
  1147		err = ip_append_data(sk, fl4, getfrag, msg, ulen,
  1148				     sizeof(struct udphdr), &ipc, &rt,
  1149				     corkreq ? msg->msg_flags|MSG_MORE : msg->msg_flags);
  1150		if (err)
  1151			udp_flush_pending_frames(sk);
  1152		else if (!corkreq)
  1153			err = udp_push_pending_frames(sk);
  1154		else if (unlikely(skb_queue_empty(&sk->sk_write_queue)))
  1155			up->pending = 0;
  1156		release_sock(sk);
  1157	
  1158	out:
  1159		ip_rt_put(rt);
  1160	out_free:
  1161		if (free)
  1162			kfree(ipc.opt);
  1163		if (!err)
  1164			return len;
  1165		/*
  1166		 * ENOBUFS = no kernel mem, SOCK_NOSPACE = no sndbuf space.  Reporting
  1167		 * ENOBUFS might not be good (it's not tunable per se), but otherwise
  1168		 * we don't have a good statistic (IpOutDiscards but it can be too many
  1169		 * things).  We could add another new stat but at least for now that
  1170		 * seems like overkill.
  1171		 */
  1172		if (err == -ENOBUFS || test_bit(SOCK_NOSPACE, &sk->sk_socket->flags)) {
  1173			UDP_INC_STATS(sock_net(sk),
  1174				      UDP_MIB_SNDBUFERRORS, is_udplite);
  1175		}
  1176		return err;
  1177	
  1178	do_confirm:
  1179		if (msg->msg_flags & MSG_PROBE)
  1180			dst_confirm_neigh(&rt->dst, &fl4->daddr);
  1181		if (!(msg->msg_flags&MSG_PROBE) || len)
  1182			goto back_from_confirm;
  1183		err = 0;
  1184		goto out;
  1185	}
  1186	EXPORT_SYMBOL(udp_sendmsg);
  1187	

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 24538 bytes --]

^ permalink raw reply

* Re: [Cake] [PATCH net-next v14 6/7] sch_cake: Add overhead compensation support to the rate shaper
From: Marcelo Ricardo Leitner @ 2018-05-21 23:45 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen; +Cc: netdev, cake
In-Reply-To: <152693495874.32668.11244869690098290078.stgit@alrua-kau>

On Mon, May 21, 2018 at 10:35:58PM +0200, Toke Høiland-Jørgensen wrote:
> +static u32 cake_overhead(struct cake_sched_data *q, const struct sk_buff *skb)
> +{
> +	const struct skb_shared_info *shinfo = skb_shinfo(skb);
> +	unsigned int hdr_len, last_len = 0;
> +	u32 off = skb_network_offset(skb);
> +	u32 len = qdisc_pkt_len(skb);
> +	u16 segs = 1;
> +
> +	q->avg_netoff = cake_ewma(q->avg_netoff, off << 16, 8);
> +
> +	if (!shinfo->gso_size)
> +		return cake_calc_overhead(q, len, off);
> +
> +	/* borrowed from qdisc_pkt_len_init() */
> +	hdr_len = skb_transport_header(skb) - skb_mac_header(skb);
> +
> +	/* + transport layer */
> +	if (likely(shinfo->gso_type & (SKB_GSO_TCPV4 |
> +						SKB_GSO_TCPV6))) {
> +		const struct tcphdr *th;
> +		struct tcphdr _tcphdr;
> +
> +		th = skb_header_pointer(skb, skb_transport_offset(skb),
> +					sizeof(_tcphdr), &_tcphdr);
> +		if (likely(th))
> +			hdr_len += __tcp_hdrlen(th);
> +	} else {

I didn't see some code limiting GSO packets to just TCP or UDP. Is it
safe to assume that this packet is an UDP one, and not SCTP or ESP,
for example?

> +		struct udphdr _udphdr;
> +
> +		if (skb_header_pointer(skb, skb_transport_offset(skb),
> +				       sizeof(_udphdr), &_udphdr))
> +			hdr_len += sizeof(struct udphdr);
> +	}
> +
> +	if (unlikely(shinfo->gso_type & SKB_GSO_DODGY))
> +		segs = DIV_ROUND_UP(skb->len - hdr_len,
> +				    shinfo->gso_size);
> +	else
> +		segs = shinfo->gso_segs;
> +
> +	len = shinfo->gso_size + hdr_len;
> +	last_len = skb->len - shinfo->gso_size * (segs - 1);
> +
> +	return (cake_calc_overhead(q, len, off) * (segs - 1) +
> +		cake_calc_overhead(q, last_len, off));
> +}
> +

^ permalink raw reply

* [PATCH v2 net] netfilter: provide correct argument to nla_strlcpy()
From: Eric Dumazet @ 2018-05-21 23:35 UTC (permalink / raw)
  To: David S . Miller
  Cc: netdev, Eric Dumazet, Eric Dumazet, Florian Westphal,
	Pablo Neira Ayuso

Recent patch forgot to remove nla_data(), upsetting syzkaller a bit.

BUG: KASAN: slab-out-of-bounds in nla_strlcpy+0x13d/0x150 lib/nlattr.c:314
Read of size 1 at addr ffff8801ad1f4fdd by task syz-executor189/4509

CPU: 1 PID: 4509 Comm: syz-executor189 Not tainted 4.17.0-rc6+ #62
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0x1b9/0x294 lib/dump_stack.c:113
 print_address_description+0x6c/0x20b mm/kasan/report.c:256
 kasan_report_error mm/kasan/report.c:354 [inline]
 kasan_report.cold.7+0x242/0x2fe mm/kasan/report.c:412
 __asan_report_load1_noabort+0x14/0x20 mm/kasan/report.c:430
 nla_strlcpy+0x13d/0x150 lib/nlattr.c:314
 nfnl_acct_new+0x574/0xc50 net/netfilter/nfnetlink_acct.c:118
 nfnetlink_rcv_msg+0xdb5/0xff0 net/netfilter/nfnetlink.c:212
 netlink_rcv_skb+0x172/0x440 net/netlink/af_netlink.c:2448
 nfnetlink_rcv+0x1fe/0x1ba0 net/netfilter/nfnetlink.c:513
 netlink_unicast_kernel net/netlink/af_netlink.c:1310 [inline]
 netlink_unicast+0x58b/0x740 net/netlink/af_netlink.c:1336
 netlink_sendmsg+0x9f0/0xfa0 net/netlink/af_netlink.c:1901
 sock_sendmsg_nosec net/socket.c:629 [inline]
 sock_sendmsg+0xd5/0x120 net/socket.c:639
 sock_write_iter+0x35a/0x5a0 net/socket.c:908
 call_write_iter include/linux/fs.h:1784 [inline]
 new_sync_write fs/read_write.c:474 [inline]
 __vfs_write+0x64d/0x960 fs/read_write.c:487
 vfs_write+0x1f8/0x560 fs/read_write.c:549
 ksys_write+0xf9/0x250 fs/read_write.c:598
 __do_sys_write fs/read_write.c:610 [inline]
 __se_sys_write fs/read_write.c:607 [inline]
 __x64_sys_write+0x73/0xb0 fs/read_write.c:607

Fixes: 4e09fc873d92 ("netfilter: prefer nla_strlcpy for dealing with NLA_STRING attributes")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Florian Westphal <fw@strlen.de>
Cc: Pablo Neira Ayuso <pablo@netfilter.org>
Reported-by: syzbot <syzkaller@googlegroups.com>
---
 net/netfilter/nfnetlink_acct.c     | 2 +-
 net/netfilter/nfnetlink_cthelper.c | 4 ++--
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/net/netfilter/nfnetlink_acct.c b/net/netfilter/nfnetlink_acct.c
index 6ddf89183e7b47e6c029b28cf5b524c73a790498..a0e5adf0b3b6ddbcd01f956d25dda01c611a7663 100644
--- a/net/netfilter/nfnetlink_acct.c
+++ b/net/netfilter/nfnetlink_acct.c
@@ -115,7 +115,7 @@ static int nfnl_acct_new(struct net *net, struct sock *nfnl,
 		nfacct->flags = flags;
 	}
 
-	nla_strlcpy(nfacct->name, nla_data(tb[NFACCT_NAME]), NFACCT_NAME_MAX);
+	nla_strlcpy(nfacct->name, tb[NFACCT_NAME], NFACCT_NAME_MAX);
 
 	if (tb[NFACCT_BYTES]) {
 		atomic64_set(&nfacct->bytes,
diff --git a/net/netfilter/nfnetlink_cthelper.c b/net/netfilter/nfnetlink_cthelper.c
index fa026b269b3691d5186e28020eb2b08e93dc3679..cb5b5f2077774c29fb987b7f1eb2d75e9efc2fe1 100644
--- a/net/netfilter/nfnetlink_cthelper.c
+++ b/net/netfilter/nfnetlink_cthelper.c
@@ -150,7 +150,7 @@ nfnl_cthelper_expect_policy(struct nf_conntrack_expect_policy *expect_policy,
 		return -EINVAL;
 
 	nla_strlcpy(expect_policy->name,
-		    nla_data(tb[NFCTH_POLICY_NAME]), NF_CT_HELPER_NAME_LEN);
+		    tb[NFCTH_POLICY_NAME], NF_CT_HELPER_NAME_LEN);
 	expect_policy->max_expected =
 		ntohl(nla_get_be32(tb[NFCTH_POLICY_EXPECT_MAX]));
 	if (expect_policy->max_expected > NF_CT_EXPECT_MAX_CNT)
@@ -235,7 +235,7 @@ nfnl_cthelper_create(const struct nlattr * const tb[],
 		goto err1;
 
 	nla_strlcpy(helper->name,
-		    nla_data(tb[NFCTH_NAME]), NF_CT_HELPER_NAME_LEN);
+		    tb[NFCTH_NAME], NF_CT_HELPER_NAME_LEN);
 	size = ntohl(nla_get_be32(tb[NFCTH_PRIV_DATA_LEN]));
 	if (size > FIELD_SIZEOF(struct nf_conn_help, data)) {
 		ret = -ENOMEM;
-- 
2.17.0.441.gb46fe60e1d-goog

^ permalink raw reply related

* Re: [PATCH net] netfilter: provide correct argument to nla_strlcpy()
From: Eric Dumazet @ 2018-05-21 23:34 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, Eric Dumazet, Florian Westphal, Pablo Neira Ayuso
In-Reply-To: <20180521233320.66082-1-edumazet@google.com>

On Mon, May 21, 2018 at 4:33 PM Eric Dumazet <edumazet@google.com> wrote:

> Recent patch forgot to remove nla_data(), upsetting syzkaller a bit.


Humpff.

I forgot to add one file in the change.

Will send V2.

^ permalink raw reply

* Re: [Cake] [PATCH net-next v14 4/7] sch_cake: Add NAT awareness to packet classifier
From: Marcelo Ricardo Leitner @ 2018-05-21 23:34 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen; +Cc: netdev, cake, netfilter-devel
In-Reply-To: <152693495866.32668.5164616056948127124.stgit@alrua-kau>

[Cc'ing netfilter-devel@ for awareness]

On Mon, May 21, 2018 at 10:35:58PM +0200, Toke Høiland-Jørgensen wrote:
> When CAKE is deployed on a gateway that also performs NAT (which is a
> common deployment mode), the host fairness mechanism cannot distinguish
> internal hosts from each other, and so fails to work correctly.
> 
> To fix this, we add an optional NAT awareness mode, which will query the
> kernel conntrack mechanism to obtain the pre-NAT addresses for each packet
> and use that in the flow and host hashing.
> 
> When the shaper is enabled and the host is already performing NAT, the cost
> of this lookup is negligible. However, in unlimited mode with no NAT being
> performed, there is a significant CPU cost at higher bandwidths. For this
> reason, the feature is turned off by default.
> 
> Signed-off-by: Toke Høiland-Jørgensen <toke@toke.dk>
> ---
>  net/sched/sch_cake.c |   79 ++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 79 insertions(+)
> 
> diff --git a/net/sched/sch_cake.c b/net/sched/sch_cake.c
> index 92623160d43e..04364993ce19 100644
> --- a/net/sched/sch_cake.c
> +++ b/net/sched/sch_cake.c
> @@ -71,6 +71,12 @@
>  #include <net/tcp.h>
>  #include <net/flow_dissector.h>
>  
> +#if IS_REACHABLE(CONFIG_NF_CONNTRACK)
> +#include <net/netfilter/nf_conntrack_core.h>
> +#include <net/netfilter/nf_conntrack_zones.h>
> +#include <net/netfilter/nf_conntrack.h>
> +#endif
> +
>  #define CAKE_SET_WAYS (8)
>  #define CAKE_MAX_TINS (8)
>  #define CAKE_QUEUES (1024)
> @@ -516,6 +522,60 @@ static bool cobalt_should_drop(struct cobalt_vars *vars,
>  	return drop;
>  }
>  
> +#if IS_REACHABLE(CONFIG_NF_CONNTRACK)
> +
> +static void cake_update_flowkeys(struct flow_keys *keys,
> +				 const struct sk_buff *skb)
> +{
> +	const struct nf_conntrack_tuple *tuple;
> +	enum ip_conntrack_info ctinfo;
> +	struct nf_conn *ct;
> +	bool rev = false;
> +
> +	if (tc_skb_protocol(skb) != htons(ETH_P_IP))
> +		return;
> +
> +	ct = nf_ct_get(skb, &ctinfo);
> +	if (ct) {
> +		tuple = nf_ct_tuple(ct, CTINFO2DIR(ctinfo));
> +	} else {
> +		const struct nf_conntrack_tuple_hash *hash;
> +		struct nf_conntrack_tuple srctuple;
> +
> +		if (!nf_ct_get_tuplepr(skb, skb_network_offset(skb),
> +				       NFPROTO_IPV4, dev_net(skb->dev),
> +				       &srctuple))
> +			return;
> +
> +		hash = nf_conntrack_find_get(dev_net(skb->dev),
> +					     &nf_ct_zone_dflt,
> +					     &srctuple);
> +		if (!hash)
> +			return;
> +
> +		rev = true;
> +		ct = nf_ct_tuplehash_to_ctrack(hash);
> +		tuple = nf_ct_tuple(ct, !hash->tuple.dst.dir);
> +	}
> +
> +	keys->addrs.v4addrs.src = rev ? tuple->dst.u3.ip : tuple->src.u3.ip;
> +	keys->addrs.v4addrs.dst = rev ? tuple->src.u3.ip : tuple->dst.u3.ip;
> +
> +	if (keys->ports.ports) {
> +		keys->ports.src = rev ? tuple->dst.u.all : tuple->src.u.all;
> +		keys->ports.dst = rev ? tuple->src.u.all : tuple->dst.u.all;
> +	}
> +	if (rev)
> +		nf_ct_put(ct);
> +}
> +#else
> +static void cake_update_flowkeys(struct flow_keys *keys,
> +				 const struct sk_buff *skb)
> +{
> +	/* There is nothing we can do here without CONNTRACK */
> +}
> +#endif
> +
>  /* Cake has several subtle multiple bit settings. In these cases you
>   *  would be matching triple isolate mode as well.
>   */
> @@ -543,6 +603,9 @@ static u32 cake_hash(struct cake_tin_data *q, const struct sk_buff *skb,
>  	skb_flow_dissect_flow_keys(skb, &keys,
>  				   FLOW_DISSECTOR_F_STOP_AT_FLOW_LABEL);
>  
> +	if (flow_mode & CAKE_FLOW_NAT_FLAG)
> +		cake_update_flowkeys(&keys, skb);
> +
>  	/* flow_hash_from_keys() sorts the addresses by value, so we have
>  	 * to preserve their order in a separate data structure to treat
>  	 * src and dst host addresses as independently selectable.
> @@ -1894,6 +1957,18 @@ static int cake_change(struct Qdisc *sch, struct nlattr *opt,
>  	if (err < 0)
>  		return err;
>  
> +	if (tb[TCA_CAKE_NAT]) {
> +#if IS_REACHABLE(CONFIG_NF_CONNTRACK)
> +		q->flow_mode &= ~CAKE_FLOW_NAT_FLAG;
> +		q->flow_mode |= CAKE_FLOW_NAT_FLAG *
> +			!!nla_get_u32(tb[TCA_CAKE_NAT]);
> +#else
> +		NL_SET_ERR_MSG_ATTR(extack, "No conntrack support in kernel",
> +				    tb[TCA_CAKE_NAT]);
> +		return -EOPNOTSUPP;
> +#endif
> +	}
> +
>  	if (tb[TCA_CAKE_BASE_RATE64])
>  		q->rate_bps = nla_get_u64(tb[TCA_CAKE_BASE_RATE64]);
>  
> @@ -2066,6 +2141,10 @@ static int cake_dump(struct Qdisc *sch, struct sk_buff *skb)
>  	if (nla_put_u32(skb, TCA_CAKE_ACK_FILTER, q->ack_filter))
>  		goto nla_put_failure;
>  
> +	if (nla_put_u32(skb, TCA_CAKE_NAT,
> +			!!(q->flow_mode & CAKE_FLOW_NAT_FLAG)))
> +		goto nla_put_failure;
> +
>  	return nla_nest_end(skb, opts);
>  
>  nla_put_failure:
> 
> _______________________________________________
> Cake mailing list
> Cake@lists.bufferbloat.net
> https://lists.bufferbloat.net/listinfo/cake

^ permalink raw reply

* [PATCH net] netfilter: provide correct argument to nla_strlcpy()
From: Eric Dumazet @ 2018-05-21 23:33 UTC (permalink / raw)
  To: David S . Miller
  Cc: netdev, Eric Dumazet, Eric Dumazet, Florian Westphal,
	Pablo Neira Ayuso

Recent patch forgot to remove nla_data(), upsetting syzkaller a bit.

BUG: KASAN: slab-out-of-bounds in nla_strlcpy+0x13d/0x150 lib/nlattr.c:314
Read of size 1 at addr ffff8801ad1f4fdd by task syz-executor189/4509

CPU: 1 PID: 4509 Comm: syz-executor189 Not tainted 4.17.0-rc6+ #62
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0x1b9/0x294 lib/dump_stack.c:113
 print_address_description+0x6c/0x20b mm/kasan/report.c:256
 kasan_report_error mm/kasan/report.c:354 [inline]
 kasan_report.cold.7+0x242/0x2fe mm/kasan/report.c:412
 __asan_report_load1_noabort+0x14/0x20 mm/kasan/report.c:430
 nla_strlcpy+0x13d/0x150 lib/nlattr.c:314
 nfnl_acct_new+0x574/0xc50 net/netfilter/nfnetlink_acct.c:118
 nfnetlink_rcv_msg+0xdb5/0xff0 net/netfilter/nfnetlink.c:212
 netlink_rcv_skb+0x172/0x440 net/netlink/af_netlink.c:2448
 nfnetlink_rcv+0x1fe/0x1ba0 net/netfilter/nfnetlink.c:513
 netlink_unicast_kernel net/netlink/af_netlink.c:1310 [inline]
 netlink_unicast+0x58b/0x740 net/netlink/af_netlink.c:1336
 netlink_sendmsg+0x9f0/0xfa0 net/netlink/af_netlink.c:1901
 sock_sendmsg_nosec net/socket.c:629 [inline]
 sock_sendmsg+0xd5/0x120 net/socket.c:639
 sock_write_iter+0x35a/0x5a0 net/socket.c:908
 call_write_iter include/linux/fs.h:1784 [inline]
 new_sync_write fs/read_write.c:474 [inline]
 __vfs_write+0x64d/0x960 fs/read_write.c:487
 vfs_write+0x1f8/0x560 fs/read_write.c:549
 ksys_write+0xf9/0x250 fs/read_write.c:598
 __do_sys_write fs/read_write.c:610 [inline]
 __se_sys_write fs/read_write.c:607 [inline]
 __x64_sys_write+0x73/0xb0 fs/read_write.c:607

Fixes: 4e09fc873d92 ("netfilter: prefer nla_strlcpy for dealing with NLA_STRING attributes")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Florian Westphal <fw@strlen.de>
Cc: Pablo Neira Ayuso <pablo@netfilter.org>
Reported-by: syzbot <syzkaller@googlegroups.com>
---
 net/netfilter/nfnetlink_acct.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/netfilter/nfnetlink_acct.c b/net/netfilter/nfnetlink_acct.c
index 6ddf89183e7b47e6c029b28cf5b524c73a790498..a0e5adf0b3b6ddbcd01f956d25dda01c611a7663 100644
--- a/net/netfilter/nfnetlink_acct.c
+++ b/net/netfilter/nfnetlink_acct.c
@@ -115,7 +115,7 @@ static int nfnl_acct_new(struct net *net, struct sock *nfnl,
 		nfacct->flags = flags;
 	}
 
-	nla_strlcpy(nfacct->name, nla_data(tb[NFACCT_NAME]), NFACCT_NAME_MAX);
+	nla_strlcpy(nfacct->name, tb[NFACCT_NAME], NFACCT_NAME_MAX);
 
 	if (tb[NFACCT_BYTES]) {
 		atomic64_set(&nfacct->bytes,
-- 
2.17.0.441.gb46fe60e1d-goog

^ permalink raw reply related

* Re: [PATCH bpf-next 1/5] bpf: Hooks for sys_sendmsg
From: kbuild test robot @ 2018-05-21 23:33 UTC (permalink / raw)
  To: Andrey Ignatov
  Cc: kbuild-all, netdev, Andrey Ignatov, davem, ast, daniel,
	kernel-team
In-Reply-To: <654b59700a21ed2342b12f30a583ba9329ef850b.1526694154.git.rdna@fb.com>

[-- Attachment #1: Type: text/plain, Size: 11021 bytes --]

Hi Andrey,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on bpf-next/master]

url:    https://github.com/0day-ci/linux/commits/Andrey-Ignatov/bpf-Hooks-for-sys_sendmsg/20180522-065614
base:   https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git master
config: x86_64-randconfig-x003-201820 (attached as .config)
compiler: gcc-7 (Debian 7.3.0-16) 7.3.0
reproduce:
        # save the attached .config to linux build tree
        make ARCH=x86_64 

All errors (new ones prefixed by >>):

   net/ipv4/udp.c: In function 'udp_sendmsg':
>> net/ipv4/udp.c:1014:44: error: macro "BPF_CGROUP_RUN_PROG_UDP4_SENDMSG_LOCK" passed 3 arguments, but takes just 2
             (struct sockaddr *)usin, &ipc.addr);
                                               ^
>> net/ipv4/udp.c:1013:9: error: 'BPF_CGROUP_RUN_PROG_UDP4_SENDMSG_LOCK' undeclared (first use in this function); did you mean 'BPF_CGROUP_UDP4_SENDMSG'?
      err = BPF_CGROUP_RUN_PROG_UDP4_SENDMSG_LOCK(sk,
            ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
            BPF_CGROUP_UDP4_SENDMSG
   net/ipv4/udp.c:1013:9: note: each undeclared identifier is reported only once for each function it appears in
--
   net/ipv6/udp.c: In function 'udpv6_sendmsg':
>> net/ipv6/udp.c:1320:44: error: macro "BPF_CGROUP_RUN_PROG_UDP6_SENDMSG_LOCK" passed 3 arguments, but takes just 2
            (struct sockaddr *)sin6, &fl6.saddr);
                                               ^
>> net/ipv6/udp.c:1319:9: error: 'BPF_CGROUP_RUN_PROG_UDP6_SENDMSG_LOCK' undeclared (first use in this function); did you mean 'BPF_CGROUP_UDP6_SENDMSG'?
      err = BPF_CGROUP_RUN_PROG_UDP6_SENDMSG_LOCK(sk,
            ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
            BPF_CGROUP_UDP6_SENDMSG
   net/ipv6/udp.c:1319:9: note: each undeclared identifier is reported only once for each function it appears in

vim +/BPF_CGROUP_RUN_PROG_UDP4_SENDMSG_LOCK +1014 net/ipv4/udp.c

   898	
   899	int udp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
   900	{
   901		struct inet_sock *inet = inet_sk(sk);
   902		struct udp_sock *up = udp_sk(sk);
   903		DECLARE_SOCKADDR(struct sockaddr_in *, usin, msg->msg_name);
   904		struct flowi4 fl4_stack;
   905		struct flowi4 *fl4;
   906		int ulen = len;
   907		struct ipcm_cookie ipc;
   908		struct rtable *rt = NULL;
   909		int free = 0;
   910		int connected = 0;
   911		__be32 daddr, faddr, saddr;
   912		__be16 dport;
   913		u8  tos;
   914		int err, is_udplite = IS_UDPLITE(sk);
   915		int corkreq = up->corkflag || msg->msg_flags&MSG_MORE;
   916		int (*getfrag)(void *, char *, int, int, int, struct sk_buff *);
   917		struct sk_buff *skb;
   918		struct ip_options_data opt_copy;
   919	
   920		if (len > 0xFFFF)
   921			return -EMSGSIZE;
   922	
   923		/*
   924		 *	Check the flags.
   925		 */
   926	
   927		if (msg->msg_flags & MSG_OOB) /* Mirror BSD error message compatibility */
   928			return -EOPNOTSUPP;
   929	
   930		ipc.opt = NULL;
   931		ipc.tx_flags = 0;
   932		ipc.ttl = 0;
   933		ipc.tos = -1;
   934	
   935		getfrag = is_udplite ? udplite_getfrag : ip_generic_getfrag;
   936	
   937		fl4 = &inet->cork.fl.u.ip4;
   938		if (up->pending) {
   939			/*
   940			 * There are pending frames.
   941			 * The socket lock must be held while it's corked.
   942			 */
   943			lock_sock(sk);
   944			if (likely(up->pending)) {
   945				if (unlikely(up->pending != AF_INET)) {
   946					release_sock(sk);
   947					return -EINVAL;
   948				}
   949				goto do_append_data;
   950			}
   951			release_sock(sk);
   952		}
   953		ulen += sizeof(struct udphdr);
   954	
   955		/*
   956		 *	Get and verify the address.
   957		 */
   958		if (usin) {
   959			if (msg->msg_namelen < sizeof(*usin))
   960				return -EINVAL;
   961			if (usin->sin_family != AF_INET) {
   962				if (usin->sin_family != AF_UNSPEC)
   963					return -EAFNOSUPPORT;
   964			}
   965	
   966			daddr = usin->sin_addr.s_addr;
   967			dport = usin->sin_port;
   968			if (dport == 0)
   969				return -EINVAL;
   970		} else {
   971			if (sk->sk_state != TCP_ESTABLISHED)
   972				return -EDESTADDRREQ;
   973			daddr = inet->inet_daddr;
   974			dport = inet->inet_dport;
   975			/* Open fast path for connected socket.
   976			   Route will not be used, if at least one option is set.
   977			 */
   978			connected = 1;
   979		}
   980	
   981		ipc.sockc.tsflags = sk->sk_tsflags;
   982		ipc.addr = inet->inet_saddr;
   983		ipc.oif = sk->sk_bound_dev_if;
   984		ipc.gso_size = up->gso_size;
   985	
   986		if (msg->msg_controllen) {
   987			err = udp_cmsg_send(sk, msg, &ipc.gso_size);
   988			if (err > 0)
   989				err = ip_cmsg_send(sk, msg, &ipc,
   990						   sk->sk_family == AF_INET6);
   991			if (unlikely(err < 0)) {
   992				kfree(ipc.opt);
   993				return err;
   994			}
   995			if (ipc.opt)
   996				free = 1;
   997			connected = 0;
   998		}
   999		if (!ipc.opt) {
  1000			struct ip_options_rcu *inet_opt;
  1001	
  1002			rcu_read_lock();
  1003			inet_opt = rcu_dereference(inet->inet_opt);
  1004			if (inet_opt) {
  1005				memcpy(&opt_copy, inet_opt,
  1006				       sizeof(*inet_opt) + inet_opt->opt.optlen);
  1007				ipc.opt = &opt_copy.opt;
  1008			}
  1009			rcu_read_unlock();
  1010		}
  1011	
  1012		if (!connected) {
> 1013			err = BPF_CGROUP_RUN_PROG_UDP4_SENDMSG_LOCK(sk,
> 1014						    (struct sockaddr *)usin, &ipc.addr);
  1015			if (err)
  1016				goto out_free;
  1017			if (usin) {
  1018				if (usin->sin_port == 0) {
  1019					/* BPF program set invalid port. Reject it. */
  1020					err = -EINVAL;
  1021					goto out_free;
  1022				}
  1023				daddr = usin->sin_addr.s_addr;
  1024				dport = usin->sin_port;
  1025			}
  1026		}
  1027	
  1028		saddr = ipc.addr;
  1029		ipc.addr = faddr = daddr;
  1030	
  1031		sock_tx_timestamp(sk, ipc.sockc.tsflags, &ipc.tx_flags);
  1032	
  1033		if (ipc.opt && ipc.opt->opt.srr) {
  1034			if (!daddr) {
  1035				err = -EINVAL;
  1036				goto out_free;
  1037			}
  1038			faddr = ipc.opt->opt.faddr;
  1039			connected = 0;
  1040		}
  1041		tos = get_rttos(&ipc, inet);
  1042		if (sock_flag(sk, SOCK_LOCALROUTE) ||
  1043		    (msg->msg_flags & MSG_DONTROUTE) ||
  1044		    (ipc.opt && ipc.opt->opt.is_strictroute)) {
  1045			tos |= RTO_ONLINK;
  1046			connected = 0;
  1047		}
  1048	
  1049		if (ipv4_is_multicast(daddr)) {
  1050			if (!ipc.oif)
  1051				ipc.oif = inet->mc_index;
  1052			if (!saddr)
  1053				saddr = inet->mc_addr;
  1054			connected = 0;
  1055		} else if (!ipc.oif) {
  1056			ipc.oif = inet->uc_index;
  1057		} else if (ipv4_is_lbcast(daddr) && inet->uc_index) {
  1058			/* oif is set, packet is to local broadcast and
  1059			 * and uc_index is set. oif is most likely set
  1060			 * by sk_bound_dev_if. If uc_index != oif check if the
  1061			 * oif is an L3 master and uc_index is an L3 slave.
  1062			 * If so, we want to allow the send using the uc_index.
  1063			 */
  1064			if (ipc.oif != inet->uc_index &&
  1065			    ipc.oif == l3mdev_master_ifindex_by_index(sock_net(sk),
  1066								      inet->uc_index)) {
  1067				ipc.oif = inet->uc_index;
  1068			}
  1069		}
  1070	
  1071		if (connected)
  1072			rt = (struct rtable *)sk_dst_check(sk, 0);
  1073	
  1074		if (!rt) {
  1075			struct net *net = sock_net(sk);
  1076			__u8 flow_flags = inet_sk_flowi_flags(sk);
  1077	
  1078			fl4 = &fl4_stack;
  1079	
  1080			flowi4_init_output(fl4, ipc.oif, sk->sk_mark, tos,
  1081					   RT_SCOPE_UNIVERSE, sk->sk_protocol,
  1082					   flow_flags,
  1083					   faddr, saddr, dport, inet->inet_sport,
  1084					   sk->sk_uid);
  1085	
  1086			security_sk_classify_flow(sk, flowi4_to_flowi(fl4));
  1087			rt = ip_route_output_flow(net, fl4, sk);
  1088			if (IS_ERR(rt)) {
  1089				err = PTR_ERR(rt);
  1090				rt = NULL;
  1091				if (err == -ENETUNREACH)
  1092					IP_INC_STATS(net, IPSTATS_MIB_OUTNOROUTES);
  1093				goto out;
  1094			}
  1095	
  1096			err = -EACCES;
  1097			if ((rt->rt_flags & RTCF_BROADCAST) &&
  1098			    !sock_flag(sk, SOCK_BROADCAST))
  1099				goto out;
  1100			if (connected)
  1101				sk_dst_set(sk, dst_clone(&rt->dst));
  1102		}
  1103	
  1104		if (msg->msg_flags&MSG_CONFIRM)
  1105			goto do_confirm;
  1106	back_from_confirm:
  1107	
  1108		saddr = fl4->saddr;
  1109		if (!ipc.addr)
  1110			daddr = ipc.addr = fl4->daddr;
  1111	
  1112		/* Lockless fast path for the non-corking case. */
  1113		if (!corkreq) {
  1114			struct inet_cork cork;
  1115	
  1116			skb = ip_make_skb(sk, fl4, getfrag, msg, ulen,
  1117					  sizeof(struct udphdr), &ipc, &rt,
  1118					  &cork, msg->msg_flags);
  1119			err = PTR_ERR(skb);
  1120			if (!IS_ERR_OR_NULL(skb))
  1121				err = udp_send_skb(skb, fl4, &cork);
  1122			goto out;
  1123		}
  1124	
  1125		lock_sock(sk);
  1126		if (unlikely(up->pending)) {
  1127			/* The socket is already corked while preparing it. */
  1128			/* ... which is an evident application bug. --ANK */
  1129			release_sock(sk);
  1130	
  1131			net_dbg_ratelimited("socket already corked\n");
  1132			err = -EINVAL;
  1133			goto out;
  1134		}
  1135		/*
  1136		 *	Now cork the socket to pend data.
  1137		 */
  1138		fl4 = &inet->cork.fl.u.ip4;
  1139		fl4->daddr = daddr;
  1140		fl4->saddr = saddr;
  1141		fl4->fl4_dport = dport;
  1142		fl4->fl4_sport = inet->inet_sport;
  1143		up->pending = AF_INET;
  1144	
  1145	do_append_data:
  1146		up->len += ulen;
  1147		err = ip_append_data(sk, fl4, getfrag, msg, ulen,
  1148				     sizeof(struct udphdr), &ipc, &rt,
  1149				     corkreq ? msg->msg_flags|MSG_MORE : msg->msg_flags);
  1150		if (err)
  1151			udp_flush_pending_frames(sk);
  1152		else if (!corkreq)
  1153			err = udp_push_pending_frames(sk);
  1154		else if (unlikely(skb_queue_empty(&sk->sk_write_queue)))
  1155			up->pending = 0;
  1156		release_sock(sk);
  1157	
  1158	out:
  1159		ip_rt_put(rt);
  1160	out_free:
  1161		if (free)
  1162			kfree(ipc.opt);
  1163		if (!err)
  1164			return len;
  1165		/*
  1166		 * ENOBUFS = no kernel mem, SOCK_NOSPACE = no sndbuf space.  Reporting
  1167		 * ENOBUFS might not be good (it's not tunable per se), but otherwise
  1168		 * we don't have a good statistic (IpOutDiscards but it can be too many
  1169		 * things).  We could add another new stat but at least for now that
  1170		 * seems like overkill.
  1171		 */
  1172		if (err == -ENOBUFS || test_bit(SOCK_NOSPACE, &sk->sk_socket->flags)) {
  1173			UDP_INC_STATS(sock_net(sk),
  1174				      UDP_MIB_SNDBUFERRORS, is_udplite);
  1175		}
  1176		return err;
  1177	
  1178	do_confirm:
  1179		if (msg->msg_flags & MSG_PROBE)
  1180			dst_confirm_neigh(&rt->dst, &fl4->daddr);
  1181		if (!(msg->msg_flags&MSG_PROBE) || len)
  1182			goto back_from_confirm;
  1183		err = 0;
  1184		goto out;
  1185	}
  1186	EXPORT_SYMBOL(udp_sendmsg);
  1187	

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 27551 bytes --]

^ permalink raw reply

* Re: [PATCH bpf-next 1/5] bpf: Hooks for sys_sendmsg
From: Martin KaFai Lau @ 2018-05-21 23:16 UTC (permalink / raw)
  To: Andrey Ignatov; +Cc: netdev, davem, ast, daniel, kernel-team
In-Reply-To: <654b59700a21ed2342b12f30a583ba9329ef850b.1526694154.git.rdna@fb.com>

On Fri, May 18, 2018 at 07:21:09PM -0700, Andrey Ignatov wrote:
> In addition to already existing BPF hooks for sys_bind and sys_connect,
> the patch provides new hooks for sys_sendmsg.
> 
> It leverages existing BPF program type `BPF_PROG_TYPE_CGROUP_SOCK_ADDR`
> that provides access to socket itlself (properties like family, type,
> protocol) and user-passed `struct sockaddr *` so that BPF program can
> override destination IP and port for system calls such as sendto(2) or
> sendmsg(2) and/or assign source IP to the socket.
> 
> The hooks are implemented as two new attach types:
> `BPF_CGROUP_UDP4_SENDMSG` and `BPF_CGROUP_UDP6_SENDMSG` for UDPv4 and
> UDPv6 correspondingly.
> 
> UDPv4 and UDPv6 separate attach types for same reason as sys_bind and
> sys_connect hooks, i.e. to prevent reading from / writing to e.g.
> user_ip6 fields when user passes sockaddr_in since it'd be out-of-bound.
> 
> The difference with already existing hooks is sys_sendmsg are
> implemented only for unconnected UDP.
> 
> For TCP it doesn't make sense to change user-provided `struct sockaddr *`
> at sendto(2)/sendmsg(2) time since socket either was already connected
> and has source/destination set or wasn't connected and call to
> sendto(2)/sendmsg(2) would lead to ENOTCONN anyway.
> 
> Connected UDP is already handled by sys_connect hooks that can override
> source/destination at connect time and use fast-path later, i.e. these
> hooks don't affect UDP fast-path.
> 
> Rewriting source IP is implemented differently than that in sys_connect
> hooks. When sys_sendmsg is used with unconnected UDP it doesn't work to
> just bind socket to desired local IP address since source IP can be set
> on per-packet basis by using ancillary data (cmsg(3)). So no matter if
> socket is bound or not, source IP has to be rewritten on every call to
> sys_sendmsg.
> 
> To do so two new fields are added to UAPI `struct bpf_sock_addr`;
> * `msg_src_ip4` to set source IPv4 for UDPv4;
> * `msg_src_ip6` to set source IPv6 for UDPv6.
> 
> Signed-off-by: Andrey Ignatov <rdna@fb.com>
> Acked-by: Alexei Starovoitov <ast@kernel.org>
> ---
>  include/linux/bpf-cgroup.h | 23 +++++++++++++++++------
>  include/linux/filter.h     |  1 +
>  include/uapi/linux/bpf.h   |  8 ++++++++
>  kernel/bpf/cgroup.c        | 11 ++++++++++-
>  kernel/bpf/syscall.c       |  8 ++++++++
>  net/core/filter.c          | 39 +++++++++++++++++++++++++++++++++++++++
>  net/ipv4/udp.c             | 20 ++++++++++++++++++--
>  net/ipv6/udp.c             | 17 +++++++++++++++++
>  8 files changed, 118 insertions(+), 9 deletions(-)
> 
> diff --git a/include/linux/bpf-cgroup.h b/include/linux/bpf-cgroup.h
> index 30d15e6..46f01ba 100644
> --- a/include/linux/bpf-cgroup.h
> +++ b/include/linux/bpf-cgroup.h
> @@ -66,7 +66,8 @@ int __cgroup_bpf_run_filter_sk(struct sock *sk,
>  
>  int __cgroup_bpf_run_filter_sock_addr(struct sock *sk,
>  				      struct sockaddr *uaddr,
> -				      enum bpf_attach_type type);
> +				      enum bpf_attach_type type,
> +				      void *t_ctx);
>  
>  int __cgroup_bpf_run_filter_sock_ops(struct sock *sk,
>  				     struct bpf_sock_ops_kern *sock_ops,
> @@ -120,16 +121,18 @@ int __cgroup_bpf_check_dev_permission(short dev_type, u32 major, u32 minor,
>  ({									       \
>  	int __ret = 0;							       \
>  	if (cgroup_bpf_enabled)						       \
> -		__ret = __cgroup_bpf_run_filter_sock_addr(sk, uaddr, type);    \
> +		__ret = __cgroup_bpf_run_filter_sock_addr(sk, uaddr, type,     \
> +							  NULL);	       \
>  	__ret;								       \
>  })
>  
> -#define BPF_CGROUP_RUN_SA_PROG_LOCK(sk, uaddr, type)			       \
> +#define BPF_CGROUP_RUN_SA_PROG_LOCK(sk, uaddr, type, t_ctx)		       \
>  ({									       \
>  	int __ret = 0;							       \
>  	if (cgroup_bpf_enabled)	{					       \
>  		lock_sock(sk);						       \
> -		__ret = __cgroup_bpf_run_filter_sock_addr(sk, uaddr, type);    \
> +		__ret = __cgroup_bpf_run_filter_sock_addr(sk, uaddr, type,     \
> +							  t_ctx);	       \
>  		release_sock(sk);					       \
>  	}								       \
>  	__ret;								       \
> @@ -151,10 +154,16 @@ int __cgroup_bpf_check_dev_permission(short dev_type, u32 major, u32 minor,
>  	BPF_CGROUP_RUN_SA_PROG(sk, uaddr, BPF_CGROUP_INET6_CONNECT)
>  
>  #define BPF_CGROUP_RUN_PROG_INET4_CONNECT_LOCK(sk, uaddr)		       \
> -	BPF_CGROUP_RUN_SA_PROG_LOCK(sk, uaddr, BPF_CGROUP_INET4_CONNECT)
> +	BPF_CGROUP_RUN_SA_PROG_LOCK(sk, uaddr, BPF_CGROUP_INET4_CONNECT, NULL)
>  
>  #define BPF_CGROUP_RUN_PROG_INET6_CONNECT_LOCK(sk, uaddr)		       \
> -	BPF_CGROUP_RUN_SA_PROG_LOCK(sk, uaddr, BPF_CGROUP_INET6_CONNECT)
> +	BPF_CGROUP_RUN_SA_PROG_LOCK(sk, uaddr, BPF_CGROUP_INET6_CONNECT, NULL)
> +
> +#define BPF_CGROUP_RUN_PROG_UDP4_SENDMSG_LOCK(sk, uaddr, t_ctx)		       \
> +	BPF_CGROUP_RUN_SA_PROG_LOCK(sk, uaddr, BPF_CGROUP_UDP4_SENDMSG, t_ctx)
> +
> +#define BPF_CGROUP_RUN_PROG_UDP6_SENDMSG_LOCK(sk, uaddr, t_ctx)		       \
> +	BPF_CGROUP_RUN_SA_PROG_LOCK(sk, uaddr, BPF_CGROUP_UDP6_SENDMSG, t_ctx)
>  
>  #define BPF_CGROUP_RUN_PROG_SOCK_OPS(sock_ops)				       \
>  ({									       \
> @@ -197,6 +206,8 @@ static inline int cgroup_bpf_inherit(struct cgroup *cgrp) { return 0; }
>  #define BPF_CGROUP_RUN_PROG_INET4_CONNECT_LOCK(sk, uaddr) ({ 0; })
>  #define BPF_CGROUP_RUN_PROG_INET6_CONNECT(sk, uaddr) ({ 0; })
>  #define BPF_CGROUP_RUN_PROG_INET6_CONNECT_LOCK(sk, uaddr) ({ 0; })
> +#define BPF_CGROUP_RUN_PROG_UDP4_SENDMSG_LOCK(sk, uaddr) ({ 0; })
> +#define BPF_CGROUP_RUN_PROG_UDP6_SENDMSG_LOCK(sk, uaddr) ({ 0; })
>  #define BPF_CGROUP_RUN_PROG_SOCK_OPS(sock_ops) ({ 0; })
>  #define BPF_CGROUP_RUN_PROG_DEVICE_CGROUP(type,major,minor,access) ({ 0; })
>  
> diff --git a/include/linux/filter.h b/include/linux/filter.h
> index d358d18..d90abda 100644
> --- a/include/linux/filter.h
> +++ b/include/linux/filter.h
> @@ -1010,6 +1010,7 @@ struct bpf_sock_addr_kern {
>  	 * only two (src and dst) are available at convert_ctx_access time
>  	 */
>  	u64 tmp_reg;
> +	void *t_ctx;	/* Attach type specific context. */
>  };
>  
>  struct bpf_sock_ops_kern {
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 97446bb..b70ad2c 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -158,6 +158,8 @@ enum bpf_attach_type {
>  	BPF_CGROUP_INET6_CONNECT,
>  	BPF_CGROUP_INET4_POST_BIND,
>  	BPF_CGROUP_INET6_POST_BIND,
> +	BPF_CGROUP_UDP4_SENDMSG,
> +	BPF_CGROUP_UDP6_SENDMSG,
>  	__MAX_BPF_ATTACH_TYPE
>  };
>  
> @@ -2247,6 +2249,12 @@ struct bpf_sock_addr {
>  	__u32 family;		/* Allows 4-byte read, but no write */
>  	__u32 type;		/* Allows 4-byte read, but no write */
>  	__u32 protocol;		/* Allows 4-byte read, but no write */
> +	__u32 msg_src_ip4;	/* Allows 1,2,4-byte read an 4-byte write.
> +				 * Stored in network byte order.
> +				 */
> +	__u32 msg_src_ip6[4];	/* Allows 1,2,4-byte read an 4-byte write.
> +				 * Stored in network byte order.
> +				 */
>  };
>  
>  /* User bpf_sock_ops struct to access socket values and specify request ops
> diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c
> index 43171a0..f7c00bd 100644
> --- a/kernel/bpf/cgroup.c
> +++ b/kernel/bpf/cgroup.c
> @@ -500,6 +500,7 @@ EXPORT_SYMBOL(__cgroup_bpf_run_filter_sk);
>   * @sk: sock struct that will use sockaddr
>   * @uaddr: sockaddr struct provided by user
>   * @type: The type of program to be exectuted
> + * @t_ctx: Pointer to attach type specific context
>   *
>   * socket is expected to be of type INET or INET6.
>   *
> @@ -508,12 +509,15 @@ EXPORT_SYMBOL(__cgroup_bpf_run_filter_sk);
>   */
>  int __cgroup_bpf_run_filter_sock_addr(struct sock *sk,
>  				      struct sockaddr *uaddr,
> -				      enum bpf_attach_type type)
> +				      enum bpf_attach_type type,
> +				      void *t_ctx)
>  {
>  	struct bpf_sock_addr_kern ctx = {
>  		.sk = sk,
>  		.uaddr = uaddr,
> +		.t_ctx = t_ctx,
>  	};
> +	struct sockaddr_storage unspec;
>  	struct cgroup *cgrp;
>  	int ret;
>  
> @@ -523,6 +527,11 @@ int __cgroup_bpf_run_filter_sock_addr(struct sock *sk,
>  	if (sk->sk_family != AF_INET && sk->sk_family != AF_INET6)
>  		return 0;
>  
> +	if (!ctx.uaddr) {
> +		memset(&unspec, 0, sizeof(unspec));
> +		ctx.uaddr = (struct sockaddr *)&unspec;
> +	}
> +
>  	cgrp = sock_cgroup_ptr(&sk->sk_cgrp_data);
>  	ret = BPF_PROG_RUN_ARRAY(cgrp->bpf.effective[type], &ctx, BPF_PROG_RUN);
>  
> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index bfcde94..11a5a95 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
> @@ -1247,6 +1247,8 @@ bpf_prog_load_check_attach_type(enum bpf_prog_type prog_type,
>  		case BPF_CGROUP_INET6_BIND:
>  		case BPF_CGROUP_INET4_CONNECT:
>  		case BPF_CGROUP_INET6_CONNECT:
> +		case BPF_CGROUP_UDP4_SENDMSG:
> +		case BPF_CGROUP_UDP6_SENDMSG:
>  			return 0;
>  		default:
>  			return -EINVAL;
> @@ -1563,6 +1565,8 @@ static int bpf_prog_attach(const union bpf_attr *attr)
>  	case BPF_CGROUP_INET6_BIND:
>  	case BPF_CGROUP_INET4_CONNECT:
>  	case BPF_CGROUP_INET6_CONNECT:
> +	case BPF_CGROUP_UDP4_SENDMSG:
> +	case BPF_CGROUP_UDP6_SENDMSG:
>  		ptype = BPF_PROG_TYPE_CGROUP_SOCK_ADDR;
>  		break;
>  	case BPF_CGROUP_SOCK_OPS:
> @@ -1633,6 +1637,8 @@ static int bpf_prog_detach(const union bpf_attr *attr)
>  	case BPF_CGROUP_INET6_BIND:
>  	case BPF_CGROUP_INET4_CONNECT:
>  	case BPF_CGROUP_INET6_CONNECT:
> +	case BPF_CGROUP_UDP4_SENDMSG:
> +	case BPF_CGROUP_UDP6_SENDMSG:
>  		ptype = BPF_PROG_TYPE_CGROUP_SOCK_ADDR;
>  		break;
>  	case BPF_CGROUP_SOCK_OPS:
> @@ -1690,6 +1696,8 @@ static int bpf_prog_query(const union bpf_attr *attr,
>  	case BPF_CGROUP_INET6_POST_BIND:
>  	case BPF_CGROUP_INET4_CONNECT:
>  	case BPF_CGROUP_INET6_CONNECT:
> +	case BPF_CGROUP_UDP4_SENDMSG:
> +	case BPF_CGROUP_UDP6_SENDMSG:
>  	case BPF_CGROUP_SOCK_OPS:
>  	case BPF_CGROUP_DEVICE:
>  		break;
> diff --git a/net/core/filter.c b/net/core/filter.c
> index aec5eba..f696dc9 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -5010,6 +5010,7 @@ static bool sock_addr_is_valid_access(int off, int size,
>  		switch (prog->expected_attach_type) {
>  		case BPF_CGROUP_INET4_BIND:
>  		case BPF_CGROUP_INET4_CONNECT:
> +		case BPF_CGROUP_UDP4_SENDMSG:
>  			break;
>  		default:
>  			return false;
> @@ -5019,6 +5020,24 @@ static bool sock_addr_is_valid_access(int off, int size,
>  		switch (prog->expected_attach_type) {
>  		case BPF_CGROUP_INET6_BIND:
>  		case BPF_CGROUP_INET6_CONNECT:
> +		case BPF_CGROUP_UDP6_SENDMSG:
> +			break;
> +		default:
> +			return false;
> +		}
> +		break;
> +	case bpf_ctx_range(struct bpf_sock_addr, msg_src_ip4):
> +		switch (prog->expected_attach_type) {
> +		case BPF_CGROUP_UDP4_SENDMSG:
> +			break;
> +		default:
> +			return false;
> +		}
> +		break;
> +	case bpf_ctx_range_till(struct bpf_sock_addr, msg_src_ip6[0],
> +				msg_src_ip6[3]):
> +		switch (prog->expected_attach_type) {
> +		case BPF_CGROUP_UDP6_SENDMSG:
>  			break;
>  		default:
>  			return false;
> @@ -5029,6 +5048,9 @@ static bool sock_addr_is_valid_access(int off, int size,
>  	switch (off) {
>  	case bpf_ctx_range(struct bpf_sock_addr, user_ip4):
>  	case bpf_ctx_range_till(struct bpf_sock_addr, user_ip6[0], user_ip6[3]):
> +	case bpf_ctx_range(struct bpf_sock_addr, msg_src_ip4):
> +	case bpf_ctx_range_till(struct bpf_sock_addr, msg_src_ip6[0],
> +				msg_src_ip6[3]):
>  		/* Only narrow read access allowed for now. */
>  		if (type == BPF_READ) {
>  			bpf_ctx_record_field_size(info, size_default);
> @@ -5783,6 +5805,23 @@ static u32 sock_addr_convert_ctx_access(enum bpf_access_type type,
>  		*insn++ = BPF_ALU32_IMM(BPF_RSH, si->dst_reg,
>  					SK_FL_PROTO_SHIFT);
>  		break;
> +
> +	case offsetof(struct bpf_sock_addr, msg_src_ip4):
> +		/* Treat t_ctx as struct in_addr for msg_src_ip4. */
> +		SOCK_ADDR_LOAD_OR_STORE_NESTED_FIELD_SIZE_OFF(
> +			struct bpf_sock_addr_kern, struct in_addr, t_ctx,
> +			s_addr, BPF_SIZE(si->code), 0, tmp_reg);
> +		break;
> +
> +	case bpf_ctx_range_till(struct bpf_sock_addr, msg_src_ip6[0],
> +				msg_src_ip6[3]):
> +		off = si->off;
> +		off -= offsetof(struct bpf_sock_addr, msg_src_ip6[0]);
> +		/* Treat t_ctx as struct in6_addr for msg_src_ip6. */
> +		SOCK_ADDR_LOAD_OR_STORE_NESTED_FIELD_SIZE_OFF(
> +			struct bpf_sock_addr_kern, struct in6_addr, t_ctx,
> +			s6_addr32[0], BPF_SIZE(si->code), off, tmp_reg);
> +		break;
>  	}
>  
>  	return insn - insn_buf;
> diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
> index ff4d4ba..a1f9ba2 100644
> --- a/net/ipv4/udp.c
> +++ b/net/ipv4/udp.c
> @@ -900,6 +900,7 @@ int udp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
>  {
>  	struct inet_sock *inet = inet_sk(sk);
>  	struct udp_sock *up = udp_sk(sk);
> +	DECLARE_SOCKADDR(struct sockaddr_in *, usin, msg->msg_name);
>  	struct flowi4 fl4_stack;
>  	struct flowi4 *fl4;
>  	int ulen = len;
> @@ -954,8 +955,7 @@ int udp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
>  	/*
>  	 *	Get and verify the address.
>  	 */
> -	if (msg->msg_name) {
> -		DECLARE_SOCKADDR(struct sockaddr_in *, usin, msg->msg_name);
> +	if (usin) {
>  		if (msg->msg_namelen < sizeof(*usin))
>  			return -EINVAL;
>  		if (usin->sin_family != AF_INET) {
> @@ -1009,6 +1009,22 @@ int udp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
>  		rcu_read_unlock();
>  	}
>  
> +	if (!connected) {
> +		err = BPF_CGROUP_RUN_PROG_UDP4_SENDMSG_LOCK(sk,
> +					    (struct sockaddr *)usin, &ipc.addr);
> +		if (err)
> +			goto out_free;
> +		if (usin) {
> +			if (usin->sin_port == 0) {
> +				/* BPF program set invalid port. Reject it. */
> +				err = -EINVAL;
> +				goto out_free;
> +			}
> +			daddr = usin->sin_addr.s_addr;
> +			dport = usin->sin_port;
> +		}
> +	}
> +
>  	saddr = ipc.addr;
>  	ipc.addr = faddr = daddr;
>  
> diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
> index 2839c1b..6f580ea 100644
> --- a/net/ipv6/udp.c
> +++ b/net/ipv6/udp.c
> @@ -1315,6 +1315,22 @@ int udpv6_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
>  		fl6.saddr = np->saddr;
>  	fl6.fl6_sport = inet->inet_sport;
>  
> +	if (!connected) {
> +		err = BPF_CGROUP_RUN_PROG_UDP6_SENDMSG_LOCK(sk,
> +					   (struct sockaddr *)sin6, &fl6.saddr);
> +		if (err)
> +			goto out_no_dst;
> +		if (sin6) {
> +			if (sin6->sin6_port == 0) {
> +				/* BPF program set invalid port. Reject it. */
> +				err = -EINVAL;
> +				goto out_no_dst;
> +			}
> +			fl6.fl6_dport = sin6->sin6_port;
> +			fl6.daddr = sin6->sin6_addr;
Could the bpf_prog change sin6 to a v4mapped address?

> +		}
> +	}
> +
>  	final_p = fl6_update_dst(&fl6, opt, &final);
It seems fl6_update_dst() may update fl6->daddr.
Is it fine (or inline with the expectation after running
a bpf_prog that has changed user_ip6)?

>  	if (final_p)
>  		connected = false;
> @@ -1394,6 +1410,7 @@ int udpv6_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
>  
>  out:
>  	dst_release(dst);
> +out_no_dst:
A nit.  If dst is init to NULL, can a new exit label
be avoided?


>  	fl6_sock_release(flowlabel);
>  	txopt_put(opt_to_free);
>  	if (!err)
> -- 
> 2.9.5
> 

^ permalink raw reply

* Re: [PATCH v2] ath10k: transmit queued frames after waking queues
From: Rajkumar Manoharan @ 2018-05-21 23:11 UTC (permalink / raw)
  To: Niklas Cassel
  Cc: Kalle Valo, David S. Miller, ath10k, linux-wireless, netdev,
	linux-kernel, linux-wireless-owner
In-Reply-To: <20180521204359.23884-1-niklas.cassel@linaro.org>

On 2018-05-21 13:43, Niklas Cassel wrote:
> The following problem was observed when running iperf:
[...]
> 
> In order to avoid trying to flush the queue every time we free a frame,
> only do this when there are 3 or less frames pending, and while we
> actually have frames in the queue. This logic was copied from
> mt76_txq_schedule (mt76), one of few other drivers that are actually
> using wake_tx_queue.
> 
> Suggested-by: Toke Høiland-Jørgensen <toke@toke.dk>
> Signed-off-by: Niklas Cassel <niklas.cassel@linaro.org>
> ---
> Changes since V1: use READ_ONCE() to disallow the compiler
> optimizing things in undesirable ways.
> 
>  drivers/net/wireless/ath/ath10k/txrx.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/drivers/net/wireless/ath/ath10k/txrx.c
> b/drivers/net/wireless/ath/ath10k/txrx.c
> index cda164f6e9f6..264cf0bd5c00 100644
> --- a/drivers/net/wireless/ath/ath10k/txrx.c
> +++ b/drivers/net/wireless/ath/ath10k/txrx.c
> @@ -95,6 +95,9 @@ int ath10k_txrx_tx_unref(struct ath10k_htt *htt,
>  		wake_up(&htt->empty_tx_wq);
>  	spin_unlock_bh(&htt->tx_lock);
> 
> +	if (READ_ONCE(htt->num_pending_tx) <= 3 && !list_empty(&ar->txqs))
> +		ath10k_mac_tx_push_pending(ar);
> +
Niklas,

Sorry for the late response. ath10k_mac_tx_push_pending is already 
called
at the end of NAPI handler. Isn't that enough to process pending frames?

Earlier we observed performance issues in calling push_pending from each
tx completion. IMHO this change may introduce the same problem again.

-Rajkumar

^ permalink raw reply

* Re: [PATCH net-next] r8169: perform reset synchronously in __rtl8169_resume
From: Heiner Kallweit @ 2018-05-21 22:38 UTC (permalink / raw)
  To: Francois Romieu
  Cc: David Miller, Realtek linux nic maintainers,
	netdev@vger.kernel.org
In-Reply-To: <20180521212422.GA25067@electric-eye.fr.zoreil.com>

Am 21.05.2018 um 23:24 schrieb Francois Romieu:
> Heiner Kallweit <hkallweit1@gmail.com> :
> [...]
>> diff --git a/drivers/net/ethernet/realtek/r8169.c b/drivers/net/ethernet/realtek/r8169.c
>> index 75dfac024..1eb4f625a 100644
>> --- a/drivers/net/ethernet/realtek/r8169.c
>> +++ b/drivers/net/ethernet/realtek/r8169.c
>> @@ -7327,9 +7327,9 @@ static void __rtl8169_resume(struct net_device *dev)
>>  	rtl_lock_work(tp);
>>  	napi_enable(&tp->napi);
>>  	set_bit(RTL_FLAG_TASK_ENABLED, tp->wk.flags);
>> +	if (netif_running(dev))
>> +		rtl_reset_work(tp);
>>  	rtl_unlock_work(tp);
>> -
>> -	rtl_schedule_task(tp, RTL_FLAG_TASK_RESET_PENDING);
>>  }
>>  
>>  static int rtl8169_resume(struct device *device)
> 
> netif_running() returns true before rtl_open(): issuing rtl_reset_work()
> synchronously against uninitialized rings right at the start of rtl_open()
> won't work as expected.
> 
rtl8169_runtime_resume() has a check for tp->TxDescArray at the beginning.
Therefore it will return immediately before even calling
__rtl8169_resume(), because tp->TxDescArray is set in rtl_open() only.
So I don't think the described issue exists.

However I have to admit that I wasn't aware that netif_running() is true
already when entering rtl_open(), so thanks for the hint.

> pm_runtime_get_sync() could be issued later in rtl_open() (but I would
> not mind something different with runtime_{resume/suspend} in open/close).
> 

^ permalink raw reply

* Re: [RFC PATCH net-next 10/12] vhost_net: build xdp buff
From: Michael S. Tsirkin @ 2018-05-21 22:21 UTC (permalink / raw)
  To: Jesse Brandeburg; +Cc: Jason Wang, kvm, virtualization, netdev, linux-kernel
In-Reply-To: <20180521095611.00005caa@intel.com>

On Mon, May 21, 2018 at 09:56:11AM -0700, Jesse Brandeburg wrote:
> On Mon, 21 May 2018 17:04:31 +0800 Jason wrote:
> > This patch implement build XDP buffers in vhost_net. The idea is do
> > userspace copy in vhost_net and build XDP buff based on the
> > page. Vhost_net can then submit one or an array of XDP buffs to
> > underlayer socket (e.g TUN). TUN can choose to do XDP or call
> > build_skb() to build skb. To support build skb, vnet header were also
> > stored into the header of the XDP buff.
> > 
> > This userspace copy and XDP buffs building is key to achieve XDP
> > batching in TUN, since TUN does not need to care about userspace copy
> > and then can disable premmption for several XDP buffs to achieve
> > batching from XDP.
> > 
> > TODO: reserve headroom based on the TUN XDP.
> > 
> > Signed-off-by: Jason Wang <jasowang@redhat.com>
> > ---
> >  drivers/vhost/net.c | 74 +++++++++++++++++++++++++++++++++++++++++++++++++++++
> >  1 file changed, 74 insertions(+)
> > 
> > diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> > index f0639d7..1209e84 100644
> > --- a/drivers/vhost/net.c
> > +++ b/drivers/vhost/net.c
> > @@ -492,6 +492,80 @@ static bool vhost_has_more_pkts(struct vhost_net *net,
> >  	       likely(!vhost_exceeds_maxpend(net));
> >  }
> >  
> > +#define VHOST_NET_HEADROOM 256
> > +#define VHOST_NET_RX_PAD (NET_IP_ALIGN + NET_SKB_PAD)
> > +
> > +static int vhost_net_build_xdp(struct vhost_net_virtqueue *nvq,
> > +			       struct iov_iter *from,
> > +			       struct xdp_buff *xdp)
> > +{
> > +	struct vhost_virtqueue *vq = &nvq->vq;
> > +	struct page_frag *alloc_frag = &current->task_frag;
> > +	struct virtio_net_hdr *gso;
> > +	size_t len = iov_iter_count(from);
> > +	int buflen = SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
> > +	int pad = SKB_DATA_ALIGN(VHOST_NET_RX_PAD + VHOST_NET_HEADROOM
> > +				 + nvq->sock_hlen);
> > +	int sock_hlen = nvq->sock_hlen;
> > +	void *buf;
> > +	int copied;
> > +
> > +	if (len < nvq->sock_hlen)
> > +		return -EFAULT;
> > +
> > +	if (SKB_DATA_ALIGN(len + pad) +
> > +	    SKB_DATA_ALIGN(sizeof(struct skb_shared_info)) > PAGE_SIZE)
> > +		return -ENOSPC;
> > +
> > +	buflen += SKB_DATA_ALIGN(len + pad);
> 
> maybe store the result of SKB_DATA_ALIGN in a local instead of doing
> the work twice?

I don't mind, but I guess gcc can always do it itself?

> > +	alloc_frag->offset = ALIGN((u64)alloc_frag->offset, SMP_CACHE_BYTES);
> > +	if (unlikely(!skb_page_frag_refill(buflen, alloc_frag, GFP_KERNEL)))
> > +		return -ENOMEM;
> > +
> > +	buf = (char *)page_address(alloc_frag->page) + alloc_frag->offset;
> > +
> > +	/* We store two kinds of metadata in the header which will be
> > +	 * used for XDP_PASS to do build_skb():
> > +	 * offset 0: buflen
> > +	 * offset sizeof(int): vnet header
> > +	 */
> > +	copied = copy_page_from_iter(alloc_frag->page,
> > +				     alloc_frag->offset + sizeof(int), sock_hlen, from);
> > +	if (copied != sock_hlen)
> > +		return -EFAULT;
> > +
> > +	gso = (struct virtio_net_hdr *)(buf + sizeof(int));
> > +
> > +	if ((gso->flags & VIRTIO_NET_HDR_F_NEEDS_CSUM) &&
> > +	    vhost16_to_cpu(vq, gso->csum_start) +
> > +	    vhost16_to_cpu(vq, gso->csum_offset) + 2 >
> > +	    vhost16_to_cpu(vq, gso->hdr_len)) {
> > +		gso->hdr_len = cpu_to_vhost16(vq,
> > +			       vhost16_to_cpu(vq, gso->csum_start) +
> > +			       vhost16_to_cpu(vq, gso->csum_offset) + 2);
> > +
> > +		if (vhost16_to_cpu(vq, gso->hdr_len) > len)
> > +			return -EINVAL;
> > +	}
> > +
> > +	len -= sock_hlen;
> > +	copied = copy_page_from_iter(alloc_frag->page,
> > +				     alloc_frag->offset + pad,
> > +				     len, from);
> > +	if (copied != len)
> > +		return -EFAULT;
> > +
> > +	xdp->data_hard_start = buf;
> > +	xdp->data = buf + pad;
> > +	xdp->data_end = xdp->data + len;
> > +	*(int *)(xdp->data_hard_start)= buflen;
> 
> space before =
> 
> > +
> > +	get_page(alloc_frag->page);
> > +	alloc_frag->offset += buflen;
> > +
> > +	return 0;
> > +}
> > +
> >  static void handle_tx_copy(struct vhost_net *net)
> >  {
> >  	struct vhost_net_virtqueue *nvq = &net->vqs[VHOST_NET_VQ_TX];

^ permalink raw reply

* Reply
From: conapreb @ 2018-05-21 15:29 UTC (permalink / raw)
  To: Recipients

Are you free for discussion?

^ permalink raw reply

* [PATCH net-next 2/2] tcp: do not aggressively quick ack after ECN events
From: Eric Dumazet @ 2018-05-21 22:08 UTC (permalink / raw)
  To: David S . Miller
  Cc: netdev, Van Jacobson, Neal Cardwell, Yuchung Cheng,
	Soheil Hassas Yeganeh, Eric Dumazet, Eric Dumazet
In-Reply-To: <20180521220857.229273-1-edumazet@google.com>

ECN signals currently forces TCP to enter quickack mode for
up to 16 (TCP_MAX_QUICKACKS) following incoming packets.

We believe this is not needed, and only sending one immediate ack
for the current packet should be enough.

This should reduce the extra load noticed in DCTCP environments,
after congestion events.

This is part 2 of our effort to reduce pure ACK packets.

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 net/ipv4/tcp_input.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 2e970e9f4e09d966b703af2d14d521a4328eba7e..1191cac72109f2f7e2b688ddbc1d404151d274d6 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -263,7 +263,7 @@ static void __tcp_ecn_check_ce(struct tcp_sock *tp, const struct sk_buff *skb)
 		 * it is probably a retransmit.
 		 */
 		if (tp->ecn_flags & TCP_ECN_SEEN)
-			tcp_enter_quickack_mode((struct sock *)tp, TCP_MAX_QUICKACKS);
+			tcp_enter_quickack_mode((struct sock *)tp, 1);
 		break;
 	case INET_ECN_CE:
 		if (tcp_ca_needs_ecn((struct sock *)tp))
@@ -271,7 +271,7 @@ static void __tcp_ecn_check_ce(struct tcp_sock *tp, const struct sk_buff *skb)
 
 		if (!(tp->ecn_flags & TCP_ECN_DEMAND_CWR)) {
 			/* Better not delay acks, sender can have a very low cwnd */
-			tcp_enter_quickack_mode((struct sock *)tp, TCP_MAX_QUICKACKS);
+			tcp_enter_quickack_mode((struct sock *)tp, 1);
 			tp->ecn_flags |= TCP_ECN_DEMAND_CWR;
 		}
 		tp->ecn_flags |= TCP_ECN_SEEN;
-- 
2.17.0.441.gb46fe60e1d-goog

^ permalink raw reply related

* [PATCH net-next 1/2] tcp: add max_quickacks param to tcp_incr_quickack and tcp_enter_quickack_mode
From: Eric Dumazet @ 2018-05-21 22:08 UTC (permalink / raw)
  To: David S . Miller
  Cc: netdev, Van Jacobson, Neal Cardwell, Yuchung Cheng,
	Soheil Hassas Yeganeh, Eric Dumazet, Eric Dumazet
In-Reply-To: <20180521220857.229273-1-edumazet@google.com>

We want to add finer control of the number of ACK packets sent after
ECN events.

This patch is not changing current behavior, it only enables following
change.

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 net/ipv4/tcp_input.c | 24 +++++++++++++-----------
 1 file changed, 13 insertions(+), 11 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index aebb29ab2fdf2ceaa182cd11928f145a886149ff..2e970e9f4e09d966b703af2d14d521a4328eba7e 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -203,21 +203,23 @@ static void tcp_measure_rcv_mss(struct sock *sk, const struct sk_buff *skb)
 	}
 }
 
-static void tcp_incr_quickack(struct sock *sk)
+static void tcp_incr_quickack(struct sock *sk, unsigned int max_quickacks)
 {
 	struct inet_connection_sock *icsk = inet_csk(sk);
 	unsigned int quickacks = tcp_sk(sk)->rcv_wnd / (2 * icsk->icsk_ack.rcv_mss);
 
 	if (quickacks == 0)
 		quickacks = 2;
+	quickacks = min(quickacks, max_quickacks);
 	if (quickacks > icsk->icsk_ack.quick)
-		icsk->icsk_ack.quick = min(quickacks, TCP_MAX_QUICKACKS);
+		icsk->icsk_ack.quick = quickacks;
 }
 
-static void tcp_enter_quickack_mode(struct sock *sk)
+static void tcp_enter_quickack_mode(struct sock *sk, unsigned int max_quickacks)
 {
 	struct inet_connection_sock *icsk = inet_csk(sk);
-	tcp_incr_quickack(sk);
+
+	tcp_incr_quickack(sk, max_quickacks);
 	icsk->icsk_ack.pingpong = 0;
 	icsk->icsk_ack.ato = TCP_ATO_MIN;
 }
@@ -261,7 +263,7 @@ static void __tcp_ecn_check_ce(struct tcp_sock *tp, const struct sk_buff *skb)
 		 * it is probably a retransmit.
 		 */
 		if (tp->ecn_flags & TCP_ECN_SEEN)
-			tcp_enter_quickack_mode((struct sock *)tp);
+			tcp_enter_quickack_mode((struct sock *)tp, TCP_MAX_QUICKACKS);
 		break;
 	case INET_ECN_CE:
 		if (tcp_ca_needs_ecn((struct sock *)tp))
@@ -269,7 +271,7 @@ static void __tcp_ecn_check_ce(struct tcp_sock *tp, const struct sk_buff *skb)
 
 		if (!(tp->ecn_flags & TCP_ECN_DEMAND_CWR)) {
 			/* Better not delay acks, sender can have a very low cwnd */
-			tcp_enter_quickack_mode((struct sock *)tp);
+			tcp_enter_quickack_mode((struct sock *)tp, TCP_MAX_QUICKACKS);
 			tp->ecn_flags |= TCP_ECN_DEMAND_CWR;
 		}
 		tp->ecn_flags |= TCP_ECN_SEEN;
@@ -686,7 +688,7 @@ static void tcp_event_data_recv(struct sock *sk, struct sk_buff *skb)
 		/* The _first_ data packet received, initialize
 		 * delayed ACK engine.
 		 */
-		tcp_incr_quickack(sk);
+		tcp_incr_quickack(sk, TCP_MAX_QUICKACKS);
 		icsk->icsk_ack.ato = TCP_ATO_MIN;
 	} else {
 		int m = now - icsk->icsk_ack.lrcvtime;
@@ -702,7 +704,7 @@ static void tcp_event_data_recv(struct sock *sk, struct sk_buff *skb)
 			/* Too long gap. Apparently sender failed to
 			 * restart window, so that we send ACKs quickly.
 			 */
-			tcp_incr_quickack(sk);
+			tcp_incr_quickack(sk, TCP_MAX_QUICKACKS);
 			sk_mem_reclaim(sk);
 		}
 	}
@@ -4179,7 +4181,7 @@ static void tcp_send_dupack(struct sock *sk, const struct sk_buff *skb)
 	if (TCP_SKB_CB(skb)->end_seq != TCP_SKB_CB(skb)->seq &&
 	    before(TCP_SKB_CB(skb)->seq, tp->rcv_nxt)) {
 		NET_INC_STATS(sock_net(sk), LINUX_MIB_DELAYEDACKLOST);
-		tcp_enter_quickack_mode(sk);
+		tcp_enter_quickack_mode(sk, TCP_MAX_QUICKACKS);
 
 		if (tcp_is_sack(tp) && sock_net(sk)->ipv4.sysctl_tcp_dsack) {
 			u32 end_seq = TCP_SKB_CB(skb)->end_seq;
@@ -4706,7 +4708,7 @@ static void tcp_data_queue(struct sock *sk, struct sk_buff *skb)
 		tcp_dsack_set(sk, TCP_SKB_CB(skb)->seq, TCP_SKB_CB(skb)->end_seq);
 
 out_of_window:
-		tcp_enter_quickack_mode(sk);
+		tcp_enter_quickack_mode(sk, TCP_MAX_QUICKACKS);
 		inet_csk_schedule_ack(sk);
 drop:
 		tcp_drop(sk, skb);
@@ -5790,7 +5792,7 @@ static int tcp_rcv_synsent_state_process(struct sock *sk, struct sk_buff *skb,
 			 * to stand against the temptation 8)     --ANK
 			 */
 			inet_csk_schedule_ack(sk);
-			tcp_enter_quickack_mode(sk);
+			tcp_enter_quickack_mode(sk, TCP_MAX_QUICKACKS);
 			inet_csk_reset_xmit_timer(sk, ICSK_TIME_DACK,
 						  TCP_DELACK_MAX, TCP_RTO_MAX);
 
-- 
2.17.0.441.gb46fe60e1d-goog

^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox