* [PATCH net-next v11 4/5] virtio_net: Introduce VIRTIO_NET_F_STANDBY feature bit
From: Sridhar Samudrala @ 2018-05-22 2:06 UTC (permalink / raw)
To: mst, stephen, davem, netdev, virtualization, virtio-dev,
jesse.brandeburg, alexander.h.duyck, kubakici, sridhar.samudrala,
jasowang, loseweigh, jiri, aaron.f.brown, anjali.singhai
In-Reply-To: <1526954781-35359-1-git-send-email-sridhar.samudrala@intel.com>
This feature bit can be used by hypervisor to indicate virtio_net device to
act as a standby for another device with the same MAC address.
VIRTIO_NET_F_STANDBY is defined as bit 62 as it is a device feature bit.
Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
---
drivers/net/virtio_net.c | 2 +-
include/uapi/linux/virtio_net.h | 3 +++
2 files changed, 4 insertions(+), 1 deletion(-)
diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index f34794a76c4d..213fddc70fd0 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -2999,7 +2999,7 @@ static struct virtio_device_id id_table[] = {
VIRTIO_NET_F_GUEST_ANNOUNCE, VIRTIO_NET_F_MQ, \
VIRTIO_NET_F_CTRL_MAC_ADDR, \
VIRTIO_NET_F_MTU, VIRTIO_NET_F_CTRL_GUEST_OFFLOADS, \
- VIRTIO_NET_F_SPEED_DUPLEX
+ VIRTIO_NET_F_SPEED_DUPLEX, VIRTIO_NET_F_STANDBY
static unsigned int features[] = {
VIRTNET_FEATURES,
diff --git a/include/uapi/linux/virtio_net.h b/include/uapi/linux/virtio_net.h
index 5de6ed37695b..a3715a3224c1 100644
--- a/include/uapi/linux/virtio_net.h
+++ b/include/uapi/linux/virtio_net.h
@@ -57,6 +57,9 @@
* Steering */
#define VIRTIO_NET_F_CTRL_MAC_ADDR 23 /* Set MAC address */
+#define VIRTIO_NET_F_STANDBY 62 /* Act as standby for another device
+ * with the same MAC.
+ */
#define VIRTIO_NET_F_SPEED_DUPLEX 63 /* Device set linkspeed and duplex */
#ifndef VIRTIO_NET_NO_LEGACY
--
2.14.3
^ permalink raw reply related
* [PATCH net-next v11 3/5] net: Introduce net_failover driver
From: Sridhar Samudrala @ 2018-05-22 2:06 UTC (permalink / raw)
To: mst, stephen, davem, netdev, virtualization, virtio-dev,
jesse.brandeburg, alexander.h.duyck, kubakici, sridhar.samudrala,
jasowang, loseweigh, jiri, aaron.f.brown, anjali.singhai
In-Reply-To: <1526954781-35359-1-git-send-email-sridhar.samudrala@intel.com>
The net_failover driver provides an automated failover mechanism via APIs
to create and destroy a failover master netdev and mananges a primary and
standby slave netdevs that get registered via the generic failover
infrastructure.
The failover netdev acts a master device and controls 2 slave devices. The
original paravirtual interface gets registered as 'standby' slave netdev and
a passthru/vf device with the same MAC gets registered as 'primary' slave
netdev. Both 'standby' and 'failover' netdevs are associated with the same
'pci' device. The user accesses the network interface via 'failover' netdev.
The 'failover' netdev chooses 'primary' netdev as default for transmits when
it is available with link up and running.
This can be used by paravirtual drivers to enable an alternate low latency
datapath. It also enables hypervisor controlled live migration of a VM with
direct attached VF by failing over to the paravirtual datapath when the VF
is unplugged.
Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
---
Documentation/networking/net_failover.rst | 26 +
MAINTAINERS | 8 +
drivers/net/Kconfig | 12 +
drivers/net/Makefile | 1 +
drivers/net/net_failover.c | 836 ++++++++++++++++++++++++++++++
include/net/net_failover.h | 40 ++
6 files changed, 923 insertions(+)
create mode 100644 Documentation/networking/net_failover.rst
create mode 100644 drivers/net/net_failover.c
create mode 100644 include/net/net_failover.h
diff --git a/Documentation/networking/net_failover.rst b/Documentation/networking/net_failover.rst
new file mode 100644
index 000000000000..6764c3b1eec8
--- /dev/null
+++ b/Documentation/networking/net_failover.rst
@@ -0,0 +1,26 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+============
+NET_FAILOVER
+============
+
+Overview
+========
+
+The net_failover driver provides an automated failover mechanism via APIs
+to create and destroy a failover master netdev and mananges a primary and
+standby slave netdevs that get registered via the generic failover
+infrastructrure.
+
+The failover netdev acts a master device and controls 2 slave devices. The
+original paravirtual interface is registered as 'standby' slave netdev and
+a passthru/vf device with the same MAC gets registered as 'primary' slave
+netdev. Both 'standby' and 'failover' netdevs are associated with the same
+'pci' device. The user accesses the network interface via 'failover' netdev.
+The 'failover' netdev chooses 'primary' netdev as default for transmits when
+it is available with link up and running.
+
+This can be used by paravirtual drivers to enable an alternate low latency
+datapath. It also enables hypervisor controlled live migration of a VM with
+direct attached VF by failing over to the paravirtual datapath when the VF
+is unplugged.
diff --git a/MAINTAINERS b/MAINTAINERS
index aed9575ddc53..ad172b1a7d57 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -9642,6 +9642,14 @@ S: Maintained
F: Documentation/hwmon/nct6775
F: drivers/hwmon/nct6775.c
+NET_FAILOVER MODULE
+M: Sridhar Samudrala <sridhar.samudrala@intel.com>
+L: netdev@vger.kernel.org
+S: Supported
+F: driver/net/net_failover.c
+F: include/net/net_failover.h
+F: Documentation/networking/net_failover.rst
+
NETEFFECT IWARP RNIC DRIVER (IW_NES)
M: Faisal Latif <faisal.latif@intel.com>
L: linux-rdma@vger.kernel.org
diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig
index a029b27fd002..1cf29c5ed510 100644
--- a/drivers/net/Kconfig
+++ b/drivers/net/Kconfig
@@ -510,4 +510,16 @@ config NETDEVSIM
To compile this driver as a module, choose M here: the module
will be called netdevsim.
+config NET_FAILOVER
+ tristate "Failover driver"
+ select FAILOVER
+ help
+ This provides an automated failover mechanism via APIs to create
+ and destroy a failover master netdev and manages a primary and
+ standby slave netdevs that get registered via the generic failover
+ infrastructure. This can be used by paravirtual drivers to enable
+ an alternate low latency datapath. It alsoenables live migration of
+ a VM with direct attached VF by failing over to the paravirtual
+ datapath when the VF is unplugged.
+
endif # NETDEVICES
diff --git a/drivers/net/Makefile b/drivers/net/Makefile
index 91e67e375dd4..21cde7e78621 100644
--- a/drivers/net/Makefile
+++ b/drivers/net/Makefile
@@ -78,3 +78,4 @@ obj-$(CONFIG_FUJITSU_ES) += fjes/
thunderbolt-net-y += thunderbolt.o
obj-$(CONFIG_THUNDERBOLT_NET) += thunderbolt-net.o
obj-$(CONFIG_NETDEVSIM) += netdevsim/
+obj-$(CONFIG_NET_FAILOVER) += net_failover.o
diff --git a/drivers/net/net_failover.c b/drivers/net/net_failover.c
new file mode 100644
index 000000000000..ae47bdb363c9
--- /dev/null
+++ b/drivers/net/net_failover.c
@@ -0,0 +1,836 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2018, Intel Corporation. */
+
+/* This provides a net_failover interface for paravirtual drivers to
+ * provide an alternate datapath by exporting APIs to create and
+ * destroy a upper 'net_failover' netdev. The upper dev manages the
+ * original paravirtual interface as a 'standby' netdev and uses the
+ * generic failover infrastructure to register and manage a direct
+ * attached VF as a 'primary' netdev. This enables live migration of
+ * a VM with direct attached VF by failing over to the paravirtual
+ * datapath when the VF is unplugged.
+ *
+ * Some of the netdev management routines are based on bond/team driver as
+ * this driver provides active-backup functionality similar to those drivers.
+ */
+
+#include <linux/netdevice.h>
+#include <linux/etherdevice.h>
+#include <linux/ethtool.h>
+#include <linux/module.h>
+#include <linux/slab.h>
+#include <linux/netdevice.h>
+#include <linux/netpoll.h>
+#include <linux/rtnetlink.h>
+#include <linux/if_vlan.h>
+#include <linux/pci.h>
+#include <net/sch_generic.h>
+#include <uapi/linux/if_arp.h>
+#include <net/net_failover.h>
+
+static bool net_failover_xmit_ready(struct net_device *dev)
+{
+ return netif_running(dev) && netif_carrier_ok(dev);
+}
+
+static int net_failover_open(struct net_device *dev)
+{
+ struct net_failover_info *nfo_info = netdev_priv(dev);
+ struct net_device *primary_dev, *standby_dev;
+ int err;
+
+ primary_dev = rtnl_dereference(nfo_info->primary_dev);
+ if (primary_dev) {
+ err = dev_open(primary_dev);
+ if (err)
+ goto err_primary_open;
+ }
+
+ standby_dev = rtnl_dereference(nfo_info->standby_dev);
+ if (standby_dev) {
+ err = dev_open(standby_dev);
+ if (err)
+ goto err_standby_open;
+ }
+
+ if ((primary_dev && net_failover_xmit_ready(primary_dev)) ||
+ (standby_dev && net_failover_xmit_ready(standby_dev))) {
+ netif_carrier_on(dev);
+ netif_tx_wake_all_queues(dev);
+ }
+
+ return 0;
+
+err_standby_open:
+ dev_close(primary_dev);
+err_primary_open:
+ netif_tx_disable(dev);
+ return err;
+}
+
+static int net_failover_close(struct net_device *dev)
+{
+ struct net_failover_info *nfo_info = netdev_priv(dev);
+ struct net_device *slave_dev;
+
+ netif_tx_disable(dev);
+
+ slave_dev = rtnl_dereference(nfo_info->primary_dev);
+ if (slave_dev)
+ dev_close(slave_dev);
+
+ slave_dev = rtnl_dereference(nfo_info->standby_dev);
+ if (slave_dev)
+ dev_close(slave_dev);
+
+ return 0;
+}
+
+static netdev_tx_t net_failover_drop_xmit(struct sk_buff *skb,
+ struct net_device *dev)
+{
+ atomic_long_inc(&dev->tx_dropped);
+ dev_kfree_skb_any(skb);
+ return NETDEV_TX_OK;
+}
+
+static netdev_tx_t net_failover_start_xmit(struct sk_buff *skb,
+ struct net_device *dev)
+{
+ struct net_failover_info *nfo_info = netdev_priv(dev);
+ struct net_device *xmit_dev;
+
+ /* Try xmit via primary netdev followed by standby netdev */
+ xmit_dev = rcu_dereference_bh(nfo_info->primary_dev);
+ if (!xmit_dev || !net_failover_xmit_ready(xmit_dev)) {
+ xmit_dev = rcu_dereference_bh(nfo_info->standby_dev);
+ if (!xmit_dev || !net_failover_xmit_ready(xmit_dev))
+ return net_failover_drop_xmit(skb, dev);
+ }
+
+ skb->dev = xmit_dev;
+ skb->queue_mapping = qdisc_skb_cb(skb)->slave_dev_queue_mapping;
+
+ return dev_queue_xmit(skb);
+}
+
+static u16 net_failover_select_queue(struct net_device *dev,
+ struct sk_buff *skb, void *accel_priv,
+ select_queue_fallback_t fallback)
+{
+ struct net_failover_info *nfo_info = netdev_priv(dev);
+ struct net_device *primary_dev;
+ u16 txq;
+
+ primary_dev = rcu_dereference(nfo_info->primary_dev);
+ if (primary_dev) {
+ const struct net_device_ops *ops = primary_dev->netdev_ops;
+
+ if (ops->ndo_select_queue)
+ txq = ops->ndo_select_queue(primary_dev, skb,
+ accel_priv, fallback);
+ else
+ txq = fallback(primary_dev, skb);
+
+ qdisc_skb_cb(skb)->slave_dev_queue_mapping = skb->queue_mapping;
+
+ return txq;
+ }
+
+ txq = skb_rx_queue_recorded(skb) ? skb_get_rx_queue(skb) : 0;
+
+ /* Save the original txq to restore before passing to the driver */
+ qdisc_skb_cb(skb)->slave_dev_queue_mapping = skb->queue_mapping;
+
+ if (unlikely(txq >= dev->real_num_tx_queues)) {
+ do {
+ txq -= dev->real_num_tx_queues;
+ } while (txq >= dev->real_num_tx_queues);
+ }
+
+ return txq;
+}
+
+/* fold stats, assuming all rtnl_link_stats64 fields are u64, but
+ * that some drivers can provide 32bit values only.
+ */
+static void net_failover_fold_stats(struct rtnl_link_stats64 *_res,
+ const struct rtnl_link_stats64 *_new,
+ const struct rtnl_link_stats64 *_old)
+{
+ const u64 *new = (const u64 *)_new;
+ const u64 *old = (const u64 *)_old;
+ u64 *res = (u64 *)_res;
+ int i;
+
+ for (i = 0; i < sizeof(*_res) / sizeof(u64); i++) {
+ u64 nv = new[i];
+ u64 ov = old[i];
+ s64 delta = nv - ov;
+
+ /* detects if this particular field is 32bit only */
+ if (((nv | ov) >> 32) == 0)
+ delta = (s64)(s32)((u32)nv - (u32)ov);
+
+ /* filter anomalies, some drivers reset their stats
+ * at down/up events.
+ */
+ if (delta > 0)
+ res[i] += delta;
+ }
+}
+
+static void net_failover_get_stats(struct net_device *dev,
+ struct rtnl_link_stats64 *stats)
+{
+ struct net_failover_info *nfo_info = netdev_priv(dev);
+ const struct rtnl_link_stats64 *new;
+ struct rtnl_link_stats64 temp;
+ struct net_device *slave_dev;
+
+ spin_lock(&nfo_info->stats_lock);
+ memcpy(stats, &nfo_info->failover_stats, sizeof(*stats));
+
+ rcu_read_lock();
+
+ slave_dev = rcu_dereference(nfo_info->primary_dev);
+ if (slave_dev) {
+ new = dev_get_stats(slave_dev, &temp);
+ net_failover_fold_stats(stats, new, &nfo_info->primary_stats);
+ memcpy(&nfo_info->primary_stats, new, sizeof(*new));
+ }
+
+ slave_dev = rcu_dereference(nfo_info->standby_dev);
+ if (slave_dev) {
+ new = dev_get_stats(slave_dev, &temp);
+ net_failover_fold_stats(stats, new, &nfo_info->standby_stats);
+ memcpy(&nfo_info->standby_stats, new, sizeof(*new));
+ }
+
+ rcu_read_unlock();
+
+ memcpy(&nfo_info->failover_stats, stats, sizeof(*stats));
+ spin_unlock(&nfo_info->stats_lock);
+}
+
+static int net_failover_change_mtu(struct net_device *dev, int new_mtu)
+{
+ struct net_failover_info *nfo_info = netdev_priv(dev);
+ struct net_device *primary_dev, *standby_dev;
+ int ret = 0;
+
+ primary_dev = rcu_dereference(nfo_info->primary_dev);
+ if (primary_dev) {
+ ret = dev_set_mtu(primary_dev, new_mtu);
+ if (ret)
+ return ret;
+ }
+
+ standby_dev = rcu_dereference(nfo_info->standby_dev);
+ if (standby_dev) {
+ ret = dev_set_mtu(standby_dev, new_mtu);
+ if (ret) {
+ if (primary_dev)
+ dev_set_mtu(primary_dev, dev->mtu);
+ return ret;
+ }
+ }
+
+ dev->mtu = new_mtu;
+
+ return 0;
+}
+
+static void net_failover_set_rx_mode(struct net_device *dev)
+{
+ struct net_failover_info *nfo_info = netdev_priv(dev);
+ struct net_device *slave_dev;
+
+ rcu_read_lock();
+
+ slave_dev = rcu_dereference(nfo_info->primary_dev);
+ if (slave_dev) {
+ dev_uc_sync_multiple(slave_dev, dev);
+ dev_mc_sync_multiple(slave_dev, dev);
+ }
+
+ slave_dev = rcu_dereference(nfo_info->standby_dev);
+ if (slave_dev) {
+ dev_uc_sync_multiple(slave_dev, dev);
+ dev_mc_sync_multiple(slave_dev, dev);
+ }
+
+ rcu_read_unlock();
+}
+
+static int net_failover_vlan_rx_add_vid(struct net_device *dev, __be16 proto,
+ u16 vid)
+{
+ struct net_failover_info *nfo_info = netdev_priv(dev);
+ struct net_device *primary_dev, *standby_dev;
+ int ret = 0;
+
+ primary_dev = rcu_dereference(nfo_info->primary_dev);
+ if (primary_dev) {
+ ret = vlan_vid_add(primary_dev, proto, vid);
+ if (ret)
+ return ret;
+ }
+
+ standby_dev = rcu_dereference(nfo_info->standby_dev);
+ if (standby_dev) {
+ ret = vlan_vid_add(standby_dev, proto, vid);
+ if (ret)
+ if (primary_dev)
+ vlan_vid_del(primary_dev, proto, vid);
+ }
+
+ return ret;
+}
+
+static int net_failover_vlan_rx_kill_vid(struct net_device *dev, __be16 proto,
+ u16 vid)
+{
+ struct net_failover_info *nfo_info = netdev_priv(dev);
+ struct net_device *slave_dev;
+
+ slave_dev = rcu_dereference(nfo_info->primary_dev);
+ if (slave_dev)
+ vlan_vid_del(slave_dev, proto, vid);
+
+ slave_dev = rcu_dereference(nfo_info->standby_dev);
+ if (slave_dev)
+ vlan_vid_del(slave_dev, proto, vid);
+
+ return 0;
+}
+
+static const struct net_device_ops failover_dev_ops = {
+ .ndo_open = net_failover_open,
+ .ndo_stop = net_failover_close,
+ .ndo_start_xmit = net_failover_start_xmit,
+ .ndo_select_queue = net_failover_select_queue,
+ .ndo_get_stats64 = net_failover_get_stats,
+ .ndo_change_mtu = net_failover_change_mtu,
+ .ndo_set_rx_mode = net_failover_set_rx_mode,
+ .ndo_vlan_rx_add_vid = net_failover_vlan_rx_add_vid,
+ .ndo_vlan_rx_kill_vid = net_failover_vlan_rx_kill_vid,
+ .ndo_validate_addr = eth_validate_addr,
+ .ndo_features_check = passthru_features_check,
+};
+
+#define FAILOVER_NAME "net_failover"
+#define FAILOVER_VERSION "0.1"
+
+static void nfo_ethtool_get_drvinfo(struct net_device *dev,
+ struct ethtool_drvinfo *drvinfo)
+{
+ strlcpy(drvinfo->driver, FAILOVER_NAME, sizeof(drvinfo->driver));
+ strlcpy(drvinfo->version, FAILOVER_VERSION, sizeof(drvinfo->version));
+}
+
+static int nfo_ethtool_get_link_ksettings(struct net_device *dev,
+ struct ethtool_link_ksettings *cmd)
+{
+ struct net_failover_info *nfo_info = netdev_priv(dev);
+ struct net_device *slave_dev;
+
+ slave_dev = rtnl_dereference(nfo_info->primary_dev);
+ if (!slave_dev || !net_failover_xmit_ready(slave_dev)) {
+ slave_dev = rtnl_dereference(nfo_info->standby_dev);
+ if (!slave_dev || !net_failover_xmit_ready(slave_dev)) {
+ cmd->base.duplex = DUPLEX_UNKNOWN;
+ cmd->base.port = PORT_OTHER;
+ cmd->base.speed = SPEED_UNKNOWN;
+
+ return 0;
+ }
+ }
+
+ return __ethtool_get_link_ksettings(slave_dev, cmd);
+}
+
+static const struct ethtool_ops failover_ethtool_ops = {
+ .get_drvinfo = nfo_ethtool_get_drvinfo,
+ .get_link = ethtool_op_get_link,
+ .get_link_ksettings = nfo_ethtool_get_link_ksettings,
+};
+
+/* Called when slave dev is injecting data into network stack.
+ * Change the associated network device from lower dev to failover dev.
+ * note: already called with rcu_read_lock
+ */
+static rx_handler_result_t net_failover_handle_frame(struct sk_buff **pskb)
+{
+ struct sk_buff *skb = *pskb;
+ struct net_device *dev = rcu_dereference(skb->dev->rx_handler_data);
+ struct net_failover_info *nfo_info = netdev_priv(dev);
+ struct net_device *primary_dev, *standby_dev;
+
+ primary_dev = rcu_dereference(nfo_info->primary_dev);
+ standby_dev = rcu_dereference(nfo_info->standby_dev);
+
+ if (primary_dev && skb->dev == standby_dev)
+ return RX_HANDLER_EXACT;
+
+ skb->dev = dev;
+
+ return RX_HANDLER_ANOTHER;
+}
+
+static void net_failover_compute_features(struct net_device *dev)
+{
+ u32 vlan_features = FAILOVER_VLAN_FEATURES & NETIF_F_ALL_FOR_ALL;
+ netdev_features_t enc_features = FAILOVER_ENC_FEATURES;
+ unsigned short max_hard_header_len = ETH_HLEN;
+ unsigned int dst_release_flag = IFF_XMIT_DST_RELEASE |
+ IFF_XMIT_DST_RELEASE_PERM;
+ struct net_failover_info *nfo_info = netdev_priv(dev);
+ struct net_device *primary_dev, *standby_dev;
+
+ primary_dev = rcu_dereference(nfo_info->primary_dev);
+ if (primary_dev) {
+ vlan_features =
+ netdev_increment_features(vlan_features,
+ primary_dev->vlan_features,
+ FAILOVER_VLAN_FEATURES);
+ enc_features =
+ netdev_increment_features(enc_features,
+ primary_dev->hw_enc_features,
+ FAILOVER_ENC_FEATURES);
+
+ dst_release_flag &= primary_dev->priv_flags;
+ if (primary_dev->hard_header_len > max_hard_header_len)
+ max_hard_header_len = primary_dev->hard_header_len;
+ }
+
+ standby_dev = rcu_dereference(nfo_info->standby_dev);
+ if (standby_dev) {
+ vlan_features =
+ netdev_increment_features(vlan_features,
+ standby_dev->vlan_features,
+ FAILOVER_VLAN_FEATURES);
+ enc_features =
+ netdev_increment_features(enc_features,
+ standby_dev->hw_enc_features,
+ FAILOVER_ENC_FEATURES);
+
+ dst_release_flag &= standby_dev->priv_flags;
+ if (standby_dev->hard_header_len > max_hard_header_len)
+ max_hard_header_len = standby_dev->hard_header_len;
+ }
+
+ dev->vlan_features = vlan_features;
+ dev->hw_enc_features = enc_features | NETIF_F_GSO_ENCAP_ALL;
+ dev->hard_header_len = max_hard_header_len;
+
+ dev->priv_flags &= ~IFF_XMIT_DST_RELEASE;
+ if (dst_release_flag == (IFF_XMIT_DST_RELEASE |
+ IFF_XMIT_DST_RELEASE_PERM))
+ dev->priv_flags |= IFF_XMIT_DST_RELEASE;
+
+ netdev_change_features(dev);
+}
+
+static void net_failover_lower_state_changed(struct net_device *slave_dev,
+ struct net_device *primary_dev,
+ struct net_device *standby_dev)
+{
+ struct netdev_lag_lower_state_info info;
+
+ if (netif_carrier_ok(slave_dev))
+ info.link_up = true;
+ else
+ info.link_up = false;
+
+ if (slave_dev == primary_dev) {
+ if (netif_running(primary_dev))
+ info.tx_enabled = true;
+ else
+ info.tx_enabled = false;
+ } else {
+ if ((primary_dev && netif_running(primary_dev)) ||
+ (!netif_running(standby_dev)))
+ info.tx_enabled = false;
+ else
+ info.tx_enabled = true;
+ }
+
+ netdev_lower_state_changed(slave_dev, &info);
+}
+
+static int net_failover_slave_register(struct net_device *slave_dev,
+ struct net_device *failover_dev)
+{
+ struct net_device *standby_dev, *primary_dev;
+ struct netdev_lag_upper_info lag_upper_info;
+ struct net_failover_info *nfo_info;
+ bool slave_is_standby;
+ u32 orig_mtu;
+ int err;
+
+ nfo_info = netdev_priv(failover_dev);
+ standby_dev = rtnl_dereference(nfo_info->standby_dev);
+ primary_dev = rtnl_dereference(nfo_info->primary_dev);
+ slave_is_standby = slave_dev->dev.parent == failover_dev->dev.parent;
+ if (slave_is_standby ? standby_dev : primary_dev) {
+ netdev_err(failover_dev, "%s attempting to register as slave dev when %s already present\n",
+ slave_dev->name,
+ slave_is_standby ? "standby" : "primary");
+ goto done;
+ }
+
+ /* We want to allow only a direct attached VF device as a primary
+ * netdev. As there is no easy way to check for a VF device, restrict
+ * this to a pci device.
+ */
+ if (!slave_is_standby && (!slave_dev->dev.parent ||
+ !dev_is_pci(slave_dev->dev.parent)))
+ goto done;
+
+ if (failover_dev->features & NETIF_F_VLAN_CHALLENGED &&
+ vlan_uses_dev(failover_dev)) {
+ netdev_err(failover_dev, "Device %s is VLAN challenged and failover device has VLAN set up\n",
+ failover_dev->name);
+ goto done;
+ }
+
+ /* Align MTU of slave with failover dev */
+ orig_mtu = slave_dev->mtu;
+ err = dev_set_mtu(slave_dev, failover_dev->mtu);
+ if (err) {
+ netdev_err(failover_dev, "unable to change mtu of %s to %u register failed\n",
+ slave_dev->name, failover_dev->mtu);
+ goto done;
+ }
+
+ dev_hold(slave_dev);
+
+ if (netif_running(failover_dev)) {
+ err = dev_open(slave_dev);
+ if (err && (err != -EBUSY)) {
+ netdev_err(failover_dev, "Opening slave %s failed err:%d\n",
+ slave_dev->name, err);
+ goto err_dev_open;
+ }
+ }
+
+ netif_addr_lock_bh(failover_dev);
+ dev_uc_sync_multiple(slave_dev, failover_dev);
+ dev_uc_sync_multiple(slave_dev, failover_dev);
+ netif_addr_unlock_bh(failover_dev);
+
+ err = vlan_vids_add_by_dev(slave_dev, failover_dev);
+ if (err) {
+ netdev_err(failover_dev, "Failed to add vlan ids to device %s err:%d\n",
+ slave_dev->name, err);
+ goto err_vlan_add;
+ }
+
+ err = netdev_rx_handler_register(slave_dev, net_failover_handle_frame,
+ failover_dev);
+ if (err) {
+ netdev_err(slave_dev, "can not register failover rx handler (err = %d)\n",
+ err);
+ goto err_handler_register;
+ }
+
+ lag_upper_info.tx_type = NETDEV_LAG_TX_TYPE_ACTIVEBACKUP;
+ err = netdev_master_upper_dev_link(slave_dev, failover_dev, NULL,
+ &lag_upper_info, NULL);
+ if (err) {
+ netdev_err(slave_dev, "can not set failover device %s (err = %d)\n",
+ failover_dev->name, err);
+ goto err_upper_link;
+ }
+
+ slave_dev->priv_flags |= IFF_FAILOVER_SLAVE;
+
+ if (slave_is_standby) {
+ rcu_assign_pointer(nfo_info->standby_dev, slave_dev);
+ standby_dev = slave_dev;
+ dev_get_stats(standby_dev, &nfo_info->standby_stats);
+ } else {
+ rcu_assign_pointer(nfo_info->primary_dev, slave_dev);
+ primary_dev = slave_dev;
+ dev_get_stats(primary_dev, &nfo_info->primary_stats);
+ failover_dev->min_mtu = slave_dev->min_mtu;
+ failover_dev->max_mtu = slave_dev->max_mtu;
+ }
+
+ net_failover_lower_state_changed(slave_dev, primary_dev, standby_dev);
+ net_failover_compute_features(failover_dev);
+
+ call_netdevice_notifiers(NETDEV_JOIN, slave_dev);
+
+ netdev_info(failover_dev, "failover %s slave:%s registered\n",
+ slave_is_standby ? "standby" : "primary", slave_dev->name);
+
+ goto done;
+
+err_upper_link:
+ netdev_rx_handler_unregister(slave_dev);
+err_handler_register:
+ vlan_vids_del_by_dev(slave_dev, failover_dev);
+err_vlan_add:
+ dev_uc_unsync(slave_dev, failover_dev);
+ dev_mc_unsync(slave_dev, failover_dev);
+ dev_close(slave_dev);
+err_dev_open:
+ dev_put(slave_dev);
+ dev_set_mtu(slave_dev, orig_mtu);
+done:
+ return NOTIFY_DONE;
+}
+
+static int net_failover_slave_unregister(struct net_device *slave_dev,
+ struct net_device *failover_dev)
+{
+ struct net_device *standby_dev, *primary_dev;
+ struct net_failover_info *nfo_info;
+ bool slave_is_standby;
+
+ nfo_info = netdev_priv(failover_dev);
+ primary_dev = rtnl_dereference(nfo_info->primary_dev);
+ standby_dev = rtnl_dereference(nfo_info->standby_dev);
+
+ if (slave_dev != primary_dev && slave_dev != standby_dev)
+ goto done;
+
+ slave_is_standby = slave_dev->dev.parent == failover_dev->dev.parent;
+
+ netdev_rx_handler_unregister(slave_dev);
+ netdev_upper_dev_unlink(slave_dev, failover_dev);
+ vlan_vids_del_by_dev(slave_dev, failover_dev);
+ dev_uc_unsync(slave_dev, failover_dev);
+ dev_mc_unsync(slave_dev, failover_dev);
+ dev_close(slave_dev);
+ slave_dev->priv_flags &= ~IFF_FAILOVER_SLAVE;
+
+ nfo_info = netdev_priv(failover_dev);
+ dev_get_stats(failover_dev, &nfo_info->failover_stats);
+
+ if (slave_is_standby) {
+ RCU_INIT_POINTER(nfo_info->standby_dev, NULL);
+ } else {
+ RCU_INIT_POINTER(nfo_info->primary_dev, NULL);
+ if (standby_dev) {
+ failover_dev->min_mtu = standby_dev->min_mtu;
+ failover_dev->max_mtu = standby_dev->max_mtu;
+ }
+ }
+
+ dev_put(slave_dev);
+
+ net_failover_compute_features(failover_dev);
+
+ netdev_info(failover_dev, "failover %s slave:%s unregistered\n",
+ slave_is_standby ? "standby" : "primary", slave_dev->name);
+
+done:
+ return NOTIFY_DONE;
+}
+
+static int net_failover_slave_link_change(struct net_device *slave_dev,
+ struct net_device *failover_dev)
+{
+ struct net_device *primary_dev, *standby_dev;
+ struct net_failover_info *nfo_info;
+
+ nfo_info = netdev_priv(failover_dev);
+
+ primary_dev = rtnl_dereference(nfo_info->primary_dev);
+ standby_dev = rtnl_dereference(nfo_info->standby_dev);
+
+ if (slave_dev != primary_dev && slave_dev != standby_dev)
+ goto done;
+
+ if ((primary_dev && net_failover_xmit_ready(primary_dev)) ||
+ (standby_dev && net_failover_xmit_ready(standby_dev))) {
+ netif_carrier_on(failover_dev);
+ netif_tx_wake_all_queues(failover_dev);
+ } else {
+ dev_get_stats(failover_dev, &nfo_info->failover_stats);
+ netif_carrier_off(failover_dev);
+ netif_tx_stop_all_queues(failover_dev);
+ }
+
+ net_failover_lower_state_changed(slave_dev, primary_dev, standby_dev);
+
+done:
+ return NOTIFY_DONE;
+}
+
+static int net_failover_slave_name_change(struct net_device *slave_dev,
+ struct net_device *failover_dev)
+{
+ struct net_device *primary_dev, *standby_dev;
+ struct net_failover_info *nfo_info;
+
+ nfo_info = netdev_priv(failover_dev);
+
+ primary_dev = rtnl_dereference(nfo_info->primary_dev);
+ standby_dev = rtnl_dereference(nfo_info->standby_dev);
+
+ if (slave_dev != primary_dev && slave_dev != standby_dev)
+ goto done;
+
+ /* We need to bring up the slave after the rename by udev in case
+ * open failed with EBUSY when it was registered.
+ */
+ dev_open(slave_dev);
+
+done:
+ return NOTIFY_DONE;
+}
+
+static struct failover_ops net_failover_ops = {
+ .slave_register = net_failover_slave_register,
+ .slave_unregister = net_failover_slave_unregister,
+ .slave_link_change = net_failover_slave_link_change,
+ .slave_name_change = net_failover_slave_name_change,
+};
+
+/**
+ * net_failover_create - Create and register a failover instance
+ *
+ * @dev: standby netdev
+ *
+ * Creates a failover netdev and registers a failover instance for a standby
+ * netdev. Used by paravirtual drivers that use 3-netdev model.
+ * The failover netdev acts as a master device and controls 2 slave devices -
+ * the original standby netdev and a VF netdev with the same MAC gets
+ * registered as primary netdev.
+ *
+ * Return: pointer to failover instance
+ */
+struct failover *net_failover_create(struct net_device *standby_dev)
+{
+ struct device *dev = standby_dev->dev.parent;
+ struct net_device *failover_dev;
+ struct failover *failover;
+ int err;
+
+ /* Alloc at least 2 queues, for now we are going with 16 assuming
+ * that VF devices being enslaved won't have too many queues.
+ */
+ failover_dev = alloc_etherdev_mq(sizeof(struct net_failover_info), 16);
+ if (!failover_dev) {
+ dev_err(dev, "Unable to allocate failover_netdev!\n");
+ return ERR_PTR(-ENOMEM);
+ }
+
+ dev_net_set(failover_dev, dev_net(standby_dev));
+ SET_NETDEV_DEV(failover_dev, dev);
+
+ failover_dev->netdev_ops = &failover_dev_ops;
+ failover_dev->ethtool_ops = &failover_ethtool_ops;
+
+ /* Initialize the device options */
+ failover_dev->priv_flags |= IFF_UNICAST_FLT | IFF_NO_QUEUE;
+ failover_dev->priv_flags &= ~(IFF_XMIT_DST_RELEASE |
+ IFF_TX_SKB_SHARING);
+
+ /* don't acquire failover netdev's netif_tx_lock when transmitting */
+ failover_dev->features |= NETIF_F_LLTX;
+
+ /* Don't allow failover devices to change network namespaces. */
+ failover_dev->features |= NETIF_F_NETNS_LOCAL;
+
+ failover_dev->hw_features = FAILOVER_VLAN_FEATURES |
+ NETIF_F_HW_VLAN_CTAG_TX |
+ NETIF_F_HW_VLAN_CTAG_RX |
+ NETIF_F_HW_VLAN_CTAG_FILTER;
+
+ failover_dev->hw_features |= NETIF_F_GSO_ENCAP_ALL;
+ failover_dev->features |= failover_dev->hw_features;
+
+ memcpy(failover_dev->dev_addr, standby_dev->dev_addr,
+ failover_dev->addr_len);
+
+ failover_dev->min_mtu = standby_dev->min_mtu;
+ failover_dev->max_mtu = standby_dev->max_mtu;
+
+ err = register_netdev(failover_dev);
+ if (err) {
+ dev_err(dev, "Unable to register failover_dev!\n");
+ goto err_register_netdev;
+ }
+
+ netif_carrier_off(failover_dev);
+
+ failover = failover_register(failover_dev, &net_failover_ops);
+ if (IS_ERR(failover))
+ goto err_failover_register;
+
+ return failover;
+
+err_failover_register:
+ unregister_netdev(failover_dev);
+err_register_netdev:
+ free_netdev(failover_dev);
+
+ return ERR_PTR(err);
+}
+EXPORT_SYMBOL_GPL(net_failover_create);
+
+/**
+ * net_failover_destroy - Destroy a failover instance
+ *
+ * @failover: pointer to failover instance
+ *
+ * Unregisters any slave netdevs associated with the failover instance by
+ * calling failover_slave_unregister().
+ * unregisters the failover instance itself and finally frees the failover
+ * netdev. Used by paravirtual drivers that use 3-netdev model.
+ *
+ */
+void net_failover_destroy(struct failover *failover)
+{
+ struct net_failover_info *nfo_info;
+ struct net_device *failover_dev;
+ struct net_device *slave_dev;
+
+ if (!failover)
+ return;
+
+ failover_dev = rcu_dereference(failover->failover_dev);
+ nfo_info = netdev_priv(failover_dev);
+
+ netif_device_detach(failover_dev);
+
+ rtnl_lock();
+
+ slave_dev = rtnl_dereference(nfo_info->primary_dev);
+ if (slave_dev)
+ failover_slave_unregister(slave_dev);
+
+ slave_dev = rtnl_dereference(nfo_info->standby_dev);
+ if (slave_dev)
+ failover_slave_unregister(slave_dev);
+
+ failover_unregister(failover);
+
+ unregister_netdevice(failover_dev);
+
+ rtnl_unlock();
+
+ free_netdev(failover_dev);
+}
+EXPORT_SYMBOL_GPL(net_failover_destroy);
+
+static __init int
+net_failover_init(void)
+{
+ return 0;
+}
+module_init(net_failover_init);
+
+static __exit
+void net_failover_exit(void)
+{
+}
+module_exit(net_failover_exit);
+
+MODULE_DESCRIPTION("Failover driver for Paravirtual drivers");
+MODULE_LICENSE("GPL v2");
diff --git a/include/net/net_failover.h b/include/net/net_failover.h
new file mode 100644
index 000000000000..b12a1c469d1c
--- /dev/null
+++ b/include/net/net_failover.h
@@ -0,0 +1,40 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/* Copyright (c) 2018, Intel Corporation. */
+
+#ifndef _NET_FAILOVER_H
+#define _NET_FAILOVER_H
+
+#include <net/failover.h>
+
+/* failover state */
+struct net_failover_info {
+ /* primary netdev with same MAC */
+ struct net_device __rcu *primary_dev;
+
+ /* standby netdev */
+ struct net_device __rcu *standby_dev;
+
+ /* primary netdev stats */
+ struct rtnl_link_stats64 primary_stats;
+
+ /* standby netdev stats */
+ struct rtnl_link_stats64 standby_stats;
+
+ /* aggregated stats */
+ struct rtnl_link_stats64 failover_stats;
+
+ /* spinlock while updating stats */
+ spinlock_t stats_lock;
+};
+
+struct failover *net_failover_create(struct net_device *standby_dev);
+void net_failover_destroy(struct failover *failover);
+
+#define FAILOVER_VLAN_FEATURES (NETIF_F_HW_CSUM | NETIF_F_SG | \
+ NETIF_F_FRAGLIST | NETIF_F_ALL_TSO | \
+ NETIF_F_HIGHDMA | NETIF_F_LRO)
+
+#define FAILOVER_ENC_FEATURES (NETIF_F_HW_CSUM | NETIF_F_SG | \
+ NETIF_F_RXCSUM | NETIF_F_ALL_TSO)
+
+#endif /* _NET_FAILOVER_H */
--
2.14.3
^ permalink raw reply related
* [PATCH net-next v11 2/5] netvsc: refactor notifier/event handling code to use the failover framework
From: Sridhar Samudrala @ 2018-05-22 2:06 UTC (permalink / raw)
To: mst, stephen, davem, netdev, virtualization, virtio-dev,
jesse.brandeburg, alexander.h.duyck, kubakici, sridhar.samudrala,
jasowang, loseweigh, jiri, aaron.f.brown, anjali.singhai
In-Reply-To: <1526954781-35359-1-git-send-email-sridhar.samudrala@intel.com>
Use the registration/notification framework supported by the generic
failover infrastructure.
Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
---
drivers/net/hyperv/Kconfig | 1 +
drivers/net/hyperv/hyperv_net.h | 2 +
drivers/net/hyperv/netvsc_drv.c | 133 +++++++---------------------------------
3 files changed, 25 insertions(+), 111 deletions(-)
diff --git a/drivers/net/hyperv/Kconfig b/drivers/net/hyperv/Kconfig
index 0765d5f61714..23a2d145813a 100644
--- a/drivers/net/hyperv/Kconfig
+++ b/drivers/net/hyperv/Kconfig
@@ -2,5 +2,6 @@ config HYPERV_NET
tristate "Microsoft Hyper-V virtual network driver"
depends on HYPERV
select UCS2_STRING
+ select FAILOVER
help
Select this option to enable the Hyper-V virtual network driver.
diff --git a/drivers/net/hyperv/hyperv_net.h b/drivers/net/hyperv/hyperv_net.h
index 1be34d2e3563..99d8e7398a5b 100644
--- a/drivers/net/hyperv/hyperv_net.h
+++ b/drivers/net/hyperv/hyperv_net.h
@@ -932,6 +932,8 @@ struct net_device_context {
u32 vf_alloc;
/* Serial number of the VF to team with */
u32 vf_serial;
+
+ struct failover *failover;
};
/* Per channel data */
diff --git a/drivers/net/hyperv/netvsc_drv.c b/drivers/net/hyperv/netvsc_drv.c
index da07ccdf84bf..6c77a81b7266 100644
--- a/drivers/net/hyperv/netvsc_drv.c
+++ b/drivers/net/hyperv/netvsc_drv.c
@@ -43,6 +43,7 @@
#include <net/pkt_sched.h>
#include <net/checksum.h>
#include <net/ip6_checksum.h>
+#include <net/failover.h>
#include "hyperv_net.h"
@@ -1763,46 +1764,6 @@ static void netvsc_link_change(struct work_struct *w)
rtnl_unlock();
}
-static struct net_device *get_netvsc_bymac(const u8 *mac)
-{
- struct net_device *dev;
-
- ASSERT_RTNL();
-
- for_each_netdev(&init_net, dev) {
- if (dev->netdev_ops != &device_ops)
- continue; /* not a netvsc device */
-
- if (ether_addr_equal(mac, dev->perm_addr))
- return dev;
- }
-
- return NULL;
-}
-
-static struct net_device *get_netvsc_byref(struct net_device *vf_netdev)
-{
- struct net_device *dev;
-
- ASSERT_RTNL();
-
- for_each_netdev(&init_net, dev) {
- struct net_device_context *net_device_ctx;
-
- if (dev->netdev_ops != &device_ops)
- continue; /* not a netvsc device */
-
- net_device_ctx = netdev_priv(dev);
- if (!rtnl_dereference(net_device_ctx->nvdev))
- continue; /* device is removed */
-
- if (rtnl_dereference(net_device_ctx->vf_netdev) == vf_netdev)
- return dev; /* a match */
- }
-
- return NULL;
-}
-
/* Called when VF is injecting data into network stack.
* Change the associated network device from VF to netvsc.
* note: already called with rcu_read_lock
@@ -1915,24 +1876,15 @@ static void netvsc_vf_setup(struct work_struct *w)
rtnl_unlock();
}
-static int netvsc_register_vf(struct net_device *vf_netdev)
+static int netvsc_register_vf(struct net_device *vf_netdev,
+ struct net_device *ndev)
{
- struct net_device *ndev;
struct net_device_context *net_device_ctx;
struct netvsc_device *netvsc_dev;
if (vf_netdev->addr_len != ETH_ALEN)
return NOTIFY_DONE;
- /*
- * We will use the MAC address to locate the synthetic interface to
- * associate with the VF interface. If we don't find a matching
- * synthetic interface, move on.
- */
- ndev = get_netvsc_bymac(vf_netdev->perm_addr);
- if (!ndev)
- return NOTIFY_DONE;
-
net_device_ctx = netdev_priv(ndev);
netvsc_dev = rtnl_dereference(net_device_ctx->nvdev);
if (!netvsc_dev || rtnl_dereference(net_device_ctx->vf_netdev))
@@ -1949,17 +1901,13 @@ static int netvsc_register_vf(struct net_device *vf_netdev)
}
/* VF up/down change detected, schedule to change data path */
-static int netvsc_vf_changed(struct net_device *vf_netdev)
+static int netvsc_vf_changed(struct net_device *vf_netdev,
+ struct net_device *ndev)
{
struct net_device_context *net_device_ctx;
struct netvsc_device *netvsc_dev;
- struct net_device *ndev;
bool vf_is_up = netif_running(vf_netdev);
- ndev = get_netvsc_byref(vf_netdev);
- if (!ndev)
- return NOTIFY_DONE;
-
net_device_ctx = netdev_priv(ndev);
netvsc_dev = rtnl_dereference(net_device_ctx->nvdev);
if (!netvsc_dev)
@@ -1972,15 +1920,11 @@ static int netvsc_vf_changed(struct net_device *vf_netdev)
return NOTIFY_OK;
}
-static int netvsc_unregister_vf(struct net_device *vf_netdev)
+static int netvsc_unregister_vf(struct net_device *vf_netdev,
+ struct net_device *ndev)
{
- struct net_device *ndev;
struct net_device_context *net_device_ctx;
- ndev = get_netvsc_byref(vf_netdev);
- if (!ndev)
- return NOTIFY_DONE;
-
net_device_ctx = netdev_priv(ndev);
cancel_delayed_work_sync(&net_device_ctx->vf_takeover);
@@ -1994,6 +1938,12 @@ static int netvsc_unregister_vf(struct net_device *vf_netdev)
return NOTIFY_OK;
}
+static struct failover_ops netvsc_failover_ops = {
+ .slave_register = netvsc_register_vf,
+ .slave_unregister = netvsc_unregister_vf,
+ .slave_link_change = netvsc_vf_changed,
+};
+
static int netvsc_probe(struct hv_device *dev,
const struct hv_vmbus_device_id *dev_id)
{
@@ -2083,8 +2033,14 @@ static int netvsc_probe(struct hv_device *dev,
goto register_failed;
}
+ net_device_ctx->failover = failover_register(net, &netvsc_failover_ops);
+ if (IS_ERR(net_device_ctx->failover))
+ goto err_failover;
+
return ret;
+err_failover:
+ unregister_netdev(net);
register_failed:
rndis_filter_device_remove(dev, nvdev);
rndis_failed:
@@ -2125,13 +2081,15 @@ static int netvsc_remove(struct hv_device *dev)
rtnl_lock();
vf_netdev = rtnl_dereference(ndev_ctx->vf_netdev);
if (vf_netdev)
- netvsc_unregister_vf(vf_netdev);
+ failover_slave_unregister(vf_netdev);
if (nvdev)
rndis_filter_device_remove(dev, nvdev);
unregister_netdevice(net);
+ failover_unregister(ndev_ctx->failover);
+
rtnl_unlock();
rcu_read_unlock();
@@ -2158,54 +2116,8 @@ static struct hv_driver netvsc_drv = {
.remove = netvsc_remove,
};
-/*
- * On Hyper-V, every VF interface is matched with a corresponding
- * synthetic interface. The synthetic interface is presented first
- * to the guest. When the corresponding VF instance is registered,
- * we will take care of switching the data path.
- */
-static int netvsc_netdev_event(struct notifier_block *this,
- unsigned long event, void *ptr)
-{
- struct net_device *event_dev = netdev_notifier_info_to_dev(ptr);
-
- /* Skip our own events */
- if (event_dev->netdev_ops == &device_ops)
- return NOTIFY_DONE;
-
- /* Avoid non-Ethernet type devices */
- if (event_dev->type != ARPHRD_ETHER)
- return NOTIFY_DONE;
-
- /* Avoid Vlan dev with same MAC registering as VF */
- if (is_vlan_dev(event_dev))
- return NOTIFY_DONE;
-
- /* Avoid Bonding master dev with same MAC registering as VF */
- if ((event_dev->priv_flags & IFF_BONDING) &&
- (event_dev->flags & IFF_MASTER))
- return NOTIFY_DONE;
-
- switch (event) {
- case NETDEV_REGISTER:
- return netvsc_register_vf(event_dev);
- case NETDEV_UNREGISTER:
- return netvsc_unregister_vf(event_dev);
- case NETDEV_UP:
- case NETDEV_DOWN:
- return netvsc_vf_changed(event_dev);
- default:
- return NOTIFY_DONE;
- }
-}
-
-static struct notifier_block netvsc_netdev_notifier = {
- .notifier_call = netvsc_netdev_event,
-};
-
static void __exit netvsc_drv_exit(void)
{
- unregister_netdevice_notifier(&netvsc_netdev_notifier);
vmbus_driver_unregister(&netvsc_drv);
}
@@ -2225,7 +2137,6 @@ static int __init netvsc_drv_init(void)
if (ret)
return ret;
- register_netdevice_notifier(&netvsc_netdev_notifier);
return 0;
}
--
2.14.3
^ permalink raw reply related
* [PATCH net-next v11 1/5] net: Introduce generic failover module
From: Sridhar Samudrala @ 2018-05-22 2:06 UTC (permalink / raw)
To: mst, stephen, davem, netdev, virtualization, virtio-dev,
jesse.brandeburg, alexander.h.duyck, kubakici, sridhar.samudrala,
jasowang, loseweigh, jiri, aaron.f.brown, anjali.singhai
In-Reply-To: <1526954781-35359-1-git-send-email-sridhar.samudrala@intel.com>
The failover module provides a generic interface for paravirtual drivers
to register a netdev and a set of ops with a failover instance. The ops
are used as event handlers that get called to handle netdev register/
unregister/link change/name change events on slave pci ethernet devices
with the same mac address as the failover netdev.
This enables paravirtual drivers to use a VF as an accelerated low latency
datapath. It also allows migration of VMs with direct attached VFs by
failing over to the paravirtual datapath when the VF is unplugged.
Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
---
Documentation/networking/failover.rst | 18 +++
MAINTAINERS | 8 +
include/linux/netdevice.h | 16 ++
include/net/failover.h | 31 ++++
net/Kconfig | 13 ++
net/core/Makefile | 1 +
net/core/failover.c | 273 ++++++++++++++++++++++++++++++++++
7 files changed, 360 insertions(+)
create mode 100644 Documentation/networking/failover.rst
create mode 100644 include/net/failover.h
create mode 100644 net/core/failover.c
diff --git a/Documentation/networking/failover.rst b/Documentation/networking/failover.rst
new file mode 100644
index 000000000000..f0c8483cdbf5
--- /dev/null
+++ b/Documentation/networking/failover.rst
@@ -0,0 +1,18 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+========
+FAILOVER
+========
+
+Overview
+========
+
+The failover module provides a generic interface for paravirtual drivers
+to register a netdev and a set of ops with a failover instance. The ops
+are used as event handlers that get called to handle netdev register/
+unregister/link change/name change events on slave pci ethernet devices
+with the same mac address as the failover netdev.
+
+This enables paravirtual drivers to use a VF as an accelerated low latency
+datapath. It also allows live migration of VMs with direct attached VFs by
+failing over to the paravirtual datapath when the VF is unplugged.
diff --git a/MAINTAINERS b/MAINTAINERS
index 658880464b9d..aed9575ddc53 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -5413,6 +5413,14 @@ S: Maintained
F: Documentation/hwmon/f71805f
F: drivers/hwmon/f71805f.c
+FAILOVER MODULE
+M: Sridhar Samudrala <sridhar.samudrala@intel.com>
+L: netdev@vger.kernel.org
+S: Supported
+F: net/core/failover.c
+F: include/net/failover.h
+F: Documentation/networking/failover.rst
+
FANOTIFY
M: Jan Kara <jack@suse.cz>
R: Amir Goldstein <amir73il@gmail.com>
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 03ed492c4e14..0f4ba52b641d 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1421,6 +1421,8 @@ struct net_device_ops {
* entity (i.e. the master device for bridged veth)
* @IFF_MACSEC: device is a MACsec device
* @IFF_NO_RX_HANDLER: device doesn't support the rx_handler hook
+ * @IFF_FAILOVER: device is a failover master device
+ * @IFF_FAILOVER_SLAVE: device is lower dev of a failover master device
*/
enum netdev_priv_flags {
IFF_802_1Q_VLAN = 1<<0,
@@ -1450,6 +1452,8 @@ enum netdev_priv_flags {
IFF_PHONY_HEADROOM = 1<<24,
IFF_MACSEC = 1<<25,
IFF_NO_RX_HANDLER = 1<<26,
+ IFF_FAILOVER = 1<<27,
+ IFF_FAILOVER_SLAVE = 1<<28,
};
#define IFF_802_1Q_VLAN IFF_802_1Q_VLAN
@@ -1478,6 +1482,8 @@ enum netdev_priv_flags {
#define IFF_RXFH_CONFIGURED IFF_RXFH_CONFIGURED
#define IFF_MACSEC IFF_MACSEC
#define IFF_NO_RX_HANDLER IFF_NO_RX_HANDLER
+#define IFF_FAILOVER IFF_FAILOVER
+#define IFF_FAILOVER_SLAVE IFF_FAILOVER_SLAVE
/**
* struct net_device - The DEVICE structure.
@@ -4321,6 +4327,16 @@ static inline bool netif_is_rxfh_configured(const struct net_device *dev)
return dev->priv_flags & IFF_RXFH_CONFIGURED;
}
+static inline bool netif_is_failover(const struct net_device *dev)
+{
+ return dev->priv_flags & IFF_FAILOVER;
+}
+
+static inline bool netif_is_failover_slave(const struct net_device *dev)
+{
+ return dev->priv_flags & IFF_FAILOVER_SLAVE;
+}
+
/* This device needs to keep skb dst for qdisc enqueue or ndo_start_xmit() */
static inline void netif_keep_dst(struct net_device *dev)
{
diff --git a/include/net/failover.h b/include/net/failover.h
new file mode 100644
index 000000000000..fea9d0b978c8
--- /dev/null
+++ b/include/net/failover.h
@@ -0,0 +1,31 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/* Copyright (c) 2018, Intel Corporation. */
+
+#ifndef _FAILOVER_H
+#define _FAILOVER_H
+
+#include <linux/netdevice.h>
+
+struct failover_ops {
+ int (*slave_register)(struct net_device *slave_dev,
+ struct net_device *failover_dev);
+ int (*slave_unregister)(struct net_device *slave_dev,
+ struct net_device *failover_dev);
+ int (*slave_link_change)(struct net_device *slave_dev,
+ struct net_device *failover_dev);
+ int (*slave_name_change)(struct net_device *slave_dev,
+ struct net_device *failover_dev);
+};
+
+struct failover {
+ struct list_head list;
+ struct net_device __rcu *failover_dev;
+ struct failover_ops __rcu *ops;
+};
+
+struct failover *failover_register(struct net_device *dev,
+ struct failover_ops *ops);
+void failover_unregister(struct failover *failover);
+int failover_slave_unregister(struct net_device *slave_dev);
+
+#endif /* _FAILOVER_H */
diff --git a/net/Kconfig b/net/Kconfig
index df8d45ef47d8..d581c4f0f1c4 100644
--- a/net/Kconfig
+++ b/net/Kconfig
@@ -430,6 +430,19 @@ config MAY_USE_DEVLINK
config PAGE_POOL
bool
+config FAILOVER
+ tristate "Generic failover module"
+ help
+ The failover module provides a generic interface for paravirtual
+ drivers to register a netdev and a set of ops with a failover
+ instance. The ops are used as event handlers that get called to
+ handle netdev register/unregister/link change/name change events
+ on slave pci ethernet devices with the same mac address as the
+ failover netdev. This enables paravirtual drivers to use a
+ VF as an accelerated low latency datapath. It also allows live
+ migration of VMs with direct attached VFs by failing over to the
+ paravirtual datapath when the VF is unplugged.
+
endif # if NET
# Used by archs to tell that they support BPF JIT compiler plus which flavour.
diff --git a/net/core/Makefile b/net/core/Makefile
index 7080417f8bc8..80175e6a2eb8 100644
--- a/net/core/Makefile
+++ b/net/core/Makefile
@@ -31,3 +31,4 @@ obj-$(CONFIG_DST_CACHE) += dst_cache.o
obj-$(CONFIG_HWBM) += hwbm.o
obj-$(CONFIG_NET_DEVLINK) += devlink.o
obj-$(CONFIG_GRO_CELLS) += gro_cells.o
+obj-$(CONFIG_FAILOVER) += failover.o
diff --git a/net/core/failover.c b/net/core/failover.c
new file mode 100644
index 000000000000..48ee713a1ef4
--- /dev/null
+++ b/net/core/failover.c
@@ -0,0 +1,273 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2018, Intel Corporation. */
+
+/* A common module to handle registrations and notifications for paravirtual
+ * drivers to enable accelerated datapath and support VF live migration.
+ *
+ * The notifier and event handling code is based on netvsc driver.
+ */
+
+#include <linux/module.h>
+#include <linux/etherdevice.h>
+#include <uapi/linux/if_arp.h>
+#include <linux/rtnetlink.h>
+#include <net/failover.h>
+
+static LIST_HEAD(failover_list);
+static DEFINE_SPINLOCK(failover_lock);
+
+static struct net_device *failover_get_bymac(u8 *mac, struct failover_ops **ops)
+{
+ struct net_device *failover_dev;
+ struct failover *failover;
+
+ spin_lock(&failover_lock);
+ list_for_each_entry(failover, &failover_list, list) {
+ failover_dev = rtnl_dereference(failover->failover_dev);
+ if (ether_addr_equal(failover_dev->perm_addr, mac)) {
+ *ops = rtnl_dereference(failover->ops);
+ spin_unlock(&failover_lock);
+ return failover_dev;
+ }
+ }
+ spin_unlock(&failover_lock);
+ return NULL;
+}
+
+/**
+ * failover_slave_register - Register a slave netdev
+ *
+ * @slave_dev: slave netdev that is being registered
+ *
+ * Registers a slave device to a failover instance. Only ethernet devices
+ * are supported.
+ */
+static int failover_slave_register(struct net_device *slave_dev)
+{
+ struct net_device *failover_dev;
+ struct failover_ops *fops;
+
+ if (slave_dev->type != ARPHRD_ETHER)
+ goto done;
+
+ ASSERT_RTNL();
+
+ failover_dev = failover_get_bymac(slave_dev->perm_addr, &fops);
+ if (!failover_dev)
+ goto done;
+
+ if (fops && fops->slave_register)
+ return fops->slave_register(slave_dev, failover_dev);
+
+done:
+ return NOTIFY_DONE;
+}
+
+/**
+ * failover_slave_unregister - Unregister a slave netdev
+ *
+ * @slave_dev: slave netdev that is being unregistered
+ *
+ * Unregisters a slave device from a failover instance.
+ */
+int failover_slave_unregister(struct net_device *slave_dev)
+{
+ struct net_device *failover_dev;
+ struct failover_ops *fops;
+
+ if (!netif_is_failover_slave(slave_dev))
+ goto done;
+
+ ASSERT_RTNL();
+
+ failover_dev = failover_get_bymac(slave_dev->perm_addr, &fops);
+ if (!failover_dev)
+ goto done;
+
+ if (fops && fops->slave_unregister)
+ return fops->slave_unregister(slave_dev, failover_dev);
+
+done:
+ return NOTIFY_DONE;
+}
+EXPORT_SYMBOL_GPL(failover_slave_unregister);
+
+static int failover_slave_link_change(struct net_device *slave_dev)
+{
+ struct net_device *failover_dev;
+ struct failover_ops *fops;
+
+ if (!netif_is_failover_slave(slave_dev))
+ goto done;
+
+ ASSERT_RTNL();
+
+ failover_dev = failover_get_bymac(slave_dev->perm_addr, &fops);
+ if (!failover_dev)
+ goto done;
+
+ if (!netif_running(failover_dev))
+ goto done;
+
+ if (fops && fops->slave_link_change)
+ return fops->slave_link_change(slave_dev, failover_dev);
+
+done:
+ return NOTIFY_DONE;
+}
+
+static int failover_slave_name_change(struct net_device *slave_dev)
+{
+ struct net_device *failover_dev;
+ struct failover_ops *fops;
+
+ if (!netif_is_failover_slave(slave_dev))
+ goto done;
+
+ ASSERT_RTNL();
+
+ failover_dev = failover_get_bymac(slave_dev->perm_addr, &fops);
+ if (!failover_dev)
+ goto done;
+
+ if (!netif_running(failover_dev))
+ goto done;
+
+ if (fops && fops->slave_name_change)
+ return fops->slave_name_change(slave_dev, failover_dev);
+
+done:
+ return NOTIFY_DONE;
+}
+
+static int
+failover_event(struct notifier_block *this, unsigned long event, void *ptr)
+{
+ struct net_device *event_dev = netdev_notifier_info_to_dev(ptr);
+
+ /* Skip parent events */
+ if (netif_is_failover(event_dev))
+ return NOTIFY_DONE;
+
+ switch (event) {
+ case NETDEV_REGISTER:
+ return failover_slave_register(event_dev);
+ case NETDEV_UNREGISTER:
+ return failover_slave_unregister(event_dev);
+ case NETDEV_UP:
+ case NETDEV_DOWN:
+ case NETDEV_CHANGE:
+ return failover_slave_link_change(event_dev);
+ case NETDEV_CHANGENAME:
+ return failover_slave_name_change(event_dev);
+ default:
+ return NOTIFY_DONE;
+ }
+}
+
+static struct notifier_block failover_notifier = {
+ .notifier_call = failover_event,
+};
+
+static void
+failover_existing_slave_register(struct net_device *failover_dev)
+{
+ struct net *net = dev_net(failover_dev);
+ struct net_device *dev;
+
+ rtnl_lock();
+ for_each_netdev(net, dev) {
+ if (netif_is_failover(dev))
+ continue;
+ if (ether_addr_equal(failover_dev->perm_addr, dev->perm_addr))
+ failover_slave_register(dev);
+ }
+ rtnl_unlock();
+}
+
+/**
+ * failover_register - Register a failover instance
+ *
+ * @dev: failover netdev
+ * @ops: failover ops
+ *
+ * Allocate and register a failover instance for a failover netdev. ops
+ * provides handlers for slave device register/unregister/link change/
+ * name change events.
+ *
+ * Return: pointer to failover instance
+ */
+struct failover *failover_register(struct net_device *dev,
+ struct failover_ops *ops)
+{
+ struct failover *failover;
+
+ if (dev->type != ARPHRD_ETHER)
+ return ERR_PTR(-EINVAL);
+
+ failover = kzalloc(sizeof(*failover), GFP_KERNEL);
+ if (!failover)
+ return ERR_PTR(-ENOMEM);
+
+ rcu_assign_pointer(failover->ops, ops);
+ dev_hold(dev);
+ dev->priv_flags |= IFF_FAILOVER;
+ rcu_assign_pointer(failover->failover_dev, dev);
+
+ spin_lock(&failover_lock);
+ list_add_tail(&failover->list, &failover_list);
+ spin_unlock(&failover_lock);
+
+ netdev_info(dev, "failover master:%s registered\n", dev->name);
+
+ failover_existing_slave_register(dev);
+
+ return failover;
+}
+EXPORT_SYMBOL_GPL(failover_register);
+
+/**
+ * failover_unregister - Unregister a failover instance
+ *
+ * @failover: pointer to failover instance
+ *
+ * Unregisters and frees a failover instance.
+ */
+void failover_unregister(struct failover *failover)
+{
+ struct net_device *failover_dev;
+
+ failover_dev = rcu_dereference(failover->failover_dev);
+
+ netdev_info(failover_dev, "failover master:%s unregistered\n",
+ failover_dev->name);
+
+ failover_dev->priv_flags &= ~IFF_FAILOVER;
+ dev_put(failover_dev);
+
+ spin_lock(&failover_lock);
+ list_del(&failover->list);
+ spin_unlock(&failover_lock);
+
+ kfree(failover);
+}
+EXPORT_SYMBOL_GPL(failover_unregister);
+
+static __init int
+failover_init(void)
+{
+ register_netdevice_notifier(&failover_notifier);
+
+ return 0;
+}
+module_init(failover_init);
+
+static __exit
+void failover_exit(void)
+{
+ unregister_netdevice_notifier(&failover_notifier);
+}
+module_exit(failover_exit);
+
+MODULE_DESCRIPTION("Generic failover infrastructure/interface");
+MODULE_LICENSE("GPL v2");
--
2.14.3
^ permalink raw reply related
* [PATCH net-next v11 0/5] Enable virtio_net to act as a standby for a passthru device
From: Sridhar Samudrala @ 2018-05-22 2:06 UTC (permalink / raw)
To: mst, stephen, davem, netdev, virtualization, virtio-dev,
jesse.brandeburg, alexander.h.duyck, kubakici, sridhar.samudrala,
jasowang, loseweigh, jiri, aaron.f.brown, anjali.singhai
The main motivation for this patch is to enable cloud service providers
to provide an accelerated datapath to virtio-net enabled VMs in a
transparent manner with no/minimal guest userspace changes. This also
enables hypervisor controlled live migration to be supported with VMs that
have direct attached SR-IOV VF devices.
Patch 1 introduces a failover module that provides a generic interface for
paravirtual drivers to listen for netdev register/unregister/link change
events from pci ethernet devices with the same MAC and takeover their
datapath. The notifier and event handling code is based on the existing
netvsc implementation.
Patch 2 refactors netvsc to use the registration/notification framework
introduced by failover module.
Patch 3 introduces a net_failover driver that provides an automated
failover mechanism to paravirtual drivers via APIs to create and destroy
a failover master netdev and mananges a primary and standby slave netdevs
that get registered via the generic failover infrastructure.
Patch 4 introduces a new feature bit VIRTIO_NET_F_STANDBY to virtio-net
that can be used by hypervisor to indicate that virtio_net interface
should act as a standby for another device with the same MAC address.
Patch 5 extends virtio_net to use alternate datapath when available and
registered. When STANDBY feature is enabled, virtio_net driver uese the
net_failover API to create an additional 'failover' netdev that acts as
a master device and controls 2 slave devices. The original virtio_net
netdev is registered as 'standby' netdev and a passthru/vf device with
the same MAC gets registered as 'primary' netdev. Both 'standby' and
'failover' netdevs are associated with the same 'pci' device. The user
accesses the network interface via 'failover' netdev. The 'failover'
netdev chooses 'primary' netdev as default for transmits when it is
available with link up and running.
As this patch series is initially focusing on usecases where hypervisor
fully controls the VM networking and the guest is not expected to directly
configure any hardware settings, it doesn't expose all the ndo/ethtool ops
that are supported by virtio_net at this time. To support additional usecases,
it should be possible to enable additional ops later by caching the state
in failover netdev and replaying when the 'primary' netdev gets registered.
At the time of live migration, the hypervisor needs to unplug the VF device
from the guest on the source host and reset the MAC filter of the VF to
initiate failover of datapath to virtio before starting the migration. After
the migration is completed, the destination hypervisor sets the MAC filter
on the VF and plugs it back to the guest to switch over to VF datapath.
This patch is based on the discussion initiated by Jesse on this thread.
https://marc.info/?l=linux-virtualization&m=151189725224231&w=2
v11:
- Tested live migration with virtio-net/AVF(i40evf) configured in failover
mode while running iperf in background. Tried static ip and dhcp
configurations using 'network' scripts and Network Manager.
- Build tested netvsc module.
Updates:
- Split net_failover module into 2 components.
1. 'failover' module that provides generic failover infrastructure
to register a failover instance and listen for slave events.
2. 'net_failover' driver that provides APIs to create/destroy upper
netdev and supports 3-netdev model used by virtio-net.
- Added documentation
v10:
- Tested live migration with virtio-net/AVF(i40evf) configured in failover
mode while running iperf in background. Tried static ip and dhcp
configurations using 'network' scripts and Network Manager.
- To avoid dhcp requests to be sent over the lower 'standby' and 'primary'
netdevs, ifcfg-xxx scripts for the lower devices can be created.
- Build tested netvsc module.
Updates:
- fix net_failover_open() to update failover CARRIER correctly based on
standby and primary states.
- fix net_failover_handle_frame() to handle frames received on standby
when primary is present.
- replace netdev_upper_dev_link with netdev_master_upper_dev_link and
handle lower dev state changes.
- fix net_failver_create() and net_failover_register() interfaces to
use ERR_PTR and avoid arg **
- disable setting mac address when virtio-net in STANDBY mode
- document exported symbols
- added entry to MAINTAINERS file
v9:
Select NET_FAILOVER automatically when VIRTIO_NET/HYPERV_NET
are enabled. (stephen)
v8:
- Made the failover managment routines more robust by updating the feature
bits/other fields in the failover netdev when slave netdevs are
registered/unregistered. (mst)
- added support for handling vlans.
- Limited the changes in netvsc to only use the notifier/event/lookups
from the failover module. The slave register/unregister/link-change
handlers are only updated to use the getbymac routine to get the
upper netdev. There is no change in their functionality. (stephen)
- renamed structs/function/file names to use net_failover prefix. (mst)
v7
- Rename 'bypass/active/backup' terminology with 'failover/primary/standy'
(jiri, mst)
- re-arranged dev_open() and dev_set_mtu() calls in the register routines
so that they don't get called for 2-netdev model. (stephen)
- fixed select_queue() routine to do queue selection based on VF if it is
registered as primary. (stephen)
- minor bugfixes
v6 RFC:
Simplified virtio_net changes by moving all the ndo_ops of the
bypass_netdev and create/destroy of bypass_netdev to 'bypass' module.
avoided 2 phase registration(driver + instances).
introduced IFF_BYPASS/IFF_BYPASS_SLAVE dev->priv_flags
replaced mutex with a spinlock
v5 RFC:
Based on Jiri's comments, moved the common functionality to a 'bypass'
module so that the same notifier and event handlers to handle child
register/unregister/link change events can be shared between virtio_net
and netvsc.
Improved error handling based on Siwei's comments.
v4:
- Based on the review comments on the v3 version of the RFC patch and
Jakub's suggestion for the naming issue with 3 netdev solution,
proposed 3 netdev in-driver bonding solution for virtio-net.
v3 RFC:
- Introduced 3 netdev model and pointed out a couple of issues with
that model and proposed 2 netdev model to avoid these issues.
- Removed broadcast/multicast optimization and only use virtio as
backup path when VF is unplugged.
v2 RFC:
- Changed VIRTIO_NET_F_MASTER to VIRTIO_NET_F_BACKUP (mst)
- made a small change to the virtio-net xmit path to only use VF datapath
for unicasts. Broadcasts/multicasts use virtio datapath. This avoids
east-west broadcasts to go over the PCI link.
- added suppport for the feature bit in qemu
Sridhar Samudrala (5):
net: Introduce generic failover module
netvsc: refactor notifier/event handling code to use the failover
framework
net: Introduce net_failover driver
virtio_net: Introduce VIRTIO_NET_F_STANDBY feature bit
virtio_net: Extend virtio to use VF datapath when available
Documentation/networking/failover.rst | 18 +
Documentation/networking/net_failover.rst | 116 +++++
MAINTAINERS | 16 +
drivers/net/Kconfig | 13 +
drivers/net/Makefile | 1 +
drivers/net/hyperv/Kconfig | 1 +
drivers/net/hyperv/hyperv_net.h | 2 +
drivers/net/hyperv/netvsc_drv.c | 133 +----
drivers/net/net_failover.c | 836 ++++++++++++++++++++++++++++++
drivers/net/virtio_net.c | 40 +-
include/linux/netdevice.h | 16 +
include/net/failover.h | 31 ++
include/net/net_failover.h | 40 ++
include/uapi/linux/virtio_net.h | 3 +
net/Kconfig | 13 +
net/core/Makefile | 1 +
net/core/failover.c | 273 ++++++++++
17 files changed, 1440 insertions(+), 113 deletions(-)
create mode 100644 Documentation/networking/failover.rst
create mode 100644 Documentation/networking/net_failover.rst
create mode 100644 drivers/net/net_failover.c
create mode 100644 include/net/failover.h
create mode 100644 include/net/net_failover.h
create mode 100644 net/core/failover.c
--
2.14.3
^ permalink raw reply
* Re: [PATCH 05/33] cxgb4: use match_string() helper
From: Yisheng Xie @ 2018-05-22 0:55 UTC (permalink / raw)
To: Andy Shevchenko; +Cc: Linux Kernel Mailing List, Ganesh Goudar, netdev
In-Reply-To: <CAHp75Vcpfeey_womFqoqp_gyy0=jqMHgW+30WJs-BcVTM-vHQA@mail.gmail.com>
Hi Andy,
On 2018/5/22 5:39, Andy Shevchenko wrote:
> On Mon, May 21, 2018 at 2:57 PM, Yisheng Xie <xieyisheng1@huawei.com> wrote:
>> match_string() returns the index of an array for a matching string,
>> which can be used intead of open coded variant.
>
>> - for (i = 0; i < ARRAY_SIZE(cudbg_region); i++) {
>> - if (!strcmp(cudbg_region[i], region_name)) {
>> - found = 1;
>> - idx = i;
>> - break;
>> - }
>> - }
>> - if (!found)
>> - return -EINVAL;
>> + rc = match_string(cudbg_region, ARRAY_SIZE(cudbg_region), region_name);
>> + if (rc < 0)
>> + return rc;
>>
>> - found = 0;
>> + idx = rc;
>
> Is found still in use after this?
> If so, is it initialized properly now?
it is initialized when define 'found', so no need to be initialized once more.
Thanks
Yisheng
>
^ permalink raw reply
* Re: INFO: rcu detected stall in is_bpf_text_address
From: Marcelo Ricardo Leitner @ 2018-05-22 0:43 UTC (permalink / raw)
To: Xin Long
Cc: Eric Dumazet, syzbot, ast, Daniel Borkmann, LKML, network dev,
syzkaller-bugs, linux-sctp
In-Reply-To: <CADvbK_e7LXc7minSmaPw6iZvjWo215wG-pPAS0gr58+-VxUO7Q@mail.gmail.com>
On Sun, May 20, 2018 at 04:26:03PM +0800, Xin Long wrote:
> On Sat, May 19, 2018 at 11:57 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > SCTP experts, please take a look.
> >
> > On 05/19/2018 08:55 AM, syzbot wrote:
> >> Hello,
> >>
> >> syzbot found the following crash on:
> >>
> >> HEAD commit: 73fcb1a370c7 Merge branch 'akpm' (patches from Andrew)
> >> git tree: upstream
> >> console output: https://syzkaller.appspot.com/x/log.txt?x=1462ec0f800000
> >> kernel config: https://syzkaller.appspot.com/x/.config?x=f3b4e30da84ec1ed
> >> dashboard link: https://syzkaller.appspot.com/bug?extid=3dcd59a1f907245f891f
> >> compiler: gcc (GCC) 8.0.1 20180413 (experimental)
> >> syzkaller repro:https://syzkaller.appspot.com/x/repro.syz?x=1079cf8f800000
> Thank you.
> The Reproducer is more than helpful.
>
> setsockopt$inet_sctp6_SCTP_RTOINFO(r0, 0x84, 0x0,
> &(0x7f0000000140)={0x0, 0x6, 0x7, 0x4}, 0x10)
>
> It set rto_min=6 and rto_max=7, these are too small values.
> t3_rtx timer works fine with it. But hb_timer will get stuck there, as
> in its timer handler it starts this timer again with this value, then
> it goes to the timer handler again...
Nice, thanks Xin.
>
> HB has to repeat this and the hb timer's expire may also have to use
> 'trans->rto >> 1 ...' stuff. But we can limit the RTO's min value, like
> HZ/20, which is 'Try again later.' number used when sock lock is
> owned by others in all timer handlers.
I think a good fix for this is to not allow the application to go
below net.sctp.rto_min, and neither above net.sctp.rto_max.
Then they can even be close to each other, won't be an issue, as long
as rto_min is something sensible. Which then brings it to the second
step of a fix: to restrict rto_min to be >= HZ/5 (copying from TCP
here).
^ permalink raw reply
* Re: [PATCH] [RFC] bpf: tracing: new helper bpf_get_current_cgroup_ino
From: Y Song @ 2018-05-22 0:24 UTC (permalink / raw)
To: Alexei Starovoitov
Cc: Alban Crequy, netdev, linux-kernel, containers, cgroups,
Alban Crequy, tj
In-Reply-To: <20180521162609.lpdrnozowmzdn57m@ast-mbp.dhcp.thefacebook.com>
On Mon, May 21, 2018 at 9:26 AM, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
> On Sun, May 13, 2018 at 07:33:18PM +0200, Alban Crequy wrote:
>>
>> +BPF_CALL_2(bpf_get_current_cgroup_ino, u32, hierarchy, u64, flags)
>> +{
>> + // TODO: pick the correct hierarchy instead of the mem controller
>> + struct cgroup *cgrp = task_cgroup(current, memory_cgrp_id);
>> +
>> + if (unlikely(!cgrp))
>> + return -EINVAL;
>> + if (unlikely(hierarchy))
>> + return -EINVAL;
>> + if (unlikely(flags))
>> + return -EINVAL;
>> +
>> + return cgrp->kn->id.ino;
>
> ino only is not enough to identify cgroup. It needs generation number too.
> I don't quite see how hierarchy and flags can be used in the future.
> Also why limit it to memcg?
>
> How about something like this instead:
>
> BPF_CALL_2(bpf_get_current_cgroup_id)
> {
> struct cgroup *cgrp = task_dfl_cgroup(current);
>
> return cgrp->kn->id.id;
> }
> The user space can use fhandle api to get the same 64-bit id.
I think this should work. This will also be useful to bcc as user
space can encode desired id
in the bpf program and compared that id to the current cgroup id, so we can have
cgroup level tracing (esp. stat collection) support. To cope with
cgroup hierarchy, user can use
cgroup-array based approach or explicitly compare against multiple cgroup id's.
^ permalink raw reply
* [PATCH net-next v4 2/2] openvswitch: Support conntrack zone limit
From: Yi-Hung Wei @ 2018-05-22 0:16 UTC (permalink / raw)
To: netdev, pshelar; +Cc: Yi-Hung Wei
In-Reply-To: <1526948165-32443-1-git-send-email-yihung.wei@gmail.com>
Currently, nf_conntrack_max is used to limit the maximum number of
conntrack entries in the conntrack table for every network namespace.
For the VMs and containers that reside in the same namespace,
they share the same conntrack table, and the total # of conntrack entries
for all the VMs and containers are limited by nf_conntrack_max. In this
case, if one of the VM/container abuses the usage the conntrack entries,
it blocks the others from committing valid conntrack entries into the
conntrack table. Even if we can possibly put the VM in different network
namespace, the current nf_conntrack_max configuration is kind of rigid
that we cannot limit different VM/container to have different # conntrack
entries.
To address the aforementioned issue, this patch proposes to have a
fine-grained mechanism that could further limit the # of conntrack entries
per-zone. For example, we can designate different zone to different VM,
and set conntrack limit to each zone. By providing this isolation, a
mis-behaved VM only consumes the conntrack entries in its own zone, and
it will not influence other well-behaved VMs. Moreover, the users can
set various conntrack limit to different zone based on their preference.
The proposed implementation utilizes Netfilter's nf_conncount backend
to count the number of connections in a particular zone. If the number of
connection is above a configured limitation, ovs will return ENOMEM to the
userspace. If userspace does not configure the zone limit, the limit
defaults to zero that is no limitation, which is backward compatible to
the behavior without this patch.
The following high leve APIs are provided to the userspace:
- OVS_CT_LIMIT_CMD_SET:
* set default connection limit for all zones
* set the connection limit for a particular zone
- OVS_CT_LIMIT_CMD_DEL:
* remove the connection limit for a particular zone
- OVS_CT_LIMIT_CMD_GET:
* get the default connection limit for all zones
* get the connection limit for a particular zone
Signed-off-by: Yi-Hung Wei <yihung.wei@gmail.com>
---
net/openvswitch/Kconfig | 3 +-
net/openvswitch/conntrack.c | 541 +++++++++++++++++++++++++++++++++++++++++++-
net/openvswitch/conntrack.h | 9 +-
net/openvswitch/datapath.c | 7 +-
net/openvswitch/datapath.h | 3 +
5 files changed, 557 insertions(+), 6 deletions(-)
diff --git a/net/openvswitch/Kconfig b/net/openvswitch/Kconfig
index 2650205cdaf9..89da9512ec1e 100644
--- a/net/openvswitch/Kconfig
+++ b/net/openvswitch/Kconfig
@@ -9,7 +9,8 @@ config OPENVSWITCH
(NF_CONNTRACK && ((!NF_DEFRAG_IPV6 || NF_DEFRAG_IPV6) && \
(!NF_NAT || NF_NAT) && \
(!NF_NAT_IPV4 || NF_NAT_IPV4) && \
- (!NF_NAT_IPV6 || NF_NAT_IPV6)))
+ (!NF_NAT_IPV6 || NF_NAT_IPV6) && \
+ (!NETFILTER_CONNCOUNT || NETFILTER_CONNCOUNT)))
select LIBCRC32C
select MPLS
select NET_MPLS_GSO
diff --git a/net/openvswitch/conntrack.c b/net/openvswitch/conntrack.c
index 02fc343feb66..e8bb91420ca9 100644
--- a/net/openvswitch/conntrack.c
+++ b/net/openvswitch/conntrack.c
@@ -16,8 +16,11 @@
#include <linux/tcp.h>
#include <linux/udp.h>
#include <linux/sctp.h>
+#include <linux/static_key.h>
#include <net/ip.h>
+#include <net/genetlink.h>
#include <net/netfilter/nf_conntrack_core.h>
+#include <net/netfilter/nf_conntrack_count.h>
#include <net/netfilter/nf_conntrack_helper.h>
#include <net/netfilter/nf_conntrack_labels.h>
#include <net/netfilter/nf_conntrack_seqadj.h>
@@ -76,6 +79,31 @@ struct ovs_conntrack_info {
#endif
};
+#if IS_ENABLED(CONFIG_NETFILTER_CONNCOUNT)
+#define OVS_CT_LIMIT_UNLIMITED 0
+#define OVS_CT_LIMIT_DEFAULT OVS_CT_LIMIT_UNLIMITED
+#define CT_LIMIT_HASH_BUCKETS 512
+static DEFINE_STATIC_KEY_FALSE(ovs_ct_limit_enabled);
+
+struct ovs_ct_limit {
+ /* Elements in ovs_ct_limit_info->limits hash table */
+ struct hlist_node hlist_node;
+ struct rcu_head rcu;
+ u16 zone;
+ u32 limit;
+};
+
+struct ovs_ct_limit_info {
+ u32 default_limit;
+ struct hlist_head *limits;
+ struct nf_conncount_data *data __aligned(8);
+};
+
+static const struct nla_policy ct_limit_policy[OVS_CT_LIMIT_ATTR_MAX + 1] = {
+ [OVS_CT_LIMIT_ATTR_ZONE_LIMIT] = { .type = NLA_NESTED, },
+};
+#endif
+
static bool labels_nonzero(const struct ovs_key_ct_labels *labels);
static void __ovs_ct_free_action(struct ovs_conntrack_info *ct_info);
@@ -1036,6 +1064,89 @@ static bool labels_nonzero(const struct ovs_key_ct_labels *labels)
return false;
}
+#if IS_ENABLED(CONFIG_NETFILTER_CONNCOUNT)
+static struct hlist_head *ct_limit_hash_bucket(
+ const struct ovs_ct_limit_info *info, u16 zone)
+{
+ return &info->limits[zone & (CT_LIMIT_HASH_BUCKETS - 1)];
+}
+
+/* Call with ovs_mutex */
+static void ct_limit_set(const struct ovs_ct_limit_info *info,
+ struct ovs_ct_limit *new_ct_limit)
+{
+ struct ovs_ct_limit *ct_limit;
+ struct hlist_head *head;
+
+ head = ct_limit_hash_bucket(info, new_ct_limit->zone);
+ hlist_for_each_entry_rcu(ct_limit, head, hlist_node) {
+ if (ct_limit->zone == new_ct_limit->zone) {
+ hlist_replace_rcu(&ct_limit->hlist_node,
+ &new_ct_limit->hlist_node);
+ kfree_rcu(ct_limit, rcu);
+ return;
+ }
+ }
+
+ hlist_add_head_rcu(&new_ct_limit->hlist_node, head);
+}
+
+/* Call with ovs_mutex */
+static void ct_limit_del(const struct ovs_ct_limit_info *info, u16 zone)
+{
+ struct ovs_ct_limit *ct_limit;
+ struct hlist_head *head;
+ struct hlist_node *n;
+
+ head = ct_limit_hash_bucket(info, zone);
+ hlist_for_each_entry_safe(ct_limit, n, head, hlist_node) {
+ if (ct_limit->zone == zone) {
+ hlist_del_rcu(&ct_limit->hlist_node);
+ kfree_rcu(ct_limit, rcu);
+ return;
+ }
+ }
+}
+
+/* Call with RCU read lock */
+static u32 ct_limit_get(const struct ovs_ct_limit_info *info, u16 zone)
+{
+ struct ovs_ct_limit *ct_limit;
+ struct hlist_head *head;
+
+ head = ct_limit_hash_bucket(info, zone);
+ hlist_for_each_entry_rcu(ct_limit, head, hlist_node) {
+ if (ct_limit->zone == zone)
+ return ct_limit->limit;
+ }
+
+ return info->default_limit;
+}
+
+static int ovs_ct_check_limit(struct net *net,
+ const struct ovs_conntrack_info *info,
+ const struct nf_conntrack_tuple *tuple)
+{
+ struct ovs_net *ovs_net = net_generic(net, ovs_net_id);
+ const struct ovs_ct_limit_info *ct_limit_info = ovs_net->ct_limit_info;
+ u32 per_zone_limit, connections;
+ u32 conncount_key[5];
+
+ conncount_key[0] = info->zone.id;
+
+ per_zone_limit = ct_limit_get(ct_limit_info, info->zone.id);
+ if (per_zone_limit == OVS_CT_LIMIT_UNLIMITED)
+ return 0;
+
+ connections = nf_conncount_count(net, ct_limit_info->data,
+ conncount_key, tuple, &info->zone);
+ if (connections > per_zone_limit)
+ return -ENOMEM;
+
+ return 0;
+}
+#endif
+
/* Lookup connection and confirm if unconfirmed. */
static int ovs_ct_commit(struct net *net, struct sw_flow_key *key,
const struct ovs_conntrack_info *info,
@@ -1054,6 +1165,21 @@ static int ovs_ct_commit(struct net *net, struct sw_flow_key *key,
if (!ct)
return 0;
+#if IS_ENABLED(CONFIG_NETFILTER_CONNCOUNT)
+ if (static_branch_unlikely(&ovs_ct_limit_enabled)) {
+ if (!nf_ct_is_confirmed(ct)) {
+ err = ovs_ct_check_limit(net, info,
+ &ct->tuplehash[IP_CT_DIR_ORIGINAL].tuple);
+ if (err) {
+ net_warn_ratelimited("openvswitch: zone: %u "
+ "execeeds conntrack limit\n",
+ info->zone.id);
+ return err;
+ }
+ }
+ }
+#endif
+
/* Set the conntrack event mask if given. NEW and DELETE events have
* their own groups, but the NFNLGRP_CONNTRACK_UPDATE group listener
* typically would receive many kinds of updates. Setting the event
@@ -1655,7 +1781,410 @@ static void __ovs_ct_free_action(struct ovs_conntrack_info *ct_info)
nf_ct_tmpl_free(ct_info->ct);
}
-void ovs_ct_init(struct net *net)
+#if IS_ENABLED(CONFIG_NETFILTER_CONNCOUNT)
+static int ovs_ct_limit_init(struct net *net, struct ovs_net *ovs_net)
+{
+ int i, err;
+
+ ovs_net->ct_limit_info = kmalloc(sizeof(*ovs_net->ct_limit_info),
+ GFP_KERNEL);
+ if (!ovs_net->ct_limit_info)
+ return -ENOMEM;
+
+ ovs_net->ct_limit_info->default_limit = OVS_CT_LIMIT_DEFAULT;
+ ovs_net->ct_limit_info->limits =
+ kmalloc_array(CT_LIMIT_HASH_BUCKETS, sizeof(struct hlist_head),
+ GFP_KERNEL);
+ if (!ovs_net->ct_limit_info->limits) {
+ kfree(ovs_net->ct_limit_info);
+ return -ENOMEM;
+ }
+
+ for (i = 0; i < CT_LIMIT_HASH_BUCKETS; i++)
+ INIT_HLIST_HEAD(&ovs_net->ct_limit_info->limits[i]);
+
+ ovs_net->ct_limit_info->data =
+ nf_conncount_init(net, NFPROTO_INET, sizeof(u32));
+
+ if (IS_ERR(ovs_net->ct_limit_info->data)) {
+ err = PTR_ERR(ovs_net->ct_limit_info->data);
+ kfree(ovs_net->ct_limit_info->limits);
+ kfree(ovs_net->ct_limit_info);
+ return err;
+ }
+ return 0;
+}
+
+static void ovs_ct_limit_exit(struct net *net, struct ovs_net *ovs_net)
+{
+ const struct ovs_ct_limit_info *info = ovs_net->ct_limit_info;
+ int i;
+
+ nf_conncount_destroy(net, NFPROTO_INET, info->data);
+ for (i = 0; i < CT_LIMIT_HASH_BUCKETS; ++i) {
+ struct hlist_head *head = &info->limits[i];
+ struct ovs_ct_limit *ct_limit;
+
+ hlist_for_each_entry_rcu(ct_limit, head, hlist_node)
+ kfree_rcu(ct_limit, rcu);
+ }
+ kfree(ovs_net->ct_limit_info->limits);
+ kfree(ovs_net->ct_limit_info);
+}
+
+static struct sk_buff *
+ovs_ct_limit_cmd_reply_start(struct genl_info *info, u8 cmd,
+ struct ovs_header **ovs_reply_header)
+{
+ struct ovs_header *ovs_header = info->userhdr;
+ struct sk_buff *skb;
+
+ skb = genlmsg_new(NLMSG_DEFAULT_SIZE, GFP_KERNEL);
+ if (!skb)
+ return ERR_PTR(-ENOMEM);
+
+ *ovs_reply_header = genlmsg_put(skb, info->snd_portid,
+ info->snd_seq,
+ &dp_ct_limit_genl_family, 0, cmd);
+
+ if (!*ovs_reply_header) {
+ nlmsg_free(skb);
+ return ERR_PTR(-EMSGSIZE);
+ }
+ (*ovs_reply_header)->dp_ifindex = ovs_header->dp_ifindex;
+
+ return skb;
+}
+
+static bool check_zone_id(int zone_id, u16 *pzone)
+{
+ if (zone_id >= 0 && zone_id <= 65535) {
+ *pzone = (u16)zone_id;
+ return true;
+ }
+ return false;
+}
+
+static int ovs_ct_limit_set_zone_limit(struct nlattr *nla_zone_limit,
+ struct ovs_ct_limit_info *info)
+{
+ struct ovs_zone_limit *zone_limit;
+ int rem;
+ u16 zone;
+
+ rem = NLA_ALIGN(nla_len(nla_zone_limit));
+ zone_limit = (struct ovs_zone_limit *)nla_data(nla_zone_limit);
+
+ while (rem >= sizeof(*zone_limit)) {
+ if (unlikely(zone_limit->zone_id == -1)) {
+ ovs_lock();
+ info->default_limit = zone_limit->limit;
+ ovs_unlock();
+ } else if (unlikely(!check_zone_id(
+ zone_limit->zone_id, &zone))) {
+ OVS_NLERR(true, "zone id is out of range");
+ } else {
+ struct ovs_ct_limit *ct_limit;
+
+ ct_limit = kmalloc(sizeof(*ct_limit), GFP_KERNEL);
+ if (!ct_limit)
+ return -ENOMEM;
+
+ ct_limit->zone = zone;
+ ct_limit->limit = zone_limit->limit;
+
+ ovs_lock();
+ ct_limit_set(info, ct_limit);
+ ovs_unlock();
+ }
+ rem -= NLA_ALIGN(sizeof(*zone_limit));
+ zone_limit = (struct ovs_zone_limit *)((u8 *)zone_limit +
+ NLA_ALIGN(sizeof(*zone_limit)));
+ }
+
+ if (rem)
+ OVS_NLERR(true, "set zone limit has %d unknown bytes", rem);
+
+ return 0;
+}
+
+static int ovs_ct_limit_del_zone_limit(struct nlattr *nla_zone_limit,
+ struct ovs_ct_limit_info *info)
+{
+ struct ovs_zone_limit *zone_limit;
+ int rem;
+ u16 zone;
+
+ rem = NLA_ALIGN(nla_len(nla_zone_limit));
+ zone_limit = (struct ovs_zone_limit *)nla_data(nla_zone_limit);
+
+ while (rem >= sizeof(*zone_limit)) {
+ if (unlikely(!check_zone_id(zone_limit->zone_id, &zone))) {
+ OVS_NLERR(true, "zone id is out of range");
+ } else {
+ ovs_lock();
+ ct_limit_del(info, zone);
+ ovs_unlock();
+ }
+ rem -= NLA_ALIGN(sizeof(*zone_limit));
+ zone_limit = (struct ovs_zone_limit *)((u8 *)zone_limit +
+ NLA_ALIGN(sizeof(*zone_limit)));
+ }
+
+ if (rem)
+ OVS_NLERR(true, "del zone limit has %d unknown bytes", rem);
+
+ return 0;
+}
+
+static int ovs_ct_limit_get_default_limit(struct ovs_ct_limit_info *info,
+ struct sk_buff *reply)
+{
+ struct ovs_zone_limit zone_limit;
+ int err;
+
+ zone_limit.zone_id = -1;
+ zone_limit.limit = info->default_limit;
+ err = nla_put_nohdr(reply, sizeof(zone_limit), &zone_limit);
+ if (err)
+ return err;
+
+ return 0;
+}
+
+static int ovs_ct_limit_get_zone_limit(struct net *net,
+ struct nlattr *nla_zone_limit,
+ struct ovs_ct_limit_info *info,
+ struct sk_buff *reply)
+{
+ struct nf_conntrack_zone ct_zone;
+ struct ovs_zone_limit *zone_limit;
+ int rem, err;
+ u32 conncount_key[5];
+ u16 zone;
+
+ rem = NLA_ALIGN(nla_len(nla_zone_limit));
+ zone_limit = (struct ovs_zone_limit *)nla_data(nla_zone_limit);
+
+ while (rem >= sizeof(*zone_limit)) {
+ if (unlikely(zone_limit->zone_id == -1)) {
+ err = ovs_ct_limit_get_default_limit(info, reply);
+ if (err)
+ return err;
+ } else if (unlikely(!check_zone_id(zone_limit->zone_id,
+ &zone))) {
+ OVS_NLERR(true, "zone id is out of range");
+ } else {
+ rcu_read_lock();
+ zone_limit->limit = ct_limit_get(info, zone);
+ rcu_read_unlock();
+
+ nf_ct_zone_init(&ct_zone, zone, NF_CT_DEFAULT_ZONE_DIR,
+ 0);
+ conncount_key[0] = zone;
+ zone_limit->count = nf_conncount_count(
+ net, info->data, conncount_key, NULL, &ct_zone);
+ err = nla_put_nohdr(reply, sizeof(*zone_limit),
+ zone_limit);
+ if (err)
+ return err;
+ }
+ rem -= NLA_ALIGN(sizeof(*zone_limit));
+ zone_limit = (struct ovs_zone_limit *)((u8 *)zone_limit +
+ NLA_ALIGN(sizeof(*zone_limit)));
+ }
+
+ if (rem)
+ OVS_NLERR(true, "get zone limit has %d unknown bytes", rem);
+
+ return 0;
+}
+
+static int ovs_ct_limit_get_all_zone_limit(struct net *net,
+ struct ovs_ct_limit_info *info,
+ struct sk_buff *reply)
+{
+ struct nf_conntrack_zone ct_zone;
+ struct ovs_zone_limit zone_limit;
+ struct ovs_ct_limit *ct_limit;
+ struct hlist_head *head;
+ u32 conncount_key[5];
+ int i, err = 0;
+
+ err = ovs_ct_limit_get_default_limit(info, reply);
+ if (err)
+ return err;
+
+ rcu_read_lock();
+ for (i = 0; i < CT_LIMIT_HASH_BUCKETS; ++i) {
+ head = &info->limits[i];
+ hlist_for_each_entry_rcu(ct_limit, head, hlist_node) {
+ zone_limit.zone_id = ct_limit->zone;
+ zone_limit.limit = ct_limit->limit;
+ nf_ct_zone_init(&ct_zone, ct_limit->zone,
+ NF_CT_DEFAULT_ZONE_DIR, 0);
+
+ conncount_key[0] = ct_limit->zone;
+ zone_limit.count = nf_conncount_count(net, info->data,
+ conncount_key, NULL, &ct_zone);
+ err = nla_put_nohdr(reply, sizeof(zone_limit),
+ &zone_limit);
+ if (err)
+ goto exit_err;
+ }
+ }
+
+exit_err:
+ rcu_read_unlock();
+ return err;
+}
+
+static int ovs_ct_limit_cmd_set(struct sk_buff *skb, struct genl_info *info)
+{
+ struct nlattr **a = info->attrs;
+ struct sk_buff *reply;
+ struct ovs_header *ovs_reply_header;
+ struct ovs_net *ovs_net = net_generic(sock_net(skb->sk), ovs_net_id);
+ struct ovs_ct_limit_info *ct_limit_info = ovs_net->ct_limit_info;
+ int err;
+
+ reply = ovs_ct_limit_cmd_reply_start(info, OVS_CT_LIMIT_CMD_SET,
+ &ovs_reply_header);
+ if (IS_ERR(reply))
+ return PTR_ERR(reply);
+
+ if (!a[OVS_CT_LIMIT_ATTR_ZONE_LIMIT]) {
+ err = -EINVAL;
+ goto exit_err;
+ }
+
+ err = ovs_ct_limit_set_zone_limit(a[OVS_CT_LIMIT_ATTR_ZONE_LIMIT],
+ ct_limit_info);
+ if (err)
+ goto exit_err;
+
+ static_branch_enable(&ovs_ct_limit_enabled);
+
+ genlmsg_end(reply, ovs_reply_header);
+ return genlmsg_reply(reply, info);
+
+exit_err:
+ nlmsg_free(reply);
+ return err;
+}
+
+static int ovs_ct_limit_cmd_del(struct sk_buff *skb, struct genl_info *info)
+{
+ struct nlattr **a = info->attrs;
+ struct sk_buff *reply;
+ struct ovs_header *ovs_reply_header;
+ struct ovs_net *ovs_net = net_generic(sock_net(skb->sk), ovs_net_id);
+ struct ovs_ct_limit_info *ct_limit_info = ovs_net->ct_limit_info;
+ int err;
+
+ reply = ovs_ct_limit_cmd_reply_start(info, OVS_CT_LIMIT_CMD_DEL,
+ &ovs_reply_header);
+ if (IS_ERR(reply))
+ return PTR_ERR(reply);
+
+ if (!a[OVS_CT_LIMIT_ATTR_ZONE_LIMIT]) {
+ err = -EINVAL;
+ goto exit_err;
+ }
+
+ err = ovs_ct_limit_del_zone_limit(a[OVS_CT_LIMIT_ATTR_ZONE_LIMIT],
+ ct_limit_info);
+ if (err)
+ goto exit_err;
+
+ genlmsg_end(reply, ovs_reply_header);
+ return genlmsg_reply(reply, info);
+
+exit_err:
+ nlmsg_free(reply);
+ return err;
+}
+
+static int ovs_ct_limit_cmd_get(struct sk_buff *skb, struct genl_info *info)
+{
+ struct nlattr **a = info->attrs;
+ struct nlattr *nla_reply;
+ struct sk_buff *reply;
+ struct ovs_header *ovs_reply_header;
+ struct net *net = sock_net(skb->sk);
+ struct ovs_net *ovs_net = net_generic(net, ovs_net_id);
+ struct ovs_ct_limit_info *ct_limit_info = ovs_net->ct_limit_info;
+ int err;
+
+ reply = ovs_ct_limit_cmd_reply_start(info, OVS_CT_LIMIT_CMD_GET,
+ &ovs_reply_header);
+ if (IS_ERR(reply))
+ return PTR_ERR(reply);
+
+ nla_reply = nla_nest_start(reply, OVS_CT_LIMIT_ATTR_ZONE_LIMIT);
+
+ if (a[OVS_CT_LIMIT_ATTR_ZONE_LIMIT]) {
+ err = ovs_ct_limit_get_zone_limit(
+ net, a[OVS_CT_LIMIT_ATTR_ZONE_LIMIT], ct_limit_info,
+ reply);
+ if (err)
+ goto exit_err;
+ } else {
+ err = ovs_ct_limit_get_all_zone_limit(net, ct_limit_info,
+ reply);
+ if (err)
+ goto exit_err;
+ }
+
+ nla_nest_end(reply, nla_reply);
+ genlmsg_end(reply, ovs_reply_header);
+ return genlmsg_reply(reply, info);
+
+exit_err:
+ nlmsg_free(reply);
+ return err;
+}
+
+static struct genl_ops ct_limit_genl_ops[] = {
+ { .cmd = OVS_CT_LIMIT_CMD_SET,
+ .flags = GENL_ADMIN_PERM, /* Requires CAP_NET_ADMIN
+ * privilege. */
+ .policy = ct_limit_policy,
+ .doit = ovs_ct_limit_cmd_set,
+ },
+ { .cmd = OVS_CT_LIMIT_CMD_DEL,
+ .flags = GENL_ADMIN_PERM, /* Requires CAP_NET_ADMIN
+ * privilege. */
+ .policy = ct_limit_policy,
+ .doit = ovs_ct_limit_cmd_del,
+ },
+ { .cmd = OVS_CT_LIMIT_CMD_GET,
+ .flags = 0, /* OK for unprivileged users. */
+ .policy = ct_limit_policy,
+ .doit = ovs_ct_limit_cmd_get,
+ },
+};
+
+static const struct genl_multicast_group ovs_ct_limit_multicast_group = {
+ .name = OVS_CT_LIMIT_MCGROUP,
+};
+
+struct genl_family dp_ct_limit_genl_family __ro_after_init = {
+ .hdrsize = sizeof(struct ovs_header),
+ .name = OVS_CT_LIMIT_FAMILY,
+ .version = OVS_CT_LIMIT_VERSION,
+ .maxattr = OVS_CT_LIMIT_ATTR_MAX,
+ .netnsok = true,
+ .parallel_ops = true,
+ .ops = ct_limit_genl_ops,
+ .n_ops = ARRAY_SIZE(ct_limit_genl_ops),
+ .mcgrps = &ovs_ct_limit_multicast_group,
+ .n_mcgrps = 1,
+ .module = THIS_MODULE,
+};
+#endif
+
+int ovs_ct_init(struct net *net)
{
unsigned int n_bits = sizeof(struct ovs_key_ct_labels) * BITS_PER_BYTE;
struct ovs_net *ovs_net = net_generic(net, ovs_net_id);
@@ -1666,12 +2195,22 @@ void ovs_ct_init(struct net *net)
} else {
ovs_net->xt_label = true;
}
+
+#if IS_ENABLED(CONFIG_NETFILTER_CONNCOUNT)
+ return ovs_ct_limit_init(net, ovs_net);
+#else
+ return 0;
+#endif
}
void ovs_ct_exit(struct net *net)
{
struct ovs_net *ovs_net = net_generic(net, ovs_net_id);
+#if IS_ENABLED(CONFIG_NETFILTER_CONNCOUNT)
+ ovs_ct_limit_exit(net, ovs_net);
+#endif
+
if (ovs_net->xt_label)
nf_connlabels_put(net);
}
diff --git a/net/openvswitch/conntrack.h b/net/openvswitch/conntrack.h
index 399dfdd2c4f9..900dadd70974 100644
--- a/net/openvswitch/conntrack.h
+++ b/net/openvswitch/conntrack.h
@@ -17,10 +17,11 @@
#include "flow.h"
struct ovs_conntrack_info;
+struct ovs_ct_limit_info;
enum ovs_key_attr;
#if IS_ENABLED(CONFIG_NF_CONNTRACK)
-void ovs_ct_init(struct net *);
+int ovs_ct_init(struct net *);
void ovs_ct_exit(struct net *);
bool ovs_ct_verify(struct net *, enum ovs_key_attr attr);
int ovs_ct_copy_action(struct net *, const struct nlattr *,
@@ -44,7 +45,7 @@ void ovs_ct_free_action(const struct nlattr *a);
#else
#include <linux/errno.h>
-static inline void ovs_ct_init(struct net *net) { }
+static inline int ovs_ct_init(struct net *net) { return 0; }
static inline void ovs_ct_exit(struct net *net) { }
@@ -104,4 +105,8 @@ static inline void ovs_ct_free_action(const struct nlattr *a) { }
#define CT_SUPPORTED_MASK 0
#endif /* CONFIG_NF_CONNTRACK */
+
+#if IS_ENABLED(CONFIG_NETFILTER_CONNCOUNT)
+extern struct genl_family dp_ct_limit_genl_family;
+#endif
#endif /* ovs_conntrack.h */
diff --git a/net/openvswitch/datapath.c b/net/openvswitch/datapath.c
index 015e24e08909..a61818e94396 100644
--- a/net/openvswitch/datapath.c
+++ b/net/openvswitch/datapath.c
@@ -2288,6 +2288,9 @@ static struct genl_family * const dp_genl_families[] = {
&dp_flow_genl_family,
&dp_packet_genl_family,
&dp_meter_genl_family,
+#if IS_ENABLED(CONFIG_NETFILTER_CONNCOUNT)
+ &dp_ct_limit_genl_family,
+#endif
};
static void dp_unregister_genl(int n_families)
@@ -2323,8 +2326,7 @@ static int __net_init ovs_init_net(struct net *net)
INIT_LIST_HEAD(&ovs_net->dps);
INIT_WORK(&ovs_net->dp_notify_work, ovs_dp_notify_wq);
- ovs_ct_init(net);
- return 0;
+ return ovs_ct_init(net);
}
static void __net_exit list_vports_from_net(struct net *net, struct net *dnet,
@@ -2469,3 +2471,4 @@ MODULE_ALIAS_GENL_FAMILY(OVS_VPORT_FAMILY);
MODULE_ALIAS_GENL_FAMILY(OVS_FLOW_FAMILY);
MODULE_ALIAS_GENL_FAMILY(OVS_PACKET_FAMILY);
MODULE_ALIAS_GENL_FAMILY(OVS_METER_FAMILY);
+MODULE_ALIAS_GENL_FAMILY(OVS_CT_LIMIT_FAMILY);
diff --git a/net/openvswitch/datapath.h b/net/openvswitch/datapath.h
index 523d65526766..c9eb267c6f7e 100644
--- a/net/openvswitch/datapath.h
+++ b/net/openvswitch/datapath.h
@@ -144,6 +144,9 @@ struct dp_upcall_info {
struct ovs_net {
struct list_head dps;
struct work_struct dp_notify_work;
+#if IS_ENABLED(CONFIG_NETFILTER_CONNCOUNT)
+ struct ovs_ct_limit_info *ct_limit_info;
+#endif
/* Module reference for configuring conntrack. */
bool xt_label;
--
2.7.4
^ permalink raw reply related
* [PATCH net-next v4 1/2] openvswitch: Add conntrack limit netlink definition
From: Yi-Hung Wei @ 2018-05-22 0:16 UTC (permalink / raw)
To: netdev, pshelar; +Cc: Yi-Hung Wei
In-Reply-To: <1526948165-32443-1-git-send-email-yihung.wei@gmail.com>
Define netlink messages and attributes to support user kernel
communication that uses the conntrack limit feature.
Signed-off-by: Yi-Hung Wei <yihung.wei@gmail.com>
---
include/uapi/linux/openvswitch.h | 26 ++++++++++++++++++++++++++
1 file changed, 26 insertions(+)
diff --git a/include/uapi/linux/openvswitch.h b/include/uapi/linux/openvswitch.h
index 713e56ce681f..d8da2b7591f5 100644
--- a/include/uapi/linux/openvswitch.h
+++ b/include/uapi/linux/openvswitch.h
@@ -937,4 +937,30 @@ enum ovs_meter_band_type {
#define OVS_METER_BAND_TYPE_MAX (__OVS_METER_BAND_TYPE_MAX - 1)
+/* Conntrack limit */
+#define OVS_CT_LIMIT_FAMILY "ovs_ct_limit"
+#define OVS_CT_LIMIT_MCGROUP "ovs_ct_limit"
+#define OVS_CT_LIMIT_VERSION 0x1
+
+enum ovs_ct_limit_cmd {
+ OVS_CT_LIMIT_CMD_UNSPEC,
+ OVS_CT_LIMIT_CMD_SET, /* Add or modify ct limit. */
+ OVS_CT_LIMIT_CMD_DEL, /* Delete ct limit. */
+ OVS_CT_LIMIT_CMD_GET /* Get ct limit. */
+};
+
+enum ovs_ct_limit_attr {
+ OVS_CT_LIMIT_ATTR_UNSPEC,
+ OVS_CT_LIMIT_ATTR_ZONE_LIMIT, /* Nested struct ovs_zone_limit. */
+ __OVS_CT_LIMIT_ATTR_MAX
+};
+
+#define OVS_CT_LIMIT_ATTR_MAX (__OVS_CT_LIMIT_ATTR_MAX - 1)
+
+struct ovs_zone_limit {
+ int zone_id;
+ __u32 limit;
+ __u32 count;
+};
+
#endif /* _LINUX_OPENVSWITCH_H */
--
2.7.4
^ permalink raw reply related
* [PATCH net-next v4 0/2] openvswitch: Support conntrack zone limit
From: Yi-Hung Wei @ 2018-05-22 0:16 UTC (permalink / raw)
To: netdev, pshelar; +Cc: Yi-Hung Wei
Currently, nf_conntrack_max is used to limit the maximum number of
conntrack entries in the conntrack table for every network namespace.
For the VMs and containers that reside in the same namespace,
they share the same conntrack table, and the total # of conntrack entries
for all the VMs and containers are limited by nf_conntrack_max. In this
case, if one of the VM/container abuses the usage the conntrack entries,
it blocks the others from committing valid conntrack entries into the
conntrack table. Even if we can possibly put the VM in different network
namespace, the current nf_conntrack_max configuration is kind of rigid
that we cannot limit different VM/container to have different # conntrack
entries.
To address the aforementioned issue, this patch proposes to have a
fine-grained mechanism that could further limit the # of conntrack entries
per-zone. For example, we can designate different zone to different VM,
and set conntrack limit to each zone. By providing this isolation, a
mis-behaved VM only consumes the conntrack entries in its own zone, and
it will not influence other well-behaved VMs. Moreover, the users can
set various conntrack limit to different zone based on their preference.
The proposed implementation utilizes Netfilter's nf_conncount backend
to count the number of connections in a particular zone. If the number of
connection is above a configured limitation, OVS will return ENOMEM to the
userspace. If userspace does not configure the zone limit, the limit
defaults to zero that is no limitation, which is backward compatible to
the behavior without this patch.
The first patch defines the conntrack limit netlink definition, and the
second patch provides the implementation.
v3->v4:
- Addresses comments from Parvin that include simplify netlink API,
and remove unncessary RCU lockings.
- Rebases to master.
v2->v3:
- Addresses comments from Parvin that include using static keys to check
if ovs_ct_limit features is used, only check ct_limit when a ct entry
is unconfirmed, and reports rate limited warning messages when the ct
limit is reached.
- Rebases to master.
v1->v2:
- Fixes commit log typos suggested by Greg.
- Fixes memory free issue that Julia found.
Yi-Hung Wei (2):
openvswitch: Add conntrack limit netlink definition
openvswitch: Support conntrack zone limit
include/uapi/linux/openvswitch.h | 26 ++
net/openvswitch/Kconfig | 3 +-
net/openvswitch/conntrack.c | 541 ++++++++++++++++++++++++++++++++++++++-
net/openvswitch/conntrack.h | 9 +-
net/openvswitch/datapath.c | 7 +-
net/openvswitch/datapath.h | 3 +
6 files changed, 583 insertions(+), 6 deletions(-)
--
2.7.4
^ permalink raw reply
* Re: [PATCH bpf-next 1/5] bpf: Hooks for sys_sendmsg
From: kbuild test robot @ 2018-05-22 0:08 UTC (permalink / raw)
To: Andrey Ignatov
Cc: kbuild-all, netdev, Andrey Ignatov, davem, ast, daniel,
kernel-team
In-Reply-To: <654b59700a21ed2342b12f30a583ba9329ef850b.1526694154.git.rdna@fb.com>
[-- Attachment #1: Type: text/plain, Size: 10869 bytes --]
Hi Andrey,
Thank you for the patch! Yet something to improve:
[auto build test ERROR on bpf-next/master]
url: https://github.com/0day-ci/linux/commits/Andrey-Ignatov/bpf-Hooks-for-sys_sendmsg/20180522-065614
base: https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git master
config: i386-randconfig-s1-201820 (attached as .config)
compiler: gcc-6 (Debian 6.4.0-9) 6.4.0 20171026
reproduce:
# save the attached .config to linux build tree
make ARCH=i386
All errors (new ones prefixed by >>):
net/ipv4/udp.c: In function 'udp_sendmsg':
net/ipv4/udp.c:1014:44: error: macro "BPF_CGROUP_RUN_PROG_UDP4_SENDMSG_LOCK" passed 3 arguments, but takes just 2
(struct sockaddr *)usin, &ipc.addr);
^
>> net/ipv4/udp.c:1013:9: error: 'BPF_CGROUP_RUN_PROG_UDP4_SENDMSG_LOCK' undeclared (first use in this function)
err = BPF_CGROUP_RUN_PROG_UDP4_SENDMSG_LOCK(sk,
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
net/ipv4/udp.c:1013:9: note: each undeclared identifier is reported only once for each function it appears in
--
net/ipv6/udp.c: In function 'udpv6_sendmsg':
net/ipv6/udp.c:1320:44: error: macro "BPF_CGROUP_RUN_PROG_UDP6_SENDMSG_LOCK" passed 3 arguments, but takes just 2
(struct sockaddr *)sin6, &fl6.saddr);
^
>> net/ipv6/udp.c:1319:9: error: 'BPF_CGROUP_RUN_PROG_UDP6_SENDMSG_LOCK' undeclared (first use in this function)
err = BPF_CGROUP_RUN_PROG_UDP6_SENDMSG_LOCK(sk,
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
net/ipv6/udp.c:1319:9: note: each undeclared identifier is reported only once for each function it appears in
vim +/BPF_CGROUP_RUN_PROG_UDP4_SENDMSG_LOCK +1013 net/ipv4/udp.c
898
899 int udp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
900 {
901 struct inet_sock *inet = inet_sk(sk);
902 struct udp_sock *up = udp_sk(sk);
903 DECLARE_SOCKADDR(struct sockaddr_in *, usin, msg->msg_name);
904 struct flowi4 fl4_stack;
905 struct flowi4 *fl4;
906 int ulen = len;
907 struct ipcm_cookie ipc;
908 struct rtable *rt = NULL;
909 int free = 0;
910 int connected = 0;
911 __be32 daddr, faddr, saddr;
912 __be16 dport;
913 u8 tos;
914 int err, is_udplite = IS_UDPLITE(sk);
915 int corkreq = up->corkflag || msg->msg_flags&MSG_MORE;
916 int (*getfrag)(void *, char *, int, int, int, struct sk_buff *);
917 struct sk_buff *skb;
918 struct ip_options_data opt_copy;
919
920 if (len > 0xFFFF)
921 return -EMSGSIZE;
922
923 /*
924 * Check the flags.
925 */
926
927 if (msg->msg_flags & MSG_OOB) /* Mirror BSD error message compatibility */
928 return -EOPNOTSUPP;
929
930 ipc.opt = NULL;
931 ipc.tx_flags = 0;
932 ipc.ttl = 0;
933 ipc.tos = -1;
934
935 getfrag = is_udplite ? udplite_getfrag : ip_generic_getfrag;
936
937 fl4 = &inet->cork.fl.u.ip4;
938 if (up->pending) {
939 /*
940 * There are pending frames.
941 * The socket lock must be held while it's corked.
942 */
943 lock_sock(sk);
944 if (likely(up->pending)) {
945 if (unlikely(up->pending != AF_INET)) {
946 release_sock(sk);
947 return -EINVAL;
948 }
949 goto do_append_data;
950 }
951 release_sock(sk);
952 }
953 ulen += sizeof(struct udphdr);
954
955 /*
956 * Get and verify the address.
957 */
958 if (usin) {
959 if (msg->msg_namelen < sizeof(*usin))
960 return -EINVAL;
961 if (usin->sin_family != AF_INET) {
962 if (usin->sin_family != AF_UNSPEC)
963 return -EAFNOSUPPORT;
964 }
965
966 daddr = usin->sin_addr.s_addr;
967 dport = usin->sin_port;
968 if (dport == 0)
969 return -EINVAL;
970 } else {
971 if (sk->sk_state != TCP_ESTABLISHED)
972 return -EDESTADDRREQ;
973 daddr = inet->inet_daddr;
974 dport = inet->inet_dport;
975 /* Open fast path for connected socket.
976 Route will not be used, if at least one option is set.
977 */
978 connected = 1;
979 }
980
981 ipc.sockc.tsflags = sk->sk_tsflags;
982 ipc.addr = inet->inet_saddr;
983 ipc.oif = sk->sk_bound_dev_if;
984 ipc.gso_size = up->gso_size;
985
986 if (msg->msg_controllen) {
987 err = udp_cmsg_send(sk, msg, &ipc.gso_size);
988 if (err > 0)
989 err = ip_cmsg_send(sk, msg, &ipc,
990 sk->sk_family == AF_INET6);
991 if (unlikely(err < 0)) {
992 kfree(ipc.opt);
993 return err;
994 }
995 if (ipc.opt)
996 free = 1;
997 connected = 0;
998 }
999 if (!ipc.opt) {
1000 struct ip_options_rcu *inet_opt;
1001
1002 rcu_read_lock();
1003 inet_opt = rcu_dereference(inet->inet_opt);
1004 if (inet_opt) {
1005 memcpy(&opt_copy, inet_opt,
1006 sizeof(*inet_opt) + inet_opt->opt.optlen);
1007 ipc.opt = &opt_copy.opt;
1008 }
1009 rcu_read_unlock();
1010 }
1011
1012 if (!connected) {
> 1013 err = BPF_CGROUP_RUN_PROG_UDP4_SENDMSG_LOCK(sk,
> 1014 (struct sockaddr *)usin, &ipc.addr);
1015 if (err)
1016 goto out_free;
1017 if (usin) {
1018 if (usin->sin_port == 0) {
1019 /* BPF program set invalid port. Reject it. */
1020 err = -EINVAL;
1021 goto out_free;
1022 }
1023 daddr = usin->sin_addr.s_addr;
1024 dport = usin->sin_port;
1025 }
1026 }
1027
1028 saddr = ipc.addr;
1029 ipc.addr = faddr = daddr;
1030
1031 sock_tx_timestamp(sk, ipc.sockc.tsflags, &ipc.tx_flags);
1032
1033 if (ipc.opt && ipc.opt->opt.srr) {
1034 if (!daddr) {
1035 err = -EINVAL;
1036 goto out_free;
1037 }
1038 faddr = ipc.opt->opt.faddr;
1039 connected = 0;
1040 }
1041 tos = get_rttos(&ipc, inet);
1042 if (sock_flag(sk, SOCK_LOCALROUTE) ||
1043 (msg->msg_flags & MSG_DONTROUTE) ||
1044 (ipc.opt && ipc.opt->opt.is_strictroute)) {
1045 tos |= RTO_ONLINK;
1046 connected = 0;
1047 }
1048
1049 if (ipv4_is_multicast(daddr)) {
1050 if (!ipc.oif)
1051 ipc.oif = inet->mc_index;
1052 if (!saddr)
1053 saddr = inet->mc_addr;
1054 connected = 0;
1055 } else if (!ipc.oif) {
1056 ipc.oif = inet->uc_index;
1057 } else if (ipv4_is_lbcast(daddr) && inet->uc_index) {
1058 /* oif is set, packet is to local broadcast and
1059 * and uc_index is set. oif is most likely set
1060 * by sk_bound_dev_if. If uc_index != oif check if the
1061 * oif is an L3 master and uc_index is an L3 slave.
1062 * If so, we want to allow the send using the uc_index.
1063 */
1064 if (ipc.oif != inet->uc_index &&
1065 ipc.oif == l3mdev_master_ifindex_by_index(sock_net(sk),
1066 inet->uc_index)) {
1067 ipc.oif = inet->uc_index;
1068 }
1069 }
1070
1071 if (connected)
1072 rt = (struct rtable *)sk_dst_check(sk, 0);
1073
1074 if (!rt) {
1075 struct net *net = sock_net(sk);
1076 __u8 flow_flags = inet_sk_flowi_flags(sk);
1077
1078 fl4 = &fl4_stack;
1079
1080 flowi4_init_output(fl4, ipc.oif, sk->sk_mark, tos,
1081 RT_SCOPE_UNIVERSE, sk->sk_protocol,
1082 flow_flags,
1083 faddr, saddr, dport, inet->inet_sport,
1084 sk->sk_uid);
1085
1086 security_sk_classify_flow(sk, flowi4_to_flowi(fl4));
1087 rt = ip_route_output_flow(net, fl4, sk);
1088 if (IS_ERR(rt)) {
1089 err = PTR_ERR(rt);
1090 rt = NULL;
1091 if (err == -ENETUNREACH)
1092 IP_INC_STATS(net, IPSTATS_MIB_OUTNOROUTES);
1093 goto out;
1094 }
1095
1096 err = -EACCES;
1097 if ((rt->rt_flags & RTCF_BROADCAST) &&
1098 !sock_flag(sk, SOCK_BROADCAST))
1099 goto out;
1100 if (connected)
1101 sk_dst_set(sk, dst_clone(&rt->dst));
1102 }
1103
1104 if (msg->msg_flags&MSG_CONFIRM)
1105 goto do_confirm;
1106 back_from_confirm:
1107
1108 saddr = fl4->saddr;
1109 if (!ipc.addr)
1110 daddr = ipc.addr = fl4->daddr;
1111
1112 /* Lockless fast path for the non-corking case. */
1113 if (!corkreq) {
1114 struct inet_cork cork;
1115
1116 skb = ip_make_skb(sk, fl4, getfrag, msg, ulen,
1117 sizeof(struct udphdr), &ipc, &rt,
1118 &cork, msg->msg_flags);
1119 err = PTR_ERR(skb);
1120 if (!IS_ERR_OR_NULL(skb))
1121 err = udp_send_skb(skb, fl4, &cork);
1122 goto out;
1123 }
1124
1125 lock_sock(sk);
1126 if (unlikely(up->pending)) {
1127 /* The socket is already corked while preparing it. */
1128 /* ... which is an evident application bug. --ANK */
1129 release_sock(sk);
1130
1131 net_dbg_ratelimited("socket already corked\n");
1132 err = -EINVAL;
1133 goto out;
1134 }
1135 /*
1136 * Now cork the socket to pend data.
1137 */
1138 fl4 = &inet->cork.fl.u.ip4;
1139 fl4->daddr = daddr;
1140 fl4->saddr = saddr;
1141 fl4->fl4_dport = dport;
1142 fl4->fl4_sport = inet->inet_sport;
1143 up->pending = AF_INET;
1144
1145 do_append_data:
1146 up->len += ulen;
1147 err = ip_append_data(sk, fl4, getfrag, msg, ulen,
1148 sizeof(struct udphdr), &ipc, &rt,
1149 corkreq ? msg->msg_flags|MSG_MORE : msg->msg_flags);
1150 if (err)
1151 udp_flush_pending_frames(sk);
1152 else if (!corkreq)
1153 err = udp_push_pending_frames(sk);
1154 else if (unlikely(skb_queue_empty(&sk->sk_write_queue)))
1155 up->pending = 0;
1156 release_sock(sk);
1157
1158 out:
1159 ip_rt_put(rt);
1160 out_free:
1161 if (free)
1162 kfree(ipc.opt);
1163 if (!err)
1164 return len;
1165 /*
1166 * ENOBUFS = no kernel mem, SOCK_NOSPACE = no sndbuf space. Reporting
1167 * ENOBUFS might not be good (it's not tunable per se), but otherwise
1168 * we don't have a good statistic (IpOutDiscards but it can be too many
1169 * things). We could add another new stat but at least for now that
1170 * seems like overkill.
1171 */
1172 if (err == -ENOBUFS || test_bit(SOCK_NOSPACE, &sk->sk_socket->flags)) {
1173 UDP_INC_STATS(sock_net(sk),
1174 UDP_MIB_SNDBUFERRORS, is_udplite);
1175 }
1176 return err;
1177
1178 do_confirm:
1179 if (msg->msg_flags & MSG_PROBE)
1180 dst_confirm_neigh(&rt->dst, &fl4->daddr);
1181 if (!(msg->msg_flags&MSG_PROBE) || len)
1182 goto back_from_confirm;
1183 err = 0;
1184 goto out;
1185 }
1186 EXPORT_SYMBOL(udp_sendmsg);
1187
---
0-DAY kernel test infrastructure Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all Intel Corporation
[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 24538 bytes --]
^ permalink raw reply
* Re: [Cake] [PATCH net-next v14 6/7] sch_cake: Add overhead compensation support to the rate shaper
From: Marcelo Ricardo Leitner @ 2018-05-21 23:45 UTC (permalink / raw)
To: Toke Høiland-Jørgensen; +Cc: netdev, cake
In-Reply-To: <152693495874.32668.11244869690098290078.stgit@alrua-kau>
On Mon, May 21, 2018 at 10:35:58PM +0200, Toke Høiland-Jørgensen wrote:
> +static u32 cake_overhead(struct cake_sched_data *q, const struct sk_buff *skb)
> +{
> + const struct skb_shared_info *shinfo = skb_shinfo(skb);
> + unsigned int hdr_len, last_len = 0;
> + u32 off = skb_network_offset(skb);
> + u32 len = qdisc_pkt_len(skb);
> + u16 segs = 1;
> +
> + q->avg_netoff = cake_ewma(q->avg_netoff, off << 16, 8);
> +
> + if (!shinfo->gso_size)
> + return cake_calc_overhead(q, len, off);
> +
> + /* borrowed from qdisc_pkt_len_init() */
> + hdr_len = skb_transport_header(skb) - skb_mac_header(skb);
> +
> + /* + transport layer */
> + if (likely(shinfo->gso_type & (SKB_GSO_TCPV4 |
> + SKB_GSO_TCPV6))) {
> + const struct tcphdr *th;
> + struct tcphdr _tcphdr;
> +
> + th = skb_header_pointer(skb, skb_transport_offset(skb),
> + sizeof(_tcphdr), &_tcphdr);
> + if (likely(th))
> + hdr_len += __tcp_hdrlen(th);
> + } else {
I didn't see some code limiting GSO packets to just TCP or UDP. Is it
safe to assume that this packet is an UDP one, and not SCTP or ESP,
for example?
> + struct udphdr _udphdr;
> +
> + if (skb_header_pointer(skb, skb_transport_offset(skb),
> + sizeof(_udphdr), &_udphdr))
> + hdr_len += sizeof(struct udphdr);
> + }
> +
> + if (unlikely(shinfo->gso_type & SKB_GSO_DODGY))
> + segs = DIV_ROUND_UP(skb->len - hdr_len,
> + shinfo->gso_size);
> + else
> + segs = shinfo->gso_segs;
> +
> + len = shinfo->gso_size + hdr_len;
> + last_len = skb->len - shinfo->gso_size * (segs - 1);
> +
> + return (cake_calc_overhead(q, len, off) * (segs - 1) +
> + cake_calc_overhead(q, last_len, off));
> +}
> +
^ permalink raw reply
* [PATCH v2 net] netfilter: provide correct argument to nla_strlcpy()
From: Eric Dumazet @ 2018-05-21 23:35 UTC (permalink / raw)
To: David S . Miller
Cc: netdev, Eric Dumazet, Eric Dumazet, Florian Westphal,
Pablo Neira Ayuso
Recent patch forgot to remove nla_data(), upsetting syzkaller a bit.
BUG: KASAN: slab-out-of-bounds in nla_strlcpy+0x13d/0x150 lib/nlattr.c:314
Read of size 1 at addr ffff8801ad1f4fdd by task syz-executor189/4509
CPU: 1 PID: 4509 Comm: syz-executor189 Not tainted 4.17.0-rc6+ #62
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
__dump_stack lib/dump_stack.c:77 [inline]
dump_stack+0x1b9/0x294 lib/dump_stack.c:113
print_address_description+0x6c/0x20b mm/kasan/report.c:256
kasan_report_error mm/kasan/report.c:354 [inline]
kasan_report.cold.7+0x242/0x2fe mm/kasan/report.c:412
__asan_report_load1_noabort+0x14/0x20 mm/kasan/report.c:430
nla_strlcpy+0x13d/0x150 lib/nlattr.c:314
nfnl_acct_new+0x574/0xc50 net/netfilter/nfnetlink_acct.c:118
nfnetlink_rcv_msg+0xdb5/0xff0 net/netfilter/nfnetlink.c:212
netlink_rcv_skb+0x172/0x440 net/netlink/af_netlink.c:2448
nfnetlink_rcv+0x1fe/0x1ba0 net/netfilter/nfnetlink.c:513
netlink_unicast_kernel net/netlink/af_netlink.c:1310 [inline]
netlink_unicast+0x58b/0x740 net/netlink/af_netlink.c:1336
netlink_sendmsg+0x9f0/0xfa0 net/netlink/af_netlink.c:1901
sock_sendmsg_nosec net/socket.c:629 [inline]
sock_sendmsg+0xd5/0x120 net/socket.c:639
sock_write_iter+0x35a/0x5a0 net/socket.c:908
call_write_iter include/linux/fs.h:1784 [inline]
new_sync_write fs/read_write.c:474 [inline]
__vfs_write+0x64d/0x960 fs/read_write.c:487
vfs_write+0x1f8/0x560 fs/read_write.c:549
ksys_write+0xf9/0x250 fs/read_write.c:598
__do_sys_write fs/read_write.c:610 [inline]
__se_sys_write fs/read_write.c:607 [inline]
__x64_sys_write+0x73/0xb0 fs/read_write.c:607
Fixes: 4e09fc873d92 ("netfilter: prefer nla_strlcpy for dealing with NLA_STRING attributes")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Florian Westphal <fw@strlen.de>
Cc: Pablo Neira Ayuso <pablo@netfilter.org>
Reported-by: syzbot <syzkaller@googlegroups.com>
---
net/netfilter/nfnetlink_acct.c | 2 +-
net/netfilter/nfnetlink_cthelper.c | 4 ++--
2 files changed, 3 insertions(+), 3 deletions(-)
diff --git a/net/netfilter/nfnetlink_acct.c b/net/netfilter/nfnetlink_acct.c
index 6ddf89183e7b47e6c029b28cf5b524c73a790498..a0e5adf0b3b6ddbcd01f956d25dda01c611a7663 100644
--- a/net/netfilter/nfnetlink_acct.c
+++ b/net/netfilter/nfnetlink_acct.c
@@ -115,7 +115,7 @@ static int nfnl_acct_new(struct net *net, struct sock *nfnl,
nfacct->flags = flags;
}
- nla_strlcpy(nfacct->name, nla_data(tb[NFACCT_NAME]), NFACCT_NAME_MAX);
+ nla_strlcpy(nfacct->name, tb[NFACCT_NAME], NFACCT_NAME_MAX);
if (tb[NFACCT_BYTES]) {
atomic64_set(&nfacct->bytes,
diff --git a/net/netfilter/nfnetlink_cthelper.c b/net/netfilter/nfnetlink_cthelper.c
index fa026b269b3691d5186e28020eb2b08e93dc3679..cb5b5f2077774c29fb987b7f1eb2d75e9efc2fe1 100644
--- a/net/netfilter/nfnetlink_cthelper.c
+++ b/net/netfilter/nfnetlink_cthelper.c
@@ -150,7 +150,7 @@ nfnl_cthelper_expect_policy(struct nf_conntrack_expect_policy *expect_policy,
return -EINVAL;
nla_strlcpy(expect_policy->name,
- nla_data(tb[NFCTH_POLICY_NAME]), NF_CT_HELPER_NAME_LEN);
+ tb[NFCTH_POLICY_NAME], NF_CT_HELPER_NAME_LEN);
expect_policy->max_expected =
ntohl(nla_get_be32(tb[NFCTH_POLICY_EXPECT_MAX]));
if (expect_policy->max_expected > NF_CT_EXPECT_MAX_CNT)
@@ -235,7 +235,7 @@ nfnl_cthelper_create(const struct nlattr * const tb[],
goto err1;
nla_strlcpy(helper->name,
- nla_data(tb[NFCTH_NAME]), NF_CT_HELPER_NAME_LEN);
+ tb[NFCTH_NAME], NF_CT_HELPER_NAME_LEN);
size = ntohl(nla_get_be32(tb[NFCTH_PRIV_DATA_LEN]));
if (size > FIELD_SIZEOF(struct nf_conn_help, data)) {
ret = -ENOMEM;
--
2.17.0.441.gb46fe60e1d-goog
^ permalink raw reply related
* Re: [PATCH net] netfilter: provide correct argument to nla_strlcpy()
From: Eric Dumazet @ 2018-05-21 23:34 UTC (permalink / raw)
To: David Miller; +Cc: netdev, Eric Dumazet, Florian Westphal, Pablo Neira Ayuso
In-Reply-To: <20180521233320.66082-1-edumazet@google.com>
On Mon, May 21, 2018 at 4:33 PM Eric Dumazet <edumazet@google.com> wrote:
> Recent patch forgot to remove nla_data(), upsetting syzkaller a bit.
Humpff.
I forgot to add one file in the change.
Will send V2.
^ permalink raw reply
* Re: [Cake] [PATCH net-next v14 4/7] sch_cake: Add NAT awareness to packet classifier
From: Marcelo Ricardo Leitner @ 2018-05-21 23:34 UTC (permalink / raw)
To: Toke Høiland-Jørgensen; +Cc: netdev, cake, netfilter-devel
In-Reply-To: <152693495866.32668.5164616056948127124.stgit@alrua-kau>
[Cc'ing netfilter-devel@ for awareness]
On Mon, May 21, 2018 at 10:35:58PM +0200, Toke Høiland-Jørgensen wrote:
> When CAKE is deployed on a gateway that also performs NAT (which is a
> common deployment mode), the host fairness mechanism cannot distinguish
> internal hosts from each other, and so fails to work correctly.
>
> To fix this, we add an optional NAT awareness mode, which will query the
> kernel conntrack mechanism to obtain the pre-NAT addresses for each packet
> and use that in the flow and host hashing.
>
> When the shaper is enabled and the host is already performing NAT, the cost
> of this lookup is negligible. However, in unlimited mode with no NAT being
> performed, there is a significant CPU cost at higher bandwidths. For this
> reason, the feature is turned off by default.
>
> Signed-off-by: Toke Høiland-Jørgensen <toke@toke.dk>
> ---
> net/sched/sch_cake.c | 79 ++++++++++++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 79 insertions(+)
>
> diff --git a/net/sched/sch_cake.c b/net/sched/sch_cake.c
> index 92623160d43e..04364993ce19 100644
> --- a/net/sched/sch_cake.c
> +++ b/net/sched/sch_cake.c
> @@ -71,6 +71,12 @@
> #include <net/tcp.h>
> #include <net/flow_dissector.h>
>
> +#if IS_REACHABLE(CONFIG_NF_CONNTRACK)
> +#include <net/netfilter/nf_conntrack_core.h>
> +#include <net/netfilter/nf_conntrack_zones.h>
> +#include <net/netfilter/nf_conntrack.h>
> +#endif
> +
> #define CAKE_SET_WAYS (8)
> #define CAKE_MAX_TINS (8)
> #define CAKE_QUEUES (1024)
> @@ -516,6 +522,60 @@ static bool cobalt_should_drop(struct cobalt_vars *vars,
> return drop;
> }
>
> +#if IS_REACHABLE(CONFIG_NF_CONNTRACK)
> +
> +static void cake_update_flowkeys(struct flow_keys *keys,
> + const struct sk_buff *skb)
> +{
> + const struct nf_conntrack_tuple *tuple;
> + enum ip_conntrack_info ctinfo;
> + struct nf_conn *ct;
> + bool rev = false;
> +
> + if (tc_skb_protocol(skb) != htons(ETH_P_IP))
> + return;
> +
> + ct = nf_ct_get(skb, &ctinfo);
> + if (ct) {
> + tuple = nf_ct_tuple(ct, CTINFO2DIR(ctinfo));
> + } else {
> + const struct nf_conntrack_tuple_hash *hash;
> + struct nf_conntrack_tuple srctuple;
> +
> + if (!nf_ct_get_tuplepr(skb, skb_network_offset(skb),
> + NFPROTO_IPV4, dev_net(skb->dev),
> + &srctuple))
> + return;
> +
> + hash = nf_conntrack_find_get(dev_net(skb->dev),
> + &nf_ct_zone_dflt,
> + &srctuple);
> + if (!hash)
> + return;
> +
> + rev = true;
> + ct = nf_ct_tuplehash_to_ctrack(hash);
> + tuple = nf_ct_tuple(ct, !hash->tuple.dst.dir);
> + }
> +
> + keys->addrs.v4addrs.src = rev ? tuple->dst.u3.ip : tuple->src.u3.ip;
> + keys->addrs.v4addrs.dst = rev ? tuple->src.u3.ip : tuple->dst.u3.ip;
> +
> + if (keys->ports.ports) {
> + keys->ports.src = rev ? tuple->dst.u.all : tuple->src.u.all;
> + keys->ports.dst = rev ? tuple->src.u.all : tuple->dst.u.all;
> + }
> + if (rev)
> + nf_ct_put(ct);
> +}
> +#else
> +static void cake_update_flowkeys(struct flow_keys *keys,
> + const struct sk_buff *skb)
> +{
> + /* There is nothing we can do here without CONNTRACK */
> +}
> +#endif
> +
> /* Cake has several subtle multiple bit settings. In these cases you
> * would be matching triple isolate mode as well.
> */
> @@ -543,6 +603,9 @@ static u32 cake_hash(struct cake_tin_data *q, const struct sk_buff *skb,
> skb_flow_dissect_flow_keys(skb, &keys,
> FLOW_DISSECTOR_F_STOP_AT_FLOW_LABEL);
>
> + if (flow_mode & CAKE_FLOW_NAT_FLAG)
> + cake_update_flowkeys(&keys, skb);
> +
> /* flow_hash_from_keys() sorts the addresses by value, so we have
> * to preserve their order in a separate data structure to treat
> * src and dst host addresses as independently selectable.
> @@ -1894,6 +1957,18 @@ static int cake_change(struct Qdisc *sch, struct nlattr *opt,
> if (err < 0)
> return err;
>
> + if (tb[TCA_CAKE_NAT]) {
> +#if IS_REACHABLE(CONFIG_NF_CONNTRACK)
> + q->flow_mode &= ~CAKE_FLOW_NAT_FLAG;
> + q->flow_mode |= CAKE_FLOW_NAT_FLAG *
> + !!nla_get_u32(tb[TCA_CAKE_NAT]);
> +#else
> + NL_SET_ERR_MSG_ATTR(extack, "No conntrack support in kernel",
> + tb[TCA_CAKE_NAT]);
> + return -EOPNOTSUPP;
> +#endif
> + }
> +
> if (tb[TCA_CAKE_BASE_RATE64])
> q->rate_bps = nla_get_u64(tb[TCA_CAKE_BASE_RATE64]);
>
> @@ -2066,6 +2141,10 @@ static int cake_dump(struct Qdisc *sch, struct sk_buff *skb)
> if (nla_put_u32(skb, TCA_CAKE_ACK_FILTER, q->ack_filter))
> goto nla_put_failure;
>
> + if (nla_put_u32(skb, TCA_CAKE_NAT,
> + !!(q->flow_mode & CAKE_FLOW_NAT_FLAG)))
> + goto nla_put_failure;
> +
> return nla_nest_end(skb, opts);
>
> nla_put_failure:
>
> _______________________________________________
> Cake mailing list
> Cake@lists.bufferbloat.net
> https://lists.bufferbloat.net/listinfo/cake
^ permalink raw reply
* [PATCH net] netfilter: provide correct argument to nla_strlcpy()
From: Eric Dumazet @ 2018-05-21 23:33 UTC (permalink / raw)
To: David S . Miller
Cc: netdev, Eric Dumazet, Eric Dumazet, Florian Westphal,
Pablo Neira Ayuso
Recent patch forgot to remove nla_data(), upsetting syzkaller a bit.
BUG: KASAN: slab-out-of-bounds in nla_strlcpy+0x13d/0x150 lib/nlattr.c:314
Read of size 1 at addr ffff8801ad1f4fdd by task syz-executor189/4509
CPU: 1 PID: 4509 Comm: syz-executor189 Not tainted 4.17.0-rc6+ #62
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
__dump_stack lib/dump_stack.c:77 [inline]
dump_stack+0x1b9/0x294 lib/dump_stack.c:113
print_address_description+0x6c/0x20b mm/kasan/report.c:256
kasan_report_error mm/kasan/report.c:354 [inline]
kasan_report.cold.7+0x242/0x2fe mm/kasan/report.c:412
__asan_report_load1_noabort+0x14/0x20 mm/kasan/report.c:430
nla_strlcpy+0x13d/0x150 lib/nlattr.c:314
nfnl_acct_new+0x574/0xc50 net/netfilter/nfnetlink_acct.c:118
nfnetlink_rcv_msg+0xdb5/0xff0 net/netfilter/nfnetlink.c:212
netlink_rcv_skb+0x172/0x440 net/netlink/af_netlink.c:2448
nfnetlink_rcv+0x1fe/0x1ba0 net/netfilter/nfnetlink.c:513
netlink_unicast_kernel net/netlink/af_netlink.c:1310 [inline]
netlink_unicast+0x58b/0x740 net/netlink/af_netlink.c:1336
netlink_sendmsg+0x9f0/0xfa0 net/netlink/af_netlink.c:1901
sock_sendmsg_nosec net/socket.c:629 [inline]
sock_sendmsg+0xd5/0x120 net/socket.c:639
sock_write_iter+0x35a/0x5a0 net/socket.c:908
call_write_iter include/linux/fs.h:1784 [inline]
new_sync_write fs/read_write.c:474 [inline]
__vfs_write+0x64d/0x960 fs/read_write.c:487
vfs_write+0x1f8/0x560 fs/read_write.c:549
ksys_write+0xf9/0x250 fs/read_write.c:598
__do_sys_write fs/read_write.c:610 [inline]
__se_sys_write fs/read_write.c:607 [inline]
__x64_sys_write+0x73/0xb0 fs/read_write.c:607
Fixes: 4e09fc873d92 ("netfilter: prefer nla_strlcpy for dealing with NLA_STRING attributes")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Florian Westphal <fw@strlen.de>
Cc: Pablo Neira Ayuso <pablo@netfilter.org>
Reported-by: syzbot <syzkaller@googlegroups.com>
---
net/netfilter/nfnetlink_acct.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/net/netfilter/nfnetlink_acct.c b/net/netfilter/nfnetlink_acct.c
index 6ddf89183e7b47e6c029b28cf5b524c73a790498..a0e5adf0b3b6ddbcd01f956d25dda01c611a7663 100644
--- a/net/netfilter/nfnetlink_acct.c
+++ b/net/netfilter/nfnetlink_acct.c
@@ -115,7 +115,7 @@ static int nfnl_acct_new(struct net *net, struct sock *nfnl,
nfacct->flags = flags;
}
- nla_strlcpy(nfacct->name, nla_data(tb[NFACCT_NAME]), NFACCT_NAME_MAX);
+ nla_strlcpy(nfacct->name, tb[NFACCT_NAME], NFACCT_NAME_MAX);
if (tb[NFACCT_BYTES]) {
atomic64_set(&nfacct->bytes,
--
2.17.0.441.gb46fe60e1d-goog
^ permalink raw reply related
* Re: [PATCH bpf-next 1/5] bpf: Hooks for sys_sendmsg
From: kbuild test robot @ 2018-05-21 23:33 UTC (permalink / raw)
To: Andrey Ignatov
Cc: kbuild-all, netdev, Andrey Ignatov, davem, ast, daniel,
kernel-team
In-Reply-To: <654b59700a21ed2342b12f30a583ba9329ef850b.1526694154.git.rdna@fb.com>
[-- Attachment #1: Type: text/plain, Size: 11021 bytes --]
Hi Andrey,
Thank you for the patch! Yet something to improve:
[auto build test ERROR on bpf-next/master]
url: https://github.com/0day-ci/linux/commits/Andrey-Ignatov/bpf-Hooks-for-sys_sendmsg/20180522-065614
base: https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git master
config: x86_64-randconfig-x003-201820 (attached as .config)
compiler: gcc-7 (Debian 7.3.0-16) 7.3.0
reproduce:
# save the attached .config to linux build tree
make ARCH=x86_64
All errors (new ones prefixed by >>):
net/ipv4/udp.c: In function 'udp_sendmsg':
>> net/ipv4/udp.c:1014:44: error: macro "BPF_CGROUP_RUN_PROG_UDP4_SENDMSG_LOCK" passed 3 arguments, but takes just 2
(struct sockaddr *)usin, &ipc.addr);
^
>> net/ipv4/udp.c:1013:9: error: 'BPF_CGROUP_RUN_PROG_UDP4_SENDMSG_LOCK' undeclared (first use in this function); did you mean 'BPF_CGROUP_UDP4_SENDMSG'?
err = BPF_CGROUP_RUN_PROG_UDP4_SENDMSG_LOCK(sk,
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
BPF_CGROUP_UDP4_SENDMSG
net/ipv4/udp.c:1013:9: note: each undeclared identifier is reported only once for each function it appears in
--
net/ipv6/udp.c: In function 'udpv6_sendmsg':
>> net/ipv6/udp.c:1320:44: error: macro "BPF_CGROUP_RUN_PROG_UDP6_SENDMSG_LOCK" passed 3 arguments, but takes just 2
(struct sockaddr *)sin6, &fl6.saddr);
^
>> net/ipv6/udp.c:1319:9: error: 'BPF_CGROUP_RUN_PROG_UDP6_SENDMSG_LOCK' undeclared (first use in this function); did you mean 'BPF_CGROUP_UDP6_SENDMSG'?
err = BPF_CGROUP_RUN_PROG_UDP6_SENDMSG_LOCK(sk,
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
BPF_CGROUP_UDP6_SENDMSG
net/ipv6/udp.c:1319:9: note: each undeclared identifier is reported only once for each function it appears in
vim +/BPF_CGROUP_RUN_PROG_UDP4_SENDMSG_LOCK +1014 net/ipv4/udp.c
898
899 int udp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
900 {
901 struct inet_sock *inet = inet_sk(sk);
902 struct udp_sock *up = udp_sk(sk);
903 DECLARE_SOCKADDR(struct sockaddr_in *, usin, msg->msg_name);
904 struct flowi4 fl4_stack;
905 struct flowi4 *fl4;
906 int ulen = len;
907 struct ipcm_cookie ipc;
908 struct rtable *rt = NULL;
909 int free = 0;
910 int connected = 0;
911 __be32 daddr, faddr, saddr;
912 __be16 dport;
913 u8 tos;
914 int err, is_udplite = IS_UDPLITE(sk);
915 int corkreq = up->corkflag || msg->msg_flags&MSG_MORE;
916 int (*getfrag)(void *, char *, int, int, int, struct sk_buff *);
917 struct sk_buff *skb;
918 struct ip_options_data opt_copy;
919
920 if (len > 0xFFFF)
921 return -EMSGSIZE;
922
923 /*
924 * Check the flags.
925 */
926
927 if (msg->msg_flags & MSG_OOB) /* Mirror BSD error message compatibility */
928 return -EOPNOTSUPP;
929
930 ipc.opt = NULL;
931 ipc.tx_flags = 0;
932 ipc.ttl = 0;
933 ipc.tos = -1;
934
935 getfrag = is_udplite ? udplite_getfrag : ip_generic_getfrag;
936
937 fl4 = &inet->cork.fl.u.ip4;
938 if (up->pending) {
939 /*
940 * There are pending frames.
941 * The socket lock must be held while it's corked.
942 */
943 lock_sock(sk);
944 if (likely(up->pending)) {
945 if (unlikely(up->pending != AF_INET)) {
946 release_sock(sk);
947 return -EINVAL;
948 }
949 goto do_append_data;
950 }
951 release_sock(sk);
952 }
953 ulen += sizeof(struct udphdr);
954
955 /*
956 * Get and verify the address.
957 */
958 if (usin) {
959 if (msg->msg_namelen < sizeof(*usin))
960 return -EINVAL;
961 if (usin->sin_family != AF_INET) {
962 if (usin->sin_family != AF_UNSPEC)
963 return -EAFNOSUPPORT;
964 }
965
966 daddr = usin->sin_addr.s_addr;
967 dport = usin->sin_port;
968 if (dport == 0)
969 return -EINVAL;
970 } else {
971 if (sk->sk_state != TCP_ESTABLISHED)
972 return -EDESTADDRREQ;
973 daddr = inet->inet_daddr;
974 dport = inet->inet_dport;
975 /* Open fast path for connected socket.
976 Route will not be used, if at least one option is set.
977 */
978 connected = 1;
979 }
980
981 ipc.sockc.tsflags = sk->sk_tsflags;
982 ipc.addr = inet->inet_saddr;
983 ipc.oif = sk->sk_bound_dev_if;
984 ipc.gso_size = up->gso_size;
985
986 if (msg->msg_controllen) {
987 err = udp_cmsg_send(sk, msg, &ipc.gso_size);
988 if (err > 0)
989 err = ip_cmsg_send(sk, msg, &ipc,
990 sk->sk_family == AF_INET6);
991 if (unlikely(err < 0)) {
992 kfree(ipc.opt);
993 return err;
994 }
995 if (ipc.opt)
996 free = 1;
997 connected = 0;
998 }
999 if (!ipc.opt) {
1000 struct ip_options_rcu *inet_opt;
1001
1002 rcu_read_lock();
1003 inet_opt = rcu_dereference(inet->inet_opt);
1004 if (inet_opt) {
1005 memcpy(&opt_copy, inet_opt,
1006 sizeof(*inet_opt) + inet_opt->opt.optlen);
1007 ipc.opt = &opt_copy.opt;
1008 }
1009 rcu_read_unlock();
1010 }
1011
1012 if (!connected) {
> 1013 err = BPF_CGROUP_RUN_PROG_UDP4_SENDMSG_LOCK(sk,
> 1014 (struct sockaddr *)usin, &ipc.addr);
1015 if (err)
1016 goto out_free;
1017 if (usin) {
1018 if (usin->sin_port == 0) {
1019 /* BPF program set invalid port. Reject it. */
1020 err = -EINVAL;
1021 goto out_free;
1022 }
1023 daddr = usin->sin_addr.s_addr;
1024 dport = usin->sin_port;
1025 }
1026 }
1027
1028 saddr = ipc.addr;
1029 ipc.addr = faddr = daddr;
1030
1031 sock_tx_timestamp(sk, ipc.sockc.tsflags, &ipc.tx_flags);
1032
1033 if (ipc.opt && ipc.opt->opt.srr) {
1034 if (!daddr) {
1035 err = -EINVAL;
1036 goto out_free;
1037 }
1038 faddr = ipc.opt->opt.faddr;
1039 connected = 0;
1040 }
1041 tos = get_rttos(&ipc, inet);
1042 if (sock_flag(sk, SOCK_LOCALROUTE) ||
1043 (msg->msg_flags & MSG_DONTROUTE) ||
1044 (ipc.opt && ipc.opt->opt.is_strictroute)) {
1045 tos |= RTO_ONLINK;
1046 connected = 0;
1047 }
1048
1049 if (ipv4_is_multicast(daddr)) {
1050 if (!ipc.oif)
1051 ipc.oif = inet->mc_index;
1052 if (!saddr)
1053 saddr = inet->mc_addr;
1054 connected = 0;
1055 } else if (!ipc.oif) {
1056 ipc.oif = inet->uc_index;
1057 } else if (ipv4_is_lbcast(daddr) && inet->uc_index) {
1058 /* oif is set, packet is to local broadcast and
1059 * and uc_index is set. oif is most likely set
1060 * by sk_bound_dev_if. If uc_index != oif check if the
1061 * oif is an L3 master and uc_index is an L3 slave.
1062 * If so, we want to allow the send using the uc_index.
1063 */
1064 if (ipc.oif != inet->uc_index &&
1065 ipc.oif == l3mdev_master_ifindex_by_index(sock_net(sk),
1066 inet->uc_index)) {
1067 ipc.oif = inet->uc_index;
1068 }
1069 }
1070
1071 if (connected)
1072 rt = (struct rtable *)sk_dst_check(sk, 0);
1073
1074 if (!rt) {
1075 struct net *net = sock_net(sk);
1076 __u8 flow_flags = inet_sk_flowi_flags(sk);
1077
1078 fl4 = &fl4_stack;
1079
1080 flowi4_init_output(fl4, ipc.oif, sk->sk_mark, tos,
1081 RT_SCOPE_UNIVERSE, sk->sk_protocol,
1082 flow_flags,
1083 faddr, saddr, dport, inet->inet_sport,
1084 sk->sk_uid);
1085
1086 security_sk_classify_flow(sk, flowi4_to_flowi(fl4));
1087 rt = ip_route_output_flow(net, fl4, sk);
1088 if (IS_ERR(rt)) {
1089 err = PTR_ERR(rt);
1090 rt = NULL;
1091 if (err == -ENETUNREACH)
1092 IP_INC_STATS(net, IPSTATS_MIB_OUTNOROUTES);
1093 goto out;
1094 }
1095
1096 err = -EACCES;
1097 if ((rt->rt_flags & RTCF_BROADCAST) &&
1098 !sock_flag(sk, SOCK_BROADCAST))
1099 goto out;
1100 if (connected)
1101 sk_dst_set(sk, dst_clone(&rt->dst));
1102 }
1103
1104 if (msg->msg_flags&MSG_CONFIRM)
1105 goto do_confirm;
1106 back_from_confirm:
1107
1108 saddr = fl4->saddr;
1109 if (!ipc.addr)
1110 daddr = ipc.addr = fl4->daddr;
1111
1112 /* Lockless fast path for the non-corking case. */
1113 if (!corkreq) {
1114 struct inet_cork cork;
1115
1116 skb = ip_make_skb(sk, fl4, getfrag, msg, ulen,
1117 sizeof(struct udphdr), &ipc, &rt,
1118 &cork, msg->msg_flags);
1119 err = PTR_ERR(skb);
1120 if (!IS_ERR_OR_NULL(skb))
1121 err = udp_send_skb(skb, fl4, &cork);
1122 goto out;
1123 }
1124
1125 lock_sock(sk);
1126 if (unlikely(up->pending)) {
1127 /* The socket is already corked while preparing it. */
1128 /* ... which is an evident application bug. --ANK */
1129 release_sock(sk);
1130
1131 net_dbg_ratelimited("socket already corked\n");
1132 err = -EINVAL;
1133 goto out;
1134 }
1135 /*
1136 * Now cork the socket to pend data.
1137 */
1138 fl4 = &inet->cork.fl.u.ip4;
1139 fl4->daddr = daddr;
1140 fl4->saddr = saddr;
1141 fl4->fl4_dport = dport;
1142 fl4->fl4_sport = inet->inet_sport;
1143 up->pending = AF_INET;
1144
1145 do_append_data:
1146 up->len += ulen;
1147 err = ip_append_data(sk, fl4, getfrag, msg, ulen,
1148 sizeof(struct udphdr), &ipc, &rt,
1149 corkreq ? msg->msg_flags|MSG_MORE : msg->msg_flags);
1150 if (err)
1151 udp_flush_pending_frames(sk);
1152 else if (!corkreq)
1153 err = udp_push_pending_frames(sk);
1154 else if (unlikely(skb_queue_empty(&sk->sk_write_queue)))
1155 up->pending = 0;
1156 release_sock(sk);
1157
1158 out:
1159 ip_rt_put(rt);
1160 out_free:
1161 if (free)
1162 kfree(ipc.opt);
1163 if (!err)
1164 return len;
1165 /*
1166 * ENOBUFS = no kernel mem, SOCK_NOSPACE = no sndbuf space. Reporting
1167 * ENOBUFS might not be good (it's not tunable per se), but otherwise
1168 * we don't have a good statistic (IpOutDiscards but it can be too many
1169 * things). We could add another new stat but at least for now that
1170 * seems like overkill.
1171 */
1172 if (err == -ENOBUFS || test_bit(SOCK_NOSPACE, &sk->sk_socket->flags)) {
1173 UDP_INC_STATS(sock_net(sk),
1174 UDP_MIB_SNDBUFERRORS, is_udplite);
1175 }
1176 return err;
1177
1178 do_confirm:
1179 if (msg->msg_flags & MSG_PROBE)
1180 dst_confirm_neigh(&rt->dst, &fl4->daddr);
1181 if (!(msg->msg_flags&MSG_PROBE) || len)
1182 goto back_from_confirm;
1183 err = 0;
1184 goto out;
1185 }
1186 EXPORT_SYMBOL(udp_sendmsg);
1187
---
0-DAY kernel test infrastructure Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all Intel Corporation
[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 27551 bytes --]
^ permalink raw reply
* Re: [PATCH bpf-next 1/5] bpf: Hooks for sys_sendmsg
From: Martin KaFai Lau @ 2018-05-21 23:16 UTC (permalink / raw)
To: Andrey Ignatov; +Cc: netdev, davem, ast, daniel, kernel-team
In-Reply-To: <654b59700a21ed2342b12f30a583ba9329ef850b.1526694154.git.rdna@fb.com>
On Fri, May 18, 2018 at 07:21:09PM -0700, Andrey Ignatov wrote:
> In addition to already existing BPF hooks for sys_bind and sys_connect,
> the patch provides new hooks for sys_sendmsg.
>
> It leverages existing BPF program type `BPF_PROG_TYPE_CGROUP_SOCK_ADDR`
> that provides access to socket itlself (properties like family, type,
> protocol) and user-passed `struct sockaddr *` so that BPF program can
> override destination IP and port for system calls such as sendto(2) or
> sendmsg(2) and/or assign source IP to the socket.
>
> The hooks are implemented as two new attach types:
> `BPF_CGROUP_UDP4_SENDMSG` and `BPF_CGROUP_UDP6_SENDMSG` for UDPv4 and
> UDPv6 correspondingly.
>
> UDPv4 and UDPv6 separate attach types for same reason as sys_bind and
> sys_connect hooks, i.e. to prevent reading from / writing to e.g.
> user_ip6 fields when user passes sockaddr_in since it'd be out-of-bound.
>
> The difference with already existing hooks is sys_sendmsg are
> implemented only for unconnected UDP.
>
> For TCP it doesn't make sense to change user-provided `struct sockaddr *`
> at sendto(2)/sendmsg(2) time since socket either was already connected
> and has source/destination set or wasn't connected and call to
> sendto(2)/sendmsg(2) would lead to ENOTCONN anyway.
>
> Connected UDP is already handled by sys_connect hooks that can override
> source/destination at connect time and use fast-path later, i.e. these
> hooks don't affect UDP fast-path.
>
> Rewriting source IP is implemented differently than that in sys_connect
> hooks. When sys_sendmsg is used with unconnected UDP it doesn't work to
> just bind socket to desired local IP address since source IP can be set
> on per-packet basis by using ancillary data (cmsg(3)). So no matter if
> socket is bound or not, source IP has to be rewritten on every call to
> sys_sendmsg.
>
> To do so two new fields are added to UAPI `struct bpf_sock_addr`;
> * `msg_src_ip4` to set source IPv4 for UDPv4;
> * `msg_src_ip6` to set source IPv6 for UDPv6.
>
> Signed-off-by: Andrey Ignatov <rdna@fb.com>
> Acked-by: Alexei Starovoitov <ast@kernel.org>
> ---
> include/linux/bpf-cgroup.h | 23 +++++++++++++++++------
> include/linux/filter.h | 1 +
> include/uapi/linux/bpf.h | 8 ++++++++
> kernel/bpf/cgroup.c | 11 ++++++++++-
> kernel/bpf/syscall.c | 8 ++++++++
> net/core/filter.c | 39 +++++++++++++++++++++++++++++++++++++++
> net/ipv4/udp.c | 20 ++++++++++++++++++--
> net/ipv6/udp.c | 17 +++++++++++++++++
> 8 files changed, 118 insertions(+), 9 deletions(-)
>
> diff --git a/include/linux/bpf-cgroup.h b/include/linux/bpf-cgroup.h
> index 30d15e6..46f01ba 100644
> --- a/include/linux/bpf-cgroup.h
> +++ b/include/linux/bpf-cgroup.h
> @@ -66,7 +66,8 @@ int __cgroup_bpf_run_filter_sk(struct sock *sk,
>
> int __cgroup_bpf_run_filter_sock_addr(struct sock *sk,
> struct sockaddr *uaddr,
> - enum bpf_attach_type type);
> + enum bpf_attach_type type,
> + void *t_ctx);
>
> int __cgroup_bpf_run_filter_sock_ops(struct sock *sk,
> struct bpf_sock_ops_kern *sock_ops,
> @@ -120,16 +121,18 @@ int __cgroup_bpf_check_dev_permission(short dev_type, u32 major, u32 minor,
> ({ \
> int __ret = 0; \
> if (cgroup_bpf_enabled) \
> - __ret = __cgroup_bpf_run_filter_sock_addr(sk, uaddr, type); \
> + __ret = __cgroup_bpf_run_filter_sock_addr(sk, uaddr, type, \
> + NULL); \
> __ret; \
> })
>
> -#define BPF_CGROUP_RUN_SA_PROG_LOCK(sk, uaddr, type) \
> +#define BPF_CGROUP_RUN_SA_PROG_LOCK(sk, uaddr, type, t_ctx) \
> ({ \
> int __ret = 0; \
> if (cgroup_bpf_enabled) { \
> lock_sock(sk); \
> - __ret = __cgroup_bpf_run_filter_sock_addr(sk, uaddr, type); \
> + __ret = __cgroup_bpf_run_filter_sock_addr(sk, uaddr, type, \
> + t_ctx); \
> release_sock(sk); \
> } \
> __ret; \
> @@ -151,10 +154,16 @@ int __cgroup_bpf_check_dev_permission(short dev_type, u32 major, u32 minor,
> BPF_CGROUP_RUN_SA_PROG(sk, uaddr, BPF_CGROUP_INET6_CONNECT)
>
> #define BPF_CGROUP_RUN_PROG_INET4_CONNECT_LOCK(sk, uaddr) \
> - BPF_CGROUP_RUN_SA_PROG_LOCK(sk, uaddr, BPF_CGROUP_INET4_CONNECT)
> + BPF_CGROUP_RUN_SA_PROG_LOCK(sk, uaddr, BPF_CGROUP_INET4_CONNECT, NULL)
>
> #define BPF_CGROUP_RUN_PROG_INET6_CONNECT_LOCK(sk, uaddr) \
> - BPF_CGROUP_RUN_SA_PROG_LOCK(sk, uaddr, BPF_CGROUP_INET6_CONNECT)
> + BPF_CGROUP_RUN_SA_PROG_LOCK(sk, uaddr, BPF_CGROUP_INET6_CONNECT, NULL)
> +
> +#define BPF_CGROUP_RUN_PROG_UDP4_SENDMSG_LOCK(sk, uaddr, t_ctx) \
> + BPF_CGROUP_RUN_SA_PROG_LOCK(sk, uaddr, BPF_CGROUP_UDP4_SENDMSG, t_ctx)
> +
> +#define BPF_CGROUP_RUN_PROG_UDP6_SENDMSG_LOCK(sk, uaddr, t_ctx) \
> + BPF_CGROUP_RUN_SA_PROG_LOCK(sk, uaddr, BPF_CGROUP_UDP6_SENDMSG, t_ctx)
>
> #define BPF_CGROUP_RUN_PROG_SOCK_OPS(sock_ops) \
> ({ \
> @@ -197,6 +206,8 @@ static inline int cgroup_bpf_inherit(struct cgroup *cgrp) { return 0; }
> #define BPF_CGROUP_RUN_PROG_INET4_CONNECT_LOCK(sk, uaddr) ({ 0; })
> #define BPF_CGROUP_RUN_PROG_INET6_CONNECT(sk, uaddr) ({ 0; })
> #define BPF_CGROUP_RUN_PROG_INET6_CONNECT_LOCK(sk, uaddr) ({ 0; })
> +#define BPF_CGROUP_RUN_PROG_UDP4_SENDMSG_LOCK(sk, uaddr) ({ 0; })
> +#define BPF_CGROUP_RUN_PROG_UDP6_SENDMSG_LOCK(sk, uaddr) ({ 0; })
> #define BPF_CGROUP_RUN_PROG_SOCK_OPS(sock_ops) ({ 0; })
> #define BPF_CGROUP_RUN_PROG_DEVICE_CGROUP(type,major,minor,access) ({ 0; })
>
> diff --git a/include/linux/filter.h b/include/linux/filter.h
> index d358d18..d90abda 100644
> --- a/include/linux/filter.h
> +++ b/include/linux/filter.h
> @@ -1010,6 +1010,7 @@ struct bpf_sock_addr_kern {
> * only two (src and dst) are available at convert_ctx_access time
> */
> u64 tmp_reg;
> + void *t_ctx; /* Attach type specific context. */
> };
>
> struct bpf_sock_ops_kern {
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 97446bb..b70ad2c 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -158,6 +158,8 @@ enum bpf_attach_type {
> BPF_CGROUP_INET6_CONNECT,
> BPF_CGROUP_INET4_POST_BIND,
> BPF_CGROUP_INET6_POST_BIND,
> + BPF_CGROUP_UDP4_SENDMSG,
> + BPF_CGROUP_UDP6_SENDMSG,
> __MAX_BPF_ATTACH_TYPE
> };
>
> @@ -2247,6 +2249,12 @@ struct bpf_sock_addr {
> __u32 family; /* Allows 4-byte read, but no write */
> __u32 type; /* Allows 4-byte read, but no write */
> __u32 protocol; /* Allows 4-byte read, but no write */
> + __u32 msg_src_ip4; /* Allows 1,2,4-byte read an 4-byte write.
> + * Stored in network byte order.
> + */
> + __u32 msg_src_ip6[4]; /* Allows 1,2,4-byte read an 4-byte write.
> + * Stored in network byte order.
> + */
> };
>
> /* User bpf_sock_ops struct to access socket values and specify request ops
> diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c
> index 43171a0..f7c00bd 100644
> --- a/kernel/bpf/cgroup.c
> +++ b/kernel/bpf/cgroup.c
> @@ -500,6 +500,7 @@ EXPORT_SYMBOL(__cgroup_bpf_run_filter_sk);
> * @sk: sock struct that will use sockaddr
> * @uaddr: sockaddr struct provided by user
> * @type: The type of program to be exectuted
> + * @t_ctx: Pointer to attach type specific context
> *
> * socket is expected to be of type INET or INET6.
> *
> @@ -508,12 +509,15 @@ EXPORT_SYMBOL(__cgroup_bpf_run_filter_sk);
> */
> int __cgroup_bpf_run_filter_sock_addr(struct sock *sk,
> struct sockaddr *uaddr,
> - enum bpf_attach_type type)
> + enum bpf_attach_type type,
> + void *t_ctx)
> {
> struct bpf_sock_addr_kern ctx = {
> .sk = sk,
> .uaddr = uaddr,
> + .t_ctx = t_ctx,
> };
> + struct sockaddr_storage unspec;
> struct cgroup *cgrp;
> int ret;
>
> @@ -523,6 +527,11 @@ int __cgroup_bpf_run_filter_sock_addr(struct sock *sk,
> if (sk->sk_family != AF_INET && sk->sk_family != AF_INET6)
> return 0;
>
> + if (!ctx.uaddr) {
> + memset(&unspec, 0, sizeof(unspec));
> + ctx.uaddr = (struct sockaddr *)&unspec;
> + }
> +
> cgrp = sock_cgroup_ptr(&sk->sk_cgrp_data);
> ret = BPF_PROG_RUN_ARRAY(cgrp->bpf.effective[type], &ctx, BPF_PROG_RUN);
>
> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index bfcde94..11a5a95 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
> @@ -1247,6 +1247,8 @@ bpf_prog_load_check_attach_type(enum bpf_prog_type prog_type,
> case BPF_CGROUP_INET6_BIND:
> case BPF_CGROUP_INET4_CONNECT:
> case BPF_CGROUP_INET6_CONNECT:
> + case BPF_CGROUP_UDP4_SENDMSG:
> + case BPF_CGROUP_UDP6_SENDMSG:
> return 0;
> default:
> return -EINVAL;
> @@ -1563,6 +1565,8 @@ static int bpf_prog_attach(const union bpf_attr *attr)
> case BPF_CGROUP_INET6_BIND:
> case BPF_CGROUP_INET4_CONNECT:
> case BPF_CGROUP_INET6_CONNECT:
> + case BPF_CGROUP_UDP4_SENDMSG:
> + case BPF_CGROUP_UDP6_SENDMSG:
> ptype = BPF_PROG_TYPE_CGROUP_SOCK_ADDR;
> break;
> case BPF_CGROUP_SOCK_OPS:
> @@ -1633,6 +1637,8 @@ static int bpf_prog_detach(const union bpf_attr *attr)
> case BPF_CGROUP_INET6_BIND:
> case BPF_CGROUP_INET4_CONNECT:
> case BPF_CGROUP_INET6_CONNECT:
> + case BPF_CGROUP_UDP4_SENDMSG:
> + case BPF_CGROUP_UDP6_SENDMSG:
> ptype = BPF_PROG_TYPE_CGROUP_SOCK_ADDR;
> break;
> case BPF_CGROUP_SOCK_OPS:
> @@ -1690,6 +1696,8 @@ static int bpf_prog_query(const union bpf_attr *attr,
> case BPF_CGROUP_INET6_POST_BIND:
> case BPF_CGROUP_INET4_CONNECT:
> case BPF_CGROUP_INET6_CONNECT:
> + case BPF_CGROUP_UDP4_SENDMSG:
> + case BPF_CGROUP_UDP6_SENDMSG:
> case BPF_CGROUP_SOCK_OPS:
> case BPF_CGROUP_DEVICE:
> break;
> diff --git a/net/core/filter.c b/net/core/filter.c
> index aec5eba..f696dc9 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -5010,6 +5010,7 @@ static bool sock_addr_is_valid_access(int off, int size,
> switch (prog->expected_attach_type) {
> case BPF_CGROUP_INET4_BIND:
> case BPF_CGROUP_INET4_CONNECT:
> + case BPF_CGROUP_UDP4_SENDMSG:
> break;
> default:
> return false;
> @@ -5019,6 +5020,24 @@ static bool sock_addr_is_valid_access(int off, int size,
> switch (prog->expected_attach_type) {
> case BPF_CGROUP_INET6_BIND:
> case BPF_CGROUP_INET6_CONNECT:
> + case BPF_CGROUP_UDP6_SENDMSG:
> + break;
> + default:
> + return false;
> + }
> + break;
> + case bpf_ctx_range(struct bpf_sock_addr, msg_src_ip4):
> + switch (prog->expected_attach_type) {
> + case BPF_CGROUP_UDP4_SENDMSG:
> + break;
> + default:
> + return false;
> + }
> + break;
> + case bpf_ctx_range_till(struct bpf_sock_addr, msg_src_ip6[0],
> + msg_src_ip6[3]):
> + switch (prog->expected_attach_type) {
> + case BPF_CGROUP_UDP6_SENDMSG:
> break;
> default:
> return false;
> @@ -5029,6 +5048,9 @@ static bool sock_addr_is_valid_access(int off, int size,
> switch (off) {
> case bpf_ctx_range(struct bpf_sock_addr, user_ip4):
> case bpf_ctx_range_till(struct bpf_sock_addr, user_ip6[0], user_ip6[3]):
> + case bpf_ctx_range(struct bpf_sock_addr, msg_src_ip4):
> + case bpf_ctx_range_till(struct bpf_sock_addr, msg_src_ip6[0],
> + msg_src_ip6[3]):
> /* Only narrow read access allowed for now. */
> if (type == BPF_READ) {
> bpf_ctx_record_field_size(info, size_default);
> @@ -5783,6 +5805,23 @@ static u32 sock_addr_convert_ctx_access(enum bpf_access_type type,
> *insn++ = BPF_ALU32_IMM(BPF_RSH, si->dst_reg,
> SK_FL_PROTO_SHIFT);
> break;
> +
> + case offsetof(struct bpf_sock_addr, msg_src_ip4):
> + /* Treat t_ctx as struct in_addr for msg_src_ip4. */
> + SOCK_ADDR_LOAD_OR_STORE_NESTED_FIELD_SIZE_OFF(
> + struct bpf_sock_addr_kern, struct in_addr, t_ctx,
> + s_addr, BPF_SIZE(si->code), 0, tmp_reg);
> + break;
> +
> + case bpf_ctx_range_till(struct bpf_sock_addr, msg_src_ip6[0],
> + msg_src_ip6[3]):
> + off = si->off;
> + off -= offsetof(struct bpf_sock_addr, msg_src_ip6[0]);
> + /* Treat t_ctx as struct in6_addr for msg_src_ip6. */
> + SOCK_ADDR_LOAD_OR_STORE_NESTED_FIELD_SIZE_OFF(
> + struct bpf_sock_addr_kern, struct in6_addr, t_ctx,
> + s6_addr32[0], BPF_SIZE(si->code), off, tmp_reg);
> + break;
> }
>
> return insn - insn_buf;
> diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
> index ff4d4ba..a1f9ba2 100644
> --- a/net/ipv4/udp.c
> +++ b/net/ipv4/udp.c
> @@ -900,6 +900,7 @@ int udp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
> {
> struct inet_sock *inet = inet_sk(sk);
> struct udp_sock *up = udp_sk(sk);
> + DECLARE_SOCKADDR(struct sockaddr_in *, usin, msg->msg_name);
> struct flowi4 fl4_stack;
> struct flowi4 *fl4;
> int ulen = len;
> @@ -954,8 +955,7 @@ int udp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
> /*
> * Get and verify the address.
> */
> - if (msg->msg_name) {
> - DECLARE_SOCKADDR(struct sockaddr_in *, usin, msg->msg_name);
> + if (usin) {
> if (msg->msg_namelen < sizeof(*usin))
> return -EINVAL;
> if (usin->sin_family != AF_INET) {
> @@ -1009,6 +1009,22 @@ int udp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
> rcu_read_unlock();
> }
>
> + if (!connected) {
> + err = BPF_CGROUP_RUN_PROG_UDP4_SENDMSG_LOCK(sk,
> + (struct sockaddr *)usin, &ipc.addr);
> + if (err)
> + goto out_free;
> + if (usin) {
> + if (usin->sin_port == 0) {
> + /* BPF program set invalid port. Reject it. */
> + err = -EINVAL;
> + goto out_free;
> + }
> + daddr = usin->sin_addr.s_addr;
> + dport = usin->sin_port;
> + }
> + }
> +
> saddr = ipc.addr;
> ipc.addr = faddr = daddr;
>
> diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
> index 2839c1b..6f580ea 100644
> --- a/net/ipv6/udp.c
> +++ b/net/ipv6/udp.c
> @@ -1315,6 +1315,22 @@ int udpv6_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
> fl6.saddr = np->saddr;
> fl6.fl6_sport = inet->inet_sport;
>
> + if (!connected) {
> + err = BPF_CGROUP_RUN_PROG_UDP6_SENDMSG_LOCK(sk,
> + (struct sockaddr *)sin6, &fl6.saddr);
> + if (err)
> + goto out_no_dst;
> + if (sin6) {
> + if (sin6->sin6_port == 0) {
> + /* BPF program set invalid port. Reject it. */
> + err = -EINVAL;
> + goto out_no_dst;
> + }
> + fl6.fl6_dport = sin6->sin6_port;
> + fl6.daddr = sin6->sin6_addr;
Could the bpf_prog change sin6 to a v4mapped address?
> + }
> + }
> +
> final_p = fl6_update_dst(&fl6, opt, &final);
It seems fl6_update_dst() may update fl6->daddr.
Is it fine (or inline with the expectation after running
a bpf_prog that has changed user_ip6)?
> if (final_p)
> connected = false;
> @@ -1394,6 +1410,7 @@ int udpv6_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
>
> out:
> dst_release(dst);
> +out_no_dst:
A nit. If dst is init to NULL, can a new exit label
be avoided?
> fl6_sock_release(flowlabel);
> txopt_put(opt_to_free);
> if (!err)
> --
> 2.9.5
>
^ permalink raw reply
* Re: [PATCH v2] ath10k: transmit queued frames after waking queues
From: Rajkumar Manoharan @ 2018-05-21 23:11 UTC (permalink / raw)
To: Niklas Cassel
Cc: Kalle Valo, David S. Miller, ath10k, linux-wireless, netdev,
linux-kernel, linux-wireless-owner
In-Reply-To: <20180521204359.23884-1-niklas.cassel@linaro.org>
On 2018-05-21 13:43, Niklas Cassel wrote:
> The following problem was observed when running iperf:
[...]
>
> In order to avoid trying to flush the queue every time we free a frame,
> only do this when there are 3 or less frames pending, and while we
> actually have frames in the queue. This logic was copied from
> mt76_txq_schedule (mt76), one of few other drivers that are actually
> using wake_tx_queue.
>
> Suggested-by: Toke Høiland-Jørgensen <toke@toke.dk>
> Signed-off-by: Niklas Cassel <niklas.cassel@linaro.org>
> ---
> Changes since V1: use READ_ONCE() to disallow the compiler
> optimizing things in undesirable ways.
>
> drivers/net/wireless/ath/ath10k/txrx.c | 3 +++
> 1 file changed, 3 insertions(+)
>
> diff --git a/drivers/net/wireless/ath/ath10k/txrx.c
> b/drivers/net/wireless/ath/ath10k/txrx.c
> index cda164f6e9f6..264cf0bd5c00 100644
> --- a/drivers/net/wireless/ath/ath10k/txrx.c
> +++ b/drivers/net/wireless/ath/ath10k/txrx.c
> @@ -95,6 +95,9 @@ int ath10k_txrx_tx_unref(struct ath10k_htt *htt,
> wake_up(&htt->empty_tx_wq);
> spin_unlock_bh(&htt->tx_lock);
>
> + if (READ_ONCE(htt->num_pending_tx) <= 3 && !list_empty(&ar->txqs))
> + ath10k_mac_tx_push_pending(ar);
> +
Niklas,
Sorry for the late response. ath10k_mac_tx_push_pending is already
called
at the end of NAPI handler. Isn't that enough to process pending frames?
Earlier we observed performance issues in calling push_pending from each
tx completion. IMHO this change may introduce the same problem again.
-Rajkumar
^ permalink raw reply
* Re: [PATCH net-next] r8169: perform reset synchronously in __rtl8169_resume
From: Heiner Kallweit @ 2018-05-21 22:38 UTC (permalink / raw)
To: Francois Romieu
Cc: David Miller, Realtek linux nic maintainers,
netdev@vger.kernel.org
In-Reply-To: <20180521212422.GA25067@electric-eye.fr.zoreil.com>
Am 21.05.2018 um 23:24 schrieb Francois Romieu:
> Heiner Kallweit <hkallweit1@gmail.com> :
> [...]
>> diff --git a/drivers/net/ethernet/realtek/r8169.c b/drivers/net/ethernet/realtek/r8169.c
>> index 75dfac024..1eb4f625a 100644
>> --- a/drivers/net/ethernet/realtek/r8169.c
>> +++ b/drivers/net/ethernet/realtek/r8169.c
>> @@ -7327,9 +7327,9 @@ static void __rtl8169_resume(struct net_device *dev)
>> rtl_lock_work(tp);
>> napi_enable(&tp->napi);
>> set_bit(RTL_FLAG_TASK_ENABLED, tp->wk.flags);
>> + if (netif_running(dev))
>> + rtl_reset_work(tp);
>> rtl_unlock_work(tp);
>> -
>> - rtl_schedule_task(tp, RTL_FLAG_TASK_RESET_PENDING);
>> }
>>
>> static int rtl8169_resume(struct device *device)
>
> netif_running() returns true before rtl_open(): issuing rtl_reset_work()
> synchronously against uninitialized rings right at the start of rtl_open()
> won't work as expected.
>
rtl8169_runtime_resume() has a check for tp->TxDescArray at the beginning.
Therefore it will return immediately before even calling
__rtl8169_resume(), because tp->TxDescArray is set in rtl_open() only.
So I don't think the described issue exists.
However I have to admit that I wasn't aware that netif_running() is true
already when entering rtl_open(), so thanks for the hint.
> pm_runtime_get_sync() could be issued later in rtl_open() (but I would
> not mind something different with runtime_{resume/suspend} in open/close).
>
^ permalink raw reply
* Re: [RFC PATCH net-next 10/12] vhost_net: build xdp buff
From: Michael S. Tsirkin @ 2018-05-21 22:21 UTC (permalink / raw)
To: Jesse Brandeburg; +Cc: Jason Wang, kvm, virtualization, netdev, linux-kernel
In-Reply-To: <20180521095611.00005caa@intel.com>
On Mon, May 21, 2018 at 09:56:11AM -0700, Jesse Brandeburg wrote:
> On Mon, 21 May 2018 17:04:31 +0800 Jason wrote:
> > This patch implement build XDP buffers in vhost_net. The idea is do
> > userspace copy in vhost_net and build XDP buff based on the
> > page. Vhost_net can then submit one or an array of XDP buffs to
> > underlayer socket (e.g TUN). TUN can choose to do XDP or call
> > build_skb() to build skb. To support build skb, vnet header were also
> > stored into the header of the XDP buff.
> >
> > This userspace copy and XDP buffs building is key to achieve XDP
> > batching in TUN, since TUN does not need to care about userspace copy
> > and then can disable premmption for several XDP buffs to achieve
> > batching from XDP.
> >
> > TODO: reserve headroom based on the TUN XDP.
> >
> > Signed-off-by: Jason Wang <jasowang@redhat.com>
> > ---
> > drivers/vhost/net.c | 74 +++++++++++++++++++++++++++++++++++++++++++++++++++++
> > 1 file changed, 74 insertions(+)
> >
> > diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> > index f0639d7..1209e84 100644
> > --- a/drivers/vhost/net.c
> > +++ b/drivers/vhost/net.c
> > @@ -492,6 +492,80 @@ static bool vhost_has_more_pkts(struct vhost_net *net,
> > likely(!vhost_exceeds_maxpend(net));
> > }
> >
> > +#define VHOST_NET_HEADROOM 256
> > +#define VHOST_NET_RX_PAD (NET_IP_ALIGN + NET_SKB_PAD)
> > +
> > +static int vhost_net_build_xdp(struct vhost_net_virtqueue *nvq,
> > + struct iov_iter *from,
> > + struct xdp_buff *xdp)
> > +{
> > + struct vhost_virtqueue *vq = &nvq->vq;
> > + struct page_frag *alloc_frag = ¤t->task_frag;
> > + struct virtio_net_hdr *gso;
> > + size_t len = iov_iter_count(from);
> > + int buflen = SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
> > + int pad = SKB_DATA_ALIGN(VHOST_NET_RX_PAD + VHOST_NET_HEADROOM
> > + + nvq->sock_hlen);
> > + int sock_hlen = nvq->sock_hlen;
> > + void *buf;
> > + int copied;
> > +
> > + if (len < nvq->sock_hlen)
> > + return -EFAULT;
> > +
> > + if (SKB_DATA_ALIGN(len + pad) +
> > + SKB_DATA_ALIGN(sizeof(struct skb_shared_info)) > PAGE_SIZE)
> > + return -ENOSPC;
> > +
> > + buflen += SKB_DATA_ALIGN(len + pad);
>
> maybe store the result of SKB_DATA_ALIGN in a local instead of doing
> the work twice?
I don't mind, but I guess gcc can always do it itself?
> > + alloc_frag->offset = ALIGN((u64)alloc_frag->offset, SMP_CACHE_BYTES);
> > + if (unlikely(!skb_page_frag_refill(buflen, alloc_frag, GFP_KERNEL)))
> > + return -ENOMEM;
> > +
> > + buf = (char *)page_address(alloc_frag->page) + alloc_frag->offset;
> > +
> > + /* We store two kinds of metadata in the header which will be
> > + * used for XDP_PASS to do build_skb():
> > + * offset 0: buflen
> > + * offset sizeof(int): vnet header
> > + */
> > + copied = copy_page_from_iter(alloc_frag->page,
> > + alloc_frag->offset + sizeof(int), sock_hlen, from);
> > + if (copied != sock_hlen)
> > + return -EFAULT;
> > +
> > + gso = (struct virtio_net_hdr *)(buf + sizeof(int));
> > +
> > + if ((gso->flags & VIRTIO_NET_HDR_F_NEEDS_CSUM) &&
> > + vhost16_to_cpu(vq, gso->csum_start) +
> > + vhost16_to_cpu(vq, gso->csum_offset) + 2 >
> > + vhost16_to_cpu(vq, gso->hdr_len)) {
> > + gso->hdr_len = cpu_to_vhost16(vq,
> > + vhost16_to_cpu(vq, gso->csum_start) +
> > + vhost16_to_cpu(vq, gso->csum_offset) + 2);
> > +
> > + if (vhost16_to_cpu(vq, gso->hdr_len) > len)
> > + return -EINVAL;
> > + }
> > +
> > + len -= sock_hlen;
> > + copied = copy_page_from_iter(alloc_frag->page,
> > + alloc_frag->offset + pad,
> > + len, from);
> > + if (copied != len)
> > + return -EFAULT;
> > +
> > + xdp->data_hard_start = buf;
> > + xdp->data = buf + pad;
> > + xdp->data_end = xdp->data + len;
> > + *(int *)(xdp->data_hard_start)= buflen;
>
> space before =
>
> > +
> > + get_page(alloc_frag->page);
> > + alloc_frag->offset += buflen;
> > +
> > + return 0;
> > +}
> > +
> > static void handle_tx_copy(struct vhost_net *net)
> > {
> > struct vhost_net_virtqueue *nvq = &net->vqs[VHOST_NET_VQ_TX];
^ permalink raw reply
* Reply
From: conapreb @ 2018-05-21 15:29 UTC (permalink / raw)
To: Recipients
Are you free for discussion?
^ permalink raw reply
* [PATCH net-next 2/2] tcp: do not aggressively quick ack after ECN events
From: Eric Dumazet @ 2018-05-21 22:08 UTC (permalink / raw)
To: David S . Miller
Cc: netdev, Van Jacobson, Neal Cardwell, Yuchung Cheng,
Soheil Hassas Yeganeh, Eric Dumazet, Eric Dumazet
In-Reply-To: <20180521220857.229273-1-edumazet@google.com>
ECN signals currently forces TCP to enter quickack mode for
up to 16 (TCP_MAX_QUICKACKS) following incoming packets.
We believe this is not needed, and only sending one immediate ack
for the current packet should be enough.
This should reduce the extra load noticed in DCTCP environments,
after congestion events.
This is part 2 of our effort to reduce pure ACK packets.
Signed-off-by: Eric Dumazet <edumazet@google.com>
---
net/ipv4/tcp_input.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 2e970e9f4e09d966b703af2d14d521a4328eba7e..1191cac72109f2f7e2b688ddbc1d404151d274d6 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -263,7 +263,7 @@ static void __tcp_ecn_check_ce(struct tcp_sock *tp, const struct sk_buff *skb)
* it is probably a retransmit.
*/
if (tp->ecn_flags & TCP_ECN_SEEN)
- tcp_enter_quickack_mode((struct sock *)tp, TCP_MAX_QUICKACKS);
+ tcp_enter_quickack_mode((struct sock *)tp, 1);
break;
case INET_ECN_CE:
if (tcp_ca_needs_ecn((struct sock *)tp))
@@ -271,7 +271,7 @@ static void __tcp_ecn_check_ce(struct tcp_sock *tp, const struct sk_buff *skb)
if (!(tp->ecn_flags & TCP_ECN_DEMAND_CWR)) {
/* Better not delay acks, sender can have a very low cwnd */
- tcp_enter_quickack_mode((struct sock *)tp, TCP_MAX_QUICKACKS);
+ tcp_enter_quickack_mode((struct sock *)tp, 1);
tp->ecn_flags |= TCP_ECN_DEMAND_CWR;
}
tp->ecn_flags |= TCP_ECN_SEEN;
--
2.17.0.441.gb46fe60e1d-goog
^ permalink raw reply related
* [PATCH net-next 1/2] tcp: add max_quickacks param to tcp_incr_quickack and tcp_enter_quickack_mode
From: Eric Dumazet @ 2018-05-21 22:08 UTC (permalink / raw)
To: David S . Miller
Cc: netdev, Van Jacobson, Neal Cardwell, Yuchung Cheng,
Soheil Hassas Yeganeh, Eric Dumazet, Eric Dumazet
In-Reply-To: <20180521220857.229273-1-edumazet@google.com>
We want to add finer control of the number of ACK packets sent after
ECN events.
This patch is not changing current behavior, it only enables following
change.
Signed-off-by: Eric Dumazet <edumazet@google.com>
---
net/ipv4/tcp_input.c | 24 +++++++++++++-----------
1 file changed, 13 insertions(+), 11 deletions(-)
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index aebb29ab2fdf2ceaa182cd11928f145a886149ff..2e970e9f4e09d966b703af2d14d521a4328eba7e 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -203,21 +203,23 @@ static void tcp_measure_rcv_mss(struct sock *sk, const struct sk_buff *skb)
}
}
-static void tcp_incr_quickack(struct sock *sk)
+static void tcp_incr_quickack(struct sock *sk, unsigned int max_quickacks)
{
struct inet_connection_sock *icsk = inet_csk(sk);
unsigned int quickacks = tcp_sk(sk)->rcv_wnd / (2 * icsk->icsk_ack.rcv_mss);
if (quickacks == 0)
quickacks = 2;
+ quickacks = min(quickacks, max_quickacks);
if (quickacks > icsk->icsk_ack.quick)
- icsk->icsk_ack.quick = min(quickacks, TCP_MAX_QUICKACKS);
+ icsk->icsk_ack.quick = quickacks;
}
-static void tcp_enter_quickack_mode(struct sock *sk)
+static void tcp_enter_quickack_mode(struct sock *sk, unsigned int max_quickacks)
{
struct inet_connection_sock *icsk = inet_csk(sk);
- tcp_incr_quickack(sk);
+
+ tcp_incr_quickack(sk, max_quickacks);
icsk->icsk_ack.pingpong = 0;
icsk->icsk_ack.ato = TCP_ATO_MIN;
}
@@ -261,7 +263,7 @@ static void __tcp_ecn_check_ce(struct tcp_sock *tp, const struct sk_buff *skb)
* it is probably a retransmit.
*/
if (tp->ecn_flags & TCP_ECN_SEEN)
- tcp_enter_quickack_mode((struct sock *)tp);
+ tcp_enter_quickack_mode((struct sock *)tp, TCP_MAX_QUICKACKS);
break;
case INET_ECN_CE:
if (tcp_ca_needs_ecn((struct sock *)tp))
@@ -269,7 +271,7 @@ static void __tcp_ecn_check_ce(struct tcp_sock *tp, const struct sk_buff *skb)
if (!(tp->ecn_flags & TCP_ECN_DEMAND_CWR)) {
/* Better not delay acks, sender can have a very low cwnd */
- tcp_enter_quickack_mode((struct sock *)tp);
+ tcp_enter_quickack_mode((struct sock *)tp, TCP_MAX_QUICKACKS);
tp->ecn_flags |= TCP_ECN_DEMAND_CWR;
}
tp->ecn_flags |= TCP_ECN_SEEN;
@@ -686,7 +688,7 @@ static void tcp_event_data_recv(struct sock *sk, struct sk_buff *skb)
/* The _first_ data packet received, initialize
* delayed ACK engine.
*/
- tcp_incr_quickack(sk);
+ tcp_incr_quickack(sk, TCP_MAX_QUICKACKS);
icsk->icsk_ack.ato = TCP_ATO_MIN;
} else {
int m = now - icsk->icsk_ack.lrcvtime;
@@ -702,7 +704,7 @@ static void tcp_event_data_recv(struct sock *sk, struct sk_buff *skb)
/* Too long gap. Apparently sender failed to
* restart window, so that we send ACKs quickly.
*/
- tcp_incr_quickack(sk);
+ tcp_incr_quickack(sk, TCP_MAX_QUICKACKS);
sk_mem_reclaim(sk);
}
}
@@ -4179,7 +4181,7 @@ static void tcp_send_dupack(struct sock *sk, const struct sk_buff *skb)
if (TCP_SKB_CB(skb)->end_seq != TCP_SKB_CB(skb)->seq &&
before(TCP_SKB_CB(skb)->seq, tp->rcv_nxt)) {
NET_INC_STATS(sock_net(sk), LINUX_MIB_DELAYEDACKLOST);
- tcp_enter_quickack_mode(sk);
+ tcp_enter_quickack_mode(sk, TCP_MAX_QUICKACKS);
if (tcp_is_sack(tp) && sock_net(sk)->ipv4.sysctl_tcp_dsack) {
u32 end_seq = TCP_SKB_CB(skb)->end_seq;
@@ -4706,7 +4708,7 @@ static void tcp_data_queue(struct sock *sk, struct sk_buff *skb)
tcp_dsack_set(sk, TCP_SKB_CB(skb)->seq, TCP_SKB_CB(skb)->end_seq);
out_of_window:
- tcp_enter_quickack_mode(sk);
+ tcp_enter_quickack_mode(sk, TCP_MAX_QUICKACKS);
inet_csk_schedule_ack(sk);
drop:
tcp_drop(sk, skb);
@@ -5790,7 +5792,7 @@ static int tcp_rcv_synsent_state_process(struct sock *sk, struct sk_buff *skb,
* to stand against the temptation 8) --ANK
*/
inet_csk_schedule_ack(sk);
- tcp_enter_quickack_mode(sk);
+ tcp_enter_quickack_mode(sk, TCP_MAX_QUICKACKS);
inet_csk_reset_xmit_timer(sk, ICSK_TIME_DACK,
TCP_DELACK_MAX, TCP_RTO_MAX);
--
2.17.0.441.gb46fe60e1d-goog
^ permalink raw reply related
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox