Netdev List
 help / color / mirror / Atom feed
* Re: [PATCHv4 net-next 3/6] {IPv4,xfrm} Add ESN support for AH ingress part
From: Steffen Klassert @ 2014-01-16  9:56 UTC (permalink / raw)
  To: Fan Du; +Cc: davem, netdev
In-Reply-To: <1389769995-350-4-git-send-email-fan.du@windriver.com>

On Wed, Jan 15, 2014 at 03:13:12PM +0800, Fan Du wrote:
>  
> -	ahash_request_set_crypt(req, sg, icv, skb->len);
> +	if (x->props.flags & XFRM_STATE_ESN) {
> +		/* Attach seqhi sg right after packet payload */
> +		*seqhi = htonl(XFRM_SKB_CB(skb)->seq.input.hi);

XFRM_SKB_CB(skb)->seq.input.hi is already in network byte order,
please remove the htonl(). The ipv6 patch has the same problem.

^ permalink raw reply

* [PATCH net-next v3 5/5] virtio-net: initial rx sysfs support, export mergeable rx buffer size
From: Michael Dalton @ 2014-01-16  9:38 UTC (permalink / raw)
  To: David S. Miller
  Cc: Michael Dalton, Michael S. Tsirkin, netdev, virtualization,
	Eric Dumazet
In-Reply-To: <1389865126-26225-1-git-send-email-mwdalton@google.com>

Add initial support for per-rx queue sysfs attributes to virtio-net. If
mergeable packet buffers are enabled, adds a read-only mergeable packet
buffer size sysfs attribute for each RX queue.

Signed-off-by: Michael Dalton <mwdalton@google.com>
---
 drivers/net/virtio_net.c | 66 +++++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 62 insertions(+), 4 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 3e82311..f315cbb 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -27,6 +27,7 @@
 #include <linux/slab.h>
 #include <linux/cpu.h>
 #include <linux/average.h>
+#include <linux/seqlock.h>
 
 static int napi_weight = NAPI_POLL_WEIGHT;
 module_param(napi_weight, int, 0444);
@@ -89,6 +90,12 @@ struct receive_queue {
 	/* Average packet length for mergeable receive buffers. */
 	struct ewma mrg_avg_pkt_len;
 
+	/* Sequence counter to allow sysfs readers to safely access stats.
+	 * Assumes a single virtio-net writer, which is enforced by virtio-net
+	 * and NAPI.
+	 */
+	seqcount_t sysfs_seq;
+
 	/* Page frag for packet buffer allocation. */
 	struct page_frag alloc_frag;
 
@@ -416,7 +423,9 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
 		}
 	}
 
+	write_seqcount_begin(&rq->sysfs_seq);
 	ewma_add(&rq->mrg_avg_pkt_len, head_skb->len);
+	write_seqcount_end(&rq->sysfs_seq);
 	return head_skb;
 
 err_skb:
@@ -604,18 +613,29 @@ static int add_recvbuf_big(struct receive_queue *rq, gfp_t gfp)
 	return err;
 }
 
-static int add_recvbuf_mergeable(struct receive_queue *rq, gfp_t gfp)
+static unsigned int get_mergeable_buf_len(struct ewma *avg_pkt_len)
 {
 	const size_t hdr_len = sizeof(struct virtio_net_hdr_mrg_rxbuf);
+	unsigned int len;
+
+	len = hdr_len + clamp_t(unsigned int, ewma_read(avg_pkt_len),
+			GOOD_PACKET_LEN, PAGE_SIZE - hdr_len);
+	return ALIGN(len, MERGEABLE_BUFFER_ALIGN);
+}
+
+static int add_recvbuf_mergeable(struct receive_queue *rq, gfp_t gfp)
+{
 	struct page_frag *alloc_frag = &rq->alloc_frag;
 	char *buf;
 	unsigned long ctx;
 	int err;
 	unsigned int len, hole;
 
-	len = hdr_len + clamp_t(unsigned int, ewma_read(&rq->mrg_avg_pkt_len),
-				GOOD_PACKET_LEN, PAGE_SIZE - hdr_len);
-	len = ALIGN(len, MERGEABLE_BUFFER_ALIGN);
+	/* avg_pkt_len is written only in NAPI rx softirq context. We may
+	 * read avg_pkt_len without using the sysfs_seq seqcount, as this code
+	 * is called only in NAPI rx softirq context or when NAPI is disabled.
+	 */
+	len = get_mergeable_buf_len(&rq->mrg_avg_pkt_len);
 	if (unlikely(!skb_page_frag_refill(len, alloc_frag, gfp)))
 		return -ENOMEM;
 
@@ -1557,6 +1577,7 @@ static int virtnet_alloc_queues(struct virtnet_info *vi)
 			       napi_weight);
 
 		sg_init_table(vi->rq[i].sg, ARRAY_SIZE(vi->rq[i].sg));
+		seqcount_init(&vi->rq[i].sysfs_seq);
 		ewma_init(&vi->rq[i].mrg_avg_pkt_len, 1, RECEIVE_AVG_WEIGHT);
 		sg_init_table(vi->sq[i].sg, ARRAY_SIZE(vi->sq[i].sg));
 	}
@@ -1594,6 +1615,39 @@ err:
 	return ret;
 }
 
+#ifdef CONFIG_SYSFS
+static ssize_t mergeable_rx_buffer_size_show(struct netdev_rx_queue *queue,
+		struct rx_queue_attribute *attribute, char *buf)
+{
+	struct virtnet_info *vi = netdev_priv(queue->dev);
+	unsigned int queue_index = get_netdev_rx_queue_index(queue);
+	struct receive_queue *rq;
+	struct ewma avg;
+	unsigned int start;
+
+	BUG_ON(queue_index >= vi->max_queue_pairs);
+	rq = &vi->rq[queue_index];
+	do {
+		start = read_seqcount_begin(&rq->sysfs_seq);
+		avg = rq->mrg_avg_pkt_len;
+	} while (read_seqcount_retry(&rq->sysfs_seq, start));
+	return sprintf(buf, "%u\n", get_mergeable_buf_len(&avg));
+}
+
+static struct rx_queue_attribute mergeable_rx_buffer_size_attribute =
+	__ATTR_RO(mergeable_rx_buffer_size);
+
+static struct attribute *virtio_net_mrg_rx_attrs[] = {
+	&mergeable_rx_buffer_size_attribute.attr,
+	NULL
+};
+
+static const struct attribute_group virtio_net_mrg_rx_group = {
+	.name = "virtio_net",
+	.attrs = virtio_net_mrg_rx_attrs
+};
+#endif
+
 static int virtnet_probe(struct virtio_device *vdev)
 {
 	int i, err;
@@ -1708,6 +1762,10 @@ static int virtnet_probe(struct virtio_device *vdev)
 	if (err)
 		goto free_stats;
 
+#ifdef CONFIG_SYSFS
+	if (vi->mergeable_rx_bufs)
+		dev->sysfs_rx_queue_group = &virtio_net_mrg_rx_group;
+#endif
 	netif_set_real_num_tx_queues(dev, vi->curr_queue_pairs);
 	netif_set_real_num_rx_queues(dev, vi->curr_queue_pairs);
 
-- 
1.8.5.2

^ permalink raw reply related

* [PATCH net-next v3 4/5] net-sysfs: add support for device-specific rx queue sysfs attributes
From: Michael Dalton @ 2014-01-16  9:38 UTC (permalink / raw)
  To: David S. Miller
  Cc: Michael Dalton, Michael S. Tsirkin, netdev, virtualization,
	Eric Dumazet
In-Reply-To: <1389865126-26225-1-git-send-email-mwdalton@google.com>

Extend existing support for netdevice receive queue sysfs attributes to
permit a device-specific attribute group. Initial use case for this
support will be to allow the virtio-net device to export per-receive
queue mergeable receive buffer size.

Signed-off-by: Michael Dalton <mwdalton@google.com>
---
 include/linux/netdevice.h | 40 ++++++++++++++++++++++++++++++++++++----
 net/core/dev.c            | 12 ++++++------
 net/core/net-sysfs.c      | 33 ++++++++++++++++-----------------
 3 files changed, 58 insertions(+), 27 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 5c88ab1..71b8bc4 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -668,15 +668,28 @@ extern struct rps_sock_flow_table __rcu *rps_sock_flow_table;
 bool rps_may_expire_flow(struct net_device *dev, u16 rxq_index, u32 flow_id,
 			 u16 filter_id);
 #endif
+#endif /* CONFIG_RPS */
 
 /* This structure contains an instance of an RX queue. */
 struct netdev_rx_queue {
+#ifdef CONFIG_RPS
 	struct rps_map __rcu		*rps_map;
 	struct rps_dev_flow_table __rcu	*rps_flow_table;
+#endif
 	struct kobject			kobj;
 	struct net_device		*dev;
 } ____cacheline_aligned_in_smp;
-#endif /* CONFIG_RPS */
+
+/*
+ * RX queue sysfs structures and functions.
+ */
+struct rx_queue_attribute {
+	struct attribute attr;
+	ssize_t (*show)(struct netdev_rx_queue *queue,
+	    struct rx_queue_attribute *attr, char *buf);
+	ssize_t (*store)(struct netdev_rx_queue *queue,
+	    struct rx_queue_attribute *attr, const char *buf, size_t len);
+};
 
 #ifdef CONFIG_XPS
 /*
@@ -1313,7 +1326,7 @@ struct net_device {
 						   unicast) */
 
 
-#ifdef CONFIG_RPS
+#ifdef CONFIG_SYSFS
 	struct netdev_rx_queue	*_rx;
 
 	/* Number of RX queues allocated at register_netdev() time */
@@ -1424,6 +1437,8 @@ struct net_device {
 	struct device		dev;
 	/* space for optional device, statistics, and wireless sysfs groups */
 	const struct attribute_group *sysfs_groups[4];
+	/* space for optional per-rx queue attributes */
+	const struct attribute_group *sysfs_rx_queue_group;
 
 	/* rtnetlink link ops */
 	const struct rtnl_link_ops *rtnl_link_ops;
@@ -2374,7 +2389,7 @@ static inline bool netif_is_multiqueue(const struct net_device *dev)
 
 int netif_set_real_num_tx_queues(struct net_device *dev, unsigned int txq);
 
-#ifdef CONFIG_RPS
+#ifdef CONFIG_SYSFS
 int netif_set_real_num_rx_queues(struct net_device *dev, unsigned int rxq);
 #else
 static inline int netif_set_real_num_rx_queues(struct net_device *dev,
@@ -2393,7 +2408,7 @@ static inline int netif_copy_real_num_queues(struct net_device *to_dev,
 					   from_dev->real_num_tx_queues);
 	if (err)
 		return err;
-#ifdef CONFIG_RPS
+#ifdef CONFIG_SYSFS
 	return netif_set_real_num_rx_queues(to_dev,
 					    from_dev->real_num_rx_queues);
 #else
@@ -2401,6 +2416,23 @@ static inline int netif_copy_real_num_queues(struct net_device *to_dev,
 #endif
 }
 
+#ifdef CONFIG_SYSFS
+static inline unsigned int get_netdev_rx_queue_index(
+		struct netdev_rx_queue *queue)
+{
+	struct net_device *dev = queue->dev;
+	int i;
+
+	for (i = 0; i < dev->num_rx_queues; i++)
+		if (queue == &dev->_rx[i])
+			break;
+
+	BUG_ON(i >= dev->num_rx_queues);
+
+	return i;
+}
+#endif
+
 #define DEFAULT_MAX_NUM_RSS_QUEUES	(8)
 int netif_get_num_default_rss_queues(void);
 
diff --git a/net/core/dev.c b/net/core/dev.c
index 20c834e..4be7931 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2080,7 +2080,7 @@ int netif_set_real_num_tx_queues(struct net_device *dev, unsigned int txq)
 }
 EXPORT_SYMBOL(netif_set_real_num_tx_queues);
 
-#ifdef CONFIG_RPS
+#ifdef CONFIG_SYSFS
 /**
  *	netif_set_real_num_rx_queues - set actual number of RX queues used
  *	@dev: Network device
@@ -5727,7 +5727,7 @@ void netif_stacked_transfer_operstate(const struct net_device *rootdev,
 }
 EXPORT_SYMBOL(netif_stacked_transfer_operstate);
 
-#ifdef CONFIG_RPS
+#ifdef CONFIG_SYSFS
 static int netif_alloc_rx_queues(struct net_device *dev)
 {
 	unsigned int i, count = dev->num_rx_queues;
@@ -6272,7 +6272,7 @@ struct net_device *alloc_netdev_mqs(int sizeof_priv, const char *name,
 		return NULL;
 	}
 
-#ifdef CONFIG_RPS
+#ifdef CONFIG_SYSFS
 	if (rxqs < 1) {
 		pr_err("alloc_netdev: Unable to allocate device with zero RX queues\n");
 		return NULL;
@@ -6328,7 +6328,7 @@ struct net_device *alloc_netdev_mqs(int sizeof_priv, const char *name,
 	if (netif_alloc_netdev_queues(dev))
 		goto free_all;
 
-#ifdef CONFIG_RPS
+#ifdef CONFIG_SYSFS
 	dev->num_rx_queues = rxqs;
 	dev->real_num_rx_queues = rxqs;
 	if (netif_alloc_rx_queues(dev))
@@ -6348,7 +6348,7 @@ free_all:
 free_pcpu:
 	free_percpu(dev->pcpu_refcnt);
 	netif_free_tx_queues(dev);
-#ifdef CONFIG_RPS
+#ifdef CONFIG_SYSFS
 	kfree(dev->_rx);
 #endif
 
@@ -6373,7 +6373,7 @@ void free_netdev(struct net_device *dev)
 	release_net(dev_net(dev));
 
 	netif_free_tx_queues(dev);
-#ifdef CONFIG_RPS
+#ifdef CONFIG_SYSFS
 	kfree(dev->_rx);
 #endif
 
diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index 49843bf..0193ff3 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -498,17 +498,7 @@ static struct attribute_group wireless_group = {
 #define net_class_groups	NULL
 #endif /* CONFIG_SYSFS */
 
-#ifdef CONFIG_RPS
-/*
- * RX queue sysfs structures and functions.
- */
-struct rx_queue_attribute {
-	struct attribute attr;
-	ssize_t (*show)(struct netdev_rx_queue *queue,
-	    struct rx_queue_attribute *attr, char *buf);
-	ssize_t (*store)(struct netdev_rx_queue *queue,
-	    struct rx_queue_attribute *attr, const char *buf, size_t len);
-};
+#ifdef CONFIG_SYSFS
 #define to_rx_queue_attr(_attr) container_of(_attr,		\
     struct rx_queue_attribute, attr)
 
@@ -543,6 +533,7 @@ static const struct sysfs_ops rx_queue_sysfs_ops = {
 	.store = rx_queue_attr_store,
 };
 
+#ifdef CONFIG_RPS
 static ssize_t show_rps_map(struct netdev_rx_queue *queue,
 			    struct rx_queue_attribute *attribute, char *buf)
 {
@@ -718,16 +709,20 @@ static struct rx_queue_attribute rps_cpus_attribute =
 static struct rx_queue_attribute rps_dev_flow_table_cnt_attribute =
 	__ATTR(rps_flow_cnt, S_IRUGO | S_IWUSR,
 	    show_rps_dev_flow_table_cnt, store_rps_dev_flow_table_cnt);
+#endif /* CONFIG_RPS */
 
 static struct attribute *rx_queue_default_attrs[] = {
+#ifdef CONFIG_RPS
 	&rps_cpus_attribute.attr,
 	&rps_dev_flow_table_cnt_attribute.attr,
+#endif
 	NULL
 };
 
 static void rx_queue_release(struct kobject *kobj)
 {
 	struct netdev_rx_queue *queue = to_rx_queue(kobj);
+#ifdef CONFIG_RPS
 	struct rps_map *map;
 	struct rps_dev_flow_table *flow_table;
 
@@ -743,6 +738,7 @@ static void rx_queue_release(struct kobject *kobj)
 		RCU_INIT_POINTER(queue->rps_flow_table, NULL);
 		call_rcu(&flow_table->rcu, rps_dev_flow_table_release);
 	}
+#endif
 
 	memset(kobj, 0, sizeof(*kobj));
 	dev_put(queue->dev);
@@ -767,21 +763,27 @@ static int rx_queue_add_kobject(struct net_device *net, int index)
 		kobject_put(kobj);
 		return error;
 	}
+	if (net->sysfs_rx_queue_group)
+		sysfs_create_group(kobj, net->sysfs_rx_queue_group);
 
 	kobject_uevent(kobj, KOBJ_ADD);
 	dev_hold(queue->dev);
 
 	return error;
 }
-#endif /* CONFIG_RPS */
+#endif /* CONFIG_SYFS */
 
 int
 net_rx_queue_update_kobjects(struct net_device *net, int old_num, int new_num)
 {
-#ifdef CONFIG_RPS
+#ifdef CONFIG_SYSFS
 	int i;
 	int error = 0;
 
+#ifndef CONFIG_RPS
+	if (!net->sysfs_rx_queue_group)
+		return 0;
+#endif
 	for (i = old_num; i < new_num; i++) {
 		error = rx_queue_add_kobject(net, i);
 		if (error) {
@@ -1155,9 +1157,6 @@ static int register_queue_kobjects(struct net_device *net)
 	    NULL, &net->dev.kobj);
 	if (!net->queues_kset)
 		return -ENOMEM;
-#endif
-
-#ifdef CONFIG_RPS
 	real_rx = net->real_num_rx_queues;
 #endif
 	real_tx = net->real_num_tx_queues;
@@ -1184,7 +1183,7 @@ static void remove_queue_kobjects(struct net_device *net)
 {
 	int real_rx = 0, real_tx = 0;
 
-#ifdef CONFIG_RPS
+#ifdef CONFIG_SYSFS
 	real_rx = net->real_num_rx_queues;
 #endif
 	real_tx = net->real_num_tx_queues;
-- 
1.8.5.2

^ permalink raw reply related

* [PATCH net-next v3 3/5] virtio-net: auto-tune mergeable rx buffer size for improved performance
From: Michael Dalton @ 2014-01-16  9:38 UTC (permalink / raw)
  To: David S. Miller
  Cc: Michael Dalton, Michael S. Tsirkin, netdev, virtualization,
	Eric Dumazet
In-Reply-To: <1389865126-26225-1-git-send-email-mwdalton@google.com>

Commit 2613af0ed18a ("virtio_net: migrate mergeable rx buffers to page frag
allocators") changed the mergeable receive buffer size from PAGE_SIZE to
MTU-size, introducing a single-stream regression for benchmarks with large
average packet size. There is no single optimal buffer size for all
workloads.  For workloads with packet size <= MTU bytes, MTU + virtio-net
header-sized buffers are preferred as larger buffers reduce the TCP window
due to SKB truesize. However, single-stream workloads with large average
packet sizes have higher throughput if larger (e.g., PAGE_SIZE) buffers
are used.

This commit auto-tunes the mergeable receiver buffer packet size by
choosing the packet buffer size based on an EWMA of the recent packet
sizes for the receive queue. Packet buffer sizes range from MTU_SIZE +
virtio-net header len to PAGE_SIZE. This improves throughput for
large packet workloads, as any workload with average packet size >=
PAGE_SIZE will use PAGE_SIZE buffers.

These optimizations interact positively with recent commit
ba275241030c ("virtio-net: coalesce rx frags when possible during rx"),
which coalesces adjacent RX SKB fragments in virtio_net. The coalescing
optimizations benefit buffers of any size.

Benchmarks taken from an average of 5 netperf 30-second TCP_STREAM runs
between two QEMU VMs on a single physical machine. Each VM has two VCPUs
with all offloads & vhost enabled. All VMs and vhost threads run in a
single 4 CPU cgroup cpuset, using cgroups to ensure that other processes
in the system will not be scheduled on the benchmark CPUs. Trunk includes
SKB rx frag coalescing.

net-next w/ virtio_net before 2613af0ed18a (PAGE_SIZE bufs): 14642.85Gb/s
net-next (MTU-size bufs):  13170.01Gb/s
net-next + auto-tune: 14555.94Gb/s

Jason Wang also reported a throughput increase on mlx4 from 22Gb/s
using MTU-sized buffers to about 26Gb/s using auto-tuning.

Signed-off-by: Michael Dalton <mwdalton@google.com>
---
v2->v3: Remove per-receive queue metadata ring. Encode packet buffer
        base address and truesize into an unsigned long by requiring a
        minimum packet size alignment of 256. Permit attempts to fill
        an already-full RX ring (reverting the change in v2).
v1->v2: Add per-receive queue metadata ring to track precise truesize for
        mergeable receive buffers. Remove all truesize approximation. Never
        try to fill a full RX ring (required for metadata ring in v2).
 drivers/net/virtio_net.c | 99 ++++++++++++++++++++++++++++++++++++------------
 1 file changed, 74 insertions(+), 25 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 36cbf06..3e82311 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -26,6 +26,7 @@
 #include <linux/if_vlan.h>
 #include <linux/slab.h>
 #include <linux/cpu.h>
+#include <linux/average.h>
 
 static int napi_weight = NAPI_POLL_WEIGHT;
 module_param(napi_weight, int, 0444);
@@ -36,11 +37,18 @@ module_param(gso, bool, 0444);
 
 /* FIXME: MTU in config. */
 #define GOOD_PACKET_LEN (ETH_HLEN + VLAN_HLEN + ETH_DATA_LEN)
-#define MERGE_BUFFER_LEN (ALIGN(GOOD_PACKET_LEN + \
-                                sizeof(struct virtio_net_hdr_mrg_rxbuf), \
-                                L1_CACHE_BYTES))
 #define GOOD_COPY_LEN	128
 
+/* Weight used for the RX packet size EWMA. The average packet size is used to
+ * determine the packet buffer size when refilling RX rings. As the entire RX
+ * ring may be refilled at once, the weight is chosen so that the EWMA will be
+ * insensitive to short-term, transient changes in packet size.
+ */
+#define RECEIVE_AVG_WEIGHT 64
+
+/* Minimum alignment for mergeable packet buffers. */
+#define MERGEABLE_BUFFER_ALIGN max(L1_CACHE_BYTES, 256)
+
 #define VIRTNET_DRIVER_VERSION "1.0.0"
 
 struct virtnet_stats {
@@ -78,6 +86,9 @@ struct receive_queue {
 	/* Chain pages by the private ptr. */
 	struct page *pages;
 
+	/* Average packet length for mergeable receive buffers. */
+	struct ewma mrg_avg_pkt_len;
+
 	/* Page frag for packet buffer allocation. */
 	struct page_frag alloc_frag;
 
@@ -219,6 +230,23 @@ static void skb_xmit_done(struct virtqueue *vq)
 	netif_wake_subqueue(vi->dev, vq2txq(vq));
 }
 
+static unsigned int mergeable_ctx_to_buf_truesize(unsigned long mrg_ctx)
+{
+	unsigned int truesize = mrg_ctx & (MERGEABLE_BUFFER_ALIGN - 1);
+	return truesize * MERGEABLE_BUFFER_ALIGN;
+}
+
+static void *mergeable_ctx_to_buf_address(unsigned long mrg_ctx)
+{
+	return (void *)(mrg_ctx & -MERGEABLE_BUFFER_ALIGN);
+
+}
+
+static unsigned long mergeable_buf_to_ctx(void *buf, unsigned int truesize)
+{
+	return (unsigned long)buf | (truesize / MERGEABLE_BUFFER_ALIGN);
+}
+
 /* Called from bottom half context */
 static struct sk_buff *page_to_skb(struct receive_queue *rq,
 				   struct page *page, unsigned int offset,
@@ -327,31 +355,33 @@ err:
 
 static struct sk_buff *receive_mergeable(struct net_device *dev,
 					 struct receive_queue *rq,
-					 void *buf,
+					 unsigned long ctx,
 					 unsigned int len)
 {
+	void *buf = mergeable_ctx_to_buf_address(ctx);
 	struct skb_vnet_hdr *hdr = buf;
 	int num_buf = hdr->mhdr.num_buffers;
 	struct page *page = virt_to_head_page(buf);
 	int offset = buf - page_address(page);
-	unsigned int truesize = max_t(unsigned int, len, MERGE_BUFFER_LEN);
+	unsigned int truesize = max(len, mergeable_ctx_to_buf_truesize(ctx));
+
 	struct sk_buff *head_skb = page_to_skb(rq, page, offset, len, truesize);
 	struct sk_buff *curr_skb = head_skb;
 
 	if (unlikely(!curr_skb))
 		goto err_skb;
-
 	while (--num_buf) {
 		int num_skb_frags;
 
-		buf = virtqueue_get_buf(rq->vq, &len);
-		if (unlikely(!buf)) {
+		ctx = (unsigned long)virtqueue_get_buf(rq->vq, &len);
+		if (unlikely(!ctx)) {
 			pr_debug("%s: rx error: %d buffers out of %d missing\n",
 				 dev->name, num_buf, hdr->mhdr.num_buffers);
 			dev->stats.rx_length_errors++;
 			goto err_buf;
 		}
 
+		buf = mergeable_ctx_to_buf_address(ctx);
 		page = virt_to_head_page(buf);
 		--rq->num;
 
@@ -369,7 +399,7 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
 			head_skb->truesize += nskb->truesize;
 			num_skb_frags = 0;
 		}
-		truesize = max_t(unsigned int, len, MERGE_BUFFER_LEN);
+		truesize = max(len, mergeable_ctx_to_buf_truesize(ctx));
 		if (curr_skb != head_skb) {
 			head_skb->data_len += len;
 			head_skb->len += len;
@@ -386,19 +416,20 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
 		}
 	}
 
+	ewma_add(&rq->mrg_avg_pkt_len, head_skb->len);
 	return head_skb;
 
 err_skb:
 	put_page(page);
 	while (--num_buf) {
-		buf = virtqueue_get_buf(rq->vq, &len);
-		if (unlikely(!buf)) {
+		ctx = (unsigned long)virtqueue_get_buf(rq->vq, &len);
+		if (unlikely(!ctx)) {
 			pr_debug("%s: rx error: %d buffers missing\n",
 				 dev->name, num_buf);
 			dev->stats.rx_length_errors++;
 			break;
 		}
-		page = virt_to_head_page(buf);
+		page = virt_to_head_page(mergeable_ctx_to_buf_address(ctx));
 		put_page(page);
 		--rq->num;
 	}
@@ -419,17 +450,20 @@ static void receive_buf(struct receive_queue *rq, void *buf, unsigned int len)
 	if (unlikely(len < sizeof(struct virtio_net_hdr) + ETH_HLEN)) {
 		pr_debug("%s: short packet %i\n", dev->name, len);
 		dev->stats.rx_length_errors++;
-		if (vi->mergeable_rx_bufs)
-			put_page(virt_to_head_page(buf));
-		else if (vi->big_packets)
+		if (vi->mergeable_rx_bufs) {
+			unsigned long ctx = (unsigned long)buf;
+			void *base = mergeable_ctx_to_buf_address(ctx);
+			put_page(virt_to_head_page(base));
+		} else if (vi->big_packets) {
 			give_pages(rq, buf);
-		else
+		} else {
 			dev_kfree_skb(buf);
+		}
 		return;
 	}
 
 	if (vi->mergeable_rx_bufs)
-		skb = receive_mergeable(dev, rq, buf, len);
+		skb = receive_mergeable(dev, rq, (unsigned long)buf, len);
 	else if (vi->big_packets)
 		skb = receive_big(dev, rq, buf, len);
 	else
@@ -572,25 +606,36 @@ static int add_recvbuf_big(struct receive_queue *rq, gfp_t gfp)
 
 static int add_recvbuf_mergeable(struct receive_queue *rq, gfp_t gfp)
 {
+	const size_t hdr_len = sizeof(struct virtio_net_hdr_mrg_rxbuf);
 	struct page_frag *alloc_frag = &rq->alloc_frag;
 	char *buf;
+	unsigned long ctx;
 	int err;
 	unsigned int len, hole;
 
-	if (unlikely(!skb_page_frag_refill(MERGE_BUFFER_LEN, alloc_frag, gfp)))
+	len = hdr_len + clamp_t(unsigned int, ewma_read(&rq->mrg_avg_pkt_len),
+				GOOD_PACKET_LEN, PAGE_SIZE - hdr_len);
+	len = ALIGN(len, MERGEABLE_BUFFER_ALIGN);
+	if (unlikely(!skb_page_frag_refill(len, alloc_frag, gfp)))
 		return -ENOMEM;
+
 	buf = (char *)page_address(alloc_frag->page) + alloc_frag->offset;
+	ctx = mergeable_buf_to_ctx(buf, len);
 	get_page(alloc_frag->page);
-	len = MERGE_BUFFER_LEN;
 	alloc_frag->offset += len;
 	hole = alloc_frag->size - alloc_frag->offset;
-	if (hole < MERGE_BUFFER_LEN) {
+	if (hole < len) {
+		/* To avoid internal fragmentation, if there is very likely not
+		 * enough space for another buffer, add the remaining space to
+		 * the current buffer. This extra space is not included in
+		 * the truesize stored in ctx.
+		 */
 		len += hole;
 		alloc_frag->offset += hole;
 	}
 
 	sg_init_one(rq->sg, buf, len);
-	err = virtqueue_add_inbuf(rq->vq, rq->sg, 1, buf, gfp);
+	err = virtqueue_add_inbuf(rq->vq, rq->sg, 1, (void *)ctx, gfp);
 	if (err < 0)
 		put_page(virt_to_head_page(buf));
 
@@ -1394,12 +1439,15 @@ static void free_unused_bufs(struct virtnet_info *vi)
 		struct virtqueue *vq = vi->rq[i].vq;
 
 		while ((buf = virtqueue_detach_unused_buf(vq)) != NULL) {
-			if (vi->mergeable_rx_bufs)
-				put_page(virt_to_head_page(buf));
-			else if (vi->big_packets)
+			if (vi->mergeable_rx_bufs) {
+				unsigned long ctx = (unsigned long)buf;
+				void *base = mergeable_ctx_to_buf_address(ctx);
+				put_page(virt_to_head_page(base));
+			} else if (vi->big_packets) {
 				give_pages(&vi->rq[i], buf);
-			else
+			} else {
 				dev_kfree_skb(buf);
+			}
 			--vi->rq[i].num;
 		}
 		BUG_ON(vi->rq[i].num != 0);
@@ -1509,6 +1557,7 @@ static int virtnet_alloc_queues(struct virtnet_info *vi)
 			       napi_weight);
 
 		sg_init_table(vi->rq[i].sg, ARRAY_SIZE(vi->rq[i].sg));
+		ewma_init(&vi->rq[i].mrg_avg_pkt_len, 1, RECEIVE_AVG_WEIGHT);
 		sg_init_table(vi->sq[i].sg, ARRAY_SIZE(vi->sq[i].sg));
 	}
 
-- 
1.8.5.2

^ permalink raw reply related

* [PATCH net-next v3 2/5] virtio-net: use per-receive queue page frag alloc for mergeable bufs
From: Michael Dalton @ 2014-01-16  9:38 UTC (permalink / raw)
  To: David S. Miller
  Cc: Michael Dalton, Michael S. Tsirkin, netdev, virtualization,
	Eric Dumazet
In-Reply-To: <1389865126-26225-1-git-send-email-mwdalton@google.com>

The virtio-net driver currently uses netdev_alloc_frag() for GFP_ATOMIC
mergeable rx buffer allocations. This commit migrates virtio-net to use
per-receive queue page frags for GFP_ATOMIC allocation. This change unifies
mergeable rx buffer memory allocation, which now will use skb_refill_frag()
for both atomic and GFP-WAIT buffer allocations.

To address fragmentation concerns, if after buffer allocation there
is too little space left in the page frag to allocate a subsequent
buffer, the remaining space is added to the current allocated buffer
so that the remaining space can be used to store packet data.

Signed-off-by: Michael Dalton <mwdalton@google.com>
---
v2->v3: Unchanged
v1->v2: Use GFP_COLD for RX buffer allocations (as in netdev_alloc_frag()).
        Remove per-netdev GFP_KERNEL page_frag allocator.
 
 drivers/net/virtio_net.c | 69 ++++++++++++++++++++++++------------------------
 1 file changed, 35 insertions(+), 34 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 7b17240..36cbf06 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -78,6 +78,9 @@ struct receive_queue {
 	/* Chain pages by the private ptr. */
 	struct page *pages;
 
+	/* Page frag for packet buffer allocation. */
+	struct page_frag alloc_frag;
+
 	/* RX: fragments + linear part + virtio header */
 	struct scatterlist sg[MAX_SKB_FRAGS + 2];
 
@@ -126,11 +129,6 @@ struct virtnet_info {
 	/* Lock for config space updates */
 	struct mutex config_lock;
 
-	/* Page_frag for GFP_KERNEL packet buffer allocation when we run
-	 * low on memory.
-	 */
-	struct page_frag alloc_frag;
-
 	/* Does the affinity hint is set for virtqueues? */
 	bool affinity_hint_set;
 
@@ -336,8 +334,8 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
 	int num_buf = hdr->mhdr.num_buffers;
 	struct page *page = virt_to_head_page(buf);
 	int offset = buf - page_address(page);
-	struct sk_buff *head_skb = page_to_skb(rq, page, offset, len,
-					       MERGE_BUFFER_LEN);
+	unsigned int truesize = max_t(unsigned int, len, MERGE_BUFFER_LEN);
+	struct sk_buff *head_skb = page_to_skb(rq, page, offset, len, truesize);
 	struct sk_buff *curr_skb = head_skb;
 
 	if (unlikely(!curr_skb))
@@ -353,11 +351,6 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
 			dev->stats.rx_length_errors++;
 			goto err_buf;
 		}
-		if (unlikely(len > MERGE_BUFFER_LEN)) {
-			pr_debug("%s: rx error: merge buffer too long\n",
-				 dev->name);
-			len = MERGE_BUFFER_LEN;
-		}
 
 		page = virt_to_head_page(buf);
 		--rq->num;
@@ -376,19 +369,20 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
 			head_skb->truesize += nskb->truesize;
 			num_skb_frags = 0;
 		}
+		truesize = max_t(unsigned int, len, MERGE_BUFFER_LEN);
 		if (curr_skb != head_skb) {
 			head_skb->data_len += len;
 			head_skb->len += len;
-			head_skb->truesize += MERGE_BUFFER_LEN;
+			head_skb->truesize += truesize;
 		}
 		offset = buf - page_address(page);
 		if (skb_can_coalesce(curr_skb, num_skb_frags, page, offset)) {
 			put_page(page);
 			skb_coalesce_rx_frag(curr_skb, num_skb_frags - 1,
-					     len, MERGE_BUFFER_LEN);
+					     len, truesize);
 		} else {
 			skb_add_rx_frag(curr_skb, num_skb_frags, page,
-					offset, len, MERGE_BUFFER_LEN);
+					offset, len, truesize);
 		}
 	}
 
@@ -578,25 +572,24 @@ static int add_recvbuf_big(struct receive_queue *rq, gfp_t gfp)
 
 static int add_recvbuf_mergeable(struct receive_queue *rq, gfp_t gfp)
 {
-	struct virtnet_info *vi = rq->vq->vdev->priv;
-	char *buf = NULL;
+	struct page_frag *alloc_frag = &rq->alloc_frag;
+	char *buf;
 	int err;
+	unsigned int len, hole;
 
-	if (gfp & __GFP_WAIT) {
-		if (skb_page_frag_refill(MERGE_BUFFER_LEN, &vi->alloc_frag,
-					 gfp)) {
-			buf = (char *)page_address(vi->alloc_frag.page) +
-			      vi->alloc_frag.offset;
-			get_page(vi->alloc_frag.page);
-			vi->alloc_frag.offset += MERGE_BUFFER_LEN;
-		}
-	} else {
-		buf = netdev_alloc_frag(MERGE_BUFFER_LEN);
-	}
-	if (!buf)
+	if (unlikely(!skb_page_frag_refill(MERGE_BUFFER_LEN, alloc_frag, gfp)))
 		return -ENOMEM;
+	buf = (char *)page_address(alloc_frag->page) + alloc_frag->offset;
+	get_page(alloc_frag->page);
+	len = MERGE_BUFFER_LEN;
+	alloc_frag->offset += len;
+	hole = alloc_frag->size - alloc_frag->offset;
+	if (hole < MERGE_BUFFER_LEN) {
+		len += hole;
+		alloc_frag->offset += hole;
+	}
 
-	sg_init_one(rq->sg, buf, MERGE_BUFFER_LEN);
+	sg_init_one(rq->sg, buf, len);
 	err = virtqueue_add_inbuf(rq->vq, rq->sg, 1, buf, gfp);
 	if (err < 0)
 		put_page(virt_to_head_page(buf));
@@ -617,6 +610,7 @@ static bool try_fill_recv(struct receive_queue *rq, gfp_t gfp)
 	int err;
 	bool oom;
 
+	gfp |= __GFP_COLD;
 	do {
 		if (vi->mergeable_rx_bufs)
 			err = add_recvbuf_mergeable(rq, gfp);
@@ -1377,6 +1371,14 @@ static void free_receive_bufs(struct virtnet_info *vi)
 	}
 }
 
+static void free_receive_page_frags(struct virtnet_info *vi)
+{
+	int i;
+	for (i = 0; i < vi->max_queue_pairs; i++)
+		if (vi->rq[i].alloc_frag.page)
+			put_page(vi->rq[i].alloc_frag.page);
+}
+
 static void free_unused_bufs(struct virtnet_info *vi)
 {
 	void *buf;
@@ -1705,9 +1707,8 @@ free_recv_bufs:
 	unregister_netdev(dev);
 free_vqs:
 	cancel_delayed_work_sync(&vi->refill);
+	free_receive_page_frags(vi);
 	virtnet_del_vqs(vi);
-	if (vi->alloc_frag.page)
-		put_page(vi->alloc_frag.page);
 free_stats:
 	free_percpu(vi->stats);
 free:
@@ -1724,6 +1725,8 @@ static void remove_vq_common(struct virtnet_info *vi)
 
 	free_receive_bufs(vi);
 
+	free_receive_page_frags(vi);
+
 	virtnet_del_vqs(vi);
 }
 
@@ -1741,8 +1744,6 @@ static void virtnet_remove(struct virtio_device *vdev)
 	unregister_netdev(vi->dev);
 
 	remove_vq_common(vi);
-	if (vi->alloc_frag.page)
-		put_page(vi->alloc_frag.page);
 
 	flush_work(&vi->config_work);
 
-- 
1.8.5.2

^ permalink raw reply related

* [PATCH net-next v3 1/5] net: allow > 0 order atomic page alloc in skb_page_frag_refill
From: Michael Dalton @ 2014-01-16  9:38 UTC (permalink / raw)
  To: David S. Miller
  Cc: Michael Dalton, Michael S. Tsirkin, netdev, virtualization,
	Eric Dumazet

skb_page_frag_refill currently permits only order-0 page allocs
unless GFP_WAIT is used. Change skb_page_frag_refill to attempt
higher-order page allocations whether or not GFP_WAIT is used. If
memory cannot be allocated, the allocator will fall back to
successively smaller page allocs (down to order-0 page allocs).

This change brings skb_page_frag_refill in line with the existing
page allocation strategy employed by netdev_alloc_frag, which attempts
higher-order page allocations whether or not GFP_WAIT is set, falling
back to successively lower-order page allocations on failure. Part
of migration of virtio-net to per-receive queue page frag allocators.

Acked-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Michael Dalton <mwdalton@google.com>
---
 net/core/sock.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/net/core/sock.c b/net/core/sock.c
index 85ad6f0..b3f7ee3 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1836,9 +1836,7 @@ bool skb_page_frag_refill(unsigned int sz, struct page_frag *pfrag, gfp_t prio)
 		put_page(pfrag->page);
 	}
 
-	/* We restrict high order allocations to users that can afford to wait */
-	order = (prio & __GFP_WAIT) ? SKB_FRAG_PAGE_ORDER : 0;
-
+	order = SKB_FRAG_PAGE_ORDER;
 	do {
 		gfp_t gfp = prio;
 
-- 
1.8.5.2

^ permalink raw reply related

* Re: [PATCH 11/13] net: mvneta: implement rx_copybreak
From: Willy Tarreau @ 2014-01-16  9:36 UTC (permalink / raw)
  To: David Laight
  Cc: davem@davemloft.net, netdev@vger.kernel.org, Thomas Petazzoni,
	Gregory CLEMENT
In-Reply-To: <063D6719AE5E284EB5DD2968C1650D6D45D866@AcuExch.aculab.com>

On Thu, Jan 16, 2014 at 09:14:00AM +0000, David Laight wrote:
> From: Of Willy Tarreau
> > calling dma_map_single()/dma_unmap_single() is quite expensive compared
> > to copying a small packet. So let's copy short frames and keep the buffers
> > mapped. We set the limit to 256 bytes which seems to give good results both
> > on the XP-GP board and on the AX3/4.
> 
> Have you tried something similar for transmits?
> Copying short tx into a preallocated memory area that is dma_mapped
> at driver initialisation?

Not yet by lack of time, but I intend to do so. Both the NIC and the driver
are capable of running at line rate (1.488 Mpps) when filling descriptors
directly, so many improvements are still possible.

Regards,
Willy

^ permalink raw reply

* Re: [PATCH net-next 1/2] random32: add prandom_u32_lt_N and convert "misuses" of reciprocal_divide
From: Daniel Borkmann @ 2014-01-16  9:30 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: davem, netdev, linux-kernel, Jakub Zawadzki, Hannes Frederic Sowa
In-Reply-To: <1389842050.31367.396.camel@edumazet-glaptop2.roam.corp.google.com>

On 01/16/2014 04:14 AM, Eric Dumazet wrote:
> On Thu, 2014-01-16 at 00:23 +0100, Daniel Borkmann wrote:
>
>> @@ -1220,7 +1219,7 @@ static unsigned int fanout_demux_hash(struct packet_fanout *f,
>>   				      struct sk_buff *skb,
>>   				      unsigned int num)
>>   {
>> -	return reciprocal_divide(skb->rxhash, num);
>> +	return (u32)(((u64) skb->rxhash * num) >> 32);
>>   }
>>
>
> This is unfortunate.

Hm, I agree, it would be better to have a function here. It was mostly
for the purpose of the 2nd patch to migrate users since the structure
was introduced, and thus they become incompatible.

> (This reverts one of your patch : f55d112e529386 )
>
> Please add a helper to explain what's going on here, and on many other
> spots we do this computation (as in get_rps_cpu()).
> Few people really understand this.

Indeed, that's why we also thought that we should fix some spots in
the second patch, also with the thought in mind, when upcoming users
have to use a structure to enforce this call combination.

Anyway, we'll wait for some more feedback anyway regarding the 2nd patch
and then respin.

> Or keep reciprocal_divide() as is, and introduce a new set of functions
> for people really wanting the precise divides.

That might also be a possible option, we could consider. Thanks!

>
>
>

^ permalink raw reply

* RE: [PATCH net-next 1/2] random32: add prandom_u32_lt_N and convert "misuses" of reciprocal_divide
From: David Laight @ 2014-01-16  9:28 UTC (permalink / raw)
  To: 'Eric Dumazet', Daniel Borkmann
  Cc: davem@davemloft.net, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org, Jakub Zawadzki,
	Hannes Frederic Sowa
In-Reply-To: <1389842050.31367.396.camel@edumazet-glaptop2.roam.corp.google.com>

From: Eric Dumazet
> On Thu, 2014-01-16 at 00:23 +0100, Daniel Borkmann wrote:
> 
> > @@ -1220,7 +1219,7 @@ static unsigned int fanout_demux_hash(struct packet_fanout *f,
> >  				      struct sk_buff *skb,
> >  				      unsigned int num)
> >  {
> > -	return reciprocal_divide(skb->rxhash, num);
> > +	return (u32)(((u64) skb->rxhash * num) >> 32);
> >  }
> >
> 
> This is unfortunate.
> 
> (This reverts one of your patch : f55d112e529386 )
> 
> Please add a helper to explain what's going on here, and on many other
> spots we do this computation (as in get_rps_cpu()).
> Few people really understand this.
> 
> Or keep reciprocal_divide() as is, and introduce a new set of functions
> for people really wanting the precise divides.

Maybe rename the current reciprocal_divide() as well?

In the above code it isn't immediately obvious that the result of
a division is as useful as that of a modulus!
OTOH the high 32 bits of a 32x32 multiply are (slightly) more obviously
in the required range.

	David


^ permalink raw reply

* Re: [PATCH v4 0/3] Send audit/procinfo/cgroup data in socket-level control message
From: Jan Kaluža @ 2014-01-16  9:29 UTC (permalink / raw)
  To: Tejun Heo, Eric Paris
  Cc: rgb-H+wXaHxf7aLQT0dZR+AlfA, netdev-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn,
	cgroups-u79uwXL29TY76Z2rM5mHXA, David Miller
In-Reply-To: <20140115232345.GA22237-9pTldWuhBndy/B6EtB590w@public.gmane.org>

On 01/16/2014 12:23 AM, Tejun Heo wrote:
> On Wed, Jan 15, 2014 at 06:21:43PM -0500, Eric Paris wrote:
>> Reliably being able to audit what process requested an action is
>> extremely useful.  And I like the audit patch, as it is a couple of ints
>> we are storing.
>>
>> procinfo and cgroup can both be up to 4k of data.
>>
>> Is there an alternative he should consider?  Some way to grab a
>> reference on task_struct and just attach that to the message?
>
> Or maybe it can be made separately optional instead of tagging along
> on an existing option so that it doesn't tax use cases which don't
> care about the new stuff?

Right, I could add new option next to SOCK_PASSCRED which could be used 
to send newly added stuff. Would this be acceptable?

I would still vote for SCM_AUDIT to be part of SOCK_PASSCRED and move 
SCM_CGROUP and SCM_PROCINFO into new option.

> Thanks.
>

Regards,
Jan Kaluza

^ permalink raw reply

* [PATCH] net: fix "queues" uevent between network namespaces
From: Chen Weilong @ 2014-01-16  9:24 UTC (permalink / raw)
  To: davem, gregkh; +Cc: netdev

From: Weilong Chen <chenweilong@huawei.com>

When I create a new namespace with 'ip netns add net0', or add/remove
new links in a namespace with 'ip link add/delete type veth', rx/tx
queues events can be got in all namespaces. That is because rx/tx queue
ktypes do not have namespace support, and their kobj parents are setted to
NULL. This patch is to fix it.

Reported-by: Libo Chen <chenlibo@huawei.com>
Signed-off-by: Libo Chen <chenlibo@huawei.com>
Signed-off-by: Weilong Chen <chenweilong@huawei.com>
---
 lib/kobject_uevent.c | 10 ++++++++--
 net/core/net-sysfs.c | 26 ++++++++++++++++++++++++++
 2 files changed, 34 insertions(+), 2 deletions(-)

diff --git a/lib/kobject_uevent.c b/lib/kobject_uevent.c
index 52e5abb..5f72767 100644
--- a/lib/kobject_uevent.c
+++ b/lib/kobject_uevent.c
@@ -88,11 +88,17 @@ out:
 #ifdef CONFIG_NET
 static int kobj_bcast_filter(struct sock *dsk, struct sk_buff *skb, void *data)
 {
-	struct kobject *kobj = data;
+	struct kobject *kobj = data, *ksobj;
 	const struct kobj_ns_type_operations *ops;
 
 	ops = kobj_ns_ops(kobj);
-	if (ops) {
+	if (!ops && kobj->kset) {
+		ksobj = &kobj->kset->kobj;
+		if (ksobj->parent != NULL)
+			ops = kobj_ns_ops(ksobj->parent);
+	}
+
+	if (ops && ops->netlink_ns && kobj->ktype->namespace) {
 		const void *sock_ns, *ns;
 		ns = kobj->ktype->namespace(kobj);
 		sock_ns = ops->netlink_ns(dsk);
diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index 49843bf..0cabe7c 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -748,10 +748,23 @@ static void rx_queue_release(struct kobject *kobj)
 	dev_put(queue->dev);
 }
 
+static const void *rx_queue_namespace(struct kobject *kobj)
+{
+	struct netdev_rx_queue *queue = to_rx_queue(kobj);
+	struct device *dev = &queue->dev->dev;
+	const void *ns = NULL;
+
+	if (dev->class && dev->class->ns_type)
+		ns = dev->class->namespace(dev);
+
+	return ns;
+}
+
 static struct kobj_type rx_queue_ktype = {
 	.sysfs_ops = &rx_queue_sysfs_ops,
 	.release = rx_queue_release,
 	.default_attrs = rx_queue_default_attrs,
+	.namespace = rx_queue_namespace
 };
 
 static int rx_queue_add_kobject(struct net_device *net, int index)
@@ -1082,10 +1095,23 @@ static void netdev_queue_release(struct kobject *kobj)
 	dev_put(queue->dev);
 }
 
+static const void *netdev_queue_namespace(struct kobject *kobj)
+{
+	struct netdev_queue *queue = to_netdev_queue(kobj);
+	struct device *dev = &queue->dev->dev;
+	const void *ns = NULL;
+
+	if (dev->class && dev->class->ns_type)
+		ns = dev->class->namespace(dev);
+
+	return ns;
+}
+
 static struct kobj_type netdev_queue_ktype = {
 	.sysfs_ops = &netdev_queue_sysfs_ops,
 	.release = netdev_queue_release,
 	.default_attrs = netdev_queue_default_attrs,
+	.namespace = netdev_queue_namespace,
 };
 
 static int netdev_queue_add_kobject(struct net_device *net, int index)
-- 
1.7.12

^ permalink raw reply related

* Re: [PATCH] [RFC] netfilter: nf_conntrack: don't relase a conntrack with non-zero refcnt
From: Florian Westphal @ 2014-01-16  9:23 UTC (permalink / raw)
  To: Andrew Vagin
  Cc: Florian Westphal, Andrey Vagin, netfilter-devel, Eric Dumazet,
	netfilter, netdev, linux-kernel, vvs, Cyrill Gorcunov,
	Vasiliy Averin
In-Reply-To: <20140115180844.GA3605@paralelels.com>

Andrew Vagin <avagin@parallels.com> wrote:
> > I think it would be nice if we could keep it that way.
> > If everything fails we could proably intoduce a 'larval' dummy list
> > similar to the one used by template conntracks?
> 
> I'm not sure, that this is required. Could you elaborate when this can
> be useful?

You can dump the lists via ctnetlink.  Its meant as a debugging aid in
case one suspects refcnt leaks.

Granted, in this situation there should be no leak since we put the newly
allocated entry in the error case.

> Now I see only overhead, because we need to take the nf_conntrack_lock
> lock to add conntrack in a list.

True. I don't have any preference, I guess I'd just do the insertion into the
unconfirmed list when we know we cannot track to keep the "unhashed"
bug trap in the destroy function.

Pablo, any preference?

^ permalink raw reply

* Re: [PATCH net-next 1/2] random32: add prandom_u32_lt_N and convert "misuses" of reciprocal_divide
From: Daniel Borkmann @ 2014-01-16  9:22 UTC (permalink / raw)
  To: Joe Perches
  Cc: davem, netdev, linux-kernel, Jakub Zawadzki, Eric Dumazet,
	Hannes Frederic Sowa
In-Reply-To: <1389832172.14001.41.camel@joe-AO722>

On 01/16/2014 01:29 AM, Joe Perches wrote:
> On Thu, 2014-01-16 at 00:23 +0100, Daniel Borkmann wrote:
>> Many functions have open coded a function that return a random
>> number in range [0,N-1]. Also, only because we have a function
>> that is named reciprocal_divide(), it has not much to do with
>> the pupose where it is being used when a previous reciprocal_value()
>> has not been obtained.
>
> prandom_u32_lt_N?
>
> I do not like the camelcase name and thought the
> prandom_u32_max was better.

Hm, you wanted to have the name intuitive, right ... so
u32 prandom_u32_lt_N(u32 N) suggests "less then N". If you
are saying "_max" here, then you should keep in mind that
the maximum result you get from here is "max-1", not "max".
But whatever, it's just a name.

> How about using
>
> u32 prandom_u32_max(u32 max)
> {
> 	return (u32)(((u64)prandom_u32() * max) >> 32);
> }
>
> u32 prandom_u32_range(u32 a, u32 b)
> {
> 	if (b < a)
> 		swap(a, b);
>
> 	return a + (u32)(((u64)prandom_u32() * (b - a)) >> 32);
> }

I didn't introduce the last one as it wasn't used in the patch.

^ permalink raw reply

* RE: [PATCH 11/13] net: mvneta: implement rx_copybreak
From: David Laight @ 2014-01-16  9:14 UTC (permalink / raw)
  To: 'Willy Tarreau', davem@davemloft.net
  Cc: netdev@vger.kernel.org, Thomas Petazzoni, Gregory CLEMENT
In-Reply-To: <1389856819-6503-12-git-send-email-w@1wt.eu>

From: Of Willy Tarreau
> calling dma_map_single()/dma_unmap_single() is quite expensive compared
> to copying a small packet. So let's copy short frames and keep the buffers
> mapped. We set the limit to 256 bytes which seems to give good results both
> on the XP-GP board and on the AX3/4.

Have you tried something similar for transmits?
Copying short tx into a preallocated memory area that is dma_mapped
at driver initialisation?

	David

^ permalink raw reply

* RE: [PATCH net-next 2/2] r6040: use ETH_ZLEN instead of MISR for SKB length checking
From: David Laight @ 2014-01-16  9:10 UTC (permalink / raw)
  To: 'Ben Hutchings', Florian Fainelli
  Cc: netdev@vger.kernel.org, davem@davemloft.net
In-Reply-To: <1389857001.19462.33.camel@deadeye.wl.decadent.org.uk>

From: Ben Hutchings
> On Wed, 2014-01-15 at 13:04 -0800, Florian Fainelli wrote:
> > Ever since this driver was merged the following code was included:
...
> > -	if (skb->len < MISR)
> > -		descptr->len = MISR;
> > +	if (skb->len < ETH_ZLEN)
> > +		descptr->len = ETH_ZLEN;
> 
> It looks like this is just increasing the TX descriptor length so the
> packet is tail-padded with whatever happens to come next in the skb.
> This is absolutely incorrect behaviour and may leak sensitive
> information.

And possibly page fault if the data is right at the end of a page.

> Presumably it is necessary to pad the frame because the MAC is too lame
> to do it, and the correct way to do that is using skb_padto(skb,
> ETH_ZLEN).  But this may fail as it might have to allocate memory

Alternatively use two ring entries with the 'more' bit set on
the first one and transmit the padding from a permanently allocated
and dma-mapped block of zeros.
Assuming the hardware support SG transmit and doesn't have a
constraint on the length of the first fragment.

Or have a pre-allocated an dma-mapped array of short buffers (one
for each ring slot) and copy short packets into the array instead
of dma-mapping the skb data.

	David


^ permalink raw reply

* [PATCH 0/3] Netfilter updates for net-next
From: Pablo Neira Ayuso @ 2014-01-16  9:10 UTC (permalink / raw)
  To: netfilter-devel; +Cc: davem, netdev

[ Wrong tree, very sorry for that, previous was pointing to net-next
  instead of nftables. Resending pull request ]

Hi David,

This small batch contains several Netfilter fixes for your net-next
tree, more specifically:

* Fix compilation warning in nft_ct in NF_CONNTRACK_MARK is not set,
  from Kristian Evensen.

* Add dependency to IPV6 for NF_TABLES_INET. This one has been reported
  by the several robots that are testing .config combinations, from Paul
  Gortmaker.

* Fix default base chain policy setting in nf_tables, from myself.

You can pull these changes from:

  git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nftables.git master

Thanks!

----------------------------------------------------------------

The following changes since commit 11b57f90257c1d6a91cee720151b69e0c2020cf6:

  xen-netback: stop vif thread spinning if frontend is unresponsive (2014-01-09 23:05:46 -0500)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nftables.git master

for you to fetch changes up to 847c8e2959f7f8f1462e33c0a720c6267b984ed8:

  netfilter: nft_ct: fix compilation warning if NF_CONNTRACK_MARK is not set (2014-01-15 11:00:14 +0100)

----------------------------------------------------------------
Kristian Evensen (1):
      netfilter: nft_ct: fix compilation warning if NF_CONNTRACK_MARK is not set

Pablo Neira Ayuso (1):
      netfilter: nf_tables: fix missing byteorder conversion in policy

Paul Gortmaker (1):
      netfilter: Add dependency on IPV6 for NF_TABLES_INET


^ permalink raw reply

* [PATCH 1/3] netfilter: nf_tables: fix missing byteorder conversion in policy
From: Pablo Neira Ayuso @ 2014-01-16  9:04 UTC (permalink / raw)
  To: netfilter-devel; +Cc: davem, netdev
In-Reply-To: <1389863093-4530-1-git-send-email-pablo@netfilter.org>

When fetching the policy attribute, the byteorder conversion was
missing, breaking the chain policy setting.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
 net/netfilter/nf_tables_api.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/netfilter/nf_tables_api.c b/net/netfilter/nf_tables_api.c
index 36add31..117bbaa 100644
--- a/net/netfilter/nf_tables_api.c
+++ b/net/netfilter/nf_tables_api.c
@@ -859,7 +859,7 @@ static int nf_tables_newchain(struct sock *nlsk, struct sk_buff *skb,
 		    nla[NFTA_CHAIN_HOOK] == NULL)
 			return -EOPNOTSUPP;
 
-		policy = nla_get_be32(nla[NFTA_CHAIN_POLICY]);
+		policy = ntohl(nla_get_be32(nla[NFTA_CHAIN_POLICY]));
 		switch (policy) {
 		case NF_DROP:
 		case NF_ACCEPT:
-- 
1.7.10.4

^ permalink raw reply related

* [PATCH 3/3] netfilter: nft_ct: fix compilation warning if NF_CONNTRACK_MARK is not set
From: Pablo Neira Ayuso @ 2014-01-16  9:04 UTC (permalink / raw)
  To: netfilter-devel; +Cc: davem, netdev
In-Reply-To: <1389863093-4530-1-git-send-email-pablo@netfilter.org>

From: Kristian Evensen <kristian.evensen@gmail.com>

net/netfilter/nft_ct.c: In function 'nft_ct_set_eval':
net/netfilter/nft_ct.c:136:6: warning: unused variable 'value' [-Wunused-variable]

Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Kristian Evensen <kristian.evensen@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
 net/netfilter/nft_ct.c |    2 ++
 1 file changed, 2 insertions(+)

diff --git a/net/netfilter/nft_ct.c b/net/netfilter/nft_ct.c
index c7c1285..917052e 100644
--- a/net/netfilter/nft_ct.c
+++ b/net/netfilter/nft_ct.c
@@ -133,7 +133,9 @@ static void nft_ct_set_eval(const struct nft_expr *expr,
 {
 	const struct nft_ct *priv = nft_expr_priv(expr);
 	struct sk_buff *skb = pkt->skb;
+#ifdef CONFIG_NF_CONNTRACK_MARK
 	u32 value = data[priv->sreg].data[0];
+#endif
 	enum ip_conntrack_info ctinfo;
 	struct nf_conn *ct;
 
-- 
1.7.10.4


^ permalink raw reply related

* [PATCH 2/3] netfilter: Add dependency on IPV6 for NF_TABLES_INET
From: Pablo Neira Ayuso @ 2014-01-16  9:04 UTC (permalink / raw)
  To: netfilter-devel; +Cc: davem, netdev
In-Reply-To: <1389863093-4530-1-git-send-email-pablo@netfilter.org>

From: Paul Gortmaker <paul.gortmaker@windriver.com>

Commit 1d49144c0aa ("netfilter: nf_tables: add "inet" table for
IPv4/IPv6") allows creation of non-IPV6 enabled .config files that
will fail to configure/link as follows:

warning: (NF_TABLES_INET) selects NF_TABLES_IPV6 which has unmet direct dependencies (NET && INET && IPV6 && NETFILTER && NF_TABLES)
warning: (NF_TABLES_INET) selects NF_TABLES_IPV6 which has unmet direct dependencies (NET && INET && IPV6 && NETFILTER && NF_TABLES)
warning: (NF_TABLES_INET) selects NF_TABLES_IPV6 which has unmet direct dependencies (NET && INET && IPV6 && NETFILTER && NF_TABLES)
net/built-in.o: In function `nft_reject_eval':
nft_reject.c:(.text+0x651e8): undefined reference to `nf_ip6_checksum'
nft_reject.c:(.text+0x65270): undefined reference to `ip6_route_output'
nft_reject.c:(.text+0x656c4): undefined reference to `ip6_dst_hoplimit'
make: *** [vmlinux] Error 1

Since the feature is to allow for a mixed IPV4 and IPV6 table, it
seems sensible to make it depend on IPV6.

Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Acked-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
 net/netfilter/Kconfig |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/netfilter/Kconfig b/net/netfilter/Kconfig
index 37d2092..6941a4f 100644
--- a/net/netfilter/Kconfig
+++ b/net/netfilter/Kconfig
@@ -429,7 +429,7 @@ config NF_TABLES
 	  To compile it as a module, choose M here.
 
 config NF_TABLES_INET
-	depends on NF_TABLES
+	depends on NF_TABLES && IPV6
 	select NF_TABLES_IPV4
 	select NF_TABLES_IPV6
 	tristate "Netfilter nf_tables mixed IPv4/IPv6 tables support"
-- 
1.7.10.4


^ permalink raw reply related

* [PATCH 0/3] Netfilter updates for net-next
From: Pablo Neira Ayuso @ 2014-01-16  9:04 UTC (permalink / raw)
  To: netfilter-devel; +Cc: davem, netdev

Hi David,

This small batch contains several Netfilter fixes for your net-next
tree, more specifically:

* Fix compilation warning in nft_ct in NF_CONNTRACK_MARK is not set,
  from Kristian Evensen.

* Add dependency to IPV6 for NF_TABLES_INET. This one has been reported
  by the several robots that are testing .config combinations, from Paul
  Gortmaker.

* Fix default base chain policy setting in nf_tables, from myself.

You can pull these changes from:

  git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next.git

Thanks!

----------------------------------------------------------------

The following changes since commit 11b57f90257c1d6a91cee720151b69e0c2020cf6:

  xen-netback: stop vif thread spinning if frontend is unresponsive (2014-01-09 23:05:46 -0500)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next.git 

for you to fetch changes up to 847c8e2959f7f8f1462e33c0a720c6267b984ed8:

  netfilter: nft_ct: fix compilation warning if NF_CONNTRACK_MARK is not set (2014-01-15 11:00:14 +0100)

----------------------------------------------------------------
Kristian Evensen (1):
      netfilter: nft_ct: fix compilation warning if NF_CONNTRACK_MARK is not set

Pablo Neira Ayuso (1):
      netfilter: nf_tables: fix missing byteorder conversion in policy

Paul Gortmaker (1):
      netfilter: Add dependency on IPV6 for NF_TABLES_INET

 net/netfilter/Kconfig         |    2 +-
 net/netfilter/nf_tables_api.c |    2 +-
 net/netfilter/nft_ct.c        |    2 ++
 3 files changed, 4 insertions(+), 2 deletions(-)

^ permalink raw reply

* Re: Fwd: [RFC PATCH net-next 0/3] virtio_net: add aRFS support
From: Stefan Hajnoczi @ 2014-01-16  8:52 UTC (permalink / raw)
  To: Zhi Yong Wu
  Cc: Linux Netdev List, Tom Herbert, Eric Dumazet, David S. Miller,
	Zhi Yong Wu, Michael S. Tsirkin, Rusty Russell, Jason Wang
In-Reply-To: <CAEH94Lg7ygXgHqE8ubE+nuunicDqgmnCfO2AuDSS2nXgv5GikQ@mail.gmail.com>

On Thu, Jan 16, 2014 at 04:34:10PM +0800, Zhi Yong Wu wrote:
> CC: stefanha, MST, Rusty Russel
> 
> ---------- Forwarded message ----------
> From: Jason Wang <jasowang@redhat.com>
> Date: Thu, Jan 16, 2014 at 12:23 PM
> Subject: Re: [RFC PATCH net-next 0/3] virtio_net: add aRFS support
> To: Zhi Yong Wu <zwu.kernel@gmail.com>
> Cc: netdev@vger.kernel.org, therbert@google.com, edumazet@google.com,
> davem@davemloft.net, Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
> 
> 
> On 01/15/2014 10:20 PM, Zhi Yong Wu wrote:
> >
> > From: Zhi Yong Wu<wuzhy@linux.vnet.ibm.com>
> >
> > HI, folks
> >
> > The patchset is trying to integrate aRFS support to virtio_net. In this case,
> > aRFS will be used to select the RX queue. To make sure that it's going ahead
> > in the correct direction, although it is still one RFC and isn't tested, it's
> > post out ASAP. Any comment are appreciated, thanks.
> >
> > If anyone is interested in playing with it, you can get this patchset from my
> > dev git on github:
> >    git://github.com/wuzhy/kernel.git virtnet_rfs
> >
> > Zhi Yong Wu (3):
> >    virtio_pci: Introduce one new config api vp_get_vq_irq()
> >    virtio_net: Introduce one dummy function virtnet_filter_rfs()
> >    virtio-net: Add accelerated RFS support
> >
> >   drivers/net/virtio_net.c      |   67 ++++++++++++++++++++++++++++++++++++++++-
> >   drivers/virtio/virtio_pci.c   |   11 +++++++
> >   include/linux/virtio_config.h |   12 +++++++
> >   3 files changed, 89 insertions(+), 1 deletions(-)
> >
> 
> Please run get_maintainter.pl before sending the patch. You'd better
> at least cc virtio maintainer/list for this.
> 
> The core aRFS method is a noop in this RFC which make this series no
> much sense to discuss. You should at least mention the big picture
> here in the cover letter. I suggest you should post a RFC which can
> run and has expected result or you can just raise a thread for the
> design discussion.
> 
> And this method has been discussed before, you can search "[net-next
> RFC PATCH 5/5] virtio-net: flow director support" in netdev archive
> for a very old prototype implemented by me. It can work and looks like
> most of this RFC have already done there.
> 
> A basic question is whether or not we need this, not all the mq cards
> use aRFS (see ixgbe ATR). And whether or not it can bring extra
> overheads? For virtio, we want to reduce the vmexits as much as
> possible but this aRFS seems introduce a lot of more of this. Making a
> complex interfaces just for an virtual device may not be good, simple
> method may works for most of the cases.
> 
> We really should consider to offload this to real nic. VMDq and L2
> forwarding offload may help in this case.

Zhi Yong and I had an IRC chat.  I wanted to post my questions on the
list - it's still the same concern I had in the old email thread that
Jason mentioned.

In order for virtio-net aRFS to make sense there needs to be an overall
plan for pushing flow mapping information down to the physical NIC.
That's the only way to actually achieve the benefit of steering:
processing the packet on the CPU where the application is running.

If it's not possible or too hard to implement aRFS down the entire
stack, we won't be able to process the packet on the right CPU.
Then we might as well not bother with aRFS and just distribute uniformly
across the rx virtqueues.

Please post an outline of how rx packets will be steered up the stack so
we can discuss whether aRFS can bring any benefit.

Stefan

^ permalink raw reply

* Re: [RFC PATCH net-next 0/3] virtio_net: add aRFS support
From: Zhi Yong Wu @ 2014-01-16  8:48 UTC (permalink / raw)
  To: Jason Wang
  Cc: Linux Netdev List, Tom Herbert, Eric Dumazet, David S. Miller,
	Zhi Yong Wu, Stefan Hajnoczi, Michael S. Tsirkin, Rusty Russell
In-Reply-To: <52D75EA5.1050000@redhat.com>

On Thu, Jan 16, 2014 at 12:23 PM, Jason Wang <jasowang@redhat.com> wrote:
> On 01/15/2014 10:20 PM, Zhi Yong Wu wrote:
>>
>> From: Zhi Yong Wu<wuzhy@linux.vnet.ibm.com>
>>
>> HI, folks
>>
>> The patchset is trying to integrate aRFS support to virtio_net. In this
>> case,
>> aRFS will be used to select the RX queue. To make sure that it's going
>> ahead
>> in the correct direction, although it is still one RFC and isn't tested,
>> it's
>> post out ASAP. Any comment are appreciated, thanks.
>>
>> If anyone is interested in playing with it, you can get this patchset from
>> my
>> dev git on github:
>>    git://github.com/wuzhy/kernel.git virtnet_rfs
>>
>> Zhi Yong Wu (3):
>>    virtio_pci: Introduce one new config api vp_get_vq_irq()
>>    virtio_net: Introduce one dummy function virtnet_filter_rfs()
>>    virtio-net: Add accelerated RFS support
>>
>>   drivers/net/virtio_net.c      |   67
>> ++++++++++++++++++++++++++++++++++++++++-
>>   drivers/virtio/virtio_pci.c   |   11 +++++++
>>   include/linux/virtio_config.h |   12 +++++++
>>   3 files changed, 89 insertions(+), 1 deletions(-)
>>
>
> Please run get_maintainter.pl before sending the patch. You'd better at
> least cc virtio maintainer/list for this.
Is this one must for virtio stuff?

>
> The core aRFS method is a noop in this RFC which make this series no much
> sense to discuss. You should at least mention the big picture here in the
> cover letter. I suggest you should post a RFC which can run and has expected
> result or you can just raise a thread for the design discussion.
Yes, it currently miss some important stuff as i said in another mail
of this series.

>
> And this method has been discussed before, you can search "[net-next RFC
> PATCH 5/5] virtio-net: flow director support" in netdev archive for a very
> old prototype implemented by me. It can work and looks like most of this RFC
> have already done there.
ah? Can you let me know the result of your discussion? Will checked
it, thanks for your pointer.
>
> A basic question is whether or not we need this, not all the mq cards use
> aRFS (see ixgbe ATR). And whether or not it can bring extra overheads? For
> virtio, we want to reduce the vmexits as much as possible but this aRFS
Good question, i also have concern about this, and don't know if Tom
has good explanation.
> seems introduce a lot of more of this. Making a complex interfaces just for
> an virtual device may not be good, simple method may works for most of the
> cases.
>
> We really should consider to offload this to real nic. VMDq and L2
> forwarding offload may help in this case.

By the way, Stefan, can you let us know your concerns here? as we
talked in irc channel. :)


-- 
Regards,

Zhi Yong Wu

^ permalink raw reply

* Re: [PATCH net-next 0/6] bonding: only rely on arp packets if arp monitor is used
From: Veaceslav Falico @ 2014-01-16  8:41 UTC (permalink / raw)
  To: Jay Vosburgh; +Cc: netdev, Andy Gospodarek, David S. Miller
In-Reply-To: <15764.1389848997@death.nxdomain>

On Wed, Jan 15, 2014 at 09:09:57PM -0800, Jay Vosburgh wrote:
>Veaceslav Falico <vfalico@redhat.com> wrote:
>
>>Currently, if arp_validate is off (0), slave_last_rx() returns the
>>slave->dev->last_rx, which is always updated on *any* packet received by
>>slave, and not only arps. This means that, if the validation of arps is
>>off, we're treating *any* incoming packet as a proof of slave being up, and
>>not only arps.
>
>	The "any incoming packet" part is intentional.
>
>>This might seem logical at the first glance, however it can cause a lot of
>>troubles and false-positives, one example would be:
>>
>>The arp_ip_target is NOT accessible, however someone in the broadcast domain
>>spams with any broadcast traffic. This way bonding will be tricked that the
>>slave is still up (as in - can access arp_ip_target), while it's not.
>
>	This type of situation is why arp_validate was added.
>
>	The specific situation was when multiple hosts using bonding
>with the ARP monitor were set up behind a common gateway (in the same
>Ethernet broadcast domain).  The arp_ip_target is unreachable for
>whatever reason.  In that case, the various bonding instances on the
>different hosts will each issue broadcast ARP requests, and (in the
>absence of arp_validate) those requests would trick the other bonds into
>believing that they are up.
>
>	I don't think this patch set will resolve that problem, since
>you explicitly permit any incoming ARP to count.

I've said it was tricky at first glance :).

For this situation the arp_validate is *indeed* the cure. I'm not disabling
(or even working with) how arp_validate=1/2 works, I'm working with
arp_validate == 0.

Before the patchset (with arp_validate == 0 ):

*Any* packet (arp and non-arp) will signal us that the slave is up -
because we use slave->dev->last_rx (updated on *every* incoming packet in
bond_handle_frame).

After the patchset (with arp_validate == 0 ):

*ONLY ARP* packets signal us that the slave is up - because we use
slave->last_arp_rx that is updated every time we see an ARP packet.


The way the modes work with arp_validate > 0 don't change :), as we're
updating slave->last_arp_rx the old way in this case - after validation.

So that the scenario you've described still works flawlessly, and now
already we won't be tricked by some weird broadcast traffic even with
arp_validate == 0.

>
>>The documentation for arp_validate also states that *ARPs* will (not) be
>>validated if it's on/off, and that the arp monitoring works on arps as
>>traffic generators.
>
>	I wrote most of that text in the documentation, and the intent
>was not to imply that only ARPs should count for "up-ness" even without
>arp_validate enabled.  The intent was to distinguish it from
>"non-validate," in which any incoming traffic counted for "up-ness."
>
>	The main reason for preserving the non-validate behavior (any
>traffic counts) is for the loadbalance (xor and rr) modes.  In those
>modes, the switch decides which slave receives the incoming traffic, and
>so it's to our advantage to permit any incoming traffic to count for
>"up-ness."  The arp_validate option is not allowed in these modes
>because it won't work.
>
>	With these changes, I suspect that the loadbalance ARP monitor
>will be less reliable with these changes (granted that it's already a
>bit dodgy in its dependence on the switch to hit all slaves with
>incoming packets regularly).  Particularly if the switch ports are
>configured into an Etherchannel ("static link aggregation") group, in
>which case only one slave will receive any given frame (broadcast /
>multicast traffic will not be duplicated across all slaves).

The non-AB modes also gave me a headache, however after thinking a bit I've
decided to change them also (mainly, it's the change of
arp_loadbalance_mon function).

The usual usage, however, is to generate traffic via arps. If we don't see
arp replies - this means that arp_ip_target is down, and thus the slave is
down.

>
>	I'm not sure that this change (the "only count ARPs even without
>arp_validate" bit) won't break existing configurations.  Did you test
>the -rr and -xor modes with ARP monitor after your changes (with and
>without configuring a channel group on the switch ports)?

Sure, all works fine, afaics. Obviously, these were basic tests, and bugs
might exist.

The only possible scenario of breakage for someone, from my POV, is:

1) arp monitor is used with loadbalance mode
2) arp_ip_targets are set but _any_ arp replies are never received
3) the user relies on that every slave will receive at least one packet per
arp_interval

This use case:

1) contradicts with documentation
2) contradicts with logic (arp monitor, arp ip targets etc. are used
without, actually, meaning something)
3) is really unstable

In this case, indeed, it won't work. Two points, though:

1) It shouldn't work in the first place, is unstable etc.
2) Can be easily fixed by the following oneliner (though I *really* woudn't
like to do it, as it's useless and dangerous). Basically, for non-ab mode
we set last_arp_rx for every packet. Again, I wouldn't like to do it, but
if you know any use case scenario for this usage (no working arps but
continuous receive traffic) - I can send it as v2 or 7/6 patch (whatever
suits David better, as he already applied it but didn't push).

diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index 0f613ae..87358e5 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -2286,6 +2286,11 @@ int bond_arp_rcv(const struct sk_buff *skb, struct bonding *bond,
  	__be32 sip, tip;
  	int alen;
  
+	if (!USES_PRIMARY(bond->params.mode)) {
+		slave->last_arp_rx = jiffies;
+		return RX_HANDLER_ANOTHER;
+	}
+
  	if (skb->protocol != __cpu_to_be16(ETH_P_ARP))
  		return RX_HANDLER_ANOTHER;
  

>
>>Also, the net_device->last_rx is already used in a lot of drivers (even
>>though the comment states to NOT do it :)), and it's also ugly to modify it
>>from bonding.
>
>	I didn't check, but I suspect those are mostly leftovers from
>the distant past, when the drivers were expected to update last_rx, or
>perhaps drivers using it for their own purposes.

It's really a mix. Somebody just updates them, somebody uses it for their
own purposes etc.

>
>	I don't really see an issue in decoupling bonding from the
>net_device->last_rx; it's pretty much the same thing that was done for
>trans_start some time ago.

trans_start removal is also in queue :). Though still needs some
polishing...

>
>	-J
>
>---
>	-Jay Vosburgh, IBM Linux Technology Center, fubar@us.ibm.com
>
>
>>So, to fix this, remove the last_rx from bonding, *always* call
>>bond_arp_rcv() in slave's rx_handler (bond_handle_frame), and if we spot an
>>arp there - update the slave->last_arp_rx - and use it instead of
>>net_device->last_rx. Finally, rename slave_last_rx() to slave_last_arp_rx()
>>to reflect the changes.
>>
>>As the changes touch really sensitive parts, I've tried to split them as
>>much as possible, for easier debugging/bisecting.
>>
>>CC: Jay Vosburgh <fubar@us.ibm.com>
>>CC: Andy Gospodarek <andy@greyhouse.net>
>>CC: "David S. Miller" <davem@davemloft.net>
>>CC: netdev@vger.kernel.org
>>Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
>>
>>---
>> drivers/net/bonding/bond_main.c    | 18 ++++++++----------
>> drivers/net/bonding/bond_options.c | 12 ++----------
>> drivers/net/bonding/bonding.h      | 16 ++++++----------
>> include/linux/netdevice.h          |  8 +-------
>> 4 files changed, 17 insertions(+), 37 deletions(-)
>>
>

^ permalink raw reply related

* Fwd: [RFC PATCH net-next 0/3] virtio_net: add aRFS support
From: Zhi Yong Wu @ 2014-01-16  8:34 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Linux Netdev List, Tom Herbert, Eric Dumazet, David S. Miller,
	Zhi Yong Wu, Michael S. Tsirkin, Rusty Russell, Jason Wang
In-Reply-To: <52D75EA5.1050000@redhat.com>

CC: stefanha, MST, Rusty Russel

---------- Forwarded message ----------
From: Jason Wang <jasowang@redhat.com>
Date: Thu, Jan 16, 2014 at 12:23 PM
Subject: Re: [RFC PATCH net-next 0/3] virtio_net: add aRFS support
To: Zhi Yong Wu <zwu.kernel@gmail.com>
Cc: netdev@vger.kernel.org, therbert@google.com, edumazet@google.com,
davem@davemloft.net, Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>


On 01/15/2014 10:20 PM, Zhi Yong Wu wrote:
>
> From: Zhi Yong Wu<wuzhy@linux.vnet.ibm.com>
>
> HI, folks
>
> The patchset is trying to integrate aRFS support to virtio_net. In this case,
> aRFS will be used to select the RX queue. To make sure that it's going ahead
> in the correct direction, although it is still one RFC and isn't tested, it's
> post out ASAP. Any comment are appreciated, thanks.
>
> If anyone is interested in playing with it, you can get this patchset from my
> dev git on github:
>    git://github.com/wuzhy/kernel.git virtnet_rfs
>
> Zhi Yong Wu (3):
>    virtio_pci: Introduce one new config api vp_get_vq_irq()
>    virtio_net: Introduce one dummy function virtnet_filter_rfs()
>    virtio-net: Add accelerated RFS support
>
>   drivers/net/virtio_net.c      |   67 ++++++++++++++++++++++++++++++++++++++++-
>   drivers/virtio/virtio_pci.c   |   11 +++++++
>   include/linux/virtio_config.h |   12 +++++++
>   3 files changed, 89 insertions(+), 1 deletions(-)
>

Please run get_maintainter.pl before sending the patch. You'd better
at least cc virtio maintainer/list for this.

The core aRFS method is a noop in this RFC which make this series no
much sense to discuss. You should at least mention the big picture
here in the cover letter. I suggest you should post a RFC which can
run and has expected result or you can just raise a thread for the
design discussion.

And this method has been discussed before, you can search "[net-next
RFC PATCH 5/5] virtio-net: flow director support" in netdev archive
for a very old prototype implemented by me. It can work and looks like
most of this RFC have already done there.

A basic question is whether or not we need this, not all the mq cards
use aRFS (see ixgbe ATR). And whether or not it can bring extra
overheads? For virtio, we want to reduce the vmexits as much as
possible but this aRFS seems introduce a lot of more of this. Making a
complex interfaces just for an virtual device may not be good, simple
method may works for most of the cases.

We really should consider to offload this to real nic. VMDq and L2
forwarding offload may help in this case.


-- 
Regards,

Zhi Yong Wu

^ permalink raw reply

* [PATCH net-next] sctp: remove the unnecessary assignment
From: Wang Weidong @ 2014-01-16  8:25 UTC (permalink / raw)
  To: Neil Horman, David Miller, Vlad Yasevich; +Cc: netdev, linux-sctp

When go the right path, the status is 0, no need to assign it again.
So just remove the assignment.

Signed-off-by: Wang Weidong <wangweidong1@huawei.com>
---
 net/sctp/protocol.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/net/sctp/protocol.c b/net/sctp/protocol.c
index 7c16108..d6934dc 100644
--- a/net/sctp/protocol.c
+++ b/net/sctp/protocol.c
@@ -1461,7 +1461,6 @@ static __init int sctp_init(void)
 	if (status)
 		goto err_v6_add_protocol;
 
-	status = 0;
 out:
 	return status;
 err_v6_add_protocol:
-- 
1.7.12

^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox