[PATCH net-next v2 1/4] net: allow > 0 order atomic page alloc in skb_page_frag

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH net-next v2 1/4] net: allow > 0 order atomic page alloc in skb_page_frag_refill
@ 2014-01-07  5:25 Michael Dalton
  2014-01-07  5:25 ` [PATCH net-next v2 2/4] virtio-net: use per-receive queue page frag alloc for mergeable bufs Michael Dalton
                   ` (3 more replies)
  0 siblings, 4 replies; 39+ messages in thread
From: Michael Dalton @ 2014-01-07  5:25 UTC (permalink / raw)
  To: David S. Miller
  Cc: Michael Dalton, Michael S. Tsirkin, netdev, virtualization,
	Eric Dumazet

skb_page_frag_refill currently permits only order-0 page allocs
unless GFP_WAIT is used. Change skb_page_frag_refill to attempt
higher-order page allocations whether or not GFP_WAIT is used. If
memory cannot be allocated, the allocator will fall back to
successively smaller page allocs (down to order-0 page allocs).

This change brings skb_page_frag_refill in line with the existing
page allocation strategy employed by netdev_alloc_frag, which attempts
higher-order page allocations whether or not GFP_WAIT is set, falling
back to successively lower-order page allocations on failure. Part
of migration of virtio-net to per-receive queue page frag allocators.

Acked-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Michael Dalton <mwdalton@google.com>
---
 net/core/sock.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/net/core/sock.c b/net/core/sock.c
index 5393b4b..a0d522a 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1865,9 +1865,7 @@ bool skb_page_frag_refill(unsigned int sz, struct page_frag *pfrag, gfp_t prio)
 		put_page(pfrag->page);
 	}

-	/* We restrict high order allocations to users that can afford to wait */
-	order = (prio & __GFP_WAIT) ? SKB_FRAG_PAGE_ORDER : 0;
-
+	order = SKB_FRAG_PAGE_ORDER;
 	do {
 		gfp_t gfp = prio;

-- 
1.8.5.1

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH net-next v2 2/4] virtio-net: use per-receive queue page frag alloc for mergeable bufs
  2014-01-07  5:25 [PATCH net-next v2 1/4] net: allow > 0 order atomic page alloc in skb_page_frag_refill Michael Dalton
@ 2014-01-07  5:25 ` Michael Dalton
  2014-01-07  5:25 ` [PATCH net-next v2 3/4] virtio-net: auto-tune mergeable rx buffer size for improved performance Michael Dalton
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 39+ messages in thread
From: Michael Dalton @ 2014-01-07  5:25 UTC (permalink / raw)
  To: David S. Miller
  Cc: Michael Dalton, Michael S. Tsirkin, netdev, virtualization,
	Eric Dumazet

The virtio-net driver currently uses netdev_alloc_frag() for GFP_ATOMIC
mergeable rx buffer allocations. This commit migrates virtio-net to use
per-receive queue page frags for GFP_ATOMIC allocation. This change unifies
mergeable rx buffer memory allocation, which now will use skb_refill_frag()
for both atomic and GFP-WAIT buffer allocations.

To address fragmentation concerns, if after buffer allocation there
is too little space left in the page frag to allocate a subsequent
buffer, the remaining space is added to the current allocated buffer
so that the remaining space can be used to store packet data.

Signed-off-by: Michael Dalton <mwdalton@google.com>
---
v2: Use GFP_COLD for RX buffer allocations (as in netdev_alloc_frag()).
    Remove per-netdev GFP_KERNEL page_frag allocator.

 drivers/net/virtio_net.c | 69 ++++++++++++++++++++++++------------------------
 1 file changed, 35 insertions(+), 34 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index c51a988..526dfd8 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -78,6 +78,9 @@ struct receive_queue {
 	/* Chain pages by the private ptr. */
 	struct page *pages;
 
+	/* Page frag for packet buffer allocation. */
+	struct page_frag alloc_frag;
+
 	/* RX: fragments + linear part + virtio header */
 	struct scatterlist sg[MAX_SKB_FRAGS + 2];
 
@@ -126,11 +129,6 @@ struct virtnet_info {
 	/* Lock for config space updates */
 	struct mutex config_lock;
 
-	/* Page_frag for GFP_KERNEL packet buffer allocation when we run
-	 * low on memory.
-	 */
-	struct page_frag alloc_frag;
-
 	/* Does the affinity hint is set for virtqueues? */
 	bool affinity_hint_set;
 
@@ -336,8 +334,8 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
 	int num_buf = hdr->mhdr.num_buffers;
 	struct page *page = virt_to_head_page(buf);
 	int offset = buf - page_address(page);
-	struct sk_buff *head_skb = page_to_skb(rq, page, offset, len,
-					       MERGE_BUFFER_LEN);
+	unsigned int truesize = max_t(unsigned int, len, MERGE_BUFFER_LEN);
+	struct sk_buff *head_skb = page_to_skb(rq, page, offset, len, truesize);
 	struct sk_buff *curr_skb = head_skb;
 
 	if (unlikely(!curr_skb))
@@ -353,11 +351,6 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
 			dev->stats.rx_length_errors++;
 			goto err_buf;
 		}
-		if (unlikely(len > MERGE_BUFFER_LEN)) {
-			pr_debug("%s: rx error: merge buffer too long\n",
-				 dev->name);
-			len = MERGE_BUFFER_LEN;
-		}
 
 		page = virt_to_head_page(buf);
 		--rq->num;
@@ -376,19 +369,20 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
 			head_skb->truesize += nskb->truesize;
 			num_skb_frags = 0;
 		}
+		truesize = max_t(unsigned int, len, MERGE_BUFFER_LEN);
 		if (curr_skb != head_skb) {
 			head_skb->data_len += len;
 			head_skb->len += len;
-			head_skb->truesize += MERGE_BUFFER_LEN;
+			head_skb->truesize += truesize;
 		}
 		offset = buf - page_address(page);
 		if (skb_can_coalesce(curr_skb, num_skb_frags, page, offset)) {
 			put_page(page);
 			skb_coalesce_rx_frag(curr_skb, num_skb_frags - 1,
-					     len, MERGE_BUFFER_LEN);
+					     len, truesize);
 		} else {
 			skb_add_rx_frag(curr_skb, num_skb_frags, page,
-					offset, len, MERGE_BUFFER_LEN);
+					offset, len, truesize);
 		}
 	}
 
@@ -578,25 +572,24 @@ static int add_recvbuf_big(struct receive_queue *rq, gfp_t gfp)
 
 static int add_recvbuf_mergeable(struct receive_queue *rq, gfp_t gfp)
 {
-	struct virtnet_info *vi = rq->vq->vdev->priv;
-	char *buf = NULL;
+	struct page_frag *alloc_frag = &rq->alloc_frag;
+	char *buf;
 	int err;
+	unsigned int len, hole;
 
-	if (gfp & __GFP_WAIT) {
-		if (skb_page_frag_refill(MERGE_BUFFER_LEN, &vi->alloc_frag,
-					 gfp)) {
-			buf = (char *)page_address(vi->alloc_frag.page) +
-			      vi->alloc_frag.offset;
-			get_page(vi->alloc_frag.page);
-			vi->alloc_frag.offset += MERGE_BUFFER_LEN;
-		}
-	} else {
-		buf = netdev_alloc_frag(MERGE_BUFFER_LEN);
-	}
-	if (!buf)
+	if (unlikely(!skb_page_frag_refill(MERGE_BUFFER_LEN, alloc_frag, gfp)))
 		return -ENOMEM;
+	buf = (char *)page_address(alloc_frag->page) + alloc_frag->offset;
+	get_page(alloc_frag->page);
+	len = MERGE_BUFFER_LEN;
+	alloc_frag->offset += len;
+	hole = alloc_frag->size - alloc_frag->offset;
+	if (hole < MERGE_BUFFER_LEN) {
+		len += hole;
+		alloc_frag->offset += hole;
+	}
 
-	sg_init_one(rq->sg, buf, MERGE_BUFFER_LEN);
+	sg_init_one(rq->sg, buf, len);
 	err = virtqueue_add_inbuf(rq->vq, rq->sg, 1, buf, gfp);
 	if (err < 0)
 		put_page(virt_to_head_page(buf));
@@ -617,6 +610,7 @@ static bool try_fill_recv(struct receive_queue *rq, gfp_t gfp)
 	int err;
 	bool oom;
 
+	gfp |= __GFP_COLD;
 	do {
 		if (vi->mergeable_rx_bufs)
 			err = add_recvbuf_mergeable(rq, gfp);
@@ -1377,6 +1371,14 @@ static void free_receive_bufs(struct virtnet_info *vi)
 	}
 }
 
+static void free_receive_page_frags(struct virtnet_info *vi)
+{
+	int i;
+	for (i = 0; i < vi->max_queue_pairs; i++)
+		if (vi->rq[i].alloc_frag.page)
+			put_page(vi->rq[i].alloc_frag.page);
+}
+
 static void free_unused_bufs(struct virtnet_info *vi)
 {
 	void *buf;
@@ -1705,9 +1707,8 @@ free_recv_bufs:
 	unregister_netdev(dev);
 free_vqs:
 	cancel_delayed_work_sync(&vi->refill);
+	free_receive_page_frags(vi);
 	virtnet_del_vqs(vi);
-	if (vi->alloc_frag.page)
-		put_page(vi->alloc_frag.page);
 free_stats:
 	free_percpu(vi->stats);
 free:
@@ -1724,6 +1725,8 @@ static void remove_vq_common(struct virtnet_info *vi)
 
 	free_receive_bufs(vi);
 
+	free_receive_page_frags(vi);
+
 	virtnet_del_vqs(vi);
 }
 
@@ -1741,8 +1744,6 @@ static void virtnet_remove(struct virtio_device *vdev)
 	unregister_netdev(vi->dev);
 
 	remove_vq_common(vi);
-	if (vi->alloc_frag.page)
-		put_page(vi->alloc_frag.page);
 
 	flush_work(&vi->config_work);
 
-- 
1.8.5.1

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH net-next v2 3/4] virtio-net: auto-tune mergeable rx buffer size for improved performance
  2014-01-07  5:25 [PATCH net-next v2 1/4] net: allow > 0 order atomic page alloc in skb_page_frag_refill Michael Dalton
  2014-01-07  5:25 ` [PATCH net-next v2 2/4] virtio-net: use per-receive queue page frag alloc for mergeable bufs Michael Dalton
@ 2014-01-07  5:25 ` Michael Dalton
  2014-01-08  6:23   ` Jason Wang
                     ` (2 more replies)
  2014-01-07  5:25 ` [PATCH net-next v2 4/4] virtio-net: initial debugfs support, export mergeable rx buffer size Michael Dalton
  2014-01-08 18:08 ` [PATCH net-next v2 1/4] net: allow > 0 order atomic page alloc in skb_page_frag_refill Michael S. Tsirkin
  3 siblings, 3 replies; 39+ messages in thread
From: Michael Dalton @ 2014-01-07  5:25 UTC (permalink / raw)
  To: David S. Miller
  Cc: Michael Dalton, Michael S. Tsirkin, netdev, virtualization,
	Eric Dumazet

Commit 2613af0ed18a ("virtio_net: migrate mergeable rx buffers to page frag
allocators") changed the mergeable receive buffer size from PAGE_SIZE to
MTU-size, introducing a single-stream regression for benchmarks with large
average packet size. There is no single optimal buffer size for all
workloads.  For workloads with packet size <= MTU bytes, MTU + virtio-net
header-sized buffers are preferred as larger buffers reduce the TCP window
due to SKB truesize. However, single-stream workloads with large average
packet sizes have higher throughput if larger (e.g., PAGE_SIZE) buffers
are used.

This commit auto-tunes the mergeable receiver buffer packet size by
choosing the packet buffer size based on an EWMA of the recent packet
sizes for the receive queue. Packet buffer sizes range from MTU_SIZE +
virtio-net header len to PAGE_SIZE. This improves throughput for
large packet workloads, as any workload with average packet size >=
PAGE_SIZE will use PAGE_SIZE buffers.

These optimizations interact positively with recent commit
ba275241030c ("virtio-net: coalesce rx frags when possible during rx"),
which coalesces adjacent RX SKB fragments in virtio_net. The coalescing
optimizations benefit buffers of any size.

Benchmarks taken from an average of 5 netperf 30-second TCP_STREAM runs
between two QEMU VMs on a single physical machine. Each VM has two VCPUs
with all offloads & vhost enabled. All VMs and vhost threads run in a
single 4 CPU cgroup cpuset, using cgroups to ensure that other processes
in the system will not be scheduled on the benchmark CPUs. Trunk includes
SKB rx frag coalescing.

net-next w/ virtio_net before 2613af0ed18a (PAGE_SIZE bufs): 14642.85Gb/s
net-next (MTU-size bufs):  13170.01Gb/s
net-next + auto-tune: 14555.94Gb/s

Jason Wang also reported a throughput increase on mlx4 from 22Gb/s
using MTU-sized buffers to about 26Gb/s using auto-tuning.

Signed-off-by: Michael Dalton <mwdalton@google.com>
---
v2: Add per-receive queue metadata ring to track precise truesize for
    mergeable receive buffers. Remove all truesize approximation. Never
    try to fill a full RX ring (required for metadata ring in v2).

 drivers/net/virtio_net.c | 145 ++++++++++++++++++++++++++++++++++-------------
 1 file changed, 107 insertions(+), 38 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 526dfd8..f6e1ee0 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -26,6 +26,7 @@
 #include <linux/if_vlan.h>
 #include <linux/slab.h>
 #include <linux/cpu.h>
+#include <linux/average.h>
 
 static int napi_weight = NAPI_POLL_WEIGHT;
 module_param(napi_weight, int, 0444);
@@ -36,11 +37,15 @@ module_param(gso, bool, 0444);
 
 /* FIXME: MTU in config. */
 #define GOOD_PACKET_LEN (ETH_HLEN + VLAN_HLEN + ETH_DATA_LEN)
-#define MERGE_BUFFER_LEN (ALIGN(GOOD_PACKET_LEN + \
-                                sizeof(struct virtio_net_hdr_mrg_rxbuf), \
-                                L1_CACHE_BYTES))
 #define GOOD_COPY_LEN	128
 
+/* Weight used for the RX packet size EWMA. The average packet size is used to
+ * determine the packet buffer size when refilling RX rings. As the entire RX
+ * ring may be refilled at once, the weight is chosen so that the EWMA will be
+ * insensitive to short-term, transient changes in packet size.
+ */
+#define RECEIVE_AVG_WEIGHT 64
+
 #define VIRTNET_DRIVER_VERSION "1.0.0"
 
 struct virtnet_stats {
@@ -65,11 +70,30 @@ struct send_queue {
 	char name[40];
 };
 
+/* Per-packet buffer context for mergeable receive buffers. */
+struct mergeable_receive_buf_ctx {
+	/* Packet buffer base address. */
+	void *buf;
+
+	/* Original size of the packet buffer for use in SKB truesize. Does not
+	 * include any padding space used to avoid internal fragmentation.
+	 */
+	unsigned int truesize;
+};
+
 /* Internal representation of a receive virtqueue */
 struct receive_queue {
 	/* Virtqueue associated with this receive_queue */
 	struct virtqueue *vq;
 
+	/* Circular buffer of mergeable rxbuf contexts. */
+	struct mergeable_receive_buf_ctx *mrg_buf_ctx;
+
+	/* Number of elements & head index of mrg_buf_ctx. Size must be
+	 * equal to the associated virtqueue's vring size.
+	 */
+	unsigned int mrg_buf_ctx_size, mrg_buf_ctx_head;
+
 	struct napi_struct napi;
 
 	/* Number of input buffers, and max we've ever had. */
@@ -78,6 +102,9 @@ struct receive_queue {
 	/* Chain pages by the private ptr. */
 	struct page *pages;
 
+	/* Average packet length for mergeable receive buffers. */
+	struct ewma mrg_avg_pkt_len;
+
 	/* Page frag for packet buffer allocation. */
 	struct page_frag alloc_frag;
 
@@ -327,32 +354,32 @@ err:
 
 static struct sk_buff *receive_mergeable(struct net_device *dev,
 					 struct receive_queue *rq,
-					 void *buf,
+					 struct mergeable_receive_buf_ctx *ctx,
 					 unsigned int len)
 {
-	struct skb_vnet_hdr *hdr = buf;
+	struct skb_vnet_hdr *hdr = ctx->buf;
 	int num_buf = hdr->mhdr.num_buffers;
-	struct page *page = virt_to_head_page(buf);
-	int offset = buf - page_address(page);
-	unsigned int truesize = max_t(unsigned int, len, MERGE_BUFFER_LEN);
+	struct page *page = virt_to_head_page(ctx->buf);
+	int offset = ctx->buf - page_address(page);
+	unsigned int truesize = max(len, ctx->truesize);
+
 	struct sk_buff *head_skb = page_to_skb(rq, page, offset, len, truesize);
 	struct sk_buff *curr_skb = head_skb;
 
 	if (unlikely(!curr_skb))
 		goto err_skb;
-
 	while (--num_buf) {
 		int num_skb_frags;
 
-		buf = virtqueue_get_buf(rq->vq, &len);
-		if (unlikely(!buf)) {
+		ctx = virtqueue_get_buf(rq->vq, &len);
+		if (unlikely(!ctx)) {
 			pr_debug("%s: rx error: %d buffers out of %d missing\n",
 				 dev->name, num_buf, hdr->mhdr.num_buffers);
 			dev->stats.rx_length_errors++;
 			goto err_buf;
 		}
 
-		page = virt_to_head_page(buf);
+		page = virt_to_head_page(ctx->buf);
 		--rq->num;
 
 		num_skb_frags = skb_shinfo(curr_skb)->nr_frags;
@@ -369,13 +396,13 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
 			head_skb->truesize += nskb->truesize;
 			num_skb_frags = 0;
 		}
-		truesize = max_t(unsigned int, len, MERGE_BUFFER_LEN);
+		truesize = max(len, ctx->truesize);
 		if (curr_skb != head_skb) {
 			head_skb->data_len += len;
 			head_skb->len += len;
 			head_skb->truesize += truesize;
 		}
-		offset = buf - page_address(page);
+		offset = ctx->buf - page_address(page);
 		if (skb_can_coalesce(curr_skb, num_skb_frags, page, offset)) {
 			put_page(page);
 			skb_coalesce_rx_frag(curr_skb, num_skb_frags - 1,
@@ -386,19 +413,20 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
 		}
 	}
 
+	ewma_add(&rq->mrg_avg_pkt_len, head_skb->len);
 	return head_skb;
 
 err_skb:
 	put_page(page);
 	while (--num_buf) {
-		buf = virtqueue_get_buf(rq->vq, &len);
-		if (unlikely(!buf)) {
+		ctx = virtqueue_get_buf(rq->vq, &len);
+		if (unlikely(!ctx)) {
 			pr_debug("%s: rx error: %d buffers missing\n",
 				 dev->name, num_buf);
 			dev->stats.rx_length_errors++;
 			break;
 		}
-		page = virt_to_head_page(buf);
+		page = virt_to_head_page(ctx->buf);
 		put_page(page);
 		--rq->num;
 	}
@@ -419,12 +447,14 @@ static void receive_buf(struct receive_queue *rq, void *buf, unsigned int len)
 	if (unlikely(len < sizeof(struct virtio_net_hdr) + ETH_HLEN)) {
 		pr_debug("%s: short packet %i\n", dev->name, len);
 		dev->stats.rx_length_errors++;
-		if (vi->mergeable_rx_bufs)
-			put_page(virt_to_head_page(buf));
-		else if (vi->big_packets)
+		if (vi->mergeable_rx_bufs) {
+			struct mergeable_receive_buf_ctx *ctx = buf;
+			put_page(virt_to_head_page(ctx->buf));
+		} else if (vi->big_packets) {
 			give_pages(rq, buf);
-		else
+		} else {
 			dev_kfree_skb(buf);
+		}
 		return;
 	}
 
@@ -572,29 +602,43 @@ static int add_recvbuf_big(struct receive_queue *rq, gfp_t gfp)
 
 static int add_recvbuf_mergeable(struct receive_queue *rq, gfp_t gfp)
 {
+	const unsigned int ring_size = rq->mrg_buf_ctx_size;
+	const size_t hdr_len = sizeof(struct virtio_net_hdr_mrg_rxbuf);
 	struct page_frag *alloc_frag = &rq->alloc_frag;
-	char *buf;
+	struct mergeable_receive_buf_ctx *ctx;
 	int err;
 	unsigned int len, hole;
 
-	if (unlikely(!skb_page_frag_refill(MERGE_BUFFER_LEN, alloc_frag, gfp)))
+	len = hdr_len + clamp_t(unsigned int, ewma_read(&rq->mrg_avg_pkt_len),
+				GOOD_PACKET_LEN, PAGE_SIZE - hdr_len);
+	len = ALIGN(len, L1_CACHE_BYTES);
+	if (unlikely(!skb_page_frag_refill(len, alloc_frag, gfp)))
 		return -ENOMEM;
-	buf = (char *)page_address(alloc_frag->page) + alloc_frag->offset;
+
+	ctx = &rq->mrg_buf_ctx[rq->mrg_buf_ctx_head];
+	ctx->buf = (char *)page_address(alloc_frag->page) + alloc_frag->offset;
+	ctx->truesize = len;
 	get_page(alloc_frag->page);
-	len = MERGE_BUFFER_LEN;
 	alloc_frag->offset += len;
 	hole = alloc_frag->size - alloc_frag->offset;
-	if (hole < MERGE_BUFFER_LEN) {
+	if (hole < len) {
+		/* To avoid internal fragmentation, if there is very likely not
+		 * enough space for another buffer, add the remaining space to
+		 * the current buffer. This extra space is not included in
+		 * ctx->truesize.
+		 */
 		len += hole;
 		alloc_frag->offset += hole;
 	}
 
-	sg_init_one(rq->sg, buf, len);
-	err = virtqueue_add_inbuf(rq->vq, rq->sg, 1, buf, gfp);
-	if (err < 0)
-		put_page(virt_to_head_page(buf));
-
-	return err;
+	sg_init_one(rq->sg, ctx->buf, len);
+	err = virtqueue_add_inbuf(rq->vq, rq->sg, 1, ctx, gfp);
+	if (err < 0) {
+		put_page(virt_to_head_page(ctx->buf));
+		return err;
+	}
+	rq->mrg_buf_ctx_head = (rq->mrg_buf_ctx_head + 1) & (ring_size - 1);
+	return 0;
 }
 
 /*
@@ -610,6 +654,9 @@ static bool try_fill_recv(struct receive_queue *rq, gfp_t gfp)
 	int err;
 	bool oom;
 
+	/* Do not attempt to add a buffer if the RX ring is full. */
+	if (unlikely(!rq->vq->num_free))
+		return true;
 	gfp |= __GFP_COLD;
 	do {
 		if (vi->mergeable_rx_bufs)
@@ -1354,8 +1401,10 @@ static void virtnet_free_queues(struct virtnet_info *vi)
 {
 	int i;
 
-	for (i = 0; i < vi->max_queue_pairs; i++)
+	for (i = 0; i < vi->max_queue_pairs; i++) {
 		netif_napi_del(&vi->rq[i].napi);
+		kfree(vi->rq[i].mrg_buf_ctx);
+	}
 
 	kfree(vi->rq);
 	kfree(vi->sq);
@@ -1394,12 +1443,14 @@ static void free_unused_bufs(struct virtnet_info *vi)
 		struct virtqueue *vq = vi->rq[i].vq;
 
 		while ((buf = virtqueue_detach_unused_buf(vq)) != NULL) {
-			if (vi->mergeable_rx_bufs)
-				put_page(virt_to_head_page(buf));
-			else if (vi->big_packets)
+			if (vi->mergeable_rx_bufs) {
+				struct mergeable_receive_buf_ctx *ctx = buf;
+				put_page(virt_to_head_page(ctx->buf));
+			} else if (vi->big_packets) {
 				give_pages(&vi->rq[i], buf);
-			else
+			} else {
 				dev_kfree_skb(buf);
+			}
 			--vi->rq[i].num;
 		}
 		BUG_ON(vi->rq[i].num != 0);
@@ -1509,6 +1560,7 @@ static int virtnet_alloc_queues(struct virtnet_info *vi)
 			       napi_weight);
 
 		sg_init_table(vi->rq[i].sg, ARRAY_SIZE(vi->rq[i].sg));
+		ewma_init(&vi->rq[i].mrg_avg_pkt_len, 1, RECEIVE_AVG_WEIGHT);
 		sg_init_table(vi->sq[i].sg, ARRAY_SIZE(vi->sq[i].sg));
 	}
 
@@ -1522,7 +1574,8 @@ err_sq:
 
 static int init_vqs(struct virtnet_info *vi)
 {
-	int ret;
+	struct virtio_device *vdev = vi->vdev;
+	int i, ret;
 
 	/* Allocate send & receive queues */
 	ret = virtnet_alloc_queues(vi);
@@ -1533,12 +1586,28 @@ static int init_vqs(struct virtnet_info *vi)
 	if (ret)
 		goto err_free;
 
+	if (vi->mergeable_rx_bufs) {
+		for (i = 0; i < vi->max_queue_pairs; i++) {
+			struct receive_queue *rq = &vi->rq[i];
+			rq->mrg_buf_ctx_size = virtqueue_get_vring_size(rq->vq);
+			rq->mrg_buf_ctx = kmalloc(sizeof(*rq->mrg_buf_ctx) *
+						  rq->mrg_buf_ctx_size,
+						  GFP_KERNEL);
+			if (!rq->mrg_buf_ctx) {
+				ret = -ENOMEM;
+				goto err_del_vqs;
+			}
+		}
+	}
+
 	get_online_cpus();
 	virtnet_set_affinity(vi);
 	put_online_cpus();
 
 	return 0;
 
+err_del_vqs:
+	vdev->config->del_vqs(vdev);
 err_free:
 	virtnet_free_queues(vi);
 err:
-- 
1.8.5.1

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* Re: [PATCH net-next v2 3/4] virtio-net: auto-tune mergeable rx buffer size for improved performance
  2014-01-07  5:25 ` [PATCH net-next v2 3/4] virtio-net: auto-tune mergeable rx buffer size for improved performance Michael Dalton
@ 2014-01-08  6:23   ` Jason Wang
  2014-01-08 18:28     ` Michael Dalton
  2014-01-08 20:30   ` Michael S. Tsirkin
  2014-01-09  1:42   ` Michael S. Tsirkin
  2 siblings, 1 reply; 39+ messages in thread
From: Jason Wang @ 2014-01-08  6:23 UTC (permalink / raw)
  To: Michael Dalton, David S. Miller
  Cc: netdev, Eric Dumazet, virtualization, Michael S. Tsirkin

On 01/07/2014 01:25 PM, Michael Dalton wrote:
> Commit 2613af0ed18a ("virtio_net: migrate mergeable rx buffers to page frag
> allocators") changed the mergeable receive buffer size from PAGE_SIZE to
> MTU-size, introducing a single-stream regression for benchmarks with large
> average packet size. There is no single optimal buffer size for all
> workloads.  For workloads with packet size <= MTU bytes, MTU + virtio-net
> header-sized buffers are preferred as larger buffers reduce the TCP window
> due to SKB truesize. However, single-stream workloads with large average
> packet sizes have higher throughput if larger (e.g., PAGE_SIZE) buffers
> are used.
>
> This commit auto-tunes the mergeable receiver buffer packet size by
> choosing the packet buffer size based on an EWMA of the recent packet
> sizes for the receive queue. Packet buffer sizes range from MTU_SIZE +
> virtio-net header len to PAGE_SIZE. This improves throughput for
> large packet workloads, as any workload with average packet size >=
> PAGE_SIZE will use PAGE_SIZE buffers.
>
> These optimizations interact positively with recent commit
> ba275241030c ("virtio-net: coalesce rx frags when possible during rx"),
> which coalesces adjacent RX SKB fragments in virtio_net. The coalescing
> optimizations benefit buffers of any size.
>
> Benchmarks taken from an average of 5 netperf 30-second TCP_STREAM runs
> between two QEMU VMs on a single physical machine. Each VM has two VCPUs
> with all offloads & vhost enabled. All VMs and vhost threads run in a
> single 4 CPU cgroup cpuset, using cgroups to ensure that other processes
> in the system will not be scheduled on the benchmark CPUs. Trunk includes
> SKB rx frag coalescing.
>
> net-next w/ virtio_net before 2613af0ed18a (PAGE_SIZE bufs): 14642.85Gb/s
> net-next (MTU-size bufs):  13170.01Gb/s
> net-next + auto-tune: 14555.94Gb/s
>
> Jason Wang also reported a throughput increase on mlx4 from 22Gb/s
> using MTU-sized buffers to about 26Gb/s using auto-tuning.
>
> Signed-off-by: Michael Dalton <mwdalton@google.com>
> ---
> v2: Add per-receive queue metadata ring to track precise truesize for
>     mergeable receive buffers. Remove all truesize approximation. Never
>     try to fill a full RX ring (required for metadata ring in v2).
>
>  drivers/net/virtio_net.c | 145 ++++++++++++++++++++++++++++++++++-------------
>  1 file changed, 107 insertions(+), 38 deletions(-)
>
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index 526dfd8..f6e1ee0 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -26,6 +26,7 @@
>  #include <linux/if_vlan.h>
>  #include <linux/slab.h>
>  #include <linux/cpu.h>
> +#include <linux/average.h>
>  
>  static int napi_weight = NAPI_POLL_WEIGHT;
>  module_param(napi_weight, int, 0444);
> @@ -36,11 +37,15 @@ module_param(gso, bool, 0444);
>  
>  /* FIXME: MTU in config. */
>  #define GOOD_PACKET_LEN (ETH_HLEN + VLAN_HLEN + ETH_DATA_LEN)
> -#define MERGE_BUFFER_LEN (ALIGN(GOOD_PACKET_LEN + \
> -                                sizeof(struct virtio_net_hdr_mrg_rxbuf), \
> -                                L1_CACHE_BYTES))
>  #define GOOD_COPY_LEN	128
>  
> +/* Weight used for the RX packet size EWMA. The average packet size is used to
> + * determine the packet buffer size when refilling RX rings. As the entire RX
> + * ring may be refilled at once, the weight is chosen so that the EWMA will be
> + * insensitive to short-term, transient changes in packet size.
> + */
> +#define RECEIVE_AVG_WEIGHT 64
> +
>  #define VIRTNET_DRIVER_VERSION "1.0.0"
>  
>  struct virtnet_stats {
> @@ -65,11 +70,30 @@ struct send_queue {
>  	char name[40];
>  };
>  
> +/* Per-packet buffer context for mergeable receive buffers. */
> +struct mergeable_receive_buf_ctx {
> +	/* Packet buffer base address. */
> +	void *buf;
> +
> +	/* Original size of the packet buffer for use in SKB truesize. Does not
> +	 * include any padding space used to avoid internal fragmentation.
> +	 */
> +	unsigned int truesize;
> +};
> +
>  /* Internal representation of a receive virtqueue */
>  struct receive_queue {
>  	/* Virtqueue associated with this receive_queue */
>  	struct virtqueue *vq;
>  
> +	/* Circular buffer of mergeable rxbuf contexts. */
> +	struct mergeable_receive_buf_ctx *mrg_buf_ctx;
> +
> +	/* Number of elements & head index of mrg_buf_ctx. Size must be
> +	 * equal to the associated virtqueue's vring size.
> +	 */
> +	unsigned int mrg_buf_ctx_size, mrg_buf_ctx_head;
> +
>  	struct napi_struct napi;
>  
>  	/* Number of input buffers, and max we've ever had. */
> @@ -78,6 +102,9 @@ struct receive_queue {
>  	/* Chain pages by the private ptr. */
>  	struct page *pages;
>  
> +	/* Average packet length for mergeable receive buffers. */
> +	struct ewma mrg_avg_pkt_len;
> +
>  	/* Page frag for packet buffer allocation. */
>  	struct page_frag alloc_frag;
>  
> @@ -327,32 +354,32 @@ err:
>  
>  static struct sk_buff *receive_mergeable(struct net_device *dev,
>  					 struct receive_queue *rq,
> -					 void *buf,
> +					 struct mergeable_receive_buf_ctx *ctx,
>  					 unsigned int len)
>  {
> -	struct skb_vnet_hdr *hdr = buf;
> +	struct skb_vnet_hdr *hdr = ctx->buf;
>  	int num_buf = hdr->mhdr.num_buffers;
> -	struct page *page = virt_to_head_page(buf);
> -	int offset = buf - page_address(page);
> -	unsigned int truesize = max_t(unsigned int, len, MERGE_BUFFER_LEN);
> +	struct page *page = virt_to_head_page(ctx->buf);
> +	int offset = ctx->buf - page_address(page);
> +	unsigned int truesize = max(len, ctx->truesize);

This looks unnecessary, we've already had precise truesize for this
buffer. ctx->trusize should be always greater or equal to len here.
> +
>  	struct sk_buff *head_skb = page_to_skb(rq, page, offset, len, truesize);
>  	struct sk_buff *curr_skb = head_skb;
>  
>  	if (unlikely(!curr_skb))
>  		goto err_skb;
> -
>  	while (--num_buf) {
>  		int num_skb_frags;
>  
> -		buf = virtqueue_get_buf(rq->vq, &len);
> -		if (unlikely(!buf)) {
> +		ctx = virtqueue_get_buf(rq->vq, &len);
> +		if (unlikely(!ctx)) {
>  			pr_debug("%s: rx error: %d buffers out of %d missing\n",
>  				 dev->name, num_buf, hdr->mhdr.num_buffers);
>  			dev->stats.rx_length_errors++;
>  			goto err_buf;
>  		}
>  
> -		page = virt_to_head_page(buf);
> +		page = virt_to_head_page(ctx->buf);
>  		--rq->num;
>  
>  		num_skb_frags = skb_shinfo(curr_skb)->nr_frags;
> @@ -369,13 +396,13 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
>  			head_skb->truesize += nskb->truesize;
>  			num_skb_frags = 0;
>  		}
> -		truesize = max_t(unsigned int, len, MERGE_BUFFER_LEN);
> +		truesize = max(len, ctx->truesize);

And this.
>  		if (curr_skb != head_skb) {
>  			head_skb->data_len += len;
>  			head_skb->len += len;
>  			head_skb->truesize += truesize;
>  		}
> -		offset = buf - page_address(page);
> +		offset = ctx->buf - page_address(page);
>  		if (skb_can_coalesce(curr_skb, num_skb_frags, page, offset)) {
>  			put_page(page);
>  			skb_coalesce_rx_frag(curr_skb, num_skb_frags - 1,
> @@ -386,19 +413,20 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
>  		}
>  	}
>  
> +	ewma_add(&rq->mrg_avg_pkt_len, head_skb->len);
>  	return head_skb;
>  
>  err_skb:
>  	put_page(page);
>  	while (--num_buf) {
> -		buf = virtqueue_get_buf(rq->vq, &len);
> -		if (unlikely(!buf)) {
> +		ctx = virtqueue_get_buf(rq->vq, &len);
> +		if (unlikely(!ctx)) {
>  			pr_debug("%s: rx error: %d buffers missing\n",
>  				 dev->name, num_buf);
>  			dev->stats.rx_length_errors++;
>  			break;
>  		}
> -		page = virt_to_head_page(buf);
> +		page = virt_to_head_page(ctx->buf);
>  		put_page(page);
>  		--rq->num;
>  	}
> @@ -419,12 +447,14 @@ static void receive_buf(struct receive_queue *rq, void *buf, unsigned int len)
>  	if (unlikely(len < sizeof(struct virtio_net_hdr) + ETH_HLEN)) {
>  		pr_debug("%s: short packet %i\n", dev->name, len);
>  		dev->stats.rx_length_errors++;
> -		if (vi->mergeable_rx_bufs)
> -			put_page(virt_to_head_page(buf));
> -		else if (vi->big_packets)
> +		if (vi->mergeable_rx_bufs) {
> +			struct mergeable_receive_buf_ctx *ctx = buf;
> +			put_page(virt_to_head_page(ctx->buf));
> +		} else if (vi->big_packets) {
>  			give_pages(rq, buf);
> -		else
> +		} else {
>  			dev_kfree_skb(buf);
> +		}
>  		return;
>  	}
>  
> @@ -572,29 +602,43 @@ static int add_recvbuf_big(struct receive_queue *rq, gfp_t gfp)
>  
>  static int add_recvbuf_mergeable(struct receive_queue *rq, gfp_t gfp)
>  {
> +	const unsigned int ring_size = rq->mrg_buf_ctx_size;
> +	const size_t hdr_len = sizeof(struct virtio_net_hdr_mrg_rxbuf);
>  	struct page_frag *alloc_frag = &rq->alloc_frag;
> -	char *buf;
> +	struct mergeable_receive_buf_ctx *ctx;
>  	int err;
>  	unsigned int len, hole;
>  
> -	if (unlikely(!skb_page_frag_refill(MERGE_BUFFER_LEN, alloc_frag, gfp)))
> +	len = hdr_len + clamp_t(unsigned int, ewma_read(&rq->mrg_avg_pkt_len),
> +				GOOD_PACKET_LEN, PAGE_SIZE - hdr_len);
> +	len = ALIGN(len, L1_CACHE_BYTES);
> +	if (unlikely(!skb_page_frag_refill(len, alloc_frag, gfp)))
>  		return -ENOMEM;
> -	buf = (char *)page_address(alloc_frag->page) + alloc_frag->offset;
> +
> +	ctx = &rq->mrg_buf_ctx[rq->mrg_buf_ctx_head];
> +	ctx->buf = (char *)page_address(alloc_frag->page) + alloc_frag->offset;
> +	ctx->truesize = len;
>  	get_page(alloc_frag->page);
> -	len = MERGE_BUFFER_LEN;
>  	alloc_frag->offset += len;
>  	hole = alloc_frag->size - alloc_frag->offset;
> -	if (hole < MERGE_BUFFER_LEN) {
> +	if (hole < len) {
> +		/* To avoid internal fragmentation, if there is very likely not
> +		 * enough space for another buffer, add the remaining space to
> +		 * the current buffer. This extra space is not included in
> +		 * ctx->truesize.
> +		 */

What's the reason that this extra space is not accounted for truesize?
>  		len += hole;
>  		alloc_frag->offset += hole;
>  	}
>  
> -	sg_init_one(rq->sg, buf, len);
> -	err = virtqueue_add_inbuf(rq->vq, rq->sg, 1, buf, gfp);
> -	if (err < 0)
> -		put_page(virt_to_head_page(buf));
> -
> -	return err;
> +	sg_init_one(rq->sg, ctx->buf, len);
> +	err = virtqueue_add_inbuf(rq->vq, rq->sg, 1, ctx, gfp);
> +	if (err < 0) {
> +		put_page(virt_to_head_page(ctx->buf));
> +		return err;

Should we also roll back the frag offset added above to avoid leaking frags?
> +	}
> +	rq->mrg_buf_ctx_head = (rq->mrg_buf_ctx_head + 1) & (ring_size - 1);
> +	return 0;
>  }
>  
>  /*
> @@ -610,6 +654,9 @@ static bool try_fill_recv(struct receive_queue *rq, gfp_t gfp)
>  	int err;
>  	bool oom;
>  
> +	/* Do not attempt to add a buffer if the RX ring is full. */
> +	if (unlikely(!rq->vq->num_free))
> +		return true;

I haven't figured out why this is needed. It seems safe for
virtqueue_add_inbuf() just fail in add_recv_xx()?
>  	gfp |= __GFP_COLD;
>  	do {
>  		if (vi->mergeable_rx_bufs)
> @@ -1354,8 +1401,10 @@ static void virtnet_free_queues(struct virtnet_info *vi)
>  {
>  	int i;
>  
> -	for (i = 0; i < vi->max_queue_pairs; i++)
> +	for (i = 0; i < vi->max_queue_pairs; i++) {
>  		netif_napi_del(&vi->rq[i].napi);
> +		kfree(vi->rq[i].mrg_buf_ctx);
> +	}
>  
>  	kfree(vi->rq);
>  	kfree(vi->sq);
> @@ -1394,12 +1443,14 @@ static void free_unused_bufs(struct virtnet_info *vi)
>  		struct virtqueue *vq = vi->rq[i].vq;
>  
>  		while ((buf = virtqueue_detach_unused_buf(vq)) != NULL) {
> -			if (vi->mergeable_rx_bufs)
> -				put_page(virt_to_head_page(buf));
> -			else if (vi->big_packets)
> +			if (vi->mergeable_rx_bufs) {
> +				struct mergeable_receive_buf_ctx *ctx = buf;
> +				put_page(virt_to_head_page(ctx->buf));
> +			} else if (vi->big_packets) {
>  				give_pages(&vi->rq[i], buf);
> -			else
> +			} else {
>  				dev_kfree_skb(buf);
> +			}
>  			--vi->rq[i].num;
>  		}
>  		BUG_ON(vi->rq[i].num != 0);
> @@ -1509,6 +1560,7 @@ static int virtnet_alloc_queues(struct virtnet_info *vi)
>  			       napi_weight);
>  
>  		sg_init_table(vi->rq[i].sg, ARRAY_SIZE(vi->rq[i].sg));
> +		ewma_init(&vi->rq[i].mrg_avg_pkt_len, 1, RECEIVE_AVG_WEIGHT);
>  		sg_init_table(vi->sq[i].sg, ARRAY_SIZE(vi->sq[i].sg));
>  	}
>  
> @@ -1522,7 +1574,8 @@ err_sq:
>  
>  static int init_vqs(struct virtnet_info *vi)
>  {
> -	int ret;
> +	struct virtio_device *vdev = vi->vdev;
> +	int i, ret;
>  
>  	/* Allocate send & receive queues */
>  	ret = virtnet_alloc_queues(vi);
> @@ -1533,12 +1586,28 @@ static int init_vqs(struct virtnet_info *vi)
>  	if (ret)
>  		goto err_free;
>  
> +	if (vi->mergeable_rx_bufs) {
> +		for (i = 0; i < vi->max_queue_pairs; i++) {
> +			struct receive_queue *rq = &vi->rq[i];
> +			rq->mrg_buf_ctx_size = virtqueue_get_vring_size(rq->vq);
> +			rq->mrg_buf_ctx = kmalloc(sizeof(*rq->mrg_buf_ctx) *
> +						  rq->mrg_buf_ctx_size,
> +						  GFP_KERNEL);
> +			if (!rq->mrg_buf_ctx) {
> +				ret = -ENOMEM;
> +				goto err_del_vqs;
> +			}
> +		}
> +	}
> +
>  	get_online_cpus();
>  	virtnet_set_affinity(vi);
>  	put_online_cpus();
>  
>  	return 0;
>  
> +err_del_vqs:
> +	vdev->config->del_vqs(vdev);
>  err_free:
>  	virtnet_free_queues(vi);
>  err:

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH net-next v2 3/4] virtio-net: auto-tune mergeable rx buffer size for improved performance
  2014-01-08  6:23   ` Jason Wang
@ 2014-01-08 18:28     ` Michael Dalton
  2014-01-08 18:44       ` Eric Dumazet
  2014-01-08 19:16       ` Michael S. Tsirkin
  0 siblings, 2 replies; 39+ messages in thread
From: Michael Dalton @ 2014-01-08 18:28 UTC (permalink / raw)
  To: Jason Wang
  Cc: Michael S. Tsirkin, netdev, lf-virt, Eric Dumazet,
	David S. Miller

Hi Jason,

On Tue, Jan 7, 2014 at 10:23 PM, Jason Wang <jasowang@redhat.com> wrote:
> What's the reason that this extra space is not accounted for truesize?
The initial rationale was that this extra space is due to
internal fragmentation in the page frag allocator, but I agree with
you -- this code should be changed and the extra space accounted for.
Any internal fragmentation leading to a larger last packet allocated from
the page should be reflected in the SKB truesize of the last packet.

I will do a followup patchset that accounts correctly for the extra
space, which will also me to remove the two max statements you
indicated. Thanks for finding this issue.

>> +     if (err < 0) {
>> +             put_page(virt_to_head_page(ctx->buf));
>> +             return err;
> Should we also roll back the frag offset added above to avoid leaking frags?
I believe the put_page here is sufficient for correctness. When we
allocate a buffer using skb_page_frag_refill, we use get_page/put_page
to allocate/free respectively. For example, if the virtqueue_add_inbuf
succeeded, we would eventually call put_page either in virtio-net
(e.g., page_to_skb for packets <= GOOD_COPY_LEN bytes) or later in
__skb_frag_unref and other functions called during dev_kfree_skb.

However, an offset rollback does allow the space to be reused by the next
allocation, which could be a good optimization. I can do the offset
rollback (with a put_page) in the next patchset. What do you think?

>> +     /* Do not attempt to add a buffer if the RX ring is full. */
>> +     if (unlikely(!rq->vq->num_free))
>> +             return true;
> I haven't figured out why this is needed. It seems safe for
> virtqueue_add_inbuf() just fail in add_recv_xx()?
I think this is safe with one caveat -- we can't modify
rq->mrg_buf_ctx until we know the ring isn't full (otherwise, we
clobber an in-use entry). It is safe to modify rq->mrg_buf_ctx
after we know that virtqueue_add_inbuf has succeeded.

I can remove the rq_num_free check from try_fill_recv, and then
modify virtqueue_add_inbuf to use a local mergeable_receive_buf_ctx.
Once virtqueue_add_inbuf succeeds, the contents of the local variable
can be copied to rq->mrg_buf_ctx[rq->mrg_buf_ctx_head].

Best,

Mike

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH net-next v2 3/4] virtio-net: auto-tune mergeable rx buffer size for improved performance
  2014-01-08 18:28     ` Michael Dalton
@ 2014-01-08 18:44       ` Eric Dumazet
  2014-01-08 19:16       ` Michael S. Tsirkin
  1 sibling, 0 replies; 39+ messages in thread
From: Eric Dumazet @ 2014-01-08 18:44 UTC (permalink / raw)
  To: Michael Dalton
  Cc: Michael S. Tsirkin, netdev, lf-virt, Eric Dumazet,
	David S. Miller

On Wed, 2014-01-08 at 10:28 -0800, Michael Dalton wrote:
> Hi Jason,
> 
> On Tue, Jan 7, 2014 at 10:23 PM, Jason Wang <jasowang@redhat.com> wrote:
> > What's the reason that this extra space is not accounted for truesize?
> The initial rationale was that this extra space is due to
> internal fragmentation in the page frag allocator, but I agree with
> you -- this code should be changed and the extra space accounted for.
> Any internal fragmentation leading to a larger last packet allocated from
> the page should be reflected in the SKB truesize of the last packet.

Thats not a hard requirement.

Lets keep things simple and fast.

Correctness here is not mandatory, as long as we do not underestimate
truesize by a huge value like a factor of 10.

Hint :

When __netdev_alloc_frag() has to allocate a new page, because
remaining space in current page is too small, the lost hole is not
accounted to users of current page.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH net-next v2 3/4] virtio-net: auto-tune mergeable rx buffer size for improved performance
  2014-01-08 18:28     ` Michael Dalton
  2014-01-08 18:44       ` Eric Dumazet
@ 2014-01-08 19:16       ` Michael S. Tsirkin
  2014-01-08 19:56         ` Michael Dalton
  1 sibling, 1 reply; 39+ messages in thread
From: Michael S. Tsirkin @ 2014-01-08 19:16 UTC (permalink / raw)
  To: Michael Dalton; +Cc: netdev, lf-virt, Eric Dumazet, David S. Miller

On Wed, Jan 08, 2014 at 10:28:09AM -0800, Michael Dalton wrote:
> Hi Jason,
> 
> On Tue, Jan 7, 2014 at 10:23 PM, Jason Wang <jasowang@redhat.com> wrote:
> > What's the reason that this extra space is not accounted for truesize?
> The initial rationale was that this extra space is due to
> internal fragmentation in the page frag allocator, but I agree with
> you -- this code should be changed and the extra space accounted for.
> Any internal fragmentation leading to a larger last packet allocated from
> the page should be reflected in the SKB truesize of the last packet.

I think this is what the original patchset did, but I don't really get
why this is a good idea.
Why should we select a frame at random and make it's truesize bigger?
All frames are to blame for the extra space.
Just ignoring it seems more symmetrical.


> I will do a followup patchset that accounts correctly for the extra
> space, which will also me to remove the two max statements you
> indicated. Thanks for finding this issue.
> 
> >> +     if (err < 0) {
> >> +             put_page(virt_to_head_page(ctx->buf));
> >> +             return err;
> > Should we also roll back the frag offset added above to avoid leaking frags?
> I believe the put_page here is sufficient for correctness. When we
> allocate a buffer using skb_page_frag_refill, we use get_page/put_page
> to allocate/free respectively. For example, if the virtqueue_add_inbuf
> succeeded, we would eventually call put_page either in virtio-net
> (e.g., page_to_skb for packets <= GOOD_COPY_LEN bytes) or later in
> __skb_frag_unref and other functions called during dev_kfree_skb.
> 
> However, an offset rollback does allow the space to be reused by the next
> allocation, which could be a good optimization. I can do the offset
> rollback (with a put_page) in the next patchset. What do you think?

If you intend to repost anyway (for the below wrinkle) then
you can do it right here just as well I guess. Seems a bit prettier.

> >> +     /* Do not attempt to add a buffer if the RX ring is full. */
> >> +     if (unlikely(!rq->vq->num_free))
> >> +             return true;
> > I haven't figured out why this is needed. It seems safe for
> > virtqueue_add_inbuf() just fail in add_recv_xx()?
> I think this is safe with one caveat -- we can't modify
> rq->mrg_buf_ctx until we know the ring isn't full (otherwise, we
> clobber an in-use entry). It is safe to modify rq->mrg_buf_ctx
> after we know that virtqueue_add_inbuf has succeeded.
> 
> I can remove the rq_num_free check from try_fill_recv, and then
> modify virtqueue_add_inbuf to use a local mergeable_receive_buf_ctx.
> Once virtqueue_add_inbuf succeeds, the contents of the local variable
> can be copied to rq->mrg_buf_ctx[rq->mrg_buf_ctx_head].
> 
> Best,
> 
> Mike


You don't have to fill in ctx before calling add_inbuf, do you?
Just fill it afterwards.

-- 
MST

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH net-next v2 3/4] virtio-net: auto-tune mergeable rx buffer size for improved performance
  2014-01-08 19:16       ` Michael S. Tsirkin
@ 2014-01-08 19:56         ` Michael Dalton
  0 siblings, 0 replies; 39+ messages in thread
From: Michael Dalton @ 2014-01-08 19:56 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: netdev, lf-virt, Eric Dumazet, David S. Miller

Hi Eric, Michael,

On Wed, Jan 8, 2014 at 11:16 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
> Why should we select a frame at random and make it's truesize bigger?
> All frames are to blame for the extra space.
> Just ignoring it seems more symmetrical.
Sounds good, based on Eric's feedback and Michael's feedback above,
I will leave the 'extra space' handling as-is in the followup patchset
and will not track the extra space in ctx->truesize. AFAICT, The two
max() statements will need to remain (as buffer length may exceed
ctx->truesize).  Thanks for the feedback.

> If you intend to repost anyway (for the below wrinkle) then
> you can do it right here just as well I guess. Seems a bit prettier.
Will do.

> You don't have to fill in ctx before calling add_inbuf, do you?
> Just fill it afterwards.
Agreed, ctx does not need to be filled until after add_inbuf.

Best,

Mike

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH net-next v2 3/4] virtio-net: auto-tune mergeable rx buffer size for improved performance
  2014-01-07  5:25 ` [PATCH net-next v2 3/4] virtio-net: auto-tune mergeable rx buffer size for improved performance Michael Dalton
  2014-01-08  6:23   ` Jason Wang
@ 2014-01-08 20:30   ` Michael S. Tsirkin
  2014-01-09  1:42   ` Michael S. Tsirkin
  2 siblings, 0 replies; 39+ messages in thread
From: Michael S. Tsirkin @ 2014-01-08 20:30 UTC (permalink / raw)
  To: Michael Dalton
  Cc: David S. Miller, netdev, Eric Dumazet, Rusty Russell, Jason Wang,
	virtualization

On Mon, Jan 06, 2014 at 09:25:54PM -0800, Michael Dalton wrote:
> Commit 2613af0ed18a ("virtio_net: migrate mergeable rx buffers to page frag
> allocators") changed the mergeable receive buffer size from PAGE_SIZE to
> MTU-size, introducing a single-stream regression for benchmarks with large
> average packet size. There is no single optimal buffer size for all
> workloads.  For workloads with packet size <= MTU bytes, MTU + virtio-net
> header-sized buffers are preferred as larger buffers reduce the TCP window
> due to SKB truesize. However, single-stream workloads with large average
> packet sizes have higher throughput if larger (e.g., PAGE_SIZE) buffers
> are used.
> 
> This commit auto-tunes the mergeable receiver buffer packet size by
> choosing the packet buffer size based on an EWMA of the recent packet
> sizes for the receive queue. Packet buffer sizes range from MTU_SIZE +
> virtio-net header len to PAGE_SIZE. This improves throughput for
> large packet workloads, as any workload with average packet size >=
> PAGE_SIZE will use PAGE_SIZE buffers.
> 
> These optimizations interact positively with recent commit
> ba275241030c ("virtio-net: coalesce rx frags when possible during rx"),
> which coalesces adjacent RX SKB fragments in virtio_net. The coalescing
> optimizations benefit buffers of any size.
> 
> Benchmarks taken from an average of 5 netperf 30-second TCP_STREAM runs
> between two QEMU VMs on a single physical machine. Each VM has two VCPUs
> with all offloads & vhost enabled. All VMs and vhost threads run in a
> single 4 CPU cgroup cpuset, using cgroups to ensure that other processes
> in the system will not be scheduled on the benchmark CPUs. Trunk includes
> SKB rx frag coalescing.
> 
> net-next w/ virtio_net before 2613af0ed18a (PAGE_SIZE bufs): 14642.85Gb/s
> net-next (MTU-size bufs):  13170.01Gb/s
> net-next + auto-tune: 14555.94Gb/s
> 
> Jason Wang also reported a throughput increase on mlx4 from 22Gb/s
> using MTU-sized buffers to about 26Gb/s using auto-tuning.
> 
> Signed-off-by: Michael Dalton <mwdalton@google.com>

I like where this series is going.

There are a couple of minor comments by Jason worth addressing, I'm guessing
you are going to post v3 anyway?

> ---
> v2: Add per-receive queue metadata ring to track precise truesize for
>     mergeable receive buffers. Remove all truesize approximation. Never
>     try to fill a full RX ring (required for metadata ring in v2).
> 
>  drivers/net/virtio_net.c | 145 ++++++++++++++++++++++++++++++++++-------------
>  1 file changed, 107 insertions(+), 38 deletions(-)
> 
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index 526dfd8..f6e1ee0 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -26,6 +26,7 @@
>  #include <linux/if_vlan.h>
>  #include <linux/slab.h>
>  #include <linux/cpu.h>
> +#include <linux/average.h>
>  
>  static int napi_weight = NAPI_POLL_WEIGHT;
>  module_param(napi_weight, int, 0444);
> @@ -36,11 +37,15 @@ module_param(gso, bool, 0444);
>  
>  /* FIXME: MTU in config. */
>  #define GOOD_PACKET_LEN (ETH_HLEN + VLAN_HLEN + ETH_DATA_LEN)
> -#define MERGE_BUFFER_LEN (ALIGN(GOOD_PACKET_LEN + \
> -                                sizeof(struct virtio_net_hdr_mrg_rxbuf), \
> -                                L1_CACHE_BYTES))
>  #define GOOD_COPY_LEN	128
>  
> +/* Weight used for the RX packet size EWMA. The average packet size is used to
> + * determine the packet buffer size when refilling RX rings. As the entire RX
> + * ring may be refilled at once, the weight is chosen so that the EWMA will be
> + * insensitive to short-term, transient changes in packet size.
> + */
> +#define RECEIVE_AVG_WEIGHT 64
> +
>  #define VIRTNET_DRIVER_VERSION "1.0.0"
>  
>  struct virtnet_stats {
> @@ -65,11 +70,30 @@ struct send_queue {
>  	char name[40];
>  };
>  
> +/* Per-packet buffer context for mergeable receive buffers. */
> +struct mergeable_receive_buf_ctx {
> +	/* Packet buffer base address. */
> +	void *buf;
> +
> +	/* Original size of the packet buffer for use in SKB truesize. Does not
> +	 * include any padding space used to avoid internal fragmentation.
> +	 */
> +	unsigned int truesize;
> +};
> +
>  /* Internal representation of a receive virtqueue */
>  struct receive_queue {
>  	/* Virtqueue associated with this receive_queue */
>  	struct virtqueue *vq;
>  
> +	/* Circular buffer of mergeable rxbuf contexts. */
> +	struct mergeable_receive_buf_ctx *mrg_buf_ctx;
> +
> +	/* Number of elements & head index of mrg_buf_ctx. Size must be
> +	 * equal to the associated virtqueue's vring size.
> +	 */
> +	unsigned int mrg_buf_ctx_size, mrg_buf_ctx_head;
> +
>  	struct napi_struct napi;
>  
>  	/* Number of input buffers, and max we've ever had. */
> @@ -78,6 +102,9 @@ struct receive_queue {
>  	/* Chain pages by the private ptr. */
>  	struct page *pages;
>  
> +	/* Average packet length for mergeable receive buffers. */
> +	struct ewma mrg_avg_pkt_len;
> +
>  	/* Page frag for packet buffer allocation. */
>  	struct page_frag alloc_frag;
>  
> @@ -327,32 +354,32 @@ err:
>  
>  static struct sk_buff *receive_mergeable(struct net_device *dev,
>  					 struct receive_queue *rq,
> -					 void *buf,
> +					 struct mergeable_receive_buf_ctx *ctx,
>  					 unsigned int len)
>  {
> -	struct skb_vnet_hdr *hdr = buf;
> +	struct skb_vnet_hdr *hdr = ctx->buf;
>  	int num_buf = hdr->mhdr.num_buffers;
> -	struct page *page = virt_to_head_page(buf);
> -	int offset = buf - page_address(page);
> -	unsigned int truesize = max_t(unsigned int, len, MERGE_BUFFER_LEN);
> +	struct page *page = virt_to_head_page(ctx->buf);
> +	int offset = ctx->buf - page_address(page);
> +	unsigned int truesize = max(len, ctx->truesize);
> +
>  	struct sk_buff *head_skb = page_to_skb(rq, page, offset, len, truesize);
>  	struct sk_buff *curr_skb = head_skb;
>  
>  	if (unlikely(!curr_skb))
>  		goto err_skb;
> -
>  	while (--num_buf) {
>  		int num_skb_frags;
>  
> -		buf = virtqueue_get_buf(rq->vq, &len);
> -		if (unlikely(!buf)) {
> +		ctx = virtqueue_get_buf(rq->vq, &len);
> +		if (unlikely(!ctx)) {
>  			pr_debug("%s: rx error: %d buffers out of %d missing\n",
>  				 dev->name, num_buf, hdr->mhdr.num_buffers);
>  			dev->stats.rx_length_errors++;
>  			goto err_buf;
>  		}
>  
> -		page = virt_to_head_page(buf);
> +		page = virt_to_head_page(ctx->buf);
>  		--rq->num;
>  
>  		num_skb_frags = skb_shinfo(curr_skb)->nr_frags;
> @@ -369,13 +396,13 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
>  			head_skb->truesize += nskb->truesize;
>  			num_skb_frags = 0;
>  		}
> -		truesize = max_t(unsigned int, len, MERGE_BUFFER_LEN);
> +		truesize = max(len, ctx->truesize);
>  		if (curr_skb != head_skb) {
>  			head_skb->data_len += len;
>  			head_skb->len += len;
>  			head_skb->truesize += truesize;
>  		}
> -		offset = buf - page_address(page);
> +		offset = ctx->buf - page_address(page);
>  		if (skb_can_coalesce(curr_skb, num_skb_frags, page, offset)) {
>  			put_page(page);
>  			skb_coalesce_rx_frag(curr_skb, num_skb_frags - 1,
> @@ -386,19 +413,20 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
>  		}
>  	}
>  
> +	ewma_add(&rq->mrg_avg_pkt_len, head_skb->len);
>  	return head_skb;
>  
>  err_skb:
>  	put_page(page);
>  	while (--num_buf) {
> -		buf = virtqueue_get_buf(rq->vq, &len);
> -		if (unlikely(!buf)) {
> +		ctx = virtqueue_get_buf(rq->vq, &len);
> +		if (unlikely(!ctx)) {
>  			pr_debug("%s: rx error: %d buffers missing\n",
>  				 dev->name, num_buf);
>  			dev->stats.rx_length_errors++;
>  			break;
>  		}
> -		page = virt_to_head_page(buf);
> +		page = virt_to_head_page(ctx->buf);
>  		put_page(page);
>  		--rq->num;
>  	}
> @@ -419,12 +447,14 @@ static void receive_buf(struct receive_queue *rq, void *buf, unsigned int len)
>  	if (unlikely(len < sizeof(struct virtio_net_hdr) + ETH_HLEN)) {
>  		pr_debug("%s: short packet %i\n", dev->name, len);
>  		dev->stats.rx_length_errors++;
> -		if (vi->mergeable_rx_bufs)
> -			put_page(virt_to_head_page(buf));
> -		else if (vi->big_packets)
> +		if (vi->mergeable_rx_bufs) {
> +			struct mergeable_receive_buf_ctx *ctx = buf;
> +			put_page(virt_to_head_page(ctx->buf));
> +		} else if (vi->big_packets) {
>  			give_pages(rq, buf);
> -		else
> +		} else {
>  			dev_kfree_skb(buf);
> +		}
>  		return;
>  	}
>  
> @@ -572,29 +602,43 @@ static int add_recvbuf_big(struct receive_queue *rq, gfp_t gfp)
>  
>  static int add_recvbuf_mergeable(struct receive_queue *rq, gfp_t gfp)
>  {
> +	const unsigned int ring_size = rq->mrg_buf_ctx_size;
> +	const size_t hdr_len = sizeof(struct virtio_net_hdr_mrg_rxbuf);
>  	struct page_frag *alloc_frag = &rq->alloc_frag;
> -	char *buf;
> +	struct mergeable_receive_buf_ctx *ctx;
>  	int err;
>  	unsigned int len, hole;
>  
> -	if (unlikely(!skb_page_frag_refill(MERGE_BUFFER_LEN, alloc_frag, gfp)))
> +	len = hdr_len + clamp_t(unsigned int, ewma_read(&rq->mrg_avg_pkt_len),
> +				GOOD_PACKET_LEN, PAGE_SIZE - hdr_len);
> +	len = ALIGN(len, L1_CACHE_BYTES);
> +	if (unlikely(!skb_page_frag_refill(len, alloc_frag, gfp)))
>  		return -ENOMEM;
> -	buf = (char *)page_address(alloc_frag->page) + alloc_frag->offset;
> +
> +	ctx = &rq->mrg_buf_ctx[rq->mrg_buf_ctx_head];
> +	ctx->buf = (char *)page_address(alloc_frag->page) + alloc_frag->offset;
> +	ctx->truesize = len;
>  	get_page(alloc_frag->page);
> -	len = MERGE_BUFFER_LEN;
>  	alloc_frag->offset += len;
>  	hole = alloc_frag->size - alloc_frag->offset;
> -	if (hole < MERGE_BUFFER_LEN) {
> +	if (hole < len) {
> +		/* To avoid internal fragmentation, if there is very likely not
> +		 * enough space for another buffer, add the remaining space to
> +		 * the current buffer. This extra space is not included in
> +		 * ctx->truesize.
> +		 */
>  		len += hole;
>  		alloc_frag->offset += hole;
>  	}
>  
> -	sg_init_one(rq->sg, buf, len);
> -	err = virtqueue_add_inbuf(rq->vq, rq->sg, 1, buf, gfp);
> -	if (err < 0)
> -		put_page(virt_to_head_page(buf));
> -
> -	return err;
> +	sg_init_one(rq->sg, ctx->buf, len);
> +	err = virtqueue_add_inbuf(rq->vq, rq->sg, 1, ctx, gfp);
> +	if (err < 0) {
> +		put_page(virt_to_head_page(ctx->buf));
> +		return err;
> +	}
> +	rq->mrg_buf_ctx_head = (rq->mrg_buf_ctx_head + 1) & (ring_size - 1);
> +	return 0;
>  }
>  
>  /*
> @@ -610,6 +654,9 @@ static bool try_fill_recv(struct receive_queue *rq, gfp_t gfp)
>  	int err;
>  	bool oom;
>  
> +	/* Do not attempt to add a buffer if the RX ring is full. */
> +	if (unlikely(!rq->vq->num_free))
> +		return true;
>  	gfp |= __GFP_COLD;
>  	do {
>  		if (vi->mergeable_rx_bufs)
> @@ -1354,8 +1401,10 @@ static void virtnet_free_queues(struct virtnet_info *vi)
>  {
>  	int i;
>  
> -	for (i = 0; i < vi->max_queue_pairs; i++)
> +	for (i = 0; i < vi->max_queue_pairs; i++) {
>  		netif_napi_del(&vi->rq[i].napi);
> +		kfree(vi->rq[i].mrg_buf_ctx);
> +	}
>  
>  	kfree(vi->rq);
>  	kfree(vi->sq);
> @@ -1394,12 +1443,14 @@ static void free_unused_bufs(struct virtnet_info *vi)
>  		struct virtqueue *vq = vi->rq[i].vq;
>  
>  		while ((buf = virtqueue_detach_unused_buf(vq)) != NULL) {
> -			if (vi->mergeable_rx_bufs)
> -				put_page(virt_to_head_page(buf));
> -			else if (vi->big_packets)
> +			if (vi->mergeable_rx_bufs) {
> +				struct mergeable_receive_buf_ctx *ctx = buf;
> +				put_page(virt_to_head_page(ctx->buf));
> +			} else if (vi->big_packets) {
>  				give_pages(&vi->rq[i], buf);
> -			else
> +			} else {
>  				dev_kfree_skb(buf);
> +			}
>  			--vi->rq[i].num;
>  		}
>  		BUG_ON(vi->rq[i].num != 0);
> @@ -1509,6 +1560,7 @@ static int virtnet_alloc_queues(struct virtnet_info *vi)
>  			       napi_weight);
>  
>  		sg_init_table(vi->rq[i].sg, ARRAY_SIZE(vi->rq[i].sg));
> +		ewma_init(&vi->rq[i].mrg_avg_pkt_len, 1, RECEIVE_AVG_WEIGHT);
>  		sg_init_table(vi->sq[i].sg, ARRAY_SIZE(vi->sq[i].sg));
>  	}
>  
> @@ -1522,7 +1574,8 @@ err_sq:
>  
>  static int init_vqs(struct virtnet_info *vi)
>  {
> -	int ret;
> +	struct virtio_device *vdev = vi->vdev;
> +	int i, ret;
>  
>  	/* Allocate send & receive queues */
>  	ret = virtnet_alloc_queues(vi);
> @@ -1533,12 +1586,28 @@ static int init_vqs(struct virtnet_info *vi)
>  	if (ret)
>  		goto err_free;
>  
> +	if (vi->mergeable_rx_bufs) {
> +		for (i = 0; i < vi->max_queue_pairs; i++) {
> +			struct receive_queue *rq = &vi->rq[i];
> +			rq->mrg_buf_ctx_size = virtqueue_get_vring_size(rq->vq);
> +			rq->mrg_buf_ctx = kmalloc(sizeof(*rq->mrg_buf_ctx) *
> +						  rq->mrg_buf_ctx_size,
> +						  GFP_KERNEL);
> +			if (!rq->mrg_buf_ctx) {
> +				ret = -ENOMEM;
> +				goto err_del_vqs;
> +			}
> +		}
> +	}
> +
>  	get_online_cpus();
>  	virtnet_set_affinity(vi);
>  	put_online_cpus();
>  
>  	return 0;
>  
> +err_del_vqs:
> +	vdev->config->del_vqs(vdev);
>  err_free:
>  	virtnet_free_queues(vi);
>  err:
> -- 
> 1.8.5.1

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH net-next v2 3/4] virtio-net: auto-tune mergeable rx buffer size for improved performance
  2014-01-07  5:25 ` [PATCH net-next v2 3/4] virtio-net: auto-tune mergeable rx buffer size for improved performance Michael Dalton
  2014-01-08  6:23   ` Jason Wang
  2014-01-08 20:30   ` Michael S. Tsirkin
@ 2014-01-09  1:42   ` Michael S. Tsirkin
  2014-01-09  3:16     ` Michael Dalton
  2 siblings, 1 reply; 39+ messages in thread
From: Michael S. Tsirkin @ 2014-01-09  1:42 UTC (permalink / raw)
  To: Michael Dalton; +Cc: netdev, virtualization, Eric Dumazet, David S. Miller

On Mon, Jan 06, 2014 at 09:25:54PM -0800, Michael Dalton wrote:
> Commit 2613af0ed18a ("virtio_net: migrate mergeable rx buffers to page frag
> allocators") changed the mergeable receive buffer size from PAGE_SIZE to
> MTU-size, introducing a single-stream regression for benchmarks with large
> average packet size. There is no single optimal buffer size for all
> workloads.  For workloads with packet size <= MTU bytes, MTU + virtio-net
> header-sized buffers are preferred as larger buffers reduce the TCP window
> due to SKB truesize. However, single-stream workloads with large average
> packet sizes have higher throughput if larger (e.g., PAGE_SIZE) buffers
> are used.
> 
> This commit auto-tunes the mergeable receiver buffer packet size by
> choosing the packet buffer size based on an EWMA of the recent packet
> sizes for the receive queue. Packet buffer sizes range from MTU_SIZE +
> virtio-net header len to PAGE_SIZE. This improves throughput for
> large packet workloads, as any workload with average packet size >=
> PAGE_SIZE will use PAGE_SIZE buffers.
> 
> These optimizations interact positively with recent commit
> ba275241030c ("virtio-net: coalesce rx frags when possible during rx"),
> which coalesces adjacent RX SKB fragments in virtio_net. The coalescing
> optimizations benefit buffers of any size.
> 
> Benchmarks taken from an average of 5 netperf 30-second TCP_STREAM runs
> between two QEMU VMs on a single physical machine. Each VM has two VCPUs
> with all offloads & vhost enabled. All VMs and vhost threads run in a
> single 4 CPU cgroup cpuset, using cgroups to ensure that other processes
> in the system will not be scheduled on the benchmark CPUs. Trunk includes
> SKB rx frag coalescing.
> 
> net-next w/ virtio_net before 2613af0ed18a (PAGE_SIZE bufs): 14642.85Gb/s
> net-next (MTU-size bufs):  13170.01Gb/s
> net-next + auto-tune: 14555.94Gb/s
> 
> Jason Wang also reported a throughput increase on mlx4 from 22Gb/s
> using MTU-sized buffers to about 26Gb/s using auto-tuning.
> 
> Signed-off-by: Michael Dalton <mwdalton@google.com>


Sorry that I didn't notice early, but there seems to be a bug here.
See below.
Also, I think we can simplify code, see suggestion below.

> ---
> v2: Add per-receive queue metadata ring to track precise truesize for
>     mergeable receive buffers. Remove all truesize approximation. Never
>     try to fill a full RX ring (required for metadata ring in v2).
> 
>  drivers/net/virtio_net.c | 145 ++++++++++++++++++++++++++++++++++-------------
>  1 file changed, 107 insertions(+), 38 deletions(-)
> 
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index 526dfd8..f6e1ee0 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -26,6 +26,7 @@
>  #include <linux/if_vlan.h>
>  #include <linux/slab.h>
>  #include <linux/cpu.h>
> +#include <linux/average.h>
>  
>  static int napi_weight = NAPI_POLL_WEIGHT;
>  module_param(napi_weight, int, 0444);
> @@ -36,11 +37,15 @@ module_param(gso, bool, 0444);
>  
>  /* FIXME: MTU in config. */
>  #define GOOD_PACKET_LEN (ETH_HLEN + VLAN_HLEN + ETH_DATA_LEN)
> -#define MERGE_BUFFER_LEN (ALIGN(GOOD_PACKET_LEN + \
> -                                sizeof(struct virtio_net_hdr_mrg_rxbuf), \
> -                                L1_CACHE_BYTES))
>  #define GOOD_COPY_LEN	128
>  
> +/* Weight used for the RX packet size EWMA. The average packet size is used to
> + * determine the packet buffer size when refilling RX rings. As the entire RX
> + * ring may be refilled at once, the weight is chosen so that the EWMA will be
> + * insensitive to short-term, transient changes in packet size.
> + */
> +#define RECEIVE_AVG_WEIGHT 64
> +
>  #define VIRTNET_DRIVER_VERSION "1.0.0"
>  
>  struct virtnet_stats {
> @@ -65,11 +70,30 @@ struct send_queue {
>  	char name[40];
>  };
>  
> +/* Per-packet buffer context for mergeable receive buffers. */
> +struct mergeable_receive_buf_ctx {
> +	/* Packet buffer base address. */
> +	void *buf;
> +
> +	/* Original size of the packet buffer for use in SKB truesize. Does not
> +	 * include any padding space used to avoid internal fragmentation.
> +	 */
> +	unsigned int truesize;

Don't need full int really, it's up to 4K/cache line size,
1 byte would be enough, maximum 2 ...
So if all we want is extra 1-2 bytes per buffer, we don't really
need this extra level of indirection I think.
We can just allocate them before the header together with an skb.



> +};
> +
>  /* Internal representation of a receive virtqueue */
>  struct receive_queue {
>  	/* Virtqueue associated with this receive_queue */
>  	struct virtqueue *vq;
>  
> +	/* Circular buffer of mergeable rxbuf contexts. */
> +	struct mergeable_receive_buf_ctx *mrg_buf_ctx;
> +
> +	/* Number of elements & head index of mrg_buf_ctx. Size must be
> +	 * equal to the associated virtqueue's vring size.
> +	 */
> +	unsigned int mrg_buf_ctx_size, mrg_buf_ctx_head;
> +
>  	struct napi_struct napi;
>  
>  	/* Number of input buffers, and max we've ever had. */
> @@ -78,6 +102,9 @@ struct receive_queue {
>  	/* Chain pages by the private ptr. */
>  	struct page *pages;
>  
> +	/* Average packet length for mergeable receive buffers. */
> +	struct ewma mrg_avg_pkt_len;
> +
>  	/* Page frag for packet buffer allocation. */
>  	struct page_frag alloc_frag;
>  
> @@ -327,32 +354,32 @@ err:
>  
>  static struct sk_buff *receive_mergeable(struct net_device *dev,
>  					 struct receive_queue *rq,
> -					 void *buf,
> +					 struct mergeable_receive_buf_ctx *ctx,
>  					 unsigned int len)
>  {
> -	struct skb_vnet_hdr *hdr = buf;
> +	struct skb_vnet_hdr *hdr = ctx->buf;
>  	int num_buf = hdr->mhdr.num_buffers;
> -	struct page *page = virt_to_head_page(buf);
> -	int offset = buf - page_address(page);
> -	unsigned int truesize = max_t(unsigned int, len, MERGE_BUFFER_LEN);
> +	struct page *page = virt_to_head_page(ctx->buf);
> +	int offset = ctx->buf - page_address(page);
> +	unsigned int truesize = max(len, ctx->truesize);
> +
>  	struct sk_buff *head_skb = page_to_skb(rq, page, offset, len, truesize);
>  	struct sk_buff *curr_skb = head_skb;
>  
>  	if (unlikely(!curr_skb))
>  		goto err_skb;
> -
>  	while (--num_buf) {
>  		int num_skb_frags;
>  
> -		buf = virtqueue_get_buf(rq->vq, &len);
> -		if (unlikely(!buf)) {
> +		ctx = virtqueue_get_buf(rq->vq, &len);
> +		if (unlikely(!ctx)) {
>  			pr_debug("%s: rx error: %d buffers out of %d missing\n",
>  				 dev->name, num_buf, hdr->mhdr.num_buffers);
>  			dev->stats.rx_length_errors++;
>  			goto err_buf;
>  		}
>  
> -		page = virt_to_head_page(buf);
> +		page = virt_to_head_page(ctx->buf);
>  		--rq->num;
>  
>  		num_skb_frags = skb_shinfo(curr_skb)->nr_frags;
> @@ -369,13 +396,13 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
>  			head_skb->truesize += nskb->truesize;
>  			num_skb_frags = 0;
>  		}
> -		truesize = max_t(unsigned int, len, MERGE_BUFFER_LEN);
> +		truesize = max(len, ctx->truesize);
>  		if (curr_skb != head_skb) {
>  			head_skb->data_len += len;
>  			head_skb->len += len;
>  			head_skb->truesize += truesize;
>  		}
> -		offset = buf - page_address(page);
> +		offset = ctx->buf - page_address(page);
>  		if (skb_can_coalesce(curr_skb, num_skb_frags, page, offset)) {
>  			put_page(page);
>  			skb_coalesce_rx_frag(curr_skb, num_skb_frags - 1,
> @@ -386,19 +413,20 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
>  		}
>  	}
>  
> +	ewma_add(&rq->mrg_avg_pkt_len, head_skb->len);
>  	return head_skb;
>  
>  err_skb:
>  	put_page(page);
>  	while (--num_buf) {
> -		buf = virtqueue_get_buf(rq->vq, &len);
> -		if (unlikely(!buf)) {
> +		ctx = virtqueue_get_buf(rq->vq, &len);
> +		if (unlikely(!ctx)) {
>  			pr_debug("%s: rx error: %d buffers missing\n",
>  				 dev->name, num_buf);
>  			dev->stats.rx_length_errors++;
>  			break;
>  		}
> -		page = virt_to_head_page(buf);
> +		page = virt_to_head_page(ctx->buf);
>  		put_page(page);
>  		--rq->num;
>  	}
> @@ -419,12 +447,14 @@ static void receive_buf(struct receive_queue *rq, void *buf, unsigned int len)
>  	if (unlikely(len < sizeof(struct virtio_net_hdr) + ETH_HLEN)) {
>  		pr_debug("%s: short packet %i\n", dev->name, len);
>  		dev->stats.rx_length_errors++;
> -		if (vi->mergeable_rx_bufs)
> -			put_page(virt_to_head_page(buf));
> -		else if (vi->big_packets)
> +		if (vi->mergeable_rx_bufs) {
> +			struct mergeable_receive_buf_ctx *ctx = buf;
> +			put_page(virt_to_head_page(ctx->buf));
> +		} else if (vi->big_packets) {
>  			give_pages(rq, buf);
> -		else
> +		} else {
>  			dev_kfree_skb(buf);
> +		}
>  		return;
>  	}
>  
> @@ -572,29 +602,43 @@ static int add_recvbuf_big(struct receive_queue *rq, gfp_t gfp)
>  
>  static int add_recvbuf_mergeable(struct receive_queue *rq, gfp_t gfp)
>  {
> +	const unsigned int ring_size = rq->mrg_buf_ctx_size;
> +	const size_t hdr_len = sizeof(struct virtio_net_hdr_mrg_rxbuf);
>  	struct page_frag *alloc_frag = &rq->alloc_frag;
> -	char *buf;
> +	struct mergeable_receive_buf_ctx *ctx;
>  	int err;
>  	unsigned int len, hole;
>  
> -	if (unlikely(!skb_page_frag_refill(MERGE_BUFFER_LEN, alloc_frag, gfp)))
> +	len = hdr_len + clamp_t(unsigned int, ewma_read(&rq->mrg_avg_pkt_len),
> +				GOOD_PACKET_LEN, PAGE_SIZE - hdr_len);
> +	len = ALIGN(len, L1_CACHE_BYTES);
> +	if (unlikely(!skb_page_frag_refill(len, alloc_frag, gfp)))
>  		return -ENOMEM;
> -	buf = (char *)page_address(alloc_frag->page) + alloc_frag->offset;
> +
> +	ctx = &rq->mrg_buf_ctx[rq->mrg_buf_ctx_head];
> +	ctx->buf = (char *)page_address(alloc_frag->page) + alloc_frag->offset;
> +	ctx->truesize = len;
>  	get_page(alloc_frag->page);
> -	len = MERGE_BUFFER_LEN;
>  	alloc_frag->offset += len;
>  	hole = alloc_frag->size - alloc_frag->offset;
> -	if (hole < MERGE_BUFFER_LEN) {
> +	if (hole < len) {
> +		/* To avoid internal fragmentation, if there is very likely not
> +		 * enough space for another buffer, add the remaining space to
> +		 * the current buffer. This extra space is not included in
> +		 * ctx->truesize.
> +		 */
>  		len += hole;
>  		alloc_frag->offset += hole;
>  	}
>  
> -	sg_init_one(rq->sg, buf, len);
> -	err = virtqueue_add_inbuf(rq->vq, rq->sg, 1, buf, gfp);
> -	if (err < 0)
> -		put_page(virt_to_head_page(buf));
> -
> -	return err;
> +	sg_init_one(rq->sg, ctx->buf, len);
> +	err = virtqueue_add_inbuf(rq->vq, rq->sg, 1, ctx, gfp);
> +	if (err < 0) {
> +		put_page(virt_to_head_page(ctx->buf));
> +		return err;
> +	}
> +	rq->mrg_buf_ctx_head = (rq->mrg_buf_ctx_head + 1) & (ring_size - 1);

Wait a second. this assumes that buffers are consumes in order?
This happens to be the case but is not guaranteed by the spec
at all.

> +	return 0;
>  }
>  
>  /*
> @@ -610,6 +654,9 @@ static bool try_fill_recv(struct receive_queue *rq, gfp_t gfp)
>  	int err;
>  	bool oom;
>  
> +	/* Do not attempt to add a buffer if the RX ring is full. */
> +	if (unlikely(!rq->vq->num_free))
> +		return true;
>  	gfp |= __GFP_COLD;
>  	do {
>  		if (vi->mergeable_rx_bufs)
> @@ -1354,8 +1401,10 @@ static void virtnet_free_queues(struct virtnet_info *vi)
>  {
>  	int i;
>  
> -	for (i = 0; i < vi->max_queue_pairs; i++)
> +	for (i = 0; i < vi->max_queue_pairs; i++) {
>  		netif_napi_del(&vi->rq[i].napi);
> +		kfree(vi->rq[i].mrg_buf_ctx);
> +	}
>  
>  	kfree(vi->rq);
>  	kfree(vi->sq);
> @@ -1394,12 +1443,14 @@ static void free_unused_bufs(struct virtnet_info *vi)
>  		struct virtqueue *vq = vi->rq[i].vq;
>  
>  		while ((buf = virtqueue_detach_unused_buf(vq)) != NULL) {
> -			if (vi->mergeable_rx_bufs)
> -				put_page(virt_to_head_page(buf));
> -			else if (vi->big_packets)
> +			if (vi->mergeable_rx_bufs) {
> +				struct mergeable_receive_buf_ctx *ctx = buf;
> +				put_page(virt_to_head_page(ctx->buf));
> +			} else if (vi->big_packets) {
>  				give_pages(&vi->rq[i], buf);
> -			else
> +			} else {
>  				dev_kfree_skb(buf);
> +			}
>  			--vi->rq[i].num;
>  		}
>  		BUG_ON(vi->rq[i].num != 0);
> @@ -1509,6 +1560,7 @@ static int virtnet_alloc_queues(struct virtnet_info *vi)
>  			       napi_weight);
>  
>  		sg_init_table(vi->rq[i].sg, ARRAY_SIZE(vi->rq[i].sg));
> +		ewma_init(&vi->rq[i].mrg_avg_pkt_len, 1, RECEIVE_AVG_WEIGHT);
>  		sg_init_table(vi->sq[i].sg, ARRAY_SIZE(vi->sq[i].sg));
>  	}
>  
> @@ -1522,7 +1574,8 @@ err_sq:
>  
>  static int init_vqs(struct virtnet_info *vi)
>  {
> -	int ret;
> +	struct virtio_device *vdev = vi->vdev;
> +	int i, ret;
>  
>  	/* Allocate send & receive queues */
>  	ret = virtnet_alloc_queues(vi);
> @@ -1533,12 +1586,28 @@ static int init_vqs(struct virtnet_info *vi)
>  	if (ret)
>  		goto err_free;
>  
> +	if (vi->mergeable_rx_bufs) {
> +		for (i = 0; i < vi->max_queue_pairs; i++) {
> +			struct receive_queue *rq = &vi->rq[i];
> +			rq->mrg_buf_ctx_size = virtqueue_get_vring_size(rq->vq);
> +			rq->mrg_buf_ctx = kmalloc(sizeof(*rq->mrg_buf_ctx) *
> +						  rq->mrg_buf_ctx_size,
> +						  GFP_KERNEL);
> +			if (!rq->mrg_buf_ctx) {
> +				ret = -ENOMEM;
> +				goto err_del_vqs;
> +			}
> +		}
> +	}
> +
>  	get_online_cpus();
>  	virtnet_set_affinity(vi);
>  	put_online_cpus();
>  
>  	return 0;
>  
> +err_del_vqs:
> +	vdev->config->del_vqs(vdev);
>  err_free:
>  	virtnet_free_queues(vi);
>  err:
> -- 
> 1.8.5.1

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH net-next v2 3/4] virtio-net: auto-tune mergeable rx buffer size for improved performance
  2014-01-09  1:42   ` Michael S. Tsirkin
@ 2014-01-09  3:16     ` Michael Dalton
  2014-01-09  3:41       ` Michael Dalton
  2014-01-09  6:42       ` Michael S. Tsirkin
  0 siblings, 2 replies; 39+ messages in thread
From: Michael Dalton @ 2014-01-09  3:16 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: netdev, lf-virt, Eric Dumazet, David S. Miller

Hi Michael,

On Wed, Jan 8, 2014 at 5:42 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
> Sorry that I didn't notice early, but there seems to be a bug here.
> See below.
Yes, that is definitely a bug. Virtio spec permits OOO completions,
but current code assumes in-order completion. Thanks for catching this.

> Don't need full int really, it's up to 4K/cache line size,
> 1 byte would be enough, maximum 2 ...
> So if all we want is extra 1-2 bytes per buffer, we don't really
> need this extra level of indirection I think.
> We can just allocate them before the header together with an skb.
I'm not sure if I'm parsing the above correctly, but do you mean using a
few bytes at the beginning of the packet buffer to store truesize? I
think that will break Jason's virtio-net RX frag coalescing
code. To coalesce consecutive RX packet buffers, our packet buffers must
be physically adjacent, and any extra bytes before the start of the
buffer would break that.

We could allocate an SKB per packet buffer, but if we have multi-buffer
packets often(e.g., netperf benefiting from GSO/GRO), we would be
allocating 1 SKB per packet buffer instead of 1 SKB per MAX_SKB_FRAGS
buffers. How do you feel about any of the below alternatives:

(1) Modify the existing mrg_buf_ctx to chain together free entries
We can use the 'buf' pointer in mergeable_receive_buf_ctx to chain
together free entries so that we can support OOO completions. This would
be similar to how virtio-queue manages free sg entries.

(2) Combine the buffer pointer and truesize into a single void* value
Your point about there only being a byte needed to encode truesize is
spot on, and I think we could leverage this to eliminate the out-of-band
metadata ring entirely. If we were willing to change the packet buffer
alignment from L1_CACHE_BYTES to 256 (or min (256, L1_CACHE_SIZE)), we
could encode the truesize in the least significant 8 bits of the buffer
address (encoded as truesize >> 8 as we know all sizes are a multiple
of 256). This would allow packet buffers up to 64KB in length.

Is there another approach you would prefer to any of these? If the
cleanliness issues and larger alignment aren't too bad, I think (2)
sounds promising and allow us to eliminate the metadata ring
entirely while still permitting RX frag coalescing.

Best,

Mike

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH net-next v2 3/4] virtio-net: auto-tune mergeable rx buffer size for improved performance
  2014-01-09  3:16     ` Michael Dalton
@ 2014-01-09  3:41       ` Michael Dalton
  2014-01-09  6:48         ` Michael S. Tsirkin
  2014-01-09  6:42       ` Michael S. Tsirkin
  1 sibling, 1 reply; 39+ messages in thread
From: Michael Dalton @ 2014-01-09  3:41 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: netdev, lf-virt, Eric Dumazet, David S. Miller

Sorry, forgot to mention - if we want to explore combining the buffer
address and truesize into a single void *, we could also exploit the
fact that our size ranges from aligned GOOD_PACKET_LEN to PAGE_SIZE, and
potentially encode fewer values for truesize (and require a smaller
alignment than 256). The prior e-mails discussion of 256 byte alignment
with 256 values is just one potential design point.

Best,

Mike

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH net-next v2 3/4] virtio-net: auto-tune mergeable rx buffer size for improved performance
  2014-01-09  3:41       ` Michael Dalton
@ 2014-01-09  6:48         ` Michael S. Tsirkin
  2014-01-09  8:28           ` Michael Dalton
  0 siblings, 1 reply; 39+ messages in thread
From: Michael S. Tsirkin @ 2014-01-09  6:48 UTC (permalink / raw)
  To: Michael Dalton; +Cc: netdev, lf-virt, Eric Dumazet, David S. Miller

On Wed, Jan 08, 2014 at 07:41:58PM -0800, Michael Dalton wrote:
> Sorry, forgot to mention - if we want to explore combining the buffer
> address and truesize into a single void *, we could also exploit the
> fact that our size ranges from aligned GOOD_PACKET_LEN to PAGE_SIZE, and
> potentially encode fewer values for truesize (and require a smaller
> alignment than 256). The prior e-mails discussion of 256 byte alignment
> with 256 values is just one potential design point.
> 
> Best,
> 
> Mike

Good point. I think we should keep the option to
make buffers bigger than 4K, so I think we should start with 256
alignment, then see if there are workloads that are improved by smaller
alignment.

Can you add wrapper inline functions to pack/unpack size and
buffer pointer to/from void *?
This way it will be easy to experiment with different alignments.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH net-next v2 3/4] virtio-net: auto-tune mergeable rx buffer size for improved performance
  2014-01-09  6:48         ` Michael S. Tsirkin
@ 2014-01-09  8:28           ` Michael Dalton
  2014-01-09  9:02             ` Michael Dalton
  2014-01-09 13:25             ` Michael S. Tsirkin
  0 siblings, 2 replies; 39+ messages in thread
From: Michael Dalton @ 2014-01-09  8:28 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: netdev, lf-virt, Eric Dumazet, David S. Miller

Hi Michael,

Here's a quick sketch of some code that enforces a minimum buffer
alignment of only 64, and has a maximum theoretical buffer size of
aligned GOOD_PACKET_LEN + (BUF_ALIGN - 1) * BUF_ALIGN, which is at least
1536 + 63 * 64 = 5568. On x86, we already use a 64 byte alignment, and
this code supports all current buffer sizes, from 1536 to PAGE_SIZE.

#if L1_CACHE_BYTES < 64
#define MERGEABLE_BUFFER_ALIGN 64
#define MERGEABLE_BUFFER_SHIFT 6
#else
#define MERGEABLE_BUFFER_ALIGN L1_CACHE_BYTES
#define MERGEABLE_BUFFER_SHIFT L1_CACHE_SHIFT
#endif
#define MERGEABLE_BUFFER_MIN ALIGN(GOOD_PACKET_LEN +
                                   sizeof(virtio_net_hdr_mrg_rbuf),
                                   MERGEABLE_BUFFER_ALIGN)
#define MERGEABLE_BUFFER_MAX min(MERGEABLE_BUFFER_MIN +
                                 (MERGEABLE_BUFFER_ALIGN - 1) *
                                 MERGEABLE_BUFFER_ALIGN, PAGE_SIZE)
/* Extract buffer length from a mergeable buffer context. */
static u16 get_mergeable_buf_ctx_len(void *ctx) {
        u16 len = (uintptr_t)ctx & (MERGEABLE_BUFFER_ALIGN - 1);
        return MERGEABLE_BUFFER_MIN + (len << MERGEABLE_BUFFER_SHIFT);
}
/* Extract buffer base address from a mergeable buffer context. */
static void *get_mergeable_buf_ctx_base(void *ctx) {
        return (void *) ((uintptr)ctx & -MERGEABLE_BUFFER_ALIGN);
}
/* Convert a base address and length to a mergeable buffer context. */
static void *to_mergeable_buf_ctx(void *base, u16 len) {
        len -= MERGEABLE_BUFFER_MIN;
        return (void *) ((uintptr)base | (len >> MERGEABLE_BUFFER_SHIFT));
}
/* Compute the packet buffer length for a receive queue. */
static u16 get_mergeable_buffer_len(struct receive_queue *rq) {
        u16 len = clamp_t(u16, MERGEABLE_BUFFER_MIN,
                          ewma_read(&rq->avg_pkt_len),
                          MERGEABLE_BUFFER_MAX);
        return ALIGN(len, MERGEABLE_BUFFER_ALIGN);
}

Best,

Mike

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH net-next v2 3/4] virtio-net: auto-tune mergeable rx buffer size for improved performance
  2014-01-09  8:28           ` Michael Dalton
@ 2014-01-09  9:02             ` Michael Dalton
  2014-01-09 13:25             ` Michael S. Tsirkin
  1 sibling, 0 replies; 39+ messages in thread
From: Michael Dalton @ 2014-01-09  9:02 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: netdev, lf-virt, Eric Dumazet, David S. Miller

If the prior code snippet looks good to you, I'll use something like
that as a baseline for a v3 patchset. I don't think we need a stricter
alignment than 64 to express values in the range (1536 ... 4096), as
the code snippet shows, which is great for x86 4KB pages.

On other architectures that have larger page sizes > 4KB with <= 64b
cachelines, we may want to increase the alignment so that the max buffer
size will be >= PAGE_SIZE (max size allowed by skb_page_frag_refill).

If we use a minimum alignment of 128, our maximum theoretical
packet buffer length is 1536 + 127 * 128 = 17792. With 256 byte
alignment, we can express a maximum packet buffer size > 65536.

Given the above, I think we want to select the min buffer alignment
based on the PAGE_SIZE:
        <= 4KB PAGE_SIZE: 64b min alignment
        <= 16KB PAGE_SIZE: 128b min alignment
        > 16KB PAGE_SIZE: 256b min alignment

So the prior code snippet would be relatively unchanged, except that
references to the previous minimum alignment of 64 would be replaced by
a #define'd constant derived from PAGE_SIZE as shown above.

This would guarantee that we use the minimum alignment necessary to
ensure that virtio-net can post a max size (PAGE_SIZE) buffer, and for
x86 this means we won't increase the alignment beyond the x86's current
L1_CACHE_BYTES value (64). Also, sorry I haven't had a chance to respond
yet to the debugfs feedback, I will get to that soon (just wanted to do
a further deep dive on some of the sysfs/debugfs tradeoffs).

Best,

Mike

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH net-next v2 3/4] virtio-net: auto-tune mergeable rx buffer size for improved performance
  2014-01-09  8:28           ` Michael Dalton
  2014-01-09  9:02             ` Michael Dalton
@ 2014-01-09 13:25             ` Michael S. Tsirkin
  2014-01-09 19:33               ` Michael Dalton
  1 sibling, 1 reply; 39+ messages in thread
From: Michael S. Tsirkin @ 2014-01-09 13:25 UTC (permalink / raw)
  To: Michael Dalton; +Cc: netdev, lf-virt, Eric Dumazet, David S. Miller

On Thu, Jan 09, 2014 at 12:28:50AM -0800, Michael Dalton wrote:
> Hi Michael,
> 
> Here's a quick sketch of some code that enforces a minimum buffer
> alignment of only 64, and has a maximum theoretical buffer size of
> aligned GOOD_PACKET_LEN + (BUF_ALIGN - 1) * BUF_ALIGN, which is at least
> 1536 + 63 * 64 = 5568. On x86, we already use a 64 byte alignment, and
> this code supports all current buffer sizes, from 1536 to PAGE_SIZE.

I really think it's best to start with 256 alignment.
A bit simpler, better than what we had, and will let us go
above PAGE_SIZE long term.  The optimization shrinking
alignment to 64 can be done on top if we see a
work-load that's improved by it, which I doubt.

> 
> #if L1_CACHE_BYTES < 64
> #define MERGEABLE_BUFFER_ALIGN 64
> #define MERGEABLE_BUFFER_SHIFT 6
> #else
> #define MERGEABLE_BUFFER_ALIGN L1_CACHE_BYTES
> #define MERGEABLE_BUFFER_SHIFT L1_CACHE_SHIFT
> #endif
> #define MERGEABLE_BUFFER_MIN ALIGN(GOOD_PACKET_LEN +
>                                    sizeof(virtio_net_hdr_mrg_rbuf),
>                                    MERGEABLE_BUFFER_ALIGN)
> #define MERGEABLE_BUFFER_MAX min(MERGEABLE_BUFFER_MIN +
>                                  (MERGEABLE_BUFFER_ALIGN - 1) *
>                                  MERGEABLE_BUFFER_ALIGN, PAGE_SIZE)
> /* Extract buffer length from a mergeable buffer context. */
> static u16 get_mergeable_buf_ctx_len(void *ctx) {
>         u16 len = (uintptr_t)ctx & (MERGEABLE_BUFFER_ALIGN - 1);
>         return MERGEABLE_BUFFER_MIN + (len << MERGEABLE_BUFFER_SHIFT);

You can just do len * MERGEABLE_BUFFER_ALIGN too - my compiler seems
to be smart enough to see it's same.
This way there's no need for MERGEABLE_BUFFER_SHIFT

> }
> /* Extract buffer base address from a mergeable buffer context. */
> static void *get_mergeable_buf_ctx_base(void *ctx) {
>         return (void *) ((uintptr)ctx & -MERGEABLE_BUFFER_ALIGN);

uintptr_t? or just unsigned long ...

> }
> /* Convert a base address and length to a mergeable buffer context. */
> static void *to_mergeable_buf_ctx(void *base, u16 len) {
>         len -= MERGEABLE_BUFFER_MIN;
>         return (void *) ((uintptr)base | (len >> MERGEABLE_BUFFER_SHIFT));
> }

The APIs are a bit inconsistent in naming.
Also it's unfortunate that we use void * for both virtio buffer
and frame memory.

How about
	mergeable_buf_to_len
	mergeable_buf_to_page
	mergeable_buf_to_offset
	mergeable_page_to_buf

this way we don't use void * for frame memory.

alternatively, cast virtio buffer to unsigned long
immediately and pass that around everywhere.


> /* Compute the packet buffer length for a receive queue. */
> static u16 get_mergeable_buffer_len(struct receive_queue *rq) {
>         u16 len = clamp_t(u16, MERGEABLE_BUFFER_MIN,
>                           ewma_read(&rq->avg_pkt_len),
>                           MERGEABLE_BUFFER_MAX);
>         return ALIGN(len, MERGEABLE_BUFFER_ALIGN);
> }
> 
> Best,
> 
> Mike

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH net-next v2 3/4] virtio-net: auto-tune mergeable rx buffer size for improved performance
  2014-01-09 13:25             ` Michael S. Tsirkin
@ 2014-01-09 19:33               ` Michael Dalton
  0 siblings, 0 replies; 39+ messages in thread
From: Michael Dalton @ 2014-01-09 19:33 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: netdev, lf-virt, Eric Dumazet, David S. Miller

Hi Michael,

Your improvements (code changes, more consistent naming, and use of
256-byte alignment only) all sound good to me. I will get started on a v3
patchset in conformance with your recommendations after sorting out
what we want to do with the debugfs/sysfs issues. I will followup soon on
the thread for patch 4/4 so we can close on what changes are needed for
debugfs/sysfs. Thanks!

Best,

Mike

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH net-next v2 3/4] virtio-net: auto-tune mergeable rx buffer size for improved performance
  2014-01-09  3:16     ` Michael Dalton
  2014-01-09  3:41       ` Michael Dalton
@ 2014-01-09  6:42       ` Michael S. Tsirkin
  1 sibling, 0 replies; 39+ messages in thread
From: Michael S. Tsirkin @ 2014-01-09  6:42 UTC (permalink / raw)
  To: Michael Dalton; +Cc: netdev, lf-virt, Eric Dumazet, David S. Miller

On Wed, Jan 08, 2014 at 07:16:18PM -0800, Michael Dalton wrote:
> Hi Michael,
> 
> On Wed, Jan 8, 2014 at 5:42 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
> > Sorry that I didn't notice early, but there seems to be a bug here.
> > See below.
> Yes, that is definitely a bug. Virtio spec permits OOO completions,
> but current code assumes in-order completion. Thanks for catching this.
> 
> > Don't need full int really, it's up to 4K/cache line size,
> > 1 byte would be enough, maximum 2 ...
> > So if all we want is extra 1-2 bytes per buffer, we don't really
> > need this extra level of indirection I think.
> > We can just allocate them before the header together with an skb.
> I'm not sure if I'm parsing the above correctly, but do you mean using a
> few bytes at the beginning of the packet buffer to store truesize? I
> think that will break Jason's virtio-net RX frag coalescing
> code. To coalesce consecutive RX packet buffers, our packet buffers must
> be physically adjacent, and any extra bytes before the start of the
> buffer would break that.
> 
> We could allocate an SKB per packet buffer, but if we have multi-buffer
> packets often(e.g., netperf benefiting from GSO/GRO), we would be
> allocating 1 SKB per packet buffer instead of 1 SKB per MAX_SKB_FRAGS
> buffers. How do you feel about any of the below alternatives:
> 
> (1) Modify the existing mrg_buf_ctx to chain together free entries
> We can use the 'buf' pointer in mergeable_receive_buf_ctx to chain
> together free entries so that we can support OOO completions. This would
> be similar to how virtio-queue manages free sg entries.
> 
> (2) Combine the buffer pointer and truesize into a single void* value
> Your point about there only being a byte needed to encode truesize is
> spot on, and I think we could leverage this to eliminate the out-of-band
> metadata ring entirely. If we were willing to change the packet buffer
> alignment from L1_CACHE_BYTES to 256 (or min (256, L1_CACHE_SIZE)),

I think you mean max here.

> we
> could encode the truesize in the least significant 8 bits of the buffer
> address (encoded as truesize >> 8 as we know all sizes are a multiple
> of 256). This would allow packet buffers up to 64KB in length.
>
> Is there another approach you would prefer to any of these? If the
> cleanliness issues and larger alignment aren't too bad, I think (2)
> sounds promising and allow us to eliminate the metadata ring
> entirely while still permitting RX frag coalescing.
> 
> Best,
> 
> Mike

I agree, this sounds like a better approach. It's certainly no worse than
current net-next code that always allocates about 1.5K,
and struct sk_buff itself is about 256 bytes on 64 bit.

-- 
MST

^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH net-next v2 4/4] virtio-net: initial debugfs support, export mergeable rx buffer size
  2014-01-07  5:25 [PATCH net-next v2 1/4] net: allow > 0 order atomic page alloc in skb_page_frag_refill Michael Dalton
  2014-01-07  5:25 ` [PATCH net-next v2 2/4] virtio-net: use per-receive queue page frag alloc for mergeable bufs Michael Dalton
  2014-01-07  5:25 ` [PATCH net-next v2 3/4] virtio-net: auto-tune mergeable rx buffer size for improved performance Michael Dalton
@ 2014-01-07  5:25 ` Michael Dalton
  2014-01-08  6:34   ` Jason Wang
  2014-01-08 18:24   ` Michael S. Tsirkin
  2014-01-08 18:08 ` [PATCH net-next v2 1/4] net: allow > 0 order atomic page alloc in skb_page_frag_refill Michael S. Tsirkin
  3 siblings, 2 replies; 39+ messages in thread
From: Michael Dalton @ 2014-01-07  5:25 UTC (permalink / raw)
  To: David S. Miller
  Cc: Michael Dalton, Michael S. Tsirkin, netdev, virtualization,
	Eric Dumazet

Add initial support for debugfs to virtio-net. Each virtio-net network
device will have a directory under /virtio-net in debugfs. The
per-network device directory will contain one sub-directory per active,
enabled receive queue. If mergeable receive buffers are enabled, each
receive queue directory will contain a read-only file that returns the
current packet buffer size for the receive queue.

Signed-off-by: Michael Dalton <mwdalton@google.com>
---
 drivers/net/virtio_net.c | 314 ++++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 296 insertions(+), 18 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index f6e1ee0..5da18d6 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -27,6 +27,9 @@
 #include <linux/slab.h>
 #include <linux/cpu.h>
 #include <linux/average.h>
+#include <linux/seqlock.h>
+#include <linux/kref.h>
+#include <linux/debugfs.h>
 
 static int napi_weight = NAPI_POLL_WEIGHT;
 module_param(napi_weight, int, 0444);
@@ -35,6 +38,9 @@ static bool csum = true, gso = true;
 module_param(csum, bool, 0444);
 module_param(gso, bool, 0444);
 
+/* Debugfs root directory for all virtio-net devices. */
+static struct dentry *virtnet_debugfs_root;
+
 /* FIXME: MTU in config. */
 #define GOOD_PACKET_LEN (ETH_HLEN + VLAN_HLEN + ETH_DATA_LEN)
 #define GOOD_COPY_LEN	128
@@ -102,9 +108,6 @@ struct receive_queue {
 	/* Chain pages by the private ptr. */
 	struct page *pages;
 
-	/* Average packet length for mergeable receive buffers. */
-	struct ewma mrg_avg_pkt_len;
-
 	/* Page frag for packet buffer allocation. */
 	struct page_frag alloc_frag;
 
@@ -115,6 +118,28 @@ struct receive_queue {
 	char name[40];
 };
 
+/* Per-receive queue statistics exported via debugfs. */
+struct receive_queue_stats {
+	/* Average packet length of receive queue (for mergeable rx buffers). */
+	struct ewma avg_pkt_len;
+
+	/* Per-receive queue stats debugfs directory. */
+	struct dentry *dbg;
+
+	/* Reference count for the receive queue statistics, needed because
+	 * an open debugfs file may outlive the receive queue and netdevice.
+	 * Open files will remain in-use until all outstanding file descriptors
+	 * are closed, even after the underlying file is unlinked.
+	 */
+	struct kref refcount;
+
+	/* Sequence counter to allow debugfs readers to safely access stats.
+	 * Assumes a single virtio-net writer, which is enforced by virtio-net
+	 * and NAPI.
+	 */
+	seqcount_t dbg_seq;
+};
+
 struct virtnet_info {
 	struct virtio_device *vdev;
 	struct virtqueue *cvq;
@@ -147,6 +172,15 @@ struct virtnet_info {
 	/* Active statistics */
 	struct virtnet_stats __percpu *stats;
 
+	/* Per-receive queue statstics exported via debugfs. Stored in
+	 * virtnet_info to survive freeze/restore -- a task may have a per-rq
+	 * debugfs file open at the time of freeze.
+	 */
+	struct receive_queue_stats **rq_stats;
+
+	/* Per-netdevice debugfs directory. */
+	struct dentry *dbg_dev_root;
+
 	/* Work struct for refilling if we run low on memory. */
 	struct delayed_work refill;
 
@@ -358,6 +392,8 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
 					 unsigned int len)
 {
 	struct skb_vnet_hdr *hdr = ctx->buf;
+	struct virtnet_info *vi = netdev_priv(dev);
+	struct receive_queue_stats *rq_stats = vi->rq_stats[vq2rxq(rq->vq)];
 	int num_buf = hdr->mhdr.num_buffers;
 	struct page *page = virt_to_head_page(ctx->buf);
 	int offset = ctx->buf - page_address(page);
@@ -413,7 +449,9 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
 		}
 	}
 
-	ewma_add(&rq->mrg_avg_pkt_len, head_skb->len);
+	write_seqcount_begin(&rq_stats->dbg_seq);
+	ewma_add(&rq_stats->avg_pkt_len, head_skb->len);
+	write_seqcount_end(&rq_stats->dbg_seq);
 	return head_skb;
 
 err_skb:
@@ -600,18 +638,30 @@ static int add_recvbuf_big(struct receive_queue *rq, gfp_t gfp)
 	return err;
 }
 
+static unsigned int get_mergeable_buf_len(struct ewma *avg_pkt_len)
+{
+	const size_t hdr_len = sizeof(struct virtio_net_hdr_mrg_rxbuf);
+	unsigned int len;
+
+	len = hdr_len + clamp_t(unsigned int, ewma_read(avg_pkt_len),
+				GOOD_PACKET_LEN, PAGE_SIZE - hdr_len);
+	return ALIGN(len, L1_CACHE_BYTES);
+}
+
 static int add_recvbuf_mergeable(struct receive_queue *rq, gfp_t gfp)
 {
 	const unsigned int ring_size = rq->mrg_buf_ctx_size;
-	const size_t hdr_len = sizeof(struct virtio_net_hdr_mrg_rxbuf);
 	struct page_frag *alloc_frag = &rq->alloc_frag;
+	struct virtnet_info *vi = rq->vq->vdev->priv;
 	struct mergeable_receive_buf_ctx *ctx;
 	int err;
 	unsigned int len, hole;
 
-	len = hdr_len + clamp_t(unsigned int, ewma_read(&rq->mrg_avg_pkt_len),
-				GOOD_PACKET_LEN, PAGE_SIZE - hdr_len);
-	len = ALIGN(len, L1_CACHE_BYTES);
+	/* avg_pkt_len is written only in NAPI rx softirq context. We may
+	 * read avg_pkt_len without using the dbg_seq seqcount, as this code
+	 * is called only in NAPI rx softirq context or when NAPI is disabled.
+	 */
+	len = get_mergeable_buf_len(&vi->rq_stats[vq2rxq(rq->vq)]->avg_pkt_len);
 	if (unlikely(!skb_page_frag_refill(len, alloc_frag, gfp)))
 		return -ENOMEM;
 
@@ -1274,13 +1324,101 @@ static void virtnet_get_drvinfo(struct net_device *dev,
 
 }
 
+static ssize_t mergeable_rx_buffer_size_read(struct file *file,
+					     char __user *userbuf,
+					     size_t count,
+					     loff_t *ppos)
+{
+	struct receive_queue_stats *rq_stats = file->private_data;
+	char buf[32];
+	struct ewma avg;
+	unsigned int start, len;
+
+	/* Don't allow partial reads. */
+	if (*ppos)
+		return 0;
+	do {
+		start = read_seqcount_begin(&rq_stats->dbg_seq);
+		avg = rq_stats->avg_pkt_len;
+	} while (read_seqcount_retry(&rq_stats->dbg_seq, start));
+	len = scnprintf(buf, sizeof(buf), "%u\n", get_mergeable_buf_len(&avg));
+	return simple_read_from_buffer(userbuf, count, ppos, buf, len);
+}
+
+void receive_queue_stats_free(struct kref *ref)
+{
+	struct receive_queue_stats *rq_stats;
+
+	rq_stats = container_of(ref, struct receive_queue_stats, refcount);
+	kfree(rq_stats);
+}
+
+static int receive_queue_stats_debugfs_open(struct inode *inode,
+					    struct file *file)
+{
+	struct receive_queue_stats *rq_stats = inode->i_private;
+	kref_get(&rq_stats->refcount);
+	file->private_data = rq_stats;
+	return 0;
+}
+
+static int receive_queue_stats_debugfs_release(struct inode *inode,
+					       struct file *file)
+{
+	struct receive_queue_stats *rq_stats = inode->i_private;
+	kref_put(&rq_stats->refcount, receive_queue_stats_free);
+	file->private_data = NULL;
+	return 0;
+}
+
+static const struct file_operations mergeable_rx_buffer_size_fops = {
+	.owner = THIS_MODULE,
+	.open = receive_queue_stats_debugfs_open,
+	.read = mergeable_rx_buffer_size_read,
+	.llseek = default_llseek,
+	.release = receive_queue_stats_debugfs_release,
+};
+
+static void receive_queue_debugfs_add(struct receive_queue *rq)
+{
+	struct virtnet_info *vi = rq->vq->vdev->priv;
+	unsigned int rq_index = vq2rxq(rq->vq);
+	struct receive_queue_stats *rq_stats = vi->rq_stats[rq_index];
+	struct dentry *dentry;
+	char name[32];
+
+	if (IS_ERR_OR_NULL(vi->dbg_dev_root))
+		return;
+	scnprintf(name, sizeof(name), "rx-%u", rq_index);
+	dentry = debugfs_create_dir(name, vi->dbg_dev_root);
+	if (IS_ERR_OR_NULL(dentry)) {
+		pr_warn("%s: could not create %s rx queue debugfs dir\n",
+			vi->dev->name, name);
+		return;
+	}
+	rq_stats->dbg = dentry;
+	if (vi->mergeable_rx_bufs)
+		debugfs_create_file("mergeable_rx_buffer_size", S_IRUSR,
+				rq_stats->dbg, rq_stats,
+				&mergeable_rx_buffer_size_fops);
+}
+
+static void receive_queue_debugfs_del(struct receive_queue *rq)
+{
+	struct virtnet_info *vi = rq->vq->vdev->priv;
+	struct receive_queue_stats *rq_stats = vi->rq_stats[vq2rxq(rq->vq)];
+	debugfs_remove_recursive(rq_stats->dbg);
+	rq_stats->dbg = NULL;
+}
+
 /* TODO: Eliminate OOO packets during switching */
 static int virtnet_set_channels(struct net_device *dev,
 				struct ethtool_channels *channels)
 {
 	struct virtnet_info *vi = netdev_priv(dev);
-	u16 queue_pairs = channels->combined_count;
-	int err;
+	u16 new_queue_pairs = channels->combined_count;
+	u16 old_queue_pairs = vi->curr_queue_pairs;
+	int err, i;
 
 	/* We don't support separate rx/tx channels.
 	 * We don't allow setting 'other' channels.
@@ -1288,14 +1426,21 @@ static int virtnet_set_channels(struct net_device *dev,
 	if (channels->rx_count || channels->tx_count || channels->other_count)
 		return -EINVAL;
 
-	if (queue_pairs > vi->max_queue_pairs)
+	if (new_queue_pairs > vi->max_queue_pairs)
 		return -EINVAL;
 
 	get_online_cpus();
-	err = virtnet_set_queues(vi, queue_pairs);
+	err = virtnet_set_queues(vi, new_queue_pairs);
 	if (!err) {
-		netif_set_real_num_tx_queues(dev, queue_pairs);
-		netif_set_real_num_rx_queues(dev, queue_pairs);
+		if (new_queue_pairs < old_queue_pairs) {
+			for (i = new_queue_pairs; i < old_queue_pairs; i++)
+				receive_queue_debugfs_del(&vi->rq[i]);
+		} else {
+			for (i = old_queue_pairs; i < new_queue_pairs; i++)
+				receive_queue_debugfs_add(&vi->rq[i]);
+		}
+		netif_set_real_num_tx_queues(dev, new_queue_pairs);
+		netif_set_real_num_rx_queues(dev, new_queue_pairs);
 
 		virtnet_set_affinity(vi);
 	}
@@ -1336,7 +1481,44 @@ static int virtnet_change_mtu(struct net_device *dev, int new_mtu)
 	return 0;
 }
 
+/* Must be called only after the net_device name has been expanded. */
+static void virtnet_debugfs_init(struct virtnet_info *vi)
+{
+	int i;
+
+	if (IS_ERR_OR_NULL(virtnet_debugfs_root))
+		return;
+	vi->dbg_dev_root = debugfs_create_dir(vi->dev->name,
+					      virtnet_debugfs_root);
+	if (IS_ERR_OR_NULL(vi->dbg_dev_root)) {
+		pr_warn("%s: could not create netdevice debugfs dir\n",
+			vi->dev->name);
+		return;
+	}
+	for (i = 0; i < vi->curr_queue_pairs; i++)
+		receive_queue_debugfs_add(&vi->rq[i]);
+}
+
+static void virtnet_debugfs_cleanup(struct virtnet_info *vi)
+{
+	int i;
+
+	for (i = 0; i < vi->max_queue_pairs; i++)
+		receive_queue_debugfs_del(&vi->rq[i]);
+	debugfs_remove_recursive(vi->dbg_dev_root);
+	vi->dbg_dev_root = NULL;
+}
+
+static int virtnet_init(struct net_device *dev)
+{
+	struct virtnet_info *vi = netdev_priv(dev);
+
+	virtnet_debugfs_init(vi);
+	return 0;
+}
+
 static const struct net_device_ops virtnet_netdev = {
+	.ndo_init	     = virtnet_init,
 	.ndo_open            = virtnet_open,
 	.ndo_stop   	     = virtnet_close,
 	.ndo_start_xmit      = start_xmit,
@@ -1560,7 +1742,6 @@ static int virtnet_alloc_queues(struct virtnet_info *vi)
 			       napi_weight);
 
 		sg_init_table(vi->rq[i].sg, ARRAY_SIZE(vi->rq[i].sg));
-		ewma_init(&vi->rq[i].mrg_avg_pkt_len, 1, RECEIVE_AVG_WEIGHT);
 		sg_init_table(vi->sq[i].sg, ARRAY_SIZE(vi->sq[i].sg));
 	}
 
@@ -1614,6 +1795,39 @@ err:
 	return ret;
 }
 
+static int virtnet_rename(struct notifier_block *this,
+			  unsigned long event, void *ptr)
+{
+	struct net_device *dev = netdev_notifier_info_to_dev(ptr);
+	struct virtnet_info *vi;
+
+	if (event != NETDEV_CHANGENAME || dev->netdev_ops != &virtnet_netdev)
+		return NOTIFY_DONE;
+	vi = netdev_priv(dev);
+	if (IS_ERR_OR_NULL(vi->dbg_dev_root))
+		return NOTIFY_DONE;
+	if (IS_ERR_OR_NULL(debugfs_rename(virtnet_debugfs_root,
+					  vi->dbg_dev_root,
+					  virtnet_debugfs_root, dev->name))) {
+		pr_warn("%s: failed debugfs rename, removing old debugfs dir\n",
+			dev->name);
+		virtnet_debugfs_cleanup(vi);
+	}
+	return NOTIFY_DONE;
+}
+
+static void virtnet_release_receive_queue_stats(struct virtnet_info *vi)
+{
+	int i;
+
+	for (i = 0; i < vi->max_queue_pairs; i++) {
+		struct receive_queue_stats *rq_stats = vi->rq_stats[i];
+		if (rq_stats)
+			kref_put(&rq_stats->refcount, receive_queue_stats_free);
+	}
+	kfree(vi->rq_stats);
+}
+
 static int virtnet_probe(struct virtio_device *vdev)
 {
 	int i, err;
@@ -1723,10 +1937,24 @@ static int virtnet_probe(struct virtio_device *vdev)
 	vi->curr_queue_pairs = 1;
 	vi->max_queue_pairs = max_queue_pairs;
 
+	vi->rq_stats = kzalloc(sizeof(vi->rq_stats[0]) *
+			       vi->max_queue_pairs, GFP_KERNEL);
+	if (!vi->rq_stats)
+		goto free_dev_stats;
+	for (i = 0; i < vi->max_queue_pairs; i++) {
+		vi->rq_stats[i] = kzalloc(sizeof(*vi->rq_stats[0]), GFP_KERNEL);
+		if (!vi->rq_stats[i])
+			goto free_rq_stats;
+		seqcount_init(&vi->rq_stats[i]->dbg_seq);
+		kref_init(&vi->rq_stats[i]->refcount);
+		ewma_init(&vi->rq_stats[i]->avg_pkt_len, 1,
+			  RECEIVE_AVG_WEIGHT);
+	}
+
 	/* Allocate/initialize the rx/tx queues, and invoke find_vqs */
 	err = init_vqs(vi);
 	if (err)
-		goto free_stats;
+		goto free_rq_stats;
 
 	netif_set_real_num_tx_queues(dev, vi->curr_queue_pairs);
 	netif_set_real_num_rx_queues(dev, vi->curr_queue_pairs);
@@ -1777,8 +2005,11 @@ free_recv_bufs:
 free_vqs:
 	cancel_delayed_work_sync(&vi->refill);
 	free_receive_page_frags(vi);
+	virtnet_debugfs_cleanup(vi);
 	virtnet_del_vqs(vi);
-free_stats:
+free_rq_stats:
+	virtnet_release_receive_queue_stats(vi);
+free_dev_stats:
 	free_percpu(vi->stats);
 free:
 	free_netdev(dev);
@@ -1812,10 +2043,12 @@ static void virtnet_remove(struct virtio_device *vdev)
 
 	unregister_netdev(vi->dev);
 
+	virtnet_debugfs_cleanup(vi);
 	remove_vq_common(vi);
 
 	flush_work(&vi->config_work);
 
+	virtnet_release_receive_queue_stats(vi);
 	free_percpu(vi->stats);
 	free_netdev(vi->dev);
 }
@@ -1884,6 +2117,19 @@ static int virtnet_restore(struct virtio_device *vdev)
 }
 #endif
 
+static void virtnet_register_debugfs(void)
+{
+	virtnet_debugfs_root = debugfs_create_dir("virtio-net", NULL);
+	if (IS_ERR_OR_NULL(virtnet_debugfs_root))
+		pr_warn("Could not create virtio-net debugfs dir\n");
+}
+
+static void virtnet_unregister_debugfs(void)
+{
+	debugfs_remove_recursive(virtnet_debugfs_root);
+	virtnet_debugfs_root = NULL;
+}
+
 static struct virtio_device_id id_table[] = {
 	{ VIRTIO_ID_NET, VIRTIO_DEV_ANY_ID },
 	{ 0 },
@@ -1917,7 +2163,39 @@ static struct virtio_driver virtio_net_driver = {
 #endif
 };
 
-module_virtio_driver(virtio_net_driver);
+static struct notifier_block virtnet_rename_notifier = {
+	.notifier_call = virtnet_rename,
+};
+
+static int __init init(void)
+{
+	int err;
+
+	virtnet_register_debugfs();
+	err = register_netdevice_notifier(&virtnet_rename_notifier);
+	if (err)
+		goto free_debugfs;
+	err = register_virtio_driver(&virtio_net_driver);
+	if (err)
+		goto free_notifier;
+	return 0;
+
+free_notifier:
+	unregister_netdevice_notifier(&virtnet_rename_notifier);
+free_debugfs:
+	virtnet_unregister_debugfs();
+	return err;
+}
+
+static void __exit cleanup(void)
+{
+	unregister_virtio_driver(&virtio_net_driver);
+	unregister_netdevice_notifier(&virtnet_rename_notifier);
+	virtnet_unregister_debugfs();
+}
+
+module_init(init);
+module_exit(cleanup);
 
 MODULE_DEVICE_TABLE(virtio, id_table);
 MODULE_DESCRIPTION("Virtio network driver");
-- 
1.8.5.1

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* Re: [PATCH net-next v2 4/4] virtio-net: initial debugfs support, export mergeable rx buffer size
  2014-01-07  5:25 ` [PATCH net-next v2 4/4] virtio-net: initial debugfs support, export mergeable rx buffer size Michael Dalton
@ 2014-01-08  6:34   ` Jason Wang
  2014-01-08 19:21     ` Michael S. Tsirkin
  2014-01-08 18:24   ` Michael S. Tsirkin
  1 sibling, 1 reply; 39+ messages in thread
From: Jason Wang @ 2014-01-08  6:34 UTC (permalink / raw)
  To: Michael Dalton, David S. Miller
  Cc: netdev, Eric Dumazet, virtualization, Michael S. Tsirkin

On 01/07/2014 01:25 PM, Michael Dalton wrote:
> Add initial support for debugfs to virtio-net. Each virtio-net network
> device will have a directory under /virtio-net in debugfs. The
> per-network device directory will contain one sub-directory per active,
> enabled receive queue. If mergeable receive buffers are enabled, each
> receive queue directory will contain a read-only file that returns the
> current packet buffer size for the receive queue.
>
> Signed-off-by: Michael Dalton <mwdalton@google.com>

This looks more complicated than expected. How about just adding an
entry in sysfs onto the existed network class device which looks more
simpler?

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH net-next v2 4/4] virtio-net: initial debugfs support, export mergeable rx buffer size
  2014-01-08  6:34   ` Jason Wang
@ 2014-01-08 19:21     ` Michael S. Tsirkin
  2014-01-11  5:19       ` Michael Dalton
  0 siblings, 1 reply; 39+ messages in thread
From: Michael S. Tsirkin @ 2014-01-08 19:21 UTC (permalink / raw)
  To: Jason Wang
  Cc: Michael Dalton, netdev, virtualization, Eric Dumazet,
	David S. Miller

On Wed, Jan 08, 2014 at 02:34:31PM +0800, Jason Wang wrote:
> On 01/07/2014 01:25 PM, Michael Dalton wrote:
> > Add initial support for debugfs to virtio-net. Each virtio-net network
> > device will have a directory under /virtio-net in debugfs. The
> > per-network device directory will contain one sub-directory per active,
> > enabled receive queue. If mergeable receive buffers are enabled, each
> > receive queue directory will contain a read-only file that returns the
> > current packet buffer size for the receive queue.
> >
> > Signed-off-by: Michael Dalton <mwdalton@google.com>
> 
> This looks more complicated than expected. How about just adding an
> entry in sysfs onto the existed network class device which looks more
> simpler?

sysfs is part of userspace ABI, I think that's the main issue: can we
commit to this attribute being there in the future?
If yes we can use sysfs but maybe it seems reasonable to use debugfs for
a while until we are more sure of this.
I don't mind either way.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH net-next v2 4/4] virtio-net: initial debugfs support, export mergeable rx buffer size
  2014-01-08 19:21     ` Michael S. Tsirkin
@ 2014-01-11  5:19       ` Michael Dalton
  2014-01-11  5:36         ` Michael Dalton
  2014-01-12 17:09         ` Michael S. Tsirkin
  0 siblings, 2 replies; 39+ messages in thread
From: Michael Dalton @ 2014-01-11  5:19 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: netdev, lf-virt, Eric Dumazet, David S. Miller

Hi Jason, Michael

Sorry for the delay in response. Jason, I agree this patch ended up
being larger than expected. The major implementation parts are:
(1) Setup directory structure (driver/per-netdev/rx-queue directories)
(2) Network device renames (optional, so debugfs dir has the right name)
(3) Support resizing the # of RX queues (optional - we could just export
    max_queue_pairs files and not delete files if an RX queue is disabled)
(4) Reference counting - used in case someone opens a debugfs
    file and then removes the virtio-net device.
(5) The actual mergeable rx buffer file implementation itself. For now
    I have added a seqcount for memory safety, but if a read-only race
    condition is acceptable we could elide the seqcount. FWIW, the
    seqcount write in receive_mergeable() should, on modern x86,
    translate to two non-atomic adds and two compiler barriers, so
    overhead is not expected to be meaningful.

We can move to sysfs and this would simplify or eliminate much of the
above, including most of (1) - (4). I believe our choices for what to
do for the next patchset include:
(a) Use debugfs as is currently done, removing any optional features
listed above that are deemed unnecessary.

(b) Add a per-netdev sysfs attribute group to net_device->sysfs_groups.
Each attribute would display the mergeable packet buffer size for a given
RX queue, and there would be max_queue_pairs attributes in total. This
is already supported by net/core/net-sysfs.c:netdev_register_kobject(),
but means that we would have a static set of per-RX queue files for
all RX queues supported by the netdev, rather than dynamically displaying
only the files corresponding to enabled RX queues (e.g., when # of RX
queues is changed by  ethtool -L <device>).  For an example of this
approach, see drivers/net/bonding/bond_sysfs.c.

(c) Modify struct netdev_rx_queue to add virtio-net EWMA fields directly,
and modify net-sysfs.c to manage the new fields. Unlike (b), this approach
supports the RX queue resizing in (3) but means putting virtio-net info
in netdev_rx_queue, which currently has only device-independent fields.

My preference would be (b): try using sysfs and adding a device-specific
attribute group to the virtio-net netdevice (stored in the existing
'sysfs_groups' field and supported by net-sysfs).  This would avoid
adding virtio-net specific information to net-sysfs. What would you
prefer (or is there a better way than the approaches above)? Thanks!

Best,

Mike

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH net-next v2 4/4] virtio-net: initial debugfs support, export mergeable rx buffer size
  2014-01-11  5:19       ` Michael Dalton
@ 2014-01-11  5:36         ` Michael Dalton
  2014-01-12 17:09         ` Michael S. Tsirkin
  1 sibling, 0 replies; 39+ messages in thread
From: Michael Dalton @ 2014-01-11  5:36 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: netdev, lf-virt, Eric Dumazet, David S. Miller

Also, one other note: if we use sysfs, the directory structure will
be different depending on our chosen sysfs strategy. If we augment
netdev_rx_queue, the new attributes will be found in the standard
'rx-<N>' netdev subdirectory, e.g.,
/sys/class/net/eth0/queues/rx-0/mergeable_rx_buffer_size

Whereas if we use per-netdev attributes, our attributes would be in
/sys/class/net/eth0/<group name>/<attribute name>, which may be
less intuitive as AFAICT we'd have to indicate both the queue # and
type of value being reported using the attribute name. E.g.,
/sys/class/net/eth0/virtio-net/rx-0_mergeable_buffer_size.
That's somewhat less elegant.

I don't see an easy way to add new attributes to the 'rx-<N>'
subdirectories without directly modifying struct netdev_rx_queue,
so I think this is another tradeoff between the two sysfs approaches.

Best,

Mike

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH net-next v2 4/4] virtio-net: initial debugfs support, export mergeable rx buffer size
  2014-01-11  5:19       ` Michael Dalton
  2014-01-11  5:36         ` Michael Dalton
@ 2014-01-12 17:09         ` Michael S. Tsirkin
  2014-01-12 23:32           ` Michael Dalton
  1 sibling, 1 reply; 39+ messages in thread
From: Michael S. Tsirkin @ 2014-01-12 17:09 UTC (permalink / raw)
  To: Michael Dalton; +Cc: netdev, lf-virt, Eric Dumazet, David S. Miller

On Fri, Jan 10, 2014 at 09:19:37PM -0800, Michael Dalton wrote:
> Hi Jason, Michael
> 
> Sorry for the delay in response. Jason, I agree this patch ended up
> being larger than expected. The major implementation parts are:
> (1) Setup directory structure (driver/per-netdev/rx-queue directories)
> (2) Network device renames (optional, so debugfs dir has the right name)
> (3) Support resizing the # of RX queues (optional - we could just export
>     max_queue_pairs files and not delete files if an RX queue is disabled)
> (4) Reference counting - used in case someone opens a debugfs
>     file and then removes the virtio-net device.
> (5) The actual mergeable rx buffer file implementation itself. For now
>     I have added a seqcount for memory safety, but if a read-only race
>     condition is acceptable we could elide the seqcount. FWIW, the
>     seqcount write in receive_mergeable() should, on modern x86,
>     translate to two non-atomic adds and two compiler barriers, so
>     overhead is not expected to be meaningful.
> 
> We can move to sysfs and this would simplify or eliminate much of the
> above, including most of (1) - (4). I believe our choices for what to
> do for the next patchset include:
> (a) Use debugfs as is currently done, removing any optional features
> listed above that are deemed unnecessary.
> 
> (b) Add a per-netdev sysfs attribute group to net_device->sysfs_groups.
> Each attribute would display the mergeable packet buffer size for a given
> RX queue, and there would be max_queue_pairs attributes in total. This
> is already supported by net/core/net-sysfs.c:netdev_register_kobject(),
> but means that we would have a static set of per-RX queue files for
> all RX queues supported by the netdev, rather than dynamically displaying
> only the files corresponding to enabled RX queues (e.g., when # of RX
> queues is changed by  ethtool -L <device>).  For an example of this
> approach, see drivers/net/bonding/bond_sysfs.c.
> 
> (c) Modify struct netdev_rx_queue to add virtio-net EWMA fields directly,
> and modify net-sysfs.c to manage the new fields. Unlike (b), this approach
> supports the RX queue resizing in (3) but means putting virtio-net info
> in netdev_rx_queue, which currently has only device-independent fields.

Can't we add struct attribute * to netdevice, and pass that in when
creating the kobj?

> My preference would be (b): try using sysfs and adding a device-specific
> attribute group to the virtio-net netdevice (stored in the existing
> 'sysfs_groups' field and supported by net-sysfs).  This would avoid
> adding virtio-net specific information to net-sysfs. What would you
> prefer (or is there a better way than the approaches above)? Thanks!
> 
> Best,
> 
> Mike

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH net-next v2 4/4] virtio-net: initial debugfs support, export mergeable rx buffer size
  2014-01-12 17:09         ` Michael S. Tsirkin
@ 2014-01-12 23:32           ` Michael Dalton
  2014-01-13  7:36             ` Jason Wang
  2014-01-13  9:40             ` Michael S. Tsirkin
  0 siblings, 2 replies; 39+ messages in thread
From: Michael Dalton @ 2014-01-12 23:32 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: netdev, lf-virt, Eric Dumazet, David S. Miller

Hi Michael,

On Sun, Jan 12, 2014 at 9:09 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
> Can't we add struct attribute * to netdevice, and pass that in when
> creating the kobj?

I like that idea, I think that will work and should be better than
the alternatives. The actual kobjs for RX queues (struct netdev_rx_queue)
are allocated and deallocated by calls to net_rx_queue_update_kobjects,
which resizes RX queue kobjects when the netdev RX queues are resized.

Is this what you had in mind:
(1) Add a pointer to an attribute group to struct net_device, used for
    per-netdev rx queue attributes and initialized before the call to
    register_netdevice().
(2) Declare an attribute group containing the mergeable_rx_buffer_size
    attribute in virtio-net, and initialize the per-netdevice group pointer
    to the address of this group in virtnet_probe before register_netdevice
(3) In net-sysfs, modify net_rx_queue_update_kobjects
    (or rx_queue_add_kobject) to call sysfs_create_group on the
    per-netdev attribute group (if non-NULL), adding the attributes in
    the group to the RX queue kobject.

That should allow us to have per-RX queue attributes that are
device-specific. I'm not a sysfs expert, but it seems that rx_queue_ktype
and rx_queue_sysfs_ops presume that all rx queue sysfs operations are
performed on attributes of type rx_queue_attribute. That type will need
to be moved from net-sysfs.c to a header file like netdevice.h so that
the type can be used in virtio-net when we declare the
mergeable_rx_buffer_size attribute.

The last issue is how the rx_queue_attribute 'show' function
implementation for mergeable_rx_buffer_size will access the appropriate
per-receive queue EWMA data. The arguments to the show function will be
the netdev_rx_queue and the attribute itself. We can get to the
struct net_device from the netdev_rx_queue.  If we extended
netdev_rx_queue to indicate the queue_index or to store a void *priv_data
pointer, that would be sufficient to allow us to resolve this issue.

Please let me know if the above sounds good or if you see a better way
to accomplish this goal. Thanks!

Best,

Mike

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH net-next v2 4/4] virtio-net: initial debugfs support, export mergeable rx buffer size
  2014-01-12 23:32           ` Michael Dalton
@ 2014-01-13  7:36             ` Jason Wang
  2014-01-13  9:40             ` Michael S. Tsirkin
  1 sibling, 0 replies; 39+ messages in thread
From: Jason Wang @ 2014-01-13  7:36 UTC (permalink / raw)
  To: Michael Dalton, Michael S. Tsirkin
  Cc: netdev, Eric Dumazet, David S. Miller, lf-virt

On 01/13/2014 07:32 AM, Michael Dalton wrote:
> Hi Michael,
>
> On Sun, Jan 12, 2014 at 9:09 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
>> Can't we add struct attribute * to netdevice, and pass that in when
>> creating the kobj?
> I like that idea, I think that will work and should be better than
> the alternatives. The actual kobjs for RX queues (struct netdev_rx_queue)
> are allocated and deallocated by calls to net_rx_queue_update_kobjects,
> which resizes RX queue kobjects when the netdev RX queues are resized.
>
> Is this what you had in mind:
> (1) Add a pointer to an attribute group to struct net_device, used for
>     per-netdev rx queue attributes and initialized before the call to
>     register_netdevice().
> (2) Declare an attribute group containing the mergeable_rx_buffer_size
>     attribute in virtio-net, and initialize the per-netdevice group pointer
>     to the address of this group in virtnet_probe before register_netdevice
> (3) In net-sysfs, modify net_rx_queue_update_kobjects
>     (or rx_queue_add_kobject) to call sysfs_create_group on the
>     per-netdev attribute group (if non-NULL), adding the attributes in
>     the group to the RX queue kobject.
>
> That should allow us to have per-RX queue attributes that are
> device-specific. I'm not a sysfs expert, but it seems that rx_queue_ktype
> and rx_queue_sysfs_ops presume that all rx queue sysfs operations are
> performed on attributes of type rx_queue_attribute. That type will need
> to be moved from net-sysfs.c to a header file like netdevice.h so that
> the type can be used in virtio-net when we declare the
> mergeable_rx_buffer_size attribute.

There's a possible issue, rx queue sysfs depends on CONFIG_RPS. So we
probably need a dedicated attribute just for virtio-net.
>
> The last issue is how the rx_queue_attribute 'show' function
> implementation for mergeable_rx_buffer_size will access the appropriate
> per-receive queue EWMA data. The arguments to the show function will be
> the netdev_rx_queue and the attribute itself. We can get to the
> struct net_device from the netdev_rx_queue.  If we extended
> netdev_rx_queue to indicate the queue_index or to store a void *priv_data
> pointer, that would be sufficient to allow us to resolve this issue.
>
> Please let me know if the above sounds good or if you see a better way
> to accomplish this goal. Thanks!
>
> Best,
>
> Mike

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH net-next v2 4/4] virtio-net: initial debugfs support, export mergeable rx buffer size
  2014-01-12 23:32           ` Michael Dalton
  2014-01-13  7:36             ` Jason Wang
@ 2014-01-13  9:40             ` Michael S. Tsirkin
  2014-01-13 15:38               ` Ben Hutchings
  1 sibling, 1 reply; 39+ messages in thread
From: Michael S. Tsirkin @ 2014-01-13  9:40 UTC (permalink / raw)
  To: Michael Dalton; +Cc: netdev, lf-virt, Eric Dumazet, David S. Miller

On Sun, Jan 12, 2014 at 03:32:28PM -0800, Michael Dalton wrote:
> Hi Michael,
> 
> On Sun, Jan 12, 2014 at 9:09 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
> > Can't we add struct attribute * to netdevice, and pass that in when
> > creating the kobj?
> 
> I like that idea, I think that will work and should be better than
> the alternatives. The actual kobjs for RX queues (struct netdev_rx_queue)
> are allocated and deallocated by calls to net_rx_queue_update_kobjects,
> which resizes RX queue kobjects when the netdev RX queues are resized.
> 
> Is this what you had in mind:
> (1) Add a pointer to an attribute group to struct net_device, used for
>     per-netdev rx queue attributes and initialized before the call to
>     register_netdevice().
> (2) Declare an attribute group containing the mergeable_rx_buffer_size
>     attribute in virtio-net, and initialize the per-netdevice group pointer
>     to the address of this group in virtnet_probe before register_netdevice
> (3) In net-sysfs, modify net_rx_queue_update_kobjects
>     (or rx_queue_add_kobject) to call sysfs_create_group on the
>     per-netdev attribute group (if non-NULL), adding the attributes in
>     the group to the RX queue kobject.


Exactly.

> That should allow us to have per-RX queue attributes that are
> device-specific. I'm not a sysfs expert, but it seems that rx_queue_ktype
> and rx_queue_sysfs_ops presume that all rx queue sysfs operations are
> performed on attributes of type rx_queue_attribute. That type will need
> to be moved from net-sysfs.c to a header file like netdevice.h so that
> the type can be used in virtio-net when we declare the
> mergeable_rx_buffer_size attribute.
> 
> The last issue is how the rx_queue_attribute 'show' function
> implementation for mergeable_rx_buffer_size will access the appropriate
> per-receive queue EWMA data. The arguments to the show function will be
> the netdev_rx_queue and the attribute itself. We can get to the
> struct net_device from the netdev_rx_queue.  If we extended
> netdev_rx_queue to indicate the queue_index or to store a void *priv_data
> pointer, that would be sufficient to allow us to resolve this issue.

Hmm netdev_rx_queue is not defined unless CONFIG_RPS is set.
Maybe we should use a different structure.


> Please let me know if the above sounds good or if you see a better way
> to accomplish this goal. Thanks!
> 
> Best,
> 
> Mike

Sounds good to me.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH net-next v2 4/4] virtio-net: initial debugfs support, export mergeable rx buffer size
  2014-01-13  9:40             ` Michael S. Tsirkin
@ 2014-01-13 15:38               ` Ben Hutchings
  2014-01-13 19:07                 ` Michael Dalton
  0 siblings, 1 reply; 39+ messages in thread
From: Ben Hutchings @ 2014-01-13 15:38 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Michael Dalton, Jason Wang, David S. Miller, netdev, Eric Dumazet,
	Rusty Russell, lf-virt

On Mon, 2014-01-13 at 11:40 +0200, Michael S. Tsirkin wrote:
> On Sun, Jan 12, 2014 at 03:32:28PM -0800, Michael Dalton wrote:
[...]
> > The last issue is how the rx_queue_attribute 'show' function
> > implementation for mergeable_rx_buffer_size will access the appropriate
> > per-receive queue EWMA data. The arguments to the show function will be
> > the netdev_rx_queue and the attribute itself. We can get to the
> > struct net_device from the netdev_rx_queue.  If we extended
> > netdev_rx_queue to indicate the queue_index or to store a void *priv_data
> > pointer, that would be sufficient to allow us to resolve this issue.
> 
> Hmm netdev_rx_queue is not defined unless CONFIG_RPS is set.
> Maybe we should use a different structure.
[...]

I don't think RPS should own this structure.  It's just that there are
currently no per-RX-queue attributes other than those defined by RPS.

By the way, CONFIG_RPS is equivalent to CONFIG_SMP so will likely be
enabled already in most places where virtio_net is used.

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH net-next v2 4/4] virtio-net: initial debugfs support, export mergeable rx buffer size
  2014-01-13 15:38               ` Ben Hutchings
@ 2014-01-13 19:07                 ` Michael Dalton
  2014-01-13 19:19                   ` Michael Dalton
  0 siblings, 1 reply; 39+ messages in thread
From: Michael Dalton @ 2014-01-13 19:07 UTC (permalink / raw)
  To: Ben Hutchings
  Cc: Michael S. Tsirkin, netdev, lf-virt, Eric Dumazet,
	David S. Miller

On Mon, Jan 13, 2014 at 7:38 AM, Ben Hutchings <bhutchings@solarflare.com>
wrote:
> I don't think RPS should own this structure.  It's just that there are
> currently no per-RX-queue attributes other than those defined by RPS.

Agreed, there is useful attribute-independent functionality already
built around netdev_rx_queue - e.g., dynamically resizing the rx queue
kobjs as the number of RX queues enabled for the netdev is changed. While
the current attributes happen to be used only by RPS, AFAICT it seems
RPS should not own netdev_rx_queue but rather should own the RPS-specific
fields themselves within netdev_rx_queue.

If there are no objections, it seems like I could modify
netdev_rx_queue and related functionality so that their existence does
not depend on CONFIG_RPS, and instead just have CONFIG_RPS control
whether or not the RPS-specific attributes/fields are present.

Best,

Mike

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH net-next v2 4/4] virtio-net: initial debugfs support, export mergeable rx buffer size
  2014-01-13 19:07                 ` Michael Dalton
@ 2014-01-13 19:19                   ` Michael Dalton
  2014-01-14 21:45                     ` Michael Dalton
  0 siblings, 1 reply; 39+ messages in thread
From: Michael Dalton @ 2014-01-13 19:19 UTC (permalink / raw)
  To: Ben Hutchings
  Cc: Michael S. Tsirkin, netdev, lf-virt, Eric Dumazet,
	David S. Miller

Sorry I missed this important piece of information, it appears that
netdev_queue (the TX equivalent of netdev_rx_queue) already has
decoupled itself from CONFIG_XPS due to an attribute,
queue_trans_timeout, that does not depend on XPS functionality. So it
seems that something somewhat equivalent has already happened on the
TX side.

Best,

Mike

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH net-next v2 4/4] virtio-net: initial debugfs support, export mergeable rx buffer size
  2014-01-13 19:19                   ` Michael Dalton
@ 2014-01-14 21:45                     ` Michael Dalton
  2014-01-14 21:53                       ` Michael S. Tsirkin
  0 siblings, 1 reply; 39+ messages in thread
From: Michael Dalton @ 2014-01-14 21:45 UTC (permalink / raw)
  To: Ben Hutchings
  Cc: Michael S. Tsirkin, netdev, lf-virt, Eric Dumazet,
	David S. Miller

I'd like to confirm the preferred sysfs path structure for mergeable
receive buffers. Is 'mergeable_rx_buffer_size' the right attribute name
to use or is there a strong preference for a different name?

I believe the current approach proposed for the next patchset is to use a
per-netdev attribute group which we will add to the receive
queue kobj (struct netdev_rx_queue). That leaves us with at
least two options:
  (1) Name the attribute group something, e.g., 'virtio-net', in which
      case all virtio-net attributes for eth0 queue N will be of
      the form:
      /sys/class/net/eth0/queues/rx-N/virtio-net/<attribute name>

  (2) Do not name the attribute group (leave the name NULL), in which
      case AFAICT virtio-net and device-independent attributes would be
      mixed without any indication. For example, all virtio-net
      attributes for netdev eth0 queue N would be of the form:
      /sys/class/net/eth0/queues/rx-N/<attribute name>

FWIW, the bonding netdev has a similar sysfs issue and uses a per-netdev
attribute group (stored in the 'sysfs_groups' field of struct netdevice)
In the case of bonding, the attribute group is named, so
device-independent netdev attributes are found in
/sys/class/net/eth0/<attribute name> while bonding attributes are placed
in /sys/class/net/eth0/bonding/<attribute name>.

So it seems like there is some precedent for using an attribute group
name corresponding to the driver name. Does using an attribute group
name of 'virtio-net' sound good or would an empty or different attribute
group name be preferred?

Best,

Mike

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH net-next v2 4/4] virtio-net: initial debugfs support, export mergeable rx buffer size
  2014-01-14 21:45                     ` Michael Dalton
@ 2014-01-14 21:53                       ` Michael S. Tsirkin
  0 siblings, 0 replies; 39+ messages in thread
From: Michael S. Tsirkin @ 2014-01-14 21:53 UTC (permalink / raw)
  To: Michael Dalton
  Cc: netdev, lf-virt, Eric Dumazet, Ben Hutchings, David S. Miller

On Tue, Jan 14, 2014 at 01:45:42PM -0800, Michael Dalton wrote:
> I'd like to confirm the preferred sysfs path structure for mergeable
> receive buffers. Is 'mergeable_rx_buffer_size' the right attribute name
> to use or is there a strong preference for a different name?
> 
> I believe the current approach proposed for the next patchset is to use a
> per-netdev attribute group which we will add to the receive
> queue kobj (struct netdev_rx_queue). That leaves us with at
> least two options:
>   (1) Name the attribute group something, e.g., 'virtio-net', in which
>       case all virtio-net attributes for eth0 queue N will be of
>       the form:
>       /sys/class/net/eth0/queues/rx-N/virtio-net/<attribute name>
> 
>   (2) Do not name the attribute group (leave the name NULL), in which
>       case AFAICT virtio-net and device-independent attributes would be
>       mixed without any indication. For example, all virtio-net
>       attributes for netdev eth0 queue N would be of the form:
>       /sys/class/net/eth0/queues/rx-N/<attribute name>
> 
> FWIW, the bonding netdev has a similar sysfs issue and uses a per-netdev
> attribute group (stored in the 'sysfs_groups' field of struct netdevice)
> In the case of bonding, the attribute group is named, so
> device-independent netdev attributes are found in
> /sys/class/net/eth0/<attribute name> while bonding attributes are placed
> in /sys/class/net/eth0/bonding/<attribute name>.
> 
> So it seems like there is some precedent for using an attribute group
> name corresponding to the driver name. Does using an attribute group
> name of 'virtio-net' sound good or would an empty or different attribute
> group name be preferred?
> 
> Best,
> 
> Mike

I'm guessing we should follow the bonding example.
What do others think?

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH net-next v2 4/4] virtio-net: initial debugfs support, export mergeable rx buffer size
  2014-01-07  5:25 ` [PATCH net-next v2 4/4] virtio-net: initial debugfs support, export mergeable rx buffer size Michael Dalton
  2014-01-08  6:34   ` Jason Wang
@ 2014-01-08 18:24   ` Michael S. Tsirkin
  1 sibling, 0 replies; 39+ messages in thread
From: Michael S. Tsirkin @ 2014-01-08 18:24 UTC (permalink / raw)
  To: Michael Dalton; +Cc: netdev, virtualization, Eric Dumazet, David S. Miller

On Mon, Jan 06, 2014 at 09:25:55PM -0800, Michael Dalton wrote:
> Add initial support for debugfs to virtio-net. Each virtio-net network
> device will have a directory under /virtio-net in debugfs. The
> per-network device directory will contain one sub-directory per active,
> enabled receive queue. If mergeable receive buffers are enabled, each
> receive queue directory will contain a read-only file that returns the
> current packet buffer size for the receive queue.
> 
> Signed-off-by: Michael Dalton <mwdalton@google.com>

thanks, I'll play with it.
Could you tell us meanwhile, what's the typical size that you see?

> ---
>  drivers/net/virtio_net.c | 314 ++++++++++++++++++++++++++++++++++++++++++++---
>  1 file changed, 296 insertions(+), 18 deletions(-)
> 
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index f6e1ee0..5da18d6 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -27,6 +27,9 @@
>  #include <linux/slab.h>
>  #include <linux/cpu.h>
>  #include <linux/average.h>
> +#include <linux/seqlock.h>
> +#include <linux/kref.h>
> +#include <linux/debugfs.h>
>  
>  static int napi_weight = NAPI_POLL_WEIGHT;
>  module_param(napi_weight, int, 0444);
> @@ -35,6 +38,9 @@ static bool csum = true, gso = true;
>  module_param(csum, bool, 0444);
>  module_param(gso, bool, 0444);
>  
> +/* Debugfs root directory for all virtio-net devices. */
> +static struct dentry *virtnet_debugfs_root;
> +
>  /* FIXME: MTU in config. */
>  #define GOOD_PACKET_LEN (ETH_HLEN + VLAN_HLEN + ETH_DATA_LEN)
>  #define GOOD_COPY_LEN	128
> @@ -102,9 +108,6 @@ struct receive_queue {
>  	/* Chain pages by the private ptr. */
>  	struct page *pages;
>  
> -	/* Average packet length for mergeable receive buffers. */
> -	struct ewma mrg_avg_pkt_len;
> -
>  	/* Page frag for packet buffer allocation. */
>  	struct page_frag alloc_frag;
>  
> @@ -115,6 +118,28 @@ struct receive_queue {
>  	char name[40];
>  };
>  
> +/* Per-receive queue statistics exported via debugfs. */
> +struct receive_queue_stats {
> +	/* Average packet length of receive queue (for mergeable rx buffers). */
> +	struct ewma avg_pkt_len;
> +
> +	/* Per-receive queue stats debugfs directory. */
> +	struct dentry *dbg;
> +
> +	/* Reference count for the receive queue statistics, needed because
> +	 * an open debugfs file may outlive the receive queue and netdevice.
> +	 * Open files will remain in-use until all outstanding file descriptors
> +	 * are closed, even after the underlying file is unlinked.
> +	 */
> +	struct kref refcount;
> +
> +	/* Sequence counter to allow debugfs readers to safely access stats.
> +	 * Assumes a single virtio-net writer, which is enforced by virtio-net
> +	 * and NAPI.
> +	 */
> +	seqcount_t dbg_seq;
> +};
> +
>  struct virtnet_info {
>  	struct virtio_device *vdev;
>  	struct virtqueue *cvq;
> @@ -147,6 +172,15 @@ struct virtnet_info {
>  	/* Active statistics */
>  	struct virtnet_stats __percpu *stats;
>  
> +	/* Per-receive queue statstics exported via debugfs. Stored in
> +	 * virtnet_info to survive freeze/restore -- a task may have a per-rq
> +	 * debugfs file open at the time of freeze.
> +	 */
> +	struct receive_queue_stats **rq_stats;
> +
> +	/* Per-netdevice debugfs directory. */
> +	struct dentry *dbg_dev_root;
> +
>  	/* Work struct for refilling if we run low on memory. */
>  	struct delayed_work refill;
>  
> @@ -358,6 +392,8 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
>  					 unsigned int len)
>  {
>  	struct skb_vnet_hdr *hdr = ctx->buf;
> +	struct virtnet_info *vi = netdev_priv(dev);
> +	struct receive_queue_stats *rq_stats = vi->rq_stats[vq2rxq(rq->vq)];
>  	int num_buf = hdr->mhdr.num_buffers;
>  	struct page *page = virt_to_head_page(ctx->buf);
>  	int offset = ctx->buf - page_address(page);
> @@ -413,7 +449,9 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
>  		}
>  	}
>  
> -	ewma_add(&rq->mrg_avg_pkt_len, head_skb->len);
> +	write_seqcount_begin(&rq_stats->dbg_seq);
> +	ewma_add(&rq_stats->avg_pkt_len, head_skb->len);
> +	write_seqcount_end(&rq_stats->dbg_seq);
>  	return head_skb;
>  
>  err_skb:
> @@ -600,18 +638,30 @@ static int add_recvbuf_big(struct receive_queue *rq, gfp_t gfp)
>  	return err;
>  }
>  
> +static unsigned int get_mergeable_buf_len(struct ewma *avg_pkt_len)
> +{
> +	const size_t hdr_len = sizeof(struct virtio_net_hdr_mrg_rxbuf);
> +	unsigned int len;
> +
> +	len = hdr_len + clamp_t(unsigned int, ewma_read(avg_pkt_len),
> +				GOOD_PACKET_LEN, PAGE_SIZE - hdr_len);
> +	return ALIGN(len, L1_CACHE_BYTES);
> +}
> +
>  static int add_recvbuf_mergeable(struct receive_queue *rq, gfp_t gfp)
>  {
>  	const unsigned int ring_size = rq->mrg_buf_ctx_size;
> -	const size_t hdr_len = sizeof(struct virtio_net_hdr_mrg_rxbuf);
>  	struct page_frag *alloc_frag = &rq->alloc_frag;
> +	struct virtnet_info *vi = rq->vq->vdev->priv;
>  	struct mergeable_receive_buf_ctx *ctx;
>  	int err;
>  	unsigned int len, hole;
>  
> -	len = hdr_len + clamp_t(unsigned int, ewma_read(&rq->mrg_avg_pkt_len),
> -				GOOD_PACKET_LEN, PAGE_SIZE - hdr_len);
> -	len = ALIGN(len, L1_CACHE_BYTES);
> +	/* avg_pkt_len is written only in NAPI rx softirq context. We may
> +	 * read avg_pkt_len without using the dbg_seq seqcount, as this code
> +	 * is called only in NAPI rx softirq context or when NAPI is disabled.
> +	 */
> +	len = get_mergeable_buf_len(&vi->rq_stats[vq2rxq(rq->vq)]->avg_pkt_len);
>  	if (unlikely(!skb_page_frag_refill(len, alloc_frag, gfp)))
>  		return -ENOMEM;
>  
> @@ -1274,13 +1324,101 @@ static void virtnet_get_drvinfo(struct net_device *dev,
>  
>  }
>  
> +static ssize_t mergeable_rx_buffer_size_read(struct file *file,
> +					     char __user *userbuf,
> +					     size_t count,
> +					     loff_t *ppos)
> +{
> +	struct receive_queue_stats *rq_stats = file->private_data;
> +	char buf[32];
> +	struct ewma avg;
> +	unsigned int start, len;
> +
> +	/* Don't allow partial reads. */
> +	if (*ppos)
> +		return 0;
> +	do {
> +		start = read_seqcount_begin(&rq_stats->dbg_seq);
> +		avg = rq_stats->avg_pkt_len;
> +	} while (read_seqcount_retry(&rq_stats->dbg_seq, start));
> +	len = scnprintf(buf, sizeof(buf), "%u\n", get_mergeable_buf_len(&avg));
> +	return simple_read_from_buffer(userbuf, count, ppos, buf, len);
> +}
> +
> +void receive_queue_stats_free(struct kref *ref)
> +{
> +	struct receive_queue_stats *rq_stats;
> +
> +	rq_stats = container_of(ref, struct receive_queue_stats, refcount);
> +	kfree(rq_stats);
> +}
> +
> +static int receive_queue_stats_debugfs_open(struct inode *inode,
> +					    struct file *file)
> +{
> +	struct receive_queue_stats *rq_stats = inode->i_private;
> +	kref_get(&rq_stats->refcount);
> +	file->private_data = rq_stats;
> +	return 0;
> +}
> +
> +static int receive_queue_stats_debugfs_release(struct inode *inode,
> +					       struct file *file)
> +{
> +	struct receive_queue_stats *rq_stats = inode->i_private;
> +	kref_put(&rq_stats->refcount, receive_queue_stats_free);
> +	file->private_data = NULL;
> +	return 0;
> +}
> +
> +static const struct file_operations mergeable_rx_buffer_size_fops = {
> +	.owner = THIS_MODULE,
> +	.open = receive_queue_stats_debugfs_open,
> +	.read = mergeable_rx_buffer_size_read,
> +	.llseek = default_llseek,
> +	.release = receive_queue_stats_debugfs_release,
> +};
> +
> +static void receive_queue_debugfs_add(struct receive_queue *rq)
> +{
> +	struct virtnet_info *vi = rq->vq->vdev->priv;
> +	unsigned int rq_index = vq2rxq(rq->vq);
> +	struct receive_queue_stats *rq_stats = vi->rq_stats[rq_index];
> +	struct dentry *dentry;
> +	char name[32];
> +
> +	if (IS_ERR_OR_NULL(vi->dbg_dev_root))
> +		return;
> +	scnprintf(name, sizeof(name), "rx-%u", rq_index);
> +	dentry = debugfs_create_dir(name, vi->dbg_dev_root);
> +	if (IS_ERR_OR_NULL(dentry)) {
> +		pr_warn("%s: could not create %s rx queue debugfs dir\n",
> +			vi->dev->name, name);
> +		return;
> +	}
> +	rq_stats->dbg = dentry;
> +	if (vi->mergeable_rx_bufs)
> +		debugfs_create_file("mergeable_rx_buffer_size", S_IRUSR,
> +				rq_stats->dbg, rq_stats,
> +				&mergeable_rx_buffer_size_fops);
> +}
> +
> +static void receive_queue_debugfs_del(struct receive_queue *rq)
> +{
> +	struct virtnet_info *vi = rq->vq->vdev->priv;
> +	struct receive_queue_stats *rq_stats = vi->rq_stats[vq2rxq(rq->vq)];
> +	debugfs_remove_recursive(rq_stats->dbg);
> +	rq_stats->dbg = NULL;
> +}
> +
>  /* TODO: Eliminate OOO packets during switching */
>  static int virtnet_set_channels(struct net_device *dev,
>  				struct ethtool_channels *channels)
>  {
>  	struct virtnet_info *vi = netdev_priv(dev);
> -	u16 queue_pairs = channels->combined_count;
> -	int err;
> +	u16 new_queue_pairs = channels->combined_count;
> +	u16 old_queue_pairs = vi->curr_queue_pairs;
> +	int err, i;
>  
>  	/* We don't support separate rx/tx channels.
>  	 * We don't allow setting 'other' channels.
> @@ -1288,14 +1426,21 @@ static int virtnet_set_channels(struct net_device *dev,
>  	if (channels->rx_count || channels->tx_count || channels->other_count)
>  		return -EINVAL;
>  
> -	if (queue_pairs > vi->max_queue_pairs)
> +	if (new_queue_pairs > vi->max_queue_pairs)
>  		return -EINVAL;
>  
>  	get_online_cpus();
> -	err = virtnet_set_queues(vi, queue_pairs);
> +	err = virtnet_set_queues(vi, new_queue_pairs);
>  	if (!err) {
> -		netif_set_real_num_tx_queues(dev, queue_pairs);
> -		netif_set_real_num_rx_queues(dev, queue_pairs);
> +		if (new_queue_pairs < old_queue_pairs) {
> +			for (i = new_queue_pairs; i < old_queue_pairs; i++)
> +				receive_queue_debugfs_del(&vi->rq[i]);
> +		} else {
> +			for (i = old_queue_pairs; i < new_queue_pairs; i++)
> +				receive_queue_debugfs_add(&vi->rq[i]);
> +		}
> +		netif_set_real_num_tx_queues(dev, new_queue_pairs);
> +		netif_set_real_num_rx_queues(dev, new_queue_pairs);
>  
>  		virtnet_set_affinity(vi);
>  	}
> @@ -1336,7 +1481,44 @@ static int virtnet_change_mtu(struct net_device *dev, int new_mtu)
>  	return 0;
>  }
>  
> +/* Must be called only after the net_device name has been expanded. */
> +static void virtnet_debugfs_init(struct virtnet_info *vi)
> +{
> +	int i;
> +
> +	if (IS_ERR_OR_NULL(virtnet_debugfs_root))
> +		return;
> +	vi->dbg_dev_root = debugfs_create_dir(vi->dev->name,
> +					      virtnet_debugfs_root);
> +	if (IS_ERR_OR_NULL(vi->dbg_dev_root)) {
> +		pr_warn("%s: could not create netdevice debugfs dir\n",
> +			vi->dev->name);
> +		return;
> +	}
> +	for (i = 0; i < vi->curr_queue_pairs; i++)
> +		receive_queue_debugfs_add(&vi->rq[i]);
> +}
> +
> +static void virtnet_debugfs_cleanup(struct virtnet_info *vi)
> +{
> +	int i;
> +
> +	for (i = 0; i < vi->max_queue_pairs; i++)
> +		receive_queue_debugfs_del(&vi->rq[i]);
> +	debugfs_remove_recursive(vi->dbg_dev_root);
> +	vi->dbg_dev_root = NULL;
> +}
> +
> +static int virtnet_init(struct net_device *dev)
> +{
> +	struct virtnet_info *vi = netdev_priv(dev);
> +
> +	virtnet_debugfs_init(vi);
> +	return 0;
> +}
> +
>  static const struct net_device_ops virtnet_netdev = {
> +	.ndo_init	     = virtnet_init,
>  	.ndo_open            = virtnet_open,
>  	.ndo_stop   	     = virtnet_close,
>  	.ndo_start_xmit      = start_xmit,
> @@ -1560,7 +1742,6 @@ static int virtnet_alloc_queues(struct virtnet_info *vi)
>  			       napi_weight);
>  
>  		sg_init_table(vi->rq[i].sg, ARRAY_SIZE(vi->rq[i].sg));
> -		ewma_init(&vi->rq[i].mrg_avg_pkt_len, 1, RECEIVE_AVG_WEIGHT);
>  		sg_init_table(vi->sq[i].sg, ARRAY_SIZE(vi->sq[i].sg));
>  	}
>  
> @@ -1614,6 +1795,39 @@ err:
>  	return ret;
>  }
>  
> +static int virtnet_rename(struct notifier_block *this,
> +			  unsigned long event, void *ptr)
> +{
> +	struct net_device *dev = netdev_notifier_info_to_dev(ptr);
> +	struct virtnet_info *vi;
> +
> +	if (event != NETDEV_CHANGENAME || dev->netdev_ops != &virtnet_netdev)
> +		return NOTIFY_DONE;
> +	vi = netdev_priv(dev);
> +	if (IS_ERR_OR_NULL(vi->dbg_dev_root))
> +		return NOTIFY_DONE;
> +	if (IS_ERR_OR_NULL(debugfs_rename(virtnet_debugfs_root,
> +					  vi->dbg_dev_root,
> +					  virtnet_debugfs_root, dev->name))) {
> +		pr_warn("%s: failed debugfs rename, removing old debugfs dir\n",
> +			dev->name);
> +		virtnet_debugfs_cleanup(vi);
> +	}
> +	return NOTIFY_DONE;
> +}
> +
> +static void virtnet_release_receive_queue_stats(struct virtnet_info *vi)
> +{
> +	int i;
> +
> +	for (i = 0; i < vi->max_queue_pairs; i++) {
> +		struct receive_queue_stats *rq_stats = vi->rq_stats[i];
> +		if (rq_stats)
> +			kref_put(&rq_stats->refcount, receive_queue_stats_free);
> +	}
> +	kfree(vi->rq_stats);
> +}
> +
>  static int virtnet_probe(struct virtio_device *vdev)
>  {
>  	int i, err;
> @@ -1723,10 +1937,24 @@ static int virtnet_probe(struct virtio_device *vdev)
>  	vi->curr_queue_pairs = 1;
>  	vi->max_queue_pairs = max_queue_pairs;
>  
> +	vi->rq_stats = kzalloc(sizeof(vi->rq_stats[0]) *
> +			       vi->max_queue_pairs, GFP_KERNEL);
> +	if (!vi->rq_stats)
> +		goto free_dev_stats;
> +	for (i = 0; i < vi->max_queue_pairs; i++) {
> +		vi->rq_stats[i] = kzalloc(sizeof(*vi->rq_stats[0]), GFP_KERNEL);
> +		if (!vi->rq_stats[i])
> +			goto free_rq_stats;
> +		seqcount_init(&vi->rq_stats[i]->dbg_seq);
> +		kref_init(&vi->rq_stats[i]->refcount);
> +		ewma_init(&vi->rq_stats[i]->avg_pkt_len, 1,
> +			  RECEIVE_AVG_WEIGHT);
> +	}
> +
>  	/* Allocate/initialize the rx/tx queues, and invoke find_vqs */
>  	err = init_vqs(vi);
>  	if (err)
> -		goto free_stats;
> +		goto free_rq_stats;
>  
>  	netif_set_real_num_tx_queues(dev, vi->curr_queue_pairs);
>  	netif_set_real_num_rx_queues(dev, vi->curr_queue_pairs);
> @@ -1777,8 +2005,11 @@ free_recv_bufs:
>  free_vqs:
>  	cancel_delayed_work_sync(&vi->refill);
>  	free_receive_page_frags(vi);
> +	virtnet_debugfs_cleanup(vi);
>  	virtnet_del_vqs(vi);
> -free_stats:
> +free_rq_stats:
> +	virtnet_release_receive_queue_stats(vi);
> +free_dev_stats:
>  	free_percpu(vi->stats);
>  free:
>  	free_netdev(dev);
> @@ -1812,10 +2043,12 @@ static void virtnet_remove(struct virtio_device *vdev)
>  
>  	unregister_netdev(vi->dev);
>  
> +	virtnet_debugfs_cleanup(vi);
>  	remove_vq_common(vi);
>  
>  	flush_work(&vi->config_work);
>  
> +	virtnet_release_receive_queue_stats(vi);
>  	free_percpu(vi->stats);
>  	free_netdev(vi->dev);
>  }
> @@ -1884,6 +2117,19 @@ static int virtnet_restore(struct virtio_device *vdev)
>  }
>  #endif
>  
> +static void virtnet_register_debugfs(void)
> +{
> +	virtnet_debugfs_root = debugfs_create_dir("virtio-net", NULL);
> +	if (IS_ERR_OR_NULL(virtnet_debugfs_root))
> +		pr_warn("Could not create virtio-net debugfs dir\n");
> +}
> +
> +static void virtnet_unregister_debugfs(void)
> +{
> +	debugfs_remove_recursive(virtnet_debugfs_root);
> +	virtnet_debugfs_root = NULL;
> +}
> +
>  static struct virtio_device_id id_table[] = {
>  	{ VIRTIO_ID_NET, VIRTIO_DEV_ANY_ID },
>  	{ 0 },
> @@ -1917,7 +2163,39 @@ static struct virtio_driver virtio_net_driver = {
>  #endif
>  };
>  
> -module_virtio_driver(virtio_net_driver);
> +static struct notifier_block virtnet_rename_notifier = {
> +	.notifier_call = virtnet_rename,
> +};
> +
> +static int __init init(void)
> +{
> +	int err;
> +
> +	virtnet_register_debugfs();
> +	err = register_netdevice_notifier(&virtnet_rename_notifier);
> +	if (err)
> +		goto free_debugfs;
> +	err = register_virtio_driver(&virtio_net_driver);
> +	if (err)
> +		goto free_notifier;
> +	return 0;
> +
> +free_notifier:
> +	unregister_netdevice_notifier(&virtnet_rename_notifier);
> +free_debugfs:
> +	virtnet_unregister_debugfs();
> +	return err;
> +}
> +
> +static void __exit cleanup(void)
> +{
> +	unregister_virtio_driver(&virtio_net_driver);
> +	unregister_netdevice_notifier(&virtnet_rename_notifier);
> +	virtnet_unregister_debugfs();
> +}
> +
> +module_init(init);
> +module_exit(cleanup);
>  
>  MODULE_DEVICE_TABLE(virtio, id_table);
>  MODULE_DESCRIPTION("Virtio network driver");
> -- 
> 1.8.5.1

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH net-next v2 1/4] net: allow > 0 order atomic page alloc in skb_page_frag_refill
  2014-01-07  5:25 [PATCH net-next v2 1/4] net: allow > 0 order atomic page alloc in skb_page_frag_refill Michael Dalton
                   ` (2 preceding siblings ...)
  2014-01-07  5:25 ` [PATCH net-next v2 4/4] virtio-net: initial debugfs support, export mergeable rx buffer size Michael Dalton
@ 2014-01-08 18:08 ` Michael S. Tsirkin
  2014-01-08 18:26   ` Eric Dumazet
  3 siblings, 1 reply; 39+ messages in thread
From: Michael S. Tsirkin @ 2014-01-08 18:08 UTC (permalink / raw)
  To: Michael Dalton
  Cc: David S. Miller, netdev, Eric Dumazet, Rusty Russell, Jason Wang,
	virtualization

On Mon, Jan 06, 2014 at 09:25:52PM -0800, Michael Dalton wrote:
> skb_page_frag_refill currently permits only order-0 page allocs
> unless GFP_WAIT is used. Change skb_page_frag_refill to attempt
> higher-order page allocations whether or not GFP_WAIT is used. If
> memory cannot be allocated, the allocator will fall back to
> successively smaller page allocs (down to order-0 page allocs).
> 
> This change brings skb_page_frag_refill in line with the existing
> page allocation strategy employed by netdev_alloc_frag, which attempts
> higher-order page allocations whether or not GFP_WAIT is set, falling
> back to successively lower-order page allocations on failure. Part
> of migration of virtio-net to per-receive queue page frag allocators.
> 
> Acked-by: Michael S. Tsirkin <mst@redhat.com>
> Acked-by: Eric Dumazet <edumazet@google.com>
> Signed-off-by: Michael Dalton <mwdalton@google.com>
> ---
>  net/core/sock.c | 4 +---
>  1 file changed, 1 insertion(+), 3 deletions(-)
> 
> diff --git a/net/core/sock.c b/net/core/sock.c
> index 5393b4b..a0d522a 100644
> --- a/net/core/sock.c
> +++ b/net/core/sock.c
> @@ -1865,9 +1865,7 @@ bool skb_page_frag_refill(unsigned int sz, struct page_frag *pfrag, gfp_t prio)
>  		put_page(pfrag->page);
>  	}
>  
> -	/* We restrict high order allocations to users that can afford to wait */
> -	order = (prio & __GFP_WAIT) ? SKB_FRAG_PAGE_ORDER : 0;
> -
> +	order = SKB_FRAG_PAGE_ORDER;
>  	do {
>  		gfp_t gfp = prio;


Eric said we also need a patch to add __GFP_NORETRY, right?
Probably before this one in series.

> -- 
> 1.8.5.1

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH net-next v2 1/4] net: allow > 0 order atomic page alloc in skb_page_frag_refill
  2014-01-08 18:08 ` [PATCH net-next v2 1/4] net: allow > 0 order atomic page alloc in skb_page_frag_refill Michael S. Tsirkin
@ 2014-01-08 18:26   ` Eric Dumazet
  2014-01-08 19:18     ` Michael S. Tsirkin
  0 siblings, 1 reply; 39+ messages in thread
From: Eric Dumazet @ 2014-01-08 18:26 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Michael Dalton, netdev, virtualization, Eric Dumazet,
	David S. Miller

On Wed, 2014-01-08 at 20:08 +0200, Michael S. Tsirkin wrote:

> Eric said we also need a patch to add __GFP_NORETRY, right?
> Probably before this one in series.

Nope, this __GFP_NORETRY has nothing to do with this.

I am not yet convinced we want it.

This needs mm guys advice, as its a tradeoff for mm layer more than
networking...

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH net-next v2 1/4] net: allow > 0 order atomic page alloc in skb_page_frag_refill
  2014-01-08 18:26   ` Eric Dumazet
@ 2014-01-08 19:18     ` Michael S. Tsirkin
  2014-01-08 19:46       ` Eric Dumazet
  0 siblings, 1 reply; 39+ messages in thread
From: Michael S. Tsirkin @ 2014-01-08 19:18 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Michael Dalton, netdev, virtualization, Eric Dumazet,
	David S. Miller

On Wed, Jan 08, 2014 at 10:26:03AM -0800, Eric Dumazet wrote:
> On Wed, 2014-01-08 at 20:08 +0200, Michael S. Tsirkin wrote:
> 
> > Eric said we also need a patch to add __GFP_NORETRY, right?
> > Probably before this one in series.
> 
> Nope, this __GFP_NORETRY has nothing to do with this.
> 
> I am not yet convinced we want it.
> 
> This needs mm guys advice, as its a tradeoff for mm layer more than
> networking...

Well maybe Cc linux-mm then?

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH net-next v2 1/4] net: allow > 0 order atomic page alloc in skb_page_frag_refill
  2014-01-08 19:18     ` Michael S. Tsirkin
@ 2014-01-08 19:46       ` Eric Dumazet
  2014-01-08 21:54         ` Debabrata Banerjee
  0 siblings, 1 reply; 39+ messages in thread
From: Eric Dumazet @ 2014-01-08 19:46 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Michael Dalton, netdev, virtualization, Eric Dumazet,
	David S. Miller

On Wed, 2014-01-08 at 21:18 +0200, Michael S. Tsirkin wrote:
> On Wed, Jan 08, 2014 at 10:26:03AM -0800, Eric Dumazet wrote:
> > On Wed, 2014-01-08 at 20:08 +0200, Michael S. Tsirkin wrote:
> > 
> > > Eric said we also need a patch to add __GFP_NORETRY, right?
> > > Probably before this one in series.
> > 
> > Nope, this __GFP_NORETRY has nothing to do with this.
> > 
> > I am not yet convinced we want it.
> > 
> > This needs mm guys advice, as its a tradeoff for mm layer more than
> > networking...
> 
> Well maybe Cc linux-mm then?

Well, I do not care of people mlocking the memory and complaining that
compaction does not work.

If these people care, they should contact mm guys, eventually.

Really this is an issue that has nothing to do with this patch set.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH net-next v2 1/4] net: allow > 0 order atomic page alloc in skb_page_frag_refill
  2014-01-08 19:46       ` Eric Dumazet
@ 2014-01-08 21:54         ` Debabrata Banerjee
  2014-01-08 22:01           ` Eric Dumazet
  0 siblings, 1 reply; 39+ messages in thread
From: Debabrata Banerjee @ 2014-01-08 21:54 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Michael Dalton, Michael S. Tsirkin, netdev@vger.kernel.org,
	jbaron, virtualization, Eric Dumazet, Joshua Hunt,
	David S. Miller

On Wed, Jan 8, 2014 at 2:46 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Wed, 2014-01-08 at 21:18 +0200, Michael S. Tsirkin wrote:
>> On Wed, Jan 08, 2014 at 10:26:03AM -0800, Eric Dumazet wrote:
>> > On Wed, 2014-01-08 at 20:08 +0200, Michael S. Tsirkin wrote:
>> >
>> > > Eric said we also need a patch to add __GFP_NORETRY, right?
>> > > Probably before this one in series.
>> >
>> > Nope, this __GFP_NORETRY has nothing to do with this.
>> >
>> > I am not yet convinced we want it.
>> >
>> > This needs mm guys advice, as its a tradeoff for mm layer more than
>> > networking...
>>
>> Well maybe Cc linux-mm then?
>
> Well, I do not care of people mlocking the memory and complaining that
> compaction does not work.
>
> If these people care, they should contact mm guys, eventually.
>
> Really this is an issue that has nothing to do with this patch set.
>

Actually I have more data on this:

1. __GFP_NORETRY really does help and should go into stable tree.

2. You may want to consider GFP_NOKSWAPD, because even in the
GFP_ATOMIC case you are waking up kswapd to do reclaims on a
continuous basis even when you don't enter direct reclaim.

3. mlocking memory had very little to do with it, that was a
red-herring. I tested out the problem scenario with no mlocks. You
simply need memory pressure from page_cache, and mm ends up constantly
reclaiming and trying to keep another 1-2GB free on our systems (8GB
phys ~4GB left for kernel, ~3GB optimally used for page_cache).

4. I think perhaps using a kmem_cache allocation for this buffer is
the right way to make this work. I am experimenting with a patch to do
this.

-Debabrata

-Debabrata

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH net-next v2 1/4] net: allow > 0 order atomic page alloc in skb_page_frag_refill
  2014-01-08 21:54         ` Debabrata Banerjee
@ 2014-01-08 22:01           ` Eric Dumazet
  0 siblings, 0 replies; 39+ messages in thread
From: Eric Dumazet @ 2014-01-08 22:01 UTC (permalink / raw)
  To: Debabrata Banerjee
  Cc: Michael Dalton, Michael S. Tsirkin, netdev@vger.kernel.org,
	jbaron, virtualization, Eric Dumazet, Joshua Hunt,
	David S. Miller

On Wed, 2014-01-08 at 16:54 -0500, Debabrata Banerjee wrote:

> Actually I have more data on this:
> 

Could you please stop polluting this thread ?

> 1. __GFP_NORETRY really does help and should go into stable tree.
> 

Not at all. You are free to patch your kernel if you want.

It helps you workload, and breaks all others.

If compaction is never triggered, we'll never be able to get high order
pages, and performance goes back to what we had 2 years ago.

That discussion does not belong to this thread, again.

> 2. You may want to consider GFP_NOKSWAPD, because even in the
> GFP_ATOMIC case you are waking up kswapd to do reclaims on a
> continuous basis even when you don't enter direct reclaim.
> 
> 3. mlocking memory had very little to do with it, that was a
> red-herring. I tested out the problem scenario with no mlocks. You
> simply need memory pressure from page_cache, and mm ends up constantly
> reclaiming and trying to keep another 1-2GB free on our systems (8GB
> phys ~4GB left for kernel, ~3GB optimally used for page_cache).
> 
> 4. I think perhaps using a kmem_cache allocation for this buffer is
> the right way to make this work. I am experimenting with a patch to do
> this.

Seriously... I think you missed whole point of having frag allocation on
pages, not kmem_cache.

^ permalink raw reply	[flat|nested] 39+ messages in thread

end of thread, other threads:[~2014-01-14 21:53 UTC | newest]

Thread overview: 39+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-01-07  5:25 [PATCH net-next v2 1/4] net: allow > 0 order atomic page alloc in skb_page_frag_refill Michael Dalton
2014-01-07  5:25 ` [PATCH net-next v2 2/4] virtio-net: use per-receive queue page frag alloc for mergeable bufs Michael Dalton
2014-01-07  5:25 ` [PATCH net-next v2 3/4] virtio-net: auto-tune mergeable rx buffer size for improved performance Michael Dalton
2014-01-08  6:23   ` Jason Wang
2014-01-08 18:28     ` Michael Dalton
2014-01-08 18:44       ` Eric Dumazet
2014-01-08 19:16       ` Michael S. Tsirkin
2014-01-08 19:56         ` Michael Dalton
2014-01-08 20:30   ` Michael S. Tsirkin
2014-01-09  1:42   ` Michael S. Tsirkin
2014-01-09  3:16     ` Michael Dalton
2014-01-09  3:41       ` Michael Dalton
2014-01-09  6:48         ` Michael S. Tsirkin
2014-01-09  8:28           ` Michael Dalton
2014-01-09  9:02             ` Michael Dalton
2014-01-09 13:25             ` Michael S. Tsirkin
2014-01-09 19:33               ` Michael Dalton
2014-01-09  6:42       ` Michael S. Tsirkin
2014-01-07  5:25 ` [PATCH net-next v2 4/4] virtio-net: initial debugfs support, export mergeable rx buffer size Michael Dalton
2014-01-08  6:34   ` Jason Wang
2014-01-08 19:21     ` Michael S. Tsirkin
2014-01-11  5:19       ` Michael Dalton
2014-01-11  5:36         ` Michael Dalton
2014-01-12 17:09         ` Michael S. Tsirkin
2014-01-12 23:32           ` Michael Dalton
2014-01-13  7:36             ` Jason Wang
2014-01-13  9:40             ` Michael S. Tsirkin
2014-01-13 15:38               ` Ben Hutchings
2014-01-13 19:07                 ` Michael Dalton
2014-01-13 19:19                   ` Michael Dalton
2014-01-14 21:45                     ` Michael Dalton
2014-01-14 21:53                       ` Michael S. Tsirkin
2014-01-08 18:24   ` Michael S. Tsirkin
2014-01-08 18:08 ` [PATCH net-next v2 1/4] net: allow > 0 order atomic page alloc in skb_page_frag_refill Michael S. Tsirkin
2014-01-08 18:26   ` Eric Dumazet
2014-01-08 19:18     ` Michael S. Tsirkin
2014-01-08 19:46       ` Eric Dumazet
2014-01-08 21:54         ` Debabrata Banerjee
2014-01-08 22:01           ` Eric Dumazet

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).