Netdev List
 help / color / mirror / Atom feed
* [PATCH net-next v4 5/6] lib: Ensure EWMA does not store wrong intermediate values
From: Michael Dalton @ 2014-01-16 19:52 UTC (permalink / raw)
  To: David S. Miller
  Cc: Michael Dalton, Michael S. Tsirkin, netdev, virtualization,
	Eric Dumazet, Ben Hutchings
In-Reply-To: <1389901950-3854-1-git-send-email-mwdalton@google.com>

To ensure ewma_read() without a lock returns a valid but possibly
out of date average, modify ewma_add() by using ACCESS_ONCE to prevent
intermediate wrong values from being written to avg->internal.

Suggested-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Michael Dalton <mwdalton@google.com>
---
 lib/average.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/lib/average.c b/lib/average.c
index 99a67e6..114d1be 100644
--- a/lib/average.c
+++ b/lib/average.c
@@ -53,8 +53,10 @@ EXPORT_SYMBOL(ewma_init);
  */
 struct ewma *ewma_add(struct ewma *avg, unsigned long val)
 {
-	avg->internal = avg->internal  ?
-		(((avg->internal << avg->weight) - avg->internal) +
+	unsigned long internal = ACCESS_ONCE(avg->internal);
+
+	ACCESS_ONCE(avg->internal) = internal ?
+		(((internal << avg->weight) - internal) +
 			(val << avg->factor)) >> avg->weight :
 		(val << avg->factor);
 	return avg;
-- 
1.8.5.2

^ permalink raw reply related

* [PATCH net-next v4 3/6] virtio-net: auto-tune mergeable rx buffer size for improved performance
From: Michael Dalton @ 2014-01-16 19:52 UTC (permalink / raw)
  To: David S. Miller
  Cc: netdev, Eric Dumazet, Rusty Russell, Michael S. Tsirkin,
	Jason Wang, Ben Hutchings, virtualization, Michael Dalton
In-Reply-To: <1389901950-3854-1-git-send-email-mwdalton@google.com>

Commit 2613af0ed18a ("virtio_net: migrate mergeable rx buffers to page frag
allocators") changed the mergeable receive buffer size from PAGE_SIZE to
MTU-size, introducing a single-stream regression for benchmarks with large
average packet size. There is no single optimal buffer size for all
workloads.  For workloads with packet size <= MTU bytes, MTU + virtio-net
header-sized buffers are preferred as larger buffers reduce the TCP window
due to SKB truesize. However, single-stream workloads with large average
packet sizes have higher throughput if larger (e.g., PAGE_SIZE) buffers
are used.

This commit auto-tunes the mergeable receiver buffer packet size by
choosing the packet buffer size based on an EWMA of the recent packet
sizes for the receive queue. Packet buffer sizes range from MTU_SIZE +
virtio-net header len to PAGE_SIZE. This improves throughput for
large packet workloads, as any workload with average packet size >=
PAGE_SIZE will use PAGE_SIZE buffers.

These optimizations interact positively with recent commit
ba275241030c ("virtio-net: coalesce rx frags when possible during rx"),
which coalesces adjacent RX SKB fragments in virtio_net. The coalescing
optimizations benefit buffers of any size.

Benchmarks taken from an average of 5 netperf 30-second TCP_STREAM runs
between two QEMU VMs on a single physical machine. Each VM has two VCPUs
with all offloads & vhost enabled. All VMs and vhost threads run in a
single 4 CPU cgroup cpuset, using cgroups to ensure that other processes
in the system will not be scheduled on the benchmark CPUs. Trunk includes
SKB rx frag coalescing.

net-next w/ virtio_net before 2613af0ed18a (PAGE_SIZE bufs): 14642.85Gb/s
net-next (MTU-size bufs):  13170.01Gb/s
net-next + auto-tune: 14555.94Gb/s

Jason Wang also reported a throughput increase on mlx4 from 22Gb/s
using MTU-sized buffers to about 26Gb/s using auto-tuning.

Signed-off-by: Michael Dalton <mwdalton@google.com>
---
v2->v3: Remove per-receive queue metadata ring. Encode packet buffer
        base address and truesize into an unsigned long by requiring a
        minimum packet size alignment of 256. Permit attempts to fill
        an already-full RX ring (reverting the change in v2).
v1->v2: Add per-receive queue metadata ring to track precise truesize for
        mergeable receive buffers. Remove all truesize approximation. Never
        try to fill a full RX ring (required for metadata ring in v2).

 drivers/net/virtio_net.c | 99 ++++++++++++++++++++++++++++++++++++------------
 1 file changed, 74 insertions(+), 25 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 36cbf06..3e82311 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -26,6 +26,7 @@
 #include <linux/if_vlan.h>
 #include <linux/slab.h>
 #include <linux/cpu.h>
+#include <linux/average.h>
 
 static int napi_weight = NAPI_POLL_WEIGHT;
 module_param(napi_weight, int, 0444);
@@ -36,11 +37,18 @@ module_param(gso, bool, 0444);
 
 /* FIXME: MTU in config. */
 #define GOOD_PACKET_LEN (ETH_HLEN + VLAN_HLEN + ETH_DATA_LEN)
-#define MERGE_BUFFER_LEN (ALIGN(GOOD_PACKET_LEN + \
-                                sizeof(struct virtio_net_hdr_mrg_rxbuf), \
-                                L1_CACHE_BYTES))
 #define GOOD_COPY_LEN	128
 
+/* Weight used for the RX packet size EWMA. The average packet size is used to
+ * determine the packet buffer size when refilling RX rings. As the entire RX
+ * ring may be refilled at once, the weight is chosen so that the EWMA will be
+ * insensitive to short-term, transient changes in packet size.
+ */
+#define RECEIVE_AVG_WEIGHT 64
+
+/* Minimum alignment for mergeable packet buffers. */
+#define MERGEABLE_BUFFER_ALIGN max(L1_CACHE_BYTES, 256)
+
 #define VIRTNET_DRIVER_VERSION "1.0.0"
 
 struct virtnet_stats {
@@ -78,6 +86,9 @@ struct receive_queue {
 	/* Chain pages by the private ptr. */
 	struct page *pages;
 
+	/* Average packet length for mergeable receive buffers. */
+	struct ewma mrg_avg_pkt_len;
+
 	/* Page frag for packet buffer allocation. */
 	struct page_frag alloc_frag;
 
@@ -219,6 +230,23 @@ static void skb_xmit_done(struct virtqueue *vq)
 	netif_wake_subqueue(vi->dev, vq2txq(vq));
 }
 
+static unsigned int mergeable_ctx_to_buf_truesize(unsigned long mrg_ctx)
+{
+	unsigned int truesize = mrg_ctx & (MERGEABLE_BUFFER_ALIGN - 1);
+	return truesize * MERGEABLE_BUFFER_ALIGN;
+}
+
+static void *mergeable_ctx_to_buf_address(unsigned long mrg_ctx)
+{
+	return (void *)(mrg_ctx & -MERGEABLE_BUFFER_ALIGN);
+
+}
+
+static unsigned long mergeable_buf_to_ctx(void *buf, unsigned int truesize)
+{
+	return (unsigned long)buf | (truesize / MERGEABLE_BUFFER_ALIGN);
+}
+
 /* Called from bottom half context */
 static struct sk_buff *page_to_skb(struct receive_queue *rq,
 				   struct page *page, unsigned int offset,
@@ -327,31 +355,33 @@ err:
 
 static struct sk_buff *receive_mergeable(struct net_device *dev,
 					 struct receive_queue *rq,
-					 void *buf,
+					 unsigned long ctx,
 					 unsigned int len)
 {
+	void *buf = mergeable_ctx_to_buf_address(ctx);
 	struct skb_vnet_hdr *hdr = buf;
 	int num_buf = hdr->mhdr.num_buffers;
 	struct page *page = virt_to_head_page(buf);
 	int offset = buf - page_address(page);
-	unsigned int truesize = max_t(unsigned int, len, MERGE_BUFFER_LEN);
+	unsigned int truesize = max(len, mergeable_ctx_to_buf_truesize(ctx));
+
 	struct sk_buff *head_skb = page_to_skb(rq, page, offset, len, truesize);
 	struct sk_buff *curr_skb = head_skb;
 
 	if (unlikely(!curr_skb))
 		goto err_skb;
-
 	while (--num_buf) {
 		int num_skb_frags;
 
-		buf = virtqueue_get_buf(rq->vq, &len);
-		if (unlikely(!buf)) {
+		ctx = (unsigned long)virtqueue_get_buf(rq->vq, &len);
+		if (unlikely(!ctx)) {
 			pr_debug("%s: rx error: %d buffers out of %d missing\n",
 				 dev->name, num_buf, hdr->mhdr.num_buffers);
 			dev->stats.rx_length_errors++;
 			goto err_buf;
 		}
 
+		buf = mergeable_ctx_to_buf_address(ctx);
 		page = virt_to_head_page(buf);
 		--rq->num;
 
@@ -369,7 +399,7 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
 			head_skb->truesize += nskb->truesize;
 			num_skb_frags = 0;
 		}
-		truesize = max_t(unsigned int, len, MERGE_BUFFER_LEN);
+		truesize = max(len, mergeable_ctx_to_buf_truesize(ctx));
 		if (curr_skb != head_skb) {
 			head_skb->data_len += len;
 			head_skb->len += len;
@@ -386,19 +416,20 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
 		}
 	}
 
+	ewma_add(&rq->mrg_avg_pkt_len, head_skb->len);
 	return head_skb;
 
 err_skb:
 	put_page(page);
 	while (--num_buf) {
-		buf = virtqueue_get_buf(rq->vq, &len);
-		if (unlikely(!buf)) {
+		ctx = (unsigned long)virtqueue_get_buf(rq->vq, &len);
+		if (unlikely(!ctx)) {
 			pr_debug("%s: rx error: %d buffers missing\n",
 				 dev->name, num_buf);
 			dev->stats.rx_length_errors++;
 			break;
 		}
-		page = virt_to_head_page(buf);
+		page = virt_to_head_page(mergeable_ctx_to_buf_address(ctx));
 		put_page(page);
 		--rq->num;
 	}
@@ -419,17 +450,20 @@ static void receive_buf(struct receive_queue *rq, void *buf, unsigned int len)
 	if (unlikely(len < sizeof(struct virtio_net_hdr) + ETH_HLEN)) {
 		pr_debug("%s: short packet %i\n", dev->name, len);
 		dev->stats.rx_length_errors++;
-		if (vi->mergeable_rx_bufs)
-			put_page(virt_to_head_page(buf));
-		else if (vi->big_packets)
+		if (vi->mergeable_rx_bufs) {
+			unsigned long ctx = (unsigned long)buf;
+			void *base = mergeable_ctx_to_buf_address(ctx);
+			put_page(virt_to_head_page(base));
+		} else if (vi->big_packets) {
 			give_pages(rq, buf);
-		else
+		} else {
 			dev_kfree_skb(buf);
+		}
 		return;
 	}
 
 	if (vi->mergeable_rx_bufs)
-		skb = receive_mergeable(dev, rq, buf, len);
+		skb = receive_mergeable(dev, rq, (unsigned long)buf, len);
 	else if (vi->big_packets)
 		skb = receive_big(dev, rq, buf, len);
 	else
@@ -572,25 +606,36 @@ static int add_recvbuf_big(struct receive_queue *rq, gfp_t gfp)
 
 static int add_recvbuf_mergeable(struct receive_queue *rq, gfp_t gfp)
 {
+	const size_t hdr_len = sizeof(struct virtio_net_hdr_mrg_rxbuf);
 	struct page_frag *alloc_frag = &rq->alloc_frag;
 	char *buf;
+	unsigned long ctx;
 	int err;
 	unsigned int len, hole;
 
-	if (unlikely(!skb_page_frag_refill(MERGE_BUFFER_LEN, alloc_frag, gfp)))
+	len = hdr_len + clamp_t(unsigned int, ewma_read(&rq->mrg_avg_pkt_len),
+				GOOD_PACKET_LEN, PAGE_SIZE - hdr_len);
+	len = ALIGN(len, MERGEABLE_BUFFER_ALIGN);
+	if (unlikely(!skb_page_frag_refill(len, alloc_frag, gfp)))
 		return -ENOMEM;
+
 	buf = (char *)page_address(alloc_frag->page) + alloc_frag->offset;
+	ctx = mergeable_buf_to_ctx(buf, len);
 	get_page(alloc_frag->page);
-	len = MERGE_BUFFER_LEN;
 	alloc_frag->offset += len;
 	hole = alloc_frag->size - alloc_frag->offset;
-	if (hole < MERGE_BUFFER_LEN) {
+	if (hole < len) {
+		/* To avoid internal fragmentation, if there is very likely not
+		 * enough space for another buffer, add the remaining space to
+		 * the current buffer. This extra space is not included in
+		 * the truesize stored in ctx.
+		 */
 		len += hole;
 		alloc_frag->offset += hole;
 	}
 
 	sg_init_one(rq->sg, buf, len);
-	err = virtqueue_add_inbuf(rq->vq, rq->sg, 1, buf, gfp);
+	err = virtqueue_add_inbuf(rq->vq, rq->sg, 1, (void *)ctx, gfp);
 	if (err < 0)
 		put_page(virt_to_head_page(buf));
 
@@ -1394,12 +1439,15 @@ static void free_unused_bufs(struct virtnet_info *vi)
 		struct virtqueue *vq = vi->rq[i].vq;
 
 		while ((buf = virtqueue_detach_unused_buf(vq)) != NULL) {
-			if (vi->mergeable_rx_bufs)
-				put_page(virt_to_head_page(buf));
-			else if (vi->big_packets)
+			if (vi->mergeable_rx_bufs) {
+				unsigned long ctx = (unsigned long)buf;
+				void *base = mergeable_ctx_to_buf_address(ctx);
+				put_page(virt_to_head_page(base));
+			} else if (vi->big_packets) {
 				give_pages(&vi->rq[i], buf);
-			else
+			} else {
 				dev_kfree_skb(buf);
+			}
 			--vi->rq[i].num;
 		}
 		BUG_ON(vi->rq[i].num != 0);
@@ -1509,6 +1557,7 @@ static int virtnet_alloc_queues(struct virtnet_info *vi)
 			       napi_weight);
 
 		sg_init_table(vi->rq[i].sg, ARRAY_SIZE(vi->rq[i].sg));
+		ewma_init(&vi->rq[i].mrg_avg_pkt_len, 1, RECEIVE_AVG_WEIGHT);
 		sg_init_table(vi->sq[i].sg, ARRAY_SIZE(vi->sq[i].sg));
 	}
 
-- 
1.8.5.2

^ permalink raw reply related

* [PATCH net-next v4 6/6] virtio-net: initial rx sysfs support, export mergeable rx buffer size
From: Michael Dalton @ 2014-01-16 19:52 UTC (permalink / raw)
  To: David S. Miller
  Cc: netdev, Eric Dumazet, Rusty Russell, Michael S. Tsirkin,
	Jason Wang, Ben Hutchings, virtualization, Michael Dalton
In-Reply-To: <1389901950-3854-1-git-send-email-mwdalton@google.com>

Add initial support for per-rx queue sysfs attributes to virtio-net. If
mergeable packet buffers are enabled, adds a read-only mergeable packet
buffer size sysfs attribute for each RX queue.

Suggested-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael Dalton <mwdalton@google.com>
---
v3->v4: Remove seqcount due to EWMA changes in patch 5.
        Add missing Suggested-By.

 drivers/net/virtio_net.c | 46 ++++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 42 insertions(+), 4 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 3e82311..968eacd 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -604,18 +604,25 @@ static int add_recvbuf_big(struct receive_queue *rq, gfp_t gfp)
 	return err;
 }
 
-static int add_recvbuf_mergeable(struct receive_queue *rq, gfp_t gfp)
+static unsigned int get_mergeable_buf_len(struct ewma *avg_pkt_len)
 {
 	const size_t hdr_len = sizeof(struct virtio_net_hdr_mrg_rxbuf);
+	unsigned int len;
+
+	len = hdr_len + clamp_t(unsigned int, ewma_read(avg_pkt_len),
+			GOOD_PACKET_LEN, PAGE_SIZE - hdr_len);
+	return ALIGN(len, MERGEABLE_BUFFER_ALIGN);
+}
+
+static int add_recvbuf_mergeable(struct receive_queue *rq, gfp_t gfp)
+{
 	struct page_frag *alloc_frag = &rq->alloc_frag;
 	char *buf;
 	unsigned long ctx;
 	int err;
 	unsigned int len, hole;
 
-	len = hdr_len + clamp_t(unsigned int, ewma_read(&rq->mrg_avg_pkt_len),
-				GOOD_PACKET_LEN, PAGE_SIZE - hdr_len);
-	len = ALIGN(len, MERGEABLE_BUFFER_ALIGN);
+	len = get_mergeable_buf_len(&rq->mrg_avg_pkt_len);
 	if (unlikely(!skb_page_frag_refill(len, alloc_frag, gfp)))
 		return -ENOMEM;
 
@@ -1594,6 +1601,33 @@ err:
 	return ret;
 }
 
+#ifdef CONFIG_SYSFS
+static ssize_t mergeable_rx_buffer_size_show(struct netdev_rx_queue *queue,
+		struct rx_queue_attribute *attribute, char *buf)
+{
+	struct virtnet_info *vi = netdev_priv(queue->dev);
+	unsigned int queue_index = get_netdev_rx_queue_index(queue);
+	struct ewma *avg;
+
+	BUG_ON(queue_index >= vi->max_queue_pairs);
+	avg = &vi->rq[queue_index].mrg_avg_pkt_len;
+	return sprintf(buf, "%u\n", get_mergeable_buf_len(avg));
+}
+
+static struct rx_queue_attribute mergeable_rx_buffer_size_attribute =
+	__ATTR_RO(mergeable_rx_buffer_size);
+
+static struct attribute *virtio_net_mrg_rx_attrs[] = {
+	&mergeable_rx_buffer_size_attribute.attr,
+	NULL
+};
+
+static const struct attribute_group virtio_net_mrg_rx_group = {
+	.name = "virtio_net",
+	.attrs = virtio_net_mrg_rx_attrs
+};
+#endif
+
 static int virtnet_probe(struct virtio_device *vdev)
 {
 	int i, err;
@@ -1708,6 +1742,10 @@ static int virtnet_probe(struct virtio_device *vdev)
 	if (err)
 		goto free_stats;
 
+#ifdef CONFIG_SYSFS
+	if (vi->mergeable_rx_bufs)
+		dev->sysfs_rx_queue_group = &virtio_net_mrg_rx_group;
+#endif
 	netif_set_real_num_tx_queues(dev, vi->curr_queue_pairs);
 	netif_set_real_num_rx_queues(dev, vi->curr_queue_pairs);
 
-- 
1.8.5.2

^ permalink raw reply related

* Re: [PATCH v4 0/2] ipv6 addrconf: add IFA_F_NOPREFIXROUTE flag to suppress creation of IP6 routes
From: David Miller @ 2014-01-16 19:53 UTC (permalink / raw)
  To: hannes; +Cc: jiri, thaller, netdev, stephen, dcbw
In-Reply-To: <20140116133818.GF7436@order.stressinduktion.org>

From: Hannes Frederic Sowa <hannes@stressinduktion.org>
Date: Thu, 16 Jan 2014 14:38:18 +0100

> Just fyi, I noticed at least one other patch missing from the repos:
> http://patchwork.ozlabs.org/patch/308452/

Thanks for catching that, taking care of it now.

^ permalink raw reply

* Re: [PATCH net-next v3 4/5] net-sysfs: add support for device-specific rx queue sysfs attributes
From: Eric Dumazet @ 2014-01-16 20:00 UTC (permalink / raw)
  To: Michael Dalton
  Cc: Michael S. Tsirkin, netdev, lf-virt, Eric Dumazet, Ben Hutchings,
	David S. Miller
In-Reply-To: <CANJ5vPJgQbNPKvbExvpZixKk-xXAkvF0ix0ELW=DpiV0QP5YTA@mail.gmail.com>

On Thu, 2014-01-16 at 11:51 -0800, Michael Dalton wrote:
> On Thu, Jan 16, 2014 Ben Hutchings <bhutchings@solarflare.com> wrote:
> > It's simpler but we don't know if it's faster (and I don't believe that
> > matters for the current usage).
> >
> > If one of these functions starts to be used in the data path, at that
> > point it could be worth optimising, e.g. by doing a test for queue 0 and
> > only then doing the pointer arithmetic with its implicit division.
> Good catch, my statement above is incorrect. We don't know if this change
> is a performance win, and it does introduce an implicit div. I agree this
> function is not currently performance critical and is used only by for
> reporting values to sysfs.

gcc usually do not emit a divide instruction to perform a divide by a
constant, but uses a reciprocal operation (basically a multiply and a
shift)

So I am pretty sure this is faster than the loop.

^ permalink raw reply

* Re: [PATCH 11/13] net: mvneta: implement rx_copybreak
From: Willy Tarreau @ 2014-01-16 20:07 UTC (permalink / raw)
  To: David Miller; +Cc: David.Laight, netdev, thomas.petazzoni, gregory.clement
In-Reply-To: <20140116.114905.1582218754259744747.davem@davemloft.net>

Hi David,

On Thu, Jan 16, 2014 at 11:49:05AM -0800, David Miller wrote:
> From: Willy Tarreau <w@1wt.eu>
> Date: Thu, 16 Jan 2014 10:36:40 +0100
> 
> > On Thu, Jan 16, 2014 at 09:14:00AM +0000, David Laight wrote:
> >> From: Of Willy Tarreau
> >> > calling dma_map_single()/dma_unmap_single() is quite expensive compared
> >> > to copying a small packet. So let's copy short frames and keep the buffers
> >> > mapped. We set the limit to 256 bytes which seems to give good results both
> >> > on the XP-GP board and on the AX3/4.
> >> 
> >> Have you tried something similar for transmits?
> >> Copying short tx into a preallocated memory area that is dma_mapped
> >> at driver initialisation?
> > 
> > Not yet by lack of time, but I intend to do so. Both the NIC and the driver
> > are capable of running at line rate (1.488 Mpps) when filling descriptors
> > directly, so many improvements are still possible.
> 
> I don't know if this is relevant for your hardware, but there is a trick
> with IOMMUs that I at least at one point was doing on sparc64.
> 
> I never flushed the IOMMU mappings explicitly on a dma_unmap call.
> 
> Instead I always allocate by moving the allocation cursor to higher
> addresses in the IOMMU range looking for free space.
> 
> Then when I hit the end of the range and had to wrap the allocation
> cursor back to the beginning, I flush the entire IOMMU.
> 
> So for %99 of IOMMU mapping creation and teardown calls, I didn't even
> touch the hardware.

Ah that's an interesting trick! We don't have an IOMMU on this platform
so the call is cheap. However, it relies on an I/O barrier to wait for
a cache snoop completion. From what I read, it's not needed when going
to the device. But when going to the CPU for the Rx case, we're calling
it for every packet which is counter productive because I'd like to do
it only once when entering the rx loop and avoid any other call during
the loop. I measured that I could save 300 ns per packet by doing this
(thus slightly modifying the DMA API to add a new dma_io_barrier function).

I think it could be interesting (at least for this platform, I don't know
if other platforms work with cache coherent DMA which only require to
verify end of snooping), to have two distinct set of DMA calls, something
like this :

    rx_loop(napi, budget)
    {
         dma_wait_for_snoop_completion(dev);
         ...
         for_each_rx_packet(pkt) {
             dma_sync_single_range_for_cpu_unless_completed(dev, pkt->addr);
             /* use packet */
         }
    }

The dma_wait_for_snoop_completion() call would call this I/O barrier on
platforms which provide one, and dma_sync_single_range_unless_completed()
would do the equivalent call only when the I/O barrier is not provided.
However I don't like this principle because it requires adding many new
entries to the struct dma_ops while we could avoid this by just adding
the first entry and changing all drivers at once to add the first call
before the loop. I'm just having a hard time believing that it's possible
anymore to change all drivers at once, especially when there are out of
tree drivers... So such a change would probably require to add a number
of new calls to the DMA API. I really don't know what's the best solution
would be in fact :-/

Thanks,
Willy

^ permalink raw reply

* Re: [PATCH net-next v4 5/6] lib: Ensure EWMA does not store wrong intermediate values
From: Eric Dumazet @ 2014-01-16 20:08 UTC (permalink / raw)
  To: Michael Dalton
  Cc: Michael S. Tsirkin, netdev, virtualization, Eric Dumazet,
	Ben Hutchings, David S. Miller
In-Reply-To: <1389901950-3854-5-git-send-email-mwdalton@google.com>

On Thu, 2014-01-16 at 11:52 -0800, Michael Dalton wrote:
> To ensure ewma_read() without a lock returns a valid but possibly
> out of date average, modify ewma_add() by using ACCESS_ONCE to prevent
> intermediate wrong values from being written to avg->internal.
> 
> Suggested-by: Eric Dumazet <eric.dumazet@gmail.com>
> Signed-off-by: Michael Dalton <mwdalton@google.com>
> ---
>  lib/average.c | 6 ++++--
>  1 file changed, 4 insertions(+), 2 deletions(-)

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>

^ permalink raw reply

* Re: [PATCH 11/13] net: mvneta: implement rx_copybreak
From: David Miller @ 2014-01-16 20:11 UTC (permalink / raw)
  To: w; +Cc: David.Laight, netdev, thomas.petazzoni, gregory.clement
In-Reply-To: <20140116200757.GA19969@1wt.eu>

From: Willy Tarreau <w@1wt.eu>
Date: Thu, 16 Jan 2014 21:07:57 +0100

> So such a change would probably require to add a number
> of new calls to the DMA API. I really don't know what's the best solution
> would be in fact :-/

Right, thanks for explaining your situation.

^ permalink raw reply

* [PATCH net-next v5 0/2] stmmac: fix kernel crashes for jumbo frames
From: Vince Bridgers @ 2014-01-16 20:10 UTC (permalink / raw)
  To: devicetree, netdev
  Cc: peppe.cavallaro, robh+dt, pawel.moll, mark.rutland,
	ijc+devicetree, galak, dinguyen, rayagond, vbridgers2013

v5:
* Simplify use notes for max-frame-size in stmmac.txt, in the
  devicetree patch. No change to the stmmac driver patch
  from v4.

v4:
* Address inconsistent comment with use, add comments about inconsistency
  of max-frame-size definition and use in the ePAPR v1.1 specification.

v3:
* change snps,max-frame-size to max-frame-size

v2:
* change snps,max-mtu to snps,max-frame-size

These patches address two kernel crashes seen when using jumbo frames on 
the Synopsys stmmac driver, and adds device tree configurability for the 
maximum mtu. The Synopsys emac fifo sizes can be configured when a logic
design is synthesized, but does not provide a way for a driver to query the
exact fifo size. 

The crashes seen were due to two issues. 

1) The dma buffer size was being set after the dma buffers were allocated.
This caused a crash when changing the mtu since it was possible the buffers
would subsequently be freed using an incorrect dma buffer size. This could
also cause kernel panics due to memory corruption since a large mtu size could
have been configured, but the dma buffers were not sized accordingly. 

2) Jumbo frames were being enabled by default, but the dma buffers were not
sized accordingly. This caused memory corruption in the context of certain
types of network traffic, leading to kernel panics. 

I've tested these changes using automated, reproducible testware. I can
demonstrate the panics described before the fixes and show that the fixes
address the problems described. 

Testing and improvements continue through the use of the mentioned automated
and reproducible testware. 

Vince Bridgers
Vince Bridgers (2):
  dts: Add a binding for Synopsys emac max-frame-size
  stmmac: Fix kernel crashes for jumbo frames

 Documentation/devicetree/bindings/net/stmmac.txt   |    5 ++++-
 drivers/net/ethernet/stmicro/stmmac/common.h       |    4 +++-
 drivers/net/ethernet/stmicro/stmmac/dwmac1000.h    |    7 ++-----
 .../net/ethernet/stmicro/stmmac/dwmac1000_core.c   |    7 ++++++-
 .../net/ethernet/stmicro/stmmac/dwmac100_core.c    |    2 +-
 drivers/net/ethernet/stmicro/stmmac/stmmac_main.c  |   11 +++++++----
 .../net/ethernet/stmicro/stmmac/stmmac_platform.c  |   13 +++++++++++++
 include/linux/stmmac.h                             |    1 +
 8 files changed, 37 insertions(+), 13 deletions(-)

-- 
1.7.9.5

^ permalink raw reply

* [PATCH net-next v5 1/2] dts: Add a binding for Synopsys emac max-frame-size
From: Vince Bridgers @ 2014-01-16 20:10 UTC (permalink / raw)
  To: devicetree, netdev
  Cc: peppe.cavallaro, robh+dt, pawel.moll, mark.rutland,
	ijc+devicetree, galak, dinguyen, rayagond, vbridgers2013
In-Reply-To: <1389903026-12064-1-git-send-email-vbridgers2013@gmail.com>

This change adds a parameter for the Synopsys 10/100/1000
stmmac Ethernet driver to configure the maximum frame
size supported by the EMAC driver. Synopsys allows the FIFO
sizes to be configured when the cores are built for a particular
device, but do not provide a way for the driver to read
information from the device about the maximum MTU size
supported as limited by the device's FIFO size.

Signed-off-by: Vince Bridgers <vbridgers2013@gmail.com>
---
V5: simplify comment on use of max-frame-size
V4: add comments to explain use of max-frame-size with respect
    to inconsistent definition and use in the ePAPR v1.1 spec
V3: change snps,max-frame-size to max-frame-size
V2: change snps,max-mtu to snps,max-frame-size
---
 Documentation/devicetree/bindings/net/stmmac.txt |    5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/Documentation/devicetree/bindings/net/stmmac.txt b/Documentation/devicetree/bindings/net/stmmac.txt
index eba0e5e..95ddb16 100644
--- a/Documentation/devicetree/bindings/net/stmmac.txt
+++ b/Documentation/devicetree/bindings/net/stmmac.txt
@@ -29,7 +29,9 @@ Required properties:
 				ignored if force_thresh_dma_mode is set.
 
 Optional properties:
-- mac-address: 6 bytes, mac address
+- mac-address:		6 bytes, mac address
+- max-frame-size:	Maximum Transfer Unit (IEEE defined MTU), rather
+			than the maximum frame size.
 
 Examples:
 
@@ -40,5 +42,6 @@ Examples:
 		interrupts = <24 23>;
 		interrupt-names = "macirq", "eth_wake_irq";
 		mac-address = [000000000000]; /* Filled in by U-Boot */
+		max-frame-size = <3800>;
 		phy-mode = "gmii";
 	};
-- 
1.7.9.5

^ permalink raw reply related

* Re: [PATCH net] ipv6: simplify detection of first operational link-local address on interface
From: Brian Haley @ 2014-01-16 20:15 UTC (permalink / raw)
  To: Jiri Pirko; +Cc: netdev, fbl
In-Reply-To: <20140116191304.GC17529@order.stressinduktion.org>

On 01/16/2014 02:13 PM, Hannes Frederic Sowa wrote:
> In commit 1ec047eb4751e3 ("ipv6: introduce per-interface counter for
> dad-completed ipv6 addresses") I build the detection of the first
> operational link-local address much to complex. Additionally this code
> now has a race condition.
> 
> Replace it with a much simpler variant, which just scans the address
> list when duplicate address detection completes, to check if this is
> the first valid link local address and send RS and MLD reports then.
> 
> Fixes: 1ec047eb4751e3 ("ipv6: introduce per-interface counter for dad-completed ipv6 addresses")
> Reported-by: Jiri Pirko <jiri@resnulli.us>
> Cc: Flavio Leitner <fbl@redhat.com>
> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
> ---

> +/* ifp->idev must be at least read locked */
> +static bool ipv6_lonely_lladdr(struct inet6_ifaddr *ifp)
> +{
> +	struct inet6_ifaddr *ifpiter;
> +	struct inet6_dev *idev = ifp->idev;
> +
> +	list_for_each_entry(ifpiter, &idev->addr_list, if_list) {
> +		if (ifp != ifpiter && ifpiter->scope == IFA_LINK &&
> +		    (ifpiter->flags & (IFA_F_PERMANENT|IFA_F_TENTATIVE|
> +				       IFA_F_OPTIMISTIC|IFA_F_DADFAILED)) ==
> +		    IFA_F_PERMANENT)
> +			return false;
> +	}
> +	return true;
> +}

Just a nit, but the idev->addr_list is sorted by scope, so you could use
list_for_each_entry_reverse(), and break out once the scope is > IFA_LINK to
reduce the number of compares.

-Brian

^ permalink raw reply

* [PATCH net-next v5 2/2] stmmac: Fix kernel crashes for jumbo frames
From: Vince Bridgers @ 2014-01-16 20:10 UTC (permalink / raw)
  To: devicetree, netdev
  Cc: peppe.cavallaro, robh+dt, pawel.moll, mark.rutland,
	ijc+devicetree, galak, dinguyen, rayagond, vbridgers2013
In-Reply-To: <1389903026-12064-1-git-send-email-vbridgers2013@gmail.com>

These changes correct the following issues with jumbo frames on the
stmmac driver:

1) The Synopsys EMAC can be configured to support different FIFO
sizes at core configuration time. There's no way to query the
controller and know the FIFO size, so the driver needs to get this
information from the device tree in order to know how to correctly
handle MTU changes and setting up dma buffers. The default
max-frame-size is as currently used, which is the size of a jumbo
frame.

2) The driver was enabling Jumbo frames by default, but was not allocating
dma buffers of sufficient size to handle the maximum possible packet
size that could be received. This led to memory corruption since DMAs were
occurring beyond the extent of the allocated receive buffers for certain types
of network traffic.

kernel BUG at net/core/skbuff.c:126!
Internal error: Oops - BUG: 0 [#1] SMP ARM
Modules linked in:
CPU: 0 PID: 563 Comm: sockperf Not tainted 3.13.0-rc6-01523-gf7111b9 #31
task: ef35e580 ti: ef252000 task.ti: ef252000
PC is at skb_panic+0x60/0x64
LR is at skb_panic+0x60/0x64
pc : [<c03c7c3c>]    lr : [<c03c7c3c>]    psr: 60000113
sp : ef253c18  ip : 60000113  fp : 00000000
r10: ef3a5400  r9 : 00000ebc  r8 : ef3a546c
r7 : ee59f000  r6 : ee59f084  r5 : ee59ff40  r4 : ee59f140
r3 : 000003e2  r2 : 00000007  r1 : c0b9c420  r0 : 0000007d
Flags: nZCv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment user
Control: 10c5387d  Table: 2e8ac04a  DAC: 00000015
Process sockperf (pid: 563, stack limit = 0xef252248)
Stack: (0xef253c18 to 0xef254000)
3c00:                                                       00000ebc ee59f000
3c20: ee59f084 ee59ff40 ee59f140 c04a9cd8 ee8c50c0 00000ebc ee59ff40 00000000
3c40: ee59f140 c02d0ef0 00000056 ef1eda80 ee8c50c0 00000ebc 22bbef29 c0318f8c
3c60: 00000056 ef3a547c ffe2c716 c02c9c90 c0ba1298 ef3a5838 ef3a5838 ef3a5400
3c80: 000020c0 ee573840 000055cb ef3f2050 c053f0e0 c0319214 22b9b085 22d92813
3ca0: 00001c80 004b8e00 ef3a5400 ee573840 ef3f2064 22d92813 ef3f2064 000055cb
3cc0: ef3f2050 c031a19c ef252000 00000000 00000000 c0561bc0 00000000 ff00ffff
3ce0: c05621c0 ef3a5400 ef3f2064 ee573840 00000020 ef3f2064 000055cb ef3f2050
3d00: c053f0e0 c031cad0 c053e740 00000e60 00000000 00000000 ee573840 ef3a5400
3d20: ef0a6e00 00000000 ef3f2064 c032507c 00010000 00000020 c0561bc0 c0561bc0
3d40: ee599850 c032799c 00000000 ee573840 c055a380 ef3a5400 00000000 ef3f2064
3d60: ef3f2050 c032799c 0101c7c0 2b6755cb c059a280 c030e4d8 000055cb ffffffff
3d80: ee574fc0 c055a380 ee574000 ee573840 00002b67 ee573840 c03fe9c4 c053fa68
3da0: c055a380 00001f6f 00000000 ee573840 c053f0e0 c0304fdc ef0a6e01 ef3f2050
3dc0: ee573858 ef031000 ee573840 c03055d8 c0ba0c40 ef000f40 00100100 c053f0dc
3de0: c053ffdc c053f0f0 00000008 00000000 ef031000 c02da948 00001140 00000000
3e00: c0563c78 ef253e5f 00000020 ee573840 00000020 c053f0f0 ef313400 ee573840
3e20: c053f0e0 00000000 00000000 c05380c0 ef313400 00001000 00000015 c02df280
3e40: ee574000 ef001e00 00000000 00001080 00000042 005cd980 ef031500 ef031500
3e60: 00000000 c02df824 ef031500 c053e390 c0541084 f00b1e00 c05925e8 c02df864
3e80: 00001f5c ef031440 c053e390 c0278524 00000002 00000000 c0b9eb48 c02df280
3ea0: ee8c7180 00000100 c0542ca8 00000015 00000040 ef031500 ef031500 ef031500
3ec0: c027803c ef252000 00000040 000000ec c05380c0 c0b9eb40 c0b9eb48 c02df940
3ee0: ef060780 ffffa4dd c0564a9c c056343c 002e80a8 00000080 ef031500 00000001
3f00: c053808c ef252000 fffec100 00000003 00000004 002e80a8 0000000c c00258f0
3f20: 002e80a8 c005e704 00000005 00000100 c05634d0 c0538080 c05333e0 00000000
3f40: 0000000a c0565580 c05380c0 ffffa4dc c05434f4 00400100 00000004 c0534cd4
3f60: 00000098 00000000 fffec100 002e80a8 00000004 002e80a8 002a20e0 c0025da8
3f80: c0534cd4 c000f020 fffec10c c053ea60 ef253fb0 c0008530 0000ffe2 b6ef67f4
3fa0: 40000010 ffffffff 00000124 c0012f3c 0000ffe2 002e80f0 0000ffe2 00004000
3fc0: becb6338 becb6334 00000004 00000124 002e80a8 00000004 002e80a8 002a20e0
3fe0: becb6300 becb62f4 002773bb b6ef67f4 40000010 ffffffff 00000000 00000000
[<c03c7c3c>] (skb_panic+0x60/0x64) from [<c02d0ef0>] (skb_put+0x4c/0x50)
[<c02d0ef0>] (skb_put+0x4c/0x50) from [<c0318f8c>] (tcp_collapse+0x314/0x3ec)
[<c0318f8c>] (tcp_collapse+0x314/0x3ec) from [<c0319214>]
(tcp_try_rmem_schedule+0x1b0/0x3c4)
[<c0319214>] (tcp_try_rmem_schedule+0x1b0/0x3c4) from [<c031a19c>]
(tcp_data_queue+0x480/0xe6c)
[<c031a19c>] (tcp_data_queue+0x480/0xe6c) from [<c031cad0>]
(tcp_rcv_established+0x180/0x62c)
[<c031cad0>] (tcp_rcv_established+0x180/0x62c) from [<c032507c>]
(tcp_v4_do_rcv+0x13c/0x31c)
[<c032507c>] (tcp_v4_do_rcv+0x13c/0x31c) from [<c032799c>]
(tcp_v4_rcv+0x718/0x73c)
[<c032799c>] (tcp_v4_rcv+0x718/0x73c) from [<c0304fdc>]
(ip_local_deliver+0x98/0x274)
[<c0304fdc>] (ip_local_deliver+0x98/0x274) from [<c03055d8>]
(ip_rcv+0x420/0x758)
[<c03055d8>] (ip_rcv+0x420/0x758) from [<c02da948>]
(__netif_receive_skb_core+0x44c/0x5bc)
[<c02da948>] (__netif_receive_skb_core+0x44c/0x5bc) from [<c02df280>]
(netif_receive_skb+0x48/0xb4)
[<c02df280>] (netif_receive_skb+0x48/0xb4) from [<c02df824>]
(napi_gro_flush+0x70/0x94)
[<c02df824>] (napi_gro_flush+0x70/0x94) from [<c02df864>]
(napi_complete+0x1c/0x34)
[<c02df864>] (napi_complete+0x1c/0x34) from [<c0278524>]
(stmmac_poll+0x4e8/0x5c8)
[<c0278524>] (stmmac_poll+0x4e8/0x5c8) from [<c02df940>]
(net_rx_action+0xc4/0x1e4)
[<c02df940>] (net_rx_action+0xc4/0x1e4) from [<c00258f0>]
(__do_softirq+0x12c/0x2e8)
[<c00258f0>] (__do_softirq+0x12c/0x2e8) from [<c0025da8>] (irq_exit+0x78/0xac)
[<c0025da8>] (irq_exit+0x78/0xac) from [<c000f020>] (handle_IRQ+0x44/0x90)
[<c000f020>] (handle_IRQ+0x44/0x90) from [<c0008530>]
(gic_handle_irq+0x2c/0x5c)
[<c0008530>] (gic_handle_irq+0x2c/0x5c) from [<c0012f3c>]
(__irq_usr+0x3c/0x60)

3) The driver was setting the dma buffer size after allocating dma buffers,
which caused a system panic when changing the MTU.

BUG: Bad page state in process ifconfig  pfn:2e850
page:c0b72a00 count:0 mapcount:0 mapping:  (null) index:0x0
page flags: 0x200(arch_1)
Modules linked in:
CPU: 0 PID: 566 Comm: ifconfig Not tainted 3.13.0-rc6-01523-gf7111b9 #29
[<c001547c>] (unwind_backtrace+0x0/0xf8) from [<c00122dc>]
(show_stack+0x10/0x14)
[<c00122dc>] (show_stack+0x10/0x14) from [<c03c793c>] (dump_stack+0x70/0x88)
[<c03c793c>] (dump_stack+0x70/0x88) from [<c00b2620>] (bad_page+0xc8/0x118)
[<c00b2620>] (bad_page+0xc8/0x118) from [<c00b302c>]
(get_page_from_freelist+0x744/0x870)
[<c00b302c>] (get_page_from_freelist+0x744/0x870) from [<c00b40f4>]
(__alloc_pages_nodemask+0x118/0x86c)
[<c00b40f4>] (__alloc_pages_nodemask+0x118/0x86c) from [<c00b4858>]
(__get_free_pages+0x10/0x54)
[<c00b4858>] (__get_free_pages+0x10/0x54) from [<c00cba1c>]
(kmalloc_order_trace+0x24/0xa0)
[<c00cba1c>] (kmalloc_order_trace+0x24/0xa0) from [<c02d199c>]
(__kmalloc_reserve.isra.21+0x24/0x70)
[<c02d199c>] (__kmalloc_reserve.isra.21+0x24/0x70) from [<c02d240c>]
(__alloc_skb+0x68/0x13c)
[<c02d240c>] (__alloc_skb+0x68/0x13c) from [<c02d3930>]
(__netdev_alloc_skb+0x3c/0xe8)
[<c02d3930>] (__netdev_alloc_skb+0x3c/0xe8) from [<c0279378>]
(stmmac_open+0x63c/0x1024)
[<c0279378>] (stmmac_open+0x63c/0x1024) from [<c02e18cc>]
(__dev_open+0xa0/0xfc)
[<c02e18cc>] (__dev_open+0xa0/0xfc) from [<c02e1b40>]
(__dev_change_flags+0x94/0x158)
[<c02e1b40>] (__dev_change_flags+0x94/0x158) from [<c02e1c24>]
(dev_change_flags+0x18/0x48)
[<c02e1c24>] (dev_change_flags+0x18/0x48) from [<c0337bc0>]
(devinet_ioctl+0x638/0x700)
[<c0337bc0>] (devinet_ioctl+0x638/0x700) from [<c02c7aec>]
(sock_ioctl+0x64/0x290)
[<c02c7aec>] (sock_ioctl+0x64/0x290) from [<c0100890>]
(do_vfs_ioctl+0x78/0x5b8)
[<c0100890>] (do_vfs_ioctl+0x78/0x5b8) from [<c0100e0c>] (SyS_ioctl+0x3c/0x5c)
[<c0100e0c>] (SyS_ioctl+0x3c/0x5c) from [<c000e760>]

The fixes have been verified using reproducible, automated testing.

Signed-off-by: Vince Bridgers <vbridgers2013@gmail.com>
---
V5: no change for this patch, change was in devicetree patch
V4: address inconsistency in comment, add comments explaining inconsistent
    use of max-frame-size when reading device tree max-frame-size
V3: change snps,max-frame-size to max-frame-size
V2: change snps,max-mtu to snps,max-frame-size
---
 drivers/net/ethernet/stmicro/stmmac/common.h       |    4 +++-
 drivers/net/ethernet/stmicro/stmmac/dwmac1000.h    |    7 ++-----
 .../net/ethernet/stmicro/stmmac/dwmac1000_core.c   |    7 ++++++-
 .../net/ethernet/stmicro/stmmac/dwmac100_core.c    |    2 +-
 drivers/net/ethernet/stmicro/stmmac/stmmac_main.c  |   11 +++++++----
 .../net/ethernet/stmicro/stmmac/stmmac_platform.c  |   13 +++++++++++++
 include/linux/stmmac.h                             |    1 +
 7 files changed, 33 insertions(+), 12 deletions(-)

diff --git a/drivers/net/ethernet/stmicro/stmmac/common.h b/drivers/net/ethernet/stmicro/stmmac/common.h
index fc94f20..97bfb6b 100644
--- a/drivers/net/ethernet/stmicro/stmmac/common.h
+++ b/drivers/net/ethernet/stmicro/stmmac/common.h
@@ -293,6 +293,8 @@ struct dma_features {
 #define STMMAC_CHAIN_MODE	0x1
 #define STMMAC_RING_MODE	0x2
 
+#define JUMBO_LEN		9000
+
 struct stmmac_desc_ops {
 	/* DMA RX descriptor ring initialization */
 	void (*init_rx_desc) (struct dma_desc *p, int disable_rx_ic, int mode,
@@ -369,7 +371,7 @@ struct stmmac_dma_ops {
 
 struct stmmac_ops {
 	/* MAC core initialization */
-	void (*core_init) (void __iomem *ioaddr);
+	void (*core_init) (void __iomem *ioaddr, int mtu);
 	/* Enable and verify that the IPC module is supported */
 	int (*rx_ipc) (void __iomem *ioaddr);
 	/* Dump MAC registers */
diff --git a/drivers/net/ethernet/stmicro/stmmac/dwmac1000.h b/drivers/net/ethernet/stmicro/stmmac/dwmac1000.h
index c12aabb..f37d90f 100644
--- a/drivers/net/ethernet/stmicro/stmmac/dwmac1000.h
+++ b/drivers/net/ethernet/stmicro/stmmac/dwmac1000.h
@@ -126,11 +126,8 @@ enum power_event {
 #define GMAC_ANE_PSE		(3 << 7)
 #define GMAC_ANE_PSE_SHIFT	7
 
- /* GMAC Configuration defines */
-#define GMAC_CONTROL_TC	0x01000000	/* Transmit Conf. in RGMII/SGMII */
-#define GMAC_CONTROL_WD	0x00800000	/* Disable Watchdog on receive */
-
 /* GMAC Configuration defines */
+#define GMAC_CONTROL_2K 0x08000000	/* IEEE 802.3as 2K packets */
 #define GMAC_CONTROL_TC	0x01000000	/* Transmit Conf. in RGMII/SGMII */
 #define GMAC_CONTROL_WD	0x00800000	/* Disable Watchdog on receive */
 #define GMAC_CONTROL_JD	0x00400000	/* Jabber disable */
@@ -156,7 +153,7 @@ enum inter_frame_gap {
 #define GMAC_CONTROL_RE		0x00000004	/* Receiver Enable */
 
 #define GMAC_CORE_INIT (GMAC_CONTROL_JD | GMAC_CONTROL_PS | GMAC_CONTROL_ACS | \
-			GMAC_CONTROL_JE | GMAC_CONTROL_BE)
+			GMAC_CONTROL_BE)
 
 /* GMAC Frame Filter defines */
 #define GMAC_FRAME_FILTER_PR	0x00000001	/* Promiscuous Mode */
diff --git a/drivers/net/ethernet/stmicro/stmmac/dwmac1000_core.c b/drivers/net/ethernet/stmicro/stmmac/dwmac1000_core.c
index cdd9268..b3e148e 100644
--- a/drivers/net/ethernet/stmicro/stmmac/dwmac1000_core.c
+++ b/drivers/net/ethernet/stmicro/stmmac/dwmac1000_core.c
@@ -32,10 +32,15 @@
 #include <asm/io.h>
 #include "dwmac1000.h"
 
-static void dwmac1000_core_init(void __iomem *ioaddr)
+static void dwmac1000_core_init(void __iomem *ioaddr, int mtu)
 {
 	u32 value = readl(ioaddr + GMAC_CONTROL);
 	value |= GMAC_CORE_INIT;
+	if (mtu > 1500)
+		value |= GMAC_CONTROL_2K;
+	if (mtu > 2000)
+		value |= GMAC_CONTROL_JE;
+
 	writel(value, ioaddr + GMAC_CONTROL);
 
 	/* Mask GMAC interrupts */
diff --git a/drivers/net/ethernet/stmicro/stmmac/dwmac100_core.c b/drivers/net/ethernet/stmicro/stmmac/dwmac100_core.c
index 5857d67..2ff767b 100644
--- a/drivers/net/ethernet/stmicro/stmmac/dwmac100_core.c
+++ b/drivers/net/ethernet/stmicro/stmmac/dwmac100_core.c
@@ -32,7 +32,7 @@
 #include <asm/io.h>
 #include "dwmac100.h"
 
-static void dwmac100_core_init(void __iomem *ioaddr)
+static void dwmac100_core_init(void __iomem *ioaddr, int mtu)
 {
 	u32 value = readl(ioaddr + MAC_CONTROL);
 
diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
index ecdc8ab..58f7744 100644
--- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
+++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
@@ -52,7 +52,6 @@
 #include "stmmac.h"
 
 #define STMMAC_ALIGN(x)	L1_CACHE_ALIGN(x)
-#define JUMBO_LEN	9000
 
 /* Module parameters */
 #define TX_TIMEO	5000
@@ -91,7 +90,7 @@ static int tc = TC_DEFAULT;
 module_param(tc, int, S_IRUGO | S_IWUSR);
 MODULE_PARM_DESC(tc, "DMA threshold control value");
 
-#define DMA_BUFFER_SIZE	BUF_SIZE_2KiB
+#define DMA_BUFFER_SIZE	BUF_SIZE_4KiB
 static int buf_sz = DMA_BUFFER_SIZE;
 module_param(buf_sz, int, S_IRUGO | S_IWUSR);
 MODULE_PARM_DESC(buf_sz, "DMA buffer size");
@@ -990,6 +989,8 @@ static int init_dma_desc_rings(struct net_device *dev)
 	if (bfsize < BUF_SIZE_16KiB)
 		bfsize = stmmac_set_bfsize(dev->mtu, priv->dma_buf_sz);
 
+	priv->dma_buf_sz = bfsize;
+
 	if (netif_msg_probe(priv))
 		pr_debug("%s: txsize %d, rxsize %d, bfsize %d\n", __func__,
 			 txsize, rxsize, bfsize);
@@ -1079,7 +1080,6 @@ static int init_dma_desc_rings(struct net_device *dev)
 	}
 	priv->cur_rx = 0;
 	priv->dirty_rx = (unsigned int)(i - rxsize);
-	priv->dma_buf_sz = bfsize;
 	buf_sz = bfsize;
 
 	/* Setup the chained descriptor addresses */
@@ -1642,7 +1642,7 @@ static int stmmac_open(struct net_device *dev)
 		priv->plat->bus_setup(priv->ioaddr);
 
 	/* Initialize the MAC Core */
-	priv->hw->mac->core_init(priv->ioaddr);
+	priv->hw->mac->core_init(priv->ioaddr, dev->mtu);
 
 	/* Request the IRQ lines */
 	ret = request_irq(dev->irq, stmmac_interrupt,
@@ -2248,6 +2248,9 @@ static int stmmac_change_mtu(struct net_device *dev, int new_mtu)
 	else
 		max_mtu = SKB_MAX_HEAD(NET_SKB_PAD + NET_IP_ALIGN);
 
+	if (priv->plat->maxmtu < max_mtu)
+		max_mtu = priv->plat->maxmtu;
+
 	if ((new_mtu < 46) || (new_mtu > max_mtu)) {
 		pr_err("%s: invalid MTU, max MTU is: %d\n", dev->name, max_mtu);
 		return -EINVAL;
diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_platform.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_platform.c
index 38bd1f4..48ec035 100644
--- a/drivers/net/ethernet/stmicro/stmmac/stmmac_platform.c
+++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_platform.c
@@ -52,6 +52,11 @@ static int stmmac_probe_config_dt(struct platform_device *pdev,
 					   sizeof(struct stmmac_mdio_bus_data),
 					   GFP_KERNEL);
 
+	/* Set the maxmtu to a default of JUMBO_LEN in case the
+	 * parameter is not present in the device tree.
+	 */
+	plat->maxmtu = JUMBO_LEN;
+
 	/*
 	 * Currently only the properties needed on SPEAr600
 	 * are provided. All other properties should be added
@@ -60,6 +65,14 @@ static int stmmac_probe_config_dt(struct platform_device *pdev,
 	if (of_device_is_compatible(np, "st,spear600-gmac") ||
 		of_device_is_compatible(np, "snps,dwmac-3.70a") ||
 		of_device_is_compatible(np, "snps,dwmac")) {
+		/* Note that the max-frame-size parameter as defined in the
+		 * ePAPR v1.1 spec is defined as max-frame-size, it's
+		 * actually used as the IEEE definition of MAC Client
+		 * data, or MTU. The ePAPR specification is confusing as
+		 * the definition is max-frame-size, but usage examples
+		 * are clearly MTUs
+		 */
+		of_property_read_u32(np, "max-frame-size", &plat->maxmtu);
 		plat->has_gmac = 1;
 		plat->pmt = 1;
 	}
diff --git a/include/linux/stmmac.h b/include/linux/stmmac.h
index bb5deb0..9689706 100644
--- a/include/linux/stmmac.h
+++ b/include/linux/stmmac.h
@@ -110,6 +110,7 @@ struct plat_stmmacenet_data {
 	int force_sf_dma_mode;
 	int force_thresh_dma_mode;
 	int riwt_off;
+	int maxmtu;
 	void (*fix_mac_speed)(void *priv, unsigned int speed);
 	void (*bus_setup)(void __iomem *ioaddr);
 	int (*init)(struct platform_device *pdev);
-- 
1.7.9.5

^ permalink raw reply related

* Re: [BUG] eth_type_trans() can access stale data
From: Eric Dumazet @ 2014-01-16 20:15 UTC (permalink / raw)
  To: David Miller; +Cc: netdev
In-Reply-To: <20140115.154805.1403628531019301276.davem@davemloft.net>

On Wed, 2014-01-15 at 15:48 -0800, David Miller wrote:
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Tue, 14 Jan 2014 15:52:21 -0800
> 
> > @@ -194,7 +194,8 @@ __be16 eth_type_trans(struct sk_buff *skb, struct net_device *dev)
> >  	 *      layer. We look for FFFF which isn't a used 802.2 SSAP/DSAP. This
> >  	 *      won't work for fault tolerant netware but does for the rest.
> >  	 */
> > -	if (unlikely(skb->len >= 2 && *(unsigned short *)(skb->data) == 0xFFFF))
> > +	if (unlikely(skb_headlen(skb) >= 2 &&
> > +		     *(unsigned short *)(skb->data) == 0xFFFF))
> >  		return htons(ETH_P_802_3);
> 
> I think using skb_header_pointer() will essentially have the same cost,
> why not use it instead?

Yes, we can try this, I'll send a patch, thanks.

^ permalink raw reply

* [PATCH-next v3] netfilter: don't use module_init/exit in core IPV4 code
From: Paul Gortmaker @ 2014-01-16 20:24 UTC (permalink / raw)
  To: Pablo Neira Ayuso, Patrick McHardy, Jozsef Kadlecsik
  Cc: David S. Miller, Benjamin Poirier, netfilter-devel, netdev,
	Paul Gortmaker
In-Reply-To: <20140116192613.GA23133@d2.synalogic.ca>

The file net/ipv4/netfilter.o is created based on whether
CONFIG_NETFILTER is set.  However that is defined as a bool, and
hence this file with the core netfilter hooks will never be
modular.  So using module_init as an alias for __initcall can be
somewhat misleading.

Fix this up now, so that we can relocate module_init from
init.h into module.h in the future.  If we don't do this, we'd
have to add module.h to obviously non-modular code, and that
would be a worse thing.  Also add an inclusion of init.h, as
that was previously implicit here in the netfilter.c file.

Note that direct use of __initcall is discouraged, vs. one
of the priority categorized subgroups.  As __initcall gets
mapped onto device_initcall, our use of subsys_initcall (which
seems to make sense for netfilter code) will thus change this
registration from level 6-device to level 4-subsys (i.e. slightly
earlier).  However no observable impact of that small difference
has been observed during testing, or is expected. (i.e. the
location of the netfilter messages in dmesg remains unchanged
with respect to all the other surrounding messages.)

As for the module_exit, rather than replace it with __exitcall,
we simply remove it, since it appears only UML does anything
with those, and even for UML, there is no relevant cleanup
to be done here.

Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
---

[v3: switch accidental device_initcall back to subsys_initcall
 as spotted by Benjamin Poirier; retest on ppc-sbc8548 net-next]

[v2: Drop __exitcall stuff completely, as per Eric's suggestion
 given for patch at  http://patchwork.ozlabs.org/patch/311164/ ]

 net/ipv4/netfilter.c | 9 +--------
 1 file changed, 1 insertion(+), 8 deletions(-)

diff --git a/net/ipv4/netfilter.c b/net/ipv4/netfilter.c
index c3e0adea9c27..31abf9636ba7 100644
--- a/net/ipv4/netfilter.c
+++ b/net/ipv4/netfilter.c
@@ -197,11 +197,4 @@ static int __init ipv4_netfilter_init(void)
 {
 	return nf_register_afinfo(&nf_ip_afinfo);
 }
-
-static void __exit ipv4_netfilter_fini(void)
-{
-	nf_unregister_afinfo(&nf_ip_afinfo);
-}
-
-module_init(ipv4_netfilter_init);
-module_exit(ipv4_netfilter_fini);
+subsys_initcall(ipv4_netfilter_init);
-- 
1.8.5.2


^ permalink raw reply related

* Re: [PATCH net-next v4 2/6] virtio-net: use per-receive queue page frag alloc for mergeable bufs
From: Michael S. Tsirkin @ 2014-01-16 20:24 UTC (permalink / raw)
  To: Michael Dalton
  Cc: netdev, virtualization, Eric Dumazet, Ben Hutchings,
	David S. Miller
In-Reply-To: <1389901950-3854-2-git-send-email-mwdalton@google.com>

On Thu, Jan 16, 2014 at 11:52:26AM -0800, Michael Dalton wrote:
> The virtio-net driver currently uses netdev_alloc_frag() for GFP_ATOMIC
> mergeable rx buffer allocations. This commit migrates virtio-net to use
> per-receive queue page frags for GFP_ATOMIC allocation. This change unifies
> mergeable rx buffer memory allocation, which now will use skb_refill_frag()
> for both atomic and GFP-WAIT buffer allocations.
> 
> To address fragmentation concerns, if after buffer allocation there
> is too little space left in the page frag to allocate a subsequent
> buffer, the remaining space is added to the current allocated buffer
> so that the remaining space can be used to store packet data.
> 
> Signed-off-by: Michael Dalton <mwdalton@google.com>

Acked-by: Michael S. Tsirkin <mst@redhat.com>

> ---
> v1->v2: Use GFP_COLD for RX buffer allocations (as in netdev_alloc_frag()).
>         Remove per-netdev GFP_KERNEL page_frag allocator.
> 
>  drivers/net/virtio_net.c | 69 ++++++++++++++++++++++++------------------------
>  1 file changed, 35 insertions(+), 34 deletions(-)
> 
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index 7b17240..36cbf06 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -78,6 +78,9 @@ struct receive_queue {
>  	/* Chain pages by the private ptr. */
>  	struct page *pages;
>  
> +	/* Page frag for packet buffer allocation. */
> +	struct page_frag alloc_frag;
> +
>  	/* RX: fragments + linear part + virtio header */
>  	struct scatterlist sg[MAX_SKB_FRAGS + 2];
>  
> @@ -126,11 +129,6 @@ struct virtnet_info {
>  	/* Lock for config space updates */
>  	struct mutex config_lock;
>  
> -	/* Page_frag for GFP_KERNEL packet buffer allocation when we run
> -	 * low on memory.
> -	 */
> -	struct page_frag alloc_frag;
> -
>  	/* Does the affinity hint is set for virtqueues? */
>  	bool affinity_hint_set;
>  
> @@ -336,8 +334,8 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
>  	int num_buf = hdr->mhdr.num_buffers;
>  	struct page *page = virt_to_head_page(buf);
>  	int offset = buf - page_address(page);
> -	struct sk_buff *head_skb = page_to_skb(rq, page, offset, len,
> -					       MERGE_BUFFER_LEN);
> +	unsigned int truesize = max_t(unsigned int, len, MERGE_BUFFER_LEN);
> +	struct sk_buff *head_skb = page_to_skb(rq, page, offset, len, truesize);
>  	struct sk_buff *curr_skb = head_skb;
>  
>  	if (unlikely(!curr_skb))
> @@ -353,11 +351,6 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
>  			dev->stats.rx_length_errors++;
>  			goto err_buf;
>  		}
> -		if (unlikely(len > MERGE_BUFFER_LEN)) {
> -			pr_debug("%s: rx error: merge buffer too long\n",
> -				 dev->name);
> -			len = MERGE_BUFFER_LEN;
> -		}
>  
>  		page = virt_to_head_page(buf);
>  		--rq->num;
> @@ -376,19 +369,20 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
>  			head_skb->truesize += nskb->truesize;
>  			num_skb_frags = 0;
>  		}
> +		truesize = max_t(unsigned int, len, MERGE_BUFFER_LEN);
>  		if (curr_skb != head_skb) {
>  			head_skb->data_len += len;
>  			head_skb->len += len;
> -			head_skb->truesize += MERGE_BUFFER_LEN;
> +			head_skb->truesize += truesize;
>  		}
>  		offset = buf - page_address(page);
>  		if (skb_can_coalesce(curr_skb, num_skb_frags, page, offset)) {
>  			put_page(page);
>  			skb_coalesce_rx_frag(curr_skb, num_skb_frags - 1,
> -					     len, MERGE_BUFFER_LEN);
> +					     len, truesize);
>  		} else {
>  			skb_add_rx_frag(curr_skb, num_skb_frags, page,
> -					offset, len, MERGE_BUFFER_LEN);
> +					offset, len, truesize);
>  		}
>  	}
>  
> @@ -578,25 +572,24 @@ static int add_recvbuf_big(struct receive_queue *rq, gfp_t gfp)
>  
>  static int add_recvbuf_mergeable(struct receive_queue *rq, gfp_t gfp)
>  {
> -	struct virtnet_info *vi = rq->vq->vdev->priv;
> -	char *buf = NULL;
> +	struct page_frag *alloc_frag = &rq->alloc_frag;
> +	char *buf;
>  	int err;
> +	unsigned int len, hole;
>  
> -	if (gfp & __GFP_WAIT) {
> -		if (skb_page_frag_refill(MERGE_BUFFER_LEN, &vi->alloc_frag,
> -					 gfp)) {
> -			buf = (char *)page_address(vi->alloc_frag.page) +
> -			      vi->alloc_frag.offset;
> -			get_page(vi->alloc_frag.page);
> -			vi->alloc_frag.offset += MERGE_BUFFER_LEN;
> -		}
> -	} else {
> -		buf = netdev_alloc_frag(MERGE_BUFFER_LEN);
> -	}
> -	if (!buf)
> +	if (unlikely(!skb_page_frag_refill(MERGE_BUFFER_LEN, alloc_frag, gfp)))
>  		return -ENOMEM;
> +	buf = (char *)page_address(alloc_frag->page) + alloc_frag->offset;
> +	get_page(alloc_frag->page);
> +	len = MERGE_BUFFER_LEN;
> +	alloc_frag->offset += len;
> +	hole = alloc_frag->size - alloc_frag->offset;
> +	if (hole < MERGE_BUFFER_LEN) {
> +		len += hole;
> +		alloc_frag->offset += hole;
> +	}
>  
> -	sg_init_one(rq->sg, buf, MERGE_BUFFER_LEN);
> +	sg_init_one(rq->sg, buf, len);
>  	err = virtqueue_add_inbuf(rq->vq, rq->sg, 1, buf, gfp);
>  	if (err < 0)
>  		put_page(virt_to_head_page(buf));
> @@ -617,6 +610,7 @@ static bool try_fill_recv(struct receive_queue *rq, gfp_t gfp)
>  	int err;
>  	bool oom;
>  
> +	gfp |= __GFP_COLD;
>  	do {
>  		if (vi->mergeable_rx_bufs)
>  			err = add_recvbuf_mergeable(rq, gfp);
> @@ -1377,6 +1371,14 @@ static void free_receive_bufs(struct virtnet_info *vi)
>  	}
>  }
>  
> +static void free_receive_page_frags(struct virtnet_info *vi)
> +{
> +	int i;
> +	for (i = 0; i < vi->max_queue_pairs; i++)
> +		if (vi->rq[i].alloc_frag.page)
> +			put_page(vi->rq[i].alloc_frag.page);
> +}
> +
>  static void free_unused_bufs(struct virtnet_info *vi)
>  {
>  	void *buf;
> @@ -1705,9 +1707,8 @@ free_recv_bufs:
>  	unregister_netdev(dev);
>  free_vqs:
>  	cancel_delayed_work_sync(&vi->refill);
> +	free_receive_page_frags(vi);
>  	virtnet_del_vqs(vi);
> -	if (vi->alloc_frag.page)
> -		put_page(vi->alloc_frag.page);
>  free_stats:
>  	free_percpu(vi->stats);
>  free:
> @@ -1724,6 +1725,8 @@ static void remove_vq_common(struct virtnet_info *vi)
>  
>  	free_receive_bufs(vi);
>  
> +	free_receive_page_frags(vi);
> +
>  	virtnet_del_vqs(vi);
>  }
>  
> @@ -1741,8 +1744,6 @@ static void virtnet_remove(struct virtio_device *vdev)
>  	unregister_netdev(vi->dev);
>  
>  	remove_vq_common(vi);
> -	if (vi->alloc_frag.page)
> -		put_page(vi->alloc_frag.page);
>  
>  	flush_work(&vi->config_work);
>  
> -- 
> 1.8.5.2

^ permalink raw reply

* Re: [PATCH net-next v4 3/6] virtio-net: auto-tune mergeable rx buffer size for improved performance
From: Michael S. Tsirkin @ 2014-01-16 20:24 UTC (permalink / raw)
  To: Michael Dalton
  Cc: netdev, virtualization, Eric Dumazet, Ben Hutchings,
	David S. Miller
In-Reply-To: <1389901950-3854-3-git-send-email-mwdalton@google.com>

On Thu, Jan 16, 2014 at 11:52:27AM -0800, Michael Dalton wrote:
> Commit 2613af0ed18a ("virtio_net: migrate mergeable rx buffers to page frag
> allocators") changed the mergeable receive buffer size from PAGE_SIZE to
> MTU-size, introducing a single-stream regression for benchmarks with large
> average packet size. There is no single optimal buffer size for all
> workloads.  For workloads with packet size <= MTU bytes, MTU + virtio-net
> header-sized buffers are preferred as larger buffers reduce the TCP window
> due to SKB truesize. However, single-stream workloads with large average
> packet sizes have higher throughput if larger (e.g., PAGE_SIZE) buffers
> are used.
> 
> This commit auto-tunes the mergeable receiver buffer packet size by
> choosing the packet buffer size based on an EWMA of the recent packet
> sizes for the receive queue. Packet buffer sizes range from MTU_SIZE +
> virtio-net header len to PAGE_SIZE. This improves throughput for
> large packet workloads, as any workload with average packet size >=
> PAGE_SIZE will use PAGE_SIZE buffers.
> 
> These optimizations interact positively with recent commit
> ba275241030c ("virtio-net: coalesce rx frags when possible during rx"),
> which coalesces adjacent RX SKB fragments in virtio_net. The coalescing
> optimizations benefit buffers of any size.
> 
> Benchmarks taken from an average of 5 netperf 30-second TCP_STREAM runs
> between two QEMU VMs on a single physical machine. Each VM has two VCPUs
> with all offloads & vhost enabled. All VMs and vhost threads run in a
> single 4 CPU cgroup cpuset, using cgroups to ensure that other processes
> in the system will not be scheduled on the benchmark CPUs. Trunk includes
> SKB rx frag coalescing.
> 
> net-next w/ virtio_net before 2613af0ed18a (PAGE_SIZE bufs): 14642.85Gb/s
> net-next (MTU-size bufs):  13170.01Gb/s
> net-next + auto-tune: 14555.94Gb/s
> 
> Jason Wang also reported a throughput increase on mlx4 from 22Gb/s
> using MTU-sized buffers to about 26Gb/s using auto-tuning.
> 
> Signed-off-by: Michael Dalton <mwdalton@google.com>

Acked-by: Michael S. Tsirkin <mst@redhat.com>

> ---
> v2->v3: Remove per-receive queue metadata ring. Encode packet buffer
>         base address and truesize into an unsigned long by requiring a
>         minimum packet size alignment of 256. Permit attempts to fill
>         an already-full RX ring (reverting the change in v2).
> v1->v2: Add per-receive queue metadata ring to track precise truesize for
>         mergeable receive buffers. Remove all truesize approximation. Never
>         try to fill a full RX ring (required for metadata ring in v2).
> 
>  drivers/net/virtio_net.c | 99 ++++++++++++++++++++++++++++++++++++------------
>  1 file changed, 74 insertions(+), 25 deletions(-)
> 
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index 36cbf06..3e82311 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -26,6 +26,7 @@
>  #include <linux/if_vlan.h>
>  #include <linux/slab.h>
>  #include <linux/cpu.h>
> +#include <linux/average.h>
>  
>  static int napi_weight = NAPI_POLL_WEIGHT;
>  module_param(napi_weight, int, 0444);
> @@ -36,11 +37,18 @@ module_param(gso, bool, 0444);
>  
>  /* FIXME: MTU in config. */
>  #define GOOD_PACKET_LEN (ETH_HLEN + VLAN_HLEN + ETH_DATA_LEN)
> -#define MERGE_BUFFER_LEN (ALIGN(GOOD_PACKET_LEN + \
> -                                sizeof(struct virtio_net_hdr_mrg_rxbuf), \
> -                                L1_CACHE_BYTES))
>  #define GOOD_COPY_LEN	128
>  
> +/* Weight used for the RX packet size EWMA. The average packet size is used to
> + * determine the packet buffer size when refilling RX rings. As the entire RX
> + * ring may be refilled at once, the weight is chosen so that the EWMA will be
> + * insensitive to short-term, transient changes in packet size.
> + */
> +#define RECEIVE_AVG_WEIGHT 64
> +
> +/* Minimum alignment for mergeable packet buffers. */
> +#define MERGEABLE_BUFFER_ALIGN max(L1_CACHE_BYTES, 256)
> +
>  #define VIRTNET_DRIVER_VERSION "1.0.0"
>  
>  struct virtnet_stats {
> @@ -78,6 +86,9 @@ struct receive_queue {
>  	/* Chain pages by the private ptr. */
>  	struct page *pages;
>  
> +	/* Average packet length for mergeable receive buffers. */
> +	struct ewma mrg_avg_pkt_len;
> +
>  	/* Page frag for packet buffer allocation. */
>  	struct page_frag alloc_frag;
>  
> @@ -219,6 +230,23 @@ static void skb_xmit_done(struct virtqueue *vq)
>  	netif_wake_subqueue(vi->dev, vq2txq(vq));
>  }
>  
> +static unsigned int mergeable_ctx_to_buf_truesize(unsigned long mrg_ctx)
> +{
> +	unsigned int truesize = mrg_ctx & (MERGEABLE_BUFFER_ALIGN - 1);
> +	return truesize * MERGEABLE_BUFFER_ALIGN;
> +}
> +
> +static void *mergeable_ctx_to_buf_address(unsigned long mrg_ctx)
> +{
> +	return (void *)(mrg_ctx & -MERGEABLE_BUFFER_ALIGN);
> +
> +}
> +
> +static unsigned long mergeable_buf_to_ctx(void *buf, unsigned int truesize)
> +{
> +	return (unsigned long)buf | (truesize / MERGEABLE_BUFFER_ALIGN);
> +}
> +
>  /* Called from bottom half context */
>  static struct sk_buff *page_to_skb(struct receive_queue *rq,
>  				   struct page *page, unsigned int offset,
> @@ -327,31 +355,33 @@ err:
>  
>  static struct sk_buff *receive_mergeable(struct net_device *dev,
>  					 struct receive_queue *rq,
> -					 void *buf,
> +					 unsigned long ctx,
>  					 unsigned int len)
>  {
> +	void *buf = mergeable_ctx_to_buf_address(ctx);
>  	struct skb_vnet_hdr *hdr = buf;
>  	int num_buf = hdr->mhdr.num_buffers;
>  	struct page *page = virt_to_head_page(buf);
>  	int offset = buf - page_address(page);
> -	unsigned int truesize = max_t(unsigned int, len, MERGE_BUFFER_LEN);
> +	unsigned int truesize = max(len, mergeable_ctx_to_buf_truesize(ctx));
> +
>  	struct sk_buff *head_skb = page_to_skb(rq, page, offset, len, truesize);
>  	struct sk_buff *curr_skb = head_skb;
>  
>  	if (unlikely(!curr_skb))
>  		goto err_skb;
> -
>  	while (--num_buf) {
>  		int num_skb_frags;
>  
> -		buf = virtqueue_get_buf(rq->vq, &len);
> -		if (unlikely(!buf)) {
> +		ctx = (unsigned long)virtqueue_get_buf(rq->vq, &len);
> +		if (unlikely(!ctx)) {
>  			pr_debug("%s: rx error: %d buffers out of %d missing\n",
>  				 dev->name, num_buf, hdr->mhdr.num_buffers);
>  			dev->stats.rx_length_errors++;
>  			goto err_buf;
>  		}
>  
> +		buf = mergeable_ctx_to_buf_address(ctx);
>  		page = virt_to_head_page(buf);
>  		--rq->num;
>  
> @@ -369,7 +399,7 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
>  			head_skb->truesize += nskb->truesize;
>  			num_skb_frags = 0;
>  		}
> -		truesize = max_t(unsigned int, len, MERGE_BUFFER_LEN);
> +		truesize = max(len, mergeable_ctx_to_buf_truesize(ctx));
>  		if (curr_skb != head_skb) {
>  			head_skb->data_len += len;
>  			head_skb->len += len;
> @@ -386,19 +416,20 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
>  		}
>  	}
>  
> +	ewma_add(&rq->mrg_avg_pkt_len, head_skb->len);
>  	return head_skb;
>  
>  err_skb:
>  	put_page(page);
>  	while (--num_buf) {
> -		buf = virtqueue_get_buf(rq->vq, &len);
> -		if (unlikely(!buf)) {
> +		ctx = (unsigned long)virtqueue_get_buf(rq->vq, &len);
> +		if (unlikely(!ctx)) {
>  			pr_debug("%s: rx error: %d buffers missing\n",
>  				 dev->name, num_buf);
>  			dev->stats.rx_length_errors++;
>  			break;
>  		}
> -		page = virt_to_head_page(buf);
> +		page = virt_to_head_page(mergeable_ctx_to_buf_address(ctx));
>  		put_page(page);
>  		--rq->num;
>  	}
> @@ -419,17 +450,20 @@ static void receive_buf(struct receive_queue *rq, void *buf, unsigned int len)
>  	if (unlikely(len < sizeof(struct virtio_net_hdr) + ETH_HLEN)) {
>  		pr_debug("%s: short packet %i\n", dev->name, len);
>  		dev->stats.rx_length_errors++;
> -		if (vi->mergeable_rx_bufs)
> -			put_page(virt_to_head_page(buf));
> -		else if (vi->big_packets)
> +		if (vi->mergeable_rx_bufs) {
> +			unsigned long ctx = (unsigned long)buf;
> +			void *base = mergeable_ctx_to_buf_address(ctx);
> +			put_page(virt_to_head_page(base));
> +		} else if (vi->big_packets) {
>  			give_pages(rq, buf);
> -		else
> +		} else {
>  			dev_kfree_skb(buf);
> +		}
>  		return;
>  	}
>  
>  	if (vi->mergeable_rx_bufs)
> -		skb = receive_mergeable(dev, rq, buf, len);
> +		skb = receive_mergeable(dev, rq, (unsigned long)buf, len);
>  	else if (vi->big_packets)
>  		skb = receive_big(dev, rq, buf, len);
>  	else
> @@ -572,25 +606,36 @@ static int add_recvbuf_big(struct receive_queue *rq, gfp_t gfp)
>  
>  static int add_recvbuf_mergeable(struct receive_queue *rq, gfp_t gfp)
>  {
> +	const size_t hdr_len = sizeof(struct virtio_net_hdr_mrg_rxbuf);
>  	struct page_frag *alloc_frag = &rq->alloc_frag;
>  	char *buf;
> +	unsigned long ctx;
>  	int err;
>  	unsigned int len, hole;
>  
> -	if (unlikely(!skb_page_frag_refill(MERGE_BUFFER_LEN, alloc_frag, gfp)))
> +	len = hdr_len + clamp_t(unsigned int, ewma_read(&rq->mrg_avg_pkt_len),
> +				GOOD_PACKET_LEN, PAGE_SIZE - hdr_len);
> +	len = ALIGN(len, MERGEABLE_BUFFER_ALIGN);
> +	if (unlikely(!skb_page_frag_refill(len, alloc_frag, gfp)))
>  		return -ENOMEM;
> +
>  	buf = (char *)page_address(alloc_frag->page) + alloc_frag->offset;
> +	ctx = mergeable_buf_to_ctx(buf, len);
>  	get_page(alloc_frag->page);
> -	len = MERGE_BUFFER_LEN;
>  	alloc_frag->offset += len;
>  	hole = alloc_frag->size - alloc_frag->offset;
> -	if (hole < MERGE_BUFFER_LEN) {
> +	if (hole < len) {
> +		/* To avoid internal fragmentation, if there is very likely not
> +		 * enough space for another buffer, add the remaining space to
> +		 * the current buffer. This extra space is not included in
> +		 * the truesize stored in ctx.
> +		 */
>  		len += hole;
>  		alloc_frag->offset += hole;
>  	}
>  
>  	sg_init_one(rq->sg, buf, len);
> -	err = virtqueue_add_inbuf(rq->vq, rq->sg, 1, buf, gfp);
> +	err = virtqueue_add_inbuf(rq->vq, rq->sg, 1, (void *)ctx, gfp);
>  	if (err < 0)
>  		put_page(virt_to_head_page(buf));
>  
> @@ -1394,12 +1439,15 @@ static void free_unused_bufs(struct virtnet_info *vi)
>  		struct virtqueue *vq = vi->rq[i].vq;
>  
>  		while ((buf = virtqueue_detach_unused_buf(vq)) != NULL) {
> -			if (vi->mergeable_rx_bufs)
> -				put_page(virt_to_head_page(buf));
> -			else if (vi->big_packets)
> +			if (vi->mergeable_rx_bufs) {
> +				unsigned long ctx = (unsigned long)buf;
> +				void *base = mergeable_ctx_to_buf_address(ctx);
> +				put_page(virt_to_head_page(base));
> +			} else if (vi->big_packets) {
>  				give_pages(&vi->rq[i], buf);
> -			else
> +			} else {
>  				dev_kfree_skb(buf);
> +			}
>  			--vi->rq[i].num;
>  		}
>  		BUG_ON(vi->rq[i].num != 0);
> @@ -1509,6 +1557,7 @@ static int virtnet_alloc_queues(struct virtnet_info *vi)
>  			       napi_weight);
>  
>  		sg_init_table(vi->rq[i].sg, ARRAY_SIZE(vi->rq[i].sg));
> +		ewma_init(&vi->rq[i].mrg_avg_pkt_len, 1, RECEIVE_AVG_WEIGHT);
>  		sg_init_table(vi->sq[i].sg, ARRAY_SIZE(vi->sq[i].sg));
>  	}
>  
> -- 
> 1.8.5.2

^ permalink raw reply

* Re: [PATCH net-next v4 4/6] net-sysfs: add support for device-specific rx queue sysfs attributes
From: Michael S. Tsirkin @ 2014-01-16 20:25 UTC (permalink / raw)
  To: Michael Dalton
  Cc: netdev, virtualization, Eric Dumazet, Ben Hutchings,
	David S. Miller
In-Reply-To: <1389901950-3854-4-git-send-email-mwdalton@google.com>

On Thu, Jan 16, 2014 at 11:52:28AM -0800, Michael Dalton wrote:
> Extend existing support for netdevice receive queue sysfs attributes to
> permit a device-specific attribute group. Initial use case for this
> support will be to allow the virtio-net device to export per-receive
> queue mergeable receive buffer size.
> 
> Signed-off-by: Michael Dalton <mwdalton@google.com>

Acked-by: Michael S. Tsirkin <mst@redhat.com>

> ---
> v3->v4: Simplify by removing loop in get_netdev_rx_queue_index.
> 
>  include/linux/netdevice.h | 35 +++++++++++++++++++++++++++++++----
>  net/core/dev.c            | 12 ++++++------
>  net/core/net-sysfs.c      | 33 ++++++++++++++++-----------------
>  3 files changed, 53 insertions(+), 27 deletions(-)
> 
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index 5c88ab1..38929bc 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -668,15 +668,28 @@ extern struct rps_sock_flow_table __rcu *rps_sock_flow_table;
>  bool rps_may_expire_flow(struct net_device *dev, u16 rxq_index, u32 flow_id,
>  			 u16 filter_id);
>  #endif
> +#endif /* CONFIG_RPS */
>  
>  /* This structure contains an instance of an RX queue. */
>  struct netdev_rx_queue {
> +#ifdef CONFIG_RPS
>  	struct rps_map __rcu		*rps_map;
>  	struct rps_dev_flow_table __rcu	*rps_flow_table;
> +#endif
>  	struct kobject			kobj;
>  	struct net_device		*dev;
>  } ____cacheline_aligned_in_smp;
> -#endif /* CONFIG_RPS */
> +
> +/*
> + * RX queue sysfs structures and functions.
> + */
> +struct rx_queue_attribute {
> +	struct attribute attr;
> +	ssize_t (*show)(struct netdev_rx_queue *queue,
> +	    struct rx_queue_attribute *attr, char *buf);
> +	ssize_t (*store)(struct netdev_rx_queue *queue,
> +	    struct rx_queue_attribute *attr, const char *buf, size_t len);
> +};
>  
>  #ifdef CONFIG_XPS
>  /*
> @@ -1313,7 +1326,7 @@ struct net_device {
>  						   unicast) */
>  
>  
> -#ifdef CONFIG_RPS
> +#ifdef CONFIG_SYSFS
>  	struct netdev_rx_queue	*_rx;
>  
>  	/* Number of RX queues allocated at register_netdev() time */
> @@ -1424,6 +1437,8 @@ struct net_device {
>  	struct device		dev;
>  	/* space for optional device, statistics, and wireless sysfs groups */
>  	const struct attribute_group *sysfs_groups[4];
> +	/* space for optional per-rx queue attributes */
> +	const struct attribute_group *sysfs_rx_queue_group;
>  
>  	/* rtnetlink link ops */
>  	const struct rtnl_link_ops *rtnl_link_ops;
> @@ -2374,7 +2389,7 @@ static inline bool netif_is_multiqueue(const struct net_device *dev)
>  
>  int netif_set_real_num_tx_queues(struct net_device *dev, unsigned int txq);
>  
> -#ifdef CONFIG_RPS
> +#ifdef CONFIG_SYSFS
>  int netif_set_real_num_rx_queues(struct net_device *dev, unsigned int rxq);
>  #else
>  static inline int netif_set_real_num_rx_queues(struct net_device *dev,
> @@ -2393,7 +2408,7 @@ static inline int netif_copy_real_num_queues(struct net_device *to_dev,
>  					   from_dev->real_num_tx_queues);
>  	if (err)
>  		return err;
> -#ifdef CONFIG_RPS
> +#ifdef CONFIG_SYSFS
>  	return netif_set_real_num_rx_queues(to_dev,
>  					    from_dev->real_num_rx_queues);
>  #else
> @@ -2401,6 +2416,18 @@ static inline int netif_copy_real_num_queues(struct net_device *to_dev,
>  #endif
>  }
>  
> +#ifdef CONFIG_SYSFS
> +static inline unsigned int get_netdev_rx_queue_index(
> +		struct netdev_rx_queue *queue)
> +{
> +	struct net_device *dev = queue->dev;
> +	int index = queue - dev->_rx;
> +
> +	BUG_ON(index >= dev->num_rx_queues);
> +	return index;
> +}
> +#endif
> +
>  #define DEFAULT_MAX_NUM_RSS_QUEUES	(8)
>  int netif_get_num_default_rss_queues(void);
>  
> diff --git a/net/core/dev.c b/net/core/dev.c
> index 20c834e..4be7931 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -2080,7 +2080,7 @@ int netif_set_real_num_tx_queues(struct net_device *dev, unsigned int txq)
>  }
>  EXPORT_SYMBOL(netif_set_real_num_tx_queues);
>  
> -#ifdef CONFIG_RPS
> +#ifdef CONFIG_SYSFS
>  /**
>   *	netif_set_real_num_rx_queues - set actual number of RX queues used
>   *	@dev: Network device
> @@ -5727,7 +5727,7 @@ void netif_stacked_transfer_operstate(const struct net_device *rootdev,
>  }
>  EXPORT_SYMBOL(netif_stacked_transfer_operstate);
>  
> -#ifdef CONFIG_RPS
> +#ifdef CONFIG_SYSFS
>  static int netif_alloc_rx_queues(struct net_device *dev)
>  {
>  	unsigned int i, count = dev->num_rx_queues;
> @@ -6272,7 +6272,7 @@ struct net_device *alloc_netdev_mqs(int sizeof_priv, const char *name,
>  		return NULL;
>  	}
>  
> -#ifdef CONFIG_RPS
> +#ifdef CONFIG_SYSFS
>  	if (rxqs < 1) {
>  		pr_err("alloc_netdev: Unable to allocate device with zero RX queues\n");
>  		return NULL;
> @@ -6328,7 +6328,7 @@ struct net_device *alloc_netdev_mqs(int sizeof_priv, const char *name,
>  	if (netif_alloc_netdev_queues(dev))
>  		goto free_all;
>  
> -#ifdef CONFIG_RPS
> +#ifdef CONFIG_SYSFS
>  	dev->num_rx_queues = rxqs;
>  	dev->real_num_rx_queues = rxqs;
>  	if (netif_alloc_rx_queues(dev))
> @@ -6348,7 +6348,7 @@ free_all:
>  free_pcpu:
>  	free_percpu(dev->pcpu_refcnt);
>  	netif_free_tx_queues(dev);
> -#ifdef CONFIG_RPS
> +#ifdef CONFIG_SYSFS
>  	kfree(dev->_rx);
>  #endif
>  
> @@ -6373,7 +6373,7 @@ void free_netdev(struct net_device *dev)
>  	release_net(dev_net(dev));
>  
>  	netif_free_tx_queues(dev);
> -#ifdef CONFIG_RPS
> +#ifdef CONFIG_SYSFS
>  	kfree(dev->_rx);
>  #endif
>  
> diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
> index 49843bf..0193ff3 100644
> --- a/net/core/net-sysfs.c
> +++ b/net/core/net-sysfs.c
> @@ -498,17 +498,7 @@ static struct attribute_group wireless_group = {
>  #define net_class_groups	NULL
>  #endif /* CONFIG_SYSFS */
>  
> -#ifdef CONFIG_RPS
> -/*
> - * RX queue sysfs structures and functions.
> - */
> -struct rx_queue_attribute {
> -	struct attribute attr;
> -	ssize_t (*show)(struct netdev_rx_queue *queue,
> -	    struct rx_queue_attribute *attr, char *buf);
> -	ssize_t (*store)(struct netdev_rx_queue *queue,
> -	    struct rx_queue_attribute *attr, const char *buf, size_t len);
> -};
> +#ifdef CONFIG_SYSFS
>  #define to_rx_queue_attr(_attr) container_of(_attr,		\
>      struct rx_queue_attribute, attr)
>  
> @@ -543,6 +533,7 @@ static const struct sysfs_ops rx_queue_sysfs_ops = {
>  	.store = rx_queue_attr_store,
>  };
>  
> +#ifdef CONFIG_RPS
>  static ssize_t show_rps_map(struct netdev_rx_queue *queue,
>  			    struct rx_queue_attribute *attribute, char *buf)
>  {
> @@ -718,16 +709,20 @@ static struct rx_queue_attribute rps_cpus_attribute =
>  static struct rx_queue_attribute rps_dev_flow_table_cnt_attribute =
>  	__ATTR(rps_flow_cnt, S_IRUGO | S_IWUSR,
>  	    show_rps_dev_flow_table_cnt, store_rps_dev_flow_table_cnt);
> +#endif /* CONFIG_RPS */
>  
>  static struct attribute *rx_queue_default_attrs[] = {
> +#ifdef CONFIG_RPS
>  	&rps_cpus_attribute.attr,
>  	&rps_dev_flow_table_cnt_attribute.attr,
> +#endif
>  	NULL
>  };
>  
>  static void rx_queue_release(struct kobject *kobj)
>  {
>  	struct netdev_rx_queue *queue = to_rx_queue(kobj);
> +#ifdef CONFIG_RPS
>  	struct rps_map *map;
>  	struct rps_dev_flow_table *flow_table;
>  
> @@ -743,6 +738,7 @@ static void rx_queue_release(struct kobject *kobj)
>  		RCU_INIT_POINTER(queue->rps_flow_table, NULL);
>  		call_rcu(&flow_table->rcu, rps_dev_flow_table_release);
>  	}
> +#endif
>  
>  	memset(kobj, 0, sizeof(*kobj));
>  	dev_put(queue->dev);
> @@ -767,21 +763,27 @@ static int rx_queue_add_kobject(struct net_device *net, int index)
>  		kobject_put(kobj);
>  		return error;
>  	}
> +	if (net->sysfs_rx_queue_group)
> +		sysfs_create_group(kobj, net->sysfs_rx_queue_group);
>  
>  	kobject_uevent(kobj, KOBJ_ADD);
>  	dev_hold(queue->dev);
>  
>  	return error;
>  }
> -#endif /* CONFIG_RPS */
> +#endif /* CONFIG_SYFS */
>  
>  int
>  net_rx_queue_update_kobjects(struct net_device *net, int old_num, int new_num)
>  {
> -#ifdef CONFIG_RPS
> +#ifdef CONFIG_SYSFS
>  	int i;
>  	int error = 0;
>  
> +#ifndef CONFIG_RPS
> +	if (!net->sysfs_rx_queue_group)
> +		return 0;
> +#endif
>  	for (i = old_num; i < new_num; i++) {
>  		error = rx_queue_add_kobject(net, i);
>  		if (error) {
> @@ -1155,9 +1157,6 @@ static int register_queue_kobjects(struct net_device *net)
>  	    NULL, &net->dev.kobj);
>  	if (!net->queues_kset)
>  		return -ENOMEM;
> -#endif
> -
> -#ifdef CONFIG_RPS
>  	real_rx = net->real_num_rx_queues;
>  #endif
>  	real_tx = net->real_num_tx_queues;
> @@ -1184,7 +1183,7 @@ static void remove_queue_kobjects(struct net_device *net)
>  {
>  	int real_rx = 0, real_tx = 0;
>  
> -#ifdef CONFIG_RPS
> +#ifdef CONFIG_SYSFS
>  	real_rx = net->real_num_rx_queues;
>  #endif
>  	real_tx = net->real_num_tx_queues;
> -- 
> 1.8.5.2

^ permalink raw reply

* Re: [PATCH net-next v4 5/6] lib: Ensure EWMA does not store wrong intermediate values
From: Michael S. Tsirkin @ 2014-01-16 20:25 UTC (permalink / raw)
  To: Michael Dalton
  Cc: netdev, virtualization, Eric Dumazet, Ben Hutchings,
	David S. Miller
In-Reply-To: <1389901950-3854-5-git-send-email-mwdalton@google.com>

On Thu, Jan 16, 2014 at 11:52:29AM -0800, Michael Dalton wrote:
> To ensure ewma_read() without a lock returns a valid but possibly
> out of date average, modify ewma_add() by using ACCESS_ONCE to prevent
> intermediate wrong values from being written to avg->internal.
> 
> Suggested-by: Eric Dumazet <eric.dumazet@gmail.com>
> Signed-off-by: Michael Dalton <mwdalton@google.com>

Acked-by: Michael S. Tsirkin <mst@redhat.com>

> ---
>  lib/average.c | 6 ++++--
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/lib/average.c b/lib/average.c
> index 99a67e6..114d1be 100644
> --- a/lib/average.c
> +++ b/lib/average.c
> @@ -53,8 +53,10 @@ EXPORT_SYMBOL(ewma_init);
>   */
>  struct ewma *ewma_add(struct ewma *avg, unsigned long val)
>  {
> -	avg->internal = avg->internal  ?
> -		(((avg->internal << avg->weight) - avg->internal) +
> +	unsigned long internal = ACCESS_ONCE(avg->internal);
> +
> +	ACCESS_ONCE(avg->internal) = internal ?
> +		(((internal << avg->weight) - internal) +
>  			(val << avg->factor)) >> avg->weight :
>  		(val << avg->factor);
>  	return avg;
> -- 
> 1.8.5.2

^ permalink raw reply

* Re: [PATCH net-next v4 6/6] virtio-net: initial rx sysfs support, export mergeable rx buffer size
From: Michael S. Tsirkin @ 2014-01-16 20:25 UTC (permalink / raw)
  To: Michael Dalton
  Cc: netdev, virtualization, Eric Dumazet, Ben Hutchings,
	David S. Miller
In-Reply-To: <1389901950-3854-6-git-send-email-mwdalton@google.com>

On Thu, Jan 16, 2014 at 11:52:30AM -0800, Michael Dalton wrote:
> Add initial support for per-rx queue sysfs attributes to virtio-net. If
> mergeable packet buffers are enabled, adds a read-only mergeable packet
> buffer size sysfs attribute for each RX queue.
> 
> Suggested-by: Michael S. Tsirkin <mst@redhat.com>
> Signed-off-by: Michael Dalton <mwdalton@google.com>

Acked-by: Michael S. Tsirkin <mst@redhat.com>

> ---
> v3->v4: Remove seqcount due to EWMA changes in patch 5.
>         Add missing Suggested-By.
> 
>  drivers/net/virtio_net.c | 46 ++++++++++++++++++++++++++++++++++++++++++----
>  1 file changed, 42 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index 3e82311..968eacd 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -604,18 +604,25 @@ static int add_recvbuf_big(struct receive_queue *rq, gfp_t gfp)
>  	return err;
>  }
>  
> -static int add_recvbuf_mergeable(struct receive_queue *rq, gfp_t gfp)
> +static unsigned int get_mergeable_buf_len(struct ewma *avg_pkt_len)
>  {
>  	const size_t hdr_len = sizeof(struct virtio_net_hdr_mrg_rxbuf);
> +	unsigned int len;
> +
> +	len = hdr_len + clamp_t(unsigned int, ewma_read(avg_pkt_len),
> +			GOOD_PACKET_LEN, PAGE_SIZE - hdr_len);
> +	return ALIGN(len, MERGEABLE_BUFFER_ALIGN);
> +}
> +
> +static int add_recvbuf_mergeable(struct receive_queue *rq, gfp_t gfp)
> +{
>  	struct page_frag *alloc_frag = &rq->alloc_frag;
>  	char *buf;
>  	unsigned long ctx;
>  	int err;
>  	unsigned int len, hole;
>  
> -	len = hdr_len + clamp_t(unsigned int, ewma_read(&rq->mrg_avg_pkt_len),
> -				GOOD_PACKET_LEN, PAGE_SIZE - hdr_len);
> -	len = ALIGN(len, MERGEABLE_BUFFER_ALIGN);
> +	len = get_mergeable_buf_len(&rq->mrg_avg_pkt_len);
>  	if (unlikely(!skb_page_frag_refill(len, alloc_frag, gfp)))
>  		return -ENOMEM;
>  
> @@ -1594,6 +1601,33 @@ err:
>  	return ret;
>  }
>  
> +#ifdef CONFIG_SYSFS
> +static ssize_t mergeable_rx_buffer_size_show(struct netdev_rx_queue *queue,
> +		struct rx_queue_attribute *attribute, char *buf)
> +{
> +	struct virtnet_info *vi = netdev_priv(queue->dev);
> +	unsigned int queue_index = get_netdev_rx_queue_index(queue);
> +	struct ewma *avg;
> +
> +	BUG_ON(queue_index >= vi->max_queue_pairs);
> +	avg = &vi->rq[queue_index].mrg_avg_pkt_len;
> +	return sprintf(buf, "%u\n", get_mergeable_buf_len(avg));
> +}
> +
> +static struct rx_queue_attribute mergeable_rx_buffer_size_attribute =
> +	__ATTR_RO(mergeable_rx_buffer_size);
> +
> +static struct attribute *virtio_net_mrg_rx_attrs[] = {
> +	&mergeable_rx_buffer_size_attribute.attr,
> +	NULL
> +};
> +
> +static const struct attribute_group virtio_net_mrg_rx_group = {
> +	.name = "virtio_net",
> +	.attrs = virtio_net_mrg_rx_attrs
> +};
> +#endif
> +
>  static int virtnet_probe(struct virtio_device *vdev)
>  {
>  	int i, err;
> @@ -1708,6 +1742,10 @@ static int virtnet_probe(struct virtio_device *vdev)
>  	if (err)
>  		goto free_stats;
>  
> +#ifdef CONFIG_SYSFS
> +	if (vi->mergeable_rx_bufs)
> +		dev->sysfs_rx_queue_group = &virtio_net_mrg_rx_group;
> +#endif
>  	netif_set_real_num_tx_queues(dev, vi->curr_queue_pairs);
>  	netif_set_real_num_rx_queues(dev, vi->curr_queue_pairs);
>  
> -- 
> 1.8.5.2

^ permalink raw reply

* Re: [PATCH net] ipv6: simplify detection of first operational link-local address on interface
From: Hannes Frederic Sowa @ 2014-01-16 20:27 UTC (permalink / raw)
  To: Brian Haley; +Cc: Jiri Pirko, netdev, fbl
In-Reply-To: <52D83DE3.40806@hp.com>

On Thu, Jan 16, 2014 at 03:15:31PM -0500, Brian Haley wrote:
> On 01/16/2014 02:13 PM, Hannes Frederic Sowa wrote:
> > In commit 1ec047eb4751e3 ("ipv6: introduce per-interface counter for
> > dad-completed ipv6 addresses") I build the detection of the first
> > operational link-local address much to complex. Additionally this code
> > now has a race condition.
> > 
> > Replace it with a much simpler variant, which just scans the address
> > list when duplicate address detection completes, to check if this is
> > the first valid link local address and send RS and MLD reports then.
> > 
> > Fixes: 1ec047eb4751e3 ("ipv6: introduce per-interface counter for dad-completed ipv6 addresses")
> > Reported-by: Jiri Pirko <jiri@resnulli.us>
> > Cc: Flavio Leitner <fbl@redhat.com>
> > Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
> > ---
> 
> > +/* ifp->idev must be at least read locked */
> > +static bool ipv6_lonely_lladdr(struct inet6_ifaddr *ifp)
> > +{
> > +	struct inet6_ifaddr *ifpiter;
> > +	struct inet6_dev *idev = ifp->idev;
> > +
> > +	list_for_each_entry(ifpiter, &idev->addr_list, if_list) {
> > +		if (ifp != ifpiter && ifpiter->scope == IFA_LINK &&
> > +		    (ifpiter->flags & (IFA_F_PERMANENT|IFA_F_TENTATIVE|
> > +				       IFA_F_OPTIMISTIC|IFA_F_DADFAILED)) ==
> > +		    IFA_F_PERMANENT)
> > +			return false;
> > +	}
> > +	return true;
> > +}
> 
> Just a nit, but the idev->addr_list is sorted by scope, so you could use
> list_for_each_entry_reverse(), and break out once the scope is > IFA_LINK to
> reduce the number of compares.

Very good idea, but I would leave this patch for net as-is. I'll send
one for net-next which does this optimization for the other functions
at the weekend (we can do the same optimization to ipv6_get_lladdr etc.).

Thanks,

  Hannes

^ permalink raw reply

* Re: [PATCH v2 00/16] wl1251 patches from linux-n900 tree
From: John W. Linville @ 2014-01-16 20:29 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Pali Rohár, Luciano Coelho,
	linux-wireless-u79uwXL29TY76Z2rM5mHXA,
	netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, freemangordon-uiMcrn6V0Vs,
	aaro.koskinen-X3B1VOXEql0, sre-GFxCN5SEZAc,
	joni.lapilainen-Re5JQEeQqe8AvxtiuMwx3w, Johannes Berg,
	Felipe Contreras
In-Reply-To: <20140116002121.GA3726-tWAi6jLit6GreWDznjuHag@public.gmane.org>

On Thu, Jan 16, 2014 at 01:21:21AM +0100, Pavel Machek wrote:
> On Mon 2014-01-06 15:00:50, John W. Linville wrote:
> > On Tue, Dec 31, 2013 at 10:47:29AM +0100, Pali Rohár wrote:
> > > On Sunday 08 December 2013 10:24:58 Pali Rohár wrote:
> > > > Hello, I'm sending wl1251 patches from linux-n900 tree [1] for
> > > > comments. More patches come from David's monitor & packet
> > > > injection work. Patches are tested with 3.12 rc5 kernel on
> > > > Nokia N900.
> > > > 
> > > > Second version contains new patch for fixing NULL pointer
> > > > dereference which sometimes cause kernel panic and fixes code
> > > > suggested by Pavel Machek.
> >  
> > > So far, there are no new comments for patches 01, 03-13 and 16. 
> > > Are there any other problems with that patches or can be accepted 
> > > for upstream?
> >  
> > They are probably fine to merge now.  Of course, I was still deleting
> > wl12xx patches when they were originally posted, so I'll need someone
> > to resend the patches to me...
> 
> Ping? Did they get merged? Did you receive the patches?

They are already available in wireless-next.

-- 
John W. Linville		Someday the world will need a hero, and you
linville-2XuSBdqkA4R54TAoqtyWWQ@public.gmane.org			might be all we have.  Be ready.
--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH-next v3] netfilter: don't use module_init/exit in core IPV4 code
From: Patrick McHardy @ 2014-01-16 20:32 UTC (permalink / raw)
  To: Paul Gortmaker
  Cc: Pablo Neira Ayuso, Jozsef Kadlecsik, David S. Miller,
	Benjamin Poirier, netfilter-devel, netdev
In-Reply-To: <1389903843-1633-1-git-send-email-paul.gortmaker@windriver.com>

On Thu, Jan 16, 2014 at 03:24:03PM -0500, Paul Gortmaker wrote:
> The file net/ipv4/netfilter.o is created based on whether
> CONFIG_NETFILTER is set.  However that is defined as a bool, and
> hence this file with the core netfilter hooks will never be
> modular.  So using module_init as an alias for __initcall can be
> somewhat misleading.
> 
> Fix this up now, so that we can relocate module_init from
> init.h into module.h in the future.  If we don't do this, we'd
> have to add module.h to obviously non-modular code, and that
> would be a worse thing.  Also add an inclusion of init.h, as
> that was previously implicit here in the netfilter.c file.
> 
> Note that direct use of __initcall is discouraged, vs. one
> of the priority categorized subgroups.  As __initcall gets
> mapped onto device_initcall, our use of subsys_initcall (which
> seems to make sense for netfilter code) will thus change this
> registration from level 6-device to level 4-subsys (i.e. slightly
> earlier).  However no observable impact of that small difference
> has been observed during testing, or is expected. (i.e. the
> location of the netfilter messages in dmesg remains unchanged
> with respect to all the other surrounding messages.)
> 
> As for the module_exit, rather than replace it with __exitcall,
> we simply remove it, since it appears only UML does anything
> with those, and even for UML, there is no relevant cleanup
> to be done here.

Just to bring up the idea: there's no reason why we couldn't make
netfilter fully modular for individual AFs. I've seen quite a lot
of embedded systems that include the IPv6 netfilter core without
using it.

Its rather complicated to get all the dependencies of xtables
modules right though, so this is really just to bring up the
idea.

Your patch looks fine to me.

^ permalink raw reply

* Re: [PATCH net] ipv6: simplify detection of first operational link-local address on interface
From: Flavio Leitner @ 2014-01-16 21:20 UTC (permalink / raw)
  To: Hannes Frederic Sowa; +Cc: Jiri Pirko, netdev
In-Reply-To: <20140116191304.GC17529@order.stressinduktion.org>

On Thu, Jan 16, 2014 at 08:13:04PM +0100, Hannes Frederic Sowa wrote:
> In commit 1ec047eb4751e3 ("ipv6: introduce per-interface counter for
> dad-completed ipv6 addresses") I build the detection of the first
> operational link-local address much to complex. Additionally this code
> now has a race condition.
> 
> Replace it with a much simpler variant, which just scans the address
> list when duplicate address detection completes, to check if this is
> the first valid link local address and send RS and MLD reports then.
> 
> Fixes: 1ec047eb4751e3 ("ipv6: introduce per-interface counter for dad-completed ipv6 addresses")
> Reported-by: Jiri Pirko <jiri@resnulli.us>
> Cc: Flavio Leitner <fbl@redhat.com>
> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
> ---

Nice
Acked-by: Flavio Leitner <fbl@redhat.com>

^ permalink raw reply

* [PATCH v2] ipv6: send Change Status Report after DAD is completed
From: Flavio Leitner @ 2014-01-16 21:27 UTC (permalink / raw)
  To: netdev; +Cc: Hideaki YOSHIFUJI, Hannes Frederic Sowa, Flavio Leitner

The RFC 3810 defines two type of messages for multicast
listeners. The "Current State Report" message, as the name
implies, refreshes the *current* state to the querier.
Since the querier sends Query messages periodically, there
is no need to retransmit the report.

On the other hand, any change should be reported immediately
using "State Change Report" messages. Since it's an event
triggered by a change and that it can be affected by packet
loss, the rfc states it should be retransmitted [RobVar] times
to make sure routers will receive timely.

Currently, we are sending "Current State Reports" after
DAD is completed.  Before that, we send messages using
unspecified address (::) which should be silently discarded
by routers.

This patch changes to send "State Change Report" messages
after DAD is completed fixing the behavior to be RFC compliant
and also to pass TAHI IPv6 testsuite.

Signed-off-by: Flavio Leitner <fbl@redhat.com>
---
 net/ipv6/mcast.c | 57 ++++++++++++++++++++++++++++++++------------------------
 1 file changed, 33 insertions(+), 24 deletions(-)

v2: refreshed against current net-next tree
    simplified the code

diff --git a/net/ipv6/mcast.c b/net/ipv6/mcast.c
index 7ff82b3..e1e4735 100644
--- a/net/ipv6/mcast.c
+++ b/net/ipv6/mcast.c
@@ -1665,7 +1665,7 @@ static struct sk_buff *add_grhead(struct sk_buff *skb, struct ifmcaddr6 *pmc,
 	skb_tailroom(skb)) : 0)
 
 static struct sk_buff *add_grec(struct sk_buff *skb, struct ifmcaddr6 *pmc,
-	int type, int gdeleted, int sdeleted)
+	int type, int gdeleted, int sdeleted, int crsend)
 {
 	struct inet6_dev *idev = pmc->idev;
 	struct net_device *dev = idev->dev;
@@ -1757,7 +1757,7 @@ empty_source:
 		if (type == MLD2_ALLOW_NEW_SOURCES ||
 		    type == MLD2_BLOCK_OLD_SOURCES)
 			return skb;
-		if (pmc->mca_crcount || isquery) {
+		if (pmc->mca_crcount || isquery || crsend) {
 			/* make sure we have room for group header */
 			if (skb && AVAILABLE(skb) < sizeof(struct mld2_grec)) {
 				mld_sendpack(skb);
@@ -1789,7 +1789,7 @@ static void mld_send_report(struct inet6_dev *idev, struct ifmcaddr6 *pmc)
 				type = MLD2_MODE_IS_EXCLUDE;
 			else
 				type = MLD2_MODE_IS_INCLUDE;
-			skb = add_grec(skb, pmc, type, 0, 0);
+			skb = add_grec(skb, pmc, type, 0, 0, 0);
 			spin_unlock_bh(&pmc->mca_lock);
 		}
 	} else {
@@ -1798,7 +1798,7 @@ static void mld_send_report(struct inet6_dev *idev, struct ifmcaddr6 *pmc)
 			type = MLD2_MODE_IS_EXCLUDE;
 		else
 			type = MLD2_MODE_IS_INCLUDE;
-		skb = add_grec(skb, pmc, type, 0, 0);
+		skb = add_grec(skb, pmc, type, 0, 0, 0);
 		spin_unlock_bh(&pmc->mca_lock);
 	}
 	read_unlock_bh(&idev->lock);
@@ -1843,13 +1843,13 @@ static void mld_send_cr(struct inet6_dev *idev)
 		if (pmc->mca_sfmode == MCAST_INCLUDE) {
 			type = MLD2_BLOCK_OLD_SOURCES;
 			dtype = MLD2_BLOCK_OLD_SOURCES;
-			skb = add_grec(skb, pmc, type, 1, 0);
-			skb = add_grec(skb, pmc, dtype, 1, 1);
+			skb = add_grec(skb, pmc, type, 1, 0, 0);
+			skb = add_grec(skb, pmc, dtype, 1, 1, 0);
 		}
 		if (pmc->mca_crcount) {
 			if (pmc->mca_sfmode == MCAST_EXCLUDE) {
 				type = MLD2_CHANGE_TO_INCLUDE;
-				skb = add_grec(skb, pmc, type, 1, 0);
+				skb = add_grec(skb, pmc, type, 1, 0, 0);
 			}
 			pmc->mca_crcount--;
 			if (pmc->mca_crcount == 0) {
@@ -1880,8 +1880,8 @@ static void mld_send_cr(struct inet6_dev *idev)
 			type = MLD2_ALLOW_NEW_SOURCES;
 			dtype = MLD2_BLOCK_OLD_SOURCES;
 		}
-		skb = add_grec(skb, pmc, type, 0, 0);
-		skb = add_grec(skb, pmc, dtype, 0, 1);	/* deleted sources */
+		skb = add_grec(skb, pmc, type, 0, 0, 0);
+		skb = add_grec(skb, pmc, dtype, 0, 1, 0);	/* deleted sources */
 
 		/* filter mode changes */
 		if (pmc->mca_crcount) {
@@ -1889,7 +1889,7 @@ static void mld_send_cr(struct inet6_dev *idev)
 				type = MLD2_CHANGE_TO_EXCLUDE;
 			else
 				type = MLD2_CHANGE_TO_INCLUDE;
-			skb = add_grec(skb, pmc, type, 0, 0);
+			skb = add_grec(skb, pmc, type, 0, 0, 0);
 			pmc->mca_crcount--;
 		}
 		spin_unlock_bh(&pmc->mca_lock);
@@ -1997,27 +1997,36 @@ err_out:
 	goto out;
 }
 
-static void mld_resend_report(struct inet6_dev *idev)
+static void mld_send_initial_cr(struct inet6_dev *idev)
 {
-	if (mld_in_v1_mode(idev)) {
-		struct ifmcaddr6 *mcaddr;
-		read_lock_bh(&idev->lock);
-		for (mcaddr = idev->mc_list; mcaddr; mcaddr = mcaddr->next) {
-			if (!(mcaddr->mca_flags & MAF_NOREPORT))
-				igmp6_send(&mcaddr->mca_addr, idev->dev,
-					   ICMPV6_MGM_REPORT);
-		}
-		read_unlock_bh(&idev->lock);
-	} else {
-		mld_send_report(idev, NULL);
+	struct sk_buff *skb;
+	struct ifmcaddr6 *pmc;
+	int type;
+
+	if (mld_in_v1_mode(idev))
+		return;
+
+	skb = NULL;
+	read_lock_bh(&idev->lock);
+	for (pmc=idev->mc_list; pmc; pmc=pmc->next) {
+		spin_lock_bh(&pmc->mca_lock);
+		if (pmc->mca_sfcount[MCAST_EXCLUDE])
+			type = MLD2_CHANGE_TO_EXCLUDE;
+		else
+			type = MLD2_CHANGE_TO_INCLUDE;
+		skb = add_grec(skb, pmc, type, 0, 0, 1);
+		spin_unlock_bh(&pmc->mca_lock);
 	}
+	read_unlock_bh(&idev->lock);
+	if (skb)
+		mld_sendpack(skb);
 }
 
 void ipv6_mc_dad_complete(struct inet6_dev *idev)
 {
 	idev->mc_dad_count = idev->mc_qrv;
 	if (idev->mc_dad_count) {
-		mld_resend_report(idev);
+		mld_send_initial_cr(idev);
 		idev->mc_dad_count--;
 		if (idev->mc_dad_count)
 			mld_dad_start_timer(idev, idev->mc_maxdelay);
@@ -2028,7 +2037,7 @@ static void mld_dad_timer_expire(unsigned long data)
 {
 	struct inet6_dev *idev = (struct inet6_dev *)data;
 
-	mld_resend_report(idev);
+	mld_send_initial_cr(idev);
 	if (idev->mc_dad_count) {
 		idev->mc_dad_count--;
 		if (idev->mc_dad_count)
-- 
1.8.4.2

^ permalink raw reply related

* Re: [RFC PATCH net-next 3/3] virtio-net: Add accelerated RFS support
From: Ben Hutchings @ 2014-01-16 21:31 UTC (permalink / raw)
  To: Zhi Yong Wu; +Cc: netdev, therbert, edumazet, davem, Zhi Yong Wu
In-Reply-To: <1389795654-28381-4-git-send-email-zwu.kernel@gmail.com>

On Wed, 2014-01-15 at 22:20 +0800, Zhi Yong Wu wrote:
[...]
> +static int virtnet_init_rx_cpu_rmap(struct virtnet_info *vi)
> +{
> +	int rc = 0;
> +
> +#ifdef CONFIG_RFS_ACCEL
> +	struct virtio_device *vdev = vi->vdev;
> +	unsigned int irq;
> +	int i;
> +
> +	if (!vi->affinity_hint_set)
> +		goto out;
> +
> +	vi->dev->rx_cpu_rmap = alloc_irq_cpu_rmap(vi->max_queue_pairs);
> +	if (!vi->dev->rx_cpu_rmap) {
> +		rc = -ENOMEM;
> +		goto out;
> +	}
> +
> +	for (i = 0; i < vi->max_queue_pairs; i++) {
> +		irq = virtqueue_get_vq_irq(vdev, vi->rq[i].vq);
> +		if (irq == -1)
> +			goto failed;

Jumping into an if-statement is confusing.  Also do you really want to
return 0 in this case?

Otherwise this looks fine.

Ben.

> +		rc = irq_cpu_rmap_add(vi->dev->rx_cpu_rmap, irq);
> +		if (rc) {
> +failed:
> +			virtnet_free_irq_cpu_rmap(vi);
> +			goto out;
> +		}
> +	}
> +out:
> +#endif
> +	return rc;
> +}
[...]

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox