Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH net-next v4 5/6] lib: Ensure EWMA does not store wrong intermediate values
From: Eric Dumazet @ 2014-01-16 20:08 UTC (permalink / raw)
  To: Michael Dalton
  Cc: Michael S. Tsirkin, netdev, virtualization, Eric Dumazet,
	Ben Hutchings, David S. Miller
In-Reply-To: <1389901950-3854-5-git-send-email-mwdalton@google.com>

On Thu, 2014-01-16 at 11:52 -0800, Michael Dalton wrote:
> To ensure ewma_read() without a lock returns a valid but possibly
> out of date average, modify ewma_add() by using ACCESS_ONCE to prevent
> intermediate wrong values from being written to avg->internal.
> 
> Suggested-by: Eric Dumazet <eric.dumazet@gmail.com>
> Signed-off-by: Michael Dalton <mwdalton@google.com>
> ---
>  lib/average.c | 6 ++++--
>  1 file changed, 4 insertions(+), 2 deletions(-)

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>

^ permalink raw reply

* Re: [PATCH 11/13] net: mvneta: implement rx_copybreak
From: Willy Tarreau @ 2014-01-16 20:07 UTC (permalink / raw)
  To: David Miller; +Cc: David.Laight, netdev, thomas.petazzoni, gregory.clement
In-Reply-To: <20140116.114905.1582218754259744747.davem@davemloft.net>

Hi David,

On Thu, Jan 16, 2014 at 11:49:05AM -0800, David Miller wrote:
> From: Willy Tarreau <w@1wt.eu>
> Date: Thu, 16 Jan 2014 10:36:40 +0100
> 
> > On Thu, Jan 16, 2014 at 09:14:00AM +0000, David Laight wrote:
> >> From: Of Willy Tarreau
> >> > calling dma_map_single()/dma_unmap_single() is quite expensive compared
> >> > to copying a small packet. So let's copy short frames and keep the buffers
> >> > mapped. We set the limit to 256 bytes which seems to give good results both
> >> > on the XP-GP board and on the AX3/4.
> >> 
> >> Have you tried something similar for transmits?
> >> Copying short tx into a preallocated memory area that is dma_mapped
> >> at driver initialisation?
> > 
> > Not yet by lack of time, but I intend to do so. Both the NIC and the driver
> > are capable of running at line rate (1.488 Mpps) when filling descriptors
> > directly, so many improvements are still possible.
> 
> I don't know if this is relevant for your hardware, but there is a trick
> with IOMMUs that I at least at one point was doing on sparc64.
> 
> I never flushed the IOMMU mappings explicitly on a dma_unmap call.
> 
> Instead I always allocate by moving the allocation cursor to higher
> addresses in the IOMMU range looking for free space.
> 
> Then when I hit the end of the range and had to wrap the allocation
> cursor back to the beginning, I flush the entire IOMMU.
> 
> So for %99 of IOMMU mapping creation and teardown calls, I didn't even
> touch the hardware.

Ah that's an interesting trick! We don't have an IOMMU on this platform
so the call is cheap. However, it relies on an I/O barrier to wait for
a cache snoop completion. From what I read, it's not needed when going
to the device. But when going to the CPU for the Rx case, we're calling
it for every packet which is counter productive because I'd like to do
it only once when entering the rx loop and avoid any other call during
the loop. I measured that I could save 300 ns per packet by doing this
(thus slightly modifying the DMA API to add a new dma_io_barrier function).

I think it could be interesting (at least for this platform, I don't know
if other platforms work with cache coherent DMA which only require to
verify end of snooping), to have two distinct set of DMA calls, something
like this :

    rx_loop(napi, budget)
    {
         dma_wait_for_snoop_completion(dev);
         ...
         for_each_rx_packet(pkt) {
             dma_sync_single_range_for_cpu_unless_completed(dev, pkt->addr);
             /* use packet */
         }
    }

The dma_wait_for_snoop_completion() call would call this I/O barrier on
platforms which provide one, and dma_sync_single_range_unless_completed()
would do the equivalent call only when the I/O barrier is not provided.
However I don't like this principle because it requires adding many new
entries to the struct dma_ops while we could avoid this by just adding
the first entry and changing all drivers at once to add the first call
before the loop. I'm just having a hard time believing that it's possible
anymore to change all drivers at once, especially when there are out of
tree drivers... So such a change would probably require to add a number
of new calls to the DMA API. I really don't know what's the best solution
would be in fact :-/

Thanks,
Willy

^ permalink raw reply

* Re: [PATCH net-next v3 4/5] net-sysfs: add support for device-specific rx queue sysfs attributes
From: Eric Dumazet @ 2014-01-16 20:00 UTC (permalink / raw)
  To: Michael Dalton
  Cc: Michael S. Tsirkin, netdev, lf-virt, Eric Dumazet, Ben Hutchings,
	David S. Miller
In-Reply-To: <CANJ5vPJgQbNPKvbExvpZixKk-xXAkvF0ix0ELW=DpiV0QP5YTA@mail.gmail.com>

On Thu, 2014-01-16 at 11:51 -0800, Michael Dalton wrote:
> On Thu, Jan 16, 2014 Ben Hutchings <bhutchings@solarflare.com> wrote:
> > It's simpler but we don't know if it's faster (and I don't believe that
> > matters for the current usage).
> >
> > If one of these functions starts to be used in the data path, at that
> > point it could be worth optimising, e.g. by doing a test for queue 0 and
> > only then doing the pointer arithmetic with its implicit division.
> Good catch, my statement above is incorrect. We don't know if this change
> is a performance win, and it does introduce an implicit div. I agree this
> function is not currently performance critical and is used only by for
> reporting values to sysfs.

gcc usually do not emit a divide instruction to perform a divide by a
constant, but uses a reciprocal operation (basically a multiply and a
shift)

So I am pretty sure this is faster than the loop.

^ permalink raw reply

* Re: [PATCH v4 0/2] ipv6 addrconf: add IFA_F_NOPREFIXROUTE flag to suppress creation of IP6 routes
From: David Miller @ 2014-01-16 19:53 UTC (permalink / raw)
  To: hannes; +Cc: jiri, thaller, netdev, stephen, dcbw
In-Reply-To: <20140116133818.GF7436@order.stressinduktion.org>

From: Hannes Frederic Sowa <hannes@stressinduktion.org>
Date: Thu, 16 Jan 2014 14:38:18 +0100

> Just fyi, I noticed at least one other patch missing from the repos:
> http://patchwork.ozlabs.org/patch/308452/

Thanks for catching that, taking care of it now.

^ permalink raw reply

* [PATCH net-next v4 6/6] virtio-net: initial rx sysfs support, export mergeable rx buffer size
From: Michael Dalton @ 2014-01-16 19:52 UTC (permalink / raw)
  To: David S. Miller
  Cc: netdev, Eric Dumazet, Rusty Russell, Michael S. Tsirkin,
	Jason Wang, Ben Hutchings, virtualization, Michael Dalton
In-Reply-To: <1389901950-3854-1-git-send-email-mwdalton@google.com>

Add initial support for per-rx queue sysfs attributes to virtio-net. If
mergeable packet buffers are enabled, adds a read-only mergeable packet
buffer size sysfs attribute for each RX queue.

Suggested-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael Dalton <mwdalton@google.com>
---
v3->v4: Remove seqcount due to EWMA changes in patch 5.
        Add missing Suggested-By.

 drivers/net/virtio_net.c | 46 ++++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 42 insertions(+), 4 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 3e82311..968eacd 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -604,18 +604,25 @@ static int add_recvbuf_big(struct receive_queue *rq, gfp_t gfp)
 	return err;
 }
 
-static int add_recvbuf_mergeable(struct receive_queue *rq, gfp_t gfp)
+static unsigned int get_mergeable_buf_len(struct ewma *avg_pkt_len)
 {
 	const size_t hdr_len = sizeof(struct virtio_net_hdr_mrg_rxbuf);
+	unsigned int len;
+
+	len = hdr_len + clamp_t(unsigned int, ewma_read(avg_pkt_len),
+			GOOD_PACKET_LEN, PAGE_SIZE - hdr_len);
+	return ALIGN(len, MERGEABLE_BUFFER_ALIGN);
+}
+
+static int add_recvbuf_mergeable(struct receive_queue *rq, gfp_t gfp)
+{
 	struct page_frag *alloc_frag = &rq->alloc_frag;
 	char *buf;
 	unsigned long ctx;
 	int err;
 	unsigned int len, hole;
 
-	len = hdr_len + clamp_t(unsigned int, ewma_read(&rq->mrg_avg_pkt_len),
-				GOOD_PACKET_LEN, PAGE_SIZE - hdr_len);
-	len = ALIGN(len, MERGEABLE_BUFFER_ALIGN);
+	len = get_mergeable_buf_len(&rq->mrg_avg_pkt_len);
 	if (unlikely(!skb_page_frag_refill(len, alloc_frag, gfp)))
 		return -ENOMEM;
 
@@ -1594,6 +1601,33 @@ err:
 	return ret;
 }
 
+#ifdef CONFIG_SYSFS
+static ssize_t mergeable_rx_buffer_size_show(struct netdev_rx_queue *queue,
+		struct rx_queue_attribute *attribute, char *buf)
+{
+	struct virtnet_info *vi = netdev_priv(queue->dev);
+	unsigned int queue_index = get_netdev_rx_queue_index(queue);
+	struct ewma *avg;
+
+	BUG_ON(queue_index >= vi->max_queue_pairs);
+	avg = &vi->rq[queue_index].mrg_avg_pkt_len;
+	return sprintf(buf, "%u\n", get_mergeable_buf_len(avg));
+}
+
+static struct rx_queue_attribute mergeable_rx_buffer_size_attribute =
+	__ATTR_RO(mergeable_rx_buffer_size);
+
+static struct attribute *virtio_net_mrg_rx_attrs[] = {
+	&mergeable_rx_buffer_size_attribute.attr,
+	NULL
+};
+
+static const struct attribute_group virtio_net_mrg_rx_group = {
+	.name = "virtio_net",
+	.attrs = virtio_net_mrg_rx_attrs
+};
+#endif
+
 static int virtnet_probe(struct virtio_device *vdev)
 {
 	int i, err;
@@ -1708,6 +1742,10 @@ static int virtnet_probe(struct virtio_device *vdev)
 	if (err)
 		goto free_stats;
 
+#ifdef CONFIG_SYSFS
+	if (vi->mergeable_rx_bufs)
+		dev->sysfs_rx_queue_group = &virtio_net_mrg_rx_group;
+#endif
 	netif_set_real_num_tx_queues(dev, vi->curr_queue_pairs);
 	netif_set_real_num_rx_queues(dev, vi->curr_queue_pairs);
 
-- 
1.8.5.2

^ permalink raw reply related

* [PATCH net-next v4 3/6] virtio-net: auto-tune mergeable rx buffer size for improved performance
From: Michael Dalton @ 2014-01-16 19:52 UTC (permalink / raw)
  To: David S. Miller
  Cc: netdev, Eric Dumazet, Rusty Russell, Michael S. Tsirkin,
	Jason Wang, Ben Hutchings, virtualization, Michael Dalton
In-Reply-To: <1389901950-3854-1-git-send-email-mwdalton@google.com>

Commit 2613af0ed18a ("virtio_net: migrate mergeable rx buffers to page frag
allocators") changed the mergeable receive buffer size from PAGE_SIZE to
MTU-size, introducing a single-stream regression for benchmarks with large
average packet size. There is no single optimal buffer size for all
workloads.  For workloads with packet size <= MTU bytes, MTU + virtio-net
header-sized buffers are preferred as larger buffers reduce the TCP window
due to SKB truesize. However, single-stream workloads with large average
packet sizes have higher throughput if larger (e.g., PAGE_SIZE) buffers
are used.

This commit auto-tunes the mergeable receiver buffer packet size by
choosing the packet buffer size based on an EWMA of the recent packet
sizes for the receive queue. Packet buffer sizes range from MTU_SIZE +
virtio-net header len to PAGE_SIZE. This improves throughput for
large packet workloads, as any workload with average packet size >=
PAGE_SIZE will use PAGE_SIZE buffers.

These optimizations interact positively with recent commit
ba275241030c ("virtio-net: coalesce rx frags when possible during rx"),
which coalesces adjacent RX SKB fragments in virtio_net. The coalescing
optimizations benefit buffers of any size.

Benchmarks taken from an average of 5 netperf 30-second TCP_STREAM runs
between two QEMU VMs on a single physical machine. Each VM has two VCPUs
with all offloads & vhost enabled. All VMs and vhost threads run in a
single 4 CPU cgroup cpuset, using cgroups to ensure that other processes
in the system will not be scheduled on the benchmark CPUs. Trunk includes
SKB rx frag coalescing.

net-next w/ virtio_net before 2613af0ed18a (PAGE_SIZE bufs): 14642.85Gb/s
net-next (MTU-size bufs):  13170.01Gb/s
net-next + auto-tune: 14555.94Gb/s

Jason Wang also reported a throughput increase on mlx4 from 22Gb/s
using MTU-sized buffers to about 26Gb/s using auto-tuning.

Signed-off-by: Michael Dalton <mwdalton@google.com>
---
v2->v3: Remove per-receive queue metadata ring. Encode packet buffer
        base address and truesize into an unsigned long by requiring a
        minimum packet size alignment of 256. Permit attempts to fill
        an already-full RX ring (reverting the change in v2).
v1->v2: Add per-receive queue metadata ring to track precise truesize for
        mergeable receive buffers. Remove all truesize approximation. Never
        try to fill a full RX ring (required for metadata ring in v2).

 drivers/net/virtio_net.c | 99 ++++++++++++++++++++++++++++++++++++------------
 1 file changed, 74 insertions(+), 25 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 36cbf06..3e82311 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -26,6 +26,7 @@
 #include <linux/if_vlan.h>
 #include <linux/slab.h>
 #include <linux/cpu.h>
+#include <linux/average.h>
 
 static int napi_weight = NAPI_POLL_WEIGHT;
 module_param(napi_weight, int, 0444);
@@ -36,11 +37,18 @@ module_param(gso, bool, 0444);
 
 /* FIXME: MTU in config. */
 #define GOOD_PACKET_LEN (ETH_HLEN + VLAN_HLEN + ETH_DATA_LEN)
-#define MERGE_BUFFER_LEN (ALIGN(GOOD_PACKET_LEN + \
-                                sizeof(struct virtio_net_hdr_mrg_rxbuf), \
-                                L1_CACHE_BYTES))
 #define GOOD_COPY_LEN	128
 
+/* Weight used for the RX packet size EWMA. The average packet size is used to
+ * determine the packet buffer size when refilling RX rings. As the entire RX
+ * ring may be refilled at once, the weight is chosen so that the EWMA will be
+ * insensitive to short-term, transient changes in packet size.
+ */
+#define RECEIVE_AVG_WEIGHT 64
+
+/* Minimum alignment for mergeable packet buffers. */
+#define MERGEABLE_BUFFER_ALIGN max(L1_CACHE_BYTES, 256)
+
 #define VIRTNET_DRIVER_VERSION "1.0.0"
 
 struct virtnet_stats {
@@ -78,6 +86,9 @@ struct receive_queue {
 	/* Chain pages by the private ptr. */
 	struct page *pages;
 
+	/* Average packet length for mergeable receive buffers. */
+	struct ewma mrg_avg_pkt_len;
+
 	/* Page frag for packet buffer allocation. */
 	struct page_frag alloc_frag;
 
@@ -219,6 +230,23 @@ static void skb_xmit_done(struct virtqueue *vq)
 	netif_wake_subqueue(vi->dev, vq2txq(vq));
 }
 
+static unsigned int mergeable_ctx_to_buf_truesize(unsigned long mrg_ctx)
+{
+	unsigned int truesize = mrg_ctx & (MERGEABLE_BUFFER_ALIGN - 1);
+	return truesize * MERGEABLE_BUFFER_ALIGN;
+}
+
+static void *mergeable_ctx_to_buf_address(unsigned long mrg_ctx)
+{
+	return (void *)(mrg_ctx & -MERGEABLE_BUFFER_ALIGN);
+
+}
+
+static unsigned long mergeable_buf_to_ctx(void *buf, unsigned int truesize)
+{
+	return (unsigned long)buf | (truesize / MERGEABLE_BUFFER_ALIGN);
+}
+
 /* Called from bottom half context */
 static struct sk_buff *page_to_skb(struct receive_queue *rq,
 				   struct page *page, unsigned int offset,
@@ -327,31 +355,33 @@ err:
 
 static struct sk_buff *receive_mergeable(struct net_device *dev,
 					 struct receive_queue *rq,
-					 void *buf,
+					 unsigned long ctx,
 					 unsigned int len)
 {
+	void *buf = mergeable_ctx_to_buf_address(ctx);
 	struct skb_vnet_hdr *hdr = buf;
 	int num_buf = hdr->mhdr.num_buffers;
 	struct page *page = virt_to_head_page(buf);
 	int offset = buf - page_address(page);
-	unsigned int truesize = max_t(unsigned int, len, MERGE_BUFFER_LEN);
+	unsigned int truesize = max(len, mergeable_ctx_to_buf_truesize(ctx));
+
 	struct sk_buff *head_skb = page_to_skb(rq, page, offset, len, truesize);
 	struct sk_buff *curr_skb = head_skb;
 
 	if (unlikely(!curr_skb))
 		goto err_skb;
-
 	while (--num_buf) {
 		int num_skb_frags;
 
-		buf = virtqueue_get_buf(rq->vq, &len);
-		if (unlikely(!buf)) {
+		ctx = (unsigned long)virtqueue_get_buf(rq->vq, &len);
+		if (unlikely(!ctx)) {
 			pr_debug("%s: rx error: %d buffers out of %d missing\n",
 				 dev->name, num_buf, hdr->mhdr.num_buffers);
 			dev->stats.rx_length_errors++;
 			goto err_buf;
 		}
 
+		buf = mergeable_ctx_to_buf_address(ctx);
 		page = virt_to_head_page(buf);
 		--rq->num;
 
@@ -369,7 +399,7 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
 			head_skb->truesize += nskb->truesize;
 			num_skb_frags = 0;
 		}
-		truesize = max_t(unsigned int, len, MERGE_BUFFER_LEN);
+		truesize = max(len, mergeable_ctx_to_buf_truesize(ctx));
 		if (curr_skb != head_skb) {
 			head_skb->data_len += len;
 			head_skb->len += len;
@@ -386,19 +416,20 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
 		}
 	}
 
+	ewma_add(&rq->mrg_avg_pkt_len, head_skb->len);
 	return head_skb;
 
 err_skb:
 	put_page(page);
 	while (--num_buf) {
-		buf = virtqueue_get_buf(rq->vq, &len);
-		if (unlikely(!buf)) {
+		ctx = (unsigned long)virtqueue_get_buf(rq->vq, &len);
+		if (unlikely(!ctx)) {
 			pr_debug("%s: rx error: %d buffers missing\n",
 				 dev->name, num_buf);
 			dev->stats.rx_length_errors++;
 			break;
 		}
-		page = virt_to_head_page(buf);
+		page = virt_to_head_page(mergeable_ctx_to_buf_address(ctx));
 		put_page(page);
 		--rq->num;
 	}
@@ -419,17 +450,20 @@ static void receive_buf(struct receive_queue *rq, void *buf, unsigned int len)
 	if (unlikely(len < sizeof(struct virtio_net_hdr) + ETH_HLEN)) {
 		pr_debug("%s: short packet %i\n", dev->name, len);
 		dev->stats.rx_length_errors++;
-		if (vi->mergeable_rx_bufs)
-			put_page(virt_to_head_page(buf));
-		else if (vi->big_packets)
+		if (vi->mergeable_rx_bufs) {
+			unsigned long ctx = (unsigned long)buf;
+			void *base = mergeable_ctx_to_buf_address(ctx);
+			put_page(virt_to_head_page(base));
+		} else if (vi->big_packets) {
 			give_pages(rq, buf);
-		else
+		} else {
 			dev_kfree_skb(buf);
+		}
 		return;
 	}
 
 	if (vi->mergeable_rx_bufs)
-		skb = receive_mergeable(dev, rq, buf, len);
+		skb = receive_mergeable(dev, rq, (unsigned long)buf, len);
 	else if (vi->big_packets)
 		skb = receive_big(dev, rq, buf, len);
 	else
@@ -572,25 +606,36 @@ static int add_recvbuf_big(struct receive_queue *rq, gfp_t gfp)
 
 static int add_recvbuf_mergeable(struct receive_queue *rq, gfp_t gfp)
 {
+	const size_t hdr_len = sizeof(struct virtio_net_hdr_mrg_rxbuf);
 	struct page_frag *alloc_frag = &rq->alloc_frag;
 	char *buf;
+	unsigned long ctx;
 	int err;
 	unsigned int len, hole;
 
-	if (unlikely(!skb_page_frag_refill(MERGE_BUFFER_LEN, alloc_frag, gfp)))
+	len = hdr_len + clamp_t(unsigned int, ewma_read(&rq->mrg_avg_pkt_len),
+				GOOD_PACKET_LEN, PAGE_SIZE - hdr_len);
+	len = ALIGN(len, MERGEABLE_BUFFER_ALIGN);
+	if (unlikely(!skb_page_frag_refill(len, alloc_frag, gfp)))
 		return -ENOMEM;
+
 	buf = (char *)page_address(alloc_frag->page) + alloc_frag->offset;
+	ctx = mergeable_buf_to_ctx(buf, len);
 	get_page(alloc_frag->page);
-	len = MERGE_BUFFER_LEN;
 	alloc_frag->offset += len;
 	hole = alloc_frag->size - alloc_frag->offset;
-	if (hole < MERGE_BUFFER_LEN) {
+	if (hole < len) {
+		/* To avoid internal fragmentation, if there is very likely not
+		 * enough space for another buffer, add the remaining space to
+		 * the current buffer. This extra space is not included in
+		 * the truesize stored in ctx.
+		 */
 		len += hole;
 		alloc_frag->offset += hole;
 	}
 
 	sg_init_one(rq->sg, buf, len);
-	err = virtqueue_add_inbuf(rq->vq, rq->sg, 1, buf, gfp);
+	err = virtqueue_add_inbuf(rq->vq, rq->sg, 1, (void *)ctx, gfp);
 	if (err < 0)
 		put_page(virt_to_head_page(buf));
 
@@ -1394,12 +1439,15 @@ static void free_unused_bufs(struct virtnet_info *vi)
 		struct virtqueue *vq = vi->rq[i].vq;
 
 		while ((buf = virtqueue_detach_unused_buf(vq)) != NULL) {
-			if (vi->mergeable_rx_bufs)
-				put_page(virt_to_head_page(buf));
-			else if (vi->big_packets)
+			if (vi->mergeable_rx_bufs) {
+				unsigned long ctx = (unsigned long)buf;
+				void *base = mergeable_ctx_to_buf_address(ctx);
+				put_page(virt_to_head_page(base));
+			} else if (vi->big_packets) {
 				give_pages(&vi->rq[i], buf);
-			else
+			} else {
 				dev_kfree_skb(buf);
+			}
 			--vi->rq[i].num;
 		}
 		BUG_ON(vi->rq[i].num != 0);
@@ -1509,6 +1557,7 @@ static int virtnet_alloc_queues(struct virtnet_info *vi)
 			       napi_weight);
 
 		sg_init_table(vi->rq[i].sg, ARRAY_SIZE(vi->rq[i].sg));
+		ewma_init(&vi->rq[i].mrg_avg_pkt_len, 1, RECEIVE_AVG_WEIGHT);
 		sg_init_table(vi->sq[i].sg, ARRAY_SIZE(vi->sq[i].sg));
 	}
 
-- 
1.8.5.2

^ permalink raw reply related

* [PATCH net-next v4 5/6] lib: Ensure EWMA does not store wrong intermediate values
From: Michael Dalton @ 2014-01-16 19:52 UTC (permalink / raw)
  To: David S. Miller
  Cc: Michael Dalton, Michael S. Tsirkin, netdev, virtualization,
	Eric Dumazet, Ben Hutchings
In-Reply-To: <1389901950-3854-1-git-send-email-mwdalton@google.com>

To ensure ewma_read() without a lock returns a valid but possibly
out of date average, modify ewma_add() by using ACCESS_ONCE to prevent
intermediate wrong values from being written to avg->internal.

Suggested-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Michael Dalton <mwdalton@google.com>
---
 lib/average.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/lib/average.c b/lib/average.c
index 99a67e6..114d1be 100644
--- a/lib/average.c
+++ b/lib/average.c
@@ -53,8 +53,10 @@ EXPORT_SYMBOL(ewma_init);
  */
 struct ewma *ewma_add(struct ewma *avg, unsigned long val)
 {
-	avg->internal = avg->internal  ?
-		(((avg->internal << avg->weight) - avg->internal) +
+	unsigned long internal = ACCESS_ONCE(avg->internal);
+
+	ACCESS_ONCE(avg->internal) = internal ?
+		(((internal << avg->weight) - internal) +
 			(val << avg->factor)) >> avg->weight :
 		(val << avg->factor);
 	return avg;
-- 
1.8.5.2

^ permalink raw reply related

* [PATCH net-next v4 4/6] net-sysfs: add support for device-specific rx queue sysfs attributes
From: Michael Dalton @ 2014-01-16 19:52 UTC (permalink / raw)
  To: David S. Miller
  Cc: Michael Dalton, Michael S. Tsirkin, netdev, virtualization,
	Eric Dumazet, Ben Hutchings
In-Reply-To: <1389901950-3854-1-git-send-email-mwdalton@google.com>

Extend existing support for netdevice receive queue sysfs attributes to
permit a device-specific attribute group. Initial use case for this
support will be to allow the virtio-net device to export per-receive
queue mergeable receive buffer size.

Signed-off-by: Michael Dalton <mwdalton@google.com>
---
v3->v4: Simplify by removing loop in get_netdev_rx_queue_index.

 include/linux/netdevice.h | 35 +++++++++++++++++++++++++++++++----
 net/core/dev.c            | 12 ++++++------
 net/core/net-sysfs.c      | 33 ++++++++++++++++-----------------
 3 files changed, 53 insertions(+), 27 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 5c88ab1..38929bc 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -668,15 +668,28 @@ extern struct rps_sock_flow_table __rcu *rps_sock_flow_table;
 bool rps_may_expire_flow(struct net_device *dev, u16 rxq_index, u32 flow_id,
 			 u16 filter_id);
 #endif
+#endif /* CONFIG_RPS */
 
 /* This structure contains an instance of an RX queue. */
 struct netdev_rx_queue {
+#ifdef CONFIG_RPS
 	struct rps_map __rcu		*rps_map;
 	struct rps_dev_flow_table __rcu	*rps_flow_table;
+#endif
 	struct kobject			kobj;
 	struct net_device		*dev;
 } ____cacheline_aligned_in_smp;
-#endif /* CONFIG_RPS */
+
+/*
+ * RX queue sysfs structures and functions.
+ */
+struct rx_queue_attribute {
+	struct attribute attr;
+	ssize_t (*show)(struct netdev_rx_queue *queue,
+	    struct rx_queue_attribute *attr, char *buf);
+	ssize_t (*store)(struct netdev_rx_queue *queue,
+	    struct rx_queue_attribute *attr, const char *buf, size_t len);
+};
 
 #ifdef CONFIG_XPS
 /*
@@ -1313,7 +1326,7 @@ struct net_device {
 						   unicast) */
 
 
-#ifdef CONFIG_RPS
+#ifdef CONFIG_SYSFS
 	struct netdev_rx_queue	*_rx;
 
 	/* Number of RX queues allocated at register_netdev() time */
@@ -1424,6 +1437,8 @@ struct net_device {
 	struct device		dev;
 	/* space for optional device, statistics, and wireless sysfs groups */
 	const struct attribute_group *sysfs_groups[4];
+	/* space for optional per-rx queue attributes */
+	const struct attribute_group *sysfs_rx_queue_group;
 
 	/* rtnetlink link ops */
 	const struct rtnl_link_ops *rtnl_link_ops;
@@ -2374,7 +2389,7 @@ static inline bool netif_is_multiqueue(const struct net_device *dev)
 
 int netif_set_real_num_tx_queues(struct net_device *dev, unsigned int txq);
 
-#ifdef CONFIG_RPS
+#ifdef CONFIG_SYSFS
 int netif_set_real_num_rx_queues(struct net_device *dev, unsigned int rxq);
 #else
 static inline int netif_set_real_num_rx_queues(struct net_device *dev,
@@ -2393,7 +2408,7 @@ static inline int netif_copy_real_num_queues(struct net_device *to_dev,
 					   from_dev->real_num_tx_queues);
 	if (err)
 		return err;
-#ifdef CONFIG_RPS
+#ifdef CONFIG_SYSFS
 	return netif_set_real_num_rx_queues(to_dev,
 					    from_dev->real_num_rx_queues);
 #else
@@ -2401,6 +2416,18 @@ static inline int netif_copy_real_num_queues(struct net_device *to_dev,
 #endif
 }
 
+#ifdef CONFIG_SYSFS
+static inline unsigned int get_netdev_rx_queue_index(
+		struct netdev_rx_queue *queue)
+{
+	struct net_device *dev = queue->dev;
+	int index = queue - dev->_rx;
+
+	BUG_ON(index >= dev->num_rx_queues);
+	return index;
+}
+#endif
+
 #define DEFAULT_MAX_NUM_RSS_QUEUES	(8)
 int netif_get_num_default_rss_queues(void);
 
diff --git a/net/core/dev.c b/net/core/dev.c
index 20c834e..4be7931 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2080,7 +2080,7 @@ int netif_set_real_num_tx_queues(struct net_device *dev, unsigned int txq)
 }
 EXPORT_SYMBOL(netif_set_real_num_tx_queues);
 
-#ifdef CONFIG_RPS
+#ifdef CONFIG_SYSFS
 /**
  *	netif_set_real_num_rx_queues - set actual number of RX queues used
  *	@dev: Network device
@@ -5727,7 +5727,7 @@ void netif_stacked_transfer_operstate(const struct net_device *rootdev,
 }
 EXPORT_SYMBOL(netif_stacked_transfer_operstate);
 
-#ifdef CONFIG_RPS
+#ifdef CONFIG_SYSFS
 static int netif_alloc_rx_queues(struct net_device *dev)
 {
 	unsigned int i, count = dev->num_rx_queues;
@@ -6272,7 +6272,7 @@ struct net_device *alloc_netdev_mqs(int sizeof_priv, const char *name,
 		return NULL;
 	}
 
-#ifdef CONFIG_RPS
+#ifdef CONFIG_SYSFS
 	if (rxqs < 1) {
 		pr_err("alloc_netdev: Unable to allocate device with zero RX queues\n");
 		return NULL;
@@ -6328,7 +6328,7 @@ struct net_device *alloc_netdev_mqs(int sizeof_priv, const char *name,
 	if (netif_alloc_netdev_queues(dev))
 		goto free_all;
 
-#ifdef CONFIG_RPS
+#ifdef CONFIG_SYSFS
 	dev->num_rx_queues = rxqs;
 	dev->real_num_rx_queues = rxqs;
 	if (netif_alloc_rx_queues(dev))
@@ -6348,7 +6348,7 @@ free_all:
 free_pcpu:
 	free_percpu(dev->pcpu_refcnt);
 	netif_free_tx_queues(dev);
-#ifdef CONFIG_RPS
+#ifdef CONFIG_SYSFS
 	kfree(dev->_rx);
 #endif
 
@@ -6373,7 +6373,7 @@ void free_netdev(struct net_device *dev)
 	release_net(dev_net(dev));
 
 	netif_free_tx_queues(dev);
-#ifdef CONFIG_RPS
+#ifdef CONFIG_SYSFS
 	kfree(dev->_rx);
 #endif
 
diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index 49843bf..0193ff3 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -498,17 +498,7 @@ static struct attribute_group wireless_group = {
 #define net_class_groups	NULL
 #endif /* CONFIG_SYSFS */
 
-#ifdef CONFIG_RPS
-/*
- * RX queue sysfs structures and functions.
- */
-struct rx_queue_attribute {
-	struct attribute attr;
-	ssize_t (*show)(struct netdev_rx_queue *queue,
-	    struct rx_queue_attribute *attr, char *buf);
-	ssize_t (*store)(struct netdev_rx_queue *queue,
-	    struct rx_queue_attribute *attr, const char *buf, size_t len);
-};
+#ifdef CONFIG_SYSFS
 #define to_rx_queue_attr(_attr) container_of(_attr,		\
     struct rx_queue_attribute, attr)
 
@@ -543,6 +533,7 @@ static const struct sysfs_ops rx_queue_sysfs_ops = {
 	.store = rx_queue_attr_store,
 };
 
+#ifdef CONFIG_RPS
 static ssize_t show_rps_map(struct netdev_rx_queue *queue,
 			    struct rx_queue_attribute *attribute, char *buf)
 {
@@ -718,16 +709,20 @@ static struct rx_queue_attribute rps_cpus_attribute =
 static struct rx_queue_attribute rps_dev_flow_table_cnt_attribute =
 	__ATTR(rps_flow_cnt, S_IRUGO | S_IWUSR,
 	    show_rps_dev_flow_table_cnt, store_rps_dev_flow_table_cnt);
+#endif /* CONFIG_RPS */
 
 static struct attribute *rx_queue_default_attrs[] = {
+#ifdef CONFIG_RPS
 	&rps_cpus_attribute.attr,
 	&rps_dev_flow_table_cnt_attribute.attr,
+#endif
 	NULL
 };
 
 static void rx_queue_release(struct kobject *kobj)
 {
 	struct netdev_rx_queue *queue = to_rx_queue(kobj);
+#ifdef CONFIG_RPS
 	struct rps_map *map;
 	struct rps_dev_flow_table *flow_table;
 
@@ -743,6 +738,7 @@ static void rx_queue_release(struct kobject *kobj)
 		RCU_INIT_POINTER(queue->rps_flow_table, NULL);
 		call_rcu(&flow_table->rcu, rps_dev_flow_table_release);
 	}
+#endif
 
 	memset(kobj, 0, sizeof(*kobj));
 	dev_put(queue->dev);
@@ -767,21 +763,27 @@ static int rx_queue_add_kobject(struct net_device *net, int index)
 		kobject_put(kobj);
 		return error;
 	}
+	if (net->sysfs_rx_queue_group)
+		sysfs_create_group(kobj, net->sysfs_rx_queue_group);
 
 	kobject_uevent(kobj, KOBJ_ADD);
 	dev_hold(queue->dev);
 
 	return error;
 }
-#endif /* CONFIG_RPS */
+#endif /* CONFIG_SYFS */
 
 int
 net_rx_queue_update_kobjects(struct net_device *net, int old_num, int new_num)
 {
-#ifdef CONFIG_RPS
+#ifdef CONFIG_SYSFS
 	int i;
 	int error = 0;
 
+#ifndef CONFIG_RPS
+	if (!net->sysfs_rx_queue_group)
+		return 0;
+#endif
 	for (i = old_num; i < new_num; i++) {
 		error = rx_queue_add_kobject(net, i);
 		if (error) {
@@ -1155,9 +1157,6 @@ static int register_queue_kobjects(struct net_device *net)
 	    NULL, &net->dev.kobj);
 	if (!net->queues_kset)
 		return -ENOMEM;
-#endif
-
-#ifdef CONFIG_RPS
 	real_rx = net->real_num_rx_queues;
 #endif
 	real_tx = net->real_num_tx_queues;
@@ -1184,7 +1183,7 @@ static void remove_queue_kobjects(struct net_device *net)
 {
 	int real_rx = 0, real_tx = 0;
 
-#ifdef CONFIG_RPS
+#ifdef CONFIG_SYSFS
 	real_rx = net->real_num_rx_queues;
 #endif
 	real_tx = net->real_num_tx_queues;
-- 
1.8.5.2

^ permalink raw reply related

* [PATCH net-next v4 2/6] virtio-net: use per-receive queue page frag alloc for mergeable bufs
From: Michael Dalton @ 2014-01-16 19:52 UTC (permalink / raw)
  To: David S. Miller
  Cc: Michael Dalton, Michael S. Tsirkin, netdev, virtualization,
	Eric Dumazet, Ben Hutchings
In-Reply-To: <1389901950-3854-1-git-send-email-mwdalton@google.com>

The virtio-net driver currently uses netdev_alloc_frag() for GFP_ATOMIC
mergeable rx buffer allocations. This commit migrates virtio-net to use
per-receive queue page frags for GFP_ATOMIC allocation. This change unifies
mergeable rx buffer memory allocation, which now will use skb_refill_frag()
for both atomic and GFP-WAIT buffer allocations.

To address fragmentation concerns, if after buffer allocation there
is too little space left in the page frag to allocate a subsequent
buffer, the remaining space is added to the current allocated buffer
so that the remaining space can be used to store packet data.

Signed-off-by: Michael Dalton <mwdalton@google.com>
---
v1->v2: Use GFP_COLD for RX buffer allocations (as in netdev_alloc_frag()).
        Remove per-netdev GFP_KERNEL page_frag allocator.

 drivers/net/virtio_net.c | 69 ++++++++++++++++++++++++------------------------
 1 file changed, 35 insertions(+), 34 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 7b17240..36cbf06 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -78,6 +78,9 @@ struct receive_queue {
 	/* Chain pages by the private ptr. */
 	struct page *pages;
 
+	/* Page frag for packet buffer allocation. */
+	struct page_frag alloc_frag;
+
 	/* RX: fragments + linear part + virtio header */
 	struct scatterlist sg[MAX_SKB_FRAGS + 2];
 
@@ -126,11 +129,6 @@ struct virtnet_info {
 	/* Lock for config space updates */
 	struct mutex config_lock;
 
-	/* Page_frag for GFP_KERNEL packet buffer allocation when we run
-	 * low on memory.
-	 */
-	struct page_frag alloc_frag;
-
 	/* Does the affinity hint is set for virtqueues? */
 	bool affinity_hint_set;
 
@@ -336,8 +334,8 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
 	int num_buf = hdr->mhdr.num_buffers;
 	struct page *page = virt_to_head_page(buf);
 	int offset = buf - page_address(page);
-	struct sk_buff *head_skb = page_to_skb(rq, page, offset, len,
-					       MERGE_BUFFER_LEN);
+	unsigned int truesize = max_t(unsigned int, len, MERGE_BUFFER_LEN);
+	struct sk_buff *head_skb = page_to_skb(rq, page, offset, len, truesize);
 	struct sk_buff *curr_skb = head_skb;
 
 	if (unlikely(!curr_skb))
@@ -353,11 +351,6 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
 			dev->stats.rx_length_errors++;
 			goto err_buf;
 		}
-		if (unlikely(len > MERGE_BUFFER_LEN)) {
-			pr_debug("%s: rx error: merge buffer too long\n",
-				 dev->name);
-			len = MERGE_BUFFER_LEN;
-		}
 
 		page = virt_to_head_page(buf);
 		--rq->num;
@@ -376,19 +369,20 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
 			head_skb->truesize += nskb->truesize;
 			num_skb_frags = 0;
 		}
+		truesize = max_t(unsigned int, len, MERGE_BUFFER_LEN);
 		if (curr_skb != head_skb) {
 			head_skb->data_len += len;
 			head_skb->len += len;
-			head_skb->truesize += MERGE_BUFFER_LEN;
+			head_skb->truesize += truesize;
 		}
 		offset = buf - page_address(page);
 		if (skb_can_coalesce(curr_skb, num_skb_frags, page, offset)) {
 			put_page(page);
 			skb_coalesce_rx_frag(curr_skb, num_skb_frags - 1,
-					     len, MERGE_BUFFER_LEN);
+					     len, truesize);
 		} else {
 			skb_add_rx_frag(curr_skb, num_skb_frags, page,
-					offset, len, MERGE_BUFFER_LEN);
+					offset, len, truesize);
 		}
 	}
 
@@ -578,25 +572,24 @@ static int add_recvbuf_big(struct receive_queue *rq, gfp_t gfp)
 
 static int add_recvbuf_mergeable(struct receive_queue *rq, gfp_t gfp)
 {
-	struct virtnet_info *vi = rq->vq->vdev->priv;
-	char *buf = NULL;
+	struct page_frag *alloc_frag = &rq->alloc_frag;
+	char *buf;
 	int err;
+	unsigned int len, hole;
 
-	if (gfp & __GFP_WAIT) {
-		if (skb_page_frag_refill(MERGE_BUFFER_LEN, &vi->alloc_frag,
-					 gfp)) {
-			buf = (char *)page_address(vi->alloc_frag.page) +
-			      vi->alloc_frag.offset;
-			get_page(vi->alloc_frag.page);
-			vi->alloc_frag.offset += MERGE_BUFFER_LEN;
-		}
-	} else {
-		buf = netdev_alloc_frag(MERGE_BUFFER_LEN);
-	}
-	if (!buf)
+	if (unlikely(!skb_page_frag_refill(MERGE_BUFFER_LEN, alloc_frag, gfp)))
 		return -ENOMEM;
+	buf = (char *)page_address(alloc_frag->page) + alloc_frag->offset;
+	get_page(alloc_frag->page);
+	len = MERGE_BUFFER_LEN;
+	alloc_frag->offset += len;
+	hole = alloc_frag->size - alloc_frag->offset;
+	if (hole < MERGE_BUFFER_LEN) {
+		len += hole;
+		alloc_frag->offset += hole;
+	}
 
-	sg_init_one(rq->sg, buf, MERGE_BUFFER_LEN);
+	sg_init_one(rq->sg, buf, len);
 	err = virtqueue_add_inbuf(rq->vq, rq->sg, 1, buf, gfp);
 	if (err < 0)
 		put_page(virt_to_head_page(buf));
@@ -617,6 +610,7 @@ static bool try_fill_recv(struct receive_queue *rq, gfp_t gfp)
 	int err;
 	bool oom;
 
+	gfp |= __GFP_COLD;
 	do {
 		if (vi->mergeable_rx_bufs)
 			err = add_recvbuf_mergeable(rq, gfp);
@@ -1377,6 +1371,14 @@ static void free_receive_bufs(struct virtnet_info *vi)
 	}
 }
 
+static void free_receive_page_frags(struct virtnet_info *vi)
+{
+	int i;
+	for (i = 0; i < vi->max_queue_pairs; i++)
+		if (vi->rq[i].alloc_frag.page)
+			put_page(vi->rq[i].alloc_frag.page);
+}
+
 static void free_unused_bufs(struct virtnet_info *vi)
 {
 	void *buf;
@@ -1705,9 +1707,8 @@ free_recv_bufs:
 	unregister_netdev(dev);
 free_vqs:
 	cancel_delayed_work_sync(&vi->refill);
+	free_receive_page_frags(vi);
 	virtnet_del_vqs(vi);
-	if (vi->alloc_frag.page)
-		put_page(vi->alloc_frag.page);
 free_stats:
 	free_percpu(vi->stats);
 free:
@@ -1724,6 +1725,8 @@ static void remove_vq_common(struct virtnet_info *vi)
 
 	free_receive_bufs(vi);
 
+	free_receive_page_frags(vi);
+
 	virtnet_del_vqs(vi);
 }
 
@@ -1741,8 +1744,6 @@ static void virtnet_remove(struct virtio_device *vdev)
 	unregister_netdev(vi->dev);
 
 	remove_vq_common(vi);
-	if (vi->alloc_frag.page)
-		put_page(vi->alloc_frag.page);
 
 	flush_work(&vi->config_work);
 
-- 
1.8.5.2

^ permalink raw reply related

* [PATCH net-next v4 1/6] net: allow > 0 order atomic page alloc in skb_page_frag_refill
From: Michael Dalton @ 2014-01-16 19:52 UTC (permalink / raw)
  To: David S. Miller
  Cc: Michael Dalton, Michael S. Tsirkin, netdev, virtualization,
	Eric Dumazet, Ben Hutchings

skb_page_frag_refill currently permits only order-0 page allocs
unless GFP_WAIT is used. Change skb_page_frag_refill to attempt
higher-order page allocations whether or not GFP_WAIT is used. If
memory cannot be allocated, the allocator will fall back to
successively smaller page allocs (down to order-0 page allocs).

This change brings skb_page_frag_refill in line with the existing
page allocation strategy employed by netdev_alloc_frag, which attempts
higher-order page allocations whether or not GFP_WAIT is set, falling
back to successively lower-order page allocations on failure. Part
of migration of virtio-net to per-receive queue page frag allocators.

Acked-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Michael Dalton <mwdalton@google.com>
---
 net/core/sock.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/net/core/sock.c b/net/core/sock.c
index 85ad6f0..b3f7ee3 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1836,9 +1836,7 @@ bool skb_page_frag_refill(unsigned int sz, struct page_frag *pfrag, gfp_t prio)
 		put_page(pfrag->page);
 	}

-	/* We restrict high order allocations to users that can afford to wait */
-	order = (prio & __GFP_WAIT) ? SKB_FRAG_PAGE_ORDER : 0;
-
+	order = SKB_FRAG_PAGE_ORDER;
 	do {
 		gfp_t gfp = prio;

-- 
1.8.5.2

^ permalink raw reply related

* Re: [PATCH net-next v3 4/5] net-sysfs: add support for device-specific rx queue sysfs attributes
From: Michael Dalton @ 2014-01-16 19:51 UTC (permalink / raw)
  To: Ben Hutchings
  Cc: Michael S. Tsirkin, netdev, lf-virt, Eric Dumazet,
	David S. Miller
In-Reply-To: <1389901352.11912.73.camel@bwh-desktop.uk.level5networks.com>

On Thu, Jan 16, 2014 Ben Hutchings <bhutchings@solarflare.com> wrote:
> It's simpler but we don't know if it's faster (and I don't believe that
> matters for the current usage).
>
> If one of these functions starts to be used in the data path, at that
> point it could be worth optimising, e.g. by doing a test for queue 0 and
> only then doing the pointer arithmetic with its implicit division.
Good catch, my statement above is incorrect. We don't know if this change
is a performance win, and it does introduce an implicit div. I agree this
function is not currently performance critical and is used only by for
reporting values to sysfs.

Best,

Mike

^ permalink raw reply

* Re: [PATCHv3 net-next 1/2] flowcache: Make flow cache name space aware
From: David Miller @ 2014-01-16 19:50 UTC (permalink / raw)
  To: steffen.klassert; +Cc: fan.du, netdev
In-Reply-To: <20140116101946.GM31491@secunet.com>

From: Steffen Klassert <steffen.klassert@secunet.com>
Date: Thu, 16 Jan 2014 11:19:46 +0100

> So I would not mind to move this into IPsec, but I want to see an ack
> from David before I take this into ipsec-next.

No objections from me.

And longer term I don't even think the flow cache makes sense as the
most appropriate data structure for fast IPSEC lookups either.

Anything that requires GC has fundamental issues.

^ permalink raw reply

* Re: [PATCH 11/13] net: mvneta: implement rx_copybreak
From: David Miller @ 2014-01-16 19:49 UTC (permalink / raw)
  To: w; +Cc: David.Laight, netdev, thomas.petazzoni, gregory.clement
In-Reply-To: <20140116093640.GA16457@1wt.eu>

From: Willy Tarreau <w@1wt.eu>
Date: Thu, 16 Jan 2014 10:36:40 +0100

> On Thu, Jan 16, 2014 at 09:14:00AM +0000, David Laight wrote:
>> From: Of Willy Tarreau
>> > calling dma_map_single()/dma_unmap_single() is quite expensive compared
>> > to copying a small packet. So let's copy short frames and keep the buffers
>> > mapped. We set the limit to 256 bytes which seems to give good results both
>> > on the XP-GP board and on the AX3/4.
>> 
>> Have you tried something similar for transmits?
>> Copying short tx into a preallocated memory area that is dma_mapped
>> at driver initialisation?
> 
> Not yet by lack of time, but I intend to do so. Both the NIC and the driver
> are capable of running at line rate (1.488 Mpps) when filling descriptors
> directly, so many improvements are still possible.

I don't know if this is relevant for your hardware, but there is a trick
with IOMMUs that I at least at one point was doing on sparc64.

I never flushed the IOMMU mappings explicitly on a dma_unmap call.

Instead I always allocate by moving the allocation cursor to higher
addresses in the IOMMU range looking for free space.

Then when I hit the end of the range and had to wrap the allocation
cursor back to the beginning, I flush the entire IOMMU.

So for %99 of IOMMU mapping creation and teardown calls, I didn't even
touch the hardware.

^ permalink raw reply

* Re: [PATCH net-next v4 1/2] dts: Add a binding for Synopsys emac max-frame-size
From: Ben Hutchings @ 2014-01-16 19:45 UTC (permalink / raw)
  To: Rob Herring
  Cc: Vince Bridgers, devicetree@vger.kernel.org, netdev,
	Giuseppe CAVALLARO, Rob Herring, Pawel Moll, Mark Rutland,
	Ian Campbell, Kumar Gala, Dinh Nguyen, Rayagond Kokatanur
In-Reply-To: <CAL_Jsq+81QtmC9dtKRU1b427Y0OjhyxZZ1gZbsvk9WWutM4TVQ@mail.gmail.com>

On Thu, 2014-01-16 at 13:29 -0600, Rob Herring wrote:
> On Thu, Jan 16, 2014 at 12:48 PM, Vince Bridgers
> <vbridgers2013@gmail.com> wrote:
> > On Thu, Jan 16, 2014 at 11:50 AM, Ben Hutchings
> > <bhutchings@solarflare.com> wrote:
> >> On Thu, 2014-01-16 at 08:05 -0600, Vince Bridgers wrote:
> >>> This change adds a parameter for the Synopsys 10/100/1000
> >>> stmmac Ethernet driver to configure the maximum frame
> >>> size supported by the EMAC driver. Synopsys allows the FIFO
> >>> sizes to be configured when the cores are built for a particular
> >>> device, but do not provide a way for the driver to read
> >>> information from the device about the maximum MTU size
> >>> supported as limited by the device's FIFO size.
> >>>
> >>> Signed-off-by: Vince Bridgers <vbridgers2013@gmail.com>
> >>> ---
> >>> V4: add comments to explain use of max-frame-size with respect
> >>>     to inconsistent definition and use in the ePAPR v1.1 spec
> >>
> >> Well, ePAPR does not seem to be consistent with itself. :-)
> >>
> >>> V3: change snps,max-frame-size to max-frame-size
> >>> V2: change snps,max-mtu to snps,max-frame-size
> >>> ---
> >>>  Documentation/devicetree/bindings/net/stmmac.txt |   13 ++++++++++++-
> >>>  1 file changed, 12 insertions(+), 1 deletion(-)
> >>>
> >>> diff --git a/Documentation/devicetree/bindings/net/stmmac.txt b/Documentation/devicetree/bindings/net/stmmac.txt
> >>> index eba0e5e..d553be2 100644
> >>> --- a/Documentation/devicetree/bindings/net/stmmac.txt
> >>> +++ b/Documentation/devicetree/bindings/net/stmmac.txt
> >>> @@ -29,7 +29,17 @@ Required properties:
> >>>                               ignored if force_thresh_dma_mode is set.
> >>>
> >>>  Optional properties:
> >>> -- mac-address: 6 bytes, mac address
> >>> +- mac-address:               6 bytes, mac address
> >>> +- max-frame-size:    Maximum frame size permitted. This parameter is useful
> >>> +                     since different implementations of the Synopsys MAC may
> >>> +                     have different FIFO sizes depending on the selections
> >>> +                     made in Synopsys Core Consultant. Note that the usage
> >>> +                     is inconsistent with the definition in the ePAPR v1.1
> >>> +                     specification, as it defines max-frame-size inclusive
> >>> +                     of the MAC DA, SA, Ethertype and CRC while usage is
> >>> +                     consistent with the IEEE definition of MAC Client
> >>> +                     Data, which is sans the MAC DA, SA, Ethertype and
> >>> +                     CRC.
> >> [...]
> >>
> >> While this is very precise, I fear that it is now so verbose that it
> >> actually becomes confusing.  Can this not be condensed to 'the maximum
> >> MTU and MRU, rather than the maximum Ethernet frame size'?
> >>
> >> Ben.
> >
> > Sure, I'll cut this down and resubmit as V5. How about
> >
> > "The Maximum Transfer Unit (IEEE defined MTU), rather than the maximum
> > frame size."
> 
> If you are not following the ePAPR definition, then call the property
> something else rather than relying on the documentation to explain the
> different meaning. Why can't the driver adjust the size from the ePAPR
> definition? Otherwise, I would go back to the v2 name of
> "snps,max-mtu".

ePAPR *doesn't have* a clear definition:
http://article.gmane.org/gmane.linux.network/300098

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply

* Re: [PATCH 0/3] Netfilter updates for net-next
From: David Miller @ 2014-01-16 19:45 UTC (permalink / raw)
  To: pablo; +Cc: netfilter-devel, netdev
In-Reply-To: <1389863400-4863-1-git-send-email-pablo@netfilter.org>

From: Pablo Neira Ayuso <pablo@netfilter.org>
Date: Thu, 16 Jan 2014 10:10:00 +0100

> You can pull these changes from:
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nftables.git master

Yes, that works a lot better.

Pulled, thank you.

^ permalink raw reply

* Re: [PATCH 0/3] Netfilter updates for net-next
From: David Miller @ 2014-01-16 19:44 UTC (permalink / raw)
  To: pablo; +Cc: netfilter-devel, netdev
In-Reply-To: <1389863093-4530-1-git-send-email-pablo@netfilter.org>

From: Pablo Neira Ayuso <pablo@netfilter.org>
Date: Thu, 16 Jan 2014 10:04:50 +0100

>   git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next.git

I think you didn't push your changes there:

git pull --no-ff git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next.git
>From git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next
 * branch            HEAD       -> FETCH_HEAD
Already up-to-date.

^ permalink raw reply

* Re: [PATCH net-next v3 4/5] net-sysfs: add support for device-specific rx queue sysfs attributes
From: Ben Hutchings @ 2014-01-16 19:42 UTC (permalink / raw)
  To: Michael Dalton
  Cc: Michael S. Tsirkin, netdev, lf-virt, Eric Dumazet,
	David S. Miller
In-Reply-To: <CANJ5vP+nDgi7vWOd=RSuZ=NS5Z_txEkEsoYYm_33sFPh6rr=gg@mail.gmail.com>

On Thu, 2014-01-16 at 11:07 -0800, Michael Dalton wrote:
> On Jan 16, 2014 at 10:57 AM, Ben Hutchings <bhutchings@solarflare.com> wrote:
> > Why write a loop when you can do:
> >         i = queue - dev->_rx;
> Good point, the loop approach was done in get_netdev_queue_index --
> I agree your fix is faster and simpler. I'll fix in next patchset.

It's simpler but we don't know if it's faster (and I don't believe that
matters for the current usage).

If one of these functions starts to be used in the data path, at that
point it could be worth optimising, e.g. by doing a test for queue 0 and
only then doing the pointer arithmetic with its implicit division.

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply

* Re: [PATCH v2 1/2] net/mlx4_core: clean up cq_res_start_move_to()
From: David Miller @ 2014-01-16 19:39 UTC (permalink / raw)
  To: jackm; +Cc: pebolle, ogerlitz, ronye, hadarh, netdev, linux-kernel
In-Reply-To: <20140116094639.6fe72bcf@jpm-OptiPlex-GX620>

From: Jack Morgenstein <jackm@dev.mellanox.co.il>
Date: Thu, 16 Jan 2014 09:46:39 +0200

> ACK.  OK.

This is not the correct way to ack a patch.

First of all, you should not top post.  Instead you should quote
the relevant parts of the email you are replying to, and then add
your new content underneath.

In this circumstance, the "relevant parts" are just the commit log
message from the patch submitter.  There is no reason ever to quote
the patch itself unless you are making comments on specific parts.

And you specify your ACK using a properly formed "Acked-by: "
line.

Please look at how other reviewers ACK patches.

Thank you.

^ permalink raw reply

* Re: [PATCH-next v2] netfilter: don't use module_init/exit in core IPV4 code
From: Paul Gortmaker @ 2014-01-16 19:38 UTC (permalink / raw)
  To: Benjamin Poirier
  Cc: Pablo Neira Ayuso, Patrick McHardy, Jozsef Kadlecsik,
	David S. Miller, netfilter-devel, netdev
In-Reply-To: <20140116192613.GA23133@d2.synalogic.ca>

On 14-01-16 02:26 PM, Benjamin Poirier wrote:
> On 2014/01/15 15:57, Paul Gortmaker wrote:
>> The file net/ipv4/netfilter.o is created based on whether
>> CONFIG_NETFILTER is set.  However that is defined as a bool, and
>> hence this file with the core netfilter hooks will never be
>> modular.  So using module_init as an alias for __initcall can be
>> somewhat misleading.
>>
>> Fix this up now, so that we can relocate module_init from
>> init.h into module.h in the future.  If we don't do this, we'd
>> have to add module.h to obviously non-modular code, and that
>> would be a worse thing.  Also add an inclusion of init.h, as
>> that was previously implicit here in the netfilter.c file.
>>
>> Note that direct use of __initcall is discouraged, vs. one
>> of the priority categorized subgroups.  As __initcall gets
>> mapped onto device_initcall, our use of subsys_initcall (which
>> seems to make sense for netfilter code) will thus change this
>> registration from level 6-device to level 4-subsys (i.e. slightly
>> earlier).  However no observable impact of that small difference
>> has been observed during testing, or is expected. (i.e. the
>> location of the netfilter messages in dmesg remains unchanged
>> with respect to all the other surrounding messages.)
>>
>> As for the module_exit, rather than replace it with __exitcall,
>> we simply remove it, since it appears only UML does anything
>> with those, and even for UML, there is no relevant cleanup
>> to be done here.
>>
>> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
>> ---
>>
>> [v2: Drop __exitcall stuff completely, as per Eric's suggestion
>>  given for patch at  http://patchwork.ozlabs.org/patch/311164/ ]
>>
>>  net/ipv4/netfilter.c | 9 +--------
>>  1 file changed, 1 insertion(+), 8 deletions(-)
>>
>> diff --git a/net/ipv4/netfilter.c b/net/ipv4/netfilter.c
>> index c3e0adea9c27..31abf9636ba7 100644
>> --- a/net/ipv4/netfilter.c
>> +++ b/net/ipv4/netfilter.c
>> @@ -197,11 +197,4 @@ static int __init ipv4_netfilter_init(void)
>>  {
>>  	return nf_register_afinfo(&nf_ip_afinfo);
>>  }
>> -
>> -static void __exit ipv4_netfilter_fini(void)
>> -{
>> -	nf_unregister_afinfo(&nf_ip_afinfo);
>> -}
>> -
>> -module_init(ipv4_netfilter_init);
>> -module_exit(ipv4_netfilter_fini);
>> +device_initcall(ipv4_netfilter_init);
> 
> device_ vs. subsys_initcall changed from v1 of the patch and no longer
> matches the description I think.

Thanks for spotting that ; not sure how I fat fingered that
during the exitcall removal.  Trying to rush things I guess.

P.
--

> 
>> -- 
>> 1.8.5.2
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe netdev" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [patch net-next] neigh: use NEIGH_VAR_INIT in ndo_neigh_setup functions.
From: David Miller @ 2014-01-16 19:37 UTC (permalink / raw)
  To: jiri; +Cc: netdev, jes
In-Reply-To: <20140116071519.GA2815@minipsycho.orion>

From: Jiri Pirko <jiri@resnulli.us>
Date: Thu, 16 Jan 2014 08:15:19 +0100

> Mon, Jan 13, 2014 at 08:35:38PM CET, davem@davemloft.net wrote:
>>From: Jiri Pirko <jiri@resnulli.us>
>>Date: Thu,  9 Jan 2014 14:13:47 +0100
>>
>>> When ndo_neigh_setup is called, the bitfield used by NEIGH_VAR_SET is
>>> not initialized yet. This might cause confusion for the people who use
>>> NEIGH_VAR_SET in ndo_neigh_setup. So rather introduce NEIGH_VAR_INIT for
>>> usage in ndo_neigh_setup.
>>> 
>>> Signed-off-by: Jiri Pirko <jiri@resnulli.us>
>>
>>Wouldn't it be better to move the neigh_parms_data_state_cleanall() call
>>before we invoke ->ndo_neigh_setup()?  It seems that this code intended
>>to work that way, no?
> 
> Even moving neigh_parms_data_state_cleanall before ndo_neigh_setup
> would not solve this. In ndo_neigh_setup the original default value is
> changed. If it is changed by NEIGH_VAR_SET, it touches data_state bit
> and it looks as if it was changed by user. That is not desired because
> default value change would be ignored for this device. Therefore there
> is need for NEIGH_VAR_INIT which does not touch data_state bit.

That makes sense, applied, thanks Jiri.

^ permalink raw reply

* Re: [PATCH net-next 2/2] r6040: use ETH_ZLEN instead of MISR for SKB length checking
From: Florian Fainelli @ 2014-01-16 19:30 UTC (permalink / raw)
  To: Ben Hutchings
  Cc: David Laight, Ben Hutchings, netdev@vger.kernel.org,
	davem@davemloft.net
In-Reply-To: <1389893604.11912.47.camel@bwh-desktop.uk.level5networks.com>

2014/1/16 Ben Hutchings <bhutchings@solarflare.com>:
> On Thu, 2014-01-16 at 09:10 +0000, David Laight wrote:
>> From: Ben Hutchings
>> > On Wed, 2014-01-15 at 13:04 -0800, Florian Fainelli wrote:
>> > > Ever since this driver was merged the following code was included:
>> ...
>> > > - if (skb->len < MISR)
>> > > -         descptr->len = MISR;
>> > > + if (skb->len < ETH_ZLEN)
>> > > +         descptr->len = ETH_ZLEN;
>> >
>> > It looks like this is just increasing the TX descriptor length so the
>> > packet is tail-padded with whatever happens to come next in the skb.
>> > This is absolutely incorrect behaviour and may leak sensitive
>> > information.
>>
>> And possibly page fault if the data is right at the end of a page.
>
> If the CPU was accessing it, it would be fine as the skb linear area
> ends with struct skb_shared_info.  (sfc actually takes advantage of that
> in efx_enqueue_skb_pio() which can add padding (*not* transmitted) to
> ensure that the CPU does write-combining.)
>
> But the NIC, if it's connected through an IOMMU, might get a page fault
> due to the incorrect length of the DMA mapping.
>
>> > Presumably it is necessary to pad the frame because the MAC is too lame
>> > to do it, and the correct way to do that is using skb_padto(skb,
>> > ETH_ZLEN).  But this may fail as it might have to allocate memory
>>
>> Alternatively use two ring entries with the 'more' bit set on
>> the first one and transmit the padding from a permanently allocated
>> and dma-mapped block of zeros.
>> Assuming the hardware support SG transmit and doesn't have a
>> constraint on the length of the first fragment.
>
> Do you see NETIF_F_SG in this driver?

No this driver does not it, your fix padding the SKB to max(skb->len,
ETH_ZLEN) looks good to me. Could you submit these as two separate
patches?

Thanks!
-- 
Florian

^ permalink raw reply

* Re: [PATCH net-next v4 1/2] dts: Add a binding for Synopsys emac max-frame-size
From: Rob Herring @ 2014-01-16 19:29 UTC (permalink / raw)
  To: Vince Bridgers
  Cc: Ben Hutchings, devicetree@vger.kernel.org, netdev,
	Giuseppe CAVALLARO, Rob Herring, Pawel Moll, Mark Rutland,
	Ian Campbell, Kumar Gala, Dinh Nguyen, Rayagond Kokatanur
In-Reply-To: <CAOwfj2N2p_B-Wq0DTzt76RAbqCu5O088wxWiLEvHCc0qusDq-g@mail.gmail.com>

On Thu, Jan 16, 2014 at 12:48 PM, Vince Bridgers
<vbridgers2013@gmail.com> wrote:
> On Thu, Jan 16, 2014 at 11:50 AM, Ben Hutchings
> <bhutchings@solarflare.com> wrote:
>> On Thu, 2014-01-16 at 08:05 -0600, Vince Bridgers wrote:
>>> This change adds a parameter for the Synopsys 10/100/1000
>>> stmmac Ethernet driver to configure the maximum frame
>>> size supported by the EMAC driver. Synopsys allows the FIFO
>>> sizes to be configured when the cores are built for a particular
>>> device, but do not provide a way for the driver to read
>>> information from the device about the maximum MTU size
>>> supported as limited by the device's FIFO size.
>>>
>>> Signed-off-by: Vince Bridgers <vbridgers2013@gmail.com>
>>> ---
>>> V4: add comments to explain use of max-frame-size with respect
>>>     to inconsistent definition and use in the ePAPR v1.1 spec
>>
>> Well, ePAPR does not seem to be consistent with itself. :-)
>>
>>> V3: change snps,max-frame-size to max-frame-size
>>> V2: change snps,max-mtu to snps,max-frame-size
>>> ---
>>>  Documentation/devicetree/bindings/net/stmmac.txt |   13 ++++++++++++-
>>>  1 file changed, 12 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/Documentation/devicetree/bindings/net/stmmac.txt b/Documentation/devicetree/bindings/net/stmmac.txt
>>> index eba0e5e..d553be2 100644
>>> --- a/Documentation/devicetree/bindings/net/stmmac.txt
>>> +++ b/Documentation/devicetree/bindings/net/stmmac.txt
>>> @@ -29,7 +29,17 @@ Required properties:
>>>                               ignored if force_thresh_dma_mode is set.
>>>
>>>  Optional properties:
>>> -- mac-address: 6 bytes, mac address
>>> +- mac-address:               6 bytes, mac address
>>> +- max-frame-size:    Maximum frame size permitted. This parameter is useful
>>> +                     since different implementations of the Synopsys MAC may
>>> +                     have different FIFO sizes depending on the selections
>>> +                     made in Synopsys Core Consultant. Note that the usage
>>> +                     is inconsistent with the definition in the ePAPR v1.1
>>> +                     specification, as it defines max-frame-size inclusive
>>> +                     of the MAC DA, SA, Ethertype and CRC while usage is
>>> +                     consistent with the IEEE definition of MAC Client
>>> +                     Data, which is sans the MAC DA, SA, Ethertype and
>>> +                     CRC.
>> [...]
>>
>> While this is very precise, I fear that it is now so verbose that it
>> actually becomes confusing.  Can this not be condensed to 'the maximum
>> MTU and MRU, rather than the maximum Ethernet frame size'?
>>
>> Ben.
>
> Sure, I'll cut this down and resubmit as V5. How about
>
> "The Maximum Transfer Unit (IEEE defined MTU), rather than the maximum
> frame size."

If you are not following the ePAPR definition, then call the property
something else rather than relying on the documentation to explain the
different meaning. Why can't the driver adjust the size from the ePAPR
definition? Otherwise, I would go back to the v2 name of
"snps,max-mtu".

Rob

^ permalink raw reply

* Re: [PATCH-next v2] netfilter: don't use module_init/exit in core IPV4 code
From: Benjamin Poirier @ 2014-01-16 19:26 UTC (permalink / raw)
  To: Paul Gortmaker
  Cc: Pablo Neira Ayuso, Patrick McHardy, Jozsef Kadlecsik,
	David S. Miller, netfilter-devel, netdev
In-Reply-To: <1389819427-20199-1-git-send-email-paul.gortmaker@windriver.com>

On 2014/01/15 15:57, Paul Gortmaker wrote:
> The file net/ipv4/netfilter.o is created based on whether
> CONFIG_NETFILTER is set.  However that is defined as a bool, and
> hence this file with the core netfilter hooks will never be
> modular.  So using module_init as an alias for __initcall can be
> somewhat misleading.
> 
> Fix this up now, so that we can relocate module_init from
> init.h into module.h in the future.  If we don't do this, we'd
> have to add module.h to obviously non-modular code, and that
> would be a worse thing.  Also add an inclusion of init.h, as
> that was previously implicit here in the netfilter.c file.
> 
> Note that direct use of __initcall is discouraged, vs. one
> of the priority categorized subgroups.  As __initcall gets
> mapped onto device_initcall, our use of subsys_initcall (which
> seems to make sense for netfilter code) will thus change this
> registration from level 6-device to level 4-subsys (i.e. slightly
> earlier).  However no observable impact of that small difference
> has been observed during testing, or is expected. (i.e. the
> location of the netfilter messages in dmesg remains unchanged
> with respect to all the other surrounding messages.)
> 
> As for the module_exit, rather than replace it with __exitcall,
> we simply remove it, since it appears only UML does anything
> with those, and even for UML, there is no relevant cleanup
> to be done here.
> 
> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
> ---
> 
> [v2: Drop __exitcall stuff completely, as per Eric's suggestion
>  given for patch at  http://patchwork.ozlabs.org/patch/311164/ ]
> 
>  net/ipv4/netfilter.c | 9 +--------
>  1 file changed, 1 insertion(+), 8 deletions(-)
> 
> diff --git a/net/ipv4/netfilter.c b/net/ipv4/netfilter.c
> index c3e0adea9c27..31abf9636ba7 100644
> --- a/net/ipv4/netfilter.c
> +++ b/net/ipv4/netfilter.c
> @@ -197,11 +197,4 @@ static int __init ipv4_netfilter_init(void)
>  {
>  	return nf_register_afinfo(&nf_ip_afinfo);
>  }
> -
> -static void __exit ipv4_netfilter_fini(void)
> -{
> -	nf_unregister_afinfo(&nf_ip_afinfo);
> -}
> -
> -module_init(ipv4_netfilter_init);
> -module_exit(ipv4_netfilter_fini);
> +device_initcall(ipv4_netfilter_init);

device_ vs. subsys_initcall changed from v1 of the patch and no longer
matches the description I think.

> -- 
> 1.8.5.2
> 
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH] net: introduce SO_BPF_EXTENSIONS
From: Daniel Borkmann @ 2014-01-16 19:24 UTC (permalink / raw)
  To: Michal Sekletar; +Cc: netdev, Michael Kerrisk, David Miller, Eric Dumazet
In-Reply-To: <52D82626.5060701@redhat.com>

On 01/16/2014 07:34 PM, Daniel Borkmann wrote:
> On 01/16/2014 06:26 PM, Michal Sekletar wrote:
>> On Thu, Jan 16, 2014 at 05:41:46PM +0100, Daniel Borkmann wrote:
>>> On 01/16/2014 05:14 PM, Michal Sekletar wrote:
>>>> userspace packet capturing libraries (e.g. libpcap) don't have a way how to find
>>>> out which BPF extensions are supported by the kernel, except compiling a filter
>>>> which uses extensions and trying to apply filter to the socket. This is very
>>>> inconvenient.
>>>>
>>>> Therefore this commit introduces new option which can be used as an argument for
>>>> getsockopt() and allows one to obtain information about which BPF extensions are
>>>> supported by the kernel.
>>>>
>>>> On Tue, Dec 10, 2013 at 1:25 AM, David Miller <davem@davemloft.net> wrote:
>>>>
>>>>> And therefore, you do not need to define any actual bits yet. The
>>>>> socket option will for now just return "0", and that will mean that
>>>>> all the extensions Linux implements currently are presnet.
>>>>
>>>> Signed-off-by: Michal Sekletar <msekleta@redhat.com>
>>>> Cc: Michael Kerrisk <mtk.manpages@gmail.com>
>>>> Cc: Daniel Borkmann <dborkman@redhat.com>
>>>> Cc: David Miller <davem@davemloft.net>
>>>> ---
>>>>   arch/alpha/include/uapi/asm/socket.h   | 2 ++
>>>>   arch/avr32/include/uapi/asm/socket.h   | 2 ++
>>>>   arch/cris/include/uapi/asm/socket.h    | 2 ++
>>>>   arch/frv/include/uapi/asm/socket.h     | 2 ++
>>>>   arch/ia64/include/uapi/asm/socket.h    | 2 ++
>>>>   arch/m32r/include/uapi/asm/socket.h    | 2 ++
>>>>   arch/mips/include/uapi/asm/socket.h    | 2 ++
>>>>   arch/mn10300/include/uapi/asm/socket.h | 2 ++
>>>>   arch/parisc/include/uapi/asm/socket.h  | 2 ++
>>>>   arch/powerpc/include/uapi/asm/socket.h | 2 ++
>>>>   arch/s390/include/uapi/asm/socket.h    | 2 ++
>>>>   arch/sparc/include/uapi/asm/socket.h   | 2 ++
>>>>   arch/xtensa/include/uapi/asm/socket.h  | 2 ++
>>>>   include/net/sock.h                     | 5 +++++
>>>>   include/uapi/asm-generic/socket.h      | 2 ++
>>>>   net/core/sock.c                        | 4 ++++
>>>>   16 files changed, 37 insertions(+)
>>>>
>>> ...
>>>> --- a/include/net/sock.h
>>>> +++ b/include/net/sock.h
>>>> @@ -2292,4 +2292,9 @@ extern int sysctl_optmem_max;
>>>>   extern __u32 sysctl_wmem_default;
>>>>   extern __u32 sysctl_rmem_default;
>>>>
>>>> +static inline int bpf_get_extensions(void)
>>>> +{
>>>> +    return 0;
>>>> +}
>>>> +
>>>
>>> This should rather be in: include/linux/filter.h
>>
>> Yes, having that function there makes more sense.
>>
>>>
>>> And then, maybe be renamed into something like bpf_tell_extensions()
>>> for example, but that's just nit.
>>
>> Renamed.
>>
>>>
>>> Also, please add a comment, saying if people add new extensions, they
>>> need to enumerate them within this function from now on so that user
>>> space who enquires for that can be made aware of.
>>
>> Let me know if even more explanatory comment is desired.
>
> I think that looks good.

Ok, then it seems you need to put your changes on top of [1]; and
0x4029 is fine then.

Sine you have to rebase anyway, and since I just noticed in your v2
patch [2] ... In function bpf_tell_extensions(), you need to convert
your whitespaces to tab, see Documentation/CodingStyle. While you're
at it, also try to line-break the comment at around 80 chars.

Otherwise your patch looks good, thanks Michal.

   [1] http://patchwork.ozlabs.org/patch/311829/
   [2] http://patchwork.ozlabs.org/patch/311790/

>>> Still, grepping through latest libpcap sources, tells me, so far over
>>> the years they have included SKF_AD_PROTOCOL and SKF_AD_PKTTYPE, wow!
>>>
>>> That's very disappointing, so I have high hopes with this getsockopt().
>>>
>>> ;)
>>>
>>> Thanks,
>>
>> Thank you for the review. Btw, what do you think about define for parisc? I am
>
> Yes, saw that in the patch and had similar thoughts.
>
>> not sure I grok 0x4048 for SO_MAX_PACING_RATE and hence not sure about 0x4029
>> for SO_BPF_EXTENSIONS.
>
> Hmm, maybe typo; SO_MAX_PACING_RATE should have been a 0x4028 ?
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH net v2] tcp: metrics: Avoid duplicate entries with the same destination-IP
From: Eric Dumazet @ 2014-01-16 19:20 UTC (permalink / raw)
  To: Christoph Paasch; +Cc: netdev, David Miller
In-Reply-To: <1389898881-11393-1-git-send-email-christoph.paasch@uclouvain.be>

On Thu, 2014-01-16 at 20:01 +0100, Christoph Paasch wrote:
> Because the tcp-metrics is an RCU-list, it may be that two
> soft-interrupts are inside __tcp_get_metrics() for the same
> destination-IP at the same time. If this destination-IP is not yet part of
> the tcp-metrics, both soft-interrupts will end up in tcpm_new and create
> a new entry for this IP.
> So, we will have two tcp-metrics with the same destination-IP in the list.
> 
> This patch checks twice __tcp_get_metrics(). First without holding the
> lock, then while holding the lock. The second one is there to confirm
> that the entry has not been added by another soft-irq while waiting for
> the spin-lock.
> 
> Fixes: 51c5d0c4b169b (tcp: Maintain dynamic metrics in local cache.)
> Signed-off-by: Christoph Paasch <christoph.paasch@uclouvain.be>
> ---
> 
> v2: As requested by Eric D.: Check the cache twice. Once without holding the lock,
>     and then again while holding the lock. That way we avoid taking the lock
>     needlessly.
> 
>  net/ipv4/tcp_metrics.c | 51 +++++++++++++++++++++++++++++++-------------------
>  1 file changed, 32 insertions(+), 19 deletions(-)

Reviewed-by: Eric Dumazet <edumazet@google.com>

Thanks !

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox