Netdev List
 help / color / mirror / Atom feed
* [iproute2] tc: Show classes more hierarchically]
From: vadim4j @ 2014-12-15 22:48 UTC (permalink / raw)
  To: netdev; +Cc: vadim4j

Hi All,

I am playing with showing classes in more hierarchically format and I
have some code and example of output from my TC looks like:

# tc/tc -t class show dev tap0

 \---1:2 (htb) prio 0 rate 1Kbit ceil 6Mbit burst 15Kb cburst 1599b 
        \---1:40 (htb) prio 0 rate 1Kbit ceil 6Mbit burst 15Kb cburst 1599b 
        \---1:50 (htb) prio 0 rate 1Kbit ceil 6Mbit burst 15Kb cburst 1599b 
        \---1:60 (htb) prio 0 rate 1Kbit ceil 6Mbit burst 15Kb cburst 1599b 
 \---1:1 (htb) prio 0 rate 1Kbit ceil 6Mbit burst 15Kb cburst 1599b 
        \---1:10 (htb) prio 0 rate 1Kbit ceil 6Mbit burst 15Kb cburst 1599b 
               \---1:11 (htb) prio 0 rate 1Kbit ceil 6Mbit burst 15Kb cburst 1599b 
                      \---1:111 (htb) prio 0 rate 1Kbit ceil 6Mbit burst 15Kb cburst 1599b 
        \---1:20 (htb) prio 0 rate 1Kbit ceil 6Mbit burst 15Kb cburst 1599b 
        \---1:30 (htb) prio 0 rate 1Kbit ceil 6Mbit burst 15Kb cburst 1599b 


which in standart output mode it looks like:

# tc/tc class show dev tap0

class htb 1:11 parent 1:10 rate 3Mbit ceil 6Mbit burst 15Kb cburst 1599b 
class htb 1:111 parent 1:11 prio 0 rate 1Kbit ceil 6Mbit burst 15Kb cburst 1599b 
class htb 1:10 parent 1:1 rate 5Mbit ceil 5Mbit burst 15Kb cburst 1600b 
class htb 1:1 root rate 6Mbit ceil 6Mbit burst 15Kb cburst 1599b 
class htb 1:20 parent 1:1 leaf 20: prio 0 rate 3Mbit ceil 6Mbit burst 15Kb cburst 1599b 
class htb 1:2 root rate 6Mbit ceil 6Mbit burst 15Kb cburst 1599b 
class htb 1:30 parent 1:1 leaf 30: prio 0 rate 1Kbit ceil 6Mbit burst 15Kb cburst 1599b 
class htb 1:40 parent 1:2 leaf 40: prio 0 rate 5Mbit ceil 5Mbit burst 15Kb cburst 1600b 
class htb 1:50 parent 1:2 leaf 50: prio 0 rate 3Mbit ceil 6Mbit burst 15Kb cburst 1599b 
class htb 1:60 parent 1:2 leaf 60: prio 0 rate 1Kbit ceil 6Mbit burst 15Kb cburst 1599b 

So I'd like to ask if it might be useful for the TC users (may be
better format ?) to have this ?

Thanks,

^ permalink raw reply

* Re: [RFC PATCH net-next 0/5] tcp: TCP tracer
From: Steven Rostedt @ 2014-12-15 22:29 UTC (permalink / raw)
  To: Tom Herbert
  Cc: Josef Bacik, Eric Dumazet, Alexei Starovoitov, Laurent Chavey,
	Yuchung Cheng, Martin KaFai Lau, netdev@vger.kernel.org,
	David S. Miller, Hannes Frederic Sowa, Lawrence Brakmo,
	Kernel Team
In-Reply-To: <CA+mtBx8tB6EE6i9C5KdOmwJ1D1nnaX3bvia71oj=N9U5h3KKBA@mail.gmail.com>

On Mon, 15 Dec 2014 14:01:43 -0800
Tom Herbert <therbert@google.com> wrote:

> >
> > We've set up something for exactly this case at the end of January but have
> > yet to get a response from Google.  If any of the Google people cc'ed (or
> > really anybody, its not a strictly FB/Google thing) is interested please
> > email me directly and I'll send you the details, we will be meeting face to
> > face in the bay area at the end of January.  Thanks,
> >
> 
> Maybe this would be good for discussion at netdev01?

Is this something I should attend too? For this discussion that is.
Weather permitting, Ottawa is only a 4 1/2 hour drive for me.

-- Steve

^ permalink raw reply

* Re: [RFC PATCH net-next 0/5] tcp: TCP tracer
From: Tom Herbert @ 2014-12-15 22:01 UTC (permalink / raw)
  To: Josef Bacik
  Cc: Eric Dumazet, Alexei Starovoitov, Laurent Chavey, Yuchung Cheng,
	Martin KaFai Lau, netdev@vger.kernel.org, David S. Miller,
	Hannes Frederic Sowa, Steven Rostedt, Lawrence Brakmo,
	Kernel Team
In-Reply-To: <548F0F62.7080704@fb.com>

On Mon, Dec 15, 2014 at 8:42 AM, Josef Bacik <jbacik@fb.com> wrote:
> On 12/15/2014 11:03 AM, Eric Dumazet wrote:
>>
>> On Sun, 2014-12-14 at 22:55 -0800, Alexei Starovoitov wrote:
>>
>>> I think patches 1 and 3 are good additions, since they establish
>>> few permanent points of instrumentation in tcp stack.
>>> Patches 4-5 look more like use cases of tracepoints established
>>> before. They may feel like simple additions and, no doubt,
>>> they are useful, but since they expose things via tracing
>>> infra they become part of api and cannot be changed later,
>>> when more stats would be needed.
>>> I think systemtap like scripting on top of patches 1 and 3
>>> should solve your use case ?
>>> Also, have you looked at recent eBPF work?
>>> Though it's not completely ready yet, soon it should
>>> be able to do the same stats collection as you have
>>> in 4/5 without adding permanent pieces to the kernel.
>>
>>
>> So it looks like web10g like interfaces are very often requested by
>> various teams.
>>
>> And we have many different views on how to hack this. I am astonished by
>> number of hacks I saw about this stuff going on.
>>
>> What about a clean way, extending current TCP_INFO, which is both
>> available as a getsockopt() for socket owners and ss/iproute2
>> information for 'external entities'
>>
>> If we consider web10g info needed, then adding a ftrace/eBPF like
>> interface is simply yet another piece of code we need to maintain,
>> and the argument of 'this should cost nothing if not activated' is
>> nonsense since major players need to constantly monitor TCP metrics and
>> behavior.
>>
>> It seems both FaceBook and Google are working on a subset of web10g.
>>
>> I suggest we meet together and establish a common ground, preferably
>> after Christmas holidays.
>>
>
> We've set up something for exactly this case at the end of January but have
> yet to get a response from Google.  If any of the Google people cc'ed (or
> really anybody, its not a strictly FB/Google thing) is interested please
> email me directly and I'll send you the details, we will be meeting face to
> face in the bay area at the end of January.  Thanks,
>

Maybe this would be good for discussion at netdev01?

> Josef
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* [PATCH 2/3] vringh: initial virtio 1.0 support
From: Michael S. Tsirkin @ 2014-12-15 21:14 UTC (permalink / raw)
  To: linux-kernel; +Cc: netdev, kvm, virtualization
In-Reply-To: <1418678019-31629-1-git-send-email-mst@redhat.com>

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 include/linux/vringh.h |  33 ++++++++++++++
 drivers/vhost/vringh.c | 121 ++++++++++++++++++++++++++++++-------------------
 2 files changed, 107 insertions(+), 47 deletions(-)

diff --git a/include/linux/vringh.h b/include/linux/vringh.h
index f696dd0..a3fa537 100644
--- a/include/linux/vringh.h
+++ b/include/linux/vringh.h
@@ -24,12 +24,16 @@
 #ifndef _LINUX_VRINGH_H
 #define _LINUX_VRINGH_H
 #include <uapi/linux/virtio_ring.h>
+#include <linux/virtio_byteorder.h>
 #include <linux/uio.h>
 #include <linux/slab.h>
 #include <asm/barrier.h>
 
 /* virtio_ring with information needed for host access. */
 struct vringh {
+	/* Everything is little endian */
+	bool little_endian;
+
 	/* Guest publishes used event idx (note: we always do). */
 	bool event_indices;
 
@@ -222,4 +226,33 @@ static inline void vringh_notify(struct vringh *vrh)
 		vrh->notify(vrh);
 }
 
+static inline u16 vringh16_to_cpu(const struct vringh *vrh, __virtio16 val)
+{
+	return __virtio16_to_cpu(vrh->little_endian, val);
+}
+
+static inline __virtio16 cpu_to_vringh16(const struct vringh *vrh, u16 val)
+{
+	return __cpu_to_virtio16(vrh->little_endian, val);
+}
+
+static inline u32 vringh32_to_cpu(const struct vringh *vrh, __virtio32 val)
+{
+	return __virtio32_to_cpu(vrh->little_endian, val);
+}
+
+static inline __virtio32 cpu_to_vringh32(const struct vringh *vrh, u32 val)
+{
+	return __cpu_to_virtio32(vrh->little_endian, val);
+}
+
+static inline u64 vringh64_to_cpu(const struct vringh *vrh, __virtio64 val)
+{
+	return __virtio64_to_cpu(vrh->little_endian, val);
+}
+
+static inline __virtio64 cpu_to_vringh64(const struct vringh *vrh, u64 val)
+{
+	return __cpu_to_virtio64(vrh->little_endian, val);
+}
 #endif /* _LINUX_VRINGH_H */
diff --git a/drivers/vhost/vringh.c b/drivers/vhost/vringh.c
index ac3fe27..3bb02c6 100644
--- a/drivers/vhost/vringh.c
+++ b/drivers/vhost/vringh.c
@@ -11,6 +11,7 @@
 #include <linux/uaccess.h>
 #include <linux/slab.h>
 #include <linux/export.h>
+#include <uapi/linux/virtio_config.h>
 
 static __printf(1,2) __cold void vringh_bad(const char *fmt, ...)
 {
@@ -28,13 +29,14 @@ static __printf(1,2) __cold void vringh_bad(const char *fmt, ...)
 
 /* Returns vring->num if empty, -ve on error. */
 static inline int __vringh_get_head(const struct vringh *vrh,
-				    int (*getu16)(u16 *val, const u16 *p),
+				    int (*getu16)(const struct vringh *vrh,
+						  u16 *val, const __virtio16 *p),
 				    u16 *last_avail_idx)
 {
 	u16 avail_idx, i, head;
 	int err;
 
-	err = getu16(&avail_idx, &vrh->vring.avail->idx);
+	err = getu16(vrh, &avail_idx, &vrh->vring.avail->idx);
 	if (err) {
 		vringh_bad("Failed to access avail idx at %p",
 			   &vrh->vring.avail->idx);
@@ -49,7 +51,7 @@ static inline int __vringh_get_head(const struct vringh *vrh,
 
 	i = *last_avail_idx & (vrh->vring.num - 1);
 
-	err = getu16(&head, &vrh->vring.avail->ring[i]);
+	err = getu16(vrh, &head, &vrh->vring.avail->ring[i]);
 	if (err) {
 		vringh_bad("Failed to read head: idx %d address %p",
 			   *last_avail_idx, &vrh->vring.avail->ring[i]);
@@ -144,28 +146,32 @@ static inline bool no_range_check(struct vringh *vrh, u64 addr, size_t *len,
 }
 
 /* No reason for this code to be inline. */
-static int move_to_indirect(int *up_next, u16 *i, void *addr,
+static int move_to_indirect(const struct vringh *vrh,
+			    int *up_next, u16 *i, void *addr,
 			    const struct vring_desc *desc,
 			    struct vring_desc **descs, int *desc_max)
 {
+	u32 len;
+
 	/* Indirect tables can't have indirect. */
 	if (*up_next != -1) {
 		vringh_bad("Multilevel indirect %u->%u", *up_next, *i);
 		return -EINVAL;
 	}
 
-	if (unlikely(desc->len % sizeof(struct vring_desc))) {
+	len = vringh32_to_cpu(vrh, desc->len);
+	if (unlikely(len % sizeof(struct vring_desc))) {
 		vringh_bad("Strange indirect len %u", desc->len);
 		return -EINVAL;
 	}
 
 	/* We will check this when we follow it! */
-	if (desc->flags & VRING_DESC_F_NEXT)
-		*up_next = desc->next;
+	if (desc->flags & cpu_to_vringh16(vrh, VRING_DESC_F_NEXT))
+		*up_next = vringh16_to_cpu(vrh, desc->next);
 	else
 		*up_next = -2;
 	*descs = addr;
-	*desc_max = desc->len / sizeof(struct vring_desc);
+	*desc_max = len / sizeof(struct vring_desc);
 
 	/* Now, start at the first indirect. */
 	*i = 0;
@@ -287,22 +293,25 @@ __vringh_iov(struct vringh *vrh, u16 i,
 		if (unlikely(err))
 			goto fail;
 
-		if (unlikely(desc.flags & VRING_DESC_F_INDIRECT)) {
+		if (unlikely(desc.flags &
+			     cpu_to_vringh16(vrh, VRING_DESC_F_INDIRECT))) {
+			u64 a = vringh64_to_cpu(vrh, desc.addr);
+
 			/* Make sure it's OK, and get offset. */
-			len = desc.len;
-			if (!rcheck(vrh, desc.addr, &len, &range, getrange)) {
+			len = vringh32_to_cpu(vrh, desc.len);
+			if (!rcheck(vrh, a, &len, &range, getrange)) {
 				err = -EINVAL;
 				goto fail;
 			}
 
-			if (unlikely(len != desc.len)) {
+			if (unlikely(len != vringh32_to_cpu(vrh, desc.len))) {
 				slow = true;
 				/* We need to save this range to use offset */
 				slowrange = range;
 			}
 
-			addr = (void *)(long)(desc.addr + range.offset);
-			err = move_to_indirect(&up_next, &i, addr, &desc,
+			addr = (void *)(long)(a + range.offset);
+			err = move_to_indirect(vrh, &up_next, &i, addr, &desc,
 					       &descs, &desc_max);
 			if (err)
 				goto fail;
@@ -315,7 +324,7 @@ __vringh_iov(struct vringh *vrh, u16 i,
 			goto fail;
 		}
 
-		if (desc.flags & VRING_DESC_F_WRITE)
+		if (desc.flags & cpu_to_vringh16(vrh, VRING_DESC_F_WRITE))
 			iov = wiov;
 		else {
 			iov = riov;
@@ -336,12 +345,14 @@ __vringh_iov(struct vringh *vrh, u16 i,
 
 	again:
 		/* Make sure it's OK, and get offset. */
-		len = desc.len;
-		if (!rcheck(vrh, desc.addr, &len, &range, getrange)) {
+		len = vringh32_to_cpu(vrh, desc.len);
+		if (!rcheck(vrh, vringh64_to_cpu(vrh, desc.addr), &len, &range,
+			    getrange)) {
 			err = -EINVAL;
 			goto fail;
 		}
-		addr = (void *)(unsigned long)(desc.addr + range.offset);
+		addr = (void *)(unsigned long)(vringh64_to_cpu(vrh, desc.addr) +
+					       range.offset);
 
 		if (unlikely(iov->used == (iov->max_num & ~VRINGH_IOV_ALLOCATED))) {
 			err = resize_iovec(iov, gfp);
@@ -353,14 +364,16 @@ __vringh_iov(struct vringh *vrh, u16 i,
 		iov->iov[iov->used].iov_len = len;
 		iov->used++;
 
-		if (unlikely(len != desc.len)) {
-			desc.len -= len;
-			desc.addr += len;
+		if (unlikely(len != vringh32_to_cpu(vrh, desc.len))) {
+			desc.len = cpu_to_vringh32(vrh,
+				   vringh32_to_cpu(vrh, desc.len) - len);
+			desc.addr = cpu_to_vringh64(vrh,
+				    vringh64_to_cpu(vrh, desc.addr) + len);
 			goto again;
 		}
 
-		if (desc.flags & VRING_DESC_F_NEXT) {
-			i = desc.next;
+		if (desc.flags & cpu_to_vringh16(vrh, VRING_DESC_F_NEXT)) {
+			i = vringh16_to_cpu(vrh, desc.next);
 		} else {
 			/* Just in case we need to finish traversing above. */
 			if (unlikely(up_next > 0)) {
@@ -387,7 +400,8 @@ fail:
 static inline int __vringh_complete(struct vringh *vrh,
 				    const struct vring_used_elem *used,
 				    unsigned int num_used,
-				    int (*putu16)(u16 *p, u16 val),
+				    int (*putu16)(const struct vringh *vrh,
+						  __virtio16 *p, u16 val),
 				    int (*putused)(struct vring_used_elem *dst,
 						   const struct vring_used_elem
 						   *src, unsigned num))
@@ -420,7 +434,7 @@ static inline int __vringh_complete(struct vringh *vrh,
 	/* Make sure buffer is written before we update index. */
 	virtio_wmb(vrh->weak_barriers);
 
-	err = putu16(&vrh->vring.used->idx, used_idx + num_used);
+	err = putu16(vrh, &vrh->vring.used->idx, used_idx + num_used);
 	if (err) {
 		vringh_bad("Failed to update used index at %p",
 			   &vrh->vring.used->idx);
@@ -433,7 +447,9 @@ static inline int __vringh_complete(struct vringh *vrh,
 
 
 static inline int __vringh_need_notify(struct vringh *vrh,
-				       int (*getu16)(u16 *val, const u16 *p))
+				       int (*getu16)(const struct vringh *vrh,
+						     u16 *val,
+						     const __virtio16 *p))
 {
 	bool notify;
 	u16 used_event;
@@ -447,7 +463,7 @@ static inline int __vringh_need_notify(struct vringh *vrh,
 	/* Old-style, without event indices. */
 	if (!vrh->event_indices) {
 		u16 flags;
-		err = getu16(&flags, &vrh->vring.avail->flags);
+		err = getu16(vrh, &flags, &vrh->vring.avail->flags);
 		if (err) {
 			vringh_bad("Failed to get flags at %p",
 				   &vrh->vring.avail->flags);
@@ -457,7 +473,7 @@ static inline int __vringh_need_notify(struct vringh *vrh,
 	}
 
 	/* Modern: we know when other side wants to know. */
-	err = getu16(&used_event, &vring_used_event(&vrh->vring));
+	err = getu16(vrh, &used_event, &vring_used_event(&vrh->vring));
 	if (err) {
 		vringh_bad("Failed to get used event idx at %p",
 			   &vring_used_event(&vrh->vring));
@@ -478,20 +494,22 @@ static inline int __vringh_need_notify(struct vringh *vrh,
 }
 
 static inline bool __vringh_notify_enable(struct vringh *vrh,
-					  int (*getu16)(u16 *val, const u16 *p),
-					  int (*putu16)(u16 *p, u16 val))
+					  int (*getu16)(const struct vringh *vrh,
+							u16 *val, const __virtio16 *p),
+					  int (*putu16)(const struct vringh *vrh,
+							__virtio16 *p, u16 val))
 {
 	u16 avail;
 
 	if (!vrh->event_indices) {
 		/* Old-school; update flags. */
-		if (putu16(&vrh->vring.used->flags, 0) != 0) {
+		if (putu16(vrh, &vrh->vring.used->flags, 0) != 0) {
 			vringh_bad("Clearing used flags %p",
 				   &vrh->vring.used->flags);
 			return true;
 		}
 	} else {
-		if (putu16(&vring_avail_event(&vrh->vring),
+		if (putu16(vrh, &vring_avail_event(&vrh->vring),
 			   vrh->last_avail_idx) != 0) {
 			vringh_bad("Updating avail event index %p",
 				   &vring_avail_event(&vrh->vring));
@@ -503,7 +521,7 @@ static inline bool __vringh_notify_enable(struct vringh *vrh,
 	 * sure it's written, then check again. */
 	virtio_mb(vrh->weak_barriers);
 
-	if (getu16(&avail, &vrh->vring.avail->idx) != 0) {
+	if (getu16(vrh, &avail, &vrh->vring.avail->idx) != 0) {
 		vringh_bad("Failed to check avail idx at %p",
 			   &vrh->vring.avail->idx);
 		return true;
@@ -516,11 +534,13 @@ static inline bool __vringh_notify_enable(struct vringh *vrh,
 }
 
 static inline void __vringh_notify_disable(struct vringh *vrh,
-					   int (*putu16)(u16 *p, u16 val))
+					   int (*putu16)(const struct vringh *vrh,
+							 __virtio16 *p, u16 val))
 {
 	if (!vrh->event_indices) {
 		/* Old-school; update flags. */
-		if (putu16(&vrh->vring.used->flags, VRING_USED_F_NO_NOTIFY)) {
+		if (putu16(vrh, &vrh->vring.used->flags,
+			   VRING_USED_F_NO_NOTIFY)) {
 			vringh_bad("Setting used flags %p",
 				   &vrh->vring.used->flags);
 		}
@@ -528,14 +548,18 @@ static inline void __vringh_notify_disable(struct vringh *vrh,
 }
 
 /* Userspace access helpers: in this case, addresses are really userspace. */
-static inline int getu16_user(u16 *val, const u16 *p)
+static inline int getu16_user(const struct vringh *vrh, u16 *val, const __virtio16 *p)
 {
-	return get_user(*val, (__force u16 __user *)p);
+	__virtio16 v = 0;
+	int rc = get_user(v, (__force __virtio16 __user *)p);
+	*val = vringh16_to_cpu(vrh, v);
+	return rc;
 }
 
-static inline int putu16_user(u16 *p, u16 val)
+static inline int putu16_user(const struct vringh *vrh, __virtio16 *p, u16 val)
 {
-	return put_user(val, (__force u16 __user *)p);
+	__virtio16 v = cpu_to_vringh16(vrh, val);
+	return put_user(v, (__force __virtio16 __user *)p);
 }
 
 static inline int copydesc_user(void *dst, const void *src, size_t len)
@@ -589,6 +613,7 @@ int vringh_init_user(struct vringh *vrh, u64 features,
 		return -EINVAL;
 	}
 
+	vrh->little_endian = (features & (1ULL << VIRTIO_F_VERSION_1));
 	vrh->event_indices = (features & (1 << VIRTIO_RING_F_EVENT_IDX));
 	vrh->weak_barriers = weak_barriers;
 	vrh->completed = 0;
@@ -729,8 +754,8 @@ int vringh_complete_user(struct vringh *vrh, u16 head, u32 len)
 {
 	struct vring_used_elem used;
 
-	used.id = head;
-	used.len = len;
+	used.id = cpu_to_vringh32(vrh, head);
+	used.len = cpu_to_vringh32(vrh, len);
 	return __vringh_complete(vrh, &used, 1, putu16_user, putused_user);
 }
 EXPORT_SYMBOL(vringh_complete_user);
@@ -792,15 +817,16 @@ int vringh_need_notify_user(struct vringh *vrh)
 EXPORT_SYMBOL(vringh_need_notify_user);
 
 /* Kernelspace access helpers. */
-static inline int getu16_kern(u16 *val, const u16 *p)
+static inline int getu16_kern(const struct vringh *vrh,
+			      u16 *val, const __virtio16 *p)
 {
-	*val = ACCESS_ONCE(*p);
+	*val = vringh16_to_cpu(vrh, ACCESS_ONCE(*p));
 	return 0;
 }
 
-static inline int putu16_kern(u16 *p, u16 val)
+static inline int putu16_kern(const struct vringh *vrh, __virtio16 *p, u16 val)
 {
-	ACCESS_ONCE(*p) = val;
+	ACCESS_ONCE(*p) = cpu_to_vringh16(vrh, val);
 	return 0;
 }
 
@@ -848,6 +874,7 @@ int vringh_init_kern(struct vringh *vrh, u64 features,
 		return -EINVAL;
 	}
 
+	vrh->little_endian = (features & (1ULL << VIRTIO_F_VERSION_1));
 	vrh->event_indices = (features & (1 << VIRTIO_RING_F_EVENT_IDX));
 	vrh->weak_barriers = weak_barriers;
 	vrh->completed = 0;
@@ -962,8 +989,8 @@ int vringh_complete_kern(struct vringh *vrh, u16 head, u32 len)
 {
 	struct vring_used_elem used;
 
-	used.id = head;
-	used.len = len;
+	used.id = cpu_to_vringh32(vrh, head);
+	used.len = cpu_to_vringh32(vrh, len);
 
 	return __vringh_complete(vrh, &used, 1, putu16_kern, putused_kern);
 }
-- 
MST

^ permalink raw reply related

* [PATCH 1/3] vringh: 64 bit features
From: Michael S. Tsirkin @ 2014-12-15 21:14 UTC (permalink / raw)
  To: linux-kernel; +Cc: netdev, kvm, virtualization
In-Reply-To: <1418678019-31629-1-git-send-email-mst@redhat.com>

Pass u64 everywhere.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 include/linux/vringh.h | 4 ++--
 drivers/vhost/vringh.c | 4 ++--
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/include/linux/vringh.h b/include/linux/vringh.h
index 749cde2..f696dd0 100644
--- a/include/linux/vringh.h
+++ b/include/linux/vringh.h
@@ -105,7 +105,7 @@ struct vringh_kiov {
 #define VRINGH_IOV_ALLOCATED 0x8000000
 
 /* Helpers for userspace vrings. */
-int vringh_init_user(struct vringh *vrh, u32 features,
+int vringh_init_user(struct vringh *vrh, u64 features,
 		     unsigned int num, bool weak_barriers,
 		     struct vring_desc __user *desc,
 		     struct vring_avail __user *avail,
@@ -167,7 +167,7 @@ bool vringh_notify_enable_user(struct vringh *vrh);
 void vringh_notify_disable_user(struct vringh *vrh);
 
 /* Helpers for kernelspace vrings. */
-int vringh_init_kern(struct vringh *vrh, u32 features,
+int vringh_init_kern(struct vringh *vrh, u64 features,
 		     unsigned int num, bool weak_barriers,
 		     struct vring_desc *desc,
 		     struct vring_avail *avail,
diff --git a/drivers/vhost/vringh.c b/drivers/vhost/vringh.c
index 5174eba..ac3fe27 100644
--- a/drivers/vhost/vringh.c
+++ b/drivers/vhost/vringh.c
@@ -577,7 +577,7 @@ static inline int xfer_to_user(void *dst, void *src, size_t len)
  * Returns an error if num is invalid: you should check pointers
  * yourself!
  */
-int vringh_init_user(struct vringh *vrh, u32 features,
+int vringh_init_user(struct vringh *vrh, u64 features,
 		     unsigned int num, bool weak_barriers,
 		     struct vring_desc __user *desc,
 		     struct vring_avail __user *avail,
@@ -836,7 +836,7 @@ static inline int xfer_kern(void *src, void *dst, size_t len)
  *
  * Returns an error if num is invalid.
  */
-int vringh_init_kern(struct vringh *vrh, u32 features,
+int vringh_init_kern(struct vringh *vrh, u64 features,
 		     unsigned int num, bool weak_barriers,
 		     struct vring_desc *desc,
 		     struct vring_avail *avail,
-- 
MST

^ permalink raw reply related

* Re: [PATCH] ioc3: fix incorrect use of htons/ntohs
From: Ben Hutchings @ 2014-12-15 21:09 UTC (permalink / raw)
  To: Ralf Baechle; +Cc: Lino Sanfilippo, linux-mips, netdev, linux-kernel
In-Reply-To: <20141215181444.GD26674@linux-mips.org>

[-- Attachment #1: Type: text/plain, Size: 2676 bytes --]

On Mon, 2014-12-15 at 19:14 +0100, Ralf Baechle wrote:
> On Mon, Dec 01, 2014 at 04:09:36AM +0000, Ben Hutchings wrote:
> 
> > >  	/* Same as tx - compute csum of pseudo header  */
> > >  	csum = hwsum +
> > > -	       (ih->tot_len - (ih->ihl << 2)) +
> > > -	       htons((uint16_t)ih->protocol) +
> > > +	       (ih->tot_len - (ih->ihl << 2)) + ih->protocol +
> > >  	       (ih->saddr >> 16) + (ih->saddr & 0xffff) +
> > >  	       (ih->daddr >> 16) + (ih->daddr & 0xffff);
> > >
> > 
> > The pseudo-header is specified as:
> > 
> >                      +--------+--------+--------+--------+
> >                      |           Source Address          |
> >                      +--------+--------+--------+--------+
> >                      |         Destination Address       |
> >                      +--------+--------+--------+--------+
> >                      |  zero  |  PTCL  |    TCP Length   |
> >                      +--------+--------+--------+--------+
> > 
> > The current code zero-extends the protocol number to produce the 5th
> > 16-bit word of the pseudo-header, then uses htons() to put it in
> > big-endian order, consistent with the other fields.  (Yes, it's doing
> > addition on big-endian words; this works even on little-endian machines
> > due to the way the checksum is specified.)
> > 
> > The driver should not be doing this at all, though.  It should set
> > skb->csum = hwsum; skb->ip_summed = CHECKSUM_COMPLETE; and let the
> > network stack adjust the hardware checksum.
> 
> Really?  The IOC3 isn't the exactly the smartest NIC around; it does add up
> everything and the kitchen sink, that is ethernet headers, IP headers and
> on RX the frame's trailing CRC.

That is almost exactly what CHECKSUM_COMPLETE means on receive; only the
CRC would need to be subtracted.  Then the driver can validate
{TCP,UDP}/IPv{4,6} checksums without any header parsing.

> All that needs to be subtracted in software
> which is what this does.  I think others NICs are all smarted and don't
> need this particular piece of magic.

It may not be smart, but that allows it to cover more cases than most
smart network controllers!

On transmit, the driver should:
- Calculate the partial checksum of data up to offset csum_start
- Subtract this from the checksum at offset (csum_start + csum_offset)
It should set the NETIF_F_GEN_CSUM feature flag rather than
NETIF_F_IP_CSUM.  Then it will be able to generate {TCP,UDP}/IPv{4,6}
checksums.

Ben.

> I agree with your other comment wrt. to htons().


-- 
Ben Hutchings
The two most common things in the universe are hydrogen and stupidity.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 811 bytes --]

^ permalink raw reply

* Re: [RFC PATCH net-next 0/5] tcp: TCP tracer
From: Yuchung Cheng @ 2014-12-15 19:56 UTC (permalink / raw)
  To: Blake Matheny
  Cc: Eric Dumazet, Alexei Starovoitov, Laurent Chavey, Martin Lau,
	netdev@vger.kernel.org, David S. Miller, Hannes Frederic Sowa,
	Steven Rostedt, Lawrence Brakmo, Josef Bacik, Kernel Team
In-Reply-To: <D0B44739.74E8A%bmatheny@fb.com>

On Mon, Dec 15, 2014 at 8:08 AM, Blake Matheny <bmatheny@fb.com> wrote:
>
> We have an additional set of patches for web10g that builds on these
> tracepoints. It can be made to work either way, but I agree the idea of
> something like a sockopt would be really nice.

I'd like to compare these patches  with tools that parse pcap files to
generate per-flow counters to collect RTTs, #dupacks, etc. What
additional values or insights do they provide to improve/debug TCP
performance? maybe an example?

IMO these stats provide a general pictures of how TCP works of a
specific network, but not enough to really nail specific bugs in TCP
protocol or implementation. Then SNMP stats or sampling with pcap
traces with offline analysis can achieve the same purpose.

>
>
> -Blake
>
> On 12/15/14, 8:03 AM, "Eric Dumazet" <eric.dumazet@gmail.com> wrote:
>
> >On Sun, 2014-12-14 at 22:55 -0800, Alexei Starovoitov wrote:
> >
> >> I think patches 1 and 3 are good additions, since they establish
> >> few permanent points of instrumentation in tcp stack.
> >> Patches 4-5 look more like use cases of tracepoints established
> >> before. They may feel like simple additions and, no doubt,
> >> they are useful, but since they expose things via tracing
> >> infra they become part of api and cannot be changed later,
> >> when more stats would be needed.
> >> I think systemtap like scripting on top of patches 1 and 3
> >> should solve your use case ?
> >> Also, have you looked at recent eBPF work?
> >> Though it's not completely ready yet, soon it should
> >> be able to do the same stats collection as you have
> >> in 4/5 without adding permanent pieces to the kernel.
> >
> >So it looks like web10g like interfaces are very often requested by
> >various teams.
> >
> >And we have many different views on how to hack this. I am astonished by
> >number of hacks I saw about this stuff going on.
> >
> >What about a clean way, extending current TCP_INFO, which is both
> >available as a getsockopt() for socket owners and ss/iproute2
> >information for 'external entities'
> >
> >If we consider web10g info needed, then adding a ftrace/eBPF like
> >interface is simply yet another piece of code we need to maintain,
> >and the argument of 'this should cost nothing if not activated' is
> >nonsense since major players need to constantly monitor TCP metrics and
> >behavior.
> >
> >It seems both FaceBook and Google are working on a subset of web10g.
> >
> >I suggest we meet together and establish a common ground, preferably
> >after Christmas holidays.
> >
> >Thanks
> >
> >
>

^ permalink raw reply

* [PATCH 3.13.y-ckt 60/96] net/ping: handle protocol mismatching scenario
From: Kamal Mostafa @ 2014-12-15 19:26 UTC (permalink / raw)
  To: linux-kernel, stable, kernel-team
  Cc: David S. Miller, Alexey Kuznetsov, James Morris,
	Hideaki YOSHIFUJI, Patrick McHardy, netdev, Jane Zhou, Yiwei Zhao,
	Kamal Mostafa
In-Reply-To: <1418671616-25482-1-git-send-email-kamal@canonical.com>

3.13.11-ckt13 -stable review patch.  If anyone has any objections, please let me know.

------------------

From: Jane Zhou <a17711@motorola.com>

commit 91a0b603469069cdcce4d572b7525ffc9fd352a6 upstream.

ping_lookup() may return a wrong sock if sk_buff's and sock's protocols
dont' match. For example, sk_buff's protocol is ETH_P_IPV6, but sock's
sk_family is AF_INET, in that case, if sk->sk_bound_dev_if is zero, a wrong
sock will be returned.
the fix is to "continue" the searching, if no matching, return NULL.

Cc: "David S. Miller" <davem@davemloft.net>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: James Morris <jmorris@namei.org>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: Patrick McHardy <kaber@trash.net>
Cc: netdev@vger.kernel.org
Signed-off-by: Jane Zhou <a17711@motorola.com>
Signed-off-by: Yiwei Zhao <gbjc64@motorola.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Kamal Mostafa <kamal@canonical.com>
---
 net/ipv4/ping.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/net/ipv4/ping.c b/net/ipv4/ping.c
index 3ef2919..8e0f65c 100644
--- a/net/ipv4/ping.c
+++ b/net/ipv4/ping.c
@@ -213,6 +213,8 @@ static struct sock *ping_lookup(struct net *net, struct sk_buff *skb, u16 ident)
 					     &ipv6_hdr(skb)->daddr))
 				continue;
 #endif
+		} else {
+			continue;
 		}
 
 		if (sk->sk_bound_dev_if && sk->sk_bound_dev_if != dif)
-- 
1.9.1

^ permalink raw reply related

* RE: [PATCH net-next v2 2/4] swdevice: add new api to set and del bridge port attributes
From: Arad, Ronen @ 2014-12-15 18:36 UTC (permalink / raw)
  To: John Fastabend, netdev@vger.kernel.org
  Cc: Jamal Hadi Salim, Roopa Prabhu, Jiri Pirko, sfeldma@gmail.com,
	bcrl@kvack.org, tgraf@suug.ch, stephen@networkplumber.org,
	linville@tuxdriver.com, vyasevic@redhat.com, davem@davemloft.net,
	shm@cumulusnetworks.com, gospo@cumulusnetworks.com
In-Reply-To: <548F20FF.9080508@gmail.com>



> -----Original Message-----
> From: John Fastabend [mailto:john.fastabend@gmail.com]
> Sent: Monday, December 15, 2014 7:57 PM
> To: Arad, Ronen
> Cc: Jamal Hadi Salim; Roopa Prabhu; Jiri Pirko; netdev@vger.kernel.org;
> sfeldma@gmail.com; bcrl@kvack.org; tgraf@suug.ch;
> stephen@networkplumber.org; linville@tuxdriver.com;
> vyasevic@redhat.com; davem@davemloft.net;
> shm@cumulusnetworks.com; gospo@cumulusnetworks.com
> Subject: Re: [PATCH net-next v2 2/4] swdevice: add new api to set and del
> bridge port attributes
> 
> On 12/15/2014 09:25 AM, Arad, Ronen wrote:
> >
> >
> >> -----Original Message-----
> >> From: netdev-owner@vger.kernel.org [mailto:netdev-
> >> owner@vger.kernel.org] On Behalf Of Jamal Hadi Salim
> >> Sent: Monday, December 15, 2014 5:26 PM
> >> To: Roopa Prabhu; Jiri Pirko
> >> Cc: sfeldma@gmail.com; bcrl@kvack.org; tgraf@suug.ch;
> >> john.fastabend@gmail.com; stephen@networkplumber.org;
> >> linville@tuxdriver.com; vyasevic@redhat.com; netdev@vger.kernel.org;
> >> davem@davemloft.net; shm@cumulusnetworks.com;
> >> gospo@cumulusnetworks.com
> >> Subject: Re: [PATCH net-next v2 2/4] swdevice: add new api to set and
> >> del bridge port attributes
> >>
> >>
> >> Sorry - i didnt quiet follow the discussion, but i can see the value
> >> of propagating things from parent to children netdevs as part of the
> >> generic approach. And in that spirit:
> >>
> >> Ben's patches (and I am sure the cumulus folk do this) expose ports.
> >> i.e you boot up the hardware and you see ports. You can then put
> >> these ports in a bridge and you can offload fdbs and do other
> >> parametrization to the ASIC. IOW, this only becomes a bridge because
> >> you created one in the kernel and attached bridge ports to it.
> >>
> >> Lets say i didnt want a bridge. I want instead to take these exposed
> >> ports and create a bond (and maybe play with LACP). How does this
> >> propagation from
> >> parent->child->child work then? I think the idea of just bonding and
> >> parent->child->not
> >> exposing it as a switch is a reasonable use case.
> >
> 
> > Are you saying that the software should reflect the same functionality
> > the HW provides?
> > In other words is creating a bridge device mandatory for supporting
> > standard VLAN-bridging (as in IEEE 802.1Q) in the HW?
> 
> No it shouldn't be mandatory. And I don't think it is at the moment.
> Users are free to manage the FDB tables directly via netlink or configure the
> software bridge to sync them. This seems like a good model to follow to me
> and we should try to get as many features as it makes sense to follow this
> model.
> 
> > VLAN-bridging including port VLAN membership, VLAN filtering, PVID,
> > Egress un-tagging could be supported without an explicit bridge device
> > when port devices implement bridge ndos (ndo_bridge_{set,del,get}link).
> > What is lost is the ability to have common handling of VLAN-aware FDB
> > in the bridge module.
> 
> not sure what is lost here? Its seems the SW bridge does (or at least
> could) support the same vlan capabilities. And the bridge could push these
> into hardware when Roopa's offload bit is set. Or if users want to manage
> everything outside bridge calling the ndo_bridge_ ops directly works as well.
> By the way I believe the software linux bridge supports most of those
> features you listed today. If we missed something we can add it.
> 
> > Do we expect switch port devices to support L2 functionality both ways
> > (with and without an explicit bridge device)?
> 
> My opinion yes. But in fact the driver shouldn't care what is driving it. The
> paths should be the same for direct user manipulation via netlink and
> SELF|MASTER bit, bridge module, or any other in-kernel sub-system.
> 
The behavior of a driver could depend on the presence of a bridge and features such as FDB LEARNING and LEARNING_SYNC.
A switch port driver which is not enslaved to a bridge might need to implement VLAN-aware FDB within the driver and report its content to user-space using ndo_fdb_dump. 
A switch port driver which is enslaved to a bridge could do with only pass through for static FDB configuration to the HW when LEARNING_SYNC is configured. FDB reporting to user-space and soft aging are left to the bridge module FDB.
Such driver, without LEARNING_SYNC could still avoid maintaing in-driver FDB as long as it could dump the HW FDB on demand.
LEARNING_SYNC also requires periodic updates of freshness information from the driver to the bridge module.

> > Will the decision about using a bridge device or avoiding it be left
> > to the end-user?
> 
> Its a user policy decision. Again the offload bit gets us this in a reasonably
> configurable way IMO.
> 
> > (This requires switch port drivers to be able to work and provide
> > similar functionality in both setups).
> 
> Right, but if the drivers "care" who is calling their ndo ops something is
> seriously broken. For the driver it should not need to know anything about
> the callers so it doesn't matter to the driver if its a netlink call from user
> space or an internal call fro bridge.ko

LEARNING_SYNC only makes sense when a switch port driver is enslaved to a bridge. Rocker switch driver indeed monitors upper change notifications and keep track of master bridge presence. So bridge presence is not transparent.
> 
> > I think that we need to outline the handling of L3 as it could
> > determine or at least impact some of the answers to my above questions.
> 
> L3 should follow the same model. Admittedly I've not worked through the
> L3 cases closely but I don't see why we can't apply the same model.
> And maybe this is where we need to introduce a container to hold some
> state as Jamal says. The easiest way to see this will be to look at some
> proposed code.
> 
> 
> > cheers,
> > ronen
> >
> >> Also how does it work when i start doing L3 and the bond's port
> >> doesnt support L3? Is it time to revive the thing we called TheThing in Du?
> >>
> >> cheers,
> >> jamal
> >>
> >> On 12/14/14 14:41, Roopa Prabhu wrote:
> >>> On 12/14/14, 7:35 AM, Jiri Pirko wrote:
> >>
> >> [..chopped off for brevity and saving electrons..]
> >>
> >> cheers,
> >> jamal
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe netdev" in
> >> the body of a message to majordomo@vger.kernel.org More majordomo
> >> info at http://vger.kernel.org/majordomo-info.html
> 
> 
> --
> John Fastabend         Intel Corporation

^ permalink raw reply

* Re: [PATCH] ioc3: fix incorrect use of htons/ntohs
From: Ralf Baechle @ 2014-12-15 18:14 UTC (permalink / raw)
  To: Ben Hutchings; +Cc: Lino Sanfilippo, linux-mips, netdev, linux-kernel
In-Reply-To: <1417406976.7215.126.camel@decadent.org.uk>

On Mon, Dec 01, 2014 at 04:09:36AM +0000, Ben Hutchings wrote:

> >  	/* Same as tx - compute csum of pseudo header  */
> >  	csum = hwsum +
> > -	       (ih->tot_len - (ih->ihl << 2)) +
> > -	       htons((uint16_t)ih->protocol) +
> > +	       (ih->tot_len - (ih->ihl << 2)) + ih->protocol +
> >  	       (ih->saddr >> 16) + (ih->saddr & 0xffff) +
> >  	       (ih->daddr >> 16) + (ih->daddr & 0xffff);
> >
> 
> The pseudo-header is specified as:
> 
>                      +--------+--------+--------+--------+
>                      |           Source Address          |
>                      +--------+--------+--------+--------+
>                      |         Destination Address       |
>                      +--------+--------+--------+--------+
>                      |  zero  |  PTCL  |    TCP Length   |
>                      +--------+--------+--------+--------+
> 
> The current code zero-extends the protocol number to produce the 5th
> 16-bit word of the pseudo-header, then uses htons() to put it in
> big-endian order, consistent with the other fields.  (Yes, it's doing
> addition on big-endian words; this works even on little-endian machines
> due to the way the checksum is specified.)
> 
> The driver should not be doing this at all, though.  It should set
> skb->csum = hwsum; skb->ip_summed = CHECKSUM_COMPLETE; and let the
> network stack adjust the hardware checksum.

Really?  The IOC3 isn't the exactly the smartest NIC around; it does add up
everything and the kitchen sink, that is ethernet headers, IP headers and
on RX the frame's trailing CRC.  All that needs to be subtracted in software
which is what this does.  I think others NICs are all smarted and don't
need this particular piece of magic.

I agree with your other comment wrt. to htons().

  Ralf

^ permalink raw reply

* GOOD DAY
From: Sage Mothibi @ 2014-12-15 17:15 UTC (permalink / raw)
  To: sagemothibi

[-- Attachment #1: Type: text/plain, Size: 66 bytes --]



Please view the attachment for more details
Thanks
Engr. Mothibi

[-- Attachment #2: Hello.pdf --]
[-- Type: application/pdf, Size: 180137 bytes --]

^ permalink raw reply

* Re: [bisected] xfrm: TCP connection initiating PMTU discovery stalls on v3.
From: Wolfgang Walter @ 2014-12-15 18:04 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Thomas Jarosch, netdev, Eric Dumazet, Herbert Xu,
	Steffen Klassert
In-Reply-To: <1418429740.13491.27.camel@edumazet-glaptop2.roam.corp.google.com>

Hello Eric!

Am Freitag, 12. Dezember 2014, 16:15:40 schrieb Eric Dumazet:
> On Sat, 2014-12-13 at 00:47 +0100, Wolfgang Walter wrote:
> > I can't disable it as the driver will not allow it:
> > # ethtool -K eth0 tx off
> > Cannot change tx-checksumming
> > Could not change any device features
> 
> Sounds a bug in itself :(

I think the "gso disabled for interface" case is broken because:

tcp_sendmsg() sets tcp_gso_segs to zero

tcp_sendmsg() calls tcp_push_one() or __tcp_push_pending_frames()

tcp_push_one() and  __tcp_push_pending_frames() call tcp_write_xmit()

tcp_write_xmit() call tcp_init_tso_segs()

tcp_init_tso_segs() calls tcp_set_skb_tso_segs() because !tso_segs is true

tcp_init_tso_segs() creates a gso-packet if packet size is larger than 
mss_now.


I think tcp_init_tso_segs() assumed that tcp_init_tso_segs() checks if the 
socket supports gso.

I didn't check the other callers of tcp_init_tso_segs().


Regards,
-- 
Wolfgang Walter
Studentenwerk München
Anstalt des öffentlichen Rechts

^ permalink raw reply

* Re: Bug: mv643xxx fails with highmem
From: Russell King - ARM Linux @ 2014-12-15 18:04 UTC (permalink / raw)
  To: fugang.duan@freescale.com
  Cc: David Miller, Fabio.Estevam@freescale.com,
	ezequiel.garcia@free-electrons.com, netdev@vger.kernel.org
In-Reply-To: <BLUPR03MB37339496B52755D0F32B002F5600@BLUPR03MB373.namprd03.prod.outlook.com>

On Fri, Dec 12, 2014 at 05:34:01AM +0000, fugang.duan@freescale.com wrote:
> I will submit one patch to fix the issue.

There's more bugs in the FEC driver... here's the relevant bits:

static void
fec_enet_tx_queue(struct net_device *ndev, u16 queue_id)
{
        bdp = txq->dirty_tx;

        bdp = fec_enet_get_nextdesc(bdp, fep, queue_id);

        while (((status = bdp->cbd_sc) & BD_ENET_TX_READY) == 0) {
                /* current queue is empty */
                if (bdp == txq->cur_tx)
                        break;

                skb = txq->tx_skbuff[index];
                txq->tx_skbuff[index] = NULL;
                if (!IS_TSO_HEADER(txq, bdp->cbd_bufaddr))
                        dma_unmap_single(&fep->pdev->dev, bdp->cbd_bufaddr,
                                        bdp->cbd_datlen, DMA_TO_DEVICE);
                bdp->cbd_bufaddr = 0;
                if (!skb) {
                        bdp = fec_enet_get_nextdesc(bdp, fep, queue_id);
                        continue;
                }
...
                txq->dirty_tx = bdp;
                bdp = fec_enet_get_nextdesc(bdp, fep, queue_id);
        }

Consider the following code path:
- we enter this function
- get the dirty_tx pointer
- move to the next descriptor (which we'll call descriptor A)
- next descriptor indicates that TX_READY = 0
- bdp != txq->cur_tx
- we unmap if needed
- we set bdp->cmdbufaddr = 0
- assume skb is NULL, so we move to the next descriptor (we'll call this B)
- next descriptor _may_ have TX_READY = 1
- we break out of the loop, and return

Some time later, we re-enter:
- get the dirty_tx pointer
- move to the next descriptor (which is descriptor A above)
- next descriptor indicates that TX_READY = 0
- bdp != txq->cur_tx
- we call dma_unmap_single(..., bdp->cbd_bufaddr, which we previously zeroed
  - the DMA API debugging complains that FEC is unmapping memory which it
    doesn't own

Unfortunately, this does appear to happen - from a paste from Jon
Nettleton from iMX6Q:

 32. [   45.033001] unmapping this address 0x0 size 66
 33. [   45.037470] ------------[ cut here ]------------
 34. [   45.042127] WARNING: CPU: 0 PID: 102 at lib/dma-debug.c:1080 check_unmap+0x784/0x9f4()
 35. [   45.050066] fec 2188000.ethernet: DMA-API: device driver tries to free DMA memory it has not a]

(where the printk at line 32 is something that was added to debug this.)

The sad thing is that the remainder of my FEC patches did go a long way
to clean up these kinds of issues in the driver (and there's /many/ of
them), but unfortunately other conflicting changes got merged before I
could finish rebasing them, I decided to move on to other things and
discard the remainder of my patch set.  Marek showed some interest in
taking the patch set over, but I've not heard anything more - and I'm
not about to resurect my efforts only to get into the same situation
where I'm carrying 50 odd patches which I can't merge back into mainline
without spending weeks endlessly rebasing them.

-- 
FTTC broadband for 0.8mile line: currently at 9.5Mbps down 400kbps up
according to speedtest.net.

^ permalink raw reply

* Re: [PATCH net-next v2 2/4] swdevice: add new api to set and del bridge port attributes
From: Benjamin LaHaise @ 2014-12-15 18:03 UTC (permalink / raw)
  To: Roopa Prabhu
  Cc: Jamal Hadi Salim, Jiri Pirko, sfeldma, tgraf, john.fastabend,
	stephen, linville, vyasevic, netdev, davem, shm, gospo
In-Reply-To: <548F1852.7090507@cumulusnetworks.com>

On Mon, Dec 15, 2014 at 09:20:18AM -0800, Roopa Prabhu wrote:
> On 12/15/14, 7:26 AM, Jamal Hadi Salim wrote:
> >
> >Sorry - i didnt quiet follow the discussion, but i can see the value
> >of propagating things from parent to children netdevs as part of the
> >generic approach. And in that spirit:
> >
> >Ben's patches (and I am sure the cumulus folk do this) expose ports.
> >i.e you boot up the hardware and you see ports. You can then put these
> >ports in a bridge and you can offload fdbs and do other parametrization
> >to the ASIC. IOW, this only becomes a bridge because you created one
> >in the kernel and attached bridge ports to it.
> >
> >Lets say i didnt want a bridge. I want instead to take these exposed
> >ports and create a bond (and maybe play with LACP). How does this
> >propagation from parent->child->child work then? I think the idea
> >of just bonding and not exposing it as a switch is a reasonable use
> >case.
> 
> We have not come to pure bonding and lacp yet (but i have mentioned it 
> in many contexts before).
> The use case you mention is offloading bond attributes. This will be 
> addressed as part of ongoing switchdev work
> for all other offloads (bonds, vxlans etc).
> Right now we are only talking bridge port attribute offload 
> (learn/flood/port state etc). This could still be a bridge port 
> attribute on a bond
> when the bond is a bridge port.

This raises the question: do we track which attributes are configured 
onto a port in the switchdev code (in an attempt to maintain the state 
on behalf of the underlying device), or do we simply pass them in as 
attributes of the config request?  With stacking, I can see the need 
for different layers to add different attributes to the config for a 
given switch port.  Things like bonding would need to make note of the 
bond interface with a common identifier so that the switch can figure 
out to put the different ports into the same group.

The rtl8366s has support for one 802.3ad group, so it looks like I will 
need to tackle this.

		-ben

> >Also how does it work when i start doing L3 and the bond's port doesnt
> >support L3? Is it time to revive the thing we called TheThing in Du?
> yes, exactly. This is what i had indicated in my initial emails on this 
> thread when the stacked devices topic came up.
> Since there was reluctance in introducing a switch device (theThing), My 
> current patch tries to do that without a switch device.
> Since this is still l2, and we are dealing with bridge port attributes, 
> my current patch traverses the stacked netdevs to call the 
> ndo_bridge_setlink on the switch port.
> 
> When it comes to l3, we can follow the same.., but as discussed in Du, 
> there are other reasons where a switch device becomes necessary.
> I can submit an alternate series to cover the switch device approach for 
> l2 as well.
> 
> Thanks,
> Roopa
> 

-- 
"Thought is the essence of where you are now."

^ permalink raw reply

* [PATCH net 3/3] net: dsa: bcm_sf2: always select FIXED_PHY
From: Florian Fainelli @ 2014-12-15 17:57 UTC (permalink / raw)
  To: netdev; +Cc: davem, Florian Fainelli
In-Reply-To: <1418666235-3109-1-git-send-email-f.fainelli@gmail.com>

There is no need to do the following:

select FIXED_PHY if NET_DSA_BCM_SF2=y, as this implies that we will not be
able to build and/or run the driver correctly when built as a module,
which is no longer an issue since commit 37e9a6904520 ("net: phy: export
fixed_phy_register()").

Fixes: 246d7f773c13ca ("net: dsa: add Broadcom SF2 switch driver")
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
---
 drivers/net/dsa/Kconfig | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/dsa/Kconfig b/drivers/net/dsa/Kconfig
index 7cf8f4ac281f..48e62a34f7f2 100644
--- a/drivers/net/dsa/Kconfig
+++ b/drivers/net/dsa/Kconfig
@@ -59,7 +59,7 @@ config NET_DSA_BCM_SF2
 	depends on HAS_IOMEM
 	select NET_DSA
 	select NET_DSA_TAG_BRCM
-	select FIXED_PHY if NET_DSA_BCM_SF2=y
+	select FIXED_PHY
 	select BCM7XXX_PHY
 	select MDIO_BCM_UNIMAC
 	---help---
-- 
2.1.0

^ permalink raw reply related

* [PATCH net 2/3] net: systemport: always select FIXED_PHY
From: Florian Fainelli @ 2014-12-15 17:57 UTC (permalink / raw)
  To: netdev; +Cc: davem, Florian Fainelli
In-Reply-To: <1418666235-3109-1-git-send-email-f.fainelli@gmail.com>

There is no need to do the following:

select FIXED_PHY if SYSTEMPORT=y, as this implies that we will not be able
to build and/or run the driver correctly when built as a module, which
is no longer an issue since commit 37e9a6904520 ("net: phy: export
fixed_phy_register()")

Fixes: a3862db2d3c4 ("net: systemport: hook SYSTEMPORT driver in the build")
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
---
 drivers/net/ethernet/broadcom/Kconfig | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/broadcom/Kconfig b/drivers/net/ethernet/broadcom/Kconfig
index f4789a059d54..41a3c9804427 100644
--- a/drivers/net/ethernet/broadcom/Kconfig
+++ b/drivers/net/ethernet/broadcom/Kconfig
@@ -155,7 +155,7 @@ config SYSTEMPORT
 	depends on OF
 	select MII
 	select PHYLIB
-	select FIXED_PHY if SYSTEMPORT=y
+	select FIXED_PHY
 	help
 	  This driver supports the built-in Ethernet MACs found in the
 	  Broadcom BCM7xxx Set Top Box family chipset using an internal
-- 
2.1.0

^ permalink raw reply related

* Re: [PATCH net-next v2 2/4] swdevice: add new api to set and del bridge port attributes
From: John Fastabend @ 2014-12-15 17:57 UTC (permalink / raw)
  To: Arad, Ronen
  Cc: Jamal Hadi Salim, Roopa Prabhu, Jiri Pirko,
	netdev@vger.kernel.org, sfeldma@gmail.com, bcrl@kvack.org,
	tgraf@suug.ch, stephen@networkplumber.org, linville@tuxdriver.com,
	vyasevic@redhat.com, davem@davemloft.net, shm@cumulusnetworks.com,
	gospo@cumulusnetworks.com
In-Reply-To: <E4CD12F19ABA0C4D8729E087A761DC3505D999EA@ORSMSX106.amr.corp.intel.com>

On 12/15/2014 09:25 AM, Arad, Ronen wrote:
>
>
>> -----Original Message-----
>> From: netdev-owner@vger.kernel.org [mailto:netdev-
>> owner@vger.kernel.org] On Behalf Of Jamal Hadi Salim
>> Sent: Monday, December 15, 2014 5:26 PM
>> To: Roopa Prabhu; Jiri Pirko
>> Cc: sfeldma@gmail.com; bcrl@kvack.org; tgraf@suug.ch;
>> john.fastabend@gmail.com; stephen@networkplumber.org;
>> linville@tuxdriver.com; vyasevic@redhat.com; netdev@vger.kernel.org;
>> davem@davemloft.net; shm@cumulusnetworks.com;
>> gospo@cumulusnetworks.com
>> Subject: Re: [PATCH net-next v2 2/4] swdevice: add new api to set and del
>> bridge port attributes
>>
>>
>> Sorry - i didnt quiet follow the discussion, but i can see the value of
>> propagating things from parent to children netdevs as part of the generic
>> approach. And in that spirit:
>>
>> Ben's patches (and I am sure the cumulus folk do this) expose ports.
>> i.e you boot up the hardware and you see ports. You can then put these
>> ports in a bridge and you can offload fdbs and do other parametrization to
>> the ASIC. IOW, this only becomes a bridge because you created one in the
>> kernel and attached bridge ports to it.
>>
>> Lets say i didnt want a bridge. I want instead to take these exposed ports and
>> create a bond (and maybe play with LACP). How does this propagation from
>> parent->child->child work then? I think the idea of just bonding and not
>> exposing it as a switch is a reasonable use case.
>

> Are you saying that the software should reflect the same
> functionality  the HW provides?
> In other words is creating a bridge device mandatory for supporting
> standard VLAN-bridging (as in IEEE 802.1Q) in the HW?

No it shouldn't be mandatory. And I don't think it is at the moment.
Users are free to manage the FDB tables directly via netlink or
configure the software bridge to sync them. This seems like a good
model to follow to me and we should try to get as many features as
it makes sense to follow this model.

> VLAN-bridging including port VLAN membership, VLAN filtering, PVID,
> Egress un-tagging could be supported without an explicit bridge device
> when port devices implement bridge ndos (ndo_bridge_{set,del,get}link).
> What is lost is the ability to have common handling of VLAN-aware FDB in
> the bridge module.

not sure what is lost here? Its seems the SW bridge does (or at least
could) support the same vlan capabilities. And the bridge could push
these into hardware when Roopa's offload bit is set. Or if users want
to manage everything outside bridge calling the ndo_bridge_ ops directly
works as well. By the way I believe the software linux bridge supports
most of those features you listed today. If we missed something we can
add it.

> Do we expect switch port devices to support L2 functionality both
> ways  (with and without an explicit bridge device)?

My opinion yes. But in fact the driver shouldn't care what is driving
it. The paths should be the same for direct user manipulation via
netlink and SELF|MASTER bit, bridge module, or any other in-kernel
sub-system.

> Will the decision about using a bridge device or avoiding it be left
> to the end-user?

Its a user policy decision. Again the offload bit gets us this in
a reasonably configurable way IMO.

> (This requires switch port drivers to be able to work and provide
> similar functionality in both setups).

Right, but if the drivers "care" who is calling their ndo ops something
is seriously broken. For the driver it should not need to know anything
about the callers so it doesn't matter to the driver if its a netlink
call from user space or an internal call fro bridge.ko

> I think that we need to outline the handling of L3 as it could
> determine or at least impact some of the answers to my above questions.

L3 should follow the same model. Admittedly I've not worked through the
L3 cases closely but I don't see why we can't apply the same model.
And maybe this is where we need to introduce a container to hold some
state as Jamal says. The easiest way to see this will be to look at
some proposed code.


> cheers,
> ronen
>
>> Also how does it work when i start doing L3 and the bond's port doesnt
>> support L3? Is it time to revive the thing we called TheThing in Du?
>>
>> cheers,
>> jamal
>>
>> On 12/14/14 14:41, Roopa Prabhu wrote:
>>> On 12/14/14, 7:35 AM, Jiri Pirko wrote:
>>
>> [..chopped off for brevity and saving electrons..]
>>
>> cheers,
>> jamal
>> --
>> To unsubscribe from this list: send the line "unsubscribe netdev" in the body
>> of a message to majordomo@vger.kernel.org More majordomo info at
>> http://vger.kernel.org/majordomo-info.html


-- 
John Fastabend         Intel Corporation

^ permalink raw reply

* [PATCH net 1/3] net: bcmgenet: always select FIXED_PHY
From: Florian Fainelli @ 2014-12-15 17:57 UTC (permalink / raw)
  To: netdev; +Cc: davem, Florian Fainelli
In-Reply-To: <1418666235-3109-1-git-send-email-f.fainelli@gmail.com>

There is no need to do the following:

select FIXED_PHY if BCMGENET=y, as this implies that we will not be able
to build and/or run the driver correctly when built as a module, which
is no longer an issue since commit 37e9a6904520 ("net: phy: export
fixed_phy_register()")

Fixes: b0ba512e225d ("net: bcmgenet: enable driver to work without device tree"
Fixes: bdaa53bde55f ("net: bcmgenet: hook into the build system")
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
---
 drivers/net/ethernet/broadcom/Kconfig | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/broadcom/Kconfig b/drivers/net/ethernet/broadcom/Kconfig
index 888247ad9068..f4789a059d54 100644
--- a/drivers/net/ethernet/broadcom/Kconfig
+++ b/drivers/net/ethernet/broadcom/Kconfig
@@ -64,7 +64,7 @@ config BCMGENET
 	tristate "Broadcom GENET internal MAC support"
 	select MII
 	select PHYLIB
-	select FIXED_PHY if BCMGENET=y
+	select FIXED_PHY
 	select BCM7XXX_PHY
 	help
 	  This driver supports the built-in Ethernet MACs found in the
-- 
2.1.0

^ permalink raw reply related

* [PATCH net 0/3] net: broadcom: fix FIXED_PHY dependencies
From: Florian Fainelli @ 2014-12-15 17:57 UTC (permalink / raw)
  To: netdev; +Cc: davem, Florian Fainelli

Hi David,

This patch series removes the bogus "select FIXED_PHY if FOO=y" that I have
been using in GENET, SYSTEMPORT and the SF2 DSA switch driver.

Thanks!

Florian Fainelli (3):
  net: bcmgenet: always select FIXED_PHY
  net: systemport: always select FIXED_PHY
  net: dsa: bcm_sf2: always select FIXED_PHY

 drivers/net/dsa/Kconfig               | 2 +-
 drivers/net/ethernet/broadcom/Kconfig | 4 ++--
 2 files changed, 3 insertions(+), 3 deletions(-)

-- 
2.1.0

^ permalink raw reply

* Re: [PATCH net-next v2 2/4] swdevice: add new api to set and del bridge port attributes
From: Benjamin LaHaise @ 2014-12-15 17:57 UTC (permalink / raw)
  To: Arad, Ronen
  Cc: Jamal Hadi Salim, Roopa Prabhu, Jiri Pirko,
	netdev@vger.kernel.org, sfeldma@gmail.com, tgraf@suug.ch,
	john.fastabend@gmail.com, stephen@networkplumber.org,
	linville@tuxdriver.com, vyasevic@redhat.com, davem@davemloft.net,
	shm@cumulusnetworks.com, gospo@cumulusnetworks.com
In-Reply-To: <E4CD12F19ABA0C4D8729E087A761DC3505D999EA@ORSMSX106.amr.corp.intel.com>

On Mon, Dec 15, 2014 at 05:25:00PM +0000, Arad, Ronen wrote:
> > Ben's patches (and I am sure the cumulus folk do this) expose ports.
> > i.e you boot up the hardware and you see ports. You can then put these
> > ports in a bridge and you can offload fdbs and do other parametrization to
> > the ASIC. IOW, this only becomes a bridge because you created one in the
> > kernel and attached bridge ports to it.
> > 
> > Lets say i didnt want a bridge. I want instead to take these exposed ports and
> > create a bond (and maybe play with LACP). How does this propagation from
> > parent->child->child work then? I think the idea of just bonding and not
> > exposing it as a switch is a reasonable use case.
> 
> Are you saying that the software should reflect the same functionality the HW provides?
> In other words is creating a bridge device mandatory for supporting standard VLAN-bridging (as in IEEE 802.1Q) in the HW?

Re-using the existing Linux APIs for bridging is the approach I've taken 
with the rtl8366s driver I've been working on.  It makes no sense to create 
entirely new APIs for these operations.

> VLAN-bridging including port VLAN membership, VLAN filtering, PVID, Egress un-tagging could be supported without an explicit bridge device when port devices implement bridge ndos (ndo_bridge_{set,del,get}link). What is lost is the ability to have common handling of VLAN-aware FDB in the bridge module.
> Do we expect switch port devices to support L2 functionality both ways (with and without an explicit bridge device)?

I've already got egress untagging working if the VLANs are configured using 
vconfig and added to a bridge.  The lack of this functionality in the "new" 
API for bridging VLANs is a huge complaint I have about the new bridge 
configuration API.  I feel that API change was a step backwards in terms 
of functionality.  It also conflates the different FDBs for different VLANs 
into the same hash table, again, something i find rather messy compared to 
the 1 FDB to 1 bridge state the bridge device was in originally.

		-ben

> Will the decision about using a bridge device or avoiding it be left to the end-user?
> (This requires switch port drivers to be able to work and provide similar functionality in both setups).
> I think that we need to outline the handling of L3 as it could determine or at least impact some of the answers to my above questions.
> 
> cheers,
> ronen
>  
> > Also how does it work when i start doing L3 and the bond's port doesnt
> > support L3? Is it time to revive the thing we called TheThing in Du?
> > 
> > cheers,
> > jamal
> > 
> > On 12/14/14 14:41, Roopa Prabhu wrote:
> > > On 12/14/14, 7:35 AM, Jiri Pirko wrote:
> > 
> > [..chopped off for brevity and saving electrons..]
> > 
> > cheers,
> > jamal
> > --
> > To unsubscribe from this list: send the line "unsubscribe netdev" in the body
> > of a message to majordomo@vger.kernel.org More majordomo info at
> > http://vger.kernel.org/majordomo-info.html

-- 
"Thought is the essence of where you are now."

^ permalink raw reply

* Re: [PATCH net-next v9 0/3] add hisilicon hip04 ethernet driver
From: Arnd Bergmann @ 2014-12-15 13:29 UTC (permalink / raw)
  To: linux-arm-kernel
  Cc: Ding Tianhong, zhangfei.gao, davem, linux, f.fainelli,
	sergei.shtylyov, mark.rutland, David.Laight, eric.dumazet, xuwei5,
	netdev, devicetree
In-Reply-To: <1418298150-4944-1-git-send-email-dingtianhong@huawei.com>

On Thursday 11 December 2014 19:42:27 Ding Tianhong wrote:
> v9:
> - There is no tx completion interrupts to free DMAd Tx packets, it means taht
>   we rely on new tx packets arriving to run the destructors of completed packets,
>   which open up space in their sockets's send queues. Sometimes we don't get such
>   new packets causing Tx to stall, a single UDP transmitter is a good example of
>   this situation, so we need a clean up workqueue to reclaims completed packets,
>   the workqueue will only free the last packets which is already stay for several jiffies.
>   Also fix some format cleanups.

You must have missed the reply in which David Miller explained why
you can't call skb_orphan and rely on an occasional cleanup of the
queue. Please use something like the patch below:

- drop the broken skb_orphan call
- drop the workqueue
- batch cleanup based on tx_coalesce_frames/usecs for better throughput
- use a reasonable default tx timeout (200us, could be shorted
  based on measurements) with a range timer
- fix napi poll function return value
- use a lockless queue for cleanup

Signed-off-by: Arnd Bergmann <arnd@arndb.de>

Feel free to fold this into your patch rather than keeping it separate.
Completely untested, probably contains some bugs. Please ask if you
have questions about this.

diff --git a/drivers/net/ethernet/hisilicon/hip04_eth.c b/drivers/net/ethernet/hisilicon/hip04_eth.c
index 9d37b670db6a..d85c5287654e 100644
--- a/drivers/net/ethernet/hisilicon/hip04_eth.c
+++ b/drivers/net/ethernet/hisilicon/hip04_eth.c
@@ -12,6 +12,7 @@
 #include <linux/etherdevice.h>
 #include <linux/platform_device.h>
 #include <linux/interrupt.h>
+#include <linux/ktime.h>
 #include <linux/of_address.h>
 #include <linux/phy.h>
 #include <linux/of_mdio.h>
@@ -157,9 +158,11 @@ struct hip04_priv {
 	struct sk_buff *tx_skb[TX_DESC_NUM];
 	dma_addr_t tx_phys[TX_DESC_NUM];
 	unsigned int tx_head;
-	unsigned int tx_tail;
-	int tx_count;
-	unsigned long last_tx;
+
+	/* FIXME: make these adjustable through ethtool */
+	int tx_coalesce_frames;
+	int tx_coalesce_usecs;
+	struct hrtimer tx_coalesce_timer;
 
 	unsigned char *rx_buf[RX_DESC_NUM];
 	dma_addr_t rx_phys[RX_DESC_NUM];
@@ -171,10 +174,15 @@ struct hip04_priv {
 	struct regmap *map;
 	struct work_struct tx_timeout_task;
 
-	struct workqueue_struct *wq;
-	struct delayed_work tx_clean_task;
+	/* written only by tx cleanup */
+	unsigned int tx_tail ____cacheline_aligned_in_smp;
 };
 
+static inline unsigned int tx_count(unsigned int head, unsigned int tail)
+{
+	return (head - tail) % (TX_DESC_NUM - 1);
+}
+
 static void hip04_config_port(struct net_device *ndev, u32 speed, u32 duplex)
 {
 	struct hip04_priv *priv = netdev_priv(ndev);
@@ -351,18 +359,21 @@ static int hip04_set_mac_address(struct net_device *ndev, void *addr)
 	return 0;
 }
 
-static void hip04_tx_reclaim(struct net_device *ndev, bool force)
+static int hip04_tx_reclaim(struct net_device *ndev, bool force)
 {
 	struct hip04_priv *priv = netdev_priv(ndev);
 	unsigned tx_tail = priv->tx_tail;
 	struct tx_desc *desc;
 	unsigned int bytes_compl = 0, pkts_compl = 0;
+	unsigned int count;
 
-	if (priv->tx_count == 0)
+	smp_rmb();
+	count = tx_count(ACCESS_ONCE(priv->tx_head), tx_tail);
+	if (count == 0)
 		goto out;
 
-	while ((tx_tail != priv->tx_head) || (priv->tx_count == TX_DESC_NUM)) {
-		desc = &priv->tx_desc[priv->tx_tail];
+	while (count) {
+		desc = &priv->tx_desc[tx_tail];
 		if (desc->send_addr != 0) {
 			if (force)
 				desc->send_addr = 0;
@@ -381,57 +392,37 @@ static void hip04_tx_reclaim(struct net_device *ndev, bool force)
 		dev_kfree_skb(priv->tx_skb[tx_tail]);
 		priv->tx_skb[tx_tail] = NULL;
 		tx_tail = TX_NEXT(tx_tail);
-		priv->tx_count--;
-
-		if (priv->tx_count <= 0)
-			break;
+		count--;
 	}
 
 	priv->tx_tail = tx_tail;
+	smp_wmb(); /* Ensure tx_tail visible to xmit */
 
-	/* Ensure tx_tail & tx_count visible to xmit */
-	smp_mb();
 out:
-
 	if (pkts_compl || bytes_compl)
 		netdev_completed_queue(ndev, pkts_compl, bytes_compl);
 
-	if (unlikely(netif_queue_stopped(ndev)) &&
-	    (priv->tx_count < TX_DESC_NUM))
+	if (unlikely(netif_queue_stopped(ndev)) && (count < (TX_DESC_NUM - 1)))
 		netif_wake_queue(ndev);
-}
 
-static void hip04_tx_clean_monitor(struct work_struct *work)
-{
-	struct hip04_priv *priv = container_of(work, struct hip04_priv,
-					       tx_clean_task.work);
-	struct net_device *ndev = priv->ndev;
-	int delta_in_ticks = msecs_to_jiffies(1000);
-
-	if (!time_in_range(jiffies, priv->last_tx,
-			   priv->last_tx + delta_in_ticks)) {
-		netif_tx_lock(ndev);
-		hip04_tx_reclaim(ndev, false);
-		netif_tx_unlock(ndev);
-	}
-	queue_delayed_work(priv->wq, &priv->tx_clean_task, delta_in_ticks);
+	return count;
 }
 
 static int hip04_mac_start_xmit(struct sk_buff *skb, struct net_device *ndev)
 {
 	struct hip04_priv *priv = netdev_priv(ndev);
 	struct net_device_stats *stats = &ndev->stats;
-	unsigned int tx_head = priv->tx_head;
+	unsigned int tx_head = priv->tx_head, count;
 	struct tx_desc *desc = &priv->tx_desc[tx_head];
 	dma_addr_t phys;
 
-	if (priv->tx_count >= TX_DESC_NUM) {
+	smp_rmb();
+	count = tx_count(tx_head, ACCESS_ONCE(priv->tx_tail));
+	if (count == (TX_DESC_NUM - 1)) {
 		netif_stop_queue(ndev);
 		return NETDEV_TX_BUSY;
 	}
 
-	hip04_tx_reclaim(ndev, false);
-
 	phys = dma_map_single(&ndev->dev, skb->data, skb->len, DMA_TO_DEVICE);
 	if (dma_mapping_error(&ndev->dev, phys)) {
 		dev_kfree_skb(skb);
@@ -447,20 +438,33 @@ static int hip04_mac_start_xmit(struct sk_buff *skb, struct net_device *ndev)
 	desc->wb_addr = cpu_to_be32(phys);
 	skb_tx_timestamp(skb);
 
-	/* Don't wait up for transmitted skbs to be freed. */
-	skb_orphan(skb);
-
+	/* FIXME: eliminate this mmio access if xmit_more is set */
 	hip04_set_xmit_desc(priv, phys);
 	priv->tx_head = TX_NEXT(tx_head);
+	count++;
 	netdev_sent_queue(ndev, skb->len);
 
 	stats->tx_bytes += skb->len;
 	stats->tx_packets++;
-	priv->tx_count++;
-	priv->last_tx = jiffies;
 
-	/* Ensure tx_head & tx_count update visible to tx reclaim */
-	smp_mb();
+	/* Ensure tx_head update visible to tx reclaim */
+	smp_wmb();
+
+	/* queue is getting full, better start cleaning up now */
+	if (count >= priv->tx_coalesce_frames) {
+		if (napi_schedule_prep(&priv->napi)) {
+			/* disable rx interrupt and timer */
+			priv->reg_inten &= ~(RCV_INT);
+			writel_relaxed(DEF_INT_MASK & ~RCV_INT,
+				       priv->base + PPE_INTEN);
+			hrtimer_cancel(&priv->tx_coalesce_timer);
+			__napi_schedule(&priv->napi);
+		}
+	} else if (!hrtimer_is_queued(&priv->tx_coalesce_timer)) {
+		/* cleanup not pending yet, start a new timer */
+		hrtimer_start_expires(&priv->tx_coalesce_timer,
+				      HRTIMER_MODE_REL);
+	}
 
 	return NETDEV_TX_OK;
 }
@@ -477,6 +481,7 @@ static int hip04_rx_poll(struct napi_struct *napi, int budget)
 	bool last = false;
 	dma_addr_t phys;
 	int rx = 0;
+	int tx_remaining;
 	u16 len;
 	u32 err;
 
@@ -513,11 +518,11 @@ static int hip04_rx_poll(struct napi_struct *napi, int budget)
 
 		buf = netdev_alloc_frag(priv->rx_buf_size);
 		if (!buf)
-			return -ENOMEM;
+			goto done;
 		phys = dma_map_single(&ndev->dev, buf,
 				      RX_BUF_SIZE, DMA_FROM_DEVICE);
 		if (dma_mapping_error(&ndev->dev, phys))
-			return -EIO;
+			goto done;
 		priv->rx_buf[priv->rx_head] = buf;
 		priv->rx_phys[priv->rx_head] = phys;
 		hip04_set_recv_desc(priv, phys);
@@ -537,6 +542,11 @@ static int hip04_rx_poll(struct napi_struct *napi, int budget)
 	}
 	napi_complete(napi);
 done:
+	/* clean up tx descriptors and start a new timer if necessary */
+	tx_remaining = hip04_tx_reclaim(ndev, false);
+	if (rx < budget && tx_remaining)
+		hrtimer_start_expires(&priv->tx_coalesce_timer, HRTIMER_MODE_REL);
+
 	return rx;
 }
 
@@ -547,6 +557,9 @@ static irqreturn_t hip04_mac_interrupt(int irq, void *dev_id)
 	struct net_device_stats *stats = &ndev->stats;
 	u32 ists = readl_relaxed(priv->base + PPE_INTSTS);
 
+	if (!ists)
+		return IRQ_NONE;
+
 	writel_relaxed(DEF_INT_MASK, priv->base + PPE_RINT);
 
 	if (unlikely(ists & DEF_INT_ERR)) {
@@ -560,16 +573,32 @@ static irqreturn_t hip04_mac_interrupt(int irq, void *dev_id)
 		}
 	}
 
-	if (ists & RCV_INT) {
+	if (ists & RCV_INT && napi_schedule_prep(&priv->napi)) {
 		/* disable rx interrupt */
 		priv->reg_inten &= ~(RCV_INT);
-		writel_relaxed(priv->reg_inten, priv->base + PPE_INTEN);
-		napi_schedule(&priv->napi);
+		writel_relaxed(DEF_INT_MASK & ~RCV_INT, priv->base + PPE_INTEN);
+		hrtimer_cancel(&priv->tx_coalesce_timer);
+		__napi_schedule(&priv->napi);
 	}
 
 	return IRQ_HANDLED;
 }
 
+enum hrtimer_restart tx_done(struct hrtimer *hrtimer)
+{
+	struct hip04_priv *priv;
+	priv = container_of(hrtimer, struct hip04_priv, tx_coalesce_timer);
+
+	if (napi_schedule_prep(&priv->napi)) {
+		/* disable rx interrupt */
+		priv->reg_inten &= ~(RCV_INT);
+		writel_relaxed(DEF_INT_MASK & ~RCV_INT, priv->base + PPE_INTEN);
+		__napi_schedule(&priv->napi);
+	}
+
+	return HRTIMER_NORESTART;
+}
+
 static void hip04_adjust_link(struct net_device *ndev)
 {
 	struct hip04_priv *priv = netdev_priv(ndev);
@@ -589,7 +618,6 @@ static int hip04_mac_open(struct net_device *ndev)
 	priv->rx_head = 0;
 	priv->tx_head = 0;
 	priv->tx_tail = 0;
-	priv->tx_count = 0;
 	hip04_reset_ppe(priv);
 
 	for (i = 0; i < RX_DESC_NUM; i++) {
@@ -612,9 +640,6 @@ static int hip04_mac_open(struct net_device *ndev)
 	hip04_mac_enable(ndev);
 	napi_enable(&priv->napi);
 
-	INIT_DELAYED_WORK(&priv->tx_clean_task, hip04_tx_clean_monitor);
-	queue_delayed_work(priv->wq, &priv->tx_clean_task, 0);
-
 	return 0;
 }
 
@@ -623,8 +648,6 @@ static int hip04_mac_stop(struct net_device *ndev)
 	struct hip04_priv *priv = netdev_priv(ndev);
 	int i;
 
-	cancel_delayed_work_sync(&priv->tx_clean_task);
-
 	napi_disable(&priv->napi);
 	netif_stop_queue(ndev);
 	hip04_mac_disable(ndev);
@@ -725,6 +748,7 @@ static int hip04_mac_probe(struct platform_device *pdev)
 	struct hip04_priv *priv;
 	struct resource *res;
 	unsigned int irq;
+	ktime_t txtime;
 	int ret;
 
 	ndev = alloc_etherdev(sizeof(struct hip04_priv));
@@ -751,6 +775,21 @@ static int hip04_mac_probe(struct platform_device *pdev)
 	priv->port = arg.args[0];
 	priv->chan = arg.args[1] * RX_DESC_NUM;
 
+	hrtimer_init(&priv->tx_coalesce_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
+
+	/*
+	 * BQL will try to keep the TX queue as short as possible, but it can't
+	 * be faster than tx_coalesce_usecs, so we need a fast timeout here,
+	 * but also long enough to gather up enough frames to ensure we don't
+	 * get more interrupts than necessary.
+	 * 200us is enough for 16 frames of 1500 bytes at gigabit ethernet rate
+	 */
+	priv->tx_coalesce_frames = TX_DESC_NUM * 3 / 4;
+	priv->tx_coalesce_usecs = 200;
+	/* allow timer to fire after half the time at the earliest */
+	txtime = ktime_set(0, priv->tx_coalesce_usecs * NSEC_PER_USEC / 2);
+	hrtimer_set_expires_range(&priv->tx_coalesce_timer, txtime, txtime);
+
 	priv->map = syscon_node_to_regmap(arg.np);
 	if (IS_ERR(priv->map)) {
 		dev_warn(d, "no syscon hisilicon,hip04-ppe\n");
@@ -788,12 +827,6 @@ static int hip04_mac_probe(struct platform_device *pdev)
 		}
 	}
 
-	priv->wq = create_singlethread_workqueue(ndev->name);
-	if (!priv->wq) {
-		ret = -ENOMEM;
-		goto init_fail;
-	}
-
 	INIT_WORK(&priv->tx_timeout_task, hip04_tx_timeout_task);
 
 	ether_setup(ndev);
@@ -848,8 +881,6 @@ static int hip04_remove(struct platform_device *pdev)
 	free_irq(ndev->irq, ndev);
 	of_node_put(priv->phy_node);
 	cancel_work_sync(&priv->tx_timeout_task);
-	if (priv->wq)
-		destroy_workqueue(priv->wq);
 	free_netdev(ndev);
 
 	return 0;

^ permalink raw reply related

* RE: [PATCH net-next v2 2/4] swdevice: add new api to set and del bridge port attributes
From: Arad, Ronen @ 2014-12-15 17:25 UTC (permalink / raw)
  To: Jamal Hadi Salim, Roopa Prabhu, Jiri Pirko,
	netdev@vger.kernel.org
  Cc: sfeldma@gmail.com, bcrl@kvack.org, tgraf@suug.ch,
	john.fastabend@gmail.com, stephen@networkplumber.org,
	linville@tuxdriver.com, vyasevic@redhat.com, davem@davemloft.net,
	shm@cumulusnetworks.com, gospo@cumulusnetworks.com
In-Reply-To: <548EFD94.8070905@mojatatu.com>



> -----Original Message-----
> From: netdev-owner@vger.kernel.org [mailto:netdev-
> owner@vger.kernel.org] On Behalf Of Jamal Hadi Salim
> Sent: Monday, December 15, 2014 5:26 PM
> To: Roopa Prabhu; Jiri Pirko
> Cc: sfeldma@gmail.com; bcrl@kvack.org; tgraf@suug.ch;
> john.fastabend@gmail.com; stephen@networkplumber.org;
> linville@tuxdriver.com; vyasevic@redhat.com; netdev@vger.kernel.org;
> davem@davemloft.net; shm@cumulusnetworks.com;
> gospo@cumulusnetworks.com
> Subject: Re: [PATCH net-next v2 2/4] swdevice: add new api to set and del
> bridge port attributes
> 
> 
> Sorry - i didnt quiet follow the discussion, but i can see the value of
> propagating things from parent to children netdevs as part of the generic
> approach. And in that spirit:
> 
> Ben's patches (and I am sure the cumulus folk do this) expose ports.
> i.e you boot up the hardware and you see ports. You can then put these
> ports in a bridge and you can offload fdbs and do other parametrization to
> the ASIC. IOW, this only becomes a bridge because you created one in the
> kernel and attached bridge ports to it.
> 
> Lets say i didnt want a bridge. I want instead to take these exposed ports and
> create a bond (and maybe play with LACP). How does this propagation from
> parent->child->child work then? I think the idea of just bonding and not
> exposing it as a switch is a reasonable use case.

Are you saying that the software should reflect the same functionality the HW provides?
In other words is creating a bridge device mandatory for supporting standard VLAN-bridging (as in IEEE 802.1Q) in the HW?
VLAN-bridging including port VLAN membership, VLAN filtering, PVID, Egress un-tagging could be supported without an explicit bridge device when port devices implement bridge ndos (ndo_bridge_{set,del,get}link). What is lost is the ability to have common handling of VLAN-aware FDB in the bridge module.
Do we expect switch port devices to support L2 functionality both ways (with and without an explicit bridge device)?
Will the decision about using a bridge device or avoiding it be left to the end-user?
(This requires switch port drivers to be able to work and provide similar functionality in both setups).
I think that we need to outline the handling of L3 as it could determine or at least impact some of the answers to my above questions.

cheers,
ronen
 
> Also how does it work when i start doing L3 and the bond's port doesnt
> support L3? Is it time to revive the thing we called TheThing in Du?
> 
> cheers,
> jamal
> 
> On 12/14/14 14:41, Roopa Prabhu wrote:
> > On 12/14/14, 7:35 AM, Jiri Pirko wrote:
> 
> [..chopped off for brevity and saving electrons..]
> 
> cheers,
> jamal
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in the body
> of a message to majordomo@vger.kernel.org More majordomo info at
> http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH iproute2 v4] ip: Simplify executing ip cmd within network ns
From: Nicolas Dichtel @ 2014-12-15 17:30 UTC (permalink / raw)
  To: Jiri Pirko, vadim4j; +Cc: netdev
In-Reply-To: <20141213152024.GB1849@nanopsycho.orion>

Le 13/12/2014 16:20, Jiri Pirko a écrit :
> Sat, Dec 13, 2014 at 02:32:10PM CET, vadim4j@gmail.com wrote:
>> On Sat, Dec 13, 2014 at 10:58:03AM +0200, vadim4j@gmail.com wrote:
>>> On Sat, Dec 13, 2014 at 10:42:43AM +0200, vadim4j@gmail.com wrote:
>>>> On Sat, Dec 13, 2014 at 09:29:36AM +0100, Jiri Pirko wrote:
>>>>> Fri, Dec 12, 2014 at 11:15:07PM CET, vadim4j@gmail.com wrote:
>>>>>> From: Vadim Kochan <vadim4j@gmail.com>
>>>>>>
>>>>>> Added new '-netns' option to simplify executing following cmd:
>>>>>>
>>>>>>     ip netns exec NETNS ip OPTIONS COMMAND OBJECT
>>>>>>
>>>>>>     to
>>>>>>
>>>>>>     ip -n[etns] NETNS OPTIONS COMMAND OBJECT
>>>>>>
>>>>>> e.g.:
>>>>>>
>>>>>>     ip -net vnet0 link add br0 type bridge
>>>>>>     ip -n vnet0 link
>>>>>>
>>>>>> Signed-off-by: Vadim Kochan <vadim4j@gmail.com>
>>>>>
>>>>>
>>>>> This looks good. I'm still missing support in tc, bridge, etc. I think
>>>>> it would be great to do this in the same patch/patchset.
>>>>>
>>>>   I planned to do this in the future patches after this main
>>>>   changes will be accepted. Actually adding this option to other
>>>>   tools is trivial.
>>>>
>>>>   Anyway may be I will re-send v5 with supporting of these tools if I will have time.
>>>>
>>>>   Regards,
>>>
>>> BTW, some tools already have '-n' option, so I think only '-net' can be
>>> used in such cases.
>
>
> Yep, that is my point. I would like to have the same option for all.
Agreed. The real option name is '-netns'. The fact that '-n' will work comes
from how 'ip' is implemented. This kind of shortcut will depend on each tool
implementation. But again, the *real* option name is '-netns' ;-)


Regards,
Nicolas

^ permalink raw reply

* Re: [PATCH net-next v2 2/4] swdevice: add new api to set and del bridge port attributes
From: Roopa Prabhu @ 2014-12-15 17:20 UTC (permalink / raw)
  To: Jamal Hadi Salim
  Cc: Jiri Pirko, sfeldma, bcrl, tgraf, john.fastabend, stephen,
	linville, vyasevic, netdev, davem, shm, gospo
In-Reply-To: <548EFD94.8070905@mojatatu.com>

On 12/15/14, 7:26 AM, Jamal Hadi Salim wrote:
>
> Sorry - i didnt quiet follow the discussion, but i can see the value
> of propagating things from parent to children netdevs as part of the
> generic approach. And in that spirit:
>
> Ben's patches (and I am sure the cumulus folk do this) expose ports.
> i.e you boot up the hardware and you see ports. You can then put these
> ports in a bridge and you can offload fdbs and do other parametrization
> to the ASIC. IOW, this only becomes a bridge because you created one
> in the kernel and attached bridge ports to it.
>
> Lets say i didnt want a bridge. I want instead to take these exposed
> ports and create a bond (and maybe play with LACP). How does this
> propagation from parent->child->child work then? I think the idea
> of just bonding and not exposing it as a switch is a reasonable use
> case.

We have not come to pure bonding and lacp yet (but i have mentioned it 
in many contexts before).
The use case you mention is offloading bond attributes. This will be 
addressed as part of ongoing switchdev work
for all other offloads (bonds, vxlans etc).
Right now we are only talking bridge port attribute offload 
(learn/flood/port state etc). This could still be a bridge port 
attribute on a bond
when the bond is a bridge port.

> Also how does it work when i start doing L3 and the bond's port doesnt
> support L3? Is it time to revive the thing we called TheThing in Du?
yes, exactly. This is what i had indicated in my initial emails on this 
thread when the stacked devices topic came up.
Since there was reluctance in introducing a switch device (theThing), My 
current patch tries to do that without a switch device.
Since this is still l2, and we are dealing with bridge port attributes, 
my current patch traverses the stacked netdevs to call the 
ndo_bridge_setlink on the switch port.

When it comes to l3, we can follow the same.., but as discussed in Du, 
there are other reasons where a switch device becomes necessary.
I can submit an alternate series to cover the switch device approach for 
l2 as well.

Thanks,
Roopa

^ permalink raw reply

* Re: [PATCH 0/2] net/macb: fix multiqueue support patch up
From: David Miller @ 2014-12-15 16:51 UTC (permalink / raw)
  To: cyrille.pitchen
  Cc: nicolas.ferre, linux-arm-kernel, netdev, soren.brinkmann,
	linux-kernel
In-Reply-To: <cover.1418651689.git.cyrille.pitchen@atmel.com>

From: Cyrille Pitchen <cyrille.pitchen@atmel.com>
Date: Mon, 15 Dec 2014 15:13:30 +0100

> This series of patches is a fixup for the multiqueue support patch.
> 
> The first patch fixes a bug introduced by the multiqueue support patch.
> The second one doesn't fix a bug but simplify the source code by removing
> useless calls to devm_free_irq() since we use managed device resources for
> IRQs.
> 
> They were applied on the net-next tree and tested with a sama5d36ek board.

Series applied, thanks.

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox