Netdev List
 help / color / mirror / Atom feed
* [PATCH net 2/2] geneve: Fix races between socket add and release.
From: Jesse Gross @ 2014-12-17  2:25 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, Andy Zhou
In-Reply-To: <1418783132-99230-1-git-send-email-jesse@nicira.com>

Currently, searching for a socket to add a reference to is not
synchronized with deletion of sockets. This can result in use
after free if there is another operation that is removing a
socket at the same time. Solving this requires both holding the
appropriate lock and checking the refcount to ensure that it
has not already hit zero.

Inspired by a related (but not exactly the same) issue in the
VXLAN driver.

Fixes: 0b5e8b8e ("net: Add Geneve tunneling protocol driver")
CC: Andy Zhou <azhou@nicira.com>
Signed-off-by: Jesse Gross <jesse@nicira.com>
---
 net/ipv4/geneve.c | 13 +++++++------
 1 file changed, 7 insertions(+), 6 deletions(-)

diff --git a/net/ipv4/geneve.c b/net/ipv4/geneve.c
index 5a47188..95e47c9 100644
--- a/net/ipv4/geneve.c
+++ b/net/ipv4/geneve.c
@@ -296,6 +296,7 @@ struct geneve_sock *geneve_sock_add(struct net *net, __be16 port,
 				    geneve_rcv_t *rcv, void *data,
 				    bool no_share, bool ipv6)
 {
+	struct geneve_net *gn = net_generic(net, geneve_net_id);
 	struct geneve_sock *gs;
 
 	gs = geneve_socket_create(net, port, rcv, data, ipv6);
@@ -305,15 +306,15 @@ struct geneve_sock *geneve_sock_add(struct net *net, __be16 port,
 	if (no_share)	/* Return error if sharing is not allowed. */
 		return ERR_PTR(-EINVAL);
 
+	spin_lock(&gn->sock_lock);
 	gs = geneve_find_sock(net, port);
-	if (gs) {
-		if (gs->rcv == rcv)
-			atomic_inc(&gs->refcnt);
-		else
+	if (gs && ((gs->rcv != rcv) ||
+		   !atomic_add_unless(&gs->refcnt, 1, 0)))
 			gs = ERR_PTR(-EBUSY);
-	} else {
+	spin_unlock(&gn->sock_lock);
+
+	if (!gs)
 		gs = ERR_PTR(-EINVAL);
-	}
 
 	return gs;
 }
-- 
1.9.1

^ permalink raw reply related

* Re: [RFC PATCH net-next 0/5] tcp: TCP tracer
From: Martin Lau @ 2014-12-17  1:30 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Eric Dumazet, Blake Matheny, Laurent Chavey, Yuchung Cheng,
	netdev@vger.kernel.org, David S. Miller, Hannes Frederic Sowa,
	Steven Rostedt, Lawrence Brakmo, Josef Bacik, Kernel Team
In-Reply-To: <CAADnVQKtEfrEpAqMxf_aOmHmF4WeOQUD-zZpVYR9jnQUjHQgVg@mail.gmail.com>

On Tue, Dec 16, 2014 at 04:15:24PM -0800, Alexei Starovoitov wrote:
> On Tue, Dec 16, 2014 at 10:28 AM, Martin Lau <kafai@fb.com> wrote:
> > We can consider to reuse the events's format (tracing/events/*/format). I think
> > blktrace.c is using similar approach in trace-cmd.
> 
> yes. tcp_trace is a carbon copy of blktrace applied to tcp.
> 
> >> >> I think systemtap like scripting on top of patches 1 and 3
> >> >> should solve your use case ?
> > We have quite a few different versions running in the production.  It may not
> > be operationally easy.
> 
> different versions of kernel or different versions of tcp_tracer ?
Former and we are releasing new kernel pretty often.

> 
> > Having a getsockopt will be useful for the new application/library to take
> > advantage of.
> >
> > For the continuous monitoring/logging purpose, ftrace can provide event
> > triggered tracing instead of periodically consulting ss.
> 
> so both getsockopt tcp_info approach and ftrace+tcp_trace
> approach can provide the same set of stats per flow, right?
> And the only difference is 'ss' needs polling and ftrace
> collects all events?
> Since they're stats anyway, the polling interval
> shouldn't matter. Just like lost trace events?
> 
> from patch 5 commit log:
> "Define probes and register them to the TCP tracepoints.  The probes
> collect the data defined in struct tcp_sk_trace and record them to
> the tracing's ring_buffer.
> "
> so two trace_seq_printf() from patch 5
> and two new 'struct tcp_trace_stats' and 'tcp_trace_basic'
> from patch 4 will become permanent user api.
> 
> At the same time the commit log is saying:
> "It is still missing a few things that
> we currently have, like:
> - why the sender is blocked? and how long for each reason?
> - some TCP Congestion Control data"
> 
> Does it mean that these printf and structs would have
> to change?
How does the current TRACE_EVENT do it when it wants to printf more data?

> Can 'struct tcp_info' be extended instead of
> adding 'struct tcp_trace_stats' ?
> Then getsockopt and ftrace+tcp_trace will be returning
> the same structs.
> 
> It feels that for stats collection only, tracepoints+tcp_trace
> do not add much additional value vs extending tcp_info
> and using ss.
I think we are on the same page. Once 'this should cost nothing if not
activated' proposition was cleared out.  It was what I meant that doing the
collection part in the TCP itself (instead of tracepoints) would be nice.

> I see the value in tracepoints on its own, since we'll
> be able to use dynamic tracing to do event aggregation,
> filtering, etc. That was my alternative suggestion to
> add only tracepoints from patches 1 and 3.

I think going forward, as others have suggested, it may be better to come
together and reach a common ground on what to collect first before I re-work
patch 1 to 3 and repost.

Thanks,
--Martin

^ permalink raw reply

* Re: net: integer overflow in ip_idents_reserve
From: Sasha Levin @ 2014-12-17  1:15 UTC (permalink / raw)
  To: Hannes Frederic Sowa, Eric Dumazet
  Cc: David S. Miller, LKML, netdev, Andrey Ryabinin, Dave Jones
In-Reply-To: <1418771356.3449499.203748285.4B1A82B8@webmail.messagingengine.com>

On 12/16/2014 06:09 PM, Hannes Frederic Sowa wrote:
> 
> On Tue, Dec 16, 2014, at 22:47, Eric Dumazet wrote:
>> > On Tue, 2014-12-16 at 16:19 -0500, Sasha Levin wrote:
>>> > > Hi Eric,
>>> > > 
>>> > > While fuzzing with trinity on a -next kernel with the undefined behaviour
>>> > > sanitizer path, I've observed the following warning in code which was
>>> > > introduced in 04ca6973f7 ("ip: make IP identifiers less predictable"):
>> > 
>> > This is a false positive.
> Also we compile the whole kernel with -fno-strict-overflow, so every
> report of signed overflow leading to undefined behavior is probably a
> false positive. I don't know if it is worth to try to get rid of them, I
> doubt it.

I reported this one because there's usually some code to handle overflow
in code that expects that and here there was none (I could see).

For example, the ntp code had a few cases where a user could generate
overflows and mess up quite a few things (he got what he asked for -
problems).


Thanks,
Sasha

^ permalink raw reply

* Re: [RFC PATCH net-next 0/5] tcp: TCP tracer
From: Alexei Starovoitov @ 2014-12-17  0:15 UTC (permalink / raw)
  To: Martin Lau
  Cc: Eric Dumazet, Blake Matheny, Laurent Chavey, Yuchung Cheng,
	netdev@vger.kernel.org, David S. Miller, Hannes Frederic Sowa,
	Steven Rostedt, Lawrence Brakmo, Josef Bacik, Kernel Team

On Tue, Dec 16, 2014 at 10:28 AM, Martin Lau <kafai@fb.com> wrote:
> We can consider to reuse the events's format (tracing/events/*/format). I think
> blktrace.c is using similar approach in trace-cmd.

yes. tcp_trace is a carbon copy of blktrace applied to tcp.

>> >> I think systemtap like scripting on top of patches 1 and 3
>> >> should solve your use case ?
> We have quite a few different versions running in the production.  It may not
> be operationally easy.

different versions of kernel or different versions of tcp_tracer ?

> Having a getsockopt will be useful for the new application/library to take
> advantage of.
>
> For the continuous monitoring/logging purpose, ftrace can provide event
> triggered tracing instead of periodically consulting ss.

so both getsockopt tcp_info approach and ftrace+tcp_trace
approach can provide the same set of stats per flow, right?
And the only difference is 'ss' needs polling and ftrace
collects all events?
Since they're stats anyway, the polling interval
shouldn't matter. Just like lost trace events?

from patch 5 commit log:
"Define probes and register them to the TCP tracepoints.  The probes
collect the data defined in struct tcp_sk_trace and record them to
the tracing's ring_buffer.
"
so two trace_seq_printf() from patch 5
and two new 'struct tcp_trace_stats' and 'tcp_trace_basic'
from patch 4 will become permanent user api.

At the same time the commit log is saying:
"It is still missing a few things that
we currently have, like:
- why the sender is blocked? and how long for each reason?
- some TCP Congestion Control data"

Does it mean that these printf and structs would have
to change?

Can 'struct tcp_info' be extended instead of
adding 'struct tcp_trace_stats' ?

Then getsockopt and ftrace+tcp_trace will be returning
the same structs.

It feels that for stats collection only, tracepoints+tcp_trace
do not add much additional value vs extending tcp_info
and using ss.

I see the value in tracepoints on its own, since we'll
be able to use dynamic tracing to do event aggregation,
filtering, etc. That was my alternative suggestion to
add only tracepoints from patches 1 and 3.

^ permalink raw reply

* Re: Netlink mmap tx security?
From: Eric Dumazet @ 2014-12-17  0:02 UTC (permalink / raw)
  To: David Miller; +Cc: tgraf, dborkman, luto, torvalds, kaber, netdev
In-Reply-To: <20141216.175817.576861457076632402.davem@davemloft.net>

On Tue, 2014-12-16 at 17:58 -0500, David Miller wrote:

> +		__skb_put(skb, nm_len);
> +		memcpy(skb->data, (void *)hdr + NL_MMAP_HDRLEN, nm_len);
> +		netlink_set_status(hdr, NL_MMAP_STATUS_UNUSED);
>  

Not related to this patch, but it looks like netlink_set_status()
barrier is wrong ?

It seems we need smp_wmb() after the memcpy() and before the
hdr->nm_status = status; 

^ permalink raw reply

* Re: Netlink mmap tx security?
From: Daniel Borkmann @ 2014-12-16 23:58 UTC (permalink / raw)
  To: David Miller; +Cc: tgraf, luto, torvalds, kaber, netdev
In-Reply-To: <20141216.175817.576861457076632402.davem@davemloft.net>

On 12/16/2014 11:58 PM, David Miller wrote:
> From: Thomas Graf <tgraf@suug.ch>
> Date: Thu, 16 Oct 2014 08:07:53 +0100
>
>> On 10/16/14 at 08:45am, Daniel Borkmann wrote:
>>> On 10/15/2014 04:09 AM, David Miller wrote:
>>>> That would work as well.
>>>>
>>>> There are pros and cons to all of these approaches.
>>>>
>>>> I was thinking that if we do the "TX mmap --> copy to kernel buffer"
>>>> approach, then if in the future we find a way to make it work
>>>> reliably, we can avoid the copy.  And frankly performance wise it's no
>>>> worse than what happens via normal sendmsg() calls.
>>>>
>>>> And all applications using NETLINK_RX_RING keep working and keep
>>>> getting the performance boost.
>>>
>>> That would be better, yes. This would avoid having such a TPACKET_V*
>>> API chaos we have in packet sockets if this could be fixed for netlink
>>> eventually.
>>
>> Only saw the second part of Dave's message now. I agree that this
>> is even a better option.
>
> Sorry for dropping the ball on this, here is the patch I have right
> now, any comments?
>
> ====================
> [PATCH] netlink: Always copy on mmap TX.
>
> Checking the file f_count and the nlk->mapped count is not completely
> sufficient to prevent the mmap'd area contents from changing from
> under us during netlink mmap sendmsg() operations.
>
> Be careful to sample the header's length field only once, because this
> could change from under us as well.
>
> Signed-off-by: David S. Miller <davem@davemloft.net>

Looks good to me, thanks for following up on this Dave!

Now this also leaves NETLINK_SKB_TX unused afterwards, so we should
get rid of these bits, too. It's only set in the broken 'exclusive'
path and tested for resetting the slot status in the skb destructor,
but _now_ we set it back immediately to NL_MMAP_STATUS_UNUSED instead
of a NL_MMAP_STATUS_RESERVED -> NL_MMAP_STATUS_UNUSED transition via
destructor.

The atomic (tx)ring->pending logic inside the send path seems also
unused now; similarly as in, resp. taken from packet sockets, it was
there to wait for completion (for blocking syscalls) until the
netlink_skb_destructor() has been invoked and thus the slot given
back to user space.

Anyways, above could probably be followed-up; other than that:

Fixes: 5fd96123ee19 ("netlink: implement memory mapped sendmsg()")
Acked-by: Daniel Borkmann <dborkman@redhat.com>

^ permalink raw reply

* Re: net: integer overflow in ip_idents_reserve
From: Eric Dumazet @ 2014-12-16 23:22 UTC (permalink / raw)
  To: Hannes Frederic Sowa
  Cc: Sasha Levin, David S. Miller, LKML, netdev, Andrey Ryabinin,
	Dave Jones
In-Reply-To: <1418771356.3449499.203748285.4B1A82B8@webmail.messagingengine.com>

On Wed, 2014-12-17 at 00:09 +0100, Hannes Frederic Sowa wrote:


> Also we compile the whole kernel with -fno-strict-overflow, so every
> report of signed overflow leading to undefined behavior is probably a
> false positive. I don't know if it is worth to try to get rid of them, I
> doubt it.

Presumably we could have uatomic_t , or atomic_u32_t, whatever...

This particular xadd() is heavily hit in some cases, we really do not
want a cmpxchg()

^ permalink raw reply

* Re: net: integer overflow in ip_idents_reserve
From: Hannes Frederic Sowa @ 2014-12-16 23:09 UTC (permalink / raw)
  To: Eric Dumazet, Sasha Levin
  Cc: David S. Miller, LKML, netdev, Andrey Ryabinin, Dave Jones
In-Reply-To: <1418766460.9773.48.camel@edumazet-glaptop2.roam.corp.google.com>



On Tue, Dec 16, 2014, at 22:47, Eric Dumazet wrote:
> On Tue, 2014-12-16 at 16:19 -0500, Sasha Levin wrote:
> > Hi Eric,
> > 
> > While fuzzing with trinity on a -next kernel with the undefined behaviour
> > sanitizer path, I've observed the following warning in code which was
> > introduced in 04ca6973f7 ("ip: make IP identifiers less predictable"):
> 
> This is a false positive.

Also we compile the whole kernel with -fno-strict-overflow, so every
report of signed overflow leading to undefined behavior is probably a
false positive. I don't know if it is worth to try to get rid of them, I
doubt it.

Bye,
Hannes

^ permalink raw reply

* Re: Netlink mmap tx security?
From: David Miller @ 2014-12-16 22:58 UTC (permalink / raw)
  To: tgraf; +Cc: dborkman, luto, torvalds, kaber, netdev
In-Reply-To: <20141016070753.GA16738@casper.infradead.org>

From: Thomas Graf <tgraf@suug.ch>
Date: Thu, 16 Oct 2014 08:07:53 +0100

> On 10/16/14 at 08:45am, Daniel Borkmann wrote:
>> On 10/15/2014 04:09 AM, David Miller wrote:
>> >That would work as well.
>> >
>> >There are pros and cons to all of these approaches.
>> >
>> >I was thinking that if we do the "TX mmap --> copy to kernel buffer"
>> >approach, then if in the future we find a way to make it work
>> >reliably, we can avoid the copy.  And frankly performance wise it's no
>> >worse than what happens via normal sendmsg() calls.
>> >
>> >And all applications using NETLINK_RX_RING keep working and keep
>> >getting the performance boost.
>> 
>> That would be better, yes. This would avoid having such a TPACKET_V*
>> API chaos we have in packet sockets if this could be fixed for netlink
>> eventually.
> 
> Only saw the second part of Dave's message now. I agree that this
> is even a better option.

Sorry for dropping the ball on this, here is the patch I have right
now, any comments?

====================
[PATCH] netlink: Always copy on mmap TX.

Checking the file f_count and the nlk->mapped count is not completely
sufficient to prevent the mmap'd area contents from changing from
under us during netlink mmap sendmsg() operations.

Be careful to sample the header's length field only once, because this
could change from under us as well.

Signed-off-by: David S. Miller <davem@davemloft.net>
---
 net/netlink/af_netlink.c | 52 +++++++++++++++---------------------------------
 1 file changed, 16 insertions(+), 36 deletions(-)

diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
index ef5f77b..a64680a 100644
--- a/net/netlink/af_netlink.c
+++ b/net/netlink/af_netlink.c
@@ -525,14 +525,14 @@ out:
 	return err;
 }
 
-static void netlink_frame_flush_dcache(const struct nl_mmap_hdr *hdr)
+static void netlink_frame_flush_dcache(const struct nl_mmap_hdr *hdr, unsigned int nm_len)
 {
 #if ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE == 1
 	struct page *p_start, *p_end;
 
 	/* First page is flushed through netlink_{get,set}_status */
 	p_start = pgvec_to_page(hdr + PAGE_SIZE);
-	p_end   = pgvec_to_page((void *)hdr + NL_MMAP_HDRLEN + hdr->nm_len - 1);
+	p_end   = pgvec_to_page((void *)hdr + NL_MMAP_HDRLEN + nm_len - 1);
 	while (p_start <= p_end) {
 		flush_dcache_page(p_start);
 		p_start++;
@@ -714,24 +714,16 @@ static int netlink_mmap_sendmsg(struct sock *sk, struct msghdr *msg,
 	struct nl_mmap_hdr *hdr;
 	struct sk_buff *skb;
 	unsigned int maxlen;
-	bool excl = true;
 	int err = 0, len = 0;
 
-	/* Netlink messages are validated by the receiver before processing.
-	 * In order to avoid userspace changing the contents of the message
-	 * after validation, the socket and the ring may only be used by a
-	 * single process, otherwise we fall back to copying.
-	 */
-	if (atomic_long_read(&sk->sk_socket->file->f_count) > 1 ||
-	    atomic_read(&nlk->mapped) > 1)
-		excl = false;
-
 	mutex_lock(&nlk->pg_vec_lock);
 
 	ring   = &nlk->tx_ring;
 	maxlen = ring->frame_size - NL_MMAP_HDRLEN;
 
 	do {
+		unsigned int nm_len;
+
 		hdr = netlink_current_frame(ring, NL_MMAP_STATUS_VALID);
 		if (hdr == NULL) {
 			if (!(msg->msg_flags & MSG_DONTWAIT) &&
@@ -739,35 +731,23 @@ static int netlink_mmap_sendmsg(struct sock *sk, struct msghdr *msg,
 				schedule();
 			continue;
 		}
-		if (hdr->nm_len > maxlen) {
+
+		nm_len = ACCESS_ONCE(hdr->nm_len);
+		if (nm_len > maxlen) {
 			err = -EINVAL;
 			goto out;
 		}
 
-		netlink_frame_flush_dcache(hdr);
+		netlink_frame_flush_dcache(hdr, nm_len);
 
-		if (likely(dst_portid == 0 && dst_group == 0 && excl)) {
-			skb = alloc_skb_head(GFP_KERNEL);
-			if (skb == NULL) {
-				err = -ENOBUFS;
-				goto out;
-			}
-			sock_hold(sk);
-			netlink_ring_setup_skb(skb, sk, ring, hdr);
-			NETLINK_CB(skb).flags |= NETLINK_SKB_TX;
-			__skb_put(skb, hdr->nm_len);
-			netlink_set_status(hdr, NL_MMAP_STATUS_RESERVED);
-			atomic_inc(&ring->pending);
-		} else {
-			skb = alloc_skb(hdr->nm_len, GFP_KERNEL);
-			if (skb == NULL) {
-				err = -ENOBUFS;
-				goto out;
-			}
-			__skb_put(skb, hdr->nm_len);
-			memcpy(skb->data, (void *)hdr + NL_MMAP_HDRLEN, hdr->nm_len);
-			netlink_set_status(hdr, NL_MMAP_STATUS_UNUSED);
+		skb = alloc_skb(nm_len, GFP_KERNEL);
+		if (skb == NULL) {
+			err = -ENOBUFS;
+			goto out;
 		}
+		__skb_put(skb, nm_len);
+		memcpy(skb->data, (void *)hdr + NL_MMAP_HDRLEN, nm_len);
+		netlink_set_status(hdr, NL_MMAP_STATUS_UNUSED);
 
 		netlink_increment_head(ring);
 
@@ -813,7 +793,7 @@ static void netlink_queue_mmaped_skb(struct sock *sk, struct sk_buff *skb)
 	hdr->nm_pid	= NETLINK_CB(skb).creds.pid;
 	hdr->nm_uid	= from_kuid(sk_user_ns(sk), NETLINK_CB(skb).creds.uid);
 	hdr->nm_gid	= from_kgid(sk_user_ns(sk), NETLINK_CB(skb).creds.gid);
-	netlink_frame_flush_dcache(hdr);
+	netlink_frame_flush_dcache(hdr, hdr->nm_len);
 	netlink_set_status(hdr, NL_MMAP_STATUS_VALID);
 
 	NETLINK_CB(skb).flags |= NETLINK_SKB_DELIVERED;
-- 
2.1.0

^ permalink raw reply related

* Re: [RFC PATCH net-next 0/5] tcp: TCP tracer
From: Hannes Frederic Sowa @ 2014-12-16 22:50 UTC (permalink / raw)
  To: David Miller, jbaron
  Cc: jbacik, eric.dumazet, alexei.starovoitov, chavey, ycheng, kafai,
	netdev, rostedt, brakmo, Kernel-team, Daniel Borkmann,
	Florian Westphal
In-Reply-To: <20141216.174524.888376617592272638.davem@davemloft.net>

On Tue, Dec 16, 2014, at 23:45, David Miller wrote:
> From: Jason Baron <jbaron@akamai.com>
> Date: Tue, 16 Dec 2014 17:40:47 -0500
> 
> > We are interested in tcp tracing as well. Another requirement that
> > we have that I don't think I saw is the ability to start/stop
> > tracing on sockets (potentially multiple times) during the lifetime
> > of a connection. So for example, the ability to use setsockopt(), to
> > selectively start/stop tracing on a connection, so as not to incur
> > overhead for non-traced sockets.
> 
> This is so backwards.
> 
> You make the tracing cheap enough that this can never be an issue.
> 
> Your requirement can only exist if the implementation is broken
> by design.

An idea I had was to add a proxy tcp congestion control which could be
selectively chosen per destination as soon as Daniel's patchset hits
net-next or one could enable it globally by
sys/net/ipv4/tcp_congestion_control. Needed tracepoints could be
installed in the congestion_ops handlers and the additional storage
could live inside the private data of the congestion control handler.
Further callbacks could go to a chained congestion control handler. The
names could be extended like "t:cubic", t for tracing.

Do the congestion control callbacks provide enough insight to the
connection state?

Bye,
Hannes

^ permalink raw reply

* RE: [PATCH net-next v2 2/4] swdevice: add new api to set and del bridge port attributes
From: Arad, Ronen @ 2014-12-16 22:46 UTC (permalink / raw)
  To: B Viswanath, netdev@vger.kernel.org
  Cc: John Fastabend, Roopa Prabhu, Jamal Hadi Salim, Jiri Pirko,
	sfeldma@gmail.com, bcrl@kvack.org, tgraf@suug.ch,
	stephen@networkplumber.org, linville@tuxdriver.com,
	vyasevic@redhat.com, davem@davemloft.net, shm@cumulusnetworks.com,
	gospo@cumulusnetworks.com
In-Reply-To: <CAN+pFwJgbbgYjU3-P=3h7G1v8Sz=naDAU880-6LJ3LTTciz7nA@mail.gmail.com>



> -----Original Message-----
> From: B Viswanath [mailto:marichika4@gmail.com]
> Sent: Tuesday, December 16, 2014 11:52 PM
> To: Arad, Ronen
> Cc: netdev@vger.kernel.org; John Fastabend; Roopa Prabhu; Jamal Hadi
> Salim; Jiri Pirko; sfeldma@gmail.com; bcrl@kvack.org; tgraf@suug.ch;
> stephen@networkplumber.org; linville@tuxdriver.com;
> vyasevic@redhat.com; davem@davemloft.net;
> shm@cumulusnetworks.com; gospo@cumulusnetworks.com
> Subject: Re: [PATCH net-next v2 2/4] swdevice: add new api to set and del
> bridge port attributes
> 
> On 17 December 2014 at 02:22, Arad, Ronen <ronen.arad@intel.com> wrote:
> >
> >
> >> -----Original Message-----
> >> From: netdev-owner@vger.kernel.org [mailto:netdev-
> >> owner@vger.kernel.org] On Behalf Of B Viswanath
> >> Sent: Tuesday, December 16, 2014 9:24 PM
> >> To: Arad, Ronen
> >> Cc: John Fastabend; netdev@vger.kernel.org; Roopa Prabhu; Jamal Hadi
> >> Salim; Jiri Pirko; sfeldma@gmail.com; bcrl@kvack.org; tgraf@suug.ch;
> >> stephen@networkplumber.org; linville@tuxdriver.com;
> >> vyasevic@redhat.com; davem@davemloft.net;
> shm@cumulusnetworks.com;
> >> gospo@cumulusnetworks.com
> >> Subject: Re: [PATCH net-next v2 2/4] swdevice: add new api to set and
> >> del bridge port attributes
> >>
> >> Hi,
> >>
> >> This is my first email on this thread, and on this list. My apologies
> >> if I have not understood something correctly. I would like to
> >> participate  in this discussion, which is one of the reasons I joined
> >> this list recently.  Some feedback inline below.
> >>
> >> On 16 December 2014 at 22:59, Arad, Ronen <ronen.arad@intel.com>
> wrote:
> >> >
> >> >
> > <sniped for brevity>
> >> >>
> >> >> I'm still missing why there is duplicate implementations in the driver.
> >> >> If the driver implements the set of ndo ops why should it care who
> >> >> calls them? I think you tried to explain this already but I'm not seeing it.
> >> >>
> >> >
> >> > Let's consider a bridge property. I'll use the default PVID
> >> > attribute as an
> >> example. This is currently configurable by sysfs only and a netlink
> >> support for that is still due. Let's assume for our discussion that a
> >> DEAFAULT_PVID attribute will be added as a bridge attribute within
> >> AFSPEC nested attribute of AF_BRIDGE SETLINK message.
> >> > When a bridge device is present, this attribute is processed by the
> >> > bridge
> >> module and saved as default_pvid field in net_bridge structure. When
> >> a switch port is enslaved to a bridge, the bridge driver creates a
> >> net_bridge_port instance and assigns it a pvid inherited from the
> >> default_pvid attribute of the bridge. Setting the pvid for a new
> >> enslaved switch port is not done via netlink. It only applies to the
> >> net_bridge_port structure which is internal to the bridge module.
> >> Offloading this to HW is not addressed with current bridge offloading.
> >> >
> >> > When a bridge device is not used, the DEFAULT_PVID will be targeted
> >> > using
> >> the SELF flag to any of the switch ports. The driver will recognize
> >> that as a bridge port and will need to maintain some switch global
> >> structure similar to net_bridge where it could save the default_pvid.
> >> The driver, knowing that the switch port is not enslaved to a bridge,
> >> will have to replicate the same functionality. In the HW case, it
> >> will have to configure default VLAN on all the switch ports.
> >> > This is different from the yet to be defined way of propagating
> >> > default PVID
> >> from a bridge device to offloaded bridge ports.
> >> >
> >> > Another example is STP. STP attributes are bridge attributes which
> >> > are not
> >> offloaded when a bridge device is present. The bridge module handles
> >> STP protocol internally. Without bridge device, STP attributes have
> >> to be targeted at a switch port device and the driver should save
> >> them in driver-specific structures and have proprietary
> >> implementation of STP (as the one in the bridge module is not used).
> >>
> >> In general I feel that the switch-device and port relation should be
> >> that of the 'container-containee'. This is the actual physical
> >> relationship.  Apart from some operations such as vlans and protocol
> >> related, it is tricky to model all operations directly on ports. My
> >> thinking is it is cleaner to have all operations be on switch-device,
> >> which in turn peculates the operations downward, to its contained
> >> ports as applicable.  The offloading is really a property of the
> >> switch device and not individual ports. Similarly the FDB is
> >> maintained by the switch and not the ports. As we extend the current
> >> offloading mechanism to other L2, L3 and other features, we may find it
> easier to have a 'switch- device' in place.
> >>
> >> I am somewhat confused with the notion of bridges though. Many
> >> existing linux-based routers use bridges differently than as a vlan-
> broadcast-domain.
> >> For example it is common to have eth0.334 and
> >> eth1 in the same bridge. What is being done internally is that the
> >> additional vlan tag 334 (which indicates video traffic, say) is
> >> removed and that video traffic is being bridged to eth1. There is no
> >> default vlan for this bridge.  This is a software bridge. I am not
> >> sure how this can be accomplished if there is a need to associate a vlan
> with a bridge.
> >>
> >> Thanks
> >> Viswanath
> >>
> >>
> >
> > Let's say we have three ports sp1, sp2, sp3 and two VLANs 10 and 20.
> > VLAN 10 is allowed on ports sp1 and sp2 and VLAN 20 is allowed on ports
> sp2 and sp3.
> > Let's say sp1 and sp3 are access ports and carry untagged traffic. Sp2 is an
> uplink trunk port and carries tagged traffic.
> >
> > This could be modeled using Linux bridge in at least two ways:
> >
> > 1) Bridge per VLAN
> >         - Two bridge devices are used br10 and br20
> >         - sp2.10 is a vlan interface on sp2 for VLAN 10
> >         - sp2.20 is a vlan interface on sp2 for VLAN 20
> >         - br10 enslaves ports sp1 and sp2.10
> >         - br20 enslaves ports sp3 and sp2.20
> >         - br10 and br20 could be L3 interfaces and have IP address assigned to.
> >                  This allows for routing between VLANs.
> >         - Traffic sent to br10 will egress untagged on sp1 and tagged with
> VID=10 on sp2
> >         - Traffic sent to br20 will egress untagged on sp2 and tagged with
> VID=20 on sp2
> >         - Traffic received on sp1 is delivered to br10 and could be flooded if
> MAC DA is broadcast or unknown unicast.
> >         - Traffic received on sp2 with VID=10 is delivered to br10 after the
> VLAN tag is removed.
> >         - Similarly traffic receive on sp3 or tagged traffic with
> > VID=20 on sp2 is delivered to br20
> 
> Thanks for the explanation. Understood this, and this is how I
> saw/implemented things so far. However, my understanding is that both
> br10 and br20 have nothing to do with vlan 10 and 20 respectively. The traffic
> the actual bridges see (say if we ran tcpdump on br10 or br20) is always
> untagged. The tagging happens while the packets are egressing the sp2 port.
> In that sense, there is no vlan associated with either of the bridges.
> 
> What I was not sure was about how we can associate a vlan with such a
> bridge in this case, and what it really means for the traffic.
> 
> >
> > 2) Single bridge with VLAN filtering
> >         - A single bridge device br0 is used
> >         - All ports sp1, sp2, and sp3 are enslaved to the bridge
> >         - br0 allows both VID=10 and VID=20
> >         - VLAN policy on the ports determines the bridging domains within the
> bridge
> >                 - sp1 allows VID=10, it is untagged on egress, and it is the PVID of
> sp1.
> >                   Linux does not provide a way to only allow untagged traffic to
> enter a port.
> >                   Both tagged traffic with VID=10 and untagged traffic are
> considered received on VLAN 10.
> >                 - sp3 allows VID=20, it is untagged on egress, and it is the PVID of
> sp2.
> >                 - sp2 allows VID=10 and VID=20, both are tagged on egress, and
> there is no PVID so untagged traffic received on sp2 is dropped.
> >
> >         - Above configuration is sufficient for L2 switching with two distinct
> VLANs.
> >         - L3 routing across VLANs in this model is achieved by vlan interfaces
> on the bridge. br0.10 for VLAN 10 and br0.20 for VLAN 20.
> >         - br0.10 and br0.20 are L3 interfaces and have IP addresses assigned.
> >                 - br0.10 allows VID=10
> 
> In this case am I correct in assuming that br0 actually represents the switch
> device ? I don't see a way in which one more such bridge can exist, since the
> forwarding decisions are handled at the switch-device level.
> 

br0 represent the HW switch as far as configuration. Offloading (discussed on this patch-set) will propagate bridge and port attributes to underlying port switch devices for HW configuration.
Bypassing data-path handling by the software bridge is yet to be defined.
On the transmit path (i.e. packet originated by the switch OS or applications) it could be OK to use software bridging but it could be inefficient for multicast or flooding. 
An alternative that hands packets sent to the bridge or to one of its vlan interfaces (br0.10, br0.20) to the HW switch is desirable.
On the receive path, delivering packets received from the HW by the switch driver on a switch port device (such as sp1, sp2, sp3) has to be avoided as it could cause duplicate packets (packets are flooded in the HW and again by the software bridge).

> Thanks
> Viswanath
> 
> >
> >
> >> >
> >> >
> >> >> [...]
> >> >>
> >> >> I'll need to think about the l3 stuff but I think Jiri/Scott/Roopa
> >> >> might have worked some of it out.
> >> >>
> >> >> --
> >> >> John Fastabend         Intel Corporation
> >> > --
> >> > To unsubscribe from this list: send the line "unsubscribe netdev"
> >> > in the body of a message to majordomo@vger.kernel.org More
> >> > majordomo
> >> info
> >> > at  http://vger.kernel.org/majordomo-info.html
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe netdev" in
> >> the body of a message to majordomo@vger.kernel.org More majordomo
> >> info at http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [RFC PATCH net-next 0/5] tcp: TCP tracer
From: David Miller @ 2014-12-16 22:45 UTC (permalink / raw)
  To: jbaron
  Cc: jbacik, eric.dumazet, alexei.starovoitov, chavey, ycheng, kafai,
	netdev, hannes, rostedt, brakmo, Kernel-team
In-Reply-To: <5490B4EF.1030803@akamai.com>

From: Jason Baron <jbaron@akamai.com>
Date: Tue, 16 Dec 2014 17:40:47 -0500

> We are interested in tcp tracing as well. Another requirement that
> we have that I don't think I saw is the ability to start/stop
> tracing on sockets (potentially multiple times) during the lifetime
> of a connection. So for example, the ability to use setsockopt(), to
> selectively start/stop tracing on a connection, so as not to incur
> overhead for non-traced sockets.

This is so backwards.

You make the tracing cheap enough that this can never be an issue.

Your requirement can only exist if the implementation is broken
by design.

^ permalink raw reply

* Re: [PATCH net-next 2/3] Implementation of RFC 4898 Extended TCP Statistics (Web10G)
From: David Miller @ 2014-12-16 22:44 UTC (permalink / raw)
  To: eric.dumazet; +Cc: rapier, alexei.starovoitov, netdev
In-Reply-To: <1418769212.9773.65.camel@edumazet-glaptop2.roam.corp.google.com>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Tue, 16 Dec 2014 14:33:32 -0800

> There is very little chance web10g ~3000 lines of code are added into
> linux TCP stack, by people who did not submit netdev changes in last
> years.

+1

> At Google, we tried the web10g route, but reverted it (today !) in favor
> of tcp_info extensions (ss command from iproute2 can also grab/display
> these), after too many bugs being filled.

+1

^ permalink raw reply

* Re: [RFC PATCH net-next 0/5] tcp: TCP tracer
From: Jason Baron @ 2014-12-16 22:40 UTC (permalink / raw)
  To: Josef Bacik, Eric Dumazet, Alexei Starovoitov, Laurent Chavey,
	Yuchung Cheng
  Cc: Martin KaFai Lau, netdev@vger.kernel.org, David S. Miller,
	Hannes Frederic Sowa, Steven Rostedt, Lawrence Brakmo,
	Kernel Team
In-Reply-To: <548F0F62.7080704@fb.com>

On 12/15/2014 11:42 AM, Josef Bacik wrote:
> On 12/15/2014 11:03 AM, Eric Dumazet wrote:
>> On Sun, 2014-12-14 at 22:55 -0800, Alexei Starovoitov wrote:
>>
>>> I think patches 1 and 3 are good additions, since they establish
>>> few permanent points of instrumentation in tcp stack.
>>> Patches 4-5 look more like use cases of tracepoints established
>>> before. They may feel like simple additions and, no doubt,
>>> they are useful, but since they expose things via tracing
>>> infra they become part of api and cannot be changed later,
>>> when more stats would be needed.
>>> I think systemtap like scripting on top of patches 1 and 3
>>> should solve your use case ?
>>> Also, have you looked at recent eBPF work?
>>> Though it's not completely ready yet, soon it should
>>> be able to do the same stats collection as you have
>>> in 4/5 without adding permanent pieces to the kernel.
>>
>> So it looks like web10g like interfaces are very often requested by
>> various teams.
>>
>> And we have many different views on how to hack this. I am astonished by
>> number of hacks I saw about this stuff going on.
>>
>> What about a clean way, extending current TCP_INFO, which is both
>> available as a getsockopt() for socket owners and ss/iproute2
>> information for 'external entities'
>>
>> If we consider web10g info needed, then adding a ftrace/eBPF like
>> interface is simply yet another piece of code we need to maintain,
>> and the argument of 'this should cost nothing if not activated' is
>> nonsense since major players need to constantly monitor TCP metrics and
>> behavior.
>>
>> It seems both FaceBook and Google are working on a subset of web10g.
>>
>> I suggest we meet together and establish a common ground, preferably
>> after Christmas holidays.
>>
>
> We've set up something for exactly this case at the end of January but
> have yet to get a response from Google.  If any of the Google people
> cc'ed (or really anybody, its not a strictly FB/Google thing) is
> interested please email me directly and I'll send you the details, we
> will be meeting face to face in the bay area at the end of January. 
> Thanks,
>
> Josef

We are interested in tcp tracing as well. Another requirement that we
have that
I don't think I saw is the ability to start/stop tracing on sockets
(potentially
multiple times) during the lifetime of a connection. So for example, the
ability
to use setsockopt(), to selectively start/stop tracing on a connection,
so as not
to incur overhead for non-traced sockets.

Thanks,

-Jason

^ permalink raw reply

* Re: [PATCH net-next 2/3] Implementation of RFC 4898 Extended TCP Statistics (Web10G)
From: Eric Dumazet @ 2014-12-16 22:33 UTC (permalink / raw)
  To: rapier; +Cc: David Miller, alexei.starovoitov, netdev
In-Reply-To: <54909DD5.5070202@psc.edu>

On Tue, 2014-12-16 at 16:02 -0500, rapier wrote:

> I understand where you are coming from. I do believe that our 
> methodology provides some advantages over a tcp_info solution. I'll 
> provide some information on that tomorrow after I've had a chance to 
> talk about this in more depth with our dev team. That being said, we 
> only really care about the instrument set being incorporated as such we 
> will take a closer look tcp_info shortly.

Hmm...

There is very little chance web10g ~3000 lines of code are added into
linux TCP stack, by people who did not submit netdev changes in last
years.

At Google, we tried the web10g route, but reverted it (today !) in favor
of tcp_info extensions (ss command from iproute2 can also grab/display
these), after too many bugs being filled.

Researchers love/want to have hundred of metrics. This does not mean
linux has to provide them natively, unless we can prove it is really
damn useful.

Sorry, but someone had to raise some reality concerns.

tcp_info _is_ extensible, granted you do not try to push 127 new metrics
in it.

http://git.kernel.org/cgit/linux/kernel/git/davem/net-next.git/commit/?id=977cb0ecf82eb6d15562573c31edebf90db35163

^ permalink raw reply

* Re: [PATCH net-next 2/3] Implementation of RFC 4898 Extended TCP Statistics (Web10G)
From: Dominic Hamon @ 2014-12-16 22:09 UTC (permalink / raw)
  To: David Miller; +Cc: netdev
In-Reply-To: <20141216.151823.2276708539799601894.davem@davemloft.net>

On Tue, Dec 16, 2014 at 03:18:23PM -0500, David Miller wrote:
> From: rapier <rapier@psc.edu>
> Date: Tue, 16 Dec 2014 15:13:44 -0500
> 
> > On 12/16/14, 3:03 PM, David Miller wrote:
> > 
> >> You shouldn't need to export any symbols.
> > 
> > As a point of clarification - is it acceptable to export symbols for
> > use with in tree modules such as tcp_htcp? We are more than willing to
> > do the work required to bring this in line with best practices.
> 
> I'm saying for data and TCP statistics collection, you shouldn't need
> to add any new symbol exports.
> 
> Keep this in the main kernel, nothing external should be needed.
> 
> Extending tcp_info or similar is the only reasonable way to implement
> this stuff.

There are two aspects of the tcp_estats approach that I think are worth
considering separately:
  - the breadth of instrumentation
  - the access method through kernel module/netlink

I don't see any reason why we can't take the instrumentation parts of
these patches and plumb them in to tcp_info.

I would prefer if we can retain the nested structures and the sysctl
that controls the set of instruments collected as I think this key
feature gives plenty of flexibility to system operators in terms of an
overhead/instrumentation trade-off, however I recognise this complicates
how the socket option is returned, and how ss might report the metrics.

--
Dominic Hamon

> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH net-next v2 2/4] swdevice: add new api to set and del bridge port attributes
From: B Viswanath @ 2014-12-16 21:51 UTC (permalink / raw)
  To: Arad, Ronen
  Cc: netdev@vger.kernel.org, John Fastabend, Roopa Prabhu,
	Jamal Hadi Salim, Jiri Pirko, sfeldma@gmail.com, bcrl@kvack.org,
	tgraf@suug.ch, stephen@networkplumber.org, linville@tuxdriver.com,
	vyasevic@redhat.com, davem@davemloft.net, shm@cumulusnetworks.com,
	gospo@cumulusnetworks.com
In-Reply-To: <E4CD12F19ABA0C4D8729E087A761DC3505DB588F@ORSMSX101.amr.corp.intel.com>

On 17 December 2014 at 02:22, Arad, Ronen <ronen.arad@intel.com> wrote:
>
>
>> -----Original Message-----
>> From: netdev-owner@vger.kernel.org [mailto:netdev-
>> owner@vger.kernel.org] On Behalf Of B Viswanath
>> Sent: Tuesday, December 16, 2014 9:24 PM
>> To: Arad, Ronen
>> Cc: John Fastabend; netdev@vger.kernel.org; Roopa Prabhu; Jamal Hadi
>> Salim; Jiri Pirko; sfeldma@gmail.com; bcrl@kvack.org; tgraf@suug.ch;
>> stephen@networkplumber.org; linville@tuxdriver.com;
>> vyasevic@redhat.com; davem@davemloft.net;
>> shm@cumulusnetworks.com; gospo@cumulusnetworks.com
>> Subject: Re: [PATCH net-next v2 2/4] swdevice: add new api to set and del
>> bridge port attributes
>>
>> Hi,
>>
>> This is my first email on this thread, and on this list. My apologies if I have
>> not understood something correctly. I would like to participate  in this
>> discussion, which is one of the reasons I joined this list recently.  Some
>> feedback inline below.
>>
>> On 16 December 2014 at 22:59, Arad, Ronen <ronen.arad@intel.com> wrote:
>> >
>> >
> <sniped for brevity>
>> >>
>> >> I'm still missing why there is duplicate implementations in the driver.
>> >> If the driver implements the set of ndo ops why should it care who
>> >> calls them? I think you tried to explain this already but I'm not seeing it.
>> >>
>> >
>> > Let's consider a bridge property. I'll use the default PVID attribute as an
>> example. This is currently configurable by sysfs only and a netlink support for
>> that is still due. Let's assume for our discussion that a DEAFAULT_PVID
>> attribute will be added as a bridge attribute within AFSPEC nested attribute
>> of AF_BRIDGE SETLINK message.
>> > When a bridge device is present, this attribute is processed by the bridge
>> module and saved as default_pvid field in net_bridge structure. When a
>> switch port is enslaved to a bridge, the bridge driver creates a
>> net_bridge_port instance and assigns it a pvid inherited from the
>> default_pvid attribute of the bridge. Setting the pvid for a new enslaved
>> switch port is not done via netlink. It only applies to the net_bridge_port
>> structure which is internal to the bridge module. Offloading this to HW is not
>> addressed with current bridge offloading.
>> >
>> > When a bridge device is not used, the DEFAULT_PVID will be targeted using
>> the SELF flag to any of the switch ports. The driver will recognize that as a
>> bridge port and will need to maintain some switch global structure similar to
>> net_bridge where it could save the default_pvid. The driver, knowing that the
>> switch port is not enslaved to a bridge, will have to replicate the same
>> functionality. In the HW case, it will have to configure default VLAN on all the
>> switch ports.
>> > This is different from the yet to be defined way of propagating default PVID
>> from a bridge device to offloaded bridge ports.
>> >
>> > Another example is STP. STP attributes are bridge attributes which are not
>> offloaded when a bridge device is present. The bridge module handles STP
>> protocol internally. Without bridge device, STP attributes have to be targeted
>> at a switch port device and the driver should save them in driver-specific
>> structures and have proprietary implementation of STP (as the one in the
>> bridge module is not used).
>>
>> In general I feel that the switch-device and port relation should be that of the
>> 'container-containee'. This is the actual physical relationship.  Apart from
>> some operations such as vlans and protocol related, it is tricky to model all
>> operations directly on ports. My thinking is it is cleaner to have all operations
>> be on switch-device, which in turn peculates the operations downward, to its
>> contained ports as applicable.  The offloading is really a property of the
>> switch device and not individual ports. Similarly the FDB is maintained by the
>> switch and not the ports. As we extend the current offloading mechanism to
>> other L2, L3 and other features, we may find it easier to have a 'switch-
>> device' in place.
>>
>> I am somewhat confused with the notion of bridges though. Many existing
>> linux-based routers use bridges differently than as a vlan-broadcast-domain.
>> For example it is common to have eth0.334 and
>> eth1 in the same bridge. What is being done internally is that the additional
>> vlan tag 334 (which indicates video traffic, say) is removed and that video
>> traffic is being bridged to eth1. There is no default vlan for this bridge.  This is
>> a software bridge. I am not sure how this can be accomplished if there is a
>> need to associate a vlan with a bridge.
>>
>> Thanks
>> Viswanath
>>
>>
>
> Let's say we have three ports sp1, sp2, sp3 and two VLANs 10 and 20.
> VLAN 10 is allowed on ports sp1 and sp2 and VLAN 20 is allowed on ports sp2 and sp3.
> Let's say sp1 and sp3 are access ports and carry untagged traffic. Sp2 is an uplink trunk port and carries tagged traffic.
>
> This could be modeled using Linux bridge in at least two ways:
>
> 1) Bridge per VLAN
>         - Two bridge devices are used br10 and br20
>         - sp2.10 is a vlan interface on sp2 for VLAN 10
>         - sp2.20 is a vlan interface on sp2 for VLAN 20
>         - br10 enslaves ports sp1 and sp2.10
>         - br20 enslaves ports sp3 and sp2.20
>         - br10 and br20 could be L3 interfaces and have IP address assigned to.
>                  This allows for routing between VLANs.
>         - Traffic sent to br10 will egress untagged on sp1 and tagged with VID=10 on sp2
>         - Traffic sent to br20 will egress untagged on sp2 and tagged with VID=20 on sp2
>         - Traffic received on sp1 is delivered to br10 and could be flooded if MAC DA is broadcast or unknown unicast.
>         - Traffic received on sp2 with VID=10 is delivered to br10 after the VLAN tag is removed.
>         - Similarly traffic receive on sp3 or tagged traffic with VID=20 on sp2 is delivered to br20

Thanks for the explanation. Understood this, and this is how I
saw/implemented things so far. However, my understanding is that both
br10 and br20 have nothing to do with vlan 10 and 20 respectively. The
traffic the actual bridges see (say if we ran tcpdump on br10 or br20)
is always untagged. The tagging happens while the packets are
egressing the sp2 port. In that sense, there is no vlan associated
with either of the bridges.

What I was not sure was about how we can associate a vlan with such a
bridge in this case, and what it really means for the traffic.

>
> 2) Single bridge with VLAN filtering
>         - A single bridge device br0 is used
>         - All ports sp1, sp2, and sp3 are enslaved to the bridge
>         - br0 allows both VID=10 and VID=20
>         - VLAN policy on the ports determines the bridging domains within the bridge
>                 - sp1 allows VID=10, it is untagged on egress, and it is the PVID of sp1.
>                   Linux does not provide a way to only allow untagged traffic to enter a port.
>                   Both tagged traffic with VID=10 and untagged traffic are considered received on VLAN 10.
>                 - sp3 allows VID=20, it is untagged on egress, and it is the PVID of sp2.
>                 - sp2 allows VID=10 and VID=20, both are tagged on egress, and there is no PVID so untagged traffic received on sp2 is dropped.
>
>         - Above configuration is sufficient for L2 switching with two distinct VLANs.
>         - L3 routing across VLANs in this model is achieved by vlan interfaces on the bridge. br0.10 for VLAN 10 and br0.20 for VLAN 20.
>         - br0.10 and br0.20 are L3 interfaces and have IP addresses assigned.
>                 - br0.10 allows VID=10

In this case am I correct in assuming that br0 actually represents the
switch device ? I don't see a way in which one more such bridge can
exist, since the forwarding decisions are handled at the switch-device
level.

Thanks
Viswanath

>
>
>> >
>> >
>> >> [...]
>> >>
>> >> I'll need to think about the l3 stuff but I think Jiri/Scott/Roopa
>> >> might have worked some of it out.
>> >>
>> >> --
>> >> John Fastabend         Intel Corporation
>> > --
>> > To unsubscribe from this list: send the line "unsubscribe netdev" in
>> > the body of a message to majordomo@vger.kernel.org More majordomo
>> info
>> > at  http://vger.kernel.org/majordomo-info.html
>> --
>> To unsubscribe from this list: send the line "unsubscribe netdev" in the body
>> of a message to majordomo@vger.kernel.org More majordomo info at
>> http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: net: integer overflow in ip_idents_reserve
From: Eric Dumazet @ 2014-12-16 21:47 UTC (permalink / raw)
  To: Sasha Levin
  Cc: David S. Miller, LKML, netdev@vger.kernel.org, Andrey Ryabinin,
	Dave Jones
In-Reply-To: <5490A1F8.6020207@oracle.com>

On Tue, 2014-12-16 at 16:19 -0500, Sasha Levin wrote:
> Hi Eric,
> 
> While fuzzing with trinity on a -next kernel with the undefined behaviour
> sanitizer path, I've observed the following warning in code which was
> introduced in 04ca6973f7 ("ip: make IP identifiers less predictable"):

This is a false positive.

We don't really care of the value if (now - old) is too big :

No packet was sent recently, so IP ID being X or Y is not a concern.

^ permalink raw reply

* net: integer overflow in ip_idents_reserve
From: Sasha Levin @ 2014-12-16 21:19 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S. Miller, LKML, netdev@vger.kernel.org, Andrey Ryabinin,
	Dave Jones

Hi Eric,

While fuzzing with trinity on a -next kernel with the undefined behaviour
sanitizer path, I've observed the following warning in code which was
introduced in 04ca6973f7 ("ip: make IP identifiers less predictable"):

[  234.317163] ================================================================================
[  234.320001] UBSan: Undefined behaviour in ./arch/x86/include/asm/atomic.h:157:9
[  234.321568] signed integer overflow:
[  234.322772] 1678406574 + 641542997 cannot be represented in type 'int'
[  234.324316] CPU: 2 PID: 16819 Comm: trinity-c537 Not tainted 3.18.0-next-20141216-sasha-00065-g3c56201-dirty #1609
[  234.326548]  0000000000000000 0000000000000000 ffffffffbc2e4e10 ffff8802e63137e8
[  234.327837]  ffffffffb126bd68 1ffffffff7aa2c03 ffffffffbc2e5c34 ffff8802e6313808
[  234.329117]  ffffffffb126df6f 1ffffffff7aa2c03 ffffffffbc2e5c34 ffff8802e63138c8
[  234.330755] Call Trace:
[  234.331213] dump_stack (lib/dump_stack.c:52)
[  234.332025] ubsan_epilogue (lib/ubsan.c:159)
[  234.332986] handle_overflow (lib/ubsan.c:191)
[  234.334022] ? preempt_schedule (./arch/x86/include/asm/preempt.h:77 (discriminator 1) kernel/sched/core.c:2898 (discriminator 1))
[  234.334945] ? ___preempt_schedule (arch/x86/lib/thunk_64.S:42)
[  234.335919] __ubsan_handle_add_overflow (lib/ubsan.c:200)
[  234.337211] ip_idents_reserve (./arch/x86/include/asm/atomic.h:157 net/ipv4/route.c:482)
[  234.338935] __ip_select_ident (include/uapi/linux/swab.h:49 (discriminator 3) net/ipv4/route.c:498 (discriminator 3))
[  234.340773] __ip_make_skb (include/net/ip.h:339 include/net/ip.h:345 net/ipv4/ip_output.c:1386)
[  234.342736] ip_push_pending_frames (include/net/ip.h:148 net/ipv4/ip_output.c:1430)
[  234.344707] raw_sendmsg (net/ipv4/raw.c:644)
[  234.346537] ? system_call_fastpath (arch/x86/kernel/entry_64.S:423)
[  234.348431] ? get_parent_ip (kernel/sched/core.c:2564)
[  234.350259] ? preempt_count_sub (kernel/sched/core.c:2620)
[  234.352170] inet_sendmsg (net/ipv4/af_inet.c:734)
[  234.354107] do_sock_sendmsg (net/socket.c:646 (discriminator 4))
[  234.355947] ? retint_restore_args (arch/x86/kernel/entry_64.S:844)
[  234.357962] ___sys_sendmsg (net/socket.c:653 net/socket.c:2094)
[  234.359545] ? sched_clock (./arch/x86/include/asm/paravirt.h:192 arch/x86/kernel/tsc.c:304)
[  234.361182] ? __acct_update_integrals (kernel/tsacct.c:147)
[  234.363394] ? acct_account_cputime (kernel/tsacct.c:168)
[  234.365417] __sys_sendmsg (net/socket.c:2131)
[  234.367248] SyS_sendmsg (net/socket.c:2136)
[  234.368925] system_call_fastpath (arch/x86/kernel/entry_64.S:423)
[  234.371038] ================================================================================


Thanks,
Sasha

^ permalink raw reply

* Re: Adding Unisys virtnic.c to the Network Tree
From: Andy Gospodarek @ 2014-12-16 21:13 UTC (permalink / raw)
  To: Erik Arfvidson
  Cc: davem, netdev, benjamin.romer, dzickus, prarit, Bruce.Vessey,
	sparmaintainer
In-Reply-To: <54909B71.10609@redhat.com>

On Tue, Dec 16, 2014 at 03:52:01PM -0500, Erik Arfvidson wrote:
> Hi Dave,
> 
> I'm a partner engineer at Red Hat working for Unisys. Currently most of our
> driver for our system reside in the staging tree Maintained by Greg KH. It
> was suggested by one of the engineers at Red Hat Don Zickus that in order to
> accelerate the process of moving the remaining drivers we pushed directly to
> their specific system in the Linux Kernel. Currently virtnic which is our
> Virtual Network driver resides internally at Unisys and dependencies are in
> the staging tree(drivers/staging/unisys/). So would you be willing to take a
> look at our Network driver in order to add it to your Network tree?

Erik,

I realize I'm not Dave, but I thought I could respond anyway.  :)

I suspect that if you posted the driver to this list, more eyes than
just Dave would be willing to take a look and provide inline feedback.

-andy

^ permalink raw reply

* Re: [PATCH net-next 2/3] Implementation of RFC 4898 Extended TCP Statistics (Web10G)
From: rapier @ 2014-12-16 21:02 UTC (permalink / raw)
  To: David Miller; +Cc: alexei.starovoitov, netdev
In-Reply-To: <20141216.151823.2276708539799601894.davem@davemloft.net>



On 12/16/14 3:18 PM, David Miller wrote:
> From: rapier <rapier@psc.edu>
> Date: Tue, 16 Dec 2014 15:13:44 -0500
>
>> On 12/16/14, 3:03 PM, David Miller wrote:
>>
>>> You shouldn't need to export any symbols.
>>
>> As a point of clarification - is it acceptable to export symbols for
>> use with in tree modules such as tcp_htcp? We are more than willing to
>> do the work required to bring this in line with best practices.
>
> I'm saying for data and TCP statistics collection, you shouldn't need
> to add any new symbol exports.

We've been able to identify that as the code stands we only need one 
export for the KIS. That being said, I understand if that's one too 
many. The DLKM (which hasn't been submitted) does require two additional 
symbols and that's an oversight on our part. We'll work to eliminate 
those as well.

> Keep this in the main kernel, nothing external should be needed.
>
> Extending tcp_info or similar is the only reasonable way to implement
> this stuff.

I understand where you are coming from. I do believe that our 
methodology provides some advantages over a tcp_info solution. I'll 
provide some information on that tomorrow after I've had a chance to 
talk about this in more depth with our dev team. That being said, we 
only really care about the instrument set being incorporated as such we 
will take a closer look tcp_info shortly.

Chris

^ permalink raw reply

* RE: [PATCH net-next v2 2/4] swdevice: add new api to set and del bridge port attributes
From: Arad, Ronen @ 2014-12-16 20:52 UTC (permalink / raw)
  To: B Viswanath, netdev@vger.kernel.org
  Cc: John Fastabend, Roopa Prabhu, Jamal Hadi Salim, Jiri Pirko,
	sfeldma@gmail.com, bcrl@kvack.org, tgraf@suug.ch,
	stephen@networkplumber.org, linville@tuxdriver.com,
	vyasevic@redhat.com, davem@davemloft.net, shm@cumulusnetworks.com,
	gospo@cumulusnetworks.com
In-Reply-To: <CAN+pFwJuOBb2c49TUqX357DQLY9a7GUs7Kufne0Td=jSJ+nFeA@mail.gmail.com>



> -----Original Message-----
> From: netdev-owner@vger.kernel.org [mailto:netdev-
> owner@vger.kernel.org] On Behalf Of B Viswanath
> Sent: Tuesday, December 16, 2014 9:24 PM
> To: Arad, Ronen
> Cc: John Fastabend; netdev@vger.kernel.org; Roopa Prabhu; Jamal Hadi
> Salim; Jiri Pirko; sfeldma@gmail.com; bcrl@kvack.org; tgraf@suug.ch;
> stephen@networkplumber.org; linville@tuxdriver.com;
> vyasevic@redhat.com; davem@davemloft.net;
> shm@cumulusnetworks.com; gospo@cumulusnetworks.com
> Subject: Re: [PATCH net-next v2 2/4] swdevice: add new api to set and del
> bridge port attributes
> 
> Hi,
> 
> This is my first email on this thread, and on this list. My apologies if I have
> not understood something correctly. I would like to participate  in this
> discussion, which is one of the reasons I joined this list recently.  Some
> feedback inline below.
> 
> On 16 December 2014 at 22:59, Arad, Ronen <ronen.arad@intel.com> wrote:
> >
> >
<sniped for brevity>
> >>
> >> I'm still missing why there is duplicate implementations in the driver.
> >> If the driver implements the set of ndo ops why should it care who
> >> calls them? I think you tried to explain this already but I'm not seeing it.
> >>
> >
> > Let's consider a bridge property. I'll use the default PVID attribute as an
> example. This is currently configurable by sysfs only and a netlink support for
> that is still due. Let's assume for our discussion that a DEAFAULT_PVID
> attribute will be added as a bridge attribute within AFSPEC nested attribute
> of AF_BRIDGE SETLINK message.
> > When a bridge device is present, this attribute is processed by the bridge
> module and saved as default_pvid field in net_bridge structure. When a
> switch port is enslaved to a bridge, the bridge driver creates a
> net_bridge_port instance and assigns it a pvid inherited from the
> default_pvid attribute of the bridge. Setting the pvid for a new enslaved
> switch port is not done via netlink. It only applies to the net_bridge_port
> structure which is internal to the bridge module. Offloading this to HW is not
> addressed with current bridge offloading.
> >
> > When a bridge device is not used, the DEFAULT_PVID will be targeted using
> the SELF flag to any of the switch ports. The driver will recognize that as a
> bridge port and will need to maintain some switch global structure similar to
> net_bridge where it could save the default_pvid. The driver, knowing that the
> switch port is not enslaved to a bridge, will have to replicate the same
> functionality. In the HW case, it will have to configure default VLAN on all the
> switch ports.
> > This is different from the yet to be defined way of propagating default PVID
> from a bridge device to offloaded bridge ports.
> >
> > Another example is STP. STP attributes are bridge attributes which are not
> offloaded when a bridge device is present. The bridge module handles STP
> protocol internally. Without bridge device, STP attributes have to be targeted
> at a switch port device and the driver should save them in driver-specific
> structures and have proprietary implementation of STP (as the one in the
> bridge module is not used).
> 
> In general I feel that the switch-device and port relation should be that of the
> 'container-containee'. This is the actual physical relationship.  Apart from
> some operations such as vlans and protocol related, it is tricky to model all
> operations directly on ports. My thinking is it is cleaner to have all operations
> be on switch-device, which in turn peculates the operations downward, to its
> contained ports as applicable.  The offloading is really a property of the
> switch device and not individual ports. Similarly the FDB is maintained by the
> switch and not the ports. As we extend the current offloading mechanism to
> other L2, L3 and other features, we may find it easier to have a 'switch-
> device' in place.
> 
> I am somewhat confused with the notion of bridges though. Many existing
> linux-based routers use bridges differently than as a vlan-broadcast-domain.
> For example it is common to have eth0.334 and
> eth1 in the same bridge. What is being done internally is that the additional
> vlan tag 334 (which indicates video traffic, say) is removed and that video
> traffic is being bridged to eth1. There is no default vlan for this bridge.  This is
> a software bridge. I am not sure how this can be accomplished if there is a
> need to associate a vlan with a bridge.
> 
> Thanks
> Viswanath
> 
>

Let's say we have three ports sp1, sp2, sp3 and two VLANs 10 and 20.
VLAN 10 is allowed on ports sp1 and sp2 and VLAN 20 is allowed on ports sp2 and sp3.
Let's say sp1 and sp3 are access ports and carry untagged traffic. Sp2 is an uplink trunk port and carries tagged traffic.

This could be modeled using Linux bridge in at least two ways:

1) Bridge per VLAN
	- Two bridge devices are used br10 and br20
	- sp2.10 is a vlan interface on sp2 for VLAN 10
	- sp2.20 is a vlan interface on sp2 for VLAN 20
	- br10 enslaves ports sp1 and sp2.10
	- br20 enslaves ports sp3 and sp2.20
	- br10 and br20 could be L3 interfaces and have IP address assigned to.
                 This allows for routing between VLANs.
	- Traffic sent to br10 will egress untagged on sp1 and tagged with VID=10 on sp2
	- Traffic sent to br20 will egress untagged on sp2 and tagged with VID=20 on sp2
	- Traffic received on sp1 is delivered to br10 and could be flooded if MAC DA is broadcast or unknown unicast.
	- Traffic received on sp2 with VID=10 is delivered to br10 after the VLAN tag is removed.
	- Similarly traffic receive on sp3 or tagged traffic with VID=20 on sp2 is delivered to br20

2) Single bridge with VLAN filtering
	- A single bridge device br0 is used
	- All ports sp1, sp2, and sp3 are enslaved to the bridge
	- br0 allows both VID=10 and VID=20
	- VLAN policy on the ports determines the bridging domains within the bridge
		- sp1 allows VID=10, it is untagged on egress, and it is the PVID of sp1.
		  Linux does not provide a way to only allow untagged traffic to enter a port.
		  Both tagged traffic with VID=10 and untagged traffic are considered received on VLAN 10.
		- sp3 allows VID=20, it is untagged on egress, and it is the PVID of sp2.
		- sp2 allows VID=10 and VID=20, both are tagged on egress, and there is no PVID so untagged traffic received on sp2 is dropped.
		
	- Above configuration is sufficient for L2 switching with two distinct VLANs.
	- L3 routing across VLANs in this model is achieved by vlan interfaces on the bridge. br0.10 for VLAN 10 and br0.20 for VLAN 20.
	- br0.10 and br0.20 are L3 interfaces and have IP addresses assigned.
		- br0.10 allows VID=10

 
> >
> >
> >> [...]
> >>
> >> I'll need to think about the l3 stuff but I think Jiri/Scott/Roopa
> >> might have worked some of it out.
> >>
> >> --
> >> John Fastabend         Intel Corporation
> > --
> > To unsubscribe from this list: send the line "unsubscribe netdev" in
> > the body of a message to majordomo@vger.kernel.org More majordomo
> info
> > at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in the body
> of a message to majordomo@vger.kernel.org More majordomo info at
> http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: Adding Unisys virtnic.c to the Network Tree
From: Erik Arfvidson @ 2014-12-16 20:52 UTC (permalink / raw)
  To: davem, netdev
  Cc: benjamin.romer, dzickus, prarit, Bruce.Vessey, sparmaintainer
In-Reply-To: <5490976C.20409@redhat.com>

Hi Dave,

I'm a partner engineer at Red Hat working for Unisys. Currently most of 
our driver for our system reside in the staging tree Maintained by Greg 
KH. It was suggested by one of the engineers at Red Hat Don Zickus that 
in order to accelerate the process of moving the remaining drivers we 
pushed directly to their specific system in the Linux Kernel. Currently 
virtnic which is our Virtual Network driver resides internally at Unisys 
and dependencies are in the staging tree(drivers/staging/unisys/). So 
would you be willing to take a look at our Network driver in order to 
add it to your Network tree?

Cheers,
Erik Arfvidson

^ permalink raw reply

* Re: [PATCH 2/2] ip_tunnel: Add missing validation of encap type to ip_tunnel_encap_setup()
From: Thomas Graf @ 2014-12-16 20:50 UTC (permalink / raw)
  To: Tom Herbert; +Cc: David Miller, Linux Netdev List
In-Reply-To: <CA+mtBx_364iHXxLkA=Ytb=34_2zEy9-3hqfOdt96Ckhm8NwFiQ@mail.gmail.com>

On 12/16/14 at 12:23pm, Tom Herbert wrote:
> On Tue, Dec 16, 2014 at 12:05 PM, Thomas Graf <tgraf@suug.ch> wrote:
> > The encap->type comes straight from Netlink. Validate it against
> > max supported encap types just like ip_encap_hlen() already does.
> >
> > Fixes: a8c5f9 ("ip_tunnel: Ops registration for secondary encap (fou, gue)")
> > Signed-off-by: Thomas Graf <tgraf@suug.ch>
> > ---
> >  net/ipv4/ip_tunnel.c | 3 +++
> >  1 file changed, 3 insertions(+)
> >
> > diff --git a/net/ipv4/ip_tunnel.c b/net/ipv4/ip_tunnel.c
> > index 2f498f8..d3e4479 100644
> > --- a/net/ipv4/ip_tunnel.c
> > +++ b/net/ipv4/ip_tunnel.c
> > @@ -573,6 +573,9 @@ int ip_tunnel_encap(struct sk_buff *skb, struct ip_tunnel *t,
> >         if (t->encap.type == TUNNEL_ENCAP_NONE)
> >                 return 0;
> >
> > +       if (t->encap.type >= MAX_IPTUN_ENCAP_OPS)
> > +               return -EINVAL;
> > +
> 
> I don't think this is technically needed, we should have already
> verified the type when setting up the tunnel (ip_encap_hlen).

Right, assuming that every API user always calls ip_tunnel_encap_setup()
on changelink. It's currently the case but since this is a exported
API I figured we better be safe than sorry, in particular as
ip_tunnel_encap() is called before ip_encap_hlen() on xmit.

^ permalink raw reply

* Re: [PATCH net] net: Disallow providing non zero VLAN ID for NIC drivers FDB add flow
From: David Miller @ 2014-12-16 20:41 UTC (permalink / raw)
  To: ogerlitz; +Cc: netdev, jiri, gospo, jhs, john.r.fastabend
In-Reply-To: <1418573945-27840-1-git-send-email-ogerlitz@mellanox.com>

From: Or Gerlitz <ogerlitz@mellanox.com>
Date: Sun, 14 Dec 2014 18:19:05 +0200

> The current implementations all use dev_uc_add_excl() and such whose API
> doesn't support vlans, so we can't make it with NICs HW for now.
> 
> Fixes: f6f6424ba773 ('net: make vid as a parameter for ndo_fdb_add/ndo_fdb_del')
> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>

Applied, thanks.

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox