Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH net-next 2/3] Implementation of RFC 4898 Extended TCP Statistics (Web10G)
From: David Miller @ 2014-12-16 22:44 UTC (permalink / raw)
  To: eric.dumazet; +Cc: rapier, alexei.starovoitov, netdev
In-Reply-To: <1418769212.9773.65.camel@edumazet-glaptop2.roam.corp.google.com>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Tue, 16 Dec 2014 14:33:32 -0800

> There is very little chance web10g ~3000 lines of code are added into
> linux TCP stack, by people who did not submit netdev changes in last
> years.

+1

> At Google, we tried the web10g route, but reverted it (today !) in favor
> of tcp_info extensions (ss command from iproute2 can also grab/display
> these), after too many bugs being filled.

+1

^ permalink raw reply

* Re: [RFC PATCH net-next 0/5] tcp: TCP tracer
From: David Miller @ 2014-12-16 22:45 UTC (permalink / raw)
  To: jbaron
  Cc: jbacik, eric.dumazet, alexei.starovoitov, chavey, ycheng, kafai,
	netdev, hannes, rostedt, brakmo, Kernel-team
In-Reply-To: <5490B4EF.1030803@akamai.com>

From: Jason Baron <jbaron@akamai.com>
Date: Tue, 16 Dec 2014 17:40:47 -0500

> We are interested in tcp tracing as well. Another requirement that
> we have that I don't think I saw is the ability to start/stop
> tracing on sockets (potentially multiple times) during the lifetime
> of a connection. So for example, the ability to use setsockopt(), to
> selectively start/stop tracing on a connection, so as not to incur
> overhead for non-traced sockets.

This is so backwards.

You make the tracing cheap enough that this can never be an issue.

Your requirement can only exist if the implementation is broken
by design.

^ permalink raw reply

* RE: [PATCH net-next v2 2/4] swdevice: add new api to set and del bridge port attributes
From: Arad, Ronen @ 2014-12-16 22:46 UTC (permalink / raw)
  To: B Viswanath, netdev@vger.kernel.org
  Cc: John Fastabend, Roopa Prabhu, Jamal Hadi Salim, Jiri Pirko,
	sfeldma@gmail.com, bcrl@kvack.org, tgraf@suug.ch,
	stephen@networkplumber.org, linville@tuxdriver.com,
	vyasevic@redhat.com, davem@davemloft.net, shm@cumulusnetworks.com,
	gospo@cumulusnetworks.com
In-Reply-To: <CAN+pFwJgbbgYjU3-P=3h7G1v8Sz=naDAU880-6LJ3LTTciz7nA@mail.gmail.com>



> -----Original Message-----
> From: B Viswanath [mailto:marichika4@gmail.com]
> Sent: Tuesday, December 16, 2014 11:52 PM
> To: Arad, Ronen
> Cc: netdev@vger.kernel.org; John Fastabend; Roopa Prabhu; Jamal Hadi
> Salim; Jiri Pirko; sfeldma@gmail.com; bcrl@kvack.org; tgraf@suug.ch;
> stephen@networkplumber.org; linville@tuxdriver.com;
> vyasevic@redhat.com; davem@davemloft.net;
> shm@cumulusnetworks.com; gospo@cumulusnetworks.com
> Subject: Re: [PATCH net-next v2 2/4] swdevice: add new api to set and del
> bridge port attributes
> 
> On 17 December 2014 at 02:22, Arad, Ronen <ronen.arad@intel.com> wrote:
> >
> >
> >> -----Original Message-----
> >> From: netdev-owner@vger.kernel.org [mailto:netdev-
> >> owner@vger.kernel.org] On Behalf Of B Viswanath
> >> Sent: Tuesday, December 16, 2014 9:24 PM
> >> To: Arad, Ronen
> >> Cc: John Fastabend; netdev@vger.kernel.org; Roopa Prabhu; Jamal Hadi
> >> Salim; Jiri Pirko; sfeldma@gmail.com; bcrl@kvack.org; tgraf@suug.ch;
> >> stephen@networkplumber.org; linville@tuxdriver.com;
> >> vyasevic@redhat.com; davem@davemloft.net;
> shm@cumulusnetworks.com;
> >> gospo@cumulusnetworks.com
> >> Subject: Re: [PATCH net-next v2 2/4] swdevice: add new api to set and
> >> del bridge port attributes
> >>
> >> Hi,
> >>
> >> This is my first email on this thread, and on this list. My apologies
> >> if I have not understood something correctly. I would like to
> >> participate  in this discussion, which is one of the reasons I joined
> >> this list recently.  Some feedback inline below.
> >>
> >> On 16 December 2014 at 22:59, Arad, Ronen <ronen.arad@intel.com>
> wrote:
> >> >
> >> >
> > <sniped for brevity>
> >> >>
> >> >> I'm still missing why there is duplicate implementations in the driver.
> >> >> If the driver implements the set of ndo ops why should it care who
> >> >> calls them? I think you tried to explain this already but I'm not seeing it.
> >> >>
> >> >
> >> > Let's consider a bridge property. I'll use the default PVID
> >> > attribute as an
> >> example. This is currently configurable by sysfs only and a netlink
> >> support for that is still due. Let's assume for our discussion that a
> >> DEAFAULT_PVID attribute will be added as a bridge attribute within
> >> AFSPEC nested attribute of AF_BRIDGE SETLINK message.
> >> > When a bridge device is present, this attribute is processed by the
> >> > bridge
> >> module and saved as default_pvid field in net_bridge structure. When
> >> a switch port is enslaved to a bridge, the bridge driver creates a
> >> net_bridge_port instance and assigns it a pvid inherited from the
> >> default_pvid attribute of the bridge. Setting the pvid for a new
> >> enslaved switch port is not done via netlink. It only applies to the
> >> net_bridge_port structure which is internal to the bridge module.
> >> Offloading this to HW is not addressed with current bridge offloading.
> >> >
> >> > When a bridge device is not used, the DEFAULT_PVID will be targeted
> >> > using
> >> the SELF flag to any of the switch ports. The driver will recognize
> >> that as a bridge port and will need to maintain some switch global
> >> structure similar to net_bridge where it could save the default_pvid.
> >> The driver, knowing that the switch port is not enslaved to a bridge,
> >> will have to replicate the same functionality. In the HW case, it
> >> will have to configure default VLAN on all the switch ports.
> >> > This is different from the yet to be defined way of propagating
> >> > default PVID
> >> from a bridge device to offloaded bridge ports.
> >> >
> >> > Another example is STP. STP attributes are bridge attributes which
> >> > are not
> >> offloaded when a bridge device is present. The bridge module handles
> >> STP protocol internally. Without bridge device, STP attributes have
> >> to be targeted at a switch port device and the driver should save
> >> them in driver-specific structures and have proprietary
> >> implementation of STP (as the one in the bridge module is not used).
> >>
> >> In general I feel that the switch-device and port relation should be
> >> that of the 'container-containee'. This is the actual physical
> >> relationship.  Apart from some operations such as vlans and protocol
> >> related, it is tricky to model all operations directly on ports. My
> >> thinking is it is cleaner to have all operations be on switch-device,
> >> which in turn peculates the operations downward, to its contained
> >> ports as applicable.  The offloading is really a property of the
> >> switch device and not individual ports. Similarly the FDB is
> >> maintained by the switch and not the ports. As we extend the current
> >> offloading mechanism to other L2, L3 and other features, we may find it
> easier to have a 'switch- device' in place.
> >>
> >> I am somewhat confused with the notion of bridges though. Many
> >> existing linux-based routers use bridges differently than as a vlan-
> broadcast-domain.
> >> For example it is common to have eth0.334 and
> >> eth1 in the same bridge. What is being done internally is that the
> >> additional vlan tag 334 (which indicates video traffic, say) is
> >> removed and that video traffic is being bridged to eth1. There is no
> >> default vlan for this bridge.  This is a software bridge. I am not
> >> sure how this can be accomplished if there is a need to associate a vlan
> with a bridge.
> >>
> >> Thanks
> >> Viswanath
> >>
> >>
> >
> > Let's say we have three ports sp1, sp2, sp3 and two VLANs 10 and 20.
> > VLAN 10 is allowed on ports sp1 and sp2 and VLAN 20 is allowed on ports
> sp2 and sp3.
> > Let's say sp1 and sp3 are access ports and carry untagged traffic. Sp2 is an
> uplink trunk port and carries tagged traffic.
> >
> > This could be modeled using Linux bridge in at least two ways:
> >
> > 1) Bridge per VLAN
> >         - Two bridge devices are used br10 and br20
> >         - sp2.10 is a vlan interface on sp2 for VLAN 10
> >         - sp2.20 is a vlan interface on sp2 for VLAN 20
> >         - br10 enslaves ports sp1 and sp2.10
> >         - br20 enslaves ports sp3 and sp2.20
> >         - br10 and br20 could be L3 interfaces and have IP address assigned to.
> >                  This allows for routing between VLANs.
> >         - Traffic sent to br10 will egress untagged on sp1 and tagged with
> VID=10 on sp2
> >         - Traffic sent to br20 will egress untagged on sp2 and tagged with
> VID=20 on sp2
> >         - Traffic received on sp1 is delivered to br10 and could be flooded if
> MAC DA is broadcast or unknown unicast.
> >         - Traffic received on sp2 with VID=10 is delivered to br10 after the
> VLAN tag is removed.
> >         - Similarly traffic receive on sp3 or tagged traffic with
> > VID=20 on sp2 is delivered to br20
> 
> Thanks for the explanation. Understood this, and this is how I
> saw/implemented things so far. However, my understanding is that both
> br10 and br20 have nothing to do with vlan 10 and 20 respectively. The traffic
> the actual bridges see (say if we ran tcpdump on br10 or br20) is always
> untagged. The tagging happens while the packets are egressing the sp2 port.
> In that sense, there is no vlan associated with either of the bridges.
> 
> What I was not sure was about how we can associate a vlan with such a
> bridge in this case, and what it really means for the traffic.
> 
> >
> > 2) Single bridge with VLAN filtering
> >         - A single bridge device br0 is used
> >         - All ports sp1, sp2, and sp3 are enslaved to the bridge
> >         - br0 allows both VID=10 and VID=20
> >         - VLAN policy on the ports determines the bridging domains within the
> bridge
> >                 - sp1 allows VID=10, it is untagged on egress, and it is the PVID of
> sp1.
> >                   Linux does not provide a way to only allow untagged traffic to
> enter a port.
> >                   Both tagged traffic with VID=10 and untagged traffic are
> considered received on VLAN 10.
> >                 - sp3 allows VID=20, it is untagged on egress, and it is the PVID of
> sp2.
> >                 - sp2 allows VID=10 and VID=20, both are tagged on egress, and
> there is no PVID so untagged traffic received on sp2 is dropped.
> >
> >         - Above configuration is sufficient for L2 switching with two distinct
> VLANs.
> >         - L3 routing across VLANs in this model is achieved by vlan interfaces
> on the bridge. br0.10 for VLAN 10 and br0.20 for VLAN 20.
> >         - br0.10 and br0.20 are L3 interfaces and have IP addresses assigned.
> >                 - br0.10 allows VID=10
> 
> In this case am I correct in assuming that br0 actually represents the switch
> device ? I don't see a way in which one more such bridge can exist, since the
> forwarding decisions are handled at the switch-device level.
> 

br0 represent the HW switch as far as configuration. Offloading (discussed on this patch-set) will propagate bridge and port attributes to underlying port switch devices for HW configuration.
Bypassing data-path handling by the software bridge is yet to be defined.
On the transmit path (i.e. packet originated by the switch OS or applications) it could be OK to use software bridging but it could be inefficient for multicast or flooding. 
An alternative that hands packets sent to the bridge or to one of its vlan interfaces (br0.10, br0.20) to the HW switch is desirable.
On the receive path, delivering packets received from the HW by the switch driver on a switch port device (such as sp1, sp2, sp3) has to be avoided as it could cause duplicate packets (packets are flooded in the HW and again by the software bridge).

> Thanks
> Viswanath
> 
> >
> >
> >> >
> >> >
> >> >> [...]
> >> >>
> >> >> I'll need to think about the l3 stuff but I think Jiri/Scott/Roopa
> >> >> might have worked some of it out.
> >> >>
> >> >> --
> >> >> John Fastabend         Intel Corporation
> >> > --
> >> > To unsubscribe from this list: send the line "unsubscribe netdev"
> >> > in the body of a message to majordomo@vger.kernel.org More
> >> > majordomo
> >> info
> >> > at  http://vger.kernel.org/majordomo-info.html
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe netdev" in
> >> the body of a message to majordomo@vger.kernel.org More majordomo
> >> info at http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [RFC PATCH net-next 0/5] tcp: TCP tracer
From: Hannes Frederic Sowa @ 2014-12-16 22:50 UTC (permalink / raw)
  To: David Miller, jbaron
  Cc: jbacik, eric.dumazet, alexei.starovoitov, chavey, ycheng, kafai,
	netdev, rostedt, brakmo, Kernel-team, Daniel Borkmann,
	Florian Westphal
In-Reply-To: <20141216.174524.888376617592272638.davem@davemloft.net>

On Tue, Dec 16, 2014, at 23:45, David Miller wrote:
> From: Jason Baron <jbaron@akamai.com>
> Date: Tue, 16 Dec 2014 17:40:47 -0500
> 
> > We are interested in tcp tracing as well. Another requirement that
> > we have that I don't think I saw is the ability to start/stop
> > tracing on sockets (potentially multiple times) during the lifetime
> > of a connection. So for example, the ability to use setsockopt(), to
> > selectively start/stop tracing on a connection, so as not to incur
> > overhead for non-traced sockets.
> 
> This is so backwards.
> 
> You make the tracing cheap enough that this can never be an issue.
> 
> Your requirement can only exist if the implementation is broken
> by design.

An idea I had was to add a proxy tcp congestion control which could be
selectively chosen per destination as soon as Daniel's patchset hits
net-next or one could enable it globally by
sys/net/ipv4/tcp_congestion_control. Needed tracepoints could be
installed in the congestion_ops handlers and the additional storage
could live inside the private data of the congestion control handler.
Further callbacks could go to a chained congestion control handler. The
names could be extended like "t:cubic", t for tracing.

Do the congestion control callbacks provide enough insight to the
connection state?

Bye,
Hannes

^ permalink raw reply

* Re: Netlink mmap tx security?
From: David Miller @ 2014-12-16 22:58 UTC (permalink / raw)
  To: tgraf; +Cc: dborkman, luto, torvalds, kaber, netdev
In-Reply-To: <20141016070753.GA16738@casper.infradead.org>

From: Thomas Graf <tgraf@suug.ch>
Date: Thu, 16 Oct 2014 08:07:53 +0100

> On 10/16/14 at 08:45am, Daniel Borkmann wrote:
>> On 10/15/2014 04:09 AM, David Miller wrote:
>> >That would work as well.
>> >
>> >There are pros and cons to all of these approaches.
>> >
>> >I was thinking that if we do the "TX mmap --> copy to kernel buffer"
>> >approach, then if in the future we find a way to make it work
>> >reliably, we can avoid the copy.  And frankly performance wise it's no
>> >worse than what happens via normal sendmsg() calls.
>> >
>> >And all applications using NETLINK_RX_RING keep working and keep
>> >getting the performance boost.
>> 
>> That would be better, yes. This would avoid having such a TPACKET_V*
>> API chaos we have in packet sockets if this could be fixed for netlink
>> eventually.
> 
> Only saw the second part of Dave's message now. I agree that this
> is even a better option.

Sorry for dropping the ball on this, here is the patch I have right
now, any comments?

====================
[PATCH] netlink: Always copy on mmap TX.

Checking the file f_count and the nlk->mapped count is not completely
sufficient to prevent the mmap'd area contents from changing from
under us during netlink mmap sendmsg() operations.

Be careful to sample the header's length field only once, because this
could change from under us as well.

Signed-off-by: David S. Miller <davem@davemloft.net>
---
 net/netlink/af_netlink.c | 52 +++++++++++++++---------------------------------
 1 file changed, 16 insertions(+), 36 deletions(-)

diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
index ef5f77b..a64680a 100644
--- a/net/netlink/af_netlink.c
+++ b/net/netlink/af_netlink.c
@@ -525,14 +525,14 @@ out:
 	return err;
 }
 
-static void netlink_frame_flush_dcache(const struct nl_mmap_hdr *hdr)
+static void netlink_frame_flush_dcache(const struct nl_mmap_hdr *hdr, unsigned int nm_len)
 {
 #if ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE == 1
 	struct page *p_start, *p_end;
 
 	/* First page is flushed through netlink_{get,set}_status */
 	p_start = pgvec_to_page(hdr + PAGE_SIZE);
-	p_end   = pgvec_to_page((void *)hdr + NL_MMAP_HDRLEN + hdr->nm_len - 1);
+	p_end   = pgvec_to_page((void *)hdr + NL_MMAP_HDRLEN + nm_len - 1);
 	while (p_start <= p_end) {
 		flush_dcache_page(p_start);
 		p_start++;
@@ -714,24 +714,16 @@ static int netlink_mmap_sendmsg(struct sock *sk, struct msghdr *msg,
 	struct nl_mmap_hdr *hdr;
 	struct sk_buff *skb;
 	unsigned int maxlen;
-	bool excl = true;
 	int err = 0, len = 0;
 
-	/* Netlink messages are validated by the receiver before processing.
-	 * In order to avoid userspace changing the contents of the message
-	 * after validation, the socket and the ring may only be used by a
-	 * single process, otherwise we fall back to copying.
-	 */
-	if (atomic_long_read(&sk->sk_socket->file->f_count) > 1 ||
-	    atomic_read(&nlk->mapped) > 1)
-		excl = false;
-
 	mutex_lock(&nlk->pg_vec_lock);
 
 	ring   = &nlk->tx_ring;
 	maxlen = ring->frame_size - NL_MMAP_HDRLEN;
 
 	do {
+		unsigned int nm_len;
+
 		hdr = netlink_current_frame(ring, NL_MMAP_STATUS_VALID);
 		if (hdr == NULL) {
 			if (!(msg->msg_flags & MSG_DONTWAIT) &&
@@ -739,35 +731,23 @@ static int netlink_mmap_sendmsg(struct sock *sk, struct msghdr *msg,
 				schedule();
 			continue;
 		}
-		if (hdr->nm_len > maxlen) {
+
+		nm_len = ACCESS_ONCE(hdr->nm_len);
+		if (nm_len > maxlen) {
 			err = -EINVAL;
 			goto out;
 		}
 
-		netlink_frame_flush_dcache(hdr);
+		netlink_frame_flush_dcache(hdr, nm_len);
 
-		if (likely(dst_portid == 0 && dst_group == 0 && excl)) {
-			skb = alloc_skb_head(GFP_KERNEL);
-			if (skb == NULL) {
-				err = -ENOBUFS;
-				goto out;
-			}
-			sock_hold(sk);
-			netlink_ring_setup_skb(skb, sk, ring, hdr);
-			NETLINK_CB(skb).flags |= NETLINK_SKB_TX;
-			__skb_put(skb, hdr->nm_len);
-			netlink_set_status(hdr, NL_MMAP_STATUS_RESERVED);
-			atomic_inc(&ring->pending);
-		} else {
-			skb = alloc_skb(hdr->nm_len, GFP_KERNEL);
-			if (skb == NULL) {
-				err = -ENOBUFS;
-				goto out;
-			}
-			__skb_put(skb, hdr->nm_len);
-			memcpy(skb->data, (void *)hdr + NL_MMAP_HDRLEN, hdr->nm_len);
-			netlink_set_status(hdr, NL_MMAP_STATUS_UNUSED);
+		skb = alloc_skb(nm_len, GFP_KERNEL);
+		if (skb == NULL) {
+			err = -ENOBUFS;
+			goto out;
 		}
+		__skb_put(skb, nm_len);
+		memcpy(skb->data, (void *)hdr + NL_MMAP_HDRLEN, nm_len);
+		netlink_set_status(hdr, NL_MMAP_STATUS_UNUSED);
 
 		netlink_increment_head(ring);
 
@@ -813,7 +793,7 @@ static void netlink_queue_mmaped_skb(struct sock *sk, struct sk_buff *skb)
 	hdr->nm_pid	= NETLINK_CB(skb).creds.pid;
 	hdr->nm_uid	= from_kuid(sk_user_ns(sk), NETLINK_CB(skb).creds.uid);
 	hdr->nm_gid	= from_kgid(sk_user_ns(sk), NETLINK_CB(skb).creds.gid);
-	netlink_frame_flush_dcache(hdr);
+	netlink_frame_flush_dcache(hdr, hdr->nm_len);
 	netlink_set_status(hdr, NL_MMAP_STATUS_VALID);
 
 	NETLINK_CB(skb).flags |= NETLINK_SKB_DELIVERED;
-- 
2.1.0

^ permalink raw reply related

* Re: net: integer overflow in ip_idents_reserve
From: Hannes Frederic Sowa @ 2014-12-16 23:09 UTC (permalink / raw)
  To: Eric Dumazet, Sasha Levin
  Cc: David S. Miller, LKML, netdev, Andrey Ryabinin, Dave Jones
In-Reply-To: <1418766460.9773.48.camel@edumazet-glaptop2.roam.corp.google.com>

On Tue, Dec 16, 2014, at 22:47, Eric Dumazet wrote:
> On Tue, 2014-12-16 at 16:19 -0500, Sasha Levin wrote:
> > Hi Eric,
> > 
> > While fuzzing with trinity on a -next kernel with the undefined behaviour
> > sanitizer path, I've observed the following warning in code which was
> > introduced in 04ca6973f7 ("ip: make IP identifiers less predictable"):
> 
> This is a false positive.

Also we compile the whole kernel with -fno-strict-overflow, so every
report of signed overflow leading to undefined behavior is probably a
false positive. I don't know if it is worth to try to get rid of them, I
doubt it.

Bye,
Hannes

^ permalink raw reply

* Re: net: integer overflow in ip_idents_reserve
From: Eric Dumazet @ 2014-12-16 23:22 UTC (permalink / raw)
  To: Hannes Frederic Sowa
  Cc: Sasha Levin, David S. Miller, LKML, netdev, Andrey Ryabinin,
	Dave Jones
In-Reply-To: <1418771356.3449499.203748285.4B1A82B8@webmail.messagingengine.com>

On Wed, 2014-12-17 at 00:09 +0100, Hannes Frederic Sowa wrote:


> Also we compile the whole kernel with -fno-strict-overflow, so every
> report of signed overflow leading to undefined behavior is probably a
> false positive. I don't know if it is worth to try to get rid of them, I
> doubt it.

Presumably we could have uatomic_t , or atomic_u32_t, whatever...

This particular xadd() is heavily hit in some cases, we really do not
want a cmpxchg()

^ permalink raw reply

* Re: Netlink mmap tx security?
From: Daniel Borkmann @ 2014-12-16 23:58 UTC (permalink / raw)
  To: David Miller; +Cc: tgraf, luto, torvalds, kaber, netdev
In-Reply-To: <20141216.175817.576861457076632402.davem@davemloft.net>

On 12/16/2014 11:58 PM, David Miller wrote:
> From: Thomas Graf <tgraf@suug.ch>
> Date: Thu, 16 Oct 2014 08:07:53 +0100
>
>> On 10/16/14 at 08:45am, Daniel Borkmann wrote:
>>> On 10/15/2014 04:09 AM, David Miller wrote:
>>>> That would work as well.
>>>>
>>>> There are pros and cons to all of these approaches.
>>>>
>>>> I was thinking that if we do the "TX mmap --> copy to kernel buffer"
>>>> approach, then if in the future we find a way to make it work
>>>> reliably, we can avoid the copy.  And frankly performance wise it's no
>>>> worse than what happens via normal sendmsg() calls.
>>>>
>>>> And all applications using NETLINK_RX_RING keep working and keep
>>>> getting the performance boost.
>>>
>>> That would be better, yes. This would avoid having such a TPACKET_V*
>>> API chaos we have in packet sockets if this could be fixed for netlink
>>> eventually.
>>
>> Only saw the second part of Dave's message now. I agree that this
>> is even a better option.
>
> Sorry for dropping the ball on this, here is the patch I have right
> now, any comments?
>
> ====================
> [PATCH] netlink: Always copy on mmap TX.
>
> Checking the file f_count and the nlk->mapped count is not completely
> sufficient to prevent the mmap'd area contents from changing from
> under us during netlink mmap sendmsg() operations.
>
> Be careful to sample the header's length field only once, because this
> could change from under us as well.
>
> Signed-off-by: David S. Miller <davem@davemloft.net>

Looks good to me, thanks for following up on this Dave!

Now this also leaves NETLINK_SKB_TX unused afterwards, so we should
get rid of these bits, too. It's only set in the broken 'exclusive'
path and tested for resetting the slot status in the skb destructor,
but _now_ we set it back immediately to NL_MMAP_STATUS_UNUSED instead
of a NL_MMAP_STATUS_RESERVED -> NL_MMAP_STATUS_UNUSED transition via
destructor.

The atomic (tx)ring->pending logic inside the send path seems also
unused now; similarly as in, resp. taken from packet sockets, it was
there to wait for completion (for blocking syscalls) until the
netlink_skb_destructor() has been invoked and thus the slot given
back to user space.

Anyways, above could probably be followed-up; other than that:

Fixes: 5fd96123ee19 ("netlink: implement memory mapped sendmsg()")
Acked-by: Daniel Borkmann <dborkman@redhat.com>

^ permalink raw reply

* Re: Netlink mmap tx security?
From: Eric Dumazet @ 2014-12-17  0:02 UTC (permalink / raw)
  To: David Miller; +Cc: tgraf, dborkman, luto, torvalds, kaber, netdev
In-Reply-To: <20141216.175817.576861457076632402.davem@davemloft.net>

On Tue, 2014-12-16 at 17:58 -0500, David Miller wrote:

> +		__skb_put(skb, nm_len);
> +		memcpy(skb->data, (void *)hdr + NL_MMAP_HDRLEN, nm_len);
> +		netlink_set_status(hdr, NL_MMAP_STATUS_UNUSED);
>  

Not related to this patch, but it looks like netlink_set_status()
barrier is wrong ?

It seems we need smp_wmb() after the memcpy() and before the
hdr->nm_status = status; 

^ permalink raw reply

* Re: [RFC PATCH net-next 0/5] tcp: TCP tracer
From: Alexei Starovoitov @ 2014-12-17  0:15 UTC (permalink / raw)
  To: Martin Lau
  Cc: Eric Dumazet, Blake Matheny, Laurent Chavey, Yuchung Cheng,
	netdev@vger.kernel.org, David S. Miller, Hannes Frederic Sowa,
	Steven Rostedt, Lawrence Brakmo, Josef Bacik, Kernel Team

On Tue, Dec 16, 2014 at 10:28 AM, Martin Lau <kafai@fb.com> wrote:
> We can consider to reuse the events's format (tracing/events/*/format). I think
> blktrace.c is using similar approach in trace-cmd.

yes. tcp_trace is a carbon copy of blktrace applied to tcp.

>> >> I think systemtap like scripting on top of patches 1 and 3
>> >> should solve your use case ?
> We have quite a few different versions running in the production.  It may not
> be operationally easy.

different versions of kernel or different versions of tcp_tracer ?

> Having a getsockopt will be useful for the new application/library to take
> advantage of.
>
> For the continuous monitoring/logging purpose, ftrace can provide event
> triggered tracing instead of periodically consulting ss.

so both getsockopt tcp_info approach and ftrace+tcp_trace
approach can provide the same set of stats per flow, right?
And the only difference is 'ss' needs polling and ftrace
collects all events?
Since they're stats anyway, the polling interval
shouldn't matter. Just like lost trace events?

from patch 5 commit log:
"Define probes and register them to the TCP tracepoints.  The probes
collect the data defined in struct tcp_sk_trace and record them to
the tracing's ring_buffer.
"
so two trace_seq_printf() from patch 5
and two new 'struct tcp_trace_stats' and 'tcp_trace_basic'
from patch 4 will become permanent user api.

At the same time the commit log is saying:
"It is still missing a few things that
we currently have, like:
- why the sender is blocked? and how long for each reason?
- some TCP Congestion Control data"

Does it mean that these printf and structs would have
to change?

Can 'struct tcp_info' be extended instead of
adding 'struct tcp_trace_stats' ?

Then getsockopt and ftrace+tcp_trace will be returning
the same structs.

It feels that for stats collection only, tracepoints+tcp_trace
do not add much additional value vs extending tcp_info
and using ss.

I see the value in tracepoints on its own, since we'll
be able to use dynamic tracing to do event aggregation,
filtering, etc. That was my alternative suggestion to
add only tracepoints from patches 1 and 3.

^ permalink raw reply

* Re: net: integer overflow in ip_idents_reserve
From: Sasha Levin @ 2014-12-17  1:15 UTC (permalink / raw)
  To: Hannes Frederic Sowa, Eric Dumazet
  Cc: David S. Miller, LKML, netdev, Andrey Ryabinin, Dave Jones
In-Reply-To: <1418771356.3449499.203748285.4B1A82B8@webmail.messagingengine.com>

On 12/16/2014 06:09 PM, Hannes Frederic Sowa wrote:
> 
> On Tue, Dec 16, 2014, at 22:47, Eric Dumazet wrote:
>> > On Tue, 2014-12-16 at 16:19 -0500, Sasha Levin wrote:
>>> > > Hi Eric,
>>> > > 
>>> > > While fuzzing with trinity on a -next kernel with the undefined behaviour
>>> > > sanitizer path, I've observed the following warning in code which was
>>> > > introduced in 04ca6973f7 ("ip: make IP identifiers less predictable"):
>> > 
>> > This is a false positive.
> Also we compile the whole kernel with -fno-strict-overflow, so every
> report of signed overflow leading to undefined behavior is probably a
> false positive. I don't know if it is worth to try to get rid of them, I
> doubt it.

I reported this one because there's usually some code to handle overflow
in code that expects that and here there was none (I could see).

For example, the ntp code had a few cases where a user could generate
overflows and mess up quite a few things (he got what he asked for -
problems).

Thanks,
Sasha

^ permalink raw reply

* Re: [RFC PATCH net-next 0/5] tcp: TCP tracer
From: Martin Lau @ 2014-12-17  1:30 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Eric Dumazet, Blake Matheny, Laurent Chavey, Yuchung Cheng,
	netdev@vger.kernel.org, David S. Miller, Hannes Frederic Sowa,
	Steven Rostedt, Lawrence Brakmo, Josef Bacik, Kernel Team
In-Reply-To: <CAADnVQKtEfrEpAqMxf_aOmHmF4WeOQUD-zZpVYR9jnQUjHQgVg@mail.gmail.com>

On Tue, Dec 16, 2014 at 04:15:24PM -0800, Alexei Starovoitov wrote:
> On Tue, Dec 16, 2014 at 10:28 AM, Martin Lau <kafai@fb.com> wrote:
> > We can consider to reuse the events's format (tracing/events/*/format). I think
> > blktrace.c is using similar approach in trace-cmd.
> 
> yes. tcp_trace is a carbon copy of blktrace applied to tcp.
> 
> >> >> I think systemtap like scripting on top of patches 1 and 3
> >> >> should solve your use case ?
> > We have quite a few different versions running in the production.  It may not
> > be operationally easy.
> 
> different versions of kernel or different versions of tcp_tracer ?
Former and we are releasing new kernel pretty often.

> 
> > Having a getsockopt will be useful for the new application/library to take
> > advantage of.
> >
> > For the continuous monitoring/logging purpose, ftrace can provide event
> > triggered tracing instead of periodically consulting ss.
> 
> so both getsockopt tcp_info approach and ftrace+tcp_trace
> approach can provide the same set of stats per flow, right?
> And the only difference is 'ss' needs polling and ftrace
> collects all events?
> Since they're stats anyway, the polling interval
> shouldn't matter. Just like lost trace events?
> 
> from patch 5 commit log:
> "Define probes and register them to the TCP tracepoints.  The probes
> collect the data defined in struct tcp_sk_trace and record them to
> the tracing's ring_buffer.
> "
> so two trace_seq_printf() from patch 5
> and two new 'struct tcp_trace_stats' and 'tcp_trace_basic'
> from patch 4 will become permanent user api.
> 
> At the same time the commit log is saying:
> "It is still missing a few things that
> we currently have, like:
> - why the sender is blocked? and how long for each reason?
> - some TCP Congestion Control data"
> 
> Does it mean that these printf and structs would have
> to change?
How does the current TRACE_EVENT do it when it wants to printf more data?

> Can 'struct tcp_info' be extended instead of
> adding 'struct tcp_trace_stats' ?
> Then getsockopt and ftrace+tcp_trace will be returning
> the same structs.
> 
> It feels that for stats collection only, tracepoints+tcp_trace
> do not add much additional value vs extending tcp_info
> and using ss.
I think we are on the same page. Once 'this should cost nothing if not
activated' proposition was cleared out.  It was what I meant that doing the
collection part in the TCP itself (instead of tracepoints) would be nice.

> I see the value in tracepoints on its own, since we'll
> be able to use dynamic tracing to do event aggregation,
> filtering, etc. That was my alternative suggestion to
> add only tracepoints from patches 1 and 3.

I think going forward, as others have suggested, it may be better to come
together and reach a common ground on what to collect first before I re-work
patch 1 to 3 and repost.

Thanks,
--Martin

^ permalink raw reply

* [PATCH net 2/2] geneve: Fix races between socket add and release.
From: Jesse Gross @ 2014-12-17  2:25 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, Andy Zhou
In-Reply-To: <1418783132-99230-1-git-send-email-jesse@nicira.com>

Currently, searching for a socket to add a reference to is not
synchronized with deletion of sockets. This can result in use
after free if there is another operation that is removing a
socket at the same time. Solving this requires both holding the
appropriate lock and checking the refcount to ensure that it
has not already hit zero.

Inspired by a related (but not exactly the same) issue in the
VXLAN driver.

Fixes: 0b5e8b8e ("net: Add Geneve tunneling protocol driver")
CC: Andy Zhou <azhou@nicira.com>
Signed-off-by: Jesse Gross <jesse@nicira.com>
---
 net/ipv4/geneve.c | 13 +++++++------
 1 file changed, 7 insertions(+), 6 deletions(-)

diff --git a/net/ipv4/geneve.c b/net/ipv4/geneve.c
index 5a47188..95e47c9 100644
--- a/net/ipv4/geneve.c
+++ b/net/ipv4/geneve.c
@@ -296,6 +296,7 @@ struct geneve_sock *geneve_sock_add(struct net *net, __be16 port,
 				    geneve_rcv_t *rcv, void *data,
 				    bool no_share, bool ipv6)
 {
+	struct geneve_net *gn = net_generic(net, geneve_net_id);
 	struct geneve_sock *gs;

 	gs = geneve_socket_create(net, port, rcv, data, ipv6);
@@ -305,15 +306,15 @@ struct geneve_sock *geneve_sock_add(struct net *net, __be16 port,
 	if (no_share)	/* Return error if sharing is not allowed. */
 		return ERR_PTR(-EINVAL);

+	spin_lock(&gn->sock_lock);
 	gs = geneve_find_sock(net, port);
-	if (gs) {
-		if (gs->rcv == rcv)
-			atomic_inc(&gs->refcnt);
-		else
+	if (gs && ((gs->rcv != rcv) ||
+		   !atomic_add_unless(&gs->refcnt, 1, 0)))
 			gs = ERR_PTR(-EBUSY);
-	} else {
+	spin_unlock(&gn->sock_lock);
+
+	if (!gs)
 		gs = ERR_PTR(-EINVAL);
-	}

 	return gs;
 }
-- 
1.9.1

^ permalink raw reply related

* [PATCH net 1/2] geneve: Remove socket and offload handlers at destruction.
From: Jesse Gross @ 2014-12-17  2:25 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, Andy Zhou

Sockets aren't currently removed from the the global list when
they are destroyed. In addition, offload handlers need to be cleaned
up as well.

Fixes: 0b5e8b8e ("net: Add Geneve tunneling protocol driver")
CC: Andy Zhou <azhou@nicira.com>
Signed-off-by: Jesse Gross <jesse@nicira.com>
---
 net/ipv4/geneve.c | 17 +++++++++++++++++
 1 file changed, 17 insertions(+)

diff --git a/net/ipv4/geneve.c b/net/ipv4/geneve.c
index a457232..5a47188 100644
--- a/net/ipv4/geneve.c
+++ b/net/ipv4/geneve.c
@@ -159,6 +159,15 @@ static void geneve_notify_add_rx_port(struct geneve_sock *gs)
 	}
 }
 
+static void geneve_notify_del_rx_port(struct geneve_sock *gs)
+{
+	struct sock *sk = gs->sock->sk;
+	sa_family_t sa_family = sk->sk_family;
+
+	if (sa_family == AF_INET)
+		udp_del_offload(&gs->udp_offloads);
+}
+
 /* Callback from net/ipv4/udp.c to receive packets */
 static int geneve_udp_encap_recv(struct sock *sk, struct sk_buff *skb)
 {
@@ -312,9 +321,17 @@ EXPORT_SYMBOL_GPL(geneve_sock_add);
 
 void geneve_sock_release(struct geneve_sock *gs)
 {
+	struct net *net = sock_net(gs->sock->sk);
+	struct geneve_net *gn = net_generic(net, geneve_net_id);
+
 	if (!atomic_dec_and_test(&gs->refcnt))
 		return;
 
+	spin_lock(&gn->sock_lock);
+	hlist_del_rcu(&gs->hlist);
+	geneve_notify_del_rx_port(gs);
+	spin_unlock(&gn->sock_lock);
+
 	queue_work(geneve_wq, &gs->del_work);
 }
 EXPORT_SYMBOL_GPL(geneve_sock_release);
-- 
1.9.1

^ permalink raw reply related

* Re: [RFC PATCH net-next 0/5] tcp: TCP tracer
From: Alexei Starovoitov @ 2014-12-17  3:06 UTC (permalink / raw)
  To: Martin Lau
  Cc: Eric Dumazet, Blake Matheny, Laurent Chavey, Yuchung Cheng,
	netdev@vger.kernel.org, David S. Miller, Hannes Frederic Sowa,
	Steven Rostedt, Lawrence Brakmo, Josef Bacik, Kernel Team

On Tue, Dec 16, 2014 at 5:30 PM, Martin Lau <kafai@fb.com> wrote:
>> >> >> I think systemtap like scripting on top of patches 1 and 3
>> >> >> should solve your use case ?
>> > We have quite a few different versions running in the production.  It may not
>> > be operationally easy.
>>
>> different versions of kernel or different versions of tcp_tracer ?
> Former and we are releasing new kernel pretty often.

I see. So for dynamic tracer to be useful in such environment,
the scripts should be compatible across different kernel version
without recompilation. All makes sense.

> How does the current TRACE_EVENT do it when it wants to printf more data?

tracepoints, like any other user interface, shouldn't
break compatibility. With printf it's practically impossible.
Some subsystems may be breaking this rule arguing that
tracepoints is a debug facility, but networking tracepoints don't change.

>> It feels that for stats collection only, tracepoints+tcp_trace
>> do not add much additional value vs extending tcp_info
>> and using ss.
> I think we are on the same page. Once 'this should cost nothing if not
> activated' proposition was cleared out.  It was what I meant that doing the
> collection part in the TCP itself (instead of tracepoints) would be nice.

agree.

> I think going forward, as others have suggested, it may be better to come
> together and reach a common ground on what to collect first before I re-work
> patch 1 to 3 and repost.

I think as a minimum it will be discussed at netdev01 in Feb,
but I suspect not everyone on this list can(want) go to Ottawa,
so would be nice to have a meetup for bay area folks to
discuss this sooner with public g+ hangout.
Thoughts?

^ permalink raw reply

* Re: [PATCH 0/5] tun/macvtap: TUNSETIFF fixes
From: Jason Wang @ 2014-12-17  3:11 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: linux-kernel, David Miller, netdev, Dan Carpenter
In-Reply-To: <1418732988-3535-1-git-send-email-mst@redhat.com>



On Tue, Dec 16, 2014 at 9:04 PM, Michael S. Tsirkin <mst@redhat.com> 
wrote:
> Dan Carpenter reported the following:
> 	static checker warning:
> 
> 		drivers/net/tun.c:1694 tun_set_iff()
> 		warn: 0x17100 is larger than 16 bits
> 
> 	drivers/net/tun.c
> 	  1692
> 	  1693          tun->flags = (tun->flags & ~TUN_FEATURES) |
> 	  1694                  (ifr->ifr_flags & TUN_FEATURES);
> 	  1695
> 
> 	It's complaining because the "ifr->ifr_flags" variable is a short
> 	(should it be unsigned?).  The new define:
> 
> 	#define IFF_VNET_LE    0x10000
> 
> 	doesn't fit in two bytes.  Other suspect looking code could be:
> 
> 		return __virtio16_to_cpu(q->flags & IFF_VNET_LE, val);
> 
> And that's true: we have run out of IFF flags in tun.

I don't have objections on this series.
Just note that we still have several bits available.
> 
> So let's not try to add more: add simple GET/SET ioctls
> instead. Easy to test, leads to clear semantics.
> 
> Alternatively we'll have to revert the whole thing for 3.19,
> but that seems more work as this has dependencies
> in other places.
> 
> While here, I noticed that macvtap was actually reading
> ifreq flags as a 32 bit field.
> Fix that up as well.
> 
> Michael S. Tsirkin (5):
>   macvtap: fix uninitialized access on TUNSETIFF
>   if_tun: add TUNSETVNETLE/TUNGETVNETLE
>   tun: drop broken IFF_VNET_LE
>   macvtap: drop broken IFF_VNET_LE
>   if_tun: drop broken IFF_VNET_LE
> 
>  include/uapi/linux/if_tun.h |  3 ++-
>  drivers/net/macvtap.c       | 30 ++++++++++++++++++++++++------
>  drivers/net/tun.c           | 26 +++++++++++++++++++++++---
>  3 files changed, 49 insertions(+), 10 deletions(-)
> 
> -- 
> MST
> 
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH 3/5] tun: drop broken IFF_VNET_LE
From: Jason Wang @ 2014-12-17  3:15 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: linux-kernel, David Miller, netdev, Dan Carpenter, Herbert Xu,
	Tom Herbert, Ben Hutchings, Xi Wang, Masatake YAMATO
In-Reply-To: <1418732988-3535-4-git-send-email-mst@redhat.com>



On Tue, Dec 16, 2014 at 9:05 PM, Michael S. Tsirkin <mst@redhat.com> 
wrote:
> Use TUNSETVNETLE/TUNGETVNETLE instead.
> 
> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> ---
>  drivers/net/tun.c | 26 +++++++++++++++++++++++---
>  1 file changed, 23 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
> index c052bd6b..e3e8a0e 100644
> --- a/drivers/net/tun.c
> +++ b/drivers/net/tun.c
> @@ -109,9 +109,11 @@ do {								\
>   * overload it to mean fasync when stored there.
>   */
>  #define TUN_FASYNC	IFF_ATTACH_QUEUE
> +/* High bits in flags field are unused. */
> +#define TUN_VNET_LE     0x80000000
>  
>  #define TUN_FEATURES (IFF_NO_PI | IFF_ONE_QUEUE | IFF_VNET_HDR | \
> -		      IFF_VNET_LE | IFF_MULTI_QUEUE)
> +		      IFF_MULTI_QUEUE)
>  #define GOODCOPY_LEN 128
>  
>  #define FLT_EXACT_COUNT 8
> @@ -207,12 +209,12 @@ struct tun_struct {
>  
>  static inline u16 tun16_to_cpu(struct tun_struct *tun, __virtio16 
> val)
>  {
> -	return __virtio16_to_cpu(tun->flags & IFF_VNET_LE, val);
> +	return __virtio16_to_cpu(tun->flags & TUN_VNET_LE, val);
>  }
>  
>  static inline __virtio16 cpu_to_tun16(struct tun_struct *tun, u16 
> val)
>  {
> -	return __cpu_to_virtio16(tun->flags & IFF_VNET_LE, val);
> +	return __cpu_to_virtio16(tun->flags & TUN_VNET_LE, val);
>  }
>  
>  static inline u32 tun_hashfn(u32 rxhash)
> @@ -1853,6 +1855,7 @@ static long __tun_chr_ioctl(struct file *file, 
> unsigned int cmd,
>  	int sndbuf;
>  	int vnet_hdr_sz;
>  	unsigned int ifindex;
> +	int le;
>  	int ret;
>  
>  	if (cmd == TUNSETIFF || cmd == TUNSETQUEUE || _IOC_TYPE(cmd) == 
> 0x89) {
> @@ -2052,6 +2055,23 @@ static long __tun_chr_ioctl(struct file *file, 
> unsigned int cmd,
>  		tun->vnet_hdr_sz = vnet_hdr_sz;
>  		break;
>  
> +	case TUNGETVNETLE:
> +		le = !!(tun->flags & TUN_VNET_LE);
> +		if (put_user(le, (int __user *)argp))
> +			ret = -EFAULT;
> +		break;
> +
> +	case TUNSETVNETLE:
> +		if (get_user(le, (int __user *)argp)) {
> +			ret = -EFAULT;
> +			break;
> +		}
> +		if (le)
> +			tun->flags |= TUN_VNET_LE;
> +		else
> +			tun->flags &= ~TUN_VNET_LE;
> +		break;
> +

A little bit different from persistent devices:

- TUNSETPERSIST check argp instead
- Userspace may check persist flags through TUNGETIFF

Probably this patch may needs more modifications on userspace.
> 
>  	case TUNATTACHFILTER:
>  		/* Can be set only for TAPs */
>  		ret = -EINVAL;
> -- 
> MST
> 
> --
> To unsubscribe from this list: send the line "unsubscribe 
> linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply

* Re: [PATCH net-next 2/3] Implementation of RFC 4898 Extended TCP Statistics (Web10G)
From: Andi Kleen @ 2014-12-17  3:44 UTC (permalink / raw)
  To: rapier; +Cc: netdev
In-Reply-To: <549070D3.5050808@psc.edu>

rapier <rapier@psc.edu> writes:
> +
> +void tcp_estats_update_rtt(struct sock *sk, unsigned long rtt_sample)
> +{
> +	struct tcp_estats *stats = tcp_sk(sk)->tcp_stats;
> +	struct tcp_estats_path_table *path_table = stats->tables.path_table;
> +	unsigned long rtt_sample_msec = rtt_sample/1000;
> +	u32 rto;
> +
> +	if (path_table == NULL)
> +		return;
> +
> +	path_table->SampleRTT = rtt_sample_msec;
> +
> +	if (rtt_sample_msec > path_table->MaxRTT)
> +		path_table->MaxRTT = rtt_sample_msec;
> +	if (rtt_sample_msec < path_table->MinRTT)
> +		path_table->MinRTT = rtt_sample_msec;
> +
> +	path_table->CountRTT++;
> +	path_table->SumRTT += rtt_sample_msec;
> +
> +	rto = jiffies_to_msecs(inet_csk(sk)->icsk_rto);
> +	if (rto > path_table->MaxRTO)
> +		path_table->MaxRTO = rto;
> +	if (rto < path_table->MinRTO)
> +		path_table->MinRTO = rto;

Looking through your hooks it seem that many basically do simple
value profiling in a very open coded way.

Perhaps you could simplify things a lot by just having a couple of trace
points for these values (e.g. trace_change_rtt). Then have a library
of different data profiling types.

Then you could register a new value oriented trace point type with
different backend for whatever you currently need from the value: like
min/max/avg/ or full histogram or even reservoir sampling or EWMA.

I guess such a generic infrastructure would be useful elsewhere too.

One challenge would be how to associate such value profiles with
sockets, but I'm sure this could be done in some nice generic
way too.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only

^ permalink raw reply

* panic related to b44_poll
From: Shivaram Lingamneni @ 2014-12-17  5:20 UTC (permalink / raw)
  To: netdev; +Cc: zambrano

I'm experiencing a kernel panic, which I believe is caused by the b44
driver and triggered by the b44_poll call. Here are some pictures of
the panic on the 3.18 mainline kernel:

http://i.imgur.com/v4YPMei.jpg
http://i.imgur.com/8b6Sttw.jpg

Some additional information, including `lshw` output and a theory
about the conditions that cause the panic, is on the kernel bugzilla
issue here:

https://bugzilla.kernel.org/show_bug.cgi?id=89611

I have some additional photos with varying call traces on the RH
bugzilla issue here:

https://bugzilla.redhat.com/show_bug.cgi?id=1147321

however, the traces linked there are against the Fedora kernel, not
against the mainline kernel.

I'm not subscribed to the netdev list, so please cc me on responses.
Thanks very much for your time!

^ permalink raw reply

* Re: [PATCH net-next RESEND] net: Do not call ndo_dflt_fdb_dump if ndo_fdb_dump is defined.
From: Roopa Prabhu @ 2014-12-17  5:51 UTC (permalink / raw)
  To: Jamal Hadi Salim
  Cc: John Fastabend, Hubert Sokolowski, netdev@vger.kernel.org,
	Vlad Yasevich
In-Reply-To: <54902E5E.2070405@mojatatu.com>

On 12/16/14, 5:06 AM, Jamal Hadi Salim wrote:
> On 12/15/14 19:45, John Fastabend wrote:
>> On 12/15/2014 06:29 AM, Jamal Hadi Salim wrote:
>
>>
>> hmm good question. When I implemented this on the host nics with SR-IOV,
>> VMDQ, etc. The multi/unicast addresses were propagated into the FDB by
>> the driver.
>
> So if i understand correctly, this is a NIC with an FDB. And there is no
> concept of a bridge to which it is attached. To the point of
> classical uni/multicast addresses on a netdev abstraction; these
> are typically stored in *much simpler tables* (used to be IO
> registers back in the day)
> Do these NICs not have such a concept?
> An fdb entry has an egress port column; I have seen cases where the
> port is labeled as "Cpu port" which would mean it belongs to the host;
> but in this case it just seems there is no such concept and as Or
> brought up in another email - what does "VLANid" mean in such a case?
> If we go with a CPU port concept,
> We could then use the concept of a vlan filter on a port basis
> but then what happens when you dont have an fdb (majority of cases)?
>
>> My logic was if some netdev ethx has a set of MAC addresses
>> above it well then any virtual function or virtual device also behind
>> the hardware shouldn't be sending those addresses out the egress switch
>> facing port. Otherwise the switch will see packets it knows are behind
>> that port and drop them. Or flood them if it hasn't learned the address
>> yet. Either way they will never get to the right netdev.
>>
>> Admittedly I wasn't thinking about switches with many ports at the time.
>>
>
> I often struggle with trying to "box" SRIOV into some concept of a
> switch abstraction and sometimes i am puzzled.
> Would exposing the SRIOV underlay as a switch not have solved this
> problem? Then the virtual ports essentially are bridge ports.
> Maybe what we need is a concept of a "edge relay" extended netdev?
> These things would have an fdb as well down and uplink relay ports that
> can be attached to them.
>
>
>>> Some of these drivers may be just doing the LinuxWay(aka cutnpaste what
>>> the other driver did).
>>
>> My original thinking here was... if it didn't implement fdb_add, fdb_del
>> and fdb_dump then if you wanted to think of it as having forwarding
>> database that was fine but it was really just a two port mac relay. In
>> which case just dump all the mac addresses it knows about. In this case
>> if it was something more fancy it could do its own dump like vxlan or
>> macvlan.
>>
>
> The challenge here is lack of separation between a NICs uni/multicast
> ports which it owns - which is a traditional operation regardless of
> what capabilities the NIC has; vs an fdb which has may have many
> other capabilities. Probably all NICs capable of many MACs implement
> fdbs?
>
>> For a host nic ucast/multicast and fdb are the same, I think? The
>> code we had was just short-hand to allow the common case a host nic
>> to work. Notice vxlan and bridge drivers didn't dump there addr lists
>> from fdb_dump until your patch.
>>
>> Perhaps my implementation of macvlan fdb_{add|del|dump} is buggy. And
>> I shouldn't overload the addr lists.
>>
>
> Not just those - I am wondering about the general utility of what
> Hubert was trying to do if all the driver does is call the default
> dumper based on some flags presence and the default dumper
> does a dump of uni/multicast host entries. Those are not really fdb
> entries in the traditional sense.
> Is there no way to get the unicast/multicast mac addresses for such
> a driver?
> I think that would help bring clarity to my confusion.
>
>
>>
>> I'm interested to see what Vlad says as well. But the current situation
>> is previously some drivers dumped their addr lists others didn't.
>> Specifically, the more switch like devices (bridge, vxlan) didn't. Now
>> every device will dump the addr lists. I'm not entirely convinced that
>> is correct.
>>
>
> I am glad this happened ;-> Otherwise we wouldnt be having this
> discussion. When Vlad was asking me I was in a rush to get the patch
> out and didnt question because i thought this was something some crazy
> virtualization people needed.
> If Vlad's use case goes away, then Hubert's little restoration is fine.
>
>
>> It works OK for host nics (NICS that can't forward between ports) and
>> seems at best confusing for real switch asics.
>
> So if these NICs have fdb entries and i programmed it (meaning setting
> which port a given MAC should be sent to), would it not work?
>
>> On a related question do
>> you expect the switch asic to trap any packets with MAC addresses in
>> the multi/unicast address lists and send them to the correct netdev? Or
>> will the switch forward them using normal FDB tables?
>>
>
> I think there would be a separate table for that. Roopa, can you check
> with the ASICs you guys work on? 
Jamal, yes, AFAICS, we do have a separate table where we add some static 
entries
indicating send to  CPU (example IPV4 and IPV6 link local multicast) and 
such
packets are sent to the correct netdev

> The point i was trying to make above
> is today there is a uni/multicast list or table of sorts that all NICs
> expose.
> There's always the hack of a "cpu port". I have also seen the "cpu port"
> being conceptualized in L3 tables to imply "next hop is cpu" where you
> have an IP address owned by the host; so maybe we need a concept of a
> cpu port or again the revival of TheThing class device.
>
> cheers,
> jamal
>

^ permalink raw reply

* Re: [PATCH net-next RESEND] net: Do not call ndo_dflt_fdb_dump if ndo_fdb_dump is defined.
From: Roopa Prabhu @ 2014-12-17  5:54 UTC (permalink / raw)
  To: Samudrala, Sridhar
  Cc: John Fastabend, Jamal Hadi Salim, Hubert Sokolowski,
	netdev@vger.kernel.org, Vlad Yasevich
In-Reply-To: <549091DB.6050600@intel.com>

On 12/16/14, 12:11 PM, Samudrala, Sridhar wrote:
>
> On 12/16/2014 11:30 AM, Roopa Prabhu wrote:
>> On 12/16/14, 9:21 AM, Samudrala, Sridhar wrote:
>>>
>>> On 12/16/2014 8:35 AM, John Fastabend wrote:
>>>>
>>>>> Is there no way to get the unicast/multicast mac addresses for such
>>>>> a driver?
>>>>
>>>> You can almost infer it from ip link by looking at all the stacked
>>>> drivers and figuring out how the address are propagated down. Then
>>>> look at the routes and figure out multicast address. But other than
>>>> the fdb dump mechanism I don't think there is anything.
>>>
>>> It looks like we can get the device specific unicast/multicast mac 
>>> addresses via 'ip maddr' too.
>> if i remember correctly, 'ip maddr' was only for multicast list. And 
>> there was no way to dump the unicast list until bridge self was 
>> introduced.
>> the only way to dump unicast addresses today is by using the `bridge 
>> fdb show self`
> Yes. 'ip maddr show' only lists the multicast macs as the name 
> suggests. I stand corrected.
> May be we need 'ip uaddr show' to list unicast macs instead of 
> overloading 'bridge fdb show' to show unicast lists.
>
maybe too late for that ..., 'bridge fdb show self' has been out for 
sometime now and in use

^ permalink raw reply

* Re: [PATCH] dm9000: Add regulator and reset support to dm9000
From: Sascha Hauer @ 2014-12-17  6:19 UTC (permalink / raw)
  To: Zubair Lutfullah Kakakhel
  Cc: davem, devicetree, linux-kernel, netdev, paul.burton
In-Reply-To: <1418747624-2682-1-git-send-email-Zubair.Kakakhel@imgtec.com>

Hi Zubair,

Several comments inline.

On Tue, Dec 16, 2014 at 04:33:44PM +0000, Zubair Lutfullah Kakakhel wrote:
> In boards, the dm9000 chip's power and reset can be controlled by gpio.
> 
> It makes sense to add them to the dm9000 driver and let dt be used to
> enable power and reset the phy.
> 
> Signed-off-by: Zubair Lutfullah Kakakhel <Zubair.Kakakhel@imgtec.com>
> Signed-off-by: Paul Burton <paul.burton@imgtec.com>
> ---
>  .../devicetree/bindings/net/davicom-dm9000.txt     |  4 +++
>  drivers/net/ethernet/davicom/dm9000.c              | 33 ++++++++++++++++++++++
>  2 files changed, 37 insertions(+)
> 
> diff --git a/Documentation/devicetree/bindings/net/davicom-dm9000.txt b/Documentation/devicetree/bindings/net/davicom-dm9000.txt
> index 28767ed..dba19a2 100644
> --- a/Documentation/devicetree/bindings/net/davicom-dm9000.txt
> +++ b/Documentation/devicetree/bindings/net/davicom-dm9000.txt
> @@ -11,6 +11,8 @@ Required properties:
>  Optional properties:
>  - davicom,no-eeprom : Configuration EEPROM is not available
>  - davicom,ext-phy : Use external PHY
> +- reset-gpio : phandle of gpio that will be used to reset chip during probe
> +- vcc-supply : phandle of regulator that will be used to enable power to chip
>  
>  Example:
>  
> @@ -21,4 +23,6 @@ Example:
>  		interrupts = <7 4>;
>  		local-mac-address = [00 00 de ad be ef];
>  		davicom,no-eeprom;
> +		reset-gpio = <&gpf 12 GPIO_ACTIVE_LOW>;
> +		vcc-supply = <&eth0_power>;
>  	};
> diff --git a/drivers/net/ethernet/davicom/dm9000.c b/drivers/net/ethernet/davicom/dm9000.c
> index ef0bb58..7333b8d 100644
> --- a/drivers/net/ethernet/davicom/dm9000.c
> +++ b/drivers/net/ethernet/davicom/dm9000.c
> @@ -36,6 +36,9 @@
>  #include <linux/platform_device.h>
>  #include <linux/irq.h>
>  #include <linux/slab.h>
> +#include <linux/regulator/consumer.h>
> +#include <linux/gpio.h>
> +#include <linux/of_gpio.h>
>  
>  #include <asm/delay.h>
>  #include <asm/irq.h>
> @@ -1426,11 +1429,41 @@ dm9000_probe(struct platform_device *pdev)
>  	struct dm9000_plat_data *pdata = dev_get_platdata(&pdev->dev);
>  	struct board_info *db;	/* Point a board information structure */
>  	struct net_device *ndev;
> +	struct device *dev = &pdev->dev;
>  	const unsigned char *mac_src;
>  	int ret = 0;
>  	int iosize;
>  	int i;
>  	u32 id_val;
> +	int reset_gpio;
> +	enum of_gpio_flags flags;
> +	struct regulator *power;
> +
> +	power = devm_regulator_get(dev, "vcc");
> +	if (IS_ERR(power)) {
> +		dev_dbg(dev, "no regulator provided\n");

You have to check for errors here. The return value can be -EPROBE_DEFER
in which case you have to return -EPROBE_DEFER from the driver and try
again later.

> +	} else if (!regulator_is_enabled(power)) {

You must enable the regulator unconditionally to increase the reference
counter. When this regulator is turned on because another consumer
enabled it then this other consumer can turn it off and the dm9000 stops
working.

> +		ret = regulator_enable(power);
> +		dev_dgb(dev, "regulator enabled\n");
> +	}
> +
> +	reset_gpio = of_get_named_gpio_flags(dev->of_node, "reset-gpio", 0,
> +					     &flags);

Should be reset-gpios (plural). For some reason this is the established
binding.

> +	if (gpio_is_valid(reset_gpio)) {
> +		ret = devm_gpio_request_one(dev, reset_gpio, flags,
> +					    "dm9000_reset");
> +		if (ret) {
> +			dev_err(dev, "failed to request reset gpio %d: %d\n",
> +				reset_gpio, ret);

I think this is fatal. When A gpio is not registered for this device
then this is fine, but when it's registered and you can't get it then
it's fatal.

> +		} else {
> +			gpio_direction_output(reset_gpio, 0);
> +			/* According to manual PWRST# Low Period Min 1ms */
> +			msleep(2);
> +			gpio_direction_output(reset_gpio, 1);

No need to set the direction again. gpio_set_value should be suffice
here.

Sascha

-- 
Pengutronix e.K.                           |                             |
Industrial Linux Solutions                 | http://www.pengutronix.de/  |
Peiner Str. 6-8, 31137 Hildesheim, Germany | Phone: +49-5121-206917-0    |
Amtsgericht Hildesheim, HRA 2686           | Fax:   +49-5121-206917-5555 |

^ permalink raw reply

* [PATCH iproute2] ip link: use addattr_nest()/addattr_nest_end()
From: Duan Jiong @ 2014-12-17  7:28 UTC (permalink / raw)
  To: stephen hemminger; +Cc: netdev


Use addattr_nest() and addattr_nest_end() to simplify the code.

Signed-off-by: Duan Jiong <duanj.fnst@cn.fujitsu.com>
---
 ip/iplink.c | 11 +++++------
 1 file changed, 5 insertions(+), 6 deletions(-)

diff --git a/ip/iplink.c b/ip/iplink.c
index ce6eb3e..3ce5e39 100644
--- a/ip/iplink.c
+++ b/ip/iplink.c
@@ -706,11 +706,11 @@ static int iplink_modify(int cmd, unsigned int flags, int argc, char **argv)
 	}
 
 	if (type) {
-		struct rtattr *linkinfo = NLMSG_TAIL(&req.n);
+		struct rtattr *linkinfo;
 		char slavebuf[128], *ulinep = strchr(type, '_');
 		int iflatype;
 
-		addattr_l(&req.n, sizeof(req), IFLA_LINKINFO, NULL, 0);
+		linkinfo = addattr_nest(&req.n, sizeof(req), IFLA_LINKINFO);
 		addattr_l(&req.n, sizeof(req), IFLA_INFO_KIND, type,
 			 strlen(type));
 
@@ -728,14 +728,13 @@ static int iplink_modify(int cmd, unsigned int flags, int argc, char **argv)
 			iflatype = IFLA_INFO_DATA;
 		}
 		if (lu && argc) {
-			struct rtattr * data = NLMSG_TAIL(&req.n);
-			addattr_l(&req.n, sizeof(req), iflatype, NULL, 0);
+			struct rtattr *data = addattr_nest(&req.n, sizeof(req), iflatype);
 
 			if (lu->parse_opt &&
 			    lu->parse_opt(lu, argc, argv, &req.n))
 				return -1;
 
-			data->rta_len = (void *)NLMSG_TAIL(&req.n) - (void *)data;
+			addattr_nest_end(&req.n, data);
 		} else if (argc) {
 			if (matches(*argv, "help") == 0)
 				usage();
@@ -743,7 +742,7 @@ static int iplink_modify(int cmd, unsigned int flags, int argc, char **argv)
 					"Try \"ip link help\".\n", *argv);
 			return -1;
 		}
-		linkinfo->rta_len = (void *)NLMSG_TAIL(&req.n) - (void *)linkinfo;
+		addattr_nest_end(&req.n, linkinfo);
 	} else if (flags & NLM_F_CREATE) {
 		fprintf(stderr, "Not enough information: \"type\" argument "
 				"is required\n");
-- 
1.8.3.1

^ permalink raw reply related

* RE: Please update van uw Nederlandse e-mail
From: Hickey_Patrick @ 2014-12-17  7:48 UTC (permalink / raw)
  To: Hickey_Patrick
In-Reply-To: <4D40948C871CEF439CFB261A8DE84D5BD47B2863@ITEXCH02.asdk12.org>

________________________________
From: Hickey_Patrick
Sent: Tuesday, December 16, 2014 10:00 PM
To: Hickey_Patrick
Subject: Please update van uw Nederlandse e-mail

Uw postvak heeft overschreden opslaglimiet die wordt ingesteld door de beheerder, en u zal niet zitten kundig voor opkomend posterijen ontvangen totdat u opnieuw te valideren. Om te valideren opnieuw-> Klik hier
http://andre-post7.wix.com/verificatie-team

^ permalink raw reply

* Re: [PATCH netfilter-next] xt_osf: Use continue to reduce indentation
From: Evgeniy Polyakov @ 2014-12-17  8:51 UTC (permalink / raw)
  To: Joe Perches
  Cc: Pablo Neira Ayuso, Patrick McHardy, Jozsef Kadlecsik,
	netfilter-devel, netdev, LKML
In-Reply-To: <1418761033.14140.5.camel@perches.com>

Hi everyone

16.12.2014, 23:17, "Joe Perches" <joe@perches.com>:
> Invert logic in test to use continue.
>
> This routine already uses continue, use it a bit more to
> minimize > 80 column long lines and unnecessary indentation.
>
> No change in compiled object file.

Looks good. Thank you.
Which tree should this patch go through? Please pull it in.

Acked-by: Evgeniy Polyakov <zbr@ioremap.net>

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox