Netdev List
 help / color / mirror / Atom feed
* Re: [Lsf] [Lsf-pc] [LSF/MM TOPIC] Generic page-pool recycle facility?
From: Alexei Starovoitov @ 2016-04-11 22:21 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Sagi Grimberg, Bart Van Assche, Christoph Hellwig,
	James Bottomley, Tom Herbert, Brenden Blanco,
	lsf@lists.linux-foundation.org, linux-mm, netdev@vger.kernel.org,
	lsf-pc@lists.linux-foundation.org
In-Reply-To: <20160411234157.3fc9c6fe@redhat.com>

On Mon, Apr 11, 2016 at 11:41:57PM +0200, Jesper Dangaard Brouer wrote:
> 
> On Sun, 10 Apr 2016 21:45:47 +0300 Sagi Grimberg <sagi@grimberg.me> wrote:
> 
> > >> This is also very interesting for storage targets, which face the same
> > >> issue.  SCST has a mode where it caches some fully constructed SGLs,
> > >> which is probably very similar to what NICs want to do.  
> > >
> > > I think a cached allocator for page sets + the scatterlists that
> > > describe these page sets would not only be useful for SCSI target
> > > implementations but also for the Linux SCSI initiator. Today the scsi-mq
> > > code reserves space in each scsi_cmnd for a scatterlist of
> > > SCSI_MAX_SG_SEGMENTS. If scatterlists would be cached together with page
> > > sets less memory would be needed per scsi_cmnd.  
> > 
> > If we go down this road how about also attaching some driver opaques
> > to the page sets?
> 
> That was the ultimate plan... to leave some opaques bytes left in the
> page struct that drivers could use.
> 
> In struct page I would need a pointer back to my page_pool struct and a
> page flag.  Then, I would need room to store the dma_unmap address.
> (And then some of the usual fields are still needed, like the refcnt,
> and reusing some of the list constructs).  And a zero-copy cross-domain
> id.

I don't think we need to add anything to struct page.
This is supposed to be small cache of dma_mapped pages with lockless access.
It can be implemented as an array or link list where every element
is dma_addr and pointer to page. If it is full, dma_unmap_page+put_page to
send it to back to page allocator.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* [PATCH v2 net 0/4] ipv6: datagram: Update dst cache of a connected udp sk during pmtu update
From: Martin KaFai Lau @ 2016-04-11 22:29 UTC (permalink / raw)
  To: netdev; +Cc: Cong Wang, Eric Dumazet, Wei Wang, Kernel Team

v2:
~ Protect __sk_dst_get() operations with rcu_read_lock in
  release_cb() because another thread may do ip6_dst_store()
  for a udp sk without taking the sk lock (e.g. in sendmsg).
~ Do a ipv6_addr_v4mapped(&sk->sk_v6_daddr) check before
  calling ip6_datagram_dst_update() in patch 3 and 4.  It is
  similar to how __ip6_datagram_connect handles it.
~ One fix in ip6_datagram_dst_update() in patch 2.  It needs
  to check (np->flow_label & IPV6_FLOWLABEL_MASK) before
  doing fl6_sock_lookup.  I was confused with the naming
  of IPV6_FLOWLABEL_MASK and IPV6_FLOWINFO_MASK.
~ Check dst->obsolete just on the safe side, although I think it
  should at least have DST_OBSOLETE_FORCE_CHK by now.
~ Add Fixes tag to patch 3 and 4
~ Add some points from the previous discussion about holding
  sk lock to the commit message in patch 3.

v1:
There is a case in connected UDP socket such that
getsockopt(IPV6_MTU) will return a stale MTU value. The reproducible
sequence could be the following:
1. Create a connected UDP socket
2. Send some datagrams out
3. Receive a ICMPV6_PKT_TOOBIG
4. No new outgoing datagrams to trigger the sk_dst_check()
   logic to update the sk->sk_dst_cache.
5. getsockopt(IPV6_MTU) returns the mtu from the invalid
   sk->sk_dst_cache instead of the newly created RTF_CACHE clone.

Patch 1 and 2 are the prep work.
Patch 3 and 4 are the fixes.

^ permalink raw reply

* [PATCH v2 net 1/4] ipv6: datagram: Refactor flowi6 init codes to a new function
From: Martin KaFai Lau @ 2016-04-11 22:29 UTC (permalink / raw)
  To: netdev; +Cc: Cong Wang, Eric Dumazet, Wei Wang, Kernel Team
In-Reply-To: <1460413777-177662-1-git-send-email-kafai@fb.com>

Move flowi6 init codes for connected datagram sk to a newly created
function ip6_datagram_flow_key_init().

Notes:
1. fl6_flowlabel is used instead of fl6.flowlabel in __ip6_datagram_connect
2. ipv6_addr_is_multicast(&fl6->daddr) is used instead of
   (addr_type & IPV6_ADDR_MULTICAST) in ip6_datagram_flow_key_init()

This new function will be reused during pmtu update in the later patch.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Wei Wang <weiwan@google.com>
---
 net/ipv6/datagram.c | 50 ++++++++++++++++++++++++++++++--------------------
 1 file changed, 30 insertions(+), 20 deletions(-)

diff --git a/net/ipv6/datagram.c b/net/ipv6/datagram.c
index 4281621..f07c1dd 100644
--- a/net/ipv6/datagram.c
+++ b/net/ipv6/datagram.c
@@ -40,6 +40,30 @@ static bool ipv6_mapped_addr_any(const struct in6_addr *a)
 	return ipv6_addr_v4mapped(a) && (a->s6_addr32[3] == 0);
 }
 
+static void ip6_datagram_flow_key_init(struct flowi6 *fl6, struct sock *sk)
+{
+	struct inet_sock *inet = inet_sk(sk);
+	struct ipv6_pinfo *np = inet6_sk(sk);
+
+	memset(fl6, 0, sizeof(*fl6));
+	fl6->flowi6_proto = sk->sk_protocol;
+	fl6->daddr = sk->sk_v6_daddr;
+	fl6->saddr = np->saddr;
+	fl6->flowi6_oif = sk->sk_bound_dev_if;
+	fl6->flowi6_mark = sk->sk_mark;
+	fl6->fl6_dport = inet->inet_dport;
+	fl6->fl6_sport = inet->inet_sport;
+	fl6->flowlabel = np->flow_label;
+
+	if (!fl6->flowi6_oif)
+		fl6->flowi6_oif = np->sticky_pktinfo.ipi6_ifindex;
+
+	if (!fl6->flowi6_oif && ipv6_addr_is_multicast(&fl6->daddr))
+		fl6->flowi6_oif = np->mcast_oif;
+
+	security_sk_classify_flow(sk, flowi6_to_flowi(fl6));
+}
+
 static int __ip6_datagram_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
 {
 	struct sockaddr_in6	*usin = (struct sockaddr_in6 *) uaddr;
@@ -52,6 +76,7 @@ static int __ip6_datagram_connect(struct sock *sk, struct sockaddr *uaddr, int a
 	struct ipv6_txoptions	*opt;
 	int			addr_type;
 	int			err;
+	__be32			fl6_flowlabel = 0;
 
 	if (usin->sin6_family == AF_INET) {
 		if (__ipv6_only_sock(sk))
@@ -66,11 +91,10 @@ static int __ip6_datagram_connect(struct sock *sk, struct sockaddr *uaddr, int a
 	if (usin->sin6_family != AF_INET6)
 		return -EAFNOSUPPORT;
 
-	memset(&fl6, 0, sizeof(fl6));
 	if (np->sndflow) {
-		fl6.flowlabel = usin->sin6_flowinfo&IPV6_FLOWINFO_MASK;
-		if (fl6.flowlabel&IPV6_FLOWLABEL_MASK) {
-			flowlabel = fl6_sock_lookup(sk, fl6.flowlabel);
+		fl6_flowlabel = usin->sin6_flowinfo & IPV6_FLOWINFO_MASK;
+		if (fl6_flowlabel & IPV6_FLOWLABEL_MASK) {
+			flowlabel = fl6_sock_lookup(sk, fl6_flowlabel);
 			if (!flowlabel)
 				return -EINVAL;
 		}
@@ -145,7 +169,7 @@ ipv4_connected:
 	}
 
 	sk->sk_v6_daddr = *daddr;
-	np->flow_label = fl6.flowlabel;
+	np->flow_label = fl6_flowlabel;
 
 	inet->inet_dport = usin->sin6_port;
 
@@ -154,21 +178,7 @@ ipv4_connected:
 	 *	destination cache for it.
 	 */
 
-	fl6.flowi6_proto = sk->sk_protocol;
-	fl6.daddr = sk->sk_v6_daddr;
-	fl6.saddr = np->saddr;
-	fl6.flowi6_oif = sk->sk_bound_dev_if;
-	fl6.flowi6_mark = sk->sk_mark;
-	fl6.fl6_dport = inet->inet_dport;
-	fl6.fl6_sport = inet->inet_sport;
-
-	if (!fl6.flowi6_oif)
-		fl6.flowi6_oif = np->sticky_pktinfo.ipi6_ifindex;
-
-	if (!fl6.flowi6_oif && (addr_type&IPV6_ADDR_MULTICAST))
-		fl6.flowi6_oif = np->mcast_oif;
-
-	security_sk_classify_flow(sk, flowi6_to_flowi(&fl6));
+	ip6_datagram_flow_key_init(&fl6, sk);
 
 	rcu_read_lock();
 	opt = flowlabel ? flowlabel->opt : rcu_dereference(np->opt);
-- 
2.5.1

^ permalink raw reply related

* [PATCH v2 net 2/4] ipv6: datagram: Refactor dst lookup and update codes to a new function
From: Martin KaFai Lau @ 2016-04-11 22:29 UTC (permalink / raw)
  To: netdev; +Cc: Cong Wang, Eric Dumazet, Wei Wang, Kernel Team
In-Reply-To: <1460413777-177662-1-git-send-email-kafai@fb.com>

This patch moves the route lookup and update codes for connected
datagram sk to a newly created function ip6_datagram_dst_update()

It will be reused during the pmtu update in the later patch.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Wei Wang <weiwan@google.com>
---
 net/ipv6/datagram.c | 103 +++++++++++++++++++++++++++++-----------------------
 1 file changed, 57 insertions(+), 46 deletions(-)

diff --git a/net/ipv6/datagram.c b/net/ipv6/datagram.c
index f07c1dd..669585e 100644
--- a/net/ipv6/datagram.c
+++ b/net/ipv6/datagram.c
@@ -64,16 +64,65 @@ static void ip6_datagram_flow_key_init(struct flowi6 *fl6, struct sock *sk)
 	security_sk_classify_flow(sk, flowi6_to_flowi(fl6));
 }
 
+static int ip6_datagram_dst_update(struct sock *sk)
+{
+	struct ip6_flowlabel *flowlabel = NULL;
+	struct in6_addr *final_p, final;
+	struct ipv6_txoptions *opt;
+	struct dst_entry *dst;
+	struct inet_sock *inet = inet_sk(sk);
+	struct ipv6_pinfo *np = inet6_sk(sk);
+	struct flowi6 fl6;
+	int err = 0;
+
+	if (np->sndflow && (np->flow_label & IPV6_FLOWLABEL_MASK)) {
+		flowlabel = fl6_sock_lookup(sk, np->flow_label);
+		if (!flowlabel)
+			return -EINVAL;
+	}
+	ip6_datagram_flow_key_init(&fl6, sk);
+
+	rcu_read_lock();
+	opt = flowlabel ? flowlabel->opt : rcu_dereference(np->opt);
+	final_p = fl6_update_dst(&fl6, opt, &final);
+	rcu_read_unlock();
+
+	dst = ip6_dst_lookup_flow(sk, &fl6, final_p);
+	if (IS_ERR(dst)) {
+		err = PTR_ERR(dst);
+		goto out;
+	}
+
+	if (ipv6_addr_any(&np->saddr))
+		np->saddr = fl6.saddr;
+
+	if (ipv6_addr_any(&sk->sk_v6_rcv_saddr)) {
+		sk->sk_v6_rcv_saddr = fl6.saddr;
+		inet->inet_rcv_saddr = LOOPBACK4_IPV6;
+		if (sk->sk_prot->rehash)
+			sk->sk_prot->rehash(sk);
+	}
+
+	ip6_dst_store(sk, dst,
+		      ipv6_addr_equal(&fl6.daddr, &sk->sk_v6_daddr) ?
+		      &sk->sk_v6_daddr : NULL,
+#ifdef CONFIG_IPV6_SUBTREES
+		      ipv6_addr_equal(&fl6.saddr, &np->saddr) ?
+		      &np->saddr :
+#endif
+		      NULL);
+
+out:
+	fl6_sock_release(flowlabel);
+	return err;
+}
+
 static int __ip6_datagram_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
 {
 	struct sockaddr_in6	*usin = (struct sockaddr_in6 *) uaddr;
 	struct inet_sock	*inet = inet_sk(sk);
 	struct ipv6_pinfo	*np = inet6_sk(sk);
-	struct in6_addr	*daddr, *final_p, final;
-	struct dst_entry	*dst;
-	struct flowi6		fl6;
-	struct ip6_flowlabel	*flowlabel = NULL;
-	struct ipv6_txoptions	*opt;
+	struct in6_addr		*daddr;
 	int			addr_type;
 	int			err;
 	__be32			fl6_flowlabel = 0;
@@ -91,14 +140,8 @@ static int __ip6_datagram_connect(struct sock *sk, struct sockaddr *uaddr, int a
 	if (usin->sin6_family != AF_INET6)
 		return -EAFNOSUPPORT;
 
-	if (np->sndflow) {
+	if (np->sndflow)
 		fl6_flowlabel = usin->sin6_flowinfo & IPV6_FLOWINFO_MASK;
-		if (fl6_flowlabel & IPV6_FLOWLABEL_MASK) {
-			flowlabel = fl6_sock_lookup(sk, fl6_flowlabel);
-			if (!flowlabel)
-				return -EINVAL;
-		}
-	}
 
 	addr_type = ipv6_addr_type(&usin->sin6_addr);
 
@@ -178,45 +221,13 @@ ipv4_connected:
 	 *	destination cache for it.
 	 */
 
-	ip6_datagram_flow_key_init(&fl6, sk);
-
-	rcu_read_lock();
-	opt = flowlabel ? flowlabel->opt : rcu_dereference(np->opt);
-	final_p = fl6_update_dst(&fl6, opt, &final);
-	rcu_read_unlock();
-
-	dst = ip6_dst_lookup_flow(sk, &fl6, final_p);
-	err = 0;
-	if (IS_ERR(dst)) {
-		err = PTR_ERR(dst);
+	err = ip6_datagram_dst_update(sk);
+	if (err)
 		goto out;
-	}
-
-	/* source address lookup done in ip6_dst_lookup */
-
-	if (ipv6_addr_any(&np->saddr))
-		np->saddr = fl6.saddr;
-
-	if (ipv6_addr_any(&sk->sk_v6_rcv_saddr)) {
-		sk->sk_v6_rcv_saddr = fl6.saddr;
-		inet->inet_rcv_saddr = LOOPBACK4_IPV6;
-		if (sk->sk_prot->rehash)
-			sk->sk_prot->rehash(sk);
-	}
-
-	ip6_dst_store(sk, dst,
-		      ipv6_addr_equal(&fl6.daddr, &sk->sk_v6_daddr) ?
-		      &sk->sk_v6_daddr : NULL,
-#ifdef CONFIG_IPV6_SUBTREES
-		      ipv6_addr_equal(&fl6.saddr, &np->saddr) ?
-		      &np->saddr :
-#endif
-		      NULL);
 
 	sk->sk_state = TCP_ESTABLISHED;
 	sk_set_txhash(sk);
 out:
-	fl6_sock_release(flowlabel);
 	return err;
 }
 
-- 
2.5.1

^ permalink raw reply related

* [PATCH v2 net 4/4] ipv6: udp: Do a route lookup and update during release_cb
From: Martin KaFai Lau @ 2016-04-11 22:29 UTC (permalink / raw)
  To: netdev; +Cc: Cong Wang, Eric Dumazet, Wei Wang, Kernel Team
In-Reply-To: <1460413777-177662-1-git-send-email-kafai@fb.com>

This patch adds a release_cb for UDPv6.  It does a route lookup
and updates sk->sk_dst_cache if it is needed.  It picks up the
left-over job from ip6_sk_update_pmtu() if the sk was owned
by user during the pmtu update.

It takes a rcu_read_lock to protect the __sk_dst_get() operations
because another thread may do ip6_dst_store() without taking the
sk lock (e.g. sendmsg).

Fixes: 45e4fd26683c ("ipv6: Only create RTF_CACHE routes after encountering pmtu exception")
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Reported-by: Wei Wang <weiwan@google.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Wei Wang <weiwan@google.com>
---
 include/net/ipv6.h  |  1 +
 net/ipv6/datagram.c | 20 ++++++++++++++++++++
 net/ipv6/udp.c      |  1 +
 3 files changed, 22 insertions(+)

diff --git a/include/net/ipv6.h b/include/net/ipv6.h
index fd02e90..1be050a 100644
--- a/include/net/ipv6.h
+++ b/include/net/ipv6.h
@@ -960,6 +960,7 @@ int ip6_datagram_connect(struct sock *sk, struct sockaddr *addr, int addr_len);
 int ip6_datagram_connect_v6_only(struct sock *sk, struct sockaddr *addr,
 				 int addr_len);
 int ip6_datagram_dst_update(struct sock *sk, bool fix_sk_saddr);
+void ip6_datagram_release_cb(struct sock *sk);
 
 int ipv6_recv_error(struct sock *sk, struct msghdr *msg, int len,
 		    int *addr_len);
diff --git a/net/ipv6/datagram.c b/net/ipv6/datagram.c
index 59e01f27..9dd3882 100644
--- a/net/ipv6/datagram.c
+++ b/net/ipv6/datagram.c
@@ -119,6 +119,26 @@ out:
 	return err;
 }
 
+void ip6_datagram_release_cb(struct sock *sk)
+{
+	struct dst_entry *dst;
+
+	if (ipv6_addr_v4mapped(&sk->sk_v6_daddr))
+		return;
+
+	rcu_read_lock();
+	dst = __sk_dst_get(sk);
+	if (!dst || !dst->obsolete ||
+	    dst->ops->check(dst, inet6_sk(sk)->dst_cookie)) {
+		rcu_read_unlock();
+		return;
+	}
+	rcu_read_unlock();
+
+	ip6_datagram_dst_update(sk, false);
+}
+EXPORT_SYMBOL_GPL(ip6_datagram_release_cb);
+
 static int __ip6_datagram_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
 {
 	struct sockaddr_in6	*usin = (struct sockaddr_in6 *) uaddr;
diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index 8125931..6bc5c66 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -1539,6 +1539,7 @@ struct proto udpv6_prot = {
 	.sendmsg	   = udpv6_sendmsg,
 	.recvmsg	   = udpv6_recvmsg,
 	.backlog_rcv	   = __udpv6_queue_rcv_skb,
+	.release_cb	   = ip6_datagram_release_cb,
 	.hash		   = udp_lib_hash,
 	.unhash		   = udp_lib_unhash,
 	.rehash		   = udp_v6_rehash,
-- 
2.5.1

^ permalink raw reply related

* [PATCH v2 net 3/4] ipv6: datagram: Update dst cache of a connected datagram sk during pmtu update
From: Martin KaFai Lau @ 2016-04-11 22:29 UTC (permalink / raw)
  To: netdev; +Cc: Cong Wang, Eric Dumazet, Wei Wang, Kernel Team
In-Reply-To: <1460413777-177662-1-git-send-email-kafai@fb.com>

There is a case in connected UDP socket such that
getsockopt(IPV6_MTU) will return a stale MTU value. The reproducible
sequence could be the following:
1. Create a connected UDP socket
2. Send some datagrams out
3. Receive a ICMPV6_PKT_TOOBIG
4. No new outgoing datagrams to trigger the sk_dst_check()
   logic to update the sk->sk_dst_cache.
5. getsockopt(IPV6_MTU) returns the mtu from the invalid
   sk->sk_dst_cache instead of the newly created RTF_CACHE clone.

This patch updates the sk->sk_dst_cache for a connected datagram sk
during pmtu-update code path.

Note that the sk->sk_v6_daddr is used to do the route lookup
instead of skb->data (i.e. iph).  It is because a UDP socket can become
connected after sending out some datagrams in un-connected state.  or
It can be connected multiple times to different destinations.  Hence,
iph may not be related to where sk is currently connected to.

It is done under '!sock_owned_by_user(sk)' condition because
the user may make another ip6_datagram_connect()  (i.e changing
the sk->sk_v6_daddr) while dst lookup is happening in the pmtu-update
code path.

For the sock_owned_by_user(sk) == true case, the next patch will
introduce a release_cb() which will update the sk->sk_dst_cache.

Test:

Server (Connected UDP Socket):
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Route Details:
[root@arch-fb-vm1 ~]# ip -6 r show | egrep '2fac'
2fac::/64 dev eth0  proto kernel  metric 256  pref medium
2fac:face::/64 via 2fac::face dev eth0  metric 1024  pref medium

A simple python code to create a connected UDP socket:

import socket
import errno

HOST = '2fac::1'
PORT = 8080

s = socket.socket(socket.AF_INET6, socket.SOCK_DGRAM)
s.bind((HOST, PORT))
s.connect(('2fac:face::face', 53))
print("connected")
while True:
    try:
	data = s.recv(1024)
    except socket.error as se:
	if se.errno == errno.EMSGSIZE:
		pmtu = s.getsockopt(41, 24)
		print("PMTU:%d" % pmtu)
		break
s.close()

Python program output after getting a ICMPV6_PKT_TOOBIG:
[root@arch-fb-vm1 ~]# python2 ~/devshare/kernel/tasks/fib6/udp-connect-53-8080.py
connected
PMTU:1300

Cache routes after recieving TOOBIG:
[root@arch-fb-vm1 ~]# ip -6 r show table cache
2fac:face::face via 2fac::face dev eth0  metric 0
    cache  expires 463sec mtu 1300 pref medium

Client (Send the ICMPV6_PKT_TOOBIG):
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
scapy is used to generate the TOOBIG message.  Here is the scapy script I have
used:

>>> p=Ether(src='da:75:4d:36:ac:32', dst='52:54:00:12:34:66', type=0x86dd)/IPv6(src='2fac::face', dst='2fac::1')/ICMPv6PacketTooBig(mtu=1300)/IPv6(src='2fac::
1',dst='2fac:face::face', nh='UDP')/UDP(sport=8080,dport=53)
>>> sendp(p, iface='qemubr0')

Fixes: 45e4fd26683c ("ipv6: Only create RTF_CACHE routes after encountering pmtu exception")
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Reported-by: Wei Wang <weiwan@google.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Wei Wang <weiwan@google.com>
---
 include/net/ipv6.h  |  1 +
 net/ipv6/datagram.c | 20 +++++++++++---------
 net/ipv6/route.c    | 12 ++++++++++++
 3 files changed, 24 insertions(+), 9 deletions(-)

diff --git a/include/net/ipv6.h b/include/net/ipv6.h
index d0aeb97..fd02e90 100644
--- a/include/net/ipv6.h
+++ b/include/net/ipv6.h
@@ -959,6 +959,7 @@ int compat_ipv6_getsockopt(struct sock *sk, int level, int optname,
 int ip6_datagram_connect(struct sock *sk, struct sockaddr *addr, int addr_len);
 int ip6_datagram_connect_v6_only(struct sock *sk, struct sockaddr *addr,
 				 int addr_len);
+int ip6_datagram_dst_update(struct sock *sk, bool fix_sk_saddr);
 
 int ipv6_recv_error(struct sock *sk, struct msghdr *msg, int len,
 		    int *addr_len);
diff --git a/net/ipv6/datagram.c b/net/ipv6/datagram.c
index 669585e..59e01f27 100644
--- a/net/ipv6/datagram.c
+++ b/net/ipv6/datagram.c
@@ -64,7 +64,7 @@ static void ip6_datagram_flow_key_init(struct flowi6 *fl6, struct sock *sk)
 	security_sk_classify_flow(sk, flowi6_to_flowi(fl6));
 }
 
-static int ip6_datagram_dst_update(struct sock *sk)
+int ip6_datagram_dst_update(struct sock *sk, bool fix_sk_saddr)
 {
 	struct ip6_flowlabel *flowlabel = NULL;
 	struct in6_addr *final_p, final;
@@ -93,14 +93,16 @@ static int ip6_datagram_dst_update(struct sock *sk)
 		goto out;
 	}
 
-	if (ipv6_addr_any(&np->saddr))
-		np->saddr = fl6.saddr;
+	if (fix_sk_saddr) {
+		if (ipv6_addr_any(&np->saddr))
+			np->saddr = fl6.saddr;
 
-	if (ipv6_addr_any(&sk->sk_v6_rcv_saddr)) {
-		sk->sk_v6_rcv_saddr = fl6.saddr;
-		inet->inet_rcv_saddr = LOOPBACK4_IPV6;
-		if (sk->sk_prot->rehash)
-			sk->sk_prot->rehash(sk);
+		if (ipv6_addr_any(&sk->sk_v6_rcv_saddr)) {
+			sk->sk_v6_rcv_saddr = fl6.saddr;
+			inet->inet_rcv_saddr = LOOPBACK4_IPV6;
+			if (sk->sk_prot->rehash)
+				sk->sk_prot->rehash(sk);
+		}
 	}
 
 	ip6_dst_store(sk, dst,
@@ -221,7 +223,7 @@ ipv4_connected:
 	 *	destination cache for it.
 	 */
 
-	err = ip6_datagram_dst_update(sk);
+	err = ip6_datagram_dst_update(sk, true);
 	if (err)
 		goto out;
 
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index 1d8871a..d916d6a 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -1418,8 +1418,20 @@ EXPORT_SYMBOL_GPL(ip6_update_pmtu);
 
 void ip6_sk_update_pmtu(struct sk_buff *skb, struct sock *sk, __be32 mtu)
 {
+	struct dst_entry *dst;
+
 	ip6_update_pmtu(skb, sock_net(sk), mtu,
 			sk->sk_bound_dev_if, sk->sk_mark);
+
+	dst = __sk_dst_get(sk);
+	if (!dst || !dst->obsolete ||
+	    dst->ops->check(dst, inet6_sk(sk)->dst_cookie))
+		return;
+
+	bh_lock_sock(sk);
+	if (!sock_owned_by_user(sk) && !ipv6_addr_v4mapped(&sk->sk_v6_daddr))
+		ip6_datagram_dst_update(sk, false);
+	bh_unlock_sock(sk);
 }
 EXPORT_SYMBOL_GPL(ip6_sk_update_pmtu);
 
-- 
2.5.1

^ permalink raw reply related

* Re: [PATCH 0/9] RxRPC: 2nd rewrite part 1
From: David Howells @ 2016-04-11 22:29 UTC (permalink / raw)
  Cc: dhowells, David Miller, linux-afs, netdev, linux-kernel
In-Reply-To: <15783.1460412099@warthog.procyon.org.uk>

David Howells <dhowells@redhat.com> wrote:

> > Series applied, but I had to fix up some trailing whitespace (reported by
> > GIT) by hand.
> 
> Do you perchance have a git hook script for checking for trailing whitespace?

No matter - there's already such a script installed by git clone - I just have
to enable it.

David

^ permalink raw reply

* Re: [PATCH RFT 1/2] phylib: add device reset GPIO support
From: Rob Herring @ 2016-04-11 22:46 UTC (permalink / raw)
  To: Sergei Shtylyov
  Cc: Grant Likely, devicetree@vger.kernel.org, Florian Fainelli,
	netdev, Frank Rowand, Pawel Moll, Mark Rutland, Ian Campbell,
	Kumar Gala, linux-kernel@vger.kernel.org
In-Reply-To: <570BFACF.30507@cogentembedded.com>

On Mon, Apr 11, 2016 at 2:28 PM, Sergei Shtylyov
<sergei.shtylyov@cogentembedded.com> wrote:
> On 04/11/2016 10:25 PM, Rob Herring wrote:
>
>>> The PHY  devices sometimes do have their reset signal (maybe even power
>>> supply?) tied to some GPIO and sometimes it also does happen that a boot
>>> loader does not leave it deasserted. So far this issue has been attacked
>>> from (as I believe) a wrong angle: by teaching the MAC driver to
>>> manipulate
>>> the GPIO in question;  that solution, when  applied to the device trees,
>>> led to adding the PHY reset GPIO properties to the MAC device node, with
>>> one exception: Cadence MACB driver which could handle the "reset-gpios"
>>> prop  in a PHY device  subnode.  I believe that the correct approach is
>>> to
>>> teach the 'phylib' to get the MDIO device reset GPIO from the device tree
>>> node corresponding to this device -- which this patch is doing...
>>>
>>> Note that I had to modify the  AT803x PHY driver as it would stop working
>>> otherwise as it made use of the reset GPIO for its own purposes...
>>
>>
>> Lots of double spaces in here. Please fix.
>
>
>    Oh, it's you again! :-D

Yep, one of those picky kernel maintainers that like a bad rash just
won't go away. :)

>>> Signed-off-by: Sergei Shtylyov <sergei.shtylyov@cogentembedded.com>
>>>
>>> ---
>>>   Documentation/devicetree/bindings/net/phy.txt |    2 +
>>>   drivers/net/phy/at803x.c                      |   19 ++------------
>>>   drivers/net/phy/mdio_bus.c                    |    4 +++
>>>   drivers/net/phy/mdio_device.c                 |   27
>>> +++++++++++++++++++--
>>>   drivers/net/phy/phy_device.c                  |   33
>>> ++++++++++++++++++++++++--
>>>   drivers/of/of_mdio.c                          |   16 ++++++++++++
>>>   include/linux/mdio.h                          |    3 ++
>>>   include/linux/phy.h                           |    5 +++
>>>   8 files changed, 89 insertions(+), 20 deletions(-)
>>
>>
>> [...]
>>
>>> Index: net-next/drivers/of/of_mdio.c
>>> ===================================================================
>>> --- net-next.orig/drivers/of/of_mdio.c
>>> +++ net-next/drivers/of/of_mdio.c
>>> @@ -44,6 +44,7 @@ static int of_get_phy_id(struct device_n
>>>   static int of_mdiobus_register_phy(struct mii_bus *mdio, struct
>>> device_node *child,
>>>                                    u32 addr)
>>>   {
>>> +       struct gpio_desc *gpiod;
>>>         struct phy_device *phy;
>>>         bool is_c45;
>>>         int rc;
>>> @@ -52,10 +53,17 @@ static int of_mdiobus_register_phy(struc
>>>         is_c45 = of_device_is_compatible(child,
>>>                                          "ethernet-phy-ieee802.3-c45");
>>>
>>> +       gpiod = fwnode_get_named_gpiod(&child->fwnode, "reset-gpios");
>>
>>
>> Calling fwnode_* functions in a DT specific file/function? That doesn't
>> make sense.
>
>
>    Really?! 8-)
>    Where is a DT-only analog I wonder...

Ah, you're right. NM.

Rob

^ permalink raw reply

* Configuring ethernet link fails with No such device
From: Stefan Agner @ 2016-04-11 22:46 UTC (permalink / raw)
  To: systemd-devel, netdev, davem
  Cc: fabio.estevam, l.stach, u.kleine-koenig, bryan.wu, bob.ham

Hi All,

I traced an issue in which we tried configuring duplex mode and speed in
a systemd-udevd .link file failed with the following error:
"Could not set speed or duplex of eth0 to 0 Mbps (half): No such device"

The FEC driver (fec_main.c) does not initialize phy_dev until the device
has been opened, and therefor the callback fec_enet_(get|set)_settings
returns -19.

What is the expectation/definition when link configuration should be
possible? Only after the network device got opened or before?

Or in other words: Is this a Kernel or systemd issue? At what point is
link configuration "guaranteed" to work?

It seems that the FEC driver used to probe the PHY earlier, it has been
moved to the open callback with commit 418bd0d4df ("netdev/fec: fix
ifconfig eth0 down hang issue")...

There has been a similar report on the systemd-devel mailing list a
while ago, probably the same underlying issue:
https://www.mail-archive.com/systemd-devel@lists.freedesktop.org/msg33186.html

^ permalink raw reply

* Re: Configuring ethernet link fails with No such device
From: Florian Fainelli @ 2016-04-11 23:10 UTC (permalink / raw)
  To: Stefan Agner, systemd-devel, netdev, davem
  Cc: fabio.estevam, l.stach, u.kleine-koenig, bryan.wu, bob.ham,
	Andrew Lunn
In-Reply-To: <67b662523c63277fa95a59472cbf018f@agner.ch>

On 11/04/16 15:46, Stefan Agner wrote:
> Hi All,
> 
> I traced an issue in which we tried configuring duplex mode and speed in
> a systemd-udevd .link file failed with the following error:
> "Could not set speed or duplex of eth0 to 0 Mbps (half): No such device"
> 
> The FEC driver (fec_main.c) does not initialize phy_dev until the device
> has been opened, and therefor the callback fec_enet_(get|set)_settings
> returns -19.
> 
> What is the expectation/definition when link configuration should be
> possible? Only after the network device got opened or before?

It seems to me like systemd tries to configure the device while we are
still opening it, which is arguably something that can be done, but you
are in a transient state, the interface is not completely ready yet.

> 
> Or in other words: Is this a Kernel or systemd issue? At what point is
> link configuration "guaranteed" to work?

Past ndo_open() had a chance to return, therefore connect to the PHY
device, is when the configuration/information querying is able to succeed.

> 
> It seems that the FEC driver used to probe the PHY earlier, it has been
> moved to the open callback with commit 418bd0d4df ("netdev/fec: fix
> ifconfig eth0 down hang issue")...

Part of the reason for doing that, is that probing for the PHY during
the driver's probe function could leave you with unused HW that is
powered on for no good reason. Until we get into ndo_open, we have no
clue if the interface is going to be used or not, so it is better to
move part of the HW initialization as late as possible in the process.

> 
> There has been a similar report on the systemd-devel mailing list a
> while ago, probably the same underlying issue:
> https://www.mail-archive.com/systemd-devel@lists.freedesktop.org/msg33186.html
> 
> --
> Stefan
> 


-- 
Florian

^ permalink raw reply

* Re: [PATCH net v2] net: sched: do not requeue a NULL skb
From: Cong Wang @ 2016-04-11 23:19 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Lars Persson, Lars Persson, Linux Kernel Network Developers,
	Jamal Hadi Salim, LKML
In-Reply-To: <1460399456.6473.566.camel@edumazet-glaptop3.roam.corp.google.com>

On Mon, Apr 11, 2016 at 11:30 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Mon, 2016-04-11 at 11:26 -0700, Eric Dumazet wrote:
>> On Mon, 2016-04-11 at 11:02 -0700, Cong Wang wrote:
>>
>> > I am fine with either way as long as the loop stops on failure.
>
>
> Note that skb that could not be validated is already freed.
>
> So I do not see any value from stopping the loop, since
> we need to schedule the queue to avoid tx hang.
>
> Just process following skb if there is one, fact that skb is sent or
> dropped does not matter.

My point is, for example, in OOM case, we don't know processing
more SKB would make it better or worse. Maybe we really need to
check the error code to decide to continue to exit?

^ permalink raw reply

* Re: [PATCH net v2] net: sched: do not requeue a NULL skb
From: Eric Dumazet @ 2016-04-11 23:48 UTC (permalink / raw)
  To: Cong Wang
  Cc: Lars Persson, Lars Persson, Linux Kernel Network Developers,
	Jamal Hadi Salim, LKML
In-Reply-To: <CAM_iQpVid8=LeO31p_ZiPQkhWtgCuioUzfCdgDPatFxkWrkbCg@mail.gmail.com>

On Mon, 2016-04-11 at 16:19 -0700, Cong Wang wrote:

> My point is, for example, in OOM case, we don't know processin
> more SKB would make it better or worse. Maybe we really need to
> check the error code to decide to continue to exit?

Really, given this bug has been there for a long time (v3.18 ???), I
doubt it matters.

Nothing can tell that following packets in the qdisc need any
transformation, and memory allocations.

So I would just fix the bug in the simplest way.

__qdisc_run() has all the checks needed to yield when needed 
(if (quota <= 0 || need_resched())) , no need to add more complexity
there.

^ permalink raw reply

* Re: [PATCH 0/9] RxRPC: 2nd rewrite part 1
From: David Miller @ 2016-04-12  1:26 UTC (permalink / raw)
  To: dhowells; +Cc: linux-afs, netdev, linux-kernel
In-Reply-To: <15783.1460412099@warthog.procyon.org.uk>

From: David Howells <dhowells@redhat.com>
Date: Mon, 11 Apr 2016 23:01:39 +0100

> David Miller <davem@davemloft.net> wrote:
> 
>> Series applied, but I had to fix up some trailing whitespace (reported by
>> GIT) by hand.
> 
> Do you perchance have a git hook script for checking for trailing whitespace?

Just try to apply the series yourself with "git am", that's what I do.

You can do "git apply --check --whitespace=error-all" on individual
patches, but that doesn't work for a series.

^ permalink raw reply

* Re: [PATCH net-next] net: mdio: Fix lockdep falls positive splat
From: David Miller @ 2016-04-12  1:27 UTC (permalink / raw)
  To: andrew; +Cc: f.fainelli, netdev, vivien.didelot
In-Reply-To: <1460403605-31882-1-git-send-email-andrew@lunn.ch>

From: Andrew Lunn <andrew@lunn.ch>
Date: Mon, 11 Apr 2016 21:40:05 +0200

> MDIO devices can be stacked upon each other. The current code supports
> two levels, which until recently has been enough for a DSA mdio bus on
> top of another bus. Now we have hardware which has an MDIO mux in the
> middle.
> 
> Define an MDIO MUTEX class with three levels.
> 
> Signed-off-by: Andrew Lunn <andrew@lunn.ch>

Applied, thanks Andrew.

^ permalink raw reply

* Re: Configuring ethernet link fails with No such device
From: David Miller @ 2016-04-12  1:29 UTC (permalink / raw)
  To: stefan
  Cc: systemd-devel, netdev, fabio.estevam, l.stach, u.kleine-koenig,
	bryan.wu, bob.ham
In-Reply-To: <67b662523c63277fa95a59472cbf018f@agner.ch>

From: Stefan Agner <stefan@agner.ch>
Date: Mon, 11 Apr 2016 15:46:08 -0700

> What is the expectation/definition when link configuration should be
> possible? Only after the network device got opened or before?

Only after it is open.  Drivers almost always have the entire chip in
powerdown state when it is not open, so we wouldn't be able to
properly do link settings even if we wanted to when the device is
closed.

^ permalink raw reply

* Re: [PATCH] devlink: add missing install of header
From: David Miller @ 2016-04-12  1:34 UTC (permalink / raw)
  To: stephen; +Cc: jiri, netdev
In-Reply-To: <1460407148-24228-1-git-send-email-stephen@networkplumber.org>

From: Stephen Hemminger <stephen@networkplumber.org>
Date: Mon, 11 Apr 2016 13:39:08 -0700

> The new devlink.h in uapi was not being installed by
> make headers_install
> 
> Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>

Applied, thanks Stephen.

I wish we could just have everything under uapi/linux automatically
get this treatment, but I guess there is a very good reason for the
current behavior.

^ permalink raw reply

* Re: [PATCH RFC 1/5] net: phy: sun8i-h3-ephy: Add bindings for Allwinner H3 Ethernet PHY
From: Chen-Yu Tsai @ 2016-04-12  1:38 UTC (permalink / raw)
  To: Florian Fainelli
  Cc: Chen-Yu Tsai, Maxime Ripard, Rob Herring, Pawel Moll,
	Mark Rutland, Ian Campbell, Kumar Gala, netdev, linux-arm-kernel,
	linux-kernel, devicetree, LABBE Corentin, Andrew Lunn
In-Reply-To: <570BF9B0.2090602@gmail.com>

On Tue, Apr 12, 2016 at 3:23 AM, Florian Fainelli <f.fainelli@gmail.com> wrote:
> On 04/04/16 09:22, Chen-Yu Tsai wrote:
>> The Allwinner H3 SoC incorporates an Ethernet PHY. This is enabled and
>> configured through a memory mapped hardware register.
>>
>> This same register also configures the MAC interface mode and TX clock
>> source. Also covered by the register, but not supported in these bindings,
>> are TX/RX clock delay chains and inverters, and an RMII module.
>>
>> Signed-off-by: Chen-Yu Tsai <wens@csie.org>
>> ---
>>  .../bindings/net/allwinner,sun8i-h3-ephy.txt       | 44 ++++++++++++++++++++++
>>  1 file changed, 44 insertions(+)
>>  create mode 100644 Documentation/devicetree/bindings/net/allwinner,sun8i-h3-ephy.txt
>>
>> diff --git a/Documentation/devicetree/bindings/net/allwinner,sun8i-h3-ephy.txt b/Documentation/devicetree/bindings/net/allwinner,sun8i-h3-ephy.txt
>> new file mode 100644
>> index 000000000000..146f227e6d88
>> --- /dev/null
>> +++ b/Documentation/devicetree/bindings/net/allwinner,sun8i-h3-ephy.txt
>> @@ -0,0 +1,44 @@
>> +* Allwinner H3 E(thernet) PHY
>> +
>> +The Allwinner H3 integrates an MII ethernet PHY. As with external PHYs,
>> +before it can be configured over the MDIO bus and used, certain hardware
>> +features must be configured, such as the PHY address and LED polarity.
>
> Is the internal PHY address really configurable? Not that there is
> anything wrong with it, this is good.

It is. Things that are configured or provided to a discrete PHY are routed
to registers in the SoC, things such as PHY address, clocks, resets.

>> +The PHY must also be powered on and brought out of reset.
>> +
>> +This is accomplished with regulators and pull-up/downs for external PHYs.
>> +For this internal PHY, a hardware register is programmed.
>> +
>> +The same hardware register also contains clock and interface controls
>> +for the MAC. This is also present in earlier SoCs, and is covered by
>> +"allwinner,sun7i-a20-gmac-clk". The controls in the H3 are slightly
>> +different due to the inclusion of what appears to be an RMII-MII
>> +bridge.
>> +
>> +Required properties:
>> +- compatible: should be "allwinner,sun8i-h3-ephy"
>> +- reg: address and length of the register set for the device
>> +- clocks: A phandle to the reference clock for this device
>> +- resets: A phandle to the reset control for this device
>> +
>> +Ethernet PHY related properties:
>> +- allwinner,ephy-addr: the MDIO bus address the PHY should respond to.
>> +                    If this is not set, the external PHY is used, and
>> +                    everything else in this section is ignored.
>
> So we are going to put the same value here, and in the actual Ethernet
> PHY device tree node that should be hanging off the EMAC/MDIO bus
> controller, this is confusing and error prone.

Yes, that would be an issue when writing the DTS.
>
>> +- allwinner,leds-active-low: LEDs are active low. Without this, LEDs are
>> +                          active high.
>> +
>> +Ethernet MAC clock related properties:
>> +- #clock-cells: should be 0
>> +- clock-output-names: "mac_tx"
>> +
>> +Example:
>> +
>> +ethernet-phy@01c00030 {
>> +     compatible = "allwinner,sun8i-h3-ephy";
>> +     reg = <0x01c00030 0x4>;
>
> Looking at this register space this looks more like an internal PHY SHIM
> that is needed to be configured before the internal PHY can be access,
> not a proper Ethernet PHY per-se, see replies in aptch 2.
>
> Should not this block be a second cell associated with the Ethernet MAC
> block? One or the other are not going to be very useful without
> knowledge of each other.

True. However the lower half of the same register also controls the
MAC interface mode and TX clock source and delays. This we had a clock
driver that was used in conjuction with stmmac on earlier SoCs. I was
hoping to keep that model with the newer EMAC. At the time it was
argued that what seemed like a clock should be handled by a clock
driver, instead of just a "syscon". If this is reaching too far to
handle this new use case, I will happily just provide patches to merge
this into the MAC.

I would like to know how to deal with things like a PHY requiring
some sort of shim driver, be it an internal one, or an external mfd
chip that happens to have an Ethernet PHY included? How do we tie
this into the PHY node under the MDIO bus?


Regards
ChenYu

^ permalink raw reply

* Re: [PATCH RFC] net: decrease the length of backlog queue immediately after it's detached from sk
From: Yang Yingliang @ 2016-04-12  2:59 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev, davem, Ding Tianhong
In-Reply-To: <1460376828.6473.538.camel@edumazet-glaptop3.roam.corp.google.com>



On 2016/4/11 20:13, Eric Dumazet wrote:
> On Mon, 2016-04-11 at 19:57 +0800, Yang Yingliang wrote:
>>
>> On 2016/4/8 22:44, Eric Dumazet wrote:
>>> On Fri, 2016-04-08 at 19:18 +0800, Yang Yingliang wrote:
>>>
>>>> I expand  tcp_adv_win_scale and tcp_rmem. It has no effect.
>>>
>>> Try :
>>>
>>> echo -2 >/proc/sys/net/ipv4/tcp_adv_win_scale
>>>
>>> And restart your flows.
>>>
>> cat /proc/sys/net/ipv4/tcp_rmem
>> 10240 2097152 10485760
>
> What about leaving the default values ?
I tried, it did not work.

>
> $ cat /proc/sys/net/ipv4/tcp_rmem
> 4096	87380	6291456
>
>>
>> echo 102400 20971520 104857600 > /proc/sys/net/ipv4/tcp_rmem
>> echo -2 >/proc/sys/net/ipv4/tcp_adv_win_scale
>>
>> It seems has not effect.
>>
>
> I have no idea what you did on the sender side to allow it to send more
> than 1.5 MB then.

We are doing performance test. The sender send 256KB per-block with 128
threads to one socket. And the receiver uses 10Gb NIC to handle the
data on ARM64. The data flow is driver->ip layer->tcp layer->iscsi.

I added some debug messages and found handling backlog packets in 
__release_sock() cost about 11ms at most. This can cause backlog queue
overflow. The sk_data_ready is re-assigned, it may cost time in our
program. I will check it out.

^ permalink raw reply

* Re: [PATCH net-next 7/7] dsa: mv88e6xxx: Use bus in mv88e6xxx_lookup_name()
From: Vivien Didelot @ 2016-04-12  3:15 UTC (permalink / raw)
  To: Andrew Lunn, David Miller; +Cc: Florian Fainelli, netdev, Andrew Lunn
In-Reply-To: <1460404209-32083-8-git-send-email-andrew@lunn.ch>

Andrew Lunn <andrew@lunn.ch> writes:

> mv88e6xxx_lookup_name() returns the model name of a switch at a given
> address on an MII bus. Using mii_bus to identify the bus rather than
> the host device is more logical, so change the parameter.
>
> Signed-off-by: Andrew Lunn <andrew@lunn.ch>

Reviewed-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>

^ permalink raw reply

* Re: [PATCH net-next 0/7] DSA refactoring: set 1
From: Vivien Didelot @ 2016-04-12  3:16 UTC (permalink / raw)
  To: Andrew Lunn, David Miller; +Cc: Florian Fainelli, netdev, Andrew Lunn
In-Reply-To: <1460404209-32083-1-git-send-email-andrew@lunn.ch>

Andrew Lunn <andrew@lunn.ch> writes:

> There has been a long running effort to refractor DSA probing to make
> the switches true linux devices. Here are a small collection of
> patches moving in this direction. Most have been seen before.
>
> We take a little step forward by passing the dsa device point to the
> driver, thus allowing it to perform resource allocations using the
> normal mechanisms. This device structure will later be replaced by the
> devices own device structure.
>
> Future patches will add a true driver probe function, so we rename the
> current probe function, cleaning up the namespace.
>
> phys_port_mask continually confuses me, thinking it is about PHYs. But
> it is actually about ports to the outside world, user ports. So rename
> it.
>
> Lots more patches yet to follow, this is just doing some ground work.
>
> Andrew Lunn (7):
>   net: dsa: Pass the dsa device to the switch drivers
>   net: dsa: Have the switch driver allocate there own private memory
>   net: dsa: Remove allocation of driver private memory
>   net: dsa: Keep the mii bus and address in the private structure
>   net: dsa: Rename DSA probe function.
>   dsa: Rename phys_port_mask to user_port_mask
>   dsa: mv88e6xxx: Use bus in mv88e6xxx_lookup_name()
>
>  drivers/net/dsa/bcm_sf2.c   | 24 +++++++++++++-------
>  drivers/net/dsa/mv88e6060.c | 47 +++++++++++++++++++++++---------------
>  drivers/net/dsa/mv88e6060.h | 11 +++++++++
>  drivers/net/dsa/mv88e6123.c | 14 +++++++-----
>  drivers/net/dsa/mv88e6131.c | 14 +++++++-----
>  drivers/net/dsa/mv88e6171.c | 14 +++++++-----
>  drivers/net/dsa/mv88e6352.c | 14 +++++++-----
>  drivers/net/dsa/mv88e6xxx.c | 55 +++++++++++++++++++++++++++++++--------------
>  drivers/net/dsa/mv88e6xxx.h | 17 +++++++++++---
>  include/net/dsa.h           | 16 ++++++++-----
>  net/dsa/dsa.c               | 19 +++++++++-------
>  11 files changed, 166 insertions(+), 79 deletions(-)

Tested-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>

^ permalink raw reply

* Re: [PATCH net-next v2 1/2] rtnetlink: add new RTM_GETSTATS message to dump link stats
From: roopa @ 2016-04-12  3:53 UTC (permalink / raw)
  To: Thomas Graf; +Cc: netdev, jhs, davem, Nikolay Aleksandrov
In-Reply-To: <570A9B4D.80104@cumulusnetworks.com>

On 4/10/16, 11:28 AM, roopa wrote:
> On 4/10/16, 1:16 AM, Thomas Graf wrote:
[snip]
>>
>> This currently ties everything to a net_device with a selector to
>> include certain bits of that net_device. How about we take it half a
>> step further and allow for non net_device stats such as IP, TCP,
>> routing or ipsec stats to be retrieved as well?
> yes, absolutely. and that is also the goal.
>> A simple array of nested attributes replacing IFLA_STATS_* would
>> allow for that, e.g.
>>
>> 1. {.type = ST_IPSTATS, value = { ...} }
>>
>> 2. {.type = ST_LINK, .value = {
>>     {.type = ST_LINK_NAME, .value = "eth0"},
>>     {.type = ST_LINK_Q, .value = 10}
>>   }}
>>
>> 3. ...
> One thing though,  Its unclear to me if we absolutely need the additional nest.
> Every stats netlink msg has an ifindex in the header (if_stats_msg) if the scope
> of the stats is a netdev. If the msg does not have an ifindex in the if_stats_msg,
> it represents a global stat. ie Cant a dump, include other stats netlink msgs after
> all the netdev msgs are done when the filter has global stat filters ?.
> same will apply to RTM_GETSTATS (without NLM_F_DUMP).
>
> Since the msg may potentially have more nest levels
> in the IFLA_EXT_STATS categories, just trying to see if i can avoid adding another
> top-level nest. We can sure add it if there is no other way to include global
> stats in the same dump.
>

Just wanted to elaborate on what i was trying to say:

Top level stats attributes can be netdev or global attributes: We can include string "LINK" in
the names of all stats belonging to a netdev to make it easier to recognize the netdev stats (example):
 IFLA_STATS_LINK64,           (netdev)
 IFLA_STATS_LINK_INET6,    (netdev)
 IFLA_STATS_TCP,                (non-netdev, global tcp stats)

RTM_GETSTATS (NLM_F_DUMP)  with user given filter_mask {
        If filter-mask contains any link stats, start with per netdev stats messages:
        { if_stats_msg.ifindex = 1,
          if_stats_msg.filter_mask = mask of included link stats,
          <stats attributes> }

        { if_stats_msg.ifindex = 2,
          if_stats_msg.filter_mask = mask of included link stats,
          <stats attributes> }

        global stats (if user given filter mask contains global filters.):
        { if_stats_msg.ifindex = 0,
          if_stats_msg.filter_mask = mask of included global stats,
          <stats attributes> }
}

We will need a field in netlink_callback to indicate global or netdev stats when the stats
crosses skb boundaries. A single nlmsg cannot have both netdev and global stats.

Non-dump RTM_GETSTATS examples:

RTM_GETSTATS with valid ifindex and filter_mask {
        filter_mask cannot have global stats (return -EINVAL)
        { if_stats_msg.ifindex = <user_given_ifindex>,
          if_stats_msg.filter_mask = mask of included link stats,
          <stats attributes> }
}

RTM_GETSTATS with ifindex = 0 and filter_mask {
        filter_mask cannot have link stats (return -EINVAL)
        { if_stats_msg.ifindex = 0,
          if_stats_msg.filter_mask = mask of included global link stats,
          <stats attributes> }
}


Will this not work ?

Thanks,
Roopa

^ permalink raw reply

* Re: [PATCH for-next 2/2] net/mlx5: Update mlx5_ifc hardware features
From: Leon Romanovsky @ 2016-04-12  5:15 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: Saeed Mahameed, Saeed Mahameed, Matan Barak, Linux Netdev List,
	linux-rdma@vger.kernel.org, David S. Miller, Doug Ledford,
	Linus Torvalds, Or Gerlitz, Leon Romanovsky, Tal Alon
In-Reply-To: <CAJ3xEMhorMRsxtawjim9zoBn-zzoDXcMAJfHUqCZBNUbOKrMJA@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 1554 bytes --]

On Tue, Apr 12, 2016 at 12:37:34AM +0300, Or Gerlitz wrote:
> On Tue, Apr 12, 2016 at 12:24 AM, Saeed Mahameed
> <saeedm@dev.mellanox.co.il> wrote:
> > On Tue, Apr 12, 2016 at 12:17 AM, Or Gerlitz <gerlitz.or@gmail.com> wrote:
> 
> >> feature --> features
> 
> > Correct, will fix.
> 
> >>> * Add vport to steering commands for SRIOV ACL support
> >>> * Add mlcr, pcmr and mcia registers for dump module EEPROM
> >>> * Add support for FCS, baeacon led and disable_link bits to hca caps
> >>> * Add CQE period mode bit in  CQ context for CQE based CQ
> >>>   moderation support
> >>> * Add umr SQ bit for fragmented memory registration
> >>> * Add needed bits and caps for Striding RQ support
> 
> >> AFAIK, all the above are features will go through net-next, what made
> >> you anticipate conflicts with linux-rdma?
> 
> > FCS bit is needed also for rdma, so we took the liberty of updating
> > all the needed HW structs, bits, caps, etc ..
> > at once for all mlx5 features planned for 4.7 regardless of rdma/net conflicts.
> 
> The cover letter states that this series deals with shared code.
> 
> I guess you might also could extend it a bit to deal also with code
> that you suspect could lead to conflicts, but I don't see why it
> evolved to that extent.

Or,
All these micro-optimizations on this shared file can potentially lead
to undesired merge conflicts. Subsystem maintainers and Linus don't need to
deal with these conflicts at all.

It won't help to anyone to split this commit to more than one patch.

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply

* [REGRESSION, bisect] pci: cxgb4 probe fails after commit 104daa71b3961434 ("PCI: Determine actual VPD size on first access")
From: Hariprasad Shenai @ 2016-04-12  5:29 UTC (permalink / raw)
  To: bhelgaas, linux-pci, linux-kernel, davem, netdev
  Cc: swise, leedom, santosh, kumaras, nirranjan

Hi All,


The following patch introduced a regression, causing cxgb4 driver to fail in
PCIe probe.

commit 104daa71b39614343929e1982170d5fcb0569bb5
Author: Hannes Reinecke <hare@suse.de>
Author: Hannes Reinecke <hare@suse.de>
Date:   Mon Feb 15 09:42:01 2016 +0100

    PCI: Determine actual VPD size on first access
    
    PCI-2.2 VPD entries have a maximum size of 32k, but might actually be
    smaller than that.  To figure out the actual size one has to read the VPD
    area until the 'end marker' is reached.
    
    Per spec, reading outside of the VPD space is "not allowed."  In practice,
    it may cause simple read errors or even crash the card.  To make matters
    worse not every PCI card implements this properly, leaving us with no 'end'
    marker or even completely invalid data.
    
    Try to determine the size of the VPD data when it's first accessed.  If no
    valid data can be read an I/O error will be returned when reading or
    writing the sysfs attribute.
    
    As the amount of VPD data is unknown initially the size of the sysfs
    attribute will always be set to '0'.
    
    [bhelgaas: changelog, use 0/1 (not false/true) for bitfield, tweak
    pci_vpd_pci22_read() error checking]
    Tested-by: Shane Seymour <shane.seymour@hpe.com>
    Tested-by: Babu Moger <babu.moger@oracle.com>
    Signed-off-by: Hannes Reinecke <hare@suse.de>
    Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
    Cc: Alexander Duyck <alexander.duyck@gmail.com>

The problem is stemming from the fact that the Chelsio adapters actually have
two VPD structures stored in the VPD.  An abbreviated on at Offset 0x0 and the
complete VPD at Offset 0x400.  The abbreviated one only contains the PN, SN and
EC Keywords, while the complete VPD contains those plus various adapter
constants contained in V0, V1, etc.  And it also contains the Base Ethernet MAC
Address in the "NA" Keyword which the cxgb4 driver needs when it can't contact
the adapter firmware.  (We don't have the "NA" Keywork in the VPD Structure at
Offset 0x0 because that's not an allowed VPD Keyword in the PCI-E 3.0
specification.)

  With the new code, the computed size of the VPD is 0x200 and so our efforts
to read the VPD at Offset 0x400 silently fails. We check the result of the
read looking for a signature 0x82 byte but we're checking against random stack
garbage.

  The end result is that the cxgb4 driver now fails the PCI-E Probe.

Thanks,
Hari

^ permalink raw reply

* Re: [PATCH for-next 2/2] net/mlx5: Update mlx5_ifc hardware features
From: Or Gerlitz @ 2016-04-12  5:36 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Saeed Mahameed, Saeed Mahameed, Matan Barak, Linux Netdev List,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	David S. Miller, Doug Ledford, Linus Torvalds, Or Gerlitz,
	Leon Romanovsky, Tal Alon
In-Reply-To: <20160412051537.GD25242-2ukJVAZIZ/Y@public.gmane.org>

On Tue, Apr 12, 2016 at 8:15 AM, Leon Romanovsky <leon-2ukJVAZIZ/Y@public.gmane.org> wrote:
> On Tue, Apr 12, 2016 at 12:37:34AM +0300, Or Gerlitz wrote:
>> On Tue, Apr 12, 2016 at 12:24 AM, Saeed Mahameed
>> <saeedm-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org> wrote:
>> > On Tue, Apr 12, 2016 at 12:17 AM, Or Gerlitz <gerlitz.or-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>>
>> >> feature --> features
>>
>> > Correct, will fix.
>>
>> >>> * Add vport to steering commands for SRIOV ACL support
>> >>> * Add mlcr, pcmr and mcia registers for dump module EEPROM
>> >>> * Add support for FCS, baeacon led and disable_link bits to hca caps
>> >>> * Add CQE period mode bit in  CQ context for CQE based CQ
>> >>>   moderation support
>> >>> * Add umr SQ bit for fragmented memory registration
>> >>> * Add needed bits and caps for Striding RQ support
>>
>> >> AFAIK, all the above are features will go through net-next, what made
>> >> you anticipate conflicts with linux-rdma?
>>
>> > FCS bit is needed also for rdma, so we took the liberty of updating
>> > all the needed HW structs, bits, caps, etc ..
>> > at once for all mlx5 features planned for 4.7 regardless of rdma/net conflicts.
>>
>> The cover letter states that this series deals with shared code.
>>
>> I guess you might also could extend it a bit to deal also with code
>> that you suspect could lead to conflicts, but I don't see why it
>> evolved to that extent.
>
> Or,
> All these micro-optimizations on this shared file can potentially lead
> to undesired merge conflicts. Subsystem maintainers and Linus don't need to
> deal with these conflicts at all.

Leon,

Conflicts happens @ all times, life.

We (MLNX) didn't do a good job to minimize them as much as
possible on the 4.5 cycle (understatement) and did vast improvement
in the 4.6 cycle (one or two conflicts AFAIK and communicated  to Linus).

It's correct that Linus got really angry on these two conflicts but I have
communicated to him the fact that we did that improvement and we're talking
on two commits for a fairly large volume of patches, the response was, if you
do things right, we will happily keep working with you, people, go look.

I understand your desire to get it down to zero, but it's not gonna
work, pick another target.

For example, the networking community has a fairly large rc activity
(I would say 10-20x
vs rdma), so when Dave does his "merge-rebases" for net-next over net
and linus tree
(4-5 times in a release), he has to this way or another solve
conflicts, yes! ditto for
Linus during merge windows and to some extent in rc times too.

> It won't help to anyone to split this commit to more than one patch.

The commit change-log should make it clear what this is about, and it doesn't.
If you believe in something, state that clear, be precise.

As Saeed admitted the shared code in the commit spans maybe 2% of it.

The 1st commit deals with a field which is not used in the driver,
this is a cleanup
that you can do in rc (net) patch (remove the field all together) and
overall, w.o seeing
the down-stream patches that depend on the newly introduced fields,
how do you know there aren't such (unused) bits in the 2nd commit?

Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH for-next 2/2] net/mlx5: Update mlx5_ifc hardware features
From: Leon Romanovsky @ 2016-04-12  6:01 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: Saeed Mahameed, Saeed Mahameed, Matan Barak, Linux Netdev List,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	David S. Miller, Doug Ledford, Linus Torvalds, Or Gerlitz,
	Leon Romanovsky, Tal Alon
In-Reply-To: <CAJ3xEMijYrLvSmbNZp0B6AXqKGozz1rdZ58SCUP_09zOB6-2gQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>

[-- Attachment #1: Type: text/plain, Size: 1688 bytes --]

On Tue, Apr 12, 2016 at 08:36:21AM +0300, Or Gerlitz wrote:
> Conflicts happens @ all times, life.
> 

...

> 
> I understand your desire to get it down to zero, but it's not gonna
> work, pick another target.

Maybe you are right and the time will show, but now we (Saeed, Matan and me)
are trying hard to achieve this goal.

> 
> For example, the networking community has a fairly large rc activity
> (I would say 10-20x
> vs rdma), so when Dave does his "merge-rebases" for net-next over net
> and linus tree
> (4-5 times in a release), he has to this way or another solve
> conflicts, yes! ditto for
> Linus during merge windows and to some extent in rc times too.

I don't see any harm in our desire to decrease work overhead from these
busy people.

> 
> > It won't help to anyone to split this commit to more than one patch.
> 
> The commit change-log should make it clear what this is about, and it doesn't.
> If you believe in something, state that clear, be precise.

I agree.

> 
> As Saeed admitted the shared code in the commit spans maybe 2% of it.
> 
> The 1st commit deals with a field which is not used in the driver,
> this is a cleanup
> that you can do in rc (net) patch (remove the field all together) and
> overall, w.o seeing

I don't agree with your point that cleanup should go to RC.

> the down-stream patches that depend on the newly introduced fields,
> how do you know there aren't such (unused) bits in the 2nd commit?

No, I don't know in advance, but the truth is that it doesn't bother
anyone, because we are exposing our internal HW to kernel clients and
doing it with minimal impact on the maintainers.

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox