Netdev List
 help / color / mirror / Atom feed
* Re: tg3 pxe weirdness
From: Berend De Schouwer @ 2017-09-27 10:05 UTC (permalink / raw)
  To: Siva Reddy Kallam; +Cc: Linux Netdev List
In-Reply-To: <CAMet4B6mZ8SyCT7W2j4OBEUDiy3ZupYKWXBFti0nXoxm51_6kg@mail.gmail.com>

On Mon, 2017-09-25 at 15:11 +0530, Siva Reddy Kallam wrote:
> On Fri, Sep 22, 2017 at 9:04 PM, Berend De Schouwer
> <berend.de.schouwer@gmail.com> wrote:
> > On Fri, 2017-09-22 at 11:51 +0530, Siva Reddy Kallam wrote:
> > > 
> > > 
> > > Can you please share below details?
> > > 1) Model and Manufacturer of the system
> > > 2) Linux distro/kernel used?
> > 
> > 4.13.3 gets a little further, but after some more data is
> > transferred
> > the tg3 driver still crashes.  This is unfortunately before I've
> > got a
> > writeable filesystem.
> > 
> > The last line is:
> > tg3 0000:01:00.0: tg3_stop_block timed out, ofs=4c00 enable_bit=2
> > 
> > I've got some ideas to get the full dmesg.
> > 
> > As with the other kernels it works OK on 1Gbps, but not slower
> > switches.
> 
> I am suspecting with link aware mode, the clock speed could be slow
> and boot code does not
> complete within the expected time with lower link speeds. So,
> Providing a patch to override clock.
> Can you please try with attached debug patch and provide us the
> feedback with 100M link?
> If it solves this issue, we will work on proper changes.

This does work on 4.13.3 and PXE for me.

I've tested on 1 Gbps, 100 Mbps and 10 Mbps.  I've done some
preliminary testing (eg. large file copies.)

^ permalink raw reply

* Re: tc H/W offload issue with vxlan tunnels [was: nfp: flower vxlan tunnel offload]
From: Paolo Abeni @ 2017-09-27  9:46 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: Or Gerlitz, Jiri Benc, Simon Horman, David Miller, Jakub Kicinski,
	Linux Netdev List, oss-drivers, John Hurley, Paul Blakey,
	Jiri Pirko, Roi Dayan
In-Reply-To: <20170927091700.GC1944@nanopsycho.orion>

On Wed, 2017-09-27 at 11:17 +0200, Jiri Pirko wrote:
> Wed, Sep 27, 2017 at 10:29:35AM CEST, pabeni@redhat.com wrote:
> > So it looks like the H/W offload hook will still be called with the
> > same arguments in both case, and 'bad' rule will still be pushed to the
> > H/W as the driver itself has no way to distinct between the two
> > scenarios.
> 
> Why "bad"?

Such rule is coped differently by the SW and the HW data path.

a rule like:

tc filter add dev eth0 protocol ip parent ffff: flower \
   enc_key_id 102 enc_dst_port 4789 src_ip 3.4.5.6 skip_hw \
   action action mirred redirect eth0_vf_1

will match 0 packets, while:

tc filter add dev eth0 protocol ip parent ffff: flower \
   enc_key_id 102 enc_dst_port 4789 src_ip 3.4.5.6 skip_sw \
   action action mirred redirect eth0_vf_1

[just flipped 'skip_sw' and 'skip_hw' ]
will match the vxlan-tunneled packets. I understand that one of the
design goal for the h/w offload path is being consistent with the sw
one, but that does not hold in the above scenario.

> Regarding the distinction, driver knows if user add a rule directly to
> the eth0, or if the eth0 is egress device in the action. Those are 2
> separete driver entrypoints - of course, talking about code with my
> changes.

ok, but than each driver should catch the scenario "rule with tunnel
match over non tunnel device" and cope with them properly - never match
it - why don't simply avoiding pushing such rules to the H/W ? 

Cheers,

Paolo

^ permalink raw reply

* [PATCH v3 1/1] ip_tunnel: add mpls over gre encapsulation
From: Amine Kherbouche @ 2017-09-27  9:37 UTC (permalink / raw)
  To: netdev, xeb, roopa; +Cc: amine.kherbouche, equinox
In-Reply-To: <cover.1506504229.git.amine.kherbouche@6wind.com>

This commit introduces the MPLSoGRE support (RFC 4023), using ip tunnel
API.

Encap:
  - Add a new iptunnel type mpls.
  - Share tx path: gre type mpls loaded from skb->protocol.

Decap:
  - pull gre hdr and call mpls_forward().

Signed-off-by: Amine Kherbouche <amine.kherbouche@6wind.com>
---
 include/linux/mpls.h           |  2 ++
 include/uapi/linux/if_tunnel.h |  1 +
 net/ipv4/ip_gre.c              | 11 +++++++++
 net/ipv6/ip6_gre.c             | 11 +++++++++
 net/mpls/af_mpls.c             | 52 ++++++++++++++++++++++++++++++++++++++++++
 5 files changed, 77 insertions(+)

diff --git a/include/linux/mpls.h b/include/linux/mpls.h
index 384fb22..57203c1 100644
--- a/include/linux/mpls.h
+++ b/include/linux/mpls.h
@@ -8,4 +8,6 @@
 #define MPLS_TC_MASK		(MPLS_LS_TC_MASK >> MPLS_LS_TC_SHIFT)
 #define MPLS_LABEL_MASK		(MPLS_LS_LABEL_MASK >> MPLS_LS_LABEL_SHIFT)
 
+int mpls_gre_rcv(struct sk_buff *skb, int gre_hdr_len);
+
 #endif  /* _LINUX_MPLS_H */
diff --git a/include/uapi/linux/if_tunnel.h b/include/uapi/linux/if_tunnel.h
index 2e52088..a2f48c0 100644
--- a/include/uapi/linux/if_tunnel.h
+++ b/include/uapi/linux/if_tunnel.h
@@ -84,6 +84,7 @@ enum tunnel_encap_types {
 	TUNNEL_ENCAP_NONE,
 	TUNNEL_ENCAP_FOU,
 	TUNNEL_ENCAP_GUE,
+	TUNNEL_ENCAP_MPLS,
 };
 
 #define TUNNEL_ENCAP_FLAG_CSUM		(1<<0)
diff --git a/net/ipv4/ip_gre.c b/net/ipv4/ip_gre.c
index 9cee986..0a898f4 100644
--- a/net/ipv4/ip_gre.c
+++ b/net/ipv4/ip_gre.c
@@ -32,6 +32,9 @@
 #include <linux/netfilter_ipv4.h>
 #include <linux/etherdevice.h>
 #include <linux/if_ether.h>
+#if IS_ENABLED(CONFIG_MPLS)
+#include <linux/mpls.h>
+#endif
 
 #include <net/sock.h>
 #include <net/ip.h>
@@ -412,6 +415,14 @@ static int gre_rcv(struct sk_buff *skb)
 			return 0;
 	}
 
+	if (unlikely(tpi.proto == htons(ETH_P_MPLS_UC))) {
+#if IS_ENABLED(CONFIG_MPLS)
+		return mpls_gre_rcv(skb, hdr_len);
+#else
+		goto drop;
+#endif
+	}
+
 	if (ipgre_rcv(skb, &tpi, hdr_len) == PACKET_RCVD)
 		return 0;
 
diff --git a/net/ipv6/ip6_gre.c b/net/ipv6/ip6_gre.c
index c82d41e..5a0f5e1 100644
--- a/net/ipv6/ip6_gre.c
+++ b/net/ipv6/ip6_gre.c
@@ -34,6 +34,9 @@
 #include <linux/hash.h>
 #include <linux/if_tunnel.h>
 #include <linux/ip6_tunnel.h>
+#if IS_ENABLED(CONFIG_MPLS)
+#include <linux/mpls.h>
+#endif
 
 #include <net/sock.h>
 #include <net/ip.h>
@@ -476,6 +479,14 @@ static int gre_rcv(struct sk_buff *skb)
 	if (hdr_len < 0)
 		goto drop;
 
+	if (unlikely(tpi.proto == htons(ETH_P_MPLS_UC))) {
+#if IS_ENABLED(CONFIG_MPLS)
+		return mpls_gre_rcv(skb, hdr_len);
+#else
+		goto drop;
+#endif
+	}
+
 	if (iptunnel_pull_header(skb, hdr_len, tpi.proto, false))
 		goto drop;
 
diff --git a/net/mpls/af_mpls.c b/net/mpls/af_mpls.c
index c5b9ce4..53ec7c0 100644
--- a/net/mpls/af_mpls.c
+++ b/net/mpls/af_mpls.c
@@ -16,6 +16,7 @@
 #include <net/arp.h>
 #include <net/ip_fib.h>
 #include <net/netevent.h>
+#include <net/ip_tunnels.h>
 #include <net/netns/generic.h>
 #if IS_ENABLED(CONFIG_IPV6)
 #include <net/ipv6.h>
@@ -39,6 +40,36 @@ static int one = 1;
 static int label_limit = (1 << 20) - 1;
 static int ttl_max = 255;
 
+#if IS_ENABLED(CONFIG_NET_IP_TUNNEL)
+size_t ipgre_mpls_encap_hlen(struct ip_tunnel_encap *e)
+{
+	return sizeof(struct mpls_shim_hdr);
+}
+
+static const struct ip_tunnel_encap_ops mpls_iptun_ops = {
+	.encap_hlen	= ipgre_mpls_encap_hlen,
+};
+
+static int ipgre_tunnel_encap_add_mpls_ops(void)
+{
+	return ip_tunnel_encap_add_ops(&mpls_iptun_ops, TUNNEL_ENCAP_MPLS);
+}
+
+static void ipgre_tunnel_encap_del_mpls_ops(void)
+{
+	ip_tunnel_encap_del_ops(&mpls_iptun_ops, TUNNEL_ENCAP_MPLS);
+}
+#else
+static int ipgre_tunnel_encap_add_mpls_ops(void)
+{
+	return 0;
+}
+
+static void ipgre_tunnel_encap_del_mpls_ops(void)
+{
+}
+#endif
+
 static void rtmsg_lfib(int event, u32 label, struct mpls_route *rt,
 		       struct nlmsghdr *nlh, struct net *net, u32 portid,
 		       unsigned int nlm_flags);
@@ -443,6 +474,22 @@ static int mpls_forward(struct sk_buff *skb, struct net_device *dev,
 	return NET_RX_DROP;
 }
 
+int mpls_gre_rcv(struct sk_buff *skb, int gre_hdr_len)
+{
+	if (unlikely(!pskb_may_pull(skb, gre_hdr_len)))
+		goto drop;
+
+	/* Pop GRE hdr and reset the skb */
+	skb_pull(skb, gre_hdr_len);
+	skb_reset_network_header(skb);
+
+	return mpls_forward(skb, skb->dev, NULL, NULL);
+drop:
+	kfree_skb(skb);
+	return NET_RX_DROP;
+}
+EXPORT_SYMBOL(mpls_gre_rcv);
+
 static struct packet_type mpls_packet_type __read_mostly = {
 	.type = cpu_to_be16(ETH_P_MPLS_UC),
 	.func = mpls_forward,
@@ -2485,6 +2532,10 @@ static int __init mpls_init(void)
 		      0);
 	rtnl_register(PF_MPLS, RTM_GETNETCONF, mpls_netconf_get_devconf,
 		      mpls_netconf_dump_devconf, 0);
+	err = ipgre_tunnel_encap_add_mpls_ops();
+	if (err)
+		pr_err("Can't add mpls over gre tunnel ops\n");
+
 	err = 0;
 out:
 	return err;
@@ -2502,6 +2553,7 @@ static void __exit mpls_exit(void)
 	dev_remove_pack(&mpls_packet_type);
 	unregister_netdevice_notifier(&mpls_dev_notifier);
 	unregister_pernet_subsys(&mpls_net_ops);
+	ipgre_tunnel_encap_del_mpls_ops();
 }
 module_exit(mpls_exit);
 
-- 
2.1.4

^ permalink raw reply related

* [PATCH v3 0/1] Introduce MPLS over GRE
From: Amine Kherbouche @ 2017-09-27  9:37 UTC (permalink / raw)
  To: netdev, xeb, roopa; +Cc: amine.kherbouche, equinox

This series introduces the MPLS over GRE encapsulation (RFC 4023).

Various applications of MPLS make use of label stacks with multiple
entries.  In some cases, it is possible to replace the top label of
the stack with an IP-based encapsulation, thereby, it is possible for
two LSRs that are adjacent on an LSP to be separated by an IP network,
even if that IP network does not provide MPLS.

Changes in v3:
  - remove mpls_forward() function exportation patch.
  - wrap efficiently mpls iptunnel add/del functions and dependent
    function/structure.
  - move mpls_gre_rcv to af_mpls.c file and export it.
  - remove unnecessary functions.
 
Changes in v2:
  - wrap ip tunnel functions under ifdef in mpls file.
  - fix indentation.
  - check return code.

An example of configuration:


         node1                LER1                       LER2                node2
        +-----+             +------+                   +------+             +-----+
        |     |             |      |                   |      |             |     |
        |     |             |      |p3  GRE tunnel   p4|      |             |     |
        |     |p1         p2|      +-------------------+      |p5         p6|     |
        |     +-------------+      +-------------------+      +------------+|     |
        |     |10.100.0.0/24|      |                   |      |10.200.0.0/24|     |
        |     |fd00:100::/64|      |  10.125.0.0/24    |      |fd00:200::/64|     |
        |     |             |      |  fd00:125::/64    |      |             |     |
        |     |             |      |                   |      |             |     |
        |     |             |      |                   |      |             |     |
        |     |             |      |                   |      |             |     |
        |     |             |      |                   |      |             |     |
        +-----+             +------+                   +------+             +-----+


		###	node1	###

ip link set p1 up
ip addr add 10.100.0.1/24 dev p1

		###	LER1	###

ip link set p2 up
ip addr add 10.100.0.2/24 dev p2

ip link set p3 up
ip addr add 10.125.0.1/24 dev p3

modprobe mpls_router
sysctl -w net.mpls.conf.p2.input=1
sysctl -w net.mpls.conf.p3.input=1
sysctl -w net.mpls.platform_labels=1000

ip link add gre1 type gre ttl 64 local 10.125.0.1 remote 10.125.0.2 dev p3
ip link set dev gre1 up

ip -M route add 111 as 222 dev gre1
ip -M route add 555 as 666 via inet 10.100.0.1 dev p2

		###	LER2	###

ip link set p5 up
ip addr add 10.200.0.2/24 dev p5

ip link set p4 up
ip addr add 10.125.0.2/24 dev p4

modprobe mpls_router
sysctl -w net.mpls.conf.p4.input=1
sysctl -w net.mpls.conf.p5.input=1
sysctl -w net.mpls.platform_labels=1000

ip link add gre1 type gre ttl 64 local 10.125.0.2 remote 10.125.0.1 dev p4
ip link set dev gre1 up

ip -M route add 444 as 555 dev gre1
ip -M route add 222 as 333 via inet 10.200.0.1 dev p5

		###	node2	###

ip link set p6 up
ip addr add 10.200.0.1/24 dev p6


Now using this scapy to forge and send packets from the port p1 of node1:

p = Ether(src='de:ed:01:0c:41:09', dst='de:ed:01:2f:3b:ba')
p /= MPLS(s=1, ttl=64, label=111)/Raw(load='\xde')
sendp(p, iface="p1", count=20, inter=0.1)

Amine Kherbouche (1):
  ip_tunnel: add mpls over gre encapsulation

 include/linux/mpls.h           |  2 ++
 include/uapi/linux/if_tunnel.h |  1 +
 net/ipv4/ip_gre.c              | 11 +++++++++
 net/ipv6/ip6_gre.c             | 11 +++++++++
 net/mpls/af_mpls.c             | 52 ++++++++++++++++++++++++++++++++++++++++++
 5 files changed, 77 insertions(+)

-- 
2.1.4

^ permalink raw reply

* Re: [PATCH v2 net-next 2/2] net/sched: allow flower to match tunnel options
From: Simon Horman @ 2017-09-27  9:27 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: David Miller, Jiri Pirko, Jamal Hadi Salim, Cong Wang, netdev,
	oss-drivers
In-Reply-To: <20170927091005.GB1944@nanopsycho.orion>

On Wed, Sep 27, 2017 at 11:10:05AM +0200, Jiri Pirko wrote:
> Wed, Sep 27, 2017 at 10:16:34AM CEST, simon.horman@netronome.com wrote:
> >Allow matching on options in tunnel headers.
> >This makes use of existing tunnel metadata support.
> >
> >Options are a bytestring of up to 256 bytes.
> >Tunnel implementations may support less or more options,
> >or no options at all.
> >
> >e.g.
> > # ip link add name geneve0 type geneve dstport 0 external
> > # tc qdisc add dev geneve0 ingress
> > # tc filter add dev geneve0 protocol ip parent ffff: \
> >     flower \
> >       enc_src_ip 10.0.99.192 \
> >       enc_dst_ip 10.0.99.193 \
> >       enc_key_id 11 \
> >       enc_opts 0102800100800020/fffffffffffffff0 \
> >       ip_proto udp \
> >       action mirred egress redirect dev eth1
> >
> >Signed-off-by: Simon Horman <simon.horman@netronome.com>
> >Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
> >
> >---
> >v2
> >* Correct example which was incorrectly described setting rather
> >  than matching tunnel options
> >---
> > include/net/flow_dissector.h | 13 +++++++++++++
> > include/uapi/linux/pkt_cls.h |  3 +++
> > net/sched/cls_flower.c       | 35 ++++++++++++++++++++++++++++++++++-
> > 3 files changed, 50 insertions(+), 1 deletion(-)
> >
> >diff --git a/include/net/flow_dissector.h b/include/net/flow_dissector.h
> >index fc3dce730a6b..43f98bf0b349 100644
> >--- a/include/net/flow_dissector.h
> >+++ b/include/net/flow_dissector.h
> >@@ -183,6 +183,18 @@ struct flow_dissector_key_ip {
> > 	__u8	ttl;
> > };
> > 
> >+/**
> >+ * struct flow_dissector_key_enc_opts:
> >+ * @data: data
> >+ * @len: len
> >+ */
> >+struct flow_dissector_key_enc_opts {
> >+	u8 data[256];	/* Using IP_TUNNEL_OPTS_MAX is desired here
> >+			 * but seems difficult to #include
> >+			 */
> >+	u8 len;
> >+};
> >+
> > enum flow_dissector_key_id {
> > 	FLOW_DISSECTOR_KEY_CONTROL, /* struct flow_dissector_key_control */
> > 	FLOW_DISSECTOR_KEY_BASIC, /* struct flow_dissector_key_basic */
> >@@ -205,6 +217,7 @@ enum flow_dissector_key_id {
> > 	FLOW_DISSECTOR_KEY_MPLS, /* struct flow_dissector_key_mpls */
> > 	FLOW_DISSECTOR_KEY_TCP, /* struct flow_dissector_key_tcp */
> > 	FLOW_DISSECTOR_KEY_IP, /* struct flow_dissector_key_ip */
> >+	FLOW_DISSECTOR_KEY_ENC_OPTS, /* struct flow_dissector_key_enc_opts */
> 
> I don't see the actual dissection implementation. Where is it?
> Did you test the patchset?

Yes, I did test it. But it is also possible something went astray along the
way and I will retest.

I think that the code you are looking for is in
fl_classify() in this patch.

^ permalink raw reply

* Re: [PATCH net-next 2/6] bpf: add meta pointer for direct access
From: Jesper Dangaard Brouer @ 2017-09-27  9:26 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: davem, alexei.starovoitov, john.fastabend, peter.waskiewicz.jr,
	jakub.kicinski, netdev, Andy Gospodarek, brouer
In-Reply-To: <59CAB17D.5090204@iogearbox.net>

On Tue, 26 Sep 2017 21:58:53 +0200
Daniel Borkmann <daniel@iogearbox.net> wrote:

> On 09/26/2017 09:13 PM, Jesper Dangaard Brouer wrote:
> [...]
> > I'm currently implementing a cpumap type, that transfers raw XDP frames
> > to another CPU, and the SKB is allocated on the remote CPU.  (It
> > actually works extremely well).  
> 
> Meaning you let all the XDP_PASS packets get processed on a
> different CPU, so you can reserve the whole CPU just for
> prefiltering, right? 

Yes, exactly.  Except I use the XDP_REDIRECT action to steer packets.
The trick is using the map-flush point, to transfer packets in bulk to
the remote CPU (single call IPC is too slow), but at the same time
flush single packets if NAPI didn't see a bulk.

> Do you have some numbers to share at this point, just curious when
> you mention it works extremely well.

Sure... I've done a lot of benchmarking on this patchset ;-)
I have a benchmark program called xdp_redirect_cpu [1][2], that collect
stats via tracepoints (atm I'm limiting bulking 8 packets, and have
tracepoints at bulk spots, to amortize tracepoint cost 25ns/8=3.125ns)

 [1] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/samples/bpf/xdp_redirect_cpu_kern.c
 [2] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/samples/bpf/xdp_redirect_cpu_user.c

Here I'm installing a DDoS program that drops UDP port 9 (pktgen
packets) on RX CPU=0.  I'm forcing my netperf to hit the same CPU, that
the 11.9Mpps DDoS attack is hitting.

Running XDP/eBPF prog_num:4
XDP-cpumap      CPU:to  pps            drop-pps    extra-info
XDP-RX          0       12,030,471     11,966,982  0          
XDP-RX          total   12,030,471     11,966,982 
cpumap-enqueue    0:2   63,488         0           0          
cpumap-enqueue  sum:2   63,488         0           0          
cpumap_kthread  2       63,488         0           3          time_exceed
cpumap_kthread  total   63,488         0           0          
redirect_err    total   0              0          

$ netperf -H 172.16.0.2 -t TCP_CRR  -l 10 -D1 -T5,5 -- -r 1024,1024
Local /Remote
Socket Size   Request  Resp.   Elapsed  Trans.
Send   Recv   Size     Size    Time     Rate         
bytes  Bytes  bytes    bytes   secs.    per sec   

16384  87380  1024     1024    10.00    12735.97   
16384  87380 

The netperf TCP_CRR performance is the same, without XDP loaded.


> Another test

I've previously shown (and optimized) in commit c0303efeab73 ("net:
reduce cycles spend on ICMP replies that gets rate limited"), that my
system can handle approx 2.7Mpps for UdpNoPorts, before the network
stack chokes.

Thus it is interesting to see, when I get UDP traffic that hits the
same CPU, if I can simply round-robin distribute it other CPUs.  This
evaluate if the cross-CPU transfer mechanism is fast-enough.

I do have to increase the ixgbe RX-ring size, else the ixgbe recycle
scheme breaks down, and we stall on the page spin_lock (as Tariq have
demonstrated before).

 # ethtool -G ixgbe1 rx 1024 tx 1024

Start RR program and add some CPUs:

 # ./xdp_redirect_cpu --dev ixgbe1 --prog 2 --cpu 1 --cpu 2 --cpu 3 --cpu 4

Running XDP/eBPF prog_num:2
XDP-cpumap      CPU:to  pps            drop-pps    extra-info
XDP-RX          0       11,006,992     0           0          
XDP-RX          total   11,006,992     0          
cpumap-enqueue    0:1   2,751,744      0           0          
cpumap-enqueue  sum:1   2,751,744      0           0          
cpumap-enqueue    0:2   2,751,748      0           0          
cpumap-enqueue  sum:2   2,751,748      0           0          
cpumap-enqueue    0:3   2,751,744      35          0          
cpumap-enqueue  sum:3   2,751,744      35          0          
cpumap-enqueue    0:4   2,751,748      0           0          
cpumap-enqueue  sum:4   2,751,748      0           0          
cpumap_kthread  1       2,751,745      0           156        time_exceed
cpumap_kthread  2       2,751,749      0           142        time_exceed
cpumap_kthread  3       2,751,713      0           131        time_exceed
cpumap_kthread  4       2,751,749      0           128        time_exceed
cpumap_kthread  total   11,006,957     0           0          
redirect_err    total   0              0          

$ nstat > /dev/null && sleep 1 && nstat | grep UdpNoPorts
UdpNoPorts                      11042282           0.0

The nstat show that the Linux network stack is actually now processing,
SKB alloc + free, 11Mpps. 

The generator was sending with 14Mpps, thus the XDP-RX program is
actually a bottleneck here. And I do see some drops on the HW level.
Thus, 1-CPU was not 100% fast-enough.

Thus, lets allocate two CPUs for XDP-RX:

Running XDP/eBPF prog_num:2
XDP-cpumap      CPU:to  pps            drop-pps    extra-info
XDP-RX          0       6,352,578      0           0          
XDP-RX          1       6,352,711      0           0          
XDP-RX          total   12,705,289     0          
cpumap-enqueue    0:2   1,588,156      1,351       0          
cpumap-enqueue    1:2   1,588,174      1,330       0          
cpumap-enqueue  sum:2   3,176,331      2,682       0          
cpumap-enqueue    0:3   1,588,157      994         0          
cpumap-enqueue    1:3   1,588,170      912         0          
cpumap-enqueue  sum:3   3,176,327      1,907       0          
cpumap-enqueue    0:4   1,588,157      529         0          
cpumap-enqueue    1:4   1,588,167      514         0          
cpumap-enqueue  sum:4   3,176,324      1,044       0          
cpumap-enqueue    0:5   1,588,159      625         0          
cpumap-enqueue    1:5   1,588,166      614         0          
cpumap-enqueue  sum:5   3,176,326      1,240       0          
cpumap_kthread  2       3,173,642      0           11257      time_exceed
cpumap_kthread  3       3,174,423      0           9779       time_exceed
cpumap_kthread  4       3,175,283      0           3938       time_exceed
cpumap_kthread  5       3,175,083      0           3120       time_exceed
cpumap_kthread  total   12,698,432     0           0          (null)
redirect_err    total   0              0          

Below, I'm using ./pktgen_sample04_many_flows.sh, and my generator
machine cannot generate more that 12,682,445 tx_packets /sec.
nstat says: UdpNoPorts 12,698,001 pps.  The XDP-RX CPUs actually have
30% idle CPU cycles, as the "only" handle 6.3Mpps each ;-)

Perf top on a CPU(3) that have to alloc and free SKBs etc.

# Overhead  CPU  Symbol                                 
# ........  ...  .......................................
#
    15.51%  003  [k] fib_table_lookup
     8.91%  003  [k] cpu_map_kthread_run
     8.04%  003  [k] build_skb
     7.88%  003  [k] page_frag_free
     5.13%  003  [k] kmem_cache_alloc
     4.76%  003  [k] ip_route_input_rcu
     4.59%  003  [k] kmem_cache_free
     4.02%  003  [k] __udp4_lib_rcv
     3.20%  003  [k] fib_validate_source
     3.02%  003  [k] __netif_receive_skb_core
     3.02%  003  [k] udp_v4_early_demux
     2.90%  003  [k] ip_rcv
     2.80%  003  [k] ip_rcv_finish
     2.26%  003  [k] eth_type_trans
     2.23%  003  [k] __build_skb
     2.00%  003  [k] icmp_send
     1.84%  003  [k] __rcu_read_unlock
     1.30%  003  [k] ip_local_deliver_finish
     1.26%  003  [k] netif_receive_skb_internal
     1.17%  003  [k] ip_route_input_noref
     1.11%  003  [k] make_kuid
     1.09%  003  [k] __udp4_lib_lookup
     1.07%  003  [k] skb_release_head_state
     1.04%  003  [k] __rcu_read_lock
     0.95%  003  [k] kfree_skb
     0.89%  003  [k] __local_bh_enable_ip
     0.88%  003  [k] skb_release_data
     0.71%  003  [k] ip_local_deliver
     0.58%  003  [k] netif_receive_skb

cmdline:
 perf report --sort cpu,symbol --kallsyms=/proc/kallsyms  --no-children  -C3 -g none --stdio

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply

* Re: tc H/W offload issue with vxlan tunnels [was: nfp: flower vxlan tunnel offload]
From: Jiri Pirko @ 2017-09-27  9:17 UTC (permalink / raw)
  To: Paolo Abeni
  Cc: Or Gerlitz, Jiri Benc, Simon Horman, David Miller, Jakub Kicinski,
	Linux Netdev List, oss-drivers, John Hurley, Paul Blakey,
	Jiri Pirko, Roi Dayan
In-Reply-To: <1506500975.2867.19.camel@redhat.com>

Wed, Sep 27, 2017 at 10:29:35AM CEST, pabeni@redhat.com wrote:
>Hi,
>
>Moving to a separate theread, since I think this is more related to the
>flower core infrastructure than to the netrome patches.
>
>On Wed, 2017-09-27 at 09:40 +0200, Jiri Pirko wrote:
>> This kind of hooks are giving me nightmares. The code is screwed up as
>> it is already. I'm currently working on conversion to callbacks. This
>> part is handled in:
>> https://github.com/jpirko/linux_mlxsw/commits/jiri_devel_egdevcb
>
>Thanks for the pointer.
>
>I skimmed quickly on the code and indeed it cleans this area a lot.
>If I read it correctly the ('good') command:
>
>tc filter add dev vxlan0 protocol ip parent ffff: flower enc_key_id 102 
>   enc_dst_port 4789 src_ip 3.4.5.6 skip_sw action [...]

I suppose "action mirred redirect eth0". Then yes, it will generate the
callpath you described below.

>
>will generate a call to:
>
>mlx5e_setup_tc(eth0, TC_SETUP_CLSFLOWER, &cls_flower) via:
>
>fl_hw_replace_filter() ->
>  tc_setup_cb_call() -> 
>    tc_exts_setup_cb_egdev_call() ->
>      tc_setup_cb_egdev_call() ->
>        tcf_action_egdev_cb_call() ->
>          mlx5e_rep_setup_tc_cb()
>
>and the 'bad' command:
>
>tc filter add dev eth0 protocol ip parent ffff: flower enc_key_id 102 \
>   enc_dst_port 4789 src_ip 3.4.5.6 skip_sw action [...]
>
>will also call:
>
>mlx5e_setup_tc(eth0, TC_SETUP_CLSFLOWER, &cls_flower) via:
>
>fl_hw_replace_filter() ->
>  ndo_setup_tc()

Sure. You are adding a rule to eth0, the call goes down to eth0 driver.
I'm missing why is it a problem? Why the call should not go down to the
eth0 driver?


>
>So it looks like the H/W offload hook will still be called with the
>same arguments in both case, and 'bad' rule will still be pushed to the
>H/W as the driver itself has no way to distinct between the two
>scenarios.

Why "bad"?

Regarding the distinction, driver knows if user add a rule directly to
the eth0, or if the eth0 is egress device in the action. Those are 2
separete driver entrypoints - of course, talking about code with my
changes.


>
>[ Note: I referred to the mlx hook just for convenience, should be the
>same with any driver implementing the same APIs ]
>
>Am I missing something?
>
>Thanks,
>
>Paolo

^ permalink raw reply

* Re: [PATCH v2 net-next 2/2] net/sched: allow flower to match tunnel options
From: Jiri Pirko @ 2017-09-27  9:10 UTC (permalink / raw)
  To: Simon Horman
  Cc: David Miller, Jiri Pirko, Jamal Hadi Salim, Cong Wang, netdev,
	oss-drivers
In-Reply-To: <1506500194-17637-3-git-send-email-simon.horman@netronome.com>

Wed, Sep 27, 2017 at 10:16:34AM CEST, simon.horman@netronome.com wrote:
>Allow matching on options in tunnel headers.
>This makes use of existing tunnel metadata support.
>
>Options are a bytestring of up to 256 bytes.
>Tunnel implementations may support less or more options,
>or no options at all.
>
>e.g.
> # ip link add name geneve0 type geneve dstport 0 external
> # tc qdisc add dev geneve0 ingress
> # tc filter add dev geneve0 protocol ip parent ffff: \
>     flower \
>       enc_src_ip 10.0.99.192 \
>       enc_dst_ip 10.0.99.193 \
>       enc_key_id 11 \
>       enc_opts 0102800100800020/fffffffffffffff0 \
>       ip_proto udp \
>       action mirred egress redirect dev eth1
>
>Signed-off-by: Simon Horman <simon.horman@netronome.com>
>Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
>
>---
>v2
>* Correct example which was incorrectly described setting rather
>  than matching tunnel options
>---
> include/net/flow_dissector.h | 13 +++++++++++++
> include/uapi/linux/pkt_cls.h |  3 +++
> net/sched/cls_flower.c       | 35 ++++++++++++++++++++++++++++++++++-
> 3 files changed, 50 insertions(+), 1 deletion(-)
>
>diff --git a/include/net/flow_dissector.h b/include/net/flow_dissector.h
>index fc3dce730a6b..43f98bf0b349 100644
>--- a/include/net/flow_dissector.h
>+++ b/include/net/flow_dissector.h
>@@ -183,6 +183,18 @@ struct flow_dissector_key_ip {
> 	__u8	ttl;
> };
> 
>+/**
>+ * struct flow_dissector_key_enc_opts:
>+ * @data: data
>+ * @len: len
>+ */
>+struct flow_dissector_key_enc_opts {
>+	u8 data[256];	/* Using IP_TUNNEL_OPTS_MAX is desired here
>+			 * but seems difficult to #include
>+			 */
>+	u8 len;
>+};
>+
> enum flow_dissector_key_id {
> 	FLOW_DISSECTOR_KEY_CONTROL, /* struct flow_dissector_key_control */
> 	FLOW_DISSECTOR_KEY_BASIC, /* struct flow_dissector_key_basic */
>@@ -205,6 +217,7 @@ enum flow_dissector_key_id {
> 	FLOW_DISSECTOR_KEY_MPLS, /* struct flow_dissector_key_mpls */
> 	FLOW_DISSECTOR_KEY_TCP, /* struct flow_dissector_key_tcp */
> 	FLOW_DISSECTOR_KEY_IP, /* struct flow_dissector_key_ip */
>+	FLOW_DISSECTOR_KEY_ENC_OPTS, /* struct flow_dissector_key_enc_opts */

I don't see the actual dissection implementation. Where is it?
Did you test the patchset?

^ permalink raw reply

* RE: [PATCH net v2] net: dsa: mv88e6xxx: lock mutex when freeing IRQs
From: David Laight @ 2017-09-27  9:06 UTC (permalink / raw)
  To: 'Vivien Didelot', netdev@vger.kernel.org
  Cc: linux-kernel@vger.kernel.org, kernel@savoirfairelinux.com,
	David S. Miller, Florian Fainelli, Andrew Lunn
In-Reply-To: <20170926185721.12187-1-vivien.didelot@savoirfairelinux.com>

From: Vivien Didelot
> Sent: 26 September 2017 19:57
> mv88e6xxx_g2_irq_free locks the registers mutex, but not
> mv88e6xxx_g1_irq_free, which results in a stack trace from
> assert_reg_lock when unloading the mv88e6xxx module. Fix this.
> 
> Fixes: 3460a5770ce9 ("net: dsa: mv88e6xxx: Mask g1 interrupts and free interrupt")
> Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
> ---
>  drivers/net/dsa/mv88e6xxx/chip.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/drivers/net/dsa/mv88e6xxx/chip.c b/drivers/net/dsa/mv88e6xxx/chip.c
> index c6678aa9b4ef..e7ff7483d2fb 100644
> --- a/drivers/net/dsa/mv88e6xxx/chip.c
> +++ b/drivers/net/dsa/mv88e6xxx/chip.c
> @@ -3947,7 +3947,9 @@ static void mv88e6xxx_remove(struct mdio_device *mdiodev)
>  	if (chip->irq > 0) {
>  		if (chip->info->g2_irqs > 0)
>  			mv88e6xxx_g2_irq_free(chip);
> +		mutex_lock(&chip->reg_lock);
>  		mv88e6xxx_g1_irq_free(chip);
> +		mutex_unlock(&chip->reg_lock);

Isn't the irq_free code likely to have to sleep waiting for any
ISR to complete??

	David

^ permalink raw reply

* RE: [PATCH net-next] liquidio: fix format truncation warning reported by gcc 7.1.1
From: David Laight @ 2017-09-27  9:04 UTC (permalink / raw)
  To: 'Felix Manlunas', davem@davemloft.net
  Cc: netdev@vger.kernel.org, raghu.vatsavayi@cavium.com,
	derek.chickles@cavium.com, satananda.burla@cavium.com
In-Reply-To: <20170926184827.GA3512@felix-thinkpad.cavium.com>

From: Felix Manlunas
> Sent: 26 September 2017 19:48
> gcc 7.1.1 with -Wformat-truncation reports these warnings:
> 
> drivers/net/ethernet/cavium/liquidio/lio_core.c: In function `octeon_setup_interrupt':
> drivers/net/ethernet/cavium/liquidio/lio_core.c:1003:41: warning: `%u' directive output may be
> truncated writing between 1 and 10 bytes into a region of size between 0 and 13 [-Wformat-truncation=]
>        INTRNAMSIZ, "LiquidIO%u-pf%u-rxtx-%u",
...
> Fix them by changing the type of the "i" local variable from int to short.

That probably adds pointless code bloat by forcing the compiler to
keep masking the value with 0xffff after every arithmetic operation.

About the only architecture that doesn't suffer the penalty is x86.

Until the compiler can correctly track the domain of values (and
be given hints about the domains) this warning is, IMHO, OTT.

	David

^ permalink raw reply

* tc H/W offload issue with vxlan tunnels [was: nfp: flower vxlan tunnel offload]
From: Paolo Abeni @ 2017-09-27  8:29 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: Or Gerlitz, Jiri Benc, Simon Horman, David Miller, Jakub Kicinski,
	Linux Netdev List, oss-drivers, John Hurley, Paul Blakey,
	Jiri Pirko, Roi Dayan

Hi,

Moving to a separate theread, since I think this is more related to the
flower core infrastructure than to the netrome patches.

On Wed, 2017-09-27 at 09:40 +0200, Jiri Pirko wrote:
> This kind of hooks are giving me nightmares. The code is screwed up as
> it is already. I'm currently working on conversion to callbacks. This
> part is handled in:
> https://github.com/jpirko/linux_mlxsw/commits/jiri_devel_egdevcb

Thanks for the pointer.

I skimmed quickly on the code and indeed it cleans this area a lot.
If I read it correctly the ('good') command:

tc filter add dev vxlan0 protocol ip parent ffff: flower enc_key_id 102 
   enc_dst_port 4789 src_ip 3.4.5.6 skip_sw action [...]

will generate a call to:

mlx5e_setup_tc(eth0, TC_SETUP_CLSFLOWER, &cls_flower) via:

fl_hw_replace_filter() ->
  tc_setup_cb_call() -> 
    tc_exts_setup_cb_egdev_call() ->
      tc_setup_cb_egdev_call() ->
        tcf_action_egdev_cb_call() ->
          mlx5e_rep_setup_tc_cb()

and the 'bad' command:

tc filter add dev eth0 protocol ip parent ffff: flower enc_key_id 102 \
   enc_dst_port 4789 src_ip 3.4.5.6 skip_sw action [...]

will also call:

mlx5e_setup_tc(eth0, TC_SETUP_CLSFLOWER, &cls_flower) via:

fl_hw_replace_filter() ->
  ndo_setup_tc()

So it looks like the H/W offload hook will still be called with the
same arguments in both case, and 'bad' rule will still be pushed to the
H/W as the driver itself has no way to distinct between the two
scenarios.

[ Note: I referred to the mlx hook just for convenience, should be the
same with any driver implementing the same APIs ]

Am I missing something?

Thanks,

Paolo

^ permalink raw reply

* Re: [PATCH iproute2] tc: fix ipv6 filter selector attribute for some prefix lengths
From: Stephen Hemminger @ 2017-09-27  8:26 UTC (permalink / raw)
  To: Yulia Kartseva; +Cc: netdev, shemminger
In-Reply-To: <CAOgqDSEvuZVExji=NXq0s1Ze9pQTawQGy=OQxLxwU7b52xHNXQ@mail.gmail.com>

On Mon, 25 Sep 2017 11:12:38 -0700
Yulia Kartseva <yulia.kartseva@gmail.com> wrote:

> Wrong TCA_U32_SEL attribute packing if prefixLen AND 0x1f equals 0x1f.
> These are  /31, /63, /95 and /127 prefix lengths.
> 
> Example:
> # tc filter add dev eth0 protocol ipv6 parent b: prio 2307 u32 match
> ip6 dst face:b00f::/31
> # tc filter show dev eth0
> filter parent b: protocol ipv6 pref 2307 u32
> filter parent b: protocol ipv6 pref 2307 u32 fh 800: ht divisor 1
> filter parent b: protocol ipv6 pref 2307 u32 fh 800::800 order 2048
> key ht 800 bkt 0
>   match faceb00f/ffffffff at 24
> 
> 
> The correct match would be "faceb00e/fffffffe": don't count the last
> bit of the 4th byte as the network prefix. With fix:
> 
> # tc filter show dev eth0
> filter parent b: protocol ipv6 pref 2307 u32
> filter parent b: protocol ipv6 pref 2307 u32 fh 800: ht divisor 1
> filter parent b: protocol ipv6 pref 2307 u32 fh 800::800 order 2048
> key ht 800 bkt 0
>   match faceb00e/fffffffe at 24
> 
>  tc/f_u32.c | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
> 
> diff --git a/tc/f_u32.c b/tc/f_u32.c
> index 5815be9..14b9588 100644
> --- a/tc/f_u32.c
> +++ b/tc/f_u32.c
> @@ -385,8 +385,7 @@ static int parse_ip6_addr(int *argc_p, char ***argv_p,
> 
>   plen = addr.bitlen;
>   for (i = 0; i < plen; i += 32) {
> - /* if (((i + 31) & ~0x1F) <= plen) { */
> - if (i + 31 <= plen) {
> + if (i + 31 < plen) {
>   res = pack_key(sel, addr.data[i / 32],
>         0xFFFFFFFF, off + 4 * (i / 32), offmask);
>   if (res < 0)

This patch looks correct, but will not apply cleanly because
the mail system that you submitted it with is removing whitespace.
If possible use a different client, or send as an attachment.

^ permalink raw reply

* [PATCH net-next 3/3] tun: introduce cpu id based steering policy
From: Jason Wang @ 2017-09-27  8:23 UTC (permalink / raw)
  To: netdev, linux-kernel; +Cc: mst, Jason Wang
In-Reply-To: <1506500637-13881-1-git-send-email-jasowang@redhat.com>

This patch introduces a simple queue selection policy which just
choose txq based on processor id. This maybe useful for connectless
workload or #queues is equal to #cpus.

Redirect UDP packets generated by MoonGen between two virtio-net ports
through xdp_redirect show 37.4% (from 0.8Mpps to 1.1Mpps) improvement
compared to automatic steering policy since the overhead of flow
caches/hasing was totally eliminated.

Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 drivers/net/tun.c           | 33 ++++++++++++++++++++++++++++++++-
 include/uapi/linux/if_tun.h |  1 +
 2 files changed, 33 insertions(+), 1 deletion(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 1106521..03b4506 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -190,6 +190,20 @@ struct tun_steering_ops {
 			 u32 data);
 };
 
+void tun_steering_xmit_nop(struct tun_struct *tun, struct sk_buff *skb)
+{
+}
+
+u32 tun_steering_pre_rx_nop(struct tun_struct *tun, struct sk_buff *skb)
+{
+	return 0;
+}
+
+void tun_steering_post_rx_nop(struct tun_struct *tun, struct tun_file *tfile,
+			      u32 data)
+{
+}
+
 struct tun_flow_entry {
 	struct hlist_node hash_link;
 	struct rcu_head rcu;
@@ -571,6 +585,11 @@ static u16 tun_automq_select_queue(struct tun_struct *tun, struct sk_buff *skb)
 	return txq;
 }
 
+static u16 tun_cpu_select_queue(struct tun_struct *tun, struct sk_buff *skb)
+{
+	return smp_processor_id() % tun->numqueues;
+}
+
 static u16 tun_select_queue(struct net_device *dev, struct sk_buff *skb,
 			    void *accel_priv, select_queue_fallback_t fallback)
 {
@@ -2152,6 +2171,13 @@ static struct tun_steering_ops tun_automq_ops = {
 	.post_rx = tun_automq_post_rx,
 };
 
+static struct tun_steering_ops tun_cpu_ops = {
+	.select_queue = tun_cpu_select_queue,
+	.xmit = tun_steering_xmit_nop,
+	.pre_rx = tun_steering_pre_rx_nop,
+	.post_rx = tun_steering_post_rx_nop,
+};
+
 static int tun_flags(struct tun_struct *tun)
 {
 	return tun->flags & (TUN_FEATURES | IFF_PERSIST | IFF_TUN | IFF_TAP);
@@ -2775,6 +2801,9 @@ static long __tun_chr_ioctl(struct file *file, unsigned int cmd,
 		case TUN_STEERING_AUTOMQ:
 			tun->steering_ops = &tun_automq_ops;
 			break;
+		case TUN_STEERING_CPU:
+			tun->steering_ops = &tun_cpu_ops;
+			break;
 		default:
 			ret = -EFAULT;
 		}
@@ -2784,6 +2813,8 @@ static long __tun_chr_ioctl(struct file *file, unsigned int cmd,
 		ret = 0;
 		if (tun->steering_ops == &tun_automq_ops)
 			steering = TUN_STEERING_AUTOMQ;
+		else if (tun->steering_ops == &tun_cpu_ops)
+			steering = TUN_STEERING_CPU;
 		else
 			BUG();
 		if (copy_to_user(argp, &steering, sizeof(steering)))
@@ -2792,7 +2823,7 @@ static long __tun_chr_ioctl(struct file *file, unsigned int cmd,
 
 	case TUNGETSTEERINGFEATURES:
 		ret = 0;
-		steering = TUN_STEERING_AUTOMQ;
+		steering = TUN_STEERING_AUTOMQ | TUN_STEERING_CPU;
 		if (copy_to_user(argp, &steering, sizeof(steering)))
 			ret = -EFAULT;
 		break;
diff --git a/include/uapi/linux/if_tun.h b/include/uapi/linux/if_tun.h
index 109760e..5f71d29 100644
--- a/include/uapi/linux/if_tun.h
+++ b/include/uapi/linux/if_tun.h
@@ -112,5 +112,6 @@ struct tun_filter {
 };
 
 #define TUN_STEERING_AUTOMQ 0x01 /* Automatic flow steering */
+#define TUN_STEERING_CPU    0x02 /* Processor id based flow steering */
 
 #endif /* _UAPI__IF_TUN_H */
-- 
2.7.4

^ permalink raw reply related

* [PATCH net-next 2/3] tun: introduce ioctls to set and get steering policies
From: Jason Wang @ 2017-09-27  8:23 UTC (permalink / raw)
  To: netdev, linux-kernel; +Cc: mst, Jason Wang
In-Reply-To: <1506500637-13881-1-git-send-email-jasowang@redhat.com>

This patch introduces new ioctl for change packet steering policy for
tun. Only automatic flow steering is supported, more policies will
come.

Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 drivers/net/tun.c           | 35 ++++++++++++++++++++++++++++++++++-
 include/uapi/linux/if_tun.h |  7 +++++++
 2 files changed, 41 insertions(+), 1 deletion(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index de83e72..1106521 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -122,7 +122,8 @@ do {								\
 #define TUN_VNET_BE     0x40000000
 
 #define TUN_FEATURES (IFF_NO_PI | IFF_ONE_QUEUE | IFF_VNET_HDR | \
-		      IFF_MULTI_QUEUE | IFF_NAPI | IFF_NAPI_FRAGS)
+		      IFF_MULTI_QUEUE | IFF_NAPI | IFF_NAPI_FRAGS | \
+		      IFF_MULTI_STEERING)
 
 #define GOODCOPY_LEN 128
 
@@ -2506,6 +2507,7 @@ static long __tun_chr_ioctl(struct file *file, unsigned int cmd,
 	unsigned int ifindex;
 	int le;
 	int ret;
+	unsigned int steering;
 
 	if (cmd == TUNSETIFF || cmd == TUNSETQUEUE || _IOC_TYPE(cmd) == SOCK_IOC_TYPE) {
 		if (copy_from_user(&ifr, argp, ifreq_len))
@@ -2764,6 +2766,37 @@ static long __tun_chr_ioctl(struct file *file, unsigned int cmd,
 		ret = 0;
 		break;
 
+	case TUNSETSTEERING:
+		ret = -EFAULT;
+		if (copy_from_user(&steering, argp, sizeof(steering)))
+			break;
+		ret = 0;
+		switch (steering) {
+		case TUN_STEERING_AUTOMQ:
+			tun->steering_ops = &tun_automq_ops;
+			break;
+		default:
+			ret = -EFAULT;
+		}
+		break;
+
+	case TUNGETSTEERING:
+		ret = 0;
+		if (tun->steering_ops == &tun_automq_ops)
+			steering = TUN_STEERING_AUTOMQ;
+		else
+			BUG();
+		if (copy_to_user(argp, &steering, sizeof(steering)))
+			ret = -EFAULT;
+		break;
+
+	case TUNGETSTEERINGFEATURES:
+		ret = 0;
+		steering = TUN_STEERING_AUTOMQ;
+		if (copy_to_user(argp, &steering, sizeof(steering)))
+			ret = -EFAULT;
+		break;
+
 	default:
 		ret = -EINVAL;
 		break;
diff --git a/include/uapi/linux/if_tun.h b/include/uapi/linux/if_tun.h
index 365ade5..109760e 100644
--- a/include/uapi/linux/if_tun.h
+++ b/include/uapi/linux/if_tun.h
@@ -56,6 +56,9 @@
  */
 #define TUNSETVNETBE _IOW('T', 222, int)
 #define TUNGETVNETBE _IOR('T', 223, int)
+#define TUNSETSTEERING _IOW('T', 224, unsigned int)
+#define TUNGETSTEERING _IOR('T', 225, unsigned int)
+#define TUNGETSTEERINGFEATURES _IOR('T', 226, unsigned int)
 
 /* TUNSETIFF ifr flags */
 #define IFF_TUN		0x0001
@@ -70,6 +73,8 @@
 #define IFF_MULTI_QUEUE 0x0100
 #define IFF_ATTACH_QUEUE 0x0200
 #define IFF_DETACH_QUEUE 0x0400
+#define IFF_MULTI_STEERING 0x2000
+
 /* read-only flag */
 #define IFF_PERSIST	0x0800
 #define IFF_NOFILTER	0x1000
@@ -106,4 +111,6 @@ struct tun_filter {
 	__u8   addr[0][ETH_ALEN];
 };
 
+#define TUN_STEERING_AUTOMQ 0x01 /* Automatic flow steering */
+
 #endif /* _UAPI__IF_TUN_H */
-- 
2.7.4

^ permalink raw reply related

* [PATCH net-next 1/3] tun: abstract flow steering logic
From: Jason Wang @ 2017-09-27  8:23 UTC (permalink / raw)
  To: netdev, linux-kernel; +Cc: mst, Jason Wang
In-Reply-To: <1506500637-13881-1-git-send-email-jasowang@redhat.com>

tun now use flow caches based automatic queue steering method. This
may not suffice all user cases. To extend it to be able to use more
flow steering policy, this patch abstracts flow steering logic into
tun_steering_ops, then we can declare and use different methods in
the future.

Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 drivers/net/tun.c | 85 +++++++++++++++++++++++++++++++++++++++++--------------
 1 file changed, 63 insertions(+), 22 deletions(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 2c36f6e..de83e72 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -181,6 +181,14 @@ struct tun_file {
 	struct skb_array tx_array;
 };
 
+struct tun_steering_ops {
+	u16 (*select_queue) (struct tun_struct *tun, struct sk_buff *skb);
+	void (*xmit) (struct tun_struct *tun, struct sk_buff *skb);
+	u32 (*pre_rx) (struct tun_struct *tun, struct sk_buff *skb);
+	void (*post_rx) (struct tun_struct *tun, struct tun_file *tfile,
+			 u32 data);
+};
+
 struct tun_flow_entry {
 	struct hlist_node hash_link;
 	struct rcu_head rcu;
@@ -231,6 +239,7 @@ struct tun_struct {
 	u32 rx_batched;
 	struct tun_pcpu_stats __percpu *pcpu_stats;
 	struct bpf_prog __rcu *xdp_prog;
+	struct tun_steering_ops *steering_ops;
 };
 
 static int tun_napi_receive(struct napi_struct *napi, int budget)
@@ -532,10 +541,8 @@ static inline void tun_flow_save_rps_rxhash(struct tun_flow_entry *e, u32 hash)
  * different rxq no. here. If we could not get rxhash, then we would
  * hope the rxq no. may help here.
  */
-static u16 tun_select_queue(struct net_device *dev, struct sk_buff *skb,
-			    void *accel_priv, select_queue_fallback_t fallback)
+static u16 tun_automq_select_queue(struct tun_struct *tun, struct sk_buff *skb)
 {
-	struct tun_struct *tun = netdev_priv(dev);
 	struct tun_flow_entry *e;
 	u32 txq = 0;
 	u32 numqueues = 0;
@@ -559,9 +566,18 @@ static u16 tun_select_queue(struct net_device *dev, struct sk_buff *skb,
 	}
 
 	rcu_read_unlock();
+
 	return txq;
 }
 
+static u16 tun_select_queue(struct net_device *dev, struct sk_buff *skb,
+			    void *accel_priv, select_queue_fallback_t fallback)
+{
+	struct tun_struct *tun = netdev_priv(dev);
+
+	return tun->steering_ops->select_queue(tun, skb);
+}
+
 static inline bool tun_not_capable(struct tun_struct *tun)
 {
 	const struct cred *cred = current_cred();
@@ -931,24 +947,10 @@ static int tun_net_close(struct net_device *dev)
 	return 0;
 }
 
-/* Net device start xmit */
-static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev)
+static void tun_automq_xmit(struct tun_struct *tun, struct sk_buff *skb)
 {
-	struct tun_struct *tun = netdev_priv(dev);
-	int txq = skb->queue_mapping;
-	struct tun_file *tfile;
-	u32 numqueues = 0;
-
-	rcu_read_lock();
-	tfile = rcu_dereference(tun->tfiles[txq]);
-	numqueues = ACCESS_ONCE(tun->numqueues);
-
-	/* Drop packet if interface is not attached */
-	if (txq >= numqueues)
-		goto drop;
-
 #ifdef CONFIG_RPS
-	if (numqueues == 1 && static_key_false(&rps_needed)) {
+	if (ACCESS_ONCE(tun->numqueues) == 1 && static_key_false(&rps_needed)) {
 		/* Select queue was not called for the skbuff, so we extract the
 		 * RPS hash and save it into the flow_table here.
 		 */
@@ -964,6 +966,25 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev)
 		}
 	}
 #endif
+}
+
+/* Net device start xmit */
+static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev)
+{
+	struct tun_struct *tun = netdev_priv(dev);
+	int txq = skb->queue_mapping;
+	struct tun_file *tfile;
+	u32 numqueues = 0;
+
+	rcu_read_lock();
+	tfile = rcu_dereference(tun->tfiles[txq]);
+	numqueues = ACCESS_ONCE(tun->numqueues);
+
+	/* Drop packet if interface is not attached */
+	if (txq >= numqueues)
+		goto drop;
+
+	tun->steering_ops->xmit(tun, skb);
 
 	tun_debug(KERN_INFO, tun, "tun_net_xmit %d\n", skb->len);
 
@@ -1527,6 +1548,17 @@ static struct sk_buff *tun_build_skb(struct tun_struct *tun,
 	return NULL;
 }
 
+u32 tun_automq_pre_rx(struct tun_struct *tun, struct sk_buff *skb)
+{
+	return __skb_get_hash_symmetric(skb);
+}
+
+void tun_automq_post_rx(struct tun_struct *tun, struct tun_file *tfile,
+			u32 rxhash)
+{
+	tun_flow_update(tun, rxhash, tfile);
+}
+
 /* Get packet from user space buffer */
 static ssize_t tun_get_user(struct tun_struct *tun, struct tun_file *tfile,
 			    void *msg_control, struct iov_iter *from,
@@ -1542,7 +1574,7 @@ static ssize_t tun_get_user(struct tun_struct *tun, struct tun_file *tfile,
 	int copylen;
 	bool zerocopy = false;
 	int err;
-	u32 rxhash;
+	u32 data;
 	int skb_xdp = 1;
 	bool frags = tun_napi_frags_enabled(tun);
 
@@ -1728,7 +1760,7 @@ static ssize_t tun_get_user(struct tun_struct *tun, struct tun_file *tfile,
 		rcu_read_unlock();
 	}
 
-	rxhash = __skb_get_hash_symmetric(skb);
+	data = tun->steering_ops->pre_rx(tun, skb);
 
 	if (frags) {
 		/* Exercise flow dissector code path. */
@@ -1772,7 +1804,7 @@ static ssize_t tun_get_user(struct tun_struct *tun, struct tun_file *tfile,
 	u64_stats_update_end(&stats->syncp);
 	put_cpu_ptr(stats);
 
-	tun_flow_update(tun, rxhash, tfile);
+	tun->steering_ops->post_rx(tun, tfile, data);
 	return total_len;
 }
 
@@ -2112,6 +2144,13 @@ static struct proto tun_proto = {
 	.obj_size	= sizeof(struct tun_file),
 };
 
+static struct tun_steering_ops tun_automq_ops = {
+	.select_queue = tun_automq_select_queue,
+	.xmit = tun_automq_xmit,
+	.pre_rx = tun_automq_pre_rx,
+	.post_rx = tun_automq_post_rx,
+};
+
 static int tun_flags(struct tun_struct *tun)
 {
 	return tun->flags & (TUN_FEATURES | IFF_PERSIST | IFF_TUN | IFF_TAP);
@@ -2268,6 +2307,8 @@ static int tun_set_iff(struct net *net, struct file *file, struct ifreq *ifr)
 			goto err_free_dev;
 		}
 
+		tun->steering_ops = &tun_automq_ops;
+
 		spin_lock_init(&tun->lock);
 
 		err = security_tun_dev_alloc_security(&tun->security);
-- 
2.7.4

^ permalink raw reply related

* [PATCH net-next 0/3] support changing steering policies in tuntap
From: Jason Wang @ 2017-09-27  8:23 UTC (permalink / raw)
  To: netdev, linux-kernel; +Cc: mst, Jason Wang

Hi all:

We use flow caches based flow steering policy now. This is good for
connection-oriented communication such as TCP but not for the others
e.g connectionless unidirectional workload which cares only about
pps. This calls the ability of supporting changing steering policies
in tuntap which was done by this series.

Flow steering policy was abstracted into tun_steering_ops in the first
patch. Then new ioctls to set or query current policy were introduced,
and the last patch introduces a very simple policy that select txq
based on processor id as an example.

Test was done by using xdp_redirect to redirect traffic generated from
MoonGen that was running on a remote machine. And I see 37%
improvement for processor id policy compared to automatic flow
steering policy.

In the future, both simple and sophisticated policy like RSS or other guest
driven steering policies could be done on top.

Thanks

Jason Wang (3):
  tun: abstract flow steering logic
  tun: introduce ioctls to set and get steering policies
  tun: introduce cpu id based steering policy

 drivers/net/tun.c           | 151 +++++++++++++++++++++++++++++++++++++-------
 include/uapi/linux/if_tun.h |   8 +++
 2 files changed, 136 insertions(+), 23 deletions(-)

-- 
2.7.4

^ permalink raw reply

* [PATCH v2 net-next 2/2] net/sched: allow flower to match tunnel options
From: Simon Horman @ 2017-09-27  8:16 UTC (permalink / raw)
  To: David Miller, Jiri Pirko
  Cc: Jamal Hadi Salim, Cong Wang, netdev, oss-drivers, Simon Horman
In-Reply-To: <1506500194-17637-1-git-send-email-simon.horman@netronome.com>

Allow matching on options in tunnel headers.
This makes use of existing tunnel metadata support.

Options are a bytestring of up to 256 bytes.
Tunnel implementations may support less or more options,
or no options at all.

e.g.
 # ip link add name geneve0 type geneve dstport 0 external
 # tc qdisc add dev geneve0 ingress
 # tc filter add dev geneve0 protocol ip parent ffff: \
     flower \
       enc_src_ip 10.0.99.192 \
       enc_dst_ip 10.0.99.193 \
       enc_key_id 11 \
       enc_opts 0102800100800020/fffffffffffffff0 \
       ip_proto udp \
       action mirred egress redirect dev eth1

Signed-off-by: Simon Horman <simon.horman@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>

---
v2
* Correct example which was incorrectly described setting rather
  than matching tunnel options
---
 include/net/flow_dissector.h | 13 +++++++++++++
 include/uapi/linux/pkt_cls.h |  3 +++
 net/sched/cls_flower.c       | 35 ++++++++++++++++++++++++++++++++++-
 3 files changed, 50 insertions(+), 1 deletion(-)

diff --git a/include/net/flow_dissector.h b/include/net/flow_dissector.h
index fc3dce730a6b..43f98bf0b349 100644
--- a/include/net/flow_dissector.h
+++ b/include/net/flow_dissector.h
@@ -183,6 +183,18 @@ struct flow_dissector_key_ip {
 	__u8	ttl;
 };
 
+/**
+ * struct flow_dissector_key_enc_opts:
+ * @data: data
+ * @len: len
+ */
+struct flow_dissector_key_enc_opts {
+	u8 data[256];	/* Using IP_TUNNEL_OPTS_MAX is desired here
+			 * but seems difficult to #include
+			 */
+	u8 len;
+};
+
 enum flow_dissector_key_id {
 	FLOW_DISSECTOR_KEY_CONTROL, /* struct flow_dissector_key_control */
 	FLOW_DISSECTOR_KEY_BASIC, /* struct flow_dissector_key_basic */
@@ -205,6 +217,7 @@ enum flow_dissector_key_id {
 	FLOW_DISSECTOR_KEY_MPLS, /* struct flow_dissector_key_mpls */
 	FLOW_DISSECTOR_KEY_TCP, /* struct flow_dissector_key_tcp */
 	FLOW_DISSECTOR_KEY_IP, /* struct flow_dissector_key_ip */
+	FLOW_DISSECTOR_KEY_ENC_OPTS, /* struct flow_dissector_key_enc_opts */
 
 	FLOW_DISSECTOR_KEY_MAX,
 };
diff --git a/include/uapi/linux/pkt_cls.h b/include/uapi/linux/pkt_cls.h
index d5e2bf68d0d4..7a09a28f21e0 100644
--- a/include/uapi/linux/pkt_cls.h
+++ b/include/uapi/linux/pkt_cls.h
@@ -467,6 +467,9 @@ enum {
 	TCA_FLOWER_KEY_IP_TTL,		/* u8 */
 	TCA_FLOWER_KEY_IP_TTL_MASK,	/* u8 */
 
+	TCA_FLOWER_KEY_ENC_OPTS,
+	TCA_FLOWER_KEY_ENC_OPTS_MASK,
+
 	__TCA_FLOWER_MAX,
 };
 
diff --git a/net/sched/cls_flower.c b/net/sched/cls_flower.c
index d230cb4c8094..e72a17c46f07 100644
--- a/net/sched/cls_flower.c
+++ b/net/sched/cls_flower.c
@@ -51,6 +51,7 @@ struct fl_flow_key {
 	struct flow_dissector_key_mpls mpls;
 	struct flow_dissector_key_tcp tcp;
 	struct flow_dissector_key_ip ip;
+	struct flow_dissector_key_enc_opts enc_opts;
 } __aligned(BITS_PER_LONG / 8); /* Ensure that we can do comparisons as longs. */
 
 struct fl_flow_mask_range {
@@ -181,6 +182,11 @@ static int fl_classify(struct sk_buff *skb, const struct tcf_proto *tp,
 		skb_key.enc_key_id.keyid = tunnel_id_to_key32(key->tun_id);
 		skb_key.enc_tp.src = key->tp_src;
 		skb_key.enc_tp.dst = key->tp_dst;
+
+		if (info->options_len) {
+			skb_key.enc_opts.len = info->options_len;
+			ip_tunnel_info_opts_get(skb_key.enc_opts.data, info);
+		}
 	}
 
 	skb_key.indev_ifindex = skb->skb_iif;
@@ -421,6 +427,8 @@ static const struct nla_policy fl_policy[TCA_FLOWER_MAX + 1] = {
 	[TCA_FLOWER_KEY_IP_TOS_MASK]	= { .type = NLA_U8 },
 	[TCA_FLOWER_KEY_IP_TTL]		= { .type = NLA_U8 },
 	[TCA_FLOWER_KEY_IP_TTL_MASK]	= { .type = NLA_U8 },
+	[TCA_FLOWER_KEY_ENC_OPTS]	= { .type = NLA_BINARY },
+	[TCA_FLOWER_KEY_ENC_OPTS_MASK]	= { .type = NLA_BINARY },
 };
 
 static void fl_set_key_val(struct nlattr **tb,
@@ -712,6 +720,26 @@ static int fl_set_key(struct net *net, struct nlattr **tb,
 		       &mask->enc_tp.dst, TCA_FLOWER_KEY_ENC_UDP_DST_PORT_MASK,
 		       sizeof(key->enc_tp.dst));
 
+	if (tb[TCA_FLOWER_KEY_ENC_OPTS]) {
+		key->enc_opts.len = nla_len(tb[TCA_FLOWER_KEY_ENC_OPTS]);
+
+		if (key->enc_opts.len > sizeof(key->enc_opts.data))
+			return -EINVAL;
+
+		/* enc_opts is variable length.
+		 * If present ensure the value and mask are the same length.
+		 */
+		if (tb[TCA_FLOWER_KEY_ENC_OPTS_MASK] &&
+		    nla_len(tb[TCA_FLOWER_KEY_ENC_OPTS_MASK]) != key->enc_opts.len)
+			return -EINVAL;
+
+		mask->enc_opts.len = key->enc_opts.len;
+		fl_set_key_val(tb, key->enc_opts.data, TCA_FLOWER_KEY_ENC_OPTS,
+			       mask->enc_opts.data,
+			       TCA_FLOWER_KEY_ENC_OPTS_MASK,
+			       key->enc_opts.len);
+	}
+
 	if (tb[TCA_FLOWER_KEY_FLAGS])
 		ret = fl_set_key_flags(tb, &key->control.flags, &mask->control.flags);
 
@@ -804,6 +832,8 @@ static void fl_init_dissector(struct cls_fl_head *head,
 			   enc_control);
 	FL_KEY_SET_IF_MASKED(&mask->key, keys, cnt,
 			     FLOW_DISSECTOR_KEY_ENC_PORTS, enc_tp);
+	FL_KEY_SET_IF_MASKED(&mask->key, keys, cnt,
+			     FLOW_DISSECTOR_KEY_ENC_OPTS, enc_opts);
 
 	skb_flow_dissector_init(&head->dissector, keys, cnt);
 }
@@ -1330,7 +1360,10 @@ static int fl_dump(struct net *net, struct tcf_proto *tp, void *fh,
 			    TCA_FLOWER_KEY_ENC_UDP_DST_PORT,
 			    &mask->enc_tp.dst,
 			    TCA_FLOWER_KEY_ENC_UDP_DST_PORT_MASK,
-			    sizeof(key->enc_tp.dst)))
+			    sizeof(key->enc_tp.dst)) ||
+	    fl_dump_key_val(skb, key->enc_opts.data, TCA_FLOWER_KEY_ENC_OPTS,
+			    mask->enc_opts.data, TCA_FLOWER_KEY_ENC_OPTS_MASK,
+			    key->enc_opts.len))
 		goto nla_put_failure;
 
 	if (fl_dump_key_flags(skb, key->control.flags, mask->control.flags))
-- 
2.1.4

^ permalink raw reply related

* [PATCH v2 net-next 1/2] net/sched: add tunnel option support to act_tunnel_key
From: Simon Horman @ 2017-09-27  8:16 UTC (permalink / raw)
  To: David Miller, Jiri Pirko
  Cc: Jamal Hadi Salim, Cong Wang, netdev, oss-drivers, Simon Horman
In-Reply-To: <1506500194-17637-1-git-send-email-simon.horman@netronome.com>

Allow setting tunnel options using the act_tunnel_key action.

Options are a bitwise maskable bytestring of up to 256 bytes.
Tunnel implementations may support less or more options,
or no options at all.

 # ip link add name geneve0 type geneve dstport 0 external
 # tc qdisc add dev eth0 ingress
 # tc filter add dev eth0 protocol ip parent ffff: \
     flower indev eth0 \
        ip_proto udp \
        action tunnel_key \
            set src_ip 10.0.99.192 \
            dst_ip 10.0.99.193 \
            dst_port 6081 \
            id 11 \
            opts 0102800100800022 \
    action mirred egress redirect dev geneve0

Signed-off-by: Simon Horman <simon.horman@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
---
v2
* Correct example which was incorrectly described matching rather
  than setting tunnel options
---
 include/uapi/linux/tc_act/tc_tunnel_key.h |  1 +
 net/sched/act_tunnel_key.c                | 26 +++++++++++++++++++++-----
 2 files changed, 22 insertions(+), 5 deletions(-)

diff --git a/include/uapi/linux/tc_act/tc_tunnel_key.h b/include/uapi/linux/tc_act/tc_tunnel_key.h
index afcd4be953e2..e0cb1121d132 100644
--- a/include/uapi/linux/tc_act/tc_tunnel_key.h
+++ b/include/uapi/linux/tc_act/tc_tunnel_key.h
@@ -35,6 +35,7 @@ enum {
 	TCA_TUNNEL_KEY_PAD,
 	TCA_TUNNEL_KEY_ENC_DST_PORT,	/* be16 */
 	TCA_TUNNEL_KEY_NO_CSUM,		/* u8 */
+	TCA_TUNNEL_KEY_ENC_OPTS,
 	__TCA_TUNNEL_KEY_MAX,
 };
 
diff --git a/net/sched/act_tunnel_key.c b/net/sched/act_tunnel_key.c
index 30c96274c638..77b5890a48b9 100644
--- a/net/sched/act_tunnel_key.c
+++ b/net/sched/act_tunnel_key.c
@@ -66,6 +66,7 @@ static const struct nla_policy tunnel_key_policy[TCA_TUNNEL_KEY_MAX + 1] = {
 	[TCA_TUNNEL_KEY_ENC_KEY_ID]   = { .type = NLA_U32 },
 	[TCA_TUNNEL_KEY_ENC_DST_PORT] = {.type = NLA_U16},
 	[TCA_TUNNEL_KEY_NO_CSUM]      = { .type = NLA_U8 },
+	[TCA_TUNNEL_KEY_ENC_OPTS]     = { .type = NLA_BINARY },
 };
 
 static int tunnel_key_init(struct net *net, struct nlattr *nla,
@@ -81,9 +82,11 @@ static int tunnel_key_init(struct net *net, struct nlattr *nla,
 	struct tcf_tunnel_key *t;
 	bool exists = false;
 	__be16 dst_port = 0;
+	int opts_len = 0;
 	__be64 key_id;
 	__be16 flags;
 	int ret = 0;
+	u8 *opts;
 	int err;
 
 	if (!nla)
@@ -121,6 +124,11 @@ static int tunnel_key_init(struct net *net, struct nlattr *nla,
 		if (tb[TCA_TUNNEL_KEY_ENC_DST_PORT])
 			dst_port = nla_get_be16(tb[TCA_TUNNEL_KEY_ENC_DST_PORT]);
 
+		if (tb[TCA_TUNNEL_KEY_ENC_OPTS]) {
+			opts = nla_data(tb[TCA_TUNNEL_KEY_ENC_OPTS]);
+			opts_len = nla_len(tb[TCA_TUNNEL_KEY_ENC_OPTS]);
+		}
+
 		if (tb[TCA_TUNNEL_KEY_ENC_IPV4_SRC] &&
 		    tb[TCA_TUNNEL_KEY_ENC_IPV4_DST]) {
 			__be32 saddr;
@@ -131,7 +139,7 @@ static int tunnel_key_init(struct net *net, struct nlattr *nla,
 
 			metadata = __ip_tun_set_dst(saddr, daddr, 0, 0,
 						    dst_port, flags,
-						    key_id, 0);
+						    key_id, opts_len);
 		} else if (tb[TCA_TUNNEL_KEY_ENC_IPV6_SRC] &&
 			   tb[TCA_TUNNEL_KEY_ENC_IPV6_DST]) {
 			struct in6_addr saddr;
@@ -142,9 +150,13 @@ static int tunnel_key_init(struct net *net, struct nlattr *nla,
 
 			metadata = __ipv6_tun_set_dst(&saddr, &daddr, 0, 0, dst_port,
 						      0, flags,
-						      key_id, 0);
+						      key_id, opts_len);
 		}
 
+		if (opts_len)
+			ip_tunnel_info_opts_set(&metadata->u.tun_info,
+						opts, opts_len);
+
 		if (!metadata) {
 			ret = -EINVAL;
 			goto err_out;
@@ -264,8 +276,9 @@ static int tunnel_key_dump(struct sk_buff *skb, struct tc_action *a,
 		goto nla_put_failure;
 
 	if (params->tcft_action == TCA_TUNNEL_KEY_ACT_SET) {
-		struct ip_tunnel_key *key =
-			&params->tcft_enc_metadata->u.tun_info.key;
+		struct ip_tunnel_info *info =
+			&params->tcft_enc_metadata->u.tun_info;
+		struct ip_tunnel_key *key = &info->key;
 		__be32 key_id = tunnel_id_to_key32(key->tun_id);
 
 		if (nla_put_be32(skb, TCA_TUNNEL_KEY_ENC_KEY_ID, key_id) ||
@@ -273,7 +286,10 @@ static int tunnel_key_dump(struct sk_buff *skb, struct tc_action *a,
 					      &params->tcft_enc_metadata->u.tun_info) ||
 		    nla_put_be16(skb, TCA_TUNNEL_KEY_ENC_DST_PORT, key->tp_dst) ||
 		    nla_put_u8(skb, TCA_TUNNEL_KEY_NO_CSUM,
-			       !(key->tun_flags & TUNNEL_CSUM)))
+			       !(key->tun_flags & TUNNEL_CSUM)) ||
+		    (info->options_len &&
+		     nla_put(skb, TCA_TUNNEL_KEY_ENC_OPTS, info->options_len,
+			     info + 1)))
 			goto nla_put_failure;
 	}
 
-- 
2.1.4

^ permalink raw reply related

* [PATCH v2 net-next 0/2] net/sched: support tunnel options in cls_flower and act_tunnel_key
From: Simon Horman @ 2017-09-27  8:16 UTC (permalink / raw)
  To: David Miller, Jiri Pirko
  Cc: Jamal Hadi Salim, Cong Wang, netdev, oss-drivers, Simon Horman

Allow the flower classifier to match on tunnel options and the
tunnel key action to set them.

Tunnel options are a bytestring of up to 256 bytes.
The flower classifier matching with an optional bitwise mask.
Tunnel implementations may support more or less options,
or none at all.


Discussion stemming from review of RFC:

This feature is to be used in conjunction with tunnels in collect metadata
(external) mode. As I understand it there are three tunnel netdevs that use
options metadata in the kernel at this time.

* Geneve

  In the case of Geneve options are TLVs[1]. My reading is that in collect
  metadata mode the kernel does not appear to do anything other than pass
  them around as a bytestring.

  [1] https://tools.ietf.org/html/draft-ietf-nvo3-geneve-05#section-3.5

* VXLAN-GBP

  In the case of VXLAN-GBP on RX in collect metadata mode options are used
  to carry information parsed in vxlan_parse_gbp_hdr() from the VXLAN Group
  Based Policy Extension[2]. On RX the options data is used to create an
  extension (header) by vxlan_build_gbp_hdr().

  [2] https://tools.ietf.org/html/draft-smith-vxlan-group-policy-03#section-2.1

* ERSPAN (GRE)

  In the case of ERSPAN, which is a variant of GRE, on RX in collect
  metadata mode options are used to carry the index parsed from the ERSPAN
  Type II feature header[3] in erspan_rcv().  The converse is true on TX
  and is handled by erspan_fb_xmit().

  [3] https://tools.ietf.org/html/draft-foschiano-erspan-03#section-4.2

Users of options:

* There are eBPF hooks to allow getting on and setting tunnel metadata:
  bpf_skb_set_tunnel_opt, bpf_skb_get_tunnel_opt.

* Open vSwitch is able to match and set Geneve and VXLAN-GBP options.

Neither of the above appear to assume any structure for the data.


Changes since RFC:
* Drop RFC prefix
* Correct changelogs and enhance cover letter.


Simon Horman (2):
  net/sched: add tunnel option support to act_tunnel_key
  net/sched: allow flower to match tunnel options

 include/net/flow_dissector.h              | 13 ++++++++++++
 include/uapi/linux/pkt_cls.h              |  3 +++
 include/uapi/linux/tc_act/tc_tunnel_key.h |  1 +
 net/sched/act_tunnel_key.c                | 26 ++++++++++++++++++-----
 net/sched/cls_flower.c                    | 35 ++++++++++++++++++++++++++++++-
 5 files changed, 72 insertions(+), 6 deletions(-)

-- 
2.1.4

^ permalink raw reply

* Re: [iproute PATCH v2 0/3] Check user supplied interface name lengths
From: Stephen Hemminger @ 2017-09-27  7:42 UTC (permalink / raw)
  To: Phil Sutter; +Cc: netdev
In-Reply-To: <20170926163548.24347-1-phil@nwl.cc>

On Tue, 26 Sep 2017 18:35:45 +0200
Phil Sutter <phil@nwl.cc> wrote:

> This series adds explicit checks for user-supplied interface names to
> make sure their length fits Linux's requirements.
> 
> The first two patches simplify interface name parsing in some places -
> these are side-effects of working on the actual implementation provided
> in patch three.
> 
> Changes since v1:
> - Patches 1 and 2 introduced.
> - Changes to patch 3 are listed in there.
> 
> Phil Sutter (3):
>   ip{6,}tunnel: Avoid copying user-supplied interface name around
>   tc: flower: No need to cache indev arg
>   Check user supplied interface name lengths
> 
>  include/utils.h |  1 +
>  ip/ip6tunnel.c  |  9 +++++----
>  ip/ipl2tp.c     |  3 ++-
>  ip/iplink.c     | 27 ++++++++-------------------
>  ip/ipmaddr.c    |  1 +
>  ip/iprule.c     |  4 ++++
>  ip/iptunnel.c   | 27 +++++++++++++--------------
>  ip/iptuntap.c   |  4 +++-
>  lib/utils.c     | 10 ++++++++++
>  misc/arpd.c     |  1 +
>  tc/f_flower.c   |  6 ++----
>  11 files changed, 50 insertions(+), 43 deletions(-)
> 

I like the idea, and checking arguments is good.
Why not merge the check and copy and put in lib/utils.c

int get_ifname(char *name, const char *arg)
{
...

^ permalink raw reply

* Re: [PATCH net-next 0/7] nfp: flower vxlan tunnel offload
From: Jiri Pirko @ 2017-09-27  7:40 UTC (permalink / raw)
  To: Paolo Abeni
  Cc: Or Gerlitz, Jiri Benc, Simon Horman, David Miller, Jakub Kicinski,
	Linux Netdev List, oss-drivers, John Hurley, Paul Blakey,
	Jiri Pirko, Roi Dayan
In-Reply-To: <1506437410.2643.17.camel@redhat.com>

Tue, Sep 26, 2017 at 04:50:10PM CEST, pabeni@redhat.com wrote:
>On Tue, 2017-09-26 at 17:17 +0300, Or Gerlitz wrote:
>> On Tue, Sep 26, 2017 at 3:51 PM, Jiri Benc <jbenc@redhat.com> wrote:
>> > On Tue, 26 Sep 2017 15:41:37 +0300, Or Gerlitz wrote:
>> > > Please note that the way the rule is being set to the HW driver is by delegation
>> > > done in flower, see these commits (specifically "Add offload support
>> > > using egress Hardware device")
>> > 
>> > It's very well possible the bug is somewhere in net/sched.
>> 
>> maybe before/instead you call it a bug, take a look on the design
>> there and maybe
>> tell us how to possibly do that otherwise?
>
>The problem, AFAICT, is in the API between flower and NIC implementing
>the offload, because in the above example the kernel will call the
>offload hook with exactly the same arguments with the 'bad' rule and
>the 'good' one - but the 'bad' rule should never match any packets.
>
>I think that can be fixed changing the flower code to invoke the
>offload hook for filters with tunnel-based match only if the device
>specified in such match has the appropriate type, e.g. given that
>currently only vxlan is supported with something like the code below
>(very rough and untested, just to give the idea):
>
>Cheers,
>
>Paolo
>
>---
>diff --git a/net/sched/cls_flower.c b/net/sched/cls_flower.c
>index d230cb4c8094..ff8476e56d4e 100644
>--- a/net/sched/cls_flower.c
>+++ b/net/sched/cls_flower.c
>@@ -243,10 +243,11 @@ static int fl_hw_replace_filter(struct tcf_proto *tp,
>                                struct fl_flow_key *mask,
>                                struct cls_fl_filter *f)
> {
>-       struct net_device *dev = tp->q->dev_queue->dev;
>+       struct net_device *ingress_dev, *dev = tp->q->dev_queue->dev;
>        struct tc_cls_flower_offload cls_flower = {};
>        int err;
> 
>+       ingress_dev = dev;
>        if (!tc_can_offload(dev)) {
>                if (tcf_exts_get_dev(dev, &f->exts, &f->hw_dev) ||
>                    (f->hw_dev && !tc_can_offload(f->hw_dev))) {
>@@ -259,6 +260,12 @@ static int fl_hw_replace_filter(struct tcf_proto *tp,
>                f->hw_dev = dev;
>        }
> 
>+       if ((dissector_uses_key(f->dissector, FLOW_DISSECTOR_KEY_ENC_KEYID) ||
>+            dissector_uses_key(f->dissector, FLOW_DISSECTOR_KEY_ENC_PORTS) ||
>+             // ... list all the others tunnel based keys ...
>+             ) && strcmp(ingress_dev->rtnl_link_ops->kind, "vxlan"))
>+               return tc_skip_sw(f->flags) ? -EINVAL : 0;

This kind of hooks are giving me nightmares. The code is screwed up as
it is already. I'm currently working on conversion to callbacks. This
part is handled in:
https://github.com/jpirko/linux_mlxsw/commits/jiri_devel_egdevcb

^ permalink raw reply

* Re: [iproute2 net-next 1/3] update headers with CBS API
From: Stephen Hemminger @ 2017-09-27  7:38 UTC (permalink / raw)
  To: Vinicius Costa Gomes
  Cc: netdev, intel-wired-lan, jhs, xiyou.wangcong, jiri, andre.guedes,
	ivan.briano, jesus.sanchez-palencia, boon.leong.ong,
	richardcochran, henrik
In-Reply-To: <20170926233958.12027-1-vinicius.gomes@intel.com>

On Tue, 26 Sep 2017 16:39:56 -0700
Vinicius Costa Gomes <vinicius.gomes@intel.com> wrote:

> Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>

I won't apply this patch directly, instead will pick up pkt_sched.h on
next update of net-next headers.

^ permalink raw reply

* Re: [PATCH iproute2 1/1] ip: initialize FILE pointer in ip-monitor
From: Stephen Hemminger @ 2017-09-27  7:36 UTC (permalink / raw)
  To: Roman Mashak; +Cc: netdev
In-Reply-To: <1506453236-7034-1-git-send-email-mrv@mojatatu.com>

On Tue, 26 Sep 2017 15:13:56 -0400
Roman Mashak <mrv@mojatatu.com> wrote:

> Since FILE *_fp was not explicitly initialized, all the consequent print_*()
> calls were failing.
> 
> Signed-off-by: Roman Mashak <mrv@mojatatu.com>

This works, but the later patch by Julien Fortien which gets rid of the
FILE * argument all together is a cleaner solution. I will skip this
patch and apply that one.

^ permalink raw reply

* [PATCH v6 11/11] of: mdio: Prevent of_mdiobus_register from scanning mdio-mux nodes
From: Corentin Labbe @ 2017-09-27  7:34 UTC (permalink / raw)
  To: robh+dt-DgEjT+Ai2ygdnm+yROfE0A, mark.rutland-5wv7dgnIgG8,
	maxime.ripard-wi1+55ScJUtKEb57/3fJTNBPR1lH4CV8, wens-jdAy2FN1RRM,
	linux-I+IVW8TIWO2tmTQ+vhA3Yw, catalin.marinas-5wv7dgnIgG8,
	will.deacon-5wv7dgnIgG8, peppe.cavallaro-qxv4g6HH51o,
	alexandre.torgue-qxv4g6HH51o, andrew-g2DYL2Zd6BY,
	f.fainelli-Re5JQEeQqe8AvxtiuMwx3w,
	frowand.list-Re5JQEeQqe8AvxtiuMwx3w
  Cc: netdev-u79uwXL29TY76Z2rM5mHXA, devicetree-u79uwXL29TY76Z2rM5mHXA,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-sunxi-/JYPxA39Uh5TLH3MbocFFw, Corentin Labbe
In-Reply-To: <20170927073414.17361-1-clabbe.montjoie-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>

Each child node of an MDIO node is scanned as a PHY when calling
of_mdiobus_register() givint the following result:
[   18.175379] mdio_bus stmmac-0: /soc/ethernet@1c30000/mdio/mdio-mux has invalid PHY address
[   18.175408] mdio_bus stmmac-0: scan phy mdio-mux at address 0
[   18.175450] mdio_bus stmmac-0: scan phy mdio-mux at address 1
[...]
[   18.176420] mdio_bus stmmac-0: scan phy mdio-mux at address 30
[   18.176452] mdio_bus stmmac-0: scan phy mdio-mux at address 31

Since mdio-mux nodes are not PHY, this patch a way to to not scan
them.

Signed-off-by: Corentin Labbe <clabbe.montjoie-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
---
 drivers/of/of_mdio.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/drivers/of/of_mdio.c b/drivers/of/of_mdio.c
index d94dd8b77abd..d90ddb0d90f2 100644
--- a/drivers/of/of_mdio.c
+++ b/drivers/of/of_mdio.c
@@ -190,6 +190,10 @@ int of_mdiobus_register(struct mii_bus *mdio, struct device_node *np)
 	struct device_node *child;
 	bool scanphys = false;
 	int addr, rc;
+	static const struct of_device_id do_not_scan[] = {
+		{ .compatible = "mdio-mux" },
+		{}
+	};
 
 	/* Do not continue if the node is disabled */
 	if (!of_device_is_available(np))
@@ -212,6 +216,9 @@ int of_mdiobus_register(struct mii_bus *mdio, struct device_node *np)
 
 	/* Loop over the child nodes and register a phy_device for each phy */
 	for_each_available_child_of_node(np, child) {
+		if (of_match_node(do_not_scan, child))
+			continue;
+
 		addr = of_mdio_parse_addr(&mdio->dev, child);
 		if (addr < 0) {
 			scanphys = true;
@@ -229,6 +236,9 @@ int of_mdiobus_register(struct mii_bus *mdio, struct device_node *np)
 
 	/* auto scan for PHYs with empty reg property */
 	for_each_available_child_of_node(np, child) {
+		if (of_match_node(do_not_scan, child))
+			continue;
+
 		/* Skip PHYs with reg property set */
 		if (of_find_property(child, "reg", NULL))
 			continue;
-- 
2.13.5

^ permalink raw reply related

* [PATCH v6 10/11] net: stmmac: dwmac-sun8i: Handle integrated/external MDIOs
From: Corentin Labbe @ 2017-09-27  7:34 UTC (permalink / raw)
  To: robh+dt-DgEjT+Ai2ygdnm+yROfE0A, mark.rutland-5wv7dgnIgG8,
	maxime.ripard-wi1+55ScJUtKEb57/3fJTNBPR1lH4CV8, wens-jdAy2FN1RRM,
	linux-I+IVW8TIWO2tmTQ+vhA3Yw, catalin.marinas-5wv7dgnIgG8,
	will.deacon-5wv7dgnIgG8, peppe.cavallaro-qxv4g6HH51o,
	alexandre.torgue-qxv4g6HH51o, andrew-g2DYL2Zd6BY,
	f.fainelli-Re5JQEeQqe8AvxtiuMwx3w,
	frowand.list-Re5JQEeQqe8AvxtiuMwx3w
  Cc: netdev-u79uwXL29TY76Z2rM5mHXA, devicetree-u79uwXL29TY76Z2rM5mHXA,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-sunxi-/JYPxA39Uh5TLH3MbocFFw, Corentin Labbe
In-Reply-To: <20170927073414.17361-1-clabbe.montjoie-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>

The Allwinner H3 SoC have two distinct MDIO bus, only one could be
active at the same time.
The selection of the active MDIO bus are done via some bits in the EMAC
register of the system controller.

This patch implement this MDIO switch via a custom MDIO-mux.

Signed-off-by: Corentin Labbe <clabbe.montjoie-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
---
 drivers/net/ethernet/stmicro/stmmac/Kconfig       |   1 +
 drivers/net/ethernet/stmicro/stmmac/dwmac-sun8i.c | 116 +++++++++++++++++++---
 2 files changed, 104 insertions(+), 13 deletions(-)

diff --git a/drivers/net/ethernet/stmicro/stmmac/Kconfig b/drivers/net/ethernet/stmicro/stmmac/Kconfig
index 97035766c291..e28c0d2c58e9 100644
--- a/drivers/net/ethernet/stmicro/stmmac/Kconfig
+++ b/drivers/net/ethernet/stmicro/stmmac/Kconfig
@@ -159,6 +159,7 @@ config DWMAC_SUN8I
 	tristate "Allwinner sun8i GMAC support"
 	default ARCH_SUNXI
 	depends on OF && (ARCH_SUNXI || COMPILE_TEST)
+	select MDIO_BUS_MUX
 	---help---
 	  Support for Allwinner H3 A83T A64 EMAC ethernet controllers.
 
diff --git a/drivers/net/ethernet/stmicro/stmmac/dwmac-sun8i.c b/drivers/net/ethernet/stmicro/stmmac/dwmac-sun8i.c
index 672553b652bd..8bd500c351b4 100644
--- a/drivers/net/ethernet/stmicro/stmmac/dwmac-sun8i.c
+++ b/drivers/net/ethernet/stmicro/stmmac/dwmac-sun8i.c
@@ -17,6 +17,7 @@
 #include <linux/clk.h>
 #include <linux/io.h>
 #include <linux/iopoll.h>
+#include <linux/mdio-mux.h>
 #include <linux/mfd/syscon.h>
 #include <linux/module.h>
 #include <linux/of_device.h>
@@ -71,6 +72,7 @@ struct sunxi_priv_data {
 	const struct emac_variant *variant;
 	struct regmap *regmap;
 	bool use_internal_phy;
+	void *mux_handle;
 };
 
 static const struct emac_variant emac_variant_h3 = {
@@ -195,6 +197,9 @@ static const struct emac_variant emac_variant_a64 = {
 #define H3_EPHY_LED_POL		BIT(17) /* 1: active low, 0: active high */
 #define H3_EPHY_SHUTDOWN	BIT(16) /* 1: shutdown, 0: power up */
 #define H3_EPHY_SELECT		BIT(15) /* 1: internal PHY, 0: external PHY */
+#define H3_EPHY_MUX_MASK	(H3_EPHY_SHUTDOWN | H3_EPHY_SELECT)
+#define DWMAC_SUN8I_MDIO_MUX_INTERNAL_ID	1
+#define DWMAC_SUN8I_MDIO_MUX_EXTERNAL_ID	2
 
 /* H3/A64 specific bits */
 #define SYSCON_RMII_EN		BIT(13) /* 1: enable RMII (overrides EPIT) */
@@ -634,6 +639,76 @@ static int sun8i_dwmac_reset(struct stmmac_priv *priv)
 	return 0;
 }
 
+/* MDIO multiplexing switch function
+ * This function is called by the mdio-mux layer when it thinks the mdio bus
+ * multiplexer needs to switch.
+ * 'current_child' is the current value of the mux register
+ * 'desired_child' is the value of the 'reg' property of the target child MDIO
+ * node.
+ * The first time this function is called, current_child == -1.
+ * If current_child == desired_child, then the mux is already set to the
+ * correct bus.
+ *
+ * Note that we do not use reg/mask like mdio-mux-mmioreg because we need to
+ * know easily which bus is used (reset must be done only for desired bus).
+ */
+static int mdio_mux_syscon_switch_fn(int current_child, int desired_child,
+				     void *data)
+{
+	struct stmmac_priv *priv = data;
+	struct sunxi_priv_data *gmac = priv->plat->bsp_priv;
+	u32 reg, val;
+	int ret = 0;
+	bool need_reset = false;
+
+	if (current_child ^ desired_child) {
+		regmap_read(gmac->regmap, SYSCON_EMAC_REG, &reg);
+		switch (desired_child) {
+		case DWMAC_SUN8I_MDIO_MUX_INTERNAL_ID:
+			dev_info(priv->device, "Switch mux to internal PHY");
+			val = (reg & ~H3_EPHY_MUX_MASK) | H3_EPHY_SELECT;
+			if (gmac->use_internal_phy)
+				need_reset = true;
+			break;
+		case DWMAC_SUN8I_MDIO_MUX_EXTERNAL_ID:
+			dev_info(priv->device, "Switch mux to external PHY");
+			val = (reg & ~H3_EPHY_MUX_MASK) | H3_EPHY_SHUTDOWN;
+			if (!gmac->use_internal_phy)
+				need_reset = true;
+			break;
+		default:
+			dev_err(priv->device, "Invalid child id %x\n", desired_child);
+			return -EINVAL;
+		}
+		regmap_write(gmac->regmap, SYSCON_EMAC_REG, val);
+		/* After changing syscon value, the MAC need reset or it will use
+		 * the last value (and so the last PHY set).
+		 * Reset is necessary only when we reach the needed MDIO,
+		 * it timeout in other case.
+		 */
+		if (need_reset)
+			ret = sun8i_dwmac_reset(priv);
+		else
+			dev_dbg(priv->device, "skipped reset\n");
+	}
+	return ret;
+}
+
+static int sun8i_dwmac_register_mdio_mux(struct stmmac_priv *priv)
+{
+	int ret;
+	struct device_node *mdio_mux;
+	struct sunxi_priv_data *gmac = priv->plat->bsp_priv;
+
+	mdio_mux = of_get_child_by_name(priv->plat->mdio_node, "mdio-mux");
+	if (!mdio_mux)
+		return -ENODEV;
+
+	ret = mdio_mux_init(priv->device, mdio_mux, mdio_mux_syscon_switch_fn,
+			    &gmac->mux_handle, priv, priv->mii);
+	return ret;
+}
+
 static int sun8i_dwmac_set_syscon(struct stmmac_priv *priv)
 {
 	struct sunxi_priv_data *gmac = priv->plat->bsp_priv;
@@ -649,12 +724,7 @@ static int sun8i_dwmac_set_syscon(struct stmmac_priv *priv)
 			 val, reg);
 
 	if (gmac->variant->soc_has_internal_phy) {
-		if (!gmac->use_internal_phy) {
-			/* switch to external PHY interface */
-			reg &= ~H3_EPHY_SELECT;
-		} else {
-			reg |= H3_EPHY_SELECT;
-			reg &= ~H3_EPHY_SHUTDOWN;
+		if (gmac->use_internal_phy) {
 			dev_dbg(priv->device, "Select internal_phy %x\n", reg);
 
 			if (of_property_read_bool(priv->plat->phy_node,
@@ -743,6 +813,8 @@ static void sun8i_dwmac_unset_syscon(struct sunxi_priv_data *gmac)
 {
 	u32 reg = gmac->variant->default_syscon_value;
 
+	if (gmac->variant->soc_has_internal_phy && gmac->mux_handle)
+		mdio_mux_uninit(gmac->mux_handle);
 	regmap_write(gmac->regmap, SYSCON_EMAC_REG, reg);
 }
 
@@ -801,12 +873,6 @@ static int sun8i_power_phy(struct stmmac_priv *priv)
 	if (ret)
 		return ret;
 
-	/* After changing syscon value, the MAC need reset or it will use
-	 * the last value (and so the last PHY set.
-	 */
-	ret = sun8i_dwmac_reset(priv);
-	if (ret)
-		return ret;
 	return 0;
 }
 
@@ -889,6 +955,8 @@ static int sun8i_dwmac_probe(struct platform_device *pdev)
 	struct sunxi_priv_data *gmac;
 	struct device *dev = &pdev->dev;
 	int ret;
+	struct stmmac_priv *priv;
+	struct net_device *ndev;
 
 	ret = stmmac_get_platform_resources(pdev, &stmmac_res);
 	if (ret)
@@ -973,9 +1041,31 @@ static int sun8i_dwmac_probe(struct platform_device *pdev)
 
 	ret = stmmac_dvr_probe(&pdev->dev, plat_dat, &stmmac_res);
 	if (ret)
-		sun8i_dwmac_exit(pdev, plat_dat->bsp_priv);
+		goto dwmac_exit;
+
+	ndev = dev_get_drvdata(&pdev->dev);
+	priv = netdev_priv(ndev);
+	/* The mux must be registered after parent MDIO
+	 * so after stmmac_dvr_probe()
+	 */
+	if (gmac->variant->soc_has_internal_phy) {
+		ret = sun8i_dwmac_register_mdio_mux(priv);
+		if (ret) {
+			dev_err(&pdev->dev, "Failed to register mux\n");
+			goto dwmac_mux;
+		}
+	} else {
+		ret = sun8i_dwmac_reset(priv);
+		if (ret)
+			goto dwmac_exit;
+	}
 
 	return ret;
+dwmac_mux:
+	sun8i_dwmac_unset_syscon(gmac);
+dwmac_exit:
+	sun8i_dwmac_exit(pdev, plat_dat->bsp_priv);
+return ret;
 }
 
 static const struct of_device_id sun8i_dwmac_match[] = {
-- 
2.13.5

^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox